Virtual Memory
- designed to overcome memory limitations cost-effectively
- allow a process to use CPU with only part of its address space in memory
- VM manager's task is to infer what parts of process' address space
needed at any time and make sure these are in primary memory
- as different parts of address space will be loaded into different
areas of primary memory, must be able to do dynamic address binding
Paging:
- process address space partitioned into equal-sized pages
- primary memory partitioned into equal-sized page frames
- pages of a process need not be in contiguous frames
- a page table associates process pages to memory frames
Page Table Main Memory Process A Process B
---------- ----------- --------- ---------
page frame frame page pgs 0,1,2,3 pgs 4,5
0 0 0 0
1 1 1 1
2 3 2 4
3 - 3 2
4 2
5 -
- given this breakdown of memory into pages, addresses can be specified as
PAGE # | OFFSET
- provided page size is a power of 2, this representation works as
an actual bit-format address: 0B FF01 -> page 11, offset 62281
- address translation: mapping virtual address (contained in program)
to physical memory address (at run time)
- notice that processes may be broken into pieces (in physical memory)
- we can exploit this -> don't need to keep entire program in memory at once!
- increase # of processes in memory
- increase max allowable process size: may be larger than entire RAM
- options: keep one huge page table for whole system, or
keep a separate page table for each process:
page #, frame #, mapped (page present/empty)
|--------------------------------------------
| |
PAGE # | OFFSET | | v
| |-------| Frame i
|------------------->| base 3| ----------> FRAME # | OFFSET
|-------|
| base 4|
|-------|
| |
base registers
Option A) Maintain array of high speed registers, one for each virtual page.
- motivation: speed
- page mapping must be extremely fast, since it is done on every reference
- OS loads registers with process page table on each context switch
- no extra memory references, but context switches are expensive
- e.g. 16 bit address space with 4K page size -> need 12 bits for offset
leaves 4 bits for page # -> 2^4 = 16 pages -> 16 page base registers
- each context switch requires replacing those registers; not a big deal
- but what happens with increasing VM size? high cost of store/load
- Page table may be extremely large
(e.g. 32 bit address space with 4K page size -> need 12 bits for offset
leaves 20 bits for page # -> 2^20 = 220 or about 1 million pages)
- do we want to devote 1 million registers (* log_2 [# frames]) bits
just to store this map? what about the cost of process switch?
|--------------------------------------------
| |
PAGE # | OFFSET | | v
| |-------| Frame i
v | base 3| ----------> FRAME # | OFFSET
+ ------------------>|-------|
^ | base 4|
| |-------|
Base register | |
page table
Option B) Entire page table stored in main memory.
- single register holds starting address of page table in physical memory
- If page table will not fit in a single page, use multi-level table
- context switches are cheap (only one register update), but at least
one extra memory reference is required for every memory translation
- i.e. we have doubled our memory access requirements (and memory
is usually the bottleneck in most programs!)
Multi-level page table:
- virtual address =
| | |
v v v
index into top- index into offset
level page table 2nd level into memory frame
yielding frame # page table
of 2nd level page yielding frame
table # of referenced
page
PT1 (1K entries) PT2 in frame 93 (1K entries)
---------------- ----------------------------- |
0 35 0 763 ||
1 52 1 353 ||
2 17 |----> 2 9933 ----------------------> ||
3 93 -------------- 3 8763 ||
4 3211
virtual address <3,2,ff31>
Option C) Page Table Cache: Associate Memory
- store page base addresses of last few lookups in processor registers
(a register 'cache')
- exploits locality: most of time, we will find the address we need
- this cache is called the Translation Lookaside Buffer (TLB)
- special hardware can search the TLB entries in parallel
- if TLB hit, go directly to page
- otherwise, TLB miss, lookup entry in page table and update TLB entry
- assume hit rate is 90% and TLB lookup time = 0.1 page table lookup
- then TLB-assisted memory access, requires, on average
.9 * .1 + .1 * (.1 + 1) = .09 + .11 = .20 (non-TLB-assisted)
TLB:
----
virtual valid modified protection frame
page # bit bit (access bits) page #
- memory manager checks TLB - if hit, go straight to indicated frame
- if miss, copy 'modified' bit of TLB entry to page table and update
- what to do with multiprocessing?
a) flush TLB on context switch
b) add process ID field to TLB
Paging policies:
- fetch policy: when should a page be loaded into primary memory
- DEMAND PAGING: system loads a page into main memory only
when a reference is made to a non-resident page (page fault)
- process is blocked, OS retrieves page from disk, swapping a page
from other process to disk if space needed
- won't this be very slow? (think principle of locality)
- when OS decides to remove a page from memory, does it need to
write it back to disk? (no, can discard if it hasn't been written)
- PREFETCH: system anticipates when a page will be needed and
loads it in advance; expensive; must have detailed knowledge
of program behaviour
- replacement policy: if primary memory is full, which page should be removed
- RANDOM REPLACEMENT: yikes, poor performance
- OPTIMAL REPLACEMENT: choose page that will not be needed for
longest time (how do we know this? in general, we don't)
- LFU: keep track of number of times page was used recently
- FIFO: replace page that has been in memory the longest
- simple, but not well suited to program behaviour
- LRU: take advantage of principle of locality
- "past is best predictor of the future"
- in fact, it's excellent: within 30-40% of optimal
- problem: expensive to record time of every page access
- translation lookaside buffer
- how to search for LRU? also expensive!
- simple implementation: approximation of LRU
- use single 'referenced' bit to indicate access
- set by hardware on any page access
- choose any page with 0 referenced bit and reset all
- not very accurate: can extend to multiple bits...
- shift right one bit every n clock cycles
- Clock Algorithms:
- use circular linked list of page frames
- keep pointer to last page replaced
- every page access sets "used" bit to 1
- on page fault, advance pointer until "used = 0" found
- replace that page; reset all bits to 0
- very similar to 1-bit approx. of LRU, but imposes total order
Replacement issue:
- number of pages allocated per process should be a
compromise between process locality (low page fault rate)
and available main memory (must share with other processes)
- fixed allocation: fixed # pages/process
- replace only within process
- variable allocation: # pages/process grows based on need
- how is 'need' measured? look at faulting behaviour
- avoid thrashing
- should pages be replaced locally or globally?
- local may be optimal but more difficult/overhead
- placement policy: where should fetched page be loaded into primary memory
- for most strategies, placement is simply into the replaced frame
- only exception is when memory is not full (why bother?)
Question: On a page fault, is it wise to actually "replace" a page right away?
- Recall: if page was not used, not write-to-disk required...
- Also: disk buffering: more efficient to write several blocks at once
- Therefore: maintain two lists: "clean" and "dirty" pages
- choose to replace clean page over dirty page -> less work
Page Table
-----------
page present used dirty frame #
--------------------------------------
0 1 0 1 0
1 1 0 0 1
2 1 1 0 3
3 0 0 0 x
4 1 1 0 2
5 0 0 0 x
Modified Clock Replacement Algorithm:
- four cases: 1) u=d=0 2) u=0,d=1, 3) u=1,d=0 4) u=d=1
- pages with d=1 are more expensive to replace
Scan 1: look for <u=0, d=0>; if found, replace
Scan 2: look for <u=0, d=1> and reset u:1->0; if found, replace
Scan 3: (if Scan 2 failed), back to beginning, but must be a u=0
=> repeat from Scan 1
Cost of VM
----------
- assume avg. instruction requires 200 ns
- assume page fault replacement costs 20 ms (to read in 4K from disk)
- page fault rate = r = avg # of faults per instruction
n instructions take: n*200 ns + nr*20 ms
avg instruction takes: 200 ns + r * 20 ms
- with r = 10^-6, slowdown factor = 1.1
- with r = 10^-5, slowdown factor = 2
- with r = 10^-4, slowdown factor = 11
- with r = 10^-3, slowdown factor = 101