Virtual Memory

- designed to overcome memory limitations cost-effectively
- allow a process to use CPU with only part of its address space in memory
- VM manager's task is to infer what parts of process' address space 
  needed at any time and make sure these are in primary memory 
- as different parts of address space will be loaded into different
  areas of primary memory, must be able to do dynamic address binding 

Paging:
- process address space partitioned into equal-sized pages
- primary memory partitioned into equal-sized page frames
- pages of a process need not be in contiguous frames
- a page table associates process pages to memory frames


Page Table		Main Memory		Process A	Process B
----------		-----------		---------	---------
page	frame		frame 	page		pgs 0,1,2,3	pgs 4,5
0	 0		 0	 0
1	 1		 1	 1
2	 3		 2	 4
3	 -		 3 	 2
4	 2		 
5	 -		

- given this breakdown of memory into pages, addresses can be specified as
  PAGE # | OFFSET
- provided page size is a power of 2, this representation works as
  an actual bit-format address: 0B FF01 -> page 11, offset 62281
- address translation: mapping virtual address (contained in program)
  to physical memory address (at run time)
- notice that processes may be broken into pieces (in physical memory)
- we can exploit this -> don't need to keep entire program in memory at once!
	- increase # of processes in memory
	- increase max allowable process size: may be larger than entire RAM

- options: keep one huge page table for whole system, or
	   keep a separate page table for each process: 
		page #, frame #, mapped (page present/empty)

								
             |--------------------------------------------    
 	     |				                 |
  PAGE # | OFFSET	|	|		  	 v
   |			|-------|   Frame i	   
   |------------------->| base 3| ---------->  FRAME # | OFFSET
			|-------|
			| base 4|
			|-------|
			|	|
			base registers

Option A)  Maintain array of high speed registers, one for each virtual page. 
- motivation: speed
- page mapping must be extremely fast, since it is done on every reference
- OS loads registers with process page table on each context switch
- no extra memory references, but context switches are expensive
- e.g. 16 bit address space with 4K page size -> need 12 bits for offset
   leaves 4 bits for page # -> 2^4 = 16 pages -> 16 page base registers
- each context switch requires replacing those registers; not a big deal
- but what happens with increasing VM size?  high cost of store/load
- Page table may be extremely large 
  (e.g. 32 bit address space with 4K page size -> need 12 bits for offset
   leaves 20 bits for page # -> 2^20 = 220 or about 1 million pages)
- do we want to devote 1 million registers (* log_2 [# frames]) bits
  just to store this map?   what about the cost of process switch?

             |--------------------------------------------    
 	     |				                 |
  PAGE # | OFFSET	|	|		  	 v
   |			|-------|   Frame i	   
   v			| base 3| ---------->  FRAME # | OFFSET
   + ------------------>|-------|
   ^			| base 4|
   |			|-------|
 Base register		|	|
			page table

Option B) Entire page table stored in main memory.
- single register holds starting address of page table in physical memory
- If page table will not fit in a single page, use multi-level table
- context switches are cheap (only one register update), but at least 
  one extra memory reference is required for every memory translation
- i.e. we have doubled our memory access requirements (and memory
  is usually the bottleneck in most programs!)

Multi-level page table:

- virtual address = 
		      |	  		 |	 	   |
		      v    		 v     		   v
		index into top-		index into	offset 
		level page table	2nd level 	into memory frame
		yielding frame #	page table
		of 2nd level page 	yielding frame
		table 			# of referenced
					page

PT1 (1K entries)		PT2 in frame 93 (1K entries)
----------------		-----------------------------		|	
 0	35			0	763				||
 1	52			1	353				||
 2	17		|---->	2	9933 ----------------------> 	||
 3	93 --------------	3	8763				||
				4	3211

virtual address <3,2,ff31>

Option C) Page Table Cache: Associate Memory
- store page base addresses of last few lookups in processor registers
  (a register 'cache')
- exploits locality: most of time, we will find the address we need
- this cache is called the Translation Lookaside Buffer (TLB) 
- special hardware can search the TLB entries in parallel
- if TLB hit, go directly to page
- otherwise, TLB miss, lookup entry in page table and update TLB entry
- assume hit rate is 90% and TLB lookup time = 0.1 page table lookup
- then TLB-assisted memory access, requires, on average 
	.9 * .1 + .1 * (.1 + 1) = .09 + .11 = .20 (non-TLB-assisted)

TLB:
----
virtual		valid	modified	protection	frame 
page #     	bit    	bit		(access bits)	page #


- memory manager checks TLB - if hit, go straight to indicated frame
- if miss, copy 'modified' bit of TLB entry to page table and update
- what to do with multiprocessing?
	a) flush TLB on context switch
	b) add process ID field to TLB


Paging policies:
- fetch policy: when should a page be loaded into primary memory
	- DEMAND PAGING: system loads a page into main memory only 
	  when a reference is made to a non-resident page (page fault)
	- process is blocked, OS retrieves page from disk, swapping a page
  	  from other process to disk if space needed
	- won't this be very slow? (think principle of locality)
	- when OS decides to remove a page from memory, does it need to 
	  write it back to disk?  (no, can discard if it hasn't been written)

	- PREFETCH: system anticipates when a page will be needed and
	  loads it in advance; expensive; must have detailed knowledge
	  of program behaviour

- replacement policy: if primary memory is full, which page should be removed
	- RANDOM REPLACEMENT: yikes, poor performance
	- OPTIMAL REPLACEMENT: choose page that will not be needed for
	  longest time (how do we know this?  in general, we don't)
	- LFU: keep track of number of times page was used recently
	- FIFO: replace page that has been in memory the longest
		- simple, but not well suited to program behaviour
	- LRU: take advantage of principle of locality
		- "past is best predictor of the future"
		- in fact, it's excellent: within 30-40% of optimal
		- problem: expensive to record time of every page access
		- translation lookaside buffer
		- how to search for LRU?  also expensive!
		- simple implementation: approximation of LRU 
			- use single 'referenced' bit to indicate access 
			- set by hardware on any page access
			- choose any page with 0 referenced bit and reset all
		 	- not very accurate: can extend to multiple bits...
			- shift right one bit every n clock cycles

	- Clock Algorithms:
		- use circular linked list of page frames
		- keep pointer to last page replaced
		- every page access sets "used" bit to 1
		- on page fault, advance pointer until "used = 0" found
		- replace that page; reset all bits to 0
		- very similar to 1-bit approx. of LRU, but imposes total order

	Replacement issue:
		- number of pages allocated per process should be a 
		compromise between process locality (low page fault rate) 
		and available main memory (must share with other processes)
		- fixed allocation: fixed # pages/process
			- replace only within process
		- variable allocation: # pages/process grows based on need
			- how is 'need' measured? look at faulting behaviour
			- avoid thrashing
			- should pages be replaced locally or globally?
			- local may be optimal but more difficult/overhead

- placement policy: where should fetched page be loaded into primary memory
	- for most strategies, placement is simply into the replaced frame
	- only exception is when memory is not full (why bother?)

Question: On a page fault, is it wise to actually "replace" a page right away?
	- Recall: if page was not used, not write-to-disk required...
	- Also: disk buffering: more efficient to write several blocks at once
	- Therefore: maintain two lists: "clean" and "dirty" pages
	- choose to replace clean page over dirty page -> less work

Page Table
-----------		
page  present	used	dirty  frame #
--------------------------------------
0	1	 0	 1	 0	
1	1	 0	 0	 1
2	1	 1	 0	 3
3	0	 0	 0	 x
4	1	 1	 0	 2
5	0	 0	 0	 x

Modified Clock Replacement Algorithm:

- four cases: 1) u=d=0  2) u=0,d=1, 3) u=1,d=0  4) u=d=1
- pages with d=1 are more expensive to replace

	Scan 1: look for <u=0, d=0>; if found, replace
	Scan 2: look for <u=0, d=1> and reset u:1->0; if found, replace
	Scan 3: (if Scan 2 failed), back to beginning, but must be a u=0
		=> repeat from Scan 1

Cost of VM
----------
- assume avg. instruction requires 200 ns
- assume page fault replacement costs 20 ms (to read in 4K from disk)
- page fault rate = r = avg # of faults per instruction

	n instructions take: n*200 ns + nr*20 ms
	avg instruction takes: 200 ns + r * 20 ms

- with r = 10^-6, slowdown factor = 1.1
- with r = 10^-5, slowdown factor = 2
- with r = 10^-4, slowdown factor = 11
- with r = 10^-3, slowdown factor = 101