UVM Documentation
Initially I am going to use this page to post the various documentation scattered through the UVM source code such as structure descriptions and locking protocols. Later I hope to paint a clearer picture of how the UVM works once I get familiar with the code. Eventually it may become a valuable reference for anyone working on the UVM.
Other sources
- NetBSD's documentation page http://www.netbsd.org/Documentation/kernel/uvm.html
- Chuck Cranor's dissertation http://www.ccrc.wustl.edu/pub/chuck/psgz/diss.ps.gz
- Local mirror of Chuck's dissertation (converted to PDF) UVM_Dissertation.pdf
New bringup Code for i386 (arch/i386/init.c)
Some important defines for the VM
- In arch/i386/pmap.h
#define PDSLOT_KERN ( KERNBASE / NBPD )which is 0, this is the first slot in the Page Directory occupied by the kernel's virtual address space.#define PDSLOT_APTE ( PDSLOT_PTE - 1 )which is 510, this is an alternate recursive mapping the pmap layer uses to access Page Table pages in other processes Page Directories. This slot maps 0x7F800000 - 0x7FBFFFFF.#define PDSLOT_PTE ( PDSLOT_USER - 1 )which is 511, this is a recursive mapping of the Page Directory. This means every entry in the Page Directory that contains the address of a Page Table page, is now interpreted as a Page Table entry that maps the address of evey Page Table page into a linear 4MB space. The pmap layer uses this to access Page Table pages. This slot maps 0x7FC00000 - 0x7FFFFFFF.#define PDSLOT_USER ( 512 ), this is the first slot in the Page Directory occupied by user virtual address space.#define NKPTP ( 1 ), this is the number of Page Directory entries the kernel occupies on boot. This can be changed after boot with pmap_growkernel(). The size of the kernel and modules loaded by the bootloader must reside in physical memory from KERNBASE to KERNBASE + NKPTP * NBPD. The kernel is not likely to grow beyond 4MiB, but nothing is stopping the user from loading huge modules. We are also assuming that the bootloader is loading the modules close to the kernel, there is nothing in the multiboot spec that guarantees this.
- In arch/i386/vmparam.h
#define VM_MIN_KERNEL_ADDRESS ((vaddr_t)(PDSLOT_KERN << PDSHIFT))which is 0x00000000, this is the inclusive minimum page address in kernel virtual address space.#define VM_MAX_KERNEL_ADDRESS ((vaddr_t)(PDSLOT_APTE << PDSHIFT))which is 0x7F800000, this is the exclusive maximum page address in kernel virtual address space.#define VM_MIN_USER_ADDRESS ((vaddr_t)(PDSLOT_USER << PDSHIFT))which is 0x80000000, this is the inclusive minimum page address in user virtual address space.#define VM_MAX_USER_ADDRESS ((vaddr_t)(((NPTEPD - 1) << PDSHIFT) + ((NPTPG - 1) << PAGE_SHIFT)))which is 0xFFFFF000, this is the exclusive maximum page address in user virtual address space (actually inclusive but it's nice to match withVM_MAX_KERNEL_ADDRESS).
Initial Page Directory (cr3) layout
Page Directory:
| Page Directory Slot | Virtual Address Range | Description |
|---|---|---|
| 0 | 0x00000000 - 0x003FFFFF | Maps the first 4MB of virtual address space to the first 4MB of physical memory. This contains the kernel, boot modules, and the boot heap. If we need more space for this we can just increase NKPTP. |
| 0x00400000 - 0x7F7FFFFF | Filled in with valid page tables, but no valid addresses yet. pmap_bootstrap() carves a few page table entries out of this range for utility mappings. | |
| 510 | 0x7F800000 - 0x7FBFFFFF | Nothing here yet, is managed with 'pmapmapptes()' and 'pmapunmapptes()'. |
| 511 | 0x7FC00000 - 0x7FFFFFFF | Recursive map of page directory. |
| 0x80000000 - 0xFFFFFFFF | User space, nothing here yet. |
i386_init()
Initialization steps relevent to the VM:
Set the heap pointer to the end of the kernel image (earlier in the file):
char *heap = (void *)_end;Load memory segment and boot module information from the multiboot header and push the heap pointer past the last module:
/* After this call the heap is pushed past the last module */ multiboot_read_config(mbi); /* Apply sanity checks to values our boot loader provided. */ assert(size_base > MIN_MEM_BASE, "need 640K base mem"); assert(size_ext >= MIN_MEM_EXTENDED, "need 1M extended mem"); memsegsr0.m_base = 0; memsegsr0.m_len = size_base; memsegsr1.m_base = MIN_MEM_EXTENDED; // badly named memsegsr1.m_len = size_ext; /* * Multiboot will have deposited zero or more (likely more) modules. * We must manually construct processes for them so that they will be * scheduled when we start running processes. This technique is used * to avoid having to embed boot drivers/ filesystems/etc. into the * microkernel. We're not ready to do full task creation, but now is a * good time to tabulate them. */ if (nboot_task != 0) { struct multiboot_module *m;
}/* When we have the heap, get our boot_tasks table */ b = boot_tasks = (void *)heap; heap += sizeof(struct boot_task) * nboot_task; /* Walk the multiboot modules */ for (y = 0, m = mod_ptr; y < nboot_task; ++y, ++m, ++b) { /* Convert from a.out-ese into a more generic * representation. */ a = (struct aout *)m->mod_start; /* Record next entry */ b->b_pc = a->a_entry; b->b_textaddr = (char *)PAGE_SIZE; b->b_text = PFN_UP(a->a_text + sizeof(struct aout)); b->b_dataaddr = (void *)(PAGE_SIZE*1024); b->b_data = PFN_UP(a->a_data); b->b_bss = PFN_UP(a->a_bss); b->b_pfn = PFN_DOWN(a); /* Patch in the arguments. */ patch_args(a, (char *)m->string); }Create our page directory page:
heap = (char *)roundup(heap, PAGE_SIZE); cr3 = (pd_entry_t *)heap; heap += PAGE_SIZE; bzero(cr3, PAGE_SIZE);Map in the first NKPTP page table pages:
pt = (pt_entry_t *)heap; heap += PAGE_SIZE; y = PDSLOT_KERN; cr3[y++] = (paddr_t)pt | PT_V | PT_W; ptr0 = 0; for (x = 1; x < NPTPG * NKPTP; ++x) { if ((x % NPTPG) == 0) { cr3[y++] = (paddr_t)(&pt[x]) | PT_V | PT_W; heap += PAGE_SIZE; } pt[x] = (x << PT_PFNSHIFT) | PT_V | PT_W; }Setup PDSLOT_PTE:
cr3[PDSLOT_PTE] = (paddr_t)cr3 | PT_V | PT_W;Enable paging:
/* Switch to our own PTEs */ set_cr3((ulong)cr3); cr0 = get_cr0(); if ( !(cr0 & CR0_PE) ) { printk("Protected mode disabled?\n\r"); printk("CR0 = 0x%x\n\r", (unsigned)cr0); panic("System halted\n\r"); } /* Turn on paging mode */ if ( !(cr0 & CR0_PG) ) { cr0 |= CR0_PG; set_cr0(cr0); printk("Paging enabled: PBD @ 0x%x\n\r", (unsigned)cr3); }Bootstrap the pmap layer:
pmap_bootstrap((vaddr_t)heap);Add free physical memory to UVM page handling routines:
/* Base memory goes into the 16M freelist */ uvm_page_physload(atop(memsegsr0.m_base), atop(memsegsr0.m_base + memsegsr0.m_len), atop(memsegsr0.m_base), atop(memsegsr0.m_base + memsegsr0.m_len), VM_FREELIST_FIRST16); /* * Boot Modules get vm_page's but aren't marked as available. This * is so we can map out these pages to the actual boot servers * in bootproc(). */ uvm_page_physload(atop(start_boot_mods), atop(end_boot_mods), 0, 0, VM_FREELIST_FIRST16); /* Split the rest of memory at 16M */ seg_start = NKPTP * NBPD; seg_end = memsegsr1.m_base + memsegsr1.m_len; if (seg_start < 0x01000000 && seg_end > 0x01000000) { uvm_page_physload(atop(seg_start), atop(0x01000000), atop(seg_start), atop(0x01000000), VM_FREELIST_FIRST16); uvm_page_physload(atop(0x01000000), atop(seg_end), atop(0x01000000), atop(seg_end), VM_FREELIST_DEFAULT); } else { uvm_page_physload(atop(seg_start), atop(seg_end), atop(seg_start), atop(seg_end), VM_FREELIST_DEFAULT); }Finally initialize UVM:
uvm_init();
uvm_map.h
Maps are doubly-linked lists of map entries, kept sorted by address. A single hint is provided to start searches again from the last successful search, insertion, or removal.
Locking Protocol Notes
VM map locking is a little complicated. There are both shared and exclusive locks on maps. However, it is sometimes required to downgrade an exclusive lock to a shared lock, and upgrade to an exclusive lock again (to perform error recovery). However, another thread must not queue itself to receive an exclusive lock before we upgrade back to exclusive, otherwise the error recovery becomes extremely difficult, if not impossible.
In order to prevent this scenario, we introduce the notion of a busy map. A busy map is read-locked, but other threads attempting to write-lock wait for this flag to clear before entering the lock manager. A map may only be marked busy when the map is write-locked (and then the map must be downgraded to read-locked), and may only be marked unbusy by the thread which marked it busy (holding either a read-lock or a write-lock, the latter being gained by an upgrade).
Access to the map flags member is controlled by the flagslock_ simple lock. Note that some flags are static (set once at map creation time, and never changed), and thus require no locking to check those flags. All flags which are r/w must be set or cleared while the flagslock_ is asserted. Additional locking requirements are:
VM_MAP_PAGEABLE- r/o static flag; no locking requiredVM_MAP_INTRSAFE- r/o static flag; no locking requiredVM_MAP_WIREFUTURE- r/w; may only be set or cleared when map is write-locked. may be tested without asserting flagslock_.VM_MAP_BUSY- r/w; may only be set when map is write-locked, may only be cleared by thread which set it, map read-locked or write-locked. must be tested while flagslock_ is asserted.VM_MAP_WANTLOCK- r/w; may only be set when the map is busy, and thread is attempting to write-lock. must be tested while @flags_lock' is asserted.VM_MAP_DYING- r/o; set when a vmspace is being destroyed to indicate that updates to the pmap can be skipped.VM_MAP_TOPDOWN- r/o; set when the vmspace is created if the unspecified map allocations are to be arranged in a "top down" manner.
uvm_page.h - Resident memory system definitions.
Management of resident (logical) pages.
A small structure is kept for each resident
page, indexed by page number. Each structure is an element of several lists:
- A hash table bucket used to quickly perform object/offset lookups
- A list of all pages for a given object, so they can be quickly deactivated at time of deallocation.
- An ordered list of pages due for pageout.
In addition, the structure contains the object and offset to which this page belongs (for pageout), and sundry status bits. Fields in this structure are locked either by the lock on the object that the page belongs to (O) or by the lock on the page queues (P) or both:
struct vm_page {
TAILQ_ENTRY(vm_page) pageq; /* queue info for FIFO
* queue or free list (P) */
TAILQ_ENTRY(vm_page) hashq; /* hash table links (O)*/
TAILQ_ENTRY(vm_page) listq; /* pages in same object (O)*/
struct vm_anon *uanon; /* anon (O,P) */
struct uvm_object *uobject; /* object (O,P) */
voff_t offset; /* offset into object (O,P) */
uint16_t flags; /* object flags [O] */
uint16_t loan_count; /* number of active loans
* to read: [O or P]
* to modify: [O _and_ P] */
uint16_t wire_count; /* wired down map refs [P] */
uint16_t pqflags; /* page queue flags [P] */
paddr_t phys_addr; /* physical address of page */
#ifdef +HAVE_VM_PAGE_MD
struct vm_page_md mdpage; /* pmap-specific data */
#endif
#if defined(UVM_PAGE_TRKOWN)
/* debugging fields to track page ownership */
pid_t owner; /* proc that set PG_BUSY */
char *owner_tag; /* why it was set busy */
#endif
};
comments:
licensing --amatus, Sat, 30 Apr 2005 17:03:55 -0500 reply It occured to me even comments are under the source file's license, so I'm obligated to point out the reader should probably take a look at the aforementioned files to see the license.