14.7. File System I/OTwo distinct methods perform file system I/O:
Both methods are implemented in a similar way: Pages of a file are mapped into an address space, and then paged I/O is performed on the pages within the mapped address space. Although it may be obvious that memory mapping is performed when we memory-map a file into a process's address space, it is less obvious that the read() and write() system calls also map a file before reading or writing it. The major differences between these two methods lie in where the file is mapped and who does the mapping; a process calls mmap() to map the file into its address space for memory mapped I/O, and the kernel maps the file into the kernel's address space for read and write. The two methods are contrasted in Figure 14.9. Figure 14.9. The read()/write() vs. mmap() Methods for File I/O14.7.1. Memory Mapped I/OA request to memory-map a file into an address space is handled by the file system vnode method vop_map() and the seg_vn memory segment driver (see Section 14.7.4). A process requests that a file be mapped into its address space. Once the mapping is established, the address space represented by the file appears as regular memory and the file system can perform I/O by simply accessing that memory. Memory mapping of files hides the real work of reading and writing the file because the seg_vn memory segment driver quietly works with the file system to perform the I/Os without the need for process-initiated system calls. I/O is performed, in units of pages, upon reference to the pages mapped into the address space; reads are initiated by a memory access; writes are initiated as the VM system finds dirty pages in the mapped address space. The system call mmap() calls the file system for the requested file with the vnode's vop_map() method. In turn, the file system calls the address space map function for the current address space, and the mapping is created. The protection flags passed into the mmap() system call are reduced to the subset allowed by the file permissions. If mandatory locking is set for the file, then mmap() returns an error. Once the file mapping is created in the process's address space, file pages are read when a fault occurs in the address space. A fault occurs the first time a memory address within the mapped segment is accessed because at this point, no physical page of memory is at that location. The memory management unit causes a hardware trap for that memory segment; the memory segment calls its fault function to handle the I/O for that address. The segvn_fault() routine handles a fault for a file mapping in a process address space and then calls the file system to read in the page for the faulted address, as shown below. segvn_fault (hat, seg, addr, len, type, rw) { for ( page = all pages in region ) { advise = lookup_advise (page); /* Look up madvise settings for page */ if (advise == MADV_SEQUENTIAL) free_all_pages_up_to (page); /* Segvn will read at most 64k ahead */ if ( len > PVN_GETPAGE_SZ) len = PVN_GETPAGE_SZ; vp = segvp (seg); vpoff = segoff (seg); /* Read 64k at a time if the next page is not in memory, * else just a page */ if (hat_probe (addr+PAGESIZE)==TRUE) len=PAGESIZE; /* Ask the file system for the next 64k of pages if the next*/ VOP_GETPAGE(vp, vp_off, len, &vpprot, plp, plsz, seg, addr + (vp_off - off), arw, cred) } } See usr/src/uts/common/vm/seg_vn.c For each page fault, seg_vn reads in an 8-Kbyte page at the fault location. In addition, seg_vn initiates a read-ahead of the next eight pages at each 64-Kbyte boundary. Memory mapped read-ahead uses the file system cluster size (used by the read() and write() system calls) unless the segment is mapped MA_SHARED or memory advice MADV_RANDOM is set. Recall that you can provide paging advice to the pages within a memory mapped segment by using the madvise system call. The madvise system call and (as in the example) the advice information are used to decide when to free behind as the file is read. Modified pages remain unwritten to disk until the fsflush daemon passes over the page, at which point they will be written out to disk. You can also use the memcntl() system call to initiate a synchronous or asynchronous write of pages. 14.7.2. read() and write() System CallsThe vnode's vop_read() and vop_write() methods implement reading and writing with the read() and write() system calls. As shown in Figure 14.10, the seg_map segment driver directly accesses a page by means of the seg_kpm mapping of the system's physical pages within the kernel's address space during the read() and write() system calls. The read and write file system calls copy data to or from the process during a system call to a portion of the file that is mapped into the kernel's address space by seg_kpm. The seg_map driver maintains a cache of addresses between the vnode/offset and the virtual address where the page is mapped. Figure 14.10. File System Data Movement with seg_map/seg_kpm14.7.3. The seg_kpm DriverThe seg_kpm driver provides a fast mapping for physical pages within the kernel's address space. It is used by file systems to provide a virtual address when copying data to and from the user's address space for file system I/O. The use of this seg_kpm mapping facility is new for Solaris 10. Since the available virtual address range in a 64-bit kernel is always larger than physical memory size, the entire physical memory can be mapped into the kernel. This eliminates the need to map/unmap pages every time they are accessed through segmap, significantly reducing code path and the need for TLB shoot-downs. In addition, seg_kpm can use large TLB mappings to minimize TLB miss overhead. 14.7.4. The seg_map DriverThe seg_map driver maintains the relationship between pieces of files into the kernel address space and is used only by the file systems. Every time a read or write system call occurs, the seg_map segment driver locates the virtual address space where the page of the file can be mapped. The system call can then copy the data to or from the user address space. The seg_map segment provides a full set of segment driver interfaces (see Section 9.5); however, the file system directly uses a small subset of these interfaces without going through the generic segment interface. The subset handles the bulk of the work that is done by the seg_map segment for file read and write operations. The functions used by the file systems are shown on page 714. The seg_map segment driver divides the segment into block-sized slots that represent blocks in the files it maps. The seg_map block size for the Solaris kernel is 8,192 bytes. A 128-Mbyte segkmap segment would, for example, be divided into 128-MB/8-KB slots, or 16,384 slots. The seg_map segment driver maintains a hash list of its page mappings so that it can easily locate existing blocks. The list is based on file and offsets. One list entry exists for each slot in the segkmap segment. The structure for each slot in a seg_map segment is defined in the <vm/segmap.h> header file, shown below. /* * Machine independent per instance kpm mapping structure */ struct kpme { struct kpme *kpe_next; struct kpme *kpe_prev; struct page *kpe_page; /* back pointer to (start) page */ }; See usr/src/uts/common/vm/kpm.h /* * Each smap struct represents a MAXBSIZE sized mapping to the * <sm_vp, sm_off> given in the structure. The location of the * the structure in the array gives the virtual address of the * mapping. Structure rearranged for 64bit sm_off. */ struct smap { kmutex_t sm_mtx; /* protect non-list fields */ struct vnode *sm_vp; /* vnode pointer (if mapped) */ struct smap *sm_hash; /* hash pointer */ struct smap *sm_next; /* next pointer */ struct smap *sm_prev; /* previous pointer */ u_offset_t sm_off; /* file offset for mapping */ ushort_t sm_bitmap; /* bit map for locked translations */ ushort_t sm_refcnt; /* reference count for uses */ ushort_t sm_flags; /* smap flags */ ushort_t sm_free_ndx; /* freelist */ #ifdef SEGKPM_SUPPORT struct kpme sm_kpme; /* segkpm */ #endif }; See usr/src/uts/common/vm/segmap.h The key smap structures are
The important fields in the smap structure are the file and offset fields, sm_vp and sm_off. These fields identify which page of a file is represented by each slot in the segment. An example of the interaction between a file system read and segmap is shown in Figure 14.11. Figure 14.11. vop_read() segmap Interaction
A read system call invokes the file-system-dependent vop_read function. The vop_read method calls into the seg_map segment to locate a virtual address in the kernel address space via segkpm for the file and offset requested with the segmap_getmapflt() function. The seg_map driver determines whether it already has a slot for the page of the file at the given offset by looking into its hashed list of mapping slots. Once a slot is located or created, an address for the page is located, and segmap then calls back into the file system with vop_getpage() to soft-initiate a page fault to read in a page at the virtual address of the seg_map slot. While the segmap_getmapflt() routine is still running, the page fault is initiated by a call to segmap_fault(), which in turn calls back into the file system with vop_getpage(). The file system's vop_getpage() routine handles the task of bringing the requested range of the file (vnode, offset, and length) from disk into the virtual address and length passed into the vop_getpage() function. Once the page is read by the file system, the requested range is copied back to the user by the uio_move() function. Then, the file system releases the slot associated with that block of the file with the segmap_release() function. At this point, the slot is not removed from the segment because we may need the same file and offset later (effectively caching the virtual address location); instead, it is added onto a seg_map free list so it can be reclaimed or reused later. Writing is a similar process. Again, segmap_getmap() is called to retrieve or create a mapping for the file and offset, the I/O is done, and the segmap slot is released. An additional step is involved if the file is being extended or a new page is being created within a hole of a file. This additional step calls the segmap_pagecreate() function to create and lock the new pages, then calls segmap_pageunlock() to unlock the pages that were locked during the page_create(). The key segmap functions are shown below. caddr_t segmap_getmapflt(struct seg *seg, struct vnode *vp, u_offset_t off, size_t len, int forcefault, enum seg_rw rw); Retrieves an address in the kernel's address space for a range of the file at the given offset and length. segmap_getmap allocates a MAXBSIZE big slot to map the vnode vp in the range <off, off + len). off doesn't need to be MAXBSIZE aligned. The return address is always MAXBSIZE aligned. If forcefault is nonzero and the MMU translations haven't yet been created, segmap_getmap will call segmap_fault(..., F_INVAL, rw) to create them. int segmap_release(struct seg *seg, caddr_t addr, uint_t flags); Releases the mapping for a given file at a given address. int segmap_pagecreate(struct seg *seg, caddr_t addr, size_t len, int softlock); Creates new page(s) of memory and slots in the seg_map segment for a given file. Used for extending files or writing to holes during a write. This function creates pages (without using VOP_GETPAGE) and loads up translations to them. If softlock is TRUE, then set things up so that it looks like a call to segmap_fault with F_SOFTLOCK. Returns 1 if a page is created by calling page_create_va(), or 0 otherwise. All fields in the generic segment (struct seg) are considered to be read-only for "segmap" even though the kernel address space (kas) may not be locked; hence, no lock is needed to access them. void segmap_pageunlock(struct seg *seg, caddr_t addr, size_t len, enum seg_rw rw); Unlocks pages in the segment that was locked during segmap_pagecreate(). See usr/src/uts/common/vm/segmap.h We can observe the seg_map slot activity with the kstat statistics that are collected for the seg_map segment driver. These statistics are visible with the kstat command, as shown below. sol10$ kstat -n segmap module: unix instance: 0 name: segmap class: vm crtime 42.268896913 fault 352197 faulta 0 free 1123987 free_dirty 50836 free_notfree 2073 get_nofree 0 get_nomtx 0 get_reclaim 5644590 get_reuse 1356990 get_unused 0 get_use 386 getmap 7005644 pagecreate 1375991 rel_abort 0 rel_async 291640 rel_dontneed 291640 rel_free 7054 rel_write 304570 release 6694020 snaptime 1177936.33212098 stolen 0 Table 14.5 describes the segmap statistics.
14.7.5. Interaction between segmap and segkpmThe following three examples show the code flow through the file system into segmap for three important cases:
Hit in page cache and segmap: -> ufs_read read() Entry point into UFS -> segmap_getmapflt Locate the segmap slot for the vnode/off -> hat_kpm_page2va Identify the virtual address for the vnode/off <- hat_kpm_page2va <- segmap_getmapflt -> uiomove Copy the data from the segkpm address to userland <- uiomove -> segmap_release Release the segmap slot -> hat_kpm_vaddr2page Locate the page by looking up its address <- hat_kpm_vaddr2page -> segmap_smapadd Add the segmap slot to the reuse pool <- segmap_smapadd <- segmap_release <- ufs_read See examples/segkpm.d Hit in page cache, miss in segmap: -> ufs_read read() Entry point into UFS -> segmap_getmapflt Locate the segmap slot for the vnode/off -> get_free_smp Find a segmap slot that can be reused -> grab_smp Flush out the old segmap slot identity -> segmap_hashout <- segmap_hashout -> hat_kpm_page2va Identify the virtual address for the vnode/off <- hat_kpm_page2va <- grab_smp -> segmap_pagefree Put the page back on the cachelist <- segmap_pagefree <- get_free_smp -> segmap_hashin Set up the segmap slot for the new vnode/off <- segmap_hashin -> segkpm_create_va Create a virtual address for this vnode/off <- segkpm_create_va -> ufs_getpage Find the page already in the page-cache <- ufs_getpage -> hat_kpm_mapin Reuse a mapping for the page in segkpm <- hat_kpm_mapin <- segmap_getmapflt -> uiomove Copy the data from the segkpm address to userland <- uiomove -> segmap_release Add the segmap slot to the reuse pool -> hat_kpm_vaddr2page <- hat_kpm_vaddr2page -> segmap_smapadd <- segmap_smapadd <- segmap_release <- ufs_read See examples/segkpm.d Miss in page cache, miss in segmap: -> ufs_read read() Entry point into UFS -> segmap_getmapflt Locate the segmap slot for the vnode/off -> get_free_smp Find a segmap slot that can be reused -> grab_smp Flush out the old segmap slot identity -> segmap_hashout <- segmap_hashout -> hat_kpm_page2va Identify the virtual address for the vnode/off <- hat_kpm_page2va -> hat_kpm_mapout Unmap the old slot's page(s) <- hat_kpm_mapout <- grab_smp -> segmap_pagefree <- segmap_pagefree <- get_free_smp -> segmap_hashin Set up the segmap slot for the new vnode/off <- segmap_hashin -> segkpm_create_va Create a virtual address for this vnode/off <- segkpm_create_va -> ufs_getpage Call the file system getpage() to read in the page -> bdev_strategy Initiate the physical read <- bdev_strategy <- ufs_getpage -> hat_kpm_mapin Create a mapping for the page in segkpm -> sfmmu_kpm_mapin -> sfmmu_kpm_getvaddr <- sfmmu_kpm_getvaddr <- sfmmu_kpm_mapin -> sfmmu_kpme_lookup <- sfmmu_kpme_lookup -> sfmmu_kpme_add <- sfmmu_kpme_add <- hat_kpm_mapin <- segmap_getmapflt -> uiomove Copy the data from the segkpm address to userland <- uiomove -> segmap_release Add the segmap slot to the reuse pool -> get_smap_kpm -> hat_kpm_vaddr2page <- hat_kpm_vaddr2page <- get_smap_kpm -> segmap_smapadd <- segmap_smapadd <- segmap_release <- ufs_read See examples/segkpm.d |