Section 14.7. File System IO

14.7. File System I/O

Two distinct methods perform file system I/O:

read(), write(), and related system calls
Memory-mapping of a file into the process's address space

Both methods are implemented in a similar way: Pages of a file are mapped into an address space, and then paged I/O is performed on the pages within the mapped address space. Although it may be obvious that memory mapping is performed when we memory-map a file into a process's address space, it is less obvious that the read() and write() system calls also map a file before reading or writing it. The major differences between these two methods lie in where the file is mapped and who does the mapping; a process calls mmap() to map the file into its address space for memory mapped I/O, and the kernel maps the file into the kernel's address space for read and write. The two methods are contrasted in Figure 14.9.

Figure 14.9. The `read()/write()` vs. `mmap()` Methods for File I/O

14.7.1. Memory Mapped I/O

A request to memory-map a file into an address space is handled by the file system vnode method vop_map() and the seg_vn memory segment driver (see Section 14.7.4). A process requests that a file be mapped into its address space. Once the mapping is established, the address space represented by the file appears as regular memory and the file system can perform I/O by simply accessing that memory.

Memory mapping of files hides the real work of reading and writing the file because the seg_vn memory segment driver quietly works with the file system to perform the I/Os without the need for process-initiated system calls. I/O is performed, in units of pages, upon reference to the pages mapped into the address space; reads are initiated by a memory access; writes are initiated as the VM system finds dirty pages in the mapped address space.

The system call mmap() calls the file system for the requested file with the vnode's vop_map() method. In turn, the file system calls the address space map function for the current address space, and the mapping is created. The protection flags passed into the mmap() system call are reduced to the subset allowed by the file permissions. If mandatory locking is set for the file, then mmap() returns an error.

Once the file mapping is created in the process's address space, file pages are read when a fault occurs in the address space. A fault occurs the first time a memory address within the mapped segment is accessed because at this point, no physical page of memory is at that location. The memory management unit causes a hardware trap for that memory segment; the memory segment calls its fault function to handle the I/O for that address. The segvn_fault() routine handles a fault for a file mapping in a process address space and then calls the file system to read in the page for the faulted address, as shown below.

 segvn_fault (hat, seg, addr, len, type, rw) {         for ( page = all pages in region ) {                 advise = lookup_advise (page);  /* Look up madvise settings for page */                 if (advise == MADV_SEQUENTIAL)                         free_all_pages_up_to (page);                 /* Segvn will read at most 64k ahead */                 if ( len > PVN_GETPAGE_SZ)                         len = PVN_GETPAGE_SZ;                 vp = segvp (seg);                 vpoff = segoff (seg);                 /* Read 64k at a time if the next page is not in memory,                  * else just a page                  */                 if (hat_probe (addr+PAGESIZE)==TRUE)                         len=PAGESIZE;                 /* Ask the file system for the next 64k of pages if the next*/                 VOP_GETPAGE(vp, vp_off, len,                         &vpprot, plp, plsz, seg, addr + (vp_off - off), arw, cred)         } }                                                      See usr/src/uts/common/vm/seg_vn.c

For each page fault, seg_vn reads in an 8-Kbyte page at the fault location. In addition, seg_vn initiates a read-ahead of the next eight pages at each 64-Kbyte boundary. Memory mapped read-ahead uses the file system cluster size (used by the read() and write() system calls) unless the segment is mapped MA_SHARED or memory advice MADV_RANDOM is set.

Recall that you can provide paging advice to the pages within a memory mapped segment by using the madvise system call. The madvise system call and (as in the example) the advice information are used to decide when to free behind as the file is read.

Modified pages remain unwritten to disk until the fsflush daemon passes over the page, at which point they will be written out to disk. You can also use the memcntl() system call to initiate a synchronous or asynchronous write of pages.

14.7.2. `read()` and `write()` System Calls

The vnode's vop_read() and vop_write() methods implement reading and writing with the read() and write() system calls. As shown in Figure 14.10, the seg_map segment driver directly accesses a page by means of the seg_kpm mapping of the system's physical pages within the kernel's address space during the read() and write() system calls. The read and write file system calls copy data to or from the process during a system call to a portion of the file that is mapped into the kernel's address space by seg_kpm. The seg_map driver maintains a cache of addresses between the vnode/offset and the virtual address where the page is mapped.

Figure 14.10. File System Data Movement with `seg_map/seg_kpm`

14.7.3. The `seg_kpm` Driver

The seg_kpm driver provides a fast mapping for physical pages within the kernel's address space. It is used by file systems to provide a virtual address when copying data to and from the user's address space for file system I/O. The use of this seg_kpm mapping facility is new for Solaris 10.

Since the available virtual address range in a 64-bit kernel is always larger than physical memory size, the entire physical memory can be mapped into the kernel. This eliminates the need to map/unmap pages every time they are accessed through segmap, significantly reducing code path and the need for TLB shoot-downs. In addition, seg_kpm can use large TLB mappings to minimize TLB miss overhead.

14.7.4. The `seg_map` Driver

The seg_map driver maintains the relationship between pieces of files into the kernel address space and is used only by the file systems. Every time a read or write system call occurs, the seg_map segment driver locates the virtual address space where the page of the file can be mapped. The system call can then copy the data to or from the user address space.

The seg_map segment provides a full set of segment driver interfaces (see Section 9.5); however, the file system directly uses a small subset of these interfaces without going through the generic segment interface. The subset handles the bulk of the work that is done by the seg_map segment for file read and write operations. The functions used by the file systems are shown on page 714.

The seg_map segment driver divides the segment into block-sized slots that represent blocks in the files it maps. The seg_map block size for the Solaris kernel is 8,192 bytes. A 128-Mbyte segkmap segment would, for example, be divided into 128-MB/8-KB slots, or 16,384 slots. The seg_map segment driver maintains a hash list of its page mappings so that it can easily locate existing blocks. The list is based on file and offsets. One list entry exists for each slot in the segkmap segment. The structure for each slot in a seg_map segment is defined in the <vm/segmap.h> header file, shown below.

/*  * Machine independent per instance kpm mapping structure  */ struct kpme {         struct kpme     *kpe_next;         struct kpme     *kpe_prev;         struct page     *kpe_page;      /* back pointer to (start) page */ };                                                        See usr/src/uts/common/vm/kpm.h /*  * Each smap struct represents a MAXBSIZE sized mapping to the  * <sm_vp, sm_off> given in the structure.  The location of the  * the structure in the array gives the virtual address of the  * mapping. Structure rearranged for 64bit sm_off.  */ struct  smap {         kmutex_t        sm_mtx;         /* protect non-list fields */         struct  vnode   *sm_vp;         /* vnode pointer (if mapped) */         struct  smap    *sm_hash;       /* hash pointer */         struct  smap    *sm_next;       /* next pointer */         struct  smap    *sm_prev;       /* previous pointer */         u_offset_t      sm_off;         /* file offset for mapping */         ushort_t        sm_bitmap;      /* bit map for locked translations */         ushort_t        sm_refcnt;      /* reference count for uses */         ushort_t        sm_flags;       /* smap flags */         ushort_t        sm_free_ndx;    /* freelist */ #ifdef  SEGKPM_SUPPORT         struct kpme     sm_kpme;        /* segkpm */ #endif };                                                     See usr/src/uts/common/vm/segmap.h

The key smap structures are

sm_vp. The file (vnode) this slot represents (if slot not empty)
sm_hash, sm_next, sm_prev. Hash list reference pointers
sm_off. The file (vnode) offset for a block-sized chunk in this slot in the file
sm_bitmap Bitmap to maintain translation locking
sm_refcnt. The number of references to this mapping caused by concurrent reads

The important fields in the smap structure are the file and offset fields, sm_vp and sm_off. These fields identify which page of a file is represented by each slot in the segment.

An example of the interaction between a file system read and segmap is shown in Figure 14.11.

Figure 14.11. `vop_read() segmap` Interaction

A read system call invokes the file-system-dependent vop_read function. The vop_read method calls into the seg_map segment to locate a virtual address in the kernel address space via segkpm for the file and offset requested with the segmap_getmapflt() function. The seg_map driver determines whether it already has a slot for the page of the file at the given offset by looking into its hashed list of mapping slots. Once a slot is located or created, an address for the page is located, and segmap then calls back into the file system with vop_getpage() to soft-initiate a page fault to read in a page at the virtual address of the seg_map slot. While the segmap_getmapflt() routine is still running, the page fault is initiated by a call to segmap_fault(), which in turn calls back into the file system with vop_getpage().

The file system's vop_getpage() routine handles the task of bringing the requested range of the file (vnode, offset, and length) from disk into the virtual address and length passed into the vop_getpage() function.

Once the page is read by the file system, the requested range is copied back to the user by the uio_move() function. Then, the file system releases the slot associated with that block of the file with the segmap_release() function. At this point, the slot is not removed from the segment because we may need the same file and offset later (effectively caching the virtual address location); instead, it is added onto a seg_map free list so it can be reclaimed or reused later.

Writing is a similar process. Again, segmap_getmap() is called to retrieve or create a mapping for the file and offset, the I/O is done, and the segmap slot is released. An additional step is involved if the file is being extended or a new page is being created within a hole of a file. This additional step calls the segmap_pagecreate() function to create and lock the new pages, then calls segmap_pageunlock() to unlock the pages that were locked during the page_create().

The key segmap functions are shown below.

caddr_t segmap_getmapflt(struct seg *seg,                          struct vnode *vp,                          u_offset_t off,                          size_t len,                          int forcefault,                          enum seg_rw rw); Retrieves an address in the kernel's address space for a range of the file at the given offset and length. segmap_getmap allocates a MAXBSIZE big slot to map the vnode vp in the range <off, off + len). off doesn't need to be MAXBSIZE aligned. The return address is always MAXBSIZE aligned. If forcefault is nonzero and the MMU translations haven't yet been created, segmap_getmap will call segmap_fault(..., F_INVAL, rw) to create them. int segmap_release(struct seg *seg, caddr_t addr, uint_t flags); Releases the mapping for a given file at a given address. int segmap_pagecreate(struct seg *seg, caddr_t addr, size_t len, int softlock); Creates new page(s) of memory and slots in the seg_map segment for a given file. Used for extending files or writing to holes during a write. This function creates pages (without using VOP_GETPAGE) and loads up translations to them. If softlock is TRUE, then set things up so that it looks like a call to segmap_fault with F_SOFTLOCK.  Returns 1 if a page is created by calling page_create_va(), or 0 otherwise. All fields in the generic segment (struct seg) are considered to be read-only for "segmap" even though the kernel address space (kas) may not be locked; hence, no lock is needed to access them. void segmap_pageunlock(struct seg *seg, caddr_t addr, size_t len, enum seg_rw rw); Unlocks pages in the segment that was locked during segmap_pagecreate().                                                      See usr/src/uts/common/vm/segmap.h

We can observe the seg_map slot activity with the kstat statistics that are collected for the seg_map segment driver. These statistics are visible with the kstat command, as shown below.

sol10$ kstat -n segmap module: unix                            instance: 0 name:   segmap                          class:     vm         crtime                          42.268896913         fault                           352197         faulta                          0         free                            1123987         free_dirty                      50836         free_notfree                    2073         get_nofree                      0         get_nomtx                       0         get_reclaim                     5644590         get_reuse                       1356990         get_unused                      0         get_use                         386         getmap                          7005644         pagecreate                      1375991         rel_abort                       0         rel_async                       291640         rel_dontneed                    291640         rel_free                        7054         rel_write                       304570         release                         6694020         snaptime                        1177936.33212098         stolen                          0

Table 14.5 describes the segmap statistics.

Table 14.5. Statistics from the `seg_map` Segment Driver
Field Name	Description
`fault`	The number of times `segmap_fault` was called, usually as a result of a `read` or `write` system call.
`faulta`	The number of times the `segmap_faulta` function was called. It is called to initiate asynchronous paged I/O on a file.
`getmap`	The number of times the `segmap_getmap` function was called. It is called by the `read` and `write` system calls each time a `read` or `write` call is started. It sets up a slot in the `seg_map` segment for the requested range on the file.
`get_use`	The number of times a valid mapping was found in `seg_map`, which was also already referenced by another user.
`get_reclaim`	The number of times a valid mapping was found in `seg_map`, which was otherwise unused.
`get_reuse`	The number of times `getmap` deleted the mapping in a nonempty slot and created a new mapping for the file and offset requested.
`get_unused`	Not usedalways zero.
`get_nofree`	The number of times a request for a slot was made and none was available on the internal free list of slots. This number is usually zero because each slot is put on the free list when `release` is called at the end of each I/O. Hence, ample free slots are usually available.
`rel_async`	The slot was released with a delayed I/O on it.
`rel_write`	The slot was released as a result of a write system call.
`rel_free`	The slot was released, and the VM system was told that the page may be needed again but to free it and retain its file/offset information. These pages are placed on the cache list tail so that they are not the first to be reused.
`rel_abort`	The slot was released and asked to be removed from the `seg_map` segment as a result of a failed aborted write.
`rel_dontneed`	The slot was released, and the VM system was told to free the page because it won't be needed again. These pages are placed on the cache list head so they will be reused first.
`released`	The slot was released and the release was not affected by `rel_abort, rel_async, or rel_write`.
`pagecreate`	Pages were created in the `segmap_pagecreate` function.
`free_notfree`	An attempt was made to free a page which was still mapped
`free_dirty`	Pages that were dirty were freed from `segmap`.
`free`	Pages that were clean were freed from `segmap`.
`stolen`	A `smap` slot was taken during a `getmap`.
`get_nomtx`	This field is not used.

14.7.5. Interaction between segmap and segkpm

The following three examples show the code flow through the file system into segmap for three important cases:

The requested vnode/offset has a cached slot in seg_map, and the physical page is in the page cache.
The requested vnode/offset does not have a cached slot in seg_map, but the physical page is in the page cache.
The requested vnode/offset is not in either.

Hit in page cache and segmap: -> ufs_read                          read() Entry point into UFS   -> segmap_getmapflt                Locate the segmap slot for the vnode/off     -> hat_kpm_page2va               Identify the virtual address for the vnode/off     <- hat_kpm_page2va   <- segmap_getmapflt   -> uiomove                         Copy the data from the segkpm address to userland   <- uiomove   -> segmap_release                  Release the segmap slot     -> hat_kpm_vaddr2page            Locate the page by looking up its address     <- hat_kpm_vaddr2page     -> segmap_smapadd                Add the segmap slot to the reuse pool     <- segmap_smapadd   <- segmap_release <- ufs_read                                                                   See examples/segkpm.d

Hit in page cache, miss in segmap: -> ufs_read                          read() Entry point into UFS   -> segmap_getmapflt                Locate the segmap slot for the vnode/off     -> get_free_smp                  Find a segmap slot that can be reused       -> grab_smp                    Flush out the old segmap slot identity          -> segmap_hashout          <- segmap_hashout         -> hat_kpm_page2va           Identify the virtual address for the vnode/off         <- hat_kpm_page2va       <- grab_smp       -> segmap_pagefree             Put the page back on the cachelist       <- segmap_pagefree     <- get_free_smp     -> segmap_hashin                 Set up the segmap slot for the new vnode/off      <- segmap_hashin     -> segkpm_create_va              Create a virtual address for this vnode/off     <- segkpm_create_va     -> ufs_getpage                   Find the page already in the page-cache     <- ufs_getpage     -> hat_kpm_mapin                 Reuse a mapping for the page in segkpm      <- hat_kpm_mapin   <- segmap_getmapflt   -> uiomove                         Copy the data from the segkpm address to userland   <- uiomove   -> segmap_release                  Add the segmap slot to the reuse pool      -> hat_kpm_vaddr2page      <- hat_kpm_vaddr2page     -> segmap_smapadd     <- segmap_smapadd   <- segmap_release <- ufs_read                                                                   See examples/segkpm.d

Miss in page cache, miss in segmap: -> ufs_read                          read() Entry point into UFS   -> segmap_getmapflt                Locate the segmap slot for the vnode/off     -> get_free_smp                  Find a segmap slot that can be reused       -> grab_smp                    Flush out the old segmap slot identity         -> segmap_hashout         <- segmap_hashout         -> hat_kpm_page2va           Identify the virtual address for the vnode/off          <- hat_kpm_page2va         -> hat_kpm_mapout            Unmap the old slot's page(s)         <- hat_kpm_mapout       <- grab_smp       -> segmap_pagefree       <- segmap_pagefree     <- get_free_smp     -> segmap_hashin                 Set up the segmap slot for the new vnode/off     <- segmap_hashin     -> segkpm_create_va              Create a virtual address for this vnode/off     <- segkpm_create_va     -> ufs_getpage                   Call the file system getpage() to read in the page       -> bdev_strategy               Initiate the physical read       <- bdev_strategy     <- ufs_getpage     -> hat_kpm_mapin                 Create a mapping for the page in segkpm        -> sfmmu_kpm_mapin          -> sfmmu_kpm_getvaddr          <- sfmmu_kpm_getvaddr        <- sfmmu_kpm_mapin        -> sfmmu_kpme_lookup        <- sfmmu_kpme_lookup        -> sfmmu_kpme_add        <- sfmmu_kpme_add      <- hat_kpm_mapin   <- segmap_getmapflt   -> uiomove                         Copy the data from the segkpm address to userland   <- uiomove   -> segmap_release                  Add the segmap slot to the reuse pool     -> get_smap_kpm        -> hat_kpm_vaddr2page        <- hat_kpm_vaddr2page     <- get_smap_kpm     -> segmap_smapadd     <- segmap_smapadd   <- segmap_release <- ufs_read                                                                   See examples/segkpm.d

14.7. File System I/O

Figure 14.9. The read()/write() vs. mmap() Methods for File I/O

14.7.1. Memory Mapped I/O

14.7.2. read() and write() System Calls

Figure 14.10. File System Data Movement with seg_map/seg_kpm

14.7.3. The seg_kpm Driver

14.7.4. The seg_map Driver

Figure 14.11. vop_read() segmap Interaction

Table 14.5. Statistics from the seg_map Segment Driver