diff options
Diffstat (limited to 'Documentation/filesystems/iomap/design.rst')
-rw-r--r-- | Documentation/filesystems/iomap/design.rst | 441 |
1 files changed, 441 insertions, 0 deletions
diff --git a/Documentation/filesystems/iomap/design.rst b/Documentation/filesystems/iomap/design.rst new file mode 100644 index 000000000000..f8ee3427bc1a --- /dev/null +++ b/Documentation/filesystems/iomap/design.rst @@ -0,0 +1,441 @@ +.. SPDX-License-Identifier: GPL-2.0 +.. _iomap_design: + +.. + Dumb style notes to maintain the author's sanity: + Please try to start sentences on separate lines so that + sentence changes don't bleed colors in diff. + Heading decorations are documented in sphinx.rst. + +============== +Library Design +============== + +.. contents:: Table of Contents + :local: + +Introduction +============ + +iomap is a filesystem library for handling common file operations. +The library has two layers: + + 1. A lower layer that provides an iterator over ranges of file offsets. + This layer tries to obtain mappings of each file ranges to storage + from the filesystem, but the storage information is not necessarily + required. + + 2. An upper layer that acts upon the space mappings provided by the + lower layer iterator. + +The iteration can involve mappings of file's logical offset ranges to +physical extents, but the storage layer information is not necessarily +required, e.g. for walking cached file information. +The library exports various APIs for implementing file operations such +as: + + * Pagecache reads and writes + * Folio write faults to the pagecache + * Writeback of dirty folios + * Direct I/O reads and writes + * fsdax I/O reads, writes, loads, and stores + * FIEMAP + * lseek ``SEEK_DATA`` and ``SEEK_HOLE`` + * swapfile activation + +This origins of this library is the file I/O path that XFS once used; it +has now been extended to cover several other operations. + +Who Should Read This? +===================== + +The target audience for this document are filesystem, storage, and +pagecache programmers and code reviewers. + +If you are working on PCI, machine architectures, or device drivers, you +are most likely in the wrong place. + +How Is This Better? +=================== + +Unlike the classic Linux I/O model which breaks file I/O into small +units (generally memory pages or blocks) and looks up space mappings on +the basis of that unit, the iomap model asks the filesystem for the +largest space mappings that it can create for a given file operation and +initiates operations on that basis. +This strategy improves the filesystem's visibility into the size of the +operation being performed, which enables it to combat fragmentation with +larger space allocations when possible. +Larger space mappings improve runtime performance by amortizing the cost +of mapping function calls into the filesystem across a larger amount of +data. + +At a high level, an iomap operation `looks like this +<https://lore.kernel.org/all/ZGbVaewzcCysclPt@dread.disaster.area/>`_: + +1. For each byte in the operation range... + + 1. Obtain a space mapping via ``->iomap_begin`` + + 2. For each sub-unit of work... + + 1. Revalidate the mapping and go back to (1) above, if necessary. + So far only the pagecache operations need to do this. + + 2. Do the work + + 3. Increment operation cursor + + 4. Release the mapping via ``->iomap_end``, if necessary + +Each iomap operation will be covered in more detail below. +This library was covered previously by an `LWN article +<https://lwn.net/Articles/935934/>`_ and a `KernelNewbies page +<https://kernelnewbies.org/KernelProjects/iomap>`_. + +The goal of this document is to provide a brief discussion of the +design and capabilities of iomap, followed by a more detailed catalog +of the interfaces presented by iomap. +If you change iomap, please update this design document. + +File Range Iterator +=================== + +Definitions +----------- + + * **buffer head**: Shattered remnants of the old buffer cache. + + * ``fsblock``: The block size of a file, also known as ``i_blocksize``. + + * ``i_rwsem``: The VFS ``struct inode`` rwsemaphore. + Processes hold this in shared mode to read file state and contents. + Some filesystems may allow shared mode for writes. + Processes often hold this in exclusive mode to change file state and + contents. + + * ``invalidate_lock``: The pagecache ``struct address_space`` + rwsemaphore that protects against folio insertion and removal for + filesystems that support punching out folios below EOF. + Processes wishing to insert folios must hold this lock in shared + mode to prevent removal, though concurrent insertion is allowed. + Processes wishing to remove folios must hold this lock in exclusive + mode to prevent insertions. + Concurrent removals are not allowed. + + * ``dax_read_lock``: The RCU read lock that dax takes to prevent a + device pre-shutdown hook from returning before other threads have + released resources. + + * **filesystem mapping lock**: This synchronization primitive is + internal to the filesystem and must protect the file mapping data + from updates while a mapping is being sampled. + The filesystem author must determine how this coordination should + happen; it does not need to be an actual lock. + + * **iomap internal operation lock**: This is a general term for + synchronization primitives that iomap functions take while holding a + mapping. + A specific example would be taking the folio lock while reading or + writing the pagecache. + + * **pure overwrite**: A write operation that does not require any + metadata or zeroing operations to perform during either submission + or completion. + This implies that the fileystem must have already allocated space + on disk as ``IOMAP_MAPPED`` and the filesystem must not place any + constaints on IO alignment or size. + The only constraints on I/O alignment are device level (minimum I/O + size and alignment, typically sector size). + +``struct iomap`` +---------------- + +The filesystem communicates to the iomap iterator the mapping of +byte ranges of a file to byte ranges of a storage device with the +structure below: + +.. code-block:: c + + struct iomap { + u64 addr; + loff_t offset; + u64 length; + u16 type; + u16 flags; + struct block_device *bdev; + struct dax_device *dax_dev; + voidw *inline_data; + void *private; + const struct iomap_folio_ops *folio_ops; + u64 validity_cookie; + }; + +The fields are as follows: + + * ``offset`` and ``length`` describe the range of file offsets, in + bytes, covered by this mapping. + These fields must always be set by the filesystem. + + * ``type`` describes the type of the space mapping: + + * **IOMAP_HOLE**: No storage has been allocated. + This type must never be returned in response to an ``IOMAP_WRITE`` + operation because writes must allocate and map space, and return + the mapping. + The ``addr`` field must be set to ``IOMAP_NULL_ADDR``. + iomap does not support writing (whether via pagecache or direct + I/O) to a hole. + + * **IOMAP_DELALLOC**: A promise to allocate space at a later time + ("delayed allocation"). + If the filesystem returns IOMAP_F_NEW here and the write fails, the + ``->iomap_end`` function must delete the reservation. + The ``addr`` field must be set to ``IOMAP_NULL_ADDR``. + + * **IOMAP_MAPPED**: The file range maps to specific space on the + storage device. + The device is returned in ``bdev`` or ``dax_dev``. + The device address, in bytes, is returned via ``addr``. + + * **IOMAP_UNWRITTEN**: The file range maps to specific space on the + storage device, but the space has not yet been initialized. + The device is returned in ``bdev`` or ``dax_dev``. + The device address, in bytes, is returned via ``addr``. + Reads from this type of mapping will return zeroes to the caller. + For a write or writeback operation, the ioend should update the + mapping to MAPPED. + Refer to the sections about ioends for more details. + + * **IOMAP_INLINE**: The file range maps to the memory buffer + specified by ``inline_data``. + For write operation, the ``->iomap_end`` function presumably + handles persisting the data. + The ``addr`` field must be set to ``IOMAP_NULL_ADDR``. + + * ``flags`` describe the status of the space mapping. + These flags should be set by the filesystem in ``->iomap_begin``: + + * **IOMAP_F_NEW**: The space under the mapping is newly allocated. + Areas that will not be written to must be zeroed. + If a write fails and the mapping is a space reservation, the + reservation must be deleted. + + * **IOMAP_F_DIRTY**: The inode will have uncommitted metadata needed + to access any data written. + fdatasync is required to commit these changes to persistent + storage. + This needs to take into account metadata changes that *may* be made + at I/O completion, such as file size updates from direct I/O. + + * **IOMAP_F_SHARED**: The space under the mapping is shared. + Copy on write is necessary to avoid corrupting other file data. + + * **IOMAP_F_BUFFER_HEAD**: This mapping requires the use of buffer + heads for pagecache operations. + Do not add more uses of this. + + * **IOMAP_F_MERGED**: Multiple contiguous block mappings were + coalesced into this single mapping. + This is only useful for FIEMAP. + + * **IOMAP_F_XATTR**: The mapping is for extended attribute data, not + regular file data. + This is only useful for FIEMAP. + + * **IOMAP_F_PRIVATE**: Starting with this value, the upper bits can + be set by the filesystem for its own purposes. + + These flags can be set by iomap itself during file operations. + The filesystem should supply an ``->iomap_end`` function if it needs + to observe these flags: + + * **IOMAP_F_SIZE_CHANGED**: The file size has changed as a result of + using this mapping. + + * **IOMAP_F_STALE**: The mapping was found to be stale. + iomap will call ``->iomap_end`` on this mapping and then + ``->iomap_begin`` to obtain a new mapping. + + Currently, these flags are only set by pagecache operations. + + * ``addr`` describes the device address, in bytes. + + * ``bdev`` describes the block device for this mapping. + This only needs to be set for mapped or unwritten operations. + + * ``dax_dev`` describes the DAX device for this mapping. + This only needs to be set for mapped or unwritten operations, and + only for a fsdax operation. + + * ``inline_data`` points to a memory buffer for I/O involving + ``IOMAP_INLINE`` mappings. + This value is ignored for all other mapping types. + + * ``private`` is a pointer to `filesystem-private information + <https://lore.kernel.org/all/20180619164137.13720-7-hch@lst.de/>`_. + This value will be passed unchanged to ``->iomap_end``. + + * ``folio_ops`` will be covered in the section on pagecache operations. + + * ``validity_cookie`` is a magic freshness value set by the filesystem + that should be used to detect stale mappings. + For pagecache operations this is critical for correct operation + because page faults can occur, which implies that filesystem locks + should not be held between ``->iomap_begin`` and ``->iomap_end``. + Filesystems with completely static mappings need not set this value. + Only pagecache operations revalidate mappings; see the section about + ``iomap_valid`` for details. + +``struct iomap_ops`` +-------------------- + +Every iomap function requires the filesystem to pass an operations +structure to obtain a mapping and (optionally) to release the mapping: + +.. code-block:: c + + struct iomap_ops { + int (*iomap_begin)(struct inode *inode, loff_t pos, loff_t length, + unsigned flags, struct iomap *iomap, + struct iomap *srcmap); + + int (*iomap_end)(struct inode *inode, loff_t pos, loff_t length, + ssize_t written, unsigned flags, + struct iomap *iomap); + }; + +``->iomap_begin`` +~~~~~~~~~~~~~~~~~ + +iomap operations call ``->iomap_begin`` to obtain one file mapping for +the range of bytes specified by ``pos`` and ``length`` for the file +``inode``. +This mapping should be returned through the ``iomap`` pointer. +The mapping must cover at least the first byte of the supplied file +range, but it does not need to cover the entire requested range. + +Each iomap operation describes the requested operation through the +``flags`` argument. +The exact value of ``flags`` will be documented in the +operation-specific sections below. +These flags can, at least in principle, apply generally to iomap +operations: + + * ``IOMAP_DIRECT`` is set when the caller wishes to issue file I/O to + block storage. + + * ``IOMAP_DAX`` is set when the caller wishes to issue file I/O to + memory-like storage. + + * ``IOMAP_NOWAIT`` is set when the caller wishes to perform a best + effort attempt to avoid any operation that would result in blocking + the submitting task. + This is similar in intent to ``O_NONBLOCK`` for network APIs - it is + intended for asynchronous applications to keep doing other work + instead of waiting for the specific unavailable filesystem resource + to become available. + Filesystems implementing ``IOMAP_NOWAIT`` semantics need to use + trylock algorithms. + They need to be able to satisfy the entire I/O request range with a + single iomap mapping. + They need to avoid reading or writing metadata synchronously. + They need to avoid blocking memory allocations. + They need to avoid waiting on transaction reservations to allow + modifications to take place. + They probably should not be allocating new space. + And so on. + If there is any doubt in the filesystem developer's mind as to + whether any specific ``IOMAP_NOWAIT`` operation may end up blocking, + then they should return ``-EAGAIN`` as early as possible rather than + start the operation and force the submitting task to block. + ``IOMAP_NOWAIT`` is often set on behalf of ``IOCB_NOWAIT`` or + ``RWF_NOWAIT``. + +If it is necessary to read existing file contents from a `different +<https://lore.kernel.org/all/20191008071527.29304-9-hch@lst.de/>`_ +device or address range on a device, the filesystem should return that +information via ``srcmap``. +Only pagecache and fsdax operations support reading from one mapping and +writing to another. + +``->iomap_end`` +~~~~~~~~~~~~~~~ + +After the operation completes, the ``->iomap_end`` function, if present, +is called to signal that iomap is finished with a mapping. +Typically, implementations will use this function to tear down any +context that were set up in ``->iomap_begin``. +For example, a write might wish to commit the reservations for the bytes +that were operated upon and unreserve any space that was not operated +upon. +``written`` might be zero if no bytes were touched. +``flags`` will contain the same value passed to ``->iomap_begin``. +iomap ops for reads are not likely to need to supply this function. + +Both functions should return a negative errno code on error, or zero on +success. + +Preparing for File Operations +============================= + +iomap only handles mapping and I/O. +Filesystems must still call out to the VFS to check input parameters +and file state before initiating an I/O operation. +It does not handle obtaining filesystem freeze protection, updating of +timestamps, stripping privileges, or access control. + +Locking Hierarchy +================= + +iomap requires that filesystems supply their own locking model. +There are three categories of synchronization primitives, as far as +iomap is concerned: + + * The **upper** level primitive is provided by the filesystem to + coordinate access to different iomap operations. + The exact primitive is specifc to the filesystem and operation, + but is often a VFS inode, pagecache invalidation, or folio lock. + For example, a filesystem might take ``i_rwsem`` before calling + ``iomap_file_buffered_write`` and ``iomap_file_unshare`` to prevent + these two file operations from clobbering each other. + Pagecache writeback may lock a folio to prevent other threads from + accessing the folio until writeback is underway. + + * The **lower** level primitive is taken by the filesystem in the + ``->iomap_begin`` and ``->iomap_end`` functions to coordinate + access to the file space mapping information. + The fields of the iomap object should be filled out while holding + this primitive. + The upper level synchronization primitive, if any, remains held + while acquiring the lower level synchronization primitive. + For example, XFS takes ``ILOCK_EXCL`` and ext4 takes ``i_data_sem`` + while sampling mappings. + Filesystems with immutable mapping information may not require + synchronization here. + + * The **operation** primitive is taken by an iomap operation to + coordinate access to its own internal data structures. + The upper level synchronization primitive, if any, remains held + while acquiring this primitive. + The lower level primitive is not held while acquiring this + primitive. + For example, pagecache write operations will obtain a file mapping, + then grab and lock a folio to copy new contents. + It may also lock an internal folio state object to update metadata. + +The exact locking requirements are specific to the filesystem; for +certain operations, some of these locks can be elided. +All further mention of locking are *recommendations*, not mandates. +Each filesystem author must figure out the locking for themself. + +Bugs and Limitations +==================== + + * No support for fscrypt. + * No support for compression. + * No support for fsverity yet. + * Strong assumptions that IO should work the way it does on XFS. + * Does iomap *actually* work for non-regular file data? + +Patches welcome! |