diff options
author | Linus Torvalds <torvalds@linux-foundation.org> | 2023-04-29 10:44:27 -0700 |
---|---|---|
committer | Linus Torvalds <torvalds@linux-foundation.org> | 2023-04-29 10:44:27 -0700 |
commit | 56c455b38dba47ae9cb48d71b2a106d769d1a694 (patch) | |
tree | a34645bb8a4067855a8affacc7a158fbcb742107 | |
parent | bedf1495271bc2ea57903762b722f339ea680d0d (diff) | |
parent | 9419092fb2630c30e4ffeb9ef61007ef0c61827a (diff) |
Merge tag 'xfs-6.4-merge-1' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux
Pull xfs updates from Dave Chinner:
"This consists mainly of online scrub functionality and the design
documentation for the upcoming online repair functionality built on
top of the scrub code:
- Added detailed design documentation for the upcoming online repair
feature
- major update to online scrub to complete the reverse mapping
cross-referencing infrastructure enabling us to fully validate
allocated metadata against owner records. This is the last piece of
scrub infrastructure needed before we can start merging online
repair functionality.
- Fixes for the ascii-ci hashing issues
- deprecation of the ascii-ci functionality
- on-disk format verification bug fixes
- various random bug fixes for syzbot and other bug reports"
* tag 'xfs-6.4-merge-1' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux: (107 commits)
xfs: fix livelock in delayed allocation at ENOSPC
xfs: Extend table marker on deprecated mount options table
xfs: fix duplicate includes
xfs: fix BUG_ON in xfs_getbmap()
xfs: verify buffer contents when we skip log replay
xfs: _{attr,data}_map_shared should take ILOCK_EXCL until iread_extents is completely done
xfs: remove WARN when dquot cache insertion fails
xfs: don't consider future format versions valid
xfs: deprecate the ascii-ci feature
xfs: test the ascii case-insensitive hash
xfs: stabilize the dirent name transformation function used for ascii-ci dir hash computation
xfs: cross-reference rmap records with refcount btrees
xfs: cross-reference rmap records with inode btrees
xfs: cross-reference rmap records with free space btrees
xfs: cross-reference rmap records with ag btrees
xfs: introduce bitmap type for AG blocks
xfs: convert xbitmap to interval tree
xfs: drop the _safe behavior from the xbitmap foreach macro
xfs: don't load local xattr values during scrub
xfs: remove the for_each_xbitmap_ helpers
...
85 files changed, 10520 insertions, 1890 deletions
diff --git a/Documentation/admin-guide/xfs.rst b/Documentation/admin-guide/xfs.rst index e2561416391c..3a9c041d7f6c 100644 --- a/Documentation/admin-guide/xfs.rst +++ b/Documentation/admin-guide/xfs.rst @@ -236,13 +236,14 @@ the dates listed above. Deprecated Mount Options ======================== -=========================== ================ +============================ ================ Name Removal Schedule -=========================== ================ +============================ ================ Mounting with V4 filesystem September 2030 +Mounting ascii-ci filesystem September 2030 ikeep/noikeep September 2025 attr2/noattr2 September 2025 -=========================== ================ +============================ ================ Removed Mount Options diff --git a/Documentation/filesystems/index.rst b/Documentation/filesystems/index.rst index bee63d42e5ec..fbb2b5ada95b 100644 --- a/Documentation/filesystems/index.rst +++ b/Documentation/filesystems/index.rst @@ -123,4 +123,5 @@ Documentation for filesystem implementations. vfat xfs-delayed-logging-design xfs-self-describing-metadata + xfs-online-fsck-design zonefs diff --git a/Documentation/filesystems/xfs-online-fsck-design.rst b/Documentation/filesystems/xfs-online-fsck-design.rst new file mode 100644 index 000000000000..791ab264b77e --- /dev/null +++ b/Documentation/filesystems/xfs-online-fsck-design.rst @@ -0,0 +1,5315 @@ +.. SPDX-License-Identifier: GPL-2.0 +.. _xfs_online_fsck_design: + +.. + Mapping of heading styles within this document: + Heading 1 uses "====" above and below + Heading 2 uses "====" + Heading 3 uses "----" + Heading 4 uses "````" + Heading 5 uses "^^^^" + Heading 6 uses "~~~~" + Heading 7 uses "...." + + Sections are manually numbered because apparently that's what everyone + does in the kernel. + +====================== +XFS Online Fsck Design +====================== + +This document captures the design of the online filesystem check feature for +XFS. +The purpose of this document is threefold: + +- To help kernel distributors understand exactly what the XFS online fsck + feature is, and issues about which they should be aware. + +- To help people reading the code to familiarize themselves with the relevant + concepts and design points before they start digging into the code. + +- To help developers maintaining the system by capturing the reasons + supporting higher level decision making. + +As the online fsck code is merged, the links in this document to topic branches +will be replaced with links to code. + +This document is licensed under the terms of the GNU Public License, v2. +The primary author is Darrick J. Wong. + +This design document is split into seven parts. +Part 1 defines what fsck tools are and the motivations for writing a new one. +Parts 2 and 3 present a high level overview of how online fsck process works +and how it is tested to ensure correct functionality. +Part 4 discusses the user interface and the intended usage modes of the new +program. +Parts 5 and 6 show off the high level components and how they fit together, and +then present case studies of how each repair function actually works. +Part 7 sums up what has been discussed so far and speculates about what else +might be built atop online fsck. + +.. contents:: Table of Contents + :local: + +1. What is a Filesystem Check? +============================== + +A Unix filesystem has four main responsibilities: + +- Provide a hierarchy of names through which application programs can associate + arbitrary blobs of data for any length of time, + +- Virtualize physical storage media across those names, and + +- Retrieve the named data blobs at any time. + +- Examine resource usage. + +Metadata directly supporting these functions (e.g. files, directories, space +mappings) are sometimes called primary metadata. +Secondary metadata (e.g. reverse mapping and directory parent pointers) support +operations internal to the filesystem, such as internal consistency checking +and reorganization. +Summary metadata, as the name implies, condense information contained in +primary metadata for performance reasons. + +The filesystem check (fsck) tool examines all the metadata in a filesystem +to look for errors. +In addition to looking for obvious metadata corruptions, fsck also +cross-references different types of metadata records with each other to look +for inconsistencies. +People do not like losing data, so most fsck tools also contains some ability +to correct any problems found. +As a word of caution -- the primary goal of most Linux fsck tools is to restore +the filesystem metadata to a consistent state, not to maximize the data +recovered. +That precedent will not be challenged here. + +Filesystems of the 20th century generally lacked any redundancy in the ondisk +format, which means that fsck can only respond to errors by erasing files until +errors are no longer detected. +More recent filesystem designs contain enough redundancy in their metadata that +it is now possible to regenerate data structures when non-catastrophic errors +occur; this capability aids both strategies. + ++--------------------------------------------------------------------------+ +| **Note**: | ++--------------------------------------------------------------------------+ +| System administrators avoid data loss by increasing the number of | +| separate storage systems through the creation of backups; and they avoid | +| downtime by increasing the redundancy of each storage system through the | +| creation of RAID arrays. | +| fsck tools address only the first problem. | ++--------------------------------------------------------------------------+ + +TLDR; Show Me the Code! +----------------------- + +Code is posted to the kernel.org git trees as follows: +`kernel changes <https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=repair-symlink>`_, +`userspace changes <https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfsprogs-dev.git/log/?h=scrub-media-scan-service>`_, and +`QA test changes <https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfstests-dev.git/log/?h=repair-dirs>`_. +Each kernel patchset adding an online repair function will use the same branch +name across the kernel, xfsprogs, and fstests git repos. + +Existing Tools +-------------- + +The online fsck tool described here will be the third tool in the history of +XFS (on Linux) to check and repair filesystems. +Two programs precede it: + +The first program, ``xfs_check``, was created as part of the XFS debugger +(``xfs_db``) and can only be used with unmounted filesystems. +It walks all metadata in the filesystem looking for inconsistencies in the +metadata, though it lacks any ability to repair what it finds. +Due to its high memory requirements and inability to repair things, this +program is now deprecated and will not be discussed further. + +The second program, ``xfs_repair``, was created to be faster and more robust +than the first program. +Like its predecessor, it can only be used with unmounted filesystems. +It uses extent-based in-memory data structures to reduce memory consumption, +and tries to schedule readahead IO appropriately to reduce I/O waiting time +while it scans the metadata of the entire filesystem. +The most important feature of this tool is its ability to respond to +inconsistencies in file metadata and directory tree by erasing things as needed +to eliminate problems. +Space usage metadata are rebuilt from the observed file metadata. + +Problem Statement +----------------- + +The current XFS tools leave several problems unsolved: + +1. **User programs** suddenly **lose access** to the filesystem when unexpected + shutdowns occur as a result of silent corruptions in the metadata. + These occur **unpredictably** and often without warning. + +2. **Users** experience a **total loss of service** during the recovery period + after an **unexpected shutdown** occurs. + +3. **Users** experience a **total loss of service** if the filesystem is taken + offline to **look for problems** proactively. + +4. **Data owners** cannot **check the integrity** of their stored data without + reading all of it. + This may expose them to substantial billing costs when a linear media scan + performed by the storage system administrator might suffice. + +5. **System administrators** cannot **schedule** a maintenance window to deal + with corruptions if they **lack the means** to assess filesystem health + while the filesystem is online. + +6. **Fleet monitoring tools** cannot **automate periodic checks** of filesystem + health when doing so requires **manual intervention** and downtime. + +7. **Users** can be tricked into **doing things they do not desire** when + malicious actors **exploit quirks of Unicode** to place misleading names + in directories. + +Given this definition of the problems to be solved and the actors who would +benefit, the proposed solution is a third fsck tool that acts on a running +filesystem. + +This new third program has three components: an in-kernel facility to check +metadata, an in-kernel facility to repair metadata, and a userspace driver +program to drive fsck activity on a live filesystem. +``xfs_scrub`` is the name of the driver program. +The rest of this document presents the goals and use cases of the new fsck +tool, describes its major design points in connection to those goals, and +discusses the similarities and differences with existing tools. + ++--------------------------------------------------------------------------+ +| **Note**: | ++--------------------------------------------------------------------------+ +| Throughout this document, the existing offline fsck tool can also be | +| referred to by its current name "``xfs_repair``". | +| The userspace driver program for the new online fsck tool can be | +| referred to as "``xfs_scrub``". | +| The kernel portion of online fsck that validates metadata is called | +| "online scrub", and portion of the kernel that fixes metadata is called | +| "online repair". | ++--------------------------------------------------------------------------+ + +The naming hierarchy is broken up into objects known as directories and files +and the physical space is split into pieces known as allocation groups. +Sharding enables better performance on highly parallel systems and helps to +contain the damage when corruptions occur. +The division of the filesystem into principal objects (allocation groups and +inodes) means that there are ample opportunities to perform targeted checks and +repairs on a subset of the filesystem. + +While this is going on, other parts continue processing IO requests. +Even if a piece of filesystem metadata can only be regenerated by scanning the +entire system, the scan can still be done in the background while other file +operations continue. + +In summary, online fsck takes advantage of resource sharding and redundant +metadata to enable targeted checking and repair operations while the system +is running. +This capability will be coupled to automatic system management so that +autonomous self-healing of XFS maximizes service availability. + +2. Theory of Operation +====================== + +Because it is necessary for online fsck to lock and scan live metadata objects, +online fsck consists of three separate code components. +The first is the userspace driver program ``xfs_scrub``, which is responsible +for identifying individual metadata items, scheduling work items for them, +reacting to the outcomes appropriately, and reporting results to the system +administrator. +The second and third are in the kernel, which implements functions to check +and repair each type of online fsck work item. + ++------------------------------------------------------------------+ +| **Note**: | ++------------------------------------------------------------------+ +| For brevity, this document shortens the phrase "online fsck work | +| item" to "scrub item". | ++------------------------------------------------------------------+ + +Scrub item types are delineated in a manner consistent with the Unix design +philosophy, which is to say that each item should handle one aspect of a +metadata structure, and handle it well. + +Scope +----- + +In principle, online fsck should be able to check and to repair everything that +the offline fsck program can handle. +However, online fsck cannot be running 100% of the time, which means that +latent errors may creep in after a scrub completes. +If these errors cause the next mount to fail, offline fsck is the only +solution. +This limitation means that maintenance of the offline fsck tool will continue. +A second limitation of online fsck is that it must follow the same resource +sharing and lock acquisition rules as the regular filesystem. +This means that scrub cannot take *any* shortcuts to save time, because doing +so could lead to concurrency problems. +In other words, online fsck is not a complete replacement for offline fsck, and +a complete run of online fsck may take longer than online fsck. +However, both of these limitations are acceptable tradeoffs to satisfy the +different motivations of online fsck, which are to **minimize system downtime** +and to **increase predictability of operation**. + +.. _scrubphases: + +Phases of Work +-------------- + +The userspace driver program ``xfs_scrub`` splits the work of checking and +repairing an entire filesystem into seven phases. +Each phase concentrates on checking specific types of scrub items and depends +on the success of all previous phases. +The seven phases are as follows: + +1. Collect geometry information about the mounted filesystem and computer, + discover the online fsck capabilities of the kernel, and open the + underlying storage devices. + +2. Check allocation group metadata, all realtime volume metadata, and all quota + files. + Each metadata structure is scheduled as a separate scrub item. + If corruption is found in the inode header or inode btree and ``xfs_scrub`` + is permitted to perform repairs, then those scrub items are repaired to + prepare for phase 3. + Repairs are implemented by using the information in the scrub item to + resubmit the kernel scrub call with the repair flag enabled; this is + discussed in the next section. + Optimizations and all other repairs are deferred to phase 4. + +3. Check all metadata of every file in the filesystem. + Each metadata structure is also scheduled as a separate scrub item. + If repairs are needed and ``xfs_scrub`` is permitted to perform repairs, + and there were no problems detected during phase 2, then those scrub items + are repaired immediately. + Optimizations, deferred repairs, and unsuccessful repairs are deferred to + phase 4. + +4. All remaining repairs and scheduled optimizations are performed during this + phase, if the caller permits them. + Before starting repairs, the summary counters are checked and any necessary + repairs are performed so that subsequent repairs will not fail the resource + reservation step due to wildly incorrect summary counters. + Unsuccesful repairs are requeued as long as forward progress on repairs is + made somewhere in the filesystem. + Free space in the filesystem is trimmed at the end of phase 4 if the + filesystem is clean. + +5. By the start of this phase, all primary and secondary filesystem metadata + must be correct. + Summary counters such as the free space counts and quota resource counts + are checked and corrected. + Directory entry names and extended attribute names are checked for + suspicious entries such as control characters or confusing Unicode sequences + appearing in names. + +6. If the caller asks for a media scan, read all allocated and written data + file extents in the filesystem. + The ability to use hardware-assisted data file integrity checking is new + to online fsck; neither of the previous tools have this capability. + If media errors occur, they will be mapped to the owning files and reported. + +7. Re-check the summary counters and presents the caller with a summary of + space usage and file counts. + +This allocation of responsibilities will be :ref:`revisited <scrubcheck>` +later in this document. + +Steps for Each Scrub Item +------------------------- + +The kernel scrub code uses a three-step strategy for checking and repairing +the one aspect of a metadata object represented by a scrub item: + +1. The scrub item of interest is checked for corruptions; opportunities for + optimization; and for values that are directly controlled by the system + administrator but look suspicious. + If the item is not corrupt or does not need optimization, resource are + released and the positive scan results are returned to userspace. + If the item is corrupt or could be optimized but the caller does not permit + this, resources are released and the negative scan results are returned to + userspace. + Otherwise, the kernel moves on to the second step. + +2. The repair function is called to rebuild the data structure. + Repair functions generally choose rebuild a structure from other metadata + rather than try to salvage the existing structure. + If the repair fails, the scan results from the first step are returned to + userspace. + Otherwise, the kernel moves on to the third step. + +3. In the third step, the kernel runs the same checks over the new metadata + item to assess the efficacy of the repairs. + The results of the reassessment are returned to userspace. + +Classification of Metadata +-------------------------- + +Each type of metadata object (and therefore each type of scrub item) is +classified as follows: + +Primary Metadata +```````````````` + +Metadata structures in this category should be most familiar to filesystem +users either because they are directly created by the user or they index +objects created by the user +Most filesystem objects fall into this class: + +- Free space and reference count information + +- Inode records and indexes + +- Storage mapping information for file data + +- Directories + +- Extended attributes + +- Symbolic links + +- Quota limits + +Scrub obeys the same rules as regular filesystem accesses for resource and lock +acquisition. + +Primary metadata objects are the simplest for scrub to process. +The principal filesystem object (either an allocation group or an inode) that +owns the item being scrubbed is locked to guard against concurrent updates. +The check function examines every record associated with the type for obvious +errors and cross-references healthy records against other metadata to look for +inconsistencies. +Repairs for this class of scrub item are simple, since the repair function +starts by holding all the resources acquired in the previous step. +The repair function scans available metadata as needed to record all the +observations needed to complete the structure. +Next, it stages the observations in a new ondisk structure and commits it +atomically to complete the repair. +Finally, the storage from the old data structure are carefully reaped. + +Because ``xfs_scrub`` locks a primary object for the duration of the repair, +this is effectively an offline repair operation performed on a subset of the +filesystem. +This minimizes the complexity of the repair code because it is not necessary to +handle concurrent updates from other threads, nor is it necessary to access +any other part of the filesystem. +As a result, indexed structures can be rebuilt very quickly, and programs +trying to access the damaged structure will be blocked until repairs complete. +The only infrastructure needed by the repair code are the staging area for +observations and a means to write new structures to disk. +Despite these limitations, the advantage that online repair holds is clear: +targeted work on individual shards of the filesystem avoids total loss of +service. + +This mechanism is described in section 2.1 ("Off-Line Algorithm") of +V. Srinivasan and M. J. Carey, `"Performance of On-Line Index Construction +Algorithms" <https://minds.wisconsin.edu/bitstream/handle/1793/59524/TR1047.pdf>`_, +*Extending Database Technology*, pp. 293-309, 1992. + +Most primary metadata repair functions stage their intermediate results in an +in-memory array prior to formatting the new ondisk structure, which is very +similar to the list-based algorithm discussed in section 2.3 ("List-Based +Algorithms") of Srinivasan. +However, any data structure builder that maintains a resource lock for the +duration of the repair is *always* an offline algorithm. + +.. _secondary_metadata: + +Secondary Metadata +`````````````````` + +Metadata structures in this category reflect records found in primary metadata, +but are only needed for online fsck or for reorganization of the filesystem. + +Secondary metadata include: + +- Reverse mapping information + +- Directory parent pointers + +This class of metadata is difficult for scrub to process because scrub attaches +to the secondary object but needs to check primary metadata, which runs counter +to the usual order of resource acquisition. +Frequently, this means that full filesystems scans are necessary to rebuild the +metadata. +Check functions can be limited in scope to reduce runtime. +Repairs, however, require a full scan of primary metadata, which can take a +long time to complete. +Under these conditions, ``xfs_scrub`` cannot lock resources for the entire +duration of the repair. + +Instead, repair functions set up an in-memory staging structure to store +observations. +Depending on the requirements of the specific repair function, the staging +index will either have the same format as the ondisk structure or a design +specific to that repair function. +The next step is to release all locks and start the filesystem scan. +When the repair scanner needs to record an observation, the staging data are +locked long enough to apply the update. +While the filesystem scan is in progress, the repair function hooks the +filesystem so that it can apply pending filesystem updates to the staging +information. +Once the scan is done, the owning object is re-locked, the live data is used to +write a new ondisk structure, and the repairs are committed atomically. +The hooks are disabled and the staging staging area is freed. +Finally, the storage from the old data structure are carefully reaped. + +Introducing concurrency helps online repair avoid various locking problems, but +comes at a high cost to code complexity. +Live filesystem code has to be hooked so that the repair function can observe +updates in progress. +The staging area has to become a fully functional parallel structure so that +updates can be merged from the hooks. +Finally, the hook, the filesystem scan, and the inode locking model must be +sufficiently well integrated that a hook event can decide if a given update +should be applied to the staging structure. + +In theory, the scrub implementation could apply these same techniques for +primary metadata, but doing so would make it massively more complex and less +performant. +Programs attempting to access the damaged structures are not blocked from +operation, which may cause application failure or an unplanned filesystem +shutdown. + +Inspiration for the secondary metadata repair strategy was drawn from section +2.4 of Srinivasan above, and sections 2 ("NSF: Inded Build Without Side-File") +and 3.1.1 ("Duplicate Key Insert Problem") in C. Mohan, `"Algorithms for +Creating Indexes for Very Large Tables Without Quiescing Updates" +<https://dl.acm.org/doi/10.1145/130283.130337>`_, 1992. + +The sidecar index mentioned above bears some resemblance to the side file +method mentioned in Srinivasan and Mohan. +Their method consists of an index builder that extracts relevant record data to +build the new structure as quickly as possible; and an auxiliary structure that +captures all updates that would be committed to the index by other threads were +the new index already online. +After the index building scan finishes, the updates recorded in the side file +are applied to the new index. +To avoid conflicts between the index builder and other writer threads, the +builder maintains a publicly visible cursor that tracks the progress of the +scan through the record space. +To avoid duplication of work between the side file and the index builder, side +file updates are elided when the record ID for the update is greater than the +cursor position within the record ID space. + +To minimize changes to the rest of the codebase, XFS online repair keeps the +replacement index hidden until it's completely ready to go. +In other words, there is no attempt to expose the keyspace of the new index +while repair is running. +The complexity of such an approach would be very high and perhaps more +appropriate to building *new* indices. + +**Future Work Question**: Can the full scan and live update code used to +facilitate a repair also be used to implement a comprehensive check? + +*Answer*: In theory, yes. Check would be much stronger if each scrub function +employed these live scans to build a shadow copy of the metadata and then +compared the shadow records to the ondisk records. +However, doing that is a fair amount more work than what the checking functions +do now. +The live scans and hooks were developed much later. +That in turn increases the runtime of those scrub functions. + +Summary Information +``````````````````` + +Metadata structures in this last category summarize the contents of primary +metadata records. +These are often used to speed up resource usage queries, and are many times +smaller than the primary metadata which they represent. + +Examples of summary information include: + +- Summary counts of free space and inodes + +- File link counts from directories + +- Quota resource usage counts + +Check and repair require full filesystem scans, but resource and lock +acquisition follow the same paths as regular filesystem accesses. + +The superblock summary counters have special requirements due to the underlying +implementation of the incore counters, and will be treated separately. +Check and repair of the other types of summary counters (quota resource counts +and file link counts) employ the same filesystem scanning and hooking +techniques as outlined above, but because the underlying data are sets of +integer counters, the staging data need not be a fully functional mirror of the +ondisk structure. + +Inspiration for quota and file link count repair strategies were drawn from +sections 2.12 ("Online Index Operations") through 2.14 ("Incremental View +Maintenace") of G. Graefe, `"Concurrent Queries and Updates in Summary Views +and Their Indexes" +<http://www.odbms.org/wp-content/uploads/2014/06/Increment-locks.pdf>`_, 2011. + +Since quotas are non-negative integer counts of resource usage, online +quotacheck can use the incremental view deltas described in section 2.14 to +track pending changes to the block and inode usage counts in each transaction, +and commit those changes to a dquot side file when the transaction commits. +Delta tracking is necessary for dquots because the index builder scans inodes, +whereas the data structure being rebuilt is an index of dquots. +Link count checking combines the view deltas and commit step into one because +it sets attributes of the objects being scanned instead of writing them to a +separate data structure. +Each online fsck function will be discussed as case studies later in this +document. + +Risk Management +--------------- + +During the development of online fsck, several risk factors were identified +that may make the feature unsuitable for certain distributors and users. +Steps can be taken to mitigate or eliminate those risks, though at a cost to +functionality. + +- **Decreased performance**: Adding metadata indices to the filesystem + increases the time cost of persisting changes to disk, and the reverse space + mapping and directory parent pointers are no exception. + System administrators who require the maximum performance can disable the + reverse mapping features at format time, though this choice dramatically + reduces the ability of online fsck to find inconsistencies and repair them. + +- **Incorrect repairs**: As with all software, there might be defects in the + software that result in incorrect repairs being written to the filesystem. + Systematic fuzz testing (detailed in the next section) is employed by the + authors to find bugs early, but it might not catch everything. + The kernel build system provides Kconfig options (``CONFIG_XFS_ONLINE_SCRUB`` + and ``CONFIG_XFS_ONLINE_REPAIR``) to enable distributors to choose not to + accept this risk. + The xfsprogs build system has a configure option (``--enable-scrub=no``) that + disables building of the ``xfs_scrub`` binary, though this is not a risk + mitigation if the kernel functionality remains enabled. + +- **Inability to repair**: Sometimes, a filesystem is too badly damaged to be + repairable. + If the keyspaces of several metadata indices overlap in some manner but a + coherent narrative cannot be formed from records collected, then the repair + fails. + To reduce the chance that a repair will fail with a dirty transaction and + render the filesystem unusable, the online repair functions have been + designed to stage and validate all new records before committing the new + structure. + +- **Misbehavior**: Online fsck requires many privileges -- raw IO to block + devices, opening files by handle, ignoring Unix discretionary access control, + and the ability to perform administrative changes. + Running this automatically in the background scares people, so the systemd + background service is configured to run with only the privileges required. + Obviously, this cannot address certain problems like the kernel crashing or + deadlocking, but it should be sufficient to prevent the scrub process from + escaping and reconfiguring the system. + The cron job does not have this protection. + +- **Fuzz Kiddiez**: There are many people now who seem to think that running + automated fuzz testing of ondisk artifacts to find mischevious behavior and + spraying exploit code onto the public mailing list for instant zero-day + disclosure is somehow of some social benefit. + In the view of this author, the benefit is realized only when the fuzz + operators help to **fix** the flaws, but this opinion apparently is not + widely shared among security "researchers". + The XFS maintainers' continuing ability to manage these events presents an + ongoing risk to the stability of the development process. + Automated testing should front-load some of the risk while the feature is + considered EXPERIMENTAL. + +Many of these risks are inherent to software programming. +Despite this, it is hoped that this new functionality will prove useful in +reducing unexpected downtime. + +3. Testing Plan +=============== + +As stated before, fsck tools have three main goals: + +1. Detect inconsistencies in the metadata; + +2. Eliminate those inconsistencies; and + +3. Minimize further loss of data. + +Demonstrations of correct operation are necessary to build users' confidence +that the software behaves within expectations. +Unfortunately, it was not really feasible to perform regular exhaustive testing +of every aspect of a fsck tool until the introduction of low-cost virtual +machines with high-IOPS storage. +With ample hardware availability in mind, the testing strategy for the online +fsck project involves differential analysis against the existing fsck tools and +systematic testing of every attribute of every type of metadata object. +Testing can be split into four major categories, as discussed below. + +Integrated Testing with fstests +------------------------------- + +The primary goal of any free software QA effort is to make testing as +inexpensive and widespread as possible to maximize the scaling advantages of +community. +In other words, testing should maximize the breadth of filesystem configuration +scenarios and hardware setups. +This improves code quality by enabling the authors of online fsck to find and +fix bugs early, and helps developers of new features to find integration +issues earlier in their development effort. + +The Linux filesystem community shares a common QA testing suite, +`fstests <https://git.kernel.org/pub/scm/fs/xfs/xfstests-dev.git/>`_, for +functional and regression testing. +Even before development work began on online fsck, fstests (when run on XFS) +would run both the ``xfs_check`` and ``xfs_repair -n`` commands on the test and +scratch filesystems between each test. +This provides a level of assurance that the kernel and the fsck tools stay in +alignment about what constitutes consistent metadata. +During development of the online checking code, fstests was modified to run +``xfs_scrub -n`` between each test to ensure that the new checking code +produces the same results as the two existing fsck tools. + +To start development of online repair, fstests was modified to run +``xfs_repair`` to rebuild the filesystem's metadata indices between tests. +This ensures that offline repair does not crash, leave a corrupt filesystem +after it exists, or trigger complaints from the online check. +This also established a baseline for what can and cannot be repaired offline. +To complete the first phase of development of online repair, fstests was +modified to be able to run ``xfs_scrub`` in a "force rebuild" mode. +This enables a comparison of the effectiveness of online repair as compared to +the existing offline repair tools. + +General Fuzz Testing of Metadata Blocks +--------------------------------------- + +XFS benefits greatly from having a very robust debugging tool, ``xfs_db``. + +Before development of online fsck even began, a set of fstests were created +to test the rather common fault that entire metadata blocks get corrupted. +This required the creation of fstests library code that can create a filesystem +containing every possible type of metadata object. +Next, individual test cases were created to create a test filesystem, identify +a single block of a specific type of metadata object, trash it with the +existing ``blocktrash`` command in ``xfs_db``, and test the reaction of a +particular metadata validation strategy. + +This earlier test suite enabled XFS developers to test the ability of the +in-kernel validation functions and the ability of the offline fsck tool to +detect and eliminate the inconsistent metadata. +This part of the test suite was extended to cover online fsck in exactly the +same manner. + +In other words, for a given fstests filesystem configuration: + +* For each metadata object existing on the filesystem: + + * Write garbage to it + + * Test the reactions of: + + 1. The kernel verifiers to stop obviously bad metadata + 2. Offline repair (``xfs_repair``) to detect and fix + 3. Online repair (``xfs_scrub``) to detect and fix + +Targeted Fuzz Testing of Metadata Records +----------------------------------------- + +The testing plan for online fsck includes extending the existing fs testing +infrastructure to provide a much more powerful facility: targeted fuzz testing +of every metadata field of every metadata object in the filesystem. +``xfs_db`` can modify every field of every metadata structure in every +block in the filesystem to simulate the effects of memory corruption and +software bugs. +Given that fstests already contains the ability to create a filesystem +containing every metadata format known to the filesystem, ``xfs_db`` can be +used to perform exhaustive fuzz testing! + +For a given fstests filesystem configuration: + +* For each metadata object existing on the filesystem... + + * For each record inside that metadata object... + + * For each field inside that record... + + * For each conceivable type of transformation that can be applied to a bit field... + + 1. Clear all bits + 2. Set all bits + 3. Toggle the most significant bit + 4. Toggle the middle bit + 5. Toggle the least significant bit + 6. Add a small quantity + 7. Subtract a small quantity + 8. Randomize the contents + + * ...test the reactions of: + + 1. The kernel verifiers to stop obviously bad metadata + 2. Offline checking (``xfs_repair -n``) + 3. Offline repair (``xfs_repair``) + 4. Online checking (``xfs_scrub -n``) + 5. Online repair (``xfs_scrub``) + 6. Both repair tools (``xfs_scrub`` and then ``xfs_repair`` if online repair doesn't succeed) + +This is quite the combinatoric explosion! + +Fortunately, having this much test coverage makes it easy for XFS developers to +check the responses of XFS' fsck tools. +Since the introduction of the fuzz testing framework, these tests have been +used to discover incorrect repair code and missing functionality for entire +classes of metadata objects in ``xfs_repair``. +The enhanced testing was used to finalize the deprecation of ``xfs_check`` by +confirming that ``xfs_repair`` could detect at least as many corruptions as +the older tool. + +These tests have been very valuable for ``xfs_scrub`` in the same ways -- they +allow the online fsck developers to compare online fsck against offline fsck, +and they enable XFS developers to find deficiencies in the code base. + +Proposed patchsets include +`general fuzzer improvements +<https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfstests-dev.git/log/?h=fuzzer-improvements>`_, +`fuzzing baselines +<https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfstests-dev.git/log/?h=fuzz-baseline>`_, +and `improvements in fuzz testing comprehensiveness +<https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfstests-dev.git/log/?h=more-fuzz-testing>`_. + +Stress Testing +-------------- + +A unique requirement to online fsck is the ability to operate on a filesystem +concurrently with regular workloads. +Although it is of course impossible to run ``xfs_scrub`` with *zero* observable +impact on the running system, the online repair code should never introduce +inconsistencies into the filesystem metadata, and regular workloads should +never notice resource starvation. +To verify that these conditions are being met, fstests has been enhanced in +the following ways: + +* For each scrub item type, create a test to exercise checking that item type + while running ``fsstress``. +* For each scrub item type, create a test to exercise repairing that item type + while running ``fsstress``. +* Race ``fsstress`` and ``xfs_scrub -n`` to ensure that checking the whole + filesystem doesn't cause problems. +* Race ``fsstress`` and ``xfs_scrub`` in force-rebuild mode to ensure that + force-repairing the whole filesystem doesn't cause problems. +* Race ``xfs_scrub`` in check and force-repair mode against ``fsstress`` while + freezing and thawing the filesystem. +* Race ``xfs_scrub`` in check and force-repair mode against ``fsstress`` while + remounting the filesystem read-only and read-write. +* The same, but running ``fsx`` instead of ``fsstress``. (Not done yet?) + +Success is defined by the ability to run all of these tests without observing +any unexpected filesystem shutdowns due to corrupted metadata, kernel hang +check warnings, or any other sort of mischief. + +Proposed patchsets include `general stress testing +<https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfstests-dev.git/log/?h=race-scrub-and-mount-state-changes>`_ +and the `evolution of existing per-function stress testing +<https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfstests-dev.git/log/?h=refactor-scrub-stress>`_. + +4. User Interface +================= + +The primary user of online fsck is the system administrator, just like offline +repair. +Online fsck presents two modes of operation to administrators: +A foreground CLI process for online fsck on demand, and a background service +that performs autonomous checking and repair. + +Checking on Demand +------------------ + +For administrators who want the absolute freshest information about the +metadata in a filesystem, ``xfs_scrub`` can be run as a foreground process on +a command line. +The program checks every piece of metadata in the filesystem while the +administrator waits for the results to be reported, just like the existing +``xfs_repair`` tool. +Both tools share a ``-n`` option to perform a read-only scan, and a ``-v`` +option to increase the verbosity of the information reported. + +A new feature of ``xfs_scrub`` is the ``-x`` option, which employs the error +correction capabilities of the hardware to check data file contents. +The media scan is not enabled by default because it may dramatically increase +program runtime and consume a lot of bandwidth on older storage hardware. + +The output of a foreground invocation is captured in the system log. + +The ``xfs_scrub_all`` program walks the list of mounted filesystems and +initiates ``xfs_scrub`` for each of them in parallel. +It serializes scans for any filesystems that resolve to the same top level +kernel block device to prevent resource overconsumption. + +Background Service +------------------ + +To reduce the workload of system administrators, the ``xfs_scrub`` package +provides a suite of `systemd <https://systemd.io/>`_ timers and services that +run online fsck automatically on weekends by default. +The background service configures scrub to run with as little privilege as +possible, the lowest CPU and IO priority, and in a CPU-constrained single +threaded mode. +This can be tuned by the systemd administrator at any time to suit the latency +and throughput requirements of customer workloads. + +The output of the background service is also captured in the system log. +If desired, reports of failures (either due to inconsistencies or mere runtime +errors) can be emailed automatically by setting the ``EMAIL_ADDR`` environment +variable in the following service files: + +* ``xfs_scrub_fail@.service`` +* ``xfs_scrub_media_fail@.service`` +* ``xfs_scrub_all_fail.service`` + +The decision to enable the background scan is left to the system administrator. +This can be done by enabling either of the following services: + +* ``xfs_scrub_all.timer`` on systemd systems +* ``xfs_scrub_all.cron`` on non-systemd systems + +This automatic weekly scan is configured out of the box to perform an +additional media scan of all file data once per month. +This is less foolproof than, say, storing file data block checksums, but much +more performant if application software provides its own integrity checking, +redundancy can be provided elsewhere above the filesystem, or the storage +device's integrity guarantees are deemed sufficient. + +The systemd unit file definitions have been subjected to a security audit +(as of systemd 249) to ensure that the xfs_scrub processes have as little +access to the rest of the system as possible. +This was performed via ``systemd-analyze security``, after which privileges +were restricted to the minimum required, sandboxing was set up to the maximal +extent possible with sandboxing and system call filtering; and access to the +filesystem tree was restricted to the minimum needed to start the program and +access the filesystem being scanned. +The service definition files restrict CPU usage to 80% of one CPU core, and +apply as nice of a priority to IO and CPU scheduling as possible. +This measure was taken to minimize delays in the rest of the filesystem. +No such hardening has been performed for the cron job. + +Proposed patchset: +`Enabling the xfs_scrub background service +<https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfsprogs-dev.git/log/?h=scrub-media-scan-service>`_. + +Health Reporting +---------------- + +XFS caches a summary of each filesystem's health status in memory. +The information is updated whenever ``xfs_scrub`` is run, or whenever +inconsistencies are detected in the filesystem metadata during regular +operations. +System administrators should use the ``health`` command of ``xfs_spaceman`` to +download this information into a human-readable format. +If problems have been observed, the administrator can schedule a reduced +service window to run the online repair tool to correct the problem. +Failing that, the administrator can decide to schedule a maintenance window to +run the traditional offline repair tool to correct the problem. + +**Future Work Question**: Should the health reporting integrate with the new +inotify fs error notification system? +Would it be helpful for sysadmins to have a daemon to listen for corruption +notifications and initiate a repair? + +*Answer*: These questions remain unanswered, but should be a part of the +conversation with early adopters and potential downstream users of XFS. + +Proposed patchsets include +`wiring up health reports to correction returns +<https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=corruption-health-reports>`_ +and +`preservation of sickness info during memory reclaim +<https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=indirect-health-reporting>`_. + +5. Kernel Algorithms and Data Structures +======================================== + +This section discusses the key algorithms and data structures of the kernel +code that provide the ability to check and repair metadata while the system +is running. +The first chapters in this section reveal the pieces that provide the +foundation for checking metadata. +The remainder of this section presents the mechanisms through which XFS +regenerates itself. + +Self Describing Metadata +------------------------ + +Starting with XFS version 5 in 2012, XFS updated the format of nearly every +ondisk block header to record a magic number, a checksum, a universally +"unique" identifier (UUID), an owner code, the ondisk address of the block, +and a log sequence number. +When loading a block buffer from disk, the magic number, UUID, owner, and +ondisk address confirm that the retrieved block matches the specific owner of +the current filesystem, and that the information contained in the block is +supposed to be found at the ondisk address. +The first three components enable checking tools to disregard alleged metadata +that doesn't belong to the filesystem, and the fourth component enables the +filesystem to detect lost writes. + +Whenever a file system operation modifies a block, the change is submitted +to the log as part of a transaction. +The log then processes these transactions marking them done once they are +safely persisted to storage. +The logging code maintains the checksum and the log sequence number of the last +transactional update. +Checksums are useful for detecting torn writes and other discrepancies that can +be introduced between the computer and its storage devices. +Sequence number tracking enables log recovery to avoid applying out of date +log updates to the filesystem. + +These two features improve overall runtime resiliency by providing a means for +the filesystem to detect obvious corruption when reading metadata blocks from +disk, but these buffer verifiers cannot provide any consistency checking +between metadata structures. + +For more information, please see the documentation for +Documentation/filesystems/xfs-self-describing-metadata.rst + +Reverse Mapping +--------------- + +The original design of XFS (circa 1993) is an improvement upon 1980s Unix +filesystem design. +In those days, storage density was expensive, CPU time was scarce, and +excessive seek time could kill performance. +For performance reasons, filesystem authors were reluctant to add redundancy to +the filesystem, even at the cost of data integrity. +Filesystems designers in the early 21st century choose different strategies to +increase internal redundancy -- either storing nearly identical copies of +metadata, or more space-efficient encoding techniques. + +For XFS, a different redundancy strategy was chosen to modernize the design: +a secondary space usage index that maps allocated disk extents back to their +owners. +By adding a new index, the filesystem retains most of its ability to scale +well to heavily threaded workloads involving large datasets, since the primary +file metadata (the directory tree, the file block map, and the allocation +groups) remain unchanged. +Like any system that improves redundancy, the reverse-mapping feature increases +overhead costs for space mapping activities. +However, it has two critical advantages: first, the reverse index is key to +enabling online fsck and other requested functionality such as free space +defragmentation, better media failure reporting, and filesystem shrinking. +Second, the different ondisk storage format of the reverse mapping btree +defeats device-level deduplication because the filesystem requires real +redundancy. + ++--------------------------------------------------------------------------+ +| **Sidebar**: | ++--------------------------------------------------------------------------+ +| A criticism of adding the secondary index is that it does nothing to | +| improve the robustness of user data storage itself. | +| This is a valid point, but adding a new index for file data block | +| checksums increases write amplification by turning data overwrites into | +| copy-writes, which age the filesystem prematurely. | +| In keeping with thirty years of precedent, users who want file data | +| integrity can supply as powerful a solution as they require. | +| As for metadata, the complexity of adding a new secondary index of space | +| usage is much less than adding volume management and storage device | +| mirroring to XFS itself. | +| Perfection of RAID and volume management are best left to existing | +| layers in the kernel. | ++--------------------------------------------------------------------------+ + +The information captured in a reverse space mapping record is as follows: + +.. code-block:: c + + struct xfs_rmap_irec { + xfs_agblock_t rm_startblock; /* extent start block */ + xfs_extlen_t rm_blockcount; /* extent length */ + uint64_t rm_owner; /* extent owner */ + uint64_t rm_offset; /* offset within the owner */ + unsigned int rm_flags; /* state flags */ + }; + +The first two fields capture the location and size of the physical space, +in units of filesystem blocks. +The owner field tells scrub which metadata structure or file inode have been +assigned this space. +For space allocated to files, the offset field tells scrub where the space was +mapped within the file fork. +Finally, the flags field provides extra information about the space usage -- +is this an attribute fork extent? A file mapping btree extent? Or an +unwritten data extent? + +Online filesystem checking judges the consistency of each primary metadata +record by comparing its information against all other space indices. +The reverse mapping index plays a key role in the consistency checking process +because it contains a centralized alternate copy of all space allocation +information. +Program runtime and ease of resource acquisition are the only real limits to +what online checking can consult. +For example, a file data extent mapping can be checked against: + +* The absence of an entry in the free space information. +* The absence of an entry in the inode index. +* The absence of an entry in the reference count data if the file is not + marked as having shared extents. +* The correspondence of an entry in the reverse mapping information. + +There are several observations to make about reverse mapping indices: + +1. Reverse mappings can provide a positive affirmation of correctness if any of + the above primary metadata are in doubt. + The checking code for most primary metadata follows a path similar to the + one outlined above. + +2. Proving the consistency of secondary metadata with the primary metadata is + difficult because that requires a full scan of all primary space metadata, + which is very time intensive. + For example, checking a reverse mapping record for a file extent mapping + btree block requires locking the file and searching the entire btree to + confirm the block. + Instead, scrub relies on rigorous cross-referencing during the primary space + mapping structure checks. + +3. Consistency scans must use non-blocking lock acquisition primitives if the + required locking order is not the same order used by regular filesystem + operations. + For example, if the filesystem normally takes a file ILOCK before taking + the AGF buffer lock but scrub wants to take a file ILOCK while holding + an AGF buffer lock, scrub cannot block on that second acquisition. + This means that forward progress during this part of a scan of the reverse + mapping data cannot be guaranteed if system load is heavy. + +In summary, reverse mappings play a key role in reconstruction of primary +metadata. +The details of how these records are staged, written to disk, and committed +into the filesystem are covered in subsequent sections. + +Checking and Cross-Referencing +------------------------------ + +The first step of checking a metadata structure is to examine every record +contained within the structure and its relationship with the rest of the +system. +XFS contains multiple layers of checking to try to prevent inconsistent +metadata from wreaking havoc on the system. +Each of these layers contributes information that helps the kernel to make +three decisions about the health of a metadata structure: + +- Is a part of this structure obviously corrupt (``XFS_SCRUB_OFLAG_CORRUPT``) ? +- Is this structure inconsistent with the rest of the system + (``XFS_SCRUB_OFLAG_XCORRUPT``) ? +- Is there so much damage around the filesystem that cross-referencing is not + possible (``XFS_SCRUB_OFLAG_XFAIL``) ? +- Can the structure be optimized to improve performance or reduce the size of + metadata (``XFS_SCRUB_OFLAG_PREEN``) ? +- Does the structure contain data that is not inconsistent but deserves review + by the system administrator (``XFS_SCRUB_OFLAG_WARNING``) ? + +The following sections describe how the metadata scrubbing process works. + +Metadata Buffer Verification +```````````````````````````` + +The lowest layer of metadata protection in XFS are the metadata verifiers built +into the buffer cache. +These functions perform inexpensive internal consistency checking of the block +itself, and answer these questions: + +- Does the block belong to this filesystem? + +- Does the block belong to the structure that asked for the read? + This assumes that metadata blocks only have one owner, which is always true + in XFS. + +- Is the type of data stored in the block within a reasonable range of what + scrub is expecting? + +- Does the physical location of the block match the location it was read from? + +- Does the block checksum match the data? + +The scope of the protections here are very limited -- verifiers can only +establish that the filesystem code is reasonably free of gross corruption bugs +and that the storage system is reasonably competent at retrieval. +Corruption problems observed at runtime cause the generation of health reports, +failed system calls, and in the extreme case, filesystem shutdowns if the +corrupt metadata force the cancellation of a dirty transaction. + +Every online fsck scrubbing function is expected to read every ondisk metadata +block of a structure in the course of checking the structure. +Corruption problems observed during a check are immediately reported to +userspace as corruption; during a cross-reference, they are reported as a +failure to cross-reference once the full examination is complete. +Reads satisfied by a buffer already in cache (and hence already verified) +bypass these checks. + +Internal Consistency Checks +``````````````````````````` + +After the buffer cache, the next level of metadata protection is the internal +record verification code built into the filesystem. +These checks are split between the buffer verifiers, the in-filesystem users of +the buffer cache, and the scrub code itself, depending on the amount of higher +level context required. +The scope of checking is still internal to the block. +These higher level checking functions answer these questions: + +- Does the type of data stored in the block match what scrub is expecting? + +- Does the block belong to the owning structure that asked for the read? + +- If the block contains records, do the records fit within the block? + +- If the block tracks internal free space information, is it consistent with + the record areas? + +- Are the records contained inside the block free of obvious corruptions? + +Record checks in this category are more rigorous and more time-intensive. +For example, block pointers and inumbers are checked to ensure that they point +within the dynamically allocated parts of an allocation group and within +the filesystem. +Names are checked for invalid characters, and flags are checked for invalid +combinations. +Other record attributes are checked for sensible values. +Btree records spanning an interval of the btree keyspace are checked for +correct order and lack of mergeability (except for file fork mappings). +For performance reasons, regular code may skip some of these checks unless +debugging is enabled or a write is about to occur. +Scrub functions, of course, must check all possible problems. + +Validation of Userspace-Controlled Record Attributes +```````````````````````````````````````````````````` + +Various pieces of filesystem metadata are directly controlled by userspace. +Because of this nature, validation work cannot be more precise than checking +that a value is within the possible range. +These fields include: + +- Superblock fields controlled by mount options +- Filesystem labels +- File timestamps +- File permissions +- File size +- File flags +- Names present in directory entries, extended attribute keys, and filesystem + labels +- Extended attribute key namespaces +- Extended attribute values +- File data block contents +- Quota limits +- Quota timer expiration (if resource usage exceeds the soft limit) + +Cross-Referencing Space Metadata +```````````````````````````````` + +After internal block checks, the next higher level of checking is +cross-referencing records between metadata structures. +For regular runtime code, the cost of these checks is considered to be +prohibitively expensive, but as scrub is dedicated to rooting out +inconsistencies, it must pursue all avenues of inquiry. +The exact set of cross-referencing is highly dependent on the context of the +data structure being checked. + +The XFS btree code has keyspace scanning functions that online fsck uses to +cross reference one structure with another. +Specifically, scrub can scan the key space of an index to determine if that +keyspace is fully, sparsely, or not at all mapped to records. +For the reverse mapping btree, it is possible to mask parts of the key for the +purposes of performing a keyspace scan so that scrub can decide if the rmap +btree contains records mapping a certain extent of physical space without the +sparsenses of the rest of the rmap keyspace getting in the way. + +Btree blocks undergo the following checks before cross-referencing: + +- Does the type of data stored in the block match what scrub is expecting? + +- Does the block belong to the owning structure that asked for the read? + +- Do the records fit within the block? + +- Are the records contained inside the block free of obvious corruptions? + +- Are the name hashes in the correct order? + +- Do node pointers within the btree point to valid block addresses for the type + of btree? + +- Do child pointers point towards the leaves? + +- Do sibling pointers point across the same level? + +- For each node block record, does the record key accurate reflect the contents + of the child block? + +Space allocation records are cross-referenced as follows: + +1. Any space mentioned by any metadata structure are cross-referenced as + follows: + + - Does the reverse mapping index list only the appropriate owner as the + owner of each block? + + - Are none of the blocks claimed as free space? + + - If these aren't file data blocks, are none of the blocks claimed as space + shared by different owners? + +2. Btree blocks are cross-referenced as follows: + + - Everything in class 1 above. + + - If there's a parent node block, do the keys listed for this block match the + keyspace of this block? + + - Do the sibling pointers point to valid blocks? Of the same level? + + - Do the child pointers point to valid blocks? Of the next level down? + +3. Free space btree records are cross-referenced as follows: + + - Everything in class 1 and 2 above. + + - Does the reverse mapping index list no owners of this space? + + - Is this space not claimed by the inode index for inodes? + + - Is it not mentioned by the reference count index? + + - Is there a matching record in the other free space btree? + +4. Inode btree records are cross-referenced as follows: + + - Everything in class 1 and 2 above. + + - Is there a matching record in free inode btree? + + - Do cleared bits in the holemask correspond with inode clusters? + + - Do set bits in the freemask correspond with inode records with zero link + count? + +5. Inode records are cross-referenced as follows: + + - Everything in class 1. + + - Do all the fields that summarize information about the file forks actually + match those forks? + + - Does each inode with zero link count correspond to a record in the free + inode btree? + +6. File fork space mapping records are cross-referenced as follows: + + - Everything in class 1 and 2 above. + + - Is this space not mentioned by the inode btrees? + + - If this is a CoW fork mapping, does it correspond to a CoW entry in the + reference count btree? + +7. Reference count records are cross-referenced as follows: + + - Everything in class 1 and 2 above. + + - Within the space subkeyspace of the rmap btree (that is to say, all + records mapped to a particular space extent and ignoring the owner info), + are there the same number of reverse mapping records for each block as the + reference count record claims? + +Proposed patchsets are the series to find gaps in +`refcount btree +<https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=scrub-detect-refcount-gaps>`_, +`inode btree +<https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=scrub-detect-inobt-gaps>`_, and +`rmap btree +<https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=scrub-detect-rmapbt-gaps>`_ records; +to find +`mergeable records +<https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=scrub-detect-mergeable-records>`_; +and to +`improve cross referencing with rmap +<https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=scrub-strengthen-rmap-checking>`_ +before starting a repair. + +Checking Extended Attributes +```````````````````````````` + +Extended attributes implement a key-value store that enable fragments of data +to be attached to any file. +Both the kernel and userspace can access the keys and values, subject to +namespace and privilege restrictions. +Most typically these fragments are metadata about the file -- origins, security +contexts, user-supplied labels, indexing information, etc. + +Names can be as long as 255 bytes and can exist in several different +namespaces. +Values can be as large as 64KB. +A file's extended attributes are stored in blocks mapped by the attr fork. +The mappings point to leaf blocks, remote value blocks, or dabtree blocks. +Block 0 in the attribute fork is always the top of the structure, but otherwise +each of the three types of blocks can be found at any offset in the attr fork. +Leaf blocks contain attribute key records that point to the name and the value. +Names are always stored elsewhere in the same leaf block. +Values that are less than 3/4 the size of a filesystem block are also stored +elsewhere in the same leaf block. +Remote value blocks contain values that are too large to fit inside a leaf. +If the leaf information exceeds a single filesystem block, a dabtree (also +rooted at block 0) is created to map hashes of the attribute names to leaf +blocks in the attr fork. + +Checking an extended attribute structure is not so straightfoward due to the +lack of separation between attr blocks and index blocks. +Scrub must read each block mapped by the attr fork and ignore the non-leaf +blocks: + +1. Walk the dabtree in the attr fork (if present) to ensure that there are no + irregularities in the blocks or dabtree mappings that do not point to + attr leaf blocks. + +2. Walk the blocks of the attr fork looking for leaf blocks. + For each entry inside a leaf: + + a. Validate that the name does not contain invalid characters. + + b. Read the attr value. + This performs a named lookup of the attr name to ensure the correctness + of the dabtree. + If the value is stored in a remote block, this also validates the + integrity of the remote value block. + +Checking and Cross-Referencing Directories +`````````````````````````````````````````` + +The filesystem directory tree is a directed acylic graph structure, with files +constituting the nodes, and directory entries (dirents) constituting the edges. +Directories are a special type of file containing a set of mappings from a +255-byte sequence (name) to an inumber. +These are called directory entries, or dirents for short. +Each directory file must have exactly one directory pointing to the file. +A root directory points to itself. +Directory entries point to files of any type. +Each non-directory file may have multiple directories point to it. + +In XFS, directories are implemented as a file containing up to three 32GB +partitions. +The first partition contains directory entry data blocks. +Each data block contains variable-sized records associating a user-provided +name with an inumber and, optionally, a file type. +If the directory entry data grows beyond one block, the second partition (which +exists as post-EOF extents) is populated with a block containing free space +information and an index that maps hashes of the dirent names to directory data +blocks in the first partition. +This makes directory name lookups very fast. +If this second partition grows beyond one block, the third partition is +populated with a linear array of free space information for faster +expansions. +If the free space has been separated and the second partition grows again +beyond one block, then a dabtree is used to map hashes of dirent names to +directory data blocks. + +Checking a directory is pretty straightfoward: + +1. Walk the dabtree in the second partition (if present) to ensure that there + are no irregularities in the blocks or dabtree mappings that do not point to + dirent blocks. + +2. Walk the blocks of the first partition looking for directory entries. + Each dirent is checked as follows: + + a. Does the name contain no invalid characters? + + b. Does the inumber correspond to an actual, allocated inode? + + c. Does the child inode have a nonzero link count? + + d. If a file type is included in the dirent, does it match the type of the + inode? + + e. If the child is a subdirectory, does the child's dotdot pointer point + back to the parent? + + f. If the directory has a second partition, perform a named lookup of the + dirent name to ensure the correctness of the dabtree. + +3. Walk the free space list in the third partition (if present) to ensure that + the free spaces it describes are really unused. + +Checking operations involving :ref:`parents <dirparent>` and +:ref:`file link counts <nlinks>` are discussed in more detail in later +sections. + +Checking Directory/Attribute Btrees +^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ + +As stated in previous sections, the directory/attribute btree (dabtree) index +maps user-provided names to improve lookup times by avoiding linear scans. +Internally, it maps a 32-bit hash of the name to a block offset within the +appropriate file fork. + +The internal structure of a dabtree closely resembles the btrees that record +fixed-size metadata records -- each dabtree block contains a magic number, a +checksum, sibling pointers, a UUID, a tree level, and a log sequence number. +The format of leaf and node records are the same -- each entry points to the +next level down in the hierarchy, with dabtree node records pointing to dabtree +leaf blocks, and dabtree leaf records pointing to non-dabtree blocks elsewhere +in the fork. + +Checking and cross-referencing the dabtree is very similar to what is done for +space btrees: + +- Does the type of data stored in the block match what scrub is expecting? + +- Does the block belong to the owning structure that asked for the read? + +- Do the records fit within the block? + +- Are the records contained inside the block free of obvious corruptions? + +- Are the name hashes in the correct order? + +- Do node pointers within the dabtree point to valid fork offsets for dabtree + blocks? + +- Do leaf pointers within the dabtree point to valid fork offsets for directory + or attr leaf blocks? + +- Do child pointers point towards the leaves? + +- Do sibling pointers point across the same level? + +- For each dabtree node record, does the record key accurate reflect the + contents of the child dabtree block? + +- For each dabtree leaf record, does the record key accurate reflect the + contents of the directory or attr block? + +Cross-Referencing Summary Counters +`````````````````````````````````` + +XFS maintains three classes of summary counters: available resources, quota +resource usage, and file link counts. + +In theory, the amount of available resources (data blocks, inodes, realtime +extents) can be found by walking the entire filesystem. +This would make for very slow reporting, so a transactional filesystem can +maintain summaries of this information in the superblock. +Cross-referencing these values against the filesystem metadata should be a +simple matter of walking the free space and inode metadata in each AG and the +realtime bitmap, but there are complications that will be discussed in +:ref:`more detail <fscounters>` later. + +:ref:`Quota usage <quotacheck>` and :ref:`file link count <nlinks>` +checking are sufficiently complicated to warrant separate sections. + +Post-Repair Reverification +`````````````````````````` + +After performing a repair, the checking code is run a second time to validate +the new structure, and the results of the health assessment are recorded +internally and returned to the calling process. +This step is critical for enabling system administrator to monitor the status +of the filesystem and the progress of any repairs. +For developers, it is a useful means to judge the efficacy of error detection +and correction in the online and offline checking tools. + +Eventual Consistency vs. Online Fsck +------------------------------------ + +Complex operations can make modifications to multiple per-AG data structures +with a chain of transactions. +These chains, once committed to the log, are restarted during log recovery if +the system crashes while processing the chain. +Because the AG header buffers are unlocked between transactions within a chain, +online checking must coordinate with chained operations that are in progress to +avoid incorrectly detecting inconsistencies due to pending chains. +Furthermore, online repair must not run when operations are pending because +the metadata are temporarily inconsistent with each other, and rebuilding is +not possible. + +Only online fsck has this requirement of total consistency of AG metadata, and +should be relatively rare as compared to filesystem change operations. +Online fsck coordinates with transaction chains as follows: + +* For each AG, maintain a count of intent items targetting that AG. + The count should be bumped whenever a new item is added to the chain. + The count should be dropped when the filesystem has locked the AG header + buffers and finished the work. + +* When online fsck wants to examine an AG, it should lock the AG header + buffers to quiesce all transaction chains that want to modify that AG. + If the count is zero, proceed with the checking operation. + If it is nonzero, cycle the buffer locks to allow the chain to make forward + progress. + +This may lead to online fsck taking a long time to complete, but regular +filesystem updates take precedence over background checking activity. +Details about the discovery of this situation are presented in the +:ref:`next section <chain_coordination>`, and details about the solution +are presented :ref:`after that<intent_drains>`. + +.. _chain_coordination: + +Discovery of the Problem +```````````````````````` + +Midway through the development of online scrubbing, the fsstress tests +uncovered a misinteraction between online fsck and compound transaction chains +created by other writer threads that resulted in false reports of metadata +inconsistency. +The root cause of these reports is the eventual consistency model introduced by +the expansion of deferred work items and compound transaction chains when +reverse mapping and reflink were introduced. + +Originally, transaction chains were added to XFS to avoid deadlocks when +unmapping space from files. +Deadlock avoidance rules require that AGs only be locked in increasing order, +which makes it impossible (say) to use a single transaction to free a space +extent in AG 7 and then try to free a now superfluous block mapping btree block +in AG 3. +To avoid these kinds of deadlocks, XFS creates Extent Freeing Intent (EFI) log +items to commit to freeing some space in one transaction while deferring the +actual metadata updates to a fresh transaction. +The transaction sequence looks like this: + +1. The first transaction contains a physical update to the file's block mapping + structures to remove the mapping from the btree blocks. + It then attaches to the in-memory transaction an action item to schedule + deferred freeing of space. + Concretely, each transaction maintains a list of ``struct + xfs_defer_pending`` objects, each of which maintains a list of ``struct + xfs_extent_free_item`` objects. + Returning to the example above, the action item tracks the freeing of both + the unmapped space from AG 7 and the block mapping btree (BMBT) block from + AG 3. + Deferred frees recorded in this manner are committed in the log by creating + an EFI log item from the ``struct xfs_extent_free_item`` object and + attaching the log item to the transaction. + When the log is persisted to disk, the EFI item is written into the ondisk + transaction record. + EFIs can list up to 16 extents to free, all sorted in AG order. + +2. The second transaction contains a physical update to the free space btrees + of AG 3 to release the former BMBT block and a second physical update to the + free space btrees of AG 7 to release the unmapped file space. + Observe that the the physical updates are resequenced in the correct order + when possible. + Attached to the transaction is a an extent free done (EFD) log item. + The EFD contains a pointer to the EFI logged in transaction #1 so that log + recovery can tell if the EFI needs to be replayed. + +If the system goes down after transaction #1 is written back to the filesystem +but before #2 is committed, a scan of the filesystem metadata would show +inconsistent filesystem metadata because there would not appear to be any owner +of the unmapped space. +Happily, log recovery corrects this inconsistency for us -- when recovery finds +an intent log item but does not find a corresponding intent done item, it will +reconstruct the incore state of the intent item and finish it. +In the example above, the log must replay both frees described in the recovered +EFI to complete the recovery phase. + +There are subtleties to XFS' transaction chaining strategy to consider: + +* Log items must be added to a transaction in the correct order to prevent + conflicts with principal objects that are not held by the transaction. + In other words, all per-AG metadata updates for an unmapped block must be + completed before the last update to free the extent, and extents should not + be reallocated until that last update commits to the log. + +* AG header buffers are released between each transaction in a chain. + This means that other threads can observe an AG in an intermediate state, + but as long as the first subtlety is handled, this should not affect the + correctness of filesystem operations. + +* Unmounting the filesystem flushes all pending work to disk, which means that + offline fsck never sees the temporary inconsistencies caused by deferred + work item processing. + +In this manner, XFS employs a form of eventual consistency to avoid deadlocks +and increase parallelism. + +During the design phase of the reverse mapping and reflink features, it was +decided that it was impractical to cram all the reverse mapping updates for a +single filesystem change into a single transaction because a single file +mapping operation can explode into many small updates: + +* The block mapping update itself +* A reverse mapping update for the block mapping update +* Fixing the freelist +* A reverse mapping update for the freelist fix + +* A shape change to the block mapping btree +* A reverse mapping update for the btree update +* Fixing the freelist (again) +* A reverse mapping update for the freelist fix + +* An update to the reference counting information +* A reverse mapping update for the refcount update +* Fixing the freelist (a third time) +* A reverse mapping update for the freelist fix + +* Freeing any space that was unmapped and not owned by any other file +* Fixing the freelist (a fourth time) +* A reverse mapping update for the freelist fix + +* Freeing the space used by the block mapping btree +* Fixing the freelist (a fifth time) +* A reverse mapping update for the freelist fix + +Free list fixups are not usually needed more than once per AG per transaction +chain, but it is theoretically possible if space is very tight. +For copy-on-write updates this is even worse, because this must be done once to +remove the space from a staging area and again to map it into the file! + +To deal with this explosion in a calm manner, XFS expands its use of deferred +work items to cover most reverse mapping updates and all refcount updates. +This reduces the worst case size of transaction reservations by breaking the +work into a long chain of small updates, which increases the degree of eventual +consistency in the system. +Again, this generally isn't a problem because XFS orders its deferred work +items carefully to avoid resource reuse conflicts between unsuspecting threads. + +However, online fsck changes the rules -- remember that although physical +updates to per-AG structures are coordinated by locking the buffers for AG +headers, buffer locks are dropped between transactions. +Once scrub acquires resources and takes locks for a data structure, it must do +all the validation work without releasing the lock. +If the main lock for a space btree is an AG header buffer lock, scrub may have +interrupted another thread that is midway through finishing a chain. +For example, if a thread performing a copy-on-write has completed a reverse +mapping update but not the corresponding refcount update, the two AG btrees +will appear inconsistent to scrub and an observation of corruption will be +recorded. This observation will not be correct. +If a repair is attempted in this state, the results will be catastrophic! + +Several other solutions to this problem were evaluated upon discovery of this +flaw and rejected: + +1. Add a higher level lock to allocation groups and require writer threads to + acquire the higher level lock in AG order before making any changes. + This would be very difficult to implement in practice because it is + difficult to determine which locks need to be obtained, and in what order, + without simulating the entire operation. + Performing a dry run of a file operation to discover necessary locks would + make the filesystem very slow. + +2. Make the deferred work coordinator code aware of consecutive intent items + targeting the same AG and have it hold the AG header buffers locked across + the transaction roll between updates. + This would introduce a lot of complexity into the coordinator since it is + only loosely coupled with the actual deferred work items. + It would also fail to solve the problem because deferred work items can + generate new deferred subtasks, but all subtasks must be complete before + work can start on a new sibling task. + +3. Teach online fsck to walk all transactions waiting for whichever lock(s) + protect the data structure being scrubbed to look for pending operations. + The checking and repair operations must factor these pending operations into + the evaluations being performed. + This solution is a nonstarter because it is *extremely* invasive to the main + filesystem. + +.. _intent_drains: + +Intent Drains +````````````` + +Online fsck uses an atomic intent item counter and lock cycling to coordinate +with transaction chains. +There are two key properties to the drain mechanism. +First, the counter is incremented when a deferred work item is *queued* to a +transaction, and it is decremented after the associated intent done log item is +*committed* to another transaction. +The second property is that deferred work can be added to a transaction without +holding an AG header lock, but per-AG work items cannot be marked done without +locking that AG header buffer to log the physical updates and the intent done +log item. +The first property enables scrub to yield to running transaction chains, which +is an explicit deprioritization of online fsck to benefit file operations. +The second property of the drain is key to the correct coordination of scrub, +since scrub will always be able to decide if a conflict is possible. + +For regular filesystem code, the drain works as follows: + +1. Call the appropriate subsystem function to add a deferred work item to a + transaction. + +2. The function calls ``xfs_defer_drain_bump`` to increase the counter. + +3. When the deferred item manager wants to finish the deferred work item, it + calls ``->finish_item`` to complete it. + +4. The ``->finish_item`` implementation logs some changes and calls + ``xfs_defer_drain_drop`` to decrease the sloppy counter and wake up any threads + waiting on the drain. + +5. The subtransaction commits, which unlocks the resource associated with the + intent item. + +For scrub, the drain works as follows: + +1. Lock the resource(s) associated with the metadata being scrubbed. + For example, a scan of the refcount btree would lock the AGI and AGF header + buffers. + +2. If the counter is zero (``xfs_defer_drain_busy`` returns false), there are no + chains in progress and the operation may proceed. + +3. Otherwise, release the resources grabbed in step 1. + +4. Wait for the intent counter to reach zero (``xfs_defer_drain_intents``), then go + back to step 1 unless a signal has been caught. + +To avoid polling in step 4, the drain provides a waitqueue for scrub threads to +be woken up whenever the intent count drops to zero. + +The proposed patchset is the +`scrub intent drain series +<https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=scrub-drain-intents>`_. + +.. _jump_labels: + +Static Keys (aka Jump Label Patching) +````````````````````````````````````` + +Online fsck for XFS separates the regular filesystem from the checking and +repair code as much as possible. +However, there are a few parts of online fsck (such as the intent drains, and +later, live update hooks) where it is useful for the online fsck code to know +what's going on in the rest of the filesystem. +Since it is not expected that online fsck will be constantly running in the +background, it is very important to minimize the runtime overhead imposed by +these hooks when online fsck is compiled into the kernel but not actively +running on behalf of userspace. +Taking locks in the hot path of a writer thread to access a data structure only +to find that no further action is necessary is expensive -- on the author's +computer, this have an overhead of 40-50ns per access. +Fortunately, the kernel supports dynamic code patching, which enables XFS to +replace a static branch to hook code with ``nop`` sleds when online fsck isn't +running. +This sled has an overhead of however long it takes the instruction decoder to +skip past the sled, which seems to be on the order of less than 1ns and +does not access memory outside of instruction fetching. + +When online fsck enables the static key, the sled is replaced with an +unconditional branch to call the hook code. +The switchover is quite expensive (~22000ns) but is paid entirely by the +program that invoked online fsck, and can be amortized if multiple threads +enter online fsck at the same time, or if multiple filesystems are being +checked at the same time. +Changing the branch direction requires taking the CPU hotplug lock, and since +CPU initialization requires memory allocation, online fsck must be careful not +to change a static key while holding any locks or resources that could be +accessed in the memory reclaim paths. +To minimize contention on the CPU hotplug lock, care should be taken not to +enable or disable static keys unnecessarily. + +Because static keys are intended to minimize hook overhead for regular +filesystem operations when xfs_scrub is not running, the intended usage +patterns are as follows: + +- The hooked part of XFS should declare a static-scoped static key that + defaults to false. + The ``DEFINE_STATIC_KEY_FALSE`` macro takes care of this. + The static key itself should be declared as a ``static`` variable. + +- When deciding to invoke code that's only used by scrub, the regular + filesystem should call the ``static_branch_unlikely`` predicate to avoid the + scrub-only hook code if the static key is not enabled. + +- The regular filesystem should export helper functions that call + ``static_branch_inc`` to enable and ``static_branch_dec`` to disable the + static key. + Wrapper functions make it easy to compile out the relevant code if the kernel + distributor turns off online fsck at build time. + +- Scrub functions wanting to turn on scrub-only XFS functionality should call + the ``xchk_fsgates_enable`` from the setup function to enable a specific + hook. + This must be done before obtaining any resources that are used by memory + reclaim. + Callers had better be sure they really need the functionality gated by the + static key; the ``TRY_HARDER`` flag is useful here. + +Online scrub has resource acquisition helpers (e.g. ``xchk_perag_lock``) to +handle locking AGI and AGF buffers for all scrubber functions. +If it detects a conflict between scrub and the running transactions, it will +try to wait for intents to complete. +If the caller of the helper has not enabled the static key, the helper will +return -EDEADLOCK, which should result in the scrub being restarted with the +``TRY_HARDER`` flag set. +The scrub setup function should detect that flag, enable the static key, and +try the scrub again. +Scrub teardown disables all static keys obtained by ``xchk_fsgates_enable``. + +For more information, please see the kernel documentation of +Documentation/staging/static-keys.rst. + +.. _xfile: + +Pageable Kernel Memory +---------------------- + +Some online checking functions work by scanning the filesystem to build a +shadow copy of an ondisk metadata structure in memory and comparing the two +copies. +For online repair to rebuild a metadata structure, it must compute the record +set that will be stored in the new structure before it can persist that new +structure to disk. +Ideally, repairs complete with a single atomic commit that introduces +a new data structure. +To meet these goals, the kernel needs to collect a large amount of information +in a place that doesn't require the correct operation of the filesystem. + +Kernel memory isn't suitable because: + +* Allocating a contiguous region of memory to create a C array is very + difficult, especially on 32-bit systems. + +* Linked lists of records introduce double pointer overhead which is very high + and eliminate the possibility of indexed lookups. + +* Kernel memory is pinned, which can drive the system into OOM conditions. + +* The system might not have sufficient memory to stage all the information. + +At any given time, online fsck does not need to keep the entire record set in +memory, which means that individual records can be paged out if necessary. +Continued development of online fsck demonstrated that the ability to perform +indexed data storage would also be very useful. +Fortunately, the Linux kernel already has a facility for byte-addressable and +pageable storage: tmpfs. +In-kernel graphics drivers (most notably i915) take advantage of tmpfs files +to store intermediate data that doesn't need to be in memory at all times, so +that usage precedent is already established. +Hence, the ``xfile`` was born! + ++--------------------------------------------------------------------------+ +| **Historical Sidebar**: | ++--------------------------------------------------------------------------+ +| The first edition of online repair inserted records into a new btree as | +| it found them, which failed because filesystem could shut down with a | +| built data structure, which would be live after recovery finished. | +| | +| The second edition solved the half-rebuilt structure problem by storing | +| everything in memory, but frequently ran the system out of memory. | +| | +| The third edition solved the OOM problem by using linked lists, but the | +| memory overhead of the list pointers was extreme. | ++--------------------------------------------------------------------------+ + +xfile Access Models +``````````````````` + +A survey of the intended uses of xfiles suggested these use cases: + +1. Arrays of fixed-sized records (space management btrees, directory and + extended attribute entries) + +2. Sparse arrays of fixed-sized records (quotas and link counts) + +3. Large binary objects (BLOBs) of variable sizes (directory and extended + attribute names and values) + +4. Staging btrees in memory (reverse mapping btrees) + +5. Arbitrary contents (realtime space management) + +To support the first four use cases, high level data structures wrap the xfile +to share functionality between online fsck functions. +The rest of this section discusses the interfaces that the xfile presents to +four of those five higher level data structures. +The fifth use case is discussed in the :ref:`realtime summary <rtsummary>` case +study. + +The most general storage interface supported by the xfile enables the reading +and writing of arbitrary quantities of data at arbitrary offsets in the xfile. +This capability is provided by ``xfile_pread`` and ``xfile_pwrite`` functions, +which behave similarly to their userspace counterparts. +XFS is very record-based, which suggests that the ability to load and store +complete records is important. +To support these cases, a pair of ``xfile_obj_load`` and ``xfile_obj_store`` +functions are provided to read and persist objects into an xfile. +They are internally the same as pread and pwrite, except that they treat any +error as an out of memory error. +For online repair, squashing error conditions in this manner is an acceptable +behavior because the only reaction is to abort the operation back to userspace. +All five xfile usecases can be serviced by these four functions. + +However, no discussion of file access idioms is complete without answering the +question, "But what about mmap?" +It is convenient to access storage directly with pointers, just like userspace +code does with regular memory. +Online fsck must not drive the system into OOM conditions, which means that +xfiles must be responsive to memory reclamation. +tmpfs can only push a pagecache folio to the swap cache if the folio is neither +pinned nor locked, which means the xfile must not pin too many folios. + +Short term direct access to xfile contents is done by locking the pagecache +folio and mapping it into kernel address space. +Programmatic access (e.g. pread and pwrite) uses this mechanism. +Folio locks are not supposed to be held for long periods of time, so long +term direct access to xfile contents is done by bumping the folio refcount, +mapping it into kernel address space, and dropping the folio lock. +These long term users *must* be responsive to memory reclaim by hooking into +the shrinker infrastructure to know when to release folios. + +The ``xfile_get_page`` and ``xfile_put_page`` functions are provided to +retrieve the (locked) folio that backs part of an xfile and to release it. +The only code to use these folio lease functions are the xfarray +:ref:`sorting<xfarray_sort>` algorithms and the :ref:`in-memory +btrees<xfbtree>`. + +xfile Access Coordination +````````````````````````` + +For security reasons, xfiles must be owned privately by the kernel. +They are marked ``S_PRIVATE`` to prevent interference from the security system, +must never be mapped into process file descriptor tables, and their pages must +never be mapped into userspace processes. + +To avoid locking recursion issues with the VFS, all accesses to the shmfs file +are performed by manipulating the page cache directly. +xfile writers call the ``->write_begin`` and ``->write_end`` functions of the +xfile's address space to grab writable pages, copy the caller's buffer into the +page, and release the pages. +xfile readers call ``shmem_read_mapping_page_gfp`` to grab pages directly +before copying the contents into the caller's buffer. +In other words, xfiles ignore the VFS read and write code paths to avoid +having to create a dummy ``struct kiocb`` and to avoid taking inode and +freeze locks. +tmpfs cannot be frozen, and xfiles must not be exposed to userspace. + +If an xfile is shared between threads to stage repairs, the caller must provide +its own locks to coordinate access. +For example, if a scrub function stores scan results in an xfile and needs +other threads to provide updates to the scanned data, the scrub function must +provide a lock for all threads to share. + +.. _xfarray: + +Arrays of Fixed-Sized Records +````````````````````````````` + +In XFS, each type of indexed space metadata (free space, inodes, reference +counts, file fork space, and reverse mappings) consists of a set of fixed-size +records indexed with a classic B+ tree. +Directories have a set of fixed-size dirent records that point to the names, +and extended attributes have a set of fixed-size attribute keys that point to +names and values. +Quota counters and file link counters index records with numbers. +During a repair, scrub needs to stage new records during the gathering step and +retrieve them during the btree building step. + +Although this requirement can be satisfied by calling the read and write +methods of the xfile directly, it is simpler for callers for there to be a +higher level abstraction to take care of computing array offsets, to provide +iterator functions, and to deal with sparse records and sorting. +The ``xfarray`` abstraction presents a linear array for fixed-size records atop +the byte-accessible xfile. + +.. _xfarray_access_patterns: + +Array Access Patterns +^^^^^^^^^^^^^^^^^^^^^ + +Array access patterns in online fsck tend to fall into three categories. +Iteration of records is assumed to be necessary for all cases and will be +covered in the next section. + +The first type of caller handles records that are indexed by position. +Gaps may exist between records, and a record may be updated multiple times +during the collection step. +In other words, these callers want a sparse linearly addressed table file. +The typical use case are quota records or file link count records. +Access to array elements is performed programmatically via ``xfarray_load`` and +``xfarray_store`` functions, which wrap the similarly-named xfile functions to +provide loading and storing of array elements at arbitrary array indices. +Gaps are defined to be null records, and null records are defined to be a +sequence of all zero bytes. +Null records are detected by calling ``xfarray_element_is_null``. +They are created either by calling ``xfarray_unset`` to null out an existing +record or by never storing anything to an array index. + +The second type of caller handles records that are not indexed by position +and do not require multiple updates to a record. +The typical use case here is rebuilding space btrees and key/value btrees. +These callers can add records to the array without caring about array indices +via the ``xfarray_append`` function, which stores a record at the end of the +array. +For callers that require records to be presentable in a specific order (e.g. +rebuilding btree data), the ``xfarray_sort`` function can arrange the sorted +records; this function will be covered later. + +The third type of caller is a bag, which is useful for counting records. +The typical use case here is constructing space extent reference counts from +reverse mapping information. +Records can be put in the bag in any order, they can be removed from the bag +at any time, and uniqueness of records is left to callers. +The ``xfarray_store_anywhere`` function is used to insert a record in any +null record slot in the bag; and the ``xfarray_unset`` function removes a +record from the bag. + +The proposed patchset is the +`big in-memory array +<https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=big-array>`_. + +Iterating Array Elements +^^^^^^^^^^^^^^^^^^^^^^^^ + +Most users of the xfarray require the ability to iterate the records stored in +the array. +Callers can probe every possible array index with the following: + +.. code-block:: c + + xfarray_idx_t i; + foreach_xfarray_idx(array, i) { + xfarray_load(array, i, &rec); + + /* do something with rec */ + } + +All users of this idiom must be prepared to handle null records or must already +know that there aren't any. + +For xfarray users that want to iterate a sparse array, the ``xfarray_iter`` +function ignores indices in the xfarray that have never been written to by +calling ``xfile_seek_data`` (which internally uses ``SEEK_DATA``) to skip areas +of the array that are not populated with memory pages. +Once it finds a page, it will skip the zeroed areas of the page. + +.. code-block:: c + + xfarray_idx_t i = XFARRAY_CURSOR_INIT; + while ((ret = xfarray_iter(array, &i, &rec)) == 1) { + /* do something with rec */ + } + +.. _xfarray_sort: + +Sorting Array Elements +^^^^^^^^^^^^^^^^^^^^^^ + +During the fourth demonstration of online repair, a community reviewer remarked +that for performance reasons, online repair ought to load batches of records +into btree record blocks instead of inserting records into a new btree one at a +time. +The btree insertion code in XFS is responsible for maintaining correct ordering +of the records, so naturally the xfarray must also support sorting the record +set prior to bulk loading. + +Case Study: Sorting xfarrays +~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +The sorting algorithm used in the xfarray is actually a combination of adaptive +quicksort and a heapsort subalgorithm in the spirit of +`Sedgewick <https://algs4.cs.princeton.edu/23quicksort/>`_ and +`pdqsort <https://github.com/orlp/pdqsort>`_, with customizations for the Linux +kernel. +To sort records in a reasonably short amount of time, ``xfarray`` takes +advantage of the binary subpartitioning offered by quicksort, but it also uses +heapsort to hedge aginst performance collapse if the chosen quicksort pivots +are poor. +Both algorithms are (in general) O(n * lg(n)), but there is a wide performance +gulf between the two implementations. + +The Linux kernel already contains a reasonably fast implementation of heapsort. +It only operates on regular C arrays, which limits the scope of its usefulness. +There are two key places where the xfarray uses it: + +* Sorting any record subset backed by a single xfile page. + +* Loading a small number of xfarray records from potentially disparate parts + of the xfarray into a memory buffer, and sorting the buffer. + +In other words, ``xfarray`` uses heapsort to constrain the nested recursion of +quicksort, thereby mitigating quicksort's worst runtime behavior. + +Choosing a quicksort pivot is a tricky business. +A good pivot splits the set to sort in half, leading to the divide and conquer +behavior that is crucial to O(n * lg(n)) performance. +A poor pivot barely splits the subset at all, leading to O(n\ :sup:`2`) +runtime. +The xfarray sort routine tries to avoid picking a bad pivot by sampling nine +records into a memory buffer and using the kernel heapsort to identify the +median of the nine. + +Most modern quicksort implementations employ Tukey's "ninther" to select a +pivot from a classic C array. +Typical ninther implementations pick three unique triads of records, sort each +of the triads, and then sort the middle value of each triad to determine the +ninther value. +As stated previously, however, xfile accesses are not entirely cheap. +It turned out to be much more performant to read the nine elements into a +memory buffer, run the kernel's in-memory heapsort on the buffer, and choose +the 4th element of that buffer as the pivot. +Tukey's ninthers are described in J. W. Tukey, `The ninther, a technique for +low-effort robust (resistant) location in large samples`, in *Contributions to +Survey Sampling and Applied Statistics*, edited by H. David, (Academic Press, +1978), pp. 251–257. + +The partitioning of quicksort is fairly textbook -- rearrange the record +subset around the pivot, then set up the current and next stack frames to +sort with the larger and the smaller halves of the pivot, respectively. +This keeps the stack space requirements to log2(record count). + +As a final performance optimization, the hi and lo scanning phase of quicksort +keeps examined xfile pages mapped in the kernel for as long as possible to +reduce map/unmap cycles. +Surprisingly, this reduces overall sort runtime by nearly half again after +accounting for the application of heapsort directly onto xfile pages. + +.. _xfblob: + +Blob Storage +```````````` + +Extended attributes and directories add an additional requirement for staging +records: arbitrary byte sequences of finite length. +Each directory entry record needs to store entry name, +and each extended attribute needs to store both the attribute name and value. +The names, keys, and values can consume a large amount of memory, so the +``xfblob`` abstraction was created to simplify management of these blobs +atop an xfile. + +Blob arrays provide ``xfblob_load`` and ``xfblob_store`` functions to retrieve +and persist objects. +The store function returns a magic cookie for every object that it persists. +Later, callers provide this cookie to the ``xblob_load`` to recall the object. +The ``xfblob_free`` function frees a specific blob, and the ``xfblob_truncate`` +function frees them all because compaction is not needed. + +The details of repairing directories and extended attributes will be discussed +in a subsequent section about atomic extent swapping. +However, it should be noted that these repair functions only use blob storage +to cache a small number of entries before adding them to a temporary ondisk +file, which is why compaction is not required. + +The proposed patchset is at the start of the +`extended attribute repair +<https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=repair-xattrs>`_ series. + +.. _xfbtree: + +In-Memory B+Trees +````````````````` + +The chapter about :ref:`secondary metadata<secondary_metadata>` mentioned that +checking and repairing of secondary metadata commonly requires coordination +between a live metadata scan of the filesystem and writer threads that are +updating that metadata. +Keeping the scan data up to date requires requires the ability to propagate +metadata updates from the filesystem into the data being collected by the scan. +This *can* be done by appending concurrent updates into a separate log file and +applying them before writing the new metadata to disk, but this leads to +unbounded memory consumption if the rest of the system is very busy. +Another option is to skip the side-log and commit live updates from the +filesystem directly into the scan data, which trades more overhead for a lower +maximum memory requirement. +In both cases, the data structure holding the scan results must support indexed +access to perform well. + +Given that indexed lookups of scan data is required for both strategies, online +fsck employs the second strategy of committing live updates directly into +scan data. +Because xfarrays are not indexed and do not enforce record ordering, they +are not suitable for this task. +Conveniently, however, XFS has a library to create and maintain ordered reverse +mapping records: the existing rmap btree code! +If only there was a means to create one in memory. + +Recall that the :ref:`xfile <xfile>` abstraction represents memory pages as a +regular file, which means that the kernel can create byte or block addressable +virtual address spaces at will. +The XFS buffer cache specializes in abstracting IO to block-oriented address +spaces, which means that adaptation of the buffer cache to interface with +xfiles enables reuse of the entire btree library. +Btrees built atop an xfile are collectively known as ``xfbtrees``. +The next few sections describe how they actually work. + +The proposed patchset is the +`in-memory btree +<https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=in-memory-btrees>`_ +series. + +Using xfiles as a Buffer Cache Target +^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ + +Two modifications are necessary to support xfiles as a buffer cache target. +The first is to make it possible for the ``struct xfs_buftarg`` structure to +host the ``struct xfs_buf`` rhashtable, because normally those are held by a +per-AG structure. +The second change is to modify the buffer ``ioapply`` function to "read" cached +pages from the xfile and "write" cached pages back to the xfile. +Multiple access to individual buffers is controlled by the ``xfs_buf`` lock, +since the xfile does not provide any locking on its own. +With this adaptation in place, users of the xfile-backed buffer cache use +exactly the same APIs as users of the disk-backed buffer cache. +The separation between xfile and buffer cache implies higher memory usage since +they do not share pages, but this property could some day enable transactional +updates to an in-memory btree. +Today, however, it simply eliminates the need for new code. + +Space Management with an xfbtree +^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ + +Space management for an xfile is very simple -- each btree block is one memory +page in size. +These blocks use the same header format as an on-disk btree, but the in-memory +block verifiers ignore the checksums, assuming that xfile memory is no more +corruption-prone than regular DRAM. +Reusing existing code here is more important than absolute memory efficiency. + +The very first block of an xfile backing an xfbtree contains a header block. +The header describes the owner, height, and the block number of the root +xfbtree block. + +To allocate a btree block, use ``xfile_seek_data`` to find a gap in the file. +If there are no gaps, create one by extending the length of the xfile. +Preallocate space for the block with ``xfile_prealloc``, and hand back the +location. +To free an xfbtree block, use ``xfile_discard`` (which internally uses +``FALLOC_FL_PUNCH_HOLE``) to remove the memory page from the xfile. + +Populating an xfbtree +^^^^^^^^^^^^^^^^^^^^^ + +An online fsck function that wants to create an xfbtree should proceed as +follows: + +1. Call ``xfile_create`` to create an xfile. + +2. Call ``xfs_alloc_memory_buftarg`` to create a buffer cache target structure + pointing to the xfile. + +3. Pass the buffer cache target, buffer ops, and other information to + ``xfbtree_create`` to write an initial tree header and root block to the + xfile. + Each btree type should define a wrapper that passes necessary arguments to + the creation function. + For example, rmap btrees define ``xfs_rmapbt_mem_create`` to take care of + all the necessary details for callers. + A ``struct xfbtree`` object will be returned. + +4. Pass the xfbtree object to the btree cursor creation function for the + btree type. + Following the example above, ``xfs_rmapbt_mem_cursor`` takes care of this + for callers. + +5. Pass the btree cursor to the regular btree functions to make queries against + and to update the in-memory btree. + For example, a btree cursor for an rmap xfbtree can be passed to the + ``xfs_rmap_*`` functions just like any other btree cursor. + See the :ref:`next section<xfbtree_commit>` for information on dealing with + xfbtree updates that are logged to a transaction. + +6. When finished, delete the btree cursor, destroy the xfbtree object, free the + buffer target, and the destroy the xfile to release all resources. + +.. _xfbtree_commit: + +Committing Logged xfbtree Buffers +^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ + +Although it is a clever hack to reuse the rmap btree code to handle the staging +structure, the ephemeral nature of the in-memory btree block storage presents +some challenges of its own. +The XFS transaction manager must not commit buffer log items for buffers backed +by an xfile because the log format does not understand updates for devices +other than the data device. +An ephemeral xfbtree probably will not exist by the time the AIL checkpoints +log transactions back into the filesystem, and certainly won't exist during +log recovery. +For these reasons, any code updating an xfbtree in transaction context must +remove the buffer log items from the transaction and write the updates into the +backing xfile before committing or cancelling the transaction. + +The ``xfbtree_trans_commit`` and ``xfbtree_trans_cancel`` functions implement +this functionality as follows: + +1. Find each buffer log item whose buffer targets the xfile. + +2. Record the dirty/ordered status of the log item. + +3. Detach the log item from the buffer. + +4. Queue the buffer to a special delwri list. + +5. Clear the transaction dirty flag if the only dirty log items were the ones + that were detached in step 3. + +6. Submit the delwri list to commit the changes to the xfile, if the updates + are being committed. + +After removing xfile logged buffers from the transaction in this manner, the +transaction can be committed or cancelled. + +Bulk Loading of Ondisk B+Trees +------------------------------ + +As mentioned previously, early iterations of online repair built new btree +structures by creating a new btree and adding observations individually. +Loading a btree one record at a time had a slight advantage of not requiring +the incore records to be sorted prior to commit, but was very slow and leaked +blocks if the system went down during a repair. +Loading records one at a time also meant that repair could not control the +loading factor of the blocks in the new btree. + +Fortunately, the venerable ``xfs_repair`` tool had a more efficient means for +rebuilding a btree index from a collection of records -- bulk btree loading. +This was implemented rather inefficiently code-wise, since ``xfs_repair`` +had separate copy-pasted implementations for each btree type. + +To prepare for online fsck, each of the four bulk loaders were studied, notes +were taken, and the four were refactored into a single generic btree bulk +loading mechanism. +Those notes in turn have been refreshed and are presented below. + +Geometry Computation +```````````````````` + +The zeroth step of bulk loading is to assemble the entire record set that will +be stored in the new btree, and sort the records. +Next, call ``xfs_btree_bload_compute_geometry`` to compute the shape of the +btree from the record set, the type of btree, and any load factor preferences. +This information is required for resource reservation. + +First, the geometry computation computes the minimum and maximum records that +will fit in a leaf block from the size of a btree block and the size of the +block header. +Roughly speaking, the maximum number of records is:: + + maxrecs = (block_size - header_size) / record_size + +The XFS design specifies that btree blocks should be merged when possible, +which means the minimum number of records is half of maxrecs:: + + minrecs = maxrecs / 2 + +The next variable to determine is the desired loading factor. +This must be at least minrecs and no more than maxrecs. +Choosing minrecs is undesirable because it wastes half the block. +Choosing maxrecs is also undesirable because adding a single record to each +newly rebuilt leaf block will cause a tree split, which causes a noticeable +drop in performance immediately afterwards. +The default loading factor was chosen to be 75% of maxrecs, which provides a +reasonably compact structure without any immediate split penalties:: + + default_load_factor = (maxrecs + minrecs) / 2 + +If space is tight, the loading factor will be set to maxrecs to try to avoid +running out of space:: + + leaf_load_factor = enough space ? default_load_factor : maxrecs + +Load factor is computed for btree node blocks using the combined size of the +btree key and pointer as the record size:: + + maxrecs = (block_size - header_size) / (key_size + ptr_size) + minrecs = maxrecs / 2 + node_load_factor = enough space ? default_load_factor : maxrecs + +Once that's done, the number of leaf blocks required to store the record set +can be computed as:: + + leaf_blocks = ceil(record_count / leaf_load_factor) + +The number of node blocks needed to point to the next level down in the tree +is computed as:: + + n_blocks = (n == 0 ? leaf_blocks : node_blocks[n]) + node_blocks[n + 1] = ceil(n_blocks / node_load_factor) + +The entire computation is performed recursively until the current level only +needs one block. +The resulting geometry is as follows: + +- For AG-rooted btrees, this level is the root level, so the height of the new + tree is ``level + 1`` and the space needed is the summation of the number of + blocks on each level. + +- For inode-rooted btrees where the records in the top level do not fit in the + inode fork area, the height is ``level + 2``, the space needed is the + summation of the number of blocks on each level, and the inode fork points to + the root block. + +- For inode-rooted btrees where the records in the top level can be stored in + the inode fork area, then the root block can be stored in the inode, the + height is ``level + 1``, and the space needed is one less than the summation + of the number of blocks on each level. + This only becomes relevant when non-bmap btrees gain the ability to root in + an inode, which is a future patchset and only included here for completeness. + +.. _newbt: + +Reserving New B+Tree Blocks +``````````````````````````` + +Once repair knows the number of blocks needed for the new btree, it allocates +those blocks using the free space information. +Each reserved extent is tracked separately by the btree builder state data. +To improve crash resilience, the reservation code also logs an Extent Freeing +Intent (EFI) item in the same transaction as each space allocation and attaches +its in-memory ``struct xfs_extent_free_item`` object to the space reservation. +If the system goes down, log recovery will use the unfinished EFIs to free the +unused space, the free space, leaving the filesystem unchanged. + +Each time the btree builder claims a block for the btree from a reserved +extent, it updates the in-memory reservation to reflect the claimed space. +Block reservation tries to allocate as much contiguous space as possible to +reduce the number of EFIs in play. + +While repair is writing these new btree blocks, the EFIs created for the space +reservations pin the tail of the ondisk log. +It's possible that other parts of the system will remain busy and push the head +of the log towards the pinned tail. +To avoid livelocking the filesystem, the EFIs must not pin the tail of the log +for too long. +To alleviate this problem, the dynamic relogging capability of the deferred ops +mechanism is reused here to commit a transaction at the log head containing an +EFD for the old EFI and new EFI at the head. +This enables the log to release the old EFI to keep the log moving forwards. + +EFIs have a role to play during the commit and reaping phases; please see the +next section and the section about :ref:`reaping<reaping>` for more details. + +Proposed patchsets are the +`bitmap rework +<https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=repair-bitmap-rework>`_ +and the +`preparation for bulk loading btrees +<https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=repair-prep-for-bulk-loading>`_. + + +Writing the New Tree +```````````````````` + +This part is pretty simple -- the btree builder (``xfs_btree_bulkload``) claims +a block from the reserved list, writes the new btree block header, fills the +rest of the block with records, and adds the new leaf block to a list of +written blocks:: + + ┌────┐ + │leaf│ + │RRR │ + └────┘ + +Sibling pointers are set every time a new block is added to the level:: + + ┌────┐ ┌────┐ ┌────┐ ┌────┐ + │leaf│→│leaf│→│leaf│→│leaf│ + │RRR │←│RRR │←│RRR │←│RRR │ + └────┘ └────┘ └────┘ └────┘ + +When it finishes writing the record leaf blocks, it moves on to the node +blocks +To fill a node block, it walks each block in the next level down in the tree +to compute the relevant keys and write them into the parent node:: + + ┌────┐ ┌────┐ + │node│──────→│node│ + │PP │←──────│PP │ + └────┘ └────┘ + ↙ ↘ ↙ ↘ + ┌────┐ ┌────┐ ┌────┐ ┌────┐ + │leaf│→│leaf│→│leaf│→│leaf│ + │RRR │←│RRR │←│RRR │←│RRR │ + └────┘ └────┘ └────┘ └────┘ + +When it reaches the root level, it is ready to commit the new btree!:: + + ┌─────────┐ + │ root │ + │ PP │ + └─────────┘ + ↙ ↘ + ┌────┐ ┌────┐ + │node│──────→│node│ + │PP │←──────│PP │ + └────┘ └────┘ + ↙ ↘ ↙ ↘ + ┌────┐ ┌────┐ ┌────┐ ┌────┐ + │leaf│→│leaf│→│leaf│→│leaf│ + │RRR │←│RRR │←│RRR │←│RRR │ + └────┘ └────┘ └────┘ └────┘ + +The first step to commit the new btree is to persist the btree blocks to disk +synchronously. +This is a little complicated because a new btree block could have been freed +in the recent past, so the builder must use ``xfs_buf_delwri_queue_here`` to +remove the (stale) buffer from the AIL list before it can write the new blocks +to disk. +Blocks are queued for IO using a delwri list and written in one large batch +with ``xfs_buf_delwri_submit``. + +Once the new blocks have been persisted to disk, control returns to the +individual repair function that called the bulk loader. +The repair function must log the location of the new root in a transaction, +clean up the space reservations that were made for the new btree, and reap the +old metadata blocks: + +1. Commit the location of the new btree root. + +2. For each incore reservation: + + a. Log Extent Freeing Done (EFD) items for all the space that was consumed + by the btree builder. The new EFDs must point to the EFIs attached to + the reservation to prevent log recovery from freeing the new blocks. + + b. For unclaimed portions of incore reservations, create a regular deferred + extent free work item to be free the unused space later in the + transaction chain. + + c. The EFDs and EFIs logged in steps 2a and 2b must not overrun the + reservation of the committing transaction. + If the btree loading code suspects this might be about to happen, it must + call ``xrep_defer_finish`` to clear out the deferred work and obtain a + fresh transaction. + +3. Clear out the deferred work a second time to finish the commit and clean + the repair transaction. + +The transaction rolling in steps 2c and 3 represent a weakness in the repair +algorithm, because a log flush and a crash before the end of the reap step can +result in space leaking. +Online repair functions minimize the chances of this occuring by using very +large transactions, which each can accomodate many thousands of block freeing +instructions. +Repair moves on to reaping the old blocks, which will be presented in a +subsequent :ref:`section<reaping>` after a few case studies of bulk loading. + +Case Study: Rebuilding the Inode Index +^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ + +The high level process to rebuild the inode index btree is: + +1. Walk the reverse mapping records to generate ``struct xfs_inobt_rec`` + records from the inode chunk information and a bitmap of the old inode btree + blocks. + +2. Append the records to an xfarray in inode order. + +3. Use the ``xfs_btree_bload_compute_geometry`` function to compute the number + of blocks needed for the inode btree. + If the free space inode btree is enabled, call it again to estimate the + geometry of the finobt. + +4. Allocate the number of blocks computed in the previous step. + +5. Use ``xfs_btree_bload`` to write the xfarray records to btree blocks and + generate the internal node blocks. + If the free space inode btree is enabled, call it again to load the finobt. + +6. Commit the location of the new btree root block(s) to the AGI. + +7. Reap the old btree blocks using the bitmap created in step 1. + +Details are as follows. + +The inode btree maps inumbers to the ondisk location of the associated +inode records, which means that the inode btrees can be rebuilt from the +reverse mapping information. +Reverse mapping records with an owner of ``XFS_RMAP_OWN_INOBT`` marks the +location of the old inode btree blocks. +Each reverse mapping record with an owner of ``XFS_RMAP_OWN_INODES`` marks the +location of at least one inode cluster buffer. +A cluster is the smallest number of ondisk inodes that can be allocated or +freed in a single transaction; it is never smaller than 1 fs block or 4 inodes. + +For the space represented by each inode cluster, ensure that there are no +records in the free space btrees nor any records in the reference count btree. +If there are, the space metadata inconsistencies are reason enough to abort the +operation. +Otherwise, read each cluster buffer to check that its contents appear to be +ondisk inodes and to decide if the file is allocated +(``xfs_dinode.i_mode != 0``) or free (``xfs_dinode.i_mode == 0``). +Accumulate the results of successive inode cluster buffer reads until there is +enough information to fill a single inode chunk record, which is 64 consecutive +numbers in the inumber keyspace. +If the chunk is sparse, the chunk record may include holes. + +Once the repair function accumulates one chunk's worth of data, it calls +``xfarray_append`` to add the inode btree record to the xfarray. +This xfarray is walked twice during the btree creation step -- once to populate +the inode btree with all inode chunk records, and a second time to populate the +free inode btree with records for chunks that have free non-sparse inodes. +The number of records for the inode btree is the number of xfarray records, +but the record count for the free inode btree has to be computed as inode chunk +records are stored in the xfarray. + +The proposed patchset is the +`AG btree repair +<https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=repair-ag-btrees>`_ +series. + +Case Study: Rebuilding the Space Reference Counts +^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ + +Reverse mapping records are used to rebuild the reference count information. +Reference counts are required for correct operation of copy on write for shared +file data. +Imagine the reverse mapping entries as rectangles representing extents of +physical blocks, and that the rectangles can be laid down to allow them to +overlap each other. +From the diagram below, it is apparent that a reference count record must start +or end wherever the height of the stack changes. +In other words, the record emission stimulus is level-triggered:: + + █ ███ + ██ █████ ████ ███ ██████ + ██ ████ ███████████ ████ █████████ + ████████████████████████████████ ███████████ + ^ ^ ^^ ^^ ^ ^^ ^^^ ^^^^ ^ ^^ ^ ^ ^ + 2 1 23 21 3 43 234 2123 1 01 2 3 0 + +The ondisk reference count btree does not store the refcount == 0 cases because +the free space btree already records which blocks are free. +Extents being used to stage copy-on-write operations should be the only records +with refcount == 1. +Single-owner file blocks aren't recorded in either the free space or the +reference count btrees. + +The high level process to rebuild the reference count btree is: + +1. Walk the reverse mapping records to generate ``struct xfs_refcount_irec`` + records for any space having more than one reverse mapping and add them to + the xfarray. + Any records owned by ``XFS_RMAP_OWN_COW`` are also added to the xfarray + because these are extents allocated to stage a copy on write operation and + are tracked in the refcount btree. + + Use any records owned by ``XFS_RMAP_OWN_REFC`` to create a bitmap of old + refcount btree blocks. + +2. Sort the records in physical extent order, putting the CoW staging extents + at the end of the xfarray. + This matches the sorting order of records in the refcount btree. + +3. Use the ``xfs_btree_bload_compute_geometry`` function to compute the number + of blocks needed for the new tree. + +4. Allocate the number of blocks computed in the previous step. + +5. Use ``xfs_btree_bload`` to write the xfarray records to btree blocks and + generate the internal node blocks. + +6. Commit the location of new btree root block to the AGF. + +7. Reap the old btree blocks using the bitmap created in step 1. + +Details are as follows; the same algorithm is used by ``xfs_repair`` to +generate refcount information from reverse mapping records. + +- Until the reverse mapping btree runs out of records: + + - Retrieve the next record from the btree and put it in a bag. + + - Collect all records with the same starting block from the btree and put + them in the bag. + + - While the bag isn't empty: + + - Among the mappings in the bag, compute the lowest block number where the + reference count changes. + This position will be either the starting block number of the next + unprocessed reverse mapping or the next block after the shortest mapping + in the bag. + + - Remove all mappings from the bag that end at this position. + + - Collect all reverse mappings that start at this position from the btree + and put them in the bag. + + - If the size of the bag changed and is greater than one, create a new + refcount record associating the block number range that we just walked to + the size of the bag. + +The bag-like structure in this case is a type 2 xfarray as discussed in the +:ref:`xfarray access patterns<xfarray_access_patterns>` section. +Reverse mappings are added to the bag using ``xfarray_store_anywhere`` and +removed via ``xfarray_unset``. +Bag members are examined through ``xfarray_iter`` loops. + +The proposed patchset is the +`AG btree repair +<https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=repair-ag-btrees>`_ +series. + +Case Study: Rebuilding File Fork Mapping Indices +^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ + +The high level process to rebuild a data/attr fork mapping btree is: + +1. Walk the reverse mapping records to generate ``struct xfs_bmbt_rec`` + records from the reverse mapping records for that inode and fork. + Append these records to an xfarray. + Compute the bitmap of the old bmap btree blocks from the ``BMBT_BLOCK`` + records. + +2. Use the ``xfs_btree_bload_compute_geometry`` function to compute the number + of blocks needed for the new tree. + +3. Sort the records in file offset order. + +4. If the extent records would fit in the inode fork immediate area, commit the + records to that immediate area and skip to step 8. + +5. Allocate the number of blocks computed in the previous step. + +6. Use ``xfs_btree_bload`` to write the xfarray records to btree blocks and + generate the internal node blocks. + +7. Commit the new btree root block to the inode fork immediate area. + +8. Reap the old btree blocks using the bitmap created in step 1. + +There are some complications here: +First, it's possible to move the fork offset to adjust the sizes of the +immediate areas if the data and attr forks are not both in BMBT format. +Second, if there are sufficiently few fork mappings, it may be possible to use +EXTENTS format instead of BMBT, which may require a conversion. +Third, the incore extent map must be reloaded carefully to avoid disturbing +any delayed allocation extents. + +The proposed patchset is the +`file mapping repair +<https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=repair-file-mappings>`_ +series. + +.. _reaping: + +Reaping Old Metadata Blocks +--------------------------- + +Whenever online fsck builds a new data structure to replace one that is +suspect, there is a question of how to find and dispose of the blocks that +belonged to the old structure. +The laziest method of course is not to deal with them at all, but this slowly +leads to service degradations as space leaks out of the filesystem. +Hopefully, someone will schedule a rebuild of the free space information to +plug all those leaks. +Offline repair rebuilds all space metadata after recording the usage of +the files and directories that it decides not to clear, hence it can build new +structures in the discovered free space and avoid the question of reaping. + +As part of a repair, online fsck relies heavily on the reverse mapping records +to find space that is owned by the corresponding rmap owner yet truly free. +Cross referencing rmap records with other rmap records is necessary because +there may be other data structures that also think they own some of those +blocks (e.g. crosslinked trees). +Permitting the block allocator to hand them out again will not push the system +towards consistency. + +For space metadata, the process of finding extents to dispose of generally +follows this format: + +1. Create a bitmap of space used by data structures that must be preserved. + The space reservations used to create the new metadata can be used here if + the same rmap owner code is used to denote all of the objects being rebuilt. + +2. Survey the reverse mapping data to create a bitmap of space owned by the + same ``XFS_RMAP_OWN_*`` number for the metadata that is being preserved. + +3. Use the bitmap disunion operator to subtract (1) from (2). + The remaining set bits represent candidate extents that could be freed. + The process moves on to step 4 below. + +Repairs for file-based metadata such as extended attributes, directories, +symbolic links, quota files and realtime bitmaps are performed by building a +new structure attached to a temporary file and swapping the forks. +Afterward, the mappings in the old file fork are the candidate blocks for +disposal. + +The process for disposing of old extents is as follows: + +4. For each candidate extent, count the number of reverse mapping records for + the first block in that extent that do not have the same rmap owner for the + data structure being repaired. + + - If zero, the block has a single owner and can be freed. + + - If not, the block is part of a crosslinked structure and must not be + freed. + +5. Starting with the next block in the extent, figure out how many more blocks + have the same zero/nonzero other owner status as that first block. + +6. If the region is crosslinked, delete the reverse mapping entry for the + structure being repaired and move on to the next region. + +7. If the region is to be freed, mark any corresponding buffers in the buffer + cache as stale to prevent log writeback. + +8. Free the region and move on. + +However, there is one complication to this procedure. +Transactions are of finite size, so the reaping process must be careful to roll +the transactions to avoid overruns. +Overruns come from two sources: + +a. EFIs logged on behalf of space that is no longer occupied + +b. Log items for buffer invalidations + +This is also a window in which a crash during the reaping process can leak +blocks. +As stated earlier, online repair functions use very large transactions to +minimize the chances of this occurring. + +The proposed patchset is the +`preparation for bulk loading btrees +<https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=repair-prep-for-bulk-loading>`_ +series. + +Case Study: Reaping After a Regular Btree Repair +```````````````````````````````````````````````` + +Old reference count and inode btrees are the easiest to reap because they have +rmap records with special owner codes: ``XFS_RMAP_OWN_REFC`` for the refcount +btree, and ``XFS_RMAP_OWN_INOBT`` for the inode and free inode btrees. +Creating a list of extents to reap the old btree blocks is quite simple, +conceptually: + +1. Lock the relevant AGI/AGF header buffers to prevent allocation and frees. + +2. For each reverse mapping record with an rmap owner corresponding to the + metadata structure being rebuilt, set the corresponding range in a bitmap. + +3. Walk the current data structures that have the same rmap owner. + For each block visited, clear that range in the above bitmap. + +4. Each set bit in the bitmap represents a block that could be a block from the + old data structures and hence is a candidate for reaping. + In other words, ``(rmap_records_owned_by & ~blocks_reachable_by_walk)`` + are the blocks that might be freeable. + +If it is possible to maintain the AGF lock throughout the repair (which is the +common case), then step 2 can be performed at the same time as the reverse +mapping record walk that creates the records for the new btree. + +Case Study: Rebuilding the Free Space Indices +````````````````````````````````````````````` + +The high level process to rebuild the free space indices is: + +1. Walk the reverse mapping records to generate ``struct xfs_alloc_rec_incore`` + records from the gaps in the reverse mapping btree. + +2. Append the records to an xfarray. + +3. Use the ``xfs_btree_bload_compute_geometry`` function to compute the number + of blocks needed for each new tree. + +4. Allocate the number of blocks computed in the previous step from the free + space information collected. + +5. Use ``xfs_btree_bload`` to write the xfarray records to btree blocks and + generate the internal node blocks for the free space by length index. + Call it again for the free space by block number index. + +6. Commit the locations of the new btree root blocks to the AGF. + +7. Reap the old btree blocks by looking for space that is not recorded by the + reverse mapping btree, the new free space btrees, or the AGFL. + +Repairing the free space btrees has three key complications over a regular +btree repair: + +First, free space is not explicitly tracked in the reverse mapping records. +Hence, the new free space records must be inferred from gaps in the physical +space component of the keyspace of the reverse mapping btree. + +Second, free space repairs cannot use the common btree reservation code because +new blocks are reserved out of the free space btrees. +This is impossible when repairing the free space btrees themselves. +However, repair holds the AGF buffer lock for the duration of the free space +index reconstruction, so it can use the collected free space information to +supply the blocks for the new free space btrees. +It is not necessary to back each reserved extent with an EFI because the new +free space btrees are constructed in what the ondisk filesystem thinks is +unowned space. +However, if reserving blocks for the new btrees from the collected free space +information changes the number of free space records, repair must re-estimate +the new free space btree geometry with the new record count until the +reservation is sufficient. +As part of committing the new btrees, repair must ensure that reverse mappings +are created for the reserved blocks and that unused reserved blocks are +inserted into the free space btrees. +Deferrred rmap and freeing operations are used to ensure that this transition +is atomic, similar to the other btree repair functions. + +Third, finding the blocks to reap after the repair is not overly +straightforward. +Blocks for the free space btrees and the reverse mapping btrees are supplied by +the AGFL. +Blocks put onto the AGFL have reverse mapping records with the owner +``XFS_RMAP_OWN_AG``. +This ownership is retained when blocks move from the AGFL into the free space +btrees or the reverse mapping btrees. +When repair walks reverse mapping records to synthesize free space records, it +creates a bitmap (``ag_owner_bitmap``) of all the space claimed by +``XFS_RMAP_OWN_AG`` records. +The repair context maintains a second bitmap corresponding to the rmap btree +blocks and the AGFL blocks (``rmap_agfl_bitmap``). +When the walk is complete, the bitmap disunion operation ``(ag_owner_bitmap & +~rmap_agfl_bitmap)`` computes the extents that are used by the old free space +btrees. +These blocks can then be reaped using the methods outlined above. + +The proposed patchset is the +`AG btree repair +<https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=repair-ag-btrees>`_ +series. + +.. _rmap_reap: + +Case Study: Reaping After Repairing Reverse Mapping Btrees +`````````````````````````````````````````````````````````` + +Old reverse mapping btrees are less difficult to reap after a repair. +As mentioned in the previous section, blocks on the AGFL, the two free space +btree blocks, and the reverse mapping btree blocks all have reverse mapping +records with ``XFS_RMAP_OWN_AG`` as the owner. +The full process of gathering reverse mapping records and building a new btree +are described in the case study of +:ref:`live rebuilds of rmap data <rmap_repair>`, but a crucial point from that +discussion is that the new rmap btree will not contain any records for the old +rmap btree, nor will the old btree blocks be tracked in the free space btrees. +The list of candidate reaping blocks is computed by setting the bits +corresponding to the gaps in the new rmap btree records, and then clearing the +bits corresponding to extents in the free space btrees and the current AGFL +blocks. +The result ``(new_rmapbt_gaps & ~(agfl | bnobt_records))`` are reaped using the +methods outlined above. + +The rest of the process of rebuildng the reverse mapping btree is discussed +in a separate :ref:`case study<rmap_repair>`. + +The proposed patchset is the +`AG btree repair +<https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=repair-ag-btrees>`_ +series. + +Case Study: Rebuilding the AGFL +``````````````````````````````` + +The allocation group free block list (AGFL) is repaired as follows: + +1. Create a bitmap for all the space that the reverse mapping data claims is + owned by ``XFS_RMAP_OWN_AG``. + +2. Subtract the space used by the two free space btrees and the rmap btree. + +3. Subtract any space that the reverse mapping data claims is owned by any + other owner, to avoid re-adding crosslinked blocks to the AGFL. + +4. Once the AGFL is full, reap any blocks leftover. + +5. The next operation to fix the freelist will right-size the list. + +See `fs/xfs/scrub/agheader_repair.c <https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/fs/xfs/scrub/agheader_repair.c>`_ for more details. + +Inode Record Repairs +-------------------- + +Inode records must be handled carefully, because they have both ondisk records +("dinodes") and an in-memory ("cached") representation. +There is a very high potential for cache coherency issues if online fsck is not +careful to access the ondisk metadata *only* when the ondisk metadata is so +badly damaged that the filesystem cannot load the in-memory representation. +When online fsck wants to open a damaged file for scrubbing, it must use +specialized resource acquisition functions that return either the in-memory +representation *or* a lock on whichever object is necessary to prevent any +update to the ondisk location. + +The only repairs that should be made to the ondisk inode buffers are whatever +is necessary to get the in-core structure loaded. +This means fixing whatever is caught by the inode cluster buffer and inode fork +verifiers, and retrying the ``iget`` operation. +If the second ``iget`` fails, the repair has failed. + +Once the in-memory representation is loaded, repair can lock the inode and can +subject it to comprehensive checks, repairs, and optimizations. +Most inode attributes are easy to check and constrain, or are user-controlled +arbitrary bit patterns; these are both easy to fix. +Dealing with the data and attr fork extent counts and the file block counts is +more complicated, because computing the correct value requires traversing the +forks, or if that fails, leaving the fields invalid and waiting for the fork +fsck functions to run. + +The proposed patchset is the +`inode +<https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=repair-inodes>`_ +repair series. + +Quota Record Repairs +-------------------- + +Similar to inodes, quota records ("dquots") also have both ondisk records and +an in-memory representation, and hence are subject to the same cache coherency +issues. +Somewhat confusingly, both are known as dquots in the XFS codebase. + +The only repairs that should be made to the ondisk quota record buffers are +whatever is necessary to get the in-core structure loaded. +Once the in-memory representation is loaded, the only attributes needing +checking are obviously bad limits and timer values. + +Quota usage counters are checked, repaired, and discussed separately in the +section about :ref:`live quotacheck <quotacheck>`. + +The proposed patchset is the +`quota +<https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=repair-quota>`_ +repair series. + +.. _fscounters: + +Freezing to Fix Summary Counters +-------------------------------- + +Filesystem summary counters track availability of filesystem resources such +as free blocks, free inodes, and allocated inodes. +This information could be compiled by walking the free space and inode indexes, +but this is a slow process, so XFS maintains a copy in the ondisk superblock +that should reflect the ondisk metadata, at least when the filesystem has been +unmounted cleanly. +For performance reasons, XFS also maintains incore copies of those counters, +which are key to enabling resource reservations for active transactions. +Writer threads reserve the worst-case quantities of resources from the +incore counter and give back whatever they don't use at commit time. +It is therefore only necessary to serialize on the superblock when the +superblock is being committed to disk. + +The lazy superblock counter feature introduced in XFS v5 took this even further +by training log recovery to recompute the summary counters from the AG headers, +which eliminated the need for most transactions even to touch the superblock. +The only time XFS commits the summary counters is at filesystem unmount. +To reduce contention even further, the incore counter is implemented as a +percpu counter, which means that each CPU is allocated a batch of blocks from a +global incore counter and can satisfy small allocations from the local batch. + +The high-performance nature of the summary counters makes it difficult for +online fsck to check them, since there is no way to quiesce a percpu counter +while the system is running. +Although online fsck can read the filesystem metadata to compute the correct +values of the summary counters, there's no way to hold the value of a percpu +counter stable, so it's quite possible that the counter will be out of date by +the time the walk is complete. +Earlier versions of online scrub would return to userspace with an incomplete +scan flag, but this is not a satisfying outcome for a system administrator. +For repairs, the in-memory counters must be stabilized while walking the +filesystem metadata to get an accurate reading and install it in the percpu +counter. + +To satisfy this requirement, online fsck must prevent other programs in the +system from initiating new writes to the filesystem, it must disable background +garbage collection threads, and it must wait for existing writer programs to +exit the kernel. +Once that has been established, scrub can walk the AG free space indexes, the +inode btrees, and the realtime bitmap to compute the correct value of all +four summary counters. +This is very similar to a filesystem freeze, though not all of the pieces are +necessary: + +- The final freeze state is set one higher than ``SB_FREEZE_COMPLETE`` to + prevent other threads from thawing the filesystem, or other scrub threads + from initiating another fscounters freeze. + +- It does not quiesce the log. + +With this code in place, it is now possible to pause the filesystem for just +long enough to check and correct the summary counters. + ++--------------------------------------------------------------------------+ +| **Historical Sidebar**: | ++--------------------------------------------------------------------------+ +| The initial implementation used the actual VFS filesystem freeze | +| mechanism to quiesce filesystem activity. | +| With the filesystem frozen, it is possible to resolve the counter values | +| with exact precision, but there are many problems with calling the VFS | +| methods directly: | +| | +| - Other programs can unfreeze the filesystem without our knowledge. | +| This leads to incorrect scan results and incorrect repairs. | +| | +| - Adding an extra lock to prevent others from thawing the filesystem | +| required the addition of a ``->freeze_super`` function to wrap | +| ``freeze_fs()``. | +| This in turn caused other subtle problems because it turns out that | +| the VFS ``freeze_super`` and ``thaw_super`` functions can drop the | +| last reference to the VFS superblock, and any subsequent access | +| becomes a UAF bug! | +| This can happen if the filesystem is unmounted while the underlying | +| block device has frozen the filesystem. | +| This problem could be solved by grabbing extra references to the | +| superblock, but it felt suboptimal given the other inadequacies of | +| this approach. | +| | +| - The log need not be quiesced to check the summary counters, but a VFS | +| freeze initiates one anyway. | +| This adds unnecessary runtime to live fscounter fsck operations. | +| | +| - Quiescing the log means that XFS flushes the (possibly incorrect) | +| counters to disk as part of cleaning the log. | +| | +| - A bug in the VFS meant that freeze could complete even when | +| sync_filesystem fails to flush the filesystem and returns an error. | +| This bug was fixed in Linux 5.17. | ++--------------------------------------------------------------------------+ + +The proposed patchset is the +`summary counter cleanup +<https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=repair-fscounters>`_ +series. + +Full Filesystem Scans +--------------------- + +Certain types of metadata can only be checked by walking every file in the +entire filesystem to record observations and comparing the observations against +what's recorded on disk. +Like every other type of online repair, repairs are made by writing those +observations to disk in a replacement structure and committing it atomically. +However, it is not practical to shut down the entire filesystem to examine +hundreds of billions of files because the downtime would be excessive. +Therefore, online fsck must build the infrastructure to manage a live scan of +all the files in the filesystem. +There are two questions that need to be solved to perform a live walk: + +- How does scrub manage the scan while it is collecting data? + +- How does the scan keep abreast of changes being made to the system by other + threads? + +.. _iscan: + +Coordinated Inode Scans +``````````````````````` + +In the original Unix filesystems of the 1970s, each directory entry contained +an index number (*inumber*) which was used as an index into on ondisk array +(*itable*) of fixed-size records (*inodes*) describing a file's attributes and +its data block mapping. +This system is described by J. Lions, `"inode (5659)" +<http://www.lemis.com/grog/Documentation/Lions/>`_ in *Lions' Commentary on +UNIX, 6th Edition*, (Dept. of Computer Science, the University of New South +Wales, November 1977), pp. 18-2; and later by D. Ritchie and K. Thompson, +`"Implementation of the File System" +<https://archive.org/details/bstj57-6-1905/page/n8/mode/1up>`_, from *The UNIX +Time-Sharing System*, (The Bell System Technical Journal, July 1978), pp. +1913-4. + +XFS retains most of this design, except now inumbers are search keys over all +the space in the data section filesystem. +They form a continuous keyspace that can be expressed as a 64-bit integer, +though the inodes themselves are sparsely distributed within the keyspace. +Scans proceed in a linear fashion across the inumber keyspace, starting from +``0x0`` and ending at ``0xFFFFFFFFFFFFFFFF``. +Naturally, a scan through a keyspace requires a scan cursor object to track the +scan progress. +Because this keyspace is sparse, this cursor contains two parts. +The first part of this scan cursor object tracks the inode that will be +examined next; call this the examination cursor. +Somewhat less obviously, the scan cursor object must also track which parts of +the keyspace have already been visited, which is critical for deciding if a +concurrent filesystem update needs to be incorporated into the scan data. +Call this the visited inode cursor. + +Advancing the scan cursor is a multi-step process encapsulated in +``xchk_iscan_iter``: + +1. Lock the AGI buffer of the AG containing the inode pointed to by the visited + inode cursor. + This guarantee that inodes in this AG cannot be allocated or freed while + advancing the cursor. + +2. Use the per-AG inode btree to look up the next inumber after the one that + was just visited, since it may not be keyspace adjacent. + +3. If there are no more inodes left in this AG: + + a. Move the examination cursor to the point of the inumber keyspace that + corresponds to the start of the next AG. + + b. Adjust the visited inode cursor to indicate that it has "visited" the + last possible inode in the current AG's inode keyspace. + XFS inumbers are segmented, so the cursor needs to be marked as having + visited the entire keyspace up to just before the start of the next AG's + inode keyspace. + + c. Unlock the AGI and return to step 1 if there are unexamined AGs in the + filesystem. + + d. If there are no more AGs to examine, set both cursors to the end of the + inumber keyspace. + The scan is now complete. + +4. Otherwise, there is at least one more inode to scan in this AG: + + a. Move the examination cursor ahead to the next inode marked as allocated + by the inode btree. + + b. Adjust the visited inode cursor to point to the inode just prior to where + the examination cursor is now. + Because the scanner holds the AGI buffer lock, no inodes could have been + created in the part of the inode keyspace that the visited inode cursor + just advanced. + +5. Get the incore inode for the inumber of the examination cursor. + By maintaining the AGI buffer lock until this point, the scanner knows that + it was safe to advance the examination cursor across the entire keyspace, + and that it has stabilized this next inode so that it cannot disappear from + the filesystem until the scan releases the incore inode. + +6. Drop the AGI lock and return the incore inode to the caller. + +Online fsck functions scan all files in the filesystem as follows: + +1. Start a scan by calling ``xchk_iscan_start``. + +2. Advance the scan cursor (``xchk_iscan_iter``) to get the next inode. + If one is provided: + + a. Lock the inode to prevent updates during the scan. + + b. Scan the inode. + + c. While still holding the inode lock, adjust the visited inode cursor + (``xchk_iscan_mark_visited``) to point to this inode. + + d. Unlock and release the inode. + +8. Call ``xchk_iscan_teardown`` to complete the scan. + +There are subtleties with the inode cache that complicate grabbing the incore +inode for the caller. +Obviously, it is an absolute requirement that the inode metadata be consistent +enough to load it into the inode cache. +Second, if the incore inode is stuck in some intermediate state, the scan +coordinator must release the AGI and push the main filesystem to get the inode +back into a loadable state. + +The proposed patches are the +`inode scanner +<https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=scrub-iscan>`_ +series. +The first user of the new functionality is the +`online quotacheck +<https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=repair-quotacheck>`_ +series. + +Inode Management +```````````````` + +In regular filesystem code, references to allocated XFS incore inodes are +always obtained (``xfs_iget``) outside of transaction context because the +creation of the incore context for an existing file does not require metadata +updates. +However, it is important to note that references to incore inodes obtained as +part of file creation must be performed in transaction context because the +filesystem must ensure the atomicity of the ondisk inode btree index updates +and the initialization of the actual ondisk inode. + +References to incore inodes are always released (``xfs_irele``) outside of +transaction context because there are a handful of activities that might +require ondisk updates: + +- The VFS may decide to kick off writeback as part of a ``DONTCACHE`` inode + release. + +- Speculative preallocations need to be unreserved. + +- An unlinked file may have lost its last reference, in which case the entire + file must be inactivated, which involves releasing all of its resources in + the ondisk metadata and freeing the inode. + +These activities are collectively called inode inactivation. +Inactivation has two parts -- the VFS part, which initiates writeback on all +dirty file pages, and the XFS part, which cleans up XFS-specific information +and frees the inode if it was unlinked. +If the inode is unlinked (or unconnected after a file handle operation), the +kernel drops the inode into the inactivation machinery immediately. + +During normal operation, resource acquisition for an update follows this order +to avoid deadlocks: + +1. Inode reference (``iget``). + +2. Filesystem freeze protection, if repairing (``mnt_want_write_file``). + +3. Inode ``IOLOCK`` (VFS ``i_rwsem``) lock to control file IO. + +4. Inode ``MMAPLOCK`` (page cache ``invalidate_lock``) lock for operations that + can update page cache mappings. + +5. Log feature enablement. + +6. Transaction log space grant. + +7. Space on the data and realtime devices for the transaction. + +8. Incore dquot references, if a file is being repaired. + Note that they are not locked, merely acquired. + +9. Inode ``ILOCK`` for file metadata updates. + +10. AG header buffer locks / Realtime metadata inode ILOCK. + +11. Realtime metadata buffer locks, if applicable. + +12. Extent mapping btree blocks, if applicable. + +Resources are often released in the reverse order, though this is not required. +However, online fsck differs from regular XFS operations because it may examine +an object that normally is acquired in a later stage of the locking order, and +then decide to cross-reference the object with an object that is acquired +earlier in the order. +The next few sections detail the specific ways in which online fsck takes care +to avoid deadlocks. + +iget and irele During a Scrub +^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ + +An inode scan performed on behalf of a scrub operation runs in transaction +context, and possibly with resources already locked and bound to it. +This isn't much of a problem for ``iget`` since it can operate in the context +of an existing transaction, as long as all of the bound resources are acquired +before the inode reference in the regular filesystem. + +When the VFS ``iput`` function is given a linked inode with no other +references, it normally puts the inode on an LRU list in the hope that it can +save time if another process re-opens the file before the system runs out +of memory and frees it. +Filesystem callers can short-circuit the LRU process by setting a ``DONTCACHE`` +flag on the inode to cause the kernel to try to drop the inode into the +inactivation machinery immediately. + +In the past, inactivation was always done from the process that dropped the +inode, which was a problem for scrub because scrub may already hold a +transaction, and XFS does not support nesting transactions. +On the other hand, if there is no scrub transaction, it is desirable to drop +otherwise unused inodes immediately to avoid polluting caches. +To capture these nuances, the online fsck code has a separate ``xchk_irele`` +function to set or clear the ``DONTCACHE`` flag to get the required release +behavior. + +Proposed patchsets include fixing +`scrub iget usage +<https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=scrub-iget-fixes>`_ and +`dir iget usage +<https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=scrub-dir-iget-fixes>`_. + +.. _ilocking: + +Locking Inodes +^^^^^^^^^^^^^^ + +In regular filesystem code, the VFS and XFS will acquire multiple IOLOCK locks +in a well-known order: parent → child when updating the directory tree, and +in numerical order of the addresses of their ``struct inode`` object otherwise. +For regular files, the MMAPLOCK can be acquired after the IOLOCK to stop page +faults. +If two MMAPLOCKs must be acquired, they are acquired in numerical order of +the addresses of their ``struct address_space`` objects. +Due to the structure of existing filesystem code, IOLOCKs and MMAPLOCKs must be +acquired before transactions are allocated. +If two ILOCKs must be acquired, they are acquired in inumber order. + +Inode lock acquisition must be done carefully during a coordinated inode scan. +Online fsck cannot abide these conventions, because for a directory tree +scanner, the scrub process holds the IOLOCK of the file being scanned and it +needs to take the IOLOCK of the file at the other end of the directory link. +If the directory tree is corrupt because it contains a cycle, ``xfs_scrub`` +cannot use the regular inode locking functions and avoid becoming trapped in an +ABBA deadlock. + +Solving both of these problems is straightforward -- any time online fsck +needs to take a second lock of the same class, it uses trylock to avoid an ABBA +deadlock. +If the trylock fails, scrub drops all inode locks and use trylock loops to +(re)acquire all necessary resources. +Trylock loops enable scrub to check for pending fatal signals, which is how +scrub avoids deadlocking the filesystem or becoming an unresponsive process. +However, trylock loops means that online fsck must be prepared to measure the +resource being scrubbed before and after the lock cycle to detect changes and +react accordingly. + +.. _dirparent: + +Case Study: Finding a Directory Parent +^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ + +Consider the directory parent pointer repair code as an example. +Online fsck must verify that the dotdot dirent of a directory points up to a +parent directory, and that the parent directory contains exactly one dirent +pointing down to the child directory. +Fully validating this relationship (and repairing it if possible) requires a +walk of every directory on the filesystem while holding the child locked, and +while updates to the directory tree are being made. +The coordinated inode scan provides a way to walk the filesystem without the +possibility of missing an inode. +The child directory is kept locked to prevent updates to the dotdot dirent, but +if the scanner fails to lock a parent, it can drop and relock both the child +and the prospective parent. +If the dotdot entry changes while the directory is unlocked, then a move or +rename operation must have changed the child's parentage, and the scan can +exit early. + +The proposed patchset is the +`directory repair +<https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=repair-dirs>`_ +series. + +.. _fshooks: + +Filesystem Hooks +````````````````` + +The second piece of support that online fsck functions need during a full +filesystem scan is the ability to stay informed about updates being made by +other threads in the filesystem, since comparisons against the past are useless +in a dynamic environment. +Two pieces of Linux kernel infrastructure enable online fsck to monitor regular +filesystem operations: filesystem hooks and :ref:`static keys<jump_labels>`. + +Filesystem hooks convey information about an ongoing filesystem operation to +a downstream consumer. +In this case, the downstream consumer is always an online fsck function. +Because multiple fsck functions can run in parallel, online fsck uses the Linux +notifier call chain facility to dispatch updates to any number of interested +fsck processes. +Call chains are a dynamic list, which means that they can be configured at +run time. +Because these hooks are private to the XFS module, the information passed along +contains exactly what the checking function needs to update its observations. + +The current implementation of XFS hooks uses SRCU notifier chains to reduce the +impact to highly threaded workloads. +Regular blocking notifier chains use a rwsem and seem to have a much lower +overhead for single-threaded applications. +However, it may turn out that the combination of blocking chains and static +keys are a more performant combination; more study is needed here. + +The following pieces are necessary to hook a certain point in the filesystem: + +- A ``struct xfs_hooks`` object must be embedded in a convenient place such as + a well-known incore filesystem object. + +- Each hook must define an action code and a structure containing more context + about the action. + +- Hook providers should provide appropriate wrapper functions and structs + around the ``xfs_hooks`` and ``xfs_hook`` objects to take advantage of type + checking to ensure correct usage. + +- A callsite in the regular filesystem code must be chosen to call + ``xfs_hooks_call`` with the action code and data structure. + This place should be adjacent to (and not earlier than) the place where + the filesystem update is committed to the transaction. + In general, when the filesystem calls a hook chain, it should be able to + handle sleeping and should not be vulnerable to memory reclaim or locking + recursion. + However, the exact requirements are very dependent on the context of the hook + caller and the callee. + +- The online fsck function should define a structure to hold scan data, a lock + to coordinate access to the scan data, and a ``struct xfs_hook`` object. + The scanner function and the regular filesystem code must acquire resources + in the same order; see the next section for details. + +- The online fsck code must contain a C function to catch the hook action code + and data structure. + If the object being updated has already been visited by the scan, then the + hook information must be applied to the scan data. + +- Prior to unlocking inodes to start the scan, online fsck must call + ``xfs_hooks_setup`` to initialize the ``struct xfs_hook``, and + ``xfs_hooks_add`` to enable the hook. + +- Online fsck must call ``xfs_hooks_del`` to disable the hook once the scan is + complete. + +The number of hooks should be kept to a minimum to reduce complexity. +Static keys are used to reduce the overhead of filesystem hooks to nearly +zero when online fsck is not running. + +.. _liveupdate: + +Live Updates During a Scan +`````````````````````````` + +The code paths of the online fsck scanning code and the :ref:`hooked<fshooks>` +filesystem code look like this:: + + other program + ↓ + inode lock ←────────────────────┐ + ↓ │ + AG header lock │ + ↓ │ + filesystem function │ + ↓ │ + notifier call chain │ same + ↓ ├─── inode + scrub hook function │ lock + ↓ │ + scan data mutex ←──┐ same │ + ↓ ├─── scan │ + update scan data │ lock │ + ↑ │ │ + scan data mutex ←──┘ │ + ↑ │ + inode lock ←────────────────────┘ + ↑ + scrub function + ↑ + inode scanner + ↑ + xfs_scrub + +These rules must be followed to ensure correct interactions between the +checking code and the code making an update to the filesystem: + +- Prior to invoking the notifier call chain, the filesystem function being + hooked must acquire the same lock that the scrub scanning function acquires + to scan the inode. + +- The scanning function and the scrub hook function must coordinate access to + the scan data by acquiring a lock on the scan data. + +- Scrub hook function must not add the live update information to the scan + observations unless the inode being updated has already been scanned. + The scan coordinator has a helper predicate (``xchk_iscan_want_live_update``) + for this. + +- Scrub hook functions must not change the caller's state, including the + transaction that it is running. + They must not acquire any resources that might conflict with the filesystem + function being hooked. + +- The hook function can abort the inode scan to avoid breaking the other rules. + +The inode scan APIs are pretty simple: + +- ``xchk_iscan_start`` starts a scan + +- ``xchk_iscan_iter`` grabs a reference to the next inode in the scan or + returns zero if there is nothing left to scan + +- ``xchk_iscan_want_live_update`` to decide if an inode has already been + visited in the scan. + This is critical for hook functions to decide if they need to update the + in-memory scan information. + +- ``xchk_iscan_mark_visited`` to mark an inode as having been visited in the + scan + +- ``xchk_iscan_teardown`` to finish the scan + +This functionality is also a part of the +`inode scanner +<https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=scrub-iscan>`_ +series. + +.. _quotacheck: + +Case Study: Quota Counter Checking +^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ + +It is useful to compare the mount time quotacheck code to the online repair +quotacheck code. +Mount time quotacheck does not have to contend with concurrent operations, so +it does the following: + +1. Make sure the ondisk dquots are in good enough shape that all the incore + dquots will actually load, and zero the resource usage counters in the + ondisk buffer. + +2. Walk every inode in the filesystem. + Add each file's resource usage to the incore dquot. + +3. Walk each incore dquot. + If the incore dquot is not being flushed, add the ondisk buffer backing the + incore dquot to a delayed write (delwri) list. + +4. Write the buffer list to disk. + +Like most online fsck functions, online quotacheck can't write to regular +filesystem objects until the newly collected metadata reflect all filesystem +state. +Therefore, online quotacheck records file resource usage to a shadow dquot +index implemented with a sparse ``xfarray``, and only writes to the real dquots +once the scan is complete. +Handling transactional updates is tricky because quota resource usage updates +are handled in phases to minimize contention on dquots: + +1. The inodes involved are joined and locked to a transaction. + +2. For each dquot attached to the file: + + a. The dquot is locked. + + b. A quota reservation is added to the dquot's resource usage. + The reservation is recorded in the transaction. + + c. The dquot is unlocked. + +3. Changes in actual quota usage are tracked in the transaction. + +4. At transaction commit time, each dquot is examined again: + + a. The dquot is locked again. + + b. Quota usage changes are logged and unused reservation is given back to + the dquot. + + c. The dquot is unlocked. + +For online quotacheck, hooks are placed in steps 2 and 4. +The step 2 hook creates a shadow version of the transaction dquot context +(``dqtrx``) that operates in a similar manner to the regular code. +The step 4 hook commits the shadow ``dqtrx`` changes to the shadow dquots. +Notice that both hooks are called with the inode locked, which is how the +live update coordinates with the inode scanner. + +The quotacheck scan looks like this: + +1. Set up a coordinated inode scan. + +2. For each inode returned by the inode scan iterator: + + a. Grab and lock the inode. + + b. Determine that inode's resource usage (data blocks, inode counts, + realtime blocks) and add that to the shadow dquots for the user, group, + and project ids associated with the inode. + + c. Unlock and release the inode. + +3. For each dquot in the system: + + a. Grab and lock the dquot. + + b. Check the dquot against the shadow dquots created by the scan and updated + by the live hooks. + +Live updates are key to being able to walk every quota record without +needing to hold any locks for a long duration. +If repairs are desired, the real and shadow dquots are locked and their +resource counts are set to the values in the shadow dquot. + +The proposed patchset is the +`online quotacheck +<https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=repair-quotacheck>`_ +series. + +.. _nlinks: + +Case Study: File Link Count Checking +^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ + +File link count checking also uses live update hooks. +The coordinated inode scanner is used to visit all directories on the +filesystem, and per-file link count records are stored in a sparse ``xfarray`` +indexed by inumber. +During the scanning phase, each entry in a directory generates observation +data as follows: + +1. If the entry is a dotdot (``'..'``) entry of the root directory, the + directory's parent link count is bumped because the root directory's dotdot + entry is self referential. + +2. If the entry is a dotdot entry of a subdirectory, the parent's backref + count is bumped. + +3. If the entry is neither a dot nor a dotdot entry, the target file's parent + count is bumped. + +4. If the target is a subdirectory, the parent's child link count is bumped. + +A crucial point to understand about how the link count inode scanner interacts +with the live update hooks is that the scan cursor tracks which *parent* +directories have been scanned. +In other words, the live updates ignore any update about ``A → B`` when A has +not been scanned, even if B has been scanned. +Furthermore, a subdirectory A with a dotdot entry pointing back to B is +accounted as a backref counter in the shadow data for A, since child dotdot +entries affect the parent's link count. +Live update hooks are carefully placed in all parts of the filesystem that +create, change, or remove directory entries, since those operations involve +bumplink and droplink. + +For any file, the correct link count is the number of parents plus the number +of child subdirectories. +Non-directories never have children of any kind. +The backref information is used to detect inconsistencies in the number of +links pointing to child subdirectories and the number of dotdot entries +pointing back. + +After the scan completes, the link count of each file can be checked by locking +both the inode and the shadow data, and comparing the link counts. +A second coordinated inode scan cursor is used for comparisons. +Live updates are key to being able to walk every inode without needing to hold +any locks between inodes. +If repairs are desired, the inode's link count is set to the value in the +shadow information. +If no parents are found, the file must be :ref:`reparented <orphanage>` to the +orphanage to prevent the file from being lost forever. + +The proposed patchset is the +`file link count repair +<https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=scrub-nlinks>`_ +series. + +.. _rmap_repair: + +Case Study: Rebuilding Reverse Mapping Records +^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ + +Most repair functions follow the same pattern: lock filesystem resources, +walk the surviving ondisk metadata looking for replacement metadata records, +and use an :ref:`in-memory array <xfarray>` to store the gathered observations. +The primary advantage of this approach is the simplicity and modularity of the +repair code -- code and data are entirely contained within the scrub module, +do not require hooks in the main filesystem, and are usually the most efficient +in memory use. +A secondary advantage of this repair approach is atomicity -- once the kernel +decides a structure is corrupt, no other threads can access the metadata until +the kernel finishes repairing and revalidating the metadata. + +For repairs going on within a shard of the filesystem, these advantages +outweigh the delays inherent in locking the shard while repairing parts of the +shard. +Unfortunately, repairs to the reverse mapping btree cannot use the "standard" +btree repair strategy because it must scan every space mapping of every fork of +every file in the filesystem, and the filesystem cannot stop. +Therefore, rmap repair foregoes atomicity between scrub and repair. +It combines a :ref:`coordinated inode scanner <iscan>`, :ref:`live update hooks +<liveupdate>`, and an :ref:`in-memory rmap btree <xfbtree>` to complete the +scan for reverse mapping records. + +1. Set up an xfbtree to stage rmap records. + +2. While holding the locks on the AGI and AGF buffers acquired during the + scrub, generate reverse mappings for all AG metadata: inodes, btrees, CoW + staging extents, and the internal log. + +3. Set up an inode scanner. + +4. Hook into rmap updates for the AG being repaired so that the live scan data + can receive updates to the rmap btree from the rest of the filesystem during + the file scan. + +5. For each space mapping found in either fork of each file scanned, + decide if the mapping matches the AG of interest. + If so: + + a. Create a btree cursor for the in-memory btree. + + b. Use the rmap code to add the record to the in-memory btree. + + c. Use the :ref:`special commit function <xfbtree_commit>` to write the + xfbtree changes to the xfile. + +6. For each live update received via the hook, decide if the owner has already + been scanned. + If so, apply the live update into the scan data: + + a. Create a btree cursor for the in-memory btree. + + b. Replay the operation into the in-memory btree. + + c. Use the :ref:`special commit function <xfbtree_commit>` to write the + xfbtree changes to the xfile. + This is performed with an empty transaction to avoid changing the + caller's state. + +7. When the inode scan finishes, create a new scrub transaction and relock the + two AG headers. + +8. Compute the new btree geometry using the number of rmap records in the + shadow btree, like all other btree rebuilding functions. + +9. Allocate the number of blocks computed in the previous step. + +10. Perform the usual btree bulk loading and commit to install the new rmap + btree. + +11. Reap the old rmap btree blocks as discussed in the case study about how + to :ref:`reap after rmap btree repair <rmap_reap>`. + +12. Free the xfbtree now that it not needed. + +The proposed patchset is the +`rmap repair +<https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=repair-rmap-btree>`_ +series. + +Staging Repairs with Temporary Files on Disk +-------------------------------------------- + +XFS stores a substantial amount of metadata in file forks: directories, +extended attributes, symbolic link targets, free space bitmaps and summary +information for the realtime volume, and quota records. +File forks map 64-bit logical file fork space extents to physical storage space +extents, similar to how a memory management unit maps 64-bit virtual addresses +to physical memory addresses. +Therefore, file-based tree structures (such as directories and extended +attributes) use blocks mapped in the file fork offset address space that point +to other blocks mapped within that same address space, and file-based linear +structures (such as bitmaps and quota records) compute array element offsets in +the file fork offset address space. + +Because file forks can consume as much space as the entire filesystem, repairs +cannot be staged in memory, even when a paging scheme is available. +Therefore, online repair of file-based metadata createas a temporary file in +the XFS filesystem, writes a new structure at the correct offsets into the +temporary file, and atomically swaps the fork mappings (and hence the fork +contents) to commit the repair. +Once the repair is complete, the old fork can be reaped as necessary; if the +system goes down during the reap, the iunlink code will delete the blocks +during log recovery. + +**Note**: All space usage and inode indices in the filesystem *must* be +consistent to use a temporary file safely! +This dependency is the reason why online repair can only use pageable kernel +memory to stage ondisk space usage information. + +Swapping metadata extents with a temporary file requires the owner field of the +block headers to match the file being repaired and not the temporary file. The +directory, extended attribute, and symbolic link functions were all modified to +allow callers to specify owner numbers explicitly. + +There is a downside to the reaping process -- if the system crashes during the +reap phase and the fork extents are crosslinked, the iunlink processing will +fail because freeing space will find the extra reverse mappings and abort. + +Temporary files created for repair are similar to ``O_TMPFILE`` files created +by userspace. +They are not linked into a directory and the entire file will be reaped when +the last reference to the file is lost. +The key differences are that these files must have no access permission outside +the kernel at all, they must be specially marked to prevent them from being +opened by handle, and they must never be linked into the directory tree. + ++--------------------------------------------------------------------------+ +| **Historical Sidebar**: | ++--------------------------------------------------------------------------+ +| In the initial iteration of file metadata repair, the damaged metadata | +| blocks would be scanned for salvageable data; the extents in the file | +| fork would be reaped; and then a new structure would be built in its | +| place. | +| This strategy did not survive the introduction of the atomic repair | +| requirement expressed earlier in this document. | +| | +| The second iteration explored building a second structure at a high | +| offset in the fork from the salvage data, reaping the old extents, and | +| using a ``COLLAPSE_RANGE`` operation to slide the new extents into | +| place. | +| | +| This had many drawbacks: | +| | +| - Array structures are linearly addressed, and the regular filesystem | +| codebase does not have the concept of a linear offset that could be | +| applied to the record offset computation to build an alternate copy. | +| | +| - Extended attributes are allowed to use the entire attr fork offset | +| address space. | +| | +| - Even if repair could build an alternate copy of a data structure in a | +| different part of the fork address space, the atomic repair commit | +| requirement means that online repair would have to be able to perform | +| a log assisted ``COLLAPSE_RANGE`` operation to ensure that the old | +| structure was completely replaced. | +| | +| - A crash after construction of the secondary tree but before the range | +| collapse would leave unreachable blocks in the file fork. | +| This would likely confuse things further. | +| | +| - Reaping blocks after a repair is not a simple operation, and | +| initiating a reap operation from a restarted range collapse operation | +| during log recovery is daunting. | +| | +| - Directory entry blocks and quota records record the file fork offset | +| in the header area of each block. | +| An atomic range collapse operation would have to rewrite this part of | +| each block header. | +| Rewriting a single field in block headers is not a huge problem, but | +| it's something to be aware of. | +| | +| - Each block in a directory or extended attributes btree index contains | +| sibling and child block pointers. | +| Were the atomic commit to use a range collapse operation, each block | +| would have to be rewritten very carefully to preserve the graph | +| structure. | +| Doing this as part of a range collapse means rewriting a large number | +| of blocks repeatedly, which is not conducive to quick repairs. | +| | +| This lead to the introduction of temporary file staging. | ++--------------------------------------------------------------------------+ + +Using a Temporary File +`````````````````````` + +Online repair code should use the ``xrep_tempfile_create`` function to create a +temporary file inside the filesystem. +This allocates an inode, marks the in-core inode private, and attaches it to +the scrub context. +These files are hidden from userspace, may not be added to the directory tree, +and must be kept private. + +Temporary files only use two inode locks: the IOLOCK and the ILOCK. +The MMAPLOCK is not needed here, because there must not be page faults from +userspace for data fork blocks. +The usage patterns of these two locks are the same as for any other XFS file -- +access to file data are controlled via the IOLOCK, and access to file metadata +are controlled via the ILOCK. +Locking helpers are provided so that the temporary file and its lock state can +be cleaned up by the scrub context. +To comply with the nested locking strategy laid out in the :ref:`inode +locking<ilocking>` section, it is recommended that scrub functions use the +xrep_tempfile_ilock*_nowait lock helpers. + +Data can be written to a temporary file by two means: + +1. ``xrep_tempfile_copyin`` can be used to set the contents of a regular + temporary file from an xfile. + +2. The regular directory, symbolic link, and extended attribute functions can + be used to write to the temporary file. + +Once a good copy of a data file has been constructed in a temporary file, it +must be conveyed to the file being repaired, which is the topic of the next +section. + +The proposed patches are in the +`repair temporary files +<https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=repair-tempfiles>`_ +series. + +Atomic Extent Swapping +---------------------- + +Once repair builds a temporary file with a new data structure written into +it, it must commit the new changes into the existing file. +It is not possible to swap the inumbers of two files, so instead the new +metadata must replace the old. +This suggests the need for the ability to swap extents, but the existing extent +swapping code used by the file defragmenting tool ``xfs_fsr`` is not sufficient +for online repair because: + +a. When the reverse-mapping btree is enabled, the swap code must keep the + reverse mapping information up to date with every exchange of mappings. + Therefore, it can only exchange one mapping per transaction, and each + transaction is independent. + +b. Reverse-mapping is critical for the operation of online fsck, so the old + defragmentation code (which swapped entire extent forks in a single + operation) is not useful here. + +c. Defragmentation is assumed to occur between two files with identical + contents. + For this use case, an incomplete exchange will not result in a user-visible + change in file contents, even if the operation is interrupted. + +d. Online repair needs to swap the contents of two files that are by definition + *not* identical. + For directory and xattr repairs, the user-visible contents might be the + same, but the contents of individual blocks may be very different. + +e. Old blocks in the file may be cross-linked with another structure and must + not reappear if the system goes down mid-repair. + +These problems are overcome by creating a new deferred operation and a new type +of log intent item to track the progress of an operation to exchange two file +ranges. +The new deferred operation type chains together the same transactions used by +the reverse-mapping extent swap code. +The new log item records the progress of the exchange to ensure that once an +exchange begins, it will always run to completion, even there are +interruptions. +The new ``XFS_SB_FEAT_INCOMPAT_LOG_ATOMIC_SWAP`` log-incompatible feature flag +in the superblock protects these new log item records from being replayed on +old kernels. + +The proposed patchset is the +`atomic extent swap +<https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=atomic-file-updates>`_ +series. + ++--------------------------------------------------------------------------+ +| **Sidebar: Using Log-Incompatible Feature Flags** | ++--------------------------------------------------------------------------+ +| Starting with XFS v5, the superblock contains a | +| ``sb_features_log_incompat`` field to indicate that the log contains | +| records that might not readable by all kernels that could mount this | +| filesystem. | +| In short, log incompat features protect the log contents against kernels | +| that will not understand the contents. | +| Unlike the other superblock feature bits, log incompat bits are | +| ephemeral because an empty (clean) log does not need protection. | +| The log cleans itself after its contents have been committed into the | +| filesystem, either as part of an unmount or because the system is | +| otherwise idle. | +| Because upper level code can be working on a transaction at the same | +| time that the log cleans itself, it is necessary for upper level code to | +| communicate to the log when it is going to use a log incompatible | +| feature. | +| | +| The log coordinates access to incompatible features through the use of | +| one ``struct rw_semaphore`` for each feature. | +| The log cleaning code tries to take this rwsem in exclusive mode to | +| clear the bit; if the lock attempt fails, the feature bit remains set. | +| Filesystem code signals its intention to use a log incompat feature in a | +| transaction by calling ``xlog_use_incompat_feat``, which takes the rwsem | +| in shared mode. | +| The code supporting a log incompat feature should create wrapper | +| functions to obtain the log feature and call | +| ``xfs_add_incompat_log_feature`` to set the feature bits in the primary | +| superblock. | +| The superblock update is performed transactionally, so the wrapper to | +| obtain log assistance must be called just prior to the creation of the | +| transaction that uses the functionality. | +| For a file operation, this step must happen after taking the IOLOCK | +| and the MMAPLOCK, but before allocating the transaction. | +| When the transaction is complete, the ``xlog_drop_incompat_feat`` | +| function is called to release the feature. | +| The feature bit will not be cleared from the superblock until the log | +| becomes clean. | +| | +| Log-assisted extended attribute updates and atomic extent swaps both use | +| log incompat features and provide convenience wrappers around the | +| functionality. | ++--------------------------------------------------------------------------+ + +Mechanics of an Atomic Extent Swap +`````````````````````````````````` + +Swapping entire file forks is a complex task. +The goal is to exchange all file fork mappings between two file fork offset +ranges. +There are likely to be many extent mappings in each fork, and the edges of +the mappings aren't necessarily aligned. +Furthermore, there may be other updates that need to happen after the swap, +such as exchanging file sizes, inode flags, or conversion of fork data to local +format. +This is roughly the format of the new deferred extent swap work item: + +.. code-block:: c + + struct xfs_swapext_intent { + /* Inodes participating in the operation. */ + struct xfs_inode *sxi_ip1; + struct xfs_inode *sxi_ip2; + + /* File offset range information. */ + xfs_fileoff_t sxi_startoff1; + xfs_fileoff_t sxi_startoff2; + xfs_filblks_t sxi_blockcount; + + /* Set these file sizes after the operation, unless negative. */ + xfs_fsize_t sxi_isize1; + xfs_fsize_t sxi_isize2; + + /* XFS_SWAP_EXT_* log operation flags */ + uint64_t sxi_flags; + }; + +The new log intent item contains enough information to track two logical fork +offset ranges: ``(inode1, startoff1, blockcount)`` and ``(inode2, startoff2, +blockcount)``. +Each step of a swap operation exchanges the largest file range mapping possible +from one file to the other. +After each step in the swap operation, the two startoff fields are incremented +and the blockcount field is decremented to reflect the progress made. +The flags field captures behavioral parameters such as swapping the attr fork +instead of the data fork and other work to be done after the extent swap. +The two isize fields are used to swap the file size at the end of the operation +if the file data fork is the target of the swap operation. + +When the extent swap is initiated, the sequence of operations is as follows: + +1. Create a deferred work item for the extent swap. + At the start, it should contain the entirety of the file ranges to be + swapped. + +2. Call ``xfs_defer_finish`` to process the exchange. + This is encapsulated in ``xrep_tempswap_contents`` for scrub operations. + This will log an extent swap intent item to the transaction for the deferred + extent swap work item. + +3. Until ``sxi_blockcount`` of the deferred extent swap work item is zero, + + a. Read the block maps of both file ranges starting at ``sxi_startoff1`` and + ``sxi_startoff2``, respectively, and compute the longest extent that can + be swapped in a single step. + This is the minimum of the two ``br_blockcount`` s in the mappings. + Keep advancing through the file forks until at least one of the mappings + contains written blocks. + Mutual holes, unwritten extents, and extent mappings to the same physical + space are not exchanged. + + For the next few steps, this document will refer to the mapping that came + from file 1 as "map1", and the mapping that came from file 2 as "map2". + + b. Create a deferred block mapping update to unmap map1 from file 1. + + c. Create a deferred block mapping update to unmap map2 from file 2. + + d. Create a deferred block mapping update to map map1 into file 2. + + e. Create a deferred block mapping update to map map2 into file 1. + + f. Log the block, quota, and extent count updates for both files. + + g. Extend the ondisk size of either file if necessary. + + h. Log an extent swap done log item for the extent swap intent log item + that was read at the start of step 3. + + i. Compute the amount of file range that has just been covered. + This quantity is ``(map1.br_startoff + map1.br_blockcount - + sxi_startoff1)``, because step 3a could have skipped holes. + + j. Increase the starting offsets of ``sxi_startoff1`` and ``sxi_startoff2`` + by the number of blocks computed in the previous step, and decrease + ``sxi_blockcount`` by the same quantity. + This advances the cursor. + + k. Log a new extent swap intent log item reflecting the advanced state of + the work item. + + l. Return the proper error code (EAGAIN) to the deferred operation manager + to inform it that there is more work to be done. + The operation manager completes the deferred work in steps 3b-3e before + moving back to the start of step 3. + +4. Perform any post-processing. + This will be discussed in more detail in subsequent sections. + +If the filesystem goes down in the middle of an operation, log recovery will +find the most recent unfinished extent swap log intent item and restart from +there. +This is how extent swapping guarantees that an outside observer will either see +the old broken structure or the new one, and never a mismash of both. + +Preparation for Extent Swapping +``````````````````````````````` + +There are a few things that need to be taken care of before initiating an +atomic extent swap operation. +First, regular files require the page cache to be flushed to disk before the +operation begins, and directio writes to be quiesced. +Like any filesystem operation, extent swapping must determine the maximum +amount of disk space and quota that can be consumed on behalf of both files in +the operation, and reserve that quantity of resources to avoid an unrecoverable +out of space failure once it starts dirtying metadata. +The preparation step scans the ranges of both files to estimate: + +- Data device blocks needed to handle the repeated updates to the fork + mappings. +- Change in data and realtime block counts for both files. +- Increase in quota usage for both files, if the two files do not share the + same set of quota ids. +- The number of extent mappings that will be added to each file. +- Whether or not there are partially written realtime extents. + User programs must never be able to access a realtime file extent that maps + to different extents on the realtime volume, which could happen if the + operation fails to run to completion. + +The need for precise estimation increases the run time of the swap operation, +but it is very important to maintain correct accounting. +The filesystem must not run completely out of free space, nor can the extent +swap ever add more extent mappings to a fork than it can support. +Regular users are required to abide the quota limits, though metadata repairs +may exceed quota to resolve inconsistent metadata elsewhere. + +Special Features for Swapping Metadata File Extents +``````````````````````````````````````````````````` + +Extended attributes, symbolic links, and directories can set the fork format to +"local" and treat the fork as a literal area for data storage. +Metadata repairs must take extra steps to support these cases: + +- If both forks are in local format and the fork areas are large enough, the + swap is performed by copying the incore fork contents, logging both forks, + and committing. + The atomic extent swap mechanism is not necessary, since this can be done + with a single transaction. + +- If both forks map blocks, then the regular atomic extent swap is used. + +- Otherwise, only one fork is in local format. + The contents of the local format fork are converted to a block to perform the + swap. + The conversion to block format must be done in the same transaction that + logs the initial extent swap intent log item. + The regular atomic extent swap is used to exchange the mappings. + Special flags are set on the swap operation so that the transaction can be + rolled one more time to convert the second file's fork back to local format + so that the second file will be ready to go as soon as the ILOCK is dropped. + +Extended attributes and directories stamp the owning inode into every block, +but the buffer verifiers do not actually check the inode number! +Although there is no verification, it is still important to maintain +referential integrity, so prior to performing the extent swap, online repair +builds every block in the new data structure with the owner field of the file +being repaired. + +After a successful swap operation, the repair operation must reap the old fork +blocks by processing each fork mapping through the standard :ref:`file extent +reaping <reaping>` mechanism that is done post-repair. +If the filesystem should go down during the reap part of the repair, the +iunlink processing at the end of recovery will free both the temporary file and +whatever blocks were not reaped. +However, this iunlink processing omits the cross-link detection of online +repair, and is not completely foolproof. + +Swapping Temporary File Extents +``````````````````````````````` + +To repair a metadata file, online repair proceeds as follows: + +1. Create a temporary repair file. + +2. Use the staging data to write out new contents into the temporary repair + file. + The same fork must be written to as is being repaired. + +3. Commit the scrub transaction, since the swap estimation step must be + completed before transaction reservations are made. + +4. Call ``xrep_tempswap_trans_alloc`` to allocate a new scrub transaction with + the appropriate resource reservations, locks, and fill out a ``struct + xfs_swapext_req`` with the details of the swap operation. + +5. Call ``xrep_tempswap_contents`` to swap the contents. + +6. Commit the transaction to complete the repair. + +.. _rtsummary: + +Case Study: Repairing the Realtime Summary File +^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ + +In the "realtime" section of an XFS filesystem, free space is tracked via a +bitmap, similar to Unix FFS. +Each bit in the bitmap represents one realtime extent, which is a multiple of +the filesystem block size between 4KiB and 1GiB in size. +The realtime summary file indexes the number of free extents of a given size to +the offset of the block within the realtime free space bitmap where those free +extents begin. +In other words, the summary file helps the allocator find free extents by +length, similar to what the free space by count (cntbt) btree does for the data +section. + +The summary file itself is a flat file (with no block headers or checksums!) +partitioned into ``log2(total rt extents)`` sections containing enough 32-bit +counters to match the number of blocks in the rt bitmap. +Each counter records the number of free extents that start in that bitmap block +and can satisfy a power-of-two allocation request. + +To check the summary file against the bitmap: + +1. Take the ILOCK of both the realtime bitmap and summary files. + +2. For each free space extent recorded in the bitmap: + + a. Compute the position in the summary file that contains a counter that + represents this free extent. + + b. Read the counter from the xfile. + + c. Increment it, and write it back to the xfile. + +3. Compare the contents of the xfile against the ondisk file. + +To repair the summary file, write the xfile contents into the temporary file +and use atomic extent swap to commit the new contents. +The temporary file is then reaped. + +The proposed patchset is the +`realtime summary repair +<https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=repair-rtsummary>`_ +series. + +Case Study: Salvaging Extended Attributes +^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ + +In XFS, extended attributes are implemented as a namespaced name-value store. +Values are limited in size to 64KiB, but there is no limit in the number of +names. +The attribute fork is unpartitioned, which means that the root of the attribute +structure is always in logical block zero, but attribute leaf blocks, dabtree +index blocks, and remote value blocks are intermixed. +Attribute leaf blocks contain variable-sized records that associate +user-provided names with the user-provided values. +Values larger than a block are allocated separate extents and written there. +If the leaf information expands beyond a single block, a directory/attribute +btree (``dabtree``) is created to map hashes of attribute names to entries +for fast lookup. + +Salvaging extended attributes is done as follows: + +1. Walk the attr fork mappings of the file being repaired to find the attribute + leaf blocks. + When one is found, + + a. Walk the attr leaf block to find candidate keys. + When one is found, + + 1. Check the name for problems, and ignore the name if there are. + + 2. Retrieve the value. + If that succeeds, add the name and value to the staging xfarray and + xfblob. + +2. If the memory usage of the xfarray and xfblob exceed a certain amount of + memory or there are no more attr fork blocks to examine, unlock the file and + add the staged extended attributes to the temporary file. + +3. Use atomic extent swapping to exchange the new and old extended attribute + structures. + The old attribute blocks are now attached to the temporary file. + +4. Reap the temporary file. + +The proposed patchset is the +`extended attribute repair +<https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=repair-xattrs>`_ +series. + +Fixing Directories +------------------ + +Fixing directories is difficult with currently available filesystem features, +since directory entries are not redundant. +The offline repair tool scans all inodes to find files with nonzero link count, +and then it scans all directories to establish parentage of those linked files. +Damaged files and directories are zapped, and files with no parent are +moved to the ``/lost+found`` directory. +It does not try to salvage anything. + +The best that online repair can do at this time is to read directory data +blocks and salvage any dirents that look plausible, correct link counts, and +move orphans back into the directory tree. +The salvage process is discussed in the case study at the end of this section. +The :ref:`file link count fsck <nlinks>` code takes care of fixing link counts +and moving orphans to the ``/lost+found`` directory. + +Case Study: Salvaging Directories +````````````````````````````````` + +Unlike extended attributes, directory blocks are all the same size, so +salvaging directories is straightforward: + +1. Find the parent of the directory. + If the dotdot entry is not unreadable, try to confirm that the alleged + parent has a child entry pointing back to the directory being repaired. + Otherwise, walk the filesystem to find it. + +2. Walk the first partition of data fork of the directory to find the directory + entry data blocks. + When one is found, + + a. Walk the directory data block to find candidate entries. + When an entry is found: + + i. Check the name for problems, and ignore the name if there are. + + ii. Retrieve the inumber and grab the inode. + If that succeeds, add the name, inode number, and file type to the + staging xfarray and xblob. + +3. If the memory usage of the xfarray and xfblob exceed a certain amount of + memory or there are no more directory data blocks to examine, unlock the + directory and add the staged dirents into the temporary directory. + Truncate the staging files. + +4. Use atomic extent swapping to exchange the new and old directory structures. + The old directory blocks are now attached to the temporary file. + +5. Reap the temporary file. + +**Future Work Question**: Should repair revalidate the dentry cache when +rebuilding a directory? + +*Answer*: Yes, it should. + +In theory it is necessary to scan all dentry cache entries for a directory to +ensure that one of the following apply: + +1. The cached dentry reflects an ondisk dirent in the new directory. + +2. The cached dentry no longer has a corresponding ondisk dirent in the new + directory and the dentry can be purged from the cache. + +3. The cached dentry no longer has an ondisk dirent but the dentry cannot be + purged. + This is the problem case. + +Unfortunately, the current dentry cache design doesn't provide a means to walk +every child dentry of a specific directory, which makes this a hard problem. +There is no known solution. + +The proposed patchset is the +`directory repair +<https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=repair-dirs>`_ +series. + +Parent Pointers +``````````````` + +A parent pointer is a piece of file metadata that enables a user to locate the +file's parent directory without having to traverse the directory tree from the +root. +Without them, reconstruction of directory trees is hindered in much the same +way that the historic lack of reverse space mapping information once hindered +reconstruction of filesystem space metadata. +The parent pointer feature, however, makes total directory reconstruction +possible. + +XFS parent pointers include the dirent name and location of the entry within +the parent directory. +In other words, child files use extended attributes to store pointers to +parents in the form ``(parent_inum, parent_gen, dirent_pos) → (dirent_name)``. +The directory checking process can be strengthened to ensure that the target of +each dirent also contains a parent pointer pointing back to the dirent. +Likewise, each parent pointer can be checked by ensuring that the target of +each parent pointer is a directory and that it contains a dirent matching +the parent pointer. +Both online and offline repair can use this strategy. + +**Note**: The ondisk format of parent pointers is not yet finalized. + ++--------------------------------------------------------------------------+ +| **Historical Sidebar**: | ++--------------------------------------------------------------------------+ +| Directory parent pointers were first proposed as an XFS feature more | +| than a decade ago by SGI. | +| Each link from a parent directory to a child file is mirrored with an | +| extended attribute in the child that could be used to identify the | +| parent directory. | +| Unfortunately, this early implementation had major shortcomings and was | +| never merged into Linux XFS: | +| | +| 1. The XFS codebase of the late 2000s did not have the infrastructure to | +| enforce strong referential integrity in the directory tree. | +| It did not guarantee that a change in a forward link would always be | +| followed up with the corresponding change to the reverse links. | +| | +| 2. Referential integrity was not integrated into offline repair. | +| Checking and repairs were performed on mounted filesystems without | +| taking any kernel or inode locks to coordinate access. | +| It is not clear how this actually worked properly. | +| | +| 3. The extended attribute did not record the name of the directory entry | +| in the parent, so the SGI parent pointer implementation cannot be | +| used to reconnect the directory tree. | +| | +| 4. Extended attribute forks only support 65,536 extents, which means | +| that parent pointer attribute creation is likely to fail at some | +| point before the maximum file link count is achieved. | +| | +| The original parent pointer design was too unstable for something like | +| a file system repair to depend on. | +| Allison Henderson, Chandan Babu, and Catherine Hoang are working on a | +| second implementation that solves all shortcomings of the first. | +| During 2022, Allison introduced log intent items to track physical | +| manipulations of the extended attribute structures. | +| This solves the referential integrity problem by making it possible to | +| commit a dirent update and a parent pointer update in the same | +| transaction. | +| Chandan increased the maximum extent counts of both data and attribute | +| forks, thereby ensuring that the extended attribute structure can grow | +| to handle the maximum hardlink count of any file. | ++--------------------------------------------------------------------------+ + +Case Study: Repairing Directories with Parent Pointers +^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ + +Directory rebuilding uses a :ref:`coordinated inode scan <iscan>` and +a :ref:`directory entry live update hook <liveupdate>` as follows: + +1. Set up a temporary directory for generating the new directory structure, + an xfblob for storing entry names, and an xfarray for stashing directory + updates. + +2. Set up an inode scanner and hook into the directory entry code to receive + updates on directory operations. + +3. For each parent pointer found in each file scanned, decide if the parent + pointer references the directory of interest. + If so: + + a. Stash an addname entry for this dirent in the xfarray for later. + + b. When finished scanning that file, flush the stashed updates to the + temporary directory. + +4. For each live directory update received via the hook, decide if the child + has already been scanned. + If so: + + a. Stash an addname or removename entry for this dirent update in the + xfarray for later. + We cannot write directly to the temporary directory because hook + functions are not allowed to modify filesystem metadata. + Instead, we stash updates in the xfarray and rely on the scanner thread + to apply the stashed updates to the temporary directory. + +5. When the scan is complete, atomically swap the contents of the temporary + directory and the directory being repaired. + The temporary directory now contains the damaged directory structure. + +6. Reap the temporary directory. + +7. Update the dirent position field of parent pointers as necessary. + This may require the queuing of a substantial number of xattr log intent + items. + +The proposed patchset is the +`parent pointers directory repair +<https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=pptrs-online-dir-repair>`_ +series. + +**Unresolved Question**: How will repair ensure that the ``dirent_pos`` fields +match in the reconstructed directory? + +*Answer*: There are a few ways to solve this problem: + +1. The field could be designated advisory, since the other three values are + sufficient to find the entry in the parent. + However, this makes indexed key lookup impossible while repairs are ongoing. + +2. We could allow creating directory entries at specified offsets, which solves + the referential integrity problem but runs the risk that dirent creation + will fail due to conflicts with the free space in the directory. + + These conflicts could be resolved by appending the directory entry and + amending the xattr code to support updating an xattr key and reindexing the + dabtree, though this would have to be performed with the parent directory + still locked. + +3. Same as above, but remove the old parent pointer entry and add a new one + atomically. + +4. Change the ondisk xattr format to ``(parent_inum, name) → (parent_gen)``, + which would provide the attr name uniqueness that we require, without + forcing repair code to update the dirent position. + Unfortunately, this requires changes to the xattr code to support attr + names as long as 263 bytes. + +5. Change the ondisk xattr format to ``(parent_inum, hash(name)) → + (name, parent_gen)``. + If the hash is sufficiently resistant to collisions (e.g. sha256) then + this should provide the attr name uniqueness that we require. + Names shorter than 247 bytes could be stored directly. + +Discussion is ongoing under the `parent pointers patch deluge +<https://www.spinics.net/lists/linux-xfs/msg69397.html>`_. + +Case Study: Repairing Parent Pointers +^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ + +Online reconstruction of a file's parent pointer information works similarly to +directory reconstruction: + +1. Set up a temporary file for generating a new extended attribute structure, + an `xfblob<xfblob>` for storing parent pointer names, and an xfarray for + stashing parent pointer updates. + +2. Set up an inode scanner and hook into the directory entry code to receive + updates on directory operations. + +3. For each directory entry found in each directory scanned, decide if the + dirent references the file of interest. + If so: + + a. Stash an addpptr entry for this parent pointer in the xfblob and xfarray + for later. + + b. When finished scanning the directory, flush the stashed updates to the + temporary directory. + +4. For each live directory update received via the hook, decide if the parent + has already been scanned. + If so: + + a. Stash an addpptr or removepptr entry for this dirent update in the + xfarray for later. + We cannot write parent pointers directly to the temporary file because + hook functions are not allowed to modify filesystem metadata. + Instead, we stash updates in the xfarray and rely on the scanner thread + to apply the stashed parent pointer updates to the temporary file. + +5. Copy all non-parent pointer extended attributes to the temporary file. + +6. When the scan is complete, atomically swap the attribute fork of the + temporary file and the file being repaired. + The temporary file now contains the damaged extended attribute structure. + +7. Reap the temporary file. + +The proposed patchset is the +`parent pointers repair +<https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=pptrs-online-parent-repair>`_ +series. + +Digression: Offline Checking of Parent Pointers +^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ + +Examining parent pointers in offline repair works differently because corrupt +files are erased long before directory tree connectivity checks are performed. +Parent pointer checks are therefore a second pass to be added to the existing +connectivity checks: + +1. After the set of surviving files has been established (i.e. phase 6), + walk the surviving directories of each AG in the filesystem. + This is already performed as part of the connectivity checks. + +2. For each directory entry found, record the name in an xfblob, and store + ``(child_ag_inum, parent_inum, parent_gen, dirent_pos)`` tuples in a + per-AG in-memory slab. + +3. For each AG in the filesystem, + + a. Sort the per-AG tuples in order of child_ag_inum, parent_inum, and + dirent_pos. + + b. For each inode in the AG, + + 1. Scan the inode for parent pointers. + Record the names in a per-file xfblob, and store ``(parent_inum, + parent_gen, dirent_pos)`` tuples in a per-file slab. + + 2. Sort the per-file tuples in order of parent_inum, and dirent_pos. + + 3. Position one slab cursor at the start of the inode's records in the + per-AG tuple slab. + This should be trivial since the per-AG tuples are in child inumber + order. + + 4. Position a second slab cursor at the start of the per-file tuple slab. + + 5. Iterate the two cursors in lockstep, comparing the parent_ino and + dirent_pos fields of the records under each cursor. + + a. Tuples in the per-AG list but not the per-file list are missing and + need to be written to the inode. + + b. Tuples in the per-file list but not the per-AG list are dangling + and need to be removed from the inode. + + c. For tuples in both lists, update the parent_gen and name components + of the parent pointer if necessary. + +4. Move on to examining link counts, as we do today. + +The proposed patchset is the +`offline parent pointers repair +<https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfsprogs-dev.git/log/?h=pptrs-repair>`_ +series. + +Rebuilding directories from parent pointers in offline repair is very +challenging because it currently uses a single-pass scan of the filesystem +during phase 3 to decide which files are corrupt enough to be zapped. +This scan would have to be converted into a multi-pass scan: + +1. The first pass of the scan zaps corrupt inodes, forks, and attributes + much as it does now. + Corrupt directories are noted but not zapped. + +2. The next pass records parent pointers pointing to the directories noted + as being corrupt in the first pass. + This second pass may have to happen after the phase 4 scan for duplicate + blocks, if phase 4 is also capable of zapping directories. + +3. The third pass resets corrupt directories to an empty shortform directory. + Free space metadata has not been ensured yet, so repair cannot yet use the + directory building code in libxfs. + +4. At the start of phase 6, space metadata have been rebuilt. + Use the parent pointer information recorded during step 2 to reconstruct + the dirents and add them to the now-empty directories. + +This code has not yet been constructed. + +.. _orphanage: + +The Orphanage +------------- + +Filesystems present files as a directed, and hopefully acyclic, graph. +In other words, a tree. +The root of the filesystem is a directory, and each entry in a directory points +downwards either to more subdirectories or to non-directory files. +Unfortunately, a disruption in the directory graph pointers result in a +disconnected graph, which makes files impossible to access via regular path +resolution. + +Without parent pointers, the directory parent pointer online scrub code can +detect a dotdot entry pointing to a parent directory that doesn't have a link +back to the child directory and the file link count checker can detect a file +that isn't pointed to by any directory in the filesystem. +If such a file has a positive link count, the file is an orphan. + +With parent pointers, directories can be rebuilt by scanning parent pointers +and parent pointers can be rebuilt by scanning directories. +This should reduce the incidence of files ending up in ``/lost+found``. + +When orphans are found, they should be reconnected to the directory tree. +Offline fsck solves the problem by creating a directory ``/lost+found`` to +serve as an orphanage, and linking orphan files into the orphanage by using the +inumber as the name. +Reparenting a file to the orphanage does not reset any of its permissions or +ACLs. + +This process is more involved in the kernel than it is in userspace. +The directory and file link count repair setup functions must use the regular +VFS mechanisms to create the orphanage directory with all the necessary +security attributes and dentry cache entries, just like a regular directory +tree modification. + +Orphaned files are adopted by the orphanage as follows: + +1. Call ``xrep_orphanage_try_create`` at the start of the scrub setup function + to try to ensure that the lost and found directory actually exists. + This also attaches the orphanage directory to the scrub context. + +2. If the decision is made to reconnect a file, take the IOLOCK of both the + orphanage and the file being reattached. + The ``xrep_orphanage_iolock_two`` function follows the inode locking + strategy discussed earlier. + +3. Call ``xrep_orphanage_compute_blkres`` and ``xrep_orphanage_compute_name`` + to compute the new name in the orphanage and the block reservation required. + +4. Use ``xrep_orphanage_adoption_prep`` to reserve resources to the repair + transaction. + +5. Call ``xrep_orphanage_adopt`` to reparent the orphaned file into the lost + and found, and update the kernel dentry cache. + +The proposed patches are in the +`orphanage adoption +<https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=repair-orphanage>`_ +series. + +6. Userspace Algorithms and Data Structures +=========================================== + +This section discusses the key algorithms and data structures of the userspace +program, ``xfs_scrub``, that provide the ability to drive metadata checks and +repairs in the kernel, verify file data, and look for other potential problems. + +.. _scrubcheck: + +Checking Metadata +----------------- + +Recall the :ref:`phases of fsck work<scrubphases>` outlined earlier. +That structure follows naturally from the data dependencies designed into the +filesystem from its beginnings in 1993. +In XFS, there are several groups of metadata dependencies: + +a. Filesystem summary counts depend on consistency within the inode indices, + the allocation group space btrees, and the realtime volume space + information. + +b. Quota resource counts depend on consistency within the quota file data + forks, inode indices, inode records, and the forks of every file on the + system. + +c. The naming hierarchy depends on consistency within the directory and + extended attribute structures. + This includes file link counts. + +d. Directories, extended attributes, and file data depend on consistency within + the file forks that map directory and extended attribute data to physical + storage media. + +e. The file forks depends on consistency within inode records and the space + metadata indices of the allocation groups and the realtime volume. + This includes quota and realtime metadata files. + +f. Inode records depends on consistency within the inode metadata indices. + +g. Realtime space metadata depend on the inode records and data forks of the + realtime metadata inodes. + +h. The allocation group metadata indices (free space, inodes, reference count, + and reverse mapping btrees) depend on consistency within the AG headers and + between all the AG metadata btrees. + +i. ``xfs_scrub`` depends on the filesystem being mounted and kernel support + for online fsck functionality. + +Therefore, a metadata dependency graph is a convenient way to schedule checking +operations in the ``xfs_scrub`` program: + +- Phase 1 checks that the provided path maps to an XFS filesystem and detect + the kernel's scrubbing abilities, which validates group (i). + +- Phase 2 scrubs groups (g) and (h) in parallel using a threaded workqueue. + +- Phase 3 scans inodes in parallel. + For each inode, groups (f), (e), and (d) are checked, in that order. + +- Phase 4 repairs everything in groups (i) through (d) so that phases 5 and 6 + may run reliably. + +- Phase 5 starts by checking groups (b) and (c) in parallel before moving on + to checking names. + +- Phase 6 depends on groups (i) through (b) to find file data blocks to verify, + to read them, and to report which blocks of which files are affected. + +- Phase 7 checks group (a), having validated everything else. + +Notice that the data dependencies between groups are enforced by the structure +of the program flow. + +Parallel Inode Scans +-------------------- + +An XFS filesystem can easily contain hundreds of millions of inodes. +Given that XFS targets installations with large high-performance storage, +it is desirable to scrub inodes in parallel to minimize runtime, particularly +if the program has been invoked manually from a command line. +This requires careful scheduling to keep the threads as evenly loaded as +possible. + +Early iterations of the ``xfs_scrub`` inode scanner naïvely created a single +workqueue and scheduled a single workqueue item per AG. +Each workqueue item walked the inode btree (with ``XFS_IOC_INUMBERS``) to find +inode chunks and then called bulkstat (``XFS_IOC_BULKSTAT``) to gather enough +information to construct file handles. +The file handle was then passed to a function to generate scrub items for each +metadata object of each inode. +This simple algorithm leads to thread balancing problems in phase 3 if the +filesystem contains one AG with a few large sparse files and the rest of the +AGs contain many smaller files. +The inode scan dispatch function was not sufficiently granular; it should have +been dispatching at the level of individual inodes, or, to constrain memory +consumption, inode btree records. + +Thanks to Dave Chinner, bounded workqueues in userspace enable ``xfs_scrub`` to +avoid this problem with ease by adding a second workqueue. +Just like before, the first workqueue is seeded with one workqueue item per AG, +and it uses INUMBERS to find inode btree chunks. +The second workqueue, however, is configured with an upper bound on the number +of items that can be waiting to be run. +Each inode btree chunk found by the first workqueue's workers are queued to the +second workqueue, and it is this second workqueue that queries BULKSTAT, +creates a file handle, and passes it to a function to generate scrub items for +each metadata object of each inode. +If the second workqueue is too full, the workqueue add function blocks the +first workqueue's workers until the backlog eases. +This doesn't completely solve the balancing problem, but reduces it enough to +move on to more pressing issues. + +The proposed patchsets are the scrub +`performance tweaks +<https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfsprogs-dev.git/log/?h=scrub-performance-tweaks>`_ +and the +`inode scan rebalance +<https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfsprogs-dev.git/log/?h=scrub-iscan-rebalance>`_ +series. + +.. _scrubrepair: + +Scheduling Repairs +------------------ + +During phase 2, corruptions and inconsistencies reported in any AGI header or +inode btree are repaired immediately, because phase 3 relies on proper +functioning of the inode indices to find inodes to scan. +Failed repairs are rescheduled to phase 4. +Problems reported in any other space metadata are deferred to phase 4. +Optimization opportunities are always deferred to phase 4, no matter their +origin. + +During phase 3, corruptions and inconsistencies reported in any part of a +file's metadata are repaired immediately if all space metadata were validated +during phase 2. +Repairs that fail or cannot be repaired immediately are scheduled for phase 4. + +In the original design of ``xfs_scrub``, it was thought that repairs would be +so infrequent that the ``struct xfs_scrub_metadata`` objects used to +communicate with the kernel could also be used as the primary object to +schedule repairs. +With recent increases in the number of optimizations possible for a given +filesystem object, it became much more memory-efficient to track all eligible +repairs for a given filesystem object with a single repair item. +Each repair item represents a single lockable object -- AGs, metadata files, +individual inodes, or a class of summary information. + +Phase 4 is responsible for scheduling a lot of repair work in as quick a +manner as is practical. +The :ref:`data dependencies <scrubcheck>` outlined earlier still apply, which +means that ``xfs_scrub`` must try to complete the repair work scheduled by +phase 2 before trying repair work scheduled by phase 3. +The repair process is as follows: + +1. Start a round of repair with a workqueue and enough workers to keep the CPUs + as busy as the user desires. + + a. For each repair item queued by phase 2, + + i. Ask the kernel to repair everything listed in the repair item for a + given filesystem object. + + ii. Make a note if the kernel made any progress in reducing the number + of repairs needed for this object. + + iii. If the object no longer requires repairs, revalidate all metadata + associated with this object. + If the revalidation succeeds, drop the repair item. + If not, requeue the item for more repairs. + + b. If any repairs were made, jump back to 1a to retry all the phase 2 items. + + c. For each repair item queued by phase 3, + + i. Ask the kernel to repair everything listed in the repair item for a + given filesystem object. + + ii. Make a note if the kernel made any progress in reducing the number + of repairs needed for this object. + + iii. If the object no longer requires repairs, revalidate all metadata + associated with this object. + If the revalidation succeeds, drop the repair item. + If not, requeue the item for more repairs. + + d. If any repairs were made, jump back to 1c to retry all the phase 3 items. + +2. If step 1 made any repair progress of any kind, jump back to step 1 to start + another round of repair. + +3. If there are items left to repair, run them all serially one more time. + Complain if the repairs were not successful, since this is the last chance + to repair anything. + +Corruptions and inconsistencies encountered during phases 5 and 7 are repaired +immediately. +Corrupt file data blocks reported by phase 6 cannot be recovered by the +filesystem. + +The proposed patchsets are the +`repair warning improvements +<https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfsprogs-dev.git/log/?h=scrub-better-repair-warnings>`_, +refactoring of the +`repair data dependency +<https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfsprogs-dev.git/log/?h=scrub-repair-data-deps>`_ +and +`object tracking +<https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfsprogs-dev.git/log/?h=scrub-object-tracking>`_, +and the +`repair scheduling +<https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfsprogs-dev.git/log/?h=scrub-repair-scheduling>`_ +improvement series. + +Checking Names for Confusable Unicode Sequences +----------------------------------------------- + +If ``xfs_scrub`` succeeds in validating the filesystem metadata by the end of +phase 4, it moves on to phase 5, which checks for suspicious looking names in +the filesystem. +These names consist of the filesystem label, names in directory entries, and +the names of extended attributes. +Like most Unix filesystems, XFS imposes the sparest of constraints on the +contents of a name: + +- Slashes and null bytes are not allowed in directory entries. + +- Null bytes are not allowed in userspace-visible extended attributes. + +- Null bytes are not allowed in the filesystem label. + +Directory entries and attribute keys store the length of the name explicitly +ondisk, which means that nulls are not name terminators. +For this section, the term "naming domain" refers to any place where names are +presented together -- all the names in a directory, or all the attributes of a +file. + +Although the Unix naming constraints are very permissive, the reality of most +modern-day Linux systems is that programs work with Unicode character code +points to support international languages. +These programs typically encode those code points in UTF-8 when interfacing +with the C library because the kernel expects null-terminated names. +In the common case, therefore, names found in an XFS filesystem are actually +UTF-8 encoded Unicode data. + +To maximize its expressiveness, the Unicode standard defines separate control +points for various characters that render similarly or identically in writing +systems around the world. +For example, the character "Cyrillic Small Letter A" U+0430 "а" often renders +identically to "Latin Small Letter A" U+0061 "a". + +The standard also permits characters to be constructed in multiple ways -- +either by using a defined code point, or by combining one code point with +various combining marks. +For example, the character "Angstrom Sign U+212B "Å" can also be expressed +as "Latin Capital Letter A" U+0041 "A" followed by "Combining Ring Above" +U+030A "◌̊". +Both sequences render identically. + +Like the standards that preceded it, Unicode also defines various control +characters to alter the presentation of text. +For example, the character "Right-to-Left Override" U+202E can trick some +programs into rendering "moo\\xe2\\x80\\xaegnp.txt" as "mootxt.png". +A second category of rendering problems involves whitespace characters. +If the character "Zero Width Space" U+200B is encountered in a file name, the +name will render identically to a name that does not have the zero width +space. + +If two names within a naming domain have different byte sequences but render +identically, a user may be confused by it. +The kernel, in its indifference to upper level encoding schemes, permits this. +Most filesystem drivers persist the byte sequence names that are given to them +by the VFS. + +Techniques for detecting confusable names are explained in great detail in +sections 4 and 5 of the +`Unicode Security Mechanisms <https://unicode.org/reports/tr39/>`_ +document. +When ``xfs_scrub`` detects UTF-8 encoding in use on a system, it uses the +Unicode normalization form NFD in conjunction with the confusable name +detection component of +`libicu <https://github.com/unicode-org/icu>`_ +to identify names with a directory or within a file's extended attributes that +could be confused for each other. +Names are also checked for control characters, non-rendering characters, and +mixing of bidirectional characters. +All of these potential issues are reported to the system administrator during +phase 5. + +Media Verification of File Data Extents +--------------------------------------- + +The system administrator can elect to initiate a media scan of all file data +blocks. +This scan after validation of all filesystem metadata (except for the summary +counters) as phase 6. +The scan starts by calling ``FS_IOC_GETFSMAP`` to scan the filesystem space map +to find areas that are allocated to file data fork extents. +Gaps betweeen data fork extents that are smaller than 64k are treated as if +they were data fork extents to reduce the command setup overhead. +When the space map scan accumulates a region larger than 32MB, a media +verification request is sent to the disk as a directio read of the raw block +device. + +If the verification read fails, ``xfs_scrub`` retries with single-block reads +to narrow down the failure to the specific region of the media and recorded. +When it has finished issuing verification requests, it again uses the space +mapping ioctl to map the recorded media errors back to metadata structures +and report what has been lost. +For media errors in blocks owned by files, parent pointers can be used to +construct file paths from inode numbers for user-friendly reporting. + +7. Conclusion and Future Work +============================= + +It is hoped that the reader of this document has followed the designs laid out +in this document and now has some familiarity with how XFS performs online +rebuilding of its metadata indices, and how filesystem users can interact with +that functionality. +Although the scope of this work is daunting, it is hoped that this guide will +make it easier for code readers to understand what has been built, for whom it +has been built, and why. +Please feel free to contact the XFS mailing list with questions. + +FIEXCHANGE_RANGE +---------------- + +As discussed earlier, a second frontend to the atomic extent swap mechanism is +a new ioctl call that userspace programs can use to commit updates to files +atomically. +This frontend has been out for review for several years now, though the +necessary refinements to online repair and lack of customer demand mean that +the proposal has not been pushed very hard. + +Extent Swapping with Regular User Files +``````````````````````````````````````` + +As mentioned earlier, XFS has long had the ability to swap extents between +files, which is used almost exclusively by ``xfs_fsr`` to defragment files. +The earliest form of this was the fork swap mechanism, where the entire +contents of data forks could be exchanged between two files by exchanging the +raw bytes in each inode fork's immediate area. +When XFS v5 came along with self-describing metadata, this old mechanism grew +some log support to continue rewriting the owner fields of BMBT blocks during +log recovery. +When the reverse mapping btree was later added to XFS, the only way to maintain +the consistency of the fork mappings with the reverse mapping index was to +develop an iterative mechanism that used deferred bmap and rmap operations to +swap mappings one at a time. +This mechanism is identical to steps 2-3 from the procedure above except for +the new tracking items, because the atomic extent swap mechanism is an +iteration of an existing mechanism and not something totally novel. +For the narrow case of file defragmentation, the file contents must be +identical, so the recovery guarantees are not much of a gain. + +Atomic extent swapping is much more flexible than the existing swapext +implementations because it can guarantee that the caller never sees a mix of +old and new contents even after a crash, and it can operate on two arbitrary +file fork ranges. +The extra flexibility enables several new use cases: + +- **Atomic commit of file writes**: A userspace process opens a file that it + wants to update. + Next, it opens a temporary file and calls the file clone operation to reflink + the first file's contents into the temporary file. + Writes to the original file should instead be written to the temporary file. + Finally, the process calls the atomic extent swap system call + (``FIEXCHANGE_RANGE``) to exchange the file contents, thereby committing all + of the updates to the original file, or none of them. + +.. _swapext_if_unchanged: + +- **Transactional file updates**: The same mechanism as above, but the caller + only wants the commit to occur if the original file's contents have not + changed. + To make this happen, the calling process snapshots the file modification and + change timestamps of the original file before reflinking its data to the + temporary file. + When the program is ready to commit the changes, it passes the timestamps + into the kernel as arguments to the atomic extent swap system call. + The kernel only commits the changes if the provided timestamps match the + original file. + +- **Emulation of atomic block device writes**: Export a block device with a + logical sector size matching the filesystem block size to force all writes + to be aligned to the filesystem block size. + Stage all writes to a temporary file, and when that is complete, call the + atomic extent swap system call with a flag to indicate that holes in the + temporary file should be ignored. + This emulates an atomic device write in software, and can support arbitrary + scattered writes. + +Vectorized Scrub +---------------- + +As it turns out, the :ref:`refactoring <scrubrepair>` of repair items mentioned +earlier was a catalyst for enabling a vectorized scrub system call. +Since 2018, the cost of making a kernel call has increased considerably on some +systems to mitigate the effects of speculative execution attacks. +This incentivizes program authors to make as few system calls as possible to +reduce the number of times an execution path crosses a security boundary. + +With vectorized scrub, userspace pushes to the kernel the identity of a +filesystem object, a list of scrub types to run against that object, and a +simple representation of the data dependencies between the selected scrub +types. +The kernel executes as much of the caller's plan as it can until it hits a +dependency that cannot be satisfied due to a corruption, and tells userspace +how much was accomplished. +It is hoped that ``io_uring`` will pick up enough of this functionality that +online fsck can use that instead of adding a separate vectored scrub system +call to XFS. + +The relevant patchsets are the +`kernel vectorized scrub +<https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=vectorized-scrub>`_ +and +`userspace vectorized scrub +<https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfsprogs-dev.git/log/?h=vectorized-scrub>`_ +series. + +Quality of Service Targets for Scrub +------------------------------------ + +One serious shortcoming of the online fsck code is that the amount of time that +it can spend in the kernel holding resource locks is basically unbounded. +Userspace is allowed to send a fatal signal to the process which will cause +``xfs_scrub`` to exit when it reaches a good stopping point, but there's no way +for userspace to provide a time budget to the kernel. +Given that the scrub codebase has helpers to detect fatal signals, it shouldn't +be too much work to allow userspace to specify a timeout for a scrub/repair +operation and abort the operation if it exceeds budget. +However, most repair functions have the property that once they begin to touch +ondisk metadata, the operation cannot be cancelled cleanly, after which a QoS +timeout is no longer useful. + +Defragmenting Free Space +------------------------ + +Over the years, many XFS users have requested the creation of a program to +clear a portion of the physical storage underlying a filesystem so that it +becomes a contiguous chunk of free space. +Call this free space defragmenter ``clearspace`` for short. + +The first piece the ``clearspace`` program needs is the ability to read the +reverse mapping index from userspace. +This already exists in the form of the ``FS_IOC_GETFSMAP`` ioctl. +The second piece it needs is a new fallocate mode +(``FALLOC_FL_MAP_FREE_SPACE``) that allocates the free space in a region and +maps it to a file. +Call this file the "space collector" file. +The third piece is the ability to force an online repair. + +To clear all the metadata out of a portion of physical storage, clearspace +uses the new fallocate map-freespace call to map any free space in that region +to the space collector file. +Next, clearspace finds all metadata blocks in that region by way of +``GETFSMAP`` and issues forced repair requests on the data structure. +This often results in the metadata being rebuilt somewhere that is not being +cleared. +After each relocation, clearspace calls the "map free space" function again to +collect any newly freed space in the region being cleared. + +To clear all the file data out of a portion of the physical storage, clearspace +uses the FSMAP information to find relevant file data blocks. +Having identified a good target, it uses the ``FICLONERANGE`` call on that part +of the file to try to share the physical space with a dummy file. +Cloning the extent means that the original owners cannot overwrite the +contents; any changes will be written somewhere else via copy-on-write. +Clearspace makes its own copy of the frozen extent in an area that is not being +cleared, and uses ``FIEDEUPRANGE`` (or the :ref:`atomic extent swap +<swapext_if_unchanged>` feature) to change the target file's data extent +mapping away from the area being cleared. +When all other mappings have been moved, clearspace reflinks the space into the +space collector file so that it becomes unavailable. + +There are further optimizations that could apply to the above algorithm. +To clear a piece of physical storage that has a high sharing factor, it is +strongly desirable to retain this sharing factor. +In fact, these extents should be moved first to maximize sharing factor after +the operation completes. +To make this work smoothly, clearspace needs a new ioctl +(``FS_IOC_GETREFCOUNTS``) to report reference count information to userspace. +With the refcount information exposed, clearspace can quickly find the longest, +most shared data extents in the filesystem, and target them first. + +**Future Work Question**: How might the filesystem move inode chunks? + +*Answer*: To move inode chunks, Dave Chinner constructed a prototype program +that creates a new file with the old contents and then locklessly runs around +the filesystem updating directory entries. +The operation cannot complete if the filesystem goes down. +That problem isn't totally insurmountable: create an inode remapping table +hidden behind a jump label, and a log item that tracks the kernel walking the +filesystem to update directory entries. +The trouble is, the kernel can't do anything about open files, since it cannot +revoke them. + +**Future Work Question**: Can static keys be used to minimize the cost of +supporting ``revoke()`` on XFS files? + +*Answer*: Yes. +Until the first revocation, the bailout code need not be in the call path at +all. + +The relevant patchsets are the +`kernel freespace defrag +<https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=defrag-freespace>`_ +and +`userspace freespace defrag +<https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfsprogs-dev.git/log/?h=defrag-freespace>`_ +series. + +Shrinking Filesystems +--------------------- + +Removing the end of the filesystem ought to be a simple matter of evacuating +the data and metadata at the end of the filesystem, and handing the freed space +to the shrink code. +That requires an evacuation of the space at end of the filesystem, which is a +use of free space defragmentation! diff --git a/Documentation/filesystems/xfs-self-describing-metadata.rst b/Documentation/filesystems/xfs-self-describing-metadata.rst index b79dbf36dc94..a10c4ae6955e 100644 --- a/Documentation/filesystems/xfs-self-describing-metadata.rst +++ b/Documentation/filesystems/xfs-self-describing-metadata.rst @@ -1,4 +1,5 @@ .. SPDX-License-Identifier: GPL-2.0 +.. _xfs_self_describing_metadata: ============================ XFS Self Describing Metadata diff --git a/fs/xfs/Kconfig b/fs/xfs/Kconfig index 9fac5ea8d0e4..52e1823241fb 100644 --- a/fs/xfs/Kconfig +++ b/fs/xfs/Kconfig @@ -47,6 +47,33 @@ config XFS_SUPPORT_V4 To continue supporting the old V4 format (crc=0), say Y. To close off an attack surface, say N. +config XFS_SUPPORT_ASCII_CI + bool "Support deprecated case-insensitive ascii (ascii-ci=1) format" + depends on XFS_FS + default y + help + The ASCII case insensitivity filesystem feature only works correctly + on systems that have been coerced into using ISO 8859-1, and it does + not work on extended attributes. The kernel has no visibility into + the locale settings in userspace, so it corrupts UTF-8 names. + Enabling this feature makes XFS vulnerable to mixed case sensitivity + attacks. Because of this, the feature is deprecated. All users + should upgrade by backing up their files, reformatting, and restoring + from the backup. + + Administrators and users can detect such a filesystem by running + xfs_info against a filesystem mountpoint and checking for a string + beginning with "ascii-ci=". If the string "ascii-ci=1" is found, the + filesystem is a case-insensitive filesystem. If no such string is + found, please upgrade xfsprogs to the latest version and try again. + + This option will become default N in September 2025. Support for the + feature will be removed entirely in September 2030. Distributors + can say N here to withdraw support earlier. + + To continue supporting case-insensitivity (ascii-ci=1), say Y. + To close off an attack surface, say N. + config XFS_QUOTA bool "XFS Quota support" depends on XFS_FS @@ -93,10 +120,15 @@ config XFS_RT If unsure, say N. +config XFS_DRAIN_INTENTS + bool + select JUMP_LABEL if HAVE_ARCH_JUMP_LABEL + config XFS_ONLINE_SCRUB bool "XFS online metadata check support" default n depends on XFS_FS + select XFS_DRAIN_INTENTS help If you say Y here you will be able to check metadata on a mounted XFS filesystem. This feature is intended to reduce diff --git a/fs/xfs/Makefile b/fs/xfs/Makefile index 92d88dc3c9f7..16e4eb431230 100644 --- a/fs/xfs/Makefile +++ b/fs/xfs/Makefile @@ -136,6 +136,8 @@ ifeq ($(CONFIG_MEMORY_FAILURE),y) xfs-$(CONFIG_FS_DAX) += xfs_notify_failure.o endif +xfs-$(CONFIG_XFS_DRAIN_INTENTS) += xfs_drain.o + # online scrub/repair ifeq ($(CONFIG_XFS_ONLINE_SCRUB),y) @@ -146,6 +148,7 @@ xfs-y += $(addprefix scrub/, \ agheader.o \ alloc.o \ attr.o \ + bitmap.o \ bmap.o \ btree.o \ common.o \ @@ -156,6 +159,7 @@ xfs-y += $(addprefix scrub/, \ ialloc.o \ inode.o \ parent.o \ + readdir.o \ refcount.o \ rmap.o \ scrub.o \ @@ -169,7 +173,6 @@ xfs-$(CONFIG_XFS_QUOTA) += scrub/quota.o ifeq ($(CONFIG_XFS_ONLINE_REPAIR),y) xfs-y += $(addprefix scrub/, \ agheader_repair.o \ - bitmap.o \ repair.o \ ) endif diff --git a/fs/xfs/libxfs/xfs_ag.c b/fs/xfs/libxfs/xfs_ag.c index 86696a1c6891..1b078bbbf225 100644 --- a/fs/xfs/libxfs/xfs_ag.c +++ b/fs/xfs/libxfs/xfs_ag.c @@ -81,6 +81,19 @@ xfs_perag_get_tag( return pag; } +/* Get a passive reference to the given perag. */ +struct xfs_perag * +xfs_perag_hold( + struct xfs_perag *pag) +{ + ASSERT(atomic_read(&pag->pag_ref) > 0 || + atomic_read(&pag->pag_active_ref) > 0); + + trace_xfs_perag_hold(pag, _RET_IP_); + atomic_inc(&pag->pag_ref); + return pag; +} + void xfs_perag_put( struct xfs_perag *pag) @@ -247,6 +260,7 @@ xfs_free_perag( spin_unlock(&mp->m_perag_lock); ASSERT(pag); XFS_IS_CORRUPT(pag->pag_mount, atomic_read(&pag->pag_ref) != 0); + xfs_defer_drain_free(&pag->pag_intents_drain); cancel_delayed_work_sync(&pag->pag_blockgc_work); xfs_buf_hash_destroy(pag); @@ -372,6 +386,7 @@ xfs_initialize_perag( spin_lock_init(&pag->pag_state_lock); INIT_DELAYED_WORK(&pag->pag_blockgc_work, xfs_blockgc_worker); INIT_RADIX_TREE(&pag->pag_ici_root, GFP_ATOMIC); + xfs_defer_drain_init(&pag->pag_intents_drain); init_waitqueue_head(&pag->pagb_wait); init_waitqueue_head(&pag->pag_active_wq); pag->pagb_count = 0; @@ -408,6 +423,7 @@ xfs_initialize_perag( return 0; out_remove_pag: + xfs_defer_drain_free(&pag->pag_intents_drain); radix_tree_delete(&mp->m_perag_tree, index); out_free_pag: kmem_free(pag); @@ -418,6 +434,7 @@ out_unwind_new_pags: if (!pag) break; xfs_buf_hash_destroy(pag); + xfs_defer_drain_free(&pag->pag_intents_drain); kmem_free(pag); } return error; @@ -1043,10 +1060,8 @@ xfs_ag_extend_space( if (error) return error; - error = xfs_free_extent(tp, XFS_AGB_TO_FSB(pag->pag_mount, pag->pag_agno, - be32_to_cpu(agf->agf_length) - len), - len, &XFS_RMAP_OINFO_SKIP_UPDATE, - XFS_AG_RESV_NONE); + error = xfs_free_extent(tp, pag, be32_to_cpu(agf->agf_length) - len, + len, &XFS_RMAP_OINFO_SKIP_UPDATE, XFS_AG_RESV_NONE); if (error) return error; diff --git a/fs/xfs/libxfs/xfs_ag.h b/fs/xfs/libxfs/xfs_ag.h index 5e18536dfdce..2e0aef87d633 100644 --- a/fs/xfs/libxfs/xfs_ag.h +++ b/fs/xfs/libxfs/xfs_ag.h @@ -101,6 +101,14 @@ struct xfs_perag { /* background prealloc block trimming */ struct delayed_work pag_blockgc_work; + /* + * We use xfs_drain to track the number of deferred log intent items + * that have been queued (but not yet processed) so that waiters (e.g. + * scrub) will not lock resources when other threads are in the middle + * of processing a chain of intent items only to find momentary + * inconsistencies. + */ + struct xfs_defer_drain pag_intents_drain; #endif /* __KERNEL__ */ }; @@ -134,6 +142,7 @@ void xfs_free_perag(struct xfs_mount *mp); struct xfs_perag *xfs_perag_get(struct xfs_mount *mp, xfs_agnumber_t agno); struct xfs_perag *xfs_perag_get_tag(struct xfs_mount *mp, xfs_agnumber_t agno, unsigned int tag); +struct xfs_perag *xfs_perag_hold(struct xfs_perag *pag); void xfs_perag_put(struct xfs_perag *pag); /* Active AG references */ diff --git a/fs/xfs/libxfs/xfs_alloc.c b/fs/xfs/libxfs/xfs_alloc.c index 203f16c48c19..fdfa08cbf4db 100644 --- a/fs/xfs/libxfs/xfs_alloc.c +++ b/fs/xfs/libxfs/xfs_alloc.c @@ -233,6 +233,52 @@ xfs_alloc_update( return xfs_btree_update(cur, &rec); } +/* Convert the ondisk btree record to its incore representation. */ +void +xfs_alloc_btrec_to_irec( + const union xfs_btree_rec *rec, + struct xfs_alloc_rec_incore *irec) +{ + irec->ar_startblock = be32_to_cpu(rec->alloc.ar_startblock); + irec->ar_blockcount = be32_to_cpu(rec->alloc.ar_blockcount); +} + +/* Simple checks for free space records. */ +xfs_failaddr_t +xfs_alloc_check_irec( + struct xfs_btree_cur *cur, + const struct xfs_alloc_rec_incore *irec) +{ + struct xfs_perag *pag = cur->bc_ag.pag; + + if (irec->ar_blockcount == 0) + return __this_address; + + /* check for valid extent range, including overflow */ + if (!xfs_verify_agbext(pag, irec->ar_startblock, irec->ar_blockcount)) + return __this_address; + + return NULL; +} + +static inline int +xfs_alloc_complain_bad_rec( + struct xfs_btree_cur *cur, + xfs_failaddr_t fa, + const struct xfs_alloc_rec_incore *irec) +{ + struct xfs_mount *mp = cur->bc_mp; + + xfs_warn(mp, + "%s Freespace BTree record corruption in AG %d detected at %pS!", + cur->bc_btnum == XFS_BTNUM_BNO ? "Block" : "Size", + cur->bc_ag.pag->pag_agno, fa); + xfs_warn(mp, + "start block 0x%x block count 0x%x", irec->ar_startblock, + irec->ar_blockcount); + return -EFSCORRUPTED; +} + /* * Get the data from the pointed-to record. */ @@ -243,35 +289,23 @@ xfs_alloc_get_rec( xfs_extlen_t *len, /* output: length of extent */ int *stat) /* output: success/failure */ { - struct xfs_mount *mp = cur->bc_mp; - struct xfs_perag *pag = cur->bc_ag.pag; + struct xfs_alloc_rec_incore irec; union xfs_btree_rec *rec; + xfs_failaddr_t fa; int error; error = xfs_btree_get_rec(cur, &rec, stat); if (error || !(*stat)) return error; - *bno = be32_to_cpu(rec->alloc.ar_startblock); - *len = be32_to_cpu(rec->alloc.ar_blockcount); - - if (*len == 0) - goto out_bad_rec; - - /* check for valid extent range, including overflow */ - if (!xfs_verify_agbext(pag, *bno, *len)) - goto out_bad_rec; + xfs_alloc_btrec_to_irec(rec, &irec); + fa = xfs_alloc_check_irec(cur, &irec); + if (fa) + return xfs_alloc_complain_bad_rec(cur, fa, &irec); + *bno = irec.ar_startblock; + *len = irec.ar_blockcount; return 0; - -out_bad_rec: - xfs_warn(mp, - "%s Freespace BTree record corruption in AG %d detected!", - cur->bc_btnum == XFS_BTNUM_BNO ? "Block" : "Size", - pag->pag_agno); - xfs_warn(mp, - "start block 0x%x block count 0x%x", *bno, *len); - return -EFSCORRUPTED; } /* @@ -2405,6 +2439,7 @@ xfs_defer_agfl_block( trace_xfs_agfl_free_defer(mp, agno, 0, agbno, 1); + xfs_extent_free_get_group(mp, xefi); xfs_defer_add(tp, XFS_DEFER_OPS_TYPE_AGFL_FREE, &xefi->xefi_list); } @@ -2421,8 +2456,8 @@ __xfs_free_extent_later( bool skip_discard) { struct xfs_extent_free_item *xefi; -#ifdef DEBUG struct xfs_mount *mp = tp->t_mountp; +#ifdef DEBUG xfs_agnumber_t agno; xfs_agblock_t agbno; @@ -2456,9 +2491,11 @@ __xfs_free_extent_later( } else { xefi->xefi_owner = XFS_RMAP_OWN_NULL; } - trace_xfs_bmap_free_defer(tp->t_mountp, + trace_xfs_bmap_free_defer(mp, XFS_FSB_TO_AGNO(tp->t_mountp, bno), 0, XFS_FSB_TO_AGBNO(tp->t_mountp, bno), len); + + xfs_extent_free_get_group(mp, xefi); xfs_defer_add(tp, XFS_DEFER_OPS_TYPE_FREE, &xefi->xefi_list); } @@ -3596,7 +3633,8 @@ xfs_free_extent_fix_freelist( int __xfs_free_extent( struct xfs_trans *tp, - xfs_fsblock_t bno, + struct xfs_perag *pag, + xfs_agblock_t agbno, xfs_extlen_t len, const struct xfs_owner_info *oinfo, enum xfs_ag_resv_type type, @@ -3604,12 +3642,9 @@ __xfs_free_extent( { struct xfs_mount *mp = tp->t_mountp; struct xfs_buf *agbp; - xfs_agnumber_t agno = XFS_FSB_TO_AGNO(mp, bno); - xfs_agblock_t agbno = XFS_FSB_TO_AGBNO(mp, bno); struct xfs_agf *agf; int error; unsigned int busy_flags = 0; - struct xfs_perag *pag; ASSERT(len != 0); ASSERT(type != XFS_AG_RESV_AGFL); @@ -3618,10 +3653,9 @@ __xfs_free_extent( XFS_ERRTAG_FREE_EXTENT)) return -EIO; - pag = xfs_perag_get(mp, agno); error = xfs_free_extent_fix_freelist(tp, pag, &agbp); if (error) - goto err; + return error; agf = agbp->b_addr; if (XFS_IS_CORRUPT(mp, agbno >= mp->m_sb.sb_agblocks)) { @@ -3635,20 +3669,18 @@ __xfs_free_extent( goto err_release; } - error = xfs_free_ag_extent(tp, agbp, agno, agbno, len, oinfo, type); + error = xfs_free_ag_extent(tp, agbp, pag->pag_agno, agbno, len, oinfo, + type); if (error) goto err_release; if (skip_discard) busy_flags |= XFS_EXTENT_BUSY_SKIP_DISCARD; xfs_extent_busy_insert(tp, pag, agbno, len, busy_flags); - xfs_perag_put(pag); return 0; err_release: xfs_trans_brelse(tp, agbp); -err: - xfs_perag_put(pag); return error; } @@ -3666,9 +3698,13 @@ xfs_alloc_query_range_helper( { struct xfs_alloc_query_range_info *query = priv; struct xfs_alloc_rec_incore irec; + xfs_failaddr_t fa; + + xfs_alloc_btrec_to_irec(rec, &irec); + fa = xfs_alloc_check_irec(cur, &irec); + if (fa) + return xfs_alloc_complain_bad_rec(cur, fa, &irec); - irec.ar_startblock = be32_to_cpu(rec->alloc.ar_startblock); - irec.ar_blockcount = be32_to_cpu(rec->alloc.ar_blockcount); return query->fn(cur, &irec, query->priv); } @@ -3709,13 +3745,16 @@ xfs_alloc_query_all( return xfs_btree_query_all(cur, xfs_alloc_query_range_helper, &query); } -/* Is there a record covering a given extent? */ +/* + * Scan part of the keyspace of the free space and tell us if the area has no + * records, is fully mapped by records, or is partially filled. + */ int -xfs_alloc_has_record( +xfs_alloc_has_records( struct xfs_btree_cur *cur, xfs_agblock_t bno, xfs_extlen_t len, - bool *exists) + enum xbtree_recpacking *outcome) { union xfs_btree_irec low; union xfs_btree_irec high; @@ -3725,7 +3764,7 @@ xfs_alloc_has_record( memset(&high, 0xFF, sizeof(high)); high.a.ar_startblock = bno + len - 1; - return xfs_btree_has_record(cur, &low, &high, exists); + return xfs_btree_has_records(cur, &low, &high, NULL, outcome); } /* diff --git a/fs/xfs/libxfs/xfs_alloc.h b/fs/xfs/libxfs/xfs_alloc.h index 2b246d74c189..5dbb25546d0b 100644 --- a/fs/xfs/libxfs/xfs_alloc.h +++ b/fs/xfs/libxfs/xfs_alloc.h @@ -141,7 +141,8 @@ int xfs_alloc_vextent_first_ag(struct xfs_alloc_arg *args, int /* error */ __xfs_free_extent( struct xfs_trans *tp, /* transaction pointer */ - xfs_fsblock_t bno, /* starting block number of extent */ + struct xfs_perag *pag, + xfs_agblock_t agbno, xfs_extlen_t len, /* length of extent */ const struct xfs_owner_info *oinfo, /* extent owner */ enum xfs_ag_resv_type type, /* block reservation type */ @@ -150,12 +151,13 @@ __xfs_free_extent( static inline int xfs_free_extent( struct xfs_trans *tp, - xfs_fsblock_t bno, + struct xfs_perag *pag, + xfs_agblock_t agbno, xfs_extlen_t len, const struct xfs_owner_info *oinfo, enum xfs_ag_resv_type type) { - return __xfs_free_extent(tp, bno, len, oinfo, type, false); + return __xfs_free_extent(tp, pag, agbno, len, oinfo, type, false); } int /* error */ @@ -179,6 +181,12 @@ xfs_alloc_get_rec( xfs_extlen_t *len, /* output: length of extent */ int *stat); /* output: success/failure */ +union xfs_btree_rec; +void xfs_alloc_btrec_to_irec(const union xfs_btree_rec *rec, + struct xfs_alloc_rec_incore *irec); +xfs_failaddr_t xfs_alloc_check_irec(struct xfs_btree_cur *cur, + const struct xfs_alloc_rec_incore *irec); + int xfs_read_agf(struct xfs_perag *pag, struct xfs_trans *tp, int flags, struct xfs_buf **agfbpp); int xfs_alloc_read_agf(struct xfs_perag *pag, struct xfs_trans *tp, int flags, @@ -205,8 +213,8 @@ int xfs_alloc_query_range(struct xfs_btree_cur *cur, int xfs_alloc_query_all(struct xfs_btree_cur *cur, xfs_alloc_query_range_fn fn, void *priv); -int xfs_alloc_has_record(struct xfs_btree_cur *cur, xfs_agblock_t bno, - xfs_extlen_t len, bool *exist); +int xfs_alloc_has_records(struct xfs_btree_cur *cur, xfs_agblock_t bno, + xfs_extlen_t len, enum xbtree_recpacking *outcome); typedef int (*xfs_agfl_walk_fn)(struct xfs_mount *mp, xfs_agblock_t bno, void *priv); @@ -235,9 +243,13 @@ struct xfs_extent_free_item { uint64_t xefi_owner; xfs_fsblock_t xefi_startblock;/* starting fs block number */ xfs_extlen_t xefi_blockcount;/* number of blocks in extent */ + struct xfs_perag *xefi_pag; unsigned int xefi_flags; }; +void xfs_extent_free_get_group(struct xfs_mount *mp, + struct xfs_extent_free_item *xefi); + #define XFS_EFI_SKIP_DISCARD (1U << 0) /* don't issue discard */ #define XFS_EFI_ATTR_FORK (1U << 1) /* freeing attr fork block */ #define XFS_EFI_BMBT_BLOCK (1U << 2) /* freeing bmap btree block */ diff --git a/fs/xfs/libxfs/xfs_alloc_btree.c b/fs/xfs/libxfs/xfs_alloc_btree.c index 0f29c7b1b39f..c65228efed4a 100644 --- a/fs/xfs/libxfs/xfs_alloc_btree.c +++ b/fs/xfs/libxfs/xfs_alloc_btree.c @@ -260,20 +260,27 @@ STATIC int64_t xfs_bnobt_diff_two_keys( struct xfs_btree_cur *cur, const union xfs_btree_key *k1, - const union xfs_btree_key *k2) + const union xfs_btree_key *k2, + const union xfs_btree_key *mask) { + ASSERT(!mask || mask->alloc.ar_startblock); + return (int64_t)be32_to_cpu(k1->alloc.ar_startblock) - - be32_to_cpu(k2->alloc.ar_startblock); + be32_to_cpu(k2->alloc.ar_startblock); } STATIC int64_t xfs_cntbt_diff_two_keys( struct xfs_btree_cur *cur, const union xfs_btree_key *k1, - const union xfs_btree_key *k2) + const union xfs_btree_key *k2, + const union xfs_btree_key *mask) { int64_t diff; + ASSERT(!mask || (mask->alloc.ar_blockcount && + mask->alloc.ar_startblock)); + diff = be32_to_cpu(k1->alloc.ar_blockcount) - be32_to_cpu(k2->alloc.ar_blockcount); if (diff) @@ -423,6 +430,19 @@ xfs_cntbt_recs_inorder( be32_to_cpu(r2->alloc.ar_startblock)); } +STATIC enum xbtree_key_contig +xfs_allocbt_keys_contiguous( + struct xfs_btree_cur *cur, + const union xfs_btree_key *key1, + const union xfs_btree_key *key2, + const union xfs_btree_key *mask) +{ + ASSERT(!mask || mask->alloc.ar_startblock); + + return xbtree_key_contig(be32_to_cpu(key1->alloc.ar_startblock), + be32_to_cpu(key2->alloc.ar_startblock)); +} + static const struct xfs_btree_ops xfs_bnobt_ops = { .rec_len = sizeof(xfs_alloc_rec_t), .key_len = sizeof(xfs_alloc_key_t), @@ -443,6 +463,7 @@ static const struct xfs_btree_ops xfs_bnobt_ops = { .diff_two_keys = xfs_bnobt_diff_two_keys, .keys_inorder = xfs_bnobt_keys_inorder, .recs_inorder = xfs_bnobt_recs_inorder, + .keys_contiguous = xfs_allocbt_keys_contiguous, }; static const struct xfs_btree_ops xfs_cntbt_ops = { @@ -465,6 +486,7 @@ static const struct xfs_btree_ops xfs_cntbt_ops = { .diff_two_keys = xfs_cntbt_diff_two_keys, .keys_inorder = xfs_cntbt_keys_inorder, .recs_inorder = xfs_cntbt_recs_inorder, + .keys_contiguous = NULL, /* not needed right now */ }; /* Allocate most of a new allocation btree cursor. */ @@ -492,9 +514,7 @@ xfs_allocbt_init_common( cur->bc_statoff = XFS_STATS_CALC_INDEX(xs_abtb_2); } - /* take a reference for the cursor */ - atomic_inc(&pag->pag_ref); - cur->bc_ag.pag = pag; + cur->bc_ag.pag = xfs_perag_hold(pag); if (xfs_has_crc(mp)) cur->bc_flags |= XFS_BTREE_CRC_BLOCKS; diff --git a/fs/xfs/libxfs/xfs_bmap.c b/fs/xfs/libxfs/xfs_bmap.c index 34de6e6898c4..b512de0540d5 100644 --- a/fs/xfs/libxfs/xfs_bmap.c +++ b/fs/xfs/libxfs/xfs_bmap.c @@ -1083,6 +1083,34 @@ struct xfs_iread_state { xfs_extnum_t loaded; }; +int +xfs_bmap_complain_bad_rec( + struct xfs_inode *ip, + int whichfork, + xfs_failaddr_t fa, + const struct xfs_bmbt_irec *irec) +{ + struct xfs_mount *mp = ip->i_mount; + const char *forkname; + + switch (whichfork) { + case XFS_DATA_FORK: forkname = "data"; break; + case XFS_ATTR_FORK: forkname = "attr"; break; + case XFS_COW_FORK: forkname = "CoW"; break; + default: forkname = "???"; break; + } + + xfs_warn(mp, + "Bmap BTree record corruption in inode 0x%llx %s fork detected at %pS!", + ip->i_ino, forkname, fa); + xfs_warn(mp, + "Offset 0x%llx, start block 0x%llx, block count 0x%llx state 0x%x", + irec->br_startoff, irec->br_startblock, irec->br_blockcount, + irec->br_state); + + return -EFSCORRUPTED; +} + /* Stuff every bmbt record from this block into the incore extent map. */ static int xfs_iread_bmbt_block( @@ -1125,7 +1153,8 @@ xfs_iread_bmbt_block( xfs_inode_verifier_error(ip, -EFSCORRUPTED, "xfs_iread_extents(2)", frp, sizeof(*frp), fa); - return -EFSCORRUPTED; + return xfs_bmap_complain_bad_rec(ip, whichfork, fa, + &new); } xfs_iext_insert(ip, &ir->icur, &new, xfs_bmap_fork_to_state(whichfork)); @@ -1171,6 +1200,12 @@ xfs_iread_extents( goto out; } ASSERT(ir.loaded == xfs_iext_count(ifp)); + /* + * Use release semantics so that we can use acquire semantics in + * xfs_need_iread_extents and be guaranteed to see a valid mapping tree + * after that load. + */ + smp_store_release(&ifp->if_needextents, 0); return 0; out: xfs_iext_destroy(ifp); @@ -3505,7 +3540,6 @@ xfs_bmap_btalloc_at_eof( * original non-aligned state so the caller can proceed on allocation * failure as if this function was never called. */ - args->fsbno = ap->blkno; args->alignment = 1; return 0; } @@ -6075,6 +6109,7 @@ __xfs_bmap_add( bi->bi_whichfork = whichfork; bi->bi_bmap = *bmap; + xfs_bmap_update_get_group(tp->t_mountp, bi); xfs_defer_add(tp, XFS_DEFER_OPS_TYPE_BMAP, &bi->bi_list); return 0; } diff --git a/fs/xfs/libxfs/xfs_bmap.h b/fs/xfs/libxfs/xfs_bmap.h index dd08361ca5a6..e33470e39728 100644 --- a/fs/xfs/libxfs/xfs_bmap.h +++ b/fs/xfs/libxfs/xfs_bmap.h @@ -145,7 +145,7 @@ static inline int xfs_bmapi_whichfork(uint32_t bmapi_flags) { BMAP_COWFORK, "COW" } /* Return true if the extent is an allocated extent, written or not. */ -static inline bool xfs_bmap_is_real_extent(struct xfs_bmbt_irec *irec) +static inline bool xfs_bmap_is_real_extent(const struct xfs_bmbt_irec *irec) { return irec->br_startblock != HOLESTARTBLOCK && irec->br_startblock != DELAYSTARTBLOCK && @@ -238,9 +238,13 @@ struct xfs_bmap_intent { enum xfs_bmap_intent_type bi_type; int bi_whichfork; struct xfs_inode *bi_owner; + struct xfs_perag *bi_pag; struct xfs_bmbt_irec bi_bmap; }; +void xfs_bmap_update_get_group(struct xfs_mount *mp, + struct xfs_bmap_intent *bi); + int xfs_bmap_finish_one(struct xfs_trans *tp, struct xfs_bmap_intent *bi); void xfs_bmap_map_extent(struct xfs_trans *tp, struct xfs_inode *ip, struct xfs_bmbt_irec *imap); @@ -261,6 +265,8 @@ static inline uint32_t xfs_bmap_fork_to_state(int whichfork) xfs_failaddr_t xfs_bmap_validate_extent(struct xfs_inode *ip, int whichfork, struct xfs_bmbt_irec *irec); +int xfs_bmap_complain_bad_rec(struct xfs_inode *ip, int whichfork, + xfs_failaddr_t fa, const struct xfs_bmbt_irec *irec); int xfs_bmapi_remap(struct xfs_trans *tp, struct xfs_inode *ip, xfs_fileoff_t bno, xfs_filblks_t len, xfs_fsblock_t startblock, diff --git a/fs/xfs/libxfs/xfs_bmap_btree.c b/fs/xfs/libxfs/xfs_bmap_btree.c index b8ad95050c9b..1b40e5f8b1ec 100644 --- a/fs/xfs/libxfs/xfs_bmap_btree.c +++ b/fs/xfs/libxfs/xfs_bmap_btree.c @@ -382,11 +382,14 @@ STATIC int64_t xfs_bmbt_diff_two_keys( struct xfs_btree_cur *cur, const union xfs_btree_key *k1, - const union xfs_btree_key *k2) + const union xfs_btree_key *k2, + const union xfs_btree_key *mask) { uint64_t a = be64_to_cpu(k1->bmbt.br_startoff); uint64_t b = be64_to_cpu(k2->bmbt.br_startoff); + ASSERT(!mask || mask->bmbt.br_startoff); + /* * Note: This routine previously casted a and b to int64 and subtracted * them to generate a result. This lead to problems if b was the @@ -500,6 +503,19 @@ xfs_bmbt_recs_inorder( xfs_bmbt_disk_get_startoff(&r2->bmbt); } +STATIC enum xbtree_key_contig +xfs_bmbt_keys_contiguous( + struct xfs_btree_cur *cur, + const union xfs_btree_key *key1, + const union xfs_btree_key *key2, + const union xfs_btree_key *mask) +{ + ASSERT(!mask || mask->bmbt.br_startoff); + + return xbtree_key_contig(be64_to_cpu(key1->bmbt.br_startoff), + be64_to_cpu(key2->bmbt.br_startoff)); +} + static const struct xfs_btree_ops xfs_bmbt_ops = { .rec_len = sizeof(xfs_bmbt_rec_t), .key_len = sizeof(xfs_bmbt_key_t), @@ -520,6 +536,7 @@ static const struct xfs_btree_ops xfs_bmbt_ops = { .buf_ops = &xfs_bmbt_buf_ops, .keys_inorder = xfs_bmbt_keys_inorder, .recs_inorder = xfs_bmbt_recs_inorder, + .keys_contiguous = xfs_bmbt_keys_contiguous, }; /* diff --git a/fs/xfs/libxfs/xfs_btree.c b/fs/xfs/libxfs/xfs_btree.c index c4649cc624e1..6a6503ab0cd7 100644 --- a/fs/xfs/libxfs/xfs_btree.c +++ b/fs/xfs/libxfs/xfs_btree.c @@ -2067,8 +2067,7 @@ xfs_btree_get_leaf_keys( for (n = 2; n <= xfs_btree_get_numrecs(block); n++) { rec = xfs_btree_rec_addr(cur, n, block); cur->bc_ops->init_high_key_from_rec(&hkey, rec); - if (cur->bc_ops->diff_two_keys(cur, &hkey, &max_hkey) - > 0) + if (xfs_btree_keycmp_gt(cur, &hkey, &max_hkey)) max_hkey = hkey; } @@ -2096,7 +2095,7 @@ xfs_btree_get_node_keys( max_hkey = xfs_btree_high_key_addr(cur, 1, block); for (n = 2; n <= xfs_btree_get_numrecs(block); n++) { hkey = xfs_btree_high_key_addr(cur, n, block); - if (cur->bc_ops->diff_two_keys(cur, hkey, max_hkey) > 0) + if (xfs_btree_keycmp_gt(cur, hkey, max_hkey)) max_hkey = hkey; } @@ -2183,8 +2182,8 @@ __xfs_btree_updkeys( nlkey = xfs_btree_key_addr(cur, ptr, block); nhkey = xfs_btree_high_key_addr(cur, ptr, block); if (!force_all && - !(cur->bc_ops->diff_two_keys(cur, nlkey, lkey) != 0 || - cur->bc_ops->diff_two_keys(cur, nhkey, hkey) != 0)) + xfs_btree_keycmp_eq(cur, nlkey, lkey) && + xfs_btree_keycmp_eq(cur, nhkey, hkey)) break; xfs_btree_copy_keys(cur, nlkey, lkey, 1); xfs_btree_log_keys(cur, bp, ptr, ptr); @@ -4716,7 +4715,6 @@ xfs_btree_simple_query_range( { union xfs_btree_rec *recp; union xfs_btree_key rec_key; - int64_t diff; int stat; bool firstrec = true; int error; @@ -4746,20 +4744,17 @@ xfs_btree_simple_query_range( if (error || !stat) break; - /* Skip if high_key(rec) < low_key. */ + /* Skip if low_key > high_key(rec). */ if (firstrec) { cur->bc_ops->init_high_key_from_rec(&rec_key, recp); firstrec = false; - diff = cur->bc_ops->diff_two_keys(cur, low_key, - &rec_key); - if (diff > 0) + if (xfs_btree_keycmp_gt(cur, low_key, &rec_key)) goto advloop; } - /* Stop if high_key < low_key(rec). */ + /* Stop if low_key(rec) > high_key. */ cur->bc_ops->init_key_from_rec(&rec_key, recp); - diff = cur->bc_ops->diff_two_keys(cur, &rec_key, high_key); - if (diff > 0) + if (xfs_btree_keycmp_gt(cur, &rec_key, high_key)) break; /* Callback */ @@ -4813,8 +4808,6 @@ xfs_btree_overlapped_query_range( union xfs_btree_key *hkp; union xfs_btree_rec *recp; struct xfs_btree_block *block; - int64_t ldiff; - int64_t hdiff; int level; struct xfs_buf *bp; int i; @@ -4854,25 +4847,23 @@ pop_up: block); cur->bc_ops->init_high_key_from_rec(&rec_hkey, recp); - ldiff = cur->bc_ops->diff_two_keys(cur, &rec_hkey, - low_key); - cur->bc_ops->init_key_from_rec(&rec_key, recp); - hdiff = cur->bc_ops->diff_two_keys(cur, high_key, - &rec_key); /* + * If (query's high key < record's low key), then there + * are no more interesting records in this block. Pop + * up to the leaf level to find more record blocks. + * * If (record's high key >= query's low key) and * (query's high key >= record's low key), then * this record overlaps the query range; callback. */ - if (ldiff >= 0 && hdiff >= 0) { + if (xfs_btree_keycmp_lt(cur, high_key, &rec_key)) + goto pop_up; + if (xfs_btree_keycmp_ge(cur, &rec_hkey, low_key)) { error = fn(cur, recp, priv); if (error) break; - } else if (hdiff < 0) { - /* Record is larger than high key; pop. */ - goto pop_up; } cur->bc_levels[level].ptr++; continue; @@ -4884,15 +4875,18 @@ pop_up: block); pp = xfs_btree_ptr_addr(cur, cur->bc_levels[level].ptr, block); - ldiff = cur->bc_ops->diff_two_keys(cur, hkp, low_key); - hdiff = cur->bc_ops->diff_two_keys(cur, high_key, lkp); - /* + * If (query's high key < pointer's low key), then there are no + * more interesting keys in this block. Pop up one leaf level + * to continue looking for records. + * * If (pointer's high key >= query's low key) and * (query's high key >= pointer's low key), then * this record overlaps the query range; follow pointer. */ - if (ldiff >= 0 && hdiff >= 0) { + if (xfs_btree_keycmp_lt(cur, high_key, lkp)) + goto pop_up; + if (xfs_btree_keycmp_ge(cur, hkp, low_key)) { level--; error = xfs_btree_lookup_get_block(cur, level, pp, &block); @@ -4907,9 +4901,6 @@ pop_up: #endif cur->bc_levels[level].ptr = 1; continue; - } else if (hdiff < 0) { - /* The low key is larger than the upper range; pop. */ - goto pop_up; } cur->bc_levels[level].ptr++; } @@ -4937,6 +4928,19 @@ out: return error; } +static inline void +xfs_btree_key_from_irec( + struct xfs_btree_cur *cur, + union xfs_btree_key *key, + const union xfs_btree_irec *irec) +{ + union xfs_btree_rec rec; + + cur->bc_rec = *irec; + cur->bc_ops->init_rec_from_cur(cur, &rec); + cur->bc_ops->init_key_from_rec(key, &rec); +} + /* * Query a btree for all records overlapping a given interval of keys. The * supplied function will be called with each record found; return one of the @@ -4951,21 +4955,15 @@ xfs_btree_query_range( xfs_btree_query_range_fn fn, void *priv) { - union xfs_btree_rec rec; union xfs_btree_key low_key; union xfs_btree_key high_key; /* Find the keys of both ends of the interval. */ - cur->bc_rec = *high_rec; - cur->bc_ops->init_rec_from_cur(cur, &rec); - cur->bc_ops->init_key_from_rec(&high_key, &rec); + xfs_btree_key_from_irec(cur, &high_key, high_rec); + xfs_btree_key_from_irec(cur, &low_key, low_rec); - cur->bc_rec = *low_rec; - cur->bc_ops->init_rec_from_cur(cur, &rec); - cur->bc_ops->init_key_from_rec(&low_key, &rec); - - /* Enforce low key < high key. */ - if (cur->bc_ops->diff_two_keys(cur, &low_key, &high_key) > 0) + /* Enforce low key <= high key. */ + if (!xfs_btree_keycmp_le(cur, &low_key, &high_key)) return -EINVAL; if (!(cur->bc_flags & XFS_BTREE_OVERLAPPING)) @@ -5027,34 +5025,132 @@ xfs_btree_diff_two_ptrs( return (int64_t)be32_to_cpu(a->s) - be32_to_cpu(b->s); } -/* If there's an extent, we're done. */ +struct xfs_btree_has_records { + /* Keys for the start and end of the range we want to know about. */ + union xfs_btree_key start_key; + union xfs_btree_key end_key; + + /* Mask for key comparisons, if desired. */ + const union xfs_btree_key *key_mask; + + /* Highest record key we've seen so far. */ + union xfs_btree_key high_key; + + enum xbtree_recpacking outcome; +}; + STATIC int -xfs_btree_has_record_helper( +xfs_btree_has_records_helper( struct xfs_btree_cur *cur, const union xfs_btree_rec *rec, void *priv) { - return -ECANCELED; + union xfs_btree_key rec_key; + union xfs_btree_key rec_high_key; + struct xfs_btree_has_records *info = priv; + enum xbtree_key_contig key_contig; + + cur->bc_ops->init_key_from_rec(&rec_key, rec); + + if (info->outcome == XBTREE_RECPACKING_EMPTY) { + info->outcome = XBTREE_RECPACKING_SPARSE; + + /* + * If the first record we find does not overlap the start key, + * then there is a hole at the start of the search range. + * Classify this as sparse and stop immediately. + */ + if (xfs_btree_masked_keycmp_lt(cur, &info->start_key, &rec_key, + info->key_mask)) + return -ECANCELED; + } else { + /* + * If a subsequent record does not overlap with the any record + * we've seen so far, there is a hole in the middle of the + * search range. Classify this as sparse and stop. + * If the keys overlap and this btree does not allow overlap, + * signal corruption. + */ + key_contig = cur->bc_ops->keys_contiguous(cur, &info->high_key, + &rec_key, info->key_mask); + if (key_contig == XBTREE_KEY_OVERLAP && + !(cur->bc_flags & XFS_BTREE_OVERLAPPING)) + return -EFSCORRUPTED; + if (key_contig == XBTREE_KEY_GAP) + return -ECANCELED; + } + + /* + * If high_key(rec) is larger than any other high key we've seen, + * remember it for later. + */ + cur->bc_ops->init_high_key_from_rec(&rec_high_key, rec); + if (xfs_btree_masked_keycmp_gt(cur, &rec_high_key, &info->high_key, + info->key_mask)) + info->high_key = rec_high_key; /* struct copy */ + + return 0; } -/* Is there a record covering a given range of keys? */ +/* + * Scan part of the keyspace of a btree and tell us if that keyspace does not + * map to any records; is fully mapped to records; or is partially mapped to + * records. This is the btree record equivalent to determining if a file is + * sparse. + * + * For most btree types, the record scan should use all available btree key + * fields to compare the keys encountered. These callers should pass NULL for + * @mask. However, some callers (e.g. scanning physical space in the rmapbt) + * want to ignore some part of the btree record keyspace when performing the + * comparison. These callers should pass in a union xfs_btree_key object with + * the fields that *should* be a part of the comparison set to any nonzero + * value, and the rest zeroed. + */ int -xfs_btree_has_record( +xfs_btree_has_records( struct xfs_btree_cur *cur, const union xfs_btree_irec *low, const union xfs_btree_irec *high, - bool *exists) + const union xfs_btree_key *mask, + enum xbtree_recpacking *outcome) { + struct xfs_btree_has_records info = { + .outcome = XBTREE_RECPACKING_EMPTY, + .key_mask = mask, + }; int error; - error = xfs_btree_query_range(cur, low, high, - &xfs_btree_has_record_helper, NULL); - if (error == -ECANCELED) { - *exists = true; - return 0; + /* Not all btrees support this operation. */ + if (!cur->bc_ops->keys_contiguous) { + ASSERT(0); + return -EOPNOTSUPP; } - *exists = false; - return error; + + xfs_btree_key_from_irec(cur, &info.start_key, low); + xfs_btree_key_from_irec(cur, &info.end_key, high); + + error = xfs_btree_query_range(cur, low, high, + xfs_btree_has_records_helper, &info); + if (error == -ECANCELED) + goto out; + if (error) + return error; + + if (info.outcome == XBTREE_RECPACKING_EMPTY) + goto out; + + /* + * If the largest high_key(rec) we saw during the walk is greater than + * the end of the search range, classify this as full. Otherwise, + * there is a hole at the end of the search range. + */ + if (xfs_btree_masked_keycmp_ge(cur, &info.high_key, &info.end_key, + mask)) + info.outcome = XBTREE_RECPACKING_FULL; + +out: + *outcome = info.outcome; + return 0; } /* Are there more records in this btree? */ diff --git a/fs/xfs/libxfs/xfs_btree.h b/fs/xfs/libxfs/xfs_btree.h index 29c4b4ccb909..a2aa36b23e25 100644 --- a/fs/xfs/libxfs/xfs_btree.h +++ b/fs/xfs/libxfs/xfs_btree.h @@ -90,6 +90,27 @@ uint32_t xfs_btree_magic(int crc, xfs_btnum_t btnum); #define XFS_BTREE_STATS_ADD(cur, stat, val) \ XFS_STATS_ADD_OFF((cur)->bc_mp, (cur)->bc_statoff + __XBTS_ ## stat, val) +enum xbtree_key_contig { + XBTREE_KEY_GAP = 0, + XBTREE_KEY_CONTIGUOUS, + XBTREE_KEY_OVERLAP, +}; + +/* + * Decide if these two numeric btree key fields are contiguous, overlapping, + * or if there's a gap between them. @x should be the field from the high + * key and @y should be the field from the low key. + */ +static inline enum xbtree_key_contig xbtree_key_contig(uint64_t x, uint64_t y) +{ + x++; + if (x < y) + return XBTREE_KEY_GAP; + if (x == y) + return XBTREE_KEY_CONTIGUOUS; + return XBTREE_KEY_OVERLAP; +} + struct xfs_btree_ops { /* size of the key and record structures */ size_t key_len; @@ -140,11 +161,14 @@ struct xfs_btree_ops { /* * Difference between key2 and key1 -- positive if key1 > key2, - * negative if key1 < key2, and zero if equal. + * negative if key1 < key2, and zero if equal. If the @mask parameter + * is non NULL, each key field to be used in the comparison must + * contain a nonzero value. */ int64_t (*diff_two_keys)(struct xfs_btree_cur *cur, const union xfs_btree_key *key1, - const union xfs_btree_key *key2); + const union xfs_btree_key *key2, + const union xfs_btree_key *mask); const struct xfs_buf_ops *buf_ops; @@ -157,6 +181,22 @@ struct xfs_btree_ops { int (*recs_inorder)(struct xfs_btree_cur *cur, const union xfs_btree_rec *r1, const union xfs_btree_rec *r2); + + /* + * Are these two btree keys immediately adjacent? + * + * Given two btree keys @key1 and @key2, decide if it is impossible for + * there to be a third btree key K satisfying the relationship + * @key1 < K < @key2. To determine if two btree records are + * immediately adjacent, @key1 should be the high key of the first + * record and @key2 should be the low key of the second record. + * If the @mask parameter is non NULL, each key field to be used in the + * comparison must contain a nonzero value. + */ + enum xbtree_key_contig (*keys_contiguous)(struct xfs_btree_cur *cur, + const union xfs_btree_key *key1, + const union xfs_btree_key *key2, + const union xfs_btree_key *mask); }; /* @@ -540,12 +580,105 @@ void xfs_btree_get_keys(struct xfs_btree_cur *cur, struct xfs_btree_block *block, union xfs_btree_key *key); union xfs_btree_key *xfs_btree_high_key_from_key(struct xfs_btree_cur *cur, union xfs_btree_key *key); -int xfs_btree_has_record(struct xfs_btree_cur *cur, +typedef bool (*xfs_btree_key_gap_fn)(struct xfs_btree_cur *cur, + const union xfs_btree_key *key1, + const union xfs_btree_key *key2); + +int xfs_btree_has_records(struct xfs_btree_cur *cur, const union xfs_btree_irec *low, - const union xfs_btree_irec *high, bool *exists); + const union xfs_btree_irec *high, + const union xfs_btree_key *mask, + enum xbtree_recpacking *outcome); + bool xfs_btree_has_more_records(struct xfs_btree_cur *cur); struct xfs_ifork *xfs_btree_ifork_ptr(struct xfs_btree_cur *cur); +/* Key comparison helpers */ +static inline bool +xfs_btree_keycmp_lt( + struct xfs_btree_cur *cur, + const union xfs_btree_key *key1, + const union xfs_btree_key *key2) +{ + return cur->bc_ops->diff_two_keys(cur, key1, key2, NULL) < 0; +} + +static inline bool +xfs_btree_keycmp_gt( + struct xfs_btree_cur *cur, + const union xfs_btree_key *key1, + const union xfs_btree_key *key2) +{ + return cur->bc_ops->diff_two_keys(cur, key1, key2, NULL) > 0; +} + +static inline bool +xfs_btree_keycmp_eq( + struct xfs_btree_cur *cur, + const union xfs_btree_key *key1, + const union xfs_btree_key *key2) +{ + return cur->bc_ops->diff_two_keys(cur, key1, key2, NULL) == 0; +} + +static inline bool +xfs_btree_keycmp_le( + struct xfs_btree_cur *cur, + const union xfs_btree_key *key1, + const union xfs_btree_key *key2) +{ + return !xfs_btree_keycmp_gt(cur, key1, key2); +} + +static inline bool +xfs_btree_keycmp_ge( + struct xfs_btree_cur *cur, + const union xfs_btree_key *key1, + const union xfs_btree_key *key2) +{ + return !xfs_btree_keycmp_lt(cur, key1, key2); +} + +static inline bool +xfs_btree_keycmp_ne( + struct xfs_btree_cur *cur, + const union xfs_btree_key *key1, + const union xfs_btree_key *key2) +{ + return !xfs_btree_keycmp_eq(cur, key1, key2); +} + +/* Masked key comparison helpers */ +static inline bool +xfs_btree_masked_keycmp_lt( + struct xfs_btree_cur *cur, + const union xfs_btree_key *key1, + const union xfs_btree_key *key2, + const union xfs_btree_key *mask) +{ + return cur->bc_ops->diff_two_keys(cur, key1, key2, mask) < 0; +} + +static inline bool +xfs_btree_masked_keycmp_gt( + struct xfs_btree_cur *cur, + const union xfs_btree_key *key1, + const union xfs_btree_key *key2, + const union xfs_btree_key *mask) +{ + return cur->bc_ops->diff_two_keys(cur, key1, key2, mask) > 0; +} + +static inline bool +xfs_btree_masked_keycmp_ge( + struct xfs_btree_cur *cur, + const union xfs_btree_key *key1, + const union xfs_btree_key *key2, + const union xfs_btree_key *mask) +{ + return !xfs_btree_masked_keycmp_lt(cur, key1, key2, mask); +} + /* Does this cursor point to the last block in the given level? */ static inline bool xfs_btree_islastblock( diff --git a/fs/xfs/libxfs/xfs_defer.c b/fs/xfs/libxfs/xfs_defer.c index 5a321b783398..bcfb6a4203cd 100644 --- a/fs/xfs/libxfs/xfs_defer.c +++ b/fs/xfs/libxfs/xfs_defer.c @@ -397,6 +397,7 @@ xfs_defer_cancel_list( list_for_each_safe(pwi, n, &dfp->dfp_work) { list_del(pwi); dfp->dfp_count--; + trace_xfs_defer_cancel_item(mp, dfp, pwi); ops->cancel_item(pwi); } ASSERT(dfp->dfp_count == 0); @@ -476,6 +477,7 @@ xfs_defer_finish_one( list_for_each_safe(li, n, &dfp->dfp_work) { list_del(li); dfp->dfp_count--; + trace_xfs_defer_finish_item(tp->t_mountp, dfp, li); error = ops->finish_item(tp, dfp->dfp_done, li, &state); if (error == -EAGAIN) { int ret; @@ -623,7 +625,7 @@ xfs_defer_add( struct list_head *li) { struct xfs_defer_pending *dfp = NULL; - const struct xfs_defer_op_type *ops; + const struct xfs_defer_op_type *ops = defer_op_types[type]; ASSERT(tp->t_flags & XFS_TRANS_PERM_LOG_RES); BUILD_BUG_ON(ARRAY_SIZE(defer_op_types) != XFS_DEFER_OPS_TYPE_MAX); @@ -636,7 +638,6 @@ xfs_defer_add( if (!list_empty(&tp->t_dfops)) { dfp = list_last_entry(&tp->t_dfops, struct xfs_defer_pending, dfp_list); - ops = defer_op_types[dfp->dfp_type]; if (dfp->dfp_type != type || (ops->max_items && dfp->dfp_count >= ops->max_items)) dfp = NULL; @@ -653,6 +654,7 @@ xfs_defer_add( } list_add_tail(li, &dfp->dfp_work); + trace_xfs_defer_add_item(tp->t_mountp, dfp, li); dfp->dfp_count++; } diff --git a/fs/xfs/libxfs/xfs_dir2.c b/fs/xfs/libxfs/xfs_dir2.c index 92bac3373f1f..f5462fd582d5 100644 --- a/fs/xfs/libxfs/xfs_dir2.c +++ b/fs/xfs/libxfs/xfs_dir2.c @@ -64,7 +64,7 @@ xfs_ascii_ci_hashname( int i; for (i = 0, hash = 0; i < name->len; i++) - hash = tolower(name->name[i]) ^ rol32(hash, 7); + hash = xfs_ascii_ci_xfrm(name->name[i]) ^ rol32(hash, 7); return hash; } @@ -85,7 +85,8 @@ xfs_ascii_ci_compname( for (i = 0; i < len; i++) { if (args->name[i] == name[i]) continue; - if (tolower(args->name[i]) != tolower(name[i])) + if (xfs_ascii_ci_xfrm(args->name[i]) != + xfs_ascii_ci_xfrm(name[i])) return XFS_CMP_DIFFERENT; result = XFS_CMP_CASE; } diff --git a/fs/xfs/libxfs/xfs_dir2.h b/fs/xfs/libxfs/xfs_dir2.h index dd39f17dd9a9..19af22a16c41 100644 --- a/fs/xfs/libxfs/xfs_dir2.h +++ b/fs/xfs/libxfs/xfs_dir2.h @@ -248,4 +248,35 @@ unsigned int xfs_dir3_data_end_offset(struct xfs_da_geometry *geo, struct xfs_dir2_data_hdr *hdr); bool xfs_dir2_namecheck(const void *name, size_t length); +/* + * The "ascii-ci" feature was created to speed up case-insensitive lookups for + * a Samba product. Because of the inherent problems with CI and UTF-8 + * encoding, etc, it was decided that Samba would be configured to export + * latin1/iso 8859-1 encodings as that covered >90% of the target markets for + * the product. Hence the "ascii-ci" casefolding code could be encoded into + * the XFS directory operations and remove all the overhead of casefolding from + * Samba. + * + * To provide consistent hashing behavior between the userspace and kernel, + * these functions prepare names for hashing by transforming specific bytes + * to other bytes. Robustness with other encodings is not guaranteed. + */ +static inline bool xfs_ascii_ci_need_xfrm(unsigned char c) +{ + if (c >= 0x41 && c <= 0x5a) /* A-Z */ + return true; + if (c >= 0xc0 && c <= 0xd6) /* latin A-O with accents */ + return true; + if (c >= 0xd8 && c <= 0xde) /* latin O-Y with accents */ + return true; + return false; +} + +static inline unsigned char xfs_ascii_ci_xfrm(unsigned char c) +{ + if (xfs_ascii_ci_need_xfrm(c)) + c -= 'A' - 'a'; + return c; +} + #endif /* __XFS_DIR2_H__ */ diff --git a/fs/xfs/libxfs/xfs_ialloc.c b/fs/xfs/libxfs/xfs_ialloc.c index 7ee292aecbeb..a16d5de16933 100644 --- a/fs/xfs/libxfs/xfs_ialloc.c +++ b/fs/xfs/libxfs/xfs_ialloc.c @@ -95,33 +95,25 @@ xfs_inobt_btrec_to_irec( irec->ir_free = be64_to_cpu(rec->inobt.ir_free); } -/* - * Get the data from the pointed-to record. - */ -int -xfs_inobt_get_rec( - struct xfs_btree_cur *cur, - struct xfs_inobt_rec_incore *irec, - int *stat) +/* Simple checks for inode records. */ +xfs_failaddr_t +xfs_inobt_check_irec( + struct xfs_btree_cur *cur, + const struct xfs_inobt_rec_incore *irec) { - struct xfs_mount *mp = cur->bc_mp; - union xfs_btree_rec *rec; - int error; uint64_t realfree; - error = xfs_btree_get_rec(cur, &rec, stat); - if (error || *stat == 0) - return error; - - xfs_inobt_btrec_to_irec(mp, rec, irec); - + /* Record has to be properly aligned within the AG. */ if (!xfs_verify_agino(cur->bc_ag.pag, irec->ir_startino)) - goto out_bad_rec; + return __this_address; + if (!xfs_verify_agino(cur->bc_ag.pag, + irec->ir_startino + XFS_INODES_PER_CHUNK - 1)) + return __this_address; if (irec->ir_count < XFS_INODES_PER_HOLEMASK_BIT || irec->ir_count > XFS_INODES_PER_CHUNK) - goto out_bad_rec; + return __this_address; if (irec->ir_freecount > XFS_INODES_PER_CHUNK) - goto out_bad_rec; + return __this_address; /* if there are no holes, return the first available offset */ if (!xfs_inobt_issparse(irec->ir_holemask)) @@ -129,15 +121,23 @@ xfs_inobt_get_rec( else realfree = irec->ir_free & xfs_inobt_irec_to_allocmask(irec); if (hweight64(realfree) != irec->ir_freecount) - goto out_bad_rec; + return __this_address; - return 0; + return NULL; +} + +static inline int +xfs_inobt_complain_bad_rec( + struct xfs_btree_cur *cur, + xfs_failaddr_t fa, + const struct xfs_inobt_rec_incore *irec) +{ + struct xfs_mount *mp = cur->bc_mp; -out_bad_rec: xfs_warn(mp, - "%s Inode BTree record corruption in AG %d detected!", + "%s Inode BTree record corruption in AG %d detected at %pS!", cur->bc_btnum == XFS_BTNUM_INO ? "Used" : "Free", - cur->bc_ag.pag->pag_agno); + cur->bc_ag.pag->pag_agno, fa); xfs_warn(mp, "start inode 0x%x, count 0x%x, free 0x%x freemask 0x%llx, holemask 0x%x", irec->ir_startino, irec->ir_count, irec->ir_freecount, @@ -146,6 +146,32 @@ out_bad_rec: } /* + * Get the data from the pointed-to record. + */ +int +xfs_inobt_get_rec( + struct xfs_btree_cur *cur, + struct xfs_inobt_rec_incore *irec, + int *stat) +{ + struct xfs_mount *mp = cur->bc_mp; + union xfs_btree_rec *rec; + xfs_failaddr_t fa; + int error; + + error = xfs_btree_get_rec(cur, &rec, stat); + if (error || *stat == 0) + return error; + + xfs_inobt_btrec_to_irec(mp, rec, irec); + fa = xfs_inobt_check_irec(cur, irec); + if (fa) + return xfs_inobt_complain_bad_rec(cur, fa, irec); + + return 0; +} + +/* * Insert a single inobt record. Cursor must already point to desired location. */ int @@ -1952,8 +1978,6 @@ xfs_difree_inobt( */ if (!xfs_has_ikeep(mp) && rec.ir_free == XFS_INOBT_ALL_FREE && mp->m_sb.sb_inopblock <= XFS_INODES_PER_CHUNK) { - struct xfs_perag *pag = agbp->b_pag; - xic->deleted = true; xic->first_ino = XFS_AGINO_TO_INO(mp, pag->pag_agno, rec.ir_startino); @@ -2617,44 +2641,50 @@ xfs_ialloc_read_agi( return 0; } -/* Is there an inode record covering a given range of inode numbers? */ -int -xfs_ialloc_has_inode_record( - struct xfs_btree_cur *cur, - xfs_agino_t low, - xfs_agino_t high, - bool *exists) +/* How many inodes are backed by inode clusters ondisk? */ +STATIC int +xfs_ialloc_count_ondisk( + struct xfs_btree_cur *cur, + xfs_agino_t low, + xfs_agino_t high, + unsigned int *allocated) { struct xfs_inobt_rec_incore irec; - xfs_agino_t agino; - uint16_t holemask; - int has_record; - int i; - int error; + unsigned int ret = 0; + int has_record; + int error; - *exists = false; error = xfs_inobt_lookup(cur, low, XFS_LOOKUP_LE, &has_record); - while (error == 0 && has_record) { + if (error) + return error; + + while (has_record) { + unsigned int i, hole_idx; + error = xfs_inobt_get_rec(cur, &irec, &has_record); - if (error || irec.ir_startino > high) + if (error) + return error; + if (irec.ir_startino > high) break; - agino = irec.ir_startino; - holemask = irec.ir_holemask; - for (i = 0; i < XFS_INOBT_HOLEMASK_BITS; holemask >>= 1, - i++, agino += XFS_INODES_PER_HOLEMASK_BIT) { - if (holemask & 1) + for (i = 0; i < XFS_INODES_PER_CHUNK; i++) { + if (irec.ir_startino + i < low) continue; - if (agino + XFS_INODES_PER_HOLEMASK_BIT > low && - agino <= high) { - *exists = true; - return 0; - } + if (irec.ir_startino + i > high) + break; + + hole_idx = i / XFS_INODES_PER_HOLEMASK_BIT; + if (!(irec.ir_holemask & (1U << hole_idx))) + ret++; } error = xfs_btree_increment(cur, 0, &has_record); + if (error) + return error; } - return error; + + *allocated = ret; + return 0; } /* Is there an inode record covering a given extent? */ @@ -2663,15 +2693,27 @@ xfs_ialloc_has_inodes_at_extent( struct xfs_btree_cur *cur, xfs_agblock_t bno, xfs_extlen_t len, - bool *exists) + enum xbtree_recpacking *outcome) { - xfs_agino_t low; - xfs_agino_t high; + xfs_agino_t agino; + xfs_agino_t last_agino; + unsigned int allocated; + int error; + + agino = XFS_AGB_TO_AGINO(cur->bc_mp, bno); + last_agino = XFS_AGB_TO_AGINO(cur->bc_mp, bno + len) - 1; - low = XFS_AGB_TO_AGINO(cur->bc_mp, bno); - high = XFS_AGB_TO_AGINO(cur->bc_mp, bno + len) - 1; + error = xfs_ialloc_count_ondisk(cur, agino, last_agino, &allocated); + if (error) + return error; - return xfs_ialloc_has_inode_record(cur, low, high, exists); + if (allocated == 0) + *outcome = XBTREE_RECPACKING_EMPTY; + else if (allocated == last_agino - agino + 1) + *outcome = XBTREE_RECPACKING_FULL; + else + *outcome = XBTREE_RECPACKING_SPARSE; + return 0; } struct xfs_ialloc_count_inodes { @@ -2688,8 +2730,13 @@ xfs_ialloc_count_inodes_rec( { struct xfs_inobt_rec_incore irec; struct xfs_ialloc_count_inodes *ci = priv; + xfs_failaddr_t fa; xfs_inobt_btrec_to_irec(cur->bc_mp, rec, &irec); + fa = xfs_inobt_check_irec(cur, &irec); + if (fa) + return xfs_inobt_complain_bad_rec(cur, fa, &irec); + ci->count += irec.ir_count; ci->freecount += irec.ir_freecount; diff --git a/fs/xfs/libxfs/xfs_ialloc.h b/fs/xfs/libxfs/xfs_ialloc.h index ab8c30b4ec22..fe824bb04a09 100644 --- a/fs/xfs/libxfs/xfs_ialloc.h +++ b/fs/xfs/libxfs/xfs_ialloc.h @@ -93,10 +93,11 @@ union xfs_btree_rec; void xfs_inobt_btrec_to_irec(struct xfs_mount *mp, const union xfs_btree_rec *rec, struct xfs_inobt_rec_incore *irec); +xfs_failaddr_t xfs_inobt_check_irec(struct xfs_btree_cur *cur, + const struct xfs_inobt_rec_incore *irec); int xfs_ialloc_has_inodes_at_extent(struct xfs_btree_cur *cur, - xfs_agblock_t bno, xfs_extlen_t len, bool *exists); -int xfs_ialloc_has_inode_record(struct xfs_btree_cur *cur, xfs_agino_t low, - xfs_agino_t high, bool *exists); + xfs_agblock_t bno, xfs_extlen_t len, + enum xbtree_recpacking *outcome); int xfs_ialloc_count_inodes(struct xfs_btree_cur *cur, xfs_agino_t *count, xfs_agino_t *freecount); int xfs_inobt_insert_rec(struct xfs_btree_cur *cur, uint16_t holemask, diff --git a/fs/xfs/libxfs/xfs_ialloc_btree.c b/fs/xfs/libxfs/xfs_ialloc_btree.c index 9b28211d5a4c..5a945ae21b5d 100644 --- a/fs/xfs/libxfs/xfs_ialloc_btree.c +++ b/fs/xfs/libxfs/xfs_ialloc_btree.c @@ -156,9 +156,12 @@ __xfs_inobt_free_block( struct xfs_buf *bp, enum xfs_ag_resv_type resv) { + xfs_fsblock_t fsbno; + xfs_inobt_mod_blockcount(cur, -1); - return xfs_free_extent(cur->bc_tp, - XFS_DADDR_TO_FSB(cur->bc_mp, xfs_buf_daddr(bp)), 1, + fsbno = XFS_DADDR_TO_FSB(cur->bc_mp, xfs_buf_daddr(bp)); + return xfs_free_extent(cur->bc_tp, cur->bc_ag.pag, + XFS_FSB_TO_AGBNO(cur->bc_mp, fsbno), 1, &XFS_RMAP_OINFO_INOBT, resv); } @@ -266,10 +269,13 @@ STATIC int64_t xfs_inobt_diff_two_keys( struct xfs_btree_cur *cur, const union xfs_btree_key *k1, - const union xfs_btree_key *k2) + const union xfs_btree_key *k2, + const union xfs_btree_key *mask) { + ASSERT(!mask || mask->inobt.ir_startino); + return (int64_t)be32_to_cpu(k1->inobt.ir_startino) - - be32_to_cpu(k2->inobt.ir_startino); + be32_to_cpu(k2->inobt.ir_startino); } static xfs_failaddr_t @@ -380,6 +386,19 @@ xfs_inobt_recs_inorder( be32_to_cpu(r2->inobt.ir_startino); } +STATIC enum xbtree_key_contig +xfs_inobt_keys_contiguous( + struct xfs_btree_cur *cur, + const union xfs_btree_key *key1, + const union xfs_btree_key *key2, + const union xfs_btree_key *mask) +{ + ASSERT(!mask || mask->inobt.ir_startino); + + return xbtree_key_contig(be32_to_cpu(key1->inobt.ir_startino), + be32_to_cpu(key2->inobt.ir_startino)); +} + static const struct xfs_btree_ops xfs_inobt_ops = { .rec_len = sizeof(xfs_inobt_rec_t), .key_len = sizeof(xfs_inobt_key_t), @@ -399,6 +418,7 @@ static const struct xfs_btree_ops xfs_inobt_ops = { .diff_two_keys = xfs_inobt_diff_two_keys, .keys_inorder = xfs_inobt_keys_inorder, .recs_inorder = xfs_inobt_recs_inorder, + .keys_contiguous = xfs_inobt_keys_contiguous, }; static const struct xfs_btree_ops xfs_finobt_ops = { @@ -420,6 +440,7 @@ static const struct xfs_btree_ops xfs_finobt_ops = { .diff_two_keys = xfs_inobt_diff_two_keys, .keys_inorder = xfs_inobt_keys_inorder, .recs_inorder = xfs_inobt_recs_inorder, + .keys_contiguous = xfs_inobt_keys_contiguous, }; /* @@ -447,9 +468,7 @@ xfs_inobt_init_common( if (xfs_has_crc(mp)) cur->bc_flags |= XFS_BTREE_CRC_BLOCKS; - /* take a reference for the cursor */ - atomic_inc(&pag->pag_ref); - cur->bc_ag.pag = pag; + cur->bc_ag.pag = xfs_perag_hold(pag); return cur; } @@ -607,7 +626,7 @@ xfs_iallocbt_maxlevels_ondisk(void) */ uint64_t xfs_inobt_irec_to_allocmask( - struct xfs_inobt_rec_incore *rec) + const struct xfs_inobt_rec_incore *rec) { uint64_t bitmap = 0; uint64_t inodespbit; diff --git a/fs/xfs/libxfs/xfs_ialloc_btree.h b/fs/xfs/libxfs/xfs_ialloc_btree.h index e859a6e05230..3262c3fe5ebe 100644 --- a/fs/xfs/libxfs/xfs_ialloc_btree.h +++ b/fs/xfs/libxfs/xfs_ialloc_btree.h @@ -53,7 +53,7 @@ struct xfs_btree_cur *xfs_inobt_stage_cursor(struct xfs_perag *pag, extern int xfs_inobt_maxrecs(struct xfs_mount *, int, int); /* ir_holemask to inode allocation bitmap conversion */ -uint64_t xfs_inobt_irec_to_allocmask(struct xfs_inobt_rec_incore *); +uint64_t xfs_inobt_irec_to_allocmask(const struct xfs_inobt_rec_incore *irec); #if defined(DEBUG) || defined(XFS_WARN) int xfs_inobt_rec_check_count(struct xfs_mount *, diff --git a/fs/xfs/libxfs/xfs_inode_fork.c b/fs/xfs/libxfs/xfs_inode_fork.c index 6b21760184d9..5a2e7ddfa76d 100644 --- a/fs/xfs/libxfs/xfs_inode_fork.c +++ b/fs/xfs/libxfs/xfs_inode_fork.c @@ -140,7 +140,8 @@ xfs_iformat_extents( xfs_inode_verifier_error(ip, -EFSCORRUPTED, "xfs_iformat_extents(2)", dp, sizeof(*dp), fa); - return -EFSCORRUPTED; + return xfs_bmap_complain_bad_rec(ip, whichfork, + fa, &new); } xfs_iext_insert(ip, &icur, &new, state); @@ -226,10 +227,15 @@ xfs_iformat_data_fork( /* * Initialize the extent count early, as the per-format routines may - * depend on it. + * depend on it. Use release semantics to set needextents /after/ we + * set the format. This ensures that we can use acquire semantics on + * needextents in xfs_need_iread_extents() and be guaranteed to see a + * valid format value after that load. */ ip->i_df.if_format = dip->di_format; ip->i_df.if_nextents = xfs_dfork_data_extents(dip); + smp_store_release(&ip->i_df.if_needextents, + ip->i_df.if_format == XFS_DINODE_FMT_BTREE ? 1 : 0); switch (inode->i_mode & S_IFMT) { case S_IFIFO: @@ -282,8 +288,17 @@ xfs_ifork_init_attr( enum xfs_dinode_fmt format, xfs_extnum_t nextents) { + /* + * Initialize the extent count early, as the per-format routines may + * depend on it. Use release semantics to set needextents /after/ we + * set the format. This ensures that we can use acquire semantics on + * needextents in xfs_need_iread_extents() and be guaranteed to see a + * valid format value after that load. + */ ip->i_af.if_format = format; ip->i_af.if_nextents = nextents; + smp_store_release(&ip->i_af.if_needextents, + ip->i_af.if_format == XFS_DINODE_FMT_BTREE ? 1 : 0); } void diff --git a/fs/xfs/libxfs/xfs_inode_fork.h b/fs/xfs/libxfs/xfs_inode_fork.h index d3943d6ad0b9..96d307784c85 100644 --- a/fs/xfs/libxfs/xfs_inode_fork.h +++ b/fs/xfs/libxfs/xfs_inode_fork.h @@ -24,6 +24,7 @@ struct xfs_ifork { xfs_extnum_t if_nextents; /* # of extents in this fork */ short if_broot_bytes; /* bytes allocated for root */ int8_t if_format; /* format of this fork */ + uint8_t if_needextents; /* extents have not been read */ }; /* @@ -260,9 +261,10 @@ int xfs_iext_count_upgrade(struct xfs_trans *tp, struct xfs_inode *ip, uint nr_to_add); /* returns true if the fork has extents but they are not read in yet. */ -static inline bool xfs_need_iread_extents(struct xfs_ifork *ifp) +static inline bool xfs_need_iread_extents(const struct xfs_ifork *ifp) { - return ifp->if_format == XFS_DINODE_FMT_BTREE && ifp->if_height == 0; + /* see xfs_iformat_{data,attr}_fork() for needextents semantics */ + return smp_load_acquire(&ifp->if_needextents) != 0; } #endif /* __XFS_INODE_FORK_H__ */ diff --git a/fs/xfs/libxfs/xfs_refcount.c b/fs/xfs/libxfs/xfs_refcount.c index bcf46aa0d08b..c1c65774dcc2 100644 --- a/fs/xfs/libxfs/xfs_refcount.c +++ b/fs/xfs/libxfs/xfs_refcount.c @@ -120,45 +120,41 @@ xfs_refcount_btrec_to_irec( irec->rc_refcount = be32_to_cpu(rec->refc.rc_refcount); } -/* - * Get the data from the pointed-to record. - */ -int -xfs_refcount_get_rec( +/* Simple checks for refcount records. */ +xfs_failaddr_t +xfs_refcount_check_irec( struct xfs_btree_cur *cur, - struct xfs_refcount_irec *irec, - int *stat) + const struct xfs_refcount_irec *irec) { - struct xfs_mount *mp = cur->bc_mp; struct xfs_perag *pag = cur->bc_ag.pag; - union xfs_btree_rec *rec; - int error; - - error = xfs_btree_get_rec(cur, &rec, stat); - if (error || !*stat) - return error; - xfs_refcount_btrec_to_irec(rec, irec); if (irec->rc_blockcount == 0 || irec->rc_blockcount > MAXREFCEXTLEN) - goto out_bad_rec; + return __this_address; if (!xfs_refcount_check_domain(irec)) - goto out_bad_rec; + return __this_address; /* check for valid extent range, including overflow */ if (!xfs_verify_agbext(pag, irec->rc_startblock, irec->rc_blockcount)) - goto out_bad_rec; + return __this_address; if (irec->rc_refcount == 0 || irec->rc_refcount > MAXREFCOUNT) - goto out_bad_rec; + return __this_address; - trace_xfs_refcount_get(cur->bc_mp, pag->pag_agno, irec); - return 0; + return NULL; +} + +static inline int +xfs_refcount_complain_bad_rec( + struct xfs_btree_cur *cur, + xfs_failaddr_t fa, + const struct xfs_refcount_irec *irec) +{ + struct xfs_mount *mp = cur->bc_mp; -out_bad_rec: xfs_warn(mp, - "Refcount BTree record corruption in AG %d detected!", - pag->pag_agno); + "Refcount BTree record corruption in AG %d detected at %pS!", + cur->bc_ag.pag->pag_agno, fa); xfs_warn(mp, "Start block 0x%x, block count 0x%x, references 0x%x", irec->rc_startblock, irec->rc_blockcount, irec->rc_refcount); @@ -166,6 +162,32 @@ out_bad_rec: } /* + * Get the data from the pointed-to record. + */ +int +xfs_refcount_get_rec( + struct xfs_btree_cur *cur, + struct xfs_refcount_irec *irec, + int *stat) +{ + union xfs_btree_rec *rec; + xfs_failaddr_t fa; + int error; + + error = xfs_btree_get_rec(cur, &rec, stat); + if (error || !*stat) + return error; + + xfs_refcount_btrec_to_irec(rec, irec); + fa = xfs_refcount_check_irec(cur, irec); + if (fa) + return xfs_refcount_complain_bad_rec(cur, fa, irec); + + trace_xfs_refcount_get(cur->bc_mp, cur->bc_ag.pag->pag_agno, irec); + return 0; +} + +/* * Update the record referred to by cur to the value given * by [bno, len, refcount]. * This either works (return 0) or gets an EFSCORRUPTED error. @@ -1332,26 +1354,22 @@ xfs_refcount_finish_one( xfs_agblock_t bno; unsigned long nr_ops = 0; int shape_changes = 0; - struct xfs_perag *pag; - pag = xfs_perag_get(mp, XFS_FSB_TO_AGNO(mp, ri->ri_startblock)); bno = XFS_FSB_TO_AGBNO(mp, ri->ri_startblock); trace_xfs_refcount_deferred(mp, XFS_FSB_TO_AGNO(mp, ri->ri_startblock), ri->ri_type, XFS_FSB_TO_AGBNO(mp, ri->ri_startblock), ri->ri_blockcount); - if (XFS_TEST_ERROR(false, mp, XFS_ERRTAG_REFCOUNT_FINISH_ONE)) { - error = -EIO; - goto out_drop; - } + if (XFS_TEST_ERROR(false, mp, XFS_ERRTAG_REFCOUNT_FINISH_ONE)) + return -EIO; /* * If we haven't gotten a cursor or the cursor AG doesn't match * the startblock, get one now. */ rcur = *pcur; - if (rcur != NULL && rcur->bc_ag.pag != pag) { + if (rcur != NULL && rcur->bc_ag.pag != ri->ri_pag) { nr_ops = rcur->bc_ag.refc.nr_ops; shape_changes = rcur->bc_ag.refc.shape_changes; xfs_refcount_finish_one_cleanup(tp, rcur, 0); @@ -1359,12 +1377,12 @@ xfs_refcount_finish_one( *pcur = NULL; } if (rcur == NULL) { - error = xfs_alloc_read_agf(pag, tp, XFS_ALLOC_FLAG_FREEING, - &agbp); + error = xfs_alloc_read_agf(ri->ri_pag, tp, + XFS_ALLOC_FLAG_FREEING, &agbp); if (error) - goto out_drop; + return error; - rcur = xfs_refcountbt_init_cursor(mp, tp, agbp, pag); + rcur = xfs_refcountbt_init_cursor(mp, tp, agbp, ri->ri_pag); rcur->bc_ag.refc.nr_ops = nr_ops; rcur->bc_ag.refc.shape_changes = shape_changes; } @@ -1375,7 +1393,7 @@ xfs_refcount_finish_one( error = xfs_refcount_adjust(rcur, &bno, &ri->ri_blockcount, XFS_REFCOUNT_ADJUST_INCREASE); if (error) - goto out_drop; + return error; if (ri->ri_blockcount > 0) error = xfs_refcount_continue_op(rcur, ri, bno); break; @@ -1383,31 +1401,29 @@ xfs_refcount_finish_one( error = xfs_refcount_adjust(rcur, &bno, &ri->ri_blockcount, XFS_REFCOUNT_ADJUST_DECREASE); if (error) - goto out_drop; + return error; if (ri->ri_blockcount > 0) error = xfs_refcount_continue_op(rcur, ri, bno); break; case XFS_REFCOUNT_ALLOC_COW: error = __xfs_refcount_cow_alloc(rcur, bno, ri->ri_blockcount); if (error) - goto out_drop; + return error; ri->ri_blockcount = 0; break; case XFS_REFCOUNT_FREE_COW: error = __xfs_refcount_cow_free(rcur, bno, ri->ri_blockcount); if (error) - goto out_drop; + return error; ri->ri_blockcount = 0; break; default: ASSERT(0); - error = -EFSCORRUPTED; + return -EFSCORRUPTED; } if (!error && ri->ri_blockcount > 0) - trace_xfs_refcount_finish_one_leftover(mp, pag->pag_agno, + trace_xfs_refcount_finish_one_leftover(mp, ri->ri_pag->pag_agno, ri->ri_type, bno, ri->ri_blockcount); -out_drop: - xfs_perag_put(pag); return error; } @@ -1435,6 +1451,7 @@ __xfs_refcount_add( ri->ri_startblock = startblock; ri->ri_blockcount = blockcount; + xfs_refcount_update_get_group(tp->t_mountp, ri); xfs_defer_add(tp, XFS_DEFER_OPS_TYPE_REFCOUNT, &ri->ri_list); } @@ -1876,7 +1893,8 @@ xfs_refcount_recover_extent( INIT_LIST_HEAD(&rr->rr_list); xfs_refcount_btrec_to_irec(rec, &rr->rr_rrec); - if (XFS_IS_CORRUPT(cur->bc_mp, + if (xfs_refcount_check_irec(cur, &rr->rr_rrec) != NULL || + XFS_IS_CORRUPT(cur->bc_mp, rr->rr_rrec.rc_domain != XFS_REFC_DOMAIN_COW)) { kfree(rr); return -EFSCORRUPTED; @@ -1980,14 +1998,17 @@ out_free: return error; } -/* Is there a record covering a given extent? */ +/* + * Scan part of the keyspace of the refcount records and tell us if the area + * has no records, is fully mapped by records, or is partially filled. + */ int -xfs_refcount_has_record( +xfs_refcount_has_records( struct xfs_btree_cur *cur, enum xfs_refc_domain domain, xfs_agblock_t bno, xfs_extlen_t len, - bool *exists) + enum xbtree_recpacking *outcome) { union xfs_btree_irec low; union xfs_btree_irec high; @@ -1998,7 +2019,7 @@ xfs_refcount_has_record( high.rc.rc_startblock = bno + len - 1; low.rc.rc_domain = high.rc.rc_domain = domain; - return xfs_btree_has_record(cur, &low, &high, exists); + return xfs_btree_has_records(cur, &low, &high, NULL, outcome); } int __init diff --git a/fs/xfs/libxfs/xfs_refcount.h b/fs/xfs/libxfs/xfs_refcount.h index c633477ce3ce..783cd89ca195 100644 --- a/fs/xfs/libxfs/xfs_refcount.h +++ b/fs/xfs/libxfs/xfs_refcount.h @@ -50,6 +50,7 @@ enum xfs_refcount_intent_type { struct xfs_refcount_intent { struct list_head ri_list; + struct xfs_perag *ri_pag; enum xfs_refcount_intent_type ri_type; xfs_extlen_t ri_blockcount; xfs_fsblock_t ri_startblock; @@ -67,6 +68,9 @@ xfs_refcount_check_domain( return true; } +void xfs_refcount_update_get_group(struct xfs_mount *mp, + struct xfs_refcount_intent *ri); + void xfs_refcount_increase_extent(struct xfs_trans *tp, struct xfs_bmbt_irec *irec); void xfs_refcount_decrease_extent(struct xfs_trans *tp, @@ -107,12 +111,14 @@ extern int xfs_refcount_recover_cow_leftovers(struct xfs_mount *mp, */ #define XFS_REFCOUNT_ITEM_OVERHEAD 32 -extern int xfs_refcount_has_record(struct xfs_btree_cur *cur, +extern int xfs_refcount_has_records(struct xfs_btree_cur *cur, enum xfs_refc_domain domain, xfs_agblock_t bno, - xfs_extlen_t len, bool *exists); + xfs_extlen_t len, enum xbtree_recpacking *outcome); union xfs_btree_rec; extern void xfs_refcount_btrec_to_irec(const union xfs_btree_rec *rec, struct xfs_refcount_irec *irec); +xfs_failaddr_t xfs_refcount_check_irec(struct xfs_btree_cur *cur, + const struct xfs_refcount_irec *irec); extern int xfs_refcount_insert(struct xfs_btree_cur *cur, struct xfs_refcount_irec *irec, int *stat); diff --git a/fs/xfs/libxfs/xfs_refcount_btree.c b/fs/xfs/libxfs/xfs_refcount_btree.c index f3b860970b26..d4afc5f4e6a5 100644 --- a/fs/xfs/libxfs/xfs_refcount_btree.c +++ b/fs/xfs/libxfs/xfs_refcount_btree.c @@ -112,8 +112,9 @@ xfs_refcountbt_free_block( XFS_FSB_TO_AGBNO(cur->bc_mp, fsbno), 1); be32_add_cpu(&agf->agf_refcount_blocks, -1); xfs_alloc_log_agf(cur->bc_tp, agbp, XFS_AGF_REFCOUNT_BLOCKS); - error = xfs_free_extent(cur->bc_tp, fsbno, 1, &XFS_RMAP_OINFO_REFC, - XFS_AG_RESV_METADATA); + error = xfs_free_extent(cur->bc_tp, cur->bc_ag.pag, + XFS_FSB_TO_AGBNO(cur->bc_mp, fsbno), 1, + &XFS_RMAP_OINFO_REFC, XFS_AG_RESV_METADATA); if (error) return error; @@ -201,10 +202,13 @@ STATIC int64_t xfs_refcountbt_diff_two_keys( struct xfs_btree_cur *cur, const union xfs_btree_key *k1, - const union xfs_btree_key *k2) + const union xfs_btree_key *k2, + const union xfs_btree_key *mask) { + ASSERT(!mask || mask->refc.rc_startblock); + return (int64_t)be32_to_cpu(k1->refc.rc_startblock) - - be32_to_cpu(k2->refc.rc_startblock); + be32_to_cpu(k2->refc.rc_startblock); } STATIC xfs_failaddr_t @@ -299,6 +303,19 @@ xfs_refcountbt_recs_inorder( be32_to_cpu(r2->refc.rc_startblock); } +STATIC enum xbtree_key_contig +xfs_refcountbt_keys_contiguous( + struct xfs_btree_cur *cur, + const union xfs_btree_key *key1, + const union xfs_btree_key *key2, + const union xfs_btree_key *mask) +{ + ASSERT(!mask || mask->refc.rc_startblock); + + return xbtree_key_contig(be32_to_cpu(key1->refc.rc_startblock), + be32_to_cpu(key2->refc.rc_startblock)); +} + static const struct xfs_btree_ops xfs_refcountbt_ops = { .rec_len = sizeof(struct xfs_refcount_rec), .key_len = sizeof(struct xfs_refcount_key), @@ -318,6 +335,7 @@ static const struct xfs_btree_ops xfs_refcountbt_ops = { .diff_two_keys = xfs_refcountbt_diff_two_keys, .keys_inorder = xfs_refcountbt_keys_inorder, .recs_inorder = xfs_refcountbt_recs_inorder, + .keys_contiguous = xfs_refcountbt_keys_contiguous, }; /* @@ -339,10 +357,7 @@ xfs_refcountbt_init_common( cur->bc_flags |= XFS_BTREE_CRC_BLOCKS; - /* take a reference for the cursor */ - atomic_inc(&pag->pag_ref); - cur->bc_ag.pag = pag; - + cur->bc_ag.pag = xfs_perag_hold(pag); cur->bc_ag.refc.nr_ops = 0; cur->bc_ag.refc.shape_changes = 0; cur->bc_ops = &xfs_refcountbt_ops; diff --git a/fs/xfs/libxfs/xfs_rmap.c b/fs/xfs/libxfs/xfs_rmap.c index df720041cd3d..f4dc23b3b837 100644 --- a/fs/xfs/libxfs/xfs_rmap.c +++ b/fs/xfs/libxfs/xfs_rmap.c @@ -193,7 +193,7 @@ done: } /* Convert an internal btree record to an rmap record. */ -int +xfs_failaddr_t xfs_rmap_btrec_to_irec( const union xfs_btree_rec *rec, struct xfs_rmap_irec *irec) @@ -205,51 +205,74 @@ xfs_rmap_btrec_to_irec( irec); } -/* - * Get the data from the pointed-to record. - */ -int -xfs_rmap_get_rec( - struct xfs_btree_cur *cur, - struct xfs_rmap_irec *irec, - int *stat) +/* Simple checks for rmap records. */ +xfs_failaddr_t +xfs_rmap_check_irec( + struct xfs_btree_cur *cur, + const struct xfs_rmap_irec *irec) { - struct xfs_mount *mp = cur->bc_mp; - struct xfs_perag *pag = cur->bc_ag.pag; - union xfs_btree_rec *rec; - int error; - - error = xfs_btree_get_rec(cur, &rec, stat); - if (error || !*stat) - return error; - - if (xfs_rmap_btrec_to_irec(rec, irec)) - goto out_bad_rec; + struct xfs_mount *mp = cur->bc_mp; + bool is_inode; + bool is_unwritten; + bool is_bmbt; + bool is_attr; if (irec->rm_blockcount == 0) - goto out_bad_rec; + return __this_address; if (irec->rm_startblock <= XFS_AGFL_BLOCK(mp)) { if (irec->rm_owner != XFS_RMAP_OWN_FS) - goto out_bad_rec; + return __this_address; if (irec->rm_blockcount != XFS_AGFL_BLOCK(mp) + 1) - goto out_bad_rec; + return __this_address; } else { /* check for valid extent range, including overflow */ - if (!xfs_verify_agbext(pag, irec->rm_startblock, - irec->rm_blockcount)) - goto out_bad_rec; + if (!xfs_verify_agbext(cur->bc_ag.pag, irec->rm_startblock, + irec->rm_blockcount)) + return __this_address; } if (!(xfs_verify_ino(mp, irec->rm_owner) || (irec->rm_owner <= XFS_RMAP_OWN_FS && irec->rm_owner >= XFS_RMAP_OWN_MIN))) - goto out_bad_rec; + return __this_address; + + /* Check flags. */ + is_inode = !XFS_RMAP_NON_INODE_OWNER(irec->rm_owner); + is_bmbt = irec->rm_flags & XFS_RMAP_BMBT_BLOCK; + is_attr = irec->rm_flags & XFS_RMAP_ATTR_FORK; + is_unwritten = irec->rm_flags & XFS_RMAP_UNWRITTEN; + + if (is_bmbt && irec->rm_offset != 0) + return __this_address; + + if (!is_inode && irec->rm_offset != 0) + return __this_address; + + if (is_unwritten && (is_bmbt || !is_inode || is_attr)) + return __this_address; + + if (!is_inode && (is_bmbt || is_unwritten || is_attr)) + return __this_address; + + /* Check for a valid fork offset, if applicable. */ + if (is_inode && !is_bmbt && + !xfs_verify_fileext(mp, irec->rm_offset, irec->rm_blockcount)) + return __this_address; + + return NULL; +} + +static inline int +xfs_rmap_complain_bad_rec( + struct xfs_btree_cur *cur, + xfs_failaddr_t fa, + const struct xfs_rmap_irec *irec) +{ + struct xfs_mount *mp = cur->bc_mp; - return 0; -out_bad_rec: xfs_warn(mp, - "Reverse Mapping BTree record corruption in AG %d detected!", - pag->pag_agno); + "Reverse Mapping BTree record corruption in AG %d detected at %pS!", + cur->bc_ag.pag->pag_agno, fa); xfs_warn(mp, "Owner 0x%llx, flags 0x%x, start block 0x%x block count 0x%x", irec->rm_owner, irec->rm_flags, irec->rm_startblock, @@ -257,6 +280,32 @@ out_bad_rec: return -EFSCORRUPTED; } +/* + * Get the data from the pointed-to record. + */ +int +xfs_rmap_get_rec( + struct xfs_btree_cur *cur, + struct xfs_rmap_irec *irec, + int *stat) +{ + union xfs_btree_rec *rec; + xfs_failaddr_t fa; + int error; + + error = xfs_btree_get_rec(cur, &rec, stat); + if (error || !*stat) + return error; + + fa = xfs_rmap_btrec_to_irec(rec, irec); + if (!fa) + fa = xfs_rmap_check_irec(cur, irec); + if (fa) + return xfs_rmap_complain_bad_rec(cur, fa, irec); + + return 0; +} + struct xfs_find_left_neighbor_info { struct xfs_rmap_irec high; struct xfs_rmap_irec *irec; @@ -2320,11 +2369,14 @@ xfs_rmap_query_range_helper( { struct xfs_rmap_query_range_info *query = priv; struct xfs_rmap_irec irec; - int error; + xfs_failaddr_t fa; + + fa = xfs_rmap_btrec_to_irec(rec, &irec); + if (!fa) + fa = xfs_rmap_check_irec(cur, &irec); + if (fa) + return xfs_rmap_complain_bad_rec(cur, fa, &irec); - error = xfs_rmap_btrec_to_irec(rec, &irec); - if (error) - return error; return query->fn(cur, &irec, query->priv); } @@ -2394,7 +2446,6 @@ xfs_rmap_finish_one( struct xfs_btree_cur **pcur) { struct xfs_mount *mp = tp->t_mountp; - struct xfs_perag *pag; struct xfs_btree_cur *rcur; struct xfs_buf *agbp = NULL; int error = 0; @@ -2402,26 +2453,22 @@ xfs_rmap_finish_one( xfs_agblock_t bno; bool unwritten; - pag = xfs_perag_get(mp, XFS_FSB_TO_AGNO(mp, ri->ri_bmap.br_startblock)); bno = XFS_FSB_TO_AGBNO(mp, ri->ri_bmap.br_startblock); - trace_xfs_rmap_deferred(mp, pag->pag_agno, ri->ri_type, bno, + trace_xfs_rmap_deferred(mp, ri->ri_pag->pag_agno, ri->ri_type, bno, ri->ri_owner, ri->ri_whichfork, ri->ri_bmap.br_startoff, ri->ri_bmap.br_blockcount, ri->ri_bmap.br_state); - if (XFS_TEST_ERROR(false, mp, XFS_ERRTAG_RMAP_FINISH_ONE)) { - error = -EIO; - goto out_drop; - } - + if (XFS_TEST_ERROR(false, mp, XFS_ERRTAG_RMAP_FINISH_ONE)) + return -EIO; /* * If we haven't gotten a cursor or the cursor AG doesn't match * the startblock, get one now. */ rcur = *pcur; - if (rcur != NULL && rcur->bc_ag.pag != pag) { + if (rcur != NULL && rcur->bc_ag.pag != ri->ri_pag) { xfs_rmap_finish_one_cleanup(tp, rcur, 0); rcur = NULL; *pcur = NULL; @@ -2432,15 +2479,13 @@ xfs_rmap_finish_one( * rmapbt, because a shape change could cause us to * allocate blocks. */ - error = xfs_free_extent_fix_freelist(tp, pag, &agbp); + error = xfs_free_extent_fix_freelist(tp, ri->ri_pag, &agbp); if (error) - goto out_drop; - if (XFS_IS_CORRUPT(tp->t_mountp, !agbp)) { - error = -EFSCORRUPTED; - goto out_drop; - } + return error; + if (XFS_IS_CORRUPT(tp->t_mountp, !agbp)) + return -EFSCORRUPTED; - rcur = xfs_rmapbt_init_cursor(mp, tp, agbp, pag); + rcur = xfs_rmapbt_init_cursor(mp, tp, agbp, ri->ri_pag); } *pcur = rcur; @@ -2480,8 +2525,7 @@ xfs_rmap_finish_one( ASSERT(0); error = -EFSCORRUPTED; } -out_drop: - xfs_perag_put(pag); + return error; } @@ -2526,6 +2570,7 @@ __xfs_rmap_add( ri->ri_whichfork = whichfork; ri->ri_bmap = *bmap; + xfs_rmap_update_get_group(tp->t_mountp, ri); xfs_defer_add(tp, XFS_DEFER_OPS_TYPE_RMAP, &ri->ri_list); } @@ -2664,14 +2709,21 @@ xfs_rmap_compare( return 0; } -/* Is there a record covering a given extent? */ +/* + * Scan the physical storage part of the keyspace of the reverse mapping index + * and tell us if the area has no records, is fully mapped by records, or is + * partially filled. + */ int -xfs_rmap_has_record( +xfs_rmap_has_records( struct xfs_btree_cur *cur, xfs_agblock_t bno, xfs_extlen_t len, - bool *exists) + enum xbtree_recpacking *outcome) { + union xfs_btree_key mask = { + .rmap.rm_startblock = cpu_to_be32(-1U), + }; union xfs_btree_irec low; union xfs_btree_irec high; @@ -2680,68 +2732,144 @@ xfs_rmap_has_record( memset(&high, 0xFF, sizeof(high)); high.r.rm_startblock = bno + len - 1; - return xfs_btree_has_record(cur, &low, &high, exists); + return xfs_btree_has_records(cur, &low, &high, &mask, outcome); } -/* - * Is there a record for this owner completely covering a given physical - * extent? If so, *has_rmap will be set to true. If there is no record - * or the record only covers part of the range, we set *has_rmap to false. - * This function doesn't perform range lookups or offset checks, so it is - * not suitable for checking data fork blocks. - */ -int -xfs_rmap_record_exists( - struct xfs_btree_cur *cur, +struct xfs_rmap_ownercount { + /* Owner that we're looking for. */ + struct xfs_rmap_irec good; + + /* rmap search keys */ + struct xfs_rmap_irec low; + struct xfs_rmap_irec high; + + struct xfs_rmap_matches *results; + + /* Stop early if we find a nonmatch? */ + bool stop_on_nonmatch; +}; + +/* Does this rmap represent space that can have multiple owners? */ +static inline bool +xfs_rmap_shareable( + struct xfs_mount *mp, + const struct xfs_rmap_irec *rmap) +{ + if (!xfs_has_reflink(mp)) + return false; + if (XFS_RMAP_NON_INODE_OWNER(rmap->rm_owner)) + return false; + if (rmap->rm_flags & (XFS_RMAP_ATTR_FORK | + XFS_RMAP_BMBT_BLOCK)) + return false; + return true; +} + +static inline void +xfs_rmap_ownercount_init( + struct xfs_rmap_ownercount *roc, xfs_agblock_t bno, xfs_extlen_t len, const struct xfs_owner_info *oinfo, - bool *has_rmap) + struct xfs_rmap_matches *results) { - uint64_t owner; - uint64_t offset; - unsigned int flags; - int has_record; - struct xfs_rmap_irec irec; - int error; + memset(roc, 0, sizeof(*roc)); + roc->results = results; + + roc->low.rm_startblock = bno; + memset(&roc->high, 0xFF, sizeof(roc->high)); + roc->high.rm_startblock = bno + len - 1; + + memset(results, 0, sizeof(*results)); + roc->good.rm_startblock = bno; + roc->good.rm_blockcount = len; + roc->good.rm_owner = oinfo->oi_owner; + roc->good.rm_offset = oinfo->oi_offset; + if (oinfo->oi_flags & XFS_OWNER_INFO_ATTR_FORK) + roc->good.rm_flags |= XFS_RMAP_ATTR_FORK; + if (oinfo->oi_flags & XFS_OWNER_INFO_BMBT_BLOCK) + roc->good.rm_flags |= XFS_RMAP_BMBT_BLOCK; +} - xfs_owner_info_unpack(oinfo, &owner, &offset, &flags); - ASSERT(XFS_RMAP_NON_INODE_OWNER(owner) || - (flags & XFS_RMAP_BMBT_BLOCK)); +/* Figure out if this is a match for the owner. */ +STATIC int +xfs_rmap_count_owners_helper( + struct xfs_btree_cur *cur, + const struct xfs_rmap_irec *rec, + void *priv) +{ + struct xfs_rmap_ownercount *roc = priv; + struct xfs_rmap_irec check = *rec; + unsigned int keyflags; + bool filedata; + int64_t delta; + + filedata = !XFS_RMAP_NON_INODE_OWNER(check.rm_owner) && + !(check.rm_flags & XFS_RMAP_BMBT_BLOCK); + + /* Trim the part of check that comes before the comparison range. */ + delta = (int64_t)roc->good.rm_startblock - check.rm_startblock; + if (delta > 0) { + check.rm_startblock += delta; + check.rm_blockcount -= delta; + if (filedata) + check.rm_offset += delta; + } - error = xfs_rmap_lookup_le(cur, bno, owner, offset, flags, &irec, - &has_record); - if (error) - return error; - if (!has_record) { - *has_rmap = false; - return 0; + /* Trim the part of check that comes after the comparison range. */ + delta = (check.rm_startblock + check.rm_blockcount) - + (roc->good.rm_startblock + roc->good.rm_blockcount); + if (delta > 0) + check.rm_blockcount -= delta; + + /* Don't care about unwritten status for establishing ownership. */ + keyflags = check.rm_flags & (XFS_RMAP_ATTR_FORK | XFS_RMAP_BMBT_BLOCK); + + if (check.rm_startblock == roc->good.rm_startblock && + check.rm_blockcount == roc->good.rm_blockcount && + check.rm_owner == roc->good.rm_owner && + check.rm_offset == roc->good.rm_offset && + keyflags == roc->good.rm_flags) { + roc->results->matches++; + } else { + roc->results->non_owner_matches++; + if (xfs_rmap_shareable(cur->bc_mp, &roc->good) ^ + xfs_rmap_shareable(cur->bc_mp, &check)) + roc->results->bad_non_owner_matches++; } - *has_rmap = (irec.rm_owner == owner && irec.rm_startblock <= bno && - irec.rm_startblock + irec.rm_blockcount >= bno + len); + if (roc->results->non_owner_matches && roc->stop_on_nonmatch) + return -ECANCELED; + return 0; } -struct xfs_rmap_key_state { - uint64_t owner; - uint64_t offset; - unsigned int flags; -}; - -/* For each rmap given, figure out if it doesn't match the key we want. */ -STATIC int -xfs_rmap_has_other_keys_helper( +/* Count the number of owners and non-owners of this range of blocks. */ +int +xfs_rmap_count_owners( struct xfs_btree_cur *cur, - const struct xfs_rmap_irec *rec, - void *priv) + xfs_agblock_t bno, + xfs_extlen_t len, + const struct xfs_owner_info *oinfo, + struct xfs_rmap_matches *results) { - struct xfs_rmap_key_state *rks = priv; + struct xfs_rmap_ownercount roc; + int error; - if (rks->owner == rec->rm_owner && rks->offset == rec->rm_offset && - ((rks->flags & rec->rm_flags) & XFS_RMAP_KEY_FLAGS) == rks->flags) - return 0; - return -ECANCELED; + xfs_rmap_ownercount_init(&roc, bno, len, oinfo, results); + error = xfs_rmap_query_range(cur, &roc.low, &roc.high, + xfs_rmap_count_owners_helper, &roc); + if (error) + return error; + + /* + * There can't be any non-owner rmaps that conflict with the given + * owner if we didn't find any rmaps matching the owner. + */ + if (!results->matches) + results->bad_non_owner_matches = 0; + + return 0; } /* @@ -2754,28 +2882,26 @@ xfs_rmap_has_other_keys( xfs_agblock_t bno, xfs_extlen_t len, const struct xfs_owner_info *oinfo, - bool *has_rmap) + bool *has_other) { - struct xfs_rmap_irec low = {0}; - struct xfs_rmap_irec high; - struct xfs_rmap_key_state rks; + struct xfs_rmap_matches res; + struct xfs_rmap_ownercount roc; int error; - xfs_owner_info_unpack(oinfo, &rks.owner, &rks.offset, &rks.flags); - *has_rmap = false; - - low.rm_startblock = bno; - memset(&high, 0xFF, sizeof(high)); - high.rm_startblock = bno + len - 1; + xfs_rmap_ownercount_init(&roc, bno, len, oinfo, &res); + roc.stop_on_nonmatch = true; - error = xfs_rmap_query_range(cur, &low, &high, - xfs_rmap_has_other_keys_helper, &rks); + error = xfs_rmap_query_range(cur, &roc.low, &roc.high, + xfs_rmap_count_owners_helper, &roc); if (error == -ECANCELED) { - *has_rmap = true; + *has_other = true; return 0; } + if (error) + return error; - return error; + *has_other = false; + return 0; } const struct xfs_owner_info XFS_RMAP_OINFO_SKIP_UPDATE = { diff --git a/fs/xfs/libxfs/xfs_rmap.h b/fs/xfs/libxfs/xfs_rmap.h index 2dac88cea28d..3c98d9d50afb 100644 --- a/fs/xfs/libxfs/xfs_rmap.h +++ b/fs/xfs/libxfs/xfs_rmap.h @@ -62,13 +62,14 @@ xfs_rmap_irec_offset_pack( return x; } -static inline int +static inline xfs_failaddr_t xfs_rmap_irec_offset_unpack( __u64 offset, struct xfs_rmap_irec *irec) { if (offset & ~(XFS_RMAP_OFF_MASK | XFS_RMAP_OFF_FLAGS)) - return -EFSCORRUPTED; + return __this_address; + irec->rm_offset = XFS_RMAP_OFF(offset); irec->rm_flags = 0; if (offset & XFS_RMAP_OFF_ATTR_FORK) @@ -77,7 +78,7 @@ xfs_rmap_irec_offset_unpack( irec->rm_flags |= XFS_RMAP_BMBT_BLOCK; if (offset & XFS_RMAP_OFF_UNWRITTEN) irec->rm_flags |= XFS_RMAP_UNWRITTEN; - return 0; + return NULL; } static inline void @@ -162,8 +163,12 @@ struct xfs_rmap_intent { int ri_whichfork; uint64_t ri_owner; struct xfs_bmbt_irec ri_bmap; + struct xfs_perag *ri_pag; }; +void xfs_rmap_update_get_group(struct xfs_mount *mp, + struct xfs_rmap_intent *ri); + /* functions for updating the rmapbt based on bmbt map/unmap operations */ void xfs_rmap_map_extent(struct xfs_trans *tp, struct xfs_inode *ip, int whichfork, struct xfs_bmbt_irec *imap); @@ -188,16 +193,31 @@ int xfs_rmap_lookup_le_range(struct xfs_btree_cur *cur, xfs_agblock_t bno, int xfs_rmap_compare(const struct xfs_rmap_irec *a, const struct xfs_rmap_irec *b); union xfs_btree_rec; -int xfs_rmap_btrec_to_irec(const union xfs_btree_rec *rec, +xfs_failaddr_t xfs_rmap_btrec_to_irec(const union xfs_btree_rec *rec, struct xfs_rmap_irec *irec); -int xfs_rmap_has_record(struct xfs_btree_cur *cur, xfs_agblock_t bno, - xfs_extlen_t len, bool *exists); -int xfs_rmap_record_exists(struct xfs_btree_cur *cur, xfs_agblock_t bno, +xfs_failaddr_t xfs_rmap_check_irec(struct xfs_btree_cur *cur, + const struct xfs_rmap_irec *irec); + +int xfs_rmap_has_records(struct xfs_btree_cur *cur, xfs_agblock_t bno, + xfs_extlen_t len, enum xbtree_recpacking *outcome); + +struct xfs_rmap_matches { + /* Number of owner matches. */ + unsigned long long matches; + + /* Number of non-owner matches. */ + unsigned long long non_owner_matches; + + /* Number of non-owner matches that conflict with the owner matches. */ + unsigned long long bad_non_owner_matches; +}; + +int xfs_rmap_count_owners(struct xfs_btree_cur *cur, xfs_agblock_t bno, xfs_extlen_t len, const struct xfs_owner_info *oinfo, - bool *has_rmap); + struct xfs_rmap_matches *rmatch); int xfs_rmap_has_other_keys(struct xfs_btree_cur *cur, xfs_agblock_t bno, xfs_extlen_t len, const struct xfs_owner_info *oinfo, - bool *has_rmap); + bool *has_other); int xfs_rmap_map_raw(struct xfs_btree_cur *cur, struct xfs_rmap_irec *rmap); extern const struct xfs_owner_info XFS_RMAP_OINFO_SKIP_UPDATE; diff --git a/fs/xfs/libxfs/xfs_rmap_btree.c b/fs/xfs/libxfs/xfs_rmap_btree.c index d3285684bb5e..6c81b20e97d2 100644 --- a/fs/xfs/libxfs/xfs_rmap_btree.c +++ b/fs/xfs/libxfs/xfs_rmap_btree.c @@ -156,6 +156,16 @@ xfs_rmapbt_get_maxrecs( return cur->bc_mp->m_rmap_mxr[level != 0]; } +/* + * Convert the ondisk record's offset field into the ondisk key's offset field. + * Fork and bmbt are significant parts of the rmap record key, but written + * status is merely a record attribute. + */ +static inline __be64 ondisk_rec_offset_to_key(const union xfs_btree_rec *rec) +{ + return rec->rmap.rm_offset & ~cpu_to_be64(XFS_RMAP_OFF_UNWRITTEN); +} + STATIC void xfs_rmapbt_init_key_from_rec( union xfs_btree_key *key, @@ -163,7 +173,7 @@ xfs_rmapbt_init_key_from_rec( { key->rmap.rm_startblock = rec->rmap.rm_startblock; key->rmap.rm_owner = rec->rmap.rm_owner; - key->rmap.rm_offset = rec->rmap.rm_offset; + key->rmap.rm_offset = ondisk_rec_offset_to_key(rec); } /* @@ -186,7 +196,7 @@ xfs_rmapbt_init_high_key_from_rec( key->rmap.rm_startblock = rec->rmap.rm_startblock; be32_add_cpu(&key->rmap.rm_startblock, adj); key->rmap.rm_owner = rec->rmap.rm_owner; - key->rmap.rm_offset = rec->rmap.rm_offset; + key->rmap.rm_offset = ondisk_rec_offset_to_key(rec); if (XFS_RMAP_NON_INODE_OWNER(be64_to_cpu(rec->rmap.rm_owner)) || XFS_RMAP_IS_BMBT_BLOCK(be64_to_cpu(rec->rmap.rm_offset))) return; @@ -219,6 +229,16 @@ xfs_rmapbt_init_ptr_from_cur( ptr->s = agf->agf_roots[cur->bc_btnum]; } +/* + * Mask the appropriate parts of the ondisk key field for a key comparison. + * Fork and bmbt are significant parts of the rmap record key, but written + * status is merely a record attribute. + */ +static inline uint64_t offset_keymask(uint64_t offset) +{ + return offset & ~XFS_RMAP_OFF_UNWRITTEN; +} + STATIC int64_t xfs_rmapbt_key_diff( struct xfs_btree_cur *cur, @@ -240,8 +260,8 @@ xfs_rmapbt_key_diff( else if (y > x) return -1; - x = XFS_RMAP_OFF(be64_to_cpu(kp->rm_offset)); - y = rec->rm_offset; + x = offset_keymask(be64_to_cpu(kp->rm_offset)); + y = offset_keymask(xfs_rmap_irec_offset_pack(rec)); if (x > y) return 1; else if (y > x) @@ -253,31 +273,43 @@ STATIC int64_t xfs_rmapbt_diff_two_keys( struct xfs_btree_cur *cur, const union xfs_btree_key *k1, - const union xfs_btree_key *k2) + const union xfs_btree_key *k2, + const union xfs_btree_key *mask) { const struct xfs_rmap_key *kp1 = &k1->rmap; const struct xfs_rmap_key *kp2 = &k2->rmap; int64_t d; __u64 x, y; + /* Doesn't make sense to mask off the physical space part */ + ASSERT(!mask || mask->rmap.rm_startblock); + d = (int64_t)be32_to_cpu(kp1->rm_startblock) - - be32_to_cpu(kp2->rm_startblock); + be32_to_cpu(kp2->rm_startblock); if (d) return d; - x = be64_to_cpu(kp1->rm_owner); - y = be64_to_cpu(kp2->rm_owner); - if (x > y) - return 1; - else if (y > x) - return -1; + if (!mask || mask->rmap.rm_owner) { + x = be64_to_cpu(kp1->rm_owner); + y = be64_to_cpu(kp2->rm_owner); + if (x > y) + return 1; + else if (y > x) + return -1; + } + + if (!mask || mask->rmap.rm_offset) { + /* Doesn't make sense to allow offset but not owner */ + ASSERT(!mask || mask->rmap.rm_owner); + + x = offset_keymask(be64_to_cpu(kp1->rm_offset)); + y = offset_keymask(be64_to_cpu(kp2->rm_offset)); + if (x > y) + return 1; + else if (y > x) + return -1; + } - x = XFS_RMAP_OFF(be64_to_cpu(kp1->rm_offset)); - y = XFS_RMAP_OFF(be64_to_cpu(kp2->rm_offset)); - if (x > y) - return 1; - else if (y > x) - return -1; return 0; } @@ -387,8 +419,8 @@ xfs_rmapbt_keys_inorder( return 1; else if (a > b) return 0; - a = XFS_RMAP_OFF(be64_to_cpu(k1->rmap.rm_offset)); - b = XFS_RMAP_OFF(be64_to_cpu(k2->rmap.rm_offset)); + a = offset_keymask(be64_to_cpu(k1->rmap.rm_offset)); + b = offset_keymask(be64_to_cpu(k2->rmap.rm_offset)); if (a <= b) return 1; return 0; @@ -417,13 +449,33 @@ xfs_rmapbt_recs_inorder( return 1; else if (a > b) return 0; - a = XFS_RMAP_OFF(be64_to_cpu(r1->rmap.rm_offset)); - b = XFS_RMAP_OFF(be64_to_cpu(r2->rmap.rm_offset)); + a = offset_keymask(be64_to_cpu(r1->rmap.rm_offset)); + b = offset_keymask(be64_to_cpu(r2->rmap.rm_offset)); if (a <= b) return 1; return 0; } +STATIC enum xbtree_key_contig +xfs_rmapbt_keys_contiguous( + struct xfs_btree_cur *cur, + const union xfs_btree_key *key1, + const union xfs_btree_key *key2, + const union xfs_btree_key *mask) +{ + ASSERT(!mask || mask->rmap.rm_startblock); + + /* + * We only support checking contiguity of the physical space component. + * If any callers ever need more specificity than that, they'll have to + * implement it here. + */ + ASSERT(!mask || (!mask->rmap.rm_owner && !mask->rmap.rm_offset)); + + return xbtree_key_contig(be32_to_cpu(key1->rmap.rm_startblock), + be32_to_cpu(key2->rmap.rm_startblock)); +} + static const struct xfs_btree_ops xfs_rmapbt_ops = { .rec_len = sizeof(struct xfs_rmap_rec), .key_len = 2 * sizeof(struct xfs_rmap_key), @@ -443,6 +495,7 @@ static const struct xfs_btree_ops xfs_rmapbt_ops = { .diff_two_keys = xfs_rmapbt_diff_two_keys, .keys_inorder = xfs_rmapbt_keys_inorder, .recs_inorder = xfs_rmapbt_recs_inorder, + .keys_contiguous = xfs_rmapbt_keys_contiguous, }; static struct xfs_btree_cur * @@ -460,10 +513,7 @@ xfs_rmapbt_init_common( cur->bc_statoff = XFS_STATS_CALC_INDEX(xs_rmap_2); cur->bc_ops = &xfs_rmapbt_ops; - /* take a reference for the cursor */ - atomic_inc(&pag->pag_ref); - cur->bc_ag.pag = pag; - + cur->bc_ag.pag = xfs_perag_hold(pag); return cur; } diff --git a/fs/xfs/libxfs/xfs_sb.c b/fs/xfs/libxfs/xfs_sb.c index 99cc03a298e2..ba0f17bc1dc0 100644 --- a/fs/xfs/libxfs/xfs_sb.c +++ b/fs/xfs/libxfs/xfs_sb.c @@ -72,7 +72,8 @@ xfs_sb_validate_v5_features( } /* - * We support all XFS versions newer than a v4 superblock with V2 directories. + * We current support XFS v5 formats with known features and v4 superblocks with + * at least V2 directories. */ bool xfs_sb_good_version( @@ -86,16 +87,16 @@ xfs_sb_good_version( if (xfs_sb_is_v5(sbp)) return xfs_sb_validate_v5_features(sbp); + /* versions prior to v4 are not supported */ + if (XFS_SB_VERSION_NUM(sbp) != XFS_SB_VERSION_4) + return false; + /* We must not have any unknown v4 feature bits set */ if ((sbp->sb_versionnum & ~XFS_SB_VERSION_OKBITS) || ((sbp->sb_versionnum & XFS_SB_VERSION_MOREBITSBIT) && (sbp->sb_features2 & ~XFS_SB_VERSION2_OKBITS))) return false; - /* versions prior to v4 are not supported */ - if (XFS_SB_VERSION_NUM(sbp) < XFS_SB_VERSION_4) - return false; - /* V4 filesystems need v2 directories and unwritten extents */ if (!(sbp->sb_versionnum & XFS_SB_VERSION_DIRV2BIT)) return false; diff --git a/fs/xfs/libxfs/xfs_types.h b/fs/xfs/libxfs/xfs_types.h index 5ebdda7e1078..851220021484 100644 --- a/fs/xfs/libxfs/xfs_types.h +++ b/fs/xfs/libxfs/xfs_types.h @@ -204,6 +204,18 @@ enum xfs_ag_resv_type { XFS_AG_RESV_RMAPBT, }; +/* Results of scanning a btree keyspace to check occupancy. */ +enum xbtree_recpacking { + /* None of the keyspace maps to records. */ + XBTREE_RECPACKING_EMPTY = 0, + + /* Some, but not all, of the keyspace maps to records. */ + XBTREE_RECPACKING_SPARSE, + + /* The entire keyspace maps to records. */ + XBTREE_RECPACKING_FULL, +}; + /* * Type verifier functions */ diff --git a/fs/xfs/scrub/agheader.c b/fs/xfs/scrub/agheader.c index 4dd52b15f09c..6c6e5eba42c8 100644 --- a/fs/xfs/scrub/agheader.c +++ b/fs/xfs/scrub/agheader.c @@ -1,7 +1,7 @@ -// SPDX-License-Identifier: GPL-2.0+ +// SPDX-License-Identifier: GPL-2.0-or-later /* - * Copyright (C) 2017 Oracle. All Rights Reserved. - * Author: Darrick J. Wong <darrick.wong@oracle.com> + * Copyright (C) 2017-2023 Oracle. All Rights Reserved. + * Author: Darrick J. Wong <djwong@kernel.org> */ #include "xfs.h" #include "xfs_fs.h" @@ -18,6 +18,15 @@ #include "scrub/scrub.h" #include "scrub/common.h" +int +xchk_setup_agheader( + struct xfs_scrub *sc) +{ + if (xchk_need_intent_drain(sc)) + xchk_fsgates_enable(sc, XCHK_FSGATES_DRAIN); + return xchk_setup_fs(sc); +} + /* Superblock */ /* Cross-reference with the other btrees. */ @@ -42,8 +51,9 @@ xchk_superblock_xref( xchk_xref_is_used_space(sc, agbno, 1); xchk_xref_is_not_inode_chunk(sc, agbno, 1); - xchk_xref_is_owned_by(sc, agbno, 1, &XFS_RMAP_OINFO_FS); + xchk_xref_is_only_owned_by(sc, agbno, 1, &XFS_RMAP_OINFO_FS); xchk_xref_is_not_shared(sc, agbno, 1); + xchk_xref_is_not_cow_staging(sc, agbno, 1); /* scrub teardown will take care of sc->sa for us */ } @@ -505,9 +515,10 @@ xchk_agf_xref( xchk_agf_xref_freeblks(sc); xchk_agf_xref_cntbt(sc); xchk_xref_is_not_inode_chunk(sc, agbno, 1); - xchk_xref_is_owned_by(sc, agbno, 1, &XFS_RMAP_OINFO_FS); + xchk_xref_is_only_owned_by(sc, agbno, 1, &XFS_RMAP_OINFO_FS); xchk_agf_xref_btreeblks(sc); xchk_xref_is_not_shared(sc, agbno, 1); + xchk_xref_is_not_cow_staging(sc, agbno, 1); xchk_agf_xref_refcblks(sc); /* scrub teardown will take care of sc->sa for us */ @@ -633,8 +644,9 @@ xchk_agfl_block_xref( xchk_xref_is_used_space(sc, agbno, 1); xchk_xref_is_not_inode_chunk(sc, agbno, 1); - xchk_xref_is_owned_by(sc, agbno, 1, &XFS_RMAP_OINFO_AG); + xchk_xref_is_only_owned_by(sc, agbno, 1, &XFS_RMAP_OINFO_AG); xchk_xref_is_not_shared(sc, agbno, 1); + xchk_xref_is_not_cow_staging(sc, agbno, 1); } /* Scrub an AGFL block. */ @@ -689,8 +701,9 @@ xchk_agfl_xref( xchk_xref_is_used_space(sc, agbno, 1); xchk_xref_is_not_inode_chunk(sc, agbno, 1); - xchk_xref_is_owned_by(sc, agbno, 1, &XFS_RMAP_OINFO_FS); + xchk_xref_is_only_owned_by(sc, agbno, 1, &XFS_RMAP_OINFO_FS); xchk_xref_is_not_shared(sc, agbno, 1); + xchk_xref_is_not_cow_staging(sc, agbno, 1); /* * Scrub teardown will take care of sc->sa for us. Leave sc->sa @@ -844,8 +857,9 @@ xchk_agi_xref( xchk_xref_is_used_space(sc, agbno, 1); xchk_xref_is_not_inode_chunk(sc, agbno, 1); xchk_agi_xref_icounts(sc); - xchk_xref_is_owned_by(sc, agbno, 1, &XFS_RMAP_OINFO_FS); + xchk_xref_is_only_owned_by(sc, agbno, 1, &XFS_RMAP_OINFO_FS); xchk_xref_is_not_shared(sc, agbno, 1); + xchk_xref_is_not_cow_staging(sc, agbno, 1); xchk_agi_xref_fiblocks(sc); /* scrub teardown will take care of sc->sa for us */ diff --git a/fs/xfs/scrub/agheader_repair.c b/fs/xfs/scrub/agheader_repair.c index c37e6d72760b..bbaa65422c4f 100644 --- a/fs/xfs/scrub/agheader_repair.c +++ b/fs/xfs/scrub/agheader_repair.c @@ -1,7 +1,7 @@ -// SPDX-License-Identifier: GPL-2.0+ +// SPDX-License-Identifier: GPL-2.0-or-later /* - * Copyright (C) 2018 Oracle. All Rights Reserved. - * Author: Darrick J. Wong <darrick.wong@oracle.com> + * Copyright (C) 2018-2023 Oracle. All Rights Reserved. + * Author: Darrick J. Wong <djwong@kernel.org> */ #include "xfs.h" #include "xfs_fs.h" @@ -487,10 +487,11 @@ xrep_agfl_walk_rmap( /* Strike out the blocks that are cross-linked according to the rmapbt. */ STATIC int xrep_agfl_check_extent( - struct xrep_agfl *ra, uint64_t start, - uint64_t len) + uint64_t len, + void *priv) { + struct xrep_agfl *ra = priv; xfs_agblock_t agbno = XFS_FSB_TO_AGBNO(ra->sc->mp, start); xfs_agblock_t last_agbno = agbno + len - 1; int error; @@ -538,7 +539,6 @@ xrep_agfl_collect_blocks( struct xrep_agfl ra; struct xfs_mount *mp = sc->mp; struct xfs_btree_cur *cur; - struct xbitmap_range *br, *n; int error; ra.sc = sc; @@ -579,11 +579,7 @@ xrep_agfl_collect_blocks( /* Strike out the blocks that are cross-linked. */ ra.rmap_cur = xfs_rmapbt_init_cursor(mp, sc->tp, agf_bp, sc->sa.pag); - for_each_xbitmap_extent(br, n, agfl_extents) { - error = xrep_agfl_check_extent(&ra, br->start, br->len); - if (error) - break; - } + error = xbitmap_walk(agfl_extents, xrep_agfl_check_extent, &ra); xfs_btree_del_cursor(ra.rmap_cur, error); if (error) goto out_bmp; @@ -629,21 +625,58 @@ xrep_agfl_update_agf( XFS_AGF_FLFIRST | XFS_AGF_FLLAST | XFS_AGF_FLCOUNT); } +struct xrep_agfl_fill { + struct xbitmap used_extents; + struct xfs_scrub *sc; + __be32 *agfl_bno; + xfs_agblock_t flcount; + unsigned int fl_off; +}; + +/* Fill the AGFL with whatever blocks are in this extent. */ +static int +xrep_agfl_fill( + uint64_t start, + uint64_t len, + void *priv) +{ + struct xrep_agfl_fill *af = priv; + struct xfs_scrub *sc = af->sc; + xfs_fsblock_t fsbno = start; + int error; + + while (fsbno < start + len && af->fl_off < af->flcount) + af->agfl_bno[af->fl_off++] = + cpu_to_be32(XFS_FSB_TO_AGBNO(sc->mp, fsbno++)); + + trace_xrep_agfl_insert(sc->mp, sc->sa.pag->pag_agno, + XFS_FSB_TO_AGBNO(sc->mp, start), len); + + error = xbitmap_set(&af->used_extents, start, fsbno - 1); + if (error) + return error; + + if (af->fl_off == af->flcount) + return -ECANCELED; + + return 0; +} + /* Write out a totally new AGFL. */ -STATIC void +STATIC int xrep_agfl_init_header( struct xfs_scrub *sc, struct xfs_buf *agfl_bp, struct xbitmap *agfl_extents, xfs_agblock_t flcount) { + struct xrep_agfl_fill af = { + .sc = sc, + .flcount = flcount, + }; struct xfs_mount *mp = sc->mp; - __be32 *agfl_bno; - struct xbitmap_range *br; - struct xbitmap_range *n; struct xfs_agfl *agfl; - xfs_agblock_t agbno; - unsigned int fl_off; + int error; ASSERT(flcount <= xfs_agfl_size(mp)); @@ -662,36 +695,18 @@ xrep_agfl_init_header( * blocks than fit in the AGFL, they will be freed in a subsequent * step. */ - fl_off = 0; - agfl_bno = xfs_buf_to_agfl_bno(agfl_bp); - for_each_xbitmap_extent(br, n, agfl_extents) { - agbno = XFS_FSB_TO_AGBNO(mp, br->start); - - trace_xrep_agfl_insert(mp, sc->sa.pag->pag_agno, agbno, - br->len); - - while (br->len > 0 && fl_off < flcount) { - agfl_bno[fl_off] = cpu_to_be32(agbno); - fl_off++; - agbno++; - - /* - * We've now used br->start by putting it in the AGFL, - * so bump br so that we don't reap the block later. - */ - br->start++; - br->len--; - } - - if (br->len) - break; - list_del(&br->list); - kfree(br); - } + xbitmap_init(&af.used_extents); + af.agfl_bno = xfs_buf_to_agfl_bno(agfl_bp), + xbitmap_walk(agfl_extents, xrep_agfl_fill, &af); + error = xbitmap_disunion(agfl_extents, &af.used_extents); + if (error) + return error; /* Write new AGFL to disk. */ xfs_trans_buf_set_type(sc->tp, agfl_bp, XFS_BLFT_AGFL_BUF); xfs_trans_log_buf(sc->tp, agfl_bp, 0, BBTOB(agfl_bp->b_length) - 1); + xbitmap_destroy(&af.used_extents); + return 0; } /* Repair the AGFL. */ @@ -744,7 +759,9 @@ xrep_agfl( * buffers until we know that part works. */ xrep_agfl_update_agf(sc, agf_bp, flcount); - xrep_agfl_init_header(sc, agfl_bp, &agfl_extents, flcount); + error = xrep_agfl_init_header(sc, agfl_bp, &agfl_extents, flcount); + if (error) + goto err; /* * Ok, the AGFL should be ready to go now. Roll the transaction to diff --git a/fs/xfs/scrub/alloc.c b/fs/xfs/scrub/alloc.c index 3b38f4e2a537..279af72b1671 100644 --- a/fs/xfs/scrub/alloc.c +++ b/fs/xfs/scrub/alloc.c @@ -1,7 +1,7 @@ -// SPDX-License-Identifier: GPL-2.0+ +// SPDX-License-Identifier: GPL-2.0-or-later /* - * Copyright (C) 2017 Oracle. All Rights Reserved. - * Author: Darrick J. Wong <darrick.wong@oracle.com> + * Copyright (C) 2017-2023 Oracle. All Rights Reserved. + * Author: Darrick J. Wong <djwong@kernel.org> */ #include "xfs.h" #include "xfs_fs.h" @@ -24,10 +24,19 @@ int xchk_setup_ag_allocbt( struct xfs_scrub *sc) { + if (xchk_need_intent_drain(sc)) + xchk_fsgates_enable(sc, XCHK_FSGATES_DRAIN); + return xchk_setup_ag_btree(sc, false); } /* Free space btree scrubber. */ + +struct xchk_alloc { + /* Previous free space extent. */ + struct xfs_alloc_rec_incore prev; +}; + /* * Ensure there's a corresponding cntbt/bnobt record matching this * bnobt/cntbt record, respectively. @@ -75,9 +84,11 @@ xchk_allocbt_xref_other( STATIC void xchk_allocbt_xref( struct xfs_scrub *sc, - xfs_agblock_t agbno, - xfs_extlen_t len) + const struct xfs_alloc_rec_incore *irec) { + xfs_agblock_t agbno = irec->ar_startblock; + xfs_extlen_t len = irec->ar_blockcount; + if (sc->sm->sm_flags & XFS_SCRUB_OFLAG_CORRUPT) return; @@ -85,25 +96,44 @@ xchk_allocbt_xref( xchk_xref_is_not_inode_chunk(sc, agbno, len); xchk_xref_has_no_owner(sc, agbno, len); xchk_xref_is_not_shared(sc, agbno, len); + xchk_xref_is_not_cow_staging(sc, agbno, len); +} + +/* Flag failures for records that could be merged. */ +STATIC void +xchk_allocbt_mergeable( + struct xchk_btree *bs, + struct xchk_alloc *ca, + const struct xfs_alloc_rec_incore *irec) +{ + if (bs->sc->sm->sm_flags & XFS_SCRUB_OFLAG_CORRUPT) + return; + + if (ca->prev.ar_blockcount > 0 && + ca->prev.ar_startblock + ca->prev.ar_blockcount == irec->ar_startblock && + ca->prev.ar_blockcount + irec->ar_blockcount < (uint32_t)~0U) + xchk_btree_set_corrupt(bs->sc, bs->cur, 0); + + memcpy(&ca->prev, irec, sizeof(*irec)); } /* Scrub a bnobt/cntbt record. */ STATIC int xchk_allocbt_rec( - struct xchk_btree *bs, - const union xfs_btree_rec *rec) + struct xchk_btree *bs, + const union xfs_btree_rec *rec) { - struct xfs_perag *pag = bs->cur->bc_ag.pag; - xfs_agblock_t bno; - xfs_extlen_t len; + struct xfs_alloc_rec_incore irec; + struct xchk_alloc *ca = bs->private; - bno = be32_to_cpu(rec->alloc.ar_startblock); - len = be32_to_cpu(rec->alloc.ar_blockcount); - - if (!xfs_verify_agbext(pag, bno, len)) + xfs_alloc_btrec_to_irec(rec, &irec); + if (xfs_alloc_check_irec(bs->cur, &irec) != NULL) { xchk_btree_set_corrupt(bs->sc, bs->cur, 0); + return 0; + } - xchk_allocbt_xref(bs->sc, bno, len); + xchk_allocbt_mergeable(bs, ca, &irec); + xchk_allocbt_xref(bs->sc, &irec); return 0; } @@ -114,10 +144,11 @@ xchk_allocbt( struct xfs_scrub *sc, xfs_btnum_t which) { + struct xchk_alloc ca = { }; struct xfs_btree_cur *cur; cur = which == XFS_BTNUM_BNO ? sc->sa.bno_cur : sc->sa.cnt_cur; - return xchk_btree(sc, cur, xchk_allocbt_rec, &XFS_RMAP_OINFO_AG, NULL); + return xchk_btree(sc, cur, xchk_allocbt_rec, &XFS_RMAP_OINFO_AG, &ca); } int @@ -141,15 +172,15 @@ xchk_xref_is_used_space( xfs_agblock_t agbno, xfs_extlen_t len) { - bool is_freesp; + enum xbtree_recpacking outcome; int error; if (!sc->sa.bno_cur || xchk_skip_xref(sc->sm)) return; - error = xfs_alloc_has_record(sc->sa.bno_cur, agbno, len, &is_freesp); + error = xfs_alloc_has_records(sc->sa.bno_cur, agbno, len, &outcome); if (!xchk_should_check_xref(sc, &error, &sc->sa.bno_cur)) return; - if (is_freesp) + if (outcome != XBTREE_RECPACKING_EMPTY) xchk_btree_xref_set_corrupt(sc, sc->sa.bno_cur, 0); } diff --git a/fs/xfs/scrub/attr.c b/fs/xfs/scrub/attr.c index 31529b9bf389..6c16d9530cca 100644 --- a/fs/xfs/scrub/attr.c +++ b/fs/xfs/scrub/attr.c @@ -1,7 +1,7 @@ -// SPDX-License-Identifier: GPL-2.0+ +// SPDX-License-Identifier: GPL-2.0-or-later /* - * Copyright (C) 2017 Oracle. All Rights Reserved. - * Author: Darrick J. Wong <darrick.wong@oracle.com> + * Copyright (C) 2017-2023 Oracle. All Rights Reserved. + * Author: Darrick J. Wong <djwong@kernel.org> */ #include "xfs.h" #include "xfs_fs.h" @@ -15,11 +15,51 @@ #include "xfs_da_btree.h" #include "xfs_attr.h" #include "xfs_attr_leaf.h" +#include "xfs_attr_sf.h" #include "scrub/scrub.h" #include "scrub/common.h" #include "scrub/dabtree.h" #include "scrub/attr.h" +/* Free the buffers linked from the xattr buffer. */ +static void +xchk_xattr_buf_cleanup( + void *priv) +{ + struct xchk_xattr_buf *ab = priv; + + kvfree(ab->freemap); + ab->freemap = NULL; + kvfree(ab->usedmap); + ab->usedmap = NULL; + kvfree(ab->value); + ab->value = NULL; + ab->value_sz = 0; +} + +/* + * Allocate the free space bitmap if we're trying harder; there are leaf blocks + * in the attr fork; or we can't tell if there are leaf blocks. + */ +static inline bool +xchk_xattr_want_freemap( + struct xfs_scrub *sc) +{ + struct xfs_ifork *ifp; + + if (sc->flags & XCHK_TRY_HARDER) + return true; + + if (!sc->ip) + return true; + + ifp = xfs_ifork_ptr(sc->ip, XFS_ATTR_FORK); + if (!ifp) + return false; + + return xfs_ifork_has_extents(ifp); +} + /* * Allocate enough memory to hold an attr value and attr block bitmaps, * reallocating the buffer if necessary. Buffer contents are not preserved @@ -28,41 +68,49 @@ static int xchk_setup_xattr_buf( struct xfs_scrub *sc, - size_t value_size, - gfp_t flags) + size_t value_size) { - size_t sz; + size_t bmp_sz; struct xchk_xattr_buf *ab = sc->buf; + void *new_val; - /* - * We need enough space to read an xattr value from the file or enough - * space to hold three copies of the xattr free space bitmap. We don't - * need the buffer space for both purposes at the same time. - */ - sz = 3 * sizeof(long) * BITS_TO_LONGS(sc->mp->m_attr_geo->blksize); - sz = max_t(size_t, sz, value_size); + bmp_sz = sizeof(long) * BITS_TO_LONGS(sc->mp->m_attr_geo->blksize); - /* - * If there's already a buffer, figure out if we need to reallocate it - * to accommodate a larger size. - */ - if (ab) { - if (sz <= ab->sz) - return 0; - kvfree(ab); - sc->buf = NULL; - } + if (ab) + goto resize_value; - /* - * Don't zero the buffer upon allocation to avoid runtime overhead. - * All users must be careful never to read uninitialized contents. - */ - ab = kvmalloc(sizeof(*ab) + sz, flags); + ab = kvzalloc(sizeof(struct xchk_xattr_buf), XCHK_GFP_FLAGS); if (!ab) return -ENOMEM; - - ab->sz = sz; sc->buf = ab; + sc->buf_cleanup = xchk_xattr_buf_cleanup; + + ab->usedmap = kvmalloc(bmp_sz, XCHK_GFP_FLAGS); + if (!ab->usedmap) + return -ENOMEM; + + if (xchk_xattr_want_freemap(sc)) { + ab->freemap = kvmalloc(bmp_sz, XCHK_GFP_FLAGS); + if (!ab->freemap) + return -ENOMEM; + } + +resize_value: + if (ab->value_sz >= value_size) + return 0; + + if (ab->value) { + kvfree(ab->value); + ab->value = NULL; + ab->value_sz = 0; + } + + new_val = kvmalloc(value_size, XCHK_GFP_FLAGS); + if (!new_val) + return -ENOMEM; + + ab->value = new_val; + ab->value_sz = value_size; return 0; } @@ -79,8 +127,7 @@ xchk_setup_xattr( * without the inode lock held, which means we can sleep. */ if (sc->flags & XCHK_TRY_HARDER) { - error = xchk_setup_xattr_buf(sc, XATTR_SIZE_MAX, - XCHK_GFP_FLAGS); + error = xchk_setup_xattr_buf(sc, XATTR_SIZE_MAX); if (error) return error; } @@ -111,11 +158,24 @@ xchk_xattr_listent( int namelen, int valuelen) { + struct xfs_da_args args = { + .op_flags = XFS_DA_OP_NOTIME, + .attr_filter = flags & XFS_ATTR_NSP_ONDISK_MASK, + .geo = context->dp->i_mount->m_attr_geo, + .whichfork = XFS_ATTR_FORK, + .dp = context->dp, + .name = name, + .namelen = namelen, + .hashval = xfs_da_hashname(name, namelen), + .trans = context->tp, + .valuelen = valuelen, + }; + struct xchk_xattr_buf *ab; struct xchk_xattr *sx; - struct xfs_da_args args = { NULL }; int error = 0; sx = container_of(context, struct xchk_xattr, context); + ab = sx->sc->buf; if (xchk_should_terminate(sx->sc, &error)) { context->seen_enough = error; @@ -128,18 +188,32 @@ xchk_xattr_listent( return; } + /* Only one namespace bit allowed. */ + if (hweight32(flags & XFS_ATTR_NSP_ONDISK_MASK) > 1) { + xchk_fblock_set_corrupt(sx->sc, XFS_ATTR_FORK, args.blkno); + goto fail_xref; + } + /* Does this name make sense? */ if (!xfs_attr_namecheck(name, namelen)) { xchk_fblock_set_corrupt(sx->sc, XFS_ATTR_FORK, args.blkno); - return; + goto fail_xref; } /* + * Local xattr values are stored in the attr leaf block, so we don't + * need to retrieve the value from a remote block to detect corruption + * problems. + */ + if (flags & XFS_ATTR_LOCAL) + goto fail_xref; + + /* * Try to allocate enough memory to extrat the attr value. If that * doesn't work, we overload the seen_enough variable to convey * the error message back to the main scrub function. */ - error = xchk_setup_xattr_buf(sx->sc, valuelen, XCHK_GFP_FLAGS); + error = xchk_setup_xattr_buf(sx->sc, valuelen); if (error == -ENOMEM) error = -EDEADLOCK; if (error) { @@ -147,17 +221,7 @@ xchk_xattr_listent( return; } - args.op_flags = XFS_DA_OP_NOTIME; - args.attr_filter = flags & XFS_ATTR_NSP_ONDISK_MASK; - args.geo = context->dp->i_mount->m_attr_geo; - args.whichfork = XFS_ATTR_FORK; - args.dp = context->dp; - args.name = name; - args.namelen = namelen; - args.hashval = xfs_da_hashname(args.name, args.namelen); - args.trans = context->tp; - args.value = xchk_xattr_valuebuf(sx->sc); - args.valuelen = valuelen; + args.value = ab->value; error = xfs_attr_get_ilocked(&args); /* ENODATA means the hash lookup failed and the attr is bad */ @@ -213,25 +277,23 @@ xchk_xattr_set_map( STATIC bool xchk_xattr_check_freemap( struct xfs_scrub *sc, - unsigned long *map, struct xfs_attr3_icleaf_hdr *leafhdr) { - unsigned long *freemap = xchk_xattr_freemap(sc); - unsigned long *dstmap = xchk_xattr_dstmap(sc); + struct xchk_xattr_buf *ab = sc->buf; unsigned int mapsize = sc->mp->m_attr_geo->blksize; int i; /* Construct bitmap of freemap contents. */ - bitmap_zero(freemap, mapsize); + bitmap_zero(ab->freemap, mapsize); for (i = 0; i < XFS_ATTR_LEAF_MAPSIZE; i++) { - if (!xchk_xattr_set_map(sc, freemap, + if (!xchk_xattr_set_map(sc, ab->freemap, leafhdr->freemap[i].base, leafhdr->freemap[i].size)) return false; } /* Look for bits that are set in freemap and are marked in use. */ - return bitmap_and(dstmap, freemap, map, mapsize) == 0; + return !bitmap_intersects(ab->freemap, ab->usedmap, mapsize); } /* @@ -251,7 +313,7 @@ xchk_xattr_entry( __u32 *last_hashval) { struct xfs_mount *mp = ds->state->mp; - unsigned long *usedmap = xchk_xattr_usedmap(ds->sc); + struct xchk_xattr_buf *ab = ds->sc->buf; char *name_end; struct xfs_attr_leaf_name_local *lentry; struct xfs_attr_leaf_name_remote *rentry; @@ -291,7 +353,7 @@ xchk_xattr_entry( if (name_end > buf_end) xchk_da_set_corrupt(ds, level); - if (!xchk_xattr_set_map(ds->sc, usedmap, nameidx, namesize)) + if (!xchk_xattr_set_map(ds->sc, ab->usedmap, nameidx, namesize)) xchk_da_set_corrupt(ds, level); if (!(ds->sc->sm->sm_flags & XFS_SCRUB_OFLAG_CORRUPT)) *usedbytes += namesize; @@ -311,35 +373,26 @@ xchk_xattr_block( struct xfs_attr_leafblock *leaf = bp->b_addr; struct xfs_attr_leaf_entry *ent; struct xfs_attr_leaf_entry *entries; - unsigned long *usedmap; + struct xchk_xattr_buf *ab = ds->sc->buf; char *buf_end; size_t off; __u32 last_hashval = 0; unsigned int usedbytes = 0; unsigned int hdrsize; int i; - int error; if (*last_checked == blk->blkno) return 0; - /* Allocate memory for block usage checking. */ - error = xchk_setup_xattr_buf(ds->sc, 0, XCHK_GFP_FLAGS); - if (error == -ENOMEM) - return -EDEADLOCK; - if (error) - return error; - usedmap = xchk_xattr_usedmap(ds->sc); - *last_checked = blk->blkno; - bitmap_zero(usedmap, mp->m_attr_geo->blksize); + bitmap_zero(ab->usedmap, mp->m_attr_geo->blksize); /* Check all the padding. */ if (xfs_has_crc(ds->sc->mp)) { - struct xfs_attr3_leafblock *leaf = bp->b_addr; + struct xfs_attr3_leafblock *leaf3 = bp->b_addr; - if (leaf->hdr.pad1 != 0 || leaf->hdr.pad2 != 0 || - leaf->hdr.info.hdr.pad != 0) + if (leaf3->hdr.pad1 != 0 || leaf3->hdr.pad2 != 0 || + leaf3->hdr.info.hdr.pad != 0) xchk_da_set_corrupt(ds, level); } else { if (leaf->hdr.pad1 != 0 || leaf->hdr.info.pad != 0) @@ -356,7 +409,7 @@ xchk_xattr_block( xchk_da_set_corrupt(ds, level); if (leafhdr.firstused < hdrsize) xchk_da_set_corrupt(ds, level); - if (!xchk_xattr_set_map(ds->sc, usedmap, 0, hdrsize)) + if (!xchk_xattr_set_map(ds->sc, ab->usedmap, 0, hdrsize)) xchk_da_set_corrupt(ds, level); if (ds->sc->sm->sm_flags & XFS_SCRUB_OFLAG_CORRUPT) @@ -370,7 +423,7 @@ xchk_xattr_block( for (i = 0, ent = entries; i < leafhdr.count; ent++, i++) { /* Mark the leaf entry itself. */ off = (char *)ent - (char *)leaf; - if (!xchk_xattr_set_map(ds->sc, usedmap, off, + if (!xchk_xattr_set_map(ds->sc, ab->usedmap, off, sizeof(xfs_attr_leaf_entry_t))) { xchk_da_set_corrupt(ds, level); goto out; @@ -384,7 +437,7 @@ xchk_xattr_block( goto out; } - if (!xchk_xattr_check_freemap(ds->sc, usedmap, &leafhdr)) + if (!xchk_xattr_check_freemap(ds->sc, &leafhdr)) xchk_da_set_corrupt(ds, level); if (leafhdr.usedbytes != usedbytes) @@ -468,38 +521,115 @@ out: return error; } +/* Check space usage of shortform attrs. */ +STATIC int +xchk_xattr_check_sf( + struct xfs_scrub *sc) +{ + struct xchk_xattr_buf *ab = sc->buf; + struct xfs_attr_shortform *sf; + struct xfs_attr_sf_entry *sfe; + struct xfs_attr_sf_entry *next; + struct xfs_ifork *ifp; + unsigned char *end; + int i; + int error = 0; + + ifp = xfs_ifork_ptr(sc->ip, XFS_ATTR_FORK); + + bitmap_zero(ab->usedmap, ifp->if_bytes); + sf = (struct xfs_attr_shortform *)sc->ip->i_af.if_u1.if_data; + end = (unsigned char *)ifp->if_u1.if_data + ifp->if_bytes; + xchk_xattr_set_map(sc, ab->usedmap, 0, sizeof(sf->hdr)); + + sfe = &sf->list[0]; + if ((unsigned char *)sfe > end) { + xchk_fblock_set_corrupt(sc, XFS_ATTR_FORK, 0); + return 0; + } + + for (i = 0; i < sf->hdr.count; i++) { + unsigned char *name = sfe->nameval; + unsigned char *value = &sfe->nameval[sfe->namelen]; + + if (xchk_should_terminate(sc, &error)) + return error; + + next = xfs_attr_sf_nextentry(sfe); + if ((unsigned char *)next > end) { + xchk_fblock_set_corrupt(sc, XFS_ATTR_FORK, 0); + break; + } + + if (!xchk_xattr_set_map(sc, ab->usedmap, + (char *)sfe - (char *)sf, + sizeof(struct xfs_attr_sf_entry))) { + xchk_fblock_set_corrupt(sc, XFS_ATTR_FORK, 0); + break; + } + + if (!xchk_xattr_set_map(sc, ab->usedmap, + (char *)name - (char *)sf, + sfe->namelen)) { + xchk_fblock_set_corrupt(sc, XFS_ATTR_FORK, 0); + break; + } + + if (!xchk_xattr_set_map(sc, ab->usedmap, + (char *)value - (char *)sf, + sfe->valuelen)) { + xchk_fblock_set_corrupt(sc, XFS_ATTR_FORK, 0); + break; + } + + sfe = next; + } + + return 0; +} + /* Scrub the extended attribute metadata. */ int xchk_xattr( struct xfs_scrub *sc) { - struct xchk_xattr sx; + struct xchk_xattr sx = { + .sc = sc, + .context = { + .dp = sc->ip, + .tp = sc->tp, + .resynch = 1, + .put_listent = xchk_xattr_listent, + .allow_incomplete = true, + }, + }; xfs_dablk_t last_checked = -1U; int error = 0; if (!xfs_inode_hasattr(sc->ip)) return -ENOENT; - memset(&sx, 0, sizeof(sx)); - /* Check attribute tree structure */ - error = xchk_da_btree(sc, XFS_ATTR_FORK, xchk_xattr_rec, - &last_checked); + /* Allocate memory for xattr checking. */ + error = xchk_setup_xattr_buf(sc, 0); + if (error == -ENOMEM) + return -EDEADLOCK; if (error) - goto out; + return error; - if (sc->sm->sm_flags & XFS_SCRUB_OFLAG_CORRUPT) - goto out; + /* Check the physical structure of the xattr. */ + if (sc->ip->i_af.if_format == XFS_DINODE_FMT_LOCAL) + error = xchk_xattr_check_sf(sc); + else + error = xchk_da_btree(sc, XFS_ATTR_FORK, xchk_xattr_rec, + &last_checked); + if (error) + return error; - /* Check that every attr key can also be looked up by hash. */ - sx.context.dp = sc->ip; - sx.context.resynch = 1; - sx.context.put_listent = xchk_xattr_listent; - sx.context.tp = sc->tp; - sx.context.allow_incomplete = true; - sx.sc = sc; + if (sc->sm->sm_flags & XFS_SCRUB_OFLAG_CORRUPT) + return 0; /* - * Look up every xattr in this file by name. + * Look up every xattr in this file by name and hash. * * Use the backend implementation of xfs_attr_list to call * xchk_xattr_listent on every attribute key in this inode. @@ -516,11 +646,11 @@ xchk_xattr( */ error = xfs_attr_list_ilocked(&sx.context); if (!xchk_fblock_process_error(sc, XFS_ATTR_FORK, 0, &error)) - goto out; + return error; /* Did our listent function try to return any errors? */ if (sx.context.seen_enough < 0) - error = sx.context.seen_enough; -out: - return error; + return sx.context.seen_enough; + + return 0; } diff --git a/fs/xfs/scrub/attr.h b/fs/xfs/scrub/attr.h index 3590e10e3e62..48fd9402c432 100644 --- a/fs/xfs/scrub/attr.h +++ b/fs/xfs/scrub/attr.h @@ -1,7 +1,7 @@ /* SPDX-License-Identifier: GPL-2.0-or-later */ /* - * Copyright (C) 2019 Oracle. All Rights Reserved. - * Author: Darrick J. Wong <darrick.wong@oracle.com> + * Copyright (C) 2019-2023 Oracle. All Rights Reserved. + * Author: Darrick J. Wong <djwong@kernel.org> */ #ifndef __XFS_SCRUB_ATTR_H__ #define __XFS_SCRUB_ATTR_H__ @@ -10,59 +10,15 @@ * Temporary storage for online scrub and repair of extended attributes. */ struct xchk_xattr_buf { - /* Size of @buf, in bytes. */ - size_t sz; + /* Bitmap of used space in xattr leaf blocks and shortform forks. */ + unsigned long *usedmap; - /* - * Memory buffer -- either used for extracting attr values while - * walking the attributes; or for computing attr block bitmaps when - * checking the attribute tree. - * - * Each bitmap contains enough bits to track every byte in an attr - * block (rounded up to the size of an unsigned long). The attr block - * used space bitmap starts at the beginning of the buffer; the free - * space bitmap follows immediately after; and we have a third buffer - * for storing intermediate bitmap results. - */ - uint8_t buf[]; -}; - -/* A place to store attribute values. */ -static inline uint8_t * -xchk_xattr_valuebuf( - struct xfs_scrub *sc) -{ - struct xchk_xattr_buf *ab = sc->buf; - - return ab->buf; -} - -/* A bitmap of space usage computed by walking an attr leaf block. */ -static inline unsigned long * -xchk_xattr_usedmap( - struct xfs_scrub *sc) -{ - struct xchk_xattr_buf *ab = sc->buf; + /* Bitmap of free space in xattr leaf blocks. */ + unsigned long *freemap; - return (unsigned long *)ab->buf; -} - -/* A bitmap of free space computed by walking attr leaf block free info. */ -static inline unsigned long * -xchk_xattr_freemap( - struct xfs_scrub *sc) -{ - return xchk_xattr_usedmap(sc) + - BITS_TO_LONGS(sc->mp->m_attr_geo->blksize); -} - -/* A bitmap used to hold temporary results. */ -static inline unsigned long * -xchk_xattr_dstmap( - struct xfs_scrub *sc) -{ - return xchk_xattr_freemap(sc) + - BITS_TO_LONGS(sc->mp->m_attr_geo->blksize); -} + /* Memory buffer used to extract xattr values. */ + void *value; + size_t value_sz; +}; #endif /* __XFS_SCRUB_ATTR_H__ */ diff --git a/fs/xfs/scrub/bitmap.c b/fs/xfs/scrub/bitmap.c index a255f09e9f0a..0c959be396ea 100644 --- a/fs/xfs/scrub/bitmap.c +++ b/fs/xfs/scrub/bitmap.c @@ -1,11 +1,12 @@ -// SPDX-License-Identifier: GPL-2.0+ +// SPDX-License-Identifier: GPL-2.0-or-later /* - * Copyright (C) 2018 Oracle. All Rights Reserved. - * Author: Darrick J. Wong <darrick.wong@oracle.com> + * Copyright (C) 2018-2023 Oracle. All Rights Reserved. + * Author: Darrick J. Wong <djwong@kernel.org> */ #include "xfs.h" #include "xfs_fs.h" #include "xfs_shared.h" +#include "xfs_bit.h" #include "xfs_format.h" #include "xfs_trans_resv.h" #include "xfs_mount.h" @@ -13,27 +14,160 @@ #include "scrub/scrub.h" #include "scrub/bitmap.h" +#include <linux/interval_tree_generic.h> + +struct xbitmap_node { + struct rb_node bn_rbnode; + + /* First set bit of this interval and subtree. */ + uint64_t bn_start; + + /* Last set bit of this interval. */ + uint64_t bn_last; + + /* Last set bit of this subtree. Do not touch this. */ + uint64_t __bn_subtree_last; +}; + +/* Define our own interval tree type with uint64_t parameters. */ + +#define START(node) ((node)->bn_start) +#define LAST(node) ((node)->bn_last) + /* - * Set a range of this bitmap. Caller must ensure the range is not set. - * - * This is the logical equivalent of bitmap |= mask(start, len). + * These functions are defined by the INTERVAL_TREE_DEFINE macro, but we'll + * forward-declare them anyway for clarity. */ +static inline void +xbitmap_tree_insert(struct xbitmap_node *node, struct rb_root_cached *root); + +static inline void +xbitmap_tree_remove(struct xbitmap_node *node, struct rb_root_cached *root); + +static inline struct xbitmap_node * +xbitmap_tree_iter_first(struct rb_root_cached *root, uint64_t start, + uint64_t last); + +static inline struct xbitmap_node * +xbitmap_tree_iter_next(struct xbitmap_node *node, uint64_t start, + uint64_t last); + +INTERVAL_TREE_DEFINE(struct xbitmap_node, bn_rbnode, uint64_t, + __bn_subtree_last, START, LAST, static inline, xbitmap_tree) + +/* Iterate each interval of a bitmap. Do not change the bitmap. */ +#define for_each_xbitmap_extent(bn, bitmap) \ + for ((bn) = rb_entry_safe(rb_first(&(bitmap)->xb_root.rb_root), \ + struct xbitmap_node, bn_rbnode); \ + (bn) != NULL; \ + (bn) = rb_entry_safe(rb_next(&(bn)->bn_rbnode), \ + struct xbitmap_node, bn_rbnode)) + +/* Clear a range of this bitmap. */ +int +xbitmap_clear( + struct xbitmap *bitmap, + uint64_t start, + uint64_t len) +{ + struct xbitmap_node *bn; + struct xbitmap_node *new_bn; + uint64_t last = start + len - 1; + + while ((bn = xbitmap_tree_iter_first(&bitmap->xb_root, start, last))) { + if (bn->bn_start < start && bn->bn_last > last) { + uint64_t old_last = bn->bn_last; + + /* overlaps with the entire clearing range */ + xbitmap_tree_remove(bn, &bitmap->xb_root); + bn->bn_last = start - 1; + xbitmap_tree_insert(bn, &bitmap->xb_root); + + /* add an extent */ + new_bn = kmalloc(sizeof(struct xbitmap_node), + XCHK_GFP_FLAGS); + if (!new_bn) + return -ENOMEM; + new_bn->bn_start = last + 1; + new_bn->bn_last = old_last; + xbitmap_tree_insert(new_bn, &bitmap->xb_root); + } else if (bn->bn_start < start) { + /* overlaps with the left side of the clearing range */ + xbitmap_tree_remove(bn, &bitmap->xb_root); + bn->bn_last = start - 1; + xbitmap_tree_insert(bn, &bitmap->xb_root); + } else if (bn->bn_last > last) { + /* overlaps with the right side of the clearing range */ + xbitmap_tree_remove(bn, &bitmap->xb_root); + bn->bn_start = last + 1; + xbitmap_tree_insert(bn, &bitmap->xb_root); + break; + } else { + /* in the middle of the clearing range */ + xbitmap_tree_remove(bn, &bitmap->xb_root); + kfree(bn); + } + } + + return 0; +} + +/* Set a range of this bitmap. */ int xbitmap_set( struct xbitmap *bitmap, uint64_t start, uint64_t len) { - struct xbitmap_range *bmr; + struct xbitmap_node *left; + struct xbitmap_node *right; + uint64_t last = start + len - 1; + int error; - bmr = kmalloc(sizeof(struct xbitmap_range), XCHK_GFP_FLAGS); - if (!bmr) - return -ENOMEM; + /* Is this whole range already set? */ + left = xbitmap_tree_iter_first(&bitmap->xb_root, start, last); + if (left && left->bn_start <= start && left->bn_last >= last) + return 0; + + /* Clear out everything in the range we want to set. */ + error = xbitmap_clear(bitmap, start, len); + if (error) + return error; + + /* Do we have a left-adjacent extent? */ + left = xbitmap_tree_iter_first(&bitmap->xb_root, start - 1, start - 1); + ASSERT(!left || left->bn_last + 1 == start); + + /* Do we have a right-adjacent extent? */ + right = xbitmap_tree_iter_first(&bitmap->xb_root, last + 1, last + 1); + ASSERT(!right || right->bn_start == last + 1); - INIT_LIST_HEAD(&bmr->list); - bmr->start = start; - bmr->len = len; - list_add_tail(&bmr->list, &bitmap->list); + if (left && right) { + /* combine left and right adjacent extent */ + xbitmap_tree_remove(left, &bitmap->xb_root); + xbitmap_tree_remove(right, &bitmap->xb_root); + left->bn_last = right->bn_last; + xbitmap_tree_insert(left, &bitmap->xb_root); + kfree(right); + } else if (left) { + /* combine with left extent */ + xbitmap_tree_remove(left, &bitmap->xb_root); + left->bn_last = last; + xbitmap_tree_insert(left, &bitmap->xb_root); + } else if (right) { + /* combine with right extent */ + xbitmap_tree_remove(right, &bitmap->xb_root); + right->bn_start = start; + xbitmap_tree_insert(right, &bitmap->xb_root); + } else { + /* add an extent */ + left = kmalloc(sizeof(struct xbitmap_node), XCHK_GFP_FLAGS); + if (!left) + return -ENOMEM; + left->bn_start = start; + left->bn_last = last; + xbitmap_tree_insert(left, &bitmap->xb_root); + } return 0; } @@ -43,12 +177,11 @@ void xbitmap_destroy( struct xbitmap *bitmap) { - struct xbitmap_range *bmr; - struct xbitmap_range *n; + struct xbitmap_node *bn; - for_each_xbitmap_extent(bmr, n, bitmap) { - list_del(&bmr->list); - kfree(bmr); + while ((bn = xbitmap_tree_iter_first(&bitmap->xb_root, 0, -1ULL))) { + xbitmap_tree_remove(bn, &bitmap->xb_root); + kfree(bn); } } @@ -57,27 +190,7 @@ void xbitmap_init( struct xbitmap *bitmap) { - INIT_LIST_HEAD(&bitmap->list); -} - -/* Compare two btree extents. */ -static int -xbitmap_range_cmp( - void *priv, - const struct list_head *a, - const struct list_head *b) -{ - struct xbitmap_range *ap; - struct xbitmap_range *bp; - - ap = container_of(a, struct xbitmap_range, list); - bp = container_of(b, struct xbitmap_range, list); - - if (ap->start > bp->start) - return 1; - if (ap->start < bp->start) - return -1; - return 0; + bitmap->xb_root = RB_ROOT_CACHED; } /* @@ -94,118 +207,26 @@ xbitmap_range_cmp( * * This is the logical equivalent of bitmap &= ~sub. */ -#define LEFT_ALIGNED (1 << 0) -#define RIGHT_ALIGNED (1 << 1) int xbitmap_disunion( struct xbitmap *bitmap, struct xbitmap *sub) { - struct list_head *lp; - struct xbitmap_range *br; - struct xbitmap_range *new_br; - struct xbitmap_range *sub_br; - uint64_t sub_start; - uint64_t sub_len; - int state; - int error = 0; + struct xbitmap_node *bn; + int error; - if (list_empty(&bitmap->list) || list_empty(&sub->list)) + if (xbitmap_empty(bitmap) || xbitmap_empty(sub)) return 0; - ASSERT(!list_empty(&sub->list)); - - list_sort(NULL, &bitmap->list, xbitmap_range_cmp); - list_sort(NULL, &sub->list, xbitmap_range_cmp); - - /* - * Now that we've sorted both lists, we iterate bitmap once, rolling - * forward through sub and/or bitmap as necessary until we find an - * overlap or reach the end of either list. We do not reset lp to the - * head of bitmap nor do we reset sub_br to the head of sub. The - * list traversal is similar to merge sort, but we're deleting - * instead. In this manner we avoid O(n^2) operations. - */ - sub_br = list_first_entry(&sub->list, struct xbitmap_range, - list); - lp = bitmap->list.next; - while (lp != &bitmap->list) { - br = list_entry(lp, struct xbitmap_range, list); - - /* - * Advance sub_br and/or br until we find a pair that - * intersect or we run out of extents. - */ - while (sub_br->start + sub_br->len <= br->start) { - if (list_is_last(&sub_br->list, &sub->list)) - goto out; - sub_br = list_next_entry(sub_br, list); - } - if (sub_br->start >= br->start + br->len) { - lp = lp->next; - continue; - } - /* trim sub_br to fit the extent we have */ - sub_start = sub_br->start; - sub_len = sub_br->len; - if (sub_br->start < br->start) { - sub_len -= br->start - sub_br->start; - sub_start = br->start; - } - if (sub_len > br->len) - sub_len = br->len; - - state = 0; - if (sub_start == br->start) - state |= LEFT_ALIGNED; - if (sub_start + sub_len == br->start + br->len) - state |= RIGHT_ALIGNED; - switch (state) { - case LEFT_ALIGNED: - /* Coincides with only the left. */ - br->start += sub_len; - br->len -= sub_len; - break; - case RIGHT_ALIGNED: - /* Coincides with only the right. */ - br->len -= sub_len; - lp = lp->next; - break; - case LEFT_ALIGNED | RIGHT_ALIGNED: - /* Total overlap, just delete ex. */ - lp = lp->next; - list_del(&br->list); - kfree(br); - break; - case 0: - /* - * Deleting from the middle: add the new right extent - * and then shrink the left extent. - */ - new_br = kmalloc(sizeof(struct xbitmap_range), - XCHK_GFP_FLAGS); - if (!new_br) { - error = -ENOMEM; - goto out; - } - INIT_LIST_HEAD(&new_br->list); - new_br->start = sub_start + sub_len; - new_br->len = br->start + br->len - new_br->start; - list_add(&new_br->list, &br->list); - br->len = sub_start - br->start; - lp = lp->next; - break; - default: - ASSERT(0); - break; - } + for_each_xbitmap_extent(bn, sub) { + error = xbitmap_clear(bitmap, bn->bn_start, + bn->bn_last - bn->bn_start + 1); + if (error) + return error; } -out: - return error; + return 0; } -#undef LEFT_ALIGNED -#undef RIGHT_ALIGNED /* * Record all btree blocks seen while iterating all records of a btree. @@ -242,6 +263,38 @@ out: * For the 300th record we just exit, with the list being [1, 4, 2, 3]. */ +/* Mark a btree block to the agblock bitmap. */ +STATIC int +xagb_bitmap_visit_btblock( + struct xfs_btree_cur *cur, + int level, + void *priv) +{ + struct xagb_bitmap *bitmap = priv; + struct xfs_buf *bp; + xfs_fsblock_t fsbno; + xfs_agblock_t agbno; + + xfs_btree_get_block(cur, level, &bp); + if (!bp) + return 0; + + fsbno = XFS_DADDR_TO_FSB(cur->bc_mp, xfs_buf_daddr(bp)); + agbno = XFS_FSB_TO_AGBNO(cur->bc_mp, fsbno); + + return xagb_bitmap_set(bitmap, agbno, 1); +} + +/* Mark all (per-AG) btree blocks in the agblock bitmap. */ +int +xagb_bitmap_set_btblocks( + struct xagb_bitmap *bitmap, + struct xfs_btree_cur *cur) +{ + return xfs_btree_visit_blocks(cur, xagb_bitmap_visit_btblock, + XFS_BTREE_VISIT_ALL, bitmap); +} + /* * Record all the buffers pointed to by the btree cursor. Callers already * engaged in a btree walk should call this function to capture the list of @@ -304,12 +357,97 @@ uint64_t xbitmap_hweight( struct xbitmap *bitmap) { - struct xbitmap_range *bmr; - struct xbitmap_range *n; + struct xbitmap_node *bn; uint64_t ret = 0; - for_each_xbitmap_extent(bmr, n, bitmap) - ret += bmr->len; + for_each_xbitmap_extent(bn, bitmap) + ret += bn->bn_last - bn->bn_start + 1; return ret; } + +/* Call a function for every run of set bits in this bitmap. */ +int +xbitmap_walk( + struct xbitmap *bitmap, + xbitmap_walk_fn fn, + void *priv) +{ + struct xbitmap_node *bn; + int error = 0; + + for_each_xbitmap_extent(bn, bitmap) { + error = fn(bn->bn_start, bn->bn_last - bn->bn_start + 1, priv); + if (error) + break; + } + + return error; +} + +struct xbitmap_walk_bits { + xbitmap_walk_bits_fn fn; + void *priv; +}; + +/* Walk all the bits in a run. */ +static int +xbitmap_walk_bits_in_run( + uint64_t start, + uint64_t len, + void *priv) +{ + struct xbitmap_walk_bits *wb = priv; + uint64_t i; + int error = 0; + + for (i = start; i < start + len; i++) { + error = wb->fn(i, wb->priv); + if (error) + break; + } + + return error; +} + +/* Call a function for every set bit in this bitmap. */ +int +xbitmap_walk_bits( + struct xbitmap *bitmap, + xbitmap_walk_bits_fn fn, + void *priv) +{ + struct xbitmap_walk_bits wb = {.fn = fn, .priv = priv}; + + return xbitmap_walk(bitmap, xbitmap_walk_bits_in_run, &wb); +} + +/* Does this bitmap have no bits set at all? */ +bool +xbitmap_empty( + struct xbitmap *bitmap) +{ + return bitmap->xb_root.rb_root.rb_node == NULL; +} + +/* Is the start of the range set or clear? And for how long? */ +bool +xbitmap_test( + struct xbitmap *bitmap, + uint64_t start, + uint64_t *len) +{ + struct xbitmap_node *bn; + uint64_t last = start + *len - 1; + + bn = xbitmap_tree_iter_first(&bitmap->xb_root, start, last); + if (!bn) + return false; + if (bn->bn_start <= start) { + if (bn->bn_last < last) + *len = bn->bn_last - start + 1; + return true; + } + *len = bn->bn_start - start; + return false; +} diff --git a/fs/xfs/scrub/bitmap.h b/fs/xfs/scrub/bitmap.h index 900646b72de1..84981724ecaf 100644 --- a/fs/xfs/scrub/bitmap.h +++ b/fs/xfs/scrub/bitmap.h @@ -1,31 +1,19 @@ -// SPDX-License-Identifier: GPL-2.0+ +// SPDX-License-Identifier: GPL-2.0-or-later /* - * Copyright (C) 2018 Oracle. All Rights Reserved. - * Author: Darrick J. Wong <darrick.wong@oracle.com> + * Copyright (C) 2018-2023 Oracle. All Rights Reserved. + * Author: Darrick J. Wong <djwong@kernel.org> */ #ifndef __XFS_SCRUB_BITMAP_H__ #define __XFS_SCRUB_BITMAP_H__ -struct xbitmap_range { - struct list_head list; - uint64_t start; - uint64_t len; -}; - struct xbitmap { - struct list_head list; + struct rb_root_cached xb_root; }; void xbitmap_init(struct xbitmap *bitmap); void xbitmap_destroy(struct xbitmap *bitmap); -#define for_each_xbitmap_extent(bex, n, bitmap) \ - list_for_each_entry_safe((bex), (n), &(bitmap)->list, list) - -#define for_each_xbitmap_block(b, bex, n, bitmap) \ - list_for_each_entry_safe((bex), (n), &(bitmap)->list, list) \ - for ((b) = (bex)->start; (b) < (bex)->start + (bex)->len; (b)++) - +int xbitmap_clear(struct xbitmap *bitmap, uint64_t start, uint64_t len); int xbitmap_set(struct xbitmap *bitmap, uint64_t start, uint64_t len); int xbitmap_disunion(struct xbitmap *bitmap, struct xbitmap *sub); int xbitmap_set_btcur_path(struct xbitmap *bitmap, @@ -34,4 +22,93 @@ int xbitmap_set_btblocks(struct xbitmap *bitmap, struct xfs_btree_cur *cur); uint64_t xbitmap_hweight(struct xbitmap *bitmap); +/* + * Return codes for the bitmap iterator functions are 0 to continue iterating, + * and non-zero to stop iterating. Any non-zero value will be passed up to the + * iteration caller. The special value -ECANCELED can be used to stop + * iteration, because neither bitmap iterator ever generates that error code on + * its own. Callers must not modify the bitmap while walking it. + */ +typedef int (*xbitmap_walk_fn)(uint64_t start, uint64_t len, void *priv); +int xbitmap_walk(struct xbitmap *bitmap, xbitmap_walk_fn fn, + void *priv); + +typedef int (*xbitmap_walk_bits_fn)(uint64_t bit, void *priv); +int xbitmap_walk_bits(struct xbitmap *bitmap, xbitmap_walk_bits_fn fn, + void *priv); + +bool xbitmap_empty(struct xbitmap *bitmap); +bool xbitmap_test(struct xbitmap *bitmap, uint64_t start, uint64_t *len); + +/* Bitmaps, but for type-checked for xfs_agblock_t */ + +struct xagb_bitmap { + struct xbitmap agbitmap; +}; + +static inline void xagb_bitmap_init(struct xagb_bitmap *bitmap) +{ + xbitmap_init(&bitmap->agbitmap); +} + +static inline void xagb_bitmap_destroy(struct xagb_bitmap *bitmap) +{ + xbitmap_destroy(&bitmap->agbitmap); +} + +static inline int xagb_bitmap_clear(struct xagb_bitmap *bitmap, + xfs_agblock_t start, xfs_extlen_t len) +{ + return xbitmap_clear(&bitmap->agbitmap, start, len); +} +static inline int xagb_bitmap_set(struct xagb_bitmap *bitmap, + xfs_agblock_t start, xfs_extlen_t len) +{ + return xbitmap_set(&bitmap->agbitmap, start, len); +} + +static inline bool +xagb_bitmap_test( + struct xagb_bitmap *bitmap, + xfs_agblock_t start, + xfs_extlen_t *len) +{ + uint64_t biglen = *len; + bool ret; + + ret = xbitmap_test(&bitmap->agbitmap, start, &biglen); + + if (start + biglen >= UINT_MAX) { + ASSERT(0); + biglen = UINT_MAX - start; + } + + *len = biglen; + return ret; +} + +static inline int xagb_bitmap_disunion(struct xagb_bitmap *bitmap, + struct xagb_bitmap *sub) +{ + return xbitmap_disunion(&bitmap->agbitmap, &sub->agbitmap); +} + +static inline uint32_t xagb_bitmap_hweight(struct xagb_bitmap *bitmap) +{ + return xbitmap_hweight(&bitmap->agbitmap); +} +static inline bool xagb_bitmap_empty(struct xagb_bitmap *bitmap) +{ + return xbitmap_empty(&bitmap->agbitmap); +} + +static inline int xagb_bitmap_walk(struct xagb_bitmap *bitmap, + xbitmap_walk_fn fn, void *priv) +{ + return xbitmap_walk(&bitmap->agbitmap, fn, priv); +} + +int xagb_bitmap_set_btblocks(struct xagb_bitmap *bitmap, + struct xfs_btree_cur *cur); + #endif /* __XFS_SCRUB_BITMAP_H__ */ diff --git a/fs/xfs/scrub/bmap.c b/fs/xfs/scrub/bmap.c index dbbc7037074c..87ab9f95a487 100644 --- a/fs/xfs/scrub/bmap.c +++ b/fs/xfs/scrub/bmap.c @@ -1,7 +1,7 @@ -// SPDX-License-Identifier: GPL-2.0+ +// SPDX-License-Identifier: GPL-2.0-or-later /* - * Copyright (C) 2017 Oracle. All Rights Reserved. - * Author: Darrick J. Wong <darrick.wong@oracle.com> + * Copyright (C) 2017-2023 Oracle. All Rights Reserved. + * Author: Darrick J. Wong <djwong@kernel.org> */ #include "xfs.h" #include "xfs_fs.h" @@ -31,12 +31,15 @@ xchk_setup_inode_bmap( { int error; - error = xchk_get_inode(sc); + if (xchk_need_intent_drain(sc)) + xchk_fsgates_enable(sc, XCHK_FSGATES_DRAIN); + + error = xchk_iget_for_scrubbing(sc); if (error) goto out; - sc->ilock_flags = XFS_IOLOCK_EXCL | XFS_MMAPLOCK_EXCL; - xfs_ilock(sc->ip, sc->ilock_flags); + sc->ilock_flags = XFS_IOLOCK_EXCL; + xfs_ilock(sc->ip, XFS_IOLOCK_EXCL); /* * We don't want any ephemeral data fork updates sitting around @@ -47,6 +50,9 @@ xchk_setup_inode_bmap( sc->sm->sm_type == XFS_SCRUB_TYPE_BMBTD) { struct address_space *mapping = VFS_I(sc->ip)->i_mapping; + sc->ilock_flags |= XFS_MMAPLOCK_EXCL; + xfs_ilock(sc->ip, XFS_MMAPLOCK_EXCL); + inode_dio_wait(VFS_I(sc->ip)); /* @@ -90,11 +96,23 @@ out: struct xchk_bmap_info { struct xfs_scrub *sc; + + /* Incore extent tree cursor */ struct xfs_iext_cursor icur; - xfs_fileoff_t lastoff; + + /* Previous fork mapping that we examined */ + struct xfs_bmbt_irec prev_rec; + + /* Is this a realtime fork? */ bool is_rt; + + /* May mappings point to shared space? */ bool is_shared; + + /* Was the incore extent tree loaded? */ bool was_loaded; + + /* Which inode fork are we checking? */ int whichfork; }; @@ -147,49 +165,7 @@ xchk_bmap_get_rmap( return has_rmap; } -static inline bool -xchk_bmap_has_prev( - struct xchk_bmap_info *info, - struct xfs_bmbt_irec *irec) -{ - struct xfs_bmbt_irec got; - struct xfs_ifork *ifp; - - ifp = xfs_ifork_ptr(info->sc->ip, info->whichfork); - - if (!xfs_iext_peek_prev_extent(ifp, &info->icur, &got)) - return false; - if (got.br_startoff + got.br_blockcount != irec->br_startoff) - return false; - if (got.br_startblock + got.br_blockcount != irec->br_startblock) - return false; - if (got.br_state != irec->br_state) - return false; - return true; -} - -static inline bool -xchk_bmap_has_next( - struct xchk_bmap_info *info, - struct xfs_bmbt_irec *irec) -{ - struct xfs_bmbt_irec got; - struct xfs_ifork *ifp; - - ifp = xfs_ifork_ptr(info->sc->ip, info->whichfork); - - if (!xfs_iext_peek_next_extent(ifp, &info->icur, &got)) - return false; - if (irec->br_startoff + irec->br_blockcount != got.br_startoff) - return false; - if (irec->br_startblock + irec->br_blockcount != got.br_startblock) - return false; - if (got.br_state != irec->br_state) - return false; - return true; -} - -/* Make sure that we have rmapbt records for this extent. */ +/* Make sure that we have rmapbt records for this data/attr fork extent. */ STATIC void xchk_bmap_xref_rmap( struct xchk_bmap_info *info, @@ -198,41 +174,39 @@ xchk_bmap_xref_rmap( { struct xfs_rmap_irec rmap; unsigned long long rmap_end; - uint64_t owner; + uint64_t owner = info->sc->ip->i_ino; if (!info->sc->sa.rmap_cur || xchk_skip_xref(info->sc->sm)) return; - if (info->whichfork == XFS_COW_FORK) - owner = XFS_RMAP_OWN_COW; - else - owner = info->sc->ip->i_ino; - /* Find the rmap record for this irec. */ if (!xchk_bmap_get_rmap(info, irec, agbno, owner, &rmap)) return; - /* Check the rmap. */ + /* + * The rmap must be an exact match for this incore file mapping record, + * which may have arisen from multiple ondisk records. + */ + if (rmap.rm_startblock != agbno) + xchk_fblock_xref_set_corrupt(info->sc, info->whichfork, + irec->br_startoff); + rmap_end = (unsigned long long)rmap.rm_startblock + rmap.rm_blockcount; - if (rmap.rm_startblock > agbno || - agbno + irec->br_blockcount > rmap_end) + if (rmap_end != agbno + irec->br_blockcount) xchk_fblock_xref_set_corrupt(info->sc, info->whichfork, irec->br_startoff); - /* - * Check the logical offsets if applicable. CoW staging extents - * don't track logical offsets since the mappings only exist in - * memory. - */ - if (info->whichfork != XFS_COW_FORK) { - rmap_end = (unsigned long long)rmap.rm_offset + - rmap.rm_blockcount; - if (rmap.rm_offset > irec->br_startoff || - irec->br_startoff + irec->br_blockcount > rmap_end) - xchk_fblock_xref_set_corrupt(info->sc, - info->whichfork, irec->br_startoff); - } + /* Check the logical offsets. */ + if (rmap.rm_offset != irec->br_startoff) + xchk_fblock_xref_set_corrupt(info->sc, info->whichfork, + irec->br_startoff); + rmap_end = (unsigned long long)rmap.rm_offset + rmap.rm_blockcount; + if (rmap_end != irec->br_startoff + irec->br_blockcount) + xchk_fblock_xref_set_corrupt(info->sc, info->whichfork, + irec->br_startoff); + + /* Check the owner */ if (rmap.rm_owner != owner) xchk_fblock_xref_set_corrupt(info->sc, info->whichfork, irec->br_startoff); @@ -244,8 +218,7 @@ xchk_bmap_xref_rmap( * records because the blocks are owned (on-disk) by the refcountbt, * which doesn't track unwritten state. */ - if (owner != XFS_RMAP_OWN_COW && - !!(irec->br_state == XFS_EXT_UNWRITTEN) != + if (!!(irec->br_state == XFS_EXT_UNWRITTEN) != !!(rmap.rm_flags & XFS_RMAP_UNWRITTEN)) xchk_fblock_xref_set_corrupt(info->sc, info->whichfork, irec->br_startoff); @@ -257,34 +230,60 @@ xchk_bmap_xref_rmap( if (rmap.rm_flags & XFS_RMAP_BMBT_BLOCK) xchk_fblock_xref_set_corrupt(info->sc, info->whichfork, irec->br_startoff); +} + +/* Make sure that we have rmapbt records for this COW fork extent. */ +STATIC void +xchk_bmap_xref_rmap_cow( + struct xchk_bmap_info *info, + struct xfs_bmbt_irec *irec, + xfs_agblock_t agbno) +{ + struct xfs_rmap_irec rmap; + unsigned long long rmap_end; + uint64_t owner = XFS_RMAP_OWN_COW; + + if (!info->sc->sa.rmap_cur || xchk_skip_xref(info->sc->sm)) + return; + + /* Find the rmap record for this irec. */ + if (!xchk_bmap_get_rmap(info, irec, agbno, owner, &rmap)) + return; /* - * If the rmap starts before this bmbt record, make sure there's a bmbt - * record for the previous offset that is contiguous with this mapping. - * Skip this for CoW fork extents because the refcount btree (and not - * the inode) is the ondisk owner for those extents. + * CoW staging extents are owned by the refcount btree, so the rmap + * can start before and end after the physical space allocated to this + * mapping. There are no offsets to check. */ - if (info->whichfork != XFS_COW_FORK && rmap.rm_startblock < agbno && - !xchk_bmap_has_prev(info, irec)) { + if (rmap.rm_startblock > agbno) + xchk_fblock_xref_set_corrupt(info->sc, info->whichfork, + irec->br_startoff); + + rmap_end = (unsigned long long)rmap.rm_startblock + rmap.rm_blockcount; + if (rmap_end < agbno + irec->br_blockcount) + xchk_fblock_xref_set_corrupt(info->sc, info->whichfork, + irec->br_startoff); + + /* Check the owner */ + if (rmap.rm_owner != owner) xchk_fblock_xref_set_corrupt(info->sc, info->whichfork, irec->br_startoff); - return; - } /* - * If the rmap ends after this bmbt record, make sure there's a bmbt - * record for the next offset that is contiguous with this mapping. - * Skip this for CoW fork extents because the refcount btree (and not - * the inode) is the ondisk owner for those extents. + * No flags allowed. Note that the (in-memory) CoW fork distinguishes + * between unwritten and written extents, but we don't track that in + * the rmap records because the blocks are owned (on-disk) by the + * refcountbt, which doesn't track unwritten state. */ - rmap_end = (unsigned long long)rmap.rm_startblock + rmap.rm_blockcount; - if (info->whichfork != XFS_COW_FORK && - rmap_end > agbno + irec->br_blockcount && - !xchk_bmap_has_next(info, irec)) { + if (rmap.rm_flags & XFS_RMAP_ATTR_FORK) + xchk_fblock_xref_set_corrupt(info->sc, info->whichfork, + irec->br_startoff); + if (rmap.rm_flags & XFS_RMAP_BMBT_BLOCK) + xchk_fblock_xref_set_corrupt(info->sc, info->whichfork, + irec->br_startoff); + if (rmap.rm_flags & XFS_RMAP_UNWRITTEN) xchk_fblock_xref_set_corrupt(info->sc, info->whichfork, irec->br_startoff); - return; - } } /* Cross-reference a single rtdev extent record. */ @@ -305,6 +304,7 @@ xchk_bmap_iextent_xref( struct xchk_bmap_info *info, struct xfs_bmbt_irec *irec) { + struct xfs_owner_info oinfo; struct xfs_mount *mp = info->sc->mp; xfs_agnumber_t agno; xfs_agblock_t agbno; @@ -322,17 +322,35 @@ xchk_bmap_iextent_xref( xchk_xref_is_used_space(info->sc, agbno, len); xchk_xref_is_not_inode_chunk(info->sc, agbno, len); - xchk_bmap_xref_rmap(info, irec, agbno); switch (info->whichfork) { case XFS_DATA_FORK: - if (xfs_is_reflink_inode(info->sc->ip)) - break; - fallthrough; + xchk_bmap_xref_rmap(info, irec, agbno); + if (!xfs_is_reflink_inode(info->sc->ip)) { + xfs_rmap_ino_owner(&oinfo, info->sc->ip->i_ino, + info->whichfork, irec->br_startoff); + xchk_xref_is_only_owned_by(info->sc, agbno, + irec->br_blockcount, &oinfo); + xchk_xref_is_not_shared(info->sc, agbno, + irec->br_blockcount); + } + xchk_xref_is_not_cow_staging(info->sc, agbno, + irec->br_blockcount); + break; case XFS_ATTR_FORK: + xchk_bmap_xref_rmap(info, irec, agbno); + xfs_rmap_ino_owner(&oinfo, info->sc->ip->i_ino, + info->whichfork, irec->br_startoff); + xchk_xref_is_only_owned_by(info->sc, agbno, irec->br_blockcount, + &oinfo); xchk_xref_is_not_shared(info->sc, agbno, irec->br_blockcount); + xchk_xref_is_not_cow_staging(info->sc, agbno, + irec->br_blockcount); break; case XFS_COW_FORK: + xchk_bmap_xref_rmap_cow(info, irec, agbno); + xchk_xref_is_only_owned_by(info->sc, agbno, irec->br_blockcount, + &XFS_RMAP_OINFO_COW); xchk_xref_is_cow_staging(info->sc, agbno, irec->br_blockcount); xchk_xref_is_not_shared(info->sc, agbno, @@ -382,7 +400,8 @@ xchk_bmap_iextent( * Check for out-of-order extents. This record could have come * from the incore list, for which there is no ordering check. */ - if (irec->br_startoff < info->lastoff) + if (irec->br_startoff < info->prev_rec.br_startoff + + info->prev_rec.br_blockcount) xchk_fblock_set_corrupt(info->sc, info->whichfork, irec->br_startoff); @@ -392,15 +411,7 @@ xchk_bmap_iextent( xchk_bmap_dirattr_extent(ip, info, irec); - /* There should never be a "hole" extent in either extent list. */ - if (irec->br_startblock == HOLESTARTBLOCK) - xchk_fblock_set_corrupt(info->sc, info->whichfork, - irec->br_startoff); - /* Make sure the extent points to a valid place. */ - if (irec->br_blockcount > XFS_MAX_BMBT_EXTLEN) - xchk_fblock_set_corrupt(info->sc, info->whichfork, - irec->br_startoff); if (info->is_rt && !xfs_verify_rtext(mp, irec->br_startblock, irec->br_blockcount)) xchk_fblock_set_corrupt(info->sc, info->whichfork, @@ -468,6 +479,12 @@ xchk_bmapbt_rec( return 0; xfs_bmbt_disk_get_all(&rec->bmbt, &irec); + if (xfs_bmap_validate_extent(ip, info->whichfork, &irec) != NULL) { + xchk_fblock_set_corrupt(bs->sc, info->whichfork, + irec.br_startoff); + return 0; + } + if (!xfs_iext_lookup_extent(ip, ifp, irec.br_startoff, &icur, &iext_irec) || irec.br_startoff != iext_irec.br_startoff || @@ -618,45 +635,57 @@ xchk_bmap_check_ag_rmaps( return error; } -/* Make sure each rmap has a corresponding bmbt entry. */ -STATIC int -xchk_bmap_check_rmaps( - struct xfs_scrub *sc, - int whichfork) +/* + * Decide if we want to walk every rmap btree in the fs to make sure that each + * rmap for this file fork has corresponding bmbt entries. + */ +static bool +xchk_bmap_want_check_rmaps( + struct xchk_bmap_info *info) { - struct xfs_ifork *ifp = xfs_ifork_ptr(sc->ip, whichfork); - struct xfs_perag *pag; - xfs_agnumber_t agno; - bool zero_size; - int error; + struct xfs_scrub *sc = info->sc; + struct xfs_ifork *ifp; - if (!xfs_has_rmapbt(sc->mp) || - whichfork == XFS_COW_FORK || - (sc->sm->sm_flags & XFS_SCRUB_OFLAG_CORRUPT)) - return 0; + if (!xfs_has_rmapbt(sc->mp)) + return false; + if (info->whichfork == XFS_COW_FORK) + return false; + if (sc->sm->sm_flags & XFS_SCRUB_OFLAG_CORRUPT) + return false; /* Don't support realtime rmap checks yet. */ - if (XFS_IS_REALTIME_INODE(sc->ip) && whichfork == XFS_DATA_FORK) - return 0; - - ASSERT(xfs_ifork_ptr(sc->ip, whichfork) != NULL); + if (info->is_rt) + return false; /* - * Only do this for complex maps that are in btree format, or for - * situations where we would seem to have a size but zero extents. - * The inode repair code can zap broken iforks, which means we have - * to flag this bmap as corrupt if there are rmaps that need to be - * reattached. + * The inode repair code zaps broken inode forks by resetting them back + * to EXTENTS format and zero extent records. If we encounter a fork + * in this state along with evidence that the fork isn't supposed to be + * empty, we need to scan the reverse mappings to decide if we're going + * to rebuild the fork. Data forks with nonzero file size are scanned. + * xattr forks are never empty of content, so they are always scanned. */ + ifp = xfs_ifork_ptr(sc->ip, info->whichfork); + if (ifp->if_format == XFS_DINODE_FMT_EXTENTS && ifp->if_nextents == 0) { + if (info->whichfork == XFS_DATA_FORK && + i_size_read(VFS_I(sc->ip)) == 0) + return false; - if (whichfork == XFS_DATA_FORK) - zero_size = i_size_read(VFS_I(sc->ip)) == 0; - else - zero_size = false; + return true; + } - if (ifp->if_format != XFS_DINODE_FMT_BTREE && - (zero_size || ifp->if_nextents > 0)) - return 0; + return false; +} + +/* Make sure each rmap has a corresponding bmbt entry. */ +STATIC int +xchk_bmap_check_rmaps( + struct xfs_scrub *sc, + int whichfork) +{ + struct xfs_perag *pag; + xfs_agnumber_t agno; + int error; for_each_perag(sc->mp, agno, pag) { error = xchk_bmap_check_ag_rmaps(sc, whichfork, pag); @@ -683,7 +712,8 @@ xchk_bmap_iextent_delalloc( * Check for out-of-order extents. This record could have come * from the incore list, for which there is no ordering check. */ - if (irec->br_startoff < info->lastoff) + if (irec->br_startoff < info->prev_rec.br_startoff + + info->prev_rec.br_blockcount) xchk_fblock_set_corrupt(info->sc, info->whichfork, irec->br_startoff); @@ -697,6 +727,101 @@ xchk_bmap_iextent_delalloc( irec->br_startoff); } +/* Decide if this individual fork mapping is ok. */ +static bool +xchk_bmap_iext_mapping( + struct xchk_bmap_info *info, + const struct xfs_bmbt_irec *irec) +{ + /* There should never be a "hole" extent in either extent list. */ + if (irec->br_startblock == HOLESTARTBLOCK) + return false; + if (irec->br_blockcount > XFS_MAX_BMBT_EXTLEN) + return false; + return true; +} + +/* Are these two mappings contiguous with each other? */ +static inline bool +xchk_are_bmaps_contiguous( + const struct xfs_bmbt_irec *b1, + const struct xfs_bmbt_irec *b2) +{ + /* Don't try to combine unallocated mappings. */ + if (!xfs_bmap_is_real_extent(b1)) + return false; + if (!xfs_bmap_is_real_extent(b2)) + return false; + + /* Does b2 come right after b1 in the logical and physical range? */ + if (b1->br_startoff + b1->br_blockcount != b2->br_startoff) + return false; + if (b1->br_startblock + b1->br_blockcount != b2->br_startblock) + return false; + if (b1->br_state != b2->br_state) + return false; + return true; +} + +/* + * Walk the incore extent records, accumulating consecutive contiguous records + * into a single incore mapping. Returns true if @irec has been set to a + * mapping or false if there are no more mappings. Caller must ensure that + * @info.icur is zeroed before the first call. + */ +static int +xchk_bmap_iext_iter( + struct xchk_bmap_info *info, + struct xfs_bmbt_irec *irec) +{ + struct xfs_bmbt_irec got; + struct xfs_ifork *ifp; + xfs_filblks_t prev_len; + + ifp = xfs_ifork_ptr(info->sc->ip, info->whichfork); + + /* Advance to the next iextent record and check the mapping. */ + xfs_iext_next(ifp, &info->icur); + if (!xfs_iext_get_extent(ifp, &info->icur, irec)) + return false; + + if (!xchk_bmap_iext_mapping(info, irec)) { + xchk_fblock_set_corrupt(info->sc, info->whichfork, + irec->br_startoff); + return false; + } + + /* + * Iterate subsequent iextent records and merge them with the one + * that we just read, if possible. + */ + prev_len = irec->br_blockcount; + while (xfs_iext_peek_next_extent(ifp, &info->icur, &got)) { + if (!xchk_are_bmaps_contiguous(irec, &got)) + break; + + if (!xchk_bmap_iext_mapping(info, &got)) { + xchk_fblock_set_corrupt(info->sc, info->whichfork, + got.br_startoff); + return false; + } + + /* + * Notify the user of mergeable records in the data or attr + * forks. CoW forks only exist in memory so we ignore them. + */ + if (info->whichfork != XFS_COW_FORK && + prev_len + got.br_blockcount > BMBT_BLOCKCOUNT_MASK) + xchk_ino_set_preen(info->sc, info->sc->ip->i_ino); + + irec->br_blockcount += got.br_blockcount; + prev_len = got.br_blockcount; + xfs_iext_next(ifp, &info->icur); + } + + return true; +} + /* * Scrub an inode fork's block mappings. * @@ -776,10 +901,15 @@ xchk_bmap( if (!xchk_fblock_process_error(sc, whichfork, 0, &error)) goto out; - /* Scrub extent records. */ - info.lastoff = 0; - ifp = xfs_ifork_ptr(ip, whichfork); - for_each_xfs_iext(ifp, &info.icur, &irec) { + /* + * Scrub extent records. We use a special iterator function here that + * combines adjacent mappings if they are logically and physically + * contiguous. For large allocations that require multiple bmbt + * records, this reduces the number of cross-referencing calls, which + * reduces runtime. Cross referencing with the rmap is simpler because + * the rmap must match the combined mapping exactly. + */ + while (xchk_bmap_iext_iter(&info, &irec)) { if (xchk_should_terminate(sc, &error) || (sc->sm->sm_flags & XFS_SCRUB_OFLAG_CORRUPT)) goto out; @@ -794,12 +924,14 @@ xchk_bmap( xchk_bmap_iextent_delalloc(ip, &info, &irec); else xchk_bmap_iextent(ip, &info, &irec); - info.lastoff = irec.br_startoff + irec.br_blockcount; + memcpy(&info.prev_rec, &irec, sizeof(struct xfs_bmbt_irec)); } - error = xchk_bmap_check_rmaps(sc, whichfork); - if (!xchk_fblock_xref_process_error(sc, whichfork, 0, &error)) - goto out; + if (xchk_bmap_want_check_rmaps(&info)) { + error = xchk_bmap_check_rmaps(sc, whichfork); + if (!xchk_fblock_xref_process_error(sc, whichfork, 0, &error)) + goto out; + } out: return error; } diff --git a/fs/xfs/scrub/btree.c b/fs/xfs/scrub/btree.c index 0fd36d5b4646..1935b9ce1885 100644 --- a/fs/xfs/scrub/btree.c +++ b/fs/xfs/scrub/btree.c @@ -1,7 +1,7 @@ -// SPDX-License-Identifier: GPL-2.0+ +// SPDX-License-Identifier: GPL-2.0-or-later /* - * Copyright (C) 2017 Oracle. All Rights Reserved. - * Author: Darrick J. Wong <darrick.wong@oracle.com> + * Copyright (C) 2017-2023 Oracle. All Rights Reserved. + * Author: Darrick J. Wong <djwong@kernel.org> */ #include "xfs.h" #include "xfs_fs.h" @@ -36,6 +36,7 @@ __xchk_btree_process_error( switch (*error) { case -EDEADLOCK: + case -ECHRNG: /* Used to restart an op with deadlock avoidance. */ trace_xchk_deadlock_retry(sc->ip, sc->sm, *error); break; @@ -118,6 +119,16 @@ xchk_btree_xref_set_corrupt( __return_address); } +void +xchk_btree_set_preen( + struct xfs_scrub *sc, + struct xfs_btree_cur *cur, + int level) +{ + __xchk_btree_set_corrupt(sc, cur, level, XFS_SCRUB_OFLAG_PREEN, + __return_address); +} + /* * Make sure this record is in order and doesn't stray outside of the parent * keys. @@ -140,29 +151,30 @@ xchk_btree_rec( trace_xchk_btree_rec(bs->sc, cur, 0); - /* If this isn't the first record, are they in order? */ - if (cur->bc_levels[0].ptr > 1 && + /* Are all records across all record blocks in order? */ + if (bs->lastrec_valid && !cur->bc_ops->recs_inorder(cur, &bs->lastrec, rec)) xchk_btree_set_corrupt(bs->sc, cur, 0); memcpy(&bs->lastrec, rec, cur->bc_ops->rec_len); + bs->lastrec_valid = true; if (cur->bc_nlevels == 1) return; - /* Is this at least as large as the parent low key? */ + /* Is low_key(rec) at least as large as the parent low key? */ cur->bc_ops->init_key_from_rec(&key, rec); keyblock = xfs_btree_get_block(cur, 1, &bp); keyp = xfs_btree_key_addr(cur, cur->bc_levels[1].ptr, keyblock); - if (cur->bc_ops->diff_two_keys(cur, &key, keyp) < 0) + if (xfs_btree_keycmp_lt(cur, &key, keyp)) xchk_btree_set_corrupt(bs->sc, cur, 1); if (!(cur->bc_flags & XFS_BTREE_OVERLAPPING)) return; - /* Is this no larger than the parent high key? */ + /* Is high_key(rec) no larger than the parent high key? */ cur->bc_ops->init_high_key_from_rec(&hkey, rec); keyp = xfs_btree_high_key_addr(cur, cur->bc_levels[1].ptr, keyblock); - if (cur->bc_ops->diff_two_keys(cur, keyp, &hkey) < 0) + if (xfs_btree_keycmp_lt(cur, keyp, &hkey)) xchk_btree_set_corrupt(bs->sc, cur, 1); } @@ -187,29 +199,30 @@ xchk_btree_key( trace_xchk_btree_key(bs->sc, cur, level); - /* If this isn't the first key, are they in order? */ - if (cur->bc_levels[level].ptr > 1 && - !cur->bc_ops->keys_inorder(cur, &bs->lastkey[level - 1], key)) + /* Are all low keys across all node blocks in order? */ + if (bs->lastkey[level - 1].valid && + !cur->bc_ops->keys_inorder(cur, &bs->lastkey[level - 1].key, key)) xchk_btree_set_corrupt(bs->sc, cur, level); - memcpy(&bs->lastkey[level - 1], key, cur->bc_ops->key_len); + memcpy(&bs->lastkey[level - 1].key, key, cur->bc_ops->key_len); + bs->lastkey[level - 1].valid = true; if (level + 1 >= cur->bc_nlevels) return; - /* Is this at least as large as the parent low key? */ + /* Is this block's low key at least as large as the parent low key? */ keyblock = xfs_btree_get_block(cur, level + 1, &bp); keyp = xfs_btree_key_addr(cur, cur->bc_levels[level + 1].ptr, keyblock); - if (cur->bc_ops->diff_two_keys(cur, key, keyp) < 0) + if (xfs_btree_keycmp_lt(cur, key, keyp)) xchk_btree_set_corrupt(bs->sc, cur, level); if (!(cur->bc_flags & XFS_BTREE_OVERLAPPING)) return; - /* Is this no larger than the parent high key? */ + /* Is this block's high key no larger than the parent high key? */ key = xfs_btree_high_key_addr(cur, cur->bc_levels[level].ptr, block); keyp = xfs_btree_high_key_addr(cur, cur->bc_levels[level + 1].ptr, keyblock); - if (cur->bc_ops->diff_two_keys(cur, keyp, key) < 0) + if (xfs_btree_keycmp_lt(cur, keyp, key)) xchk_btree_set_corrupt(bs->sc, cur, level); } @@ -389,7 +402,7 @@ xchk_btree_check_block_owner( if (!bs->sc->sa.bno_cur && btnum == XFS_BTNUM_BNO) bs->cur = NULL; - xchk_xref_is_owned_by(bs->sc, agbno, 1, bs->oinfo); + xchk_xref_is_only_owned_by(bs->sc, agbno, 1, bs->oinfo); if (!bs->sc->sa.rmap_cur && btnum == XFS_BTNUM_RMAP) bs->cur = NULL; @@ -519,6 +532,48 @@ xchk_btree_check_minrecs( } /* + * If this btree block has a parent, make sure that the parent's keys capture + * the keyspace contained in this block. + */ +STATIC void +xchk_btree_block_check_keys( + struct xchk_btree *bs, + int level, + struct xfs_btree_block *block) +{ + union xfs_btree_key block_key; + union xfs_btree_key *block_high_key; + union xfs_btree_key *parent_low_key, *parent_high_key; + struct xfs_btree_cur *cur = bs->cur; + struct xfs_btree_block *parent_block; + struct xfs_buf *bp; + + if (level == cur->bc_nlevels - 1) + return; + + xfs_btree_get_keys(cur, block, &block_key); + + /* Make sure the low key of this block matches the parent. */ + parent_block = xfs_btree_get_block(cur, level + 1, &bp); + parent_low_key = xfs_btree_key_addr(cur, cur->bc_levels[level + 1].ptr, + parent_block); + if (xfs_btree_keycmp_ne(cur, &block_key, parent_low_key)) { + xchk_btree_set_corrupt(bs->sc, bs->cur, level); + return; + } + + if (!(cur->bc_flags & XFS_BTREE_OVERLAPPING)) + return; + + /* Make sure the high key of this block matches the parent. */ + parent_high_key = xfs_btree_high_key_addr(cur, + cur->bc_levels[level + 1].ptr, parent_block); + block_high_key = xfs_btree_high_key_from_key(cur, &block_key); + if (xfs_btree_keycmp_ne(cur, block_high_key, parent_high_key)) + xchk_btree_set_corrupt(bs->sc, bs->cur, level); +} + +/* * Grab and scrub a btree block given a btree pointer. Returns block * and buffer pointers (if applicable) if they're ok to use. */ @@ -569,7 +624,12 @@ xchk_btree_get_block( * Check the block's siblings; this function absorbs error codes * for us. */ - return xchk_btree_block_check_siblings(bs, *pblock); + error = xchk_btree_block_check_siblings(bs, *pblock); + if (error) + return error; + + xchk_btree_block_check_keys(bs, level, *pblock); + return 0; } /* @@ -601,7 +661,7 @@ xchk_btree_block_keys( parent_keys = xfs_btree_key_addr(cur, cur->bc_levels[level + 1].ptr, parent_block); - if (cur->bc_ops->diff_two_keys(cur, &block_keys, parent_keys) != 0) + if (xfs_btree_keycmp_ne(cur, &block_keys, parent_keys)) xchk_btree_set_corrupt(bs->sc, cur, 1); if (!(cur->bc_flags & XFS_BTREE_OVERLAPPING)) @@ -612,7 +672,7 @@ xchk_btree_block_keys( high_pk = xfs_btree_high_key_addr(cur, cur->bc_levels[level + 1].ptr, parent_block); - if (cur->bc_ops->diff_two_keys(cur, high_bk, high_pk) != 0) + if (xfs_btree_keycmp_ne(cur, high_bk, high_pk)) xchk_btree_set_corrupt(bs->sc, cur, 1); } diff --git a/fs/xfs/scrub/btree.h b/fs/xfs/scrub/btree.h index da61a53a0b61..9d7b9ee8bef4 100644 --- a/fs/xfs/scrub/btree.h +++ b/fs/xfs/scrub/btree.h @@ -1,7 +1,7 @@ -// SPDX-License-Identifier: GPL-2.0+ +// SPDX-License-Identifier: GPL-2.0-or-later /* - * Copyright (C) 2017 Oracle. All Rights Reserved. - * Author: Darrick J. Wong <darrick.wong@oracle.com> + * Copyright (C) 2017-2023 Oracle. All Rights Reserved. + * Author: Darrick J. Wong <djwong@kernel.org> */ #ifndef __XFS_SCRUB_BTREE_H__ #define __XFS_SCRUB_BTREE_H__ @@ -19,6 +19,8 @@ bool xchk_btree_xref_process_error(struct xfs_scrub *sc, /* Check for btree corruption. */ void xchk_btree_set_corrupt(struct xfs_scrub *sc, struct xfs_btree_cur *cur, int level); +void xchk_btree_set_preen(struct xfs_scrub *sc, struct xfs_btree_cur *cur, + int level); /* Check for btree xref discrepancies. */ void xchk_btree_xref_set_corrupt(struct xfs_scrub *sc, @@ -29,6 +31,11 @@ typedef int (*xchk_btree_rec_fn)( struct xchk_btree *bs, const union xfs_btree_rec *rec); +struct xchk_btree_key { + union xfs_btree_key key; + bool valid; +}; + struct xchk_btree { /* caller-provided scrub state */ struct xfs_scrub *sc; @@ -38,11 +45,12 @@ struct xchk_btree { void *private; /* internal scrub state */ + bool lastrec_valid; union xfs_btree_rec lastrec; struct list_head to_check; /* this element must come last! */ - union xfs_btree_key lastkey[]; + struct xchk_btree_key lastkey[]; }; /* diff --git a/fs/xfs/scrub/common.c b/fs/xfs/scrub/common.c index 848a8e32e56f..9aa79665c608 100644 --- a/fs/xfs/scrub/common.c +++ b/fs/xfs/scrub/common.c @@ -1,7 +1,7 @@ -// SPDX-License-Identifier: GPL-2.0+ +// SPDX-License-Identifier: GPL-2.0-or-later /* - * Copyright (C) 2017 Oracle. All Rights Reserved. - * Author: Darrick J. Wong <darrick.wong@oracle.com> + * Copyright (C) 2017-2023 Oracle. All Rights Reserved. + * Author: Darrick J. Wong <djwong@kernel.org> */ #include "xfs.h" #include "xfs_fs.h" @@ -75,6 +75,7 @@ __xchk_process_error( case 0: return true; case -EDEADLOCK: + case -ECHRNG: /* Used to restart an op with deadlock avoidance. */ trace_xchk_deadlock_retry( sc->ip ? sc->ip : XFS_I(file_inode(sc->file)), @@ -130,6 +131,7 @@ __xchk_fblock_process_error( case 0: return true; case -EDEADLOCK: + case -ECHRNG: /* Used to restart an op with deadlock avoidance. */ trace_xchk_deadlock_retry(sc->ip, sc->sm, *error); break; @@ -396,26 +398,19 @@ want_ag_read_header_failure( } /* - * Grab the perag structure and all the headers for an AG. + * Grab the AG header buffers for the attached perag structure. * * The headers should be released by xchk_ag_free, but as a fail safe we attach * all the buffers we grab to the scrub transaction so they'll all be freed - * when we cancel it. Returns ENOENT if we can't grab the perag structure. + * when we cancel it. */ -int -xchk_ag_read_headers( +static inline int +xchk_perag_read_headers( struct xfs_scrub *sc, - xfs_agnumber_t agno, struct xchk_ag *sa) { - struct xfs_mount *mp = sc->mp; int error; - ASSERT(!sa->pag); - sa->pag = xfs_perag_get(mp, agno); - if (!sa->pag) - return -ENOENT; - error = xfs_ialloc_read_agi(sa->pag, sc->tp, &sa->agi_bp); if (error && want_ag_read_header_failure(sc, XFS_SCRUB_TYPE_AGI)) return error; @@ -427,6 +422,104 @@ xchk_ag_read_headers( return 0; } +/* + * Grab the AG headers for the attached perag structure and wait for pending + * intents to drain. + */ +static int +xchk_perag_drain_and_lock( + struct xfs_scrub *sc) +{ + struct xchk_ag *sa = &sc->sa; + int error = 0; + + ASSERT(sa->pag != NULL); + ASSERT(sa->agi_bp == NULL); + ASSERT(sa->agf_bp == NULL); + + do { + if (xchk_should_terminate(sc, &error)) + return error; + + error = xchk_perag_read_headers(sc, sa); + if (error) + return error; + + /* + * If we've grabbed an inode for scrubbing then we assume that + * holding its ILOCK will suffice to coordinate with any intent + * chains involving this inode. + */ + if (sc->ip) + return 0; + + /* + * Decide if this AG is quiet enough for all metadata to be + * consistent with each other. XFS allows the AG header buffer + * locks to cycle across transaction rolls while processing + * chains of deferred ops, which means that there could be + * other threads in the middle of processing a chain of + * deferred ops. For regular operations we are careful about + * ordering operations to prevent collisions between threads + * (which is why we don't need a per-AG lock), but scrub and + * repair have to serialize against chained operations. + * + * We just locked all the AG headers buffers; now take a look + * to see if there are any intents in progress. If there are, + * drop the AG headers and wait for the intents to drain. + * Since we hold all the AG header locks for the duration of + * the scrub, this is the only time we have to sample the + * intents counter; any threads increasing it after this point + * can't possibly be in the middle of a chain of AG metadata + * updates. + * + * Obviously, this should be slanted against scrub and in favor + * of runtime threads. + */ + if (!xfs_perag_intent_busy(sa->pag)) + return 0; + + if (sa->agf_bp) { + xfs_trans_brelse(sc->tp, sa->agf_bp); + sa->agf_bp = NULL; + } + + if (sa->agi_bp) { + xfs_trans_brelse(sc->tp, sa->agi_bp); + sa->agi_bp = NULL; + } + + if (!(sc->flags & XCHK_FSGATES_DRAIN)) + return -ECHRNG; + error = xfs_perag_intent_drain(sa->pag); + if (error == -ERESTARTSYS) + error = -EINTR; + } while (!error); + + return error; +} + +/* + * Grab the per-AG structure, grab all AG header buffers, and wait until there + * aren't any pending intents. Returns -ENOENT if we can't grab the perag + * structure. + */ +int +xchk_ag_read_headers( + struct xfs_scrub *sc, + xfs_agnumber_t agno, + struct xchk_ag *sa) +{ + struct xfs_mount *mp = sc->mp; + + ASSERT(!sa->pag); + sa->pag = xfs_perag_get(mp, agno); + if (!sa->pag) + return -ENOENT; + + return xchk_perag_drain_and_lock(sc); +} + /* Release all the AG btree cursors. */ void xchk_ag_btcur_free( @@ -550,6 +643,14 @@ xchk_ag_init( /* Per-scrubber setup functions */ +void +xchk_trans_cancel( + struct xfs_scrub *sc) +{ + xfs_trans_cancel(sc->tp); + sc->tp = NULL; +} + /* * Grab an empty transaction so that we can re-grab locked buffers if * one of our btrees turns out to be cyclic. @@ -625,80 +726,273 @@ xchk_checkpoint_log( return 0; } +/* Verify that an inode is allocated ondisk, then return its cached inode. */ +int +xchk_iget( + struct xfs_scrub *sc, + xfs_ino_t inum, + struct xfs_inode **ipp) +{ + return xfs_iget(sc->mp, sc->tp, inum, XFS_IGET_UNTRUSTED, 0, ipp); +} + /* - * Given an inode and the scrub control structure, grab either the - * inode referenced in the control structure or the inode passed in. - * The inode is not locked. + * Try to grab an inode in a manner that avoids races with physical inode + * allocation. If we can't, return the locked AGI buffer so that the caller + * can single-step the loading process to see where things went wrong. + * Callers must have a valid scrub transaction. + * + * If the iget succeeds, return 0, a NULL AGI, and the inode. + * + * If the iget fails, return the error, the locked AGI, and a NULL inode. This + * can include -EINVAL and -ENOENT for invalid inode numbers or inodes that are + * no longer allocated; or any other corruption or runtime error. + * + * If the AGI read fails, return the error, a NULL AGI, and NULL inode. + * + * If a fatal signal is pending, return -EINTR, a NULL AGI, and a NULL inode. */ int -xchk_get_inode( +xchk_iget_agi( + struct xfs_scrub *sc, + xfs_ino_t inum, + struct xfs_buf **agi_bpp, + struct xfs_inode **ipp) +{ + struct xfs_mount *mp = sc->mp; + struct xfs_trans *tp = sc->tp; + struct xfs_perag *pag; + int error; + + ASSERT(sc->tp != NULL); + +again: + *agi_bpp = NULL; + *ipp = NULL; + error = 0; + + if (xchk_should_terminate(sc, &error)) + return error; + + /* + * Attach the AGI buffer to the scrub transaction to avoid deadlocks + * in the iget cache miss path. + */ + pag = xfs_perag_get(mp, XFS_INO_TO_AGNO(mp, inum)); + error = xfs_ialloc_read_agi(pag, tp, agi_bpp); + xfs_perag_put(pag); + if (error) + return error; + + error = xfs_iget(mp, tp, inum, + XFS_IGET_NORETRY | XFS_IGET_UNTRUSTED, 0, ipp); + if (error == -EAGAIN) { + /* + * The inode may be in core but temporarily unavailable and may + * require the AGI buffer before it can be returned. Drop the + * AGI buffer and retry the lookup. + * + * Incore lookup will fail with EAGAIN on a cache hit if the + * inode is queued to the inactivation list. The inactivation + * worker may remove the inode from the unlinked list and hence + * needs the AGI. + * + * Hence xchk_iget_agi() needs to drop the AGI lock on EAGAIN + * to allow inodegc to make progress and move the inode to + * IRECLAIMABLE state where xfs_iget will be able to return it + * again if it can lock the inode. + */ + xfs_trans_brelse(tp, *agi_bpp); + delay(1); + goto again; + } + if (error) + return error; + + /* We got the inode, so we can release the AGI. */ + ASSERT(*ipp != NULL); + xfs_trans_brelse(tp, *agi_bpp); + *agi_bpp = NULL; + return 0; +} + +/* Install an inode that we opened by handle for scrubbing. */ +int +xchk_install_handle_inode( + struct xfs_scrub *sc, + struct xfs_inode *ip) +{ + if (VFS_I(ip)->i_generation != sc->sm->sm_gen) { + xchk_irele(sc, ip); + return -ENOENT; + } + + sc->ip = ip; + return 0; +} + +/* + * In preparation to scrub metadata structures that hang off of an inode, + * grab either the inode referenced in the scrub control structure or the + * inode passed in. If the inumber does not reference an allocated inode + * record, the function returns ENOENT to end the scrub early. The inode + * is not locked. + */ +int +xchk_iget_for_scrubbing( struct xfs_scrub *sc) { struct xfs_imap imap; struct xfs_mount *mp = sc->mp; struct xfs_perag *pag; + struct xfs_buf *agi_bp; struct xfs_inode *ip_in = XFS_I(file_inode(sc->file)); struct xfs_inode *ip = NULL; + xfs_agnumber_t agno = XFS_INO_TO_AGNO(mp, sc->sm->sm_ino); int error; + ASSERT(sc->tp == NULL); + /* We want to scan the inode we already had opened. */ if (sc->sm->sm_ino == 0 || sc->sm->sm_ino == ip_in->i_ino) { sc->ip = ip_in; return 0; } - /* Look up the inode, see if the generation number matches. */ + /* Reject internal metadata files and obviously bad inode numbers. */ if (xfs_internal_inum(mp, sc->sm->sm_ino)) return -ENOENT; - error = xfs_iget(mp, NULL, sc->sm->sm_ino, - XFS_IGET_UNTRUSTED | XFS_IGET_DONTCACHE, 0, &ip); - switch (error) { - case -ENOENT: - /* Inode doesn't exist, just bail out. */ + if (!xfs_verify_ino(sc->mp, sc->sm->sm_ino)) + return -ENOENT; + + /* Try a regular untrusted iget. */ + error = xchk_iget(sc, sc->sm->sm_ino, &ip); + if (!error) + return xchk_install_handle_inode(sc, ip); + if (error == -ENOENT) return error; - case 0: - /* Got an inode, continue. */ - break; - case -EINVAL: + if (error != -EINVAL) + goto out_error; + + /* + * EINVAL with IGET_UNTRUSTED probably means one of several things: + * userspace gave us an inode number that doesn't correspond to fs + * space; the inode btree lacks a record for this inode; or there is a + * record, and it says this inode is free. + * + * We want to look up this inode in the inobt to distinguish two + * scenarios: (1) the inobt says the inode is free, in which case + * there's nothing to do; and (2) the inobt says the inode is + * allocated, but loading it failed due to corruption. + * + * Allocate a transaction and grab the AGI to prevent inobt activity + * in this AG. Retry the iget in case someone allocated a new inode + * after the first iget failed. + */ + error = xchk_trans_alloc(sc, 0); + if (error) + goto out_error; + + error = xchk_iget_agi(sc, sc->sm->sm_ino, &agi_bp, &ip); + if (error == 0) { + /* Actually got the inode, so install it. */ + xchk_trans_cancel(sc); + return xchk_install_handle_inode(sc, ip); + } + if (error == -ENOENT) + goto out_gone; + if (error != -EINVAL) + goto out_cancel; + + /* Ensure that we have protected against inode allocation/freeing. */ + if (agi_bp == NULL) { + ASSERT(agi_bp != NULL); + error = -ECANCELED; + goto out_cancel; + } + + /* + * Untrusted iget failed a second time. Let's try an inobt lookup. + * If the inobt thinks this the inode neither can exist inside the + * filesystem nor is allocated, return ENOENT to signal that the check + * can be skipped. + * + * If the lookup returns corruption, we'll mark this inode corrupt and + * exit to userspace. There's little chance of fixing anything until + * the inobt is straightened out, but there's nothing we can do here. + * + * If the lookup encounters any other error, exit to userspace. + * + * If the lookup succeeds, something else must be very wrong in the fs + * such that setting up the incore inode failed in some strange way. + * Treat those as corruptions. + */ + pag = xfs_perag_get(mp, XFS_INO_TO_AGNO(mp, sc->sm->sm_ino)); + if (!pag) { + error = -EFSCORRUPTED; + goto out_cancel; + } + + error = xfs_imap(pag, sc->tp, sc->sm->sm_ino, &imap, + XFS_IGET_UNTRUSTED); + xfs_perag_put(pag); + if (error == -EINVAL || error == -ENOENT) + goto out_gone; + if (!error) + error = -EFSCORRUPTED; + +out_cancel: + xchk_trans_cancel(sc); +out_error: + trace_xchk_op_error(sc, agno, XFS_INO_TO_AGBNO(mp, sc->sm->sm_ino), + error, __return_address); + return error; +out_gone: + /* The file is gone, so there's nothing to check. */ + xchk_trans_cancel(sc); + return -ENOENT; +} + +/* Release an inode, possibly dropping it in the process. */ +void +xchk_irele( + struct xfs_scrub *sc, + struct xfs_inode *ip) +{ + if (current->journal_info != NULL) { + ASSERT(current->journal_info == sc->tp); + /* - * -EINVAL with IGET_UNTRUSTED could mean one of several - * things: userspace gave us an inode number that doesn't - * correspond to fs space, or doesn't have an inobt entry; - * or it could simply mean that the inode buffer failed the - * read verifiers. + * If we are in a transaction, we /cannot/ drop the inode + * ourselves, because the VFS will trigger writeback, which + * can require a transaction. Clear DONTCACHE to force the + * inode to the LRU, where someone else can take care of + * dropping it. * - * Try just the inode mapping lookup -- if it succeeds, then - * the inode buffer verifier failed and something needs fixing. - * Otherwise, we really couldn't find it so tell userspace - * that it no longer exists. + * Note that when we grabbed our reference to the inode, it + * could have had an active ref and DONTCACHE set if a sysadmin + * is trying to coerce a change in file access mode. icache + * hits do not clear DONTCACHE, so we must do it here. */ - pag = xfs_perag_get(mp, XFS_INO_TO_AGNO(mp, sc->sm->sm_ino)); - if (pag) { - error = xfs_imap(pag, sc->tp, sc->sm->sm_ino, &imap, - XFS_IGET_UNTRUSTED | XFS_IGET_DONTCACHE); - xfs_perag_put(pag); - if (error) - return -ENOENT; - } - error = -EFSCORRUPTED; - fallthrough; - default: - trace_xchk_op_error(sc, - XFS_INO_TO_AGNO(mp, sc->sm->sm_ino), - XFS_INO_TO_AGBNO(mp, sc->sm->sm_ino), - error, __return_address); - return error; - } - if (VFS_I(ip)->i_generation != sc->sm->sm_gen) { - xfs_irele(ip); - return -ENOENT; + spin_lock(&VFS_I(ip)->i_lock); + VFS_I(ip)->i_state &= ~I_DONTCACHE; + spin_unlock(&VFS_I(ip)->i_lock); + } else if (atomic_read(&VFS_I(ip)->i_count) == 1) { + /* + * If this is the last reference to the inode and the caller + * permits it, set DONTCACHE to avoid thrashing. + */ + d_mark_dontcache(VFS_I(ip)); } - sc->ip = ip; - return 0; + xfs_irele(ip); } -/* Set us up to scrub a file's contents. */ +/* + * Set us up to scrub metadata mapped by a file's fork. Callers must not use + * this to operate on user-accessible regular file data because the MMAPLOCK is + * not taken. + */ int xchk_setup_inode_contents( struct xfs_scrub *sc, @@ -706,13 +1000,14 @@ xchk_setup_inode_contents( { int error; - error = xchk_get_inode(sc); + error = xchk_iget_for_scrubbing(sc); if (error) return error; - /* Got the inode, lock it and we're ready to go. */ - sc->ilock_flags = XFS_IOLOCK_EXCL | XFS_MMAPLOCK_EXCL; + /* Lock the inode so the VFS cannot touch this file. */ + sc->ilock_flags = XFS_IOLOCK_EXCL; xfs_ilock(sc->ip, sc->ilock_flags); + error = xchk_trans_alloc(sc, resblks); if (error) goto out; @@ -869,28 +1164,6 @@ xchk_metadata_inode_forks( return 0; } -/* - * Try to lock an inode in violation of the usual locking order rules. For - * example, trying to get the IOLOCK while in transaction context, or just - * plain breaking AG-order or inode-order inode locking rules. Either way, - * the only way to avoid an ABBA deadlock is to use trylock and back off if - * we can't. - */ -int -xchk_ilock_inverted( - struct xfs_inode *ip, - uint lock_mode) -{ - int i; - - for (i = 0; i < 20; i++) { - if (xfs_ilock_nowait(ip, lock_mode)) - return 0; - delay(1); - } - return -EDEADLOCK; -} - /* Pause background reaping of resources. */ void xchk_stop_reaping( @@ -916,3 +1189,25 @@ xchk_start_reaping( } sc->flags &= ~XCHK_REAPING_DISABLED; } + +/* + * Enable filesystem hooks (i.e. runtime code patching) before starting a scrub + * operation. Callers must not hold any locks that intersect with the CPU + * hotplug lock (e.g. writeback locks) because code patching must halt the CPUs + * to change kernel code. + */ +void +xchk_fsgates_enable( + struct xfs_scrub *sc, + unsigned int scrub_fsgates) +{ + ASSERT(!(scrub_fsgates & ~XCHK_FSGATES_ALL)); + ASSERT(!(sc->flags & scrub_fsgates)); + + trace_xchk_fsgates_enable(sc, scrub_fsgates); + + if (scrub_fsgates & XCHK_FSGATES_DRAIN) + xfs_drain_wait_enable(); + + sc->flags |= scrub_fsgates; +} diff --git a/fs/xfs/scrub/common.h b/fs/xfs/scrub/common.h index b73648d81d23..18b5f2b62f13 100644 --- a/fs/xfs/scrub/common.h +++ b/fs/xfs/scrub/common.h @@ -1,7 +1,7 @@ -// SPDX-License-Identifier: GPL-2.0+ +// SPDX-License-Identifier: GPL-2.0-or-later /* - * Copyright (C) 2017 Oracle. All Rights Reserved. - * Author: Darrick J. Wong <darrick.wong@oracle.com> + * Copyright (C) 2017-2023 Oracle. All Rights Reserved. + * Author: Darrick J. Wong <djwong@kernel.org> */ #ifndef __XFS_SCRUB_COMMON_H__ #define __XFS_SCRUB_COMMON_H__ @@ -32,6 +32,8 @@ xchk_should_terminate( } int xchk_trans_alloc(struct xfs_scrub *sc, uint resblks); +void xchk_trans_cancel(struct xfs_scrub *sc); + bool xchk_process_error(struct xfs_scrub *sc, xfs_agnumber_t agno, xfs_agblock_t bno, int *error); bool xchk_fblock_process_error(struct xfs_scrub *sc, int whichfork, @@ -72,6 +74,7 @@ bool xchk_should_check_xref(struct xfs_scrub *sc, int *error, struct xfs_btree_cur **curpp); /* Setup functions */ +int xchk_setup_agheader(struct xfs_scrub *sc); int xchk_setup_fs(struct xfs_scrub *sc); int xchk_setup_ag_allocbt(struct xfs_scrub *sc); int xchk_setup_ag_iallocbt(struct xfs_scrub *sc); @@ -132,10 +135,16 @@ int xchk_count_rmap_ownedby_ag(struct xfs_scrub *sc, struct xfs_btree_cur *cur, const struct xfs_owner_info *oinfo, xfs_filblks_t *blocks); int xchk_setup_ag_btree(struct xfs_scrub *sc, bool force_log); -int xchk_get_inode(struct xfs_scrub *sc); +int xchk_iget_for_scrubbing(struct xfs_scrub *sc); int xchk_setup_inode_contents(struct xfs_scrub *sc, unsigned int resblks); void xchk_buffer_recheck(struct xfs_scrub *sc, struct xfs_buf *bp); +int xchk_iget(struct xfs_scrub *sc, xfs_ino_t inum, struct xfs_inode **ipp); +int xchk_iget_agi(struct xfs_scrub *sc, xfs_ino_t inum, + struct xfs_buf **agi_bpp, struct xfs_inode **ipp); +void xchk_irele(struct xfs_scrub *sc, struct xfs_inode *ip); +int xchk_install_handle_inode(struct xfs_scrub *sc, struct xfs_inode *ip); + /* * Don't bother cross-referencing if we already found corruption or cross * referencing discrepancies. @@ -147,8 +156,21 @@ static inline bool xchk_skip_xref(struct xfs_scrub_metadata *sm) } int xchk_metadata_inode_forks(struct xfs_scrub *sc); -int xchk_ilock_inverted(struct xfs_inode *ip, uint lock_mode); void xchk_stop_reaping(struct xfs_scrub *sc); void xchk_start_reaping(struct xfs_scrub *sc); +/* + * Setting up a hook to wait for intents to drain is costly -- we have to take + * the CPU hotplug lock and force an i-cache flush on all CPUs once to set it + * up, and again to tear it down. These costs add up quickly, so we only want + * to enable the drain waiter if the drain actually detected a conflict with + * running intent chains. + */ +static inline bool xchk_need_intent_drain(struct xfs_scrub *sc) +{ + return sc->flags & XCHK_NEED_DRAIN; +} + +void xchk_fsgates_enable(struct xfs_scrub *sc, unsigned int scrub_fshooks); + #endif /* __XFS_SCRUB_COMMON_H__ */ diff --git a/fs/xfs/scrub/dabtree.c b/fs/xfs/scrub/dabtree.c index d17cee177085..82b150d3b8b7 100644 --- a/fs/xfs/scrub/dabtree.c +++ b/fs/xfs/scrub/dabtree.c @@ -1,7 +1,7 @@ -// SPDX-License-Identifier: GPL-2.0+ +// SPDX-License-Identifier: GPL-2.0-or-later /* - * Copyright (C) 2017 Oracle. All Rights Reserved. - * Author: Darrick J. Wong <darrick.wong@oracle.com> + * Copyright (C) 2017-2023 Oracle. All Rights Reserved. + * Author: Darrick J. Wong <djwong@kernel.org> */ #include "xfs.h" #include "xfs_fs.h" @@ -39,6 +39,7 @@ xchk_da_process_error( switch (*error) { case -EDEADLOCK: + case -ECHRNG: /* Used to restart an op with deadlock avoidance. */ trace_xchk_deadlock_retry(sc->ip, sc->sm, *error); break; diff --git a/fs/xfs/scrub/dabtree.h b/fs/xfs/scrub/dabtree.h index 1f3515c6d5a8..4f8c2138a1ec 100644 --- a/fs/xfs/scrub/dabtree.h +++ b/fs/xfs/scrub/dabtree.h @@ -1,7 +1,7 @@ -// SPDX-License-Identifier: GPL-2.0+ +// SPDX-License-Identifier: GPL-2.0-or-later /* - * Copyright (C) 2017 Oracle. All Rights Reserved. - * Author: Darrick J. Wong <darrick.wong@oracle.com> + * Copyright (C) 2017-2023 Oracle. All Rights Reserved. + * Author: Darrick J. Wong <djwong@kernel.org> */ #ifndef __XFS_SCRUB_DABTREE_H__ #define __XFS_SCRUB_DABTREE_H__ diff --git a/fs/xfs/scrub/dir.c b/fs/xfs/scrub/dir.c index d1b0f23c2c59..0b491784b759 100644 --- a/fs/xfs/scrub/dir.c +++ b/fs/xfs/scrub/dir.c @@ -1,7 +1,7 @@ -// SPDX-License-Identifier: GPL-2.0+ +// SPDX-License-Identifier: GPL-2.0-or-later /* - * Copyright (C) 2017 Oracle. All Rights Reserved. - * Author: Darrick J. Wong <darrick.wong@oracle.com> + * Copyright (C) 2017-2023 Oracle. All Rights Reserved. + * Author: Darrick J. Wong <djwong@kernel.org> */ #include "xfs.h" #include "xfs_fs.h" @@ -18,6 +18,7 @@ #include "scrub/scrub.h" #include "scrub/common.h" #include "scrub/dabtree.h" +#include "scrub/readdir.h" /* Set us up to scrub directories. */ int @@ -31,168 +32,114 @@ xchk_setup_directory( /* Scrub a directory entry. */ -struct xchk_dir_ctx { - /* VFS fill-directory iterator */ - struct dir_context dir_iter; - - struct xfs_scrub *sc; -}; - -/* Check that an inode's mode matches a given DT_ type. */ -STATIC int +/* Check that an inode's mode matches a given XFS_DIR3_FT_* type. */ +STATIC void xchk_dir_check_ftype( - struct xchk_dir_ctx *sdc, + struct xfs_scrub *sc, xfs_fileoff_t offset, - xfs_ino_t inum, - int dtype) + struct xfs_inode *ip, + int ftype) { - struct xfs_mount *mp = sdc->sc->mp; - struct xfs_inode *ip; - int ino_dtype; - int error = 0; + struct xfs_mount *mp = sc->mp; if (!xfs_has_ftype(mp)) { - if (dtype != DT_UNKNOWN && dtype != DT_DIR) - xchk_fblock_set_corrupt(sdc->sc, XFS_DATA_FORK, - offset); - goto out; - } - - /* - * Grab the inode pointed to by the dirent. We release the - * inode before we cancel the scrub transaction. Since we're - * don't know a priori that releasing the inode won't trigger - * eofblocks cleanup (which allocates what would be a nested - * transaction), we can't use DONTCACHE here because DONTCACHE - * inodes can trigger immediate inactive cleanup of the inode. - * - * If _iget returns -EINVAL or -ENOENT then the child inode number is - * garbage and the directory is corrupt. If the _iget returns - * -EFSCORRUPTED or -EFSBADCRC then the child is corrupt which is a - * cross referencing error. Any other error is an operational error. - */ - error = xfs_iget(mp, sdc->sc->tp, inum, 0, 0, &ip); - if (error == -EINVAL || error == -ENOENT) { - error = -EFSCORRUPTED; - xchk_fblock_process_error(sdc->sc, XFS_DATA_FORK, 0, &error); - goto out; + if (ftype != XFS_DIR3_FT_UNKNOWN && ftype != XFS_DIR3_FT_DIR) + xchk_fblock_set_corrupt(sc, XFS_DATA_FORK, offset); + return; } - if (!xchk_fblock_xref_process_error(sdc->sc, XFS_DATA_FORK, offset, - &error)) - goto out; - /* Convert mode to the DT_* values that dir_emit uses. */ - ino_dtype = xfs_dir3_get_dtype(mp, - xfs_mode_to_ftype(VFS_I(ip)->i_mode)); - if (ino_dtype != dtype) - xchk_fblock_set_corrupt(sdc->sc, XFS_DATA_FORK, offset); - xfs_irele(ip); -out: - return error; + if (xfs_mode_to_ftype(VFS_I(ip)->i_mode) != ftype) + xchk_fblock_set_corrupt(sc, XFS_DATA_FORK, offset); } /* * Scrub a single directory entry. * - * We use the VFS directory iterator (i.e. readdir) to call this - * function for every directory entry in a directory. Once we're here, - * we check the inode number to make sure it's sane, then we check that - * we can look up this filename. Finally, we check the ftype. + * Check the inode number to make sure it's sane, then we check that we can + * look up this filename. Finally, we check the ftype. */ -STATIC bool +STATIC int xchk_dir_actor( - struct dir_context *dir_iter, - const char *name, - int namelen, - loff_t pos, - u64 ino, - unsigned type) + struct xfs_scrub *sc, + struct xfs_inode *dp, + xfs_dir2_dataptr_t dapos, + const struct xfs_name *name, + xfs_ino_t ino, + void *priv) { - struct xfs_mount *mp; + struct xfs_mount *mp = dp->i_mount; struct xfs_inode *ip; - struct xchk_dir_ctx *sdc; - struct xfs_name xname; xfs_ino_t lookup_ino; xfs_dablk_t offset; - bool checked_ftype = false; int error = 0; - sdc = container_of(dir_iter, struct xchk_dir_ctx, dir_iter); - ip = sdc->sc->ip; - mp = ip->i_mount; offset = xfs_dir2_db_to_da(mp->m_dir_geo, - xfs_dir2_dataptr_to_db(mp->m_dir_geo, pos)); + xfs_dir2_dataptr_to_db(mp->m_dir_geo, dapos)); - if (xchk_should_terminate(sdc->sc, &error)) - return !error; + if (xchk_should_terminate(sc, &error)) + return error; /* Does this inode number make sense? */ if (!xfs_verify_dir_ino(mp, ino)) { - xchk_fblock_set_corrupt(sdc->sc, XFS_DATA_FORK, offset); - goto out; + xchk_fblock_set_corrupt(sc, XFS_DATA_FORK, offset); + return -ECANCELED; } /* Does this name make sense? */ - if (!xfs_dir2_namecheck(name, namelen)) { - xchk_fblock_set_corrupt(sdc->sc, XFS_DATA_FORK, offset); - goto out; + if (!xfs_dir2_namecheck(name->name, name->len)) { + xchk_fblock_set_corrupt(sc, XFS_DATA_FORK, offset); + return -ECANCELED; } - if (!strncmp(".", name, namelen)) { + if (!strncmp(".", name->name, name->len)) { /* If this is "." then check that the inum matches the dir. */ - if (xfs_has_ftype(mp) && type != DT_DIR) - xchk_fblock_set_corrupt(sdc->sc, XFS_DATA_FORK, - offset); - checked_ftype = true; - if (ino != ip->i_ino) - xchk_fblock_set_corrupt(sdc->sc, XFS_DATA_FORK, - offset); - } else if (!strncmp("..", name, namelen)) { + if (ino != dp->i_ino) + xchk_fblock_set_corrupt(sc, XFS_DATA_FORK, offset); + } else if (!strncmp("..", name->name, name->len)) { /* * If this is ".." in the root inode, check that the inum * matches this dir. */ - if (xfs_has_ftype(mp) && type != DT_DIR) - xchk_fblock_set_corrupt(sdc->sc, XFS_DATA_FORK, - offset); - checked_ftype = true; - if (ip->i_ino == mp->m_sb.sb_rootino && ino != ip->i_ino) - xchk_fblock_set_corrupt(sdc->sc, XFS_DATA_FORK, - offset); + if (dp->i_ino == mp->m_sb.sb_rootino && ino != dp->i_ino) + xchk_fblock_set_corrupt(sc, XFS_DATA_FORK, offset); } /* Verify that we can look up this name by hash. */ - xname.name = name; - xname.len = namelen; - xname.type = XFS_DIR3_FT_UNKNOWN; - - error = xfs_dir_lookup(sdc->sc->tp, ip, &xname, &lookup_ino, NULL); + error = xchk_dir_lookup(sc, dp, name, &lookup_ino); /* ENOENT means the hash lookup failed and the dir is corrupt */ if (error == -ENOENT) error = -EFSCORRUPTED; - if (!xchk_fblock_process_error(sdc->sc, XFS_DATA_FORK, offset, - &error)) + if (!xchk_fblock_process_error(sc, XFS_DATA_FORK, offset, &error)) goto out; if (lookup_ino != ino) { - xchk_fblock_set_corrupt(sdc->sc, XFS_DATA_FORK, offset); - goto out; + xchk_fblock_set_corrupt(sc, XFS_DATA_FORK, offset); + return -ECANCELED; } - /* Verify the file type. This function absorbs error codes. */ - if (!checked_ftype) { - error = xchk_dir_check_ftype(sdc, offset, lookup_ino, type); - if (error) - goto out; - } -out: /* - * A negative error code returned here is supposed to cause the - * dir_emit caller (xfs_readdir) to abort the directory iteration - * and return zero to xchk_directory. + * Grab the inode pointed to by the dirent. We release the inode + * before we cancel the scrub transaction. + * + * If _iget returns -EINVAL or -ENOENT then the child inode number is + * garbage and the directory is corrupt. If the _iget returns + * -EFSCORRUPTED or -EFSBADCRC then the child is corrupt which is a + * cross referencing error. Any other error is an operational error. */ - if (error == 0 && sdc->sc->sm->sm_flags & XFS_SCRUB_OFLAG_CORRUPT) - return false; - return !error; + error = xchk_iget(sc, ino, &ip); + if (error == -EINVAL || error == -ENOENT) { + error = -EFSCORRUPTED; + xchk_fblock_process_error(sc, XFS_DATA_FORK, 0, &error); + goto out; + } + if (!xchk_fblock_xref_process_error(sc, XFS_DATA_FORK, offset, &error)) + goto out; + + xchk_dir_check_ftype(sc, offset, ip, name->type); + xchk_irele(sc, ip); +out: + if (sc->sm->sm_flags & XFS_SCRUB_OFLAG_CORRUPT) + return -ECANCELED; + return error; } /* Scrub a directory btree record. */ @@ -201,6 +148,7 @@ xchk_dir_rec( struct xchk_da_btree *ds, int level) { + struct xfs_name dname = { }; struct xfs_da_state_blk *blk = &ds->state->path.blk[level]; struct xfs_mount *mp = ds->state->mp; struct xfs_inode *dp = ds->dargs.dp; @@ -297,7 +245,11 @@ xchk_dir_rec( xchk_fblock_set_corrupt(ds->sc, XFS_DATA_FORK, rec_bno); goto out_relse; } - calc_hash = xfs_da_hashname(dent->name, dent->namelen); + + /* Does the directory hash match? */ + dname.name = dent->name; + dname.len = dent->namelen; + calc_hash = xfs_dir2_hashname(mp, &dname); if (calc_hash != hash) xchk_fblock_set_corrupt(ds->sc, XFS_DATA_FORK, rec_bno); @@ -803,14 +755,7 @@ int xchk_directory( struct xfs_scrub *sc) { - struct xchk_dir_ctx sdc = { - .dir_iter.actor = xchk_dir_actor, - .dir_iter.pos = 0, - .sc = sc, - }; - size_t bufsize; - loff_t oldpos; - int error = 0; + int error; if (!S_ISDIR(VFS_I(sc->ip)->i_mode)) return -ENOENT; @@ -818,7 +763,7 @@ xchk_directory( /* Plausible size? */ if (sc->ip->i_disk_size < xfs_dir2_sf_hdr_size(0)) { xchk_ino_set_corrupt(sc, sc->ip->i_ino); - goto out; + return 0; } /* Check directory tree structure */ @@ -827,7 +772,7 @@ xchk_directory( return error; if (sc->sm->sm_flags & XFS_SCRUB_OFLAG_CORRUPT) - return error; + return 0; /* Check the freespace. */ error = xchk_directory_blocks(sc); @@ -835,44 +780,11 @@ xchk_directory( return error; if (sc->sm->sm_flags & XFS_SCRUB_OFLAG_CORRUPT) - return error; - - /* - * Check that every dirent we see can also be looked up by hash. - * Userspace usually asks for a 32k buffer, so we will too. - */ - bufsize = (size_t)min_t(loff_t, XFS_READDIR_BUFSIZE, - sc->ip->i_disk_size); - - /* - * Look up every name in this directory by hash. - * - * Use the xfs_readdir function to call xchk_dir_actor on - * every directory entry in this directory. In _actor, we check - * the name, inode number, and ftype (if applicable) of the - * entry. xfs_readdir uses the VFS filldir functions to provide - * iteration context. - * - * The VFS grabs a read or write lock via i_rwsem before it reads - * or writes to a directory. If we've gotten this far we've - * already obtained IOLOCK_EXCL, which (since 4.10) is the same as - * getting a write lock on i_rwsem. Therefore, it is safe for us - * to drop the ILOCK here in order to reuse the _readdir and - * _dir_lookup routines, which do their own ILOCK locking. - */ - oldpos = 0; - sc->ilock_flags &= ~XFS_ILOCK_EXCL; - xfs_iunlock(sc->ip, XFS_ILOCK_EXCL); - while (true) { - error = xfs_readdir(sc->tp, sc->ip, &sdc.dir_iter, bufsize); - if (!xchk_fblock_process_error(sc, XFS_DATA_FORK, 0, - &error)) - goto out; - if (oldpos == sdc.dir_iter.pos) - break; - oldpos = sdc.dir_iter.pos; - } + return 0; -out: + /* Look up every name in this directory by hash. */ + error = xchk_dir_walk(sc, sc->ip, xchk_dir_actor, NULL); + if (error == -ECANCELED) + error = 0; return error; } diff --git a/fs/xfs/scrub/fscounters.c b/fs/xfs/scrub/fscounters.c index f0c7f41897b9..faa315be7978 100644 --- a/fs/xfs/scrub/fscounters.c +++ b/fs/xfs/scrub/fscounters.c @@ -1,7 +1,7 @@ // SPDX-License-Identifier: GPL-2.0+ /* - * Copyright (C) 2019 Oracle. All Rights Reserved. - * Author: Darrick J. Wong <darrick.wong@oracle.com> + * Copyright (C) 2019-2023 Oracle. All Rights Reserved. + * Author: Darrick J. Wong <djwong@kernel.org> */ #include "xfs.h" #include "xfs_fs.h" @@ -130,6 +130,13 @@ xchk_setup_fscounters( struct xchk_fscounters *fsc; int error; + /* + * If the AGF doesn't track btreeblks, we have to lock the AGF to count + * btree block usage by walking the actual btrees. + */ + if (!xfs_has_lazysbcount(sc->mp)) + xchk_fsgates_enable(sc, XCHK_FSGATES_DRAIN); + sc->buf = kzalloc(sizeof(struct xchk_fscounters), XCHK_GFP_FLAGS); if (!sc->buf) return -ENOMEM; diff --git a/fs/xfs/scrub/health.c b/fs/xfs/scrub/health.c index aa65ec88a0c0..d2b2a1cb6533 100644 --- a/fs/xfs/scrub/health.c +++ b/fs/xfs/scrub/health.c @@ -1,12 +1,14 @@ -// SPDX-License-Identifier: GPL-2.0+ +// SPDX-License-Identifier: GPL-2.0-or-later /* - * Copyright (C) 2019 Oracle. All Rights Reserved. - * Author: Darrick J. Wong <darrick.wong@oracle.com> + * Copyright (C) 2019-2023 Oracle. All Rights Reserved. + * Author: Darrick J. Wong <djwong@kernel.org> */ #include "xfs.h" #include "xfs_fs.h" #include "xfs_shared.h" #include "xfs_format.h" +#include "xfs_trans_resv.h" +#include "xfs_mount.h" #include "xfs_btree.h" #include "xfs_trans_resv.h" #include "xfs_mount.h" diff --git a/fs/xfs/scrub/health.h b/fs/xfs/scrub/health.h index d0b938d3d028..66a273f8585b 100644 --- a/fs/xfs/scrub/health.h +++ b/fs/xfs/scrub/health.h @@ -1,7 +1,7 @@ -// SPDX-License-Identifier: GPL-2.0+ +// SPDX-License-Identifier: GPL-2.0-or-later /* - * Copyright (C) 2019 Oracle. All Rights Reserved. - * Author: Darrick J. Wong <darrick.wong@oracle.com> + * Copyright (C) 2019-2023 Oracle. All Rights Reserved. + * Author: Darrick J. Wong <djwong@kernel.org> */ #ifndef __XFS_SCRUB_HEALTH_H__ #define __XFS_SCRUB_HEALTH_H__ diff --git a/fs/xfs/scrub/ialloc.c b/fs/xfs/scrub/ialloc.c index e312be7cd375..575f22a02ebe 100644 --- a/fs/xfs/scrub/ialloc.c +++ b/fs/xfs/scrub/ialloc.c @@ -1,7 +1,7 @@ -// SPDX-License-Identifier: GPL-2.0+ +// SPDX-License-Identifier: GPL-2.0-or-later /* - * Copyright (C) 2017 Oracle. All Rights Reserved. - * Author: Darrick J. Wong <darrick.wong@oracle.com> + * Copyright (C) 2017-2023 Oracle. All Rights Reserved. + * Author: Darrick J. Wong <djwong@kernel.org> */ #include "xfs.h" #include "xfs_fs.h" @@ -32,6 +32,8 @@ int xchk_setup_ag_iallocbt( struct xfs_scrub *sc) { + if (xchk_need_intent_drain(sc)) + xchk_fsgates_enable(sc, XCHK_FSGATES_DRAIN); return xchk_setup_ag_btree(sc, sc->flags & XCHK_TRY_HARDER); } @@ -49,83 +51,237 @@ struct xchk_iallocbt { }; /* - * If we're checking the finobt, cross-reference with the inobt. - * Otherwise we're checking the inobt; if there is an finobt, make sure - * we have a record or not depending on freecount. + * Does the finobt have a record for this inode with the same hole/free state? + * This is a bit complicated because of the following: + * + * - The finobt need not have a record if all inodes in the inobt record are + * allocated. + * - The finobt need not have a record if all inodes in the inobt record are + * free. + * - The finobt need not have a record if the inobt record says this is a hole. + * This likely doesn't happen in practice. */ -static inline void -xchk_iallocbt_chunk_xref_other( +STATIC int +xchk_inobt_xref_finobt( + struct xfs_scrub *sc, + struct xfs_inobt_rec_incore *irec, + xfs_agino_t agino, + bool free, + bool hole) +{ + struct xfs_inobt_rec_incore frec; + struct xfs_btree_cur *cur = sc->sa.fino_cur; + bool ffree, fhole; + unsigned int frec_idx, fhole_idx; + int has_record; + int error; + + ASSERT(cur->bc_btnum == XFS_BTNUM_FINO); + + error = xfs_inobt_lookup(cur, agino, XFS_LOOKUP_LE, &has_record); + if (error) + return error; + if (!has_record) + goto no_record; + + error = xfs_inobt_get_rec(cur, &frec, &has_record); + if (!has_record) + return -EFSCORRUPTED; + + if (frec.ir_startino + XFS_INODES_PER_CHUNK <= agino) + goto no_record; + + /* There's a finobt record; free and hole status must match. */ + frec_idx = agino - frec.ir_startino; + ffree = frec.ir_free & (1ULL << frec_idx); + fhole_idx = frec_idx / XFS_INODES_PER_HOLEMASK_BIT; + fhole = frec.ir_holemask & (1U << fhole_idx); + + if (ffree != free) + xchk_btree_xref_set_corrupt(sc, cur, 0); + if (fhole != hole) + xchk_btree_xref_set_corrupt(sc, cur, 0); + return 0; + +no_record: + /* inobt record is fully allocated */ + if (irec->ir_free == 0) + return 0; + + /* inobt record is totally unallocated */ + if (irec->ir_free == XFS_INOBT_ALL_FREE) + return 0; + + /* inobt record says this is a hole */ + if (hole) + return 0; + + /* finobt doesn't care about allocated inodes */ + if (!free) + return 0; + + xchk_btree_xref_set_corrupt(sc, cur, 0); + return 0; +} + +/* + * Make sure that each inode of this part of an inobt record has the same + * sparse and free status as the finobt. + */ +STATIC void +xchk_inobt_chunk_xref_finobt( struct xfs_scrub *sc, struct xfs_inobt_rec_incore *irec, - xfs_agino_t agino) + xfs_agino_t agino, + unsigned int nr_inodes) { - struct xfs_btree_cur **pcur; - bool has_irec; + xfs_agino_t i; + unsigned int rec_idx; int error; - if (sc->sm->sm_type == XFS_SCRUB_TYPE_FINOBT) - pcur = &sc->sa.ino_cur; - else - pcur = &sc->sa.fino_cur; - if (!(*pcur)) - return; - error = xfs_ialloc_has_inode_record(*pcur, agino, agino, &has_irec); - if (!xchk_should_check_xref(sc, &error, pcur)) + ASSERT(sc->sm->sm_type == XFS_SCRUB_TYPE_INOBT); + + if (!sc->sa.fino_cur || xchk_skip_xref(sc->sm)) return; - if (((irec->ir_freecount > 0 && !has_irec) || - (irec->ir_freecount == 0 && has_irec))) - xchk_btree_xref_set_corrupt(sc, *pcur, 0); + + for (i = agino, rec_idx = agino - irec->ir_startino; + i < agino + nr_inodes; + i++, rec_idx++) { + bool free, hole; + unsigned int hole_idx; + + free = irec->ir_free & (1ULL << rec_idx); + hole_idx = rec_idx / XFS_INODES_PER_HOLEMASK_BIT; + hole = irec->ir_holemask & (1U << hole_idx); + + error = xchk_inobt_xref_finobt(sc, irec, i, free, hole); + if (!xchk_should_check_xref(sc, &error, &sc->sa.fino_cur)) + return; + } +} + +/* + * Does the inobt have a record for this inode with the same hole/free state? + * The inobt must always have a record if there's a finobt record. + */ +STATIC int +xchk_finobt_xref_inobt( + struct xfs_scrub *sc, + struct xfs_inobt_rec_incore *frec, + xfs_agino_t agino, + bool ffree, + bool fhole) +{ + struct xfs_inobt_rec_incore irec; + struct xfs_btree_cur *cur = sc->sa.ino_cur; + bool free, hole; + unsigned int rec_idx, hole_idx; + int has_record; + int error; + + ASSERT(cur->bc_btnum == XFS_BTNUM_INO); + + error = xfs_inobt_lookup(cur, agino, XFS_LOOKUP_LE, &has_record); + if (error) + return error; + if (!has_record) + goto no_record; + + error = xfs_inobt_get_rec(cur, &irec, &has_record); + if (!has_record) + return -EFSCORRUPTED; + + if (irec.ir_startino + XFS_INODES_PER_CHUNK <= agino) + goto no_record; + + /* There's an inobt record; free and hole status must match. */ + rec_idx = agino - irec.ir_startino; + free = irec.ir_free & (1ULL << rec_idx); + hole_idx = rec_idx / XFS_INODES_PER_HOLEMASK_BIT; + hole = irec.ir_holemask & (1U << hole_idx); + + if (ffree != free) + xchk_btree_xref_set_corrupt(sc, cur, 0); + if (fhole != hole) + xchk_btree_xref_set_corrupt(sc, cur, 0); + return 0; + +no_record: + /* finobt should never have a record for which the inobt does not */ + xchk_btree_xref_set_corrupt(sc, cur, 0); + return 0; } -/* Cross-reference with the other btrees. */ +/* + * Make sure that each inode of this part of an finobt record has the same + * sparse and free status as the inobt. + */ STATIC void -xchk_iallocbt_chunk_xref( +xchk_finobt_chunk_xref_inobt( struct xfs_scrub *sc, - struct xfs_inobt_rec_incore *irec, + struct xfs_inobt_rec_incore *frec, xfs_agino_t agino, - xfs_agblock_t agbno, - xfs_extlen_t len) + unsigned int nr_inodes) { - if (sc->sm->sm_flags & XFS_SCRUB_OFLAG_CORRUPT) + xfs_agino_t i; + unsigned int rec_idx; + int error; + + ASSERT(sc->sm->sm_type == XFS_SCRUB_TYPE_FINOBT); + + if (!sc->sa.ino_cur || xchk_skip_xref(sc->sm)) return; - xchk_xref_is_used_space(sc, agbno, len); - xchk_iallocbt_chunk_xref_other(sc, irec, agino); - xchk_xref_is_owned_by(sc, agbno, len, &XFS_RMAP_OINFO_INODES); - xchk_xref_is_not_shared(sc, agbno, len); + for (i = agino, rec_idx = agino - frec->ir_startino; + i < agino + nr_inodes; + i++, rec_idx++) { + bool ffree, fhole; + unsigned int hole_idx; + + ffree = frec->ir_free & (1ULL << rec_idx); + hole_idx = rec_idx / XFS_INODES_PER_HOLEMASK_BIT; + fhole = frec->ir_holemask & (1U << hole_idx); + + error = xchk_finobt_xref_inobt(sc, frec, i, ffree, fhole); + if (!xchk_should_check_xref(sc, &error, &sc->sa.ino_cur)) + return; + } } -/* Is this chunk worth checking? */ +/* Is this chunk worth checking and cross-referencing? */ STATIC bool xchk_iallocbt_chunk( struct xchk_btree *bs, struct xfs_inobt_rec_incore *irec, xfs_agino_t agino, - xfs_extlen_t len) + unsigned int nr_inodes) { + struct xfs_scrub *sc = bs->sc; struct xfs_mount *mp = bs->cur->bc_mp; struct xfs_perag *pag = bs->cur->bc_ag.pag; - xfs_agblock_t bno; + xfs_agblock_t agbno; + xfs_extlen_t len; - bno = XFS_AGINO_TO_AGBNO(mp, agino); + agbno = XFS_AGINO_TO_AGBNO(mp, agino); + len = XFS_B_TO_FSB(mp, nr_inodes * mp->m_sb.sb_inodesize); - if (!xfs_verify_agbext(pag, bno, len)) + if (!xfs_verify_agbext(pag, agbno, len)) xchk_btree_set_corrupt(bs->sc, bs->cur, 0); - xchk_iallocbt_chunk_xref(bs->sc, irec, agino, bno, len); + if (bs->sc->sm->sm_flags & XFS_SCRUB_OFLAG_CORRUPT) + return false; + xchk_xref_is_used_space(sc, agbno, len); + if (sc->sm->sm_type == XFS_SCRUB_TYPE_INOBT) + xchk_inobt_chunk_xref_finobt(sc, irec, agino, nr_inodes); + else + xchk_finobt_chunk_xref_inobt(sc, irec, agino, nr_inodes); + xchk_xref_is_only_owned_by(sc, agbno, len, &XFS_RMAP_OINFO_INODES); + xchk_xref_is_not_shared(sc, agbno, len); + xchk_xref_is_not_cow_staging(sc, agbno, len); return true; } -/* Count the number of free inodes. */ -static unsigned int -xchk_iallocbt_freecount( - xfs_inofree_t freemask) -{ - BUILD_BUG_ON(sizeof(freemask) != sizeof(__u64)); - return hweight64(freemask); -} - /* * Check that an inode's allocation status matches ir_free in the inobt * record. First we try querying the in-core inode state, and if the inode @@ -272,7 +428,7 @@ xchk_iallocbt_check_cluster( return 0; } - xchk_xref_is_owned_by(bs->sc, agbno, M_IGEO(mp)->blocks_per_cluster, + xchk_xref_is_only_owned_by(bs->sc, agbno, M_IGEO(mp)->blocks_per_cluster, &XFS_RMAP_OINFO_INODES); /* Grab the inode cluster buffer. */ @@ -420,36 +576,22 @@ xchk_iallocbt_rec( const union xfs_btree_rec *rec) { struct xfs_mount *mp = bs->cur->bc_mp; - struct xfs_perag *pag = bs->cur->bc_ag.pag; struct xchk_iallocbt *iabt = bs->private; struct xfs_inobt_rec_incore irec; uint64_t holes; xfs_agino_t agino; - xfs_extlen_t len; int holecount; int i; int error = 0; - unsigned int real_freecount; uint16_t holemask; xfs_inobt_btrec_to_irec(mp, rec, &irec); - - if (irec.ir_count > XFS_INODES_PER_CHUNK || - irec.ir_freecount > XFS_INODES_PER_CHUNK) - xchk_btree_set_corrupt(bs->sc, bs->cur, 0); - - real_freecount = irec.ir_freecount + - (XFS_INODES_PER_CHUNK - irec.ir_count); - if (real_freecount != xchk_iallocbt_freecount(irec.ir_free)) + if (xfs_inobt_check_irec(bs->cur, &irec) != NULL) { xchk_btree_set_corrupt(bs->sc, bs->cur, 0); + return 0; + } agino = irec.ir_startino; - /* Record has to be properly aligned within the AG. */ - if (!xfs_verify_agino(pag, agino) || - !xfs_verify_agino(pag, agino + XFS_INODES_PER_CHUNK - 1)) { - xchk_btree_set_corrupt(bs->sc, bs->cur, 0); - goto out; - } xchk_iallocbt_rec_alignment(bs, &irec); if (bs->sc->sm->sm_flags & XFS_SCRUB_OFLAG_CORRUPT) @@ -459,12 +601,11 @@ xchk_iallocbt_rec( /* Handle non-sparse inodes */ if (!xfs_inobt_issparse(irec.ir_holemask)) { - len = XFS_B_TO_FSB(mp, - XFS_INODES_PER_CHUNK * mp->m_sb.sb_inodesize); if (irec.ir_count != XFS_INODES_PER_CHUNK) xchk_btree_set_corrupt(bs->sc, bs->cur, 0); - if (!xchk_iallocbt_chunk(bs, &irec, agino, len)) + if (!xchk_iallocbt_chunk(bs, &irec, agino, + XFS_INODES_PER_CHUNK)) goto out; goto check_clusters; } @@ -472,8 +613,6 @@ xchk_iallocbt_rec( /* Check each chunk of a sparse inode cluster. */ holemask = irec.ir_holemask; holecount = 0; - len = XFS_B_TO_FSB(mp, - XFS_INODES_PER_HOLEMASK_BIT * mp->m_sb.sb_inodesize); holes = ~xfs_inobt_irec_to_allocmask(&irec); if ((holes & irec.ir_free) != holes || irec.ir_freecount > irec.ir_count) @@ -482,8 +621,9 @@ xchk_iallocbt_rec( for (i = 0; i < XFS_INOBT_HOLEMASK_BITS; i++) { if (holemask & 1) holecount += XFS_INODES_PER_HOLEMASK_BIT; - else if (!xchk_iallocbt_chunk(bs, &irec, agino, len)) - break; + else if (!xchk_iallocbt_chunk(bs, &irec, agino, + XFS_INODES_PER_HOLEMASK_BIT)) + goto out; holemask >>= 1; agino += XFS_INODES_PER_HOLEMASK_BIT; } @@ -493,6 +633,9 @@ xchk_iallocbt_rec( xchk_btree_set_corrupt(bs->sc, bs->cur, 0); check_clusters: + if (bs->sc->sm->sm_flags & XFS_SCRUB_OFLAG_CORRUPT) + goto out; + error = xchk_iallocbt_check_clusters(bs, &irec); if (error) goto out; @@ -622,18 +765,18 @@ xchk_xref_inode_check( xfs_agblock_t agbno, xfs_extlen_t len, struct xfs_btree_cur **icur, - bool should_have_inodes) + enum xbtree_recpacking expected) { - bool has_inodes; + enum xbtree_recpacking outcome; int error; if (!(*icur) || xchk_skip_xref(sc->sm)) return; - error = xfs_ialloc_has_inodes_at_extent(*icur, agbno, len, &has_inodes); + error = xfs_ialloc_has_inodes_at_extent(*icur, agbno, len, &outcome); if (!xchk_should_check_xref(sc, &error, icur)) return; - if (has_inodes != should_have_inodes) + if (outcome != expected) xchk_btree_xref_set_corrupt(sc, *icur, 0); } @@ -644,8 +787,10 @@ xchk_xref_is_not_inode_chunk( xfs_agblock_t agbno, xfs_extlen_t len) { - xchk_xref_inode_check(sc, agbno, len, &sc->sa.ino_cur, false); - xchk_xref_inode_check(sc, agbno, len, &sc->sa.fino_cur, false); + xchk_xref_inode_check(sc, agbno, len, &sc->sa.ino_cur, + XBTREE_RECPACKING_EMPTY); + xchk_xref_inode_check(sc, agbno, len, &sc->sa.fino_cur, + XBTREE_RECPACKING_EMPTY); } /* xref check that the extent is covered by inodes */ @@ -655,5 +800,6 @@ xchk_xref_is_inode_chunk( xfs_agblock_t agbno, xfs_extlen_t len) { - xchk_xref_inode_check(sc, agbno, len, &sc->sa.ino_cur, true); + xchk_xref_inode_check(sc, agbno, len, &sc->sa.ino_cur, + XBTREE_RECPACKING_FULL); } diff --git a/fs/xfs/scrub/inode.c b/fs/xfs/scrub/inode.c index 7a2f38e5202c..3e1e02e340a6 100644 --- a/fs/xfs/scrub/inode.c +++ b/fs/xfs/scrub/inode.c @@ -1,7 +1,7 @@ -// SPDX-License-Identifier: GPL-2.0+ +// SPDX-License-Identifier: GPL-2.0-or-later /* - * Copyright (C) 2017 Oracle. All Rights Reserved. - * Author: Darrick J. Wong <darrick.wong@oracle.com> + * Copyright (C) 2017-2023 Oracle. All Rights Reserved. + * Author: Darrick J. Wong <djwong@kernel.org> */ #include "xfs.h" #include "xfs_fs.h" @@ -11,8 +11,11 @@ #include "xfs_mount.h" #include "xfs_btree.h" #include "xfs_log_format.h" +#include "xfs_trans.h" +#include "xfs_ag.h" #include "xfs_inode.h" #include "xfs_ialloc.h" +#include "xfs_icache.h" #include "xfs_da_format.h" #include "xfs_reflink.h" #include "xfs_rmap.h" @@ -20,45 +23,176 @@ #include "scrub/scrub.h" #include "scrub/common.h" #include "scrub/btree.h" +#include "scrub/trace.h" + +/* Prepare the attached inode for scrubbing. */ +static inline int +xchk_prepare_iscrub( + struct xfs_scrub *sc) +{ + int error; + + sc->ilock_flags = XFS_IOLOCK_EXCL; + xfs_ilock(sc->ip, sc->ilock_flags); + + error = xchk_trans_alloc(sc, 0); + if (error) + return error; + + sc->ilock_flags |= XFS_ILOCK_EXCL; + xfs_ilock(sc->ip, XFS_ILOCK_EXCL); + return 0; +} + +/* Install this scrub-by-handle inode and prepare it for scrubbing. */ +static inline int +xchk_install_handle_iscrub( + struct xfs_scrub *sc, + struct xfs_inode *ip) +{ + int error; + + error = xchk_install_handle_inode(sc, ip); + if (error) + return error; + + return xchk_prepare_iscrub(sc); +} /* - * Grab total control of the inode metadata. It doesn't matter here if - * the file data is still changing; exclusive access to the metadata is - * the goal. + * Grab total control of the inode metadata. In the best case, we grab the + * incore inode and take all locks on it. If the incore inode cannot be + * constructed due to corruption problems, lock the AGI so that we can single + * step the loading process to fix everything that can go wrong. */ int xchk_setup_inode( struct xfs_scrub *sc) { + struct xfs_imap imap; + struct xfs_inode *ip; + struct xfs_mount *mp = sc->mp; + struct xfs_inode *ip_in = XFS_I(file_inode(sc->file)); + struct xfs_buf *agi_bp; + struct xfs_perag *pag; + xfs_agnumber_t agno = XFS_INO_TO_AGNO(mp, sc->sm->sm_ino); int error; + if (xchk_need_intent_drain(sc)) + xchk_fsgates_enable(sc, XCHK_FSGATES_DRAIN); + + /* We want to scan the opened inode, so lock it and exit. */ + if (sc->sm->sm_ino == 0 || sc->sm->sm_ino == ip_in->i_ino) { + sc->ip = ip_in; + return xchk_prepare_iscrub(sc); + } + + /* Reject internal metadata files and obviously bad inode numbers. */ + if (xfs_internal_inum(mp, sc->sm->sm_ino)) + return -ENOENT; + if (!xfs_verify_ino(sc->mp, sc->sm->sm_ino)) + return -ENOENT; + + /* Try a regular untrusted iget. */ + error = xchk_iget(sc, sc->sm->sm_ino, &ip); + if (!error) + return xchk_install_handle_iscrub(sc, ip); + if (error == -ENOENT) + return error; + if (error != -EFSCORRUPTED && error != -EFSBADCRC && error != -EINVAL) + goto out_error; + /* - * Try to get the inode. If the verifiers fail, we try again - * in raw mode. + * EINVAL with IGET_UNTRUSTED probably means one of several things: + * userspace gave us an inode number that doesn't correspond to fs + * space; the inode btree lacks a record for this inode; or there is + * a record, and it says this inode is free. + * + * EFSCORRUPTED/EFSBADCRC could mean that the inode was mappable, but + * some other metadata corruption (e.g. inode forks) prevented + * instantiation of the incore inode. Or it could mean the inobt is + * corrupt. + * + * We want to look up this inode in the inobt directly to distinguish + * three different scenarios: (1) the inobt says the inode is free, + * in which case there's nothing to do; (2) the inobt is corrupt so we + * should flag the corruption and exit to userspace to let it fix the + * inobt; and (3) the inobt says the inode is allocated, but loading it + * failed due to corruption. + * + * Allocate a transaction and grab the AGI to prevent inobt activity in + * this AG. Retry the iget in case someone allocated a new inode after + * the first iget failed. */ - error = xchk_get_inode(sc); - switch (error) { - case 0: - break; - case -EFSCORRUPTED: - case -EFSBADCRC: - return xchk_trans_alloc(sc, 0); - default: - return error; + error = xchk_trans_alloc(sc, 0); + if (error) + goto out_error; + + error = xchk_iget_agi(sc, sc->sm->sm_ino, &agi_bp, &ip); + if (error == 0) { + /* Actually got the incore inode, so install it and proceed. */ + xchk_trans_cancel(sc); + return xchk_install_handle_iscrub(sc, ip); + } + if (error == -ENOENT) + goto out_gone; + if (error != -EFSCORRUPTED && error != -EFSBADCRC && error != -EINVAL) + goto out_cancel; + + /* Ensure that we have protected against inode allocation/freeing. */ + if (agi_bp == NULL) { + ASSERT(agi_bp != NULL); + error = -ECANCELED; + goto out_cancel; } - /* Got the inode, lock it and we're ready to go. */ - sc->ilock_flags = XFS_IOLOCK_EXCL | XFS_MMAPLOCK_EXCL; - xfs_ilock(sc->ip, sc->ilock_flags); - error = xchk_trans_alloc(sc, 0); + /* + * Untrusted iget failed a second time. Let's try an inobt lookup. + * If the inobt doesn't think this is an allocated inode then we'll + * return ENOENT to signal that the check can be skipped. + * + * If the lookup signals corruption, we'll mark this inode corrupt and + * exit to userspace. There's little chance of fixing anything until + * the inobt is straightened out, but there's nothing we can do here. + * + * If the lookup encounters a runtime error, exit to userspace. + */ + pag = xfs_perag_get(mp, XFS_INO_TO_AGNO(mp, sc->sm->sm_ino)); + if (!pag) { + error = -EFSCORRUPTED; + goto out_cancel; + } + + error = xfs_imap(pag, sc->tp, sc->sm->sm_ino, &imap, + XFS_IGET_UNTRUSTED); + xfs_perag_put(pag); + if (error == -EINVAL || error == -ENOENT) + goto out_gone; if (error) - goto out; - sc->ilock_flags |= XFS_ILOCK_EXCL; - xfs_ilock(sc->ip, XFS_ILOCK_EXCL); + goto out_cancel; -out: - /* scrub teardown will unlock and release the inode for us */ + /* + * The lookup succeeded. Chances are the ondisk inode is corrupt and + * preventing iget from reading it. Retain the scrub transaction and + * the AGI buffer to prevent anyone from allocating or freeing inodes. + * This ensures that we preserve the inconsistency between the inobt + * saying the inode is allocated and the icache being unable to load + * the inode until we can flag the corruption in xchk_inode. The + * scrub function has to note the corruption, since we're not really + * supposed to do that from the setup function. + */ + return 0; + +out_cancel: + xchk_trans_cancel(sc); +out_error: + trace_xchk_op_error(sc, agno, XFS_INO_TO_AGBNO(mp, sc->sm->sm_ino), + error, __return_address); return error; +out_gone: + /* The file is gone, so there's nothing to check. */ + xchk_trans_cancel(sc); + return -ENOENT; } /* Inode core */ @@ -553,8 +687,9 @@ xchk_inode_xref( xchk_xref_is_used_space(sc, agbno, 1); xchk_inode_xref_finobt(sc, ino); - xchk_xref_is_owned_by(sc, agbno, 1, &XFS_RMAP_OINFO_INODES); + xchk_xref_is_only_owned_by(sc, agbno, 1, &XFS_RMAP_OINFO_INODES); xchk_xref_is_not_shared(sc, agbno, 1); + xchk_xref_is_not_cow_staging(sc, agbno, 1); xchk_inode_xref_bmap(sc, dip); out_free: diff --git a/fs/xfs/scrub/parent.c b/fs/xfs/scrub/parent.c index d8dff3fd8053..58d5dfb7ea21 100644 --- a/fs/xfs/scrub/parent.c +++ b/fs/xfs/scrub/parent.c @@ -1,7 +1,7 @@ -// SPDX-License-Identifier: GPL-2.0+ +// SPDX-License-Identifier: GPL-2.0-or-later /* - * Copyright (C) 2017 Oracle. All Rights Reserved. - * Author: Darrick J. Wong <darrick.wong@oracle.com> + * Copyright (C) 2017-2023 Oracle. All Rights Reserved. + * Author: Darrick J. Wong <djwong@kernel.org> */ #include "xfs.h" #include "xfs_fs.h" @@ -16,6 +16,7 @@ #include "xfs_dir2_priv.h" #include "scrub/scrub.h" #include "scrub/common.h" +#include "scrub/readdir.h" /* Set us up to scrub parents. */ int @@ -30,122 +31,93 @@ xchk_setup_parent( /* Look for an entry in a parent pointing to this inode. */ struct xchk_parent_ctx { - struct dir_context dc; struct xfs_scrub *sc; - xfs_ino_t ino; xfs_nlink_t nlink; - bool cancelled; }; /* Look for a single entry in a directory pointing to an inode. */ -STATIC bool +STATIC int xchk_parent_actor( - struct dir_context *dc, - const char *name, - int namelen, - loff_t pos, - u64 ino, - unsigned type) + struct xfs_scrub *sc, + struct xfs_inode *dp, + xfs_dir2_dataptr_t dapos, + const struct xfs_name *name, + xfs_ino_t ino, + void *priv) { - struct xchk_parent_ctx *spc; + struct xchk_parent_ctx *spc = priv; int error = 0; - spc = container_of(dc, struct xchk_parent_ctx, dc); - if (spc->ino == ino) + /* Does this name make sense? */ + if (!xfs_dir2_namecheck(name->name, name->len)) + error = -EFSCORRUPTED; + if (!xchk_fblock_xref_process_error(sc, XFS_DATA_FORK, 0, &error)) + return error; + + if (sc->ip->i_ino == ino) spc->nlink++; - /* - * If we're facing a fatal signal, bail out. Store the cancellation - * status separately because the VFS readdir code squashes error codes - * into short directory reads. - */ if (xchk_should_terminate(spc->sc, &error)) - spc->cancelled = true; + return error; - return !error; + return 0; } -/* Count the number of dentries in the parent dir that point to this inode. */ -STATIC int -xchk_parent_count_parent_dentries( - struct xfs_scrub *sc, - struct xfs_inode *parent, - xfs_nlink_t *nlink) +/* + * Try to lock a parent directory for checking dirents. Returns the inode + * flags for the locks we now hold, or zero if we failed. + */ +STATIC unsigned int +xchk_parent_ilock_dir( + struct xfs_inode *dp) { - struct xchk_parent_ctx spc = { - .dc.actor = xchk_parent_actor, - .ino = sc->ip->i_ino, - .sc = sc, - }; - size_t bufsize; - loff_t oldpos; - uint lock_mode; - int error = 0; + if (!xfs_ilock_nowait(dp, XFS_ILOCK_SHARED)) + return 0; - /* - * If there are any blocks, read-ahead block 0 as we're almost - * certain to have the next operation be a read there. This is - * how we guarantee that the parent's extent map has been loaded, - * if there is one. - */ - lock_mode = xfs_ilock_data_map_shared(parent); - if (parent->i_df.if_nextents > 0) - error = xfs_dir3_data_readahead(parent, 0, 0); - xfs_iunlock(parent, lock_mode); - if (error) - return error; + if (!xfs_need_iread_extents(&dp->i_df)) + return XFS_ILOCK_SHARED; - /* - * Iterate the parent dir to confirm that there is - * exactly one entry pointing back to the inode being - * scanned. - */ - bufsize = (size_t)min_t(loff_t, XFS_READDIR_BUFSIZE, - parent->i_disk_size); - oldpos = 0; - while (true) { - error = xfs_readdir(sc->tp, parent, &spc.dc, bufsize); - if (error) - goto out; - if (spc.cancelled) { - error = -EAGAIN; - goto out; - } - if (oldpos == spc.dc.pos) - break; - oldpos = spc.dc.pos; - } - *nlink = spc.nlink; -out: - return error; + xfs_iunlock(dp, XFS_ILOCK_SHARED); + + if (!xfs_ilock_nowait(dp, XFS_ILOCK_EXCL)) + return 0; + + return XFS_ILOCK_EXCL; } /* - * Given the inode number of the alleged parent of the inode being - * scrubbed, try to validate that the parent has exactly one directory - * entry pointing back to the inode being scrubbed. + * Given the inode number of the alleged parent of the inode being scrubbed, + * try to validate that the parent has exactly one directory entry pointing + * back to the inode being scrubbed. Returns -EAGAIN if we need to revalidate + * the dotdot entry. */ STATIC int xchk_parent_validate( struct xfs_scrub *sc, - xfs_ino_t dnum, - bool *try_again) + xfs_ino_t parent_ino) { + struct xchk_parent_ctx spc = { + .sc = sc, + .nlink = 0, + }; struct xfs_mount *mp = sc->mp; struct xfs_inode *dp = NULL; xfs_nlink_t expected_nlink; - xfs_nlink_t nlink; + unsigned int lock_mode; int error = 0; - *try_again = false; - - if (sc->sm->sm_flags & XFS_SCRUB_OFLAG_CORRUPT) - goto out; + /* Is this the root dir? Then '..' must point to itself. */ + if (sc->ip == mp->m_rootip) { + if (sc->ip->i_ino != mp->m_sb.sb_rootino || + sc->ip->i_ino != parent_ino) + xchk_fblock_set_corrupt(sc, XFS_DATA_FORK, 0); + return 0; + } /* '..' must not point to ourselves. */ - if (sc->ip->i_ino == dnum) { + if (sc->ip->i_ino == parent_ino) { xchk_fblock_set_corrupt(sc, XFS_DATA_FORK, 0); - goto out; + return 0; } /* @@ -155,106 +127,51 @@ xchk_parent_validate( expected_nlink = VFS_I(sc->ip)->i_nlink == 0 ? 0 : 1; /* - * Grab this parent inode. We release the inode before we - * cancel the scrub transaction. Since we're don't know a - * priori that releasing the inode won't trigger eofblocks - * cleanup (which allocates what would be a nested transaction) - * if the parent pointer erroneously points to a file, we - * can't use DONTCACHE here because DONTCACHE inodes can trigger - * immediate inactive cleanup of the inode. + * Grab the parent directory inode. This must be released before we + * cancel the scrub transaction. * * If _iget returns -EINVAL or -ENOENT then the parent inode number is * garbage and the directory is corrupt. If the _iget returns * -EFSCORRUPTED or -EFSBADCRC then the parent is corrupt which is a * cross referencing error. Any other error is an operational error. */ - error = xfs_iget(mp, sc->tp, dnum, XFS_IGET_UNTRUSTED, 0, &dp); + error = xchk_iget(sc, parent_ino, &dp); if (error == -EINVAL || error == -ENOENT) { error = -EFSCORRUPTED; xchk_fblock_process_error(sc, XFS_DATA_FORK, 0, &error); - goto out; + return error; } if (!xchk_fblock_xref_process_error(sc, XFS_DATA_FORK, 0, &error)) - goto out; + return error; if (dp == sc->ip || !S_ISDIR(VFS_I(dp)->i_mode)) { xchk_fblock_set_corrupt(sc, XFS_DATA_FORK, 0); goto out_rele; } - /* - * We prefer to keep the inode locked while we lock and search - * its alleged parent for a forward reference. If we can grab - * the iolock, validate the pointers and we're done. We must - * use nowait here to avoid an ABBA deadlock on the parent and - * the child inodes. - */ - if (xfs_ilock_nowait(dp, XFS_IOLOCK_SHARED)) { - error = xchk_parent_count_parent_dentries(sc, dp, &nlink); - if (!xchk_fblock_xref_process_error(sc, XFS_DATA_FORK, 0, - &error)) - goto out_unlock; - if (nlink != expected_nlink) - xchk_fblock_set_corrupt(sc, XFS_DATA_FORK, 0); - goto out_unlock; - } - - /* - * The game changes if we get here. We failed to lock the parent, - * so we're going to try to verify both pointers while only holding - * one lock so as to avoid deadlocking with something that's actually - * trying to traverse down the directory tree. - */ - xfs_iunlock(sc->ip, sc->ilock_flags); - sc->ilock_flags = 0; - error = xchk_ilock_inverted(dp, XFS_IOLOCK_SHARED); - if (error) + lock_mode = xchk_parent_ilock_dir(dp); + if (!lock_mode) { + xfs_iunlock(sc->ip, XFS_ILOCK_EXCL); + xfs_ilock(sc->ip, XFS_ILOCK_EXCL); + error = -EAGAIN; goto out_rele; + } - /* Go looking for our dentry. */ - error = xchk_parent_count_parent_dentries(sc, dp, &nlink); + /* Look for a directory entry in the parent pointing to the child. */ + error = xchk_dir_walk(sc, dp, xchk_parent_actor, &spc); if (!xchk_fblock_xref_process_error(sc, XFS_DATA_FORK, 0, &error)) goto out_unlock; - /* Drop the parent lock, relock this inode. */ - xfs_iunlock(dp, XFS_IOLOCK_SHARED); - error = xchk_ilock_inverted(sc->ip, XFS_IOLOCK_EXCL); - if (error) - goto out_rele; - sc->ilock_flags = XFS_IOLOCK_EXCL; - - /* - * If we're an unlinked directory, the parent /won't/ have a link - * to us. Otherwise, it should have one link. We have to re-set - * it here because we dropped the lock on sc->ip. - */ - expected_nlink = VFS_I(sc->ip)->i_nlink == 0 ? 0 : 1; - - /* Look up '..' to see if the inode changed. */ - error = xfs_dir_lookup(sc->tp, sc->ip, &xfs_name_dotdot, &dnum, NULL); - if (!xchk_fblock_process_error(sc, XFS_DATA_FORK, 0, &error)) - goto out_rele; - - /* Drat, parent changed. Try again! */ - if (dnum != dp->i_ino) { - xfs_irele(dp); - *try_again = true; - return 0; - } - xfs_irele(dp); - /* - * '..' didn't change, so check that there was only one entry - * for us in the parent. + * Ensure that the parent has as many links to the child as the child + * thinks it has to the parent. */ - if (nlink != expected_nlink) + if (spc.nlink != expected_nlink) xchk_fblock_set_corrupt(sc, XFS_DATA_FORK, 0); - return error; out_unlock: - xfs_iunlock(dp, XFS_IOLOCK_SHARED); + xfs_iunlock(dp, lock_mode); out_rele: - xfs_irele(dp); -out: + xchk_irele(sc, dp); return error; } @@ -264,9 +181,7 @@ xchk_parent( struct xfs_scrub *sc) { struct xfs_mount *mp = sc->mp; - xfs_ino_t dnum; - bool try_again; - int tries = 0; + xfs_ino_t parent_ino; int error = 0; /* @@ -279,56 +194,29 @@ xchk_parent( /* We're not a special inode, are we? */ if (!xfs_verify_dir_ino(mp, sc->ip->i_ino)) { xchk_fblock_set_corrupt(sc, XFS_DATA_FORK, 0); - goto out; + return 0; } - /* - * The VFS grabs a read or write lock via i_rwsem before it reads - * or writes to a directory. If we've gotten this far we've - * already obtained IOLOCK_EXCL, which (since 4.10) is the same as - * getting a write lock on i_rwsem. Therefore, it is safe for us - * to drop the ILOCK here in order to do directory lookups. - */ - sc->ilock_flags &= ~(XFS_ILOCK_EXCL | XFS_MMAPLOCK_EXCL); - xfs_iunlock(sc->ip, XFS_ILOCK_EXCL | XFS_MMAPLOCK_EXCL); - - /* Look up '..' */ - error = xfs_dir_lookup(sc->tp, sc->ip, &xfs_name_dotdot, &dnum, NULL); - if (!xchk_fblock_process_error(sc, XFS_DATA_FORK, 0, &error)) - goto out; - if (!xfs_verify_dir_ino(mp, dnum)) { - xchk_fblock_set_corrupt(sc, XFS_DATA_FORK, 0); - goto out; - } + do { + if (xchk_should_terminate(sc, &error)) + break; - /* Is this the root dir? Then '..' must point to itself. */ - if (sc->ip == mp->m_rootip) { - if (sc->ip->i_ino != mp->m_sb.sb_rootino || - sc->ip->i_ino != dnum) + /* Look up '..' */ + error = xchk_dir_lookup(sc, sc->ip, &xfs_name_dotdot, + &parent_ino); + if (!xchk_fblock_process_error(sc, XFS_DATA_FORK, 0, &error)) + return error; + if (!xfs_verify_dir_ino(mp, parent_ino)) { xchk_fblock_set_corrupt(sc, XFS_DATA_FORK, 0); - goto out; - } + return 0; + } - do { - error = xchk_parent_validate(sc, dnum, &try_again); - if (error) - goto out; - } while (try_again && ++tries < 20); + /* + * Check that the dotdot entry points to a parent directory + * containing a dirent pointing to this subdirectory. + */ + error = xchk_parent_validate(sc, parent_ino); + } while (error == -EAGAIN); - /* - * We gave it our best shot but failed, so mark this scrub - * incomplete. Userspace can decide if it wants to try again. - */ - if (try_again && tries == 20) - xchk_set_incomplete(sc); -out: - /* - * If we failed to lock the parent inode even after a retry, just mark - * this scrub incomplete and return. - */ - if ((sc->flags & XCHK_TRY_HARDER) && error == -EDEADLOCK) { - error = 0; - xchk_set_incomplete(sc); - } return error; } diff --git a/fs/xfs/scrub/quota.c b/fs/xfs/scrub/quota.c index 9eeac8565394..e6caa358cbda 100644 --- a/fs/xfs/scrub/quota.c +++ b/fs/xfs/scrub/quota.c @@ -1,7 +1,7 @@ -// SPDX-License-Identifier: GPL-2.0+ +// SPDX-License-Identifier: GPL-2.0-or-later /* - * Copyright (C) 2017 Oracle. All Rights Reserved. - * Author: Darrick J. Wong <darrick.wong@oracle.com> + * Copyright (C) 2017-2023 Oracle. All Rights Reserved. + * Author: Darrick J. Wong <djwong@kernel.org> */ #include "xfs.h" #include "xfs_fs.h" @@ -53,6 +53,9 @@ xchk_setup_quota( if (!xfs_this_quota_on(sc->mp, dqtype)) return -ENOENT; + if (xchk_need_intent_drain(sc)) + xchk_fsgates_enable(sc, XCHK_FSGATES_DRAIN); + error = xchk_setup_fs(sc); if (error) return error; diff --git a/fs/xfs/scrub/readdir.c b/fs/xfs/scrub/readdir.c new file mode 100644 index 000000000000..e51c1544be63 --- /dev/null +++ b/fs/xfs/scrub/readdir.c @@ -0,0 +1,375 @@ +// SPDX-License-Identifier: GPL-2.0-or-later +/* + * Copyright (C) 2022-2023 Oracle. All Rights Reserved. + * Author: Darrick J. Wong <djwong@kernel.org> + */ +#include "xfs.h" +#include "xfs_fs.h" +#include "xfs_shared.h" +#include "xfs_format.h" +#include "xfs_log_format.h" +#include "xfs_trans_resv.h" +#include "xfs_mount.h" +#include "xfs_inode.h" +#include "xfs_dir2.h" +#include "xfs_dir2_priv.h" +#include "xfs_trace.h" +#include "xfs_bmap.h" +#include "xfs_trans.h" +#include "xfs_error.h" +#include "scrub/scrub.h" +#include "scrub/readdir.h" + +/* Call a function for every entry in a shortform directory. */ +STATIC int +xchk_dir_walk_sf( + struct xfs_scrub *sc, + struct xfs_inode *dp, + xchk_dirent_fn dirent_fn, + void *priv) +{ + struct xfs_name name = { + .name = ".", + .len = 1, + .type = XFS_DIR3_FT_DIR, + }; + struct xfs_mount *mp = dp->i_mount; + struct xfs_da_geometry *geo = mp->m_dir_geo; + struct xfs_dir2_sf_entry *sfep; + struct xfs_dir2_sf_hdr *sfp; + xfs_ino_t ino; + xfs_dir2_dataptr_t dapos; + unsigned int i; + int error; + + ASSERT(dp->i_df.if_bytes == dp->i_disk_size); + ASSERT(dp->i_df.if_u1.if_data != NULL); + + sfp = (struct xfs_dir2_sf_hdr *)dp->i_df.if_u1.if_data; + + /* dot entry */ + dapos = xfs_dir2_db_off_to_dataptr(geo, geo->datablk, + geo->data_entry_offset); + + error = dirent_fn(sc, dp, dapos, &name, dp->i_ino, priv); + if (error) + return error; + + /* dotdot entry */ + dapos = xfs_dir2_db_off_to_dataptr(geo, geo->datablk, + geo->data_entry_offset + + xfs_dir2_data_entsize(mp, sizeof(".") - 1)); + ino = xfs_dir2_sf_get_parent_ino(sfp); + name.name = ".."; + name.len = 2; + + error = dirent_fn(sc, dp, dapos, &name, ino, priv); + if (error) + return error; + + /* iterate everything else */ + sfep = xfs_dir2_sf_firstentry(sfp); + for (i = 0; i < sfp->count; i++) { + dapos = xfs_dir2_db_off_to_dataptr(geo, geo->datablk, + xfs_dir2_sf_get_offset(sfep)); + ino = xfs_dir2_sf_get_ino(mp, sfp, sfep); + name.name = sfep->name; + name.len = sfep->namelen; + name.type = xfs_dir2_sf_get_ftype(mp, sfep); + + error = dirent_fn(sc, dp, dapos, &name, ino, priv); + if (error) + return error; + + sfep = xfs_dir2_sf_nextentry(mp, sfp, sfep); + } + + return 0; +} + +/* Call a function for every entry in a block directory. */ +STATIC int +xchk_dir_walk_block( + struct xfs_scrub *sc, + struct xfs_inode *dp, + xchk_dirent_fn dirent_fn, + void *priv) +{ + struct xfs_mount *mp = dp->i_mount; + struct xfs_da_geometry *geo = mp->m_dir_geo; + struct xfs_buf *bp; + unsigned int off, next_off, end; + int error; + + error = xfs_dir3_block_read(sc->tp, dp, &bp); + if (error) + return error; + + /* Walk each directory entry. */ + end = xfs_dir3_data_end_offset(geo, bp->b_addr); + for (off = geo->data_entry_offset; off < end; off = next_off) { + struct xfs_name name = { }; + struct xfs_dir2_data_unused *dup = bp->b_addr + off; + struct xfs_dir2_data_entry *dep = bp->b_addr + off; + xfs_ino_t ino; + xfs_dir2_dataptr_t dapos; + + /* Skip an empty entry. */ + if (be16_to_cpu(dup->freetag) == XFS_DIR2_DATA_FREE_TAG) { + next_off = off + be16_to_cpu(dup->length); + continue; + } + + /* Otherwise, find the next entry and report it. */ + next_off = off + xfs_dir2_data_entsize(mp, dep->namelen); + if (next_off > end) + break; + + dapos = xfs_dir2_db_off_to_dataptr(geo, geo->datablk, off); + ino = be64_to_cpu(dep->inumber); + name.name = dep->name; + name.len = dep->namelen; + name.type = xfs_dir2_data_get_ftype(mp, dep); + + error = dirent_fn(sc, dp, dapos, &name, ino, priv); + if (error) + break; + } + + xfs_trans_brelse(sc->tp, bp); + return error; +} + +/* Read a leaf-format directory buffer. */ +STATIC int +xchk_read_leaf_dir_buf( + struct xfs_trans *tp, + struct xfs_inode *dp, + struct xfs_da_geometry *geo, + xfs_dir2_off_t *curoff, + struct xfs_buf **bpp) +{ + struct xfs_iext_cursor icur; + struct xfs_bmbt_irec map; + struct xfs_ifork *ifp = xfs_ifork_ptr(dp, XFS_DATA_FORK); + xfs_dablk_t last_da; + xfs_dablk_t map_off; + xfs_dir2_off_t new_off; + + *bpp = NULL; + + /* + * Look for mapped directory blocks at or above the current offset. + * Truncate down to the nearest directory block to start the scanning + * operation. + */ + last_da = xfs_dir2_byte_to_da(geo, XFS_DIR2_LEAF_OFFSET); + map_off = xfs_dir2_db_to_da(geo, xfs_dir2_byte_to_db(geo, *curoff)); + + if (!xfs_iext_lookup_extent(dp, ifp, map_off, &icur, &map)) + return 0; + if (map.br_startoff >= last_da) + return 0; + xfs_trim_extent(&map, map_off, last_da - map_off); + + /* Read the directory block of that first mapping. */ + new_off = xfs_dir2_da_to_byte(geo, map.br_startoff); + if (new_off > *curoff) + *curoff = new_off; + + return xfs_dir3_data_read(tp, dp, map.br_startoff, 0, bpp); +} + +/* Call a function for every entry in a leaf directory. */ +STATIC int +xchk_dir_walk_leaf( + struct xfs_scrub *sc, + struct xfs_inode *dp, + xchk_dirent_fn dirent_fn, + void *priv) +{ + struct xfs_mount *mp = dp->i_mount; + struct xfs_da_geometry *geo = mp->m_dir_geo; + struct xfs_buf *bp = NULL; + xfs_dir2_off_t curoff = 0; + unsigned int offset = 0; + int error; + + /* Iterate every directory offset in this directory. */ + while (curoff < XFS_DIR2_LEAF_OFFSET) { + struct xfs_name name = { }; + struct xfs_dir2_data_unused *dup; + struct xfs_dir2_data_entry *dep; + xfs_ino_t ino; + unsigned int length; + xfs_dir2_dataptr_t dapos; + + /* + * If we have no buffer, or we're off the end of the + * current buffer, need to get another one. + */ + if (!bp || offset >= geo->blksize) { + if (bp) { + xfs_trans_brelse(sc->tp, bp); + bp = NULL; + } + + error = xchk_read_leaf_dir_buf(sc->tp, dp, geo, &curoff, + &bp); + if (error || !bp) + break; + + /* + * Find our position in the block. + */ + offset = geo->data_entry_offset; + curoff += geo->data_entry_offset; + } + + /* Skip an empty entry. */ + dup = bp->b_addr + offset; + if (be16_to_cpu(dup->freetag) == XFS_DIR2_DATA_FREE_TAG) { + length = be16_to_cpu(dup->length); + offset += length; + curoff += length; + continue; + } + + /* Otherwise, find the next entry and report it. */ + dep = bp->b_addr + offset; + length = xfs_dir2_data_entsize(mp, dep->namelen); + + dapos = xfs_dir2_byte_to_dataptr(curoff) & 0x7fffffff; + ino = be64_to_cpu(dep->inumber); + name.name = dep->name; + name.len = dep->namelen; + name.type = xfs_dir2_data_get_ftype(mp, dep); + + error = dirent_fn(sc, dp, dapos, &name, ino, priv); + if (error) + break; + + /* Advance to the next entry. */ + offset += length; + curoff += length; + } + + if (bp) + xfs_trans_brelse(sc->tp, bp); + return error; +} + +/* + * Call a function for every entry in a directory. + * + * Callers must hold the ILOCK. File types are XFS_DIR3_FT_*. + */ +int +xchk_dir_walk( + struct xfs_scrub *sc, + struct xfs_inode *dp, + xchk_dirent_fn dirent_fn, + void *priv) +{ + struct xfs_da_args args = { + .dp = dp, + .geo = dp->i_mount->m_dir_geo, + .trans = sc->tp, + }; + bool isblock; + int error; + + if (xfs_is_shutdown(dp->i_mount)) + return -EIO; + + ASSERT(S_ISDIR(VFS_I(dp)->i_mode)); + ASSERT(xfs_isilocked(dp, XFS_ILOCK_SHARED | XFS_ILOCK_EXCL)); + + if (dp->i_df.if_format == XFS_DINODE_FMT_LOCAL) + return xchk_dir_walk_sf(sc, dp, dirent_fn, priv); + + /* dir2 functions require that the data fork is loaded */ + error = xfs_iread_extents(sc->tp, dp, XFS_DATA_FORK); + if (error) + return error; + + error = xfs_dir2_isblock(&args, &isblock); + if (error) + return error; + + if (isblock) + return xchk_dir_walk_block(sc, dp, dirent_fn, priv); + + return xchk_dir_walk_leaf(sc, dp, dirent_fn, priv); +} + +/* + * Look up the inode number for an exact name in a directory. + * + * Callers must hold the ILOCK. File types are XFS_DIR3_FT_*. Names are not + * checked for correctness. + */ +int +xchk_dir_lookup( + struct xfs_scrub *sc, + struct xfs_inode *dp, + const struct xfs_name *name, + xfs_ino_t *ino) +{ + struct xfs_da_args args = { + .dp = dp, + .geo = dp->i_mount->m_dir_geo, + .trans = sc->tp, + .name = name->name, + .namelen = name->len, + .filetype = name->type, + .hashval = xfs_dir2_hashname(dp->i_mount, name), + .whichfork = XFS_DATA_FORK, + .op_flags = XFS_DA_OP_OKNOENT, + }; + bool isblock, isleaf; + int error; + + if (xfs_is_shutdown(dp->i_mount)) + return -EIO; + + ASSERT(S_ISDIR(VFS_I(dp)->i_mode)); + ASSERT(xfs_isilocked(dp, XFS_ILOCK_SHARED | XFS_ILOCK_EXCL)); + + if (dp->i_df.if_format == XFS_DINODE_FMT_LOCAL) { + error = xfs_dir2_sf_lookup(&args); + goto out_check_rval; + } + + /* dir2 functions require that the data fork is loaded */ + error = xfs_iread_extents(sc->tp, dp, XFS_DATA_FORK); + if (error) + return error; + + error = xfs_dir2_isblock(&args, &isblock); + if (error) + return error; + + if (isblock) { + error = xfs_dir2_block_lookup(&args); + goto out_check_rval; + } + + error = xfs_dir2_isleaf(&args, &isleaf); + if (error) + return error; + + if (isleaf) { + error = xfs_dir2_leaf_lookup(&args); + goto out_check_rval; + } + + error = xfs_dir2_node_lookup(&args); + +out_check_rval: + if (error == -EEXIST) + error = 0; + if (!error) + *ino = args.inumber; + return error; +} diff --git a/fs/xfs/scrub/readdir.h b/fs/xfs/scrub/readdir.h new file mode 100644 index 000000000000..55787f4df123 --- /dev/null +++ b/fs/xfs/scrub/readdir.h @@ -0,0 +1,19 @@ +/* SPDX-License-Identifier: GPL-2.0-or-later */ +/* + * Copyright (C) 2022-2023 Oracle. All Rights Reserved. + * Author: Darrick J. Wong <djwong@kernel.org> + */ +#ifndef __XFS_SCRUB_READDIR_H__ +#define __XFS_SCRUB_READDIR_H__ + +typedef int (*xchk_dirent_fn)(struct xfs_scrub *sc, struct xfs_inode *dp, + xfs_dir2_dataptr_t dapos, const struct xfs_name *name, + xfs_ino_t ino, void *priv); + +int xchk_dir_walk(struct xfs_scrub *sc, struct xfs_inode *dp, + xchk_dirent_fn dirent_fn, void *priv); + +int xchk_dir_lookup(struct xfs_scrub *sc, struct xfs_inode *dp, + const struct xfs_name *name, xfs_ino_t *ino); + +#endif /* __XFS_SCRUB_READDIR_H__ */ diff --git a/fs/xfs/scrub/refcount.c b/fs/xfs/scrub/refcount.c index d9c1b3cea4a5..304ea1e1bfb0 100644 --- a/fs/xfs/scrub/refcount.c +++ b/fs/xfs/scrub/refcount.c @@ -1,21 +1,22 @@ -// SPDX-License-Identifier: GPL-2.0+ +// SPDX-License-Identifier: GPL-2.0-or-later /* - * Copyright (C) 2017 Oracle. All Rights Reserved. - * Author: Darrick J. Wong <darrick.wong@oracle.com> + * Copyright (C) 2017-2023 Oracle. All Rights Reserved. + * Author: Darrick J. Wong <djwong@kernel.org> */ #include "xfs.h" #include "xfs_fs.h" #include "xfs_shared.h" #include "xfs_format.h" +#include "xfs_trans_resv.h" +#include "xfs_mount.h" +#include "xfs_ag.h" #include "xfs_btree.h" #include "xfs_rmap.h" #include "xfs_refcount.h" #include "scrub/scrub.h" #include "scrub/common.h" #include "scrub/btree.h" -#include "xfs_trans_resv.h" -#include "xfs_mount.h" -#include "xfs_ag.h" +#include "scrub/trace.h" /* * Set us up to scrub reference count btrees. @@ -24,6 +25,8 @@ int xchk_setup_ag_refcountbt( struct xfs_scrub *sc) { + if (xchk_need_intent_drain(sc)) + xchk_fsgates_enable(sc, XCHK_FSGATES_DRAIN); return xchk_setup_ag_btree(sc, false); } @@ -300,8 +303,10 @@ xchk_refcountbt_xref_rmap( goto out_free; xchk_refcountbt_process_rmap_fragments(&refchk); - if (irec->rc_refcount != refchk.seen) + if (irec->rc_refcount != refchk.seen) { + trace_xchk_refcount_incorrect(sc->sa.pag, irec, refchk.seen); xchk_btree_xref_set_corrupt(sc, sc->sa.rmap_cur, 0); + } out_free: list_for_each_entry_safe(frag, n, &refchk.fragments, list) { @@ -325,6 +330,107 @@ xchk_refcountbt_xref( xchk_refcountbt_xref_rmap(sc, irec); } +struct xchk_refcbt_records { + /* Previous refcount record. */ + struct xfs_refcount_irec prev_rec; + + /* The next AG block where we aren't expecting shared extents. */ + xfs_agblock_t next_unshared_agbno; + + /* Number of CoW blocks we expect. */ + xfs_agblock_t cow_blocks; + + /* Was the last record a shared or CoW staging extent? */ + enum xfs_refc_domain prev_domain; +}; + +STATIC int +xchk_refcountbt_rmap_check_gap( + struct xfs_btree_cur *cur, + const struct xfs_rmap_irec *rec, + void *priv) +{ + xfs_agblock_t *next_bno = priv; + + if (*next_bno != NULLAGBLOCK && rec->rm_startblock < *next_bno) + return -ECANCELED; + + *next_bno = rec->rm_startblock + rec->rm_blockcount; + return 0; +} + +/* + * Make sure that a gap in the reference count records does not correspond to + * overlapping records (i.e. shared extents) in the reverse mappings. + */ +static inline void +xchk_refcountbt_xref_gaps( + struct xfs_scrub *sc, + struct xchk_refcbt_records *rrc, + xfs_agblock_t bno) +{ + struct xfs_rmap_irec low; + struct xfs_rmap_irec high; + xfs_agblock_t next_bno = NULLAGBLOCK; + int error; + + if (bno <= rrc->next_unshared_agbno || !sc->sa.rmap_cur || + xchk_skip_xref(sc->sm)) + return; + + memset(&low, 0, sizeof(low)); + low.rm_startblock = rrc->next_unshared_agbno; + memset(&high, 0xFF, sizeof(high)); + high.rm_startblock = bno - 1; + + error = xfs_rmap_query_range(sc->sa.rmap_cur, &low, &high, + xchk_refcountbt_rmap_check_gap, &next_bno); + if (error == -ECANCELED) + xchk_btree_xref_set_corrupt(sc, sc->sa.rmap_cur, 0); + else + xchk_should_check_xref(sc, &error, &sc->sa.rmap_cur); +} + +static inline bool +xchk_refcount_mergeable( + struct xchk_refcbt_records *rrc, + const struct xfs_refcount_irec *r2) +{ + const struct xfs_refcount_irec *r1 = &rrc->prev_rec; + + /* Ignore if prev_rec is not yet initialized. */ + if (r1->rc_blockcount > 0) + return false; + + if (r1->rc_domain != r2->rc_domain) + return false; + if (r1->rc_startblock + r1->rc_blockcount != r2->rc_startblock) + return false; + if (r1->rc_refcount != r2->rc_refcount) + return false; + if ((unsigned long long)r1->rc_blockcount + r2->rc_blockcount > + MAXREFCEXTLEN) + return false; + + return true; +} + +/* Flag failures for records that could be merged. */ +STATIC void +xchk_refcountbt_check_mergeable( + struct xchk_btree *bs, + struct xchk_refcbt_records *rrc, + const struct xfs_refcount_irec *irec) +{ + if (bs->sc->sm->sm_flags & XFS_SCRUB_OFLAG_CORRUPT) + return; + + if (xchk_refcount_mergeable(rrc, irec)) + xchk_btree_set_corrupt(bs->sc, bs->cur, 0); + + memcpy(&rrc->prev_rec, irec, sizeof(struct xfs_refcount_irec)); +} + /* Scrub a refcountbt record. */ STATIC int xchk_refcountbt_rec( @@ -332,27 +438,37 @@ xchk_refcountbt_rec( const union xfs_btree_rec *rec) { struct xfs_refcount_irec irec; - xfs_agblock_t *cow_blocks = bs->private; - struct xfs_perag *pag = bs->cur->bc_ag.pag; + struct xchk_refcbt_records *rrc = bs->private; xfs_refcount_btrec_to_irec(rec, &irec); - - /* Check the domain and refcount are not incompatible. */ - if (!xfs_refcount_check_domain(&irec)) + if (xfs_refcount_check_irec(bs->cur, &irec) != NULL) { xchk_btree_set_corrupt(bs->sc, bs->cur, 0); + return 0; + } if (irec.rc_domain == XFS_REFC_DOMAIN_COW) - (*cow_blocks) += irec.rc_blockcount; - - /* Check the extent. */ - if (!xfs_verify_agbext(pag, irec.rc_startblock, irec.rc_blockcount)) - xchk_btree_set_corrupt(bs->sc, bs->cur, 0); + rrc->cow_blocks += irec.rc_blockcount; - if (irec.rc_refcount == 0) + /* Shared records always come before CoW records. */ + if (irec.rc_domain == XFS_REFC_DOMAIN_SHARED && + rrc->prev_domain == XFS_REFC_DOMAIN_COW) xchk_btree_set_corrupt(bs->sc, bs->cur, 0); + rrc->prev_domain = irec.rc_domain; + xchk_refcountbt_check_mergeable(bs, rrc, &irec); xchk_refcountbt_xref(bs->sc, &irec); + /* + * If this is a record for a shared extent, check that all blocks + * between the previous record and this one have at most one reverse + * mapping. + */ + if (irec.rc_domain == XFS_REFC_DOMAIN_SHARED) { + xchk_refcountbt_xref_gaps(bs->sc, rrc, irec.rc_startblock); + rrc->next_unshared_agbno = irec.rc_startblock + + irec.rc_blockcount; + } + return 0; } @@ -394,15 +510,25 @@ int xchk_refcountbt( struct xfs_scrub *sc) { - xfs_agblock_t cow_blocks = 0; + struct xchk_refcbt_records rrc = { + .cow_blocks = 0, + .next_unshared_agbno = 0, + .prev_domain = XFS_REFC_DOMAIN_SHARED, + }; int error; error = xchk_btree(sc, sc->sa.refc_cur, xchk_refcountbt_rec, - &XFS_RMAP_OINFO_REFC, &cow_blocks); + &XFS_RMAP_OINFO_REFC, &rrc); if (error) return error; - xchk_refcount_xref_rmap(sc, cow_blocks); + /* + * Check that all blocks between the last refcount > 1 record and the + * end of the AG have at most one reverse mapping. + */ + xchk_refcountbt_xref_gaps(sc, &rrc, sc->mp->m_sb.sb_agblocks); + + xchk_refcount_xref_rmap(sc, rrc.cow_blocks); return 0; } @@ -458,16 +584,37 @@ xchk_xref_is_not_shared( xfs_agblock_t agbno, xfs_extlen_t len) { - bool shared; + enum xbtree_recpacking outcome; + int error; + + if (!sc->sa.refc_cur || xchk_skip_xref(sc->sm)) + return; + + error = xfs_refcount_has_records(sc->sa.refc_cur, + XFS_REFC_DOMAIN_SHARED, agbno, len, &outcome); + if (!xchk_should_check_xref(sc, &error, &sc->sa.refc_cur)) + return; + if (outcome != XBTREE_RECPACKING_EMPTY) + xchk_btree_xref_set_corrupt(sc, sc->sa.refc_cur, 0); +} + +/* xref check that the extent is not being used for CoW staging. */ +void +xchk_xref_is_not_cow_staging( + struct xfs_scrub *sc, + xfs_agblock_t agbno, + xfs_extlen_t len) +{ + enum xbtree_recpacking outcome; int error; if (!sc->sa.refc_cur || xchk_skip_xref(sc->sm)) return; - error = xfs_refcount_has_record(sc->sa.refc_cur, XFS_REFC_DOMAIN_SHARED, - agbno, len, &shared); + error = xfs_refcount_has_records(sc->sa.refc_cur, XFS_REFC_DOMAIN_COW, + agbno, len, &outcome); if (!xchk_should_check_xref(sc, &error, &sc->sa.refc_cur)) return; - if (shared) + if (outcome != XBTREE_RECPACKING_EMPTY) xchk_btree_xref_set_corrupt(sc, sc->sa.refc_cur, 0); } diff --git a/fs/xfs/scrub/repair.c b/fs/xfs/scrub/repair.c index 1b71174ec0d6..ac6d8803e660 100644 --- a/fs/xfs/scrub/repair.c +++ b/fs/xfs/scrub/repair.c @@ -1,7 +1,7 @@ -// SPDX-License-Identifier: GPL-2.0+ +// SPDX-License-Identifier: GPL-2.0-or-later /* - * Copyright (C) 2018 Oracle. All Rights Reserved. - * Author: Darrick J. Wong <darrick.wong@oracle.com> + * Copyright (C) 2018-2023 Oracle. All Rights Reserved. + * Author: Darrick J. Wong <djwong@kernel.org> */ #include "xfs.h" #include "xfs_fs.h" @@ -60,6 +60,9 @@ xrep_attempt( sc->sm->sm_flags &= ~XFS_SCRUB_FLAGS_OUT; sc->flags |= XREP_ALREADY_FIXED; return -EAGAIN; + case -ECHRNG: + sc->flags |= XCHK_NEED_DRAIN; + return -EAGAIN; case -EDEADLOCK: /* Tell the caller to try again having grabbed all the locks. */ if (!(sc->flags & XCHK_TRY_HARDER)) { @@ -442,6 +445,30 @@ xrep_init_btblock( * buffers associated with @bitmap. */ +static int +xrep_invalidate_block( + uint64_t fsbno, + void *priv) +{ + struct xfs_scrub *sc = priv; + struct xfs_buf *bp; + int error; + + /* Skip AG headers and post-EOFS blocks */ + if (!xfs_verify_fsbno(sc->mp, fsbno)) + return 0; + + error = xfs_buf_incore(sc->mp->m_ddev_targp, + XFS_FSB_TO_DADDR(sc->mp, fsbno), + XFS_FSB_TO_BB(sc->mp, 1), XBF_TRYLOCK, &bp); + if (error) + return 0; + + xfs_trans_bjoin(sc->tp, bp); + xfs_trans_binval(sc->tp, bp); + return 0; +} + /* * Invalidate buffers for per-AG btree blocks we're dumping. This function * is not intended for use with file data repairs; we have bunmapi for that. @@ -451,11 +478,6 @@ xrep_invalidate_blocks( struct xfs_scrub *sc, struct xbitmap *bitmap) { - struct xbitmap_range *bmr; - struct xbitmap_range *n; - struct xfs_buf *bp; - xfs_fsblock_t fsbno; - /* * For each block in each extent, see if there's an incore buffer for * exactly that block; if so, invalidate it. The buffer cache only @@ -464,23 +486,7 @@ xrep_invalidate_blocks( * because we never own those; and if we can't TRYLOCK the buffer we * assume it's owned by someone else. */ - for_each_xbitmap_block(fsbno, bmr, n, bitmap) { - int error; - - /* Skip AG headers and post-EOFS blocks */ - if (!xfs_verify_fsbno(sc->mp, fsbno)) - continue; - error = xfs_buf_incore(sc->mp->m_ddev_targp, - XFS_FSB_TO_DADDR(sc->mp, fsbno), - XFS_FSB_TO_BB(sc->mp, 1), XBF_TRYLOCK, &bp); - if (error) - continue; - - xfs_trans_bjoin(sc->tp, bp); - xfs_trans_binval(sc->tp, bp); - } - - return 0; + return xbitmap_walk_bits(bitmap, xrep_invalidate_block, sc); } /* Ensure the freelist is the correct size. */ @@ -501,6 +507,15 @@ xrep_fix_freelist( can_shrink ? 0 : XFS_ALLOC_FLAG_NOSHRINK); } +/* Information about reaping extents after a repair. */ +struct xrep_reap_state { + struct xfs_scrub *sc; + + /* Reverse mapping owner and metadata reservation type. */ + const struct xfs_owner_info *oinfo; + enum xfs_ag_resv_type resv; +}; + /* * Put a block back on the AGFL. */ @@ -545,17 +560,23 @@ xrep_put_freelist( /* Dispose of a single block. */ STATIC int xrep_reap_block( - struct xfs_scrub *sc, - xfs_fsblock_t fsbno, - const struct xfs_owner_info *oinfo, - enum xfs_ag_resv_type resv) + uint64_t fsbno, + void *priv) { + struct xrep_reap_state *rs = priv; + struct xfs_scrub *sc = rs->sc; struct xfs_btree_cur *cur; struct xfs_buf *agf_bp = NULL; xfs_agblock_t agbno; bool has_other_rmap; int error; + ASSERT(sc->ip != NULL || + XFS_FSB_TO_AGNO(sc->mp, fsbno) == sc->sa.pag->pag_agno); + trace_xrep_dispose_btree_extent(sc->mp, + XFS_FSB_TO_AGNO(sc->mp, fsbno), + XFS_FSB_TO_AGBNO(sc->mp, fsbno), 1); + agbno = XFS_FSB_TO_AGBNO(sc->mp, fsbno); ASSERT(XFS_FSB_TO_AGNO(sc->mp, fsbno) == sc->sa.pag->pag_agno); @@ -574,7 +595,8 @@ xrep_reap_block( cur = xfs_rmapbt_init_cursor(sc->mp, sc->tp, agf_bp, sc->sa.pag); /* Can we find any other rmappings? */ - error = xfs_rmap_has_other_keys(cur, agbno, 1, oinfo, &has_other_rmap); + error = xfs_rmap_has_other_keys(cur, agbno, 1, rs->oinfo, + &has_other_rmap); xfs_btree_del_cursor(cur, error); if (error) goto out_free; @@ -594,11 +616,12 @@ xrep_reap_block( */ if (has_other_rmap) error = xfs_rmap_free(sc->tp, agf_bp, sc->sa.pag, agbno, - 1, oinfo); - else if (resv == XFS_AG_RESV_AGFL) + 1, rs->oinfo); + else if (rs->resv == XFS_AG_RESV_AGFL) error = xrep_put_freelist(sc, agbno); else - error = xfs_free_extent(sc->tp, fsbno, 1, oinfo, resv); + error = xfs_free_extent(sc->tp, sc->sa.pag, agbno, 1, rs->oinfo, + rs->resv); if (agf_bp != sc->sa.agf_bp) xfs_trans_brelse(sc->tp, agf_bp); if (error) @@ -622,26 +645,15 @@ xrep_reap_extents( const struct xfs_owner_info *oinfo, enum xfs_ag_resv_type type) { - struct xbitmap_range *bmr; - struct xbitmap_range *n; - xfs_fsblock_t fsbno; - int error = 0; + struct xrep_reap_state rs = { + .sc = sc, + .oinfo = oinfo, + .resv = type, + }; ASSERT(xfs_has_rmapbt(sc->mp)); - for_each_xbitmap_block(fsbno, bmr, n, bitmap) { - ASSERT(sc->ip != NULL || - XFS_FSB_TO_AGNO(sc->mp, fsbno) == sc->sa.pag->pag_agno); - trace_xrep_dispose_btree_extent(sc->mp, - XFS_FSB_TO_AGNO(sc->mp, fsbno), - XFS_FSB_TO_AGBNO(sc->mp, fsbno), 1); - - error = xrep_reap_block(sc, fsbno, oinfo, type); - if (error) - break; - } - - return error; + return xbitmap_walk_bits(bitmap, xrep_reap_block, &rs); } /* diff --git a/fs/xfs/scrub/repair.h b/fs/xfs/scrub/repair.h index 840f74ec431c..dce791c679ee 100644 --- a/fs/xfs/scrub/repair.h +++ b/fs/xfs/scrub/repair.h @@ -1,7 +1,7 @@ -// SPDX-License-Identifier: GPL-2.0+ +// SPDX-License-Identifier: GPL-2.0-or-later /* - * Copyright (C) 2018 Oracle. All Rights Reserved. - * Author: Darrick J. Wong <darrick.wong@oracle.com> + * Copyright (C) 2018-2023 Oracle. All Rights Reserved. + * Author: Darrick J. Wong <djwong@kernel.org> */ #ifndef __XFS_SCRUB_REPAIR_H__ #define __XFS_SCRUB_REPAIR_H__ @@ -31,6 +31,7 @@ int xrep_init_btblock(struct xfs_scrub *sc, xfs_fsblock_t fsb, const struct xfs_buf_ops *ops); struct xbitmap; +struct xagb_bitmap; int xrep_fix_freelist(struct xfs_scrub *sc, bool can_shrink); int xrep_invalidate_blocks(struct xfs_scrub *sc, struct xbitmap *btlist); diff --git a/fs/xfs/scrub/rmap.c b/fs/xfs/scrub/rmap.c index 229826b2e1c0..d29a26ecddd6 100644 --- a/fs/xfs/scrub/rmap.c +++ b/fs/xfs/scrub/rmap.c @@ -1,21 +1,29 @@ -// SPDX-License-Identifier: GPL-2.0+ +// SPDX-License-Identifier: GPL-2.0-or-later /* - * Copyright (C) 2017 Oracle. All Rights Reserved. - * Author: Darrick J. Wong <darrick.wong@oracle.com> + * Copyright (C) 2017-2023 Oracle. All Rights Reserved. + * Author: Darrick J. Wong <djwong@kernel.org> */ #include "xfs.h" #include "xfs_fs.h" #include "xfs_shared.h" #include "xfs_format.h" +#include "xfs_log_format.h" #include "xfs_trans_resv.h" #include "xfs_mount.h" +#include "xfs_trans.h" #include "xfs_btree.h" #include "xfs_rmap.h" #include "xfs_refcount.h" +#include "xfs_ag.h" +#include "xfs_bit.h" +#include "xfs_alloc.h" +#include "xfs_alloc_btree.h" +#include "xfs_ialloc_btree.h" +#include "xfs_refcount_btree.h" #include "scrub/scrub.h" #include "scrub/common.h" #include "scrub/btree.h" -#include "xfs_ag.h" +#include "scrub/bitmap.h" /* * Set us up to scrub reverse mapping btrees. @@ -24,11 +32,39 @@ int xchk_setup_ag_rmapbt( struct xfs_scrub *sc) { + if (xchk_need_intent_drain(sc)) + xchk_fsgates_enable(sc, XCHK_FSGATES_DRAIN); + return xchk_setup_ag_btree(sc, false); } /* Reverse-mapping scrubber. */ +struct xchk_rmap { + /* + * The furthest-reaching of the rmapbt records that we've already + * processed. This enables us to detect overlapping records for space + * allocations that cannot be shared. + */ + struct xfs_rmap_irec overlap_rec; + + /* + * The previous rmapbt record, so that we can check for two records + * that could be one. + */ + struct xfs_rmap_irec prev_rec; + + /* Bitmaps containing all blocks for each type of AG metadata. */ + struct xagb_bitmap fs_owned; + struct xagb_bitmap log_owned; + struct xagb_bitmap ag_owned; + struct xagb_bitmap inobt_owned; + struct xagb_bitmap refcbt_owned; + + /* Did we complete the AG space metadata bitmaps? */ + bool bitmaps_complete; +}; + /* Cross-reference a rmap against the refcount btree. */ STATIC void xchk_rmapbt_xref_refc( @@ -84,80 +120,415 @@ xchk_rmapbt_xref( xchk_rmapbt_xref_refc(sc, irec); } -/* Scrub an rmapbt record. */ -STATIC int -xchk_rmapbt_rec( - struct xchk_btree *bs, - const union xfs_btree_rec *rec) +/* + * Check for bogus UNWRITTEN flags in the rmapbt node block keys. + * + * In reverse mapping records, the file mapping extent state + * (XFS_RMAP_OFF_UNWRITTEN) is a record attribute, not a key field. It is not + * involved in lookups in any way. In older kernels, the functions that + * convert rmapbt records to keys forgot to filter out the extent state bit, + * even though the key comparison functions have filtered the flag correctly. + * If we spot an rmap key with the unwritten bit set in rm_offset, we should + * mark the btree as needing optimization to rebuild the btree without those + * flags. + */ +STATIC void +xchk_rmapbt_check_unwritten_in_keyflags( + struct xchk_btree *bs) { - struct xfs_mount *mp = bs->cur->bc_mp; - struct xfs_rmap_irec irec; - struct xfs_perag *pag = bs->cur->bc_ag.pag; - bool non_inode; - bool is_unwritten; - bool is_bmbt; - bool is_attr; - int error; + struct xfs_scrub *sc = bs->sc; + struct xfs_btree_cur *cur = bs->cur; + struct xfs_btree_block *keyblock; + union xfs_btree_key *lkey, *hkey; + __be64 badflag = cpu_to_be64(XFS_RMAP_OFF_UNWRITTEN); + unsigned int level; - error = xfs_rmap_btrec_to_irec(rec, &irec); - if (!xchk_btree_process_error(bs->sc, bs->cur, 0, &error)) - goto out; + if (sc->sm->sm_flags & XFS_SCRUB_OFLAG_PREEN) + return; + + for (level = 1; level < cur->bc_nlevels; level++) { + struct xfs_buf *bp; + unsigned int ptr; + + /* Only check the first time we've seen this node block. */ + if (cur->bc_levels[level].ptr > 1) + continue; + + keyblock = xfs_btree_get_block(cur, level, &bp); + for (ptr = 1; ptr <= be16_to_cpu(keyblock->bb_numrecs); ptr++) { + lkey = xfs_btree_key_addr(cur, ptr, keyblock); + + if (lkey->rmap.rm_offset & badflag) { + xchk_btree_set_preen(sc, cur, level); + break; + } + + hkey = xfs_btree_high_key_addr(cur, ptr, keyblock); + if (hkey->rmap.rm_offset & badflag) { + xchk_btree_set_preen(sc, cur, level); + break; + } + } + } +} + +static inline bool +xchk_rmapbt_is_shareable( + struct xfs_scrub *sc, + const struct xfs_rmap_irec *irec) +{ + if (!xfs_has_reflink(sc->mp)) + return false; + if (XFS_RMAP_NON_INODE_OWNER(irec->rm_owner)) + return false; + if (irec->rm_flags & (XFS_RMAP_BMBT_BLOCK | XFS_RMAP_ATTR_FORK | + XFS_RMAP_UNWRITTEN)) + return false; + return true; +} + +/* Flag failures for records that overlap but cannot. */ +STATIC void +xchk_rmapbt_check_overlapping( + struct xchk_btree *bs, + struct xchk_rmap *cr, + const struct xfs_rmap_irec *irec) +{ + xfs_agblock_t pnext, inext; + + if (bs->sc->sm->sm_flags & XFS_SCRUB_OFLAG_CORRUPT) + return; + + /* No previous record? */ + if (cr->overlap_rec.rm_blockcount == 0) + goto set_prev; + + /* Do overlap_rec and irec overlap? */ + pnext = cr->overlap_rec.rm_startblock + cr->overlap_rec.rm_blockcount; + if (pnext <= irec->rm_startblock) + goto set_prev; - /* Check extent. */ - if (irec.rm_startblock + irec.rm_blockcount <= irec.rm_startblock) + /* Overlap is only allowed if both records are data fork mappings. */ + if (!xchk_rmapbt_is_shareable(bs->sc, &cr->overlap_rec) || + !xchk_rmapbt_is_shareable(bs->sc, irec)) xchk_btree_set_corrupt(bs->sc, bs->cur, 0); - if (irec.rm_owner == XFS_RMAP_OWN_FS) { + /* Save whichever rmap record extends furthest. */ + inext = irec->rm_startblock + irec->rm_blockcount; + if (pnext > inext) + return; + +set_prev: + memcpy(&cr->overlap_rec, irec, sizeof(struct xfs_rmap_irec)); +} + +/* Decide if two reverse-mapping records can be merged. */ +static inline bool +xchk_rmap_mergeable( + struct xchk_rmap *cr, + const struct xfs_rmap_irec *r2) +{ + const struct xfs_rmap_irec *r1 = &cr->prev_rec; + + /* Ignore if prev_rec is not yet initialized. */ + if (cr->prev_rec.rm_blockcount == 0) + return false; + + if (r1->rm_owner != r2->rm_owner) + return false; + if (r1->rm_startblock + r1->rm_blockcount != r2->rm_startblock) + return false; + if ((unsigned long long)r1->rm_blockcount + r2->rm_blockcount > + XFS_RMAP_LEN_MAX) + return false; + if (XFS_RMAP_NON_INODE_OWNER(r2->rm_owner)) + return true; + /* must be an inode owner below here */ + if (r1->rm_flags != r2->rm_flags) + return false; + if (r1->rm_flags & XFS_RMAP_BMBT_BLOCK) + return true; + return r1->rm_offset + r1->rm_blockcount == r2->rm_offset; +} + +/* Flag failures for records that could be merged. */ +STATIC void +xchk_rmapbt_check_mergeable( + struct xchk_btree *bs, + struct xchk_rmap *cr, + const struct xfs_rmap_irec *irec) +{ + if (bs->sc->sm->sm_flags & XFS_SCRUB_OFLAG_CORRUPT) + return; + + if (xchk_rmap_mergeable(cr, irec)) + xchk_btree_set_corrupt(bs->sc, bs->cur, 0); + + memcpy(&cr->prev_rec, irec, sizeof(struct xfs_rmap_irec)); +} + +/* Compare an rmap for AG metadata against the metadata walk. */ +STATIC int +xchk_rmapbt_mark_bitmap( + struct xchk_btree *bs, + struct xchk_rmap *cr, + const struct xfs_rmap_irec *irec) +{ + struct xfs_scrub *sc = bs->sc; + struct xagb_bitmap *bmp = NULL; + xfs_extlen_t fsbcount = irec->rm_blockcount; + + /* + * Skip corrupt records. It is essential that we detect records in the + * btree that cannot overlap but do, flag those as CORRUPT, and skip + * the bitmap comparison to avoid generating false XCORRUPT reports. + */ + if (sc->sm->sm_flags & XFS_SCRUB_OFLAG_CORRUPT) + return 0; + + /* + * If the AG metadata walk didn't complete, there's no point in + * comparing against partial results. + */ + if (!cr->bitmaps_complete) + return 0; + + switch (irec->rm_owner) { + case XFS_RMAP_OWN_FS: + bmp = &cr->fs_owned; + break; + case XFS_RMAP_OWN_LOG: + bmp = &cr->log_owned; + break; + case XFS_RMAP_OWN_AG: + bmp = &cr->ag_owned; + break; + case XFS_RMAP_OWN_INOBT: + bmp = &cr->inobt_owned; + break; + case XFS_RMAP_OWN_REFC: + bmp = &cr->refcbt_owned; + break; + } + + if (!bmp) + return 0; + + if (xagb_bitmap_test(bmp, irec->rm_startblock, &fsbcount)) { /* - * xfs_verify_agbno returns false for static fs metadata. - * Since that only exists at the start of the AG, validate - * that by hand. + * The start of this reverse mapping corresponds to a set + * region in the bitmap. If the mapping covers more area than + * the set region, then it covers space that wasn't found by + * the AG metadata walk. */ - if (irec.rm_startblock != 0 || - irec.rm_blockcount != XFS_AGFL_BLOCK(mp) + 1) - xchk_btree_set_corrupt(bs->sc, bs->cur, 0); + if (fsbcount < irec->rm_blockcount) + xchk_btree_xref_set_corrupt(bs->sc, + bs->sc->sa.rmap_cur, 0); } else { /* - * Otherwise we must point somewhere past the static metadata - * but before the end of the FS. Run the regular check. + * The start of this reverse mapping does not correspond to a + * completely set region in the bitmap. The region wasn't + * fully set by walking the AG metadata, so this is a + * cross-referencing corruption. */ - if (!xfs_verify_agbno(pag, irec.rm_startblock) || - !xfs_verify_agbno(pag, irec.rm_startblock + - irec.rm_blockcount - 1)) - xchk_btree_set_corrupt(bs->sc, bs->cur, 0); + xchk_btree_xref_set_corrupt(bs->sc, bs->sc->sa.rmap_cur, 0); } - /* Check flags. */ - non_inode = XFS_RMAP_NON_INODE_OWNER(irec.rm_owner); - is_bmbt = irec.rm_flags & XFS_RMAP_BMBT_BLOCK; - is_attr = irec.rm_flags & XFS_RMAP_ATTR_FORK; - is_unwritten = irec.rm_flags & XFS_RMAP_UNWRITTEN; + /* Unset the region so that we can detect missing rmap records. */ + return xagb_bitmap_clear(bmp, irec->rm_startblock, irec->rm_blockcount); +} - if (is_bmbt && irec.rm_offset != 0) - xchk_btree_set_corrupt(bs->sc, bs->cur, 0); +/* Scrub an rmapbt record. */ +STATIC int +xchk_rmapbt_rec( + struct xchk_btree *bs, + const union xfs_btree_rec *rec) +{ + struct xchk_rmap *cr = bs->private; + struct xfs_rmap_irec irec; - if (non_inode && irec.rm_offset != 0) + if (xfs_rmap_btrec_to_irec(rec, &irec) != NULL || + xfs_rmap_check_irec(bs->cur, &irec) != NULL) { xchk_btree_set_corrupt(bs->sc, bs->cur, 0); + return 0; + } - if (is_unwritten && (is_bmbt || non_inode || is_attr)) - xchk_btree_set_corrupt(bs->sc, bs->cur, 0); + xchk_rmapbt_check_unwritten_in_keyflags(bs); + xchk_rmapbt_check_mergeable(bs, cr, &irec); + xchk_rmapbt_check_overlapping(bs, cr, &irec); + xchk_rmapbt_xref(bs->sc, &irec); - if (non_inode && (is_bmbt || is_unwritten || is_attr)) - xchk_btree_set_corrupt(bs->sc, bs->cur, 0); + return xchk_rmapbt_mark_bitmap(bs, cr, &irec); +} - if (!non_inode) { - if (!xfs_verify_ino(mp, irec.rm_owner)) - xchk_btree_set_corrupt(bs->sc, bs->cur, 0); - } else { - /* Non-inode owner within the magic values? */ - if (irec.rm_owner <= XFS_RMAP_OWN_MIN || - irec.rm_owner > XFS_RMAP_OWN_FS) - xchk_btree_set_corrupt(bs->sc, bs->cur, 0); +/* Add an AGFL block to the rmap list. */ +STATIC int +xchk_rmapbt_walk_agfl( + struct xfs_mount *mp, + xfs_agblock_t agbno, + void *priv) +{ + struct xagb_bitmap *bitmap = priv; + + return xagb_bitmap_set(bitmap, agbno, 1); +} + +/* + * Set up bitmaps mapping all the AG metadata to compare with the rmapbt + * records. + * + * Grab our own btree cursors here if the scrub setup function didn't give us a + * btree cursor due to reports of poor health. We need to find out if the + * rmapbt disagrees with primary metadata btrees to tag the rmapbt as being + * XCORRUPT. + */ +STATIC int +xchk_rmapbt_walk_ag_metadata( + struct xfs_scrub *sc, + struct xchk_rmap *cr) +{ + struct xfs_mount *mp = sc->mp; + struct xfs_buf *agfl_bp; + struct xfs_agf *agf = sc->sa.agf_bp->b_addr; + struct xfs_btree_cur *cur; + int error; + + /* OWN_FS: AG headers */ + error = xagb_bitmap_set(&cr->fs_owned, XFS_SB_BLOCK(mp), + XFS_AGFL_BLOCK(mp) - XFS_SB_BLOCK(mp) + 1); + if (error) + goto out; + + /* OWN_LOG: Internal log */ + if (xfs_ag_contains_log(mp, sc->sa.pag->pag_agno)) { + error = xagb_bitmap_set(&cr->log_owned, + XFS_FSB_TO_AGBNO(mp, mp->m_sb.sb_logstart), + mp->m_sb.sb_logblocks); + if (error) + goto out; + } + + /* OWN_AG: bnobt, cntbt, rmapbt, and AGFL */ + cur = sc->sa.bno_cur; + if (!cur) + cur = xfs_allocbt_init_cursor(sc->mp, sc->tp, sc->sa.agf_bp, + sc->sa.pag, XFS_BTNUM_BNO); + error = xagb_bitmap_set_btblocks(&cr->ag_owned, cur); + if (cur != sc->sa.bno_cur) + xfs_btree_del_cursor(cur, error); + if (error) + goto out; + + cur = sc->sa.cnt_cur; + if (!cur) + cur = xfs_allocbt_init_cursor(sc->mp, sc->tp, sc->sa.agf_bp, + sc->sa.pag, XFS_BTNUM_CNT); + error = xagb_bitmap_set_btblocks(&cr->ag_owned, cur); + if (cur != sc->sa.cnt_cur) + xfs_btree_del_cursor(cur, error); + if (error) + goto out; + + error = xagb_bitmap_set_btblocks(&cr->ag_owned, sc->sa.rmap_cur); + if (error) + goto out; + + error = xfs_alloc_read_agfl(sc->sa.pag, sc->tp, &agfl_bp); + if (error) + goto out; + + error = xfs_agfl_walk(sc->mp, agf, agfl_bp, xchk_rmapbt_walk_agfl, + &cr->ag_owned); + xfs_trans_brelse(sc->tp, agfl_bp); + if (error) + goto out; + + /* OWN_INOBT: inobt, finobt */ + cur = sc->sa.ino_cur; + if (!cur) + cur = xfs_inobt_init_cursor(sc->sa.pag, sc->tp, sc->sa.agi_bp, + XFS_BTNUM_INO); + error = xagb_bitmap_set_btblocks(&cr->inobt_owned, cur); + if (cur != sc->sa.ino_cur) + xfs_btree_del_cursor(cur, error); + if (error) + goto out; + + if (xfs_has_finobt(sc->mp)) { + cur = sc->sa.fino_cur; + if (!cur) + cur = xfs_inobt_init_cursor(sc->sa.pag, sc->tp, + sc->sa.agi_bp, XFS_BTNUM_FINO); + error = xagb_bitmap_set_btblocks(&cr->inobt_owned, cur); + if (cur != sc->sa.fino_cur) + xfs_btree_del_cursor(cur, error); + if (error) + goto out; + } + + /* OWN_REFC: refcountbt */ + if (xfs_has_reflink(sc->mp)) { + cur = sc->sa.refc_cur; + if (!cur) + cur = xfs_refcountbt_init_cursor(sc->mp, sc->tp, + sc->sa.agf_bp, sc->sa.pag); + error = xagb_bitmap_set_btblocks(&cr->refcbt_owned, cur); + if (cur != sc->sa.refc_cur) + xfs_btree_del_cursor(cur, error); + if (error) + goto out; } - xchk_rmapbt_xref(bs->sc, &irec); out: - return error; + /* + * If there's an error, set XFAIL and disable the bitmap + * cross-referencing checks, but proceed with the scrub anyway. + */ + if (error) + xchk_btree_xref_process_error(sc, sc->sa.rmap_cur, + sc->sa.rmap_cur->bc_nlevels - 1, &error); + else + cr->bitmaps_complete = true; + return 0; +} + +/* + * Check for set regions in the bitmaps; if there are any, the rmap records do + * not describe all the AG metadata. + */ +STATIC void +xchk_rmapbt_check_bitmaps( + struct xfs_scrub *sc, + struct xchk_rmap *cr) +{ + struct xfs_btree_cur *cur = sc->sa.rmap_cur; + unsigned int level; + + if (sc->sm->sm_flags & (XFS_SCRUB_OFLAG_CORRUPT | + XFS_SCRUB_OFLAG_XFAIL)) + return; + if (!cur) + return; + level = cur->bc_nlevels - 1; + + /* + * Any bitmap with bits still set indicates that the reverse mapping + * doesn't cover the entire primary structure. + */ + if (xagb_bitmap_hweight(&cr->fs_owned) != 0) + xchk_btree_xref_set_corrupt(sc, cur, level); + + if (xagb_bitmap_hweight(&cr->log_owned) != 0) + xchk_btree_xref_set_corrupt(sc, cur, level); + + if (xagb_bitmap_hweight(&cr->ag_owned) != 0) + xchk_btree_xref_set_corrupt(sc, cur, level); + + if (xagb_bitmap_hweight(&cr->inobt_owned) != 0) + xchk_btree_xref_set_corrupt(sc, cur, level); + + if (xagb_bitmap_hweight(&cr->refcbt_owned) != 0) + xchk_btree_xref_set_corrupt(sc, cur, level); } /* Scrub the rmap btree for some AG. */ @@ -165,42 +536,63 @@ int xchk_rmapbt( struct xfs_scrub *sc) { - return xchk_btree(sc, sc->sa.rmap_cur, xchk_rmapbt_rec, - &XFS_RMAP_OINFO_AG, NULL); + struct xchk_rmap *cr; + int error; + + cr = kzalloc(sizeof(struct xchk_rmap), XCHK_GFP_FLAGS); + if (!cr) + return -ENOMEM; + + xagb_bitmap_init(&cr->fs_owned); + xagb_bitmap_init(&cr->log_owned); + xagb_bitmap_init(&cr->ag_owned); + xagb_bitmap_init(&cr->inobt_owned); + xagb_bitmap_init(&cr->refcbt_owned); + + error = xchk_rmapbt_walk_ag_metadata(sc, cr); + if (error) + goto out; + + error = xchk_btree(sc, sc->sa.rmap_cur, xchk_rmapbt_rec, + &XFS_RMAP_OINFO_AG, cr); + if (error) + goto out; + + xchk_rmapbt_check_bitmaps(sc, cr); + +out: + xagb_bitmap_destroy(&cr->refcbt_owned); + xagb_bitmap_destroy(&cr->inobt_owned); + xagb_bitmap_destroy(&cr->ag_owned); + xagb_bitmap_destroy(&cr->log_owned); + xagb_bitmap_destroy(&cr->fs_owned); + kfree(cr); + return error; } -/* xref check that the extent is owned by a given owner */ -static inline void -xchk_xref_check_owner( +/* xref check that the extent is owned only by a given owner */ +void +xchk_xref_is_only_owned_by( struct xfs_scrub *sc, xfs_agblock_t bno, xfs_extlen_t len, - const struct xfs_owner_info *oinfo, - bool should_have_rmap) + const struct xfs_owner_info *oinfo) { - bool has_rmap; + struct xfs_rmap_matches res; int error; if (!sc->sa.rmap_cur || xchk_skip_xref(sc->sm)) return; - error = xfs_rmap_record_exists(sc->sa.rmap_cur, bno, len, oinfo, - &has_rmap); + error = xfs_rmap_count_owners(sc->sa.rmap_cur, bno, len, oinfo, &res); if (!xchk_should_check_xref(sc, &error, &sc->sa.rmap_cur)) return; - if (has_rmap != should_have_rmap) + if (res.matches != 1) + xchk_btree_xref_set_corrupt(sc, sc->sa.rmap_cur, 0); + if (res.bad_non_owner_matches) + xchk_btree_xref_set_corrupt(sc, sc->sa.rmap_cur, 0); + if (res.non_owner_matches) xchk_btree_xref_set_corrupt(sc, sc->sa.rmap_cur, 0); -} - -/* xref check that the extent is owned by a given owner */ -void -xchk_xref_is_owned_by( - struct xfs_scrub *sc, - xfs_agblock_t bno, - xfs_extlen_t len, - const struct xfs_owner_info *oinfo) -{ - xchk_xref_check_owner(sc, bno, len, oinfo, true); } /* xref check that the extent is not owned by a given owner */ @@ -211,7 +603,19 @@ xchk_xref_is_not_owned_by( xfs_extlen_t len, const struct xfs_owner_info *oinfo) { - xchk_xref_check_owner(sc, bno, len, oinfo, false); + struct xfs_rmap_matches res; + int error; + + if (!sc->sa.rmap_cur || xchk_skip_xref(sc->sm)) + return; + + error = xfs_rmap_count_owners(sc->sa.rmap_cur, bno, len, oinfo, &res); + if (!xchk_should_check_xref(sc, &error, &sc->sa.rmap_cur)) + return; + if (res.matches != 0) + xchk_btree_xref_set_corrupt(sc, sc->sa.rmap_cur, 0); + if (res.bad_non_owner_matches) + xchk_btree_xref_set_corrupt(sc, sc->sa.rmap_cur, 0); } /* xref check that the extent has no reverse mapping at all */ @@ -221,15 +625,15 @@ xchk_xref_has_no_owner( xfs_agblock_t bno, xfs_extlen_t len) { - bool has_rmap; + enum xbtree_recpacking outcome; int error; if (!sc->sa.rmap_cur || xchk_skip_xref(sc->sm)) return; - error = xfs_rmap_has_record(sc->sa.rmap_cur, bno, len, &has_rmap); + error = xfs_rmap_has_records(sc->sa.rmap_cur, bno, len, &outcome); if (!xchk_should_check_xref(sc, &error, &sc->sa.rmap_cur)) return; - if (has_rmap) + if (outcome != XBTREE_RECPACKING_EMPTY) xchk_btree_xref_set_corrupt(sc, sc->sa.rmap_cur, 0); } diff --git a/fs/xfs/scrub/rtbitmap.c b/fs/xfs/scrub/rtbitmap.c index 0a3bde64c675..e7dace7b4be8 100644 --- a/fs/xfs/scrub/rtbitmap.c +++ b/fs/xfs/scrub/rtbitmap.c @@ -1,7 +1,7 @@ -// SPDX-License-Identifier: GPL-2.0+ +// SPDX-License-Identifier: GPL-2.0-or-later /* - * Copyright (C) 2017 Oracle. All Rights Reserved. - * Author: Darrick J. Wong <darrick.wong@oracle.com> + * Copyright (C) 2017-2023 Oracle. All Rights Reserved. + * Author: Darrick J. Wong <djwong@kernel.org> */ #include "xfs.h" #include "xfs_fs.h" diff --git a/fs/xfs/scrub/scrub.c b/fs/xfs/scrub/scrub.c index 07a7a75f987f..02819bedc5b1 100644 --- a/fs/xfs/scrub/scrub.c +++ b/fs/xfs/scrub/scrub.c @@ -1,7 +1,7 @@ -// SPDX-License-Identifier: GPL-2.0+ +// SPDX-License-Identifier: GPL-2.0-or-later /* - * Copyright (C) 2017 Oracle. All Rights Reserved. - * Author: Darrick J. Wong <darrick.wong@oracle.com> + * Copyright (C) 2017-2023 Oracle. All Rights Reserved. + * Author: Darrick J. Wong <djwong@kernel.org> */ #include "xfs.h" #include "xfs_fs.h" @@ -145,6 +145,21 @@ xchk_probe( /* Scrub setup and teardown */ +static inline void +xchk_fsgates_disable( + struct xfs_scrub *sc) +{ + if (!(sc->flags & XCHK_FSGATES_ALL)) + return; + + trace_xchk_fsgates_disable(sc, sc->flags & XCHK_FSGATES_ALL); + + if (sc->flags & XCHK_FSGATES_DRAIN) + xfs_drain_wait_disable(); + + sc->flags &= ~XCHK_FSGATES_ALL; +} + /* Free all the resources and finish the transactions. */ STATIC int xchk_teardown( @@ -166,7 +181,7 @@ xchk_teardown( xfs_iunlock(sc->ip, sc->ilock_flags); if (sc->ip != ip_in && !xfs_internal_inum(sc->mp, sc->ip->i_ino)) - xfs_irele(sc->ip); + xchk_irele(sc, sc->ip); sc->ip = NULL; } if (sc->sm->sm_flags & XFS_SCRUB_IFLAG_REPAIR) @@ -174,9 +189,14 @@ xchk_teardown( if (sc->flags & XCHK_REAPING_DISABLED) xchk_start_reaping(sc); if (sc->buf) { + if (sc->buf_cleanup) + sc->buf_cleanup(sc->buf); kvfree(sc->buf); + sc->buf_cleanup = NULL; sc->buf = NULL; } + + xchk_fsgates_disable(sc); return error; } @@ -191,25 +211,25 @@ static const struct xchk_meta_ops meta_scrub_ops[] = { }, [XFS_SCRUB_TYPE_SB] = { /* superblock */ .type = ST_PERAG, - .setup = xchk_setup_fs, + .setup = xchk_setup_agheader, .scrub = xchk_superblock, .repair = xrep_superblock, }, [XFS_SCRUB_TYPE_AGF] = { /* agf */ .type = ST_PERAG, - .setup = xchk_setup_fs, + .setup = xchk_setup_agheader, .scrub = xchk_agf, .repair = xrep_agf, }, [XFS_SCRUB_TYPE_AGFL]= { /* agfl */ .type = ST_PERAG, - .setup = xchk_setup_fs, + .setup = xchk_setup_agheader, .scrub = xchk_agfl, .repair = xrep_agfl, }, [XFS_SCRUB_TYPE_AGI] = { /* agi */ .type = ST_PERAG, - .setup = xchk_setup_fs, + .setup = xchk_setup_agheader, .scrub = xchk_agi, .repair = xrep_agi, }, @@ -491,23 +511,20 @@ retry_op: /* Set up for the operation. */ error = sc->ops->setup(sc); + if (error == -EDEADLOCK && !(sc->flags & XCHK_TRY_HARDER)) + goto try_harder; + if (error == -ECHRNG && !(sc->flags & XCHK_NEED_DRAIN)) + goto need_drain; if (error) goto out_teardown; /* Scrub for errors. */ error = sc->ops->scrub(sc); - if (!(sc->flags & XCHK_TRY_HARDER) && error == -EDEADLOCK) { - /* - * Scrubbers return -EDEADLOCK to mean 'try harder'. - * Tear down everything we hold, then set up again with - * preparation for worst-case scenarios. - */ - error = xchk_teardown(sc, 0); - if (error) - goto out_sc; - sc->flags |= XCHK_TRY_HARDER; - goto retry_op; - } else if (error || (sm->sm_flags & XFS_SCRUB_OFLAG_INCOMPLETE)) + if (error == -EDEADLOCK && !(sc->flags & XCHK_TRY_HARDER)) + goto try_harder; + if (error == -ECHRNG && !(sc->flags & XCHK_NEED_DRAIN)) + goto need_drain; + if (error || (sm->sm_flags & XFS_SCRUB_OFLAG_INCOMPLETE)) goto out_teardown; xchk_update_health(sc); @@ -565,4 +582,21 @@ out: error = 0; } return error; +need_drain: + error = xchk_teardown(sc, 0); + if (error) + goto out_sc; + sc->flags |= XCHK_NEED_DRAIN; + goto retry_op; +try_harder: + /* + * Scrubbers return -EDEADLOCK to mean 'try harder'. Tear down + * everything we hold, then set up again with preparation for + * worst-case scenarios. + */ + error = xchk_teardown(sc, 0); + if (error) + goto out_sc; + sc->flags |= XCHK_TRY_HARDER; + goto retry_op; } diff --git a/fs/xfs/scrub/scrub.h b/fs/xfs/scrub/scrub.h index b4d391b4c938..e71903474cd7 100644 --- a/fs/xfs/scrub/scrub.h +++ b/fs/xfs/scrub/scrub.h @@ -1,7 +1,7 @@ -// SPDX-License-Identifier: GPL-2.0+ +// SPDX-License-Identifier: GPL-2.0-or-later /* - * Copyright (C) 2017 Oracle. All Rights Reserved. - * Author: Darrick J. Wong <darrick.wong@oracle.com> + * Copyright (C) 2017-2023 Oracle. All Rights Reserved. + * Author: Darrick J. Wong <djwong@kernel.org> */ #ifndef __XFS_SCRUB_SCRUB_H__ #define __XFS_SCRUB_SCRUB_H__ @@ -77,7 +77,17 @@ struct xfs_scrub { */ struct xfs_inode *ip; + /* Kernel memory buffer used by scrubbers; freed at teardown. */ void *buf; + + /* + * Clean up resources owned by whatever is in the buffer. Cleanup can + * be deferred with this hook as a means for scrub functions to pass + * data to repair functions. This function must not free the buffer + * itself. + */ + void (*buf_cleanup)(void *buf); + uint ilock_flags; /* See the XCHK/XREP state flags below. */ @@ -96,9 +106,19 @@ struct xfs_scrub { /* XCHK state flags grow up from zero, XREP state flags grown down from 2^31 */ #define XCHK_TRY_HARDER (1 << 0) /* can't get resources, try again */ -#define XCHK_REAPING_DISABLED (1 << 2) /* background block reaping paused */ +#define XCHK_REAPING_DISABLED (1 << 1) /* background block reaping paused */ +#define XCHK_FSGATES_DRAIN (1 << 2) /* defer ops draining enabled */ +#define XCHK_NEED_DRAIN (1 << 3) /* scrub needs to drain defer ops */ #define XREP_ALREADY_FIXED (1 << 31) /* checking our repair work */ +/* + * The XCHK_FSGATES* flags reflect functionality in the main filesystem that + * are only enabled for this particular online fsck. When not in use, the + * features are gated off via dynamic code patching, which is why the state + * must be enabled during scrub setup and can only be torn down afterwards. + */ +#define XCHK_FSGATES_ALL (XCHK_FSGATES_DRAIN) + /* Metadata scrubbers */ int xchk_tester(struct xfs_scrub *sc); int xchk_superblock(struct xfs_scrub *sc); @@ -152,7 +172,7 @@ void xchk_xref_is_not_inode_chunk(struct xfs_scrub *sc, xfs_agblock_t agbno, xfs_extlen_t len); void xchk_xref_is_inode_chunk(struct xfs_scrub *sc, xfs_agblock_t agbno, xfs_extlen_t len); -void xchk_xref_is_owned_by(struct xfs_scrub *sc, xfs_agblock_t agbno, +void xchk_xref_is_only_owned_by(struct xfs_scrub *sc, xfs_agblock_t agbno, xfs_extlen_t len, const struct xfs_owner_info *oinfo); void xchk_xref_is_not_owned_by(struct xfs_scrub *sc, xfs_agblock_t agbno, xfs_extlen_t len, const struct xfs_owner_info *oinfo); @@ -162,6 +182,8 @@ void xchk_xref_is_cow_staging(struct xfs_scrub *sc, xfs_agblock_t bno, xfs_extlen_t len); void xchk_xref_is_not_shared(struct xfs_scrub *sc, xfs_agblock_t bno, xfs_extlen_t len); +void xchk_xref_is_not_cow_staging(struct xfs_scrub *sc, xfs_agblock_t bno, + xfs_extlen_t len); #ifdef CONFIG_XFS_RT void xchk_xref_is_used_rt_space(struct xfs_scrub *sc, xfs_rtblock_t rtbno, xfs_extlen_t len); diff --git a/fs/xfs/scrub/symlink.c b/fs/xfs/scrub/symlink.c index c1c99ffe7408..38708fb9a5d7 100644 --- a/fs/xfs/scrub/symlink.c +++ b/fs/xfs/scrub/symlink.c @@ -1,7 +1,7 @@ -// SPDX-License-Identifier: GPL-2.0+ +// SPDX-License-Identifier: GPL-2.0-or-later /* - * Copyright (C) 2017 Oracle. All Rights Reserved. - * Author: Darrick J. Wong <darrick.wong@oracle.com> + * Copyright (C) 2017-2023 Oracle. All Rights Reserved. + * Author: Darrick J. Wong <djwong@kernel.org> */ #include "xfs.h" #include "xfs_fs.h" diff --git a/fs/xfs/scrub/trace.c b/fs/xfs/scrub/trace.c index b5f94676c37c..0a975439d2b6 100644 --- a/fs/xfs/scrub/trace.c +++ b/fs/xfs/scrub/trace.c @@ -1,7 +1,7 @@ -// SPDX-License-Identifier: GPL-2.0+ +// SPDX-License-Identifier: GPL-2.0-or-later /* - * Copyright (C) 2017 Oracle. All Rights Reserved. - * Author: Darrick J. Wong <darrick.wong@oracle.com> + * Copyright (C) 2017-2023 Oracle. All Rights Reserved. + * Author: Darrick J. Wong <djwong@kernel.org> */ #include "xfs.h" #include "xfs_fs.h" diff --git a/fs/xfs/scrub/trace.h b/fs/xfs/scrub/trace.h index 93ece6df02e3..68efd6fda61c 100644 --- a/fs/xfs/scrub/trace.h +++ b/fs/xfs/scrub/trace.h @@ -1,7 +1,7 @@ -// SPDX-License-Identifier: GPL-2.0+ +// SPDX-License-Identifier: GPL-2.0-or-later /* - * Copyright (C) 2017 Oracle. All Rights Reserved. - * Author: Darrick J. Wong <darrick.wong@oracle.com> + * Copyright (C) 2017-2023 Oracle. All Rights Reserved. + * Author: Darrick J. Wong <djwong@kernel.org> * * NOTE: none of these tracepoints shall be considered a stable kernel ABI * as they can change at any time. See xfs_trace.h for documentation of @@ -30,6 +30,9 @@ TRACE_DEFINE_ENUM(XFS_BTNUM_FINOi); TRACE_DEFINE_ENUM(XFS_BTNUM_RMAPi); TRACE_DEFINE_ENUM(XFS_BTNUM_REFCi); +TRACE_DEFINE_ENUM(XFS_REFC_DOMAIN_SHARED); +TRACE_DEFINE_ENUM(XFS_REFC_DOMAIN_COW); + TRACE_DEFINE_ENUM(XFS_SCRUB_TYPE_PROBE); TRACE_DEFINE_ENUM(XFS_SCRUB_TYPE_SB); TRACE_DEFINE_ENUM(XFS_SCRUB_TYPE_AGF); @@ -93,6 +96,13 @@ TRACE_DEFINE_ENUM(XFS_SCRUB_TYPE_FSCOUNTERS); { XFS_SCRUB_OFLAG_WARNING, "warning" }, \ { XFS_SCRUB_OFLAG_NO_REPAIR_NEEDED, "norepair" } +#define XFS_SCRUB_STATE_STRINGS \ + { XCHK_TRY_HARDER, "try_harder" }, \ + { XCHK_REAPING_DISABLED, "reaping_disabled" }, \ + { XCHK_FSGATES_DRAIN, "fsgates_drain" }, \ + { XCHK_NEED_DRAIN, "need_drain" }, \ + { XREP_ALREADY_FIXED, "already_fixed" } + DECLARE_EVENT_CLASS(xchk_class, TP_PROTO(struct xfs_inode *ip, struct xfs_scrub_metadata *sm, int error), @@ -139,6 +149,33 @@ DEFINE_SCRUB_EVENT(xchk_deadlock_retry); DEFINE_SCRUB_EVENT(xrep_attempt); DEFINE_SCRUB_EVENT(xrep_done); +DECLARE_EVENT_CLASS(xchk_fsgate_class, + TP_PROTO(struct xfs_scrub *sc, unsigned int fsgate_flags), + TP_ARGS(sc, fsgate_flags), + TP_STRUCT__entry( + __field(dev_t, dev) + __field(unsigned int, type) + __field(unsigned int, fsgate_flags) + ), + TP_fast_assign( + __entry->dev = sc->mp->m_super->s_dev; + __entry->type = sc->sm->sm_type; + __entry->fsgate_flags = fsgate_flags; + ), + TP_printk("dev %d:%d type %s fsgates '%s'", + MAJOR(__entry->dev), MINOR(__entry->dev), + __print_symbolic(__entry->type, XFS_SCRUB_TYPE_STRINGS), + __print_flags(__entry->fsgate_flags, "|", XFS_SCRUB_STATE_STRINGS)) +) + +#define DEFINE_SCRUB_FSHOOK_EVENT(name) \ +DEFINE_EVENT(xchk_fsgate_class, name, \ + TP_PROTO(struct xfs_scrub *sc, unsigned int fsgates_flags), \ + TP_ARGS(sc, fsgates_flags)) + +DEFINE_SCRUB_FSHOOK_EVENT(xchk_fsgates_enable); +DEFINE_SCRUB_FSHOOK_EVENT(xchk_fsgates_disable); + TRACE_EVENT(xchk_op_error, TP_PROTO(struct xfs_scrub *sc, xfs_agnumber_t agno, xfs_agblock_t bno, int error, void *ret_ip), @@ -657,6 +694,38 @@ TRACE_EVENT(xchk_fscounters_within_range, __entry->old_value) ) +TRACE_EVENT(xchk_refcount_incorrect, + TP_PROTO(struct xfs_perag *pag, const struct xfs_refcount_irec *irec, + xfs_nlink_t seen), + TP_ARGS(pag, irec, seen), + TP_STRUCT__entry( + __field(dev_t, dev) + __field(xfs_agnumber_t, agno) + __field(enum xfs_refc_domain, domain) + __field(xfs_agblock_t, startblock) + __field(xfs_extlen_t, blockcount) + __field(xfs_nlink_t, refcount) + __field(xfs_nlink_t, seen) + ), + TP_fast_assign( + __entry->dev = pag->pag_mount->m_super->s_dev; + __entry->agno = pag->pag_agno; + __entry->domain = irec->rc_domain; + __entry->startblock = irec->rc_startblock; + __entry->blockcount = irec->rc_blockcount; + __entry->refcount = irec->rc_refcount; + __entry->seen = seen; + ), + TP_printk("dev %d:%d agno 0x%x dom %s agbno 0x%x fsbcount 0x%x refcount %u seen %u", + MAJOR(__entry->dev), MINOR(__entry->dev), + __entry->agno, + __print_symbolic(__entry->domain, XFS_REFC_DOMAIN_STRINGS), + __entry->startblock, + __entry->blockcount, + __entry->refcount, + __entry->seen) +) + /* repair tracepoints */ #if IS_ENABLED(CONFIG_XFS_ONLINE_REPAIR) diff --git a/fs/xfs/scrub/xfs_scrub.h b/fs/xfs/scrub/xfs_scrub.h index 2ceae614ade8..a39befa743ce 100644 --- a/fs/xfs/scrub/xfs_scrub.h +++ b/fs/xfs/scrub/xfs_scrub.h @@ -1,7 +1,7 @@ -// SPDX-License-Identifier: GPL-2.0+ +// SPDX-License-Identifier: GPL-2.0-or-later /* - * Copyright (C) 2017 Oracle. All Rights Reserved. - * Author: Darrick J. Wong <darrick.wong@oracle.com> + * Copyright (C) 2017-2023 Oracle. All Rights Reserved. + * Author: Darrick J. Wong <djwong@kernel.org> */ #ifndef __XFS_SCRUB_H__ #define __XFS_SCRUB_H__ diff --git a/fs/xfs/xfs_bmap_item.c b/fs/xfs/xfs_bmap_item.c index 6e2f0013380a..7551c3ec4ea5 100644 --- a/fs/xfs/xfs_bmap_item.c +++ b/fs/xfs/xfs_bmap_item.c @@ -24,6 +24,7 @@ #include "xfs_error.h" #include "xfs_log_priv.h" #include "xfs_log_recover.h" +#include "xfs_ag.h" struct kmem_cache *xfs_bui_cache; struct kmem_cache *xfs_bud_cache; @@ -363,6 +364,34 @@ xfs_bmap_update_create_done( return &xfs_trans_get_bud(tp, BUI_ITEM(intent))->bud_item; } +/* Take a passive ref to the AG containing the space we're mapping. */ +void +xfs_bmap_update_get_group( + struct xfs_mount *mp, + struct xfs_bmap_intent *bi) +{ + xfs_agnumber_t agno; + + agno = XFS_FSB_TO_AGNO(mp, bi->bi_bmap.br_startblock); + + /* + * Bump the intent count on behalf of the deferred rmap and refcount + * intent items that that we can queue when we finish this bmap work. + * This new intent item will bump the intent count before the bmap + * intent drops the intent count, ensuring that the intent count + * remains nonzero across the transaction roll. + */ + bi->bi_pag = xfs_perag_intent_get(mp, agno); +} + +/* Release a passive AG ref after finishing mapping work. */ +static inline void +xfs_bmap_update_put_group( + struct xfs_bmap_intent *bi) +{ + xfs_perag_intent_put(bi->bi_pag); +} + /* Process a deferred rmap update. */ STATIC int xfs_bmap_update_finish_item( @@ -381,6 +410,8 @@ xfs_bmap_update_finish_item( ASSERT(bi->bi_type == XFS_BMAP_UNMAP); return -EAGAIN; } + + xfs_bmap_update_put_group(bi); kmem_cache_free(xfs_bmap_intent_cache, bi); return error; } @@ -393,7 +424,7 @@ xfs_bmap_update_abort_intent( xfs_bui_release(BUI_ITEM(intent)); } -/* Cancel a deferred rmap update. */ +/* Cancel a deferred bmap update. */ STATIC void xfs_bmap_update_cancel_item( struct list_head *item) @@ -401,6 +432,8 @@ xfs_bmap_update_cancel_item( struct xfs_bmap_intent *bi; bi = container_of(item, struct xfs_bmap_intent, bi_list); + + xfs_bmap_update_put_group(bi); kmem_cache_free(xfs_bmap_intent_cache, bi); } @@ -509,10 +542,12 @@ xfs_bui_item_recover( fake.bi_bmap.br_state = (map->me_flags & XFS_BMAP_EXTENT_UNWRITTEN) ? XFS_EXT_UNWRITTEN : XFS_EXT_NORM; + xfs_bmap_update_get_group(mp, &fake); error = xfs_trans_log_finish_bmap_update(tp, budp, &fake); if (error == -EFSCORRUPTED) XFS_CORRUPTION_ERROR(__func__, XFS_ERRLEVEL_LOW, mp, map, sizeof(*map)); + xfs_bmap_update_put_group(&fake); if (error) goto err_cancel; diff --git a/fs/xfs/xfs_bmap_util.c b/fs/xfs/xfs_bmap_util.c index a09dd2606479..f032d3a4b727 100644 --- a/fs/xfs/xfs_bmap_util.c +++ b/fs/xfs/xfs_bmap_util.c @@ -314,15 +314,13 @@ xfs_getbmap_report_one( if (isnullstartblock(got->br_startblock) || got->br_startblock == DELAYSTARTBLOCK) { /* - * Delalloc extents that start beyond EOF can occur due to - * speculative EOF allocation when the delalloc extent is larger - * than the largest freespace extent at conversion time. These - * extents cannot be converted by data writeback, so can exist - * here even if we are not supposed to be finding delalloc - * extents. + * Take the flush completion as being a point-in-time snapshot + * where there are no delalloc extents, and if any new ones + * have been created racily, just skip them as being 'after' + * the flush and so don't get reported. */ - if (got->br_startoff < XFS_B_TO_FSB(ip->i_mount, XFS_ISIZE(ip))) - ASSERT((bmv->bmv_iflags & BMV_IF_DELALLOC) != 0); + if (!(bmv->bmv_iflags & BMV_IF_DELALLOC)) + return 0; p->bmv_oflags |= BMV_OF_DELALLOC; p->bmv_block = -2; diff --git a/fs/xfs/xfs_buf_item_recover.c b/fs/xfs/xfs_buf_item_recover.c index ffa94102094d..43167f543afc 100644 --- a/fs/xfs/xfs_buf_item_recover.c +++ b/fs/xfs/xfs_buf_item_recover.c @@ -943,6 +943,16 @@ xlog_recover_buf_commit_pass2( if (lsn && lsn != -1 && XFS_LSN_CMP(lsn, current_lsn) >= 0) { trace_xfs_log_recover_buf_skip(log, buf_f); xlog_recover_validate_buf_type(mp, bp, buf_f, NULLCOMMITLSN); + + /* + * We're skipping replay of this buffer log item due to the log + * item LSN being behind the ondisk buffer. Verify the buffer + * contents since we aren't going to run the write verifier. + */ + if (bp->b_ops) { + bp->b_ops->verify_read(bp); + error = bp->b_error; + } goto out_release; } diff --git a/fs/xfs/xfs_dahash_test.c b/fs/xfs/xfs_dahash_test.c index 230651ab5ce4..0dab5941e080 100644 --- a/fs/xfs/xfs_dahash_test.c +++ b/fs/xfs/xfs_dahash_test.c @@ -9,6 +9,9 @@ #include "xfs_format.h" #include "xfs_da_format.h" #include "xfs_da_btree.h" +#include "xfs_trans_resv.h" +#include "xfs_mount.h" +#include "xfs_dir2_priv.h" #include "xfs_dahash_test.h" /* 4096 random bytes */ @@ -533,108 +536,109 @@ static struct dahash_test { uint16_t start; /* random 12 bit offset in buf */ uint16_t length; /* random 8 bit length of test */ xfs_dahash_t dahash; /* expected dahash result */ + xfs_dahash_t ascii_ci_dahash; /* expected ascii-ci dahash result */ } test[] __initdata = { - {0x0567, 0x0097, 0x96951389}, - {0x0869, 0x0055, 0x6455ab4f}, - {0x0c51, 0x00be, 0x8663afde}, - {0x044a, 0x00fc, 0x98fbe432}, - {0x0f29, 0x0079, 0x42371997}, - {0x08ba, 0x0052, 0x942be4f7}, - {0x01f2, 0x0013, 0x5262687e}, - {0x09e3, 0x00e2, 0x8ffb0908}, - {0x007c, 0x0051, 0xb3158491}, - {0x0854, 0x001f, 0x83bb20d9}, - {0x031b, 0x0008, 0x98970bdf}, - {0x0de7, 0x0027, 0xbfbf6f6c}, - {0x0f76, 0x0005, 0x906a7105}, - {0x092e, 0x00d0, 0x86631850}, - {0x0233, 0x0082, 0xdbdd914e}, - {0x04c9, 0x0075, 0x5a400a9e}, - {0x0b66, 0x0099, 0xae128b45}, - {0x000d, 0x00ed, 0xe61c216a}, - {0x0a31, 0x003d, 0xf69663b9}, - {0x00a3, 0x0052, 0x643c39ae}, - {0x0125, 0x00d5, 0x7c310b0d}, - {0x0105, 0x004a, 0x06a77e74}, - {0x0858, 0x008e, 0x265bc739}, - {0x045e, 0x0095, 0x13d6b192}, - {0x0dab, 0x003c, 0xc4498704}, - {0x00cd, 0x00b5, 0x802a4e2d}, - {0x069b, 0x008c, 0x5df60f71}, - {0x0454, 0x006c, 0x5f03d8bb}, - {0x040e, 0x0032, 0x0ce513b5}, - {0x0874, 0x00e2, 0x6a811fb3}, - {0x0521, 0x00b4, 0x93296833}, - {0x0ddc, 0x00cf, 0xf9305338}, - {0x0a70, 0x0023, 0x239549ea}, - {0x083e, 0x0027, 0x2d88ba97}, - {0x0241, 0x00a7, 0xfe0b32e1}, - {0x0dfc, 0x0096, 0x1a11e815}, - {0x023e, 0x001e, 0xebc9a1f3}, - {0x067e, 0x0066, 0xb1067f81}, - {0x09ea, 0x000e, 0x46fd7247}, - {0x036b, 0x008c, 0x1a39acdf}, - {0x078f, 0x0030, 0x964042ab}, - {0x085c, 0x008f, 0x1829edab}, - {0x02ec, 0x009f, 0x6aefa72d}, - {0x043b, 0x00ce, 0x65642ff5}, - {0x0a32, 0x00b8, 0xbd82759e}, - {0x0d3c, 0x0087, 0xf4d66d54}, - {0x09ec, 0x008a, 0x06bfa1ff}, - {0x0902, 0x0015, 0x755025d2}, - {0x08fe, 0x000e, 0xf690ce2d}, - {0x00fb, 0x00dc, 0xe55f1528}, - {0x0eaa, 0x003a, 0x0fe0a8d7}, - {0x05fb, 0x0006, 0x86281cfb}, - {0x0dd1, 0x00a7, 0x60ab51b4}, - {0x0005, 0x001b, 0xf51d969b}, - {0x077c, 0x00dd, 0xc2fed268}, - {0x0575, 0x00f5, 0x432c0b1a}, - {0x05be, 0x0088, 0x78baa04b}, - {0x0c89, 0x0068, 0xeda9e428}, - {0x0f5c, 0x0068, 0xec143c76}, - {0x06a8, 0x0009, 0xd72651ce}, - {0x060f, 0x008e, 0x765426cd}, - {0x07b1, 0x0047, 0x2cfcfa0c}, - {0x04f1, 0x0041, 0x55b172f9}, - {0x0e05, 0x00ac, 0x61efde93}, - {0x0bf7, 0x0097, 0x05b83eee}, - {0x04e9, 0x00f3, 0x9928223a}, - {0x023a, 0x0005, 0xdfada9bc}, - {0x0acb, 0x000e, 0x2217cecd}, - {0x0148, 0x0060, 0xbc3f7405}, - {0x0764, 0x0059, 0xcbc201b1}, - {0x021f, 0x0059, 0x5d6b2256}, - {0x0f1e, 0x006c, 0xdefeeb45}, - {0x071c, 0x00b9, 0xb9b59309}, - {0x0564, 0x0063, 0xae064271}, - {0x0b14, 0x0044, 0xdb867d9b}, - {0x0e5a, 0x0055, 0xff06b685}, - {0x015e, 0x00ba, 0x1115ccbc}, - {0x0379, 0x00e6, 0x5f4e58dd}, - {0x013b, 0x0067, 0x4897427e}, - {0x0e64, 0x0071, 0x7af2b7a4}, - {0x0a11, 0x0050, 0x92105726}, - {0x0109, 0x0055, 0xd0d000f9}, - {0x00aa, 0x0022, 0x815d229d}, - {0x09ac, 0x004f, 0x02f9d985}, - {0x0e1b, 0x00ce, 0x5cf92ab4}, - {0x08af, 0x00d8, 0x17ca72d1}, - {0x0e33, 0x000a, 0xda2dba6b}, - {0x0ee3, 0x006a, 0xb00048e5}, - {0x0648, 0x001a, 0x2364b8cb}, - {0x0315, 0x0085, 0x0596fd0d}, - {0x0fbb, 0x003e, 0x298230ca}, - {0x0422, 0x006a, 0x78ada4ab}, - {0x04ba, 0x0073, 0xced1fbc2}, - {0x007d, 0x0061, 0x4b7ff236}, - {0x070b, 0x00d0, 0x261cf0ae}, - {0x0c1a, 0x0035, 0x8be92ee2}, - {0x0af8, 0x0063, 0x824dcf03}, - {0x08f8, 0x006d, 0xd289710c}, - {0x021b, 0x00ee, 0x6ac1c41d}, - {0x05b5, 0x00da, 0x8e52f0e2}, + {0x0567, 0x0097, 0x96951389, 0xc153aa0d}, + {0x0869, 0x0055, 0x6455ab4f, 0xd07f69bf}, + {0x0c51, 0x00be, 0x8663afde, 0xf9add90c}, + {0x044a, 0x00fc, 0x98fbe432, 0xbf2abb76}, + {0x0f29, 0x0079, 0x42371997, 0x282588b3}, + {0x08ba, 0x0052, 0x942be4f7, 0x2e023547}, + {0x01f2, 0x0013, 0x5262687e, 0x5266287e}, + {0x09e3, 0x00e2, 0x8ffb0908, 0x1da892f3}, + {0x007c, 0x0051, 0xb3158491, 0xb67f9e63}, + {0x0854, 0x001f, 0x83bb20d9, 0x22bb21db}, + {0x031b, 0x0008, 0x98970bdf, 0x9cd70adf}, + {0x0de7, 0x0027, 0xbfbf6f6c, 0xae3f296c}, + {0x0f76, 0x0005, 0x906a7105, 0x906a7105}, + {0x092e, 0x00d0, 0x86631850, 0xa3f6ac04}, + {0x0233, 0x0082, 0xdbdd914e, 0x5d8c7aac}, + {0x04c9, 0x0075, 0x5a400a9e, 0x12f60711}, + {0x0b66, 0x0099, 0xae128b45, 0x7551310d}, + {0x000d, 0x00ed, 0xe61c216a, 0xc22d3c4c}, + {0x0a31, 0x003d, 0xf69663b9, 0x51960bf8}, + {0x00a3, 0x0052, 0x643c39ae, 0xa93c73a8}, + {0x0125, 0x00d5, 0x7c310b0d, 0xf221cbb3}, + {0x0105, 0x004a, 0x06a77e74, 0xa4ef4561}, + {0x0858, 0x008e, 0x265bc739, 0xd6c36d9b}, + {0x045e, 0x0095, 0x13d6b192, 0x5f5c1d62}, + {0x0dab, 0x003c, 0xc4498704, 0x10414654}, + {0x00cd, 0x00b5, 0x802a4e2d, 0xfbd17c9d}, + {0x069b, 0x008c, 0x5df60f71, 0x91ddca5f}, + {0x0454, 0x006c, 0x5f03d8bb, 0x5c59fce0}, + {0x040e, 0x0032, 0x0ce513b5, 0xa8cd99b1}, + {0x0874, 0x00e2, 0x6a811fb3, 0xca028316}, + {0x0521, 0x00b4, 0x93296833, 0x2c4d4880}, + {0x0ddc, 0x00cf, 0xf9305338, 0x2c94210d}, + {0x0a70, 0x0023, 0x239549ea, 0x22b561aa}, + {0x083e, 0x0027, 0x2d88ba97, 0x5cd8bb9d}, + {0x0241, 0x00a7, 0xfe0b32e1, 0x17b506b8}, + {0x0dfc, 0x0096, 0x1a11e815, 0xee4141bd}, + {0x023e, 0x001e, 0xebc9a1f3, 0x5689a1f3}, + {0x067e, 0x0066, 0xb1067f81, 0xd9952571}, + {0x09ea, 0x000e, 0x46fd7247, 0x42b57245}, + {0x036b, 0x008c, 0x1a39acdf, 0x58bf1586}, + {0x078f, 0x0030, 0x964042ab, 0xb04218b9}, + {0x085c, 0x008f, 0x1829edab, 0x9ceca89c}, + {0x02ec, 0x009f, 0x6aefa72d, 0x634cc2a7}, + {0x043b, 0x00ce, 0x65642ff5, 0x6c8a584e}, + {0x0a32, 0x00b8, 0xbd82759e, 0x0f96a34f}, + {0x0d3c, 0x0087, 0xf4d66d54, 0xb71ba5f4}, + {0x09ec, 0x008a, 0x06bfa1ff, 0x576ca80f}, + {0x0902, 0x0015, 0x755025d2, 0x517225c2}, + {0x08fe, 0x000e, 0xf690ce2d, 0xf690cf3d}, + {0x00fb, 0x00dc, 0xe55f1528, 0x707d7d92}, + {0x0eaa, 0x003a, 0x0fe0a8d7, 0x87638cc5}, + {0x05fb, 0x0006, 0x86281cfb, 0x86281cf9}, + {0x0dd1, 0x00a7, 0x60ab51b4, 0xe28ef00c}, + {0x0005, 0x001b, 0xf51d969b, 0xe71dd6d3}, + {0x077c, 0x00dd, 0xc2fed268, 0xdc30c555}, + {0x0575, 0x00f5, 0x432c0b1a, 0x81dd7d16}, + {0x05be, 0x0088, 0x78baa04b, 0xd69b433e}, + {0x0c89, 0x0068, 0xeda9e428, 0xe9b4fa0a}, + {0x0f5c, 0x0068, 0xec143c76, 0x9947067a}, + {0x06a8, 0x0009, 0xd72651ce, 0xd72651ee}, + {0x060f, 0x008e, 0x765426cd, 0x2099626f}, + {0x07b1, 0x0047, 0x2cfcfa0c, 0x1a4baa07}, + {0x04f1, 0x0041, 0x55b172f9, 0x15331a79}, + {0x0e05, 0x00ac, 0x61efde93, 0x320568cc}, + {0x0bf7, 0x0097, 0x05b83eee, 0xc72fb7a3}, + {0x04e9, 0x00f3, 0x9928223a, 0xe8c77de2}, + {0x023a, 0x0005, 0xdfada9bc, 0xdfadb9be}, + {0x0acb, 0x000e, 0x2217cecd, 0x0017d6cd}, + {0x0148, 0x0060, 0xbc3f7405, 0xf5fd6615}, + {0x0764, 0x0059, 0xcbc201b1, 0xbb089bf4}, + {0x021f, 0x0059, 0x5d6b2256, 0xa16a0a59}, + {0x0f1e, 0x006c, 0xdefeeb45, 0xfc34f9d6}, + {0x071c, 0x00b9, 0xb9b59309, 0xb645eae2}, + {0x0564, 0x0063, 0xae064271, 0x954dc6d1}, + {0x0b14, 0x0044, 0xdb867d9b, 0xdf432309}, + {0x0e5a, 0x0055, 0xff06b685, 0xa65ff257}, + {0x015e, 0x00ba, 0x1115ccbc, 0x11c365f4}, + {0x0379, 0x00e6, 0x5f4e58dd, 0x2d176d31}, + {0x013b, 0x0067, 0x4897427e, 0xc40532fe}, + {0x0e64, 0x0071, 0x7af2b7a4, 0x1fb7bf43}, + {0x0a11, 0x0050, 0x92105726, 0xb1185e51}, + {0x0109, 0x0055, 0xd0d000f9, 0x60a60bfd}, + {0x00aa, 0x0022, 0x815d229d, 0x215d379c}, + {0x09ac, 0x004f, 0x02f9d985, 0x10b90b20}, + {0x0e1b, 0x00ce, 0x5cf92ab4, 0x6a477573}, + {0x08af, 0x00d8, 0x17ca72d1, 0x385af156}, + {0x0e33, 0x000a, 0xda2dba6b, 0xda2dbb69}, + {0x0ee3, 0x006a, 0xb00048e5, 0xa9a2decc}, + {0x0648, 0x001a, 0x2364b8cb, 0x3364b1cb}, + {0x0315, 0x0085, 0x0596fd0d, 0xa651740f}, + {0x0fbb, 0x003e, 0x298230ca, 0x7fc617c7}, + {0x0422, 0x006a, 0x78ada4ab, 0xc576ae2a}, + {0x04ba, 0x0073, 0xced1fbc2, 0xaac8455b}, + {0x007d, 0x0061, 0x4b7ff236, 0x347d5739}, + {0x070b, 0x00d0, 0x261cf0ae, 0xc7fb1c10}, + {0x0c1a, 0x0035, 0x8be92ee2, 0x8be9b4e1}, + {0x0af8, 0x0063, 0x824dcf03, 0x53010388}, + {0x08f8, 0x006d, 0xd289710c, 0x30418edd}, + {0x021b, 0x00ee, 0x6ac1c41d, 0x2557e9a3}, + {0x05b5, 0x00da, 0x8e52f0e2, 0x98531012}, }; int __init @@ -644,12 +648,19 @@ xfs_dahash_test(void) unsigned int errors = 0; for (i = 0; i < ARRAY_SIZE(test); i++) { + struct xfs_name xname = { }; xfs_dahash_t hash; hash = xfs_da_hashname(test_buf + test[i].start, test[i].length); if (hash != test[i].dahash) errors++; + + xname.name = test_buf + test[i].start; + xname.len = test[i].length; + hash = xfs_ascii_ci_hashname(&xname); + if (hash != test[i].ascii_ci_dahash) + errors++; } if (errors) { diff --git a/fs/xfs/xfs_dquot.c b/fs/xfs/xfs_dquot.c index 8fb90da89787..7f071757f278 100644 --- a/fs/xfs/xfs_dquot.c +++ b/fs/xfs/xfs_dquot.c @@ -798,7 +798,6 @@ xfs_qm_dqget_cache_insert( error = radix_tree_insert(tree, id, dqp); if (unlikely(error)) { /* Duplicate found! Caller must try again. */ - WARN_ON(error != -EEXIST); mutex_unlock(&qi->qi_tree_lock); trace_xfs_dqget_dup(dqp); return error; diff --git a/fs/xfs/xfs_drain.c b/fs/xfs/xfs_drain.c new file mode 100644 index 000000000000..005a66be44a2 --- /dev/null +++ b/fs/xfs/xfs_drain.c @@ -0,0 +1,166 @@ +// SPDX-License-Identifier: GPL-2.0-or-later +/* + * Copyright (C) 2022-2023 Oracle. All Rights Reserved. + * Author: Darrick J. Wong <djwong@kernel.org> + */ +#include "xfs.h" +#include "xfs_fs.h" +#include "xfs_shared.h" +#include "xfs_format.h" +#include "xfs_trans_resv.h" +#include "xfs_mount.h" +#include "xfs_ag.h" +#include "xfs_trace.h" + +/* + * Use a static key here to reduce the overhead of xfs_drain_rele. If the + * compiler supports jump labels, the static branch will be replaced by a nop + * sled when there are no xfs_drain_wait callers. Online fsck is currently + * the only caller, so this is a reasonable tradeoff. + * + * Note: Patching the kernel code requires taking the cpu hotplug lock. Other + * parts of the kernel allocate memory with that lock held, which means that + * XFS callers cannot hold any locks that might be used by memory reclaim or + * writeback when calling the static_branch_{inc,dec} functions. + */ +static DEFINE_STATIC_KEY_FALSE(xfs_drain_waiter_gate); + +void +xfs_drain_wait_disable(void) +{ + static_branch_dec(&xfs_drain_waiter_gate); +} + +void +xfs_drain_wait_enable(void) +{ + static_branch_inc(&xfs_drain_waiter_gate); +} + +void +xfs_defer_drain_init( + struct xfs_defer_drain *dr) +{ + atomic_set(&dr->dr_count, 0); + init_waitqueue_head(&dr->dr_waiters); +} + +void +xfs_defer_drain_free(struct xfs_defer_drain *dr) +{ + ASSERT(atomic_read(&dr->dr_count) == 0); +} + +/* Increase the pending intent count. */ +static inline void xfs_defer_drain_grab(struct xfs_defer_drain *dr) +{ + atomic_inc(&dr->dr_count); +} + +static inline bool has_waiters(struct wait_queue_head *wq_head) +{ + /* + * This memory barrier is paired with the one in set_current_state on + * the waiting side. + */ + smp_mb__after_atomic(); + return waitqueue_active(wq_head); +} + +/* Decrease the pending intent count, and wake any waiters, if appropriate. */ +static inline void xfs_defer_drain_rele(struct xfs_defer_drain *dr) +{ + if (atomic_dec_and_test(&dr->dr_count) && + static_branch_unlikely(&xfs_drain_waiter_gate) && + has_waiters(&dr->dr_waiters)) + wake_up(&dr->dr_waiters); +} + +/* Are there intents pending? */ +static inline bool xfs_defer_drain_busy(struct xfs_defer_drain *dr) +{ + return atomic_read(&dr->dr_count) > 0; +} + +/* + * Wait for the pending intent count for a drain to hit zero. + * + * Callers must not hold any locks that would prevent intents from being + * finished. + */ +static inline int xfs_defer_drain_wait(struct xfs_defer_drain *dr) +{ + return wait_event_killable(dr->dr_waiters, !xfs_defer_drain_busy(dr)); +} + +/* + * Get a passive reference to an AG and declare an intent to update its + * metadata. + */ +struct xfs_perag * +xfs_perag_intent_get( + struct xfs_mount *mp, + xfs_agnumber_t agno) +{ + struct xfs_perag *pag; + + pag = xfs_perag_get(mp, agno); + if (!pag) + return NULL; + + xfs_perag_intent_hold(pag); + return pag; +} + +/* + * Release our intent to update this AG's metadata, and then release our + * passive ref to the AG. + */ +void +xfs_perag_intent_put( + struct xfs_perag *pag) +{ + xfs_perag_intent_rele(pag); + xfs_perag_put(pag); +} + +/* + * Declare an intent to update AG metadata. Other threads that need exclusive + * access can decide to back off if they see declared intentions. + */ +void +xfs_perag_intent_hold( + struct xfs_perag *pag) +{ + trace_xfs_perag_intent_hold(pag, __return_address); + xfs_defer_drain_grab(&pag->pag_intents_drain); +} + +/* Release our intent to update this AG's metadata. */ +void +xfs_perag_intent_rele( + struct xfs_perag *pag) +{ + trace_xfs_perag_intent_rele(pag, __return_address); + xfs_defer_drain_rele(&pag->pag_intents_drain); +} + +/* + * Wait for the intent update count for this AG to hit zero. + * Callers must not hold any AG header buffers. + */ +int +xfs_perag_intent_drain( + struct xfs_perag *pag) +{ + trace_xfs_perag_wait_intents(pag, __return_address); + return xfs_defer_drain_wait(&pag->pag_intents_drain); +} + +/* Has anyone declared an intent to update this AG? */ +bool +xfs_perag_intent_busy( + struct xfs_perag *pag) +{ + return xfs_defer_drain_busy(&pag->pag_intents_drain); +} diff --git a/fs/xfs/xfs_drain.h b/fs/xfs/xfs_drain.h new file mode 100644 index 000000000000..50a5772a8296 --- /dev/null +++ b/fs/xfs/xfs_drain.h @@ -0,0 +1,87 @@ +// SPDX-License-Identifier: GPL-2.0-or-later +/* + * Copyright (C) 2022-2023 Oracle. All Rights Reserved. + * Author: Darrick J. Wong <djwong@kernel.org> + */ +#ifndef XFS_DRAIN_H_ +#define XFS_DRAIN_H_ + +struct xfs_perag; + +#ifdef CONFIG_XFS_DRAIN_INTENTS +/* + * Passive drain mechanism. This data structure tracks a count of some items + * and contains a waitqueue for callers who would like to wake up when the + * count hits zero. + */ +struct xfs_defer_drain { + /* Number of items pending in some part of the filesystem. */ + atomic_t dr_count; + + /* Queue to wait for dri_count to go to zero */ + struct wait_queue_head dr_waiters; +}; + +void xfs_defer_drain_init(struct xfs_defer_drain *dr); +void xfs_defer_drain_free(struct xfs_defer_drain *dr); + +void xfs_drain_wait_disable(void); +void xfs_drain_wait_enable(void); + +/* + * Deferred Work Intent Drains + * =========================== + * + * When a writer thread executes a chain of log intent items, the AG header + * buffer locks will cycle during a transaction roll to get from one intent + * item to the next in a chain. Although scrub takes all AG header buffer + * locks, this isn't sufficient to guard against scrub checking an AG while + * that writer thread is in the middle of finishing a chain because there's no + * higher level locking primitive guarding allocation groups. + * + * When there's a collision, cross-referencing between data structures (e.g. + * rmapbt and refcountbt) yields false corruption events; if repair is running, + * this results in incorrect repairs, which is catastrophic. + * + * The solution is to the perag structure the count of active intents and make + * scrub wait until it has both AG header buffer locks and the intent counter + * reaches zero. It is therefore critical that deferred work threads hold the + * AGI or AGF buffers when decrementing the intent counter. + * + * Given a list of deferred work items, the deferred work manager will complete + * a work item and all the sub-items that the parent item creates before moving + * on to the next work item in the list. This is also true for all levels of + * sub-items. Writer threads are permitted to queue multiple work items + * targetting the same AG, so a deferred work item (such as a BUI) that creates + * sub-items (such as RUIs) must bump the intent counter and maintain it until + * the sub-items can themselves bump the intent counter. + * + * Therefore, the intent count tracks entire lifetimes of deferred work items. + * All functions that create work items must increment the intent counter as + * soon as the item is added to the transaction and cannot drop the counter + * until the item is finished or cancelled. + */ +struct xfs_perag *xfs_perag_intent_get(struct xfs_mount *mp, + xfs_agnumber_t agno); +void xfs_perag_intent_put(struct xfs_perag *pag); + +void xfs_perag_intent_hold(struct xfs_perag *pag); +void xfs_perag_intent_rele(struct xfs_perag *pag); + +int xfs_perag_intent_drain(struct xfs_perag *pag); +bool xfs_perag_intent_busy(struct xfs_perag *pag); +#else +struct xfs_defer_drain { /* empty */ }; + +#define xfs_defer_drain_free(dr) ((void)0) +#define xfs_defer_drain_init(dr) ((void)0) + +#define xfs_perag_intent_get(mp, agno) xfs_perag_get((mp), (agno)) +#define xfs_perag_intent_put(pag) xfs_perag_put(pag) + +static inline void xfs_perag_intent_hold(struct xfs_perag *pag) { } +static inline void xfs_perag_intent_rele(struct xfs_perag *pag) { } + +#endif /* CONFIG_XFS_DRAIN_INTENTS */ + +#endif /* XFS_DRAIN_H_ */ diff --git a/fs/xfs/xfs_extfree_item.c b/fs/xfs/xfs_extfree_item.c index 011b50469301..f9e36b810663 100644 --- a/fs/xfs/xfs_extfree_item.c +++ b/fs/xfs/xfs_extfree_item.c @@ -351,8 +351,6 @@ xfs_trans_free_extent( struct xfs_mount *mp = tp->t_mountp; struct xfs_extent *extp; uint next_extent; - xfs_agnumber_t agno = XFS_FSB_TO_AGNO(mp, - xefi->xefi_startblock); xfs_agblock_t agbno = XFS_FSB_TO_AGBNO(mp, xefi->xefi_startblock); int error; @@ -363,12 +361,13 @@ xfs_trans_free_extent( if (xefi->xefi_flags & XFS_EFI_BMBT_BLOCK) oinfo.oi_flags |= XFS_OWNER_INFO_BMBT_BLOCK; - trace_xfs_bmap_free_deferred(tp->t_mountp, agno, 0, agbno, - xefi->xefi_blockcount); + trace_xfs_bmap_free_deferred(tp->t_mountp, xefi->xefi_pag->pag_agno, 0, + agbno, xefi->xefi_blockcount); - error = __xfs_free_extent(tp, xefi->xefi_startblock, + error = __xfs_free_extent(tp, xefi->xefi_pag, agbno, xefi->xefi_blockcount, &oinfo, XFS_AG_RESV_NONE, xefi->xefi_flags & XFS_EFI_SKIP_DISCARD); + /* * Mark the transaction dirty, even on error. This ensures the * transaction is aborted, which: @@ -396,14 +395,13 @@ xfs_extent_free_diff_items( const struct list_head *a, const struct list_head *b) { - struct xfs_mount *mp = priv; struct xfs_extent_free_item *ra; struct xfs_extent_free_item *rb; ra = container_of(a, struct xfs_extent_free_item, xefi_list); rb = container_of(b, struct xfs_extent_free_item, xefi_list); - return XFS_FSB_TO_AGNO(mp, ra->xefi_startblock) - - XFS_FSB_TO_AGNO(mp, rb->xefi_startblock); + + return ra->xefi_pag->pag_agno - rb->xefi_pag->pag_agno; } /* Log a free extent to the intent item. */ @@ -462,6 +460,26 @@ xfs_extent_free_create_done( return &xfs_trans_get_efd(tp, EFI_ITEM(intent), count)->efd_item; } +/* Take a passive ref to the AG containing the space we're freeing. */ +void +xfs_extent_free_get_group( + struct xfs_mount *mp, + struct xfs_extent_free_item *xefi) +{ + xfs_agnumber_t agno; + + agno = XFS_FSB_TO_AGNO(mp, xefi->xefi_startblock); + xefi->xefi_pag = xfs_perag_intent_get(mp, agno); +} + +/* Release a passive AG ref after some freeing work. */ +static inline void +xfs_extent_free_put_group( + struct xfs_extent_free_item *xefi) +{ + xfs_perag_intent_put(xefi->xefi_pag); +} + /* Process a free extent. */ STATIC int xfs_extent_free_finish_item( @@ -476,6 +494,8 @@ xfs_extent_free_finish_item( xefi = container_of(item, struct xfs_extent_free_item, xefi_list); error = xfs_trans_free_extent(tp, EFD_ITEM(done), xefi); + + xfs_extent_free_put_group(xefi); kmem_cache_free(xfs_extfree_item_cache, xefi); return error; } @@ -496,6 +516,8 @@ xfs_extent_free_cancel_item( struct xfs_extent_free_item *xefi; xefi = container_of(item, struct xfs_extent_free_item, xefi_list); + + xfs_extent_free_put_group(xefi); kmem_cache_free(xfs_extfree_item_cache, xefi); } @@ -526,24 +548,21 @@ xfs_agfl_free_finish_item( struct xfs_extent *extp; struct xfs_buf *agbp; int error; - xfs_agnumber_t agno; xfs_agblock_t agbno; uint next_extent; - struct xfs_perag *pag; xefi = container_of(item, struct xfs_extent_free_item, xefi_list); ASSERT(xefi->xefi_blockcount == 1); - agno = XFS_FSB_TO_AGNO(mp, xefi->xefi_startblock); agbno = XFS_FSB_TO_AGBNO(mp, xefi->xefi_startblock); oinfo.oi_owner = xefi->xefi_owner; - trace_xfs_agfl_free_deferred(mp, agno, 0, agbno, xefi->xefi_blockcount); + trace_xfs_agfl_free_deferred(mp, xefi->xefi_pag->pag_agno, 0, agbno, + xefi->xefi_blockcount); - pag = xfs_perag_get(mp, agno); - error = xfs_alloc_read_agf(pag, tp, 0, &agbp); + error = xfs_alloc_read_agf(xefi->xefi_pag, tp, 0, &agbp); if (!error) - error = xfs_free_agfl_block(tp, agno, agbno, agbp, &oinfo); - xfs_perag_put(pag); + error = xfs_free_agfl_block(tp, xefi->xefi_pag->pag_agno, + agbno, agbp, &oinfo); /* * Mark the transaction dirty, even on error. This ensures the @@ -562,6 +581,7 @@ xfs_agfl_free_finish_item( extp->ext_len = xefi->xefi_blockcount; efdp->efd_next_extent++; + xfs_extent_free_put_group(xefi); kmem_cache_free(xfs_extfree_item_cache, xefi); return error; } @@ -632,7 +652,9 @@ xfs_efi_item_recover( fake.xefi_startblock = extp->ext_start; fake.xefi_blockcount = extp->ext_len; + xfs_extent_free_get_group(mp, &fake); error = xfs_trans_free_extent(tp, efdp, &fake); + xfs_extent_free_put_group(&fake); if (error == -EFSCORRUPTED) XFS_CORRUPTION_ERROR(__func__, XFS_ERRLEVEL_LOW, mp, extp, sizeof(*extp)); diff --git a/fs/xfs/xfs_icache.c b/fs/xfs/xfs_icache.c index c9a7e270a428..351849fc18ff 100644 --- a/fs/xfs/xfs_icache.c +++ b/fs/xfs/xfs_icache.c @@ -767,7 +767,8 @@ again: return 0; out_error_or_again: - if (!(flags & XFS_IGET_INCORE) && error == -EAGAIN) { + if (!(flags & (XFS_IGET_INCORE | XFS_IGET_NORETRY)) && + error == -EAGAIN) { delay(1); goto again; } diff --git a/fs/xfs/xfs_icache.h b/fs/xfs/xfs_icache.h index 6cd180721659..87910191a9dd 100644 --- a/fs/xfs/xfs_icache.h +++ b/fs/xfs/xfs_icache.h @@ -34,10 +34,13 @@ struct xfs_icwalk { /* * Flags for xfs_iget() */ -#define XFS_IGET_CREATE 0x1 -#define XFS_IGET_UNTRUSTED 0x2 -#define XFS_IGET_DONTCACHE 0x4 -#define XFS_IGET_INCORE 0x8 /* don't read from disk or reinit */ +#define XFS_IGET_CREATE (1U << 0) +#define XFS_IGET_UNTRUSTED (1U << 1) +#define XFS_IGET_DONTCACHE (1U << 2) +/* don't read from disk or reinit */ +#define XFS_IGET_INCORE (1U << 3) +/* Return -EAGAIN immediately if the inode is unavailable. */ +#define XFS_IGET_NORETRY (1U << 4) int xfs_iget(struct xfs_mount *mp, struct xfs_trans *tp, xfs_ino_t ino, uint flags, uint lock_flags, xfs_inode_t **ipp); diff --git a/fs/xfs/xfs_iunlink_item.c b/fs/xfs/xfs_iunlink_item.c index 43005ce8bd48..2ddccb172fa0 100644 --- a/fs/xfs/xfs_iunlink_item.c +++ b/fs/xfs/xfs_iunlink_item.c @@ -168,9 +168,7 @@ xfs_iunlink_log_inode( iup->ip = ip; iup->next_agino = next_agino; iup->old_agino = ip->i_next_unlinked; - - atomic_inc(&pag->pag_ref); - iup->pag = pag; + iup->pag = xfs_perag_hold(pag); xfs_trans_add_item(tp, &iup->item); tp->t_flags |= XFS_TRANS_DIRTY; diff --git a/fs/xfs/xfs_iwalk.c b/fs/xfs/xfs_iwalk.c index 21be93bf006d..b3275e8d47b6 100644 --- a/fs/xfs/xfs_iwalk.c +++ b/fs/xfs/xfs_iwalk.c @@ -667,11 +667,10 @@ xfs_iwalk_threaded( iwag->mp = mp; /* - * perag is being handed off to async work, so take another + * perag is being handed off to async work, so take a passive * reference for the async work to release. */ - atomic_inc(&pag->pag_ref); - iwag->pag = pag; + iwag->pag = xfs_perag_hold(pag); iwag->iwalk_fn = iwalk_fn; iwag->data = data; iwag->startino = startino; diff --git a/fs/xfs/xfs_linux.h b/fs/xfs/xfs_linux.h index e88f18f85e4b..74dcb05069e8 100644 --- a/fs/xfs/xfs_linux.h +++ b/fs/xfs/xfs_linux.h @@ -80,6 +80,7 @@ typedef __u32 xfs_nlink_t; #include "xfs_cksum.h" #include "xfs_buf.h" #include "xfs_message.h" +#include "xfs_drain.h" #ifdef __BIG_ENDIAN #define XFS_NATIVE_HOST 1 diff --git a/fs/xfs/xfs_refcount_item.c b/fs/xfs/xfs_refcount_item.c index 48d771a76add..edd8587658d5 100644 --- a/fs/xfs/xfs_refcount_item.c +++ b/fs/xfs/xfs_refcount_item.c @@ -20,6 +20,7 @@ #include "xfs_error.h" #include "xfs_log_priv.h" #include "xfs_log_recover.h" +#include "xfs_ag.h" struct kmem_cache *xfs_cui_cache; struct kmem_cache *xfs_cud_cache; @@ -279,14 +280,13 @@ xfs_refcount_update_diff_items( const struct list_head *a, const struct list_head *b) { - struct xfs_mount *mp = priv; struct xfs_refcount_intent *ra; struct xfs_refcount_intent *rb; ra = container_of(a, struct xfs_refcount_intent, ri_list); rb = container_of(b, struct xfs_refcount_intent, ri_list); - return XFS_FSB_TO_AGNO(mp, ra->ri_startblock) - - XFS_FSB_TO_AGNO(mp, rb->ri_startblock); + + return ra->ri_pag->pag_agno - rb->ri_pag->pag_agno; } /* Set the phys extent flags for this reverse mapping. */ @@ -365,6 +365,26 @@ xfs_refcount_update_create_done( return &xfs_trans_get_cud(tp, CUI_ITEM(intent))->cud_item; } +/* Take a passive ref to the AG containing the space we're refcounting. */ +void +xfs_refcount_update_get_group( + struct xfs_mount *mp, + struct xfs_refcount_intent *ri) +{ + xfs_agnumber_t agno; + + agno = XFS_FSB_TO_AGNO(mp, ri->ri_startblock); + ri->ri_pag = xfs_perag_intent_get(mp, agno); +} + +/* Release a passive AG ref after finishing refcounting work. */ +static inline void +xfs_refcount_update_put_group( + struct xfs_refcount_intent *ri) +{ + xfs_perag_intent_put(ri->ri_pag); +} + /* Process a deferred refcount update. */ STATIC int xfs_refcount_update_finish_item( @@ -386,6 +406,8 @@ xfs_refcount_update_finish_item( ri->ri_type == XFS_REFCOUNT_DECREASE); return -EAGAIN; } + + xfs_refcount_update_put_group(ri); kmem_cache_free(xfs_refcount_intent_cache, ri); return error; } @@ -406,6 +428,8 @@ xfs_refcount_update_cancel_item( struct xfs_refcount_intent *ri; ri = container_of(item, struct xfs_refcount_intent, ri_list); + + xfs_refcount_update_put_group(ri); kmem_cache_free(xfs_refcount_intent_cache, ri); } @@ -520,9 +544,13 @@ xfs_cui_item_recover( fake.ri_startblock = pmap->pe_startblock; fake.ri_blockcount = pmap->pe_len; - if (!requeue_only) + + if (!requeue_only) { + xfs_refcount_update_get_group(mp, &fake); error = xfs_trans_log_finish_refcount_update(tp, cudp, &fake, &rcur); + xfs_refcount_update_put_group(&fake); + } if (error == -EFSCORRUPTED) XFS_CORRUPTION_ERROR(__func__, XFS_ERRLEVEL_LOW, mp, &cuip->cui_format, diff --git a/fs/xfs/xfs_rmap_item.c b/fs/xfs/xfs_rmap_item.c index a1619d67015f..520c7ebdfed8 100644 --- a/fs/xfs/xfs_rmap_item.c +++ b/fs/xfs/xfs_rmap_item.c @@ -20,6 +20,7 @@ #include "xfs_error.h" #include "xfs_log_priv.h" #include "xfs_log_recover.h" +#include "xfs_ag.h" struct kmem_cache *xfs_rui_cache; struct kmem_cache *xfs_rud_cache; @@ -320,14 +321,13 @@ xfs_rmap_update_diff_items( const struct list_head *a, const struct list_head *b) { - struct xfs_mount *mp = priv; struct xfs_rmap_intent *ra; struct xfs_rmap_intent *rb; ra = container_of(a, struct xfs_rmap_intent, ri_list); rb = container_of(b, struct xfs_rmap_intent, ri_list); - return XFS_FSB_TO_AGNO(mp, ra->ri_bmap.br_startblock) - - XFS_FSB_TO_AGNO(mp, rb->ri_bmap.br_startblock); + + return ra->ri_pag->pag_agno - rb->ri_pag->pag_agno; } /* Log rmap updates in the intent item. */ @@ -390,6 +390,26 @@ xfs_rmap_update_create_done( return &xfs_trans_get_rud(tp, RUI_ITEM(intent))->rud_item; } +/* Take a passive ref to the AG containing the space we're rmapping. */ +void +xfs_rmap_update_get_group( + struct xfs_mount *mp, + struct xfs_rmap_intent *ri) +{ + xfs_agnumber_t agno; + + agno = XFS_FSB_TO_AGNO(mp, ri->ri_bmap.br_startblock); + ri->ri_pag = xfs_perag_intent_get(mp, agno); +} + +/* Release a passive AG ref after finishing rmapping work. */ +static inline void +xfs_rmap_update_put_group( + struct xfs_rmap_intent *ri) +{ + xfs_perag_intent_put(ri->ri_pag); +} + /* Process a deferred rmap update. */ STATIC int xfs_rmap_update_finish_item( @@ -405,6 +425,8 @@ xfs_rmap_update_finish_item( error = xfs_trans_log_finish_rmap_update(tp, RUD_ITEM(done), ri, state); + + xfs_rmap_update_put_group(ri); kmem_cache_free(xfs_rmap_intent_cache, ri); return error; } @@ -425,6 +447,8 @@ xfs_rmap_update_cancel_item( struct xfs_rmap_intent *ri; ri = container_of(item, struct xfs_rmap_intent, ri_list); + + xfs_rmap_update_put_group(ri); kmem_cache_free(xfs_rmap_intent_cache, ri); } @@ -559,11 +583,13 @@ xfs_rui_item_recover( fake.ri_bmap.br_state = (map->me_flags & XFS_RMAP_EXTENT_UNWRITTEN) ? XFS_EXT_UNWRITTEN : XFS_EXT_NORM; + xfs_rmap_update_get_group(mp, &fake); error = xfs_trans_log_finish_rmap_update(tp, rudp, &fake, &rcur); if (error == -EFSCORRUPTED) XFS_CORRUPTION_ERROR(__func__, XFS_ERRLEVEL_LOW, mp, map, sizeof(*map)); + xfs_rmap_update_put_group(&fake); if (error) goto abort_error; diff --git a/fs/xfs/xfs_super.c b/fs/xfs/xfs_super.c index 4f814f9e12ab..4d2e87462ac4 100644 --- a/fs/xfs/xfs_super.c +++ b/fs/xfs/xfs_super.c @@ -1548,6 +1548,19 @@ xfs_fs_fill_super( #endif } + /* ASCII case insensitivity is undergoing deprecation. */ + if (xfs_has_asciici(mp)) { +#ifdef CONFIG_XFS_SUPPORT_ASCII_CI + xfs_warn_once(mp, + "Deprecated ASCII case-insensitivity feature (ascii-ci=1) will not be supported after September 2030."); +#else + xfs_warn(mp, + "Deprecated ASCII case-insensitivity feature (ascii-ci=1) not supported by kernel."); + error = -EINVAL; + goto out_free_sb; +#endif + } + /* Filesystem claims it needs repair, so refuse the mount. */ if (xfs_has_needsrepair(mp)) { xfs_warn(mp, "Filesystem needs repair. Please run xfs_repair."); diff --git a/fs/xfs/xfs_trace.h b/fs/xfs/xfs_trace.h index 9c0006c55fec..cd4ca5b1fcb0 100644 --- a/fs/xfs/xfs_trace.h +++ b/fs/xfs/xfs_trace.h @@ -190,6 +190,7 @@ DEFINE_EVENT(xfs_perag_class, name, \ TP_ARGS(pag, caller_ip)) DEFINE_PERAG_REF_EVENT(xfs_perag_get); DEFINE_PERAG_REF_EVENT(xfs_perag_get_tag); +DEFINE_PERAG_REF_EVENT(xfs_perag_hold); DEFINE_PERAG_REF_EVENT(xfs_perag_put); DEFINE_PERAG_REF_EVENT(xfs_perag_grab); DEFINE_PERAG_REF_EVENT(xfs_perag_grab_tag); @@ -2686,6 +2687,44 @@ DEFINE_BMAP_FREE_DEFERRED_EVENT(xfs_bmap_free_deferred); DEFINE_BMAP_FREE_DEFERRED_EVENT(xfs_agfl_free_defer); DEFINE_BMAP_FREE_DEFERRED_EVENT(xfs_agfl_free_deferred); +DECLARE_EVENT_CLASS(xfs_defer_pending_item_class, + TP_PROTO(struct xfs_mount *mp, struct xfs_defer_pending *dfp, + void *item), + TP_ARGS(mp, dfp, item), + TP_STRUCT__entry( + __field(dev_t, dev) + __field(int, type) + __field(void *, intent) + __field(void *, item) + __field(char, committed) + __field(int, nr) + ), + TP_fast_assign( + __entry->dev = mp ? mp->m_super->s_dev : 0; + __entry->type = dfp->dfp_type; + __entry->intent = dfp->dfp_intent; + __entry->item = item; + __entry->committed = dfp->dfp_done != NULL; + __entry->nr = dfp->dfp_count; + ), + TP_printk("dev %d:%d optype %d intent %p item %p committed %d nr %d", + MAJOR(__entry->dev), MINOR(__entry->dev), + __entry->type, + __entry->intent, + __entry->item, + __entry->committed, + __entry->nr) +) +#define DEFINE_DEFER_PENDING_ITEM_EVENT(name) \ +DEFINE_EVENT(xfs_defer_pending_item_class, name, \ + TP_PROTO(struct xfs_mount *mp, struct xfs_defer_pending *dfp, \ + void *item), \ + TP_ARGS(mp, dfp, item)) + +DEFINE_DEFER_PENDING_ITEM_EVENT(xfs_defer_add_item); +DEFINE_DEFER_PENDING_ITEM_EVENT(xfs_defer_cancel_item); +DEFINE_DEFER_PENDING_ITEM_EVENT(xfs_defer_finish_item); + /* rmap tracepoints */ DECLARE_EVENT_CLASS(xfs_rmap_class, TP_PROTO(struct xfs_mount *mp, xfs_agnumber_t agno, @@ -4325,6 +4364,39 @@ TRACE_EVENT(xfs_force_shutdown, __entry->line_num) ); +#ifdef CONFIG_XFS_DRAIN_INTENTS +DECLARE_EVENT_CLASS(xfs_perag_intents_class, + TP_PROTO(struct xfs_perag *pag, void *caller_ip), + TP_ARGS(pag, caller_ip), + TP_STRUCT__entry( + __field(dev_t, dev) + __field(xfs_agnumber_t, agno) + __field(long, nr_intents) + __field(void *, caller_ip) + ), + TP_fast_assign( + __entry->dev = pag->pag_mount->m_super->s_dev; + __entry->agno = pag->pag_agno; + __entry->nr_intents = atomic_read(&pag->pag_intents_drain.dr_count); + __entry->caller_ip = caller_ip; + ), + TP_printk("dev %d:%d agno 0x%x intents %ld caller %pS", + MAJOR(__entry->dev), MINOR(__entry->dev), + __entry->agno, + __entry->nr_intents, + __entry->caller_ip) +); + +#define DEFINE_PERAG_INTENTS_EVENT(name) \ +DEFINE_EVENT(xfs_perag_intents_class, name, \ + TP_PROTO(struct xfs_perag *pag, void *caller_ip), \ + TP_ARGS(pag, caller_ip)) +DEFINE_PERAG_INTENTS_EVENT(xfs_perag_intent_hold); +DEFINE_PERAG_INTENTS_EVENT(xfs_perag_intent_rele); +DEFINE_PERAG_INTENTS_EVENT(xfs_perag_wait_intents); + +#endif /* CONFIG_XFS_DRAIN_INTENTS */ + #endif /* _TRACE_XFS_H */ #undef TRACE_INCLUDE_PATH |