summaryrefslogtreecommitdiff
path: root/fs/btrfs
AgeCommit message (Collapse)Author
2018-04-22Merge tag 'for-4.17-rc1-tag' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux Pull btrfs fixes from David Sterba: "This contains a few fixups to the qgroup patches that were merged this dev cycle, unaligned access fix, blockgroup removal corner case fix and a small debugging output tweak" * tag 'for-4.17-rc1-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux: btrfs: print-tree: debugging output enhancement btrfs: Fix race condition between delayed refs and blockgroup removal btrfs: fix unaligned access in readdir btrfs: Fix wrong btrfs_delalloc_release_extents parameter btrfs: delayed-inode: Remove wrong qgroup meta reservation calls btrfs: qgroup: Use independent and accurate per inode qgroup rsv btrfs: qgroup: Commit transaction in advance to reduce early EDQUOT
2018-04-20btrfs: print-tree: debugging output enhancementQu Wenruo
This patch enhances the following things: - tree block header * add generation and owner output for node and leaf - node pointer generation output - allow btrfs_print_tree() to not follow nodes * just like btrfs-progs Please note that, although function btrfs_print_tree() is not called by anyone right now, it's still a pretty useful function to debug kernel. So that function is still kept for later use. Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: Lu Fengqi <lufq.fnst@cn.fujitsu.com> Signed-off-by: David Sterba <dsterba@suse.com>
2018-04-20btrfs: Fix race condition between delayed refs and blockgroup removalNikolay Borisov
When the delayed refs for a head are all run, eventually cleanup_ref_head is called which (in case of deletion) obtains a reference for the relevant btrfs_space_info struct by querying the bg for the range. This is problematic because when the last extent of a bg is deleted a race window emerges between removal of that bg and the subsequent invocation of cleanup_ref_head. This can result in cache being null and either a null pointer dereference or assertion failure. task: ffff8d04d31ed080 task.stack: ffff9e5dc10cc000 RIP: 0010:assfail.constprop.78+0x18/0x1a [btrfs] RSP: 0018:ffff9e5dc10cfbe8 EFLAGS: 00010292 RAX: 0000000000000044 RBX: 0000000000000000 RCX: 0000000000000000 RDX: ffff8d04ffc1f868 RSI: ffff8d04ffc178c8 RDI: ffff8d04ffc178c8 RBP: ffff8d04d29e5ea0 R08: 00000000000001f0 R09: 0000000000000001 R10: ffff9e5dc0507d58 R11: 0000000000000001 R12: ffff8d04d29e5ea0 R13: ffff8d04d29e5f08 R14: ffff8d04efe29b40 R15: ffff8d04efe203e0 FS: 00007fbf58ead500(0000) GS:ffff8d04ffc00000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 00007fe6c6975648 CR3: 0000000013b2a000 CR4: 00000000000006f0 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 Call Trace: __btrfs_run_delayed_refs+0x10e7/0x12c0 [btrfs] btrfs_run_delayed_refs+0x68/0x250 [btrfs] btrfs_should_end_transaction+0x42/0x60 [btrfs] btrfs_truncate_inode_items+0xaac/0xfc0 [btrfs] btrfs_evict_inode+0x4c6/0x5c0 [btrfs] evict+0xc6/0x190 do_unlinkat+0x19c/0x300 do_syscall_64+0x74/0x140 entry_SYSCALL_64_after_hwframe+0x3d/0xa2 RIP: 0033:0x7fbf589c57a7 To fix this, introduce a new flag "is_system" to head_ref structs, which is populated at insertion time. This allows to decouple the querying for the spaceinfo from querying the possibly deleted bg. Fixes: d7eae3403f46 ("Btrfs: rework delayed ref total_bytes_pinned accounting") CC: stable@vger.kernel.org # 4.14+ Suggested-by: Omar Sandoval <osandov@osandov.com> Signed-off-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: Omar Sandoval <osandov@fb.com> Signed-off-by: David Sterba <dsterba@suse.com>
2018-04-19btrfs: fix unaligned access in readdirDavid Sterba
The last update to readdir introduced a temporary buffer to store the emitted readdir data, but as there are file names of variable length, there's a lot of unaligned access. This was observed on a sparc64 machine: Kernel unaligned access at TPC[102f3080] btrfs_real_readdir+0x51c/0x718 [btrfs] Fixes: 23b5ec74943 ("btrfs: fix readdir deadlock with pagefault") CC: stable@vger.kernel.org # 4.14+ Reported-and-tested-by: René Rebe <rene@exactcode.com> Reviewed-by: Liu Bo <bo.liu@linux.alibaba.com> Signed-off-by: David Sterba <dsterba@suse.com>
2018-04-18btrfs: Fix wrong btrfs_delalloc_release_extents parameterQu Wenruo
Commit 43b18595d660 ("btrfs: qgroup: Use separate meta reservation type for delalloc") merged into mainline is not the latest version submitted to mail list in Dec 2017. It has a fatal wrong @qgroup_free parameter, which results increasing qgroup metadata pertrans reserved space, and causing a lot of early EDQUOT. Fix it by applying the correct diff on top of current branch. Fixes: 43b18595d660 ("btrfs: qgroup: Use separate meta reservation type for delalloc") Signed-off-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2018-04-18btrfs: delayed-inode: Remove wrong qgroup meta reservation callsQu Wenruo
Commit 4f5427ccce5d ("btrfs: delayed-inode: Use new qgroup meta rsv for delayed inode and item") merged into mainline was not latest version submitted to the mail list in Dec 2017. Which lacks the following fixes: 1) Remove btrfs_qgroup_convert_reserved_meta() call in btrfs_delayed_item_release_metadata() 2) Remove btrfs_qgroup_reserve_meta_prealloc() call in btrfs_delayed_inode_reserve_metadata() Those fixes will resolve unexpected EDQUOT problems. Fixes: 4f5427ccce5d ("btrfs: delayed-inode: Use new qgroup meta rsv for delayed inode and item") Signed-off-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2018-04-18btrfs: qgroup: Use independent and accurate per inode qgroup rsvQu Wenruo
Unlike reservation calculation used in inode rsv for metadata, qgroup doesn't really need to care about things like csum size or extent usage for the whole tree COW. Qgroups care more about net change of the extent usage. That's to say, if we're going to insert one file extent, it will mostly find its place in COWed tree block, leaving no change in extent usage. Or causing a leaf split, resulting in one new net extent and increasing qgroup number by nodesize. Or in an even more rare case, increase the tree level, increasing qgroup number by 2 * nodesize. So here instead of using the complicated calculation for extent allocator, which cares more about accuracy and no error, qgroup doesn't need that over-estimated reservation. This patch will maintain 2 new members in btrfs_block_rsv structure for qgroup, using much smaller calculation for qgroup rsv, reducing false EDQUOT. Signed-off-by: David Sterba <dsterba@suse.com> Signed-off-by: Qu Wenruo <wqu@suse.com>
2018-04-18btrfs: qgroup: Commit transaction in advance to reduce early EDQUOTQu Wenruo
Unlike previous method that tries to commit transaction inside qgroup_reserve(), this time we will try to commit transaction using fs_info->transaction_kthread to avoid nested transaction and no need to worry about locking context. Since it's an asynchronous function call and we won't wait for transaction commit, unlike previous method, we must call it before we hit the qgroup limit. So this patch will use the ratio and size of qgroup meta_pertrans reservation as indicator to check if we should trigger a transaction commit. (meta_prealloc won't be cleaned in transaction committ, it's useless anyway) Signed-off-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2018-04-15Merge tag 'for-4.17-part2-tag' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux Pull more btrfs updates from David Sterba: "We have queued a few more fixes (error handling, log replay, softlockup) and the rest is SPDX updates that touche almost all files so the diffstat is long" * tag 'for-4.17-part2-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux: btrfs: Only check first key for committed tree blocks btrfs: add SPDX header to Kconfig btrfs: replace GPL boilerplate by SPDX -- sources btrfs: replace GPL boilerplate by SPDX -- headers Btrfs: fix loss of prealloc extents past i_size after fsync log replay Btrfs: clean up resources during umount after trans is aborted btrfs: Fix possible softlock on single core machines Btrfs: bail out on error during replay_dir_deletes Btrfs: fix NULL pointer dereference in log_dir_items
2018-04-13btrfs: Only check first key for committed tree blocksQu Wenruo
When looping btrfs/074 with many cpus (>= 8), it's possible to trigger kernel warning due to first key verification: [ 4239.523446] WARNING: CPU: 5 PID: 2381 at fs/btrfs/disk-io.c:460 btree_read_extent_buffer_pages+0x1ad/0x210 [ 4239.523830] Modules linked in: [ 4239.524630] RIP: 0010:btree_read_extent_buffer_pages+0x1ad/0x210 [ 4239.527101] Call Trace: [ 4239.527251] read_tree_block+0x42/0x70 [ 4239.527434] read_node_slot+0xd2/0x110 [ 4239.527632] push_leaf_right+0xad/0x1b0 [ 4239.527809] split_leaf+0x4ea/0x700 [ 4239.527988] ? leaf_space_used+0xbc/0xe0 [ 4239.528192] ? btrfs_set_lock_blocking_rw+0x99/0xb0 [ 4239.528416] btrfs_search_slot+0x8cc/0xa40 [ 4239.528605] btrfs_insert_empty_items+0x71/0xc0 [ 4239.528798] __btrfs_run_delayed_refs+0xa98/0x1680 [ 4239.529013] btrfs_run_delayed_refs+0x10b/0x1b0 [ 4239.529205] btrfs_commit_transaction+0x33/0xaf0 [ 4239.529445] ? start_transaction+0xa8/0x4f0 [ 4239.529630] btrfs_alloc_data_chunk_ondemand+0x1b0/0x4e0 [ 4239.529833] btrfs_check_data_free_space+0x54/0xa0 [ 4239.530045] btrfs_delalloc_reserve_space+0x25/0x70 [ 4239.531907] btrfs_direct_IO+0x233/0x3d0 [ 4239.532098] generic_file_direct_write+0xcb/0x170 [ 4239.532296] btrfs_file_write_iter+0x2bb/0x5f4 [ 4239.532491] aio_write+0xe2/0x180 [ 4239.532669] ? lock_acquire+0xac/0x1e0 [ 4239.532839] ? __might_fault+0x3e/0x90 [ 4239.533032] do_io_submit+0x594/0x860 [ 4239.533223] ? do_io_submit+0x594/0x860 [ 4239.533398] SyS_io_submit+0x10/0x20 [ 4239.533560] ? SyS_io_submit+0x10/0x20 [ 4239.533729] do_syscall_64+0x75/0x1d0 [ 4239.533979] entry_SYSCALL_64_after_hwframe+0x42/0xb7 [ 4239.534182] RIP: 0033:0x7f8519741697 The problem here is, at btree_read_extent_buffer_pages() we don't have acquired read/write lock on that extent buffer, only basic info like level/bytenr is reliable. So race condition leads to such false alert. However in current call site, it's impossible to acquire proper lock without race window. To fix the problem, we only verify first key for committed tree blocks (whose generation is no larger than fs_info->last_trans_committed), so the content of such tree blocks will not change and there is no need to get read/write lock. Reported-by: Nikolay Borisov <nborisov@suse.com> Fixes: 581c1760415c ("btrfs: Validate child tree block's level and first key") Signed-off-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2018-04-12btrfs: add SPDX header to KconfigDavid Sterba
Signed-off-by: David Sterba <dsterba@suse.com>
2018-04-12btrfs: replace GPL boilerplate by SPDX -- sourcesDavid Sterba
Remove GPL boilerplate text (long, short, one-line) and keep the rest, ie. personal, company or original source copyright statements. Add the SPDX header. Signed-off-by: David Sterba <dsterba@suse.com>
2018-04-12btrfs: replace GPL boilerplate by SPDX -- headersDavid Sterba
Remove GPL boilerplate text (long, short, one-line) and keep the rest, ie. personal, company or original source copyright statements. Add the SPDX header. Unify the include protection macros to match the file names. Signed-off-by: David Sterba <dsterba@suse.com>
2018-04-12Btrfs: fix loss of prealloc extents past i_size after fsync log replayFilipe Manana
Currently if we allocate extents beyond an inode's i_size (through the fallocate system call) and then fsync the file, we log the extents but after a power failure we replay them and then immediately drop them. This behaviour happens since about 2009, commit c71bf099abdd ("Btrfs: Avoid orphan inodes cleanup while replaying log"), because it marks the inode as an orphan instead of dropping any extents beyond i_size before replaying logged extents, so after the log replay, and while the mount operation is still ongoing, we find the inode marked as an orphan and then perform a truncation (drop extents beyond the inode's i_size). Because the processing of orphan inodes is still done right after replaying the log and before the mount operation finishes, the intention of that commit does not make any sense (at least as of today). However reverting that behaviour is not enough, because we can not simply discard all extents beyond i_size and then replay logged extents, because we risk dropping extents beyond i_size created in past transactions, for example: add prealloc extent beyond i_size fsync - clears the flag BTRFS_INODE_NEEDS_FULL_SYNC from the inode transaction commit add another prealloc extent beyond i_size fsync - triggers the fast fsync path power failure In that scenario, we would drop the first extent and then replay the second one. To fix this just make sure that all prealloc extents beyond i_size are logged, and if we find too many (which is far from a common case), fallback to a full transaction commit (like we do when logging regular extents in the fast fsync path). Trivial reproducer: $ mkfs.btrfs -f /dev/sdb $ mount /dev/sdb /mnt $ xfs_io -f -c "pwrite -S 0xab 0 256K" /mnt/foo $ sync $ xfs_io -c "falloc -k 256K 1M" /mnt/foo $ xfs_io -c "fsync" /mnt/foo <power failure> # mount to replay log $ mount /dev/sdb /mnt # at this point the file only has one extent, at offset 0, size 256K A test case for fstests follows soon, covering multiple scenarios that involve adding prealloc extents with previous shrinking truncates and without such truncates. Fixes: c71bf099abdd ("Btrfs: Avoid orphan inodes cleanup while replaying log") Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2018-04-12Btrfs: clean up resources during umount after trans is abortedLiu Bo
Currently if some fatal errors occur, like all IO get -EIO, resources would be cleaned up when a) transaction is being committed or b) BTRFS_FS_STATE_ERROR is set However, in some rare cases, resources may be left alone after transaction gets aborted and umount may run into some ASSERT(), e.g. ASSERT(list_empty(&block_group->dirty_list)); For case a), in btrfs_commit_transaciton(), there're several places at the beginning where we just call btrfs_end_transaction() without cleaning up resources. For case b), it is possible that the trans handle doesn't have any dirty stuff, then only trans hanlde is marked as aborted while BTRFS_FS_STATE_ERROR is not set, so resources remain in memory. This makes btrfs also check BTRFS_FS_STATE_TRANS_ABORTED to make sure that all resources won't stay in memory after umount. Signed-off-by: Liu Bo <bo.liu@linux.alibaba.com> Signed-off-by: David Sterba <dsterba@suse.com>
2018-04-11page cache: use xa_lockMatthew Wilcox
Remove the address_space ->tree_lock and use the xa_lock newly added to the radix_tree_root. Rename the address_space ->page_tree to ->i_pages, since we don't really care that it's a tree. [willy@infradead.org: fix nds32, fs/dax.c] Link: http://lkml.kernel.org/r/20180406145415.GB20605@bombadil.infradead.orgLink: http://lkml.kernel.org/r/20180313132639.17387-9-willy@infradead.org Signed-off-by: Matthew Wilcox <mawilcox@microsoft.com> Acked-by: Jeff Layton <jlayton@redhat.com> Cc: Darrick J. Wong <darrick.wong@oracle.com> Cc: Dave Chinner <david@fromorbit.com> Cc: Ryusuke Konishi <konishi.ryusuke@lab.ntt.co.jp> Cc: Will Deacon <will.deacon@arm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-04-05btrfs: Fix possible softlock on single core machinesNikolay Borisov
do_chunk_alloc implements a loop checking whether there is a pending chunk allocation and if so causes the caller do loop. Generally this loop is executed only once, however testing with btrfs/072 on a single core vm machines uncovered an extreme case where the system could loop indefinitely. This is due to a missing cond_resched when loop which doesn't give a chance to the previous chunk allocator finish its job. The fix is to simply add the missing cond_resched. Fixes: 6d74119f1a3e ("Btrfs: avoid taking the chunk_mutex in do_chunk_alloc") Signed-off-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2018-04-05Btrfs: bail out on error during replay_dir_deletesLiu Bo
If errors were returned by btrfs_next_leaf(), replay_dir_deletes needs to bail out, otherwise @ret would be forced to be 0 after 'break;' and the caller won't be aware of it. Fixes: e02119d5a7b4 ("Btrfs: Add a write ahead tree log to optimize synchronous operations") Reviewed-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: Liu Bo <bo.liu@linux.alibaba.com> Signed-off-by: David Sterba <dsterba@suse.com>
2018-04-05Btrfs: fix NULL pointer dereference in log_dir_itemsLiu Bo
0, 1 and <0 can be returned by btrfs_next_leaf(), and when <0 is returned, path->nodes[0] could be NULL, log_dir_items lacks such a check for <0 and we may run into a null pointer dereference panic. Fixes: e02119d5a7b4 ("Btrfs: Add a write ahead tree log to optimize synchronous operations") Reviewed-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: Liu Bo <bo.liu@linux.alibaba.com> Signed-off-by: David Sterba <dsterba@suse.com>
2018-04-04Merge tag 'for-4.17-tag' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux Pull btrfs updates from David Sterba: "There are a several user visible changes, the rest is mostly invisible and continues to clean up the whole code base. User visible changes: - new mount option nossd_spread (pair for ssd_spread) - mount option subvolid will detect junk after the number and fail the mount - add message after cancelled device replace - direct module dependency on libcrc32, removed own crc wrappers - removed user space transaction ioctls - use lighter locking when reading /proc/self/mounts, RCU instead of mutex to avoid unnecessary contention Enhancements: - skip writeback of last page when truncating file to same size - send: do not issue unnecessary truncate operations - mount option token specifiers: use %u for unsigned values, more validation - selftests: more tree block validations qgroups: - preparatory work for splitting reservation types for data and metadata, this should allow for more accurate tracking and fix some issues with underflows or do further enhancements - split metadata reservations for started and joined transaction so they do not get mixed up and are accounted correctly at commit time - with the above, it's possible to revert patch that potentially deadlocks when trying to make more space by explicitly committing when the quota limit is hit - fix root item corruption when multiple same source snapshots are created with quota enabled RAID56: - make sure target is identical to source when raid56 rebuild fails after dev-replace - faster rebuild during scrub, batch by stripes and not block-by-block - make more use of cached data when rebuilding from a missing device Fixes: - null pointer deref when device replace target is missing - fix fsync after hole punching when using no-holes feature - fix lockdep splat when allocating percpu data with wrong GFP flags Cleanups, refactoring, core changes: - drop redunant parameters from various functions - kill and opencode trivial helpers - __cold/__exit function annotations - dead code removal - continued audit and documentation of memory barriers - error handling: handle removal from uuid tree - error handling: remove handling of impossible condtitons - more debugging or error messages - updated tracepoints - one VLA use removal (and one still left)" * tag 'for-4.17-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux: (164 commits) btrfs: lift errors from add_extent_changeset to the callers Btrfs: print error messages when failing to read trees btrfs: user proper type for btrfs_mask_flags flags btrfs: split dev-replace locking helpers for read and write btrfs: remove stale comments about fs_mutex btrfs: use RCU in btrfs_show_devname for device list traversal btrfs: update barrier in should_cow_block btrfs: use lockdep_assert_held for mutexes btrfs: use lockdep_assert_held for spinlocks btrfs: Validate child tree block's level and first key btrfs: tests/qgroup: Fix wrong tree backref level Btrfs: fix copy_items() return value when logging an inode Btrfs: fix fsync after hole punching when using no-holes feature btrfs: use helper to set ulist aux from a qgroup Revert "btrfs: qgroups: Retry after commit on getting EDQUOT" btrfs: qgroup: Update trace events for metadata reservation btrfs: qgroup: Use root::qgroup_meta_rsv_* to record qgroup meta reserved space btrfs: delayed-inode: Use new qgroup meta rsv for delayed inode and item btrfs: qgroup: Use separate meta reservation type for delalloc btrfs: qgroup: Introduce function to convert META_PREALLOC into META_PERTRANS ...
2018-04-02Merge branch 'sched-wait-for-linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull wait_var_event updates from Ingo Molnar: "This introduces the new wait_var_event() API, which is a more flexible waiting primitive than wait_on_atomic_t(). All wait_on_atomic_t() users are migrated over to the new API and wait_on_atomic_t() is removed. The migration fixes one bug and should result in no functional changes for the other usecases" * 'sched-wait-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: sched/wait: Improve __var_waitqueue() code generation sched/wait: Remove the wait_on_atomic_t() API sched/wait, arch/mips: Fix and convert wait_on_atomic_t() usage to the new wait_var_event() API sched/wait, fs/ocfs2: Convert wait_on_atomic_t() usage to the new wait_var_event() API sched/wait, fs/nfs: Convert wait_on_atomic_t() usage to the new wait_var_event() API sched/wait, fs/fscache: Convert wait_on_atomic_t() usage to the new wait_var_event() API sched/wait, fs/btrfs: Convert wait_on_atomic_t() usage to the new wait_var_event() API sched/wait, fs/afs: Convert wait_on_atomic_t() usage to the new wait_var_event() API sched/wait, drivers/media: Convert wait_on_atomic_t() usage to the new wait_var_event() API sched/wait, drivers/drm: Convert wait_on_atomic_t() usage to the new wait_var_event() API sched/wait: Introduce wait_var_event()
2018-03-31btrfs: lift errors from add_extent_changeset to the callersDavid Sterba
The missing error handling in add_extent_changeset was hidden, so make it at least visible in the callers. Signed-off-by: David Sterba <dsterba@suse.com>
2018-03-31Btrfs: print error messages when failing to read treesLiu Bo
When mount fails to read trees like fs tree, checksum tree, extent tree, etc, there is not enough information about where went wrong. With this, messages like "BTRFS warning (device sdf): failed to read root (objectid=7): -5" would help us a bit. Signed-off-by: Liu Bo <bo.liu@linux.alibaba.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2018-03-31btrfs: user proper type for btrfs_mask_flags flagsDavid Sterba
All users pass a local unsigned int and not the __uXX types that are supposed to be used for userspace interfaces. Signed-off-by: David Sterba <dsterba@suse.com>
2018-03-31btrfs: split dev-replace locking helpers for read and writeDavid Sterba
The current calls are unclear in what way btrfs_dev_replace_lock takes the locks, so drop the argument, split the helpers and use similar naming as for read and write locks. Signed-off-by: David Sterba <dsterba@suse.com>
2018-03-31btrfs: remove stale comments about fs_mutexDavid Sterba
The fs_mutex has been killed in 2008, a213501153fd66e2 ("Btrfs: Replace the big fs_mutex with a collection of other locks"), still remembered in some comments. We don't have any extra needs for locking in the ACL handlers. Signed-off-by: David Sterba <dsterba@suse.com>
2018-03-31btrfs: use RCU in btrfs_show_devname for device list traversalDavid Sterba
The show_devname callback is used to print device name in /proc/self/mounts, we need to traverse the device list consistently and read the name that's copied to a seq buffer so we don't need further locking. If the first device is being deleted at the same time, the RCU will allow us to read the device name, though it will become stale right after the RCU protection ends. This is unavoidable and the user can expect that the device will disappear from the filesystem's list at some point. The device_list_mutex was pretty heavy as it is used eg. for writing superblock and a few other IO related contexts. This can stall any application that reads the proc file for no reason. Reviewed-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: David Sterba <dsterba@suse.com>
2018-03-31btrfs: update barrier in should_cow_blockDavid Sterba
Once there was a simple int force_cow that was used with the plain barriers, and then converted to a bit, so we should use the appropriate barrier helper. Other variables in the complex if condition do not depend on a barrier, so we should be fine in case the atomic barrier becomes a no-op. Signed-off-by: David Sterba <dsterba@suse.com>
2018-03-31btrfs: use lockdep_assert_held for mutexesDavid Sterba
Using lockdep_assert_held is preferred, replace mutex_is_locked. Signed-off-by: David Sterba <dsterba@suse.com>
2018-03-31btrfs: use lockdep_assert_held for spinlocksDavid Sterba
Using lockdep_assert_held is preferred, replace assert_spin_locked. Signed-off-by: David Sterba <dsterba@suse.com>
2018-03-31btrfs: Validate child tree block's level and first keyQu Wenruo
We have several reports about node pointer points to incorrect child tree blocks, which could have even wrong owner and level but still with valid generation and checksum. Although btrfs check could handle it and print error message like: leaf parent key incorrect 60670574592 Kernel doesn't have enough check on this type of corruption correctly. At least add such check to read_tree_block() and btrfs_read_buffer(), where we need two new parameters @level and @first_key to verify the child tree block. The new @level check is mandatory and all call sites are already modified to extract expected level from its call chain. While @first_key is optional, the following call sites are skipping such check: 1) Root node/leaf As ROOT_ITEM doesn't contain the first key, skip @first_key check. 2) Direct backref Only parent bytenr and level is known and we need to resolve the key all by ourselves, skip @first_key check. Another note of this verification is, it needs extra info from nodeptr or ROOT_ITEM, so it can't fit into current tree-checker framework, which is limited to node/leaf boundary. Signed-off-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2018-03-31btrfs: tests/qgroup: Fix wrong tree backref levelQu Wenruo
The extent tree of the test fs is like the following: BTRFS info (device (null)): leaf 16327509003777336587 total ptrs 1 free space 3919 item 0 key (4096 168 4096) itemoff 3944 itemsize 51 extent refs 1 gen 1 flags 2 tree block key (68719476736 0 0) level 1 ^^^^^^^ ref#0: tree block backref root 5 And it's using an empty tree for fs tree, so there is no way that its level can be 1. For REAL (created by mkfs) fs tree backref with no skinny metadata, the result should look like: item 3 key (30408704 EXTENT_ITEM 4096) itemoff 3845 itemsize 51 refs 1 gen 4 flags TREE_BLOCK tree block key (256 INODE_ITEM 0) level 0 ^^^^^^^ tree block backref root 5 Fix the level to 0, so it won't break later tree level checker. Fixes: faa2dbf004e8 ("Btrfs: add sanity tests for new qgroup accounting code") Signed-off-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2018-03-31Btrfs: fix copy_items() return value when logging an inodeFilipe Manana
When logging an inode, at tree-log.c:copy_items(), if we call btrfs_next_leaf() at the loop which checks for the need to log holes, we need to make sure copy_items() returns the value 1 to its caller and not 0 (on success). This is because the path the caller passed was released and is now different from what is was before, and the caller expects a return value of 0 to mean both success and that the path has not changed, while a return value of 1 means both success and signals the caller that it can not reuse the path, it has to perform another tree search. Even though this is a case that should not be triggered on normal circumstances or very rare at least, its consequences can be very unpredictable (especially when replaying a log tree). Fixes: 16e7549f045d ("Btrfs: incompatible format change to remove hole extents") Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2018-03-31Btrfs: fix fsync after hole punching when using no-holes featureFilipe Manana
When we have the no-holes mode enabled and fsync a file after punching a hole in it, we can end up not logging the whole hole range in the log tree. This happens if the file has extent items that span more than one leaf and we punch a hole that covers a range that starts in a leaf but does not go beyond the offset of the first extent in the next leaf. Example: $ mkfs.btrfs -f -O no-holes -n 65536 /dev/sdb $ mount /dev/sdb /mnt $ for ((i = 0; i <= 831; i++)); do offset=$((i * 2 * 256 * 1024)) xfs_io -f -c "pwrite -S 0xab -b 256K $offset 256K" \ /mnt/foobar >/dev/null done $ sync # We now have 2 leafs in our filesystem fs tree, the first leaf has an # item corresponding the extent at file offset 216530944 and the second # leaf has a first item corresponding to the extent at offset 217055232. # Now we punch a hole that partially covers the range of the extent at # offset 216530944 but does go beyond the offset 217055232. $ xfs_io -c "fpunch $((216530944 + 128 * 1024 - 4000)) 256K" /mnt/foobar $ xfs_io -c "fsync" /mnt/foobar <power fail> # mount to replay the log $ mount /dev/sdb /mnt # Before this patch, only the subrange [216658016, 216662016[ (length of # 4000 bytes) was logged, leaving an incorrect file layout after log # replay. Fix this by checking if there is a hole between the last extent item that we processed and the first extent item in the next leaf, and if there is one, log an explicit hole extent item. Fixes: 16e7549f045d ("Btrfs: incompatible format change to remove hole extents") Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2018-03-31btrfs: use helper to set ulist aux from a qgroupDavid Sterba
We have a nice helper to do proper casting of a qgroup to a ulist aux value. And several places that could make use of it. Signed-off-by: David Sterba <dsterba@suse.com>
2018-03-31Revert "btrfs: qgroups: Retry after commit on getting EDQUOT"Qu Wenruo
This reverts commit 48a89bc4f2ceab87bc858a8eb189636b09c846a7. The idea to commit transaction and free some space after hitting qgroup limit is good, although the problem is it can easily cause deadlocks. One deadlock example is caused by trying to flush data while still holding it: Call Trace: __schedule+0x49d/0x10f0 schedule+0xc6/0x290 schedule_timeout+0x187/0x1c0 wait_for_completion+0x204/0x3a0 btrfs_wait_ordered_extents+0xa40/0xaf0 [btrfs] qgroup_reserve+0x913/0xa10 [btrfs] btrfs_qgroup_reserve_data+0x3ef/0x580 [btrfs] btrfs_check_data_free_space+0x96/0xd0 [btrfs] __btrfs_buffered_write+0x3ac/0xd40 [btrfs] btrfs_file_write_iter+0x62a/0xba0 [btrfs] __vfs_write+0x320/0x430 vfs_write+0x107/0x270 SyS_write+0xbf/0x150 do_syscall_64+0x1b0/0x3d0 entry_SYSCALL64_slow_path+0x25/0x25 Another can be caused by trying to commit one transaction while nesting with trans handle held by ourselves: btrfs_start_transaction() |- btrfs_qgroup_reserve_meta_pertrans() |- qgroup_reserve() |- btrfs_join_transaction() |- btrfs_commit_transaction() The retry is causing more problems than exppected when limit is enabled. At least a graceful EDQUOT is way better than deadlock. Signed-off-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2018-03-31btrfs: qgroup: Update trace events for metadata reservationQu Wenruo
Now trace_qgroup_meta_reserve() will have extra type parameter. And introduce two new trace events: 1) trace_qgroup_meta_free_all_pertrans() For btrfs_qgroup_free_meta_all_pertrans() 2) trace_qgroup_meta_convert() For btrfs_qgroup_convert_reserved_meta() Signed-off-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2018-03-31btrfs: qgroup: Use root::qgroup_meta_rsv_* to record qgroup meta reserved spaceQu Wenruo
For quota disabled->enable case, it's possible that at reservation time quota was not enabled so no bytes were really reserved, while at release time, quota was enabled so we will try to release some bytes we didn't really own. Such situation can cause metadata reserveation underflow, for both types, also less possible for per-trans type since quota enable will commit transaction. To address this, record qgroup meta reserved bytes into root::qgroup_meta_rsv_pertrans and ::prealloc. So at releasing time we won't free any bytes we didn't reserve. For DATA, it's already handled by io_tree, so nothing needs to be done there. Signed-off-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2018-03-31btrfs: delayed-inode: Use new qgroup meta rsv for delayed inode and itemQu Wenruo
Quite similar for delalloc, some modification to delayed-inode and delayed-item reservation. Also needs extra parameter for release case to distinguish normal release and error release. Signed-off-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2018-03-31btrfs: qgroup: Use separate meta reservation type for delallocQu Wenruo
Before this patch, btrfs qgroup is mixing per-transcation meta rsv with preallocated meta rsv, making it quite easy to underflow qgroup meta reservation. Since we have the new qgroup meta rsv types, apply it to delalloc reservation. Now for delalloc, most of its reserved space will use META_PREALLOC qgroup rsv type. And for callers reducing outstanding extent like btrfs_finish_ordered_io(), they will convert corresponding META_PREALLOC reservation to META_PERTRANS. This is mainly due to the fact that current qgroup numbers will only be updated in btrfs_commit_transaction(), that's to say if we don't keep such placeholder reservation, we can exceed qgroup limitation. And for callers freeing outstanding extent in error handler, we will just free META_PREALLOC bytes. This behavior makes callers of btrfs_qgroup_release_meta() or btrfs_qgroup_convert_meta() to be aware of which type they are. So in this patch, btrfs_delalloc_release_metadata() and its callers get an extra parameter to info qgroup to do correct meta convert/release. The good news is, even we use the wrong type (convert or free), it won't cause obvious bug, as prealloc type is always in good shape, and the type only affects how per-trans meta is increased or not. So the worst case will be at most metadata limitation can be sometimes exceeded (no convert at all) or metadata limitation is reached too soon (no free at all). Signed-off-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2018-03-31btrfs: qgroup: Introduce function to convert META_PREALLOC into META_PERTRANSQu Wenruo
For meta_prealloc reservation users, after btrfs_join_transaction() caller will modify tree so part (or even all) meta_prealloc reservation should be converted to meta_pertrans until transaction commit time. This patch introduces a new function, btrfs_qgroup_convert_reserved_meta() to do this for META_PREALLOC reservation user. Signed-off-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2018-03-31btrfs: qgroup: Don't use root->qgroup_meta_rsv for qgroupQu Wenruo
Since qgroup has seperate metadata reservation types now, we can completely get rid of the old root->qgroup_meta_rsv, which mostly acts as current META_PERTRANS reservation type. Signed-off-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2018-03-31btrfs: qgroup: Split meta rsv type into meta_prealloc and meta_pertransQu Wenruo
Btrfs uses 2 different methods to reseve metadata qgroup space. 1) Reserve at btrfs_start_transaction() time This is quite straightforward, caller will use the trans handler allocated to modify b-trees. In this case, reserved metadata should be kept until qgroup numbers are updated. 2) Reserve by using block_rsv first, and later btrfs_join_transaction() This is more complicated, caller will reserve space using block_rsv first, and then later call btrfs_join_transaction() to get a trans handle. In this case, before we modify trees, the reserved space can be modified on demand, and after btrfs_join_transaction(), such reserved space should also be kept until qgroup numbers are updated. Since these two types behave differently, split the original "META" reservation type into 2 sub-types: META_PERTRANS: For above case 1) META_PREALLOC: For reservations that happened before btrfs_join_transaction() of case 2) NOTE: This patch will only convert existing qgroup meta reservation callers according to its situation, not ensuring all callers are at correct timing. Such fix will be added in later patches. Signed-off-by: Qu Wenruo <wqu@suse.com> [ update comments ] Signed-off-by: David Sterba <dsterba@suse.com>
2018-03-31btrfs: qgroup: Cleanup the remaining old reservation countersQu Wenruo
So qgroup is switched to new separate types reservation system. Signed-off-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2018-03-31btrfs: qgroup: Update trace events to use new separate rsv typesQu Wenruo
Signed-off-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2018-03-31btrfs: qgroup: Fix wrong qgroup reservation update for relationship modificationQu Wenruo
When modifying qgroup relationship, for qgroup which only owns exclusive extents, we will go through quick update path. In this path, we will add/subtract exclusive and reference number for parent qgroup, since the source (child) qgroup only has exclusive extents, destination (parent) qgroup will also own or lose those extents exclusively. The same should be the same for reservation, since later reservation adding/releasing will also affect parent qgroup, without the reservation carried from child, parent will underflow reservation or have dead reservation which will never be freed. However original code doesn't do the same thing for reservation. It handles qgroup reservation quite differently: It removes qgroup reservation, as it's allocating space from the reserved qgroup for relationship adding. But does nothing for qgroup reservation if we're removing a qgroup relationship. According to the original code, it looks just like because we're adding qgroup->rfer, the code assumes we're writing new data, so it's follows the normal write routine, by reducing qgroup->reserved and adding qgroup->rfer/excl. This old behavior is wrong, and should be fixed to follow the same excl/rfer behavior. Just fix it by using the correct behavior described above. Fixes: 31193213f1f9 ("Btrfs: qgroup: Introduce a may_use to account space_info->bytes_may_use.") Signed-off-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2018-03-31btrfs: qgroup: Make qgroup_reserve and its callers to use separate ↵Qu Wenruo
reservation type Since most callers of qgroup_reserve() are already defined by type, converting qgroup_reserve() is quite an easy work. Signed-off-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2018-03-31btrfs: qgroup: Introduce helpers to update and access new qgroup rsvQu Wenruo
Introduce helpers to: 1) Get total reserved space For limit calculation 2) Add/release reserved space for given type With underflow detection and warning 3) Add/release reserved space according to child qgroup Signed-off-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2018-03-31btrfs: qgroup: Skeleton to support separate qgroup reservation typeQu Wenruo
Instead of single qgroup->reserved, use a new structure btrfs_qgroup_rsv to store different types of reservation. This patch only updates the header needed to compile. Signed-off-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2018-03-31Btrfs: delete dead code in btrfs_orphan_add()Omar Sandoval
btrfs_orphan_add() has had this case commented out since it was first introduced in commit d68fc57b7e32 ("Btrfs: Metadata reservation for orphan inodes"). Most of the orphan cleanup code has been rewritten since then, so it's safe to say that this code isn't needed. Signed-off-by: Omar Sandoval <osandov@fb.com> Reviewed-by: Nikolay Borisov <nborisov@suse.com> [ switch to bool ] Signed-off-by: David Sterba <dsterba@suse.com>