Age | Commit message (Collapse) | Author |
|
git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs
Pull vfs inode / dentry updates from Christian Brauner:
"This contains smaller performance improvements to inodes and dentries:
inode:
- Add rcu based inode lookup variants.
They avoid one inode hash lock acquire in the common case thereby
significantly reducing contention. We already support RCU-based
operations but didn't take advantage of them during inode
insertion.
Callers of iget_locked() get the improvement without any code
changes. Callers that need a custom callback can switch to
iget5_locked_rcu() as e.g., did btrfs.
With 20 threads each walking a dedicated 1000 dirs * 1000 files
directory tree to stat(2) on a 32 core + 24GB ram vm:
before: 3.54s user 892.30s system 1966% cpu 45.549 total
after: 3.28s user 738.66s system 1955% cpu 37.932 total (-16.7%)
Long-term we should pick up the effort to introduce more
fine-grained locking and possibly improve on the currently used
hash implementation.
- Start zeroing i_state in inode_init_always() instead of doing it in
individual filesystems.
This allows us to remove an unneeded lock acquire in new_inode()
and not burden individual filesystems with this.
dcache:
- Move d_lockref out of the area used by RCU lookup to avoid
cacheline ping poing because the embedded name is sharing a
cacheline with d_lockref.
- Fix dentry size on 32bit with CONFIG_SMP=y so it does actually end
up with 128 bytes in total"
* tag 'vfs-6.11.inode' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs:
fs: fix dentry size
vfs: move d_lockref out of the area used by RCU lookup
bcachefs: remove now spurious i_state initialization
xfs: remove now spurious i_state initialization in xfs_inode_alloc
vfs: partially sanitize i_state zeroing on inode creation
xfs: preserve i_state around inode_init_always in xfs_reinit_inode
btrfs: use iget5_locked_rcu
vfs: add rcu-based find_inode variants for iget ops
|
|
btree_root_lock is for the root keys in btree_root, not the pointers to
the nodes themselves; this fixes a lock ordering issue between
btree_root_lock and btree node locks.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
|
|
proper lock ordering is: fs_reclaim -> btree node locks
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
|
|
not using unlock_long() blocks key cache reclaim, and the allocator may
take awhile
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
|
|
This reverts commit 86d81ec5f5f05846c7c6e48ffb964b24cba2e669.
This wasn't tested with memcg enabled, it immediately hits a null ptr
deref in list_lru_add().
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
|
|
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
|
|
Reported-by: syzbot+e74fea078710bbca6f4b@syzkaller.appspotmail.com
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
|
|
this fixes a 'transaction should be locked' error in backpointers fsck
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
|
|
Closes: https://syzkaller.appspot.com/bug?extid=8996d8f176cf946ef641
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
|
|
Instead of popping an assert in bch2_write(), WARN and print out some
debugging info.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
|
|
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
|
|
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
|
|
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
|
|
After commit 230e9fc28604 ("slab: add SLAB_ACCOUNT flag"), we need to mark
the inode cache as SLAB_ACCOUNT, similar to commit 5d097056c9a0 ("kmemcg:
account for certain kmem allocations to memcg")
Signed-off-by: Youling Tang <tangyouling@kylinos.cn>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
|
|
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
|
|
silly race
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
|
|
We don't have a way to flush a timer that's executing the callback, and
this is simple and limited enough in scope that we can just use the lock
instead.
Needed for the next patch that adds direct wakeups from the allocator to
copygc, where we're now more frequently calling io_timer_del() on an
expiring timer.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
|
|
fragmentation_lru derives from dirty_sectors, and wasn't being checked.
Co-developed-by: Daniel Hill <daniel@gluo.nz>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
|
|
We need to make sure we're not missing any fragmenation entries in the
LRU BTREE after repairing ALLOC BTREE
Also, use the new bch2_btree_write_buffer_maybe_flush() helper; this was
only working without it before since bucket invalidation (usually)
wasn't happening while fsck was running.
Co-developed-by: Daniel Hill <daniel@gluo.nz>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
|
|
Add a new helper for checking references to write buffer btrees, where
we need a flush before we definitively know we have an inconsistency.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
|
|
Fixes warnings from bch2_print_allocator_stuck()
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
|
|
Accidental infinite loop; also fix btree_deadlock_to_text()
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
|
|
BCH_READ_NODECODE mode - used by the move paths - really wants to use
only the original rbio, but the retry path really wants to clone - oof.
Make sure to copy the crc of the pointer we read from back to the
original rbio, or we'll see spurious checksum errors later.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
|
|
On a new filesystem or device we have to allocate the journal with a
bump allocator, because allocation info isn't ready yet - but when
hot-adding a device that doesn't have a journal, we don't want to use
that path.
Reported-by: syzbot+24a867cb90d8315cccff@syzkaller.appspotmail.com
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
|
|
Reported-by: syzbot+e5292b50f1957164a4b6@syzkaller.appspotmail.com
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
|
|
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
|
|
the unlock is now in read_extent, this fixes an assertion pop in
read_from_stale_dirty_pointer()
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
|
|
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
|
|
When allocating too huge a snapshot table, we should fail gracefully
in __snapshot_t_mut() instead of fail in kmalloc().
Reported-by: syzbot+770e99b65e26fa023ab1@syzkaller.appspotmail.com
Closes: https://syzkaller.appspot.com/bug?extid=770e99b65e26fa023ab1
Tested-by: syzbot+770e99b65e26fa023ab1@syzkaller.appspotmail.com
Signed-off-by: Pei Li <peili.dev@gmail.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
|
|
There's no reason for discards to be single threaded across all devices;
this will improve performance on multi device setups.
Additionally, making them per-device simplifies the refcounting on
bch_dev->io_ref; we now hold it for the duration that the discard path
is running, which fixes a race between the discard path and device
removal.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
|
|
This series fix the shift-out-of-bounds issue in
bch2_blacklist_entries_gc().
Instead of passing 0 to eytzinger0_first() when iterating the entries,
we explicitly check 0 and initialize i to be 0.
syzbot has tested the proposed patch and the reproducer did not trigger
any issue:
Reported-and-tested-by: syzbot+835d255ad6bc7f29ee12@syzkaller.appspotmail.com
Closes: https://syzkaller.appspot.com/bug?extid=835d255ad6bc7f29ee12
Signed-off-by: Pei Li <peili.dev@gmail.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
|
|
Acquire fsck_error_counts_lock before accessing the critical section
protected by this lock.
syzbot has tested the proposed patch and the reproducer did not trigger
any issue.
Reported-by: syzbot+a2bc0e838efd7663f4d9@syzkaller.appspotmail.com
Closes: https://syzkaller.appspot.com/bug?extid=a2bc0e838efd7663f4d9
Signed-off-by: Pei Li <peili.dev@gmail.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
|
|
This fixes a rare deadlock when we're doing an emergency shutdown due to
failure to do a journal write.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
|
|
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
|
|
This fixes filesystem size not changing on device removal.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
|
|
The debug code relies on btree_trans_list being ordered so that it can
resume on subsequent calls or lock restarts.
However, it was using trans->locknig_wait.task.pid, which is incorrect
since btree_trans objects are cached and reused - typically by different
tasks.
Fix this by switching to pointer order, and also sort them lazily when
required - speeding up the btree_trans_get() fastpath.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
|
|
debug.c was using closure_get() on a different thread's closure where
the we don't know if the object being refcounted is alive.
We keep btree_trans objects on a list so they can be printed by debug
code, and because it is cost prohibitive to touch the btree_trans list
every time we allocate and free btree_trans objects, cached objects are
also on this list.
However, we do not want the debug code to see cached but not in use
btree_trans objects - critically because the btree_paths array will have
been freed (if it was reallocated).
closure_get() is also incorrect to use when that get may race with it
hitting zero, i.e. we must already have a ref on the object or know the
ref can't currently hit 0 for other reasons (as used in the cycle
detector).
to fix this, use the previously introduced closure_get_not_zero(),
closure_return_sync(), and closure_init_stack_release(); the debug code
now can only take a ref on a trans object if it's alive and in use.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
|
|
btree_deadlock_to_text() searches the list of btree transactions to find
a deadlock - when it finds one it's done; it's not like other *_read()
functions that's printing each object.
Factor out btree_deadlock_to_text() to make this clearer.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
|
|
We were grabbing the sequence number before unlock incremented it - fix
this by moving the increment to seqmutex_lock() (so the seqmutex_relock()
failure path skips the mutex_trylock()), and returning the sequence
number from unlock(), to make the API simpler and safer.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
|
|
This fixes incorrect/missign checking of strndup_user() returns.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
|
|
`inode->ei_flags` setting and cleaning should be done after initialization,
otherwise the operation is invalid.
Fixes: 9ca4853b98af ("bcachefs: Fix quota support for snapshots")
Signed-off-by: Youling Tang <tangyouling@kylinos.cn>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
|
|
write_super() may reallocate the superblock buffer - but
bch_sb_field_ext was referencing it; don't use it after the write_super
call.
Reported-by: syzbot+8992fc10a192067b8d8a@syzkaller.appspotmail.com
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
|
|
printk strings get truncated to 1024 bytes; if we have a long error
message (journal debug info) we need to use a helper.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
|
|
discard_new_inode() is the correct interface for tearing down an indoe
that was fully created but not made visible to other threads, but it
expects I_NEW to be set, which we don't use.
Reported-by: https://github.com/koverstreet/bcachefs/issues/690
Fixes: bcachefs: Fix race path in bch2_inode_insert()
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
|
|
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
|
|
Incorrect bucket state transition in the discard path; when incrementing
a bucket's generation number that had already been discarded, we were
forgetting to check if it should be need_gc_gens, not free.
This was caught by the .invalid checks in the transaction commit path,
causing us to go emergency read only.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
|
|
With CONFIG_READ_ONLY_THP_FOR_FS, the Linux kernel supports using THPs
for read-only mmapped files, such as shared libraries. However, the
kernel makes no attempt to actually align those mappings on 2MB
boundaries, which makes it impossible to use those THPs most of the
time. This issue applies to general file mapping THP as well as
existing setups using CONFIG_READ_ONLY_THP_FOR_FS. This is easily
fixed by using thp_get_unmapped_area for the unmapped_area function
in bcachefs, which is what ext2, ext4, fuse, xfs and btrfs all use.
Similar to commit b0c582233a85 ("btrfs: fix alignment of VMA for
memory mapped files on THP").
Signed-off-by: Youling Tang <tangyouling@kylinos.cn>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
|
|
i.e. the start of automatic self healing:
If errors=continue or fix_safe, we now automatically fix simple errors
without user intervention.
New error action option: fix_safe
This replaces the existing errors=ro option, which gets a new slot, i.e.
existing errors=ro users now get errors=fix_safe.
This is currently only enabled for a limited set of errors - initially
just disk accounting; errors we would never not want to fix, and we
don't want to require user intervention (i.e. to make sure a bug report
gets filed).
Errors will still be counted in the superblock, so we (developers) will
still know they've been occuring if a bug report gets filed (as bug
reports typically include the errors superblock section).
Eventually we'll be enabling this for a much wider set of errors, after
we've done thorough error injection testing.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
|
|
reference: https://github.com/koverstreet/bcachefs/issues/692
trans->ref is the reference used by the cycle detector, which walks
btree_trans objects of other threads to walk the graph of held locks and
issue wakeups when an abort is required.
We have to wait for the ref to go to 1 before freeing trans->paths or
clearing trans->locking_wait.task.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
|
|
this is long running - help users see what's going on
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
|