Age | Commit message (Collapse) | Author |
|
[BUG]
There is an internal report that KASAN is reporting use-after-free, with
the following backtrace:
BUG: KASAN: slab-use-after-free in btrfs_check_read_bio+0xa68/0xb70 [btrfs]
Read of size 4 at addr ffff8881117cec28 by task kworker/u16:2/45
CPU: 1 UID: 0 PID: 45 Comm: kworker/u16:2 Not tainted 6.11.0-rc2-next-20240805-default+ #76
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.16.2-3-gd478f380-rebuilt.opensuse.org 04/01/2014
Workqueue: btrfs-endio btrfs_end_bio_work [btrfs]
Call Trace:
dump_stack_lvl+0x61/0x80
print_address_description.constprop.0+0x5e/0x2f0
print_report+0x118/0x216
kasan_report+0x11d/0x1f0
btrfs_check_read_bio+0xa68/0xb70 [btrfs]
process_one_work+0xce0/0x12a0
worker_thread+0x717/0x1250
kthread+0x2e3/0x3c0
ret_from_fork+0x2d/0x70
ret_from_fork_asm+0x11/0x20
Allocated by task 20917:
kasan_save_stack+0x37/0x60
kasan_save_track+0x10/0x30
__kasan_slab_alloc+0x7d/0x80
kmem_cache_alloc_noprof+0x16e/0x3e0
mempool_alloc_noprof+0x12e/0x310
bio_alloc_bioset+0x3f0/0x7a0
btrfs_bio_alloc+0x2e/0x50 [btrfs]
submit_extent_page+0x4d1/0xdb0 [btrfs]
btrfs_do_readpage+0x8b4/0x12a0 [btrfs]
btrfs_readahead+0x29a/0x430 [btrfs]
read_pages+0x1a7/0xc60
page_cache_ra_unbounded+0x2ad/0x560
filemap_get_pages+0x629/0xa20
filemap_read+0x335/0xbf0
vfs_read+0x790/0xcb0
ksys_read+0xfd/0x1d0
do_syscall_64+0x6d/0x140
entry_SYSCALL_64_after_hwframe+0x4b/0x53
Freed by task 20917:
kasan_save_stack+0x37/0x60
kasan_save_track+0x10/0x30
kasan_save_free_info+0x37/0x50
__kasan_slab_free+0x4b/0x60
kmem_cache_free+0x214/0x5d0
bio_free+0xed/0x180
end_bbio_data_read+0x1cc/0x580 [btrfs]
btrfs_submit_chunk+0x98d/0x1880 [btrfs]
btrfs_submit_bio+0x33/0x70 [btrfs]
submit_one_bio+0xd4/0x130 [btrfs]
submit_extent_page+0x3ea/0xdb0 [btrfs]
btrfs_do_readpage+0x8b4/0x12a0 [btrfs]
btrfs_readahead+0x29a/0x430 [btrfs]
read_pages+0x1a7/0xc60
page_cache_ra_unbounded+0x2ad/0x560
filemap_get_pages+0x629/0xa20
filemap_read+0x335/0xbf0
vfs_read+0x790/0xcb0
ksys_read+0xfd/0x1d0
do_syscall_64+0x6d/0x140
entry_SYSCALL_64_after_hwframe+0x4b/0x53
[CAUSE]
Although I cannot reproduce the error, the report itself is good enough
to pin down the cause.
The call trace is the regular endio workqueue context, but the
free-by-task trace is showing that during btrfs_submit_chunk() we
already hit a critical error, and is calling btrfs_bio_end_io() to error
out. And the original endio function called bio_put() to free the whole
bio.
This means a double freeing thus causing use-after-free, e.g.:
1. Enter btrfs_submit_bio() with a read bio
The read bio length is 128K, crossing two 64K stripes.
2. The first run of btrfs_submit_chunk()
2.1 Call btrfs_map_block(), which returns 64K
2.2 Call btrfs_split_bio()
Now there are two bios, one referring to the first 64K, the other
referring to the second 64K.
2.3 The first half is submitted.
3. The second run of btrfs_submit_chunk()
3.1 Call btrfs_map_block(), which by somehow failed
Now we call btrfs_bio_end_io() to handle the error
3.2 btrfs_bio_end_io() calls the original endio function
Which is end_bbio_data_read(), and it calls bio_put() for the
original bio.
Now the original bio is freed.
4. The submitted first 64K bio finished
Now we call into btrfs_check_read_bio() and tries to advance the bio
iter.
But since the original bio (thus its iter) is already freed, we
trigger the above use-after free.
And even if the memory is not poisoned/corrupted, we will later call
the original endio function, causing a double freeing.
[FIX]
Instead of calling btrfs_bio_end_io(), call btrfs_orig_bbio_end_io(),
which has the extra check on split bios and do the proper refcounting
for cloned bios.
Furthermore there is already one extra btrfs_cleanup_bio() call, but
that is duplicated to btrfs_orig_bbio_end_io() call, so remove that
label completely.
Reported-by: David Sterba <dsterba@suse.com>
Fixes: 852eee62d31a ("btrfs: allow btrfs_submit_bio to split bios")
CC: stable@vger.kernel.org # 6.6+
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
Currently, we copy the mtime and ctime to the in-core inode and then
mark the inode dirty. This is fine for certain types of filesystems, but
not all. Some require a real setattr to properly change these values
(e.g. ceph or reexported NFS).
Fix this code to call notify_change() instead, which is the proper way
to effect a setattr. There is one problem though:
In this case, the client is holding a write delegation and has sent us
attributes to update our cache. We don't want to break the delegation
for this since that would defeat the purpose. Add a new ATTR_DELEG flag
that makes notify_change bypass the try_break_deleg call.
Fixes: c5967721e106 ("NFSD: handle GETATTR conflict with write delegation")
Reviewed-by: Christian Brauner <brauner@kernel.org>
Signed-off-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
|
|
create_elf_fdpic_tables() does not correctly account the space for the
AUX vector when an architecture has ELF_HWCAP2 defined. Prior to the
commit 10e29251be0e ("binfmt_elf_fdpic: fix /proc/<pid>/auxv") it
resulted in the last entry of the AUX vector being set to zero, but with
that change it results in a kernel BUG.
Fix that by adding one to the number of AUXV entries (nitems) when
ELF_HWCAP2 is defined.
Fixes: 10e29251be0e ("binfmt_elf_fdpic: fix /proc/<pid>/auxv")
Cc: stable@vger.kernel.org
Reported-by: Greg Ungerer <gerg@kernel.org>
Closes: https://lore.kernel.org/lkml/5b51975f-6d0b-413c-8b38-39a6a45e8821@westnet.com.au/
Signed-off-by: Max Filippov <jcmvbkbc@gmail.com>
Tested-by: Greg Ungerer <gerg@kernel.org>
Link: https://lore.kernel.org/r/20240826032745.3423812-1-jcmvbkbc@gmail.com
Signed-off-by: Kees Cook <kees@kernel.org>
|
|
Once we drop the delegation reference, the fields embedded in it are no
longer safe to access. Do that last.
Fixes: c5967721e106 ("NFSD: handle GETATTR conflict with write delegation")
Signed-off-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
|
|
Once we've dropped the flc_lock, there is nothing that ensures that the
delegation that was found will still be around later. Take a reference
to it while holding the lock and then drop it when we've finished with
the delegation.
Fixes: c5967721e106 ("NFSD: handle GETATTR conflict with write delegation")
Signed-off-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
|
|
extent_fiemap()
There's a warning (probably on some older compiler version):
fs/btrfs/fiemap.c: warning: 'last_extent_end' may be used uninitialized in this function [-Wmaybe-uninitialized]: => 822:19
Initialize the variable to 0 although it's not necessary as it's either
properly set or not used after an error. The called function is in the
same file so this is a false alert but we want to fix all
-Wmaybe-uninitialized reports.
Link: https://lore.kernel.org/all/20240819070639.2558629-1-geert@linux-m68k.org/
Reported-by: Geert Uytterhoeven <geert@linux-m68k.org>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
I notice a rmap query bug in xfs_io fsmap:
[root@fedora ~]# xfs_io -c 'fsmap -vvvv' /mnt
EXT: DEV BLOCK-RANGE OWNER FILE-OFFSET AG AG-OFFSET TOTAL
0: 253:16 [0..7]: static fs metadata 0 (0..7) 8
1: 253:16 [8..23]: per-AG metadata 0 (8..23) 16
2: 253:16 [24..39]: inode btree 0 (24..39) 16
3: 253:16 [40..47]: per-AG metadata 0 (40..47) 8
4: 253:16 [48..55]: refcount btree 0 (48..55) 8
5: 253:16 [56..103]: per-AG metadata 0 (56..103) 48
6: 253:16 [104..127]: free space 0 (104..127) 24
......
Bug:
[root@fedora ~]# xfs_io -c 'fsmap -vvvv -d 0 3' /mnt
[root@fedora ~]#
Normally, we should be able to get one record, but we got nothing.
The root cause of this problem lies in the incorrect setting of rm_owner in
the rmap query. In the case of the initial query where the owner is not
set, __xfs_getfsmap_datadev() first sets info->high.rm_owner to ULLONG_MAX.
This is done to prevent any omissions when comparing rmap items. However,
if the current ag is detected to be the last one, the function sets info's
high_irec based on the provided key. If high->rm_owner is not specified, it
should continue to be set to ULLONG_MAX; otherwise, there will be issues
with interval omissions. For example, consider "start" and "end" within the
same block. If high->rm_owner == 0, it will be smaller than the founded
record in rmapbt, resulting in a query with no records. The main call stack
is as follows:
xfs_ioc_getfsmap
xfs_getfsmap
xfs_getfsmap_datadev_rmapbt
__xfs_getfsmap_datadev
info->high.rm_owner = ULLONG_MAX
if (pag->pag_agno == end_ag)
xfs_fsmap_owner_to_rmap
// set info->high.rm_owner = 0 because fmr_owner == -1ULL
dest->rm_owner = 0
// get nothing
xfs_getfsmap_datadev_rmapbt_query
The problem can be resolved by simply modify the xfs_fsmap_owner_to_rmap
function internal logic to achieve.
After applying this patch, the above problem have been solved:
[root@fedora ~]# xfs_io -c 'fsmap -vvvv -d 0 3' /mnt
EXT: DEV BLOCK-RANGE OWNER FILE-OFFSET AG AG-OFFSET TOTAL
0: 253:16 [0..7]: static fs metadata 0 (0..7) 8
Fixes: e89c041338ed ("xfs: implement the GETFSMAP ioctl")
Signed-off-by: Zizhi Wo <wozizhi@huawei.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Chandan Babu R <chandanbabu@kernel.org>
|
|
Don't bother reporting the number of bytes that we "trimmed" because the
underlying storage isn't required to do anything(!) and failed discard
IOs aren't reported to the caller anyway. It's not like userspace can
use the reported value for anything useful like adjusting the offset
parameter of the next call, and it's not like anyone ever wrote a
manpage about FITRIM's out parameters.
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Tested-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Chandan Babu R <chandanbabu@kernel.org>
|
|
As a result of the factoring in commit 14dd46cf31f4 ("xfs: split
xfs_inobt_init_cursor"), mount started taking a long time on a
user's filesystem. For Anders, this made mount times regress from
under a second to over 15 minutes for a filesystem with only 30
million inodes in it.
Anders bisected it down to the above commit, but even then the bug
was not obvious. In this commit, over 20 calls to
xfs_inobt_init_cursor() were modified, and some we modified to call
a new function named xfs_finobt_init_cursor().
If that takes you a moment to reread those function names to see
what the rename was, then you have realised why this bug wasn't
spotted during review. And it wasn't spotted on inspection even
after the bisect pointed at this commit - a single missing "f" isn't
the easiest thing for a human eye to notice....
The result is that xfs_finobt_count_blocks() now incorrectly calls
xfs_inobt_init_cursor() so it is now walking the inobt instead of
the finobt. Hence when there are lots of allocated inodes in a
filesystem, mount takes a -long- time run because it now walks a
massive allocated inode btrees instead of the small, nearly empty
free inode btrees. It also means all the finobt space reservations
are wrong, so mount could potentially given ENOSPC on kernel
upgrade.
In hindsight, commit 14dd46cf31f4 should have been two commits - the
first to convert the finobt callers to the new API, the second to
modify the xfs_inobt_init_cursor() API for the inobt callers. That
would have made the bug very obvious during review.
Fixes: 14dd46cf31f4 ("xfs: split xfs_inobt_init_cursor")
Reported-by: Anders Blomdell <anders.blomdell@gmail.com>
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Chandan Babu R <chandanbabu@kernel.org>
|
|
willy pointed out that folio_mark_dirty is the correct function to use
to mark an xfile folio dirty because it calls out to the mapping's aops
to mark it dirty. For tmpfs this likely doesn't matter much since it
currently uses nop_dirty_folio, but let's use the abstractions properly.
Reported-by: willy@infradead.org
Fixes: 6907e3c00a40 ("xfs: add file_{get,put}_folio")
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Chandan Babu R <chandanbabu@kernel.org>
|
|
"KjellR" complained on IRC that an old V4 filesystem suddenly stopped
mounting after upgrading from 6.9.11 to 6.10.3, with the following splat
when trying to read the rt bitmap inode:
00000000: 49 4e 80 00 01 02 00 01 00 00 00 00 00 00 00 00 IN..............
00000010: 00 00 00 01 00 00 00 00 00 00 00 00 00 00 00 00 ................
00000020: 00 00 00 00 00 00 00 00 43 d2 a9 da 21 0f d6 30 ........C...!..0
00000030: 43 d2 a9 da 21 0f d6 30 00 00 00 00 00 00 00 00 C...!..0........
00000040: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................
00000050: 00 00 00 02 00 00 00 00 00 00 00 04 00 00 00 00 ................
00000060: ff ff ff ff 00 00 00 00 00 00 00 00 00 00 00 00 ................
00000070: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................
As Dave Chinner points out, this is a V1 inode with both di_onlink and
di_nlink set to 1 and di_flushiter == 0. In other words, this inode was
formatted this way by mkfs and hasn't been touched since then.
Back in the old days of xfsprogs 3.2.3, I observed that libxfs_ialloc
would set di_nlink, but if the filesystem didn't have NLINK, it would
then set di_version = 1. libxfs_iflush_int later sees the V1 inode and
copies the value of di_nlink to di_onlink without zeroing di_onlink.
Eventually this filesystem must have been upgraded to support NLINK
because 6.10 doesn't support !NLINK filesystems, which is how we tripped
over this old behavior. The filesystem doesn't have a realtime section,
so that's why the rtbitmap inode has never been touched.
Fix this by removing the di_onlink/di_nlink checking for all V1/V2
inodes because this is a muddy mess. The V3 inode handling code has
always supported NLINK and written di_onlink==0 so keep that check.
The removal of the V1 inode handling code when we dropped support for
!NLINK obscured this old behavior.
Reported-by: kjell.m.randa@gmail.com
Fixes: 40cb8613d612 ("xfs: check unused nlink fields in the ondisk inode")
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Chandan Babu R <chandanbabu@kernel.org>
|
|
We have transient failures with btrfs/301, specifically in the part
where we do
for i in $(seq 0 10); do
write 50m to file
rm -f file
done
Sometimes this will result in a transient quota error, and it's because
sometimes we start writeback on the file which results in a delayed
iput, and thus the rm doesn't actually clean the file up. When we're
flushing the quota space we need to run the delayed iputs to make sure
all the unlinks that we think have completed have actually completed.
This removes the small window where we could fail to find enough space
in our quota.
CC: stable@vger.kernel.org # 5.15+
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
The cifs filesystem doesn't quite emulate FALLOC_FL_PUNCH_HOLE correctly
(note that due to lack of protocol support, it can't actually implement it
directly). Whilst it will (partially) invalidate dirty folios in the
pagecache, it doesn't write them back first, and so the EOF marker on the
server may be lower than inode->i_size.
This presents a problem, however, as if the punched hole invalidates the
tail of the locally cached dirty data, writeback won't know it needs to
move the EOF over to account for the hole punch (which isn't supposed to
move the EOF). We could just write zeroes over the punched out region of
the pagecache and write that back - but this is supposed to be a
deallocatory operation.
Fix this by manually moving the EOF over on the server after the operation
if the hole punched would corrupt it.
Note that the FSCTL_SET_ZERO_DATA RPC and the setting of the EOF should
probably be compounded to stop a third party interfering (or, at least,
massively reduce the chance).
This was reproducible occasionally by using fsx with the following script:
truncate 0x0 0x375e2 0x0
punch_hole 0x2f6d3 0x6ab5 0x375e2
truncate 0x0 0x3a71f 0x375e2
mapread 0xee05 0xcf12 0x3a71f
write 0x2078e 0x5604 0x3a71f
write 0x3ebdf 0x1421 0x3a71f *
punch_hole 0x379d0 0x8630 0x40000 *
mapread 0x2aaa2 0x85b 0x40000
fallocate 0x1b401 0x9ada 0x40000
read 0x15f2 0x7d32 0x40000
read 0x32f37 0x7a3b 0x40000 *
The second "write" should extend the EOF to 0x40000, and the "punch_hole"
should operate inside of that - but that depends on whether the VM gets in
and writes back the data first. If it doesn't, the file ends up 0x3a71f in
size, not 0x40000.
Fixes: 31742c5a3317 ("enable fallocate punch hole ("fallocate -p") for SMB3")
Signed-off-by: David Howells <dhowells@redhat.com>
cc: Steve French <sfrench@samba.org>
cc: Paulo Alcantara <pc@manguebit.com>
cc: Shyam Prasad N <nspmangalore@gmail.com>
cc: Jeff Layton <jlayton@kernel.org>
cc: linux-cifs@vger.kernel.org
cc: netfs@lists.linux.dev
Signed-off-by: Steve French <stfrench@microsoft.com>
|
|
rqst.rq_iter needs to be truncated otherwise we'll
also send the bytes into the stream socket...
This is the logic behind rqst.rq_npages = 0, which was removed in
"cifs: Change the I/O paths to use an iterator rather than a page list"
(d08089f649a0cfb2099c8551ac47eef0cc23fdf2).
Cc: stable@vger.kernel.org
Fixes: d08089f649a0 ("cifs: Change the I/O paths to use an iterator rather than a page list")
Reviewed-by: David Howells <dhowells@redhat.com>
Signed-off-by: Stefan Metzmacher <metze@samba.org>
Signed-off-by: Steve French <stfrench@microsoft.com>
|
|
Reviewed-by: David Howells <dhowells@redhat.com>
Fixes: d08089f649a0 ("cifs: Change the I/O paths to use an iterator rather than a page list")
Signed-off-by: Stefan Metzmacher <metze@samba.org>
Signed-off-by: Steve French <stfrench@microsoft.com>
|
|
This happens when called from SMB2_read() while using rdma
and reaching the rdma_readwrite_threshold.
Cc: stable@vger.kernel.org
Fixes: a6559cc1d35d ("cifs: split out smb3_use_rdma_offload() helper")
Reviewed-by: David Howells <dhowells@redhat.com>
Signed-off-by: Stefan Metzmacher <metze@samba.org>
Signed-off-by: Steve French <stfrench@microsoft.com>
|
|
Pull bcachefs fixes from Kent Overstreet:
- assorted syzbot fixes
- some upgrade fixes for old (pre 1.0) filesystems
- fix for moving data off a device that was switched to durability=0
after data had been written to it.
- nocow deadlock fix
- fix for new rebalance_work accounting
* tag 'bcachefs-2024-08-24' of git://evilpiepirate.org/bcachefs: (28 commits)
bcachefs: Fix rebalance_work accounting
bcachefs: Fix failure to flush moves before sleeping in copygc
bcachefs: don't use rht_bucket() in btree_key_cache_scan()
bcachefs: add missing inode_walker_exit()
bcachefs: clear path->should_be_locked in bch2_btree_key_cache_drop()
bcachefs: Fix double assignment in check_dirent_to_subvol()
bcachefs: Fix refcounting in discard path
bcachefs: Fix compat issue with old alloc_v4 keys
bcachefs: Fix warning in bch2_fs_journal_stop()
fs/super.c: improve get_tree() error message
bcachefs: Fix missing validation in bch2_sb_journal_v2_validate()
bcachefs: Fix replay_now_at() assert
bcachefs: Fix locking in bch2_ioc_setlabel()
bcachefs: fix failure to relock in btree_node_fill()
bcachefs: fix failure to relock in bch2_btree_node_mem_alloc()
bcachefs: unlock_long() before resort in journal replay
bcachefs: fix missing bch2_err_str()
bcachefs: fix time_stats_to_text()
bcachefs: Fix bch2_bucket_gens_init()
bcachefs: Fix bch2_trigger_alloc assert
...
|
|
Pull smb server fixes from Steve French:
- query directory flex array fix
- fix potential null ptr reference in open
- fix error message in some open cases
- two minor cleanups
* tag '6.11-rc5-server-fixes' of git://git.samba.org/ksmbd:
smb/server: update misguided comment of smb2_allocate_rsp_buf()
smb/server: remove useless assignment of 'file_present' in smb2_open()
smb/server: fix potential null-ptr-deref of lease_ctx_info in smb2_open()
smb/server: fix return value of smb2_open()
ksmbd: the buffer of smb2 query dir response has at least 1 byte
|
|
rebalance_work was keying off of the presence of rebelance_opts in the
extent - but that was incorrect, we keep those around after rebalance
for indirect extents since the inode's options are not directly
available
Fixes: 20ac515a9cc7 ("bcachefs: bch_acct_rebalance_work")
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
|
|
This fixes an apparent deadlock - rebalance would get stuck trying to
take nocow locks because they weren't being released by copygc.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
|
|
When a folio that is marked for streaming write (dirty, but not uptodate,
with partial content specified in the private data) is written back, the
folio is effectively switched to the blank state upon completion of the
write. This means that if we want to read it in future, we need to reread
the whole folio.
However, if the folio is above the zero_point position, when it is read
back, it will just be cleared and the read skipped, leading to apparent
local corruption.
Fix this by increasing the zero_point to the end of the dirty data in the
folio when clearing the folio state after writeback. This is analogous to
the folio having ->release_folio() called upon it.
This was causing the config.log generated by configuring a cpython tree on
a cifs share to get corrupted because the scripts involved were appending
text to the file in small pieces.
Fixes: 288ace2f57c9 ("netfs: New writeback implementation")
Signed-off-by: David Howells <dhowells@redhat.com>
Link: https://lore.kernel.org/r/563286.1724500613@warthog.procyon.org.uk
cc: Steve French <sfrench@samba.org>
cc: Paulo Alcantara <pc@manguebit.com>
cc: Jeff Layton <jlayton@kernel.org>
cc: linux-cifs@vger.kernel.org
cc: netfs@lists.linux.dev
cc: linux-fsdevel@vger.kernel.org
Signed-off-by: Christian Brauner <brauner@kernel.org>
|
|
Fix netfs_rreq_perform_resubmissions() to reset before retrying a short
read, otherwise the wrong part of the output buffer will be used.
Fixes: 92b6cc5d1e7c ("netfs: Add iov_iters to (sub)requests to describe various buffers")
Signed-off-by: David Howells <dhowells@redhat.com>
Link: https://lore.kernel.org/r/20240823200819.532106-6-dhowells@redhat.com
cc: Steve French <sfrench@samba.org>
cc: Paulo Alcantara <pc@manguebit.com>
cc: Jeff Layton <jlayton@kernel.org>
cc: linux-cifs@vger.kernel.org
cc: netfs@lists.linux.dev
cc: linux-fsdevel@vger.kernel.org
Signed-off-by: Christian Brauner <brauner@kernel.org>
|
|
When netfslib writes to a folio that it doesn't have data for, but that
data exists on the server, it will make a 'streaming write' whereby it
stores data in a folio that is marked dirty, but not uptodate. When it
does this, it attaches a record to folio->private to track the dirty
region.
When truncate() or fallocate() wants to invalidate part of such a folio, it
will call into ->invalidate_folio(), specifying the part of the folio that
is to be invalidated. netfs_invalidate_folio(), on behalf of the
filesystem, must then determine how to trim the streaming write record. In
a couple of cases, however, it does this incorrectly (the reduce-length and
move-start cases are switched over and don't, in any case, calculate the
value correctly).
Fix this by making the logic tree more obvious and fixing the cases.
Fixes: 9ebff83e6481 ("netfs: Prep to use folio->private for write grouping and streaming write")
Signed-off-by: David Howells <dhowells@redhat.com>
Link: https://lore.kernel.org/r/20240823200819.532106-5-dhowells@redhat.com
cc: Matthew Wilcox (Oracle) <willy@infradead.org>
cc: Pankaj Raghav <p.raghav@samsung.com>
cc: Jeff Layton <jlayton@kernel.org>
cc: Marc Dionne <marc.dionne@auristor.com>
cc: linux-afs@lists.infradead.org
cc: netfs@lists.linux.dev
cc: linux-mm@kvack.org
cc: linux-fsdevel@vger.kernel.org
Signed-off-by: Christian Brauner <brauner@kernel.org>
|
|
Fix netfs_release_folio() to say no (ie. return false) if the folio is
dirty (analogous with iomap's behaviour). Without this, it will say yes to
the release of a dirty page by split_huge_page_to_list_to_order(), which
will result in the loss of untruncated data in the folio.
Without this, the generic/075 and generic/112 xfstests (both fsx-based
tests) fail with minimum folio size patches applied[1].
Fixes: c1ec4d7c2e13 ("netfs: Provide invalidate_folio and release_folio calls")
Signed-off-by: David Howells <dhowells@redhat.com>
Link: https://lore.kernel.org/r/20240815090849.972355-1-kernel@pankajraghav.com/ [1]
Link: https://lore.kernel.org/r/20240823200819.532106-4-dhowells@redhat.com
cc: Matthew Wilcox (Oracle) <willy@infradead.org>
cc: Pankaj Raghav <p.raghav@samsung.com>
cc: Jeff Layton <jlayton@kernel.org>
cc: Marc Dionne <marc.dionne@auristor.com>
cc: linux-afs@lists.infradead.org
cc: netfs@lists.linux.dev
cc: linux-mm@kvack.org
cc: linux-fsdevel@vger.kernel.org
Signed-off-by: Christian Brauner <brauner@kernel.org>
|
|
At the end of an kAFS RPC operation, there is an "edit" phase (originally
intended for post-directory modification ops to edit the local image) that
the setattr VFS op uses to fix up the pagecache if the RPC that requested
truncation of a file was successful.
afs_setattr_edit_file() calls truncate_setsize() which sets i_size, expands
the pagecache if needed and truncates the pagecache. The first two of
those, however, are redundant as they've already been done by
afs_setattr_success() under the io_lock and the first is also done under
the callback lock (cb_lock).
Fix afs_setattr_edit_file() to call truncate_pagecache() instead (which is
called by truncate_setsize(), thereby skipping the redundant parts.
Fixes: 100ccd18bb41 ("netfs: Optimise away reads above the point at which there can be no data")
Signed-off-by: David Howells <dhowells@redhat.com>
Link: https://lore.kernel.org/r/20240823200819.532106-3-dhowells@redhat.com
cc: Matthew Wilcox (Oracle) <willy@infradead.org>
cc: Pankaj Raghav <p.raghav@samsung.com>
cc: Jeff Layton <jlayton@kernel.org>
cc: Marc Dionne <marc.dionne@auristor.com>
cc: linux-afs@lists.infradead.org
cc: netfs@lists.linux.dev
cc: linux-mm@kvack.org
cc: linux-fsdevel@vger.kernel.org
Signed-off-by: Christian Brauner <brauner@kernel.org>
|
|
Simplify and fix overlayfs layer parsing so the maximum of 500 layers
can be used.
* patches from https://lore.kernel.org/r/20240705011510.794025-1-chengzhihao1@huawei.com:
ovl: ovl_parse_param_lowerdir: Add missed '\n' for pr_err
ovl: fix wrong lowerdir number check for parameter Opt_lowerdir
ovl: pass string to ovl_parse_layer()
Link: https://lore.kernel.org/r/20240705011510.794025-1-chengzhihao1@huawei.com
Signed-off-by: Christian Brauner <brauner@kernel.org>
|
|
Pull NFS client fixes from Anna Schumaker:
- Fix rpcrdma refcounting in xa_alloc
- Fix rpcrdma usage of XA_FLAGS_ALLOC
- Fix requesting FATTR4_WORD2_OPEN_ARGUMENTS
- Fix attribute bitmap decoder to handle a 3rd word
- Add reschedule points when returning delegations to avoid soft lockups
- Fix clearing layout segments in layoutreturn
- Avoid unnecessary rescanning of the per-server delegation list
* tag 'nfs-for-6.11-2' of git://git.linux-nfs.org/projects/anna/linux-nfs:
NFS: Avoid unnecessary rescanning of the per-server delegation list
NFSv4: Fix clearing of layout segments in layoutreturn
NFSv4: Add missing rescheduling points in nfs_client_return_marked_delegations
nfs: fix bitmap decoder to handle a 3rd word
nfs: fix the fetch of FATTR4_OPEN_ARGUMENTS
rpcrdma: Trace connection registration and unregistration
rpcrdma: Use XA_FLAGS_ALLOC instead of XA_FLAGS_ALLOC1
rpcrdma: Device kref is over-incremented on error from xa_alloc
|
|
Pull smb client fixes from Steve French:
- fix refcount leak (can cause rmmod fail)
- fix byte range locking problem with cached reads
- fix for mount failure if reparse point unrecognized
- minor typo
* tag 'v6.11-rc4-client-fixes' of git://git.samba.org/sfrench/cifs-2.6:
smb/client: fix typo: GlobalMid_Sem -> GlobalMid_Lock
smb: client: ignore unhandled reparse tags
smb3: fix problem unloading module due to leaked refcount on shutdown
smb3: fix broken cached reads when posix locks
|
|
Add '\n' for pr_err in function ovl_parse_param_lowerdir(), which
ensures that error message is displayed at once.
Fixes: b36a5780cb44 ("ovl: modify layer parameter parsing")
Signed-off-by: Zhihao Cheng <chengzhihao1@huawei.com>
Link: https://lore.kernel.org/r/20240705011510.794025-4-chengzhihao1@huawei.com
Signed-off-by: Christian Brauner <brauner@kernel.org>
|
|
The max count of lowerdir is OVL_MAX_STACK[500], which is broken by
commit 37f32f526438("ovl: fix memory leak in ovl_parse_param()") for
parameter Opt_lowerdir. Since commit 819829f0319a("ovl: refactor layer
parsing helpers") and commit 24e16e385f22("ovl: add support for
appending lowerdirs one by one") added check ovl_mount_dir_check() in
function ovl_parse_param_lowerdir(), the 'ctx->nr' should be smaller
than OVL_MAX_STACK, after commit 37f32f526438("ovl: fix memory leak in
ovl_parse_param()") is applied, the 'ctx->nr' is updated before the
check ovl_mount_dir_check(), which leads the max count of lowerdir
to become 499 for parameter Opt_lowerdir.
Fix it by replacing lower layers parsing code with the existing helper
function ovl_parse_layer().
Fixes: 37f32f526438 ("ovl: fix memory leak in ovl_parse_param()")
Signed-off-by: Zhihao Cheng <chengzhihao1@huawei.com>
Link: https://lore.kernel.org/r/20240705011510.794025-3-chengzhihao1@huawei.com
Reviewed-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Christian Brauner <brauner@kernel.org>
|
|
So it can be used for parsing the Opt_lowerdir.
Signed-off-by: Zhihao Cheng <chengzhihao1@huawei.com>
Link: https://lore.kernel.org/r/20240705011510.794025-2-chengzhihao1@huawei.com
Signed-off-by: Christian Brauner <brauner@kernel.org>
|
|
Prior to commit 3f29cc82a84c ("nfsd: split sc_status out of
sc_type") states_show() relied on sc_type field to be of valid
type before calling into a subfunction to show content of a
particular stateid. From that commit, we split the validity of
the stateid into sc_status and no longer changed sc_type to 0
while unhashing the stateid. This resulted in kernel oopsing
for nfsv4.0 opens that stay around and in nfs4_show_open()
would derefence sc_file which was NULL.
Instead, for closed open stateids forgo displaying information
that relies of having a valid sc_file.
To reproduce: mount the server with 4.0, read and close
a file and then on the server cat /proc/fs/nfsd/clients/2/states
[ 513.590804] Call trace:
[ 513.590925] _raw_spin_lock+0xcc/0x160
[ 513.591119] nfs4_show_open+0x78/0x2c0 [nfsd]
[ 513.591412] states_show+0x44c/0x488 [nfsd]
[ 513.591681] seq_read_iter+0x5d8/0x760
[ 513.591896] seq_read+0x188/0x208
[ 513.592075] vfs_read+0x148/0x470
[ 513.592241] ksys_read+0xcc/0x178
Fixes: 3f29cc82a84c ("nfsd: split sc_status out of sc_type")
Signed-off-by: Olga Kornievskaia <okorniev@redhat.com>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
|
|
Filesystems may define their own splice write. Therefore, use the file
fops instead of invoking iter_file_splice_write() directly.
Signed-off-by: Ed Tsai <ed.tsai@mediatek.com>
Link: https://lore.kernel.org/r/20240708072208.25244-1-ed.tsai@mediatek.com
Fixes: 5ca73468612d ("fuse: implement splice read/write passthrough")
Reviewed-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Christian Brauner <brauner@kernel.org>
|
|
If the call to nfs_delegation_grab_inode() fails, we will not have
dropped any locks that require us to rescan the list.
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
|
|
Make sure that we clear the layout segments in cases where we see a
fatal error, and also in the case where the layout is invalid.
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
|
|
We're seeing reports of soft lockups when iterating through the loops,
so let's add rescheduling points.
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
|
|
It only decodes the first two words at this point. Have it decode the
third word as well. Without this, the client doesn't send delegated
timestamps in the CB_GETATTR response.
With this change we also need to expand the on-stack bitmap in
decode_recallany_args to 3 elements, in case the server sends a larger
bitmap than expected.
Fixes: 43df7110f4a9 ("NFSv4: Add CB_GETATTR support for delegated attributes")
Signed-off-by: Jeff Layton <jlayton@kernel.org>
Reviewed-by: Benjamin Coddington <bcodding@redhat.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
|
|
The client doesn't properly request FATTR4_OPEN_ARGUMENTS in the initial
SERVER_CAPS getattr. Add FATTR4_WORD2_OPEN_ARGUMENTS to the initial
request.
Fixes: 707f13b3d081 (NFSv4: Add support for the FATTR4_OPEN_ARGUMENTS attribute)
Signed-off-by: Jeff Layton <jlayton@kernel.org>
Reviewed-by: Benjamin Coddington <bcodding@redhat.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
|
|
The comments have typos, fix that to not confuse readers.
Signed-off-by: ChenXiaoSong <chenxiaosong@kylinos.cn>
Reviewed-by: Namjae Jeon <linkinjeon@kernel.org
Signed-off-by: Steve French <stfrench@microsoft.com>
|
|
If nfsd4_encode_fattr4 ends up doing a "goto out" before we get to
checking for the security label, then args.context will be set to
uninitialized junk on the stack, which we'll then try to free.
Initialize it early.
Fixes: f59388a579c6 ("NFSD: Add nfsd4_encode_fattr4_sec_label()")
Signed-off-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
|
|
Just ignore reparse points that the client can't parse rather than
bailing out and not opening the file or directory.
Reported-by: Marc <1marc1@gmail.com>
Closes: https://lore.kernel.org/r/CAMHwNVv-B+Q6wa0FEXrAuzdchzcJRsPKDDRrNaYZJd6X-+iJzw@mail.gmail.com
Fixes: 539aad7f14da ("smb: client: introduce ->parse_reparse_point()")
Tested-by: Anthony Nandaa (Microsoft) <profnandaa@gmail.com>
Signed-off-by: Paulo Alcantara (Red Hat) <pc@manguebit.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
|
|
The shutdown ioctl can leak a refcount on the tlink which can
prevent rmmod (unloading the cifs.ko) module from working.
Found while debugging xfstest generic/043
Fixes: 69ca1f57555f ("smb3: add dynamic tracepoints for shutdown ioctl")
Reviewed-by: Meetakshi Setiya <msetiya@microsoft.com>
Reviewed-by: Shyam Prasad N <sprasad@microsoft.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
|
|
smb2_allocate_rsp_buf() will return other error code except -ENOMEM.
Signed-off-by: ChenXiaoSong <chenxiaosong@kylinos.cn>
Acked-by: Namjae Jeon <linkinjeon@kernel.org>
Signed-off-by: Steve French <stfrench@microsoft.com>
|
|
The variable is already true here.
Signed-off-by: ChenXiaoSong <chenxiaosong@kylinos.cn>
Acked-by: Namjae Jeon <linkinjeon@kernel.org>
Signed-off-by: Steve French <stfrench@microsoft.com>
|
|
null-ptr-deref will occur when (req_op_level == SMB2_OPLOCK_LEVEL_LEASE)
and parse_lease_state() return NULL.
Fix this by check if 'lease_ctx_info' is NULL.
Additionally, remove the redundant parentheses in
parse_durable_handle_context().
Signed-off-by: ChenXiaoSong <chenxiaosong@kylinos.cn>
Signed-off-by: Steve French <stfrench@microsoft.com>
|
|
In most error cases, error code is not returned in smb2_open(),
__process_request() will not print error message.
Fix this by returning the correct value at the end of smb2_open().
Signed-off-by: ChenXiaoSong <chenxiaosong@kylinos.cn>
Acked-by: Namjae Jeon <linkinjeon@kernel.org>
Signed-off-by: Steve French <stfrench@microsoft.com>
|
|
When STATUS_NO_MORE_FILES status is set to smb2 query dir response,
->StructureSize is set to 9, which mean buffer has 1 byte.
This issue occurs because ->Buffer[1] in smb2_query_directory_rsp to
flex-array.
Fixes: eb3e28c1e89b ("smb3: Replace smb2pdu 1-element arrays with flex-arrays")
Cc: stable@vger.kernel.org # v6.1+
Signed-off-by: Namjae Jeon <linkinjeon@kernel.org>
Signed-off-by: Steve French <stfrench@microsoft.com>
|
|
rht_bucket() does strange complicated things when a rehash is in
progress.
Instead, just skip scanning when a rehash is in progress: scanning is
going to be more expensive (many more empty slots to cover), and some
sort of infinite loop is being observed
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
|
|
fix a small leak
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
|
|
bch2_btree_key_cache_drop() evicts the key cache entry - it's used when
we're doing an update that bypasses the key cache, because for cache
coherency reasons a key can't be in the key cache unless it also exists
in the btree - i.e. creates have to bypass the cache.
After evicting, the path no longer points to a key cache key, and
relock() will always fail if should_be_locked is true.
Prep for improving path->should_be_locked assertions
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
|