Age | Commit message (Collapse) | Author |
|
The NETFS_RREQ_USE_PGPRIV2 and NETFS_RREQ_WRITE_TO_CACHE flags aren't used
correctly. The problem is that we try to set them up in the request
initialisation, but we the cache may be in the process of setting up still,
and so the state may not be correct. Further, we secondarily sample the
cache state and make contradictory decisions later.
The issue arises because we set up the cache resources, which allows the
cache's ->prepare_read() to switch on NETFS_SREQ_COPY_TO_CACHE - which
triggers cache writing even if we didn't set the flags when allocating.
Fix this in the following way:
(1) Drop NETFS_ICTX_USE_PGPRIV2 and instead set NETFS_RREQ_USE_PGPRIV2 in
->init_request() rather than trying to juggle that in
netfs_alloc_request().
(2) Repurpose NETFS_RREQ_USE_PGPRIV2 to merely indicate that if caching is
to be done, then PG_private_2 is to be used rather than only setting
it if we decide to cache and then having netfs_rreq_unlock_folios()
set the non-PG_private_2 writeback-to-cache if it wasn't set.
(3) Split netfs_rreq_unlock_folios() into two functions, one of which
contains the deprecated code for using PG_private_2 to avoid
accidentally doing the writeback path - and always use it if
USE_PGPRIV2 is set.
(4) As NETFS_ICTX_USE_PGPRIV2 is removed, make netfs_write_begin() always
wait for PG_private_2. This function is deprecated and only used by
ceph anyway, and so label it so.
(5) Drop the NETFS_RREQ_WRITE_TO_CACHE flag and use
fscache_operation_valid() on the cache_resources instead. This has
the advantage of picking up the result of netfs_begin_cache_read() and
fscache_begin_write_operation() - which are called after the object is
initialised and will wait for the cache to come to a usable state.
Just reverting ae678317b95e[1] isn't a sufficient fix, so this need to be
applied on top of that. Without this as well, things like:
rcu: INFO: rcu_sched detected expedited stalls on CPUs/tasks: {
and:
WARNING: CPU: 13 PID: 3621 at fs/ceph/caps.c:3386
may happen, along with some UAFs due to PG_private_2 not getting used to
wait on writeback completion.
Fixes: 2ff1e97587f4 ("netfs: Replace PG_fscache by setting folio->private and marking dirty")
Reported-by: Max Kellermann <max.kellermann@ionos.com>
Signed-off-by: David Howells <dhowells@redhat.com>
cc: Ilya Dryomov <idryomov@gmail.com>
cc: Xiubo Li <xiubli@redhat.com>
cc: Hristo Venev <hristo@venev.name>
cc: Jeff Layton <jlayton@kernel.org>
cc: Matthew Wilcox <willy@infradead.org>
cc: ceph-devel@vger.kernel.org
cc: netfs@lists.linux.dev
cc: linux-fsdevel@vger.kernel.org
cc: linux-mm@kvack.org
Link: https://lore.kernel.org/r/3575457.1722355300@warthog.procyon.org.uk/ [1]
Link: https://lore.kernel.org/r/1173209.1723152682@warthog.procyon.org.uk
Signed-off-by: Christian Brauner <brauner@kernel.org>
|
|
When netfslib is performing writeback (ie. ->writepages), it maintains two
parallel streams of writes, one to the server and one to the cache, but it
doesn't mark either stream of writes as active until it gets some data that
needs to be written to that stream.
This is done because some folios will only be written to the cache
(e.g. copying to the cache on read is done by marking the folios and
letting writeback do the actual work) and sometimes we'll only be writing
to the server (e.g. if there's no cache).
Now, since we don't actually dispatch uploads and cache writes in parallel,
but rather flip between the streams, depending on which has the lowest
so-far-issued offset, and don't wait for the subreqs to finish before
flipping, we can end up in a situation where, say, we issue a write to the
server and this completes before we start the write to the cache.
But because we only activate a stream when we first add a subreq to it, the
result collection code may run before we manage to activate the stream -
resulting in the folio being cleaned and having the writeback-in-progress
mark removed. At this point, the folio no longer belongs to us.
This is only really a problem for folios that need to be written to both
streams - and in that case, the upload to the server is started first,
followed by the write to the cache - and the cache write may see a bad
folio.
Fix this by activating the cache stream up front if there's a cache
available. If there's a cache, then all data is going to be written to it.
Fixes: 288ace2f57c9 ("netfs: New writeback implementation")
Signed-off-by: David Howells <dhowells@redhat.com>
Link: https://lore.kernel.org/r/1599053.1721398818@warthog.procyon.org.uk
cc: Jeff Layton <jlayton@kernel.org>
cc: netfs@lists.linux.dev
cc: linux-fsdevel@vger.kernel.org
Signed-off-by: Christian Brauner <brauner@kernel.org>
|
|
Revert commit 163eae0fb0d4c610c59a8de38040f8e12f89fd43 to get back the
original operation of the debugging macros.
Signed-off-by: David Howells <dhowells@redhat.com>
Link: https://lore.kernel.org/r/20240608151352.22860-2-ukleinek@kernel.org
Link: https://lore.kernel.org/r/1410685.1721333252@warthog.procyon.org.uk
cc: Uwe Kleine-König <ukleinek@kernel.org>
cc: Christian Brauner <brauner@kernel.org>
cc: Jeff Layton <jlayton@kernel.org>
cc: netfs@lists.linux.dev
cc: linux-fsdevel@vger.kernel.org
Signed-off-by: Christian Brauner <brauner@kernel.org>
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs
Pull vfs fixes from Christian Brauner:
"cachefiles:
- Export an existing and add a new cachefile helper to be used in
filesystems to fix reference count bugs
- Use the newly added fscache_ty_get_volume() helper to get a
reference count on an fscache_volume to handle volumes that are
about to be removed cleanly
- After withdrawing a fscache_cache via FSCACHE_CACHE_IS_WITHDRAWN
wait for all ongoing cookie lookups to complete and for the object
count to reach zero
- Propagate errors from vfs_getxattr() to avoid an infinite loop in
cachefiles_check_volume_xattr() because it keeps seeing ESTALE
- Don't send new requests when an object is dropped by raising
CACHEFILES_ONDEMAND_OJBSTATE_DROPPING
- Cancel all requests for an object that is about to be dropped
- Wait for the ondemand_boject_worker to finish before dropping a
cachefiles object to prevent use-after-free
- Use cyclic allocation for message ids to better handle id recycling
- Add missing lock protection when iterating through the xarray when
polling
netfs:
- Use standard logging helpers for debug logging
VFS:
- Fix potential use-after-free in file locks during
trace_posix_lock_inode(). The tracepoint could fire while another
task raced it and freed the lock that was requested to be traced
- Only increment the nr_dentry_negative counter for dentries that are
present on the superblock LRU. Currently, DCACHE_LRU_LIST list is
used to detect this case. However, the flag is also raised in
combination with DCACHE_SHRINK_LIST to indicate that dentry->d_lru
is used. So checking only DCACHE_LRU_LIST will lead to wrong
nr_dentry_negative count. Fix the check to not count dentries that
are on a shrink related list
Misc:
- hfsplus: fix an uninitialized value issue in copy_name
- minix: fix minixfs_rename with HIGHMEM. It still uses kunmap() even
though we switched it to kmap_local_page() a while ago"
* tag 'vfs-6.10-rc8.fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs:
minixfs: Fix minixfs_rename with HIGHMEM
hfsplus: fix uninit-value in copy_name
vfs: don't mod negative dentry count when on shrinker list
filelock: fix potential use-after-free in posix_lock_inode
cachefiles: add missing lock protection when polling
cachefiles: cyclic allocation of msg_id to avoid reuse
cachefiles: wait for ondemand_object_worker to finish when dropping object
cachefiles: cancel all requests for the object that is being dropped
cachefiles: stop sending new request when dropping object
cachefiles: propagate errors from vfs_getxattr() to avoid infinite loop
cachefiles: fix slab-use-after-free in cachefiles_withdraw_cookie()
cachefiles: fix slab-use-after-free in fscache_withdraw_volume()
netfs, fscache: export fscache_put_volume() and add fscache_try_get_volume()
netfs: Switch debug logging to pr_debug()
|
|
libaokun@huaweicloud.com <libaokun@huaweicloud.com> says:
This is the third version of this patch series, in which another patch set
is subsumed into this one to avoid confusing the two patch sets.
(https://patchwork.kernel.org/project/linux-fsdevel/list/?series=854914)
We've been testing ondemand mode for cachefiles since January, and we're
almost done. We hit a lot of issues during the testing period, and this
patch series fixes some of the issues. The patches have passed internal
testing without regression.
The following is a brief overview of the patches, see the patches for
more details.
Patch 1-2: Add fscache_try_get_volume() helper function to avoid
fscache_volume use-after-free on cache withdrawal.
Patch 3: Fix cachefiles_lookup_cookie() and cachefiles_withdraw_cache()
concurrency causing cachefiles_volume use-after-free.
Patch 4: Propagate error codes returned by vfs_getxattr() to avoid
endless loops.
Patch 5-7: A read request waiting for reopen could be closed maliciously
before the reopen worker is executing or waiting to be scheduled. So
ondemand_object_worker() may be called after the info and object and even
the cache have been freed and trigger use-after-free. So use
cancel_work_sync() in cachefiles_ondemand_clean_object() to cancel the
reopen worker or wait for it to finish. Since it makes no sense to wait
for the daemon to complete the reopen request, to avoid this pointless
operation blocking cancel_work_sync(), Patch 1 avoids request generation
by the DROPPING state when the request has not been sent, and Patch 2
flushes the requests of the current object before cancel_work_sync().
Patch 8: Cyclic allocation of msg_id to avoid msg_id reuse misleading
the daemon to cause hung.
Patch 9: Hold xas_lock during polling to avoid dereferencing reqs causing
use-after-free. This issue was triggered frequently in our tests, and we
found that anolis 5.10 had fixed it. So to avoid failing the test, this
patch is pushed upstream as well.
Baokun Li (7):
netfs, fscache: export fscache_put_volume() and add
fscache_try_get_volume()
cachefiles: fix slab-use-after-free in fscache_withdraw_volume()
cachefiles: fix slab-use-after-free in cachefiles_withdraw_cookie()
cachefiles: propagate errors from vfs_getxattr() to avoid infinite
loop
cachefiles: stop sending new request when dropping object
cachefiles: cancel all requests for the object that is being dropped
cachefiles: cyclic allocation of msg_id to avoid reuse
Hou Tao (1):
cachefiles: wait for ondemand_object_worker to finish when dropping
object
Jingbo Xu (1):
cachefiles: add missing lock protection when polling
fs/cachefiles/cache.c | 45 ++++++++++++++++++++++++++++-
fs/cachefiles/daemon.c | 4 +--
fs/cachefiles/internal.h | 3 ++
fs/cachefiles/ondemand.c | 52 ++++++++++++++++++++++++++++++----
fs/cachefiles/volume.c | 1 -
fs/cachefiles/xattr.c | 5 +++-
fs/netfs/fscache_volume.c | 14 +++++++++
fs/netfs/internal.h | 2 --
include/linux/fscache-cache.h | 6 ++++
include/trace/events/fscache.h | 4 +++
10 files changed, 123 insertions(+), 13 deletions(-)
Link: https://lore.kernel.org/r/20240628062930.2467993-1-libaokun@huaweicloud.com
Signed-off-by: Christian Brauner <brauner@kernel.org>
|
|
During the writeback procedure, at the end of netfs_write_folio(), pending
write operations are flushed if the amount of write-streaming data stored
in a page is less than the size of the folio because if we haven't modified
a folio to the end, it cannot be contiguous with the following folio...
except if the dirty region of the folio is right at the end of the folio
space.
Fix the test to take the offset into the folio into account as well, such
that if the dirty region runs right up to the end of the folio, we leave
the flushing for later.
Fixes: 288ace2f57c9 ("netfs: New writeback implementation")
Signed-off-by: David Howells <dhowells@redhat.com>
cc: Jeff Layton <jlayton@kernel.org>
cc: Eric Van Hensbergen <ericvh@kernel.org>
cc: Latchesar Ionkov <lucho@ionkov.net>
cc: Dominique Martinet <asmadeus@codewreck.org>
cc: Christian Schoenebeck <linux_oss@crudebyte.com>
cc: Marc Dionne <marc.dionne@auristor.com>
cc: Steve French <sfrench@samba.org>
cc: Paulo Alcantara <pc@manguebit.com> (DFS, global name space)
cc: v9fs@lists.linux.dev
cc: linux-afs@lists.infradead.org
cc: linux-cifs@vger.kernel.org
cc: netfs@lists.linux.dev
cc: linux-fsdevel@vger.kernel.org
Link: https://lore.kernel.org/r/20240620173137.610345-4-dhowells@redhat.com
Signed-off-by: Christian Brauner <brauner@kernel.org>
|
|
Instead of inventing a custom way to conditionally enable debugging,
just make use of pr_debug(), which also has dynamic debugging facilities
and is more likely known to someone who hunts a problem in the netfs
code. Also drop the module parameter netfs_debug which didn't have any
effect without further source changes. (The variable netfs_debug was
only used in #ifdef blocks for cpp vars that don't exist; Note that
CONFIG_NETFS_DEBUG isn't settable via kconfig, a variable with that name
never existed in the mainline and is probably just taken over (and
renamed) from similar custom debug logging implementations.)
Signed-off-by: Uwe Kleine-König <ukleinek@kernel.org>
Link: https://lore.kernel.org/r/20240608151352.22860-2-ukleinek@kernel.org
Signed-off-by: Christian Brauner <brauner@kernel.org>
|
|
If an error occurs whilst we're doing an AIO write in write-through mode,
we may end up calling ->ki_complete() *and* returning an error from
->write_iter(). This can result in either a UAF (the ->ki_complete() func
pointer may get overwritten, for example) or a refcount underflow in
io_submit() as ->ki_complete is called twice.
Fix this by making netfs_end_writethrough() - and thus
netfs_perform_write() - unconditionally return -EIOCBQUEUED if we're doing
an AIO write and wait for completion if we're not.
Fixes: 288ace2f57c9 ("netfs: New writeback implementation")
Signed-off-by: David Howells <dhowells@redhat.com>
Link: https://lore.kernel.org/r/295052.1716298587@warthog.procyon.org.uk
cc: Jeff Layton <jlayton@kernel.org>
cc: Enzo Matsumiya <ematsumiya@suse.de>
cc: netfs@lists.linux.dev
cc: v9fs@lists.linux.dev
cc: linux-afs@lists.infradead.org
cc: linux-cifs@vger.kernel.org
cc: linux-fsdevel@vger.kernel.org
Signed-off-by: Christian Brauner <brauner@kernel.org>
|
|
This can be triggered by mounting a cifs filesystem with a cache=strict
mount option and then, using the fsx program from xfstests, doing:
ltp/fsx -A -d -N 1000 -S 11463 -P /tmp /cifs-mount/foo \
--replay-ops=gen112-fsxops
Where gen112-fsxops holds:
fallocate 0x6be7 0x8fc5 0x377d3
copy_range 0x9c71 0x77e8 0x2edaf 0x377d3
write 0x2776d 0x8f65 0x377d3
The problem is that netfs_io_request::len is being used for two purposes
and ends up getting set to the amount of data we transferred, not the
amount of data the caller asked to be transferred (for various reasons,
such as mmap'd writes, we might end up rounding out the data written to the
server to include the entire folio at each end).
Fix this by keeping the amount we were asked to write in ->len and using
->submitted to track what we issued ops for. Then, when we come to calling
->ki_complete(), ->len is the right size.
This also required netfs_cleanup_dio_write() to change since we're no
longer advancing wreq->len. Use wreq->transferred instead as we might have
done a short read.
With this, the generic/112 xfstest passes if cifs is forced to put all
non-DIO opens into write-through mode.
Fixes: 288ace2f57c9 ("netfs: New writeback implementation")
Signed-off-by: David Howells <dhowells@redhat.com>
Link: https://lore.kernel.org/r/295086.1716298663@warthog.procyon.org.uk
cc: Jeff Layton <jlayton@kernel.org>
cc: Steve French <stfrench@microsoft.com>
cc: Enzo Matsumiya <ematsumiya@suse.de>
cc: netfs@lists.linux.dev
cc: v9fs@lists.linux.dev
cc: linux-afs@lists.infradead.org
cc: linux-cifs@vger.kernel.org
cc: linux-fsdevel@vger.kernel.org
Signed-off-by: Christian Brauner <brauner@kernel.org>
|
|
Cut over to using the new writeback code. The old code is #ifdef'd out or
otherwise removed from compilation to avoid conflicts and will be removed
in a future patch.
Signed-off-by: David Howells <dhowells@redhat.com>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
cc: Eric Van Hensbergen <ericvh@kernel.org>
cc: Latchesar Ionkov <lucho@ionkov.net>
cc: Dominique Martinet <asmadeus@codewreck.org>
cc: Christian Schoenebeck <linux_oss@crudebyte.com>
cc: Marc Dionne <marc.dionne@auristor.com>
cc: v9fs@lists.linux.dev
cc: linux-afs@lists.infradead.org
cc: netfs@lists.linux.dev
cc: linux-fsdevel@vger.kernel.org
|
|
Add some write-side stats to count buffered writes, buffered writethrough,
and writepages calls.
Whilst we're at it, clean up the naming on some of the existing stats
counters and organise the output into two sets.
Signed-off-by: David Howells <dhowells@redhat.com>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
cc: netfs@lists.linux.dev
cc: linux-fsdevel@vger.kernel.org
|
|
The current netfslib writeback implementation creates writeback requests of
contiguous folio data and then separately tiles subrequests over the space
twice, once for the server and once for the cache. This creates a few
issues:
(1) Every time there's a discontiguity or a change between writing to only
one destination or writing to both, it must create a new request.
This makes it harder to do vectored writes.
(2) The folios don't have the writeback mark removed until the end of the
request - and a request could be hundreds of megabytes.
(3) In future, I want to support a larger cache granularity, which will
require aggregation of some folios that contain unmodified data (which
only need to go to the cache) and some which contain modifications
(which need to be uploaded and stored to the cache) - but, currently,
these are treated as discontiguous.
There's also a move to get everyone to use writeback_iter() to extract
writable folios from the pagecache. That said, currently writeback_iter()
has some issues that make it less than ideal:
(1) there's no way to cancel the iteration, even if you find a "temporary"
error that means the current folio and all subsequent folios are going
to fail;
(2) there's no way to filter the folios being written back - something
that will impact Ceph with it's ordered snap system;
(3) and if you get a folio you can't immediately deal with (say you need
to flush the preceding writes), you are left with a folio hanging in
the locked state for the duration, when really we should unlock it and
relock it later.
In this new implementation, I use writeback_iter() to pump folios,
progressively creating two parallel, but separate streams and cleaning up
the finished folios as the subrequests complete. Either or both streams
can contain gaps, and the subrequests in each stream can be of variable
size, don't need to align with each other and don't need to align with the
folios.
Indeed, subrequests can cross folio boundaries, may cover several folios or
a folio may be spanned by multiple folios, e.g.:
+---+---+-----+-----+---+----------+
Folios: | | | | | | |
+---+---+-----+-----+---+----------+
+------+------+ +----+----+
Upload: | | |.....| | |
+------+------+ +----+----+
+------+------+------+------+------+
Cache: | | | | | |
+------+------+------+------+------+
The progressive subrequest construction permits the algorithm to be
preparing both the next upload to the server and the next write to the
cache whilst the previous ones are already in progress. Throttling can be
applied to control the rate of production of subrequests - and, in any
case, we probably want to write them to the server in ascending order,
particularly if the file will be extended.
Content crypto can also be prepared at the same time as the subrequests and
run asynchronously, with the prepped requests being stalled until the
crypto catches up with them. This might also be useful for transport
crypto, but that happens at a lower layer, so probably would be harder to
pull off.
The algorithm is split into three parts:
(1) The issuer. This walks through the data, packaging it up, encrypting
it and creating subrequests. The part of this that generates
subrequests only deals with file positions and spans and so is usable
for DIO/unbuffered writes as well as buffered writes.
(2) The collector. This asynchronously collects completed subrequests,
unlocks folios, frees crypto buffers and performs any retries. This
runs in a work queue so that the issuer can return to the caller for
writeback (so that the VM can have its kswapd thread back) or async
writes.
(3) The retryer. This pauses the issuer, waits for all outstanding
subrequests to complete and then goes through the failed subrequests
to reissue them. This may involve reprepping them (with cifs, the
credits must be renegotiated, and a subrequest may need splitting),
and doing RMW for content crypto if there's a conflicting change on
the server.
[!] Note that some of the functions are prefixed with "new_" to avoid
clashes with existing functions. These will be renamed in a later patch
that cuts over to the new algorithm.
Signed-off-by: David Howells <dhowells@redhat.com>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
cc: Eric Van Hensbergen <ericvh@kernel.org>
cc: Latchesar Ionkov <lucho@ionkov.net>
cc: Dominique Martinet <asmadeus@codewreck.org>
cc: Christian Schoenebeck <linux_oss@crudebyte.com>
cc: Marc Dionne <marc.dionne@auristor.com>
cc: v9fs@lists.linux.dev
cc: linux-afs@lists.infradead.org
cc: netfs@lists.linux.dev
cc: linux-fsdevel@vger.kernel.org
|