Age | Commit message (Collapse) | Author |
|
The inode allocation is supposed to use alloc_inode_sb(), so convert
kmem_cache_alloc() of all filesystems to alloc_inode_sb().
Link: https://lkml.kernel.org/r/20220228122126.37293-5-songmuchun@bytedance.com
Signed-off-by: Muchun Song <songmuchun@bytedance.com>
Acked-by: Theodore Ts'o <tytso@mit.edu> [ext4]
Acked-by: Roman Gushchin <roman.gushchin@linux.dev>
Cc: Alex Shi <alexs@kernel.org>
Cc: Anna Schumaker <Anna.Schumaker@Netapp.com>
Cc: Chao Yu <chao@kernel.org>
Cc: Dave Chinner <david@fromorbit.com>
Cc: Fam Zheng <fam.zheng@bytedance.com>
Cc: Jaegeuk Kim <jaegeuk@kernel.org>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Kari Argillander <kari.argillander@gmail.com>
Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Qi Zheng <zhengqi.arch@bytedance.com>
Cc: Shakeel Butt <shakeelb@google.com>
Cc: Trond Myklebust <trond.myklebust@hammerspace.com>
Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Wei Yang <richard.weiyang@gmail.com>
Cc: Xiongchun Duan <duanxiongchun@bytedance.com>
Cc: Yang Shi <shy828301@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux
Pull btrfs fixes from David Sterba:
"A few more fixes for various problems that have user visible effects
or seem to be urgent:
- fix corruption when combining DIO and non-blocking io_uring over
multiple extents (seen on MariaDB)
- fix relocation crash due to premature return from commit
- fix quota deadlock between rescan and qgroup removal
- fix item data bounds checks in tree-checker (found on a fuzzed
image)
- fix fsync of prealloc extents after EOF
- add missing run of delayed items after unlink during log replay
- don't start relocation until snapshot drop is finished
- fix reversed condition for subpage writers locking
- fix warning on page error"
* tag 'for-5.17-rc6-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux:
btrfs: fallback to blocking mode when doing async dio over multiple extents
btrfs: add missing run of delayed items after unlink during log replay
btrfs: qgroup: fix deadlock between rescan worker and remove qgroup
btrfs: fix relocation crash due to premature return from btrfs_commit_transaction()
btrfs: do not start relocation until in progress drops are done
btrfs: tree-checker: use u64 for item data end to avoid overflow
btrfs: do not WARN_ON() if we have PageError set
btrfs: fix lost prealloc extents beyond eof after full fsync
btrfs: subpage: fix a wrong check on subpage->writers
|
|
Some users recently reported that MariaDB was getting a read corruption
when using io_uring on top of btrfs. This started to happen in 5.16,
after commit 51bd9563b6783d ("btrfs: fix deadlock due to page faults
during direct IO reads and writes"). That changed btrfs to use the new
iomap flag IOMAP_DIO_PARTIAL and to disable page faults before calling
iomap_dio_rw(). This was necessary to fix deadlocks when the iovector
corresponds to a memory mapped file region. That type of scenario is
exercised by test case generic/647 from fstests.
For this MariaDB scenario, we attempt to read 16K from file offset X
using IOCB_NOWAIT and io_uring. In that range we have 4 extents, each
with a size of 4K, and what happens is the following:
1) btrfs_direct_read() disables page faults and calls iomap_dio_rw();
2) iomap creates a struct iomap_dio object, its reference count is
initialized to 1 and its ->size field is initialized to 0;
3) iomap calls btrfs_dio_iomap_begin() with file offset X, which finds
the first 4K extent, and setups an iomap for this extent consisting
of a single page;
4) At iomap_dio_bio_iter(), we are able to access the first page of the
buffer (struct iov_iter) with bio_iov_iter_get_pages() without
triggering a page fault;
5) iomap submits a bio for this 4K extent
(iomap_dio_submit_bio() -> btrfs_submit_direct()) and increments
the refcount on the struct iomap_dio object to 2; The ->size field
of the struct iomap_dio object is incremented to 4K;
6) iomap calls btrfs_iomap_begin() again, this time with a file
offset of X + 4K. There we setup an iomap for the next extent
that also has a size of 4K;
7) Then at iomap_dio_bio_iter() we call bio_iov_iter_get_pages(),
which tries to access the next page (2nd page) of the buffer.
This triggers a page fault and returns -EFAULT;
8) At __iomap_dio_rw() we see the -EFAULT, but we reset the error
to 0 because we passed the flag IOMAP_DIO_PARTIAL to iomap and
the struct iomap_dio object has a ->size value of 4K (we submitted
a bio for an extent already). The 'wait_for_completion' variable
is not set to true, because our iocb has IOCB_NOWAIT set;
9) At the bottom of __iomap_dio_rw(), we decrement the reference count
of the struct iomap_dio object from 2 to 1. Because we were not
the only ones holding a reference on it and 'wait_for_completion' is
set to false, -EIOCBQUEUED is returned to btrfs_direct_read(), which
just returns it up the callchain, up to io_uring;
10) The bio submitted for the first extent (step 5) completes and its
bio endio function, iomap_dio_bio_end_io(), decrements the last
reference on the struct iomap_dio object, resulting in calling
iomap_dio_complete_work() -> iomap_dio_complete().
11) At iomap_dio_complete() we adjust the iocb->ki_pos from X to X + 4K
and return 4K (the amount of io done) to iomap_dio_complete_work();
12) iomap_dio_complete_work() calls the iocb completion callback,
iocb->ki_complete() with a second argument value of 4K (total io
done) and the iocb with the adjust ki_pos of X + 4K. This results
in completing the read request for io_uring, leaving it with a
result of 4K bytes read, and only the first page of the buffer
filled in, while the remaining 3 pages, corresponding to the other
3 extents, were not filled;
13) For the application, the result is unexpected because if we ask
to read N bytes, it expects to get N bytes read as long as those
N bytes don't cross the EOF (i_size).
MariaDB reports this as an error, as it's not expecting a short read,
since it knows it's asking for read operations fully within the i_size
boundary. This is typical in many applications, but it may also be
questionable if they should react to such short reads by issuing more
read calls to get the remaining data. Nevertheless, the short read
happened due to a change in btrfs regarding how it deals with page
faults while in the middle of a read operation, and there's no reason
why btrfs can't have the previous behaviour of returning the whole data
that was requested by the application.
The problem can also be triggered with the following simple program:
/* Get O_DIRECT */
#ifndef _GNU_SOURCE
#define _GNU_SOURCE
#endif
#include <stdio.h>
#include <stdlib.h>
#include <unistd.h>
#include <fcntl.h>
#include <errno.h>
#include <string.h>
#include <liburing.h>
int main(int argc, char *argv[])
{
char *foo_path;
struct io_uring ring;
struct io_uring_sqe *sqe;
struct io_uring_cqe *cqe;
struct iovec iovec;
int fd;
long pagesize;
void *write_buf;
void *read_buf;
ssize_t ret;
int i;
if (argc != 2) {
fprintf(stderr, "Use: %s <directory>\n", argv[0]);
return 1;
}
foo_path = malloc(strlen(argv[1]) + 5);
if (!foo_path) {
fprintf(stderr, "Failed to allocate memory for file path\n");
return 1;
}
strcpy(foo_path, argv[1]);
strcat(foo_path, "/foo");
/*
* Create file foo with 2 extents, each with a size matching
* the page size. Then allocate a buffer to read both extents
* with io_uring, using O_DIRECT and IOCB_NOWAIT. Before doing
* the read with io_uring, access the first page of the buffer
* to fault it in, so that during the read we only trigger a
* page fault when accessing the second page of the buffer.
*/
fd = open(foo_path, O_CREAT | O_TRUNC | O_WRONLY |
O_DIRECT, 0666);
if (fd == -1) {
fprintf(stderr,
"Failed to create file 'foo': %s (errno %d)",
strerror(errno), errno);
return 1;
}
pagesize = sysconf(_SC_PAGE_SIZE);
ret = posix_memalign(&write_buf, pagesize, 2 * pagesize);
if (ret) {
fprintf(stderr, "Failed to allocate write buffer\n");
return 1;
}
memset(write_buf, 0xab, pagesize);
memset(write_buf + pagesize, 0xcd, pagesize);
/* Create 2 extents, each with a size matching page size. */
for (i = 0; i < 2; i++) {
ret = pwrite(fd, write_buf + i * pagesize, pagesize,
i * pagesize);
if (ret != pagesize) {
fprintf(stderr,
"Failed to write to file, ret = %ld errno %d (%s)\n",
ret, errno, strerror(errno));
return 1;
}
ret = fsync(fd);
if (ret != 0) {
fprintf(stderr, "Failed to fsync file\n");
return 1;
}
}
close(fd);
fd = open(foo_path, O_RDONLY | O_DIRECT);
if (fd == -1) {
fprintf(stderr,
"Failed to open file 'foo': %s (errno %d)",
strerror(errno), errno);
return 1;
}
ret = posix_memalign(&read_buf, pagesize, 2 * pagesize);
if (ret) {
fprintf(stderr, "Failed to allocate read buffer\n");
return 1;
}
/*
* Fault in only the first page of the read buffer.
* We want to trigger a page fault for the 2nd page of the
* read buffer during the read operation with io_uring
* (O_DIRECT and IOCB_NOWAIT).
*/
memset(read_buf, 0, 1);
ret = io_uring_queue_init(1, &ring, 0);
if (ret != 0) {
fprintf(stderr, "Failed to create io_uring queue\n");
return 1;
}
sqe = io_uring_get_sqe(&ring);
if (!sqe) {
fprintf(stderr, "Failed to get io_uring sqe\n");
return 1;
}
iovec.iov_base = read_buf;
iovec.iov_len = 2 * pagesize;
io_uring_prep_readv(sqe, fd, &iovec, 1, 0);
ret = io_uring_submit_and_wait(&ring, 1);
if (ret != 1) {
fprintf(stderr,
"Failed at io_uring_submit_and_wait()\n");
return 1;
}
ret = io_uring_wait_cqe(&ring, &cqe);
if (ret < 0) {
fprintf(stderr, "Failed at io_uring_wait_cqe()\n");
return 1;
}
printf("io_uring read result for file foo:\n\n");
printf(" cqe->res == %d (expected %d)\n", cqe->res, 2 * pagesize);
printf(" memcmp(read_buf, write_buf) == %d (expected 0)\n",
memcmp(read_buf, write_buf, 2 * pagesize));
io_uring_cqe_seen(&ring, cqe);
io_uring_queue_exit(&ring);
return 0;
}
When running it on an unpatched kernel:
$ gcc io_uring_test.c -luring
$ mkfs.btrfs -f /dev/sda
$ mount /dev/sda /mnt/sda
$ ./a.out /mnt/sda
io_uring read result for file foo:
cqe->res == 4096 (expected 8192)
memcmp(read_buf, write_buf) == -205 (expected 0)
After this patch, the read always returns 8192 bytes, with the buffer
filled with the correct data. Although that reproducer always triggers
the bug in my test vms, it's possible that it will not be so reliable
on other environments, as that can happen if the bio for the first
extent completes and decrements the reference on the struct iomap_dio
object before we do the atomic_dec_and_test() on the reference at
__iomap_dio_rw().
Fix this in btrfs by having btrfs_dio_iomap_begin() return -EAGAIN
whenever we try to satisfy a non blocking IO request (IOMAP_NOWAIT flag
set) over a range that spans multiple extents (or a mix of extents and
holes). This avoids returning success to the caller when we only did
partial IO, which is not optimal for writes and for reads it's actually
incorrect, as the caller doesn't expect to get less bytes read than it has
requested (unless EOF is crossed), as previously mentioned. This is also
the type of behaviour that xfs follows (xfs_direct_write_iomap_begin()),
even though it doesn't use IOMAP_DIO_PARTIAL.
A test case for fstests will follow soon.
Link: https://lore.kernel.org/linux-btrfs/CABVffEM0eEWho+206m470rtM0d9J8ue85TtR-A_oVTuGLWFicA@mail.gmail.com/
Link: https://lore.kernel.org/linux-btrfs/CAHF2GV6U32gmqSjLe=XKgfcZAmLCiH26cJ2OnHGp5x=VAH4OHQ@mail.gmail.com/
CC: stable@vger.kernel.org # 5.16+
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
During log replay, whenever we need to check if a name (dentry) exists in
a directory we do searches on the subvolume tree for inode references or
or directory entries (BTRFS_DIR_INDEX_KEY keys, and BTRFS_DIR_ITEM_KEY
keys as well, before kernel 5.17). However when during log replay we
unlink a name, through btrfs_unlink_inode(), we may not delete inode
references and dir index keys from a subvolume tree and instead just add
the deletions to the delayed inode's delayed items, which will only be
run when we commit the transaction used for log replay. This means that
after an unlink operation during log replay, if we attempt to search for
the same name during log replay, we will not see that the name was already
deleted, since the deletion is recorded only on the delayed items.
We run delayed items after every unlink operation during log replay,
except at unlink_old_inode_refs() and at add_inode_ref(). This was due
to an overlook, as delayed items should be run after evert unlink, for
the reasons stated above.
So fix those two cases.
Fixes: 0d836392cadd5 ("Btrfs: fix mount failure after fsync due to hard link recreation")
Fixes: 1f250e929a9c9 ("Btrfs: fix log replay failure after unlink and link combination")
CC: stable@vger.kernel.org # 4.19+
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
The commit e804861bd4e6 ("btrfs: fix deadlock between quota disable and
qgroup rescan worker") by Kawasaki resolves deadlock between quota
disable and qgroup rescan worker. But also there is a deadlock case like
it. It's about enabling or disabling quota and creating or removing
qgroup. It can be reproduced in simple script below.
for i in {1..100}
do
btrfs quota enable /mnt &
btrfs qgroup create 1/0 /mnt &
btrfs qgroup destroy 1/0 /mnt &
btrfs quota disable /mnt &
done
Here's why the deadlock happens:
1) The quota rescan task is running.
2) Task A calls btrfs_quota_disable(), locks the qgroup_ioctl_lock
mutex, and then calls btrfs_qgroup_wait_for_completion(), to wait for
the quota rescan task to complete.
3) Task B calls btrfs_remove_qgroup() and it blocks when trying to lock
the qgroup_ioctl_lock mutex, because it's being held by task A. At that
point task B is holding a transaction handle for the current transaction.
4) The quota rescan task calls btrfs_commit_transaction(). This results
in it waiting for all other tasks to release their handles on the
transaction, but task B is blocked on the qgroup_ioctl_lock mutex
while holding a handle on the transaction, and that mutex is being held
by task A, which is waiting for the quota rescan task to complete,
resulting in a deadlock between these 3 tasks.
To resolve this issue, the thread disabling quota should unlock
qgroup_ioctl_lock before waiting rescan completion. Move
btrfs_qgroup_wait_for_completion() after unlock of qgroup_ioctl_lock.
Fixes: e804861bd4e6 ("btrfs: fix deadlock between quota disable and qgroup rescan worker")
CC: stable@vger.kernel.org # 5.4+
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: Shin'ichiro Kawasaki <shinichiro.kawasaki@wdc.com>
Signed-off-by: Sidong Yang <realwakka@gmail.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
btrfs_commit_transaction()
We are seeing crashes similar to the following trace:
[38.969182] WARNING: CPU: 20 PID: 2105 at fs/btrfs/relocation.c:4070 btrfs_relocate_block_group+0x2dc/0x340 [btrfs]
[38.973556] CPU: 20 PID: 2105 Comm: btrfs Not tainted 5.17.0-rc4 #54
[38.974580] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.12.0-59-gc9ba5276e321-prebuilt.qemu.org 04/01/2014
[38.976539] RIP: 0010:btrfs_relocate_block_group+0x2dc/0x340 [btrfs]
[38.980336] RSP: 0000:ffffb0dd42e03c20 EFLAGS: 00010206
[38.981218] RAX: ffff96cfc4ede800 RBX: ffff96cfc3ce0000 RCX: 000000000002ca14
[38.982560] RDX: 0000000000000000 RSI: 4cfd109a0bcb5d7f RDI: ffff96cfc3ce0360
[38.983619] RBP: ffff96cfc309c000 R08: 0000000000000000 R09: 0000000000000000
[38.984678] R10: ffff96cec0000001 R11: ffffe84c80000000 R12: ffff96cfc4ede800
[38.985735] R13: 0000000000000000 R14: 0000000000000000 R15: ffff96cfc3ce0360
[38.987146] FS: 00007f11c15218c0(0000) GS:ffff96d6dfb00000(0000) knlGS:0000000000000000
[38.988662] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[38.989398] CR2: 00007ffc922c8e60 CR3: 00000001147a6001 CR4: 0000000000370ee0
[38.990279] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[38.991219] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[38.992528] Call Trace:
[38.992854] <TASK>
[38.993148] btrfs_relocate_chunk+0x27/0xe0 [btrfs]
[38.993941] btrfs_balance+0x78e/0xea0 [btrfs]
[38.994801] ? vsnprintf+0x33c/0x520
[38.995368] ? __kmalloc_track_caller+0x351/0x440
[38.996198] btrfs_ioctl_balance+0x2b9/0x3a0 [btrfs]
[38.997084] btrfs_ioctl+0x11b0/0x2da0 [btrfs]
[38.997867] ? mod_objcg_state+0xee/0x340
[38.998552] ? seq_release+0x24/0x30
[38.999184] ? proc_nr_files+0x30/0x30
[38.999654] ? call_rcu+0xc8/0x2f0
[39.000228] ? __x64_sys_ioctl+0x84/0xc0
[39.000872] ? btrfs_ioctl_get_supported_features+0x30/0x30 [btrfs]
[39.001973] __x64_sys_ioctl+0x84/0xc0
[39.002566] do_syscall_64+0x3a/0x80
[39.003011] entry_SYSCALL_64_after_hwframe+0x44/0xae
[39.003735] RIP: 0033:0x7f11c166959b
[39.007324] RSP: 002b:00007fff2543e998 EFLAGS: 00000246 ORIG_RAX: 0000000000000010
[39.008521] RAX: ffffffffffffffda RBX: 00007f11c1521698 RCX: 00007f11c166959b
[39.009833] RDX: 00007fff2543ea40 RSI: 00000000c4009420 RDI: 0000000000000003
[39.011270] RBP: 0000000000000003 R08: 0000000000000013 R09: 00007f11c16f94e0
[39.012581] R10: 0000000000000000 R11: 0000000000000246 R12: 00007fff25440df3
[39.014046] R13: 0000000000000000 R14: 00007fff2543ea40 R15: 0000000000000001
[39.015040] </TASK>
[39.015418] ---[ end trace 0000000000000000 ]---
[43.131559] ------------[ cut here ]------------
[43.132234] kernel BUG at fs/btrfs/extent-tree.c:2717!
[43.133031] invalid opcode: 0000 [#1] PREEMPT SMP PTI
[43.133702] CPU: 1 PID: 1839 Comm: btrfs Tainted: G W 5.17.0-rc4 #54
[43.134863] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.12.0-59-gc9ba5276e321-prebuilt.qemu.org 04/01/2014
[43.136426] RIP: 0010:unpin_extent_range+0x37a/0x4f0 [btrfs]
[43.139913] RSP: 0000:ffffb0dd4216bc70 EFLAGS: 00010246
[43.140629] RAX: 0000000000000000 RBX: ffff96cfc34490f8 RCX: 0000000000000001
[43.141604] RDX: 0000000080000001 RSI: 0000000051d00000 RDI: 00000000ffffffff
[43.142645] RBP: 0000000000000000 R08: 0000000000000000 R09: ffff96cfd07dca50
[43.143669] R10: ffff96cfc46e8a00 R11: fffffffffffec000 R12: 0000000041d00000
[43.144657] R13: ffff96cfc3ce0000 R14: ffffb0dd4216bd08 R15: 0000000000000000
[43.145686] FS: 00007f7657dd68c0(0000) GS:ffff96d6df640000(0000) knlGS:0000000000000000
[43.146808] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[43.147584] CR2: 00007f7fe81bf5b0 CR3: 00000001093ee004 CR4: 0000000000370ee0
[43.148589] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[43.149581] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[43.150559] Call Trace:
[43.150904] <TASK>
[43.151253] btrfs_finish_extent_commit+0x88/0x290 [btrfs]
[43.152127] btrfs_commit_transaction+0x74f/0xaa0 [btrfs]
[43.152932] ? btrfs_attach_transaction_barrier+0x1e/0x50 [btrfs]
[43.153786] btrfs_ioctl+0x1edc/0x2da0 [btrfs]
[43.154475] ? __check_object_size+0x150/0x170
[43.155170] ? preempt_count_add+0x49/0xa0
[43.155753] ? __x64_sys_ioctl+0x84/0xc0
[43.156437] ? btrfs_ioctl_get_supported_features+0x30/0x30 [btrfs]
[43.157456] __x64_sys_ioctl+0x84/0xc0
[43.157980] do_syscall_64+0x3a/0x80
[43.158543] entry_SYSCALL_64_after_hwframe+0x44/0xae
[43.159231] RIP: 0033:0x7f7657f1e59b
[43.161819] RSP: 002b:00007ffda5cd1658 EFLAGS: 00000246 ORIG_RAX: 0000000000000010
[43.162702] RAX: ffffffffffffffda RBX: 0000000000000001 RCX: 00007f7657f1e59b
[43.163526] RDX: 0000000000000000 RSI: 0000000000009408 RDI: 0000000000000003
[43.164358] RBP: 0000000000000003 R08: 0000000000000000 R09: 0000000000000000
[43.165208] R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000000000
[43.166029] R13: 00005621b91c3232 R14: 00005621b91ba580 R15: 00007ffda5cd1800
[43.166863] </TASK>
[43.167125] Modules linked in: btrfs blake2b_generic xor pata_acpi ata_piix libata raid6_pq scsi_mod libcrc32c virtio_net virtio_rng net_failover rng_core failover scsi_common
[43.169552] ---[ end trace 0000000000000000 ]---
[43.171226] RIP: 0010:unpin_extent_range+0x37a/0x4f0 [btrfs]
[43.174767] RSP: 0000:ffffb0dd4216bc70 EFLAGS: 00010246
[43.175600] RAX: 0000000000000000 RBX: ffff96cfc34490f8 RCX: 0000000000000001
[43.176468] RDX: 0000000080000001 RSI: 0000000051d00000 RDI: 00000000ffffffff
[43.177357] RBP: 0000000000000000 R08: 0000000000000000 R09: ffff96cfd07dca50
[43.178271] R10: ffff96cfc46e8a00 R11: fffffffffffec000 R12: 0000000041d00000
[43.179178] R13: ffff96cfc3ce0000 R14: ffffb0dd4216bd08 R15: 0000000000000000
[43.180071] FS: 00007f7657dd68c0(0000) GS:ffff96d6df800000(0000) knlGS:0000000000000000
[43.181073] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[43.181808] CR2: 00007fe09905f010 CR3: 00000001093ee004 CR4: 0000000000370ee0
[43.182706] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[43.183591] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
We first hit the WARN_ON(rc->block_group->pinned > 0) in
btrfs_relocate_block_group() and then the BUG_ON(!cache) in
unpin_extent_range(). This tells us that we are exiting relocation and
removing the block group with bytes still pinned for that block group.
This is supposed to be impossible: the last thing relocate_block_group()
does is commit the transaction to get rid of pinned extents.
Commit d0c2f4fa555e ("btrfs: make concurrent fsyncs wait less when
waiting for a transaction commit") introduced an optimization so that
commits from fsync don't have to wait for the previous commit to unpin
extents. This was only intended to affect fsync, but it inadvertently
made it possible for any commit to skip waiting for the previous commit
to unpin. This is because if a call to btrfs_commit_transaction() finds
that another thread is already committing the transaction, it waits for
the other thread to complete the commit and then returns. If that other
thread was in fsync, then it completes the commit without completing the
previous commit. This makes the following sequence of events possible:
Thread 1____________________|Thread 2 (fsync)_____________________|Thread 3 (balance)___________________
btrfs_commit_transaction(N) | |
btrfs_run_delayed_refs | |
pin extents | |
... | |
state = UNBLOCKED |btrfs_sync_file |
| btrfs_start_transaction(N + 1) |relocate_block_group
| | btrfs_join_transaction(N + 1)
| btrfs_commit_transaction(N + 1) |
... | trans->state = COMMIT_START |
| | btrfs_commit_transaction(N + 1)
| | wait_for_commit(N + 1, COMPLETED)
| wait_for_commit(N, SUPER_COMMITTED)|
state = SUPER_COMMITTED | ... |
btrfs_finish_extent_commit| |
unpin_extent_range() | trans->state = COMPLETED |
| | return
| |
... | |Thread 1 isn't done, so pinned > 0
| |and we WARN
| |
| |btrfs_remove_block_group
unpin_extent_range() | |
Thread 3 removed the | |
block group, so we BUG| |
There are other sequences involving SUPER_COMMITTED transactions that
can cause a similar outcome.
We could fix this by making relocation explicitly wait for unpinning,
but there may be other cases that need it. Josef mentioned ENOSPC
flushing and the free space cache inode as other potential victims.
Rather than playing whack-a-mole, this fix is conservative and makes all
commits not in fsync wait for all previous transactions, which is what
the optimization intended.
Fixes: d0c2f4fa555e ("btrfs: make concurrent fsyncs wait less when waiting for a transaction commit")
CC: stable@vger.kernel.org # 5.15+
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Omar Sandoval <osandov@fb.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
We hit a bug with a recovering relocation on mount for one of our file
systems in production. I reproduced this locally by injecting errors
into snapshot delete with balance running at the same time. This
presented as an error while looking up an extent item
WARNING: CPU: 5 PID: 1501 at fs/btrfs/extent-tree.c:866 lookup_inline_extent_backref+0x647/0x680
CPU: 5 PID: 1501 Comm: btrfs-balance Not tainted 5.16.0-rc8+ #8
RIP: 0010:lookup_inline_extent_backref+0x647/0x680
RSP: 0018:ffffae0a023ab960 EFLAGS: 00010202
RAX: 0000000000000001 RBX: 0000000000000000 RCX: 0000000000000000
RDX: 0000000000000000 RSI: 000000000000000c RDI: 0000000000000000
RBP: ffff943fd2a39b60 R08: 0000000000000000 R09: 0000000000000001
R10: 0001434088152de0 R11: 0000000000000000 R12: 0000000001d05000
R13: ffff943fd2a39b60 R14: ffff943fdb96f2a0 R15: ffff9442fc923000
FS: 0000000000000000(0000) GS:ffff944e9eb40000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 00007f1157b1fca8 CR3: 000000010f092000 CR4: 0000000000350ee0
Call Trace:
<TASK>
insert_inline_extent_backref+0x46/0xd0
__btrfs_inc_extent_ref.isra.0+0x5f/0x200
? btrfs_merge_delayed_refs+0x164/0x190
__btrfs_run_delayed_refs+0x561/0xfa0
? btrfs_search_slot+0x7b4/0xb30
? btrfs_update_root+0x1a9/0x2c0
btrfs_run_delayed_refs+0x73/0x1f0
? btrfs_update_root+0x1a9/0x2c0
btrfs_commit_transaction+0x50/0xa50
? btrfs_update_reloc_root+0x122/0x220
prepare_to_merge+0x29f/0x320
relocate_block_group+0x2b8/0x550
btrfs_relocate_block_group+0x1a6/0x350
btrfs_relocate_chunk+0x27/0xe0
btrfs_balance+0x777/0xe60
balance_kthread+0x35/0x50
? btrfs_balance+0xe60/0xe60
kthread+0x16b/0x190
? set_kthread_struct+0x40/0x40
ret_from_fork+0x22/0x30
</TASK>
Normally snapshot deletion and relocation are excluded from running at
the same time by the fs_info->cleaner_mutex. However if we had a
pending balance waiting to get the ->cleaner_mutex, and a snapshot
deletion was running, and then the box crashed, we would come up in a
state where we have a half deleted snapshot.
Again, in the normal case the snapshot deletion needs to complete before
relocation can start, but in this case relocation could very well start
before the snapshot deletion completes, as we simply add the root to the
dead roots list and wait for the next time the cleaner runs to clean up
the snapshot.
Fix this by setting a bit on the fs_info if we have any DEAD_ROOT's that
had a pending drop_progress key. If they do then we know we were in the
middle of the drop operation and set a flag on the fs_info. Then
balance can wait until this flag is cleared to start up again.
If there are DEAD_ROOT's that don't have a drop_progress set then we're
safe to start balance right away as we'll be properly protected by the
cleaner_mutex.
CC: stable@vger.kernel.org # 5.10+
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
User reported there is an array-index-out-of-bounds access while
mounting the crafted image:
[350.411942 ] loop0: detected capacity change from 0 to 262144
[350.427058 ] BTRFS: device fsid a62e00e8-e94e-4200-8217-12444de93c2e devid 1 transid 8 /dev/loop0 scanned by systemd-udevd (1044)
[350.428564 ] BTRFS info (device loop0): disk space caching is enabled
[350.428568 ] BTRFS info (device loop0): has skinny extents
[350.429589 ]
[350.429619 ] UBSAN: array-index-out-of-bounds in fs/btrfs/struct-funcs.c:161:1
[350.429636 ] index 1048096 is out of range for type 'page *[16]'
[350.429650 ] CPU: 0 PID: 9 Comm: kworker/u8:1 Not tainted 5.16.0-rc4
[350.429652 ] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.13.0-1ubuntu1.1 04/01/2014
[350.429653 ] Workqueue: btrfs-endio-meta btrfs_work_helper [btrfs]
[350.429772 ] Call Trace:
[350.429774 ] <TASK>
[350.429776 ] dump_stack_lvl+0x47/0x5c
[350.429780 ] ubsan_epilogue+0x5/0x50
[350.429786 ] __ubsan_handle_out_of_bounds+0x66/0x70
[350.429791 ] btrfs_get_16+0xfd/0x120 [btrfs]
[350.429832 ] check_leaf+0x754/0x1a40 [btrfs]
[350.429874 ] ? filemap_read+0x34a/0x390
[350.429878 ] ? load_balance+0x175/0xfc0
[350.429881 ] validate_extent_buffer+0x244/0x310 [btrfs]
[350.429911 ] btrfs_validate_metadata_buffer+0xf8/0x100 [btrfs]
[350.429935 ] end_bio_extent_readpage+0x3af/0x850 [btrfs]
[350.429969 ] ? newidle_balance+0x259/0x480
[350.429972 ] end_workqueue_fn+0x29/0x40 [btrfs]
[350.429995 ] btrfs_work_helper+0x71/0x330 [btrfs]
[350.430030 ] ? __schedule+0x2fb/0xa40
[350.430033 ] process_one_work+0x1f6/0x400
[350.430035 ] ? process_one_work+0x400/0x400
[350.430036 ] worker_thread+0x2d/0x3d0
[350.430037 ] ? process_one_work+0x400/0x400
[350.430038 ] kthread+0x165/0x190
[350.430041 ] ? set_kthread_struct+0x40/0x40
[350.430043 ] ret_from_fork+0x1f/0x30
[350.430047 ] </TASK>
[350.430047 ]
[350.430077 ] BTRFS warning (device loop0): bad eb member start: ptr 0xffe20f4e start 20975616 member offset 4293005178 size 2
btrfs check reports:
corrupt leaf: root=3 block=20975616 physical=20975616 slot=1, unexpected
item end, have 4294971193 expect 3897
The first slot item offset is 4293005033 and the size is 1966160.
In check_leaf, we use btrfs_item_end() to check item boundary versus
extent_buffer data size. However, return type of btrfs_item_end() is u32.
(u32)(4293005033 + 1966160) == 3897, overflow happens and the result 3897
equals to leaf data size reasonably.
Fix it by use u64 variable to store item data end in check_leaf() to
avoid u32 overflow.
This commit does solve the invalid memory access showed by the stack
trace. However, its metadata profile is DUP and another copy of the
leaf is fine. So the image can be mounted successfully. But when umount
is called, the ASSERT btrfs_mark_buffer_dirty() will be triggered
because the only node in extent tree has 0 item and invalid owner. It's
solved by another commit
"btrfs: check extent buffer owner against the owner rootid".
Bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=215299
Reported-by: Wenqing Liu <wenqingliu0120@gmail.com>
CC: stable@vger.kernel.org # 4.19+
Signed-off-by: Su Yue <l@damenly.su>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
Whenever we do any extent buffer operations we call
assert_eb_page_uptodate() to complain loudly if we're operating on an
non-uptodate page. Our overnight tests caught this warning earlier this
week
WARNING: CPU: 1 PID: 553508 at fs/btrfs/extent_io.c:6849 assert_eb_page_uptodate+0x3f/0x50
CPU: 1 PID: 553508 Comm: kworker/u4:13 Tainted: G W 5.17.0-rc3+ #564
Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.13.0-2.fc32 04/01/2014
Workqueue: btrfs-cache btrfs_work_helper
RIP: 0010:assert_eb_page_uptodate+0x3f/0x50
RSP: 0018:ffffa961440a7c68 EFLAGS: 00010246
RAX: 0017ffffc0002112 RBX: ffffe6e74453f9c0 RCX: 0000000000001000
RDX: ffffe6e74467c887 RSI: ffffe6e74453f9c0 RDI: ffff8d4c5efc2fc0
RBP: 0000000000000d56 R08: ffff8d4d4a224000 R09: 0000000000000000
R10: 00015817fa9d1ef0 R11: 000000000000000c R12: 00000000000007b1
R13: ffff8d4c5efc2fc0 R14: 0000000001500000 R15: 0000000001cb1000
FS: 0000000000000000(0000) GS:ffff8d4dbbd00000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 00007ff31d3448d8 CR3: 0000000118be8004 CR4: 0000000000370ee0
Call Trace:
extent_buffer_test_bit+0x3f/0x70
free_space_test_bit+0xa6/0xc0
load_free_space_tree+0x1f6/0x470
caching_thread+0x454/0x630
? rcu_read_lock_sched_held+0x12/0x60
? rcu_read_lock_sched_held+0x12/0x60
? rcu_read_lock_sched_held+0x12/0x60
? lock_release+0x1f0/0x2d0
btrfs_work_helper+0xf2/0x3e0
? lock_release+0x1f0/0x2d0
? finish_task_switch.isra.0+0xf9/0x3a0
process_one_work+0x26d/0x580
? process_one_work+0x580/0x580
worker_thread+0x55/0x3b0
? process_one_work+0x580/0x580
kthread+0xf0/0x120
? kthread_complete_and_exit+0x20/0x20
ret_from_fork+0x1f/0x30
This was partially fixed by c2e39305299f01 ("btrfs: clear extent buffer
uptodate when we fail to write it"), however all that fix did was keep
us from finding extent buffers after a failed writeout. It didn't keep
us from continuing to use a buffer that we already had found.
In this case we're searching the commit root to cache the block group,
so we can start committing the transaction and switch the commit root
and then start writing. After the switch we can look up an extent
buffer that hasn't been written yet and start processing that block
group. Then we fail to write that block out and clear Uptodate on the
page, and then we start spewing these errors.
Normally we're protected by the tree lock to a certain degree here. If
we read a block we have that block read locked, and we block the writer
from locking the block before we submit it for the write. However this
isn't necessarily fool proof because the read could happen before we do
the submit_bio and after we locked and unlocked the extent buffer.
Also in this particular case we have path->skip_locking set, so that
won't save us here. We'll simply get a block that was valid when we
read it, but became invalid while we were using it.
What we really want is to catch the case where we've "read" a block but
it's not marked Uptodate. On read we ClearPageError(), so if we're
!Uptodate and !Error we know we didn't do the right thing for reading
the page.
Fix this by checking !Uptodate && !Error, this way we will not complain
if our buffer gets invalidated while we're using it, and we'll maintain
the spirit of the check which is to make sure we have a fully in-cache
block while we're messing with it.
CC: stable@vger.kernel.org # 5.4+
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
When doing a full fsync, if we have prealloc extents beyond (or at) eof,
and the leaves that contain them were not modified in the current
transaction, we end up not logging them. This results in losing those
extents when we replay the log after a power failure, since the inode is
truncated to the current value of the logged i_size.
Just like for the fast fsync path, we need to always log all prealloc
extents starting at or beyond i_size. The fast fsync case was fixed in
commit 471d557afed155 ("Btrfs: fix loss of prealloc extents past i_size
after fsync log replay") but it missed the full fsync path. The problem
exists since the very early days, when the log tree was added by
commit e02119d5a7b439 ("Btrfs: Add a write ahead tree log to optimize
synchronous operations").
Example reproducer:
$ mkfs.btrfs -f /dev/sdc
$ mount /dev/sdc /mnt
# Create our test file with many file extent items, so that they span
# several leaves of metadata, even if the node/page size is 64K. Use
# direct IO and not fsync/O_SYNC because it's both faster and it avoids
# clearing the full sync flag from the inode - we want the fsync below
# to trigger the slow full sync code path.
$ xfs_io -f -d -c "pwrite -b 4K 0 16M" /mnt/foo
# Now add two preallocated extents to our file without extending the
# file's size. One right at i_size, and another further beyond, leaving
# a gap between the two prealloc extents.
$ xfs_io -c "falloc -k 16M 1M" /mnt/foo
$ xfs_io -c "falloc -k 20M 1M" /mnt/foo
# Make sure everything is durably persisted and the transaction is
# committed. This makes all created extents to have a generation lower
# than the generation of the transaction used by the next write and
# fsync.
sync
# Now overwrite only the first extent, which will result in modifying
# only the first leaf of metadata for our inode. Then fsync it. This
# fsync will use the slow code path (inode full sync bit is set) because
# it's the first fsync since the inode was created/loaded.
$ xfs_io -c "pwrite 0 4K" -c "fsync" /mnt/foo
# Extent list before power failure.
$ xfs_io -c "fiemap -v" /mnt/foo
/mnt/foo:
EXT: FILE-OFFSET BLOCK-RANGE TOTAL FLAGS
0: [0..7]: 2178048..2178055 8 0x0
1: [8..16383]: 26632..43007 16376 0x0
2: [16384..32767]: 2156544..2172927 16384 0x0
3: [32768..34815]: 2172928..2174975 2048 0x800
4: [34816..40959]: hole 6144
5: [40960..43007]: 2174976..2177023 2048 0x801
<power fail>
# Mount fs again, trigger log replay.
$ mount /dev/sdc /mnt
# Extent list after power failure and log replay.
$ xfs_io -c "fiemap -v" /mnt/foo
/mnt/foo:
EXT: FILE-OFFSET BLOCK-RANGE TOTAL FLAGS
0: [0..7]: 2178048..2178055 8 0x0
1: [8..16383]: 26632..43007 16376 0x0
2: [16384..32767]: 2156544..2172927 16384 0x1
# The prealloc extents at file offsets 16M and 20M are missing.
So fix this by calling btrfs_log_prealloc_extents() when we are doing a
full fsync, so that we always log all prealloc extents beyond eof.
A test case for fstests will follow soon.
CC: stable@vger.kernel.org # 4.19+
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
[BUG]
When looping btrfs/074 with 64K page size and 4K sectorsize, there is a
low chance (1/50~1/100) to crash with the following ASSERT() triggered
in btrfs_subpage_start_writer():
ret = atomic_add_return(nbits, &subpage->writers);
ASSERT(ret == nbits); <<< This one <<<
[CAUSE]
With more debugging output on the parameters of
btrfs_subpage_start_writer(), it shows a very concerning error:
ret=29 nbits=13 start=393216 len=53248
For @nbits it's correct, but @ret which is the returned value from
atomic_add_return(), it's not only larger than nbits, but also larger
than max sectors per page value (for 64K page size and 4K sector size,
it's 16).
This indicates that some call sites are not properly decreasing the value.
And that's exactly the case, in btrfs_page_unlock_writer(), due to the
fact that we can have page locked either by lock_page() or
process_one_page(), we have to check if the subpage has any writer.
If no writers, it's locked by lock_page() and we only need to unlock it.
But unfortunately the check for the writers are completely opposite:
if (atomic_read(&subpage->writers))
/* No writers, locked by plain lock_page() */
return unlock_page(page);
We directly unlock the page if it has writers, which is the completely
opposite what we want.
Thankfully the affected call site is only limited to
extent_write_locked_range(), so it's mostly affecting compressed write.
[FIX]
Just fix the wrong check condition to fix the bug.
Fixes: e55a0de18572 ("btrfs: rework page locking in __extent_writepage()")
CC: stable@vger.kernel.org # 5.16
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux
Pull btrfs fixes from David Sterba:
"This is a hopefully last batch of fixes for defrag that got broken in
5.16, all stable material.
The remaining reported problem is excessive IO with autodefrag due to
various conditions in the defrag code not met or missing"
* tag 'for-5.17-rc5-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux:
btrfs: reduce extent threshold for autodefrag
btrfs: autodefrag: only scan one inode once
btrfs: defrag: don't use merged extent map for their generation check
btrfs: defrag: bring back the old file extent search behavior
btrfs: defrag: remove an ambiguous condition for rejection
btrfs: defrag: don't defrag extents which are already at max capacity
btrfs: defrag: don't try to merge regular extents with preallocated extents
btrfs: defrag: allow defrag_one_cluster() to skip large extent which is not a target
btrfs: prevent copying too big compressed lzo segment
|
|
There is a big gap between inode_should_defrag() and autodefrag extent
size threshold. For inode_should_defrag() it has a flexible
@small_write value. For compressed extent is 16K, and for non-compressed
extent it's 64K.
However for autodefrag extent size threshold, it's always fixed to the
default value (256K).
This means, the following write sequence will trigger autodefrag to
defrag ranges which didn't trigger autodefrag:
pwrite 0 8k
sync
pwrite 8k 128K
sync
The latter 128K write will also be considered as a defrag target (if
other conditions are met). While only that 8K write is really
triggering autodefrag.
Such behavior can cause extra IO for autodefrag.
Close the gap, by copying the @small_write value into inode_defrag, so
that later autodefrag can use the same @small_write value which
triggered autodefrag.
With the existing transid value, this allows autodefrag really to scan
the ranges which triggered autodefrag.
Although this behavior change is mostly reducing the extent_thresh value
for autodefrag, I believe in the future we should allow users to specify
the autodefrag extent threshold through mount options, but that's an
other problem to consider in the future.
CC: stable@vger.kernel.org # 5.16+
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
Although we have btrfs_requeue_inode_defrag(), for autodefrag we are
still just exhausting all inode_defrag items in the tree.
This means, it doesn't make much difference to requeue an inode_defrag,
other than scan the inode from the beginning till its end.
Change the behaviour to always scan from offset 0 of an inode, and till
the end.
By this we get the following benefit:
- Straight-forward code
- No more re-queue related check
- Fewer members in inode_defrag
We still keep the same btrfs_get_fs_root() and btrfs_iget() check for
each loop, and added extra should_auto_defrag() check per-loop.
Note: the patch needs to be backported and is intentionally written
to minimize the diff size, code will be cleaned up later.
CC: stable@vger.kernel.org # 5.16
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
For extent maps, if they are not compressed extents and are adjacent by
logical addresses and file offsets, they can be merged into one larger
extent map.
Such merged extent map will have the higher generation of all the
original ones.
But this brings a problem for autodefrag, as it relies on accurate
extent_map::generation to determine if one extent should be defragged.
For merged extent maps, their higher generation can mark some older
extents to be defragged while the original extent map doesn't meet the
minimal generation threshold.
Thus this will cause extra IO.
So solve the problem, here we introduce a new flag, EXTENT_FLAG_MERGED,
to indicate if the extent map is merged from one or more ems.
And for autodefrag, if we find a merged extent map, and its generation
meets the generation requirement, we just don't use this one, and go
back to defrag_get_extent() to read extent maps from subvolume trees.
This could cause more read IO, but should result less defrag data write,
so in the long run it should be a win for autodefrag.
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
For defrag, we don't really want to use btrfs_get_extent() to iterate
all extent maps of an inode.
The reasons are:
- btrfs_get_extent() can merge extent maps
And the result em has the higher generation of the two, causing defrag
to mark unnecessary part of such merged large extent map.
This in fact can result extra IO for autodefrag in v5.16+ kernels.
However this patch is not going to completely solve the problem, as
one can still using read() to trigger extent map reading, and got
them merged.
The completely solution for the extent map merging generation problem
will come as an standalone fix.
- btrfs_get_extent() caches the extent map result
Normally it's fine, but for defrag the target range may not get
another read/write for a long long time.
Such cache would only increase the memory usage.
- btrfs_get_extent() doesn't skip older extent map
Unlike the old find_new_extent() which uses btrfs_search_forward() to
skip the older subtree, thus it will pick up unnecessary extent maps.
This patch will fix the regression by introducing defrag_get_extent() to
replace the btrfs_get_extent() call.
This helper will:
- Not cache the file extent we found
It will search the file extent and manually convert it to em.
- Use btrfs_search_forward() to skip entire ranges which is modified in
the past
This should reduce the IO for autodefrag.
Reported-by: Filipe Manana <fdmanana@suse.com>
Fixes: 7b508037d4ca ("btrfs: defrag: use defrag_one_cluster() to implement btrfs_defrag_file()")
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
From the very beginning of btrfs defrag, there is a check to reject
extents which meet both conditions:
- Physically adjacent
We may want to defrag physically adjacent extents to reduce the number
of extents or the size of subvolume tree.
- Larger than 128K
This may be there for compressed extents, but unfortunately 128K is
exactly the max capacity for compressed extents.
And the check is > 128K, thus it never rejects compressed extents.
Furthermore, the compressed extent capacity bug is fixed by previous
patch, there is no reason for that check anymore.
The original check has a very small ranges to reject (the target extent
size is > 128K, and default extent threshold is 256K), and for
compressed extent it doesn't work at all.
So it's better just to remove the rejection, and allow us to defrag
physically adjacent extents.
CC: stable@vger.kernel.org # 5.16
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
[BUG]
For compressed extents, defrag ioctl will always try to defrag any
compressed extents, wasting not only IO but also CPU time to
compress/decompress:
mkfs.btrfs -f $DEV
mount -o compress $DEV $MNT
xfs_io -f -c "pwrite -S 0xab 0 128K" $MNT/foobar
sync
xfs_io -f -c "pwrite -S 0xcd 128K 128K" $MNT/foobar
sync
echo "=== before ==="
xfs_io -c "fiemap -v" $MNT/foobar
btrfs filesystem defrag $MNT/foobar
sync
echo "=== after ==="
xfs_io -c "fiemap -v" $MNT/foobar
Then it shows the 2 128K extents just get COW for no extra benefit, with
extra IO/CPU spent:
=== before ===
/mnt/btrfs/file1:
EXT: FILE-OFFSET BLOCK-RANGE TOTAL FLAGS
0: [0..255]: 26624..26879 256 0x8
1: [256..511]: 26632..26887 256 0x9
=== after ===
/mnt/btrfs/file1:
EXT: FILE-OFFSET BLOCK-RANGE TOTAL FLAGS
0: [0..255]: 26640..26895 256 0x8
1: [256..511]: 26648..26903 256 0x9
This affects not only v5.16 (after the defrag rework), but also v5.15
(before the defrag rework).
[CAUSE]
From the very beginning, btrfs defrag never checks if one extent is
already at its max capacity (128K for compressed extents, 128M
otherwise).
And the default extent size threshold is 256K, which is already beyond
the compressed extent max size.
This means, by default btrfs defrag ioctl will mark all compressed
extent which is not adjacent to a hole/preallocated range for defrag.
[FIX]
Introduce a helper to grab the maximum extent size, and then in
defrag_collect_targets() and defrag_check_next_extent(), reject extents
which are already at their max capacity.
Reported-by: Filipe Manana <fdmanana@suse.com>
CC: stable@vger.kernel.org # 5.16
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
[BUG]
With older kernels (before v5.16), btrfs will defrag preallocated extents.
While with newer kernels (v5.16 and newer) btrfs will not defrag
preallocated extents, but it will defrag the extent just before the
preallocated extent, even it's just a single sector.
This can be exposed by the following small script:
mkfs.btrfs -f $dev > /dev/null
mount $dev $mnt
xfs_io -f -c "pwrite 0 4k" -c sync -c "falloc 4k 16K" $mnt/file
xfs_io -c "fiemap -v" $mnt/file
btrfs fi defrag $mnt/file
sync
xfs_io -c "fiemap -v" $mnt/file
The output looks like this on older kernels:
/mnt/btrfs/file:
EXT: FILE-OFFSET BLOCK-RANGE TOTAL FLAGS
0: [0..7]: 26624..26631 8 0x0
1: [8..39]: 26632..26663 32 0x801
/mnt/btrfs/file:
EXT: FILE-OFFSET BLOCK-RANGE TOTAL FLAGS
0: [0..39]: 26664..26703 40 0x1
Which defrags the single sector along with the preallocated extent, and
replace them with an regular extent into a new location (caused by data
COW).
This wastes most of the data IO just for the preallocated range.
On the other hand, v5.16 is slightly better:
/mnt/btrfs/file:
EXT: FILE-OFFSET BLOCK-RANGE TOTAL FLAGS
0: [0..7]: 26624..26631 8 0x0
1: [8..39]: 26632..26663 32 0x801
/mnt/btrfs/file:
EXT: FILE-OFFSET BLOCK-RANGE TOTAL FLAGS
0: [0..7]: 26664..26671 8 0x0
1: [8..39]: 26632..26663 32 0x801
The preallocated range is not defragged, but the sector before it still
gets defragged, which has no need for it.
[CAUSE]
One of the function reused by the old and new behavior is
defrag_check_next_extent(), it will determine if we should defrag
current extent by checking the next one.
It only checks if the next extent is a hole or inlined, but it doesn't
check if it's preallocated.
On the other hand, out of the function, both old and new kernel will
reject preallocated extents.
Such inconsistent behavior causes above behavior.
[FIX]
- Also check if next extent is preallocated
If so, don't defrag current extent.
- Add comments for each branch why we reject the extent
This will reduce the IO caused by defrag ioctl and autodefrag.
CC: stable@vger.kernel.org # 5.16
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
a target
In the rework of btrfs_defrag_file(), we always call
defrag_one_cluster() and increase the offset by cluster size, which is
only 256K.
But there are cases where we have a large extent (e.g. 128M) which
doesn't need to be defragged at all.
Before the refactor, we can directly skip the range, but now we have to
scan that extent map again and again until the cluster moves after the
non-target extent.
Fix the problem by allow defrag_one_cluster() to increase
btrfs_defrag_ctrl::last_scanned to the end of an extent, if and only if
the last extent of the cluster is not a target.
The test script looks like this:
mkfs.btrfs -f $dev > /dev/null
mount $dev $mnt
# As btrfs ioctl uses 32M as extent_threshold
xfs_io -f -c "pwrite 0 64M" $mnt/file1
sync
# Some fragemented range to defrag
xfs_io -s -c "pwrite 65548k 4k" \
-c "pwrite 65544k 4k" \
-c "pwrite 65540k 4k" \
-c "pwrite 65536k 4k" \
$mnt/file1
sync
echo "=== before ==="
xfs_io -c "fiemap -v" $mnt/file1
echo "=== after ==="
btrfs fi defrag $mnt/file1
sync
xfs_io -c "fiemap -v" $mnt/file1
umount $mnt
With extra ftrace put into defrag_one_cluster(), before the patch it
would result tons of loops:
(As defrag_one_cluster() is inlined, the function name is its caller)
btrfs-126062 [005] ..... 4682.816026: btrfs_defrag_file: r/i=5/257 start=0 len=262144
btrfs-126062 [005] ..... 4682.816027: btrfs_defrag_file: r/i=5/257 start=262144 len=262144
btrfs-126062 [005] ..... 4682.816028: btrfs_defrag_file: r/i=5/257 start=524288 len=262144
btrfs-126062 [005] ..... 4682.816028: btrfs_defrag_file: r/i=5/257 start=786432 len=262144
btrfs-126062 [005] ..... 4682.816028: btrfs_defrag_file: r/i=5/257 start=1048576 len=262144
...
btrfs-126062 [005] ..... 4682.816043: btrfs_defrag_file: r/i=5/257 start=67108864 len=262144
But with this patch there will be just one loop, then directly to the
end of the extent:
btrfs-130471 [014] ..... 5434.029558: defrag_one_cluster: r/i=5/257 start=0 len=262144
btrfs-130471 [014] ..... 5434.029559: defrag_one_cluster: r/i=5/257 start=67108864 len=16384
CC: stable@vger.kernel.org # 5.16
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
Compressed length can be corrupted to be a lot larger than memory
we have allocated for buffer.
This will cause memcpy in copy_compressed_segment to write outside
of allocated memory.
This mostly results in stuck read syscall but sometimes when using
btrfs send can get #GP
kernel: general protection fault, probably for non-canonical address 0x841551d5c1000: 0000 [#1] PREEMPT SMP NOPTI
kernel: CPU: 17 PID: 264 Comm: kworker/u256:7 Tainted: P OE 5.17.0-rc2-1 #12
kernel: Workqueue: btrfs-endio btrfs_work_helper [btrfs]
kernel: RIP: 0010:lzo_decompress_bio (./include/linux/fortify-string.h:225 fs/btrfs/lzo.c:322 fs/btrfs/lzo.c:394) btrfs
Code starting with the faulting instruction
===========================================
0:* 48 8b 06 mov (%rsi),%rax <-- trapping instruction
3: 48 8d 79 08 lea 0x8(%rcx),%rdi
7: 48 83 e7 f8 and $0xfffffffffffffff8,%rdi
b: 48 89 01 mov %rax,(%rcx)
e: 44 89 f0 mov %r14d,%eax
11: 48 8b 54 06 f8 mov -0x8(%rsi,%rax,1),%rdx
kernel: RSP: 0018:ffffb110812efd50 EFLAGS: 00010212
kernel: RAX: 0000000000001000 RBX: 000000009ca264c8 RCX: ffff98996e6d8ff8
kernel: RDX: 0000000000000064 RSI: 000841551d5c1000 RDI: ffffffff9500435d
kernel: RBP: ffff989a3be856c0 R08: 0000000000000000 R09: 0000000000000000
kernel: R10: 0000000000000000 R11: 0000000000001000 R12: ffff98996e6d8000
kernel: R13: 0000000000000008 R14: 0000000000001000 R15: 000841551d5c1000
kernel: FS: 0000000000000000(0000) GS:ffff98a09d640000(0000) knlGS:0000000000000000
kernel: CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
kernel: CR2: 00001e9f984d9ea8 CR3: 000000014971a000 CR4: 00000000003506e0
kernel: Call Trace:
kernel: <TASK>
kernel: end_compressed_bio_read (fs/btrfs/compression.c:104 fs/btrfs/compression.c:1363 fs/btrfs/compression.c:323) btrfs
kernel: end_workqueue_fn (fs/btrfs/disk-io.c:1923) btrfs
kernel: btrfs_work_helper (fs/btrfs/async-thread.c:326) btrfs
kernel: process_one_work (./arch/x86/include/asm/jump_label.h:27 ./include/linux/jump_label.h:212 ./include/trace/events/workqueue.h:108 kernel/workqueue.c:2312)
kernel: worker_thread (./include/linux/list.h:292 kernel/workqueue.c:2455)
kernel: ? process_one_work (kernel/workqueue.c:2397)
kernel: kthread (kernel/kthread.c:377)
kernel: ? kthread_complete_and_exit (kernel/kthread.c:332)
kernel: ret_from_fork (arch/x86/entry/entry_64.S:301)
kernel: </TASK>
CC: stable@vger.kernel.org # 4.9+
Signed-off-by: Dāvis Mosāns <davispuh@gmail.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux
Pull btrfs fixes from David Sterba:
- yield CPU more often when defragmenting a large file
- skip defragmenting extents already under writeback
- improve error message when send fails to write file data
- get rid of warning when mounted with 'flushoncommit'
* tag 'for-5.17-rc4-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux:
btrfs: send: in case of IO error log it
btrfs: get rid of warning on transaction commit when using flushoncommit
btrfs: defrag: don't try to defrag extents which are under writeback
btrfs: don't hold CPU for too long when defragging a file
|
|
Currently if we get IO error while doing send then we abort without
logging information about which file caused issue. So log it to help
with debugging.
CC: stable@vger.kernel.org # 4.9+
Signed-off-by: Dāvis Mosāns <davispuh@gmail.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
When using the flushoncommit mount option, during almost every transaction
commit we trigger a warning from __writeback_inodes_sb_nr():
$ cat fs/fs-writeback.c:
(...)
static void __writeback_inodes_sb_nr(struct super_block *sb, ...
{
(...)
WARN_ON(!rwsem_is_locked(&sb->s_umount));
(...)
}
(...)
The trace produced in dmesg looks like the following:
[947.473890] WARNING: CPU: 5 PID: 930 at fs/fs-writeback.c:2610 __writeback_inodes_sb_nr+0x7e/0xb3
[947.481623] Modules linked in: nfsd nls_cp437 cifs asn1_decoder cifs_arc4 fscache cifs_md4 ipmi_ssif
[947.489571] CPU: 5 PID: 930 Comm: btrfs-transacti Not tainted 95.16.3-srb-asrock-00001-g36437ad63879 #186
[947.497969] RIP: 0010:__writeback_inodes_sb_nr+0x7e/0xb3
[947.502097] Code: 24 10 4c 89 44 24 18 c6 (...)
[947.519760] RSP: 0018:ffffc90000777e10 EFLAGS: 00010246
[947.523818] RAX: 0000000000000000 RBX: 0000000000963300 RCX: 0000000000000000
[947.529765] RDX: 0000000000000000 RSI: 000000000000fa51 RDI: ffffc90000777e50
[947.535740] RBP: ffff888101628a90 R08: ffff888100955800 R09: ffff888100956000
[947.541701] R10: 0000000000000002 R11: 0000000000000001 R12: ffff888100963488
[947.547645] R13: ffff888100963000 R14: ffff888112fb7200 R15: ffff888100963460
[947.553621] FS: 0000000000000000(0000) GS:ffff88841fd40000(0000) knlGS:0000000000000000
[947.560537] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[947.565122] CR2: 0000000008be50c4 CR3: 000000000220c000 CR4: 00000000001006e0
[947.571072] Call Trace:
[947.572354] <TASK>
[947.573266] btrfs_commit_transaction+0x1f1/0x998
[947.576785] ? start_transaction+0x3ab/0x44e
[947.579867] ? schedule_timeout+0x8a/0xdd
[947.582716] transaction_kthread+0xe9/0x156
[947.585721] ? btrfs_cleanup_transaction.isra.0+0x407/0x407
[947.590104] kthread+0x131/0x139
[947.592168] ? set_kthread_struct+0x32/0x32
[947.595174] ret_from_fork+0x22/0x30
[947.597561] </TASK>
[947.598553] ---[ end trace 644721052755541c ]---
This is because we started using writeback_inodes_sb() to flush delalloc
when committing a transaction (when using -o flushoncommit), in order to
avoid deadlocks with filesystem freeze operations. This change was made
by commit ce8ea7cc6eb313 ("btrfs: don't call btrfs_start_delalloc_roots
in flushoncommit"). After that change we started producing that warning,
and every now and then a user reports this since the warning happens too
often, it spams dmesg/syslog, and a user is unsure if this reflects any
problem that might compromise the filesystem's reliability.
We can not just lock the sb->s_umount semaphore before calling
writeback_inodes_sb(), because that would at least deadlock with
filesystem freezing, since at fs/super.c:freeze_super() sync_filesystem()
is called while we are holding that semaphore in write mode, and that can
trigger a transaction commit, resulting in a deadlock. It would also
trigger the same type of deadlock in the unmount path. Possibly, it could
also introduce some other locking dependencies that lockdep would report.
To fix this call try_to_writeback_inodes_sb() instead of
writeback_inodes_sb(), because that will try to read lock sb->s_umount
and then will only call writeback_inodes_sb() if it was able to lock it.
This is fine because the cases where it can't read lock sb->s_umount
are during a filesystem unmount or during a filesystem freeze - in those
cases sb->s_umount is write locked and sync_filesystem() is called, which
calls writeback_inodes_sb(). In other words, in all cases where we can't
take a read lock on sb->s_umount, writeback is already being triggered
elsewhere.
An alternative would be to call btrfs_start_delalloc_roots() with a
number of pages different from LONG_MAX, for example matching the number
of delalloc bytes we currently have, in which case we would end up
starting all delalloc with filemap_fdatawrite_wbc() and not with an
async flush via filemap_flush() - that is only possible after the rather
recent commit e076ab2a2ca70a ("btrfs: shrink delalloc pages instead of
full inodes"). However that creates a whole new can of worms due to new
lock dependencies, which lockdep complains, like for example:
[ 8948.247280] ======================================================
[ 8948.247823] WARNING: possible circular locking dependency detected
[ 8948.248353] 5.17.0-rc1-btrfs-next-111 #1 Not tainted
[ 8948.248786] ------------------------------------------------------
[ 8948.249320] kworker/u16:18/933570 is trying to acquire lock:
[ 8948.249812] ffff9b3de1591690 (sb_internal#2){.+.+}-{0:0}, at: find_free_extent+0x141e/0x1590 [btrfs]
[ 8948.250638]
but task is already holding lock:
[ 8948.251140] ffff9b3e09c717d8 (&root->delalloc_mutex){+.+.}-{3:3}, at: start_delalloc_inodes+0x78/0x400 [btrfs]
[ 8948.252018]
which lock already depends on the new lock.
[ 8948.252710]
the existing dependency chain (in reverse order) is:
[ 8948.253343]
-> #2 (&root->delalloc_mutex){+.+.}-{3:3}:
[ 8948.253950] __mutex_lock+0x90/0x900
[ 8948.254354] start_delalloc_inodes+0x78/0x400 [btrfs]
[ 8948.254859] btrfs_start_delalloc_roots+0x194/0x2a0 [btrfs]
[ 8948.255408] btrfs_commit_transaction+0x32f/0xc00 [btrfs]
[ 8948.255942] btrfs_mksubvol+0x380/0x570 [btrfs]
[ 8948.256406] btrfs_mksnapshot+0x81/0xb0 [btrfs]
[ 8948.256870] __btrfs_ioctl_snap_create+0x17f/0x190 [btrfs]
[ 8948.257413] btrfs_ioctl_snap_create_v2+0xbb/0x140 [btrfs]
[ 8948.257961] btrfs_ioctl+0x1196/0x3630 [btrfs]
[ 8948.258418] __x64_sys_ioctl+0x83/0xb0
[ 8948.258793] do_syscall_64+0x3b/0xc0
[ 8948.259146] entry_SYSCALL_64_after_hwframe+0x44/0xae
[ 8948.259709]
-> #1 (&fs_info->delalloc_root_mutex){+.+.}-{3:3}:
[ 8948.260330] __mutex_lock+0x90/0x900
[ 8948.260692] btrfs_start_delalloc_roots+0x97/0x2a0 [btrfs]
[ 8948.261234] btrfs_commit_transaction+0x32f/0xc00 [btrfs]
[ 8948.261766] btrfs_set_free_space_cache_v1_active+0x38/0x60 [btrfs]
[ 8948.262379] btrfs_start_pre_rw_mount+0x119/0x180 [btrfs]
[ 8948.262909] open_ctree+0x1511/0x171e [btrfs]
[ 8948.263359] btrfs_mount_root.cold+0x12/0xde [btrfs]
[ 8948.263863] legacy_get_tree+0x30/0x50
[ 8948.264242] vfs_get_tree+0x28/0xc0
[ 8948.264594] vfs_kern_mount.part.0+0x71/0xb0
[ 8948.265017] btrfs_mount+0x11d/0x3a0 [btrfs]
[ 8948.265462] legacy_get_tree+0x30/0x50
[ 8948.265851] vfs_get_tree+0x28/0xc0
[ 8948.266203] path_mount+0x2d4/0xbe0
[ 8948.266554] __x64_sys_mount+0x103/0x140
[ 8948.266940] do_syscall_64+0x3b/0xc0
[ 8948.267300] entry_SYSCALL_64_after_hwframe+0x44/0xae
[ 8948.267790]
-> #0 (sb_internal#2){.+.+}-{0:0}:
[ 8948.268322] __lock_acquire+0x12e8/0x2260
[ 8948.268733] lock_acquire+0xd7/0x310
[ 8948.269092] start_transaction+0x44c/0x6e0 [btrfs]
[ 8948.269591] find_free_extent+0x141e/0x1590 [btrfs]
[ 8948.270087] btrfs_reserve_extent+0x14b/0x280 [btrfs]
[ 8948.270588] cow_file_range+0x17e/0x490 [btrfs]
[ 8948.271051] btrfs_run_delalloc_range+0x345/0x7a0 [btrfs]
[ 8948.271586] writepage_delalloc+0xb5/0x170 [btrfs]
[ 8948.272071] __extent_writepage+0x156/0x3c0 [btrfs]
[ 8948.272579] extent_write_cache_pages+0x263/0x460 [btrfs]
[ 8948.273113] extent_writepages+0x76/0x130 [btrfs]
[ 8948.273573] do_writepages+0xd2/0x1c0
[ 8948.273942] filemap_fdatawrite_wbc+0x68/0x90
[ 8948.274371] start_delalloc_inodes+0x17f/0x400 [btrfs]
[ 8948.274876] btrfs_start_delalloc_roots+0x194/0x2a0 [btrfs]
[ 8948.275417] flush_space+0x1f2/0x630 [btrfs]
[ 8948.275863] btrfs_async_reclaim_data_space+0x108/0x1b0 [btrfs]
[ 8948.276438] process_one_work+0x252/0x5a0
[ 8948.276829] worker_thread+0x55/0x3b0
[ 8948.277189] kthread+0xf2/0x120
[ 8948.277506] ret_from_fork+0x22/0x30
[ 8948.277868]
other info that might help us debug this:
[ 8948.278548] Chain exists of:
sb_internal#2 --> &fs_info->delalloc_root_mutex --> &root->delalloc_mutex
[ 8948.279601] Possible unsafe locking scenario:
[ 8948.280102] CPU0 CPU1
[ 8948.280508] ---- ----
[ 8948.280915] lock(&root->delalloc_mutex);
[ 8948.281271] lock(&fs_info->delalloc_root_mutex);
[ 8948.281915] lock(&root->delalloc_mutex);
[ 8948.282487] lock(sb_internal#2);
[ 8948.282800]
*** DEADLOCK ***
[ 8948.283333] 4 locks held by kworker/u16:18/933570:
[ 8948.283750] #0: ffff9b3dc00a9d48 ((wq_completion)events_unbound){+.+.}-{0:0}, at: process_one_work+0x1d2/0x5a0
[ 8948.284609] #1: ffffa90349dafe70 ((work_completion)(&fs_info->async_data_reclaim_work)){+.+.}-{0:0}, at: process_one_work+0x1d2/0x5a0
[ 8948.285637] #2: ffff9b3e14db5040 (&fs_info->delalloc_root_mutex){+.+.}-{3:3}, at: btrfs_start_delalloc_roots+0x97/0x2a0 [btrfs]
[ 8948.286674] #3: ffff9b3e09c717d8 (&root->delalloc_mutex){+.+.}-{3:3}, at: start_delalloc_inodes+0x78/0x400 [btrfs]
[ 8948.287596]
stack backtrace:
[ 8948.287975] CPU: 3 PID: 933570 Comm: kworker/u16:18 Not tainted 5.17.0-rc1-btrfs-next-111 #1
[ 8948.288677] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.14.0-0-g155821a1990b-prebuilt.qemu.org 04/01/2014
[ 8948.289649] Workqueue: events_unbound btrfs_async_reclaim_data_space [btrfs]
[ 8948.290298] Call Trace:
[ 8948.290517] <TASK>
[ 8948.290700] dump_stack_lvl+0x59/0x73
[ 8948.291026] check_noncircular+0xf3/0x110
[ 8948.291375] ? start_transaction+0x228/0x6e0 [btrfs]
[ 8948.291826] __lock_acquire+0x12e8/0x2260
[ 8948.292241] lock_acquire+0xd7/0x310
[ 8948.292714] ? find_free_extent+0x141e/0x1590 [btrfs]
[ 8948.293241] ? lock_is_held_type+0xea/0x140
[ 8948.293601] start_transaction+0x44c/0x6e0 [btrfs]
[ 8948.294055] ? find_free_extent+0x141e/0x1590 [btrfs]
[ 8948.294518] find_free_extent+0x141e/0x1590 [btrfs]
[ 8948.294957] ? _raw_spin_unlock+0x29/0x40
[ 8948.295312] ? btrfs_get_alloc_profile+0x124/0x290 [btrfs]
[ 8948.295813] btrfs_reserve_extent+0x14b/0x280 [btrfs]
[ 8948.296270] cow_file_range+0x17e/0x490 [btrfs]
[ 8948.296691] btrfs_run_delalloc_range+0x345/0x7a0 [btrfs]
[ 8948.297175] ? find_lock_delalloc_range+0x247/0x270 [btrfs]
[ 8948.297678] writepage_delalloc+0xb5/0x170 [btrfs]
[ 8948.298123] __extent_writepage+0x156/0x3c0 [btrfs]
[ 8948.298570] extent_write_cache_pages+0x263/0x460 [btrfs]
[ 8948.299061] extent_writepages+0x76/0x130 [btrfs]
[ 8948.299495] do_writepages+0xd2/0x1c0
[ 8948.299817] ? sched_clock_cpu+0xd/0x110
[ 8948.300160] ? lock_release+0x155/0x4a0
[ 8948.300494] filemap_fdatawrite_wbc+0x68/0x90
[ 8948.300874] ? do_raw_spin_unlock+0x4b/0xa0
[ 8948.301243] start_delalloc_inodes+0x17f/0x400 [btrfs]
[ 8948.301706] ? lock_release+0x155/0x4a0
[ 8948.302055] btrfs_start_delalloc_roots+0x194/0x2a0 [btrfs]
[ 8948.302564] flush_space+0x1f2/0x630 [btrfs]
[ 8948.302970] btrfs_async_reclaim_data_space+0x108/0x1b0 [btrfs]
[ 8948.303510] process_one_work+0x252/0x5a0
[ 8948.303860] ? process_one_work+0x5a0/0x5a0
[ 8948.304221] worker_thread+0x55/0x3b0
[ 8948.304543] ? process_one_work+0x5a0/0x5a0
[ 8948.304904] kthread+0xf2/0x120
[ 8948.305184] ? kthread_complete_and_exit+0x20/0x20
[ 8948.305598] ret_from_fork+0x22/0x30
[ 8948.305921] </TASK>
It all comes from the fact that btrfs_start_delalloc_roots() takes the
delalloc_root_mutex, in the transaction commit path we are holding a
read lock on one of the superblock's freeze semaphores (via
sb_start_intwrite()), the async reclaim task can also do a call to
btrfs_start_delalloc_roots(), which ends up triggering writeback with
calls to filemap_fdatawrite_wbc(), resulting in extent allocation which
in turn can call btrfs_start_transaction(), which will result in taking
the freeze semaphore via sb_start_intwrite(), forming a nasty dependency
on all those locks which can be taken in different orders by different
code paths.
So just adopt the simple approach of calling try_to_writeback_inodes_sb()
at btrfs_start_delalloc_flush().
Link: https://lore.kernel.org/linux-btrfs/20220130005258.GA7465@cuci.nl/
Link: https://lore.kernel.org/linux-btrfs/43acc426-d683-d1b6-729d-c6bc4a2fff4d@gmail.com/
Link: https://lore.kernel.org/linux-btrfs/6833930a-08d7-6fbc-0141-eb9cdfd6bb4d@gmail.com/
Link: https://lore.kernel.org/linux-btrfs/20190322041731.GF16651@hungrycats.org/
Reviewed-by: Omar Sandoval <osandov@fb.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
[ add more link reports ]
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
Once we start writeback (have called btrfs_run_delalloc_range()), we
allocate an extent, create an extent map point to that extent, with a
generation of (u64)-1, created the ordered extent and then clear the
DELALLOC bit from the range in the inode's io tree.
Such extent map can pass the first call of defrag_collect_targets(), as
its generation is (u64)-1, meets any possible minimal generation check.
And the range will not have DELALLOC bit, also passing the DELALLOC bit
check.
It will only be re-checked in the second call of
defrag_collect_targets(), which will wait for writeback.
But at that stage we have already spent our time waiting for some IO we
may or may not want to defrag.
Let's reject such extents early so we won't waste our time.
CC: stable@vger.kernel.org # 5.16
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
There is a user report about "btrfs filesystem defrag" causing 120s
timeout problem.
For btrfs_defrag_file() it will iterate all file extents if called from
defrag ioctl, thus it can take a long time.
There is no reason not to release the CPU during such a long operation.
Add cond_resched() after defragged one cluster.
CC: stable@vger.kernel.org # 5.16
Link: https://lore.kernel.org/linux-btrfs/10e51417-2203-f0a4-2021-86c8511cc367@gmx.com
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux
Pull btrfs fixes from David Sterba:
"A few fixes and error handling improvements:
- fix deadlock between quota disable and qgroup rescan worker
- fix use-after-free after failure to create a snapshot
- skip warning on unmount after log cleanup failure
- don't start transaction for scrub if the fs is mounted read-only
- tree checker verifies item sizes"
* tag 'for-5.17-rc2-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux:
btrfs: skip reserved bytes warning on unmount after log cleanup failure
btrfs: fix use of uninitialized variable at rm device ioctl
btrfs: fix use-after-free after failure to create a snapshot
btrfs: tree-checker: check item_size for dev_item
btrfs: tree-checker: check item_size for inode_item
btrfs: fix deadlock between quota disable and qgroup rescan worker
btrfs: don't start transaction for scrub if the fs is mounted read-only
|
|
After the recent changes made by commit c2e39305299f01 ("btrfs: clear
extent buffer uptodate when we fail to write it") and its followup fix,
commit 651740a5024117 ("btrfs: check WRITE_ERR when trying to read an
extent buffer"), we can now end up not cleaning up space reservations of
log tree extent buffers after a transaction abort happens, as well as not
cleaning up still dirty extent buffers.
This happens because if writeback for a log tree extent buffer failed,
then we have cleared the bit EXTENT_BUFFER_UPTODATE from the extent buffer
and we have also set the bit EXTENT_BUFFER_WRITE_ERR on it. Later on,
when trying to free the log tree with free_log_tree(), which iterates
over the tree, we can end up getting an -EIO error when trying to read
a node or a leaf, since read_extent_buffer_pages() returns -EIO if an
extent buffer does not have EXTENT_BUFFER_UPTODATE set and has the
EXTENT_BUFFER_WRITE_ERR bit set. Getting that -EIO means that we return
immediately as we can not iterate over the entire tree.
In that case we never update the reserved space for an extent buffer in
the respective block group and space_info object.
When this happens we get the following traces when unmounting the fs:
[174957.284509] BTRFS: error (device dm-0) in cleanup_transaction:1913: errno=-5 IO failure
[174957.286497] BTRFS: error (device dm-0) in free_log_tree:3420: errno=-5 IO failure
[174957.399379] ------------[ cut here ]------------
[174957.402497] WARNING: CPU: 2 PID: 3206883 at fs/btrfs/block-group.c:127 btrfs_put_block_group+0x77/0xb0 [btrfs]
[174957.407523] Modules linked in: btrfs overlay dm_zero (...)
[174957.424917] CPU: 2 PID: 3206883 Comm: umount Tainted: G W 5.16.0-rc5-btrfs-next-109 #1
[174957.426689] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.14.0-0-g155821a1990b-prebuilt.qemu.org 04/01/2014
[174957.428716] RIP: 0010:btrfs_put_block_group+0x77/0xb0 [btrfs]
[174957.429717] Code: 21 48 8b bd (...)
[174957.432867] RSP: 0018:ffffb70d41cffdd0 EFLAGS: 00010206
[174957.433632] RAX: 0000000000000001 RBX: ffff8b09c3848000 RCX: ffff8b0758edd1c8
[174957.434689] RDX: 0000000000000001 RSI: ffffffffc0b467e7 RDI: ffff8b0758edd000
[174957.436068] RBP: ffff8b0758edd000 R08: 0000000000000000 R09: 0000000000000000
[174957.437114] R10: 0000000000000246 R11: 0000000000000000 R12: ffff8b09c3848148
[174957.438140] R13: ffff8b09c3848198 R14: ffff8b0758edd188 R15: dead000000000100
[174957.439317] FS: 00007f328fb82800(0000) GS:ffff8b0a2d200000(0000) knlGS:0000000000000000
[174957.440402] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[174957.441164] CR2: 00007fff13563e98 CR3: 0000000404f4e005 CR4: 0000000000370ee0
[174957.442117] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[174957.443076] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[174957.443948] Call Trace:
[174957.444264] <TASK>
[174957.444538] btrfs_free_block_groups+0x255/0x3c0 [btrfs]
[174957.445238] close_ctree+0x301/0x357 [btrfs]
[174957.445803] ? call_rcu+0x16c/0x290
[174957.446250] generic_shutdown_super+0x74/0x120
[174957.446832] kill_anon_super+0x14/0x30
[174957.447305] btrfs_kill_super+0x12/0x20 [btrfs]
[174957.447890] deactivate_locked_super+0x31/0xa0
[174957.448440] cleanup_mnt+0x147/0x1c0
[174957.448888] task_work_run+0x5c/0xa0
[174957.449336] exit_to_user_mode_prepare+0x1e5/0x1f0
[174957.449934] syscall_exit_to_user_mode+0x16/0x40
[174957.450512] do_syscall_64+0x48/0xc0
[174957.450980] entry_SYSCALL_64_after_hwframe+0x44/0xae
[174957.451605] RIP: 0033:0x7f328fdc4a97
[174957.452059] Code: 03 0c 00 f7 (...)
[174957.454320] RSP: 002b:00007fff13564ec8 EFLAGS: 00000246 ORIG_RAX: 00000000000000a6
[174957.455262] RAX: 0000000000000000 RBX: 00007f328feea264 RCX: 00007f328fdc4a97
[174957.456131] RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000560b8ae51dd0
[174957.457118] RBP: 0000560b8ae51ba0 R08: 0000000000000000 R09: 00007fff13563c40
[174957.458005] R10: 00007f328fe49fc0 R11: 0000000000000246 R12: 0000000000000000
[174957.459113] R13: 0000560b8ae51dd0 R14: 0000560b8ae51cb0 R15: 0000000000000000
[174957.460193] </TASK>
[174957.460534] irq event stamp: 0
[174957.461003] hardirqs last enabled at (0): [<0000000000000000>] 0x0
[174957.461947] hardirqs last disabled at (0): [<ffffffffb0e94214>] copy_process+0x934/0x2040
[174957.463147] softirqs last enabled at (0): [<ffffffffb0e94214>] copy_process+0x934/0x2040
[174957.465116] softirqs last disabled at (0): [<0000000000000000>] 0x0
[174957.466323] ---[ end trace bc7ee0c490bce3af ]---
[174957.467282] ------------[ cut here ]------------
[174957.468184] WARNING: CPU: 2 PID: 3206883 at fs/btrfs/block-group.c:3976 btrfs_free_block_groups+0x330/0x3c0 [btrfs]
[174957.470066] Modules linked in: btrfs overlay dm_zero (...)
[174957.483137] CPU: 2 PID: 3206883 Comm: umount Tainted: G W 5.16.0-rc5-btrfs-next-109 #1
[174957.484691] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.14.0-0-g155821a1990b-prebuilt.qemu.org 04/01/2014
[174957.486853] RIP: 0010:btrfs_free_block_groups+0x330/0x3c0 [btrfs]
[174957.488050] Code: 00 00 00 ad de (...)
[174957.491479] RSP: 0018:ffffb70d41cffde0 EFLAGS: 00010206
[174957.492520] RAX: ffff8b08d79310b0 RBX: ffff8b09c3848000 RCX: 0000000000000000
[174957.493868] RDX: 0000000000000001 RSI: fffff443055ee600 RDI: ffffffffb1131846
[174957.495183] RBP: ffff8b08d79310b0 R08: 0000000000000000 R09: 0000000000000000
[174957.496580] R10: 0000000000000001 R11: 0000000000000000 R12: ffff8b08d7931000
[174957.498027] R13: ffff8b09c38492b0 R14: dead000000000122 R15: dead000000000100
[174957.499438] FS: 00007f328fb82800(0000) GS:ffff8b0a2d200000(0000) knlGS:0000000000000000
[174957.500990] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[174957.502117] CR2: 00007fff13563e98 CR3: 0000000404f4e005 CR4: 0000000000370ee0
[174957.503513] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[174957.504864] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[174957.506167] Call Trace:
[174957.506654] <TASK>
[174957.507047] close_ctree+0x301/0x357 [btrfs]
[174957.507867] ? call_rcu+0x16c/0x290
[174957.508567] generic_shutdown_super+0x74/0x120
[174957.509447] kill_anon_super+0x14/0x30
[174957.510194] btrfs_kill_super+0x12/0x20 [btrfs]
[174957.511123] deactivate_locked_super+0x31/0xa0
[174957.511976] cleanup_mnt+0x147/0x1c0
[174957.512610] task_work_run+0x5c/0xa0
[174957.513309] exit_to_user_mode_prepare+0x1e5/0x1f0
[174957.514231] syscall_exit_to_user_mode+0x16/0x40
[174957.515069] do_syscall_64+0x48/0xc0
[174957.515718] entry_SYSCALL_64_after_hwframe+0x44/0xae
[174957.516688] RIP: 0033:0x7f328fdc4a97
[174957.517413] Code: 03 0c 00 f7 d8 (...)
[174957.521052] RSP: 002b:00007fff13564ec8 EFLAGS: 00000246 ORIG_RAX: 00000000000000a6
[174957.522514] RAX: 0000000000000000 RBX: 00007f328feea264 RCX: 00007f328fdc4a97
[174957.523950] RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000560b8ae51dd0
[174957.525375] RBP: 0000560b8ae51ba0 R08: 0000000000000000 R09: 00007fff13563c40
[174957.526763] R10: 00007f328fe49fc0 R11: 0000000000000246 R12: 0000000000000000
[174957.528058] R13: 0000560b8ae51dd0 R14: 0000560b8ae51cb0 R15: 0000000000000000
[174957.529404] </TASK>
[174957.529843] irq event stamp: 0
[174957.530256] hardirqs last enabled at (0): [<0000000000000000>] 0x0
[174957.531061] hardirqs last disabled at (0): [<ffffffffb0e94214>] copy_process+0x934/0x2040
[174957.532075] softirqs last enabled at (0): [<ffffffffb0e94214>] copy_process+0x934/0x2040
[174957.533083] softirqs last disabled at (0): [<0000000000000000>] 0x0
[174957.533865] ---[ end trace bc7ee0c490bce3b0 ]---
[174957.534452] BTRFS info (device dm-0): space_info 4 has 1070841856 free, is not full
[174957.535404] BTRFS info (device dm-0): space_info total=1073741824, used=2785280, pinned=0, reserved=49152, may_use=0, readonly=65536 zone_unusable=0
[174957.537029] BTRFS info (device dm-0): global_block_rsv: size 0 reserved 0
[174957.537859] BTRFS info (device dm-0): trans_block_rsv: size 0 reserved 0
[174957.538697] BTRFS info (device dm-0): chunk_block_rsv: size 0 reserved 0
[174957.539552] BTRFS info (device dm-0): delayed_block_rsv: size 0 reserved 0
[174957.540403] BTRFS info (device dm-0): delayed_refs_rsv: size 0 reserved 0
This also means that in case we have log tree extent buffers that are
still dirty, we can end up not cleaning them up in case we find an
extent buffer with EXTENT_BUFFER_WRITE_ERR set on it, as in that case
we have no way for iterating over the rest of the tree.
This issue is very often triggered with test cases generic/475 and
generic/648 from fstests.
The issue could almost be fixed by iterating over the io tree attached to
each log root which keeps tracks of the range of allocated extent buffers,
log_root->dirty_log_pages, however that does not work and has some
inconveniences:
1) After we sync the log, we clear the range of the extent buffers from
the io tree, so we can't find them after writeback. We could keep the
ranges in the io tree, with a separate bit to signal they represent
extent buffers already written, but that means we need to hold into
more memory until the transaction commits.
How much more memory is used depends a lot on whether we are able to
allocate contiguous extent buffers on disk (and how often) for a log
tree - if we are able to, then a single extent state record can
represent multiple extent buffers, otherwise we need multiple extent
state record structures to track each extent buffer.
In fact, my earlier approach did that:
https://lore.kernel.org/linux-btrfs/3aae7c6728257c7ce2279d6660ee2797e5e34bbd.1641300250.git.fdmanana@suse.com/
However that can cause a very significant negative impact on
performance, not only due to the extra memory usage but also because
we get a larger and deeper dirty_log_pages io tree.
We got a report that, on beefy machines at least, we can get such
performance drop with fsmark for example:
https://lore.kernel.org/linux-btrfs/20220117082426.GE32491@xsang-OptiPlex-9020/
2) We would be doing it only to deal with an unexpected and exceptional
case, which is basically failure to read an extent buffer from disk
due to IO failures. On a healthy system we don't expect transaction
aborts to happen after all;
3) Instead of relying on iterating the log tree or tracking the ranges
of extent buffers in the dirty_log_pages io tree, using the radix
tree that tracks extent buffers (fs_info->buffer_radix) to find all
log tree extent buffers is not reliable either, because after writeback
of an extent buffer it can be evicted from memory by the release page
callback of the btree inode (btree_releasepage()).
Since there's no way to be able to properly cleanup a log tree without
being able to read its extent buffers from disk and without using more
memory to track the logical ranges of the allocated extent buffers do
the following:
1) When we fail to cleanup a log tree, setup a flag that indicates that
failure;
2) Trigger writeback of all log tree extent buffers that are still dirty,
and wait for the writeback to complete. This is just to cleanup their
state, page states, page leaks, etc;
3) When unmounting the fs, ignore if the number of bytes reserved in a
block group and in a space_info is not 0 if, and only if, we failed to
cleanup a log tree. Also ignore only for metadata block groups and the
metadata space_info object.
This is far from a perfect solution, but it serves to silence test
failures such as those from generic/475 and generic/648. However having
a non-zero value for the reserved bytes counters on unmount after a
transaction abort, is not such a terrible thing and it's completely
harmless, it does not affect the filesystem integrity in any way.
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
Clang static analysis reports this problem
ioctl.c:3333:8: warning: 3rd function call argument is an
uninitialized value
ret = exclop_start_or_cancel_reloc(fs_info,
cancel is only set in one branch of an if-check and is always used. So
initialize to false.
Fixes: 1a15eb724aae ("btrfs: use btrfs_get_dev_args_from_path in dev removal ioctls")
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: Tom Rix <trix@redhat.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
At ioctl.c:create_snapshot(), we allocate a pending snapshot structure and
then attach it to the transaction's list of pending snapshots. After that
we call btrfs_commit_transaction(), and if that returns an error we jump
to 'fail' label, where we kfree() the pending snapshot structure. This can
result in a later use-after-free of the pending snapshot:
1) We allocated the pending snapshot and added it to the transaction's
list of pending snapshots;
2) We call btrfs_commit_transaction(), and it fails either at the first
call to btrfs_run_delayed_refs() or btrfs_start_dirty_block_groups().
In both cases, we don't abort the transaction and we release our
transaction handle. We jump to the 'fail' label and free the pending
snapshot structure. We return with the pending snapshot still in the
transaction's list;
3) Another task commits the transaction. This time there's no error at
all, and then during the transaction commit it accesses a pointer
to the pending snapshot structure that the snapshot creation task
has already freed, resulting in a user-after-free.
This issue could actually be detected by smatch, which produced the
following warning:
fs/btrfs/ioctl.c:843 create_snapshot() warn: '&pending_snapshot->list' not removed from list
So fix this by not having the snapshot creation ioctl directly add the
pending snapshot to the transaction's list. Instead add the pending
snapshot to the transaction handle, and then at btrfs_commit_transaction()
we add the snapshot to the list only when we can guarantee that any error
returned after that point will result in a transaction abort, in which
case the ioctl code can safely free the pending snapshot and no one can
access it anymore.
CC: stable@vger.kernel.org # 5.10+
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
Check item size before accessing the device item to avoid out of bound
access, similar to inode_item check.
Signed-off-by: Su Yue <l@damenly.su>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
while mounting the crafted image, out-of-bounds access happens:
[350.429619] UBSAN: array-index-out-of-bounds in fs/btrfs/struct-funcs.c:161:1
[350.429636] index 1048096 is out of range for type 'page *[16]'
[350.429650] CPU: 0 PID: 9 Comm: kworker/u8:1 Not tainted 5.16.0-rc4 #1
[350.429652] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.13.0-1ubuntu1.1 04/01/2014
[350.429653] Workqueue: btrfs-endio-meta btrfs_work_helper [btrfs]
[350.429772] Call Trace:
[350.429774] <TASK>
[350.429776] dump_stack_lvl+0x47/0x5c
[350.429780] ubsan_epilogue+0x5/0x50
[350.429786] __ubsan_handle_out_of_bounds+0x66/0x70
[350.429791] btrfs_get_16+0xfd/0x120 [btrfs]
[350.429832] check_leaf+0x754/0x1a40 [btrfs]
[350.429874] ? filemap_read+0x34a/0x390
[350.429878] ? load_balance+0x175/0xfc0
[350.429881] validate_extent_buffer+0x244/0x310 [btrfs]
[350.429911] btrfs_validate_metadata_buffer+0xf8/0x100 [btrfs]
[350.429935] end_bio_extent_readpage+0x3af/0x850 [btrfs]
[350.429969] ? newidle_balance+0x259/0x480
[350.429972] end_workqueue_fn+0x29/0x40 [btrfs]
[350.429995] btrfs_work_helper+0x71/0x330 [btrfs]
[350.430030] ? __schedule+0x2fb/0xa40
[350.430033] process_one_work+0x1f6/0x400
[350.430035] ? process_one_work+0x400/0x400
[350.430036] worker_thread+0x2d/0x3d0
[350.430037] ? process_one_work+0x400/0x400
[350.430038] kthread+0x165/0x190
[350.430041] ? set_kthread_struct+0x40/0x40
[350.430043] ret_from_fork+0x1f/0x30
[350.430047] </TASK>
[350.430077] BTRFS warning (device loop0): bad eb member start: ptr 0xffe20f4e start 20975616 member offset 4293005178 size 2
check_leaf() is checking the leaf:
corrupt leaf: root=4 block=29396992 slot=1, bad key order, prev (16140901064495857664 1 0) current (1 204 12582912)
leaf 29396992 items 6 free space 3565 generation 6 owner DEV_TREE
leaf 29396992 flags 0x1(WRITTEN) backref revision 1
fs uuid a62e00e8-e94e-4200-8217-12444de93c2e
chunk uuid cecbd0f7-9ca0-441e-ae9f-f782f9732bd8
item 0 key (16140901064495857664 INODE_ITEM 0) itemoff 3955 itemsize 40
generation 0 transid 0 size 0 nbytes 17592186044416
block group 0 mode 52667 links 33 uid 0 gid 2104132511 rdev 94223634821136
sequence 100305 flags 0x2409000(none)
atime 0.0 (1970-01-01 08:00:00)
ctime 2973280098083405823.4294967295 (-269783007-01-01 21:37:03)
mtime 18446744071572723616.4026825121 (1902-04-16 12:40:00)
otime 9249929404488876031.4294967295 (622322949-04-16 04:25:58)
item 1 key (1 DEV_EXTENT 12582912) itemoff 3907 itemsize 48
dev extent chunk_tree 3
chunk_objectid 256 chunk_offset 12582912 length 8388608
chunk_tree_uuid cecbd0f7-9ca0-441e-ae9f-f782f9732bd8
The corrupted leaf of device tree has an inode item. The leaf passed
checksum and others checks in validate_extent_buffer until check_leaf_item().
Because of the key type BTRFS_INODE_ITEM, check_inode_item() is called even we
are in the device tree. Since the
item offset + sizeof(struct btrfs_inode_item) > eb->len, out-of-bounds access
is triggered.
The item end vs leaf boundary check has been done before
check_leaf_item(), so fix it by checking item size in check_inode_item()
before access of the inode item in extent buffer.
Other check functions except check_dev_item() in check_leaf_item()
have their item size checks.
The commit for check_dev_item() is followed.
No regression observed during running fstests.
Bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=215299
CC: stable@vger.kernel.org # 5.10+
CC: Wenqing Liu <wenqingliu0120@gmail.com>
Signed-off-by: Su Yue <l@damenly.su>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
Quota disable ioctl starts a transaction before waiting for the qgroup
rescan worker completes. However, this wait can be infinite and results
in deadlock because of circular dependency among the quota disable
ioctl, the qgroup rescan worker and the other task with transaction such
as block group relocation task.
The deadlock happens with the steps following:
1) Task A calls ioctl to disable quota. It starts a transaction and
waits for qgroup rescan worker completes.
2) Task B such as block group relocation task starts a transaction and
joins to the transaction that task A started. Then task B commits to
the transaction. In this commit, task B waits for a commit by task A.
3) Task C as the qgroup rescan worker starts its job and starts a
transaction. In this transaction start, task C waits for completion
of the transaction that task A started and task B committed.
This deadlock was found with fstests test case btrfs/115 and a zoned
null_blk device. The test case enables and disables quota, and the
block group reclaim was triggered during the quota disable by chance.
The deadlock was also observed by running quota enable and disable in
parallel with 'btrfs balance' command on regular null_blk devices.
An example report of the deadlock:
[372.469894] INFO: task kworker/u16:6:103 blocked for more than 122 seconds.
[372.479944] Not tainted 5.16.0-rc8 #7
[372.485067] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[372.493898] task:kworker/u16:6 state:D stack: 0 pid: 103 ppid: 2 flags:0x00004000
[372.503285] Workqueue: btrfs-qgroup-rescan btrfs_work_helper [btrfs]
[372.510782] Call Trace:
[372.514092] <TASK>
[372.521684] __schedule+0xb56/0x4850
[372.530104] ? io_schedule_timeout+0x190/0x190
[372.538842] ? lockdep_hardirqs_on+0x7e/0x100
[372.547092] ? _raw_spin_unlock_irqrestore+0x3e/0x60
[372.555591] schedule+0xe0/0x270
[372.561894] btrfs_commit_transaction+0x18bb/0x2610 [btrfs]
[372.570506] ? btrfs_apply_pending_changes+0x50/0x50 [btrfs]
[372.578875] ? free_unref_page+0x3f2/0x650
[372.585484] ? finish_wait+0x270/0x270
[372.591594] ? release_extent_buffer+0x224/0x420 [btrfs]
[372.599264] btrfs_qgroup_rescan_worker+0xc13/0x10c0 [btrfs]
[372.607157] ? lock_release+0x3a9/0x6d0
[372.613054] ? btrfs_qgroup_account_extent+0xda0/0xda0 [btrfs]
[372.620960] ? do_raw_spin_lock+0x11e/0x250
[372.627137] ? rwlock_bug.part.0+0x90/0x90
[372.633215] ? lock_is_held_type+0xe4/0x140
[372.639404] btrfs_work_helper+0x1ae/0xa90 [btrfs]
[372.646268] process_one_work+0x7e9/0x1320
[372.652321] ? lock_release+0x6d0/0x6d0
[372.658081] ? pwq_dec_nr_in_flight+0x230/0x230
[372.664513] ? rwlock_bug.part.0+0x90/0x90
[372.670529] worker_thread+0x59e/0xf90
[372.676172] ? process_one_work+0x1320/0x1320
[372.682440] kthread+0x3b9/0x490
[372.687550] ? _raw_spin_unlock_irq+0x24/0x50
[372.693811] ? set_kthread_struct+0x100/0x100
[372.700052] ret_from_fork+0x22/0x30
[372.705517] </TASK>
[372.709747] INFO: task btrfs-transacti:2347 blocked for more than 123 seconds.
[372.729827] Not tainted 5.16.0-rc8 #7
[372.745907] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[372.767106] task:btrfs-transacti state:D stack: 0 pid: 2347 ppid: 2 flags:0x00004000
[372.787776] Call Trace:
[372.801652] <TASK>
[372.812961] __schedule+0xb56/0x4850
[372.830011] ? io_schedule_timeout+0x190/0x190
[372.852547] ? lockdep_hardirqs_on+0x7e/0x100
[372.871761] ? _raw_spin_unlock_irqrestore+0x3e/0x60
[372.886792] schedule+0xe0/0x270
[372.901685] wait_current_trans+0x22c/0x310 [btrfs]
[372.919743] ? btrfs_put_transaction+0x3d0/0x3d0 [btrfs]
[372.938923] ? finish_wait+0x270/0x270
[372.959085] ? join_transaction+0xc75/0xe30 [btrfs]
[372.977706] start_transaction+0x938/0x10a0 [btrfs]
[372.997168] transaction_kthread+0x19d/0x3c0 [btrfs]
[373.013021] ? btrfs_cleanup_transaction.isra.0+0xfc0/0xfc0 [btrfs]
[373.031678] kthread+0x3b9/0x490
[373.047420] ? _raw_spin_unlock_irq+0x24/0x50
[373.064645] ? set_kthread_struct+0x100/0x100
[373.078571] ret_from_fork+0x22/0x30
[373.091197] </TASK>
[373.105611] INFO: task btrfs:3145 blocked for more than 123 seconds.
[373.114147] Not tainted 5.16.0-rc8 #7
[373.120401] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[373.130393] task:btrfs state:D stack: 0 pid: 3145 ppid: 3141 flags:0x00004000
[373.140998] Call Trace:
[373.145501] <TASK>
[373.149654] __schedule+0xb56/0x4850
[373.155306] ? io_schedule_timeout+0x190/0x190
[373.161965] ? lockdep_hardirqs_on+0x7e/0x100
[373.168469] ? _raw_spin_unlock_irqrestore+0x3e/0x60
[373.175468] schedule+0xe0/0x270
[373.180814] wait_for_commit+0x104/0x150 [btrfs]
[373.187643] ? test_and_set_bit+0x20/0x20 [btrfs]
[373.194772] ? kmem_cache_free+0x124/0x550
[373.201191] ? btrfs_put_transaction+0x69/0x3d0 [btrfs]
[373.208738] ? finish_wait+0x270/0x270
[373.214704] ? __btrfs_end_transaction+0x347/0x7b0 [btrfs]
[373.222342] btrfs_commit_transaction+0x44d/0x2610 [btrfs]
[373.230233] ? join_transaction+0x255/0xe30 [btrfs]
[373.237334] ? btrfs_record_root_in_trans+0x4d/0x170 [btrfs]
[373.245251] ? btrfs_apply_pending_changes+0x50/0x50 [btrfs]
[373.253296] relocate_block_group+0x105/0xc20 [btrfs]
[373.260533] ? mutex_lock_io_nested+0x1270/0x1270
[373.267516] ? btrfs_wait_nocow_writers+0x85/0x180 [btrfs]
[373.275155] ? merge_reloc_roots+0x710/0x710 [btrfs]
[373.283602] ? btrfs_wait_ordered_extents+0xd30/0xd30 [btrfs]
[373.291934] ? kmem_cache_free+0x124/0x550
[373.298180] btrfs_relocate_block_group+0x35c/0x930 [btrfs]
[373.306047] btrfs_relocate_chunk+0x85/0x210 [btrfs]
[373.313229] btrfs_balance+0x12f4/0x2d20 [btrfs]
[373.320227] ? lock_release+0x3a9/0x6d0
[373.326206] ? btrfs_relocate_chunk+0x210/0x210 [btrfs]
[373.333591] ? lock_is_held_type+0xe4/0x140
[373.340031] ? rcu_read_lock_sched_held+0x3f/0x70
[373.346910] btrfs_ioctl_balance+0x548/0x700 [btrfs]
[373.354207] btrfs_ioctl+0x7f2/0x71b0 [btrfs]
[373.360774] ? lockdep_hardirqs_on_prepare+0x410/0x410
[373.367957] ? lockdep_hardirqs_on_prepare+0x410/0x410
[373.375327] ? btrfs_ioctl_get_supported_features+0x20/0x20 [btrfs]
[373.383841] ? find_held_lock+0x2c/0x110
[373.389993] ? lock_release+0x3a9/0x6d0
[373.395828] ? mntput_no_expire+0xf7/0xad0
[373.402083] ? lock_is_held_type+0xe4/0x140
[373.408249] ? vfs_fileattr_set+0x9f0/0x9f0
[373.414486] ? selinux_file_ioctl+0x349/0x4e0
[373.420938] ? trace_raw_output_lock+0xb4/0xe0
[373.427442] ? selinux_inode_getsecctx+0x80/0x80
[373.434224] ? lockdep_hardirqs_on+0x7e/0x100
[373.440660] ? force_qs_rnp+0x2a0/0x6b0
[373.446534] ? lock_is_held_type+0x9b/0x140
[373.452763] ? __blkcg_punt_bio_submit+0x1b0/0x1b0
[373.459732] ? security_file_ioctl+0x50/0x90
[373.466089] __x64_sys_ioctl+0x127/0x190
[373.472022] do_syscall_64+0x3b/0x90
[373.477513] entry_SYSCALL_64_after_hwframe+0x44/0xae
[373.484823] RIP: 0033:0x7f8f4af7e2bb
[373.490493] RSP: 002b:00007ffcbf936178 EFLAGS: 00000246 ORIG_RAX: 0000000000000010
[373.500197] RAX: ffffffffffffffda RBX: 0000000000000003 RCX: 00007f8f4af7e2bb
[373.509451] RDX: 00007ffcbf936220 RSI: 00000000c4009420 RDI: 0000000000000003
[373.518659] RBP: 00007ffcbf93774a R08: 0000000000000013 R09: 00007f8f4b02d4e0
[373.527872] R10: 00007f8f4ae87740 R11: 0000000000000246 R12: 0000000000000001
[373.537222] R13: 00007ffcbf936220 R14: 0000000000000000 R15: 0000000000000002
[373.546506] </TASK>
[373.550878] INFO: task btrfs:3146 blocked for more than 123 seconds.
[373.559383] Not tainted 5.16.0-rc8 #7
[373.565748] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[373.575748] task:btrfs state:D stack: 0 pid: 3146 ppid: 2168 flags:0x00000000
[373.586314] Call Trace:
[373.590846] <TASK>
[373.595121] __schedule+0xb56/0x4850
[373.600901] ? __lock_acquire+0x23db/0x5030
[373.607176] ? io_schedule_timeout+0x190/0x190
[373.613954] schedule+0xe0/0x270
[373.619157] schedule_timeout+0x168/0x220
[373.625170] ? usleep_range_state+0x150/0x150
[373.631653] ? mark_held_locks+0x9e/0xe0
[373.637767] ? do_raw_spin_lock+0x11e/0x250
[373.643993] ? lockdep_hardirqs_on_prepare+0x17b/0x410
[373.651267] ? _raw_spin_unlock_irq+0x24/0x50
[373.657677] ? lockdep_hardirqs_on+0x7e/0x100
[373.664103] wait_for_completion+0x163/0x250
[373.670437] ? bit_wait_timeout+0x160/0x160
[373.676585] btrfs_quota_disable+0x176/0x9a0 [btrfs]
[373.683979] ? btrfs_quota_enable+0x12f0/0x12f0 [btrfs]
[373.691340] ? down_write+0xd0/0x130
[373.696880] ? down_write_killable+0x150/0x150
[373.703352] btrfs_ioctl+0x3945/0x71b0 [btrfs]
[373.710061] ? find_held_lock+0x2c/0x110
[373.716192] ? lock_release+0x3a9/0x6d0
[373.722047] ? __handle_mm_fault+0x23cd/0x3050
[373.728486] ? btrfs_ioctl_get_supported_features+0x20/0x20 [btrfs]
[373.737032] ? set_pte+0x6a/0x90
[373.742271] ? do_raw_spin_unlock+0x55/0x1f0
[373.748506] ? lock_is_held_type+0xe4/0x140
[373.754792] ? vfs_fileattr_set+0x9f0/0x9f0
[373.761083] ? selinux_file_ioctl+0x349/0x4e0
[373.767521] ? selinux_inode_getsecctx+0x80/0x80
[373.774247] ? __up_read+0x182/0x6e0
[373.780026] ? count_memcg_events.constprop.0+0x46/0x60
[373.787281] ? up_write+0x460/0x460
[373.792932] ? security_file_ioctl+0x50/0x90
[373.799232] __x64_sys_ioctl+0x127/0x190
[373.805237] do_syscall_64+0x3b/0x90
[373.810947] entry_SYSCALL_64_after_hwframe+0x44/0xae
[373.818102] RIP: 0033:0x7f1383ea02bb
[373.823847] RSP: 002b:00007fffeb4d71f8 EFLAGS: 00000202 ORIG_RAX: 0000000000000010
[373.833641] RAX: ffffffffffffffda RBX: 0000000000000000 RCX: 00007f1383ea02bb
[373.842961] RDX: 00007fffeb4d7210 RSI: 00000000c0109428 RDI: 0000000000000003
[373.852179] RBP: 0000000000000003 R08: 0000000000000003 R09: 0000000000000078
[373.861408] R10: 00007f1383daec78 R11: 0000000000000202 R12: 00007fffeb4d874a
[373.870647] R13: 0000000000493099 R14: 0000000000000001 R15: 0000000000000000
[373.879838] </TASK>
[373.884018]
Showing all locks held in the system:
[373.894250] 3 locks held by kworker/4:1/58:
[373.900356] 1 lock held by khungtaskd/63:
[373.906333] #0: ffffffff8945ff60 (rcu_read_lock){....}-{1:2}, at: debug_show_all_locks+0x53/0x260
[373.917307] 3 locks held by kworker/u16:6/103:
[373.923938] #0: ffff888127b4f138 ((wq_completion)btrfs-qgroup-rescan){+.+.}-{0:0}, at: process_one_work+0x712/0x1320
[373.936555] #1: ffff88810b817dd8 ((work_completion)(&work->normal_work)){+.+.}-{0:0}, at: process_one_work+0x73f/0x1320
[373.951109] #2: ffff888102dd4650 (sb_internal#2){.+.+}-{0:0}, at: btrfs_qgroup_rescan_worker+0x1f6/0x10c0 [btrfs]
[373.964027] 2 locks held by less/1803:
[373.969982] #0: ffff88813ed56098 (&tty->ldisc_sem){++++}-{0:0}, at: tty_ldisc_ref_wait+0x24/0x80
[373.981295] #1: ffffc90000b3b2e8 (&ldata->atomic_read_lock){+.+.}-{3:3}, at: n_tty_read+0x9e2/0x1060
[373.992969] 1 lock held by btrfs-transacti/2347:
[373.999893] #0: ffff88813d4887a8 (&fs_info->transaction_kthread_mutex){+.+.}-{3:3}, at: transaction_kthread+0xe3/0x3c0 [btrfs]
[374.015872] 3 locks held by btrfs/3145:
[374.022298] #0: ffff888102dd4460 (sb_writers#18){.+.+}-{0:0}, at: btrfs_ioctl_balance+0xc3/0x700 [btrfs]
[374.034456] #1: ffff88813d48a0a0 (&fs_info->reclaim_bgs_lock){+.+.}-{3:3}, at: btrfs_balance+0xfe5/0x2d20 [btrfs]
[374.047646] #2: ffff88813d488838 (&fs_info->cleaner_mutex){+.+.}-{3:3}, at: btrfs_relocate_block_group+0x354/0x930 [btrfs]
[374.063295] 4 locks held by btrfs/3146:
[374.069647] #0: ffff888102dd4460 (sb_writers#18){.+.+}-{0:0}, at: btrfs_ioctl+0x38b1/0x71b0 [btrfs]
[374.081601] #1: ffff88813d488bb8 (&fs_info->subvol_sem){+.+.}-{3:3}, at: btrfs_ioctl+0x38fd/0x71b0 [btrfs]
[374.094283] #2: ffff888102dd4650 (sb_internal#2){.+.+}-{0:0}, at: btrfs_quota_disable+0xc8/0x9a0 [btrfs]
[374.106885] #3: ffff88813d489800 (&fs_info->qgroup_ioctl_lock){+.+.}-{3:3}, at: btrfs_quota_disable+0xd5/0x9a0 [btrfs]
[374.126780] =============================================
To avoid the deadlock, wait for the qgroup rescan worker to complete
before starting the transaction for the quota disable ioctl. Clear
BTRFS_FS_QUOTA_ENABLE flag before the wait and the transaction to
request the worker to complete. On transaction start failure, set the
BTRFS_FS_QUOTA_ENABLE flag again. These BTRFS_FS_QUOTA_ENABLE flag
changes can be done safely since the function btrfs_quota_disable is not
called concurrently because of fs_info->subvol_sem.
Also check the BTRFS_FS_QUOTA_ENABLE flag in qgroup_rescan_init to avoid
another qgroup rescan worker to start after the previous qgroup worker
completed.
CC: stable@vger.kernel.org # 5.4+
Suggested-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Shin'ichiro Kawasaki <shinichiro.kawasaki@wdc.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
[BUG]
The following super simple script would crash btrfs at unmount time, if
CONFIG_BTRFS_ASSERT() is set.
mkfs.btrfs -f $dev
mount $dev $mnt
xfs_io -f -c "pwrite 0 4k" $mnt/file
umount $mnt
mount -r ro $dev $mnt
btrfs scrub start -Br $mnt
umount $mnt
This will trigger the following ASSERT() introduced by commit
0a31daa4b602 ("btrfs: add assertion for empty list of transactions at
late stage of umount").
That patch is definitely not the cause, it just makes enough noise for
developers.
[CAUSE]
We will start transaction for the following call chain during scrub:
scrub_enumerate_chunks()
|- btrfs_inc_block_group_ro()
|- btrfs_join_transaction()
However for RO mount, there is no running transaction at all, thus
btrfs_join_transaction() will start a new transaction.
Furthermore, since it's read-only mount, btrfs_sync_fs() will not call
btrfs_commit_super() to commit the new but empty transaction.
And leads to the ASSERT().
The bug has been there for a long time. Only the new ASSERT() makes it
noisy enough to be noticed.
[FIX]
For read-only scrub on read-only mount, there is no need to start a
transaction nor to allocate new chunks in btrfs_inc_block_group_ro().
Just do extra read-only mount check in btrfs_inc_block_group_ro(), and
if it's read-only, skip all chunk allocation and go inc_block_group_ro()
directly.
CC: stable@vger.kernel.org # 5.4+
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs
Pull fsnotify fixes from Jan Kara:
"Fixes for userspace breakage caused by fsnotify changes ~3 years ago
and one fanotify cleanup"
* tag 'fsnotify_for_v5.17-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs:
fsnotify: fix fsnotify hooks in pseudo filesystems
fsnotify: invalidate dcache before IN_DELETE event
fanotify: remove variable set but not used
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux
Pull btrfs fixes from David Sterba:
"Several fixes for defragmentation that got broken in 5.16 after
refactoring and added subpage support. The observed bugs are excessive
IO or uninterruptible ioctl.
All stable material"
* tag 'for-5.17-rc1-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux:
btrfs: update writeback index when starting defrag
btrfs: add back missing dirty page rate limiting to defrag
btrfs: fix deadlock when reserving space during defrag
btrfs: defrag: properly update range->start for autodefrag
btrfs: defrag: fix wrong number of defragged sectors
btrfs: allow defrag to be interruptible
btrfs: fix too long loop when defragging a 1 byte file
|
|
When starting a defrag, we should update the writeback index of the
inode's mapping in case it currently has a value beyond the start of the
range we are defragging. This can help performance and often result in
getting less extents after writeback - for e.g., if the current value
of the writeback index sits somewhere in the middle of a range that
gets dirty by the defrag, then after writeback we can get two smaller
extents instead of a single, larger extent.
We used to have this before the refactoring in 5.16, but it was removed
without any reason to do so. Originally it was added in kernel 3.1, by
commit 2a0f7f5769992b ("Btrfs: fix recursive auto-defrag"), in order to
fix a loop with autodefrag resulting in dirtying and writing pages over
and over, but some testing on current code did not show that happening,
at least with the test described in that commit.
So add back the behaviour, as at the very least it is a nice to have
optimization.
Fixes: 7b508037d4cac3 ("btrfs: defrag: use defrag_one_cluster() to implement btrfs_defrag_file()")
CC: stable@vger.kernel.org # 5.16
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
A defrag operation can dirty a lot of pages, specially if operating on
the entire file or a large file range. Any task dirtying pages should
periodically call balance_dirty_pages_ratelimited(), as stated in that
function's comments, otherwise they can leave too many dirty pages in
the system. This is what we did before the refactoring in 5.16, and
it should have remained, just like in the buffered write path and
relocation. So restore that behaviour.
Fixes: 7b508037d4cac3 ("btrfs: defrag: use defrag_one_cluster() to implement btrfs_defrag_file()")
CC: stable@vger.kernel.org # 5.16
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
When defragging we can end up collecting a range for defrag that has
already pages under delalloc (dirty), as long as the respective extent
map for their range is not mapped to a hole, a prealloc extent or
the extent map is from an old generation.
Most of the time that is harmless from a functional perspective at
least, however it can result in a deadlock:
1) At defrag_collect_targets() we find an extent map that meets all
requirements but there's delalloc for the range it covers, and we add
its range to list of ranges to defrag;
2) The defrag_collect_targets() function is called at defrag_one_range(),
after it locked a range that overlaps the range of the extent map;
3) At defrag_one_range(), while the range is still locked, we call
defrag_one_locked_target() for the range associated to the extent
map we collected at step 1);
4) Then finally at defrag_one_locked_target() we do a call to
btrfs_delalloc_reserve_space(), which will reserve data and metadata
space. If the space reservations can not be satisfied right away, the
flusher might be kicked in and start flushing delalloc and wait for
the respective ordered extents to complete. If this happens we will
deadlock, because both flushing delalloc and finishing an ordered
extent, requires locking the range in the inode's io tree, which was
already locked at defrag_collect_targets().
So fix this by skipping extent maps for which there's already delalloc.
Fixes: eb793cf857828d ("btrfs: defrag: introduce helper to collect target file extents")
CC: stable@vger.kernel.org # 5.16
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
Apparently, there are some applications that use IN_DELETE event as an
invalidation mechanism and expect that if they try to open a file with
the name reported with the delete event, that it should not contain the
content of the deleted file.
Commit 49246466a989 ("fsnotify: move fsnotify_nameremove() hook out of
d_delete()") moved the fsnotify delete hook before d_delete() so fsnotify
will have access to a positive dentry.
This allowed a race where opening the deleted file via cached dentry
is now possible after receiving the IN_DELETE event.
To fix the regression, create a new hook fsnotify_delete() that takes
the unlinked inode as an argument and use a helper d_delete_notify() to
pin the inode, so we can pass it to fsnotify_delete() after d_delete().
Backporting hint: this regression is from v5.3. Although patch will
apply with only trivial conflicts to v5.4 and v5.10, it won't build,
because fsnotify_delete() implementation is different in each of those
versions (see fsnotify_link()).
A follow up patch will fix the fsnotify_unlink/rmdir() calls in pseudo
filesystem that do not need to call d_delete().
Link: https://lore.kernel.org/r/20220120215305.282577-1-amir73il@gmail.com
Reported-by: Ivan Delalande <colona@arista.com>
Link: https://lore.kernel.org/linux-fsdevel/YeNyzoDM5hP5LtGW@visor/
Fixes: 49246466a989 ("fsnotify: move fsnotify_nameremove() hook out of d_delete()")
Cc: stable@vger.kernel.org # v5.3+
Signed-off-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Jan Kara <jack@suse.cz>
|
|
Patch series "remove Xen tmem leftovers".
Since the removal of the Xen tmem driver in 2019, the cleancache hooks
are entirely unused, as are large parts of frontswap. This series
against linux-next (with the folio changes included) removes
cleancaches, and cuts down frontswap to the bits actually used by zswap.
This patch (of 13):
The cleancache subsystem is unused since the removal of Xen tmem driver
in commit 814bbf49dcd0 ("xen: remove tmem driver").
[akpm@linux-foundation.org: remove now-unreachable code]
Link: https://lkml.kernel.org/r/20211224062246.1258487-1-hch@lst.de
Link: https://lkml.kernel.org/r/20211224062246.1258487-2-hch@lst.de
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Juergen Gross <jgross@suse.com>
Acked-by: Geert Uytterhoeven <geert@linux-m68k.org>
Cc: Konrad Rzeszutek Wilk <Konrad.wilk@oracle.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Seth Jennings <sjenning@redhat.com>
Cc: Dan Streetman <ddstreet@ieee.org>
Cc: Vitaly Wool <vitaly.wool@konsulko.com>
Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
Merge more updates from Andrew Morton:
"55 patches.
Subsystems affected by this patch series: percpu, procfs, sysctl,
misc, core-kernel, get_maintainer, lib, checkpatch, binfmt, nilfs2,
hfs, fat, adfs, panic, delayacct, kconfig, kcov, and ubsan"
* emailed patches from Andrew Morton <akpm@linux-foundation.org>: (55 commits)
lib: remove redundant assignment to variable ret
ubsan: remove CONFIG_UBSAN_OBJECT_SIZE
kcov: fix generic Kconfig dependencies if ARCH_WANTS_NO_INSTR
lib/Kconfig.debug: make TEST_KMOD depend on PAGE_SIZE_LESS_THAN_256KB
btrfs: use generic Kconfig option for 256kB page size limit
arch/Kconfig: split PAGE_SIZE_LESS_THAN_256KB from PAGE_SIZE_LESS_THAN_64KB
configs: introduce debug.config for CI-like setup
delayacct: track delays from memory compact
Documentation/accounting/delay-accounting.rst: add thrashing page cache and direct compact
delayacct: cleanup flags in struct task_delay_info and functions use it
delayacct: fix incomplete disable operation when switch enable to disable
delayacct: support swapin delay accounting for swapping without blkio
panic: remove oops_id
panic: use error_report_end tracepoint on warnings
fs/adfs: remove unneeded variable make code cleaner
FAT: use io_schedule_timeout() instead of congestion_wait()
hfsplus: use struct_group_attr() for memcpy() region
nilfs2: remove redundant pointer sbufs
fs/binfmt_elf: use PT_LOAD p_align values for static PIE
const_structs.checkpatch: add frequently used ops structs
...
|
|
Use the newly introduced CONFIG_PAGE_SIZE_LESS_THAN_256KB to describe
the dependency introduced by commit b05fbcc36be1 ("btrfs: disable build
on platforms having page size 256K").
Link: https://lkml.kernel.org/r/20211129230141.228085-3-nathan@kernel.org
Signed-off-by: Nathan Chancellor <nathan@kernel.org>
Acked-by: David Sterba <dsterba@suse.com>
Cc: Chris Mason <clm@fb.com>
Cc: Josef Bacik <josef@toxicpanda.com>
Cc: kernel test robot <lkp@intel.com>
Cc: Luis Chamberlain <mcgrof@kernel.org>
Cc: Nick Desaulniers <ndesaulniers@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
[BUG]
After commit 7b508037d4ca ("btrfs: defrag: use defrag_one_cluster() to
implement btrfs_defrag_file()") autodefrag no longer properly re-defrag
the file from previously finished location.
[CAUSE]
The recent refactoring of defrag only focuses on defrag ioctl subpage
support, doesn't take autodefrag into consideration.
There are two problems involved which prevents autodefrag to restart its
scan:
- No range.start update
Previously when one defrag target is found, range->start will be
updated to indicate where next search should start from.
But now btrfs_defrag_file() doesn't update it anymore, making all
autodefrag to rescan from file offset 0.
This would also make autodefrag to mark the same range dirty again and
again, causing extra IO.
- No proper quick exit for defrag_one_cluster()
Currently if we reached or exceed @max_sectors limit, we just exit
defrag_one_cluster(), and let next defrag_one_cluster() call to do a
quick exit.
This makes @cur increase, thus no way to properly know which range is
defragged and which range is skipped.
[FIX]
The fix involves two modifications:
- Update range->start to next cluster start
This is a little different from the old behavior.
Previously range->start is updated to the next defrag target.
But in the end, the behavior should still be pretty much the same,
as now we skip to next defrag target inside btrfs_defrag_file().
Thus if auto-defrag determines to re-scan, then we still do the skip,
just at a different timing.
- Make defrag_one_cluster() to return >0 to indicate a quick exit
So that btrfs_defrag_file() can also do a quick exit, without
increasing @cur to the range end, and re-use @cur to update
@range->start.
- Add comment for btrfs_defrag_file() to mention the range->start update
Currently only autodefrag utilize this behavior, as defrag ioctl won't
set @max_to_defrag parameter, thus unless interrupted it will always
try to defrag the whole range.
Reported-by: Filipe Manana <fdmanana@suse.com>
Fixes: 7b508037d4ca ("btrfs: defrag: use defrag_one_cluster() to implement btrfs_defrag_file()")
Link: https://lore.kernel.org/linux-btrfs/0a269612-e43f-da22-c5bc-b34b1b56ebe8@mailbox.org/
CC: stable@vger.kernel.org # 5.16
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
[BUG]
There are users using autodefrag mount option reporting obvious increase
in IO:
> If I compare the write average (in total, I don't have it per process)
> when taking idle periods on the same machine:
> Linux 5.16:
> without autodefrag: ~ 10KiB/s
> with autodefrag: between 1 and 2MiB/s.
>
> Linux 5.15:
> with autodefrag:~ 10KiB/s (around the same as without
> autodefrag on 5.16)
[CAUSE]
When autodefrag mount option is enabled, btrfs_defrag_file() will be
called with @max_sectors = BTRFS_DEFRAG_BATCH (1024) to limit how many
sectors we can defrag in one try.
And then use the number of sectors defragged to determine if we need to
re-defrag.
But commit b18c3ab2343d ("btrfs: defrag: introduce helper to defrag one
cluster") uses wrong unit to increase @sectors_defragged, which should
be in unit of sector, not byte.
This means, if we have defragged any sector, then @sectors_defragged
will be >= sectorsize (normally 4096), which is larger than
BTRFS_DEFRAG_BATCH.
This makes the @max_sectors check in defrag_one_cluster() to underflow,
rendering the whole @max_sectors check useless.
Thus causing way more IO for autodefrag mount options, as now there is
no limit on how many sectors can really be defragged.
[FIX]
Fix the problems by:
- Use sector as unit when increasing @sectors_defragged
- Include @sectors_defragged > @max_sectors case to break the loop
- Add extra comment on the return value of btrfs_defrag_file()
Reported-by: Anthony Ruhier <aruhier@mailbox.org>
Fixes: b18c3ab2343d ("btrfs: defrag: introduce helper to defrag one cluster")
Link: https://lore.kernel.org/linux-btrfs/0a269612-e43f-da22-c5bc-b34b1b56ebe8@mailbox.org/
CC: stable@vger.kernel.org # 5.16
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
During defrag, at btrfs_defrag_file(), we have this loop that iterates
over a file range in steps no larger than 256K subranges. If the range
is too long, there's no way to interrupt it. So make the loop check in
each iteration if there's signal pending, and if there is, break and
return -AGAIN to userspace.
Before kernel 5.16, we used to allow defrag to be cancelled through a
signal, but that was lost with commit 7b508037d4cac3 ("btrfs: defrag:
use defrag_one_cluster() to implement btrfs_defrag_file()").
This change adds back the possibility to cancel a defrag with a signal
and keeps the same semantics, returning -EAGAIN to user space (and not
the usually more expected -EINTR).
This is also motivated by a recent bug on 5.16 where defragging a 1 byte
file resulted in iterating from file range 0 to (u64)-1, as hitting the
bug triggered a too long loop, basically requiring one to reboot the
machine, as it was not possible to cancel defrag.
Fixes: 7b508037d4cac3 ("btrfs: defrag: use defrag_one_cluster() to implement btrfs_defrag_file()")
CC: stable@vger.kernel.org # 5.16
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
When attempting to defrag a file with a single byte, we can end up in a
too long loop, which is nearly infinite because at btrfs_defrag_file()
we end up with the variable last_byte assigned with a value of
18446744073709551615 (which is (u64)-1). The problem comes from the fact
we end up doing:
last_byte = round_up(last_byte, fs_info->sectorsize) - 1;
So if last_byte was assigned 0, which is i_size - 1, we underflow and
end up with the value 18446744073709551615.
This is trivial to reproduce and the following script triggers it:
$ cat test.sh
#!/bin/bash
DEV=/dev/sdj
MNT=/mnt/sdj
mkfs.btrfs -f $DEV
mount $DEV $MNT
echo -n "X" > $MNT/foobar
btrfs filesystem defragment $MNT/foobar
umount $MNT
So fix this by not decrementing last_byte by 1 before doing the sector
size round up. Also, to make it easier to follow, make the round up right
after computing last_byte.
Reported-by: Anthony Ruhier <aruhier@mailbox.org>
Fixes: 7b508037d4cac3 ("btrfs: defrag: use defrag_one_cluster() to implement btrfs_defrag_file()")
Link: https://lore.kernel.org/linux-btrfs/0a269612-e43f-da22-c5bc-b34b1b56ebe8@mailbox.org/
CC: stable@vger.kernel.org # 5.16
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
Print extra information about how many dirty bytes an uncommitted
has at the end of mount.
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
If we extended the size of a swapfile after its header was created (by the
mkswap utility) and then try to activate it, we will map the entire file
when activating the swap file, instead of limiting to the max size defined
in the swap file's header.
Currently test case generic/643 from fstests fails because we do not
respect that size limit defined in the swap file's header.
So fix this by not mapping file ranges beyond the max size defined in the
swap header.
This is the same type of bug that iomap used to have, and was fixed in
commit 36ca7943ac18ae ("mm/swap: consider max pages in
iomap_swapfile_add_extent").
Fixes: ed46ff3d423780 ("Btrfs: support swap files")
CC: stable@vger.kernel.org # 5.4+
Reviewed-and-tested-by: Josef Bacik <josef@toxicpanda.com
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
The warnings were found by running scripts/kernel-doc, which is
caused by using 'make W=1'.
fs/btrfs/extent_io.c:3210: warning: Function parameter or member
'bio_ctrl' not described in 'btrfs_bio_add_page'
fs/btrfs/extent_io.c:3210: warning: Excess function parameter 'bio'
description in 'btrfs_bio_add_page'
fs/btrfs/extent_io.c:3210: warning: Excess function parameter
'prev_bio_flags' description in 'btrfs_bio_add_page'
fs/btrfs/space-info.c:1602: warning: Excess function parameter 'root'
description in 'btrfs_reserve_metadata_bytes'
fs/btrfs/space-info.c:1602: warning: Function parameter or member
'fs_info' not described in 'btrfs_reserve_metadata_bytes'
Note: this is fixing only the warnings regarding parameter list, the
first line is not strictly conforming to the kdoc format as the btrfs
codebase does not stick to that and keeps the first line more free form
(because it's only for internal use).
Reported-by: Abaci Robot <abaci@linux.alibaba.com>
Signed-off-by: Yang Li <yang.lee@linux.alibaba.com>
Reviewed-by: David Sterba <dsterba@suse.com>
[ add note ]
Signed-off-by: David Sterba <dsterba@suse.com>
|