summaryrefslogtreecommitdiff
path: root/fs
AgeCommit message (Collapse)Author
2020-09-24Merge branch 'for-5.10/block' into for-5.10/driversJens Axboe
* for-5.10/block: (140 commits) bdi: replace BDI_CAP_NO_{WRITEBACK,ACCT_DIRTY} with a single flag bdi: invert BDI_CAP_NO_ACCT_WB bdi: replace BDI_CAP_STABLE_WRITES with a queue and a sb flag mm: use SWP_SYNCHRONOUS_IO more intelligently bdi: remove BDI_CAP_SYNCHRONOUS_IO bdi: remove BDI_CAP_CGROUP_WRITEBACK block: lift setting the readahead size into the block layer md: update the optimal I/O size on reshape bdi: initialize ->ra_pages and ->io_pages in bdi_init aoe: set an optimal I/O size bcache: inherit the optimal I/O size drbd: remove dead code in device_to_statistics fs: remove the unused SB_I_MULTIROOT flag block: mark blkdev_get static PM: mm: cleanup swsusp_swap_check mm: split swap_type_of PM: rewrite is_hibernate_resume_dev to not require an inode mm: cleanup claim_swapfile ocfs2: cleanup o2hb_region_dev_store dasd: cleanup dasd_scan_partitions ...
2020-09-24bdi: replace BDI_CAP_NO_{WRITEBACK,ACCT_DIRTY} with a single flagChristoph Hellwig
Replace the two negative flags that are always used together with a single positive flag that indicates the writeback capability instead of two related non-capabilities. Also remove the pointless wrappers to just check the flag. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Jan Kara <jack@suse.cz> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-09-24bdi: invert BDI_CAP_NO_ACCT_WBChristoph Hellwig
Replace BDI_CAP_NO_ACCT_WB with a positive BDI_CAP_WRITEBACK_ACCT to make the checks more obvious. Also remove the pointless bdi_cap_account_writeback wrapper that just obsfucates the check. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Jan Kara <jack@suse.cz> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-09-24bdi: replace BDI_CAP_STABLE_WRITES with a queue and a sb flagChristoph Hellwig
The BDI_CAP_STABLE_WRITES is one of the few bits of information in the backing_dev_info shared between the block drivers and the writeback code. To help untangling the dependency replace it with a queue flag and a superblock flag derived from it. This also helps with the case of e.g. a file system requiring stable writes due to its own checksumming, but not forcing it on other users of the block device like the swap code. One downside is that we an't support the stable_pages_required bdi attribute in sysfs anymore. It is replaced with a queue attribute which also is writable for easier testing. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Jan Kara <jack@suse.cz> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-09-24bdi: remove BDI_CAP_CGROUP_WRITEBACKChristoph Hellwig
Just checking SB_I_CGROUPWB for cgroup writeback support is enough. Either the file system allocates its own bdi (e.g. btrfs), in which case it is known to support cgroup writeback, or the bdi comes from the block layer, which always supports cgroup writeback. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Jan Kara <jack@suse.cz> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-09-24bdi: initialize ->ra_pages and ->io_pages in bdi_initChristoph Hellwig
Set up a readahead size by default, as very few users have a good reason to change it. This means code, ecryptfs, and orangefs now set up the values while they were previously missing it, while ubifs, mtd and vboxsf manually set it to 0 to avoid readahead. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Jan Kara <jack@suse.cz> Acked-by: David Sterba <dsterba@suse.com> [btrfs] Acked-by: Richard Weinberger <richard@nod.at> [ubifs, mtd] Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-09-24fs: remove the unused SB_I_MULTIROOT flagChristoph Hellwig
The last user of SB_I_MULTIROOT is disappeared with commit f2aedb713c28 ("NFS: Add fs_context support.") Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Jan Kara <jack@suse.cz> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-09-23block: mark blkdev_get staticChristoph Hellwig
There are no users outside the core block code left now. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-09-23PM: rewrite is_hibernate_resume_dev to not require an inodeChristoph Hellwig
Just check the dev_t to help simplifying the code. Signed-off-by: Christoph Hellwig <hch@lst.de> Acked-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com> Acked-by: Pavel Machek <pavel@ucw.cz> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-09-23ocfs2: cleanup o2hb_region_dev_storeChristoph Hellwig
Use blkdev_get_by_dev instead of igrab (aka open coded bdgrab) + blkdev_get. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Joseph Qi <joseph.qi@linux.alibaba.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-09-23block: move the NEED_PART_SCAN flag to struct gendiskChristoph Hellwig
We can only scan for partitions on the whole disk, so move the flag from struct block_device to struct gendisk. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-09-22Merge branch 'fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfsLinus Torvalds
Pull vfs fixes from Al Viro: "No common topic, just assorted fixes" * 'fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: fuse: fix the ->direct_IO() treatment of iov_iter fs: fix cast in fsparam_u32hex() macro vboxsf: Fix the check for the old binary mount-arguments struct
2020-09-22Merge tag 'io_uring-5.9-2020-09-22' of git://git.kernel.dk/linux-blockLinus Torvalds
Pull io_uring fixes from Jens Axboe: "A few fixes - most of them regression fixes from this cycle, but also a few stable heading fixes, and a build fix for the included demo tool since some systems now actually have gettid() available" * tag 'io_uring-5.9-2020-09-22' of git://git.kernel.dk/linux-block: io_uring: fix openat/openat2 unified prep handling io_uring: mark statx/files_update/epoll_ctl as non-SQPOLL tools/io_uring: fix compile breakage io_uring: don't use retry based buffered reads for non-async bdev io_uring: don't re-setup vecs/iter in io_resumit_prep() is already there io_uring: don't run task work on an exiting task io_uring: drop 'ctx' ref on task work cancelation io_uring: grab any needed state during defer prep
2020-09-21io_uring: fix openat/openat2 unified prep handlingJens Axboe
A previous commit unified how we handle prep for these two functions, but this means that we check the allowed context (SQPOLL, specifically) later than we should. Move the ring type checking into the two parent functions, instead of doing it after we've done some setup work. Fixes: ec65fea5a8d7 ("io_uring: deduplicate io_openat{,2}_prep()") Reported-by: Andy Lutomirski <luto@kernel.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-09-21io_uring: mark statx/files_update/epoll_ctl as non-SQPOLLJens Axboe
These will naturally fail when attempted through SQPOLL, but either with -EFAULT or -EBADF. Make it explicit that these are not workable through SQPOLL and return -EINVAL, just like other ops that need to use ->files. Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-09-21io_uring: don't use retry based buffered reads for non-async bdevJens Axboe
Some block devices, like dm, bubble back -EAGAIN through the completion handler. We check for this in io_read(), but don't honor it for when we have copied the iov. Return -EAGAIN for this case before retrying, to force punt to io-wq. Fixes: bcf5a06304d6 ("io_uring: support true async buffered reads, if file provides it") Reported-by: Zorro Lang <zlang@redhat.com> Tested-by: Zorro Lang <zlang@redhat.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-09-21io_uring: don't re-setup vecs/iter in io_resumit_prep() is already thereJens Axboe
If we already have mapped the necessary data for retry, then don't set it up again. It's a pointless operation, and we leak the iovec if it's a large (non-stack) vec. Fixes: b63534c41e20 ("io_uring: re-issue block requests that failed because of resources") Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-09-19fs/fs-writeback.c: adjust dirtytime_interval_handler definition to match ↵Tobias Klauser
prototype Commit 32927393dc1c ("sysctl: pass kernel pointers to ->proc_handler") changed ctl_table.proc_handler to take a kernel pointer. Adjust the definition of dirtytime_interval_handler to match its prototype in linux/writeback.h which fixes the following sparse error/warning: fs/fs-writeback.c:2189:50: warning: incorrect type in argument 3 (different address spaces) fs/fs-writeback.c:2189:50: expected void * fs/fs-writeback.c:2189:50: got void [noderef] __user *buffer fs/fs-writeback.c:2184:5: error: symbol 'dirtytime_interval_handler' redeclared with different type (incompatible argument 3 (different address spaces)): fs/fs-writeback.c:2184:5: int extern [addressable] [signed] [toplevel] dirtytime_interval_handler( ... ) fs/fs-writeback.c: note: in included file: ./include/linux/writeback.h:374:5: note: previously declared as: ./include/linux/writeback.h:374:5: int extern [addressable] [signed] [toplevel] dirtytime_interval_handler( ... ) Fixes: 32927393dc1c ("sysctl: pass kernel pointers to ->proc_handler") Signed-off-by: Tobias Klauser <tklauser@distanz.ch> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Reviewed-by: Jan Kara <jack@suse.cz> Cc: Christoph Hellwig <hch@lst.de> Cc: Al Viro <viro@zeniv.linux.org.uk> Link: https://lkml.kernel.org/r/20200907093140.13434-1-tklauser@distanz.ch Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-09-17fuse: fix the ->direct_IO() treatment of iov_iterAl Viro
the callers rely upon having any iov_iter_truncate() done inside ->direct_IO() countered by iov_iter_reexpand(). Reported-by: Qian Cai <cai@redhat.com> Tested-by: Qian Cai <cai@redhat.com> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2020-09-14Merge tag 'for-5.9-rc5-tag' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux Pull btrfs fix from David Sterba: "One of the recent lockdep fixes introduced a bug that breaks the search ioctl, which is used by some applications (bees, compsize). The patch made it to stable trees so we need this fixup to make it work again" * tag 'for-5.9-rc5-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux: btrfs: fix wrong address when faulting in pages in the search ioctl
2020-09-14io_uring: don't run task work on an exiting taskJens Axboe
This isn't safe, and isn't needed either. We are guaranteed that any work we queue is on a live task (and will be run), or it goes to our backup io-wq threads if the task is exiting. Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-09-14io_uring: drop 'ctx' ref on task work cancelationJens Axboe
If task_work ends up being marked for cancelation, we go through a cancelation helper instead of the queue path. In converting task_work to always hold a ctx reference, this path was missed. Make sure that io_req_task_cancel() puts the reference that is being held against the ctx. Fixes: 6d816e088c35 ("io_uring: hold 'ctx' reference around task_work queue + execute") Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-09-14btrfs: fix wrong address when faulting in pages in the search ioctlFilipe Manana
When faulting in the pages for the user supplied buffer for the search ioctl, we are passing only the base address of the buffer to the function fault_in_pages_writeable(). This means that after the first iteration of the while loop that searches for leaves, when we have a non-zero offset, stored in 'sk_offset', we try to fault in a wrong page range. So fix this by adding the offset in 'sk_offset' to the base address of the user supplied buffer when calling fault_in_pages_writeable(). Several users have reported that the applications compsize and bees have started to operate incorrectly since commit a48b73eca4ceb9 ("btrfs: fix potential deadlock in the search ioctl") was added to stable trees, and these applications make heavy use of the search ioctls. This fixes their issues. Link: https://lore.kernel.org/linux-btrfs/632b888d-a3c3-b085-cdf5-f9bb61017d92@lechevalier.se/ Link: https://github.com/kilobyte/compsize/issues/34 Fixes: a48b73eca4ceb9 ("btrfs: fix potential deadlock in the search ioctl") CC: stable@vger.kernel.org # 4.4+ Tested-by: A L <mail@lechevalier.se> Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2020-09-13io_uring: grab any needed state during defer prepJens Axboe
Always grab work environment for deferred links. The assumption that we will be running it always from the task in question is false, as exiting tasks may mean that we're deferring this one to a thread helper. And at that point it's too late to grab the work environment. Fixes: debb85f496c9 ("io_uring: factor out grab_env() from defer_prep()") Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-09-13Merge tag 'driver-core-5.9-rc5' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core Pull driver core fixes from Greg KH: "Here are some small driver core and debugfs fixes for 5.9-rc5 Included in here are: - firmware loader memory leak fix - firmware loader testing fixes for non-EFI systems - device link locking fixes found by lockdep - kobject_del() bugfix that has been affecting some callers - debugfs minor fix All of these have been in linux-next for a while with no reported issues" * tag 'driver-core-5.9-rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core: test_firmware: Test platform fw loading on non-EFI systems PM: <linux/device.h>: fix @em_pd kernel-doc warning kobject: Drop unneeded conditional in __kobject_del() driver core: Fix device_pm_lock() locking for device links MAINTAINERS: Add the security document to SECURITY CONTACT driver code: print symbolic error code debugfs: Fix module state check condition kobject: Restore old behaviour of kobject_del(NULL) firmware_loader: fix memory leak for paged buffer
2020-09-12Merge tag 'for-5.9-rc4-tag' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux Pull btrfs fixes from David Sterba: "A few more fixes: - regression fix for a crash after failed snapshot creation - one more lockep fix: use nofs allocation when allocating missing device - fix reloc tree leak on degraded mount - make some extent buffer alignment checks less strict to mount filesystems created by btrfs-convert" * tag 'for-5.9-rc4-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux: btrfs: fix NULL pointer dereference after failure to create snapshot btrfs: free data reloc tree on failed mount btrfs: require only sector size alignment for parent eb bytenr btrfs: fix lockdep splat in add_missing_dev
2020-09-12Merge tag '5.9-rc4-smb3-fix' of git://git.samba.org/sfrench/cifs-2.6Linus Torvalds
Pull cifs fix from Steve French: "A fix for lookup on DFS link when cifsacl or modefromsid is used" * tag '5.9-rc4-smb3-fix' of git://git.samba.org/sfrench/cifs-2.6: cifs: fix DFS mount with cifsacl/modefromsid
2020-09-10Merge tag 'f2fs-for-5.9-rc5' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/jaegeuk/f2fs Pull f2fs fixes from Jaegeuk Kim: "Small bug fixes for: - SMR drive fix - infinite loop when building free node ids - EOF at DIO read" * tag 'f2fs-for-5.9-rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/jaegeuk/f2fs: f2fs: Return EOF on unaligned end of file DIO read f2fs: fix indefinite loop scanning for free nid f2fs: Fix type of section block count variables
2020-09-10block: remove check_disk_changeChristoph Hellwig
Remove the now unused check_disk_change helper. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: Hannes Reinecke <hare@suse.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-09-10block: add a bdev_check_media_change helperChristoph Hellwig
Like check_disk_changed, except that it does not call ->revalidate_disk but leaves that to the caller. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: Hannes Reinecke <hare@suse.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-09-09Merge tag 'nfs-for-5.9-2' of git://git.linux-nfs.org/projects/trondmy/linux-nfsLinus Torvalds
Pull NFS client bugfixes from Trond Myklebust: - Fix an NFS/RDMA resource leak - Fix the error handling during delegation recall - NFSv4.0 needs to return the delegation on a zero-stateid SETATTR - Stop printk reading past end of string * tag 'nfs-for-5.9-2' of git://git.linux-nfs.org/projects/trondmy/linux-nfs: SUNRPC: stop printk reading past end of string NFS: Zero-stateid SETATTR should first return delegation NFSv4.1 handle ERR_DELAY error reclaiming locking state on delegation recall xprtrdma: Release in-flight MRs on disconnect
2020-09-08f2fs: Return EOF on unaligned end of file DIO readGabriel Krisman Bertazi
Reading past end of file returns EOF for aligned reads but -EINVAL for unaligned reads on f2fs. While documentation is not strict about this corner case, most filesystem returns EOF on this case, like iomap filesystems. This patch consolidates the behavior for f2fs, by making it return EOF(0). it can be verified by a read loop on a file that does a partial read before EOF (A file that doesn't end at an aligned address). The following code fails on an unaligned file on f2fs, but not on btrfs, ext4, and xfs. while (done < total) { ssize_t delta = pread(fd, buf + done, total - done, off + done); if (!delta) break; ... } It is arguable whether filesystems should actually return EOF or -EINVAL, but since iomap filesystems support it, and so does the original DIO code, it seems reasonable to consolidate on that. Signed-off-by: Gabriel Krisman Bertazi <krisman@collabora.com> Reviewed-by: Chao Yu <yuchao0@huawei.com> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2020-09-08f2fs: fix indefinite loop scanning for free nidSahitya Tummala
If the sbi->ckpt->next_free_nid is not NAT block aligned and if there are free nids in that NAT block between the start of the block and next_free_nid, then those free nids will not be scanned in scan_nat_page(). This results into mismatch between nm_i->available_nids and the sum of nm_i->free_nid_count of all NAT blocks scanned. And nm_i->available_nids will always be greater than the sum of free nids in all the blocks. Under this condition, if we use all the currently scanned free nids, then it will loop forever in f2fs_alloc_nid() as nm_i->available_nids is still not zero but nm_i->free_nid_count of that partially scanned NAT block is zero. Fix this to align the nm_i->next_scan_nid to the first nid of the corresponding NAT block. Signed-off-by: Sahitya Tummala <stummala@codeaurora.org> Reviewed-by: Chao Yu <yuchao0@huawei.com> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2020-09-08f2fs: Fix type of section block count variablesShin'ichiro Kawasaki
Commit da52f8ade40b ("f2fs: get the right gc victim section when section has several segments") added code to count blocks of each section using variables with type 'unsigned short', which has 2 bytes size in many systems. However, the counts can be larger than the 2 bytes range and type conversion results in wrong values. Especially when the f2fs sections have blocks as many as USHRT_MAX + 1, the count is handled as 0. This triggers eternal loop in init_dirty_segmap() at mount system call. Fix this by changing the type of the variables to block_t. Fixes: da52f8ade40b ("f2fs: get the right gc victim section when section has several segments") Signed-off-by: Shin'ichiro Kawasaki <shinichiro.kawasaki@wdc.com> Reviewed-by: Chao Yu <yuchao0@huawei.com> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2020-09-07block: Do not discard buffers under a mounted filesystemJan Kara
Discarding blocks and buffers under a mounted filesystem is hardly anything admin wants to do. Usually it will confuse the filesystem and sometimes the loss of buffer_head state (including b_private field) can even cause crashes like: BUG: unable to handle kernel NULL pointer dereference at 0000000000000008 PGD 0 P4D 0 Oops: 0002 [#1] SMP PTI CPU: 4 PID: 203778 Comm: jbd2/dm-3-8 Kdump: loaded Tainted: G O --------- - - 4.18.0-147.5.0.5.h126.eulerosv2r9.x86_64 #1 Hardware name: Huawei RH2288H V3/BC11HGSA0, BIOS 1.57 08/11/2015 RIP: 0010:jbd2_journal_grab_journal_head+0x1b/0x40 [jbd2] ... Call Trace: __jbd2_journal_insert_checkpoint+0x23/0x70 [jbd2] jbd2_journal_commit_transaction+0x155f/0x1b60 [jbd2] kjournald2+0xbd/0x270 [jbd2] So if we don't have block device open with O_EXCL already, claim the block device while we truncate buffer cache. This makes sure any exclusive block device user (such as filesystem) cannot operate on the device while we are discarding buffer cache. Reported-by: Ye Bin <yebin10@huawei.com> Signed-off-by: Jan Kara <jack@suse.cz> Reviewed-by: Christoph Hellwig <hch@lst.de> [axboe: fix !CONFIG_BLOCK error in truncate_bdev_range()] Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-09-07btrfs: fix NULL pointer dereference after failure to create snapshotFilipe Manana
When trying to get a new fs root for a snapshot during the transaction at transaction.c:create_pending_snapshot(), if btrfs_get_new_fs_root() fails we leave "pending->snap" pointing to an error pointer, and then later at ioctl.c:create_snapshot() we dereference that pointer, resulting in a crash: [12264.614689] BUG: kernel NULL pointer dereference, address: 00000000000007c4 [12264.615650] #PF: supervisor write access in kernel mode [12264.616487] #PF: error_code(0x0002) - not-present page [12264.617436] PGD 0 P4D 0 [12264.618328] Oops: 0002 [#1] PREEMPT SMP DEBUG_PAGEALLOC PTI [12264.619150] CPU: 0 PID: 2310635 Comm: fsstress Tainted: G W 5.9.0-rc3-btrfs-next-67 #1 [12264.619960] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.13.0-0-gf21b5a4aeb02-prebuilt.qemu.org 04/01/2014 [12264.621769] RIP: 0010:btrfs_mksubvol+0x438/0x4a0 [btrfs] [12264.622528] Code: bc ef ff ff (...) [12264.624092] RSP: 0018:ffffaa6fc7277cd8 EFLAGS: 00010282 [12264.624669] RAX: 00000000fffffff4 RBX: ffff9d3e8f151a60 RCX: 0000000000000000 [12264.625249] RDX: 0000000000000001 RSI: ffffffff9d56c9be RDI: fffffffffffffff4 [12264.625830] RBP: ffff9d3e8f151b48 R08: 0000000000000000 R09: 0000000000000000 [12264.626413] R10: 0000000000000000 R11: 0000000000000000 R12: 00000000fffffff4 [12264.626994] R13: ffff9d3ede380538 R14: ffff9d3ede380500 R15: ffff9d3f61b2eeb8 [12264.627582] FS: 00007f140d5d8200(0000) GS:ffff9d3fb5e00000(0000) knlGS:0000000000000000 [12264.628176] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [12264.628773] CR2: 00000000000007c4 CR3: 000000020f8e8004 CR4: 00000000003706f0 [12264.629379] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 [12264.629994] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 [12264.630594] Call Trace: [12264.631227] btrfs_mksnapshot+0x7b/0xb0 [btrfs] [12264.631840] __btrfs_ioctl_snap_create+0x16f/0x1a0 [btrfs] [12264.632458] btrfs_ioctl_snap_create_v2+0xb0/0xf0 [btrfs] [12264.633078] btrfs_ioctl+0x1864/0x3130 [btrfs] [12264.633689] ? do_sys_openat2+0x1a7/0x2d0 [12264.634295] ? kmem_cache_free+0x147/0x3a0 [12264.634899] ? __x64_sys_ioctl+0x83/0xb0 [12264.635488] __x64_sys_ioctl+0x83/0xb0 [12264.636058] do_syscall_64+0x33/0x80 [12264.636616] entry_SYSCALL_64_after_hwframe+0x44/0xa9 (gdb) list *(btrfs_mksubvol+0x438) 0x7c7b8 is in btrfs_mksubvol (fs/btrfs/ioctl.c:858). 853 ret = 0; 854 pending_snapshot->anon_dev = 0; 855 fail: 856 /* Prevent double freeing of anon_dev */ 857 if (ret && pending_snapshot->snap) 858 pending_snapshot->snap->anon_dev = 0; 859 btrfs_put_root(pending_snapshot->snap); 860 btrfs_subvolume_release_metadata(root, &pending_snapshot->block_rsv); 861 free_pending: 862 if (pending_snapshot->anon_dev) So fix this by setting "pending->snap" to NULL if we get an error from the call to btrfs_get_new_fs_root() at transaction.c:create_pending_snapshot(). Fixes: 2dfb1e43f57dd3 ("btrfs: preallocate anon block device at first phase of snapshot creation") Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2020-09-07fs: Don't invalidate page buffers in block_write_full_page()Jan Kara
If block_write_full_page() is called for a page that is beyond current inode size, it will truncate page buffers for the page and return 0. This logic has been added in 2.5.62 in commit 81eb69062588 ("fix ext3 BUG due to race with truncate") in history.git tree to fix a problem with ext3 in data=ordered mode. This particular problem doesn't exist anymore because ext3 is long gone and ext4 handles ordered data differently. Also normally buffers are invalidated by truncate code and there's no need to specially handle this in ->writepage() code. This invalidation of page buffers in block_write_full_page() is causing issues to filesystems (e.g. ext4 or ocfs2) when block device is shrunk under filesystem's hands and metadata buffers get discarded while being tracked by the journalling layer. Although it is obviously "not supported" it can cause kernel crashes like: [ 7986.689400] BUG: unable to handle kernel NULL pointer dereference at +0000000000000008 [ 7986.697197] PGD 0 P4D 0 [ 7986.699724] Oops: 0002 [#1] SMP PTI [ 7986.703200] CPU: 4 PID: 203778 Comm: jbd2/dm-3-8 Kdump: loaded Tainted: G +O --------- - - 4.18.0-147.5.0.5.h126.eulerosv2r9.x86_64 #1 [ 7986.716438] Hardware name: Huawei RH2288H V3/BC11HGSA0, BIOS 1.57 08/11/2015 [ 7986.723462] RIP: 0010:jbd2_journal_grab_journal_head+0x1b/0x40 [jbd2] ... [ 7986.810150] Call Trace: [ 7986.812595] __jbd2_journal_insert_checkpoint+0x23/0x70 [jbd2] [ 7986.818408] jbd2_journal_commit_transaction+0x155f/0x1b60 [jbd2] [ 7986.836467] kjournald2+0xbd/0x270 [jbd2] which is not great. The crash happens because bh->b_private is suddently NULL although BH_JBD flag is still set (this is because block_invalidatepage() cleared BH_Mapped flag and subsequent bh lookup found buffer without BH_Mapped set, called init_page_buffers() which has rewritten bh->b_private). So just remove the invalidation in block_write_full_page(). Note that the buffer cache invalidation when block device changes size is already careful to avoid similar problems by using invalidate_mapping_pages() which skips busy buffers so it was only this odd block_write_full_page() behavior that could tear down bdev buffers under filesystem's hands. Reported-by: Ye Bin <yebin10@huawei.com> Signed-off-by: Jan Kara <jack@suse.cz> Reviewed-by: Christoph Hellwig <hch@lst.de> CC: stable@vger.kernel.org Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-09-07btrfs: free data reloc tree on failed mountJosef Bacik
While testing a weird problem with -o degraded, I noticed I was getting leaked root errors BTRFS warning (device loop0): writable mount is not allowed due to too many missing devices BTRFS error (device loop0): open_ctree failed BTRFS error (device loop0): leaked root -9-0 refcount 1 This is the DATA_RELOC root, which gets read before the other fs roots, but is included in the fs roots radix tree. Handle this by adding a btrfs_drop_and_free_fs_root() on the data reloc root if it exists. This is ok to do here if we fail further up because we will only drop the ref if we delete the root from the radix tree, and all other cleanup won't be duplicated. CC: stable@vger.kernel.org # 5.8+ Reviewed-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2020-09-07btrfs: require only sector size alignment for parent eb bytenrQu Wenruo
[BUG] A completely sane converted fs will cause kernel warning at balance time: [ 1557.188633] BTRFS info (device sda7): relocating block group 8162107392 flags data [ 1563.358078] BTRFS info (device sda7): found 11722 extents [ 1563.358277] BTRFS info (device sda7): leaf 7989321728 gen 95 total ptrs 213 free space 3458 owner 2 [ 1563.358280] item 0 key (7984947200 169 0) itemoff 16250 itemsize 33 [ 1563.358281] extent refs 1 gen 90 flags 2 [ 1563.358282] ref#0: tree block backref root 4 [ 1563.358285] item 1 key (7985602560 169 0) itemoff 16217 itemsize 33 [ 1563.358286] extent refs 1 gen 93 flags 258 [ 1563.358287] ref#0: shared block backref parent 7985602560 [ 1563.358288] (parent 7985602560 is NOT ALIGNED to nodesize 16384) [ 1563.358290] item 2 key (7985635328 169 0) itemoff 16184 itemsize 33 ... [ 1563.358995] BTRFS error (device sda7): eb 7989321728 invalid extent inline ref type 182 [ 1563.358996] ------------[ cut here ]------------ [ 1563.359005] WARNING: CPU: 14 PID: 2930 at 0xffffffff9f231766 Then with transaction abort, and obviously failed to balance the fs. [CAUSE] That mentioned inline ref type 182 is completely sane, it's BTRFS_SHARED_BLOCK_REF_KEY, it's some extra check making kernel to believe it's invalid. Commit 64ecdb647ddb ("Btrfs: add one more sanity check for shared ref type") introduced extra checks for backref type. One of the requirement is, parent bytenr must be aligned to node size, which is not correct. One example is like this: 0 1G 1G+4K 2G 2G+4K | |///////////////////|//| <- A chunk starts at 1G+4K | | <- A tree block get reserved at bytenr 1G+4K Then we have a valid tree block at bytenr 1G+4K, but not aligned to nodesize (16K). Such chunk is not ideal, but current kernel can handle it pretty well. We may warn about such tree block in the future, but should not reject them. [FIX] Change the alignment requirement from node size alignment to sector size alignment. Also, to make our lives a little easier, also output @iref when btrfs_get_extent_inline_ref_type() failed, so we can locate the item easier. Bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=205475 Fixes: 64ecdb647ddb ("Btrfs: add one more sanity check for shared ref type") CC: stable@vger.kernel.org # 4.14+ Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Qu Wenruo <wqu@suse.com> [ update comments and messages ] Signed-off-by: David Sterba <dsterba@suse.com>
2020-09-07btrfs: fix lockdep splat in add_missing_devJosef Bacik
Nikolay reported a lockdep splat in generic/476 that I could reproduce with btrfs/187. ====================================================== WARNING: possible circular locking dependency detected 5.9.0-rc2+ #1 Tainted: G W ------------------------------------------------------ kswapd0/100 is trying to acquire lock: ffff9e8ef38b6268 (&delayed_node->mutex){+.+.}-{3:3}, at: __btrfs_release_delayed_node.part.0+0x3f/0x330 but task is already holding lock: ffffffffa9d74700 (fs_reclaim){+.+.}-{0:0}, at: __fs_reclaim_acquire+0x5/0x30 which lock already depends on the new lock. the existing dependency chain (in reverse order) is: -> #2 (fs_reclaim){+.+.}-{0:0}: fs_reclaim_acquire+0x65/0x80 slab_pre_alloc_hook.constprop.0+0x20/0x200 kmem_cache_alloc_trace+0x3a/0x1a0 btrfs_alloc_device+0x43/0x210 add_missing_dev+0x20/0x90 read_one_chunk+0x301/0x430 btrfs_read_sys_array+0x17b/0x1b0 open_ctree+0xa62/0x1896 btrfs_mount_root.cold+0x12/0xea legacy_get_tree+0x30/0x50 vfs_get_tree+0x28/0xc0 vfs_kern_mount.part.0+0x71/0xb0 btrfs_mount+0x10d/0x379 legacy_get_tree+0x30/0x50 vfs_get_tree+0x28/0xc0 path_mount+0x434/0xc00 __x64_sys_mount+0xe3/0x120 do_syscall_64+0x33/0x40 entry_SYSCALL_64_after_hwframe+0x44/0xa9 -> #1 (&fs_info->chunk_mutex){+.+.}-{3:3}: __mutex_lock+0x7e/0x7e0 btrfs_chunk_alloc+0x125/0x3a0 find_free_extent+0xdf6/0x1210 btrfs_reserve_extent+0xb3/0x1b0 btrfs_alloc_tree_block+0xb0/0x310 alloc_tree_block_no_bg_flush+0x4a/0x60 __btrfs_cow_block+0x11a/0x530 btrfs_cow_block+0x104/0x220 btrfs_search_slot+0x52e/0x9d0 btrfs_lookup_inode+0x2a/0x8f __btrfs_update_delayed_inode+0x80/0x240 btrfs_commit_inode_delayed_inode+0x119/0x120 btrfs_evict_inode+0x357/0x500 evict+0xcf/0x1f0 vfs_rmdir.part.0+0x149/0x160 do_rmdir+0x136/0x1a0 do_syscall_64+0x33/0x40 entry_SYSCALL_64_after_hwframe+0x44/0xa9 -> #0 (&delayed_node->mutex){+.+.}-{3:3}: __lock_acquire+0x1184/0x1fa0 lock_acquire+0xa4/0x3d0 __mutex_lock+0x7e/0x7e0 __btrfs_release_delayed_node.part.0+0x3f/0x330 btrfs_evict_inode+0x24c/0x500 evict+0xcf/0x1f0 dispose_list+0x48/0x70 prune_icache_sb+0x44/0x50 super_cache_scan+0x161/0x1e0 do_shrink_slab+0x178/0x3c0 shrink_slab+0x17c/0x290 shrink_node+0x2b2/0x6d0 balance_pgdat+0x30a/0x670 kswapd+0x213/0x4c0 kthread+0x138/0x160 ret_from_fork+0x1f/0x30 other info that might help us debug this: Chain exists of: &delayed_node->mutex --> &fs_info->chunk_mutex --> fs_reclaim Possible unsafe locking scenario: CPU0 CPU1 ---- ---- lock(fs_reclaim); lock(&fs_info->chunk_mutex); lock(fs_reclaim); lock(&delayed_node->mutex); *** DEADLOCK *** 3 locks held by kswapd0/100: #0: ffffffffa9d74700 (fs_reclaim){+.+.}-{0:0}, at: __fs_reclaim_acquire+0x5/0x30 #1: ffffffffa9d65c50 (shrinker_rwsem){++++}-{3:3}, at: shrink_slab+0x115/0x290 #2: ffff9e8e9da260e0 (&type->s_umount_key#48){++++}-{3:3}, at: super_cache_scan+0x38/0x1e0 stack backtrace: CPU: 1 PID: 100 Comm: kswapd0 Tainted: G W 5.9.0-rc2+ #1 Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.13.0-2.fc32 04/01/2014 Call Trace: dump_stack+0x92/0xc8 check_noncircular+0x12d/0x150 __lock_acquire+0x1184/0x1fa0 lock_acquire+0xa4/0x3d0 ? __btrfs_release_delayed_node.part.0+0x3f/0x330 __mutex_lock+0x7e/0x7e0 ? __btrfs_release_delayed_node.part.0+0x3f/0x330 ? __btrfs_release_delayed_node.part.0+0x3f/0x330 ? lock_acquire+0xa4/0x3d0 ? btrfs_evict_inode+0x11e/0x500 ? find_held_lock+0x2b/0x80 __btrfs_release_delayed_node.part.0+0x3f/0x330 btrfs_evict_inode+0x24c/0x500 evict+0xcf/0x1f0 dispose_list+0x48/0x70 prune_icache_sb+0x44/0x50 super_cache_scan+0x161/0x1e0 do_shrink_slab+0x178/0x3c0 shrink_slab+0x17c/0x290 shrink_node+0x2b2/0x6d0 balance_pgdat+0x30a/0x670 kswapd+0x213/0x4c0 ? _raw_spin_unlock_irqrestore+0x46/0x60 ? add_wait_queue_exclusive+0x70/0x70 ? balance_pgdat+0x670/0x670 kthread+0x138/0x160 ? kthread_create_worker_on_cpu+0x40/0x40 ret_from_fork+0x1f/0x30 This is because we are holding the chunk_mutex when we call btrfs_alloc_device, which does a GFP_KERNEL allocation. We don't want to switch that to a GFP_NOFS lock because this is the only place where it matters. So instead use memalloc_nofs_save() around the allocation in order to avoid the lockdep splat. Reported-by: Nikolay Borisov <nborisov@suse.com> CC: stable@vger.kernel.org # 4.4+ Reviewed-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2020-09-06cifs: fix DFS mount with cifsacl/modefromsidRonnie Sahlberg
RHBZ: 1871246 If during cifs_lookup()/get_inode_info() we encounter a DFS link and we use the cifsacl or modefromsid mount options we must suppress any -EREMOTE errors that triggers or else we will not be able to follow the DFS link and automount the target. This fixes an issue with modefromsid/cifsacl where these mountoptions would break DFS and we would no longer be able to access the share. Signed-off-by: Ronnie Sahlberg <lsahlber@redhat.com> Reviewed-by: Paulo Alcantara (SUSE) <pc@cjr.nz> Signed-off-by: Steve French <stfrench@microsoft.com>
2020-09-06Merge tag 'io_uring-5.9-2020-09-06' of git://git.kernel.dk/linux-blockLinus Torvalds
Pull more io_uring fixes from Jens Axboe: "Two followup fixes. One is fixing a regression from this merge window, the other is two commits fixing cancelation of deferred requests. Both have gone through full testing, and both spawned a few new regression test additions to liburing. - Don't play games with const, properly store the output iovec and assign it as needed. - Deferred request cancelation fix (Pavel)" * tag 'io_uring-5.9-2020-09-06' of git://git.kernel.dk/linux-block: io_uring: fix linked deferred ->files cancellation io_uring: fix cancel of deferred reqs with ->files io_uring: fix explicit async read/write mapping for large segments
2020-09-05io_uring: fix linked deferred ->files cancellationPavel Begunkov
While looking for ->files in ->defer_list, consider that requests there may actually be links. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-09-05io_uring: fix cancel of deferred reqs with ->filesPavel Begunkov
While trying to cancel requests with ->files, it also should look for requests in ->defer_list, otherwise it might end up hanging a thread. Cancel all requests in ->defer_list up to the last request there with matching ->files, that's needed to follow drain ordering semantics. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-09-05Merge tag 'xfs-5.9-fixes-2' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linuxLinus Torvalds
Pull xfs fix from Darrick Wong: "Fix a broken metadata verifier that would incorrectly validate attr fork extents of a realtime file against the realtime volume" * tag 'xfs-5.9-fixes-2' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux: xfs: fix xfs_bmap_validate_extent_raw when checking attr fork of rt files
2020-09-05xfs: don't update mtime on COW faultsMikulas Patocka
When running in a dax mode, if the user maps a page with MAP_PRIVATE and PROT_WRITE, the xfs filesystem would incorrectly update ctime and mtime when the user hits a COW fault. This breaks building of the Linux kernel. How to reproduce: 1. extract the Linux kernel tree on dax-mounted xfs filesystem 2. run make clean 3. run make -j12 4. run make -j12 at step 4, make would incorrectly rebuild the whole kernel (although it was already built in step 3). The reason for the breakage is that almost all object files depend on objtool. When we run objtool, it takes COW page fault on its .data section, and these faults will incorrectly update the timestamp of the objtool binary. The updated timestamp causes make to rebuild the whole tree. Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Cc: stable@vger.kernel.org Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-09-05ext2: don't update mtime on COW faultsMikulas Patocka
When running in a dax mode, if the user maps a page with MAP_PRIVATE and PROT_WRITE, the ext2 filesystem would incorrectly update ctime and mtime when the user hits a COW fault. This breaks building of the Linux kernel. How to reproduce: 1. extract the Linux kernel tree on dax-mounted ext2 filesystem 2. run make clean 3. run make -j12 4. run make -j12 at step 4, make would incorrectly rebuild the whole kernel (although it was already built in step 3). The reason for the breakage is that almost all object files depend on objtool. When we run objtool, it takes COW page fault on its .data section, and these faults will incorrectly update the timestamp of the objtool binary. The updated timestamp causes make to rebuild the whole tree. Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Cc: stable@vger.kernel.org Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-09-05io_uring: fix explicit async read/write mapping for large segmentsJens Axboe
If we exceed UIO_FASTIOV, we don't handle the transition correctly between an allocated vec for requests that are queued with IOSQE_ASYNC. Store the iovec appropriately and re-set it in the iter iov in case it changed. Fixes: ff6165b2d7f6 ("io_uring: retain iov_iter state over io_read/io_write calls") Reported-by: Nick Hill <nick@nickhill.org> Tested-by: Norman Maurer <norman.maurer@googlemail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-09-05NFS: Zero-stateid SETATTR should first return delegationChuck Lever
If a write delegation isn't available, the Linux NFS client uses a zero-stateid when performing a SETATTR. NFSv4.0 provides no mechanism for an NFS server to match such a request to a particular client. It recalls all delegations for that file, even delegations held by the client issuing the request. If that client happens to hold a read delegation, the server will recall it immediately, resulting in an NFS4ERR_DELAY/CB_RECALL/ DELEGRETURN sequence. Optimize out this pipeline bubble by having the client return any delegations it may hold on a file before it issues a SETATTR(zero-stateid) on that file. Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
2020-09-04Merge tag 'io_uring-5.9-2020-09-04' of git://git.kernel.dk/linux-blockLinus Torvalds
Pull io_uring fixes from Jens Axboe: - EAGAIN with O_NONBLOCK retry fix - Two small fixes for registered files (Jiufei) * tag 'io_uring-5.9-2020-09-04' of git://git.kernel.dk/linux-block: io_uring: no read/write-retry on -EAGAIN error and O_NONBLOCK marked file io_uring: set table->files[i] to NULL when io_sqe_file_register failed io_uring: fix removing the wrong file in __io_sqe_files_update()