summaryrefslogtreecommitdiff
path: root/block
AgeCommit message (Collapse)Author
2024-05-21Merge tag 'pull-bd_flags-2' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs Pull bdev flags update from Al Viro: "Compactifying bdev flags. We can easily have up to 24 flags with sane atomicity, _without_ pushing anything out of the first cacheline of struct block_device" * tag 'pull-bd_flags-2' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: bdev: move ->bd_make_it_fail to ->__bd_flags bdev: move ->bd_ro_warned to ->__bd_flags bdev: move ->bd_has_subit_bio to ->__bd_flags bdev: move ->bd_write_holder into ->__bd_flags bdev: move ->bd_read_only to ->__bd_flags bdev: infrastructure for flags wrapper for access to ->bd_partno Use bdev_is_paritition() instead of open-coding it
2024-05-21Merge tag 'pull-bd_inode-1' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs Pull bdev bd_inode updates from Al Viro: "Replacement of bdev->bd_inode with sane(r) set of primitives by me and Yu Kuai" * tag 'pull-bd_inode-1' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: RIP ->bd_inode dasd_format(): killing the last remaining user of ->bd_inode nilfs_attach_log_writer(): use ->bd_mapping->host instead of ->bd_inode block/bdev.c: use the knowledge of inode/bdev coallocation gfs2: more obvious initializations of mapping->host fs/buffer.c: massage the remaining users of ->bd_inode to ->bd_mapping blk_ioctl_{discard,zeroout}(): we only want ->bd_inode->i_mapping here... grow_dev_folio(): we only want ->bd_inode->i_mapping there use ->bd_mapping instead of ->bd_inode->i_mapping block_device: add a pointer to struct address_space (page cache of bdev) missing helpers: bdev_unhash(), bdev_drop() block: move two helpers into bdev.c block2mtd: prevent direct access of bd_inode dm-vdo: use bdev_nr_bytes(bdev) instead of i_size_read(bdev->bd_inode) blkdev_write_iter(): saner way to get inode and bdev bcachefs: remove dead function bdev_sectors() ext4: remove block_device_ejected() erofs_buf: store address_space instead of inode erofs: switch erofs_bread() to passing offset instead of block number
2024-05-21Merge tag 'pull-set_blocksize' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs Pull vfs blocksize updates from Al Viro: "This gets rid of bogus set_blocksize() uses, switches it over to be based on a 'struct file *' and verifies that the caller has the device opened exclusively" * tag 'pull-set_blocksize' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: make set_blocksize() fail unless block device is opened exclusive set_blocksize(): switch to passing struct file * btrfs_get_bdev_and_sb(): call set_blocksize() only for exclusive opens swsusp: don't bother with setting block size zram: don't bother with reopening - just use O_EXCL for open swapon(2): open swap with O_EXCL swapon(2)/swapoff(2): don't bother with block size pktcdvd: sort set_blocksize() calls out bcache_register(): don't bother with set_blocksize()
2024-05-19Merge tag 'mm-nonmm-stable-2024-05-19-11-56' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm Pull non-mm updates from Andrew Morton: "Mainly singleton patches, documented in their respective changelogs. Notable series include: - Some maintenance and performance work for ocfs2 in Heming Zhao's series "improve write IO performance when fragmentation is high". - Some ocfs2 bugfixes from Su Yue in the series "ocfs2 bugs fixes exposed by fstests". - kfifo header rework from Andy Shevchenko in the series "kfifo: Clean up kfifo.h". - GDB script fixes from Florian Rommel in the series "scripts/gdb: Fixes for $lx_current and $lx_per_cpu". - After much discussion, a coding-style update from Barry Song explaining one reason why inline functions are preferred over macros. The series is "codingstyle: avoid unused parameters for a function-like macro"" * tag 'mm-nonmm-stable-2024-05-19-11-56' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: (62 commits) fs/proc: fix softlockup in __read_vmcore nilfs2: convert BUG_ON() in nilfs_finish_roll_forward() to WARN_ON() scripts: checkpatch: check unused parameters for function-like macro Documentation: coding-style: ask function-like macros to evaluate parameters nilfs2: use __field_struct() for a bitwise field selftests/kcmp: remove unused open mode nilfs2: remove calls to folio_set_error() and folio_clear_error() kernel/watchdog_perf.c: tidy up kerneldoc watchdog: allow nmi watchdog to use raw perf event watchdog: handle comma separated nmi_watchdog command line nilfs2: make superblock data array index computation sparse friendly squashfs: remove calls to set the folio error flag squashfs: convert squashfs_symlink_read_folio to use folio APIs scripts/gdb: fix detection of current CPU in KGDB scripts/gdb: make get_thread_info accept pointers scripts/gdb: fix parameter handling in $lx_per_cpu scripts/gdb: fix failing KGDB detection during probe kfifo: don't use "proxy" headers media: stih-cec: add missing io.h media: rc: add missing io.h ...
2024-05-14Merge tag 'scsi-misc' of git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsiLinus Torvalds
Pull SCSI updates from James Bottomley: "Updates to the usual drivers (ufs, lpfc, qla2xxx, mpi3mr, libsas). The major update (which causes a conflict with block, see below) is Christoph removing the queue limits and their associated block helpers. The remaining patches are assorted minor fixes and deprecated function updates plus a bit of constification" * tag 'scsi-misc' of git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi: (141 commits) scsi: mpi3mr: Sanitise num_phys scsi: lpfc: Copyright updates for 14.4.0.2 patches scsi: lpfc: Update lpfc version to 14.4.0.2 scsi: lpfc: Add support for 32 byte CDBs scsi: lpfc: Change lpfc_hba hba_flag member into a bitmask scsi: lpfc: Introduce rrq_list_lock to protect active_rrq_list scsi: lpfc: Clear deferred RSCN processing flag when driver is unloading scsi: lpfc: Update logging of protection type for T10 DIF I/O scsi: lpfc: Change default logging level for unsolicited CT MIB commands scsi: target: Remove unused list 'device_list' scsi: iscsi: Remove unused list 'connlist_err' scsi: ufs: exynos: Add support for Tensor gs101 SoC scsi: ufs: exynos: Add some pa_dbg_ register offsets into drvdata scsi: ufs: exynos: Allow max frequencies up to 267Mhz scsi: ufs: exynos: Add EXYNOS_UFS_OPT_TIMER_TICK_SELECT option scsi: ufs: exynos: Add EXYNOS_UFS_OPT_UFSPR_SECURE option scsi: ufs: dt-bindings: exynos: Add gs101 compatible scsi: qla2xxx: Fix debugfs output for fw_resource_count scsi: qedf: Ensure the copied buf is NUL terminated scsi: bfa: Ensure the copied buf is NUL terminated ...
2024-05-14Merge tag 'for-6.10-tag' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux Pull btrfs updates from David Sterba: "This update brings a few minor performance improvements, otherwise there's a lot of refactoring, cleanups and other sort of not user visible changes. Performance improvements: - inline b-tree locking functions, improvement in metadata-heavy changes - relax locking on a range that's being reflinked, allows read operations to run in parallel - speed up NOCOW write checks (throughput +9% on a sample test) - extent locking ranges have been reduced in several places, namely around delayed ref processing Core: - more page to folio conversions: - relocation - send - compression - inline extent handling - super block write and wait - extent_map structure optimizations: - reduced structure size - code simplifications - add shrinker for allocated objects, the numbers can go high and could exhaust memory on smaller systems (reported) as they may not get an opportunity to be freed fast enough - extent locking optimizations: - reduce locking ranges where it does not seem to be necessary and are safe due to other means of synchronization - potential improvements due to lower contention, allocation/freeing and state management operations of extent state tracking structures - delayed ref cleanups and simplifications - updated trace points - improved error handling, warnings and assertions - cleanups and refactoring, unification of error handling paths" * tag 'for-6.10-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux: (122 commits) btrfs: qgroup: fix initialization of auto inherit array btrfs: count super block write errors in device instead of tracking folio error state btrfs: use the folio iterator in btrfs_end_super_write() btrfs: convert super block writes to folio in write_dev_supers() btrfs: convert super block writes to folio in wait_dev_supers() bio: Export bio_add_folio_nofail to modules btrfs: remove duplicate included header from fs.h btrfs: add a cached state to extent_clear_unlock_delalloc btrfs: push extent lock down in submit_one_async_extent btrfs: push lock_extent down in cow_file_range() btrfs: move can_cow_file_range_inline() outside of the extent lock btrfs: push lock_extent into cow_file_range_inline btrfs: push extent lock into cow_file_range btrfs: push extent lock into run_delalloc_cow btrfs: remove unlock_extent from run_delalloc_compressed btrfs: push extent lock down in run_delalloc_nocow btrfs: adjust while loop condition in run_delalloc_nocow btrfs: push extent lock into run_delalloc_nocow btrfs: push the extent lock into btrfs_run_delalloc_range btrfs: lock extent when doing inline extent in compression ...
2024-05-13Merge tag 'for-6.10/block-20240511' of git://git.kernel.dk/linuxLinus Torvalds
Pull block updates from Jens Axboe: - Add a partscan attribute in sysfs, fixing an issue with systemd relying on an internal interface that went away. - Attempt #2 at making long running discards interruptible. The previous attempt went into 6.9, but we ended up mostly reverting it as it had issues. - Remove old ida_simple API in bcache - Support for zoned write plugging, greatly improving the performance on zoned devices. - Remove the old throttle low interface, which has been experimental since 2017 and never made it beyond that and isn't being used. - Remove page->index debugging checks in brd, as it hasn't caught anything and prepares us for removing in struct page. - MD pull request from Song - Don't schedule block workers on isolated CPUs * tag 'for-6.10/block-20240511' of git://git.kernel.dk/linux: (84 commits) blk-throttle: delay initialization until configuration blk-throttle: remove CONFIG_BLK_DEV_THROTTLING_LOW block: fix that util can be greater than 100% block: support to account io_ticks precisely block: add plug while submitting IO bcache: fix variable length array abuse in btree_iter bcache: Remove usage of the deprecated ida_simple_xx() API md: Revert "md: Fix overflow in is_mddev_idle" blk-lib: check for kill signal in ioctl BLKDISCARD block: add a bio_await_chain helper block: add a blk_alloc_discard_bio helper block: add a bio_chain_and_submit helper block: move discard checks into the ioctl handler block: remove the discard_granularity check in __blkdev_issue_discard block/ioctl: prefer different overflow check null_blk: Fix the WARNING: modpost: missing MODULE_DESCRIPTION() block: fix and simplify blkdevparts= cmdline parsing block: refine the EOF check in blkdev_iomap_begin block: add a partscan sysfs attribute for disks block: add a disk_has_partscan helper ...
2024-05-13Merge tag 'vfs-6.10.misc' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs Pull misc vfs updates from Christian Brauner: "This contains the usual miscellaneous features, cleanups, and fixes for vfs and individual fses. Features: - Free up FMODE_* bits. I've freed up bits 6, 7, 8, and 24. That means we now have six free FMODE_* bits in total (but bit #6 already got used for FMODE_WRITE_RESTRICTED) - Add FOP_HUGE_PAGES flag (follow-up to FMODE_* cleanup) - Add fd_raw cleanup class so we can make use of automatic cleanup provided by CLASS(fd_raw, f)(fd) for O_PATH fds as well - Optimize seq_puts() - Simplify __seq_puts() - Add new anon_inode_getfile_fmode() api to allow specifying f_mode instead of open-coding it in multiple places - Annotate struct file_handle with __counted_by() and use struct_size() - Warn in get_file() whether f_count resurrection from zero is attempted (epoll/drm discussion) - Folio-sophize aio - Export the subvolume id in statx() for both btrfs and bcachefs - Relax linkat(AT_EMPTY_PATH) requirements - Add F_DUPFD_QUERY fcntl() allowing to compare two file descriptors for dup*() equality replacing kcmp() Cleanups: - Compile out swapfile inode checks when swap isn't enabled - Use (1 << n) notation for FMODE_* bitshifts for clarity - Remove redundant variable assignment in fs/direct-io - Cleanup uses of strncpy in orangefs - Speed up and cleanup writeback - Move fsparam_string_empty() helper into header since it's currently open-coded in multiple places - Add kernel-doc comments to proc_create_net_data_write() - Don't needlessly read dentry->d_flags twice Fixes: - Fix out-of-range warning in nilfs2 - Fix ecryptfs overflow due to wrong encryption packet size calculation - Fix overly long line in xfs file_operations (follow-up to FMODE_* cleanup) - Don't raise FOP_BUFFER_{R,W}ASYNC for directories in xfs (follow-up to FMODE_* cleanup) - Don't call xfs_file_open from xfs_dir_open (follow-up to FMODE_* cleanup) - Fix stable offset api to prevent endless loops - Fix afs file server rotations - Prevent xattr node from overflowing the eraseblock in jffs2 - Move fdinfo PTRACE_MODE_READ procfs check into the .permission() operation instead of .open() operation since this caused userspace regressions" * tag 'vfs-6.10.misc' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs: (39 commits) afs: Fix fileserver rotation getting stuck selftests: add F_DUPDFD_QUERY selftests fcntl: add F_DUPFD_QUERY fcntl() file: add fd_raw cleanup class fs: WARN when f_count resurrection is attempted seq_file: Simplify __seq_puts() seq_file: Optimize seq_puts() proc: Move fdinfo PTRACE_MODE_READ check into the inode .permission operation fs: Create anon_inode_getfile_fmode() xfs: don't call xfs_file_open from xfs_dir_open xfs: drop fop_flags for directories xfs: fix overly long line in the file_operations shmem: Fix shmem_rename2() libfs: Add simple_offset_rename() API libfs: Fix simple_offset_rename_exchange() jffs2: prevent xattr node from overflowing the eraseblock vfs, swap: compile out IS_SWAPFILE() on swapless configs vfs: relax linkat() AT_EMPTY_PATH - aka flink() - requirements fs/direct-io: remove redundant assignment to variable retval fs/dcache: Re-use value stored to dentry->d_flags instead of re-reading ...
2024-05-10Merge tag 'block-6.9-20240510' of git://git.kernel.dk/linuxLinus Torvalds
Pull block fixes from Jens Axboe: - NVMe pull request via Keith: - nvme target fixes (Sagi, Dan, Maurizo) - new vendor quirk for broken MSI (Sean) - Virtual boundary fix for a regression in this merge window (Ming) * tag 'block-6.9-20240510' of git://git.kernel.dk/linux: nvmet-rdma: fix possible bad dereference when freeing rsps nvmet: prevent sprintf() overflow in nvmet_subsys_nsid_exists() nvmet: make nvmet_wq unbound nvmet-auth: return the error code to the nvmet_auth_ctrl_hash() callers nvme-pci: Add quirk for broken MSIs block: set default max segment size in case of virt_boundary
2024-05-09blk-throttle: delay initialization until configurationYu Kuai
Other cgroup policy like bfq, iocost are lazy-initialized when they are configured for the first time for the device, but blk-throttle is initialized unconditionally from blkcg_init_disk(). Delay initialization of blk-throttle as well, to save some cpu and memory overhead if it's not configured. Noted that once it's initialized, it can't be destroyed until disk removal, even if it's disabled. Signed-off-by: Yu Kuai <yukuai3@huawei.com> Link: https://lore.kernel.org/r/20240509121107.3195568-3-yukuai1@huaweicloud.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-05-09blk-throttle: remove CONFIG_BLK_DEV_THROTTLING_LOWYu Kuai
One the one hand, it's marked EXPERIMENTAL since 2017, and looks like there are no users since then, and no testers and no developers, it's just not active at all. On the other hand, even if the config is disabled, there are still many fields in throtl_grp and throtl_data and many functions that are only used for throtl low. At last, currently blk-throtl is initialized during disk initialization, and destroyed during disk removal, and it exposes many functions to be called directly from block layer. Remove throtl low to make code much more cleaner and follow up work much easier. Signed-off-by: Yu Kuai <yukuai3@huawei.com> Acked-by: Tejun Heo <tj@kernel.org> Link: https://lore.kernel.org/r/20240509121107.3195568-2-yukuai1@huaweicloud.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-05-09block: fix that util can be greater than 100%Yu Kuai
util means the percentage that disk has IO, and theoretically it should not be greater than 100%. However, there is a gap for rq-based disk: io_ticks will be updated when rq is allocated, however, before such rq dispatch to driver, it will not be account as inflight from blk_mq_start_request() hence diskstats_show()/part_stat_show() will not update io_ticks. For example: 1) at t0, issue a new IO, rq is allocated, and blk_account_io_start() update io_ticks; 2) something is wrong with drivers, and the rq can't be dispatched; 3) at t0 + 10s, drivers recovers and rq is dispatched and done, io_ticks is updated; Then if user is using "iostat 1" to monitor "util", between t0 - t0+9s, util will be zero, and between t0+9s - t0+10s, util will be 1000%. Fix this problem by updating io_ticks from diskstats_show() and part_stat_show() if there are rq allocated. Signed-off-by: Yu Kuai <yukuai3@huawei.com> Link: https://lore.kernel.org/r/20240509123717.3223892-3-yukuai1@huaweicloud.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-05-09block: support to account io_ticks preciselyYu Kuai
Currently, io_ticks is accounted based on sampling, specifically update_io_ticks() will always account io_ticks by 1 jiffies from bdev_start_io_acct()/blk_account_io_start(), and the result can be inaccurate, for example(HZ is 250): Test script: fio -filename=/dev/sda -bs=4k -rw=write -direct=1 -name=test -thinktime=4ms Test result: util is about 90%, while the disk is really idle. This behaviour is introduced by commit 5b18b5a73760 ("block: delete part_round_stats and switch to less precise counting"), however, there was a key point that is missed that this patch also improve performance a lot: Before the commit: part_round_stats: if (part->stamp != now) stats |= 1; part_in_flight() -> there can be lots of task here in 1 jiffies. part_round_stats_single() __part_stat_add() part->stamp = now; After the commit: update_io_ticks: stamp = part->bd_stamp; if (time_after(now, stamp)) if (try_cmpxchg()) __part_stat_add() -> only one task can reach here in 1 jiffies. Hence in order to account io_ticks precisely, we only need to know if there are IO inflight at most once in one jiffies. Noted that for rq-based device, iterating tags should not be used here because 'tags->lock' is grabbed in blk_mq_find_and_get_req(), hence part_stat_lock_inc/dec() and part_in_flight() is used to trace inflight. The additional overhead is quite little: - per cpu add/dec for each IO for rq-based device; - per cpu sum for each jiffies; And it's verified by null-blk that there are no performance degration under heavy IO pressure. Fixes: 5b18b5a73760 ("block: delete part_round_stats and switch to less precise counting") Signed-off-by: Yu Kuai <yukuai3@huawei.com> Link: https://lore.kernel.org/r/20240509123717.3223892-2-yukuai1@huaweicloud.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-05-09block: add plug while submitting IOYu Kuai
So that if caller didn't use plug, for example, __blkdev_direct_IO_simple() and __blkdev_direct_IO_async(), block layer can still benefit from caching nsec time in the plug. Signed-off-by: Yu Kuai <yukuai3@huawei.com> Link: https://lore.kernel.org/r/20240509123825.3225207-1-yukuai1@huaweicloud.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-05-07bio: Export bio_add_folio_nofail to modulesMatthew Wilcox (Oracle)
Several modules use __bio_add_page() today and may need to be converted to bio_add_folio_nofail(). Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: Jens Axboe <axboe@kernel.dk> Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2024-05-07blk-lib: check for kill signal in ioctl BLKDISCARDChristoph Hellwig
Discards can access a significant capacity and take longer than the user expected. A user may change their mind about wanting to run that command and attempt to kill the process and do something else with their device. But since the task is uninterruptable, they have to wait for it to finish, which could be many hours. Open code blkdev_issue_discard in the BLKDISCARD ioctl handler and check for a fatal signal at each iteration so the user doesn't have to wait for their regretted operation to complete naturally. Heavily based on an earlier patch from Keith Busch. Reported-by: Conrad Meyer <conradmeyer@meta.com> Signed-off-by: Christoph Hellwig <hch@lst.de> Link: https://lore.kernel.org/r/20240506042027.2289826-7-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-05-07block: add a bio_await_chain helperKeith Busch
Add a helper to wait for an entire chain of bios to complete. [hch: split from a larger patch, moved and changed the name now that it is non-static] Signed-off-by: Keith Busch <kbusch@kernel.org> Signed-off-by: Christoph Hellwig <hch@lst.de> Link: https://lore.kernel.org/r/20240506042027.2289826-6-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-05-07block: add a blk_alloc_discard_bio helperChristoph Hellwig
Factor out a helper from __blkdev_issue_discard that chews off as much as possible from a discard range and allocates a bio for it. Signed-off-by: Christoph Hellwig <hch@lst.de> Link: https://lore.kernel.org/r/20240506042027.2289826-5-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-05-07block: add a bio_chain_and_submit helperChristoph Hellwig
This is basically blk_next_bio just with the bio allocation moved to the caller to allow for more flexible bio handling in the caller. Signed-off-by: Christoph Hellwig <hch@lst.de> Link: https://lore.kernel.org/r/20240506042027.2289826-4-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-05-07block: move discard checks into the ioctl handlerChristoph Hellwig
Most bio operations get basic sanity checking in submit_bio and anything more complicated than that is done in the callers. Discards are a bit different from that in that a lot of checking is done in __blkdev_issue_discard, and the specific errnos for that are returned to userspace. Move the checks that require specific errnos to the ioctl handler instead, and just leave the basic sanity checking in submit_bio for the other handlers. This introduces two changes in behavior: 1) the logical block size alignment check of the start and len is lost for non-ioctl callers. This matches what is done for other operations including reads and writes. We should probably verify this for all bios, but for now make discards match the normal flow. 2) for non-ioctl callers all errors are reported on I/O completion now instead of synchronously. Callers in general mostly ignore or log errors so this will actually simplify the code once cleaned up Signed-off-by: Christoph Hellwig <hch@lst.de> Link: https://lore.kernel.org/r/20240506042027.2289826-3-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-05-07block: remove the discard_granularity check in __blkdev_issue_discardChristoph Hellwig
We now set a default granularity in the queue limits API, so don't bother with this extra check. Signed-off-by: Christoph Hellwig <hch@lst.de> Link: https://lore.kernel.org/r/20240506042027.2289826-2-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-05-07block/ioctl: prefer different overflow checkJustin Stitt
Running syzkaller with the newly reintroduced signed integer overflow sanitizer shows this report: [ 62.982337] ------------[ cut here ]------------ [ 62.985692] cgroup: Invalid name [ 62.986211] UBSAN: signed-integer-overflow in ../block/ioctl.c:36:46 [ 62.989370] 9pnet_fd: p9_fd_create_tcp (7343): problem connecting socket to 127.0.0.1 [ 62.992992] 9223372036854775807 + 4095 cannot be represented in type 'long long' [ 62.997827] 9pnet_fd: p9_fd_create_tcp (7345): problem connecting socket to 127.0.0.1 [ 62.999369] random: crng reseeded on system resumption [ 63.000634] GUP no longer grows the stack in syz-executor.2 (7353): 20002000-20003000 (20001000) [ 63.000668] CPU: 0 PID: 7353 Comm: syz-executor.2 Not tainted 6.8.0-rc2-00035-gb3ef86b5a957 #1 [ 63.000677] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.16.3-debian-1.16.3-2 04/01/2014 [ 63.000682] Call Trace: [ 63.000686] <TASK> [ 63.000731] dump_stack_lvl+0x93/0xd0 [ 63.000919] __get_user_pages+0x903/0xd30 [ 63.001030] __gup_longterm_locked+0x153e/0x1ba0 [ 63.001041] ? _raw_read_unlock_irqrestore+0x17/0x50 [ 63.001072] ? try_get_folio+0x29c/0x2d0 [ 63.001083] internal_get_user_pages_fast+0x1119/0x1530 [ 63.001109] iov_iter_extract_pages+0x23b/0x580 [ 63.001206] bio_iov_iter_get_pages+0x4de/0x1220 [ 63.001235] iomap_dio_bio_iter+0x9b6/0x1410 [ 63.001297] __iomap_dio_rw+0xab4/0x1810 [ 63.001316] iomap_dio_rw+0x45/0xa0 [ 63.001328] ext4_file_write_iter+0xdde/0x1390 [ 63.001372] vfs_write+0x599/0xbd0 [ 63.001394] ksys_write+0xc8/0x190 [ 63.001403] do_syscall_64+0xd4/0x1b0 [ 63.001421] ? arch_exit_to_user_mode_prepare+0x3a/0x60 [ 63.001479] entry_SYSCALL_64_after_hwframe+0x6f/0x77 [ 63.001535] RIP: 0033:0x7f7fd3ebf539 [ 63.001551] Code: 28 00 00 00 75 05 48 83 c4 28 c3 e8 f1 14 00 00 90 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 c7 c1 b8 ff ff ff f7 d8 64 89 01 48 [ 63.001562] RSP: 002b:00007f7fd32570c8 EFLAGS: 00000246 ORIG_RAX: 0000000000000001 [ 63.001584] RAX: ffffffffffffffda RBX: 00007f7fd3ff3f80 RCX: 00007f7fd3ebf539 [ 63.001590] RDX: 4db6d1e4f7e43360 RSI: 0000000020000000 RDI: 0000000000000004 [ 63.001595] RBP: 00007f7fd3f1e496 R08: 0000000000000000 R09: 0000000000000000 [ 63.001599] R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000000000 [ 63.001604] R13: 0000000000000006 R14: 00007f7fd3ff3f80 R15: 00007ffd415ad2b8 ... [ 63.018142] ---[ end trace ]--- Historically, the signed integer overflow sanitizer did not work in the kernel due to its interaction with `-fwrapv` but this has since been changed [1] in the newest version of Clang; It was re-enabled in the kernel with Commit 557f8c582a9ba8ab ("ubsan: Reintroduce signed overflow sanitizer"). Let's rework this overflow checking logic to not actually perform an overflow during the check itself, thus avoiding the UBSAN splat. [1]: https://github.com/llvm/llvm-project/pull/82432 Signed-off-by: Justin Stitt <justinstitt@google.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Link: https://lore.kernel.org/r/20240507-b4-sio-block-ioctl-v3-1-ba0c2b32275e@google.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-05-06block: set default max segment size in case of virt_boundaryMing Lei
For devices with virt_boundary limit, the driver may provide zero max segment size, we have to set it as UINT_MAX at default. Otherwise, it may cause warning in driver when handling sglist. Fix it by setting default max segment size as UINT_MAX. Cc: Christoph Hellwig <hch@lst.de> Cc: Mike Snitzer <snitzer@kernel.org> Fixes: b561ea56a264 ("block: allow device to have both virt_boundary_mask and max segment size") Tested-by: Geert Uytterhoeven <geert+renesas@glider.be> Reported-by: Geert Uytterhoeven <geert+renesas@glider.be> Closes: https://lore.kernel.org/linux-block/7e38b67c-9372-a42d-41eb-abdce33d3372@linux-m68k.org/ Signed-off-by: Ming Lei <ming.lei@redhat.com> Link: https://lore.kernel.org/r/20240424134722.2584284-1-ming.lei@redhat.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-05-03block: fix and simplify blkdevparts= cmdline parsingINAGAKI Hiroshi
Fix the cmdline parsing of the "blkdevparts=" parameter using strsep(), which makes the code simpler. Before commit 146afeb235cc ("block: use strscpy() to instead of strncpy()"), we used a strncpy() to copy a block device name and partition names. The commit simply replaced a strncpy() and NULL termination with a strscpy(). It did not update calculations of length passed to strscpy(). While the length passed to strncpy() is just a length of valid characters without NULL termination ('\0'), strscpy() takes it as a length of the destination buffer, including a NULL termination. Since the source buffer is not necessarily NULL terminated, the current code copies "length - 1" characters and puts a NULL character in the destination buffer. It replaces the last character with NULL and breaks the parsing. As an example, that buffer will be passed to parse_parts() and breaks parsing sub-partitions due to the missing ')' at the end, like the following. example (Check Point V-80 & OpenWrt): - Linux Kernel 6.6 [ 0.000000] Kernel command line: console=ttyS0,115200 earlycon=uart8250,mmio32,0xf0512000 crashkernel=30M mvpp2x.queue_mode=1 blkdevparts=mmcblk1:48M@10M(kernel-1),1M(dtb-1),720M(rootfs-1),48M(kernel-2),1M(dtb-2),720M(rootfs-2),300M(default_sw),650M(logs),1M(preset_cfg),1M(adsl),-(storage) maxcpus=4 ... [ 0.884016] mmc1: new HS200 MMC card at address 0001 [ 0.889951] mmcblk1: mmc1:0001 004GA0 3.69 GiB [ 0.895043] cmdline partition format is invalid. [ 0.895704] mmcblk1: p1 [ 0.903447] mmcblk1boot0: mmc1:0001 004GA0 2.00 MiB [ 0.908667] mmcblk1boot1: mmc1:0001 004GA0 2.00 MiB [ 0.913765] mmcblk1rpmb: mmc1:0001 004GA0 512 KiB, chardev (248:0) 1. "48M@10M(kernel-1),..." is passed to strscpy() with length=17 from parse_parts() 2. strscpy() returns -E2BIG and the destination buffer has "48M@10M(kernel-1\0" 3. "48M@10M(kernel-1\0" is passed to parse_subpart() 4. parse_subpart() fails to find ')' when parsing a partition name, and returns error - Linux Kernel 6.1 [ 0.000000] Kernel command line: console=ttyS0,115200 earlycon=uart8250,mmio32,0xf0512000 crashkernel=30M mvpp2x.queue_mode=1 blkdevparts=mmcblk1:48M@10M(kernel-1),1M(dtb-1),720M(rootfs-1),48M(kernel-2),1M(dtb-2),720M(rootfs-2),300M(default_sw),650M(logs),1M(preset_cfg),1M(adsl),-(storage) maxcpus=4 ... [ 0.953142] mmc1: new HS200 MMC card at address 0001 [ 0.959114] mmcblk1: mmc1:0001 004GA0 3.69 GiB [ 0.964259] mmcblk1: p1(kernel-1) p2(dtb-1) p3(rootfs-1) p4(kernel-2) p5(dtb-2) 6(rootfs-2) p7(default_sw) p8(logs) p9(preset_cfg) p10(adsl) p11(storage) [ 0.979174] mmcblk1boot0: mmc1:0001 004GA0 2.00 MiB [ 0.984674] mmcblk1boot1: mmc1:0001 004GA0 2.00 MiB [ 0.989926] mmcblk1rpmb: mmc1:0001 004GA0 512 KiB, chardev (248:0 By the way, strscpy() takes a length of destination buffer and it is often confusing when copying characters with a specified length. Using strsep() helps to separate the string by the specified character. Then, we can use strscpy() naturally with the size of the destination buffer. Separating the string on the fly is also useful to omit the redundant string copy, reducing memory usage and improve the code readability. Fixes: 146afeb235cc ("block: use strscpy() to instead of strncpy()") Suggested-by: Naohiro Aota <naota@elisp.net> Signed-off-by: INAGAKI Hiroshi <musashino.open@gmail.com> Reviewed-by: Daniel Golle <daniel@makrotopia.org> Link: https://lore.kernel.org/r/20240421074005.565-1-musashino.open@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-05-03block: refine the EOF check in blkdev_iomap_beginChristoph Hellwig
blkdev_iomap_begin rounds down the offset to the logical block size before stashing it in iomap->offset and checking that it still is inside the inode size. Check the i_size check to the raw pos value so that we don't try a zero size write if iter->pos is unaligned. Fixes: 487c607df790 ("block: use iomap for writes to block devices") Reported-by: syzbot+0a3683a0a6fecf909244@syzkaller.appspotmail.com Signed-off-by: Christoph Hellwig <hch@lst.de> Tested-by: syzbot+0a3683a0a6fecf909244@syzkaller.appspotmail.com Link: https://lore.kernel.org/r/20240503081042.2078062-1-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-05-03block: add a partscan sysfs attribute for disksChristoph Hellwig
Userspace had been unknowingly relying on a non-stable interface of kernel internals to determine if partition scanning is enabled for a given disk. Provide a stable interface for this purpose instead. Cc: stable@vger.kernel.org # 6.3+ Depends-on: 140ce28dd3be ("block: add a disk_has_partscan helper") Signed-off-by: Christoph Hellwig <hch@lst.de> Link: https://lore.kernel.org/linux-block/ZhQJf8mzq_wipkBH@gardel-login/ Link: https://lore.kernel.org/r/20240502130033.1958492-3-hch@lst.de [axboe: add links and commit message from Keith] Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-05-03block: add a disk_has_partscan helperChristoph Hellwig
Add a helper to check if partition scanning is enabled instead of open coding the check in a few places. This now always checks for the hidden flag even if all but one of the callers are never reachable for hidden gendisks. Signed-off-by: Christoph Hellwig <hch@lst.de> Link: https://lore.kernel.org/r/20240502130033.1958492-2-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-05-03RIP ->bd_inodeAl Viro
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2024-05-03block/bdev.c: use the knowledge of inode/bdev coallocationAl Viro
Here we know that bdevfs inodes are coallocated with struct block_device and we can get to ->bd_inode value without any dereferencing. Introduce an inlined helper (static, *not* exported, purely internal for bdev.c) that gets an associated inode by block_device - BD_INODE(bdev). NOTE: leave it static; nobody outside of block/bdev.c has any business playing with that. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2024-05-03blk_ioctl_{discard,zeroout}(): we only want ->bd_inode->i_mapping here...Al Viro
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> Link: https://lore.kernel.org/r/20240411145346.2516848-6-viro@zeniv.linux.org.uk Signed-off-by: Christian Brauner <brauner@kernel.org>
2024-05-03use ->bd_mapping instead of ->bd_inode->i_mappingAl Viro
Just the low-hanging fruit... Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> Link: https://lore.kernel.org/r/20240411145346.2516848-2-viro@zeniv.linux.org.uk Signed-off-by: Christian Brauner <brauner@kernel.org>
2024-05-03block_device: add a pointer to struct address_space (page cache of bdev)Al Viro
points to ->i_data of coallocated inode. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> Link: https://lore.kernel.org/r/20240411145346.2516848-1-viro@zeniv.linux.org.uk Signed-off-by: Christian Brauner <brauner@kernel.org>
2024-05-03missing helpers: bdev_unhash(), bdev_drop()Al Viro
bdev_unhash(): make block device invisible to lookups by device number bdev_drop(): drop reference to associated inode. Both are internal, for use by genhd and partition-related code - similar to bdev_add(). The logics in there (especially the lifetime-related parts of it) ought to be cleaned up, but that's a separate story; here we just encapsulate getting to associated inode. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2024-05-03block: move two helpers into bdev.cYu Kuai
disk_live() and block_size() access bd_inode directly, prepare to remove the field bd_inode from block_device, and only access bd_inode in block layer. Signed-off-by: Yu Kuai <yukuai3@huawei.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Jan Kara <jack@suse.cz> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> Link: https://lore.kernel.org/r/20240411145346.2516848-8-viro@zeniv.linux.org.uk Signed-off-by: Christian Brauner <brauner@kernel.org>
2024-05-03blkdev_write_iter(): saner way to get inode and bdevAl Viro
... same as in other methods - bdev_file_inode() and I_BDEV() of that. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> Link: https://lore.kernel.org/r/20240411145346.2516848-5-viro@zeniv.linux.org.uk Signed-off-by: Christian Brauner <brauner@kernel.org>
2024-05-02bdev: move ->bd_make_it_fail to ->__bd_flagsAl Viro
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2024-05-02bdev: move ->bd_ro_warned to ->__bd_flagsAl Viro
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2024-05-02bdev: move ->bd_has_subit_bio to ->__bd_flagsAl Viro
In bdev_alloc() we have all flags initialized to false, so assignment to ->bh_has_submit_bio n there is a no-op unless we have partno != 0 and flag already set on entire device. In device_add_disk() we have just allocated the block_device in question and it had been a full-device one, so the flag is guaranteed to be still clear when we get to assignment. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2024-05-02bdev: move ->bd_write_holder into ->__bd_flagsAl Viro
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2024-05-02bdev: move ->bd_read_only to ->__bd_flagsAl Viro
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2024-05-02bdev: infrastructure for flagsAl Viro
Replace bd_partno with a 32bit field (__bd_flags). The lower 8 bits contain the partition number, the upper 24 are for flags. Helpers: bdev_{test,set,clear}_flag(bdev, flag), with atomic_or() and atomic_andnot() used to set/clear. NOTE: this commit does not actually move any flags over there - they are still bool fields. As the result, it shifts the fields wrt cacheline boundaries; that's going to be restored once the first 3 flags are dealt with. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2024-05-02wrapper for access to ->bd_partnoAl Viro
On the next step it's going to get folded into a field where flags will go. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2024-05-02Use bdev_is_paritition() instead of open-coding itAl Viro
Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2024-05-02make set_blocksize() fail unless block device is opened exclusiveAl Viro
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2024-05-02set_blocksize(): switch to passing struct file *Al Viro
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2024-05-01block: Cleanup blk_revalidate_zone_cb()Damien Le Moal
Define the code for checking conventional and sequential write required zones suing the functions blk_revalidate_conv_zone() and blk_revalidate_seq_zone() respectively. This simplifies the zone type switch-case in blk_revalidate_zone_cb(). No functional changes. Signed-off-by: Damien Le Moal <dlemoal@kernel.org> Link: https://lore.kernel.org/r/20240501110907.96950-15-dlemoal@kernel.org Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-05-01block: Simplify zone write plug BIO abortDamien Le Moal
When BIOs plugged in a zone write plug are aborted, blk_zone_wplug_bio_io_error() clears the BIO BIO_ZONE_WRITE_PLUGGING flag so that bio_io_error(bio) does not end up calling blk_zone_write_plug_bio_endio() and we thus need to manually drop the reference on the zone write plug held by the aborted BIO. Move the call to disk_put_zone_wplug() that is alwasy following the call to blk_zone_wplug_bio_io_error() inside that function to simplify the code. Signed-off-by: Damien Le Moal <dlemoal@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Link: https://lore.kernel.org/r/20240501110907.96950-14-dlemoal@kernel.org Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-05-01block: Simplify blk_zone_write_plug_bio_endio()Damien Le Moal
We already have the disk variable obtained from the bio when calling disk_get_zone_wplug(). So use that variable instead of dereferencing the bio bdev again for the disk argument of disk_get_zone_wplug(). Signed-off-by: Damien Le Moal <dlemoal@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Link: https://lore.kernel.org/r/20240501110907.96950-13-dlemoal@kernel.org Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-05-01block: Improve zone write request completion handlingDamien Le Moal
blk_zone_complete_request() must be called to handle the completion of a zone write request handled with zone write plugging. This function is called from blk_complete_request(), blk_update_request() and also in blk_mq_submit_bio() error path. Improve this by moving this function call into blk_mq_finish_request() as all requests are processed with this function when they complete as well as when they are freed without being executed. This also improves blk_update_request() used by scsi devices as these may repeatedly call this function to handle partial completions. To be consistent with this change, blk_zone_complete_request() is renamed to blk_zone_finish_request() and blk_zone_write_plug_complete_request() is renamed to blk_zone_write_plug_finish_request(). Signed-off-by: Damien Le Moal <dlemoal@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Link: https://lore.kernel.org/r/20240501110907.96950-12-dlemoal@kernel.org Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-05-01block: Improve blk_zone_write_plug_bio_merged()Damien Le Moal
Improve blk_zone_write_plug_bio_merged() to check that we succefully get a reference on the zone write plug of the merged BIO, as expected since for a merge we already have at least one request and one BIO referencing the zone write plug. Comments in this function are also improved to better explain the references to the BIO zone write plug. Signed-off-by: Damien Le Moal <dlemoal@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Link: https://lore.kernel.org/r/20240501110907.96950-11-dlemoal@kernel.org Signed-off-by: Jens Axboe <axboe@kernel.dk>