summaryrefslogtreecommitdiff
path: root/drivers/block/ublk_drv.c
AgeCommit message (Collapse)Author
2024-09-27[tree-wide] finally take no_llseek outAl Viro
no_llseek had been defined to NULL two years ago, in commit 868941b14441 ("fs: remove no_llseek") To quote that commit, At -rc1 we'll need do a mechanical removal of no_llseek - git grep -l -w no_llseek | grep -v porting.rst | while read i; do sed -i '/\<no_llseek\>/d' $i done would do it. Unfortunately, that hadn't been done. Linus, could you do that now, so that we could finally put that thing to rest? All instances are of the form .llseek = no_llseek, so it's obviously safe. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2024-09-16Merge tag 'for-6.12/block-20240913' of git://git.kernel.dk/linuxLinus Torvalds
Pull block updates from Jens Axboe: - MD changes via Song: - md-bitmap refactoring (Yu Kuai) - raid5 performance optimization (Artur Paszkiewicz) - Other small fixes (Yu Kuai, Chen Ni) - Add a sysfs entry 'new_level' (Xiao Ni) - Improve information reported in /proc/mdstat (Mateusz Kusiak) - NVMe changes via Keith: - Asynchronous namespace scanning (Stuart) - TCP TLS updates (Hannes) - RDMA queue controller validation (Niklas) - Align field names to the spec (Anuj) - Metadata support validation (Puranjay) - A syntax cleanup (Shen) - Fix a Kconfig linking error (Arnd) - New queue-depth quirk (Keith) - Add missing unplug trace event (Keith) - blk-iocost fixes (Colin, Konstantin) - t10-pi modular removal and fixes (Alexey) - Fix for potential BLKSECDISCARD overflow (Alexey) - bio splitting cleanups and fixes (Christoph) - Deal with folios rather than rather than pages, speeding up how the block layer handles bigger IOs (Kundan) - Use spinlocks rather than bit spinlocks in zram (Sebastian, Mike) - Reduce zoned device overhead in ublk (Ming) - Add and use sendpages_ok() for drbd and nvme-tcp (Ofir) - Fix regression in partition error pointer checking (Riyan) - Add support for write zeroes and rotational status in nbd (Wouter) - Add Yu Kuai as new BFQ maintainer. The scheduler has been unmaintained for quite a while. - Various sets of fixes for BFQ (Yu Kuai) - Misc fixes and cleanups (Alvaro, Christophe, Li, Md Haris, Mikhail, Yang) * tag 'for-6.12/block-20240913' of git://git.kernel.dk/linux: (120 commits) nvme-pci: qdepth 1 quirk block: fix potential invalid pointer dereference in blk_add_partition blk_iocost: make read-only static array vrate_adj_pct const block: unpin user pages belonging to a folio at once mm: release number of pages of a folio block: introduce folio awareness and add a bigger size from folio block: Added folio-ized version of bio_add_hw_page() block, bfq: factor out a helper to split bfqq in bfq_init_rq() block, bfq: remove local variable 'bfqq_already_existing' in bfq_init_rq() block, bfq: remove local variable 'split' in bfq_init_rq() block, bfq: remove bfq_log_bfqg() block, bfq: merge bfq_release_process_ref() into bfq_put_cooperator() block, bfq: fix procress reference leakage for bfqq in merge chain block, bfq: fix uaf for accessing waker_bfqq after splitting blk-throttle: support prioritized processing of metadata blk-throttle: remove last_low_overflow_time drbd: Add NULL check for net_conf to prevent dereference in state validation nvme-tcp: fix link failure for TCP auth blk-mq: add missing unplug trace event mtip32xx: Remove redundant null pointer checks in mtip_hw_debugfs_init() ...
2024-09-04ublk_drv: fix NULL pointer dereference in ublk_ctrl_start_recovery()Li Nan
When two UBLK_CMD_START_USER_RECOVERY commands are submitted, the first one sets 'ubq->ubq_daemon' to NULL, and the second one triggers WARN in ublk_queue_reinit() and subsequently a NULL pointer dereference issue. Fix it by adding the check in ublk_ctrl_start_recovery() and return immediately in case of zero 'ub->nr_queues_ready'. BUG: kernel NULL pointer dereference, address: 0000000000000028 RIP: 0010:ublk_ctrl_start_recovery.constprop.0+0x82/0x180 Call Trace: <TASK> ? __die+0x20/0x70 ? page_fault_oops+0x75/0x170 ? exc_page_fault+0x64/0x140 ? asm_exc_page_fault+0x22/0x30 ? ublk_ctrl_start_recovery.constprop.0+0x82/0x180 ublk_ctrl_uring_cmd+0x4f7/0x6c0 ? pick_next_task_idle+0x26/0x40 io_uring_cmd+0x9a/0x1b0 io_issue_sqe+0x193/0x3f0 io_wq_submit_work+0x9b/0x390 io_worker_handle_work+0x165/0x360 io_wq_worker+0xcb/0x2f0 ? finish_task_switch.isra.0+0x203/0x290 ? finish_task_switch.isra.0+0x203/0x290 ? __pfx_io_wq_worker+0x10/0x10 ret_from_fork+0x2d/0x50 ? __pfx_io_wq_worker+0x10/0x10 ret_from_fork_asm+0x1a/0x30 </TASK> Fixes: c732a852b419 ("ublk_drv: add START_USER_RECOVERY and END_USER_RECOVERY support") Reported-and-tested-by: Changhui Zhong <czhong@redhat.com> Closes: https://lore.kernel.org/all/CAGVVp+UvLiS+bhNXV-h2icwX1dyybbYHeQUuH7RYqUvMQf6N3w@mail.gmail.com Reviewed-by: Ming Lei <ming.lei@redhat.com> Signed-off-by: Li Nan <linan122@huawei.com> Link: https://lore.kernel.org/r/20240904031348.4139545-1-ming.lei@redhat.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-08-12ublk: move zone report data out of request pduMing Lei
ublk zoned takes 16 bytes in each request pdu just for handling REPORT_ZONE operation, this way does waste memory since request pdu is allocated statically. Store the transient zone report data into one global xarray, and remove it after the report zone request is completed. This way is reasonable since report zone is run in slow code path. Fixes: 29802d7ca33b ("ublk: enable zoned storage support") Cc: Damien Le Moal <dlemoal@kernel.org> Cc: Andreas Hindborg <a.hindborg@samsung.com> Signed-off-by: Ming Lei <ming.lei@redhat.com> Link: https://lore.kernel.org/r/20240812013624.587587-1-ming.lei@redhat.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-07-24ublk: fix UBLK_CMD_DEL_DEV_ASYNC handlingMing Lei
In ublk_ctrl_uring_cmd(), ioctl command NR should be used for matching _IOC_NR(cmd_op). Fix it by adding one private macro, and this way is clean. Fixes: 13fe8e6825e4 ("ublk: add UBLK_CMD_DEL_DEV_ASYNC") Signed-off-by: Ming Lei <ming.lei@redhat.com> Link: https://lore.kernel.org/r/20240724143311.2646330-1-ming.lei@redhat.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-07-10ublk_drv: add missing MODULE_DESCRIPTION() macroJeff Johnson
make allmodconfig && make W=1 C=1 reports: WARNING: modpost: missing MODULE_DESCRIPTION() in drivers/block/ublk_drv.o Add the missing invocation of the MODULE_DESCRIPTION() macro. Signed-off-by: Jeff Johnson <quic_jjohnson@quicinc.com> Reviewed-by: Ming Lei <ming.lei@redhat.com> Link: https://lore.kernel.org/r/20240602-md-block-ublk_drv-v1-1-995474cafff0@quicinc.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-07-05block: Remove REQ_OP_ZONE_RESET_ALL emulationDamien Le Moal
Now that device mapper can handle resetting all zones of a mapped zoned device using REQ_OP_ZONE_RESET_ALL, all zoned block device drivers support this operation. With this, the request queue feature BLK_FEAT_ZONE_RESETALL is not necessary and the emulation code in blk-zone.c can be removed. Signed-off-by: Damien Le Moal <dlemoal@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com> Link: https://lore.kernel.org/r/20240704052816.623865-5-dlemoal@kernel.org Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-06-19block: move the zone_resetall flag to queue_limitsChristoph Hellwig
Move the zone_resetall flag into the queue_limits feature field so that it can be set atomically with the queue frozen. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Damien Le Moal <dlemoal@kernel.org> Reviewed-by: Hannes Reinecke <hare@suse.de> Link: https://lore.kernel.org/r/20240617060532.127975-24-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-06-19block: move the zoned flag into the features fieldChristoph Hellwig
Move the zoned flags into the features field to reclaim a little bit of space. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Damien Le Moal <dlemoal@kernel.org> Reviewed-by: Hannes Reinecke <hare@suse.de> Link: https://lore.kernel.org/r/20240617060532.127975-23-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-06-19block: move the nonrot flag to queue_limitsChristoph Hellwig
Move the nonrot flag into the queue_limits feature field so that it can be set atomically with the queue frozen. Use the chance to switch to defaulting to non-rotational and require the driver to opt into rotational, which matches the polarity of the sysfs interface. For the z2ram, ps3vram, 2x memstick, ubiblock and dcssblk the new rotational flag is not set as they clearly are not rotational despite this being a behavior change. There are some other drivers that unconditionally set the rotational flag to keep the existing behavior as they arguably can be used on rotational devices even if that is probably not their main use today (e.g. virtio_blk and drbd). The flag is automatically inherited in blk_stack_limits matching the existing behavior in dm and md. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Damien Le Moal <dlemoal@kernel.org> Reviewed-by: Hannes Reinecke <hare@suse.de> Link: https://lore.kernel.org/r/20240617060532.127975-15-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-06-19block: move cache control settings out of queue->flagsChristoph Hellwig
Move the cache control settings into the queue_limits so that the flags can be set atomically with the device queue frozen. Add new features and flags field for the driver set flags, and internal (usually sysfs-controlled) flags in the block layer. Note that we'll eventually remove enough field from queue_limits to bring it back to the previous size. The disable flag is inverted compared to the previous meaning, which means it now survives a rescan, similar to the max_sectors and max_discard_sectors user limits. The FLUSH and FUA flags are now inherited by blk_stack_limits, which simplified the code in dm a lot, but also causes a slight behavior change in that dm-switch and dm-unstripe now advertise a write cache despite setting num_flush_bios to 0. The I/O path will handle this gracefully, but as far as I can tell the lack of num_flush_bios and thus flush support is a pre-existing data integrity bug in those targets that really needs fixing, after which a non-zero num_flush_bios should be required in dm for targets that map to underlying devices. Signed-off-by: Christoph Hellwig <hch@lst.de> Acked-by: Ulf Hansson <ulf.hansson@linaro.org> Reviewed-by: Damien Le Moal <dlemoal@kernel.org> Reviewed-by: Hannes Reinecke <hare@suse.de> Link: https://lore.kernel.org/r/20240617060532.127975-14-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-05-13ublk_drv: set DMA alignment mask to 3Jens Axboe
By default, this will be 511, as that's the block layer default. But drivers these days can support memory alignments that aren't tied to the sector sizes, instead just being limited by what the DMA engine supports. An example is NVMe, where it's generally set to a 32-bit or 64-bit boundary. As ublk itself doesn't really care, just set it low enough that we don't run into issues with NVMe where the required O_DIRECT memory alignment is now more restrictive on ublk than it is on the underlying device. This was triggered by spurious -EINVAL returns on O_DIRECT IO on a setup with ublk managing NVMe devices, which previously worked just fine on the NVMe device itself. With the alignment relaxed, the test works fine. Reviewed-by: Ming Lei <ming.lei@redhat.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-05-13Merge tag 'for-6.10/block-20240511' of git://git.kernel.dk/linuxLinus Torvalds
Pull block updates from Jens Axboe: - Add a partscan attribute in sysfs, fixing an issue with systemd relying on an internal interface that went away. - Attempt #2 at making long running discards interruptible. The previous attempt went into 6.9, but we ended up mostly reverting it as it had issues. - Remove old ida_simple API in bcache - Support for zoned write plugging, greatly improving the performance on zoned devices. - Remove the old throttle low interface, which has been experimental since 2017 and never made it beyond that and isn't being used. - Remove page->index debugging checks in brd, as it hasn't caught anything and prepares us for removing in struct page. - MD pull request from Song - Don't schedule block workers on isolated CPUs * tag 'for-6.10/block-20240511' of git://git.kernel.dk/linux: (84 commits) blk-throttle: delay initialization until configuration blk-throttle: remove CONFIG_BLK_DEV_THROTTLING_LOW block: fix that util can be greater than 100% block: support to account io_ticks precisely block: add plug while submitting IO bcache: fix variable length array abuse in btree_iter bcache: Remove usage of the deprecated ida_simple_xx() API md: Revert "md: Fix overflow in is_mddev_idle" blk-lib: check for kill signal in ioctl BLKDISCARD block: add a bio_await_chain helper block: add a blk_alloc_discard_bio helper block: add a bio_chain_and_submit helper block: move discard checks into the ioctl handler block: remove the discard_granularity check in __blkdev_issue_discard block/ioctl: prefer different overflow check null_blk: Fix the WARNING: modpost: missing MODULE_DESCRIPTION() block: fix and simplify blkdevparts= cmdline parsing block: refine the EOF check in blkdev_iomap_begin block: add a partscan sysfs attribute for disks block: add a disk_has_partscan helper ...
2024-04-30ublk: remove segment count and size limitsUday Shankar
ublk_drv currently creates block devices with the default max_segments and max_segment_size limits of BLK_MAX_SEGMENTS (128) and BLK_MAX_SEGMENT_SIZE (65536) respectively. These defaults can artificially constrain the I/O size seen by the ublk server - for example, suppose that the ublk server has configured itself to accept I/Os up to 1M and the application is also issuing 1M sized I/Os. If the I/O buffer used by the application is backed by 4K pages, the buffer could consist of up to 1M / 4K = 256 physically discontiguous segments (even if the buffer is virtually contiguous). As such, the I/O could exceed the default max_segments limit and get split. This can cause unnecessary performance issues if the ublk server is optimized to handle 1M I/Os. The block layer's segment count/size limits exist to model hardware constraints which don't exist in ublk_drv's case, so just remove those limits for the block devices created by ublk_drv. Signed-off-by: Uday Shankar <ushankar@purestorage.com> Reviewed-by: Riley Thomasson <riley@purestorage.com> Reviewed-by: Ming Lei <ming.lei@redhat.com> Link: https://lore.kernel.org/r/20240430211623.2802036-1-ushankar@purestorage.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-04-17block: Simplify blk_revalidate_disk_zones() interfaceDamien Le Moal
The only user of blk_revalidate_disk_zones() second argument was the SCSI disk driver (sd). Now that this driver does not require this update_driver_data argument, remove it to simplify the interface of blk_revalidate_disk_zones(). Also update the function kdoc comment to be more accurate (i.e. there is no gendisk ->revalidate method). Signed-off-by: Damien Le Moal <dlemoal@kernel.org> Reviewed-by: Hannes Reinecke <hare@suse.de> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Bart Van Assche <bvanassche@acm.org> Tested-by: Hans Holmberg <hans.holmberg@wdc.com> Tested-by: Dennis Maisenbacher <dennis.maisenbacher@wdc.com> Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com> Link: https://lore.kernel.org/r/20240408014128.205141-21-dlemoal@kernel.org Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-04-17ublk_drv: Do not request ELEVATOR_F_ZBD_SEQ_WRITE elevator featureDamien Le Moal
With zone write plugging enabled at the block layer level, any zone device can only ever see at most a single write operation per zone. There is thus no need to request a block scheduler with strick per-zone sequential write ordering control through the ELEVATOR_F_ZBD_SEQ_WRITE feature. Removing this allows using a zoned ublk device with any scheduler, including "none". Signed-off-by: Damien Le Moal <dlemoal@kernel.org> Reviewed-by: Hannes Reinecke <hare@suse.de> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Bart Van Assche <bvanassche@acm.org> Tested-by: Hans Holmberg <hans.holmberg@wdc.com> Tested-by: Dennis Maisenbacher <dennis.maisenbacher@wdc.com> Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com> Link: https://lore.kernel.org/r/20240408014128.205141-15-dlemoal@kernel.org Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-02-28ublk: add UBLK_CMD_DEL_DEV_ASYNCMing Lei
The current command UBLK_CMD_DEL_DEV won't return until the device is released, this way looks more reliable, but makes userspace more difficult to implement, especially about orders: unmap command buffer(which holds one ublkc reference), ublkc close, io_uring_file_unregister, ublkb close. Add UBLK_CMD_DEL_DEV_ASYNC so that device deletion won't wait release, then userspace needn't worry about the above order. Actually both loop and nbd is deleted in this async way. Signed-off-by: Ming Lei <ming.lei@redhat.com> Link: https://lore.kernel.org/r/20240223075539.89945-3-ming.lei@redhat.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-02-28ublk: improve getting & putting ublk deviceMing Lei
Firstly convert get_device() and put_device() into ublk_get_device() and ublk_put_device(). Secondly annotate ublk_get_device() & ublk_put_device() as noinline for trace, especially it is often to trigger device deletion hang when incorrect order is used on ublkc mmap, ublkc close, io_uring_sqe_unregister_file, ublkb close. Signed-off-by: Ming Lei <ming.lei@redhat.com> Link: https://lore.kernel.org/r/20240223075539.89945-2-ming.lei@redhat.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-02-19ublk: pass queue_limits to blk_mq_alloc_diskChristoph Hellwig
Pass the limits ublk imposes directly to blk_mq_alloc_disk instead of setting them one at a time. Signed-off-by: Christoph Hellwig <hch@lst.de> Link: https://lore.kernel.org/r/20240215070300.2200308-17-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-02-13block: pass a queue_limits argument to blk_mq_alloc_diskChristoph Hellwig
Pass a queue_limits to blk_mq_alloc_disk and apply it if non-NULL. This will allow allocating queues with valid queue limits instead of setting the values one at a time later. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Keith Busch <kbusch@kernel.org> Reviewed-by: John Garry <john.g.garry@oracle.com> Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com> Reviewed-by: Ming Lei <ming.lei@redhat.com> Reviewed-by: Damien Le Moal <dlemoal@kernel.org> Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com> Reviewed-by: Hannes Reinecke <hare@suse.de> Link: https://lore.kernel.org/r/20240213073425.1621680-11-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-01-11Merge tag 'for-6.8/io_uring-2024-01-08' of git://git.kernel.dk/linuxLinus Torvalds
Pull io_uring updates from Jens Axboe: "Mostly just come fixes and cleanups, but one feature as well. In detail: - Harden the check for handling IOPOLL based on return (Pavel) - Various minor optimizations (Pavel) - Drop remnants of SCM_RIGHTS fd passing support, now that it's no longer supported since 6.7 (me) - Fix for a case where bytes_done wasn't initialized properly on a failure condition for read/write requests (me) - Move the register related code to a separate file (me) - Add support for returning the provided ring buffer head (me) - Add support for adding a direct descriptor to the normal file table (me, Christian Brauner) - Fix for ensuring pending task_work for a ring with DEFER_TASKRUN is run even if we timeout waiting (me)" * tag 'for-6.8/io_uring-2024-01-08' of git://git.kernel.dk/linux: io_uring: ensure local task_work is run on wait timeout io_uring/kbuf: add method for returning provided buffer ring head io_uring/rw: ensure io->bytes_done is always initialized io_uring: drop any code related to SCM_RIGHTS io_uring/unix: drop usage of io_uring socket io_uring/register: move io_uring_register(2) related code to register.c io_uring/openclose: add support for IORING_OP_FIXED_FD_INSTALL io_uring/cmd: inline io_uring_cmd_get_task io_uring/cmd: inline io_uring_cmd_do_in_task_lazy io_uring: split out cmd api into a separate header io_uring: optimise ltimeout for inline execution io_uring: don't check iopoll if request completes
2024-01-11Merge tag 'for-6.8/block-2024-01-08' of git://git.kernel.dk/linuxLinus Torvalds
Pull block updates from Jens Axboe: "Pretty quiet round this time around. This contains: - NVMe updates via Keith: - nvme fabrics spec updates (Guixin, Max) - nvme target udpates (Guixin, Evan) - nvme attribute refactoring (Daniel) - nvme-fc numa fix (Keith) - MD updates via Song: - Fix/Cleanup RCU usage from conf->disks[i].rdev (Yu Kuai) - Fix raid5 hang issue (Junxiao Bi) - Add Yu Kuai as Reviewer of the md subsystem - Remove deprecated flavors (Song Liu) - raid1 read error check support (Li Nan) - Better handle events off-by-1 case (Alex Lyakas) - Efficiency improvements for passthrough (Kundan) - Support for mapping integrity data directly (Keith) - Zoned write fix (Damien) - rnbd fixes (Kees, Santosh, Supriti) - Default to a sane discard size granularity (Christoph) - Make the default max transfer size naming less confusing (Christoph) - Remove support for deprecated host aware zoned model (Christoph) - Misc fixes (me, Li, Matthew, Min, Ming, Randy, liyouhong, Daniel, Bart, Christoph)" * tag 'for-6.8/block-2024-01-08' of git://git.kernel.dk/linux: (78 commits) block: Treat sequential write preferred zone type as invalid block: remove disk_clear_zoned sd: remove the !ZBC && blk_queue_is_zoned case in sd_read_block_characteristics drivers/block/xen-blkback/common.h: Fix spelling typo in comment blk-cgroup: fix rcu lockdep warning in blkg_lookup() blk-cgroup: don't use removal safe list iterators block: floor the discard granularity to the physical block size mtd_blkdevs: use the default discard granularity bcache: use the default discard granularity zram: use the default discard granularity null_blk: use the default discard granularity nbd: use the default discard granularity ubd: use the default discard granularity block: default the discard granularity to sector size bcache: discard_granularity should not be smaller than a sector block: remove two comments in bio_split_discard block: rename and document BLK_DEF_MAX_SECTORS loop: don't abuse BLK_DEF_MAX_SECTORS aoe: don't abuse BLK_DEF_MAX_SECTORS null_blk: don't cap max_hw_sectors to BLK_DEF_MAX_SECTORS ...
2023-12-19block: simplify disk_set_zonedChristoph Hellwig
Only use disk_set_zoned to actually enable zoned device support. For clearing it, call disk_clear_zoned, which is renamed from disk_clear_zone_settings and now directly clears the zoned flag as well. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Damien Le Moal <dlemoal@kernel.org> Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com> Link: https://lore.kernel.org/r/20231217165359.604246-5-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-12-19block: remove support for the host aware zone modelChristoph Hellwig
When zones were first added the SCSI and ATA specs, two different models were supported (in addition to the drive managed one that is invisible to the host): - host managed where non-conventional zones there is strict requirement to write at the write pointer, or else an error is returned - host aware where a write point is maintained if writes always happen at it, otherwise it is left in an under-defined state and the sequential write preferred zones behave like conventional zones (probably very badly performing ones, though) Not surprisingly this lukewarm model didn't prove to be very useful and was finally removed from the ZBC and SBC specs (NVMe never implemented it). Due to to the easily disappearing write pointer host software could never rely on the write pointer to actually be useful for say recovery. Fortunately only a few HDD prototypes shipped using this model which never made it to mass production. Drop the support before it is too late. Note that any such host aware prototype HDD can still be used with Linux as we'll now treat it as a conventional HDD. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com> Link: https://lore.kernel.org/r/20231217165359.604246-4-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-12-12io_uring: split out cmd api into a separate headerPavel Begunkov
linux/io_uring.h is slowly becoming a rubbish bin where we put anything exposed to other subsystems. For instance, the task exit hooks and io_uring cmd infra are completely orthogonal and don't need each other's definitions. Start cleaning it up by splitting out all command bits into a new header file. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/7ec50bae6e21f371d3850796e716917fc141225a.1701391955.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-12-05iov_iter: replace import_single_range() with import_ubuf()Jens Axboe
With the removal of the 'iov' argument to import_single_range(), the two functions are now fully identical. Convert the import_single_range() callers to import_ubuf(), and remove the former fully. Signed-off-by: Jens Axboe <axboe@kernel.dk> Link: https://lore.kernel.org/r/20231204174827.1258875-3-axboe@kernel.dk Signed-off-by: Christian Brauner <brauner@kernel.org>
2023-12-05iov_iter: remove unused 'iov' argument from import_single_range()Jens Axboe
It is entirely unused, just get rid of it. Signed-off-by: Jens Axboe <axboe@kernel.dk> Link: https://lore.kernel.org/r/20231204174827.1258875-2-axboe@kernel.dk Signed-off-by: Christian Brauner <brauner@kernel.org>
2023-10-19block: ublk_drv: Remove unused functionJiapeng Chong
The function are defined in the ublk_drv.c file, but not called elsewhere, so delete the unused function. drivers/block/ublk_drv.c:1211:20: warning: unused function 'ublk_abort_io_cmds'. Reported-by: Abaci Robot <abaci@linux.alibaba.com> Closes: https://bugzilla.openanolis.cn/show_bug.cgi?id=6938 Signed-off-by: Jiapeng Chong <jiapeng.chong@linux.alibaba.com> Fixes: b4e1353f4651 ("ublk: simplify aborting request") Reviewed-by: Ming Lei <ming.lei@rehdat.com> Link: https://lore.kernel.org/r/20231019030444.53680-1-jiapeng.chong@linux.alibaba.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-10-17ublk: simplify aborting requestMing Lei
Now ublk_abort_queue() is run exclusively with ublk_queue_rq() and the ubq_daemon task, so simplify aborting request: - set UBLK_IO_FLAG_ABORTED in ublk_abort_queue() just for aborting this request - abort request in ublk_queue_rq() if ubq->canceling is set Signed-off-by: Ming Lei <ming.lei@redhat.com> Link: https://lore.kernel.org/r/20231009093324.957829-8-ming.lei@redhat.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-10-17ublk: replace monitor with cancelable uring_cmdMing Lei
Monitor work actually introduces one extra context for handling abort, this way is easy to cause race, and also introduce extra delay when handling aborting. Now we start to support cancelable uring_cmd, so use it instead: 1) this cancel callback is either run from the uring cmd submission task context or called after the io_uring context is exit, so the callback is run exclusively with ublk_ch_uring_cmd() and __ublk_rq_task_work(). 2) the previous patch freezes request queue when calling ublk_abort_queue(), which is now completely exclusive with ublk_queue_rq() and ublk_ch_uring_cmd()/__ublk_rq_task_work(). 3) in timeout handler, if all IOs are in-flight, then all uring commands are completed, uring command canceling can't help us to provide forward progress any more, so call ublk_abort_requests() in timeout handler. This way simplifies aborting queue, and is helpful for adding new feature, such as, relax the limit of using single task for handling one queue. Signed-off-by: Ming Lei <ming.lei@redhat.com> Link: https://lore.kernel.org/r/20231009093324.957829-7-ming.lei@redhat.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-10-17ublk: quiesce request queue when aborting queueMing Lei
So far aborting queue ends request when the ubq daemon is exiting, and it can be run concurrently with ublk_queue_rq(), this way is fragile and we depend on the tricky usage of UBLK_IO_FLAG_ABORTED for avoiding such race. Quiesce queue when aborting queue, and the two code paths can be run completely exclusively, then it becomes easier to add new ublk feature, such as relaxing single same task limit for each queue. Signed-off-by: Ming Lei <ming.lei@redhat.com> Link: https://lore.kernel.org/r/20231009093324.957829-6-ming.lei@redhat.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-10-17ublk: rename mm_lock as lockMing Lei
Rename mm_lock field of ublk_device as lock, so that this lock can be reused for protecting access of ub->ub_disk, which will be used for simplifying ublk_abort_queue() by quiesce queue in next patch. Signed-off-by: Ming Lei <ming.lei@redhat.com> Link: https://lore.kernel.org/r/20231009093324.957829-5-ming.lei@redhat.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-10-17ublk: move ublk_cancel_dev() out of ub->mutexMing Lei
ublk_cancel_dev() just calls ublk_cancel_queue() to cancel all pending io commands after ublk request queue is idle. The only protection is just the read & write of ubq->nr_io_ready and avoid duplicated command cancel, so add one per-queue lock with cancel flag for providing this protection, meantime move ublk_cancel_dev() out of ub->mutex. Then we needn't to call io_uring_cmd_complete_in_task() to cancel pending command. And the same cancel logic will be re-used for cancelable uring command. This patch basically reverts commit ac5902f84bb5 ("ublk: fix AB-BA lockdep warning"). Signed-off-by: Ming Lei <ming.lei@redhat.com> Link: https://lore.kernel.org/r/20231009093324.957829-4-ming.lei@redhat.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-10-17ublk: make sure io cmd handled in submitter task contextMing Lei
In well-done ublk server implementation, ublk io command won't be linked into any link chain. Meantime they are always handled in no-wait style, so basically io cmd is always handled in submitter task context. However, the server may set IOSQE_ASYNC, or io command is linked to one chain mistakenly, then we may still run into io-wq context and ctx->uring_lock isn't held. So in case of IO_URING_F_UNLOCKED, schedule this command by io_uring_cmd_complete_in_task to force running it in submitter task. Then ublk_ch_uring_cmd_local() is guaranteed to run with context uring_lock held, and we needn't to worry about sync among submission code path any more. Signed-off-by: Ming Lei <ming.lei@redhat.com> Link: https://lore.kernel.org/r/20231009093324.957829-3-ming.lei@redhat.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-10-17ublk: don't get ublk device reference in ublk_abort_queue()Ming Lei
ublk_abort_queue() is called in ublk_daemon_monitor_work(), in which it is guaranteed that the device is live because monitor work is canceled when removing device, so no need to get the device reference. Signed-off-by: Ming Lei <ming.lei@redhat.com> Link: https://lore.kernel.org/r/20231009093324.957829-2-ming.lei@redhat.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-10-17ublk: Make ublks_max configurableMike Christie
We are converting tcmu applications to ublk, but have systems with up to 1k devices. This patch allows us to configure the ublks_max from userspace with the ublks_max modparam. Signed-off-by: Mike Christie <michael.christie@oracle.com> Reviewed-by: Ming Lei <ming.lei@redhat.com> Link: https://lore.kernel.org/r/20231012150600.6198-3-michael.christie@oracle.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-10-17ublk: Limit dev_id/ub_number valuesMike Christie
The dev_id/ub_number is used for the ublk dev's char device's minor number so it has to fit into MINORMASK. This patch adds checks to prevent userspace from passing a number that's too large and limits what can be allocated by the ublk_index_idr for the case where userspace has the kernel allocate the dev_id/ub_number. Signed-off-by: Mike Christie <michael.christie@oracle.com> Reviewed-by: Ming Lei <ming.lei@redhat.com> Link: https://lore.kernel.org/r/20231012150600.6198-2-michael.christie@oracle.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-08-29Merge tag 'for-6.6/block-2023-08-28' of git://git.kernel.dk/linuxLinus Torvalds
Pull block updates from Jens Axboe: "Pretty quiet round for this release. This contains: - Add support for zoned storage to ublk (Andreas, Ming) - Series improving performance for drivers that mark themselves as needing a blocking context for issue (Bart) - Cleanup the flush logic (Chengming) - sed opal keyring support (Greg) - Fixes and improvements to the integrity support (Jinyoung) - Add some exports for bcachefs that we can hopefully delete again in the future (Kent) - deadline throttling fix (Zhiguo) - Series allowing building the kernel without buffer_head support (Christoph) - Sanitize the bio page adding flow (Christoph) - Write back cache fixes (Christoph) - MD updates via Song: - Fix perf regression for raid0 large sequential writes (Jan) - Fix split bio iostat for raid0 (David) - Various raid1 fixes (Heinz, Xueshi) - raid6test build fixes (WANG) - Deprecate bitmap file support (Christoph) - Fix deadlock with md sync thread (Yu) - Refactor md io accounting (Yu) - Various non-urgent fixes (Li, Yu, Jack) - Various fixes and cleanups (Arnd, Azeem, Chengming, Damien, Li, Ming, Nitesh, Ruan, Tejun, Thomas, Xu)" * tag 'for-6.6/block-2023-08-28' of git://git.kernel.dk/linux: (113 commits) block: use strscpy() to instead of strncpy() block: sed-opal: keyring support for SED keys block: sed-opal: Implement IOC_OPAL_REVERT_LSP block: sed-opal: Implement IOC_OPAL_DISCOVERY blk-mq: prealloc tags when increase tagset nr_hw_queues blk-mq: delete redundant tagset map update when fallback blk-mq: fix tags leak when shrink nr_hw_queues ublk: zoned: support REQ_OP_ZONE_RESET_ALL md: raid0: account for split bio in iostat accounting md/raid0: Fix performance regression for large sequential writes md/raid0: Factor out helper for mapping and submitting a bio md raid1: allow writebehind to work on any leg device set WriteMostly md/raid1: hold the barrier until handle_read_error() finishes md/raid1: free the r1bio before waiting for blocked rdev md/raid1: call free_r1bio() before allow_barrier() in raid_end_bio_io() blk-cgroup: Fix NULL deref caused by blkg_policy_data being installed before init drivers/rnbd: restore sysfs interface to rnbd-client md/raid5-cache: fix null-ptr-deref for r5l_flush_stripe_to_raid() raid6: test: only check for Altivec if building on powerpc hosts raid6: test: make sure all intermediate and artifact files are .gitignored ...
2023-08-20ublk: zoned: support REQ_OP_ZONE_RESET_ALLMing Lei
There isn't any reason to not support REQ_OP_ZONE_RESET_ALL given everything is actually handled in userspace, not mention it is pretty easy to support RESET_ALL. So enable REQ_OP_ZONE_RESET_ALL and let userspace handle it. Verified by 'tools/zbc_reset_zone -all /dev/ublkb0' in libzbc[1] with libublk-rs based ublk-zoned target prototype[2], follows command line for creating ublk-zoned: cargo run --example zoned -- add -1 1024 # add $dev_id $DEV_SIZE [1] https://github.com/westerndigitalcorporation/libzbc [2] https://github.com/ming1/libublk-rs/tree/zoned.v2 Cc: Niklas Cassel <Niklas.Cassel@wdc.com> Cc: Damien Le Moal <dlemoal@kernel.org> Cc: Andreas Hindborg <a.hindborg@samsung.com> Signed-off-by: Ming Lei <ming.lei@redhat.com> Link: https://lore.kernel.org/r/20230810124326.321472-1-ming.lei@redhat.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-08-15ublk: Switch to memdup_user_nul() helperRuan Jinjie
Use memdup_user_nul() helper instead of open-coding to simplify the code. Signed-off-by: Ruan Jinjie <ruanjinjie@huawei.com> Reviewed-by: Ming Lei <ming.lei@redhat.com> Link: https://lore.kernel.org/r/20230815114815.1551171-1-ruanjinjie@huawei.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-08-11ublk: fix 'warn: variable dereferenced before check 'req'' from SmatchMing Lei
The added check of 'req_op(req) == REQ_OP_ZONE_APPEND' should have been done after the request is confirmed as valid. Actually here, the request should always been true, so add one WARN_ON_ONCE(!req), meantime move the zone_append check after checking the request. Cc: Andreas Hindborg <a.hindborg@samsung.com> Reported-by: Dan Carpenter <dan.carpenter@linaro.org> Fixes: 29802d7ca33b ("ublk: enable zoned storage support") Signed-off-by: Ming Lei <ming.lei@redhat.com> Link: https://lore.kernel.org/r/20230811135216.420404-1-ming.lei@redhat.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-08-10ublk: Fix signedness bug returning warningLi Zetao
There are two warnings reported by smatch: drivers/block/ublk_drv.c:445 ublk_setup_iod_zoned() warn: signedness bug returning '(-95)' drivers/block/ublk_drv.c:963 ublk_setup_iod() warn: signedness bug returning '(-5)' The type of "blk_status_t" is either be a u32 or u8, but this two functions return a negative value when not supported or failed. Use the error code of the blk module to fix these warnings. Fixes: 29802d7ca33b ("ublk: enable zoned storage support") Reported-by: kernel test robot <lkp@intel.com> Reported-by: Dan Carpenter <dan.carpenter@linaro.org> Closes: https://lore.kernel.org/r/202308100201.TCRhgdvN-lkp@intel.com/ Signed-off-by: Li Zetao <lizetao1@huawei.com> Reviewed-by: Ming Lei <ming.lei@redhat.com> Link: https://lore.kernel.org/r/20230810084836.3535322-1-lizetao1@huawei.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-08-08ublk: enable zoned storage supportAndreas Hindborg
Add zoned storage support to ublk: report_zones and operations: - REQ_OP_ZONE_OPEN - REQ_OP_ZONE_CLOSE - REQ_OP_ZONE_FINISH - REQ_OP_ZONE_RESET - REQ_OP_ZONE_APPEND The zone append feature uses the `addr` field of `struct ublksrv_io_cmd` to communicate ALBA back to the kernel. Therefore ublk must be used with the user copy feature (UBLK_F_USER_COPY) for zoned storage support to be available. Without this feature, ublk will not allow zoned storage support. Signed-off-by: Andreas Hindborg <a.hindborg@samsung.com> Reviewed-by: Ming Lei <ming.lei@redhat.com> Tested-by: Ming Lei <ming.lei@redhat.com> Link: https://lore.kernel.org/r/20230804114610.179530-4-nmi@metaspace.dk Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-08-08ublk: move check for empty address field on command submissionAndreas Hindborg
In preparation for zoned storage support, move the check for empty `addr` field into the command handler case statement. Note that the check makes no sense for `UBLK_IO_NEED_GET_DATA` because the `addr` field must always be set for this command. Signed-off-by: Andreas Hindborg <a.hindborg@samsung.com> Reviewed-by: Ming Lei <ming.lei@redhat.com> Link: https://lore.kernel.org/r/20230804114610.179530-3-nmi@metaspace.dk Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-08-08ublk: add helper to check if device supports user copyAndreas Hindborg
This will be used by ublk zoned storage support. Signed-off-by: Andreas Hindborg <a.hindborg@samsung.com> Reviewed-by: Ming Lei <ming.lei@redhat.com> Reviewed-by: Damien Le Moal <dlemoal@kernel.org> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Link: https://lore.kernel.org/r/20230804114610.179530-2-nmi@metaspace.dk Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-07-27ublk: return -EINTR if breaking from waiting for existed users in DEL_DEVMing Lei
If user interrupts wait_event_interruptible() in ublk_ctrl_del_dev(), return -EINTR and let user know what happens. Fixes: 0abe39dec065 ("block: ublk: improve handling device deletion") Reported-by: Stefano Garzarella <sgarzare@redhat.com> Signed-off-by: Ming Lei <ming.lei@redhat.com> Reviewed-by: Stefano Garzarella <sgarzare@redhat.com> Link: https://lore.kernel.org/r/20230726144502.566785-4-ming.lei@redhat.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-07-27ublk: fail to recover device if queue setup is interruptedMing Lei
In ublk_ctrl_end_recovery(), if wait_for_completion_interruptible() is interrupted by signal, queues aren't setup successfully yet, so we have to fail UBLK_CMD_END_USER_RECOVERY, otherwise kernel oops can be triggered. Fixes: c732a852b419 ("ublk_drv: add START_USER_RECOVERY and END_USER_RECOVERY support") Reported-by: Stefano Garzarella <sgarzare@redhat.com> Signed-off-by: Ming Lei <ming.lei@redhat.com> Reviewed-by: Stefano Garzarella <sgarzare@redhat.com> Link: https://lore.kernel.org/r/20230726144502.566785-3-ming.lei@redhat.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-07-27ublk: fail to start device if queue setup is interruptedMing Lei
In ublk_ctrl_start_dev(), if wait_for_completion_interruptible() is interrupted by signal, queues aren't setup successfully yet, so we have to fail UBLK_CMD_START_DEV, otherwise kernel oops can be triggered. Reported by German when working on qemu-storage-deamon which requires single thread ublk daemon. Fixes: 71f28f3136af ("ublk_drv: add io_uring based userspace block driver") Reported-by: German Maglione <gmaglione@redhat.com> Signed-off-by: Ming Lei <ming.lei@redhat.com> Link: https://lore.kernel.org/r/20230726144502.566785-2-ming.lei@redhat.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-06-26Merge tag 'for-6.5/block-2023-06-23' of git://git.kernel.dk/linuxLinus Torvalds
Pull block updates from Jens Axboe: - NVMe pull request via Keith: - Various cleanups all around (Irvin, Chaitanya, Christophe) - Better struct packing (Christophe JAILLET) - Reduce controller error logs for optional commands (Keith) - Support for >=64KiB block sizes (Daniel Gomez) - Fabrics fixes and code organization (Max, Chaitanya, Daniel Wagner) - bcache updates via Coly: - Fix a race at init time (Mingzhe Zou) - Misc fixes and cleanups (Andrea, Thomas, Zheng, Ye) - use page pinning in the block layer for dio (David) - convert old block dio code to page pinning (David, Christoph) - cleanups for pktcdvd (Andy) - cleanups for rnbd (Guoqing) - use the unchecked __bio_add_page() for the initial single page additions (Johannes) - fix overflows in the Amiga partition handling code (Michael) - improve mq-deadline zoned device support (Bart) - keep passthrough requests out of the IO schedulers (Christoph, Ming) - improve support for flush requests, making them less special to deal with (Christoph) - add bdev holder ops and shutdown methods (Christoph) - fix the name_to_dev_t() situation and use cases (Christoph) - decouple the block open flags from fmode_t (Christoph) - ublk updates and cleanups, including adding user copy support (Ming) - BFQ sanity checking (Bart) - convert brd from radix to xarray (Pankaj) - constify various structures (Thomas, Ivan) - more fine grained persistent reservation ioctl capability checks (Jingbo) - misc fixes and cleanups (Arnd, Azeem, Demi, Ed, Hengqi, Hou, Jan, Jordy, Li, Min, Yu, Zhong, Waiman) * tag 'for-6.5/block-2023-06-23' of git://git.kernel.dk/linux: (266 commits) scsi/sg: don't grab scsi host module reference ext4: Fix warning in blkdev_put() block: don't return -EINVAL for not found names in devt_from_devname cdrom: Fix spectre-v1 gadget block: Improve kernel-doc headers blk-mq: don't insert passthrough request into sw queue bsg: make bsg_class a static const structure ublk: make ublk_chr_class a static const structure aoe: make aoe_class a static const structure block/rnbd: make all 'class' structures const block: fix the exclusive open mask in disk_scan_partitions block: add overflow checks for Amiga partition support block: change all __u32 annotations to __be32 in affs_hardblocks.h block: fix signed int overflow in Amiga partition support block: add capacity validation in bdev_add_partition() block: fine-granular CAP_SYS_ADMIN for Persistent Reservation block: disallow Persistent Reservation on partitions reiserfs: fix blkdev_put() warning from release_journal_dev() block: fix wrong mode for blkdev_get_by_dev() from disk_scan_partitions() block: document the holder argument to blkdev_get_by_path ...
2023-06-21ublk: make ublk_chr_class a static const structureIvan Orlov
Now that the driver core allows for struct class to be in read-only memory, move the ublk_chr_class structure to be declared at build time placing it into read-only memory, instead of having to be dynamically allocated at boot time. Cc: Ming Lei <ming.lei@redhat.com> Cc: Jens Axboe <axboe@kernel.dk> Cc: linux-block@vger.kernel.org Suggested-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Signed-off-by: Ivan Orlov <ivan.orlov0322@gmail.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Link: https://lore.kernel.org/r/20230620180129.645646-7-gregkh@linuxfoundation.org Signed-off-by: Jens Axboe <axboe@kernel.dk>