summaryrefslogtreecommitdiff
path: root/block/blk-mq.h
AgeCommit message (Collapse)Author
2023-06-12blk-mq: fix potential io hang by wrong 'wake_batch'Yu Kuai
In __blk_mq_tag_busy/idle(), updating 'active_queues' and calculating 'wake_batch' is not atomic: t1: t2: _blk_mq_tag_busy blk_mq_tag_busy inc active_queues // assume 1->2 inc active_queues // 2 -> 3 blk_mq_update_wake_batch // calculate based on 3 blk_mq_update_wake_batch /* calculate based on 2, while active_queues is actually 3. */ Fix this problem by protecting them wih 'tags->lock', this is not a hot path, so performance should not be concerned. And now that all writers are inside the lock, switch 'actives_queues' from atomic to unsigned int. Fixes: 180dccb0dba4 ("blk-mq: fix tag_get wait task can't be awakened") Signed-off-by: Yu Kuai <yukuai3@huawei.com> Reviewed-by: Jan Kara <jack@suse.cz> Link: https://lore.kernel.org/r/20230610023043.2559121-1-yukuai1@huaweicloud.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-05-19blk-mq: don't use the requeue list to queue flush commandsChristoph Hellwig
Currently both requeues of commands that were already sent to the driver and flush commands submitted from the flush state machine share the same requeue_list struct request_queue, despite requeues doing head insertions and flushes not. Switch to using two separate lists instead. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Damien Le Moal <dlemoal@kernel.org> Link: https://lore.kernel.org/r/20230519044050.107790-8-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-05-19blk-mq: defer to the normal submission path for non-flush flush commandsChristoph Hellwig
If blk_insert_flush decides that a command does not need to use the flush state machine, return false and let blk_mq_submit_bio handle it the normal way (including using an I/O scheduler) instead of doing a bypass insert. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Bart Van Assche <bvanassche@acm.org> Reviewed-by: Damien Le Moal <dlemoal@kernel.org> Link: https://lore.kernel.org/r/20230519044050.107790-4-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-05-18blk-mq: make sure elevator callbacks aren't called for passthrough requestChristoph Hellwig
In case of q->elevator, passthrough request can still be marked as RQF_ELV, so some elevator callbacks will be called for them. Fix this by splitting RQF_SCHED_TAGS, which is set for all requests that are issued on a queue that uses an I/O scheduler, and RQF_USE_SCHED for non-flush, non-passthrough requests on such a queue. Roughly based on two different patches from Ming Lei <ming.lei@redhat.com>. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Ming Lei <ming.lei@redhat.com> Link: https://lore.kernel.org/r/20230518053101.760632-4-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-04-26Merge tag 'for-6.4/block-2023-04-21' of git://git.kernel.dk/linuxLinus Torvalds
Pull block updates from Jens Axboe: - drbd patches, bringing us closer to unifying the out-of-tree version and the in tree one (Andreas, Christoph) - support for auto-quiesce for the s390 dasd driver (Stefan) - MD pull request via Song: - md/bitmap: Optimal last page size (Jon Derrick) - Various raid10 fixes (Yu Kuai, Li Nan) - md: add error_handlers for raid0 and linear (Mariusz Tkaczyk) - NVMe pull request via Christoph: - Drop redundant pci_enable_pcie_error_reporting (Bjorn Helgaas) - Validate nvmet module parameters (Chaitanya Kulkarni) - Fence TCP socket on receive error (Chris Leech) - Fix async event trace event (Keith Busch) - Minor cleanups (Chaitanya Kulkarni, zhenwei pi) - Fix and cleanup nvmet Identify handling (Damien Le Moal, Christoph Hellwig) - Fix double blk_mq_complete_request race in the timeout handler (Lei Yin) - Fix irq locking in nvme-fcloop (Ming Lei) - Remove queue mapping helper for rdma devices (Sagi Grimberg) - use structured request attribute checks for nbd (Jakub) - fix blk-crypto race conditions between keyslot management (Eric) - add sed-opal support for reading read locking range attributes (Ondrej) - make fault injection configurable for null_blk (Akinobu) - clean up the request insertion API (Christoph) - clean up the queue running API (Christoph) - blkg config helper cleanups (Tejun) - lazy init support for blk-iolatency (Tejun) - various fixes and tweaks to ublk (Ming) - remove hybrid polling. It hasn't really been useful since we got async polled IO support, and these days we don't support sync polled IO at all (Keith) - misc fixes, cleanups, improvements (Zhong, Ondrej, Colin, Chengming, Chaitanya, me) * tag 'for-6.4/block-2023-04-21' of git://git.kernel.dk/linux: (118 commits) nbd: fix incomplete validation of ioctl arg ublk: don't return 0 in case of any failure sed-opal: geometry feature reporting command null_blk: Always check queue mode setting from configfs block: ublk: switch to ioctl command encoding blk-mq: fix the blk_mq_add_to_requeue_list call in blk_kick_flush block, bfq: Fix division by zero error on zero wsum fault-inject: fix build error when FAULT_INJECTION_CONFIGFS=y and CONFIGFS_FS=m block: store bdev->bd_disk->fops->submit_bio state in bdev block: re-arrange the struct block_device fields for better layout md/raid5: remove unused working_disks variable md/raid10: don't call bio_start_io_acct twice for bio which experienced read error md/raid10: fix memleak of md thread md/raid10: fix memleak for 'conf->bio_split' md/raid10: fix leak of 'r10bio->remaining' for recovery md/raid10: don't BUG_ON() in raise_barrier() md: fix soft lockup in status_resync md: add error_handlers for raid0 and linear md: Use optimal I/O size for last bitmap page md: Fix types in sb writer ...
2023-04-13blk-mq: pass a flags argument to blk_mq_add_to_requeue_listChristoph Hellwig
Replace the boolean at_head argument with the same flags that are already passed to blk_mq_insert_request. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Damien Le Moal <dlemoal@kernel.org> Link: https://lore.kernel.org/r/20230413064057.707578-21-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-04-13blk-mq: pass a flags argument to blk_mq_request_bypass_insertChristoph Hellwig
Replace the boolean at_head argument with the same flags that are already passed to blk_mq_insert_request. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Bart Van Assche <bvanassche@acm.org> Reviewed-by: Damien Le Moal <dlemoal@kernel.org> Link: https://lore.kernel.org/r/20230413064057.707578-19-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-04-13blk-mq: pass a flags argument to blk_mq_insert_requestChristoph Hellwig
Replace the at_head bool with a flags argument that so far only contains a single BLK_MQ_INSERT_AT_HEAD value. This makes it much easier to grep for head insertions into the blk-mq dispatch queues. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Damien Le Moal <dlemoal@kernel.org> Link: https://lore.kernel.org/r/20230413064057.707578-18-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-04-13blk-mq: don't kick the requeue_list in blk_mq_add_to_requeue_listChristoph Hellwig
blk_mq_add_to_requeue_list takes a bool parameter to control how to kick the requeue list at the end of the function. Move the call to blk_mq_kick_requeue_list to the callers that want it instead. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Damien Le Moal <dlemoal@kernel.org> Link: https://lore.kernel.org/r/20230413064057.707578-17-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-04-13blk-mq: don't run the hw_queue from blk_mq_request_bypass_insertChristoph Hellwig
blk_mq_request_bypass_insert takes a bool parameter to control how to run the queue at the end of the function. Move the blk_mq_run_hw_queue call to the callers that want it instead. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Damien Le Moal <dlemoal@kernel.org> Link: https://lore.kernel.org/r/20230413064057.707578-16-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-04-13blk-mq: fold __blk_mq_insert_request into blk_mq_insert_requestChristoph Hellwig
There is no good point in keeping the __blk_mq_insert_request around for two function calls and a singler caller. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Bart Van Assche <bvanassche@acm.org> Reviewed-by: Damien Le Moal <dlemoal@kernel.org> Link: https://lore.kernel.org/r/20230413064057.707578-8-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-04-13blk-mq: fold blk_mq_sched_insert_requests into blk_mq_dispatch_plug_listChristoph Hellwig
blk_mq_dispatch_plug_list is the only caller of blk_mq_sched_insert_requests, and it makes sense to just fold it there as blk_mq_sched_insert_requests isn't specific to I/O schedulers despite the name. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Bart Van Assche <bvanassche@acm.org> Reviewed-by: Damien Le Moal <dlemoal@kernel.org> Link: https://lore.kernel.org/r/20230413064057.707578-6-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-04-13blk-mq: move more logic into blk_mq_insert_requestsChristoph Hellwig
Move all logic related to the direct insert (including the call to blk_mq_run_hw_queue) into blk_mq_insert_requests to streamline the code flow up a bit, and to allow marking blk_mq_try_issue_list_directly static. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Bart Van Assche <bvanassche@acm.org> Reviewed-by: Damien Le Moal <dlemoal@kernel.org> Link: https://lore.kernel.org/r/20230413064057.707578-5-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-04-13blk-mq: include <linux/blk-mq.h> in block/blk-mq.hChristoph Hellwig
block/blk-mq.h needs various definitions from <linux/blk-mq.h>, include it there instead of relying on the source files to include both. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Damien Le Moal <dlemoal@kernel.org> Link: https://lore.kernel.org/r/20230413064057.707578-4-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-04-13blk-mq: remove blk-mq-tag.hChristoph Hellwig
blk-mq-tag.h is always included by blk-mq.h, and causes recursive inclusion hell with further changes. Just merge it into blk-mq.h instead. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Damien Le Moal <dlemoal@kernel.org> Link: https://lore.kernel.org/r/20230413064057.707578-3-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-03-14blk-mq: fix "bad unlock balance detected" on q->srcu in ↵Chris Leech
__blk_mq_run_dispatch_ops The 'q' parameter of the macro __blk_mq_run_dispatch_ops may not be one local variable, such as, it is rq->q, then request queue pointed by this variable could be changed to another queue in case of BLK_MQ_F_TAG_QUEUE_SHARED after 'dispatch_ops' returns, then 'bad unlock balance' is triggered. Fixes the issue by adding one local variable for doing srcu lock/unlock. Fixes: 2a904d00855f ("blk-mq: remove hctx_lock and hctx_unlock") Cc: Marco Patalano <mpatalan@redhat.com> Signed-off-by: Chris Leech <cleech@redhat.com> Signed-off-by: Ming Lei <ming.lei@redhat.com> Link: https://lore.kernel.org/r/20230310010913.1014789-1-ming.lei@redhat.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-11-02blk-mq: move the srcu_struct used for quiescing to the tagsetChristoph Hellwig
All I/O submissions have fairly similar latencies, and a tagset-wide quiesce is a fairly common operation. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Keith Busch <kbusch@kernel.org> Reviewed-by: Ming Lei <ming.lei@redhat.com> Reviewed-by: Chao Leng <lengchao@huawei.com> Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Reviewed-by: Hannes Reinecke <hare@suse.de> Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com> Link: https://lore.kernel.org/r/20221101150050.3510-12-hch@lst.de [axboe: fix whitespace] Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-09-29block: adapt blk_mq_plug() to not plug for writes that require a zone lockPankaj Raghav
The current implementation of blk_mq_plug() disables plugging for all operations that involves a transfer to the device as we just check if the last bit in op_is_write() function. Modify blk_mq_plug() to disable plugging only for REQ_OP_WRITE and REQ_OP_WRITE_ZEROS as they might require a zone lock. Suggested-by: Christoph Hellwig <hch@lst.de> Suggested-by: Damien Le Moal <damien.lemoal@opensource.wdc.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Pankaj Raghav <p.raghav@samsung.com> Reviewed-by: Damien Le Moal <damien.lemoal@opensource.wdc.com> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Link: https://lore.kernel.org/r/20220929074745.103073-2-p.raghav@samsung.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-07-14block: Use the new blk_opf_t typeBart Van Assche
Use the new blk_opf_t type for arguments and variables that represent request flags or a bitwise combination of a request operation and request flags. Rename the function arguments and also a structure member that hold a request operation and flags from 'rw' into 'opf'. This patch does not change any functionality. Cc: Christoph Hellwig <hch@lst.de> Cc: Ming Lei <ming.lei@redhat.com> Cc: Hannes Reinecke <hare@suse.de> Cc: Damien Le Moal <damien.lemoal@wdc.com> Cc: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Bart Van Assche <bvanassche@acm.org> Link: https://lore.kernel.org/r/20220714180729.1065367-7-bvanassche@acm.org Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-07-06block: simplify blk_mq_plugChristoph Hellwig
Drop the unused q argument, and invert the check to move the exception into a branch and the regular path as the normal return. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com> Reviewed-by: Damien Le Moal <damien.lemoal@opensource.wdc.com> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Link: https://lore.kernel.org/r/20220706070350.1703384-5-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-07-06block: use bdev_is_zoned instead of open coding itChristoph Hellwig
Use bdev_is_zoned in all places where a block_device is available instead of open coding it. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com> Reviewed-by: Damien Le Moal <damien.lemoal@opensource.wdc.com> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Link: https://lore.kernel.org/r/20220706070350.1703384-4-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-06-28blk-mq: cleanup disk sysfs registrationChristoph Hellwig
Pass a gendisk to the sysfs register/unregister functions and give them descriptive names. Also move the unregistration helper next to the one doing the registration. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Bart Van Assche <bvanassche@acm.org> Link: https://lore.kernel.org/r/20220628171850.1313069-7-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-06-28blk-mq: rename blk_mq_sysfs_{,un}registerChristoph Hellwig
Add a _hctx postfix to better describe what the functions do, match the debugfs equivalents and release the old names for functions that should be using them. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Bart Van Assche <bvanassche@acm.org> Link: https://lore.kernel.org/r/20220628171850.1313069-6-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-06-27block: Rename a blk_mq_map_queue() argumentBart Van Assche
Before the introduction of blk_mq_get_hctx_type(), blk_mq_map_queue() only used the flags from its second argument. Since the introduction of blk_mq_get_hctx_type(), blk_mq_map_queue() uses both the operation and the flags encoded in that argument. Rename the second argument of blk_mq_map_queue() to make this clear. Cc: Christoph Hellwig <hch@lst.de> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Bart Van Assche <bvanassche@acm.org> Link: https://lore.kernel.org/r/20220615225549.1054905-3-bvanassche@acm.org Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-03-08blk-mq: manage hctx map via xarrayMing Lei
First code becomes more clean by switching to xarray from plain array. Second use-after-free on q->queue_hw_ctx can be fixed because queue_for_each_hw_ctx() may be run when updating nr_hw_queues is in-progress. With this patch, q->hctx_table is defined as xarray, and this structure will share same lifetime with request queue, so queue_for_each_hw_ctx() can use q->hctx_table to lookup hctx reliably. Reported-by: Yu Kuai <yukuai3@huawei.com> Signed-off-by: Ming Lei <ming.lei@redhat.com> Reviewed-by: Hannes Reinecke <hare@suse.de> Reviewed-by: Christoph Hellwig <hch@lst.de> Link: https://lore.kernel.org/r/20220308073219.91173-7-ming.lei@redhat.com [axboe: fix blk_mq_hw_ctx forward declaration] Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-12-06blk-mq: don't run might_sleep() if the operation needn't blockingMing Lei
The operation protected via blk_mq_run_dispatch_ops() in blk_mq_run_hw_queue won't sleep, so don't run might_sleep() for it. Reported-and-tested-by: Marek Szyprowski <m.szyprowski@samsung.com> Signed-off-by: Ming Lei <ming.lei@redhat.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-12-03blk-mq: pass request queue to blk_mq_run_dispatch_opsMing Lei
We have switched to allocate srcu into request queue, so it is fine to pass request queue to blk_mq_run_dispatch_ops(). Signed-off-by: Ming Lei <ming.lei@redhat.com> Link: https://lore.kernel.org/r/20211203131534.3668411-4-ming.lei@redhat.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-12-03blk-mq: move srcu from blk_mq_hw_ctx to request_queueMing Lei
In case of BLK_MQ_F_BLOCKING, per-hctx srcu is used to protect dispatch critical area. However, this srcu instance stays at the end of hctx, and it often takes standalone cacheline, often cold. Inside srcu_read_lock() and srcu_read_unlock(), WRITE is always done on the indirect percpu variable which is allocated from heap instead of being embedded, srcu->srcu_idx is read only in srcu_read_lock(). It doesn't matter if srcu structure stays in hctx or request queue. So switch to per-request-queue srcu for protecting dispatch, and this way simplifies quiesce a lot, not mention quiesce is always done on the request queue wide. Signed-off-by: Ming Lei <ming.lei@redhat.com> Link: https://lore.kernel.org/r/20211203131534.3668411-3-ming.lei@redhat.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-12-03blk-mq: remove hctx_lock and hctx_unlockMing Lei
Remove hctx_lock and hctx_unlock, and add one helper of blk_mq_run_dispatch_ops() to run code block defined in dispatch_ops with rcu/srcu read held. Compared with hctx_lock()/hctx_unlock(): 1) remove 2 branch to 1, so we just need to check (hctx->flags & BLK_MQ_F_BLOCKING) once when running one dispatch_ops 2) srcu_idx needn't to be touched in case of non-blocking 3) might_sleep_if() can be moved to the blocking branch Also put the added blk_mq_run_dispatch_ops() in private header, so that the following patch can use it out of blk-mq.c. Signed-off-by: Ming Lei <ming.lei@redhat.com> Link: https://lore.kernel.org/r/20211203131534.3668411-2-ming.lei@redhat.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-11-29block: move request based cloning helpers to blk-mq.cChristoph Hellwig
Keep all the request based code together. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Link: https://lore.kernel.org/r/20211117061404.331732-6-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-11-15blk-mq: cancel blk-mq dispatch work in both blk_cleanup_queue and disk_release()Ming Lei
For avoiding to slow down queue destroy, we don't call blk_mq_quiesce_queue() in blk_cleanup_queue(), instead of delaying to cancel dispatch work in blk_release_queue(). However, this way has caused kernel oops[1], reported by Changhui. The log shows that scsi_device can be freed before running blk_release_queue(), which is expected too since scsi_device is released after the scsi disk is closed and the scsi_device is removed. Fixes the issue by canceling blk-mq dispatch work in both blk_cleanup_queue() and disk_release(): 1) when disk_release() is run, the disk has been closed, and any sync dispatch activities have been done, so canceling dispatch work is enough to quiesce filesystem I/O dispatch activity. 2) in blk_cleanup_queue(), we only focus on passthrough request, and passthrough request is always explicitly allocated & freed by its caller, so once queue is frozen, all sync dispatch activity for passthrough request has been done, then it is enough to just cancel dispatch work for avoiding any dispatch activity. [1] kernel panic log [12622.769416] BUG: kernel NULL pointer dereference, address: 0000000000000300 [12622.777186] #PF: supervisor read access in kernel mode [12622.782918] #PF: error_code(0x0000) - not-present page [12622.788649] PGD 0 P4D 0 [12622.791474] Oops: 0000 [#1] PREEMPT SMP PTI [12622.796138] CPU: 10 PID: 744 Comm: kworker/10:1H Kdump: loaded Not tainted 5.15.0+ #1 [12622.804877] Hardware name: Dell Inc. PowerEdge R730/0H21J3, BIOS 1.5.4 10/002/2015 [12622.813321] Workqueue: kblockd blk_mq_run_work_fn [12622.818572] RIP: 0010:sbitmap_get+0x75/0x190 [12622.823336] Code: 85 80 00 00 00 41 8b 57 08 85 d2 0f 84 b1 00 00 00 45 31 e4 48 63 cd 48 8d 1c 49 48 c1 e3 06 49 03 5f 10 4c 8d 6b 40 83 f0 01 <48> 8b 33 44 89 f2 4c 89 ef 0f b6 c8 e8 fa f3 ff ff 83 f8 ff 75 58 [12622.844290] RSP: 0018:ffffb00a446dbd40 EFLAGS: 00010202 [12622.850120] RAX: 0000000000000001 RBX: 0000000000000300 RCX: 0000000000000004 [12622.858082] RDX: 0000000000000006 RSI: 0000000000000082 RDI: ffffa0b7a2dfe030 [12622.866042] RBP: 0000000000000004 R08: 0000000000000001 R09: ffffa0b742721334 [12622.874003] R10: 0000000000000008 R11: 0000000000000008 R12: 0000000000000000 [12622.881964] R13: 0000000000000340 R14: 0000000000000000 R15: ffffa0b7a2dfe030 [12622.889926] FS: 0000000000000000(0000) GS:ffffa0baafb40000(0000) knlGS:0000000000000000 [12622.898956] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [12622.905367] CR2: 0000000000000300 CR3: 0000000641210001 CR4: 00000000001706e0 [12622.913328] Call Trace: [12622.916055] <TASK> [12622.918394] scsi_mq_get_budget+0x1a/0x110 [12622.922969] __blk_mq_do_dispatch_sched+0x1d4/0x320 [12622.928404] ? pick_next_task_fair+0x39/0x390 [12622.933268] __blk_mq_sched_dispatch_requests+0xf4/0x140 [12622.939194] blk_mq_sched_dispatch_requests+0x30/0x60 [12622.944829] __blk_mq_run_hw_queue+0x30/0xa0 [12622.949593] process_one_work+0x1e8/0x3c0 [12622.954059] worker_thread+0x50/0x3b0 [12622.958144] ? rescuer_thread+0x370/0x370 [12622.962616] kthread+0x158/0x180 [12622.966218] ? set_kthread_struct+0x40/0x40 [12622.970884] ret_from_fork+0x22/0x30 [12622.974875] </TASK> [12622.977309] Modules linked in: scsi_debug rpcsec_gss_krb5 auth_rpcgss nfsv4 dns_resolver nfs lockd grace fscache netfs sunrpc dm_multipath intel_rapl_msr intel_rapl_common dell_wmi_descriptor sb_edac rfkill video x86_pkg_temp_thermal intel_powerclamp dcdbas coretemp kvm_intel kvm mgag200 irqbypass i2c_algo_bit rapl drm_kms_helper ipmi_ssif intel_cstate intel_uncore syscopyarea sysfillrect sysimgblt fb_sys_fops pcspkr cec mei_me lpc_ich mei ipmi_si ipmi_devintf ipmi_msghandler acpi_power_meter drm fuse xfs libcrc32c sr_mod cdrom sd_mod t10_pi sg ixgbe ahci libahci crct10dif_pclmul crc32_pclmul crc32c_intel libata megaraid_sas ghash_clmulni_intel tg3 wdat_wdt mdio dca wmi dm_mirror dm_region_hash dm_log dm_mod [last unloaded: scsi_debug] Reported-by: ChanghuiZhong <czhong@redhat.com> Cc: Christoph Hellwig <hch@lst.de> Cc: "Martin K. Petersen" <martin.petersen@oracle.com> Cc: Bart Van Assche <bvanassche@acm.org> Cc: linux-scsi@vger.kernel.org Signed-off-by: Ming Lei <ming.lei@redhat.com> Link: https://lore.kernel.org/r/20211116014343.610501-1-ming.lei@redhat.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-11-12blk-mq: fix filesystem I/O request allocationMing Lei
submit_bio_checks() may update bio->bi_opf, so we have to initialize blk_mq_alloc_data.cmd_flags with bio->bi_opf after submit_bio_checks() returns when allocating new request. In case of using cached request, fallback to allocate new request if cached rq isn't compatible with the incoming bio, otherwise change rq->cmd_flags with incoming bio->bi_opf. Fixes: 900e080752025f00 ("block: move queue enter logic into blk_mq_submit_bio()") Reported-by: Geert Uytterhoeven <geert@linux-m68k.org> Tested-by: Geert Uytterhoeven <geert@linux-m68k.org> Cc: Christoph Hellwig <hch@lst.de> Signed-off-by: Ming Lei <ming.lei@redhat.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-11-09block: use enum type for blk_mq_alloc_data->rq_flagsJens Axboe
kernel test robot reports that we now trigger some sparse warnings: block/blk-mq.h:169:32: sparse: sparse: restricted req_flags_t degrades to integer block/blk-mq.h:169:32: sparse: sparse: restricted req_flags_t degrades to integer block/blk-mq.h:169:32: sparse: sparse: restricted req_flags_t degrades to integer which is due to ->rq_flags being an unsigned int, rather than the stronger type req_flags_t enum. Change the type to req_flags_t to silence this warning. Fixes: 56f8da642bd8 ("block: add rq_flags to struct blk_mq_alloc_data") Reported-by: kernel test robot <lkp@intel.com> Reviewed-by: Bart Van Assche <bvanassche@acm.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-11-03blk-mq: update hctx->nr_active in blk_mq_end_request_batch()Ming Lei
In case of shared tags and none io sched, batched completion still may be run into, and hctx->nr_active is accounted when getting driver tag, so it has to be updated in blk_mq_end_request_batch(). Otherwise, hctx->nr_active may become same with queue depth, then hctx_may_queue() always return false, then io hang is caused. Fixes the issue by updating the counter in batched way. Reported-by: Shinichiro Kawasaki <shinichiro.kawasaki@wdc.com> Fixes: f794f3351f26 ("block: add support for blk_mq_end_request_batch()") Signed-off-by: Ming Lei <ming.lei@redhat.com> Link: https://lore.kernel.org/r/20211102153619.3627505-4-ming.lei@redhat.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-10-27block: add rq_flags to struct blk_mq_alloc_dataJens Axboe
There's a hole here we can use, and it's faster to set this earlier rather than need to check q->elevator multiple times. Signed-off-by: Jens Axboe <axboe@kernel.dk> Link: https://lore.kernel.org/r/20211019153300.623322-2-axboe@kernel.dk Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-10-20blk-mq: move blk_mq_flush_plug_list to block/blk-mq.hChristoph Hellwig
This helper is internal to the block layer. Signed-off-by: Christoph Hellwig <hch@lst.de> Link: https://lore.kernel.org/r/20211020144119.142582-3-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-10-19block: inline fast path of driver tag allocationJens Axboe
If we don't use an IO scheduler or have shared tags, then we don't need to call into this external function at all. This saves ~2% for such a setup. Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-10-18block: add a struct io_comp_batch argument to fops->iopoll()Jens Axboe
struct io_comp_batch contains a list head and a completion handler, which will allow completions to more effciently completed batches of IO. For now, no functional changes in this patch, we just define the io_comp_batch structure and add the argument to the file_operations iopoll handler. Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-10-18block: remove debugfs blk_mq_ctx dispatched/merged/completed attributesJens Axboe
These were added as part of early days debugging for blk-mq, and they are not really useful anymore. Rather than spend cycles updating them, just get rid of them. As a bonus, this shrinks the per-cpu software queue size from 256b to 192b. That's a whole cacheline less. Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-10-18block: switch polling to be bio basedChristoph Hellwig
Replace the blk_poll interface that requires the caller to keep a queue and cookie from the submissions with polling based on the bio. Polling for the bio itself leads to a few advantages: - the cookie construction can made entirely private in blk-mq.c - the caller does not need to remember the request_queue and cookie separately and thus sidesteps their lifetime issues - keeping the device and the cookie inside the bio allows to trivially support polling BIOs remapping by stacking drivers - a lot of code to propagate the cookie back up the submission path can be removed entirely. Signed-off-by: Christoph Hellwig <hch@lst.de> Tested-by: Mark Wunderlich <mark.wunderlich@intel.com> Link: https://lore.kernel.org/r/20211012111226.760968-15-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-10-18block: rename REQ_HIPRI to REQ_POLLEDChristoph Hellwig
Unlike the RWF_HIPRI userspace ABI which is intentionally kept vague, the bio flag is specific to the polling implementation, so rename and document it properly. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Reviewed-by: Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com> Tested-by: Mark Wunderlich <mark.wunderlich@intel.com> Link: https://lore.kernel.org/r/20211012111226.760968-12-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-10-18block: pre-allocate requests if plug is started and is a batchJens Axboe
The caller typically has a good (or even exact) idea of how many requests it needs to submit. We can make the request/tag allocation a lot more efficient if we just allocate N requests/tags upfront when we queue the first bio from the batch. Provide a new plug start helper that allows the caller to specify how many IOs are expected. This sets plug->nr_ios, and we can use that for smarter request allocation. The plug provides a holding spot for requests, and request allocation will check it before calling into the normal request allocation path. The blk_finish_plug() is called, check if there are unused requests and free them. This should not happen in normal operations. The exception is if we get merging, then we may be left with requests that need freeing when done. This raises the per-core performance on my setup from ~5.8M to ~6.1M IOPS. Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-10-18blk-mq: Change shared sbitmap naming to shared tagsJohn Garry
Now that shared sbitmap support really means shared tags, rename symbols to match that. Signed-off-by: John Garry <john.garry@huawei.com> Link: https://lore.kernel.org/r/1633429419-228500-15-git-send-email-john.garry@huawei.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-10-18blk-mq: Use shared tags for shared sbitmap supportJohn Garry
Currently we use separate sbitmap pairs and active_queues atomic_t for shared sbitmap support. However a full sets of static requests are used per HW queue, which is quite wasteful, considering that the total number of requests usable at any given time across all HW queues is limited by the shared sbitmap depth. As such, it is considerably more memory efficient in the case of shared sbitmap to allocate a set of static rqs per tag set or request queue, and not per HW queue. So replace the sbitmap pairs and active_queues atomic_t with a shared tags per tagset and request queue, which will hold a set of shared static rqs. Since there is now no valid HW queue index to be passed to the blk_mq_ops .init and .exit_request callbacks, pass an invalid index token. This changes the semantics of the APIs, such that the callback would need to validate the HW queue index before using it. Currently no user of shared sbitmap actually uses the HW queue index (as would be expected). Signed-off-by: John Garry <john.garry@huawei.com> Reviewed-by: Ming Lei <ming.lei@redhat.com> Link: https://lore.kernel.org/r/1633429419-228500-13-git-send-email-john.garry@huawei.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-10-18blk-mq: Refactor and rename blk_mq_free_map_and_{requests->rqs}()John Garry
Refactor blk_mq_free_map_and_requests() such that it can be used at many sites at which the tag map and rqs are freed. Also rename to blk_mq_free_map_and_rqs(), which is shorter and matches the alloc equivalent. Suggested-by: Ming Lei <ming.lei@redhat.com> Signed-off-by: John Garry <john.garry@huawei.com> Reviewed-by: Hannes Reinecke <hare@suse.de> Link: https://lore.kernel.org/r/1633429419-228500-12-git-send-email-john.garry@huawei.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-10-18blk-mq: Add blk_mq_alloc_map_and_rqs()John Garry
Add a function to combine allocating tags and the associated requests, and factor out common patterns to use this new function. Some function only call blk_mq_alloc_map_and_rqs() now, but more functionality will be added later. Also make blk_mq_alloc_rq_map() and blk_mq_alloc_rqs() static since they are only used in blk-mq.c, and finally rename some functions for conciseness and consistency with other function names: - __blk_mq_alloc_map_and_{request -> rqs}() - blk_mq_alloc_{map_and_requests -> set_map_and_rqs}() Suggested-by: Ming Lei <ming.lei@redhat.com> Signed-off-by: John Garry <john.garry@huawei.com> Reviewed-by: Hannes Reinecke <hare@suse.de> Reviewed-by: Ming Lei <ming.lei@redhat.com> Link: https://lore.kernel.org/r/1633429419-228500-11-git-send-email-john.garry@huawei.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-06-24blk: Fix lock inversion between ioc lock and bfqd lockJan Kara
Lockdep complains about lock inversion between ioc->lock and bfqd->lock: bfqd -> ioc: put_io_context+0x33/0x90 -> ioc->lock grabbed blk_mq_free_request+0x51/0x140 blk_put_request+0xe/0x10 blk_attempt_req_merge+0x1d/0x30 elv_attempt_insert_merge+0x56/0xa0 blk_mq_sched_try_insert_merge+0x4b/0x60 bfq_insert_requests+0x9e/0x18c0 -> bfqd->lock grabbed blk_mq_sched_insert_requests+0xd6/0x2b0 blk_mq_flush_plug_list+0x154/0x280 blk_finish_plug+0x40/0x60 ext4_writepages+0x696/0x1320 do_writepages+0x1c/0x80 __filemap_fdatawrite_range+0xd7/0x120 sync_file_range+0xac/0xf0 ioc->bfqd: bfq_exit_icq+0xa3/0xe0 -> bfqd->lock grabbed put_io_context_active+0x78/0xb0 -> ioc->lock grabbed exit_io_context+0x48/0x50 do_exit+0x7e9/0xdd0 do_group_exit+0x54/0xc0 To avoid this inversion we change blk_mq_sched_try_insert_merge() to not free the merged request but rather leave that upto the caller similarly to blk_mq_sched_try_merge(). And in bfq_insert_requests() we make sure to free all the merged requests after dropping bfqd->lock. Fixes: aee69d78dec0 ("block, bfq: introduce the BFQ-v0 I/O scheduler as an extra scheduler") Reviewed-by: Ming Lei <ming.lei@redhat.com> Acked-by: Paolo Valente <paolo.valente@linaro.org> Signed-off-by: Jan Kara <jack@suse.cz> Link: https://lore.kernel.org/r/20210623093634.27879-3-jack@suse.cz Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-06-03block: Do not pull requests from the scheduler when we cannot dispatch themJan Kara
Provided the device driver does not implement dispatch budget accounting (which only SCSI does) the loop in __blk_mq_do_dispatch_sched() pulls requests from the IO scheduler as long as it is willing to give out any. That defeats scheduling heuristics inside the scheduler by creating false impression that the device can take more IO when it in fact cannot. For example with BFQ IO scheduler on top of virtio-blk device setting blkio cgroup weight has barely any impact on observed throughput of async IO because __blk_mq_do_dispatch_sched() always sucks out all the IO queued in BFQ. BFQ first submits IO from higher weight cgroups but when that is all dispatched, it will give out IO of lower weight cgroups as well. And then we have to wait for all this IO to be dispatched to the disk (which means lot of it actually has to complete) before the IO scheduler is queried again for dispatching more requests. This completely destroys any service differentiation. So grab request tag for a request pulled out of the IO scheduler already in __blk_mq_do_dispatch_sched() and do not pull any more requests if we cannot get it because we are unlikely to be able to dispatch it. That way only single request is going to wait in the dispatch list for some tag to free. Reviewed-by: Ming Lei <ming.lei@redhat.com> Signed-off-by: Jan Kara <jack@suse.cz> Link: https://lore.kernel.org/r/20210603104721.6309-1-jack@suse.cz Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-05-24blk-mq: grab rq->refcount before calling ->fn in blk_mq_tagset_busy_iterMing Lei
Grab rq->refcount before calling ->fn in blk_mq_tagset_busy_iter(), and this way will prevent the request from being re-used when ->fn is running. The approach is same as what we do during handling timeout. Fix request use-after-free(UAF) related with completion race or queue releasing: - If one rq is referred before rq->q is frozen, then queue won't be frozen before the request is released during iteration. - If one rq is referred after rq->q is frozen, refcount_inc_not_zero() will return false, and we won't iterate over this request. However, still one request UAF not covered: refcount_inc_not_zero() may read one freed request, and it will be handled in next patch. Tested-by: John Garry <john.garry@huawei.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Bart Van Assche <bvanassche@acm.org> Signed-off-by: Ming Lei <ming.lei@redhat.com> Link: https://lore.kernel.org/r/20210511152236.763464-3-ming.lei@redhat.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-03-04scsi: blk-mq: Return budget token from .get_budget callbackMing Lei
SCSI uses a global atomic variable to track queue depth for each LUN/request queue. This doesn't scale well when there are lots of CPU cores and the disk is very fast. It has been observed that IOPS is affected a lot by tracking queue depth via sdev->device_busy in the I/O path. Return budget token from .get_budget callback. The budget token can be passed to driver so that we can replace the atomic variable with sbitmap_queue and alleviate the scaling problems that way. Link: https://lore.kernel.org/r/20210122023317.687987-9-ming.lei@redhat.com Cc: Omar Sandoval <osandov@fb.com> Cc: Kashyap Desai <kashyap.desai@broadcom.com> Cc: Sumanesh Samanta <sumanesh.samanta@broadcom.com> Cc: Ewan D. Milne <emilne@redhat.com> Tested-by: Sumanesh Samanta <sumanesh.samanta@broadcom.com> Reviewed-by: Hannes Reinecke <hare@suse.de> Signed-off-by: Ming Lei <ming.lei@redhat.com> Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>