summaryrefslogtreecommitdiff
AgeCommit message (Collapse)Author
2018-01-18block, bfq: limit tags for writes and async I/OPaolo Valente
Asynchronous I/O can easily starve synchronous I/O (both sync reads and sync writes), by consuming all request tags. Similarly, storms of synchronous writes, such as those that sync(2) may trigger, can starve synchronous reads. In their turn, these two problems may also cause BFQ to loose control on latency for interactive and soft real-time applications. For example, on a PLEXTOR PX-256M5S SSD, LibreOffice Writer takes 0.6 seconds to start if the device is idle, but it takes more than 45 seconds (!) if there are sequential writes in the background. This commit addresses this issue by limiting the maximum percentage of tags that asynchronous I/O requests and synchronous write requests can consume. In particular, this commit grants a higher threshold to synchronous writes, to prevent the latter from being starved by asynchronous I/O. According to the above test, LibreOffice Writer now starts in about 1.2 seconds on average, regardless of the background workload, and apart from some rare outlier. To check this improvement, run, e.g., sudo ./comm_startup_lat.sh bfq 5 5 seq 10 "lowriter --terminate_after_init" for the comm_startup_lat benchmark in the S suite [1]. [1] https://github.com/Algodev-github/S Tested-by: Oleksandr Natalenko <oleksandr@natalenko.name> Tested-by: Holger Hoffstätte <holger@applied-asynchrony.com> Signed-off-by: Paolo Valente <paolo.valente@linaro.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-01-17blk-mq: don't dispatch request in blk_mq_request_direct_issue if queue is busyMing Lei
If we run into blk_mq_request_direct_issue(), when queue is busy, we don't want to dispatch this request into hctx->dispatch_list, and what we need to do is to return the queue busy info to caller, so that caller can deal with it well. Fixes: 396eaf21ee ("blk-mq: improve DM's blk-mq IO merging via blk_insert_cloned_request feedback") Reported-by: Laurence Oberman <loberman@redhat.com> Reviewed-by: Mike Snitzer <snitzer@redhat.com> Signed-off-by: Ming Lei <ming.lei@redhat.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-01-17block: Fix __bio_integrity_endio() documentationBart Van Assche
Fixes: 4246a0b63bd8 ("block: add a bi_error field to struct bio") Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com> Signed-off-by: Bart Van Assche <bart.vanassche@wdc.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-01-17blk-mq-sched: remove unused 'can_block' arg from blk_mq_sched_insert_requestMike Snitzer
After commit: 923218f6166a ("blk-mq: don't allocate driver tag upfront for flush rq") we no longer use the 'can_block' argument in blk_mq_sched_insert_request(). Kill it. Signed-off-by: Mike Snitzer <snitzer@redhat.com> Added actual commit message as to why it's being removed. Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-01-17blk-mq: improve DM's blk-mq IO merging via blk_insert_cloned_request feedbackMing Lei
blk_insert_cloned_request() is called in the fast path of a dm-rq driver (e.g. blk-mq request-based DM mpath). blk_insert_cloned_request() uses blk_mq_request_bypass_insert() to directly append the request to the blk-mq hctx->dispatch_list of the underlying queue. 1) This way isn't efficient enough because the hctx spinlock is always used. 2) With blk_insert_cloned_request(), we completely bypass underlying queue's elevator and depend on the upper-level dm-rq driver's elevator to schedule IO. But dm-rq currently can't get the underlying queue's dispatch feedback at all. Without knowing whether a request was issued or not (e.g. due to underlying queue being busy) the dm-rq elevator will not be able to provide effective IO merging (as a side-effect of dm-rq currently blindly destaging a request from its elevator only to requeue it after a delay, which kills any opportunity for merging). This obviously causes very bad sequential IO performance. Fix this by updating blk_insert_cloned_request() to use blk_mq_request_direct_issue(). blk_mq_request_direct_issue() allows a request to be issued directly to the underlying queue and returns the dispatch feedback (blk_status_t). If blk_mq_request_direct_issue() returns BLK_SYS_RESOURCE the dm-rq driver will now use DM_MAPIO_REQUEUE to _not_ destage the request. Whereby preserving the opportunity to merge IO. With this, request-based DM's blk-mq sequential IO performance is vastly improved (as much as 3X in mpath/virtio-scsi testing). Signed-off-by: Ming Lei <ming.lei@redhat.com> [blk-mq.c changes heavily influenced by Ming Lei's initial solution, but they were refactored to make them less fragile and easier to read/review] Signed-off-by: Mike Snitzer <snitzer@redhat.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-01-17blk-mq: factor out a few helpers from __blk_mq_try_issue_directlyMike Snitzer
No functional change. Just makes code flow more logically. In following commit, __blk_mq_try_issue_directly() will be used to return the dispatch result (blk_status_t) to DM. DM needs this information to improve IO merging. Signed-off-by: Mike Snitzer <snitzer@redhat.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-01-17blk-mq: turn WARN_ON in __blk_mq_run_hw_queue into printkMing Lei
We know this WARN_ON is harmless and in reality it may be trigged, so convert it to printk() and dump_stack() to avoid to confusing people. Also add comment about two releated races here. Cc: Christian Borntraeger <borntraeger@de.ibm.com> Cc: Stefan Haberland <sth@linux.vnet.ibm.com> Cc: Christoph Hellwig <hch@lst.de> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: "jianchao.wang" <jianchao.w.wang@oracle.com> Signed-off-by: Ming Lei <ming.lei@redhat.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-01-17blk-mq: make sure hctx->next_cpu is set correctlyMing Lei
When hctx->next_cpu is set from possible online CPUs, there is one race in which hctx->next_cpu may be set as >= nr_cpu_ids, and finally break workqueue. The race can be triggered in the following two sitations: 1) when one CPU is becoming DEAD, blk_mq_hctx_notify_dead() is called to dispatch requests from the DEAD cpu context, but at that time, this DEAD CPU has been cleared from 'cpu_online_mask', so all CPUs in hctx->cpumask may become offline, and cause hctx->next_cpu set a bad value. 2) blk_mq_delay_run_hw_queue() is called from CPU B, and found the queue should be run on the other CPU A, then CPU A may become offline at the same time and all CPUs in hctx->cpumask become offline. This patch deals with this issue by re-selecting next CPU, and making sure it is set correctly. Cc: Christian Borntraeger <borntraeger@de.ibm.com> Cc: Stefan Haberland <sth@linux.vnet.ibm.com> Cc: Christoph Hellwig <hch@lst.de> Cc: Thomas Gleixner <tglx@linutronix.de> Reported-by: "jianchao.wang" <jianchao.w.wang@oracle.com> Tested-by: "jianchao.wang" <jianchao.w.wang@oracle.com> Fixes: 20e4d81393 ("blk-mq: simplify queue mapping & schedule with each possisble CPU") Signed-off-by: Ming Lei <ming.lei@redhat.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-01-17aoe: use ktime_t instead of timevalTina Ruchandani
'struct frame' uses two variables to store the sent timestamp - 'struct timeval' and jiffies. jiffies is used to avoid discrepancies caused by updates to system time. 'struct timeval' is deprecated because it uses 32-bit representation for seconds which will overflow in year 2038. This patch does the following: - Replace the use of 'struct timeval' and jiffies with ktime_t, which is the recommended type for timestamping - ktime_t provides both long range (like jiffies) and high resolution (like timeval). Using ktime_get (monotonic time) instead of wall-clock time prevents any discprepancies caused by updates to system time. [updates by Arnd below] The original patch from Tina never went anywhere as we discussed how to keep the impact on performance minimal. I've started over now but arrived at basically the same patch that she had originally, except for an slightly improved tsince_hr() function. I'm making it more robust against overflows, and also optimize explicitly for the common case in which a frame is less than 4.2 seconds old, using only a 32-bit division in that case. This should make the new version more efficient than the old code, since we replace the existing two 32-bit division in do_gettimeofday() plus one multiplication with a single single 32-bit division in tsince_hr() and drop the double bookkeeping. It's also more efficient than the ktime_get_us() API we discussed before, since that would also rely on multiple divisions. Link: https://lists.linaro.org/pipermail/y2038/2015-May/000276.html Signed-off-by: Tina Ruchandani <ruchandani.tina@gmail.com> Cc: Ed Cashin <ed.cashin@acm.org> Signed-off-by: Arnd Bergmann <arnd@arndb.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-01-16blkcg: simplify statistic accumulation codeArnd Bergmann
Some older compilers (gcc-4.4 through 4.6 in particular) struggle with the way that blkg_rwstat_read() returns a structure, leading to excessive stack usage and rather inefficient code: block/blk-cgroup.c: In function 'blkg_destroy': block/blk-cgroup.c:354:1: error: the frame size of 1296 bytes is larger than 1024 bytes [-Werror=frame-larger-than=] block/cfq-iosched.c: In function 'cfqg_stats_add_aux': block/cfq-iosched.c:753:1: error: the frame size of 1928 bytes is larger than 1024 bytes [-Werror=frame-larger-than=] block/bfq-cgroup.c: In function 'bfqg_stats_add_aux': block/bfq-cgroup.c:299:1: error: the frame size of 1928 bytes is larger than 1024 bytes [-Werror=frame-larger-than=] I also notice that there is no point in using atomic accesses for the local variables, so storing the temporaries in simple 'u64' variables not only avoids the stack usage on older compilers but also improves the object code on modern versions. Fixes: e6269c445467 ("blkcg: add blkg_[rw]stat->aux_cnt and replace cfq_group->dead_stats with it") Acked-by: Tejun Heo <tj@kernel.org> Signed-off-by: Arnd Bergmann <arnd@arndb.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-01-15dm: fix incomplete request_queue initializationMike Snitzer
DM is no longer prone to having its request_queue be improperly initialized. Summary of changes: - defer DM's blk_register_queue() from add_disk()-time until dm_setup_md_queue() by using add_disk_no_queue_reg() in alloc_dev(). - dm_setup_md_queue() is updated to fully initialize DM's request_queue (_after_ all table loads have occurred and the request_queue's type, features and limits are known). A very welcome side-effect of these changes is DM no longer needs to: 1) backfill the "mq" sysfs entry (because historically DM didn't initialize the request_queue to use blk-mq until _after_ blk_register_queue() was called via add_disk()). 2) call elv_register_queue() to get .request_fn request-based DM device's "iosched" exposed in syfs. In addition, blk-mq debugfs support is now made available because request-based DM's blk-mq request_queue is now properly initialized before dm_setup_md_queue() calls blk_register_queue(). These changes also stave off the need to introduce new DM-specific workarounds in block core, e.g. this proposal: https://patchwork.kernel.org/patch/10067961/ In the end DM devices should be less unicorn in nature (relative to initialization and availability of block core infrastructure provided by the request_queue). Signed-off-by: Mike Snitzer <snitzer@redhat.com> Tested-by: Ming Lei <ming.lei@redhat.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-01-15blk_rq_map_user_iov: fix error overrideDouglas Gilbert
During stress tests by syzkaller on the sg driver the block layer infrequently returns EINVAL. Closer inspection shows the block layer was trying to return ENOMEM (which is much more understandable) but for some reason overroad that useful error. Patch below does not show this (unchanged) line: ret =__blk_rq_map_user_iov(rq, map_data, &i, gfp_mask, copy); That 'ret' was being overridden when that function failed. Signed-off-by: Douglas Gilbert <dgilbert@interlog.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-01-15block: allow gendisk's request_queue registration to be deferredMike Snitzer
Since I can remember DM has forced the block layer to allow the allocation and initialization of the request_queue to be distinct operations. Reason for this is block/genhd.c:add_disk() has requires that the request_queue (and associated bdi) be tied to the gendisk before add_disk() is called -- because add_disk() also deals with exposing the request_queue via blk_register_queue(). DM's dynamic creation of arbitrary device types (and associated request_queue types) requires the DM device's gendisk be available so that DM table loads can establish a master/slave relationship with subordinate devices that are referenced by loaded DM tables -- using bd_link_disk_holder(). But until these DM tables, and their associated subordinate devices, are known DM cannot know what type of request_queue it needs -- nor what its queue_limits should be. This chicken and egg scenario has created all manner of problems for DM and, at times, the block layer. Summary of changes: - Add device_add_disk_no_queue_reg() and add_disk_no_queue_reg() variant that drivers may use to add a disk without also calling blk_register_queue(). Driver must call blk_register_queue() once its request_queue is fully initialized. - Return early from blk_unregister_queue() if QUEUE_FLAG_REGISTERED is not set. It won't be set if driver used add_disk_no_queue_reg() but driver encounters an error and must del_gendisk() before calling blk_register_queue(). - Export blk_register_queue(). These changes allow DM to use add_disk_no_queue_reg() to anchor its gendisk as the "master" for master/slave relationships DM must establish with subordinate devices referenced in DM tables that get loaded. Once all "slave" devices for a DM device are known its request_queue can be properly initialized and then advertised via sysfs -- important improvement being that no request_queue resource initialization performed by blk_register_queue() is missed for DM devices anymore. Signed-off-by: Mike Snitzer <snitzer@redhat.com> Reviewed-by: Ming Lei <ming.lei@redhat.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-01-15block: properly protect the 'queue' kobj in blk_unregister_queueMike Snitzer
The original commit e9a823fb34a8b (block: fix warning when I/O elevator is changed as request_queue is being removed) is pretty conflated. "conflated" because the resource being protected by q->sysfs_lock isn't the queue_flags (it is the 'queue' kobj). q->sysfs_lock serializes __elevator_change() (via elv_iosched_store) from racing with blk_unregister_queue(): 1) By holding q->sysfs_lock first, __elevator_change() can complete before a racing blk_unregister_queue(). 2) Conversely, __elevator_change() is testing for QUEUE_FLAG_REGISTERED in case elv_iosched_store() loses the race with blk_unregister_queue(), it needs a way to know the 'queue' kobj isn't there. Expand the scope of blk_unregister_queue()'s q->sysfs_lock use so it is held until after the 'queue' kobj is removed. To do so blk_mq_unregister_dev() must not also take q->sysfs_lock. So rename __blk_mq_unregister_dev() to blk_mq_unregister_dev(). Also, blk_unregister_queue() should use q->queue_lock to protect against any concurrent writes to q->queue_flags -- even though chances are the queue is being cleaned up so no concurrent writes are likely. Fixes: e9a823fb34a8b ("block: fix warning when I/O elevator is changed as request_queue is being removed") Signed-off-by: Mike Snitzer <snitzer@redhat.com> Reviewed-by: Ming Lei <ming.lei@redhat.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-01-15block: only bdi_unregister() in del_gendisk() if !GENHD_FL_HIDDENMike Snitzer
device_add_disk() will only call bdi_register_owner() if !GENHD_FL_HIDDEN, so it follows that del_gendisk() should only call bdi_unregister() if !GENHD_FL_HIDDEN. Found with code inspection. bdi_unregister() won't do any harm if bdi_register_owner() wasn't used but best to avoid the unnecessary call to bdi_unregister(). Fixes: 8ddcd65325 ("block: introduce GENHD_FL_HIDDEN") Signed-off-by: Mike Snitzer <snitzer@redhat.com> Reviewed-by: Ming Lei <ming.lei@redhat.com> Reviewed-by: Hannes Reinecke <hare@suse.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-01-14blk-mq: fix bad clear of RQF_MQ_INFLIGHT in blk_mq_ct_ctx_init()Jens Axboe
A previous commit moved the clearing of rq->rq_flags later, but we may have already set RQF_MQ_INFLIGHT when that happens. Ensure that we correctly initialize rq->rq_flags to the right value. This is based on an original fix by Ming, just rewritten to not require a conditional. Fixes: 7c3fb70f0341 ("block: rearrange a few request fields for better cache layout") Reviewed-by: Mike Snitzer <snitzer@redhat.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-01-12blk-mq: add missing RQF_STARTED to debugfsJens Axboe
Looking at debug output, we see: ./000000009ddfa913/requeue_list:000000009646711c {.op=READ, .state=idle, gen=0x1 18, abort_gen=0x0, .cmd_flags=, .rq_flags=SORTED|1|SOFTBARRIER|IO_STAT, complete =0, .tag=-1, .internal_tag=217} Note the '1' between SORTED and SOFTBARRIER - that's because no name as defined for RQF_STARTED. Fixed that. Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-01-12blk-mq: simplify queue mapping & schedule with each possisble CPUChristoph Hellwig
The previous patch assigns interrupt vectors to all possible CPUs, so now hctx can be mapped to possible CPUs, this patch applies this fact to simplify queue mapping & schedule so that we don't need to handle CPU hotplug for dealing with physical CPU plug & unplug. With this simplication, we can work well on physical CPU plug & unplug, which is a normal use case for VM at least. Make sure we allocate blk_mq_ctx structures for all possible CPUs, and set hctx->numa_node for possible CPUs which are mapped to this hctx. And only choose the online CPUs for schedule. Reported-by: Christian Borntraeger <borntraeger@de.ibm.com> Tested-by: Christian Borntraeger <borntraeger@de.ibm.com> Tested-by: Stefan Haberland <sth@linux.vnet.ibm.com> Cc: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Christoph Hellwig <hch@lst.de> Fixes: 4b855ad37194 ("blk-mq: Create hctx for each present CPU") (merged the three into one because any single one may not work, and fix selecting online CPUs for scheduler) Signed-off-by: Ming Lei <ming.lei@redhat.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-01-12genirq/affinity: assign vectors to all possible CPUsChristoph Hellwig
Currently we assign managed interrupt vectors to all present CPUs. This works fine for systems were we only online/offline CPUs. But in case of systems that support physical CPU hotplug (or the virtualized version of it) this means the additional CPUs covered for in the ACPI tables or on the command line are not catered for. To fix this we'd either need to introduce new hotplug CPU states just for this case, or we can start assining vectors to possible but not present CPUs. Reported-by: Christian Borntraeger <borntraeger@de.ibm.com> Tested-by: Christian Borntraeger <borntraeger@de.ibm.com> Tested-by: Stefan Haberland <sth@linux.vnet.ibm.com> Fixes: 4b855ad37194 ("blk-mq: Create hctx for each present CPU") Cc: linux-kernel@vger.kernel.org Cc: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-01-11blk-mq: Reduce the number of if-statements in blk_mq_mark_tag_wait()Bart Van Assche
This patch does not change any functionality but makes the blk_mq_mark_tag_wait() code slightly easier to read. Signed-off-by: Bart Van Assche <bart.vanassche@wdc.com> Cc: Christoph Hellwig <hch@lst.de> Cc: Omar Sandoval <osandov@fb.com> Cc: Hannes Reinecke <hare@suse.de> Cc: Johannes Thumshirn <jthumshirn@suse.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-01-11null_blk: remove explicit 'select FAULT_INJECTION'Arnd Bergmann
Selecting FAULT_INJECTION causes a Kconfig warning when CONFIG_DEBUG_KERNEL is not set: warning: (BLK_DEV_NULL_BLK && DRM_I915_SELFTEST) selects FAULT_INJECTION which has unmet direct dependencies (DEBUG_KERNEL) The other drivers that use FAULT_INJECTION tend to have a separate Kconfig symbol for turning on that feature, so let's do the same thing here. This may add a bit more complexity than we like, but it avoids the warning and is more consistent with the rest of the kernel. Fixes: 93b570464cce ("null_blk: add option for managing IO timeouts") Signed-off-by: Arnd Bergmann <arnd@arndb.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-01-10blk-mq: Add locking annotations to hctx_lock() and hctx_unlock()Bart Van Assche
This patch avoids that sparse reports the following: block/blk-mq.c:637:33: warning: context imbalance in 'hctx_unlock' - unexpected unlock block/blk-mq.c:642:9: warning: context imbalance in 'hctx_lock' - wrong count at exit Signed-off-by: Bart Van Assche <bart.vanassche@wdc.com> Cc: Tejun Heo <tj@kernel.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-01-10block: silently forbid sending any ioctl to a partitionPaolo Bonzini
After the first few months, the message has not led to many bug reports. It's been almost five years now, and in practice the main source of it seems to be MTIOCGET that someone is using to detect tape devices. While we could whitelist it just like CDROM_GET_CAPABILITY, this patch just removes the message altogether. The patch also removes the "safe but not very useful" ioctl whitelist, as suggested by Christoph. I doubt anything is using most of those ioctls _in general_, let alone on a partition. Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-01-10block: rearrange a few request fields for better cache layoutJens Axboe
Move completion related items (like the call single data) near the end of the struct, instead of mixing them in with the initial queueing related fields. Move queuelist below the bio structures. Then we have all queueing related bits in the first cache line. This yields a 1.5-2% increase in IOPS for a null_blk test, both for sync and for high thread count access. Sync test goes form 975K to 992K, 32-thread case from 20.8M to 21.2M IOPS. Reviewed-by: Bart Van Assche <bart.vanassche@wdc.com> Reviewed-by: Omar Sandoval <osandov@fb.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-01-10block: convert REQ_ATOM_COMPLETE to stealing rq->__deadline bitJens Axboe
We only have one atomic flag left. Instead of using an entire unsigned long for that, steal the bottom bit of the deadline field that we already reserved. Remove ->atomic_flags, since it's now unused. Reviewed-by: Bart Van Assche <bart.vanassche@wdc.com> Reviewed-by: Omar Sandoval <osandov@fb.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-01-10block: add accessors for setting/querying request deadlineJens Axboe
We reduce the resolution of request expiry, but since we're already using jiffies for this where resolution depends on the kernel configuration and since the timeout resolution is coarse anyway, that should be fine. Reviewed-by: Bart Van Assche <bart.vanassche@wdc.com> Reviewed-by: Omar Sandoval <osandov@fb.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-01-10block: remove REQ_ATOM_POLL_SLEPTJens Axboe
We don't need this to be an atomic flag, it can be a regular flag. We either end up on the same CPU for the polling, in which case the state is sane, or we did the sleep which would imply the needed barrier to ensure we see the right state. Reviewed-by: Bart Van Assche <bart.vanassche@wdc.com> Reviewed-by: Omar Sandoval <osandov@fb.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-01-10blk-mq: add a few missing debugfs RQF_ flagsJens Axboe
We are missing ZONE_WRITE_LOCKED and MQ_TIMEOUT_EXPIRED, add them so the debugfs bits can decode them. Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-01-10dm mpath: Use blk_path_errorKeith Busch
Uses common code for determining if an error should be retried on alternate path. Acked-by: Mike Snitzer <snitzer@redhat.com> Reviewed-by: Hannes Reinecke <hare@suse.com> Signed-off-by: Keith Busch <keith.busch@intel.com> Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-01-10nvme/multipath: Use blk_path_errorKeith Busch
Uses common code for determining if an error should be retried on alternate path. Acked-by: Mike Snitzer <snitzer@redhat.com> Reviewed-by: Hannes Reinecke <hare@suse.com> Signed-off-by: Keith Busch <keith.busch@intel.com> Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-01-10block: Provide blk_status_t decoding for path errorsKeith Busch
This patch provides a common decoder for block status path related errors that may be retried so various entities wishing to consult this do not have to duplicate this decision. Acked-by: Mike Snitzer <snitzer@redhat.com> Reviewed-by: Hannes Reinecke <hare@suse.com> Signed-off-by: Keith Busch <keith.busch@intel.com> Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-01-10nvme/multipath: Consult blk_status_t for failoverKeith Busch
This removes nvme multipath's specific status decoding to see if failover is needed, using the generic blk_status_t that was decoded earlier. This abstraction from the raw NVMe status means all status decoding exists in one place. Acked-by: Mike Snitzer <snitzer@redhat.com> Reviewed-by: Hannes Reinecke <hare@suse.com> Signed-off-by: Keith Busch <keith.busch@intel.com> Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-01-10nvme: Add more command status translationKeith Busch
This adds more NVMe status code translations to blk_status_t values, and captures all the current status codes NVMe multipath uses. Acked-by: Mike Snitzer <snitzer@redhat.com> Reviewed-by: Hannes Reinecke <hare@suse.com> Signed-off-by: Keith Busch <keith.busch@intel.com> Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-01-10blk-mq: Explain when 'active_queues' is decrementedBart Van Assche
It is nontrivial to derive from the blk-mq source code when blk_mq_tags.active_queues is decremented. Hence add a comment that explains this. Signed-off-by: Bart Van Assche <bart.vanassche@wdc.com> Cc: Christoph Hellwig <hch@lst.de> Cc: Hannes Reinecke <hare@suse.de> Cc: Johannes Thumshirn <jthumshirn@suse.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-01-10partitions/msdos: Unable to mount UFS 44bsd partitionsRichard Narron
UFS partitions from newer versions of FreeBSD 10 and 11 use relative addressing for their subpartitions. But older versions of FreeBSD still use absolute addressing just like OpenBSD and NetBSD. Instead of simply testing for a FreeBSD partition, the code needs to also test if the starting offset of the C subpartition is zero. https://bugzilla.kernel.org/show_bug.cgi?id=197733 Signed-off-by: Richard Narron <comet.berkeley@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-01-10null_blk: add option for managing IO timeoutsJens Axboe
Use the fault injection framework to provide a way for null_blk to configure timeouts. This only works for queue_mode 1 and 2, since the bio mode doesn't have code for tracking timeouts. Let's say you want to have a 10% chance of timing out every 100,000 requests, and for 5 total timeouts, you could do: modprobe null_blk timeout="100000,10,0,5" This is useful for adding blktests to test that IO timeouts are handled appropriately. Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-01-10block, bfq: fix occurrences of request finish method's old nameChiara Bruschi
Commit '7b9e93616399' ("blk-mq-sched: unify request finished methods") changed the old name of current bfq_finish_request method, but left it unchanged elsewhere in the code (related comments, part of function name bfq_put_rq_priv_body). This commit fixes all occurrences of the old name of this method by changing them into the current name. Fixes: 7b9e93616399 ("blk-mq-sched: unify request finished methods") Reviewed-by: Paolo Valente <paolo.valente@linaro.org> Signed-off-by: Federico Motta <federico@willer.it> Signed-off-by: Chiara Bruschi <bruschi.chiara@outlook.it> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-01-09Revert "block: blk-merge: try to make front segments in full size"Ming Lei
This reverts commit a2d37968d784363842f87820a21e106741d28004. If max segment size isn't 512-aligned, this patch won't work well. Also once multipage bvec is enabled, adjacent bvecs won't be physically contiguous if page is added via bio_add_page(), so we don't need this kind of complicated logic. Reported-by: Dmitry Osipenko <digetx@gmail.com> Signed-off-by: Ming Lei <ming.lei@redhat.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-01-09null_blk: wire up timeoutsJens Axboe
This is needed to ensure that we actually handle timeouts. Without it, the queue_mode=1 path will never call blk_add_timer(), and the queue_mode=2 path will continually just return EH_RESET_TIMER and we never actually complete the offending request. This was used to test the new timeout code, and the changes around killing off REQ_ATOM_COMPLETE. Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-01-09bfq-iosched: don't call bfqg_and_blkg_put for !CONFIG_BFQ_GROUP_IOSCHEDJens Axboe
It's not available if we don't have group io scheduling set, and there's no need to call it. Fixes: 0d52af590552 ("block, bfq: release oom-queue ref to root group on exit") Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-01-09bcache: closures: move control bits one bit rightMichael Lyle
Otherwise, architectures that do negated adds of atomics (e.g. s390) to do atomic_sub fail in closure_set_stopped. Signed-off-by: Michael Lyle <mlyle@lyle.org> Cc: Kent Overstreet <kent.overstreet@gmail.com> Reported-by: kbuild test robot <lkp@intel.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-01-09block: Fix kernel-doc warnings reported when building with W=1Bart Van Assche
Commit 3a025e1d1c2e ("Add optional check for bad kernel-doc comments") causes W=1 the kernel-doc script to be run and thereby causes several new warnings to appear when building the kernel with W=1. Fix the block layer kernel-doc headers such that the block layer again builds cleanly with W=1. Signed-off-by: Bart Van Assche <bart.vanassche@wdc.com> Cc: Martin K. Petersen <martin.petersen@oracle.com> Cc: Christoph Hellwig <hch@lst.de> Cc: Hannes Reinecke <hare@suse.de> Cc: Johannes Thumshirn <jthumshirn@suse.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-01-09blk-mq: Fix spelling in a source code commentBart Van Assche
Change "nedeing" into "needing" and "caes" into "cases". Fixes: commit f906a6a0f426 ("blk-mq: improve tag waiting setup for non-shared tags") Signed-off-by: Bart Van Assche <bart.vanassche@wdc.com> Cc: Christoph Hellwig <hch@lst.de> Cc: Omar Sandoval <osandov@fb.com> Cc: Hannes Reinecke <hare@suse.de> Cc: Johannes Thumshirn <jthumshirn@suse.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-01-09blk-mq: silence false positive warnings in hctx_unlock()Jens Axboe
In some stupider versions of gcc, it complains: block/blk-mq.c: In function ‘blk_mq_complete_request’: ./include/linux/srcu.h:175:2: warning: ‘srcu_idx’ may be used uninitialized in this function [-Wmaybe-uninitialized] __srcu_read_unlock(sp, idx); ^ block/blk-mq.c:620:6: note: ‘srcu_idx’ was declared here int srcu_idx; ^ which is completely bogus, since we only use srcu_idx when hctx->flags & BLK_MQ_F_BLOCKING is set, and that's the case where hctx_lock() has initialized it. Just set it to '0' in the normal path in hctx_lock() to silence this annoying warning. Fixes: 04ced159cec8 ("blk-mq: move hctx lock/unlock into a helper") Fixes: 5197c05e16b4 ("blk-mq: protect completion path with RCU") Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-01-09blk-mq: rename blk_mq_hw_ctx->queue_rq_srcu to ->srcuTejun Heo
The RCU protection has been expanded to cover both queueing and completion paths making ->queue_rq_srcu a misnomer. Rename it to ->srcu as suggested by Bart. Signed-off-by: Tejun Heo <tj@kernel.org> Cc: Bart Van Assche <Bart.VanAssche@wdc.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-01-09blk-mq: remove REQ_ATOM_STARTEDTejun Heo
After the recent updates to use generation number and state based synchronization, we can easily replace REQ_ATOM_STARTED usages by adding an extra state to distinguish completed but not yet freed state. Add MQ_RQ_COMPLETE and replace REQ_ATOM_STARTED usages with blk_mq_rq_state() tests. REQ_ATOM_STARTED no longer has any users left and is removed. Signed-off-by: Tejun Heo <tj@kernel.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-01-09blk-mq: remove REQ_ATOM_COMPLETE usages from blk-mqTejun Heo
After the recent updates to use generation number and state based synchronization, blk-mq no longer depends on REQ_ATOM_COMPLETE except to avoid firing the same timeout multiple times. Remove all REQ_ATOM_COMPLETE usages and use a new rq_flags flag RQF_MQ_TIMEOUT_EXPIRED to avoid firing the same timeout multiple times. This removes atomic bitops from hot paths too. v2: Removed blk_clear_rq_complete() from blk_mq_rq_timed_out(). v3: Added RQF_MQ_TIMEOUT_EXPIRED flag. Signed-off-by: Tejun Heo <tj@kernel.org> Cc: "jianchao.wang" <jianchao.w.wang@oracle.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-01-09blk-mq: make blk_abort_request() trigger timeout pathTejun Heo
With issue/complete and timeout paths now using the generation number and state based synchronization, blk_abort_request() is the only one which depends on REQ_ATOM_COMPLETE for arbitrating completion. There's no reason for blk_abort_request() to be a completely separate path. This patch makes blk_abort_request() piggyback on the timeout path instead of trying to terminate the request directly. This removes the last dependency on REQ_ATOM_COMPLETE in blk-mq. Note that this makes blk_abort_request() asynchronous - it initiates abortion but the actual termination will happen after a short while, even when the caller owns the request. AFAICS, SCSI and ATA should be fine with that and I think mtip32xx and dasd should be safe but not completely sure. It'd be great if people who know the drivers take a look. v2: - Add comment explaining the lack of synchronization around ->deadline update as requested by Bart. Signed-off-by: Tejun Heo <tj@kernel.org> Cc: Asai Thambi SP <asamymuthupa@micron.com> Cc: Stefan Haberland <sth@linux.vnet.ibm.com> Cc: Jan Hoeppner <hoeppner@linux.vnet.ibm.com> Cc: Bart Van Assche <Bart.VanAssche@wdc.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-01-09blk-mq: use blk_mq_rq_state() instead of testing REQ_ATOM_COMPLETETejun Heo
blk_mq_check_inflight() and blk_mq_poll_hybrid_sleep() test REQ_ATOM_COMPLETE to determine the request state. Both uses are speculative and we can test REQ_ATOM_STARTED and blk_mq_rq_state() for equivalent results. Replace the tests. This will allow removing REQ_ATOM_COMPLETE usages from blk-mq. Signed-off-by: Tejun Heo <tj@kernel.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-01-09blk-mq: replace timeout synchronization with a RCU and generation based schemeTejun Heo
Currently, blk-mq timeout path synchronizes against the usual issue/completion path using a complex scheme involving atomic bitflags, REQ_ATOM_*, memory barriers and subtle memory coherence rules. Unfortunately, it contains quite a few holes. There's a complex dancing around REQ_ATOM_STARTED and REQ_ATOM_COMPLETE between issue/completion and timeout paths; however, they don't have a synchronization point across request recycle instances and it isn't clear what the barriers add. blk_mq_check_expired() can easily read STARTED from N-2'th iteration, deadline from N-1'th, blk_mark_rq_complete() against Nth instance. In fact, it's pretty easy to make blk_mq_check_expired() terminate a later instance of a request. If we induce 5 sec delay before time_after_eq() test in blk_mq_check_expired(), shorten the timeout to 2s, and issue back-to-back large IOs, blk-mq starts timing out requests spuriously pretty quickly. Nothing actually timed out. It just made the call on a recycle instance of a request and then terminated a later instance long after the original instance finished. The scenario isn't theoretical either. This patch replaces the broken synchronization mechanism with a RCU and generation number based one. 1. Each request has a u64 generation + state value, which can be updated only by the request owner. Whenever a request becomes in-flight, the generation number gets bumped up too. This provides the basis for the timeout path to distinguish different recycle instances of the request. Also, marking a request in-flight and setting its deadline are protected with a seqcount so that the timeout path can fetch both values coherently. 2. The timeout path fetches the generation, state and deadline. If the verdict is timeout, it records the generation into a dedicated request abortion field and does RCU wait. 3. The completion path is also protected by RCU (from the previous patch) and checks whether the current generation number and state match the abortion field. If so, it skips completion. 4. The timeout path, after RCU wait, scans requests again and terminates the ones whose generation and state still match the ones requested for abortion. By now, the timeout path knows that either the generation number and state changed if it lost the race or the completion will yield to it and can safely timeout the request. While it's more lines of code, it's conceptually simpler, doesn't depend on direct use of subtle memory ordering or coherence, and hopefully doesn't terminate the wrong instance. While this change makes REQ_ATOM_COMPLETE synchronization unnecessary between issue/complete and timeout paths, REQ_ATOM_COMPLETE isn't removed yet as it's still used in other places. Future patches will move all state tracking to the new mechanism and remove all bitops in the hot paths. Note that this patch adds a comment explaining a race condition in BLK_EH_RESET_TIMER path. The race has always been there and this patch doesn't change it. It's just documenting the existing race. v2: - Fixed BLK_EH_RESET_TIMER handling as pointed out by Jianchao. - s/request->gstate_seqc/request->gstate_seq/ as suggested by Peter. - READ_ONCE() added in blk_mq_rq_update_state() as suggested by Peter. v3: - Fixed possible extended seqcount / u64_stats_sync read looping spotted by Peter. - MQ_RQ_IDLE was incorrectly being set in complete_request instead of free_request. Fixed. v4: - Rebased on top of hctx_lock() refactoring patch. - Added comment explaining the use of hctx_lock() in completion path. v5: - Added comments requested by Bart. - Note the addition of BLK_EH_RESET_TIMER race condition in the commit message. Signed-off-by: Tejun Heo <tj@kernel.org> Cc: "jianchao.wang" <jianchao.w.wang@oracle.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Christoph Hellwig <hch@lst.de> Cc: Bart Van Assche <Bart.VanAssche@wdc.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>