summaryrefslogtreecommitdiff
path: root/drivers/md
AgeCommit message (Collapse)Author
2022-06-03Merge tag 'for-5.19/drivers-2022-06-02' of git://git.kernel.dk/linux-blockLinus Torvalds
Pull more block driver updates from Jens Axboe: "A collection of stragglers that were late on sending in their changes and just followup fixes. - NVMe fixes pull request via Christoph: - set controller enable bit in a separate write (Niklas Cassel) - disable namespace identifiers for the MAXIO MAP1001 (Christoph) - fix a comment typo (Julia Lawall)" - MD fixes pull request via Song: - Remove uses of bdevname (Christoph Hellwig) - Bug fixes (Guoqing Jiang, and Xiao Ni) - bcache fixes series (Coly) - null_blk zoned write fix (Damien) - nbd fixes (Yu, Zhang) - Fix for loop partition scanning (Christoph)" * tag 'for-5.19/drivers-2022-06-02' of git://git.kernel.dk/linux-block: (23 commits) block: null_blk: Fix null_zone_write() nvmet: fix typo in comment nvme: set controller enable bit in a separate write nvme-pci: disable namespace identifiers for the MAXIO MAP1001 bcache: avoid unnecessary soft lockup in kworker update_writeback_rate() nbd: use pr_err to output error message nbd: fix possible overflow on 'first_minor' in nbd_dev_add() nbd: fix io hung while disconnecting device nbd: don't clear 'NBD_CMD_INFLIGHT' flag if request is not completed nbd: fix race between nbd_alloc_config() and module removal nbd: call genl_unregister_family() first in nbd_cleanup() md: bcache: check the return value of kzalloc() in detached_dev_do_request() bcache: memset on stack variables in bch_btree_check() and bch_sectors_dirty_init() block, loop: support partitions without scanning bcache: avoid journal no-space deadlock by reserving 1 journal bucket bcache: remove incremental dirty sector counting for bch_sectors_dirty_init() bcache: improve multithreaded bch_sectors_dirty_init() bcache: improve multithreaded bch_btree_check() md: fix double free of io_acct_set bioset md: Don't set mddev private to NULL in raid0 pers->free ...
2022-06-01Merge tag 'for-5.19/dm-fixes' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm Pull device mapper fixes from Mike Snitzer: - Fix DM core's dm_table_supports_poll to return false if target has no data devices. - Fix DM verity target so that it cannot be switched to a different DM target type (e.g. dm-linear) via DM table reload. * tag 'for-5.19/dm-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm: dm verity: set DM_TARGET_IMMUTABLE feature flag dm table: fix dm_table_supports_poll to return false if no data devices
2022-05-31dm verity: set DM_TARGET_IMMUTABLE feature flagSarthak Kukreti
The device-mapper framework provides a mechanism to mark targets as immutable (and hence fail table reloads that try to change the target type). Add the DM_TARGET_IMMUTABLE flag to the dm-verity target's feature flags to prevent switching the verity target with a different target type. Fixes: a4ffc152198e ("dm: add verity target") Cc: stable@vger.kernel.org Signed-off-by: Sarthak Kukreti <sarthakkukreti@google.com> Reviewed-by: Kees Cook <keescook@chromium.org> Signed-off-by: Mike Snitzer <snitzer@kernel.org>
2022-05-31dm table: fix dm_table_supports_poll to return false if no data devicesMike Snitzer
It was reported that the "generic/250" test in xfstests (which uses the dm-error target) demonstrates a regression where the kernel crashes in bioset_exit(). Since commit cfc97abcbe0b ("dm: conditionally enable BIOSET_PERCPU_CACHE for dm_io bioset") the bioset_init() for the dm_io bioset will setup the bioset's per-cpu alloc cache if all devices have QUEUE_FLAG_POLL set. But there was an bug where a target that doesn't have any data devices (and that doesn't even set the .iterate_devices dm target callback) will incorrectly return true from dm_table_supports_poll(). Fix this by updating dm_table_supports_poll() to follow dm-table.c's well-worn pattern for testing that _all_ targets in a DM table do in fact have underlying devices that set QUEUE_FLAG_POLL. NOTE: An additional block fix is still needed so that bio_alloc_cache_destroy() clears the bioset's ->cache member. Otherwise, a DM device's table reload that transitions the DM device's bioset from using a per-cpu alloc cache to _not_ using one will result in bioset_exit() crashing in bio_alloc_cache_destroy() because dm's dm_io bioset ("io_bs") was left with a stale ->cache member. Fixes: cfc97abcbe0b ("dm: conditionally enable BIOSET_PERCPU_CACHE for dm_io bioset") Reported-by: Matthew Wilcox <willy@infradead.org> Reported-by: Dave Chinner <david@fromorbit.com> Signed-off-by: Mike Snitzer <snitzer@kernel.org>
2022-05-28bcache: avoid unnecessary soft lockup in kworker update_writeback_rate()Coly Li
The kworker routine update_writeback_rate() is schedued to update the writeback rate in every 5 seconds by default. Before calling __update_writeback_rate() to do real job, semaphore dc->writeback_lock should be held by the kworker routine. At the same time, bcache writeback thread routine bch_writeback_thread() also needs to hold dc->writeback_lock before flushing dirty data back into the backing device. If the dirty data set is large, it might be very long time for bch_writeback_thread() to scan all dirty buckets and releases dc->writeback_lock. In such case update_writeback_rate() can be starved for long enough time so that kernel reports a soft lockup warn- ing started like: watchdog: BUG: soft lockup - CPU#246 stuck for 23s! [kworker/246:31:179713] Such soft lockup condition is unnecessary, because after the writeback thread finishes its job and releases dc->writeback_lock, the kworker update_writeback_rate() may continue to work and everything is fine indeed. This patch avoids the unnecessary soft lockup by the following method, - Add new member to struct cached_dev - dc->rate_update_retry (0 by default) - In update_writeback_rate() call down_read_trylock(&dc->writeback_lock) firstly, if it fails then lock contention happens. - If dc->rate_update_retry <= BCH_WBRATE_UPDATE_MAX_SKIPS (15), doesn't acquire the lock and reschedules the kworker for next try. - If dc->rate_update_retry > BCH_WBRATE_UPDATE_MAX_SKIPS, no retry anymore and call down_read(&dc->writeback_lock) to wait for the lock. By the above method, at worst case update_writeback_rate() may retry for 1+ minutes before blocking on dc->writeback_lock by calling down_read(). For a 4TB cache device with 1TB dirty data, 90%+ of the unnecessary soft lockup warning message can be avoided. When retrying to acquire dc->writeback_lock in update_writeback_rate(), of course the writeback rate cannot be updated. It is fair, because when the kworker is blocked on the lock contention of dc->writeback_lock, the writeback rate cannot be updated neither. This change follows Jens Axboe's suggestion to a more clear and simple version. Signed-off-by: Coly Li <colyli@suse.de> Link: https://lore.kernel.org/r/20220528124550.32834-2-colyli@suse.de Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-05-27Merge tag 'libnvdimm-for-5.19' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/nvdimm/nvdimm Pull libnvdimm and DAX updates from Dan Williams: "New support for clearing memory errors when a file is in DAX mode, alongside with some other fixes and cleanups. Previously it was only possible to clear these errors using a truncate or hole-punch operation to trigger the filesystem to reallocate the block, now, any page aligned write can opportunistically clear errors as well. This change spans x86/mm, nvdimm, and fs/dax, and has received the appropriate sign-offs. Thanks to Jane for her work on this. Summary: - Add support for clearing memory error via pwrite(2) on DAX - Fix 'security overwrite' support in the presence of media errors - Miscellaneous cleanups and fixes for nfit_test (nvdimm unit tests)" * tag 'libnvdimm-for-5.19' of git://git.kernel.org/pub/scm/linux/kernel/git/nvdimm/nvdimm: pmem: implement pmem_recovery_write() pmem: refactor pmem_clear_poison() dax: add .recovery_write dax_operation dax: introduce DAX_RECOVERY_WRITE dax access mode mce: fix set_mce_nospec to always unmap the whole page x86/mce: relocate set{clear}_mce_nospec() functions acpi/nfit: rely on mce->misc to determine poison granularity testing: nvdimm: asm/mce.h is not needed in nfit.c testing: nvdimm: iomap: make __nfit_test_ioremap a macro nvdimm: Allow overwrite in the presence of disabled dimms tools/testing/nvdimm: remove unneeded flush_workqueue
2022-05-27md: bcache: check the return value of kzalloc() in detached_dev_do_request()Jia-Ju Bai
The function kzalloc() in detached_dev_do_request() can fail, so its return value should be checked. Fixes: bc082a55d25c ("bcache: fix inaccurate io state for detached bcache devices") Reported-by: TOTE Robot <oslab@tsinghua.edu.cn> Signed-off-by: Jia-Ju Bai <baijiaju1990@gmail.com> Signed-off-by: Coly Li <colyli@suse.de> Link: https://lore.kernel.org/r/20220527152818.27545-4-colyli@suse.de Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-05-27bcache: memset on stack variables in bch_btree_check() and ↵Coly Li
bch_sectors_dirty_init() The local variables check_state (in bch_btree_check()) and state (in bch_sectors_dirty_init()) should be fully filled by 0, because before allocating them on stack, they were dynamically allocated by kzalloc(). Signed-off-by: Coly Li <colyli@suse.de> Link: https://lore.kernel.org/r/20220527152818.27545-2-colyli@suse.de Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-05-26Merge tag 'for-5.19/dm-changes' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm Pull device mapper updates from Mike Snitzer: - Enable DM core bioset's per-cpu bio cache if QUEUE_FLAG_POLL set. This change improves DM's hipri bio polling (REQ_POLLED) performance by 7 - 20% depending on the system. - Update DM core to use jump_labels to further reduce cost of unlikely branches for zoned block devices, dm-stats and swap_bios throttling. - Various DM core changes to reduce bio-based DM overhead and simplify IO accounting. - Fundamental DM core improvements to dm_io reference counting and the elimination of using bio_split()+bio_chain() -- instead DM's bio-based IO accounting is updated to account that a split occurred. - Improve DM core's abnormal bio processing to do less work. - Improve DM core's hipri polling support to use a single list rather than an hlist. - Update DM core to pass NULL bdev to bio_alloc_clone() so that initialization that isn't useful for DM can be elided. - Add cond_resched to DM stats' various loops that loop over all entries. - Fix incorrect error code return from DM integrity's constructor. - Make DM crypt's printing of the key constant-time. - Update bio-based DM multipath to provide high-resolution timer to the Historical Service Time (HST) path selector. * tag 'for-5.19/dm-changes' of git://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm: (26 commits) dm: pass NULL bdev to bio_alloc_clone dm cache metadata: remove unnecessary variable in __dump_mapping dm mpath: provide high-resolution timer to HST for bio-based dm crypt: make printing of the key constant-time dm integrity: fix error code in dm_integrity_ctr() dm stats: add cond_resched when looping over entries dm: improve abnormal bio processing dm: simplify bio-based IO accounting further dm: put all polled dm_io instances into a single list dm: improve dm_io reference counting dm: don't grab target io reference in dm_zone_map_bio dm: improve bio splitting and associated IO accounting dm: switch to bdev based IO accounting interfaces dm: pass dm_io instance to dm_io_acct directly dm: don't pass bio to __dm_start_io_acct and dm_end_io_acct dm: use bio_sectors in dm_aceept_partial_bio dm: simplify basic targets dm: conditionally enable branching for less used features dm: introduce dm_{get,put}_live_table_bio called from dm_submit_bio dm: move hot dm_io members to same cacheline as dm_target_io ...
2022-05-24bcache: avoid journal no-space deadlock by reserving 1 journal bucketColy Li
The journal no-space deadlock was reported time to time. Such deadlock can happen in the following situation. When all journal buckets are fully filled by active jset with heavy write I/O load, the cache set registration (after a reboot) will load all active jsets and inserting them into the btree again (which is called journal replay). If a journaled bkey is inserted into a btree node and results btree node split, new journal request might be triggered. For example, the btree grows one more level after the node split, then the root node record in cache device super block will be upgrade by bch_journal_meta() from bch_btree_set_root(). But there is no space in journal buckets, the journal replay has to wait for new journal bucket to be reclaimed after at least one journal bucket replayed. This is one example that how the journal no-space deadlock happens. The solution to avoid the deadlock is to reserve 1 journal bucket in run time, and only permit the reserved journal bucket to be used during cache set registration procedure for things like journal replay. Then the journal space will never be fully filled, there is no chance for journal no-space deadlock to happen anymore. This patch adds a new member "bool do_reserve" in struct journal, it is inititalized to 0 (false) when struct journal is allocated, and set to 1 (true) by bch_journal_space_reserve() when all initialization done in run_cache_set(). In the run time when journal_reclaim() tries to allocate a new journal bucket, free_journal_buckets() is called to check whether there are enough free journal buckets to use. If there is only 1 free journal bucket and journal->do_reserve is 1 (true), the last bucket is reserved and free_journal_buckets() will return 0 to indicate no free journal bucket. Then journal_reclaim() will give up, and try next time to see whetheer there is free journal bucket to allocate. By this method, there is always 1 jouranl bucket reserved in run time. During the cache set registration, journal->do_reserve is 0 (false), so the reserved journal bucket can be used to avoid the no-space deadlock. Reported-by: Nikhil Kshirsagar <nkshirsagar@gmail.com> Signed-off-by: Coly Li <colyli@suse.de> Cc: stable@vger.kernel.org Link: https://lore.kernel.org/r/20220524102336.10684-5-colyli@suse.de Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-05-24bcache: remove incremental dirty sector counting for bch_sectors_dirty_init()Coly Li
After making bch_sectors_dirty_init() being multithreaded, the existing incremental dirty sector counting in bch_root_node_dirty_init() doesn't release btree occupation after iterating 500000 (INIT_KEYS_EACH_TIME) bkeys. Because a read lock is added on btree root node to prevent the btree to be split during the dirty sectors counting, other I/O requester has no chance to gain the write lock even restart bcache_btree(). That is to say, the incremental dirty sectors counting is incompatible to the multhreaded bch_sectors_dirty_init(). We have to choose one and drop another one. In my testing, with 512 bytes random writes, I generate 1.2T dirty data and a btree with 400K nodes. With single thread and incremental dirty sectors counting, it takes 30+ minites to register the backing device. And with multithreaded dirty sectors counting, the backing device registration can be accomplished within 2 minutes. The 30+ minutes V.S. 2- minutes difference makes me decide to keep multithreaded bch_sectors_dirty_init() and drop the incremental dirty sectors counting. This is what this patch does. But INIT_KEYS_EACH_TIME is kept, in sectors_dirty_init_fn() the CPU will be released by cond_resched() after every INIT_KEYS_EACH_TIME keys iterated. This is to avoid the watchdog reports a bogus soft lockup warning. Fixes: b144e45fc576 ("bcache: make bch_sectors_dirty_init() to be multithreaded") Signed-off-by: Coly Li <colyli@suse.de> Cc: stable@vger.kernel.org Link: https://lore.kernel.org/r/20220524102336.10684-4-colyli@suse.de Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-05-24bcache: improve multithreaded bch_sectors_dirty_init()Coly Li
Commit b144e45fc576 ("bcache: make bch_sectors_dirty_init() to be multithreaded") makes bch_sectors_dirty_init() to be much faster when counting dirty sectors by iterating all dirty keys in the btree. But it isn't in ideal shape yet, still can be improved. This patch does the following changes to improve current parallel dirty keys iteration on the btree, - Add read lock to root node when multiple threads iterating the btree, to prevent the root node gets split by I/Os from other registered bcache devices. - Remove local variable "char name[32]" and generate kernel thread name string directly when calling kthread_run(). - Allocate "struct bch_dirty_init_state state" directly on stack and avoid the unnecessary dynamic memory allocation for it. - Decrease BCH_DIRTY_INIT_THRD_MAX from 64 to 12 which is enough indeed. - Increase &state->started to count created kernel thread after it succeeds to create. - When wait for all dirty key counting threads to finish, use wait_event() to replace wait_event_interruptible(). With the above changes, the code is more clear, and some potential error conditions are avoided. Fixes: b144e45fc576 ("bcache: make bch_sectors_dirty_init() to be multithreaded") Signed-off-by: Coly Li <colyli@suse.de> Cc: stable@vger.kernel.org Link: https://lore.kernel.org/r/20220524102336.10684-3-colyli@suse.de Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-05-24bcache: improve multithreaded bch_btree_check()Coly Li
Commit 8e7102273f59 ("bcache: make bch_btree_check() to be multithreaded") makes bch_btree_check() to be much faster when checking all btree nodes during cache device registration. But it isn't in ideal shap yet, still can be improved. This patch does the following thing to improve current parallel btree nodes check by multiple threads in bch_btree_check(), - Add read lock to root node while checking all the btree nodes with multiple threads. Although currently it is not mandatory but it is good to have a read lock in code logic. - Remove local variable 'char name[32]', and generate kernel thread name string directly when calling kthread_run(). - Allocate local variable "struct btree_check_state check_state" on the stack and avoid unnecessary dynamic memory allocation for it. - Reduce BCH_BTR_CHKTHREAD_MAX from 64 to 12 which is enough indeed. - Increase check_state->started to count created kernel thread after it succeeds to create. - When wait for all checking kernel threads to finish, use wait_event() to replace wait_event_interruptible(). With this change, the code is more clear, and some potential error conditions are avoided. Fixes: 8e7102273f59 ("bcache: make bch_btree_check() to be multithreaded") Signed-off-by: Coly Li <colyli@suse.de> Cc: stable@vger.kernel.org Link: https://lore.kernel.org/r/20220524102336.10684-2-colyli@suse.de Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-05-23Merge tag 'for-5.19/drivers-2022-05-22' of git://git.kernel.dk/linux-blockLinus Torvalds
Pull block driver updates from Jens Axboe: "Here are the driver updates queued up for 5.19. This contains: - NVMe pull requests via Christoph: - tighten the PCI presence check (Stefan Roese) - fix a potential NULL pointer dereference in an error path (Kyle Miller Smith) - fix interpretation of the DMRSL field (Tom Yan) - relax the data transfer alignment (Keith Busch) - verbose error logging improvements (Max Gurtovoy, Chaitanya Kulkarni) - misc cleanups (Chaitanya Kulkarni, Christoph) - set non-mdts limits in nvme_scan_work (Chaitanya Kulkarni) - add support for TP4084 - Time-to-Ready Enhancements (Christoph) - MD pull request via Song: - Improve annotation in raid5 code, by Logan Gunthorpe - Support MD_BROKEN flag in raid-1/5/10, by Mariusz Tkaczyk - Other small fixes/cleanups - null_blk series making the configfs side much saner (Damien) - Various minor drbd cleanups and fixes (Haowen, Uladzislau, Jiapeng, Arnd, Cai) - Avoid using the system workqueue (and hence flushing it) in rnbd (Jack) - Avoid using the system workqueue (and hence flushing it) in aoe (Tetsuo) - Series fixing discard_alignment issues in drivers (Christoph) - Small series fixing drivers poking at disk->part0 for openers information (Christoph) - Series fixing deadlocks in loop (Christoph, Tetsuo) - Remove loop.h and add SPDX headers (Christoph) - Various fixes and cleanups (Julia, Xie, Yu)" * tag 'for-5.19/drivers-2022-05-22' of git://git.kernel.dk/linux-block: (72 commits) mtip32xx: fix typo in comment nvme: set non-mdts limits in nvme_scan_work nvme: add support for TP4084 - Time-to-Ready Enhancements nvme: split the enum used for various register constants nbd: Fix hung on disconnect request if socket is closed before nvme-fabrics: add a request timeout helper nvme-pci: harden drive presence detect in nvme_dev_disable() nvme-pci: fix a NULL pointer dereference in nvme_alloc_admin_tags nvme: mark internal passthru request RQF_QUIET nvme: remove unneeded include from constants file nvme: add missing status values to verbose logging nvme: set dma alignment to dword nvme: fix interpretation of DMRSL loop: remove most the top-of-file boilerplate comment from the UAPI header loop: remove most the top-of-file boilerplate comment loop: add a SPDX header loop: remove loop.h block: null_blk: Improve device creation with configfs block: null_blk: Cleanup messages block: null_blk: Cleanup device creation and deletion ...
2022-05-23Merge tag 'for-5.19/block-2022-05-22' of git://git.kernel.dk/linux-blockLinus Torvalds
Pull block updates from Jens Axboe: "Here are the core block changes for 5.19. This contains: - blk-throttle accounting fix (Laibin) - Series removing redundant assignments (Michal) - Expose bio cache via the bio_set, so that DM can use it (Mike) - Finish off the bio allocation interface cleanups by dealing with the weirdest member of the family. bio_kmalloc combines a kmalloc for the bio and bio_vecs with a hidden bio_init call and magic cleanup semantics (Christoph) - Clean up the block layer API so that APIs consumed by file systems are (almost) only struct block_device based, so that file systems don't have to poke into block layer internals like the request_queue (Christoph) - Clean up the blk_execute_rq* API (Christoph) - Clean up various lose end in the blk-cgroup code to make it easier to follow in preparation of reworking the blkcg assignment for bios (Christoph) - Fix use-after-free issues in BFQ when processes with merged queues get moved to different cgroups (Jan) - BFQ fixes (Jan) - Various fixes and cleanups (Bart, Chengming, Fanjun, Julia, Ming, Wolfgang, me)" * tag 'for-5.19/block-2022-05-22' of git://git.kernel.dk/linux-block: (83 commits) blk-mq: fix typo in comment bfq: Remove bfq_requeue_request_body() bfq: Remove superfluous conversion from RQ_BIC() bfq: Allow current waker to defend against a tentative one bfq: Relax waker detection for shared queues blk-cgroup: delete rcu_read_lock_held() WARN_ON_ONCE() blk-throttle: Set BIO_THROTTLED when bio has been throttled blk-cgroup: Remove unnecessary rcu_read_lock/unlock() blk-cgroup: always terminate io.stat lines block, bfq: make bfq_has_work() more accurate block, bfq: protect 'bfqd->queued' by 'bfqd->lock' block: cleanup the VM accounting in submit_bio block: Fix the bio.bi_opf comment block: reorder the REQ_ flags blk-iocost: combine local_stat and desc_stat to stat block: improve the error message from bio_check_eod block: allow passing a NULL bdev to bio_alloc_clone/bio_init_clone block: remove superfluous calls to blkcg_bio_issue_init kthread: unexport kthread_blkcg blk-cgroup: cleanup blkcg_maybe_throttle_current ...
2022-05-22md: fix double free of io_acct_set biosetXiao Ni
Now io_acct_set is alloc and free in personality. Remove the codes that free io_acct_set in md_free and md_stop. Fixes: 0c031fd37f69 (md: Move alloc/free acct bioset in to personality) Signed-off-by: Xiao Ni <xni@redhat.com> Signed-off-by: Song Liu <song@kernel.org>
2022-05-22md: Don't set mddev private to NULL in raid0 pers->freeXiao Ni
In normal stop process, it does like this: do_md_stop | __md_stop (pers->free(); mddev->private=NULL) | md_free (free mddev) __md_stop sets mddev->private to NULL after pers->free. The raid device will be stopped and mddev memory is free. But in reshape, it doesn't free the mddev and mddev will still be used in new raid. In reshape, it first sets mddev->private to new_pers and then runs old_pers->free(). Now raid0 sets mddev->private to NULL in raid0_free. The new raid can't work anymore. It will panic when dereference mddev->private because of NULL pointer dereference. It can panic like this: [63010.814972] kernel BUG at drivers/md/raid10.c:928! [63010.819778] invalid opcode: 0000 [#1] PREEMPT SMP NOPTI [63010.825011] CPU: 3 PID: 44437 Comm: md0_resync Kdump: loaded Not tainted 5.14.0-86.el9.x86_64 #1 [63010.833789] Hardware name: Dell Inc. PowerEdge R6415/07YXFK, BIOS 1.15.0 09/11/2020 [63010.841440] RIP: 0010:raise_barrier+0x161/0x170 [raid10] [63010.865508] RSP: 0018:ffffc312408bbc10 EFLAGS: 00010246 [63010.870734] RAX: 0000000000000000 RBX: ffffa00bf7d39800 RCX: 0000000000000000 [63010.877866] RDX: 0000000000000000 RSI: 0000000000000001 RDI: ffffa00bf7d39800 [63010.884999] RBP: 0000000000000000 R08: fffffa4945e74400 R09: 0000000000000000 [63010.892132] R10: ffffa00eed02f798 R11: 0000000000000000 R12: ffffa00bbc435200 [63010.899266] R13: ffffa00bf7d39800 R14: 0000000000000400 R15: 0000000000000003 [63010.906399] FS: 0000000000000000(0000) GS:ffffa00eed000000(0000) knlGS:0000000000000000 [63010.914485] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [63010.920229] CR2: 00007f5cfbe99828 CR3: 0000000105efe000 CR4: 00000000003506e0 [63010.927363] Call Trace: [63010.929822] ? bio_reset+0xe/0x40 [63010.933144] ? raid10_alloc_init_r10buf+0x60/0xa0 [raid10] [63010.938629] raid10_sync_request+0x756/0x1610 [raid10] [63010.943770] md_do_sync.cold+0x3e4/0x94c [63010.947698] md_thread+0xab/0x160 [63010.951024] ? md_write_inc+0x50/0x50 [63010.954688] kthread+0x149/0x170 [63010.957923] ? set_kthread_struct+0x40/0x40 [63010.962107] ret_from_fork+0x22/0x30 Removing the code that sets mddev->private to NULL in raid0 can fix problem. Fixes: 0c031fd37f69 (md: Move alloc/free acct bioset in to personality) Reported-by: Fine Fan <ffan@redhat.com> Signed-off-by: Xiao Ni <xni@redhat.com> Signed-off-by: Song Liu <song@kernel.org>
2022-05-22md: remove most calls to bdevnameChristoph Hellwig
Use the %pg format specifier to save on stack consumption and code size. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Song Liu <song@kernel.org>
2022-05-22md: protect md_unregister_thread from reentrancyGuoqing Jiang
Generally, the md_unregister_thread is called with reconfig_mutex, but raid_message in dm-raid doesn't hold reconfig_mutex to unregister thread, so md_unregister_thread can be called simulitaneously from two call sites in theory. Then after previous commit which remove the protection of reconfig_mutex for md_unregister_thread completely, the potential issue could be worse than before. Let's take pers_lock at the beginning of function to ensure reentrancy. Reported-by: Donald Buczek <buczek@molgen.mpg.de> Signed-off-by: Guoqing Jiang <guoqing.jiang@linux.dev> Signed-off-by: Song Liu <song@kernel.org>
2022-05-22md: don't unregister sync_thread with reconfig_mutex heldGuoqing Jiang
Unregister sync_thread doesn't need to hold reconfig_mutex since it doesn't reconfigure array. And it could cause deadlock problem for raid5 as follows: 1. process A tried to reap sync thread with reconfig_mutex held after echo idle to sync_action. 2. raid5 sync thread was blocked if there were too many active stripes. 3. SB_CHANGE_PENDING was set (because of write IO comes from upper layer) which causes the number of active stripes can't be decreased. 4. SB_CHANGE_PENDING can't be cleared since md_check_recovery was not able to hold reconfig_mutex. More details in the link: https://lore.kernel.org/linux-raid/5ed54ffc-ce82-bf66-4eff-390cb23bc1ac@molgen.mpg.de/T/#t And add one parameter to md_reap_sync_thread since it could be called by dm-raid which doesn't hold reconfig_mutex. Reported-and-tested-by: Donald Buczek <buczek@molgen.mpg.de> Signed-off-by: Guoqing Jiang <guoqing.jiang@cloud.ionos.com> Signed-off-by: Song Liu <song@kernel.org>
2022-05-16dax: add .recovery_write dax_operationJane Chu
Introduce dax_recovery_write() operation. The function is used to recover a dax range that contains poison. Typical use case is when a user process receives a SIGBUS with si_code BUS_MCEERR_AR indicating poison(s) in a dax range, in response, the user process issues a pwrite() to the page-aligned dax range, thus clears the poison and puts valid data in the range. Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jane Chu <jane.chu@oracle.com> Link: https://lore.kernel.org/r/20220422224508.440670-6-jane.chu@oracle.com Signed-off-by: Dan Williams <dan.j.williams@intel.com>
2022-05-16dax: introduce DAX_RECOVERY_WRITE dax access modeJane Chu
Up till now, dax_direct_access() is used implicitly for normal access, but for the purpose of recovery write, dax range with poison is requested. To make the interface clear, introduce enum dax_access_mode { DAX_ACCESS, DAX_RECOVERY_WRITE, } where DAX_ACCESS is used for normal dax access, and DAX_RECOVERY_WRITE is used for dax recovery write. Suggested-by: Dan Williams <dan.j.williams@intel.com> Signed-off-by: Jane Chu <jane.chu@oracle.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Cc: Mike Snitzer <snitzer@redhat.com> Reviewed-by: Vivek Goyal <vgoyal@redhat.com> Link: https://lore.kernel.org/r/165247982851.52965.11024212198889762949.stgit@dwillia2-desk3.amr.corp.intel.com Signed-off-by: Dan Williams <dan.j.williams@intel.com>
2022-05-11dm: pass NULL bdev to bio_alloc_cloneMike Snitzer
Most DM targets will remap the clone bio passed to their ->map function using bio_set_bdev(). So this change to pass NULL bdev to bio_alloc_clone avoids clone-time work that sets up resources for a bdev association that will not be used in practice (e.g. clone issued to underlying device will not use DM device's blk-cgroups resources). But clone->bi_bdev is still initialized following bio_alloc_clone to preserve DM target expectations that clone->bi_bdev will be set. Follow-up work is needed to audit DM targets to remove accesses to a clone->bi_bdev that the target didn't initialize with bio_set_dev(). Depends-on: 7ecc56c62b27 ("block: allow passing a NULL bdev to bio_alloc_clone/bio_init_clone") Signed-off-by: Mike Snitzer <snitzer@kernel.org>
2022-05-09dm cache metadata: remove unnecessary variable in __dump_mappingGuo Zhengkui
Fix the following coccicheck warning: drivers/md/dm-cache-metadata.c:1512:5-6: Unneeded variable: "r". Return "0" on line 1520. Signed-off-by: Guo Zhengkui <guozhengkui@vivo.com> Signed-off-by: Mike Snitzer <snitzer@kernel.org>
2022-05-09dm mpath: provide high-resolution timer to HST for bio-basedGabriel Krisman Bertazi
The precision loss of reading IO start_time with jiffies_to_nsecs instead of using a high resolution timer degrades HST path prediction for BIO-based mpath on high load workloads. Below, I show the utilization percentage of a 10 disk multipath with asymmetrical disk access cost, while being exercised by a randwrite FIO benchmark with high submission queue depth (depth=64). It is possible to see that the HST path selection degrades heavily for high-iops in BIO-mpath, underutilizing the slower paths way beyond expected. This seems to be caused by the start_time truncation, which makes some IO to seem much slower than it actually is. In this scenario ST outperforms HST for bio-mpath, but not for mq-mpath, which already uses ktime_get_ns(). The third column shows utilization with this patch applied. It is easy to see that now HST prediction is much closer to the ideal distribution (calculated considering the real cost of each path). | | ST | HST (orig) | HST(ktime) | Best | | sdd | 0.17 | 0.20 | 0.17 | 0.18 | | sde | 0.17 | 0.20 | 0.17 | 0.18 | | sdf | 0.17 | 0.20 | 0.17 | 0.18 | | sdg | 0.06 | 0.00 | 0.06 | 0.04 | | sdh | 0.03 | 0.00 | 0.03 | 0.02 | | sdi | 0.03 | 0.00 | 0.03 | 0.02 | | sdj | 0.02 | 0.00 | 0.01 | 0.01 | | sdk | 0.02 | 0.00 | 0.01 | 0.01 | | sdl | 0.17 | 0.20 | 0.17 | 0.18 | | sdm | 0.17 | 0.20 | 0.17 | 0.18 | This issue was originally discussed [1] when we first merged HST, and this patch was left as a low hanging fruit to be solved later. Regarding the implementation, as suggested by Mike in that mail thread, in order to avoid the overhead of ktime_get_ns for other selectors, this patch adds a flag for the selector code to request the high-resolution timer. I tested this using the same benchmark used in the original HST submission. Full test and benchmark scripts are available here: https://people.collabora.com/~krisman/HST-BIO-MPATH/ [1] https://lore.kernel.org/lkml/85tv0am9de.fsf@collabora.com/T/ Signed-off-by: Gabriel Krisman Bertazi <krisman@collabora.com> [snitzer: cleaned up various implementation details] Signed-off-by: Mike Snitzer <snitzer@kernel.org>
2022-05-09dm crypt: make printing of the key constant-timeMikulas Patocka
The device mapper dm-crypt target is using scnprintf("%02x", cc->key[i]) to report the current key to userspace. However, this is not a constant-time operation and it may leak information about the key via timing, via cache access patterns or via the branch predictor. Change dm-crypt's key printing to use "%c" instead of "%02x". Also introduce hex2asc() that carefully avoids any branching or memory accesses when converting a number in the range 0 ... 15 to an ascii character. Cc: stable@vger.kernel.org Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Tested-by: Milan Broz <gmazyland@gmail.com> Signed-off-by: Mike Snitzer <snitzer@kernel.org>
2022-05-09dm integrity: fix error code in dm_integrity_ctr()Dan Carpenter
The "r" variable shadows an earlier "r" that has function scope. It means that we accidentally return success instead of an error code. Smatch has a warning for this: drivers/md/dm-integrity.c:4503 dm_integrity_ctr() warn: missing error code 'r' Fixes: 7eada909bfd7 ("dm: add integrity target") Cc: stable@vger.kernel.org Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com> Reviewed-by: Mikulas Patocka <mpatocka@redhat.com> Signed-off-by: Mike Snitzer <snitzer@kernel.org>
2022-05-09dm stats: add cond_resched when looping over entriesMikulas Patocka
dm-stats can be used with a very large number of entries (it is only limited by 1/4 of total system memory), so add rescheduling points to the loops that iterate over the entries. Cc: stable@vger.kernel.org Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Signed-off-by: Mike Snitzer <snitzer@kernel.org>
2022-05-05dm: improve abnormal bio processingMike Snitzer
Read/write/flush are the most common operations, optimize switch in is_abnormal_io() for those cases. Follows same pattern established in block perf-wip commit ("block: optimise blk_may_split for normal rw") Also, push is_abnormal_io() check and blk_queue_split() down from dm_submit_bio() to dm_split_and_process_bio() and set new 'is_abnormal_io' flag in clone_info. Optimize __split_and_process_bio and __process_abnormal_io by leveraging ci.is_abnormal_io flag. Signed-off-by: Mike Snitzer <snitzer@kernel.org>
2022-05-05dm: simplify bio-based IO accounting furtherMike Snitzer
Now that io splitting is recorded prior to, or during, ->map IO accounting can happen immediately rather than defer until after bio splitting in dm_split_and_process_bio(). Remove the DM_IO_START_ACCT flag and also remove dm_io's map_task member because there is no longer any need to wait for splitting to occur before accounting. Also move dm_io struct's 'flags' member to consolidate struct holes. Signed-off-by: Mike Snitzer <snitzer@kernel.org>
2022-05-05dm: put all polled dm_io instances into a single listMing Lei
Now that bio_split() isn't used by DM's bio splitting, it is a bit overkill to link dm_io into an hlist given there is only single dm_io in the list. Convert to using a single list for holding all dm_io instances associated with this bio. Signed-off-by: Ming Lei <ming.lei@redhat.com> Signed-off-by: Mike Snitzer <snitzer@kernel.org>
2022-05-05dm: improve dm_io reference countingMing Lei
Currently each dm_io's reference counter is grabbed before calling __map_bio(), this way isn't efficient since we can move this grabbing to initialization time inside alloc_io(). Meantime it becomes typical async io reference counter model: one is for submission side, the other is for completion side, and the io won't be completed until both sides are done. Signed-off-by: Ming Lei <ming.lei@redhat.com> Signed-off-by: Mike Snitzer <snitzer@kernel.org>
2022-05-05dm: don't grab target io reference in dm_zone_map_bioMing Lei
dm_zone_map_bio() is only called from __map_bio in which the io's reference is grabbed already, and the reference won't be released until the bio is submitted, so not necessary to do it dm_zone_map_bio any more. Reviewed-by: Damien Le Moal <damien.lemoal@opensource.wdc.com> Tested-by: Damien Le Moal <damien.lemoal@opensource.wdc.com> Signed-off-by: Ming Lei <ming.lei@redhat.com> Signed-off-by: Mike Snitzer <snitzer@kernel.org>
2022-05-05dm: improve bio splitting and associated IO accountingMing Lei
The current DM code (ab)uses late assignment of dm_io->orig_bio (after __map_bio() returns and any bio splitting is complete) to indicate the FS bio has been processed and can be accounted. This results in awkward waiting until ->orig_bio is set in dm_submit_bio_remap(). Also the bio splitting was implemented using bio_split()+bio_chain() -- a well-worn pattern but it requires bio cloning purely for the benefit of more natural IO accounting. The bio_split() result was stored in ->orig_bio to represent the mapped part of the original FS bio. DM has switched to the bdev based IO accounting interface. DM's IO accounting can be implemented in terms of the original FS bio (now stored early in ->orig_bio) via access to its sectors/bio_op. And if/when splitting is needed, set a new DM_IO_WAS_SPLIT flag and use new dm_io fields of .sector_offset & .sectors to allow IO accounting for split bios _without_ needing to clone a new bio to store in ->orig_bio. Signed-off-by: Ming Lei <ming.lei@redhat.com> Co-developed-by: Mike Snitzer <snitzer@kernel.org> Signed-off-by: Mike Snitzer <snitzer@kernel.org>
2022-05-05dm: switch to bdev based IO accounting interfacesMing Lei
DM splits flush with data into empty flush followed by bio with data payload, switch dm_io_acct() to use bdev_{start,end}_io_acct() to do this accoiunting more naturally (rather than temporarily changing the bio's bi_size). This will allow DM to more easily account bios that are split (in following commit). Signed-off-by: Ming Lei <ming.lei@redhat.com> Signed-off-by: Mike Snitzer <snitzer@kernel.org>
2022-05-05dm: pass dm_io instance to dm_io_acct directlyMing Lei
All the other 4 parameters are retrieved from the 'dm_io' instance, so it's not necessary to pass all four to dm_io_acct(). Signed-off-by: Ming Lei <ming.lei@redhat.com> Signed-off-by: Mike Snitzer <snitzer@kernel.org>
2022-05-05dm: don't pass bio to __dm_start_io_acct and dm_end_io_acctMing Lei
dm->orig_bio is always passed to __dm_start_io_acct and dm_end_io_acct, so it isn't necessary to take one bio parameter for the two helpers. Signed-off-by: Ming Lei <ming.lei@redhat.com> Signed-off-by: Mike Snitzer <snitzer@kernel.org>
2022-05-05dm: use bio_sectors in dm_aceept_partial_bioMike Snitzer
Rename 'bi_size' to 'bio_sectors' given bi_size is being stored in sectors. Also, use bio_sectors() rather than open-coding it. Signed-off-by: Mike Snitzer <snitzer@kernel.org>
2022-05-05dm: simplify basic targetsMike Snitzer
Remove needless factoring and remap bi_sector regardless of bio_sectors() being non-zero. Signed-off-by: Mike Snitzer <snitzer@kernel.org>
2022-05-05dm: conditionally enable branching for less used featuresMike Snitzer
Use jump_labels to further reduce cost of unlikely branches for zoned block devices, dm-stats and swap_bios throttling. Signed-off-by: Mike Snitzer <snitzer@kernel.org>
2022-05-05dm: introduce dm_{get,put}_live_table_bio called from dm_submit_bioMike Snitzer
If a bio is marked REQ_NOWAIT optimize dm_submit_bio()'s dm_table RCU usage to dm_{get,put}_live_table_fast. DM core offers protection against blocking (via suspend) if REQ_NOWAIT. Signed-off-by: Mike Snitzer <snitzer@kernel.org>
2022-05-05dm: move hot dm_io members to same cacheline as dm_target_ioMike Snitzer
Just saves some cacheline bouncing for members accessed during cloned bio submission and completion. Signed-off-by: Mike Snitzer <snitzer@kernel.org>
2022-05-05dm: add local variables to clone_endio and __map_bioMike Snitzer
Avoid redundant dereferences in both functions. Signed-off-by: Mike Snitzer <snitzer@kernel.org>
2022-05-05dm: mark various branches unlikelyMike Snitzer
Signed-off-by: Mike Snitzer <snitzer@kernel.org>
2022-05-05dm: simplify dm_start_io_acctMike Snitzer
Pull common DM_IO_ACCOUNTED check out to beginning of dm_start_io_acct. Also, use dm_tio_is_normal (and move it to dm-core.h). Signed-off-by: Mike Snitzer <snitzer@kernel.org>
2022-05-05dm: simplify dm_io access in dm_split_and_process_bioMike Snitzer
Use local variable instead of redudant access using ci.io Signed-off-by: Mike Snitzer <snitzer@kernel.org>
2022-05-05dm: factor out dm_io_set_error and __dm_io_dec_pendingMike Snitzer
Also eliminate need to use errno_to_blk_status(). Signed-off-by: Mike Snitzer <snitzer@kernel.org>
2022-05-05dm: conditionally enable BIOSET_PERCPU_CACHE for dm_io biosetMike Snitzer
A bioset's per-cpu alloc cache may have broader utility in the future but for now constrain it to being tightly coupled to QUEUE_FLAG_POLL. Also change dm_io_complete() to use bio_clear_polled() so that it properly clears all associated bio state on requeue. This commit improves DM's hipri bio polling (REQ_POLLED) perf by 7 - 20% depending on the system. Signed-off-by: Mike Snitzer <snitzer@kernel.org>
2022-05-03raid5: don't set the discard_alignment queue limitChristoph Hellwig
The discard_alignment queue limit is named a bit misleading means the offset into the block device at which the discard granularity starts. Setting it to the discard granularity as done by raid5 is mostly harmless but also useless. Signed-off-by: Christoph Hellwig <hch@lst.de> Acked-by: Song Liu <song@kernel.org> Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com> Link: https://lore.kernel.org/r/20220418045314.360785-7-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-05-03dm-zoned: don't set the discard_alignment queue limitChristoph Hellwig
The discard_alignment queue limit is named a bit misleading means the offset into the block device at which the discard granularity starts. Setting it to the discard granularity as done by dm-zoned is mostly harmless but also useless. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Damien Le Moal <damien.lemoal@opensource.wdc.com> Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com> Link: https://lore.kernel.org/r/20220418045314.360785-6-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>