Age | Commit message (Collapse) | Author |
|
git://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm
Pull device mapper fixes from Mike Snitzer:
- fix a particularly nasty DM core bug in a 4.15 refcount_t conversion.
- fix various targets to dm_register_target after module __init
resources created; otherwise racing lvm2 commands could result in a
NULL pointer during initialization of associated DM kernel module.
- fix regression in bio-based DM multipath queue_if_no_path handling.
- fix DM bufio's shrinker to reclaim more than one buffer per scan.
* tag 'for-4.15/dm-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm:
dm bufio: fix shrinker scans when (nr_to_scan < retain_target)
dm mpath: fix bio-based multipath queue_if_no_path handling
dm: fix various targets to dm_register_target after module __init resources created
dm table: fix regression from improper dm_dev_internal.count refcount_t conversion
|
|
When system is under memory pressure it is observed that dm bufio
shrinker often reclaims only one buffer per scan. This change fixes
the following two issues in dm bufio shrinker that cause this behavior:
1. ((nr_to_scan - freed) <= retain_target) condition is used to
terminate slab scan process. This assumes that nr_to_scan is equal
to the LRU size, which might not be correct because do_shrink_slab()
in vmscan.c calculates nr_to_scan using multiple inputs.
As a result when nr_to_scan is less than retain_target (64) the scan
will terminate after the first iteration, effectively reclaiming one
buffer per scan and making scans very inefficient. This hurts vmscan
performance especially because mutex is acquired/released every time
dm_bufio_shrink_scan() is called.
New implementation uses ((LRU size - freed) <= retain_target)
condition for scan termination. LRU size can be safely determined
inside __scan() because this function is called after dm_bufio_lock().
2. do_shrink_slab() uses value returned by dm_bufio_shrink_count() to
determine number of freeable objects in the slab. However dm_bufio
always retains retain_target buffers in its LRU and will terminate
a scan when this mark is reached. Therefore returning the entire LRU size
from dm_bufio_shrink_count() is misleading because that does not
represent the number of freeable objects that slab will reclaim during
a scan. Returning (LRU size - retain_target) better represents the
number of freeable objects in the slab. This way do_shrink_slab()
returns 0 when (LRU size < retain_target) and vmscan will not try to
scan this shrinker avoiding scans that will not reclaim any memory.
Test: tested using Android device running
<AOSP>/system/extras/alloc-stress that generates memory pressure
and causes intensive shrinker scans
Signed-off-by: Suren Baghdasaryan <surenb@google.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
|
|
Commit ca5beb76 ("dm mpath: micro-optimize the hot path relative to
MPATHF_QUEUE_IF_NO_PATH") caused bio-based DM-multipath to fail mptest's
"test_02_sdev_delete".
Restoring the logic that existed prior to commit ca5beb76 fixes this
bio-based DM-multipath regression. Also verified all mptest tests pass
with request-based DM-multipath.
This commit effectively reverts commit ca5beb76 -- but it does so
without reintroducing the need to take the m->lock spinlock in
must_push_back_{rq,bio}.
Fixes: ca5beb76 ("dm mpath: micro-optimize the hot path relative to MPATHF_QUEUE_IF_NO_PATH")
Cc: stable@vger.kernel.org # 4.12+
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
|
|
created
A NULL pointer is seen if two concurrent "vgchange -ay -K <vg name>"
processes race to load the dm-thin-pool module:
PID: 25992 TASK: ffff883cd7d23500 CPU: 4 COMMAND: "vgchange"
#0 [ffff883cd743d600] machine_kexec at ffffffff81038fa9
0000001 [ffff883cd743d660] crash_kexec at ffffffff810c5992
0000002 [ffff883cd743d730] oops_end at ffffffff81515c90
0000003 [ffff883cd743d760] no_context at ffffffff81049f1b
0000004 [ffff883cd743d7b0] __bad_area_nosemaphore at ffffffff8104a1a5
0000005 [ffff883cd743d800] bad_area at ffffffff8104a2ce
0000006 [ffff883cd743d830] __do_page_fault at ffffffff8104aa6f
0000007 [ffff883cd743d950] do_page_fault at ffffffff81517bae
0000008 [ffff883cd743d980] page_fault at ffffffff81514f95
[exception RIP: kmem_cache_alloc+108]
RIP: ffffffff8116ef3c RSP: ffff883cd743da38 RFLAGS: 00010046
RAX: 0000000000000004 RBX: ffffffff81121b90 RCX: ffff881bf1e78cc0
RDX: 0000000000000000 RSI: 00000000000000d0 RDI: 0000000000000000
RBP: ffff883cd743da68 R8: ffff881bf1a4eb00 R9: 0000000080042000
R10: 0000000000002000 R11: 0000000000000000 R12: 00000000000000d0
R13: 0000000000000000 R14: 00000000000000d0 R15: 0000000000000246
ORIG_RAX: ffffffffffffffff CS: 0010 SS: 0018
0000009 [ffff883cd743da70] mempool_alloc_slab at ffffffff81121ba5
0000010 [ffff883cd743da80] mempool_create_node at ffffffff81122083
0000011 [ffff883cd743dad0] mempool_create at ffffffff811220f4
0000012 [ffff883cd743dae0] pool_ctr at ffffffffa08de049 [dm_thin_pool]
0000013 [ffff883cd743dbd0] dm_table_add_target at ffffffffa0005f2f [dm_mod]
0000014 [ffff883cd743dc30] table_load at ffffffffa0008ba9 [dm_mod]
0000015 [ffff883cd743dc90] ctl_ioctl at ffffffffa0009dc4 [dm_mod]
The race results in a NULL pointer because:
Process A (vgchange -ay -K):
a. send DM_LIST_VERSIONS_CMD ioctl;
b. pool_target not registered;
c. modprobe dm_thin_pool and wait until end.
Process B (vgchange -ay -K):
a. send DM_LIST_VERSIONS_CMD ioctl;
b. pool_target registered;
c. table_load->dm_table_add_target->pool_ctr;
d. _new_mapping_cache is NULL and panic.
Note:
1. process A and process B are two concurrent processes.
2. pool_target can be detected by process B but
_new_mapping_cache initialization has not ended.
To fix dm-thin-pool, and other targets (cache, multipath, and snapshot)
with the same problem, simply dm_register_target() after all resources
created during module init (as labelled with __init) are finished.
Cc: stable@vger.kernel.org
Signed-off-by: monty <monty_pavel@sina.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
|
|
conversion
Multiple refcounts are needed if the device was already added. The
micro-optimization of setting the refcount to 1 on first added (rather
than fall thru to a common refcount_inc) lost sight of the fact that the
refcount_inc is also needed for the case when the device already exists
and the mode need not be upgraded.
Fixes: 2a0b4682e0 ("dm: convert dm_dev_internal.count from atomic_t to refcount_t")
Reported-by: Zdenek Kabelac <zkabelac@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
|
|
flush_pending_writes isn't always called with block plug, so add it, and plug
works in nested way.
Signed-off-by: Shaohua Li <shli@fb.com>
|
|
There is a small window near the end of md_do_sync where mddev->curr_resync
can be equal to MaxSector.
If status_resync is called during this window, the resulting /proc/mdstat
output contains a HUGE number of = signs due to the very large curr_resync:
Personalities : [raid1]
md123 : active raid1 sdd3[2] sdb3[0]
204736 blocks super 1.0 [2/1] [U_]
[=====================================================================
... (82 MB more) ...
================>] recovery =429496729.3% (9223372036854775807/204736)
finish=0.2min speed=12796K/sec
bitmap: 0/1 pages [0KB], 65536KB chunk
Modify status_resync to ensure the resync variable doesn't exceed
the array's max_sectors.
Signed-off-by: Nate Dailey <nate.dailey@stratus.com>
Acked-by: Guoqing Jiang <gqjiang@suse.com>
Signed-off-by: Shaohua Li <shli@fb.com>
|
|
r5c_journal_mode_set() is called by r5c_journal_mode_store() and
raid_ctr() in dm-raid. We don't need mddev_lock() when calling from
raid_ctr(). This patch fixes this by moves the mddev_lock() to
r5c_journal_mode_store().
Cc: stable@vger.kernel.org (v4.13+)
Signed-off-by: Song Liu <songliubraving@fb.com>
Signed-off-by: Shaohua Li <shli@fb.com>
|
|
When disk failure occurs on new disks for reshape, mddev->degraded
is not calculated correctly. Faulty bit of the failure device is not
set before raid5_calc_degraded(conf).
mdadm --create /dev/md0 --level=5 --raid-devices=3 /dev/loop[012]
mdadm /dev/md0 -a /dev/loop3
mdadm /dev/md0 --grow -n4
mdadm /dev/md0 -f /dev/loop3 # simulating disk failure
cat /sys/block/md0/md/degraded # it outputs 0, but it should be 1.
However, mdadm -D /dev/md0 will show that it is degraded. It's a bug.
It can be fixed by moving the resources raid5_calc_degraded() depends
on before it.
Reported-by: Roy Chung <roychung@synology.com>
Reviewed-by: Alex Wu <alexwu@synology.com>
Signed-off-by: BingJing Chang <bingjingc@synology.com>
Signed-off-by: Shaohua Li <shli@fb.com>
|
|
register_shrinker is now __must_check, so check it to kill a warning.
Caller of bch_btree_cache_alloc in super.c appropriately checks return
value so this is fully plumbed through.
This V2 fixes checkpatch warnings and improves the commit description,
as I was too hasty getting the previous version out.
Signed-off-by: Michael Lyle <mlyle@lyle.org>
Reviewed-by: Vojtech Pavlik <vojtech@suse.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
When we send a read request and hit the clean data in cache device, there
is a situation called cache read race in bcache(see the commit in the tail
of cache_look_up(), the following explaination just copy from there):
The bucket we're reading from might be reused while our bio is in flight,
and we could then end up reading the wrong data. We guard against this
by checking (in bch_cache_read_endio()) if the pointer is stale again;
if so, we treat it as an error (s->iop.error = -EINTR) and reread from
the backing device (but we don't pass that error up anywhere)
It should be noted that cache read race happened under normal
circumstances, not the circumstance when SSD failed, it was counted
and shown in /sys/fs/bcache/XXX/internal/cache_read_races.
Without this patch, when we use writeback mode, we will never reread from
the backing device when cache read race happened, until the whole cache
device is clean, because the condition
(s->recoverable && (dc && !atomic_read(&dc->has_dirty))) is false in
cached_dev_read_error(). In this situation, the s->iop.error(= -EINTR)
will be passed up, at last, user will receive -EINTR when it's bio end,
this is not suitable, and wield to up-application.
In this patch, we use s->read_dirty_data to judge whether the read
request hit dirty data in cache device, it is safe to reread data from
the backing device when the read request hit clean data. This can not
only handle cache read race, but also recover data when failed read
request from cache device.
[edited by mlyle to fix up whitespace, commit log title, comment
spelling]
Fixes: d59b23795933 ("bcache: only permit to recovery read error when cache device is clean")
Cc: <stable@vger.kernel.org> # 4.14
Signed-off-by: Hua Rui <huarui.dev@gmail.com>
Reviewed-by: Michael Lyle <mlyle@lyle.org>
Reviewed-by: Coly Li <colyli@suse.de>
Signed-off-by: Michael Lyle <mlyle@lyle.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
This patch try to fix the building error on MIPS. The reason is MIPS
has already defined the PTR macro, which conflicts with the PTR macro
in include/uapi/linux/bcache.h.
[fixed by mlyle: corrected a line-length issue]
Cc: stable@vger.kernel.org
Signed-off-by: Huacai Chen <chenhc@lemote.com>
Reviewed-by: Michael Lyle <mlyle@lyle.org>
Signed-off-by: Michael Lyle <mlyle@lyle.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
Journal bucket is a circular buffer, the bucket
can be like YYYNNNYY, which means the first valid journal in
the 7th bucket, and the latest valid journal in third bucket, in
this case, if we do not try we the zero index first, We
may get a valid journal in the 7th bucket, then we call
find_next_bit(bitmap,ca->sb.njournal_buckets, l + 1) to get the
first invalid bucket after the 7th bucket, because all these
buckets is valid, so no bit 1 in bitmap, thus find_next_bit()
function would return with ca->sb.njournal_buckets (8). So, after
that, bcache only read journal in 7th and 8the bucket,
the first to the third buckets are lost.
So, it is important to let developer know that, we need to try
the zero index at first in the hash-search, and avoid any breaks
in future's code modification.
[ML: Fixed whitespace & formatting & file permissions]
Signed-off-by: Tang Junhui <tang.junhui@zte.com.cn>
Signed-off-by: Michael Lyle <mlyle@lyle.org>
Reviewed-by: Michael Lyle <mlyle@lyle.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
Pull more block layer updates from Jens Axboe:
"A followup pull request, with some parts that either needed a bit more
testing before going in, merge sync, or just later arriving fixes.
This contains:
- Timer related updates from Kees. These were purposefully delayed
since I didn't want to pull in a later v4.14-rc tag to my block
tree.
- ide-cd prep sense buffer fix from Bart. Also delayed, as not to
clash with the late fix we put into 4.14-rc.
- Small BFQ updates series from Luca and Paolo.
- Single nvmet fix from James, fixing a non-functional case there.
- Bio fast clone fix from Michael, which made bcache return the wrong
data for some cases.
- Legacy IO path regression hang fix from Ming"
* 'for-linus' of git://git.kernel.dk/linux-block:
bio: ensure __bio_clone_fast copies bi_partno
nvmet_fc: fix better length checking
block: wake up all tasks blocked in get_request()
block, bfq: move debug blkio stats behind CONFIG_DEBUG_BLK_CGROUP
block, bfq: update blkio stats outside the scheduler lock
block, bfq: add missing invocations of bfqg_stats_update_io_add/remove
doc, block, bfq: update max IOPS sustainable with BFQ
ide: Make ide_cdrom_prep_fs() initialize the sense buffer pointer
md: Convert timers to use timer_setup()
block: swim3: Convert timers to use timer_setup()
block/aoe: Convert timers to use timer_setup()
amifloppy: Convert timers to use timer_setup()
block/floppy: Convert callback to pass timer_list
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm
Pull more device mapper updates from Mike Snitzer:
"Given your expected travel I figured I'd get these fixes to you sooner
rather than later.
- a DM multipath stable@ fix to silence an annoying error message
that isn't _really_ an error
- a DM core @stable fix for discard support that was enabled for an
entire DM device despite only having partial support for discards
due to a mix of discard capabilities across the underlying devices.
- a couple other DM core discard fixes.
- a DM bufio @stable fix that resolves a 32-bit overflow"
* tag 'for-4.15/dm-changes-2' of git://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm:
dm bufio: fix integer overflow when limiting maximum cache size
dm: clear all discard attributes in queue_limits when discards are disabled
dm: do not set 'discards_supported' in targets that do not need it
dm: discard support requires all targets in a table support discards
dm mpath: remove annoying message of 'blk_get_request() returned -11'
|
|
The default max_cache_size_bytes for dm-bufio is meant to be the lesser
of 25% of the size of the vmalloc area and 2% of the size of lowmem.
However, on 32-bit systems the intermediate result in the expression
(VMALLOC_END - VMALLOC_START) * DM_BUFIO_VMALLOC_PERCENT / 100
overflows, causing the wrong result to be computed. For example, on a
32-bit system where the vmalloc area is 520093696 bytes, the result is
1174405 rather than the expected 130023424, which makes the maximum
cache size much too small (far less than 2% of lowmem). This causes
severe performance problems for dm-verity users on affected systems.
Fix this by using mult_frac() to correctly multiply by a percentage. Do
this for all places in dm-bufio that multiply by a percentage. Also
replace (VMALLOC_END - VMALLOC_START) with VMALLOC_TOTAL, which contrary
to the comment is now defined in include/linux/vmalloc.h.
Depends-on: 9993bc635 ("sched/x86: Fix overflow in cyc2ns_offset")
Fixes: 95d402f057f2 ("dm: add bufio")
Cc: <stable@vger.kernel.org> # v3.2+
Signed-off-by: Eric Biggers <ebiggers@google.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
|
|
Otherwise, it can happen that the QUEUE_FLAG_DISCARD isn't set but the
various discard attributes (which get exposed via sysfs) may be set.
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
|
|
The DM target's 'discards_supported' flag is intended to act as an
override. Meaning, even if the underlying storage doesn't support
discards the DM target will.
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
|
|
A DM device with a mix of discard capabilities (due to some underlying
devices not having discard support) _should_ just return -EOPNOTSUPP for
the region of the device that doesn't support discards (even if only by
way of the underlying driver formally not supporting discards). BUT,
that does ask the underlying driver to handle something that it never
advertised support for. In doing so we're exposing users to the
potential for a underlying disk driver hanging if/when a discard is
issued a the device that is incapable and never claimed to support
discards.
Fix this by requiring that each DM target in a DM table provide discard
support as a prereq for a DM device to advertise support for discards.
This may cause some configurations that were happily supporting discards
(even in the face of a mix of discard support) to stop supporting
discards -- but the risk of users hitting driver hangs, and forced
reboots, outweighs supporting those fringe mixed discard
configurations.
Cc: stable@vger.kernel.org
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
|
|
It is very normal to see allocation failure, especially with blk-mq
request_queues, so it's unnecessary to report this error and annoy
people.
In practice this 'blk_get_request() returned -11' error gets logged
quite frequently when a blk-mq DM multipath device sees heavy IO.
This change is marked for stable@ because the annoying message in
question was included in stable@ commit 7083abbbf.
Fixes: 7083abbbf ("dm mpath: avoid that path removal can trigger an infinite loop")
Cc: stable@vger.kernel.org
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/jeyu/linux
Pull module updates from Jessica Yu:
"Summary of modules changes for the 4.15 merge window:
- treewide module_param_call() cleanup, fix up set/get function
prototype mismatches, from Kees Cook
- minor code cleanups"
* tag 'modules-for-v4.15' of git://git.kernel.org/pub/scm/linux/kernel/git/jeyu/linux:
module: Do not paper over type mismatches in module_param_call()
treewide: Fix function prototypes for module_param_call()
module: Prepare to convert all module_param_call() prototypes
kernel/module: Delete an error message for a failed memory allocation in add_module_usage()
|
|
In preparation for unconditionally passing the struct timer_list pointer to
all timer callbacks, switch to using the new timer_setup() and from_timer()
to pass the timer pointer explicitly.
Cc: Kent Overstreet <kent.overstreet@gmail.com>
Cc: Shaohua Li <shli@kernel.org>
Cc: Alasdair Kergon <agk@redhat.com>
Cc: Mike Snitzer <snitzer@redhat.com>
Cc: dm-devel@redhat.com
Cc: linux-bcache@vger.kernel.org
Cc: linux-raid@vger.kernel.org
Signed-off-by: Kees Cook <keescook@chromium.org>
Reviewed-by: Michael Lyle <mlyle@lyle.org>
Reviewed-by: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
ssh://gitolite.kernel.org/pub/scm/linux/kernel/git/linusw/linux-gpio
Pull GPIO updates from Linus Walleij:
"This is the bulk of GPIO changes for the v4.15 kernel cycle:
Core:
- Fix the semantics of raw GPIO to actually be raw. No inversion
semantics as before, but also no open draining, and allow the raw
operations to affect lines used for interrupts as the caller
supposedly knows what they are doing if they are getting the big
hammer.
- Rewrote the __inner_function() notation calls to names that make
more sense. I just find this kind of code disturbing.
- Drop the .irq_base() field from the gpiochip since now all IRQs are
mapped dynamically. This is nice.
- Support for .get_multiple() in the core driver API. This allows us
to read several GPIO lines with a single register read. This has
high value for some usecases: it can be used to create
oscilloscopes and signal analyzers and other things that rely on
reading several lines at exactly the same instant. Also a generally
nice optimization. This uses the new assign_bit() macro from the
bitops lib that was ACKed by Andrew Morton and is implemented for
two drivers, one of them being the generic MMIO driver so everyone
using that will be able to benefit from this.
- Do not allow requests of Open Drain and Open Source setting of a
GPIO line simultaneously. If the hardware actually supports
enabling both at the same time the electrical result would be
disastrous.
- A new interrupt chip core helper. This will be helpful to deal with
"banked" GPIOs, which means GPIO controllers with several logical
blocks of GPIO inside them. This is several gpiochips per device in
the device model, in contrast to the case when there is a 1-to-1
relationship between a device and a gpiochip.
New drivers:
- Maxim MAX3191x industrial serializer, a very interesting piece of
professional I/O hardware.
- Uniphier GPIO driver. This is the GPIO block from the recent
Socionext (ex Fujitsu and Panasonic) platform.
- Tegra 186 driver. This is based on the new banked GPIO
infrastructure.
Other improvements:
- Some documentation improvements.
- Wakeup support for the DesignWare DWAPB GPIO controller.
- Reset line support on the DesignWare DWAPB GPIO controller.
- Several non-critical bug fixes and improvements for the Broadcom
BRCMSTB driver.
- Misc non-critical bug fixes like exotic errorpaths, removal of dead
code etc.
- Explicit comments on fall-through switch() statements"
* tag 'gpio-v4.15-1' of ssh://gitolite.kernel.org/pub/scm/linux/kernel/git/linusw/linux-gpio: (65 commits)
gpio: tegra186: Remove tegra186_gpio_lock_class
gpio: rcar: Add r8a77995 (R-Car D3) support
pinctrl: bcm2835: Fix some merge fallout
gpio: Fix undefined lock_dep_class
gpio: Automatically add lockdep keys
gpio: Introduce struct gpio_irq_chip.first
gpio: Disambiguate struct gpio_irq_chip.nested
gpio: Add Tegra186 support
gpio: Export gpiochip_irq_{map,unmap}()
gpio: Implement tighter IRQ chip integration
gpio: Move lock_key into struct gpio_irq_chip
gpio: Move irq_valid_mask into struct gpio_irq_chip
gpio: Move irq_nested into struct gpio_irq_chip
gpio: Move irq_chained_parent to struct gpio_irq_chip
gpio: Move irq_default_type to struct gpio_irq_chip
gpio: Move irq_handler to struct gpio_irq_chip
gpio: Move irqdomain into struct gpio_irq_chip
gpio: Move irqchip into struct gpio_irq_chip
gpio: Introduce struct gpio_irq_chip
pinctrl: armada-37xx: remove unused variable
...
|
|
Pull MD update from Shaohua Li:
"This update mostly includes bug fixes:
- md-cluster now supports raid10 from Guoqing
- raid5 PPL fixes from Artur
- badblock regression fix from Bo
- suspend hang related fixes from Neil
- raid5 reshape fixes from Neil
- raid1 freeze deadlock fix from Nate
- memleak fixes from Zdenek
- bitmap related fixes from Me and Tao
- other fixes and cleanups"
* 'for-next' of git://git.kernel.org/pub/scm/linux/kernel/git/shli/md: (33 commits)
md: free unused memory after bitmap resize
md: release allocated bitset sync_set
md/bitmap: clear BITMAP_WRITE_ERROR bit before writing it to sb
md: be cautious about using ->curr_resync_completed for ->recovery_offset
badblocks: fix wrong return value in badblocks_set if badblocks are disabled
md: don't check MD_SB_CHANGE_CLEAN in md_allow_write
md-cluster: update document for raid10
md: remove redundant variable q
raid1: remove obsolete code in raid1_write_request
md-cluster: Use a small window for raid10 resync
md-cluster: Suspend writes in RAID10 if within range
md-cluster/raid10: set "do_balance = 0" if area is resyncing
md: use lockdep_assert_held
raid1: prevent freeze_array/wait_all_barriers deadlock
md: use TASK_IDLE instead of blocking signals
md: remove special meaning of ->quiesce(.., 2)
md: allow metadata update while suspending.
md: use mddev_suspend/resume instead of ->quiesce()
md: move suspend_hi/lo handling into core md code
md: don't call bitmap_create() while array is quiesced.
...
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm
Pull device mapper updates from Mike Snitzer:
- a few conversions from atomic_t to ref_count_t
- a DM core fix for a race during device destruction that could result
in a BUG_ON
- a stable@ fix for a DM cache race condition that could lead to data
corruption when operating in writeback mode (writethrough is default)
- various DM cache cleanups and improvements
- add DAX support to the DM log-writes target
- a fix for the DM zoned target's ability to deal with the last zone of
the drive being smaller than all others
- a stable@ DM crypt and DM integrity fix for a negative check that was
to restrictive (prevented slab debug with XFS ontop of DM crypt from
working)
- a DM raid target fix for a panic that can occur when forcing a raid
to sync
* tag 'for-4.15/dm' of git://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm: (25 commits)
dm cache: lift common migration preparation code to alloc_migration()
dm cache: remove usused deferred_cells member from struct cache
dm cache policy smq: allocate cache blocks in order
dm cache policy smq: change max background work from 10240 to 4096 blocks
dm cache background tracker: limit amount of background work that may be issued at once
dm cache policy smq: take origin idle status into account when queuing writebacks
dm cache policy smq: handle races with queuing background_work
dm raid: fix panic when attempting to force a raid to sync
dm integrity: allow unaligned bv_offset
dm crypt: allow unaligned bv_offset
dm: small cleanup in dm_get_md()
dm: fix race between dm_get_from_kobject() and __dm_destroy()
dm: allocate struct mapped_device with kvzalloc
dm zoned: ignore last smaller runt zone
dm space map metadata: use ARRAY_SIZE
dm log writes: add support for DAX
dm log writes: add support for inline data buffers
dm cache: simplify get_per_bio_data() by removing data_size argument
dm cache: remove all obsolete writethrough-specific code
dm cache: submit writethrough writes in parallel to origin and cache
...
|
|
Pull core block layer updates from Jens Axboe:
"This is the main pull request for block storage for 4.15-rc1.
Nothing out of the ordinary in here, and no API changes or anything
like that. Just various new features for drivers, core changes, etc.
In particular, this pull request contains:
- A patch series from Bart, closing the whole on blk/scsi-mq queue
quescing.
- A series from Christoph, building towards hidden gendisks (for
multipath) and ability to move bio chains around.
- NVMe
- Support for native multipath for NVMe (Christoph).
- Userspace notifications for AENs (Keith).
- Command side-effects support (Keith).
- SGL support (Chaitanya Kulkarni)
- FC fixes and improvements (James Smart)
- Lots of fixes and tweaks (Various)
- bcache
- New maintainer (Michael Lyle)
- Writeback control improvements (Michael)
- Various fixes (Coly, Elena, Eric, Liang, et al)
- lightnvm updates, mostly centered around the pblk interface
(Javier, Hans, and Rakesh).
- Removal of unused bio/bvec kmap atomic interfaces (me, Christoph)
- Writeback series that fix the much discussed hundreds of millions
of sync-all units. This goes all the way, as discussed previously
(me).
- Fix for missing wakeup on writeback timer adjustments (Yafang
Shao).
- Fix laptop mode on blk-mq (me).
- {mq,name} tupple lookup for IO schedulers, allowing us to have
alias names. This means you can use 'deadline' on both !mq and on
mq (where it's called mq-deadline). (me).
- blktrace race fix, oopsing on sg load (me).
- blk-mq optimizations (me).
- Obscure waitqueue race fix for kyber (Omar).
- NBD fixes (Josef).
- Disable writeback throttling by default on bfq, like we do on cfq
(Luca Miccio).
- Series from Ming that enable us to treat flush requests on blk-mq
like any other request. This is a really nice cleanup.
- Series from Ming that improves merging on blk-mq with schedulers,
getting us closer to flipping the switch on scsi-mq again.
- BFQ updates (Paolo).
- blk-mq atomic flags memory ordering fixes (Peter Z).
- Loop cgroup support (Shaohua).
- Lots of minor fixes from lots of different folks, both for core and
driver code"
* 'for-4.15/block' of git://git.kernel.dk/linux-block: (294 commits)
nvme: fix visibility of "uuid" ns attribute
blk-mq: fixup some comment typos and lengths
ide: ide-atapi: fix compile error with defining macro DEBUG
blk-mq: improve tag waiting setup for non-shared tags
brd: remove unused brd_mutex
blk-mq: only run the hardware queue if IO is pending
block: avoid null pointer dereference on null disk
fs: guard_bio_eod() needs to consider partitions
xtensa/simdisk: fix compile error
nvme: expose subsys attribute to sysfs
nvme: create 'slaves' and 'holders' entries for hidden controllers
block: create 'slaves' and 'holders' entries for hidden gendisks
nvme: also expose the namespace identification sysfs files for mpath nodes
nvme: implement multipath access to nvme subsystems
nvme: track shared namespaces
nvme: introduce a nvme_ns_ids structure
nvme: track subsystems
block, nvme: Introduce blk_mq_req_flags_t
block, scsi: Make SCSI quiesce and resume work reliably
block: Add the QUEUE_FLAG_PREEMPT_ONLY request queue flag
...
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/herbert/crypto-2.6
Pull crypto updates from Herbert Xu:
"Here is the crypto update for 4.15:
API:
- Disambiguate EBUSY when queueing crypto request by adding ENOSPC.
This change touches code outside the crypto API.
- Reset settings when empty string is written to rng_current.
Algorithms:
- Add OSCCA SM3 secure hash.
Drivers:
- Remove old mv_cesa driver (replaced by marvell/cesa).
- Enable rfc3686/ecb/cfb/ofb AES in crypto4xx.
- Add ccm/gcm AES in crypto4xx.
- Add support for BCM7278 in iproc-rng200.
- Add hash support on Exynos in s5p-sss.
- Fix fallback-induced error in vmx.
- Fix output IV in atmel-aes.
- Fix empty GCM hash in mediatek.
Others:
- Fix DoS potential in lib/mpi.
- Fix potential out-of-order issues with padata"
* 'linus' of git://git.kernel.org/pub/scm/linux/kernel/git/herbert/crypto-2.6: (162 commits)
lib/mpi: call cond_resched() from mpi_powm() loop
crypto: stm32/hash - Fix return issue on update
crypto: dh - Remove pointless checks for NULL 'p' and 'g'
crypto: qat - Clean up error handling in qat_dh_set_secret()
crypto: dh - Don't permit 'key' or 'g' size longer than 'p'
crypto: dh - Don't permit 'p' to be 0
crypto: dh - Fix double free of ctx->p
hwrng: iproc-rng200 - Add support for BCM7278
dt-bindings: rng: Document BCM7278 RNG200 compatible
crypto: chcr - Replace _manual_ swap with swap macro
crypto: marvell - Add a NULL entry at the end of mv_cesa_plat_id_table[]
hwrng: virtio - Virtio RNG devices need to be re-registered after suspend/resume
crypto: atmel - remove empty functions
crypto: ecdh - remove empty exit()
MAINTAINERS: update maintainer for qat
crypto: caam - remove unused param of ctx_map_to_sec4_sg()
crypto: caam - remove unneeded edesc zeroization
crypto: atmel-aes - Reset the controller before each use
crypto: atmel-aes - properly set IV after {en,de}crypt
hwrng: core - Reset user selected rng by writing "" to rng_current
...
|
|
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
|
|
Signed-off-by: Joe Thornber <ejt@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
|
|
Previously, cache blocks were being allocated in reverse order. Fix
this by pulling the block off the head of the free list.
Shouldn't have any impact on performance or latency but it is more
correct to have the cache blocks allocated/mapped in ascending order.
This fix will slightly increase the chances of two adjacent oblocks
being in adjacent cblocks.
Signed-off-by: Joe Thornber <ejt@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
|
|
10240 blocks was too much, lowering this reduces the latency of copying
and consumes less memory.
Signed-off-by: Joe Thornber <ejt@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
|
|
issued at once
On large systems the cache policy can be over enthusiastic and queue far
too much dirty data to be written back. This consumes memory.
Signed-off-by: Joe Thornber <ejt@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
|
|
writebacks
If the origin device is idle try and writeback more data.
Signed-off-by: Joe Thornber <ejt@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
|
|
The background_tracker holds a set of promotions/demotions that the
cache policy wishes the core target to implement.
When adding a new operation to the tracker it's possible that an
operation on the same block is already present (but in practise this
doesn't appear to be happening). Catch these situations and do the
appropriate cleanup.
Signed-off-by: Joe Thornber <ejt@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
|
|
Requesting a sync on an active raid device via a table reload
(see 'sync' parameter in Documentation/device-mapper/dm-raid.txt)
skips the super_load() call that defines the superblock size
(rdev->sb_size) -- resulting in an oops if/when super_sync()->memset()
is called.
Fix by moving the initialization of the superblock start and size
out of super_load() to the caller (analyse_superblocks).
Signed-off-by: Heinz Mauelshagen <heinzm@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
|
|
When slub_debug is enabled kmalloc returns unaligned memory. XFS uses
this unaligned memory for its buffers (if an unaligned buffer crosses a
page, XFS frees it and allocates a full page instead - see the function
xfs_buf_allocate_memory).
dm-integrity checks if bv_offset is aligned on page size and this check
fail with slub_debug and XFS.
Fix this bug by removing the bv_offset check, leaving only the check for
bv_len.
Fixes: 7eada909bfd7 ("dm: add integrity target")
Cc: stable@vger.kernel.org # v4.12+
Reported-by: Bruno Prémont <bonbons@sysophe.eu>
Reviewed-by: Milan Broz <gmazyland@gmail.com>
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
|
|
When slub_debug is enabled kmalloc returns unaligned memory. XFS uses
this unaligned memory for its buffers (if an unaligned buffer crosses a
page, XFS frees it and allocates a full page instead - see the function
xfs_buf_allocate_memory).
dm-crypt checks if bv_offset is aligned on page size and these checks
fail with slub_debug and XFS.
Fix this bug by removing the bv_offset checks. Switch to checking if
bv_len is aligned instead of bv_offset (this check should be sufficient
to prevent overruns if a bio with too small bv_len is received).
Fixes: 8f0009a22517 ("dm crypt: optionally support larger encryption sector size")
Cc: stable@vger.kernel.org # v4.12+
Reported-by: Bruno Prémont <bonbons@sysophe.eu>
Tested-by: Bruno Prémont <bonbons@sysophe.eu>
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Reviewed-by: Milan Broz <gmazyland@gmail.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
|
|
Makes dm_get_md() and dm_get_from_kobject() have similar code.
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
|
|
The following BUG_ON was hit when testing repeat creation and removal of
DM devices:
kernel BUG at drivers/md/dm.c:2919!
CPU: 7 PID: 750 Comm: systemd-udevd Not tainted 4.1.44
Call Trace:
[<ffffffff81649e8b>] dm_get_from_kobject+0x34/0x3a
[<ffffffff81650ef1>] dm_attr_show+0x2b/0x5e
[<ffffffff817b46d1>] ? mutex_lock+0x26/0x44
[<ffffffff811df7f5>] sysfs_kf_seq_show+0x83/0xcf
[<ffffffff811de257>] kernfs_seq_show+0x23/0x25
[<ffffffff81199118>] seq_read+0x16f/0x325
[<ffffffff811de994>] kernfs_fop_read+0x3a/0x13f
[<ffffffff8117b625>] __vfs_read+0x26/0x9d
[<ffffffff8130eb59>] ? security_file_permission+0x3c/0x44
[<ffffffff8117bdb8>] ? rw_verify_area+0x83/0xd9
[<ffffffff8117be9d>] vfs_read+0x8f/0xcf
[<ffffffff81193e34>] ? __fdget_pos+0x12/0x41
[<ffffffff8117c686>] SyS_read+0x4b/0x76
[<ffffffff817b606e>] system_call_fastpath+0x12/0x71
The bug can be easily triggered, if an extra delay (e.g. 10ms) is added
between the test of DMF_FREEING & DMF_DELETING and dm_get() in
dm_get_from_kobject().
To fix it, we need to ensure the test of DMF_FREEING & DMF_DELETING and
dm_get() are done in an atomic way, so _minor_lock is used.
The other callers of dm_get() have also been checked to be OK: some
callers invoke dm_get() under _minor_lock, some callers invoke it under
_hash_lock, and dm_start_request() invoke it after increasing
md->open_count.
Cc: stable@vger.kernel.org
Signed-off-by: Hou Tao <houtao1@huawei.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
|
|
The structure srcu_struct can be very big, its size is proportional to the
value CONFIG_NR_CPUS. The Fedora kernel has CONFIG_NR_CPUS 8192, the field
io_barrier in the struct mapped_device has 84kB in the debugging kernel
and 50kB in the non-debugging kernel. The large size may result in failure
of the function kzalloc_node.
In order to avoid the allocation failure, we use the function
kvzalloc_node, this function falls back to vmalloc if a large contiguous
chunk of memory is not available. This patch also moves the field
io_barrier to the last position of struct mapped_device - the reason is
that on many processor architectures, short memory offsets result in
smaller code than long memory offsets - on x86-64 it reduces code size by
320 bytes.
Note to stable kernel maintainers - the kernels 4.11 and older don't have
the function kvzalloc_node, you can use the function vzalloc_node instead.
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Cc: stable@vger.kernel.org
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
|
|
The SCSI layer allows ZBC drives to have a smaller last runt zone. For
such a device, specifying the entire capacity for a dm-zoned target
table entry fails because the specified capacity is not aligned on a
device zone size indicated in the request queue structure of the
device.
Fix this problem by ignoring the last runt zone in the entry length
when seting up the dm-zoned target (ctr method) and when iterating table
entries of the target (iterate_devices method). This allows dm-zoned
users to still easily setup a target using the entire device capacity
(as mandated by dm-zoned) or the aligned capacity excluding the last
runt zone.
While at it, replace direct references to the device queue chunk_sectors
limit with calls to the accessor blk_queue_zone_sectors().
Reported-by: Peter Desnoyers <pjd@ccs.neu.edu>
Cc: stable@vger.kernel.org
Signed-off-by: Damien Le Moal <damien.lemoal@wdc.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
|
|
Using the ARRAY_SIZE macro improves the readability of the code.
Found with Coccinelle with the following semantic patch:
@r depends on (org || report)@
type T;
T[] E;
position p;
@@
(
(sizeof(E)@p /sizeof(*E))
|
(sizeof(E)@p /sizeof(E[...]))
|
(sizeof(E)@p /sizeof(T))
)
Signed-off-by: Jérémy Lefaure <jeremy.lefaure@lse.epita.fr>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
|
|
Now that we have the ability log filesystem writes using a flat buffer, add
support for DAX.
The motivation for this support is the need for an xfstest that can test
the new MAP_SYNC DAX flag. By logging the filesystem activity with
dm-log-writes we can show that the MAP_SYNC page faults are writing out
their metadata as they happen, instead of requiring an explicit
msync/fsync.
Unfortunately we can't easily track data that has been written via
mmap() now that the dax_flush() abstraction was removed by commit
c3ca015fab6d ("dax: remove the pmem_dax_ops->flush abstraction").
Otherwise we could just treat each flush as a big write, and store the
data that is being synced to media. It may be worthwhile to add the
dax_flush() entry point back, just as a notifier so we can do this
logging.
Signed-off-by: Ross Zwisler <ross.zwisler@linux.intel.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
|
|
Currently dm-log-writes supports writing filesystem data via BIOs, and
writing internal metadata from a flat buffer via write_metadata().
For DAX writes, though, we won't have a BIO, but will instead have an
iterator that we'll want to use to fill a flat data buffer.
So, create write_inline_data() which allows us to write filesystem data
using a flat buffer as a source, and wire it up in log_one_block().
Signed-off-by: Ross Zwisler <ross.zwisler@linux.intel.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
|
|
There is only one per_bio_data size now that writethrough-specific data
was removed from the per_bio_data structure.
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
|
|
Now that the writethrough code is much simpler there is no need to track
so much state or cascade bio submission (as was done, via
writethrough_endio(), to issue origin then cache IO in series).
As such the obsolete writethrough list and workqueue is also removed.
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
|
|
Discontinue issuing writethrough write IO in series to the origin and
then cache.
Use bio_clone_fast() to create a new origin clone bio that will be
mapped to the origin device and then bio_chain() it to the bio that gets
remapped to the cache device. The origin clone bio does _not_ have a
copy of the per_bio_data -- as such check_if_tick_bio_needed() will not
be called.
The cache bio (parent bio) will not complete until the origin bio has
completed -- this fulfills bio_clone_fast()'s requirements as well as
the requirement to not complete the original IO until the write IO has
completed to both the origin and cache device.
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
|
|
No functional changes, just a bit cleaner than passing cache_features
structure.
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
|
|
When a DM cache in writeback mode moves data between the slow and fast
device it can often avoid a copy if the triggering bio either:
i) covers the whole block (no point copying if we're about to overwrite it)
ii) the migration is a promotion and the origin block is currently discarded
Prior to this fix there was a race with case (ii). The discard status
was checked with a shared lock held (rather than exclusive). This meant
another bio could run in parallel and write data to the origin, removing
the discard state. After the promotion the parallel write would have
been lost.
With this fix the discard status is re-checked once the exclusive lock
has been aquired. If the block is no longer discarded it falls back to
the slower full copy path.
Fixes: b29d4986d ("dm cache: significant rework to leverage dm-bio-prison-v2")
Cc: stable@vger.kernel.org # v4.12+
Signed-off-by: Joe Thornber <ejt@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
|
|
When bitmap is resized, the old kalloced chunks just are not released
once the resized bitmap starts to use new space.
This fixes in particular kmemleak reports like this one:
unreferenced object 0xffff8f4311e9c000 (size 4096):
comm "lvm", pid 19333, jiffies 4295263268 (age 528.265s)
hex dump (first 32 bytes):
02 80 02 80 02 80 02 80 02 80 02 80 02 80 02 80 ................
02 80 02 80 02 80 02 80 02 80 02 80 02 80 02 80 ................
backtrace:
[<ffffffffa69471ca>] kmemleak_alloc+0x4a/0xa0
[<ffffffffa628c10e>] kmem_cache_alloc_trace+0x14e/0x2e0
[<ffffffffa676cfec>] bitmap_checkpage+0x7c/0x110
[<ffffffffa676d0c5>] bitmap_get_counter+0x45/0xd0
[<ffffffffa676d6b3>] bitmap_set_memory_bits+0x43/0xe0
[<ffffffffa676e41c>] bitmap_init_from_disk+0x23c/0x530
[<ffffffffa676f1ae>] bitmap_load+0xbe/0x160
[<ffffffffc04c47d3>] raid_preresume+0x203/0x2f0 [dm_raid]
[<ffffffffa677762f>] dm_table_resume_targets+0x4f/0xe0
[<ffffffffa6774b52>] dm_resume+0x122/0x140
[<ffffffffa6779b9f>] dev_suspend+0x18f/0x290
[<ffffffffa677a3a7>] ctl_ioctl+0x287/0x560
[<ffffffffa677a693>] dm_ctl_ioctl+0x13/0x20
[<ffffffffa62d6b46>] do_vfs_ioctl+0xa6/0x750
[<ffffffffa62d7269>] SyS_ioctl+0x79/0x90
[<ffffffffa6956d41>] entry_SYSCALL_64_fastpath+0x1f/0xc2
Signed-off-by: Zdenek Kabelac <zkabelac@redhat.com>
Signed-off-by: Shaohua Li <shli@fb.com>
|