Age | Commit message (Collapse) | Author |
|
QUEUE_FLAG_NO_SG_MERGE has been killed, so kill BLK_MQ_F_SG_MERGE too.
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Omar Sandoval <osandov@fb.com>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
Since bdced438acd83ad83a6c ("block: setup bi_phys_segments after splitting"),
physical segment number is mainly figured out in blk_queue_split() for
fast path, and the flag of BIO_SEG_VALID is set there too.
Now only blk_recount_segments() and blk_recalc_rq_segments() use this
flag.
Basically blk_recount_segments() is bypassed in fast path given BIO_SEG_VALID
is set in blk_queue_split().
For another user of blk_recalc_rq_segments():
- run in partial completion branch of blk_update_request, which is an unusual case
- run in blk_cloned_rq_check_limits(), still not a big problem if the flag is killed
since dm-rq is the only user.
Multi-page bvec is enabled now, not doing S/G merging is rather pointless with the
current setup of the I/O path, as it isn't going to save you a significant amount
of cycles.
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Omar Sandoval <osandov@fb.com>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
This patch introduces one extra iterator variable to bio_for_each_segment_all(),
then we can allow bio_for_each_segment_all() to iterate over multi-page bvec.
Given it is just one mechannical & simple change on all bio_for_each_segment_all()
users, this patch does tree-wide change in one single patch, so that we can
avoid to use a temporary helper for this conversion.
Reviewed-by: Omar Sandoval <osandov@fb.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
bch_bio_alloc_pages() is always called on one new bio, so it is safe
to access the bvec table directly. Given it is the only kind of this
case, open code the bvec table access since bio_for_each_segment_all()
will be changed to support for iterating over multipage bvec.
Acked-by: Coly Li <colyli@suse.de>
Reviewed-by: Omar Sandoval <osandov@fb.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
iov_iter is implemented on bvec itererator helpers, so it is safe to pass
multi-page bvec to it, and this way is much more efficient than passing one
page in each bvec.
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Omar Sandoval <osandov@fb.com>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
This patch fixes a race condition where a write is mapped to the last
sectors of a line. The write is synced to the device but the L2P is not
updated yet. When the line is garbage collected before the L2P update
is performed, the sectors are ignored by the GC logic and the line is
freed before all sectors are moved. When the L2P is finally updated, it
contains a mapping to a freed line, subsequent reads of the
corresponding LBAs fail.
This patch introduces a per line counter specifying the number of
sectors that are synced to the device but have not been updated in the
L2P. Lines with a counter of greater than zero will not be selected
for GC.
Signed-off-by: Heiner Litz <hlitz@ucsc.edu>
Reviewed-by: Hans Holmberg <hans.holmberg@cnexlabs.com>
Reviewed-by: Javier González <javier@javigon.com>
Signed-off-by: Matias Bjørling <mb@lightnvm.io>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
In order to respect mw_cuinits, pblk's write buffer maintains a
backpointer to protect data not yet persisted; when writing to the write
buffer, this backpointer defines a threshold that pblk's rate-limiter
enforces.
On small PU configurations, the following scenarios might take place: (i)
the threshold is larger than the write buffer and (ii) the threshold is
smaller than the write buffer, but larger than the maximun allowed
split bio - 256KB at this moment (Note that writes are not always
split - we only do this when we the size of the buffer is smaller
than the buffer). In both cases, pblk's rate-limiter prevents the I/O to
be written to the buffer, thus stalling.
This patch fixes the original backpointer implementation by considering
the threshold both on buffer creation and on the rate-limiters path,
when bio_split is triggered (case (ii) above).
Fixes: 766c8ceb16fc ("lightnvm: pblk: guarantee that backpointer is respected on writer stall")
Signed-off-by: Javier González <javier@javigon.com>
Reviewed-by: Hans Holmberg <hans.holmberg@cnexlabs.com>
Signed-off-by: Matias Bjørling <mb@lightnvm.io>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
pblk stripes writes of minimal write size across all non-offline chunks
in a line, which means that the maximum write pointer delta should not
exceed the minimal write size.
Extend the line write pointer balance check to cover this case, and
ignore the offline chunk wps.
This will render us a warning during recovery if something unexpected
has happened to the chunk write pointers (i.e. powerloss, a spurious
chunk reset, ..).
Reported-by: Zhoujie Wu <zjwu@marvell.com>
Tested-by: Zhoujie Wu <zjwu@marvell.com>
Reviewed-by: Javier González <javier@javigon.com>
Signed-off-by: Hans Holmberg <hans.holmberg@cnexlabs.com>
Signed-off-by: Matias Bjørling <mb@lightnvm.io>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
As the comment block in include/trace/define_trace.h says,
TRACE_INCLUDE_PATH should be a relative path to the define_trace.h
../../drivers/lightnvm is the correct relative path.
../../../drivers/lightnvm is working by coincidence because the top
Makefile adds -I$(srctree)/arch/$(SRCARCH)/include as a header
search path, but we should not rely on it.
Signed-off-by: Masahiro Yamada <yamada.masahiro@socionext.com>
Reviewed-by: Hans Holmberg <hans.holmberg@cnexlabs.com>
Signed-off-by: Matias Bjørling <mb@lightnvm.io>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
There are new types and helpers that are supposed to be used in new code.
As a preparation to get rid of legacy types and API functions do
the conversion here.
Signed-off-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com>
Reviewed-by: Javier González <javier@javigon.com>
Signed-off-by: Matias Bjørling <mb@lightnvm.io>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
Sparse complains about using strict data types:
drivers/lightnvm/pblk-read.c:254:43: warning: incorrect type in assignment (different base types)
drivers/lightnvm/pblk-read.c:254:43: expected restricted __le64 <noident>
drivers/lightnvm/pblk-read.c:254:43: got unsigned long long [unsigned] [usertype] <noident>
drivers/lightnvm/pblk-read.c:255:29: warning: cast from restricted __le64
drivers/lightnvm/pblk-read.c:268:29: warning: cast from restricted __le64
drivers/lightnvm/pblk-read.c:328:41: warning: incorrect type in assignment (different base types)
drivers/lightnvm/pblk-read.c:328:41: expected restricted __le64 <noident>
drivers/lightnvm/pblk-read.c:328:41: got unsigned long long [unsigned] [usertype] <noident>
In the code it seems explicit that lba_list_mem and lba_list_media members
of struct pblk_pr_ctx are used on CPU side, which means they should not be
of strict types.
Change types of lba_list_mem and lba_list_media members to be u64.
Signed-off-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com>
Reviewed-by: Javier González <javier@javigon.com>
Signed-off-by: Matias Bjørling <mb@lightnvm.io>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
As chunk metadata is allocated using vmalloc, we need to free it
using vfree.
Fixes: 090ee26fd512 ("lightnvm: use internal allocation for chunk log page")
Signed-off-by: Hans Holmberg <hans.holmberg@cnexlabs.com>
Signed-off-by: Matias Bjørling <mb@lightnvm.io>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
pblk_line_meta_free might sleep (it can end up calling vfree, depending
on how we allocate lba lists), and this can lead to a BUG()
if we wake up on a different cpu and release the lock.
As there is no point of grabbing the free lock when pblk has shut down,
remove the lock.
Signed-off-by: Hans Holmberg <hans.holmberg@cnexlabs.com>
Signed-off-by: Matias Bjørling <mb@lightnvm.io>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
We have various helpers for setting/clearing this flag, and also
a helper to check if the queue supports queueable flushes or not.
But nobody uses them anymore, kill it with fire.
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
In 'commit 752f66a75aba ("bcache: use REQ_PRIO to indicate bio for
metadata")' REQ_META is replaced by REQ_PRIO to indicate metadata bio.
This assumption is not always correct, e.g. XFS uses REQ_META to mark
metadata bio other than REQ_PRIO. This is why Nix noticed that bcache
does not cache metadata for XFS after the above commit.
Thanks to Dave Chinner, he explains the difference between REQ_META and
REQ_PRIO from view of file system developer. Here I quote part of his
explanation from mailing list,
REQ_META is used for metadata. REQ_PRIO is used to communicate to
the lower layers that the submitter considers this IO to be more
important that non REQ_PRIO IO and so dispatch should be expedited.
IOWs, if the filesystem considers metadata IO to be more important
that user data IO, then it will use REQ_PRIO | REQ_META rather than
just REQ_META.
Then it seems bios with REQ_META or REQ_PRIO should both be cached for
performance optimation, because they are all probably low I/O latency
demand by upper layer (e.g. file system).
So in this patch, when we want to decide whether to bypass the cache,
REQ_META and REQ_PRIO are both checked. Then both metadata and
high priority I/O requests will be handled properly.
Reported-by: Nix <nix@esperi.org.uk>
Signed-off-by: Coly Li <colyli@suse.de>
Reviewed-by: Andre Noll <maan@tuebingen.mpg.de>
Tested-by: Nix <nix@esperi.org.uk>
Cc: stable@vger.kernel.org
Cc: Dave Chinner <david@fromorbit.com>
Cc: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
Cache set sysfs entry io_error_halflife is used to set c->error_decay.
c->error_decay is in type unsigned int, and it is converted by
strtoul_or_return(), therefore overflow to c->error_decay is possible
for a large input value.
This patch fixes the overflow by using strtoul_safe_clamp() to convert
input string to an unsigned long value in range [0, UINT_MAX], then
divides by 88 and set it to c->error_decay.
Signed-off-by: Coly Li <colyli@suse.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
c->error_limit is in type unsigned int, it is set via cache set sysfs
file io_error_limit. Inside the bcache code, input string is converted
by strtoul_or_return() and set the converted value to c->error_limit.
Because the converted value is unsigned long, and c->error_limit is
unsigned int, if the input is large enought, overflow will happen to
c->error_limit.
This patch uses sysfs_strtoul_clamp() to convert input string, and set
the range in [0, UINT_MAX] to avoid the potential overflow.
Signed-off-by: Coly Li <colyli@suse.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
c->journal_delay_ms is in type unsigned short, it is set via sysfs
interface and converted by sysfs_strtoul() from input string to
unsigned short value. Therefore overflow to unsigned short might be
happen when the converted value exceed USHRT_MAX. e.g. writing
65536 into sysfs file journal_delay_ms, c->journal_delay_ms is set to
0.
This patch uses sysfs_strtoul_clamp() to convert the input string and
limit value range in [0, USHRT_MAX], to avoid the input overflow.
Signed-off-by: Coly Li <colyli@suse.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
dc->writeback_rate_minimum is type unsigned integer variable, it is set
via sysfs interface, and converte from input string to unsigned integer
by d_strtoul_nonzero(). When the converted input value is larger than
UINT_MAX, overflow to unsigned integer happens.
This patch fixes the overflow by using sysfs_strotoul_clamp() to
convert input string and limit the value in range [1, UINT_MAX], then
the overflow can be avoided.
Signed-off-by: Coly Li <colyli@suse.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
Current code already uses d_strtoul_nonzero() to convert input string
to an unsigned integer, to make sure writeback_rate_p_term_inverse
won't be zero value. But overflow may happen when converting input
string to an unsigned integer value by d_strtoul_nonzero(), then
dc->writeback_rate_p_term_inverse can still be set to 0 even if the
sysfs file input value is not zero, e.g. 4294967296 (a.k.a UINT_MAX+1).
If dc->writeback_rate_p_term_inverse is set to 0, it might cause a
dev-zero error in following code from __update_writeback_rate(),
int64_t proportional_scaled =
div_s64(error, dc->writeback_rate_p_term_inverse);
This patch replaces d_strtoul_nonzero() by sysfs_strtoul_clamp() and
limit the value range in [1, UINT_MAX]. Then the unsigned integer
overflow and dev-zero error can be avoided.
Signed-off-by: Coly Li <colyli@suse.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
dc->writeback_rate_i_term_inverse can be set via sysfs interface. It is
in type unsigned int, and convert from input string by d_strtoul(). The
problem is d_strtoul() does not check valid range of the input, if
4294967296 is written into sysfs file writeback_rate_i_term_inverse,
an overflow of unsigned integer will happen and value 0 is set to
dc->writeback_rate_i_term_inverse.
In writeback.c:__update_writeback_rate(), there are following lines of
code,
integral_scaled = div_s64(dc->writeback_rate_integral,
dc->writeback_rate_i_term_inverse);
If dc->writeback_rate_i_term_inverse is set to 0 via sysfs interface,
a div-zero error might be triggered in the above code.
Therefore we need to add a range limitation in the sysfs interface,
this is what this patch does, use sysfs_stroul_clamp() to replace
d_strtoul() and restrict the input range in [1, UINT_MAX].
Signed-off-by: Coly Li <colyli@suse.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
Sysfs file writeback_delay is used to configure dc->writeback_delay
which is type unsigned int. But bcache code uses sysfs_strtoul() to
convert the input string, therefore it might be overflowed if the input
value is too large. E.g. input value is 4294967296 but indeed 0 is
set to dc->writeback_delay.
This patch uses sysfs_strtoul_clamp() to convert the input string and
set the result value range in [0, UINT_MAX] to avoid such unsigned
integer overflow.
Signed-off-by: Coly Li <colyli@suse.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
When setting bcache parameters via sysfs, there are some variables are
defined as bit-field value. Current bcache code in sysfs.c uses either
d_strtoul() or sysfs_strtoul() to convert the input string to unsigned
integer value and set it to the corresponded bit-field value.
The problem is, the bit-field value only takes the lowest bit of the
converted value. If input is 2, the expected value (like bool value)
of the bit-field value should be 1, but indeed it is 0.
The following sysfs files for bit-field variables have such problem,
bypass_torture_test, for dc->bypass_torture_test
writeback_metadata, for dc->writeback_metadata
writeback_running, for dc->writeback_running
verify, for c->verify
key_merging_disabled, for c->key_merging_disabled
gc_always_rewrite, for c->gc_always_rewrite
btree_shrinker_disabled,for c->shrinker_disabled
copy_gc_enabled, for c->copy_gc_enabled
This patch uses sysfs_strtoul_bool() to set such bit-field variables,
then if the converted value is non-zero, the bit-field variables will
be set to 1, like setting a bool value like expensive_debug_checks.
Signed-off-by: Coly Li <colyli@suse.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
When setting bool values via sysfs interface, e.g. writeback_metadata,
if writing 1 into writeback_metadata file, dc->writeback_metadata is
set to 1, but if writing 2 into the file, dc->writeback_metadata is
0. This is misleading, a better result should be 1 for all non-zero
input value.
It is because dc->writeback_metadata is a bit-field variable, and
current code simply use d_strtoul() to convert a string into integer
and takes the lowest bit value. To fix such error, we need a routine
to convert the input string into unsigned integer, and set target
variable to 1 if the converted integer is non-zero.
This patch introduces a new macro called sysfs_strtoul_bool(), it can
be used to convert input string into bool value, we can use it to set
bool value for bit-field vairables.
Signed-off-by: Coly Li <colyli@suse.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
People may set sequential_cutoff of a cached device via sysfs file,
but current code does not check input value overflow. E.g. if value
4294967295 (UINT_MAX) is written to file sequential_cutoff, its value
is 4GB, but if 4294967296 (UINT_MAX + 1) is written into, its value
will be 0. This is an unexpected behavior.
This patch replaces d_strtoi_h() by sysfs_strtoul_clamp() to convert
input string to unsigned integer value, and limit its range in
[0, UINT_MAX]. Then the input overflow can be fixed.
Signed-off-by: Coly Li <colyli@suse.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
Cache set congested threshold values congested_read_threshold_us and
congested_write_threshold_us can be set via sysfs interface. These
two values are 'unsigned int' type, but sysfs interface uses strtoul
to convert input string. So if people input a large number like
9999999999, the value indeed set is 1410065407, which is not expected
behavior.
This patch replaces sysfs_strtoul() by sysfs_strtoul_clamp() when
convert input string to unsigned int value, and set value range in
[0, UINT_MAX], to avoid the above integer overflow errors.
Signed-off-by: Coly Li <colyli@suse.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
Currently sysfs_strtoul_clamp() is defined as,
82 #define sysfs_strtoul_clamp(file, var, min, max) \
83 do { \
84 if (attr == &sysfs_ ## file) \
85 return strtoul_safe_clamp(buf, var, min, max) \
86 ?: (ssize_t) size; \
87 } while (0)
The problem is, if bit width of var is less then unsigned long, min and
max may not protect var from integer overflow, because overflow happens
in strtoul_safe_clamp() before checking min and max.
To fix such overflow in sysfs_strtoul_clamp(), to make min and max take
effect, this patch adds an unsigned long variable, and uses it to macro
strtoul_safe_clamp() to convert an unsigned long value in range defined
by [min, max]. Then assign this value to var. By this method, if bit
width of var is less than unsigned long, integer overflow won't happen
before min and max are checking.
Now sysfs_strtoul_clamp() can properly handle smaller data type like
unsigned int, of cause min and max should be defined in range of
unsigned int too.
Signed-off-by: Coly Li <colyli@suse.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
Stale && dirty keys can be produced in the follow way:
After writeback in write_dirty_finish(), dirty keys k1 will
replace by clean keys k2
==>ret = bch_btree_insert(dc->disk.c, &keys, NULL, &w->key);
==>btree_insert_fn(struct btree_op *b_op, struct btree *b)
==>static int bch_btree_insert_node(struct btree *b,
struct btree_op *op,
struct keylist *insert_keys,
atomic_t *journal_ref,
Then two steps:
A) update k1 to k2 in btree node memory;
bch_btree_insert_keys(b, op, insert_keys, replace_key)
B) Write the bset(contains k2) to cache disk by a 30s delay work
bch_btree_leaf_dirty(b, journal_ref).
But before the 30s delay work write the bset to cache device,
these things happened:
A) GC works, and reclaim the bucket k2 point to;
B) Allocator works, and invalidate the bucket k2 point to,
and increase the gen of the bucket, and place it into free_inc
fifo;
C) Until now, the 30s delay work still does not finish work,
so in the disk, the key still is k1, it is dirty and stale
(its gen is smaller than the gen of the bucket). and then the
machine power off suddenly happens;
D) When the machine power on again, after the btree reconstruction,
the stale dirty key appear.
In bch_extent_bad(), when expensive_debug_checks is off, it would
treat the dirty key as good even it is stale keys, and it would
cause bellow probelms:
A) In read_dirty() it would cause machine crash:
BUG_ON(ptr_stale(dc->disk.c, &w->key, 0));
B) It could be worse when reads hits stale dirty keys, it would
read old incorrect data.
This patch tolerate the existence of these stale && dirty keys,
and treat them as bad key in bch_extent_bad().
(Coly Li: fix indent which was modified by sender's email client)
Signed-off-by: Tang Junhui <tang.junhui.linux@gmail.com>
Cc: stable@vger.kernel.org
Signed-off-by: Coly Li <colyli@suse.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
There is a hunk of code that is indented one level too deep, fix this
by removing the extra tabs.
Signed-off-by: Colin Ian King <colin.king@canonical.com>
Signed-off-by: Coly Li <colyli@suse.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
When there are multiple bcache devices, after a reboot the name of
bcache devices may change (e.g. current /dev/bcache1 was /dev/bcache0
before reboot). Therefore we need the backing device UUID (sb.uuid) to
identify each bcache device.
Backing device uuid can be found by program bcache-super-show, but
directly exporting backing_dev_uuid by sysfs file
/sys/block/bcache<?>/bcache/backing_dev_uuid is a much simpler method.
With backing_dev_uuid, and partition uuids from /dev/disk/by-partuuid/,
now we can identify each bcache device and its partitions conveniently.
Signed-off-by: Coly Li <colyli@suse.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
This patch export dc->backing_dev_name to sysfs file
/sys/block/bcache<?>/bcache/backing_dev_name, then people or user space
tools may know the backing device name of this bcache device.
Of cause it can be done by parsing sysfs links, but this method can be
much simpler to find the link between bcache device and backing device.
Signed-off-by: Coly Li <colyli@suse.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
In stats.c:bch_cache_accounting_clear(), a hard coded number '7' is
used in memset(). It is because in struct cache_stats, there are 7
atomic_t type members. This is not good when new members added into
struct stats, the hard coded number will only clear part of memory.
This patch replaces 'sizeof(unsigned long) * 7' by more generic
'sizeof(struct cache_stats))', to avoid potential error if new
member added into struct cache_stats.
Signed-off-by: Coly Li <colyli@suse.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
Some users see panics like the following when performing fstrim on a
bcached volume:
[ 529.803060] BUG: unable to handle kernel NULL pointer dereference at 0000000000000008
[ 530.183928] #PF error: [normal kernel read fault]
[ 530.412392] PGD 8000001f42163067 P4D 8000001f42163067 PUD 1f42168067 PMD 0
[ 530.750887] Oops: 0000 [#1] SMP PTI
[ 530.920869] CPU: 10 PID: 4167 Comm: fstrim Kdump: loaded Not tainted 5.0.0-rc1+ #3
[ 531.290204] Hardware name: HP ProLiant DL360 Gen9/ProLiant DL360 Gen9, BIOS P89 12/27/2015
[ 531.693137] RIP: 0010:blk_queue_split+0x148/0x620
[ 531.922205] Code: 60 38 89 55 a0 45 31 db 45 31 f6 45 31 c9 31 ff 89 4d 98 85 db 0f 84 7f 04 00 00 44 8b 6d 98 4c 89 ee 48 c1 e6 04 49 03 70 78 <8b> 46 08 44 8b 56 0c 48
8b 16 44 29 e0 39 d8 48 89 55 a8 0f 47 c3
[ 532.838634] RSP: 0018:ffffb9b708df39b0 EFLAGS: 00010246
[ 533.093571] RAX: 00000000ffffffff RBX: 0000000000046000 RCX: 0000000000000000
[ 533.441865] RDX: 0000000000000200 RSI: 0000000000000000 RDI: 0000000000000000
[ 533.789922] RBP: ffffb9b708df3a48 R08: ffff940d3b3fdd20 R09: 0000000000000000
[ 534.137512] R10: ffffb9b708df3958 R11: 0000000000000000 R12: 0000000000000000
[ 534.485329] R13: 0000000000000000 R14: 0000000000000000 R15: ffff940d39212020
[ 534.833319] FS: 00007efec26e3840(0000) GS:ffff940d1f480000(0000) knlGS:0000000000000000
[ 535.224098] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 535.504318] CR2: 0000000000000008 CR3: 0000001f4e256004 CR4: 00000000001606e0
[ 535.851759] Call Trace:
[ 535.970308] ? mempool_alloc_slab+0x15/0x20
[ 536.174152] ? bch_data_insert+0x42/0xd0 [bcache]
[ 536.403399] blk_mq_make_request+0x97/0x4f0
[ 536.607036] generic_make_request+0x1e2/0x410
[ 536.819164] submit_bio+0x73/0x150
[ 536.980168] ? submit_bio+0x73/0x150
[ 537.149731] ? bio_associate_blkg_from_css+0x3b/0x60
[ 537.391595] ? _cond_resched+0x1a/0x50
[ 537.573774] submit_bio_wait+0x59/0x90
[ 537.756105] blkdev_issue_discard+0x80/0xd0
[ 537.959590] ext4_trim_fs+0x4a9/0x9e0
[ 538.137636] ? ext4_trim_fs+0x4a9/0x9e0
[ 538.324087] ext4_ioctl+0xea4/0x1530
[ 538.497712] ? _copy_to_user+0x2a/0x40
[ 538.679632] do_vfs_ioctl+0xa6/0x600
[ 538.853127] ? __do_sys_newfstat+0x44/0x70
[ 539.051951] ksys_ioctl+0x6d/0x80
[ 539.212785] __x64_sys_ioctl+0x1a/0x20
[ 539.394918] do_syscall_64+0x5a/0x110
[ 539.568674] entry_SYSCALL_64_after_hwframe+0x44/0xa9
We have observed it where both:
1) LVM/devmapper is involved (bcache backing device is LVM volume) and
2) writeback cache is involved (bcache cache_mode is writeback)
On one machine, we can reliably reproduce it with:
# echo writeback > /sys/block/bcache0/bcache/cache_mode
(not sure whether above line is required)
# mount /dev/bcache0 /test
# for i in {0..10}; do
file="$(mktemp /test/zero.XXX)"
dd if=/dev/zero of="$file" bs=1M count=256
sync
rm $file
done
# fstrim -v /test
Observing this with tracepoints on, we see the following writes:
fstrim-18019 [022] .... 91107.302026: bcache_write: 73f95583-561c-408f-a93a-4cbd2498f5c8 inode 0 DS 4260112 + 196352 hit 0 bypass 1
fstrim-18019 [022] .... 91107.302050: bcache_write: 73f95583-561c-408f-a93a-4cbd2498f5c8 inode 0 DS 4456464 + 262144 hit 0 bypass 1
fstrim-18019 [022] .... 91107.302075: bcache_write: 73f95583-561c-408f-a93a-4cbd2498f5c8 inode 0 DS 4718608 + 81920 hit 0 bypass 1
fstrim-18019 [022] .... 91107.302094: bcache_write: 73f95583-561c-408f-a93a-4cbd2498f5c8 inode 0 DS 5324816 + 180224 hit 0 bypass 1
fstrim-18019 [022] .... 91107.302121: bcache_write: 73f95583-561c-408f-a93a-4cbd2498f5c8 inode 0 DS 5505040 + 262144 hit 0 bypass 1
fstrim-18019 [022] .... 91107.302145: bcache_write: 73f95583-561c-408f-a93a-4cbd2498f5c8 inode 0 DS 5767184 + 81920 hit 0 bypass 1
fstrim-18019 [022] .... 91107.308777: bcache_write: 73f95583-561c-408f-a93a-4cbd2498f5c8 inode 0 DS 6373392 + 180224 hit 1 bypass 0
<crash>
Note the final one has different hit/bypass flags.
This is because in should_writeback(), we were hitting a case where
the partial stripe condition was returning true and so
should_writeback() was returning true early.
If that hadn't been the case, it would have hit the would_skip test, and
as would_skip == s->iop.bypass == true, should_writeback() would have
returned false.
Looking at the git history from 'commit 72c270612bd3 ("bcache: Write out
full stripes")', it looks like the idea was to optimise for raid5/6:
* If a stripe is already dirty, force writes to that stripe to
writeback mode - to help build up full stripes of dirty data
To fix this issue, make sure that should_writeback() on a discard op
never returns true.
More details of debugging:
https://www.spinics.net/lists/linux-bcache/msg06996.html
Previous reports:
- https://bugzilla.kernel.org/show_bug.cgi?id=201051
- https://bugzilla.kernel.org/show_bug.cgi?id=196103
- https://www.spinics.net/lists/linux-bcache/msg06885.html
(Coly Li: minor modification to follow maximum 75 chars per line rule)
Cc: Kent Overstreet <koverstreet@google.com>
Cc: stable@vger.kernel.org
Fixes: 72c270612bd3 ("bcache: Write out full stripes")
Signed-off-by: Daniel Axtens <dja@axtens.net>
Signed-off-by: Coly Li <colyli@suse.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
The following traceback is sometimes seen when booting an image in qemu:
[ 54.608293] cdrom: Uniform CD-ROM driver Revision: 3.20
[ 54.611085] Fusion MPT base driver 3.04.20
[ 54.611877] Copyright (c) 1999-2008 LSI Corporation
[ 54.616234] Fusion MPT SAS Host driver 3.04.20
[ 54.635139] sysctl duplicate entry: /dev/cdrom//info
[ 54.639578] CPU: 0 PID: 266 Comm: kworker/u4:5 Not tainted 5.0.0-rc5 #1
[ 54.639578] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 0.0.0 02/06/2015
[ 54.641273] Workqueue: events_unbound async_run_entry_fn
[ 54.641273] Call Trace:
[ 54.641273] dump_stack+0x67/0x90
[ 54.641273] __register_sysctl_table+0x50b/0x570
[ 54.641273] ? rcu_read_lock_sched_held+0x6f/0x80
[ 54.641273] ? kmem_cache_alloc_trace+0x1c7/0x1f0
[ 54.646814] __register_sysctl_paths+0x1c8/0x1f0
[ 54.646814] cdrom_sysctl_register.part.7+0xc/0x5f
[ 54.646814] register_cdrom.cold.24+0x2a/0x33
[ 54.646814] sr_probe+0x4bd/0x580
[ 54.646814] ? __driver_attach+0xd0/0xd0
[ 54.646814] really_probe+0xd6/0x260
[ 54.646814] ? __driver_attach+0xd0/0xd0
[ 54.646814] driver_probe_device+0x4a/0xb0
[ 54.646814] ? __driver_attach+0xd0/0xd0
[ 54.646814] bus_for_each_drv+0x73/0xc0
[ 54.646814] __device_attach+0xd6/0x130
[ 54.646814] bus_probe_device+0x9a/0xb0
[ 54.646814] device_add+0x40c/0x670
[ 54.646814] ? __pm_runtime_resume+0x4f/0x80
[ 54.646814] scsi_sysfs_add_sdev+0x81/0x290
[ 54.646814] scsi_probe_and_add_lun+0x888/0xc00
[ 54.646814] ? scsi_autopm_get_host+0x21/0x40
[ 54.646814] __scsi_add_device+0x116/0x130
[ 54.646814] ata_scsi_scan_host+0x93/0x1c0
[ 54.646814] async_run_entry_fn+0x34/0x100
[ 54.646814] process_one_work+0x237/0x5e0
[ 54.646814] worker_thread+0x37/0x380
[ 54.646814] ? rescuer_thread+0x360/0x360
[ 54.646814] kthread+0x118/0x130
[ 54.646814] ? kthread_create_on_node+0x60/0x60
[ 54.646814] ret_from_fork+0x3a/0x50
The only sensible explanation is that cdrom_sysctl_register() is called
twice, once from the module init function and once from register_cdrom().
cdrom_sysctl_register() is not mutex protected and may happily execute
twice if the second call is made before the first call is complete.
Use a static atomic to ensure that the function is executed exactly once.
Signed-off-by: Guenter Roeck <linux@roeck-us.net>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
Pull MD changes for 5.1 from Song.
* 'md-next' of https://github.com/liu-song-6/linux:
raid1: simplify raid1_error function
md-linear: use struct_size() in kzalloc()
|
|
Remove redundance set_bit and let code simplify.
Signed-off-by: Yufen Yu <yuyufen@huawei.com>
Signed-off-by: Song Liu <songliubraving@fb.com>
|
|
One of the more common cases of allocation size calculations is finding the
size of a structure that has a zero-sized array at the end, along with memory
for some number of elements for that array. For example:
struct foo {
int stuff;
void *entry[];
};
instance = kzalloc(sizeof(struct foo) + sizeof(void *) * count, GFP_KERNEL);
Instead of leaving these open-coded and prone to type mistakes, we can now
use the new struct_size() helper:
instance = kzalloc(struct_size(instance, entry, count), GFP_KERNEL);
This code was detected with the help of Coccinelle.
Signed-off-by: Gustavo A. R. Silva <gustavo@embeddedor.com>
Signed-off-by: Song Liu <songliubraving@fb.com>
|
|
It is used now just to flush error recovery and reconnect work items in
the RDMA and TCP transports, which can simply be moved to the
corresponding teardown routines.
Signed-off-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Christoph Hellwig <hch@lst.de>
|
|
Allow write zeroes operations (REQ_OP_WRITE_ZEROES) on the block
device, if the device supports an optional command bit set for write
zeroes. Add support to setup write zeroes command. Set maximum possible
write zeroes sectors in one write zeroes command according to
nvme write zeroes command definition.
This patch was posted as a part of block-write-zeroes support
implementation (https://patchwork.kernel.org/patch/9454859/),
but did not make into mainline kernel as it got reverted due to
failure on the Linus's machine.
In this patch in order to be more cautious, we use NVMe controller's
maximum hardware sector size which is calculated based on the
controller's MDTS (Maximum Data Transfer Size) field to calculate
the maximum sectors for the write zeroes request.
Signed-off-by: Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com>
Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com>
[folded a fix from Keith Busch to properly respect
NVME_QUIRK_DEALLOCATE_ZEROES]
Signed-off-by: Christoph Hellwig <hch@lst.de>
|
|
The mtip32xx driver uses managed resources for DMA coherent memory
and irqs, but then always pairs them with free calls anyway, making
the resource tracking rather pointless. Given some DMA allocations
are transient anyway, the irq freeing seems to require ordering vs
other hardware access the best solution seems to be to stop using
the managed resource API entirely.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull irq fixes from Thomas Gleixner:
"A small set of fixes for the interrupt subsystem:
- Fix a double increment in the irq descriptor allocator which
resulted in a sanity check only being done for every second
affinity mask
- Add a missing device tree translation in the stm32-exti driver.
Without that the interrupt association is completely wrong.
- Initialize the mutex in the GIC-V3 MBI driver
- Fix the alignment for aliasing devices in the GIC-V3-ITS driver so
multi MSI allocations work correctly
- Ensure that the initial affinity of a interrupt is not empty at
startup time.
- Drop bogus include in the madera irq chip driver
- Fix KernelDoc regression"
* 'irq-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
irqchip/gic-v3-its: Align PCI Multi-MSI allocation on their size
genirq/irqdesc: Fix double increment in alloc_descs()
genirq: Fix the kerneldoc comment for struct irq_affinity_desc
irqchip/madera: Drop GPIO includes
irqchip/gic-v3-mbi: Fix uninitialized mbi_lock
irqchip/stm32-exti: Add domain translate function
genirq: Make sure the initial affinity is not empty
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/bp/bp
Pull EDAC fix from Borislav Petkov:
"Fix persistent register offsets of altera_edac, from Thor Thayer"
* tag 'edac_fix_for_5.0' of git://git.kernel.org/pub/scm/linux/kernel/git/bp/bp:
EDAC, altera: Fix S10 persistent register offset
|
|
Pull dma-mapping fix from Christoph Hellwig:
"Fix a xen-swiotlb regression on arm64"
* tag 'dma-mapping-5.0-2' of git://git.infradead.org/users/hch/dma-mapping:
arm64/xen: fix xen-swiotlb cache flushing
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/nvdimm/nvdimm
Pull libnvdimm fixes from Dan Williams:
"A fix for namespace label support for non-Intel NVDIMMs that implement
the ACPI standard label method.
This has apparently never worked and could wait for v5.1. However it
has enough visibility with hardware vendors [1] and distro bug
trackers [2], and low enough risk that I decided it should go in for
-rc4. The other fixups target the new, for v5.0, nvdimm security
functionality. The larger init path fixup closes a memory leak and a
potential userspace lockup due to missed notifications.
[1] https://github.com/pmem/ndctl/issues/78
[2] https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1811785
These have all soaked in -next for a week with no reported issues.
Summary:
- Fix support for NVDIMMs that implement the ACPI standard label
methods.
- Fix error handling for security overwrite (memory leak / userspace
hang condition), and another one-line security cleanup"
* tag 'libnvdimm-fixes-5.0-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/nvdimm/nvdimm:
acpi/nfit: Fix command-supported detection
acpi/nfit: Block function zero DSMs
libnvdimm/security: Require nvdimm_security_setup_events() to succeed
nfit_test: fix security state pull for nvdimm security nfit_test
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/dtor/input
Pull input fixes from Dmitry Torokhov:
"A fixup for the input_event fix for y2038 Sparc64, and couple other
minor fixes"
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/dtor/input:
Input: input_event - fix the CONFIG_SPARC64 mixup
Input: olpc_apsp - assign priv->dev earlier
Input: uinput - fix undefined behavior in uinput_validate_absinfo()
Input: raspberrypi-ts - fix link error
Input: xpad - add support for SteelSeries Stratus Duo
Input: input_event - provide override for sparc64
|
|
Pull networking fixes from David Miller:
1) Count ttl-dropped frames properly in mac80211, from Bob Copeland.
2) Integer overflow in ktime handling of bcm can code, from Oliver
Hartkopp.
3) Fix RX desc handling wrt. hw checksumming in ravb, from Simon
Horman.
4) Various hash key fixes in hv_netvsc, from Haiyang Zhang.
5) Use after free in ax25, from Eric Dumazet.
6) Several fixes to the SSN support in SCTP, from Xin Long.
7) Do not process frames after a NAPI reschedule in ibmveth, from
Thomas Falcon.
8) Fix NLA_POLICY_NESTED arguments, from Johannes Berg.
* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net: (42 commits)
qed: Revert error handling changes.
cfg80211: extend range deviation for DMG
cfg80211: reg: remove warn_on for a normal case
mac80211: Add attribute aligned(2) to struct 'action'
mac80211: don't initiate TDLS connection if station is not associated to AP
nl80211: fix NLA_POLICY_NESTED() arguments
ibmveth: Do not process frames after calling napi_reschedule
net: dev_is_mac_header_xmit() true for ARPHRD_RAWIP
net: usb: asix: ax88772_bind return error when hw_reset fail
MAINTAINERS: Update cavium networking drivers
net/mlx4_core: Fix error handling when initializing CQ bufs in the driver
net/mlx4_core: Add masking for a few queries on HCA caps
sctp: set flow sport from saddr only when it's 0
sctp: set chunk transport correctly when it's a new asoc
sctp: improve the events for sctp stream adding
sctp: improve the events for sctp stream reset
ip_tunnel: Make none-tunnel-dst tunnel port work with lwtunnel
ax25: fix possible use-after-free
sfc: suppress duplicate nvmem partition types in efx_ef10_mtd_probe
hv_netvsc: fix typos in code comments
...
|
|
Pull VFIO fixes from Alex Williamson:
- cleanup licenses in new files (Thomas Gleixner)
- cleanup new compiler warnings (Alexey Kardashevskiy)
* tag 'vfio-v5.0-rc4' of git://github.com/awilliam/linux-vfio:
vfio-pci/nvlink2: Fix ancient gcc warnings
vfio/pci: Cleanup license mess
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi
Pull SCSI fixes from James Bottomley:
"Six fixes, all of which appear to have user visible consequences.
The DMA one is a regression fix from the merge window and of the
others, four are driver specific and one specific to the target code"
* tag 'scsi-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi:
scsi: ufs: Use explicit access size in ufshcd_dump_regs
scsi: tcmu: fix use after free
scsi: csiostor: fix NULL pointer dereference in csio_vport_set_state()
scsi: lpfc: nvmet: avoid hang / use-after-free when destroying targetport
scsi: lpfc: nvme: avoid hang / use-after-free when destroying localport
scsi: communicate max segment size to the DMA mapping code
|
|
Pull block fixes from Jens Axboe:
"A collection of fixes for this release. This contains:
- Silence sparse rightfully complaining about non-static wbt
functions (Bart)
- Fixes for the zoned comments/ioctl documentation (Damien)
- direct-io fix that's been lingering for a while (Ernesto)
- cgroup writeback fix (Tejun)
- Set of NVMe patches for nvme-rdma/tcp (Sagi, Hannes, Raju)
- Block recursion tracking fix (Ming)
- Fix debugfs command flag naming for a few flags (Jianchao)"
* tag 'for-linus-20190125' of git://git.kernel.dk/linux-block:
block: Fix comment typo
uapi: fix ioctl documentation
blk-wbt: Declare local functions static
blk-mq: fix the cmd_flag_name array
nvme-multipath: drop optimization for static ANA group IDs
nvmet-rdma: fix null dereference under heavy load
nvme-rdma: rework queue maps handling
nvme-tcp: fix timeout handler
nvme-rdma: fix timeout handler
writeback: synchronize sync(2) against cgroup writeback membership switches
block: cover another queue enter recursion via BIO_QUEUE_ENTERED
direct-io: allow direct writes to empty inodes
|
|
This is new code and not bug fixes.
This reverts all changes added by merge commit
8fb18be93efd7292d6ee403b9f61af1008239639
Signed-off-by: David S. Miller <davem@davemloft.net>
|