Age | Commit message (Collapse) | Author |
|
Pull NVMe updates and fixes from Keith:
"nvme updates for Linux 6.10
- Fabrics connection retries (Daniel, Hannes)
- Fabrics logging enhancements (Tokunori)
- RDMA delete optimization (Sagi)"
* tag 'nvme-6.10-2024-05-14' of git://git.infradead.org/nvme:
nvme-rdma, nvme-tcp: include max reconnects for reconnect logging
nvmet-rdma: Avoid o(n^2) loop in delete_ctrl
nvme: do not retry authentication failures
nvme-fabrics: short-circuit reconnect retries
nvme: return kernel error codes for admin queue connect
nvmet: return DHCHAP status codes from nvmet_setup_auth()
nvmet: lock config semaphore when accessing DH-HMAC-CHAP key
|
|
Pull block updates from Jens Axboe:
- Add a partscan attribute in sysfs, fixing an issue with systemd
relying on an internal interface that went away.
- Attempt #2 at making long running discards interruptible. The
previous attempt went into 6.9, but we ended up mostly reverting it
as it had issues.
- Remove old ida_simple API in bcache
- Support for zoned write plugging, greatly improving the performance
on zoned devices.
- Remove the old throttle low interface, which has been experimental
since 2017 and never made it beyond that and isn't being used.
- Remove page->index debugging checks in brd, as it hasn't caught
anything and prepares us for removing in struct page.
- MD pull request from Song
- Don't schedule block workers on isolated CPUs
* tag 'for-6.10/block-20240511' of git://git.kernel.dk/linux: (84 commits)
blk-throttle: delay initialization until configuration
blk-throttle: remove CONFIG_BLK_DEV_THROTTLING_LOW
block: fix that util can be greater than 100%
block: support to account io_ticks precisely
block: add plug while submitting IO
bcache: fix variable length array abuse in btree_iter
bcache: Remove usage of the deprecated ida_simple_xx() API
md: Revert "md: Fix overflow in is_mddev_idle"
blk-lib: check for kill signal in ioctl BLKDISCARD
block: add a bio_await_chain helper
block: add a blk_alloc_discard_bio helper
block: add a bio_chain_and_submit helper
block: move discard checks into the ioctl handler
block: remove the discard_granularity check in __blkdev_issue_discard
block/ioctl: prefer different overflow check
null_blk: Fix the WARNING: modpost: missing MODULE_DESCRIPTION()
block: fix and simplify blkdevparts= cmdline parsing
block: refine the EOF check in blkdev_iomap_begin
block: add a partscan sysfs attribute for disks
block: add a disk_has_partscan helper
...
|
|
Pull io_uring updates from Jens Axboe:
- Greatly improve send zerocopy performance, by enabling coalescing of
sent buffers.
MSG_ZEROCOPY already does this with send(2) and sendmsg(2), but the
io_uring side did not. In local testing, the crossover point for send
zerocopy being faster is now around 3000 byte packets, and it
performs better than the sync syscall variants as well.
This feature relies on a shared branch with net-next, which was
pulled into both branches.
- Unification of how async preparation is done across opcodes.
Previously, opcodes that required extra memory for async retry would
allocate that as needed, using on-stack state until that was the
case. If async retry was needed, the on-stack state was adjusted
appropriately for a retry and then copied to the allocated memory.
This led to some fragile and ugly code, particularly for read/write
handling, and made storage retries more difficult than they needed to
be. Allocate the memory upfront, as it's cheap from our pools, and
use that state consistently both initially and also from the retry
side.
- Move away from using remap_pfn_range() for mapping the rings.
This is really not the right interface to use and can cause lifetime
issues or leaks. Additionally, it means the ring sq/cq arrays need to
be physically contigious, which can cause problems in production with
larger rings when services are restarted, as memory can be very
fragmented at that point.
Move to using vm_insert_page(s) for the ring sq/cq arrays, and apply
the same treatment to mapped ring provided buffers. This also helps
unify the code we have dealing with allocating and mapping memory.
Hard to see in the diffstat as we're adding a few features as well,
but this kills about ~400 lines of code from the codebase as well.
- Add support for bundles for send/recv.
When used with provided buffers, bundles support sending or receiving
more than one buffer at the time, improving the efficiency by only
needing to call into the networking stack once for multiple sends or
receives.
- Tweaks for our accept operations, supporting both a DONTWAIT flag for
skipping poll arm and retry if we can, and a POLLFIRST flag that the
application can use to skip the initial accept attempt and rely
purely on poll for triggering the operation. Both of these have
identical flags on the receive side already.
- Make the task_work ctx locking unconditional.
We had various code paths here that would do a mix of lock/trylock
and set the task_work state to whether or not it was locked. All of
that goes away, we lock it unconditionally and get rid of the state
flag indicating whether it's locked or not.
The state struct still exists as an empty type, can go away in the
future.
- Add support for specifying NOP completion values, allowing it to be
used for error handling testing.
- Use set/test bit for io-wq worker flags. Not strictly needed, but
also doesn't hurt and helps silence a KCSAN warning.
- Cleanups for io-wq locking and work assignments, closing a tiny race
where cancelations would not be able to find the work item reliably.
- Misc fixes, cleanups, and improvements
* tag 'for-6.10/io_uring-20240511' of git://git.kernel.dk/linux: (97 commits)
io_uring: support to inject result for NOP
io_uring: fail NOP if non-zero op flags is passed in
io_uring/net: add IORING_ACCEPT_POLL_FIRST flag
io_uring/net: add IORING_ACCEPT_DONTWAIT flag
io_uring/filetable: don't unnecessarily clear/reset bitmap
io_uring/io-wq: Use set_bit() and test_bit() at worker->flags
io_uring/msg_ring: cleanup posting to IOPOLL vs !IOPOLL ring
io_uring: Require zeroed sqe->len on provided-buffers send
io_uring/notif: disable LAZY_WAKE for linked notifs
io_uring/net: fix sendzc lazy wake polling
io_uring/msg_ring: reuse ctx->submitter_task read using READ_ONCE instead of re-reading it
io_uring/rw: reinstate thread check for retries
io_uring/notif: implement notification stacking
io_uring/notif: simplify io_notif_flush()
net: add callback for setting a ubuf_info to skb
net: extend ubuf_info callback to ops structure
io_uring/net: support bundles for recv
io_uring/net: support bundles for send
io_uring/kbuf: add helpers for getting/peeking multiple buffers
io_uring/net: add provided buffer support for IORING_OP_SEND
...
|
|
Makes clear max reconnects translated by ctrl loss tmo and reconnect delay.
Signed-off-by: Tokunori Ikegami <ikegami.t@gmail.com>
Signed-off-by: Keith Busch <kbusch@kernel.org>
|
|
Sandisk SN530 NVMe drives have broken MSIs. On systems without MSI-X
support, all commands time out resulting in the following message:
nvme nvme0: I/O tag 12 (100c) QID 0 timeout, completion polled
These timeouts cause the boot to take an excessively-long time (over 20
minutes) while the initial command queue is flushed.
Address this by adding a quirk for drives with buggy MSIs. The lspci
output for this device (recorded on a system with MSI-X support) is:
02:00.0 Non-Volatile memory controller: Sandisk Corp Device 5008 (rev 01) (prog-if 02 [NVM Express])
Subsystem: Sandisk Corp Device 5008
Flags: bus master, fast devsel, latency 0, IRQ 16, NUMA node 0
Memory at f7e00000 (64-bit, non-prefetchable) [size=16K]
Memory at f7e04000 (64-bit, non-prefetchable) [size=256]
Capabilities: [80] Power Management version 3
Capabilities: [90] MSI: Enable- Count=1/32 Maskable- 64bit+
Capabilities: [b0] MSI-X: Enable+ Count=17 Masked-
Capabilities: [c0] Express Endpoint, MSI 00
Capabilities: [100] Advanced Error Reporting
Capabilities: [150] Device Serial Number 00-00-00-00-00-00-00-00
Capabilities: [1b8] Latency Tolerance Reporting
Capabilities: [300] Secondary PCI Express
Capabilities: [900] L1 PM Substates
Kernel driver in use: nvme
Kernel modules: nvme
Cc: <stable@vger.kernel.org>
Signed-off-by: Sean Anderson <sean.anderson@linux.dev>
Reviewed-by: Christoph Hellwig <hch@lst.de>
|
|
When the key is invalid there is no point in retrying. Because the auth
code returns kernel error codes only, we can't test on the DNR bit.
Signed-off-by: Daniel Wagner <dwagner@suse.de>
Signed-off-by: Keith Busch <kbusch@kernel.org>
|
|
Returning a nvme status from nvme_tcp_setup_ctrl() indicates that the
association was established and we have received a status from the
controller; consequently we should honour the DNR bit. If not any future
reconnect attempts will just return the same error, so we can
short-circuit the reconnect attempts and fail the connection directly.
Signed-off-by: Hannes Reinecke <hare@suse.de>
[dwagner: - extended nvme_should_reconnect]
Signed-off-by: Daniel Wagner <dwagner@suse.de>
Signed-off-by: Keith Busch <kbusch@kernel.org>
|
|
nvmf_connect_admin_queue returns NVMe error status codes and kernel
error codes. This mixes the different domains which makes maintainability
difficult.
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Hannes Reinecke <hare@suse.de>
Signed-off-by: Daniel Wagner <dwagner@suse.de>
Signed-off-by: Keith Busch <kbusch@kernel.org>
|
|
TLS requires a strict pdu pacing via MSG_EOR to signal the end
of a record and subsequent encryption. If we do not set MSG_EOR
at the end of a sequence the record won't be closed, encryption
doesn't start, and we end up with a send stall as the message
will never be passed on to the TCP layer.
So do not check for the queue status when TLS is enabled but
rather make the MSG_MORE setting dependent on the current
request only.
Signed-off-by: Hannes Reinecke <hare@kernel.org>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Keith Busch <kbusch@kernel.org>
|
|
While I/O is running, if the pci bus error occurs then
in-flight I/O can not complete. Worst, if at this time,
user (logically) hot-unplug the nvme disk then the
nvme_remove() code path can't forward progress until
in-flight I/O is cancelled. So these sequence of events
may potentially hang hot-unplug code path indefinitely.
This patch helps cancel the pending/in-flight I/O from the
nvme request timeout handler in case the nvme controller
is in the terminal (DEAD/DELETING/DELETING_NOIO) state and
that helps nvme_remove() code path forward progress and
finish successfully.
Link: https://lore.kernel.org/all/199be893-5dfa-41e5-b6f2-40ac90ebccc4@linux.ibm.com/
Signed-off-by: Nilay Shroff <nilay@linux.ibm.com>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Keith Busch <kbusch@kernel.org>
|
|
On system where native nvme multipath is configured and iopolicy
is set to numa but the nvme controller numa node id is undefined
or -1 (NUMA_NO_NODE) then avoid calculating node distance for
finding optimal io path. In such case we may access numa distance
table with invalid index and that may potentially refer to incorrect
memory. So this patch ensures that if the nvme controller numa node
id is -1 then instead of calculating node distance for finding optimal
io path, we set the numa node distance of such controller to default 10
(LOCAL_DISTANCE).
Link: https://lore.kernel.org/all/20240413090614.678353-1-nilay@linux.ibm.com/
Signed-off-by: Nilay Shroff <nilay@linux.ibm.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com>
Signed-off-by: Keith Busch <kbusch@kernel.org>
|
|
The only user of blk_revalidate_disk_zones() second argument was the
SCSI disk driver (sd). Now that this driver does not require this
update_driver_data argument, remove it to simplify the interface of
blk_revalidate_disk_zones(). Also update the function kdoc comment to
be more accurate (i.e. there is no gendisk ->revalidate method).
Signed-off-by: Damien Le Moal <dlemoal@kernel.org>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Bart Van Assche <bvanassche@acm.org>
Tested-by: Hans Holmberg <hans.holmberg@wdc.com>
Tested-by: Dennis Maisenbacher <dennis.maisenbacher@wdc.com>
Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com>
Link: https://lore.kernel.org/r/20240408014128.205141-21-dlemoal@kernel.org
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
NVMe is making up issue_flags, which is a no-no in general, and to make
matters worse, they are completely the wrong ones. For a pure polled
request, which it does check for, we're already inside the
ctx->uring_lock when the completions are run off io_do_iopoll(). Hence
the correct flag would be '0' rather than IO_URING_F_UNLOCKED.
Reviewed-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
Move the stray '.' that is currently at the end of the line after
newline '\n' to before newline character which is the right position.
Fixes: ce8d78616a6b ("nvme: warn about shared namespaces without CONFIG_NVME_MULTIPATH")
Signed-off-by: Yi Zhang <yi.zhang@redhat.com>
Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com>
Signed-off-by: Keith Busch <kbusch@kernel.org>
|
|
Rename nvme_fc_nvme_ctrl_freed to nvme_fc_free_ctrl to match the name
pattern for the callback.
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Signed-off-by: Daniel Wagner <dwagner@suse.de>
Signed-off-by: Keith Busch <kbusch@kernel.org>
|
|
Apparently there are nvme controllers around that report namespaces
in the namespace list which have zero capacity. Return -ENXIO instead
of -ENODEV from nvme_update_ns_info_block so we don't create a hidden
multipath node for these namespaces but entirely ignore them.
Fixes: 46e7422cda84 ("nvme: move common logic into nvme_update_ns_info")
Reported-by: Nilay Shroff <nilay@linux.ibm.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Tested-by: Nilay Shroff <nilay@linux.ibm.com>
Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com>
Signed-off-by: Keith Busch <kbusch@kernel.org>
|
|
nvme_update_zone_info does (admin queue) I/O to the device and can fail.
We fail to abort the queue limits update if that happen, but really
should avoid with the frozen I/O queue as much as possible anyway.
Split the logic into a helper to query the information that can be
called on an unfrozen queue and one to apply it to the queue limits.
Fixes: 9b130d681443 ("nvme: use the atomic queue limits update API")
Reported-by: Kanchan Joshi <joshi.k@samsung.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Kanchan Joshi <joshi.k@samsung.com>
Signed-off-by: Keith Busch <kbusch@kernel.org>
|
|
Linux 6.9 made the nvme multipath nodes not properly pick up changes when
the LBA size goes smaller after an nvme format. This is because we now
try to inherit the queue settings for the multipath node entirely from
the individual paths. That is the right thing to do for I/O size
limitations, which make up most of the queue limits, but it is wrong for
changes to the namespace configuration, where we do want to pick up the
new format, which will eventually show up on all paths once they are
re-queried.
Fix this by not inheriting the block size and related fields and always
for updating them.
Fixes: 8f03cfa117e0 ("nvme: don't use nvme_update_disk_info for the multipath disk")
Reported-by: Nilay Shroff <nilay@linux.ibm.com>
Tested-by: Nilay Shroff <nilay@linux.ibm.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Keith Busch <kbusch@kernel.org>
|
|
Pull NVMe fixes from Keith:
"nvme updates for Linux 6.9
- Make an informative message less ominous (Keith)
- Enhanced trace decoding (Guixin)
- TCP updates (Hannes, Li)
- Fabrics connect deadlock fix (Chunguang)
- Platform API migration update (Uwe)
- A new device quirk (Jiawei)"
* tag 'nvme-6.9-2024-03-21' of git://git.infradead.org/nvme:
nvmet-rdma: remove NVMET_RDMA_REQ_INVALIDATE_RKEY flag
nvme: remove redundant BUILD_BUG_ON check
nvme/tcp: Add wq_unbound modparam for nvme_tcp_wq
nvme-tcp: Export the nvme_tcp_wq to sysfs
drivers/nvme: Add quirks for device 126f:2262
nvme: parse format command's lbafu when tracing
nvme: add tracing of reservation commands
nvme: parse zns command's zsa and zrasf to string
nvme: use nvme_disk_is_ns_head helper
nvme: fix reconnection fail due to reserved tag allocation
nvmet: add tracing of zns commands
nvmet: add tracing of authentication commands
nvme-apple: Convert to platform remove callback returning void
nvmet-tcp: do not continue for invalid icreq
nvme: change shutdown timeout setting message
|
|
Remove redundant BUILD_BUG_ON check of struct nvme_dsm_range, it's
already checked in nvme_init_ctrl().
Signed-off-by: Guixin Liu <kanie@linux.alibaba.com>
Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Keith Busch <kbusch@kernel.org>
|
|
The default nvme_tcp_wq will use all CPUs to process tasks. Sometimes it is
necessary to set CPU affinity to improve performance.
A new module parameter wq_unbound is added here. If set to true, users can
configure cpu affinity through
/sys/devices/virtual/workqueue/nvme_tcp_wq/cpumask.
Signed-off-by: Li Feng <fengli@smartx.com>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Keith Busch <kbusch@kernel.org>
|
|
Make the workqueue userspace visible for easy viewing and configuration.
Signed-off-by: Li Feng <fengli@smartx.com>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Keith Busch <kbusch@kernel.org>
|
|
This commit adds NVME_QUIRK_NO_DEEPEST_PS and NVME_QUIRK_BOGUS_NID for
device [126f:2262], which appears to be a generic VID:PID pair used for
many SSDs based on the Silicon Motion SM2262/SM2262EN controller.
Two of my SSDs with this VID:PID pair exhibit the same behavior:
* They frequently have trouble exiting the deepest power state (5),
resulting in the entire disk unresponsive.
Verified by setting nvme_core.default_ps_max_latency_us=10000 and
observing them behaving normally.
* They produce all-zero nguid and eui64 with `nvme id-ns` command.
The offending products are:
* HP SSD EX950 1TB
* HIKVISION C2000Pro 2TB
Signed-off-by: Jiawei Fu <i@ibugone.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Keith Busch <kbusch@kernel.org>
|
|
Add the parse of format command's lbafu to calculate lbaf.
Signed-off-by: Guixin Liu <kanie@linux.alibaba.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Keith Busch <kbusch@kernel.org>
|
|
Add detailed parsing of reservation commands to make the trace log
more consistent and human-readable.
Signed-off-by: Guixin Liu <kanie@linux.alibaba.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Keith Busch <kbusch@kernel.org>
|
|
Parse zone mgmt send commands's zsa and receive command's
zrasf to string to make the trace log more human-readable.
Signed-off-by: Guixin Liu <kanie@linux.alibaba.com>
Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Keith Busch <kbusch@kernel.org>
|
|
Use nvme_disk_is_ns_head helper instead of check fops directly,
and also drop CONFIG_NVME_MULTIPATH check.
Signed-off-by: Guixin Liu <kanie@linux.alibaba.com>
Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Keith Busch <kbusch@kernel.org>
|
|
We found a issue on production environment while using NVMe over RDMA,
admin_q reconnect failed forever while remote target and network is ok.
After dig into it, we found it may caused by a ABBA deadlock due to tag
allocation. In my case, the tag was hold by a keep alive request
waiting inside admin_q, as we quiesced admin_q while reset ctrl, so the
request maked as idle and will not process before reset success. As
fabric_q shares tagset with admin_q, while reconnect remote target, we
need a tag for connect command, but the only one reserved tag was held
by keep alive command which waiting inside admin_q. As a result, we
failed to reconnect admin_q forever. In order to fix this issue, I
think we should keep two reserved tags for admin queue.
Fixes: ed01fee283a0 ("nvme-fabrics: only reserve a single tag")
Signed-off-by: Chunguang Xu <chunguang.xu@shopee.com>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Keith Busch <kbusch@kernel.org>
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-next
Pull networking updates from Jakub Kicinski:
"Core & protocols:
- Large effort by Eric to lower rtnl_lock pressure and remove locks:
- Make commonly used parts of rtnetlink (address, route dumps
etc) lockless, protected by RCU instead of rtnl_lock.
- Add a netns exit callback which already holds rtnl_lock,
allowing netns exit to take rtnl_lock once in the core instead
of once for each driver / callback.
- Remove locks / serialization in the socket diag interface.
- Remove 6 calls to synchronize_rcu() while holding rtnl_lock.
- Remove the dev_base_lock, depend on RCU where necessary.
- Support busy polling on a per-epoll context basis. Poll length and
budget parameters can be set independently of system defaults.
- Introduce struct net_hotdata, to make sure read-mostly global
config variables fit in as few cache lines as possible.
- Add optional per-nexthop statistics to ease monitoring / debug of
ECMP imbalance problems.
- Support TCP_NOTSENT_LOWAT in MPTCP.
- Ensure that IPv6 temporary addresses' preferred lifetimes are long
enough, compared to other configured lifetimes, and at least 2 sec.
- Support forwarding of ICMP Error messages in IPSec, per RFC 4301.
- Add support for the independent control state machine for bonding
per IEEE 802.1AX-2008 5.4.15 in addition to the existing coupled
control state machine.
- Add "network ID" to MCTP socket APIs to support hosts with multiple
disjoint MCTP networks.
- Re-use the mono_delivery_time skbuff bit for packets which user
space wants to be sent at a specified time. Maintain the timing
information while traversing veth links, bridge etc.
- Take advantage of MSG_SPLICE_PAGES for RxRPC DATA and ACK packets.
- Simplify many places iterating over netdevs by using an xarray
instead of a hash table walk (hash table remains in place, for use
on fastpaths).
- Speed up scanning for expired routes by keeping a dedicated list.
- Speed up "generic" XDP by trying harder to avoid large allocations.
- Support attaching arbitrary metadata to netconsole messages.
Things we sprinkled into general kernel code:
- Enforce VM_IOREMAP flag and range in ioremap_page_range and
introduce VM_SPARSE kind and vm_area_[un]map_pages (used by
bpf_arena).
- Rework selftest harness to enable the use of the full range of ksft
exit code (pass, fail, skip, xfail, xpass).
Netfilter:
- Allow userspace to define a table that is exclusively owned by a
daemon (via netlink socket aliveness) without auto-removing this
table when the userspace program exits. Such table gets marked as
orphaned and a restarting management daemon can re-attach/regain
ownership.
- Speed up element insertions to nftables' concatenated-ranges set
type. Compact a few related data structures.
BPF:
- Add BPF token support for delegating a subset of BPF subsystem
functionality from privileged system-wide daemons such as systemd
through special mount options for userns-bound BPF fs to a trusted
& unprivileged application.
- Introduce bpf_arena which is sparse shared memory region between
BPF program and user space where structures inside the arena can
have pointers to other areas of the arena, and pointers work
seamlessly for both user-space programs and BPF programs.
- Introduce may_goto instruction that is a contract between the
verifier and the program. The verifier allows the program to loop
assuming it's behaving well, but reserves the right to terminate
it.
- Extend the BPF verifier to enable static subprog calls in spin lock
critical sections.
- Support registration of struct_ops types from modules which helps
projects like fuse-bpf that seeks to implement a new struct_ops
type.
- Add support for retrieval of cookies for perf/kprobe multi links.
- Support arbitrary TCP SYN cookie generation / validation in the TC
layer with BPF to allow creating SYN flood handling in BPF
firewalls.
- Add code generation to inline the bpf_kptr_xchg() helper which
improves performance when stashing/popping the allocated BPF
objects.
Wireless:
- Add SPP (signaling and payload protected) AMSDU support.
- Support wider bandwidth OFDMA, as required for EHT operation.
Driver API:
- Major overhaul of the Energy Efficient Ethernet internals to
support new link modes (2.5GE, 5GE), share more code between
drivers (especially those using phylib), and encourage more
uniform behavior. Convert and clean up drivers.
- Define an API for querying per netdev queue statistics from
drivers.
- IPSec: account in global stats for fully offloaded sessions.
- Create a concept of Ethernet PHY Packages at the Device Tree level,
to allow parameterizing the existing PHY package code.
- Enable Rx hashing (RSS) on GTP protocol fields.
Misc:
- Improvements and refactoring all over networking selftests.
- Create uniform module aliases for TC classifiers, actions, and
packet schedulers to simplify creating modprobe policies.
- Address all missing MODULE_DESCRIPTION() warnings in networking.
- Extend the Netlink descriptions in YAML to cover message
encapsulation or "Netlink polymorphism", where interpretation of
nested attributes depends on link type, classifier type or some
other "class type".
Drivers:
- Ethernet high-speed NICs:
- Add a new driver for Marvell's Octeon PCI Endpoint NIC VF.
- Intel (100G, ice, idpf):
- support E825-C devices
- nVidia/Mellanox:
- support devices with one port and multiple PCIe links
- Broadcom (bnxt):
- support n-tuple filters
- support configuring the RSS key
- Wangxun (ngbe/txgbe):
- implement irq_domain for TXGBE's sub-interrupts
- Pensando/AMD:
- support XDP
- optimize queue submission and wakeup handling (+17% bps)
- optimize struct layout, saving 28% of memory on queues
- Ethernet NICs embedded and virtual:
- Google cloud vNIC:
- refactor driver to perform memory allocations for new queue
config before stopping and freeing the old queue memory
- Synopsys (stmmac):
- obey queueMaxSDU and implement counters required by 802.1Qbv
- Renesas (ravb):
- support packet checksum offload
- suspend to RAM and runtime PM support
- Ethernet switches:
- nVidia/Mellanox:
- support for nexthop group statistics
- Microchip:
- ksz8: implement PHY loopback
- add support for KSZ8567, a 7-port 10/100Mbps switch
- PTP:
- New driver for RENESAS FemtoClock3 Wireless clock generator.
- Support OCP PTP cards designed and built by Adva.
- CAN:
- Support recvmsg() flags for own, local and remote traffic on CAN
BCM sockets.
- Support for esd GmbH PCIe/402 CAN device family.
- m_can:
- Rx/Tx submission coalescing
- wake on frame Rx
- WiFi:
- Intel (iwlwifi):
- enable signaling and payload protected A-MSDUs
- support wider-bandwidth OFDMA
- support for new devices
- bump FW API to 89 for AX devices; 90 for BZ/SC devices
- MediaTek (mt76):
- mt7915: newer ADIE version support
- mt7925: radio temperature sensor support
- Qualcomm (ath11k):
- support 6 GHz station power modes: Low Power Indoor (LPI),
Standard Power) SP and Very Low Power (VLP)
- QCA6390 & WCN6855: support 2 concurrent station interfaces
- QCA2066 support
- Qualcomm (ath12k):
- refactoring in preparation for Multi-Link Operation (MLO)
support
- 1024 Block Ack window size support
- firmware-2.bin support
- support having multiple identical PCI devices (firmware needs
to have ATH12K_FW_FEATURE_MULTI_QRTR_ID)
- QCN9274: support split-PHY devices
- WCN7850: enable Power Save Mode in station mode
- WCN7850: P2P support
- RealTek:
- rtw88: support for more rtw8811cu and rtw8821cu devices
- rtw89: support SCAN_RANDOM_SN and SET_SCAN_DWELL
- rtlwifi: speed up USB firmware initialization
- rtwl8xxxu:
- RTL8188F: concurrent interface support
- Channel Switch Announcement (CSA) support in AP mode
- Broadcom (brcmfmac):
- per-vendor feature support
- per-vendor SAE password setup
- DMI nvram filename quirk for ACEPC W5 Pro"
* tag 'net-next-6.9' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-next: (2255 commits)
nexthop: Fix splat with CONFIG_DEBUG_PREEMPT=y
nexthop: Fix out-of-bounds access during attribute validation
nexthop: Only parse NHA_OP_FLAGS for dump messages that require it
nexthop: Only parse NHA_OP_FLAGS for get messages that require it
bpf: move sleepable flag from bpf_prog_aux to bpf_prog
bpf: hardcode BPF_PROG_PACK_SIZE to 2MB * num_possible_nodes()
selftests/bpf: Add kprobe multi triggering benchmarks
ptp: Move from simple ida to xarray
vxlan: Remove generic .ndo_get_stats64
vxlan: Do not alloc tstats manually
devlink: Add comments to use netlink gen tool
nfp: flower: handle acti_netdevs allocation failure
net/packet: Add getsockopt support for PACKET_COPY_THRESH
net/netlink: Add getsockopt support for NETLINK_LISTEN_ALL_NSID
selftests/bpf: Add bpf_arena_htab test.
selftests/bpf: Add bpf_arena_list test.
selftests/bpf: Add unit tests for bpf_arena_alloc/free_pages
bpf: Add helper macro bpf_addr_space_cast()
libbpf: Recognize __arena global variables.
bpftool: Recognize arena map type
...
|
|
Pull block updates from Jens Axboe:
- MD pull requests via Song:
- Cleanup redundant checks (Yu Kuai)
- Remove deprecated headers (Marc Zyngier, Song Liu)
- Concurrency fixes (Li Lingfeng)
- Memory leak fix (Li Nan)
- Refactor raid1 read_balance (Yu Kuai, Paul Luse)
- Clean up and fix for md_ioctl (Li Nan)
- Other small fixes (Gui-Dong Han, Heming Zhao)
- MD atomic limits (Christoph)
- NVMe pull request via Keith:
- RDMA target enhancements (Max)
- Fabrics fixes (Max, Guixin, Hannes)
- Atomic queue_limits usage (Christoph)
- Const use for class_register (Ricardo)
- Identification error handling fixes (Shin'ichiro, Keith)
- Improvement and cleanup for cached request handling (Christoph)
- Moving towards atomic queue limits. Core changes and driver bits so
far (Christoph)
- Fix UAF issues in aoeblk (Chun-Yi)
- Zoned fix and cleanups (Damien)
- s390 dasd cleanups and fixes (Jan, Miroslav)
- Block issue timestamp caching (me)
- noio scope guarding for zoned IO (Johannes)
- block/nvme PI improvements (Kanchan)
- Ability to terminate long running discard loop (Keith)
- bdev revalidation fix (Li)
- Get rid of old nr_queues hack for kdump kernels (Ming)
- Support for async deletion of ublk (Ming)
- Improve IRQ bio recycling (Pavel)
- Factor in CPU capacity for remote vs local completion (Qais)
- Add shared_tags configfs entry for null_blk (Shin'ichiro
- Fix for a regression in page refcounts introduced by the folio
unification (Tony)
- Misc fixes and cleanups (Arnd, Colin, John, Kunwu, Li, Navid,
Ricardo, Roman, Tang, Uwe)
* tag 'for-6.9/block-20240310' of git://git.kernel.dk/linux: (221 commits)
block: partitions: only define function mac_fix_string for CONFIG_PPC_PMAC
block/swim: Convert to platform remove callback returning void
cdrom: gdrom: Convert to platform remove callback returning void
block: remove disk_stack_limits
md: remove mddev->queue
md: don't initialize queue limits
md/raid10: use the atomic queue limit update APIs
md/raid5: use the atomic queue limit update APIs
md/raid1: use the atomic queue limit update APIs
md/raid0: use the atomic queue limit update APIs
md: add queue limit helpers
md: add a mddev_is_dm helper
md: add a mddev_add_trace_msg helper
md: add a mddev_trace_remap helper
bcache: move calculation of stripe_size and io_opt into bcache_device_init
virtio_blk: Do not use disk_set_max_open/active_zones()
aoe: fix the potential use-after-free problem in aoecmd_cfg_pkts
block: move capacity validation to blkpg_do_ioctl()
block: prevent division by zero in blk_rq_stat_sum()
drbd: atomically update queue limits in drbd_reconsider_queue_parameters
...
|
|
The .remove() callback for a platform driver returns an int which makes
many driver authors wrongly assume it's possible to do error handling by
returning an error code. However the value returned is ignored (apart
from emitting a warning) and this typically results in resource leaks.
To improve here there is a quest to make the remove callback return
void. In the first step of this quest all drivers are converted to
.remove_new(), which already returns void. Eventually after all drivers
are converted, .remove_new() will be renamed to .remove().
Trivially convert this driver from always returning zero in the remove
callback to the void returning variant.
Signed-off-by: Uwe Kleine-König <u.kleine-koenig@pengutronix.de>
Signed-off-by: Keith Busch <kbusch@kernel.org>
|
|
User visible messages containing the word "timeout" can be alarming.
This one from nvme is just reporting a potentially informative device
configuration, and everything is working as designed. Change the text to
report the less concerning "D3 entry latency", which is where this value
comes from anyway.
Reported-by: Len Brown <lenb@kernel.org>
Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Keith Busch <kbusch@kernel.org>
|
|
The memory allocated for the identification is freed on failure. Set
it to NULL so the caller doesn't have a pointer to that freed address.
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Keith Busch <kbusch@kernel.org>
|
|
When nvme_identify_ns() fails, it frees the pointer to the struct
nvme_id_ns before it returns. However, ns_update_nuse() calls kfree()
for the pointer even when nvme_identify_ns() fails. This results in
KASAN double-free, which was observed with blktests nvme/045 with
proposed patches [1] on the kernel v6.8-rc7. Fix the double-free by
skipping kfree() when nvme_identify_ns() fails.
Link: https://lore.kernel.org/linux-block/20240304161303.19681-1-dwagner@suse.de/ [1]
Fixes: a1a825ab6a60 ("nvme: add csi, ms and nuse to sysfs")
Signed-off-by: Shin'ichiro Kawasaki <shinichiro.kawasaki@wdc.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Daniel Wagner <dwagner@suse.de>
Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com>
Signed-off-by: Keith Busch <kbusch@kernel.org>
|
|
Since commit 43a7206b0963 ("driver core: class: make class_register() take
a const *"), the driver core allows for struct class to be in read-only
memory, so move the nvmf_class structure to be declared at build time
placing it into read-only memory, instead of having to be dynamically
allocated at boot time.
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Suggested-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Signed-off-by: Ricardo B. Marliere <ricardo@marliere.net>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Keith Busch <kbusch@kernel.org>
|
|
Since commit 43a7206b0963 ("driver core: class: make class_register() take
a const *"), the driver core allows for struct class to be in read-only
memory, so move the structures nvme_class, nvme_subsys_class and
nvme_ns_chr_class to be declared at build time placing them into read-only
memory, instead of having to be dynamically allocated at boot time.
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Suggested-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Signed-off-by: Ricardo B. Marliere <ricardo@marliere.net>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Keith Busch <kbusch@kernel.org>
|
|
When draining a page_frag_cache, most user are doing
the similar steps, so introduce an API to avoid code
duplication.
Signed-off-by: Yunsheng Lin <linyunsheng@huawei.com>
Acked-by: Jason Wang <jasowang@redhat.com>
Reviewed-by: Alexander Duyck <alexanderduyck@fb.com>
Acked-by: Michael S. Tsirkin <mst@redhat.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
|
|
Of course we should use the key if there is no error ...
Signed-off-by: Hannes Reinecke <hare@suse.de>
Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com>
Signed-off-by: Keith Busch <kbusch@kernel.org>
|
|
Switch to the queue_limits_* helpers to stack the bdev limits, which also
includes updating the readahead settings.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Keith Busch <kbusch@kernel.org>
|
|
The multipath disk starts out with the stacking default limits.
The one interesting part here is that blk_set_stacking_limits
sets the max_zone_append_sectorts to UINT_MAX, which fails the
validation for non-zoned devices. With the old one call per
limit scheme this was fine because no one verified this weird
mismatch and it was fixed by blk_stack_limits a little later
before I/O could be issued.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Keith Busch <kbusch@kernel.org>
|
|
Changes the callchains that update queue_limits to build an on-stack
queue_limits and update it atomically. Note that for now only the
admin queue actually passes it to the queue allocation function.
Doing the same for the gendisks used for the namespaces will require
a little more work.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Keith Busch <kbusch@kernel.org>
|
|
Fold nvme_init_ms into nvme_configure_metadata after splitting up
a little helper to deal with the extended LBA formats.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Keith Busch <kbusch@kernel.org>
|
|
Move reading the Identify Namespace Data Structure, NVM Command Set out
of configure_metadata into the caller. This allows doing the identify
call outside the frozen I/O queues, and prepares for using data from
the Identify data structure for other purposes.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Keith Busch <kbusch@kernel.org>
|
|
Split the logic to query the Identify Namespace Data Structure, NVM
Command Set into a separate helper.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Max Gurtovoy <mgurtovoy@nvidia.com>
Signed-off-by: Keith Busch <kbusch@kernel.org>
|
|
nvme_update_ns_info_generic and nvme_update_ns_info_block share a
fair amount of logic related to not fully supported namespace
formats and updating the multipath information. Move this logic
into the common caller.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Keith Busch <kbusch@kernel.org>
|
|
nvme_set_queue_limits is used on the admin queue and all gendisks
including hidden ones that don't support block I/O. The write cache
setting on the other hand only makes sense for block I/O. Move the
blk_queue_write_cache call to nvme_update_ns_info_block instead.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Max Gurtovoy <mgurtovoy@nvidia.com>
Signed-off-by: Keith Busch <kbusch@kernel.org>
|
|
Move setting up the integrity profile and setting the disk capacity out
of nvme_update_disk_info to get nvme_update_disk_info into a shape where
it just sets queue_limits eventually.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Keith Busch <kbusch@kernel.org>
|
|
Currently nvme_update_ns_info_block calls nvme_update_disk_info both for
the namespace attached disk, and the multipath one (if it exists). This
is very different from how other stacking drivers work, and leads to
a lot of complexity.
Switch to setting the disk capacity and initializing the integrity
profile, and let blk_stack_limits which already is called just below
deal with updating the other limits.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Keith Busch <kbusch@kernel.org>
|
|
Move uneregistering the existing integrity profile into the helper
dealing with all the other integrity / metadata setup.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Max Gurtovoy <mgurtovoy@nvidia.com>
Signed-off-by: Keith Busch <kbusch@kernel.org>
|
|
Handle the no metadata support case in nvme_init_integrity as well to
simplify the calling convention and prepare for future changes in the
area.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Max Gurtovoy <mgurtovoy@nvidia.com>
Signed-off-by: Keith Busch <kbusch@kernel.org>
|