diff options
author | Linus Torvalds <torvalds@linux-foundation.org> | 2024-05-13 12:48:06 -0700 |
---|---|---|
committer | Linus Torvalds <torvalds@linux-foundation.org> | 2024-05-13 12:48:06 -0700 |
commit | 9961a785944601e32f185ea696347b22ffda634c (patch) | |
tree | 24362a6360b2a1fa96c7eb562c6a95f96b37e459 /drivers | |
parent | f4e8d80292859809ea135e9f4c43bae47e4f58bc (diff) | |
parent | deb1e496a83557896fe0cca0b8af01c2a97c0dc6 (diff) |
Merge tag 'for-6.10/io_uring-20240511' of git://git.kernel.dk/linux
Pull io_uring updates from Jens Axboe:
- Greatly improve send zerocopy performance, by enabling coalescing of
sent buffers.
MSG_ZEROCOPY already does this with send(2) and sendmsg(2), but the
io_uring side did not. In local testing, the crossover point for send
zerocopy being faster is now around 3000 byte packets, and it
performs better than the sync syscall variants as well.
This feature relies on a shared branch with net-next, which was
pulled into both branches.
- Unification of how async preparation is done across opcodes.
Previously, opcodes that required extra memory for async retry would
allocate that as needed, using on-stack state until that was the
case. If async retry was needed, the on-stack state was adjusted
appropriately for a retry and then copied to the allocated memory.
This led to some fragile and ugly code, particularly for read/write
handling, and made storage retries more difficult than they needed to
be. Allocate the memory upfront, as it's cheap from our pools, and
use that state consistently both initially and also from the retry
side.
- Move away from using remap_pfn_range() for mapping the rings.
This is really not the right interface to use and can cause lifetime
issues or leaks. Additionally, it means the ring sq/cq arrays need to
be physically contigious, which can cause problems in production with
larger rings when services are restarted, as memory can be very
fragmented at that point.
Move to using vm_insert_page(s) for the ring sq/cq arrays, and apply
the same treatment to mapped ring provided buffers. This also helps
unify the code we have dealing with allocating and mapping memory.
Hard to see in the diffstat as we're adding a few features as well,
but this kills about ~400 lines of code from the codebase as well.
- Add support for bundles for send/recv.
When used with provided buffers, bundles support sending or receiving
more than one buffer at the time, improving the efficiency by only
needing to call into the networking stack once for multiple sends or
receives.
- Tweaks for our accept operations, supporting both a DONTWAIT flag for
skipping poll arm and retry if we can, and a POLLFIRST flag that the
application can use to skip the initial accept attempt and rely
purely on poll for triggering the operation. Both of these have
identical flags on the receive side already.
- Make the task_work ctx locking unconditional.
We had various code paths here that would do a mix of lock/trylock
and set the task_work state to whether or not it was locked. All of
that goes away, we lock it unconditionally and get rid of the state
flag indicating whether it's locked or not.
The state struct still exists as an empty type, can go away in the
future.
- Add support for specifying NOP completion values, allowing it to be
used for error handling testing.
- Use set/test bit for io-wq worker flags. Not strictly needed, but
also doesn't hurt and helps silence a KCSAN warning.
- Cleanups for io-wq locking and work assignments, closing a tiny race
where cancelations would not be able to find the work item reliably.
- Misc fixes, cleanups, and improvements
* tag 'for-6.10/io_uring-20240511' of git://git.kernel.dk/linux: (97 commits)
io_uring: support to inject result for NOP
io_uring: fail NOP if non-zero op flags is passed in
io_uring/net: add IORING_ACCEPT_POLL_FIRST flag
io_uring/net: add IORING_ACCEPT_DONTWAIT flag
io_uring/filetable: don't unnecessarily clear/reset bitmap
io_uring/io-wq: Use set_bit() and test_bit() at worker->flags
io_uring/msg_ring: cleanup posting to IOPOLL vs !IOPOLL ring
io_uring: Require zeroed sqe->len on provided-buffers send
io_uring/notif: disable LAZY_WAKE for linked notifs
io_uring/net: fix sendzc lazy wake polling
io_uring/msg_ring: reuse ctx->submitter_task read using READ_ONCE instead of re-reading it
io_uring/rw: reinstate thread check for retries
io_uring/notif: implement notification stacking
io_uring/notif: simplify io_notif_flush()
net: add callback for setting a ubuf_info to skb
net: extend ubuf_info callback to ops structure
io_uring/net: support bundles for recv
io_uring/net: support bundles for send
io_uring/kbuf: add helpers for getting/peeking multiple buffers
io_uring/net: add provided buffer support for IORING_OP_SEND
...
Diffstat (limited to 'drivers')
-rw-r--r-- | drivers/net/tap.c | 2 | ||||
-rw-r--r-- | drivers/net/tun.c | 2 | ||||
-rw-r--r-- | drivers/net/xen-netback/common.h | 5 | ||||
-rw-r--r-- | drivers/net/xen-netback/interface.c | 2 | ||||
-rw-r--r-- | drivers/net/xen-netback/netback.c | 11 | ||||
-rw-r--r-- | drivers/nvme/host/ioctl.c | 15 | ||||
-rw-r--r-- | drivers/vhost/net.c | 8 |
7 files changed, 30 insertions, 15 deletions
diff --git a/drivers/net/tap.c b/drivers/net/tap.c index 9f0495e8df4d..bfdd3875fe86 100644 --- a/drivers/net/tap.c +++ b/drivers/net/tap.c @@ -754,7 +754,7 @@ static ssize_t tap_get_user(struct tap_queue *q, void *msg_control, skb_zcopy_init(skb, msg_control); } else if (msg_control) { struct ubuf_info *uarg = msg_control; - uarg->callback(NULL, uarg, false); + uarg->ops->complete(NULL, uarg, false); } dev_queue_xmit(skb); diff --git a/drivers/net/tun.c b/drivers/net/tun.c index 92da8c03d960..9254bca2813d 100644 --- a/drivers/net/tun.c +++ b/drivers/net/tun.c @@ -1906,7 +1906,7 @@ static ssize_t tun_get_user(struct tun_struct *tun, struct tun_file *tfile, skb_zcopy_init(skb, msg_control); } else if (msg_control) { struct ubuf_info *uarg = msg_control; - uarg->callback(NULL, uarg, false); + uarg->ops->complete(NULL, uarg, false); } skb_reset_network_header(skb); diff --git a/drivers/net/xen-netback/common.h b/drivers/net/xen-netback/common.h index 1fcbd83f7ff2..17421da139f2 100644 --- a/drivers/net/xen-netback/common.h +++ b/drivers/net/xen-netback/common.h @@ -390,9 +390,8 @@ bool xenvif_rx_queue_tail(struct xenvif_queue *queue, struct sk_buff *skb); void xenvif_carrier_on(struct xenvif *vif); -/* Callback from stack when TX packet can be released */ -void xenvif_zerocopy_callback(struct sk_buff *skb, struct ubuf_info *ubuf, - bool zerocopy_success); +/* Callbacks from stack when TX packet can be released */ +extern const struct ubuf_info_ops xenvif_ubuf_ops; static inline pending_ring_idx_t nr_pending_reqs(struct xenvif_queue *queue) { diff --git a/drivers/net/xen-netback/interface.c b/drivers/net/xen-netback/interface.c index 7cff90aa8d24..65db5f14465f 100644 --- a/drivers/net/xen-netback/interface.c +++ b/drivers/net/xen-netback/interface.c @@ -593,7 +593,7 @@ int xenvif_init_queue(struct xenvif_queue *queue) for (i = 0; i < MAX_PENDING_REQS; i++) { queue->pending_tx_info[i].callback_struct = (struct ubuf_info_msgzc) - { { .callback = xenvif_zerocopy_callback }, + { { .ops = &xenvif_ubuf_ops }, { { .ctx = NULL, .desc = i } } }; queue->grant_tx_handle[i] = NETBACK_INVALID_HANDLE; diff --git a/drivers/net/xen-netback/netback.c b/drivers/net/xen-netback/netback.c index ef76850d9bcd..bab7e43ea05b 100644 --- a/drivers/net/xen-netback/netback.c +++ b/drivers/net/xen-netback/netback.c @@ -1156,7 +1156,7 @@ static int xenvif_handle_frag_list(struct xenvif_queue *queue, struct sk_buff *s uarg = skb_shinfo(skb)->destructor_arg; /* increase inflight counter to offset decrement in callback */ atomic_inc(&queue->inflight_packets); - uarg->callback(NULL, uarg, true); + uarg->ops->complete(NULL, uarg, true); skb_shinfo(skb)->destructor_arg = NULL; /* Fill the skb with the new (local) frags. */ @@ -1278,8 +1278,9 @@ static int xenvif_tx_submit(struct xenvif_queue *queue) return work_done; } -void xenvif_zerocopy_callback(struct sk_buff *skb, struct ubuf_info *ubuf_base, - bool zerocopy_success) +static void xenvif_zerocopy_callback(struct sk_buff *skb, + struct ubuf_info *ubuf_base, + bool zerocopy_success) { unsigned long flags; pending_ring_idx_t index; @@ -1312,6 +1313,10 @@ void xenvif_zerocopy_callback(struct sk_buff *skb, struct ubuf_info *ubuf_base, xenvif_skb_zerocopy_complete(queue); } +const struct ubuf_info_ops xenvif_ubuf_ops = { + .complete = xenvif_zerocopy_callback, +}; + static inline void xenvif_tx_dealloc_action(struct xenvif_queue *queue) { struct gnttab_unmap_grant_ref *gop; diff --git a/drivers/nvme/host/ioctl.c b/drivers/nvme/host/ioctl.c index 3dfd5ae99ae0..499a8bb7cac7 100644 --- a/drivers/nvme/host/ioctl.c +++ b/drivers/nvme/host/ioctl.c @@ -423,13 +423,20 @@ static enum rq_end_io_ret nvme_uring_cmd_end_io(struct request *req, pdu->result = le64_to_cpu(nvme_req(req)->result.u64); /* - * For iopoll, complete it directly. + * For iopoll, complete it directly. Note that using the uring_cmd + * helper for this is safe only because we check blk_rq_is_poll(). + * As that returns false if we're NOT on a polled queue, then it's + * safe to use the polled completion helper. + * * Otherwise, move the completion to task work. */ - if (blk_rq_is_poll(req)) - nvme_uring_task_cb(ioucmd, IO_URING_F_UNLOCKED); - else + if (blk_rq_is_poll(req)) { + if (pdu->bio) + blk_rq_unmap_user(pdu->bio); + io_uring_cmd_iopoll_done(ioucmd, pdu->result, pdu->status); + } else { io_uring_cmd_do_in_task_lazy(ioucmd, nvme_uring_task_cb); + } return RQ_END_IO_FREE; } diff --git a/drivers/vhost/net.c b/drivers/vhost/net.c index c64ded183f8d..f16279351db5 100644 --- a/drivers/vhost/net.c +++ b/drivers/vhost/net.c @@ -380,7 +380,7 @@ static void vhost_zerocopy_signal_used(struct vhost_net *net, } } -static void vhost_zerocopy_callback(struct sk_buff *skb, +static void vhost_zerocopy_complete(struct sk_buff *skb, struct ubuf_info *ubuf_base, bool success) { struct ubuf_info_msgzc *ubuf = uarg_to_msgzc(ubuf_base); @@ -408,6 +408,10 @@ static void vhost_zerocopy_callback(struct sk_buff *skb, rcu_read_unlock_bh(); } +static const struct ubuf_info_ops vhost_ubuf_ops = { + .complete = vhost_zerocopy_complete, +}; + static inline unsigned long busy_clock(void) { return local_clock() >> 10; @@ -879,7 +883,7 @@ static void handle_tx_zerocopy(struct vhost_net *net, struct socket *sock) vq->heads[nvq->upend_idx].len = VHOST_DMA_IN_PROGRESS; ubuf->ctx = nvq->ubufs; ubuf->desc = nvq->upend_idx; - ubuf->ubuf.callback = vhost_zerocopy_callback; + ubuf->ubuf.ops = &vhost_ubuf_ops; ubuf->ubuf.flags = SKBFL_ZEROCOPY_FRAG; refcount_set(&ubuf->ubuf.refcnt, 1); msg.msg_control = &ctl; |