summaryrefslogtreecommitdiff
path: root/io_uring/net.c
AgeCommit message (Collapse)Author
2023-11-03io_uring/net: ensure socket is marked connected on connect retryJens Axboe
io_uring does non-blocking connection attempts, which can yield some unexpected results if a connect request is re-attempted by an an application. This is equivalent to the following sync syscall sequence: sock = socket(AF_INET, SOCK_STREAM | SOCK_NONBLOCK, IPPROTO_TCP); connect(sock, &addr, sizeof(addr); ret == -1 and errno == EINPROGRESS expected here. Now poll for POLLOUT on sock, and when that returns, we expect the socket to be connected. But if we follow that procedure with: connect(sock, &addr, sizeof(addr)); you'd expect ret == -1 and errno == EISCONN here, but you actually get ret == 0. If we attempt the connection one more time, then we get EISCON as expected. io_uring used to do this, but turns out that bluetooth fails with EBADFD if you attempt to re-connect. Also looks like EISCONN _could_ occur with this sequence. Retain the ->in_progress logic, but work-around a potential EISCONN or EBADFD error and only in those cases look at the sock_error(). This should work in general and avoid the odd sequence of a repeated connect request returning success when the socket is already connected. This is all a side effect of the socket state being in a CONNECTING state when we get EINPROGRESS, and only a re-connect or other related operation will turn that into CONNECTED. Cc: stable@vger.kernel.org Fixes: 3fb1bd688172 ("io_uring/net: handle -EINPROGRESS correct for IORING_OP_CONNECT") Link: https://github.com/axboe/liburing/issues/980 Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-09-14io_uring/net: fix iter retargeting for selected bufPavel Begunkov
When using selected buffer feature, io_uring delays data iter setup until later. If io_setup_async_msg() is called before that it might see not correctly setup iterator. Pre-init nr_segs and judge from its state whether we repointing. Cc: stable@vger.kernel.org Reported-by: syzbot+a4c6e5ef999b68b26ed1@syzkaller.appspotmail.com Fixes: 0455d4ccec548 ("io_uring: add POLL_FIRST support for send/sendmsg and recv/recvmsg") Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/0000000000002770be06053c7757@google.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-08-11io_uring: never overflow io_aux_cqePavel Begunkov
Now all callers of io_aux_cqe() set allow_overflow to false, remove the parameter and not allow overflowing auxilary multishot cqes. When CQ is full the function callers and all multishot requests in general are expected to complete the request. That prevents indefinite in-background grows of the overflow list and let's the userspace to handle the backlog at its own pace. Resubmitting a request should also be faster than accounting a bunch of overflows, so it should be better for perf when it happens, but a well behaving userspace should be trying to avoid overflows in any case. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/bb20d14d708ea174721e58bb53786b0521e4dd6d.1691757663.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-08-11io_uring/net: don't overflow multishot recvPavel Begunkov
Don't allow overflowing multishot recv CQEs, it might get out of hand, hurt performance, and in the worst case scenario OOM the task. Cc: stable@vger.kernel.org Fixes: b3fdea6ecb55c ("io_uring: multishot recv") Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/0b295634e8f1b71aa764c984608c22d85f88f75c.1691757663.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-08-11io_uring/net: don't overflow multishot acceptPavel Begunkov
Don't allow overflowing multishot accept CQEs, we want to limit the grows of the overflow list. Cc: stable@vger.kernel.org Fixes: 4e86a2c980137 ("io_uring: implement multishot mode for accept") Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/7d0d749649244873772623dd7747966f516fe6e2.1691757663.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-07-03Merge tag 'io_uring-6.5-2023-07-03' of git://git.kernel.dk/linuxLinus Torvalds
Pull io_uring fixes from Jens Axboe: "The fix for the msghdr->msg_inq assigned value being wrong, using -1 instead of -1U for the signed type. Also a fix for ensuring when we're trying to run task_work on an exiting task, that we wait for it. This is not really a correctness thing as the work is being canceled, but it does help with ensuring file descriptors are closed when the task has exited." * tag 'io_uring-6.5-2023-07-03' of git://git.kernel.dk/linux: io_uring: flush offloaded and delayed task_work on exit io_uring: remove io_fallback_tw() forward declaration io_uring/net: use proper value for msg_inq
2023-06-28Merge tag 'net-next-6.5' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-next Pull networking changes from Jakub Kicinski: "WiFi 7 and sendpage changes are the biggest pieces of work for this release. The latter will definitely require fixes but I think that we got it to a reasonable point. Core: - Rework the sendpage & splice implementations Instead of feeding data into sockets page by page extend sendmsg handlers to support taking a reference on the data, controlled by a new flag called MSG_SPLICE_PAGES Rework the handling of unexpected-end-of-file to invoke an additional callback instead of trying to predict what the right combination of MORE/NOTLAST flags is Remove the MSG_SENDPAGE_NOTLAST flag completely - Implement SCM_PIDFD, a new type of CMSG type analogous to SCM_CREDENTIALS, but it contains pidfd instead of plain pid - Enable socket busy polling with CONFIG_RT - Improve reliability and efficiency of reporting for ref_tracker - Auto-generate a user space C library for various Netlink families Protocols: - Allow TCP to shrink the advertised window when necessary, prevent sk_rcvbuf auto-tuning from growing the window all the way up to tcp_rmem[2] - Use per-VMA locking for "page-flipping" TCP receive zerocopy - Prepare TCP for device-to-device data transfers, by making sure that payloads are always attached to skbs as page frags - Make the backoff time for the first N TCP SYN retransmissions linear. Exponential backoff is unnecessarily conservative - Create a new MPTCP getsockopt to retrieve all info (MPTCP_FULL_INFO) - Avoid waking up applications using TLS sockets until we have a full record - Allow using kernel memory for protocol ioctl callbacks, paving the way to issuing ioctls over io_uring - Add nolocalbypass option to VxLAN, forcing packets to be fully encapsulated even if they are destined for a local IP address - Make TCPv4 use consistent hash in TIME_WAIT and SYN_RECV. Ensure in-kernel ECMP implementation (e.g. Open vSwitch) select the same link for all packets. Support L4 symmetric hashing in Open vSwitch - PPPoE: make number of hash bits configurable - Allow DNS to be overwritten by DHCPACK in the in-kernel DHCP client (ipconfig) - Add layer 2 miss indication and filtering, allowing higher layers (e.g. ACL filters) to make forwarding decisions based on whether packet matched forwarding state in lower devices (bridge) - Support matching on Connectivity Fault Management (CFM) packets - Hide the "link becomes ready" IPv6 messages by demoting their printk level to debug - HSR: don't enable promiscuous mode if device offloads the proto - Support active scanning in IEEE 802.15.4 - Continue work on Multi-Link Operation for WiFi 7 BPF: - Add precision propagation for subprogs and callbacks. This allows maintaining verification efficiency when subprograms are used, or in fact passing the verifier at all for complex programs, especially those using open-coded iterators - Improve BPF's {g,s}setsockopt() length handling. Previously BPF assumed the length is always equal to the amount of written data. But some protos allow passing a NULL buffer to discover what the output buffer *should* be, without writing anything - Accept dynptr memory as memory arguments passed to helpers - Add routing table ID to bpf_fib_lookup BPF helper - Support O_PATH FDs in BPF_OBJ_PIN and BPF_OBJ_GET commands - Drop bpf_capable() check in BPF_MAP_FREEZE command (used to mark maps as read-only) - Show target_{obj,btf}_id in tracing link fdinfo - Addition of several new kfuncs (most of the names are self-explanatory): - Add a set of new dynptr kfuncs: bpf_dynptr_adjust(), bpf_dynptr_is_null(), bpf_dynptr_is_rdonly(), bpf_dynptr_size() and bpf_dynptr_clone(). - bpf_task_under_cgroup() - bpf_sock_destroy() - force closing sockets - bpf_cpumask_first_and(), rework bpf_cpumask_any*() kfuncs Netfilter: - Relax set/map validation checks in nf_tables. Allow checking presence of an entry in a map without using the value - Increase ip_vs_conn_tab_bits range for 64BIT builds - Allow updating size of a set - Improve NAT tuple selection when connection is closing Driver API: - Integrate netdev with LED subsystem, to allow configuring HW "offloaded" blinking of LEDs based on link state and activity (i.e. packets coming in and out) - Support configuring rate selection pins of SFP modules - Factor Clause 73 auto-negotiation code out of the drivers, provide common helper routines - Add more fool-proof helpers for managing lifetime of MDIO devices associated with the PCS layer - Allow drivers to report advanced statistics related to Time Aware scheduler offload (taprio) - Allow opting out of VF statistics in link dump, to allow more VFs to fit into the message - Split devlink instance and devlink port operations New hardware / drivers: - Ethernet: - Synopsys EMAC4 IP support (stmmac) - Marvell 88E6361 8 port (5x1GE + 3x2.5GE) switches - Marvell 88E6250 7 port switches - Microchip LAN8650/1 Rev.B0 PHYs - MediaTek MT7981/MT7988 built-in 1GE PHY driver - WiFi: - Realtek RTL8192FU, 2.4 GHz, b/g/n mode, 2T2R, 300 Mbps - Realtek RTL8723DS (SDIO variant) - Realtek RTL8851BE - CAN: - Fintek F81604 Drivers: - Ethernet NICs: - Intel (100G, ice): - support dynamic interrupt allocation - use meta data match instead of VF MAC addr on slow-path - nVidia/Mellanox: - extend link aggregation to handle 4, rather than just 2 ports - spawn sub-functions without any features by default - OcteonTX2: - support HTB (Tx scheduling/QoS) offload - make RSS hash generation configurable - support selecting Rx queue using TC filters - Wangxun (ngbe/txgbe): - add basic Tx/Rx packet offloads - add phylink support (SFP/PCS control) - Freescale/NXP (enetc): - report TAPRIO packet statistics - Solarflare/AMD: - support matching on IP ToS and UDP source port of outer header - VxLAN and GENEVE tunnel encapsulation over IPv4 or IPv6 - add devlink dev info support for EF10 - Virtual NICs: - Microsoft vNIC: - size the Rx indirection table based on requested configuration - support VLAN tagging - Amazon vNIC: - try to reuse Rx buffers if not fully consumed, useful for ARM servers running with 16kB pages - Google vNIC: - support TCP segmentation of >64kB frames - Ethernet embedded switches: - Marvell (mv88e6xxx): - enable USXGMII (88E6191X) - Microchip: - lan966x: add support for Egress Stage 0 ACL engine - lan966x: support mapping packet priority to internal switch priority (based on PCP or DSCP) - Ethernet PHYs: - Broadcom PHYs: - support for Wake-on-LAN for BCM54210E/B50212E - report LPI counter - Microsemi PHYs: support RGMII delay configuration (VSC85xx) - Micrel PHYs: receive timestamp in the frame (LAN8841) - Realtek PHYs: support optional external PHY clock - Altera TSE PCS: merge the driver into Lynx PCS which it is a variant of - CAN: Kvaser PCIEcan: - support packet timestamping - WiFi: - Intel (iwlwifi): - major update for new firmware and Multi-Link Operation (MLO) - configuration rework to drop test devices and split the different families - support for segmented PNVM images and power tables - new vendor entries for PPAG (platform antenna gain) feature - Qualcomm 802.11ax (ath11k): - Multiple Basic Service Set Identifier (MBSSID) and Enhanced MBSSID Advertisement (EMA) support in AP mode - support factory test mode - RealTek (rtw89): - add RSSI based antenna diversity - support U-NII-4 channels on 5 GHz band - RealTek (rtl8xxxu): - AP mode support for 8188f - support USB RX aggregation for the newer chips" * tag 'net-next-6.5' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-next: (1602 commits) net: scm: introduce and use scm_recv_unix helper af_unix: Skip SCM_PIDFD if scm->pid is NULL. net: lan743x: Simplify comparison netlink: Add __sock_i_ino() for __netlink_diag_dump(). net: dsa: avoid suspicious RCU usage for synced VLAN-aware MAC addresses Revert "af_unix: Call scm_recv() only after scm_set_cred()." phylink: ReST-ify the phylink_pcs_neg_mode() kdoc libceph: Partially revert changes to support MSG_SPLICE_PAGES net: phy: mscc: fix packet loss due to RGMII delays net: mana: use vmalloc_array and vcalloc net: enetc: use vmalloc_array and vcalloc ionic: use vmalloc_array and vcalloc pds_core: use vmalloc_array and vcalloc gve: use vmalloc_array and vcalloc octeon_ep: use vmalloc_array and vcalloc net: usb: qmi_wwan: add u-blox 0x1312 composition perf trace: fix MSG_SPLICE_PAGES build error ipvlan: Fix return value of ipvlan_queue_xmit() netfilter: nf_tables: fix underflow in chain reference counter netfilter: nf_tables: unbind non-anonymous set if rule construction fails ...
2023-06-27io_uring/net: use proper value for msg_inqJens Axboe
struct msghdr->msg_inq is a signed type, yet we attempt to store what is essentially an unsigned bitmask in there. We only really need to know if the field was stored or not, but let's use the proper type to avoid any misunderstandings on what is being attempted here. Link: https://lore.kernel.org/io-uring/CAHk-=wjKb24aSe6fE4zDH-eh8hr-FB9BbukObUVSMGOrsBHCRQ@mail.gmail.com/ Reported-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-06-26Merge tag 'for-6.5/io_uring-2023-06-23' of git://git.kernel.dk/linuxLinus Torvalds
Pull io_uring updates from Jens Axboe: "Nothing major in this release, just a bunch of cleanups and some optimizations around networking mostly. - clean up file request flags handling (Christoph) - clean up request freeing and CQ locking (Pavel) - support for using pre-registering the io_uring fd at setup time (Josh) - Add support for user allocated ring memory, rather than having the kernel allocate it. Mostly for packing rings into a huge page (me) - avoid an unnecessary double retry on receive (me) - maintain ordering for task_work, which also improves performance (me) - misc cleanups/fixes (Pavel, me)" * tag 'for-6.5/io_uring-2023-06-23' of git://git.kernel.dk/linux: (39 commits) io_uring: merge conditional unlock flush helpers io_uring: make io_cq_unlock_post static io_uring: inline __io_cq_unlock io_uring: fix acquire/release annotations io_uring: kill io_cq_unlock() io_uring: remove IOU_F_TWQ_FORCE_NORMAL io_uring: don't batch task put on reqs free io_uring: move io_clean_op() io_uring: inline io_dismantle_req() io_uring: remove io_free_req_tw io_uring: open code io_put_req_find_next io_uring: add helpers to decode the fixed file file_ptr io_uring: use io_file_from_index in io_msg_grab_file io_uring: use io_file_from_index in __io_sync_cancel io_uring: return REQ_F_ flags from io_file_get_flags io_uring: remove io_req_ffs_set io_uring: remove a confusing comment above io_file_get_flags io_uring: remove the mode variable in io_file_get_flags io_uring: remove __io_file_supports_nowait io_uring: wait interruptibly for request completions on exit ...
2023-06-22Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/netJakub Kicinski
Cross-merge networking fixes after downstream PR. Conflicts: tools/testing/selftests/net/fcnal-test.sh d7a2fc1437f7 ("selftests: net: fcnal-test: check if FIPS mode is enabled") dd017c72dde6 ("selftests: fcnal: Test SO_DONTROUTE on TCP sockets.") https://lore.kernel.org/all/5007b52c-dd16-dbf6-8d64-b9701bfa498b@tessares.net/ https://lore.kernel.org/all/20230619105427.4a0df9b3@canb.auug.org.au/ No adjacent changes. Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-06-21io_uring/net: use the correct msghdr union member in io_sendmsg_copy_hdrJens Axboe
Rather than assign the user pointer to msghdr->msg_control, assign it to msghdr->msg_control_user to make sparse happy. They are in a union so the end result is the same, but let's avoid new sparse warnings and squash this one. Reported-by: kernel test robot <lkp@intel.com> Closes: https://lore.kernel.org/oe-kbuild-all/202306210654.mDMcyMuB-lkp@intel.com/ Fixes: cac9e4418f4c ("io_uring/net: save msghdr->msg_control for retries") Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-06-21io_uring/net: disable partial retries for recvmsg with cmsgJens Axboe
We cannot sanely handle partial retries for recvmsg if we have cmsg attached. If we don't, then we'd just be overwriting the initial cmsg header on retries. Alternatively we could increment and handle this appropriately, but it doesn't seem worth the complication. Move the MSG_WAITALL check into the non-multishot case while at it, since MSG_WAITALL is explicitly disabled for multishot anyway. Link: https://lore.kernel.org/io-uring/0b0d4411-c8fd-4272-770b-e030af6919a0@kernel.dk/ Cc: stable@vger.kernel.org # 5.10+ Reported-by: Stefan Metzmacher <metze@samba.org> Reviewed-by: Stefan Metzmacher <metze@samba.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-06-21io_uring/net: clear msg_controllen on partial sendmsg retryJens Axboe
If we have cmsg attached AND we transferred partial data at least, clear msg_controllen on retry so we don't attempt to send that again. Cc: stable@vger.kernel.org # 5.10+ Fixes: cac9e4418f4c ("io_uring/net: save msghdr->msg_control for retries") Reported-by: Stefan Metzmacher <metze@samba.org> Reviewed-by: Stefan Metzmacher <metze@samba.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-06-13io_uring/net: save msghdr->msg_control for retriesJens Axboe
If the application sets ->msg_control and we have to later retry this command, or if it got queued with IOSQE_ASYNC to begin with, then we need to retain the original msg_control value. This is due to the net stack overwriting this field with an in-kernel pointer, to copy it in. Hitting that path for the second time will now fail the copy from user, as it's attempting to copy from a non-user address. Cc: stable@vger.kernel.org # 5.10+ Link: https://github.com/axboe/liburing/issues/880 Reported-and-tested-by: Marek Majkowski <marek@cloudflare.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-06-07io_uring: cleanup io_aux_cqe() APIJens Axboe
Everybody is passing in the request, so get rid of the io_ring_ctx and explicit user_data pass-in. Both the ctx and user_data can be deduced from the request at hand. Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-05-23net: Declare MSG_SPLICE_PAGES internal sendmsg() flagDavid Howells
Declare MSG_SPLICE_PAGES, an internal sendmsg() flag, that hints to a network protocol that it should splice pages from the source iterator rather than copying the data if it can. This flag is added to a list that is cleared by sendmsg syscalls on entry. This is intended as a replacement for the ->sendpage() op, allowing a way to splice in several multipage folios in one go. Signed-off-by: David Howells <dhowells@redhat.com> Reviewed-by: Willem de Bruijn <willemb@google.com> cc: Jens Axboe <axboe@kernel.dk> cc: Matthew Wilcox <willy@infradead.org> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-05-17io_uring/net: don't retry recvmsg() unnecessarilyJens Axboe
If we're doing multishot receives, then we always end up doing two trips through sock_recvmsg(). For protocols that sanely set msghdr->msg_inq, then we don't need to waste time picking a new buffer and attempting a new receive if there's nothing there. Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-05-17io_uring/net: push IORING_CQE_F_SOCK_NONEMPTY into io_recv_finish()Jens Axboe
Rather than have this logic in both io_recv() and io_recvmsg_multishot(), push it into the handler they both call when finishing a receive operation. Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-05-17io_uring/net: initalize msghdr->msg_inq to known valueJens Axboe
We can't currently tell if ->msg_inq was set when we ask for msg_get_inq, initialize it to -1U so we can tell apart if it was set and there's no data left, or if it just wasn't set at all by the protocol. Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-05-17io_uring/net: initialize struct msghdr more sanely for io_recv()Jens Axboe
We only need to clear the input fields on the first invocation, not when potentially doing a retry. Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-03-30iov_iter: add iter_iovec() helperJens Axboe
This returns a pointer to the current iovec entry in the iterator. Only useful with ITER_IOVEC right now, but it prepares us to treat ITER_UBUF and ITER_IOVEC identically for the first segment. Rename struct iov_iter->iov to iov_iter->__iov to find any potentially troublesome spots, and also to prevent anyone from adding new code that accesses iter->iov directly. Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-03-20io_uring/net: avoid sending -ECONNABORTED on repeated connection requestsJens Axboe
Since io_uring does nonblocking connect requests, if we do two repeated ones without having a listener, the second will get -ECONNABORTED rather than the expected -ECONNREFUSED. Treat -ECONNABORTED like a normal retry condition if we're nonblocking, if we haven't already seen it. Cc: stable@vger.kernel.org Fixes: 3fb1bd688172 ("io_uring/net: handle -EINPROGRESS correct for IORING_OP_CONNECT") Link: https://github.com/axboe/liburing/issues/828 Reported-by: Hui, Chunyang <sanqian.hcy@antgroup.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-02-24io_uring: remove MSG_NOSIGNAL from recvmsgDavid Lamparter
MSG_NOSIGNAL is not applicable for the receiving side, SIGPIPE is generated when trying to write to a "broken pipe". AF_PACKET's packet_recvmsg() does enforce this, giving back EINVAL when MSG_NOSIGNAL is set - making it unuseable in io_uring's recvmsg. Remove MSG_NOSIGNAL from io_recvmsg_prep(). Cc: stable@vger.kernel.org # v5.10+ Signed-off-by: David Lamparter <equinox@diac24.net> Cc: Eric Dumazet <edumazet@google.com> Cc: Jens Axboe <axboe@kernel.dk> Reviewed-by: Eric Dumazet <edumazet@google.com> Link: https://lore.kernel.org/r/20230224150123.128346-1-equinox@diac24.net Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-02-20Merge tag 'for-6.3/iter-ubuf-2023-02-16' of git://git.kernel.dk/linuxLinus Torvalds
Pull io_uring ITER_UBUF conversion from Jens Axboe: "Since we now have ITER_UBUF available, switch to using it for single ranges as it's more efficient than ITER_IOVEC for that" * tag 'for-6.3/iter-ubuf-2023-02-16' of git://git.kernel.dk/linux: block: use iter_ubuf for single range iov_iter: move iter_ubuf check inside restore WARN io_uring: use iter_ubuf for single range imports io_uring: switch network send/recv to ITER_UBUF iov: add import_ubuf()
2023-01-29io_uring: for requests that require async, force itDylan Yudaken
Some requests require being run async as they do not support non-blocking. Instead of trying to issue these requests, getting -EAGAIN and then queueing them for async issue, rather just force async upfront. Add WARN_ON_ONCE to make sure surprising code paths do not come up, however in those cases the bug would end up being a blocking io_uring_enter(2) which should not be critical. Signed-off-by: Dylan Yudaken <dylany@meta.com> Link: https://lore.kernel.org/r/20230127135227.3646353-3-dylany@meta.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-01-23io_uring/net: cache provided buffer group value for multishot receivesJens Axboe
If we're using ring provided buffers with multishot receive, and we end up doing an io-wq based issue at some points that also needs to select a buffer, we'll lose the initially assigned buffer group as io_ring_buffer_select() correctly clears the buffer group list as the issue isn't serialized by the ctx uring_lock. This is fine for normal receives as the request puts the buffer and finishes, but for multishot, we will re-arm and do further receives. On the next trigger for this multishot receive, the receive will try and pick from a buffer group whose value is the same as the buffer ID of the las receive. That is obviously incorrect, and will result in a premature -ENOUFS error for the receive even if we had available buffers in the correct group. Cache the buffer group value at prep time, so we can restore it for future receives. This only needs doing for the above mentioned case, but just do it by default to keep it easier to read. Cc: stable@vger.kernel.org Fixes: b3fdea6ecb55 ("io_uring: multishot recv") Fixes: 9bb66906f23e ("io_uring: support multishot in recvmsg") Cc: Dylan Yudaken <dylany@meta.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-01-08io_uring: switch network send/recv to ITER_UBUFJens Axboe
This is more efficient than iter_iov. Signed-off-by: Jens Axboe <axboe@kernel.dk> [merged to 6.2] Signed-off-by: Keith Busch <kbusch@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de>
2022-12-19io_uring/net: fix cleanup after recyclePavel Begunkov
Don't access io_async_msghdr io_netmsg_recycle(), it may be reallocated. Cc: stable@vger.kernel.org Fixes: 9bb66906f23e5 ("io_uring: support multishot in recvmsg") Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/9e326f4ad4046ddadf15bf34bf3fa58c6372f6b5.1671461985.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-12-19io_uring/net: ensure compat import handlers clear free_iovJens Axboe
If we're not allocating the vectors because the count is below UIO_FASTIOV, we still do need to properly clear ->free_iov to prevent an erronous free of on-stack data. Reported-by: Jiri Slaby <jirislaby@gmail.com> Fixes: 4c17a496a7a0 ("io_uring/net: fix cleanup double free free_iov init") Cc: stable@vger.kernel.org Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-12-13Merge tag 'for-6.2/io_uring-next-2022-12-08' of git://git.kernel.dk/linuxLinus Torvalds
Pull io_uring updates part two from Jens Axboe: - Misc fixes (me, Lin) - Series from Pavel extending the single task exclusive ring mode, yielding nice improvements for the common case of having a single ring per thread (Pavel) - Cleanup for MSG_RING, removing our IOPOLL hack (Pavel) - Further poll cleanups and fixes (Pavel) - Misc cleanups and fixes (Pavel) * tag 'for-6.2/io_uring-next-2022-12-08' of git://git.kernel.dk/linux: (22 commits) io_uring/msg_ring: flag target ring as having task_work, if needed io_uring: skip spinlocking for ->task_complete io_uring: do msg_ring in target task via tw io_uring: extract a io_msg_install_complete helper io_uring: get rid of double locking io_uring: never run tw and fallback in parallel io_uring: use tw for putting rsrc io_uring: force multishot CQEs into task context io_uring: complete all requests in task context io_uring: don't check overflow flush failures io_uring: skip overflow CQE posting for dying ring io_uring: improve io_double_lock_ctx fail handling io_uring: dont remove file from msg_ring reqs io_uring: reshuffle issue_flags io_uring: don't reinstall quiesce node for each tw io_uring: improve rsrc quiesce refs checks io_uring: don't raw spin unlock to match cq_lock io_uring: combine poll tw handlers io_uring: improve poll warning handling io_uring: remove ctx variable in io_poll_check_events ...
2022-12-13Merge tag 'for-6.2/io_uring-2022-12-08' of git://git.kernel.dk/linuxLinus Torvalds
Pull io_uring updates from Jens Axboe: - Always ensure proper ordering in case of CQ ring overflow, which then means we can remove some work-arounds for that (Dylan) - Support completion batching for multishot, greatly increasing the efficiency for those (Dylan) - Flag epoll/eventfd wakeups done from io_uring, so that we can easily tell if we're recursing into io_uring again. Previously, this would have resulted in repeated multishot notifications if we had a dependency there. That could happen if an eventfd was registered as the ring eventfd, and we multishot polled for events on it. Or if an io_uring fd was added to epoll, and io_uring had a multishot request for the epoll fd. Test cases here: https://git.kernel.dk/cgit/liburing/commit/?id=919755a7d0096fda08fb6d65ac54ad8d0fe027cd Previously these got terminated when the CQ ring eventually overflowed, now it's handled gracefully (me). - Tightening of the IOPOLL based completions (Pavel) - Optimizations of the networking zero-copy paths (Pavel) - Various tweaks and fixes (Dylan, Pavel) * tag 'for-6.2/io_uring-2022-12-08' of git://git.kernel.dk/linux: (41 commits) io_uring: keep unlock_post inlined in hot path io_uring: don't use complete_post in kbuf io_uring: spelling fix io_uring: remove io_req_complete_post_tw io_uring: allow multishot polled reqs to defer completion io_uring: remove overflow param from io_post_aux_cqe io_uring: add lockdep assertion in io_fill_cqe_aux io_uring: make io_fill_cqe_aux static io_uring: add io_aux_cqe which allows deferred completion io_uring: allow defer completion for aux posted cqes io_uring: defer all io_req_complete_failed io_uring: always lock in io_apoll_task_func io_uring: remove iopoll spinlock io_uring: iopoll protect complete_post io_uring: inline __io_req_complete_put() io_uring: remove io_req_tw_post_queue io_uring: use io_req_task_complete() in timeout io_uring: hold locks for io_req_complete_failed io_uring: add completion locking for iopoll io_uring: kill io_cqring_ev_posted() and __io_cq_unlock_post() ...
2022-12-07io_uring: force multishot CQEs into task contextPavel Begunkov
Multishot are posting CQEs outside of the normal request completion path, which is usually done from within a task work handler. However, it might be not the case when it's yet to be polled but has been punted to io-wq. Make it abide ->task_complete and push it to the polling path when executed by io-wq. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/d7714aaff583096769a0f26e8e747759e556feb1.1670384893.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-11-25use less confusing names for iov_iter direction initializersAl Viro
READ/WRITE proved to be actively confusing - the meanings are "data destination, as used with read(2)" and "data source, as used with write(2)", but people keep interpreting those as "we read data from it" and "we write data to it", i.e. exactly the wrong way. Call them ITER_DEST and ITER_SOURCE - at least that is harder to misinterpret... Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2022-11-25io_uring: add io_aux_cqe which allows deferred completionDylan Yudaken
Use the just introduced deferred post cqe completion state when possible in io_aux_cqe. If not possible fallback to io_post_aux_cqe. This introduces a complication because of allow_overflow. For deferred completions we cannot know without locking the completion_lock if it will overflow (and even if we locked it, another post could sneak in and cause this cqe to be in overflow). However since overflow protection is mostly a best effort defence in depth to prevent infinite loops of CQEs for poll, just checking the overflow bit is going to be good enough and will result in at most 16 (array size of deferred cqes) overflows. Suggested-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Dylan Yudaken <dylany@meta.com> Link: https://lore.kernel.org/r/20221124093559.3780686-6-dylany@meta.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-11-21io_uring: allow multishot recv CQEs to overflowDylan Yudaken
With commit aa1df3a360a0 ("io_uring: fix CQE reordering"), there are stronger guarantees for overflow ordering. Specifically ensuring that userspace will not receive out of order receive CQEs. Therefore this is not needed any more for recv/recvmsg. Signed-off-by: Dylan Yudaken <dylany@meta.com> Link: https://lore.kernel.org/r/20221107125236.260132-4-dylany@meta.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-11-21io_uring: revert "io_uring fix multishot accept ordering"Dylan Yudaken
This is no longer needed after commit aa1df3a360a0 ("io_uring: fix CQE reordering"), since all reordering is now taken care of. This reverts commit cbd25748545c ("io_uring: fix multishot accept ordering"). Signed-off-by: Dylan Yudaken <dylany@meta.com> Link: https://lore.kernel.org/r/20221107125236.260132-2-dylany@meta.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-11-21io_uring: fix two assignments in if conditionsXinghui Li
Fixes two errors: "ERROR: do not use assignment in if condition 130: FILE: io_uring/net.c:130: + if (!(issue_flags & IO_URING_F_UNLOCKED) && ERROR: do not use assignment in if condition 599: FILE: io_uring/poll.c:599: + } else if (!(issue_flags & IO_URING_F_UNLOCKED) &&" reported by checkpatch.pl in net.c and poll.c . Signed-off-by: Xinghui Li <korantli@tencent.com> Reported-by: kernel test robot <lkp@intel.com> Link: https://lore.kernel.org/r/20221102082503.32236-1-korantwork@gmail.com [axboe: style tweaks] Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-11-21io_uring/net: move mm accounting to a slower pathPavel Begunkov
We can also move mm accounting to the extended callbacks. It removes a few cycles from the hot path including skipping one function call and setting io_req_task_complete as a callback directly. For user backed I/O it shouldn't make any difference taking into considering atomic mm accounting and page pinning. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/1062f270273ad11c1b7b45ec59a6a317533d5e64.1667557923.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-11-21io_uring: move zc reporting from the hot pathPavel Begunkov
Add custom tw and notif callbacks on top of usual bits also handling zc reporting. That moves it from the hot path. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/40de4a6409042478e1f35adc4912e23226cb1b5c.1667557923.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-11-21io_uring/net: introduce IORING_SEND_ZC_REPORT_USAGE flagStefan Metzmacher
It might be useful for applications to detect if a zero copy transfer with SEND[MSG]_ZC was actually possible or not. The application can fallback to plain SEND[MSG] in order to avoid the overhead of two cqes per request. Or it can generate a log message that could indicate to an administrator that no zero copy was possible and could explain degraded performance. Cc: stable@vger.kernel.org # 6.1 Link: https://lore.kernel.org/io-uring/fb6a7599-8a9b-15e5-9b64-6cd9d01c6ff4@gmail.com/T/#m2b0d9df94ce43b0e69e6c089bdff0ce6babbdfaa Signed-off-by: Stefan Metzmacher <metze@samba.org> Reviewed-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/8945b01756d902f5d5b0667f20b957ad3f742e5e.1666895626.git.metze@samba.org Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-11-17io_uring: fix multishot recv request leaksPavel Begunkov
Having REQ_F_POLLED set doesn't guarantee that the request is executed as a multishot from the polling path. Fortunately for us, if the code thinks it's multishot issue when it's not, it can only ask to skip completion so leaking the request. Use issue_flags to mark multipoll issues. Cc: stable@vger.kernel.org Fixes: 1300ebb20286b ("io_uring: multishot recv") Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/37762040ba9c52b81b92a2f5ebfd4ee484088951.1668710222.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-11-17io_uring: fix multishot accept request leaksPavel Begunkov
Having REQ_F_POLLED set doesn't guarantee that the request is executed as a multishot from the polling path. Fortunately for us, if the code thinks it's multishot issue when it's not, it can only ask to skip completion so leaking the request. Use issue_flags to mark multipoll issues. Cc: stable@vger.kernel.org Fixes: 390ed29b5e425 ("io_uring: add IORING_ACCEPT_MULTISHOT for accept") Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/7700ac57653f2823e30b34dc74da68678c0c5f13.1668710222.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-10-22io_uring/net: fail zc sendmsg when unsupported by socketPavel Begunkov
The previous patch fails zerocopy send requests for protocols that don't support it, do the same for zerocopy sendmsg. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/0854e7bb4c3d810a48ec8b5853e2f61af36a0467.1666346426.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-10-22io_uring/net: fail zc send when unsupported by socketPavel Begunkov
If a protocol doesn't support zerocopy it will silently fall back to copying. This type of behaviour has always been a source of troubles so it's better to fail such requests instead. Cc: <stable@vger.kernel.org> # 6.0 Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/2db3c7f16bb6efab4b04569cd16e6242b40c5cb3.1666346426.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-10-12io_uring/net: handle -EINPROGRESS correct for IORING_OP_CONNECTJens Axboe
We treat EINPROGRESS like EAGAIN, but if we're retrying post getting EINPROGRESS, then we just need to check the socket for errors and terminate the request. This was exposed on a bluetooth connection request which ends up taking a while and hitting EINPROGRESS, and yields a CQE result of -EBADFD because we're retrying a connect on a socket that is now connected. Cc: stable@vger.kernel.org Fixes: 87f80d623c6c ("io_uring: handle connect -EINPROGRESS like -EAGAIN") Link: https://github.com/axboe/liburing/issues/671 Reported-by: Aidan Sun <aidansun05@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-09-29io_uring/net: fix notif cqe reorderingPavel Begunkov
send zc is not restricted to !IO_URING_F_UNLOCKED anymore and so we can't use task-tw ordering trick to order notification cqes with requests completions. In this case leave it alone and let io_send_zc_cleanup() flush it. Cc: stable@vger.kernel.org Fixes: 53bdc88aac9a2 ("io_uring/notif: order notif vs send CQEs") Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/0031f3a00d492e814a4a0935a2029a46d9c9ba06.1664486545.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-09-29io_uring/net: don't update msg_name if not providedPavel Begunkov
io_sendmsg_copy_hdr() may clear msg->msg_name if the userspace didn't provide it, we should retain NULL in this case. Cc: stable@vger.kernel.org Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/97d49f61b5ec76d0900df658cfde3aa59ff22121.1664486545.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-09-29io_uring/net: fix fast_iov assignment in io_setup_async_msg()Stefan Metzmacher
I hit a very bad problem during my tests of SENDMSG_ZC. BUG(); in first_iovec_segment() triggered very easily. The problem was io_setup_async_msg() in the partial retry case, which seems to happen more often with _ZC. iov_iter_iovec_advance() may change i->iov in order to have i->iov_offset being only relative to the first element. Which means kmsg->msg.msg_iter.iov is no longer the same as kmsg->fast_iov. But this would rewind the copy to be the start of async_msg->fast_iov, which means the internal state of sync_msg->msg.msg_iter is inconsitent. I tested with 5 vectors with length like this 4, 0, 64, 20, 8388608 and got a short writes with: - ret=2675244 min_ret=8388692 => remaining 5713448 sr->done_io=2675244 - ret=-EAGAIN => io_uring_poll_arm - ret=4911225 min_ret=5713448 => remaining 802223 sr->done_io=7586469 - ret=-EAGAIN => io_uring_poll_arm - ret=802223 min_ret=802223 => res=8388692 While this was easily triggered with SENDMSG_ZC (queued for 6.1), it was a potential problem starting with 7ba89d2af17aa879dda30f5d5d3f152e587fc551 in 5.18 for IORING_OP_RECVMSG. And also with 4c3c09439c08b03d9503df0ca4c7619c5842892e in 5.19 for IORING_OP_SENDMSG. However 257e84a5377fbbc336ff563833a8712619acce56 introduced the critical code into io_setup_async_msg() in 5.11. Fixes: 7ba89d2af17aa ("io_uring: ensure recv and recvmsg handle MSG_WAITALL correctly") Fixes: 257e84a5377fb ("io_uring: refactor sendmsg/recvmsg iov managing") Cc: stable@vger.kernel.org Signed-off-by: Stefan Metzmacher <metze@samba.org> Reviewed-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/b2e7be246e2fb173520862b0c7098e55767567a2.1664436949.git.metze@samba.org Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-09-28io_uring/net: fix non-zc send with addressPavel Begunkov
We're currently ignoring the dest address with non-zerocopy send because even though we copy it from the userspace shortly after ->msg_name gets zeroed. Move msghdr init earlier. Fixes: 516e82f0e043a ("io_uring/net: support non-zerocopy sendto") Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/176ced5e8568aa5d300ca899b7f05b303ebc49fd.1664409532.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-09-28io_uring/net: don't skip notifs for failed requestsPavel Begunkov
We currently only add a notification CQE when the send succeded, i.e. cqe.res >= 0. However, it'd be more robust to do buffer notifications for failed requests as well in case drivers decide do something fanky. Always return a buffer notification after initial prep, don't hide it. This behaviour is better aligned with documentation and the patch also helps the userspace to respect it. Cc: stable@vger.kernel.org # 6.0 Suggested-by: Stefan Metzmacher <metze@samba.org> Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/9c8bead87b2b980fcec441b8faef52188b4a6588.1664292100.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>