diff options
author | Linus Torvalds <torvalds@linux-foundation.org> | 2024-07-16 19:28:34 -0700 |
---|---|---|
committer | Linus Torvalds <torvalds@linux-foundation.org> | 2024-07-16 19:28:34 -0700 |
commit | 51835949dda3783d4639cfa74ce13a3c9829de00 (patch) | |
tree | 2b593de5eba6ecc73f7c58fc65fdaffae45c7323 /drivers/net/virtio_net.c | |
parent | 0434dbe32053d07d658165be681505120c6b1abc (diff) | |
parent | 77ae5e5b00720372af2860efdc4bc652ac682696 (diff) |
Merge tag 'net-next-6.11' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-nextHEADmaster
Pull networking updates from Jakub Kicinski:
"Not much excitement - a handful of large patchsets (devmem among them)
did not make it in time.
Core & protocols:
- Use local_lock in addition to local_bh_disable() to protect per-CPU
resources in networking, a step closer for local_bh_disable() not
to act as a big lock on PREEMPT_RT
- Use flex array for netdevice priv area, ensure its cache alignment
- Add a sysctl knob to allow user to specify a default rto_min at
socket init time. Bit of a big hammer but multiple companies were
independently carrying such patch downstream so clearly it's useful
- Support scheduling transmission of packets based on CLOCK_TAI
- Un-pin TCP TIMEWAIT timer to avoid it firing on CPUs later cordoned
off using cpusets
- Support multiple L2TPv3 UDP tunnels using the same 5-tuple address
- Allow configuration of multipath hash seed, to both allow
synchronizing hashing of two routers, and preventing partial
accidental sync
- Improve TCP compliance with RFC 9293 for simultaneous connect()
- Support sending NAT keepalives in IPsec ESP in UDP states.
Userspace IKE daemon had to do this before, but the kernel can
better keep track of it
- Support sending supervision HSR frames with MAC addresses stored in
ProxyNodeTable when RedBox (i.e. HSR-SAN) is enabled
- Introduce IPPROTO_SMC for selecting SMC when socket is created
- Allow UDP GSO transmit from devices with no checksum offload
- openvswitch: add packet sampling via psample, separating the
sampled traffic from "upcall" packets sent to user space for
forwarding
- nf_tables: shrink memory consumption for transaction objects
Things we sprinkled into general kernel code:
- Power Sequencing subsystem (used by Qualcomm Bluetooth driver for
QCA6390) [ Already merged separately - Linus ]
- Add IRQ information in sysfs for auxiliary bus
- Introduce guard definition for local_lock
- Add aligned flavor of __cacheline_group_{begin, end}() markings for
grouping fields in structures
BPF:
- Notify user space (via epoll) when a struct_ops object is getting
detached/unregistered
- Add new kfuncs for a generic, open-coded bits iterator
- Enable BPF programs to declare arrays of kptr, bpf_rb_root, and
bpf_list_head
- Support resilient split BTF which cuts down on duplication and
makes BTF as compact as possible WRT BTF from modules
- Add support for dumping kfunc prototypes from BTF which enables
both detecting as well as dumping compilable prototypes for kfuncs
- riscv64 BPF JIT improvements in particular to add 12-argument
support for BPF trampolines and to utilize bpf_prog_pack for the
latter
- Add the capability to offload the netfilter flowtable in XDP layer
through kfuncs
Driver API:
- Allow users to configure IRQ tresholds between which automatic IRQ
moderation can choose
- Expand Power Sourcing (PoE) status with power, class and failure
reason. Support setting power limits
- Track additional RSS contexts in the core, make sure configuration
changes don't break them
- Support IPsec crypto offload for IPv6 ESP and IPv4 UDP-encapsulated
ESP data paths
- Support updating firmware on SFP modules
Tests and tooling:
- mptcp: use net/lib.sh to manage netns
- TCP-AO and TCP-MD5: replace debug prints used by tests with
tracepoints
- openvswitch: make test self-contained (don't depend on OvS CLI
tools)
Drivers:
- Ethernet high-speed NICs:
- Broadcom (bnxt):
- increase the max total outstanding PTP TX packets to 4
- add timestamping statistics support
- implement netdev_queue_mgmt_ops
- support new RSS context API
- Intel (100G, ice, idpf):
- implement FEC statistics and dumping signal quality indicators
- support E825C products (with 56Gbps PHYs)
- nVidia/Mellanox:
- support HW-GRO
- mlx4/mlx5: support per-queue statistics via netlink
- obey the max number of EQs setting in sub-functions
- AMD/Solarflare:
- support new RSS context API
- AMD/Pensando:
- ionic: rework fix for doorbell miss to lower overhead and
skip it on new HW
- Wangxun:
- txgbe: support Flow Director perfect filters
- Ethernet NICs consumer, embedded and virtual:
- Add driver for Tehuti Networks TN40xx chips
- Add driver for Meta's internal NIC chips
- Add driver for Ethernet MAC on Airoha EN7581 SoCs
- Add driver for Renesas Ethernet-TSN devices
- Google cloud vNIC:
- flow steering support
- Microsoft vNIC:
- support page sizes other than 4KB on ARM64
- vmware vNIC:
- support latency measurement (update to version 9)
- VirtIO net:
- support for Byte Queue Limits
- support configuring thresholds for automatic IRQ moderation
- support for AF_XDP Rx zero-copy
- Synopsys (stmmac):
- support for STM32MP13 SoC
- let platforms select the right PCS implementation
- TI:
- icssg-prueth: add multicast filtering support
- icssg-prueth: enable PTP timestamping and PPS
- Renesas:
- ravb: improve Rx performance 30-400% by using page pool,
theaded NAPI and timer-based IRQ coalescing
- ravb: add MII support for R-Car V4M
- Cadence (macb):
- macb: add ARP support to Wake-On-LAN
- Cortina:
- use phylib for RX and TX pause configuration
- Ethernet switches:
- nVidia/Mellanox:
- support configuration of multipath hash seed
- report more accurate max MTU
- use page_pool to improve Rx performance
- MediaTek:
- mt7530: add support for bridge port isolation
- Qualcomm:
- qca8k: add support for bridge port isolation
- Microchip:
- lan9371/2: add 100BaseTX PHY support
- NXP:
- vsc73xx: implement VLAN operations
- Ethernet PHYs:
- aquantia: enable support for aqr115c
- aquantia: add support for PHY LEDs
- realtek: add support for rtl8224 2.5Gbps PHY
- xpcs: add memory-mapped device support
- add BroadR-Reach link mode and support in Broadcom's PHY driver
- CAN:
- add document for ISO 15765-2 protocol support
- mcp251xfd: workaround for erratum DS80000789E, use timestamps to
catch when device returns incorrect FIFO status
- WiFi:
- mac80211/cfg80211:
- parse Transmit Power Envelope (TPE) data in mac80211 instead
of in drivers
- improvements for 6 GHz regulatory flexibility
- multi-link improvements
- support multiple radios per wiphy
- remove DEAUTH_NEED_MGD_TX_PREP flag
- Intel (iwlwifi):
- bump FW API to 91 for BZ/SC devices
- report 64-bit radiotap timestamp
- enable P2P low latency by default
- handle Transmit Power Envelope (TPE) advertised by AP
- remove support for older FW for new devices
- fast resume (keeping the device configured)
- mvm: re-enable Multi-Link Operation (MLO)
- aggregation (A-MSDU) optimizations
- MediaTek (mt76):
- mt7925 Multi-Link Operation (MLO) support
- Qualcomm (ath10k):
- LED support for various chipsets
- Qualcomm (ath12k):
- remove unsupported Tx monitor handling
- support channel 2 in 6 GHz band
- support Spatial Multiplexing Power Save (SMPS) in 6 GHz band
- supprt multiple BSSID (MBSSID) and Enhanced Multi-BSSID
Advertisements (EMA)
- support dynamic VLAN
- add panic handler for resetting the firmware state
- DebugFS support for datapath statistics
- WCN7850: support for Wake on WLAN
- Microchip (wilc1000):
- read MAC address during probe to make it visible to user space
- suspend/resume improvements
- TI (wl18xx):
- support newer firmware versions
- RealTek (rtw89):
- preparation for RTL8852BE-VT support
- Wake on WLAN support for WiFi 6 chips
- 36-bit PCI DMA support
- RealTek (rtlwifi):
- RTL8192DU support
- Broadcom (brcmfmac):
- Management Frame Protection support (to enable WPA3)
- Bluetooth:
- qualcomm: use the power sequencer for QCA6390
- btusb: mediatek: add ISO data transmission functions
- hci_bcm4377: add BCM4388 support
- btintel: add support for BlazarU core
- btintel: add support for Whale Peak2
- btnxpuart: add support for AW693 A1 chipset
- btnxpuart: add support for IW615 chipset
- btusb: add Realtek RTL8852BE support ID 0x13d3:0x3591"
* tag 'net-next-6.11' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-next: (1589 commits)
eth: fbnic: Fix spelling mistake "tiggerring" -> "triggering"
tcp: Replace strncpy() with strscpy()
wifi: ath12k: fix build vs old compiler
tcp: Don't access uninit tcp_rsk(req)->ao_keyid in tcp_create_openreq_child().
eth: fbnic: Write the TCAM tables used for RSS control and Rx to host
eth: fbnic: Add L2 address programming
eth: fbnic: Add basic Rx handling
eth: fbnic: Add basic Tx handling
eth: fbnic: Add link detection
eth: fbnic: Add initial messaging to notify FW of our presence
eth: fbnic: Implement Rx queue alloc/start/stop/free
eth: fbnic: Implement Tx queue alloc/start/stop/free
eth: fbnic: Allocate a netdevice and napi vectors with queues
eth: fbnic: Add FW communication mechanism
eth: fbnic: Add message parsing for FW messages
eth: fbnic: Add register init to set PCIe/Ethernet device config
eth: fbnic: Allocate core device specific structures and devlink interface
eth: fbnic: Add scaffolding for Meta's NIC driver
PCI: Add Meta Platforms vendor ID
net/sched: cls_flower: propagate tca[TCA_OPTIONS] to NL_REQ_ATTR_CHECK
...
Diffstat (limited to 'drivers/net/virtio_net.c')
-rw-r--r-- | drivers/net/virtio_net.c | 914 |
1 files changed, 785 insertions, 129 deletions
diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c index ea10db9a09fa..af474cc191d0 100644 --- a/drivers/net/virtio_net.c +++ b/drivers/net/virtio_net.c @@ -25,6 +25,7 @@ #include <net/net_failover.h> #include <net/netdev_rx_queue.h> #include <net/netdev_queues.h> +#include <net/xdp_sock_drv.h> static int napi_weight = NAPI_POLL_WEIGHT; module_param(napi_weight, int, 0444); @@ -40,14 +41,12 @@ module_param(napi_tx, bool, 0644); #define VIRTNET_RX_PAD (NET_IP_ALIGN + NET_SKB_PAD) -/* Amount of XDP headroom to prepend to packets for use by xdp_adjust_head */ -#define VIRTIO_XDP_HEADROOM 256 - /* Separating two types of XDP xmit */ #define VIRTIO_XDP_TX BIT(0) #define VIRTIO_XDP_REDIR BIT(1) -#define VIRTIO_XDP_FLAG BIT(0) +#define VIRTIO_XDP_FLAG BIT(0) +#define VIRTIO_ORPHAN_FLAG BIT(1) /* RX packet size EWMA. The average packet size is used to determine the packet * buffer size when refilling RX rings. As the entire RX ring may be refilled @@ -85,6 +84,8 @@ struct virtnet_stat_desc { struct virtnet_sq_free_stats { u64 packets; u64 bytes; + u64 napi_packets; + u64 napi_bytes; }; struct virtnet_sq_stats { @@ -348,6 +349,13 @@ struct receive_queue { /* Record the last dma info to free after new pages is allocated. */ struct virtnet_rq_dma *last_dma; + + struct xsk_buff_pool *xsk_pool; + + /* xdp rxq used by xsk */ + struct xdp_rxq_info xsk_rxq_info; + + struct xdp_buff **xsk_buffs; }; /* This structure can contain rss message with maximum settings for indirection table and keysize @@ -490,6 +498,16 @@ struct virtio_net_common_hdr { }; static void virtnet_sq_free_unused_buf(struct virtqueue *vq, void *buf); +static int virtnet_xdp_handler(struct bpf_prog *xdp_prog, struct xdp_buff *xdp, + struct net_device *dev, + unsigned int *xdp_xmit, + struct virtnet_rq_stats *stats); +static void virtnet_receive_done(struct virtnet_info *vi, struct receive_queue *rq, + struct sk_buff *skb, u8 flags); +static struct sk_buff *virtnet_skb_append_frag(struct sk_buff *head_skb, + struct sk_buff *curr_skb, + struct page *page, void *buf, + int len, int truesize); static bool is_xdp_frame(void *ptr) { @@ -506,29 +524,50 @@ static struct xdp_frame *ptr_to_xdp(void *ptr) return (struct xdp_frame *)((unsigned long)ptr & ~VIRTIO_XDP_FLAG); } -static void __free_old_xmit(struct send_queue *sq, bool in_napi, - struct virtnet_sq_free_stats *stats) +static bool is_orphan_skb(void *ptr) +{ + return (unsigned long)ptr & VIRTIO_ORPHAN_FLAG; +} + +static void *skb_to_ptr(struct sk_buff *skb, bool orphan) +{ + return (void *)((unsigned long)skb | (orphan ? VIRTIO_ORPHAN_FLAG : 0)); +} + +static struct sk_buff *ptr_to_skb(void *ptr) +{ + return (struct sk_buff *)((unsigned long)ptr & ~VIRTIO_ORPHAN_FLAG); +} + +static void __free_old_xmit(struct send_queue *sq, struct netdev_queue *txq, + bool in_napi, struct virtnet_sq_free_stats *stats) { unsigned int len; void *ptr; while ((ptr = virtqueue_get_buf(sq->vq, &len)) != NULL) { - ++stats->packets; - if (!is_xdp_frame(ptr)) { - struct sk_buff *skb = ptr; + struct sk_buff *skb = ptr_to_skb(ptr); pr_debug("Sent skb %p\n", skb); - stats->bytes += skb->len; + if (is_orphan_skb(ptr)) { + stats->packets++; + stats->bytes += skb->len; + } else { + stats->napi_packets++; + stats->napi_bytes += skb->len; + } napi_consume_skb(skb, in_napi); } else { struct xdp_frame *frame = ptr_to_xdp(ptr); + stats->packets++; stats->bytes += xdp_get_frame_len(frame); xdp_return_frame(frame); } } + netdev_tx_completed_queue(txq, stats->napi_packets, stats->napi_bytes); } /* Converting between virtqueue no. and kernel tx/rx queue no. @@ -949,27 +988,33 @@ static void virtnet_rq_unmap_free_buf(struct virtqueue *vq, void *buf) rq = &vi->rq[i]; + if (rq->xsk_pool) { + xsk_buff_free((struct xdp_buff *)buf); + return; + } + if (!vi->big_packets || vi->mergeable_rx_bufs) virtnet_rq_unmap(rq, buf, 0); virtnet_rq_free_buf(vi, rq, buf); } -static void free_old_xmit(struct send_queue *sq, bool in_napi) +static void free_old_xmit(struct send_queue *sq, struct netdev_queue *txq, + bool in_napi) { struct virtnet_sq_free_stats stats = {0}; - __free_old_xmit(sq, in_napi, &stats); + __free_old_xmit(sq, txq, in_napi, &stats); /* Avoid overhead when no packets have been processed * happens when called speculatively from start_xmit. */ - if (!stats.packets) + if (!stats.packets && !stats.napi_packets) return; u64_stats_update_begin(&sq->stats.syncp); - u64_stats_add(&sq->stats.bytes, stats.bytes); - u64_stats_add(&sq->stats.packets, stats.packets); + u64_stats_add(&sq->stats.bytes, stats.bytes + stats.napi_bytes); + u64_stats_add(&sq->stats.packets, stats.packets + stats.napi_packets); u64_stats_update_end(&sq->stats.syncp); } @@ -1003,7 +1048,9 @@ static void check_sq_full_and_disable(struct virtnet_info *vi, * early means 16 slots are typically wasted. */ if (sq->vq->num_free < 2+MAX_SKB_FRAGS) { - netif_stop_subqueue(dev, qnum); + struct netdev_queue *txq = netdev_get_tx_queue(dev, qnum); + + netif_tx_stop_queue(txq); u64_stats_update_begin(&sq->stats.syncp); u64_stats_inc(&sq->stats.stop); u64_stats_update_end(&sq->stats.syncp); @@ -1012,7 +1059,7 @@ static void check_sq_full_and_disable(struct virtnet_info *vi, virtqueue_napi_schedule(&sq->napi, sq->vq); } else if (unlikely(!virtqueue_enable_cb_delayed(sq->vq))) { /* More just got used, free them then recheck. */ - free_old_xmit(sq, false); + free_old_xmit(sq, txq, false); if (sq->vq->num_free >= 2+MAX_SKB_FRAGS) { netif_start_subqueue(dev, qnum); u64_stats_update_begin(&sq->stats.syncp); @@ -1024,6 +1071,329 @@ static void check_sq_full_and_disable(struct virtnet_info *vi, } } +static void sg_fill_dma(struct scatterlist *sg, dma_addr_t addr, u32 len) +{ + sg->dma_address = addr; + sg->length = len; +} + +static struct xdp_buff *buf_to_xdp(struct virtnet_info *vi, + struct receive_queue *rq, void *buf, u32 len) +{ + struct xdp_buff *xdp; + u32 bufsize; + + xdp = (struct xdp_buff *)buf; + + bufsize = xsk_pool_get_rx_frame_size(rq->xsk_pool) + vi->hdr_len; + + if (unlikely(len > bufsize)) { + pr_debug("%s: rx error: len %u exceeds truesize %u\n", + vi->dev->name, len, bufsize); + DEV_STATS_INC(vi->dev, rx_length_errors); + xsk_buff_free(xdp); + return NULL; + } + + xsk_buff_set_size(xdp, len); + xsk_buff_dma_sync_for_cpu(xdp); + + return xdp; +} + +static struct sk_buff *xsk_construct_skb(struct receive_queue *rq, + struct xdp_buff *xdp) +{ + unsigned int metasize = xdp->data - xdp->data_meta; + struct sk_buff *skb; + unsigned int size; + + size = xdp->data_end - xdp->data_hard_start; + skb = napi_alloc_skb(&rq->napi, size); + if (unlikely(!skb)) { + xsk_buff_free(xdp); + return NULL; + } + + skb_reserve(skb, xdp->data_meta - xdp->data_hard_start); + + size = xdp->data_end - xdp->data_meta; + memcpy(__skb_put(skb, size), xdp->data_meta, size); + + if (metasize) { + __skb_pull(skb, metasize); + skb_metadata_set(skb, metasize); + } + + xsk_buff_free(xdp); + + return skb; +} + +static struct sk_buff *virtnet_receive_xsk_small(struct net_device *dev, struct virtnet_info *vi, + struct receive_queue *rq, struct xdp_buff *xdp, + unsigned int *xdp_xmit, + struct virtnet_rq_stats *stats) +{ + struct bpf_prog *prog; + u32 ret; + + ret = XDP_PASS; + rcu_read_lock(); + prog = rcu_dereference(rq->xdp_prog); + if (prog) + ret = virtnet_xdp_handler(prog, xdp, dev, xdp_xmit, stats); + rcu_read_unlock(); + + switch (ret) { + case XDP_PASS: + return xsk_construct_skb(rq, xdp); + + case XDP_TX: + case XDP_REDIRECT: + return NULL; + + default: + /* drop packet */ + xsk_buff_free(xdp); + u64_stats_inc(&stats->drops); + return NULL; + } +} + +static void xsk_drop_follow_bufs(struct net_device *dev, + struct receive_queue *rq, + u32 num_buf, + struct virtnet_rq_stats *stats) +{ + struct xdp_buff *xdp; + u32 len; + + while (num_buf-- > 1) { + xdp = virtqueue_get_buf(rq->vq, &len); + if (unlikely(!xdp)) { + pr_debug("%s: rx error: %d buffers missing\n", + dev->name, num_buf); + DEV_STATS_INC(dev, rx_length_errors); + break; + } + u64_stats_add(&stats->bytes, len); + xsk_buff_free(xdp); + } +} + +static int xsk_append_merge_buffer(struct virtnet_info *vi, + struct receive_queue *rq, + struct sk_buff *head_skb, + u32 num_buf, + struct virtio_net_hdr_mrg_rxbuf *hdr, + struct virtnet_rq_stats *stats) +{ + struct sk_buff *curr_skb; + struct xdp_buff *xdp; + u32 len, truesize; + struct page *page; + void *buf; + + curr_skb = head_skb; + + while (--num_buf) { + buf = virtqueue_get_buf(rq->vq, &len); + if (unlikely(!buf)) { + pr_debug("%s: rx error: %d buffers out of %d missing\n", + vi->dev->name, num_buf, + virtio16_to_cpu(vi->vdev, + hdr->num_buffers)); + DEV_STATS_INC(vi->dev, rx_length_errors); + return -EINVAL; + } + + u64_stats_add(&stats->bytes, len); + + xdp = buf_to_xdp(vi, rq, buf, len); + if (!xdp) + goto err; + + buf = napi_alloc_frag(len); + if (!buf) { + xsk_buff_free(xdp); + goto err; + } + + memcpy(buf, xdp->data - vi->hdr_len, len); + + xsk_buff_free(xdp); + + page = virt_to_page(buf); + + truesize = len; + + curr_skb = virtnet_skb_append_frag(head_skb, curr_skb, page, + buf, len, truesize); + if (!curr_skb) { + put_page(page); + goto err; + } + } + + return 0; + +err: + xsk_drop_follow_bufs(vi->dev, rq, num_buf, stats); + return -EINVAL; +} + +static struct sk_buff *virtnet_receive_xsk_merge(struct net_device *dev, struct virtnet_info *vi, + struct receive_queue *rq, struct xdp_buff *xdp, + unsigned int *xdp_xmit, + struct virtnet_rq_stats *stats) +{ + struct virtio_net_hdr_mrg_rxbuf *hdr; + struct bpf_prog *prog; + struct sk_buff *skb; + u32 ret, num_buf; + + hdr = xdp->data - vi->hdr_len; + num_buf = virtio16_to_cpu(vi->vdev, hdr->num_buffers); + + ret = XDP_PASS; + rcu_read_lock(); + prog = rcu_dereference(rq->xdp_prog); + /* TODO: support multi buffer. */ + if (prog && num_buf == 1) + ret = virtnet_xdp_handler(prog, xdp, dev, xdp_xmit, stats); + rcu_read_unlock(); + + switch (ret) { + case XDP_PASS: + skb = xsk_construct_skb(rq, xdp); + if (!skb) + goto drop_bufs; + + if (xsk_append_merge_buffer(vi, rq, skb, num_buf, hdr, stats)) { + dev_kfree_skb(skb); + goto drop; + } + + return skb; + + case XDP_TX: + case XDP_REDIRECT: + return NULL; + + default: + /* drop packet */ + xsk_buff_free(xdp); + } + +drop_bufs: + xsk_drop_follow_bufs(dev, rq, num_buf, stats); + +drop: + u64_stats_inc(&stats->drops); + return NULL; +} + +static void virtnet_receive_xsk_buf(struct virtnet_info *vi, struct receive_queue *rq, + void *buf, u32 len, + unsigned int *xdp_xmit, + struct virtnet_rq_stats *stats) +{ + struct net_device *dev = vi->dev; + struct sk_buff *skb = NULL; + struct xdp_buff *xdp; + u8 flags; + + len -= vi->hdr_len; + + u64_stats_add(&stats->bytes, len); + + xdp = buf_to_xdp(vi, rq, buf, len); + if (!xdp) + return; + + if (unlikely(len < ETH_HLEN)) { + pr_debug("%s: short packet %i\n", dev->name, len); + DEV_STATS_INC(dev, rx_length_errors); + xsk_buff_free(xdp); + return; + } + + flags = ((struct virtio_net_common_hdr *)(xdp->data - vi->hdr_len))->hdr.flags; + + if (!vi->mergeable_rx_bufs) + skb = virtnet_receive_xsk_small(dev, vi, rq, xdp, xdp_xmit, stats); + else + skb = virtnet_receive_xsk_merge(dev, vi, rq, xdp, xdp_xmit, stats); + + if (skb) + virtnet_receive_done(vi, rq, skb, flags); +} + +static int virtnet_add_recvbuf_xsk(struct virtnet_info *vi, struct receive_queue *rq, + struct xsk_buff_pool *pool, gfp_t gfp) +{ + struct xdp_buff **xsk_buffs; + dma_addr_t addr; + int err = 0; + u32 len, i; + int num; + + xsk_buffs = rq->xsk_buffs; + + num = xsk_buff_alloc_batch(pool, xsk_buffs, rq->vq->num_free); + if (!num) + return -ENOMEM; + + len = xsk_pool_get_rx_frame_size(pool) + vi->hdr_len; + + for (i = 0; i < num; ++i) { + /* Use the part of XDP_PACKET_HEADROOM as the virtnet hdr space. + * We assume XDP_PACKET_HEADROOM is larger than hdr->len. + * (see function virtnet_xsk_pool_enable) + */ + addr = xsk_buff_xdp_get_dma(xsk_buffs[i]) - vi->hdr_len; + + sg_init_table(rq->sg, 1); + sg_fill_dma(rq->sg, addr, len); + + err = virtqueue_add_inbuf(rq->vq, rq->sg, 1, xsk_buffs[i], gfp); + if (err) + goto err; + } + + return num; + +err: + for (; i < num; ++i) + xsk_buff_free(xsk_buffs[i]); + + return err; +} + +static int virtnet_xsk_wakeup(struct net_device *dev, u32 qid, u32 flag) +{ + struct virtnet_info *vi = netdev_priv(dev); + struct send_queue *sq; + + if (!netif_running(dev)) + return -ENETDOWN; + + if (qid >= vi->curr_queue_pairs) + return -EINVAL; + + sq = &vi->sq[qid]; + + if (napi_if_scheduled_mark_missed(&sq->napi)) + return 0; + + local_bh_disable(); + virtqueue_napi_schedule(&sq->napi, sq->vq); + local_bh_enable(); + + return 0; +} + static int __virtnet_xdp_xmit_one(struct virtnet_info *vi, struct send_queue *sq, struct xdp_frame *xdpf) @@ -1138,7 +1508,8 @@ static int virtnet_xdp_xmit(struct net_device *dev, } /* Free up any pending old buffers before queueing new ones. */ - __free_old_xmit(sq, false, &stats); + __free_old_xmit(sq, netdev_get_tx_queue(dev, sq - vi->sq), + false, &stats); for (i = 0; i < n; i++) { struct xdp_frame *xdpf = frames[i]; @@ -1240,7 +1611,7 @@ static int virtnet_xdp_handler(struct bpf_prog *xdp_prog, struct xdp_buff *xdp, static unsigned int virtnet_get_headroom(struct virtnet_info *vi) { - return vi->xdp_enabled ? VIRTIO_XDP_HEADROOM : 0; + return vi->xdp_enabled ? XDP_PACKET_HEADROOM : 0; } /* We copy the packet for XDP in the following cases: @@ -1304,7 +1675,7 @@ static struct page *xdp_linearize_page(struct receive_queue *rq, } /* Headroom does not contribute to packet length */ - *len = page_off - VIRTIO_XDP_HEADROOM; + *len = page_off - XDP_PACKET_HEADROOM; return page; err_buf: __free_pages(page, 0); @@ -1591,8 +1962,8 @@ static int virtnet_build_xdp_buff_mrg(struct net_device *dev, void *ctx; xdp_init_buff(xdp, frame_sz, &rq->xdp_rxq); - xdp_prepare_buff(xdp, buf - VIRTIO_XDP_HEADROOM, - VIRTIO_XDP_HEADROOM + vi->hdr_len, len - vi->hdr_len, true); + xdp_prepare_buff(xdp, buf - XDP_PACKET_HEADROOM, + XDP_PACKET_HEADROOM + vi->hdr_len, len - vi->hdr_len, true); if (!*num_buf) return 0; @@ -1709,12 +2080,12 @@ static void *mergeable_xdp_get_buf(struct virtnet_info *vi, /* linearize data for XDP */ xdp_page = xdp_linearize_page(rq, num_buf, *page, offset, - VIRTIO_XDP_HEADROOM, + XDP_PACKET_HEADROOM, len); if (!xdp_page) return NULL; } else { - xdp_room = SKB_DATA_ALIGN(VIRTIO_XDP_HEADROOM + + xdp_room = SKB_DATA_ALIGN(XDP_PACKET_HEADROOM + sizeof(struct skb_shared_info)); if (*len + xdp_room > PAGE_SIZE) return NULL; @@ -1723,7 +2094,7 @@ static void *mergeable_xdp_get_buf(struct virtnet_info *vi, if (!xdp_page) return NULL; - memcpy(page_address(xdp_page) + VIRTIO_XDP_HEADROOM, + memcpy(page_address(xdp_page) + XDP_PACKET_HEADROOM, page_address(*page) + offset, *len); } @@ -1733,7 +2104,7 @@ static void *mergeable_xdp_get_buf(struct virtnet_info *vi, *page = xdp_page; - return page_address(*page) + VIRTIO_XDP_HEADROOM; + return page_address(*page) + XDP_PACKET_HEADROOM; } static struct sk_buff *receive_mergeable_xdp(struct net_device *dev, @@ -1796,6 +2167,49 @@ err_xdp: return NULL; } +static struct sk_buff *virtnet_skb_append_frag(struct sk_buff *head_skb, + struct sk_buff *curr_skb, + struct page *page, void *buf, + int len, int truesize) +{ + int num_skb_frags; + int offset; + + num_skb_frags = skb_shinfo(curr_skb)->nr_frags; + if (unlikely(num_skb_frags == MAX_SKB_FRAGS)) { + struct sk_buff *nskb = alloc_skb(0, GFP_ATOMIC); + + if (unlikely(!nskb)) + return NULL; + + if (curr_skb == head_skb) + skb_shinfo(curr_skb)->frag_list = nskb; + else + curr_skb->next = nskb; + curr_skb = nskb; + head_skb->truesize += nskb->truesize; + num_skb_frags = 0; + } + + if (curr_skb != head_skb) { + head_skb->data_len += len; + head_skb->len += len; + head_skb->truesize += truesize; + } + + offset = buf - page_address(page); + if (skb_can_coalesce(curr_skb, num_skb_frags, page, offset)) { + put_page(page); + skb_coalesce_rx_frag(curr_skb, num_skb_frags - 1, + len, truesize); + } else { + skb_add_rx_frag(curr_skb, num_skb_frags, page, + offset, len, truesize); + } + + return curr_skb; +} + static struct sk_buff *receive_mergeable(struct net_device *dev, struct virtnet_info *vi, struct receive_queue *rq, @@ -1845,8 +2259,6 @@ static struct sk_buff *receive_mergeable(struct net_device *dev, if (unlikely(!curr_skb)) goto err_skb; while (--num_buf) { - int num_skb_frags; - buf = virtnet_rq_get_buf(rq, &len, &ctx); if (unlikely(!buf)) { pr_debug("%s: rx error: %d buffers out of %d missing\n", @@ -1871,34 +2283,10 @@ static struct sk_buff *receive_mergeable(struct net_device *dev, goto err_skb; } - num_skb_frags = skb_shinfo(curr_skb)->nr_frags; - if (unlikely(num_skb_frags == MAX_SKB_FRAGS)) { - struct sk_buff *nskb = alloc_skb(0, GFP_ATOMIC); - - if (unlikely(!nskb)) - goto err_skb; - if (curr_skb == head_skb) - skb_shinfo(curr_skb)->frag_list = nskb; - else - curr_skb->next = nskb; - curr_skb = nskb; - head_skb->truesize += nskb->truesize; - num_skb_frags = 0; - } - if (curr_skb != head_skb) { - head_skb->data_len += len; - head_skb->len += len; - head_skb->truesize += truesize; - } - offset = buf - page_address(page); - if (skb_can_coalesce(curr_skb, num_skb_frags, page, offset)) { - put_page(page); - skb_coalesce_rx_frag(curr_skb, num_skb_frags - 1, - len, truesize); - } else { - skb_add_rx_frag(curr_skb, num_skb_frags, page, - offset, len, truesize); - } + curr_skb = virtnet_skb_append_frag(head_skb, curr_skb, page, + buf, len, truesize); + if (!curr_skb) + goto err_skb; } ewma_pkt_len_add(&rq->mrg_avg_pkt_len, head_skb->len); @@ -1943,6 +2331,40 @@ static void virtio_skb_set_hash(const struct virtio_net_hdr_v1_hash *hdr_hash, skb_set_hash(skb, __le32_to_cpu(hdr_hash->hash_value), rss_hash_type); } +static void virtnet_receive_done(struct virtnet_info *vi, struct receive_queue *rq, + struct sk_buff *skb, u8 flags) +{ + struct virtio_net_common_hdr *hdr; + struct net_device *dev = vi->dev; + + hdr = skb_vnet_common_hdr(skb); + if (dev->features & NETIF_F_RXHASH && vi->has_rss_hash_report) + virtio_skb_set_hash(&hdr->hash_v1_hdr, skb); + + if (flags & VIRTIO_NET_HDR_F_DATA_VALID) + skb->ip_summed = CHECKSUM_UNNECESSARY; + + if (virtio_net_hdr_to_skb(skb, &hdr->hdr, + virtio_is_little_endian(vi->vdev))) { + net_warn_ratelimited("%s: bad gso: type: %u, size: %u\n", + dev->name, hdr->hdr.gso_type, + hdr->hdr.gso_size); + goto frame_err; + } + + skb_record_rx_queue(skb, vq2rxq(rq->vq)); + skb->protocol = eth_type_trans(skb, dev); + pr_debug("Receiving skb proto 0x%04x len %i type %i\n", + ntohs(skb->protocol), skb->len, skb->pkt_type); + + napi_gro_receive(&rq->napi, skb); + return; + +frame_err: + DEV_STATS_INC(dev, rx_frame_errors); + dev_kfree_skb(skb); +} + static void receive_buf(struct virtnet_info *vi, struct receive_queue *rq, void *buf, unsigned int len, void **ctx, unsigned int *xdp_xmit, @@ -1950,7 +2372,6 @@ static void receive_buf(struct virtnet_info *vi, struct receive_queue *rq, { struct net_device *dev = vi->dev; struct sk_buff *skb; - struct virtio_net_common_hdr *hdr; u8 flags; if (unlikely(len < vi->hdr_len + ETH_HLEN)) { @@ -1980,32 +2401,7 @@ static void receive_buf(struct virtnet_info *vi, struct receive_queue *rq, if (unlikely(!skb)) return; - hdr = skb_vnet_common_hdr(skb); - if (dev->features & NETIF_F_RXHASH && vi->has_rss_hash_report) - virtio_skb_set_hash(&hdr->hash_v1_hdr, skb); - - if (flags & VIRTIO_NET_HDR_F_DATA_VALID) - skb->ip_summed = CHECKSUM_UNNECESSARY; - - if (virtio_net_hdr_to_skb(skb, &hdr->hdr, - virtio_is_little_endian(vi->vdev))) { - net_warn_ratelimited("%s: bad gso: type: %u, size: %u\n", - dev->name, hdr->hdr.gso_type, - hdr->hdr.gso_size); - goto frame_err; - } - - skb_record_rx_queue(skb, vq2rxq(rq->vq)); - skb->protocol = eth_type_trans(skb, dev); - pr_debug("Receiving skb proto 0x%04x len %i type %i\n", - ntohs(skb->protocol), skb->len, skb->pkt_type); - - napi_gro_receive(&rq->napi, skb); - return; - -frame_err: - DEV_STATS_INC(dev, rx_frame_errors); - dev_kfree_skb(skb); + virtnet_receive_done(vi, rq, skb, flags); } /* Unlike mergeable buffers, all buffers are allocated to the @@ -2166,7 +2562,11 @@ static bool try_fill_recv(struct virtnet_info *vi, struct receive_queue *rq, gfp_t gfp) { int err; - bool oom; + + if (rq->xsk_pool) { + err = virtnet_add_recvbuf_xsk(vi, rq, rq->xsk_pool, gfp); + goto kick; + } do { if (vi->mergeable_rx_bufs) @@ -2176,10 +2576,11 @@ static bool try_fill_recv(struct virtnet_info *vi, struct receive_queue *rq, else err = add_recvbuf_small(vi, rq, gfp); - oom = err == -ENOMEM; if (err) break; } while (rq->vq->num_free); + +kick: if (virtqueue_kick_prepare(rq->vq) && virtqueue_notify(rq->vq)) { unsigned long flags; @@ -2188,7 +2589,7 @@ static bool try_fill_recv(struct virtnet_info *vi, struct receive_queue *rq, u64_stats_update_end_irqrestore(&rq->stats.syncp, flags); } - return !oom; + return err != -ENOMEM; } static void skb_recv_done(struct virtqueue *rvq) @@ -2259,32 +2660,68 @@ static void refill_work(struct work_struct *work) } } -static int virtnet_receive(struct receive_queue *rq, int budget, - unsigned int *xdp_xmit) +static int virtnet_receive_xsk_bufs(struct virtnet_info *vi, + struct receive_queue *rq, + int budget, + unsigned int *xdp_xmit, + struct virtnet_rq_stats *stats) +{ + unsigned int len; + int packets = 0; + void *buf; + + while (packets < budget) { + buf = virtqueue_get_buf(rq->vq, &len); + if (!buf) + break; + + virtnet_receive_xsk_buf(vi, rq, buf, len, xdp_xmit, stats); + packets++; + } + + return packets; +} + +static int virtnet_receive_packets(struct virtnet_info *vi, + struct receive_queue *rq, + int budget, + unsigned int *xdp_xmit, + struct virtnet_rq_stats *stats) { - struct virtnet_info *vi = rq->vq->vdev->priv; - struct virtnet_rq_stats stats = {}; unsigned int len; int packets = 0; void *buf; - int i; if (!vi->big_packets || vi->mergeable_rx_bufs) { void *ctx; - while (packets < budget && (buf = virtnet_rq_get_buf(rq, &len, &ctx))) { - receive_buf(vi, rq, buf, len, ctx, xdp_xmit, &stats); + receive_buf(vi, rq, buf, len, ctx, xdp_xmit, stats); packets++; } } else { while (packets < budget && (buf = virtqueue_get_buf(rq->vq, &len)) != NULL) { - receive_buf(vi, rq, buf, len, NULL, xdp_xmit, &stats); + receive_buf(vi, rq, buf, len, NULL, xdp_xmit, stats); packets++; } } + return packets; +} + +static int virtnet_receive(struct receive_queue *rq, int budget, + unsigned int *xdp_xmit) +{ + struct virtnet_info *vi = rq->vq->vdev->priv; + struct virtnet_rq_stats stats = {}; + int i, packets; + + if (rq->xsk_pool) + packets = virtnet_receive_xsk_bufs(vi, rq, budget, xdp_xmit, &stats); + else + packets = virtnet_receive_packets(vi, rq, budget, xdp_xmit, &stats); + if (rq->vq->num_free > min((unsigned int)budget, virtqueue_get_vring_size(rq->vq)) / 2) { if (!try_fill_recv(vi, rq, GFP_ATOMIC)) { spin_lock(&vi->refill_lock); @@ -2313,7 +2750,7 @@ static int virtnet_receive(struct receive_queue *rq, int budget, return packets; } -static void virtnet_poll_cleantx(struct receive_queue *rq) +static void virtnet_poll_cleantx(struct receive_queue *rq, int budget) { struct virtnet_info *vi = rq->vq->vdev->priv; unsigned int index = vq2rxq(rq->vq); @@ -2331,7 +2768,7 @@ static void virtnet_poll_cleantx(struct receive_queue *rq) do { virtqueue_disable_cb(sq->vq); - free_old_xmit(sq, true); + free_old_xmit(sq, txq, !!budget); } while (unlikely(!virtqueue_enable_cb_delayed(sq->vq))); if (sq->vq->num_free >= 2 + MAX_SKB_FRAGS) { @@ -2354,12 +2791,13 @@ static void virtnet_rx_dim_update(struct virtnet_info *vi, struct receive_queue if (!rq->packets_in_napi) return; - u64_stats_update_begin(&rq->stats.syncp); + /* Don't need protection when fetching stats, since fetcher and + * updater of the stats are in same context + */ dim_update_sample(rq->calls, u64_stats_read(&rq->stats.packets), u64_stats_read(&rq->stats.bytes), &cur_sample); - u64_stats_update_end(&rq->stats.syncp); net_dim(&rq->dim, cur_sample); rq->packets_in_napi = 0; @@ -2375,7 +2813,7 @@ static int virtnet_poll(struct napi_struct *napi, int budget) unsigned int xdp_xmit = 0; bool napi_complete; - virtnet_poll_cleantx(rq); + virtnet_poll_cleantx(rq, budget); received = virtnet_receive(rq, budget, &xdp_xmit); rq->packets_in_napi += received; @@ -2430,6 +2868,7 @@ static int virtnet_enable_queue_pair(struct virtnet_info *vi, int qp_index) goto err_xdp_reg_mem_model; virtnet_napi_enable(vi->rq[qp_index].vq, &vi->rq[qp_index].napi); + netdev_tx_reset_queue(netdev_get_tx_queue(vi->dev, qp_index)); virtnet_napi_tx_enable(vi, vi->sq[qp_index].vq, &vi->sq[qp_index].napi); return 0; @@ -2439,6 +2878,13 @@ err_xdp_reg_mem_model: return err; } +static void virtnet_cancel_dim(struct virtnet_info *vi, struct dim *dim) +{ + if (!virtio_has_feature(vi->vdev, VIRTIO_NET_F_VQ_NOTF_COAL)) + return; + net_dim_work_cancel(dim); +} + static int virtnet_open(struct net_device *dev) { struct virtnet_info *vi = netdev_priv(dev); @@ -2465,7 +2911,7 @@ err_enable_qp: for (i--; i >= 0; i--) { virtnet_disable_queue_pair(vi, i); - cancel_work_sync(&vi->rq[i].dim.work); + virtnet_cancel_dim(vi, &vi->rq[i].dim); } return err; @@ -2489,7 +2935,7 @@ static int virtnet_poll_tx(struct napi_struct *napi, int budget) txq = netdev_get_tx_queue(vi->dev, index); __netif_tx_lock(txq, raw_smp_processor_id()); virtqueue_disable_cb(sq->vq); - free_old_xmit(sq, true); + free_old_xmit(sq, txq, !!budget); if (sq->vq->num_free >= 2 + MAX_SKB_FRAGS) { if (netif_tx_queue_stopped(txq)) { @@ -2523,7 +2969,7 @@ static int virtnet_poll_tx(struct napi_struct *napi, int budget) return 0; } -static int xmit_skb(struct send_queue *sq, struct sk_buff *skb) +static int xmit_skb(struct send_queue *sq, struct sk_buff *skb, bool orphan) { struct virtio_net_hdr_mrg_rxbuf *hdr; const unsigned char *dest = ((struct ethhdr *)skb->data)->h_dest; @@ -2567,7 +3013,8 @@ static int xmit_skb(struct send_queue *sq, struct sk_buff *skb) return num_sg; num_sg++; } - return virtqueue_add_outbuf(sq->vq, sq->sg, num_sg, skb, GFP_ATOMIC); + return virtqueue_add_outbuf(sq->vq, sq->sg, num_sg, + skb_to_ptr(skb, orphan), GFP_ATOMIC); } static netdev_tx_t start_xmit(struct sk_buff *skb, struct net_device *dev) @@ -2577,24 +3024,25 @@ static netdev_tx_t start_xmit(struct sk_buff *skb, struct net_device *dev) struct send_queue *sq = &vi->sq[qnum]; int err; struct netdev_queue *txq = netdev_get_tx_queue(dev, qnum); - bool kick = !netdev_xmit_more(); + bool xmit_more = netdev_xmit_more(); bool use_napi = sq->napi.weight; + bool kick; /* Free up any pending old buffers before queueing new ones. */ do { if (use_napi) virtqueue_disable_cb(sq->vq); - free_old_xmit(sq, false); + free_old_xmit(sq, txq, false); - } while (use_napi && kick && + } while (use_napi && !xmit_more && unlikely(!virtqueue_enable_cb_delayed(sq->vq))); /* timestamp packet in software */ skb_tx_timestamp(skb); /* Try to transmit */ - err = xmit_skb(sq, skb); + err = xmit_skb(sq, skb, !use_napi); /* This should not happen! */ if (unlikely(err)) { @@ -2616,7 +3064,9 @@ static netdev_tx_t start_xmit(struct sk_buff *skb, struct net_device *dev) check_sq_full_and_disable(vi, dev, sq); - if (kick || netif_xmit_stopped(txq)) { + kick = use_napi ? __netdev_tx_sent_queue(txq, skb->len, xmit_more) : + !xmit_more || netif_xmit_stopped(txq); + if (kick) { if (virtqueue_kick_prepare(sq->vq) && virtqueue_notify(sq->vq)) { u64_stats_update_begin(&sq->stats.syncp); u64_stats_inc(&sq->stats.kicks); @@ -2627,37 +3077,49 @@ static netdev_tx_t start_xmit(struct sk_buff *skb, struct net_device *dev) return NETDEV_TX_OK; } -static int virtnet_rx_resize(struct virtnet_info *vi, - struct receive_queue *rq, u32 ring_num) +static void virtnet_rx_pause(struct virtnet_info *vi, struct receive_queue *rq) { bool running = netif_running(vi->dev); - int err, qindex; - - qindex = rq - vi->rq; if (running) { napi_disable(&rq->napi); - cancel_work_sync(&rq->dim.work); + virtnet_cancel_dim(vi, &rq->dim); } +} - err = virtqueue_resize(rq->vq, ring_num, virtnet_rq_unmap_free_buf); - if (err) - netdev_err(vi->dev, "resize rx fail: rx queue index: %d err: %d\n", qindex, err); +static void virtnet_rx_resume(struct virtnet_info *vi, struct receive_queue *rq) +{ + bool running = netif_running(vi->dev); if (!try_fill_recv(vi, rq, GFP_KERNEL)) schedule_delayed_work(&vi->refill, 0); if (running) virtnet_napi_enable(rq->vq, &rq->napi); +} + +static int virtnet_rx_resize(struct virtnet_info *vi, + struct receive_queue *rq, u32 ring_num) +{ + int err, qindex; + + qindex = rq - vi->rq; + + virtnet_rx_pause(vi, rq); + + err = virtqueue_resize(rq->vq, ring_num, virtnet_rq_unmap_free_buf); + if (err) + netdev_err(vi->dev, "resize rx fail: rx queue index: %d err: %d\n", qindex, err); + + virtnet_rx_resume(vi, rq); return err; } -static int virtnet_tx_resize(struct virtnet_info *vi, - struct send_queue *sq, u32 ring_num) +static void virtnet_tx_pause(struct virtnet_info *vi, struct send_queue *sq) { bool running = netif_running(vi->dev); struct netdev_queue *txq; - int err, qindex; + int qindex; qindex = sq - vi->sq; @@ -2678,10 +3140,17 @@ static int virtnet_tx_resize(struct virtnet_info *vi, netif_stop_subqueue(vi->dev, qindex); __netif_tx_unlock_bh(txq); +} - err = virtqueue_resize(sq->vq, ring_num, virtnet_sq_free_unused_buf); - if (err) - netdev_err(vi->dev, "resize tx fail: tx queue index: %d err: %d\n", qindex, err); +static void virtnet_tx_resume(struct virtnet_info *vi, struct send_queue *sq) +{ + bool running = netif_running(vi->dev); + struct netdev_queue *txq; + int qindex; + + qindex = sq - vi->sq; + + txq = netdev_get_tx_queue(vi->dev, qindex); __netif_tx_lock_bh(txq); sq->reset = false; @@ -2690,6 +3159,23 @@ static int virtnet_tx_resize(struct virtnet_info *vi, if (running) virtnet_napi_tx_enable(vi, sq->vq, &sq->napi); +} + +static int virtnet_tx_resize(struct virtnet_info *vi, struct send_queue *sq, + u32 ring_num) +{ + int qindex, err; + + qindex = sq - vi->sq; + + virtnet_tx_pause(vi, sq); + + err = virtqueue_resize(sq->vq, ring_num, virtnet_sq_free_unused_buf); + if (err) + netdev_err(vi->dev, "resize tx fail: tx queue index: %d err: %d\n", qindex, err); + + virtnet_tx_resume(vi, sq); + return err; } @@ -2898,7 +3384,7 @@ static int virtnet_close(struct net_device *dev) for (i = 0; i < vi->max_queue_pairs; i++) { virtnet_disable_queue_pair(vi, i); - cancel_work_sync(&vi->rq[i].dim.work); + virtnet_cancel_dim(vi, &vi->rq[i].dim); } return 0; @@ -4424,7 +4910,7 @@ static void virtnet_rx_dim_work(struct work_struct *work) if (!rq->dim_enabled) goto out; - update_moder = net_dim_get_rx_moderation(dim->mode, dim->profile_ix); + update_moder = net_dim_get_rx_irq_moder(dev, dim); if (update_moder.usec != rq->intr_coal.max_usecs || update_moder.pkts != rq->intr_coal.max_packets) { err = virtnet_send_rx_ctrl_coal_vq_cmd(vi, qnum, @@ -4927,10 +5413,144 @@ static int virtnet_restore_guest_offloads(struct virtnet_info *vi) return virtnet_set_guest_offloads(vi, offloads); } +static int virtnet_rq_bind_xsk_pool(struct virtnet_info *vi, struct receive_queue *rq, + struct xsk_buff_pool *pool) +{ + int err, qindex; + + qindex = rq - vi->rq; + + if (pool) { + err = xdp_rxq_info_reg(&rq->xsk_rxq_info, vi->dev, qindex, rq->napi.napi_id); + if (err < 0) + return err; + + err = xdp_rxq_info_reg_mem_model(&rq->xsk_rxq_info, + MEM_TYPE_XSK_BUFF_POOL, NULL); + if (err < 0) + goto unreg; + + xsk_pool_set_rxq_info(pool, &rq->xsk_rxq_info); + } + + virtnet_rx_pause(vi, rq); + + err = virtqueue_reset(rq->vq, virtnet_rq_unmap_free_buf); + if (err) { + netdev_err(vi->dev, "reset rx fail: rx queue index: %d err: %d\n", qindex, err); + + pool = NULL; + } + + rq->xsk_pool = pool; + + virtnet_rx_resume(vi, rq); + + if (pool) + return 0; + +unreg: + xdp_rxq_info_unreg(&rq->xsk_rxq_info); + return err; +} + +static int virtnet_xsk_pool_enable(struct net_device *dev, + struct xsk_buff_pool *pool, + u16 qid) +{ + struct virtnet_info *vi = netdev_priv(dev); + struct receive_queue *rq; + struct device *dma_dev; + struct send_queue *sq; + int err, size; + + if (vi->hdr_len > xsk_pool_get_headroom(pool)) + return -EINVAL; + + /* In big_packets mode, xdp cannot work, so there is no need to + * initialize xsk of rq. + */ + if (vi->big_packets && !vi->mergeable_rx_bufs) + return -ENOENT; + + if (qid >= vi->curr_queue_pairs) + return -EINVAL; + + sq = &vi->sq[qid]; + rq = &vi->rq[qid]; + + /* xsk assumes that tx and rx must have the same dma device. The af-xdp + * may use one buffer to receive from the rx and reuse this buffer to + * send by the tx. So the dma dev of sq and rq must be the same one. + * + * But vq->dma_dev allows every vq has the respective dma dev. So I + * check the dma dev of vq and sq is the same dev. + */ + if (virtqueue_dma_dev(rq->vq) != virtqueue_dma_dev(sq->vq)) + return -EINVAL; + + dma_dev = virtqueue_dma_dev(rq->vq); + if (!dma_dev) + return -EINVAL; + + size = virtqueue_get_vring_size(rq->vq); + + rq->xsk_buffs = kvcalloc(size, sizeof(*rq->xsk_buffs), GFP_KERNEL); + if (!rq->xsk_buffs) + return -ENOMEM; + + err = xsk_pool_dma_map(pool, dma_dev, 0); + if (err) + goto err_xsk_map; + + err = virtnet_rq_bind_xsk_pool(vi, rq, pool); + if (err) + goto err_rq; + + return 0; + +err_rq: + xsk_pool_dma_unmap(pool, 0); +err_xsk_map: + return err; +} + +static int virtnet_xsk_pool_disable(struct net_device *dev, u16 qid) +{ + struct virtnet_info *vi = netdev_priv(dev); + struct xsk_buff_pool *pool; + struct receive_queue *rq; + int err; + + if (qid >= vi->curr_queue_pairs) + return -EINVAL; + + rq = &vi->rq[qid]; + + pool = rq->xsk_pool; + + err = virtnet_rq_bind_xsk_pool(vi, rq, NULL); + + xsk_pool_dma_unmap(pool, 0); + + kvfree(rq->xsk_buffs); + + return err; +} + +static int virtnet_xsk_pool_setup(struct net_device *dev, struct netdev_bpf *xdp) +{ + if (xdp->xsk.pool) + return virtnet_xsk_pool_enable(dev, xdp->xsk.pool, + xdp->xsk.queue_id); + else + return virtnet_xsk_pool_disable(dev, xdp->xsk.queue_id); +} + static int virtnet_xdp_set(struct net_device *dev, struct bpf_prog *prog, struct netlink_ext_ack *extack) { - unsigned int room = SKB_DATA_ALIGN(VIRTIO_XDP_HEADROOM + + unsigned int room = SKB_DATA_ALIGN(XDP_PACKET_HEADROOM + sizeof(struct skb_shared_info)); unsigned int max_sz = PAGE_SIZE - room - ETH_HLEN; struct virtnet_info *vi = netdev_priv(dev); @@ -5052,6 +5672,8 @@ static int virtnet_xdp(struct net_device *dev, struct netdev_bpf *xdp) switch (xdp->command) { case XDP_SETUP_PROG: return virtnet_xdp_set(dev, xdp->prog, xdp->extack); + case XDP_SETUP_XSK_POOL: + return virtnet_xsk_pool_setup(dev, xdp); default: return -EINVAL; } @@ -5124,6 +5746,36 @@ static void virtnet_tx_timeout(struct net_device *dev, unsigned int txqueue) jiffies_to_usecs(jiffies - READ_ONCE(txq->trans_start))); } +static int virtnet_init_irq_moder(struct virtnet_info *vi) +{ + u8 profile_flags = 0, coal_flags = 0; + int ret, i; + + profile_flags |= DIM_PROFILE_RX; + coal_flags |= DIM_COALESCE_USEC | DIM_COALESCE_PKTS; + ret = net_dim_init_irq_moder(vi->dev, profile_flags, coal_flags, + DIM_CQ_PERIOD_MODE_START_FROM_EQE, + 0, virtnet_rx_dim_work, NULL); + + if (ret) + return ret; + + for (i = 0; i < vi->max_queue_pairs; i++) + net_dim_setting(vi->dev, &vi->rq[i].dim, false); + + return 0; +} + +static void virtnet_free_irq_moder(struct virtnet_info *vi) +{ + if (!virtio_has_feature(vi->vdev, VIRTIO_NET_F_VQ_NOTF_COAL)) + return; + + rtnl_lock(); + net_dim_free_irq_moder(vi->dev); + rtnl_unlock(); +} + static const struct net_device_ops virtnet_netdev = { .ndo_open = virtnet_open, .ndo_stop = virtnet_close, @@ -5136,6 +5788,7 @@ static const struct net_device_ops virtnet_netdev = { .ndo_vlan_rx_kill_vid = virtnet_vlan_rx_kill_vid, .ndo_bpf = virtnet_xdp, .ndo_xdp_xmit = virtnet_xdp_xmit, + .ndo_xsk_wakeup = virtnet_xsk_wakeup, .ndo_features_check = passthru_features_check, .ndo_get_phys_port_name = virtnet_get_phys_port_name, .ndo_set_features = virtnet_set_features, @@ -5403,9 +6056,6 @@ static int virtnet_alloc_queues(struct virtnet_info *vi) virtnet_poll_tx, napi_tx ? napi_weight : 0); - INIT_WORK(&vi->rq[i].dim.work, virtnet_rx_dim_work); - vi->rq[i].dim.mode = DIM_CQ_PERIOD_MODE_START_FROM_EQE; - sg_init_table(vi->rq[i].sg, ARRAY_SIZE(vi->rq[i].sg)); ewma_pkt_len_init(&vi->rq[i].mrg_avg_pkt_len); sg_init_table(vi->sq[i].sg, ARRAY_SIZE(vi->sq[i].sg)); @@ -5834,6 +6484,10 @@ static int virtnet_probe(struct virtio_device *vdev) for (i = 0; i < vi->max_queue_pairs; i++) if (vi->sq[i].napi.weight) vi->sq[i].intr_coal.max_packets = 1; + + err = virtnet_init_irq_moder(vi); + if (err) + goto free; } #ifdef CONFIG_SYSFS @@ -5985,6 +6639,8 @@ static void virtnet_remove(struct virtio_device *vdev) disable_rx_mode_work(vi); flush_work(&vi->rx_mode_work); + virtnet_free_irq_moder(vi); + unregister_netdev(vi->dev); net_failover_destroy(vi->failover); |