summaryrefslogtreecommitdiff
path: root/drivers/net/ethernet/mellanox/mlx5/core
AgeCommit message (Collapse)Author
2024-07-15Merge tag 'aux-sysfs-irqs' of ↵Jakub Kicinski
git://git.kernel.org/pub/scm/linux/kernel/git/mellanox/linux Saeed Mahameed says: ==================== aux-sysfs-irqs Shay Says: ========== Introduce auxiliary bus IRQs sysfs Today, PCI PFs and VFs, which are anchored on the PCI bus, display their IRQ information in the <pci_device>/msi_irqs/<irq_num> sysfs files. PCI subfunctions (SFs) are similar to PFs and VFs and these SFs are anchored on the auxiliary bus. However, these PCI SFs lack such IRQ information on the auxiliary bus, leaving users without visibility into which IRQs are used by the SFs. This absence makes it impossible to debug situations and to understand the source of interrupts/SFs for performance tuning and debug. Additionally, the SFs are multifunctional devices supporting RDMA, network devices, clocks, and more, similar to their peer PCI PFs and VFs. Therefore, it is desirable to have SFs' IRQ information available at the bus/device level. To overcome the above limitations, this short series extends the auxiliary bus to display IRQ information in sysfs, similar to that of PFs and VFs. It adds an 'irqs' directory under the auxiliary device and includes an <irq_num> sysfs file within it. For example: $ ls /sys/bus/auxiliary/devices/mlx5_core.sf.1/irqs/ 50 51 52 53 54 55 56 57 58 Patch summary: patch-1 adds auxiliary bus to support irqs used by auxiliary device patch-2 mlx5 driver using exposing irqs for PCI SF devices via auxiliary bus ========== * tag 'aux-sysfs-irqs' of git://git.kernel.org/pub/scm/linux/kernel/git/mellanox/linux: net/mlx5: Expose SFs IRQs driver core: auxiliary bus: show auxiliary device IRQs RDMA/mlx5: Add Qcounters req_transport_retries_exceeded/req_rnr_retries_exceeded net/mlx5: Reimplement write combining test ==================== Link: https://patch.msgid.link/20240711213140.256997-1-saeed@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-07-15Merge branch 'net-make-timestamping-selectable'Jakub Kicinski
First part of "net: Make timestamping selectable" from Kory Maincent. Change the driver-facing type already to lower rebasing pain. Link: https://lore.kernel.org/20240709-feature_ptp_netnext-v17-0-b5317f50df2a@bootlin.com/ Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-07-15net: Add struct kernel_ethtool_ts_infoKory Maincent
In prevision to add new UAPI for hwtstamp we will be limited to the struct ethtool_ts_info that is currently passed in fixed binary format through the ETHTOOL_GET_TS_INFO ethtool ioctl. It would be good if new kernel code already started operating on an extensible kernel variant of that structure, similar in concept to struct kernel_hwtstamp_config vs struct hwtstamp_config. Since struct ethtool_ts_info is in include/uapi/linux/ethtool.h, here we introduce the kernel-only structure in include/linux/ethtool.h. The manual copy is then made in the function called by ETHTOOL_GET_TS_INFO. Acked-by: Shannon Nelson <shannon.nelson@amd.com> Acked-by: Alexandra Winter <wintera@linux.ibm.com> Signed-off-by: Kory Maincent <kory.maincent@bootlin.com> Link: https://patch.msgid.link/20240709-feature_ptp_netnext-v17-6-b5317f50df2a@bootlin.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-07-13eth: mlx5: expose NETIF_F_NTUPLE when ARFS is compiled outJakub Kicinski
ARFS depends on NTUPLE filters, but the inverse is not true. Drivers which don't support ARFS commonly still support NTUPLE filtering. mlx5 has a Kconfig option to disable ARFS (MLX5_EN_ARFS) and does not advertise NTUPLE filters as a feature at all when ARFS is compiled out. That's not correct, ntuple filters indeed still work just fine (as long as MLX5_EN_RXNFC is enabled). This is needed to make the RSS test not skip all RSS context related testing. Acked-by: Saeed Mahameed <saeedm@nvidia.com> Link: https://patch.msgid.link/20240711223722.297676-1-kuba@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-07-13net/mlx5: Use set number of max EQsDaniel Jurgens
If a maximum number of EQs has been set for an SF, use that amount. Signed-off-by: Daniel Jurgens <danielj@nvidia.com> Reviewed-by: William Tu <witu@nvidia.com> Signed-off-by: Tariq Toukan <tariqt@nvidia.com> Signed-off-by: Saeed Mahameed <saeedm@nvidia.com> Link: https://patch.msgid.link/20240712003310.355106-5-saeed@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-07-13net/mlx5: Set default max eqs for SFsDaniel Jurgens
If the user hasn't configured max_io_eqs set a low default. The SF driver shouldn't try to create more than this, but FW will enforce this limit. Signed-off-by: Daniel Jurgens <danielj@nvidia.com> Reviewed-by: William Tu <witu@nvidia.com> Signed-off-by: Tariq Toukan <tariqt@nvidia.com> Signed-off-by: Saeed Mahameed <saeedm@nvidia.com> Link: https://patch.msgid.link/20240712003310.355106-4-saeed@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-07-13net/mlx5: Set sf_eq_usage for SF max EQsDaniel Jurgens
When setting max_io_eqs for an SF function also set the sf_eq_usage_cap. This is to indicate to the SF driver from the PF that the user has set the max io eqs via devlink. So the SF driver can later query the proper max eq value from the new cap. Signed-off-by: Daniel Jurgens <danielj@nvidia.com> Reviewed-by: William Tu <witu@nvidia.com> Signed-off-by: Tariq Toukan <tariqt@nvidia.com> Signed-off-by: Saeed Mahameed <saeedm@nvidia.com> Link: https://patch.msgid.link/20240712003310.355106-3-saeed@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-07-11net/mlx5: Expose SFs IRQsShay Drory
Expose the sysfs files for the IRQs that the mlx5 PCI SFs are using. These entries are similar to PCI PFs and VFs in 'msi_irqs' directory. Reviewed-by: Parav Pandit <parav@nvidia.com> Signed-off-by: Shay Drory <shayd@nvidia.com> Signed-off-by: Saeed Mahameed <saeedm@nvidia.com> --- v8-v9: - add Przemek RB v6->v7: - remove not needed changes to mlx5 sfnum SF sysfs v5->v6: - fail IRQ creation in case auxiliary_device_sysfs_irq_add() failed (Parav and Przemek) v2->v3: - fix mlx5 sfnum SF sysfs
2024-07-09net/mlx5e: CT: Initialize err to 0 to avoid warningCosmin Ratiu
It is theoretically possible to return bogus uninitialized values from mlx5_tc_ct_entry_replace_rules, even though in practice this will never be the case as the flow rule will be part of at least the regular ct table or the ct nat table, if not both. But to reduce noise, initialize err to 0. Fixes: 49d37d05f216 ("net/mlx5: CT: Separate CT and CT-NAT tuple entries") Signed-off-by: Cosmin Ratiu <cratiu@nvidia.com> Signed-off-by: Tariq Toukan <tariqt@nvidia.com> Link: https://patch.msgid.link/20240708080025.1593555-11-tariqt@nvidia.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-07-09net/mlx5e: SHAMPO, Add missing aggregate counterDragos Tatulea
When the rx_hds_nodata_packets/bytes counters were added, the aggregate counters were omitted. This patch adds them. Fixes: e95c5b9e8912 ("net/mlx5e: SHAMPO, Add header-only ethtool counters for header data split") Signed-off-by: Dragos Tatulea <dtatulea@nvidia.com> Signed-off-by: Tariq Toukan <tariqt@nvidia.com> Link: https://patch.msgid.link/20240708080025.1593555-10-tariqt@nvidia.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-07-09net/mlx5: DR, Remove definer functions from SW Steering APIYevgeny Kliteynik
No need to expose definer get/put functions as part of SW Steering API - they are internal functions. Signed-off-by: Yevgeny Kliteynik <kliteyn@nvidia.com> Reviewed-by: Alex Vesker <valex@nvidia.com> Signed-off-by: Tariq Toukan <tariqt@nvidia.com> Link: https://patch.msgid.link/20240708080025.1593555-9-tariqt@nvidia.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-07-04Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/netJakub Kicinski
Cross-merge networking fixes after downstream PR. Conflicts: drivers/net/phy/aquantia/aquantia.h 219343755eae ("net: phy: aquantia: add missing include guards") 61578f679378 ("net: phy: aquantia: add support for PHY LEDs") drivers/net/ethernet/wangxun/libwx/wx_hw.c bd07a9817846 ("net: txgbe: remove separate irq request for MSI and INTx") b501d261a5b3 ("net: txgbe: add FDIR ATR support") https://lore.kernel.org/all/20240703112936.483c1975@canb.auug.org.au/ include/linux/mlx5/mlx5_ifc.h 048a403648fc ("net/mlx5: IFC updates for changing max EQs") 99be56171fa9 ("net/mlx5e: SHAMPO, Re-enable HW-GRO") https://lore.kernel.org/all/20240701133951.6926b2e3@canb.auug.org.au/ Adjacent changes: drivers/net/wireless/intel/iwlwifi/mvm/mac80211.c 4130c67cd123 ("wifi: iwlwifi: mvm: check vif for NULL/ERR_PTR before dereference") 3f3126515fbe ("wifi: iwlwifi: mvm: add mvm-specific guard") include/net/mac80211.h 816c6bec09ed ("wifi: mac80211: fix BSS_CHANGED_UNSOL_BCAST_PROBE_RESP") 5a009b42e041 ("wifi: mac80211: track changes in AP's TPE") Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-06-28net/mlx5e: Approximate IPsec per-SA payload data bytes countLeon Romanovsky
ConnectX devices lack ability to count payload data byte size which is needed for SA to return to libreswan for rekeying. As a solution let's approximate that by decreasing headers size from total size counted by flow steering. The calculation doesn't take into account any other headers which can be in the packet (e.g. IP extensions). Fixes: 5a6cddb89b51 ("net/mlx5e: Update IPsec per SA packets/bytes count") Signed-off-by: Leon Romanovsky <leonro@nvidia.com> Signed-off-by: Tariq Toukan <tariqt@nvidia.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2024-06-28net/mlx5e: Present succeeded IPsec SA bytes and packetLeon Romanovsky
IPsec SA statistics presents successfully decrypted and encrypted packet and bytes, and not total handled by this SA. So update the calculation logic to take into account failures. Fixes: 6fb7f9408779 ("net/mlx5e: Connect mlx5 IPsec statistics with XFRM core") Signed-off-by: Leon Romanovsky <leonro@nvidia.com> Signed-off-by: Tariq Toukan <tariqt@nvidia.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2024-06-28net/mlx5e: Add mqprio_rl cleanup and free in mlx5e_priv_cleanup()Jianbo Liu
In the cited commit, mqprio_rl cleanup and free are mistakenly removed in mlx5e_priv_cleanup(), and it causes the leakage of host memory and firmware SCHEDULING_ELEMENT objects while changing eswitch mode. So, add them back. Fixes: 0bb7228f7096 ("net/mlx5e: Fix mqprio_rl handling on devlink reload") Signed-off-by: Jianbo Liu <jianbol@nvidia.com> Reviewed-by: Dragos Tatulea <dtatulea@nvidia.com> Signed-off-by: Tariq Toukan <tariqt@nvidia.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2024-06-28net/mlx5: E-switch, Create ingress ACL when neededChris Mi
Currently, ingress acl is used for three features. It is created only when vport metadata match and prio tag are enabled. But active-backup lag mode also uses it. It is independent of vport metadata match and prio tag. And vport metadata match can be disabled using the following devlink command: # devlink dev param set pci/0000:08:00.0 name esw_port_metadata \ value false cmode runtime If ingress acl is not created, will hit panic when creating drop rule for active-backup lag mode. If always create it, there will be about 5% performance degradation. Fix it by creating ingress acl when needed. If esw_port_metadata is true, ingress acl exists, then create drop rule using existing ingress acl. If esw_port_metadata is false, create ingress acl and then create drop rule. Fixes: 1749c4c51c16 ("net/mlx5: E-switch, add drop rule support to ingress ACL") Signed-off-by: Chris Mi <cmi@nvidia.com> Signed-off-by: Tariq Toukan <tariqt@nvidia.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2024-06-28net/mlx5: Use max_num_eqs_24b when setting max_io_eqsDaniel Jurgens
Due a bug in the device max_num_eqs doesn't always reflect a written value. As a result, setting max_io_eqs may not work but appear successful. Instead write max_num_eqs_24b, which reflects correct value. Fixes: 93197c7c509d ("mlx5/core: Support max_io_eqs for a function") Signed-off-by: Daniel Jurgens <danielj@nvidia.com> Reviewed-by: Parav Pandit <parav@nvidia.com> Signed-off-by: Tariq Toukan <tariqt@nvidia.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2024-06-28net/mlx5: Use max_num_eqs_24b capability if setDaniel Jurgens
A new capability with more bits is added. If it's set use that value as the maximum number of EQs available. This cap is also writable by the vhca_resource_manager to allow limiting the number of EQs available to SFs and VFs. Fixes: 93197c7c509d ("mlx5/core: Support max_io_eqs for a function") Signed-off-by: Daniel Jurgens <danielj@nvidia.com> Reviewed-by: Parav Pandit <parav@nvidia.com> Reviewed-by: William Tu <witu@nvidia.com> Signed-off-by: Tariq Toukan <tariqt@nvidia.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2024-06-17net/mlx5e: Add per queue netdev-genl statsJoe Damato
./cli.py --spec netlink/specs/netdev.yaml \ --dump qstats-get --json '{"scope": "queue"}' ...snip {'ifindex': 7, 'queue-id': 62, 'queue-type': 'rx', 'rx-alloc-fail': 0, 'rx-bytes': 105965251, 'rx-packets': 179790}, {'ifindex': 7, 'queue-id': 0, 'queue-type': 'tx', 'tx-bytes': 9402665, 'tx-packets': 17551}, ...snip Also tested with the script tools/testing/selftests/drivers/net/stats.py in several scenarios to ensure stats tallying was correct: - on boot (default queue counts) - adjusting queue count up or down (ethtool -L eth0 combined ...) The tools/testing/selftests/drivers/net/stats.py brings the device up, so to test with the device down, I did the following: $ ip link show eth4 7: eth4: <BROADCAST,MULTICAST> mtu 9000 qdisc mq state DOWN [..snip..] [..snip..] $ cat /proc/net/dev | grep eth4 eth4: 235710489 434811 [..snip rx..] 2878744 21227 [..snip tx..] $ ./cli.py --spec ../../../Documentation/netlink/specs/netdev.yaml \ --dump qstats-get --json '{"ifindex": 7}' [{'ifindex': 7, 'rx-alloc-fail': 0, 'rx-bytes': 235710489, 'rx-packets': 434811, 'tx-bytes': 2878744, 'tx-packets': 21227}] Compare the values in /proc/net/dev match the output of cli for the same device, even while the device is down. Note that while the device is down, per queue stats output nothing (because the device is down there are no queues): $ ./cli.py --spec ../../../Documentation/netlink/specs/netdev.yaml \ --dump qstats-get --json '{"scope": "queue", "ifindex": 7}' [] Signed-off-by: Joe Damato <jdamato@fastly.com> Reviewed-by: Tariq Toukan <tariqt@nvidia.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2024-06-17net/mlx5e: Add txq to sq stats mappingJoe Damato
mlx5 currently maps txqs to an sq via priv->txq2sq. It is useful to map txqs to sq_stats, as well, for direct access to stats. Add priv->txq2sq_stats and insert mappings. The mappings will be used next to tabulate stats information. Signed-off-by: Joe Damato <jdamato@fastly.com> Reviewed-by: Tariq Toukan <tariqt@nvidia.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2024-06-16net/mlx5: Reimplement write combining testJianbo Liu
The test of write combining was added before in mlx5_ib driver. It opens UD QP and posts NOP WQEs, and uses BlueFlame doorbell. When BlueFlame is used, WQEs get written directly to a PCI BAR of the device (in addition to memory) so that the device handles them without having to access memory. In this test, the WQEs written in memory are different from the ones written to the BlueFlame which request CQE update. By checking the completion reports posted on CQ, we can know if BlueFlame succeeds or not. The write combining must be supported if BlueFlame succeeds as its register is written using write combining. This patch reimplements the test in the same way, but using a pair of SQ and CQ only. It is moved to mlx5_core as a general feature used by both mlx5_core and mlx5_ib. Besides, save write combine test result of the PCI function, so that its thousands of child functions such as SF can query without paying the time and resource penalty by itself. The test function is called only after failing to get the cached result. With this enhancement, all thousands of SFs of the PF attached to same driver no longer need to perform WC check explicitly, which is already done in the system. This saves several commands per SF, thereby speeds up SF creation and also saves completion EQ creation. Signed-off-by: Jianbo Liu <jianbol@nvidia.com> Reviewed-by: Tariq Toukan <tariqt@nvidia.com> Link: https://lore.kernel.org/r/4ff5a8cc4c5b5b0d98397baa45a5019bcdbf096e.1717409369.git.leon@kernel.org Signed-off-by: Leon Romanovsky <leon@kernel.org>
2024-06-14net/mlx5e: Support SWP-mode offload L4 csum calculationRahul Rameshbabu
Calculate the pseudo-header checksum for both IPSec transport mode and IPSec tunnel mode for mlx5 devices that do not implement a pure hardware checksum offload for L4 checksum calculation. Introduce a capability bit that identifies such mlx5 devices. Signed-off-by: Rahul Rameshbabu <rrameshbabu@nvidia.com> Reviewed-by: Gal Pressman <gal@nvidia.com> Reviewed-by: Cosmin Ratiu <cratiu@nvidia.com> Signed-off-by: Tariq Toukan <tariqt@nvidia.com> Link: https://lore.kernel.org/r/20240613210036.1125203-7-tariqt@nvidia.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-06-14net/mlx5e: Use tcp_v[46]_check checksum helpersGal Pressman
Use the tcp specific helpers to calculate the tcp pseudo header checksum instead of the csum_*_magic ones. Signed-off-by: Gal Pressman <gal@nvidia.com> Reviewed-by: Rahul Rameshbabu <rrameshbabu@nvidia.com> Signed-off-by: Tariq Toukan <tariqt@nvidia.com> Link: https://lore.kernel.org/r/20240613210036.1125203-6-tariqt@nvidia.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-06-14net/mlx5e: Fix outdated comment in features checkGal Pressman
The code no longer treats only UDP tunnels, adjust the outdated comment. Signed-off-by: Gal Pressman <gal@nvidia.com> Reviewed-by: Cosmin Ratiu <cratiu@nvidia.com> Signed-off-by: Tariq Toukan <tariqt@nvidia.com> Link: https://lore.kernel.org/r/20240613210036.1125203-5-tariqt@nvidia.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-06-14net/mlx5: Replace strcpy with strscpyMoshe Shemesh
Replace deprecated strcpy with strscpy. Signed-off-by: Moshe Shemesh <moshe@nvidia.com> Signed-off-by: Tariq Toukan <tariqt@nvidia.com> Link: https://lore.kernel.org/r/20240613210036.1125203-4-tariqt@nvidia.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-06-14net/mlx5: CT: Separate CT and CT-NAT tuple entriesChris Mi
Currently a ct entry is stored in both ct and ct-nat tables. ct action is directed to the ct table, while ct nat action is directed to the nat table. ct-nat entries perform the nat header rewrites, if required. The current design assures that a ct action will match in hardware even if the tuple has nat configured, it will just not execute it. However, storing each connection in two tables increases the system's memory consumption while reducing its insertion rate. Offload a connection to either ct or the ct-nat table. Add a miss fall-through rule from ct-nat table to the ct table allowing ct(nat) action on non-natted connections. ct action on natted connections, by default, will be handled by the software miss path. Signed-off-by: Paul Blakey <paulb@nvidia.com> Signed-off-by: Chris Mi <cmi@nvidia.com> Signed-off-by: Tariq Toukan <tariqt@nvidia.com> Link: https://lore.kernel.org/r/20240613210036.1125203-3-tariqt@nvidia.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-06-14net/mlx5: Correct TASR typo into TSARCosmin Ratiu
TSAR is the correct spelling (Transmit Scheduling ARbiter). Signed-off-by: Cosmin Ratiu <cratiu@nvidia.com> Reviewed-by: Gal Pressman <gal@nvidia.com> Signed-off-by: Tariq Toukan <tariqt@nvidia.com> Link: https://lore.kernel.org/r/20240613210036.1125203-2-tariqt@nvidia.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-06-13Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/netJakub Kicinski
Cross-merge networking fixes after downstream PR. No conflicts, no adjacent changes. Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-06-12net/mlx5e: flower: validate encapsulation control flagsAsbjørn Sloth Tønnesen
Encapsulation control flags are currently not used anywhere, so all flags are currently unsupported by all drivers. This patch adds validation of this assumption, so that encapsulation flags may be used in the future. In case any encapsulation control flags are masked, flow_rule_match_has_enc_control_flags() sets a NL extended error message, and we return -EOPNOTSUPP. Only compile tested. Signed-off-by: Asbjørn Sloth Tønnesen <ast@fiberby.net> Reviewed-by: Davide Caratti <dcaratti@redhat.com> Link: https://lore.kernel.org/r/20240609173358.193178-4-ast@fiberby.net Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-06-10net/mlx5e: Fix features validation check for tunneled UDP (non-VXLAN) packetsGal Pressman
Move the vxlan_features_check() call to after we verified the packet is a tunneled VXLAN packet. Without this, tunneled UDP non-VXLAN packets (for ex. GENENVE) might wrongly not get offloaded. In some cases, it worked by chance as GENEVE header is the same size as VXLAN, but it is obviously incorrect. Fixes: e3cfc7e6b7bd ("net/mlx5e: TX, Add geneve tunnel stateless offload support") Signed-off-by: Gal Pressman <gal@nvidia.com> Reviewed-by: Dragos Tatulea <dtatulea@nvidia.com> Signed-off-by: Tariq Toukan <tariqt@nvidia.com> Reviewed-by: Wojciech Drewek <wojciech.drewek@intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2024-06-06Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/netJakub Kicinski
Cross-merge networking fixes after downstream PR. No conflicts. Adjacent changes: drivers/net/ethernet/pensando/ionic/ionic_txrx.c d9c04209990b ("ionic: Mark error paths in the data path as unlikely") 491aee894a08 ("ionic: fix kernel panic in XDP_TX action") net/ipv6/ip6_fib.c b4cb4a1391dc ("net: use unrcu_pointer() helper") b01e1c030770 ("ipv6: fix possible race in __fib6_drop_pcpu_from()") Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-06-05net/mlx5e: SHAMPO, Coalesce skb fragments to page sizeDragos Tatulea
When doing hardware GRO (SHAMPO), the driver puts each data payload of a packet from the wire into one skb fragment. TCP Zero-Copy expects page sized skb fragments to be able to do it's page-flipping magic. With the current way of arranging fragments by the driver, only specific MTUs (page sized multiple + header size) will yield such page sized fragments in a high percentage. This change improves payload arrangement in the skb for hardware GRO by coalescing payloads into a single skb fragment when possible. To demonstrate the fix, running tcp_mmap with a MTU of 1500 yields: - Before: 0 % bytes mmap'ed - After : 81 % bytes mmap'ed More importantly, coalescing considerably improves the HW GRO performance. Here are the results for a iperf3 bandwidth benchmark: +---------+--------+--------+------------------------+-----------+ | streams | SW GRO | HW GRO | HW GRO with coalescing | Unit | |---------+--------+--------+------------------------+-----------| | 1 | 36 | 42 | 57 | Gbits/sec | | 4 | 34 | 39 | 50 | Gbits/sec | | 8 | 31 | 35 | 43 | Gbits/sec | +---------+--------+--------+------------------------+-----------+ Benchmark details: VM based setup CPU: Intel(R) Xeon(R) Platinum 8380 CPU, 24 cores NIC: ConnectX-7 100GbE iperf3 and irq running on same CPU over a single receive queue Signed-off-by: Dragos Tatulea <dtatulea@nvidia.com> Signed-off-by: Tariq Toukan <tariqt@nvidia.com> Link: https://lore.kernel.org/r/20240603212219.1037656-15-tariqt@nvidia.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-06-05net/mlx5e: SHAMPO, Re-enable HW-GROYoray Zack
Add back HW-GRO to the reported features. As the current implementation of HW-GRO uses KSMs with a specific fixed buffer size (256B) to map its headers buffer, we reported the feature only if the NIC is supporting KSM and the minimum value for buffer size is below the requested one. iperf3 bandwidth comparison: +---------+--------+--------+-----------+ | streams | SW GRO | HW GRO | Unit | |---------+--------+--------+-----------| | 1 | 36 | 42 | Gbits/sec | | 4 | 34 | 39 | Gbits/sec | | 8 | 31 | 35 | Gbits/sec | +---------+--------+--------+-----------+ A downstream patch will add skb fragment coalescing which will improve performance considerably. Benchmark details: VM based setup CPU: Intel(R) Xeon(R) Platinum 8380 CPU, 24 cores NIC: ConnectX-7 100GbE iperf3 and irq running on same CPU over a single receive queue Signed-off-by: Yoray Zack <yorayz@nvidia.com> Signed-off-by: Tariq Toukan <tariqt@nvidia.com> Link: https://lore.kernel.org/r/20240603212219.1037656-14-tariqt@nvidia.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-06-05net/mlx5e: SHAMPO, Use KSMs instead of KLMsYoray Zack
KSM Mkey is KLM Mkey with a fixed buffer size. Due to this fact, it is a faster mechanism than KLM. SHAMPO feature used KLMs Mkeys for memory mappings of its headers buffer. As it used KLMs with the same buffer size for each entry, we can use KSMs instead. This commit changes the Mkeys that map the SHAMPO headers buffer from KLMs to KSMs. Signed-off-by: Yoray Zack <yorayz@nvidia.com> Signed-off-by: Tariq Toukan <tariqt@nvidia.com> Link: https://lore.kernel.org/r/20240603212219.1037656-13-tariqt@nvidia.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-06-05net/mlx5e: SHAMPO, Add header-only ethtool counters for header data splitTariq Toukan
Count the number of header-only packets and bytes from SHAMPO. Signed-off-by: Dragos Tatulea <dtatulea@nvidia.com> Signed-off-by: Tariq Toukan <tariqt@nvidia.com> Link: https://lore.kernel.org/r/20240603212219.1037656-12-tariqt@nvidia.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-06-05net/mlx5e: SHAMPO, Drop rx_gro_match_packets counterDragos Tatulea
After modifying rx_gro_packets to be more accurate, the rx_gro_match_packets counter is redundant. Signed-off-by: Dragos Tatulea <dtatulea@nvidia.com> Signed-off-by: Tariq Toukan <tariqt@nvidia.com> Link: https://lore.kernel.org/r/20240603212219.1037656-11-tariqt@nvidia.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-06-05net/mlx5e: SHAMPO, Make GRO counters more preciseDragos Tatulea
Don't count non GRO packets. A non GRO packet is a packet with a GRO cb count of 1. Signed-off-by: Dragos Tatulea <dtatulea@nvidia.com> Signed-off-by: Tariq Toukan <tariqt@nvidia.com> Link: https://lore.kernel.org/r/20240603212219.1037656-10-tariqt@nvidia.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-06-05net/mlx5e: SHAMPO, Skipping on duplicate flush of the same SHAMPO SKBYoray Zack
SHAMPO SKB can be flushed in mlx5e_shampo_complete_rx_cqe(). If the SKB was flushed, rq->hw_gro_data->skb was also set to NULL. We can skip on flushing the SKB in mlx5e_shampo_flush_skb if rq->hw_gro_data->skb == NULL. Signed-off-by: Yoray Zack <yorayz@nvidia.com> Signed-off-by: Tariq Toukan <tariqt@nvidia.com> Link: https://lore.kernel.org/r/20240603212219.1037656-9-tariqt@nvidia.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-06-05net/mlx5e: SHAMPO, Specialize mlx5e_fill_skb_data()Dragos Tatulea
mlx5e_fill_skb_data() used to have multiple callers. But after the XDP multibuf refactoring from commit 2cb0e27d43b4 ("net/mlx5e: RX, Prepare non-linear striding RQ for XDP multi-buffer support") the SHAMPO code path is the only caller. Take advantage of this and specialize the function: - Drop the redundant check. - Assume that data_bcnt is > 0. This is needed in a downstream patch. Rename the function as well to make things clear. Signed-off-by: Dragos Tatulea <dtatulea@nvidia.com> Suggested-by: Tariq Toukan <tariqt@nvidia.com> Signed-off-by: Tariq Toukan <tariqt@nvidia.com> Link: https://lore.kernel.org/r/20240603212219.1037656-8-tariqt@nvidia.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-06-05net/mlx5e: SHAMPO, Simplify header page release in teardownDragos Tatulea
The function that releases SHAMPO header pages (mlx5e_shampo_dealloc_hd) has some complicated logic that comes from the fact that it is called twice during teardown: 1) To release the posted header pages that didn't get any completions. 2) To release all remaining header pages. This flow is not necessary: all header pages can be released from the driver side in one go. Furthermore, the above flow is buggy. Taking the 8 headers per page example: 1) Release fragments 5-7. Page will be released. 2) Release remaining fragments 0-4. The bits in the header will indicate that the page needs releasing. But this is incorrect: page was released in step 1. This patch releases all header pages in one go. This simplifies the header page cleanup function. For consistency, the datapath header page release API (mlx5e_free_rx_shampo_hd_entry()) is used. Signed-off-by: Dragos Tatulea <dtatulea@nvidia.com> Signed-off-by: Tariq Toukan <tariqt@nvidia.com> Link: https://lore.kernel.org/r/20240603212219.1037656-7-tariqt@nvidia.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-06-05net/mlx5e: SHAMPO, Disable gso_size for non GRO packetsDragos Tatulea
When HW GRO is enabled, forwarding of packets is broken due to gso_size being set incorrectly on non GRO packets. Non GRO packets have a skb GRO count of 1. mlx5 always sets gso_size on the skb, even for non GRO packets. It leans on the fact that gso_size is normally reset in napi_gro_complete(). But this happens only for packets from GRO'able protocols (TCP/UDP) that have a gro_receive() handler. The problematic scenarios are: 1) Non GRO protocol packets are received, validate_xmit_skb() will drop them (see EPROTONOSUPPORT in skb_mac_gso_segment()). The fix for this case would be to not set gso_size at all for SHAMPO packets with header size 0. 2) Packets from a GRO'ed protocol (TCP) are received but immediately flushed because they are not GRO'able (TCP SYN for example). mlx5e_shampo_update_hdr(), which updates the remaining GRO state on the skb, is not called because skb GRO count is 1. The fix here would be to always call mlx5e_shampo_update_hdr(), regardless of skb GRO count. But this call is expensive The unified fix for both cases is to reset gso_size before calling napi_gro_receive(). It is a change that is more effective (no call to mlx5e_shampo_update_hdr() necessary) and simple (smallest code footprint). Signed-off-by: Dragos Tatulea <dtatulea@nvidia.com> Signed-off-by: Tariq Toukan <tariqt@nvidia.com> Link: https://lore.kernel.org/r/20240603212219.1037656-6-tariqt@nvidia.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-06-05net/mlx5e: SHAMPO, Fix FCS config when HW GRO onDragos Tatulea
For the following scenario: ethtool --features eth3 rx-gro-hw on ethtool --features eth3 rx-fcs on ethtool --features eth3 rx-fcs off ... there is a firmware error because the driver enables HW GRO first while FCS is still enabled. This patch fixes this by swapping the order of HW GRO and FCS for this specific case. Take LRO into consideration as well for consistency. Signed-off-by: Dragos Tatulea <dtatulea@nvidia.com> Signed-off-by: Tariq Toukan <tariqt@nvidia.com> Link: https://lore.kernel.org/r/20240603212219.1037656-5-tariqt@nvidia.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-06-05net/mlx5e: SHAMPO, Fix invalid WQ linked list unlinkDragos Tatulea
When all the strides in a WQE have been consumed, the WQE is unlinked from the WQ linked list (mlx5_wq_ll_pop()). For SHAMPO, it is possible to receive CQEs with 0 consumed strides for the same WQE even after the WQE is fully consumed and unlinked. This triggers an additional unlink for the same wqe which corrupts the linked list. Fix this scenario by accepting 0 sized consumed strides without unlinking the WQE again. Signed-off-by: Dragos Tatulea <dtatulea@nvidia.com> Signed-off-by: Tariq Toukan <tariqt@nvidia.com> Link: https://lore.kernel.org/r/20240603212219.1037656-4-tariqt@nvidia.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-06-05net/mlx5e: SHAMPO, Fix incorrect page releaseDragos Tatulea
Under the following conditions: 1) No skb created yet 2) header_size == 0 (no SHAMPO header) 3) header_index + 1 % MLX5E_SHAMPO_WQ_HEADER_PER_PAGE == 0 (this is the last page fragment of a SHAMPO header page) a new skb is formed with a page that is NOT a SHAMPO header page (it is a regular data page). Further down in the same function (mlx5e_handle_rx_cqe_mpwrq_shampo()), a SHAMPO header page from header_index is released. This is wrong and it leads to SHAMPO header pages being released more than once. Signed-off-by: Dragos Tatulea <dtatulea@nvidia.com> Signed-off-by: Tariq Toukan <tariqt@nvidia.com> Link: https://lore.kernel.org/r/20240603212219.1037656-3-tariqt@nvidia.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-06-05net/mlx5e: SHAMPO, Use net_prefetch APITariq Toukan
Let the SHAMPO functions use the net-specific prefetch API, similar to all other usages. Signed-off-by: Tariq Toukan <tariqt@nvidia.com> Link: https://lore.kernel.org/r/20240603212219.1037656-2-tariqt@nvidia.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-06-05net/mlx5: Fix tainted pointer delete is case of flow rules creation failAleksandr Mishin
In case of flow rule creation fail in mlx5_lag_create_port_sel_table(), instead of previously created rules, the tainted pointer is deleted deveral times. Fix this bug by using correct flow rules pointers. Found by Linux Verification Center (linuxtesting.org) with SVACE. Fixes: 352899f384d4 ("net/mlx5: Lag, use buckets in hash mode") Signed-off-by: Aleksandr Mishin <amishin@t-argos.ru> Reviewed-by: Jacob Keller <jacob.e.keller@intel.com> Reviewed-by: Tariq Toukan <tariqt@nvidia.com> Link: https://lore.kernel.org/r/20240604100552.25201-1-amishin@t-argos.ru Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-06-05net/mlx5: Always stop health timer during driver removalShay Drory
Currently, if teardown_hca fails to execute during driver removal, mlx5 does not stop the health timer. Afterwards, mlx5 continue with driver teardown. This may lead to a UAF bug, which results in page fault Oops[1], since the health timer invokes after resources were freed. Hence, stop the health monitor even if teardown_hca fails. [1] mlx5_core 0000:18:00.0: E-Switch: Unload vfs: mode(LEGACY), nvfs(0), necvfs(0), active vports(0) mlx5_core 0000:18:00.0: E-Switch: Disable: mode(LEGACY), nvfs(0), necvfs(0), active vports(0) mlx5_core 0000:18:00.0: E-Switch: Disable: mode(LEGACY), nvfs(0), necvfs(0), active vports(0) mlx5_core 0000:18:00.0: E-Switch: cleanup mlx5_core 0000:18:00.0: wait_func:1155:(pid 1967079): TEARDOWN_HCA(0x103) timeout. Will cause a leak of a command resource mlx5_core 0000:18:00.0: mlx5_function_close:1288:(pid 1967079): tear_down_hca failed, skip cleanup BUG: unable to handle page fault for address: ffffa26487064230 PGD 100c00067 P4D 100c00067 PUD 100e5a067 PMD 105ed7067 PTE 0 Oops: 0000 [#1] PREEMPT SMP PTI CPU: 0 PID: 0 Comm: swapper/0 Tainted: G OE ------- --- 6.7.0-68.fc38.x86_64 #1 Hardware name: Intel Corporation S2600WFT/S2600WFT, BIOS SE5C620.86B.02.01.0013.121520200651 12/15/2020 RIP: 0010:ioread32be+0x34/0x60 RSP: 0018:ffffa26480003e58 EFLAGS: 00010292 RAX: ffffa26487064200 RBX: ffff9042d08161a0 RCX: ffff904c108222c0 RDX: 000000010bbf1b80 RSI: ffffffffc055ddb0 RDI: ffffa26487064230 RBP: ffff9042d08161a0 R08: 0000000000000022 R09: ffff904c108222e8 R10: 0000000000000004 R11: 0000000000000441 R12: ffffffffc055ddb0 R13: ffffa26487064200 R14: ffffa26480003f00 R15: ffff904c108222c0 FS: 0000000000000000(0000) GS:ffff904c10800000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: ffffa26487064230 CR3: 00000002c4420006 CR4: 00000000007706f0 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 PKRU: 55555554 Call Trace: <IRQ> ? __die+0x23/0x70 ? page_fault_oops+0x171/0x4e0 ? exc_page_fault+0x175/0x180 ? asm_exc_page_fault+0x26/0x30 ? __pfx_poll_health+0x10/0x10 [mlx5_core] ? __pfx_poll_health+0x10/0x10 [mlx5_core] ? ioread32be+0x34/0x60 mlx5_health_check_fatal_sensors+0x20/0x100 [mlx5_core] ? __pfx_poll_health+0x10/0x10 [mlx5_core] poll_health+0x42/0x230 [mlx5_core] ? __next_timer_interrupt+0xbc/0x110 ? __pfx_poll_health+0x10/0x10 [mlx5_core] call_timer_fn+0x21/0x130 ? __pfx_poll_health+0x10/0x10 [mlx5_core] __run_timers+0x222/0x2c0 run_timer_softirq+0x1d/0x40 __do_softirq+0xc9/0x2c8 __irq_exit_rcu+0xa6/0xc0 sysvec_apic_timer_interrupt+0x72/0x90 </IRQ> <TASK> asm_sysvec_apic_timer_interrupt+0x1a/0x20 RIP: 0010:cpuidle_enter_state+0xcc/0x440 ? cpuidle_enter_state+0xbd/0x440 cpuidle_enter+0x2d/0x40 do_idle+0x20d/0x270 cpu_startup_entry+0x2a/0x30 rest_init+0xd0/0xd0 arch_call_rest_init+0xe/0x30 start_kernel+0x709/0xa90 x86_64_start_reservations+0x18/0x30 x86_64_start_kernel+0x96/0xa0 secondary_startup_64_no_verify+0x18f/0x19b ---[ end trace 0000000000000000 ]--- Fixes: 9b98d395b85d ("net/mlx5: Start health poll at earlier stage of driver load") Signed-off-by: Shay Drory <shayd@nvidia.com> Reviewed-by: Moshe Shemesh <moshe@nvidia.com> Signed-off-by: Tariq Toukan <tariqt@nvidia.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2024-06-05net/mlx5: Stop waiting for PCI if pci channel is offlineMoshe Shemesh
In case pci channel becomes offline the driver should not wait for PCI reads during health dump and recovery flow. The driver has timeout for each of these loops trying to read PCI, so it would fail anyway. However, in case of recovery waiting till timeout may cause the pci error_detected() callback fail to meet pci_dpc_recovered() wait timeout. Fixes: b3bd076f7501 ("net/mlx5: Report devlink health on FW fatal issues") Signed-off-by: Moshe Shemesh <moshe@nvidia.com> Reviewed-by: Shay Drori <shayd@nvidia.com> Signed-off-by: Tariq Toukan <tariqt@nvidia.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2024-05-24net/mlx5e: Fix UDP GSO for encapsulated packetsGal Pressman
When the skb is encapsulated, adjust the inner UDP header instead of the outer one, and account for UDP header (instead of TCP) in the inline header size calculation. Fixes: 689adf0d4892 ("net/mlx5e: Add UDP GSO support") Reported-by: Jason Baron <jbaron@akamai.com> Closes: https://lore.kernel.org/netdev/c42961cb-50b9-4a9a-bd43-87fe48d88d29@akamai.com/ Signed-off-by: Gal Pressman <gal@nvidia.com> Reviewed-by: Dragos Tatulea <dtatulea@nvidia.com> Reviewed-by: Boris Pismenny <borisp@nvidia.com> Signed-off-by: Tariq Toukan <tariqt@nvidia.com> Reviewed-by: Simon Horman <horms@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2024-05-24net/mlx5e: Use rx_missed_errors instead of rx_dropped for reporting buffer ↵Carolina Jubran
exhaustion Previously, the driver incorrectly used rx_dropped to report device buffer exhaustion. According to the documentation, rx_dropped should not be used to count packets dropped due to buffer exhaustion, which is the purpose of rx_missed_errors. Use rx_missed_errors as intended for counting packets dropped due to buffer exhaustion. Fixes: 269e6b3af3bf ("net/mlx5e: Report additional error statistics in get stats ndo") Signed-off-by: Carolina Jubran <cjubran@nvidia.com> Signed-off-by: Tariq Toukan <tariqt@nvidia.com> Reviewed-by: Simon Horman <horms@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>