Age | Commit message (Collapse) | Author |
|
git://git.kernel.org/pub/scm/linux/kernel/git/mellanox/linux
Saeed Mahameed says:
====================
aux-sysfs-irqs
Shay Says:
==========
Introduce auxiliary bus IRQs sysfs
Today, PCI PFs and VFs, which are anchored on the PCI bus, display their
IRQ information in the <pci_device>/msi_irqs/<irq_num> sysfs files. PCI
subfunctions (SFs) are similar to PFs and VFs and these SFs are anchored
on the auxiliary bus. However, these PCI SFs lack such IRQ information
on the auxiliary bus, leaving users without visibility into which IRQs
are used by the SFs. This absence makes it impossible to debug
situations and to understand the source of interrupts/SFs for
performance tuning and debug.
Additionally, the SFs are multifunctional devices supporting RDMA,
network devices, clocks, and more, similar to their peer PCI PFs and
VFs. Therefore, it is desirable to have SFs' IRQ information available
at the bus/device level.
To overcome the above limitations, this short series extends the
auxiliary bus to display IRQ information in sysfs, similar to that of
PFs and VFs.
It adds an 'irqs' directory under the auxiliary device and includes an
<irq_num> sysfs file within it.
For example:
$ ls /sys/bus/auxiliary/devices/mlx5_core.sf.1/irqs/
50 51 52 53 54 55 56 57 58
Patch summary:
patch-1 adds auxiliary bus to support irqs used by auxiliary device
patch-2 mlx5 driver using exposing irqs for PCI SF devices via auxiliary
bus
==========
* tag 'aux-sysfs-irqs' of git://git.kernel.org/pub/scm/linux/kernel/git/mellanox/linux:
net/mlx5: Expose SFs IRQs
driver core: auxiliary bus: show auxiliary device IRQs
RDMA/mlx5: Add Qcounters req_transport_retries_exceeded/req_rnr_retries_exceeded
net/mlx5: Reimplement write combining test
====================
Link: https://patch.msgid.link/20240711213140.256997-1-saeed@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
First part of "net: Make timestamping selectable" from Kory Maincent.
Change the driver-facing type already to lower rebasing pain.
Link: https://lore.kernel.org/20240709-feature_ptp_netnext-v17-0-b5317f50df2a@bootlin.com/
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
In prevision to add new UAPI for hwtstamp we will be limited to the struct
ethtool_ts_info that is currently passed in fixed binary format through the
ETHTOOL_GET_TS_INFO ethtool ioctl. It would be good if new kernel code
already started operating on an extensible kernel variant of that
structure, similar in concept to struct kernel_hwtstamp_config vs struct
hwtstamp_config.
Since struct ethtool_ts_info is in include/uapi/linux/ethtool.h, here
we introduce the kernel-only structure in include/linux/ethtool.h.
The manual copy is then made in the function called by ETHTOOL_GET_TS_INFO.
Acked-by: Shannon Nelson <shannon.nelson@amd.com>
Acked-by: Alexandra Winter <wintera@linux.ibm.com>
Signed-off-by: Kory Maincent <kory.maincent@bootlin.com>
Link: https://patch.msgid.link/20240709-feature_ptp_netnext-v17-6-b5317f50df2a@bootlin.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
ARFS depends on NTUPLE filters, but the inverse is not true.
Drivers which don't support ARFS commonly still support NTUPLE
filtering. mlx5 has a Kconfig option to disable ARFS (MLX5_EN_ARFS)
and does not advertise NTUPLE filters as a feature at all when ARFS
is compiled out. That's not correct, ntuple filters indeed still work
just fine (as long as MLX5_EN_RXNFC is enabled).
This is needed to make the RSS test not skip all RSS context
related testing.
Acked-by: Saeed Mahameed <saeedm@nvidia.com>
Link: https://patch.msgid.link/20240711223722.297676-1-kuba@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
If a maximum number of EQs has been set for an SF, use that amount.
Signed-off-by: Daniel Jurgens <danielj@nvidia.com>
Reviewed-by: William Tu <witu@nvidia.com>
Signed-off-by: Tariq Toukan <tariqt@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
Link: https://patch.msgid.link/20240712003310.355106-5-saeed@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
If the user hasn't configured max_io_eqs set a low default. The SF
driver shouldn't try to create more than this, but FW will enforce this
limit.
Signed-off-by: Daniel Jurgens <danielj@nvidia.com>
Reviewed-by: William Tu <witu@nvidia.com>
Signed-off-by: Tariq Toukan <tariqt@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
Link: https://patch.msgid.link/20240712003310.355106-4-saeed@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
When setting max_io_eqs for an SF function also set the sf_eq_usage_cap.
This is to indicate to the SF driver from the PF that the user has set
the max io eqs via devlink. So the SF driver can later query the proper
max eq value from the new cap.
Signed-off-by: Daniel Jurgens <danielj@nvidia.com>
Reviewed-by: William Tu <witu@nvidia.com>
Signed-off-by: Tariq Toukan <tariqt@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
Link: https://patch.msgid.link/20240712003310.355106-3-saeed@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
Expose the sysfs files for the IRQs that the mlx5 PCI SFs are using.
These entries are similar to PCI PFs and VFs in 'msi_irqs' directory.
Reviewed-by: Parav Pandit <parav@nvidia.com>
Signed-off-by: Shay Drory <shayd@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
---
v8-v9:
- add Przemek RB
v6->v7:
- remove not needed changes to mlx5 sfnum SF sysfs
v5->v6:
- fail IRQ creation in case auxiliary_device_sysfs_irq_add() failed
(Parav and Przemek)
v2->v3:
- fix mlx5 sfnum SF sysfs
|
|
It is theoretically possible to return bogus uninitialized values from
mlx5_tc_ct_entry_replace_rules, even though in practice this will never
be the case as the flow rule will be part of at least the regular ct
table or the ct nat table, if not both.
But to reduce noise, initialize err to 0.
Fixes: 49d37d05f216 ("net/mlx5: CT: Separate CT and CT-NAT tuple entries")
Signed-off-by: Cosmin Ratiu <cratiu@nvidia.com>
Signed-off-by: Tariq Toukan <tariqt@nvidia.com>
Link: https://patch.msgid.link/20240708080025.1593555-11-tariqt@nvidia.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
When the rx_hds_nodata_packets/bytes counters were added, the aggregate
counters were omitted. This patch adds them.
Fixes: e95c5b9e8912 ("net/mlx5e: SHAMPO, Add header-only ethtool counters for header data split")
Signed-off-by: Dragos Tatulea <dtatulea@nvidia.com>
Signed-off-by: Tariq Toukan <tariqt@nvidia.com>
Link: https://patch.msgid.link/20240708080025.1593555-10-tariqt@nvidia.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
No need to expose definer get/put functions as part of
SW Steering API - they are internal functions.
Signed-off-by: Yevgeny Kliteynik <kliteyn@nvidia.com>
Reviewed-by: Alex Vesker <valex@nvidia.com>
Signed-off-by: Tariq Toukan <tariqt@nvidia.com>
Link: https://patch.msgid.link/20240708080025.1593555-9-tariqt@nvidia.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
Cross-merge networking fixes after downstream PR.
Conflicts:
drivers/net/phy/aquantia/aquantia.h
219343755eae ("net: phy: aquantia: add missing include guards")
61578f679378 ("net: phy: aquantia: add support for PHY LEDs")
drivers/net/ethernet/wangxun/libwx/wx_hw.c
bd07a9817846 ("net: txgbe: remove separate irq request for MSI and INTx")
b501d261a5b3 ("net: txgbe: add FDIR ATR support")
https://lore.kernel.org/all/20240703112936.483c1975@canb.auug.org.au/
include/linux/mlx5/mlx5_ifc.h
048a403648fc ("net/mlx5: IFC updates for changing max EQs")
99be56171fa9 ("net/mlx5e: SHAMPO, Re-enable HW-GRO")
https://lore.kernel.org/all/20240701133951.6926b2e3@canb.auug.org.au/
Adjacent changes:
drivers/net/wireless/intel/iwlwifi/mvm/mac80211.c
4130c67cd123 ("wifi: iwlwifi: mvm: check vif for NULL/ERR_PTR before dereference")
3f3126515fbe ("wifi: iwlwifi: mvm: add mvm-specific guard")
include/net/mac80211.h
816c6bec09ed ("wifi: mac80211: fix BSS_CHANGED_UNSOL_BCAST_PROBE_RESP")
5a009b42e041 ("wifi: mac80211: track changes in AP's TPE")
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
ConnectX devices lack ability to count payload data byte size which is
needed for SA to return to libreswan for rekeying.
As a solution let's approximate that by decreasing headers size from
total size counted by flow steering. The calculation doesn't take into
account any other headers which can be in the packet (e.g. IP extensions).
Fixes: 5a6cddb89b51 ("net/mlx5e: Update IPsec per SA packets/bytes count")
Signed-off-by: Leon Romanovsky <leonro@nvidia.com>
Signed-off-by: Tariq Toukan <tariqt@nvidia.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
IPsec SA statistics presents successfully decrypted and encrypted
packet and bytes, and not total handled by this SA. So update the
calculation logic to take into account failures.
Fixes: 6fb7f9408779 ("net/mlx5e: Connect mlx5 IPsec statistics with XFRM core")
Signed-off-by: Leon Romanovsky <leonro@nvidia.com>
Signed-off-by: Tariq Toukan <tariqt@nvidia.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
In the cited commit, mqprio_rl cleanup and free are mistakenly removed
in mlx5e_priv_cleanup(), and it causes the leakage of host memory and
firmware SCHEDULING_ELEMENT objects while changing eswitch mode. So,
add them back.
Fixes: 0bb7228f7096 ("net/mlx5e: Fix mqprio_rl handling on devlink reload")
Signed-off-by: Jianbo Liu <jianbol@nvidia.com>
Reviewed-by: Dragos Tatulea <dtatulea@nvidia.com>
Signed-off-by: Tariq Toukan <tariqt@nvidia.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
Currently, ingress acl is used for three features. It is created only
when vport metadata match and prio tag are enabled. But active-backup
lag mode also uses it. It is independent of vport metadata match and
prio tag. And vport metadata match can be disabled using the
following devlink command:
# devlink dev param set pci/0000:08:00.0 name esw_port_metadata \
value false cmode runtime
If ingress acl is not created, will hit panic when creating drop rule
for active-backup lag mode. If always create it, there will be about
5% performance degradation.
Fix it by creating ingress acl when needed. If esw_port_metadata is
true, ingress acl exists, then create drop rule using existing
ingress acl. If esw_port_metadata is false, create ingress acl and
then create drop rule.
Fixes: 1749c4c51c16 ("net/mlx5: E-switch, add drop rule support to ingress ACL")
Signed-off-by: Chris Mi <cmi@nvidia.com>
Signed-off-by: Tariq Toukan <tariqt@nvidia.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
Due a bug in the device max_num_eqs doesn't always reflect a written
value. As a result, setting max_io_eqs may not work but appear
successful. Instead write max_num_eqs_24b, which reflects correct
value.
Fixes: 93197c7c509d ("mlx5/core: Support max_io_eqs for a function")
Signed-off-by: Daniel Jurgens <danielj@nvidia.com>
Reviewed-by: Parav Pandit <parav@nvidia.com>
Signed-off-by: Tariq Toukan <tariqt@nvidia.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
A new capability with more bits is added. If it's set use that value as
the maximum number of EQs available.
This cap is also writable by the vhca_resource_manager to allow limiting
the number of EQs available to SFs and VFs.
Fixes: 93197c7c509d ("mlx5/core: Support max_io_eqs for a function")
Signed-off-by: Daniel Jurgens <danielj@nvidia.com>
Reviewed-by: Parav Pandit <parav@nvidia.com>
Reviewed-by: William Tu <witu@nvidia.com>
Signed-off-by: Tariq Toukan <tariqt@nvidia.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
./cli.py --spec netlink/specs/netdev.yaml \
--dump qstats-get --json '{"scope": "queue"}'
...snip
{'ifindex': 7,
'queue-id': 62,
'queue-type': 'rx',
'rx-alloc-fail': 0,
'rx-bytes': 105965251,
'rx-packets': 179790},
{'ifindex': 7,
'queue-id': 0,
'queue-type': 'tx',
'tx-bytes': 9402665,
'tx-packets': 17551},
...snip
Also tested with the script tools/testing/selftests/drivers/net/stats.py
in several scenarios to ensure stats tallying was correct:
- on boot (default queue counts)
- adjusting queue count up or down (ethtool -L eth0 combined ...)
The tools/testing/selftests/drivers/net/stats.py brings the device up,
so to test with the device down, I did the following:
$ ip link show eth4
7: eth4: <BROADCAST,MULTICAST> mtu 9000 qdisc mq state DOWN [..snip..]
[..snip..]
$ cat /proc/net/dev | grep eth4
eth4: 235710489 434811 [..snip rx..] 2878744 21227 [..snip tx..]
$ ./cli.py --spec ../../../Documentation/netlink/specs/netdev.yaml \
--dump qstats-get --json '{"ifindex": 7}'
[{'ifindex': 7,
'rx-alloc-fail': 0,
'rx-bytes': 235710489,
'rx-packets': 434811,
'tx-bytes': 2878744,
'tx-packets': 21227}]
Compare the values in /proc/net/dev match the output of cli for the same
device, even while the device is down.
Note that while the device is down, per queue stats output nothing
(because the device is down there are no queues):
$ ./cli.py --spec ../../../Documentation/netlink/specs/netdev.yaml \
--dump qstats-get --json '{"scope": "queue", "ifindex": 7}'
[]
Signed-off-by: Joe Damato <jdamato@fastly.com>
Reviewed-by: Tariq Toukan <tariqt@nvidia.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
mlx5 currently maps txqs to an sq via priv->txq2sq. It is useful to map
txqs to sq_stats, as well, for direct access to stats.
Add priv->txq2sq_stats and insert mappings. The mappings will be used
next to tabulate stats information.
Signed-off-by: Joe Damato <jdamato@fastly.com>
Reviewed-by: Tariq Toukan <tariqt@nvidia.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
The test of write combining was added before in mlx5_ib driver. It
opens UD QP and posts NOP WQEs, and uses BlueFlame doorbell. When
BlueFlame is used, WQEs get written directly to a PCI BAR of the
device (in addition to memory) so that the device handles them without
having to access memory.
In this test, the WQEs written in memory are different from the ones
written to the BlueFlame which request CQE update. By checking the
completion reports posted on CQ, we can know if BlueFlame succeeds or
not. The write combining must be supported if BlueFlame succeeds as
its register is written using write combining.
This patch reimplements the test in the same way, but using a pair of
SQ and CQ only. It is moved to mlx5_core as a general feature used by
both mlx5_core and mlx5_ib.
Besides, save write combine test result of the PCI function, so that
its thousands of child functions such as SF can query without paying
the time and resource penalty by itself. The test function is called
only after failing to get the cached result. With this enhancement,
all thousands of SFs of the PF attached to same driver no longer need
to perform WC check explicitly, which is already done in the system.
This saves several commands per SF, thereby speeds up SF creation and
also saves completion EQ creation.
Signed-off-by: Jianbo Liu <jianbol@nvidia.com>
Reviewed-by: Tariq Toukan <tariqt@nvidia.com>
Link: https://lore.kernel.org/r/4ff5a8cc4c5b5b0d98397baa45a5019bcdbf096e.1717409369.git.leon@kernel.org
Signed-off-by: Leon Romanovsky <leon@kernel.org>
|
|
Calculate the pseudo-header checksum for both IPSec transport mode and
IPSec tunnel mode for mlx5 devices that do not implement a pure hardware
checksum offload for L4 checksum calculation. Introduce a capability bit
that identifies such mlx5 devices.
Signed-off-by: Rahul Rameshbabu <rrameshbabu@nvidia.com>
Reviewed-by: Gal Pressman <gal@nvidia.com>
Reviewed-by: Cosmin Ratiu <cratiu@nvidia.com>
Signed-off-by: Tariq Toukan <tariqt@nvidia.com>
Link: https://lore.kernel.org/r/20240613210036.1125203-7-tariqt@nvidia.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
Use the tcp specific helpers to calculate the tcp pseudo header checksum
instead of the csum_*_magic ones.
Signed-off-by: Gal Pressman <gal@nvidia.com>
Reviewed-by: Rahul Rameshbabu <rrameshbabu@nvidia.com>
Signed-off-by: Tariq Toukan <tariqt@nvidia.com>
Link: https://lore.kernel.org/r/20240613210036.1125203-6-tariqt@nvidia.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
The code no longer treats only UDP tunnels, adjust the outdated comment.
Signed-off-by: Gal Pressman <gal@nvidia.com>
Reviewed-by: Cosmin Ratiu <cratiu@nvidia.com>
Signed-off-by: Tariq Toukan <tariqt@nvidia.com>
Link: https://lore.kernel.org/r/20240613210036.1125203-5-tariqt@nvidia.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
Replace deprecated strcpy with strscpy.
Signed-off-by: Moshe Shemesh <moshe@nvidia.com>
Signed-off-by: Tariq Toukan <tariqt@nvidia.com>
Link: https://lore.kernel.org/r/20240613210036.1125203-4-tariqt@nvidia.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
Currently a ct entry is stored in both ct and ct-nat tables. ct
action is directed to the ct table, while ct nat action is directed
to the nat table. ct-nat entries perform the nat header rewrites,
if required. The current design assures that a ct action will match
in hardware even if the tuple has nat configured, it will just not
execute it. However, storing each connection in two tables increases
the system's memory consumption while reducing its insertion rate.
Offload a connection to either ct or the ct-nat table. Add a miss
fall-through rule from ct-nat table to the ct table allowing ct(nat)
action on non-natted connections.
ct action on natted connections, by default, will be handled by the
software miss path.
Signed-off-by: Paul Blakey <paulb@nvidia.com>
Signed-off-by: Chris Mi <cmi@nvidia.com>
Signed-off-by: Tariq Toukan <tariqt@nvidia.com>
Link: https://lore.kernel.org/r/20240613210036.1125203-3-tariqt@nvidia.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
TSAR is the correct spelling (Transmit Scheduling ARbiter).
Signed-off-by: Cosmin Ratiu <cratiu@nvidia.com>
Reviewed-by: Gal Pressman <gal@nvidia.com>
Signed-off-by: Tariq Toukan <tariqt@nvidia.com>
Link: https://lore.kernel.org/r/20240613210036.1125203-2-tariqt@nvidia.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
Cross-merge networking fixes after downstream PR.
No conflicts, no adjacent changes.
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
Encapsulation control flags are currently not used anywhere,
so all flags are currently unsupported by all drivers.
This patch adds validation of this assumption, so that
encapsulation flags may be used in the future.
In case any encapsulation control flags are masked,
flow_rule_match_has_enc_control_flags() sets a NL extended
error message, and we return -EOPNOTSUPP.
Only compile tested.
Signed-off-by: Asbjørn Sloth Tønnesen <ast@fiberby.net>
Reviewed-by: Davide Caratti <dcaratti@redhat.com>
Link: https://lore.kernel.org/r/20240609173358.193178-4-ast@fiberby.net
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
Move the vxlan_features_check() call to after we verified the packet is
a tunneled VXLAN packet.
Without this, tunneled UDP non-VXLAN packets (for ex. GENENVE) might
wrongly not get offloaded.
In some cases, it worked by chance as GENEVE header is the same size as
VXLAN, but it is obviously incorrect.
Fixes: e3cfc7e6b7bd ("net/mlx5e: TX, Add geneve tunnel stateless offload support")
Signed-off-by: Gal Pressman <gal@nvidia.com>
Reviewed-by: Dragos Tatulea <dtatulea@nvidia.com>
Signed-off-by: Tariq Toukan <tariqt@nvidia.com>
Reviewed-by: Wojciech Drewek <wojciech.drewek@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
Cross-merge networking fixes after downstream PR.
No conflicts.
Adjacent changes:
drivers/net/ethernet/pensando/ionic/ionic_txrx.c
d9c04209990b ("ionic: Mark error paths in the data path as unlikely")
491aee894a08 ("ionic: fix kernel panic in XDP_TX action")
net/ipv6/ip6_fib.c
b4cb4a1391dc ("net: use unrcu_pointer() helper")
b01e1c030770 ("ipv6: fix possible race in __fib6_drop_pcpu_from()")
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
When doing hardware GRO (SHAMPO), the driver puts each data payload of a
packet from the wire into one skb fragment. TCP Zero-Copy expects page
sized skb fragments to be able to do it's page-flipping magic. With the
current way of arranging fragments by the driver, only specific MTUs
(page sized multiple + header size) will yield such page sized fragments
in a high percentage.
This change improves payload arrangement in the skb for hardware GRO by
coalescing payloads into a single skb fragment when possible.
To demonstrate the fix, running tcp_mmap with a MTU of 1500 yields:
- Before: 0 % bytes mmap'ed
- After : 81 % bytes mmap'ed
More importantly, coalescing considerably improves the HW GRO performance.
Here are the results for a iperf3 bandwidth benchmark:
+---------+--------+--------+------------------------+-----------+
| streams | SW GRO | HW GRO | HW GRO with coalescing | Unit |
|---------+--------+--------+------------------------+-----------|
| 1 | 36 | 42 | 57 | Gbits/sec |
| 4 | 34 | 39 | 50 | Gbits/sec |
| 8 | 31 | 35 | 43 | Gbits/sec |
+---------+--------+--------+------------------------+-----------+
Benchmark details:
VM based setup
CPU: Intel(R) Xeon(R) Platinum 8380 CPU, 24 cores
NIC: ConnectX-7 100GbE
iperf3 and irq running on same CPU over a single receive queue
Signed-off-by: Dragos Tatulea <dtatulea@nvidia.com>
Signed-off-by: Tariq Toukan <tariqt@nvidia.com>
Link: https://lore.kernel.org/r/20240603212219.1037656-15-tariqt@nvidia.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
Add back HW-GRO to the reported features.
As the current implementation of HW-GRO uses KSMs with a
specific fixed buffer size (256B) to map its headers buffer,
we reported the feature only if the NIC is supporting KSM and
the minimum value for buffer size is below the requested one.
iperf3 bandwidth comparison:
+---------+--------+--------+-----------+
| streams | SW GRO | HW GRO | Unit |
|---------+--------+--------+-----------|
| 1 | 36 | 42 | Gbits/sec |
| 4 | 34 | 39 | Gbits/sec |
| 8 | 31 | 35 | Gbits/sec |
+---------+--------+--------+-----------+
A downstream patch will add skb fragment coalescing which will improve
performance considerably.
Benchmark details:
VM based setup
CPU: Intel(R) Xeon(R) Platinum 8380 CPU, 24 cores
NIC: ConnectX-7 100GbE
iperf3 and irq running on same CPU over a single receive queue
Signed-off-by: Yoray Zack <yorayz@nvidia.com>
Signed-off-by: Tariq Toukan <tariqt@nvidia.com>
Link: https://lore.kernel.org/r/20240603212219.1037656-14-tariqt@nvidia.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
KSM Mkey is KLM Mkey with a fixed buffer size. Due to this fact,
it is a faster mechanism than KLM.
SHAMPO feature used KLMs Mkeys for memory mappings of its headers buffer.
As it used KLMs with the same buffer size for each entry,
we can use KSMs instead.
This commit changes the Mkeys that map the SHAMPO headers buffer
from KLMs to KSMs.
Signed-off-by: Yoray Zack <yorayz@nvidia.com>
Signed-off-by: Tariq Toukan <tariqt@nvidia.com>
Link: https://lore.kernel.org/r/20240603212219.1037656-13-tariqt@nvidia.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
Count the number of header-only packets and bytes from SHAMPO.
Signed-off-by: Dragos Tatulea <dtatulea@nvidia.com>
Signed-off-by: Tariq Toukan <tariqt@nvidia.com>
Link: https://lore.kernel.org/r/20240603212219.1037656-12-tariqt@nvidia.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
After modifying rx_gro_packets to be more accurate, the
rx_gro_match_packets counter is redundant.
Signed-off-by: Dragos Tatulea <dtatulea@nvidia.com>
Signed-off-by: Tariq Toukan <tariqt@nvidia.com>
Link: https://lore.kernel.org/r/20240603212219.1037656-11-tariqt@nvidia.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
Don't count non GRO packets. A non GRO packet is a packet with
a GRO cb count of 1.
Signed-off-by: Dragos Tatulea <dtatulea@nvidia.com>
Signed-off-by: Tariq Toukan <tariqt@nvidia.com>
Link: https://lore.kernel.org/r/20240603212219.1037656-10-tariqt@nvidia.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
SHAMPO SKB can be flushed in mlx5e_shampo_complete_rx_cqe().
If the SKB was flushed, rq->hw_gro_data->skb was also set to NULL.
We can skip on flushing the SKB in mlx5e_shampo_flush_skb
if rq->hw_gro_data->skb == NULL.
Signed-off-by: Yoray Zack <yorayz@nvidia.com>
Signed-off-by: Tariq Toukan <tariqt@nvidia.com>
Link: https://lore.kernel.org/r/20240603212219.1037656-9-tariqt@nvidia.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
mlx5e_fill_skb_data() used to have multiple callers. But after the XDP
multibuf refactoring from commit 2cb0e27d43b4 ("net/mlx5e: RX, Prepare
non-linear striding RQ for XDP multi-buffer support") the SHAMPO code
path is the only caller.
Take advantage of this and specialize the function:
- Drop the redundant check.
- Assume that data_bcnt is > 0. This is needed in a downstream patch.
Rename the function as well to make things clear.
Signed-off-by: Dragos Tatulea <dtatulea@nvidia.com>
Suggested-by: Tariq Toukan <tariqt@nvidia.com>
Signed-off-by: Tariq Toukan <tariqt@nvidia.com>
Link: https://lore.kernel.org/r/20240603212219.1037656-8-tariqt@nvidia.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
The function that releases SHAMPO header pages (mlx5e_shampo_dealloc_hd)
has some complicated logic that comes from the fact that it is called
twice during teardown:
1) To release the posted header pages that didn't get any completions.
2) To release all remaining header pages.
This flow is not necessary: all header pages can be released from the
driver side in one go. Furthermore, the above flow is buggy. Taking the
8 headers per page example:
1) Release fragments 5-7. Page will be released.
2) Release remaining fragments 0-4. The bits in the header will indicate
that the page needs releasing. But this is incorrect: page was
released in step 1.
This patch releases all header pages in one go. This simplifies the
header page cleanup function. For consistency, the datapath header
page release API (mlx5e_free_rx_shampo_hd_entry()) is used.
Signed-off-by: Dragos Tatulea <dtatulea@nvidia.com>
Signed-off-by: Tariq Toukan <tariqt@nvidia.com>
Link: https://lore.kernel.org/r/20240603212219.1037656-7-tariqt@nvidia.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
When HW GRO is enabled, forwarding of packets is broken due to gso_size
being set incorrectly on non GRO packets.
Non GRO packets have a skb GRO count of 1. mlx5 always sets gso_size on
the skb, even for non GRO packets. It leans on the fact that gso_size is
normally reset in napi_gro_complete(). But this happens only for packets
from GRO'able protocols (TCP/UDP) that have a gro_receive() handler.
The problematic scenarios are:
1) Non GRO protocol packets are received, validate_xmit_skb() will drop
them (see EPROTONOSUPPORT in skb_mac_gso_segment()). The fix for
this case would be to not set gso_size at all for SHAMPO packets with
header size 0.
2) Packets from a GRO'ed protocol (TCP) are received but immediately
flushed because they are not GRO'able (TCP SYN for example).
mlx5e_shampo_update_hdr(), which updates the remaining GRO state on
the skb, is not called because skb GRO count is 1. The fix here would
be to always call mlx5e_shampo_update_hdr(), regardless of skb GRO
count. But this call is expensive
The unified fix for both cases is to reset gso_size before calling
napi_gro_receive(). It is a change that is more effective (no call to
mlx5e_shampo_update_hdr() necessary) and simple (smallest code
footprint).
Signed-off-by: Dragos Tatulea <dtatulea@nvidia.com>
Signed-off-by: Tariq Toukan <tariqt@nvidia.com>
Link: https://lore.kernel.org/r/20240603212219.1037656-6-tariqt@nvidia.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
For the following scenario:
ethtool --features eth3 rx-gro-hw on
ethtool --features eth3 rx-fcs on
ethtool --features eth3 rx-fcs off
... there is a firmware error because the driver enables HW GRO first
while FCS is still enabled.
This patch fixes this by swapping the order of HW GRO and FCS for this
specific case. Take LRO into consideration as well for consistency.
Signed-off-by: Dragos Tatulea <dtatulea@nvidia.com>
Signed-off-by: Tariq Toukan <tariqt@nvidia.com>
Link: https://lore.kernel.org/r/20240603212219.1037656-5-tariqt@nvidia.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
When all the strides in a WQE have been consumed, the WQE is unlinked
from the WQ linked list (mlx5_wq_ll_pop()). For SHAMPO, it is possible
to receive CQEs with 0 consumed strides for the same WQE even after the
WQE is fully consumed and unlinked. This triggers an additional unlink
for the same wqe which corrupts the linked list.
Fix this scenario by accepting 0 sized consumed strides without
unlinking the WQE again.
Signed-off-by: Dragos Tatulea <dtatulea@nvidia.com>
Signed-off-by: Tariq Toukan <tariqt@nvidia.com>
Link: https://lore.kernel.org/r/20240603212219.1037656-4-tariqt@nvidia.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
Under the following conditions:
1) No skb created yet
2) header_size == 0 (no SHAMPO header)
3) header_index + 1 % MLX5E_SHAMPO_WQ_HEADER_PER_PAGE == 0 (this is the
last page fragment of a SHAMPO header page)
a new skb is formed with a page that is NOT a SHAMPO header page (it
is a regular data page). Further down in the same function
(mlx5e_handle_rx_cqe_mpwrq_shampo()), a SHAMPO header page from
header_index is released. This is wrong and it leads to SHAMPO header
pages being released more than once.
Signed-off-by: Dragos Tatulea <dtatulea@nvidia.com>
Signed-off-by: Tariq Toukan <tariqt@nvidia.com>
Link: https://lore.kernel.org/r/20240603212219.1037656-3-tariqt@nvidia.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
Let the SHAMPO functions use the net-specific prefetch API,
similar to all other usages.
Signed-off-by: Tariq Toukan <tariqt@nvidia.com>
Link: https://lore.kernel.org/r/20240603212219.1037656-2-tariqt@nvidia.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
In case of flow rule creation fail in mlx5_lag_create_port_sel_table(),
instead of previously created rules, the tainted pointer is deleted
deveral times.
Fix this bug by using correct flow rules pointers.
Found by Linux Verification Center (linuxtesting.org) with SVACE.
Fixes: 352899f384d4 ("net/mlx5: Lag, use buckets in hash mode")
Signed-off-by: Aleksandr Mishin <amishin@t-argos.ru>
Reviewed-by: Jacob Keller <jacob.e.keller@intel.com>
Reviewed-by: Tariq Toukan <tariqt@nvidia.com>
Link: https://lore.kernel.org/r/20240604100552.25201-1-amishin@t-argos.ru
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
Currently, if teardown_hca fails to execute during driver removal, mlx5
does not stop the health timer. Afterwards, mlx5 continue with driver
teardown. This may lead to a UAF bug, which results in page fault
Oops[1], since the health timer invokes after resources were freed.
Hence, stop the health monitor even if teardown_hca fails.
[1]
mlx5_core 0000:18:00.0: E-Switch: Unload vfs: mode(LEGACY), nvfs(0), necvfs(0), active vports(0)
mlx5_core 0000:18:00.0: E-Switch: Disable: mode(LEGACY), nvfs(0), necvfs(0), active vports(0)
mlx5_core 0000:18:00.0: E-Switch: Disable: mode(LEGACY), nvfs(0), necvfs(0), active vports(0)
mlx5_core 0000:18:00.0: E-Switch: cleanup
mlx5_core 0000:18:00.0: wait_func:1155:(pid 1967079): TEARDOWN_HCA(0x103) timeout. Will cause a leak of a command resource
mlx5_core 0000:18:00.0: mlx5_function_close:1288:(pid 1967079): tear_down_hca failed, skip cleanup
BUG: unable to handle page fault for address: ffffa26487064230
PGD 100c00067 P4D 100c00067 PUD 100e5a067 PMD 105ed7067 PTE 0
Oops: 0000 [#1] PREEMPT SMP PTI
CPU: 0 PID: 0 Comm: swapper/0 Tainted: G OE ------- --- 6.7.0-68.fc38.x86_64 #1
Hardware name: Intel Corporation S2600WFT/S2600WFT, BIOS SE5C620.86B.02.01.0013.121520200651 12/15/2020
RIP: 0010:ioread32be+0x34/0x60
RSP: 0018:ffffa26480003e58 EFLAGS: 00010292
RAX: ffffa26487064200 RBX: ffff9042d08161a0 RCX: ffff904c108222c0
RDX: 000000010bbf1b80 RSI: ffffffffc055ddb0 RDI: ffffa26487064230
RBP: ffff9042d08161a0 R08: 0000000000000022 R09: ffff904c108222e8
R10: 0000000000000004 R11: 0000000000000441 R12: ffffffffc055ddb0
R13: ffffa26487064200 R14: ffffa26480003f00 R15: ffff904c108222c0
FS: 0000000000000000(0000) GS:ffff904c10800000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: ffffa26487064230 CR3: 00000002c4420006 CR4: 00000000007706f0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
PKRU: 55555554
Call Trace:
<IRQ>
? __die+0x23/0x70
? page_fault_oops+0x171/0x4e0
? exc_page_fault+0x175/0x180
? asm_exc_page_fault+0x26/0x30
? __pfx_poll_health+0x10/0x10 [mlx5_core]
? __pfx_poll_health+0x10/0x10 [mlx5_core]
? ioread32be+0x34/0x60
mlx5_health_check_fatal_sensors+0x20/0x100 [mlx5_core]
? __pfx_poll_health+0x10/0x10 [mlx5_core]
poll_health+0x42/0x230 [mlx5_core]
? __next_timer_interrupt+0xbc/0x110
? __pfx_poll_health+0x10/0x10 [mlx5_core]
call_timer_fn+0x21/0x130
? __pfx_poll_health+0x10/0x10 [mlx5_core]
__run_timers+0x222/0x2c0
run_timer_softirq+0x1d/0x40
__do_softirq+0xc9/0x2c8
__irq_exit_rcu+0xa6/0xc0
sysvec_apic_timer_interrupt+0x72/0x90
</IRQ>
<TASK>
asm_sysvec_apic_timer_interrupt+0x1a/0x20
RIP: 0010:cpuidle_enter_state+0xcc/0x440
? cpuidle_enter_state+0xbd/0x440
cpuidle_enter+0x2d/0x40
do_idle+0x20d/0x270
cpu_startup_entry+0x2a/0x30
rest_init+0xd0/0xd0
arch_call_rest_init+0xe/0x30
start_kernel+0x709/0xa90
x86_64_start_reservations+0x18/0x30
x86_64_start_kernel+0x96/0xa0
secondary_startup_64_no_verify+0x18f/0x19b
---[ end trace 0000000000000000 ]---
Fixes: 9b98d395b85d ("net/mlx5: Start health poll at earlier stage of driver load")
Signed-off-by: Shay Drory <shayd@nvidia.com>
Reviewed-by: Moshe Shemesh <moshe@nvidia.com>
Signed-off-by: Tariq Toukan <tariqt@nvidia.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
In case pci channel becomes offline the driver should not wait for PCI
reads during health dump and recovery flow. The driver has timeout for
each of these loops trying to read PCI, so it would fail anyway.
However, in case of recovery waiting till timeout may cause the pci
error_detected() callback fail to meet pci_dpc_recovered() wait timeout.
Fixes: b3bd076f7501 ("net/mlx5: Report devlink health on FW fatal issues")
Signed-off-by: Moshe Shemesh <moshe@nvidia.com>
Reviewed-by: Shay Drori <shayd@nvidia.com>
Signed-off-by: Tariq Toukan <tariqt@nvidia.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
When the skb is encapsulated, adjust the inner UDP header instead of the
outer one, and account for UDP header (instead of TCP) in the inline
header size calculation.
Fixes: 689adf0d4892 ("net/mlx5e: Add UDP GSO support")
Reported-by: Jason Baron <jbaron@akamai.com>
Closes: https://lore.kernel.org/netdev/c42961cb-50b9-4a9a-bd43-87fe48d88d29@akamai.com/
Signed-off-by: Gal Pressman <gal@nvidia.com>
Reviewed-by: Dragos Tatulea <dtatulea@nvidia.com>
Reviewed-by: Boris Pismenny <borisp@nvidia.com>
Signed-off-by: Tariq Toukan <tariqt@nvidia.com>
Reviewed-by: Simon Horman <horms@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
exhaustion
Previously, the driver incorrectly used rx_dropped to report device
buffer exhaustion.
According to the documentation, rx_dropped should not be used to count
packets dropped due to buffer exhaustion, which is the purpose of
rx_missed_errors.
Use rx_missed_errors as intended for counting packets dropped due to
buffer exhaustion.
Fixes: 269e6b3af3bf ("net/mlx5e: Report additional error statistics in get stats ndo")
Signed-off-by: Carolina Jubran <cjubran@nvidia.com>
Signed-off-by: Tariq Toukan <tariqt@nvidia.com>
Reviewed-by: Simon Horman <horms@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
|