summaryrefslogtreecommitdiff
path: root/Documentation/networking
diff options
context:
space:
mode:
authorLinus Torvalds <torvalds@linux-foundation.org>2024-07-16 19:28:34 -0700
committerLinus Torvalds <torvalds@linux-foundation.org>2024-07-16 19:28:34 -0700
commit51835949dda3783d4639cfa74ce13a3c9829de00 (patch)
tree2b593de5eba6ecc73f7c58fc65fdaffae45c7323 /Documentation/networking
parent0434dbe32053d07d658165be681505120c6b1abc (diff)
parent77ae5e5b00720372af2860efdc4bc652ac682696 (diff)
Merge tag 'net-next-6.11' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-nextHEADmaster
Pull networking updates from Jakub Kicinski: "Not much excitement - a handful of large patchsets (devmem among them) did not make it in time. Core & protocols: - Use local_lock in addition to local_bh_disable() to protect per-CPU resources in networking, a step closer for local_bh_disable() not to act as a big lock on PREEMPT_RT - Use flex array for netdevice priv area, ensure its cache alignment - Add a sysctl knob to allow user to specify a default rto_min at socket init time. Bit of a big hammer but multiple companies were independently carrying such patch downstream so clearly it's useful - Support scheduling transmission of packets based on CLOCK_TAI - Un-pin TCP TIMEWAIT timer to avoid it firing on CPUs later cordoned off using cpusets - Support multiple L2TPv3 UDP tunnels using the same 5-tuple address - Allow configuration of multipath hash seed, to both allow synchronizing hashing of two routers, and preventing partial accidental sync - Improve TCP compliance with RFC 9293 for simultaneous connect() - Support sending NAT keepalives in IPsec ESP in UDP states. Userspace IKE daemon had to do this before, but the kernel can better keep track of it - Support sending supervision HSR frames with MAC addresses stored in ProxyNodeTable when RedBox (i.e. HSR-SAN) is enabled - Introduce IPPROTO_SMC for selecting SMC when socket is created - Allow UDP GSO transmit from devices with no checksum offload - openvswitch: add packet sampling via psample, separating the sampled traffic from "upcall" packets sent to user space for forwarding - nf_tables: shrink memory consumption for transaction objects Things we sprinkled into general kernel code: - Power Sequencing subsystem (used by Qualcomm Bluetooth driver for QCA6390) [ Already merged separately - Linus ] - Add IRQ information in sysfs for auxiliary bus - Introduce guard definition for local_lock - Add aligned flavor of __cacheline_group_{begin, end}() markings for grouping fields in structures BPF: - Notify user space (via epoll) when a struct_ops object is getting detached/unregistered - Add new kfuncs for a generic, open-coded bits iterator - Enable BPF programs to declare arrays of kptr, bpf_rb_root, and bpf_list_head - Support resilient split BTF which cuts down on duplication and makes BTF as compact as possible WRT BTF from modules - Add support for dumping kfunc prototypes from BTF which enables both detecting as well as dumping compilable prototypes for kfuncs - riscv64 BPF JIT improvements in particular to add 12-argument support for BPF trampolines and to utilize bpf_prog_pack for the latter - Add the capability to offload the netfilter flowtable in XDP layer through kfuncs Driver API: - Allow users to configure IRQ tresholds between which automatic IRQ moderation can choose - Expand Power Sourcing (PoE) status with power, class and failure reason. Support setting power limits - Track additional RSS contexts in the core, make sure configuration changes don't break them - Support IPsec crypto offload for IPv6 ESP and IPv4 UDP-encapsulated ESP data paths - Support updating firmware on SFP modules Tests and tooling: - mptcp: use net/lib.sh to manage netns - TCP-AO and TCP-MD5: replace debug prints used by tests with tracepoints - openvswitch: make test self-contained (don't depend on OvS CLI tools) Drivers: - Ethernet high-speed NICs: - Broadcom (bnxt): - increase the max total outstanding PTP TX packets to 4 - add timestamping statistics support - implement netdev_queue_mgmt_ops - support new RSS context API - Intel (100G, ice, idpf): - implement FEC statistics and dumping signal quality indicators - support E825C products (with 56Gbps PHYs) - nVidia/Mellanox: - support HW-GRO - mlx4/mlx5: support per-queue statistics via netlink - obey the max number of EQs setting in sub-functions - AMD/Solarflare: - support new RSS context API - AMD/Pensando: - ionic: rework fix for doorbell miss to lower overhead and skip it on new HW - Wangxun: - txgbe: support Flow Director perfect filters - Ethernet NICs consumer, embedded and virtual: - Add driver for Tehuti Networks TN40xx chips - Add driver for Meta's internal NIC chips - Add driver for Ethernet MAC on Airoha EN7581 SoCs - Add driver for Renesas Ethernet-TSN devices - Google cloud vNIC: - flow steering support - Microsoft vNIC: - support page sizes other than 4KB on ARM64 - vmware vNIC: - support latency measurement (update to version 9) - VirtIO net: - support for Byte Queue Limits - support configuring thresholds for automatic IRQ moderation - support for AF_XDP Rx zero-copy - Synopsys (stmmac): - support for STM32MP13 SoC - let platforms select the right PCS implementation - TI: - icssg-prueth: add multicast filtering support - icssg-prueth: enable PTP timestamping and PPS - Renesas: - ravb: improve Rx performance 30-400% by using page pool, theaded NAPI and timer-based IRQ coalescing - ravb: add MII support for R-Car V4M - Cadence (macb): - macb: add ARP support to Wake-On-LAN - Cortina: - use phylib for RX and TX pause configuration - Ethernet switches: - nVidia/Mellanox: - support configuration of multipath hash seed - report more accurate max MTU - use page_pool to improve Rx performance - MediaTek: - mt7530: add support for bridge port isolation - Qualcomm: - qca8k: add support for bridge port isolation - Microchip: - lan9371/2: add 100BaseTX PHY support - NXP: - vsc73xx: implement VLAN operations - Ethernet PHYs: - aquantia: enable support for aqr115c - aquantia: add support for PHY LEDs - realtek: add support for rtl8224 2.5Gbps PHY - xpcs: add memory-mapped device support - add BroadR-Reach link mode and support in Broadcom's PHY driver - CAN: - add document for ISO 15765-2 protocol support - mcp251xfd: workaround for erratum DS80000789E, use timestamps to catch when device returns incorrect FIFO status - WiFi: - mac80211/cfg80211: - parse Transmit Power Envelope (TPE) data in mac80211 instead of in drivers - improvements for 6 GHz regulatory flexibility - multi-link improvements - support multiple radios per wiphy - remove DEAUTH_NEED_MGD_TX_PREP flag - Intel (iwlwifi): - bump FW API to 91 for BZ/SC devices - report 64-bit radiotap timestamp - enable P2P low latency by default - handle Transmit Power Envelope (TPE) advertised by AP - remove support for older FW for new devices - fast resume (keeping the device configured) - mvm: re-enable Multi-Link Operation (MLO) - aggregation (A-MSDU) optimizations - MediaTek (mt76): - mt7925 Multi-Link Operation (MLO) support - Qualcomm (ath10k): - LED support for various chipsets - Qualcomm (ath12k): - remove unsupported Tx monitor handling - support channel 2 in 6 GHz band - support Spatial Multiplexing Power Save (SMPS) in 6 GHz band - supprt multiple BSSID (MBSSID) and Enhanced Multi-BSSID Advertisements (EMA) - support dynamic VLAN - add panic handler for resetting the firmware state - DebugFS support for datapath statistics - WCN7850: support for Wake on WLAN - Microchip (wilc1000): - read MAC address during probe to make it visible to user space - suspend/resume improvements - TI (wl18xx): - support newer firmware versions - RealTek (rtw89): - preparation for RTL8852BE-VT support - Wake on WLAN support for WiFi 6 chips - 36-bit PCI DMA support - RealTek (rtlwifi): - RTL8192DU support - Broadcom (brcmfmac): - Management Frame Protection support (to enable WPA3) - Bluetooth: - qualcomm: use the power sequencer for QCA6390 - btusb: mediatek: add ISO data transmission functions - hci_bcm4377: add BCM4388 support - btintel: add support for BlazarU core - btintel: add support for Whale Peak2 - btnxpuart: add support for AW693 A1 chipset - btnxpuart: add support for IW615 chipset - btusb: add Realtek RTL8852BE support ID 0x13d3:0x3591" * tag 'net-next-6.11' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-next: (1589 commits) eth: fbnic: Fix spelling mistake "tiggerring" -> "triggering" tcp: Replace strncpy() with strscpy() wifi: ath12k: fix build vs old compiler tcp: Don't access uninit tcp_rsk(req)->ao_keyid in tcp_create_openreq_child(). eth: fbnic: Write the TCAM tables used for RSS control and Rx to host eth: fbnic: Add L2 address programming eth: fbnic: Add basic Rx handling eth: fbnic: Add basic Tx handling eth: fbnic: Add link detection eth: fbnic: Add initial messaging to notify FW of our presence eth: fbnic: Implement Rx queue alloc/start/stop/free eth: fbnic: Implement Tx queue alloc/start/stop/free eth: fbnic: Allocate a netdevice and napi vectors with queues eth: fbnic: Add FW communication mechanism eth: fbnic: Add message parsing for FW messages eth: fbnic: Add register init to set PCIe/Ethernet device config eth: fbnic: Allocate core device specific structures and devlink interface eth: fbnic: Add scaffolding for Meta's NIC driver PCI: Add Meta Platforms vendor ID net/sched: cls_flower: propagate tca[TCA_OPTIONS] to NL_REQ_ATTR_CHECK ...
Diffstat (limited to 'Documentation/networking')
-rw-r--r--Documentation/networking/device_drivers/ethernet/mellanox/mlx5/counters.rst24
-rw-r--r--Documentation/networking/devlink/ice.rst25
-rw-r--r--Documentation/networking/devlink/octeontx2.rst16
-rw-r--r--Documentation/networking/ethtool-netlink.rst165
-rw-r--r--Documentation/networking/index.rst3
-rw-r--r--Documentation/networking/ip-sysctl.rst27
-rw-r--r--Documentation/networking/iso15765-2.rst386
-rw-r--r--Documentation/networking/mptcp-sysctl.rst70
-rw-r--r--Documentation/networking/mptcp.rst156
-rw-r--r--Documentation/networking/net_dim.rst42
-rw-r--r--Documentation/networking/phy.rst6
-rw-r--r--Documentation/networking/sriov.rst25
-rw-r--r--Documentation/networking/tcp_ao.rst9
13 files changed, 901 insertions, 53 deletions
diff --git a/Documentation/networking/device_drivers/ethernet/mellanox/mlx5/counters.rst b/Documentation/networking/device_drivers/ethernet/mellanox/mlx5/counters.rst
index fed821ef9b09..3bd72577af9a 100644
--- a/Documentation/networking/device_drivers/ethernet/mellanox/mlx5/counters.rst
+++ b/Documentation/networking/device_drivers/ethernet/mellanox/mlx5/counters.rst
@@ -189,22 +189,19 @@ the software port.
* - `rx[i]_gro_packets`
- Number of received packets processed using hardware-accelerated GRO. The
- number of hardware GRO offloaded packets received on ring i.
+ number of hardware GRO offloaded packets received on ring i. Only true GRO
+ packets are counted: only packets that are in an SKB with a GRO count > 1.
- Acceleration
* - `rx[i]_gro_bytes`
- Number of received bytes processed using hardware-accelerated GRO. The
- number of hardware GRO offloaded bytes received on ring i.
+ number of hardware GRO offloaded bytes received on ring i. Only true GRO
+ packets are counted: only packets that are in an SKB with a GRO count > 1.
- Acceleration
* - `rx[i]_gro_skbs`
- - The number of receive SKBs constructed while performing
- hardware-accelerated GRO.
- - Informative
-
- * - `rx[i]_gro_match_packets`
- - Number of received packets processed using hardware-accelerated GRO that
- met the flow table match criteria.
+ - The number of GRO SKBs constructed from hardware-accelerated GRO. Only SKBs
+ with a GRO count > 1 are counted.
- Informative
* - `rx[i]_gro_large_hds`
@@ -212,6 +209,15 @@ the software port.
headers that require additional memory to be allocated.
- Informative
+ * - `rx[i]_hds_nodata_packets`
+ - Number of header only packets in header/data split mode [#accel]_.
+ - Informative
+
+ * - `rx[i]_hds_nodata_bytes`
+ - Number of bytes for header only packets in header/data split mode
+ [#accel]_.
+ - Informative
+
* - `rx[i]_lro_packets`
- The number of LRO packets received on ring i [#accel]_.
- Acceleration
diff --git a/Documentation/networking/devlink/ice.rst b/Documentation/networking/devlink/ice.rst
index 830c04354222..e3972d03cea0 100644
--- a/Documentation/networking/devlink/ice.rst
+++ b/Documentation/networking/devlink/ice.rst
@@ -11,6 +11,7 @@ Parameters
==========
.. list-table:: Generic parameters implemented
+ :widths: 5 5 90
* - Name
- Mode
@@ -68,6 +69,30 @@ Parameters
To verify that value has been set:
$ devlink dev param show pci/0000:16:00.0 name tx_scheduling_layers
+.. list-table:: Driver specific parameters implemented
+ :widths: 5 5 90
+
+ * - Name
+ - Mode
+ - Description
+ * - ``local_forwarding``
+ - runtime
+ - Controls loopback behavior by tuning scheduler bandwidth.
+ It impacts all kinds of functions: physical, virtual and
+ subfunctions.
+ Supported values are:
+
+ ``enabled`` - loopback traffic is allowed on port
+
+ ``disabled`` - loopback traffic is not allowed on this port
+
+ ``prioritized`` - loopback traffic is prioritized on this port
+
+ Default value of ``local_forwarding`` parameter is ``enabled``.
+ ``prioritized`` provides ability to adjust loopback traffic rate to increase
+ one port capacity at cost of the another. User needs to disable
+ local forwarding on one of the ports in order have increased capacity
+ on the ``prioritized`` port.
Info versions
=============
diff --git a/Documentation/networking/devlink/octeontx2.rst b/Documentation/networking/devlink/octeontx2.rst
index 610de99b728a..d33a90dd44bf 100644
--- a/Documentation/networking/devlink/octeontx2.rst
+++ b/Documentation/networking/devlink/octeontx2.rst
@@ -40,3 +40,19 @@ The ``octeontx2 AF`` driver implements the following driver-specific parameters.
- runtime
- Use to set the quantum which hardware uses for scheduling among transmit queues.
Hardware uses weighted DWRR algorithm to schedule among all transmit queues.
+
+The ``octeontx2 PF`` driver implements the following driver-specific parameters.
+
+.. list-table:: Driver-specific parameters implemented
+ :widths: 5 5 5 85
+
+ * - Name
+ - Type
+ - Mode
+ - Description
+ * - ``unicast_filter_count``
+ - u8
+ - runtime
+ - Set the maximum number of unicast filters that can be programmed for
+ the device. This can be used to achieve better device resource
+ utilization, avoiding over consumption of unused MCAM table entries.
diff --git a/Documentation/networking/ethtool-netlink.rst b/Documentation/networking/ethtool-netlink.rst
index 160bfb0ae8ba..3ab423b80e91 100644
--- a/Documentation/networking/ethtool-netlink.rst
+++ b/Documentation/networking/ethtool-netlink.rst
@@ -228,6 +228,7 @@ Userspace to kernel:
``ETHTOOL_MSG_PLCA_GET_STATUS`` get PLCA RS status
``ETHTOOL_MSG_MM_GET`` get MAC merge layer state
``ETHTOOL_MSG_MM_SET`` set MAC merge layer parameters
+ ``ETHTOOL_MSG_MODULE_FW_FLASH_ACT`` flash transceiver module firmware
===================================== =================================
Kernel to userspace:
@@ -274,6 +275,7 @@ Kernel to userspace:
``ETHTOOL_MSG_PLCA_GET_STATUS_REPLY`` PLCA RS status
``ETHTOOL_MSG_PLCA_NTF`` PLCA RS parameters
``ETHTOOL_MSG_MM_GET_REPLY`` MAC merge layer status
+ ``ETHTOOL_MSG_MODULE_FW_FLASH_NTF`` transceiver module flash updates
======================================== =================================
``GET`` requests are sent by userspace applications to retrieve device
@@ -1033,6 +1035,8 @@ Kernel response contents:
``ETHTOOL_A_COALESCE_TX_AGGR_MAX_BYTES`` u32 max aggr size, Tx
``ETHTOOL_A_COALESCE_TX_AGGR_MAX_FRAMES`` u32 max aggr packets, Tx
``ETHTOOL_A_COALESCE_TX_AGGR_TIME_USECS`` u32 time (us), aggr, Tx
+ ``ETHTOOL_A_COALESCE_RX_PROFILE`` nested profile of DIM, Rx
+ ``ETHTOOL_A_COALESCE_TX_PROFILE`` nested profile of DIM, Tx
=========================================== ====== =======================
Attributes are only included in reply if their value is not zero or the
@@ -1062,6 +1066,10 @@ block should be sent.
This feature is mainly of interest for specific USB devices which does not cope
well with frequent small-sized URBs transmissions.
+``ETHTOOL_A_COALESCE_RX_PROFILE`` and ``ETHTOOL_A_COALESCE_TX_PROFILE`` refer
+to DIM parameters, see `Generic Network Dynamic Interrupt Moderation (Net DIM)
+<https://www.kernel.org/doc/Documentation/networking/net_dim.rst>`_.
+
COALESCE_SET
============
@@ -1098,6 +1106,8 @@ Request contents:
``ETHTOOL_A_COALESCE_TX_AGGR_MAX_BYTES`` u32 max aggr size, Tx
``ETHTOOL_A_COALESCE_TX_AGGR_MAX_FRAMES`` u32 max aggr packets, Tx
``ETHTOOL_A_COALESCE_TX_AGGR_TIME_USECS`` u32 time (us), aggr, Tx
+ ``ETHTOOL_A_COALESCE_RX_PROFILE`` nested profile of DIM, Rx
+ ``ETHTOOL_A_COALESCE_TX_PROFILE`` nested profile of DIM, Tx
=========================================== ====== =======================
Request is rejected if it attributes declared as unsupported by driver (i.e.
@@ -1720,17 +1730,28 @@ Request contents:
Kernel response contents:
- ====================================== ====== =============================
- ``ETHTOOL_A_PSE_HEADER`` nested reply header
- ``ETHTOOL_A_PODL_PSE_ADMIN_STATE`` u32 Operational state of the PoDL
- PSE functions
- ``ETHTOOL_A_PODL_PSE_PW_D_STATUS`` u32 power detection status of the
- PoDL PSE.
- ``ETHTOOL_A_C33_PSE_ADMIN_STATE`` u32 Operational state of the PoE
- PSE functions.
- ``ETHTOOL_A_C33_PSE_PW_D_STATUS`` u32 power detection status of the
- PoE PSE.
- ====================================== ====== =============================
+ ========================================== ====== =============================
+ ``ETHTOOL_A_PSE_HEADER`` nested reply header
+ ``ETHTOOL_A_PODL_PSE_ADMIN_STATE`` u32 Operational state of the PoDL
+ PSE functions
+ ``ETHTOOL_A_PODL_PSE_PW_D_STATUS`` u32 power detection status of the
+ PoDL PSE.
+ ``ETHTOOL_A_C33_PSE_ADMIN_STATE`` u32 Operational state of the PoE
+ PSE functions.
+ ``ETHTOOL_A_C33_PSE_PW_D_STATUS`` u32 power detection status of the
+ PoE PSE.
+ ``ETHTOOL_A_C33_PSE_PW_CLASS`` u32 power class of the PoE PSE.
+ ``ETHTOOL_A_C33_PSE_ACTUAL_PW`` u32 actual power drawn on the
+ PoE PSE.
+ ``ETHTOOL_A_C33_PSE_EXT_STATE`` u32 power extended state of the
+ PoE PSE.
+ ``ETHTOOL_A_C33_PSE_EXT_SUBSTATE`` u32 power extended substatus of
+ the PoE PSE.
+ ``ETHTOOL_A_C33_PSE_AVAIL_PW_LIMIT`` u32 currently configured power
+ limit of the PoE PSE.
+ ``ETHTOOL_A_C33_PSE_PW_LIMIT_RANGES`` nested Supported power limit
+ configuration ranges.
+ ========================================== ====== =============================
When set, the optional ``ETHTOOL_A_PODL_PSE_ADMIN_STATE`` attribute identifies
the operational state of the PoDL PSE functions. The operational state of the
@@ -1762,6 +1783,46 @@ The same goes for ``ETHTOOL_A_C33_PSE_ADMIN_PW_D_STATUS`` implementing
.. kernel-doc:: include/uapi/linux/ethtool.h
:identifiers: ethtool_c33_pse_pw_d_status
+When set, the optional ``ETHTOOL_A_C33_PSE_PW_CLASS`` attribute identifies
+the power class of the C33 PSE. It depends on the class negotiated between
+the PSE and the PD. This option is corresponding to ``IEEE 802.3-2022``
+30.9.1.1.8 aPSEPowerClassification.
+
+When set, the optional ``ETHTOOL_A_C33_PSE_ACTUAL_PW`` attribute identifies
+This option is corresponding to ``IEEE 802.3-2022`` 30.9.1.1.23 aPSEActualPower.
+Actual power is reported in mW.
+
+When set, the optional ``ETHTOOL_A_C33_PSE_EXT_STATE`` attribute identifies
+the extended error state of the C33 PSE. Possible values are:
+
+.. kernel-doc:: include/uapi/linux/ethtool.h
+ :identifiers: ethtool_c33_pse_ext_state
+
+When set, the optional ``ETHTOOL_A_C33_PSE_EXT_SUBSTATE`` attribute identifies
+the extended error state of the C33 PSE. Possible values are:
+Possible values are:
+
+.. kernel-doc:: include/uapi/linux/ethtool.h
+ :identifiers: ethtool_c33_pse_ext_substate_class_num_events
+ ethtool_c33_pse_ext_substate_error_condition
+ ethtool_c33_pse_ext_substate_mr_pse_enable
+ ethtool_c33_pse_ext_substate_option_detect_ted
+ ethtool_c33_pse_ext_substate_option_vport_lim
+ ethtool_c33_pse_ext_substate_ovld_detected
+ ethtool_c33_pse_ext_substate_pd_dll_power_type
+ ethtool_c33_pse_ext_substate_power_not_available
+ ethtool_c33_pse_ext_substate_short_detected
+
+When set, the optional ``ETHTOOL_A_C33_PSE_AVAIL_PW_LIMIT`` attribute
+identifies the C33 PSE power limit in mW.
+
+When set the optional ``ETHTOOL_A_C33_PSE_PW_LIMIT_RANGES`` nested attribute
+identifies the C33 PSE power limit ranges through
+``ETHTOOL_A_C33_PSE_PWR_VAL_LIMIT_RANGE_MIN`` and
+``ETHTOOL_A_C33_PSE_PWR_VAL_LIMIT_RANGE_MAX``.
+If the controller works with fixed classes, the min and max values will be
+equal.
+
PSE_SET
=======
@@ -1773,6 +1834,8 @@ Request contents:
``ETHTOOL_A_PSE_HEADER`` nested request header
``ETHTOOL_A_PODL_PSE_ADMIN_CONTROL`` u32 Control PoDL PSE Admin state
``ETHTOOL_A_C33_PSE_ADMIN_CONTROL`` u32 Control PSE Admin state
+ ``ETHTOOL_A_C33_PSE_AVAIL_PWR_LIMIT`` u32 Control PoE PSE available
+ power limit
====================================== ====== =============================
When set, the optional ``ETHTOOL_A_PODL_PSE_ADMIN_CONTROL`` attribute is used
@@ -1783,6 +1846,18 @@ to control PoDL PSE Admin functions. This option is implementing
The same goes for ``ETHTOOL_A_C33_PSE_ADMIN_CONTROL`` implementing
``IEEE 802.3-2022`` 30.9.1.2.1 acPSEAdminControl.
+When set, the optional ``ETHTOOL_A_C33_PSE_AVAIL_PWR_LIMIT`` attribute is
+used to control the available power value limit for C33 PSE in milliwatts.
+This attribute corresponds to the `pse_available_power` variable described in
+``IEEE 802.3-2022`` 33.2.4.4 Variables and `pse_avail_pwr` in 145.2.5.4
+Variables, which are described in power classes.
+
+It was decided to use milliwatts for this interface to unify it with other
+power monitoring interfaces, which also use milliwatts, and to align with
+various existing products that document power consumption in watts rather than
+classes. If power limit configuration based on classes is needed, the
+conversion can be done in user space, for example by ethtool.
+
RSS_GET
=======
@@ -2033,6 +2108,73 @@ The attributes are propagated to the driver through the following structure:
.. kernel-doc:: include/linux/ethtool.h
:identifiers: ethtool_mm_cfg
+MODULE_FW_FLASH_ACT
+===================
+
+Flashes transceiver module firmware.
+
+Request contents:
+
+ ======================================= ====== ===========================
+ ``ETHTOOL_A_MODULE_FW_FLASH_HEADER`` nested request header
+ ``ETHTOOL_A_MODULE_FW_FLASH_FILE_NAME`` string firmware image file name
+ ``ETHTOOL_A_MODULE_FW_FLASH_PASSWORD`` u32 transceiver module password
+ ======================================= ====== ===========================
+
+The firmware update process consists of three logical steps:
+
+1. Downloading a firmware image to the transceiver module and validating it.
+2. Running the firmware image.
+3. Committing the firmware image so that it is run upon reset.
+
+When flash command is given, those three steps are taken in that order.
+
+This message merely schedules the update process and returns immediately
+without blocking. The process then runs asynchronously.
+Since it can take several minutes to complete, during the update process
+notifications are emitted from the kernel to user space updating it about
+the status and progress.
+
+The ``ETHTOOL_A_MODULE_FW_FLASH_FILE_NAME`` attribute encodes the firmware
+image file name. The firmware image is downloaded to the transceiver module,
+validated, run and committed.
+
+The optional ``ETHTOOL_A_MODULE_FW_FLASH_PASSWORD`` attribute encodes a password
+that might be required as part of the transceiver module firmware update
+process.
+
+The firmware update process can take several minutes to complete. Therefore,
+during the update process notifications are emitted from the kernel to user
+space updating it about the status and progress.
+
+
+
+Notification contents:
+
+ +---------------------------------------------------+--------+----------------+
+ | ``ETHTOOL_A_MODULE_FW_FLASH_HEADER`` | nested | reply header |
+ +---------------------------------------------------+--------+----------------+
+ | ``ETHTOOL_A_MODULE_FW_FLASH_STATUS`` | u32 | status |
+ +---------------------------------------------------+--------+----------------+
+ | ``ETHTOOL_A_MODULE_FW_FLASH_STATUS_MSG`` | string | status message |
+ +---------------------------------------------------+--------+----------------+
+ | ``ETHTOOL_A_MODULE_FW_FLASH_DONE`` | uint | progress |
+ +---------------------------------------------------+--------+----------------+
+ | ``ETHTOOL_A_MODULE_FW_FLASH_TOTAL`` | uint | total |
+ +---------------------------------------------------+--------+----------------+
+
+The ``ETHTOOL_A_MODULE_FW_FLASH_STATUS`` attribute encodes the current status
+of the firmware update process. Possible values are:
+
+.. kernel-doc:: include/uapi/linux/ethtool.h
+ :identifiers: ethtool_module_fw_flash_status
+
+The ``ETHTOOL_A_MODULE_FW_FLASH_STATUS_MSG`` attribute encodes a status message
+string.
+
+The ``ETHTOOL_A_MODULE_FW_FLASH_DONE`` and ``ETHTOOL_A_MODULE_FW_FLASH_TOTAL``
+attributes encode the completed and total amount of work, respectively.
+
Request translation
===================
@@ -2139,4 +2281,5 @@ are netlink only.
n/a ``ETHTOOL_MSG_PLCA_GET_STATUS``
n/a ``ETHTOOL_MSG_MM_GET``
n/a ``ETHTOOL_MSG_MM_SET``
+ n/a ``ETHTOOL_MSG_MODULE_FW_FLASH_ACT``
=================================== =====================================
diff --git a/Documentation/networking/index.rst b/Documentation/networking/index.rst
index 7664c0bfe461..d1af04b952f8 100644
--- a/Documentation/networking/index.rst
+++ b/Documentation/networking/index.rst
@@ -19,6 +19,7 @@ Contents:
caif/index
ethtool-netlink
ieee802154
+ iso15765-2
j1939
kapi
msg_zerocopy
@@ -72,6 +73,7 @@ Contents:
mac80211-injection
mctp
mpls-sysctl
+ mptcp
mptcp-sysctl
multiqueue
multi-pf-netdev
@@ -104,6 +106,7 @@ Contents:
seg6-sysctl
skbuff
smc-sysctl
+ sriov
statistics
strparser
switchdev
diff --git a/Documentation/networking/ip-sysctl.rst b/Documentation/networking/ip-sysctl.rst
index bd50df6a5a42..3616389c8c2d 100644
--- a/Documentation/networking/ip-sysctl.rst
+++ b/Documentation/networking/ip-sysctl.rst
@@ -131,6 +131,20 @@ fib_multipath_hash_fields - UNSIGNED INTEGER
Default: 0x0007 (source IP, destination IP and IP protocol)
+fib_multipath_hash_seed - UNSIGNED INTEGER
+ The seed value used when calculating hash for multipath routes. Applies
+ to both IPv4 and IPv6 datapath. Only present for kernels built with
+ CONFIG_IP_ROUTE_MULTIPATH enabled.
+
+ When set to 0, the seed value used for multipath routing defaults to an
+ internal random-generated one.
+
+ The actual hashing algorithm is not specified -- there is no guarantee
+ that a next hop distribution effected by a given seed will keep stable
+ across kernel versions.
+
+ Default: 0 (random)
+
fib_sync_mem - UNSIGNED INTEGER
Amount of dirty memory from fib entries that can be backlogged before
synchronize_rcu is forced.
@@ -1196,6 +1210,19 @@ tcp_pingpong_thresh - INTEGER
Default: 1
+tcp_rto_min_us - INTEGER
+ Minimal TCP retransmission timeout (in microseconds). Note that the
+ rto_min route option has the highest precedence for configuring this
+ setting, followed by the TCP_BPF_RTO_MIN socket option, followed by
+ this tcp_rto_min_us sysctl.
+
+ The recommended practice is to use a value less or equal to 200000
+ microseconds.
+
+ Possible Values: 1 - INT_MAX
+
+ Default: 200000
+
UDP variables
=============
diff --git a/Documentation/networking/iso15765-2.rst b/Documentation/networking/iso15765-2.rst
new file mode 100644
index 000000000000..0e9d96074178
--- /dev/null
+++ b/Documentation/networking/iso15765-2.rst
@@ -0,0 +1,386 @@
+.. SPDX-License-Identifier: (GPL-2.0 OR BSD-3-Clause)
+
+====================
+ISO 15765-2 (ISO-TP)
+====================
+
+Overview
+========
+
+ISO 15765-2, also known as ISO-TP, is a transport protocol specifically defined
+for diagnostic communication on CAN. It is widely used in the automotive
+industry, for example as the transport protocol for UDSonCAN (ISO 14229-3) or
+emission-related diagnostic services (ISO 15031-5).
+
+ISO-TP can be used both on CAN CC (aka Classical CAN) and CAN FD (CAN with
+Flexible Datarate) based networks. It is also designed to be compatible with a
+CAN network using SAE J1939 as data link layer (however, this is not a
+requirement).
+
+Specifications used
+-------------------
+
+* ISO 15765-2:2024 : Road vehicles - Diagnostic communication over Controller
+ Area Network (DoCAN). Part 2: Transport protocol and network layer services.
+
+Addressing
+----------
+
+In its simplest form, ISO-TP is based on two kinds of addressing modes for the
+nodes connected to the same network:
+
+* physical addressing is implemented by two node-specific addresses and is used
+ in 1-to-1 communication.
+
+* functional addressing is implemented by one node-specific address and is used
+ in 1-to-N communication.
+
+Three different addressing formats can be employed:
+
+* "normal" : each address is represented simply by a CAN ID.
+
+* "extended": each address is represented by a CAN ID plus the first byte of
+ the CAN payload; both the CAN ID and the byte inside the payload shall be
+ different between two addresses.
+
+* "mixed": each address is represented by a CAN ID plus the first byte of
+ the CAN payload; the CAN ID is different between two addresses, but the
+ additional byte is the same.
+
+Transport protocol and associated frame types
+---------------------------------------------
+
+When transmitting data using the ISO-TP protocol, the payload can either fit
+inside one single CAN message or not, also considering the overhead the protocol
+is generating and the optional extended addressing. In the first case, the data
+is transmitted at once using a so-called Single Frame (SF). In the second case,
+ISO-TP defines a multi-frame protocol, in which the sender provides (through a
+First Frame - FF) the PDU length which is to be transmitted and also asks for a
+Flow Control (FC) frame, which provides the maximum supported size of a macro
+data block (``blocksize``) and the minimum time between the single CAN messages
+composing such block (``stmin``). Once this information has been received, the
+sender starts to send frames containing fragments of the data payload (called
+Consecutive Frames - CF), stopping after every ``blocksize``-sized block to wait
+confirmation from the receiver which should then send another Flow Control
+frame to inform the sender about its availability to receive more data.
+
+How to Use ISO-TP
+=================
+
+As with others CAN protocols, the ISO-TP stack support is built into the
+Linux network subsystem for the CAN bus, aka. Linux-CAN or SocketCAN, and
+thus follows the same socket API.
+
+Creation and basic usage of an ISO-TP socket
+--------------------------------------------
+
+To use the ISO-TP stack, ``#include <linux/can/isotp.h>`` shall be used. A
+socket can then be created using the ``PF_CAN`` protocol family, the
+``SOCK_DGRAM`` type (as the underlying protocol is datagram-based by design)
+and the ``CAN_ISOTP`` protocol:
+
+.. code-block:: C
+
+ s = socket(PF_CAN, SOCK_DGRAM, CAN_ISOTP);
+
+After the socket has been successfully created, ``bind(2)`` shall be called to
+bind the socket to the desired CAN interface; to do so:
+
+* a TX CAN ID shall be specified as part of the sockaddr supplied to the call
+ itself.
+
+* a RX CAN ID shall also be specified, unless broadcast flags have been set
+ through socket option (explained below).
+
+Once bound to an interface, the socket can be read from and written to using
+the usual ``read(2)`` and ``write(2)`` system calls, as well as ``send(2)``,
+``sendmsg(2)``, ``recv(2)`` and ``recvmsg(2)``.
+Unlike the CAN_RAW socket API, only the ISO-TP data field (the actual payload)
+is sent and received by the userspace application using these calls. The address
+information and the protocol information are automatically filled by the ISO-TP
+stack using the configuration supplied during socket creation. In the same way,
+the stack will use the transport mechanism when required (i.e., when the size
+of the data payload exceeds the MTU of the underlying CAN bus).
+
+The sockaddr structure used for SocketCAN has extensions for use with ISO-TP,
+as specified below:
+
+.. code-block:: C
+
+ struct sockaddr_can {
+ sa_family_t can_family;
+ int can_ifindex;
+ union {
+ struct { canid_t rx_id, tx_id; } tp;
+ ...
+ } can_addr;
+ }
+
+* ``can_family`` and ``can_ifindex`` serve the same purpose as for other
+ SocketCAN sockets.
+
+* ``can_addr.tp.rx_id`` specifies the receive (RX) CAN ID and will be used as
+ a RX filter.
+
+* ``can_addr.tp.tx_id`` specifies the transmit (TX) CAN ID
+
+ISO-TP socket options
+---------------------
+
+When creating an ISO-TP socket, reasonable defaults are set. Some options can
+be modified with ``setsockopt(2)`` and/or read back with ``getsockopt(2)``.
+
+General options
+~~~~~~~~~~~~~~~
+
+General socket options can be passed using the ``CAN_ISOTP_OPTS`` optname:
+
+.. code-block:: C
+
+ struct can_isotp_options opts;
+ ret = setsockopt(s, SOL_CAN_ISOTP, CAN_ISOTP_OPTS, &opts, sizeof(opts))
+
+where the ``can_isotp_options`` structure has the following contents:
+
+.. code-block:: C
+
+ struct can_isotp_options {
+ u32 flags;
+ u32 frame_txtime;
+ u8 ext_address;
+ u8 txpad_content;
+ u8 rxpad_content;
+ u8 rx_ext_address;
+ };
+
+* ``flags``: modifiers to be applied to the default behaviour of the ISO-TP
+ stack. Following flags are available:
+
+ * ``CAN_ISOTP_LISTEN_MODE``: listen only (do not send FC frames); normally
+ used as a testing feature.
+
+ * ``CAN_ISOTP_EXTEND_ADDR``: use the byte specified in ``ext_address`` as an
+ additional address component. This enables the "mixed" addressing format if
+ used alone, or the "extended" addressing format if used in conjunction with
+ ``CAN_ISOTP_RX_EXT_ADDR``.
+
+ * ``CAN_ISOTP_TX_PADDING``: enable padding for transmitted frames, using
+ ``txpad_content`` as value for the padding bytes.
+
+ * ``CAN_ISOTP_RX_PADDING``: enable padding for the received frames, using
+ ``rxpad_content`` as value for the padding bytes.
+
+ * ``CAN_ISOTP_CHK_PAD_LEN``: check for correct padding length on the received
+ frames.
+
+ * ``CAN_ISOTP_CHK_PAD_DATA``: check padding bytes on the received frames
+ against ``rxpad_content``; if ``CAN_ISOTP_RX_PADDING`` is not specified,
+ this flag is ignored.
+
+ * ``CAN_ISOTP_HALF_DUPLEX``: force ISO-TP socket in half duplex mode
+ (that is, transport mechanism can only be incoming or outgoing at the same
+ time, not both).
+
+ * ``CAN_ISOTP_FORCE_TXSTMIN``: ignore stmin from received FC; normally
+ used as a testing feature.
+
+ * ``CAN_ISOTP_FORCE_RXSTMIN``: ignore CFs depending on rx stmin; normally
+ used as a testing feature.
+
+ * ``CAN_ISOTP_RX_EXT_ADDR``: use ``rx_ext_address`` instead of ``ext_address``
+ as extended addressing byte on the reception path. If used in conjunction
+ with ``CAN_ISOTP_EXTEND_ADDR``, this flag effectively enables the "extended"
+ addressing format.
+
+ * ``CAN_ISOTP_WAIT_TX_DONE``: wait until the frame is sent before returning
+ from ``write(2)`` and ``send(2)`` calls (i.e., blocking write operations).
+
+ * ``CAN_ISOTP_SF_BROADCAST``: use 1-to-N functional addressing (cannot be
+ specified alongside ``CAN_ISOTP_CF_BROADCAST``).
+
+ * ``CAN_ISOTP_CF_BROADCAST``: use 1-to-N transmission without flow control
+ (cannot be specified alongside ``CAN_ISOTP_SF_BROADCAST``).
+ NOTE: this is not covered by the ISO 15765-2 standard.
+
+ * ``CAN_ISOTP_DYN_FC_PARMS``: enable dynamic update of flow control
+ parameters.
+
+* ``frame_txtime``: frame transmission time (defined as N_As/N_Ar inside the
+ ISO standard); if ``0``, the default (or the last set value) is used.
+ To set the transmission time to ``0``, the ``CAN_ISOTP_FRAME_TXTIME_ZERO``
+ macro (equal to 0xFFFFFFFF) shall be used.
+
+* ``ext_address``: extended addressing byte, used if the
+ ``CAN_ISOTP_EXTEND_ADDR`` flag is specified.
+
+* ``txpad_content``: byte used as padding value for transmitted frames.
+
+* ``rxpad_content``: byte used as padding value for received frames.
+
+* ``rx_ext_address``: extended addressing byte for the reception path, used if
+ the ``CAN_ISOTP_RX_EXT_ADDR`` flag is specified.
+
+Flow Control options
+~~~~~~~~~~~~~~~~~~~~
+
+Flow Control (FC) options can be passed using the ``CAN_ISOTP_RECV_FC`` optname
+to provide the communication parameters for receiving ISO-TP PDUs.
+
+.. code-block:: C
+
+ struct can_isotp_fc_options fc_opts;
+ ret = setsockopt(s, SOL_CAN_ISOTP, CAN_ISOTP_RECV_FC, &fc_opts, sizeof(fc_opts));
+
+where the ``can_isotp_fc_options`` structure has the following contents:
+
+.. code-block:: C
+
+ struct can_isotp_options {
+ u8 bs;
+ u8 stmin;
+ u8 wftmax;
+ };
+
+* ``bs``: blocksize provided in flow control frames.
+
+* ``stmin``: minimum separation time provided in flow control frames; can
+ have the following values (others are reserved):
+
+ * 0x00 - 0x7F : 0 - 127 ms
+
+ * 0xF1 - 0xF9 : 100 us - 900 us
+
+* ``wftmax``: maximum number of wait frames provided in flow control frames.
+
+Link Layer options
+~~~~~~~~~~~~~~~~~~
+
+Link Layer (LL) options can be passed using the ``CAN_ISOTP_LL_OPTS`` optname:
+
+.. code-block:: C
+
+ struct can_isotp_ll_options ll_opts;
+ ret = setsockopt(s, SOL_CAN_ISOTP, CAN_ISOTP_LL_OPTS, &ll_opts, sizeof(ll_opts));
+
+where the ``can_isotp_ll_options`` structure has the following contents:
+
+.. code-block:: C
+
+ struct can_isotp_ll_options {
+ u8 mtu;
+ u8 tx_dl;
+ u8 tx_flags;
+ };
+
+* ``mtu``: generated and accepted CAN frame type, can be equal to ``CAN_MTU``
+ for classical CAN frames or ``CANFD_MTU`` for CAN FD frames.
+
+* ``tx_dl``: maximum payload length for transmitted frames, can have one value
+ among: 8, 12, 16, 20, 24, 32, 48, 64. Values above 8 only apply to CAN FD
+ traffic (i.e.: ``mtu = CANFD_MTU``).
+
+* ``tx_flags``: flags set into ``struct canfd_frame.flags`` at frame creation.
+ Only applies to CAN FD traffic (i.e.: ``mtu = CANFD_MTU``).
+
+Transmission stmin
+~~~~~~~~~~~~~~~~~~
+
+The transmission minimum separation time (stmin) can be forced using the
+``CAN_ISOTP_TX_STMIN`` optname and providing an stmin value in microseconds as
+a 32bit unsigned integer; this will overwrite the value sent by the receiver in
+flow control frames:
+
+.. code-block:: C
+
+ uint32_t stmin;
+ ret = setsockopt(s, SOL_CAN_ISOTP, CAN_ISOTP_TX_STMIN, &stmin, sizeof(stmin));
+
+Reception stmin
+~~~~~~~~~~~~~~~
+
+The reception minimum separation time (stmin) can be forced using the
+``CAN_ISOTP_RX_STMIN`` optname and providing an stmin value in microseconds as
+a 32bit unsigned integer; received Consecutive Frames (CF) which timestamps
+differ less than this value will be ignored:
+
+.. code-block:: C
+
+ uint32_t stmin;
+ ret = setsockopt(s, SOL_CAN_ISOTP, CAN_ISOTP_RX_STMIN, &stmin, sizeof(stmin));
+
+Multi-frame transport support
+-----------------------------
+
+The ISO-TP stack contained inside the Linux kernel supports the multi-frame
+transport mechanism defined by the standard, with the following constraints:
+
+* the maximum size of a PDU is defined by a module parameter, with an hard
+ limit imposed at build time.
+
+* when a transmission is in progress, subsequent calls to ``write(2)`` will
+ block, while calls to ``send(2)`` will either block or fail depending on the
+ presence of the ``MSG_DONTWAIT`` flag.
+
+* no support is present for sending "wait frames": whether a PDU can be fully
+ received or not is decided when the First Frame is received.
+
+Errors
+------
+
+Following errors are reported to userspace:
+
+RX path errors
+~~~~~~~~~~~~~~
+
+============ ===============================================================
+-ETIMEDOUT timeout of data reception
+-EILSEQ sequence number mismatch during a multi-frame reception
+-EBADMSG data reception with wrong padding
+============ ===============================================================
+
+TX path errors
+~~~~~~~~~~~~~~
+
+========== =================================================================
+-ECOMM flow control reception timeout
+-EMSGSIZE flow control reception overflow
+-EBADMSG flow control reception with wrong layout/padding
+========== =================================================================
+
+Examples
+========
+
+Basic node example
+------------------
+
+Following example implements a node using "normal" physical addressing, with
+RX ID equal to 0x18DAF142 and a TX ID equal to 0x18DA42F1. All options are left
+to their default.
+
+.. code-block:: C
+
+ int s;
+ struct sockaddr_can addr;
+ int ret;
+
+ s = socket(PF_CAN, SOCK_DGRAM, CAN_ISOTP);
+ if (s < 0)
+ exit(1);
+
+ addr.can_family = AF_CAN;
+ addr.can_ifindex = if_nametoindex("can0");
+ addr.tp.tx_id = 0x18DA42F1 | CAN_EFF_FLAG;
+ addr.tp.rx_id = 0x18DAF142 | CAN_EFF_FLAG;
+
+ ret = bind(s, (struct sockaddr *)&addr, sizeof(addr));
+ if (ret < 0)
+ exit(1);
+
+ /* Data can now be received using read(s, ...) and sent using write(s, ...) */
+
+Additional examples
+-------------------
+
+More complete (and complex) examples can be found inside the ``isotp*`` userland
+tools, distributed as part of the ``can-utils`` utilities at:
+https://github.com/linux-can/can-utils
diff --git a/Documentation/networking/mptcp-sysctl.rst b/Documentation/networking/mptcp-sysctl.rst
index 69975ce25a02..fd514bba8c43 100644
--- a/Documentation/networking/mptcp-sysctl.rst
+++ b/Documentation/networking/mptcp-sysctl.rst
@@ -7,14 +7,6 @@ MPTCP Sysfs variables
/proc/sys/net/mptcp/* Variables
===============================
-enabled - BOOLEAN
- Control whether MPTCP sockets can be created.
-
- MPTCP sockets can be created if the value is 1. This is a
- per-namespace sysctl.
-
- Default: 1 (enabled)
-
add_addr_timeout - INTEGER (seconds)
Set the timeout after which an ADD_ADDR control message will be
resent to an MPTCP peer that has not acknowledged a previous
@@ -25,16 +17,22 @@ add_addr_timeout - INTEGER (seconds)
Default: 120
-close_timeout - INTEGER (seconds)
- Set the make-after-break timeout: in absence of any close or
- shutdown syscall, MPTCP sockets will maintain the status
- unchanged for such time, after the last subflow removal, before
- moving to TCP_CLOSE.
+allow_join_initial_addr_port - BOOLEAN
+ Allow peers to send join requests to the IP address and port number used
+ by the initial subflow if the value is 1. This controls a flag that is
+ sent to the peer at connection time, and whether such join requests are
+ accepted or denied.
- The default value matches TCP_TIMEWAIT_LEN. This is a per-namespace
- sysctl.
+ Joins to addresses advertised with ADD_ADDR are not affected by this
+ value.
- Default: 60
+ This is a per-namespace sysctl.
+
+ Default: 1
+
+available_schedulers - STRING
+ Shows the available schedulers choices that are registered. More packet
+ schedulers may be available, but not loaded.
checksum_enabled - BOOLEAN
Control whether DSS checksum can be enabled.
@@ -44,18 +42,24 @@ checksum_enabled - BOOLEAN
Default: 0
-allow_join_initial_addr_port - BOOLEAN
- Allow peers to send join requests to the IP address and port number used
- by the initial subflow if the value is 1. This controls a flag that is
- sent to the peer at connection time, and whether such join requests are
- accepted or denied.
+close_timeout - INTEGER (seconds)
+ Set the make-after-break timeout: in absence of any close or
+ shutdown syscall, MPTCP sockets will maintain the status
+ unchanged for such time, after the last subflow removal, before
+ moving to TCP_CLOSE.
- Joins to addresses advertised with ADD_ADDR are not affected by this
- value.
+ The default value matches TCP_TIMEWAIT_LEN. This is a per-namespace
+ sysctl.
- This is a per-namespace sysctl.
+ Default: 60
- Default: 1
+enabled - BOOLEAN
+ Control whether MPTCP sockets can be created.
+
+ MPTCP sockets can be created if the value is 1. This is a
+ per-namespace sysctl.
+
+ Default: 1 (enabled)
pm_type - INTEGER
Set the default path manager type to use for each new MPTCP
@@ -74,6 +78,14 @@ pm_type - INTEGER
Default: 0
+scheduler - STRING
+ Select the scheduler of your choice.
+
+ Support for selection of different schedulers. This is a per-namespace
+ sysctl.
+
+ Default: "default"
+
stale_loss_cnt - INTEGER
The number of MPTCP-level retransmission intervals with no traffic and
pending outstanding data on a given subflow required to declare it stale.
@@ -85,11 +97,3 @@ stale_loss_cnt - INTEGER
This is a per-namespace sysctl.
Default: 4
-
-scheduler - STRING
- Select the scheduler of your choice.
-
- Support for selection of different schedulers. This is a per-namespace
- sysctl.
-
- Default: "default"
diff --git a/Documentation/networking/mptcp.rst b/Documentation/networking/mptcp.rst
new file mode 100644
index 000000000000..17f2bab61164
--- /dev/null
+++ b/Documentation/networking/mptcp.rst
@@ -0,0 +1,156 @@
+.. SPDX-License-Identifier: GPL-2.0
+
+=====================
+Multipath TCP (MPTCP)
+=====================
+
+Introduction
+============
+
+Multipath TCP or MPTCP is an extension to the standard TCP and is described in
+`RFC 8684 (MPTCPv1) <https://www.rfc-editor.org/rfc/rfc8684.html>`_. It allows a
+device to make use of multiple interfaces at once to send and receive TCP
+packets over a single MPTCP connection. MPTCP can aggregate the bandwidth of
+multiple interfaces or prefer the one with the lowest latency. It also allows a
+fail-over if one path is down, and the traffic is seamlessly reinjected on other
+paths.
+
+For more details about Multipath TCP in the Linux kernel, please see the
+official website: `mptcp.dev <https://www.mptcp.dev>`_.
+
+
+Use cases
+=========
+
+Thanks to MPTCP, being able to use multiple paths in parallel or simultaneously
+brings new use-cases, compared to TCP:
+
+- Seamless handovers: switching from one path to another while preserving
+ established connections, e.g. to be used in mobility use-cases, like on
+ smartphones.
+- Best network selection: using the "best" available path depending on some
+ conditions, e.g. latency, losses, cost, bandwidth, etc.
+- Network aggregation: using multiple paths at the same time to have a higher
+ throughput, e.g. to combine fixed and mobile networks to send files faster.
+
+
+Concepts
+========
+
+Technically, when a new socket is created with the ``IPPROTO_MPTCP`` protocol
+(Linux-specific), a *subflow* (or *path*) is created. This *subflow* consists of
+a regular TCP connection that is used to transmit data through one interface.
+Additional *subflows* can be negotiated later between the hosts. For the remote
+host to be able to detect the use of MPTCP, a new field is added to the TCP
+*option* field of the underlying TCP *subflow*. This field contains, amongst
+other things, a ``MP_CAPABLE`` option that tells the other host to use MPTCP if
+it is supported. If the remote host or any middlebox in between does not support
+it, the returned ``SYN+ACK`` packet will not contain MPTCP options in the TCP
+*option* field. In that case, the connection will be "downgraded" to plain TCP,
+and it will continue with a single path.
+
+This behavior is made possible by two internal components: the path manager, and
+the packet scheduler.
+
+Path Manager
+------------
+
+The Path Manager is in charge of *subflows*, from creation to deletion, and also
+address announcements. Typically, it is the client side that initiates subflows,
+and the server side that announces additional addresses via the ``ADD_ADDR`` and
+``REMOVE_ADDR`` options.
+
+Path managers are controlled by the ``net.mptcp.pm_type`` sysctl knob -- see
+mptcp-sysctl.rst. There are two types: the in-kernel one (type ``0``) where the
+same rules are applied for all the connections (see: ``ip mptcp``) ; and the
+userspace one (type ``1``), controlled by a userspace daemon (i.e. `mptcpd
+<https://mptcpd.mptcp.dev/>`_) where different rules can be applied for each
+connection. The path managers can be controlled via a Netlink API; see
+netlink_spec/mptcp_pm.rst.
+
+To be able to use multiple IP addresses on a host to create multiple *subflows*
+(paths), the default in-kernel MPTCP path-manager needs to know which IP
+addresses can be used. This can be configured with ``ip mptcp endpoint`` for
+example.
+
+Packet Scheduler
+----------------
+
+The Packet Scheduler is in charge of selecting which available *subflow(s)* to
+use to send the next data packet. It can decide to maximize the use of the
+available bandwidth, only to pick the path with the lower latency, or any other
+policy depending on the configuration.
+
+Packet schedulers are controlled by the ``net.mptcp.scheduler`` sysctl knob --
+see mptcp-sysctl.rst.
+
+
+Sockets API
+===========
+
+Creating MPTCP sockets
+----------------------
+
+On Linux, MPTCP can be used by selecting MPTCP instead of TCP when creating the
+``socket``:
+
+.. code-block:: C
+
+ int sd = socket(AF_INET(6), SOCK_STREAM, IPPROTO_MPTCP);
+
+Note that ``IPPROTO_MPTCP`` is defined as ``262``.
+
+If MPTCP is not supported, ``errno`` will be set to:
+
+- ``EINVAL``: (*Invalid argument*): MPTCP is not available, on kernels < 5.6.
+- ``EPROTONOSUPPORT`` (*Protocol not supported*): MPTCP has not been compiled,
+ on kernels >= v5.6.
+- ``ENOPROTOOPT`` (*Protocol not available*): MPTCP has been disabled using
+ ``net.mptcp.enabled`` sysctl knob; see mptcp-sysctl.rst.
+
+MPTCP is then opt-in: applications need to explicitly request it. Note that
+applications can be forced to use MPTCP with different techniques, e.g.
+``LD_PRELOAD`` (see ``mptcpize``), eBPF (see ``mptcpify``), SystemTAP,
+``GODEBUG`` (``GODEBUG=multipathtcp=1``), etc.
+
+Switching to ``IPPROTO_MPTCP`` instead of ``IPPROTO_TCP`` should be as
+transparent as possible for the userspace applications.
+
+Socket options
+--------------
+
+MPTCP supports most socket options handled by TCP. It is possible some less
+common options are not supported, but contributions are welcome.
+
+Generally, the same value is propagated to all subflows, including the ones
+created after the calls to ``setsockopt()``. eBPF can be used to set different
+values per subflow.
+
+There are some MPTCP specific socket options at the ``SOL_MPTCP`` (284) level to
+retrieve info. They fill the ``optval`` buffer of the ``getsockopt()`` system
+call:
+
+- ``MPTCP_INFO``: Uses ``struct mptcp_info``.
+- ``MPTCP_TCPINFO``: Uses ``struct mptcp_subflow_data``, followed by an array of
+ ``struct tcp_info``.
+- ``MPTCP_SUBFLOW_ADDRS``: Uses ``struct mptcp_subflow_data``, followed by an
+ array of ``mptcp_subflow_addrs``.
+- ``MPTCP_FULL_INFO``: Uses ``struct mptcp_full_info``, with one pointer to an
+ array of ``struct mptcp_subflow_info`` (including the
+ ``struct mptcp_subflow_addrs``), and one pointer to an array of
+ ``struct tcp_info``, followed by the content of ``struct mptcp_info``.
+
+Note that at the TCP level, ``TCP_IS_MPTCP`` socket option can be used to know
+if MPTCP is currently being used: the value will be set to 1 if it is.
+
+
+Design choices
+==============
+
+A new socket type has been added for MPTCP for the userspace-facing socket. The
+kernel is in charge of creating subflow sockets: they are TCP sockets where the
+behavior is modified using TCP-ULP.
+
+MPTCP listen sockets will create "plain" *accepted* TCP sockets if the
+connection request from the client didn't ask for MPTCP, making the performance
+impact minimal when MPTCP is enabled by default.
diff --git a/Documentation/networking/net_dim.rst b/Documentation/networking/net_dim.rst
index 3bed9fd95336..8908fd7b0a8d 100644
--- a/Documentation/networking/net_dim.rst
+++ b/Documentation/networking/net_dim.rst
@@ -169,6 +169,48 @@ usage is not complete but it should make the outline of the usage clear.
...
}
+
+Tuning DIM
+==========
+
+Net DIM serves a range of network devices and delivers excellent acceleration
+benefits. Yet, it has been observed that some preset configurations of DIM may
+not align seamlessly with the varying specifications of network devices, and
+this discrepancy has been identified as a factor to the suboptimal performance
+outcomes of DIM-enabled network devices, related to a mismatch in profiles.
+
+To address this issue, Net DIM introduces a per-device control to modify and
+access a device's ``rx-profile`` and ``tx-profile`` parameters:
+Assume that the target network device is named ethx, and ethx only declares
+support for RX profile setting and supports modification of ``usec`` field
+and ``pkts`` field (See the data structure:
+:c:type:`struct dim_cq_moder <dim_cq_moder>`).
+
+You can use ethtool to modify the current RX DIM profile where all
+values are 64::
+
+ $ ethtool -C ethx rx-profile 1,1,n_2,2,n_3,n,n_n,4,n_n,n,n
+
+``n`` means do not modify this field, and ``_`` separates structure
+elements of the profile array.
+
+Querying the current profiles using::
+
+ $ ethtool -c ethx
+ ...
+ rx-profile:
+ {.usec = 1, .pkts = 1, .comps = n/a,},
+ {.usec = 2, .pkts = 2, .comps = n/a,},
+ {.usec = 3, .pkts = 64, .comps = n/a,},
+ {.usec = 64, .pkts = 4, .comps = n/a,},
+ {.usec = 64, .pkts = 64, .comps = n/a,}
+ tx-profile: n/a
+
+If the network device does not support specific fields of DIM profiles,
+the corresponding ``n/a`` will display. If the ``n/a`` field is being
+modified, error messages will be reported.
+
+
Dynamic Interrupt Moderation (DIM) library API
==============================================
diff --git a/Documentation/networking/phy.rst b/Documentation/networking/phy.rst
index 1283240d7620..f64641417c54 100644
--- a/Documentation/networking/phy.rst
+++ b/Documentation/networking/phy.rst
@@ -327,6 +327,12 @@ Some of the interface modes are described below:
This is the Penta SGMII mode, it is similar to QSGMII but it combines 5
SGMII lines into a single link compared to 4 on QSGMII.
+``PHY_INTERFACE_MODE_10G_QXGMII``
+ Represents the 10G-QXGMII PHY-MAC interface as defined by the Cisco USXGMII
+ Multiport Copper Interface document. It supports 4 ports over a 10.3125 GHz
+ SerDes lane, each port having speeds of 2.5G / 1G / 100M / 10M achieved
+ through symbol replication. The PCS expects the standard USXGMII code word.
+
Pause frames / flow control
===========================
diff --git a/Documentation/networking/sriov.rst b/Documentation/networking/sriov.rst
new file mode 100644
index 000000000000..5deb4ff3154f
--- /dev/null
+++ b/Documentation/networking/sriov.rst
@@ -0,0 +1,25 @@
+.. SPDX-License-Identifier: GPL-2.0
+
+===============
+NIC SR-IOV APIs
+===============
+
+Modern NICs are strongly encouraged to focus on implementing the ``switchdev``
+model (see :ref:`switchdev`) to configure forwarding and security of SR-IOV
+functionality.
+
+Legacy API
+==========
+
+The old SR-IOV API is implemented in ``rtnetlink`` Netlink family as part of
+the ``RTM_GETLINK`` and ``RTM_SETLINK`` commands. On the driver side
+it consists of a number of ``ndo_set_vf_*`` and ``ndo_get_vf_*`` callbacks.
+
+Since the legacy APIs do not integrate well with the rest of the stack
+the API is considered frozen; no new functionality or extensions
+will be accepted. New drivers should not implement the uncommon callbacks;
+namely the following callbacks are off limits:
+
+ - ``ndo_get_vf_port``
+ - ``ndo_set_vf_port``
+ - ``ndo_set_vf_rss_query_en``
diff --git a/Documentation/networking/tcp_ao.rst b/Documentation/networking/tcp_ao.rst
index 8a58321acce7..e96e62d1dab3 100644
--- a/Documentation/networking/tcp_ao.rst
+++ b/Documentation/networking/tcp_ao.rst
@@ -337,6 +337,15 @@ TCP-AO per-socket counters are also duplicated with per-netns counters,
exposed with SNMP. Those are ``TCPAOGood``, ``TCPAOBad``, ``TCPAOKeyNotFound``,
``TCPAORequired`` and ``TCPAODroppedIcmps``.
+For monitoring purposes, there are following TCP-AO trace events:
+``tcp_hash_bad_header``, ``tcp_hash_ao_required``, ``tcp_ao_handshake_failure``,
+``tcp_ao_wrong_maclen``, ``tcp_ao_wrong_maclen``, ``tcp_ao_key_not_found``,
+``tcp_ao_rnext_request``, ``tcp_ao_synack_no_key``, ``tcp_ao_snd_sne_update``,
+``tcp_ao_rcv_sne_update``. It's possible to separately enable any of them and
+one can filter them by net-namespace, 4-tuple, family, L3 index, and TCP header
+flags. If a segment has a TCP-AO header, the filters may also include
+keyid, rnext, and maclen. SNE updates include the rolled-over numbers.
+
RFC 5925 very permissively specifies how TCP port matching can be done for
MKTs::