diff options
author | Linus Torvalds <torvalds@linux-foundation.org> | 2024-07-16 19:28:34 -0700 |
---|---|---|
committer | Linus Torvalds <torvalds@linux-foundation.org> | 2024-07-16 19:28:34 -0700 |
commit | 51835949dda3783d4639cfa74ce13a3c9829de00 (patch) | |
tree | 2b593de5eba6ecc73f7c58fc65fdaffae45c7323 /Documentation/networking | |
parent | 0434dbe32053d07d658165be681505120c6b1abc (diff) | |
parent | 77ae5e5b00720372af2860efdc4bc652ac682696 (diff) |
Merge tag 'net-next-6.11' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-nextHEADmaster
Pull networking updates from Jakub Kicinski:
"Not much excitement - a handful of large patchsets (devmem among them)
did not make it in time.
Core & protocols:
- Use local_lock in addition to local_bh_disable() to protect per-CPU
resources in networking, a step closer for local_bh_disable() not
to act as a big lock on PREEMPT_RT
- Use flex array for netdevice priv area, ensure its cache alignment
- Add a sysctl knob to allow user to specify a default rto_min at
socket init time. Bit of a big hammer but multiple companies were
independently carrying such patch downstream so clearly it's useful
- Support scheduling transmission of packets based on CLOCK_TAI
- Un-pin TCP TIMEWAIT timer to avoid it firing on CPUs later cordoned
off using cpusets
- Support multiple L2TPv3 UDP tunnels using the same 5-tuple address
- Allow configuration of multipath hash seed, to both allow
synchronizing hashing of two routers, and preventing partial
accidental sync
- Improve TCP compliance with RFC 9293 for simultaneous connect()
- Support sending NAT keepalives in IPsec ESP in UDP states.
Userspace IKE daemon had to do this before, but the kernel can
better keep track of it
- Support sending supervision HSR frames with MAC addresses stored in
ProxyNodeTable when RedBox (i.e. HSR-SAN) is enabled
- Introduce IPPROTO_SMC for selecting SMC when socket is created
- Allow UDP GSO transmit from devices with no checksum offload
- openvswitch: add packet sampling via psample, separating the
sampled traffic from "upcall" packets sent to user space for
forwarding
- nf_tables: shrink memory consumption for transaction objects
Things we sprinkled into general kernel code:
- Power Sequencing subsystem (used by Qualcomm Bluetooth driver for
QCA6390) [ Already merged separately - Linus ]
- Add IRQ information in sysfs for auxiliary bus
- Introduce guard definition for local_lock
- Add aligned flavor of __cacheline_group_{begin, end}() markings for
grouping fields in structures
BPF:
- Notify user space (via epoll) when a struct_ops object is getting
detached/unregistered
- Add new kfuncs for a generic, open-coded bits iterator
- Enable BPF programs to declare arrays of kptr, bpf_rb_root, and
bpf_list_head
- Support resilient split BTF which cuts down on duplication and
makes BTF as compact as possible WRT BTF from modules
- Add support for dumping kfunc prototypes from BTF which enables
both detecting as well as dumping compilable prototypes for kfuncs
- riscv64 BPF JIT improvements in particular to add 12-argument
support for BPF trampolines and to utilize bpf_prog_pack for the
latter
- Add the capability to offload the netfilter flowtable in XDP layer
through kfuncs
Driver API:
- Allow users to configure IRQ tresholds between which automatic IRQ
moderation can choose
- Expand Power Sourcing (PoE) status with power, class and failure
reason. Support setting power limits
- Track additional RSS contexts in the core, make sure configuration
changes don't break them
- Support IPsec crypto offload for IPv6 ESP and IPv4 UDP-encapsulated
ESP data paths
- Support updating firmware on SFP modules
Tests and tooling:
- mptcp: use net/lib.sh to manage netns
- TCP-AO and TCP-MD5: replace debug prints used by tests with
tracepoints
- openvswitch: make test self-contained (don't depend on OvS CLI
tools)
Drivers:
- Ethernet high-speed NICs:
- Broadcom (bnxt):
- increase the max total outstanding PTP TX packets to 4
- add timestamping statistics support
- implement netdev_queue_mgmt_ops
- support new RSS context API
- Intel (100G, ice, idpf):
- implement FEC statistics and dumping signal quality indicators
- support E825C products (with 56Gbps PHYs)
- nVidia/Mellanox:
- support HW-GRO
- mlx4/mlx5: support per-queue statistics via netlink
- obey the max number of EQs setting in sub-functions
- AMD/Solarflare:
- support new RSS context API
- AMD/Pensando:
- ionic: rework fix for doorbell miss to lower overhead and
skip it on new HW
- Wangxun:
- txgbe: support Flow Director perfect filters
- Ethernet NICs consumer, embedded and virtual:
- Add driver for Tehuti Networks TN40xx chips
- Add driver for Meta's internal NIC chips
- Add driver for Ethernet MAC on Airoha EN7581 SoCs
- Add driver for Renesas Ethernet-TSN devices
- Google cloud vNIC:
- flow steering support
- Microsoft vNIC:
- support page sizes other than 4KB on ARM64
- vmware vNIC:
- support latency measurement (update to version 9)
- VirtIO net:
- support for Byte Queue Limits
- support configuring thresholds for automatic IRQ moderation
- support for AF_XDP Rx zero-copy
- Synopsys (stmmac):
- support for STM32MP13 SoC
- let platforms select the right PCS implementation
- TI:
- icssg-prueth: add multicast filtering support
- icssg-prueth: enable PTP timestamping and PPS
- Renesas:
- ravb: improve Rx performance 30-400% by using page pool,
theaded NAPI and timer-based IRQ coalescing
- ravb: add MII support for R-Car V4M
- Cadence (macb):
- macb: add ARP support to Wake-On-LAN
- Cortina:
- use phylib for RX and TX pause configuration
- Ethernet switches:
- nVidia/Mellanox:
- support configuration of multipath hash seed
- report more accurate max MTU
- use page_pool to improve Rx performance
- MediaTek:
- mt7530: add support for bridge port isolation
- Qualcomm:
- qca8k: add support for bridge port isolation
- Microchip:
- lan9371/2: add 100BaseTX PHY support
- NXP:
- vsc73xx: implement VLAN operations
- Ethernet PHYs:
- aquantia: enable support for aqr115c
- aquantia: add support for PHY LEDs
- realtek: add support for rtl8224 2.5Gbps PHY
- xpcs: add memory-mapped device support
- add BroadR-Reach link mode and support in Broadcom's PHY driver
- CAN:
- add document for ISO 15765-2 protocol support
- mcp251xfd: workaround for erratum DS80000789E, use timestamps to
catch when device returns incorrect FIFO status
- WiFi:
- mac80211/cfg80211:
- parse Transmit Power Envelope (TPE) data in mac80211 instead
of in drivers
- improvements for 6 GHz regulatory flexibility
- multi-link improvements
- support multiple radios per wiphy
- remove DEAUTH_NEED_MGD_TX_PREP flag
- Intel (iwlwifi):
- bump FW API to 91 for BZ/SC devices
- report 64-bit radiotap timestamp
- enable P2P low latency by default
- handle Transmit Power Envelope (TPE) advertised by AP
- remove support for older FW for new devices
- fast resume (keeping the device configured)
- mvm: re-enable Multi-Link Operation (MLO)
- aggregation (A-MSDU) optimizations
- MediaTek (mt76):
- mt7925 Multi-Link Operation (MLO) support
- Qualcomm (ath10k):
- LED support for various chipsets
- Qualcomm (ath12k):
- remove unsupported Tx monitor handling
- support channel 2 in 6 GHz band
- support Spatial Multiplexing Power Save (SMPS) in 6 GHz band
- supprt multiple BSSID (MBSSID) and Enhanced Multi-BSSID
Advertisements (EMA)
- support dynamic VLAN
- add panic handler for resetting the firmware state
- DebugFS support for datapath statistics
- WCN7850: support for Wake on WLAN
- Microchip (wilc1000):
- read MAC address during probe to make it visible to user space
- suspend/resume improvements
- TI (wl18xx):
- support newer firmware versions
- RealTek (rtw89):
- preparation for RTL8852BE-VT support
- Wake on WLAN support for WiFi 6 chips
- 36-bit PCI DMA support
- RealTek (rtlwifi):
- RTL8192DU support
- Broadcom (brcmfmac):
- Management Frame Protection support (to enable WPA3)
- Bluetooth:
- qualcomm: use the power sequencer for QCA6390
- btusb: mediatek: add ISO data transmission functions
- hci_bcm4377: add BCM4388 support
- btintel: add support for BlazarU core
- btintel: add support for Whale Peak2
- btnxpuart: add support for AW693 A1 chipset
- btnxpuart: add support for IW615 chipset
- btusb: add Realtek RTL8852BE support ID 0x13d3:0x3591"
* tag 'net-next-6.11' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-next: (1589 commits)
eth: fbnic: Fix spelling mistake "tiggerring" -> "triggering"
tcp: Replace strncpy() with strscpy()
wifi: ath12k: fix build vs old compiler
tcp: Don't access uninit tcp_rsk(req)->ao_keyid in tcp_create_openreq_child().
eth: fbnic: Write the TCAM tables used for RSS control and Rx to host
eth: fbnic: Add L2 address programming
eth: fbnic: Add basic Rx handling
eth: fbnic: Add basic Tx handling
eth: fbnic: Add link detection
eth: fbnic: Add initial messaging to notify FW of our presence
eth: fbnic: Implement Rx queue alloc/start/stop/free
eth: fbnic: Implement Tx queue alloc/start/stop/free
eth: fbnic: Allocate a netdevice and napi vectors with queues
eth: fbnic: Add FW communication mechanism
eth: fbnic: Add message parsing for FW messages
eth: fbnic: Add register init to set PCIe/Ethernet device config
eth: fbnic: Allocate core device specific structures and devlink interface
eth: fbnic: Add scaffolding for Meta's NIC driver
PCI: Add Meta Platforms vendor ID
net/sched: cls_flower: propagate tca[TCA_OPTIONS] to NL_REQ_ATTR_CHECK
...
Diffstat (limited to 'Documentation/networking')
-rw-r--r-- | Documentation/networking/device_drivers/ethernet/mellanox/mlx5/counters.rst | 24 | ||||
-rw-r--r-- | Documentation/networking/devlink/ice.rst | 25 | ||||
-rw-r--r-- | Documentation/networking/devlink/octeontx2.rst | 16 | ||||
-rw-r--r-- | Documentation/networking/ethtool-netlink.rst | 165 | ||||
-rw-r--r-- | Documentation/networking/index.rst | 3 | ||||
-rw-r--r-- | Documentation/networking/ip-sysctl.rst | 27 | ||||
-rw-r--r-- | Documentation/networking/iso15765-2.rst | 386 | ||||
-rw-r--r-- | Documentation/networking/mptcp-sysctl.rst | 70 | ||||
-rw-r--r-- | Documentation/networking/mptcp.rst | 156 | ||||
-rw-r--r-- | Documentation/networking/net_dim.rst | 42 | ||||
-rw-r--r-- | Documentation/networking/phy.rst | 6 | ||||
-rw-r--r-- | Documentation/networking/sriov.rst | 25 | ||||
-rw-r--r-- | Documentation/networking/tcp_ao.rst | 9 |
13 files changed, 901 insertions, 53 deletions
diff --git a/Documentation/networking/device_drivers/ethernet/mellanox/mlx5/counters.rst b/Documentation/networking/device_drivers/ethernet/mellanox/mlx5/counters.rst index fed821ef9b09..3bd72577af9a 100644 --- a/Documentation/networking/device_drivers/ethernet/mellanox/mlx5/counters.rst +++ b/Documentation/networking/device_drivers/ethernet/mellanox/mlx5/counters.rst @@ -189,22 +189,19 @@ the software port. * - `rx[i]_gro_packets` - Number of received packets processed using hardware-accelerated GRO. The - number of hardware GRO offloaded packets received on ring i. + number of hardware GRO offloaded packets received on ring i. Only true GRO + packets are counted: only packets that are in an SKB with a GRO count > 1. - Acceleration * - `rx[i]_gro_bytes` - Number of received bytes processed using hardware-accelerated GRO. The - number of hardware GRO offloaded bytes received on ring i. + number of hardware GRO offloaded bytes received on ring i. Only true GRO + packets are counted: only packets that are in an SKB with a GRO count > 1. - Acceleration * - `rx[i]_gro_skbs` - - The number of receive SKBs constructed while performing - hardware-accelerated GRO. - - Informative - - * - `rx[i]_gro_match_packets` - - Number of received packets processed using hardware-accelerated GRO that - met the flow table match criteria. + - The number of GRO SKBs constructed from hardware-accelerated GRO. Only SKBs + with a GRO count > 1 are counted. - Informative * - `rx[i]_gro_large_hds` @@ -212,6 +209,15 @@ the software port. headers that require additional memory to be allocated. - Informative + * - `rx[i]_hds_nodata_packets` + - Number of header only packets in header/data split mode [#accel]_. + - Informative + + * - `rx[i]_hds_nodata_bytes` + - Number of bytes for header only packets in header/data split mode + [#accel]_. + - Informative + * - `rx[i]_lro_packets` - The number of LRO packets received on ring i [#accel]_. - Acceleration diff --git a/Documentation/networking/devlink/ice.rst b/Documentation/networking/devlink/ice.rst index 830c04354222..e3972d03cea0 100644 --- a/Documentation/networking/devlink/ice.rst +++ b/Documentation/networking/devlink/ice.rst @@ -11,6 +11,7 @@ Parameters ========== .. list-table:: Generic parameters implemented + :widths: 5 5 90 * - Name - Mode @@ -68,6 +69,30 @@ Parameters To verify that value has been set: $ devlink dev param show pci/0000:16:00.0 name tx_scheduling_layers +.. list-table:: Driver specific parameters implemented + :widths: 5 5 90 + + * - Name + - Mode + - Description + * - ``local_forwarding`` + - runtime + - Controls loopback behavior by tuning scheduler bandwidth. + It impacts all kinds of functions: physical, virtual and + subfunctions. + Supported values are: + + ``enabled`` - loopback traffic is allowed on port + + ``disabled`` - loopback traffic is not allowed on this port + + ``prioritized`` - loopback traffic is prioritized on this port + + Default value of ``local_forwarding`` parameter is ``enabled``. + ``prioritized`` provides ability to adjust loopback traffic rate to increase + one port capacity at cost of the another. User needs to disable + local forwarding on one of the ports in order have increased capacity + on the ``prioritized`` port. Info versions ============= diff --git a/Documentation/networking/devlink/octeontx2.rst b/Documentation/networking/devlink/octeontx2.rst index 610de99b728a..d33a90dd44bf 100644 --- a/Documentation/networking/devlink/octeontx2.rst +++ b/Documentation/networking/devlink/octeontx2.rst @@ -40,3 +40,19 @@ The ``octeontx2 AF`` driver implements the following driver-specific parameters. - runtime - Use to set the quantum which hardware uses for scheduling among transmit queues. Hardware uses weighted DWRR algorithm to schedule among all transmit queues. + +The ``octeontx2 PF`` driver implements the following driver-specific parameters. + +.. list-table:: Driver-specific parameters implemented + :widths: 5 5 5 85 + + * - Name + - Type + - Mode + - Description + * - ``unicast_filter_count`` + - u8 + - runtime + - Set the maximum number of unicast filters that can be programmed for + the device. This can be used to achieve better device resource + utilization, avoiding over consumption of unused MCAM table entries. diff --git a/Documentation/networking/ethtool-netlink.rst b/Documentation/networking/ethtool-netlink.rst index 160bfb0ae8ba..3ab423b80e91 100644 --- a/Documentation/networking/ethtool-netlink.rst +++ b/Documentation/networking/ethtool-netlink.rst @@ -228,6 +228,7 @@ Userspace to kernel: ``ETHTOOL_MSG_PLCA_GET_STATUS`` get PLCA RS status ``ETHTOOL_MSG_MM_GET`` get MAC merge layer state ``ETHTOOL_MSG_MM_SET`` set MAC merge layer parameters + ``ETHTOOL_MSG_MODULE_FW_FLASH_ACT`` flash transceiver module firmware ===================================== ================================= Kernel to userspace: @@ -274,6 +275,7 @@ Kernel to userspace: ``ETHTOOL_MSG_PLCA_GET_STATUS_REPLY`` PLCA RS status ``ETHTOOL_MSG_PLCA_NTF`` PLCA RS parameters ``ETHTOOL_MSG_MM_GET_REPLY`` MAC merge layer status + ``ETHTOOL_MSG_MODULE_FW_FLASH_NTF`` transceiver module flash updates ======================================== ================================= ``GET`` requests are sent by userspace applications to retrieve device @@ -1033,6 +1035,8 @@ Kernel response contents: ``ETHTOOL_A_COALESCE_TX_AGGR_MAX_BYTES`` u32 max aggr size, Tx ``ETHTOOL_A_COALESCE_TX_AGGR_MAX_FRAMES`` u32 max aggr packets, Tx ``ETHTOOL_A_COALESCE_TX_AGGR_TIME_USECS`` u32 time (us), aggr, Tx + ``ETHTOOL_A_COALESCE_RX_PROFILE`` nested profile of DIM, Rx + ``ETHTOOL_A_COALESCE_TX_PROFILE`` nested profile of DIM, Tx =========================================== ====== ======================= Attributes are only included in reply if their value is not zero or the @@ -1062,6 +1066,10 @@ block should be sent. This feature is mainly of interest for specific USB devices which does not cope well with frequent small-sized URBs transmissions. +``ETHTOOL_A_COALESCE_RX_PROFILE`` and ``ETHTOOL_A_COALESCE_TX_PROFILE`` refer +to DIM parameters, see `Generic Network Dynamic Interrupt Moderation (Net DIM) +<https://www.kernel.org/doc/Documentation/networking/net_dim.rst>`_. + COALESCE_SET ============ @@ -1098,6 +1106,8 @@ Request contents: ``ETHTOOL_A_COALESCE_TX_AGGR_MAX_BYTES`` u32 max aggr size, Tx ``ETHTOOL_A_COALESCE_TX_AGGR_MAX_FRAMES`` u32 max aggr packets, Tx ``ETHTOOL_A_COALESCE_TX_AGGR_TIME_USECS`` u32 time (us), aggr, Tx + ``ETHTOOL_A_COALESCE_RX_PROFILE`` nested profile of DIM, Rx + ``ETHTOOL_A_COALESCE_TX_PROFILE`` nested profile of DIM, Tx =========================================== ====== ======================= Request is rejected if it attributes declared as unsupported by driver (i.e. @@ -1720,17 +1730,28 @@ Request contents: Kernel response contents: - ====================================== ====== ============================= - ``ETHTOOL_A_PSE_HEADER`` nested reply header - ``ETHTOOL_A_PODL_PSE_ADMIN_STATE`` u32 Operational state of the PoDL - PSE functions - ``ETHTOOL_A_PODL_PSE_PW_D_STATUS`` u32 power detection status of the - PoDL PSE. - ``ETHTOOL_A_C33_PSE_ADMIN_STATE`` u32 Operational state of the PoE - PSE functions. - ``ETHTOOL_A_C33_PSE_PW_D_STATUS`` u32 power detection status of the - PoE PSE. - ====================================== ====== ============================= + ========================================== ====== ============================= + ``ETHTOOL_A_PSE_HEADER`` nested reply header + ``ETHTOOL_A_PODL_PSE_ADMIN_STATE`` u32 Operational state of the PoDL + PSE functions + ``ETHTOOL_A_PODL_PSE_PW_D_STATUS`` u32 power detection status of the + PoDL PSE. + ``ETHTOOL_A_C33_PSE_ADMIN_STATE`` u32 Operational state of the PoE + PSE functions. + ``ETHTOOL_A_C33_PSE_PW_D_STATUS`` u32 power detection status of the + PoE PSE. + ``ETHTOOL_A_C33_PSE_PW_CLASS`` u32 power class of the PoE PSE. + ``ETHTOOL_A_C33_PSE_ACTUAL_PW`` u32 actual power drawn on the + PoE PSE. + ``ETHTOOL_A_C33_PSE_EXT_STATE`` u32 power extended state of the + PoE PSE. + ``ETHTOOL_A_C33_PSE_EXT_SUBSTATE`` u32 power extended substatus of + the PoE PSE. + ``ETHTOOL_A_C33_PSE_AVAIL_PW_LIMIT`` u32 currently configured power + limit of the PoE PSE. + ``ETHTOOL_A_C33_PSE_PW_LIMIT_RANGES`` nested Supported power limit + configuration ranges. + ========================================== ====== ============================= When set, the optional ``ETHTOOL_A_PODL_PSE_ADMIN_STATE`` attribute identifies the operational state of the PoDL PSE functions. The operational state of the @@ -1762,6 +1783,46 @@ The same goes for ``ETHTOOL_A_C33_PSE_ADMIN_PW_D_STATUS`` implementing .. kernel-doc:: include/uapi/linux/ethtool.h :identifiers: ethtool_c33_pse_pw_d_status +When set, the optional ``ETHTOOL_A_C33_PSE_PW_CLASS`` attribute identifies +the power class of the C33 PSE. It depends on the class negotiated between +the PSE and the PD. This option is corresponding to ``IEEE 802.3-2022`` +30.9.1.1.8 aPSEPowerClassification. + +When set, the optional ``ETHTOOL_A_C33_PSE_ACTUAL_PW`` attribute identifies +This option is corresponding to ``IEEE 802.3-2022`` 30.9.1.1.23 aPSEActualPower. +Actual power is reported in mW. + +When set, the optional ``ETHTOOL_A_C33_PSE_EXT_STATE`` attribute identifies +the extended error state of the C33 PSE. Possible values are: + +.. kernel-doc:: include/uapi/linux/ethtool.h + :identifiers: ethtool_c33_pse_ext_state + +When set, the optional ``ETHTOOL_A_C33_PSE_EXT_SUBSTATE`` attribute identifies +the extended error state of the C33 PSE. Possible values are: +Possible values are: + +.. kernel-doc:: include/uapi/linux/ethtool.h + :identifiers: ethtool_c33_pse_ext_substate_class_num_events + ethtool_c33_pse_ext_substate_error_condition + ethtool_c33_pse_ext_substate_mr_pse_enable + ethtool_c33_pse_ext_substate_option_detect_ted + ethtool_c33_pse_ext_substate_option_vport_lim + ethtool_c33_pse_ext_substate_ovld_detected + ethtool_c33_pse_ext_substate_pd_dll_power_type + ethtool_c33_pse_ext_substate_power_not_available + ethtool_c33_pse_ext_substate_short_detected + +When set, the optional ``ETHTOOL_A_C33_PSE_AVAIL_PW_LIMIT`` attribute +identifies the C33 PSE power limit in mW. + +When set the optional ``ETHTOOL_A_C33_PSE_PW_LIMIT_RANGES`` nested attribute +identifies the C33 PSE power limit ranges through +``ETHTOOL_A_C33_PSE_PWR_VAL_LIMIT_RANGE_MIN`` and +``ETHTOOL_A_C33_PSE_PWR_VAL_LIMIT_RANGE_MAX``. +If the controller works with fixed classes, the min and max values will be +equal. + PSE_SET ======= @@ -1773,6 +1834,8 @@ Request contents: ``ETHTOOL_A_PSE_HEADER`` nested request header ``ETHTOOL_A_PODL_PSE_ADMIN_CONTROL`` u32 Control PoDL PSE Admin state ``ETHTOOL_A_C33_PSE_ADMIN_CONTROL`` u32 Control PSE Admin state + ``ETHTOOL_A_C33_PSE_AVAIL_PWR_LIMIT`` u32 Control PoE PSE available + power limit ====================================== ====== ============================= When set, the optional ``ETHTOOL_A_PODL_PSE_ADMIN_CONTROL`` attribute is used @@ -1783,6 +1846,18 @@ to control PoDL PSE Admin functions. This option is implementing The same goes for ``ETHTOOL_A_C33_PSE_ADMIN_CONTROL`` implementing ``IEEE 802.3-2022`` 30.9.1.2.1 acPSEAdminControl. +When set, the optional ``ETHTOOL_A_C33_PSE_AVAIL_PWR_LIMIT`` attribute is +used to control the available power value limit for C33 PSE in milliwatts. +This attribute corresponds to the `pse_available_power` variable described in +``IEEE 802.3-2022`` 33.2.4.4 Variables and `pse_avail_pwr` in 145.2.5.4 +Variables, which are described in power classes. + +It was decided to use milliwatts for this interface to unify it with other +power monitoring interfaces, which also use milliwatts, and to align with +various existing products that document power consumption in watts rather than +classes. If power limit configuration based on classes is needed, the +conversion can be done in user space, for example by ethtool. + RSS_GET ======= @@ -2033,6 +2108,73 @@ The attributes are propagated to the driver through the following structure: .. kernel-doc:: include/linux/ethtool.h :identifiers: ethtool_mm_cfg +MODULE_FW_FLASH_ACT +=================== + +Flashes transceiver module firmware. + +Request contents: + + ======================================= ====== =========================== + ``ETHTOOL_A_MODULE_FW_FLASH_HEADER`` nested request header + ``ETHTOOL_A_MODULE_FW_FLASH_FILE_NAME`` string firmware image file name + ``ETHTOOL_A_MODULE_FW_FLASH_PASSWORD`` u32 transceiver module password + ======================================= ====== =========================== + +The firmware update process consists of three logical steps: + +1. Downloading a firmware image to the transceiver module and validating it. +2. Running the firmware image. +3. Committing the firmware image so that it is run upon reset. + +When flash command is given, those three steps are taken in that order. + +This message merely schedules the update process and returns immediately +without blocking. The process then runs asynchronously. +Since it can take several minutes to complete, during the update process +notifications are emitted from the kernel to user space updating it about +the status and progress. + +The ``ETHTOOL_A_MODULE_FW_FLASH_FILE_NAME`` attribute encodes the firmware +image file name. The firmware image is downloaded to the transceiver module, +validated, run and committed. + +The optional ``ETHTOOL_A_MODULE_FW_FLASH_PASSWORD`` attribute encodes a password +that might be required as part of the transceiver module firmware update +process. + +The firmware update process can take several minutes to complete. Therefore, +during the update process notifications are emitted from the kernel to user +space updating it about the status and progress. + + + +Notification contents: + + +---------------------------------------------------+--------+----------------+ + | ``ETHTOOL_A_MODULE_FW_FLASH_HEADER`` | nested | reply header | + +---------------------------------------------------+--------+----------------+ + | ``ETHTOOL_A_MODULE_FW_FLASH_STATUS`` | u32 | status | + +---------------------------------------------------+--------+----------------+ + | ``ETHTOOL_A_MODULE_FW_FLASH_STATUS_MSG`` | string | status message | + +---------------------------------------------------+--------+----------------+ + | ``ETHTOOL_A_MODULE_FW_FLASH_DONE`` | uint | progress | + +---------------------------------------------------+--------+----------------+ + | ``ETHTOOL_A_MODULE_FW_FLASH_TOTAL`` | uint | total | + +---------------------------------------------------+--------+----------------+ + +The ``ETHTOOL_A_MODULE_FW_FLASH_STATUS`` attribute encodes the current status +of the firmware update process. Possible values are: + +.. kernel-doc:: include/uapi/linux/ethtool.h + :identifiers: ethtool_module_fw_flash_status + +The ``ETHTOOL_A_MODULE_FW_FLASH_STATUS_MSG`` attribute encodes a status message +string. + +The ``ETHTOOL_A_MODULE_FW_FLASH_DONE`` and ``ETHTOOL_A_MODULE_FW_FLASH_TOTAL`` +attributes encode the completed and total amount of work, respectively. + Request translation =================== @@ -2139,4 +2281,5 @@ are netlink only. n/a ``ETHTOOL_MSG_PLCA_GET_STATUS`` n/a ``ETHTOOL_MSG_MM_GET`` n/a ``ETHTOOL_MSG_MM_SET`` + n/a ``ETHTOOL_MSG_MODULE_FW_FLASH_ACT`` =================================== ===================================== diff --git a/Documentation/networking/index.rst b/Documentation/networking/index.rst index 7664c0bfe461..d1af04b952f8 100644 --- a/Documentation/networking/index.rst +++ b/Documentation/networking/index.rst @@ -19,6 +19,7 @@ Contents: caif/index ethtool-netlink ieee802154 + iso15765-2 j1939 kapi msg_zerocopy @@ -72,6 +73,7 @@ Contents: mac80211-injection mctp mpls-sysctl + mptcp mptcp-sysctl multiqueue multi-pf-netdev @@ -104,6 +106,7 @@ Contents: seg6-sysctl skbuff smc-sysctl + sriov statistics strparser switchdev diff --git a/Documentation/networking/ip-sysctl.rst b/Documentation/networking/ip-sysctl.rst index bd50df6a5a42..3616389c8c2d 100644 --- a/Documentation/networking/ip-sysctl.rst +++ b/Documentation/networking/ip-sysctl.rst @@ -131,6 +131,20 @@ fib_multipath_hash_fields - UNSIGNED INTEGER Default: 0x0007 (source IP, destination IP and IP protocol) +fib_multipath_hash_seed - UNSIGNED INTEGER + The seed value used when calculating hash for multipath routes. Applies + to both IPv4 and IPv6 datapath. Only present for kernels built with + CONFIG_IP_ROUTE_MULTIPATH enabled. + + When set to 0, the seed value used for multipath routing defaults to an + internal random-generated one. + + The actual hashing algorithm is not specified -- there is no guarantee + that a next hop distribution effected by a given seed will keep stable + across kernel versions. + + Default: 0 (random) + fib_sync_mem - UNSIGNED INTEGER Amount of dirty memory from fib entries that can be backlogged before synchronize_rcu is forced. @@ -1196,6 +1210,19 @@ tcp_pingpong_thresh - INTEGER Default: 1 +tcp_rto_min_us - INTEGER + Minimal TCP retransmission timeout (in microseconds). Note that the + rto_min route option has the highest precedence for configuring this + setting, followed by the TCP_BPF_RTO_MIN socket option, followed by + this tcp_rto_min_us sysctl. + + The recommended practice is to use a value less or equal to 200000 + microseconds. + + Possible Values: 1 - INT_MAX + + Default: 200000 + UDP variables ============= diff --git a/Documentation/networking/iso15765-2.rst b/Documentation/networking/iso15765-2.rst new file mode 100644 index 000000000000..0e9d96074178 --- /dev/null +++ b/Documentation/networking/iso15765-2.rst @@ -0,0 +1,386 @@ +.. SPDX-License-Identifier: (GPL-2.0 OR BSD-3-Clause) + +==================== +ISO 15765-2 (ISO-TP) +==================== + +Overview +======== + +ISO 15765-2, also known as ISO-TP, is a transport protocol specifically defined +for diagnostic communication on CAN. It is widely used in the automotive +industry, for example as the transport protocol for UDSonCAN (ISO 14229-3) or +emission-related diagnostic services (ISO 15031-5). + +ISO-TP can be used both on CAN CC (aka Classical CAN) and CAN FD (CAN with +Flexible Datarate) based networks. It is also designed to be compatible with a +CAN network using SAE J1939 as data link layer (however, this is not a +requirement). + +Specifications used +------------------- + +* ISO 15765-2:2024 : Road vehicles - Diagnostic communication over Controller + Area Network (DoCAN). Part 2: Transport protocol and network layer services. + +Addressing +---------- + +In its simplest form, ISO-TP is based on two kinds of addressing modes for the +nodes connected to the same network: + +* physical addressing is implemented by two node-specific addresses and is used + in 1-to-1 communication. + +* functional addressing is implemented by one node-specific address and is used + in 1-to-N communication. + +Three different addressing formats can be employed: + +* "normal" : each address is represented simply by a CAN ID. + +* "extended": each address is represented by a CAN ID plus the first byte of + the CAN payload; both the CAN ID and the byte inside the payload shall be + different between two addresses. + +* "mixed": each address is represented by a CAN ID plus the first byte of + the CAN payload; the CAN ID is different between two addresses, but the + additional byte is the same. + +Transport protocol and associated frame types +--------------------------------------------- + +When transmitting data using the ISO-TP protocol, the payload can either fit +inside one single CAN message or not, also considering the overhead the protocol +is generating and the optional extended addressing. In the first case, the data +is transmitted at once using a so-called Single Frame (SF). In the second case, +ISO-TP defines a multi-frame protocol, in which the sender provides (through a +First Frame - FF) the PDU length which is to be transmitted and also asks for a +Flow Control (FC) frame, which provides the maximum supported size of a macro +data block (``blocksize``) and the minimum time between the single CAN messages +composing such block (``stmin``). Once this information has been received, the +sender starts to send frames containing fragments of the data payload (called +Consecutive Frames - CF), stopping after every ``blocksize``-sized block to wait +confirmation from the receiver which should then send another Flow Control +frame to inform the sender about its availability to receive more data. + +How to Use ISO-TP +================= + +As with others CAN protocols, the ISO-TP stack support is built into the +Linux network subsystem for the CAN bus, aka. Linux-CAN or SocketCAN, and +thus follows the same socket API. + +Creation and basic usage of an ISO-TP socket +-------------------------------------------- + +To use the ISO-TP stack, ``#include <linux/can/isotp.h>`` shall be used. A +socket can then be created using the ``PF_CAN`` protocol family, the +``SOCK_DGRAM`` type (as the underlying protocol is datagram-based by design) +and the ``CAN_ISOTP`` protocol: + +.. code-block:: C + + s = socket(PF_CAN, SOCK_DGRAM, CAN_ISOTP); + +After the socket has been successfully created, ``bind(2)`` shall be called to +bind the socket to the desired CAN interface; to do so: + +* a TX CAN ID shall be specified as part of the sockaddr supplied to the call + itself. + +* a RX CAN ID shall also be specified, unless broadcast flags have been set + through socket option (explained below). + +Once bound to an interface, the socket can be read from and written to using +the usual ``read(2)`` and ``write(2)`` system calls, as well as ``send(2)``, +``sendmsg(2)``, ``recv(2)`` and ``recvmsg(2)``. +Unlike the CAN_RAW socket API, only the ISO-TP data field (the actual payload) +is sent and received by the userspace application using these calls. The address +information and the protocol information are automatically filled by the ISO-TP +stack using the configuration supplied during socket creation. In the same way, +the stack will use the transport mechanism when required (i.e., when the size +of the data payload exceeds the MTU of the underlying CAN bus). + +The sockaddr structure used for SocketCAN has extensions for use with ISO-TP, +as specified below: + +.. code-block:: C + + struct sockaddr_can { + sa_family_t can_family; + int can_ifindex; + union { + struct { canid_t rx_id, tx_id; } tp; + ... + } can_addr; + } + +* ``can_family`` and ``can_ifindex`` serve the same purpose as for other + SocketCAN sockets. + +* ``can_addr.tp.rx_id`` specifies the receive (RX) CAN ID and will be used as + a RX filter. + +* ``can_addr.tp.tx_id`` specifies the transmit (TX) CAN ID + +ISO-TP socket options +--------------------- + +When creating an ISO-TP socket, reasonable defaults are set. Some options can +be modified with ``setsockopt(2)`` and/or read back with ``getsockopt(2)``. + +General options +~~~~~~~~~~~~~~~ + +General socket options can be passed using the ``CAN_ISOTP_OPTS`` optname: + +.. code-block:: C + + struct can_isotp_options opts; + ret = setsockopt(s, SOL_CAN_ISOTP, CAN_ISOTP_OPTS, &opts, sizeof(opts)) + +where the ``can_isotp_options`` structure has the following contents: + +.. code-block:: C + + struct can_isotp_options { + u32 flags; + u32 frame_txtime; + u8 ext_address; + u8 txpad_content; + u8 rxpad_content; + u8 rx_ext_address; + }; + +* ``flags``: modifiers to be applied to the default behaviour of the ISO-TP + stack. Following flags are available: + + * ``CAN_ISOTP_LISTEN_MODE``: listen only (do not send FC frames); normally + used as a testing feature. + + * ``CAN_ISOTP_EXTEND_ADDR``: use the byte specified in ``ext_address`` as an + additional address component. This enables the "mixed" addressing format if + used alone, or the "extended" addressing format if used in conjunction with + ``CAN_ISOTP_RX_EXT_ADDR``. + + * ``CAN_ISOTP_TX_PADDING``: enable padding for transmitted frames, using + ``txpad_content`` as value for the padding bytes. + + * ``CAN_ISOTP_RX_PADDING``: enable padding for the received frames, using + ``rxpad_content`` as value for the padding bytes. + + * ``CAN_ISOTP_CHK_PAD_LEN``: check for correct padding length on the received + frames. + + * ``CAN_ISOTP_CHK_PAD_DATA``: check padding bytes on the received frames + against ``rxpad_content``; if ``CAN_ISOTP_RX_PADDING`` is not specified, + this flag is ignored. + + * ``CAN_ISOTP_HALF_DUPLEX``: force ISO-TP socket in half duplex mode + (that is, transport mechanism can only be incoming or outgoing at the same + time, not both). + + * ``CAN_ISOTP_FORCE_TXSTMIN``: ignore stmin from received FC; normally + used as a testing feature. + + * ``CAN_ISOTP_FORCE_RXSTMIN``: ignore CFs depending on rx stmin; normally + used as a testing feature. + + * ``CAN_ISOTP_RX_EXT_ADDR``: use ``rx_ext_address`` instead of ``ext_address`` + as extended addressing byte on the reception path. If used in conjunction + with ``CAN_ISOTP_EXTEND_ADDR``, this flag effectively enables the "extended" + addressing format. + + * ``CAN_ISOTP_WAIT_TX_DONE``: wait until the frame is sent before returning + from ``write(2)`` and ``send(2)`` calls (i.e., blocking write operations). + + * ``CAN_ISOTP_SF_BROADCAST``: use 1-to-N functional addressing (cannot be + specified alongside ``CAN_ISOTP_CF_BROADCAST``). + + * ``CAN_ISOTP_CF_BROADCAST``: use 1-to-N transmission without flow control + (cannot be specified alongside ``CAN_ISOTP_SF_BROADCAST``). + NOTE: this is not covered by the ISO 15765-2 standard. + + * ``CAN_ISOTP_DYN_FC_PARMS``: enable dynamic update of flow control + parameters. + +* ``frame_txtime``: frame transmission time (defined as N_As/N_Ar inside the + ISO standard); if ``0``, the default (or the last set value) is used. + To set the transmission time to ``0``, the ``CAN_ISOTP_FRAME_TXTIME_ZERO`` + macro (equal to 0xFFFFFFFF) shall be used. + +* ``ext_address``: extended addressing byte, used if the + ``CAN_ISOTP_EXTEND_ADDR`` flag is specified. + +* ``txpad_content``: byte used as padding value for transmitted frames. + +* ``rxpad_content``: byte used as padding value for received frames. + +* ``rx_ext_address``: extended addressing byte for the reception path, used if + the ``CAN_ISOTP_RX_EXT_ADDR`` flag is specified. + +Flow Control options +~~~~~~~~~~~~~~~~~~~~ + +Flow Control (FC) options can be passed using the ``CAN_ISOTP_RECV_FC`` optname +to provide the communication parameters for receiving ISO-TP PDUs. + +.. code-block:: C + + struct can_isotp_fc_options fc_opts; + ret = setsockopt(s, SOL_CAN_ISOTP, CAN_ISOTP_RECV_FC, &fc_opts, sizeof(fc_opts)); + +where the ``can_isotp_fc_options`` structure has the following contents: + +.. code-block:: C + + struct can_isotp_options { + u8 bs; + u8 stmin; + u8 wftmax; + }; + +* ``bs``: blocksize provided in flow control frames. + +* ``stmin``: minimum separation time provided in flow control frames; can + have the following values (others are reserved): + + * 0x00 - 0x7F : 0 - 127 ms + + * 0xF1 - 0xF9 : 100 us - 900 us + +* ``wftmax``: maximum number of wait frames provided in flow control frames. + +Link Layer options +~~~~~~~~~~~~~~~~~~ + +Link Layer (LL) options can be passed using the ``CAN_ISOTP_LL_OPTS`` optname: + +.. code-block:: C + + struct can_isotp_ll_options ll_opts; + ret = setsockopt(s, SOL_CAN_ISOTP, CAN_ISOTP_LL_OPTS, &ll_opts, sizeof(ll_opts)); + +where the ``can_isotp_ll_options`` structure has the following contents: + +.. code-block:: C + + struct can_isotp_ll_options { + u8 mtu; + u8 tx_dl; + u8 tx_flags; + }; + +* ``mtu``: generated and accepted CAN frame type, can be equal to ``CAN_MTU`` + for classical CAN frames or ``CANFD_MTU`` for CAN FD frames. + +* ``tx_dl``: maximum payload length for transmitted frames, can have one value + among: 8, 12, 16, 20, 24, 32, 48, 64. Values above 8 only apply to CAN FD + traffic (i.e.: ``mtu = CANFD_MTU``). + +* ``tx_flags``: flags set into ``struct canfd_frame.flags`` at frame creation. + Only applies to CAN FD traffic (i.e.: ``mtu = CANFD_MTU``). + +Transmission stmin +~~~~~~~~~~~~~~~~~~ + +The transmission minimum separation time (stmin) can be forced using the +``CAN_ISOTP_TX_STMIN`` optname and providing an stmin value in microseconds as +a 32bit unsigned integer; this will overwrite the value sent by the receiver in +flow control frames: + +.. code-block:: C + + uint32_t stmin; + ret = setsockopt(s, SOL_CAN_ISOTP, CAN_ISOTP_TX_STMIN, &stmin, sizeof(stmin)); + +Reception stmin +~~~~~~~~~~~~~~~ + +The reception minimum separation time (stmin) can be forced using the +``CAN_ISOTP_RX_STMIN`` optname and providing an stmin value in microseconds as +a 32bit unsigned integer; received Consecutive Frames (CF) which timestamps +differ less than this value will be ignored: + +.. code-block:: C + + uint32_t stmin; + ret = setsockopt(s, SOL_CAN_ISOTP, CAN_ISOTP_RX_STMIN, &stmin, sizeof(stmin)); + +Multi-frame transport support +----------------------------- + +The ISO-TP stack contained inside the Linux kernel supports the multi-frame +transport mechanism defined by the standard, with the following constraints: + +* the maximum size of a PDU is defined by a module parameter, with an hard + limit imposed at build time. + +* when a transmission is in progress, subsequent calls to ``write(2)`` will + block, while calls to ``send(2)`` will either block or fail depending on the + presence of the ``MSG_DONTWAIT`` flag. + +* no support is present for sending "wait frames": whether a PDU can be fully + received or not is decided when the First Frame is received. + +Errors +------ + +Following errors are reported to userspace: + +RX path errors +~~~~~~~~~~~~~~ + +============ =============================================================== +-ETIMEDOUT timeout of data reception +-EILSEQ sequence number mismatch during a multi-frame reception +-EBADMSG data reception with wrong padding +============ =============================================================== + +TX path errors +~~~~~~~~~~~~~~ + +========== ================================================================= +-ECOMM flow control reception timeout +-EMSGSIZE flow control reception overflow +-EBADMSG flow control reception with wrong layout/padding +========== ================================================================= + +Examples +======== + +Basic node example +------------------ + +Following example implements a node using "normal" physical addressing, with +RX ID equal to 0x18DAF142 and a TX ID equal to 0x18DA42F1. All options are left +to their default. + +.. code-block:: C + + int s; + struct sockaddr_can addr; + int ret; + + s = socket(PF_CAN, SOCK_DGRAM, CAN_ISOTP); + if (s < 0) + exit(1); + + addr.can_family = AF_CAN; + addr.can_ifindex = if_nametoindex("can0"); + addr.tp.tx_id = 0x18DA42F1 | CAN_EFF_FLAG; + addr.tp.rx_id = 0x18DAF142 | CAN_EFF_FLAG; + + ret = bind(s, (struct sockaddr *)&addr, sizeof(addr)); + if (ret < 0) + exit(1); + + /* Data can now be received using read(s, ...) and sent using write(s, ...) */ + +Additional examples +------------------- + +More complete (and complex) examples can be found inside the ``isotp*`` userland +tools, distributed as part of the ``can-utils`` utilities at: +https://github.com/linux-can/can-utils diff --git a/Documentation/networking/mptcp-sysctl.rst b/Documentation/networking/mptcp-sysctl.rst index 69975ce25a02..fd514bba8c43 100644 --- a/Documentation/networking/mptcp-sysctl.rst +++ b/Documentation/networking/mptcp-sysctl.rst @@ -7,14 +7,6 @@ MPTCP Sysfs variables /proc/sys/net/mptcp/* Variables =============================== -enabled - BOOLEAN - Control whether MPTCP sockets can be created. - - MPTCP sockets can be created if the value is 1. This is a - per-namespace sysctl. - - Default: 1 (enabled) - add_addr_timeout - INTEGER (seconds) Set the timeout after which an ADD_ADDR control message will be resent to an MPTCP peer that has not acknowledged a previous @@ -25,16 +17,22 @@ add_addr_timeout - INTEGER (seconds) Default: 120 -close_timeout - INTEGER (seconds) - Set the make-after-break timeout: in absence of any close or - shutdown syscall, MPTCP sockets will maintain the status - unchanged for such time, after the last subflow removal, before - moving to TCP_CLOSE. +allow_join_initial_addr_port - BOOLEAN + Allow peers to send join requests to the IP address and port number used + by the initial subflow if the value is 1. This controls a flag that is + sent to the peer at connection time, and whether such join requests are + accepted or denied. - The default value matches TCP_TIMEWAIT_LEN. This is a per-namespace - sysctl. + Joins to addresses advertised with ADD_ADDR are not affected by this + value. - Default: 60 + This is a per-namespace sysctl. + + Default: 1 + +available_schedulers - STRING + Shows the available schedulers choices that are registered. More packet + schedulers may be available, but not loaded. checksum_enabled - BOOLEAN Control whether DSS checksum can be enabled. @@ -44,18 +42,24 @@ checksum_enabled - BOOLEAN Default: 0 -allow_join_initial_addr_port - BOOLEAN - Allow peers to send join requests to the IP address and port number used - by the initial subflow if the value is 1. This controls a flag that is - sent to the peer at connection time, and whether such join requests are - accepted or denied. +close_timeout - INTEGER (seconds) + Set the make-after-break timeout: in absence of any close or + shutdown syscall, MPTCP sockets will maintain the status + unchanged for such time, after the last subflow removal, before + moving to TCP_CLOSE. - Joins to addresses advertised with ADD_ADDR are not affected by this - value. + The default value matches TCP_TIMEWAIT_LEN. This is a per-namespace + sysctl. - This is a per-namespace sysctl. + Default: 60 - Default: 1 +enabled - BOOLEAN + Control whether MPTCP sockets can be created. + + MPTCP sockets can be created if the value is 1. This is a + per-namespace sysctl. + + Default: 1 (enabled) pm_type - INTEGER Set the default path manager type to use for each new MPTCP @@ -74,6 +78,14 @@ pm_type - INTEGER Default: 0 +scheduler - STRING + Select the scheduler of your choice. + + Support for selection of different schedulers. This is a per-namespace + sysctl. + + Default: "default" + stale_loss_cnt - INTEGER The number of MPTCP-level retransmission intervals with no traffic and pending outstanding data on a given subflow required to declare it stale. @@ -85,11 +97,3 @@ stale_loss_cnt - INTEGER This is a per-namespace sysctl. Default: 4 - -scheduler - STRING - Select the scheduler of your choice. - - Support for selection of different schedulers. This is a per-namespace - sysctl. - - Default: "default" diff --git a/Documentation/networking/mptcp.rst b/Documentation/networking/mptcp.rst new file mode 100644 index 000000000000..17f2bab61164 --- /dev/null +++ b/Documentation/networking/mptcp.rst @@ -0,0 +1,156 @@ +.. SPDX-License-Identifier: GPL-2.0 + +===================== +Multipath TCP (MPTCP) +===================== + +Introduction +============ + +Multipath TCP or MPTCP is an extension to the standard TCP and is described in +`RFC 8684 (MPTCPv1) <https://www.rfc-editor.org/rfc/rfc8684.html>`_. It allows a +device to make use of multiple interfaces at once to send and receive TCP +packets over a single MPTCP connection. MPTCP can aggregate the bandwidth of +multiple interfaces or prefer the one with the lowest latency. It also allows a +fail-over if one path is down, and the traffic is seamlessly reinjected on other +paths. + +For more details about Multipath TCP in the Linux kernel, please see the +official website: `mptcp.dev <https://www.mptcp.dev>`_. + + +Use cases +========= + +Thanks to MPTCP, being able to use multiple paths in parallel or simultaneously +brings new use-cases, compared to TCP: + +- Seamless handovers: switching from one path to another while preserving + established connections, e.g. to be used in mobility use-cases, like on + smartphones. +- Best network selection: using the "best" available path depending on some + conditions, e.g. latency, losses, cost, bandwidth, etc. +- Network aggregation: using multiple paths at the same time to have a higher + throughput, e.g. to combine fixed and mobile networks to send files faster. + + +Concepts +======== + +Technically, when a new socket is created with the ``IPPROTO_MPTCP`` protocol +(Linux-specific), a *subflow* (or *path*) is created. This *subflow* consists of +a regular TCP connection that is used to transmit data through one interface. +Additional *subflows* can be negotiated later between the hosts. For the remote +host to be able to detect the use of MPTCP, a new field is added to the TCP +*option* field of the underlying TCP *subflow*. This field contains, amongst +other things, a ``MP_CAPABLE`` option that tells the other host to use MPTCP if +it is supported. If the remote host or any middlebox in between does not support +it, the returned ``SYN+ACK`` packet will not contain MPTCP options in the TCP +*option* field. In that case, the connection will be "downgraded" to plain TCP, +and it will continue with a single path. + +This behavior is made possible by two internal components: the path manager, and +the packet scheduler. + +Path Manager +------------ + +The Path Manager is in charge of *subflows*, from creation to deletion, and also +address announcements. Typically, it is the client side that initiates subflows, +and the server side that announces additional addresses via the ``ADD_ADDR`` and +``REMOVE_ADDR`` options. + +Path managers are controlled by the ``net.mptcp.pm_type`` sysctl knob -- see +mptcp-sysctl.rst. There are two types: the in-kernel one (type ``0``) where the +same rules are applied for all the connections (see: ``ip mptcp``) ; and the +userspace one (type ``1``), controlled by a userspace daemon (i.e. `mptcpd +<https://mptcpd.mptcp.dev/>`_) where different rules can be applied for each +connection. The path managers can be controlled via a Netlink API; see +netlink_spec/mptcp_pm.rst. + +To be able to use multiple IP addresses on a host to create multiple *subflows* +(paths), the default in-kernel MPTCP path-manager needs to know which IP +addresses can be used. This can be configured with ``ip mptcp endpoint`` for +example. + +Packet Scheduler +---------------- + +The Packet Scheduler is in charge of selecting which available *subflow(s)* to +use to send the next data packet. It can decide to maximize the use of the +available bandwidth, only to pick the path with the lower latency, or any other +policy depending on the configuration. + +Packet schedulers are controlled by the ``net.mptcp.scheduler`` sysctl knob -- +see mptcp-sysctl.rst. + + +Sockets API +=========== + +Creating MPTCP sockets +---------------------- + +On Linux, MPTCP can be used by selecting MPTCP instead of TCP when creating the +``socket``: + +.. code-block:: C + + int sd = socket(AF_INET(6), SOCK_STREAM, IPPROTO_MPTCP); + +Note that ``IPPROTO_MPTCP`` is defined as ``262``. + +If MPTCP is not supported, ``errno`` will be set to: + +- ``EINVAL``: (*Invalid argument*): MPTCP is not available, on kernels < 5.6. +- ``EPROTONOSUPPORT`` (*Protocol not supported*): MPTCP has not been compiled, + on kernels >= v5.6. +- ``ENOPROTOOPT`` (*Protocol not available*): MPTCP has been disabled using + ``net.mptcp.enabled`` sysctl knob; see mptcp-sysctl.rst. + +MPTCP is then opt-in: applications need to explicitly request it. Note that +applications can be forced to use MPTCP with different techniques, e.g. +``LD_PRELOAD`` (see ``mptcpize``), eBPF (see ``mptcpify``), SystemTAP, +``GODEBUG`` (``GODEBUG=multipathtcp=1``), etc. + +Switching to ``IPPROTO_MPTCP`` instead of ``IPPROTO_TCP`` should be as +transparent as possible for the userspace applications. + +Socket options +-------------- + +MPTCP supports most socket options handled by TCP. It is possible some less +common options are not supported, but contributions are welcome. + +Generally, the same value is propagated to all subflows, including the ones +created after the calls to ``setsockopt()``. eBPF can be used to set different +values per subflow. + +There are some MPTCP specific socket options at the ``SOL_MPTCP`` (284) level to +retrieve info. They fill the ``optval`` buffer of the ``getsockopt()`` system +call: + +- ``MPTCP_INFO``: Uses ``struct mptcp_info``. +- ``MPTCP_TCPINFO``: Uses ``struct mptcp_subflow_data``, followed by an array of + ``struct tcp_info``. +- ``MPTCP_SUBFLOW_ADDRS``: Uses ``struct mptcp_subflow_data``, followed by an + array of ``mptcp_subflow_addrs``. +- ``MPTCP_FULL_INFO``: Uses ``struct mptcp_full_info``, with one pointer to an + array of ``struct mptcp_subflow_info`` (including the + ``struct mptcp_subflow_addrs``), and one pointer to an array of + ``struct tcp_info``, followed by the content of ``struct mptcp_info``. + +Note that at the TCP level, ``TCP_IS_MPTCP`` socket option can be used to know +if MPTCP is currently being used: the value will be set to 1 if it is. + + +Design choices +============== + +A new socket type has been added for MPTCP for the userspace-facing socket. The +kernel is in charge of creating subflow sockets: they are TCP sockets where the +behavior is modified using TCP-ULP. + +MPTCP listen sockets will create "plain" *accepted* TCP sockets if the +connection request from the client didn't ask for MPTCP, making the performance +impact minimal when MPTCP is enabled by default. diff --git a/Documentation/networking/net_dim.rst b/Documentation/networking/net_dim.rst index 3bed9fd95336..8908fd7b0a8d 100644 --- a/Documentation/networking/net_dim.rst +++ b/Documentation/networking/net_dim.rst @@ -169,6 +169,48 @@ usage is not complete but it should make the outline of the usage clear. ... } + +Tuning DIM +========== + +Net DIM serves a range of network devices and delivers excellent acceleration +benefits. Yet, it has been observed that some preset configurations of DIM may +not align seamlessly with the varying specifications of network devices, and +this discrepancy has been identified as a factor to the suboptimal performance +outcomes of DIM-enabled network devices, related to a mismatch in profiles. + +To address this issue, Net DIM introduces a per-device control to modify and +access a device's ``rx-profile`` and ``tx-profile`` parameters: +Assume that the target network device is named ethx, and ethx only declares +support for RX profile setting and supports modification of ``usec`` field +and ``pkts`` field (See the data structure: +:c:type:`struct dim_cq_moder <dim_cq_moder>`). + +You can use ethtool to modify the current RX DIM profile where all +values are 64:: + + $ ethtool -C ethx rx-profile 1,1,n_2,2,n_3,n,n_n,4,n_n,n,n + +``n`` means do not modify this field, and ``_`` separates structure +elements of the profile array. + +Querying the current profiles using:: + + $ ethtool -c ethx + ... + rx-profile: + {.usec = 1, .pkts = 1, .comps = n/a,}, + {.usec = 2, .pkts = 2, .comps = n/a,}, + {.usec = 3, .pkts = 64, .comps = n/a,}, + {.usec = 64, .pkts = 4, .comps = n/a,}, + {.usec = 64, .pkts = 64, .comps = n/a,} + tx-profile: n/a + +If the network device does not support specific fields of DIM profiles, +the corresponding ``n/a`` will display. If the ``n/a`` field is being +modified, error messages will be reported. + + Dynamic Interrupt Moderation (DIM) library API ============================================== diff --git a/Documentation/networking/phy.rst b/Documentation/networking/phy.rst index 1283240d7620..f64641417c54 100644 --- a/Documentation/networking/phy.rst +++ b/Documentation/networking/phy.rst @@ -327,6 +327,12 @@ Some of the interface modes are described below: This is the Penta SGMII mode, it is similar to QSGMII but it combines 5 SGMII lines into a single link compared to 4 on QSGMII. +``PHY_INTERFACE_MODE_10G_QXGMII`` + Represents the 10G-QXGMII PHY-MAC interface as defined by the Cisco USXGMII + Multiport Copper Interface document. It supports 4 ports over a 10.3125 GHz + SerDes lane, each port having speeds of 2.5G / 1G / 100M / 10M achieved + through symbol replication. The PCS expects the standard USXGMII code word. + Pause frames / flow control =========================== diff --git a/Documentation/networking/sriov.rst b/Documentation/networking/sriov.rst new file mode 100644 index 000000000000..5deb4ff3154f --- /dev/null +++ b/Documentation/networking/sriov.rst @@ -0,0 +1,25 @@ +.. SPDX-License-Identifier: GPL-2.0 + +=============== +NIC SR-IOV APIs +=============== + +Modern NICs are strongly encouraged to focus on implementing the ``switchdev`` +model (see :ref:`switchdev`) to configure forwarding and security of SR-IOV +functionality. + +Legacy API +========== + +The old SR-IOV API is implemented in ``rtnetlink`` Netlink family as part of +the ``RTM_GETLINK`` and ``RTM_SETLINK`` commands. On the driver side +it consists of a number of ``ndo_set_vf_*`` and ``ndo_get_vf_*`` callbacks. + +Since the legacy APIs do not integrate well with the rest of the stack +the API is considered frozen; no new functionality or extensions +will be accepted. New drivers should not implement the uncommon callbacks; +namely the following callbacks are off limits: + + - ``ndo_get_vf_port`` + - ``ndo_set_vf_port`` + - ``ndo_set_vf_rss_query_en`` diff --git a/Documentation/networking/tcp_ao.rst b/Documentation/networking/tcp_ao.rst index 8a58321acce7..e96e62d1dab3 100644 --- a/Documentation/networking/tcp_ao.rst +++ b/Documentation/networking/tcp_ao.rst @@ -337,6 +337,15 @@ TCP-AO per-socket counters are also duplicated with per-netns counters, exposed with SNMP. Those are ``TCPAOGood``, ``TCPAOBad``, ``TCPAOKeyNotFound``, ``TCPAORequired`` and ``TCPAODroppedIcmps``. +For monitoring purposes, there are following TCP-AO trace events: +``tcp_hash_bad_header``, ``tcp_hash_ao_required``, ``tcp_ao_handshake_failure``, +``tcp_ao_wrong_maclen``, ``tcp_ao_wrong_maclen``, ``tcp_ao_key_not_found``, +``tcp_ao_rnext_request``, ``tcp_ao_synack_no_key``, ``tcp_ao_snd_sne_update``, +``tcp_ao_rcv_sne_update``. It's possible to separately enable any of them and +one can filter them by net-namespace, 4-tuple, family, L3 index, and TCP header +flags. If a segment has a TCP-AO header, the filters may also include +keyid, rnext, and maclen. SNE updates include the rolled-over numbers. + RFC 5925 very permissively specifies how TCP port matching can be done for MKTs:: |