diff options
Diffstat (limited to 'Documentation')
126 files changed, 7804 insertions, 1148 deletions
diff --git a/Documentation/ABI/testing/sysfs-class-net-peak_usb b/Documentation/ABI/testing/sysfs-class-net-peak_usb new file mode 100644 index 000000000000..9e3d0bf4d4b2 --- /dev/null +++ b/Documentation/ABI/testing/sysfs-class-net-peak_usb @@ -0,0 +1,19 @@ + +What: /sys/class/net/<iface>/peak_usb/can_channel_id +Date: November 2022 +KernelVersion: 6.2 +Contact: Stephane Grosjean <s.grosjean@peak-system.com> +Description: + PEAK PCAN-USB devices support user-configurable CAN channel + identifiers. Contrary to a USB serial number, these identifiers + are writable and can be set per CAN interface. This means that + if a USB device exports multiple CAN interfaces, each of them + can be assigned a unique channel ID. + This attribute provides read-only access to the currently + configured value of the channel identifier. Depending on the + device type, the identifier has a length of 8 or 32 bit. The + value read from this attribute is always an 8 digit 32 bit + hexadecimal value in big endian format. If the device only + supports an 8 bit identifier, the upper 24 bit of the value are + set to zero. + diff --git a/Documentation/admin-guide/kernel-parameters.txt b/Documentation/admin-guide/kernel-parameters.txt index 993f3b6c10ff..5e4ee29cf393 100644 --- a/Documentation/admin-guide/kernel-parameters.txt +++ b/Documentation/admin-guide/kernel-parameters.txt @@ -557,6 +557,7 @@ Format: <string> nosocket -- Disable socket memory accounting. nokmem -- Disable kernel memory accounting. + nobpf -- Disable BPF memory accounting. checkreqprot= [SELINUX] Set initial checkreqprot flag value. Format: { "0" | "1" } diff --git a/Documentation/admin-guide/sysctl/net.rst b/Documentation/admin-guide/sysctl/net.rst index 6394f5dc2303..466c560b0c30 100644 --- a/Documentation/admin-guide/sysctl/net.rst +++ b/Documentation/admin-guide/sysctl/net.rst @@ -215,6 +215,12 @@ rmem_max The maximum receive socket buffer size in bytes. +rps_default_mask +---------------- + +The default RPS CPU mask used on newly created network devices. An empty +mask means RPS disabled by default. + tstamp_allow_data ----------------- Allow processes to receive tx timestamps looped together with the original diff --git a/Documentation/bpf/bpf_design_QA.rst b/Documentation/bpf/bpf_design_QA.rst index cec2371173d7..bfff0e7e37c2 100644 --- a/Documentation/bpf/bpf_design_QA.rst +++ b/Documentation/bpf/bpf_design_QA.rst @@ -208,6 +208,10 @@ data structures and compile with kernel internal headers. Both of these kernel internals are subject to change and can break with newer kernels such that the program needs to be adapted accordingly. +New BPF functionality is generally added through the use of kfuncs instead of +new helpers. Kfuncs are not considered part of the stable API, and have their own +lifecycle expectations as described in :ref:`BPF_kfunc_lifecycle_expectations`. + Q: Are tracepoints part of the stable ABI? ------------------------------------------ A: NO. Tracepoints are tied to internal implementation details hence they are @@ -236,8 +240,8 @@ A: NO. Classic BPF programs are converted into extend BPF instructions. Q: Can BPF call arbitrary kernel functions? ------------------------------------------- -A: NO. BPF programs can only call a set of helper functions which -is defined for every program type. +A: NO. BPF programs can only call specific functions exposed as BPF helpers or +kfuncs. The set of available functions is defined for every program type. Q: Can BPF overwrite arbitrary kernel memory? --------------------------------------------- @@ -263,7 +267,12 @@ Q: New functionality via kernel modules? Q: Can BPF functionality such as new program or map types, new helpers, etc be added out of kernel module code? -A: NO. +A: Yes, through kfuncs and kptrs + +The core BPF functionality such as program types, maps and helpers cannot be +added to by modules. However, modules can expose functionality to BPF programs +by exporting kfuncs (which may return pointers to module-internal data +structures as kptrs). Q: Directly calling kernel function is an ABI? ---------------------------------------------- @@ -278,7 +287,8 @@ kernel functions have already been used by other kernel tcp cc (congestion-control) implementations. If any of these kernel functions has changed, both the in-tree and out-of-tree kernel tcp cc implementations have to be changed. The same goes for the bpf -programs and they have to be adjusted accordingly. +programs and they have to be adjusted accordingly. See +:ref:`BPF_kfunc_lifecycle_expectations` for details. Q: Attaching to arbitrary kernel functions is an ABI? ----------------------------------------------------- @@ -340,6 +350,7 @@ compatibility for these features? A: NO. -Unlike map value types, there are no stability guarantees for this case. The -whole API to work with allocated objects and any support for special fields -inside them is unstable (since it is exposed through kfuncs). +Unlike map value types, the API to work with allocated objects and any support +for special fields inside them is exposed through kfuncs, and thus has the same +lifecycle expectations as the kfuncs themselves. See +:ref:`BPF_kfunc_lifecycle_expectations` for details. diff --git a/Documentation/bpf/cpumasks.rst b/Documentation/bpf/cpumasks.rst new file mode 100644 index 000000000000..24bef9cbbeee --- /dev/null +++ b/Documentation/bpf/cpumasks.rst @@ -0,0 +1,393 @@ +.. SPDX-License-Identifier: GPL-2.0 + +.. _cpumasks-header-label: + +================== +BPF cpumask kfuncs +================== + +1. Introduction +=============== + +``struct cpumask`` is a bitmap data structure in the kernel whose indices +reflect the CPUs on the system. Commonly, cpumasks are used to track which CPUs +a task is affinitized to, but they can also be used to e.g. track which cores +are associated with a scheduling domain, which cores on a machine are idle, +etc. + +BPF provides programs with a set of :ref:`kfuncs-header-label` that can be +used to allocate, mutate, query, and free cpumasks. + +2. BPF cpumask objects +====================== + +There are two different types of cpumasks that can be used by BPF programs. + +2.1 ``struct bpf_cpumask *`` +---------------------------- + +``struct bpf_cpumask *`` is a cpumask that is allocated by BPF, on behalf of a +BPF program, and whose lifecycle is entirely controlled by BPF. These cpumasks +are RCU-protected, can be mutated, can be used as kptrs, and can be safely cast +to a ``struct cpumask *``. + +2.1.1 ``struct bpf_cpumask *`` lifecycle +---------------------------------------- + +A ``struct bpf_cpumask *`` is allocated, acquired, and released, using the +following functions: + +.. kernel-doc:: kernel/bpf/cpumask.c + :identifiers: bpf_cpumask_create + +.. kernel-doc:: kernel/bpf/cpumask.c + :identifiers: bpf_cpumask_acquire + +.. kernel-doc:: kernel/bpf/cpumask.c + :identifiers: bpf_cpumask_release + +For example: + +.. code-block:: c + + struct cpumask_map_value { + struct bpf_cpumask __kptr_ref * cpumask; + }; + + struct array_map { + __uint(type, BPF_MAP_TYPE_ARRAY); + __type(key, int); + __type(value, struct cpumask_map_value); + __uint(max_entries, 65536); + } cpumask_map SEC(".maps"); + + static int cpumask_map_insert(struct bpf_cpumask *mask, u32 pid) + { + struct cpumask_map_value local, *v; + long status; + struct bpf_cpumask *old; + u32 key = pid; + + local.cpumask = NULL; + status = bpf_map_update_elem(&cpumask_map, &key, &local, 0); + if (status) { + bpf_cpumask_release(mask); + return status; + } + + v = bpf_map_lookup_elem(&cpumask_map, &key); + if (!v) { + bpf_cpumask_release(mask); + return -ENOENT; + } + + old = bpf_kptr_xchg(&v->cpumask, mask); + if (old) + bpf_cpumask_release(old); + + return 0; + } + + /** + * A sample tracepoint showing how a task's cpumask can be queried and + * recorded as a kptr. + */ + SEC("tp_btf/task_newtask") + int BPF_PROG(record_task_cpumask, struct task_struct *task, u64 clone_flags) + { + struct bpf_cpumask *cpumask; + int ret; + + cpumask = bpf_cpumask_create(); + if (!cpumask) + return -ENOMEM; + + if (!bpf_cpumask_full(task->cpus_ptr)) + bpf_printk("task %s has CPU affinity", task->comm); + + bpf_cpumask_copy(cpumask, task->cpus_ptr); + return cpumask_map_insert(cpumask, task->pid); + } + +---- + +2.1.1 ``struct bpf_cpumask *`` as kptrs +--------------------------------------- + +As mentioned and illustrated above, these ``struct bpf_cpumask *`` objects can +also be stored in a map and used as kptrs. If a ``struct bpf_cpumask *`` is in +a map, the reference can be removed from the map with bpf_kptr_xchg(), or +opportunistically acquired with bpf_cpumask_kptr_get(): + +.. kernel-doc:: kernel/bpf/cpumask.c + :identifiers: bpf_cpumask_kptr_get + +Here is an example of a ``struct bpf_cpumask *`` being retrieved from a map: + +.. code-block:: c + + /* struct containing the struct bpf_cpumask kptr which is stored in the map. */ + struct cpumasks_kfunc_map_value { + struct bpf_cpumask __kptr_ref * bpf_cpumask; + }; + + /* The map containing struct cpumasks_kfunc_map_value entries. */ + struct { + __uint(type, BPF_MAP_TYPE_ARRAY); + __type(key, int); + __type(value, struct cpumasks_kfunc_map_value); + __uint(max_entries, 1); + } cpumasks_kfunc_map SEC(".maps"); + + /* ... */ + + /** + * A simple example tracepoint program showing how a + * struct bpf_cpumask * kptr that is stored in a map can + * be acquired using the bpf_cpumask_kptr_get() kfunc. + */ + SEC("tp_btf/cgroup_mkdir") + int BPF_PROG(cgrp_ancestor_example, struct cgroup *cgrp, const char *path) + { + struct bpf_cpumask *kptr; + struct cpumasks_kfunc_map_value *v; + u32 key = 0; + + /* Assume a bpf_cpumask * kptr was previously stored in the map. */ + v = bpf_map_lookup_elem(&cpumasks_kfunc_map, &key); + if (!v) + return -ENOENT; + + /* Acquire a reference to the bpf_cpumask * kptr that's already stored in the map. */ + kptr = bpf_cpumask_kptr_get(&v->cpumask); + if (!kptr) + /* If no bpf_cpumask was present in the map, it's because + * we're racing with another CPU that removed it with + * bpf_kptr_xchg() between the bpf_map_lookup_elem() + * above, and our call to bpf_cpumask_kptr_get(). + * bpf_cpumask_kptr_get() internally safely handles this + * race, and will return NULL if the cpumask is no longer + * present in the map by the time we invoke the kfunc. + */ + return -EBUSY; + + /* Free the reference we just took above. Note that the + * original struct bpf_cpumask * kptr is still in the map. It will + * be freed either at a later time if another context deletes + * it from the map, or automatically by the BPF subsystem if + * it's still present when the map is destroyed. + */ + bpf_cpumask_release(kptr); + + return 0; + } + +---- + +2.2 ``struct cpumask`` +---------------------- + +``struct cpumask`` is the object that actually contains the cpumask bitmap +being queried, mutated, etc. A ``struct bpf_cpumask`` wraps a ``struct +cpumask``, which is why it's safe to cast it as such (note however that it is +**not** safe to cast a ``struct cpumask *`` to a ``struct bpf_cpumask *``, and +the verifier will reject any program that tries to do so). + +As we'll see below, any kfunc that mutates its cpumask argument will take a +``struct bpf_cpumask *`` as that argument. Any argument that simply queries the +cpumask will instead take a ``struct cpumask *``. + +3. cpumask kfuncs +================= + +Above, we described the kfuncs that can be used to allocate, acquire, release, +etc a ``struct bpf_cpumask *``. This section of the document will describe the +kfuncs for mutating and querying cpumasks. + +3.1 Mutating cpumasks +--------------------- + +Some cpumask kfuncs are "read-only" in that they don't mutate any of their +arguments, whereas others mutate at least one argument (which means that the +argument must be a ``struct bpf_cpumask *``, as described above). + +This section will describe all of the cpumask kfuncs which mutate at least one +argument. :ref:`cpumasks-querying-label` below describes the read-only kfuncs. + +3.1.1 Setting and clearing CPUs +------------------------------- + +bpf_cpumask_set_cpu() and bpf_cpumask_clear_cpu() can be used to set and clear +a CPU in a ``struct bpf_cpumask`` respectively: + +.. kernel-doc:: kernel/bpf/cpumask.c + :identifiers: bpf_cpumask_set_cpu bpf_cpumask_clear_cpu + +These kfuncs are pretty straightforward, and can be used, for example, as +follows: + +.. code-block:: c + + /** + * A sample tracepoint showing how a cpumask can be queried. + */ + SEC("tp_btf/task_newtask") + int BPF_PROG(test_set_clear_cpu, struct task_struct *task, u64 clone_flags) + { + struct bpf_cpumask *cpumask; + + cpumask = bpf_cpumask_create(); + if (!cpumask) + return -ENOMEM; + + bpf_cpumask_set_cpu(0, cpumask); + if (!bpf_cpumask_test_cpu(0, cast(cpumask))) + /* Should never happen. */ + goto release_exit; + + bpf_cpumask_clear_cpu(0, cpumask); + if (bpf_cpumask_test_cpu(0, cast(cpumask))) + /* Should never happen. */ + goto release_exit; + + /* struct cpumask * pointers such as task->cpus_ptr can also be queried. */ + if (bpf_cpumask_test_cpu(0, task->cpus_ptr)) + bpf_printk("task %s can use CPU %d", task->comm, 0); + + release_exit: + bpf_cpumask_release(cpumask); + return 0; + } + +---- + +bpf_cpumask_test_and_set_cpu() and bpf_cpumask_test_and_clear_cpu() are +complementary kfuncs that allow callers to atomically test and set (or clear) +CPUs: + +.. kernel-doc:: kernel/bpf/cpumask.c + :identifiers: bpf_cpumask_test_and_set_cpu bpf_cpumask_test_and_clear_cpu + +---- + +We can also set and clear entire ``struct bpf_cpumask *`` objects in one +operation using bpf_cpumask_setall() and bpf_cpumask_clear(): + +.. kernel-doc:: kernel/bpf/cpumask.c + :identifiers: bpf_cpumask_setall bpf_cpumask_clear + +3.1.2 Operations between cpumasks +--------------------------------- + +In addition to setting and clearing individual CPUs in a single cpumask, +callers can also perform bitwise operations between multiple cpumasks using +bpf_cpumask_and(), bpf_cpumask_or(), and bpf_cpumask_xor(): + +.. kernel-doc:: kernel/bpf/cpumask.c + :identifiers: bpf_cpumask_and bpf_cpumask_or bpf_cpumask_xor + +The following is an example of how they may be used. Note that some of the +kfuncs shown in this example will be covered in more detail below. + +.. code-block:: c + + /** + * A sample tracepoint showing how a cpumask can be mutated using + bitwise operators (and queried). + */ + SEC("tp_btf/task_newtask") + int BPF_PROG(test_and_or_xor, struct task_struct *task, u64 clone_flags) + { + struct bpf_cpumask *mask1, *mask2, *dst1, *dst2; + + mask1 = bpf_cpumask_create(); + if (!mask1) + return -ENOMEM; + + mask2 = bpf_cpumask_create(); + if (!mask2) { + bpf_cpumask_release(mask1); + return -ENOMEM; + } + + // ...Safely create the other two masks... */ + + bpf_cpumask_set_cpu(0, mask1); + bpf_cpumask_set_cpu(1, mask2); + bpf_cpumask_and(dst1, (const struct cpumask *)mask1, (const struct cpumask *)mask2); + if (!bpf_cpumask_empty((const struct cpumask *)dst1)) + /* Should never happen. */ + goto release_exit; + + bpf_cpumask_or(dst1, (const struct cpumask *)mask1, (const struct cpumask *)mask2); + if (!bpf_cpumask_test_cpu(0, (const struct cpumask *)dst1)) + /* Should never happen. */ + goto release_exit; + + if (!bpf_cpumask_test_cpu(1, (const struct cpumask *)dst1)) + /* Should never happen. */ + goto release_exit; + + bpf_cpumask_xor(dst2, (const struct cpumask *)mask1, (const struct cpumask *)mask2); + if (!bpf_cpumask_equal((const struct cpumask *)dst1, + (const struct cpumask *)dst2)) + /* Should never happen. */ + goto release_exit; + + release_exit: + bpf_cpumask_release(mask1); + bpf_cpumask_release(mask2); + bpf_cpumask_release(dst1); + bpf_cpumask_release(dst2); + return 0; + } + +---- + +The contents of an entire cpumask may be copied to another using +bpf_cpumask_copy(): + +.. kernel-doc:: kernel/bpf/cpumask.c + :identifiers: bpf_cpumask_copy + +---- + +.. _cpumasks-querying-label: + +3.2 Querying cpumasks +--------------------- + +In addition to the above kfuncs, there is also a set of read-only kfuncs that +can be used to query the contents of cpumasks. + +.. kernel-doc:: kernel/bpf/cpumask.c + :identifiers: bpf_cpumask_first bpf_cpumask_first_zero bpf_cpumask_test_cpu + +.. kernel-doc:: kernel/bpf/cpumask.c + :identifiers: bpf_cpumask_equal bpf_cpumask_intersects bpf_cpumask_subset + bpf_cpumask_empty bpf_cpumask_full + +.. kernel-doc:: kernel/bpf/cpumask.c + :identifiers: bpf_cpumask_any bpf_cpumask_any_and + +---- + +Some example usages of these querying kfuncs were shown above. We will not +replicate those exmaples here. Note, however, that all of the aforementioned +kfuncs are tested in `tools/testing/selftests/bpf/progs/cpumask_success.c`_, so +please take a look there if you're looking for more examples of how they can be +used. + +.. _tools/testing/selftests/bpf/progs/cpumask_success.c: + https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/tree/tools/testing/selftests/bpf/progs/cpumask_success.c + + +4. Adding BPF cpumask kfuncs +============================ + +The set of supported BPF cpumask kfuncs are not (yet) a 1-1 match with the +cpumask operations in include/linux/cpumask.h. Any of those cpumask operations +could easily be encapsulated in a new kfunc if and when required. If you'd like +to support a new cpumask operation, please feel free to submit a patch. If you +do add a new cpumask kfunc, please document it here, and add any relevant +selftest testcases to the cpumask selftest suite. diff --git a/Documentation/bpf/graph_ds_impl.rst b/Documentation/bpf/graph_ds_impl.rst new file mode 100644 index 000000000000..61274622b71d --- /dev/null +++ b/Documentation/bpf/graph_ds_impl.rst @@ -0,0 +1,267 @@ +========================= +BPF Graph Data Structures +========================= + +This document describes implementation details of new-style "graph" data +structures (linked_list, rbtree), with particular focus on the verifier's +implementation of semantics specific to those data structures. + +Although no specific verifier code is referred to in this document, the document +assumes that the reader has general knowledge of BPF verifier internals, BPF +maps, and BPF program writing. + +Note that the intent of this document is to describe the current state of +these graph data structures. **No guarantees** of stability for either +semantics or APIs are made or implied here. + +.. contents:: + :local: + :depth: 2 + +Introduction +------------ + +The BPF map API has historically been the main way to expose data structures +of various types for use within BPF programs. Some data structures fit naturally +with the map API (HASH, ARRAY), others less so. Consequentially, programs +interacting with the latter group of data structures can be hard to parse +for kernel programmers without previous BPF experience. + +Luckily, some restrictions which necessitated the use of BPF map semantics are +no longer relevant. With the introduction of kfuncs, kptrs, and the any-context +BPF allocator, it is now possible to implement BPF data structures whose API +and semantics more closely match those exposed to the rest of the kernel. + +Two such data structures - linked_list and rbtree - have many verification +details in common. Because both have "root"s ("head" for linked_list) and +"node"s, the verifier code and this document refer to common functionality +as "graph_api", "graph_root", "graph_node", etc. + +Unless otherwise stated, examples and semantics below apply to both graph data +structures. + +Unstable API +------------ + +Data structures implemented using the BPF map API have historically used BPF +helper functions - either standard map API helpers like ``bpf_map_update_elem`` +or map-specific helpers. The new-style graph data structures instead use kfuncs +to define their manipulation helpers. Because there are no stability guarantees +for kfuncs, the API and semantics for these data structures can be evolved in +a way that breaks backwards compatibility if necessary. + +Root and node types for the new data structures are opaquely defined in the +``uapi/linux/bpf.h`` header. + +Locking +------- + +The new-style data structures are intrusive and are defined similarly to their +vanilla kernel counterparts: + +.. code-block:: c + + struct node_data { + long key; + long data; + struct bpf_rb_node node; + }; + + struct bpf_spin_lock glock; + struct bpf_rb_root groot __contains(node_data, node); + +The "root" type for both linked_list and rbtree expects to be in a map_value +which also contains a ``bpf_spin_lock`` - in the above example both global +variables are placed in a single-value arraymap. The verifier considers this +spin_lock to be associated with the ``bpf_rb_root`` by virtue of both being in +the same map_value and will enforce that the correct lock is held when +verifying BPF programs that manipulate the tree. Since this lock checking +happens at verification time, there is no runtime penalty. + +Non-owning references +--------------------- + +**Motivation** + +Consider the following BPF code: + +.. code-block:: c + + struct node_data *n = bpf_obj_new(typeof(*n)); /* ACQUIRED */ + + bpf_spin_lock(&lock); + + bpf_rbtree_add(&tree, n); /* PASSED */ + + bpf_spin_unlock(&lock); + +From the verifier's perspective, the pointer ``n`` returned from ``bpf_obj_new`` +has type ``PTR_TO_BTF_ID | MEM_ALLOC``, with a ``btf_id`` of +``struct node_data`` and a nonzero ``ref_obj_id``. Because it holds ``n``, the +program has ownership of the pointee's (object pointed to by ``n``) lifetime. +The BPF program must pass off ownership before exiting - either via +``bpf_obj_drop``, which ``free``'s the object, or by adding it to ``tree`` with +``bpf_rbtree_add``. + +(``ACQUIRED`` and ``PASSED`` comments in the example denote statements where +"ownership is acquired" and "ownership is passed", respectively) + +What should the verifier do with ``n`` after ownership is passed off? If the +object was ``free``'d with ``bpf_obj_drop`` the answer is obvious: the verifier +should reject programs which attempt to access ``n`` after ``bpf_obj_drop`` as +the object is no longer valid. The underlying memory may have been reused for +some other allocation, unmapped, etc. + +When ownership is passed to ``tree`` via ``bpf_rbtree_add`` the answer is less +obvious. The verifier could enforce the same semantics as for ``bpf_obj_drop``, +but that would result in programs with useful, common coding patterns being +rejected, e.g.: + +.. code-block:: c + + int x; + struct node_data *n = bpf_obj_new(typeof(*n)); /* ACQUIRED */ + + bpf_spin_lock(&lock); + + bpf_rbtree_add(&tree, n); /* PASSED */ + x = n->data; + n->data = 42; + + bpf_spin_unlock(&lock); + +Both the read from and write to ``n->data`` would be rejected. The verifier +can do better, though, by taking advantage of two details: + + * Graph data structure APIs can only be used when the ``bpf_spin_lock`` + associated with the graph root is held + + * Both graph data structures have pointer stability + + * Because graph nodes are allocated with ``bpf_obj_new`` and + adding / removing from the root involves fiddling with the + ``bpf_{list,rb}_node`` field of the node struct, a graph node will + remain at the same address after either operation. + +Because the associated ``bpf_spin_lock`` must be held by any program adding +or removing, if we're in the critical section bounded by that lock, we know +that no other program can add or remove until the end of the critical section. +This combined with pointer stability means that, until the critical section +ends, we can safely access the graph node through ``n`` even after it was used +to pass ownership. + +The verifier considers such a reference a *non-owning reference*. The ref +returned by ``bpf_obj_new`` is accordingly considered an *owning reference*. +Both terms currently only have meaning in the context of graph nodes and API. + +**Details** + +Let's enumerate the properties of both types of references. + +*owning reference* + + * This reference controls the lifetime of the pointee + + * Ownership of pointee must be 'released' by passing it to some graph API + kfunc, or via ``bpf_obj_drop``, which ``free``'s the pointee + + * If not released before program ends, verifier considers program invalid + + * Access to the pointee's memory will not page fault + +*non-owning reference* + + * This reference does not own the pointee + + * It cannot be used to add the graph node to a graph root, nor ``free``'d via + ``bpf_obj_drop`` + + * No explicit control of lifetime, but can infer valid lifetime based on + non-owning ref existence (see explanation below) + + * Access to the pointee's memory will not page fault + +From verifier's perspective non-owning references can only exist +between spin_lock and spin_unlock. Why? After spin_unlock another program +can do arbitrary operations on the data structure like removing and ``free``-ing +via bpf_obj_drop. A non-owning ref to some chunk of memory that was remove'd, +``free``'d, and reused via bpf_obj_new would point to an entirely different thing. +Or the memory could go away. + +To prevent this logic violation all non-owning references are invalidated by the +verifier after a critical section ends. This is necessary to ensure the "will +not page fault" property of non-owning references. So if the verifier hasn't +invalidated a non-owning ref, accessing it will not page fault. + +Currently ``bpf_obj_drop`` is not allowed in the critical section, so +if there's a valid non-owning ref, we must be in a critical section, and can +conclude that the ref's memory hasn't been dropped-and- ``free``'d or +dropped-and-reused. + +Any reference to a node that is in an rbtree _must_ be non-owning, since +the tree has control of the pointee's lifetime. Similarly, any ref to a node +that isn't in rbtree _must_ be owning. This results in a nice property: +graph API add / remove implementations don't need to check if a node +has already been added (or already removed), as the ownership model +allows the verifier to prevent such a state from being valid by simply checking +types. + +However, pointer aliasing poses an issue for the above "nice property". +Consider the following example: + +.. code-block:: c + + struct node_data *n, *m, *o, *p; + n = bpf_obj_new(typeof(*n)); /* 1 */ + + bpf_spin_lock(&lock); + + bpf_rbtree_add(&tree, n); /* 2 */ + m = bpf_rbtree_first(&tree); /* 3 */ + + o = bpf_rbtree_remove(&tree, n); /* 4 */ + p = bpf_rbtree_remove(&tree, m); /* 5 */ + + bpf_spin_unlock(&lock); + + bpf_obj_drop(o); + bpf_obj_drop(p); /* 6 */ + +Assume the tree is empty before this program runs. If we track verifier state +changes here using numbers in above comments: + + 1) n is an owning reference + + 2) n is a non-owning reference, it's been added to the tree + + 3) n and m are non-owning references, they both point to the same node + + 4) o is an owning reference, n and m non-owning, all point to same node + + 5) o and p are owning, n and m non-owning, all point to the same node + + 6) a double-free has occurred, since o and p point to same node and o was + ``free``'d in previous statement + +States 4 and 5 violate our "nice property", as there are non-owning refs to +a node which is not in an rbtree. Statement 5 will try to remove a node which +has already been removed as a result of this violation. State 6 is a dangerous +double-free. + +At a minimum we should prevent state 6 from being possible. If we can't also +prevent state 5 then we must abandon our "nice property" and check whether a +node has already been removed at runtime. + +We prevent both by generalizing the "invalidate non-owning references" behavior +of ``bpf_spin_unlock`` and doing similar invalidation after +``bpf_rbtree_remove``. The logic here being that any graph API kfunc which: + + * takes an arbitrary node argument + + * removes it from the data structure + + * returns an owning reference to the removed node + +May result in a state where some other non-owning reference points to the same +node. So ``remove``-type kfuncs must be considered a non-owning reference +invalidation point as well. diff --git a/Documentation/bpf/index.rst b/Documentation/bpf/index.rst index b81533d8b061..dbb39e8f9889 100644 --- a/Documentation/bpf/index.rst +++ b/Documentation/bpf/index.rst @@ -20,6 +20,7 @@ that goes into great technical depth about the BPF Architecture. syscall_api helpers kfuncs + cpumasks programs maps bpf_prog_run diff --git a/Documentation/bpf/instruction-set.rst b/Documentation/bpf/instruction-set.rst index e672d5ec6cc7..af515de5fc38 100644 --- a/Documentation/bpf/instruction-set.rst +++ b/Documentation/bpf/instruction-set.rst @@ -7,6 +7,11 @@ eBPF Instruction Set Specification, v1.0 This document specifies version 1.0 of the eBPF instruction set. +Documentation conventions +========================= + +For brevity, this document uses the type notion "u64", "u32", etc. +to mean an unsigned integer whose width is the specified number of bits. Registers and calling convention ================================ @@ -30,20 +35,56 @@ Instruction encoding eBPF has two instruction encodings: * the basic instruction encoding, which uses 64 bits to encode an instruction -* the wide instruction encoding, which appends a second 64-bit immediate value - (imm64) after the basic instruction for a total of 128 bits. +* the wide instruction encoding, which appends a second 64-bit immediate (i.e., + constant) value after the basic instruction for a total of 128 bits. + +The basic instruction encoding is as follows, where MSB and LSB mean the most significant +bits and least significant bits, respectively: + +============= ======= ======= ======= ============ +32 bits (MSB) 16 bits 4 bits 4 bits 8 bits (LSB) +============= ======= ======= ======= ============ +imm offset src_reg dst_reg opcode +============= ======= ======= ======= ============ + +**imm** + signed integer immediate value -The basic instruction encoding looks as follows: +**offset** + signed integer offset used with pointer arithmetic -============= ======= =============== ==================== ============ -32 bits (MSB) 16 bits 4 bits 4 bits 8 bits (LSB) -============= ======= =============== ==================== ============ -immediate offset source register destination register opcode -============= ======= =============== ==================== ============ +**src_reg** + the source register number (0-10), except where otherwise specified + (`64-bit immediate instructions`_ reuse this field for other purposes) + +**dst_reg** + destination register number (0-10) + +**opcode** + operation to perform Note that most instructions do not use all of the fields. Unused fields shall be cleared to zero. +As discussed below in `64-bit immediate instructions`_, a 64-bit immediate +instruction uses a 64-bit immediate value that is constructed as follows. +The 64 bits following the basic instruction contain a pseudo instruction +using the same format but with opcode, dst_reg, src_reg, and offset all set to zero, +and imm containing the high 32 bits of the immediate value. + +================= ================== +64 bits (MSB) 64 bits (LSB) +================= ================== +basic instruction pseudo instruction +================= ================== + +Thus the 64-bit immediate value is constructed as follows: + + imm64 = (next_imm << 32) | imm + +where 'next_imm' refers to the imm value of the pseudo instruction +following the basic instruction. + Instruction classes ------------------- @@ -71,27 +112,32 @@ For arithmetic and jump instructions (``BPF_ALU``, ``BPF_ALU64``, ``BPF_JMP`` an ============== ====== ================= 4 bits (MSB) 1 bit 3 bits (LSB) ============== ====== ================= -operation code source instruction class +code source instruction class ============== ====== ================= -The 4th bit encodes the source operand: +**code** + the operation code, whose meaning varies by instruction class - ====== ===== ======================================== - source value description - ====== ===== ======================================== - BPF_K 0x00 use 32-bit immediate as source operand - BPF_X 0x08 use 'src_reg' register as source operand - ====== ===== ======================================== +**source** + the source operand location, which unless otherwise specified is one of: -The four MSB bits store the operation code. + ====== ===== ============================================== + source value description + ====== ===== ============================================== + BPF_K 0x00 use 32-bit 'imm' value as source operand + BPF_X 0x08 use 'src_reg' register value as source operand + ====== ===== ============================================== +**instruction class** + the instruction class (see `Instruction classes`_) Arithmetic instructions ----------------------- ``BPF_ALU`` uses 32-bit wide operands while ``BPF_ALU64`` uses 64-bit wide operands for otherwise identical operations. -The 'code' field encodes the operation as below: +The 'code' field encodes the operation as below, where 'src' and 'dst' refer +to the values of the source and destination registers, respectively. ======== ===== ========================================================== code value description @@ -99,35 +145,49 @@ code value description BPF_ADD 0x00 dst += src BPF_SUB 0x10 dst -= src BPF_MUL 0x20 dst \*= src -BPF_DIV 0x30 dst /= src +BPF_DIV 0x30 dst = (src != 0) ? (dst / src) : 0 BPF_OR 0x40 dst \|= src BPF_AND 0x50 dst &= src BPF_LSH 0x60 dst <<= src BPF_RSH 0x70 dst >>= src BPF_NEG 0x80 dst = ~src -BPF_MOD 0x90 dst %= src +BPF_MOD 0x90 dst = (src != 0) ? (dst % src) : dst BPF_XOR 0xa0 dst ^= src BPF_MOV 0xb0 dst = src BPF_ARSH 0xc0 sign extending shift right BPF_END 0xd0 byte swap operations (see `Byte swap instructions`_ below) ======== ===== ========================================================== +Underflow and overflow are allowed during arithmetic operations, meaning +the 64-bit or 32-bit value will wrap. If eBPF program execution would +result in division by zero, the destination register is instead set to zero. +If execution would result in modulo by zero, for ``BPF_ALU64`` the value of +the destination register is unchanged whereas for ``BPF_ALU`` the upper +32 bits of the destination register are zeroed. + ``BPF_ADD | BPF_X | BPF_ALU`` means:: - dst_reg = (u32) dst_reg + (u32) src_reg; + dst = (u32) ((u32) dst + (u32) src) + +where '(u32)' indicates that the upper 32 bits are zeroed. ``BPF_ADD | BPF_X | BPF_ALU64`` means:: - dst_reg = dst_reg + src_reg + dst = dst + src ``BPF_XOR | BPF_K | BPF_ALU`` means:: - dst_reg = (u32) dst_reg ^ (u32) imm32 + dst = (u32) dst ^ (u32) imm32 ``BPF_XOR | BPF_K | BPF_ALU64`` means:: - dst_reg = dst_reg ^ imm32 + dst = dst ^ imm32 +Also note that the division and modulo operations are unsigned. Thus, for +``BPF_ALU``, 'imm' is first interpreted as an unsigned 32-bit value, whereas +for ``BPF_ALU64``, 'imm' is first sign extended to 64 bits and the result +interpreted as an unsigned 64-bit value. There are no instructions for +signed division or modulo. Byte swap instructions ~~~~~~~~~~~~~~~~~~~~~~ @@ -155,11 +215,11 @@ Examples: ``BPF_ALU | BPF_TO_LE | BPF_END`` with imm = 16 means:: - dst_reg = htole16(dst_reg) + dst = htole16(dst) ``BPF_ALU | BPF_TO_BE | BPF_END`` with imm = 64 means:: - dst_reg = htobe64(dst_reg) + dst = htobe64(dst) Jump instructions ----------------- @@ -234,15 +294,15 @@ instructions that transfer data between a register and memory. ``BPF_MEM | <size> | BPF_STX`` means:: - *(size *) (dst_reg + off) = src_reg + *(size *) (dst + offset) = src ``BPF_MEM | <size> | BPF_ST`` means:: - *(size *) (dst_reg + off) = imm32 + *(size *) (dst + offset) = imm32 ``BPF_MEM | <size> | BPF_LDX`` means:: - dst_reg = *(size *) (src_reg + off) + dst = *(size *) (src + offset) Where size is one of: ``BPF_B``, ``BPF_H``, ``BPF_W``, or ``BPF_DW``. @@ -276,11 +336,11 @@ BPF_XOR 0xa0 atomic xor ``BPF_ATOMIC | BPF_W | BPF_STX`` with 'imm' = BPF_ADD means:: - *(u32 *)(dst_reg + off16) += src_reg + *(u32 *)(dst + offset) += src ``BPF_ATOMIC | BPF_DW | BPF_STX`` with 'imm' = BPF ADD means:: - *(u64 *)(dst_reg + off16) += src_reg + *(u64 *)(dst + offset) += src In addition to the simple atomic operations, there also is a modifier and two complex atomic operations: @@ -295,16 +355,16 @@ BPF_CMPXCHG 0xf0 | BPF_FETCH atomic compare and exchange The ``BPF_FETCH`` modifier is optional for simple atomic operations, and always set for the complex atomic operations. If the ``BPF_FETCH`` flag -is set, then the operation also overwrites ``src_reg`` with the value that +is set, then the operation also overwrites ``src`` with the value that was in memory before it was modified. -The ``BPF_XCHG`` operation atomically exchanges ``src_reg`` with the value -addressed by ``dst_reg + off``. +The ``BPF_XCHG`` operation atomically exchanges ``src`` with the value +addressed by ``dst + offset``. The ``BPF_CMPXCHG`` operation atomically compares the value addressed by -``dst_reg + off`` with ``R0``. If they match, the value addressed by -``dst_reg + off`` is replaced with ``src_reg``. In either case, the -value that was at ``dst_reg + off`` before the operation is zero-extended +``dst + offset`` with ``R0``. If they match, the value addressed by +``dst + offset`` is replaced with ``src``. In either case, the +value that was at ``dst + offset`` before the operation is zero-extended and loaded back to ``R0``. 64-bit immediate instructions @@ -317,7 +377,7 @@ There is currently only one such instruction. ``BPF_LD | BPF_DW | BPF_IMM`` means:: - dst_reg = imm64 + dst = imm64 Legacy BPF Packet access instructions diff --git a/Documentation/bpf/kfuncs.rst b/Documentation/bpf/kfuncs.rst index 9fd7fb539f85..ca96ef3f6896 100644 --- a/Documentation/bpf/kfuncs.rst +++ b/Documentation/bpf/kfuncs.rst @@ -1,3 +1,7 @@ +.. SPDX-License-Identifier: GPL-2.0 + +.. _kfuncs-header-label: + ============================= BPF Kernel Functions (kfuncs) ============================= @@ -9,7 +13,7 @@ BPF Kernel Functions or more commonly known as kfuncs are functions in the Linux kernel which are exposed for use by BPF programs. Unlike normal BPF helpers, kfuncs do not have a stable interface and can change from one kernel release to another. Hence, BPF programs need to be updated in response to changes in the -kernel. +kernel. See :ref:`BPF_kfunc_lifecycle_expectations` for more information. 2. Defining a kfunc =================== @@ -37,7 +41,7 @@ An example is given below:: __diag_ignore_all("-Wmissing-prototypes", "Global kfuncs as their definitions will be in BTF"); - struct task_struct *bpf_find_get_task_by_vpid(pid_t nr) + __bpf_kfunc struct task_struct *bpf_find_get_task_by_vpid(pid_t nr) { return find_get_task_by_vpid(nr); } @@ -62,7 +66,7 @@ kfunc with a __tag, where tag may be one of the supported annotations. This annotation is used to indicate a memory and size pair in the argument list. An example is given below:: - void bpf_memzero(void *mem, int mem__sz) + __bpf_kfunc void bpf_memzero(void *mem, int mem__sz) { ... } @@ -82,7 +86,7 @@ safety of the program. An example is given below:: - void *bpf_obj_new(u32 local_type_id__k, ...) + __bpf_kfunc void *bpf_obj_new(u32 local_type_id__k, ...) { ... } @@ -121,6 +125,20 @@ flags on a set of kfuncs as follows:: This set encodes the BTF ID of each kfunc listed above, and encodes the flags along with it. Ofcourse, it is also allowed to specify no flags. +kfunc definitions should also always be annotated with the ``__bpf_kfunc`` +macro. This prevents issues such as the compiler inlining the kfunc if it's a +static kernel function, or the function being elided in an LTO build as it's +not used in the rest of the kernel. Developers should not manually add +annotations to their kfunc to prevent these issues. If an annotation is +required to prevent such an issue with your kfunc, it is a bug and should be +added to the definition of the macro so that other kfuncs are similarly +protected. An example is given below:: + + __bpf_kfunc struct task_struct *bpf_get_task_pid(s32 pid) + { + ... + } + 2.4.1 KF_ACQUIRE flag --------------------- @@ -163,7 +181,8 @@ KF_ACQUIRE and KF_RET_NULL flags. The KF_TRUSTED_ARGS flag is used for kfuncs taking pointer arguments. It indicates that the all pointer arguments are valid, and that all pointers to BTF objects have been passed in their unmodified form (that is, at a zero -offset, and without having been obtained from walking another pointer). +offset, and without having been obtained from walking another pointer, with one +exception described below). There are two types of pointers to kernel objects which are considered "valid": @@ -176,6 +195,25 @@ KF_TRUSTED_ARGS kfuncs, and may have a non-zero offset. The definition of "valid" pointers is subject to change at any time, and has absolutely no ABI stability guarantees. +As mentioned above, a nested pointer obtained from walking a trusted pointer is +no longer trusted, with one exception. If a struct type has a field that is +guaranteed to be valid as long as its parent pointer is trusted, the +``BTF_TYPE_SAFE_NESTED`` macro can be used to express that to the verifier as +follows: + +.. code-block:: c + + BTF_TYPE_SAFE_NESTED(struct task_struct) { + const cpumask_t *cpus_ptr; + }; + +In other words, you must: + +1. Wrap the trusted pointer type in the ``BTF_TYPE_SAFE_NESTED`` macro. + +2. Specify the type and name of the trusted nested field. This field must match + the field in the original type definition exactly. + 2.4.6 KF_SLEEPABLE flag ----------------------- @@ -200,6 +238,28 @@ single argument which must be a trusted argument or a MEM_RCU pointer. The argument may have reference count of 0 and the kfunc must take this into consideration. +.. _KF_deprecated_flag: + +2.4.9 KF_DEPRECATED flag +------------------------ + +The KF_DEPRECATED flag is used for kfuncs which are scheduled to be +changed or removed in a subsequent kernel release. A kfunc that is +marked with KF_DEPRECATED should also have any relevant information +captured in its kernel doc. Such information typically includes the +kfunc's expected remaining lifespan, a recommendation for new +functionality that can replace it if any is available, and possibly a +rationale for why it is being removed. + +Note that while on some occasions, a KF_DEPRECATED kfunc may continue to be +supported and have its KF_DEPRECATED flag removed, it is likely to be far more +difficult to remove a KF_DEPRECATED flag after it's been added than it is to +prevent it from being added in the first place. As described in +:ref:`BPF_kfunc_lifecycle_expectations`, users that rely on specific kfuncs are +encouraged to make their use-cases known as early as possible, and participate +in upstream discussions regarding whether to keep, change, deprecate, or remove +those kfuncs if and when such discussions occur. + 2.5 Registering the kfuncs -------------------------- @@ -223,14 +283,150 @@ type. An example is shown below:: } late_initcall(init_subsystem); -3. Core kfuncs +2.6 Specifying no-cast aliases with ___init +-------------------------------------------- + +The verifier will always enforce that the BTF type of a pointer passed to a +kfunc by a BPF program, matches the type of pointer specified in the kfunc +definition. The verifier, does, however, allow types that are equivalent +according to the C standard to be passed to the same kfunc arg, even if their +BTF_IDs differ. + +For example, for the following type definition: + +.. code-block:: c + + struct bpf_cpumask { + cpumask_t cpumask; + refcount_t usage; + }; + +The verifier would allow a ``struct bpf_cpumask *`` to be passed to a kfunc +taking a ``cpumask_t *`` (which is a typedef of ``struct cpumask *``). For +instance, both ``struct cpumask *`` and ``struct bpf_cpmuask *`` can be passed +to bpf_cpumask_test_cpu(). + +In some cases, this type-aliasing behavior is not desired. ``struct +nf_conn___init`` is one such example: + +.. code-block:: c + + struct nf_conn___init { + struct nf_conn ct; + }; + +The C standard would consider these types to be equivalent, but it would not +always be safe to pass either type to a trusted kfunc. ``struct +nf_conn___init`` represents an allocated ``struct nf_conn`` object that has +*not yet been initialized*, so it would therefore be unsafe to pass a ``struct +nf_conn___init *`` to a kfunc that's expecting a fully initialized ``struct +nf_conn *`` (e.g. ``bpf_ct_change_timeout()``). + +In order to accommodate such requirements, the verifier will enforce strict +PTR_TO_BTF_ID type matching if two types have the exact same name, with one +being suffixed with ``___init``. + +.. _BPF_kfunc_lifecycle_expectations: + +3. kfunc lifecycle expectations +=============================== + +kfuncs provide a kernel <-> kernel API, and thus are not bound by any of the +strict stability restrictions associated with kernel <-> user UAPIs. This means +they can be thought of as similar to EXPORT_SYMBOL_GPL, and can therefore be +modified or removed by a maintainer of the subsystem they're defined in when +it's deemed necessary. + +Like any other change to the kernel, maintainers will not change or remove a +kfunc without having a reasonable justification. Whether or not they'll choose +to change a kfunc will ultimately depend on a variety of factors, such as how +widely used the kfunc is, how long the kfunc has been in the kernel, whether an +alternative kfunc exists, what the norm is in terms of stability for the +subsystem in question, and of course what the technical cost is of continuing +to support the kfunc. + +There are several implications of this: + +a) kfuncs that are widely used or have been in the kernel for a long time will + be more difficult to justify being changed or removed by a maintainer. In + other words, kfuncs that are known to have a lot of users and provide + significant value provide stronger incentives for maintainers to invest the + time and complexity in supporting them. It is therefore important for + developers that are using kfuncs in their BPF programs to communicate and + explain how and why those kfuncs are being used, and to participate in + discussions regarding those kfuncs when they occur upstream. + +b) Unlike regular kernel symbols marked with EXPORT_SYMBOL_GPL, BPF programs + that call kfuncs are generally not part of the kernel tree. This means that + refactoring cannot typically change callers in-place when a kfunc changes, + as is done for e.g. an upstreamed driver being updated in place when a + kernel symbol is changed. + + Unlike with regular kernel symbols, this is expected behavior for BPF + symbols, and out-of-tree BPF programs that use kfuncs should be considered + relevant to discussions and decisions around modifying and removing those + kfuncs. The BPF community will take an active role in participating in + upstream discussions when necessary to ensure that the perspectives of such + users are taken into account. + +c) A kfunc will never have any hard stability guarantees. BPF APIs cannot and + will not ever hard-block a change in the kernel purely for stability + reasons. That being said, kfuncs are features that are meant to solve + problems and provide value to users. The decision of whether to change or + remove a kfunc is a multivariate technical decision that is made on a + case-by-case basis, and which is informed by data points such as those + mentioned above. It is expected that a kfunc being removed or changed with + no warning will not be a common occurrence or take place without sound + justification, but it is a possibility that must be accepted if one is to + use kfuncs. + +3.1 kfunc deprecation +--------------------- + +As described above, while sometimes a maintainer may find that a kfunc must be +changed or removed immediately to accommodate some changes in their subsystem, +usually kfuncs will be able to accommodate a longer and more measured +deprecation process. For example, if a new kfunc comes along which provides +superior functionality to an existing kfunc, the existing kfunc may be +deprecated for some period of time to allow users to migrate their BPF programs +to use the new one. Or, if a kfunc has no known users, a decision may be made +to remove the kfunc (without providing an alternative API) after some +deprecation period so as to provide users with a window to notify the kfunc +maintainer if it turns out that the kfunc is actually being used. + +It's expected that the common case will be that kfuncs will go through a +deprecation period rather than being changed or removed without warning. As +described in :ref:`KF_deprecated_flag`, the kfunc framework provides the +KF_DEPRECATED flag to kfunc developers to signal to users that a kfunc has been +deprecated. Once a kfunc has been marked with KF_DEPRECATED, the following +procedure is followed for removal: + +1. Any relevant information for deprecated kfuncs is documented in the kfunc's + kernel docs. This documentation will typically include the kfunc's expected + remaining lifespan, a recommendation for new functionality that can replace + the usage of the deprecated function (or an explanation as to why no such + replacement exists), etc. + +2. The deprecated kfunc is kept in the kernel for some period of time after it + was first marked as deprecated. This time period will be chosen on a + case-by-case basis, and will typically depend on how widespread the use of + the kfunc is, how long it has been in the kernel, and how hard it is to move + to alternatives. This deprecation time period is "best effort", and as + described :ref:`above<BPF_kfunc_lifecycle_expectations>`, circumstances may + sometimes dictate that the kfunc be removed before the full intended + deprecation period has elapsed. + +3. After the deprecation period the kfunc will be removed. At this point, BPF + programs calling the kfunc will be rejected by the verifier. + +4. Core kfuncs ============== The BPF subsystem provides a number of "core" kfuncs that are potentially applicable to a wide variety of different possible use cases and programs. Those kfuncs are documented here. -3.1 struct task_struct * kfuncs +4.1 struct task_struct * kfuncs ------------------------------- There are a number of kfuncs that allow ``struct task_struct *`` objects to be @@ -306,7 +502,7 @@ Here is an example of it being used: return 0; } -3.2 struct cgroup * kfuncs +4.2 struct cgroup * kfuncs -------------------------- ``struct cgroup *`` objects also have acquire and release functions: @@ -420,3 +616,10 @@ the verifier. bpf_cgroup_ancestor() can be used as follows: bpf_cgroup_release(parent); return 0; } + +4.3 struct cpumask * kfuncs +--------------------------- + +BPF provides a set of kfuncs that can be used to query, allocate, mutate, and +destroy struct cpumask * objects. Please refer to :ref:`cpumasks-header-label` +for more details. diff --git a/Documentation/bpf/libbpf/libbpf_naming_convention.rst b/Documentation/bpf/libbpf/libbpf_naming_convention.rst index c5ac97f3d4c4..b5b41b61b3c0 100644 --- a/Documentation/bpf/libbpf/libbpf_naming_convention.rst +++ b/Documentation/bpf/libbpf/libbpf_naming_convention.rst @@ -83,8 +83,8 @@ This prevents from accidentally exporting a symbol, that is not supposed to be a part of ABI what, in turn, improves both libbpf developer- and user-experiences. -ABI versionning ---------------- +ABI versioning +-------------- To make future ABI extensions possible libbpf ABI is versioned. Versioning is implemented by ``libbpf.map`` version script that is @@ -148,7 +148,7 @@ API documentation convention The libbpf API is documented via comments above definitions in header files. These comments can be rendered by doxygen and sphinx for well organized html output. This section describes the -convention in which these comments should be formated. +convention in which these comments should be formatted. Here is an example from btf.h: diff --git a/Documentation/bpf/map_sockmap.rst b/Documentation/bpf/map_sockmap.rst new file mode 100644 index 000000000000..cc92047c6630 --- /dev/null +++ b/Documentation/bpf/map_sockmap.rst @@ -0,0 +1,498 @@ +.. SPDX-License-Identifier: GPL-2.0-only +.. Copyright Red Hat + +============================================== +BPF_MAP_TYPE_SOCKMAP and BPF_MAP_TYPE_SOCKHASH +============================================== + +.. note:: + - ``BPF_MAP_TYPE_SOCKMAP`` was introduced in kernel version 4.14 + - ``BPF_MAP_TYPE_SOCKHASH`` was introduced in kernel version 4.18 + +``BPF_MAP_TYPE_SOCKMAP`` and ``BPF_MAP_TYPE_SOCKHASH`` maps can be used to +redirect skbs between sockets or to apply policy at the socket level based on +the result of a BPF (verdict) program with the help of the BPF helpers +``bpf_sk_redirect_map()``, ``bpf_sk_redirect_hash()``, +``bpf_msg_redirect_map()`` and ``bpf_msg_redirect_hash()``. + +``BPF_MAP_TYPE_SOCKMAP`` is backed by an array that uses an integer key as the +index to look up a reference to a ``struct sock``. The map values are socket +descriptors. Similarly, ``BPF_MAP_TYPE_SOCKHASH`` is a hash backed BPF map that +holds references to sockets via their socket descriptors. + +.. note:: + The value type is either __u32 or __u64; the latter (__u64) is to support + returning socket cookies to userspace. Returning the ``struct sock *`` that + the map holds to user-space is neither safe nor useful. + +These maps may have BPF programs attached to them, specifically a parser program +and a verdict program. The parser program determines how much data has been +parsed and therefore how much data needs to be queued to come to a verdict. The +verdict program is essentially the redirect program and can return a verdict +of ``__SK_DROP``, ``__SK_PASS``, or ``__SK_REDIRECT``. + +When a socket is inserted into one of these maps, its socket callbacks are +replaced and a ``struct sk_psock`` is attached to it. Additionally, this +``sk_psock`` inherits the programs that are attached to the map. + +A sock object may be in multiple maps, but can only inherit a single +parse or verdict program. If adding a sock object to a map would result +in having multiple parser programs the update will return an EBUSY error. + +The supported programs to attach to these maps are: + +.. code-block:: c + + struct sk_psock_progs { + struct bpf_prog *msg_parser; + struct bpf_prog *stream_parser; + struct bpf_prog *stream_verdict; + struct bpf_prog *skb_verdict; + }; + +.. note:: + Users are not allowed to attach ``stream_verdict`` and ``skb_verdict`` + programs to the same map. + +The attach types for the map programs are: + +- ``msg_parser`` program - ``BPF_SK_MSG_VERDICT``. +- ``stream_parser`` program - ``BPF_SK_SKB_STREAM_PARSER``. +- ``stream_verdict`` program - ``BPF_SK_SKB_STREAM_VERDICT``. +- ``skb_verdict`` program - ``BPF_SK_SKB_VERDICT``. + +There are additional helpers available to use with the parser and verdict +programs: ``bpf_msg_apply_bytes()`` and ``bpf_msg_cork_bytes()``. With +``bpf_msg_apply_bytes()`` BPF programs can tell the infrastructure how many +bytes the given verdict should apply to. The helper ``bpf_msg_cork_bytes()`` +handles a different case where a BPF program cannot reach a verdict on a msg +until it receives more bytes AND the program doesn't want to forward the packet +until it is known to be good. + +Finally, the helpers ``bpf_msg_pull_data()`` and ``bpf_msg_push_data()`` are +available to ``BPF_PROG_TYPE_SK_MSG`` BPF programs to pull in data and set the +start and end pointers to given values or to add metadata to the ``struct +sk_msg_buff *msg``. + +All these helpers will be described in more detail below. + +Usage +===== +Kernel BPF +---------- +bpf_msg_redirect_map() +^^^^^^^^^^^^^^^^^^^^^^ +.. code-block:: c + + long bpf_msg_redirect_map(struct sk_msg_buff *msg, struct bpf_map *map, u32 key, u64 flags) + +This helper is used in programs implementing policies at the socket level. If +the message ``msg`` is allowed to pass (i.e., if the verdict BPF program +returns ``SK_PASS``), redirect it to the socket referenced by ``map`` (of type +``BPF_MAP_TYPE_SOCKMAP``) at index ``key``. Both ingress and egress interfaces +can be used for redirection. The ``BPF_F_INGRESS`` value in ``flags`` is used +to select the ingress path otherwise the egress path is selected. This is the +only flag supported for now. + +Returns ``SK_PASS`` on success, or ``SK_DROP`` on error. + +bpf_sk_redirect_map() +^^^^^^^^^^^^^^^^^^^^^ +.. code-block:: c + + long bpf_sk_redirect_map(struct sk_buff *skb, struct bpf_map *map, u32 key u64 flags) + +Redirect the packet to the socket referenced by ``map`` (of type +``BPF_MAP_TYPE_SOCKMAP``) at index ``key``. Both ingress and egress interfaces +can be used for redirection. The ``BPF_F_INGRESS`` value in ``flags`` is used +to select the ingress path otherwise the egress path is selected. This is the +only flag supported for now. + +Returns ``SK_PASS`` on success, or ``SK_DROP`` on error. + +bpf_map_lookup_elem() +^^^^^^^^^^^^^^^^^^^^^ +.. code-block:: c + + void *bpf_map_lookup_elem(struct bpf_map *map, const void *key) + +socket entries of type ``struct sock *`` can be retrieved using the +``bpf_map_lookup_elem()`` helper. + +bpf_sock_map_update() +^^^^^^^^^^^^^^^^^^^^^ +.. code-block:: c + + long bpf_sock_map_update(struct bpf_sock_ops *skops, struct bpf_map *map, void *key, u64 flags) + +Add an entry to, or update a ``map`` referencing sockets. The ``skops`` is used +as a new value for the entry associated to ``key``. The ``flags`` argument can +be one of the following: + +- ``BPF_ANY``: Create a new element or update an existing element. +- ``BPF_NOEXIST``: Create a new element only if it did not exist. +- ``BPF_EXIST``: Update an existing element. + +If the ``map`` has BPF programs (parser and verdict), those will be inherited +by the socket being added. If the socket is already attached to BPF programs, +this results in an error. + +Returns 0 on success, or a negative error in case of failure. + +bpf_sock_hash_update() +^^^^^^^^^^^^^^^^^^^^^^ +.. code-block:: c + + long bpf_sock_hash_update(struct bpf_sock_ops *skops, struct bpf_map *map, void *key, u64 flags) + +Add an entry to, or update a sockhash ``map`` referencing sockets. The ``skops`` +is used as a new value for the entry associated to ``key``. + +The ``flags`` argument can be one of the following: + +- ``BPF_ANY``: Create a new element or update an existing element. +- ``BPF_NOEXIST``: Create a new element only if it did not exist. +- ``BPF_EXIST``: Update an existing element. + +If the ``map`` has BPF programs (parser and verdict), those will be inherited +by the socket being added. If the socket is already attached to BPF programs, +this results in an error. + +Returns 0 on success, or a negative error in case of failure. + +bpf_msg_redirect_hash() +^^^^^^^^^^^^^^^^^^^^^^^ +.. code-block:: c + + long bpf_msg_redirect_hash(struct sk_msg_buff *msg, struct bpf_map *map, void *key, u64 flags) + +This helper is used in programs implementing policies at the socket level. If +the message ``msg`` is allowed to pass (i.e., if the verdict BPF program returns +``SK_PASS``), redirect it to the socket referenced by ``map`` (of type +``BPF_MAP_TYPE_SOCKHASH``) using hash ``key``. Both ingress and egress +interfaces can be used for redirection. The ``BPF_F_INGRESS`` value in +``flags`` is used to select the ingress path otherwise the egress path is +selected. This is the only flag supported for now. + +Returns ``SK_PASS`` on success, or ``SK_DROP`` on error. + +bpf_sk_redirect_hash() +^^^^^^^^^^^^^^^^^^^^^^ +.. code-block:: c + + long bpf_sk_redirect_hash(struct sk_buff *skb, struct bpf_map *map, void *key, u64 flags) + +This helper is used in programs implementing policies at the skb socket level. +If the sk_buff ``skb`` is allowed to pass (i.e., if the verdict BPF program +returns ``SK_PASS``), redirect it to the socket referenced by ``map`` (of type +``BPF_MAP_TYPE_SOCKHASH``) using hash ``key``. Both ingress and egress +interfaces can be used for redirection. The ``BPF_F_INGRESS`` value in +``flags`` is used to select the ingress path otherwise the egress path is +selected. This is the only flag supported for now. + +Returns ``SK_PASS`` on success, or ``SK_DROP`` on error. + +bpf_msg_apply_bytes() +^^^^^^^^^^^^^^^^^^^^^^ +.. code-block:: c + + long bpf_msg_apply_bytes(struct sk_msg_buff *msg, u32 bytes) + +For socket policies, apply the verdict of the BPF program to the next (number +of ``bytes``) of message ``msg``. For example, this helper can be used in the +following cases: + +- A single ``sendmsg()`` or ``sendfile()`` system call contains multiple + logical messages that the BPF program is supposed to read and for which it + should apply a verdict. +- A BPF program only cares to read the first ``bytes`` of a ``msg``. If the + message has a large payload, then setting up and calling the BPF program + repeatedly for all bytes, even though the verdict is already known, would + create unnecessary overhead. + +Returns 0 + +bpf_msg_cork_bytes() +^^^^^^^^^^^^^^^^^^^^^^ +.. code-block:: c + + long bpf_msg_cork_bytes(struct sk_msg_buff *msg, u32 bytes) + +For socket policies, prevent the execution of the verdict BPF program for +message ``msg`` until the number of ``bytes`` have been accumulated. + +This can be used when one needs a specific number of bytes before a verdict can +be assigned, even if the data spans multiple ``sendmsg()`` or ``sendfile()`` +calls. + +Returns 0 + +bpf_msg_pull_data() +^^^^^^^^^^^^^^^^^^^^^^ +.. code-block:: c + + long bpf_msg_pull_data(struct sk_msg_buff *msg, u32 start, u32 end, u64 flags) + +For socket policies, pull in non-linear data from user space for ``msg`` and set +pointers ``msg->data`` and ``msg->data_end`` to ``start`` and ``end`` bytes +offsets into ``msg``, respectively. + +If a program of type ``BPF_PROG_TYPE_SK_MSG`` is run on a ``msg`` it can only +parse data that the (``data``, ``data_end``) pointers have already consumed. +For ``sendmsg()`` hooks this is likely the first scatterlist element. But for +calls relying on the ``sendpage`` handler (e.g., ``sendfile()``) this will be +the range (**0**, **0**) because the data is shared with user space and by +default the objective is to avoid allowing user space to modify data while (or +after) BPF verdict is being decided. This helper can be used to pull in data +and to set the start and end pointers to given values. Data will be copied if +necessary (i.e., if data was not linear and if start and end pointers do not +point to the same chunk). + +A call to this helper is susceptible to change the underlying packet buffer. +Therefore, at load time, all checks on pointers previously done by the verifier +are invalidated and must be performed again, if the helper is used in +combination with direct packet access. + +All values for ``flags`` are reserved for future usage, and must be left at +zero. + +Returns 0 on success, or a negative error in case of failure. + +bpf_map_lookup_elem() +^^^^^^^^^^^^^^^^^^^^^ + +.. code-block:: c + + void *bpf_map_lookup_elem(struct bpf_map *map, const void *key) + +Look up a socket entry in the sockmap or sockhash map. + +Returns the socket entry associated to ``key``, or NULL if no entry was found. + +bpf_map_update_elem() +^^^^^^^^^^^^^^^^^^^^^ +.. code-block:: c + + long bpf_map_update_elem(struct bpf_map *map, const void *key, const void *value, u64 flags) + +Add or update a socket entry in a sockmap or sockhash. + +The flags argument can be one of the following: + +- BPF_ANY: Create a new element or update an existing element. +- BPF_NOEXIST: Create a new element only if it did not exist. +- BPF_EXIST: Update an existing element. + +Returns 0 on success, or a negative error in case of failure. + +bpf_map_delete_elem() +^^^^^^^^^^^^^^^^^^^^^^ +.. code-block:: c + + long bpf_map_delete_elem(struct bpf_map *map, const void *key) + +Delete a socket entry from a sockmap or a sockhash. + +Returns 0 on success, or a negative error in case of failure. + +User space +---------- +bpf_map_update_elem() +^^^^^^^^^^^^^^^^^^^^^ +.. code-block:: c + + int bpf_map_update_elem(int fd, const void *key, const void *value, __u64 flags) + +Sockmap entries can be added or updated using the ``bpf_map_update_elem()`` +function. The ``key`` parameter is the index value of the sockmap array. And the +``value`` parameter is the FD value of that socket. + +Under the hood, the sockmap update function uses the socket FD value to +retrieve the associated socket and its attached psock. + +The flags argument can be one of the following: + +- BPF_ANY: Create a new element or update an existing element. +- BPF_NOEXIST: Create a new element only if it did not exist. +- BPF_EXIST: Update an existing element. + +bpf_map_lookup_elem() +^^^^^^^^^^^^^^^^^^^^^ +.. code-block:: c + + int bpf_map_lookup_elem(int fd, const void *key, void *value) + +Sockmap entries can be retrieved using the ``bpf_map_lookup_elem()`` function. + +.. note:: + The entry returned is a socket cookie rather than a socket itself. + +bpf_map_delete_elem() +^^^^^^^^^^^^^^^^^^^^^ +.. code-block:: c + + int bpf_map_delete_elem(int fd, const void *key) + +Sockmap entries can be deleted using the ``bpf_map_delete_elem()`` +function. + +Returns 0 on success, or negative error in case of failure. + +Examples +======== + +Kernel BPF +---------- +Several examples of the use of sockmap APIs can be found in: + +- `tools/testing/selftests/bpf/progs/test_sockmap_kern.h`_ +- `tools/testing/selftests/bpf/progs/sockmap_parse_prog.c`_ +- `tools/testing/selftests/bpf/progs/sockmap_verdict_prog.c`_ +- `tools/testing/selftests/bpf/progs/test_sockmap_listen.c`_ +- `tools/testing/selftests/bpf/progs/test_sockmap_update.c`_ + +The following code snippet shows how to declare a sockmap. + +.. code-block:: c + + struct { + __uint(type, BPF_MAP_TYPE_SOCKMAP); + __uint(max_entries, 1); + __type(key, __u32); + __type(value, __u64); + } sock_map_rx SEC(".maps"); + +The following code snippet shows a sample parser program. + +.. code-block:: c + + SEC("sk_skb/stream_parser") + int bpf_prog_parser(struct __sk_buff *skb) + { + return skb->len; + } + +The following code snippet shows a simple verdict program that interacts with a +sockmap to redirect traffic to another socket based on the local port. + +.. code-block:: c + + SEC("sk_skb/stream_verdict") + int bpf_prog_verdict(struct __sk_buff *skb) + { + __u32 lport = skb->local_port; + __u32 idx = 0; + + if (lport == 10000) + return bpf_sk_redirect_map(skb, &sock_map_rx, idx, 0); + + return SK_PASS; + } + +The following code snippet shows how to declare a sockhash map. + +.. code-block:: c + + struct socket_key { + __u32 src_ip; + __u32 dst_ip; + __u32 src_port; + __u32 dst_port; + }; + + struct { + __uint(type, BPF_MAP_TYPE_SOCKHASH); + __uint(max_entries, 1); + __type(key, struct socket_key); + __type(value, __u64); + } sock_hash_rx SEC(".maps"); + +The following code snippet shows a simple verdict program that interacts with a +sockhash to redirect traffic to another socket based on a hash of some of the +skb parameters. + +.. code-block:: c + + static inline + void extract_socket_key(struct __sk_buff *skb, struct socket_key *key) + { + key->src_ip = skb->remote_ip4; + key->dst_ip = skb->local_ip4; + key->src_port = skb->remote_port >> 16; + key->dst_port = (bpf_htonl(skb->local_port)) >> 16; + } + + SEC("sk_skb/stream_verdict") + int bpf_prog_verdict(struct __sk_buff *skb) + { + struct socket_key key; + + extract_socket_key(skb, &key); + + return bpf_sk_redirect_hash(skb, &sock_hash_rx, &key, 0); + } + +User space +---------- +Several examples of the use of sockmap APIs can be found in: + +- `tools/testing/selftests/bpf/prog_tests/sockmap_basic.c`_ +- `tools/testing/selftests/bpf/test_sockmap.c`_ +- `tools/testing/selftests/bpf/test_maps.c`_ + +The following code sample shows how to create a sockmap, attach a parser and +verdict program, as well as add a socket entry. + +.. code-block:: c + + int create_sample_sockmap(int sock, int parse_prog_fd, int verdict_prog_fd) + { + int index = 0; + int map, err; + + map = bpf_map_create(BPF_MAP_TYPE_SOCKMAP, NULL, sizeof(int), sizeof(int), 1, NULL); + if (map < 0) { + fprintf(stderr, "Failed to create sockmap: %s\n", strerror(errno)); + return -1; + } + + err = bpf_prog_attach(parse_prog_fd, map, BPF_SK_SKB_STREAM_PARSER, 0); + if (err){ + fprintf(stderr, "Failed to attach_parser_prog_to_map: %s\n", strerror(errno)); + goto out; + } + + err = bpf_prog_attach(verdict_prog_fd, map, BPF_SK_SKB_STREAM_VERDICT, 0); + if (err){ + fprintf(stderr, "Failed to attach_verdict_prog_to_map: %s\n", strerror(errno)); + goto out; + } + + err = bpf_map_update_elem(map, &index, &sock, BPF_NOEXIST); + if (err) { + fprintf(stderr, "Failed to update sockmap: %s\n", strerror(errno)); + goto out; + } + + out: + close(map); + return err; + } + +References +=========== + +- https://github.com/jrfastab/linux-kernel-xdp/commit/c89fd73cb9d2d7f3c716c3e00836f07b1aeb261f +- https://lwn.net/Articles/731133/ +- http://vger.kernel.org/lpc_net2018_talks/ktls_bpf_paper.pdf +- https://lwn.net/Articles/748628/ +- https://lore.kernel.org/bpf/20200218171023.844439-7-jakub@cloudflare.com/ + +.. _`tools/testing/selftests/bpf/progs/test_sockmap_kern.h`: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/tools/testing/selftests/bpf/progs/test_sockmap_kern.h +.. _`tools/testing/selftests/bpf/progs/sockmap_parse_prog.c`: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/tools/testing/selftests/bpf/progs/sockmap_parse_prog.c +.. _`tools/testing/selftests/bpf/progs/sockmap_verdict_prog.c`: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/tools/testing/selftests/bpf/progs/sockmap_verdict_prog.c +.. _`tools/testing/selftests/bpf/prog_tests/sockmap_basic.c`: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/tools/testing/selftests/bpf/prog_tests/sockmap_basic.c +.. _`tools/testing/selftests/bpf/test_sockmap.c`: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/tools/testing/selftests/bpf/test_sockmap.c +.. _`tools/testing/selftests/bpf/test_maps.c`: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/tools/testing/selftests/bpf/test_maps.c +.. _`tools/testing/selftests/bpf/progs/test_sockmap_listen.c`: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/tools/testing/selftests/bpf/progs/test_sockmap_listen.c +.. _`tools/testing/selftests/bpf/progs/test_sockmap_update.c`: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/tools/testing/selftests/bpf/progs/test_sockmap_update.c diff --git a/Documentation/bpf/map_xskmap.rst b/Documentation/bpf/map_xskmap.rst index 7093b8208451..dc143edd9233 100644 --- a/Documentation/bpf/map_xskmap.rst +++ b/Documentation/bpf/map_xskmap.rst @@ -178,7 +178,7 @@ The following code snippet shows how to update an XSKMAP with an XSK entry. For an example on how create AF_XDP sockets, please see the AF_XDP-example and AF_XDP-forwarding programs in the `bpf-examples`_ directory in the `libxdp`_ repository. -For a detailed explaination of the AF_XDP interface please see: +For a detailed explanation of the AF_XDP interface please see: - `libxdp-readme`_. - `AF_XDP`_ kernel documentation. diff --git a/Documentation/bpf/other.rst b/Documentation/bpf/other.rst index 3d61963403b4..7e6b12018802 100644 --- a/Documentation/bpf/other.rst +++ b/Documentation/bpf/other.rst @@ -6,4 +6,5 @@ Other :maxdepth: 1 ringbuf - llvm_reloc
\ No newline at end of file + llvm_reloc + graph_ds_impl diff --git a/Documentation/bpf/ringbuf.rst b/Documentation/bpf/ringbuf.rst index 6a615cd62bda..a99cd05d79d4 100644 --- a/Documentation/bpf/ringbuf.rst +++ b/Documentation/bpf/ringbuf.rst @@ -124,7 +124,7 @@ buffer. Currently 4 are supported: - ``BPF_RB_AVAIL_DATA`` returns amount of unconsumed data in ring buffer; - ``BPF_RB_RING_SIZE`` returns the size of ring buffer; -- ``BPF_RB_CONS_POS``/``BPF_RB_PROD_POS`` returns current logical possition +- ``BPF_RB_CONS_POS``/``BPF_RB_PROD_POS`` returns current logical position of consumer/producer, respectively. Returned values are momentarily snapshots of ring buffer state and could be @@ -146,7 +146,7 @@ Design and Implementation This reserve/commit schema allows a natural way for multiple producers, either on different CPUs or even on the same CPU/in the same BPF program, to reserve independent records and work with them without blocking other producers. This -means that if BPF program was interruped by another BPF program sharing the +means that if BPF program was interrupted by another BPF program sharing the same ring buffer, they will both get a record reserved (provided there is enough space left) and can work with it and submit it independently. This applies to NMI context as well, except that due to using a spinlock during diff --git a/Documentation/bpf/verifier.rst b/Documentation/bpf/verifier.rst index d4326caf01f9..f0ec19db301c 100644 --- a/Documentation/bpf/verifier.rst +++ b/Documentation/bpf/verifier.rst @@ -192,7 +192,7 @@ checked and found to be non-NULL, all copies can become PTR_TO_MAP_VALUEs. As well as range-checking, the tracked information is also used for enforcing alignment of pointer accesses. For instance, on most systems the packet pointer is 2 bytes after a 4-byte alignment. If a program adds 14 bytes to that to jump -over the Ethernet header, then reads IHL and addes (IHL * 4), the resulting +over the Ethernet header, then reads IHL and adds (IHL * 4), the resulting pointer will have a variable offset known to be 4n+2 for some n, so adding the 2 bytes (NET_IP_ALIGN) gives a 4-byte alignment and so word-sized accesses through that pointer are safe. @@ -316,6 +316,301 @@ Pruning considers not only the registers but also the stack (and any spilled registers it may hold). They must all be safe for the branch to be pruned. This is implemented in states_equal(). +Some technical details about state pruning implementation could be found below. + +Register liveness tracking +-------------------------- + +In order to make state pruning effective, liveness state is tracked for each +register and stack slot. The basic idea is to track which registers and stack +slots are actually used during subseqeuent execution of the program, until +program exit is reached. Registers and stack slots that were never used could be +removed from the cached state thus making more states equivalent to a cached +state. This could be illustrated by the following program:: + + 0: call bpf_get_prandom_u32() + 1: r1 = 0 + 2: if r0 == 0 goto +1 + 3: r0 = 1 + --- checkpoint --- + 4: r0 = r1 + 5: exit + +Suppose that a state cache entry is created at instruction #4 (such entries are +also called "checkpoints" in the text below). The verifier could reach the +instruction with one of two possible register states: + +* r0 = 1, r1 = 0 +* r0 = 0, r1 = 0 + +However, only the value of register ``r1`` is important to successfully finish +verification. The goal of the liveness tracking algorithm is to spot this fact +and figure out that both states are actually equivalent. + +Data structures +~~~~~~~~~~~~~~~ + +Liveness is tracked using the following data structures:: + + enum bpf_reg_liveness { + REG_LIVE_NONE = 0, + REG_LIVE_READ32 = 0x1, + REG_LIVE_READ64 = 0x2, + REG_LIVE_READ = REG_LIVE_READ32 | REG_LIVE_READ64, + REG_LIVE_WRITTEN = 0x4, + REG_LIVE_DONE = 0x8, + }; + + struct bpf_reg_state { + ... + struct bpf_reg_state *parent; + ... + enum bpf_reg_liveness live; + ... + }; + + struct bpf_stack_state { + struct bpf_reg_state spilled_ptr; + ... + }; + + struct bpf_func_state { + struct bpf_reg_state regs[MAX_BPF_REG]; + ... + struct bpf_stack_state *stack; + } + + struct bpf_verifier_state { + struct bpf_func_state *frame[MAX_CALL_FRAMES]; + struct bpf_verifier_state *parent; + ... + } + +* ``REG_LIVE_NONE`` is an initial value assigned to ``->live`` fields upon new + verifier state creation; + +* ``REG_LIVE_WRITTEN`` means that the value of the register (or stack slot) is + defined by some instruction verified between this verifier state's parent and + verifier state itself; + +* ``REG_LIVE_READ{32,64}`` means that the value of the register (or stack slot) + is read by a some child state of this verifier state; + +* ``REG_LIVE_DONE`` is a marker used by ``clean_verifier_state()`` to avoid + processing same verifier state multiple times and for some sanity checks; + +* ``->live`` field values are formed by combining ``enum bpf_reg_liveness`` + values using bitwise or. + +Register parentage chains +~~~~~~~~~~~~~~~~~~~~~~~~~ + +In order to propagate information between parent and child states, a *register +parentage chain* is established. Each register or stack slot is linked to a +corresponding register or stack slot in its parent state via a ``->parent`` +pointer. This link is established upon state creation in ``is_state_visited()`` +and might be modified by ``set_callee_state()`` called from +``__check_func_call()``. + +The rules for correspondence between registers / stack slots are as follows: + +* For the current stack frame, registers and stack slots of the new state are + linked to the registers and stack slots of the parent state with the same + indices. + +* For the outer stack frames, only caller saved registers (r6-r9) and stack + slots are linked to the registers and stack slots of the parent state with the + same indices. + +* When function call is processed a new ``struct bpf_func_state`` instance is + allocated, it encapsulates a new set of registers and stack slots. For this + new frame, parent links for r6-r9 and stack slots are set to nil, parent links + for r1-r5 are set to match caller r1-r5 parent links. + +This could be illustrated by the following diagram (arrows stand for +``->parent`` pointers):: + + ... ; Frame #0, some instructions + --- checkpoint #0 --- + 1 : r6 = 42 ; Frame #0 + --- checkpoint #1 --- + 2 : call foo() ; Frame #0 + ... ; Frame #1, instructions from foo() + --- checkpoint #2 --- + ... ; Frame #1, instructions from foo() + --- checkpoint #3 --- + exit ; Frame #1, return from foo() + 3 : r1 = r6 ; Frame #0 <- current state + + +-------------------------------+-------------------------------+ + | Frame #0 | Frame #1 | + Checkpoint +-------------------------------+-------------------------------+ + #0 | r0 | r1-r5 | r6-r9 | fp-8 ... | + +-------------------------------+ + ^ ^ ^ ^ + | | | | + Checkpoint +-------------------------------+ + #1 | r0 | r1-r5 | r6-r9 | fp-8 ... | + +-------------------------------+ + ^ ^ ^ + |_______|_______|_______________ + | | | + nil nil | | | nil nil + | | | | | | | + Checkpoint +-------------------------------+-------------------------------+ + #2 | r0 | r1-r5 | r6-r9 | fp-8 ... | r0 | r1-r5 | r6-r9 | fp-8 ... | + +-------------------------------+-------------------------------+ + ^ ^ ^ ^ ^ + nil nil | | | | | + | | | | | | | + Checkpoint +-------------------------------+-------------------------------+ + #3 | r0 | r1-r5 | r6-r9 | fp-8 ... | r0 | r1-r5 | r6-r9 | fp-8 ... | + +-------------------------------+-------------------------------+ + ^ ^ + nil nil | | + | | | | + Current +-------------------------------+ + state | r0 | r1-r5 | r6-r9 | fp-8 ... | + +-------------------------------+ + \ + r6 read mark is propagated via these links + all the way up to checkpoint #1. + The checkpoint #1 contains a write mark for r6 + because of instruction (1), thus read propagation + does not reach checkpoint #0 (see section below). + +Liveness marks tracking +~~~~~~~~~~~~~~~~~~~~~~~ + +For each processed instruction, the verifier tracks read and written registers +and stack slots. The main idea of the algorithm is that read marks propagate +back along the state parentage chain until they hit a write mark, which 'screens +off' earlier states from the read. The information about reads is propagated by +function ``mark_reg_read()`` which could be summarized as follows:: + + mark_reg_read(struct bpf_reg_state *state, ...): + parent = state->parent + while parent: + if state->live & REG_LIVE_WRITTEN: + break + if parent->live & REG_LIVE_READ64: + break + parent->live |= REG_LIVE_READ64 + state = parent + parent = state->parent + +Notes: + +* The read marks are applied to the **parent** state while write marks are + applied to the **current** state. The write mark on a register or stack slot + means that it is updated by some instruction in the straight-line code leading + from the parent state to the current state. + +* Details about REG_LIVE_READ32 are omitted. + +* Function ``propagate_liveness()`` (see section :ref:`read_marks_for_cache_hits`) + might override the first parent link. Please refer to the comments in the + ``propagate_liveness()`` and ``mark_reg_read()`` source code for further + details. + +Because stack writes could have different sizes ``REG_LIVE_WRITTEN`` marks are +applied conservatively: stack slots are marked as written only if write size +corresponds to the size of the register, e.g. see function ``save_register_state()``. + +Consider the following example:: + + 0: (*u64)(r10 - 8) = 0 ; define 8 bytes of fp-8 + --- checkpoint #0 --- + 1: (*u32)(r10 - 8) = 1 ; redefine lower 4 bytes + 2: r1 = (*u32)(r10 - 8) ; read lower 4 bytes defined at (1) + 3: r2 = (*u32)(r10 - 4) ; read upper 4 bytes defined at (0) + +As stated above, the write at (1) does not count as ``REG_LIVE_WRITTEN``. Should +it be otherwise, the algorithm above wouldn't be able to propagate the read mark +from (3) to checkpoint #0. + +Once the ``BPF_EXIT`` instruction is reached ``update_branch_counts()`` is +called to update the ``->branches`` counter for each verifier state in a chain +of parent verifier states. When the ``->branches`` counter reaches zero the +verifier state becomes a valid entry in a set of cached verifier states. + +Each entry of the verifier states cache is post-processed by a function +``clean_live_states()``. This function marks all registers and stack slots +without ``REG_LIVE_READ{32,64}`` marks as ``NOT_INIT`` or ``STACK_INVALID``. +Registers/stack slots marked in this way are ignored in function ``stacksafe()`` +called from ``states_equal()`` when a state cache entry is considered for +equivalence with a current state. + +Now it is possible to explain how the example from the beginning of the section +works:: + + 0: call bpf_get_prandom_u32() + 1: r1 = 0 + 2: if r0 == 0 goto +1 + 3: r0 = 1 + --- checkpoint[0] --- + 4: r0 = r1 + 5: exit + +* At instruction #2 branching point is reached and state ``{ r0 == 0, r1 == 0, pc == 4 }`` + is pushed to states processing queue (pc stands for program counter). + +* At instruction #4: + + * ``checkpoint[0]`` states cache entry is created: ``{ r0 == 1, r1 == 0, pc == 4 }``; + * ``checkpoint[0].r0`` is marked as written; + * ``checkpoint[0].r1`` is marked as read; + +* At instruction #5 exit is reached and ``checkpoint[0]`` can now be processed + by ``clean_live_states()``. After this processing ``checkpoint[0].r0`` has a + read mark and all other registers and stack slots are marked as ``NOT_INIT`` + or ``STACK_INVALID`` + +* The state ``{ r0 == 0, r1 == 0, pc == 4 }`` is popped from the states queue + and is compared against a cached state ``{ r1 == 0, pc == 4 }``, the states + are considered equivalent. + +.. _read_marks_for_cache_hits: + +Read marks propagation for cache hits +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +Another point is the handling of read marks when a previously verified state is +found in the states cache. Upon cache hit verifier must behave in the same way +as if the current state was verified to the program exit. This means that all +read marks, present on registers and stack slots of the cached state, must be +propagated over the parentage chain of the current state. Example below shows +why this is important. Function ``propagate_liveness()`` handles this case. + +Consider the following state parentage chain (S is a starting state, A-E are +derived states, -> arrows show which state is derived from which):: + + r1 read + <------------- A[r1] == 0 + C[r1] == 0 + S ---> A ---> B ---> exit E[r1] == 1 + | + ` ---> C ---> D + | + ` ---> E ^ + |___ suppose all these + ^ states are at insn #Y + | + suppose all these + states are at insn #X + +* Chain of states ``S -> A -> B -> exit`` is verified first. + +* While ``B -> exit`` is verified, register ``r1`` is read and this read mark is + propagated up to state ``A``. + +* When chain of states ``C -> D`` is verified the state ``D`` turns out to be + equivalent to state ``B``. + +* The read mark for ``r1`` has to be propagated to state ``C``, otherwise state + ``C`` might get mistakenly marked as equivalent to state ``E`` even though + values for register ``r1`` differ between ``C`` and ``E``. + Understanding eBPF verifier messages ==================================== diff --git a/Documentation/conf.py b/Documentation/conf.py index d927737e3c10..8b4e5451a02d 100644 --- a/Documentation/conf.py +++ b/Documentation/conf.py @@ -116,6 +116,9 @@ if major >= 3: # include/linux/linkage.h: "asmlinkage", + + # include/linux/btf.h + "__bpf_kfunc", ] else: diff --git a/Documentation/core-api/index.rst b/Documentation/core-api/index.rst index 77eb775b8b42..7a3a08d81f11 100644 --- a/Documentation/core-api/index.rst +++ b/Documentation/core-api/index.rst @@ -127,6 +127,7 @@ Documents that don't fit elsewhere or which have yet to be categorized. :maxdepth: 1 librs + netlink .. only:: subproject and html diff --git a/Documentation/core-api/netlink.rst b/Documentation/core-api/netlink.rst new file mode 100644 index 000000000000..e4a938a05cc9 --- /dev/null +++ b/Documentation/core-api/netlink.rst @@ -0,0 +1,101 @@ +.. SPDX-License-Identifier: BSD-3-Clause + +.. _kernel_netlink: + +=================================== +Netlink notes for kernel developers +=================================== + +General guidance +================ + +Attribute enums +--------------- + +Older families often define "null" attributes and commands with value +of ``0`` and named ``unspec``. This is supported (``type: unused``) +but should be avoided in new families. The ``unspec`` enum values are +not used in practice, so just set the value of the first attribute to ``1``. + +Message enums +------------- + +Use the same command IDs for requests and replies. This makes it easier +to match them up, and we have plenty of ID space. + +Use separate command IDs for notifications. This makes it easier to +sort the notifications from replies (and present them to the user +application via a different API than replies). + +Answer requests +--------------- + +Older families do not reply to all of the commands, especially NEW / ADD +commands. User only gets information whether the operation succeeded or +not via the ACK. Try to find useful data to return. Once the command is +added whether it replies with a full message or only an ACK is uAPI and +cannot be changed. It's better to err on the side of replying. + +Specifically NEW and ADD commands should reply with information identifying +the created object such as the allocated object's ID (without having to +resort to using ``NLM_F_ECHO``). + +NLM_F_ECHO +---------- + +Make sure to pass the request info to genl_notify() to allow ``NLM_F_ECHO`` +to take effect. This is useful for programs that need precise feedback +from the kernel (for example for logging purposes). + +Support dump consistency +------------------------ + +If iterating over objects during dump may skip over objects or repeat +them - make sure to report dump inconsistency with ``NLM_F_DUMP_INTR``. +This is usually implemented by maintaining a generation id for the +structure and recording it in the ``seq`` member of struct netlink_callback. + +Netlink specification +===================== + +Documentation of the Netlink specification parts which are only relevant +to the kernel space. + +Globals +------- + +kernel-policy +~~~~~~~~~~~~~ + +Defines if the kernel validation policy is per operation (``per-op``) +or for the entire family (``global``). New families should use ``per-op`` +(default) to be able to narrow down the attributes accepted by a specific +command. + +checks +------ + +Documentation for the ``checks`` sub-sections of attribute specs. + +unterminated-ok +~~~~~~~~~~~~~~~ + +Accept strings without the null-termination (for legacy families only). +Switches from the ``NLA_NUL_STRING`` to ``NLA_STRING`` policy type. + +max-len +~~~~~~~ + +Defines max length for a binary or string attribute (corresponding +to the ``len`` member of struct nla_policy). For string attributes terminating +null character is not counted towards ``max-len``. + +The field may either be a literal integer value or a name of a defined +constant. String types may reduce the constant by one +(i.e. specify ``max-len: CONST - 1``) to reserve space for the terminating +character so implementations should recognize such pattern. + +min-len +~~~~~~~ + +Similar to ``max-len`` but defines minimum length. diff --git a/Documentation/core-api/packing.rst b/Documentation/core-api/packing.rst index d8c341fe383e..3ed13bc9a195 100644 --- a/Documentation/core-api/packing.rst +++ b/Documentation/core-api/packing.rst @@ -161,6 +161,6 @@ xxx_packing() that calls it using the proper QUIRK_* one-hot bits set. The packing() function returns an int-encoded error code, which protects the programmer against incorrect API use. The errors are not expected to occur -durring runtime, therefore it is reasonable for xxx_packing() to return void +during runtime, therefore it is reasonable for xxx_packing() to return void and simply swallow those errors. Optionally it can dump stack or print the error description. diff --git a/Documentation/devicetree/bindings/mfd/mscc,ocelot.yaml b/Documentation/devicetree/bindings/mfd/mscc,ocelot.yaml index 1d1fee1a16c1..8bd1abfc44d9 100644 --- a/Documentation/devicetree/bindings/mfd/mscc,ocelot.yaml +++ b/Documentation/devicetree/bindings/mfd/mscc,ocelot.yaml @@ -57,6 +57,15 @@ patternProperties: enum: - mscc,ocelot-miim + "^ethernet-switch@[0-9a-f]+$": + type: object + $ref: /schemas/net/mscc,vsc7514-switch.yaml + unevaluatedProperties: false + properties: + compatible: + enum: + - mscc,vsc7512-switch + required: - compatible - reg diff --git a/Documentation/devicetree/bindings/net/amlogic,g12a-mdio-mux.yaml b/Documentation/devicetree/bindings/net/amlogic,g12a-mdio-mux.yaml new file mode 100644 index 000000000000..ec5c038ce6a0 --- /dev/null +++ b/Documentation/devicetree/bindings/net/amlogic,g12a-mdio-mux.yaml @@ -0,0 +1,80 @@ +# SPDX-License-Identifier: (GPL-2.0-only OR BSD-2-Clause) +%YAML 1.2 +--- +$id: http://devicetree.org/schemas/net/amlogic,g12a-mdio-mux.yaml# +$schema: http://devicetree.org/meta-schemas/core.yaml# + +title: MDIO bus multiplexer/glue of Amlogic G12a SoC family + +description: + This is a special case of a MDIO bus multiplexer. It allows to choose between + the internal mdio bus leading to the embedded 10/100 PHY or the external + MDIO bus. + +maintainers: + - Neil Armstrong <neil.armstrong@linaro.org> + +allOf: + - $ref: mdio-mux.yaml# + +properties: + compatible: + const: amlogic,g12a-mdio-mux + + reg: + maxItems: 1 + + clocks: + items: + - description: peripheral clock + - description: platform crytal + - description: SoC 50MHz MPLL + + clock-names: + items: + - const: pclk + - const: clkin0 + - const: clkin1 + +required: + - compatible + - reg + - clocks + - clock-names + +unevaluatedProperties: false + +examples: + - | + #include <dt-bindings/interrupt-controller/irq.h> + #include <dt-bindings/interrupt-controller/arm-gic.h> + mdio-multiplexer@4c000 { + compatible = "amlogic,g12a-mdio-mux"; + reg = <0x4c000 0xa4>; + clocks = <&clkc_eth_phy>, <&xtal>, <&clkc_mpll>; + clock-names = "pclk", "clkin0", "clkin1"; + mdio-parent-bus = <&mdio0>; + #address-cells = <1>; + #size-cells = <0>; + + mdio@0 { + reg = <0>; + #address-cells = <1>; + #size-cells = <0>; + }; + + mdio@1 { + reg = <1>; + #address-cells = <1>; + #size-cells = <0>; + + ethernet-phy@8 { + compatible = "ethernet-phy-id0180.3301", + "ethernet-phy-ieee802.3-c22"; + interrupts = <GIC_SPI 9 IRQ_TYPE_LEVEL_HIGH>; + reg = <8>; + max-speed = <100>; + }; + }; + }; +... diff --git a/Documentation/devicetree/bindings/net/amlogic,gxl-mdio-mux.yaml b/Documentation/devicetree/bindings/net/amlogic,gxl-mdio-mux.yaml new file mode 100644 index 000000000000..27ae004dbea0 --- /dev/null +++ b/Documentation/devicetree/bindings/net/amlogic,gxl-mdio-mux.yaml @@ -0,0 +1,64 @@ +# SPDX-License-Identifier: GPL-2.0-only OR BSD-2-Clause +%YAML 1.2 +--- +$id: http://devicetree.org/schemas/net/amlogic,gxl-mdio-mux.yaml# +$schema: http://devicetree.org/meta-schemas/core.yaml# + +title: Amlogic GXL MDIO bus multiplexer + +maintainers: + - Jerome Brunet <jbrunet@baylibre.com> + +description: + This is a special case of a MDIO bus multiplexer. It allows to choose between + the internal mdio bus leading to the embedded 10/100 PHY or the external + MDIO bus on the Amlogic GXL SoC family. + +allOf: + - $ref: mdio-mux.yaml# + +properties: + compatible: + const: amlogic,gxl-mdio-mux + + reg: + maxItems: 1 + + clocks: + maxItems: 1 + + clock-names: + items: + - const: ref + +required: + - compatible + - reg + - clocks + - clock-names + +unevaluatedProperties: false + +examples: + - | + eth_phy_mux: mdio@558 { + compatible = "amlogic,gxl-mdio-mux"; + reg = <0x558 0xc>; + #address-cells = <1>; + #size-cells = <0>; + clocks = <&refclk>; + clock-names = "ref"; + mdio-parent-bus = <&mdio0>; + + external_mdio: mdio@0 { + reg = <0x0>; + #address-cells = <1>; + #size-cells = <0>; + }; + + internal_mdio: mdio@1 { + reg = <0x1>; + #address-cells = <1>; + #size-cells = <0>; + }; + }; diff --git a/Documentation/devicetree/bindings/net/asix,ax88796c.yaml b/Documentation/devicetree/bindings/net/asix,ax88796c.yaml index 699ebf452479..164d1ff9e83c 100644 --- a/Documentation/devicetree/bindings/net/asix,ax88796c.yaml +++ b/Documentation/devicetree/bindings/net/asix,ax88796c.yaml @@ -19,6 +19,7 @@ description: | allOf: - $ref: ethernet-controller.yaml# + - $ref: /schemas/spi/spi-peripheral-props.yaml properties: compatible: @@ -39,8 +40,8 @@ properties: it should be marked GPIO_ACTIVE_LOW. maxItems: 1 + controller-data: true local-mac-address: true - mac-address: true required: diff --git a/Documentation/devicetree/bindings/net/can/renesas,rcar-canfd.yaml b/Documentation/devicetree/bindings/net/can/renesas,rcar-canfd.yaml index 1eb98c9a1a26..d3f45d29fa0a 100644 --- a/Documentation/devicetree/bindings/net/can/renesas,rcar-canfd.yaml +++ b/Documentation/devicetree/bindings/net/can/renesas,rcar-canfd.yaml @@ -30,13 +30,17 @@ properties: - items: - enum: + - renesas,r8a779a0-canfd # R-Car V3U + - renesas,r8a779g0-canfd # R-Car V4H + - const: renesas,rcar-gen4-canfd # R-Car Gen4 + + - items: + - enum: - renesas,r9a07g043-canfd # RZ/G2UL and RZ/Five - renesas,r9a07g044-canfd # RZ/G2{L,LC} - renesas,r9a07g054-canfd # RZ/V2L - const: renesas,rzg2l-canfd # RZ/G2L family - - const: renesas,r8a779a0-canfd # R-Car V3U - reg: maxItems: 1 @@ -60,7 +64,7 @@ properties: $ref: /schemas/types.yaml#/definitions/flag description: The controller can operate in either CAN FD only mode (default) or - Classical CAN only mode. The mode is global to both the channels. + Classical CAN only mode. The mode is global to all channels. Specify this property to put the controller in Classical CAN only mode. assigned-clocks: @@ -80,6 +84,10 @@ patternProperties: The controller supports multiple channels and each is represented as a child node. Each channel can be enabled/disabled individually. + properties: + phys: + maxItems: 1 + additionalProperties: false required: @@ -159,7 +167,7 @@ allOf: properties: compatible: contains: - const: renesas,r8a779a0-canfd + const: renesas,rcar-gen4-canfd then: patternProperties: "^channel[2-7]$": false diff --git a/Documentation/devicetree/bindings/net/dsa/arrow,xrs700x.yaml b/Documentation/devicetree/bindings/net/dsa/arrow,xrs700x.yaml index 2a6d126606ca..9565a7402146 100644 --- a/Documentation/devicetree/bindings/net/dsa/arrow,xrs700x.yaml +++ b/Documentation/devicetree/bindings/net/dsa/arrow,xrs700x.yaml @@ -7,7 +7,7 @@ $schema: http://devicetree.org/meta-schemas/core.yaml# title: Arrow SpeedChips XRS7000 Series Switch allOf: - - $ref: dsa.yaml# + - $ref: dsa.yaml#/$defs/ethernet-ports maintainers: - George McCollister <george.mccollister@gmail.com> diff --git a/Documentation/devicetree/bindings/net/dsa/brcm,b53.yaml b/Documentation/devicetree/bindings/net/dsa/brcm,b53.yaml index 1219b830b1a4..5bef4128d175 100644 --- a/Documentation/devicetree/bindings/net/dsa/brcm,b53.yaml +++ b/Documentation/devicetree/bindings/net/dsa/brcm,b53.yaml @@ -66,7 +66,7 @@ required: - reg allOf: - - $ref: dsa.yaml# + - $ref: dsa.yaml#/$defs/ethernet-ports - if: properties: compatible: diff --git a/Documentation/devicetree/bindings/net/dsa/brcm,sf2.yaml b/Documentation/devicetree/bindings/net/dsa/brcm,sf2.yaml index d159ac78cec1..eed16e216fb6 100644 --- a/Documentation/devicetree/bindings/net/dsa/brcm,sf2.yaml +++ b/Documentation/devicetree/bindings/net/dsa/brcm,sf2.yaml @@ -85,11 +85,16 @@ properties: ports: type: object - properties: - brcm,use-bcm-hdr: - description: if present, indicates that the switch port has Broadcom - tags enabled (per-packet metadata) - type: boolean + patternProperties: + '^port@[0-9a-f]$': + $ref: dsa-port.yaml# + unevaluatedProperties: false + + properties: + brcm,use-bcm-hdr: + description: if present, indicates that the switch port has Broadcom + tags enabled (per-packet metadata) + type: boolean required: - reg diff --git a/Documentation/devicetree/bindings/net/dsa/dsa-port.yaml b/Documentation/devicetree/bindings/net/dsa/dsa-port.yaml index b173fceb8998..480120469953 100644 --- a/Documentation/devicetree/bindings/net/dsa/dsa-port.yaml +++ b/Documentation/devicetree/bindings/net/dsa/dsa-port.yaml @@ -4,18 +4,19 @@ $id: http://devicetree.org/schemas/net/dsa/dsa-port.yaml# $schema: http://devicetree.org/meta-schemas/core.yaml# -title: Ethernet Switch port +title: Generic DSA Switch Port maintainers: - Andrew Lunn <andrew@lunn.ch> - Florian Fainelli <f.fainelli@gmail.com> - - Vivien Didelot <vivien.didelot@gmail.com> + - Vladimir Oltean <olteanv@gmail.com> description: - Ethernet switch port Description + A DSA switch port is a component of a switch that manages one MAC, and can + pass Ethernet frames. It can act as a stanadard Ethernet switch port, or have + DSA-specific functionality. -allOf: - - $ref: /schemas/net/ethernet-controller.yaml# +$ref: /schemas/net/ethernet-switch-port.yaml# properties: reg: @@ -58,25 +59,6 @@ properties: - rtl8_4t - seville - phy-handle: true - - phy-mode: true - - fixed-link: true - - mac-address: true - - sfp: true - - managed: true - - rx-internal-delay-ps: true - - tx-internal-delay-ps: true - -required: - - reg - # CPU and DSA ports must have phylink-compatible link descriptions if: oneOf: diff --git a/Documentation/devicetree/bindings/net/dsa/dsa.yaml b/Documentation/devicetree/bindings/net/dsa/dsa.yaml index 5469ae8a4389..8d971813bab6 100644 --- a/Documentation/devicetree/bindings/net/dsa/dsa.yaml +++ b/Documentation/devicetree/bindings/net/dsa/dsa.yaml @@ -9,7 +9,7 @@ title: Ethernet Switch maintainers: - Andrew Lunn <andrew@lunn.ch> - Florian Fainelli <f.fainelli@gmail.com> - - Vivien Didelot <vivien.didelot@gmail.com> + - Vladimir Oltean <olteanv@gmail.com> description: This binding represents Ethernet Switches which have a dedicated CPU @@ -18,10 +18,9 @@ description: select: false -properties: - $nodename: - pattern: "^(ethernet-)?switch(@.*)?$" +$ref: /schemas/net/ethernet-switch.yaml# +properties: dsa,member: minItems: 2 maxItems: 2 @@ -32,30 +31,28 @@ properties: (single device hanging off a CPU port) must not specify this property $ref: /schemas/types.yaml#/definitions/uint32-array -patternProperties: - "^(ethernet-)?ports$": - type: object - properties: - '#address-cells': - const: 1 - '#size-cells': - const: 0 +additionalProperties: true + +$defs: + ethernet-ports: + description: A DSA switch without any extra port properties + $ref: '#/' patternProperties: - "^(ethernet-)?port@[0-9]+$": + "^(ethernet-)?ports$": type: object - description: Ethernet switch ports - - $ref: dsa-port.yaml# - - unevaluatedProperties: false - -oneOf: - - required: - - ports - - required: - - ethernet-ports - -additionalProperties: true + additionalProperties: false + + properties: + '#address-cells': + const: 1 + '#size-cells': + const: 0 + + patternProperties: + "^(ethernet-)?port@[0-9]+$": + description: Ethernet switch ports + $ref: dsa-port.yaml# + unevaluatedProperties: false ... diff --git a/Documentation/devicetree/bindings/net/dsa/hirschmann,hellcreek.yaml b/Documentation/devicetree/bindings/net/dsa/hirschmann,hellcreek.yaml index 447589b01e8e..4021b054f684 100644 --- a/Documentation/devicetree/bindings/net/dsa/hirschmann,hellcreek.yaml +++ b/Documentation/devicetree/bindings/net/dsa/hirschmann,hellcreek.yaml @@ -7,7 +7,7 @@ $schema: http://devicetree.org/meta-schemas/core.yaml# title: Hirschmann Hellcreek TSN Switch allOf: - - $ref: dsa.yaml# + - $ref: dsa.yaml#/$defs/ethernet-ports maintainers: - Andrew Lunn <andrew@lunn.ch> diff --git a/Documentation/devicetree/bindings/net/dsa/mediatek,mt7530.yaml b/Documentation/devicetree/bindings/net/dsa/mediatek,mt7530.yaml index f2e9ff3f580b..449ee0735012 100644 --- a/Documentation/devicetree/bindings/net/dsa/mediatek,mt7530.yaml +++ b/Documentation/devicetree/bindings/net/dsa/mediatek,mt7530.yaml @@ -24,56 +24,46 @@ description: | There is only the standalone version of MT7531. - Port 5 on MT7530 has got various ways of configuration. - - For standalone MT7530: + Port 5 on MT7530 has got various ways of configuration: - Port 5 can be used as a CPU port. - - PHY 0 or 4 of the switch can be muxed to connect to the gmac of the SoC - which port 5 is wired to. Usually used for connecting the wan port - directly to the CPU to achieve 2 Gbps routing in total. + - PHY 0 or 4 of the switch can be muxed to gmac5 of the switch. Therefore, + the gmac of the SoC which is wired to port 5 can connect to the PHY. + This is usually used for connecting the wan port directly to the CPU to + achieve 2 Gbps routing in total. - The driver looks up the reg on the ethernet-phy node which the phy-handle - property refers to on the gmac node to mux the specified phy. + The driver looks up the reg on the ethernet-phy node, which the phy-handle + property on the gmac node refers to, to mux the specified phy. The driver requires the gmac of the SoC to have "mediatek,eth-mac" as the - compatible string and the reg must be 1. So, for now, only gmac1 of an + compatible string and the reg must be 1. So, for now, only gmac1 of a MediaTek SoC can benefit this. Banana Pi BPI-R2 suits this. - Check out example 5 for a similar configuration. - - - Port 5 can be wired to an external phy. Port 5 becomes a DSA slave. - Check out example 7 for a similar configuration. - - For multi-chip module MT7530: - - - Port 5 can be used as a CPU port. - - - PHY 0 or 4 of the switch can be muxed to connect to gmac1 of the SoC. - Usually used for connecting the wan port directly to the CPU to achieve 2 - Gbps routing in total. - - The driver looks up the reg on the ethernet-phy node which the phy-handle - property refers to on the gmac node to mux the specified phy. For the MT7621 SoCs, rgmii2 group must be claimed with rgmii2 function. + Check out example 5. - - In case of an external phy wired to gmac1 of the SoC, port 5 must not be - enabled. + - For the multi-chip module MT7530, in case of an external phy wired to + gmac1 of the SoC, port 5 must not be enabled. In case of muxing PHY 0 or 4, the external phy must not be enabled. For the MT7621 SoCs, rgmii2 group must be claimed with rgmii2 function. + Check out example 6. - - Port 5 can be muxed to an external phy. Port 5 becomes a DSA slave. - The external phy must be wired TX to TX to gmac1 of the SoC for this to - work. Ubiquiti EdgeRouter X SFP is wired this way. + - Port 5 can be wired to an external phy. Port 5 becomes a DSA slave. - Muxing PHY 0 or 4 won't work when the external phy is connected TX to TX. + For the multi-chip module MT7530, the external phy must be wired TX to TX + to gmac1 of the SoC for this to work. Ubiquiti EdgeRouter X SFP is wired + this way. + + For the multi-chip module MT7530, muxing PHY 0 or 4 won't work when the + external phy is connected TX to TX. For the MT7621 SoCs, rgmii2 group must be claimed with gpio function. + Check out example 7. properties: @@ -157,9 +147,6 @@ patternProperties: patternProperties: "^(ethernet-)?port@[0-9]+$": type: object - description: Ethernet switch ports - - unevaluatedProperties: false properties: reg: @@ -168,7 +155,6 @@ patternProperties: for user ports. allOf: - - $ref: dsa-port.yaml# - if: required: [ ethernet ] then: @@ -238,7 +224,7 @@ $defs: - sgmii allOf: - - $ref: dsa.yaml# + - $ref: dsa.yaml#/$defs/ethernet-ports - if: required: - mediatek,mcm @@ -605,7 +591,7 @@ examples: label = "lan4"; }; - /* Commented out, phy4 is muxed to gmac1. + /* Commented out, phy4 is connected to gmac1. port@4 { reg = <4>; label = "wan"; diff --git a/Documentation/devicetree/bindings/net/dsa/microchip,ksz.yaml b/Documentation/devicetree/bindings/net/dsa/microchip,ksz.yaml index 4da75b1f9533..a4b53434c85c 100644 --- a/Documentation/devicetree/bindings/net/dsa/microchip,ksz.yaml +++ b/Documentation/devicetree/bindings/net/dsa/microchip,ksz.yaml @@ -11,7 +11,7 @@ maintainers: - Woojung Huh <Woojung.Huh@microchip.com> allOf: - - $ref: dsa.yaml# + - $ref: dsa.yaml#/$defs/ethernet-ports - $ref: /schemas/spi/spi-peripheral-props.yaml# properties: diff --git a/Documentation/devicetree/bindings/net/dsa/microchip,lan937x.yaml b/Documentation/devicetree/bindings/net/dsa/microchip,lan937x.yaml index b34de303966b..8d7e878b84dc 100644 --- a/Documentation/devicetree/bindings/net/dsa/microchip,lan937x.yaml +++ b/Documentation/devicetree/bindings/net/dsa/microchip,lan937x.yaml @@ -10,7 +10,7 @@ maintainers: - UNGLinuxDriver@microchip.com allOf: - - $ref: dsa.yaml# + - $ref: dsa.yaml#/$defs/ethernet-ports properties: compatible: diff --git a/Documentation/devicetree/bindings/net/dsa/mscc,ocelot.yaml b/Documentation/devicetree/bindings/net/dsa/mscc,ocelot.yaml index 347a0e1b3d3f..fe02d05196e4 100644 --- a/Documentation/devicetree/bindings/net/dsa/mscc,ocelot.yaml +++ b/Documentation/devicetree/bindings/net/dsa/mscc,ocelot.yaml @@ -78,7 +78,7 @@ required: - reg allOf: - - $ref: dsa.yaml# + - $ref: dsa.yaml#/$defs/ethernet-ports - if: properties: compatible: diff --git a/Documentation/devicetree/bindings/net/dsa/nxp,sja1105.yaml b/Documentation/devicetree/bindings/net/dsa/nxp,sja1105.yaml index df98a16e4e75..9a64ed658745 100644 --- a/Documentation/devicetree/bindings/net/dsa/nxp,sja1105.yaml +++ b/Documentation/devicetree/bindings/net/dsa/nxp,sja1105.yaml @@ -13,7 +13,7 @@ description: depends on the SPI bus master driver. allOf: - - $ref: "dsa.yaml#" + - $ref: dsa.yaml#/$defs/ethernet-ports - $ref: /schemas/spi/spi-peripheral-props.yaml# maintainers: diff --git a/Documentation/devicetree/bindings/net/dsa/qca8k.yaml b/Documentation/devicetree/bindings/net/dsa/qca8k.yaml index 978162df51f7..389892592aac 100644 --- a/Documentation/devicetree/bindings/net/dsa/qca8k.yaml +++ b/Documentation/devicetree/bindings/net/dsa/qca8k.yaml @@ -66,15 +66,11 @@ properties: With the legacy mapping the reg corresponding to the internal mdio is the switch reg with an offset of -1. +$ref: "dsa.yaml#" + patternProperties: "^(ethernet-)?ports$": type: object - properties: - '#address-cells': - const: 1 - '#size-cells': - const: 0 - patternProperties: "^(ethernet-)?port@[0-6]$": type: object @@ -116,7 +112,7 @@ required: - compatible - reg -additionalProperties: true +unevaluatedProperties: false examples: - | @@ -148,8 +144,6 @@ examples: switch@10 { compatible = "qca,qca8337"; - #address-cells = <1>; - #size-cells = <0>; reset-gpios = <&gpio 42 GPIO_ACTIVE_LOW>; reg = <0x10>; @@ -209,8 +203,6 @@ examples: switch@10 { compatible = "qca,qca8337"; - #address-cells = <1>; - #size-cells = <0>; reset-gpios = <&gpio 42 GPIO_ACTIVE_LOW>; reg = <0x10>; diff --git a/Documentation/devicetree/bindings/net/dsa/realtek.yaml b/Documentation/devicetree/bindings/net/dsa/realtek.yaml index 1a7d45a8ad66..cfd69c2604ea 100644 --- a/Documentation/devicetree/bindings/net/dsa/realtek.yaml +++ b/Documentation/devicetree/bindings/net/dsa/realtek.yaml @@ -7,7 +7,7 @@ $schema: http://devicetree.org/meta-schemas/core.yaml# title: Realtek switches for unmanaged switches allOf: - - $ref: dsa.yaml# + - $ref: dsa.yaml#/$defs/ethernet-ports maintainers: - Linus Walleij <linus.walleij@linaro.org> diff --git a/Documentation/devicetree/bindings/net/dsa/renesas,rzn1-a5psw.yaml b/Documentation/devicetree/bindings/net/dsa/renesas,rzn1-a5psw.yaml index 0a0d62b6c00e..833d2f68daa1 100644 --- a/Documentation/devicetree/bindings/net/dsa/renesas,rzn1-a5psw.yaml +++ b/Documentation/devicetree/bindings/net/dsa/renesas,rzn1-a5psw.yaml @@ -14,7 +14,7 @@ description: | handles 4 ports + 1 CPU management port. allOf: - - $ref: dsa.yaml# + - $ref: dsa.yaml#/$defs/ethernet-ports properties: compatible: diff --git a/Documentation/devicetree/bindings/net/ethernet-switch-port.yaml b/Documentation/devicetree/bindings/net/ethernet-switch-port.yaml new file mode 100644 index 000000000000..d5cf7e40e3c3 --- /dev/null +++ b/Documentation/devicetree/bindings/net/ethernet-switch-port.yaml @@ -0,0 +1,26 @@ +# SPDX-License-Identifier: (GPL-2.0-only OR BSD-2-Clause) +%YAML 1.2 +--- +$id: http://devicetree.org/schemas/net/ethernet-switch-port.yaml# +$schema: http://devicetree.org/meta-schemas/core.yaml# + +title: Generic Ethernet Switch Port + +maintainers: + - Andrew Lunn <andrew@lunn.ch> + - Florian Fainelli <f.fainelli@gmail.com> + - Vladimir Oltean <olteanv@gmail.com> + +description: + An Ethernet switch port is a component of a switch that manages one MAC, and + can pass Ethernet frames. + +$ref: ethernet-controller.yaml# + +properties: + reg: + description: Port number + +additionalProperties: true + +... diff --git a/Documentation/devicetree/bindings/net/ethernet-switch.yaml b/Documentation/devicetree/bindings/net/ethernet-switch.yaml new file mode 100644 index 000000000000..a04f8ef744aa --- /dev/null +++ b/Documentation/devicetree/bindings/net/ethernet-switch.yaml @@ -0,0 +1,62 @@ +# SPDX-License-Identifier: (GPL-2.0-only OR BSD-2-Clause) +%YAML 1.2 +--- +$id: http://devicetree.org/schemas/net/ethernet-switch.yaml# +$schema: http://devicetree.org/meta-schemas/core.yaml# + +title: Generic Ethernet Switch + +maintainers: + - Andrew Lunn <andrew@lunn.ch> + - Florian Fainelli <f.fainelli@gmail.com> + - Vladimir Oltean <olteanv@gmail.com> + +description: + Ethernet switches are multi-port Ethernet controllers. Each port has + its own number and is represented as its own Ethernet controller. + The minimum required functionality is to pass packets to software. + They may or may not be able to forward packets automonously between + ports. + +select: false + +properties: + $nodename: + pattern: "^(ethernet-)?switch(@.*)?$" + +patternProperties: + "^(ethernet-)?ports$": + type: object + unevaluatedProperties: false + + properties: + '#address-cells': + const: 1 + '#size-cells': + const: 0 + + patternProperties: + "^(ethernet-)?port@[0-9]+$": + type: object + description: Ethernet switch ports + +oneOf: + - required: + - ports + - required: + - ethernet-ports + +additionalProperties: true + +$defs: + base: + description: An ethernet switch without any extra port properties + $ref: '#/' + + patternProperties: + "^(ethernet-)?port@[0-9]+$": + description: Ethernet switch ports + $ref: ethernet-switch-port.yaml# + unevaluatedProperties: false + +... diff --git a/Documentation/devicetree/bindings/net/fsl,fec.yaml b/Documentation/devicetree/bindings/net/fsl,fec.yaml index 77e5f32cb62f..e6f2045f05de 100644 --- a/Documentation/devicetree/bindings/net/fsl,fec.yaml +++ b/Documentation/devicetree/bindings/net/fsl,fec.yaml @@ -51,6 +51,7 @@ properties: - fsl,imx8mm-fec - fsl,imx8mn-fec - fsl,imx8mp-fec + - fsl,imx93-fec - const: fsl,imx8mq-fec - const: fsl,imx6sx-fec - items: diff --git a/Documentation/devicetree/bindings/net/maxlinear,gpy2xx.yaml b/Documentation/devicetree/bindings/net/maxlinear,gpy2xx.yaml new file mode 100644 index 000000000000..d71fa9de2b64 --- /dev/null +++ b/Documentation/devicetree/bindings/net/maxlinear,gpy2xx.yaml @@ -0,0 +1,47 @@ +# SPDX-License-Identifier: GPL-2.0-only OR BSD-2-Clause +%YAML 1.2 +--- +$id: http://devicetree.org/schemas/net/maxlinear,gpy2xx.yaml# +$schema: http://devicetree.org/meta-schemas/core.yaml# + +title: MaxLinear GPY2xx PHY + +maintainers: + - Andrew Lunn <andrew@lunn.ch> + - Michael Walle <michael@walle.cc> + +allOf: + - $ref: ethernet-phy.yaml# + +properties: + maxlinear,use-broken-interrupts: + description: | + Interrupts are broken on some GPY2xx PHYs in that they keep the + interrupt line asserted even after the interrupt status register is + cleared. Thus it is blocking the interrupt line which is usually bad + for shared lines. By default interrupts are disabled for this PHY and + polling mode is used. If one can live with the consequences, this + property can be used to enable interrupt handling. + + Affected PHYs (as far as known) are GPY215B and GPY215C. + type: boolean + +dependencies: + maxlinear,use-broken-interrupts: [ interrupts ] + +unevaluatedProperties: false + +examples: + - | + ethernet { + #address-cells = <1>; + #size-cells = <0>; + + ethernet-phy@0 { + reg = <0>; + interrupts-extended = <&intc 0>; + maxlinear,use-broken-interrupts; + }; + }; + +... diff --git a/Documentation/devicetree/bindings/net/mdio-mux-meson-g12a.txt b/Documentation/devicetree/bindings/net/mdio-mux-meson-g12a.txt deleted file mode 100644 index 3a96cbed9294..000000000000 --- a/Documentation/devicetree/bindings/net/mdio-mux-meson-g12a.txt +++ /dev/null @@ -1,48 +0,0 @@ -Properties for the MDIO bus multiplexer/glue of Amlogic G12a SoC family. - -This is a special case of a MDIO bus multiplexer. It allows to choose between -the internal mdio bus leading to the embedded 10/100 PHY or the external -MDIO bus. - -Required properties in addition to the generic multiplexer properties: -- compatible : amlogic,g12a-mdio-mux -- reg: physical address and length of the multiplexer/glue registers -- clocks: list of clock phandle, one for each entry clock-names. -- clock-names: should contain the following: - * "pclk" : peripheral clock. - * "clkin0" : platform crytal - * "clkin1" : SoC 50MHz MPLL - -Example : - -mdio_mux: mdio-multiplexer@4c000 { - compatible = "amlogic,g12a-mdio-mux"; - reg = <0x0 0x4c000 0x0 0xa4>; - clocks = <&clkc CLKID_ETH_PHY>, - <&xtal>, - <&clkc CLKID_MPLL_5OM>; - clock-names = "pclk", "clkin0", "clkin1"; - mdio-parent-bus = <&mdio0>; - #address-cells = <1>; - #size-cells = <0>; - - ext_mdio: mdio@0 { - reg = <0>; - #address-cells = <1>; - #size-cells = <0>; - }; - - int_mdio: mdio@1 { - reg = <1>; - #address-cells = <1>; - #size-cells = <0>; - - internal_ephy: ethernet-phy@8 { - compatible = "ethernet-phy-id0180.3301", - "ethernet-phy-ieee802.3-c22"; - interrupts = <GIC_SPI 9 IRQ_TYPE_LEVEL_HIGH>; - reg = <8>; - max-speed = <100>; - }; - }; -}; diff --git a/Documentation/devicetree/bindings/net/micrel-ksz90x1.txt b/Documentation/devicetree/bindings/net/micrel-ksz90x1.txt index df9e844dd6bc..2681168777a1 100644 --- a/Documentation/devicetree/bindings/net/micrel-ksz90x1.txt +++ b/Documentation/devicetree/bindings/net/micrel-ksz90x1.txt @@ -158,6 +158,7 @@ KSZ9031: no link will be established. KSZ9131: +LAN8841: All skew control options are specified in picoseconds. The increment step is 100ps. Unlike KSZ9031, the values represent picoseccond delays. diff --git a/Documentation/devicetree/bindings/net/motorcomm,yt8xxx.yaml b/Documentation/devicetree/bindings/net/motorcomm,yt8xxx.yaml new file mode 100644 index 000000000000..157e3bbcaf6f --- /dev/null +++ b/Documentation/devicetree/bindings/net/motorcomm,yt8xxx.yaml @@ -0,0 +1,117 @@ +# SPDX-License-Identifier: (GPL-2.0-only OR BSD-2-Clause) +%YAML 1.2 +--- +$id: http://devicetree.org/schemas/net/motorcomm,yt8xxx.yaml# +$schema: http://devicetree.org/meta-schemas/core.yaml# + +title: MotorComm yt8xxx Ethernet PHY + +maintainers: + - Frank Sae <frank.sae@motor-comm.com> + +allOf: + - $ref: ethernet-phy.yaml# + +properties: + compatible: + enum: + - ethernet-phy-id4f51.e91a + - ethernet-phy-id4f51.e91b + + rx-internal-delay-ps: + description: | + RGMII RX Clock Delay used only when PHY operates in RGMII mode with + internal delay (phy-mode is 'rgmii-id' or 'rgmii-rxid') in pico-seconds. + enum: [ 0, 150, 300, 450, 600, 750, 900, 1050, 1200, 1350, 1500, 1650, + 1800, 1900, 1950, 2050, 2100, 2200, 2250, 2350, 2500, 2650, 2800, + 2950, 3100, 3250, 3400, 3550, 3700, 3850, 4000, 4150 ] + default: 1950 + + tx-internal-delay-ps: + description: | + RGMII TX Clock Delay used only when PHY operates in RGMII mode with + internal delay (phy-mode is 'rgmii-id' or 'rgmii-txid') in pico-seconds. + enum: [ 0, 150, 300, 450, 600, 750, 900, 1050, 1200, 1350, 1500, 1650, 1800, + 1950, 2100, 2250 ] + default: 1950 + + motorcomm,clk-out-frequency-hz: + description: clock output on clock output pin. + enum: [0, 25000000, 125000000] + default: 0 + + motorcomm,keep-pll-enabled: + description: | + If set, keep the PLL enabled even if there is no link. Useful if you + want to use the clock output without an ethernet link. + type: boolean + + motorcomm,auto-sleep-disabled: + description: | + If set, PHY will not enter sleep mode and close AFE after unplug cable + for a timer. + type: boolean + + motorcomm,tx-clk-adj-enabled: + description: | + This configuration is mainly to adapt to VF2 with JH7110 SoC. + Useful if you want to use tx-clk-xxxx-inverted to adj the delay of tx clk. + type: boolean + + motorcomm,tx-clk-10-inverted: + description: | + Use original or inverted RGMII Transmit PHY Clock to drive the RGMII + Transmit PHY Clock delay train configuration when speed is 10Mbps. + type: boolean + + motorcomm,tx-clk-100-inverted: + description: | + Use original or inverted RGMII Transmit PHY Clock to drive the RGMII + Transmit PHY Clock delay train configuration when speed is 100Mbps. + type: boolean + + motorcomm,tx-clk-1000-inverted: + description: | + Use original or inverted RGMII Transmit PHY Clock to drive the RGMII + Transmit PHY Clock delay train configuration when speed is 1000Mbps. + type: boolean + +unevaluatedProperties: false + +examples: + - | + mdio { + #address-cells = <1>; + #size-cells = <0>; + phy-mode = "rgmii-id"; + ethernet-phy@4 { + /* Only needed to make DT lint tools work. Do not copy/paste + * into real DTS files. + */ + compatible = "ethernet-phy-id4f51.e91a"; + + reg = <4>; + rx-internal-delay-ps = <2100>; + tx-internal-delay-ps = <150>; + motorcomm,clk-out-frequency-hz = <0>; + motorcomm,keep-pll-enabled; + motorcomm,auto-sleep-disabled; + }; + }; + - | + mdio { + #address-cells = <1>; + #size-cells = <0>; + phy-mode = "rgmii"; + ethernet-phy@5 { + /* Only needed to make DT lint tools work. Do not copy/paste + * into real DTS files. + */ + compatible = "ethernet-phy-id4f51.e91a"; + + reg = <5>; + motorcomm,clk-out-frequency-hz = <125000000>; + motorcomm,keep-pll-enabled; + motorcomm,auto-sleep-disabled; + }; + }; diff --git a/Documentation/devicetree/bindings/net/mscc,vsc7514-switch.yaml b/Documentation/devicetree/bindings/net/mscc,vsc7514-switch.yaml index ee0a504bdb24..8ee2c7d7ff42 100644 --- a/Documentation/devicetree/bindings/net/mscc,vsc7514-switch.yaml +++ b/Documentation/devicetree/bindings/net/mscc,vsc7514-switch.yaml @@ -18,14 +18,52 @@ description: | packets using CPU. Additionally, PTP is supported as well as FDMA for faster packet extraction/injection. -properties: - $nodename: - pattern: "^switch@[0-9a-f]+$" +allOf: + - if: + properties: + compatible: + const: mscc,vsc7514-switch + then: + $ref: ethernet-switch.yaml# + required: + - interrupts + - interrupt-names + properties: + reg: + minItems: 21 + reg-names: + minItems: 21 + ethernet-ports: + patternProperties: + "^port@[0-9a-f]+$": + $ref: ethernet-switch-port.yaml# + unevaluatedProperties: false + + - if: + properties: + compatible: + const: mscc,vsc7512-switch + then: + $ref: /schemas/net/dsa/dsa.yaml# + properties: + reg: + maxItems: 20 + reg-names: + maxItems: 20 + ethernet-ports: + patternProperties: + "^port@[0-9a-f]+$": + $ref: /schemas/net/dsa/dsa-port.yaml# + unevaluatedProperties: false +properties: compatible: - const: mscc,vsc7514-switch + enum: + - mscc,vsc7512-switch + - mscc,vsc7514-switch reg: + minItems: 20 items: - description: system target - description: rewriter target @@ -50,6 +88,7 @@ properties: - description: fdma target reg-names: + minItems: 20 items: - const: sys - const: rew @@ -87,59 +126,16 @@ properties: - const: xtr - const: fdma - ethernet-ports: - type: object - - properties: - '#address-cells': - const: 1 - '#size-cells': - const: 0 - - additionalProperties: false - - patternProperties: - "^port@[0-9a-f]+$": - type: object - description: Ethernet ports handled by the switch - - $ref: ethernet-controller.yaml# - - unevaluatedProperties: false - - properties: - reg: - description: Switch port number - - phy-handle: true - - phy-mode: true - - fixed-link: true - - mac-address: true - - required: - - reg - - phy-mode - - oneOf: - - required: - - phy-handle - - required: - - fixed-link - required: - compatible - reg - reg-names - - interrupts - - interrupt-names - ethernet-ports -additionalProperties: false +unevaluatedProperties: false examples: + # VSC7514 (Switchdev) - | switch@1010000 { compatible = "mscc,vsc7514-switch"; @@ -187,5 +183,51 @@ examples: }; }; }; + # VSC7512 (DSA) + - | + ethernet-switch@1{ + compatible = "mscc,vsc7512-switch"; + reg = <0x71010000 0x10000>, + <0x71030000 0x10000>, + <0x71080000 0x100>, + <0x710e0000 0x10000>, + <0x711e0000 0x100>, + <0x711f0000 0x100>, + <0x71200000 0x100>, + <0x71210000 0x100>, + <0x71220000 0x100>, + <0x71230000 0x100>, + <0x71240000 0x100>, + <0x71250000 0x100>, + <0x71260000 0x100>, + <0x71270000 0x100>, + <0x71280000 0x100>, + <0x71800000 0x80000>, + <0x71880000 0x10000>, + <0x71040000 0x10000>, + <0x71050000 0x10000>, + <0x71060000 0x10000>; + reg-names = "sys", "rew", "qs", "ptp", "port0", "port1", + "port2", "port3", "port4", "port5", "port6", + "port7", "port8", "port9", "port10", "qsys", + "ana", "s0", "s1", "s2"; + + ethernet-ports { + #address-cells = <1>; + #size-cells = <0>; + + port@0 { + reg = <0>; + ethernet = <&mac_sw>; + phy-handle = <&phy0>; + phy-mode = "internal"; + }; + port@1 { + reg = <1>; + phy-handle = <&phy1>; + phy-mode = "internal"; + }; + }; + }; ... diff --git a/Documentation/devicetree/bindings/net/nxp,dwmac-imx.yaml b/Documentation/devicetree/bindings/net/nxp,dwmac-imx.yaml index 04df496af7e6..63409cbff5ad 100644 --- a/Documentation/devicetree/bindings/net/nxp,dwmac-imx.yaml +++ b/Documentation/devicetree/bindings/net/nxp,dwmac-imx.yaml @@ -4,7 +4,7 @@ $id: http://devicetree.org/schemas/net/nxp,dwmac-imx.yaml# $schema: http://devicetree.org/meta-schemas/core.yaml# -title: NXP i.MX8 DWMAC glue layer +title: NXP i.MX8/9 DWMAC glue layer maintainers: - Clark Wang <xiaoning.wang@nxp.com> @@ -19,6 +19,7 @@ select: enum: - nxp,imx8mp-dwmac-eqos - nxp,imx8dxl-dwmac-eqos + - nxp,imx93-dwmac-eqos required: - compatible @@ -32,6 +33,7 @@ properties: - enum: - nxp,imx8mp-dwmac-eqos - nxp,imx8dxl-dwmac-eqos + - nxp,imx93-dwmac-eqos - const: snps,dwmac-5.10a clocks: diff --git a/Documentation/devicetree/bindings/net/rfkill-gpio.yaml b/Documentation/devicetree/bindings/net/rfkill-gpio.yaml new file mode 100644 index 000000000000..9630c8466fac --- /dev/null +++ b/Documentation/devicetree/bindings/net/rfkill-gpio.yaml @@ -0,0 +1,51 @@ +# SPDX-License-Identifier: (GPL-2.0-only OR BSD-2-Clause) +%YAML 1.2 +--- +$id: http://devicetree.org/schemas/net/rfkill-gpio.yaml# +$schema: http://devicetree.org/meta-schemas/core.yaml# + +title: GPIO controlled rfkill switch + +maintainers: + - Johannes Berg <johannes@sipsolutions.net> + - Philipp Zabel <p.zabel@pengutronix.de> + +properties: + compatible: + const: rfkill-gpio + + label: + description: rfkill switch name, defaults to node name + + radio-type: + description: rfkill radio type + enum: + - bluetooth + - fm + - gps + - nfc + - ultrawideband + - wimax + - wlan + - wwan + + shutdown-gpios: + maxItems: 1 + +required: + - compatible + - radio-type + - shutdown-gpios + +additionalProperties: false + +examples: + - | + #include <dt-bindings/gpio/gpio.h> + + rfkill { + compatible = "rfkill-gpio"; + label = "rfkill-pcie-wlan"; + radio-type = "wlan"; + shutdown-gpios = <&gpio2 25 GPIO_ACTIVE_HIGH>; + }; diff --git a/Documentation/devicetree/bindings/net/rockchip-dwmac.yaml b/Documentation/devicetree/bindings/net/rockchip-dwmac.yaml index 42fb72b6909d..04936632fcbb 100644 --- a/Documentation/devicetree/bindings/net/rockchip-dwmac.yaml +++ b/Documentation/devicetree/bindings/net/rockchip-dwmac.yaml @@ -49,11 +49,11 @@ properties: - rockchip,rk3368-gmac - rockchip,rk3399-gmac - rockchip,rv1108-gmac - - rockchip,rv1126-gmac - items: - enum: - rockchip,rk3568-gmac - rockchip,rk3588-gmac + - rockchip,rv1126-gmac - const: snps,dwmac-4.20a clocks: diff --git a/Documentation/devicetree/bindings/net/snps,dwmac.yaml b/Documentation/devicetree/bindings/net/snps,dwmac.yaml index e88a86623fce..16b7d2904696 100644 --- a/Documentation/devicetree/bindings/net/snps,dwmac.yaml +++ b/Documentation/devicetree/bindings/net/snps,dwmac.yaml @@ -552,7 +552,7 @@ required: dependencies: snps,reset-active-low: ["snps,reset-gpio"] - snps,reset-delay-us: ["snps,reset-gpio"] + snps,reset-delays-us: ["snps,reset-gpio"] allOf: - $ref: "ethernet-controller.yaml#" diff --git a/Documentation/devicetree/bindings/net/ti,k3-am654-cpsw-nuss.yaml b/Documentation/devicetree/bindings/net/ti,k3-am654-cpsw-nuss.yaml index 821974815dec..900063411a20 100644 --- a/Documentation/devicetree/bindings/net/ti,k3-am654-cpsw-nuss.yaml +++ b/Documentation/devicetree/bindings/net/ti,k3-am654-cpsw-nuss.yaml @@ -57,6 +57,7 @@ properties: - ti,am654-cpsw-nuss - ti,j7200-cpswxg-nuss - ti,j721e-cpsw-nuss + - ti,j721e-cpswxg-nuss - ti,am642-cpsw-nuss reg: @@ -111,7 +112,7 @@ properties: const: 0 patternProperties: - "^port@[1-4]$": + "^port@[1-8]$": type: object description: CPSWxG NUSS external ports @@ -121,7 +122,7 @@ properties: properties: reg: minimum: 1 - maximum: 4 + maximum: 8 description: CPSW port number phys: @@ -186,12 +187,36 @@ allOf: properties: compatible: contains: - const: ti,j7200-cpswxg-nuss + const: ti,j721e-cpswxg-nuss then: properties: ethernet-ports: patternProperties: - "^port@[3-4]$": false + "^port@[5-8]$": false + "^port@[1-4]$": + properties: + reg: + minimum: 1 + maximum: 4 + + - if: + not: + properties: + compatible: + contains: + enum: + - ti,j721e-cpswxg-nuss + - ti,j7200-cpswxg-nuss + then: + properties: + ethernet-ports: + patternProperties: + "^port@[3-8]$": false + "^port@[1-2]$": + properties: + reg: + minimum: 1 + maximum: 2 additionalProperties: false diff --git a/Documentation/devicetree/bindings/net/ti,k3-am654-cpts.yaml b/Documentation/devicetree/bindings/net/ti,k3-am654-cpts.yaml index 6230f576134b..3e910d3b24a0 100644 --- a/Documentation/devicetree/bindings/net/ti,k3-am654-cpts.yaml +++ b/Documentation/devicetree/bindings/net/ti,k3-am654-cpts.yaml @@ -93,6 +93,14 @@ properties: description: Number of timestamp Generator function outputs (TS_GENFx) + ti,pps: + $ref: /schemas/types.yaml#/definitions/uint32-array + minItems: 2 + maxItems: 2 + description: | + The pair of HWx_TS_PUSH input and TS_GENFy output indexes used for + PPS events generation. Platform/board specific. + refclk-mux: type: object additionalProperties: false diff --git a/Documentation/devicetree/bindings/net/wireless/esp,esp8089.yaml b/Documentation/devicetree/bindings/net/wireless/esp,esp8089.yaml index 5557676e9d4b..0ea84d6fe73e 100644 --- a/Documentation/devicetree/bindings/net/wireless/esp,esp8089.yaml +++ b/Documentation/devicetree/bindings/net/wireless/esp,esp8089.yaml @@ -29,15 +29,15 @@ additionalProperties: false examples: - | - mmc { - #address-cells = <1>; - #size-cells = <0>; - - wifi@1 { - compatible = "esp,esp8089"; - reg = <1>; - esp,crystal-26M-en = <2>; - }; - }; + mmc { + #address-cells = <1>; + #size-cells = <0>; + + wifi@1 { + compatible = "esp,esp8089"; + reg = <1>; + esp,crystal-26M-en = <2>; + }; + }; ... diff --git a/Documentation/devicetree/bindings/net/wireless/ieee80211.yaml b/Documentation/devicetree/bindings/net/wireless/ieee80211.yaml index e68ed9423150..d89f7a3f88a7 100644 --- a/Documentation/devicetree/bindings/net/wireless/ieee80211.yaml +++ b/Documentation/devicetree/bindings/net/wireless/ieee80211.yaml @@ -1,6 +1,5 @@ # SPDX-License-Identifier: (GPL-2.0-only OR BSD-2-Clause) # Copyright (c) 2018-2019 The Linux Foundation. All rights reserved. - %YAML 1.2 --- $id: http://devicetree.org/schemas/net/wireless/ieee80211.yaml# diff --git a/Documentation/devicetree/bindings/net/wireless/marvell-8xxx.txt b/Documentation/devicetree/bindings/net/wireless/marvell-8xxx.txt index 9bf9bbac16e2..cdc303caf5f4 100644 --- a/Documentation/devicetree/bindings/net/wireless/marvell-8xxx.txt +++ b/Documentation/devicetree/bindings/net/wireless/marvell-8xxx.txt @@ -1,4 +1,4 @@ -Marvell 8787/8897/8997 (sd8787/sd8897/sd8997/pcie8997) SDIO/PCIE devices +Marvell 8787/8897/8978/8997 (sd8787/sd8897/sd8978/sd8997/pcie8997) SDIO/PCIE devices ------ This node provides properties for controlling the Marvell SDIO/PCIE wireless device. @@ -10,7 +10,9 @@ Required properties: - compatible : should be one of the following: * "marvell,sd8787" * "marvell,sd8897" + * "marvell,sd8978" * "marvell,sd8997" + * "nxp,iw416" * "pci11ab,2b42" * "pci1b4b,2b42" diff --git a/Documentation/devicetree/bindings/net/wireless/mediatek,mt76.yaml b/Documentation/devicetree/bindings/net/wireless/mediatek,mt76.yaml index f0c78f994491..7d526ff53fb7 100644 --- a/Documentation/devicetree/bindings/net/wireless/mediatek,mt76.yaml +++ b/Documentation/devicetree/bindings/net/wireless/mediatek,mt76.yaml @@ -1,6 +1,5 @@ # SPDX-License-Identifier: (GPL-2.0-only OR BSD-2-Clause) # Copyright (c) 2018-2019 The Linux Foundation. All rights reserved. - %YAML 1.2 --- $id: http://devicetree.org/schemas/net/wireless/mediatek,mt76.yaml# diff --git a/Documentation/devicetree/bindings/net/wireless/qcom,ath11k.yaml b/Documentation/devicetree/bindings/net/wireless/qcom,ath11k.yaml index 556eb523606a..7d5f982a3d09 100644 --- a/Documentation/devicetree/bindings/net/wireless/qcom,ath11k.yaml +++ b/Documentation/devicetree/bindings/net/wireless/qcom,ath11k.yaml @@ -1,6 +1,5 @@ # SPDX-License-Identifier: (GPL-2.0-only OR BSD-2-Clause) # Copyright (c) 2018-2019 The Linux Foundation. All rights reserved. - %YAML 1.2 --- $id: http://devicetree.org/schemas/net/wireless/qcom,ath11k.yaml# @@ -21,6 +20,7 @@ properties: - qcom,ipq8074-wifi - qcom,ipq6018-wifi - qcom,wcn6750-wifi + - qcom,ipq5018-wifi reg: maxItems: 1 @@ -262,10 +262,10 @@ allOf: examples: - | - q6v5_wcss: q6v5_wcss@CD00000 { + q6v5_wcss: remoteproc@cd00000 { compatible = "qcom,ipq8074-wcss-pil"; - reg = <0xCD00000 0x4040>, - <0x4AB000 0x20>; + reg = <0xcd00000 0x4040>, + <0x4ab000 0x20>; reg-names = "qdsp6", "rmb"; }; @@ -386,7 +386,7 @@ examples: #address-cells = <2>; #size-cells = <2>; - qcn9074_0: qcn9074_0@51100000 { + qcn9074_0: wifi@51100000 { no-map; reg = <0x0 0x51100000 0x0 0x03500000>; }; @@ -463,6 +463,6 @@ examples: qcom,smem-states = <&wlan_smp2p_out 0>; qcom,smem-state-names = "wlan-smp2p-out"; wifi-firmware { - iommus = <&apps_smmu 0x1c02 0x1>; + iommus = <&apps_smmu 0x1c02 0x1>; }; }; diff --git a/Documentation/devicetree/bindings/net/wireless/silabs,wfx.yaml b/Documentation/devicetree/bindings/net/wireless/silabs,wfx.yaml index 583db5d42226..84e5659e50ef 100644 --- a/Documentation/devicetree/bindings/net/wireless/silabs,wfx.yaml +++ b/Documentation/devicetree/bindings/net/wireless/silabs,wfx.yaml @@ -2,7 +2,6 @@ # Copyright (c) 2020, Silicon Laboratories, Inc. %YAML 1.2 --- - $id: http://devicetree.org/schemas/net/wireless/silabs,wfx.yaml# $schema: http://devicetree.org/meta-schemas/core.yaml# diff --git a/Documentation/devicetree/bindings/net/wireless/ti,wlcore.yaml b/Documentation/devicetree/bindings/net/wireless/ti,wlcore.yaml index e31456730e9f..f799a1e52173 100644 --- a/Documentation/devicetree/bindings/net/wireless/ti,wlcore.yaml +++ b/Documentation/devicetree/bindings/net/wireless/ti,wlcore.yaml @@ -90,47 +90,47 @@ examples: // For wl12xx family: spi1 { - #address-cells = <1>; - #size-cells = <0>; - - wlcore1: wlcore@1 { - compatible = "ti,wl1271"; - reg = <1>; - spi-max-frequency = <48000000>; - interrupts = <8 IRQ_TYPE_LEVEL_HIGH>; - vwlan-supply = <&vwlan_fixed>; - clock-xtal; - ref-clock-frequency = <38400000>; - }; + #address-cells = <1>; + #size-cells = <0>; + + wlcore1: wlcore@1 { + compatible = "ti,wl1271"; + reg = <1>; + spi-max-frequency = <48000000>; + interrupts = <8 IRQ_TYPE_LEVEL_HIGH>; + vwlan-supply = <&vwlan_fixed>; + clock-xtal; + ref-clock-frequency = <38400000>; + }; }; // For wl18xx family: spi2 { - #address-cells = <1>; - #size-cells = <0>; - - wlcore2: wlcore@0 { - compatible = "ti,wl1835"; - reg = <0>; - spi-max-frequency = <48000000>; - interrupts = <27 IRQ_TYPE_EDGE_RISING>; - vwlan-supply = <&vwlan_fixed>; - }; + #address-cells = <1>; + #size-cells = <0>; + + wlcore2: wlcore@0 { + compatible = "ti,wl1835"; + reg = <0>; + spi-max-frequency = <48000000>; + interrupts = <27 IRQ_TYPE_EDGE_RISING>; + vwlan-supply = <&vwlan_fixed>; + }; }; // SDIO example: mmc3 { - vmmc-supply = <&wlan_en_reg>; - bus-width = <4>; - cap-power-off-card; - keep-power-in-suspend; - - #address-cells = <1>; - #size-cells = <0>; - - wlcore3: wlcore@2 { - compatible = "ti,wl1835"; - reg = <2>; - interrupts = <19 IRQ_TYPE_LEVEL_HIGH>; - }; + vmmc-supply = <&wlan_en_reg>; + bus-width = <4>; + cap-power-off-card; + keep-power-in-suspend; + + #address-cells = <1>; + #size-cells = <0>; + + wlcore3: wlcore@2 { + compatible = "ti,wl1835"; + reg = <2>; + interrupts = <19 IRQ_TYPE_LEVEL_HIGH>; + }; }; diff --git a/Documentation/devicetree/bindings/vendor-prefixes.yaml b/Documentation/devicetree/bindings/vendor-prefixes.yaml index 852910c0ce30..5c3c6abb983b 100644 --- a/Documentation/devicetree/bindings/vendor-prefixes.yaml +++ b/Documentation/devicetree/bindings/vendor-prefixes.yaml @@ -785,6 +785,8 @@ patternProperties: description: MaxBotix Inc. "^maxim,.*": description: Maxim Integrated Products + "^maxlinear,.*": + description: MaxLinear Inc. "^mbvl,.*": description: Mobiveil Inc. "^mcube,.*": @@ -855,6 +857,8 @@ patternProperties: description: Moortec Semiconductor Ltd. "^mosaixtech,.*": description: Mosaix Technologies, Inc. + "^motorcomm,.*": + description: MotorComm, Inc. "^motorola,.*": description: Motorola, Inc. "^moxa,.*": diff --git a/Documentation/isdn/interface_capi.rst b/Documentation/isdn/interface_capi.rst index fe2421444b76..4d63b34b35cf 100644 --- a/Documentation/isdn/interface_capi.rst +++ b/Documentation/isdn/interface_capi.rst @@ -323,7 +323,7 @@ If the lowest bit of showcapimsgs is set, kernelcapi logs controller and application up and down events. In addition, every registered CAPI controller has an associated traceflag -parameter controlling how CAPI messages sent from and to tha controller are +parameter controlling how CAPI messages sent from and to the controller are logged. The traceflag parameter is initialized with the value of the showcapimsgs parameter when the controller is registered, but can later be changed via the MANUFACTURER_REQ command KCAPI_CMD_TRACE. diff --git a/Documentation/isdn/m_isdn.rst b/Documentation/isdn/m_isdn.rst index 9957de349e69..5847a164287e 100644 --- a/Documentation/isdn/m_isdn.rst +++ b/Documentation/isdn/m_isdn.rst @@ -3,7 +3,7 @@ mISDN Driver ============ mISDN is a new modular ISDN driver, in the long term it should replace -the old I4L driver architecture for passiv ISDN cards. +the old I4L driver architecture for passive ISDN cards. It was designed to allow a broad range of applications and interfaces but only have the basic function in kernel, the interface to the user space is based on sockets with a own address family AF_ISDN. diff --git a/Documentation/netlink/genetlink-c.yaml b/Documentation/netlink/genetlink-c.yaml new file mode 100644 index 000000000000..bbcfa2472b04 --- /dev/null +++ b/Documentation/netlink/genetlink-c.yaml @@ -0,0 +1,331 @@ +# SPDX-License-Identifier: GPL-2.0 +%YAML 1.2 +--- +$id: http://kernel.org/schemas/netlink/genetlink-c.yaml# +$schema: https://json-schema.org/draft-07/schema + +# Common defines +$defs: + uint: + type: integer + minimum: 0 + len-or-define: + type: [ string, integer ] + pattern: ^[0-9A-Za-z_]+( - 1)?$ + minimum: 0 + +# Schema for specs +title: Protocol +description: Specification of a genetlink protocol +type: object +required: [ name, doc, attribute-sets, operations ] +additionalProperties: False +properties: + name: + description: Name of the genetlink family. + type: string + doc: + type: string + version: + description: Generic Netlink family version. Default is 1. + type: integer + minimum: 1 + protocol: + description: Schema compatibility level. Default is "genetlink". + enum: [ genetlink, genetlink-c ] + # Start genetlink-c + uapi-header: + description: Path to the uAPI header, default is linux/${family-name}.h + type: string + c-family-name: + description: Name of the define for the family name. + type: string + c-version-name: + description: Name of the define for the verion of the family. + type: string + max-by-define: + description: Makes the number of attributes and commands be specified by a define, not an enum value. + type: boolean + # End genetlink-c + + definitions: + description: List of type and constant definitions (enums, flags, defines). + type: array + items: + type: object + required: [ type, name ] + additionalProperties: False + properties: + name: + type: string + header: + description: For C-compatible languages, header which already defines this value. + type: string + type: + enum: [ const, enum, flags ] + doc: + type: string + # For const + value: + description: For const - the value. + type: [ string, integer ] + # For enum and flags + value-start: + description: For enum or flags the literal initializer for the first value. + type: [ string, integer ] + entries: + description: For enum or flags array of values. + type: array + items: + oneOf: + - type: string + - type: object + required: [ name ] + additionalProperties: False + properties: + name: + type: string + value: + type: integer + doc: + type: string + render-max: + description: Render the max members for this enum. + type: boolean + # Start genetlink-c + enum-name: + description: Name for enum, if empty no name will be used. + type: [ string, "null" ] + name-prefix: + description: For enum the prefix of the values, optional. + type: string + # End genetlink-c + + attribute-sets: + description: Definition of attribute spaces for this family. + type: array + items: + description: Definition of a single attribute space. + type: object + required: [ name, attributes ] + additionalProperties: False + properties: + name: + description: | + Name used when referring to this space in other definitions, not used outside of the spec. + type: string + name-prefix: + description: | + Prefix for the C enum name of the attributes. Default family[name]-set[name]-a- + type: string + enum-name: + description: Name for the enum type of the attribute. + type: string + doc: + description: Documentation of the space. + type: string + subset-of: + description: | + Name of another space which this is a logical part of. Sub-spaces can be used to define + a limited group of attributes which are used in a nest. + type: string + # Start genetlink-c + attr-cnt-name: + description: The explicit name for constant holding the count of attributes (last attr + 1). + type: string + attr-max-name: + description: The explicit name for last member of attribute enum. + type: string + # End genetlink-c + attributes: + description: List of attributes in the space. + type: array + items: + type: object + required: [ name, type ] + additionalProperties: False + properties: + name: + type: string + type: &attr-type + enum: [ unused, pad, flag, binary, u8, u16, u32, u64, s32, s64, + string, nest, array-nest, nest-type-value ] + doc: + description: Documentation of the attribute. + type: string + value: + description: Value for the enum item representing this attribute in the uAPI. + $ref: '#/$defs/uint' + type-value: + description: Name of the value extracted from the type of a nest-type-value attribute. + type: array + items: + type: string + byte-order: + enum: [ little-endian, big-endian ] + multi-attr: + type: boolean + nested-attributes: + description: Name of the space (sub-space) used inside the attribute. + type: string + enum: + description: Name of the enum type used for the attribute. + type: string + enum-as-flags: + description: | + Treat the enum as flags. In most cases enum is either used as flags or as values. + Sometimes, however, both forms are necessary, in which case header contains the enum + form while specific attributes may request to convert the values into a bitfield. + type: boolean + checks: + description: Kernel input validation. + type: object + additionalProperties: False + properties: + flags-mask: + description: Name of the flags constant on which to base mask (unsigned scalar types only). + type: string + min: + description: Min value for an integer attribute. + type: integer + min-len: + description: Min length for a binary attribute. + $ref: '#/$defs/len-or-define' + max-len: + description: Max length for a string or a binary attribute. + $ref: '#/$defs/len-or-define' + sub-type: *attr-type + + # Make sure name-prefix does not appear in subsets (subsets inherit naming) + dependencies: + name-prefix: + not: + required: [ subset-of ] + subset-of: + not: + required: [ name-prefix ] + + operations: + description: Operations supported by the protocol. + type: object + required: [ list ] + additionalProperties: False + properties: + enum-model: + description: | + The model of assigning values to the operations. + "unified" is the recommended model where all message types belong + to a single enum. + "directional" has the messages sent to the kernel and from the kernel + enumerated separately. + enum: [ unified ] + name-prefix: + description: | + Prefix for the C enum name of the command. The name is formed by concatenating + the prefix with the upper case name of the command, with dashes replaced by underscores. + type: string + enum-name: + description: Name for the enum type with commands. + type: string + async-prefix: + description: Same as name-prefix but used to render notifications and events to separate enum. + type: string + async-enum: + description: Name for the enum type with notifications/events. + type: string + list: + description: List of commands + type: array + items: + type: object + additionalProperties: False + required: [ name, doc ] + properties: + name: + description: Name of the operation, also defining its C enum value in uAPI. + type: string + doc: + description: Documentation for the command. + type: string + value: + description: Value for the enum in the uAPI. + $ref: '#/$defs/uint' + attribute-set: + description: | + Attribute space from which attributes directly in the requests and replies + to this command are defined. + type: string + flags: &cmd_flags + description: Command flags. + type: array + items: + enum: [ admin-perm ] + dont-validate: + description: Kernel attribute validation flags. + type: array + items: + enum: [ strict, dump ] + do: &subop-type + description: Main command handler. + type: object + additionalProperties: False + properties: + request: &subop-attr-list + description: Definition of the request message for a given command. + type: object + additionalProperties: False + properties: + attributes: + description: | + Names of attributes from the attribute-set (not full attribute + definitions, just names). + type: array + items: + type: string + reply: *subop-attr-list + pre: + description: Hook for a function to run before the main callback (pre_doit or start). + type: string + post: + description: Hook for a function to run after the main callback (post_doit or done). + type: string + dump: *subop-type + notify: + description: Name of the command sharing the reply type with this notification. + type: string + event: + type: object + additionalProperties: False + properties: + attributes: + description: Explicit list of the attributes for the notification. + type: array + items: + type: string + mcgrp: + description: Name of the multicast group generating given notification. + type: string + mcast-groups: + description: List of multicast groups. + type: object + required: [ list ] + additionalProperties: False + properties: + list: + description: List of groups. + type: array + items: + type: object + required: [ name ] + additionalProperties: False + properties: + name: + description: | + The name for the group, used to form the define and the value of the define. + type: string + # Start genetlink-c + c-define-name: + description: Override for the name of the define in C uAPI. + type: string + # End genetlink-c + flags: *cmd_flags diff --git a/Documentation/netlink/genetlink-legacy.yaml b/Documentation/netlink/genetlink-legacy.yaml new file mode 100644 index 000000000000..5642925c4ceb --- /dev/null +++ b/Documentation/netlink/genetlink-legacy.yaml @@ -0,0 +1,361 @@ +# SPDX-License-Identifier: GPL-2.0 +%YAML 1.2 +--- +$id: http://kernel.org/schemas/netlink/genetlink-legacy.yaml# +$schema: https://json-schema.org/draft-07/schema + +# Common defines +$defs: + uint: + type: integer + minimum: 0 + len-or-define: + type: [ string, integer ] + pattern: ^[0-9A-Za-z_]+( - 1)?$ + minimum: 0 + +# Schema for specs +title: Protocol +description: Specification of a genetlink protocol +type: object +required: [ name, doc, attribute-sets, operations ] +additionalProperties: False +properties: + name: + description: Name of the genetlink family. + type: string + doc: + type: string + version: + description: Generic Netlink family version. Default is 1. + type: integer + minimum: 1 + protocol: + description: Schema compatibility level. Default is "genetlink". + enum: [ genetlink, genetlink-c, genetlink-legacy ] # Trim + # Start genetlink-c + uapi-header: + description: Path to the uAPI header, default is linux/${family-name}.h + type: string + c-family-name: + description: Name of the define for the family name. + type: string + c-version-name: + description: Name of the define for the verion of the family. + type: string + max-by-define: + description: Makes the number of attributes and commands be specified by a define, not an enum value. + type: boolean + # End genetlink-c + # Start genetlink-legacy + kernel-policy: + description: | + Defines if the input policy in the kernel is global, per-operation, or split per operation type. + Default is split. + enum: [ split, per-op, global ] + # End genetlink-legacy + + definitions: + description: List of type and constant definitions (enums, flags, defines). + type: array + items: + type: object + required: [ type, name ] + additionalProperties: False + properties: + name: + type: string + header: + description: For C-compatible languages, header which already defines this value. + type: string + type: + enum: [ const, enum, flags, struct ] # Trim + doc: + type: string + # For const + value: + description: For const - the value. + type: [ string, integer ] + # For enum and flags + value-start: + description: For enum or flags the literal initializer for the first value. + type: [ string, integer ] + entries: + description: For enum or flags array of values. + type: array + items: + oneOf: + - type: string + - type: object + required: [ name ] + additionalProperties: False + properties: + name: + type: string + value: + type: integer + doc: + type: string + render-max: + description: Render the max members for this enum. + type: boolean + # Start genetlink-c + enum-name: + description: Name for enum, if empty no name will be used. + type: [ string, "null" ] + name-prefix: + description: For enum the prefix of the values, optional. + type: string + # End genetlink-c + # Start genetlink-legacy + members: + description: List of struct members. Only scalars and strings members allowed. + type: array + items: + type: object + required: [ name, type ] + additionalProperties: False + properties: + name: + type: string + type: + enum: [ u8, u16, u32, u64, s8, s16, s32, s64, string ] + len: + $ref: '#/$defs/len-or-define' + # End genetlink-legacy + + attribute-sets: + description: Definition of attribute spaces for this family. + type: array + items: + description: Definition of a single attribute space. + type: object + required: [ name, attributes ] + additionalProperties: False + properties: + name: + description: | + Name used when referring to this space in other definitions, not used outside of the spec. + type: string + name-prefix: + description: | + Prefix for the C enum name of the attributes. Default family[name]-set[name]-a- + type: string + enum-name: + description: Name for the enum type of the attribute. + type: string + doc: + description: Documentation of the space. + type: string + subset-of: + description: | + Name of another space which this is a logical part of. Sub-spaces can be used to define + a limited group of attributes which are used in a nest. + type: string + # Start genetlink-c + attr-cnt-name: + description: The explicit name for constant holding the count of attributes (last attr + 1). + type: string + attr-max-name: + description: The explicit name for last member of attribute enum. + type: string + # End genetlink-c + attributes: + description: List of attributes in the space. + type: array + items: + type: object + required: [ name, type ] + additionalProperties: False + properties: + name: + type: string + type: &attr-type + enum: [ unused, pad, flag, binary, u8, u16, u32, u64, s32, s64, + string, nest, array-nest, nest-type-value ] + doc: + description: Documentation of the attribute. + type: string + value: + description: Value for the enum item representing this attribute in the uAPI. + $ref: '#/$defs/uint' + type-value: + description: Name of the value extracted from the type of a nest-type-value attribute. + type: array + items: + type: string + byte-order: + enum: [ little-endian, big-endian ] + multi-attr: + type: boolean + nested-attributes: + description: Name of the space (sub-space) used inside the attribute. + type: string + enum: + description: Name of the enum type used for the attribute. + type: string + enum-as-flags: + description: | + Treat the enum as flags. In most cases enum is either used as flags or as values. + Sometimes, however, both forms are necessary, in which case header contains the enum + form while specific attributes may request to convert the values into a bitfield. + type: boolean + checks: + description: Kernel input validation. + type: object + additionalProperties: False + properties: + flags-mask: + description: Name of the flags constant on which to base mask (unsigned scalar types only). + type: string + min: + description: Min value for an integer attribute. + type: integer + min-len: + description: Min length for a binary attribute. + $ref: '#/$defs/len-or-define' + max-len: + description: Max length for a string or a binary attribute. + $ref: '#/$defs/len-or-define' + sub-type: *attr-type + + # Make sure name-prefix does not appear in subsets (subsets inherit naming) + dependencies: + name-prefix: + not: + required: [ subset-of ] + subset-of: + not: + required: [ name-prefix ] + + operations: + description: Operations supported by the protocol. + type: object + required: [ list ] + additionalProperties: False + properties: + enum-model: + description: | + The model of assigning values to the operations. + "unified" is the recommended model where all message types belong + to a single enum. + "directional" has the messages sent to the kernel and from the kernel + enumerated separately. + enum: [ unified, directional ] # Trim + name-prefix: + description: | + Prefix for the C enum name of the command. The name is formed by concatenating + the prefix with the upper case name of the command, with dashes replaced by underscores. + type: string + enum-name: + description: Name for the enum type with commands. + type: string + async-prefix: + description: Same as name-prefix but used to render notifications and events to separate enum. + type: string + async-enum: + description: Name for the enum type with notifications/events. + type: string + list: + description: List of commands + type: array + items: + type: object + additionalProperties: False + required: [ name, doc ] + properties: + name: + description: Name of the operation, also defining its C enum value in uAPI. + type: string + doc: + description: Documentation for the command. + type: string + value: + description: Value for the enum in the uAPI. + $ref: '#/$defs/uint' + attribute-set: + description: | + Attribute space from which attributes directly in the requests and replies + to this command are defined. + type: string + flags: &cmd_flags + description: Command flags. + type: array + items: + enum: [ admin-perm ] + dont-validate: + description: Kernel attribute validation flags. + type: array + items: + enum: [ strict, dump ] + do: &subop-type + description: Main command handler. + type: object + additionalProperties: False + properties: + request: &subop-attr-list + description: Definition of the request message for a given command. + type: object + additionalProperties: False + properties: + attributes: + description: | + Names of attributes from the attribute-set (not full attribute + definitions, just names). + type: array + items: + type: string + # Start genetlink-legacy + value: + description: | + ID of this message if value for request and response differ, + i.e. requests and responses have different message enums. + $ref: '#/$defs/uint' + # End genetlink-legacy + reply: *subop-attr-list + pre: + description: Hook for a function to run before the main callback (pre_doit or start). + type: string + post: + description: Hook for a function to run after the main callback (post_doit or done). + type: string + dump: *subop-type + notify: + description: Name of the command sharing the reply type with this notification. + type: string + event: + type: object + additionalProperties: False + properties: + attributes: + description: Explicit list of the attributes for the notification. + type: array + items: + type: string + mcgrp: + description: Name of the multicast group generating given notification. + type: string + mcast-groups: + description: List of multicast groups. + type: object + required: [ list ] + additionalProperties: False + properties: + list: + description: List of groups. + type: array + items: + type: object + required: [ name ] + additionalProperties: False + properties: + name: + description: | + The name for the group, used to form the define and the value of the define. + type: string + # Start genetlink-c + c-define-name: + description: Override for the name of the define in C uAPI. + type: string + # End genetlink-c + flags: *cmd_flags diff --git a/Documentation/netlink/genetlink.yaml b/Documentation/netlink/genetlink.yaml new file mode 100644 index 000000000000..62a922755ce2 --- /dev/null +++ b/Documentation/netlink/genetlink.yaml @@ -0,0 +1,296 @@ +# SPDX-License-Identifier: GPL-2.0 +%YAML 1.2 +--- +$id: http://kernel.org/schemas/netlink/genetlink-legacy.yaml# +$schema: https://json-schema.org/draft-07/schema + +# Common defines +$defs: + uint: + type: integer + minimum: 0 + len-or-define: + type: [ string, integer ] + pattern: ^[0-9A-Za-z_]+( - 1)?$ + minimum: 0 + +# Schema for specs +title: Protocol +description: Specification of a genetlink protocol +type: object +required: [ name, doc, attribute-sets, operations ] +additionalProperties: False +properties: + name: + description: Name of the genetlink family. + type: string + doc: + type: string + version: + description: Generic Netlink family version. Default is 1. + type: integer + minimum: 1 + protocol: + description: Schema compatibility level. Default is "genetlink". + enum: [ genetlink ] + + definitions: + description: List of type and constant definitions (enums, flags, defines). + type: array + items: + type: object + required: [ type, name ] + additionalProperties: False + properties: + name: + type: string + header: + description: For C-compatible languages, header which already defines this value. + type: string + type: + enum: [ const, enum, flags ] + doc: + type: string + # For const + value: + description: For const - the value. + type: [ string, integer ] + # For enum and flags + value-start: + description: For enum or flags the literal initializer for the first value. + type: [ string, integer ] + entries: + description: For enum or flags array of values. + type: array + items: + oneOf: + - type: string + - type: object + required: [ name ] + additionalProperties: False + properties: + name: + type: string + value: + type: integer + doc: + type: string + render-max: + description: Render the max members for this enum. + type: boolean + + attribute-sets: + description: Definition of attribute spaces for this family. + type: array + items: + description: Definition of a single attribute space. + type: object + required: [ name, attributes ] + additionalProperties: False + properties: + name: + description: | + Name used when referring to this space in other definitions, not used outside of the spec. + type: string + name-prefix: + description: | + Prefix for the C enum name of the attributes. Default family[name]-set[name]-a- + type: string + enum-name: + description: Name for the enum type of the attribute. + type: string + doc: + description: Documentation of the space. + type: string + subset-of: + description: | + Name of another space which this is a logical part of. Sub-spaces can be used to define + a limited group of attributes which are used in a nest. + type: string + attributes: + description: List of attributes in the space. + type: array + items: + type: object + required: [ name, type ] + additionalProperties: False + properties: + name: + type: string + type: &attr-type + enum: [ unused, pad, flag, binary, u8, u16, u32, u64, s32, s64, + string, nest, array-nest, nest-type-value ] + doc: + description: Documentation of the attribute. + type: string + value: + description: Value for the enum item representing this attribute in the uAPI. + $ref: '#/$defs/uint' + type-value: + description: Name of the value extracted from the type of a nest-type-value attribute. + type: array + items: + type: string + byte-order: + enum: [ little-endian, big-endian ] + multi-attr: + type: boolean + nested-attributes: + description: Name of the space (sub-space) used inside the attribute. + type: string + enum: + description: Name of the enum type used for the attribute. + type: string + enum-as-flags: + description: | + Treat the enum as flags. In most cases enum is either used as flags or as values. + Sometimes, however, both forms are necessary, in which case header contains the enum + form while specific attributes may request to convert the values into a bitfield. + type: boolean + checks: + description: Kernel input validation. + type: object + additionalProperties: False + properties: + flags-mask: + description: Name of the flags constant on which to base mask (unsigned scalar types only). + type: string + min: + description: Min value for an integer attribute. + type: integer + min-len: + description: Min length for a binary attribute. + $ref: '#/$defs/len-or-define' + max-len: + description: Max length for a string or a binary attribute. + $ref: '#/$defs/len-or-define' + sub-type: *attr-type + + # Make sure name-prefix does not appear in subsets (subsets inherit naming) + dependencies: + name-prefix: + not: + required: [ subset-of ] + subset-of: + not: + required: [ name-prefix ] + + operations: + description: Operations supported by the protocol. + type: object + required: [ list ] + additionalProperties: False + properties: + enum-model: + description: | + The model of assigning values to the operations. + "unified" is the recommended model where all message types belong + to a single enum. + "directional" has the messages sent to the kernel and from the kernel + enumerated separately. + enum: [ unified ] + name-prefix: + description: | + Prefix for the C enum name of the command. The name is formed by concatenating + the prefix with the upper case name of the command, with dashes replaced by underscores. + type: string + enum-name: + description: Name for the enum type with commands. + type: string + async-prefix: + description: Same as name-prefix but used to render notifications and events to separate enum. + type: string + async-enum: + description: Name for the enum type with notifications/events. + type: string + list: + description: List of commands + type: array + items: + type: object + additionalProperties: False + required: [ name, doc ] + properties: + name: + description: Name of the operation, also defining its C enum value in uAPI. + type: string + doc: + description: Documentation for the command. + type: string + value: + description: Value for the enum in the uAPI. + $ref: '#/$defs/uint' + attribute-set: + description: | + Attribute space from which attributes directly in the requests and replies + to this command are defined. + type: string + flags: &cmd_flags + description: Command flags. + type: array + items: + enum: [ admin-perm ] + dont-validate: + description: Kernel attribute validation flags. + type: array + items: + enum: [ strict, dump ] + do: &subop-type + description: Main command handler. + type: object + additionalProperties: False + properties: + request: &subop-attr-list + description: Definition of the request message for a given command. + type: object + additionalProperties: False + properties: + attributes: + description: | + Names of attributes from the attribute-set (not full attribute + definitions, just names). + type: array + items: + type: string + reply: *subop-attr-list + pre: + description: Hook for a function to run before the main callback (pre_doit or start). + type: string + post: + description: Hook for a function to run after the main callback (post_doit or done). + type: string + dump: *subop-type + notify: + description: Name of the command sharing the reply type with this notification. + type: string + event: + type: object + additionalProperties: False + properties: + attributes: + description: Explicit list of the attributes for the notification. + type: array + items: + type: string + mcgrp: + description: Name of the multicast group generating given notification. + type: string + mcast-groups: + description: List of multicast groups. + type: object + required: [ list ] + additionalProperties: False + properties: + list: + description: List of groups. + type: array + items: + type: object + required: [ name ] + additionalProperties: False + properties: + name: + description: | + The name for the group, used to form the define and the value of the define. + type: string + flags: *cmd_flags diff --git a/Documentation/netlink/specs/ethtool.yaml b/Documentation/netlink/specs/ethtool.yaml new file mode 100644 index 000000000000..08b776908d15 --- /dev/null +++ b/Documentation/netlink/specs/ethtool.yaml @@ -0,0 +1,397 @@ +name: ethtool + +protocol: genetlink-legacy + +doc: Partial family for Ethtool Netlink. + +attribute-sets: + - + name: header + attributes: + - + name: dev-index + type: u32 + value: 1 + - + name: dev-name + type: string + - + name: flags + type: u32 + + - + name: bitset-bit + attributes: + - + name: index + type: u32 + value: 1 + - + name: name + type: string + - + name: value + type: flag + - + name: bitset-bits + attributes: + - + name: bit + type: nest + nested-attributes: bitset-bit + value: 1 + - + name: bitset + attributes: + - + name: nomask + type: flag + value: 1 + - + name: size + type: u32 + - + name: bits + type: nest + nested-attributes: bitset-bits + + - + name: string + attributes: + - + name: index + type: u32 + value: 1 + - + name: value + type: string + - + name: strings + attributes: + - + name: string + type: nest + value: 1 + multi-attr: true + nested-attributes: string + - + name: stringset + attributes: + - + name: id + type: u32 + value: 1 + - + name: count + type: u32 + - + name: strings + type: nest + multi-attr: true + nested-attributes: strings + - + name: stringsets + attributes: + - + name: stringset + type: nest + multi-attr: true + value: 1 + nested-attributes: stringset + - + name: strset + attributes: + - + name: header + value: 1 + type: nest + nested-attributes: header + - + name: stringsets + type: nest + nested-attributes: stringsets + - + name: counts-only + type: flag + + - + name: privflags + attributes: + - + name: header + value: 1 + type: nest + nested-attributes: header + - + name: flags + type: nest + nested-attributes: bitset + + - + name: rings + attributes: + - + name: header + value: 1 + type: nest + nested-attributes: header + - + name: rx-max + type: u32 + - + name: rx-mini-max + type: u32 + - + name: rx-jumbo-max + type: u32 + - + name: tx-max + type: u32 + - + name: rx + type: u32 + - + name: rx-mini + type: u32 + - + name: rx-jumbo + type: u32 + - + name: tx + type: u32 + - + name: rx-buf-len + type: u32 + - + name: tcp-data-split + type: u8 + - + name: cqe-size + type: u32 + - + name: tx-push + type: u8 + - + name: rx-push + type: u8 + + - + name: mm-stat + attributes: + - + name: pad + value: 1 + type: pad + - + name: reassembly-errors + type: u64 + - + name: smd-errors + type: u64 + - + name: reassembly-ok + type: u64 + - + name: rx-frag-count + type: u64 + - + name: tx-frag-count + type: u64 + - + name: hold-count + type: u64 + - + name: mm + attributes: + - + name: header + value: 1 + type: nest + nested-attributes: header + - + name: pmac-enabled + type: u8 + - + name: tx-enabled + type: u8 + - + name: tx-active + type: u8 + - + name: tx-min-frag-size + type: u32 + - + name: tx-min-frag-size + type: u32 + - + name: verify-enabled + type: u8 + - + name: verify-status + type: u8 + - + name: verify-time + type: u32 + - + name: max-verify-time + type: u32 + - + name: stats + type: nest + nested-attributes: mm-stat + +operations: + enum-model: directional + list: + - + name: strset-get + doc: Get string set from the kernel. + + attribute-set: strset + + do: &strset-get-op + request: + value: 1 + attributes: + - header + - stringsets + - counts-only + reply: + value: 1 + attributes: + - header + - stringsets + dump: *strset-get-op + + # TODO: fill in the requests in between + + - + name: privflags-get + doc: Get device private flags. + + attribute-set: privflags + + do: &privflag-get-op + request: + value: 13 + attributes: + - header + reply: + value: 14 + attributes: + - header + - flags + dump: *privflag-get-op + - + name: privflags-set + doc: Set device private flags. + + attribute-set: privflags + + do: + request: + attributes: + - header + - flags + - + name: privflags-ntf + doc: Notification for change in device private flags. + notify: privflags-get + + - + name: rings-get + doc: Get ring params. + + attribute-set: rings + + do: &ring-get-op + request: + attributes: + - header + reply: + attributes: + - header + - rx-max + - rx-mini-max + - rx-jumbo-max + - tx-max + - rx + - rx-mini + - rx-jumbo + - tx + - rx-buf-len + - tcp-data-split + - cqe-size + - tx-push + - rx-push + dump: *ring-get-op + - + name: rings-set + doc: Set ring params. + + attribute-set: rings + + do: + request: + attributes: + - header + - rx + - rx-mini + - rx-jumbo + - tx + - rx-buf-len + - tcp-data-split + - cqe-size + - tx-push + - rx-push + - + name: rings-ntf + doc: Notification for change in ring params. + notify: rings-get + + # TODO: fill in the requests in between + + - + name: mm-get + doc: Get MAC Merge configuration and state + + attribute-set: mm + + do: &mm-get-op + request: + value: 42 + attributes: + - header + reply: + value: 42 + attributes: + - header + - pmac-enabled + - tx-enabled + - tx-active + - tx-min-frag-size + - rx-min-frag-size + - verify-enabled + - verify-time + - max-verify-time + - stats + dump: *mm-get-op + - + name: mm-set + doc: Set MAC Merge configuration + + attribute-set: mm + + do: + request: + attributes: + - header + - verify-enabled + - verify-time + - tx-enabled + - pmac-enabled + - tx-min-frag-size + - + name: mm-ntf + doc: Notification for change in MAC Merge configuration. + notify: mm-get diff --git a/Documentation/netlink/specs/fou.yaml b/Documentation/netlink/specs/fou.yaml new file mode 100644 index 000000000000..266c386eedf3 --- /dev/null +++ b/Documentation/netlink/specs/fou.yaml @@ -0,0 +1,128 @@ +name: fou + +protocol: genetlink-legacy + +doc: | + Foo-over-UDP. + +c-family-name: fou-genl-name +c-version-name: fou-genl-version +max-by-define: true +kernel-policy: global + +definitions: + - + type: enum + name: encap_type + name-prefix: fou-encap- + enum-name: + entries: [ unspec, direct, gue ] + +attribute-sets: + - + name: fou + name-prefix: fou-attr- + attributes: + - + name: unspec + type: unused + - + name: port + type: u16 + byte-order: big-endian + - + name: af + type: u8 + - + name: ipproto + type: u8 + - + name: type + type: u8 + - + name: remcsum_nopartial + type: flag + - + name: local_v4 + type: u32 + - + name: local_v6 + type: binary + checks: + min-len: 16 + - + name: peer_v4 + type: u32 + - + name: peer_v6 + type: binary + checks: + min-len: 16 + - + name: peer_port + type: u16 + byte-order: big-endian + - + name: ifindex + type: s32 + +operations: + list: + - + name: unspec + doc: unused + + - + name: add + doc: Add port. + attribute-set: fou + + dont-validate: [ strict, dump ] + flags: [ admin-perm ] + + do: + request: &all_attrs + attributes: + - port + - ipproto + - type + - remcsum_nopartial + - local_v4 + - peer_v4 + - local_v6 + - peer_v6 + - peer_port + - ifindex + + - + name: del + doc: Delete port. + attribute-set: fou + + dont-validate: [ strict, dump ] + flags: [ admin-perm ] + + do: + request: &select_attrs + attributes: + - af + - ifindex + - port + - peer_port + - local_v4 + - peer_v4 + - local_v6 + - peer_v6 + + - + name: get + doc: Get tunnel info. + attribute-set: fou + dont-validate: [ strict, dump ] + + do: + request: *select_attrs + reply: *all_attrs + + dump: + reply: *all_attrs diff --git a/Documentation/netlink/specs/netdev.yaml b/Documentation/netlink/specs/netdev.yaml new file mode 100644 index 000000000000..b4dcdae54ffd --- /dev/null +++ b/Documentation/netlink/specs/netdev.yaml @@ -0,0 +1,100 @@ +name: netdev + +doc: + netdev configuration over generic netlink. + +definitions: + - + type: flags + name: xdp-act + entries: + - + name: basic + doc: + XDP feautues set supported by all drivers + (XDP_ABORTED, XDP_DROP, XDP_PASS, XDP_TX) + - + name: redirect + doc: + The netdev supports XDP_REDIRECT + - + name: ndo-xmit + doc: + This feature informs if netdev implements ndo_xdp_xmit callback. + - + name: xsk-zerocopy + doc: + This feature informs if netdev supports AF_XDP in zero copy mode. + - + name: hw-offload + doc: + This feature informs if netdev supports XDP hw oflloading. + - + name: rx-sg + doc: + This feature informs if netdev implements non-linear XDP buffer + support in the driver napi callback. + - + name: ndo-xmit-sg + doc: + This feature informs if netdev implements non-linear XDP buffer + support in ndo_xdp_xmit callback. + +attribute-sets: + - + name: dev + attributes: + - + name: ifindex + doc: netdev ifindex + type: u32 + value: 1 + checks: + min: 1 + - + name: pad + type: pad + - + name: xdp-features + doc: Bitmask of enabled xdp-features. + type: u64 + enum: xdp-act + enum-as-flags: true + +operations: + list: + - + name: dev-get + doc: Get / dump information about a netdev. + value: 1 + attribute-set: dev + do: + request: + attributes: + - ifindex + reply: &dev-all + attributes: + - ifindex + - xdp-features + dump: + reply: *dev-all + - + name: dev-add-ntf + doc: Notification about device appearing. + notify: dev-get + mcgrp: mgmt + - + name: dev-del-ntf + doc: Notification about device disappearing. + notify: dev-get + mcgrp: mgmt + - + name: dev-change-ntf + doc: Notification about device configuration being changed. + notify: dev-get + mcgrp: mgmt + +mcast-groups: + list: + - + name: mgmt diff --git a/Documentation/networking/af_xdp.rst b/Documentation/networking/af_xdp.rst index 60b217b436be..247c6c4127e9 100644 --- a/Documentation/networking/af_xdp.rst +++ b/Documentation/networking/af_xdp.rst @@ -419,7 +419,7 @@ XDP_UMEM_REG setsockopt ----------------------- This setsockopt registers a UMEM to a socket. This is the area that -contain all the buffers that packet can recide in. The call takes a +contain all the buffers that packet can reside in. The call takes a pointer to the beginning of this area and the size of it. Moreover, it also has parameter called chunk_size that is the size that the UMEM is divided into. It can only be 2K or 4K at the moment. If you have an @@ -592,7 +592,7 @@ A: When a netdev of a physical NIC is initialized, Linux usually A number of other ways are possible all up to the capabilities of the NIC you have. -Q: Can I use the XSKMAP to implement a switch betwen different umems +Q: Can I use the XSKMAP to implement a switch between different umems in copy mode? A: The short answer is no, that is not supported at the moment. The diff --git a/Documentation/networking/arcnet-hardware.rst b/Documentation/networking/arcnet-hardware.rst index ac249ac8fcf2..982215723582 100644 --- a/Documentation/networking/arcnet-hardware.rst +++ b/Documentation/networking/arcnet-hardware.rst @@ -1902,7 +1902,7 @@ of 32 possible I/O Base addresses using the following tables:: 6 | 10 The I/O address is sum of all switches set to "1". Remember that -the I/O address space bellow 0x200 is RESERVED for mainboard, so +the I/O address space below 0x200 is RESERVED for mainboard, so switch 1 should be ALWAYS SET TO OFF. diff --git a/Documentation/networking/batman-adv.rst b/Documentation/networking/batman-adv.rst index b85563ea3682..8a0dcb1894b4 100644 --- a/Documentation/networking/batman-adv.rst +++ b/Documentation/networking/batman-adv.rst @@ -159,7 +159,7 @@ Please send us comments, experiences, questions, anything :) IRC: #batadv on ircs://irc.hackint.org/ Mailing-list: - b.a.t.m.a.n@open-mesh.org (optional subscription at + b.a.t.m.a.n@lists.open-mesh.org (optional subscription at https://lists.open-mesh.org/mailman3/postorius/lists/b.a.t.m.a.n.lists.open-mesh.org/) You can also contact the Authors: diff --git a/Documentation/networking/can.rst b/Documentation/networking/can.rst index 90121deef217..d7e1ada905b2 100644 --- a/Documentation/networking/can.rst +++ b/Documentation/networking/can.rst @@ -931,7 +931,7 @@ ival1: ival2: Throttle the received message rate down to the value of ival2. This is useful to reduce messages for the application when the signal inside the - CAN frame is stateless as state changes within the ival2 periode may get + CAN frame is stateless as state changes within the ival2 period may get lost. Broadcast Manager Multiplex Message Receive Filter diff --git a/Documentation/networking/can_ucan_protocol.rst b/Documentation/networking/can_ucan_protocol.rst index 638ac1ee7914..935d872ae87c 100644 --- a/Documentation/networking/can_ucan_protocol.rst +++ b/Documentation/networking/can_ucan_protocol.rst @@ -50,7 +50,7 @@ Setup Packet ``wIndex`` USB Interface Index (0 for device commands) ``wLength`` * Host to Device - Number of bytes to transmit * Device to Host - Maximum Number of bytes to - receive. If the device send less. Commom ZLP + receive. If the device send less. Common ZLP semantics are used. ================= ===================================================== diff --git a/Documentation/networking/cdc_mbim.rst b/Documentation/networking/cdc_mbim.rst index 0048409c06b4..37f968acc473 100644 --- a/Documentation/networking/cdc_mbim.rst +++ b/Documentation/networking/cdc_mbim.rst @@ -93,7 +93,7 @@ MBIM function can be looked up using sysfs. For example:: USB configuration descriptors ----------------------------- The wMaxControlMessage field of the CDC MBIM functional descriptor -limits the maximum control message size. The managament application is +limits the maximum control message size. The management application is responsible for negotiating a control message size complying with the requirements in section 9.3.1 of [1], taking this descriptor field into consideration. diff --git a/Documentation/networking/device_drivers/atm/iphase.rst b/Documentation/networking/device_drivers/atm/iphase.rst index 92d9b757d75a..388c7101e2cb 100644 --- a/Documentation/networking/device_drivers/atm/iphase.rst +++ b/Documentation/networking/device_drivers/atm/iphase.rst @@ -4,7 +4,7 @@ ATM (i)Chip IA Linux Driver Source ================================== - READ ME FISRT + READ ME FIRST -------------------------------------------------------------------------------- diff --git a/Documentation/networking/device_drivers/can/ctu/ctucanfd-driver.rst b/Documentation/networking/device_drivers/can/ctu/ctucanfd-driver.rst index 40c92ea272af..1a4fc6607582 100644 --- a/Documentation/networking/device_drivers/can/ctu/ctucanfd-driver.rst +++ b/Documentation/networking/device_drivers/can/ctu/ctucanfd-driver.rst @@ -577,7 +577,7 @@ CTU CAN FD IP Core and Driver Development Acknowledgment * Linux driver development * continuous integration platform architect and GHDL updates - * theses `Open-source and Open-hardware CAN FD Protocol Support <https://dspace.cvut.cz/bitstream/handle/10467/80366/F3-DP-2019-Jerabek-Martin-Jerabek-thesis-2019-canfd.pdf>`_ + * thesis `Open-source and Open-hardware CAN FD Protocol Support <https://dspace.cvut.cz/bitstream/handle/10467/80366/F3-DP-2019-Jerabek-Martin-Jerabek-thesis-2019-canfd.pdf>`_ * Jiri Novak <jnovak@fel.cvut.cz> @@ -603,7 +603,7 @@ CTU CAN FD IP Core and Driver Development Acknowledgment * Jan Charvat * implemented CTU CAN FD functional model for QEMU which has been integrated into QEMU mainline (`docs/system/devices/can.rst <https://www.qemu.org/docs/master/system/devices/can.html>`_) - * Bachelor theses Model of CAN FD Communication Controller for QEMU Emulator + * Bachelor thesis Model of CAN FD Communication Controller for QEMU Emulator Notes ----- diff --git a/Documentation/networking/device_drivers/can/ctu/fsm_txt_buffer_user.svg b/Documentation/networking/device_drivers/can/ctu/fsm_txt_buffer_user.svg index b371650788f4..381323423b4c 100644 --- a/Documentation/networking/device_drivers/can/ctu/fsm_txt_buffer_user.svg +++ b/Documentation/networking/device_drivers/can/ctu/fsm_txt_buffer_user.svg @@ -129,10 +129,10 @@ </g> </g> <text transform="matrix(.264583 0 0 .264583 91.8919 139.964)" x="26.959213" y="9.11724" fill="#2aa1ff" filter="url(#filter1204-6-2-9-1-3-1)" font-size="12px" stroke-width="3.77953" text-align="center" text-anchor="middle" style="line-height:1.1" xml:space="preserve"><tspan x="26.959213" y="9.11724" text-align="center">Set</tspan><tspan x="26.959213" y="22.31724" text-align="center">abort</tspan></text> - <text transform="translate(49.0277 104.823)" x="57.620724" y="16.855087" filter="url(#filter1204)" font-size="3.175px" text-align="center" text-anchor="middle" style="line-height:1.1" xml:space="preserve"><tspan x="57.620724" y="16.855087" text-align="center">Transmission</tspan><tspan x="57.620724" y="20.347588" text-align="center">unsuccesfull</tspan></text> + <text transform="translate(49.0277 104.823)" x="57.620724" y="16.855087" filter="url(#filter1204)" font-size="3.175px" text-align="center" text-anchor="middle" style="line-height:1.1" xml:space="preserve"><tspan x="57.620724" y="16.855087" text-align="center">Transmission</tspan><tspan x="57.620724" y="20.347588" text-align="center">unsuccessful</tspan></text> <g font-size="12px" stroke-width="3.77953" text-anchor="middle"> <text transform="matrix(.264583 0 0 .264583 68.5988 118.913)" x="38.824219" y="9.1171875" filter="url(#filter1204)" text-align="center" style="line-height:1.1" xml:space="preserve"><tspan x="38.824219" y="9.1171875" text-align="center">Transmission</tspan><tspan x="38.824219" y="22.317188" text-align="center">starts</tspan></text> - <text transform="matrix(.264583 0 0 .264583 106.802 130.509)" x="38.824219" y="9.1171875" filter="url(#filter1204)" text-align="center" style="line-height:1.1" xml:space="preserve"><tspan x="38.824219" y="9.1171875" text-align="center">Transmission</tspan><tspan x="38.824219" y="22.317188" text-align="center">succesfull</tspan></text> + <text transform="matrix(.264583 0 0 .264583 106.802 130.509)" x="38.824219" y="9.1171875" filter="url(#filter1204)" text-align="center" style="line-height:1.1" xml:space="preserve"><tspan x="38.824219" y="9.1171875" text-align="center">Transmission</tspan><tspan x="38.824219" y="22.317188" text-align="center">successful</tspan></text> <text transform="matrix(.264583 0 0 .264583 107.77 145.476)" x="38.824219" y="9.1171875" filter="url(#filter1204)" text-align="center" style="line-height:1.1" xml:space="preserve"><tspan x="38.824219" y="9.1171875" text-align="center">Transmission</tspan><tspan x="38.824219" y="22.317188" text-align="center">sborted</tspan></text> </g> <g stroke-width="3.77953" text-anchor="middle"> diff --git a/Documentation/networking/device_drivers/ethernet/3com/vortex.rst b/Documentation/networking/device_drivers/ethernet/3com/vortex.rst index e89e4192af88..a060f84c4f96 100644 --- a/Documentation/networking/device_drivers/ethernet/3com/vortex.rst +++ b/Documentation/networking/device_drivers/ethernet/3com/vortex.rst @@ -254,7 +254,7 @@ Media selection A number of the older NICs such as the 3c590 and 3c900 series have 10base2 and AUI interfaces. -Prior to January, 2001 this driver would autoeselect the 10base2 or AUI +Prior to January, 2001 this driver would autoselect the 10base2 or AUI port if it didn't detect activity on the 10baseT port. It would then get stuck on the 10base2 port and a driver reload was necessary to switch back to 10baseT. This behaviour could not be prevented with a diff --git a/Documentation/networking/device_drivers/ethernet/aquantia/atlantic.rst b/Documentation/networking/device_drivers/ethernet/aquantia/atlantic.rst index 595ddef1c8b3..099280a261be 100644 --- a/Documentation/networking/device_drivers/ethernet/aquantia/atlantic.rst +++ b/Documentation/networking/device_drivers/ethernet/aquantia/atlantic.rst @@ -270,7 +270,7 @@ RX flow rules (ntuple filters) ethtool -K ethX ntuple <on|off> - When disabling ntuple filters, all the user programed filters are + When disabling ntuple filters, all the user programmed filters are flushed from the driver cache and hardware. All needed filters must be re-added when ntuple is re-enabled. @@ -418,7 +418,7 @@ Default value: 0xFFFF 0 Disable interrupt throttling. 1 Enable interrupt throttling and use specified tx and rx rates. 0xFFFF Auto throttling mode. Driver will choose the best RX and TX - interrupt throtting settings based on link speed. + interrupt throttling settings based on link speed. ====== ============================================================== aq_itr_tx - TX interrupt throttle rate @@ -456,7 +456,7 @@ AQ_CFG_RX_PAGEORDER Default value: 0 -RX page order override. Thats a power of 2 number of RX pages allocated for +RX page order override. That's a power of 2 number of RX pages allocated for each descriptor. Received descriptor size is still limited by AQ_CFG_RX_FRAME_MAX. diff --git a/Documentation/networking/device_drivers/ethernet/freescale/dpaa2/mac-phy-support.rst b/Documentation/networking/device_drivers/ethernet/freescale/dpaa2/mac-phy-support.rst index 1d2f55feca24..e2a36d0d88ef 100644 --- a/Documentation/networking/device_drivers/ethernet/freescale/dpaa2/mac-phy-support.rst +++ b/Documentation/networking/device_drivers/ethernet/freescale/dpaa2/mac-phy-support.rst @@ -11,7 +11,7 @@ Overview -------- The DPAA2 MAC / PHY support consists of a set of APIs that help DPAA2 network -drivers (dpaa2-eth, dpaa2-ethsw) interract with the PHY library. +drivers (dpaa2-eth, dpaa2-ethsw) interact with the PHY library. DPAA2 Software Architecture --------------------------- diff --git a/Documentation/networking/device_drivers/ethernet/index.rst b/Documentation/networking/device_drivers/ethernet/index.rst index 5196905582c5..392969ac88ad 100644 --- a/Documentation/networking/device_drivers/ethernet/index.rst +++ b/Documentation/networking/device_drivers/ethernet/index.rst @@ -39,7 +39,7 @@ Contents: intel/ice marvell/octeontx2 marvell/octeon_ep - mellanox/mlx5 + mellanox/mlx5/index microsoft/netvsc neterion/s2io netronome/nfp diff --git a/Documentation/networking/device_drivers/ethernet/intel/ice.rst b/Documentation/networking/device_drivers/ethernet/intel/ice.rst index b481b81f3be5..5efea4dd1251 100644 --- a/Documentation/networking/device_drivers/ethernet/intel/ice.rst +++ b/Documentation/networking/device_drivers/ethernet/intel/ice.rst @@ -901,15 +901,17 @@ To enable/disable UDP Segmentation Offload, issue the following command:: # ethtool -K <ethX> tx-udp-segmentation [off|on] + GNSS module ----------- -Allows user to read messages from the GNSS module and write supported commands. -If the module is physically present, driver creates 2 TTYs for each supported -device in /dev, ttyGNSS_<device>:<function>_0 and _1. First one (_0) is RW and -the second one is RO. -The protocol of write commands is dependent on the GNSS module as the driver -writes raw bytes from the TTY to the GNSS i2c. Please refer to the module -documentation for details. +Requires kernel compiled with CONFIG_GNSS=y or CONFIG_GNSS=m. +Allows user to read messages from the GNSS hardware module and write supported +commands. If the module is physically present, a GNSS device is spawned: +``/dev/gnss<id>``. +The protocol of write command is dependent on the GNSS hardware module as the +driver writes raw bytes by the GNSS object to the receiver through i2c. Please +refer to the hardware GNSS module documentation for configuration details. + Performance Optimization ======================== diff --git a/Documentation/networking/device_drivers/ethernet/marvell/octeontx2.rst b/Documentation/networking/device_drivers/ethernet/marvell/octeontx2.rst index dd5cd69467be..5ba9015336e2 100644 --- a/Documentation/networking/device_drivers/ethernet/marvell/octeontx2.rst +++ b/Documentation/networking/device_drivers/ethernet/marvell/octeontx2.rst @@ -127,7 +127,7 @@ Type1: Type2: - RVU PF0 ie admin function creates these VFs and maps them to loopback block's channels. - A set of two VFs (VF0 & VF1, VF2 & VF3 .. so on) works as a pair ie pkts sent out of - VF0 will be received by VF1 and viceversa. + VF0 will be received by VF1 and vice versa. - These VFs can be used by applications or virtual machines to communicate between them without sending traffic outside. There is no switch present in HW, hence the support for loopback VFs. diff --git a/Documentation/networking/device_drivers/ethernet/mellanox/mlx5.rst b/Documentation/networking/device_drivers/ethernet/mellanox/mlx5.rst deleted file mode 100644 index 6969652f593c..000000000000 --- a/Documentation/networking/device_drivers/ethernet/mellanox/mlx5.rst +++ /dev/null @@ -1,746 +0,0 @@ -.. SPDX-License-Identifier: GPL-2.0 OR Linux-OpenIB - -================================================= -Mellanox ConnectX(R) mlx5 core VPI Network Driver -================================================= - -Copyright (c) 2019, Mellanox Technologies LTD. - -Contents -======== - -- `Enabling the driver and kconfig options`_ -- `Devlink info`_ -- `Devlink parameters`_ -- `Bridge offload`_ -- `mlx5 subfunction`_ -- `mlx5 function attributes`_ -- `Devlink health reporters`_ -- `mlx5 tracepoints`_ - -Enabling the driver and kconfig options -======================================= - -| mlx5 core is modular and most of the major mlx5 core driver features can be selected (compiled in/out) -| at build time via kernel Kconfig flags. -| Basic features, ethernet net device rx/tx offloads and XDP, are available with the most basic flags -| CONFIG_MLX5_CORE=y/m and CONFIG_MLX5_CORE_EN=y. -| For the list of advanced features, please see below. - -**CONFIG_MLX5_CORE=(y/m/n)** (module mlx5_core.ko) - -| The driver can be enabled by choosing CONFIG_MLX5_CORE=y/m in kernel config. -| This will provide mlx5 core driver for mlx5 ulps to interface with (mlx5e, mlx5_ib). - - -**CONFIG_MLX5_CORE_EN=(y/n)** - -| Choosing this option will allow basic ethernet netdevice support with all of the standard rx/tx offloads. -| mlx5e is the mlx5 ulp driver which provides netdevice kernel interface, when chosen, mlx5e will be -| built-in into mlx5_core.ko. - - -**CONFIG_MLX5_EN_ARFS=(y/n)** - -| Enables Hardware-accelerated receive flow steering (arfs) support, and ntuple filtering. -| https://community.mellanox.com/s/article/howto-configure-arfs-on-connectx-4 - - -**CONFIG_MLX5_EN_RXNFC=(y/n)** - -| Enables ethtool receive network flow classification, which allows user defined -| flow rules to direct traffic into arbitrary rx queue via ethtool set/get_rxnfc API. - - -**CONFIG_MLX5_CORE_EN_DCB=(y/n)**: - -| Enables `Data Center Bridging (DCB) Support <https://community.mellanox.com/s/article/howto-auto-config-pfc-and-ets-on-connectx-4-via-lldp-dcbx>`_. - - -**CONFIG_MLX5_MPFS=(y/n)** - -| Ethernet Multi-Physical Function Switch (MPFS) support in ConnectX NIC. -| MPFs is required for when `Multi-Host <http://www.mellanox.com/page/multihost>`_ configuration is enabled to allow passing -| user configured unicast MAC addresses to the requesting PF. - - -**CONFIG_MLX5_ESWITCH=(y/n)** - -| Ethernet SRIOV E-Switch support in ConnectX NIC. E-Switch provides internal SRIOV packet steering -| and switching for the enabled VFs and PF in two available modes: -| 1) `Legacy SRIOV mode (L2 mac vlan steering based) <https://community.mellanox.com/s/article/howto-configure-sr-iov-for-connectx-4-connectx-5-with-kvm--ethernet-x>`_. -| 2) `Switchdev mode (eswitch offloads) <https://www.mellanox.com/related-docs/prod_software/ASAP2_Hardware_Offloading_for_vSwitches_User_Manual_v4.4.pdf>`_. - - -**CONFIG_MLX5_CORE_IPOIB=(y/n)** - -| IPoIB offloads & acceleration support. -| Requires CONFIG_MLX5_CORE_EN to provide an accelerated interface for the rdma -| IPoIB ulp netdevice. - - -**CONFIG_MLX5_FPGA=(y/n)** - -| Build support for the Innova family of network cards by Mellanox Technologies. -| Innova network cards are comprised of a ConnectX chip and an FPGA chip on one board. -| If you select this option, the mlx5_core driver will include the Innova FPGA core and allow -| building sandbox-specific client drivers. - - -**CONFIG_MLX5_EN_IPSEC=(y/n)** - -| Enables `IPSec XFRM cryptography-offload acceleration <http://www.mellanox.com/related-docs/prod_software/Mellanox_Innova_IPsec_Ethernet_Adapter_Card_User_Manual.pdf>`_. - -**CONFIG_MLX5_EN_TLS=(y/n)** - -| TLS cryptography-offload acceleration. - - -**CONFIG_MLX5_INFINIBAND=(y/n/m)** (module mlx5_ib.ko) - -| Provides low-level InfiniBand/RDMA and `RoCE <https://community.mellanox.com/s/article/recommended-network-configuration-examples-for-roce-deployment>`_ support. - -**CONFIG_MLX5_SF=(y/n)** - -| Build support for subfunction. -| Subfunctons are more light weight than PCI SRIOV VFs. Choosing this option -| will enable support for creating subfunction devices. - -**External options** ( Choose if the corresponding mlx5 feature is required ) - -- CONFIG_PTP_1588_CLOCK: When chosen, mlx5 ptp support will be enabled -- CONFIG_VXLAN: When chosen, mlx5 vxlan support will be enabled. -- CONFIG_MLXFW: When chosen, mlx5 firmware flashing support will be enabled (via devlink and ethtool). - -Devlink info -============ - -The devlink info reports the running and stored firmware versions on device. -It also prints the device PSID which represents the HCA board type ID. - -User command example:: - - $ devlink dev info pci/0000:00:06.0 - pci/0000:00:06.0: - driver mlx5_core - versions: - fixed: - fw.psid MT_0000000009 - running: - fw.version 16.26.0100 - stored: - fw.version 16.26.0100 - -Devlink parameters -================== - -flow_steering_mode: Device flow steering mode ---------------------------------------------- -The flow steering mode parameter controls the flow steering mode of the driver. -Two modes are supported: -1. 'dmfs' - Device managed flow steering. -2. 'smfs' - Software/Driver managed flow steering. - -In DMFS mode, the HW steering entities are created and managed through the -Firmware. -In SMFS mode, the HW steering entities are created and managed though by -the driver directly into hardware without firmware intervention. - -SMFS mode is faster and provides better rule insertion rate compared to default DMFS mode. - -User command examples: - -- Set SMFS flow steering mode:: - - $ devlink dev param set pci/0000:06:00.0 name flow_steering_mode value "smfs" cmode runtime - -- Read device flow steering mode:: - - $ devlink dev param show pci/0000:06:00.0 name flow_steering_mode - pci/0000:06:00.0: - name flow_steering_mode type driver-specific - values: - cmode runtime value smfs - -enable_roce: RoCE enablement state ----------------------------------- -RoCE enablement state controls driver support for RoCE traffic. -When RoCE is disabled, there is no gid table, only raw ethernet QPs are supported and traffic on the well-known UDP RoCE port is handled as raw ethernet traffic. - -To change RoCE enablement state, a user must change the driverinit cmode value and run devlink reload. - -User command examples: - -- Disable RoCE:: - - $ devlink dev param set pci/0000:06:00.0 name enable_roce value false cmode driverinit - $ devlink dev reload pci/0000:06:00.0 - -- Read RoCE enablement state:: - - $ devlink dev param show pci/0000:06:00.0 name enable_roce - pci/0000:06:00.0: - name enable_roce type generic - values: - cmode driverinit value true - -esw_port_metadata: Eswitch port metadata state ----------------------------------------------- -When applicable, disabling eswitch metadata can increase packet rate -up to 20% depending on the use case and packet sizes. - -Eswitch port metadata state controls whether to internally tag packets with -metadata. Metadata tagging must be enabled for multi-port RoCE, failover -between representors and stacked devices. -By default metadata is enabled on the supported devices in E-switch. -Metadata is applicable only for E-switch in switchdev mode and -users may disable it when NONE of the below use cases will be in use: -1. HCA is in Dual/multi-port RoCE mode. -2. VF/SF representor bonding (Usually used for Live migration) -3. Stacked devices - -When metadata is disabled, the above use cases will fail to initialize if -users try to enable them. - -- Show eswitch port metadata:: - - $ devlink dev param show pci/0000:06:00.0 name esw_port_metadata - pci/0000:06:00.0: - name esw_port_metadata type driver-specific - values: - cmode runtime value true - -- Disable eswitch port metadata:: - - $ devlink dev param set pci/0000:06:00.0 name esw_port_metadata value false cmode runtime - -- Change eswitch mode to switchdev mode where after choosing the metadata value:: - - $ devlink dev eswitch set pci/0000:06:00.0 mode switchdev - -Bridge offload -============== -The mlx5 driver implements support for offloading bridge rules when in switchdev -mode. Linux bridge FDBs are automatically offloaded when mlx5 switchdev -representor is attached to bridge. - -- Change device to switchdev mode:: - - $ devlink dev eswitch set pci/0000:06:00.0 mode switchdev - -- Attach mlx5 switchdev representor 'enp8s0f0' to bridge netdev 'bridge1':: - - $ ip link set enp8s0f0 master bridge1 - -VLANs ------ -Following bridge VLAN functions are supported by mlx5: - -- VLAN filtering (including multiple VLANs per port):: - - $ ip link set bridge1 type bridge vlan_filtering 1 - $ bridge vlan add dev enp8s0f0 vid 2-3 - -- VLAN push on bridge ingress:: - - $ bridge vlan add dev enp8s0f0 vid 3 pvid - -- VLAN pop on bridge egress:: - - $ bridge vlan add dev enp8s0f0 vid 3 untagged - -mlx5 subfunction -================ -mlx5 supports subfunction management using devlink port (see :ref:`Documentation/networking/devlink/devlink-port.rst <devlink_port>`) interface. - -A subfunction has its own function capabilities and its own resources. This -means a subfunction has its own dedicated queues (txq, rxq, cq, eq). These -queues are neither shared nor stolen from the parent PCI function. - -When a subfunction is RDMA capable, it has its own QP1, GID table, and RDMA -resources neither shared nor stolen from the parent PCI function. - -A subfunction has a dedicated window in PCI BAR space that is not shared -with the other subfunctions or the parent PCI function. This ensures that all -devices (netdev, rdma, vdpa, etc.) of the subfunction accesses only assigned -PCI BAR space. - -A subfunction supports eswitch representation through which it supports tc -offloads. The user configures eswitch to send/receive packets from/to -the subfunction port. - -Subfunctions share PCI level resources such as PCI MSI-X IRQs with -other subfunctions and/or with its parent PCI function. - -Example mlx5 software, system, and device view:: - - _______ - | admin | - | user |---------- - |_______| | - | | - ____|____ __|______ _________________ - | | | | | | - | devlink | | tc tool | | user | - | tool | |_________| | applications | - |_________| | |_________________| - | | | | - | | | | Userspace - +---------|-------------|-------------------|----------|--------------------+ - | | +----------+ +----------+ Kernel - | | | netdev | | rdma dev | - | | +----------+ +----------+ - (devlink port add/del | ^ ^ - port function set) | | | - | | +---------------| - _____|___ | | _______|_______ - | | | | | mlx5 class | - | devlink | +------------+ | | drivers | - | kernel | | rep netdev | | |(mlx5_core,ib) | - |_________| +------------+ | |_______________| - | | | ^ - (devlink ops) | | (probe/remove) - _________|________ | | ____|________ - | subfunction | | +---------------+ | subfunction | - | management driver|----- | subfunction |---| driver | - | (mlx5_core) | | auxiliary dev | | (mlx5_core) | - |__________________| +---------------+ |_____________| - | ^ - (sf add/del, vhca events) | - | (device add/del) - _____|____ ____|________ - | | | subfunction | - | PCI NIC |--- activate/deactivate events--->| host driver | - |__________| | (mlx5_core) | - |_____________| - -Subfunction is created using devlink port interface. - -- Change device to switchdev mode:: - - $ devlink dev eswitch set pci/0000:06:00.0 mode switchdev - -- Add a devlink port of subfunction flavour:: - - $ devlink port add pci/0000:06:00.0 flavour pcisf pfnum 0 sfnum 88 - pci/0000:06:00.0/32768: type eth netdev eth6 flavour pcisf controller 0 pfnum 0 sfnum 88 external false splittable false - function: - hw_addr 00:00:00:00:00:00 state inactive opstate detached - -- Show a devlink port of the subfunction:: - - $ devlink port show pci/0000:06:00.0/32768 - pci/0000:06:00.0/32768: type eth netdev enp6s0pf0sf88 flavour pcisf pfnum 0 sfnum 88 - function: - hw_addr 00:00:00:00:00:00 state inactive opstate detached - -- Delete a devlink port of subfunction after use:: - - $ devlink port del pci/0000:06:00.0/32768 - -mlx5 function attributes -======================== -The mlx5 driver provides a mechanism to setup PCI VF/SF function attributes in -a unified way for SmartNIC and non-SmartNIC. - -This is supported only when the eswitch mode is set to switchdev. Port function -configuration of the PCI VF/SF is supported through devlink eswitch port. - -Port function attributes should be set before PCI VF/SF is enumerated by the -driver. - -MAC address setup ------------------ -mlx5 driver support devlink port function attr mechanism to setup MAC -address. (refer to Documentation/networking/devlink/devlink-port.rst) - -RoCE capability setup ---------------------- -Not all mlx5 PCI devices/SFs require RoCE capability. - -When RoCE capability is disabled, it saves 1 Mbytes worth of system memory per -PCI devices/SF. - -mlx5 driver support devlink port function attr mechanism to setup RoCE -capability. (refer to Documentation/networking/devlink/devlink-port.rst) - -migratable capability setup ---------------------------- -User who wants mlx5 PCI VFs to be able to perform live migration need to -explicitly enable the VF migratable capability. - -mlx5 driver support devlink port function attr mechanism to setup migratable -capability. (refer to Documentation/networking/devlink/devlink-port.rst) - -SF state setup --------------- -To use the SF, the user must activate the SF using the SF function state -attribute. - -- Get the state of the SF identified by its unique devlink port index:: - - $ devlink port show ens2f0npf0sf88 - pci/0000:06:00.0/32768: type eth netdev ens2f0npf0sf88 flavour pcisf controller 0 pfnum 0 sfnum 88 external false splittable false - function: - hw_addr 00:00:00:00:88:88 state inactive opstate detached - -- Activate the function and verify its state is active:: - - $ devlink port function set ens2f0npf0sf88 state active - - $ devlink port show ens2f0npf0sf88 - pci/0000:06:00.0/32768: type eth netdev ens2f0npf0sf88 flavour pcisf controller 0 pfnum 0 sfnum 88 external false splittable false - function: - hw_addr 00:00:00:00:88:88 state active opstate detached - -Upon function activation, the PF driver instance gets the event from the device -that a particular SF was activated. It's the cue to put the device on bus, probe -it and instantiate the devlink instance and class specific auxiliary devices -for it. - -- Show the auxiliary device and port of the subfunction:: - - $ devlink dev show - devlink dev show auxiliary/mlx5_core.sf.4 - - $ devlink port show auxiliary/mlx5_core.sf.4/1 - auxiliary/mlx5_core.sf.4/1: type eth netdev p0sf88 flavour virtual port 0 splittable false - - $ rdma link show mlx5_0/1 - link mlx5_0/1 state ACTIVE physical_state LINK_UP netdev p0sf88 - - $ rdma dev show - 8: rocep6s0f1: node_type ca fw 16.29.0550 node_guid 248a:0703:00b3:d113 sys_image_guid 248a:0703:00b3:d112 - 13: mlx5_0: node_type ca fw 16.29.0550 node_guid 0000:00ff:fe00:8888 sys_image_guid 248a:0703:00b3:d112 - -- Subfunction auxiliary device and class device hierarchy:: - - mlx5_core.sf.4 - (subfunction auxiliary device) - /\ - / \ - / \ - / \ - / \ - mlx5_core.eth.4 mlx5_core.rdma.4 - (sf eth aux dev) (sf rdma aux dev) - | | - | | - p0sf88 mlx5_0 - (sf netdev) (sf rdma device) - -Additionally, the SF port also gets the event when the driver attaches to the -auxiliary device of the subfunction. This results in changing the operational -state of the function. This provides visibility to the user to decide when is it -safe to delete the SF port for graceful termination of the subfunction. - -- Show the SF port operational state:: - - $ devlink port show ens2f0npf0sf88 - pci/0000:06:00.0/32768: type eth netdev ens2f0npf0sf88 flavour pcisf controller 0 pfnum 0 sfnum 88 external false splittable false - function: - hw_addr 00:00:00:00:88:88 state active opstate attached - -Devlink health reporters -======================== - -tx reporter ------------ -The tx reporter is responsible for reporting and recovering of the following two error scenarios: - -- tx timeout - Report on kernel tx timeout detection. - Recover by searching lost interrupts. -- tx error completion - Report on error tx completion. - Recover by flushing the tx queue and reset it. - -tx reporter also support on demand diagnose callback, on which it provides -real time information of its send queues status. - -User commands examples: - -- Diagnose send queues status:: - - $ devlink health diagnose pci/0000:82:00.0 reporter tx - -NOTE: This command has valid output only when interface is up, otherwise the command has empty output. - -- Show number of tx errors indicated, number of recover flows ended successfully, - is autorecover enabled and graceful period from last recover:: - - $ devlink health show pci/0000:82:00.0 reporter tx - -rx reporter ------------ -The rx reporter is responsible for reporting and recovering of the following two error scenarios: - -- rx queues' initialization (population) timeout - Population of rx queues' descriptors on ring initialization is done - in napi context via triggering an irq. In case of a failure to get - the minimum amount of descriptors, a timeout would occur, and - descriptors could be recovered by polling the EQ (Event Queue). -- rx completions with errors (reported by HW on interrupt context) - Report on rx completion error. - Recover (if needed) by flushing the related queue and reset it. - -rx reporter also supports on demand diagnose callback, on which it -provides real time information of its receive queues' status. - -- Diagnose rx queues' status and corresponding completion queue:: - - $ devlink health diagnose pci/0000:82:00.0 reporter rx - -NOTE: This command has valid output only when interface is up. Otherwise, the command has empty output. - -- Show number of rx errors indicated, number of recover flows ended successfully, - is autorecover enabled, and graceful period from last recover:: - - $ devlink health show pci/0000:82:00.0 reporter rx - -fw reporter ------------ -The fw reporter implements `diagnose` and `dump` callbacks. -It follows symptoms of fw error such as fw syndrome by triggering -fw core dump and storing it into the dump buffer. -The fw reporter diagnose command can be triggered any time by the user to check -current fw status. - -User commands examples: - -- Check fw heath status:: - - $ devlink health diagnose pci/0000:82:00.0 reporter fw - -- Read FW core dump if already stored or trigger new one:: - - $ devlink health dump show pci/0000:82:00.0 reporter fw - -NOTE: This command can run only on the PF which has fw tracer ownership, -running it on other PF or any VF will return "Operation not permitted". - -fw fatal reporter ------------------ -The fw fatal reporter implements `dump` and `recover` callbacks. -It follows fatal errors indications by CR-space dump and recover flow. -The CR-space dump uses vsc interface which is valid even if the FW command -interface is not functional, which is the case in most FW fatal errors. -The recover function runs recover flow which reloads the driver and triggers fw -reset if needed. -On firmware error, the health buffer is dumped into the dmesg. The log -level is derived from the error's severity (given in health buffer). - -User commands examples: - -- Run fw recover flow manually:: - - $ devlink health recover pci/0000:82:00.0 reporter fw_fatal - -- Read FW CR-space dump if already stored or trigger new one:: - - $ devlink health dump show pci/0000:82:00.1 reporter fw_fatal - -NOTE: This command can run only on PF. - -mlx5 tracepoints -================ - -mlx5 driver provides internal tracepoints for tracking and debugging using -kernel tracepoints interfaces (refer to Documentation/trace/ftrace.rst). - -For the list of support mlx5 events, check `/sys/kernel/debug/tracing/events/mlx5/`. - -tc and eswitch offloads tracepoints: - -- mlx5e_configure_flower: trace flower filter actions and cookies offloaded to mlx5:: - - $ echo mlx5:mlx5e_configure_flower >> /sys/kernel/debug/tracing/set_event - $ cat /sys/kernel/debug/tracing/trace - ... - tc-6535 [019] ...1 2672.404466: mlx5e_configure_flower: cookie=0000000067874a55 actions= REDIRECT - -- mlx5e_delete_flower: trace flower filter actions and cookies deleted from mlx5:: - - $ echo mlx5:mlx5e_delete_flower >> /sys/kernel/debug/tracing/set_event - $ cat /sys/kernel/debug/tracing/trace - ... - tc-6569 [010] .N.1 2686.379075: mlx5e_delete_flower: cookie=0000000067874a55 actions= NULL - -- mlx5e_stats_flower: trace flower stats request:: - - $ echo mlx5:mlx5e_stats_flower >> /sys/kernel/debug/tracing/set_event - $ cat /sys/kernel/debug/tracing/trace - ... - tc-6546 [010] ...1 2679.704889: mlx5e_stats_flower: cookie=0000000060eb3d6a bytes=0 packets=0 lastused=4295560217 - -- mlx5e_tc_update_neigh_used_value: trace tunnel rule neigh update value offloaded to mlx5:: - - $ echo mlx5:mlx5e_tc_update_neigh_used_value >> /sys/kernel/debug/tracing/set_event - $ cat /sys/kernel/debug/tracing/trace - ... - kworker/u48:4-8806 [009] ...1 55117.882428: mlx5e_tc_update_neigh_used_value: netdev: ens1f0 IPv4: 1.1.1.10 IPv6: ::ffff:1.1.1.10 neigh_used=1 - -- mlx5e_rep_neigh_update: trace neigh update tasks scheduled due to neigh state change events:: - - $ echo mlx5:mlx5e_rep_neigh_update >> /sys/kernel/debug/tracing/set_event - $ cat /sys/kernel/debug/tracing/trace - ... - kworker/u48:7-2221 [009] ...1 1475.387435: mlx5e_rep_neigh_update: netdev: ens1f0 MAC: 24:8a:07:9a:17:9a IPv4: 1.1.1.10 IPv6: ::ffff:1.1.1.10 neigh_connected=1 - -Bridge offloads tracepoints: - -- mlx5_esw_bridge_fdb_entry_init: trace bridge FDB entry offloaded to mlx5:: - - $ echo mlx5:mlx5_esw_bridge_fdb_entry_init >> set_event - $ cat /sys/kernel/debug/tracing/trace - ... - kworker/u20:9-2217 [003] ...1 318.582243: mlx5_esw_bridge_fdb_entry_init: net_device=enp8s0f0_0 addr=e4:fd:05:08:00:02 vid=0 flags=0 used=0 - -- mlx5_esw_bridge_fdb_entry_cleanup: trace bridge FDB entry deleted from mlx5:: - - $ echo mlx5:mlx5_esw_bridge_fdb_entry_cleanup >> set_event - $ cat /sys/kernel/debug/tracing/trace - ... - ip-2581 [005] ...1 318.629871: mlx5_esw_bridge_fdb_entry_cleanup: net_device=enp8s0f0_1 addr=e4:fd:05:08:00:03 vid=0 flags=0 used=16 - -- mlx5_esw_bridge_fdb_entry_refresh: trace bridge FDB entry offload refreshed in - mlx5:: - - $ echo mlx5:mlx5_esw_bridge_fdb_entry_refresh >> set_event - $ cat /sys/kernel/debug/tracing/trace - ... - kworker/u20:8-3849 [003] ...1 466716: mlx5_esw_bridge_fdb_entry_refresh: net_device=enp8s0f0_0 addr=e4:fd:05:08:00:02 vid=3 flags=0 used=0 - -- mlx5_esw_bridge_vlan_create: trace bridge VLAN object add on mlx5 - representor:: - - $ echo mlx5:mlx5_esw_bridge_vlan_create >> set_event - $ cat /sys/kernel/debug/tracing/trace - ... - ip-2560 [007] ...1 318.460258: mlx5_esw_bridge_vlan_create: vid=1 flags=6 - -- mlx5_esw_bridge_vlan_cleanup: trace bridge VLAN object delete from mlx5 - representor:: - - $ echo mlx5:mlx5_esw_bridge_vlan_cleanup >> set_event - $ cat /sys/kernel/debug/tracing/trace - ... - bridge-2582 [007] ...1 318.653496: mlx5_esw_bridge_vlan_cleanup: vid=2 flags=8 - -- mlx5_esw_bridge_vport_init: trace mlx5 vport assigned with bridge upper - device:: - - $ echo mlx5:mlx5_esw_bridge_vport_init >> set_event - $ cat /sys/kernel/debug/tracing/trace - ... - ip-2560 [007] ...1 318.458915: mlx5_esw_bridge_vport_init: vport_num=1 - -- mlx5_esw_bridge_vport_cleanup: trace mlx5 vport removed from bridge upper - device:: - - $ echo mlx5:mlx5_esw_bridge_vport_cleanup >> set_event - $ cat /sys/kernel/debug/tracing/trace - ... - ip-5387 [000] ...1 573713: mlx5_esw_bridge_vport_cleanup: vport_num=1 - -Eswitch QoS tracepoints: - -- mlx5_esw_vport_qos_create: trace creation of transmit scheduler arbiter for vport:: - - $ echo mlx5:mlx5_esw_vport_qos_create >> /sys/kernel/debug/tracing/set_event - $ cat /sys/kernel/debug/tracing/trace - ... - <...>-23496 [018] .... 73136.838831: mlx5_esw_vport_qos_create: (0000:82:00.0) vport=2 tsar_ix=4 bw_share=0, max_rate=0 group=000000007b576bb3 - -- mlx5_esw_vport_qos_config: trace configuration of transmit scheduler arbiter for vport:: - - $ echo mlx5:mlx5_esw_vport_qos_config >> /sys/kernel/debug/tracing/set_event - $ cat /sys/kernel/debug/tracing/trace - ... - <...>-26548 [023] .... 75754.223823: mlx5_esw_vport_qos_config: (0000:82:00.0) vport=1 tsar_ix=3 bw_share=34, max_rate=10000 group=000000007b576bb3 - -- mlx5_esw_vport_qos_destroy: trace deletion of transmit scheduler arbiter for vport:: - - $ echo mlx5:mlx5_esw_vport_qos_destroy >> /sys/kernel/debug/tracing/set_event - $ cat /sys/kernel/debug/tracing/trace - ... - <...>-27418 [004] .... 76546.680901: mlx5_esw_vport_qos_destroy: (0000:82:00.0) vport=1 tsar_ix=3 - -- mlx5_esw_group_qos_create: trace creation of transmit scheduler arbiter for rate group:: - - $ echo mlx5:mlx5_esw_group_qos_create >> /sys/kernel/debug/tracing/set_event - $ cat /sys/kernel/debug/tracing/trace - ... - <...>-26578 [008] .... 75776.022112: mlx5_esw_group_qos_create: (0000:82:00.0) group=000000008dac63ea tsar_ix=5 - -- mlx5_esw_group_qos_config: trace configuration of transmit scheduler arbiter for rate group:: - - $ echo mlx5:mlx5_esw_group_qos_config >> /sys/kernel/debug/tracing/set_event - $ cat /sys/kernel/debug/tracing/trace - ... - <...>-27303 [020] .... 76461.455356: mlx5_esw_group_qos_config: (0000:82:00.0) group=000000008dac63ea tsar_ix=5 bw_share=100 max_rate=20000 - -- mlx5_esw_group_qos_destroy: trace deletion of transmit scheduler arbiter for group:: - - $ echo mlx5:mlx5_esw_group_qos_destroy >> /sys/kernel/debug/tracing/set_event - $ cat /sys/kernel/debug/tracing/trace - ... - <...>-27418 [006] .... 76547.187258: mlx5_esw_group_qos_destroy: (0000:82:00.0) group=000000007b576bb3 tsar_ix=1 - -SF tracepoints: - -- mlx5_sf_add: trace addition of the SF port:: - - $ echo mlx5:mlx5_sf_add >> /sys/kernel/debug/tracing/set_event - $ cat /sys/kernel/debug/tracing/trace - ... - devlink-9363 [031] ..... 24610.188722: mlx5_sf_add: (0000:06:00.0) port_index=32768 controller=0 hw_id=0x8000 sfnum=88 - -- mlx5_sf_free: trace freeing of the SF port:: - - $ echo mlx5:mlx5_sf_free >> /sys/kernel/debug/tracing/set_event - $ cat /sys/kernel/debug/tracing/trace - ... - devlink-9830 [038] ..... 26300.404749: mlx5_sf_free: (0000:06:00.0) port_index=32768 controller=0 hw_id=0x8000 - -- mlx5_sf_hwc_alloc: trace allocating of the hardware SF context:: - - $ echo mlx5:mlx5_sf_hwc_alloc >> /sys/kernel/debug/tracing/set_event - $ cat /sys/kernel/debug/tracing/trace - ... - devlink-9775 [031] ..... 26296.385259: mlx5_sf_hwc_alloc: (0000:06:00.0) controller=0 hw_id=0x8000 sfnum=88 - -- mlx5_sf_hwc_free: trace freeing of the hardware SF context:: - - $ echo mlx5:mlx5_sf_hwc_free >> /sys/kernel/debug/tracing/set_event - $ cat /sys/kernel/debug/tracing/trace - ... - kworker/u128:3-9093 [046] ..... 24625.365771: mlx5_sf_hwc_free: (0000:06:00.0) hw_id=0x8000 - -- mlx5_sf_hwc_deferred_free : trace deferred freeing of the hardware SF context:: - - $ echo mlx5:mlx5_sf_hwc_deferred_free >> /sys/kernel/debug/tracing/set_event - $ cat /sys/kernel/debug/tracing/trace - ... - devlink-9519 [046] ..... 24624.400271: mlx5_sf_hwc_deferred_free: (0000:06:00.0) hw_id=0x8000 - -- mlx5_sf_vhca_event: trace SF vhca event and state:: - - $ echo mlx5:mlx5_sf_vhca_event >> /sys/kernel/debug/tracing/set_event - $ cat /sys/kernel/debug/tracing/trace - ... - kworker/u128:3-9093 [046] ..... 24625.365525: mlx5_sf_vhca_event: (0000:06:00.0) hw_id=0x8000 sfnum=88 vhca_state=1 - -- mlx5_sf_dev_add : trace SF device add event:: - - $ echo mlx5:mlx5_sf_dev_add>> /sys/kernel/debug/tracing/set_event - $ cat /sys/kernel/debug/tracing/trace - ... - kworker/u128:3-9093 [000] ..... 24616.524495: mlx5_sf_dev_add: (0000:06:00.0) sfdev=00000000fc5d96fd aux_id=4 hw_id=0x8000 sfnum=88 - -- mlx5_sf_dev_del : trace SF device delete event:: - - $ echo mlx5:mlx5_sf_dev_del >> /sys/kernel/debug/tracing/set_event - $ cat /sys/kernel/debug/tracing/trace - ... - kworker/u128:3-9093 [044] ..... 24624.400749: mlx5_sf_dev_del: (0000:06:00.0) sfdev=00000000fc5d96fd aux_id=4 hw_id=0x8000 sfnum=88 diff --git a/Documentation/networking/device_drivers/ethernet/mellanox/mlx5/counters.rst b/Documentation/networking/device_drivers/ethernet/mellanox/mlx5/counters.rst new file mode 100644 index 000000000000..4cd8e869762b --- /dev/null +++ b/Documentation/networking/device_drivers/ethernet/mellanox/mlx5/counters.rst @@ -0,0 +1,1302 @@ +.. SPDX-License-Identifier: GPL-2.0 OR Linux-OpenIB +.. include:: <isonum.txt> + +================ +Ethtool counters +================ + +:Copyright: |copy| 2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved. + +Contents +======== + +- `Overview`_ +- `Groups`_ +- `Types`_ +- `Descriptions`_ + +Overview +======== + +There are several counter groups based on where the counter is being counted. In +addition, each group of counters may have different counter types. + +These counter groups are based on which component in a networking setup, +illustrated below, that they describe:: + + ---------------------------------------- + | | + ---------------------------------------- ---------------------------------------- | + | Hypervisor | | VM | | + | | | | | + | ------------------- --------------- | | ------------------- --------------- | | + | | Ethernet driver | | RDMA driver | | | | Ethernet driver | | RDMA driver | | | + | ------------------- --------------- | | ------------------- --------------- | | + | | | | | | | | | + | ------------------- | | ------------------- | | + | | | | | |-- + ---------------------------------------- ---------------------------------------- + | | + ------------- ----------------------------- + | | + ------ ------ ------ ------ ------ ------ ------ + -----| PF |----------------------| VF |-| VF |-| VF |----- --| PF |--- --| PF |--- --| PF |--- + | ------ ------ ------ ------ | | ------ | | ------ | | ------ | + | | | | | | | | + | | | | | | | | + | | | | | | | | + | eSwitch | | eSwitch | | eSwitch | | eSwitch | + ---------------------------------------------------------- ----------- ----------- ----------- + ------------------------------------------------------------------------------- + | | + | | + | Uplink (no counters) | + ------------------------------------------------------------------------------- + --------------------------------------------------------------- + | | + | | + | MPFS (no counters) | + --------------------------------------------------------------- + | + | + | Port + +Groups +====== + +Ring + Software counters populated by the driver stack. + +Netdev + An aggregation of software ring counters. + +vPort counters + Traffic counters and drops due to steering or no buffers. May indicate issues + with NIC. These counters include Ethernet traffic counters (including Raw + Ethernet) and RDMA/RoCE traffic counters. + +Physical port counters + Counters that collect statistics about the PFs and VFs. May indicate issues + with NIC, link, or network. This measuring point holds information on + standardized counters like IEEE 802.3, RFC2863, RFC 2819, RFC 3635 and + additional counters like flow control, FEC and more. Physical port counters + are not exposed to virtual machines. + +Priority Port Counters + A set of the physical port counters, per priority per port. + +Types +===== + +Counters are divided into three types. + +Traffic Informative Counters + Counters which count traffic. These counters can be used for load estimation + or for general debug. + +Traffic Acceleration Counters + Counters which count traffic that was accelerated by Mellanox driver or by + hardware. The counters are an additional layer to the informative counter set, + and the same traffic is counted in both informative and acceleration counters. + +.. [#accel] Traffic acceleration counter. + +Error Counters + Increment of these counters might indicate a problem. Each of these counters + has an explanation and correction action. + +Statistic can be fetched via the `ip link` or `ethtool` commands. `ethtool` +provides more detailed information.:: + + ip –s link show <if-name> + ethtool -S <if-name> + +Descriptions +============ + +XSK, PTP, and QoS counters that are similar to counters defined previously will +not be separately listed. For example, `ptp_tx[i]_packets` will not be +explicitly documented since `tx[i]_packets` describes the behavior of both +counters, except `ptp_tx[i]_packets` is only counted when precision time +protocol is used. + +Ring / Netdev Counter +---------------------------- +The following counters are available per ring or software port. + +These counters provide information on the amount of traffic that was accelerated +by the NIC. The counters are counting the accelerated traffic in addition to the +standard counters which counts it (i.e. accelerated traffic is counted twice). + +The counter names in the table below refers to both ring and port counters. The +notation for ring counters includes the [i] index without the braces. The +notation for port counters doesn't include the [i]. A counter name +`rx[i]_packets` will be printed as `rx0_packets` for ring 0 and `rx_packets` for +the software port. + +.. flat-table:: Ring / Software Port Counter Table + :widths: 2 3 1 + + * - Counter + - Description + - Type + + * - `rx[i]_packets` + - The number of packets received on ring i. + - Informative + + * - `rx[i]_bytes` + - The number of bytes received on ring i. + - Informative + + * - `tx[i]_packets` + - The number of packets transmitted on ring i. + - Informative + + * - `tx[i]_bytes` + - The number of bytes transmitted on ring i. + - Informative + + * - `tx[i]_recover` + - The number of times the SQ was recovered. + - Error + + * - `tx[i]_cqes` + - Number of CQEs events on SQ issued on ring i. + - Informative + + * - `tx[i]_cqe_err` + - The number of error CQEs encountered on the SQ for ring i. + - Error + + * - `tx[i]_tso_packets` + - The number of TSO packets transmitted on ring i [#accel]_. + - Acceleration + + * - `tx[i]_tso_bytes` + - The number of TSO bytes transmitted on ring i [#accel]_. + - Acceleration + + * - `tx[i]_tso_inner_packets` + - The number of TSO packets which are indicated to be carry internal + encapsulation transmitted on ring i [#accel]_. + - Acceleration + + * - `tx[i]_tso_inner_bytes` + - The number of TSO bytes which are indicated to be carry internal + encapsulation transmitted on ring i [#accel]_. + - Acceleration + + * - `rx[i]_gro_packets` + - Number of received packets processed using hardware-accelerated GRO. The + number of hardware GRO offloaded packets received on ring i. + - Acceleration + + * - `rx[i]_gro_bytes` + - Number of received bytes processed using hardware-accelerated GRO. The + number of hardware GRO offloaded bytes received on ring i. + - Acceleration + + * - `rx[i]_gro_skbs` + - The number of receive SKBs constructed while performing + hardware-accelerated GRO. + - Informative + + * - `rx[i]_gro_match_packets` + - Number of received packets processed using hardware-accelerated GRO that + met the flow table match criteria. + - Informative + + * - `rx[i]_gro_large_hds` + - Number of receive packets using hardware-accelerated GRO that have large + headers that require additional memory to be allocated. + - Informative + + * - `rx[i]_lro_packets` + - The number of LRO packets received on ring i [#accel]_. + - Acceleration + + * - `rx[i]_lro_bytes` + - The number of LRO bytes received on ring i [#accel]_. + - Acceleration + + * - `rx[i]_ecn_mark` + - The number of received packets where the ECN mark was turned on. + - Informative + + * - `rx_oversize_pkts_buffer` + - The number of dropped received packets due to length which arrived to RQ + and exceed software buffer size allocated by the device for incoming + traffic. It might imply that the device MTU is larger than the software + buffers size. + - Error + + * - `rx_oversize_pkts_sw_drop` + - Number of received packets dropped in software because the CQE data is + larger than the MTU size. + - Error + + * - `rx[i]_csum_unnecessary` + - Packets received with a `CHECKSUM_UNNECESSARY` on ring i [#accel]_. + - Acceleration + + * - `rx[i]_csum_unnecessary_inner` + - Packets received with inner encapsulation with a `CHECKSUM_UNNECESSARY` + on ring i [#accel]_. + - Acceleration + + * - `rx[i]_csum_none` + - Packets received with a `CHECKSUM_NONE` on ring i [#accel]_. + - Acceleration + + * - `rx[i]_csum_complete` + - Packets received with a `CHECKSUM_COMPLETE` on ring i [#accel]_. + - Acceleration + + * - `rx[i]_csum_complete_tail` + - Number of received packets that had checksum calculation computed, + potentially needed padding, and were able to do so with + `CHECKSUM_PARTIAL`. + - Informative + + * - `rx[i]_csum_complete_tail_slow` + - Number of received packets that need padding larger than eight bytes for + the checksum. + - Informative + + * - `tx[i]_csum_partial` + - Packets transmitted with a `CHECKSUM_PARTIAL` on ring i [#accel]_. + - Acceleration + + * - `tx[i]_csum_partial_inner` + - Packets transmitted with inner encapsulation with a `CHECKSUM_PARTIAL` on + ring i [#accel]_. + - Acceleration + + * - `tx[i]_csum_none` + - Packets transmitted with no hardware checksum acceleration on ring i. + - Informative + + * - `tx[i]_stopped` / `tx_queue_stopped` [#ring_global]_ + - Events where SQ was full on ring i. If this counter is increased, check + the amount of buffers allocated for transmission. + - Informative + + * - `tx[i]_wake` / `tx_queue_wake` [#ring_global]_ + - Events where SQ was full and has become not full on ring i. + - Informative + + * - `tx[i]_dropped` / `tx_queue_dropped` [#ring_global]_ + - Packets transmitted that were dropped due to DMA mapping failure on + ring i. If this counter is increased, check the amount of buffers + allocated for transmission. + - Error + + * - `tx[i]_nop` + - The number of nop WQEs (empty WQEs) inserted to the SQ (related to + ring i) due to the reach of the end of the cyclic buffer. When reaching + near to the end of cyclic buffer the driver may add those empty WQEs to + avoid handling a state the a WQE start in the end of the queue and ends + in the beginning of the queue. This is a normal condition. + - Informative + + * - `tx[i]_added_vlan_packets` + - The number of packets sent where vlan tag insertion was offloaded to the + hardware. + - Acceleration + + * - `rx[i]_removed_vlan_packets` + - The number of packets received where vlan tag stripping was offloaded to + the hardware. + - Acceleration + + * - `rx[i]_wqe_err` + - The number of wrong opcodes received on ring i. + - Error + + * - `rx[i]_mpwqe_frag` + - The number of WQEs that failed to allocate compound page and hence + fragmented MPWQE’s (Multi Packet WQEs) were used on ring i. If this + counter raise, it may suggest that there is no enough memory for large + pages, the driver allocated fragmented pages. This is not abnormal + condition. + - Informative + + * - `rx[i]_mpwqe_filler_cqes` + - The number of filler CQEs events that were issued on ring i. + - Informative + + * - `rx[i]_mpwqe_filler_strides` + - The number of strides consumed by filler CQEs on ring i. + - Informative + + * - `tx[i]_mpwqe_blks` + - The number of send blocks processed from Multi-Packet WQEs (mpwqe). + - Informative + + * - `tx[i]_mpwqe_pkts` + - The number of send packets processed from Multi-Packet WQEs (mpwqe). + - Informative + + * - `rx[i]_cqe_compress_blks` + - The number of receive blocks with CQE compression on ring i [#accel]_. + - Acceleration + + * - `rx[i]_cqe_compress_pkts` + - The number of receive packets with CQE compression on ring i [#accel]_. + - Acceleration + + * - `rx[i]_cache_reuse` + - The number of events of successful reuse of a page from a driver's + internal page cache. + - Acceleration + + * - `rx[i]_cache_full` + - The number of events of full internal page cache where driver can't put a + page back to the cache for recycling (page will be freed). + - Acceleration + + * - `rx[i]_cache_empty` + - The number of events where cache was empty - no page to give. Driver + shall allocate new page. + - Acceleration + + * - `rx[i]_cache_busy` + - The number of events where cache head was busy and cannot be recycled. + Driver allocated new page. + - Acceleration + + * - `rx[i]_cache_waive` + - The number of cache evacuation. This can occur due to page move to + another NUMA node or page was pfmemalloc-ed and should be freed as soon + as possible. + - Acceleration + + * - `rx[i]_arfs_err` + - Number of flow rules that failed to be added to the flow table. + - Error + + * - `rx[i]_recover` + - The number of times the RQ was recovered. + - Error + + * - `tx[i]_xmit_more` + - The number of packets sent with `xmit_more` indication set on the skbuff + (no doorbell). + - Acceleration + + * - `ch[i]_poll` + - The number of invocations of NAPI poll of channel i. + - Informative + + * - `ch[i]_arm` + - The number of times the NAPI poll function completed and armed the + completion queues on channel i. + - Informative + + * - `ch[i]_aff_change` + - The number of times the NAPI poll function explicitly stopped execution + on a CPU due to a change in affinity, on channel i. + - Informative + + * - `ch[i]_events` + - The number of hard interrupt events on the completion queues of channel i. + - Informative + + * - `ch[i]_eq_rearm` + - The number of times the EQ was recovered. + - Error + + * - `ch[i]_force_irq` + - Number of times NAPI is triggered by XSK wakeups by posting a NOP to + ICOSQ. + - Acceleration + + * - `rx[i]_congst_umr` + - The number of times an outstanding UMR request is delayed due to + congestion, on ring i. + - Informative + + * - `rx_pp_alloc_fast` + - Number of successful fast path allocations. + - Informative + + * - `rx_pp_alloc_slow` + - Number of slow path order-0 allocations. + - Informative + + * - `rx_pp_alloc_slow_high_order` + - Number of slow path high order allocations. + - Informative + + * - `rx_pp_alloc_empty` + - Counter is incremented when ptr ring is empty, so a slow path allocation + was forced. + - Informative + + * - `rx_pp_alloc_refill` + - Counter is incremented when an allocation which triggered a refill of the + cache. + - Informative + + * - `rx_pp_alloc_waive` + - Counter is incremented when pages obtained from the ptr ring that cannot + be added to the cache due to a NUMA mismatch. + - Informative + + * - `rx_pp_recycle_cached` + - Counter is incremented when recycling placed page in the page pool cache. + - Informative + + * - `rx_pp_recycle_cache_full` + - Counter is incremented when page pool cache was full. + - Informative + + * - `rx_pp_recycle_ring` + - Counter is incremented when page placed into the ptr ring. + - Informative + + * - `rx_pp_recycle_ring_full` + - Counter is incremented when page released from page pool because the ptr + ring was full. + - Informative + + * - `rx_pp_recycle_released_ref` + - Counter is incremented when page released (and not recycled) because + refcnt > 1. + - Informative + + * - `rx[i]_xsk_buff_alloc_err` + - The number of times allocating an skb or XSK buffer failed in the XSK RQ + context. + - Error + + * - `rx[i]_xsk_arfs_err` + - aRFS (accelerated Receive Flow Steering) does not occur in the XSK RQ + context, so this counter should never increment. + - Error + + * - `rx[i]_xdp_tx_xmit` + - The number of packets forwarded back to the port due to XDP program + `XDP_TX` action (bouncing). these packets are not counted by other + software counters. These packets are counted by physical port and vPort + counters. + - Informative + + * - `rx[i]_xdp_tx_mpwqe` + - Number of multi-packet WQEs transmitted by the netdev and `XDP_TX`-ed by + the netdev during the RQ context. + - Acceleration + + * - `rx[i]_xdp_tx_inlnw` + - Number of WQE data segments transmitted where the data could be inlined + in the WQE and then `XDP_TX`-ed during the RQ context. + - Acceleration + + * - `rx[i]_xdp_tx_nops` + - Number of NOP WQEBBs (WQE building blocks) received posted to the XDP SQ. + - Acceleration + + * - `rx[i]_xdp_tx_full` + - The number of packets that should have been forwarded back to the port + due to `XDP_TX` action but were dropped due to full tx queue. These packets + are not counted by other software counters. These packets are counted by + physical port and vPort counters. You may open more rx queues and spread + traffic rx over all queues and/or increase rx ring size. + - Error + + * - `rx[i]_xdp_tx_err` + - The number of times an `XDP_TX` error such as frame too long and frame + too short occurred on `XDP_TX` ring of RX ring. + - Error + + * - `rx[i]_xdp_tx_cqes` / `rx_xdp_tx_cqe` [#ring_global]_ + - The number of completions received on the CQ of the `XDP_TX` ring. + - Informative + + * - `rx[i]_xdp_drop` + - The number of packets dropped due to XDP program `XDP_DROP` action. these + packets are not counted by other software counters. These packets are + counted by physical port and vPort counters. + - Informative + + * - `rx[i]_xdp_redirect` + - The number of times an XDP redirect action was triggered on ring i. + - Acceleration + + * - `tx[i]_xdp_xmit` + - The number of packets redirected to the interface(due to XDP redirect). + These packets are not counted by other software counters. These packets + are counted by physical port and vPort counters. + - Informative + + * - `tx[i]_xdp_full` + - The number of packets redirected to the interface(due to XDP redirect), + but were dropped due to full tx queue. these packets are not counted by + other software counters. you may enlarge tx queues. + - Informative + + * - `tx[i]_xdp_mpwqe` + - Number of multi-packet WQEs offloaded onto the NIC that were + `XDP_REDIRECT`-ed from other netdevs. + - Acceleration + + * - `tx[i]_xdp_inlnw` + - Number of WQE data segments where the data could be inlined in the WQE + where the data segments were `XDP_REDIRECT`-ed from other netdevs. + - Acceleration + + * - `tx[i]_xdp_nops` + - Number of NOP WQEBBs (WQE building blocks) posted to the SQ that were + `XDP_REDIRECT`-ed from other netdevs. + - Acceleration + + * - `tx[i]_xdp_err` + - The number of packets redirected to the interface(due to XDP redirect) + but were dropped due to error such as frame too long and frame too short. + - Error + + * - `tx[i]_xdp_cqes` + - The number of completions received for packets redirected to the + interface(due to XDP redirect) on the CQ. + - Informative + + * - `tx[i]_xsk_xmit` + - The number of packets transmitted using XSK zerocopy functionality. + - Acceleration + + * - `tx[i]_xsk_mpwqe` + - Number of multi-packet WQEs offloaded onto the NIC that were + `XDP_REDIRECT`-ed from other netdevs. + - Acceleration + + * - `tx[i]_xsk_inlnw` + - Number of WQE data segments where the data could be inlined in the WQE + that are transmitted using XSK zerocopy. + - Acceleration + + * - `tx[i]_xsk_full` + - Number of times doorbell is rung in XSK zerocopy mode when SQ is full. + - Error + + * - `tx[i]_xsk_err` + - Number of errors that occurred in XSK zerocopy mode such as if the data + size is larger than the MTU size. + - Error + + * - `tx[i]_xsk_cqes` + - Number of CQEs processed in XSK zerocopy mode. + - Acceleration + + * - `tx_tls_ctx` + - Number of TLS TX HW offload contexts added to device for encryption. + - Acceleration + + * - `tx_tls_del` + - Number of TLS TX HW offload contexts removed from device (connection + closed). + - Acceleration + + * - `tx_tls_pool_alloc` + - Number of times a unit of work is successfully allocated in the TLS HW + offload pool. + - Acceleration + + * - `tx_tls_pool_free` + - Number of times a unit of work is freed in the TLS HW offload pool. + - Acceleration + + * - `rx_tls_ctx` + - Number of TLS RX HW offload contexts added to device for decryption. + - Acceleration + + * - `rx_tls_del` + - Number of TLS RX HW offload contexts deleted from device (connection has + finished). + - Acceleration + + * - `rx[i]_tls_decrypted_packets` + - Number of successfully decrypted RX packets which were part of a TLS + stream. + - Acceleration + + * - `rx[i]_tls_decrypted_bytes` + - Number of TLS payload bytes in RX packets which were successfully + decrypted. + - Acceleration + + * - `rx[i]_tls_resync_req_pkt` + - Number of received TLS packets with a resync request. + - Acceleration + + * - `rx[i]_tls_resync_req_start` + - Number of times the TLS async resync request was started. + - Acceleration + + * - `rx[i]_tls_resync_req_end` + - Number of times the TLS async resync request properly ended with + providing the HW tracked tcp-seq. + - Acceleration + + * - `rx[i]_tls_resync_req_skip` + - Number of times the TLS async resync request procedure was started but + not properly ended. + - Error + + * - `rx[i]_tls_resync_res_ok` + - Number of times the TLS resync response call to the driver was + successfully handled. + - Acceleration + + * - `rx[i]_tls_resync_res_retry` + - Number of times the TLS resync response call to the driver was + reattempted when ICOSQ is full. + - Error + + * - `rx[i]_tls_resync_res_skip` + - Number of times the TLS resync response call to the driver was terminated + unsuccessfully. + - Error + + * - `rx[i]_tls_err` + - Number of times when CQE TLS offload was problematic. + - Error + + * - `tx[i]_tls_encrypted_packets` + - The number of send packets that are TLS encrypted by the kernel. + - Acceleration + + * - `tx[i]_tls_encrypted_bytes` + - The number of send bytes that are TLS encrypted by the kernel. + - Acceleration + + * - `tx[i]_tls_ooo` + - Number of times out of order TLS SQE fragments were handled on ring i. + - Acceleration + + * - `tx[i]_tls_dump_packets` + - Number of TLS decrypted packets copied over from NIC over DMA. + - Acceleration + + * - `tx[i]_tls_dump_bytes` + - Number of TLS decrypted bytes copied over from NIC over DMA. + - Acceleration + + * - `tx[i]_tls_resync_bytes` + - Number of TLS bytes requested to be resynchronized in order to be + decrypted. + - Acceleration + + * - `tx[i]_tls_skip_no_sync_data` + - Number of TLS send data that can safely be skipped / do not need to be + decrypted. + - Acceleration + + * - `tx[i]_tls_drop_no_sync_data` + - Number of TLS send data that were dropped due to retransmission of TLS + data. + - Acceleration + + * - `ptp_cq[i]_abort` + - Number of times a CQE has to be skipped in precision time protocol due to + a skew between the port timestamp and CQE timestamp being greater than + 128 seconds. + - Error + + * - `ptp_cq[i]_abort_abs_diff_ns` + - Accumulation of time differences between the port timestamp and CQE + timestamp when the difference is greater than 128 seconds in precision + time protocol. + - Error + +.. [#ring_global] The corresponding ring and global counters do not share the + same name (i.e. do not follow the common naming scheme). + +vPort Counters +-------------- +Counters on the NIC port that is connected to a eSwitch. + +.. flat-table:: vPort Counter Table + :widths: 2 3 1 + + * - Counter + - Description + - Type + + * - `rx_vport_unicast_packets` + - Unicast packets received, steered to a port including Raw Ethernet + QP/DPDK traffic, excluding RDMA traffic. + - Informative + + * - `rx_vport_unicast_bytes` + - Unicast bytes received, steered to a port including Raw Ethernet QP/DPDK + traffic, excluding RDMA traffic. + - Informative + + * - `tx_vport_unicast_packets` + - Unicast packets transmitted, steered from a port including Raw Ethernet + QP/DPDK traffic, excluding RDMA traffic. + - Informative + + * - `tx_vport_unicast_bytes` + - Unicast bytes transmitted, steered from a port including Raw Ethernet + QP/DPDK traffic, excluding RDMA traffic. + - Informative + + * - `rx_vport_multicast_packets` + - Multicast packets received, steered to a port including Raw Ethernet + QP/DPDK traffic, excluding RDMA traffic. + - Informative + + * - `rx_vport_multicast_bytes` + - Multicast bytes received, steered to a port including Raw Ethernet + QP/DPDK traffic, excluding RDMA traffic. + - Informative + + * - `tx_vport_multicast_packets` + - Multicast packets transmitted, steered from a port including Raw Ethernet + QP/DPDK traffic, excluding RDMA traffic. + - Informative + + * - `tx_vport_multicast_bytes` + - Multicast bytes transmitted, steered from a port including Raw Ethernet + QP/DPDK traffic, excluding RDMA traffic. + - Informative + + * - `rx_vport_broadcast_packets` + - Broadcast packets received, steered to a port including Raw Ethernet + QP/DPDK traffic, excluding RDMA traffic. + - Informative + + * - `rx_vport_broadcast_bytes` + - Broadcast bytes received, steered to a port including Raw Ethernet + QP/DPDK traffic, excluding RDMA traffic. + - Informative + + * - `tx_vport_broadcast_packets` + - Broadcast packets transmitted, steered from a port including Raw Ethernet + QP/DPDK traffic, excluding RDMA traffic. + - Informative + + * - `tx_vport_broadcast_bytes` + - Broadcast bytes transmitted, steered from a port including Raw Ethernet + QP/DPDK traffic, excluding RDMA traffic. + - Informative + + * - `rx_vport_rdma_unicast_packets` + - RDMA unicast packets received, steered to a port (counters counts + RoCE/UD/RC traffic) [#accel]_. + - Acceleration + + * - `rx_vport_rdma_unicast_bytes` + - RDMA unicast bytes received, steered to a port (counters counts + RoCE/UD/RC traffic) [#accel]_. + - Acceleration + + * - `tx_vport_rdma_unicast_packets` + - RDMA unicast packets transmitted, steered from a port (counters counts + RoCE/UD/RC traffic) [#accel]_. + - Acceleration + + * - `tx_vport_rdma_unicast_bytes` + - RDMA unicast bytes transmitted, steered from a port (counters counts + RoCE/UD/RC traffic) [#accel]_. + - Acceleration + + * - `rx_vport_rdma_multicast_packets` + - RDMA multicast packets received, steered to a port (counters counts + RoCE/UD/RC traffic) [#accel]_. + - Acceleration + + * - `rx_vport_rdma_multicast_bytes` + - RDMA multicast bytes received, steered to a port (counters counts + RoCE/UD/RC traffic) [#accel]_. + - Acceleration + + * - `tx_vport_rdma_multicast_packets` + - RDMA multicast packets transmitted, steered from a port (counters counts + RoCE/UD/RC traffic) [#accel]_. + - Acceleration + + * - `tx_vport_rdma_multicast_bytes` + - RDMA multicast bytes transmitted, steered from a port (counters counts + RoCE/UD/RC traffic) [#accel]_. + - Acceleration + + * - `rx_steer_missed_packets` + - Number of packets that was received by the NIC, however was discarded + because it did not match any flow in the NIC flow table. + - Error + + * - `rx_packets` + - Representor only: packets received, that were handled by the hypervisor. + - Informative + + * - `rx_bytes` + - Representor only: bytes received, that were handled by the hypervisor. + - Informative + + * - `tx_packets` + - Representor only: packets transmitted, that were handled by the + hypervisor. + - Informative + + * - `tx_bytes` + - Representor only: bytes transmitted, that were handled by the hypervisor. + - Informative + + * - `dev_internal_queue_oob` + - The number of dropped packets due to lack of receive WQEs for an internal + device RQ. + - Error + +Physical Port Counters +---------------------- +The physical port counters are the counters on the external port connecting the +adapter to the network. This measuring point holds information on standardized +counters like IEEE 802.3, RFC2863, RFC 2819, RFC 3635 and additional counters +like flow control, FEC and more. + +.. flat-table:: Physical Port Counter Table + :widths: 2 3 1 + + * - Counter + - Description + - Type + + * - `rx_packets_phy` + - The number of packets received on the physical port. This counter doesn’t + include packets that were discarded due to FCS, frame size and similar + errors. + - Informative + + * - `tx_packets_phy` + - The number of packets transmitted on the physical port. + - Informative + + * - `rx_bytes_phy` + - The number of bytes received on the physical port, including Ethernet + header and FCS. + - Informative + + * - `tx_bytes_phy` + - The number of bytes transmitted on the physical port. + - Informative + + * - `rx_multicast_phy` + - The number of multicast packets received on the physical port. + - Informative + + * - `tx_multicast_phy` + - The number of multicast packets transmitted on the physical port. + - Informative + + * - `rx_broadcast_phy` + - The number of broadcast packets received on the physical port. + - Informative + + * - `tx_broadcast_phy` + - The number of broadcast packets transmitted on the physical port. + - Informative + + * - `rx_crc_errors_phy` + - The number of dropped received packets due to FCS (Frame Check Sequence) + error on the physical port. If this counter is increased in high rate, + check the link quality using `rx_symbol_error_phy` and + `rx_corrected_bits_phy` counters below. + - Error + + * - `rx_in_range_len_errors_phy` + - The number of received packets dropped due to length/type errors on a + physical port. + - Error + + * - `rx_out_of_range_len_phy` + - The number of received packets dropped due to length greater than allowed + on a physical port. If this counter is increasing, it implies that the + peer connected to the adapter has a larger MTU configured. Using same MTU + configuration shall resolve this issue. + - Error + + * - `rx_oversize_pkts_phy` + - The number of dropped received packets due to length which exceed MTU + size on a physical port. If this counter is increasing, it implies that + the peer connected to the adapter has a larger MTU configured. Using same + MTU configuration shall resolve this issue. + - Error + + * - `rx_symbol_err_phy` + - The number of received packets dropped due to physical coding errors + (symbol errors) on a physical port. + - Error + + * - `rx_mac_control_phy` + - The number of MAC control packets received on the physical port. + - Informative + + * - `tx_mac_control_phy` + - The number of MAC control packets transmitted on the physical port. + - Informative + + * - `rx_pause_ctrl_phy` + - The number of link layer pause packets received on a physical port. If + this counter is increasing, it implies that the network is congested and + cannot absorb the traffic coming from to the adapter. + - Informative + + * - `tx_pause_ctrl_phy` + - The number of link layer pause packets transmitted on a physical port. If + this counter is increasing, it implies that the NIC is congested and + cannot absorb the traffic coming from the network. + - Informative + + * - `rx_unsupported_op_phy` + - The number of MAC control packets received with unsupported opcode on a + physical port. + - Error + + * - `rx_discards_phy` + - The number of received packets dropped due to lack of buffers on a + physical port. If this counter is increasing, it implies that the adapter + is congested and cannot absorb the traffic coming from the network. + - Error + + * - `tx_discards_phy` + - The number of packets which were discarded on transmission, even no + errors were detected. the drop might occur due to link in down state, + head of line drop, pause from the network, etc. + - Error + + * - `tx_errors_phy` + - The number of transmitted packets dropped due to a length which exceed + MTU size on a physical port. + - Error + + * - `rx_undersize_pkts_phy` + - The number of received packets dropped due to length which is shorter + than 64 bytes on a physical port. If this counter is increasing, it + implies that the peer connected to the adapter has a non-standard MTU + configured or malformed packet had arrived. + - Error + + * - `rx_fragments_phy` + - The number of received packets dropped due to a length which is shorter + than 64 bytes and has FCS error on a physical port. If this counter is + increasing, it implies that the peer connected to the adapter has a + non-standard MTU configured. + - Error + + * - `rx_jabbers_phy` + - The number of received packets d due to a length which is longer than 64 + bytes and had FCS error on a physical port. + - Error + + * - `rx_64_bytes_phy` + - The number of packets received on the physical port with size of 64 bytes. + - Informative + + * - `rx_65_to_127_bytes_phy` + - The number of packets received on the physical port with size of 65 to + 127 bytes. + - Informative + + * - `rx_128_to_255_bytes_phy` + - The number of packets received on the physical port with size of 128 to + 255 bytes. + - Informative + + * - `rx_256_to_511_bytes_phy` + - The number of packets received on the physical port with size of 256 to + 512 bytes. + - Informative + + * - `rx_512_to_1023_bytes_phy` + - The number of packets received on the physical port with size of 512 to + 1023 bytes. + - Informative + + * - `rx_1024_to_1518_bytes_phy` + - The number of packets received on the physical port with size of 1024 to + 1518 bytes. + - Informative + + * - `rx_1519_to_2047_bytes_phy` + - The number of packets received on the physical port with size of 1519 to + 2047 bytes. + - Informative + + * - `rx_2048_to_4095_bytes_phy` + - The number of packets received on the physical port with size of 2048 to + 4095 bytes. + - Informative + + * - `rx_4096_to_8191_bytes_phy` + - The number of packets received on the physical port with size of 4096 to + 8191 bytes. + - Informative + + * - `rx_8192_to_10239_bytes_phy` + - The number of packets received on the physical port with size of 8192 to + 10239 bytes. + - Informative + + * - `link_down_events_phy` + - The number of times where the link operative state changed to down. In + case this counter is increasing it may imply on port flapping. You may + need to replace the cable/transceiver. + - Error + + * - `rx_out_of_buffer` + - Number of times receive queue had no software buffers allocated for the + adapter's incoming traffic. + - Error + + * - `module_bus_stuck` + - The number of times that module's I\ :sup:`2`\C bus (data or clock) + short-wire was detected. You may need to replace the cable/transceiver. + - Error + + * - `module_high_temp` + - The number of times that the module temperature was too high. If this + issue persist, you may need to check the ambient temperature or replace + the cable/transceiver module. + - Error + + * - `module_bad_shorted` + - The number of times that the module cables were shorted. You may need to + replace the cable/transceiver module. + - Error + + * - `module_unplug` + - The number of times that module was ejected. + - Informative + + * - `rx_buffer_passed_thres_phy` + - The number of events where the port receive buffer was over 85% full. + - Informative + + * - `tx_pause_storm_warning_events` + - The number of times the device was sending pauses for a long period of + time. + - Informative + + * - `tx_pause_storm_error_events` + - The number of times the device was sending pauses for a long period of + time, reaching time out and disabling transmission of pause frames. on + the period where pause frames were disabled, drop could have been + occurred. + - Error + + * - `rx[i]_buff_alloc_err` + - Failed to allocate a buffer to received packet (or SKB) on ring i. + - Error + + * - `rx_bits_phy` + - This counter provides information on the total amount of traffic that + could have been received and can be used as a guideline to measure the + ratio of errored traffic in `rx_pcs_symbol_err_phy` and + `rx_corrected_bits_phy`. + - Informative + + * - `rx_pcs_symbol_err_phy` + - This counter counts the number of symbol errors that wasn’t corrected by + FEC correction algorithm or that FEC algorithm was not active on this + interface. If this counter is increasing, it implies that the link + between the NIC and the network is suffering from high BER, and that + traffic is lost. You may need to replace the cable/transceiver. The error + rate is the number of `rx_pcs_symbol_err_phy` divided by the number of + `rx_bits_phy` on a specific time frame. + - Error + + * - `rx_corrected_bits_phy` + - The number of corrected bits on this port according to active FEC + (RS/FC). If this counter is increasing, it implies that the link between + the NIC and the network is suffering from high BER. The corrected bit + rate is the number of `rx_corrected_bits_phy` divided by the number of + `rx_bits_phy` on a specific time frame. + - Error + + * - `rx_err_lane_[l]_phy` + - This counter counts the number of physical raw errors per lane l index. + The counter counts errors before FEC corrections. If this counter is + increasing, it implies that the link between the NIC and the network is + suffering from high BER, and that traffic might be lost. You may need to + replace the cable/transceiver. Please check in accordance with + `rx_corrected_bits_phy`. + - Error + + * - `rx_global_pause` + - The number of pause packets received on the physical port. If this + counter is increasing, it implies that the network is congested and + cannot absorb the traffic coming from the adapter. Note: This counter is + only enabled when global pause mode is enabled. + - Informative + + * - `rx_global_pause_duration` + - The duration of pause received (in microSec) on the physical port. The + counter represents the time the port did not send any traffic. If this + counter is increasing, it implies that the network is congested and + cannot absorb the traffic coming from the adapter. Note: This counter is + only enabled when global pause mode is enabled. + - Informative + + * - `tx_global_pause` + - The number of pause packets transmitted on a physical port. If this + counter is increasing, it implies that the adapter is congested and + cannot absorb the traffic coming from the network. Note: This counter is + only enabled when global pause mode is enabled. + - Informative + + * - `tx_global_pause_duration` + - The duration of pause transmitter (in microSec) on the physical port. + Note: This counter is only enabled when global pause mode is enabled. + - Informative + + * - `rx_global_pause_transition` + - The number of times a transition from Xoff to Xon on the physical port + has occurred. Note: This counter is only enabled when global pause mode + is enabled. + - Informative + + * - `rx_if_down_packets` + - The number of received packets that were dropped due to interface down. + - Informative + +Priority Port Counters +---------------------- +The following counters are physical port counters that are counted per L2 +priority (0-7). + +**Note:** `p` in the counter name represents the priority. + +.. flat-table:: Priority Port Counter Table + :widths: 2 3 1 + + * - Counter + - Description + - Type + + * - `rx_prio[p]_bytes` + - The number of bytes received with priority p on the physical port. + - Informative + + * - `rx_prio[p]_packets` + - The number of packets received with priority p on the physical port. + - Informative + + * - `tx_prio[p]_bytes` + - The number of bytes transmitted on priority p on the physical port. + - Informative + + * - `tx_prio[p]_packets` + - The number of packets transmitted on priority p on the physical port. + - Informative + + * - `rx_prio[p]_pause` + - The number of pause packets received with priority p on a physical port. + If this counter is increasing, it implies that the network is congested + and cannot absorb the traffic coming from the adapter. Note: This counter + is available only if PFC was enabled on priority p. + - Informative + + * - `rx_prio[p]_pause_duration` + - The duration of pause received (in microSec) on priority p on the + physical port. The counter represents the time the port did not send any + traffic on this priority. If this counter is increasing, it implies that + the network is congested and cannot absorb the traffic coming from the + adapter. Note: This counter is available only if PFC was enabled on + priority p. + - Informative + + * - `rx_prio[p]_pause_transition` + - The number of times a transition from Xoff to Xon on priority p on the + physical port has occurred. Note: This counter is available only if PFC + was enabled on priority p. + - Informative + + * - `tx_prio[p]_pause` + - The number of pause packets transmitted on priority p on a physical port. + If this counter is increasing, it implies that the adapter is congested + and cannot absorb the traffic coming from the network. Note: This counter + is available only if PFC was enabled on priority p. + - Informative + + * - `tx_prio[p]_pause_duration` + - The duration of pause transmitter (in microSec) on priority p on the + physical port. Note: This counter is available only if PFC was enabled on + priority p. + - Informative + + * - `rx_prio[p]_buf_discard` + - The number of packets discarded by device due to lack of per host receive + buffers. + - Informative + + * - `rx_prio[p]_cong_discard` + - The number of packets discarded by device due to per host congestion. + - Informative + + * - `rx_prio[p]_marked` + - The number of packets ecn marked by device due to per host congestion. + - Informative + + * - `rx_prio[p]_discards` + - The number of packets discarded by device due to lack of receive buffers. + - Informative + +Device Counters +--------------- +.. flat-table:: Device Counter Table + :widths: 2 3 1 + + * - Counter + - Description + - Type + + * - `rx_pci_signal_integrity` + - Counts physical layer PCIe signal integrity errors, the number of + transitions to recovery due to Framing errors and CRC (dlp and tlp). If + this counter is raising, try moving the adapter card to a different slot + to rule out a bad PCI slot. Validate that you are running with the latest + firmware available and latest server BIOS version. + - Error + + * - `tx_pci_signal_integrity` + - Counts physical layer PCIe signal integrity errors, the number of + transition to recovery initiated by the other side (moving to recovery + due to getting TS/EIEOS). If this counter is raising, try moving the + adapter card to a different slot to rule out a bad PCI slot. Validate + that you are running with the latest firmware available and latest server + BIOS version. + - Error + + * - `outbound_pci_buffer_overflow` + - The number of packets dropped due to pci buffer overflow. If this counter + is raising in high rate, it might indicate that the receive traffic rate + for a host is larger than the PCIe bus and therefore a congestion occurs. + - Informative + + * - `outbound_pci_stalled_rd` + - The percentage (in the range 0...100) of time within the last second that + the NIC had outbound non-posted reads requests but could not perform the + operation due to insufficient posted credits. + - Informative + + * - `outbound_pci_stalled_wr` + - The percentage (in the range 0...100) of time within the last second that + the NIC had outbound posted writes requests but could not perform the + operation due to insufficient posted credits. + - Informative + + * - `outbound_pci_stalled_rd_events` + - The number of seconds where `outbound_pci_stalled_rd` was above 30%. + - Informative + + * - `outbound_pci_stalled_wr_events` + - The number of seconds where `outbound_pci_stalled_wr` was above 30%. + - Informative + + * - `dev_out_of_buffer` + - The number of times the device owned queue had not enough buffers + allocated. + - Error diff --git a/Documentation/networking/device_drivers/ethernet/mellanox/mlx5/devlink.rst b/Documentation/networking/device_drivers/ethernet/mellanox/mlx5/devlink.rst new file mode 100644 index 000000000000..9b5c40ba7f0d --- /dev/null +++ b/Documentation/networking/device_drivers/ethernet/mellanox/mlx5/devlink.rst @@ -0,0 +1,224 @@ +.. SPDX-License-Identifier: GPL-2.0 OR Linux-OpenIB +.. include:: <isonum.txt> + +======= +Devlink +======= + +:Copyright: |copy| 2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved. + +Contents +======== + +- `Info`_ +- `Parameters`_ +- `Health reporters`_ + +Info +==== + +The devlink info reports the running and stored firmware versions on device. +It also prints the device PSID which represents the HCA board type ID. + +User command example:: + + $ devlink dev info pci/0000:00:06.0 + pci/0000:00:06.0: + driver mlx5_core + versions: + fixed: + fw.psid MT_0000000009 + running: + fw.version 16.26.0100 + stored: + fw.version 16.26.0100 + +Parameters +========== + +flow_steering_mode: Device flow steering mode +--------------------------------------------- +The flow steering mode parameter controls the flow steering mode of the driver. +Two modes are supported: +1. 'dmfs' - Device managed flow steering. +2. 'smfs' - Software/Driver managed flow steering. + +In DMFS mode, the HW steering entities are created and managed through the +Firmware. +In SMFS mode, the HW steering entities are created and managed though by +the driver directly into hardware without firmware intervention. + +SMFS mode is faster and provides better rule insertion rate compared to default DMFS mode. + +User command examples: + +- Set SMFS flow steering mode:: + + $ devlink dev param set pci/0000:06:00.0 name flow_steering_mode value "smfs" cmode runtime + +- Read device flow steering mode:: + + $ devlink dev param show pci/0000:06:00.0 name flow_steering_mode + pci/0000:06:00.0: + name flow_steering_mode type driver-specific + values: + cmode runtime value smfs + +enable_roce: RoCE enablement state +---------------------------------- +If the device supports RoCE disablement, RoCE enablement state controls device +support for RoCE capability. Otherwise, the control occurs in the driver stack. +When RoCE is disabled at the driver level, only raw ethernet QPs are supported. + +To change RoCE enablement state, a user must change the driverinit cmode value +and run devlink reload. + +User command examples: + +- Disable RoCE:: + + $ devlink dev param set pci/0000:06:00.0 name enable_roce value false cmode driverinit + $ devlink dev reload pci/0000:06:00.0 + +- Read RoCE enablement state:: + + $ devlink dev param show pci/0000:06:00.0 name enable_roce + pci/0000:06:00.0: + name enable_roce type generic + values: + cmode driverinit value true + +esw_port_metadata: Eswitch port metadata state +---------------------------------------------- +When applicable, disabling eswitch metadata can increase packet rate +up to 20% depending on the use case and packet sizes. + +Eswitch port metadata state controls whether to internally tag packets with +metadata. Metadata tagging must be enabled for multi-port RoCE, failover +between representors and stacked devices. +By default metadata is enabled on the supported devices in E-switch. +Metadata is applicable only for E-switch in switchdev mode and +users may disable it when NONE of the below use cases will be in use: +1. HCA is in Dual/multi-port RoCE mode. +2. VF/SF representor bonding (Usually used for Live migration) +3. Stacked devices + +When metadata is disabled, the above use cases will fail to initialize if +users try to enable them. + +- Show eswitch port metadata:: + + $ devlink dev param show pci/0000:06:00.0 name esw_port_metadata + pci/0000:06:00.0: + name esw_port_metadata type driver-specific + values: + cmode runtime value true + +- Disable eswitch port metadata:: + + $ devlink dev param set pci/0000:06:00.0 name esw_port_metadata value false cmode runtime + +- Change eswitch mode to switchdev mode where after choosing the metadata value:: + + $ devlink dev eswitch set pci/0000:06:00.0 mode switchdev + +Health reporters +================ + +tx reporter +----------- +The tx reporter is responsible for reporting and recovering of the following two error scenarios: + +- tx timeout + Report on kernel tx timeout detection. + Recover by searching lost interrupts. +- tx error completion + Report on error tx completion. + Recover by flushing the tx queue and reset it. + +tx reporter also support on demand diagnose callback, on which it provides +real time information of its send queues status. + +User commands examples: + +- Diagnose send queues status:: + + $ devlink health diagnose pci/0000:82:00.0 reporter tx + +NOTE: This command has valid output only when interface is up, otherwise the command has empty output. + +- Show number of tx errors indicated, number of recover flows ended successfully, + is autorecover enabled and graceful period from last recover:: + + $ devlink health show pci/0000:82:00.0 reporter tx + +rx reporter +----------- +The rx reporter is responsible for reporting and recovering of the following two error scenarios: + +- rx queues' initialization (population) timeout + Population of rx queues' descriptors on ring initialization is done + in napi context via triggering an irq. In case of a failure to get + the minimum amount of descriptors, a timeout would occur, and + descriptors could be recovered by polling the EQ (Event Queue). +- rx completions with errors (reported by HW on interrupt context) + Report on rx completion error. + Recover (if needed) by flushing the related queue and reset it. + +rx reporter also supports on demand diagnose callback, on which it +provides real time information of its receive queues' status. + +- Diagnose rx queues' status and corresponding completion queue:: + + $ devlink health diagnose pci/0000:82:00.0 reporter rx + +NOTE: This command has valid output only when interface is up. Otherwise, the command has empty output. + +- Show number of rx errors indicated, number of recover flows ended successfully, + is autorecover enabled, and graceful period from last recover:: + + $ devlink health show pci/0000:82:00.0 reporter rx + +fw reporter +----------- +The fw reporter implements `diagnose` and `dump` callbacks. +It follows symptoms of fw error such as fw syndrome by triggering +fw core dump and storing it into the dump buffer. +The fw reporter diagnose command can be triggered any time by the user to check +current fw status. + +User commands examples: + +- Check fw heath status:: + + $ devlink health diagnose pci/0000:82:00.0 reporter fw + +- Read FW core dump if already stored or trigger new one:: + + $ devlink health dump show pci/0000:82:00.0 reporter fw + +NOTE: This command can run only on the PF which has fw tracer ownership, +running it on other PF or any VF will return "Operation not permitted". + +fw fatal reporter +----------------- +The fw fatal reporter implements `dump` and `recover` callbacks. +It follows fatal errors indications by CR-space dump and recover flow. +The CR-space dump uses vsc interface which is valid even if the FW command +interface is not functional, which is the case in most FW fatal errors. +The recover function runs recover flow which reloads the driver and triggers fw +reset if needed. +On firmware error, the health buffer is dumped into the dmesg. The log +level is derived from the error's severity (given in health buffer). + +User commands examples: + +- Run fw recover flow manually:: + + $ devlink health recover pci/0000:82:00.0 reporter fw_fatal + +- Read FW CR-space dump if already stored or trigger new one:: + + $ devlink health dump show pci/0000:82:00.1 reporter fw_fatal + +NOTE: This command can run only on PF. diff --git a/Documentation/networking/device_drivers/ethernet/mellanox/mlx5/index.rst b/Documentation/networking/device_drivers/ethernet/mellanox/mlx5/index.rst new file mode 100644 index 000000000000..3fdcd6b61ccf --- /dev/null +++ b/Documentation/networking/device_drivers/ethernet/mellanox/mlx5/index.rst @@ -0,0 +1,26 @@ +.. SPDX-License-Identifier: GPL-2.0 OR Linux-OpenIB +.. include:: <isonum.txt> + +Mellanox ConnectX(R) mlx5 core VPI Network Driver +================================================= + +:Copyright: |copy| 2019, Mellanox Technologies LTD. +:Copyright: |copy| 2020-2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved. + +Contents: + +.. toctree:: + :maxdepth: 2 + + kconfig + devlink + switchdev + tracepoints + counters + +.. only:: subproject and html + + Indices + ======= + + * :ref:`genindex` diff --git a/Documentation/networking/device_drivers/ethernet/mellanox/mlx5/kconfig.rst b/Documentation/networking/device_drivers/ethernet/mellanox/mlx5/kconfig.rst new file mode 100644 index 000000000000..43b1f7e87ec4 --- /dev/null +++ b/Documentation/networking/device_drivers/ethernet/mellanox/mlx5/kconfig.rst @@ -0,0 +1,168 @@ +.. SPDX-License-Identifier: GPL-2.0 OR Linux-OpenIB +.. include:: <isonum.txt> + +======================================= +Enabling the driver and kconfig options +======================================= + +:Copyright: |copy| 2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved. + +| mlx5 core is modular and most of the major mlx5 core driver features can be selected (compiled in/out) +| at build time via kernel Kconfig flags. +| Basic features, ethernet net device rx/tx offloads and XDP, are available with the most basic flags +| CONFIG_MLX5_CORE=y/m and CONFIG_MLX5_CORE_EN=y. +| For the list of advanced features, please see below. + +**CONFIG_MLX5_BRIDGE=(y/n)** + +| Enable :ref:`Ethernet Bridging (BRIDGE) offloading support <mlx5_bridge_offload>`. +| This will provide the ability to add representors of mlx5 uplink and VF +| ports to Bridge and offloading rules for traffic between such ports. +| Supports VLANs (trunk and access modes). + + +**CONFIG_MLX5_CORE=(y/m/n)** (module mlx5_core.ko) + +| The driver can be enabled by choosing CONFIG_MLX5_CORE=y/m in kernel config. +| This will provide mlx5 core driver for mlx5 ulps to interface with (mlx5e, mlx5_ib). + + +**CONFIG_MLX5_CORE_EN=(y/n)** + +| Choosing this option will allow basic ethernet netdevice support with all of the standard rx/tx offloads. +| mlx5e is the mlx5 ulp driver which provides netdevice kernel interface, when chosen, mlx5e will be +| built-in into mlx5_core.ko. + + +**CONFIG_MLX5_CORE_EN_DCB=(y/n)**: + +| Enables `Data Center Bridging (DCB) Support <https://community.mellanox.com/s/article/howto-auto-config-pfc-and-ets-on-connectx-4-via-lldp-dcbx>`_. + + +**CONFIG_MLX5_CORE_IPOIB=(y/n)** + +| IPoIB offloads & acceleration support. +| Requires CONFIG_MLX5_CORE_EN to provide an accelerated interface for the rdma +| IPoIB ulp netdevice. + + +**CONFIG_MLX5_CLS_ACT=(y/n)** + +| Enables offload support for TC classifier action (NET_CLS_ACT). +| Works in both native NIC mode and Switchdev SRIOV mode. +| Flow-based classifiers, such as those registered through +| `tc-flower(8)`, are processed by the device, rather than the +| host. Actions that would then overwrite matching classification +| results would then be instant due to the offload. + + +**CONFIG_MLX5_EN_ARFS=(y/n)** + +| Enables Hardware-accelerated receive flow steering (arfs) support, and ntuple filtering. +| https://community.mellanox.com/s/article/howto-configure-arfs-on-connectx-4 + + +**CONFIG_MLX5_EN_IPSEC=(y/n)** + +| Enables `IPSec XFRM cryptography-offload acceleration <https://support.mellanox.com/s/article/ConnectX-6DX-Bluefield-2-IPsec-HW-Full-Offload-Configuration-Guide>`_. + + +**CONFIG_MLX5_EN_MACSEC=(y/n)** + +| Build support for MACsec cryptography-offload acceleration in the NIC. + + +**CONFIG_MLX5_EN_RXNFC=(y/n)** + +| Enables ethtool receive network flow classification, which allows user defined +| flow rules to direct traffic into arbitrary rx queue via ethtool set/get_rxnfc API. + + +**CONFIG_MLX5_EN_TLS=(y/n)** + +| TLS cryptography-offload acceleration. + + +**CONFIG_MLX5_ESWITCH=(y/n)** + +| Ethernet SRIOV E-Switch support in ConnectX NIC. E-Switch provides internal SRIOV packet steering +| and switching for the enabled VFs and PF in two available modes: +| 1) `Legacy SRIOV mode (L2 mac vlan steering based) <https://community.mellanox.com/s/article/howto-configure-sr-iov-for-connectx-4-connectx-5-with-kvm--ethernet-x>`_. +| 2) `Switchdev mode (eswitch offloads) <https://www.mellanox.com/related-docs/prod_software/ASAP2_Hardware_Offloading_for_vSwitches_User_Manual_v4.4.pdf>`_. + + +**CONFIG_MLX5_FPGA=(y/n)** + +| Build support for the Innova family of network cards by Mellanox Technologies. +| Innova network cards are comprised of a ConnectX chip and an FPGA chip on one board. +| If you select this option, the mlx5_core driver will include the Innova FPGA core and allow +| building sandbox-specific client drivers. + + +**CONFIG_MLX5_INFINIBAND=(y/n/m)** (module mlx5_ib.ko) + +| Provides low-level InfiniBand/RDMA and `RoCE <https://community.mellanox.com/s/article/recommended-network-configuration-examples-for-roce-deployment>`_ support. + + +**CONFIG_MLX5_MPFS=(y/n)** + +| Ethernet Multi-Physical Function Switch (MPFS) support in ConnectX NIC. +| MPFs is required for when `Multi-Host <http://www.mellanox.com/page/multihost>`_ configuration is enabled to allow passing +| user configured unicast MAC addresses to the requesting PF. + + +**CONFIG_MLX5_SF=(y/n)** + +| Build support for subfunction. +| Subfunctons are more light weight than PCI SRIOV VFs. Choosing this option +| will enable support for creating subfunction devices. + + +**CONFIG_MLX5_SF_MANAGER=(y/n)** + +| Build support for subfuction port in the NIC. A Mellanox subfunction +| port is managed through devlink. A subfunction supports RDMA, netdevice +| and vdpa device. It is similar to a SRIOV VF but it doesn't require +| SRIOV support. + + +**CONFIG_MLX5_SW_STEERING=(y/n)** + +| Build support for software-managed steering in the NIC. + + +**CONFIG_MLX5_TC_CT=(y/n)** + +| Support offloading connection tracking rules via tc ct action. + + +**CONFIG_MLX5_TC_SAMPLE=(y/n)** + +| Support offloading sample rules via tc sample action. + + +**CONFIG_MLX5_VDPA=(y/n)** + +| Support library for Mellanox VDPA drivers. Provides code that is +| common for all types of VDPA drivers. The following drivers are planned: +| net, block. + + +**CONFIG_MLX5_VDPA_NET=(y/n)** + +| VDPA network driver for ConnectX6 and newer. Provides offloading +| of virtio net datapath such that descriptors put on the ring will +| be executed by the hardware. It also supports a variety of stateless +| offloads depending on the actual device used and firmware version. + + +**CONFIG_MLX5_VFIO_PCI=(y/n)** + +| This provides migration support for MLX5 devices using the VFIO framework. + + +**External options** ( Choose if the corresponding mlx5 feature is required ) + +- CONFIG_MLXFW: When chosen, mlx5 firmware flashing support will be enabled (via devlink and ethtool). +- CONFIG_PTP_1588_CLOCK: When chosen, mlx5 ptp support will be enabled +- CONFIG_VXLAN: When chosen, mlx5 vxlan support will be enabled. diff --git a/Documentation/networking/device_drivers/ethernet/mellanox/mlx5/switchdev.rst b/Documentation/networking/device_drivers/ethernet/mellanox/mlx5/switchdev.rst new file mode 100644 index 000000000000..01deedb71597 --- /dev/null +++ b/Documentation/networking/device_drivers/ethernet/mellanox/mlx5/switchdev.rst @@ -0,0 +1,239 @@ +.. SPDX-License-Identifier: GPL-2.0 OR Linux-OpenIB +.. include:: <isonum.txt> + +========= +Switchdev +========= + +:Copyright: |copy| 2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved. + +.. _mlx5_bridge_offload: + +Bridge offload +============== + +The mlx5 driver implements support for offloading bridge rules when in switchdev +mode. Linux bridge FDBs are automatically offloaded when mlx5 switchdev +representor is attached to bridge. + +- Change device to switchdev mode:: + + $ devlink dev eswitch set pci/0000:06:00.0 mode switchdev + +- Attach mlx5 switchdev representor 'enp8s0f0' to bridge netdev 'bridge1':: + + $ ip link set enp8s0f0 master bridge1 + +VLANs +----- + +Following bridge VLAN functions are supported by mlx5: + +- VLAN filtering (including multiple VLANs per port):: + + $ ip link set bridge1 type bridge vlan_filtering 1 + $ bridge vlan add dev enp8s0f0 vid 2-3 + +- VLAN push on bridge ingress:: + + $ bridge vlan add dev enp8s0f0 vid 3 pvid + +- VLAN pop on bridge egress:: + + $ bridge vlan add dev enp8s0f0 vid 3 untagged + +Subfunction +=========== + +mlx5 supports subfunction management using devlink port (see :ref:`Documentation/networking/devlink/devlink-port.rst <devlink_port>`) interface. + +A subfunction has its own function capabilities and its own resources. This +means a subfunction has its own dedicated queues (txq, rxq, cq, eq). These +queues are neither shared nor stolen from the parent PCI function. + +When a subfunction is RDMA capable, it has its own QP1, GID table, and RDMA +resources neither shared nor stolen from the parent PCI function. + +A subfunction has a dedicated window in PCI BAR space that is not shared +with the other subfunctions or the parent PCI function. This ensures that all +devices (netdev, rdma, vdpa, etc.) of the subfunction accesses only assigned +PCI BAR space. + +A subfunction supports eswitch representation through which it supports tc +offloads. The user configures eswitch to send/receive packets from/to +the subfunction port. + +Subfunctions share PCI level resources such as PCI MSI-X IRQs with +other subfunctions and/or with its parent PCI function. + +Example mlx5 software, system, and device view:: + + _______ + | admin | + | user |---------- + |_______| | + | | + ____|____ __|______ _________________ + | | | | | | + | devlink | | tc tool | | user | + | tool | |_________| | applications | + |_________| | |_________________| + | | | | + | | | | Userspace + +---------|-------------|-------------------|----------|--------------------+ + | | +----------+ +----------+ Kernel + | | | netdev | | rdma dev | + | | +----------+ +----------+ + (devlink port add/del | ^ ^ + port function set) | | | + | | +---------------| + _____|___ | | _______|_______ + | | | | | mlx5 class | + | devlink | +------------+ | | drivers | + | kernel | | rep netdev | | |(mlx5_core,ib) | + |_________| +------------+ | |_______________| + | | | ^ + (devlink ops) | | (probe/remove) + _________|________ | | ____|________ + | subfunction | | +---------------+ | subfunction | + | management driver|----- | subfunction |---| driver | + | (mlx5_core) | | auxiliary dev | | (mlx5_core) | + |__________________| +---------------+ |_____________| + | ^ + (sf add/del, vhca events) | + | (device add/del) + _____|____ ____|________ + | | | subfunction | + | PCI NIC |--- activate/deactivate events--->| host driver | + |__________| | (mlx5_core) | + |_____________| + +Subfunction is created using devlink port interface. + +- Change device to switchdev mode:: + + $ devlink dev eswitch set pci/0000:06:00.0 mode switchdev + +- Add a devlink port of subfunction flavour:: + + $ devlink port add pci/0000:06:00.0 flavour pcisf pfnum 0 sfnum 88 + pci/0000:06:00.0/32768: type eth netdev eth6 flavour pcisf controller 0 pfnum 0 sfnum 88 external false splittable false + function: + hw_addr 00:00:00:00:00:00 state inactive opstate detached + +- Show a devlink port of the subfunction:: + + $ devlink port show pci/0000:06:00.0/32768 + pci/0000:06:00.0/32768: type eth netdev enp6s0pf0sf88 flavour pcisf pfnum 0 sfnum 88 + function: + hw_addr 00:00:00:00:00:00 state inactive opstate detached + +- Delete a devlink port of subfunction after use:: + + $ devlink port del pci/0000:06:00.0/32768 + +Function attributes +=================== + +The mlx5 driver provides a mechanism to setup PCI VF/SF function attributes in +a unified way for SmartNIC and non-SmartNIC. + +This is supported only when the eswitch mode is set to switchdev. Port function +configuration of the PCI VF/SF is supported through devlink eswitch port. + +Port function attributes should be set before PCI VF/SF is enumerated by the +driver. + +MAC address setup +----------------- + +mlx5 driver support devlink port function attr mechanism to setup MAC +address. (refer to Documentation/networking/devlink/devlink-port.rst) + +RoCE capability setup +~~~~~~~~~~~~~~~~~~~~~ +Not all mlx5 PCI devices/SFs require RoCE capability. + +When RoCE capability is disabled, it saves 1 Mbytes worth of system memory per +PCI devices/SF. + +mlx5 driver support devlink port function attr mechanism to setup RoCE +capability. (refer to Documentation/networking/devlink/devlink-port.rst) + +migratable capability setup +~~~~~~~~~~~~~~~~~~~~~~~~~~~ +User who wants mlx5 PCI VFs to be able to perform live migration need to +explicitly enable the VF migratable capability. + +mlx5 driver support devlink port function attr mechanism to setup migratable +capability. (refer to Documentation/networking/devlink/devlink-port.rst) + +SF state setup +-------------- + +To use the SF, the user must activate the SF using the SF function state +attribute. + +- Get the state of the SF identified by its unique devlink port index:: + + $ devlink port show ens2f0npf0sf88 + pci/0000:06:00.0/32768: type eth netdev ens2f0npf0sf88 flavour pcisf controller 0 pfnum 0 sfnum 88 external false splittable false + function: + hw_addr 00:00:00:00:88:88 state inactive opstate detached + +- Activate the function and verify its state is active:: + + $ devlink port function set ens2f0npf0sf88 state active + + $ devlink port show ens2f0npf0sf88 + pci/0000:06:00.0/32768: type eth netdev ens2f0npf0sf88 flavour pcisf controller 0 pfnum 0 sfnum 88 external false splittable false + function: + hw_addr 00:00:00:00:88:88 state active opstate detached + +Upon function activation, the PF driver instance gets the event from the device +that a particular SF was activated. It's the cue to put the device on bus, probe +it and instantiate the devlink instance and class specific auxiliary devices +for it. + +- Show the auxiliary device and port of the subfunction:: + + $ devlink dev show + devlink dev show auxiliary/mlx5_core.sf.4 + + $ devlink port show auxiliary/mlx5_core.sf.4/1 + auxiliary/mlx5_core.sf.4/1: type eth netdev p0sf88 flavour virtual port 0 splittable false + + $ rdma link show mlx5_0/1 + link mlx5_0/1 state ACTIVE physical_state LINK_UP netdev p0sf88 + + $ rdma dev show + 8: rocep6s0f1: node_type ca fw 16.29.0550 node_guid 248a:0703:00b3:d113 sys_image_guid 248a:0703:00b3:d112 + 13: mlx5_0: node_type ca fw 16.29.0550 node_guid 0000:00ff:fe00:8888 sys_image_guid 248a:0703:00b3:d112 + +- Subfunction auxiliary device and class device hierarchy:: + + mlx5_core.sf.4 + (subfunction auxiliary device) + /\ + / \ + / \ + / \ + / \ + mlx5_core.eth.4 mlx5_core.rdma.4 + (sf eth aux dev) (sf rdma aux dev) + | | + | | + p0sf88 mlx5_0 + (sf netdev) (sf rdma device) + +Additionally, the SF port also gets the event when the driver attaches to the +auxiliary device of the subfunction. This results in changing the operational +state of the function. This provides visibility to the user to decide when is it +safe to delete the SF port for graceful termination of the subfunction. + +- Show the SF port operational state:: + + $ devlink port show ens2f0npf0sf88 + pci/0000:06:00.0/32768: type eth netdev ens2f0npf0sf88 flavour pcisf controller 0 pfnum 0 sfnum 88 external false splittable false + function: + hw_addr 00:00:00:00:88:88 state active opstate attached diff --git a/Documentation/networking/device_drivers/ethernet/mellanox/mlx5/tracepoints.rst b/Documentation/networking/device_drivers/ethernet/mellanox/mlx5/tracepoints.rst new file mode 100644 index 000000000000..a9d3e123adc4 --- /dev/null +++ b/Documentation/networking/device_drivers/ethernet/mellanox/mlx5/tracepoints.rst @@ -0,0 +1,229 @@ +.. SPDX-License-Identifier: GPL-2.0 OR Linux-OpenIB +.. include:: <isonum.txt> + +=========== +Tracepoints +=========== + +:Copyright: |copy| 2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved. + +mlx5 driver provides internal tracepoints for tracking and debugging using +kernel tracepoints interfaces (refer to Documentation/trace/ftrace.rst). + +For the list of support mlx5 events, check `/sys/kernel/debug/tracing/events/mlx5/`. + +tc and eswitch offloads tracepoints: + +- mlx5e_configure_flower: trace flower filter actions and cookies offloaded to mlx5:: + + $ echo mlx5:mlx5e_configure_flower >> /sys/kernel/debug/tracing/set_event + $ cat /sys/kernel/debug/tracing/trace + ... + tc-6535 [019] ...1 2672.404466: mlx5e_configure_flower: cookie=0000000067874a55 actions= REDIRECT + +- mlx5e_delete_flower: trace flower filter actions and cookies deleted from mlx5:: + + $ echo mlx5:mlx5e_delete_flower >> /sys/kernel/debug/tracing/set_event + $ cat /sys/kernel/debug/tracing/trace + ... + tc-6569 [010] .N.1 2686.379075: mlx5e_delete_flower: cookie=0000000067874a55 actions= NULL + +- mlx5e_stats_flower: trace flower stats request:: + + $ echo mlx5:mlx5e_stats_flower >> /sys/kernel/debug/tracing/set_event + $ cat /sys/kernel/debug/tracing/trace + ... + tc-6546 [010] ...1 2679.704889: mlx5e_stats_flower: cookie=0000000060eb3d6a bytes=0 packets=0 lastused=4295560217 + +- mlx5e_tc_update_neigh_used_value: trace tunnel rule neigh update value offloaded to mlx5:: + + $ echo mlx5:mlx5e_tc_update_neigh_used_value >> /sys/kernel/debug/tracing/set_event + $ cat /sys/kernel/debug/tracing/trace + ... + kworker/u48:4-8806 [009] ...1 55117.882428: mlx5e_tc_update_neigh_used_value: netdev: ens1f0 IPv4: 1.1.1.10 IPv6: ::ffff:1.1.1.10 neigh_used=1 + +- mlx5e_rep_neigh_update: trace neigh update tasks scheduled due to neigh state change events:: + + $ echo mlx5:mlx5e_rep_neigh_update >> /sys/kernel/debug/tracing/set_event + $ cat /sys/kernel/debug/tracing/trace + ... + kworker/u48:7-2221 [009] ...1 1475.387435: mlx5e_rep_neigh_update: netdev: ens1f0 MAC: 24:8a:07:9a:17:9a IPv4: 1.1.1.10 IPv6: ::ffff:1.1.1.10 neigh_connected=1 + +Bridge offloads tracepoints: + +- mlx5_esw_bridge_fdb_entry_init: trace bridge FDB entry offloaded to mlx5:: + + $ echo mlx5:mlx5_esw_bridge_fdb_entry_init >> set_event + $ cat /sys/kernel/debug/tracing/trace + ... + kworker/u20:9-2217 [003] ...1 318.582243: mlx5_esw_bridge_fdb_entry_init: net_device=enp8s0f0_0 addr=e4:fd:05:08:00:02 vid=0 flags=0 used=0 + +- mlx5_esw_bridge_fdb_entry_cleanup: trace bridge FDB entry deleted from mlx5:: + + $ echo mlx5:mlx5_esw_bridge_fdb_entry_cleanup >> set_event + $ cat /sys/kernel/debug/tracing/trace + ... + ip-2581 [005] ...1 318.629871: mlx5_esw_bridge_fdb_entry_cleanup: net_device=enp8s0f0_1 addr=e4:fd:05:08:00:03 vid=0 flags=0 used=16 + +- mlx5_esw_bridge_fdb_entry_refresh: trace bridge FDB entry offload refreshed in + mlx5:: + + $ echo mlx5:mlx5_esw_bridge_fdb_entry_refresh >> set_event + $ cat /sys/kernel/debug/tracing/trace + ... + kworker/u20:8-3849 [003] ...1 466716: mlx5_esw_bridge_fdb_entry_refresh: net_device=enp8s0f0_0 addr=e4:fd:05:08:00:02 vid=3 flags=0 used=0 + +- mlx5_esw_bridge_vlan_create: trace bridge VLAN object add on mlx5 + representor:: + + $ echo mlx5:mlx5_esw_bridge_vlan_create >> set_event + $ cat /sys/kernel/debug/tracing/trace + ... + ip-2560 [007] ...1 318.460258: mlx5_esw_bridge_vlan_create: vid=1 flags=6 + +- mlx5_esw_bridge_vlan_cleanup: trace bridge VLAN object delete from mlx5 + representor:: + + $ echo mlx5:mlx5_esw_bridge_vlan_cleanup >> set_event + $ cat /sys/kernel/debug/tracing/trace + ... + bridge-2582 [007] ...1 318.653496: mlx5_esw_bridge_vlan_cleanup: vid=2 flags=8 + +- mlx5_esw_bridge_vport_init: trace mlx5 vport assigned with bridge upper + device:: + + $ echo mlx5:mlx5_esw_bridge_vport_init >> set_event + $ cat /sys/kernel/debug/tracing/trace + ... + ip-2560 [007] ...1 318.458915: mlx5_esw_bridge_vport_init: vport_num=1 + +- mlx5_esw_bridge_vport_cleanup: trace mlx5 vport removed from bridge upper + device:: + + $ echo mlx5:mlx5_esw_bridge_vport_cleanup >> set_event + $ cat /sys/kernel/debug/tracing/trace + ... + ip-5387 [000] ...1 573713: mlx5_esw_bridge_vport_cleanup: vport_num=1 + +Eswitch QoS tracepoints: + +- mlx5_esw_vport_qos_create: trace creation of transmit scheduler arbiter for vport:: + + $ echo mlx5:mlx5_esw_vport_qos_create >> /sys/kernel/debug/tracing/set_event + $ cat /sys/kernel/debug/tracing/trace + ... + <...>-23496 [018] .... 73136.838831: mlx5_esw_vport_qos_create: (0000:82:00.0) vport=2 tsar_ix=4 bw_share=0, max_rate=0 group=000000007b576bb3 + +- mlx5_esw_vport_qos_config: trace configuration of transmit scheduler arbiter for vport:: + + $ echo mlx5:mlx5_esw_vport_qos_config >> /sys/kernel/debug/tracing/set_event + $ cat /sys/kernel/debug/tracing/trace + ... + <...>-26548 [023] .... 75754.223823: mlx5_esw_vport_qos_config: (0000:82:00.0) vport=1 tsar_ix=3 bw_share=34, max_rate=10000 group=000000007b576bb3 + +- mlx5_esw_vport_qos_destroy: trace deletion of transmit scheduler arbiter for vport:: + + $ echo mlx5:mlx5_esw_vport_qos_destroy >> /sys/kernel/debug/tracing/set_event + $ cat /sys/kernel/debug/tracing/trace + ... + <...>-27418 [004] .... 76546.680901: mlx5_esw_vport_qos_destroy: (0000:82:00.0) vport=1 tsar_ix=3 + +- mlx5_esw_group_qos_create: trace creation of transmit scheduler arbiter for rate group:: + + $ echo mlx5:mlx5_esw_group_qos_create >> /sys/kernel/debug/tracing/set_event + $ cat /sys/kernel/debug/tracing/trace + ... + <...>-26578 [008] .... 75776.022112: mlx5_esw_group_qos_create: (0000:82:00.0) group=000000008dac63ea tsar_ix=5 + +- mlx5_esw_group_qos_config: trace configuration of transmit scheduler arbiter for rate group:: + + $ echo mlx5:mlx5_esw_group_qos_config >> /sys/kernel/debug/tracing/set_event + $ cat /sys/kernel/debug/tracing/trace + ... + <...>-27303 [020] .... 76461.455356: mlx5_esw_group_qos_config: (0000:82:00.0) group=000000008dac63ea tsar_ix=5 bw_share=100 max_rate=20000 + +- mlx5_esw_group_qos_destroy: trace deletion of transmit scheduler arbiter for group:: + + $ echo mlx5:mlx5_esw_group_qos_destroy >> /sys/kernel/debug/tracing/set_event + $ cat /sys/kernel/debug/tracing/trace + ... + <...>-27418 [006] .... 76547.187258: mlx5_esw_group_qos_destroy: (0000:82:00.0) group=000000007b576bb3 tsar_ix=1 + +SF tracepoints: + +- mlx5_sf_add: trace addition of the SF port:: + + $ echo mlx5:mlx5_sf_add >> /sys/kernel/debug/tracing/set_event + $ cat /sys/kernel/debug/tracing/trace + ... + devlink-9363 [031] ..... 24610.188722: mlx5_sf_add: (0000:06:00.0) port_index=32768 controller=0 hw_id=0x8000 sfnum=88 + +- mlx5_sf_free: trace freeing of the SF port:: + + $ echo mlx5:mlx5_sf_free >> /sys/kernel/debug/tracing/set_event + $ cat /sys/kernel/debug/tracing/trace + ... + devlink-9830 [038] ..... 26300.404749: mlx5_sf_free: (0000:06:00.0) port_index=32768 controller=0 hw_id=0x8000 + +- mlx5_sf_activate: trace activation of the SF port:: + + $ echo mlx5:mlx5_sf_activate >> /sys/kernel/debug/tracing/set_event + $ cat /sys/kernel/debug/tracing/trace + ... + devlink-29841 [008] ..... 3669.635095: mlx5_sf_activate: (0000:08:00.0) port_index=32768 controller=0 hw_id=0x8000 + +- mlx5_sf_deactivate: trace deactivation of the SF port:: + + $ echo mlx5:mlx5_sf_deactivate >> /sys/kernel/debug/tracing/set_event + $ cat /sys/kernel/debug/tracing/trace + ... + devlink-29994 [008] ..... 4015.969467: mlx5_sf_deactivate: (0000:08:00.0) port_index=32768 controller=0 hw_id=0x8000 + +- mlx5_sf_hwc_alloc: trace allocating of the hardware SF context:: + + $ echo mlx5:mlx5_sf_hwc_alloc >> /sys/kernel/debug/tracing/set_event + $ cat /sys/kernel/debug/tracing/trace + ... + devlink-9775 [031] ..... 26296.385259: mlx5_sf_hwc_alloc: (0000:06:00.0) controller=0 hw_id=0x8000 sfnum=88 + +- mlx5_sf_hwc_free: trace freeing of the hardware SF context:: + + $ echo mlx5:mlx5_sf_hwc_free >> /sys/kernel/debug/tracing/set_event + $ cat /sys/kernel/debug/tracing/trace + ... + kworker/u128:3-9093 [046] ..... 24625.365771: mlx5_sf_hwc_free: (0000:06:00.0) hw_id=0x8000 + +- mlx5_sf_hwc_deferred_free: trace deferred freeing of the hardware SF context:: + + $ echo mlx5:mlx5_sf_hwc_deferred_free >> /sys/kernel/debug/tracing/set_event + $ cat /sys/kernel/debug/tracing/trace + ... + devlink-9519 [046] ..... 24624.400271: mlx5_sf_hwc_deferred_free: (0000:06:00.0) hw_id=0x8000 + +- mlx5_sf_update_state: trace state updates for SF contexts:: + + $ echo mlx5:mlx5_sf_update_state >> /sys/kernel/debug/tracing/set_event + $ cat /sys/kernel/debug/tracing/trace + ... + kworker/u20:3-29490 [009] ..... 4141.453530: mlx5_sf_update_state: (0000:08:00.0) port_index=32768 controller=0 hw_id=0x8000 state=2 + +- mlx5_sf_vhca_event: trace SF vhca event and state:: + + $ echo mlx5:mlx5_sf_vhca_event >> /sys/kernel/debug/tracing/set_event + $ cat /sys/kernel/debug/tracing/trace + ... + kworker/u128:3-9093 [046] ..... 24625.365525: mlx5_sf_vhca_event: (0000:06:00.0) hw_id=0x8000 sfnum=88 vhca_state=1 + +- mlx5_sf_dev_add: trace SF device add event:: + + $ echo mlx5:mlx5_sf_dev_add>> /sys/kernel/debug/tracing/set_event + $ cat /sys/kernel/debug/tracing/trace + ... + kworker/u128:3-9093 [000] ..... 24616.524495: mlx5_sf_dev_add: (0000:06:00.0) sfdev=00000000fc5d96fd aux_id=4 hw_id=0x8000 sfnum=88 + +- mlx5_sf_dev_del: trace SF device delete event:: + + $ echo mlx5:mlx5_sf_dev_del >> /sys/kernel/debug/tracing/set_event + $ cat /sys/kernel/debug/tracing/trace + ... + kworker/u128:3-9093 [044] ..... 24624.400749: mlx5_sf_dev_del: (0000:06:00.0) sfdev=00000000fc5d96fd aux_id=4 hw_id=0x8000 sfnum=88 diff --git a/Documentation/networking/device_drivers/ethernet/pensando/ionic.rst b/Documentation/networking/device_drivers/ethernet/pensando/ionic.rst index 0eabbc347d6c..6ec7d686efab 100644 --- a/Documentation/networking/device_drivers/ethernet/pensando/ionic.rst +++ b/Documentation/networking/device_drivers/ethernet/pensando/ionic.rst @@ -83,7 +83,7 @@ Configuring the Driver MTU --- -Jumbo frame support is available with a maximim size of 9194 bytes. +Jumbo frame support is available with a maximum size of 9194 bytes. Interrupt coalescing -------------------- diff --git a/Documentation/networking/device_drivers/ethernet/ti/am65_nuss_cpsw_switchdev.rst b/Documentation/networking/device_drivers/ethernet/ti/am65_nuss_cpsw_switchdev.rst index f24adfab6a1b..25fd9aa284e2 100644 --- a/Documentation/networking/device_drivers/ethernet/ti/am65_nuss_cpsw_switchdev.rst +++ b/Documentation/networking/device_drivers/ethernet/ti/am65_nuss_cpsw_switchdev.rst @@ -124,7 +124,7 @@ Multicast flooding ================== CPU port mcast_flooding is always on -Turning flooding on/off on swithch ports: +Turning flooding on/off on switch ports: bridge link set dev sw0p1 mcast_flood on/off Access and Trunk port diff --git a/Documentation/networking/device_drivers/ethernet/ti/cpsw_switchdev.rst b/Documentation/networking/device_drivers/ethernet/ti/cpsw_switchdev.rst index 1241ecac73bd..464dce938ed1 100644 --- a/Documentation/networking/device_drivers/ethernet/ti/cpsw_switchdev.rst +++ b/Documentation/networking/device_drivers/ethernet/ti/cpsw_switchdev.rst @@ -174,7 +174,7 @@ Multicast flooding ================== CPU port mcast_flooding is always on -Turning flooding on/off on swithch ports: +Turning flooding on/off on switch ports: bridge link set dev sw0p1 mcast_flood on/off Access and Trunk port diff --git a/Documentation/networking/device_drivers/wwan/iosm.rst b/Documentation/networking/device_drivers/wwan/iosm.rst index aceb0223eb46..6f9e955af984 100644 --- a/Documentation/networking/device_drivers/wwan/iosm.rst +++ b/Documentation/networking/device_drivers/wwan/iosm.rst @@ -69,7 +69,7 @@ wwan0-X network device The IOSM driver exposes IP link interface "wwan0-X" of type "wwan" for IP traffic. Iproute network utility is used for creating "wwan0-X" network interface and for associating it with MBIM IP session. The Driver supports -upto 8 IP sessions for simultaneous IP communication. +up to 8 IP sessions for simultaneous IP communication. The userspace management application is responsible for creating new IP link prior to establishing MBIM IP session where the SessionId is greater than 0. diff --git a/Documentation/networking/devlink/devlink-health.rst b/Documentation/networking/devlink/devlink-health.rst index e37f77734b5b..e0b8cfed610a 100644 --- a/Documentation/networking/devlink/devlink-health.rst +++ b/Documentation/networking/devlink/devlink-health.rst @@ -33,7 +33,7 @@ Device driver can provide specific callbacks for each "health reporter", e.g.: * Recovery procedures * Diagnostics procedures * Object dump procedures - * OOB initial parameters + * Out Of Box initial parameters Different parts of the driver can register different types of health reporters with different handlers. @@ -46,12 +46,31 @@ Once an error is reported, devlink health will perform the following actions: * A log is being send to the kernel trace events buffer * Health status and statistics are being updated for the reporter instance * Object dump is being taken and saved at the reporter instance (as long as - there is no other dump which is already stored) + auto-dump is set and there is no other dump which is already stored) * Auto recovery attempt is being done. Depends on: - Auto-recovery configuration - Grace period vs. time passed since last recover +Devlink formatted message +========================= + +To handle devlink health diagnose and health dump requests, devlink creates a +formatted message structure ``devlink_fmsg`` and send it to the driver's callback +to fill the data in using the devlink fmsg API. + +Devlink fmsg is a mechanism to pass descriptors between drivers and devlink, in +json-like format. The API allows the driver to add nested attributes such as +object, object pair and value array, in addition to attributes such as name and +value. + +Driver should use this API to fill the fmsg context in a format which will be +translated by the devlink to the netlink message later. When it needs to send +the data using SKBs to the netlink layer, it fragments the data between +different SKBs. In order to do this fragmentation, it uses virtual nests +attributes, to avoid actual nesting use which cannot be divided between +different SKBs. + User Interface ============== diff --git a/Documentation/networking/devlink/ice.rst b/Documentation/networking/devlink/ice.rst index 625efb3777d5..10f282c2117c 100644 --- a/Documentation/networking/devlink/ice.rst +++ b/Documentation/networking/devlink/ice.rst @@ -285,7 +285,7 @@ features are enabled after the hierarchy is exported, but before any changes are made. This feature is also dependent on switchdev being enabled in the system. -It's required bacause devlink-rate requires devlink-port objects to be +It's required because devlink-rate requires devlink-port objects to be present, and those objects are only created in switchdev mode. If the driver is set to the switchdev mode, it will export internal @@ -320,7 +320,7 @@ nodes and nodes with children also can't be deleted. * - ``tx_weight`` - allows for usage of Weighted Fair Queuing arbitration scheme among siblings. This arbitration scheme can be used simultaneously with - the strict priority. Range 1-200. Only relative values mater for + the strict priority. Range 1-200. Only relative values matter for arbitration. ``tx_priority`` and ``tx_weight`` can be used simultaneously. In that case diff --git a/Documentation/networking/devlink/index.rst b/Documentation/networking/devlink/index.rst index fee4d3968309..b49749e2b9a6 100644 --- a/Documentation/networking/devlink/index.rst +++ b/Documentation/networking/devlink/index.rst @@ -66,3 +66,4 @@ parameters, info versions, and other features it supports. prestera iosm octeontx2 + sfc diff --git a/Documentation/networking/devlink/mlx5.rst b/Documentation/networking/devlink/mlx5.rst index 29ad304e6fba..3321117cf605 100644 --- a/Documentation/networking/devlink/mlx5.rst +++ b/Documentation/networking/devlink/mlx5.rst @@ -54,6 +54,24 @@ parameters. - Control the number of large groups (size > 1) in the FDB table. * The default value is 15, and the range is between 1 and 1024. + * - ``esw_multiport`` + - Boolean + - runtime + - Control MultiPort E-Switch shared fdb mode. + + An experimental mode where a single E-Switch is used and all the vports + and physical ports on the NIC are connected to it. + + An example is to send traffic from a VF that is created on PF0 to an + uplink that is natively associated with the uplink of PF1 + + Note: Future devices, ConnectX-8 and onward, will eventually have this + as the default to allow forwarding between all NIC ports in a single + E-switch environment and the dual E-switch mode will likely get + deprecated. + + Default: disabled + The ``mlx5`` driver supports reloading via ``DEVLINK_CMD_RELOAD`` diff --git a/Documentation/networking/devlink/netdevsim.rst b/Documentation/networking/devlink/netdevsim.rst index ec5e6d79b2e2..88482725422c 100644 --- a/Documentation/networking/devlink/netdevsim.rst +++ b/Documentation/networking/devlink/netdevsim.rst @@ -95,5 +95,5 @@ Driver-specific Traps * - ``fid_miss`` - ``exception`` - When a packet enters the device it is classified to a filtering - indentifier (FID) based on the ingress port and VLAN. This trap is used + identifier (FID) based on the ingress port and VLAN. This trap is used to trap packets for which a FID could not be found diff --git a/Documentation/networking/devlink/prestera.rst b/Documentation/networking/devlink/prestera.rst index 49409d1d3081..96b1124e614b 100644 --- a/Documentation/networking/devlink/prestera.rst +++ b/Documentation/networking/devlink/prestera.rst @@ -138,4 +138,4 @@ Driver-specific Traps - Drops packets with zero (0) IPV4 source address. * - ``met_red`` - ``drop`` - - Drops non-conforming packets (dropped by Ingress policer, metering drop), e.g. packet rate exceeded configured bandwith. + - Drops non-conforming packets (dropped by Ingress policer, metering drop), e.g. packet rate exceeded configured bandwidth. diff --git a/Documentation/networking/devlink/sfc.rst b/Documentation/networking/devlink/sfc.rst new file mode 100644 index 000000000000..db64a1bd9733 --- /dev/null +++ b/Documentation/networking/devlink/sfc.rst @@ -0,0 +1,57 @@ +.. SPDX-License-Identifier: GPL-2.0 + +=================== +sfc devlink support +=================== + +This document describes the devlink features implemented by the ``sfc`` +device driver for the ef100 device. + +Info versions +============= + +The ``sfc`` driver reports the following versions + +.. list-table:: devlink info versions implemented + :widths: 5 5 90 + + * - Name + - Type + - Description + * - ``fw.mgmt.suc`` + - running + - For boards where the management function is split between multiple + control units, this is the SUC control unit's firmware version. + * - ``fw.mgmt.cmc`` + - running + - For boards where the management function is split between multiple + control units, this is the CMC control unit's firmware version. + * - ``fpga.rev`` + - running + - FPGA design revision. + * - ``fpga.app`` + - running + - Datapath programmable logic version. + * - ``fw.app`` + - running + - Datapath software/microcode/firmware version. + * - ``coproc.boot`` + - running + - SmartNIC application co-processor (APU) first stage boot loader version. + * - ``coproc.uboot`` + - running + - SmartNIC application co-processor (APU) co-operating system loader version. + * - ``coproc.main`` + - running + - SmartNIC application co-processor (APU) main operating system version. + * - ``coproc.recovery`` + - running + - SmartNIC application co-processor (APU) recovery operating system version. + * - ``fw.exprom`` + - running + - Expansion ROM version. For boards where the expansion ROM is split between + multiple images (e.g. PXE and UEFI), this is the specifically the PXE boot + ROM version. + * - ``fw.uefi`` + - running + - UEFI driver version (No UNDI support). diff --git a/Documentation/networking/dsa/configuration.rst b/Documentation/networking/dsa/configuration.rst index 827701f8cbfe..d2934c40f0f1 100644 --- a/Documentation/networking/dsa/configuration.rst +++ b/Documentation/networking/dsa/configuration.rst @@ -5,7 +5,7 @@ DSA switch configuration from userspace ======================================= The DSA switch configuration is not integrated into the main userspace -network configuration suites by now and has to be performed manualy. +network configuration suites by now and has to be performed manually. .. _dsa-config-showcases: diff --git a/Documentation/networking/ethtool-netlink.rst b/Documentation/networking/ethtool-netlink.rst index f10f8eb44255..e1bc6186d7ea 100644 --- a/Documentation/networking/ethtool-netlink.rst +++ b/Documentation/networking/ethtool-netlink.rst @@ -106,7 +106,7 @@ modifying a bitmap, the former changes the bit set in mask to values set in value and preserves the rest; the latter sets the bits set in the bitmap and clears the rest. -Compact form: nested (bitset) atrribute contents: +Compact form: nested (bitset) attribute contents: ============================ ====== ============================ ``ETHTOOL_A_BITSET_NOMASK`` flag no mask, only a list @@ -223,6 +223,8 @@ Userspace to kernel: ``ETHTOOL_MSG_PSE_SET`` set PSE parameters ``ETHTOOL_MSG_PSE_GET`` get PSE parameters ``ETHTOOL_MSG_RSS_GET`` get RSS settings + ``ETHTOOL_MSG_MM_GET`` get MAC merge layer state + ``ETHTOOL_MSG_MM_SET`` set MAC merge layer parameters ===================================== ================================= Kernel to userspace: @@ -265,6 +267,7 @@ Kernel to userspace: ``ETHTOOL_MSG_MODULE_GET_REPLY`` transceiver module parameters ``ETHTOOL_MSG_PSE_GET_REPLY`` PSE parameters ``ETHTOOL_MSG_RSS_GET_REPLY`` RSS settings + ``ETHTOOL_MSG_MM_GET_REPLY`` MAC merge layer status ======================================== ================================= ``GET`` requests are sent by userspace applications to retrieve device @@ -780,7 +783,7 @@ Kernel response contents: ``ETHTOOL_A_FEATURES_ACTIVE`` bitset diff old vs. new active ==================================== ====== ========================== -Request constains only one bitset which can be either value/mask pair (request +Request contains only one bitset which can be either value/mask pair (request to change specific feature bits and leave the rest) or only a value (request to set all features to specified set). @@ -871,6 +874,7 @@ Kernel response contents: ``ETHTOOL_A_RINGS_TCP_DATA_SPLIT`` u8 TCP header / data split ``ETHTOOL_A_RINGS_CQE_SIZE`` u32 Size of TX/RX CQE ``ETHTOOL_A_RINGS_TX_PUSH`` u8 flag of TX Push mode + ``ETHTOOL_A_RINGS_RX_PUSH`` u8 flag of RX Push mode ==================================== ====== =========================== ``ETHTOOL_A_RINGS_TCP_DATA_SPLIT`` indicates whether the device is usable with @@ -880,8 +884,8 @@ separate buffers. The device configuration must make it possible to receive full memory pages of data, for example because MTU is high enough or through HW-GRO. -``ETHTOOL_A_RINGS_TX_PUSH`` flag is used to enable descriptor fast -path to send packets. In ordinary path, driver fills descriptors in DRAM and +``ETHTOOL_A_RINGS_[RX|TX]_PUSH`` flag is used to enable descriptor fast +path to send or receive packets. In ordinary path, driver fills descriptors in DRAM and notifies NIC hardware. In fast path, driver pushes descriptors to the device through MMIO writes, thus reducing the latency. However, enabling this feature may increase the CPU cost. Drivers may enforce additional per-packet @@ -903,6 +907,7 @@ Request contents: ``ETHTOOL_A_RINGS_RX_BUF_LEN`` u32 size of buffers on the ring ``ETHTOOL_A_RINGS_CQE_SIZE`` u32 Size of TX/RX CQE ``ETHTOOL_A_RINGS_TX_PUSH`` u8 flag of TX Push mode + ``ETHTOOL_A_RINGS_RX_PUSH`` u8 flag of RX Push mode ==================================== ====== =========================== Kernel checks that requested ring sizes do not exceed limits reported by @@ -1004,6 +1009,9 @@ Kernel response contents: ``ETHTOOL_A_COALESCE_RATE_SAMPLE_INTERVAL`` u32 rate sampling interval ``ETHTOOL_A_COALESCE_USE_CQE_TX`` bool timer reset mode, Tx ``ETHTOOL_A_COALESCE_USE_CQE_RX`` bool timer reset mode, Rx + ``ETHTOOL_A_COALESCE_TX_AGGR_MAX_BYTES`` u32 max aggr size, Tx + ``ETHTOOL_A_COALESCE_TX_AGGR_MAX_FRAMES`` u32 max aggr packets, Tx + ``ETHTOOL_A_COALESCE_TX_AGGR_TIME_USECS`` u32 time (us), aggr, Tx =========================================== ====== ======================= Attributes are only included in reply if their value is not zero or the @@ -1022,6 +1030,17 @@ each packet event resets the timer. In this mode timer is used to force the interrupt if queue goes idle, while busy queues depend on the packet limit to trigger interrupts. +Tx aggregation consists of copying frames into a contiguous buffer so that they +can be submitted as a single IO operation. ``ETHTOOL_A_COALESCE_TX_AGGR_MAX_BYTES`` +describes the maximum size in bytes for the submitted buffer. +``ETHTOOL_A_COALESCE_TX_AGGR_MAX_FRAMES`` describes the maximum number of frames +that can be aggregated into a single buffer. +``ETHTOOL_A_COALESCE_TX_AGGR_TIME_USECS`` describes the amount of time in usecs, +counted since the first packet arrival in an aggregated block, after which the +block should be sent. +This feature is mainly of interest for specific USB devices which does not cope +well with frequent small-sized URBs transmissions. + COALESCE_SET ============ @@ -1055,6 +1074,9 @@ Request contents: ``ETHTOOL_A_COALESCE_RATE_SAMPLE_INTERVAL`` u32 rate sampling interval ``ETHTOOL_A_COALESCE_USE_CQE_TX`` bool timer reset mode, Tx ``ETHTOOL_A_COALESCE_USE_CQE_RX`` bool timer reset mode, Rx + ``ETHTOOL_A_COALESCE_TX_AGGR_MAX_BYTES`` u32 max aggr size, Tx + ``ETHTOOL_A_COALESCE_TX_AGGR_MAX_FRAMES`` u32 max aggr packets, Tx + ``ETHTOOL_A_COALESCE_TX_AGGR_TIME_USECS`` u32 time (us), aggr, Tx =========================================== ====== ======================= Request is rejected if it attributes declared as unsupported by driver (i.e. @@ -1072,8 +1094,18 @@ Request contents: ===================================== ====== ========================== ``ETHTOOL_A_PAUSE_HEADER`` nested request header + ``ETHTOOL_A_PAUSE_STATS_SRC`` u32 source of statistics ===================================== ====== ========================== +``ETHTOOL_A_PAUSE_STATS_SRC`` is optional. It takes values from: + +.. kernel-doc:: include/uapi/linux/ethtool.h + :identifiers: ethtool_mac_stats_src + +If absent from the request, stats will be provided with +an ``ETHTOOL_A_PAUSE_STATS_SRC`` attribute in the response equal to +``ETHTOOL_MAC_STATS_SRC_AGGREGATE``. + Kernel response contents: ===================================== ====== ========================== @@ -1488,6 +1520,7 @@ Request contents: ======================================= ====== ========================== ``ETHTOOL_A_STATS_HEADER`` nested request header + ``ETHTOOL_A_STATS_SRC`` u32 source of statistics ``ETHTOOL_A_STATS_GROUPS`` bitset requested groups of stats ======================================= ====== ========================== @@ -1496,6 +1529,8 @@ Kernel response contents: +-----------------------------------+--------+--------------------------------+ | ``ETHTOOL_A_STATS_HEADER`` | nested | reply header | +-----------------------------------+--------+--------------------------------+ + | ``ETHTOOL_A_STATS_SRC`` | u32 | source of statistics | + +-----------------------------------+--------+--------------------------------+ | ``ETHTOOL_A_STATS_GRP`` | nested | one or more group of stats | +-+---------------------------------+--------+--------------------------------+ | | ``ETHTOOL_A_STATS_GRP_ID`` | u32 | group ID - ``ETHTOOL_STATS_*`` | @@ -1557,6 +1592,11 @@ Low and high bounds are inclusive, for example: etherStatsPkts512to1023Octets 512 1023 ============================= ==== ==== +``ETHTOOL_A_STATS_SRC`` is optional. Similar to ``PAUSE_GET``, it takes values +from ``enum ethtool_mac_stats_src``. If absent from the request, stats will be +provided with an ``ETHTOOL_A_STATS_SRC`` attribute in the response equal to +``ETHTOOL_MAC_STATS_SRC_AGGREGATE``. + PHC_VCLOCKS_GET =============== @@ -1716,6 +1756,225 @@ being used. Current supported options are toeplitz, xor or crc32. ETHTOOL_A_RSS_INDIR attribute returns RSS indrection table where each byte indicates queue number. +PLCA_GET_CFG +============ + +Gets the IEEE 802.3cg-2019 Clause 148 Physical Layer Collision Avoidance +(PLCA) Reconciliation Sublayer (RS) attributes. + +Request contents: + + ===================================== ====== ========================== + ``ETHTOOL_A_PLCA_HEADER`` nested request header + ===================================== ====== ========================== + +Kernel response contents: + + ====================================== ====== ============================= + ``ETHTOOL_A_PLCA_HEADER`` nested reply header + ``ETHTOOL_A_PLCA_VERSION`` u16 Supported PLCA management + interface standard/version + ``ETHTOOL_A_PLCA_ENABLED`` u8 PLCA Admin State + ``ETHTOOL_A_PLCA_NODE_ID`` u32 PLCA unique local node ID + ``ETHTOOL_A_PLCA_NODE_CNT`` u32 Number of PLCA nodes on the + network, including the + coordinator + ``ETHTOOL_A_PLCA_TO_TMR`` u32 Transmit Opportunity Timer + value in bit-times (BT) + ``ETHTOOL_A_PLCA_BURST_CNT`` u32 Number of additional packets + the node is allowed to send + within a single TO + ``ETHTOOL_A_PLCA_BURST_TMR`` u32 Time to wait for the MAC to + transmit a new frame before + terminating the burst + ====================================== ====== ============================= + +When set, the optional ``ETHTOOL_A_PLCA_VERSION`` attribute indicates which +standard and version the PLCA management interface complies to. When not set, +the interface is vendor-specific and (possibly) supplied by the driver. +The OPEN Alliance SIG specifies a standard register map for 10BASE-T1S PHYs +embedding the PLCA Reconcialiation Sublayer. See "10BASE-T1S PLCA Management +Registers" at https://www.opensig.org/about/specifications/. + +When set, the optional ``ETHTOOL_A_PLCA_ENABLED`` attribute indicates the +administrative state of the PLCA RS. When not set, the node operates in "plain" +CSMA/CD mode. This option is corresponding to ``IEEE 802.3cg-2019`` 30.16.1.1.1 +aPLCAAdminState / 30.16.1.2.1 acPLCAAdminControl. + +When set, the optional ``ETHTOOL_A_PLCA_NODE_ID`` attribute indicates the +configured local node ID of the PHY. This ID determines which transmit +opportunity (TO) is reserved for the node to transmit into. This option is +corresponding to ``IEEE 802.3cg-2019`` 30.16.1.1.4 aPLCALocalNodeID. The valid +range for this attribute is [0 .. 255] where 255 means "not configured". + +When set, the optional ``ETHTOOL_A_PLCA_NODE_CNT`` attribute indicates the +configured maximum number of PLCA nodes on the mixing-segment. This number +determines the total number of transmit opportunities generated during a +PLCA cycle. This attribute is relevant only for the PLCA coordinator, which is +the node with aPLCALocalNodeID set to 0. Follower nodes ignore this setting. +This option is corresponding to ``IEEE 802.3cg-2019`` 30.16.1.1.3 +aPLCANodeCount. The valid range for this attribute is [1 .. 255]. + +When set, the optional ``ETHTOOL_A_PLCA_TO_TMR`` attribute indicates the +configured value of the transmit opportunity timer in bit-times. This value +must be set equal across all nodes sharing the medium for PLCA to work +correctly. This option is corresponding to ``IEEE 802.3cg-2019`` 30.16.1.1.5 +aPLCATransmitOpportunityTimer. The valid range for this attribute is +[0 .. 255]. + +When set, the optional ``ETHTOOL_A_PLCA_BURST_CNT`` attribute indicates the +configured number of extra packets that the node is allowed to send during a +single transmit opportunity. By default, this attribute is 0, meaning that +the node can only send a single frame per TO. When greater than 0, the PLCA RS +keeps the TO after any transmission, waiting for the MAC to send a new frame +for up to aPLCABurstTimer BTs. This can only happen a number of times per PLCA +cycle up to the value of this parameter. After that, the burst is over and the +normal counting of TOs resumes. This option is corresponding to +``IEEE 802.3cg-2019`` 30.16.1.1.6 aPLCAMaxBurstCount. The valid range for this +attribute is [0 .. 255]. + +When set, the optional ``ETHTOOL_A_PLCA_BURST_TMR`` attribute indicates how +many bit-times the PLCA RS waits for the MAC to initiate a new transmission +when aPLCAMaxBurstCount is greater than 0. If the MAC fails to send a new +frame within this time, the burst ends and the counting of TOs resumes. +Otherwise, the new frame is sent as part of the current burst. This option +is corresponding to ``IEEE 802.3cg-2019`` 30.16.1.1.7 aPLCABurstTimer. The +valid range for this attribute is [0 .. 255]. Although, the value should be +set greater than the Inter-Frame-Gap (IFG) time of the MAC (plus some margin) +for PLCA burst mode to work as intended. + +PLCA_SET_CFG +============ + +Sets PLCA RS parameters. + +Request contents: + + ====================================== ====== ============================= + ``ETHTOOL_A_PLCA_HEADER`` nested request header + ``ETHTOOL_A_PLCA_ENABLED`` u8 PLCA Admin State + ``ETHTOOL_A_PLCA_NODE_ID`` u8 PLCA unique local node ID + ``ETHTOOL_A_PLCA_NODE_CNT`` u8 Number of PLCA nodes on the + netkork, including the + coordinator + ``ETHTOOL_A_PLCA_TO_TMR`` u8 Transmit Opportunity Timer + value in bit-times (BT) + ``ETHTOOL_A_PLCA_BURST_CNT`` u8 Number of additional packets + the node is allowed to send + within a single TO + ``ETHTOOL_A_PLCA_BURST_TMR`` u8 Time to wait for the MAC to + transmit a new frame before + terminating the burst + ====================================== ====== ============================= + +For a description of each attribute, see ``PLCA_GET_CFG``. + +PLCA_GET_STATUS +=============== + +Gets PLCA RS status information. + +Request contents: + + ===================================== ====== ========================== + ``ETHTOOL_A_PLCA_HEADER`` nested request header + ===================================== ====== ========================== + +Kernel response contents: + + ====================================== ====== ============================= + ``ETHTOOL_A_PLCA_HEADER`` nested reply header + ``ETHTOOL_A_PLCA_STATUS`` u8 PLCA RS operational status + ====================================== ====== ============================= + +When set, the ``ETHTOOL_A_PLCA_STATUS`` attribute indicates whether the node is +detecting the presence of the BEACON on the network. This flag is +corresponding to ``IEEE 802.3cg-2019`` 30.16.1.1.2 aPLCAStatus. + +MM_GET +====== + +Retrieve 802.3 MAC Merge parameters. + +Request contents: + + ==================================== ====== ========================== + ``ETHTOOL_A_MM_HEADER`` nested request header + ==================================== ====== ========================== + +Kernel response contents: + + ================================= ====== =================================== + ``ETHTOOL_A_MM_HEADER`` nested request header + ``ETHTOOL_A_MM_PMAC_ENABLED`` bool set if RX of preemptible and SMD-V + frames is enabled + ``ETHTOOL_A_MM_TX_ENABLED`` bool set if TX of preemptible frames is + administratively enabled (might be + inactive if verification failed) + ``ETHTOOL_A_MM_TX_ACTIVE`` bool set if TX of preemptible frames is + operationally enabled + ``ETHTOOL_A_MM_TX_MIN_FRAG_SIZE`` u32 minimum size of transmitted + non-final fragments, in octets + ``ETHTOOL_A_MM_RX_MIN_FRAG_SIZE`` u32 minimum size of received non-final + fragments, in octets + ``ETHTOOL_A_MM_VERIFY_ENABLED`` bool set if TX of SMD-V frames is + administratively enabled + ``ETHTOOL_A_MM_VERIFY_STATUS`` u8 state of the verification function + ``ETHTOOL_A_MM_VERIFY_TIME`` u32 delay between verification attempts + ``ETHTOOL_A_MM_MAX_VERIFY_TIME``` u32 maximum verification interval + supported by device + ``ETHTOOL_A_MM_STATS`` nested IEEE 802.3-2018 subclause 30.14.1 + oMACMergeEntity statistics counters + ================================= ====== =================================== + +The attributes are populated by the device driver through the following +structure: + +.. kernel-doc:: include/linux/ethtool.h + :identifiers: ethtool_mm_state + +The ``ETHTOOL_A_MM_VERIFY_STATUS`` will report one of the values from + +.. kernel-doc:: include/uapi/linux/ethtool.h + :identifiers: ethtool_mm_verify_status + +If ``ETHTOOL_A_MM_VERIFY_ENABLED`` was passed as false in the ``MM_SET`` +command, ``ETHTOOL_A_MM_VERIFY_STATUS`` will report either +``ETHTOOL_MM_VERIFY_STATUS_INITIAL`` or ``ETHTOOL_MM_VERIFY_STATUS_DISABLED``, +otherwise it should report one of the other states. + +It is recommended that drivers start with the pMAC disabled, and enable it upon +user space request. It is also recommended that user space does not depend upon +the default values from ``ETHTOOL_MSG_MM_GET`` requests. + +``ETHTOOL_A_MM_STATS`` are reported if ``ETHTOOL_FLAG_STATS`` was set in +``ETHTOOL_A_HEADER_FLAGS``. The attribute will be empty if driver did not +report any statistics. Drivers fill in the statistics in the following +structure: + +.. kernel-doc:: include/linux/ethtool.h + :identifiers: ethtool_mm_stats + +MM_SET +====== + +Modifies the configuration of the 802.3 MAC Merge layer. + +Request contents: + + ================================= ====== ========================== + ``ETHTOOL_A_MM_VERIFY_TIME`` u32 see MM_GET description + ``ETHTOOL_A_MM_VERIFY_ENABLED`` bool see MM_GET description + ``ETHTOOL_A_MM_TX_ENABLED`` bool see MM_GET description + ``ETHTOOL_A_MM_PMAC_ENABLED`` bool see MM_GET description + ``ETHTOOL_A_MM_TX_MIN_FRAG_SIZE`` u32 see MM_GET description + ================================= ====== ========================== + +The attributes are propagated to the driver through the following structure: + +.. kernel-doc:: include/linux/ethtool.h + :identifiers: ethtool_mm_cfg + Request translation =================== @@ -1817,4 +2076,9 @@ are netlink only. n/a ``ETHTOOL_MSG_PHC_VCLOCKS_GET`` n/a ``ETHTOOL_MSG_MODULE_GET`` n/a ``ETHTOOL_MSG_MODULE_SET`` + n/a ``ETHTOOL_MSG_PLCA_GET_CFG`` + n/a ``ETHTOOL_MSG_PLCA_SET_CFG`` + n/a ``ETHTOOL_MSG_PLCA_GET_STATUS`` + n/a ``ETHTOOL_MSG_MM_GET`` + n/a ``ETHTOOL_MSG_MM_SET`` =================================== ===================================== diff --git a/Documentation/networking/gtp.rst b/Documentation/networking/gtp.rst index 1563fb94b289..9a7835cc1437 100644 --- a/Documentation/networking/gtp.rst +++ b/Documentation/networking/gtp.rst @@ -162,7 +162,7 @@ Local GTP-U entity and tunnel identification GTP-U uses UDP for transporting PDU's. The receiving UDP port is 2152 for GTPv1-U and 3386 for GTPv0-U. -There is only one GTP-U entity (and therefor SGSN/GGSN/S-GW/PDN-GW +There is only one GTP-U entity (and therefore SGSN/GGSN/S-GW/PDN-GW instance) per IP address. Tunnel Endpoint Identifier (TEID) are unique per GTP-U entity. diff --git a/Documentation/networking/ieee802154.rst b/Documentation/networking/ieee802154.rst index f27856d77c8b..c652d383fe10 100644 --- a/Documentation/networking/ieee802154.rst +++ b/Documentation/networking/ieee802154.rst @@ -70,7 +70,7 @@ Like with WiFi, there are several types of devices implementing IEEE 802.15.4. exports a management (e.g. MLME) and data API. 2) 'SoftMAC' or just radio. These types of devices are just radio transceivers possibly with some kinds of acceleration like automatic CRC computation and -comparation, automagic ACK handling, address matching, etc. +comparison, automagic ACK handling, address matching, etc. Those types of devices require different approach to be hooked into Linux kernel. diff --git a/Documentation/networking/index.rst b/Documentation/networking/index.rst index 4f2d1f682a18..4ddcae33c336 100644 --- a/Documentation/networking/index.rst +++ b/Documentation/networking/index.rst @@ -120,6 +120,7 @@ Contents: xfrm_proc xfrm_sync xfrm_sysctl + xdp-rx-metadata .. only:: subproject and html diff --git a/Documentation/networking/ip-sysctl.rst b/Documentation/networking/ip-sysctl.rst index 7fbd060d6047..87dd1c5283e6 100644 --- a/Documentation/networking/ip-sysctl.rst +++ b/Documentation/networking/ip-sysctl.rst @@ -50,7 +50,7 @@ ip_no_pmtu_disc - INTEGER Default: FALSE min_pmtu - INTEGER - default 552 - minimum Path MTU. Unless this is changed mannually, + default 552 - minimum Path MTU. Unless this is changed manually, each cached pmtu will never be lower than this setting. ip_forward_use_pmtu - BOOLEAN @@ -156,6 +156,9 @@ route/max_size - INTEGER From linux kernel 3.6 onwards, this is deprecated for ipv4 as route cache is no longer used. + From linux kernel 6.3 onwards, this is deprecated for ipv6 + as garbage collection manages cached route entries. + neigh/default/gc_thresh1 - INTEGER Minimum number of entries to keep. Garbage collector will not purge entries if there are fewer than this number. @@ -1589,6 +1592,14 @@ proxy_arp_pvlan - BOOLEAN Hewlett-Packard call it Source-Port filtering or port-isolation. Ericsson call it MAC-Forced Forwarding (RFC Draft). +proxy_delay - INTEGER + Delay proxy response. + + Delay response to a neighbor solicitation when proxy_arp + or proxy_ndp is enabled. A random value between [0, proxy_delay) + will be chosen, setting to zero means reply with no delay. + Value in jiffies. Defaults to 80. + shared_media - BOOLEAN Send(router) or accept(host) RFC1620 shared media redirects. Overrides secure_redirects. @@ -2075,7 +2086,7 @@ skip_notify_on_dev_down - BOOLEAN nexthop_compat_mode - BOOLEAN New nexthop API provides a means for managing nexthops independent of - prefixes. Backwards compatibilty with old route format is enabled by + prefixes. Backwards compatibility with old route format is enabled by default which means route dumps and notifications contain the new nexthop attribute but also the full, expanded nexthop definition. Further, updates or deletes of a nexthop configuration generate route @@ -2808,7 +2819,7 @@ pf_expose - INTEGER can be got via SCTP_GET_PEER_ADDR_INFO sockopt; When it's enabled, a SCTP_PEER_ADDR_CHANGE event will be sent for a transport becoming SCTP_PF state and a SCTP_PF-state transport info can be got via - SCTP_GET_PEER_ADDR_INFO sockopt; When it's diabled, no + SCTP_GET_PEER_ADDR_INFO sockopt; When it's disabled, no SCTP_PEER_ADDR_CHANGE event will be sent and it returns -EACCES when trying to get a SCTP_PF-state transport info via SCTP_GET_PEER_ADDR_INFO sockopt. diff --git a/Documentation/networking/ipvlan.rst b/Documentation/networking/ipvlan.rst index 0000c1d383bc..895d0ccfd596 100644 --- a/Documentation/networking/ipvlan.rst +++ b/Documentation/networking/ipvlan.rst @@ -61,7 +61,7 @@ e.g. IPvlan has two modes of operation - L2 and L3. For a given master device, you can select one of these two modes and all slaves on that master will operate in the same (selected) mode. The RX mode is almost identical except -that in L3 mode the slaves wont receive any multicast / broadcast traffic. +that in L3 mode the slaves won't receive any multicast / broadcast traffic. L3 mode is more restrictive since routing is controlled from the other (mostly) default namespace. diff --git a/Documentation/networking/j1939.rst b/Documentation/networking/j1939.rst index b705d2801e9c..e4bd7aa1f5aa 100644 --- a/Documentation/networking/j1939.rst +++ b/Documentation/networking/j1939.rst @@ -116,7 +116,7 @@ format, the Group Extension is set in the PS-field. ---------------------------------------- 23 ... 16 15 ... 8 ============== ======================== - F0h ... FFh GE (Group Extenstion) + F0h ... FFh GE (Group Extension) ============== ======================== On the other hand, when using PDU1 format, the PS-field contains a so-called diff --git a/Documentation/networking/net_failover.rst b/Documentation/networking/net_failover.rst index 3a662f2b4d6e..f4e1b4e07adc 100644 --- a/Documentation/networking/net_failover.rst +++ b/Documentation/networking/net_failover.rst @@ -90,7 +90,7 @@ virtio-net interface, and ens11 is the slave 'primary' VF passthrough interface. One point to note here is that some user space network configuration daemons like systemd-networkd, ifupdown, etc, do not understand the 'net_failover' device; and on the first boot, the VM might end up with both 'failover' device -and VF accquiring IP addresses (either same or different) from the DHCP server. +and VF acquiring IP addresses (either same or different) from the DHCP server. This will result in lack of connectivity to the VM. So some tweaks might be needed to these network configuration daemons to make sure that an IP is received only on the 'failover' device. diff --git a/Documentation/networking/netconsole.rst b/Documentation/networking/netconsole.rst index 1f5c4a04027c..dd0518e002f6 100644 --- a/Documentation/networking/netconsole.rst +++ b/Documentation/networking/netconsole.rst @@ -167,7 +167,7 @@ following format which is the same as /dev/kmsg:: Non printable characters in <message text> are escaped using "\xff" notation. If the message contains optional dictionary, verbatim -newline is used as the delimeter. +newline is used as the delimiter. If a message doesn't fit in certain number of bytes (currently 1000), the message is split into multiple fragments by netconsole. These diff --git a/Documentation/networking/page_pool.rst b/Documentation/networking/page_pool.rst index 5db8c263b0c6..30f1344e7cca 100644 --- a/Documentation/networking/page_pool.rst +++ b/Documentation/networking/page_pool.rst @@ -11,7 +11,7 @@ Basic use involves replacing alloc_pages() calls with the page_pool_alloc_pages() call. Drivers should use page_pool_dev_alloc_pages() replacing dev_alloc_pages(). -API keeps track of inflight pages, in order to let API user know +API keeps track of in-flight pages, in order to let API user know when it is safe to free a page_pool object. Thus, API users must run page_pool_release_page() when a page is leaving the page_pool or call page_pool_put_page() where appropriate in order to maintain correct @@ -19,7 +19,7 @@ accounting. API user must call page_pool_put_page() once on a page, as it will either recycle the page, or in case of refcnt > 1, it will -release the DMA mapping and inflight state accounting. +release the DMA mapping and in-flight state accounting. Architecture overview ===================== @@ -88,7 +88,7 @@ a page will cause no race conditions is enough. directly into the pool fast cache. * page_pool_release_page(): Unmap the page (if mapped) and account for it on - inflight counters. + in-flight counters. * page_pool_dev_alloc_pages(): Get a page from the page allocator or page_pool caches. diff --git a/Documentation/networking/phonet.rst b/Documentation/networking/phonet.rst index 8668dcbc5e6a..d705cc5b09fc 100644 --- a/Documentation/networking/phonet.rst +++ b/Documentation/networking/phonet.rst @@ -131,7 +131,7 @@ Phonet resources, as follow:: Subscription is similarly cancelled using the SIOCPNDELRESOURCE I/O control request, or when the socket is closed. -Note that no more than one socket can be subcribed to any given +Note that no more than one socket can be subscribed to any given resource at a time. If not, ioctl() will return EBUSY. diff --git a/Documentation/networking/phy.rst b/Documentation/networking/phy.rst index d11329a08984..b7ac4c64cf67 100644 --- a/Documentation/networking/phy.rst +++ b/Documentation/networking/phy.rst @@ -315,7 +315,7 @@ Some of the interface modes are described below: only the port id, but also so-called "extensions". The only documented extension so-far in the specification is the inclusion of timestamps, for PTP-enabled PHYs. This mode isn't compatible with QSGMII, but offers the - same capabilities in terms of link speed and negociation. + same capabilities in terms of link speed and negotiation. ``PHY_INTERFACE_MODE_1000BASEKX`` This is 1000BASE-X as defined by IEEE 802.3 Clause 36 with Clause 73 diff --git a/Documentation/networking/regulatory.rst b/Documentation/networking/regulatory.rst index 16782a95b74a..5e42c8a175c3 100644 --- a/Documentation/networking/regulatory.rst +++ b/Documentation/networking/regulatory.rst @@ -66,7 +66,7 @@ An example:: iw reg set CR This will request the kernel to set the regulatory domain to -the specificied alpha2. The kernel in turn will then ask userspace +the specified alpha2. The kernel in turn will then ask userspace to provide a regulatory domain for the alpha2 specified by the user by sending a uevent. @@ -158,7 +158,7 @@ kmalloc() a structure big enough to hold your regulatory domain structure and you should then fill it with your data. Finally you simply call regulatory_hint() with the regulatory domain structure in it. -Bellow is a simple example, with a regulatory domain cached using the stack. +Below is a simple example, with a regulatory domain cached using the stack. Your implementation may vary (read EEPROM cache instead, for example). Example cache of some regulatory domain:: diff --git a/Documentation/networking/rxrpc.rst b/Documentation/networking/rxrpc.rst index e1af54424192..ec1323d92c96 100644 --- a/Documentation/networking/rxrpc.rst +++ b/Documentation/networking/rxrpc.rst @@ -1069,7 +1069,7 @@ The kernel interface functions are as follows: This value can be used to determine if the remote client has been restarted as it shouldn't change otherwise. - (#) Set the maxmimum lifespan on a call:: + (#) Set the maximum lifespan on a call:: void rxrpc_kernel_set_max_life(struct socket *sock, struct rxrpc_call *call, diff --git a/Documentation/networking/snmp_counter.rst b/Documentation/networking/snmp_counter.rst index 423d138b5ff3..213637474478 100644 --- a/Documentation/networking/snmp_counter.rst +++ b/Documentation/networking/snmp_counter.rst @@ -980,7 +980,7 @@ How many reply packets of the SYN cookies the TCP stack receives. The MSS decoded from the SYN cookie is invalid. When this counter is updated, the received packet won't be treated as a SYN cookie and the -TcpExtSyncookiesRecv counter wont be updated. +TcpExtSyncookiesRecv counter won't be updated. Challenge ACK ============= @@ -1681,7 +1681,7 @@ RST to nstat-b:: nstatuser@nstat-a:~$ sudo iptables -A INPUT -p tcp --sport 9000 -j DROP -Send 3 SYN repeatly to nstat-b:: +Send 3 SYN repeatedly to nstat-b:: nstatuser@nstat-a:~$ for i in {1..3}; do sudo tcpreplay -i ens3 /tmp/syn_fixcsum.pcap; done diff --git a/Documentation/networking/statistics.rst b/Documentation/networking/statistics.rst index c9aeb70dafa2..551b3cc29a41 100644 --- a/Documentation/networking/statistics.rst +++ b/Documentation/networking/statistics.rst @@ -171,6 +171,7 @@ statistics are supported in the following commands: - `ETHTOOL_MSG_PAUSE_GET` - `ETHTOOL_MSG_FEC_GET` + - `ETHTOOL_MSG_MM_GET` debugfs ------- diff --git a/Documentation/networking/sysfs-tagging.rst b/Documentation/networking/sysfs-tagging.rst index 83647e10c207..65307130ab63 100644 --- a/Documentation/networking/sysfs-tagging.rst +++ b/Documentation/networking/sysfs-tagging.rst @@ -43,6 +43,6 @@ Users of this interface: - current_ns() which returns current's namespace - netlink_ns() which returns a socket's namespace - - initial_ns() which returns the initial namesapce + - initial_ns() which returns the initial namespace - call kobj_ns_exit() when an individual tag is no longer valid diff --git a/Documentation/networking/xdp-rx-metadata.rst b/Documentation/networking/xdp-rx-metadata.rst new file mode 100644 index 000000000000..aac63fc2d08b --- /dev/null +++ b/Documentation/networking/xdp-rx-metadata.rst @@ -0,0 +1,110 @@ +=============== +XDP RX Metadata +=============== + +This document describes how an eXpress Data Path (XDP) program can access +hardware metadata related to a packet using a set of helper functions, +and how it can pass that metadata on to other consumers. + +General Design +============== + +XDP has access to a set of kfuncs to manipulate the metadata in an XDP frame. +Every device driver that wishes to expose additional packet metadata can +implement these kfuncs. The set of kfuncs is declared in ``include/net/xdp.h`` +via ``XDP_METADATA_KFUNC_xxx``. + +Currently, the following kfuncs are supported. In the future, as more +metadata is supported, this set will grow: + +.. kernel-doc:: net/core/xdp.c + :identifiers: bpf_xdp_metadata_rx_timestamp bpf_xdp_metadata_rx_hash + +An XDP program can use these kfuncs to read the metadata into stack +variables for its own consumption. Or, to pass the metadata on to other +consumers, an XDP program can store it into the metadata area carried +ahead of the packet. + +Not all kfuncs have to be implemented by the device driver; when not +implemented, the default ones that return ``-EOPNOTSUPP`` will be used. + +Within an XDP frame, the metadata layout (accessed via ``xdp_buff``) is +as follows:: + + +----------+-----------------+------+ + | headroom | custom metadata | data | + +----------+-----------------+------+ + ^ ^ + | | + xdp_buff->data_meta xdp_buff->data + +An XDP program can store individual metadata items into this ``data_meta`` +area in whichever format it chooses. Later consumers of the metadata +will have to agree on the format by some out of band contract (like for +the AF_XDP use case, see below). + +AF_XDP +====== + +:doc:`af_xdp` use-case implies that there is a contract between the BPF +program that redirects XDP frames into the ``AF_XDP`` socket (``XSK``) and +the final consumer. Thus the BPF program manually allocates a fixed number of +bytes out of metadata via ``bpf_xdp_adjust_meta`` and calls a subset +of kfuncs to populate it. The userspace ``XSK`` consumer computes +``xsk_umem__get_data() - METADATA_SIZE`` to locate that metadata. +Note, ``xsk_umem__get_data`` is defined in ``libxdp`` and +``METADATA_SIZE`` is an application-specific constant (``AF_XDP`` receive +descriptor does _not_ explicitly carry the size of the metadata). + +Here is the ``AF_XDP`` consumer layout (note missing ``data_meta`` pointer):: + + +----------+-----------------+------+ + | headroom | custom metadata | data | + +----------+-----------------+------+ + ^ + | + rx_desc->address + +XDP_PASS +======== + +This is the path where the packets processed by the XDP program are passed +into the kernel. The kernel creates the ``skb`` out of the ``xdp_buff`` +contents. Currently, every driver has custom kernel code to parse +the descriptors and populate ``skb`` metadata when doing this ``xdp_buff->skb`` +conversion, and the XDP metadata is not used by the kernel when building +``skbs``. However, TC-BPF programs can access the XDP metadata area using +the ``data_meta`` pointer. + +In the future, we'd like to support a case where an XDP program +can override some of the metadata used for building ``skbs``. + +bpf_redirect_map +================ + +``bpf_redirect_map`` can redirect the frame to a different device. +Some devices (like virtual ethernet links) support running a second XDP +program after the redirect. However, the final consumer doesn't have +access to the original hardware descriptor and can't access any of +the original metadata. The same applies to XDP programs installed +into devmaps and cpumaps. + +This means that for redirected packets only custom metadata is +currently supported, which has to be prepared by the initial XDP program +before redirect. If the frame is eventually passed to the kernel, the +``skb`` created from such a frame won't have any hardware metadata populated +in its ``skb``. If such a packet is later redirected into an ``XSK``, +that will also only have access to the custom metadata. + +bpf_tail_call +============= + +Adding programs that access metadata kfuncs to the ``BPF_MAP_TYPE_PROG_ARRAY`` +is currently not supported. + +Example +======= + +See ``tools/testing/selftests/bpf/progs/xdp_metadata.c`` and +``tools/testing/selftests/bpf/prog_tests/xdp_metadata.c`` for an example of +BPF program that handles XDP metadata. diff --git a/Documentation/networking/xfrm_device.rst b/Documentation/networking/xfrm_device.rst index c43ace79e320..83abdfef4ec3 100644 --- a/Documentation/networking/xfrm_device.rst +++ b/Documentation/networking/xfrm_device.rst @@ -64,7 +64,7 @@ Callbacks to implement /* from include/linux/netdevice.h */ struct xfrmdev_ops { /* Crypto and Packet offload callbacks */ - int (*xdo_dev_state_add) (struct xfrm_state *x); + int (*xdo_dev_state_add) (struct xfrm_state *x, struct netlink_ext_ack *extack); void (*xdo_dev_state_delete) (struct xfrm_state *x); void (*xdo_dev_state_free) (struct xfrm_state *x); bool (*xdo_dev_offload_ok) (struct sk_buff *skb, @@ -73,7 +73,7 @@ Callbacks to implement /* Solely packet offload callbacks */ void (*xdo_dev_state_update_curlft) (struct xfrm_state *x); - int (*xdo_dev_policy_add) (struct xfrm_policy *x); + int (*xdo_dev_policy_add) (struct xfrm_policy *x, struct netlink_ext_ack *extack); void (*xdo_dev_policy_delete) (struct xfrm_policy *x); void (*xdo_dev_policy_free) (struct xfrm_policy *x); }; diff --git a/Documentation/userspace-api/netlink/c-code-gen.rst b/Documentation/userspace-api/netlink/c-code-gen.rst new file mode 100644 index 000000000000..89de42c13350 --- /dev/null +++ b/Documentation/userspace-api/netlink/c-code-gen.rst @@ -0,0 +1,107 @@ +.. SPDX-License-Identifier: BSD-3-Clause + +============================== +Netlink spec C code generation +============================== + +This document describes how Netlink specifications are used to render +C code (uAPI, policies etc.). It also defines the additional properties +allowed in older families by the ``genetlink-c`` protocol level, +to control the naming. + +For brevity this document refers to ``name`` properties of various +objects by the object type. For example ``$attr`` is the value +of ``name`` in an attribute, and ``$family`` is the name of the +family (the global ``name`` property). + +The upper case is used to denote literal values, e.g. ``$family-CMD`` +means the concatenation of ``$family``, a dash character, and the literal +``CMD``. + +The names of ``#defines`` and enum values are always converted to upper case, +and with dashes (``-``) replaced by underscores (``_``). + +If the constructed name is a C keyword, an extra underscore is +appended (``do`` -> ``do_``). + +Globals +======= + +``c-family-name`` controls the name of the ``#define`` for the family +name, default is ``$family-FAMILY-NAME``. + +``c-version-name`` controls the name of the ``#define`` for the version +of the family, default is ``$family-FAMILY-VERSION``. + +``max-by-define`` selects if max values for enums are defined as a +``#define`` rather than inside the enum. + +Definitions +=========== + +Constants +--------- + +Every constant is rendered as a ``#define``. +The name of the constant is ``$family-$constant`` and the value +is rendered as a string or integer according to its type in the spec. + +Enums and flags +--------------- + +Enums are named ``$family-$enum``. The full name can be set directly +or suppressed by specifying the ``enum-name`` property. +Default entry name is ``$family-$enum-$entry``. +If ``name-prefix`` is specified it replaces the ``$family-$enum`` +portion of the entry name. + +Boolean ``render-max`` controls creation of the max values +(which are enabled by default for attribute enums). + +Attributes +========== + +Each attribute set (excluding fractional sets) is rendered as an enum. + +Attribute enums are traditionally unnamed in netlink headers. +If naming is desired ``enum-name`` can be used to specify the name. + +The default attribute name prefix is ``$family-A`` if the name of the set +is the same as the name of the family and ``$family-A-$set`` if the names +differ. The prefix can be overridden by the ``name-prefix`` property of a set. +The rest of the section will refer to the prefix as ``$pfx``. + +Attributes are named ``$pfx-$attribute``. + +Attribute enums end with two special values ``__$pfx-MAX`` and ``$pfx-MAX`` +which are used for sizing attribute tables. +These two names can be specified directly with the ``attr-cnt-name`` +and ``attr-max-name`` properties respectively. + +If ``max-by-define`` is set to ``true`` at the global level ``attr-max-name`` +will be specified as a ``#define`` rather than an enum value. + +Operations +========== + +Operations are named ``$family-CMD-$operation``. +If ``name-prefix`` is specified it replaces the ``$family-CMD`` +portion of the name. + +Similarly to attribute enums operation enums end with special count and max +attributes. For operations those attributes can be renamed with +``cmd-cnt-name`` and ``cmd-max-name``. Max will be a define if ``max-by-define`` +is ``true``. + +Multicast groups +================ + +Each multicast group gets a define rendered into the kernel uAPI header. +The name of the define is ``$family-MCGRP-$group``, and can be overwritten +with the ``c-define-name`` property. + +Code generation +=============== + +uAPI header is assumed to come from ``<linux/$family.h>`` in the default header +search path. It can be changed using the ``uapi-header`` global property. diff --git a/Documentation/userspace-api/netlink/genetlink-legacy.rst b/Documentation/userspace-api/netlink/genetlink-legacy.rst new file mode 100644 index 000000000000..3bf0bcdf21d8 --- /dev/null +++ b/Documentation/userspace-api/netlink/genetlink-legacy.rst @@ -0,0 +1,178 @@ +.. SPDX-License-Identifier: BSD-3-Clause + +================================================================= +Netlink specification support for legacy Generic Netlink families +================================================================= + +This document describes the many additional quirks and properties +required to describe older Generic Netlink families which form +the ``genetlink-legacy`` protocol level. + +The spec is a work in progress, some of the quirks are just documented +for future reference. + +Specification (defined) +======================= + +Attribute type nests +-------------------- + +New Netlink families should use ``multi-attr`` to define arrays. +Older families (e.g. ``genetlink`` control family) attempted to +define array types reusing attribute type to carry information. + +For reference the ``multi-attr`` array may look like this:: + + [ARRAY-ATTR] + [INDEX (optionally)] + [MEMBER1] + [MEMBER2] + [SOME-OTHER-ATTR] + [ARRAY-ATTR] + [INDEX (optionally)] + [MEMBER1] + [MEMBER2] + +where ``ARRAY-ATTR`` is the array entry type. + +array-nest +~~~~~~~~~~ + +``array-nest`` creates the following structure:: + + [SOME-OTHER-ATTR] + [ARRAY-ATTR] + [ENTRY] + [MEMBER1] + [MEMBER2] + [ENTRY] + [MEMBER1] + [MEMBER2] + +It wraps the entire array in an extra attribute (hence limiting its size +to 64kB). The ``ENTRY`` nests are special and have the index of the entry +as their type instead of normal attribute type. + +type-value +~~~~~~~~~~ + +``type-value`` is a construct which uses attribute types to carry +information about a single object (often used when array is dumped +entry-by-entry). + +``type-value`` can have multiple levels of nesting, for example +genetlink's policy dumps create the following structures:: + + [POLICY-IDX] + [ATTR-IDX] + [POLICY-INFO-ATTR1] + [POLICY-INFO-ATTR2] + +Where the first level of nest has the policy index as it's attribute +type, it contains a single nest which has the attribute index as its +type. Inside the attr-index nest are the policy attributes. Modern +Netlink families should have instead defined this as a flat structure, +the nesting serves no good purpose here. + +Operations +========== + +Enum (message ID) model +----------------------- + +unified +~~~~~~~ + +Modern families use the ``unified`` message ID model, which uses +a single enumeration for all messages within family. Requests and +responses share the same message ID. Notifications have separate +IDs from the same space. For example given the following list +of operations: + +.. code-block:: yaml + + - + name: a + value: 1 + do: ... + - + name: b + do: ... + - + name: c + value: 4 + notify: a + - + name: d + do: ... + +Requests and responses for operation ``a`` will have the ID of 1, +the requests and responses of ``b`` - 2 (since there is no explicit +``value`` it's previous operation ``+ 1``). Notification ``c`` will +use the ID of 4, operation ``d`` 5 etc. + +directional +~~~~~~~~~~~ + +The ``directional`` model splits the ID assignment by the direction of +the message. Messages from and to the kernel can't be confused with +each other so this conserves the ID space (at the cost of making +the programming more cumbersome). + +In this case ``value`` attribute should be specified in the ``request`` +``reply`` sections of the operations (if an operation has both ``do`` +and ``dump`` the IDs are shared, ``value`` should be set in ``do``). +For notifications the ``value`` is provided at the op level but it +only allocates a ``reply`` (i.e. a "from-kernel" ID). Let's look +at an example: + +.. code-block:: yaml + + - + name: a + do: + request: + value: 2 + attributes: ... + reply: + value: 1 + attributes: ... + - + name: b + notify: a + - + name: c + notify: a + value: 7 + - + name: d + do: ... + +In this case ``a`` will use 2 when sending the message to the kernel +and expects message with ID 1 in response. Notification ``b`` allocates +a "from-kernel" ID which is 2. ``c`` allocates "from-kernel" ID of 7. +If operation ``d`` does not set ``values`` explicitly in the spec +it will be allocated 3 for the request (``a`` is the previous operation +with a request section and the value of 2) and 8 for response (``c`` is +the previous operation in the "from-kernel" direction). + +Other quirks (todo) +=================== + +Structures +---------- + +Legacy families can define C structures both to be used as the contents +of an attribute and as a fixed message header. The plan is to define +the structs in ``definitions`` and link the appropriate attrs. + +Multi-message DO +---------------- + +New Netlink families should never respond to a DO operation with multiple +replies, with ``NLM_F_MULTI`` set. Use a filtered dump instead. + +At the spec level we can define a ``dumps`` property for the ``do``, +perhaps with values of ``combine`` and ``multi-object`` depending +on how the parsing should be implemented (parse into a single reply +vs list of objects i.e. pretty much a dump). diff --git a/Documentation/userspace-api/netlink/index.rst b/Documentation/userspace-api/netlink/index.rst index b0c21538d97d..26f3720cb3be 100644 --- a/Documentation/userspace-api/netlink/index.rst +++ b/Documentation/userspace-api/netlink/index.rst @@ -10,3 +10,9 @@ Netlink documentation for users. :maxdepth: 2 intro + intro-specs + specs + c-code-gen + genetlink-legacy + +See also :ref:`Documentation/core-api/netlink.rst <kernel_netlink>`. diff --git a/Documentation/userspace-api/netlink/intro-specs.rst b/Documentation/userspace-api/netlink/intro-specs.rst new file mode 100644 index 000000000000..a3b847eafff7 --- /dev/null +++ b/Documentation/userspace-api/netlink/intro-specs.rst @@ -0,0 +1,80 @@ +.. SPDX-License-Identifier: BSD-3-Clause + +===================================== +Using Netlink protocol specifications +===================================== + +This document is a quick starting guide for using Netlink protocol +specifications. For more detailed description of the specs see :doc:`specs`. + +Simple CLI +========== + +Kernel comes with a simple CLI tool which should be useful when +developing Netlink related code. The tool is implemented in Python +and can use a YAML specification to issue Netlink requests +to the kernel. Only Generic Netlink is supported. + +The tool is located at ``tools/net/ynl/cli.py``. It accepts +a handul of arguments, the most important ones are: + + - ``--spec`` - point to the spec file + - ``--do $name`` / ``--dump $name`` - issue request ``$name`` + - ``--json $attrs`` - provide attributes for the request + - ``--subscribe $group`` - receive notifications from ``$group`` + +YAML specs can be found under ``Documentation/netlink/specs/``. + +Example use:: + + $ ./tools/net/ynl/cli.py --spec Documentation/netlink/specs/ethtool.yaml \ + --do rings-get \ + --json '{"header":{"dev-index": 18}}' + {'header': {'dev-index': 18, 'dev-name': 'eni1np1'}, + 'rx': 0, + 'rx-jumbo': 0, + 'rx-jumbo-max': 4096, + 'rx-max': 4096, + 'rx-mini': 0, + 'rx-mini-max': 4096, + 'tx': 0, + 'tx-max': 4096, + 'tx-push': 0} + +The input arguments are parsed as JSON, while the output is only +Python-pretty-printed. This is because some Netlink types can't +be expressed as JSON directly. If such attributes are needed in +the input some hacking of the script will be necessary. + +The spec and Netlink internals are factored out as a standalone +library - it should be easy to write Python tools / tests reusing +code from ``cli.py``. + +Generating kernel code +====================== + +``tools/net/ynl/ynl-regen.sh`` scans the kernel tree in search of +auto-generated files which need to be updated. Using this tool is the easiest +way to generate / update auto-generated code. + +By default code is re-generated only if spec is newer than the source, +to force regeneration use ``-f``. + +``ynl-regen.sh`` searches for ``YNL-GEN`` in the contents of files +(note that it only scans files in the git index, that is only files +tracked by git!) For instance the ``fou_nl.c`` kernel source contains:: + + /* Documentation/netlink/specs/fou.yaml */ + /* YNL-GEN kernel source */ + +``ynl-regen.sh`` will find this marker and replace the file with +kernel source based on fou.yaml. + +The simplest way to generate a new file based on a spec is to add +the two marker lines like above to a file, add that file to git, +and run the regeneration tool. Grep the tree for ``YNL-GEN`` +to see other examples. + +The code generation itself is performed by ``tools/net/ynl/ynl-gen-c.py`` +but it takes a few arguments so calling it directly for each file +quickly becomes tedious. diff --git a/Documentation/userspace-api/netlink/specs.rst b/Documentation/userspace-api/netlink/specs.rst new file mode 100644 index 000000000000..6ffe8137cd90 --- /dev/null +++ b/Documentation/userspace-api/netlink/specs.rst @@ -0,0 +1,425 @@ +.. SPDX-License-Identifier: BSD-3-Clause + +========================================= +Netlink protocol specifications (in YAML) +========================================= + +Netlink protocol specifications are complete, machine readable descriptions of +Netlink protocols written in YAML. The goal of the specifications is to allow +separating Netlink parsing from user space logic and minimize the amount of +hand written Netlink code for each new family, command, attribute. +Netlink specs should be complete and not depend on any other spec +or C header file, making it easy to use in languages which can't include +kernel headers directly. + +Internally kernel uses the YAML specs to generate: + + - the C uAPI header + - documentation of the protocol as a ReST file + - policy tables for input attribute validation + - operation tables + +YAML specifications can be found under ``Documentation/netlink/specs/`` + +This document describes details of the schema. +See :doc:`intro-specs` for a practical starting guide. + +Compatibility levels +==================== + +There are four schema levels for Netlink specs, from the simplest used +by new families to the most complex covering all the quirks of the old ones. +Each next level inherits the attributes of the previous level, meaning that +user capable of parsing more complex ``genetlink`` schemas is also compatible +with simpler ones. The levels are: + + - ``genetlink`` - most streamlined, should be used by all new families + - ``genetlink-c`` - superset of ``genetlink`` with extra attributes allowing + customization of define and enum type and value names; this schema should + be equivalent to ``genetlink`` for all implementations which don't interact + directly with C uAPI headers + - ``genetlink-legacy`` - Generic Netlink catch all schema supporting quirks of + all old genetlink families, strange attribute formats, binary structures etc. + - ``netlink-raw`` - catch all schema supporting pre-Generic Netlink protocols + such as ``NETLINK_ROUTE`` + +The definition of the schemas (in ``jsonschema``) can be found +under ``Documentation/netlink/``. + +Schema structure +================ + +YAML schema has the following conceptual sections: + + - globals + - definitions + - attributes + - operations + - multicast groups + +Most properties in the schema accept (or in fact require) a ``doc`` +sub-property documenting the defined object. + +The following sections describe the properties of the most modern ``genetlink`` +schema. See the documentation of :doc:`genetlink-c <c-code-gen>` +for information on how C names are derived from name properties. + +genetlink +========= + +Globals +------- + +Attributes listed directly at the root level of the spec file. + +name +~~~~ + +Name of the family. Name identifies the family in a unique way, since +the Family IDs are allocated dynamically. + +version +~~~~~~~ + +Generic Netlink family version, default is 1. + +protocol +~~~~~~~~ + +The schema level, default is ``genetlink``, which is the only value +allowed for new ``genetlink`` families. + +definitions +----------- + +Array of type and constant definitions. + +name +~~~~ + +Name of the type / constant. + +type +~~~~ + +One of the following types: + + - const - a single, standalone constant + - enum - defines an integer enumeration, with values for each entry + incrementing by 1, (e.g. 0, 1, 2, 3) + - flags - defines an integer enumeration, with values for each entry + occupying a bit, starting from bit 0, (e.g. 1, 2, 4, 8) + +value +~~~~~ + +The value for the ``const``. + +value-start +~~~~~~~~~~~ + +The first value for ``enum`` and ``flags``, allows overriding the default +start value of ``0`` (for ``enum``) and starting bit (for ``flags``). +For ``flags`` ``value-start`` selects the starting bit, not the shifted value. + +Sparse enumerations are not supported. + +entries +~~~~~~~ + +Array of names of the entries for ``enum`` and ``flags``. + +header +~~~~~~ + +For C-compatible languages, header which already defines this value. +In case the definition is shared by multiple families (e.g. ``IFNAMSIZ``) +code generators for C-compatible languages may prefer to add an appropriate +include instead of rendering a new definition. + +attribute-sets +-------------- + +This property contains information about netlink attributes of the family. +All families have at least one attribute set, most have multiple. +``attribute-sets`` is an array, with each entry describing a single set. + +Note that the spec is "flattened" and is not meant to visually resemble +the format of the netlink messages (unlike certain ad-hoc documentation +formats seen in kernel comments). In the spec subordinate attribute sets +are not defined inline as a nest, but defined in a separate attribute set +referred to with a ``nested-attributes`` property of the container. + +Spec may also contain fractional sets - sets which contain a ``subset-of`` +property. Such sets describe a section of a full set, allowing narrowing down +which attributes are allowed in a nest or refining the validation criteria. +Fractional sets can only be used in nests. They are not rendered to the uAPI +in any fashion. + +name +~~~~ + +Uniquely identifies the attribute set, operations and nested attributes +refer to the sets by the ``name``. + +subset-of +~~~~~~~~~ + +Re-defines a portion of another set (a fractional set). +Allows narrowing down fields and changing validation criteria +or even types of attributes depending on the nest in which they +are contained. The ``value`` of each attribute in the fractional +set is implicitly the same as in the main set. + +attributes +~~~~~~~~~~ + +List of attributes in the set. + +Attribute properties +-------------------- + +name +~~~~ + +Identifies the attribute, unique within the set. + +type +~~~~ + +Netlink attribute type, see :ref:`attr_types`. + +.. _assign_val: + +value +~~~~~ + +Numerical attribute ID, used in serialized Netlink messages. +The ``value`` property can be skipped, in which case the attribute ID +will be the value of the previous attribute plus one (recursively) +and ``0`` for the first attribute in the attribute set. + +Note that the ``value`` of an attribute is defined only in its main set. + +enum +~~~~ + +For integer types specifies that values in the attribute belong +to an ``enum`` or ``flags`` from the ``definitions`` section. + +enum-as-flags +~~~~~~~~~~~~~ + +Treat ``enum`` as ``flags`` regardless of its type in ``definitions``. +When both ``enum`` and ``flags`` forms are needed ``definitions`` should +contain an ``enum`` and attributes which need the ``flags`` form should +use this attribute. + +nested-attributes +~~~~~~~~~~~~~~~~~ + +Identifies the attribute space for attributes nested within given attribute. +Only valid for complex attributes which may have sub-attributes. + +multi-attr (arrays) +~~~~~~~~~~~~~~~~~~~ + +Boolean property signifying that the attribute may be present multiple times. +Allowing an attribute to repeat is the recommended way of implementing arrays +(no extra nesting). + +byte-order +~~~~~~~~~~ + +For integer types specifies attribute byte order - ``little-endian`` +or ``big-endian``. + +checks +~~~~~~ + +Input validation constraints used by the kernel. User space should query +the policy of the running kernel using Generic Netlink introspection, +rather than depend on what is specified in the spec file. + +The validation policy in the kernel is formed by combining the type +definition (``type`` and ``nested-attributes``) and the ``checks``. + +operations +---------- + +This section describes messages passed between the kernel and the user space. +There are three types of entries in this section - operations, notifications +and events. + +Operations describe the most common request - response communication. User +sends a request and kernel replies. Each operation may contain any combination +of the two modes familiar to netlink users - ``do`` and ``dump``. +``do`` and ``dump`` in turn contain a combination of ``request`` and +``response`` properties. If no explicit message with attributes is passed +in a given direction (e.g. a ``dump`` which does not accept filter, or a ``do`` +of a SET operation to which the kernel responds with just the netlink error +code) ``request`` or ``response`` section can be skipped. +``request`` and ``response`` sections list the attributes allowed in a message. +The list contains only the names of attributes from a set referred +to by the ``attribute-set`` property. + +Notifications and events both refer to the asynchronous messages sent by +the kernel to members of a multicast group. The difference between the +two is that a notification shares its contents with a GET operation +(the name of the GET operation is specified in the ``notify`` property). +This arrangement is commonly used for notifications about +objects where the notification carries the full object definition. + +Events are more focused and carry only a subset of information rather than full +object state (a made up example would be a link state change event with just +the interface name and the new link state). Events contain the ``event`` +property. Events are considered less idiomatic for netlink and notifications +should be preferred. + +list +~~~~ + +The only property of ``operations`` for ``genetlink``, holds the list of +operations, notifications etc. + +Operation properties +-------------------- + +name +~~~~ + +Identifies the operation. + +value +~~~~~ + +Numerical message ID, used in serialized Netlink messages. +The same enumeration rules are applied as to +:ref:`attribute values<assign_val>`. + +attribute-set +~~~~~~~~~~~~~ + +Specifies the attribute set contained within the message. + +do +~~~ + +Specification for the ``doit`` request. Should contain ``request``, ``reply`` +or both of these properties, each holding a :ref:`attr_list`. + +dump +~~~~ + +Specification for the ``dumpit`` request. Should contain ``request``, ``reply`` +or both of these properties, each holding a :ref:`attr_list`. + +notify +~~~~~~ + +Designates the message as a notification. Contains the name of the operation +(possibly the same as the operation holding this property) which shares +the contents with the notification (``do``). + +event +~~~~~ + +Specification of attributes in the event, holds a :ref:`attr_list`. +``event`` property is mutually exclusive with ``notify``. + +mcgrp +~~~~~ + +Used with ``event`` and ``notify``, specifies which multicast group +message belongs to. + +.. _attr_list: + +Message attribute list +---------------------- + +``request``, ``reply`` and ``event`` properties have a single ``attributes`` +property which holds the list of attribute names. + +Messages can also define ``pre`` and ``post`` properties which will be rendered +as ``pre_doit`` and ``post_doit`` calls in the kernel (these properties should +be ignored by user space). + +mcast-groups +------------ + +This section lists the multicast groups of the family. + +list +~~~~ + +The only property of ``mcast-groups`` for ``genetlink``, holds the list +of groups. + +Multicast group properties +-------------------------- + +name +~~~~ + +Uniquely identifies the multicast group in the family. Similarly to +Family ID, Multicast Group ID needs to be resolved at runtime, based +on the name. + +.. _attr_types: + +Attribute types +=============== + +This section describes the attribute types supported by the ``genetlink`` +compatibility level. Refer to documentation of different levels for additional +attribute types. + +Scalar integer types +-------------------- + +Fixed-width integer types: +``u8``, ``u16``, ``u32``, ``u64``, ``s8``, ``s16``, ``s32``, ``s64``. + +Note that types smaller than 32 bit should be avoided as using them +does not save any memory in Netlink messages (due to alignment). +See :ref:`pad_type` for padding of 64 bit attributes. + +The payload of the attribute is the integer in host order unless ``byte-order`` +specifies otherwise. + +.. _pad_type: + +pad +--- + +Special attribute type used for padding attributes which require alignment +bigger than standard 4B alignment required by netlink (e.g. 64 bit integers). +There can only be a single attribute of the ``pad`` type in any attribute set +and it should be automatically used for padding when needed. + +flag +---- + +Attribute with no payload, its presence is the entire information. + +binary +------ + +Raw binary data attribute, the contents are opaque to generic code. + +string +------ + +Character string. Unless ``checks`` has ``unterminated-ok`` set to ``true`` +the string is required to be null terminated. +``max-len`` in ``checks`` indicates the longest possible string, +if not present the length of the string is unbounded. + +Note that ``max-len`` does not count the terminating character. + +nest +---- + +Attribute containing other (nested) attributes. +``nested-attributes`` specifies which attribute set is used inside. |