Age | Commit message (Collapse) | Author |
|
We will support arbitrary SYN Cookie with BPF.
If BPF prog validates ACK and kfunc allocates a reqsk, it will
be carried to TCP stack as skb->sk with req->syncookie 1.
In skb_steal_sock(), we need to check inet_reqsk(sk)->syncookie
to see if the reqsk is created by kfunc. However, inet_reqsk()
is not available in sock.h.
Let's move skb_steal_sock() to request_sock.h.
While at it, we refactor skb_steal_sock() so it returns early if
skb->sk is NULL to minimise the following patch.
Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Link: https://lore.kernel.org/r/20240115205514.68364-3-kuniyu@amazon.com
Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
|
|
We will support arbitrary SYN Cookie with BPF.
When BPF prog validates ACK and kfunc allocates a reqsk, we need
to call tcp_ns_to_ts() to calculate an offset of TSval for later
use:
time
t0 : Send SYN+ACK
-> tsval = Initial TSval (Random Number)
t1 : Recv ACK of 3WHS
-> tsoff = TSecr - tcp_ns_to_ts(usec_ts_ok, tcp_clock_ns())
= Initial TSval - t1
t2 : Send ACK
-> tsval = t2 + tsoff
= Initial TSval + (t2 - t1)
= Initial TSval + Time Delta (x)
(x) Note that the time delta does not include the initial RTT
from t0 to t1.
Let's move tcp_ns_to_ts() to tcp.h.
Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Link: https://lore.kernel.org/r/20240115205514.68364-2-kuniyu@amazon.com
Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
|
|
It is possible for bpf_kfunc_call_test_release() to be called from
bpf_map_free_deferred() when bpf_testmod is already unloaded and
perf_test_stuct.cnt which it tries to decrease is no longer in memory.
This patch tries to fix the issue by waiting for all references to be
dropped in bpf_testmod_exit().
The issue can be triggered by running 'test_progs -t map_kptr' in 6.5,
but is obscured in 6.6 by d119357d07435 ("rcu-tasks: Treat only
synchronous grace periods urgently").
Fixes: 65eb006d85a2 ("bpf: Move kernel test kfuncs to bpf_testmod")
Signed-off-by: Artem Savkov <asavkov@redhat.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Yonghong Song <yonghong.song@linux.dev>
Cc: Jiri Olsa <jolsa@kernel.org>
Link: https://lore.kernel.org/bpf/82f55c0e-0ec8-4fe1-8d8c-b1de07558ad9@linux.dev
Link: https://lore.kernel.org/bpf/20240110085737.8895-1-asavkov@redhat.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
|
|
There exists the following warning when building bpftool:
CC prog.o
prog.c: In function ‘profile_open_perf_events’:
prog.c:2301:24: warning: ‘calloc’ sizes specified with ‘sizeof’ in the earlier argument and not in the later argument [-Wcalloc-transposed-args]
2301 | sizeof(int), obj->rodata->num_cpu * obj->rodata->num_metric);
| ^~~
prog.c:2301:24: note: earlier argument should specify number of elements, later size of each element
Tested with the latest upstream GCC which contains a new warning option
-Wcalloc-transposed-args. The first argument to calloc is documented to
be number of elements in array, while the second argument is size of each
element, just switch the first and second arguments of calloc() to silence
the build warning, compile tested only.
Fixes: 47c09d6a9f67 ("bpftool: Introduce "prog profile" command")
Signed-off-by: Tiezhu Yang <yangtiezhu@loongson.cn>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Reviewed-by: Quentin Monnet <quentin@isovalent.com>
Link: https://lore.kernel.org/bpf/20240116061920.31172-1-yangtiezhu@loongson.cn
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
|
|
Few minor improvements for bpf_cmp() macro:
. reduce number of args in __bpf_cmp()
. rename NOFLIP to UNLIKELY
. add a comment about 64-bit truncation in "i" constraint
. use "ri" constraint for sizeof(rhs) <= 4
. improve error message for bpf_cmp_likely()
Before:
progs/iters_task_vma.c:31:7: error: variable 'ret' is uninitialized when used here [-Werror,-Wuninitialized]
31 | if (bpf_cmp_likely(seen, <==, 1000))
| ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
../bpf/bpf_experimental.h:325:3: note: expanded from macro 'bpf_cmp_likely'
325 | ret;
| ^~~
progs/iters_task_vma.c:31:7: note: variable 'ret' is declared here
../bpf/bpf_experimental.h:310:3: note: expanded from macro 'bpf_cmp_likely'
310 | bool ret;
| ^
After:
progs/iters_task_vma.c:31:7: error: invalid operand for instruction
31 | if (bpf_cmp_likely(seen, <==, 1000))
| ^
../bpf/bpf_experimental.h:324:17: note: expanded from macro 'bpf_cmp_likely'
324 | asm volatile("r0 " #OP " invalid compare");
| ^
<inline asm>:1:5: note: instantiated into assembly here
1 | r0 <== invalid compare
| ^
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Yonghong Song <yonghong.song@linux.dev>
Link: https://lore.kernel.org/bpf/20240112220134.71209-1-alexei.starovoitov@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
|
|
In verifier.rst, I found an incorrect statement (maybe a typo) in section
'Liveness marks tracking'. Basically, the wrong register is attributed
to have a read mark. This may confuse the user.
Signed-off-by: Yonghong Song <yonghong.song@linux.dev>
Acked-by: Eduard Zingerman <eddyz87@gmail.com>
Link: https://lore.kernel.org/r/20240111052136.3440417-1-yonghong.song@linux.dev
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
|
|
Add a selftest with a 4 bytes BPF_ST of 0 where the store is not
8-byte aligned. The goal is to ensure that STACK_ZERO is properly
marked in stack slots and the STACK_ZERO value can propagate
properly during the load.
Acked-by: Andrii Nakryiko <andrii@kernel.org>
Signed-off-by: Yonghong Song <yonghong.song@linux.dev>
Link: https://lore.kernel.org/r/20240110051355.2737232-1-yonghong.song@linux.dev
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
|
|
With patch set [1], precision backtracing supports register spill/fill
to/from the stack. The patch [2] allows initial imprecise register spill
with content 0. This is a common case for cpuv3 and lower for
initializing the stack variables with pattern
r1 = 0
*(u64 *)(r10 - 8) = r1
and the [2] has demonstrated good verification improvement.
For cpuv4, the initialization could be
*(u64 *)(r10 - 8) = 0
The current verifier marks the r10-8 contents with STACK_ZERO.
Similar to [2], let us permit the above insn to behave like
imprecise register spill which can reduce number of verified states.
The change is in function check_stack_write_fixed_off().
Before this patch, spilled zero will be marked as STACK_ZERO
which can provide precise values. In check_stack_write_var_off(),
STACK_ZERO will be maintained if writing a const zero
so later it can provide precise values if needed.
The above handling of '*(u64 *)(r10 - 8) = 0' as a spill
will have issues in check_stack_write_var_off() as the spill
will be converted to STACK_MISC and the precise value 0
is lost. To fix this issue, if the spill slots with const
zero and the BPF_ST write also with const zero, the spill slots
are preserved, which can later provide precise values
if needed. Without the change in check_stack_write_var_off(),
the test_verifier subtest 'BPF_ST_MEM stack imm zero, variable offset'
will fail.
I checked cpuv3 and cpuv4 with and without this patch with veristat.
There is no state change for cpuv3 since '*(u64 *)(r10 - 8) = 0'
is only generated with cpuv4.
For cpuv4:
$ ../veristat -C old.cpuv4.csv new.cpuv4.csv -e file,prog,insns,states -f 'insns_diff!=0'
File Program Insns (A) Insns (B) Insns (DIFF) States (A) States (B) States (DIFF)
------------------------------------------ ------------------- --------- --------- --------------- ---------- ---------- -------------
local_storage_bench.bpf.linked3.o get_local 228 168 -60 (-26.32%) 17 14 -3 (-17.65%)
pyperf600_bpf_loop.bpf.linked3.o on_event 6066 4889 -1177 (-19.40%) 403 321 -82 (-20.35%)
test_cls_redirect.bpf.linked3.o cls_redirect 35483 35387 -96 (-0.27%) 2179 2177 -2 (-0.09%)
test_l4lb_noinline.bpf.linked3.o balancer_ingress 4494 4522 +28 (+0.62%) 217 219 +2 (+0.92%)
test_l4lb_noinline_dynptr.bpf.linked3.o balancer_ingress 1432 1455 +23 (+1.61%) 92 94 +2 (+2.17%)
test_xdp_noinline.bpf.linked3.o balancer_ingress_v6 3462 3458 -4 (-0.12%) 216 216 +0 (+0.00%)
verifier_iterating_callbacks.bpf.linked3.o widening 52 41 -11 (-21.15%) 4 3 -1 (-25.00%)
xdp_synproxy_kern.bpf.linked3.o syncookie_tc 12412 11719 -693 (-5.58%) 345 330 -15 (-4.35%)
xdp_synproxy_kern.bpf.linked3.o syncookie_xdp 12478 11794 -684 (-5.48%) 346 331 -15 (-4.34%)
test_l4lb_noinline and test_l4lb_noinline_dynptr has minor regression, but
pyperf600_bpf_loop and local_storage_bench gets pretty good improvement.
[1] https://lore.kernel.org/all/20231205184248.1502704-1-andrii@kernel.org/
[2] https://lore.kernel.org/all/20231205184248.1502704-9-andrii@kernel.org/
Cc: Kuniyuki Iwashima <kuniyu@amazon.com>
Cc: Martin KaFai Lau <kafai@fb.com>
Signed-off-by: Yonghong Song <yonghong.song@linux.dev>
Tested-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Acked-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/r/20240110051348.2737007-1-yonghong.song@linux.dev
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
|
|
The previous commit implemented assigning IDs to registers holding
scalars before spill. Add the test cases to check the new functionality.
Signed-off-by: Maxim Mikityanskiy <maxim@isovalent.com>
Acked-by: Eduard Zingerman <eddyz87@gmail.com>
Link: https://lore.kernel.org/r/20240108205209.838365-10-maxtram95@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
|
|
Currently, when a scalar bounded register is spilled to the stack, its
ID is preserved, but only if was already assigned, i.e. if this register
was MOVed before.
Assign an ID on spill if none is set, so that equal scalars could be
tracked if a register is spilled to the stack and filled into another
register.
One test is adjusted to reflect the change in register IDs.
Signed-off-by: Maxim Mikityanskiy <maxim@isovalent.com>
Acked-by: Eduard Zingerman <eddyz87@gmail.com>
Link: https://lore.kernel.org/r/20240108205209.838365-9-maxtram95@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
|
|
Put calculation of the register value width into a dedicated function.
This function will also be used in a following commit.
Signed-off-by: Maxim Mikityanskiy <maxim@isovalent.com>
Link: https://lore.kernel.org/r/20240108205209.838365-8-maxtram95@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
|
|
Extract the common code that generates a register ID for src_reg before
MOV if needed into a new function. This function will also be used in
a following commit.
Signed-off-by: Maxim Mikityanskiy <maxim@isovalent.com>
Acked-by: Eduard Zingerman <eddyz87@gmail.com>
Link: https://lore.kernel.org/r/20240108205209.838365-7-maxtram95@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
|
|
When a range check is performed on a register that was 32-bit spilled to
the stack, the IDs of the two instances of the register are the same, so
the range should also be the same.
Signed-off-by: Maxim Mikityanskiy <maxim@isovalent.com>
Acked-by: Eduard Zingerman <eddyz87@gmail.com>
Link: https://lore.kernel.org/r/20240108205209.838365-6-maxtram95@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
|
|
Adjust the check in bpf_get_spilled_reg to take into account spilled
registers narrower than 64 bits. That allows find_equal_scalars to
properly adjust the range of all spilled registers that have the same
ID. Before this change, it was possible for a register and a spilled
register to have the same IDs but different ranges if the spill was
narrower than 64 bits and a range check was performed on the register.
Signed-off-by: Maxim Mikityanskiy <maxim@isovalent.com>
Acked-by: Eduard Zingerman <eddyz87@gmail.com>
Link: https://lore.kernel.org/r/20240108205209.838365-5-maxtram95@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
|
|
Verify that infinite loop detection logic separates states with
identical register states but different imprecise scalars spilled to
stack.
Signed-off-by: Eduard Zingerman <eddyz87@gmail.com>
Link: https://lore.kernel.org/r/20240108205209.838365-4-maxtram95@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
|
|
Current infinite loops detection mechanism is speculative:
- first, states_maybe_looping() check is done which simply does memcmp
for R1-R10 in current frame;
- second, states_equal(..., exact=false) is called. With exact=false
states_equal() would compare scalars for equality only if in old
state scalar has precision mark.
Such logic might be problematic if compiler makes some unlucky stack
spill/fill decisions. An artificial example of a false positive looks
as follows:
r0 = ... unknown scalar ...
r0 &= 0xff;
*(u64 *)(r10 - 8) = r0;
r0 = 0;
loop:
r0 = *(u64 *)(r10 - 8);
if r0 > 10 goto exit_;
r0 += 1;
*(u64 *)(r10 - 8) = r0;
r0 = 0;
goto loop;
This commit updates call to states_equal to use exact=true, forcing
all scalar comparisons to be exact.
Signed-off-by: Eduard Zingerman <eddyz87@gmail.com>
Link: https://lore.kernel.org/r/20240108205209.838365-3-maxtram95@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
|
|
The u64_offset_to_skb_data test is supposed to make a 64-bit fill, but
instead makes a 16-bit one. Fix the test according to its intention and
update the comments accordingly (umax is no longer 0xffff). The 16-bit
fill is covered by u16_offset_to_skb_data.
Signed-off-by: Maxim Mikityanskiy <maxim@isovalent.com>
Acked-by: Eduard Zingerman <eddyz87@gmail.com>
Link: https://lore.kernel.org/r/20240108205209.838365-2-maxtram95@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
|
|
reviews.llvm.org was LLVM's Phabricator instances for code review. It
has been abandoned in favor of GitHub pull requests. While the majority
of links in the kernel sources still work because of the work Fangrui
has done turning the dynamic Phabricator instance into a static archive,
there are some issues with that work, so preemptively convert all the
links in the kernel sources to point to the commit on GitHub.
Most of the commits have the corresponding differential review link in
the commit message itself so there should not be any loss of fidelity in
the relevant information.
Additionally, fix a typo in the xdpwall.c print ("LLMV" -> "LLVM") while
in the area.
Link: https://discourse.llvm.org/t/update-on-github-pull-requests/71540/172
Acked-by: Yonghong Song <yonghong.song@linux.dev>
Signed-off-by: Nathan Chancellor <nathan@kernel.org>
Link: https://lore.kernel.org/r/20240111-bpf-update-llvm-phabricator-links-v2-1-9a7ae976bd64@kernel.org
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
|
|
Various tests specify extra testing prog_flags when loading BPF
programs, like BPF_F_TEST_RND_HI32, and more recently also
BPF_F_TEST_REG_INVARIANTS. While BPF_F_TEST_RND_HI32 is old enough to
not cause much problem on older kernels, BPF_F_TEST_REG_INVARIANTS is
very fresh and unconditionally specifying it causes selftests to fail on
even slightly outdated kernels.
This breaks libbpf CI test against 4.9 and 5.15 kernels, it can break
some local development (done outside of VM), etc.
To prevent this, and guard against similar problems in the future, do
runtime detection of supported "testing flags", and only provide those
that host kernel recognizes.
Acked-by: Song Liu <song@kernel.org>
Acked-by: Jiri Olsa <jolsa@kernel.org>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/r/20240109231738.575844-1-andrii@kernel.org
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
|
|
The discussion of what the actual conformance groups should be
is still in progress, so this is just part 1 which only uses
"legacy" for deprecated instructions and "basic" for everything
else. Subsequent patches will add more groups as discussion
continues.
Signed-off-by: Dave Thaler <dthaler1968@gmail.com>
Acked-by: David Vernet <void@manifault.com>
Link: https://lore.kernel.org/r/20240108214231.5280-1-dthaler1968@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
|
|
Fix spelling errors as reported by codespell.
Signed-off-by: Randy Dunlap <rdunlap@infradead.org>
Cc: Alexei Starovoitov <ast@kernel.org>
Cc: Daniel Borkmann <daniel@iogearbox.net>
Cc: Andrii Nakryiko <andrii@kernel.org>
Cc: bpf@vger.kernel.org
Cc: "David S. Miller" <davem@davemloft.net>
Cc: Eric Dumazet <edumazet@google.com>
Cc: Jakub Kicinski <kuba@kernel.org>
Cc: Paolo Abeni <pabeni@redhat.com>
Reviewed-by: Simon Horman <horms@kernel.org>
Link: https://lore.kernel.org/r/20240106065545.16855-1-rdunlap@infradead.org
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
|
|
Add ability to iterate multiple decl_tag types pointed to the same
function argument. Use this to support multiple __arg_xxx tags per
global subprog argument.
We leave btf_find_decl_tag_value() intact, but change its implementation
to use a new btf_find_next_decl_tag() which can be straightforwardly
used to find next BTF type ID of a matching btf_decl_tag type.
btf_prepare_func_args() is switched from btf_find_decl_tag_value() to
btf_find_next_decl_tag() to gain multiple tags per argument support.
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Acked-by: Eduard Zingerman <eddyz87@gmail.com>
Link: https://lore.kernel.org/r/20240105000909.2818934-5-andrii@kernel.org
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
|
|
Add btf_arg_tag flags enum to be able to record multiple tags per
argument. Also streamline pointer argument processing some more.
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Acked-by: Eduard Zingerman <eddyz87@gmail.com>
Link: https://lore.kernel.org/r/20240105000909.2818934-4-andrii@kernel.org
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
|
|
Move scalar arg processing in btf_prepare_func_args() after all pointer
arg processing is done. This makes it easier to do validation. One
example of unintended behavior right now is ability to specify
__arg_nonnull for integer/enum arguments. This patch fixes this.
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Acked-by: Eduard Zingerman <eddyz87@gmail.com>
Link: https://lore.kernel.org/r/20240105000909.2818934-3-andrii@kernel.org
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
|
|
Seeing:
process_subtest:PASS:Can't alloc specs array 0 nsec
... in verbose successful test log is very confusing. Use smaller
identifier-like test tag to denote that we are asserting specs array
allocation success.
Now it's much less distracting:
process_subtest:PASS:specs_alloc 0 nsec
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Acked-by: Eduard Zingerman <eddyz87@gmail.com>
Link: https://lore.kernel.org/r/20240105000909.2818934-2-andrii@kernel.org
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
|
|
Hou Tao says:
====================
The motivation of inlining bpf_kptr_xchg() comes from the performance
profiling of bpf memory allocator benchmark [1]. The benchmark uses
bpf_kptr_xchg() to stash the allocated objects and to pop the stashed
objects for free. After inling bpf_kptr_xchg(), the performance for
object free on 8-CPUs VM increases about 2%~10%. However the performance
gain comes with costs: both the kasan and kcsan checks on the pointer
will be unavailable. Initially the inline is implemented in do_jit() for
x86-64 directly, but I think it will more portable to implement the
inline in verifier.
Patch #1 supports inlining bpf_kptr_xchg() helper and enables it on
x86-4. Patch #2 factors out a helper for newly-added test in patch #3.
Patch #3 tests whether the inlining of bpf_kptr_xchg() is expected.
Please see individual patches for more details. And comments are always
welcome.
Change Log:
v3:
* rebased on bpf-next tree
* patch 1 & 2: Add Rvb-by and Ack-by tags from Eduard
* patch 3: use inline assembly and naked function instead of c code
(suggested by Eduard)
v2: https://lore.kernel.org/bpf/20231223104042.1432300-1-houtao@huaweicloud.com/
* rebased on bpf-next tree
* drop patch #1 in v1 due to discussion in [2]
* patch #1: add the motivation in the commit message, merge patch #1
and #3 into the new patch in v2. (Daniel)
* patch #2/#3: newly-added patch to test the inlining of
bpf_kptr_xchg() (Eduard)
v1: https://lore.kernel.org/bpf/95b8c2cd-44d5-5fe1-60b5-7e8218779566@huaweicloud.com/
[1]: https://lore.kernel.org/bpf/20231221141501.3588586-1-houtao@huaweicloud.com/
[2]: https://lore.kernel.org/bpf/fd94efb9-4a56-c982-dc2e-c66be5202cb7@huaweicloud.com/
====================
Link: https://lore.kernel.org/r/20240105104819.3916743-1-houtao@huaweicloud.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
|
|
The test uses bpf_prog_get_info_by_fd() to obtain the xlated
instructions of the program first. Since these instructions have
already been rewritten by the verifier, the tests then checks whether
the rewritten instructions are as expected. And to ensure LLVM generates
code exactly as expected, use inline assembly and a naked function.
Suggested-by: Eduard Zingerman <eddyz87@gmail.com>
Signed-off-by: Hou Tao <houtao1@huawei.com>
Acked-by: Eduard Zingerman <eddyz87@gmail.com>
Link: https://lore.kernel.org/r/20240105104819.3916743-4-houtao@huaweicloud.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
|
|
Both test_verifier and test_progs use get_xlated_program(), so moving
the helper into testing_helpers.h to reuse it.
Acked-by: Eduard Zingerman <eddyz87@gmail.com>
Signed-off-by: Hou Tao <houtao1@huawei.com>
Link: https://lore.kernel.org/r/20240105104819.3916743-3-houtao@huaweicloud.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
|
|
The motivation of inlining bpf_kptr_xchg() comes from the performance
profiling of bpf memory allocator benchmark. The benchmark uses
bpf_kptr_xchg() to stash the allocated objects and to pop the stashed
objects for free. After inling bpf_kptr_xchg(), the performance for
object free on 8-CPUs VM increases about 2%~10%. The inline also has
downside: both the kasan and kcsan checks on the pointer will be
unavailable.
bpf_kptr_xchg() can be inlined by converting the calling of
bpf_kptr_xchg() into an atomic_xchg() instruction. But the conversion
depends on two conditions:
1) JIT backend supports atomic_xchg() on pointer-sized word
2) For the specific arch, the implementation of xchg is the same as
atomic_xchg() on pointer-sized words.
It seems most 64-bit JIT backends satisfies these two conditions. But
as a precaution, defining a weak function bpf_jit_supports_ptr_xchg()
to state whether such conversion is safe and only supporting inline for
64-bit host.
For x86-64, it supports BPF_XCHG atomic operation and both xchg() and
atomic_xchg() use arch_xchg() to implement the exchange, so enabling the
inline of bpf_kptr_xchg() on x86-64 first.
Reviewed-by: Eduard Zingerman <eddyz87@gmail.com>
Signed-off-by: Hou Tao <houtao1@huawei.com>
Link: https://lore.kernel.org/r/20240105104819.3916743-2-houtao@huaweicloud.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
|
|
Eric Dumazet says:
====================
inet_diag: remove three mutexes in diag dumps
Surprisingly, inet_diag operations are serialized over a stack
of three mutexes, giving legacy /proc based files an unfair
advantage on modern hosts.
This series removes all of them, making inet_diag operations
(eg iproute2/ss) fully parallel.
1-2) Two first patches are adding data-race annotations
and can be backported to stable kernels.
3-4) inet_diag_table_mutex can be replaced with RCU protection,
if we add corresponding protection against module unload.
5-7) sock_diag_table_mutex can be replaced with RCU protection,
if we add corresponding protection against module unload.
8) sock_diag_mutex is removed, as the old bug it was
working around has been fixed more elegantly.
9) inet_diag_dump_icsk() can skip over empty buckets to reduce
spinlock contention.
====================
Link: https://lore.kernel.org/r/20240122112603.3270097-1-edumazet@google.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
|
|
After the removal of inet_diag_table_mutex, sock_diag_table_mutex
and sock_diag_mutex, I was able so see spinlock contention from
inet_diag_dump_icsk() when running 100 parallel invocations.
It is time to skip over empty buckets.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Guillaume Nault <gnault@redhat.com>
Reviewed-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Reviewed-by: Willem de Bruijn <willemb@google.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
|
|
sock_diag_rcv() is still serializing its operations using
a mutex, for no good reason.
This came with commit 0a9c73014415 ("[INET_DIAG]: Fix oops
in netlink_rcv_skb"), but the root cause has been fixed
with commit cd40b7d3983c ("[NET]: make netlink user -> kernel
interface synchronious")
Remove this mutex to let multiple threads run concurrently.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Guillaume Nault <gnault@redhat.com>
Reviewed-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Reviewed-by: Willem de Bruijn <willemb@google.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
|
|
TCPDIAG_GETSOCK and DCCPDIAG_GETSOCK diag are serialized
on sock_diag_table_mutex.
This is to make sure inet_diag module is not unloaded
while diag was ongoing.
It is time to get rid of this mutex and use RCU protection,
allowing full parallelism.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Guillaume Nault <gnault@redhat.com>
Reviewed-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Reviewed-by: Willem de Bruijn <willemb@google.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
|
|
sock_diag_broadcast_destroy_work() and __sock_diag_cmd()
are currently using sock_diag_table_mutex to protect
against concurrent sock_diag_handlers[] changes.
This makes inet_diag dump serialized, thus less scalable
than legacy /proc files.
It is time to switch to full RCU protection.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Guillaume Nault <gnault@redhat.com>
Reviewed-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Reviewed-by: Willem de Bruijn <willemb@google.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
|
|
Following patch is going to use RCU instead of
sock_diag_table_mutex acquisition.
This patch is a preparation, no change of behavior yet.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Guillaume Nault <gnault@redhat.com>
Reviewed-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Reviewed-by: Willem de Bruijn <willemb@google.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
|
|
inet_diag_lock_handler() current implementation uses a mutex
to protect inet_diag_table[] array against concurrent changes.
This makes inet_diag dump serialized, thus less scalable
than legacy /proc files.
It is time to switch to full RCU protection.
As a bonus, if a target is statically linked instead of being
modular, inet_diag_lock_handler() & inet_diag_unlock_handler()
reduce to reads only.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Guillaume Nault <gnault@redhat.com>
Reviewed-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Reviewed-by: Willem de Bruijn <willemb@google.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
|
|
Following patch is going to use RCU instead of
inet_diag_table_mutex acquisition.
This patch is a preparation, no change of behavior yet.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Guillaume Nault <gnault@redhat.com>
Reviewed-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Reviewed-by: Willem de Bruijn <willemb@google.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
|
|
inet_diag_lock_handler() reads inet_diag_table[proto] locklessly.
Use READ_ONCE()/WRITE_ONCE() annotations to avoid potential issues.
Fixes: d523a328fb02 ("[INET]: Fix inet_diag dead-lock regression")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Guillaume Nault <gnault@redhat.com>
Reviewed-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Reviewed-by: Willem de Bruijn <willemb@google.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
|
|
__sock_diag_cmd() and sock_diag_bind() read sock_diag_handlers[family]
without a lock held.
Use READ_ONCE()/WRITE_ONCE() annotations to avoid potential issues.
Fixes: 8ef874bfc729 ("sock_diag: Move the sock_ code to net/core/")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Guillaume Nault <gnault@redhat.com>
Reviewed-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Reviewed-by: Willem de Bruijn <willemb@google.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net
Pull networking fixes from Jakub Kicinski:
"Including fixes from bpf and netfilter.
Previous releases - regressions:
- Revert "net: rtnetlink: Enslave device before bringing it up",
breaks the case inverse to the one it was trying to fix
- net: dsa: fix oob access in DSA's netdevice event handler
dereference netdev_priv() before check its a DSA port
- sched: track device in tcf_block_get/put_ext() only for clsact
binder types
- net: tls, fix WARNING in __sk_msg_free when record becomes full
during splice and MORE hint set
- sfp-bus: fix SFP mode detect from bitrate
- drv: stmmac: prevent DSA tags from breaking COE
Previous releases - always broken:
- bpf: fix no forward progress in in bpf_iter_udp if output buffer is
too small
- bpf: reject variable offset alu on registers with a type of
PTR_TO_FLOW_KEYS to prevent oob access
- netfilter: tighten input validation
- net: add more sanity check in virtio_net_hdr_to_skb()
- rxrpc: fix use of Don't Fragment flag on RESPONSE packets, avoid
infinite loop
- amt: do not use the portion of skb->cb area which may get clobbered
- mptcp: improve validation of the MPTCPOPT_MP_JOIN MCTCP option
Misc:
- spring cleanup of inactive maintainers"
* tag 'net-6.8-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net: (88 commits)
i40e: Include types.h to some headers
ipv6: mcast: fix data-race in ipv6_mc_down / mld_ifc_work
selftests: mlxsw: qos_pfc: Adjust the test to support 8 lanes
selftests: mlxsw: qos_pfc: Remove wrong description
mlxsw: spectrum_router: Register netdevice notifier before nexthop
mlxsw: spectrum_acl_tcam: Fix stack corruption
mlxsw: spectrum_acl_tcam: Fix NULL pointer dereference in error path
mlxsw: spectrum_acl_erp: Fix error flow of pool allocation failure
ethtool: netlink: Add missing ethnl_ops_begin/complete
selftests: bonding: Add more missing config options
selftests: netdevsim: add a config file
libbpf: warn on unexpected __arg_ctx type when rewriting BTF
selftests/bpf: add tests confirming type logic in kernel for __arg_ctx
bpf: enforce types for __arg_ctx-tagged arguments in global subprogs
bpf: extract bpf_ctx_convert_map logic and make it more reusable
libbpf: feature-detect arg:ctx tag support in kernel
ipvs: avoid stat macros calls from preemptible context
netfilter: nf_tables: reject NFT_SET_CONCAT with not field length description
netfilter: nf_tables: skip dead set elements in netlink dump
netfilter: nf_tables: do not allow mismatch field size and set key length
...
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/wsa/linux
Pull i2c updates from Wolfram Sang:
"This removes the currently unused CLASS_DDC support (controllers set
the flag, but there is no client to use it).
Also, CLASS_SPD support gets simplified to prepare removal in the
future. Class based instantiation is not recommended these days
anyhow.
Furthermore, I2C core now creates a debugfs directory per I2C adapter.
Current bus driver users were converted to use it.
Finally, quite some driver updates. Standing out are patches for the
wmt-driver which is refactored to support more variants.
This is the rebased pull request where a large series for the
designware driver was dropped"
* tag 'i2c-for-6.8-rc1-rebased' of git://git.kernel.org/pub/scm/linux/kernel/git/wsa/linux: (38 commits)
MAINTAINERS: use proper email for my I2C work
i2c: stm32f7: add support for stm32mp25 soc
i2c: stm32f7: perform I2C_ISR read once at beginning of event isr
dt-bindings: i2c: document st,stm32mp25-i2c compatible
i2c: stm32f7: simplify status messages in case of errors
i2c: stm32f7: perform most of irq job in threaded handler
i2c: stm32f7: use dev_err_probe upon calls of devm_request_irq
i2c: i801: Add lis3lv02d for Dell XPS 15 7590
i2c: i801: Add lis3lv02d for Dell Precision 3540
i2c: wmt: Reduce redundant: REG_CR setting
i2c: wmt: Reduce redundant: function parameter
i2c: wmt: Reduce redundant: clock mode setting
i2c: wmt: Reduce redundant: wait event complete
i2c: wmt: Reduce redundant: bus busy check
i2c: mux: reg: Remove class-based device auto-detection support
i2c: make i2c_bus_type const
dt-bindings: at24: add ROHM BR24G04
eeprom: at24: use of_match_ptr()
i2c: cpm: Remove linux,i2c-index conversion from be32
i2c: imx: Make SDA actually optional for bus recovering
...
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/abelloni/linux
Pull RTC updates from Alexandre Belloni:
"There are three new drivers this cycle. Also the cmos driver is
getting fixes for longstanding wakeup issues on AMD.
New drivers:
- Analog Devices MAX31335
- Nuvoton ma35d1
- Texas Instrument TPS6594 PMIC RTC
Drivers:
- cmos: use ACPI alarm instead of HPET on recent AMD platforms
- nuvoton: add NCT3015Y-R and NCT3018Y-R support
- rv8803: proper suspend/resume and wakeup-source support"
* tag 'rtc-6.8' of git://git.kernel.org/pub/scm/linux/kernel/git/abelloni/linux: (26 commits)
rtc: nuvoton: Compatible with NCT3015Y-R and NCT3018Y-R
rtc: da9063: Use dev_err_probe()
rtc: da9063: Use device_get_match_data()
rtc: da9063: Make IRQ as optional
rtc: max31335: Fix comparison in max31335_volatile_reg()
rtc: max31335: use regmap_update_bits_check
rtc: max31335: remove unecessary locking
rtc: max31335: add driver support
dt-bindings: rtc: max31335: add max31335 bindings
rtc: rv8803: add wakeup-source support
rtc: ac100: remove misuses of kernel-doc
rtc: class: Remove usage of the deprecated ida_simple_xx() API
rtc: MAINTAINERS: drop Alessandro Zummo
rtc: ma35d1: remove hardcoded UIE support
dt-bindings: rtc: qcom-pm8xxx: fix inconsistent example
rtc: rv8803: Add power management support
rtc: ds3232: avoid unused-const-variable warning
rtc: lpc24xx: add missing dependency
rtc: tps6594: Add driver for TPS6594 RTC
rtc: Add driver for Nuvoton ma35d1 rtc controller
...
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/dtor/input
Pull input updates from Dmitry Torokhov:
- a new driver for Adafruit Seesaw gamepad device
- Zforce touchscreen will handle standard device properties for axis
swap/inversion
- handling of advanced sensitivity settings in Microchip CAP11xx
capacitive sensor driver
- more drivers have been converted to use newer gpiod API
- support for dedicated wakeup IRQs in gpio-keys dirver
- support for slider gestures and OTP variants in iqs269a driver
- atkbd will report keyboard version as 0xab83 in cases when GET ID
command was skipped (to deal with problematic firmware on newer
laptops), restoring the previous behavior
- other assorted cleanups and changes
* tag 'input-for-v6.8-rc0' of git://git.kernel.org/pub/scm/linux/kernel/git/dtor/input: (44 commits)
Input: atkbd - use ab83 as id when skipping the getid command
Input: driver for Adafruit Seesaw Gamepad
dt-bindings: input: bindings for Adafruit Seesaw Gamepad
Input: da9063_onkey - avoid explicitly setting input's parent
Input: da9063_onkey - avoid using OF-specific APIs
Input: iqs269a - add support for OTP variants
dt-bindings: input: iqs269a: Add bindings for OTP variants
Input: iqs269a - add support for slider gestures
dt-bindings: input: iqs269a: Add bindings for slider gestures
Input: gpio-keys - filter gpio_keys -EPROBE_DEFER error messages
Input: zforce_ts - accept standard touchscreen properties
dt-bindings: touchscreen: neonode,zforce: Use standard properties
dt-bindings: touchscreen: convert neonode,zforce to json-schema
dt-bindings: input: convert drv266x to json-schema
Input: da9063 - use dev_err_probe()
Input: da9063 - drop redundant prints in probe()
Input: da9063 - simplify obtaining OF match data
Input: as5011 - convert to GPIO descriptor
Input: omap-keypad - drop optional GPIO support
Input: tca6416-keypad - drop unused include
...
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/phy/linux-phy
Pull phy updates from Vinod Koul:
"New Support:
- Qualcomm SM8650 UFS, PCIe and USB/DP Combo PHY, eUSB2 PHY, SDX75
USB3, X1E80100 USB3 support
- Mediatek MT8195 support
- Rockchip RK3128 usb2 support
- TI SGMII mode for J784S4
Updates:
- Qualcomm v7 register offsets updates
- Mediatek tphy support for force phy mode switch"
* tag 'phy-for-6.8' of git://git.kernel.org/pub/scm/linux/kernel/git/phy/linux-phy: (34 commits)
phy: ti: j721e-wiz: Add SGMII support in WIZ driver for J784S4
phy: ti: gmii-sel: Enable SGMII mode for J784S4
phy: qcom-qmp-usb: Add Qualcomm X1E80100 USB3 PHY support
dt-bindings: phy: qcom,sc8280xp-qmp-usb3-uni: Add X1E80100 USB PHY binding
phy: qcom-qmp-combo: Add x1e80100 USB/DP combo phys
dt-bindings: phy: qcom,sc8280xp-qmp-usb43dp-phy: Document X1E80100 compatible
dt-bindings: phy: qcom: snps-eusb2: Document the X1E80100 compatible
phy: mediatek: tphy: add support force phy mode switch
dt-bindings: phy: mediatek: tphy: add a property for force-mode switch
phy: phy-can-transceiver: insert space after include
phy: qualcomm: phy-qcom-qmp-ufs: Rectify SM8550 UFS HS-G4 PHY Settings
dt-bindings: phy: qcom,sc8280xp-qmp-usb43dp-phy: fix path to header
phy: renesas: phy-rcar-gen2: use select for GENERIC_PHY
phy: qcom-qmp: qserdes-txrx: Add v7 register offsets
phy: qcom-qmp: qserdes-txrx: Add V6 N4 register offsets
phy: qcom-qmp: qserdes-com: Add v7 register offsets
phy: qcom-qmp: pcs-usb: Add v7 register offsets
phy: qcom-qmp: pcs: Add v7 register offsets
phy: qcom-qmp: qserdes-txrx: Add some more v6.20 register offsets
phy: qcom-qmp: qserdes-com: Add some more v6 register offsets
...
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/vkoul/soundwire
Pull soundwire updates from Vinod Koul:
- Core: add concept of controller_id to deal with clear Controller /
Manager hierarchy
- bunch of qcom driver refactoring for qcom_swrm_stream_alloc_ports(),
qcom_swrm_stream_alloc_ports() and setting controller id to hw master
id
* tag 'soundwire-6.8-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/vkoul/soundwire:
soundwire: amd: drop bus freq calculation and set 'max_clk_freq'
soundwire: generic_bandwidth_allocation use bus->params.max_dr_freq
soundwire: qcom: set controller id to hw master id
soundwire: fix initializing sysfs for same devices on different buses
soundwire: bus: introduce controller_id
soundwire: stream: constify sdw_port_config when adding devices
soundwire: qcom: move sconfig in qcom_swrm_stream_alloc_ports() out of critical section
soundwire: qcom: drop unneeded qcom_swrm_stream_alloc_ports() cleanup
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/brgl/linux
Pull gpio fixes from Bartosz Golaszewski:
"Apart from some regular driver fixes there's a relatively big revert
of the locking changes that were introduced to GPIOLIB in this merge
window.
This is because it turned out that some legacy GPIO interfaces - that
need to translate a number from the global GPIO numberspace to the
address of the relevant descriptor, thus running a GPIO device lookup
and taking the GPIO device list lock - are still used in old code from
atomic context resulting in "scheduling while atomic" errors.
I'll try to make the read-only part of the list access entirely
lockless using SRCU but this will take some time so let's go back to
the old global spinlock for now.
Summary:
- revert the changes aiming to use a read-write semaphore to protect
the list of GPIO devices due to calls to legacy API taking that
lock from atomic context in old code
- fix inverted logic in DEFINE_FREE() for GPIO device references
- check the return value of bgpio_init() in gpio-mlxbf3
- fix node address in the DT bindings example for gpio-xilinx
- fix signedness bug in gpio-rtd
- fix kernel-doc warnings in gpio-en7523"
* tag 'gpio-fixes-for-v6.8-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/brgl/linux:
gpiolib: revert the attempt to protect the GPIO device list with an rwsem
gpio: EN7523: fix kernel-doc warnings
gpiolib: Fix scope-based gpio_device refcounting
gpio: mlxbf3: add an error code check in mlxbf3_gpio_probe
dt-bindings: gpio: xilinx: Fix node address in gpio
gpio: rtd: Fix signedness bug in probe
|
|
gitolite.kernel.org:pub/scm/linux/kernel/git/ukleinek/linux
Pull pwm fixes from Uwe Kleine-König:
- fix a duplicate cleanup in an error path introduced in
this merge window
- fix an out-of-bounds access
In practise it doesn't happen - otherwise someone would have noticed
since v5.17-rc1 I guess - because the device tree binding for the two
drivers using of_pwm_single_xlate() only have args->args_count == 1.
A device-tree that doesn't conform to the respective bindings could
trigger that easily however.
- correct the request callback of the jz4740 pwm driver which used
dev_err_probe() long after .probe() completed.
This is conceptually wrong because dev_err_probe() might call
device_set_deferred_probe_reason() which is nonsensical after the
driver is bound.
* tag 'pwm/for-6.8-2' of gitolite.kernel.org:pub/scm/linux/kernel/git/ukleinek/linux:
pwm: jz4740: Don't use dev_err_probe() in .request()
pwm: Fix out-of-bounds access in of_pwm_single_xlate()
pwm: bcm2835: Remove duplicate call to clk_rate_exclusive_put()
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/lee/backlight
Pull backlight updates from Lee Jones:
"New Drivers:
- Add support for Monolithic Power Systems MP3309C WLED Step-up Converter
Fix-ups:
- Use/convert to new/better APIs/helpers/MACROs instead of
hand-rolling implementations
- Device Tree Binding updates
- Demote non-kerneldoc header comments
- Improve error handling; return proper error values, simplify, avoid
duplicates, etc
- Convert over to the new (kinda) GPIOD API
Bug Fixes:
- Fix uninitialised local variable"
* tag 'backlight-next-6.8' of git://git.kernel.org/pub/scm/linux/kernel/git/lee/backlight:
backlight: hx8357: Convert to agnostic GPIO API
backlight: ili922x: Add an error code check in ili922x_write()
backlight: ili922x: Drop kernel-doc for local macros
backlight: mp3309c: Fix uninitialized local variable
backlight: pwm_bl: Use dev_err_probe
backlight: mp3309c: Add support for MPS MP3309C
dt-bindings: backlight: mp3309c: Remove two required properties
|
|
git://git.infradead.org/users/hch/dma-mapping
Pull dma-mapping fixes from Christoph Hellwig:
- fix kerneldoc warnings (Randy Dunlap)
- better bounds checking in swiotlb (ZhangPeng)
* tag 'dma-mapping-6.8-2024-01-18' of git://git.infradead.org/users/hch/dma-mapping:
dma-debug: fix kernel-doc warnings
swiotlb: check alloc_size before the allocation of a new memory pool
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/rppt/memblock
Pull memblock update from Mike Rapoport:
"Code readability improvement.
Use NUMA_NO_NODE instead of -1 as return value of
memblock_search_pfn_nid() to improve code readability
and consistency with the callers of that function"
* tag 'memblock-v6.8-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/rppt/memblock:
memblock: Return NUMA_NO_NODE instead of -1 to improve code readability
|