diff options
author | David S. Miller <davem@davemloft.net> | 2021-02-16 13:14:06 -0800 |
---|---|---|
committer | David S. Miller <davem@davemloft.net> | 2021-02-16 13:14:06 -0800 |
commit | b8af417e4d93caeefb89bbfbd56ec95dedd8dab5 (patch) | |
tree | 1c8d22e1aec330238830a43cc8aee0cf768ae1c7 | |
parent | 9ec5eea5b6acfae7279203097eeec5d02d01d9b7 (diff) | |
parent | 45159b27637b0fef6d5ddb86fc7c46b13c77960f (diff) |
Merge git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next
Daniel Borkmann says:
====================
pull-request: bpf-next 2021-02-16
The following pull-request contains BPF updates for your *net-next* tree.
There's a small merge conflict between 7eeba1706eba ("tcp: Add receive timestamp
support for receive zerocopy.") from net-next tree and 9cacf81f8161 ("bpf: Remove
extra lock_sock for TCP_ZEROCOPY_RECEIVE") from bpf-next tree. Resolve as follows:
[...]
lock_sock(sk);
err = tcp_zerocopy_receive(sk, &zc, &tss);
err = BPF_CGROUP_RUN_PROG_GETSOCKOPT_KERN(sk, level, optname,
&zc, &len, err);
release_sock(sk);
[...]
We've added 116 non-merge commits during the last 27 day(s) which contain
a total of 156 files changed, 5662 insertions(+), 1489 deletions(-).
The main changes are:
1) Adds support of pointers to types with known size among global function
args to overcome the limit on max # of allowed args, from Dmitrii Banshchikov.
2) Add bpf_iter for task_vma which can be used to generate information similar
to /proc/pid/maps, from Song Liu.
3) Enable bpf_{g,s}etsockopt() from all sock_addr related program hooks. Allow
rewriting bind user ports from BPF side below the ip_unprivileged_port_start
range, both from Stanislav Fomichev.
4) Prevent recursion on fentry/fexit & sleepable programs and allow map-in-map
as well as per-cpu maps for the latter, from Alexei Starovoitov.
5) Add selftest script to run BPF CI locally. Also enable BPF ringbuffer
for sleepable programs, both from KP Singh.
6) Extend verifier to enable variable offset read/write access to the BPF
program stack, from Andrei Matei.
7) Improve tc & XDP MTU handling and add a new bpf_check_mtu() helper to
query device MTU from programs, from Jesper Dangaard Brouer.
8) Allow bpf_get_socket_cookie() helper also be called from [sleepable] BPF
tracing programs, from Florent Revest.
9) Extend x86 JIT to pad JMPs with NOPs for helping image to converge when
otherwise too many passes are required, from Gary Lin.
10) Verifier fixes on atomics with BPF_FETCH as well as function-by-function
verification both related to zero-extension handling, from Ilya Leoshkevich.
11) Better kernel build integration of resolve_btfids tool, from Jiri Olsa.
12) Batch of AF_XDP selftest cleanups and small performance improvement
for libbpf's xsk map redirect for newer kernels, from Björn Töpel.
13) Follow-up BPF doc and verifier improvements around atomics with
BPF_FETCH, from Brendan Jackman.
14) Permit zero-sized data sections e.g. if ELF .rodata section contains
read-only data from local variables, from Yonghong Song.
15) veth driver skb bulk-allocation for ndo_xdp_xmit, from Lorenzo Bianconi.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
156 files changed, 5662 insertions, 1489 deletions
diff --git a/Documentation/bpf/bpf_design_QA.rst b/Documentation/bpf/bpf_design_QA.rst index 2df7b067ab93..0e15f9b05c9d 100644 --- a/Documentation/bpf/bpf_design_QA.rst +++ b/Documentation/bpf/bpf_design_QA.rst @@ -208,6 +208,12 @@ data structures and compile with kernel internal headers. Both of these kernel internals are subject to change and can break with newer kernels such that the program needs to be adapted accordingly. +Q: Are tracepoints part of the stable ABI? +------------------------------------------ +A: NO. Tracepoints are tied to internal implementation details hence they are +subject to change and can break with newer kernels. BPF programs need to change +accordingly when this happens. + Q: How much stack space a BPF program uses? ------------------------------------------- A: Currently all program types are limited to 512 bytes of stack diff --git a/Documentation/bpf/bpf_devel_QA.rst b/Documentation/bpf/bpf_devel_QA.rst index 5b613d2a5f1a..2ed89abbf9a4 100644 --- a/Documentation/bpf/bpf_devel_QA.rst +++ b/Documentation/bpf/bpf_devel_QA.rst @@ -501,16 +501,19 @@ All LLVM releases can be found at: http://releases.llvm.org/ Q: Got it, so how do I build LLVM manually anyway? -------------------------------------------------- -A: You need cmake and gcc-c++ as build requisites for LLVM. Once you have -that set up, proceed with building the latest LLVM and clang version +A: We recommend that developers who want the fastest incremental builds +use the Ninja build system, you can find it in your system's package +manager, usually the package is ninja or ninja-build. + +You need ninja, cmake and gcc-c++ as build requisites for LLVM. Once you +have that set up, proceed with building the latest LLVM and clang version from the git repositories:: $ git clone https://github.com/llvm/llvm-project.git - $ mkdir -p llvm-project/llvm/build/install + $ mkdir -p llvm-project/llvm/build $ cd llvm-project/llvm/build $ cmake .. -G "Ninja" -DLLVM_TARGETS_TO_BUILD="BPF;X86" \ -DLLVM_ENABLE_PROJECTS="clang" \ - -DBUILD_SHARED_LIBS=OFF \ -DCMAKE_BUILD_TYPE=Release \ -DLLVM_BUILD_RUNTIME=OFF $ ninja diff --git a/Documentation/networking/filter.rst b/Documentation/networking/filter.rst index f6d8f90e9a56..251c6bd73d15 100644 --- a/Documentation/networking/filter.rst +++ b/Documentation/networking/filter.rst @@ -1048,12 +1048,12 @@ Unlike classic BPF instruction set, eBPF has generic load/store operations:: Where size is one of: BPF_B or BPF_H or BPF_W or BPF_DW. It also includes atomic operations, which use the immediate field for extra -encoding. +encoding:: .imm = BPF_ADD, .code = BPF_ATOMIC | BPF_W | BPF_STX: lock xadd *(u32 *)(dst_reg + off16) += src_reg .imm = BPF_ADD, .code = BPF_ATOMIC | BPF_DW | BPF_STX: lock xadd *(u64 *)(dst_reg + off16) += src_reg -The basic atomic operations supported are: +The basic atomic operations supported are:: BPF_ADD BPF_AND @@ -1066,33 +1066,35 @@ memory location addresed by ``dst_reg + off`` is atomically modified, with immediate, then these operations also overwrite ``src_reg`` with the value that was in memory before it was modified. -The more special operations are: +The more special operations are:: BPF_XCHG This atomically exchanges ``src_reg`` with the value addressed by ``dst_reg + -off``. +off``. :: BPF_CMPXCHG This atomically compares the value addressed by ``dst_reg + off`` with -``R0``. If they match it is replaced with ``src_reg``, The value that was there -before is loaded back to ``R0``. +``R0``. If they match it is replaced with ``src_reg``. In either case, the +value that was there before is zero-extended and loaded back to ``R0``. Note that 1 and 2 byte atomic operations are not supported. -Except ``BPF_ADD`` _without_ ``BPF_FETCH`` (for legacy reasons), all 4 byte -atomic operations require alu32 mode. Clang enables this mode by default in -architecture v3 (``-mcpu=v3``). For older versions it can be enabled with +Clang can generate atomic instructions by default when ``-mcpu=v3`` is +enabled. If a lower version for ``-mcpu`` is set, the only atomic instruction +Clang can generate is ``BPF_ADD`` *without* ``BPF_FETCH``. If you need to enable +the atomics features, while keeping a lower ``-mcpu`` version, you can use ``-Xclang -target-feature -Xclang +alu32``. -You may encounter BPF_XADD - this is a legacy name for BPF_ATOMIC, referring to -the exclusive-add operation encoded when the immediate field is zero. +You may encounter ``BPF_XADD`` - this is a legacy name for ``BPF_ATOMIC``, +referring to the exclusive-add operation encoded when the immediate field is +zero. -eBPF has one 16-byte instruction: BPF_LD | BPF_DW | BPF_IMM which consists +eBPF has one 16-byte instruction: ``BPF_LD | BPF_DW | BPF_IMM`` which consists of two consecutive ``struct bpf_insn`` 8-byte blocks and interpreted as single instruction that loads 64-bit immediate value into a dst_reg. -Classic BPF has similar instruction: BPF_LD | BPF_W | BPF_IMM which loads +Classic BPF has similar instruction: ``BPF_LD | BPF_W | BPF_IMM`` which loads 32-bit immediate value into a register. eBPF verifier @@ -1082,6 +1082,17 @@ ifdef CONFIG_STACK_VALIDATION endif endif +PHONY += resolve_btfids_clean + +resolve_btfids_O = $(abspath $(objtree))/tools/bpf/resolve_btfids + +# tools/bpf/resolve_btfids directory might not exist +# in output directory, skip its clean in that case +resolve_btfids_clean: +ifneq ($(wildcard $(resolve_btfids_O)),) + $(Q)$(MAKE) -sC $(srctree)/tools/bpf/resolve_btfids O=$(resolve_btfids_O) clean +endif + ifdef CONFIG_BPF ifdef CONFIG_DEBUG_INFO_BTF ifeq ($(has_libelf),1) @@ -1491,7 +1502,7 @@ vmlinuxclean: $(Q)$(CONFIG_SHELL) $(srctree)/scripts/link-vmlinux.sh clean $(Q)$(if $(ARCH_POSTLINK), $(MAKE) -f $(ARCH_POSTLINK) clean) -clean: archclean vmlinuxclean +clean: archclean vmlinuxclean resolve_btfids_clean # mrproper - Delete all generated files, including .config # diff --git a/arch/x86/net/bpf_jit_comp.c b/arch/x86/net/bpf_jit_comp.c index 1d4d50199293..79e7a0ec1da5 100644 --- a/arch/x86/net/bpf_jit_comp.c +++ b/arch/x86/net/bpf_jit_comp.c @@ -869,8 +869,31 @@ static void detect_reg_usage(struct bpf_insn *insn, int insn_cnt, } } +static int emit_nops(u8 **pprog, int len) +{ + u8 *prog = *pprog; + int i, noplen, cnt = 0; + + while (len > 0) { + noplen = len; + + if (noplen > ASM_NOP_MAX) + noplen = ASM_NOP_MAX; + + for (i = 0; i < noplen; i++) + EMIT1(ideal_nops[noplen][i]); + len -= noplen; + } + + *pprog = prog; + + return cnt; +} + +#define INSN_SZ_DIFF (((addrs[i] - addrs[i - 1]) - (prog - temp))) + static int do_jit(struct bpf_prog *bpf_prog, int *addrs, u8 *image, - int oldproglen, struct jit_context *ctx) + int oldproglen, struct jit_context *ctx, bool jmp_padding) { bool tail_call_reachable = bpf_prog->aux->tail_call_reachable; struct bpf_insn *insn = bpf_prog->insnsi; @@ -880,7 +903,7 @@ static int do_jit(struct bpf_prog *bpf_prog, int *addrs, u8 *image, bool seen_exit = false; u8 temp[BPF_MAX_INSN_SIZE + BPF_INSN_SAFETY]; int i, cnt = 0, excnt = 0; - int proglen = 0; + int ilen, proglen = 0; u8 *prog = temp; int err; @@ -894,17 +917,24 @@ static int do_jit(struct bpf_prog *bpf_prog, int *addrs, u8 *image, bpf_prog_was_classic(bpf_prog), tail_call_reachable, bpf_prog->aux->func_idx != 0); push_callee_regs(&prog, callee_regs_used); - addrs[0] = prog - temp; + + ilen = prog - temp; + if (image) + memcpy(image + proglen, temp, ilen); + proglen += ilen; + addrs[0] = proglen; + prog = temp; for (i = 1; i <= insn_cnt; i++, insn++) { const s32 imm32 = insn->imm; u32 dst_reg = insn->dst_reg; u32 src_reg = insn->src_reg; u8 b2 = 0, b3 = 0; + u8 *start_of_ldx; s64 jmp_offset; u8 jmp_cond; - int ilen; u8 *func; + int nops; switch (insn->code) { /* ALU */ @@ -1249,12 +1279,30 @@ st: if (is_imm8(insn->off)) case BPF_LDX | BPF_PROBE_MEM | BPF_W: case BPF_LDX | BPF_MEM | BPF_DW: case BPF_LDX | BPF_PROBE_MEM | BPF_DW: + if (BPF_MODE(insn->code) == BPF_PROBE_MEM) { + /* test src_reg, src_reg */ + maybe_emit_mod(&prog, src_reg, src_reg, true); /* always 1 byte */ + EMIT2(0x85, add_2reg(0xC0, src_reg, src_reg)); + /* jne start_of_ldx */ + EMIT2(X86_JNE, 0); + /* xor dst_reg, dst_reg */ + emit_mov_imm32(&prog, false, dst_reg, 0); + /* jmp byte_after_ldx */ + EMIT2(0xEB, 0); + + /* populate jmp_offset for JNE above */ + temp[4] = prog - temp - 5 /* sizeof(test + jne) */; + start_of_ldx = prog; + } emit_ldx(&prog, BPF_SIZE(insn->code), dst_reg, src_reg, insn->off); if (BPF_MODE(insn->code) == BPF_PROBE_MEM) { struct exception_table_entry *ex; u8 *_insn = image + proglen; s64 delta; + /* populate jmp_offset for JMP above */ + start_of_ldx[-1] = prog - start_of_ldx; + if (!bpf_prog->aux->extable) break; @@ -1502,6 +1550,30 @@ emit_cond_jmp: /* Convert BPF opcode to x86 */ } jmp_offset = addrs[i + insn->off] - addrs[i]; if (is_imm8(jmp_offset)) { + if (jmp_padding) { + /* To keep the jmp_offset valid, the extra bytes are + * padded before the jump insn, so we substract the + * 2 bytes of jmp_cond insn from INSN_SZ_DIFF. + * + * If the previous pass already emits an imm8 + * jmp_cond, then this BPF insn won't shrink, so + * "nops" is 0. + * + * On the other hand, if the previous pass emits an + * imm32 jmp_cond, the extra 4 bytes(*) is padded to + * keep the image from shrinking further. + * + * (*) imm32 jmp_cond is 6 bytes, and imm8 jmp_cond + * is 2 bytes, so the size difference is 4 bytes. + */ + nops = INSN_SZ_DIFF - 2; + if (nops != 0 && nops != 4) { + pr_err("unexpected jmp_cond padding: %d bytes\n", + nops); + return -EFAULT; + } + cnt += emit_nops(&prog, nops); + } EMIT2(jmp_cond, jmp_offset); } else if (is_simm32(jmp_offset)) { EMIT2_off32(0x0F, jmp_cond + 0x10, jmp_offset); @@ -1524,11 +1596,55 @@ emit_cond_jmp: /* Convert BPF opcode to x86 */ else jmp_offset = addrs[i + insn->off] - addrs[i]; - if (!jmp_offset) - /* Optimize out nop jumps */ + if (!jmp_offset) { + /* + * If jmp_padding is enabled, the extra nops will + * be inserted. Otherwise, optimize out nop jumps. + */ + if (jmp_padding) { + /* There are 3 possible conditions. + * (1) This BPF_JA is already optimized out in + * the previous run, so there is no need + * to pad any extra byte (0 byte). + * (2) The previous pass emits an imm8 jmp, + * so we pad 2 bytes to match the previous + * insn size. + * (3) Similarly, the previous pass emits an + * imm32 jmp, and 5 bytes is padded. + */ + nops = INSN_SZ_DIFF; + if (nops != 0 && nops != 2 && nops != 5) { + pr_err("unexpected nop jump padding: %d bytes\n", + nops); + return -EFAULT; + } + cnt += emit_nops(&prog, nops); + } break; + } emit_jmp: if (is_imm8(jmp_offset)) { + if (jmp_padding) { + /* To avoid breaking jmp_offset, the extra bytes + * are padded before the actual jmp insn, so + * 2 bytes is substracted from INSN_SZ_DIFF. + * + * If the previous pass already emits an imm8 + * jmp, there is nothing to pad (0 byte). + * + * If it emits an imm32 jmp (5 bytes) previously + * and now an imm8 jmp (2 bytes), then we pad + * (5 - 2 = 3) bytes to stop the image from + * shrinking further. + */ + nops = INSN_SZ_DIFF - 2; + if (nops != 0 && nops != 3) { + pr_err("unexpected jump padding: %d bytes\n", + nops); + return -EFAULT; + } + cnt += emit_nops(&prog, INSN_SZ_DIFF - 2); + } EMIT2(0xEB, jmp_offset); } else if (is_simm32(jmp_offset)) { EMIT1_off32(0xE9, jmp_offset); @@ -1624,17 +1740,25 @@ static int invoke_bpf_prog(const struct btf_func_model *m, u8 **pprog, struct bpf_prog *p, int stack_size, bool mod_ret) { u8 *prog = *pprog; + u8 *jmp_insn; int cnt = 0; - if (p->aux->sleepable) { - if (emit_call(&prog, __bpf_prog_enter_sleepable, prog)) - return -EINVAL; - } else { - if (emit_call(&prog, __bpf_prog_enter, prog)) + /* arg1: mov rdi, progs[i] */ + emit_mov_imm64(&prog, BPF_REG_1, (long) p >> 32, (u32) (long) p); + if (emit_call(&prog, + p->aux->sleepable ? __bpf_prog_enter_sleepable : + __bpf_prog_enter, prog)) return -EINVAL; - /* remember prog start time returned by __bpf_prog_enter */ - emit_mov_reg(&prog, true, BPF_REG_6, BPF_REG_0); - } + /* remember prog start time returned by __bpf_prog_enter */ + emit_mov_reg(&prog, true, BPF_REG_6, BPF_REG_0); + + /* if (__bpf_prog_enter*(prog) == 0) + * goto skip_exec_of_prog; + */ + EMIT3(0x48, 0x85, 0xC0); /* test rax,rax */ + /* emit 2 nops that will be replaced with JE insn */ + jmp_insn = prog; + emit_nops(&prog, 2); /* arg1: lea rdi, [rbp - stack_size] */ EMIT4(0x48, 0x8D, 0x7D, -stack_size); @@ -1654,43 +1778,23 @@ static int invoke_bpf_prog(const struct btf_func_model *m, u8 **pprog, if (mod_ret) emit_stx(&prog, BPF_DW, BPF_REG_FP, BPF_REG_0, -8); - if (p->aux->sleepable) { - if (emit_call(&prog, __bpf_prog_exit_sleepable, prog)) - return -EINVAL; - } else { - /* arg1: mov rdi, progs[i] */ - emit_mov_imm64(&prog, BPF_REG_1, (long) p >> 32, - (u32) (long) p); - /* arg2: mov rsi, rbx <- start time in nsec */ - emit_mov_reg(&prog, true, BPF_REG_2, BPF_REG_6); - if (emit_call(&prog, __bpf_prog_exit, prog)) + /* replace 2 nops with JE insn, since jmp target is known */ + jmp_insn[0] = X86_JE; + jmp_insn[1] = prog - jmp_insn - 2; + + /* arg1: mov rdi, progs[i] */ + emit_mov_imm64(&prog, BPF_REG_1, (long) p >> 32, (u32) (long) p); + /* arg2: mov rsi, rbx <- start time in nsec */ + emit_mov_reg(&prog, true, BPF_REG_2, BPF_REG_6); + if (emit_call(&prog, + p->aux->sleepable ? __bpf_prog_exit_sleepable : + __bpf_prog_exit, prog)) return -EINVAL; - } *pprog = prog; return 0; } -static void emit_nops(u8 **pprog, unsigned int len) -{ - unsigned int i, noplen; - u8 *prog = *pprog; - int cnt = 0; - - while (len > 0) { - noplen = len; - - if (noplen > ASM_NOP_MAX) - noplen = ASM_NOP_MAX; - - for (i = 0; i < noplen; i++) - EMIT1(ideal_nops[noplen][i]); - len -= noplen; - } - - *pprog = prog; -} - static void emit_align(u8 **pprog, u32 align) { u8 *target, *prog = *pprog; @@ -2065,6 +2169,9 @@ struct x64_jit_data { struct jit_context ctx; }; +#define MAX_PASSES 20 +#define PADDING_PASSES (MAX_PASSES - 5) + struct bpf_prog *bpf_int_jit_compile(struct bpf_prog *prog) { struct bpf_binary_header *header = NULL; @@ -2074,6 +2181,7 @@ struct bpf_prog *bpf_int_jit_compile(struct bpf_prog *prog) struct jit_context ctx = {}; bool tmp_blinded = false; bool extra_pass = false; + bool padding = false; u8 *image = NULL; int *addrs; int pass; @@ -2110,6 +2218,7 @@ struct bpf_prog *bpf_int_jit_compile(struct bpf_prog *prog) image = jit_data->image; header = jit_data->header; extra_pass = true; + padding = true; goto skip_init_addrs; } addrs = kmalloc_array(prog->len + 1, sizeof(*addrs), GFP_KERNEL); @@ -2135,8 +2244,10 @@ skip_init_addrs: * may converge on the last pass. In such case do one more * pass to emit the final image. */ - for (pass = 0; pass < 20 || image; pass++) { - proglen = do_jit(prog, addrs, image, oldproglen, &ctx); + for (pass = 0; pass < MAX_PASSES || image; pass++) { + if (!padding && pass >= PADDING_PASSES) + padding = true; + proglen = do_jit(prog, addrs, image, oldproglen, &ctx, padding); if (proglen <= 0) { out_image: image = NULL; diff --git a/drivers/net/veth.c b/drivers/net/veth.c index 99caae7d1641..aa1a66ad2ce5 100644 --- a/drivers/net/veth.c +++ b/drivers/net/veth.c @@ -35,6 +35,7 @@ #define VETH_XDP_HEADROOM (XDP_PACKET_HEADROOM + NET_IP_ALIGN) #define VETH_XDP_TX_BULK_SIZE 16 +#define VETH_XDP_BATCH 16 struct veth_stats { u64 rx_drops; @@ -562,20 +563,13 @@ static int veth_xdp_tx(struct veth_rq *rq, struct xdp_buff *xdp, return 0; } -static struct sk_buff *veth_xdp_rcv_one(struct veth_rq *rq, - struct xdp_frame *frame, - struct veth_xdp_tx_bq *bq, - struct veth_stats *stats) +static struct xdp_frame *veth_xdp_rcv_one(struct veth_rq *rq, + struct xdp_frame *frame, + struct veth_xdp_tx_bq *bq, + struct veth_stats *stats) { - void *hard_start = frame->data - frame->headroom; - int len = frame->len, delta = 0; struct xdp_frame orig_frame; struct bpf_prog *xdp_prog; - unsigned int headroom; - struct sk_buff *skb; - - /* bpf_xdp_adjust_head() assures BPF cannot access xdp_frame area */ - hard_start -= sizeof(struct xdp_frame); rcu_read_lock(); xdp_prog = rcu_dereference(rq->xdp_prog); @@ -590,8 +584,8 @@ static struct sk_buff *veth_xdp_rcv_one(struct veth_rq *rq, switch (act) { case XDP_PASS: - delta = frame->data - xdp.data; - len = xdp.data_end - xdp.data; + if (xdp_update_frame_from_buff(&xdp, frame)) + goto err_xdp; break; case XDP_TX: orig_frame = *frame; @@ -629,19 +623,7 @@ static struct sk_buff *veth_xdp_rcv_one(struct veth_rq *rq, } rcu_read_unlock(); - headroom = sizeof(struct xdp_frame) + frame->headroom - delta; - skb = veth_build_skb(hard_start, headroom, len, frame->frame_sz); - if (!skb) { - xdp_return_frame(frame); - stats->rx_drops++; - goto err; - } - - xdp_release_frame(frame); - xdp_scrub_frame(frame); - skb->protocol = eth_type_trans(skb, rq->dev); -err: - return skb; + return frame; err_xdp: rcu_read_unlock(); xdp_return_frame(frame); @@ -649,6 +631,37 @@ xdp_xmit: return NULL; } +/* frames array contains VETH_XDP_BATCH at most */ +static void veth_xdp_rcv_bulk_skb(struct veth_rq *rq, void **frames, + int n_xdpf, struct veth_xdp_tx_bq *bq, + struct veth_stats *stats) +{ + void *skbs[VETH_XDP_BATCH]; + int i; + + if (xdp_alloc_skb_bulk(skbs, n_xdpf, + GFP_ATOMIC | __GFP_ZERO) < 0) { + for (i = 0; i < n_xdpf; i++) + xdp_return_frame(frames[i]); + stats->rx_drops += n_xdpf; + + return; + } + + for (i = 0; i < n_xdpf; i++) { + struct sk_buff *skb = skbs[i]; + + skb = __xdp_build_skb_from_frame(frames[i], skb, + rq->dev); + if (!skb) { + xdp_return_frame(frames[i]); + stats->rx_drops++; + continue; + } + napi_gro_receive(&rq->xdp_napi, skb); + } +} + static struct sk_buff *veth_xdp_rcv_skb(struct veth_rq *rq, struct sk_buff *skb, struct veth_xdp_tx_bq *bq, @@ -796,32 +809,45 @@ static int veth_xdp_rcv(struct veth_rq *rq, int budget, struct veth_xdp_tx_bq *bq, struct veth_stats *stats) { - int i, done = 0; + int i, done = 0, n_xdpf = 0; + void *xdpf[VETH_XDP_BATCH]; for (i = 0; i < budget; i++) { void *ptr = __ptr_ring_consume(&rq->xdp_ring); - struct sk_buff *skb; if (!ptr) break; if (veth_is_xdp_frame(ptr)) { + /* ndo_xdp_xmit */ struct xdp_frame *frame = veth_ptr_to_xdp(ptr); stats->xdp_bytes += frame->len; - skb = veth_xdp_rcv_one(rq, frame, bq, stats); + frame = veth_xdp_rcv_one(rq, frame, bq, stats); + if (frame) { + /* XDP_PASS */ + xdpf[n_xdpf++] = frame; + if (n_xdpf == VETH_XDP_BATCH) { + veth_xdp_rcv_bulk_skb(rq, xdpf, n_xdpf, + bq, stats); + n_xdpf = 0; + } + } } else { - skb = ptr; + /* ndo_start_xmit */ + struct sk_buff *skb = ptr; + stats->xdp_bytes += skb->len; skb = veth_xdp_rcv_skb(rq, skb, bq, stats); + if (skb) + napi_gro_receive(&rq->xdp_napi, skb); } - - if (skb) - napi_gro_receive(&rq->xdp_napi, skb); - done++; } + if (n_xdpf) + veth_xdp_rcv_bulk_skb(rq, xdpf, n_xdpf, bq, stats); + u64_stats_update_begin(&rq->stats.syncp); rq->stats.vs.xdp_redirect += stats->xdp_redirect; rq->stats.vs.xdp_bytes += stats->xdp_bytes; diff --git a/include/linux/bpf-cgroup.h b/include/linux/bpf-cgroup.h index 72e69a0e1e8c..c42e02b4d84b 100644 --- a/include/linux/bpf-cgroup.h +++ b/include/linux/bpf-cgroup.h @@ -23,8 +23,8 @@ struct ctl_table_header; #ifdef CONFIG_CGROUP_BPF -extern struct static_key_false cgroup_bpf_enabled_key; -#define cgroup_bpf_enabled static_branch_unlikely(&cgroup_bpf_enabled_key) +extern struct static_key_false cgroup_bpf_enabled_key[MAX_BPF_ATTACH_TYPE]; +#define cgroup_bpf_enabled(type) static_branch_unlikely(&cgroup_bpf_enabled_key[type]) DECLARE_PER_CPU(struct bpf_cgroup_storage*, bpf_cgroup_storage[MAX_BPF_CGROUP_STORAGE_TYPE]); @@ -125,7 +125,8 @@ int __cgroup_bpf_run_filter_sk(struct sock *sk, int __cgroup_bpf_run_filter_sock_addr(struct sock *sk, struct sockaddr *uaddr, enum bpf_attach_type type, - void *t_ctx); + void *t_ctx, + u32 *flags); int __cgroup_bpf_run_filter_sock_ops(struct sock *sk, struct bpf_sock_ops_kern *sock_ops, @@ -147,6 +148,10 @@ int __cgroup_bpf_run_filter_getsockopt(struct sock *sk, int level, int __user *optlen, int max_optlen, int retval); +int __cgroup_bpf_run_filter_getsockopt_kern(struct sock *sk, int level, + int optname, void *optval, + int *optlen, int retval); + static inline enum bpf_cgroup_storage_type cgroup_storage_type( struct bpf_map *map) { @@ -185,7 +190,7 @@ int bpf_percpu_cgroup_storage_update(struct bpf_map *map, void *key, #define BPF_CGROUP_RUN_PROG_INET_INGRESS(sk, skb) \ ({ \ int __ret = 0; \ - if (cgroup_bpf_enabled) \ + if (cgroup_bpf_enabled(BPF_CGROUP_INET_INGRESS)) \ __ret = __cgroup_bpf_run_filter_skb(sk, skb, \ BPF_CGROUP_INET_INGRESS); \ \ @@ -195,7 +200,7 @@ int bpf_percpu_cgroup_storage_update(struct bpf_map *map, void *key, #define BPF_CGROUP_RUN_PROG_INET_EGRESS(sk, skb) \ ({ \ int __ret = 0; \ - if (cgroup_bpf_enabled && sk && sk == skb->sk) { \ + if (cgroup_bpf_enabled(BPF_CGROUP_INET_EGRESS) && sk && sk == skb->sk) { \ typeof(sk) __sk = sk_to_full_sk(sk); \ if (sk_fullsock(__sk)) \ __ret = __cgroup_bpf_run_filter_skb(__sk, skb, \ @@ -207,7 +212,7 @@ int bpf_percpu_cgroup_storage_update(struct bpf_map *map, void *key, #define BPF_CGROUP_RUN_SK_PROG(sk, type) \ ({ \ int __ret = 0; \ - if (cgroup_bpf_enabled) { \ + if (cgroup_bpf_enabled(type)) { \ __ret = __cgroup_bpf_run_filter_sk(sk, type); \ } \ __ret; \ @@ -227,33 +232,53 @@ int bpf_percpu_cgroup_storage_update(struct bpf_map *map, void *key, #define BPF_CGROUP_RUN_SA_PROG(sk, uaddr, type) \ ({ \ + u32 __unused_flags; \ int __ret = 0; \ - if (cgroup_bpf_enabled) \ + if (cgroup_bpf_enabled(type)) \ __ret = __cgroup_bpf_run_filter_sock_addr(sk, uaddr, type, \ - NULL); \ + NULL, \ + &__unused_flags); \ __ret; \ }) #define BPF_CGROUP_RUN_SA_PROG_LOCK(sk, uaddr, type, t_ctx) \ ({ \ + u32 __unused_flags; \ int __ret = 0; \ - if (cgroup_bpf_enabled) { \ + if (cgroup_bpf_enabled(type)) { \ lock_sock(sk); \ __ret = __cgroup_bpf_run_filter_sock_addr(sk, uaddr, type, \ - t_ctx); \ + t_ctx, \ + &__unused_flags); \ release_sock(sk); \ } \ __ret; \ }) -#define BPF_CGROUP_RUN_PROG_INET4_BIND_LOCK(sk, uaddr) \ - BPF_CGROUP_RUN_SA_PROG_LOCK(sk, uaddr, BPF_CGROUP_INET4_BIND, NULL) - -#define BPF_CGROUP_RUN_PROG_INET6_BIND_LOCK(sk, uaddr) \ - BPF_CGROUP_RUN_SA_PROG_LOCK(sk, uaddr, BPF_CGROUP_INET6_BIND, NULL) +/* BPF_CGROUP_INET4_BIND and BPF_CGROUP_INET6_BIND can return extra flags + * via upper bits of return code. The only flag that is supported + * (at bit position 0) is to indicate CAP_NET_BIND_SERVICE capability check + * should be bypassed (BPF_RET_BIND_NO_CAP_NET_BIND_SERVICE). + */ +#define BPF_CGROUP_RUN_PROG_INET_BIND_LOCK(sk, uaddr, type, bind_flags) \ +({ \ + u32 __flags = 0; \ + int __ret = 0; \ + if (cgroup_bpf_enabled(type)) { \ + lock_sock(sk); \ + __ret = __cgroup_bpf_run_filter_sock_addr(sk, uaddr, type, \ + NULL, &__flags); \ + release_sock(sk); \ + if (__flags & BPF_RET_BIND_NO_CAP_NET_BIND_SERVICE) \ + *bind_flags |= BIND_NO_CAP_NET_BIND_SERVICE; \ + } \ + __ret; \ +}) -#define BPF_CGROUP_PRE_CONNECT_ENABLED(sk) (cgroup_bpf_enabled && \ - sk->sk_prot->pre_connect) +#define BPF_CGROUP_PRE_CONNECT_ENABLED(sk) \ + ((cgroup_bpf_enabled(BPF_CGROUP_INET4_CONNECT) || \ + cgroup_bpf_enabled(BPF_CGROUP_INET6_CONNECT)) && \ + (sk)->sk_prot->pre_connect) #define BPF_CGROUP_RUN_PROG_INET4_CONNECT(sk, uaddr) \ BPF_CGROUP_RUN_SA_PROG(sk, uaddr, BPF_CGROUP_INET4_CONNECT) @@ -297,7 +322,7 @@ int bpf_percpu_cgroup_storage_update(struct bpf_map *map, void *key, #define BPF_CGROUP_RUN_PROG_SOCK_OPS_SK(sock_ops, sk) \ ({ \ int __ret = 0; \ - if (cgroup_bpf_enabled) \ + if (cgroup_bpf_enabled(BPF_CGROUP_SOCK_OPS)) \ __ret = __cgroup_bpf_run_filter_sock_ops(sk, \ sock_ops, \ BPF_CGROUP_SOCK_OPS); \ @@ -307,7 +332,7 @@ int bpf_percpu_cgroup_storage_update(struct bpf_map *map, void *key, #define BPF_CGROUP_RUN_PROG_SOCK_OPS(sock_ops) \ ({ \ int __ret = 0; \ - if (cgroup_bpf_enabled && (sock_ops)->sk) { \ + if (cgroup_bpf_enabled(BPF_CGROUP_SOCK_OPS) && (sock_ops)->sk) { \ typeof(sk) __sk = sk_to_full_sk((sock_ops)->sk); \ if (__sk && sk_fullsock(__sk)) \ __ret = __cgroup_bpf_run_filter_sock_ops(__sk, \ @@ -320,7 +345,7 @@ int bpf_percpu_cgroup_storage_update(struct bpf_map *map, void *key, #define BPF_CGROUP_RUN_PROG_DEVICE_CGROUP(type, major, minor, access) \ ({ \ int __ret = 0; \ - if (cgroup_bpf_enabled) \ + if (cgroup_bpf_enabled(BPF_CGROUP_DEVICE)) \ __ret = __cgroup_bpf_check_dev_permission(type, major, minor, \ access, \ BPF_CGROUP_DEVICE); \ @@ -332,7 +357,7 @@ int bpf_percpu_cgroup_storage_update(struct bpf_map *map, void *key, #define BPF_CGROUP_RUN_PROG_SYSCTL(head, table, write, buf, count, pos) \ ({ \ int __ret = 0; \ - if (cgroup_bpf_enabled) \ + if (cgroup_bpf_enabled(BPF_CGROUP_SYSCTL)) \ __ret = __cgroup_bpf_run_filter_sysctl(head, table, write, \ buf, count, pos, \ BPF_CGROUP_SYSCTL); \ @@ -343,7 +368,7 @@ int bpf_percpu_cgroup_storage_update(struct bpf_map *map, void *key, kernel_optval) \ ({ \ int __ret = 0; \ - if (cgroup_bpf_enabled) \ + if (cgroup_bpf_enabled(BPF_CGROUP_SETSOCKOPT)) \ __ret = __cgroup_bpf_run_filter_setsockopt(sock, level, \ optname, optval, \ optlen, \ @@ -354,7 +379,7 @@ int bpf_percpu_cgroup_storage_update(struct bpf_map *map, void *key, #define BPF_CGROUP_GETSOCKOPT_MAX_OPTLEN(optlen) \ ({ \ int __ret = 0; \ - if (cgroup_bpf_enabled) \ + if (cgroup_bpf_enabled(BPF_CGROUP_GETSOCKOPT)) \ get_user(__ret, optlen); \ __ret; \ }) @@ -363,11 +388,24 @@ int bpf_percpu_cgroup_storage_update(struct bpf_map *map, void *key, max_optlen, retval) \ ({ \ int __ret = retval; \ - if (cgroup_bpf_enabled) \ - __ret = __cgroup_bpf_run_filter_getsockopt(sock, level, \ - optname, optval, \ - optlen, max_optlen, \ - retval); \ + if (cgroup_bpf_enabled(BPF_CGROUP_GETSOCKOPT)) \ + if (!(sock)->sk_prot->bpf_bypass_getsockopt || \ + !INDIRECT_CALL_INET_1((sock)->sk_prot->bpf_bypass_getsockopt, \ + tcp_bpf_bypass_getsockopt, \ + level, optname)) \ + __ret = __cgroup_bpf_run_filter_getsockopt( \ + sock, level, optname, optval, optlen, \ + max_optlen, retval); \ + __ret; \ +}) + +#define BPF_CGROUP_RUN_PROG_GETSOCKOPT_KERN(sock, level, optname, optval, \ + optlen, retval) \ +({ \ + int __ret = retval; \ + if (cgroup_bpf_enabled(BPF_CGROUP_GETSOCKOPT)) \ + __ret = __cgroup_bpf_run_filter_getsockopt_kern( \ + sock, level, optname, optval, optlen, retval); \ __ret; \ }) @@ -427,15 +465,14 @@ static inline int bpf_percpu_cgroup_storage_update(struct bpf_map *map, return 0; } -#define cgroup_bpf_enabled (0) +#define cgroup_bpf_enabled(type) (0) #define BPF_CGROUP_RUN_SA_PROG_LOCK(sk, uaddr, type, t_ctx) ({ 0; }) #define BPF_CGROUP_PRE_CONNECT_ENABLED(sk) (0) #define BPF_CGROUP_RUN_PROG_INET_INGRESS(sk,skb) ({ 0; }) #define BPF_CGROUP_RUN_PROG_INET_EGRESS(sk,skb) ({ 0; }) #define BPF_CGROUP_RUN_PROG_INET_SOCK(sk) ({ 0; }) #define BPF_CGROUP_RUN_PROG_INET_SOCK_RELEASE(sk) ({ 0; }) -#define BPF_CGROUP_RUN_PROG_INET4_BIND_LOCK(sk, uaddr) ({ 0; }) -#define BPF_CGROUP_RUN_PROG_INET6_BIND_LOCK(sk, uaddr) ({ 0; }) +#define BPF_CGROUP_RUN_PROG_INET_BIND_LOCK(sk, uaddr, type, flags) ({ 0; }) #define BPF_CGROUP_RUN_PROG_INET4_POST_BIND(sk) ({ 0; }) #define BPF_CGROUP_RUN_PROG_INET6_POST_BIND(sk) ({ 0; }) #define BPF_CGROUP_RUN_PROG_INET4_CONNECT(sk, uaddr) ({ 0; }) @@ -452,6 +489,8 @@ static inline int bpf_percpu_cgroup_storage_update(struct bpf_map *map, #define BPF_CGROUP_GETSOCKOPT_MAX_OPTLEN(optlen) ({ 0; }) #define BPF_CGROUP_RUN_PROG_GETSOCKOPT(sock, level, optname, optval, \ optlen, max_optlen, retval) ({ retval; }) +#define BPF_CGROUP_RUN_PROG_GETSOCKOPT_KERN(sock, level, optname, optval, \ + optlen, retval) ({ retval; }) #define BPF_CGROUP_RUN_PROG_SETSOCKOPT(sock, level, optname, optval, optlen, \ kernel_optval) ({ 0; }) diff --git a/include/linux/bpf.h b/include/linux/bpf.h index 1aac2af12fed..cccaef1088ea 100644 --- a/include/linux/bpf.h +++ b/include/linux/bpf.h @@ -14,7 +14,6 @@ #include <linux/numa.h> #include <linux/mm_types.h> #include <linux/wait.h> -#include <linux/u64_stats_sync.h> #include <linux/refcount.h> #include <linux/mutex.h> #include <linux/module.h> @@ -507,12 +506,6 @@ enum bpf_cgroup_storage_type { */ #define MAX_BPF_FUNC_ARGS 12 -struct bpf_prog_stats { - u64 cnt; - u64 nsecs; - struct u64_stats_sync syncp; -} __aligned(2 * sizeof(u64)); - struct btf_func_model { u8 ret_size; u8 nr_args; @@ -536,7 +529,7 @@ struct btf_func_model { /* Each call __bpf_prog_enter + call bpf_func + call __bpf_prog_exit is ~50 * bytes on x86. Pick a number to fit into BPF_IMAGE_SIZE / 2 */ -#define BPF_MAX_TRAMP_PROGS 40 +#define BPF_MAX_TRAMP_PROGS 38 struct bpf_tramp_progs { struct bpf_prog *progs[BPF_MAX_TRAMP_PROGS]; @@ -568,10 +561,10 @@ int arch_prepare_bpf_trampoline(void *image, void *image_end, struct bpf_tramp_progs *tprogs, void *orig_call); /* these two functions are called from generated trampoline */ -u64 notrace __bpf_prog_enter(void); +u64 notrace __bpf_prog_enter(struct bpf_prog *prog); void notrace __bpf_prog_exit(struct bpf_prog *prog, u64 start); -void notrace __bpf_prog_enter_sleepable(void); -void notrace __bpf_prog_exit_sleepable(void); +u64 notrace __bpf_prog_enter_sleepable(struct bpf_prog *prog); +void notrace __bpf_prog_exit_sleepable(struct bpf_prog *prog, u64 start); struct bpf_ksym { unsigned long start; @@ -845,7 +838,6 @@ struct bpf_prog_aux { u32 linfo_idx; u32 num_exentries; struct exception_table_entry *extable; - struct bpf_prog_stats __percpu *stats; union { struct work_struct work; struct rcu_head rcu; @@ -1073,6 +1065,34 @@ int bpf_prog_array_copy(struct bpf_prog_array *old_array, struct bpf_prog *include_prog, struct bpf_prog_array **new_array); +/* BPF program asks to bypass CAP_NET_BIND_SERVICE in bind. */ +#define BPF_RET_BIND_NO_CAP_NET_BIND_SERVICE (1 << 0) +/* BPF program asks to set CN on the packet. */ +#define BPF_RET_SET_CN (1 << 0) + +#define BPF_PROG_RUN_ARRAY_FLAGS(array, ctx, func, ret_flags) \ + ({ \ + struct bpf_prog_array_item *_item; \ + struct bpf_prog *_prog; \ + struct bpf_prog_array *_array; \ + u32 _ret = 1; \ + u32 func_ret; \ + migrate_disable(); \ + rcu_read_lock(); \ + _array = rcu_dereference(array); \ + _item = &_array->items[0]; \ + while ((_prog = READ_ONCE(_item->prog))) { \ + bpf_cgroup_storage_set(_item->cgroup_storage); \ + func_ret = func(_prog, ctx); \ + _ret &= (func_ret & 1); \ + *(ret_flags) |= (func_ret >> 1); \ + _item++; \ + } \ + rcu_read_unlock(); \ + migrate_enable(); \ + _ret; \ + }) + #define __BPF_PROG_RUN_ARRAY(array, ctx, func, check_non_null) \ ({ \ struct bpf_prog_array_item *_item; \ @@ -1120,25 +1140,11 @@ _out: \ */ #define BPF_PROG_CGROUP_INET_EGRESS_RUN_ARRAY(array, ctx, func) \ ({ \ - struct bpf_prog_array_item *_item; \ - struct bpf_prog *_prog; \ - struct bpf_prog_array *_array; \ - u32 ret; \ - u32 _ret = 1; \ - u32 _cn = 0; \ - migrate_disable(); \ - rcu_read_lock(); \ - _array = rcu_dereference(array); \ - _item = &_array->items[0]; \ - while ((_prog = READ_ONCE(_item->prog))) { \ - bpf_cgroup_storage_set(_item->cgroup_storage); \ - ret = func(_prog, ctx); \ - _ret &= (ret & 1); \ - _cn |= (ret & 2); \ - _item++; \ - } \ - rcu_read_unlock(); \ - migrate_enable(); \ + u32 _flags = 0; \ + bool _cn; \ + u32 _ret; \ + _ret = BPF_PROG_RUN_ARRAY_FLAGS(array, ctx, func, &_flags); \ + _cn = _flags & BPF_RET_SET_CN; \ if (_ret) \ _ret = (_cn ? NET_XMIT_CN : NET_XMIT_SUCCESS); \ else \ @@ -1276,6 +1282,11 @@ static inline bool bpf_allow_ptr_leaks(void) return perfmon_capable(); } +static inline bool bpf_allow_uninit_stack(void) +{ + return perfmon_capable(); +} + static inline bool bpf_allow_ptr_to_map_access(void) { return perfmon_capable(); @@ -1874,6 +1885,7 @@ extern const struct bpf_func_proto bpf_per_cpu_ptr_proto; extern const struct bpf_func_proto bpf_this_cpu_ptr_proto; extern const struct bpf_func_proto bpf_ktime_get_coarse_ns_proto; extern const struct bpf_func_proto bpf_sock_from_file_proto; +extern const struct bpf_func_proto bpf_get_socket_ptr_cookie_proto; const struct bpf_func_proto *bpf_tracing_func_proto( enum bpf_func_id func_id, const struct bpf_prog *prog); diff --git a/include/linux/bpf_verifier.h b/include/linux/bpf_verifier.h index dfe6f85d97dd..971b33aca13d 100644 --- a/include/linux/bpf_verifier.h +++ b/include/linux/bpf_verifier.h @@ -195,7 +195,7 @@ struct bpf_func_state { * 0 = main function, 1 = first callee. */ u32 frameno; - /* subprog number == index within subprog_stack_depth + /* subprog number == index within subprog_info * zero == main subprog */ u32 subprogno; @@ -404,6 +404,7 @@ struct bpf_verifier_env { u32 used_btf_cnt; /* number of used BTF objects */ u32 id_gen; /* used to generate unique reg IDs */ bool allow_ptr_leaks; + bool allow_uninit_stack; bool allow_ptr_to_map_access; bool bpf_capable; bool bypass_spec_v1; @@ -470,6 +471,8 @@ bpf_prog_offload_remove_insns(struct bpf_verifier_env *env, u32 off, u32 cnt); int check_ctx_reg(struct bpf_verifier_env *env, const struct bpf_reg_state *reg, int regno); +int check_mem_reg(struct bpf_verifier_env *env, struct bpf_reg_state *reg, + u32 regno, u32 mem_size); /* this lives here instead of in bpf.h because it needs to dereference tgt_prog */ static inline u64 bpf_trampoline_compute_key(const struct bpf_prog *tgt_prog, diff --git a/include/linux/filter.h b/include/linux/filter.h index 7fdce5407214..3b00fc906ccd 100644 --- a/include/linux/filter.h +++ b/include/linux/filter.h @@ -22,6 +22,7 @@ #include <linux/vmalloc.h> #include <linux/sockptr.h> #include <crypto/sha1.h> +#include <linux/u64_stats_sync.h> #include <net/sch_generic.h> @@ -539,6 +540,13 @@ struct bpf_binary_header { u8 image[] __aligned(BPF_IMAGE_ALIGNMENT); }; +struct bpf_prog_stats { + u64 cnt; + u64 nsecs; + u64 misses; + struct u64_stats_sync syncp; +} __aligned(2 * sizeof(u64)); + struct bpf_prog { u16 pages; /* Number of allocated pages */ u16 jited:1, /* Is our filter JIT'ed? */ @@ -557,10 +565,12 @@ struct bpf_prog { u32 len; /* Number of filter blocks */ u32 jited_len; /* Size of jited insns in bytes */ u8 tag[BPF_TAG_SIZE]; - struct bpf_prog_aux *aux; /* Auxiliary fields */ - struct sock_fprog_kern *orig_prog; /* Original BPF program */ + struct bpf_prog_stats __percpu *stats; + int __percpu *active; unsigned int (*bpf_func)(const void *ctx, const struct bpf_insn *insn); + struct bpf_prog_aux *aux; /* Auxiliary fields */ + struct sock_fprog_kern *orig_prog; /* Original BPF program */ /* Instructions for interpreter */ struct sock_filter insns[0]; struct bpf_insn insnsi[]; @@ -581,7 +591,7 @@ DECLARE_STATIC_KEY_FALSE(bpf_stats_enabled_key); struct bpf_prog_stats *__stats; \ u64 __start = sched_clock(); \ __ret = dfunc(ctx, (prog)->insnsi, (prog)->bpf_func); \ - __stats = this_cpu_ptr(prog->aux->stats); \ + __stats = this_cpu_ptr(prog->stats); \ u64_stats_update_begin(&__stats->syncp); \ __stats->cnt++; \ __stats->nsecs += sched_clock() - __start; \ @@ -1298,6 +1308,11 @@ struct bpf_sysctl_kern { u64 tmp_reg; }; +#define BPF_SOCKOPT_KERN_BUF_SIZE 32 +struct bpf_sockopt_buf { + u8 data[BPF_SOCKOPT_KERN_BUF_SIZE]; +}; + struct bpf_sockopt_kern { struct sock *sk; u8 *optval; diff --git a/include/linux/indirect_call_wrapper.h b/include/linux/indirect_call_wrapper.h index a8345c8a613d..c1c76a70a6ce 100644 --- a/include/linux/indirect_call_wrapper.h +++ b/include/linux/indirect_call_wrapper.h @@ -62,4 +62,10 @@ #define INDIRECT_CALL_INET(f, f2, f1, ...) f(__VA_ARGS__) #endif +#if IS_ENABLED(CONFIG_INET) +#define INDIRECT_CALL_INET_1(f, f1, ...) INDIRECT_CALL_1(f, f1, __VA_ARGS__) +#else +#define INDIRECT_CALL_INET_1(f, f1, ...) f(__VA_ARGS__) +#endif + #endif diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h index bfadf3b82f9c..ddf4cfc12615 100644 --- a/include/linux/netdevice.h +++ b/include/linux/netdevice.h @@ -3931,14 +3931,42 @@ int xdp_umem_query(struct net_device *dev, u16 queue_id); int __dev_forward_skb(struct net_device *dev, struct sk_buff *skb); int dev_forward_skb(struct net_device *dev, struct sk_buff *skb); +int dev_forward_skb_nomtu(struct net_device *dev, struct sk_buff *skb); bool is_skb_forwardable(const struct net_device *dev, const struct sk_buff *skb); +static __always_inline bool __is_skb_forwardable(const struct net_device *dev, + const struct sk_buff *skb, + const bool check_mtu) +{ + const u32 vlan_hdr_len = 4; /* VLAN_HLEN */ + unsigned int len; + + if (!(dev->flags & IFF_UP)) + return false; + + if (!check_mtu) + return true; + + len = dev->mtu + dev->hard_header_len + vlan_hdr_len; + if (skb->len <= len) + return true; + + /* if TSO is enabled, we don't care about the length as the packet + * could be forwarded without being segmented before + */ + if (skb_is_gso(skb)) + return true; + + return false; +} + static __always_inline int ____dev_forward_skb(struct net_device *dev, - struct sk_buff *skb) + struct sk_buff *skb, + const bool check_mtu) { if (skb_orphan_frags(skb, GFP_ATOMIC) || - unlikely(!is_skb_forwardable(dev, skb))) { + unlikely(!__is_skb_forwardable(dev, skb, check_mtu))) { atomic_long_inc(&dev->rx_dropped); kfree_skb(skb); return NET_RX_DROP; diff --git a/include/linux/skmsg.h b/include/linux/skmsg.h index fec0c5ac1c4f..8edbbf5f2f93 100644 --- a/include/linux/skmsg.h +++ b/include/linux/skmsg.h @@ -390,7 +390,6 @@ static inline struct sk_psock *sk_psock_get(struct sock *sk) } void sk_psock_stop(struct sock *sk, struct sk_psock *psock); -void sk_psock_destroy(struct rcu_head *rcu); void sk_psock_drop(struct sock *sk, struct sk_psock *psock); static inline void sk_psock_put(struct sock *sk, struct sk_psock *psock) diff --git a/include/net/inet_common.h b/include/net/inet_common.h index cb2818862919..cad2a611efde 100644 --- a/include/net/inet_common.h +++ b/include/net/inet_common.h @@ -41,6 +41,8 @@ int inet_bind(struct socket *sock, struct sockaddr *uaddr, int addr_len); #define BIND_WITH_LOCK (1 << 1) /* Called from BPF program. */ #define BIND_FROM_BPF (1 << 2) +/* Skip CAP_NET_BIND_SERVICE check. */ +#define BIND_NO_CAP_NET_BIND_SERVICE (1 << 3) int __inet_bind(struct sock *sk, struct sockaddr *uaddr, int addr_len, u32 flags); int inet_getname(struct socket *sock, struct sockaddr *uaddr, diff --git a/include/net/sock.h b/include/net/sock.h index 855c068c6c86..636810ddcd9b 100644 --- a/include/net/sock.h +++ b/include/net/sock.h @@ -1174,6 +1174,8 @@ struct proto { int (*backlog_rcv) (struct sock *sk, struct sk_buff *skb); + bool (*bpf_bypass_getsockopt)(int level, + int optname); void (*release_cb)(struct sock *sk); diff --git a/include/net/tcp.h b/include/net/tcp.h index 484eb2362645..963cd86d12dd 100644 --- a/include/net/tcp.h +++ b/include/net/tcp.h @@ -403,6 +403,7 @@ __poll_t tcp_poll(struct file *file, struct socket *sock, struct poll_table_struct *wait); int tcp_getsockopt(struct sock *sk, int level, int optname, char __user *optval, int __user *optlen); +bool tcp_bpf_bypass_getsockopt(int level, int optname); int tcp_setsockopt(struct sock *sk, int level, int optname, sockptr_t optval, unsigned int optlen); void tcp_set_keepalive(struct sock *sk, int val); diff --git a/include/net/xdp.h b/include/net/xdp.h index 0cf3976ce77c..a5bc214a49d9 100644 --- a/include/net/xdp.h +++ b/include/net/xdp.h @@ -164,6 +164,12 @@ void xdp_warn(const char *msg, const char *func, const int line); #define XDP_WARN(msg) xdp_warn(msg, __func__, __LINE__) struct xdp_frame *xdp_convert_zc_to_xdp_frame(struct xdp_buff *xdp); +struct sk_buff *__xdp_build_skb_from_frame(struct xdp_frame *xdpf, + struct sk_buff *skb, + struct net_device *dev); +struct sk_buff *xdp_build_skb_from_frame(struct xdp_frame *xdpf, + struct net_device *dev); +int xdp_alloc_skb_bulk(void **skbs, int n_skb, gfp_t gfp); static inline void xdp_convert_frame_to_buff(struct xdp_frame *frame, struct xdp_buff *xdp) diff --git a/include/trace/bpf_probe.h b/include/trace/bpf_probe.h index cd74bffed5c6..a23be89119aa 100644 --- a/include/trace/bpf_probe.h +++ b/include/trace/bpf_probe.h @@ -55,8 +55,7 @@ /* tracepoints with more than 12 arguments will hit build error */ #define CAST_TO_U64(...) CONCATENATE(__CAST, COUNT_ARGS(__VA_ARGS__))(__VA_ARGS__) -#undef DECLARE_EVENT_CLASS -#define DECLARE_EVENT_CLASS(call, proto, args, tstruct, assign, print) \ +#define __BPF_DECLARE_TRACE(call, proto, args) \ static notrace void \ __bpf_trace_##call(void *__data, proto) \ { \ @@ -64,6 +63,10 @@ __bpf_trace_##call(void *__data, proto) \ CONCATENATE(bpf_trace_run, COUNT_ARGS(args))(prog, CAST_TO_U64(args)); \ } +#undef DECLARE_EVENT_CLASS +#define DECLARE_EVENT_CLASS(call, proto, args, tstruct, assign, print) \ + __BPF_DECLARE_TRACE(call, PARAMS(proto), PARAMS(args)) + /* * This part is compiled out, it is only here as a build time check * to make sure that if the tracepoint handling changes, the @@ -111,6 +114,11 @@ __DEFINE_EVENT(template, call, PARAMS(proto), PARAMS(args), size) #define DEFINE_EVENT_PRINT(template, name, proto, args, print) \ DEFINE_EVENT(template, name, PARAMS(proto), PARAMS(args)) +#undef DECLARE_TRACE +#define DECLARE_TRACE(call, proto, args) \ + __BPF_DECLARE_TRACE(call, PARAMS(proto), PARAMS(args)) \ + __DEFINE_EVENT(call, call, PARAMS(proto), PARAMS(args), 0) + #include TRACE_INCLUDE(TRACE_INCLUDE_FILE) #undef DEFINE_EVENT_WRITABLE diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h index c001766adcbc..4c24daa43bac 100644 --- a/include/uapi/linux/bpf.h +++ b/include/uapi/linux/bpf.h @@ -1656,22 +1656,30 @@ union bpf_attr { * networking traffic statistics as it provides a global socket * identifier that can be assumed unique. * Return - * A 8-byte long non-decreasing number on success, or 0 if the - * socket field is missing inside *skb*. + * A 8-byte long unique number on success, or 0 if the socket + * field is missing inside *skb*. * * u64 bpf_get_socket_cookie(struct bpf_sock_addr *ctx) * Description * Equivalent to bpf_get_socket_cookie() helper that accepts * *skb*, but gets socket from **struct bpf_sock_addr** context. * Return - * A 8-byte long non-decreasing number. + * A 8-byte long unique number. * * u64 bpf_get_socket_cookie(struct bpf_sock_ops *ctx) * Description * Equivalent to **bpf_get_socket_cookie**\ () helper that accepts * *skb*, but gets socket from **struct bpf_sock_ops** context. * Return - * A 8-byte long non-decreasing number. + * A 8-byte long unique number. + * + * u64 bpf_get_socket_cookie(struct sock *sk) + * Description + * Equivalent to **bpf_get_socket_cookie**\ () helper that accepts + * *sk*, but gets socket from a BTF **struct sock**. This helper + * also works for sleepable programs. + * Return + * A 8-byte long unique number or 0 if *sk* is NULL. * * u32 bpf_get_socket_uid(struct sk_buff *skb) * Return @@ -2231,6 +2239,9 @@ union bpf_attr { * * > 0 one of **BPF_FIB_LKUP_RET_** codes explaining why the * packet is not forwarded or needs assist from full stack * + * If lookup fails with BPF_FIB_LKUP_RET_FRAG_NEEDED, then the MTU + * was exceeded and output params->mtu_result contains the MTU. + * * long bpf_sock_hash_update(struct bpf_sock_ops *skops, struct bpf_map *map, void *key, u64 flags) * Description * Add an entry to, or update a sockhash *map* referencing sockets. @@ -3836,6 +3847,69 @@ union bpf_attr { * Return * A pointer to a struct socket on success or NULL if the file is * not a socket. + * + * long bpf_check_mtu(void *ctx, u32 ifindex, u32 *mtu_len, s32 len_diff, u64 flags) + * Description + + * Check ctx packet size against exceeding MTU of net device (based + * on *ifindex*). This helper will likely be used in combination + * with helpers that adjust/change the packet size. + * + * The argument *len_diff* can be used for querying with a planned + * size change. This allows to check MTU prior to changing packet + * ctx. Providing an *len_diff* adjustment that is larger than the + * actual packet size (resulting in negative packet size) will in + * principle not exceed the MTU, why it is not considered a + * failure. Other BPF-helpers are needed for performing the + * planned size change, why the responsability for catch a negative + * packet size belong in those helpers. + * + * Specifying *ifindex* zero means the MTU check is performed + * against the current net device. This is practical if this isn't + * used prior to redirect. + * + * The Linux kernel route table can configure MTUs on a more + * specific per route level, which is not provided by this helper. + * For route level MTU checks use the **bpf_fib_lookup**\ () + * helper. + * + * *ctx* is either **struct xdp_md** for XDP programs or + * **struct sk_buff** for tc cls_act programs. + * + * The *flags* argument can be a combination of one or more of the + * following values: + * + * **BPF_MTU_CHK_SEGS** + * This flag will only works for *ctx* **struct sk_buff**. + * If packet context contains extra packet segment buffers + * (often knows as GSO skb), then MTU check is harder to + * check at this point, because in transmit path it is + * possible for the skb packet to get re-segmented + * (depending on net device features). This could still be + * a MTU violation, so this flag enables performing MTU + * check against segments, with a different violation + * return code to tell it apart. Check cannot use len_diff. + * + * On return *mtu_len* pointer contains the MTU value of the net + * device. Remember the net device configured MTU is the L3 size, + * which is returned here and XDP and TX length operate at L2. + * Helper take this into account for you, but remember when using + * MTU value in your BPF-code. On input *mtu_len* must be a valid + * pointer and be initialized (to zero), else verifier will reject + * BPF program. + * + * Return + * * 0 on success, and populate MTU value in *mtu_len* pointer. + * + * * < 0 if any input argument is invalid (*mtu_len* not updated) + * + * MTU violations return positive values, but also populate MTU + * value in *mtu_len* pointer, as this can be needed for + * implementing PMTU handing: + * + * * **BPF_MTU_CHK_RET_FRAG_NEEDED** + * * **BPF_MTU_CHK_RET_SEGS_TOOBIG** + * */ #define __BPF_FUNC_MAPPER(FN) \ FN(unspec), \ @@ -4001,6 +4075,7 @@ union bpf_attr { FN(ktime_get_coarse_ns), \ FN(ima_inode_hash), \ FN(sock_from_file), \ + FN(check_mtu), \ /* */ /* integer value in 'imm' field of BPF_CALL instruction selects which helper @@ -4501,6 +4576,7 @@ struct bpf_prog_info { __aligned_u64 prog_tags; __u64 run_time_ns; __u64 run_cnt; + __u64 recursion_misses; } __attribute__((aligned(8))); struct bpf_map_info { @@ -4981,9 +5057,13 @@ struct bpf_fib_lookup { __be16 sport; __be16 dport; - /* total length of packet from network header - used for MTU check */ - __u16 tot_len; + union { /* used for MTU check */ + /* input to lookup */ + __u16 tot_len; /* L3 length from network hdr (iph->tot_len) */ + /* output: MTU value */ + __u16 mtu_result; + }; /* input: L3 device index for lookup * output: device index from FIB lookup */ @@ -5029,6 +5109,17 @@ struct bpf_redir_neigh { }; }; +/* bpf_check_mtu flags*/ +enum bpf_check_mtu_flags { + BPF_MTU_CHK_SEGS = (1U << 0), +}; + +enum bpf_check_mtu_ret { + BPF_MTU_CHK_RET_SUCCESS, /* check and lookup successful */ + BPF_MTU_CHK_RET_FRAG_NEEDED, /* fragmentation required to fwd */ + BPF_MTU_CHK_RET_SEGS_TOOBIG, /* GSO re-segmentation needed to fwd */ +}; + enum bpf_task_fd_type { BPF_FD_TYPE_RAW_TRACEPOINT, /* tp name */ BPF_FD_TYPE_TRACEPOINT, /* tp name */ diff --git a/kernel/bpf/bpf_iter.c b/kernel/bpf/bpf_iter.c index 5454161407f1..a0d9eade9c80 100644 --- a/kernel/bpf/bpf_iter.c +++ b/kernel/bpf/bpf_iter.c @@ -287,7 +287,7 @@ int bpf_iter_reg_target(const struct bpf_iter_reg *reg_info) { struct bpf_iter_target_info *tinfo; - tinfo = kmalloc(sizeof(*tinfo), GFP_KERNEL); + tinfo = kzalloc(sizeof(*tinfo), GFP_KERNEL); if (!tinfo) return -ENOMEM; diff --git a/kernel/bpf/bpf_lru_list.c b/kernel/bpf/bpf_lru_list.c index 1b6b9349cb85..d99e89f113c4 100644 --- a/kernel/bpf/bpf_lru_list.c +++ b/kernel/bpf/bpf_lru_list.c @@ -502,13 +502,14 @@ struct bpf_lru_node *bpf_lru_pop_free(struct bpf_lru *lru, u32 hash) static void bpf_common_lru_push_free(struct bpf_lru *lru, struct bpf_lru_node *node) { + u8 node_type = READ_ONCE(node->type); unsigned long flags; - if (WARN_ON_ONCE(node->type == BPF_LRU_LIST_T_FREE) || - WARN_ON_ONCE(node->type == BPF_LRU_LOCAL_LIST_T_FREE)) + if (WARN_ON_ONCE(node_type == BPF_LRU_LIST_T_FREE) || + WARN_ON_ONCE(node_type == BPF_LRU_LOCAL_LIST_T_FREE)) return; - if (node->type == BPF_LRU_LOCAL_LIST_T_PENDING) { + if (node_type == BPF_LRU_LOCAL_LIST_T_PENDING) { struct bpf_lru_locallist *loc_l; loc_l = per_cpu_ptr(lru->common_lru.local_list, node->cpu); diff --git a/kernel/bpf/btf.c b/kernel/bpf/btf.c index 8962f988514f..2efeb5f4b343 100644 --- a/kernel/bpf/btf.c +++ b/kernel/bpf/btf.c @@ -3540,11 +3540,6 @@ static s32 btf_datasec_check_meta(struct btf_verifier_env *env, return -EINVAL; } - if (!btf_type_vlen(t)) { - btf_verifier_log_type(env, t, "vlen == 0"); - return -EINVAL; - } - if (!t->size) { btf_verifier_log_type(env, t, "size == 0"); return -EINVAL; @@ -5296,15 +5291,16 @@ int btf_check_type_match(struct bpf_verifier_log *log, const struct bpf_prog *pr * Only PTR_TO_CTX and SCALAR_VALUE states are recognized. */ int btf_check_func_arg_match(struct bpf_verifier_env *env, int subprog, - struct bpf_reg_state *reg) + struct bpf_reg_state *regs) { struct bpf_verifier_log *log = &env->log; struct bpf_prog *prog = env->prog; struct btf *btf = prog->aux->btf; const struct btf_param *args; - const struct btf_type *t; - u32 i, nargs, btf_id; + const struct btf_type *t, *ref_t; + u32 i, nargs, btf_id, type_size; const char *tname; + bool is_global; if (!prog->aux->func_info) return -EINVAL; @@ -5338,38 +5334,57 @@ int btf_check_func_arg_match(struct bpf_verifier_env *env, int subprog, bpf_log(log, "Function %s has %d > 5 args\n", tname, nargs); goto out; } + + is_global = prog->aux->func_info_aux[subprog].linkage == BTF_FUNC_GLOBAL; /* check that BTF function arguments match actual types that the * verifier sees. */ for (i = 0; i < nargs; i++) { + struct bpf_reg_state *reg = ®s[i + 1]; + t = btf_type_by_id(btf, args[i].type); while (btf_type_is_modifier(t)) t = btf_type_by_id(btf, t->type); if (btf_type_is_int(t) || btf_type_is_enum(t)) { - if (reg[i + 1].type == SCALAR_VALUE) + if (reg->type == SCALAR_VALUE) continue; bpf_log(log, "R%d is not a scalar\n", i + 1); goto out; } if (btf_type_is_ptr(t)) { - if (reg[i + 1].type == SCALAR_VALUE) { - bpf_log(log, "R%d is not a pointer\n", i + 1); - goto out; - } /* If function expects ctx type in BTF check that caller * is passing PTR_TO_CTX. */ if (btf_get_prog_ctx_type(log, btf, t, prog->type, i)) { - if (reg[i + 1].type != PTR_TO_CTX) { + if (reg->type != PTR_TO_CTX) { bpf_log(log, "arg#%d expected pointer to ctx, but got %s\n", i, btf_kind_str[BTF_INFO_KIND(t->info)]); goto out; } - if (check_ctx_reg(env, ®[i + 1], i + 1)) + if (check_ctx_reg(env, reg, i + 1)) goto out; continue; } + + if (!is_global) + goto out; + + t = btf_type_skip_modifiers(btf, t->type, NULL); + + ref_t = btf_resolve_size(btf, t, &type_size); + if (IS_ERR(ref_t)) { + bpf_log(log, + "arg#%d reference type('%s %s') size cannot be determined: %ld\n", + i, btf_type_str(t), btf_name_by_offset(btf, t->name_off), + PTR_ERR(ref_t)); + goto out; + } + + if (check_mem_reg(env, reg, i + 1, type_size)) + goto out; + + continue; } bpf_log(log, "Unrecognized arg#%d type %s\n", i, btf_kind_str[BTF_INFO_KIND(t->info)]); @@ -5393,14 +5408,14 @@ out: * (either PTR_TO_CTX or SCALAR_VALUE). */ int btf_prepare_func_args(struct bpf_verifier_env *env, int subprog, - struct bpf_reg_state *reg) + struct bpf_reg_state *regs) { struct bpf_verifier_log *log = &env->log; struct bpf_prog *prog = env->prog; enum bpf_prog_type prog_type = prog->type; struct btf *btf = prog->aux->btf; const struct btf_param *args; - const struct btf_type *t; + const struct btf_type *t, *ref_t; u32 i, nargs, btf_id; const char *tname; @@ -5464,16 +5479,35 @@ int btf_prepare_func_args(struct bpf_verifier_env *env, int subprog, * Only PTR_TO_CTX and SCALAR are supported atm. */ for (i = 0; i < nargs; i++) { + struct bpf_reg_state *reg = ®s[i + 1]; + t = btf_type_by_id(btf, args[i].type); while (btf_type_is_modifier(t)) t = btf_type_by_id(btf, t->type); if (btf_type_is_int(t) || btf_type_is_enum(t)) { - reg[i + 1].type = SCALAR_VALUE; + reg->type = SCALAR_VALUE; continue; } - if (btf_type_is_ptr(t) && - btf_get_prog_ctx_type(log, btf, t, prog_type, i)) { - reg[i + 1].type = PTR_TO_CTX; + if (btf_type_is_ptr(t)) { + if (btf_get_prog_ctx_type(log, btf, t, prog_type, i)) { + reg->type = PTR_TO_CTX; + continue; + } + + t = btf_type_skip_modifiers(btf, t->type, NULL); + + ref_t = btf_resolve_size(btf, t, ®->mem_size); + if (IS_ERR(ref_t)) { + bpf_log(log, + "arg#%d reference type('%s %s') size cannot be determined: %ld\n", + i, btf_type_str(t), btf_name_by_offset(btf, t->name_off), + PTR_ERR(ref_t)); + return -EINVAL; + } + + reg->type = PTR_TO_MEM_OR_NULL; + reg->id = ++env->id_gen; + continue; } bpf_log(log, "Arg#%d type %s in %s() is not supported yet.\n", diff --git a/kernel/bpf/cgroup.c b/kernel/bpf/cgroup.c index 6aa9e10c6335..b567ca46555c 100644 --- a/kernel/bpf/cgroup.c +++ b/kernel/bpf/cgroup.c @@ -19,7 +19,7 @@ #include "../cgroup/cgroup-internal.h" -DEFINE_STATIC_KEY_FALSE(cgroup_bpf_enabled_key); +DEFINE_STATIC_KEY_ARRAY_FALSE(cgroup_bpf_enabled_key, MAX_BPF_ATTACH_TYPE); EXPORT_SYMBOL(cgroup_bpf_enabled_key); void cgroup_bpf_offline(struct cgroup *cgrp) @@ -128,7 +128,7 @@ static void cgroup_bpf_release(struct work_struct *work) if (pl->link) bpf_cgroup_link_auto_detach(pl->link); kfree(pl); - static_branch_dec(&cgroup_bpf_enabled_key); + static_branch_dec(&cgroup_bpf_enabled_key[type]); } old_array = rcu_dereference_protected( cgrp->bpf.effective[type], @@ -499,7 +499,7 @@ int __cgroup_bpf_attach(struct cgroup *cgrp, if (old_prog) bpf_prog_put(old_prog); else - static_branch_inc(&cgroup_bpf_enabled_key); + static_branch_inc(&cgroup_bpf_enabled_key[type]); bpf_cgroup_storages_link(new_storage, cgrp, type); return 0; @@ -698,7 +698,7 @@ int __cgroup_bpf_detach(struct cgroup *cgrp, struct bpf_prog *prog, cgrp->bpf.flags[type] = 0; if (old_prog) bpf_prog_put(old_prog); - static_branch_dec(&cgroup_bpf_enabled_key); + static_branch_dec(&cgroup_bpf_enabled_key[type]); return 0; cleanup: @@ -1055,6 +1055,8 @@ EXPORT_SYMBOL(__cgroup_bpf_run_filter_sk); * @uaddr: sockaddr struct provided by user * @type: The type of program to be exectuted * @t_ctx: Pointer to attach type specific context + * @flags: Pointer to u32 which contains higher bits of BPF program + * return value (OR'ed together). * * socket is expected to be of type INET or INET6. * @@ -1064,7 +1066,8 @@ EXPORT_SYMBOL(__cgroup_bpf_run_filter_sk); int __cgroup_bpf_run_filter_sock_addr(struct sock *sk, struct sockaddr *uaddr, enum bpf_attach_type type, - void *t_ctx) + void *t_ctx, + u32 *flags) { struct bpf_sock_addr_kern ctx = { .sk = sk, @@ -1087,7 +1090,8 @@ int __cgroup_bpf_run_filter_sock_addr(struct sock *sk, } cgrp = sock_cgroup_ptr(&sk->sk_cgrp_data); - ret = BPF_PROG_RUN_ARRAY(cgrp->bpf.effective[type], &ctx, BPF_PROG_RUN); + ret = BPF_PROG_RUN_ARRAY_FLAGS(cgrp->bpf.effective[type], &ctx, + BPF_PROG_RUN, flags); return ret == 1 ? 0 : -EPERM; } @@ -1298,7 +1302,8 @@ static bool __cgroup_bpf_prog_array_is_empty(struct cgroup *cgrp, return empty; } -static int sockopt_alloc_buf(struct bpf_sockopt_kern *ctx, int max_optlen) +static int sockopt_alloc_buf(struct bpf_sockopt_kern *ctx, int max_optlen, + struct bpf_sockopt_buf *buf) { if (unlikely(max_optlen < 0)) return -EINVAL; @@ -1310,6 +1315,15 @@ static int sockopt_alloc_buf(struct bpf_sockopt_kern *ctx, int max_optlen) max_optlen = PAGE_SIZE; } + if (max_optlen <= sizeof(buf->data)) { + /* When the optval fits into BPF_SOCKOPT_KERN_BUF_SIZE + * bytes avoid the cost of kzalloc. + */ + ctx->optval = buf->data; + ctx->optval_end = ctx->optval + max_optlen; + return max_optlen; + } + ctx->optval = kzalloc(max_optlen, GFP_USER); if (!ctx->optval) return -ENOMEM; @@ -1319,16 +1333,26 @@ static int sockopt_alloc_buf(struct bpf_sockopt_kern *ctx, int max_optlen) return max_optlen; } -static void sockopt_free_buf(struct bpf_sockopt_kern *ctx) +static void sockopt_free_buf(struct bpf_sockopt_kern *ctx, + struct bpf_sockopt_buf *buf) { + if (ctx->optval == buf->data) + return; kfree(ctx->optval); } +static bool sockopt_buf_allocated(struct bpf_sockopt_kern *ctx, + struct bpf_sockopt_buf *buf) +{ + return ctx->optval != buf->data; +} + int __cgroup_bpf_run_filter_setsockopt(struct sock *sk, int *level, int *optname, char __user *optval, int *optlen, char **kernel_optval) { struct cgroup *cgrp = sock_cgroup_ptr(&sk->sk_cgrp_data); + struct bpf_sockopt_buf buf = {}; struct bpf_sockopt_kern ctx = { .sk = sk, .level = *level, @@ -1340,8 +1364,7 @@ int __cgroup_bpf_run_filter_setsockopt(struct sock *sk, int *level, * attached to the hook so we don't waste time allocating * memory and locking the socket. */ - if (!cgroup_bpf_enabled || - __cgroup_bpf_prog_array_is_empty(cgrp, BPF_CGROUP_SETSOCKOPT)) + if (__cgroup_bpf_prog_array_is_empty(cgrp, BPF_CGROUP_SETSOCKOPT)) return 0; /* Allocate a bit more than the initial user buffer for @@ -1350,7 +1373,7 @@ int __cgroup_bpf_run_filter_setsockopt(struct sock *sk, int *level, */ max_optlen = max_t(int, 16, *optlen); - max_optlen = sockopt_alloc_buf(&ctx, max_optlen); + max_optlen = sockopt_alloc_buf(&ctx, max_optlen, &buf); if (max_optlen < 0) return max_optlen; @@ -1390,14 +1413,31 @@ int __cgroup_bpf_run_filter_setsockopt(struct sock *sk, int *level, */ if (ctx.optlen != 0) { *optlen = ctx.optlen; - *kernel_optval = ctx.optval; + /* We've used bpf_sockopt_kern->buf as an intermediary + * storage, but the BPF program indicates that we need + * to pass this data to the kernel setsockopt handler. + * No way to export on-stack buf, have to allocate a + * new buffer. + */ + if (!sockopt_buf_allocated(&ctx, &buf)) { + void *p = kmalloc(ctx.optlen, GFP_USER); + + if (!p) { + ret = -ENOMEM; + goto out; + } + memcpy(p, ctx.optval, ctx.optlen); + *kernel_optval = p; + } else { + *kernel_optval = ctx.optval; + } /* export and don't free sockopt buf */ return 0; } } out: - sockopt_free_buf(&ctx); + sockopt_free_buf(&ctx, &buf); return ret; } @@ -1407,6 +1447,7 @@ int __cgroup_bpf_run_filter_getsockopt(struct sock *sk, int level, int retval) { struct cgroup *cgrp = sock_cgroup_ptr(&sk->sk_cgrp_data); + struct bpf_sockopt_buf buf = {}; struct bpf_sockopt_kern ctx = { .sk = sk, .level = level, @@ -1419,13 +1460,12 @@ int __cgroup_bpf_run_filter_getsockopt(struct sock *sk, int level, * attached to the hook so we don't waste time allocating * memory and locking the socket. */ - if (!cgroup_bpf_enabled || - __cgroup_bpf_prog_array_is_empty(cgrp, BPF_CGROUP_GETSOCKOPT)) + if (__cgroup_bpf_prog_array_is_empty(cgrp, BPF_CGROUP_GETSOCKOPT)) return retval; ctx.optlen = max_optlen; - max_optlen = sockopt_alloc_buf(&ctx, max_optlen); + max_optlen = sockopt_alloc_buf(&ctx, max_optlen, &buf); if (max_optlen < 0) return max_optlen; @@ -1488,9 +1528,55 @@ int __cgroup_bpf_run_filter_getsockopt(struct sock *sk, int level, ret = ctx.retval; out: - sockopt_free_buf(&ctx); + sockopt_free_buf(&ctx, &buf); return ret; } + +int __cgroup_bpf_run_filter_getsockopt_kern(struct sock *sk, int level, + int optname, void *optval, + int *optlen, int retval) +{ + struct cgroup *cgrp = sock_cgroup_ptr(&sk->sk_cgrp_data); + struct bpf_sockopt_kern ctx = { + .sk = sk, + .level = level, + .optname = optname, + .retval = retval, + .optlen = *optlen, + .optval = optval, + .optval_end = optval + *optlen, + }; + int ret; + + /* Note that __cgroup_bpf_run_filter_getsockopt doesn't copy + * user data back into BPF buffer when reval != 0. This is + * done as an optimization to avoid extra copy, assuming + * kernel won't populate the data in case of an error. + * Here we always pass the data and memset() should + * be called if that data shouldn't be "exported". + */ + + ret = BPF_PROG_RUN_ARRAY(cgrp->bpf.effective[BPF_CGROUP_GETSOCKOPT], + &ctx, BPF_PROG_RUN); + if (!ret) + return -EPERM; + + if (ctx.optlen > *optlen) + return -EFAULT; + + /* BPF programs only allowed to set retval to 0, not some + * arbitrary value. + */ + if (ctx.retval != 0 && ctx.retval != retval) + return -EFAULT; + + /* BPF programs can shrink the buffer, export the modifications. + */ + if (ctx.optlen != 0) + *optlen = ctx.optlen; + + return ctx.retval; +} #endif static ssize_t sysctl_cpy_dir(const struct ctl_dir *dir, char **bufp, diff --git a/kernel/bpf/core.c b/kernel/bpf/core.c index 5bbd4884ff7a..0ae015ad1e05 100644 --- a/kernel/bpf/core.c +++ b/kernel/bpf/core.c @@ -91,6 +91,12 @@ struct bpf_prog *bpf_prog_alloc_no_stats(unsigned int size, gfp_t gfp_extra_flag vfree(fp); return NULL; } + fp->active = alloc_percpu_gfp(int, GFP_KERNEL_ACCOUNT | gfp_extra_flags); + if (!fp->active) { + vfree(fp); + kfree(aux); + return NULL; + } fp->pages = size / PAGE_SIZE; fp->aux = aux; @@ -114,8 +120,9 @@ struct bpf_prog *bpf_prog_alloc(unsigned int size, gfp_t gfp_extra_flags) if (!prog) return NULL; - prog->aux->stats = alloc_percpu_gfp(struct bpf_prog_stats, gfp_flags); - if (!prog->aux->stats) { + prog->stats = alloc_percpu_gfp(struct bpf_prog_stats, gfp_flags); + if (!prog->stats) { + free_percpu(prog->active); kfree(prog->aux); vfree(prog); return NULL; @@ -124,7 +131,7 @@ struct bpf_prog *bpf_prog_alloc(unsigned int size, gfp_t gfp_extra_flags) for_each_possible_cpu(cpu) { struct bpf_prog_stats *pstats; - pstats = per_cpu_ptr(prog->aux->stats, cpu); + pstats = per_cpu_ptr(prog->stats, cpu); u64_stats_init(&pstats->syncp); } return prog; @@ -238,6 +245,8 @@ struct bpf_prog *bpf_prog_realloc(struct bpf_prog *fp_old, unsigned int size, * reallocated structure. */ fp_old->aux = NULL; + fp_old->stats = NULL; + fp_old->active = NULL; __bpf_prog_free(fp_old); } @@ -249,10 +258,11 @@ void __bpf_prog_free(struct bpf_prog *fp) if (fp->aux) { mutex_destroy(&fp->aux->used_maps_mutex); mutex_destroy(&fp->aux->dst_mutex); - free_percpu(fp->aux->stats); kfree(fp->aux->poke_tab); kfree(fp->aux); } + free_percpu(fp->stats); + free_percpu(fp->active); vfree(fp); } diff --git a/kernel/bpf/cpumap.c b/kernel/bpf/cpumap.c index 747313698178..5d1469de6921 100644 --- a/kernel/bpf/cpumap.c +++ b/kernel/bpf/cpumap.c @@ -141,49 +141,6 @@ static void cpu_map_kthread_stop(struct work_struct *work) kthread_stop(rcpu->kthread); } -static struct sk_buff *cpu_map_build_skb(struct xdp_frame *xdpf, - struct sk_buff *skb) -{ - unsigned int hard_start_headroom; - unsigned int frame_size; - void *pkt_data_start; - - /* Part of headroom was reserved to xdpf */ - hard_start_headroom = sizeof(struct xdp_frame) + xdpf->headroom; - - /* Memory size backing xdp_frame data already have reserved - * room for build_skb to place skb_shared_info in tailroom. - */ - frame_size = xdpf->frame_sz; - - pkt_data_start = xdpf->data - hard_start_headroom; - skb = build_skb_around(skb, pkt_data_start, frame_size); - if (unlikely(!skb)) - return NULL; - - skb_reserve(skb, hard_start_headroom); - __skb_put(skb, xdpf->len); - if (xdpf->metasize) - skb_metadata_set(skb, xdpf->metasize); - - /* Essential SKB info: protocol and skb->dev */ - skb->protocol = eth_type_trans(skb, xdpf->dev_rx); - - /* Optional SKB info, currently missing: - * - HW checksum info (skb->ip_summed) - * - HW RX hash (skb_set_hash) - * - RX ring dev queue index (skb_record_rx_queue) - */ - - /* Until page_pool get SKB return path, release DMA here */ - xdp_release_frame(xdpf); - - /* Allow SKB to reuse area used by xdp_frame */ - xdp_scrub_frame(xdpf); - - return skb; -} - static void __cpu_map_ring_cleanup(struct ptr_ring *ring) { /* The tear-down procedure should have made sure that queue is @@ -350,7 +307,8 @@ static int cpu_map_kthread_run(void *data) struct sk_buff *skb = skbs[i]; int ret; - skb = cpu_map_build_skb(xdpf, skb); + skb = __xdp_build_skb_from_frame(xdpf, skb, + xdpf->dev_rx); if (!skb) { xdp_return_frame(xdpf); continue; diff --git a/kernel/bpf/devmap.c b/kernel/bpf/devmap.c index f6e9c68afdd4..85d9d1b72a33 100644 --- a/kernel/bpf/devmap.c +++ b/kernel/bpf/devmap.c @@ -802,9 +802,7 @@ static int dev_map_notification(struct notifier_block *notifier, break; /* will be freed in free_netdev() */ - netdev->xdp_bulkq = - __alloc_percpu_gfp(sizeof(struct xdp_dev_bulk_queue), - sizeof(void *), GFP_ATOMIC); + netdev->xdp_bulkq = alloc_percpu(struct xdp_dev_bulk_queue); if (!netdev->xdp_bulkq) return NOTIFY_BAD; diff --git a/kernel/bpf/disasm.c b/kernel/bpf/disasm.c index 19ff8fed7f4b..3acc7e0b6916 100644 --- a/kernel/bpf/disasm.c +++ b/kernel/bpf/disasm.c @@ -161,7 +161,7 @@ void print_bpf_insn(const struct bpf_insn_cbs *cbs, insn->dst_reg, insn->off, insn->src_reg); else if (BPF_MODE(insn->code) == BPF_ATOMIC && - (insn->imm == BPF_ADD || insn->imm == BPF_ADD || + (insn->imm == BPF_ADD || insn->imm == BPF_AND || insn->imm == BPF_OR || insn->imm == BPF_XOR)) { verbose(cbs->private_data, "(%02x) lock *(%s *)(r%d %+d) %s r%d\n", insn->code, diff --git a/kernel/bpf/hashtab.c b/kernel/bpf/hashtab.c index c1ac7f964bc9..d63912e73ad9 100644 --- a/kernel/bpf/hashtab.c +++ b/kernel/bpf/hashtab.c @@ -1148,7 +1148,7 @@ static int __htab_percpu_map_update_elem(struct bpf_map *map, void *key, /* unknown flags */ return -EINVAL; - WARN_ON_ONCE(!rcu_read_lock_held()); + WARN_ON_ONCE(!rcu_read_lock_held() && !rcu_read_lock_trace_held()); key_size = map->key_size; @@ -1202,7 +1202,7 @@ static int __htab_lru_percpu_map_update_elem(struct bpf_map *map, void *key, /* unknown flags */ return -EINVAL; - WARN_ON_ONCE(!rcu_read_lock_held()); + WARN_ON_ONCE(!rcu_read_lock_held() && !rcu_read_lock_trace_held()); key_size = map->key_size; diff --git a/kernel/bpf/helpers.c b/kernel/bpf/helpers.c index 41ca280b1dc1..308427fe03a3 100644 --- a/kernel/bpf/helpers.c +++ b/kernel/bpf/helpers.c @@ -720,14 +720,6 @@ bpf_base_func_proto(enum bpf_func_id func_id) return &bpf_spin_lock_proto; case BPF_FUNC_spin_unlock: return &bpf_spin_unlock_proto; - case BPF_FUNC_trace_printk: - if (!perfmon_capable()) - return NULL; - return bpf_get_trace_printk_proto(); - case BPF_FUNC_snprintf_btf: - if (!perfmon_capable()) - return NULL; - return &bpf_snprintf_btf_proto; case BPF_FUNC_jiffies64: return &bpf_jiffies64_proto; case BPF_FUNC_per_cpu_ptr: @@ -742,6 +734,8 @@ bpf_base_func_proto(enum bpf_func_id func_id) return NULL; switch (func_id) { + case BPF_FUNC_trace_printk: + return bpf_get_trace_printk_proto(); case BPF_FUNC_get_current_task: return &bpf_get_current_task_proto; case BPF_FUNC_probe_read_user: @@ -752,6 +746,8 @@ bpf_base_func_proto(enum bpf_func_id func_id) return &bpf_probe_read_user_str_proto; case BPF_FUNC_probe_read_kernel_str: return &bpf_probe_read_kernel_str_proto; + case BPF_FUNC_snprintf_btf: + return &bpf_snprintf_btf_proto; default: return NULL; } diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c index e5999d86c76e..c859bc46d06c 100644 --- a/kernel/bpf/syscall.c +++ b/kernel/bpf/syscall.c @@ -1731,25 +1731,28 @@ static int bpf_prog_release(struct inode *inode, struct file *filp) static void bpf_prog_get_stats(const struct bpf_prog *prog, struct bpf_prog_stats *stats) { - u64 nsecs = 0, cnt = 0; + u64 nsecs = 0, cnt = 0, misses = 0; int cpu; for_each_possible_cpu(cpu) { const struct bpf_prog_stats *st; unsigned int start; - u64 tnsecs, tcnt; + u64 tnsecs, tcnt, tmisses; - st = per_cpu_ptr(prog->aux->stats, cpu); + st = per_cpu_ptr(prog->stats, cpu); do { start = u64_stats_fetch_begin_irq(&st->syncp); tnsecs = st->nsecs; tcnt = st->cnt; + tmisses = st->misses; } while (u64_stats_fetch_retry_irq(&st->syncp, start)); nsecs += tnsecs; cnt += tcnt; + misses += tmisses; } stats->nsecs = nsecs; stats->cnt = cnt; + stats->misses = misses; } #ifdef CONFIG_PROC_FS @@ -1768,14 +1771,16 @@ static void bpf_prog_show_fdinfo(struct seq_file *m, struct file *filp) "memlock:\t%llu\n" "prog_id:\t%u\n" "run_time_ns:\t%llu\n" - "run_cnt:\t%llu\n", + "run_cnt:\t%llu\n" + "recursion_misses:\t%llu\n", prog->type, prog->jited, prog_tag, prog->pages * 1ULL << PAGE_SHIFT, prog->aux->id, stats.nsecs, - stats.cnt); + stats.cnt, + stats.misses); } #endif @@ -3438,6 +3443,7 @@ static int bpf_prog_get_info_by_fd(struct file *file, bpf_prog_get_stats(prog, &stats); info.run_time_ns = stats.nsecs; info.run_cnt = stats.cnt; + info.recursion_misses = stats.misses; if (!bpf_capable()) { info.jited_prog_len = 0; diff --git a/kernel/bpf/task_iter.c b/kernel/bpf/task_iter.c index 175b7b42bfc4..b68cb5d6d6eb 100644 --- a/kernel/bpf/task_iter.c +++ b/kernel/bpf/task_iter.c @@ -286,9 +286,248 @@ static const struct seq_operations task_file_seq_ops = { .show = task_file_seq_show, }; +struct bpf_iter_seq_task_vma_info { + /* The first field must be struct bpf_iter_seq_task_common. + * this is assumed by {init, fini}_seq_pidns() callback functions. + */ + struct bpf_iter_seq_task_common common; + struct task_struct *task; + struct vm_area_struct *vma; + u32 tid; + unsigned long prev_vm_start; + unsigned long prev_vm_end; +}; + +enum bpf_task_vma_iter_find_op { + task_vma_iter_first_vma, /* use mm->mmap */ + task_vma_iter_next_vma, /* use curr_vma->vm_next */ + task_vma_iter_find_vma, /* use find_vma() to find next vma */ +}; + +static struct vm_area_struct * +task_vma_seq_get_next(struct bpf_iter_seq_task_vma_info *info) +{ + struct pid_namespace *ns = info->common.ns; + enum bpf_task_vma_iter_find_op op; + struct vm_area_struct *curr_vma; + struct task_struct *curr_task; + u32 curr_tid = info->tid; + + /* If this function returns a non-NULL vma, it holds a reference to + * the task_struct, and holds read lock on vma->mm->mmap_lock. + * If this function returns NULL, it does not hold any reference or + * lock. + */ + if (info->task) { + curr_task = info->task; + curr_vma = info->vma; + /* In case of lock contention, drop mmap_lock to unblock + * the writer. + * + * After relock, call find(mm, prev_vm_end - 1) to find + * new vma to process. + * + * +------+------+-----------+ + * | VMA1 | VMA2 | VMA3 | + * +------+------+-----------+ + * | | | | + * 4k 8k 16k 400k + * + * For example, curr_vma == VMA2. Before unlock, we set + * + * prev_vm_start = 8k + * prev_vm_end = 16k + * + * There are a few cases: + * + * 1) VMA2 is freed, but VMA3 exists. + * + * find_vma() will return VMA3, just process VMA3. + * + * 2) VMA2 still exists. + * + * find_vma() will return VMA2, process VMA2->next. + * + * 3) no more vma in this mm. + * + * Process the next task. + * + * 4) find_vma() returns a different vma, VMA2'. + * + * 4.1) If VMA2 covers same range as VMA2', skip VMA2', + * because we already covered the range; + * 4.2) VMA2 and VMA2' covers different ranges, process + * VMA2'. + */ + if (mmap_lock_is_contended(curr_task->mm)) { + info->prev_vm_start = curr_vma->vm_start; + info->prev_vm_end = curr_vma->vm_end; + op = task_vma_iter_find_vma; + mmap_read_unlock(curr_task->mm); + if (mmap_read_lock_killable(curr_task->mm)) + goto finish; + } else { + op = task_vma_iter_next_vma; + } + } else { +again: + curr_task = task_seq_get_next(ns, &curr_tid, true); + if (!curr_task) { + info->tid = curr_tid + 1; + goto finish; + } + + if (curr_tid != info->tid) { + info->tid = curr_tid; + /* new task, process the first vma */ + op = task_vma_iter_first_vma; + } else { + /* Found the same tid, which means the user space + * finished data in previous buffer and read more. + * We dropped mmap_lock before returning to user + * space, so it is necessary to use find_vma() to + * find the next vma to process. + */ + op = task_vma_iter_find_vma; + } + + if (!curr_task->mm) + goto next_task; + + if (mmap_read_lock_killable(curr_task->mm)) + goto finish; + } + + switch (op) { + case task_vma_iter_first_vma: + curr_vma = curr_task->mm->mmap; + break; + case task_vma_iter_next_vma: + curr_vma = curr_vma->vm_next; + break; + case task_vma_iter_find_vma: + /* We dropped mmap_lock so it is necessary to use find_vma + * to find the next vma. This is similar to the mechanism + * in show_smaps_rollup(). + */ + curr_vma = find_vma(curr_task->mm, info->prev_vm_end - 1); + /* case 1) and 4.2) above just use curr_vma */ + + /* check for case 2) or case 4.1) above */ + if (curr_vma && + curr_vma->vm_start == info->prev_vm_start && + curr_vma->vm_end == info->prev_vm_end) + curr_vma = curr_vma->vm_next; + break; + } + if (!curr_vma) { + /* case 3) above, or case 2) 4.1) with vma->next == NULL */ + mmap_read_unlock(curr_task->mm); + goto next_task; + } + info->task = curr_task; + info->vma = curr_vma; + return curr_vma; + +next_task: + put_task_struct(curr_task); + info->task = NULL; + curr_tid++; + goto again; + +finish: + if (curr_task) + put_task_struct(curr_task); + info->task = NULL; + info->vma = NULL; + return NULL; +} + +static void *task_vma_seq_start(struct seq_file *seq, loff_t *pos) +{ + struct bpf_iter_seq_task_vma_info *info = seq->private; + struct vm_area_struct *vma; + + vma = task_vma_seq_get_next(info); + if (vma && *pos == 0) + ++*pos; + + return vma; +} + +static void *task_vma_seq_next(struct seq_file *seq, void *v, loff_t *pos) +{ + struct bpf_iter_seq_task_vma_info *info = seq->private; + + ++*pos; + return task_vma_seq_get_next(info); +} + +struct bpf_iter__task_vma { + __bpf_md_ptr(struct bpf_iter_meta *, meta); + __bpf_md_ptr(struct task_struct *, task); + __bpf_md_ptr(struct vm_area_struct *, vma); +}; + +DEFINE_BPF_ITER_FUNC(task_vma, struct bpf_iter_meta *meta, + struct task_struct *task, struct vm_area_struct *vma) + +static int __task_vma_seq_show(struct seq_file *seq, bool in_stop) +{ + struct bpf_iter_seq_task_vma_info *info = seq->private; + struct bpf_iter__task_vma ctx; + struct bpf_iter_meta meta; + struct bpf_prog *prog; + + meta.seq = seq; + prog = bpf_iter_get_info(&meta, in_stop); + if (!prog) + return 0; + + ctx.meta = &meta; + ctx.task = info->task; + ctx.vma = info->vma; + return bpf_iter_run_prog(prog, &ctx); +} + +static int task_vma_seq_show(struct seq_file *seq, void *v) +{ + return __task_vma_seq_show(seq, false); +} + +static void task_vma_seq_stop(struct seq_file *seq, void *v) +{ + struct bpf_iter_seq_task_vma_info *info = seq->private; + + if (!v) { + (void)__task_vma_seq_show(seq, true); + } else { + /* info->vma has not been seen by the BPF program. If the + * user space reads more, task_vma_seq_get_next should + * return this vma again. Set prev_vm_start to ~0UL, + * so that we don't skip the vma returned by the next + * find_vma() (case task_vma_iter_find_vma in + * task_vma_seq_get_next()). + */ + info->prev_vm_start = ~0UL; + info->prev_vm_end = info->vma->vm_end; + mmap_read_unlock(info->task->mm); + put_task_struct(info->task); + info->task = NULL; + } +} + +static const struct seq_operations task_vma_seq_ops = { + .start = task_vma_seq_start, + .next = task_vma_seq_next, + .stop = task_vma_seq_stop, + .show = task_vma_seq_show, +}; + BTF_ID_LIST(btf_task_file_ids) BTF_ID(struct, task_struct) BTF_ID(struct, file) +BTF_ID(struct, vm_area_struct) static const struct bpf_iter_seq_info task_seq_info = { .seq_ops = &task_seq_ops, @@ -328,6 +567,26 @@ static struct bpf_iter_reg task_file_reg_info = { .seq_info = &task_file_seq_info, }; +static const struct bpf_iter_seq_info task_vma_seq_info = { + .seq_ops = &task_vma_seq_ops, + .init_seq_private = init_seq_pidns, + .fini_seq_private = fini_seq_pidns, + .seq_priv_size = sizeof(struct bpf_iter_seq_task_vma_info), +}; + +static struct bpf_iter_reg task_vma_reg_info = { + .target = "task_vma", + .feature = BPF_ITER_RESCHED, + .ctx_arg_info_size = 2, + .ctx_arg_info = { + { offsetof(struct bpf_iter__task_vma, task), + PTR_TO_BTF_ID_OR_NULL }, + { offsetof(struct bpf_iter__task_vma, vma), + PTR_TO_BTF_ID_OR_NULL }, + }, + .seq_info = &task_vma_seq_info, +}; + static int __init task_iter_init(void) { int ret; @@ -339,6 +598,12 @@ static int __init task_iter_init(void) task_file_reg_info.ctx_arg_info[0].btf_id = btf_task_file_ids[0]; task_file_reg_info.ctx_arg_info[1].btf_id = btf_task_file_ids[1]; - return bpf_iter_reg_target(&task_file_reg_info); + ret = bpf_iter_reg_target(&task_file_reg_info); + if (ret) + return ret; + + task_vma_reg_info.ctx_arg_info[0].btf_id = btf_task_file_ids[0]; + task_vma_reg_info.ctx_arg_info[1].btf_id = btf_task_file_ids[2]; + return bpf_iter_reg_target(&task_vma_reg_info); } late_initcall(task_iter_init); diff --git a/kernel/bpf/trampoline.c b/kernel/bpf/trampoline.c index 35c5887d82ff..7bc3b3209224 100644 --- a/kernel/bpf/trampoline.c +++ b/kernel/bpf/trampoline.c @@ -381,55 +381,100 @@ out: mutex_unlock(&trampoline_mutex); } +#define NO_START_TIME 1 +static u64 notrace bpf_prog_start_time(void) +{ + u64 start = NO_START_TIME; + + if (static_branch_unlikely(&bpf_stats_enabled_key)) { + start = sched_clock(); + if (unlikely(!start)) + start = NO_START_TIME; + } + return start; +} + +static void notrace inc_misses_counter(struct bpf_prog *prog) +{ + struct bpf_prog_stats *stats; + + stats = this_cpu_ptr(prog->stats); + u64_stats_update_begin(&stats->syncp); + stats->misses++; + u64_stats_update_end(&stats->syncp); +} + /* The logic is similar to BPF_PROG_RUN, but with an explicit * rcu_read_lock() and migrate_disable() which are required * for the trampoline. The macro is split into - * call _bpf_prog_enter + * call __bpf_prog_enter * call prog->bpf_func * call __bpf_prog_exit + * + * __bpf_prog_enter returns: + * 0 - skip execution of the bpf prog + * 1 - execute bpf prog + * [2..MAX_U64] - excute bpf prog and record execution time. + * This is start time. */ -u64 notrace __bpf_prog_enter(void) +u64 notrace __bpf_prog_enter(struct bpf_prog *prog) __acquires(RCU) { - u64 start = 0; - rcu_read_lock(); migrate_disable(); - if (static_branch_unlikely(&bpf_stats_enabled_key)) - start = sched_clock(); - return start; + if (unlikely(__this_cpu_inc_return(*(prog->active)) != 1)) { + inc_misses_counter(prog); + return 0; + } + return bpf_prog_start_time(); } -void notrace __bpf_prog_exit(struct bpf_prog *prog, u64 start) - __releases(RCU) +static void notrace update_prog_stats(struct bpf_prog *prog, + u64 start) { struct bpf_prog_stats *stats; if (static_branch_unlikely(&bpf_stats_enabled_key) && - /* static_key could be enabled in __bpf_prog_enter - * and disabled in __bpf_prog_exit. + /* static_key could be enabled in __bpf_prog_enter* + * and disabled in __bpf_prog_exit*. * And vice versa. - * Hence check that 'start' is not zero. + * Hence check that 'start' is valid. */ - start) { - stats = this_cpu_ptr(prog->aux->stats); + start > NO_START_TIME) { + stats = this_cpu_ptr(prog->stats); u64_stats_update_begin(&stats->syncp); stats->cnt++; stats->nsecs += sched_clock() - start; u64_stats_update_end(&stats->syncp); } +} + +void notrace __bpf_prog_exit(struct bpf_prog *prog, u64 start) + __releases(RCU) +{ + update_prog_stats(prog, start); + __this_cpu_dec(*(prog->active)); migrate_enable(); rcu_read_unlock(); } -void notrace __bpf_prog_enter_sleepable(void) +u64 notrace __bpf_prog_enter_sleepable(struct bpf_prog *prog) { rcu_read_lock_trace(); + migrate_disable(); might_fault(); + if (unlikely(__this_cpu_inc_return(*(prog->active)) != 1)) { + inc_misses_counter(prog); + return 0; + } + return bpf_prog_start_time(); } -void notrace __bpf_prog_exit_sleepable(void) +void notrace __bpf_prog_exit_sleepable(struct bpf_prog *prog, u64 start) { + update_prog_stats(prog, start); + __this_cpu_dec(*(prog->active)); + migrate_enable(); rcu_read_unlock_trace(); } diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c index 1cffd4e84725..36d1e7339ede 100644 --- a/kernel/bpf/verifier.c +++ b/kernel/bpf/verifier.c @@ -228,6 +228,12 @@ static void bpf_map_key_store(struct bpf_insn_aux_data *aux, u64 state) (poisoned ? BPF_MAP_KEY_POISON : 0ULL); } +static bool bpf_pseudo_call(const struct bpf_insn *insn) +{ + return insn->code == (BPF_JMP | BPF_CALL) && + insn->src_reg == BPF_PSEUDO_CALL; +} + struct bpf_call_arg_meta { struct bpf_map *map_ptr; bool raw_mode; @@ -1073,6 +1079,51 @@ static void mark_reg_known_zero(struct bpf_verifier_env *env, __mark_reg_known_zero(regs + regno); } +static void mark_ptr_not_null_reg(struct bpf_reg_state *reg) +{ + switch (reg->type) { + case PTR_TO_MAP_VALUE_OR_NULL: { + const struct bpf_map *map = reg->map_ptr; + + if (map->inner_map_meta) { + reg->type = CONST_PTR_TO_MAP; + reg->map_ptr = map->inner_map_meta; + } else if (map->map_type == BPF_MAP_TYPE_XSKMAP) { + reg->type = PTR_TO_XDP_SOCK; + } else if (map->map_type == BPF_MAP_TYPE_SOCKMAP || + map->map_type == BPF_MAP_TYPE_SOCKHASH) { + reg->type = PTR_TO_SOCKET; + } else { + reg->type = PTR_TO_MAP_VALUE; + } + break; + } + case PTR_TO_SOCKET_OR_NULL: + reg->type = PTR_TO_SOCKET; + break; + case PTR_TO_SOCK_COMMON_OR_NULL: + reg->type = PTR_TO_SOCK_COMMON; + break; + case PTR_TO_TCP_SOCK_OR_NULL: + reg->type = PTR_TO_TCP_SOCK; + break; + case PTR_TO_BTF_ID_OR_NULL: + reg->type = PTR_TO_BTF_ID; + break; + case PTR_TO_MEM_OR_NULL: + reg->type = PTR_TO_MEM; + break; + case PTR_TO_RDONLY_BUF_OR_NULL: + reg->type = PTR_TO_RDONLY_BUF; + break; + case PTR_TO_RDWR_BUF_OR_NULL: + reg->type = PTR_TO_RDWR_BUF; + break; + default: + WARN_ON("unknown nullable register type"); + } +} + static bool reg_is_pkt_pointer(const struct bpf_reg_state *reg) { return type_is_pkt_pointer(reg->type); @@ -1486,9 +1537,7 @@ static int check_subprogs(struct bpf_verifier_env *env) /* determine subprog starts. The end is one before the next starts */ for (i = 0; i < insn_cnt; i++) { - if (insn[i].code != (BPF_JMP | BPF_CALL)) - continue; - if (insn[i].src_reg != BPF_PSEUDO_CALL) + if (!bpf_pseudo_call(insn + i)) continue; if (!env->bpf_capable) { verbose(env, @@ -2271,12 +2320,14 @@ static void save_register_state(struct bpf_func_state *state, state->stack[spi].slot_type[i] = STACK_SPILL; } -/* check_stack_read/write functions track spill/fill of registers, +/* check_stack_{read,write}_fixed_off functions track spill/fill of registers, * stack boundary and alignment are checked in check_mem_access() */ -static int check_stack_write(struct bpf_verifier_env *env, - struct bpf_func_state *state, /* func where register points to */ - int off, int size, int value_regno, int insn_idx) +static int check_stack_write_fixed_off(struct bpf_verifier_env *env, + /* stack frame we're writing to */ + struct bpf_func_state *state, + int off, int size, int value_regno, + int insn_idx) { struct bpf_func_state *cur; /* state of the current function */ int i, slot = -off - 1, spi = slot / BPF_REG_SIZE, err; @@ -2402,9 +2453,175 @@ static int check_stack_write(struct bpf_verifier_env *env, return 0; } -static int check_stack_read(struct bpf_verifier_env *env, - struct bpf_func_state *reg_state /* func where register points to */, - int off, int size, int value_regno) +/* Write the stack: 'stack[ptr_regno + off] = value_regno'. 'ptr_regno' is + * known to contain a variable offset. + * This function checks whether the write is permitted and conservatively + * tracks the effects of the write, considering that each stack slot in the + * dynamic range is potentially written to. + * + * 'off' includes 'regno->off'. + * 'value_regno' can be -1, meaning that an unknown value is being written to + * the stack. + * + * Spilled pointers in range are not marked as written because we don't know + * what's going to be actually written. This means that read propagation for + * future reads cannot be terminated by this write. + * + * For privileged programs, uninitialized stack slots are considered + * initialized by this write (even though we don't know exactly what offsets + * are going to be written to). The idea is that we don't want the verifier to + * reject future reads that access slots written to through variable offsets. + */ +static int check_stack_write_var_off(struct bpf_verifier_env *env, + /* func where register points to */ + struct bpf_func_state *state, + int ptr_regno, int off, int size, + int value_regno, int insn_idx) +{ + struct bpf_func_state *cur; /* state of the current function */ + int min_off, max_off; + int i, err; + struct bpf_reg_state *ptr_reg = NULL, *value_reg = NULL; + bool writing_zero = false; + /* set if the fact that we're writing a zero is used to let any + * stack slots remain STACK_ZERO + */ + bool zero_used = false; + + cur = env->cur_state->frame[env->cur_state->curframe]; + ptr_reg = &cur->regs[ptr_regno]; + min_off = ptr_reg->smin_value + off; + max_off = ptr_reg->smax_value + off + size; + if (value_regno >= 0) + value_reg = &cur->regs[value_regno]; + if (value_reg && register_is_null(value_reg)) + writing_zero = true; + + err = realloc_func_state(state, round_up(-min_off, BPF_REG_SIZE), + state->acquired_refs, true); + if (err) + return err; + + + /* Variable offset writes destroy any spilled pointers in range. */ + for (i = min_off; i < max_off; i++) { + u8 new_type, *stype; + int slot, spi; + + slot = -i - 1; + spi = slot / BPF_REG_SIZE; + stype = &state->stack[spi].slot_type[slot % BPF_REG_SIZE]; + + if (!env->allow_ptr_leaks + && *stype != NOT_INIT + && *stype != SCALAR_VALUE) { + /* Reject the write if there's are spilled pointers in + * range. If we didn't reject here, the ptr status + * would be erased below (even though not all slots are + * actually overwritten), possibly opening the door to + * leaks. + */ + verbose(env, "spilled ptr in range of var-offset stack write; insn %d, ptr off: %d", + insn_idx, i); + return -EINVAL; + } + + /* Erase all spilled pointers. */ + state->stack[spi].spilled_ptr.type = NOT_INIT; + + /* Update the slot type. */ + new_type = STACK_MISC; + if (writing_zero && *stype == STACK_ZERO) { + new_type = STACK_ZERO; + zero_used = true; + } + /* If the slot is STACK_INVALID, we check whether it's OK to + * pretend that it will be initialized by this write. The slot + * might not actually be written to, and so if we mark it as + * initialized future reads might leak uninitialized memory. + * For privileged programs, we will accept such reads to slots + * that may or may not be written because, if we're reject + * them, the error would be too confusing. + */ + if (*stype == STACK_INVALID && !env->allow_uninit_stack) { + verbose(env, "uninit stack in range of var-offset write prohibited for !root; insn %d, off: %d", + insn_idx, i); + return -EINVAL; + } + *stype = new_type; + } + if (zero_used) { + /* backtracking doesn't work for STACK_ZERO yet. */ + err = mark_chain_precision(env, value_regno); + if (err) + return err; + } + return 0; +} + +/* When register 'dst_regno' is assigned some values from stack[min_off, + * max_off), we set the register's type according to the types of the + * respective stack slots. If all the stack values are known to be zeros, then + * so is the destination reg. Otherwise, the register is considered to be + * SCALAR. This function does not deal with register filling; the caller must + * ensure that all spilled registers in the stack range have been marked as + * read. + */ +static void mark_reg_stack_read(struct bpf_verifier_env *env, + /* func where src register points to */ + struct bpf_func_state *ptr_state, + int min_off, int max_off, int dst_regno) +{ + struct bpf_verifier_state *vstate = env->cur_state; + struct bpf_func_state *state = vstate->frame[vstate->curframe]; + int i, slot, spi; + u8 *stype; + int zeros = 0; + + for (i = min_off; i < max_off; i++) { + slot = -i - 1; + spi = slot / BPF_REG_SIZE; + stype = ptr_state->stack[spi].slot_type; + if (stype[slot % BPF_REG_SIZE] != STACK_ZERO) + break; + zeros++; + } + if (zeros == max_off - min_off) { + /* any access_size read into register is zero extended, + * so the whole register == const_zero + */ + __mark_reg_const_zero(&state->regs[dst_regno]); + /* backtracking doesn't support STACK_ZERO yet, + * so mark it precise here, so that later + * backtracking can stop here. + * Backtracking may not need this if this register + * doesn't participate in pointer adjustment. + * Forward propagation of precise flag is not + * necessary either. This mark is only to stop + * backtracking. Any register that contributed + * to const 0 was marked precise before spill. + */ + state->regs[dst_regno].precise = true; + } else { + /* have read misc data from the stack */ + mark_reg_unknown(env, state->regs, dst_regno); + } + state->regs[dst_regno].live |= REG_LIVE_WRITTEN; +} + +/* Read the stack at 'off' and put the results into the register indicated by + * 'dst_regno'. It handles reg filling if the addressed stack slot is a + * spilled reg. + * + * 'dst_regno' can be -1, meaning that the read value is not going to a + * register. + * + * The access is assumed to be within the current stack bounds. + */ +static int check_stack_read_fixed_off(struct bpf_verifier_env *env, + /* func where src register points to */ + struct bpf_func_state *reg_state, + int off, int size, int dst_regno) { struct bpf_verifier_state *vstate = env->cur_state; struct bpf_func_state *state = vstate->frame[vstate->curframe]; @@ -2412,11 +2629,6 @@ static int check_stack_read(struct bpf_verifier_env *env, struct bpf_reg_state *reg; u8 *stype; - if (reg_state->allocated_stack <= slot) { - verbose(env, "invalid read from stack off %d+0 size %d\n", - off, size); - return -EACCES; - } stype = reg_state->stack[spi].slot_type; reg = ®_state->stack[spi].spilled_ptr; @@ -2427,9 +2639,9 @@ static int check_stack_read(struct bpf_verifier_env *env, verbose(env, "invalid size of register fill\n"); return -EACCES; } - if (value_regno >= 0) { - mark_reg_unknown(env, state->regs, value_regno); - state->regs[value_regno].live |= REG_LIVE_WRITTEN; + if (dst_regno >= 0) { + mark_reg_unknown(env, state->regs, dst_regno); + state->regs[dst_regno].live |= REG_LIVE_WRITTEN; } mark_reg_read(env, reg, reg->parent, REG_LIVE_READ64); return 0; @@ -2441,16 +2653,16 @@ static int check_stack_read(struct bpf_verifier_env *env, } } - if (value_regno >= 0) { + if (dst_regno >= 0) { /* restore register state from stack */ - state->regs[value_regno] = *reg; + state->regs[dst_regno] = *reg; /* mark reg as written since spilled pointer state likely * has its liveness marks cleared by is_state_visited() * which resets stack/reg liveness for state transitions */ - state->regs[value_regno].live |= REG_LIVE_WRITTEN; + state->regs[dst_regno].live |= REG_LIVE_WRITTEN; } else if (__is_pointer_value(env->allow_ptr_leaks, reg)) { - /* If value_regno==-1, the caller is asking us whether + /* If dst_regno==-1, the caller is asking us whether * it is acceptable to use this value as a SCALAR_VALUE * (e.g. for XADD). * We must not allow unprivileged callers to do that @@ -2462,70 +2674,167 @@ static int check_stack_read(struct bpf_verifier_env *env, } mark_reg_read(env, reg, reg->parent, REG_LIVE_READ64); } else { - int zeros = 0; + u8 type; for (i = 0; i < size; i++) { - if (stype[(slot - i) % BPF_REG_SIZE] == STACK_MISC) + type = stype[(slot - i) % BPF_REG_SIZE]; + if (type == STACK_MISC) continue; - if (stype[(slot - i) % BPF_REG_SIZE] == STACK_ZERO) { - zeros++; + if (type == STACK_ZERO) continue; - } verbose(env, "invalid read from stack off %d+%d size %d\n", off, i, size); return -EACCES; } mark_reg_read(env, reg, reg->parent, REG_LIVE_READ64); - if (value_regno >= 0) { - if (zeros == size) { - /* any size read into register is zero extended, - * so the whole register == const_zero - */ - __mark_reg_const_zero(&state->regs[value_regno]); - /* backtracking doesn't support STACK_ZERO yet, - * so mark it precise here, so that later - * backtracking can stop here. - * Backtracking may not need this if this register - * doesn't participate in pointer adjustment. - * Forward propagation of precise flag is not - * necessary either. This mark is only to stop - * backtracking. Any register that contributed - * to const 0 was marked precise before spill. - */ - state->regs[value_regno].precise = true; - } else { - /* have read misc data from the stack */ - mark_reg_unknown(env, state->regs, value_regno); - } - state->regs[value_regno].live |= REG_LIVE_WRITTEN; - } + if (dst_regno >= 0) + mark_reg_stack_read(env, reg_state, off, off + size, dst_regno); } return 0; } -static int check_stack_access(struct bpf_verifier_env *env, - const struct bpf_reg_state *reg, - int off, int size) +enum stack_access_src { + ACCESS_DIRECT = 1, /* the access is performed by an instruction */ + ACCESS_HELPER = 2, /* the access is performed by a helper */ +}; + +static int check_stack_range_initialized(struct bpf_verifier_env *env, + int regno, int off, int access_size, + bool zero_size_allowed, + enum stack_access_src type, + struct bpf_call_arg_meta *meta); + +static struct bpf_reg_state *reg_state(struct bpf_verifier_env *env, int regno) +{ + return cur_regs(env) + regno; +} + +/* Read the stack at 'ptr_regno + off' and put the result into the register + * 'dst_regno'. + * 'off' includes the pointer register's fixed offset(i.e. 'ptr_regno.off'), + * but not its variable offset. + * 'size' is assumed to be <= reg size and the access is assumed to be aligned. + * + * As opposed to check_stack_read_fixed_off, this function doesn't deal with + * filling registers (i.e. reads of spilled register cannot be detected when + * the offset is not fixed). We conservatively mark 'dst_regno' as containing + * SCALAR_VALUE. That's why we assert that the 'ptr_regno' has a variable + * offset; for a fixed offset check_stack_read_fixed_off should be used + * instead. + */ +static int check_stack_read_var_off(struct bpf_verifier_env *env, + int ptr_regno, int off, int size, int dst_regno) { - /* Stack accesses must be at a fixed offset, so that we - * can determine what type of data were returned. See - * check_stack_read(). + /* The state of the source register. */ + struct bpf_reg_state *reg = reg_state(env, ptr_regno); + struct bpf_func_state *ptr_state = func(env, reg); + int err; + int min_off, max_off; + + /* Note that we pass a NULL meta, so raw access will not be permitted. */ - if (!tnum_is_const(reg->var_off)) { + err = check_stack_range_initialized(env, ptr_regno, off, size, + false, ACCESS_DIRECT, NULL); + if (err) + return err; + + min_off = reg->smin_value + off; + max_off = reg->smax_value + off; + mark_reg_stack_read(env, ptr_state, min_off, max_off + size, dst_regno); + return 0; +} + +/* check_stack_read dispatches to check_stack_read_fixed_off or + * check_stack_read_var_off. + * + * The caller must ensure that the offset falls within the allocated stack + * bounds. + * + * 'dst_regno' is a register which will receive the value from the stack. It + * can be -1, meaning that the read value is not going to a register. + */ +static int check_stack_read(struct bpf_verifier_env *env, + int ptr_regno, int off, int size, + int dst_regno) +{ + struct bpf_reg_state *reg = reg_state(env, ptr_regno); + struct bpf_func_state *state = func(env, reg); + int err; + /* Some accesses are only permitted with a static offset. */ + bool var_off = !tnum_is_const(reg->var_off); + + /* The offset is required to be static when reads don't go to a + * register, in order to not leak pointers (see + * check_stack_read_fixed_off). + */ + if (dst_regno < 0 && var_off) { char tn_buf[48]; tnum_strn(tn_buf, sizeof(tn_buf), reg->var_off); - verbose(env, "variable stack access var_off=%s off=%d size=%d\n", + verbose(env, "variable offset stack pointer cannot be passed into helper function; var_off=%s off=%d size=%d\n", tn_buf, off, size); return -EACCES; } + /* Variable offset is prohibited for unprivileged mode for simplicity + * since it requires corresponding support in Spectre masking for stack + * ALU. See also retrieve_ptr_limit(). + */ + if (!env->bypass_spec_v1 && var_off) { + char tn_buf[48]; - if (off >= 0 || off < -MAX_BPF_STACK) { - verbose(env, "invalid stack off=%d size=%d\n", off, size); + tnum_strn(tn_buf, sizeof(tn_buf), reg->var_off); + verbose(env, "R%d variable offset stack access prohibited for !root, var_off=%s\n", + ptr_regno, tn_buf); return -EACCES; } - return 0; + if (!var_off) { + off += reg->var_off.value; + err = check_stack_read_fixed_off(env, state, off, size, + dst_regno); + } else { + /* Variable offset stack reads need more conservative handling + * than fixed offset ones. Note that dst_regno >= 0 on this + * branch. + */ + err = check_stack_read_var_off(env, ptr_regno, off, size, + dst_regno); + } + return err; +} + + +/* check_stack_write dispatches to check_stack_write_fixed_off or + * check_stack_write_var_off. + * + * 'ptr_regno' is the register used as a pointer into the stack. + * 'off' includes 'ptr_regno->off', but not its variable offset (if any). + * 'value_regno' is the register whose value we're writing to the stack. It can + * be -1, meaning that we're not writing from a register. + * + * The caller must ensure that the offset falls within the maximum stack size. + */ +static int check_stack_write(struct bpf_verifier_env *env, + int ptr_regno, int off, int size, + int value_regno, int insn_idx) +{ + struct bpf_reg_state *reg = reg_state(env, ptr_regno); + struct bpf_func_state *state = func(env, reg); + int err; + + if (tnum_is_const(reg->var_off)) { + off += reg->var_off.value; + err = check_stack_write_fixed_off(env, state, off, size, + value_regno, insn_idx); + } else { + /* Variable offset stack reads need more conservative handling + * than fixed offset ones. + */ + err = check_stack_write_var_off(env, state, + ptr_regno, off, size, + value_regno, insn_idx); + } + return err; } static int check_map_access_type(struct bpf_verifier_env *env, u32 regno, @@ -2858,11 +3167,6 @@ static int check_sock_access(struct bpf_verifier_env *env, int insn_idx, return -EACCES; } -static struct bpf_reg_state *reg_state(struct bpf_verifier_env *env, int regno) -{ - return cur_regs(env) + regno; -} - static bool is_pointer_value(struct bpf_verifier_env *env, int regno) { return __is_pointer_value(env->allow_ptr_leaks, reg_state(env, regno)); @@ -2981,8 +3285,8 @@ static int check_ptr_alignment(struct bpf_verifier_env *env, break; case PTR_TO_STACK: pointer_desc = "stack "; - /* The stack spill tracking logic in check_stack_write() - * and check_stack_read() relies on stack accesses being + /* The stack spill tracking logic in check_stack_write_fixed_off() + * and check_stack_read_fixed_off() relies on stack accesses being * aligned. */ strict = true; @@ -3074,9 +3378,7 @@ process_func: continue_func: subprog_end = subprog[idx + 1].start; for (; i < subprog_end; i++) { - if (insn[i].code != (BPF_JMP | BPF_CALL)) - continue; - if (insn[i].src_reg != BPF_PSEUDO_CALL) + if (!bpf_pseudo_call(insn + i)) continue; /* remember insn and function to return to */ ret_insn[frame] = i + 1; @@ -3400,6 +3702,91 @@ static int check_ptr_to_map_access(struct bpf_verifier_env *env, return 0; } +/* Check that the stack access at the given offset is within bounds. The + * maximum valid offset is -1. + * + * The minimum valid offset is -MAX_BPF_STACK for writes, and + * -state->allocated_stack for reads. + */ +static int check_stack_slot_within_bounds(int off, + struct bpf_func_state *state, + enum bpf_access_type t) +{ + int min_valid_off; + + if (t == BPF_WRITE) + min_valid_off = -MAX_BPF_STACK; + else + min_valid_off = -state->allocated_stack; + + if (off < min_valid_off || off > -1) + return -EACCES; + return 0; +} + +/* Check that the stack access at 'regno + off' falls within the maximum stack + * bounds. + * + * 'off' includes `regno->offset`, but not its dynamic part (if any). + */ +static int check_stack_access_within_bounds( + struct bpf_verifier_env *env, + int regno, int off, int access_size, + enum stack_access_src src, enum bpf_access_type type) +{ + struct bpf_reg_state *regs = cur_regs(env); + struct bpf_reg_state *reg = regs + regno; + struct bpf_func_state *state = func(env, reg); + int min_off, max_off; + int err; + char *err_extra; + + if (src == ACCESS_HELPER) + /* We don't know if helpers are reading or writing (or both). */ + err_extra = " indirect access to"; + else if (type == BPF_READ) + err_extra = " read from"; + else + err_extra = " write to"; + + if (tnum_is_const(reg->var_off)) { + min_off = reg->var_off.value + off; + if (access_size > 0) + max_off = min_off + access_size - 1; + else + max_off = min_off; + } else { + if (reg->smax_value >= BPF_MAX_VAR_OFF || + reg->smin_value <= -BPF_MAX_VAR_OFF) { + verbose(env, "invalid unbounded variable-offset%s stack R%d\n", + err_extra, regno); + return -EACCES; + } + min_off = reg->smin_value + off; + if (access_size > 0) + max_off = reg->smax_value + off + access_size - 1; + else + max_off = min_off; + } + + err = check_stack_slot_within_bounds(min_off, state, type); + if (!err) + err = check_stack_slot_within_bounds(max_off, state, type); + + if (err) { + if (tnum_is_const(reg->var_off)) { + verbose(env, "invalid%s stack R%d off=%d size=%d\n", + err_extra, regno, off, access_size); + } else { + char tn_buf[48]; + + tnum_strn(tn_buf, sizeof(tn_buf), reg->var_off); + verbose(env, "invalid variable-offset%s stack R%d var_off=%s size=%d\n", + err_extra, regno, tn_buf, access_size); + } + } + return err; +} /* check whether memory at (regno + off) is accessible for t = (read | write) * if t==write, value_regno is a register which value is stored into memory @@ -3515,8 +3902,8 @@ static int check_mem_access(struct bpf_verifier_env *env, int insn_idx, u32 regn } } else if (reg->type == PTR_TO_STACK) { - off += reg->var_off.value; - err = check_stack_access(env, reg, off, size); + /* Basic bounds checks. */ + err = check_stack_access_within_bounds(env, regno, off, size, ACCESS_DIRECT, t); if (err) return err; @@ -3525,12 +3912,12 @@ static int check_mem_access(struct bpf_verifier_env *env, int insn_idx, u32 regn if (err) return err; - if (t == BPF_WRITE) - err = check_stack_write(env, state, off, size, - value_regno, insn_idx); - else - err = check_stack_read(env, state, off, size, + if (t == BPF_READ) + err = check_stack_read(env, regno, off, size, value_regno); + else + err = check_stack_write(env, regno, off, size, + value_regno, insn_idx); } else if (reg_is_pkt_pointer(reg)) { if (t == BPF_WRITE && !may_access_direct_pkt_data(env, NULL, t)) { verbose(env, "cannot write into packet\n"); @@ -3665,9 +4052,26 @@ static int check_atomic(struct bpf_verifier_env *env, int insn_idx, struct bpf_i return -EACCES; } + if (insn->imm & BPF_FETCH) { + if (insn->imm == BPF_CMPXCHG) + load_reg = BPF_REG_0; + else + load_reg = insn->src_reg; + + /* check and record load of old value */ + err = check_reg_arg(env, load_reg, DST_OP); + if (err) + return err; + } else { + /* This instruction accesses a memory location but doesn't + * actually load it into a register. + */ + load_reg = -1; + } + /* check whether we can read the memory */ err = check_mem_access(env, insn_idx, insn->dst_reg, insn->off, - BPF_SIZE(insn->code), BPF_READ, -1, true); + BPF_SIZE(insn->code), BPF_READ, load_reg, true); if (err) return err; @@ -3677,65 +4081,56 @@ static int check_atomic(struct bpf_verifier_env *env, int insn_idx, struct bpf_i if (err) return err; - if (!(insn->imm & BPF_FETCH)) - return 0; - - if (insn->imm == BPF_CMPXCHG) - load_reg = BPF_REG_0; - else - load_reg = insn->src_reg; - - /* check and record load of old value */ - err = check_reg_arg(env, load_reg, DST_OP); - if (err) - return err; - return 0; } -static int __check_stack_boundary(struct bpf_verifier_env *env, u32 regno, - int off, int access_size, - bool zero_size_allowed) +/* When register 'regno' is used to read the stack (either directly or through + * a helper function) make sure that it's within stack boundary and, depending + * on the access type, that all elements of the stack are initialized. + * + * 'off' includes 'regno->off', but not its dynamic part (if any). + * + * All registers that have been spilled on the stack in the slots within the + * read offsets are marked as read. + */ +static int check_stack_range_initialized( + struct bpf_verifier_env *env, int regno, int off, + int access_size, bool zero_size_allowed, + enum stack_access_src type, struct bpf_call_arg_meta *meta) { struct bpf_reg_state *reg = reg_state(env, regno); + struct bpf_func_state *state = func(env, reg); + int err, min_off, max_off, i, j, slot, spi; + char *err_extra = type == ACCESS_HELPER ? " indirect" : ""; + enum bpf_access_type bounds_check_type; + /* Some accesses can write anything into the stack, others are + * read-only. + */ + bool clobber = false; - if (off >= 0 || off < -MAX_BPF_STACK || off + access_size > 0 || - access_size < 0 || (access_size == 0 && !zero_size_allowed)) { - if (tnum_is_const(reg->var_off)) { - verbose(env, "invalid stack type R%d off=%d access_size=%d\n", - regno, off, access_size); - } else { - char tn_buf[48]; - - tnum_strn(tn_buf, sizeof(tn_buf), reg->var_off); - verbose(env, "invalid stack type R%d var_off=%s access_size=%d\n", - regno, tn_buf, access_size); - } + if (access_size == 0 && !zero_size_allowed) { + verbose(env, "invalid zero-sized read\n"); return -EACCES; } - return 0; -} -/* when register 'regno' is passed into function that will read 'access_size' - * bytes from that pointer, make sure that it's within stack boundary - * and all elements of stack are initialized. - * Unlike most pointer bounds-checking functions, this one doesn't take an - * 'off' argument, so it has to add in reg->off itself. - */ -static int check_stack_boundary(struct bpf_verifier_env *env, int regno, - int access_size, bool zero_size_allowed, - struct bpf_call_arg_meta *meta) -{ - struct bpf_reg_state *reg = reg_state(env, regno); - struct bpf_func_state *state = func(env, reg); - int err, min_off, max_off, i, j, slot, spi; + if (type == ACCESS_HELPER) { + /* The bounds checks for writes are more permissive than for + * reads. However, if raw_mode is not set, we'll do extra + * checks below. + */ + bounds_check_type = BPF_WRITE; + clobber = true; + } else { + bounds_check_type = BPF_READ; + } + err = check_stack_access_within_bounds(env, regno, off, access_size, + type, bounds_check_type); + if (err) + return err; + if (tnum_is_const(reg->var_off)) { - min_off = max_off = reg->var_off.value + reg->off; - err = __check_stack_boundary(env, regno, min_off, access_size, - zero_size_allowed); - if (err) - return err; + min_off = max_off = reg->var_off.value + off; } else { /* Variable offset is prohibited for unprivileged mode for * simplicity since it requires corresponding support in @@ -3746,8 +4141,8 @@ static int check_stack_boundary(struct bpf_verifier_env *env, int regno, char tn_buf[48]; tnum_strn(tn_buf, sizeof(tn_buf), reg->var_off); - verbose(env, "R%d indirect variable offset stack access prohibited for !root, var_off=%s\n", - regno, tn_buf); + verbose(env, "R%d%s variable offset stack access prohibited for !root, var_off=%s\n", + regno, err_extra, tn_buf); return -EACCES; } /* Only initialized buffer on stack is allowed to be accessed @@ -3759,28 +4154,8 @@ static int check_stack_boundary(struct bpf_verifier_env *env, int regno, if (meta && meta->raw_mode) meta = NULL; - if (reg->smax_value >= BPF_MAX_VAR_OFF || - reg->smax_value <= -BPF_MAX_VAR_OFF) { - verbose(env, "R%d unbounded indirect variable offset stack access\n", - regno); - return -EACCES; - } - min_off = reg->smin_value + reg->off; - max_off = reg->smax_value + reg->off; - err = __check_stack_boundary(env, regno, min_off, access_size, - zero_size_allowed); - if (err) { - verbose(env, "R%d min value is outside of stack bound\n", - regno); - return err; - } - err = __check_stack_boundary(env, regno, max_off, access_size, - zero_size_allowed); - if (err) { - verbose(env, "R%d max value is outside of stack bound\n", - regno); - return err; - } + min_off = reg->smin_value + off; + max_off = reg->smax_value + off; } if (meta && meta->raw_mode) { @@ -3800,8 +4175,10 @@ static int check_stack_boundary(struct bpf_verifier_env *env, int regno, if (*stype == STACK_MISC) goto mark; if (*stype == STACK_ZERO) { - /* helper can write anything into the stack */ - *stype = STACK_MISC; + if (clobber) { + /* helper can write anything into the stack */ + *stype = STACK_MISC; + } goto mark; } @@ -3812,22 +4189,24 @@ static int check_stack_boundary(struct bpf_verifier_env *env, int regno, if (state->stack[spi].slot_type[0] == STACK_SPILL && (state->stack[spi].spilled_ptr.type == SCALAR_VALUE || env->allow_ptr_leaks)) { - __mark_reg_unknown(env, &state->stack[spi].spilled_ptr); - for (j = 0; j < BPF_REG_SIZE; j++) - state->stack[spi].slot_type[j] = STACK_MISC; + if (clobber) { + __mark_reg_unknown(env, &state->stack[spi].spilled_ptr); + for (j = 0; j < BPF_REG_SIZE; j++) + state->stack[spi].slot_type[j] = STACK_MISC; + } goto mark; } err: if (tnum_is_const(reg->var_off)) { - verbose(env, "invalid indirect read from stack off %d+%d size %d\n", - min_off, i - min_off, access_size); + verbose(env, "invalid%s read from stack R%d off %d+%d size %d\n", + err_extra, regno, min_off, i - min_off, access_size); } else { char tn_buf[48]; tnum_strn(tn_buf, sizeof(tn_buf), reg->var_off); - verbose(env, "invalid indirect read from stack var_off %s+%d size %d\n", - tn_buf, i - min_off, access_size); + verbose(env, "invalid%s read from stack R%d var_off %s+%d size %d\n", + err_extra, regno, tn_buf, i - min_off, access_size); } return -EACCES; mark: @@ -3876,8 +4255,10 @@ static int check_helper_mem_access(struct bpf_verifier_env *env, int regno, "rdwr", &env->prog->aux->max_rdwr_access); case PTR_TO_STACK: - return check_stack_boundary(env, regno, access_size, - zero_size_allowed, meta); + return check_stack_range_initialized( + env, + regno, reg->off, access_size, + zero_size_allowed, ACCESS_HELPER, meta); default: /* scalar_value or invalid ptr */ /* Allow zero-byte read from NULL, regardless of pointer type */ if (zero_size_allowed && access_size == 0 && @@ -3891,6 +4272,29 @@ static int check_helper_mem_access(struct bpf_verifier_env *env, int regno, } } +int check_mem_reg(struct bpf_verifier_env *env, struct bpf_reg_state *reg, + u32 regno, u32 mem_size) +{ + if (register_is_null(reg)) + return 0; + + if (reg_type_may_be_null(reg->type)) { + /* Assuming that the register contains a value check if the memory + * access is safe. Temporarily save and restore the register's state as + * the conversion shouldn't be visible to a caller. + */ + const struct bpf_reg_state saved_reg = *reg; + int rv; + + mark_ptr_not_null_reg(reg); + rv = check_helper_mem_access(env, regno, mem_size, true, NULL); + *reg = saved_reg; + return rv; + } + + return check_helper_mem_access(env, regno, mem_size, true, NULL); +} + /* Implementation details: * bpf_map_lookup returns PTR_TO_MAP_VALUE_OR_NULL * Two bpf_map_lookups (even with the same key) will have different reg->id. @@ -4875,8 +5279,9 @@ static int check_func_call(struct bpf_verifier_env *env, struct bpf_insn *insn, subprog); clear_caller_saved_regs(env, caller->regs); - /* All global functions return SCALAR_VALUE */ + /* All global functions return a 64-bit SCALAR_VALUE */ mark_reg_unknown(env, caller->regs, BPF_REG_0); + caller->regs[BPF_REG_0].subreg_def = DEF_NOT_SUBREG; /* continue with next insn after call */ return 0; @@ -5541,6 +5946,41 @@ do_sim: return !ret ? -EFAULT : 0; } +/* check that stack access falls within stack limits and that 'reg' doesn't + * have a variable offset. + * + * Variable offset is prohibited for unprivileged mode for simplicity since it + * requires corresponding support in Spectre masking for stack ALU. See also + * retrieve_ptr_limit(). + * + * + * 'off' includes 'reg->off'. + */ +static int check_stack_access_for_ptr_arithmetic( + struct bpf_verifier_env *env, + int regno, + const struct bpf_reg_state *reg, + int off) +{ + if (!tnum_is_const(reg->var_off)) { + char tn_buf[48]; + + tnum_strn(tn_buf, sizeof(tn_buf), reg->var_off); + verbose(env, "R%d variable stack access prohibited for !root, var_off=%s off=%d\n", + regno, tn_buf, off); + return -EACCES; + } + + if (off >= 0 || off < -MAX_BPF_STACK) { + verbose(env, "R%d stack pointer arithmetic goes out of range, " + "prohibited for !root; off=%d\n", regno, off); + return -EACCES; + } + + return 0; +} + + /* Handles arithmetic on a pointer and a scalar: computes new min/max and var_off. * Caller should also handle BPF_MOV case separately. * If we return -EACCES, caller may want to try again treating pointer as a @@ -5784,10 +6224,9 @@ static int adjust_ptr_min_max_vals(struct bpf_verifier_env *env, "prohibited for !root\n", dst); return -EACCES; } else if (dst_reg->type == PTR_TO_STACK && - check_stack_access(env, dst_reg, dst_reg->off + - dst_reg->var_off.value, 1)) { - verbose(env, "R%d stack pointer arithmetic goes out of range, " - "prohibited for !root\n", dst); + check_stack_access_for_ptr_arithmetic( + env, dst, dst_reg, dst_reg->off + + dst_reg->var_off.value)) { return -EACCES; } } @@ -6266,7 +6705,7 @@ static void scalar32_min_max_rsh(struct bpf_reg_state *dst_reg, * 3) the signed bounds cross zero, so they tell us nothing * about the result * If the value in dst_reg is known nonnegative, then again the - * unsigned bounts capture the signed bounds. + * unsigned bounds capture the signed bounds. * Thus, in all cases it suffices to blow away our signed bounds * and rely on inferring new ones from the unsigned bounds and * var_off of the result. @@ -6297,7 +6736,7 @@ static void scalar_min_max_rsh(struct bpf_reg_state *dst_reg, * 3) the signed bounds cross zero, so they tell us nothing * about the result * If the value in dst_reg is known nonnegative, then again the - * unsigned bounts capture the signed bounds. + * unsigned bounds capture the signed bounds. * Thus, in all cases it suffices to blow away our signed bounds * and rely on inferring new ones from the unsigned bounds and * var_off of the result. @@ -7367,43 +7806,19 @@ static void mark_ptr_or_null_reg(struct bpf_func_state *state, } if (is_null) { reg->type = SCALAR_VALUE; - } else if (reg->type == PTR_TO_MAP_VALUE_OR_NULL) { - const struct bpf_map *map = reg->map_ptr; - - if (map->inner_map_meta) { - reg->type = CONST_PTR_TO_MAP; - reg->map_ptr = map->inner_map_meta; - } else if (map->map_type == BPF_MAP_TYPE_XSKMAP) { - reg->type = PTR_TO_XDP_SOCK; - } else if (map->map_type == BPF_MAP_TYPE_SOCKMAP || - map->map_type == BPF_MAP_TYPE_SOCKHASH) { - reg->type = PTR_TO_SOCKET; - } else { - reg->type = PTR_TO_MAP_VALUE; - } - } else if (reg->type == PTR_TO_SOCKET_OR_NULL) { - reg->type = PTR_TO_SOCKET; - } else if (reg->type == PTR_TO_SOCK_COMMON_OR_NULL) { - reg->type = PTR_TO_SOCK_COMMON; - } else if (reg->type == PTR_TO_TCP_SOCK_OR_NULL) { - reg->type = PTR_TO_TCP_SOCK; - } else if (reg->type == PTR_TO_BTF_ID_OR_NULL) { - reg->type = PTR_TO_BTF_ID; - } else if (reg->type == PTR_TO_MEM_OR_NULL) { - reg->type = PTR_TO_MEM; - } else if (reg->type == PTR_TO_RDONLY_BUF_OR_NULL) { - reg->type = PTR_TO_RDONLY_BUF; - } else if (reg->type == PTR_TO_RDWR_BUF_OR_NULL) { - reg->type = PTR_TO_RDWR_BUF; - } - if (is_null) { /* We don't need id and ref_obj_id from this point * onwards anymore, thus we should better reset it, * so that state pruning has chances to take effect. */ reg->id = 0; reg->ref_obj_id = 0; - } else if (!reg_may_point_to_spin_lock(reg)) { + + return; + } + + mark_ptr_not_null_reg(reg); + + if (!reg_may_point_to_spin_lock(reg)) { /* For not-NULL ptr, reg->ref_obj_id will be reset * in release_reg_references(). * @@ -7986,6 +8401,9 @@ static int check_return_code(struct bpf_verifier_env *env) env->prog->expected_attach_type == BPF_CGROUP_INET4_GETSOCKNAME || env->prog->expected_attach_type == BPF_CGROUP_INET6_GETSOCKNAME) range = tnum_range(1, 1); + if (env->prog->expected_attach_type == BPF_CGROUP_INET4_BIND || + env->prog->expected_attach_type == BPF_CGROUP_INET6_BIND) + range = tnum_range(0, 3); break; case BPF_PROG_TYPE_CGROUP_SKB: if (env->prog->expected_attach_type == BPF_CGROUP_INET_EGRESS) { @@ -10015,15 +10433,22 @@ static int check_map_prog_compatibility(struct bpf_verifier_env *env, case BPF_MAP_TYPE_HASH: case BPF_MAP_TYPE_LRU_HASH: case BPF_MAP_TYPE_ARRAY: + case BPF_MAP_TYPE_PERCPU_HASH: + case BPF_MAP_TYPE_PERCPU_ARRAY: + case BPF_MAP_TYPE_LRU_PERCPU_HASH: + case BPF_MAP_TYPE_ARRAY_OF_MAPS: + case BPF_MAP_TYPE_HASH_OF_MAPS: if (!is_preallocated_map(map)) { verbose(env, - "Sleepable programs can only use preallocated hash maps\n"); + "Sleepable programs can only use preallocated maps\n"); return -EINVAL; } break; + case BPF_MAP_TYPE_RINGBUF: + break; default: verbose(env, - "Sleepable programs can only use array and hash maps\n"); + "Sleepable programs can only use array, hash, and ringbuf maps\n"); return -EINVAL; } @@ -10581,6 +11006,7 @@ static int opt_subreg_zext_lo32_rnd_hi32(struct bpf_verifier_env *env, for (i = 0; i < len; i++) { int adj_idx = i + delta; struct bpf_insn insn; + u8 load_reg; insn = insns[adj_idx]; if (!aux[adj_idx].zext_dst) { @@ -10623,9 +11049,27 @@ static int opt_subreg_zext_lo32_rnd_hi32(struct bpf_verifier_env *env, if (!bpf_jit_needs_zext()) continue; + /* zext_dst means that we want to zero-extend whatever register + * the insn defines, which is dst_reg most of the time, with + * the notable exception of BPF_STX + BPF_ATOMIC + BPF_FETCH. + */ + if (BPF_CLASS(insn.code) == BPF_STX && + BPF_MODE(insn.code) == BPF_ATOMIC) { + /* BPF_STX + BPF_ATOMIC insns without BPF_FETCH do not + * define any registers, therefore zext_dst cannot be + * set. + */ + if (WARN_ON(!(insn.imm & BPF_FETCH))) + return -EINVAL; + load_reg = insn.imm == BPF_CMPXCHG ? BPF_REG_0 + : insn.src_reg; + } else { + load_reg = insn.dst_reg; + } + zext_patch[0] = insn; - zext_patch[1].dst_reg = insn.dst_reg; - zext_patch[1].src_reg = insn.dst_reg; + zext_patch[1].dst_reg = load_reg; + zext_patch[1].src_reg = load_reg; patch = zext_patch; patch_len = 2; apply_patch_buffer: @@ -10841,8 +11285,7 @@ static int jit_subprogs(struct bpf_verifier_env *env) return 0; for (i = 0, insn = prog->insnsi; i < prog->len; i++, insn++) { - if (insn->code != (BPF_JMP | BPF_CALL) || - insn->src_reg != BPF_PSEUDO_CALL) + if (!bpf_pseudo_call(insn)) continue; /* Upon error here we cannot fall back to interpreter but * need a hard reject of the program. Thus -EFAULT is @@ -10883,7 +11326,7 @@ static int jit_subprogs(struct bpf_verifier_env *env) /* BPF_PROG_RUN doesn't call subprogs directly, * hence main prog stats include the runtime of subprogs. * subprogs don't have IDs and not reachable via prog_get_next_id - * func[i]->aux->stats will never be accessed and stays NULL + * func[i]->stats will never be accessed and stays NULL */ func[i] = bpf_prog_alloc_no_stats(bpf_prog_size(len), GFP_USER); if (!func[i]) @@ -10971,8 +11414,7 @@ static int jit_subprogs(struct bpf_verifier_env *env) for (i = 0; i < env->subprog_cnt; i++) { insn = func[i]->insnsi; for (j = 0; j < func[i]->len; j++, insn++) { - if (insn->code != (BPF_JMP | BPF_CALL) || - insn->src_reg != BPF_PSEUDO_CALL) + if (!bpf_pseudo_call(insn)) continue; subprog = insn->off; insn->imm = BPF_CAST_CALL(func[subprog]->bpf_func) - @@ -11017,8 +11459,7 @@ static int jit_subprogs(struct bpf_verifier_env *env) * later look the same as if they were interpreted only. */ for (i = 0, insn = prog->insnsi; i < prog->len; i++, insn++) { - if (insn->code != (BPF_JMP | BPF_CALL) || - insn->src_reg != BPF_PSEUDO_CALL) + if (!bpf_pseudo_call(insn)) continue; insn->off = env->insn_aux_data[i].call_imm; subprog = find_subprog(env, i + insn->off + 1); @@ -11047,8 +11488,7 @@ out_undo_insn: /* cleanup main prog to be interpreted */ prog->jit_requested = 0; for (i = 0, insn = prog->insnsi; i < prog->len; i++, insn++) { - if (insn->code != (BPF_JMP | BPF_CALL) || - insn->src_reg != BPF_PSEUDO_CALL) + if (!bpf_pseudo_call(insn)) continue; insn->off = 0; insn->imm = env->insn_aux_data[i].call_imm; @@ -11083,8 +11523,7 @@ static int fixup_call_args(struct bpf_verifier_env *env) return -EINVAL; } for (i = 0; i < prog->len; i++, insn++) { - if (insn->code != (BPF_JMP | BPF_CALL) || - insn->src_reg != BPF_PSEUDO_CALL) + if (!bpf_pseudo_call(insn)) continue; depth = get_callee_stack_depth(env, insn, i); if (depth < 0) @@ -11547,6 +11986,13 @@ static int do_check_common(struct bpf_verifier_env *env, int subprog) mark_reg_known_zero(env, regs, i); else if (regs[i].type == SCALAR_VALUE) mark_reg_unknown(env, regs, i); + else if (regs[i].type == PTR_TO_MEM_OR_NULL) { + const u32 mem_size = regs[i].mem_size; + + mark_reg_known_zero(env, regs, i); + regs[i].mem_size = mem_size; + regs[i].id = ++env->id_gen; + } } } else { /* 1st arg to a function */ @@ -12125,6 +12571,7 @@ int bpf_check(struct bpf_prog **prog, union bpf_attr *attr, env->strict_alignment = false; env->allow_ptr_leaks = bpf_allow_ptr_leaks(); + env->allow_uninit_stack = bpf_allow_uninit_stack(); env->allow_ptr_to_map_access = bpf_allow_ptr_to_map_access(); env->bypass_spec_v1 = bpf_bypass_spec_v1(); env->bypass_spec_v4 = bpf_bypass_spec_v4(); diff --git a/kernel/trace/bpf_trace.c b/kernel/trace/bpf_trace.c index 764400260eb6..b0c45d923f0f 100644 --- a/kernel/trace/bpf_trace.c +++ b/kernel/trace/bpf_trace.c @@ -1188,6 +1188,10 @@ BTF_SET_END(btf_allowlist_d_path) static bool bpf_d_path_allowed(const struct bpf_prog *prog) { + if (prog->type == BPF_PROG_TYPE_TRACING && + prog->expected_attach_type == BPF_TRACE_ITER) + return true; + if (prog->type == BPF_PROG_TYPE_LSM) return bpf_lsm_is_sleepable_hook(prog->aux->attach_btf_id); @@ -1757,6 +1761,8 @@ tracing_prog_func_proto(enum bpf_func_id func_id, const struct bpf_prog *prog) return &bpf_sk_storage_delete_tracing_proto; case BPF_FUNC_sock_from_file: return &bpf_sock_from_file_proto; + case BPF_FUNC_get_socket_cookie: + return &bpf_get_socket_ptr_cookie_proto; #endif case BPF_FUNC_seq_printf: return prog->expected_attach_type == BPF_TRACE_ITER ? diff --git a/lib/test_bpf.c b/lib/test_bpf.c index 49ec9e8d8aed..4dc4dcbecd12 100644 --- a/lib/test_bpf.c +++ b/lib/test_bpf.c @@ -345,7 +345,7 @@ static int __bpf_fill_ja(struct bpf_test *self, unsigned int len, static int bpf_fill_maxinsns11(struct bpf_test *self) { - /* Hits 70 passes on x86_64, so cannot get JITed there. */ + /* Hits 70 passes on x86_64 and triggers NOPs padding. */ return __bpf_fill_ja(self, BPF_MAXINSNS, 68); } @@ -5318,15 +5318,10 @@ static struct bpf_test tests[] = { { "BPF_MAXINSNS: Jump, gap, jump, ...", { }, -#if defined(CONFIG_BPF_JIT_ALWAYS_ON) && defined(CONFIG_X86) - CLASSIC | FLAG_NO_DATA | FLAG_EXPECTED_FAIL, -#else CLASSIC | FLAG_NO_DATA, -#endif { }, { { 0, 0xababcbac } }, .fill_helper = bpf_fill_maxinsns11, - .expected_errcode = -ENOTSUPP, }, { "BPF_MAXINSNS: jump over MSH", diff --git a/net/core/dev.c b/net/core/dev.c index ea9b46318d23..6c5967e80132 100644 --- a/net/core/dev.c +++ b/net/core/dev.c @@ -2217,28 +2217,14 @@ static inline void net_timestamp_set(struct sk_buff *skb) bool is_skb_forwardable(const struct net_device *dev, const struct sk_buff *skb) { - unsigned int len; - - if (!(dev->flags & IFF_UP)) - return false; - - len = dev->mtu + dev->hard_header_len + VLAN_HLEN; - if (skb->len <= len) - return true; - - /* if TSO is enabled, we don't care about the length as the packet - * could be forwarded without being segmented before - */ - if (skb_is_gso(skb)) - return true; - - return false; + return __is_skb_forwardable(dev, skb, true); } EXPORT_SYMBOL_GPL(is_skb_forwardable); -int __dev_forward_skb(struct net_device *dev, struct sk_buff *skb) +static int __dev_forward_skb2(struct net_device *dev, struct sk_buff *skb, + bool check_mtu) { - int ret = ____dev_forward_skb(dev, skb); + int ret = ____dev_forward_skb(dev, skb, check_mtu); if (likely(!ret)) { skb->protocol = eth_type_trans(skb, dev); @@ -2247,6 +2233,11 @@ int __dev_forward_skb(struct net_device *dev, struct sk_buff *skb) return ret; } + +int __dev_forward_skb(struct net_device *dev, struct sk_buff *skb) +{ + return __dev_forward_skb2(dev, skb, true); +} EXPORT_SYMBOL_GPL(__dev_forward_skb); /** @@ -2273,6 +2264,11 @@ int dev_forward_skb(struct net_device *dev, struct sk_buff *skb) } EXPORT_SYMBOL_GPL(dev_forward_skb); +int dev_forward_skb_nomtu(struct net_device *dev, struct sk_buff *skb) +{ + return __dev_forward_skb2(dev, skb, false) ?: netif_rx_internal(skb); +} + static inline int deliver_skb(struct sk_buff *skb, struct packet_type *pt_prev, struct net_device *orig_dev) diff --git a/net/core/filter.c b/net/core/filter.c index 3b728ab79a61..adfdad234674 100644 --- a/net/core/filter.c +++ b/net/core/filter.c @@ -2083,13 +2083,13 @@ static const struct bpf_func_proto bpf_csum_level_proto = { static inline int __bpf_rx_skb(struct net_device *dev, struct sk_buff *skb) { - return dev_forward_skb(dev, skb); + return dev_forward_skb_nomtu(dev, skb); } static inline int __bpf_rx_skb_no_mac(struct net_device *dev, struct sk_buff *skb) { - int ret = ____dev_forward_skb(dev, skb); + int ret = ____dev_forward_skb(dev, skb, false); if (likely(!ret)) { skb->dev = dev; @@ -2480,7 +2480,7 @@ int skb_do_redirect(struct sk_buff *skb) goto out_drop; dev = ops->ndo_get_peer_dev(dev); if (unlikely(!dev || - !is_skb_forwardable(dev, skb) || + !(dev->flags & IFF_UP) || net_eq(net, dev_net(dev)))) goto out_drop; skb->dev = dev; @@ -3552,11 +3552,7 @@ static int bpf_skb_net_shrink(struct sk_buff *skb, u32 off, u32 len_diff, return 0; } -static u32 __bpf_skb_max_len(const struct sk_buff *skb) -{ - return skb->dev ? skb->dev->mtu + skb->dev->hard_header_len : - SKB_MAX_ALLOC; -} +#define BPF_SKB_MAX_LEN SKB_MAX_ALLOC BPF_CALL_4(sk_skb_adjust_room, struct sk_buff *, skb, s32, len_diff, u32, mode, u64, flags) @@ -3605,7 +3601,7 @@ BPF_CALL_4(bpf_skb_adjust_room, struct sk_buff *, skb, s32, len_diff, { u32 len_cur, len_diff_abs = abs(len_diff); u32 len_min = bpf_skb_net_base_len(skb); - u32 len_max = __bpf_skb_max_len(skb); + u32 len_max = BPF_SKB_MAX_LEN; __be16 proto = skb->protocol; bool shrink = len_diff < 0; u32 off; @@ -3688,7 +3684,7 @@ static int bpf_skb_trim_rcsum(struct sk_buff *skb, unsigned int new_len) static inline int __bpf_skb_change_tail(struct sk_buff *skb, u32 new_len, u64 flags) { - u32 max_len = __bpf_skb_max_len(skb); + u32 max_len = BPF_SKB_MAX_LEN; u32 min_len = __bpf_skb_min_len(skb); int ret; @@ -3764,7 +3760,7 @@ static const struct bpf_func_proto sk_skb_change_tail_proto = { static inline int __bpf_skb_change_head(struct sk_buff *skb, u32 head_room, u64 flags) { - u32 max_len = __bpf_skb_max_len(skb); + u32 max_len = BPF_SKB_MAX_LEN; u32 new_len = skb->len + head_room; int ret; @@ -4631,6 +4627,18 @@ static const struct bpf_func_proto bpf_get_socket_cookie_sock_proto = { .arg1_type = ARG_PTR_TO_CTX, }; +BPF_CALL_1(bpf_get_socket_ptr_cookie, struct sock *, sk) +{ + return sk ? sock_gen_cookie(sk) : 0; +} + +const struct bpf_func_proto bpf_get_socket_ptr_cookie_proto = { + .func = bpf_get_socket_ptr_cookie, + .gpl_only = false, + .ret_type = RET_INTEGER, + .arg1_type = ARG_PTR_TO_BTF_ID_SOCK_COMMON, +}; + BPF_CALL_1(bpf_get_socket_cookie_sock_ops, struct bpf_sock_ops_kern *, ctx) { return __sock_gen_cookie(ctx->sk); @@ -5291,12 +5299,14 @@ static const struct bpf_func_proto bpf_skb_get_xfrm_state_proto = { #if IS_ENABLED(CONFIG_INET) || IS_ENABLED(CONFIG_IPV6) static int bpf_fib_set_fwd_params(struct bpf_fib_lookup *params, const struct neighbour *neigh, - const struct net_device *dev) + const struct net_device *dev, u32 mtu) { memcpy(params->dmac, neigh->ha, ETH_ALEN); memcpy(params->smac, dev->dev_addr, ETH_ALEN); params->h_vlan_TCI = 0; params->h_vlan_proto = 0; + if (mtu) + params->mtu_result = mtu; /* union with tot_len */ return 0; } @@ -5312,8 +5322,8 @@ static int bpf_ipv4_fib_lookup(struct net *net, struct bpf_fib_lookup *params, struct net_device *dev; struct fib_result res; struct flowi4 fl4; + u32 mtu = 0; int err; - u32 mtu; dev = dev_get_by_index_rcu(net, params->ifindex); if (unlikely(!dev)) @@ -5380,8 +5390,10 @@ static int bpf_ipv4_fib_lookup(struct net *net, struct bpf_fib_lookup *params, if (check_mtu) { mtu = ip_mtu_from_fib_result(&res, params->ipv4_dst); - if (params->tot_len > mtu) + if (params->tot_len > mtu) { + params->mtu_result = mtu; /* union with tot_len */ return BPF_FIB_LKUP_RET_FRAG_NEEDED; + } } nhc = res.nhc; @@ -5415,7 +5427,7 @@ static int bpf_ipv4_fib_lookup(struct net *net, struct bpf_fib_lookup *params, if (!neigh) return BPF_FIB_LKUP_RET_NO_NEIGH; - return bpf_fib_set_fwd_params(params, neigh, dev); + return bpf_fib_set_fwd_params(params, neigh, dev, mtu); } #endif @@ -5432,7 +5444,7 @@ static int bpf_ipv6_fib_lookup(struct net *net, struct bpf_fib_lookup *params, struct flowi6 fl6; int strict = 0; int oif, err; - u32 mtu; + u32 mtu = 0; /* link local addresses are never forwarded */ if (rt6_need_strict(dst) || rt6_need_strict(src)) @@ -5507,8 +5519,10 @@ static int bpf_ipv6_fib_lookup(struct net *net, struct bpf_fib_lookup *params, if (check_mtu) { mtu = ipv6_stub->ip6_mtu_from_fib6(&res, dst, src); - if (params->tot_len > mtu) + if (params->tot_len > mtu) { + params->mtu_result = mtu; /* union with tot_len */ return BPF_FIB_LKUP_RET_FRAG_NEEDED; + } } if (res.nh->fib_nh_lws) @@ -5528,7 +5542,7 @@ static int bpf_ipv6_fib_lookup(struct net *net, struct bpf_fib_lookup *params, if (!neigh) return BPF_FIB_LKUP_RET_NO_NEIGH; - return bpf_fib_set_fwd_params(params, neigh, dev); + return bpf_fib_set_fwd_params(params, neigh, dev, mtu); } #endif @@ -5571,6 +5585,7 @@ BPF_CALL_4(bpf_skb_fib_lookup, struct sk_buff *, skb, { struct net *net = dev_net(skb->dev); int rc = -EAFNOSUPPORT; + bool check_mtu = false; if (plen < sizeof(*params)) return -EINVAL; @@ -5578,25 +5593,33 @@ BPF_CALL_4(bpf_skb_fib_lookup, struct sk_buff *, skb, if (flags & ~(BPF_FIB_LOOKUP_DIRECT | BPF_FIB_LOOKUP_OUTPUT)) return -EINVAL; + if (params->tot_len) + check_mtu = true; + switch (params->family) { #if IS_ENABLED(CONFIG_INET) case AF_INET: - rc = bpf_ipv4_fib_lookup(net, params, flags, false); + rc = bpf_ipv4_fib_lookup(net, params, flags, check_mtu); break; #endif #if IS_ENABLED(CONFIG_IPV6) case AF_INET6: - rc = bpf_ipv6_fib_lookup(net, params, flags, false); + rc = bpf_ipv6_fib_lookup(net, params, flags, check_mtu); break; #endif } - if (!rc) { + if (rc == BPF_FIB_LKUP_RET_SUCCESS && !check_mtu) { struct net_device *dev; + /* When tot_len isn't provided by user, check skb + * against MTU of FIB lookup resulting net_device + */ dev = dev_get_by_index_rcu(net, params->ifindex); if (!is_skb_forwardable(dev, skb)) rc = BPF_FIB_LKUP_RET_FRAG_NEEDED; + + params->mtu_result = dev->mtu; /* union with tot_len */ } return rc; @@ -5612,6 +5635,116 @@ static const struct bpf_func_proto bpf_skb_fib_lookup_proto = { .arg4_type = ARG_ANYTHING, }; +static struct net_device *__dev_via_ifindex(struct net_device *dev_curr, + u32 ifindex) +{ + struct net *netns = dev_net(dev_curr); + + /* Non-redirect use-cases can use ifindex=0 and save ifindex lookup */ + if (ifindex == 0) + return dev_curr; + + return dev_get_by_index_rcu(netns, ifindex); +} + +BPF_CALL_5(bpf_skb_check_mtu, struct sk_buff *, skb, + u32, ifindex, u32 *, mtu_len, s32, len_diff, u64, flags) +{ + int ret = BPF_MTU_CHK_RET_FRAG_NEEDED; + struct net_device *dev = skb->dev; + int skb_len, dev_len; + int mtu; + + if (unlikely(flags & ~(BPF_MTU_CHK_SEGS))) + return -EINVAL; + + if (unlikely(flags & BPF_MTU_CHK_SEGS && len_diff)) + return -EINVAL; + + dev = __dev_via_ifindex(dev, ifindex); + if (unlikely(!dev)) + return -ENODEV; + + mtu = READ_ONCE(dev->mtu); + + dev_len = mtu + dev->hard_header_len; + skb_len = skb->len + len_diff; /* minus result pass check */ + if (skb_len <= dev_len) { + ret = BPF_MTU_CHK_RET_SUCCESS; + goto out; + } + /* At this point, skb->len exceed MTU, but as it include length of all + * segments, it can still be below MTU. The SKB can possibly get + * re-segmented in transmit path (see validate_xmit_skb). Thus, user + * must choose if segs are to be MTU checked. + */ + if (skb_is_gso(skb)) { + ret = BPF_MTU_CHK_RET_SUCCESS; + + if (flags & BPF_MTU_CHK_SEGS && + !skb_gso_validate_network_len(skb, mtu)) + ret = BPF_MTU_CHK_RET_SEGS_TOOBIG; + } +out: + /* BPF verifier guarantees valid pointer */ + *mtu_len = mtu; + + return ret; +} + +BPF_CALL_5(bpf_xdp_check_mtu, struct xdp_buff *, xdp, + u32, ifindex, u32 *, mtu_len, s32, len_diff, u64, flags) +{ + struct net_device *dev = xdp->rxq->dev; + int xdp_len = xdp->data_end - xdp->data; + int ret = BPF_MTU_CHK_RET_SUCCESS; + int mtu, dev_len; + + /* XDP variant doesn't support multi-buffer segment check (yet) */ + if (unlikely(flags)) + return -EINVAL; + + dev = __dev_via_ifindex(dev, ifindex); + if (unlikely(!dev)) + return -ENODEV; + + mtu = READ_ONCE(dev->mtu); + + /* Add L2-header as dev MTU is L3 size */ + dev_len = mtu + dev->hard_header_len; + + xdp_len += len_diff; /* minus result pass check */ + if (xdp_len > dev_len) + ret = BPF_MTU_CHK_RET_FRAG_NEEDED; + + /* BPF verifier guarantees valid pointer */ + *mtu_len = mtu; + + return ret; +} + +static const struct bpf_func_proto bpf_skb_check_mtu_proto = { + .func = bpf_skb_check_mtu, + .gpl_only = true, + .ret_type = RET_INTEGER, + .arg1_type = ARG_PTR_TO_CTX, + .arg2_type = ARG_ANYTHING, + .arg3_type = ARG_PTR_TO_INT, + .arg4_type = ARG_ANYTHING, + .arg5_type = ARG_ANYTHING, +}; + +static const struct bpf_func_proto bpf_xdp_check_mtu_proto = { + .func = bpf_xdp_check_mtu, + .gpl_only = true, + .ret_type = RET_INTEGER, + .arg1_type = ARG_PTR_TO_CTX, + .arg2_type = ARG_ANYTHING, + .arg3_type = ARG_PTR_TO_INT, + .arg4_type = ARG_ANYTHING, + .arg5_type = ARG_ANYTHING, +}; + #if IS_ENABLED(CONFIG_IPV6_SEG6_BPF) static int bpf_push_seg6_encap(struct sk_buff *skb, u32 type, void *hdr, u32 len) { @@ -7021,6 +7154,14 @@ sock_addr_func_proto(enum bpf_func_id func_id, const struct bpf_prog *prog) case BPF_CGROUP_INET6_BIND: case BPF_CGROUP_INET4_CONNECT: case BPF_CGROUP_INET6_CONNECT: + case BPF_CGROUP_UDP4_RECVMSG: + case BPF_CGROUP_UDP6_RECVMSG: + case BPF_CGROUP_UDP4_SENDMSG: + case BPF_CGROUP_UDP6_SENDMSG: + case BPF_CGROUP_INET4_GETPEERNAME: + case BPF_CGROUP_INET6_GETPEERNAME: + case BPF_CGROUP_INET4_GETSOCKNAME: + case BPF_CGROUP_INET6_GETSOCKNAME: return &bpf_sock_addr_setsockopt_proto; default: return NULL; @@ -7031,6 +7172,14 @@ sock_addr_func_proto(enum bpf_func_id func_id, const struct bpf_prog *prog) case BPF_CGROUP_INET6_BIND: case BPF_CGROUP_INET4_CONNECT: case BPF_CGROUP_INET6_CONNECT: + case BPF_CGROUP_UDP4_RECVMSG: + case BPF_CGROUP_UDP6_RECVMSG: + case BPF_CGROUP_UDP4_SENDMSG: + case BPF_CGROUP_UDP6_SENDMSG: + case BPF_CGROUP_INET4_GETPEERNAME: + case BPF_CGROUP_INET6_GETPEERNAME: + case BPF_CGROUP_INET4_GETSOCKNAME: + case BPF_CGROUP_INET6_GETSOCKNAME: return &bpf_sock_addr_getsockopt_proto; default: return NULL; @@ -7181,6 +7330,8 @@ tc_cls_act_func_proto(enum bpf_func_id func_id, const struct bpf_prog *prog) return &bpf_get_socket_uid_proto; case BPF_FUNC_fib_lookup: return &bpf_skb_fib_lookup_proto; + case BPF_FUNC_check_mtu: + return &bpf_skb_check_mtu_proto; case BPF_FUNC_sk_fullsock: return &bpf_sk_fullsock_proto; case BPF_FUNC_sk_storage_get: @@ -7250,6 +7401,8 @@ xdp_func_proto(enum bpf_func_id func_id, const struct bpf_prog *prog) return &bpf_xdp_adjust_tail_proto; case BPF_FUNC_fib_lookup: return &bpf_xdp_fib_lookup_proto; + case BPF_FUNC_check_mtu: + return &bpf_xdp_check_mtu_proto; #ifdef CONFIG_INET case BPF_FUNC_sk_lookup_udp: return &bpf_xdp_sk_lookup_udp_proto; diff --git a/net/core/skmsg.c b/net/core/skmsg.c index 25cdbb20f3a0..1261512d6807 100644 --- a/net/core/skmsg.c +++ b/net/core/skmsg.c @@ -669,14 +669,13 @@ static void sk_psock_destroy_deferred(struct work_struct *gc) kfree(psock); } -void sk_psock_destroy(struct rcu_head *rcu) +static void sk_psock_destroy(struct rcu_head *rcu) { struct sk_psock *psock = container_of(rcu, struct sk_psock, rcu); INIT_WORK(&psock->gc, sk_psock_destroy_deferred); schedule_work(&psock->gc); } -EXPORT_SYMBOL_GPL(sk_psock_destroy); void sk_psock_drop(struct sock *sk, struct sk_psock *psock) { diff --git a/net/core/xdp.c b/net/core/xdp.c index 3a8c9ab4ecbe..05354976c1fc 100644 --- a/net/core/xdp.c +++ b/net/core/xdp.c @@ -513,3 +513,73 @@ void xdp_warn(const char *msg, const char *func, const int line) WARN(1, "XDP_WARN: %s(line:%d): %s\n", func, line, msg); }; EXPORT_SYMBOL_GPL(xdp_warn); + +int xdp_alloc_skb_bulk(void **skbs, int n_skb, gfp_t gfp) +{ + n_skb = kmem_cache_alloc_bulk(skbuff_head_cache, gfp, + n_skb, skbs); + if (unlikely(!n_skb)) + return -ENOMEM; + + return 0; +} +EXPORT_SYMBOL_GPL(xdp_alloc_skb_bulk); + +struct sk_buff *__xdp_build_skb_from_frame(struct xdp_frame *xdpf, + struct sk_buff *skb, + struct net_device *dev) +{ + unsigned int headroom, frame_size; + void *hard_start; + + /* Part of headroom was reserved to xdpf */ + headroom = sizeof(*xdpf) + xdpf->headroom; + + /* Memory size backing xdp_frame data already have reserved + * room for build_skb to place skb_shared_info in tailroom. + */ + frame_size = xdpf->frame_sz; + + hard_start = xdpf->data - headroom; + skb = build_skb_around(skb, hard_start, frame_size); + if (unlikely(!skb)) + return NULL; + + skb_reserve(skb, headroom); + __skb_put(skb, xdpf->len); + if (xdpf->metasize) + skb_metadata_set(skb, xdpf->metasize); + + /* Essential SKB info: protocol and skb->dev */ + skb->protocol = eth_type_trans(skb, dev); + + /* Optional SKB info, currently missing: + * - HW checksum info (skb->ip_summed) + * - HW RX hash (skb_set_hash) + * - RX ring dev queue index (skb_record_rx_queue) + */ + + /* Until page_pool get SKB return path, release DMA here */ + xdp_release_frame(xdpf); + + /* Allow SKB to reuse area used by xdp_frame */ + xdp_scrub_frame(xdpf); + + return skb; +} +EXPORT_SYMBOL_GPL(__xdp_build_skb_from_frame); + +struct sk_buff *xdp_build_skb_from_frame(struct xdp_frame *xdpf, + struct net_device *dev) +{ + struct sk_buff *skb; + + skb = kmem_cache_alloc(skbuff_head_cache, GFP_ATOMIC); + if (unlikely(!skb)) + return NULL; + + memset(skb, 0, offsetof(struct sk_buff, tail)); + + return __xdp_build_skb_from_frame(xdpf, skb, dev); +} +EXPORT_SYMBOL_GPL(xdp_build_skb_from_frame); diff --git a/net/ipv4/af_inet.c b/net/ipv4/af_inet.c index 2ff5d8058ce6..a02ce89b56b5 100644 --- a/net/ipv4/af_inet.c +++ b/net/ipv4/af_inet.c @@ -438,6 +438,7 @@ EXPORT_SYMBOL(inet_release); int inet_bind(struct socket *sock, struct sockaddr *uaddr, int addr_len) { struct sock *sk = sock->sk; + u32 flags = BIND_WITH_LOCK; int err; /* If the socket has its own bind function then use it. (RAW) */ @@ -450,11 +451,12 @@ int inet_bind(struct socket *sock, struct sockaddr *uaddr, int addr_len) /* BPF prog is run before any checks are done so that if the prog * changes context in a wrong way it will be caught. */ - err = BPF_CGROUP_RUN_PROG_INET4_BIND_LOCK(sk, uaddr); + err = BPF_CGROUP_RUN_PROG_INET_BIND_LOCK(sk, uaddr, + BPF_CGROUP_INET4_BIND, &flags); if (err) return err; - return __inet_bind(sk, uaddr, addr_len, BIND_WITH_LOCK); + return __inet_bind(sk, uaddr, addr_len, flags); } EXPORT_SYMBOL(inet_bind); @@ -499,7 +501,8 @@ int __inet_bind(struct sock *sk, struct sockaddr *uaddr, int addr_len, snum = ntohs(addr->sin_port); err = -EACCES; - if (snum && inet_port_requires_bind_service(net, snum) && + if (!(flags & BIND_NO_CAP_NET_BIND_SERVICE) && + snum && inet_port_requires_bind_service(net, snum) && !ns_capable(net->user_ns, CAP_NET_BIND_SERVICE)) goto out; @@ -777,18 +780,19 @@ int inet_getname(struct socket *sock, struct sockaddr *uaddr, return -ENOTCONN; sin->sin_port = inet->inet_dport; sin->sin_addr.s_addr = inet->inet_daddr; + BPF_CGROUP_RUN_SA_PROG_LOCK(sk, (struct sockaddr *)sin, + BPF_CGROUP_INET4_GETPEERNAME, + NULL); } else { __be32 addr = inet->inet_rcv_saddr; if (!addr) addr = inet->inet_saddr; sin->sin_port = inet->inet_sport; sin->sin_addr.s_addr = addr; - } - if (cgroup_bpf_enabled) BPF_CGROUP_RUN_SA_PROG_LOCK(sk, (struct sockaddr *)sin, - peer ? BPF_CGROUP_INET4_GETPEERNAME : - BPF_CGROUP_INET4_GETSOCKNAME, + BPF_CGROUP_INET4_GETSOCKNAME, NULL); + } memset(sin->sin_zero, 0, sizeof(sin->sin_zero)); return sizeof(*sin); } diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c index 7a6b58ae408d..a3422e42784e 100644 --- a/net/ipv4/tcp.c +++ b/net/ipv4/tcp.c @@ -4162,6 +4162,8 @@ static int do_tcp_getsockopt(struct sock *sk, int level, return -EINVAL; lock_sock(sk); err = tcp_zerocopy_receive(sk, &zc, &tss); + err = BPF_CGROUP_RUN_PROG_GETSOCKOPT_KERN(sk, level, optname, + &zc, &len, err); release_sock(sk); if (len >= offsetofend(struct tcp_zerocopy_receive, msg_flags)) goto zerocopy_rcv_cmsg; @@ -4208,6 +4210,18 @@ zerocopy_rcv_out: return 0; } +bool tcp_bpf_bypass_getsockopt(int level, int optname) +{ + /* TCP do_tcp_getsockopt has optimized getsockopt implementation + * to avoid extra socket lock for TCP_ZEROCOPY_RECEIVE. + */ + if (level == SOL_TCP && optname == TCP_ZEROCOPY_RECEIVE) + return true; + + return false; +} +EXPORT_SYMBOL(tcp_bpf_bypass_getsockopt); + int tcp_getsockopt(struct sock *sk, int level, int optname, char __user *optval, int __user *optlen) { diff --git a/net/ipv4/tcp_ipv4.c b/net/ipv4/tcp_ipv4.c index 611039207d30..daad4f99db32 100644 --- a/net/ipv4/tcp_ipv4.c +++ b/net/ipv4/tcp_ipv4.c @@ -2796,6 +2796,7 @@ struct proto tcp_prot = { .shutdown = tcp_shutdown, .setsockopt = tcp_setsockopt, .getsockopt = tcp_getsockopt, + .bpf_bypass_getsockopt = tcp_bpf_bypass_getsockopt, .keepalive = tcp_set_keepalive, .recvmsg = tcp_recvmsg, .sendmsg = tcp_sendmsg, diff --git a/net/ipv4/udp.c b/net/ipv4/udp.c index 48208fb4e895..4a0478b17243 100644 --- a/net/ipv4/udp.c +++ b/net/ipv4/udp.c @@ -1130,7 +1130,7 @@ int udp_sendmsg(struct sock *sk, struct msghdr *msg, size_t len) rcu_read_unlock(); } - if (cgroup_bpf_enabled && !connected) { + if (cgroup_bpf_enabled(BPF_CGROUP_UDP4_SENDMSG) && !connected) { err = BPF_CGROUP_RUN_PROG_UDP4_SENDMSG_LOCK(sk, (struct sockaddr *)usin, &ipc.addr); if (err) @@ -1864,9 +1864,8 @@ try_again: memset(sin->sin_zero, 0, sizeof(sin->sin_zero)); *addr_len = sizeof(*sin); - if (cgroup_bpf_enabled) - BPF_CGROUP_RUN_PROG_UDP4_RECVMSG_LOCK(sk, - (struct sockaddr *)sin); + BPF_CGROUP_RUN_PROG_UDP4_RECVMSG_LOCK(sk, + (struct sockaddr *)sin); } if (udp_sk(sk)->gro_enabled) diff --git a/net/ipv6/af_inet6.c b/net/ipv6/af_inet6.c index 0e9994e0ecd7..1fb75f01756c 100644 --- a/net/ipv6/af_inet6.c +++ b/net/ipv6/af_inet6.c @@ -295,7 +295,8 @@ static int __inet6_bind(struct sock *sk, struct sockaddr *uaddr, int addr_len, return -EINVAL; snum = ntohs(addr->sin6_port); - if (snum && inet_port_requires_bind_service(net, snum) && + if (!(flags & BIND_NO_CAP_NET_BIND_SERVICE) && + snum && inet_port_requires_bind_service(net, snum) && !ns_capable(net->user_ns, CAP_NET_BIND_SERVICE)) return -EACCES; @@ -439,6 +440,7 @@ out_unlock: int inet6_bind(struct socket *sock, struct sockaddr *uaddr, int addr_len) { struct sock *sk = sock->sk; + u32 flags = BIND_WITH_LOCK; int err = 0; /* If the socket has its own bind function then use it. */ @@ -451,11 +453,12 @@ int inet6_bind(struct socket *sock, struct sockaddr *uaddr, int addr_len) /* BPF prog is run before any checks are done so that if the prog * changes context in a wrong way it will be caught. */ - err = BPF_CGROUP_RUN_PROG_INET6_BIND_LOCK(sk, uaddr); + err = BPF_CGROUP_RUN_PROG_INET_BIND_LOCK(sk, uaddr, + BPF_CGROUP_INET6_BIND, &flags); if (err) return err; - return __inet6_bind(sk, uaddr, addr_len, BIND_WITH_LOCK); + return __inet6_bind(sk, uaddr, addr_len, flags); } EXPORT_SYMBOL(inet6_bind); @@ -527,18 +530,19 @@ int inet6_getname(struct socket *sock, struct sockaddr *uaddr, sin->sin6_addr = sk->sk_v6_daddr; if (np->sndflow) sin->sin6_flowinfo = np->flow_label; + BPF_CGROUP_RUN_SA_PROG_LOCK(sk, (struct sockaddr *)sin, + BPF_CGROUP_INET6_GETPEERNAME, + NULL); } else { if (ipv6_addr_any(&sk->sk_v6_rcv_saddr)) sin->sin6_addr = np->saddr; else sin->sin6_addr = sk->sk_v6_rcv_saddr; sin->sin6_port = inet->inet_sport; - } - if (cgroup_bpf_enabled) BPF_CGROUP_RUN_SA_PROG_LOCK(sk, (struct sockaddr *)sin, - peer ? BPF_CGROUP_INET6_GETPEERNAME : - BPF_CGROUP_INET6_GETSOCKNAME, + BPF_CGROUP_INET6_GETSOCKNAME, NULL); + } sin->sin6_scope_id = ipv6_iface_scope_id(&sin->sin6_addr, sk->sk_bound_dev_if); return sizeof(*sin); diff --git a/net/ipv6/tcp_ipv6.c b/net/ipv6/tcp_ipv6.c index d093ef3ef060..bd44ded7e50c 100644 --- a/net/ipv6/tcp_ipv6.c +++ b/net/ipv6/tcp_ipv6.c @@ -2124,6 +2124,7 @@ struct proto tcpv6_prot = { .shutdown = tcp_shutdown, .setsockopt = tcp_setsockopt, .getsockopt = tcp_getsockopt, + .bpf_bypass_getsockopt = tcp_bpf_bypass_getsockopt, .keepalive = tcp_set_keepalive, .recvmsg = tcp_recvmsg, .sendmsg = tcp_sendmsg, diff --git a/net/ipv6/udp.c b/net/ipv6/udp.c index d75429205084..d25e5a9252fd 100644 --- a/net/ipv6/udp.c +++ b/net/ipv6/udp.c @@ -409,9 +409,8 @@ try_again: } *addr_len = sizeof(*sin6); - if (cgroup_bpf_enabled) - BPF_CGROUP_RUN_PROG_UDP6_RECVMSG_LOCK(sk, - (struct sockaddr *)sin6); + BPF_CGROUP_RUN_PROG_UDP6_RECVMSG_LOCK(sk, + (struct sockaddr *)sin6); } if (udp_sk(sk)->gro_enabled) @@ -1462,7 +1461,7 @@ do_udp_sendmsg: fl6.saddr = np->saddr; fl6.fl6_sport = inet->inet_sport; - if (cgroup_bpf_enabled && !connected) { + if (cgroup_bpf_enabled(BPF_CGROUP_UDP6_SENDMSG) && !connected) { err = BPF_CGROUP_RUN_PROG_UDP6_SENDMSG_LOCK(sk, (struct sockaddr *)sin6, &fl6.saddr); if (err) diff --git a/net/socket.c b/net/socket.c index 33e8b6c4e1d3..7f0617ab5437 100644 --- a/net/socket.c +++ b/net/socket.c @@ -2126,6 +2126,9 @@ SYSCALL_DEFINE5(setsockopt, int, fd, int, level, int, optname, return __sys_setsockopt(fd, level, optname, optval, optlen); } +INDIRECT_CALLABLE_DECLARE(bool tcp_bpf_bypass_getsockopt(int level, + int optname)); + /* * Get a socket option. Because we don't know the option lengths we have * to pass a user mode parameter for the protocols to sort out. diff --git a/net/xdp/xsk.c b/net/xdp/xsk.c index 4a83117507f5..4faabd1ecfd1 100644 --- a/net/xdp/xsk.c +++ b/net/xdp/xsk.c @@ -184,12 +184,13 @@ static void xsk_copy_xdp(struct xdp_buff *to, struct xdp_buff *from, u32 len) memcpy(to_buf, from_buf, len + metalen); } -static int __xsk_rcv(struct xdp_sock *xs, struct xdp_buff *xdp, u32 len, - bool explicit_free) +static int __xsk_rcv(struct xdp_sock *xs, struct xdp_buff *xdp) { struct xdp_buff *xsk_xdp; int err; + u32 len; + len = xdp->data_end - xdp->data; if (len > xsk_pool_get_rx_frame_size(xs->pool)) { xs->rx_dropped++; return -ENOSPC; @@ -207,8 +208,6 @@ static int __xsk_rcv(struct xdp_sock *xs, struct xdp_buff *xdp, u32 len, xsk_buff_free(xsk_xdp); return err; } - if (explicit_free) - xdp_return_buff(xdp); return 0; } @@ -230,11 +229,8 @@ static bool xsk_is_bound(struct xdp_sock *xs) return false; } -static int xsk_rcv(struct xdp_sock *xs, struct xdp_buff *xdp, - bool explicit_free) +static int xsk_rcv_check(struct xdp_sock *xs, struct xdp_buff *xdp) { - u32 len; - if (!xsk_is_bound(xs)) return -EINVAL; @@ -242,11 +238,7 @@ static int xsk_rcv(struct xdp_sock *xs, struct xdp_buff *xdp, return -EINVAL; sk_mark_napi_id_once_xdp(&xs->sk, xdp); - len = xdp->data_end - xdp->data; - - return xdp->rxq->mem.type == MEM_TYPE_XSK_BUFF_POOL ? - __xsk_rcv_zc(xs, xdp, len) : - __xsk_rcv(xs, xdp, len, explicit_free); + return 0; } static void xsk_flush(struct xdp_sock *xs) @@ -261,18 +253,41 @@ int xsk_generic_rcv(struct xdp_sock *xs, struct xdp_buff *xdp) int err; spin_lock_bh(&xs->rx_lock); - err = xsk_rcv(xs, xdp, false); - xsk_flush(xs); + err = xsk_rcv_check(xs, xdp); + if (!err) { + err = __xsk_rcv(xs, xdp); + xsk_flush(xs); + } spin_unlock_bh(&xs->rx_lock); return err; } +static int xsk_rcv(struct xdp_sock *xs, struct xdp_buff *xdp) +{ + int err; + u32 len; + + err = xsk_rcv_check(xs, xdp); + if (err) + return err; + + if (xdp->rxq->mem.type == MEM_TYPE_XSK_BUFF_POOL) { + len = xdp->data_end - xdp->data; + return __xsk_rcv_zc(xs, xdp, len); + } + + err = __xsk_rcv(xs, xdp); + if (!err) + xdp_return_buff(xdp); + return err; +} + int __xsk_map_redirect(struct xdp_sock *xs, struct xdp_buff *xdp) { struct list_head *flush_list = this_cpu_ptr(&xskmap_flush_list); int err; - err = xsk_rcv(xs, xdp, true); + err = xsk_rcv(xs, xdp); if (err) return err; diff --git a/net/xdp/xsk_buff_pool.c b/net/xdp/xsk_buff_pool.c index 20598eea658c..8de01aaac4a0 100644 --- a/net/xdp/xsk_buff_pool.c +++ b/net/xdp/xsk_buff_pool.c @@ -119,8 +119,8 @@ static void xp_disable_drv_zc(struct xsk_buff_pool *pool) } } -static int __xp_assign_dev(struct xsk_buff_pool *pool, - struct net_device *netdev, u16 queue_id, u16 flags) +int xp_assign_dev(struct xsk_buff_pool *pool, + struct net_device *netdev, u16 queue_id, u16 flags) { bool force_zc, force_copy; struct netdev_bpf bpf; @@ -191,12 +191,6 @@ err_unreg_pool: return err; } -int xp_assign_dev(struct xsk_buff_pool *pool, struct net_device *dev, - u16 queue_id, u16 flags) -{ - return __xp_assign_dev(pool, dev, queue_id, flags); -} - int xp_assign_dev_shared(struct xsk_buff_pool *pool, struct xdp_umem *umem, struct net_device *dev, u16 queue_id) { @@ -210,7 +204,7 @@ int xp_assign_dev_shared(struct xsk_buff_pool *pool, struct xdp_umem *umem, if (pool->uses_need_wakeup) flags |= XDP_USE_NEED_WAKEUP; - return __xp_assign_dev(pool, dev, queue_id, flags); + return xp_assign_dev(pool, dev, queue_id, flags); } void xp_clear_dev(struct xsk_buff_pool *pool) diff --git a/samples/bpf/Makefile b/samples/bpf/Makefile index 26fc96ca619e..45ceca4e2c70 100644 --- a/samples/bpf/Makefile +++ b/samples/bpf/Makefile @@ -183,6 +183,14 @@ BPF_EXTRA_CFLAGS := $(ARM_ARCH_SELECTOR) TPROGS_CFLAGS += $(ARM_ARCH_SELECTOR) endif +ifeq ($(ARCH), mips) +TPROGS_CFLAGS += -D__SANE_USERSPACE_TYPES__ +ifdef CONFIG_MACH_LOONGSON64 +BPF_EXTRA_CFLAGS += -I$(srctree)/arch/mips/include/asm/mach-loongson64 +BPF_EXTRA_CFLAGS += -I$(srctree)/arch/mips/include/asm/mach-generic +endif +endif + TPROGS_CFLAGS += -Wall -O2 TPROGS_CFLAGS += -Wmissing-prototypes TPROGS_CFLAGS += -Wstrict-prototypes @@ -208,7 +216,7 @@ TPROGLDLIBS_xdpsock += -pthread -lcap TPROGLDLIBS_xsk_fwd += -pthread # Allows pointing LLC/CLANG to a LLVM backend with bpf support, redefine on cmdline: -# make M=samples/bpf/ LLC=~/git/llvm/build/bin/llc CLANG=~/git/llvm/build/bin/clang +# make M=samples/bpf LLC=~/git/llvm-project/llvm/build/bin/llc CLANG=~/git/llvm-project/llvm/build/bin/clang LLC ?= llc CLANG ?= clang OPT ?= opt diff --git a/samples/bpf/README.rst b/samples/bpf/README.rst index dd34b2d26f1c..60c6494adb1b 100644 --- a/samples/bpf/README.rst +++ b/samples/bpf/README.rst @@ -62,20 +62,26 @@ To generate a smaller llc binary one can use:: -DLLVM_TARGETS_TO_BUILD="BPF" +We recommend that developers who want the fastest incremental builds +use the Ninja build system, you can find it in your system's package +manager, usually the package is ninja or ninja-build. + Quick sniplet for manually compiling LLVM and clang -(build dependencies are cmake and gcc-c++):: +(build dependencies are ninja, cmake and gcc-c++):: - $ git clone http://llvm.org/git/llvm.git - $ cd llvm/tools - $ git clone --depth 1 http://llvm.org/git/clang.git - $ cd ..; mkdir build; cd build - $ cmake .. -DLLVM_TARGETS_TO_BUILD="BPF;X86" - $ make -j $(getconf _NPROCESSORS_ONLN) + $ git clone https://github.com/llvm/llvm-project.git + $ mkdir -p llvm-project/llvm/build + $ cd llvm-project/llvm/build + $ cmake .. -G "Ninja" -DLLVM_TARGETS_TO_BUILD="BPF;X86" \ + -DLLVM_ENABLE_PROJECTS="clang" \ + -DCMAKE_BUILD_TYPE=Release \ + -DLLVM_BUILD_RUNTIME=OFF + $ ninja It is also possible to point make to the newly compiled 'llc' or 'clang' command via redefining LLC or CLANG on the make command line:: - make M=samples/bpf LLC=~/git/llvm/build/bin/llc CLANG=~/git/llvm/build/bin/clang + make M=samples/bpf LLC=~/git/llvm-project/llvm/build/bin/llc CLANG=~/git/llvm-project/llvm/build/bin/clang Cross compiling samples ----------------------- diff --git a/samples/bpf/bpf_insn.h b/samples/bpf/bpf_insn.h index db67a2847395..aee04534483a 100644 --- a/samples/bpf/bpf_insn.h +++ b/samples/bpf/bpf_insn.h @@ -134,15 +134,31 @@ struct bpf_insn; .off = OFF, \ .imm = 0 }) -/* Atomic memory add, *(uint *)(dst_reg + off16) += src_reg */ - -#define BPF_STX_XADD(SIZE, DST, SRC, OFF) \ +/* + * Atomic operations: + * + * BPF_ADD *(uint *) (dst_reg + off16) += src_reg + * BPF_AND *(uint *) (dst_reg + off16) &= src_reg + * BPF_OR *(uint *) (dst_reg + off16) |= src_reg + * BPF_XOR *(uint *) (dst_reg + off16) ^= src_reg + * BPF_ADD | BPF_FETCH src_reg = atomic_fetch_add(dst_reg + off16, src_reg); + * BPF_AND | BPF_FETCH src_reg = atomic_fetch_and(dst_reg + off16, src_reg); + * BPF_OR | BPF_FETCH src_reg = atomic_fetch_or(dst_reg + off16, src_reg); + * BPF_XOR | BPF_FETCH src_reg = atomic_fetch_xor(dst_reg + off16, src_reg); + * BPF_XCHG src_reg = atomic_xchg(dst_reg + off16, src_reg) + * BPF_CMPXCHG r0 = atomic_cmpxchg(dst_reg + off16, r0, src_reg) + */ + +#define BPF_ATOMIC_OP(SIZE, OP, DST, SRC, OFF) \ ((struct bpf_insn) { \ .code = BPF_STX | BPF_SIZE(SIZE) | BPF_ATOMIC, \ .dst_reg = DST, \ .src_reg = SRC, \ .off = OFF, \ - .imm = BPF_ADD }) + .imm = OP }) + +/* Legacy alias */ +#define BPF_STX_XADD(SIZE, DST, SRC, OFF) BPF_ATOMIC_OP(SIZE, BPF_ADD, DST, SRC, OFF) /* Memory store, *(uint *) (dst_reg + off16) = imm32 */ diff --git a/samples/bpf/cookie_uid_helper_example.c b/samples/bpf/cookie_uid_helper_example.c index c5ff7a13918c..cc3bce8d3aac 100644 --- a/samples/bpf/cookie_uid_helper_example.c +++ b/samples/bpf/cookie_uid_helper_example.c @@ -313,7 +313,7 @@ int main(int argc, char *argv[]) print_table(); printf("\n"); sleep(1); - }; + } } else if (cfg_test_cookie) { udp_client(); } diff --git a/samples/bpf/xdp_redirect_map_kern.c b/samples/bpf/xdp_redirect_map_kern.c index 6489352ab7a4..a92b8e567bdd 100644 --- a/samples/bpf/xdp_redirect_map_kern.c +++ b/samples/bpf/xdp_redirect_map_kern.c @@ -19,12 +19,22 @@ #include <linux/ipv6.h> #include <bpf/bpf_helpers.h> +/* The 2nd xdp prog on egress does not support skb mode, so we define two + * maps, tx_port_general and tx_port_native. + */ struct { __uint(type, BPF_MAP_TYPE_DEVMAP); __uint(key_size, sizeof(int)); __uint(value_size, sizeof(int)); __uint(max_entries, 100); -} tx_port SEC(".maps"); +} tx_port_general SEC(".maps"); + +struct { + __uint(type, BPF_MAP_TYPE_DEVMAP); + __uint(key_size, sizeof(int)); + __uint(value_size, sizeof(struct bpf_devmap_val)); + __uint(max_entries, 100); +} tx_port_native SEC(".maps"); /* Count RX packets, as XDP bpf_prog doesn't get direct TX-success * feedback. Redirect TX errors can be caught via a tracepoint. @@ -36,6 +46,14 @@ struct { __uint(max_entries, 1); } rxcnt SEC(".maps"); +/* map to store egress interface mac address */ +struct { + __uint(type, BPF_MAP_TYPE_ARRAY); + __type(key, u32); + __type(value, __be64); + __uint(max_entries, 1); +} tx_mac SEC(".maps"); + static void swap_src_dst_mac(void *data) { unsigned short *p = data; @@ -52,17 +70,16 @@ static void swap_src_dst_mac(void *data) p[5] = dst[2]; } -SEC("xdp_redirect_map") -int xdp_redirect_map_prog(struct xdp_md *ctx) +static __always_inline int xdp_redirect_map(struct xdp_md *ctx, void *redirect_map) { void *data_end = (void *)(long)ctx->data_end; void *data = (void *)(long)ctx->data; struct ethhdr *eth = data; int rc = XDP_DROP; - int vport, port = 0, m = 0; long *value; u32 key = 0; u64 nh_off; + int vport; nh_off = sizeof(*eth); if (data + nh_off > data_end) @@ -79,7 +96,40 @@ int xdp_redirect_map_prog(struct xdp_md *ctx) swap_src_dst_mac(data); /* send packet out physical port */ - return bpf_redirect_map(&tx_port, vport, 0); + return bpf_redirect_map(redirect_map, vport, 0); +} + +SEC("xdp_redirect_general") +int xdp_redirect_map_general(struct xdp_md *ctx) +{ + return xdp_redirect_map(ctx, &tx_port_general); +} + +SEC("xdp_redirect_native") +int xdp_redirect_map_native(struct xdp_md *ctx) +{ + return xdp_redirect_map(ctx, &tx_port_native); +} + +SEC("xdp_devmap/map_prog") +int xdp_redirect_map_egress(struct xdp_md *ctx) +{ + void *data_end = (void *)(long)ctx->data_end; + void *data = (void *)(long)ctx->data; + struct ethhdr *eth = data; + __be64 *mac; + u32 key = 0; + u64 nh_off; + + nh_off = sizeof(*eth); + if (data + nh_off > data_end) + return XDP_DROP; + + mac = bpf_map_lookup_elem(&tx_mac, &key); + if (mac) + __builtin_memcpy(eth->h_source, mac, ETH_ALEN); + + return XDP_PASS; } /* Redirect require an XDP bpf_prog loaded on the TX device */ diff --git a/samples/bpf/xdp_redirect_map_user.c b/samples/bpf/xdp_redirect_map_user.c index 31131b6e7782..0e8192688dfc 100644 --- a/samples/bpf/xdp_redirect_map_user.c +++ b/samples/bpf/xdp_redirect_map_user.c @@ -14,6 +14,10 @@ #include <unistd.h> #include <libgen.h> #include <sys/resource.h> +#include <sys/ioctl.h> +#include <sys/types.h> +#include <sys/socket.h> +#include <netinet/in.h> #include "bpf_util.h" #include <bpf/bpf.h> @@ -22,6 +26,7 @@ static int ifindex_in; static int ifindex_out; static bool ifindex_out_xdp_dummy_attached = true; +static bool xdp_devmap_attached; static __u32 prog_id; static __u32 dummy_prog_id; @@ -83,6 +88,32 @@ static void poll_stats(int interval, int ifindex) } } +static int get_mac_addr(unsigned int ifindex_out, void *mac_addr) +{ + char ifname[IF_NAMESIZE]; + struct ifreq ifr; + int fd, ret = -1; + + fd = socket(AF_INET, SOCK_DGRAM, 0); + if (fd < 0) + return ret; + + if (!if_indextoname(ifindex_out, ifname)) + goto err_out; + + strcpy(ifr.ifr_name, ifname); + + if (ioctl(fd, SIOCGIFHWADDR, &ifr) != 0) + goto err_out; + + memcpy(mac_addr, ifr.ifr_hwaddr.sa_data, 6 * sizeof(char)); + ret = 0; + +err_out: + close(fd); + return ret; +} + static void usage(const char *prog) { fprintf(stderr, @@ -90,24 +121,26 @@ static void usage(const char *prog) "OPTS:\n" " -S use skb-mode\n" " -N enforce native mode\n" - " -F force loading prog\n", + " -F force loading prog\n" + " -X load xdp program on egress\n", prog); } int main(int argc, char **argv) { struct bpf_prog_load_attr prog_load_attr = { - .prog_type = BPF_PROG_TYPE_XDP, + .prog_type = BPF_PROG_TYPE_UNSPEC, }; - struct bpf_program *prog, *dummy_prog; + struct bpf_program *prog, *dummy_prog, *devmap_prog; + int prog_fd, dummy_prog_fd, devmap_prog_fd = 0; + int tx_port_map_fd, tx_mac_map_fd; + struct bpf_devmap_val devmap_val; struct bpf_prog_info info = {}; __u32 info_len = sizeof(info); - int prog_fd, dummy_prog_fd; - const char *optstr = "FSN"; + const char *optstr = "FSNX"; struct bpf_object *obj; int ret, opt, key = 0; char filename[256]; - int tx_port_map_fd; while ((opt = getopt(argc, argv, optstr)) != -1) { switch (opt) { @@ -120,14 +153,21 @@ int main(int argc, char **argv) case 'F': xdp_flags &= ~XDP_FLAGS_UPDATE_IF_NOEXIST; break; + case 'X': + xdp_devmap_attached = true; + break; default: usage(basename(argv[0])); return 1; } } - if (!(xdp_flags & XDP_FLAGS_SKB_MODE)) + if (!(xdp_flags & XDP_FLAGS_SKB_MODE)) { xdp_flags |= XDP_FLAGS_DRV_MODE; + } else if (xdp_devmap_attached) { + printf("Load xdp program on egress with SKB mode not supported yet\n"); + return 1; + } if (optind == argc) { printf("usage: %s <IFNAME|IFINDEX>_IN <IFNAME|IFINDEX>_OUT\n", argv[0]); @@ -150,24 +190,28 @@ int main(int argc, char **argv) if (bpf_prog_load_xattr(&prog_load_attr, &obj, &prog_fd)) return 1; - prog = bpf_program__next(NULL, obj); - dummy_prog = bpf_program__next(prog, obj); - if (!prog || !dummy_prog) { - printf("finding a prog in obj file failed\n"); - return 1; + if (xdp_flags & XDP_FLAGS_SKB_MODE) { + prog = bpf_object__find_program_by_name(obj, "xdp_redirect_map_general"); + tx_port_map_fd = bpf_object__find_map_fd_by_name(obj, "tx_port_general"); + } else { + prog = bpf_object__find_program_by_name(obj, "xdp_redirect_map_native"); + tx_port_map_fd = bpf_object__find_map_fd_by_name(obj, "tx_port_native"); + } + dummy_prog = bpf_object__find_program_by_name(obj, "xdp_redirect_dummy_prog"); + if (!prog || dummy_prog < 0 || tx_port_map_fd < 0) { + printf("finding prog/dummy_prog/tx_port_map in obj file failed\n"); + goto out; } - /* bpf_prog_load_xattr gives us the pointer to first prog's fd, - * so we're missing only the fd for dummy prog - */ + prog_fd = bpf_program__fd(prog); dummy_prog_fd = bpf_program__fd(dummy_prog); - if (prog_fd < 0 || dummy_prog_fd < 0) { + if (prog_fd < 0 || dummy_prog_fd < 0 || tx_port_map_fd < 0) { printf("bpf_prog_load_xattr: %s\n", strerror(errno)); return 1; } - tx_port_map_fd = bpf_object__find_map_fd_by_name(obj, "tx_port"); + tx_mac_map_fd = bpf_object__find_map_fd_by_name(obj, "tx_mac"); rxcnt_map_fd = bpf_object__find_map_fd_by_name(obj, "rxcnt"); - if (tx_port_map_fd < 0 || rxcnt_map_fd < 0) { + if (tx_mac_map_fd < 0 || rxcnt_map_fd < 0) { printf("bpf_object__find_map_fd_by_name failed\n"); return 1; } @@ -199,11 +243,39 @@ int main(int argc, char **argv) } dummy_prog_id = info.id; + /* Load 2nd xdp prog on egress. */ + if (xdp_devmap_attached) { + unsigned char mac_addr[6]; + + devmap_prog = bpf_object__find_program_by_name(obj, "xdp_redirect_map_egress"); + if (!devmap_prog) { + printf("finding devmap_prog in obj file failed\n"); + goto out; + } + devmap_prog_fd = bpf_program__fd(devmap_prog); + if (devmap_prog_fd < 0) { + printf("finding devmap_prog fd failed\n"); + goto out; + } + + if (get_mac_addr(ifindex_out, mac_addr) < 0) { + printf("get interface %d mac failed\n", ifindex_out); + goto out; + } + + ret = bpf_map_update_elem(tx_mac_map_fd, &key, mac_addr, 0); + if (ret) { + perror("bpf_update_elem tx_mac_map_fd"); + goto out; + } + } + signal(SIGINT, int_exit); signal(SIGTERM, int_exit); - /* populate virtual to physical port map */ - ret = bpf_map_update_elem(tx_port_map_fd, &key, &ifindex_out, 0); + devmap_val.ifindex = ifindex_out; + devmap_val.bpf_prog.fd = devmap_prog_fd; + ret = bpf_map_update_elem(tx_port_map_fd, &key, &devmap_val, 0); if (ret) { perror("bpf_update_elem"); goto out; diff --git a/tools/bpf/bpf_dbg.c b/tools/bpf/bpf_dbg.c index a0ebcdf59c31..a07dfc479270 100644 --- a/tools/bpf/bpf_dbg.c +++ b/tools/bpf/bpf_dbg.c @@ -890,7 +890,7 @@ static int bpf_run_stepping(struct sock_filter *f, uint16_t bpf_len, bool stop = false; int i = 1; - while (bpf_curr.Rs == false && stop == false) { + while (!bpf_curr.Rs && !stop) { bpf_safe_regs(); if (i++ == next) diff --git a/tools/bpf/bpftool/Makefile b/tools/bpf/bpftool/Makefile index 45ac2f9e0aa9..8ced1655fea6 100644 --- a/tools/bpf/bpftool/Makefile +++ b/tools/bpf/bpftool/Makefile @@ -75,8 +75,6 @@ endif INSTALL ?= install RM ?= rm -f -CLANG ?= clang -LLVM_STRIP ?= llvm-strip FEATURE_USER = .bpftool FEATURE_TESTS = libbfd disassembler-four-args reallocarray zlib libcap \ diff --git a/tools/bpf/bpftool/prog.c b/tools/bpf/bpftool/prog.c index 1fe3ba255bad..f2b915b20546 100644 --- a/tools/bpf/bpftool/prog.c +++ b/tools/bpf/bpftool/prog.c @@ -368,6 +368,8 @@ static void print_prog_header_json(struct bpf_prog_info *info) jsonw_uint_field(json_wtr, "run_time_ns", info->run_time_ns); jsonw_uint_field(json_wtr, "run_cnt", info->run_cnt); } + if (info->recursion_misses) + jsonw_uint_field(json_wtr, "recursion_misses", info->recursion_misses); } static void print_prog_json(struct bpf_prog_info *info, int fd) @@ -446,6 +448,8 @@ static void print_prog_header_plain(struct bpf_prog_info *info) if (info->run_time_ns) printf(" run_time_ns %lld run_cnt %lld", info->run_time_ns, info->run_cnt); + if (info->recursion_misses) + printf(" recursion_misses %lld", info->recursion_misses); printf("\n"); } diff --git a/tools/bpf/resolve_btfids/.gitignore b/tools/bpf/resolve_btfids/.gitignore index a026df7dc280..16913fffc985 100644 --- a/tools/bpf/resolve_btfids/.gitignore +++ b/tools/bpf/resolve_btfids/.gitignore @@ -1,4 +1,3 @@ -/FEATURE-DUMP.libbpf -/bpf_helper_defs.h /fixdep /resolve_btfids +/libbpf/ diff --git a/tools/bpf/resolve_btfids/Makefile b/tools/bpf/resolve_btfids/Makefile index bf656432ad73..bb9fa8de7e62 100644 --- a/tools/bpf/resolve_btfids/Makefile +++ b/tools/bpf/resolve_btfids/Makefile @@ -2,11 +2,7 @@ include ../../scripts/Makefile.include include ../../scripts/Makefile.arch -ifeq ($(srctree),) -srctree := $(patsubst %/,%,$(dir $(CURDIR))) -srctree := $(patsubst %/,%,$(dir $(srctree))) -srctree := $(patsubst %/,%,$(dir $(srctree))) -endif +srctree := $(abspath $(CURDIR)/../../../) ifeq ($(V),1) Q = @@ -22,28 +18,29 @@ AR = $(HOSTAR) CC = $(HOSTCC) LD = $(HOSTLD) ARCH = $(HOSTARCH) +RM ?= rm OUTPUT ?= $(srctree)/tools/bpf/resolve_btfids/ LIBBPF_SRC := $(srctree)/tools/lib/bpf/ SUBCMD_SRC := $(srctree)/tools/lib/subcmd/ -BPFOBJ := $(OUTPUT)/libbpf.a -SUBCMDOBJ := $(OUTPUT)/libsubcmd.a +BPFOBJ := $(OUTPUT)/libbpf/libbpf.a +SUBCMDOBJ := $(OUTPUT)/libsubcmd/libsubcmd.a BINARY := $(OUTPUT)/resolve_btfids BINARY_IN := $(BINARY)-in.o all: $(BINARY) -$(OUTPUT): +$(OUTPUT) $(OUTPUT)/libbpf $(OUTPUT)/libsubcmd: $(call msg,MKDIR,,$@) - $(Q)mkdir -p $(OUTPUT) + $(Q)mkdir -p $(@) -$(SUBCMDOBJ): fixdep FORCE - $(Q)$(MAKE) -C $(SUBCMD_SRC) OUTPUT=$(OUTPUT) +$(SUBCMDOBJ): fixdep FORCE | $(OUTPUT)/libsubcmd + $(Q)$(MAKE) -C $(SUBCMD_SRC) OUTPUT=$(abspath $(dir $@))/ $(abspath $@) -$(BPFOBJ): $(wildcard $(LIBBPF_SRC)/*.[ch] $(LIBBPF_SRC)/Makefile) | $(OUTPUT) +$(BPFOBJ): $(wildcard $(LIBBPF_SRC)/*.[ch] $(LIBBPF_SRC)/Makefile) | $(OUTPUT)/libbpf $(Q)$(MAKE) $(submake_extras) -C $(LIBBPF_SRC) OUTPUT=$(abspath $(dir $@))/ $(abspath $@) CFLAGS := -g \ @@ -57,24 +54,27 @@ LIBS = -lelf -lz export srctree OUTPUT CFLAGS Q include $(srctree)/tools/build/Makefile.include -$(BINARY_IN): fixdep FORCE +$(BINARY_IN): fixdep FORCE | $(OUTPUT) $(Q)$(MAKE) $(build)=resolve_btfids $(BINARY): $(BPFOBJ) $(SUBCMDOBJ) $(BINARY_IN) $(call msg,LINK,$@) $(Q)$(CC) $(BINARY_IN) $(LDFLAGS) -o $@ $(BPFOBJ) $(SUBCMDOBJ) $(LIBS) -libsubcmd-clean: - $(Q)$(MAKE) -C $(SUBCMD_SRC) OUTPUT=$(OUTPUT) clean - -libbpf-clean: - $(Q)$(MAKE) -C $(LIBBPF_SRC) OUTPUT=$(OUTPUT) clean +clean_objects := $(wildcard $(OUTPUT)/*.o \ + $(OUTPUT)/.*.o.cmd \ + $(OUTPUT)/.*.o.d \ + $(OUTPUT)/libbpf \ + $(OUTPUT)/libsubcmd \ + $(OUTPUT)/resolve_btfids) -clean: libsubcmd-clean libbpf-clean fixdep-clean +ifneq ($(clean_objects),) +clean: fixdep-clean $(call msg,CLEAN,$(BINARY)) - $(Q)$(RM) -f $(BINARY); \ - $(RM) -rf $(if $(OUTPUT),$(OUTPUT),.)/feature; \ - find $(if $(OUTPUT),$(OUTPUT),.) -name \*.o -or -name \*.o.cmd -or -name \*.o.d | xargs $(RM) + $(Q)$(RM) -rf $(clean_objects) +else +clean: +endif tags: $(call msg,GEN,,tags) diff --git a/tools/bpf/runqslower/Makefile b/tools/bpf/runqslower/Makefile index 4d5ca54fcd4c..9d9fb6209be1 100644 --- a/tools/bpf/runqslower/Makefile +++ b/tools/bpf/runqslower/Makefile @@ -3,9 +3,6 @@ include ../../scripts/Makefile.include OUTPUT ?= $(abspath .output)/ -CLANG ?= clang -LLC ?= llc -LLVM_STRIP ?= llvm-strip BPFTOOL_OUTPUT := $(OUTPUT)bpftool/ DEFAULT_BPFTOOL := $(BPFTOOL_OUTPUT)bpftool BPFTOOL ?= $(DEFAULT_BPFTOOL) diff --git a/tools/build/feature/Makefile b/tools/build/feature/Makefile index 89ba522e377d..3e55edb3ea54 100644 --- a/tools/build/feature/Makefile +++ b/tools/build/feature/Makefile @@ -1,4 +1,6 @@ # SPDX-License-Identifier: GPL-2.0 +include ../../scripts/Makefile.include + FILES= \ test-all.bin \ test-backtrace.bin \ @@ -76,8 +78,6 @@ FILES= \ FILES := $(addprefix $(OUTPUT),$(FILES)) PKG_CONFIG ?= $(CROSS_COMPILE)pkg-config -LLVM_CONFIG ?= llvm-config -CLANG ?= clang all: $(FILES) diff --git a/tools/include/linux/types.h b/tools/include/linux/types.h index 154eb4e3ca7c..e9c5a215837d 100644 --- a/tools/include/linux/types.h +++ b/tools/include/linux/types.h @@ -6,7 +6,10 @@ #include <stddef.h> #include <stdint.h> +#ifndef __SANE_USERSPACE_TYPES__ #define __SANE_USERSPACE_TYPES__ /* For PPC64, to get LL64 types */ +#endif + #include <asm/types.h> #include <asm/posix_types.h> diff --git a/tools/include/uapi/linux/bpf.h b/tools/include/uapi/linux/bpf.h index c001766adcbc..4c24daa43bac 100644 --- a/tools/include/uapi/linux/bpf.h +++ b/tools/include/uapi/linux/bpf.h @@ -1656,22 +1656,30 @@ union bpf_attr { * networking traffic statistics as it provides a global socket * identifier that can be assumed unique. * Return - * A 8-byte long non-decreasing number on success, or 0 if the - * socket field is missing inside *skb*. + * A 8-byte long unique number on success, or 0 if the socket + * field is missing inside *skb*. * * u64 bpf_get_socket_cookie(struct bpf_sock_addr *ctx) * Description * Equivalent to bpf_get_socket_cookie() helper that accepts * *skb*, but gets socket from **struct bpf_sock_addr** context. * Return - * A 8-byte long non-decreasing number. + * A 8-byte long unique number. * * u64 bpf_get_socket_cookie(struct bpf_sock_ops *ctx) * Description * Equivalent to **bpf_get_socket_cookie**\ () helper that accepts * *skb*, but gets socket from **struct bpf_sock_ops** context. * Return - * A 8-byte long non-decreasing number. + * A 8-byte long unique number. + * + * u64 bpf_get_socket_cookie(struct sock *sk) + * Description + * Equivalent to **bpf_get_socket_cookie**\ () helper that accepts + * *sk*, but gets socket from a BTF **struct sock**. This helper + * also works for sleepable programs. + * Return + * A 8-byte long unique number or 0 if *sk* is NULL. * * u32 bpf_get_socket_uid(struct sk_buff *skb) * Return @@ -2231,6 +2239,9 @@ union bpf_attr { * * > 0 one of **BPF_FIB_LKUP_RET_** codes explaining why the * packet is not forwarded or needs assist from full stack * + * If lookup fails with BPF_FIB_LKUP_RET_FRAG_NEEDED, then the MTU + * was exceeded and output params->mtu_result contains the MTU. + * * long bpf_sock_hash_update(struct bpf_sock_ops *skops, struct bpf_map *map, void *key, u64 flags) * Description * Add an entry to, or update a sockhash *map* referencing sockets. @@ -3836,6 +3847,69 @@ union bpf_attr { * Return * A pointer to a struct socket on success or NULL if the file is * not a socket. + * + * long bpf_check_mtu(void *ctx, u32 ifindex, u32 *mtu_len, s32 len_diff, u64 flags) + * Description + + * Check ctx packet size against exceeding MTU of net device (based + * on *ifindex*). This helper will likely be used in combination + * with helpers that adjust/change the packet size. + * + * The argument *len_diff* can be used for querying with a planned + * size change. This allows to check MTU prior to changing packet + * ctx. Providing an *len_diff* adjustment that is larger than the + * actual packet size (resulting in negative packet size) will in + * principle not exceed the MTU, why it is not considered a + * failure. Other BPF-helpers are needed for performing the + * planned size change, why the responsability for catch a negative + * packet size belong in those helpers. + * + * Specifying *ifindex* zero means the MTU check is performed + * against the current net device. This is practical if this isn't + * used prior to redirect. + * + * The Linux kernel route table can configure MTUs on a more + * specific per route level, which is not provided by this helper. + * For route level MTU checks use the **bpf_fib_lookup**\ () + * helper. + * + * *ctx* is either **struct xdp_md** for XDP programs or + * **struct sk_buff** for tc cls_act programs. + * + * The *flags* argument can be a combination of one or more of the + * following values: + * + * **BPF_MTU_CHK_SEGS** + * This flag will only works for *ctx* **struct sk_buff**. + * If packet context contains extra packet segment buffers + * (often knows as GSO skb), then MTU check is harder to + * check at this point, because in transmit path it is + * possible for the skb packet to get re-segmented + * (depending on net device features). This could still be + * a MTU violation, so this flag enables performing MTU + * check against segments, with a different violation + * return code to tell it apart. Check cannot use len_diff. + * + * On return *mtu_len* pointer contains the MTU value of the net + * device. Remember the net device configured MTU is the L3 size, + * which is returned here and XDP and TX length operate at L2. + * Helper take this into account for you, but remember when using + * MTU value in your BPF-code. On input *mtu_len* must be a valid + * pointer and be initialized (to zero), else verifier will reject + * BPF program. + * + * Return + * * 0 on success, and populate MTU value in *mtu_len* pointer. + * + * * < 0 if any input argument is invalid (*mtu_len* not updated) + * + * MTU violations return positive values, but also populate MTU + * value in *mtu_len* pointer, as this can be needed for + * implementing PMTU handing: + * + * * **BPF_MTU_CHK_RET_FRAG_NEEDED** + * * **BPF_MTU_CHK_RET_SEGS_TOOBIG** + * */ #define __BPF_FUNC_MAPPER(FN) \ FN(unspec), \ @@ -4001,6 +4075,7 @@ union bpf_attr { FN(ktime_get_coarse_ns), \ FN(ima_inode_hash), \ FN(sock_from_file), \ + FN(check_mtu), \ /* */ /* integer value in 'imm' field of BPF_CALL instruction selects which helper @@ -4501,6 +4576,7 @@ struct bpf_prog_info { __aligned_u64 prog_tags; __u64 run_time_ns; __u64 run_cnt; + __u64 recursion_misses; } __attribute__((aligned(8))); struct bpf_map_info { @@ -4981,9 +5057,13 @@ struct bpf_fib_lookup { __be16 sport; __be16 dport; - /* total length of packet from network header - used for MTU check */ - __u16 tot_len; + union { /* used for MTU check */ + /* input to lookup */ + __u16 tot_len; /* L3 length from network hdr (iph->tot_len) */ + /* output: MTU value */ + __u16 mtu_result; + }; /* input: L3 device index for lookup * output: device index from FIB lookup */ @@ -5029,6 +5109,17 @@ struct bpf_redir_neigh { }; }; +/* bpf_check_mtu flags*/ +enum bpf_check_mtu_flags { + BPF_MTU_CHK_SEGS = (1U << 0), +}; + +enum bpf_check_mtu_ret { + BPF_MTU_CHK_RET_SUCCESS, /* check and lookup successful */ + BPF_MTU_CHK_RET_FRAG_NEEDED, /* fragmentation required to fwd */ + BPF_MTU_CHK_RET_SEGS_TOOBIG, /* GSO re-segmentation needed to fwd */ +}; + enum bpf_task_fd_type { BPF_FD_TYPE_RAW_TRACEPOINT, /* tp name */ BPF_FD_TYPE_TRACEPOINT, /* tp name */ diff --git a/tools/include/uapi/linux/bpf_perf_event.h b/tools/include/uapi/linux/bpf_perf_event.h index 8f95303f9d80..eb1b9d21250c 100644 --- a/tools/include/uapi/linux/bpf_perf_event.h +++ b/tools/include/uapi/linux/bpf_perf_event.h @@ -13,6 +13,7 @@ struct bpf_perf_event_data { bpf_user_pt_regs_t regs; __u64 sample_period; + __u64 addr; }; #endif /* _UAPI__LINUX_BPF_PERF_EVENT_H__ */ diff --git a/tools/include/uapi/linux/tcp.h b/tools/include/uapi/linux/tcp.h new file mode 100644 index 000000000000..13ceeb395eb8 --- /dev/null +++ b/tools/include/uapi/linux/tcp.h @@ -0,0 +1,357 @@ +/* SPDX-License-Identifier: GPL-2.0+ WITH Linux-syscall-note */ +/* + * INET An implementation of the TCP/IP protocol suite for the LINUX + * operating system. INET is implemented using the BSD Socket + * interface as the means of communication with the user level. + * + * Definitions for the TCP protocol. + * + * Version: @(#)tcp.h 1.0.2 04/28/93 + * + * Author: Fred N. van Kempen, <waltje@uWalt.NL.Mugnet.ORG> + * + * This program is free software; you can redistribute it and/or + * modify it under the terms of the GNU General Public License + * as published by the Free Software Foundation; either version + * 2 of the License, or (at your option) any later version. + */ +#ifndef _UAPI_LINUX_TCP_H +#define _UAPI_LINUX_TCP_H + +#include <linux/types.h> +#include <asm/byteorder.h> +#include <linux/socket.h> + +struct tcphdr { + __be16 source; + __be16 dest; + __be32 seq; + __be32 ack_seq; +#if defined(__LITTLE_ENDIAN_BITFIELD) + __u16 res1:4, + doff:4, + fin:1, + syn:1, + rst:1, + psh:1, + ack:1, + urg:1, + ece:1, + cwr:1; +#elif defined(__BIG_ENDIAN_BITFIELD) + __u16 doff:4, + res1:4, + cwr:1, + ece:1, + urg:1, + ack:1, + psh:1, + rst:1, + syn:1, + fin:1; +#else +#error "Adjust your <asm/byteorder.h> defines" +#endif + __be16 window; + __sum16 check; + __be16 urg_ptr; +}; + +/* + * The union cast uses a gcc extension to avoid aliasing problems + * (union is compatible to any of its members) + * This means this part of the code is -fstrict-aliasing safe now. + */ +union tcp_word_hdr { + struct tcphdr hdr; + __be32 words[5]; +}; + +#define tcp_flag_word(tp) ( ((union tcp_word_hdr *)(tp))->words [3]) + +enum { + TCP_FLAG_CWR = __constant_cpu_to_be32(0x00800000), + TCP_FLAG_ECE = __constant_cpu_to_be32(0x00400000), + TCP_FLAG_URG = __constant_cpu_to_be32(0x00200000), + TCP_FLAG_ACK = __constant_cpu_to_be32(0x00100000), + TCP_FLAG_PSH = __constant_cpu_to_be32(0x00080000), + TCP_FLAG_RST = __constant_cpu_to_be32(0x00040000), + TCP_FLAG_SYN = __constant_cpu_to_be32(0x00020000), + TCP_FLAG_FIN = __constant_cpu_to_be32(0x00010000), + TCP_RESERVED_BITS = __constant_cpu_to_be32(0x0F000000), + TCP_DATA_OFFSET = __constant_cpu_to_be32(0xF0000000) +}; + +/* + * TCP general constants + */ +#define TCP_MSS_DEFAULT 536U /* IPv4 (RFC1122, RFC2581) */ +#define TCP_MSS_DESIRED 1220U /* IPv6 (tunneled), EDNS0 (RFC3226) */ + +/* TCP socket options */ +#define TCP_NODELAY 1 /* Turn off Nagle's algorithm. */ +#define TCP_MAXSEG 2 /* Limit MSS */ +#define TCP_CORK 3 /* Never send partially complete segments */ +#define TCP_KEEPIDLE 4 /* Start keeplives after this period */ +#define TCP_KEEPINTVL 5 /* Interval between keepalives */ +#define TCP_KEEPCNT 6 /* Number of keepalives before death */ +#define TCP_SYNCNT 7 /* Number of SYN retransmits */ +#define TCP_LINGER2 8 /* Life time of orphaned FIN-WAIT-2 state */ +#define TCP_DEFER_ACCEPT 9 /* Wake up listener only when data arrive */ +#define TCP_WINDOW_CLAMP 10 /* Bound advertised window */ +#define TCP_INFO 11 /* Information about this connection. */ +#define TCP_QUICKACK 12 /* Block/reenable quick acks */ +#define TCP_CONGESTION 13 /* Congestion control algorithm */ +#define TCP_MD5SIG 14 /* TCP MD5 Signature (RFC2385) */ +#define TCP_THIN_LINEAR_TIMEOUTS 16 /* Use linear timeouts for thin streams*/ +#define TCP_THIN_DUPACK 17 /* Fast retrans. after 1 dupack */ +#define TCP_USER_TIMEOUT 18 /* How long for loss retry before timeout */ +#define TCP_REPAIR 19 /* TCP sock is under repair right now */ +#define TCP_REPAIR_QUEUE 20 +#define TCP_QUEUE_SEQ 21 +#define TCP_REPAIR_OPTIONS 22 +#define TCP_FASTOPEN 23 /* Enable FastOpen on listeners */ +#define TCP_TIMESTAMP 24 +#define TCP_NOTSENT_LOWAT 25 /* limit number of unsent bytes in write queue */ +#define TCP_CC_INFO 26 /* Get Congestion Control (optional) info */ +#define TCP_SAVE_SYN 27 /* Record SYN headers for new connections */ +#define TCP_SAVED_SYN 28 /* Get SYN headers recorded for connection */ +#define TCP_REPAIR_WINDOW 29 /* Get/set window parameters */ +#define TCP_FASTOPEN_CONNECT 30 /* Attempt FastOpen with connect */ +#define TCP_ULP 31 /* Attach a ULP to a TCP connection */ +#define TCP_MD5SIG_EXT 32 /* TCP MD5 Signature with extensions */ +#define TCP_FASTOPEN_KEY 33 /* Set the key for Fast Open (cookie) */ +#define TCP_FASTOPEN_NO_COOKIE 34 /* Enable TFO without a TFO cookie */ +#define TCP_ZEROCOPY_RECEIVE 35 +#define TCP_INQ 36 /* Notify bytes available to read as a cmsg on read */ + +#define TCP_CM_INQ TCP_INQ + +#define TCP_TX_DELAY 37 /* delay outgoing packets by XX usec */ + + +#define TCP_REPAIR_ON 1 +#define TCP_REPAIR_OFF 0 +#define TCP_REPAIR_OFF_NO_WP -1 /* Turn off without window probes */ + +struct tcp_repair_opt { + __u32 opt_code; + __u32 opt_val; +}; + +struct tcp_repair_window { + __u32 snd_wl1; + __u32 snd_wnd; + __u32 max_window; + + __u32 rcv_wnd; + __u32 rcv_wup; +}; + +enum { + TCP_NO_QUEUE, + TCP_RECV_QUEUE, + TCP_SEND_QUEUE, + TCP_QUEUES_NR, +}; + +/* why fastopen failed from client perspective */ +enum tcp_fastopen_client_fail { + TFO_STATUS_UNSPEC, /* catch-all */ + TFO_COOKIE_UNAVAILABLE, /* if not in TFO_CLIENT_NO_COOKIE mode */ + TFO_DATA_NOT_ACKED, /* SYN-ACK did not ack SYN data */ + TFO_SYN_RETRANSMITTED, /* SYN-ACK did not ack SYN data after timeout */ +}; + +/* for TCP_INFO socket option */ +#define TCPI_OPT_TIMESTAMPS 1 +#define TCPI_OPT_SACK 2 +#define TCPI_OPT_WSCALE 4 +#define TCPI_OPT_ECN 8 /* ECN was negociated at TCP session init */ +#define TCPI_OPT_ECN_SEEN 16 /* we received at least one packet with ECT */ +#define TCPI_OPT_SYN_DATA 32 /* SYN-ACK acked data in SYN sent or rcvd */ + +/* + * Sender's congestion state indicating normal or abnormal situations + * in the last round of packets sent. The state is driven by the ACK + * information and timer events. + */ +enum tcp_ca_state { + /* + * Nothing bad has been observed recently. + * No apparent reordering, packet loss, or ECN marks. + */ + TCP_CA_Open = 0, +#define TCPF_CA_Open (1<<TCP_CA_Open) + /* + * The sender enters disordered state when it has received DUPACKs or + * SACKs in the last round of packets sent. This could be due to packet + * loss or reordering but needs further information to confirm packets + * have been lost. + */ + TCP_CA_Disorder = 1, +#define TCPF_CA_Disorder (1<<TCP_CA_Disorder) + /* + * The sender enters Congestion Window Reduction (CWR) state when it + * has received ACKs with ECN-ECE marks, or has experienced congestion + * or packet discard on the sender host (e.g. qdisc). + */ + TCP_CA_CWR = 2, +#define TCPF_CA_CWR (1<<TCP_CA_CWR) + /* + * The sender is in fast recovery and retransmitting lost packets, + * typically triggered by ACK events. + */ + TCP_CA_Recovery = 3, +#define TCPF_CA_Recovery (1<<TCP_CA_Recovery) + /* + * The sender is in loss recovery triggered by retransmission timeout. + */ + TCP_CA_Loss = 4 +#define TCPF_CA_Loss (1<<TCP_CA_Loss) +}; + +struct tcp_info { + __u8 tcpi_state; + __u8 tcpi_ca_state; + __u8 tcpi_retransmits; + __u8 tcpi_probes; + __u8 tcpi_backoff; + __u8 tcpi_options; + __u8 tcpi_snd_wscale : 4, tcpi_rcv_wscale : 4; + __u8 tcpi_delivery_rate_app_limited:1, tcpi_fastopen_client_fail:2; + + __u32 tcpi_rto; + __u32 tcpi_ato; + __u32 tcpi_snd_mss; + __u32 tcpi_rcv_mss; + + __u32 tcpi_unacked; + __u32 tcpi_sacked; + __u32 tcpi_lost; + __u32 tcpi_retrans; + __u32 tcpi_fackets; + + /* Times. */ + __u32 tcpi_last_data_sent; + __u32 tcpi_last_ack_sent; /* Not remembered, sorry. */ + __u32 tcpi_last_data_recv; + __u32 tcpi_last_ack_recv; + + /* Metrics. */ + __u32 tcpi_pmtu; + __u32 tcpi_rcv_ssthresh; + __u32 tcpi_rtt; + __u32 tcpi_rttvar; + __u32 tcpi_snd_ssthresh; + __u32 tcpi_snd_cwnd; + __u32 tcpi_advmss; + __u32 tcpi_reordering; + + __u32 tcpi_rcv_rtt; + __u32 tcpi_rcv_space; + + __u32 tcpi_total_retrans; + + __u64 tcpi_pacing_rate; + __u64 tcpi_max_pacing_rate; + __u64 tcpi_bytes_acked; /* RFC4898 tcpEStatsAppHCThruOctetsAcked */ + __u64 tcpi_bytes_received; /* RFC4898 tcpEStatsAppHCThruOctetsReceived */ + __u32 tcpi_segs_out; /* RFC4898 tcpEStatsPerfSegsOut */ + __u32 tcpi_segs_in; /* RFC4898 tcpEStatsPerfSegsIn */ + + __u32 tcpi_notsent_bytes; + __u32 tcpi_min_rtt; + __u32 tcpi_data_segs_in; /* RFC4898 tcpEStatsDataSegsIn */ + __u32 tcpi_data_segs_out; /* RFC4898 tcpEStatsDataSegsOut */ + + __u64 tcpi_delivery_rate; + + __u64 tcpi_busy_time; /* Time (usec) busy sending data */ + __u64 tcpi_rwnd_limited; /* Time (usec) limited by receive window */ + __u64 tcpi_sndbuf_limited; /* Time (usec) limited by send buffer */ + + __u32 tcpi_delivered; + __u32 tcpi_delivered_ce; + + __u64 tcpi_bytes_sent; /* RFC4898 tcpEStatsPerfHCDataOctetsOut */ + __u64 tcpi_bytes_retrans; /* RFC4898 tcpEStatsPerfOctetsRetrans */ + __u32 tcpi_dsack_dups; /* RFC4898 tcpEStatsStackDSACKDups */ + __u32 tcpi_reord_seen; /* reordering events seen */ + + __u32 tcpi_rcv_ooopack; /* Out-of-order packets received */ + + __u32 tcpi_snd_wnd; /* peer's advertised receive window after + * scaling (bytes) + */ +}; + +/* netlink attributes types for SCM_TIMESTAMPING_OPT_STATS */ +enum { + TCP_NLA_PAD, + TCP_NLA_BUSY, /* Time (usec) busy sending data */ + TCP_NLA_RWND_LIMITED, /* Time (usec) limited by receive window */ + TCP_NLA_SNDBUF_LIMITED, /* Time (usec) limited by send buffer */ + TCP_NLA_DATA_SEGS_OUT, /* Data pkts sent including retransmission */ + TCP_NLA_TOTAL_RETRANS, /* Data pkts retransmitted */ + TCP_NLA_PACING_RATE, /* Pacing rate in bytes per second */ + TCP_NLA_DELIVERY_RATE, /* Delivery rate in bytes per second */ + TCP_NLA_SND_CWND, /* Sending congestion window */ + TCP_NLA_REORDERING, /* Reordering metric */ + TCP_NLA_MIN_RTT, /* minimum RTT */ + TCP_NLA_RECUR_RETRANS, /* Recurring retransmits for the current pkt */ + TCP_NLA_DELIVERY_RATE_APP_LMT, /* delivery rate application limited ? */ + TCP_NLA_SNDQ_SIZE, /* Data (bytes) pending in send queue */ + TCP_NLA_CA_STATE, /* ca_state of socket */ + TCP_NLA_SND_SSTHRESH, /* Slow start size threshold */ + TCP_NLA_DELIVERED, /* Data pkts delivered incl. out-of-order */ + TCP_NLA_DELIVERED_CE, /* Like above but only ones w/ CE marks */ + TCP_NLA_BYTES_SENT, /* Data bytes sent including retransmission */ + TCP_NLA_BYTES_RETRANS, /* Data bytes retransmitted */ + TCP_NLA_DSACK_DUPS, /* DSACK blocks received */ + TCP_NLA_REORD_SEEN, /* reordering events seen */ + TCP_NLA_SRTT, /* smoothed RTT in usecs */ + TCP_NLA_TIMEOUT_REHASH, /* Timeout-triggered rehash attempts */ + TCP_NLA_BYTES_NOTSENT, /* Bytes in write queue not yet sent */ + TCP_NLA_EDT, /* Earliest departure time (CLOCK_MONOTONIC) */ +}; + +/* for TCP_MD5SIG socket option */ +#define TCP_MD5SIG_MAXKEYLEN 80 + +/* tcp_md5sig extension flags for TCP_MD5SIG_EXT */ +#define TCP_MD5SIG_FLAG_PREFIX 0x1 /* address prefix length */ +#define TCP_MD5SIG_FLAG_IFINDEX 0x2 /* ifindex set */ + +struct tcp_md5sig { + struct __kernel_sockaddr_storage tcpm_addr; /* address associated */ + __u8 tcpm_flags; /* extension flags */ + __u8 tcpm_prefixlen; /* address prefix */ + __u16 tcpm_keylen; /* key length */ + int tcpm_ifindex; /* device index for scope */ + __u8 tcpm_key[TCP_MD5SIG_MAXKEYLEN]; /* key (binary) */ +}; + +/* INET_DIAG_MD5SIG */ +struct tcp_diag_md5sig { + __u8 tcpm_family; + __u8 tcpm_prefixlen; + __u16 tcpm_keylen; + __be32 tcpm_addr[4]; + __u8 tcpm_key[TCP_MD5SIG_MAXKEYLEN]; +}; + +/* setsockopt(fd, IPPROTO_TCP, TCP_ZEROCOPY_RECEIVE, ...) */ + +#define TCP_RECEIVE_ZEROCOPY_FLAG_TLB_CLEAN_HINT 0x1 +struct tcp_zerocopy_receive { + __u64 address; /* in: address of mapping */ + __u32 length; /* in/out: number of bytes to map/mapped */ + __u32 recv_skip_hint; /* out: amount of bytes to skip */ + __u32 inq; /* out: amount of bytes in read queue */ + __s32 err; /* out: socket error */ + __u64 copybuf_address; /* in: copybuf address (small reads) */ + __s32 copybuf_len; /* in/out: copybuf bytes avail/used or error */ + __u32 flags; /* in: flags */ +}; +#endif /* _UAPI_LINUX_TCP_H */ diff --git a/tools/lib/bpf/.gitignore b/tools/lib/bpf/.gitignore index 8a81b3679d2b..5d4cfac671d5 100644 --- a/tools/lib/bpf/.gitignore +++ b/tools/lib/bpf/.gitignore @@ -1,7 +1,6 @@ # SPDX-License-Identifier: GPL-2.0-only libbpf_version.h libbpf.pc -FEATURE-DUMP.libbpf libbpf.so.* TAGS tags diff --git a/tools/lib/bpf/Makefile b/tools/lib/bpf/Makefile index 55bd78b3496f..887a494ad5fc 100644 --- a/tools/lib/bpf/Makefile +++ b/tools/lib/bpf/Makefile @@ -58,28 +58,7 @@ ifndef VERBOSE VERBOSE = 0 endif -FEATURE_USER = .libbpf -FEATURE_TESTS = libelf zlib bpf -FEATURE_DISPLAY = libelf zlib bpf - INCLUDES = -I. -I$(srctree)/tools/include -I$(srctree)/tools/include/uapi -FEATURE_CHECK_CFLAGS-bpf = $(INCLUDES) - -check_feat := 1 -NON_CHECK_FEAT_TARGETS := clean TAGS tags cscope help -ifdef MAKECMDGOALS -ifeq ($(filter-out $(NON_CHECK_FEAT_TARGETS),$(MAKECMDGOALS)),) - check_feat := 0 -endif -endif - -ifeq ($(check_feat),1) -ifeq ($(FEATURES_DUMP),) -include $(srctree)/tools/build/Makefile.feature -else -include $(FEATURES_DUMP) -endif -endif export prefix libdir src obj @@ -157,7 +136,7 @@ all: fixdep all_cmd: $(CMD_TARGETS) check -$(BPF_IN_SHARED): force elfdep zdep bpfdep $(BPF_HELPER_DEFS) +$(BPF_IN_SHARED): force $(BPF_HELPER_DEFS) @(test -f ../../include/uapi/linux/bpf.h -a -f ../../../include/uapi/linux/bpf.h && ( \ (diff -B ../../include/uapi/linux/bpf.h ../../../include/uapi/linux/bpf.h >/dev/null) || \ echo "Warning: Kernel ABI header at 'tools/include/uapi/linux/bpf.h' differs from latest version at 'include/uapi/linux/bpf.h'" >&2 )) || true @@ -175,7 +154,7 @@ $(BPF_IN_SHARED): force elfdep zdep bpfdep $(BPF_HELPER_DEFS) echo "Warning: Kernel ABI header at 'tools/include/uapi/linux/if_xdp.h' differs from latest version at 'include/uapi/linux/if_xdp.h'" >&2 )) || true $(Q)$(MAKE) $(build)=libbpf OUTPUT=$(SHARED_OBJDIR) CFLAGS="$(CFLAGS) $(SHLIB_FLAGS)" -$(BPF_IN_STATIC): force elfdep zdep bpfdep $(BPF_HELPER_DEFS) +$(BPF_IN_STATIC): force $(BPF_HELPER_DEFS) $(Q)$(MAKE) $(build)=libbpf OUTPUT=$(STATIC_OBJDIR) $(BPF_HELPER_DEFS): $(srctree)/tools/include/uapi/linux/bpf.h @@ -264,34 +243,16 @@ install_pkgconfig: $(PC_FILE) install: install_lib install_pkgconfig install_headers -### Cleaning rules - -config-clean: - $(call QUIET_CLEAN, feature-detect) - $(Q)$(MAKE) -C $(srctree)/tools/build/feature/ clean >/dev/null - -clean: config-clean +clean: $(call QUIET_CLEAN, libbpf) $(RM) -rf $(CMD_TARGETS) \ *~ .*.d .*.cmd LIBBPF-CFLAGS $(BPF_HELPER_DEFS) \ $(SHARED_OBJDIR) $(STATIC_OBJDIR) \ $(addprefix $(OUTPUT), \ *.o *.a *.so *.so.$(LIBBPF_MAJOR_VERSION) *.pc) - $(call QUIET_CLEAN, core-gen) $(RM) $(OUTPUT)FEATURE-DUMP.libbpf - - -PHONY += force elfdep zdep bpfdep cscope tags +PHONY += force cscope tags force: -elfdep: - @if [ "$(feature-libelf)" != "1" ]; then echo "No libelf found"; exit 1 ; fi - -zdep: - @if [ "$(feature-zlib)" != "1" ]; then echo "No zlib found"; exit 1 ; fi - -bpfdep: - @if [ "$(feature-bpf)" != "1" ]; then echo "BPF API too old"; exit 1 ; fi - cscope: ls *.c *.h > cscope.files cscope -b -q -I $(srctree)/include -f cscope.out diff --git a/tools/lib/bpf/btf.c b/tools/lib/bpf/btf.c index 9970a288dda5..d9c10830d749 100644 --- a/tools/lib/bpf/btf.c +++ b/tools/lib/bpf/btf.c @@ -858,6 +858,7 @@ static struct btf *btf_parse_elf(const char *path, struct btf *base_btf, Elf_Scn *scn = NULL; Elf *elf = NULL; GElf_Ehdr ehdr; + size_t shstrndx; if (elf_version(EV_CURRENT) == EV_NONE) { pr_warn("failed to init libelf for %s\n", path); @@ -882,7 +883,14 @@ static struct btf *btf_parse_elf(const char *path, struct btf *base_btf, pr_warn("failed to get EHDR from %s\n", path); goto done; } - if (!elf_rawdata(elf_getscn(elf, ehdr.e_shstrndx), NULL)) { + + if (elf_getshdrstrndx(elf, &shstrndx)) { + pr_warn("failed to get section names section index for %s\n", + path); + goto done; + } + + if (!elf_rawdata(elf_getscn(elf, shstrndx), NULL)) { pr_warn("failed to get e_shstrndx from %s\n", path); goto done; } @@ -897,7 +905,7 @@ static struct btf *btf_parse_elf(const char *path, struct btf *base_btf, idx, path); goto done; } - name = elf_strptr(elf, ehdr.e_shstrndx, sh.sh_name); + name = elf_strptr(elf, shstrndx, sh.sh_name); if (!name) { pr_warn("failed to get section(%d) name from %s\n", idx, path); diff --git a/tools/lib/bpf/libbpf.c b/tools/lib/bpf/libbpf.c index 2abbc3800568..d43cc3f29dae 100644 --- a/tools/lib/bpf/libbpf.c +++ b/tools/lib/bpf/libbpf.c @@ -884,24 +884,24 @@ static int bpf_map__init_kern_struct_ops(struct bpf_map *map, if (btf_is_ptr(mtype)) { struct bpf_program *prog; - mtype = skip_mods_and_typedefs(btf, mtype->type, &mtype_id); + prog = st_ops->progs[i]; + if (!prog) + continue; + kern_mtype = skip_mods_and_typedefs(kern_btf, kern_mtype->type, &kern_mtype_id); - if (!btf_is_func_proto(mtype) || - !btf_is_func_proto(kern_mtype)) { - pr_warn("struct_ops init_kern %s: non func ptr %s is not supported\n", + + /* mtype->type must be a func_proto which was + * guaranteed in bpf_object__collect_st_ops_relos(), + * so only check kern_mtype for func_proto here. + */ + if (!btf_is_func_proto(kern_mtype)) { + pr_warn("struct_ops init_kern %s: kernel member %s is not a func ptr\n", map->name, mname); return -ENOTSUP; } - prog = st_ops->progs[i]; - if (!prog) { - pr_debug("struct_ops init_kern %s: func ptr %s is not set\n", - map->name, mname); - continue; - } - prog->attach_btf_id = kern_type_id; prog->expected_attach_type = kern_member_idx; diff --git a/tools/lib/bpf/xsk.c b/tools/lib/bpf/xsk.c index e3e41ceeb1bc..ffbb588724d8 100644 --- a/tools/lib/bpf/xsk.c +++ b/tools/lib/bpf/xsk.c @@ -46,6 +46,11 @@ #define PF_XDP AF_XDP #endif +enum xsk_prog { + XSK_PROG_FALLBACK, + XSK_PROG_REDIRECT_FLAGS, +}; + struct xsk_umem { struct xsk_ring_prod *fill_save; struct xsk_ring_cons *comp_save; @@ -351,6 +356,54 @@ int xsk_umem__create_v0_0_2(struct xsk_umem **umem_ptr, void *umem_area, COMPAT_VERSION(xsk_umem__create_v0_0_2, xsk_umem__create, LIBBPF_0.0.2) DEFAULT_VERSION(xsk_umem__create_v0_0_4, xsk_umem__create, LIBBPF_0.0.4) +static enum xsk_prog get_xsk_prog(void) +{ + enum xsk_prog detected = XSK_PROG_FALLBACK; + struct bpf_load_program_attr prog_attr; + struct bpf_create_map_attr map_attr; + __u32 size_out, retval, duration; + char data_in = 0, data_out; + struct bpf_insn insns[] = { + BPF_LD_MAP_FD(BPF_REG_1, 0), + BPF_MOV64_IMM(BPF_REG_2, 0), + BPF_MOV64_IMM(BPF_REG_3, XDP_PASS), + BPF_EMIT_CALL(BPF_FUNC_redirect_map), + BPF_EXIT_INSN(), + }; + int prog_fd, map_fd, ret; + + memset(&map_attr, 0, sizeof(map_attr)); + map_attr.map_type = BPF_MAP_TYPE_XSKMAP; + map_attr.key_size = sizeof(int); + map_attr.value_size = sizeof(int); + map_attr.max_entries = 1; + + map_fd = bpf_create_map_xattr(&map_attr); + if (map_fd < 0) + return detected; + + insns[0].imm = map_fd; + + memset(&prog_attr, 0, sizeof(prog_attr)); + prog_attr.prog_type = BPF_PROG_TYPE_XDP; + prog_attr.insns = insns; + prog_attr.insns_cnt = ARRAY_SIZE(insns); + prog_attr.license = "GPL"; + + prog_fd = bpf_load_program_xattr(&prog_attr, NULL, 0); + if (prog_fd < 0) { + close(map_fd); + return detected; + } + + ret = bpf_prog_test_run(prog_fd, 0, &data_in, 1, &data_out, &size_out, &retval, &duration); + if (!ret && retval == XDP_PASS) + detected = XSK_PROG_REDIRECT_FLAGS; + close(prog_fd); + close(map_fd); + return detected; +} + static int xsk_load_xdp_prog(struct xsk_socket *xsk) { static const int log_buf_size = 16 * 1024; @@ -358,7 +411,7 @@ static int xsk_load_xdp_prog(struct xsk_socket *xsk) char log_buf[log_buf_size]; int err, prog_fd; - /* This is the C-program: + /* This is the fallback C-program: * SEC("xdp_sock") int xdp_sock_prog(struct xdp_md *ctx) * { * int ret, index = ctx->rx_queue_index; @@ -414,9 +467,31 @@ static int xsk_load_xdp_prog(struct xsk_socket *xsk) /* The jumps are to this instruction */ BPF_EXIT_INSN(), }; - size_t insns_cnt = sizeof(prog) / sizeof(struct bpf_insn); - prog_fd = bpf_load_program(BPF_PROG_TYPE_XDP, prog, insns_cnt, + /* This is the post-5.3 kernel C-program: + * SEC("xdp_sock") int xdp_sock_prog(struct xdp_md *ctx) + * { + * return bpf_redirect_map(&xsks_map, ctx->rx_queue_index, XDP_PASS); + * } + */ + struct bpf_insn prog_redirect_flags[] = { + /* r2 = *(u32 *)(r1 + 16) */ + BPF_LDX_MEM(BPF_W, BPF_REG_2, BPF_REG_1, 16), + /* r1 = xskmap[] */ + BPF_LD_MAP_FD(BPF_REG_1, ctx->xsks_map_fd), + /* r3 = XDP_PASS */ + BPF_MOV64_IMM(BPF_REG_3, 2), + /* call bpf_redirect_map */ + BPF_EMIT_CALL(BPF_FUNC_redirect_map), + BPF_EXIT_INSN(), + }; + size_t insns_cnt[] = {sizeof(prog) / sizeof(struct bpf_insn), + sizeof(prog_redirect_flags) / sizeof(struct bpf_insn), + }; + struct bpf_insn *progs[] = {prog, prog_redirect_flags}; + enum xsk_prog option = get_xsk_prog(); + + prog_fd = bpf_load_program(BPF_PROG_TYPE_XDP, progs[option], insns_cnt[option], "LGPL-2.1 or BSD-2-Clause", 0, log_buf, log_buf_size); if (prog_fd < 0) { @@ -442,7 +517,7 @@ static int xsk_get_max_queues(struct xsk_socket *xsk) struct ifreq ifr = {}; int fd, err, ret; - fd = socket(AF_INET, SOCK_DGRAM, 0); + fd = socket(AF_LOCAL, SOCK_DGRAM, 0); if (fd < 0) return -errno; diff --git a/tools/perf/Makefile.perf b/tools/perf/Makefile.perf index 62f3deb1d3a8..f4df7534026d 100644 --- a/tools/perf/Makefile.perf +++ b/tools/perf/Makefile.perf @@ -176,7 +176,6 @@ endef LD += $(EXTRA_LDFLAGS) PKG_CONFIG = $(CROSS_COMPILE)pkg-config -LLVM_CONFIG ?= llvm-config RM = rm -f LN = ln -f diff --git a/tools/scripts/Makefile.include b/tools/scripts/Makefile.include index 1358e89cdf7d..4255e71f72b7 100644 --- a/tools/scripts/Makefile.include +++ b/tools/scripts/Makefile.include @@ -69,6 +69,13 @@ HOSTCC ?= gcc HOSTLD ?= ld endif +# Some tools require Clang, LLC and/or LLVM utils +CLANG ?= clang +LLC ?= llc +LLVM_CONFIG ?= llvm-config +LLVM_OBJCOPY ?= llvm-objcopy +LLVM_STRIP ?= llvm-strip + ifeq ($(CC_NO_CLANG), 1) EXTRA_WARNINGS += -Wstrict-aliasing=3 endif diff --git a/tools/testing/selftests/bpf/.gitignore b/tools/testing/selftests/bpf/.gitignore index f5b7ef93618c..c0c48fdb9ac1 100644 --- a/tools/testing/selftests/bpf/.gitignore +++ b/tools/testing/selftests/bpf/.gitignore @@ -17,7 +17,6 @@ test_sockmap test_lirc_mode2_user get_cgroup_id_user test_skb_cgroup_id_user -test_socket_cookie test_cgroup_storage test_flow_dissector flow_dissector_load @@ -26,7 +25,6 @@ test_tcpnotify_user test_libbpf test_tcp_check_syncookie_user test_sysctl -test_current_pid_tgid_new_ns xdping test_cpp *.skel.h diff --git a/tools/testing/selftests/bpf/Makefile b/tools/testing/selftests/bpf/Makefile index 0552b07717b6..044bfdcf5b74 100644 --- a/tools/testing/selftests/bpf/Makefile +++ b/tools/testing/selftests/bpf/Makefile @@ -19,8 +19,6 @@ ifneq ($(wildcard $(GENHDR)),) GENFLAGS := -DHAVE_GENHDR endif -CLANG ?= clang -LLVM_OBJCOPY ?= llvm-objcopy BPF_GCC ?= $(shell command -v bpf-gcc;) SAN_CFLAGS ?= CFLAGS += -g -rdynamic -Wall -O2 $(GENFLAGS) $(SAN_CFLAGS) \ @@ -33,11 +31,10 @@ LDLIBS += -lcap -lelf -lz -lrt -lpthread # Order correspond to 'make run_tests' order TEST_GEN_PROGS = test_verifier test_tag test_maps test_lru_map test_lpm_map test_progs \ test_verifier_log test_dev_cgroup \ - test_sock test_sockmap get_cgroup_id_user test_socket_cookie \ + test_sock test_sockmap get_cgroup_id_user \ test_cgroup_storage \ test_netcnt test_tcpnotify_user test_sysctl \ - test_progs-no_alu32 \ - test_current_pid_tgid_new_ns + test_progs-no_alu32 # Also test bpf-gcc, if present ifneq ($(BPF_GCC),) @@ -188,7 +185,6 @@ $(OUTPUT)/test_dev_cgroup: cgroup_helpers.c $(OUTPUT)/test_skb_cgroup_id_user: cgroup_helpers.c $(OUTPUT)/test_sock: cgroup_helpers.c $(OUTPUT)/test_sock_addr: cgroup_helpers.c -$(OUTPUT)/test_socket_cookie: cgroup_helpers.c $(OUTPUT)/test_sockmap: cgroup_helpers.c $(OUTPUT)/test_tcpnotify_user: cgroup_helpers.c trace_helpers.c $(OUTPUT)/get_cgroup_id_user: cgroup_helpers.c diff --git a/tools/testing/selftests/bpf/README.rst b/tools/testing/selftests/bpf/README.rst index ca064180d4d0..fd148b8410fa 100644 --- a/tools/testing/selftests/bpf/README.rst +++ b/tools/testing/selftests/bpf/README.rst @@ -6,6 +6,30 @@ General instructions on running selftests can be found in __ /Documentation/bpf/bpf_devel_QA.rst#q-how-to-run-bpf-selftests +========================= +Running Selftests in a VM +========================= + +It's now possible to run the selftests using ``tools/testing/selftests/bpf/vmtest.sh``. +The script tries to ensure that the tests are run with the same environment as they +would be run post-submit in the CI used by the Maintainers. + +This script downloads a suitable Kconfig and VM userspace image from the system used by +the CI. It builds the kernel (without overwriting your existing Kconfig), recompiles the +bpf selftests, runs them (by default ``tools/testing/selftests/bpf/test_progs``) and +saves the resulting output (by default in ``~/.bpf_selftests``). + +For more information on about using the script, run: + +.. code-block:: console + + $ tools/testing/selftests/bpf/vmtest.sh -h + +.. note:: The script uses pahole and clang based on host environment setting. + If you want to change pahole and llvm, you can change `PATH` environment + variable in the beginning of script. + +.. note:: The script currently only supports x86_64. Additional information about selftest failures are documented here. diff --git a/tools/testing/selftests/bpf/benchs/bench_ringbufs.c b/tools/testing/selftests/bpf/benchs/bench_ringbufs.c index da87c7f31891..bde6c9d4cbd4 100644 --- a/tools/testing/selftests/bpf/benchs/bench_ringbufs.c +++ b/tools/testing/selftests/bpf/benchs/bench_ringbufs.c @@ -319,7 +319,7 @@ static void ringbuf_custom_process_ring(struct ringbuf_custom *r) smp_store_release(r->consumer_pos, cons_pos); else break; - }; + } } static void *ringbuf_custom_consumer(void *input) diff --git a/tools/testing/selftests/bpf/bpf_sockopt_helpers.h b/tools/testing/selftests/bpf/bpf_sockopt_helpers.h new file mode 100644 index 000000000000..11f3a0976174 --- /dev/null +++ b/tools/testing/selftests/bpf/bpf_sockopt_helpers.h @@ -0,0 +1,21 @@ +/* SPDX-License-Identifier: GPL-2.0 */ + +#include <sys/socket.h> +#include <bpf/bpf_helpers.h> + +int get_set_sk_priority(void *ctx) +{ + int prio; + + /* Verify that context allows calling bpf_getsockopt and + * bpf_setsockopt by reading and writing back socket + * priority. + */ + + if (bpf_getsockopt(ctx, SOL_SOCKET, SO_PRIORITY, &prio, sizeof(prio))) + return 0; + if (bpf_setsockopt(ctx, SOL_SOCKET, SO_PRIORITY, &prio, sizeof(prio))) + return 0; + + return 1; +} diff --git a/tools/testing/selftests/bpf/bpf_tcp_helpers.h b/tools/testing/selftests/bpf/bpf_tcp_helpers.h index 6a9053162cf2..91f0fac632f4 100644 --- a/tools/testing/selftests/bpf/bpf_tcp_helpers.h +++ b/tools/testing/selftests/bpf/bpf_tcp_helpers.h @@ -177,6 +177,7 @@ struct tcp_congestion_ops { * after all the ca_state processing. (optional) */ void (*cong_control)(struct sock *sk, const struct rate_sample *rs); + void *owner; }; #define min(a, b) ((a) < (b) ? (a) : (b)) diff --git a/tools/testing/selftests/bpf/bpf_testmod/bpf_testmod-events.h b/tools/testing/selftests/bpf/bpf_testmod/bpf_testmod-events.h index b83ea448bc79..89c6d58e5dd6 100644 --- a/tools/testing/selftests/bpf/bpf_testmod/bpf_testmod-events.h +++ b/tools/testing/selftests/bpf/bpf_testmod/bpf_testmod-events.h @@ -28,6 +28,12 @@ TRACE_EVENT(bpf_testmod_test_read, __entry->pid, __entry->comm, __entry->off, __entry->len) ); +/* A bare tracepoint with no event associated with it */ +DECLARE_TRACE(bpf_testmod_test_write_bare, + TP_PROTO(struct task_struct *task, struct bpf_testmod_test_write_ctx *ctx), + TP_ARGS(task, ctx) +); + #endif /* _BPF_TESTMOD_EVENTS_H */ #undef TRACE_INCLUDE_PATH diff --git a/tools/testing/selftests/bpf/bpf_testmod/bpf_testmod.c b/tools/testing/selftests/bpf/bpf_testmod/bpf_testmod.c index 0b991e115d1f..141d8da687d2 100644 --- a/tools/testing/selftests/bpf/bpf_testmod/bpf_testmod.c +++ b/tools/testing/selftests/bpf/bpf_testmod/bpf_testmod.c @@ -31,9 +31,28 @@ bpf_testmod_test_read(struct file *file, struct kobject *kobj, EXPORT_SYMBOL(bpf_testmod_test_read); ALLOW_ERROR_INJECTION(bpf_testmod_test_read, ERRNO); +noinline ssize_t +bpf_testmod_test_write(struct file *file, struct kobject *kobj, + struct bin_attribute *bin_attr, + char *buf, loff_t off, size_t len) +{ + struct bpf_testmod_test_write_ctx ctx = { + .buf = buf, + .off = off, + .len = len, + }; + + trace_bpf_testmod_test_write_bare(current, &ctx); + + return -EIO; /* always fail */ +} +EXPORT_SYMBOL(bpf_testmod_test_write); +ALLOW_ERROR_INJECTION(bpf_testmod_test_write, ERRNO); + static struct bin_attribute bin_attr_bpf_testmod_file __ro_after_init = { - .attr = { .name = "bpf_testmod", .mode = 0444, }, + .attr = { .name = "bpf_testmod", .mode = 0666, }, .read = bpf_testmod_test_read, + .write = bpf_testmod_test_write, }; static int bpf_testmod_init(void) diff --git a/tools/testing/selftests/bpf/bpf_testmod/bpf_testmod.h b/tools/testing/selftests/bpf/bpf_testmod/bpf_testmod.h index b81adfedb4f6..b3892dc40111 100644 --- a/tools/testing/selftests/bpf/bpf_testmod/bpf_testmod.h +++ b/tools/testing/selftests/bpf/bpf_testmod/bpf_testmod.h @@ -11,4 +11,10 @@ struct bpf_testmod_test_read_ctx { size_t len; }; +struct bpf_testmod_test_write_ctx { + char *buf; + loff_t off; + size_t len; +}; + #endif /* _BPF_TESTMOD_H */ diff --git a/tools/testing/selftests/bpf/prog_tests/atomic_bounds.c b/tools/testing/selftests/bpf/prog_tests/atomic_bounds.c new file mode 100644 index 000000000000..69bd7853e8f1 --- /dev/null +++ b/tools/testing/selftests/bpf/prog_tests/atomic_bounds.c @@ -0,0 +1,17 @@ +// SPDX-License-Identifier: GPL-2.0 + +#include <test_progs.h> + +#include "atomic_bounds.skel.h" + +void test_atomic_bounds(void) +{ + struct atomic_bounds *skel; + __u32 duration = 0; + + skel = atomic_bounds__open_and_load(); + if (CHECK(!skel, "skel_load", "couldn't load program\n")) + return; + + atomic_bounds__destroy(skel); +} diff --git a/tools/testing/selftests/bpf/prog_tests/bind_perm.c b/tools/testing/selftests/bpf/prog_tests/bind_perm.c new file mode 100644 index 000000000000..d0f06e40c16d --- /dev/null +++ b/tools/testing/selftests/bpf/prog_tests/bind_perm.c @@ -0,0 +1,109 @@ +// SPDX-License-Identifier: GPL-2.0 +#include <test_progs.h> +#include "bind_perm.skel.h" + +#include <sys/types.h> +#include <sys/socket.h> +#include <sys/capability.h> + +static int duration; + +void try_bind(int family, int port, int expected_errno) +{ + struct sockaddr_storage addr = {}; + struct sockaddr_in6 *sin6; + struct sockaddr_in *sin; + int fd = -1; + + fd = socket(family, SOCK_STREAM, 0); + if (CHECK(fd < 0, "fd", "errno %d", errno)) + goto close_socket; + + if (family == AF_INET) { + sin = (struct sockaddr_in *)&addr; + sin->sin_family = family; + sin->sin_port = htons(port); + } else { + sin6 = (struct sockaddr_in6 *)&addr; + sin6->sin6_family = family; + sin6->sin6_port = htons(port); + } + + errno = 0; + bind(fd, (struct sockaddr *)&addr, sizeof(addr)); + ASSERT_EQ(errno, expected_errno, "bind"); + +close_socket: + if (fd >= 0) + close(fd); +} + +bool cap_net_bind_service(cap_flag_value_t flag) +{ + const cap_value_t cap_net_bind_service = CAP_NET_BIND_SERVICE; + cap_flag_value_t original_value; + bool was_effective = false; + cap_t caps; + + caps = cap_get_proc(); + if (CHECK(!caps, "cap_get_proc", "errno %d", errno)) + goto free_caps; + + if (CHECK(cap_get_flag(caps, CAP_NET_BIND_SERVICE, CAP_EFFECTIVE, + &original_value), + "cap_get_flag", "errno %d", errno)) + goto free_caps; + + was_effective = (original_value == CAP_SET); + + if (CHECK(cap_set_flag(caps, CAP_EFFECTIVE, 1, &cap_net_bind_service, + flag), + "cap_set_flag", "errno %d", errno)) + goto free_caps; + + if (CHECK(cap_set_proc(caps), "cap_set_proc", "errno %d", errno)) + goto free_caps; + +free_caps: + CHECK(cap_free(caps), "cap_free", "errno %d", errno); + return was_effective; +} + +void test_bind_perm(void) +{ + bool cap_was_effective; + struct bind_perm *skel; + int cgroup_fd; + + cgroup_fd = test__join_cgroup("/bind_perm"); + if (CHECK(cgroup_fd < 0, "cg-join", "errno %d", errno)) + return; + + skel = bind_perm__open_and_load(); + if (!ASSERT_OK_PTR(skel, "skel")) + goto close_cgroup_fd; + + skel->links.bind_v4_prog = bpf_program__attach_cgroup(skel->progs.bind_v4_prog, cgroup_fd); + if (!ASSERT_OK_PTR(skel, "bind_v4_prog")) + goto close_skeleton; + + skel->links.bind_v6_prog = bpf_program__attach_cgroup(skel->progs.bind_v6_prog, cgroup_fd); + if (!ASSERT_OK_PTR(skel, "bind_v6_prog")) + goto close_skeleton; + + cap_was_effective = cap_net_bind_service(CAP_CLEAR); + + try_bind(AF_INET, 110, EACCES); + try_bind(AF_INET6, 110, EACCES); + + try_bind(AF_INET, 111, 0); + try_bind(AF_INET6, 111, 0); + + if (cap_was_effective) + cap_net_bind_service(CAP_SET); + +close_skeleton: + bind_perm__destroy(skel); +close_cgroup_fd: + close(cgroup_fd); +} diff --git a/tools/testing/selftests/bpf/prog_tests/bpf_iter.c b/tools/testing/selftests/bpf/prog_tests/bpf_iter.c index 0e586368948d..74c45d557a2b 100644 --- a/tools/testing/selftests/bpf/prog_tests/bpf_iter.c +++ b/tools/testing/selftests/bpf/prog_tests/bpf_iter.c @@ -7,6 +7,7 @@ #include "bpf_iter_task.skel.h" #include "bpf_iter_task_stack.skel.h" #include "bpf_iter_task_file.skel.h" +#include "bpf_iter_task_vma.skel.h" #include "bpf_iter_task_btf.skel.h" #include "bpf_iter_tcp4.skel.h" #include "bpf_iter_tcp6.skel.h" @@ -64,6 +65,22 @@ free_link: bpf_link__destroy(link); } +static int read_fd_into_buffer(int fd, char *buf, int size) +{ + int bufleft = size; + int len; + + do { + len = read(fd, buf, bufleft); + if (len > 0) { + buf += len; + bufleft -= len; + } + } while (len > 0); + + return len < 0 ? len : size - bufleft; +} + static void test_ipv6_route(void) { struct bpf_iter_ipv6_route *skel; @@ -177,7 +194,7 @@ static int do_btf_read(struct bpf_iter_task_btf *skel) { struct bpf_program *prog = skel->progs.dump_task_struct; struct bpf_iter_task_btf__bss *bss = skel->bss; - int iter_fd = -1, len = 0, bufleft = TASKBUFSZ; + int iter_fd = -1, err; struct bpf_link *link; char *buf = taskbuf; int ret = 0; @@ -190,14 +207,7 @@ static int do_btf_read(struct bpf_iter_task_btf *skel) if (CHECK(iter_fd < 0, "create_iter", "create_iter failed\n")) goto free_link; - do { - len = read(iter_fd, buf, bufleft); - if (len > 0) { - buf += len; - bufleft -= len; - } - } while (len > 0); - + err = read_fd_into_buffer(iter_fd, buf, TASKBUFSZ); if (bss->skip) { printf("%s:SKIP:no __builtin_btf_type_id\n", __func__); ret = 1; @@ -205,7 +215,7 @@ static int do_btf_read(struct bpf_iter_task_btf *skel) goto free_link; } - if (CHECK(len < 0, "read", "read failed: %s\n", strerror(errno))) + if (CHECK(err < 0, "read", "read failed: %s\n", strerror(errno))) goto free_link; CHECK(strstr(taskbuf, "(struct task_struct)") == NULL, @@ -1133,6 +1143,92 @@ static void test_buf_neg_offset(void) bpf_iter_test_kern6__destroy(skel); } +#define CMP_BUFFER_SIZE 1024 +static char task_vma_output[CMP_BUFFER_SIZE]; +static char proc_maps_output[CMP_BUFFER_SIZE]; + +/* remove \0 and \t from str, and only keep the first line */ +static void str_strip_first_line(char *str) +{ + char *dst = str, *src = str; + + do { + if (*src == ' ' || *src == '\t') + src++; + else + *(dst++) = *(src++); + + } while (*src != '\0' && *src != '\n'); + + *dst = '\0'; +} + +#define min(a, b) ((a) < (b) ? (a) : (b)) + +static void test_task_vma(void) +{ + int err, iter_fd = -1, proc_maps_fd = -1; + struct bpf_iter_task_vma *skel; + int len, read_size = 4; + char maps_path[64]; + + skel = bpf_iter_task_vma__open(); + if (CHECK(!skel, "bpf_iter_task_vma__open", "skeleton open failed\n")) + return; + + skel->bss->pid = getpid(); + + err = bpf_iter_task_vma__load(skel); + if (CHECK(err, "bpf_iter_task_vma__load", "skeleton load failed\n")) + goto out; + + skel->links.proc_maps = bpf_program__attach_iter( + skel->progs.proc_maps, NULL); + + if (CHECK(IS_ERR(skel->links.proc_maps), "bpf_program__attach_iter", + "attach iterator failed\n")) { + skel->links.proc_maps = NULL; + goto out; + } + + iter_fd = bpf_iter_create(bpf_link__fd(skel->links.proc_maps)); + if (CHECK(iter_fd < 0, "create_iter", "create_iter failed\n")) + goto out; + + /* Read CMP_BUFFER_SIZE (1kB) from bpf_iter. Read in small chunks + * to trigger seq_file corner cases. The expected output is much + * longer than 1kB, so the while loop will terminate. + */ + len = 0; + while (len < CMP_BUFFER_SIZE) { + err = read_fd_into_buffer(iter_fd, task_vma_output + len, + min(read_size, CMP_BUFFER_SIZE - len)); + if (CHECK(err < 0, "read_iter_fd", "read_iter_fd failed\n")) + goto out; + len += err; + } + + /* read CMP_BUFFER_SIZE (1kB) from /proc/pid/maps */ + snprintf(maps_path, 64, "/proc/%u/maps", skel->bss->pid); + proc_maps_fd = open(maps_path, O_RDONLY); + if (CHECK(proc_maps_fd < 0, "open_proc_maps", "open_proc_maps failed\n")) + goto out; + err = read_fd_into_buffer(proc_maps_fd, proc_maps_output, CMP_BUFFER_SIZE); + if (CHECK(err < 0, "read_prog_maps_fd", "read_prog_maps_fd failed\n")) + goto out; + + /* strip and compare the first line of the two files */ + str_strip_first_line(task_vma_output); + str_strip_first_line(proc_maps_output); + + CHECK(strcmp(task_vma_output, proc_maps_output), "compare_output", + "found mismatch\n"); +out: + close(proc_maps_fd); + close(iter_fd); + bpf_iter_task_vma__destroy(skel); +} + void test_bpf_iter(void) { if (test__start_subtest("btf_id_or_null")) @@ -1149,6 +1245,8 @@ void test_bpf_iter(void) test_task_stack(); if (test__start_subtest("task_file")) test_task_file(); + if (test__start_subtest("task_vma")) + test_task_vma(); if (test__start_subtest("task_btf")) test_task_btf(); if (test__start_subtest("tcp4")) diff --git a/tools/testing/selftests/bpf/prog_tests/bpf_tcp_ca.c b/tools/testing/selftests/bpf/prog_tests/bpf_tcp_ca.c index 9a8f47fc0b91..37c5494a0381 100644 --- a/tools/testing/selftests/bpf/prog_tests/bpf_tcp_ca.c +++ b/tools/testing/selftests/bpf/prog_tests/bpf_tcp_ca.c @@ -2,6 +2,7 @@ /* Copyright (c) 2019 Facebook */ #include <linux/err.h> +#include <netinet/tcp.h> #include <test_progs.h> #include "bpf_dctcp.skel.h" #include "bpf_cubic.skel.h" diff --git a/tools/testing/selftests/bpf/prog_tests/btf.c b/tools/testing/selftests/bpf/prog_tests/btf.c index 8ae97e2a4b9d..6a7ee7420701 100644 --- a/tools/testing/selftests/bpf/prog_tests/btf.c +++ b/tools/testing/selftests/bpf/prog_tests/btf.c @@ -914,7 +914,7 @@ static struct btf_raw_test raw_tests[] = { .err_str = "Member exceeds struct_size", }, -/* Test member exeeds the size of struct +/* Test member exceeds the size of struct * * struct A { * int m; @@ -948,7 +948,7 @@ static struct btf_raw_test raw_tests[] = { .err_str = "Member exceeds struct_size", }, -/* Test member exeeds the size of struct +/* Test member exceeds the size of struct * * struct A { * int m; @@ -3509,6 +3509,27 @@ static struct btf_raw_test raw_tests[] = { .value_type_id = 3 /* arr_t */, .max_entries = 4, }, +/* + * elf .rodata section size 4 and btf .rodata section vlen 0. + */ +{ + .descr = "datasec: vlen == 0", + .raw_types = { + /* int */ + BTF_TYPE_INT_ENC(0, BTF_INT_SIGNED, 0, 32, 4), /* [1] */ + /* .rodata section */ + BTF_TYPE_ENC(NAME_NTH(1), BTF_INFO_ENC(BTF_KIND_DATASEC, 0, 0), 4), + /* [2] */ + BTF_END_RAW, + }, + BTF_STR_SEC("\0.rodata"), + .map_type = BPF_MAP_TYPE_ARRAY, + .key_size = sizeof(int), + .value_size = sizeof(int), + .key_type_id = 1, + .value_type_id = 1, + .max_entries = 1, +}, }; /* struct btf_raw_test raw_tests[] */ diff --git a/tools/testing/selftests/bpf/prog_tests/check_mtu.c b/tools/testing/selftests/bpf/prog_tests/check_mtu.c new file mode 100644 index 000000000000..36af1c138faf --- /dev/null +++ b/tools/testing/selftests/bpf/prog_tests/check_mtu.c @@ -0,0 +1,216 @@ +// SPDX-License-Identifier: GPL-2.0 +/* Copyright (c) 2020 Jesper Dangaard Brouer */ + +#include <linux/if_link.h> /* before test_progs.h, avoid bpf_util.h redefines */ +#include <test_progs.h> +#include "test_check_mtu.skel.h" +#include "network_helpers.h" + +#include <stdlib.h> +#include <inttypes.h> + +#define IFINDEX_LO 1 + +static __u32 duration; /* Hint: needed for CHECK macro */ + +static int read_mtu_device_lo(void) +{ + const char *filename = "/sys/class/net/lo/mtu"; + char buf[11] = {}; + int value, n, fd; + + fd = open(filename, 0, O_RDONLY); + if (fd == -1) + return -1; + + n = read(fd, buf, sizeof(buf)); + close(fd); + + if (n == -1) + return -2; + + value = strtoimax(buf, NULL, 10); + if (errno == ERANGE) + return -3; + + return value; +} + +static void test_check_mtu_xdp_attach(void) +{ + struct bpf_link_info link_info; + __u32 link_info_len = sizeof(link_info); + struct test_check_mtu *skel; + struct bpf_program *prog; + struct bpf_link *link; + int err = 0; + int fd; + + skel = test_check_mtu__open_and_load(); + if (CHECK(!skel, "open and load skel", "failed")) + return; /* Exit if e.g. helper unknown to kernel */ + + prog = skel->progs.xdp_use_helper_basic; + + link = bpf_program__attach_xdp(prog, IFINDEX_LO); + if (CHECK(IS_ERR(link), "link_attach", "failed: %ld\n", PTR_ERR(link))) + goto out; + skel->links.xdp_use_helper_basic = link; + + memset(&link_info, 0, sizeof(link_info)); + fd = bpf_link__fd(link); + err = bpf_obj_get_info_by_fd(fd, &link_info, &link_info_len); + if (CHECK(err, "link_info", "failed: %d\n", err)) + goto out; + + CHECK(link_info.type != BPF_LINK_TYPE_XDP, "link_type", + "got %u != exp %u\n", link_info.type, BPF_LINK_TYPE_XDP); + CHECK(link_info.xdp.ifindex != IFINDEX_LO, "link_ifindex", + "got %u != exp %u\n", link_info.xdp.ifindex, IFINDEX_LO); + + err = bpf_link__detach(link); + CHECK(err, "link_detach", "failed %d\n", err); + +out: + test_check_mtu__destroy(skel); +} + +static void test_check_mtu_run_xdp(struct test_check_mtu *skel, + struct bpf_program *prog, + __u32 mtu_expect) +{ + const char *prog_name = bpf_program__name(prog); + int retval_expect = XDP_PASS; + __u32 mtu_result = 0; + char buf[256] = {}; + int err; + struct bpf_prog_test_run_attr tattr = { + .repeat = 1, + .data_in = &pkt_v4, + .data_size_in = sizeof(pkt_v4), + .data_out = buf, + .data_size_out = sizeof(buf), + .prog_fd = bpf_program__fd(prog), + }; + + err = bpf_prog_test_run_xattr(&tattr); + CHECK_ATTR(err != 0, "bpf_prog_test_run", + "prog_name:%s (err %d errno %d retval %d)\n", + prog_name, err, errno, tattr.retval); + + CHECK(tattr.retval != retval_expect, "retval", + "progname:%s unexpected retval=%d expected=%d\n", + prog_name, tattr.retval, retval_expect); + + /* Extract MTU that BPF-prog got */ + mtu_result = skel->bss->global_bpf_mtu_xdp; + ASSERT_EQ(mtu_result, mtu_expect, "MTU-compare-user"); +} + + +static void test_check_mtu_xdp(__u32 mtu, __u32 ifindex) +{ + struct test_check_mtu *skel; + int err; + + skel = test_check_mtu__open(); + if (CHECK(!skel, "skel_open", "failed")) + return; + + /* Update "constants" in BPF-prog *BEFORE* libbpf load */ + skel->rodata->GLOBAL_USER_MTU = mtu; + skel->rodata->GLOBAL_USER_IFINDEX = ifindex; + + err = test_check_mtu__load(skel); + if (CHECK(err, "skel_load", "failed: %d\n", err)) + goto cleanup; + + test_check_mtu_run_xdp(skel, skel->progs.xdp_use_helper, mtu); + test_check_mtu_run_xdp(skel, skel->progs.xdp_exceed_mtu, mtu); + test_check_mtu_run_xdp(skel, skel->progs.xdp_minus_delta, mtu); + +cleanup: + test_check_mtu__destroy(skel); +} + +static void test_check_mtu_run_tc(struct test_check_mtu *skel, + struct bpf_program *prog, + __u32 mtu_expect) +{ + const char *prog_name = bpf_program__name(prog); + int retval_expect = BPF_OK; + __u32 mtu_result = 0; + char buf[256] = {}; + int err; + struct bpf_prog_test_run_attr tattr = { + .repeat = 1, + .data_in = &pkt_v4, + .data_size_in = sizeof(pkt_v4), + .data_out = buf, + .data_size_out = sizeof(buf), + .prog_fd = bpf_program__fd(prog), + }; + + err = bpf_prog_test_run_xattr(&tattr); + CHECK_ATTR(err != 0, "bpf_prog_test_run", + "prog_name:%s (err %d errno %d retval %d)\n", + prog_name, err, errno, tattr.retval); + + CHECK(tattr.retval != retval_expect, "retval", + "progname:%s unexpected retval=%d expected=%d\n", + prog_name, tattr.retval, retval_expect); + + /* Extract MTU that BPF-prog got */ + mtu_result = skel->bss->global_bpf_mtu_tc; + ASSERT_EQ(mtu_result, mtu_expect, "MTU-compare-user"); +} + + +static void test_check_mtu_tc(__u32 mtu, __u32 ifindex) +{ + struct test_check_mtu *skel; + int err; + + skel = test_check_mtu__open(); + if (CHECK(!skel, "skel_open", "failed")) + return; + + /* Update "constants" in BPF-prog *BEFORE* libbpf load */ + skel->rodata->GLOBAL_USER_MTU = mtu; + skel->rodata->GLOBAL_USER_IFINDEX = ifindex; + + err = test_check_mtu__load(skel); + if (CHECK(err, "skel_load", "failed: %d\n", err)) + goto cleanup; + + test_check_mtu_run_tc(skel, skel->progs.tc_use_helper, mtu); + test_check_mtu_run_tc(skel, skel->progs.tc_exceed_mtu, mtu); + test_check_mtu_run_tc(skel, skel->progs.tc_exceed_mtu_da, mtu); + test_check_mtu_run_tc(skel, skel->progs.tc_minus_delta, mtu); +cleanup: + test_check_mtu__destroy(skel); +} + +void test_check_mtu(void) +{ + __u32 mtu_lo; + + if (test__start_subtest("bpf_check_mtu XDP-attach")) + test_check_mtu_xdp_attach(); + + mtu_lo = read_mtu_device_lo(); + if (CHECK(mtu_lo < 0, "reading MTU value", "failed (err:%d)", mtu_lo)) + return; + + if (test__start_subtest("bpf_check_mtu XDP-run")) + test_check_mtu_xdp(mtu_lo, 0); + + if (test__start_subtest("bpf_check_mtu XDP-run ifindex-lookup")) + test_check_mtu_xdp(mtu_lo, IFINDEX_LO); + + if (test__start_subtest("bpf_check_mtu TC-run")) + test_check_mtu_tc(mtu_lo, 0); + + if (test__start_subtest("bpf_check_mtu TC-run ifindex-lookup")) + test_check_mtu_tc(mtu_lo, IFINDEX_LO); +} diff --git a/tools/testing/selftests/bpf/prog_tests/cls_redirect.c b/tools/testing/selftests/bpf/prog_tests/cls_redirect.c index 9781d85cb223..e075d03ab630 100644 --- a/tools/testing/selftests/bpf/prog_tests/cls_redirect.c +++ b/tools/testing/selftests/bpf/prog_tests/cls_redirect.c @@ -7,6 +7,7 @@ #include <string.h> #include <linux/pkt_cls.h> +#include <netinet/tcp.h> #include <test_progs.h> diff --git a/tools/testing/selftests/bpf/prog_tests/fexit_stress.c b/tools/testing/selftests/bpf/prog_tests/fexit_stress.c index 3b9dbf7433f0..7c9b62e971f1 100644 --- a/tools/testing/selftests/bpf/prog_tests/fexit_stress.c +++ b/tools/testing/selftests/bpf/prog_tests/fexit_stress.c @@ -2,8 +2,8 @@ /* Copyright (c) 2019 Facebook */ #include <test_progs.h> -/* x86-64 fits 55 JITed and 43 interpreted progs into half page */ -#define CNT 40 +/* that's kernel internal BPF_MAX_TRAMP_PROGS define */ +#define CNT 38 void test_fexit_stress(void) { diff --git a/tools/testing/selftests/bpf/prog_tests/global_func_args.c b/tools/testing/selftests/bpf/prog_tests/global_func_args.c new file mode 100644 index 000000000000..8bcc2869102f --- /dev/null +++ b/tools/testing/selftests/bpf/prog_tests/global_func_args.c @@ -0,0 +1,60 @@ +// SPDX-License-Identifier: GPL-2.0 +#include "test_progs.h" +#include "network_helpers.h" + +static __u32 duration; + +static void test_global_func_args0(struct bpf_object *obj) +{ + int err, i, map_fd, actual_value; + const char *map_name = "values"; + + map_fd = bpf_find_map(__func__, obj, map_name); + if (CHECK(map_fd < 0, "bpf_find_map", "cannot find BPF map %s: %s\n", + map_name, strerror(errno))) + return; + + struct { + const char *descr; + int expected_value; + } tests[] = { + {"passing NULL pointer", 0}, + {"returning value", 1}, + {"reading local variable", 100 }, + {"writing local variable", 101 }, + {"reading global variable", 42 }, + {"writing global variable", 43 }, + {"writing to pointer-to-pointer", 1 }, + }; + + for (i = 0; i < ARRAY_SIZE(tests); ++i) { + const int expected_value = tests[i].expected_value; + + err = bpf_map_lookup_elem(map_fd, &i, &actual_value); + + CHECK(err || actual_value != expected_value, tests[i].descr, + "err %d result %d expected %d\n", err, actual_value, expected_value); + } +} + +void test_global_func_args(void) +{ + const char *file = "./test_global_func_args.o"; + __u32 retval; + struct bpf_object *obj; + int err, prog_fd; + + err = bpf_prog_load(file, BPF_PROG_TYPE_CGROUP_SKB, &obj, &prog_fd); + if (CHECK(err, "load program", "error %d loading %s\n", err, file)) + return; + + err = bpf_prog_test_run(prog_fd, 1, &pkt_v4, sizeof(pkt_v4), + NULL, NULL, &retval, &duration); + CHECK(err || retval, "pass global func args run", + "err %d errno %d retval %d duration %d\n", + err, errno, retval, duration); + + test_global_func_args0(obj); + + bpf_object__close(obj); +} diff --git a/tools/testing/selftests/bpf/prog_tests/module_attach.c b/tools/testing/selftests/bpf/prog_tests/module_attach.c index 50796b651f72..5bc53d53d86e 100644 --- a/tools/testing/selftests/bpf/prog_tests/module_attach.c +++ b/tools/testing/selftests/bpf/prog_tests/module_attach.c @@ -21,9 +21,34 @@ static int trigger_module_test_read(int read_sz) return 0; } +static int trigger_module_test_write(int write_sz) +{ + int fd, err; + char *buf = malloc(write_sz); + + if (!buf) + return -ENOMEM; + + memset(buf, 'a', write_sz); + buf[write_sz-1] = '\0'; + + fd = open("/sys/kernel/bpf_testmod", O_WRONLY); + err = -errno; + if (CHECK(fd < 0, "testmod_file_open", "failed: %d\n", err)) { + free(buf); + return err; + } + + write(fd, buf, write_sz); + close(fd); + free(buf); + return 0; +} + void test_module_attach(void) { const int READ_SZ = 456; + const int WRITE_SZ = 457; struct test_module_attach* skel; struct test_module_attach__bss *bss; int err; @@ -48,8 +73,10 @@ void test_module_attach(void) /* trigger tracepoint */ ASSERT_OK(trigger_module_test_read(READ_SZ), "trigger_read"); + ASSERT_OK(trigger_module_test_write(WRITE_SZ), "trigger_write"); ASSERT_EQ(bss->raw_tp_read_sz, READ_SZ, "raw_tp"); + ASSERT_EQ(bss->raw_tp_bare_write_sz, WRITE_SZ, "raw_tp_bare"); ASSERT_EQ(bss->tp_btf_read_sz, READ_SZ, "tp_btf"); ASSERT_EQ(bss->fentry_read_sz, READ_SZ, "fentry"); ASSERT_EQ(bss->fentry_manual_read_sz, READ_SZ, "fentry_manual"); diff --git a/tools/testing/selftests/bpf/prog_tests/ns_current_pid_tgid.c b/tools/testing/selftests/bpf/prog_tests/ns_current_pid_tgid.c index e74dc501b27f..31a3114906e2 100644 --- a/tools/testing/selftests/bpf/prog_tests/ns_current_pid_tgid.c +++ b/tools/testing/selftests/bpf/prog_tests/ns_current_pid_tgid.c @@ -1,85 +1,87 @@ // SPDX-License-Identifier: GPL-2.0 /* Copyright (c) 2020 Carlos Neira cneirabustos@gmail.com */ + +#define _GNU_SOURCE #include <test_progs.h> +#include "test_ns_current_pid_tgid.skel.h" #include <sys/stat.h> #include <sys/types.h> #include <unistd.h> #include <sys/syscall.h> +#include <sched.h> +#include <sys/wait.h> +#include <sys/mount.h> +#include <sys/fcntl.h> -struct bss { - __u64 dev; - __u64 ino; - __u64 pid_tgid; - __u64 user_pid_tgid; -}; +#define STACK_SIZE (1024 * 1024) +static char child_stack[STACK_SIZE]; -void test_ns_current_pid_tgid(void) +static int test_current_pid_tgid(void *args) { - const char *probe_name = "raw_tracepoint/sys_enter"; - const char *file = "test_ns_current_pid_tgid.o"; - int err, key = 0, duration = 0; - struct bpf_link *link = NULL; - struct bpf_program *prog; - struct bpf_map *bss_map; - struct bpf_object *obj; - struct bss bss; + struct test_ns_current_pid_tgid__bss *bss; + struct test_ns_current_pid_tgid *skel; + int err = -1, duration = 0; + pid_t tgid, pid; struct stat st; - __u64 id; - - obj = bpf_object__open_file(file, NULL); - if (CHECK(IS_ERR(obj), "obj_open", "err %ld\n", PTR_ERR(obj))) - return; - err = bpf_object__load(obj); - if (CHECK(err, "obj_load", "err %d errno %d\n", err, errno)) + skel = test_ns_current_pid_tgid__open_and_load(); + if (CHECK(!skel, "skel_open_load", "failed to load skeleton\n")) goto cleanup; - bss_map = bpf_object__find_map_by_name(obj, "test_ns_.bss"); - if (CHECK(!bss_map, "find_bss_map", "failed\n")) + pid = syscall(SYS_gettid); + tgid = getpid(); + + err = stat("/proc/self/ns/pid", &st); + if (CHECK(err, "stat", "failed /proc/self/ns/pid: %d\n", err)) goto cleanup; - prog = bpf_object__find_program_by_title(obj, probe_name); - if (CHECK(!prog, "find_prog", "prog '%s' not found\n", - probe_name)) + bss = skel->bss; + bss->dev = st.st_dev; + bss->ino = st.st_ino; + bss->user_pid = 0; + bss->user_tgid = 0; + + err = test_ns_current_pid_tgid__attach(skel); + if (CHECK(err, "skel_attach", "skeleton attach failed: %d\n", err)) goto cleanup; - memset(&bss, 0, sizeof(bss)); - pid_t tid = syscall(SYS_gettid); - pid_t pid = getpid(); + /* trigger tracepoint */ + usleep(1); + ASSERT_EQ(bss->user_pid, pid, "pid"); + ASSERT_EQ(bss->user_tgid, tgid, "tgid"); + err = 0; - id = (__u64) tid << 32 | pid; - bss.user_pid_tgid = id; +cleanup: + test_ns_current_pid_tgid__destroy(skel); - if (CHECK_FAIL(stat("/proc/self/ns/pid", &st))) { - perror("Failed to stat /proc/self/ns/pid"); - goto cleanup; - } + return err; +} - bss.dev = st.st_dev; - bss.ino = st.st_ino; +static void test_ns_current_pid_tgid_new_ns(void) +{ + int wstatus, duration = 0; + pid_t cpid; - err = bpf_map_update_elem(bpf_map__fd(bss_map), &key, &bss, 0); - if (CHECK(err, "setting_bss", "failed to set bss : %d\n", err)) - goto cleanup; + /* Create a process in a new namespace, this process + * will be the init process of this new namespace hence will be pid 1. + */ + cpid = clone(test_current_pid_tgid, child_stack + STACK_SIZE, + CLONE_NEWPID | SIGCHLD, NULL); - link = bpf_program__attach_raw_tracepoint(prog, "sys_enter"); - if (CHECK(IS_ERR(link), "attach_raw_tp", "err %ld\n", - PTR_ERR(link))) { - link = NULL; - goto cleanup; - } + if (CHECK(cpid == -1, "clone", strerror(errno))) + return; - /* trigger some syscalls */ - usleep(1); + if (CHECK(waitpid(cpid, &wstatus, 0) == -1, "waitpid", strerror(errno))) + return; - err = bpf_map_lookup_elem(bpf_map__fd(bss_map), &key, &bss); - if (CHECK(err, "set_bss", "failed to get bss : %d\n", err)) - goto cleanup; + if (CHECK(WEXITSTATUS(wstatus) != 0, "newns_pidtgid", "failed")) + return; +} - if (CHECK(id != bss.pid_tgid, "Compare user pid/tgid vs. bpf pid/tgid", - "User pid/tgid %llu BPF pid/tgid %llu\n", id, bss.pid_tgid)) - goto cleanup; -cleanup: - bpf_link__destroy(link); - bpf_object__close(obj); +void test_ns_current_pid_tgid(void) +{ + if (test__start_subtest("ns_current_pid_tgid_root_ns")) + test_current_pid_tgid(NULL); + if (test__start_subtest("ns_current_pid_tgid_new_ns")) + test_ns_current_pid_tgid_new_ns(); } diff --git a/tools/testing/selftests/bpf/prog_tests/recursion.c b/tools/testing/selftests/bpf/prog_tests/recursion.c new file mode 100644 index 000000000000..0e378d63fe18 --- /dev/null +++ b/tools/testing/selftests/bpf/prog_tests/recursion.c @@ -0,0 +1,41 @@ +// SPDX-License-Identifier: GPL-2.0 +/* Copyright (c) 2021 Facebook */ +#include <test_progs.h> +#include "recursion.skel.h" + +void test_recursion(void) +{ + struct bpf_prog_info prog_info = {}; + __u32 prog_info_len = sizeof(prog_info); + struct recursion *skel; + int key = 0; + int err; + + skel = recursion__open_and_load(); + if (!ASSERT_OK_PTR(skel, "skel_open_and_load")) + return; + + err = recursion__attach(skel); + if (!ASSERT_OK(err, "skel_attach")) + goto out; + + ASSERT_EQ(skel->bss->pass1, 0, "pass1 == 0"); + bpf_map_lookup_elem(bpf_map__fd(skel->maps.hash1), &key, 0); + ASSERT_EQ(skel->bss->pass1, 1, "pass1 == 1"); + bpf_map_lookup_elem(bpf_map__fd(skel->maps.hash1), &key, 0); + ASSERT_EQ(skel->bss->pass1, 2, "pass1 == 2"); + + ASSERT_EQ(skel->bss->pass2, 0, "pass2 == 0"); + bpf_map_lookup_elem(bpf_map__fd(skel->maps.hash2), &key, 0); + ASSERT_EQ(skel->bss->pass2, 1, "pass2 == 1"); + bpf_map_lookup_elem(bpf_map__fd(skel->maps.hash2), &key, 0); + ASSERT_EQ(skel->bss->pass2, 2, "pass2 == 2"); + + err = bpf_obj_get_info_by_fd(bpf_program__fd(skel->progs.on_lookup), + &prog_info, &prog_info_len); + if (!ASSERT_OK(err, "get_prog_info")) + goto out; + ASSERT_EQ(prog_info.recursion_misses, 2, "recursion_misses"); +out: + recursion__destroy(skel); +} diff --git a/tools/testing/selftests/bpf/prog_tests/socket_cookie.c b/tools/testing/selftests/bpf/prog_tests/socket_cookie.c new file mode 100644 index 000000000000..232db28dde18 --- /dev/null +++ b/tools/testing/selftests/bpf/prog_tests/socket_cookie.c @@ -0,0 +1,76 @@ +// SPDX-License-Identifier: GPL-2.0 +// Copyright (c) 2020 Google LLC. +// Copyright (c) 2018 Facebook + +#include <test_progs.h> +#include "socket_cookie_prog.skel.h" +#include "network_helpers.h" + +static int duration; + +struct socket_cookie { + __u64 cookie_key; + __u32 cookie_value; +}; + +void test_socket_cookie(void) +{ + int server_fd = 0, client_fd = 0, cgroup_fd = 0, err = 0; + socklen_t addr_len = sizeof(struct sockaddr_in6); + struct socket_cookie_prog *skel; + __u32 cookie_expected_value; + struct sockaddr_in6 addr; + struct socket_cookie val; + + skel = socket_cookie_prog__open_and_load(); + if (!ASSERT_OK_PTR(skel, "skel_open")) + return; + + cgroup_fd = test__join_cgroup("/socket_cookie"); + if (CHECK(cgroup_fd < 0, "join_cgroup", "cgroup creation failed\n")) + goto out; + + skel->links.set_cookie = bpf_program__attach_cgroup( + skel->progs.set_cookie, cgroup_fd); + if (!ASSERT_OK_PTR(skel->links.set_cookie, "prog_attach")) + goto close_cgroup_fd; + + skel->links.update_cookie_sockops = bpf_program__attach_cgroup( + skel->progs.update_cookie_sockops, cgroup_fd); + if (!ASSERT_OK_PTR(skel->links.update_cookie_sockops, "prog_attach")) + goto close_cgroup_fd; + + skel->links.update_cookie_tracing = bpf_program__attach( + skel->progs.update_cookie_tracing); + if (!ASSERT_OK_PTR(skel->links.update_cookie_tracing, "prog_attach")) + goto close_cgroup_fd; + + server_fd = start_server(AF_INET6, SOCK_STREAM, "::1", 0, 0); + if (CHECK(server_fd < 0, "start_server", "errno %d\n", errno)) + goto close_cgroup_fd; + + client_fd = connect_to_fd(server_fd, 0); + if (CHECK(client_fd < 0, "connect_to_fd", "errno %d\n", errno)) + goto close_server_fd; + + err = bpf_map_lookup_elem(bpf_map__fd(skel->maps.socket_cookies), + &client_fd, &val); + if (!ASSERT_OK(err, "map_lookup(socket_cookies)")) + goto close_client_fd; + + err = getsockname(client_fd, (struct sockaddr *)&addr, &addr_len); + if (!ASSERT_OK(err, "getsockname")) + goto close_client_fd; + + cookie_expected_value = (ntohs(addr.sin6_port) << 8) | 0xFF; + ASSERT_EQ(val.cookie_value, cookie_expected_value, "cookie_value"); + +close_client_fd: + close(client_fd); +close_server_fd: + close(server_fd); +close_cgroup_fd: + close(cgroup_fd); +out: + socket_cookie_prog__destroy(skel); +} diff --git a/tools/testing/selftests/bpf/prog_tests/sockmap_basic.c b/tools/testing/selftests/bpf/prog_tests/sockmap_basic.c index 85f73261fab0..b8b48cac2ac3 100644 --- a/tools/testing/selftests/bpf/prog_tests/sockmap_basic.c +++ b/tools/testing/selftests/bpf/prog_tests/sockmap_basic.c @@ -1,6 +1,7 @@ // SPDX-License-Identifier: GPL-2.0 // Copyright (c) 2020 Cloudflare #include <error.h> +#include <netinet/tcp.h> #include "test_progs.h" #include "test_skmsg_load_helpers.skel.h" diff --git a/tools/testing/selftests/bpf/prog_tests/sockopt_sk.c b/tools/testing/selftests/bpf/prog_tests/sockopt_sk.c index b25c9c45c148..d5b44b135c00 100644 --- a/tools/testing/selftests/bpf/prog_tests/sockopt_sk.c +++ b/tools/testing/selftests/bpf/prog_tests/sockopt_sk.c @@ -2,6 +2,12 @@ #include <test_progs.h> #include "cgroup_helpers.h" +#include <linux/tcp.h> + +#ifndef SOL_TCP +#define SOL_TCP IPPROTO_TCP +#endif + #define SOL_CUSTOM 0xdeadbeef static int getsetsockopt(void) @@ -11,6 +17,7 @@ static int getsetsockopt(void) char u8[4]; __u32 u32; char cc[16]; /* TCP_CA_NAME_MAX */ + struct tcp_zerocopy_receive zc; } buf = {}; socklen_t optlen; char *big_buf = NULL; @@ -154,6 +161,27 @@ static int getsetsockopt(void) goto err; } + /* TCP_ZEROCOPY_RECEIVE triggers */ + memset(&buf, 0, sizeof(buf)); + optlen = sizeof(buf.zc); + err = getsockopt(fd, SOL_TCP, TCP_ZEROCOPY_RECEIVE, &buf, &optlen); + if (err) { + log_err("Unexpected getsockopt(TCP_ZEROCOPY_RECEIVE) err=%d errno=%d", + err, errno); + goto err; + } + + memset(&buf, 0, sizeof(buf)); + buf.zc.address = 12345; /* rejected by BPF */ + optlen = sizeof(buf.zc); + errno = 0; + err = getsockopt(fd, SOL_TCP, TCP_ZEROCOPY_RECEIVE, &buf, &optlen); + if (errno != EPERM) { + log_err("Unexpected getsockopt(TCP_ZEROCOPY_RECEIVE) err=%d errno=%d", + err, errno); + goto err; + } + free(big_buf); close(fd); return 0; diff --git a/tools/testing/selftests/bpf/prog_tests/stack_var_off.c b/tools/testing/selftests/bpf/prog_tests/stack_var_off.c new file mode 100644 index 000000000000..2ce9deefa59c --- /dev/null +++ b/tools/testing/selftests/bpf/prog_tests/stack_var_off.c @@ -0,0 +1,35 @@ +// SPDX-License-Identifier: GPL-2.0 +#include <test_progs.h> +#include "test_stack_var_off.skel.h" + +/* Test read and writes to the stack performed with offsets that are not + * statically known. + */ +void test_stack_var_off(void) +{ + int duration = 0; + struct test_stack_var_off *skel; + + skel = test_stack_var_off__open_and_load(); + if (CHECK(!skel, "skel_open", "failed to open skeleton\n")) + return; + + /* Give pid to bpf prog so it doesn't trigger for anyone else. */ + skel->bss->test_pid = getpid(); + /* Initialize the probe's input. */ + skel->bss->input[0] = 2; + skel->bss->input[1] = 42; /* This will be returned in probe_res. */ + + if (!ASSERT_OK(test_stack_var_off__attach(skel), "skel_attach")) + goto cleanup; + + /* Trigger probe. */ + usleep(1); + + if (CHECK(skel->bss->probe_res != 42, "check_probe_res", + "wrong probe res: %d\n", skel->bss->probe_res)) + goto cleanup; + +cleanup: + test_stack_var_off__destroy(skel); +} diff --git a/tools/testing/selftests/bpf/prog_tests/test_global_funcs.c b/tools/testing/selftests/bpf/prog_tests/test_global_funcs.c index 32e4348b714b..7e13129f593a 100644 --- a/tools/testing/selftests/bpf/prog_tests/test_global_funcs.c +++ b/tools/testing/selftests/bpf/prog_tests/test_global_funcs.c @@ -61,6 +61,14 @@ void test_test_global_funcs(void) { "test_global_func6.o" , "modified ctx ptr R2" }, { "test_global_func7.o" , "foo() doesn't return scalar" }, { "test_global_func8.o" }, + { "test_global_func9.o" }, + { "test_global_func10.o", "invalid indirect read from stack" }, + { "test_global_func11.o", "Caller passes invalid args into func#1" }, + { "test_global_func12.o", "invalid mem access 'mem_or_null'" }, + { "test_global_func13.o", "Caller passes invalid args into func#1" }, + { "test_global_func14.o", "reference type('FWD S') size cannot be determined" }, + { "test_global_func15.o", "At program exit the register R0 has value" }, + { "test_global_func16.o", "invalid indirect read from stack" }, }; libbpf_print_fn_t old_print_fn = NULL; int err, i, duration = 0; diff --git a/tools/testing/selftests/bpf/prog_tests/test_ima.c b/tools/testing/selftests/bpf/prog_tests/test_ima.c index 61fca681d524..b54bc0c351b7 100644 --- a/tools/testing/selftests/bpf/prog_tests/test_ima.c +++ b/tools/testing/selftests/bpf/prog_tests/test_ima.c @@ -9,6 +9,7 @@ #include <unistd.h> #include <sys/wait.h> #include <test_progs.h> +#include <linux/ring_buffer.h> #include "ima.skel.h" @@ -31,9 +32,18 @@ static int run_measured_process(const char *measured_dir, u32 *monitored_pid) return -EINVAL; } +static u64 ima_hash_from_bpf; + +static int process_sample(void *ctx, void *data, size_t len) +{ + ima_hash_from_bpf = *((u64 *)data); + return 0; +} + void test_test_ima(void) { char measured_dir_template[] = "/tmp/ima_measuredXXXXXX"; + struct ring_buffer *ringbuf; const char *measured_dir; char cmd[256]; @@ -44,6 +54,11 @@ void test_test_ima(void) if (CHECK(!skel, "skel_load", "skeleton failed\n")) goto close_prog; + ringbuf = ring_buffer__new(bpf_map__fd(skel->maps.ringbuf), + process_sample, NULL, NULL); + if (!ASSERT_OK_PTR(ringbuf, "ringbuf")) + goto close_prog; + err = ima__attach(skel); if (CHECK(err, "attach", "attach failed: %d\n", err)) goto close_prog; @@ -60,11 +75,9 @@ void test_test_ima(void) if (CHECK(err, "run_measured_process", "err = %d\n", err)) goto close_clean; - CHECK(skel->data->ima_hash_ret < 0, "ima_hash_ret", - "ima_hash_ret = %ld\n", skel->data->ima_hash_ret); - - CHECK(skel->bss->ima_hash == 0, "ima_hash", - "ima_hash = %lu\n", skel->bss->ima_hash); + err = ring_buffer__consume(ringbuf); + ASSERT_EQ(err, 1, "num_samples_or_err"); + ASSERT_NEQ(ima_hash_from_bpf, 0, "ima_hash"); close_clean: snprintf(cmd, sizeof(cmd), "./ima_setup.sh cleanup %s", measured_dir); diff --git a/tools/testing/selftests/bpf/prog_tests/test_local_storage.c b/tools/testing/selftests/bpf/prog_tests/test_local_storage.c index 3bfcf00c0a67..d2c16eaae367 100644 --- a/tools/testing/selftests/bpf/prog_tests/test_local_storage.c +++ b/tools/testing/selftests/bpf/prog_tests/test_local_storage.c @@ -113,7 +113,7 @@ static bool check_syscall_operations(int map_fd, int obj_fd) void test_test_local_storage(void) { - char tmp_dir_path[64] = "/tmp/local_storageXXXXXX"; + char tmp_dir_path[] = "/tmp/local_storageXXXXXX"; int err, serv_sk = -1, task_fd = -1, rm_fd = -1; struct local_storage *skel = NULL; char tmp_exec_path[64]; diff --git a/tools/testing/selftests/bpf/prog_tests/trampoline_count.c b/tools/testing/selftests/bpf/prog_tests/trampoline_count.c index 781c8d11604b..f3022d934e2d 100644 --- a/tools/testing/selftests/bpf/prog_tests/trampoline_count.c +++ b/tools/testing/selftests/bpf/prog_tests/trampoline_count.c @@ -4,7 +4,7 @@ #include <sys/prctl.h> #include <test_progs.h> -#define MAX_TRAMP_PROGS 40 +#define MAX_TRAMP_PROGS 38 struct inst { struct bpf_object *obj; @@ -52,7 +52,7 @@ void test_trampoline_count(void) struct bpf_link *link; char comm[16] = {}; - /* attach 'allowed' 40 trampoline programs */ + /* attach 'allowed' trampoline programs */ for (i = 0; i < MAX_TRAMP_PROGS; i++) { obj = bpf_object__open_file(object, NULL); if (CHECK(IS_ERR(obj), "obj_open_file", "err %ld\n", PTR_ERR(obj))) { diff --git a/tools/testing/selftests/bpf/progs/atomic_bounds.c b/tools/testing/selftests/bpf/progs/atomic_bounds.c new file mode 100644 index 000000000000..e5fff7fc7f8f --- /dev/null +++ b/tools/testing/selftests/bpf/progs/atomic_bounds.c @@ -0,0 +1,24 @@ +// SPDX-License-Identifier: GPL-2.0 +#include <linux/bpf.h> +#include <bpf/bpf_helpers.h> +#include <bpf/bpf_tracing.h> +#include <stdbool.h> + +#ifdef ENABLE_ATOMICS_TESTS +bool skip_tests __attribute((__section__(".data"))) = false; +#else +bool skip_tests = true; +#endif + +SEC("fentry/bpf_fentry_test1") +int BPF_PROG(sub, int x) +{ +#ifdef ENABLE_ATOMICS_TESTS + int a = 0; + int b = __sync_fetch_and_add(&a, 1); + /* b is certainly 0 here. Can the verifier tell? */ + while (b) + continue; +#endif + return 0; +} diff --git a/tools/testing/selftests/bpf/progs/bind_perm.c b/tools/testing/selftests/bpf/progs/bind_perm.c new file mode 100644 index 000000000000..7bd2a027025d --- /dev/null +++ b/tools/testing/selftests/bpf/progs/bind_perm.c @@ -0,0 +1,45 @@ +// SPDX-License-Identifier: GPL-2.0 + +#include <linux/stddef.h> +#include <linux/bpf.h> +#include <sys/types.h> +#include <sys/socket.h> +#include <bpf/bpf_helpers.h> +#include <bpf/bpf_endian.h> + +static __always_inline int bind_prog(struct bpf_sock_addr *ctx, int family) +{ + struct bpf_sock *sk; + + sk = ctx->sk; + if (!sk) + return 0; + + if (sk->family != family) + return 0; + + if (ctx->type != SOCK_STREAM) + return 0; + + /* Return 1 OR'ed with the first bit set to indicate + * that CAP_NET_BIND_SERVICE should be bypassed. + */ + if (ctx->user_port == bpf_htons(111)) + return (1 | 2); + + return 1; +} + +SEC("cgroup/bind4") +int bind_v4_prog(struct bpf_sock_addr *ctx) +{ + return bind_prog(ctx, AF_INET); +} + +SEC("cgroup/bind6") +int bind_v6_prog(struct bpf_sock_addr *ctx) +{ + return bind_prog(ctx, AF_INET6); +} + +char _license[] SEC("license") = "GPL"; diff --git a/tools/testing/selftests/bpf/progs/bpf_iter.h b/tools/testing/selftests/bpf/progs/bpf_iter.h index 6a1255465fd6..3d83b185c4bc 100644 --- a/tools/testing/selftests/bpf/progs/bpf_iter.h +++ b/tools/testing/selftests/bpf/progs/bpf_iter.h @@ -7,6 +7,7 @@ #define bpf_iter__netlink bpf_iter__netlink___not_used #define bpf_iter__task bpf_iter__task___not_used #define bpf_iter__task_file bpf_iter__task_file___not_used +#define bpf_iter__task_vma bpf_iter__task_vma___not_used #define bpf_iter__tcp bpf_iter__tcp___not_used #define tcp6_sock tcp6_sock___not_used #define bpf_iter__udp bpf_iter__udp___not_used @@ -26,6 +27,7 @@ #undef bpf_iter__netlink #undef bpf_iter__task #undef bpf_iter__task_file +#undef bpf_iter__task_vma #undef bpf_iter__tcp #undef tcp6_sock #undef bpf_iter__udp @@ -67,6 +69,12 @@ struct bpf_iter__task_file { struct file *file; } __attribute__((preserve_access_index)); +struct bpf_iter__task_vma { + struct bpf_iter_meta *meta; + struct task_struct *task; + struct vm_area_struct *vma; +} __attribute__((preserve_access_index)); + struct bpf_iter__bpf_map { struct bpf_iter_meta *meta; struct bpf_map *map; diff --git a/tools/testing/selftests/bpf/progs/bpf_iter_task_vma.c b/tools/testing/selftests/bpf/progs/bpf_iter_task_vma.c new file mode 100644 index 000000000000..11d1aa37cf11 --- /dev/null +++ b/tools/testing/selftests/bpf/progs/bpf_iter_task_vma.c @@ -0,0 +1,58 @@ +// SPDX-License-Identifier: GPL-2.0 +/* Copyright (c) 2020 Facebook */ +#include "bpf_iter.h" +#include <bpf/bpf_helpers.h> +#include <bpf/bpf_tracing.h> + +char _license[] SEC("license") = "GPL"; + +/* Copied from mm.h */ +#define VM_READ 0x00000001 +#define VM_WRITE 0x00000002 +#define VM_EXEC 0x00000004 +#define VM_MAYSHARE 0x00000080 + +/* Copied from kdev_t.h */ +#define MINORBITS 20 +#define MINORMASK ((1U << MINORBITS) - 1) +#define MAJOR(dev) ((unsigned int) ((dev) >> MINORBITS)) +#define MINOR(dev) ((unsigned int) ((dev) & MINORMASK)) + +#define D_PATH_BUF_SIZE 1024 +char d_path_buf[D_PATH_BUF_SIZE] = {}; +__u32 pid = 0; + +SEC("iter/task_vma") int proc_maps(struct bpf_iter__task_vma *ctx) +{ + struct vm_area_struct *vma = ctx->vma; + struct seq_file *seq = ctx->meta->seq; + struct task_struct *task = ctx->task; + struct file *file; + char perm_str[] = "----"; + + if (task == (void *)0 || vma == (void *)0) + return 0; + + file = vma->vm_file; + if (task->tgid != pid) + return 0; + perm_str[0] = (vma->vm_flags & VM_READ) ? 'r' : '-'; + perm_str[1] = (vma->vm_flags & VM_WRITE) ? 'w' : '-'; + perm_str[2] = (vma->vm_flags & VM_EXEC) ? 'x' : '-'; + perm_str[3] = (vma->vm_flags & VM_MAYSHARE) ? 's' : 'p'; + BPF_SEQ_PRINTF(seq, "%08llx-%08llx %s ", vma->vm_start, vma->vm_end, perm_str); + + if (file) { + __u32 dev = file->f_inode->i_sb->s_dev; + + bpf_d_path(&file->f_path, d_path_buf, D_PATH_BUF_SIZE); + + BPF_SEQ_PRINTF(seq, "%08llx ", vma->vm_pgoff << 12); + BPF_SEQ_PRINTF(seq, "%02x:%02x %u", MAJOR(dev), MINOR(dev), + file->f_inode->i_ino); + BPF_SEQ_PRINTF(seq, "\t%s\n", d_path_buf); + } else { + BPF_SEQ_PRINTF(seq, "%08llx 00:00 0\n", 0ULL); + } + return 0; +} diff --git a/tools/testing/selftests/bpf/progs/connect_force_port4.c b/tools/testing/selftests/bpf/progs/connect_force_port4.c index 7396308677a3..a979aaef2a76 100644 --- a/tools/testing/selftests/bpf/progs/connect_force_port4.c +++ b/tools/testing/selftests/bpf/progs/connect_force_port4.c @@ -10,6 +10,8 @@ #include <bpf/bpf_helpers.h> #include <bpf/bpf_endian.h> +#include <bpf_sockopt_helpers.h> + char _license[] SEC("license") = "GPL"; int _version SEC("version") = 1; @@ -58,6 +60,9 @@ int connect4(struct bpf_sock_addr *ctx) SEC("cgroup/getsockname4") int getsockname4(struct bpf_sock_addr *ctx) { + if (!get_set_sk_priority(ctx)) + return 1; + /* Expose local server as 1.2.3.4:60000 to client. */ if (ctx->user_port == bpf_htons(60123)) { ctx->user_ip4 = bpf_htonl(0x01020304); @@ -71,6 +76,9 @@ int getpeername4(struct bpf_sock_addr *ctx) { struct svc_addr *orig; + if (!get_set_sk_priority(ctx)) + return 1; + /* Expose service 1.2.3.4:60000 as peer instead of backend. */ if (ctx->user_port == bpf_htons(60123)) { orig = bpf_sk_storage_get(&service_mapping, ctx->sk, 0, 0); diff --git a/tools/testing/selftests/bpf/progs/connect_force_port6.c b/tools/testing/selftests/bpf/progs/connect_force_port6.c index c1a2b555e9ad..afc8f1c5a9d6 100644 --- a/tools/testing/selftests/bpf/progs/connect_force_port6.c +++ b/tools/testing/selftests/bpf/progs/connect_force_port6.c @@ -9,6 +9,8 @@ #include <bpf/bpf_helpers.h> #include <bpf/bpf_endian.h> +#include <bpf_sockopt_helpers.h> + char _license[] SEC("license") = "GPL"; int _version SEC("version") = 1; @@ -63,6 +65,9 @@ int connect6(struct bpf_sock_addr *ctx) SEC("cgroup/getsockname6") int getsockname6(struct bpf_sock_addr *ctx) { + if (!get_set_sk_priority(ctx)) + return 1; + /* Expose local server as [fc00::1]:60000 to client. */ if (ctx->user_port == bpf_htons(60124)) { ctx->user_ip6[0] = bpf_htonl(0xfc000000); @@ -79,6 +84,9 @@ int getpeername6(struct bpf_sock_addr *ctx) { struct svc_addr *orig; + if (!get_set_sk_priority(ctx)) + return 1; + /* Expose service [fc00::1]:60000 as peer instead of backend. */ if (ctx->user_port == bpf_htons(60124)) { orig = bpf_sk_storage_get(&service_mapping, ctx->sk, 0, 0); diff --git a/tools/testing/selftests/bpf/progs/ima.c b/tools/testing/selftests/bpf/progs/ima.c index 86b21aff4bc5..96060ff4ffc6 100644 --- a/tools/testing/selftests/bpf/progs/ima.c +++ b/tools/testing/selftests/bpf/progs/ima.c @@ -9,20 +9,37 @@ #include <bpf/bpf_helpers.h> #include <bpf/bpf_tracing.h> -long ima_hash_ret = -1; -u64 ima_hash = 0; u32 monitored_pid = 0; +struct { + __uint(type, BPF_MAP_TYPE_RINGBUF); + __uint(max_entries, 1 << 12); +} ringbuf SEC(".maps"); + char _license[] SEC("license") = "GPL"; SEC("lsm.s/bprm_committed_creds") -int BPF_PROG(ima, struct linux_binprm *bprm) +void BPF_PROG(ima, struct linux_binprm *bprm) { - u32 pid = bpf_get_current_pid_tgid() >> 32; + u64 ima_hash = 0; + u64 *sample; + int ret; + u32 pid; + + pid = bpf_get_current_pid_tgid() >> 32; + if (pid == monitored_pid) { + ret = bpf_ima_inode_hash(bprm->file->f_inode, &ima_hash, + sizeof(ima_hash)); + if (ret < 0 || ima_hash == 0) + return; + + sample = bpf_ringbuf_reserve(&ringbuf, sizeof(u64), 0); + if (!sample) + return; - if (pid == monitored_pid) - ima_hash_ret = bpf_ima_inode_hash(bprm->file->f_inode, - &ima_hash, sizeof(ima_hash)); + *sample = ima_hash; + bpf_ringbuf_submit(sample, 0); + } - return 0; + return; } diff --git a/tools/testing/selftests/bpf/progs/lsm.c b/tools/testing/selftests/bpf/progs/lsm.c index ff4d343b94b5..33694ef8acfa 100644 --- a/tools/testing/selftests/bpf/progs/lsm.c +++ b/tools/testing/selftests/bpf/progs/lsm.c @@ -30,6 +30,53 @@ struct { __type(value, __u64); } lru_hash SEC(".maps"); +struct { + __uint(type, BPF_MAP_TYPE_PERCPU_ARRAY); + __uint(max_entries, 1); + __type(key, __u32); + __type(value, __u64); +} percpu_array SEC(".maps"); + +struct { + __uint(type, BPF_MAP_TYPE_PERCPU_HASH); + __uint(max_entries, 1); + __type(key, __u32); + __type(value, __u64); +} percpu_hash SEC(".maps"); + +struct { + __uint(type, BPF_MAP_TYPE_LRU_PERCPU_HASH); + __uint(max_entries, 1); + __type(key, __u32); + __type(value, __u64); +} lru_percpu_hash SEC(".maps"); + +struct inner_map { + __uint(type, BPF_MAP_TYPE_ARRAY); + __uint(max_entries, 1); + __type(key, int); + __type(value, __u64); +} inner_map SEC(".maps"); + +struct outer_arr { + __uint(type, BPF_MAP_TYPE_ARRAY_OF_MAPS); + __uint(max_entries, 1); + __uint(key_size, sizeof(int)); + __uint(value_size, sizeof(int)); + __array(values, struct inner_map); +} outer_arr SEC(".maps") = { + .values = { [0] = &inner_map }, +}; + +struct outer_hash { + __uint(type, BPF_MAP_TYPE_HASH_OF_MAPS); + __uint(max_entries, 1); + __uint(key_size, sizeof(int)); + __array(values, struct inner_map); +} outer_hash SEC(".maps") = { + .values = { [0] = &inner_map }, +}; + char _license[] SEC("license") = "GPL"; int monitored_pid = 0; @@ -61,6 +108,7 @@ SEC("lsm.s/bprm_committed_creds") int BPF_PROG(test_void_hook, struct linux_binprm *bprm) { __u32 pid = bpf_get_current_pid_tgid() >> 32; + struct inner_map *inner_map; char args[64]; __u32 key = 0; __u64 *value; @@ -80,6 +128,27 @@ int BPF_PROG(test_void_hook, struct linux_binprm *bprm) value = bpf_map_lookup_elem(&lru_hash, &key); if (value) *value = 0; + value = bpf_map_lookup_elem(&percpu_array, &key); + if (value) + *value = 0; + value = bpf_map_lookup_elem(&percpu_hash, &key); + if (value) + *value = 0; + value = bpf_map_lookup_elem(&lru_percpu_hash, &key); + if (value) + *value = 0; + inner_map = bpf_map_lookup_elem(&outer_arr, &key); + if (inner_map) { + value = bpf_map_lookup_elem(inner_map, &key); + if (value) + *value = 0; + } + inner_map = bpf_map_lookup_elem(&outer_hash, &key); + if (inner_map) { + value = bpf_map_lookup_elem(inner_map, &key); + if (value) + *value = 0; + } return 0; } diff --git a/tools/testing/selftests/bpf/progs/recursion.c b/tools/testing/selftests/bpf/progs/recursion.c new file mode 100644 index 000000000000..49f679375b9d --- /dev/null +++ b/tools/testing/selftests/bpf/progs/recursion.c @@ -0,0 +1,46 @@ +// SPDX-License-Identifier: GPL-2.0 +/* Copyright (c) 2021 Facebook */ + +#include "vmlinux.h" +#include <bpf/bpf_helpers.h> +#include <bpf/bpf_tracing.h> + +char _license[] SEC("license") = "GPL"; + +struct { + __uint(type, BPF_MAP_TYPE_HASH); + __uint(max_entries, 1); + __type(key, int); + __type(value, long); +} hash1 SEC(".maps"); + +struct { + __uint(type, BPF_MAP_TYPE_HASH); + __uint(max_entries, 1); + __type(key, int); + __type(value, long); +} hash2 SEC(".maps"); + +int pass1 = 0; +int pass2 = 0; + +SEC("fentry/__htab_map_lookup_elem") +int BPF_PROG(on_lookup, struct bpf_map *map) +{ + int key = 0; + + if (map == (void *)&hash1) { + pass1++; + return 0; + } + if (map == (void *)&hash2) { + pass2++; + /* htab_map_gen_lookup() will inline below call + * into direct call to __htab_map_lookup_elem() + */ + bpf_map_lookup_elem(&hash2, &key); + return 0; + } + + return 0; +} diff --git a/tools/testing/selftests/bpf/progs/recvmsg4_prog.c b/tools/testing/selftests/bpf/progs/recvmsg4_prog.c new file mode 100644 index 000000000000..3d1ae8b3402f --- /dev/null +++ b/tools/testing/selftests/bpf/progs/recvmsg4_prog.c @@ -0,0 +1,42 @@ +// SPDX-License-Identifier: GPL-2.0 + +#include <linux/stddef.h> +#include <linux/bpf.h> +#include <linux/in.h> +#include <sys/socket.h> + +#include <bpf/bpf_helpers.h> +#include <bpf/bpf_endian.h> + +#include <bpf_sockopt_helpers.h> + +#define SERV4_IP 0xc0a801feU /* 192.168.1.254 */ +#define SERV4_PORT 4040 + +SEC("cgroup/recvmsg4") +int recvmsg4_prog(struct bpf_sock_addr *ctx) +{ + struct bpf_sock *sk; + __u32 user_ip4; + __u16 user_port; + + sk = ctx->sk; + if (!sk) + return 1; + + if (sk->family != AF_INET) + return 1; + + if (ctx->type != SOCK_STREAM && ctx->type != SOCK_DGRAM) + return 1; + + if (!get_set_sk_priority(ctx)) + return 1; + + ctx->user_ip4 = bpf_htonl(SERV4_IP); + ctx->user_port = bpf_htons(SERV4_PORT); + + return 1; +} + +char _license[] SEC("license") = "GPL"; diff --git a/tools/testing/selftests/bpf/progs/recvmsg6_prog.c b/tools/testing/selftests/bpf/progs/recvmsg6_prog.c new file mode 100644 index 000000000000..27dfb21b21b4 --- /dev/null +++ b/tools/testing/selftests/bpf/progs/recvmsg6_prog.c @@ -0,0 +1,48 @@ +// SPDX-License-Identifier: GPL-2.0 + +#include <linux/stddef.h> +#include <linux/bpf.h> +#include <linux/in6.h> +#include <sys/socket.h> + +#include <bpf/bpf_helpers.h> +#include <bpf/bpf_endian.h> + +#include <bpf_sockopt_helpers.h> + +#define SERV6_IP_0 0xfaceb00c /* face:b00c:1234:5678::abcd */ +#define SERV6_IP_1 0x12345678 +#define SERV6_IP_2 0x00000000 +#define SERV6_IP_3 0x0000abcd +#define SERV6_PORT 6060 + +SEC("cgroup/recvmsg6") +int recvmsg6_prog(struct bpf_sock_addr *ctx) +{ + struct bpf_sock *sk; + __u32 user_ip4; + __u16 user_port; + + sk = ctx->sk; + if (!sk) + return 1; + + if (sk->family != AF_INET6) + return 1; + + if (ctx->type != SOCK_STREAM && ctx->type != SOCK_DGRAM) + return 1; + + if (!get_set_sk_priority(ctx)) + return 1; + + ctx->user_ip6[0] = bpf_htonl(SERV6_IP_0); + ctx->user_ip6[1] = bpf_htonl(SERV6_IP_1); + ctx->user_ip6[2] = bpf_htonl(SERV6_IP_2); + ctx->user_ip6[3] = bpf_htonl(SERV6_IP_3); + ctx->user_port = bpf_htons(SERV6_PORT); + + return 1; +} + +char _license[] SEC("license") = "GPL"; diff --git a/tools/testing/selftests/bpf/progs/sendmsg4_prog.c b/tools/testing/selftests/bpf/progs/sendmsg4_prog.c index 092d9da536f3..ac5abc34cde8 100644 --- a/tools/testing/selftests/bpf/progs/sendmsg4_prog.c +++ b/tools/testing/selftests/bpf/progs/sendmsg4_prog.c @@ -8,6 +8,8 @@ #include <bpf/bpf_helpers.h> #include <bpf/bpf_endian.h> +#include <bpf_sockopt_helpers.h> + #define SRC1_IP4 0xAC100001U /* 172.16.0.1 */ #define SRC2_IP4 0x00000000U #define SRC_REWRITE_IP4 0x7f000004U @@ -21,9 +23,14 @@ int _version SEC("version") = 1; SEC("cgroup/sendmsg4") int sendmsg_v4_prog(struct bpf_sock_addr *ctx) { + int prio; + if (ctx->type != SOCK_DGRAM) return 0; + if (!get_set_sk_priority(ctx)) + return 0; + /* Rewrite source. */ if (ctx->msg_src_ip4 == bpf_htonl(SRC1_IP4) || ctx->msg_src_ip4 == bpf_htonl(SRC2_IP4)) { diff --git a/tools/testing/selftests/bpf/progs/sendmsg6_prog.c b/tools/testing/selftests/bpf/progs/sendmsg6_prog.c index 255a432bc163..24694b1a8d82 100644 --- a/tools/testing/selftests/bpf/progs/sendmsg6_prog.c +++ b/tools/testing/selftests/bpf/progs/sendmsg6_prog.c @@ -8,6 +8,8 @@ #include <bpf/bpf_helpers.h> #include <bpf/bpf_endian.h> +#include <bpf_sockopt_helpers.h> + #define SRC_REWRITE_IP6_0 0 #define SRC_REWRITE_IP6_1 0 #define SRC_REWRITE_IP6_2 0 @@ -28,6 +30,9 @@ int sendmsg_v6_prog(struct bpf_sock_addr *ctx) if (ctx->type != SOCK_DGRAM) return 0; + if (!get_set_sk_priority(ctx)) + return 0; + /* Rewrite source. */ if (ctx->msg_src_ip6[3] == bpf_htonl(1) || ctx->msg_src_ip6[3] == bpf_htonl(0)) { diff --git a/tools/testing/selftests/bpf/progs/socket_cookie_prog.c b/tools/testing/selftests/bpf/progs/socket_cookie_prog.c index 0cb5656a22b0..35630a5aaf5f 100644 --- a/tools/testing/selftests/bpf/progs/socket_cookie_prog.c +++ b/tools/testing/selftests/bpf/progs/socket_cookie_prog.c @@ -1,11 +1,13 @@ // SPDX-License-Identifier: GPL-2.0 // Copyright (c) 2018 Facebook -#include <linux/bpf.h> -#include <sys/socket.h> +#include "vmlinux.h" #include <bpf/bpf_helpers.h> #include <bpf/bpf_endian.h> +#include <bpf/bpf_tracing.h> + +#define AF_INET6 10 struct socket_cookie { __u64 cookie_key; @@ -19,6 +21,14 @@ struct { __type(value, struct socket_cookie); } socket_cookies SEC(".maps"); +/* + * These three programs get executed in a row on connect() syscalls. The + * userspace side of the test creates a client socket, issues a connect() on it + * and then checks that the local storage associated with this socket has: + * cookie_value == local_port << 8 | 0xFF + * The different parts of this cookie_value are appended by those hooks if they + * all agree on the output of bpf_get_socket_cookie(). + */ SEC("cgroup/connect6") int set_cookie(struct bpf_sock_addr *ctx) { @@ -32,16 +42,16 @@ int set_cookie(struct bpf_sock_addr *ctx) if (!p) return 1; - p->cookie_value = 0xFF; + p->cookie_value = 0xF; p->cookie_key = bpf_get_socket_cookie(ctx); return 1; } SEC("sockops") -int update_cookie(struct bpf_sock_ops *ctx) +int update_cookie_sockops(struct bpf_sock_ops *ctx) { - struct bpf_sock *sk; + struct bpf_sock *sk = ctx->sk; struct socket_cookie *p; if (ctx->family != AF_INET6) @@ -50,21 +60,40 @@ int update_cookie(struct bpf_sock_ops *ctx) if (ctx->op != BPF_SOCK_OPS_TCP_CONNECT_CB) return 1; - if (!ctx->sk) + if (!sk) return 1; - p = bpf_sk_storage_get(&socket_cookies, ctx->sk, 0, 0); + p = bpf_sk_storage_get(&socket_cookies, sk, 0, 0); if (!p) return 1; if (p->cookie_key != bpf_get_socket_cookie(ctx)) return 1; - p->cookie_value = (ctx->local_port << 8) | p->cookie_value; + p->cookie_value |= (ctx->local_port << 8); return 1; } -int _version SEC("version") = 1; +SEC("fexit/inet_stream_connect") +int BPF_PROG(update_cookie_tracing, struct socket *sock, + struct sockaddr *uaddr, int addr_len, int flags) +{ + struct socket_cookie *p; + + if (uaddr->sa_family != AF_INET6) + return 0; + + p = bpf_sk_storage_get(&socket_cookies, sock->sk, 0, 0); + if (!p) + return 0; + + if (p->cookie_key != bpf_get_socket_cookie(sock->sk)) + return 0; + + p->cookie_value |= 0xF0; + + return 0; +} char _license[] SEC("license") = "GPL"; diff --git a/tools/testing/selftests/bpf/progs/sockopt_sk.c b/tools/testing/selftests/bpf/progs/sockopt_sk.c index 712df7b49cb1..d3597f81e6e9 100644 --- a/tools/testing/selftests/bpf/progs/sockopt_sk.c +++ b/tools/testing/selftests/bpf/progs/sockopt_sk.c @@ -1,8 +1,8 @@ // SPDX-License-Identifier: GPL-2.0 #include <string.h> -#include <netinet/in.h> -#include <netinet/tcp.h> +#include <linux/tcp.h> #include <linux/bpf.h> +#include <netinet/in.h> #include <bpf/bpf_helpers.h> char _license[] SEC("license") = "GPL"; @@ -12,6 +12,10 @@ __u32 _version SEC("version") = 1; #define PAGE_SIZE 4096 #endif +#ifndef SOL_TCP +#define SOL_TCP IPPROTO_TCP +#endif + #define SOL_CUSTOM 0xdeadbeef struct sockopt_sk { @@ -57,6 +61,21 @@ int _getsockopt(struct bpf_sockopt *ctx) return 1; } + if (ctx->level == SOL_TCP && ctx->optname == TCP_ZEROCOPY_RECEIVE) { + /* Verify that TCP_ZEROCOPY_RECEIVE triggers. + * It has a custom implementation for performance + * reasons. + */ + + if (optval + sizeof(struct tcp_zerocopy_receive) > optval_end) + return 0; /* EPERM, bounds check */ + + if (((struct tcp_zerocopy_receive *)optval)->address != 0) + return 0; /* EPERM, unexpected data */ + + return 1; + } + if (ctx->level == SOL_IP && ctx->optname == IP_FREEBIND) { if (optval + 1 > optval_end) return 0; /* EPERM, bounds check */ diff --git a/tools/testing/selftests/bpf/progs/test_check_mtu.c b/tools/testing/selftests/bpf/progs/test_check_mtu.c new file mode 100644 index 000000000000..b7787b43f9db --- /dev/null +++ b/tools/testing/selftests/bpf/progs/test_check_mtu.c @@ -0,0 +1,198 @@ +// SPDX-License-Identifier: GPL-2.0 +/* Copyright (c) 2020 Jesper Dangaard Brouer */ + +#include <linux/bpf.h> +#include <bpf/bpf_helpers.h> +#include <linux/if_ether.h> + +#include <stddef.h> +#include <stdint.h> + +char _license[] SEC("license") = "GPL"; + +/* Userspace will update with MTU it can see on device */ +static volatile const int GLOBAL_USER_MTU; +static volatile const __u32 GLOBAL_USER_IFINDEX; + +/* BPF-prog will update these with MTU values it can see */ +__u32 global_bpf_mtu_xdp = 0; +__u32 global_bpf_mtu_tc = 0; + +SEC("xdp") +int xdp_use_helper_basic(struct xdp_md *ctx) +{ + __u32 mtu_len = 0; + + if (bpf_check_mtu(ctx, 0, &mtu_len, 0, 0)) + return XDP_ABORTED; + + return XDP_PASS; +} + +SEC("xdp") +int xdp_use_helper(struct xdp_md *ctx) +{ + int retval = XDP_PASS; /* Expected retval on successful test */ + __u32 mtu_len = 0; + __u32 ifindex = 0; + int delta = 0; + + /* When ifindex is zero, save net_device lookup and use ctx netdev */ + if (GLOBAL_USER_IFINDEX > 0) + ifindex = GLOBAL_USER_IFINDEX; + + if (bpf_check_mtu(ctx, ifindex, &mtu_len, delta, 0)) { + /* mtu_len is also valid when check fail */ + retval = XDP_ABORTED; + goto out; + } + + if (mtu_len != GLOBAL_USER_MTU) + retval = XDP_DROP; + +out: + global_bpf_mtu_xdp = mtu_len; + return retval; +} + +SEC("xdp") +int xdp_exceed_mtu(struct xdp_md *ctx) +{ + void *data_end = (void *)(long)ctx->data_end; + void *data = (void *)(long)ctx->data; + __u32 ifindex = GLOBAL_USER_IFINDEX; + __u32 data_len = data_end - data; + int retval = XDP_ABORTED; /* Fail */ + __u32 mtu_len = 0; + int delta; + int err; + + /* Exceed MTU with 1 via delta adjust */ + delta = GLOBAL_USER_MTU - (data_len - ETH_HLEN) + 1; + + err = bpf_check_mtu(ctx, ifindex, &mtu_len, delta, 0); + if (err) { + retval = XDP_PASS; /* Success in exceeding MTU check */ + if (err != BPF_MTU_CHK_RET_FRAG_NEEDED) + retval = XDP_DROP; + } + + global_bpf_mtu_xdp = mtu_len; + return retval; +} + +SEC("xdp") +int xdp_minus_delta(struct xdp_md *ctx) +{ + int retval = XDP_PASS; /* Expected retval on successful test */ + void *data_end = (void *)(long)ctx->data_end; + void *data = (void *)(long)ctx->data; + __u32 ifindex = GLOBAL_USER_IFINDEX; + __u32 data_len = data_end - data; + __u32 mtu_len = 0; + int delta; + + /* Borderline test case: Minus delta exceeding packet length allowed */ + delta = -((data_len - ETH_HLEN) + 1); + + /* Minus length (adjusted via delta) still pass MTU check, other helpers + * are responsible for catching this, when doing actual size adjust + */ + if (bpf_check_mtu(ctx, ifindex, &mtu_len, delta, 0)) + retval = XDP_ABORTED; + + global_bpf_mtu_xdp = mtu_len; + return retval; +} + +SEC("classifier") +int tc_use_helper(struct __sk_buff *ctx) +{ + int retval = BPF_OK; /* Expected retval on successful test */ + __u32 mtu_len = 0; + int delta = 0; + + if (bpf_check_mtu(ctx, 0, &mtu_len, delta, 0)) { + retval = BPF_DROP; + goto out; + } + + if (mtu_len != GLOBAL_USER_MTU) + retval = BPF_REDIRECT; +out: + global_bpf_mtu_tc = mtu_len; + return retval; +} + +SEC("classifier") +int tc_exceed_mtu(struct __sk_buff *ctx) +{ + __u32 ifindex = GLOBAL_USER_IFINDEX; + int retval = BPF_DROP; /* Fail */ + __u32 skb_len = ctx->len; + __u32 mtu_len = 0; + int delta; + int err; + + /* Exceed MTU with 1 via delta adjust */ + delta = GLOBAL_USER_MTU - (skb_len - ETH_HLEN) + 1; + + err = bpf_check_mtu(ctx, ifindex, &mtu_len, delta, 0); + if (err) { + retval = BPF_OK; /* Success in exceeding MTU check */ + if (err != BPF_MTU_CHK_RET_FRAG_NEEDED) + retval = BPF_DROP; + } + + global_bpf_mtu_tc = mtu_len; + return retval; +} + +SEC("classifier") +int tc_exceed_mtu_da(struct __sk_buff *ctx) +{ + /* SKB Direct-Access variant */ + void *data_end = (void *)(long)ctx->data_end; + void *data = (void *)(long)ctx->data; + __u32 ifindex = GLOBAL_USER_IFINDEX; + __u32 data_len = data_end - data; + int retval = BPF_DROP; /* Fail */ + __u32 mtu_len = 0; + int delta; + int err; + + /* Exceed MTU with 1 via delta adjust */ + delta = GLOBAL_USER_MTU - (data_len - ETH_HLEN) + 1; + + err = bpf_check_mtu(ctx, ifindex, &mtu_len, delta, 0); + if (err) { + retval = BPF_OK; /* Success in exceeding MTU check */ + if (err != BPF_MTU_CHK_RET_FRAG_NEEDED) + retval = BPF_DROP; + } + + global_bpf_mtu_tc = mtu_len; + return retval; +} + +SEC("classifier") +int tc_minus_delta(struct __sk_buff *ctx) +{ + int retval = BPF_OK; /* Expected retval on successful test */ + __u32 ifindex = GLOBAL_USER_IFINDEX; + __u32 skb_len = ctx->len; + __u32 mtu_len = 0; + int delta; + + /* Borderline test case: Minus delta exceeding packet length allowed */ + delta = -((skb_len - ETH_HLEN) + 1); + + /* Minus length (adjusted via delta) still pass MTU check, other helpers + * are responsible for catching this, when doing actual size adjust + */ + if (bpf_check_mtu(ctx, ifindex, &mtu_len, delta, 0)) + retval = BPF_DROP; + + global_bpf_mtu_xdp = mtu_len; + return retval; +} diff --git a/tools/testing/selftests/bpf/progs/test_cls_redirect.c b/tools/testing/selftests/bpf/progs/test_cls_redirect.c index c9f8464996ea..3c1e042962e6 100644 --- a/tools/testing/selftests/bpf/progs/test_cls_redirect.c +++ b/tools/testing/selftests/bpf/progs/test_cls_redirect.c @@ -70,6 +70,7 @@ typedef struct { uint64_t errors_total_encap_adjust_failed; uint64_t errors_total_encap_buffer_too_small; uint64_t errors_total_redirect_loop; + uint64_t errors_total_encap_mtu_violate; } metrics_t; typedef enum { @@ -407,6 +408,7 @@ static INLINING ret_t forward_with_gre(struct __sk_buff *skb, encap_headers_t *e payload_off - sizeof(struct ethhdr) - sizeof(struct iphdr); int32_t delta = sizeof(struct gre_base_hdr) - encap_overhead; uint16_t proto = ETH_P_IP; + uint32_t mtu_len = 0; /* Loop protection: the inner packet's TTL is decremented as a safeguard * against any forwarding loop. As the only interesting field is the TTL @@ -479,6 +481,11 @@ static INLINING ret_t forward_with_gre(struct __sk_buff *skb, encap_headers_t *e } } + if (bpf_check_mtu(skb, skb->ifindex, &mtu_len, delta, 0)) { + metrics->errors_total_encap_mtu_violate++; + return TC_ACT_SHOT; + } + if (bpf_skb_adjust_room(skb, delta, BPF_ADJ_ROOM_NET, BPF_F_ADJ_ROOM_FIXED_GSO | BPF_F_ADJ_ROOM_NO_CSUM_RESET) || diff --git a/tools/testing/selftests/bpf/progs/test_global_func10.c b/tools/testing/selftests/bpf/progs/test_global_func10.c new file mode 100644 index 000000000000..61c2ae92ce41 --- /dev/null +++ b/tools/testing/selftests/bpf/progs/test_global_func10.c @@ -0,0 +1,29 @@ +// SPDX-License-Identifier: GPL-2.0-only +#include <stddef.h> +#include <linux/bpf.h> +#include <bpf/bpf_helpers.h> + +struct Small { + int x; +}; + +struct Big { + int x; + int y; +}; + +__noinline int foo(const struct Big *big) +{ + if (big == 0) + return 0; + + return bpf_get_prandom_u32() < big->y; +} + +SEC("cgroup_skb/ingress") +int test_cls(struct __sk_buff *skb) +{ + const struct Small small = {.x = skb->len }; + + return foo((struct Big *)&small) ? 1 : 0; +} diff --git a/tools/testing/selftests/bpf/progs/test_global_func11.c b/tools/testing/selftests/bpf/progs/test_global_func11.c new file mode 100644 index 000000000000..28488047c849 --- /dev/null +++ b/tools/testing/selftests/bpf/progs/test_global_func11.c @@ -0,0 +1,19 @@ +// SPDX-License-Identifier: GPL-2.0-only +#include <stddef.h> +#include <linux/bpf.h> +#include <bpf/bpf_helpers.h> + +struct S { + int x; +}; + +__noinline int foo(const struct S *s) +{ + return s ? bpf_get_prandom_u32() < s->x : 0; +} + +SEC("cgroup_skb/ingress") +int test_cls(struct __sk_buff *skb) +{ + return foo(skb); +} diff --git a/tools/testing/selftests/bpf/progs/test_global_func12.c b/tools/testing/selftests/bpf/progs/test_global_func12.c new file mode 100644 index 000000000000..62343527cc59 --- /dev/null +++ b/tools/testing/selftests/bpf/progs/test_global_func12.c @@ -0,0 +1,21 @@ +// SPDX-License-Identifier: GPL-2.0-only +#include <stddef.h> +#include <linux/bpf.h> +#include <bpf/bpf_helpers.h> + +struct S { + int x; +}; + +__noinline int foo(const struct S *s) +{ + return bpf_get_prandom_u32() < s->x; +} + +SEC("cgroup_skb/ingress") +int test_cls(struct __sk_buff *skb) +{ + const struct S s = {.x = skb->len }; + + return foo(&s); +} diff --git a/tools/testing/selftests/bpf/progs/test_global_func13.c b/tools/testing/selftests/bpf/progs/test_global_func13.c new file mode 100644 index 000000000000..ff8897c1ac22 --- /dev/null +++ b/tools/testing/selftests/bpf/progs/test_global_func13.c @@ -0,0 +1,24 @@ +// SPDX-License-Identifier: GPL-2.0-only +#include <stddef.h> +#include <linux/bpf.h> +#include <bpf/bpf_helpers.h> + +struct S { + int x; +}; + +__noinline int foo(const struct S *s) +{ + if (s) + return bpf_get_prandom_u32() < s->x; + + return 0; +} + +SEC("cgroup_skb/ingress") +int test_cls(struct __sk_buff *skb) +{ + const struct S *s = (const struct S *)(0xbedabeda); + + return foo(s); +} diff --git a/tools/testing/selftests/bpf/progs/test_global_func14.c b/tools/testing/selftests/bpf/progs/test_global_func14.c new file mode 100644 index 000000000000..698c77199ebf --- /dev/null +++ b/tools/testing/selftests/bpf/progs/test_global_func14.c @@ -0,0 +1,21 @@ +// SPDX-License-Identifier: GPL-2.0-only +#include <stddef.h> +#include <linux/bpf.h> +#include <bpf/bpf_helpers.h> + +struct S; + +__noinline int foo(const struct S *s) +{ + if (s) + return bpf_get_prandom_u32() < *(const int *) s; + + return 0; +} + +SEC("cgroup_skb/ingress") +int test_cls(struct __sk_buff *skb) +{ + + return foo(NULL); +} diff --git a/tools/testing/selftests/bpf/progs/test_global_func15.c b/tools/testing/selftests/bpf/progs/test_global_func15.c new file mode 100644 index 000000000000..c19c435988d5 --- /dev/null +++ b/tools/testing/selftests/bpf/progs/test_global_func15.c @@ -0,0 +1,22 @@ +// SPDX-License-Identifier: GPL-2.0-only +#include <stddef.h> +#include <linux/bpf.h> +#include <bpf/bpf_helpers.h> + +__noinline int foo(unsigned int *v) +{ + if (v) + *v = bpf_get_prandom_u32(); + + return 0; +} + +SEC("cgroup_skb/ingress") +int test_cls(struct __sk_buff *skb) +{ + unsigned int v = 1; + + foo(&v); + + return v; +} diff --git a/tools/testing/selftests/bpf/progs/test_global_func16.c b/tools/testing/selftests/bpf/progs/test_global_func16.c new file mode 100644 index 000000000000..0312d1e8d8c0 --- /dev/null +++ b/tools/testing/selftests/bpf/progs/test_global_func16.c @@ -0,0 +1,22 @@ +// SPDX-License-Identifier: GPL-2.0-only +#include <stddef.h> +#include <linux/bpf.h> +#include <bpf/bpf_helpers.h> + +__noinline int foo(int (*arr)[10]) +{ + if (arr) + return (*arr)[9]; + + return 0; +} + +SEC("cgroup_skb/ingress") +int test_cls(struct __sk_buff *skb) +{ + int array[10]; + + const int rv = foo(&array); + + return rv ? 1 : 0; +} diff --git a/tools/testing/selftests/bpf/progs/test_global_func9.c b/tools/testing/selftests/bpf/progs/test_global_func9.c new file mode 100644 index 000000000000..bd233ddede98 --- /dev/null +++ b/tools/testing/selftests/bpf/progs/test_global_func9.c @@ -0,0 +1,132 @@ +// SPDX-License-Identifier: GPL-2.0-only +#include <stddef.h> +#include <linux/bpf.h> +#include <bpf/bpf_helpers.h> + +struct S { + int x; +}; + +struct C { + int x; + int y; +}; + +struct { + __uint(type, BPF_MAP_TYPE_ARRAY); + __uint(max_entries, 1); + __type(key, __u32); + __type(value, struct S); +} map SEC(".maps"); + +enum E { + E_ITEM +}; + +static int global_data_x = 100; +static int volatile global_data_y = 500; + +__noinline int foo(const struct S *s) +{ + if (s) + return bpf_get_prandom_u32() < s->x; + + return 0; +} + +__noinline int bar(int *x) +{ + if (x) + *x &= bpf_get_prandom_u32(); + + return 0; +} +__noinline int baz(volatile int *x) +{ + if (x) + *x &= bpf_get_prandom_u32(); + + return 0; +} + +__noinline int qux(enum E *e) +{ + if (e) + return *e; + + return 0; +} + +__noinline int quux(int (*arr)[10]) +{ + if (arr) + return (*arr)[9]; + + return 0; +} + +__noinline int quuz(int **p) +{ + if (p) + *p = NULL; + + return 0; +} + +SEC("cgroup_skb/ingress") +int test_cls(struct __sk_buff *skb) +{ + int result = 0; + + { + const struct S s = {.x = skb->len }; + + result |= foo(&s); + } + + { + const __u32 key = 1; + const struct S *s = bpf_map_lookup_elem(&map, &key); + + result |= foo(s); + } + + { + const struct C c = {.x = skb->len, .y = skb->family }; + + result |= foo((const struct S *)&c); + } + + { + result |= foo(NULL); + } + + { + bar(&result); + bar(&global_data_x); + } + + { + result |= baz(&global_data_y); + } + + { + enum E e = E_ITEM; + + result |= qux(&e); + } + + { + int array[10] = {0}; + + result |= quux(&array); + } + + { + int *p; + + result |= quuz(&p); + } + + return result ? 1 : 0; +} diff --git a/tools/testing/selftests/bpf/progs/test_global_func_args.c b/tools/testing/selftests/bpf/progs/test_global_func_args.c new file mode 100644 index 000000000000..cae309538a9e --- /dev/null +++ b/tools/testing/selftests/bpf/progs/test_global_func_args.c @@ -0,0 +1,91 @@ +// SPDX-License-Identifier: GPL-2.0 + +#include <linux/bpf.h> + +#include <bpf/bpf_helpers.h> + +struct S { + int v; +}; + +static volatile struct S global_variable; + +struct { + __uint(type, BPF_MAP_TYPE_ARRAY); + __uint(max_entries, 7); + __type(key, __u32); + __type(value, int); +} values SEC(".maps"); + +static void save_value(__u32 index, int value) +{ + bpf_map_update_elem(&values, &index, &value, 0); +} + +__noinline int foo(__u32 index, struct S *s) +{ + if (s) { + save_value(index, s->v); + return ++s->v; + } + + save_value(index, 0); + + return 1; +} + +__noinline int bar(__u32 index, volatile struct S *s) +{ + if (s) { + save_value(index, s->v); + return ++s->v; + } + + save_value(index, 0); + + return 1; +} + +__noinline int baz(struct S **s) +{ + if (s) + *s = 0; + + return 0; +} + +SEC("cgroup_skb/ingress") +int test_cls(struct __sk_buff *skb) +{ + __u32 index = 0; + + { + const int v = foo(index++, 0); + + save_value(index++, v); + } + + { + struct S s = { .v = 100 }; + + foo(index++, &s); + save_value(index++, s.v); + } + + { + global_variable.v = 42; + bar(index++, &global_variable); + save_value(index++, global_variable.v); + } + + { + struct S v, *p = &v; + + baz(&p); + save_value(index++, !p); + } + + return 0; +} + +char _license[] SEC("license") = "GPL"; diff --git a/tools/testing/selftests/bpf/progs/test_module_attach.c b/tools/testing/selftests/bpf/progs/test_module_attach.c index efd1e287ac17..bd37ceec5587 100644 --- a/tools/testing/selftests/bpf/progs/test_module_attach.c +++ b/tools/testing/selftests/bpf/progs/test_module_attach.c @@ -17,6 +17,16 @@ int BPF_PROG(handle_raw_tp, return 0; } +__u32 raw_tp_bare_write_sz = 0; + +SEC("raw_tp/bpf_testmod_test_write_bare") +int BPF_PROG(handle_raw_tp_bare, + struct task_struct *task, struct bpf_testmod_test_write_ctx *write_ctx) +{ + raw_tp_bare_write_sz = BPF_CORE_READ(write_ctx, len); + return 0; +} + __u32 tp_btf_read_sz = 0; SEC("tp_btf/bpf_testmod_test_read") diff --git a/tools/testing/selftests/bpf/progs/test_ns_current_pid_tgid.c b/tools/testing/selftests/bpf/progs/test_ns_current_pid_tgid.c index 1dca70a6de2f..0763d49f9c42 100644 --- a/tools/testing/selftests/bpf/progs/test_ns_current_pid_tgid.c +++ b/tools/testing/selftests/bpf/progs/test_ns_current_pid_tgid.c @@ -5,31 +5,21 @@ #include <stdint.h> #include <bpf/bpf_helpers.h> -static volatile struct { - __u64 dev; - __u64 ino; - __u64 pid_tgid; - __u64 user_pid_tgid; -} res; +__u64 user_pid = 0; +__u64 user_tgid = 0; +__u64 dev = 0; +__u64 ino = 0; -SEC("raw_tracepoint/sys_enter") -int trace(void *ctx) +SEC("tracepoint/syscalls/sys_enter_nanosleep") +int handler(const void *ctx) { - __u64 ns_pid_tgid, expected_pid; struct bpf_pidns_info nsdata; - __u32 key = 0; - if (bpf_get_ns_current_pid_tgid(res.dev, res.ino, &nsdata, - sizeof(struct bpf_pidns_info))) + if (bpf_get_ns_current_pid_tgid(dev, ino, &nsdata, sizeof(struct bpf_pidns_info))) return 0; - ns_pid_tgid = (__u64)nsdata.tgid << 32 | nsdata.pid; - expected_pid = res.user_pid_tgid; - - if (expected_pid != ns_pid_tgid) - return 0; - - res.pid_tgid = ns_pid_tgid; + user_pid = nsdata.pid; + user_tgid = nsdata.tgid; return 0; } diff --git a/tools/testing/selftests/bpf/progs/test_stack_var_off.c b/tools/testing/selftests/bpf/progs/test_stack_var_off.c new file mode 100644 index 000000000000..665e6ae09d37 --- /dev/null +++ b/tools/testing/selftests/bpf/progs/test_stack_var_off.c @@ -0,0 +1,51 @@ +// SPDX-License-Identifier: GPL-2.0 + +#include <linux/bpf.h> +#include <bpf/bpf_helpers.h> + +int probe_res; + +char input[4] = {}; +int test_pid; + +SEC("tracepoint/syscalls/sys_enter_nanosleep") +int probe(void *ctx) +{ + /* This BPF program performs variable-offset reads and writes on a + * stack-allocated buffer. + */ + char stack_buf[16]; + unsigned long len; + unsigned long last; + + if ((bpf_get_current_pid_tgid() >> 32) != test_pid) + return 0; + + /* Copy the input to the stack. */ + __builtin_memcpy(stack_buf, input, 4); + + /* The first byte in the buffer indicates the length. */ + len = stack_buf[0] & 0xf; + last = (len - 1) & 0xf; + + /* Append something to the buffer. The offset where we write is not + * statically known; this is a variable-offset stack write. + */ + stack_buf[len] = 42; + + /* Index into the buffer at an unknown offset. This is a + * variable-offset stack read. + * + * Note that if it wasn't for the preceding variable-offset write, this + * read would be rejected because the stack slot cannot be verified as + * being initialized. With the preceding variable-offset write, the + * stack slot still cannot be verified, but the write inhibits the + * respective check on the reasoning that, if there was a + * variable-offset to a higher-or-equal spot, we're probably reading + * what we just wrote. + */ + probe_res = stack_buf[last]; + return 0; +} + +char _license[] SEC("license") = "GPL"; diff --git a/tools/testing/selftests/bpf/test_current_pid_tgid_new_ns.c b/tools/testing/selftests/bpf/test_current_pid_tgid_new_ns.c deleted file mode 100644 index ec53b1ef90d2..000000000000 --- a/tools/testing/selftests/bpf/test_current_pid_tgid_new_ns.c +++ /dev/null @@ -1,160 +0,0 @@ -// SPDX-License-Identifier: GPL-2.0 -/* Copyright (c) 2020 Carlos Neira cneirabustos@gmail.com */ -#define _GNU_SOURCE -#include <sys/stat.h> -#include <sys/types.h> -#include <unistd.h> -#include <sys/syscall.h> -#include <sched.h> -#include <sys/wait.h> -#include <sys/mount.h> -#include "test_progs.h" - -#define CHECK_NEWNS(condition, tag, format...) ({ \ - int __ret = !!(condition); \ - if (__ret) { \ - printf("%s:FAIL:%s ", __func__, tag); \ - printf(format); \ - } else { \ - printf("%s:PASS:%s\n", __func__, tag); \ - } \ - __ret; \ -}) - -struct bss { - __u64 dev; - __u64 ino; - __u64 pid_tgid; - __u64 user_pid_tgid; -}; - -int main(int argc, char **argv) -{ - pid_t pid; - int exit_code = 1; - struct stat st; - - printf("Testing bpf_get_ns_current_pid_tgid helper in new ns\n"); - - if (stat("/proc/self/ns/pid", &st)) { - perror("stat failed on /proc/self/ns/pid ns\n"); - printf("%s:FAILED\n", argv[0]); - return exit_code; - } - - if (CHECK_NEWNS(unshare(CLONE_NEWPID | CLONE_NEWNS), - "unshare CLONE_NEWPID | CLONE_NEWNS", "error errno=%d\n", errno)) - return exit_code; - - pid = fork(); - if (pid == -1) { - perror("Fork() failed\n"); - printf("%s:FAILED\n", argv[0]); - return exit_code; - } - - if (pid > 0) { - int status; - - usleep(5); - waitpid(pid, &status, 0); - return 0; - } else { - - pid = fork(); - if (pid == -1) { - perror("Fork() failed\n"); - printf("%s:FAILED\n", argv[0]); - return exit_code; - } - - if (pid > 0) { - int status; - waitpid(pid, &status, 0); - return 0; - } else { - if (CHECK_NEWNS(mount("none", "/proc", NULL, MS_PRIVATE|MS_REC, NULL), - "Unmounting proc", "Cannot umount proc! errno=%d\n", errno)) - return exit_code; - - if (CHECK_NEWNS(mount("proc", "/proc", "proc", MS_NOSUID|MS_NOEXEC|MS_NODEV, NULL), - "Mounting proc", "Cannot mount proc! errno=%d\n", errno)) - return exit_code; - - const char *probe_name = "raw_tracepoint/sys_enter"; - const char *file = "test_ns_current_pid_tgid.o"; - struct bpf_link *link = NULL; - struct bpf_program *prog; - struct bpf_map *bss_map; - struct bpf_object *obj; - int exit_code = 1; - int err, key = 0; - struct bss bss; - struct stat st; - __u64 id; - - obj = bpf_object__open_file(file, NULL); - if (CHECK_NEWNS(IS_ERR(obj), "obj_open", "err %ld\n", PTR_ERR(obj))) - return exit_code; - - err = bpf_object__load(obj); - if (CHECK_NEWNS(err, "obj_load", "err %d errno %d\n", err, errno)) - goto cleanup; - - bss_map = bpf_object__find_map_by_name(obj, "test_ns_.bss"); - if (CHECK_NEWNS(!bss_map, "find_bss_map", "failed\n")) - goto cleanup; - - prog = bpf_object__find_program_by_title(obj, probe_name); - if (CHECK_NEWNS(!prog, "find_prog", "prog '%s' not found\n", - probe_name)) - goto cleanup; - - memset(&bss, 0, sizeof(bss)); - pid_t tid = syscall(SYS_gettid); - pid_t pid = getpid(); - - id = (__u64) tid << 32 | pid; - bss.user_pid_tgid = id; - - if (CHECK_NEWNS(stat("/proc/self/ns/pid", &st), - "stat new ns", "Failed to stat /proc/self/ns/pid errno=%d\n", errno)) - goto cleanup; - - bss.dev = st.st_dev; - bss.ino = st.st_ino; - - err = bpf_map_update_elem(bpf_map__fd(bss_map), &key, &bss, 0); - if (CHECK_NEWNS(err, "setting_bss", "failed to set bss : %d\n", err)) - goto cleanup; - - link = bpf_program__attach_raw_tracepoint(prog, "sys_enter"); - if (CHECK_NEWNS(IS_ERR(link), "attach_raw_tp", "err %ld\n", - PTR_ERR(link))) { - link = NULL; - goto cleanup; - } - - /* trigger some syscalls */ - usleep(1); - - err = bpf_map_lookup_elem(bpf_map__fd(bss_map), &key, &bss); - if (CHECK_NEWNS(err, "set_bss", "failed to get bss : %d\n", err)) - goto cleanup; - - if (CHECK_NEWNS(id != bss.pid_tgid, "Compare user pid/tgid vs. bpf pid/tgid", - "User pid/tgid %llu BPF pid/tgid %llu\n", id, bss.pid_tgid)) - goto cleanup; - - exit_code = 0; - printf("%s:PASS\n", argv[0]); -cleanup: - if (!link) { - bpf_link__destroy(link); - link = NULL; - } - bpf_object__close(obj); - } - } - return 0; -} diff --git a/tools/testing/selftests/bpf/test_flow_dissector.c b/tools/testing/selftests/bpf/test_flow_dissector.c index 01f0c634d548..571cc076dd7d 100644 --- a/tools/testing/selftests/bpf/test_flow_dissector.c +++ b/tools/testing/selftests/bpf/test_flow_dissector.c @@ -503,7 +503,7 @@ static int do_rx(int fd) if (rbuf != cfg_payload_char) error(1, 0, "recv: payload mismatch"); num++; - }; + } return num; } diff --git a/tools/testing/selftests/bpf/test_progs.c b/tools/testing/selftests/bpf/test_progs.c index 213628ee721c..6396932b97e2 100644 --- a/tools/testing/selftests/bpf/test_progs.c +++ b/tools/testing/selftests/bpf/test_progs.c @@ -390,7 +390,7 @@ static void unload_bpf_testmod(void) return; } fprintf(env.stderr, "Failed to unload bpf_testmod.ko from kernel: %d\n", -errno); - exit(1); + return; } if (env.verbosity > VERBOSE_NONE) fprintf(stdout, "Successfully unloaded bpf_testmod.ko.\n"); diff --git a/tools/testing/selftests/bpf/test_progs.h b/tools/testing/selftests/bpf/test_progs.h index e49e2fdde942..f7c2fd89d01a 100644 --- a/tools/testing/selftests/bpf/test_progs.h +++ b/tools/testing/selftests/bpf/test_progs.h @@ -16,7 +16,6 @@ typedef __u16 __sum16; #include <linux/if_packet.h> #include <linux/ip.h> #include <linux/ipv6.h> -#include <netinet/tcp.h> #include <linux/filter.h> #include <linux/perf_event.h> #include <linux/socket.h> diff --git a/tools/testing/selftests/bpf/test_sock_addr.c b/tools/testing/selftests/bpf/test_sock_addr.c index dcb83ab02919..aa3f185fcb89 100644 --- a/tools/testing/selftests/bpf/test_sock_addr.c +++ b/tools/testing/selftests/bpf/test_sock_addr.c @@ -31,6 +31,8 @@ #define CONNECT6_PROG_PATH "./connect6_prog.o" #define SENDMSG4_PROG_PATH "./sendmsg4_prog.o" #define SENDMSG6_PROG_PATH "./sendmsg6_prog.o" +#define RECVMSG4_PROG_PATH "./recvmsg4_prog.o" +#define RECVMSG6_PROG_PATH "./recvmsg6_prog.o" #define BIND4_PROG_PATH "./bind4_prog.o" #define BIND6_PROG_PATH "./bind6_prog.o" @@ -94,10 +96,10 @@ static int sendmsg_deny_prog_load(const struct sock_addr_test *test); static int recvmsg_allow_prog_load(const struct sock_addr_test *test); static int recvmsg_deny_prog_load(const struct sock_addr_test *test); static int sendmsg4_rw_asm_prog_load(const struct sock_addr_test *test); -static int recvmsg4_rw_asm_prog_load(const struct sock_addr_test *test); +static int recvmsg4_rw_c_prog_load(const struct sock_addr_test *test); static int sendmsg4_rw_c_prog_load(const struct sock_addr_test *test); static int sendmsg6_rw_asm_prog_load(const struct sock_addr_test *test); -static int recvmsg6_rw_asm_prog_load(const struct sock_addr_test *test); +static int recvmsg6_rw_c_prog_load(const struct sock_addr_test *test); static int sendmsg6_rw_c_prog_load(const struct sock_addr_test *test); static int sendmsg6_rw_v4mapped_prog_load(const struct sock_addr_test *test); static int sendmsg6_rw_wildcard_prog_load(const struct sock_addr_test *test); @@ -573,8 +575,8 @@ static struct sock_addr_test tests[] = { LOAD_REJECT, }, { - "recvmsg4: rewrite IP & port (asm)", - recvmsg4_rw_asm_prog_load, + "recvmsg4: rewrite IP & port (C)", + recvmsg4_rw_c_prog_load, BPF_CGROUP_UDP4_RECVMSG, BPF_CGROUP_UDP4_RECVMSG, AF_INET, @@ -587,8 +589,8 @@ static struct sock_addr_test tests[] = { SUCCESS, }, { - "recvmsg6: rewrite IP & port (asm)", - recvmsg6_rw_asm_prog_load, + "recvmsg6: rewrite IP & port (C)", + recvmsg6_rw_c_prog_load, BPF_CGROUP_UDP6_RECVMSG, BPF_CGROUP_UDP6_RECVMSG, AF_INET6, @@ -786,45 +788,9 @@ static int sendmsg4_rw_asm_prog_load(const struct sock_addr_test *test) return load_insns(test, insns, sizeof(insns) / sizeof(struct bpf_insn)); } -static int recvmsg4_rw_asm_prog_load(const struct sock_addr_test *test) +static int recvmsg4_rw_c_prog_load(const struct sock_addr_test *test) { - struct sockaddr_in src4_rw_addr; - - if (mk_sockaddr(AF_INET, SERV4_IP, SERV4_PORT, - (struct sockaddr *)&src4_rw_addr, - sizeof(src4_rw_addr)) == -1) - return -1; - - struct bpf_insn insns[] = { - BPF_MOV64_REG(BPF_REG_6, BPF_REG_1), - - /* if (sk.family == AF_INET && */ - BPF_LDX_MEM(BPF_W, BPF_REG_7, BPF_REG_6, - offsetof(struct bpf_sock_addr, family)), - BPF_JMP_IMM(BPF_JNE, BPF_REG_7, AF_INET, 6), - - /* sk.type == SOCK_DGRAM) { */ - BPF_LDX_MEM(BPF_W, BPF_REG_7, BPF_REG_6, - offsetof(struct bpf_sock_addr, type)), - BPF_JMP_IMM(BPF_JNE, BPF_REG_7, SOCK_DGRAM, 4), - - /* user_ip4 = src4_rw_addr.sin_addr */ - BPF_MOV32_IMM(BPF_REG_7, src4_rw_addr.sin_addr.s_addr), - BPF_STX_MEM(BPF_W, BPF_REG_6, BPF_REG_7, - offsetof(struct bpf_sock_addr, user_ip4)), - - /* user_port = src4_rw_addr.sin_port */ - BPF_MOV32_IMM(BPF_REG_7, src4_rw_addr.sin_port), - BPF_STX_MEM(BPF_W, BPF_REG_6, BPF_REG_7, - offsetof(struct bpf_sock_addr, user_port)), - /* } */ - - /* return 1 */ - BPF_MOV64_IMM(BPF_REG_0, 1), - BPF_EXIT_INSN(), - }; - - return load_insns(test, insns, sizeof(insns) / sizeof(struct bpf_insn)); + return load_path(test, RECVMSG4_PROG_PATH); } static int sendmsg4_rw_c_prog_load(const struct sock_addr_test *test) @@ -890,37 +856,9 @@ static int sendmsg6_rw_asm_prog_load(const struct sock_addr_test *test) return sendmsg6_rw_dst_asm_prog_load(test, SERV6_REWRITE_IP); } -static int recvmsg6_rw_asm_prog_load(const struct sock_addr_test *test) +static int recvmsg6_rw_c_prog_load(const struct sock_addr_test *test) { - struct sockaddr_in6 src6_rw_addr; - - if (mk_sockaddr(AF_INET6, SERV6_IP, SERV6_PORT, - (struct sockaddr *)&src6_rw_addr, - sizeof(src6_rw_addr)) == -1) - return -1; - - struct bpf_insn insns[] = { - BPF_MOV64_REG(BPF_REG_6, BPF_REG_1), - - /* if (sk.family == AF_INET6) { */ - BPF_LDX_MEM(BPF_W, BPF_REG_7, BPF_REG_6, - offsetof(struct bpf_sock_addr, family)), - BPF_JMP_IMM(BPF_JNE, BPF_REG_7, AF_INET6, 10), - - STORE_IPV6(user_ip6, src6_rw_addr.sin6_addr.s6_addr32), - - /* user_port = dst6_rw_addr.sin6_port */ - BPF_MOV32_IMM(BPF_REG_7, src6_rw_addr.sin6_port), - BPF_STX_MEM(BPF_W, BPF_REG_6, BPF_REG_7, - offsetof(struct bpf_sock_addr, user_port)), - /* } */ - - /* return 1 */ - BPF_MOV64_IMM(BPF_REG_0, 1), - BPF_EXIT_INSN(), - }; - - return load_insns(test, insns, sizeof(insns) / sizeof(struct bpf_insn)); + return load_path(test, RECVMSG6_PROG_PATH); } static int sendmsg6_rw_v4mapped_prog_load(const struct sock_addr_test *test) diff --git a/tools/testing/selftests/bpf/test_socket_cookie.c b/tools/testing/selftests/bpf/test_socket_cookie.c deleted file mode 100644 index ca7ca87e91aa..000000000000 --- a/tools/testing/selftests/bpf/test_socket_cookie.c +++ /dev/null @@ -1,208 +0,0 @@ -// SPDX-License-Identifier: GPL-2.0 -// Copyright (c) 2018 Facebook - -#include <string.h> -#include <unistd.h> - -#include <arpa/inet.h> -#include <netinet/in.h> -#include <sys/types.h> -#include <sys/socket.h> - -#include <bpf/bpf.h> -#include <bpf/libbpf.h> - -#include "bpf_rlimit.h" -#include "cgroup_helpers.h" - -#define CG_PATH "/foo" -#define SOCKET_COOKIE_PROG "./socket_cookie_prog.o" - -struct socket_cookie { - __u64 cookie_key; - __u32 cookie_value; -}; - -static int start_server(void) -{ - struct sockaddr_in6 addr; - int fd; - - fd = socket(AF_INET6, SOCK_STREAM, 0); - if (fd == -1) { - log_err("Failed to create server socket"); - goto out; - } - - memset(&addr, 0, sizeof(addr)); - addr.sin6_family = AF_INET6; - addr.sin6_addr = in6addr_loopback; - addr.sin6_port = 0; - - if (bind(fd, (const struct sockaddr *)&addr, sizeof(addr)) == -1) { - log_err("Failed to bind server socket"); - goto close_out; - } - - if (listen(fd, 128) == -1) { - log_err("Failed to listen on server socket"); - goto close_out; - } - - goto out; - -close_out: - close(fd); - fd = -1; -out: - return fd; -} - -static int connect_to_server(int server_fd) -{ - struct sockaddr_storage addr; - socklen_t len = sizeof(addr); - int fd; - - fd = socket(AF_INET6, SOCK_STREAM, 0); - if (fd == -1) { - log_err("Failed to create client socket"); - goto out; - } - - if (getsockname(server_fd, (struct sockaddr *)&addr, &len)) { - log_err("Failed to get server addr"); - goto close_out; - } - - if (connect(fd, (const struct sockaddr *)&addr, len) == -1) { - log_err("Fail to connect to server"); - goto close_out; - } - - goto out; - -close_out: - close(fd); - fd = -1; -out: - return fd; -} - -static int validate_map(struct bpf_map *map, int client_fd) -{ - __u32 cookie_expected_value; - struct sockaddr_in6 addr; - socklen_t len = sizeof(addr); - struct socket_cookie val; - int err = 0; - int map_fd; - - if (!map) { - log_err("Map not found in BPF object"); - goto err; - } - - map_fd = bpf_map__fd(map); - - err = bpf_map_lookup_elem(map_fd, &client_fd, &val); - - err = getsockname(client_fd, (struct sockaddr *)&addr, &len); - if (err) { - log_err("Can't get client local addr"); - goto out; - } - - cookie_expected_value = (ntohs(addr.sin6_port) << 8) | 0xFF; - if (val.cookie_value != cookie_expected_value) { - log_err("Unexpected value in map: %x != %x", val.cookie_value, - cookie_expected_value); - goto err; - } - - goto out; -err: - err = -1; -out: - return err; -} - -static int run_test(int cgfd) -{ - enum bpf_attach_type attach_type; - struct bpf_prog_load_attr attr; - struct bpf_program *prog; - struct bpf_object *pobj; - const char *prog_name; - int server_fd = -1; - int client_fd = -1; - int prog_fd = -1; - int err = 0; - - memset(&attr, 0, sizeof(attr)); - attr.file = SOCKET_COOKIE_PROG; - attr.prog_type = BPF_PROG_TYPE_UNSPEC; - attr.prog_flags = BPF_F_TEST_RND_HI32; - - err = bpf_prog_load_xattr(&attr, &pobj, &prog_fd); - if (err) { - log_err("Failed to load %s", attr.file); - goto out; - } - - bpf_object__for_each_program(prog, pobj) { - prog_name = bpf_program__section_name(prog); - - if (libbpf_attach_type_by_name(prog_name, &attach_type)) - goto err; - - err = bpf_prog_attach(bpf_program__fd(prog), cgfd, attach_type, - BPF_F_ALLOW_OVERRIDE); - if (err) { - log_err("Failed to attach prog %s", prog_name); - goto out; - } - } - - server_fd = start_server(); - if (server_fd == -1) - goto err; - - client_fd = connect_to_server(server_fd); - if (client_fd == -1) - goto err; - - if (validate_map(bpf_map__next(NULL, pobj), client_fd)) - goto err; - - goto out; -err: - err = -1; -out: - close(client_fd); - close(server_fd); - bpf_object__close(pobj); - printf("%s\n", err ? "FAILED" : "PASSED"); - return err; -} - -int main(int argc, char **argv) -{ - int cgfd = -1; - int err = 0; - - cgfd = cgroup_setup_and_join(CG_PATH); - if (cgfd < 0) - goto err; - - if (run_test(cgfd)) - goto err; - - goto out; -err: - err = -1; -out: - close(cgfd); - cleanup_cgroup_environment(); - return err; -} diff --git a/tools/testing/selftests/bpf/test_verifier.c b/tools/testing/selftests/bpf/test_verifier.c index f8569f04064b..58b5a349d3ba 100644 --- a/tools/testing/selftests/bpf/test_verifier.c +++ b/tools/testing/selftests/bpf/test_verifier.c @@ -88,6 +88,10 @@ struct bpf_test { int fixup_map_event_output[MAX_FIXUPS]; int fixup_map_reuseport_array[MAX_FIXUPS]; int fixup_map_ringbuf[MAX_FIXUPS]; + /* Expected verifier log output for result REJECT or VERBOSE_ACCEPT. + * Can be a tab-separated sequence of expected strings. An empty string + * means no log verification. + */ const char *errstr; const char *errstr_unpriv; uint32_t insn_processed; @@ -297,6 +301,78 @@ static void bpf_fill_scale(struct bpf_test *self) } } +static int bpf_fill_torturous_jumps_insn_1(struct bpf_insn *insn) +{ + unsigned int len = 259, hlen = 128; + int i; + + insn[0] = BPF_EMIT_CALL(BPF_FUNC_get_prandom_u32); + for (i = 1; i <= hlen; i++) { + insn[i] = BPF_JMP_IMM(BPF_JEQ, BPF_REG_0, i, hlen); + insn[i + hlen] = BPF_JMP_A(hlen - i); + } + insn[len - 2] = BPF_MOV64_IMM(BPF_REG_0, 1); + insn[len - 1] = BPF_EXIT_INSN(); + + return len; +} + +static int bpf_fill_torturous_jumps_insn_2(struct bpf_insn *insn) +{ + unsigned int len = 4100, jmp_off = 2048; + int i, j; + + insn[0] = BPF_EMIT_CALL(BPF_FUNC_get_prandom_u32); + for (i = 1; i <= jmp_off; i++) { + insn[i] = BPF_JMP_IMM(BPF_JEQ, BPF_REG_0, i, jmp_off); + } + insn[i++] = BPF_JMP_A(jmp_off); + for (; i <= jmp_off * 2 + 1; i+=16) { + for (j = 0; j < 16; j++) { + insn[i + j] = BPF_JMP_A(16 - j - 1); + } + } + + insn[len - 2] = BPF_MOV64_IMM(BPF_REG_0, 2); + insn[len - 1] = BPF_EXIT_INSN(); + + return len; +} + +static void bpf_fill_torturous_jumps(struct bpf_test *self) +{ + struct bpf_insn *insn = self->fill_insns; + int i = 0; + + switch (self->retval) { + case 1: + self->prog_len = bpf_fill_torturous_jumps_insn_1(insn); + return; + case 2: + self->prog_len = bpf_fill_torturous_jumps_insn_2(insn); + return; + case 3: + /* main */ + insn[i++] = BPF_RAW_INSN(BPF_JMP|BPF_CALL, 0, 1, 0, 4); + insn[i++] = BPF_RAW_INSN(BPF_JMP|BPF_CALL, 0, 1, 0, 262); + insn[i++] = BPF_ST_MEM(BPF_B, BPF_REG_10, -32, 0); + insn[i++] = BPF_MOV64_IMM(BPF_REG_0, 3); + insn[i++] = BPF_EXIT_INSN(); + + /* subprog 1 */ + i += bpf_fill_torturous_jumps_insn_1(insn + i); + + /* subprog 2 */ + i += bpf_fill_torturous_jumps_insn_2(insn + i); + + self->prog_len = i; + return; + default: + self->prog_len = 0; + break; + } +} + /* BPF_SK_LOOKUP contains 13 instructions, if you need to fix up maps */ #define BPF_SK_LOOKUP(func) \ /* struct bpf_sock_tuple tuple = {} */ \ @@ -923,13 +999,19 @@ static int do_prog_test_run(int fd_prog, bool unpriv, uint32_t expected_val, return 0; } +/* Returns true if every part of exp (tab-separated) appears in log, in order. + * + * If exp is an empty string, returns true. + */ static bool cmp_str_seq(const char *log, const char *exp) { - char needle[80]; + char needle[200]; const char *p, *q; int len; do { + if (!strlen(exp)) + break; p = strchr(exp, '\t'); if (!p) p = exp + strlen(exp); @@ -943,7 +1025,7 @@ static bool cmp_str_seq(const char *log, const char *exp) needle[len] = 0; q = strstr(log, needle); if (!q) { - printf("FAIL\nUnexpected verifier log in successful load!\n" + printf("FAIL\nUnexpected verifier log!\n" "EXP: %s\nRES:\n", needle); return false; } @@ -1058,7 +1140,7 @@ static void do_test_single(struct bpf_test *test, bool unpriv, printf("FAIL\nUnexpected success to load!\n"); goto fail_log; } - if (!expected_err || !strstr(bpf_vlog, expected_err)) { + if (!expected_err || !cmp_str_seq(bpf_vlog, expected_err)) { printf("FAIL\nUnexpected error message!\n\tEXP: %s\n\tRES: %s\n", expected_err, bpf_vlog); goto fail_log; diff --git a/tools/testing/selftests/bpf/verifier/atomic_and.c b/tools/testing/selftests/bpf/verifier/atomic_and.c index 600bc5e0f143..1bdc8e6684f7 100644 --- a/tools/testing/selftests/bpf/verifier/atomic_and.c +++ b/tools/testing/selftests/bpf/verifier/atomic_and.c @@ -7,7 +7,7 @@ BPF_MOV64_IMM(BPF_REG_1, 0x011), BPF_ATOMIC_OP(BPF_DW, BPF_AND, BPF_REG_10, BPF_REG_1, -8), /* if (val != 0x010) exit(2); */ - BPF_LDX_MEM(BPF_W, BPF_REG_0, BPF_REG_10, -8), + BPF_LDX_MEM(BPF_DW, BPF_REG_0, BPF_REG_10, -8), BPF_JMP_IMM(BPF_JEQ, BPF_REG_0, 0x010, 2), BPF_MOV64_IMM(BPF_REG_0, 2), BPF_EXIT_INSN(), diff --git a/tools/testing/selftests/bpf/verifier/atomic_bounds.c b/tools/testing/selftests/bpf/verifier/atomic_bounds.c new file mode 100644 index 000000000000..e82183e4914f --- /dev/null +++ b/tools/testing/selftests/bpf/verifier/atomic_bounds.c @@ -0,0 +1,27 @@ +{ + "BPF_ATOMIC bounds propagation, mem->reg", + .insns = { + /* a = 0; */ + /* + * Note this is implemented with two separate instructions, + * where you might think one would suffice: + * + * BPF_ST_MEM(BPF_DW, BPF_REG_10, -8, 0), + * + * This is because BPF_ST_MEM doesn't seem to set the stack slot + * type to 0 when storing an immediate. + */ + BPF_MOV64_IMM(BPF_REG_0, 0), + BPF_STX_MEM(BPF_DW, BPF_REG_10, BPF_REG_0, -8), + /* b = atomic_fetch_add(&a, 1); */ + BPF_MOV64_IMM(BPF_REG_1, 1), + BPF_ATOMIC_OP(BPF_DW, BPF_ADD | BPF_FETCH, BPF_REG_10, BPF_REG_1, -8), + /* Verifier should be able to tell that this infinite loop isn't reachable. */ + /* if (b) while (true) continue; */ + BPF_JMP_IMM(BPF_JNE, BPF_REG_1, 0, -1), + BPF_EXIT_INSN(), + }, + .result = ACCEPT, + .result_unpriv = REJECT, + .errstr_unpriv = "back-edge", +}, diff --git a/tools/testing/selftests/bpf/verifier/atomic_or.c b/tools/testing/selftests/bpf/verifier/atomic_or.c index ebe6e51455ba..70f982e1f9f0 100644 --- a/tools/testing/selftests/bpf/verifier/atomic_or.c +++ b/tools/testing/selftests/bpf/verifier/atomic_or.c @@ -7,7 +7,7 @@ BPF_MOV64_IMM(BPF_REG_1, 0x011), BPF_ATOMIC_OP(BPF_DW, BPF_OR, BPF_REG_10, BPF_REG_1, -8), /* if (val != 0x111) exit(2); */ - BPF_LDX_MEM(BPF_W, BPF_REG_0, BPF_REG_10, -8), + BPF_LDX_MEM(BPF_DW, BPF_REG_0, BPF_REG_10, -8), BPF_JMP_IMM(BPF_JEQ, BPF_REG_0, 0x111, 2), BPF_MOV64_IMM(BPF_REG_0, 2), BPF_EXIT_INSN(), diff --git a/tools/testing/selftests/bpf/verifier/atomic_xor.c b/tools/testing/selftests/bpf/verifier/atomic_xor.c index eb791e547b47..74e8fb46694b 100644 --- a/tools/testing/selftests/bpf/verifier/atomic_xor.c +++ b/tools/testing/selftests/bpf/verifier/atomic_xor.c @@ -7,7 +7,7 @@ BPF_MOV64_IMM(BPF_REG_1, 0x011), BPF_ATOMIC_OP(BPF_DW, BPF_XOR, BPF_REG_10, BPF_REG_1, -8), /* if (val != 0x101) exit(2); */ - BPF_LDX_MEM(BPF_W, BPF_REG_0, BPF_REG_10, -8), + BPF_LDX_MEM(BPF_DW, BPF_REG_0, BPF_REG_10, -8), BPF_JMP_IMM(BPF_JEQ, BPF_REG_0, 0x101, 2), BPF_MOV64_IMM(BPF_REG_0, 2), BPF_EXIT_INSN(), diff --git a/tools/testing/selftests/bpf/verifier/basic_stack.c b/tools/testing/selftests/bpf/verifier/basic_stack.c index b56f8117c09d..f995777dddb3 100644 --- a/tools/testing/selftests/bpf/verifier/basic_stack.c +++ b/tools/testing/selftests/bpf/verifier/basic_stack.c @@ -4,7 +4,7 @@ BPF_ST_MEM(BPF_DW, BPF_REG_10, 8, 0), BPF_EXIT_INSN(), }, - .errstr = "invalid stack", + .errstr = "invalid write to stack", .result = REJECT, }, { diff --git a/tools/testing/selftests/bpf/verifier/calls.c b/tools/testing/selftests/bpf/verifier/calls.c index c4f5d909e58a..eb888c8479c3 100644 --- a/tools/testing/selftests/bpf/verifier/calls.c +++ b/tools/testing/selftests/bpf/verifier/calls.c @@ -1228,7 +1228,7 @@ .prog_type = BPF_PROG_TYPE_XDP, .fixup_map_hash_8b = { 23 }, .result = REJECT, - .errstr = "invalid read from stack off -16+0 size 8", + .errstr = "invalid read from stack R7 off=-16 size=8", }, { "calls: two calls that receive map_value via arg=ptr_stack_of_caller. test1", @@ -1958,7 +1958,7 @@ BPF_EXIT_INSN(), }, .fixup_map_hash_48b = { 6 }, - .errstr = "invalid indirect read from stack off -8+0 size 8", + .errstr = "invalid indirect read from stack R2 off -8+0 size 8", .result = REJECT, .prog_type = BPF_PROG_TYPE_XDP, }, diff --git a/tools/testing/selftests/bpf/verifier/const_or.c b/tools/testing/selftests/bpf/verifier/const_or.c index 6c214c58e8d4..0719b0ddec04 100644 --- a/tools/testing/selftests/bpf/verifier/const_or.c +++ b/tools/testing/selftests/bpf/verifier/const_or.c @@ -23,7 +23,7 @@ BPF_EMIT_CALL(BPF_FUNC_probe_read_kernel), BPF_EXIT_INSN(), }, - .errstr = "invalid stack type R1 off=-48 access_size=58", + .errstr = "invalid indirect access to stack R1 off=-48 size=58", .result = REJECT, .prog_type = BPF_PROG_TYPE_TRACEPOINT, }, @@ -54,7 +54,7 @@ BPF_EMIT_CALL(BPF_FUNC_probe_read_kernel), BPF_EXIT_INSN(), }, - .errstr = "invalid stack type R1 off=-48 access_size=58", + .errstr = "invalid indirect access to stack R1 off=-48 size=58", .result = REJECT, .prog_type = BPF_PROG_TYPE_TRACEPOINT, }, diff --git a/tools/testing/selftests/bpf/verifier/helper_access_var_len.c b/tools/testing/selftests/bpf/verifier/helper_access_var_len.c index 87c4e7900083..0ab7f1dfc97a 100644 --- a/tools/testing/selftests/bpf/verifier/helper_access_var_len.c +++ b/tools/testing/selftests/bpf/verifier/helper_access_var_len.c @@ -39,7 +39,7 @@ BPF_EMIT_CALL(BPF_FUNC_probe_read_kernel), BPF_EXIT_INSN(), }, - .errstr = "invalid indirect read from stack off -64+0 size 64", + .errstr = "invalid indirect read from stack R1 off -64+0 size 64", .result = REJECT, .prog_type = BPF_PROG_TYPE_TRACEPOINT, }, @@ -59,7 +59,7 @@ BPF_MOV64_IMM(BPF_REG_0, 0), BPF_EXIT_INSN(), }, - .errstr = "invalid stack type R1 off=-64 access_size=65", + .errstr = "invalid indirect access to stack R1 off=-64 size=65", .result = REJECT, .prog_type = BPF_PROG_TYPE_TRACEPOINT, }, @@ -136,7 +136,7 @@ BPF_MOV64_IMM(BPF_REG_0, 0), BPF_EXIT_INSN(), }, - .errstr = "invalid stack type R1 off=-64 access_size=65", + .errstr = "invalid indirect access to stack R1 off=-64 size=65", .result = REJECT, .prog_type = BPF_PROG_TYPE_TRACEPOINT, }, @@ -156,7 +156,7 @@ BPF_MOV64_IMM(BPF_REG_0, 0), BPF_EXIT_INSN(), }, - .errstr = "invalid stack type R1 off=-64 access_size=65", + .errstr = "invalid indirect access to stack R1 off=-64 size=65", .result = REJECT, .prog_type = BPF_PROG_TYPE_TRACEPOINT, }, @@ -194,7 +194,7 @@ BPF_MOV64_IMM(BPF_REG_0, 0), BPF_EXIT_INSN(), }, - .errstr = "invalid indirect read from stack off -64+0 size 64", + .errstr = "invalid indirect read from stack R1 off -64+0 size 64", .result = REJECT, .prog_type = BPF_PROG_TYPE_TRACEPOINT, }, @@ -584,7 +584,7 @@ BPF_LDX_MEM(BPF_DW, BPF_REG_1, BPF_REG_10, -16), BPF_EXIT_INSN(), }, - .errstr = "invalid indirect read from stack off -64+32 size 64", + .errstr = "invalid indirect read from stack R1 off -64+32 size 64", .result = REJECT, .prog_type = BPF_PROG_TYPE_TRACEPOINT, }, diff --git a/tools/testing/selftests/bpf/verifier/int_ptr.c b/tools/testing/selftests/bpf/verifier/int_ptr.c index ca3b4729df66..070893fb2900 100644 --- a/tools/testing/selftests/bpf/verifier/int_ptr.c +++ b/tools/testing/selftests/bpf/verifier/int_ptr.c @@ -27,7 +27,7 @@ }, .result = REJECT, .prog_type = BPF_PROG_TYPE_CGROUP_SYSCTL, - .errstr = "invalid indirect read from stack off -16+0 size 8", + .errstr = "invalid indirect read from stack R4 off -16+0 size 8", }, { "ARG_PTR_TO_LONG half-uninitialized", @@ -59,7 +59,7 @@ }, .result = REJECT, .prog_type = BPF_PROG_TYPE_CGROUP_SYSCTL, - .errstr = "invalid indirect read from stack off -16+4 size 8", + .errstr = "invalid indirect read from stack R4 off -16+4 size 8", }, { "ARG_PTR_TO_LONG misaligned", @@ -125,7 +125,7 @@ }, .result = REJECT, .prog_type = BPF_PROG_TYPE_CGROUP_SYSCTL, - .errstr = "invalid stack type R4 off=-4 access_size=8", + .errstr = "invalid indirect access to stack R4 off=-4 size=8", }, { "ARG_PTR_TO_LONG initialized", diff --git a/tools/testing/selftests/bpf/verifier/jit.c b/tools/testing/selftests/bpf/verifier/jit.c index c33adf344fae..df215e004566 100644 --- a/tools/testing/selftests/bpf/verifier/jit.c +++ b/tools/testing/selftests/bpf/verifier/jit.c @@ -105,3 +105,27 @@ .result = ACCEPT, .retval = 2, }, +{ + "jit: torturous jumps, imm8 nop jmp and pure jump padding", + .insns = { }, + .fill_helper = bpf_fill_torturous_jumps, + .prog_type = BPF_PROG_TYPE_SCHED_CLS, + .result = ACCEPT, + .retval = 1, +}, +{ + "jit: torturous jumps, imm32 nop jmp and jmp_cond padding", + .insns = { }, + .fill_helper = bpf_fill_torturous_jumps, + .prog_type = BPF_PROG_TYPE_SCHED_CLS, + .result = ACCEPT, + .retval = 2, +}, +{ + "jit: torturous jumps in subprog", + .insns = { }, + .fill_helper = bpf_fill_torturous_jumps, + .prog_type = BPF_PROG_TYPE_SCHED_CLS, + .result = ACCEPT, + .retval = 3, +}, diff --git a/tools/testing/selftests/bpf/verifier/raw_stack.c b/tools/testing/selftests/bpf/verifier/raw_stack.c index 193d9e87d5a9..cc8e8c3cdc03 100644 --- a/tools/testing/selftests/bpf/verifier/raw_stack.c +++ b/tools/testing/selftests/bpf/verifier/raw_stack.c @@ -11,7 +11,7 @@ BPF_EXIT_INSN(), }, .result = REJECT, - .errstr = "invalid read from stack off -8+0 size 8", + .errstr = "invalid read from stack R6 off=-8 size=8", .prog_type = BPF_PROG_TYPE_SCHED_CLS, }, { @@ -59,7 +59,7 @@ BPF_EXIT_INSN(), }, .result = REJECT, - .errstr = "invalid stack type R3", + .errstr = "invalid zero-sized read", .prog_type = BPF_PROG_TYPE_SCHED_CLS, }, { @@ -205,7 +205,7 @@ BPF_EXIT_INSN(), }, .result = REJECT, - .errstr = "invalid stack type R3 off=-513 access_size=8", + .errstr = "invalid indirect access to stack R3 off=-513 size=8", .prog_type = BPF_PROG_TYPE_SCHED_CLS, }, { @@ -221,7 +221,7 @@ BPF_EXIT_INSN(), }, .result = REJECT, - .errstr = "invalid stack type R3 off=-1 access_size=8", + .errstr = "invalid indirect access to stack R3 off=-1 size=8", .prog_type = BPF_PROG_TYPE_SCHED_CLS, }, { @@ -285,7 +285,7 @@ BPF_EXIT_INSN(), }, .result = REJECT, - .errstr = "invalid stack type R3 off=-512 access_size=0", + .errstr = "invalid zero-sized read", .prog_type = BPF_PROG_TYPE_SCHED_CLS, }, { diff --git a/tools/testing/selftests/bpf/verifier/stack_ptr.c b/tools/testing/selftests/bpf/verifier/stack_ptr.c index 8bfeb77c60bd..07eaa04412ae 100644 --- a/tools/testing/selftests/bpf/verifier/stack_ptr.c +++ b/tools/testing/selftests/bpf/verifier/stack_ptr.c @@ -44,7 +44,7 @@ BPF_EXIT_INSN(), }, .result = REJECT, - .errstr = "invalid stack off=-79992 size=8", + .errstr = "invalid write to stack R1 off=-79992 size=8", .errstr_unpriv = "R1 stack pointer arithmetic goes out of range", }, { @@ -57,7 +57,7 @@ BPF_EXIT_INSN(), }, .result = REJECT, - .errstr = "invalid stack off=0 size=8", + .errstr = "invalid write to stack R1 off=0 size=8", }, { "PTR_TO_STACK check high 1", @@ -106,7 +106,7 @@ BPF_EXIT_INSN(), }, .errstr_unpriv = "R1 stack pointer arithmetic goes out of range", - .errstr = "invalid stack off=0 size=1", + .errstr = "invalid write to stack R1 off=0 size=1", .result = REJECT, }, { @@ -119,7 +119,8 @@ BPF_EXIT_INSN(), }, .result = REJECT, - .errstr = "invalid stack off", + .errstr_unpriv = "R1 stack pointer arithmetic goes out of range", + .errstr = "invalid write to stack R1", }, { "PTR_TO_STACK check high 6", @@ -131,7 +132,8 @@ BPF_EXIT_INSN(), }, .result = REJECT, - .errstr = "invalid stack off", + .errstr_unpriv = "R1 stack pointer arithmetic goes out of range", + .errstr = "invalid write to stack", }, { "PTR_TO_STACK check high 7", @@ -183,7 +185,7 @@ BPF_EXIT_INSN(), }, .errstr_unpriv = "R1 stack pointer arithmetic goes out of range", - .errstr = "invalid stack off=-513 size=1", + .errstr = "invalid write to stack R1 off=-513 size=1", .result = REJECT, }, { @@ -208,7 +210,8 @@ BPF_EXIT_INSN(), }, .result = REJECT, - .errstr = "invalid stack off", + .errstr_unpriv = "R1 stack pointer arithmetic goes out of range", + .errstr = "invalid write to stack", }, { "PTR_TO_STACK check low 6", @@ -220,7 +223,8 @@ BPF_EXIT_INSN(), }, .result = REJECT, - .errstr = "invalid stack off", + .errstr = "invalid write to stack", + .errstr_unpriv = "R1 stack pointer arithmetic goes out of range", }, { "PTR_TO_STACK check low 7", @@ -292,7 +296,7 @@ BPF_EXIT_INSN(), }, .result_unpriv = REJECT, - .errstr_unpriv = "invalid stack off=0 size=1", + .errstr_unpriv = "invalid write to stack R1 off=0 size=1", .result = ACCEPT, .retval = 42, }, diff --git a/tools/testing/selftests/bpf/verifier/unpriv.c b/tools/testing/selftests/bpf/verifier/unpriv.c index ee298627abae..b018ad71e0a8 100644 --- a/tools/testing/selftests/bpf/verifier/unpriv.c +++ b/tools/testing/selftests/bpf/verifier/unpriv.c @@ -108,7 +108,7 @@ BPF_EXIT_INSN(), }, .fixup_map_hash_8b = { 3 }, - .errstr_unpriv = "invalid indirect read from stack off -8+0 size 8", + .errstr_unpriv = "invalid indirect read from stack R2 off -8+0 size 8", .result_unpriv = REJECT, .result = ACCEPT, }, diff --git a/tools/testing/selftests/bpf/verifier/var_off.c b/tools/testing/selftests/bpf/verifier/var_off.c index 8504ac937809..eab1f7f56e2f 100644 --- a/tools/testing/selftests/bpf/verifier/var_off.c +++ b/tools/testing/selftests/bpf/verifier/var_off.c @@ -18,7 +18,7 @@ .prog_type = BPF_PROG_TYPE_LWT_IN, }, { - "variable-offset stack access", + "variable-offset stack read, priv vs unpriv", .insns = { /* Fill the top 8 bytes of the stack */ BPF_ST_MEM(BPF_DW, BPF_REG_10, -8, 0), @@ -31,15 +31,110 @@ * we don't know which */ BPF_ALU64_REG(BPF_ADD, BPF_REG_2, BPF_REG_10), - /* dereference it */ + /* dereference it for a stack read */ + BPF_LDX_MEM(BPF_W, BPF_REG_0, BPF_REG_2, 0), + BPF_MOV64_IMM(BPF_REG_0, 0), + BPF_EXIT_INSN(), + }, + .result = ACCEPT, + .result_unpriv = REJECT, + .errstr_unpriv = "R2 variable stack access prohibited for !root", + .prog_type = BPF_PROG_TYPE_CGROUP_SKB, +}, +{ + "variable-offset stack read, uninitialized", + .insns = { + /* Get an unknown value */ + BPF_LDX_MEM(BPF_W, BPF_REG_2, BPF_REG_1, 0), + /* Make it small and 4-byte aligned */ + BPF_ALU64_IMM(BPF_AND, BPF_REG_2, 4), + BPF_ALU64_IMM(BPF_SUB, BPF_REG_2, 8), + /* add it to fp. We now have either fp-4 or fp-8, but + * we don't know which + */ + BPF_ALU64_REG(BPF_ADD, BPF_REG_2, BPF_REG_10), + /* dereference it for a stack read */ BPF_LDX_MEM(BPF_W, BPF_REG_0, BPF_REG_2, 0), + BPF_MOV64_IMM(BPF_REG_0, 0), BPF_EXIT_INSN(), }, - .errstr = "variable stack access var_off=(0xfffffffffffffff8; 0x4)", .result = REJECT, + .errstr = "invalid variable-offset read from stack R2", .prog_type = BPF_PROG_TYPE_LWT_IN, }, { + "variable-offset stack write, priv vs unpriv", + .insns = { + /* Get an unknown value */ + BPF_LDX_MEM(BPF_W, BPF_REG_2, BPF_REG_1, 0), + /* Make it small and 8-byte aligned */ + BPF_ALU64_IMM(BPF_AND, BPF_REG_2, 8), + BPF_ALU64_IMM(BPF_SUB, BPF_REG_2, 16), + /* Add it to fp. We now have either fp-8 or fp-16, but + * we don't know which + */ + BPF_ALU64_REG(BPF_ADD, BPF_REG_2, BPF_REG_10), + /* Dereference it for a stack write */ + BPF_ST_MEM(BPF_DW, BPF_REG_2, 0, 0), + /* Now read from the address we just wrote. This shows + * that, after a variable-offset write, a priviledged + * program can read the slots that were in the range of + * that write (even if the verifier doesn't actually know + * if the slot being read was really written to or not. + */ + BPF_LDX_MEM(BPF_DW, BPF_REG_3, BPF_REG_2, 0), + BPF_MOV64_IMM(BPF_REG_0, 0), + BPF_EXIT_INSN(), + }, + /* Variable stack access is rejected for unprivileged. + */ + .errstr_unpriv = "R2 variable stack access prohibited for !root", + .result_unpriv = REJECT, + .result = ACCEPT, +}, +{ + "variable-offset stack write clobbers spilled regs", + .insns = { + /* Dummy instruction; needed because we need to patch the next one + * and we can't patch the first instruction. + */ + BPF_MOV64_IMM(BPF_REG_6, 0), + /* Make R0 a map ptr */ + BPF_LD_MAP_FD(BPF_REG_0, 0), + /* Get an unknown value */ + BPF_LDX_MEM(BPF_W, BPF_REG_2, BPF_REG_1, 0), + /* Make it small and 8-byte aligned */ + BPF_ALU64_IMM(BPF_AND, BPF_REG_2, 8), + BPF_ALU64_IMM(BPF_SUB, BPF_REG_2, 16), + /* Add it to fp. We now have either fp-8 or fp-16, but + * we don't know which. + */ + BPF_ALU64_REG(BPF_ADD, BPF_REG_2, BPF_REG_10), + /* Spill R0(map ptr) into stack */ + BPF_STX_MEM(BPF_DW, BPF_REG_10, BPF_REG_0, -8), + /* Dereference the unknown value for a stack write */ + BPF_ST_MEM(BPF_DW, BPF_REG_2, 0, 0), + /* Fill the register back into R2 */ + BPF_LDX_MEM(BPF_DW, BPF_REG_2, BPF_REG_10, -8), + /* Try to dereference R2 for a memory load */ + BPF_LDX_MEM(BPF_DW, BPF_REG_0, BPF_REG_2, 8), + BPF_EXIT_INSN(), + }, + .fixup_map_hash_8b = { 1 }, + /* The unpriviledged case is not too interesting; variable + * stack access is rejected. + */ + .errstr_unpriv = "R2 variable stack access prohibited for !root", + .result_unpriv = REJECT, + /* In the priviledged case, dereferencing a spilled-and-then-filled + * register is rejected because the previous variable offset stack + * write might have overwritten the spilled pointer (i.e. we lose track + * of the spilled register when we analyze the write). + */ + .errstr = "R2 invalid mem access 'inv'", + .result = REJECT, +}, +{ "indirect variable-offset stack access, unbounded", .insns = { BPF_MOV64_IMM(BPF_REG_2, 6), @@ -63,7 +158,7 @@ BPF_MOV64_IMM(BPF_REG_0, 0), BPF_EXIT_INSN(), }, - .errstr = "R4 unbounded indirect variable offset stack access", + .errstr = "invalid unbounded variable-offset indirect access to stack R4", .result = REJECT, .prog_type = BPF_PROG_TYPE_SOCK_OPS, }, @@ -88,7 +183,7 @@ BPF_EXIT_INSN(), }, .fixup_map_hash_8b = { 5 }, - .errstr = "R2 max value is outside of stack bound", + .errstr = "invalid variable-offset indirect access to stack R2", .result = REJECT, .prog_type = BPF_PROG_TYPE_LWT_IN, }, @@ -113,7 +208,7 @@ BPF_EXIT_INSN(), }, .fixup_map_hash_8b = { 5 }, - .errstr = "R2 min value is outside of stack bound", + .errstr = "invalid variable-offset indirect access to stack R2", .result = REJECT, .prog_type = BPF_PROG_TYPE_LWT_IN, }, @@ -138,7 +233,7 @@ BPF_EXIT_INSN(), }, .fixup_map_hash_8b = { 5 }, - .errstr = "invalid indirect read from stack var_off", + .errstr = "invalid indirect read from stack R2 var_off", .result = REJECT, .prog_type = BPF_PROG_TYPE_LWT_IN, }, @@ -163,7 +258,7 @@ BPF_EXIT_INSN(), }, .fixup_map_hash_8b = { 5 }, - .errstr = "invalid indirect read from stack var_off", + .errstr = "invalid indirect read from stack R2 var_off", .result = REJECT, .prog_type = BPF_PROG_TYPE_LWT_IN, }, @@ -189,7 +284,7 @@ BPF_EXIT_INSN(), }, .fixup_map_hash_8b = { 6 }, - .errstr_unpriv = "R2 stack pointer arithmetic goes out of range, prohibited for !root", + .errstr_unpriv = "R2 variable stack access prohibited for !root", .result_unpriv = REJECT, .result = ACCEPT, .prog_type = BPF_PROG_TYPE_CGROUP_SKB, @@ -217,7 +312,7 @@ BPF_MOV64_IMM(BPF_REG_0, 0), BPF_EXIT_INSN(), }, - .errstr = "invalid indirect read from stack var_off", + .errstr = "invalid indirect read from stack R4 var_off", .result = REJECT, .prog_type = BPF_PROG_TYPE_SOCK_OPS, }, diff --git a/tools/testing/selftests/bpf/vmtest.sh b/tools/testing/selftests/bpf/vmtest.sh new file mode 100755 index 000000000000..26ae8d0b6ce3 --- /dev/null +++ b/tools/testing/selftests/bpf/vmtest.sh @@ -0,0 +1,368 @@ +#!/bin/bash +# SPDX-License-Identifier: GPL-2.0 + +set -u +set -e + +# This script currently only works for x86_64, as +# it is based on the VM image used by the BPF CI which is +# x86_64. +QEMU_BINARY="${QEMU_BINARY:="qemu-system-x86_64"}" +X86_BZIMAGE="arch/x86/boot/bzImage" +DEFAULT_COMMAND="./test_progs" +MOUNT_DIR="mnt" +ROOTFS_IMAGE="root.img" +OUTPUT_DIR="$HOME/.bpf_selftests" +KCONFIG_URL="https://raw.githubusercontent.com/libbpf/libbpf/master/travis-ci/vmtest/configs/latest.config" +KCONFIG_API_URL="https://api.github.com/repos/libbpf/libbpf/contents/travis-ci/vmtest/configs/latest.config" +INDEX_URL="https://raw.githubusercontent.com/libbpf/libbpf/master/travis-ci/vmtest/configs/INDEX" +NUM_COMPILE_JOBS="$(nproc)" + +usage() +{ + cat <<EOF +Usage: $0 [-i] [-d <output_dir>] -- [<command>] + +<command> is the command you would normally run when you are in +tools/testing/selftests/bpf. e.g: + + $0 -- ./test_progs -t test_lsm + +If no command is specified, "${DEFAULT_COMMAND}" will be run by +default. + +If you build your kernel using KBUILD_OUTPUT= or O= options, these +can be passed as environment variables to the script: + + O=<kernel_build_path> $0 -- ./test_progs -t test_lsm + +or + + KBUILD_OUTPUT=<kernel_build_path> $0 -- ./test_progs -t test_lsm + +Options: + + -i) Update the rootfs image with a newer version. + -d) Update the output directory (default: ${OUTPUT_DIR}) + -j) Number of jobs for compilation, similar to -j in make + (default: ${NUM_COMPILE_JOBS}) +EOF +} + +unset URLS +populate_url_map() +{ + if ! declare -p URLS &> /dev/null; then + # URLS contain the mapping from file names to URLs where + # those files can be downloaded from. + declare -gA URLS + while IFS=$'\t' read -r name url; do + URLS["$name"]="$url" + done < <(curl -Lsf ${INDEX_URL}) + fi +} + +download() +{ + local file="$1" + + if [[ ! -v URLS[$file] ]]; then + echo "$file not found" >&2 + return 1 + fi + + echo "Downloading $file..." >&2 + curl -Lsf "${URLS[$file]}" "${@:2}" +} + +newest_rootfs_version() +{ + { + for file in "${!URLS[@]}"; do + if [[ $file =~ ^libbpf-vmtest-rootfs-(.*)\.tar\.zst$ ]]; then + echo "${BASH_REMATCH[1]}" + fi + done + } | sort -rV | head -1 +} + +download_rootfs() +{ + local rootfsversion="$1" + local dir="$2" + + if ! which zstd &> /dev/null; then + echo 'Could not find "zstd" on the system, please install zstd' + exit 1 + fi + + download "libbpf-vmtest-rootfs-$rootfsversion.tar.zst" | + zstd -d | sudo tar -C "$dir" -x +} + +recompile_kernel() +{ + local kernel_checkout="$1" + local make_command="$2" + + cd "${kernel_checkout}" + + ${make_command} olddefconfig + ${make_command} +} + +mount_image() +{ + local rootfs_img="${OUTPUT_DIR}/${ROOTFS_IMAGE}" + local mount_dir="${OUTPUT_DIR}/${MOUNT_DIR}" + + sudo mount -o loop "${rootfs_img}" "${mount_dir}" +} + +unmount_image() +{ + local mount_dir="${OUTPUT_DIR}/${MOUNT_DIR}" + + sudo umount "${mount_dir}" &> /dev/null +} + +update_selftests() +{ + local kernel_checkout="$1" + local selftests_dir="${kernel_checkout}/tools/testing/selftests/bpf" + + cd "${selftests_dir}" + ${make_command} + + # Mount the image and copy the selftests to the image. + mount_image + sudo rm -rf "${mount_dir}/root/bpf" + sudo cp -r "${selftests_dir}" "${mount_dir}/root" + unmount_image +} + +update_init_script() +{ + local init_script_dir="${OUTPUT_DIR}/${MOUNT_DIR}/etc/rcS.d" + local init_script="${init_script_dir}/S50-startup" + local command="$1" + local log_file="$2" + + mount_image + + if [[ ! -d "${init_script_dir}" ]]; then + cat <<EOF +Could not find ${init_script_dir} in the mounted image. +This likely indicates a bad rootfs image, Please download +a new image by passing "-i" to the script +EOF + exit 1 + + fi + + sudo bash -c "cat >${init_script}" <<EOF +#!/bin/bash + +{ + cd /root/bpf + echo ${command} + stdbuf -oL -eL ${command} +} 2>&1 | tee /root/${log_file} +poweroff -f +EOF + + sudo chmod a+x "${init_script}" + unmount_image +} + +create_vm_image() +{ + local rootfs_img="${OUTPUT_DIR}/${ROOTFS_IMAGE}" + local mount_dir="${OUTPUT_DIR}/${MOUNT_DIR}" + + rm -rf "${rootfs_img}" + touch "${rootfs_img}" + chattr +C "${rootfs_img}" >/dev/null 2>&1 || true + + truncate -s 2G "${rootfs_img}" + mkfs.ext4 -q "${rootfs_img}" + + mount_image + download_rootfs "$(newest_rootfs_version)" "${mount_dir}" + unmount_image +} + +run_vm() +{ + local kernel_bzimage="$1" + local rootfs_img="${OUTPUT_DIR}/${ROOTFS_IMAGE}" + + if ! which "${QEMU_BINARY}" &> /dev/null; then + cat <<EOF +Could not find ${QEMU_BINARY} +Please install qemu or set the QEMU_BINARY environment variable. +EOF + exit 1 + fi + + ${QEMU_BINARY} \ + -nodefaults \ + -display none \ + -serial mon:stdio \ + -cpu kvm64 \ + -enable-kvm \ + -smp 4 \ + -m 2G \ + -drive file="${rootfs_img}",format=raw,index=1,media=disk,if=virtio,cache=none \ + -kernel "${kernel_bzimage}" \ + -append "root=/dev/vda rw console=ttyS0,115200" +} + +copy_logs() +{ + local mount_dir="${OUTPUT_DIR}/${MOUNT_DIR}" + local log_file="${mount_dir}/root/$1" + + mount_image + sudo cp ${log_file} "${OUTPUT_DIR}" + sudo rm -f ${log_file} + unmount_image +} + +is_rel_path() +{ + local path="$1" + + [[ ${path:0:1} != "/" ]] +} + +update_kconfig() +{ + local kconfig_file="$1" + local update_command="curl -sLf ${KCONFIG_URL} -o ${kconfig_file}" + # Github does not return the "last-modified" header when retrieving the + # raw contents of the file. Use the API call to get the last-modified + # time of the kernel config and only update the config if it has been + # updated after the previously cached config was created. This avoids + # unnecessarily compiling the kernel and selftests. + if [[ -f "${kconfig_file}" ]]; then + local last_modified_date="$(curl -sL -D - "${KCONFIG_API_URL}" -o /dev/null | \ + grep "last-modified" | awk -F ': ' '{print $2}')" + local remote_modified_timestamp="$(date -d "${last_modified_date}" +"%s")" + local local_creation_timestamp="$(stat -c %Y "${kconfig_file}")" + + if [[ "${remote_modified_timestamp}" -gt "${local_creation_timestamp}" ]]; then + ${update_command} + fi + else + ${update_command} + fi +} + +main() +{ + local script_dir="$(cd -P -- "$(dirname -- "${BASH_SOURCE[0]}")" && pwd -P)" + local kernel_checkout=$(realpath "${script_dir}"/../../../../) + local log_file="$(date +"bpf_selftests.%Y-%m-%d_%H-%M-%S.log")" + # By default the script searches for the kernel in the checkout directory but + # it also obeys environment variables O= and KBUILD_OUTPUT= + local kernel_bzimage="${kernel_checkout}/${X86_BZIMAGE}" + local command="${DEFAULT_COMMAND}" + local update_image="no" + + while getopts 'hkid:j:' opt; do + case ${opt} in + i) + update_image="yes" + ;; + d) + OUTPUT_DIR="$OPTARG" + ;; + j) + NUM_COMPILE_JOBS="$OPTARG" + ;; + h) + usage + exit 0 + ;; + \? ) + echo "Invalid Option: -$OPTARG" + usage + exit 1 + ;; + : ) + echo "Invalid Option: -$OPTARG requires an argument" + usage + exit 1 + ;; + esac + done + shift $((OPTIND -1)) + + if [[ $# -eq 0 ]]; then + echo "No command specified, will run ${DEFAULT_COMMAND} in the vm" + else + command="$@" + fi + + local kconfig_file="${OUTPUT_DIR}/latest.config" + local make_command="make -j ${NUM_COMPILE_JOBS} KCONFIG_CONFIG=${kconfig_file}" + + # Figure out where the kernel is being built. + # O takes precedence over KBUILD_OUTPUT. + if [[ "${O:=""}" != "" ]]; then + if is_rel_path "${O}"; then + O="$(realpath "${PWD}/${O}")" + fi + kernel_bzimage="${O}/${X86_BZIMAGE}" + make_command="${make_command} O=${O}" + elif [[ "${KBUILD_OUTPUT:=""}" != "" ]]; then + if is_rel_path "${KBUILD_OUTPUT}"; then + KBUILD_OUTPUT="$(realpath "${PWD}/${KBUILD_OUTPUT}")" + fi + kernel_bzimage="${KBUILD_OUTPUT}/${X86_BZIMAGE}" + make_command="${make_command} KBUILD_OUTPUT=${KBUILD_OUTPUT}" + fi + + populate_url_map + + local rootfs_img="${OUTPUT_DIR}/${ROOTFS_IMAGE}" + local mount_dir="${OUTPUT_DIR}/${MOUNT_DIR}" + + echo "Output directory: ${OUTPUT_DIR}" + + mkdir -p "${OUTPUT_DIR}" + mkdir -p "${mount_dir}" + update_kconfig "${kconfig_file}" + + recompile_kernel "${kernel_checkout}" "${make_command}" + + if [[ "${update_image}" == "no" && ! -f "${rootfs_img}" ]]; then + echo "rootfs image not found in ${rootfs_img}" + update_image="yes" + fi + + if [[ "${update_image}" == "yes" ]]; then + create_vm_image + fi + + update_selftests "${kernel_checkout}" "${make_command}" + update_init_script "${command}" "${log_file}" + run_vm "${kernel_bzimage}" + copy_logs "${log_file}" + echo "Logs saved in ${OUTPUT_DIR}/${log_file}" +} + +catch() +{ + local exit_code=$1 + # This is just a cleanup and the directory may + # have already been unmounted. So, don't let this + # clobber the error code we intend to return. + unmount_image || true + exit ${exit_code} +} + +trap 'catch "$?"' EXIT + +main "$@" diff --git a/tools/testing/selftests/bpf/xdpxceiver.c b/tools/testing/selftests/bpf/xdpxceiver.c index 1e722ee76b1f..f4a96d5ff524 100644 --- a/tools/testing/selftests/bpf/xdpxceiver.c +++ b/tools/testing/selftests/bpf/xdpxceiver.c @@ -224,14 +224,14 @@ static inline u16 udp_csum(u32 saddr, u32 daddr, u32 len, u8 proto, u16 *udp_pkt return csum_tcpudp_magic(saddr, daddr, len, proto, csum); } -static void gen_eth_hdr(void *data, struct ethhdr *eth_hdr) +static void gen_eth_hdr(struct ifobject *ifobject, struct ethhdr *eth_hdr) { - memcpy(eth_hdr->h_dest, ((struct ifobject *)data)->dst_mac, ETH_ALEN); - memcpy(eth_hdr->h_source, ((struct ifobject *)data)->src_mac, ETH_ALEN); + memcpy(eth_hdr->h_dest, ifobject->dst_mac, ETH_ALEN); + memcpy(eth_hdr->h_source, ifobject->src_mac, ETH_ALEN); eth_hdr->h_proto = htons(ETH_P_IP); } -static void gen_ip_hdr(void *data, struct iphdr *ip_hdr) +static void gen_ip_hdr(struct ifobject *ifobject, struct iphdr *ip_hdr) { ip_hdr->version = IP_PKT_VER; ip_hdr->ihl = 0x5; @@ -241,18 +241,18 @@ static void gen_ip_hdr(void *data, struct iphdr *ip_hdr) ip_hdr->frag_off = 0; ip_hdr->ttl = IPDEFTTL; ip_hdr->protocol = IPPROTO_UDP; - ip_hdr->saddr = ((struct ifobject *)data)->src_ip; - ip_hdr->daddr = ((struct ifobject *)data)->dst_ip; + ip_hdr->saddr = ifobject->src_ip; + ip_hdr->daddr = ifobject->dst_ip; ip_hdr->check = 0; } -static void gen_udp_hdr(void *data, void *arg, struct udphdr *udp_hdr) +static void gen_udp_hdr(struct generic_data *data, struct ifobject *ifobject, + struct udphdr *udp_hdr) { - udp_hdr->source = htons(((struct ifobject *)arg)->src_port); - udp_hdr->dest = htons(((struct ifobject *)arg)->dst_port); + udp_hdr->source = htons(ifobject->src_port); + udp_hdr->dest = htons(ifobject->dst_port); udp_hdr->len = htons(UDP_PKT_SIZE); - memset32_htonl(pkt_data + PKT_HDR_SIZE, - htonl(((struct generic_data *)data)->seqnum), UDP_PKT_DATA_SIZE); + memset32_htonl(pkt_data + PKT_HDR_SIZE, htonl(data->seqnum), UDP_PKT_DATA_SIZE); } static void gen_udp_csum(struct udphdr *udp_hdr, struct iphdr *ip_hdr) @@ -382,21 +382,19 @@ static bool switch_namespace(int idx) static void *nsswitchthread(void *args) { - if (switch_namespace(((struct targs *)args)->idx)) { - ifdict[((struct targs *)args)->idx]->ifindex = - if_nametoindex(ifdict[((struct targs *)args)->idx]->ifname); - if (!ifdict[((struct targs *)args)->idx]->ifindex) { - ksft_test_result_fail - ("ERROR: [%s] interface \"%s\" does not exist\n", - __func__, ifdict[((struct targs *)args)->idx]->ifname); - ((struct targs *)args)->retptr = false; + struct targs *targs = args; + + targs->retptr = false; + + if (switch_namespace(targs->idx)) { + ifdict[targs->idx]->ifindex = if_nametoindex(ifdict[targs->idx]->ifname); + if (!ifdict[targs->idx]->ifindex) { + ksft_test_result_fail("ERROR: [%s] interface \"%s\" does not exist\n", + __func__, ifdict[targs->idx]->ifname); } else { - ksft_print_msg("Interface found: %s\n", - ifdict[((struct targs *)args)->idx]->ifname); - ((struct targs *)args)->retptr = true; + ksft_print_msg("Interface found: %s\n", ifdict[targs->idx]->ifname); + targs->retptr = true; } - } else { - ((struct targs *)args)->retptr = false; } pthread_exit(NULL); } @@ -413,12 +411,12 @@ static int validate_interfaces(void) if (strcmp(ifdict[i]->nsname, "")) { struct targs *targs; - targs = (struct targs *)malloc(sizeof(struct targs)); + targs = malloc(sizeof(*targs)); if (!targs) exit_with_error(errno); targs->idx = i; - if (pthread_create(&ns_thread, NULL, nsswitchthread, (void *)targs)) + if (pthread_create(&ns_thread, NULL, nsswitchthread, targs)) exit_with_error(errno); pthread_join(ns_thread, NULL); @@ -569,16 +567,18 @@ static void rx_pkt(struct xsk_socket_info *xsk, struct pollfd *fds) } for (i = 0; i < rcvd; i++) { - u64 addr = xsk_ring_cons__rx_desc(&xsk->rx, idx_rx)->addr; - (void)xsk_ring_cons__rx_desc(&xsk->rx, idx_rx++)->len; - u64 orig = xsk_umem__extract_addr(addr); + u64 addr, orig; + + addr = xsk_ring_cons__rx_desc(&xsk->rx, idx_rx)->addr; + xsk_ring_cons__rx_desc(&xsk->rx, idx_rx++); + orig = xsk_umem__extract_addr(addr); addr = xsk_umem__add_offset_to_addr(addr); pkt_node_rx = malloc(sizeof(struct pkt) + PKT_SIZE); if (!pkt_node_rx) exit_with_error(errno); - pkt_node_rx->pkt_frame = (char *)malloc(PKT_SIZE); + pkt_node_rx->pkt_frame = malloc(PKT_SIZE); if (!pkt_node_rx->pkt_frame) exit_with_error(errno); @@ -628,28 +628,27 @@ static inline int get_batch_size(int pkt_cnt) return opt_pkt_count - pkt_cnt; } -static void complete_tx_only_all(void *arg) +static void complete_tx_only_all(struct ifobject *ifobject) { bool pending; do { pending = false; - if (((struct ifobject *)arg)->xsk->outstanding_tx) { - complete_tx_only(((struct ifobject *) - arg)->xsk, BATCH_SIZE); - pending = !!((struct ifobject *)arg)->xsk->outstanding_tx; + if (ifobject->xsk->outstanding_tx) { + complete_tx_only(ifobject->xsk, BATCH_SIZE); + pending = !!ifobject->xsk->outstanding_tx; } } while (pending); } -static void tx_only_all(void *arg) +static void tx_only_all(struct ifobject *ifobject) { struct pollfd fds[MAX_SOCKS] = { }; u32 frame_nb = 0; int pkt_cnt = 0; int ret; - fds[0].fd = xsk_socket__fd(((struct ifobject *)arg)->xsk->xsk); + fds[0].fd = xsk_socket__fd(ifobject->xsk->xsk); fds[0].events = POLLOUT; while ((opt_pkt_count && pkt_cnt < opt_pkt_count) || !opt_pkt_count) { @@ -664,12 +663,12 @@ static void tx_only_all(void *arg) continue; } - tx_only(((struct ifobject *)arg)->xsk, &frame_nb, batch_size); + tx_only(ifobject->xsk, &frame_nb, batch_size); pkt_cnt += batch_size; } if (opt_pkt_count) - complete_tx_only_all(arg); + complete_tx_only_all(ifobject); } static void worker_pkt_dump(void) @@ -727,21 +726,21 @@ static void worker_pkt_dump(void) static void worker_pkt_validate(void) { u32 payloadseqnum = -2; + struct iphdr *iphdr; while (1) { - pkt_node_rx_q = malloc(sizeof(struct pkt)); pkt_node_rx_q = TAILQ_LAST(&head, head_s); if (!pkt_node_rx_q) break; + + iphdr = (struct iphdr *)(pkt_node_rx_q->pkt_frame + sizeof(struct ethhdr)); + /*do not increment pktcounter if !(tos=0x9 and ipv4) */ - if ((((struct iphdr *)(pkt_node_rx_q->pkt_frame + - sizeof(struct ethhdr)))->version == IP_PKT_VER) - && (((struct iphdr *)(pkt_node_rx_q->pkt_frame + sizeof(struct ethhdr)))->tos == - IP_PKT_TOS)) { - payloadseqnum = *((uint32_t *) (pkt_node_rx_q->pkt_frame + PKT_HDR_SIZE)); + if (iphdr->version == IP_PKT_VER && iphdr->tos == IP_PKT_TOS) { + payloadseqnum = *((uint32_t *)(pkt_node_rx_q->pkt_frame + PKT_HDR_SIZE)); if (debug_pkt_dump && payloadseqnum != EOT) { - pkt_obj = (struct pkt_frame *)malloc(sizeof(struct pkt_frame)); - pkt_obj->payload = (char *)malloc(PKT_SIZE); + pkt_obj = malloc(sizeof(*pkt_obj)); + pkt_obj->payload = malloc(PKT_SIZE); memcpy(pkt_obj->payload, pkt_node_rx_q->pkt_frame, PKT_SIZE); pkt_buf[payloadseqnum] = pkt_obj; } @@ -759,35 +758,29 @@ static void worker_pkt_validate(void) ksft_exit_xfail(); } - TAILQ_REMOVE(&head, pkt_node_rx_q, pkt_nodes); - free(pkt_node_rx_q->pkt_frame); - free(pkt_node_rx_q); - pkt_node_rx_q = NULL; prev_pkt = payloadseqnum; pkt_counter++; } else { ksft_print_msg("Invalid frame received: "); - ksft_print_msg("[IP_PKT_VER: %02X], [IP_PKT_TOS: %02X]\n", - ((struct iphdr *)(pkt_node_rx_q->pkt_frame + - sizeof(struct ethhdr)))->version, - ((struct iphdr *)(pkt_node_rx_q->pkt_frame + - sizeof(struct ethhdr)))->tos); - TAILQ_REMOVE(&head, pkt_node_rx_q, pkt_nodes); - free(pkt_node_rx_q->pkt_frame); - free(pkt_node_rx_q); - pkt_node_rx_q = NULL; + ksft_print_msg("[IP_PKT_VER: %02X], [IP_PKT_TOS: %02X]\n", iphdr->version, + iphdr->tos); } + + TAILQ_REMOVE(&head, pkt_node_rx_q, pkt_nodes); + free(pkt_node_rx_q->pkt_frame); + free(pkt_node_rx_q); + pkt_node_rx_q = NULL; } } -static void thread_common_ops(void *arg, void *bufs, pthread_mutex_t *mutexptr, +static void thread_common_ops(struct ifobject *ifobject, void *bufs, pthread_mutex_t *mutexptr, atomic_int *spinningptr) { int ctr = 0; int ret; - xsk_configure_umem((struct ifobject *)arg, bufs, num_frames * XSK_UMEM__DEFAULT_FRAME_SIZE); - ret = xsk_configure_socket((struct ifobject *)arg); + xsk_configure_umem(ifobject, bufs, num_frames * XSK_UMEM__DEFAULT_FRAME_SIZE); + ret = xsk_configure_socket(ifobject); /* Retry Create Socket if it fails as xsk_socket__create() * is asynchronous @@ -798,9 +791,8 @@ static void thread_common_ops(void *arg, void *bufs, pthread_mutex_t *mutexptr, pthread_mutex_lock(mutexptr); while (ret && ctr < SOCK_RECONF_CTR) { atomic_store(spinningptr, 1); - xsk_configure_umem((struct ifobject *)arg, - bufs, num_frames * XSK_UMEM__DEFAULT_FRAME_SIZE); - ret = xsk_configure_socket((struct ifobject *)arg); + xsk_configure_umem(ifobject, bufs, num_frames * XSK_UMEM__DEFAULT_FRAME_SIZE); + ret = xsk_configure_socket(ifobject); usleep(USLEEP_MAX); ctr++; } @@ -815,9 +807,10 @@ static void *worker_testapp_validate(void *arg) { struct udphdr *udp_hdr = (struct udphdr *)(pkt_data + sizeof(struct ethhdr) + sizeof(struct iphdr)); - struct generic_data *data = (struct generic_data *)malloc(sizeof(struct generic_data)); struct iphdr *ip_hdr = (struct iphdr *)(pkt_data + sizeof(struct ethhdr)); struct ethhdr *eth_hdr = (struct ethhdr *)pkt_data; + struct ifobject *ifobject = (struct ifobject *)arg; + struct generic_data data; void *bufs = NULL; pthread_attr_setstacksize(&attr, THREAD_STACK); @@ -828,58 +821,56 @@ static void *worker_testapp_validate(void *arg) if (bufs == MAP_FAILED) exit_with_error(errno); - if (strcmp(((struct ifobject *)arg)->nsname, "")) - switch_namespace(((struct ifobject *)arg)->ifdict_index); + if (strcmp(ifobject->nsname, "")) + switch_namespace(ifobject->ifdict_index); } - if (((struct ifobject *)arg)->fv.vector == tx) { + if (ifobject->fv.vector == tx) { int spinningrxctr = 0; if (!bidi_pass) - thread_common_ops(arg, bufs, &sync_mutex_tx, &spinning_tx); + thread_common_ops(ifobject, bufs, &sync_mutex_tx, &spinning_tx); while (atomic_load(&spinning_rx) && spinningrxctr < SOCK_RECONF_CTR) { spinningrxctr++; usleep(USLEEP_MAX); } - ksft_print_msg("Interface [%s] vector [Tx]\n", ((struct ifobject *)arg)->ifname); + ksft_print_msg("Interface [%s] vector [Tx]\n", ifobject->ifname); for (int i = 0; i < num_frames; i++) { /*send EOT frame */ if (i == (num_frames - 1)) - data->seqnum = -1; + data.seqnum = -1; else - data->seqnum = i; - gen_udp_hdr((void *)data, (void *)arg, udp_hdr); - gen_ip_hdr((void *)arg, ip_hdr); + data.seqnum = i; + gen_udp_hdr(&data, ifobject, udp_hdr); + gen_ip_hdr(ifobject, ip_hdr); gen_udp_csum(udp_hdr, ip_hdr); - gen_eth_hdr((void *)arg, eth_hdr); - gen_eth_frame(((struct ifobject *)arg)->umem, - i * XSK_UMEM__DEFAULT_FRAME_SIZE); + gen_eth_hdr(ifobject, eth_hdr); + gen_eth_frame(ifobject->umem, i * XSK_UMEM__DEFAULT_FRAME_SIZE); } - free(data); ksft_print_msg("Sending %d packets on interface %s\n", - (opt_pkt_count - 1), ((struct ifobject *)arg)->ifname); - tx_only_all(arg); - } else if (((struct ifobject *)arg)->fv.vector == rx) { + (opt_pkt_count - 1), ifobject->ifname); + tx_only_all(ifobject); + } else if (ifobject->fv.vector == rx) { struct pollfd fds[MAX_SOCKS] = { }; int ret; if (!bidi_pass) - thread_common_ops(arg, bufs, &sync_mutex_tx, &spinning_rx); + thread_common_ops(ifobject, bufs, &sync_mutex_tx, &spinning_rx); - ksft_print_msg("Interface [%s] vector [Rx]\n", ((struct ifobject *)arg)->ifname); - xsk_populate_fill_ring(((struct ifobject *)arg)->umem); + ksft_print_msg("Interface [%s] vector [Rx]\n", ifobject->ifname); + xsk_populate_fill_ring(ifobject->umem); TAILQ_INIT(&head); if (debug_pkt_dump) { - pkt_buf = malloc(sizeof(struct pkt_frame **) * num_frames); + pkt_buf = calloc(num_frames, sizeof(*pkt_buf)); if (!pkt_buf) exit_with_error(errno); } - fds[0].fd = xsk_socket__fd(((struct ifobject *)arg)->xsk->xsk); + fds[0].fd = xsk_socket__fd(ifobject->xsk->xsk); fds[0].events = POLLIN; pthread_mutex_lock(&sync_mutex); @@ -892,7 +883,7 @@ static void *worker_testapp_validate(void *arg) if (ret <= 0) continue; } - rx_pkt(((struct ifobject *)arg)->xsk, fds); + rx_pkt(ifobject->xsk, fds); worker_pkt_validate(); if (sigvar) @@ -900,21 +891,23 @@ static void *worker_testapp_validate(void *arg) } ksft_print_msg("Received %d packets on interface %s\n", - pkt_counter, ((struct ifobject *)arg)->ifname); + pkt_counter, ifobject->ifname); if (opt_teardown) ksft_print_msg("Destroying socket\n"); } - if (!opt_bidi || (opt_bidi && bidi_pass)) { - xsk_socket__delete(((struct ifobject *)arg)->xsk->xsk); - (void)xsk_umem__delete(((struct ifobject *)arg)->umem->umem); + if (!opt_bidi || bidi_pass) { + xsk_socket__delete(ifobject->xsk->xsk); + (void)xsk_umem__delete(ifobject->umem->umem); } pthread_exit(NULL); } static void testapp_validate(void) { + struct timespec max_wait = { 0, 0 }; + pthread_attr_init(&attr); pthread_attr_setstacksize(&attr, THREAD_STACK); @@ -929,18 +922,16 @@ static void testapp_validate(void) pthread_mutex_lock(&sync_mutex); /*Spawn RX thread */ - if (!opt_bidi || (opt_bidi && !bidi_pass)) { - if (pthread_create(&t0, &attr, worker_testapp_validate, (void *)ifdict[1])) + if (!opt_bidi || !bidi_pass) { + if (pthread_create(&t0, &attr, worker_testapp_validate, ifdict[1])) exit_with_error(errno); } else if (opt_bidi && bidi_pass) { /*switch Tx/Rx vectors */ ifdict[0]->fv.vector = rx; - if (pthread_create(&t0, &attr, worker_testapp_validate, (void *)ifdict[0])) + if (pthread_create(&t0, &attr, worker_testapp_validate, ifdict[0])) exit_with_error(errno); } - struct timespec max_wait = { 0, 0 }; - if (clock_gettime(CLOCK_REALTIME, &max_wait)) exit_with_error(errno); max_wait.tv_sec += TMOUT_SEC; @@ -951,13 +942,13 @@ static void testapp_validate(void) pthread_mutex_unlock(&sync_mutex); /*Spawn TX thread */ - if (!opt_bidi || (opt_bidi && !bidi_pass)) { - if (pthread_create(&t1, &attr, worker_testapp_validate, (void *)ifdict[0])) + if (!opt_bidi || !bidi_pass) { + if (pthread_create(&t1, &attr, worker_testapp_validate, ifdict[0])) exit_with_error(errno); } else if (opt_bidi && bidi_pass) { /*switch Tx/Rx vectors */ ifdict[1]->fv.vector = tx; - if (pthread_create(&t1, &attr, worker_testapp_validate, (void *)ifdict[1])) + if (pthread_create(&t1, &attr, worker_testapp_validate, ifdict[1])) exit_with_error(errno); } @@ -991,25 +982,25 @@ static void testapp_sockets(void) print_ksft_result(); } -static void init_iface_config(void *ifaceconfig) +static void init_iface_config(struct ifaceconfigobj *ifaceconfig) { /*Init interface0 */ ifdict[0]->fv.vector = tx; - memcpy(ifdict[0]->dst_mac, ((struct ifaceconfigobj *)ifaceconfig)->dst_mac, ETH_ALEN); - memcpy(ifdict[0]->src_mac, ((struct ifaceconfigobj *)ifaceconfig)->src_mac, ETH_ALEN); - ifdict[0]->dst_ip = ((struct ifaceconfigobj *)ifaceconfig)->dst_ip.s_addr; - ifdict[0]->src_ip = ((struct ifaceconfigobj *)ifaceconfig)->src_ip.s_addr; - ifdict[0]->dst_port = ((struct ifaceconfigobj *)ifaceconfig)->dst_port; - ifdict[0]->src_port = ((struct ifaceconfigobj *)ifaceconfig)->src_port; + memcpy(ifdict[0]->dst_mac, ifaceconfig->dst_mac, ETH_ALEN); + memcpy(ifdict[0]->src_mac, ifaceconfig->src_mac, ETH_ALEN); + ifdict[0]->dst_ip = ifaceconfig->dst_ip.s_addr; + ifdict[0]->src_ip = ifaceconfig->src_ip.s_addr; + ifdict[0]->dst_port = ifaceconfig->dst_port; + ifdict[0]->src_port = ifaceconfig->src_port; /*Init interface1 */ ifdict[1]->fv.vector = rx; - memcpy(ifdict[1]->dst_mac, ((struct ifaceconfigobj *)ifaceconfig)->src_mac, ETH_ALEN); - memcpy(ifdict[1]->src_mac, ((struct ifaceconfigobj *)ifaceconfig)->dst_mac, ETH_ALEN); - ifdict[1]->dst_ip = ((struct ifaceconfigobj *)ifaceconfig)->src_ip.s_addr; - ifdict[1]->src_ip = ((struct ifaceconfigobj *)ifaceconfig)->dst_ip.s_addr; - ifdict[1]->dst_port = ((struct ifaceconfigobj *)ifaceconfig)->src_port; - ifdict[1]->src_port = ((struct ifaceconfigobj *)ifaceconfig)->dst_port; + memcpy(ifdict[1]->dst_mac, ifaceconfig->src_mac, ETH_ALEN); + memcpy(ifdict[1]->src_mac, ifaceconfig->dst_mac, ETH_ALEN); + ifdict[1]->dst_ip = ifaceconfig->src_ip.s_addr; + ifdict[1]->src_ip = ifaceconfig->dst_ip.s_addr; + ifdict[1]->dst_port = ifaceconfig->src_port; + ifdict[1]->src_port = ifaceconfig->dst_port; } int main(int argc, char **argv) @@ -1026,7 +1017,7 @@ int main(int argc, char **argv) u16 UDP_DST_PORT = 2020; u16 UDP_SRC_PORT = 2121; - ifaceconfig = (struct ifaceconfigobj *)malloc(sizeof(struct ifaceconfigobj)); + ifaceconfig = malloc(sizeof(struct ifaceconfigobj)); memcpy(ifaceconfig->dst_mac, MAC1, ETH_ALEN); memcpy(ifaceconfig->src_mac, MAC2, ETH_ALEN); inet_aton(IP1, &ifaceconfig->dst_ip); @@ -1035,7 +1026,7 @@ int main(int argc, char **argv) ifaceconfig->src_port = UDP_SRC_PORT; for (int i = 0; i < MAX_INTERFACES; i++) { - ifdict[i] = (struct ifobject *)malloc(sizeof(struct ifobject)); + ifdict[i] = malloc(sizeof(struct ifobject)); if (!ifdict[i]) exit_with_error(errno); @@ -1048,7 +1039,7 @@ int main(int argc, char **argv) num_frames = ++opt_pkt_count; - init_iface_config((void *)ifaceconfig); + init_iface_config(ifaceconfig); pthread_init_mutex(); diff --git a/tools/testing/selftests/bpf/xdpxceiver.h b/tools/testing/selftests/bpf/xdpxceiver.h index 61f595b6f200..0e9f9b7e61c2 100644 --- a/tools/testing/selftests/bpf/xdpxceiver.h +++ b/tools/testing/selftests/bpf/xdpxceiver.h @@ -92,8 +92,6 @@ struct flow_vector { enum fvector { tx, rx, - bidi, - undef, } vector; }; diff --git a/tools/testing/selftests/tc-testing/Makefile b/tools/testing/selftests/tc-testing/Makefile index 91fee5c43274..4d639279f41e 100644 --- a/tools/testing/selftests/tc-testing/Makefile +++ b/tools/testing/selftests/tc-testing/Makefile @@ -1,4 +1,5 @@ # SPDX-License-Identifier: GPL-2.0 +include ../../../scripts/Makefile.include top_srcdir = $(abspath ../../../..) APIDIR := $(top_scrdir)/include/uapi @@ -7,8 +8,6 @@ TEST_GEN_FILES = action.o KSFT_KHDR_INSTALL := 1 include ../lib.mk -CLANG ?= clang -LLC ?= llc PROBE := $(shell $(LLC) -march=bpf -mcpu=probe -filetype=null /dev/null 2>&1) ifeq ($(PROBE),) |