summaryrefslogtreecommitdiff
path: root/kernel/bpf
AgeCommit message (Collapse)Author
2022-07-01bpf: Fix insufficient bounds propagation from adjust_scalar_min_max_valsDaniel Borkmann
Kuee reported a corner case where the tnum becomes constant after the call to __reg_bound_offset(), but the register's bounds are not, that is, its min bounds are still not equal to the register's max bounds. This in turn allows to leak pointers through turning a pointer register as is into an unknown scalar via adjust_ptr_min_max_vals(). Before: func#0 @0 0: R1=ctx(off=0,imm=0,umax=0,var_off=(0x0; 0x0)) R10=fp(off=0,imm=0,umax=0,var_off=(0x0; 0x0)) 0: (b7) r0 = 1 ; R0_w=scalar(imm=1,umin=1,umax=1,var_off=(0x1; 0x0)) 1: (b7) r3 = 0 ; R3_w=scalar(imm=0,umax=0,var_off=(0x0; 0x0)) 2: (87) r3 = -r3 ; R3_w=scalar() 3: (87) r3 = -r3 ; R3_w=scalar() 4: (47) r3 |= 32767 ; R3_w=scalar(smin=-9223372036854743041,umin=32767,var_off=(0x7fff; 0xffffffffffff8000),s32_min=-2147450881) 5: (75) if r3 s>= 0x0 goto pc+1 ; R3_w=scalar(umin=9223372036854808575,var_off=(0x8000000000007fff; 0x7fffffffffff8000),s32_min=-2147450881,u32_min=32767) 6: (95) exit from 5 to 7: R0=scalar(imm=1,umin=1,umax=1,var_off=(0x1; 0x0)) R1=ctx(off=0,imm=0,umax=0,var_off=(0x0; 0x0)) R3=scalar(umin=32767,umax=9223372036854775807,var_off=(0x7fff; 0x7fffffffffff8000),s32_min=-2147450881) R10=fp(off=0,imm=0,umax=0,var_off=(0x0; 0x0)) 7: (d5) if r3 s<= 0x8000 goto pc+1 ; R3=scalar(umin=32769,umax=9223372036854775807,var_off=(0x7fff; 0x7fffffffffff8000),s32_min=-2147450881,u32_min=32767) 8: (95) exit from 7 to 9: R0=scalar(imm=1,umin=1,umax=1,var_off=(0x1; 0x0)) R1=ctx(off=0,imm=0,umax=0,var_off=(0x0; 0x0)) R3=scalar(umin=32767,umax=32768,var_off=(0x7fff; 0x8000)) R10=fp(off=0,imm=0,umax=0,var_off=(0x0; 0x0)) 9: (07) r3 += -32767 ; R3_w=scalar(imm=0,umax=1,var_off=(0x0; 0x0)) <--- [*] 10: (95) exit What can be seen here is that R3=scalar(umin=32767,umax=32768,var_off=(0x7fff; 0x8000)) after the operation R3 += -32767 results in a 'malformed' constant, that is, R3_w=scalar(imm=0,umax=1,var_off=(0x0; 0x0)). Intersecting with var_off has not been done at that point via __update_reg_bounds(), which would have improved the umax to be equal to umin. Refactor the tnum <> min/max bounds information flow into a reg_bounds_sync() helper and use it consistently everywhere. After the fix, bounds have been corrected to R3_w=scalar(imm=0,umax=0,var_off=(0x0; 0x0)) and thus the register is regarded as a 'proper' constant scalar of 0. After: func#0 @0 0: R1=ctx(off=0,imm=0,umax=0,var_off=(0x0; 0x0)) R10=fp(off=0,imm=0,umax=0,var_off=(0x0; 0x0)) 0: (b7) r0 = 1 ; R0_w=scalar(imm=1,umin=1,umax=1,var_off=(0x1; 0x0)) 1: (b7) r3 = 0 ; R3_w=scalar(imm=0,umax=0,var_off=(0x0; 0x0)) 2: (87) r3 = -r3 ; R3_w=scalar() 3: (87) r3 = -r3 ; R3_w=scalar() 4: (47) r3 |= 32767 ; R3_w=scalar(smin=-9223372036854743041,umin=32767,var_off=(0x7fff; 0xffffffffffff8000),s32_min=-2147450881) 5: (75) if r3 s>= 0x0 goto pc+1 ; R3_w=scalar(umin=9223372036854808575,var_off=(0x8000000000007fff; 0x7fffffffffff8000),s32_min=-2147450881,u32_min=32767) 6: (95) exit from 5 to 7: R0=scalar(imm=1,umin=1,umax=1,var_off=(0x1; 0x0)) R1=ctx(off=0,imm=0,umax=0,var_off=(0x0; 0x0)) R3=scalar(umin=32767,umax=9223372036854775807,var_off=(0x7fff; 0x7fffffffffff8000),s32_min=-2147450881) R10=fp(off=0,imm=0,umax=0,var_off=(0x0; 0x0)) 7: (d5) if r3 s<= 0x8000 goto pc+1 ; R3=scalar(umin=32769,umax=9223372036854775807,var_off=(0x7fff; 0x7fffffffffff8000),s32_min=-2147450881,u32_min=32767) 8: (95) exit from 7 to 9: R0=scalar(imm=1,umin=1,umax=1,var_off=(0x1; 0x0)) R1=ctx(off=0,imm=0,umax=0,var_off=(0x0; 0x0)) R3=scalar(umin=32767,umax=32768,var_off=(0x7fff; 0x8000)) R10=fp(off=0,imm=0,umax=0,var_off=(0x0; 0x0)) 9: (07) r3 += -32767 ; R3_w=scalar(imm=0,umax=0,var_off=(0x0; 0x0)) <--- [*] 10: (95) exit Fixes: b03c9f9fdc37 ("bpf/verifier: track signed and unsigned min/max values") Reported-by: Kuee K1r0a <liulin063@gmail.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Acked-by: John Fastabend <john.fastabend@gmail.com> Link: https://lore.kernel.org/bpf/20220701124727.11153-2-daniel@iogearbox.net
2022-07-01bpf: Fix incorrect verifier simulation around jmp32's jeq/jneDaniel Borkmann
Kuee reported a quirk in the jmp32's jeq/jne simulation, namely that the register value does not match expectations for the fall-through path. For example: Before fix: 0: R1=ctx(off=0,imm=0) R10=fp0 0: (b7) r2 = 0 ; R2_w=P0 1: (b7) r6 = 563 ; R6_w=P563 2: (87) r2 = -r2 ; R2_w=Pscalar() 3: (87) r2 = -r2 ; R2_w=Pscalar() 4: (4c) w2 |= w6 ; R2_w=Pscalar(umin=563,umax=4294967295,var_off=(0x233; 0xfffffdcc),s32_min=-2147483085) R6_w=P563 5: (56) if w2 != 0x8 goto pc+1 ; R2_w=P571 <--- [*] 6: (95) exit R0 !read_ok After fix: 0: R1=ctx(off=0,imm=0) R10=fp0 0: (b7) r2 = 0 ; R2_w=P0 1: (b7) r6 = 563 ; R6_w=P563 2: (87) r2 = -r2 ; R2_w=Pscalar() 3: (87) r2 = -r2 ; R2_w=Pscalar() 4: (4c) w2 |= w6 ; R2_w=Pscalar(umin=563,umax=4294967295,var_off=(0x233; 0xfffffdcc),s32_min=-2147483085) R6_w=P563 5: (56) if w2 != 0x8 goto pc+1 ; R2_w=P8 <--- [*] 6: (95) exit R0 !read_ok As can be seen on line 5 for the branch fall-through path in R2 [*] is that given condition w2 != 0x8 is false, verifier should conclude that r2 = 8 as upper 32 bit are known to be zero. However, verifier incorrectly concludes that r2 = 571 which is far off. The problem is it only marks false{true}_reg as known in the switch for JE/NE case, but at the end of the function, it uses {false,true}_{64,32}off to update {false,true}_reg->var_off and they still hold the prior value of {false,true}_reg->var_off before it got marked as known. The subsequent __reg_combine_32_into_64() then propagates this old var_off and derives new bounds. The information between min/max bounds on {false,true}_reg from setting the register to known const combined with the {false,true}_reg->var_off based on the old information then derives wrong register data. Fix it by detangling the BPF_JEQ/BPF_JNE cases and updating relevant {false,true}_{64,32}off tnums along with the register marking to known constant. Fixes: 3f50f132d840 ("bpf: Verifier, do explicit ALU32 bounds tracking") Reported-by: Kuee K1r0a <liulin063@gmail.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Acked-by: John Fastabend <john.fastabend@gmail.com> Link: https://lore.kernel.org/bpf/20220701124727.11153-1-daniel@iogearbox.net
2022-06-15bpf: Limit maximum modifier chain length in btf_check_type_tagsKumar Kartikeya Dwivedi
On processing a module BTF of module built for an older kernel, we might sometimes find that some type points to itself forming a loop. If such a type is a modifier, btf_check_type_tags's while loop following modifier chain will be caught in an infinite loop. Fix this by defining a maximum chain length and bailing out if we spin any longer than that. Fixes: eb596b090558 ("bpf: Ensure type tags precede modifiers in BTF") Reported-by: Daniel Borkmann <daniel@iogearbox.net> Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Yonghong Song <yhs@fb.com> Link: https://lore.kernel.org/bpf/20220615042151.2266537-1-memxor@gmail.com
2022-06-07bpf: Fix calling global functions from BPF_PROG_TYPE_EXT programsToke Høiland-Jørgensen
The verifier allows programs to call global functions as long as their argument types match, using BTF to check the function arguments. One of the allowed argument types to such global functions is PTR_TO_CTX; however the check for this fails on BPF_PROG_TYPE_EXT functions because the verifier uses the wrong type to fetch the vmlinux BTF ID for the program context type. This failure is seen when an XDP program is loaded using libxdp (which loads it as BPF_PROG_TYPE_EXT and attaches it to a global XDP type program). Fix the issue by passing in the target program type instead of the BPF_PROG_TYPE_EXT type to bpf_prog_get_ctx() when checking function argument compatibility. The first Fixes tag refers to the latest commit that touched the code in question, while the second one points to the code that first introduced the global function call verification. v2: - Use resolve_prog_type() Fixes: 3363bd0cfbb8 ("bpf: Extend kfunc with PTR_TO_CTX, PTR_TO_MEM argument support") Fixes: 51c39bb1d5d1 ("bpf: Introduce function-by-function verification") Reported-by: Simon Sundberg <simon.sundberg@kau.se> Signed-off-by: Toke Høiland-Jørgensen <toke@redhat.com> Link: https://lore.kernel.org/r/20220606075253.28422-1-toke@redhat.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2022-06-02Merge tag 'net-5.19-rc1' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net Pull networking fixes from Jakub Kicinski: "Including fixes from bpf and netfilter. Current release - new code bugs: - af_packet: make sure to pull the MAC header, avoid skb panic in GSO - ptp_clockmatrix: fix inverted logic in is_single_shot() - netfilter: flowtable: fix missing FLOWI_FLAG_ANYSRC flag - dt-bindings: net: adin: fix adi,phy-output-clock description syntax - wifi: iwlwifi: pcie: rename CAUSE macro, avoid MIPS build warning Previous releases - regressions: - Revert "net: af_key: add check for pfkey_broadcast in function pfkey_process" - tcp: fix tcp_mtup_probe_success vs wrong snd_cwnd - nf_tables: disallow non-stateful expression in sets earlier - nft_limit: clone packet limits' cost value - nf_tables: double hook unregistration in netns path - ping6: fix ping -6 with interface name Previous releases - always broken: - sched: fix memory barriers to prevent skbs from getting stuck in lockless qdiscs - neigh: set lower cap for neigh_managed_work rearming, avoid constantly scheduling the probe work - bpf: fix probe read error on big endian in ___bpf_prog_run() - amt: memory leak and error handling fixes Misc: - ipv6: expand & rename accept_unsolicited_na to accept_untracked_na" * tag 'net-5.19-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net: (80 commits) net/af_packet: make sure to pull mac header net: add debug info to __skb_pull() net: CONFIG_DEBUG_NET depends on CONFIG_NET stmmac: intel: Add RPL-P PCI ID net: stmmac: use dev_err_probe() for reporting mdio bus registration failure tipc: check attribute length for bearer name ice: fix access-beyond-end in the switch code nfp: remove padding in nfp_nfdk_tx_desc ax25: Fix ax25 session cleanup problems net: usb: qmi_wwan: Add support for Cinterion MV31 with new baseline sfc/siena: fix wrong tx channel offset with efx_separate_tx_channels sfc/siena: fix considering that all channels have TX queues socket: Don't use u8 type in uapi socket.h net/sched: act_api: fix error code in tcf_ct_flow_table_fill_tuple_ipv6() net: ping6: Fix ping -6 with interface name macsec: fix UAF bug for real_dev octeontx2-af: fix error code in is_valid_offset() wifi: mac80211: fix use-after-free in chanctx code bonding: guard ns_targets by CONFIG_IPV6 tcp: tcp_rtx_synack() can be called from process context ...
2022-05-28bpf: Fix probe read error in ___bpf_prog_run()Menglong Dong
I think there is something wrong with BPF_PROBE_MEM in ___bpf_prog_run() in big-endian machine. Let's make a test and see what will happen if we want to load a 'u16' with BPF_PROBE_MEM. Let's make the src value '0x0001', the value of dest register will become 0x0001000000000000, as the value will be loaded to the first 2 byte of DST with following code: bpf_probe_read_kernel(&DST, SIZE, (const void *)(long) (SRC + insn->off)); Obviously, the value in DST is not correct. In fact, we can compare BPF_PROBE_MEM with LDX_MEM_H: DST = *(SIZE *)(unsigned long) (SRC + insn->off); If the memory load is done by LDX_MEM_H, the value in DST will be 0x1 now. And I think this error results in the test case 'test_bpf_sk_storage_map' failing: test_bpf_sk_storage_map:PASS:bpf_iter_bpf_sk_storage_map__open_and_load 0 nsec test_bpf_sk_storage_map:PASS:socket 0 nsec test_bpf_sk_storage_map:PASS:map_update 0 nsec test_bpf_sk_storage_map:PASS:socket 0 nsec test_bpf_sk_storage_map:PASS:map_update 0 nsec test_bpf_sk_storage_map:PASS:socket 0 nsec test_bpf_sk_storage_map:PASS:map_update 0 nsec test_bpf_sk_storage_map:PASS:attach_iter 0 nsec test_bpf_sk_storage_map:PASS:create_iter 0 nsec test_bpf_sk_storage_map:PASS:read 0 nsec test_bpf_sk_storage_map:FAIL:ipv6_sk_count got 0 expected 3 $10/26 bpf_iter/bpf_sk_storage_map:FAIL The code of the test case is simply, it will load sk->sk_family to the register with BPF_PROBE_MEM and check if it is AF_INET6. With this patch, now the test case 'bpf_iter' can pass: $10 bpf_iter:OK Fixes: 2a02759ef5f8 ("bpf: Add support for BTF pointers to interpreter") Signed-off-by: Menglong Dong <imagedong@tencent.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Reviewed-by: Jiang Biao <benbjiang@tencent.com> Reviewed-by: Hao Peng <flyingpeng@tencent.com> Cc: Ilya Leoshkevich <iii@linux.ibm.com> Link: https://lore.kernel.org/bpf/20220524021228.533216-1-imagedong@tencent.com
2022-05-26Merge tag 'mm-stable-2022-05-25' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm Pull MM updates from Andrew Morton: "Almost all of MM here. A few things are still getting finished off, reviewed, etc. - Yang Shi has improved the behaviour of khugepaged collapsing of readonly file-backed transparent hugepages. - Johannes Weiner has arranged for zswap memory use to be tracked and managed on a per-cgroup basis. - Munchun Song adds a /proc knob ("hugetlb_optimize_vmemmap") for runtime enablement of the recent huge page vmemmap optimization feature. - Baolin Wang contributes a series to fix some issues around hugetlb pagetable invalidation. - Zhenwei Pi has fixed some interactions between hwpoisoned pages and virtualization. - Tong Tiangen has enabled the use of the presently x86-only page_table_check debugging feature on arm64 and riscv. - David Vernet has done some fixup work on the memcg selftests. - Peter Xu has taught userfaultfd to handle write protection faults against shmem- and hugetlbfs-backed files. - More DAMON development from SeongJae Park - adding online tuning of the feature and support for monitoring of fixed virtual address ranges. Also easier discovery of which monitoring operations are available. - Nadav Amit has done some optimization of TLB flushing during mprotect(). - Neil Brown continues to labor away at improving our swap-over-NFS support. - David Hildenbrand has some fixes to anon page COWing versus get_user_pages(). - Peng Liu fixed some errors in the core hugetlb code. - Joao Martins has reduced the amount of memory consumed by device-dax's compound devmaps. - Some cleanups of the arch-specific pagemap code from Anshuman Khandual. - Muchun Song has found and fixed some errors in the TLB flushing of transparent hugepages. - Roman Gushchin has done more work on the memcg selftests. ... and, of course, many smaller fixes and cleanups. Notably, the customary million cleanup serieses from Miaohe Lin" * tag 'mm-stable-2022-05-25' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: (381 commits) mm: kfence: use PAGE_ALIGNED helper selftests: vm: add the "settings" file with timeout variable selftests: vm: add "test_hmm.sh" to TEST_FILES selftests: vm: check numa_available() before operating "merge_across_nodes" in ksm_tests selftests: vm: add migration to the .gitignore selftests/vm/pkeys: fix typo in comment ksm: fix typo in comment selftests: vm: add process_mrelease tests Revert "mm/vmscan: never demote for memcg reclaim" mm/kfence: print disabling or re-enabling message include/trace/events/percpu.h: cleanup for "percpu: improve percpu_alloc_percpu event trace" include/trace/events/mmflags.h: cleanup for "tracing: incorrect gfp_t conversion" mm: fix a potential infinite loop in start_isolate_page_range() MAINTAINERS: add Muchun as co-maintainer for HugeTLB zram: fix Kconfig dependency warning mm/shmem: fix shmem folio swapoff hang cgroup: fix an error handling path in alloc_pagecache_max_30M() mm: damon: use HPAGE_PMD_SIZE tracing: incorrect isolate_mote_t cast in mm_vmscan_lru_isolate nodemask.h: fix compilation error with GCC12 ...
2022-05-25Merge tag 'net-next-5.19' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-next Pull networking updates from Jakub Kicinski: "Core ---- - Support TCPv6 segmentation offload with super-segments larger than 64k bytes using the IPv6 Jumbogram extension header (AKA BIG TCP). - Generalize skb freeing deferral to per-cpu lists, instead of per-socket lists. - Add a netdev statistic for packets dropped due to L2 address mismatch (rx_otherhost_dropped). - Continue work annotating skb drop reasons. - Accept alternative netdev names (ALT_IFNAME) in more netlink requests. - Add VLAN support for AF_PACKET SOCK_RAW GSO. - Allow receiving skb mark from the socket as a cmsg. - Enable memcg accounting for veth queues, sysctl tables and IPv6. BPF --- - Add libbpf support for User Statically-Defined Tracing (USDTs). - Speed up symbol resolution for kprobes multi-link attachments. - Support storing typed pointers to referenced and unreferenced objects in BPF maps. - Add support for BPF link iterator. - Introduce access to remote CPU map elements in BPF per-cpu map. - Allow middle-of-the-road settings for the kernel.unprivileged_bpf_disabled sysctl. - Implement basic types of dynamic pointers e.g. to allow for dynamically sized ringbuf reservations without extra memory copies. Protocols --------- - Retire port only listening_hash table, add a second bind table hashed by port and address. Avoid linear list walk when binding to very popular ports (e.g. 443). - Add bridge FDB bulk flush filtering support allowing user space to remove all FDB entries matching a condition. - Introduce accept_unsolicited_na sysctl for IPv6 to implement router-side changes for RFC9131. - Support for MPTCP path manager in user space. - Add MPTCP support for fallback to regular TCP for connections that have never connected additional subflows or transmitted out-of-sequence data (partial support for RFC8684 fallback). - Avoid races in MPTCP-level window tracking, stabilize and improve throughput. - Support lockless operation of GRE tunnels with seq numbers enabled. - WiFi support for host based BSS color collision detection. - Add support for SO_TXTIME/SCM_TXTIME on CAN sockets. - Support transmission w/o flow control in CAN ISOTP (ISO 15765-2). - Support zero-copy Tx with TLS 1.2 crypto offload (sendfile). - Allow matching on the number of VLAN tags via tc-flower. - Add tracepoint for tcp_set_ca_state(). Driver API ---------- - Improve error reporting from classifier and action offload. - Add support for listing line cards in switches (devlink). - Add helpers for reporting page pool statistics with ethtool -S. - Add support for reading clock cycles when using PTP virtual clocks, instead of having the driver convert to time before reporting. This makes it possible to report time from different vclocks. - Support configuring low-latency Tx descriptor push via ethtool. - Separate Clause 22 and Clause 45 MDIO accesses more explicitly. New hardware / drivers ---------------------- - Ethernet: - Marvell's Octeon NIC PCI Endpoint support (octeon_ep) - Sunplus SP7021 SoC (sp7021_emac) - Add support for Renesas RZ/V2M (in ravb) - Add support for MediaTek mt7986 switches (in mtk_eth_soc) - Ethernet PHYs: - ADIN1100 industrial PHYs (w/ 10BASE-T1L and SQI reporting) - TI DP83TD510 PHY - Microchip LAN8742/LAN88xx PHYs - WiFi: - Driver for pureLiFi X, XL, XC devices (plfxlc) - Driver for Silicon Labs devices (wfx) - Support for WCN6750 (in ath11k) - Support Realtek 8852ce devices (in rtw89) - Mobile: - MediaTek T700 modems (Intel 5G 5000 M.2 cards) - CAN: - ctucanfd: add support for CTU CAN FD open-source IP core from Czech Technical University in Prague Drivers ------- - Delete a number of old drivers still using virt_to_bus(). - Ethernet NICs: - intel: support TSO on tunnels MPLS - broadcom: support multi-buffer XDP - nfp: support VF rate limiting - sfc: use hardware tx timestamps for more than PTP - mlx5: multi-port eswitch support - hyper-v: add support for XDP_REDIRECT - atlantic: XDP support (including multi-buffer) - macb: improve real-time perf by deferring Tx processing to NAPI - High-speed Ethernet switches: - mlxsw: implement basic line card information querying - prestera: add support for traffic policing on ingress and egress - Embedded Ethernet switches: - lan966x: add support for packet DMA (FDMA) - lan966x: add support for PTP programmable pins - ti: cpsw_new: enable bc/mc storm prevention - Qualcomm 802.11ax WiFi (ath11k): - Wake-on-WLAN support for QCA6390 and WCN6855 - device recovery (firmware restart) support - support setting Specific Absorption Rate (SAR) for WCN6855 - read country code from SMBIOS for WCN6855/QCA6390 - enable keep-alive during WoWLAN suspend - implement remain-on-channel support - MediaTek WiFi (mt76): - support Wireless Ethernet Dispatch offloading packet movement between the Ethernet switch and WiFi interfaces - non-standard VHT MCS10-11 support - mt7921 AP mode support - mt7921 IPv6 NS offload support - Ethernet PHYs: - micrel: ksz9031/ksz9131: cabletest support - lan87xx: SQI support for T1 PHYs - lan937x: add interrupt support for link detection" * tag 'net-next-5.19' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-next: (1809 commits) ptp: ocp: Add firmware header checks ptp: ocp: fix PPS source selector debugfs reporting ptp: ocp: add .init function for sma_op vector ptp: ocp: vectorize the sma accessor functions ptp: ocp: constify selectors ptp: ocp: parameterize input/output sma selectors ptp: ocp: revise firmware display ptp: ocp: add Celestica timecard PCI ids ptp: ocp: Remove #ifdefs around PCI IDs ptp: ocp: 32-bit fixups for pci start address Revert "net/smc: fix listen processing for SMC-Rv2" ath6kl: Use cc-disable-warning to disable -Wdangling-pointer selftests/bpf: Dynptr tests bpf: Add dynptr data slices bpf: Add bpf_dynptr_read and bpf_dynptr_write bpf: Dynptr support for ring buffers bpf: Add bpf_dynptr_from_mem for local dynptrs bpf: Add verifier support for dynptrs bpf: Suppress 'passing zero to PTR_ERR' warning bpf: Introduce bpf_arch_text_invalidate for bpf_prog_pack ...
2022-05-23bpf: Add dynptr data slicesJoanne Koong
This patch adds a new helper function void *bpf_dynptr_data(struct bpf_dynptr *ptr, u32 offset, u32 len); which returns a pointer to the underlying data of a dynptr. *len* must be a statically known value. The bpf program may access the returned data slice as a normal buffer (eg can do direct reads and writes), since the verifier associates the length with the returned pointer, and enforces that no out of bounds accesses occur. Signed-off-by: Joanne Koong <joannelkoong@gmail.com> Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Acked-by: Yonghong Song <yhs@fb.com> Link: https://lore.kernel.org/bpf/20220523210712.3641569-6-joannelkoong@gmail.com
2022-05-23bpf: Add bpf_dynptr_read and bpf_dynptr_writeJoanne Koong
This patch adds two helper functions, bpf_dynptr_read and bpf_dynptr_write: long bpf_dynptr_read(void *dst, u32 len, struct bpf_dynptr *src, u32 offset); long bpf_dynptr_write(struct bpf_dynptr *dst, u32 offset, void *src, u32 len); The dynptr passed into these functions must be valid dynptrs that have been initialized. Signed-off-by: Joanne Koong <joannelkoong@gmail.com> Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Acked-by: Andrii Nakryiko <andrii@kernel.org> Link: https://lore.kernel.org/bpf/20220523210712.3641569-5-joannelkoong@gmail.com
2022-05-23bpf: Dynptr support for ring buffersJoanne Koong
Currently, our only way of writing dynamically-sized data into a ring buffer is through bpf_ringbuf_output but this incurs an extra memcpy cost. bpf_ringbuf_reserve + bpf_ringbuf_commit avoids this extra memcpy, but it can only safely support reservation sizes that are statically known since the verifier cannot guarantee that the bpf program won’t access memory outside the reserved space. The bpf_dynptr abstraction allows for dynamically-sized ring buffer reservations without the extra memcpy. There are 3 new APIs: long bpf_ringbuf_reserve_dynptr(void *ringbuf, u32 size, u64 flags, struct bpf_dynptr *ptr); void bpf_ringbuf_submit_dynptr(struct bpf_dynptr *ptr, u64 flags); void bpf_ringbuf_discard_dynptr(struct bpf_dynptr *ptr, u64 flags); These closely follow the functionalities of the original ringbuf APIs. For example, all ringbuffer dynptrs that have been reserved must be either submitted or discarded before the program exits. Signed-off-by: Joanne Koong <joannelkoong@gmail.com> Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Acked-by: Andrii Nakryiko <andrii@kernel.org> Acked-by: David Vernet <void@manifault.com> Link: https://lore.kernel.org/bpf/20220523210712.3641569-4-joannelkoong@gmail.com
2022-05-23bpf: Add bpf_dynptr_from_mem for local dynptrsJoanne Koong
This patch adds a new api bpf_dynptr_from_mem: long bpf_dynptr_from_mem(void *data, u32 size, u64 flags, struct bpf_dynptr *ptr); which initializes a dynptr to point to a bpf program's local memory. For now only local memory that is of reg type PTR_TO_MAP_VALUE is supported. Signed-off-by: Joanne Koong <joannelkoong@gmail.com> Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Link: https://lore.kernel.org/bpf/20220523210712.3641569-3-joannelkoong@gmail.com
2022-05-23bpf: Add verifier support for dynptrsJoanne Koong
This patch adds the bulk of the verifier work for supporting dynamic pointers (dynptrs) in bpf. A bpf_dynptr is opaque to the bpf program. It is a 16-byte structure defined internally as: struct bpf_dynptr_kern { void *data; u32 size; u32 offset; } __aligned(8); The upper 8 bits of *size* is reserved (it contains extra metadata about read-only status and dynptr type). Consequently, a dynptr only supports memory less than 16 MB. There are different types of dynptrs (eg malloc, ringbuf, ...). In this patchset, the most basic one, dynptrs to a bpf program's local memory, is added. For now only local memory that is of reg type PTR_TO_MAP_VALUE is supported. In the verifier, dynptr state information will be tracked in stack slots. When the program passes in an uninitialized dynptr (ARG_PTR_TO_DYNPTR | MEM_UNINIT), the stack slots corresponding to the frame pointer where the dynptr resides at are marked STACK_DYNPTR. For helper functions that take in initialized dynptrs (eg bpf_dynptr_read + bpf_dynptr_write which are added later in this patchset), the verifier enforces that the dynptr has been initialized properly by checking that their corresponding stack slots have been marked as STACK_DYNPTR. The 6th patch in this patchset adds test cases that the verifier should successfully reject, such as for example attempting to use a dynptr after doing a direct write into it inside the bpf program. Signed-off-by: Joanne Koong <joannelkoong@gmail.com> Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Acked-by: Andrii Nakryiko <andrii@kernel.org> Acked-by: David Vernet <void@manifault.com> Link: https://lore.kernel.org/bpf/20220523210712.3641569-2-joannelkoong@gmail.com
2022-05-23bpf: Suppress 'passing zero to PTR_ERR' warningKumar Kartikeya Dwivedi
Kernel Test Robot complains about passing zero to PTR_ERR for the said line, suppress it by using PTR_ERR_OR_ZERO. Fixes: c0a5a21c25f3 ("bpf: Allow storing referenced kptr in map") Reported-by: kernel test robot <lkp@intel.com> Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Link: https://lore.kernel.org/bpf/20220521132620.1976921-1-memxor@gmail.com
2022-05-23bpf: Introduce bpf_arch_text_invalidate for bpf_prog_packSong Liu
Introduce bpf_arch_text_invalidate and use it to fill unused part of the bpf_prog_pack with illegal instructions when a BPF program is freed. Fixes: 57631054fae6 ("bpf: Introduce bpf_prog_pack allocator") Fixes: 33c9805860e5 ("bpf: Introduce bpf_jit_binary_pack_[alloc|finalize|free]") Reported-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Song Liu <song@kernel.org> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Link: https://lore.kernel.org/bpf/20220520235758.1858153-4-song@kernel.org
2022-05-23bpf: Fill new bpf_prog_pack with illegal instructionsSong Liu
bpf_prog_pack enables sharing huge pages among multiple BPF programs. These pages are marked as executable before the JIT engine fill it with BPF programs. To make these pages safe, fill the hole bpf_prog_pack with illegal instructions before making it executable. Fixes: 57631054fae6 ("bpf: Introduce bpf_prog_pack allocator") Fixes: 33c9805860e5 ("bpf: Introduce bpf_jit_binary_pack_[alloc|finalize|free]") Reported-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Song Liu <song@kernel.org> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Link: https://lore.kernel.org/bpf/20220520235758.1858153-2-song@kernel.org
2022-05-20bpf: refine kernel.unprivileged_bpf_disabled behaviourAlan Maguire
With unprivileged BPF disabled, all cmds associated with the BPF syscall are blocked to users without CAP_BPF/CAP_SYS_ADMIN. However there are use cases where we may wish to allow interactions with BPF programs without being able to load and attach them. So for example, a process with required capabilities loads/attaches a BPF program, and a process with less capabilities interacts with it; retrieving perf/ring buffer events, modifying map-specified config etc. With all BPF syscall commands blocked as a result of unprivileged BPF being disabled, this mode of interaction becomes impossible for processes without CAP_BPF. As Alexei notes "The bpf ACL model is the same as traditional file's ACL. The creds and ACLs are checked at open(). Then during file's write/read additional checks might be performed. BPF has such functionality already. Different map_creates have capability checks while map_lookup has: map_get_sys_perms(map, f) & FMODE_CAN_READ. In other words it's enough to gate FD-receiving parts of bpf with unprivileged_bpf_disabled sysctl. The rest is handled by availability of FD and access to files in bpffs." So key fd creation syscall commands BPF_PROG_LOAD and BPF_MAP_CREATE are blocked with unprivileged BPF disabled and no CAP_BPF. And as Alexei notes, map creation with unprivileged BPF disabled off blocks creation of maps aside from array, hash and ringbuf maps. Programs responsible for loading and attaching the BPF program can still control access to its pinned representation by restricting permissions on the pin path, as with normal files. Signed-off-by: Alan Maguire <alan.maguire@oracle.com> Acked-by: Yonghong Song <yhs@fb.com> Acked-by: Shung-Hsi Yu <shung-hsi.yu@suse.com> Acked-by: KP Singh <kpsingh@kernel.org> Link: https://lore.kernel.org/r/1652970334-30510-2-git-send-email-alan.maguire@oracle.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2022-05-20bpf: Allow kfunc in tracing and syscall programs.Benjamin Tissoires
Tracing and syscall BPF program types are very convenient to add BPF capabilities to subsystem otherwise not BPF capable. When we add kfuncs capabilities to those program types, we can add BPF features to subsystems without having to touch BPF core. Signed-off-by: Benjamin Tissoires <benjamin.tissoires@redhat.com> Link: https://lore.kernel.org/r/20220518205924.399291-2-benjamin.tissoires@redhat.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2022-05-20bpf: Add bpf_skc_to_mptcp_sock_protoGeliang Tang
This patch implements a new struct bpf_func_proto, named bpf_skc_to_mptcp_sock_proto. Define a new bpf_id BTF_SOCK_TYPE_MPTCP, and a new helper bpf_skc_to_mptcp_sock(), which invokes another new helper bpf_mptcp_sock_from_subflow() in net/mptcp/bpf.c to get struct mptcp_sock from a given subflow socket. v2: Emit BTF type, add func_id checks in verifier.c and bpf_trace.c, remove build check for CONFIG_BPF_JIT v5: Drop EXPORT_SYMBOL (Martin) Co-developed-by: Nicolas Rybowski <nicolas.rybowski@tessares.net> Co-developed-by: Matthieu Baerts <matthieu.baerts@tessares.net> Signed-off-by: Nicolas Rybowski <nicolas.rybowski@tessares.net> Signed-off-by: Matthieu Baerts <matthieu.baerts@tessares.net> Signed-off-by: Geliang Tang <geliang.tang@suse.com> Signed-off-by: Mat Martineau <mathew.j.martineau@linux.intel.com> Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Link: https://lore.kernel.org/bpf/20220519233016.105670-2-mathew.j.martineau@linux.intel.com
2022-05-13bpf: Add MEM_UNINIT as a bpf_type_flagJoanne Koong
Instead of having uninitialized versions of arguments as separate bpf_arg_types (eg ARG_PTR_TO_UNINIT_MEM as the uninitialized version of ARG_PTR_TO_MEM), we can instead use MEM_UNINIT as a bpf_type_flag modifier to denote that the argument is uninitialized. Doing so cleans up some of the logic in the verifier. We no longer need to do two checks against an argument type (eg "if (base_type(arg_type) == ARG_PTR_TO_MEM || base_type(arg_type) == ARG_PTR_TO_UNINIT_MEM)"), since uninitialized and initialized versions of the same argument type will now share the same base type. In the near future, MEM_UNINIT will be used by dynptr helper functions as well. Signed-off-by: Joanne Koong <joannelkoong@gmail.com> Acked-by: Andrii Nakryiko <andrii@kernel.org> Acked-by: David Vernet <void@manifault.com> Link: https://lore.kernel.org/r/20220509224257.3222614-2-joannelkoong@gmail.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2022-05-13printk: stop including cache.h from printk.hPeter Collingbourne
An inclusion of cache.h in printk.h was added in 2014 in commit c28aa1f0a847 ("printk/cache: mark printk_once test variable __read_mostly") in order to bring in the definition of __read_mostly. The usage of __read_mostly was later removed in commit 3ec25826ae33 ("printk: Tie printk_once / printk_deferred_once into .data.once for reset") which made the inclusion of cache.h unnecessary, so remove it. We have a small amount of code that depended on the inclusion of cache.h from printk.h; fix that code to include the appropriate header. This fixes a circular inclusion on arm64 (linux/printk.h -> linux/cache.h -> asm/cache.h -> linux/kasan-enabled.h -> linux/static_key.h -> linux/jump_label.h -> linux/bug.h -> asm/bug.h -> linux/printk.h) that would otherwise be introduced by the next patch. Build tested using {allyesconfig,defconfig} x {arm64,x86_64}. Link: https://linux-review.googlesource.com/id/I8fd51f72c9ef1f2d6afd3b2cbc875aa4792c1fba Link: https://lkml.kernel.org/r/20220427195820.1716975-1-pcc@google.com Signed-off-by: Peter Collingbourne <pcc@google.com> Cc: Alexander Potapenko <glider@google.com> Cc: Andrey Konovalov <andreyknvl@gmail.com> Cc: Andrey Ryabinin <ryabinin.a.a@gmail.com> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: David Rientjes <rientjes@google.com> Cc: Dmitry Vyukov <dvyukov@google.com> Cc: Eric W. Biederman <ebiederm@xmission.com> Cc: Herbert Xu <herbert@gondor.apana.org.au> Cc: Hyeonggon Yoo <42.hyeyoo@gmail.com> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: Kees Cook <keescook@chromium.org> Cc: Pekka Enberg <penberg@kernel.org> Cc: Roman Gushchin <roman.gushchin@linux.dev> Cc: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-05-13bpf: Fix combination of jit blinding and pointers to bpf subprogs.Alexei Starovoitov
The combination of jit blinding and pointers to bpf subprogs causes: [ 36.989548] BUG: unable to handle page fault for address: 0000000100000001 [ 36.990342] #PF: supervisor instruction fetch in kernel mode [ 36.990968] #PF: error_code(0x0010) - not-present page [ 36.994859] RIP: 0010:0x100000001 [ 36.995209] Code: Unable to access opcode bytes at RIP 0xffffffd7. [ 37.004091] Call Trace: [ 37.004351] <TASK> [ 37.004576] ? bpf_loop+0x4d/0x70 [ 37.004932] ? bpf_prog_3899083f75e4c5de_F+0xe3/0x13b The jit blinding logic didn't recognize that ld_imm64 with an address of bpf subprogram is a special instruction and proceeded to randomize it. By itself it wouldn't have been an issue, but jit_subprogs() logic relies on two step process to JIT all subprogs and then JIT them again when addresses of all subprogs are known. Blinding process in the first JIT phase caused second JIT to miss adjustment of special ld_imm64. Fix this issue by ignoring special ld_imm64 instructions that don't have user controlled constants and shouldn't be blinded. Fixes: 69c087ba6225 ("bpf: Add bpf_for_each_map_elem() helper") Reported-by: Andrii Nakryiko <andrii@kernel.org> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Andrii Nakryiko <andrii@kernel.org> Acked-by: Martin KaFai Lau <kafai@fb.com> Link: https://lore.kernel.org/bpf/20220513011025.13344-1-alexei.starovoitov@gmail.com
2022-05-11bpf: Fix potential array overflow in bpf_trampoline_get_progs()Yuntao Wang
The cnt value in the 'cnt >= BPF_MAX_TRAMP_PROGS' check does not include BPF_TRAMP_MODIFY_RETURN bpf programs, so the number of the attached BPF_TRAMP_MODIFY_RETURN bpf programs in a trampoline can exceed BPF_MAX_TRAMP_PROGS. When this happens, the assignment '*progs++ = aux->prog' in bpf_trampoline_get_progs() will cause progs array overflow as the progs field in the bpf_tramp_progs struct can only hold at most BPF_MAX_TRAMP_PROGS bpf programs. Fixes: 88fd9e5352fe ("bpf: Refactor trampoline update code") Signed-off-by: Yuntao Wang <ytcoode@gmail.com> Link: https://lore.kernel.org/r/20220430130803.210624-1-ytcoode@gmail.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2022-05-11bpf: add bpf_map_lookup_percpu_elem for percpu mapFeng Zhou
Add new ebpf helpers bpf_map_lookup_percpu_elem. The implementation method is relatively simple, refer to the implementation method of map_lookup_elem of percpu map, increase the parameters of cpu, and obtain it according to the specified cpu. Signed-off-by: Feng Zhou <zhoufeng.zf@bytedance.com> Link: https://lore.kernel.org/r/20220511093854.411-2-zhoufeng.zf@bytedance.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2022-05-10bpf, x86: Attach a cookie to fentry/fexit/fmod_ret/lsm.Kui-Feng Lee
Pass a cookie along with BPF_LINK_CREATE requests. Add a bpf_cookie field to struct bpf_tracing_link to attach a cookie. The cookie of a bpf_tracing_link is available by calling bpf_get_attach_cookie when running the BPF program of the attached link. The value of a cookie will be set at bpf_tramp_run_ctx by the trampoline of the link. Signed-off-by: Kui-Feng Lee <kuifeng@fb.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Acked-by: Andrii Nakryiko <andrii@kernel.org> Link: https://lore.kernel.org/bpf/20220510205923.3206889-4-kuifeng@fb.com
2022-05-10bpf, x86: Create bpf_tramp_run_ctx on the caller thread's stackKui-Feng Lee
BPF trampolines will create a bpf_tramp_run_ctx, a bpf_run_ctx, on stacks and set/reset the current bpf_run_ctx before/after calling a bpf_prog. Signed-off-by: Kui-Feng Lee <kuifeng@fb.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Acked-by: Andrii Nakryiko <andrii@kernel.org> Link: https://lore.kernel.org/bpf/20220510205923.3206889-3-kuifeng@fb.com
2022-05-10bpf, x86: Generate trampolines from bpf_tramp_linksKui-Feng Lee
Replace struct bpf_tramp_progs with struct bpf_tramp_links to collect struct bpf_tramp_link(s) for a trampoline. struct bpf_tramp_link extends bpf_link to act as a linked list node. arch_prepare_bpf_trampoline() accepts a struct bpf_tramp_links to collects all bpf_tramp_link(s) that a trampoline should call. Change BPF trampoline and bpf_struct_ops to pass bpf_tramp_links instead of bpf_tramp_progs. Signed-off-by: Kui-Feng Lee <kuifeng@fb.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Acked-by: Andrii Nakryiko <andrii@kernel.org> Link: https://lore.kernel.org/bpf/20220510205923.3206889-2-kuifeng@fb.com
2022-05-10bpf: Add bpf_link iteratorDmitrii Dolgov
Implement bpf_link iterator to traverse links via bpf_seq_file operations. The changeset is mostly shamelessly copied from commit a228a64fc1e4 ("bpf: Add bpf_prog iterator") Signed-off-by: Dmitrii Dolgov <9erthalion6@gmail.com> Acked-by: Yonghong Song <yhs@fb.com> Link: https://lore.kernel.org/r/20220510155233.9815-2-9erthalion6@gmail.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2022-05-10bpf: Extend batch operations for map-in-map bpf-mapsTakshak Chahande
This patch extends batch operations support for map-in-map map-types: BPF_MAP_TYPE_HASH_OF_MAPS and BPF_MAP_TYPE_ARRAY_OF_MAPS A usecase where outer HASH map holds hundred of VIP entries and its associated reuse-ports per VIP stored in REUSEPORT_SOCKARRAY type inner map, needs to do batch operation for performance gain. This patch leverages the exiting generic functions for most of the batch operations. As map-in-map's value contains the actual reference of the inner map, for BPF_MAP_TYPE_HASH_OF_MAPS type, it needed an extra step to fetch the map_id from the reference value. selftests are added in next patch 2/2. Signed-off-by: Takshak Chahande <ctakshak@fb.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Acked-by: Yonghong Song <yhs@fb.com> Link: https://lore.kernel.org/bpf/20220510082221.2390540-1-ctakshak@fb.com
2022-05-09bpf: Remove unused parameter from find_kfunc_desc_btf()Yuntao Wang
The func_id parameter in find_kfunc_desc_btf() is not used, get rid of it. Fixes: 2357672c54c3 ("bpf: Introduce BPF support for kernel module function calls") Signed-off-by: Yuntao Wang <ytcoode@gmail.com> Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Acked-by: Martin KaFai Lau <kafai@fb.com> Acked-by: Kumar Kartikeya Dwivedi <memxor@gmail.com> Link: https://lore.kernel.org/bpf/20220505070114.3522522-1-ytcoode@gmail.com
2022-04-26bpf: Compute map_btf_id during build timeMenglong Dong
For now, the field 'map_btf_id' in 'struct bpf_map_ops' for all map types are computed during vmlinux-btf init: btf_parse_vmlinux() -> btf_vmlinux_map_ids_init() It will lookup the btf_type according to the 'map_btf_name' field in 'struct bpf_map_ops'. This process can be done during build time, thanks to Jiri's resolve_btfids. selftest of map_ptr has passed: $96 map_ptr:OK Summary: 1/0 PASSED, 0 SKIPPED, 0 FAILED Reported-by: kernel test robot <lkp@intel.com> Signed-off-by: Menglong Dong <imagedong@tencent.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2022-04-25bpf: Make BTF type match stricter for release argumentsKumar Kartikeya Dwivedi
The current of behavior of btf_struct_ids_match for release arguments is that when type match fails, it retries with first member type again (recursively). Since the offset is already 0, this is akin to just casting the pointer in normal C, since if type matches it was just embedded inside parent sturct as an object. However, we want to reject cases for release function type matching, be it kfunc or BPF helpers. An example is the following: struct foo { struct bar b; }; struct foo *v = acq_foo(); rel_bar(&v->b); // btf_struct_ids_match fails btf_types_are_same, then // retries with first member type and succeeds, while // it should fail. Hence, don't walk the struct and only rely on btf_types_are_same for strict mode. All users of strict mode must be dealing with zero offset anyway, since otherwise they would want the struct to be walked. Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Link: https://lore.kernel.org/bpf/20220424214901.2743946-10-memxor@gmail.com
2022-04-25bpf: Teach verifier about kptr_get kfunc helpersKumar Kartikeya Dwivedi
We introduce a new style of kfunc helpers, namely *_kptr_get, where they take pointer to the map value which points to a referenced kernel pointer contained in the map. Since this is referenced, only bpf_kptr_xchg from BPF side and xchg from kernel side is allowed to change the current value, and each pointer that resides in that location would be referenced, and RCU protected (this must be kept in mind while adding kernel types embeddable as reference kptr in BPF maps). This means that if do the load of the pointer value in an RCU read section, and find a live pointer, then as long as we hold RCU read lock, it won't be freed by a parallel xchg + release operation. This allows us to implement a safe refcount increment scheme. Hence, enforce that first argument of all such kfunc is a proper PTR_TO_MAP_VALUE pointing at the right offset to referenced pointer. For the rest of the arguments, they are subjected to typical kfunc argument checks, hence allowing some flexibility in passing more intent into how the reference should be taken. For instance, in case of struct nf_conn, it is not freed until RCU grace period ends, but can still be reused for another tuple once refcount has dropped to zero. Hence, a bpf_ct_kptr_get helper not only needs to call refcount_inc_not_zero, but also do a tuple match after incrementing the reference, and when it fails to match it, put the reference again and return NULL. This can be implemented easily if we allow passing additional parameters to the bpf_ct_kptr_get kfunc, like a struct bpf_sock_tuple * and a tuple__sz pair. Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Link: https://lore.kernel.org/bpf/20220424214901.2743946-9-memxor@gmail.com
2022-04-25bpf: Wire up freeing of referenced kptrKumar Kartikeya Dwivedi
A destructor kfunc can be defined as void func(type *), where type may be void or any other pointer type as per convenience. In this patch, we ensure that the type is sane and capture the function pointer into off_desc of ptr_off_tab for the specific pointer offset, with the invariant that the dtor pointer is always set when 'kptr_ref' tag is applied to the pointer's pointee type, which is indicated by the flag BPF_MAP_VALUE_OFF_F_REF. Note that only BTF IDs whose destructor kfunc is registered, thus become the allowed BTF IDs for embedding as referenced kptr. Hence it serves the purpose of finding dtor kfunc BTF ID, as well acting as a check against the whitelist of allowed BTF IDs for this purpose. Finally, wire up the actual freeing of the referenced pointer if any at all available offsets, so that no references are leaked after the BPF map goes away and the BPF program previously moved the ownership a referenced pointer into it. The behavior is similar to BPF timers, where bpf_map_{update,delete}_elem will free any existing referenced kptr. The same case is with LRU map's bpf_lru_push_free/htab_lru_push_free functions, which are extended to reset unreferenced and free referenced kptr. Note that unlike BPF timers, kptr is not reset or freed when map uref drops to zero. Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Link: https://lore.kernel.org/bpf/20220424214901.2743946-8-memxor@gmail.com
2022-04-25bpf: Populate pairs of btf_id and destructor kfunc in btfKumar Kartikeya Dwivedi
To support storing referenced PTR_TO_BTF_ID in maps, we require associating a specific BTF ID with a 'destructor' kfunc. This is because we need to release a live referenced pointer at a certain offset in map value from the map destruction path, otherwise we end up leaking resources. Hence, introduce support for passing an array of btf_id, kfunc_btf_id pairs that denote a BTF ID and its associated release function. Then, add an accessor 'btf_find_dtor_kfunc' which can be used to look up the destructor kfunc of a certain BTF ID. If found, we can use it to free the object from the map free path. The registration of these pairs also serve as a whitelist of structures which are allowed as referenced PTR_TO_BTF_ID in a BPF map, because without finding the destructor kfunc, we will bail and return an error. Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Link: https://lore.kernel.org/bpf/20220424214901.2743946-7-memxor@gmail.com
2022-04-25bpf: Adapt copy_map_value for multiple offset caseKumar Kartikeya Dwivedi
Since now there might be at most 10 offsets that need handling in copy_map_value, the manual shuffling and special case is no longer going to work. Hence, let's generalise the copy_map_value function by using a sorted array of offsets to skip regions that must be avoided while copying into and out of a map value. When the map is created, we populate the offset array in struct map, Then, copy_map_value uses this sorted offset array is used to memcpy while skipping timer, spin lock, and kptr. The array is allocated as in most cases none of these special fields would be present in map value, hence we can save on space for the common case by not embedding the entire object inside bpf_map struct. Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Link: https://lore.kernel.org/bpf/20220424214901.2743946-6-memxor@gmail.com
2022-04-25bpf: Prevent escaping of kptr loaded from mapsKumar Kartikeya Dwivedi
While we can guarantee that even for unreferenced kptr, the object pointer points to being freed etc. can be handled by the verifier's exception handling (normal load patching to PROBE_MEM loads), we still cannot allow the user to pass these pointers to BPF helpers and kfunc, because the same exception handling won't be done for accesses inside the kernel. The same is true if a referenced pointer is loaded using normal load instruction. Since the reference is not guaranteed to be held while the pointer is used, it must be marked as untrusted. Hence introduce a new type flag, PTR_UNTRUSTED, which is used to mark all registers loading unreferenced and referenced kptr from BPF maps, and ensure they can never escape the BPF program and into the kernel by way of calling stable/unstable helpers. In check_ptr_to_btf_access, the !type_may_be_null check to reject type flags is still correct, as apart from PTR_MAYBE_NULL, only MEM_USER, MEM_PERCPU, and PTR_UNTRUSTED may be set for PTR_TO_BTF_ID. The first two are checked inside the function and rejected using a proper error message, but we still want to allow dereference of untrusted case. Also, we make sure to inherit PTR_UNTRUSTED when chain of pointers are walked, so that this flag is never dropped once it has been set on a PTR_TO_BTF_ID (i.e. trusted to untrusted transition can only be in one direction). In convert_ctx_accesses, extend the switch case to consider untrusted PTR_TO_BTF_ID in addition to normal PTR_TO_BTF_ID for PROBE_MEM conversion for BPF_LDX. Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Link: https://lore.kernel.org/bpf/20220424214901.2743946-5-memxor@gmail.com
2022-04-25bpf: Allow storing referenced kptr in mapKumar Kartikeya Dwivedi
Extending the code in previous commits, introduce referenced kptr support, which needs to be tagged using 'kptr_ref' tag instead. Unlike unreferenced kptr, referenced kptr have a lot more restrictions. In addition to the type matching, only a newly introduced bpf_kptr_xchg helper is allowed to modify the map value at that offset. This transfers the referenced pointer being stored into the map, releasing the references state for the program, and returning the old value and creating new reference state for the returned pointer. Similar to unreferenced pointer case, return value for this case will also be PTR_TO_BTF_ID_OR_NULL. The reference for the returned pointer must either be eventually released by calling the corresponding release function, otherwise it must be transferred into another map. It is also allowed to call bpf_kptr_xchg with a NULL pointer, to clear the value, and obtain the old value if any. BPF_LDX, BPF_STX, and BPF_ST cannot access referenced kptr. A future commit will permit using BPF_LDX for such pointers, but attempt at making it safe, since the lifetime of object won't be guaranteed. There are valid reasons to enforce the restriction of permitting only bpf_kptr_xchg to operate on referenced kptr. The pointer value must be consistent in face of concurrent modification, and any prior values contained in the map must also be released before a new one is moved into the map. To ensure proper transfer of this ownership, bpf_kptr_xchg returns the old value, which the verifier would require the user to either free or move into another map, and releases the reference held for the pointer being moved in. In the future, direct BPF_XCHG instruction may also be permitted to work like bpf_kptr_xchg helper. Note that process_kptr_func doesn't have to call check_helper_mem_access, since we already disallow rdonly/wronly flags for map, which is what check_map_access_type checks, and we already ensure the PTR_TO_MAP_VALUE refers to kptr by obtaining its off_desc, so check_map_access is also not required. Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Link: https://lore.kernel.org/bpf/20220424214901.2743946-4-memxor@gmail.com
2022-04-25bpf: Tag argument to be released in bpf_func_protoKumar Kartikeya Dwivedi
Add a new type flag for bpf_arg_type that when set tells verifier that for a release function, that argument's register will be the one for which meta.ref_obj_id will be set, and which will then be released using release_reference. To capture the regno, introduce a new field release_regno in bpf_call_arg_meta. This would be required in the next patch, where we may either pass NULL or a refcounted pointer as an argument to the release function bpf_kptr_xchg. Just releasing only when meta.ref_obj_id is set is not enough, as there is a case where the type of argument needed matches, but the ref_obj_id is set to 0. Hence, we must enforce that whenever meta.ref_obj_id is zero, the register that is to be released can only be NULL for a release function. Since we now indicate whether an argument is to be released in bpf_func_proto itself, is_release_function helper has lost its utitlity, hence refactor code to work without it, and just rely on meta.release_regno to know when to release state for a ref_obj_id. Still, the restriction of one release argument and only one ref_obj_id passed to BPF helper or kfunc remains. This may be lifted in the future. Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Link: https://lore.kernel.org/bpf/20220424214901.2743946-3-memxor@gmail.com
2022-04-25bpf: Allow storing unreferenced kptr in mapKumar Kartikeya Dwivedi
This commit introduces a new pointer type 'kptr' which can be embedded in a map value to hold a PTR_TO_BTF_ID stored by a BPF program during its invocation. When storing such a kptr, BPF program's PTR_TO_BTF_ID register must have the same type as in the map value's BTF, and loading a kptr marks the destination register as PTR_TO_BTF_ID with the correct kernel BTF and BTF ID. Such kptr are unreferenced, i.e. by the time another invocation of the BPF program loads this pointer, the object which the pointer points to may not longer exist. Since PTR_TO_BTF_ID loads (using BPF_LDX) are patched to PROBE_MEM loads by the verifier, it would safe to allow user to still access such invalid pointer, but passing such pointers into BPF helpers and kfuncs should not be permitted. A future patch in this series will close this gap. The flexibility offered by allowing programs to dereference such invalid pointers while being safe at runtime frees the verifier from doing complex lifetime tracking. As long as the user may ensure that the object remains valid, it can ensure data read by it from the kernel object is valid. The user indicates that a certain pointer must be treated as kptr capable of accepting stores of PTR_TO_BTF_ID of a certain type, by using a BTF type tag 'kptr' on the pointed to type of the pointer. Then, this information is recorded in the object BTF which will be passed into the kernel by way of map's BTF information. The name and kind from the map value BTF is used to look up the in-kernel type, and the actual BTF and BTF ID is recorded in the map struct in a new kptr_off_tab member. For now, only storing pointers to structs is permitted. An example of this specification is shown below: #define __kptr __attribute__((btf_type_tag("kptr"))) struct map_value { ... struct task_struct __kptr *task; ... }; Then, in a BPF program, user may store PTR_TO_BTF_ID with the type task_struct into the map, and then load it later. Note that the destination register is marked PTR_TO_BTF_ID_OR_NULL, as the verifier cannot know whether the value is NULL or not statically, it must treat all potential loads at that map value offset as loading a possibly NULL pointer. Only BPF_LDX, BPF_STX, and BPF_ST (with insn->imm = 0 to denote NULL) are allowed instructions that can access such a pointer. On BPF_LDX, the destination register is updated to be a PTR_TO_BTF_ID, and on BPF_STX, it is checked whether the source register type is a PTR_TO_BTF_ID with same BTF type as specified in the map BTF. The access size must always be BPF_DW. For the map in map support, the kptr_off_tab for outer map is copied from the inner map's kptr_off_tab. It was chosen to do a deep copy instead of introducing a refcount to kptr_off_tab, because the copy only needs to be done when paramterizing using inner_map_fd in the map in map case, hence would be unnecessary for all other users. It is not permitted to use MAP_FREEZE command and mmap for BPF map having kptrs, similar to the bpf_timer case. A kptr also requires that BPF program has both read and write access to the map (hence both BPF_F_RDONLY_PROG and BPF_F_WRONLY_PROG are disallowed). Note that check_map_access must be called from both check_helper_mem_access and for the BPF instructions, hence the kptr check must distinguish between ACCESS_DIRECT and ACCESS_HELPER, and reject ACCESS_HELPER cases. We rename stack_access_src to bpf_access_src and reuse it for this purpose. Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Link: https://lore.kernel.org/bpf/20220424214901.2743946-2-memxor@gmail.com
2022-04-25bpf: Use bpf_prog_run_array_cg_flags everywhereStanislav Fomichev
Rename bpf_prog_run_array_cg_flags to bpf_prog_run_array_cg and use it everywhere. check_return_code already enforces sane return ranges for all cgroup types. (only egress and bind hooks have uncanonical return ranges, the rest is using [0, 1]) No functional changes. v2: - 'func_ret & 1' under explicit test (Andrii & Martin) Suggested-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: Stanislav Fomichev <sdf@google.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Link: https://lore.kernel.org/bpf/20220425220448.3669032-1-sdf@google.com
2022-04-23bpf: Allow attach TRACING programs through LINK_CREATE commandAndrii Nakryiko
Allow attaching BTF-aware TRACING programs, previously attachable only through BPF_RAW_TRACEPOINT_OPEN command, through LINK_CREATE command: - BTF-aware raw tracepoints (tp_btf in libbpf lingo); - fentry/fexit/fmod_ret programs; - BPF LSM programs. This change converges all bpf_link-based attachments under LINK_CREATE command allowing to further extend the API with features like BPF cookie under "multiplexed" link_create section of bpf_attr. Non-BTF-aware raw tracepoints are left under BPF_RAW_TRACEPOINT_OPEN, but there is nothing preventing opening them up to LINK_CREATE as well. Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Reviewed-by: Kuifeng Lee <kuifeng@fb.com> Link: https://lore.kernel.org/bpf/20220421033945.3602803-2-andrii@kernel.org
2022-04-21bpf: Move check_ptr_off_reg before check_map_accessKumar Kartikeya Dwivedi
Some functions in next patch want to use this function, and those functions will be called by check_map_access, hence move it before check_map_access. Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Joanne Koong <joannelkoong@gmail.com> Link: https://lore.kernel.org/bpf/20220415160354.1050687-3-memxor@gmail.com
2022-04-21bpf: Make btf_find_field more genericKumar Kartikeya Dwivedi
Next commit introduces field type 'kptr' whose kind will not be struct, but pointer, and it will not be limited to one offset, but multiple ones. Make existing btf_find_struct_field and btf_find_datasec_var functions amenable to use for finding kptrs in map value, by moving spin_lock and timer specific checks into their own function. The alignment, and name are checked before the function is called, so it is the last point where we can skip field or return an error before the next loop iteration happens. Size of the field and type is meant to be checked inside the function. Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Link: https://lore.kernel.org/bpf/20220415160354.1050687-2-memxor@gmail.com
2022-04-20rcu: Make the TASKS_RCU Kconfig option be selectedPaul E. McKenney
Currently, any kernel built with CONFIG_PREEMPTION=y also gets CONFIG_TASKS_RCU=y, which is not helpful to people trying to build preemptible kernels of minimal size. Because CONFIG_TASKS_RCU=y is needed only in kernels doing tracing of one form or another, this commit moves from TASKS_RCU deciding when it should be enabled to the tracing Kconfig options explicitly selecting it. This allows building preemptible kernels without TASKS_RCU, if desired. This commit also updates the SRCU-N and TREE09 rcutorture scenarios in order to avoid Kconfig errors that would otherwise result from CONFIG_TASKS_RCU being selected without its CONFIG_RCU_EXPERT dependency being met. [ paulmck: Apply BPF_SYSCALL feedback from Andrii Nakryiko. ] Reported-by: Hyeonggon Yoo <42.hyeyoo@gmail.com> Tested-by: Hyeonggon Yoo <42.hyeyoo@gmail.com> Tested-by: Zhouyi Zhou <zhouzhouyi@gmail.com> Cc: Andrii Nakryiko <andrii@kernel.org> Cc: Alexei Starovoitov <ast@kernel.org> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Acked-by: Masami Hiramatsu <mhiramat@kernel.org> Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2022-04-19bpf: Fix usage of trace RCU in local storage.KP Singh
bpf_{sk,task,inode}_storage_free() do not need to use call_rcu_tasks_trace as no BPF program should be accessing the owner as it's being destroyed. The only other reader at this point is bpf_local_storage_map_free() which uses normal RCU. The only path that needs trace RCU are: * bpf_local_storage_{delete,update} helpers * map_{delete,update}_elem() syscalls Fixes: 0fe4b381a59e ("bpf: Allow bpf_local_storage to be used by sleepable programs") Signed-off-by: KP Singh <kpsingh@kernel.org> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Acked-by: Martin KaFai Lau <kafai@fb.com> Link: https://lore.kernel.org/bpf/20220418155158.2865678-1-kpsingh@kernel.org
2022-04-19bpf: Ensure type tags precede modifiers in BTFKumar Kartikeya Dwivedi
It is guaranteed that for modifiers, clang always places type tags before other modifiers, and then the base type. We would like to rely on this guarantee inside the kernel to make it simple to parse type tags from BTF. However, a user would be allowed to construct a BTF without such guarantees. Hence, add a pass to check that in modifier chains, type tags only occur at the head of the chain, and then don't occur later in the chain. If we see a type tag, we can have one or more type tags preceding other modifiers that then never have another type tag. If we see other modifiers, all modifiers following them should never be a type tag. Instead of having to walk chains we verified previously, we can remember the last good modifier type ID which headed a good chain. At that point, we must have verified all other chains headed by type IDs less than it. This makes the verification process less costly, and it becomes a simple O(n) pass. Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Acked-by: Yonghong Song <yhs@fb.com> Link: https://lore.kernel.org/bpf/20220419164608.1990559-2-memxor@gmail.com
2022-04-19bpf: Move rcu lock management out of BPF_PROG_RUN routinesStanislav Fomichev
Commit 7d08c2c91171 ("bpf: Refactor BPF_PROG_RUN_ARRAY family of macros into functions") switched a bunch of BPF_PROG_RUN macros to inline routines. This changed the semantic a bit. Due to arguments expansion of macros, it used to be: rcu_read_lock(); array = rcu_dereference(cgrp->bpf.effective[atype]); ... Now, with with inline routines, we have: array_rcu = rcu_dereference(cgrp->bpf.effective[atype]); /* array_rcu can be kfree'd here */ rcu_read_lock(); array = rcu_dereference(array_rcu); I'm assuming in practice rcu subsystem isn't fast enough to trigger this but let's use rcu API properly. Also, rename to lower caps to not confuse with macros. Additionally, drop and expand BPF_PROG_CGROUP_INET_EGRESS_RUN_ARRAY. See [1] for more context. [1] https://lore.kernel.org/bpf/CAKH8qBs60fOinFdxiiQikK_q0EcVxGvNTQoWvHLEUGbgcj1UYg@mail.gmail.com/T/#u v2 - keep rcu locks inside by passing cgroup_bpf Fixes: 7d08c2c91171 ("bpf: Refactor BPF_PROG_RUN_ARRAY family of macros into functions") Signed-off-by: Stanislav Fomichev <sdf@google.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Acked-by: Martin KaFai Lau <kafai@fb.com> Link: https://lore.kernel.org/bpf/20220414161233.170780-1-sdf@google.com
2022-04-14bpf: Remove unnecessary type castingsYu Zhe
Remove/clean up unnecessary void * type castings. Signed-off-by: Yu Zhe <yuzhe@nfschina.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Link: https://lore.kernel.org/bpf/20220413015048.12319-1-yuzhe@nfschina.com
2022-04-13Merge branch 'pr/bpf-sysctl' into bpf-nextDaniel Borkmann
Pull the migration of the BPF sysctl handling into BPF core. This work is needed in both sysctl-next and bpf-next tree. Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Cc: Luis Chamberlain <mcgrof@kernel.org> Link: https://lore.kernel.org/bpf/b615bd44-6bd1-a958-7e3f-dd2ff58931a1@iogearbox.net