OpenCloudOS-Kernel

Commit Graph

Author	SHA1	Message	Date
Eduard Zingerman	0a372c9c08	selftests/bpf: verifier/direct_packet_access converted to inline assembly Test verifier/direct_packet_access automatically converted to use inline assembly. Signed-off-by: Eduard Zingerman <eddyz87@gmail.com> Link: https://lore.kernel.org/r/20230421174234.2391278-8-eddyz87@gmail.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>	2023-04-21 12:18:44 -07:00
Eduard Zingerman	6080280243	selftests/bpf: verifier/d_path converted to inline assembly Test verifier/d_path automatically converted to use inline assembly. Signed-off-by: Eduard Zingerman <eddyz87@gmail.com> Link: https://lore.kernel.org/r/20230421174234.2391278-7-eddyz87@gmail.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>	2023-04-21 12:18:16 -07:00
Eduard Zingerman	fcd36964f2	selftests/bpf: verifier/ctx converted to inline assembly Test verifier/ctx automatically converted to use inline assembly. Signed-off-by: Eduard Zingerman <eddyz87@gmail.com> Link: https://lore.kernel.org/r/20230421174234.2391278-6-eddyz87@gmail.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>	2023-04-21 12:18:03 -07:00
Eduard Zingerman	37467c79e1	selftests/bpf: verifier/btf_ctx_access converted to inline assembly Test verifier/btf_ctx_access automatically converted to use inline assembly. Signed-off-by: Eduard Zingerman <eddyz87@gmail.com> Link: https://lore.kernel.org/r/20230421174234.2391278-5-eddyz87@gmail.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>	2023-04-21 12:17:51 -07:00
Eduard Zingerman	965a3f913e	selftests/bpf: verifier/bpf_get_stack converted to inline assembly Test verifier/bpf_get_stack automatically converted to use inline assembly. Signed-off-by: Eduard Zingerman <eddyz87@gmail.com> Link: https://lore.kernel.org/r/20230421174234.2391278-4-eddyz87@gmail.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>	2023-04-21 12:17:39 -07:00
Eduard Zingerman	c92336559a	selftests/bpf: verifier/bounds converted to inline assembly Test verifier/bounds automatically converted to use inline assembly. Signed-off-by: Eduard Zingerman <eddyz87@gmail.com> Link: https://lore.kernel.org/r/20230421174234.2391278-3-eddyz87@gmail.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>	2023-04-21 12:17:14 -07:00
Eduard Zingerman	63bb645b9d	selftests/bpf: Add notion of auxiliary programs for test_loader In order to express test cases that use bpf_tail_call() intrinsic it is necessary to have several programs to be loaded at a time. This commit adds __auxiliary annotation to the set of annotations supported by test_loader.c. Programs marked as auxiliary are always loaded but are not treated as a separate test. For example: void dummy_prog1(void); struct { __uint(type, BPF_MAP_TYPE_PROG_ARRAY); __uint(max_entries, 4); __uint(key_size, sizeof(int)); __array(values, void (void)); } prog_map SEC(".maps") = { .values = { [0] = (void *) &dummy_prog1, }, }; SEC("tc") __auxiliary __naked void dummy_prog1(void) { asm volatile ("r0 = 42; exit;"); } SEC("tc") __description("reference tracking: check reference or tail call") __success __retval(0) __naked void check_reference_or_tail_call(void) { asm volatile ( "r2 = %[prog_map] ll;" "r3 = 0;" "call %[bpf_tail_call];" "r0 = 0;" "exit;" :: __imm(bpf_tail_call), : __clobber_all); } Signed-off-by: Eduard Zingerman <eddyz87@gmail.com> Link: https://lore.kernel.org/r/20230421174234.2391278-2-eddyz87@gmail.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>	2023-04-21 12:16:56 -07:00
Alexei Starovoitov	d7a799ec78	Merge branch 'bpf: add netfilter program type' Florian Westphal says: ==================== Changes since last version: - rework test case in last patch wrt. ctx->skb dereference etc (Alexei) - pacify bpf ci tests, netfilter program type missed string translation in libbpf helper. This still uses runtime btf walk rather than extending the btf trace array as Alexei suggested, I would do this later (or someone else can). v1 cover letter: Add minimal support to hook bpf programs to netfilter hooks, e.g. PREROUTING or FORWARD. For this the most relevant parts for registering a netfilter hook via the in-kernel api are exposed to userspace via bpf_link. The new program type is 'tracing style', i.e. there is no context access rewrite done by verifier, the function argument (struct bpf_nf_ctx) isn't stable. There is no support for direct packet access, dynptr api should be used instead. With this its possible to build a small test program such as: #include "vmlinux.h" extern int bpf_dynptr_from_skb(struct __sk_buff skb, __u64 flags, struct bpf_dynptr ptr__uninit) __ksym; extern void bpf_dynptr_slice(const struct bpf_dynptr ptr, uint32_t offset, void buffer, uint32_t buffer__sz) __ksym; SEC("netfilter") int nf_test(struct bpf_nf_ctx ctx) { struct nf_hook_state state = ctx->state; struct sk_buff skb = ctx->skb; const struct iphdr iph, _iph; const struct tcphdr th, _th; struct bpf_dynptr ptr; if (bpf_dynptr_from_skb(skb, 0, &ptr)) return NF_DROP; iph = bpf_dynptr_slice(&ptr, 0, &_iph, sizeof(_iph)); if (!iph) return NF_DROP; th = bpf_dynptr_slice(&ptr, iph->ihl << 2, &_th, sizeof(_th)); if (!th) return NF_DROP; bpf_printk("accept %x:%d->%x:%d, hook %d ifin %d\n", iph->saddr, bpf_ntohs(th->source), iph->daddr, bpf_ntohs(th->dest), state->hook, state->in->ifindex); return NF_ACCEPT; } Then, tail /sys/kernel/tracing/trace_pipe. Changes since v3: - uapi: remove 'reserved' struct member, s/prio/priority (Alexei) - add ctx access test cases (Alexei, see last patch) - some arm32 can only handle cmpxchg on u32 (build bot) - Fix kdoc annotations (Simon Horman) - bpftool: prefer p_err, not fprintf (Quentin) - add test cases in separate patch Changes since v2: 1. don't WARN when user calls 'bpftool loink detach' twice restrict attachment to ip+ip6 families, lets relax this later in case arp/bridge/netdev are needed too. 2. show netfilter links in 'bpftool net' output as well. Changes since v1: 1. Don't fail to link when CONFIG_NETFILTER=n (build bot) 2. Use test_progs instead of test_verifier (Alexei) Changes since last RFC version: 1. extend 'bpftool link show' to print prio/hooknum etc 2. extend 'nft list hooks' so it can print the bpf program id 3. Add an extra patch to artificially restrict bpf progs with same priority. Its fine from a technical pov but it will cause ordering issues (most recent one comes first). Can be removed later. 4. Add test_run support for netfilter prog type and a small extension to verifier tests to make sure we can't return verdicts like NF_STOLEN. 5. Alter the netfilter part of the bpf_link uapi struct: - add flags/reserved members. Not used here except returning errors when they are nonzero. Plan is to allow the bpf_link users to enable netfilter defrag or conntrack engine by setting feature flags at link create time in the future. ==================== Signed-off-by: Alexei Starovoitov <ast@kernel.org>	2023-04-21 11:37:41 -07:00
Florian Westphal	006c0e44ed	selftests/bpf: add missing netfilter return value and ctx access tests Extend prog_tests with two test cases: # ./test_progs --allow=verifier_netfilter_retcode #278/1 verifier_netfilter_retcode/bpf_exit with invalid return code. test1:OK #278/2 verifier_netfilter_retcode/bpf_exit with valid return code. test2:OK #278/3 verifier_netfilter_retcode/bpf_exit with valid return code. test3:OK #278/4 verifier_netfilter_retcode/bpf_exit with invalid return code. test4:OK #278 verifier_netfilter_retcode:OK This checks that only accept and drop (0,1) are permitted. NF_QUEUE could be implemented later if we can guarantee that attachment of such programs can be rejected if they get attached to a pf/hook that doesn't support async reinjection. NF_STOLEN could be implemented via trusted helpers that can guarantee that the skb will eventually be free'd. v4: test case for bpf_nf_ctx access checks, requested by Alexei Starovoitov. v5: also check ctx->{state,skb} can be dereferenced (Alexei). # ./test_progs --allow=verifier_netfilter_ctx #281/1 verifier_netfilter_ctx/netfilter invalid context access, size too short:OK #281/2 verifier_netfilter_ctx/netfilter invalid context access, size too short:OK #281/3 verifier_netfilter_ctx/netfilter invalid context access, past end of ctx:OK #281/4 verifier_netfilter_ctx/netfilter invalid context, write:OK #281/5 verifier_netfilter_ctx/netfilter valid context read and invalid write:OK #281/6 verifier_netfilter_ctx/netfilter test prog with skb and state read access:OK #281/7 verifier_netfilter_ctx/netfilter test prog with skb and state read access @unpriv:OK #281 verifier_netfilter_ctx:OK Summary: 1/7 PASSED, 0 SKIPPED, 0 FAILED This checks: 1/2: partial reads of ctx->{skb,state} are rejected 3. read access past sizeof(ctx) is rejected 4. write to ctx content, e.g. 'ctx->skb = NULL;' is rejected 5. ctx->state content cannot be altered 6. ctx->state and ctx->skb can be dereferenced 7. ... same program fails for unpriv (CAP_NET_ADMIN needed). Link: https://lore.kernel.org/bpf/20230419021152.sjq4gttphzzy6b5f@dhcp-172-26-102-232.dhcp.thefacebook.com/ Link: https://lore.kernel.org/bpf/20230420201655.77kkgi3dh7fesoll@MacBook-Pro-6.local/ Signed-off-by: Florian Westphal <fw@strlen.de> Link: https://lore.kernel.org/r/20230421170300.24115-8-fw@strlen.de Signed-off-by: Alexei Starovoitov <ast@kernel.org>	2023-04-21 11:34:50 -07:00
Florian Westphal	2b99ef22e0	bpf: add test_run support for netfilter program type add glue code so a bpf program can be run using userspace-provided netfilter state and packet/skb. Default is to use ipv4:output hook point, but this can be overridden by userspace. Userspace provided netfilter state is restricted, only hook and protocol families can be overridden and only to ipv4/ipv6. Signed-off-by: Florian Westphal <fw@strlen.de> Link: https://lore.kernel.org/r/20230421170300.24115-7-fw@strlen.de Signed-off-by: Alexei Starovoitov <ast@kernel.org>	2023-04-21 11:34:50 -07:00
Florian Westphal	d0fe92fb5e	tools: bpftool: print netfilter link info Dump protocol family, hook and priority value: $ bpftool link 2: netfilter prog 14 ip input prio -128 pids install(3264) 5: netfilter prog 14 ip6 forward prio 21 pids a.out(3387) 9: netfilter prog 14 ip prerouting prio 123 pids a.out(5700) 10: netfilter prog 14 ip input prio 21 pids test2(5701) v2: Quentin Monnet suggested to also add 'bpftool net' support: $ bpftool net xdp: tc: flow_dissector: netfilter: ip prerouting prio 21 prog_id 14 ip input prio -128 prog_id 14 ip input prio 21 prog_id 14 ip forward prio 21 prog_id 14 ip output prio 21 prog_id 14 ip postrouting prio 21 prog_id 14 'bpftool net' only dumps netfilter link type, links are sorted by protocol family, hook and priority. v5: fix bpf ci failure: libbpf needs small update to prog_type_name[] and probe_prog_load helper. v4: don't fail with -EOPNOTSUPP in libbpf probe_prog_load, update prog_type_name[] with "netfilter" entry (bpf ci) v3: fix bpf.h copy, 'reserved' member was removed (Alexei) use p_err, not fprintf (Quentin) Suggested-by: Quentin Monnet <quentin@isovalent.com> Link: https://lore.kernel.org/bpf/eeeaac99-9053-90c2-aa33-cc1ecb1ae9ca@isovalent.com/ Reviewed-by: Quentin Monnet <quentin@isovalent.com> Signed-off-by: Florian Westphal <fw@strlen.de> Link: https://lore.kernel.org/r/20230421170300.24115-6-fw@strlen.de Signed-off-by: Alexei Starovoitov <ast@kernel.org>	2023-04-21 11:34:49 -07:00
Florian Westphal	0bdc6da88f	netfilter: disallow bpf hook attachment at same priority This is just to avoid ordering issues between multiple bpf programs, this could be removed later in case it turns out to be too cautious. bpf prog could still be shared with non-bpf hook, otherwise we'd have to make conntrack hook registration fail just because a bpf program has same priority. Signed-off-by: Florian Westphal <fw@strlen.de> Link: https://lore.kernel.org/r/20230421170300.24115-5-fw@strlen.de Signed-off-by: Alexei Starovoitov <ast@kernel.org>	2023-04-21 11:34:14 -07:00
Florian Westphal	506a74db7e	netfilter: nfnetlink hook: dump bpf prog id This allows userspace ("nft list hooks") to show which bpf program is attached to which hook. Without this, user only knows bpf prog is attached at prio x, y, z at INPUT and FORWARD, but can't tell which program is where. v4: kdoc fixups (Simon Horman) Link: https://lore.kernel.org/bpf/ZEELzpNCnYJuZyod@corigine.com/ Signed-off-by: Florian Westphal <fw@strlen.de> Link: https://lore.kernel.org/r/20230421170300.24115-4-fw@strlen.de Signed-off-by: Alexei Starovoitov <ast@kernel.org>	2023-04-21 11:34:14 -07:00
Florian Westphal	fd9c663b9a	bpf: minimal support for programs hooked into netfilter framework This adds minimal support for BPF_PROG_TYPE_NETFILTER bpf programs that will be invoked via the NF_HOOK() points in the ip stack. Invocation incurs an indirect call. This is not a necessity: Its possible to add 'DEFINE_BPF_DISPATCHER(nf_progs)' and handle the program invocation with the same method already done for xdp progs. This isn't done here to keep the size of this chunk down. Verifier restricts verdicts to either DROP or ACCEPT. Signed-off-by: Florian Westphal <fw@strlen.de> Link: https://lore.kernel.org/r/20230421170300.24115-3-fw@strlen.de Signed-off-by: Alexei Starovoitov <ast@kernel.org>	2023-04-21 11:34:14 -07:00
Florian Westphal	84601d6ee6	bpf: add bpf_link support for BPF_NETFILTER programs Add bpf_link support skeleton. To keep this reviewable, no bpf program can be invoked yet, if a program is attached only a c-stub is called and not the actual bpf program. Defaults to 'y' if both netfilter and bpf syscall are enabled in kconfig. Uapi example usage: union bpf_attr attr = { }; attr.link_create.prog_fd = progfd; attr.link_create.attach_type = 0; /* unused */ attr.link_create.netfilter.pf = PF_INET; attr.link_create.netfilter.hooknum = NF_INET_LOCAL_IN; attr.link_create.netfilter.priority = -128; err = bpf(BPF_LINK_CREATE, &attr, sizeof(attr)); ... this would attach progfd to ipv4:input hook. Such hook gets removed automatically if the calling program exits. BPF_NETFILTER program invocation is added in followup change. NF_HOOK_OP_BPF enum will eventually be read from nfnetlink_hook, it allows to tell userspace which program is attached at the given hook when user runs 'nft hook list' command rather than just the priority and not-very-helpful 'this hook runs a bpf prog but I can't tell which one'. Will also be used to disallow registration of two bpf programs with same priority in a followup patch. v4: arm32 cmpxchg only supports 32bit operand s/prio/priority/ v3: restrict prog attachment to ip/ip6 for now, lets lift restrictions if more use cases pop up (arptables, ebtables, netdev ingress/egress etc). Signed-off-by: Florian Westphal <fw@strlen.de> Link: https://lore.kernel.org/r/20230421170300.24115-2-fw@strlen.de Signed-off-by: Alexei Starovoitov <ast@kernel.org>	2023-04-21 11:34:14 -07:00
Kui-Feng Lee	45cea721ea	bpftool: Update doc to explain struct_ops register subcommand. The "struct_ops register" subcommand now allows for an optional LINK_DIR to be included. This specifies the directory path where bpftool will pin struct_ops links with the same name as their corresponding map names. Signed-off-by: Kui-Feng Lee <kuifeng@meta.com> Reviewed-by: Quentin Monnet <quentin@isovalent.com> Link: https://lore.kernel.org/r/20230420002822.345222-2-kuifeng@meta.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>	2023-04-21 11:10:10 -07:00
Kui-Feng Lee	0232b78897	bpftool: Register struct_ops with a link. You can include an optional path after specifying the object name for the 'struct_ops register' subcommand. Since the commit `226bc6ae64` ("Merge branch 'Transit between BPF TCP congestion controls.'") has been accepted, it is now possible to create a link for a struct_ops. This can be done by defining a struct_ops in SEC(".struct_ops.link") to make libbpf returns a real link. If we don't pin the links before leaving bpftool, they will disappear. To instruct bpftool to pin the links in a directory with the names of the maps, we need to provide the path of that directory. Signed-off-by: Kui-Feng Lee <kuifeng@meta.com> Reviewed-by: Quentin Monnet <quentin@isovalent.com> Link: https://lore.kernel.org/r/20230420002822.345222-1-kuifeng@meta.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>	2023-04-21 11:10:10 -07:00
Stanislav Fomichev	833d67ecdc	selftests/bpf: Verify optval=NULL case Make sure we get optlen exported instead of getting EFAULT. Signed-off-by: Stanislav Fomichev <sdf@google.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Link: https://lore.kernel.org/bpf/20230418225343.553806-3-sdf@google.com	2023-04-21 17:10:34 +02:00
Stanislav Fomichev	00e74ae086	bpf: Don't EFAULT for getsockopt with optval=NULL Some socket options do getsockopt with optval=NULL to estimate the size of the final buffer (which is returned via optlen). This breaks BPF getsockopt assumptions about permitted optval buffer size. Let's enforce these assumptions only when non-NULL optval is provided. Fixes: `0d01da6afc` ("bpf: implement getsockopt and setsockopt hooks") Reported-by: Martin KaFai Lau <martin.lau@kernel.org> Signed-off-by: Stanislav Fomichev <sdf@google.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Link: https://lore.kernel.org/bpf/ZD7Js4fj5YyI2oLd@google.com/T/#mb68daf700f87a9244a15d01d00c3f0e5b08f49f7 Link: https://lore.kernel.org/bpf/20230418225343.553806-2-sdf@google.com	2023-04-21 17:09:53 +02:00
Magnus Karlsson	02e93e0475	selftests/xsk: Put MAP_HUGE_2MB in correct argument Put the flag MAP_HUGE_2MB in the correct flags argument instead of the wrong offset argument. Fixes: `2ddade3229` ("selftests/xsk: Fix munmap for hugepage allocated umem") Reported-by: Kal Cutter Conley <kal.conley@dectris.com> Signed-off-by: Magnus Karlsson <magnus.karlsson@intel.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Link: https://lore.kernel.org/bpf/20230421062208.3772-1-magnus.karlsson@gmail.com	2023-04-21 16:35:10 +02:00
Dave Marchevsky	4ab07209d5	bpf: Fix bpf_refcount_acquire's refcount_t address calculation When calculating the address of the refcount_t struct within a local kptr, bpf_refcount_acquire_impl should add refcount_off bytes to the address of the local kptr. Due to some missing parens, the function is incorrectly adding sizeof(refcount_t) * refcount_off bytes. This patch fixes the calculation. Due to the incorrect calculation, bpf_refcount_acquire_impl was trying to refcount_inc some memory well past the end of local kptrs, resulting in kasan and refcount complaints, as reported in [0]. In that thread, Florian and Eduard discovered that bpf selftests written in the new style - with __success and an expected __retval, specifically - were not actually being run. As a result, selftests added in bpf_refcount series weren't really exercising this behavior, and thus didn't unearth the bug. With this fixed behavior it's safe to revert commit `7c4b96c000` ("selftests/bpf: disable program test run for progs/refcounted_kptr.c"), this patch does so. [0] https://lore.kernel.org/bpf/ZEEp+j22imoN6rn9@strlen.de/ Fixes: `7c50b1cb76` ("bpf: Add bpf_refcount_acquire kfunc") Reported-by: Florian Westphal <fw@strlen.de> Reported-by: Eduard Zingerman <eddyz87@gmail.com> Signed-off-by: Dave Marchevsky <davemarchevsky@fb.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Tested-by: Eduard Zingerman <eddyz87@gmail.com> Link: https://lore.kernel.org/bpf/20230421074431.3548349-1-davemarchevsky@fb.com	2023-04-21 16:31:37 +02:00
Alexei Starovoitov	acf1c3d68e	bpf: Fix race between btf_put and btf_idr walk. Florian and Eduard reported hard dead lock: [ 58.433327] _raw_spin_lock_irqsave+0x40/0x50 [ 58.433334] btf_put+0x43/0x90 [ 58.433338] bpf_find_btf_id+0x157/0x240 [ 58.433353] btf_parse_fields+0x921/0x11c0 This happens since btf->refcount can be 1 at the time of btf_put() and btf_put() will call btf_free_id() which will try to grab btf_idr_lock and will dead lock. Avoid the issue by doing btf_put() without locking. Fixes: `3d78417b60` ("bpf: Add bpf_btf_find_by_name_kind() helper.") Fixes: `1e89106da2` ("bpf: Add bpf_core_add_cands() and wire it into bpf_core_apply_relo_insn().") Reported-by: Florian Westphal <fw@strlen.de> Reported-by: Eduard Zingerman <eddyz87@gmail.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Tested-by: Eduard Zingerman <eddyz87@gmail.com> Link: https://lore.kernel.org/bpf/20230421014901.70908-1-alexei.starovoitov@gmail.com	2023-04-21 16:27:56 +02:00
Alexei Starovoitov	267a6e4e78	Merge branch 'fix __retval() being always ignored' Eduard Zingerman says: ==================== Florian Westphal found a bug in test_loader.c processing of __retval tag. Because of this bug the function test_loader.c:do_prog_test_run() never executed and all __retval test tags were ignored. See [1]. Fix for this bug uncovers two additional bugs: - During test_verifier tests migration to inline assembly (see [2]) I missed the fact that some tests require maps to contain mock values; - Some issue with a new refcounted_kptr test, which causes kernel to produce dead lock and refcount saturation warnings when subject to libbpf's bpf_test_run_opts(). This series fixes the bug in __retval() processing, and address the issue with test maps not being populated. The issue in refcounted_kptr is not addressed, __retval tags in those tests are commented out. I found that the following tests depend on test maps being populated: - progs/verifier_array_access.c - verifier/value_ptr_arith.c (planned for migration to inline assembly) Given the small amount of these tests I decided to opt for simple non-generic solution (see patch #4). [1] https://lore.kernel.org/bpf/f4c4aee644425842ee6aa8edf1da68f0a8260e7c.camel@gmail.com/T/ [2] https://lore.kernel.org/bpf/20230325025524.144043-1-eddyz87@gmail.com/ ==================== Signed-off-by: Alexei Starovoitov <ast@kernel.org>	2023-04-20 16:49:17 -07:00
Eduard Zingerman	cbb110bc66	selftests/bpf: populate map_array_ro map for verifier_array_access test Two test cases: - "valid read map access into a read-only array 1" and - "valid read map access into a read-only array 2" Expect that map_array_ro map is filled with mock data. This logic was not taken into acount during initial test conversion. This commit modifies prog_tests/verifier.c entry point for this test to fill the map. Fixes: `a3c830ae02` ("selftests/bpf: verifier/array_access.c converted to inline assembly") Signed-off-by: Eduard Zingerman <eddyz87@gmail.com> Link: https://lore.kernel.org/r/20230420232317.2181776-5-eddyz87@gmail.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>	2023-04-20 16:49:16 -07:00
Eduard Zingerman	5b22f4d143	selftests/bpf: add pre bpf_prog_test_run_opts() callback for test_loader When a test case is annotated with __retval tag the test_loader engine would use libbpf's bpf_prog_test_run_opts() to do a test run of the program and compare retvals. This commit allows to perform arbitrary actions on bpf object right before test loader invokes bpf_prog_test_run_opts(). This could be used to setup some state for program execution, e.g. fill some maps. Signed-off-by: Eduard Zingerman <eddyz87@gmail.com> Link: https://lore.kernel.org/r/20230420232317.2181776-4-eddyz87@gmail.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>	2023-04-20 16:49:16 -07:00
Eduard Zingerman	7cdddb99e4	selftests/bpf: fix __retval() being always ignored Florian Westphal found a bug in and suggested a fix for test_loader.c processing of __retval tag. Because of this bug the function test_loader.c:do_prog_test_run() never executed and all __retval test tags were ignored. If this bug is fixed a number of test cases from progs/verifier_array_access.c fail with retval not matching the expected value. This test was recently converted to use test_loader.c and inline assembly in [1]. When doing the conversion I missed the important detail of test_verifier.c operation: when it creates fixup_map_array_ro, fixup_map_array_wo and fixup_map_array_small it populates these maps with a dummy record. Disabling the __retval checks for the affected verifier_array_access in this commit to avoid false-postivies in any potential bisects. The issue is addressed in the next patch. I verified that the __retval tags are now respected by changing expected return values for all tests annotated with __retval, and checking that these tests started to fail. [1] https://lore.kernel.org/bpf/20230325025524.144043-1-eddyz87@gmail.com/ Fixes: `19a8e06f5f` ("selftests/bpf: Tests execution support for test_loader.c") Reported-by: Florian Westphal <fw@strlen.de> Link: https://lore.kernel.org/bpf/f4c4aee644425842ee6aa8edf1da68f0a8260e7c.camel@gmail.com/T/ Signed-off-by: Eduard Zingerman <eddyz87@gmail.com> Link: https://lore.kernel.org/r/20230420232317.2181776-3-eddyz87@gmail.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>	2023-04-20 16:49:16 -07:00
Eduard Zingerman	7c4b96c000	selftests/bpf: disable program test run for progs/refcounted_kptr.c Florian Westphal found a bug in test_loader.c processing of __retval tag. Because of this bug the function test_loader.c:do_prog_test_run() never executed and all __retval test tags were ignored. This hid an issue with progs/refcounted_kptr.c tests. When __retval tag bug is fixed and refcounted_kptr.c tests are run kernel reports various issues and eventually hangs. Shortest reproducer is the following command run a few times: $ for i in $(seq 1 4); do (./test_progs --allow=refcounted_kptr &); done Commenting out __retval tags for these tests until this issue is resolved. Reported-by: Florian Westphal <fw@strlen.de> Link: https://lore.kernel.org/bpf/f4c4aee644425842ee6aa8edf1da68f0a8260e7c.camel@gmail.com/T/ Signed-off-by: Eduard Zingerman <eddyz87@gmail.com> Link: https://lore.kernel.org/r/20230420232317.2181776-2-eddyz87@gmail.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>	2023-04-20 16:49:16 -07:00
Quentin Monnet	4b7ef71ac9	bpftool: Replace "__fallthrough" by a comment to address merge conflict The recent support for inline annotations in control flow graphs generated by bpftool introduced the usage of the "__fallthrough" macro in a switch/case block in btf_dumper.c. This change went through the bpf-next tree, but resulted in a merge conflict in linux-next, because this macro has been renamed "fallthrough" (no underscores) in the meantime. To address the conflict, we temporarily switch to a simple comment instead of a macro. Related: commit `f7a858bffc` ("tools: Rename __fallthrough to fallthrough") Fixes: `9fd496848b` ("bpftool: Support inline annotations when dumping the CFG of a program") Reported-by: Sven Schnelle <svens@linux.ibm.com> Reported-by: Thomas Richter <tmricht@linux.ibm.com> Suggested-by: Daniel Borkmann <daniel@iogearbox.net> Signed-off-by: Quentin Monnet <quentin@isovalent.com> Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Link: https://lore.kernel.org/all/yt9dttxlwal7.fsf@linux.ibm.com/ Link: https://lore.kernel.org/bpf/20230412123636.2358949-1-tmricht@linux.ibm.com/ Link: https://lore.kernel.org/bpf/20230420003333.90901-1-quentin@isovalent.com	2023-04-20 16:38:10 -07:00
Alexei Starovoitov	780c69830e	Merge branch 'Access variable length array relaxed for integer type' Feng zhou says: ==================== From: Feng Zhou <zhoufeng.zf@bytedance.com> Add support for integer type of accessing variable length array. Add a selftest to check it. ==================== Signed-off-by: Alexei Starovoitov <ast@kernel.org>	2023-04-19 21:29:39 -07:00
Feng Zhou	5ff54dedf3	selftests/bpf: Add test to access integer type of variable array Add prog test for accessing integer type of variable array in tracing program. In addition, hook load_balance function to access sd->span[0], only to confirm whether the load is successful. Because there is no direct way to trigger load_balance call. Co-developed-by: Chengming Zhou <zhouchengming@bytedance.com> Signed-off-by: Chengming Zhou <zhouchengming@bytedance.com> Signed-off-by: Feng Zhou <zhoufeng.zf@bytedance.com> Link: https://lore.kernel.org/r/20230420032735.27760-3-zhoufeng.zf@bytedance.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>	2023-04-19 21:29:39 -07:00
Feng Zhou	2569c7b872	bpf: support access variable length array of integer type After this commit: bpf: Support variable length array in tracing programs (`9c5f8a1008`) Trace programs can access variable length array, but for structure type. This patch adds support for integer type. Example: Hook load_balance struct sched_domain { ... unsigned long span[]; } The access: sd->span[0]. Co-developed-by: Chengming Zhou <zhouchengming@bytedance.com> Signed-off-by: Chengming Zhou <zhouchengming@bytedance.com> Signed-off-by: Feng Zhou <zhoufeng.zf@bytedance.com> Link: https://lore.kernel.org/r/20230420032735.27760-2-zhoufeng.zf@bytedance.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>	2023-04-19 21:29:39 -07:00
Magnus Karlsson	2ddade3229	selftests/xsk: Fix munmap for hugepage allocated umem Fix the unmapping of hugepage allocated umems so that they are properly unmapped. The new test referred to in the fixes label, introduced a test that allocated a umem that is not a multiple of a 2M hugepage size. This is fine for mmap() that rounds the size up the nearest multiple of 2M. But munmap() requires the size to be a multiple of the hugepage size in order for it to unmap the region. The current behaviour of not properly unmapping the umem, was discovered when further additions of tests that require hugepages (unaligned mode tests only) started failing as the system was running out of hugepages. Fixes: `c0801598e5` ("selftests: xsk: Add test UNALIGNED_INV_DESC_4K1_FRAME_SIZE") Signed-off-by: Magnus Karlsson <magnus.karlsson@intel.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Link: https://lore.kernel.org/bpf/20230418143617.27762-1-magnus.karlsson@gmail.com	2023-04-19 16:33:53 +02:00
Alexei Starovoitov	276dcdd1a8	Merge branch 'Provide bpf_for() and bpf_for_each() by libbpf' Andrii Nakryiko says: ==================== This patch set moves bpf_for(), bpf_for_each(), and bpf_repeat() macros from selftests-internal bpf_misc.h header to libbpf-provided bpf_helpers.h header. To do this in a way to allow users to feature-detect and guard such bpf_for()/bpf_for_each() uses on old kernels we also extend libbpf to improve unresolved kfunc calls handling and reporting. This lets us mark bpf_iter_num_{new,next,destroy}() declarations as __weak, and thus not fail program loading outright if such kfuncs are missing on the host kernel. Patches #1 and #2 do some simple clean ups and logging improvements. Patch #3 adds kfunc call poisoning and log fixup logic and is the hear of this patch set, effectively. Patch #4 adds selftest for this logic. Patches #4 and #5 move bpf_for()/bpf_for_each()/bpf_repeat() into bpf_helpers.h header and mark kfuncs as __weak to allow users to feature-detect and guard their uses. ==================== Signed-off-by: Alexei Starovoitov <ast@kernel.org>	2023-04-18 12:45:11 -07:00
Andrii Nakryiko	94dccba795	libbpf: mark bpf_iter_num_{new,next,destroy} as __weak Mark bpf_iter_num_{new,next,destroy}() kfuncs declared for bpf_for()/bpf_repeat() macros as __weak to allow users to feature-detect their presence and guard bpf_for()/bpf_repeat() loops accordingly for backwards compatibility with old kernels. Now that libbpf supports kfunc calls poisoning and better reporting of unresolved (but called) kfuncs, declaring number iterator kfuncs in bpf_helpers.h won't degrade user experience and won't cause unnecessary kernel feature dependencies. Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Link: https://lore.kernel.org/r/20230418002148.3255690-7-andrii@kernel.org Signed-off-by: Alexei Starovoitov <ast@kernel.org>	2023-04-18 12:45:11 -07:00
Andrii Nakryiko	c5e6474167	libbpf: move bpf_for(), bpf_for_each(), and bpf_repeat() into bpf_helpers.h To make it easier for bleeding-edge BPF applications, such as sched_ext, to utilize open-coded iterators, move bpf_for(), bpf_for_each(), and bpf_repeat() macros from selftests/bpf-internal bpf_misc.h helper, to libbpf-provided bpf_helpers.h header. Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Link: https://lore.kernel.org/r/20230418002148.3255690-6-andrii@kernel.org Signed-off-by: Alexei Starovoitov <ast@kernel.org>	2023-04-18 12:45:11 -07:00
Andrii Nakryiko	30bbfe3236	selftests/bpf: add missing __weak kfunc log fixup test Add test validating that libbpf correctly poisons and reports __weak unresolved kfuncs in post-processed verifier log. Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Link: https://lore.kernel.org/r/20230418002148.3255690-5-andrii@kernel.org Signed-off-by: Alexei Starovoitov <ast@kernel.org>	2023-04-18 12:45:10 -07:00
Andrii Nakryiko	05b6f766b2	libbpf: improve handling of unresolved kfuncs Currently, libbpf leaves `call #0` instruction for __weak unresolved kfuncs, which might lead to a confusing verifier log situations, where invalid `call #0` will be treated as successfully validated. We can do better. Libbpf already has an established mechanism of poisoning instructions that failed some form of resolution (e.g., CO-RE relocation and BPF map set to not be auto-created). Libbpf doesn't fail them outright to allow users to guard them through other means, and as long as BPF verifier can prove that such poisoned instructions cannot be ever reached, this doesn't consistute an invalid BPF program. If user didn't guard such code, libbpf will extract few pieces of information to tie such poisoned instructions back to additional information about what entitity wasn't resolved (e.g., BPF map name, or CO-RE relocation information). __weak unresolved kfuncs fit this model well, so this patch extends libbpf with poisioning and log fixup logic for kfunc calls. Note, this poisoning is done only for kfunc calls, not kfunc address resolution (ldimm64 instructions). The former cannot be ever valid, if reached, so it's safe to poison them. The latter is a valid mechanism to check if __weak kfunc ksym was resolved, and do necessary guarding and work arounds based on this result, supported in most recent kernels. As such, libbpf keeps such ldimm64 instructions as loading zero, never poisoning them. Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Link: https://lore.kernel.org/r/20230418002148.3255690-4-andrii@kernel.org Signed-off-by: Alexei Starovoitov <ast@kernel.org>	2023-04-18 12:45:10 -07:00
Andrii Nakryiko	f709160d17	libbpf: report vmlinux vs module name when dealing with ksyms Currently libbpf always reports "kernel" as a source of ksym BTF type, which is ambiguous given ksym's BTF can come from either vmlinux or kernel module BTFs. Make this explicit and log module name, if used BTF is from kernel module. Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Link: https://lore.kernel.org/r/20230418002148.3255690-3-andrii@kernel.org Signed-off-by: Alexei Starovoitov <ast@kernel.org>	2023-04-18 12:45:10 -07:00
Andrii Nakryiko	3055ddd654	libbpf: misc internal libbpf clean ups around log fixup Normalize internal constants, field names, and comments related to log fixup. Also add explicit `ext_idx` alias for relocation where relocation is pointing to extern description for additional information. No functional changes, just a clean up before subsequent additions. Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Link: https://lore.kernel.org/r/20230418002148.3255690-2-andrii@kernel.org Signed-off-by: Alexei Starovoitov <ast@kernel.org>	2023-04-18 12:45:10 -07:00
Yonghong Song	49859de997	selftests/bpf: Add a selftest for checking subreg equality Add a selftest to ensure subreg equality if source register upper 32bit is 0. Without previous patch, the test will fail verification. Acked-by: Eduard Zingerman <eddyz87@gmail.com> Signed-off-by: Yonghong Song <yhs@fb.com> Link: https://lore.kernel.org/r/20230417222139.360607-1-yhs@fb.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>	2023-04-17 15:50:02 -07:00
Yonghong Song	3be49f7955	bpf: Improve verifier u32 scalar equality checking In [1], I tried to remove bpf-specific codes to prevent certain llvm optimizations, and add llvm TTI (target transform info) hooks to prevent those optimizations. During this process, I found if I enable llvm SimplifyCFG:shouldFoldTwoEntryPHINode transformation, I will hit the following verification failure with selftests: ... 8: (18) r1 = 0xffffc900001b2230 ; R1_w=map_value(off=560,ks=4,vs=564,imm=0) 10: (61) r1 = (u32 )(r1 +0) ; R1_w=scalar(umax=4294967295,var_off=(0x0; 0xffffffff)) ; if (skb->tstamp == EGRESS_ENDHOST_MAGIC) 11: (79) r2 = (u64 )(r6 +152) ; R2_w=scalar() R6=ctx(off=0,imm=0) ; if (skb->tstamp == EGRESS_ENDHOST_MAGIC) 12: (55) if r2 != 0xb9fbeef goto pc+10 ; R2_w=195018479 13: (bc) w2 = w1 ; R1_w=scalar(umax=4294967295,var_off=(0x0; 0xffffffff)) R2_w=scalar(umax=4294967295,var_off=(0x0; 0xffffffff)) ; if (test < __NR_TESTS) 14: (a6) if w1 < 0x9 goto pc+1 16: R0=2 R1_w=scalar(umax=8,var_off=(0x0; 0xf)) R2_w=scalar(umax=4294967295,var_off=(0x0; 0xffffffff)) R6=ctx(off=0,imm=0) R10=fp0 ; 16: (27) r2 = 28 ; R2_w=scalar(umax=120259084260,var_off=(0x0; 0x1ffffffffc),s32_max=2147483644,u32_max=-4) 17: (18) r3 = 0xffffc900001b2118 ; R3_w=map_value(off=280,ks=4,vs=564,imm=0) 19: (0f) r3 += r2 ; R2_w=scalar(umax=120259084260,var_off=(0x0; 0x1ffffffffc),s32_max=2147483644,u32_max=-4) R3_w=map_value(off=280,ks=4,vs=564,umax=120259084260,var_off=(0x0; 0x1ffffffffc),s32_max=2147483644,u32_max=-4) 20: (61) r2 = (u32 )(r3 +0) R3 unbounded memory access, make sure to bounds check any such access processed 97 insns (limit 1000000) max_states_per_insn 1 total_states 10 peak_states 10 mark_read 6 -- END PROG LOAD LOG -- libbpf: prog 'ingress_fwdns_prio100': failed to load: -13 libbpf: failed to load object 'test_tc_dtime' libbpf: failed to load BPF skeleton 'test_tc_dtime': -13 ... At insn 14, with condition 'w1 < 9', register r1 is changed from an arbitrary u32 value to `scalar(umax=8,var_off=(0x0; 0xf))`. Register r2, however, remains as an arbitrary u32 value. Current verifier won't claim r1/r2 equality if the previous mov is alu32 ('w2 = w1'). If r1 upper 32bit value is not 0, we indeed cannot clamin r1/r2 equality after 'w2 = w1'. But in this particular case, we know r1 upper 32bit value is 0, so it is safe to claim r1/r2 equality. This patch exactly did this. For a 32bit subreg mov, if the src register upper 32bit is 0, it is okay to claim equality between src and dst registers. With this patch, the above verification sequence becomes ... 8: (18) r1 = 0xffffc9000048e230 ; R1_w=map_value(off=560,ks=4,vs=564,imm=0) 10: (61) r1 = (u32 )(r1 +0) ; R1_w=scalar(umax=4294967295,var_off=(0x0; 0xffffffff)) ; if (skb->tstamp == EGRESS_ENDHOST_MAGIC) 11: (79) r2 = (u64 )(r6 +152) ; R2_w=scalar() R6=ctx(off=0,imm=0) ; if (skb->tstamp == EGRESS_ENDHOST_MAGIC) 12: (55) if r2 != 0xb9fbeef goto pc+10 ; R2_w=195018479 13: (bc) w2 = w1 ; R1_w=scalar(id=6,umax=4294967295,var_off=(0x0; 0xffffffff)) R2_w=scalar(id=6,umax=4294967295,var_off=(0x0; 0xffffffff)) ; if (test < __NR_TESTS) 14: (a6) if w1 < 0x9 goto pc+1 ; R1_w=scalar(id=6,umin=9,umax=4294967295,var_off=(0x0; 0xffffffff)) ... from 14 to 16: R0=2 R1_w=scalar(id=6,umax=8,var_off=(0x0; 0xf)) R2_w=scalar(id=6,umax=8,var_off=(0x0; 0xf)) R6=ctx(off=0,imm=0) R10=fp0 16: (27) r2 = 28 ; R2_w=scalar(umax=224,var_off=(0x0; 0xfc)) 17: (18) r3 = 0xffffc9000048e118 ; R3_w=map_value(off=280,ks=4,vs=564,imm=0) 19: (0f) r3 += r2 20: (61) r2 = (u32 )(r3 +0) ; R2_w=scalar(umax=4294967295,var_off=(0x0; 0xffffffff)) R3_w=map_value(off=280,ks=4,vs=564,umax=224,var_off=(0x0; 0xfc),s32_max=252,u32_max=252) ... and eventually the bpf program can be verified successfully. [1] https://reviews.llvm.org/D147968 Signed-off-by: Yonghong Song <yhs@fb.com> Link: https://lore.kernel.org/r/20230417222134.359714-1-yhs@fb.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>	2023-04-17 15:50:02 -07:00
Sean Young	69a8c792cd	bpf: lirc program type should not require SYS_CAP_ADMIN Make it possible to load lirc program type with just CAP_BPF. There is nothing exceptional about lirc programs that means they require SYS_CAP_ADMIN. In order to attach or detach a lirc program type you need permission to open /dev/lirc0; if you have permission to do that, you can alter all sorts of lirc receiving options. Changing the IR protocol decoder is no different. Right now on a typical distribution /dev/lirc devices are only read/write by root. Ideally we would make them group read/write like other devices so that local users can use them without becoming root. Signed-off-by: Sean Young <sean@mess.org> Link: https://lore.kernel.org/r/ZD0ArKpwnDBJZsrE@gofer.mess.org Signed-off-by: Alexei Starovoitov <ast@kernel.org>	2023-04-17 13:21:14 -07:00
Daniel Borkmann	59e498a328	bpf: Set skb redirect and from_ingress info in __bpf_tx_skb There are some use-cases where it is desirable to use bpf_redirect() in combination with ifb device, which currently is not supported, for example, around filtering inbound traffic with BPF to then push it to ifb which holds the qdisc for shaping in contrast to doing that on the egress device. Toke mentions the following case related to OpenWrt: Because there's not always a single egress on the other side. These are mainly home routers, which tend to have one or more WiFi devices bridged to one or more ethernet ports on the LAN side, and a single upstream WAN port. And the objective is to control the total amount of traffic going over the WAN link (in both directions), to deal with bufferbloat in the ISP network (which is sadly still all too prevalent). In this setup, the traffic can be split arbitrarily between the links on the LAN side, and the only "single bottleneck" is the WAN link. So we install both egress and ingress shapers on this, configured to something like 95-98% of the true link bandwidth, thus moving the queues into the qdisc layer in the router. It's usually necessary to set the ingress bandwidth shaper a bit lower than the egress due to being "downstream" of the bottleneck link, but it does work surprisingly well. We usually use something like a matchall filter to put all ingress traffic on the ifb, so doing the redirect from BPF has not been an immediate requirement thus far. However, it does seem a bit odd that this is not possible, and we do have a BPF-based filter that layers on top of this kind of setup, which currently uses u32 as the ingress filter and so it could presumably be improved to use BPF instead if that was available. Reported-by: Toke Høiland-Jørgensen <toke@redhat.com> Reported-by: Yafang Shao <laoar.shao@gmail.com> Reported-by: Tonghao Zhang <xiangxia.m.yue@gmail.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Yafang Shao <laoar.shao@gmail.com> Acked-by: Toke Høiland-Jørgensen <toke@redhat.com> Link: https://git.openwrt.org/?p=project/qosify.git;a=blob;f=README Link: https://lore.kernel.org/bpf/875y9yzbuy.fsf@toke.dk Link: https://lore.kernel.org/r/8cebc8b2b6e967e10cbafe2ffd6795050e74accd.1681739137.git.daniel@iogearbox.net Signed-off-by: Alexei Starovoitov <ast@kernel.org>	2023-04-17 13:17:41 -07:00
Alexei Starovoitov	d40f4f6813	Merge branch 'Remove KF_KPTR_GET kfunc flag' David Vernet says: ==================== We've managed to improve the UX for kptrs significantly over the last 9 months. All of the existing use cases which previously had KF_KPTR_GET kfuncs (struct bpf_cpumask , struct task_struct , and struct cgroup ) have all been updated to be synchronized using RCU. In other words, their KF_KPTR_GET kfuncs have been removed in favor of KF_RCU \| KF_ACQUIRE kfuncs, with the pointers themselves also being readable from maps in an RCU read region thanks to the types being RCU safe. While KF_KPTR_GET was a logical starting point for kptrs, it's become clear that they're not the correct abstraction. KF_KPTR_GET is a flag that essentially does nothing other than enforcing that the argument to a function is a pointer to a referenced kptr map value. At first glance, that's a useful thing to guarantee to a kfunc. It gives kfuncs the ability to try and acquire a reference on that kptr without requiring the BPF prog to do something like this: struct kptr_type in_map, new = NULL; in_map = bpf_kptr_xchg(&map->value, NULL); if (in_map) { new = bpf_kptr_type_acquire(in_map); in_map = bpf_kptr_xchg(&map->value, in_map); if (in_map) bpf_kptr_type_release(in_map); } That's clearly a pretty ugly (and racy) UX, and if using KF_KPTR_GET is the only alternative, it's better than nothing. However, the problem with any KF_KPTR_GET kfunc lies in the fact that it always requires some kind of synchronization in order to safely do an opportunistic acquire of the kptr in the map. This is because a BPF program running on another CPU could do a bpf_kptr_xchg() on that map value, and free the kptr after it's been read by the KF_KPTR_GET kfunc. For example, the now-removed bpf_task_kptr_get() kfunc did the following: struct task_struct bpf_task_kptr_get(struct task_struct *pp) { struct task_struct p; rcu_read_lock(); p = READ_ONCE(pp); / If p is non-NULL, it could still be freed by another CPU, * so we have to do an opportunistic refcount_inc_not_zero() * and return NULL if the task will be freed after the * current RCU read region. / \|f (p && !refcount_inc_not_zero(&p->rcu_users)) p = NULL; rcu_read_unlock(); return p; } In other words, the kfunc uses RCU to ensure that the task remains valid after it's been peeked from the map. However, this is completely redundant with just defining a KF_RCU kfunc that itself does a refcount_inc_not_zero(), which is exactly what bpf_task_acquire() now does. So, the question of whether KF_KPTR_GET is useful is actually, "Are there any synchronization mechanisms / safety flags that are required by certain kptrs, but which are not provided by the verifier to kfuncs?" The answer to that question today is "No", because every kptr we currently care about is RCU protected. Even if the answer ever became "yes", the proper way to support that referenced kptr type would be to add support for whatever synchronization mechanism it requires in the verifier, rather than giving kfuncs a flag that says, "Here's a pointer to a referenced kptr in a map, do whatever you need to do." With all that said -- so as to allow us to consolidate the kfunc API, and simplify the verifier, this patchset removes the KF_KPTR_GET kfunc flag. --- This is v2 of this patchset v1: https://lore.kernel.org/all/20230415103231.236063-1-void@manifault.com/ Changelog: ---------- v1 -> v2: - Fix KF_RU -> KF_RCU typo in commit summary for patch 2/3, and in cover letter (Alexei) - In order to reduce churn, don't shift all KF_ flags down by 1. We'll just fill the now-empty slot the next time we add a flag (Alexei) ==================== Signed-off-by: Alexei Starovoitov <ast@kernel.org>	2023-04-16 08:51:24 -07:00
David Vernet	530474e6d0	bpf,docs: Remove KF_KPTR_GET from documentation A prior patch removed KF_KPTR_GET from the kernel. Now that it's no longer accessible to kfunc authors, this patch removes it from the BPF kfunc documentation. Signed-off-by: David Vernet <void@manifault.com> Link: https://lore.kernel.org/r/20230416084928.326135-4-void@manifault.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>	2023-04-16 08:51:24 -07:00
David Vernet	7b4ddf3920	bpf: Remove KF_KPTR_GET kfunc flag We've managed to improve the UX for kptrs significantly over the last 9 months. All of the existing use cases which previously had KF_KPTR_GET kfuncs (struct bpf_cpumask , struct task_struct , and struct cgroup ) have all been updated to be synchronized using RCU. In other words, their KF_KPTR_GET kfuncs have been removed in favor of KF_RCU \| KF_ACQUIRE kfuncs, with the pointers themselves also being readable from maps in an RCU read region thanks to the types being RCU safe. While KF_KPTR_GET was a logical starting point for kptrs, it's become clear that they're not the correct abstraction. KF_KPTR_GET is a flag that essentially does nothing other than enforcing that the argument to a function is a pointer to a referenced kptr map value. At first glance, that's a useful thing to guarantee to a kfunc. It gives kfuncs the ability to try and acquire a reference on that kptr without requiring the BPF prog to do something like this: struct kptr_type in_map, new = NULL; in_map = bpf_kptr_xchg(&map->value, NULL); if (in_map) { new = bpf_kptr_type_acquire(in_map); in_map = bpf_kptr_xchg(&map->value, in_map); if (in_map) bpf_kptr_type_release(in_map); } That's clearly a pretty ugly (and racy) UX, and if using KF_KPTR_GET is the only alternative, it's better than nothing. However, the problem with any KF_KPTR_GET kfunc lies in the fact that it always requires some kind of synchronization in order to safely do an opportunistic acquire of the kptr in the map. This is because a BPF program running on another CPU could do a bpf_kptr_xchg() on that map value, and free the kptr after it's been read by the KF_KPTR_GET kfunc. For example, the now-removed bpf_task_kptr_get() kfunc did the following: struct task_struct bpf_task_kptr_get(struct task_struct *pp) { struct task_struct p; rcu_read_lock(); p = READ_ONCE(pp); / If p is non-NULL, it could still be freed by another CPU, * so we have to do an opportunistic refcount_inc_not_zero() * and return NULL if the task will be freed after the * current RCU read region. */ \|f (p && !refcount_inc_not_zero(&p->rcu_users)) p = NULL; rcu_read_unlock(); return p; } In other words, the kfunc uses RCU to ensure that the task remains valid after it's been peeked from the map. However, this is completely redundant with just defining a KF_RCU kfunc that itself does a refcount_inc_not_zero(), which is exactly what bpf_task_acquire() now does. So, the question of whether KF_KPTR_GET is useful is actually, "Are there any synchronization mechanisms / safety flags that are required by certain kptrs, but which are not provided by the verifier to kfuncs?" The answer to that question today is "No", because every kptr we currently care about is RCU protected. Even if the answer ever became "yes", the proper way to support that referenced kptr type would be to add support for whatever synchronization mechanism it requires in the verifier, rather than giving kfuncs a flag that says, "Here's a pointer to a referenced kptr in a map, do whatever you need to do." With all that said -- so as to allow us to consolidate the kfunc API, and simplify the verifier a bit, this patch removes KF_KPTR_GET, and all relevant logic from the verifier. Signed-off-by: David Vernet <void@manifault.com> Link: https://lore.kernel.org/r/20230416084928.326135-3-void@manifault.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>	2023-04-16 08:51:24 -07:00
David Vernet	09b501d905	bpf: Remove bpf_kfunc_call_test_kptr_get() test kfunc We've managed to improve the UX for kptrs significantly over the last 9 months. All of the prior main use cases, struct bpf_cpumask , struct task_struct , and struct cgroup *, have all been updated to be synchronized mainly using RCU. In other words, their KF_ACQUIRE kfunc calls are all KF_RCU, and the pointers themselves are MEM_RCU and can be accessed in an RCU read region in BPF. In a follow-on change, we'll be removing the KF_KPTR_GET kfunc flag. This patch prepares for that by removing the bpf_kfunc_call_test_kptr_get() kfunc, and all associated selftests. Signed-off-by: David Vernet <void@manifault.com> Link: https://lore.kernel.org/r/20230416084928.326135-2-void@manifault.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>	2023-04-16 08:51:24 -07:00
Alexei Starovoitov	7a0788fe83	Merge branch 'Shared ownership for local kptrs' Dave Marchevsky says: ==================== This series adds support for refcounted local kptrs to the verifier. A local kptr is 'refcounted' if its type contains a struct bpf_refcount field: struct refcounted_node { long data; struct bpf_list_node ll; struct bpf_refcount ref; }; bpf_refcount is used to implement shared ownership for local kptrs. Motivating usecase ================== If a struct has two collection node fields, e.g.: struct node { long key; long val; struct bpf_rb_node rb; struct bpf_list_node ll; }; It's not currently possible to add a node to both the list and rbtree: long bpf_prog(void ctx) { struct node n = bpf_obj_new(typeof(n)); if (!n) { / ... / } bpf_spin_lock(&lock); bpf_list_push_back(&head, &n->ll); bpf_rbtree_add(&root, &n->rb, less); / Assume a resonable less() / bpf_spin_unlock(&lock); } The above program will fail verification due to current owning / non-owning ref logic: after bpf_list_push_back, n is a non-owning reference and thus cannot be passed to bpf_rbtree_add. The only way to get an owning reference for the node that was added is to bpf_list_pop_{front,back} it. More generally, verifier ownership semantics expect that a node has one owner (program, collection, or stashed in map) with exclusive ownership of the node's lifetime. The owner free's the node's underlying memory when it itself goes away. Without a shared ownership concept it's impossible to express many real-world usecases such that they pass verification. Semantic Changes ================ Before this series, the verifier could make this statement: "whoever has the owning reference has exclusive ownership of the referent's lifetime". As demonstrated in the previous section, this implies that a BPF program can't have an owning reference to some node if that node is in a collection. If such a state were possible, the node would have multiple owners, each thinking they have exclusive ownership. In order to support shared ownership it's necessary to modify the exclusive ownership semantic. After this series' changes, an owning reference has ownership of the referent's lifetime, but it's not necessarily exclusive. The referent's underlying memory is guaranteed to be valid (i.e. not free'd) until the reference is dropped or used for collection insert. This change doesn't affect UX of owning or non-owning references much: insert kfuncs (bpf_rbtree_add, bpf_list_push_{front,back}) still require an owning reference arg, as ownership still must be passed to the collection in a shared-ownership world. * non-owning references still refer to valid memory without claiming any ownership. One important conclusion that followed from "exclusive ownership" statement is no longer valid, though. In exclusive-ownership world, if a BPF prog has an owning reference to a node, the verifier can conclude that no collection has ownership of it. This conclusion was used to avoid runtime checking in the implementations of insert and remove operations (""has the node already been {inserted, removed}?"). In a shared-ownership world the aforementioned conclusion is no longer valid, which necessitates doing runtime checking in insert and remove operation kfuncs, and those functions possibly failing to insert or remove anything. Luckily the verifier changes necessary to go from exclusive to shared ownership were fairly minimal. Patches in this series which do change verifier semantics generally have some summary dedicated to explaining why certain usecases Just Work for shared ownership without verifier changes. Implementation ============== The changes in this series can be categorized as follows: * struct bpf_refcount opaque field + plumbing * support for refcounted kptrs in bpf_obj_new and bpf_obj_drop * bpf_refcount_acquire kfunc * enables shared ownershp by bumping refcount + acquiring owning ref * support for possibly-failing collection insertion and removal * insertion changes are more complex If a patch's changes have some nuance to their effect - or lack of effect - on verifier behavior, the patch summary talks about it at length. Patch contents: * Patch 1 removes btf_field_offs struct * Patch 2 adds struct bpf_refcount and associated plumbing * Patch 3 modifies semantics of bpf_obj_drop and bpf_obj_new to handle refcounted kptrs * Patch 4 adds bpf_refcount_acquire * Patches 5-7 add support for possibly-failing collection insert and remove * Patch 8 centralizes constructor-like functionality for local kptr types * Patch 9 adds tests for new functionality base-commit: `4a1e885c6d` Changelog: v1 -> v2: lore.kernel.org/bpf/20230410190753.2012798-1-davemarchevsky@fb.com Patch #s used below refer to the patch's position in v1 unless otherwise specified. * General * Rebase onto latest bpf-next (base-commit updated above) * Patch 4 - "bpf: Add bpf_refcount_acquire kfunc" * Fix typo in summary (Alexei) * Patch 7 - "Migrate bpf_rbtree_remove to possibly fail" * Modify a paragraph in patch summary to more clearly state that only bpf_rbtree_remove's non-owning ref clobbering behavior is changed by the patch (Alexei) * refcount_off == -1 -> refcount_off < 0 in "node type w/ both list and rb_node fields" check, since any negative value means "no bpf_refcount field found", and furthermore refcount_off is never explicitly set to -1, but rather -EINVAL. (Alexei) * Instead of just changing "btf: list_node and rb_node in same struct" test expectation to pass instead of fail, do some refactoring to test both "list_node, rb_node, and bpf_refcount" (success) and "list_node, rb_node, _no_ bpf_refcount" (failure) cases. This ensures that logic change in previous bullet point is correct. * v1's "btf: list_node and rb_node in same struct" test changes didn't add bpf_refcount, so the fact that btf load succeeded w/ list and rb_nodes but no bpf_refcount field is further proof that this logic was incorrect in v1. * Patch 8 - "bpf: Centralize btf_field-specific initialization logic" * Instead of doing __init_field_infer_size in kfuncs when taking bpf_list_head type input which might've been 0-initialized in map, go back to simple oneliner initialization. Add short comment explaining why this is necessary. (Alexei) * Patch 9 - "selftests/bpf: Add refcounted_kptr tests" * Don't __always_inline helper fns in progs/refcounted_kptr.c (Alexei) ==================== Signed-off-by: Alexei Starovoitov <ast@kernel.org>	2023-04-15 17:36:51 -07:00
Dave Marchevsky	6147f15131	selftests/bpf: Add refcounted_kptr tests Test refcounted local kptr functionality added in previous patches in the series. Usecases which pass verification: * Add refcounted local kptr to both tree and list. Then, read and - possibly, depending on test variant - delete from tree, then list. * Also test doing read-and-maybe-delete in opposite order * Stash a refcounted local kptr in a map_value, then add it to a rbtree. Read from both, possibly deleting after tree read. * Add refcounted local kptr to both tree and list. Then, try reading and deleting twice from one of the collections. * bpf_refcount_acquire of just-added non-owning ref should work, as should bpf_refcount_acquire of owning ref just out of bpf_obj_new Usecases which fail verification: * The simple successful bpf_refcount_acquire cases from above should both fail to verify if the newly-acquired owning ref is not dropped Signed-off-by: Dave Marchevsky <davemarchevsky@fb.com> Link: https://lore.kernel.org/r/20230415201811.343116-10-davemarchevsky@fb.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>	2023-04-15 17:36:50 -07:00
Dave Marchevsky	3e81740a90	bpf: Centralize btf_field-specific initialization logic All btf_fields in an object are 0-initialized by memset in bpf_obj_init. This might not be a valid initial state for some field types, in which case kfuncs that use the type will properly initialize their input if it's been 0-initialized. Some BPF graph collection types and kfuncs do this: bpf_list_{head,node} and bpf_rb_node. An earlier patch in this series added the bpf_refcount field, for which the 0 state indicates that the refcounted object should be free'd. bpf_obj_init treats this field specially, setting refcount to 1 instead of relying on scattered "refcount is 0? Must have just been initialized, let's set to 1" logic in kfuncs. This patch extends this treatment to list and rbtree field types, allowing most scattered initialization logic in kfuncs to be removed. Note that bpf_{list_head,rb_root} may be inside a BPF map, in which case they'll be 0-initialized without passing through the newly-added logic, so scattered initialization logic must remain for these collection root types. Signed-off-by: Dave Marchevsky <davemarchevsky@fb.com> Link: https://lore.kernel.org/r/20230415201811.343116-9-davemarchevsky@fb.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>	2023-04-15 17:36:50 -07:00

1 2 3 4 5 ...

1172521 Commits All Branches Search

1172521 Commits

All Branches