This patch allows the bpf-tcp-cc to call bpf_setsockopt. One use
case is to allow a bpf-tcp-cc switching to another cc during init().
For example, when the tcp flow is not ecn ready, the bpf_dctcp
can switch to another cc by calling setsockopt(TCP_CONGESTION).
During setsockopt(TCP_CONGESTION), the new tcp-cc's init() will be
called and this could cause a recursion but it is stopped by the
current trampoline's logic (in the prog->active counter).
While retiring a bpf-tcp-cc (e.g. in tcp_v[46]_destroy_sock()),
the tcp stack calls bpf-tcp-cc's release(). To avoid the retiring
bpf-tcp-cc making further changes to the sk, bpf_setsockopt is not
available to the bpf-tcp-cc's release(). This will avoid release()
making setsockopt() call that will potentially allocate new resources.
Although the bpf-tcp-cc already has a more powerful way to read tcp_sock
from the PTR_TO_BTF_ID, it is usually expected that bpf_getsockopt and
bpf_setsockopt are available together. Thus, bpf_getsockopt() is also
added to all tcp_congestion_ops except release().
When the old bpf-tcp-cc is calling setsockopt(TCP_CONGESTION)
to switch to a new cc, the old bpf-tcp-cc will be released by
bpf_struct_ops_put(). Thus, this patch also puts the bpf_struct_ops_map
after a rcu grace period because the trampoline's image cannot be freed
while the old bpf-tcp-cc is still running.
bpf-tcp-cc can only access icsk_ca_priv as SCALAR. All kernel's
tcp-cc is also accessing the icsk_ca_priv as SCALAR. The size
of icsk_ca_priv has already been raised a few times to avoid
extra kmalloc and memory referencing. The only exception is the
kernel's tcp_cdg.c that stores a kmalloc()-ed pointer in icsk_ca_priv.
To avoid the old bpf-tcp-cc accidentally overriding this tcp_cdg's pointer
value stored in icsk_ca_priv after switching and without over-complicating
the bpf's verifier for this one exception in tcp_cdg, this patch does not
allow switching to tcp_cdg. If there is a need, bpf_tcp_cdg can be
implemented and then use the bpf_sk_storage as the extended storage.
bpf_sk_setsockopt proto has only been recently added and used
in bpf-sockopt and bpf-iter-tcp, so impose the tcp_cdg limitation in the
same proto instead of adding a new proto specifically for bpf-tcp-cc.
Signed-off-by: Martin KaFai Lau <kafai@fb.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Link: https://lore.kernel.org/bpf/20210824173007.3976921-1-kafai@fb.com
The motivation behind this helper is to access userspace pt_regs in a
kprobe handler.
uprobe's ctx is the userspace pt_regs. kprobe's ctx is the kernelspace
pt_regs. bpf_task_pt_regs() allows accessing userspace pt_regs in a
kprobe handler. The final case (kernelspace pt_regs in uprobe) is
pretty rare (usermode helper) so I think that can be solved later if
necessary.
More concretely, this helper is useful in doing BPF-based DWARF stack
unwinding. Currently the kernel can only do framepointer based stack
unwinds for userspace code. This is because the DWARF state machines are
too fragile to be computed in kernelspace [0]. The idea behind
DWARF-based stack unwinds w/ BPF is to copy a chunk of the userspace
stack (while in prog context) and send it up to userspace for unwinding
(probably with libunwind) [1]. This would effectively enable profiling
applications with -fomit-frame-pointer using kprobes and uprobes.
[0]: https://lkml.org/lkml/2012/2/10/356
[1]: https://github.com/danobi/bpf-dwarf-walk
Signed-off-by: Daniel Xu <dxu@dxuuu.xyz>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Link: https://lore.kernel.org/bpf/e2718ced2d51ef4268590ab8562962438ab82815.1629772842.git.dxu@dxuuu.xyz
bpf_get_current_task() is already supported so it's natural to also
include the _btf() variant for btf-powered helpers.
This is required for non-tracing progs to use bpf_task_pt_regs() in the
next commit.
Signed-off-by: Daniel Xu <dxu@dxuuu.xyz>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Link: https://lore.kernel.org/bpf/f99870ed5f834c9803d73b3476f8272b1bb987c0.1629772842.git.dxu@dxuuu.xyz
Fix a verifier bug found by smatch static checker in [0].
This problem has never been seen in prod to my best knowledge. Fixing it
still seems to be a good idea since it's hard to say for sure whether
it's possible or not to have a scenario where a combination of
convert_ctx_access() and a narrow load would lead to an out of bound
write.
When narrow load is handled, one or two new instructions are added to
insn_buf array, but before it was only checked that
cnt >= ARRAY_SIZE(insn_buf)
And it's safe to add a new instruction to insn_buf[cnt++] only once. The
second try will lead to out of bound write. And this is what can happen
if `shift` is set.
Fix it by making sure that if the BPF_RSH instruction has to be added in
addition to BPF_AND then there is enough space for two more instructions
in insn_buf.
The full report [0] is below:
kernel/bpf/verifier.c:12304 convert_ctx_accesses() warn: offset 'cnt' incremented past end of array
kernel/bpf/verifier.c:12311 convert_ctx_accesses() warn: offset 'cnt' incremented past end of array
kernel/bpf/verifier.c
12282
12283 insn->off = off & ~(size_default - 1);
12284 insn->code = BPF_LDX | BPF_MEM | size_code;
12285 }
12286
12287 target_size = 0;
12288 cnt = convert_ctx_access(type, insn, insn_buf, env->prog,
12289 &target_size);
12290 if (cnt == 0 || cnt >= ARRAY_SIZE(insn_buf) ||
^^^^^^^^^^^^^^^^^^^^^^^^^^^
Bounds check.
12291 (ctx_field_size && !target_size)) {
12292 verbose(env, "bpf verifier is misconfigured\n");
12293 return -EINVAL;
12294 }
12295
12296 if (is_narrower_load && size < target_size) {
12297 u8 shift = bpf_ctx_narrow_access_offset(
12298 off, size, size_default) * 8;
12299 if (ctx_field_size <= 4) {
12300 if (shift)
12301 insn_buf[cnt++] = BPF_ALU32_IMM(BPF_RSH,
^^^^^
increment beyond end of array
12302 insn->dst_reg,
12303 shift);
--> 12304 insn_buf[cnt++] = BPF_ALU32_IMM(BPF_AND, insn->dst_reg,
^^^^^
out of bounds write
12305 (1 << size * 8) - 1);
12306 } else {
12307 if (shift)
12308 insn_buf[cnt++] = BPF_ALU64_IMM(BPF_RSH,
12309 insn->dst_reg,
12310 shift);
12311 insn_buf[cnt++] = BPF_ALU64_IMM(BPF_AND, insn->dst_reg,
^^^^^^^^^^^^^^^
Same.
12312 (1ULL << size * 8) - 1);
12313 }
12314 }
12315
12316 new_prog = bpf_patch_insn_data(env, i + delta, insn_buf, cnt);
12317 if (!new_prog)
12318 return -ENOMEM;
12319
12320 delta += cnt - 1;
12321
12322 /* keep walking new program and skip insns we just inserted */
12323 env->prog = new_prog;
12324 insn = new_prog->insnsi + i + delta;
12325 }
12326
12327 return 0;
12328 }
[0] https://lore.kernel.org/bpf/20210817050843.GA21456@kili/
v1->v2:
- clarify that problem was only seen by static checker but not in prod;
Fixes: 46f53a65d2 ("bpf: Allow narrow loads with offset > 0")
Reported-by: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: Andrey Ignatov <rdna@fb.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Link: https://lore.kernel.org/bpf/20210820163935.1902398-1-rdna@fb.com
Add an enum (cgroup_bpf_attach_type) containing only valid cgroup_bpf
attach types and a function to map bpf_attach_type values to the new
enum. Inspired by netns_bpf_attach_type.
Then, migrate cgroup_bpf to use cgroup_bpf_attach_type wherever
possible. Functionality is unchanged as attach_type_to_prog_type
switches in bpf/syscall.c were preventing non-cgroup programs from
making use of the invalid cgroup_bpf array slots.
As a result struct cgroup_bpf uses 504 fewer bytes relative to when its
arrays were sized using MAX_BPF_ATTACH_TYPE.
bpf_cgroup_storage is notably not migrated as struct
bpf_cgroup_storage_key is part of uapi and contains a bpf_attach_type
member which is not meant to be opaque. Similarly, bpf_cgroup_link
continues to report its bpf_attach_type member to userspace via fdinfo
and bpf_link_info.
To ease disambiguation, bpf_attach_type variables are renamed from
'type' to 'atype' when changed to cgroup_bpf_attach_type.
Signed-off-by: Dave Marchevsky <davemarchevsky@fb.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Link: https://lore.kernel.org/bpf/20210819092420.1984861-2-davemarchevsky@fb.com
Add logic to call bpf_setsockopt() and bpf_getsockopt() from setsockopt BPF
programs. An example use case is when the user sets the IPV6_TCLASS socket
option, we would also like to change the tcp-cc for that socket.
We don't have any use case for calling bpf_setsockopt() from supposedly read-
only sys_getsockopt(), so it is made available to BPF_CGROUP_SETSOCKOPT only
at this point.
Signed-off-by: Prankur Gupta <prankgup@fb.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Song Liu <songliubraving@fb.com>
Link: https://lore.kernel.org/bpf/20210817224221.3257826-2-prankgup@fb.com
Same as previous patch but for the keys. memdup_bpfptr is renamed
to kvmemdup_bpfptr (and converted to kvmalloc).
Signed-off-by: Stanislav Fomichev <sdf@google.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Song Liu <songliubraving@fb.com>
Link: https://lore.kernel.org/bpf/20210818235216.1159202-2-sdf@google.com
Use kvmalloc/kvfree for temporary value when manipulating a map via
syscall. kmalloc might not be sufficient for percpu maps where the value
is big (and further multiplied by hundreds of CPUs).
Can be reproduced with netcnt test on qemu with "-smp 255".
Signed-off-by: Stanislav Fomichev <sdf@google.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Song Liu <songliubraving@fb.com>
Link: https://lore.kernel.org/bpf/20210818235216.1159202-1-sdf@google.com
The BPF interpreter as well as x86-64 BPF JIT were both in line by allowing
up to 33 tail calls (however odd that number may be!). Recently, this was
changed for the interpreter to reduce it down to 32 with the assumption that
this should have been the actual limit "which is in line with the behavior of
the x86 JITs" according to b61a28cf11 ("bpf: Fix off-by-one in tail call
count limiting").
Paul recently reported:
I'm a bit surprised by this because I had previously tested the tail call
limit of several JIT compilers and found it to be 33 (i.e., allowing chains
of up to 34 programs). I've just extended a test program I had to validate
this again on the x86-64 JIT, and found a limit of 33 tail calls again [1].
Also note we had previously changed the RISC-V and MIPS JITs to allow up to
33 tail calls [2, 3], for consistency with other JITs and with the interpreter.
We had decided to increase these two to 33 rather than decrease the other
JITs to 32 for backward compatibility, though that probably doesn't matter
much as I'd expect few people to actually use 33 tail calls.
[1] ae78874829
[2] 96bc4432f5 ("bpf, riscv: Limit to 33 tail calls")
[3] e49e6f6db0 ("bpf, mips: Limit to 33 tail calls")
Therefore, revert b61a28cf11 to re-align interpreter to limit a maximum of
33 tail calls. While it is unlikely to hit the limit for the vast majority,
programs in the wild could one way or another depend on this, so lets rather
be a bit more conservative, and lets align the small remainder of JITs to 33.
If needed in future, this limit could be slightly increased, but not decreased.
Fixes: b61a28cf11 ("bpf: Fix off-by-one in tail call count limiting")
Reported-by: Paul Chaignon <paul@cilium.io>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Andrii Nakryiko <andrii@kernel.org>
Acked-by: Johan Almbladh <johan.almbladh@anyfinetworks.com>
Acked-by: Yonghong Song <yhs@fb.com>
Link: https://lore.kernel.org/bpf/CAO5pjwTWrC0_dzTbTHFPSqDwA56aVH+4KFGVqdq8=ASs0MqZGQ@mail.gmail.com
The variable allow is being initialized with a value that is never read, it
is being updated later on. The assignment is redundant and can be removed.
Addresses-Coverity: ("Unused value")
Signed-off-by: Colin Ian King <colin.king@canonical.com>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/bpf/20210817170842.495440-1-colin.king@canonical.com
Add new BPF helper, bpf_get_attach_cookie(), which can be used by BPF programs
to get access to a user-provided bpf_cookie value, specified during BPF
program attachment (BPF link creation) time.
Naming is hard, though. With the concept being named "BPF cookie", I've
considered calling the helper:
- bpf_get_cookie() -- seems too unspecific and easily mistaken with socket
cookie;
- bpf_get_bpf_cookie() -- too much tautology;
- bpf_get_link_cookie() -- would be ok, but while we create a BPF link to
attach BPF program to BPF hook, it's still an "attachment" and the
bpf_cookie is associated with BPF program attachment to a hook, not a BPF
link itself. Technically, we could support bpf_cookie with old-style
cgroup programs.So I ultimately rejected it in favor of
bpf_get_attach_cookie().
Currently all perf_event-backed BPF program types support
bpf_get_attach_cookie() helper. Follow-up patches will add support for
fentry/fexit programs as well.
While at it, mark bpf_tracing_func_proto() as static to make it obvious that
it's only used from within the kernel/trace/bpf_trace.c.
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Yonghong Song <yhs@fb.com>
Link: https://lore.kernel.org/bpf/20210815070609.987780-7-andrii@kernel.org
Add ability for users to specify custom u64 value (bpf_cookie) when creating
BPF link for perf_event-backed BPF programs (kprobe/uprobe, perf_event,
tracepoints).
This is useful for cases when the same BPF program is used for attaching and
processing invocation of different tracepoints/kprobes/uprobes in a generic
fashion, but such that each invocation is distinguished from each other (e.g.,
BPF program can look up additional information associated with a specific
kernel function without having to rely on function IP lookups). This enables
new use cases to be implemented simply and efficiently that previously were
possible only through code generation (and thus multiple instances of almost
identical BPF program) or compilation at runtime (BCC-style) on target hosts
(even more expensive resource-wise). For uprobes it is not even possible in
some cases to know function IP before hand (e.g., when attaching to shared
library without PID filtering, in which case base load address is not known
for a library).
This is done by storing u64 bpf_cookie in struct bpf_prog_array_item,
corresponding to each attached and run BPF program. Given cgroup BPF programs
already use two 8-byte pointers for their needs and cgroup BPF programs don't
have (yet?) support for bpf_cookie, reuse that space through union of
cgroup_storage and new bpf_cookie field.
Make it available to kprobe/tracepoint BPF programs through bpf_trace_run_ctx.
This is set by BPF_PROG_RUN_ARRAY, used by kprobe/uprobe/tracepoint BPF
program execution code, which luckily is now also split from
BPF_PROG_RUN_ARRAY_CG. This run context will be utilized by a new BPF helper
giving access to this user-provided cookie value from inside a BPF program.
Generic perf_event BPF programs will access this value from perf_event itself
through passed in BPF program context.
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Yonghong Song <yhs@fb.com>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lore.kernel.org/bpf/20210815070609.987780-6-andrii@kernel.org
Introduce a new type of BPF link - BPF perf link. This brings perf_event-based
BPF program attachments (perf_event, tracepoints, kprobes, and uprobes) into
the common BPF link infrastructure, allowing to list all active perf_event
based attachments, auto-detaching BPF program from perf_event when link's FD
is closed, get generic BPF link fdinfo/get_info functionality.
BPF_LINK_CREATE command expects perf_event's FD as target_fd. No extra flags
are currently supported.
Force-detaching and atomic BPF program updates are not yet implemented, but
with perf_event-based BPF links we now have common framework for this without
the need to extend ioctl()-based perf_event interface.
One interesting consideration is a new value for bpf_attach_type, which
BPF_LINK_CREATE command expects. Generally, it's either 1-to-1 mapping from
bpf_attach_type to bpf_prog_type, or many-to-1 mapping from a subset of
bpf_attach_types to one bpf_prog_type (e.g., see BPF_PROG_TYPE_SK_SKB or
BPF_PROG_TYPE_CGROUP_SOCK). In this case, though, we have three different
program types (KPROBE, TRACEPOINT, PERF_EVENT) using the same perf_event-based
mechanism, so it's many bpf_prog_types to one bpf_attach_type. I chose to
define a single BPF_PERF_EVENT attach type for all of them and adjust
link_create()'s logic for checking correspondence between attach type and
program type.
The alternative would be to define three new attach types (e.g., BPF_KPROBE,
BPF_TRACEPOINT, and BPF_PERF_EVENT), but that seemed like unnecessary overkill
and BPF_KPROBE will cause naming conflicts with BPF_KPROBE() macro, defined by
libbpf. I chose to not do this to avoid unnecessary proliferation of
bpf_attach_type enum values and not have to deal with naming conflicts.
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Yonghong Song <yhs@fb.com>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lore.kernel.org/bpf/20210815070609.987780-5-andrii@kernel.org
Make internal perf_event_set_bpf_prog() use struct bpf_prog pointer as an
input argument, which makes it easier to re-use for other internal uses
(coming up for BPF link in the next patch). BPF program FD is not as
convenient and in some cases it's not available. So switch to struct bpf_prog,
move out refcounting outside and let caller do bpf_prog_put() in case of an
error. This follows the approach of most of the other BPF internal functions.
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Yonghong Song <yhs@fb.com>
Link: https://lore.kernel.org/bpf/20210815070609.987780-4-andrii@kernel.org
Similar to BPF_PROG_RUN, turn BPF_PROG_RUN_ARRAY macros into proper functions
with all the same readability and maintainability benefits. Making them into
functions required shuffling around bpf_set_run_ctx/bpf_reset_run_ctx
functions. Also, explicitly specifying the type of the BPF prog run callback
required adjusting __bpf_prog_run_save_cb() to accept const void *, casted
internally to const struct sk_buff.
Further, split out a cgroup-specific BPF_PROG_RUN_ARRAY_CG and
BPF_PROG_RUN_ARRAY_CG_FLAGS from the more generic BPF_PROG_RUN_ARRAY due to
the differences in bpf_run_ctx used for those two different use cases.
I think BPF_PROG_RUN_ARRAY_CG would benefit from further refactoring to accept
struct cgroup and enum bpf_attach_type instead of bpf_prog_array, fetching
cgrp->bpf.effective[type] and RCU-dereferencing it internally. But that
required including include/linux/cgroup-defs.h, which I wasn't sure is ok with
everyone.
The remaining generic BPF_PROG_RUN_ARRAY function will be extended to
pass-through user-provided context value in the next patch.
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Yonghong Song <yhs@fb.com>
Link: https://lore.kernel.org/bpf/20210815070609.987780-3-andrii@kernel.org
Turn BPF_PROG_RUN into a proper always inlined function. No functional and
performance changes are intended, but it makes it much easier to understand
what's going on with how BPF programs are actually get executed. It's more
obvious what types and callbacks are expected. Also extra () around input
parameters can be dropped, as well as `__` variable prefixes intended to avoid
naming collisions, which makes the code simpler to read and write.
This refactoring also highlighted one extra issue. BPF_PROG_RUN is both
a macro and an enum value (BPF_PROG_RUN == BPF_PROG_TEST_RUN). Turning
BPF_PROG_RUN into a function causes naming conflict compilation error. So
rename BPF_PROG_RUN into lower-case bpf_prog_run(), similar to
bpf_prog_run_xdp(), bpf_prog_run_pin_on_cpu(), etc. All existing callers of
BPF_PROG_RUN, the macro, are switched to bpf_prog_run() explicitly.
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Yonghong Song <yhs@fb.com>
Link: https://lore.kernel.org/bpf/20210815070609.987780-2-andrii@kernel.org
/proc/net/unix uses "%c" to print a single-byte character to escape '\0' in
the name of the abstract UNIX domain socket. The following selftest uses
it, so this patch adds support for "%c". Note that it does not support
wide character ("%lc" and "%llc") for simplicity.
Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.co.jp>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Acked-by: Yonghong Song <yhs@fb.com>
Link: https://lore.kernel.org/bpf/20210814015718.42704-3-kuniyu@amazon.co.jp
This is similar to existing BPF_PROG_TYPE_CGROUP_SOCK
and BPF_PROG_TYPE_CGROUP_SOCK_ADDR.
Signed-off-by: Stanislav Fomichev <sdf@google.com>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Acked-by: Martin KaFai Lau <kafai@fb.com>
Link: https://lore.kernel.org/bpf/20210813230530.333779-2-sdf@google.com
can and ieee802154.
Current release - regressions:
- r8169: fix ASPM-related link-up regressions
- bridge: fix flags interpretation for extern learn fdb entries
- phy: micrel: fix link detection on ksz87xx switch
- Revert "tipc: Return the correct errno code"
- ptp: fix possible memory leak caused by invalid cast
Current release - new code bugs:
- bpf: add missing bpf_read_[un]lock_trace() for syscall program
- bpf: fix potentially incorrect results with bpf_get_local_storage()
- page_pool: mask the page->signature before the checking, avoid
dma mapping leaks
- netfilter: nfnetlink_hook: 5 fixes to information in netlink dumps
- bnxt_en: fix firmware interface issues with PTP
- mlx5: Bridge, fix ageing time
Previous releases - regressions:
- linkwatch: fix failure to restore device state across suspend/resume
- bareudp: fix invalid read beyond skb's linear data
Previous releases - always broken:
- bpf: fix integer overflow involving bucket_size
- ppp: fix issues when desired interface name is specified via netlink
- wwan: mhi_wwan_ctrl: fix possible deadlock
- dsa: microchip: ksz8795: fix number of VLAN related bugs
- dsa: drivers: fix broken backpressure in .port_fdb_dump
- dsa: qca: ar9331: make proper initial port defaults
Misc:
- bpf: add lockdown check for probe_write_user helper
- netfilter: conntrack: remove offload_pickup sysctl before 5.14 is out
- netfilter: conntrack: collect all entries in one cycle,
heuristically slow down garbage collection scans
on idle systems to prevent frequent wake ups
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
-----BEGIN PGP SIGNATURE-----
iQIzBAABCAAdFiEE6jPA+I1ugmIBA4hXMUZtbf5SIrsFAmEVb/AACgkQMUZtbf5S
Irvlzw//XGDHNNPPOueHVhYK50+WiqPMxezQ5nbnG6uR6JtPyirMNTgzST8rQRsu
HmQy8/Oi6bK5rbPC9iDtKK28ba6Ldvu1ic8lTkuWyNNthG/pZGJJQ+Pg7dmkd7te
soJGZKnTbNWwbgGOFbfw9rLRuzWsjQjQ43vxTMjjNnpOwNxANuNR1GN0S/t8e9di
9BBT8jtgcHhtW5jRMHMNWHk+k8aeyIZPxjl9fjzzsMt7meX50DFrCJgf8bKkZ5dA
W2b/fzUyMqVQJpgmIY4ktFmR4mV382pWOOs6rl+ppSu+mU/gpTuYCofF7FqAUU5S
71mzukW6KdOrqymVuwiTXBlGnZB370aT7aUU5PHL/ZkDJ9shSyVRcg/iQa40myzn
5wxunZX936z5f84bxZPW1J5bBZklba8deKPXHUkl5RoIXsN2qWFPJpZ1M0eHyfPm
ZdqvRZ1IkSSFZFr6FF374bEqa88NK1wbVKUbGQ+yn8abE+HQfXQR9ZWZa1DR1wkb
rF8XWOHjQLp/zlTRnj3gj3T4pEwc5L1QOt7RUrYfI36Mh7iUz5EdzowaiEaDQT6/
neThilci1F6Mz4Uf65pK4TaDTDvj1tqqAdg3g8uneHBTFARS+htGXqkaKxP6kSi+
T/W4woOqCRT6c0+BhZ2jPRhKsMZ5kR1vKLUVBHShChq32mDpn6g=
=hzDl
-----END PGP SIGNATURE-----
Merge tag 'net-5.14-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net
Pull networking fixes from Jakub Kicinski:
"Networking fixes, including fixes from netfilter, bpf, can and
ieee802154.
The size of this is pretty normal, but we got more fixes for 5.14
changes this week than last week. Nothing major but the trend is the
opposite of what we like. We'll see how the next week goes..
Current release - regressions:
- r8169: fix ASPM-related link-up regressions
- bridge: fix flags interpretation for extern learn fdb entries
- phy: micrel: fix link detection on ksz87xx switch
- Revert "tipc: Return the correct errno code"
- ptp: fix possible memory leak caused by invalid cast
Current release - new code bugs:
- bpf: add missing bpf_read_[un]lock_trace() for syscall program
- bpf: fix potentially incorrect results with bpf_get_local_storage()
- page_pool: mask the page->signature before the checking, avoid dma
mapping leaks
- netfilter: nfnetlink_hook: 5 fixes to information in netlink dumps
- bnxt_en: fix firmware interface issues with PTP
- mlx5: Bridge, fix ageing time
Previous releases - regressions:
- linkwatch: fix failure to restore device state across
suspend/resume
- bareudp: fix invalid read beyond skb's linear data
Previous releases - always broken:
- bpf: fix integer overflow involving bucket_size
- ppp: fix issues when desired interface name is specified via
netlink
- wwan: mhi_wwan_ctrl: fix possible deadlock
- dsa: microchip: ksz8795: fix number of VLAN related bugs
- dsa: drivers: fix broken backpressure in .port_fdb_dump
- dsa: qca: ar9331: make proper initial port defaults
Misc:
- bpf: add lockdown check for probe_write_user helper
- netfilter: conntrack: remove offload_pickup sysctl before 5.14 is
out
- netfilter: conntrack: collect all entries in one cycle,
heuristically slow down garbage collection scans on idle systems to
prevent frequent wake ups"
* tag 'net-5.14-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net: (87 commits)
vsock/virtio: avoid potential deadlock when vsock device remove
wwan: core: Avoid returning NULL from wwan_create_dev()
net: dsa: sja1105: unregister the MDIO buses during teardown
Revert "tipc: Return the correct errno code"
net: mscc: Fix non-GPL export of regmap APIs
net: igmp: increase size of mr_ifc_count
MAINTAINERS: switch to my OMP email for Renesas Ethernet drivers
tcp_bbr: fix u32 wrap bug in round logic if bbr_init() called after 2B packets
net: pcs: xpcs: fix error handling on failed to allocate memory
net: linkwatch: fix failure to restore device state across suspend/resume
net: bridge: fix memleak in br_add_if()
net: switchdev: zero-initialize struct switchdev_notifier_fdb_info emitted by drivers towards the bridge
net: bridge: fix flags interpretation for extern learn fdb entries
net: dsa: sja1105: fix broken backpressure in .port_fdb_dump
net: dsa: lantiq: fix broken backpressure in .port_fdb_dump
net: dsa: lan9303: fix broken backpressure in .port_fdb_dump
net: dsa: hellcreek: fix broken backpressure in .port_fdb_dump
bpf, core: Fix kernel-doc notation
net: igmp: fix data-race in igmp_ifc_timer_expire()
net: Fix memory leak in ieee802154_raw_deliver
...
Pull ucounts fix from Eric Biederman:
"This fixes the ucount sysctls on big endian architectures.
The counts were expanded to be longs instead of ints, and the sysctl
code was overlooked, so only the low 32bit were being processed. On
litte endian just processing the low 32bits is fine, but on 64bit big
endian processing just the low 32bits results in the high order bits
instead of the low order bits being processed and nothing works
proper.
This change took a little bit to mature as we have the SYSCTL_ZERO,
and SYSCTL_INT_MAX macros that are only usable for sysctls operating
on ints, but unfortunately are not obviously broken. Which resulted in
the versions of this change working on big endian and not on little
endian, because the int SYSCTL_ZERO when extended 64bit wound up being
0x100000000. So we only allowed values greater than 0x100000000 and
less than 0faff. Which unfortunately broken everything that tried to
set the sysctls. (First reported with the windows subsystem for
linux).
I have tested this on x86_64 64bit after first reproducing the
problems with the earlier version of this change, and then verifying
the problems do not exist when we use appropriate long min and max
values for extra1 and extra2"
* 'for-v5.14' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace:
ucounts: add missing data type changes
Daniel Borkmann says:
====================
bpf-next 2021-08-10
We've added 31 non-merge commits during the last 8 day(s) which contain
a total of 28 files changed, 3644 insertions(+), 519 deletions(-).
1) Native XDP support for bonding driver & related BPF selftests, from Jussi Maki.
2) Large batch of new BPF JIT tests for test_bpf.ko that came out as a result from
32-bit MIPS JIT development, from Johan Almbladh.
3) Rewrite of netcnt BPF selftest and merge into test_progs, from Stanislav Fomichev.
4) Fix XDP bpf_prog_test_run infra after net to net-next merge, from Andrii Nakryiko.
5) Follow-up fix in unix_bpf_update_proto() to enforce socket type, from Cong Wang.
6) Fix bpf-iter-tcp4 selftest to print the correct dest IP, from Jose Blanquicet.
7) Various misc BPF XDP sample improvements, from Niklas Söderlund, Matthew Cover,
and Muhammad Falak R Wani.
* https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next: (31 commits)
bpf, tests: Add tail call test suite
bpf, tests: Add tests for BPF_CMPXCHG
bpf, tests: Add tests for atomic operations
bpf, tests: Add test for 32-bit context pointer argument passing
bpf, tests: Add branch conversion JIT test
bpf, tests: Add word-order tests for load/store of double words
bpf, tests: Add tests for ALU operations implemented with function calls
bpf, tests: Add more ALU64 BPF_MUL tests
bpf, tests: Add more BPF_LSH/RSH/ARSH tests for ALU64
bpf, tests: Add more ALU32 tests for BPF_LSH/RSH/ARSH
bpf, tests: Add more tests of ALU32 and ALU64 bitwise operations
bpf, tests: Fix typos in test case descriptions
bpf, tests: Add BPF_MOV tests for zero and sign extension
bpf, tests: Add BPF_JMP32 test cases
samples, bpf: Add an explict comment to handle nested vlan tagging.
selftests/bpf: Add tests for XDP bonding
selftests/bpf: Fix xdp_tx.c prog section name
net, core: Allow netdev_lower_get_next_private_rcu in bh context
bpf, devmap: Exclude XDP broadcast to master device
net, bonding: Add XDP support to the bonding driver
...
====================
Link: https://lore.kernel.org/r/20210810130038.16927-1-daniel@iogearbox.net
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Fix kernel-doc warnings in kernel/bpf/core.c (found by scripts/kernel-doc
and W=1 builds). That is, correct a function name in a comment and add
return descriptions for 2 functions.
Fixes these kernel-doc warnings:
kernel/bpf/core.c:1372: warning: expecting prototype for __bpf_prog_run(). Prototype was for ___bpf_prog_run() instead
kernel/bpf/core.c:1372: warning: No description found for return value of '___bpf_prog_run'
kernel/bpf/core.c:1883: warning: No description found for return value of 'bpf_prog_select_runtime'
Signed-off-by: Randy Dunlap <rdunlap@infradead.org>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Link: https://lore.kernel.org/bpf/20210809215229.7556-1-rdunlap@infradead.org
Commit b910eaaaa4 ("bpf: Fix NULL pointer dereference in bpf_get_local_storage()
helper") fixed a bug for bpf_get_local_storage() helper so different tasks
won't mess up with each other's percpu local storage.
The percpu data contains 8 slots so it can hold up to 8 contexts (same or
different tasks), for 8 different program runs, at the same time. This in
general is sufficient. But our internal testing showed the following warning
multiple times:
[...]
warning: WARNING: CPU: 13 PID: 41661 at include/linux/bpf-cgroup.h:193
__cgroup_bpf_run_filter_sock_ops+0x13e/0x180
RIP: 0010:__cgroup_bpf_run_filter_sock_ops+0x13e/0x180
<IRQ>
tcp_call_bpf.constprop.99+0x93/0xc0
tcp_conn_request+0x41e/0xa50
? tcp_rcv_state_process+0x203/0xe00
tcp_rcv_state_process+0x203/0xe00
? sk_filter_trim_cap+0xbc/0x210
? tcp_v6_inbound_md5_hash.constprop.41+0x44/0x160
tcp_v6_do_rcv+0x181/0x3e0
tcp_v6_rcv+0xc65/0xcb0
ip6_protocol_deliver_rcu+0xbd/0x450
ip6_input_finish+0x11/0x20
ip6_input+0xb5/0xc0
ip6_sublist_rcv_finish+0x37/0x50
ip6_sublist_rcv+0x1dc/0x270
ipv6_list_rcv+0x113/0x140
__netif_receive_skb_list_core+0x1a0/0x210
netif_receive_skb_list_internal+0x186/0x2a0
gro_normal_list.part.170+0x19/0x40
napi_complete_done+0x65/0x150
mlx5e_napi_poll+0x1ae/0x680
__napi_poll+0x25/0x120
net_rx_action+0x11e/0x280
__do_softirq+0xbb/0x271
irq_exit_rcu+0x97/0xa0
common_interrupt+0x7f/0xa0
</IRQ>
asm_common_interrupt+0x1e/0x40
RIP: 0010:bpf_prog_1835a9241238291a_tw_egress+0x5/0xbac
? __cgroup_bpf_run_filter_skb+0x378/0x4e0
? do_softirq+0x34/0x70
? ip6_finish_output2+0x266/0x590
? ip6_finish_output+0x66/0xa0
? ip6_output+0x6c/0x130
? ip6_xmit+0x279/0x550
? ip6_dst_check+0x61/0xd0
[...]
Using drgn [0] to dump the percpu buffer contents showed that on this CPU
slot 0 is still available, but slots 1-7 are occupied and those tasks in
slots 1-7 mostly don't exist any more. So we might have issues in
bpf_cgroup_storage_unset().
Further debugging confirmed that there is a bug in bpf_cgroup_storage_unset().
Currently, it tries to unset "current" slot with searching from the start.
So the following sequence is possible:
1. A task is running and claims slot 0
2. Running BPF program is done, and it checked slot 0 has the "task"
and ready to reset it to NULL (not yet).
3. An interrupt happens, another BPF program runs and it claims slot 1
with the *same* task.
4. The unset() in interrupt context releases slot 0 since it matches "task".
5. Interrupt is done, the task in process context reset slot 0.
At the end, slot 1 is not reset and the same process can continue to occupy
slots 2-7 and finally, when the above step 1-5 is repeated again, step 3 BPF
program won't be able to claim an empty slot and a warning will be issued.
To fix the issue, for unset() function, we should traverse from the last slot
to the first. This way, the above issue can be avoided.
The same reverse traversal should also be done in bpf_get_local_storage() helper
itself. Otherwise, incorrect local storage may be returned to BPF program.
[0] https://github.com/osandov/drgn
Fixes: b910eaaaa4 ("bpf: Fix NULL pointer dereference in bpf_get_local_storage() helper")
Signed-off-by: Yonghong Song <yhs@fb.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/bpf/20210810010413.1976277-1-yhs@fb.com
Back then, commit 96ae522795 ("bpf: Add bpf_probe_write_user BPF helper
to be called in tracers") added the bpf_probe_write_user() helper in order
to allow to override user space memory. Its original goal was to have a
facility to "debug, divert, and manipulate execution of semi-cooperative
processes" under CAP_SYS_ADMIN. Write to kernel was explicitly disallowed
since it would otherwise tamper with its integrity.
One use case was shown in cf9b1199de ("samples/bpf: Add test/example of
using bpf_probe_write_user bpf helper") where the program DNATs traffic
at the time of connect(2) syscall, meaning, it rewrites the arguments to
a syscall while they're still in userspace, and before the syscall has a
chance to copy the argument into kernel space. These days we have better
mechanisms in BPF for achieving the same (e.g. for load-balancers), but
without having to write to userspace memory.
Of course the bpf_probe_write_user() helper can also be used to abuse
many other things for both good or bad purpose. Outside of BPF, there is
a similar mechanism for ptrace(2) such as PTRACE_PEEK{TEXT,DATA} and
PTRACE_POKE{TEXT,DATA}, but would likely require some more effort.
Commit 96ae522795 explicitly dedicated the helper for experimentation
purpose only. Thus, move the helper's availability behind a newly added
LOCKDOWN_BPF_WRITE_USER lockdown knob so that the helper is disabled under
the "integrity" mode. More fine-grained control can be implemented also
from LSM side with this change.
Fixes: 96ae522795 ("bpf: Add bpf_probe_write_user BPF helper to be called in tracers")
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Andrii Nakryiko <andrii@kernel.org>
Pull cgroup fix from Tejun Heo:
"One commit to fix a possible A-A deadlock around u64_stats_sync on
32bit machines caused by updating it without disabling IRQ when it may
be read from IRQ context"
* 'for-5.14-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup:
cgroup: rstat: fix A-A deadlock on 32bit around u64_stats_sync
Rename LOCKDOWN_BPF_READ into LOCKDOWN_BPF_READ_KERNEL so we have naming
more consistent with a LOCKDOWN_BPF_WRITE_USER option that we are adding.
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Andrii Nakryiko <andrii@kernel.org>
- Prevent a memory ordering issue in the timer expiry code which makes it
possible to observe falsely that the callback has been executed already
while that's not the case, which violates the guarantee of del_timer_sync().
-----BEGIN PGP SIGNATURE-----
iQJHBAABCgAxFiEEQp8+kY+LLUocC4bMphj1TA10mKEFAmEPwQgTHHRnbHhAbGlu
dXRyb25peC5kZQAKCRCmGPVMDXSYoV/CD/0YmL4fjwNOoDk/sZSuW6nh7DjZ2714
sLxP18nzq9NhykF1tSfJhgWSokjNLWZ3cr4/UJ+i1XyDbC69uIi9dLbWiQKrir6X
5lHlxy1bzemz59Lcx9ENcCXRO1R/7FnVR2h37dMwAEKQVkeXxqIcmwSJGokW2AQW
3LNMKbY6UPT9SNU399s8BdLHxKaQ7TBDZ/jxN+1xlt/BRj2+TpnL/hE5rGvrfYC7
gnNOwxuIacuS5XBrc8s1hD//VrqJPhgASLLmaoI6vXfl9q3OwjSpNCGzqORmMWqk
N8M1A7P9538ym72BWG71evoGWrbEwoxNo1OiK5RtgjH31hrsGwSD6EtOhGmBmqIB
urdC17R/sm+OFXzNyQgg9dmq7GdwbSD4HSYXJ7DnGh2us6JilFwxSkIJ1Ce0yYOw
qSBpDutas3Xc3RiejgFVBNKEsSGhOtSy3Tc7QqvRs1OJbb6qm8twU27UEzFXy6zX
LRnhv/A7rZRaeEc5WcbWu+xBDzIqWRSgecOwM3SBsQyUkVV73R7wyuNo80o0TEb2
13jVC9dnoDUDnqUwnNLJoqtfU/I/DBs49mZRJUyqev73buvBlDZqhjRthIMwSGDb
DORRsfOYCmHa+fySkO1GZbgHG4Pym51tyjpC8jD4KxNU0dOW/d5TYlRh8nsBt8PG
p+/vOBXMHBFbCg==
=JQWW
-----END PGP SIGNATURE-----
Merge tag 'timers-urgent-2021-08-08' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull timer fix from Thomas Gleixner:
"A single timer fix:
- Prevent a memory ordering issue in the timer expiry code which
makes it possible to observe falsely that the callback has been
executed already while that's not the case, which violates the
guarantee of del_timer_sync()"
* tag 'timers-urgent-2021-08-08' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
timers: Move clearing of base::timer_running under base:: Lock
- Prevent a double enqueue caused by rt_effective_prio() being invoked
twice in __sched_setscheduler().
-----BEGIN PGP SIGNATURE-----
iQJHBAABCgAxFiEEQp8+kY+LLUocC4bMphj1TA10mKEFAmEPwDATHHRnbHhAbGlu
dXRyb25peC5kZQAKCRCmGPVMDXSYobyFD/46yd3xi1cfI9WQRuOQPNBa4/uzg7ir
33AKOk3MmHICt8M5fhBrLsC/qwCjONB3N+0tmkj+uVgZPfeW4cd8LB5rYW/byIS+
ib6wMyvOpr91oL1Hb1b7SHlodbdZFL6gInMrDb/gMABiojml+aZt1kwsA9FFFVdE
DEWOue/xIf22Tw8egCxsjZBAfMvyBSuTvdGPTKiUXKm96RO2Sr7PQIbnc6gBjbkn
SvLwW8gIcyUe6u+8pN9rhAqnlOO5E/tSkF7BWNLAnrp3xnubty/XBulWRUCaeOQy
8+/O3/5cqmQ6kSNA7aPVSPPZY3zADB+KW5EHxWBYCiZuXnDj1WJqc3r1sYiNtfXL
Tl59DRggEktlAUh8QDt7rkFxe0waWTxyeAIEa/79IebnrZkdrMi87XO8hZoB7K4P
GRqg0AyiQB7B/trcZLb7rNPa9rFAMOMoPX5qyvwEoqKZ8rwzUrv+xmW5cqWsLpIO
3TatEgnK3pWPV+hhRhz2dqFQ6NuwnNFDTIPvSOS0EgY1lTUu+HkYwU2xqqwKHswF
aqyyw6SEXnOUeXJhj/6gzhDk/qGFCLfww+1+hiInBDNj6xlEbrSXANmEG8eH8DqU
XXQpgehCQwsgtxyzVMRvJJJ0dqulDxlv+xt+RtfXZHDjQeHYE1yXlWWm2r2opWse
feOUyXbKt4Tczg==
=EZjT
-----END PGP SIGNATURE-----
Merge tag 'sched-urgent-2021-08-08' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull scheduler fix from Thomas Gleixner:
"A single scheduler fix:
- Prevent a double enqueue caused by rt_effective_prio() being
invoked twice in __sched_setscheduler()"
* tag 'sched-urgent-2021-08-08' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
sched/rt: Fix double enqueue caused by rt_effective_prio
- Correct the permission checks for perf event which send SIGTRAP to a
different process and clean up that code to be more readable.
- Prevent an out of bound MSR access in the x86 perf code which happened
due to an incomplete limiting to the actually available hardware
counters.
- Prevent access to the AMD64_EVENTSEL_HOSTONLY bit when running inside a
guest.
- Handle small core counter re-enabling correctly by issuing an ACK right
before reenabling it to prevent a stale PEBS record being kept around.
-----BEGIN PGP SIGNATURE-----
iQJHBAABCgAxFiEEQp8+kY+LLUocC4bMphj1TA10mKEFAmEPv6UTHHRnbHhAbGlu
dXRyb25peC5kZQAKCRCmGPVMDXSYob8hD/wMmRLAoc/uvJIIICJ+IQVnnU8WToIS
Qy1dAPpQMz6pQpRQor1AGpcP89IMnLVhZn84lsd+kw0/Lv630JbWsXvQ8jB2GPHn
17XewPp4l4PDUgKaGEKIjPSjsmnZmzOLTYIy5gWOfA/h5EG/1D+ozvcRGDMaXWUw
+65Pinaf2QKfjYZV11SVJMLF5zLYUxMc6vRag00WrcPxd+JO4eVeV36g0LTmhABW
fOSDcBOSVrT2w9MYDpNmPvMh3dN2vlfhrEk10NBKslx8uk4t8sV/Jbs+48WhydKa
zmdqthtjIekRUSxhiHJve70D9ngveCBSKQDp0Us2BWWxdnM0+HV6ozjuxO0julCH
5tW4413fz2AoZJhWkTn3PE4nPG3apRCnL2B+jTFHHqCjKSkkrNDRJDOEUwasXjV5
jn25DLhOq5ltkMrLFDTV/h2RZqU0fAMV2iwNSkjD3lVLgKt6B3/uSnvE9SXmaJjs
njk/1LzeWwY+sk7YYXouPQ2STEDCKvOJGYZSS5pFA03mVaQgfuJxpyHKH+7nj9tV
k0FLDLMmSucYIWBq0iapa8cR69e0ZIE48hSNR3AOIIOVh3LusmA4HkogOAQG7kdZ
P2nKQUdN+SR8rL9KQRauP63J508fg0kkXNgSAm1lFWBDnFKt6shkkHGcL+5PzxJW
1Bjx2wc52Ww84A==
=hhv+
-----END PGP SIGNATURE-----
Merge tag 'perf-urgent-2021-08-08' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull perf fixes from Thomas Gleixner:
"A set of perf fixes:
- Correct the permission checks for perf event which send SIGTRAP to
a different process and clean up that code to be more readable.
- Prevent an out of bound MSR access in the x86 perf code which
happened due to an incomplete limiting to the actually available
hardware counters.
- Prevent access to the AMD64_EVENTSEL_HOSTONLY bit when running
inside a guest.
- Handle small core counter re-enabling correctly by issuing an ACK
right before reenabling it to prevent a stale PEBS record being
kept around"
* tag 'perf-urgent-2021-08-08' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
perf/x86/intel: Apply mid ACK for small core
perf/x86/amd: Don't touch the AMD64_EVENTSEL_HOSTONLY bit inside the guest
perf/x86: Fix out of bound MSR access
perf: Refactor permissions check into perf_check_permission()
perf: Fix required permissions if sigtrap is requested
Daniel Borkmann says:
====================
pull-request: bpf 2021-08-07
The following pull-request contains BPF updates for your *net* tree.
We've added 4 non-merge commits during the last 9 day(s) which contain
a total of 4 files changed, 8 insertions(+), 7 deletions(-).
The main changes are:
1) Fix integer overflow in htab's lookup + delete batch op, from Tatsuhiko Yasumatsu.
2) Fix invalid fd 0 close in libbpf if BTF parsing failed, from Daniel Xu.
3) Fix libbpf feature probe for BPF_PROG_TYPE_CGROUP_SOCKOPT, from Robin Gögge.
4) Fix minor libbpf doc warning regarding code-block language, from Randy Dunlap.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
In __htab_map_lookup_and_delete_batch(), hash buckets are iterated
over to count the number of elements in each bucket (bucket_size).
If bucket_size is large enough, the multiplication to calculate
kvmalloc() size could overflow, resulting in out-of-bounds write
as reported by KASAN:
[...]
[ 104.986052] BUG: KASAN: vmalloc-out-of-bounds in __htab_map_lookup_and_delete_batch+0x5ce/0xb60
[ 104.986489] Write of size 4194224 at addr ffffc9010503be70 by task crash/112
[ 104.986889]
[ 104.987193] CPU: 0 PID: 112 Comm: crash Not tainted 5.14.0-rc4 #13
[ 104.987552] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.13.0-1ubuntu1.1 04/01/2014
[ 104.988104] Call Trace:
[ 104.988410] dump_stack_lvl+0x34/0x44
[ 104.988706] print_address_description.constprop.0+0x21/0x140
[ 104.988991] ? __htab_map_lookup_and_delete_batch+0x5ce/0xb60
[ 104.989327] ? __htab_map_lookup_and_delete_batch+0x5ce/0xb60
[ 104.989622] kasan_report.cold+0x7f/0x11b
[ 104.989881] ? __htab_map_lookup_and_delete_batch+0x5ce/0xb60
[ 104.990239] kasan_check_range+0x17c/0x1e0
[ 104.990467] memcpy+0x39/0x60
[ 104.990670] __htab_map_lookup_and_delete_batch+0x5ce/0xb60
[ 104.990982] ? __wake_up_common+0x4d/0x230
[ 104.991256] ? htab_of_map_free+0x130/0x130
[ 104.991541] bpf_map_do_batch+0x1fb/0x220
[...]
In hashtable, if the elements' keys have the same jhash() value, the
elements will be put into the same bucket. By putting a lot of elements
into a single bucket, the value of bucket_size can be increased to
trigger the integer overflow.
Triggering the overflow is possible for both callers with CAP_SYS_ADMIN
and callers without CAP_SYS_ADMIN.
It will be trivial for a caller with CAP_SYS_ADMIN to intentionally
reach this overflow by enabling BPF_F_ZERO_SEED. As this flag will set
the random seed passed to jhash() to 0, it will be easy for the caller
to prepare keys which will be hashed into the same value, and thus put
all the elements into the same bucket.
If the caller does not have CAP_SYS_ADMIN, BPF_F_ZERO_SEED cannot be
used. However, it will be still technically possible to trigger the
overflow, by guessing the random seed value passed to jhash() (32bit)
and repeating the attempt to trigger the overflow. In this case,
the probability to trigger the overflow will be low and will take
a very long time.
Fix the integer overflow by calling kvmalloc_array() instead of
kvmalloc() to allocate memory.
Fixes: 057996380a ("bpf: Add batch ops to all htab bpf map")
Signed-off-by: Tatsuhiko Yasumatsu <th.yasumatsu@gmail.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Link: https://lore.kernel.org/bpf/20210806150419.109658-1-th.yasumatsu@gmail.com
As callbacks to a tracepoint are paired with the data that is passed in when
the callback is registered to the tracepoint, it must have that data passed
to the callback when the tracepoint is triggered, else bad things will
happen. To keep the two together, they are both assigned to a tracepoint
structure and added to an array. The tracepoint call site will dereference
the structure (via RCU) and call the callback in that structure along with
the data in that structure. This keeps the callback and data tightly
coupled.
Because of the overhead that retpolines have on tracepoint callbacks, if
there's only one callback attached to a tracepoint (a common case), then it
is called via a static call (code modified to do a direct call instead of an
indirect call). But to implement this, the data had to be decoupled from the
callback, as now the callback is implemented via a direct call from the
static call and not an indirect call from the dereferenced structure.
Note, the static call only calls a callback used when there's a single
callback attached to the tracepoint. If more than one callback is attached
to the same tracepoint, then the static call will call an iterator
function that goes back to dereferencing the structure keeping the callback
and its data tightly coupled again.
Issues can arise when going from 0 callbacks to one, as the static call is
assigned to the callback, and it must take care that the data passed to it
is loaded before the static call calls the callback. Going from 1 to 2
callbacks is not an issue, as long as the static call is updated to the
iterator before the tracepoint structure array is updated via RCU. Going
from 2 to more or back down to 2 is not an issue as the iterator can handle
all theses cases. But going from 2 to 1, care must be taken as the static
call is now calling a callback and the data that is loaded must be the data
for that callback.
Care was taken to ensure the callback and data would be in-sync, but after
a bug was reported, it became clear that not enough was done to make sure
that was the case. These changes address this.
The first change is to compare the old and new data instead of the old and
new callback, as it's the data that can corrupt the callback, even if the
callback is the same (something getting freed).
The next change is to convert these transitions into states, to make it
easier to know when a synchronization is needed, and to perform those
synchronizations. The problem with this patch is that it slows down
disabling all events from under a second, to making it take over 10 seconds
to do the same work. But that is addressed in the final patch.
The final patch uses the RCU state functions to keep track of the RCU state
between the transitions, and only needs to perform the synchronization if an
RCU synchronization hasn't been done already. This brings the performance of
disabling all events back to its original value. That's because no
synchronization is required between disabling tracepoints but is required
when enabling a tracepoint after its been disabled. If an RCU
synchronization happens after the tracepoint is disabled, and before it is
re-enabled, there's no need to do the synchronization again.
Both the second and third patch have subtle complexities that they are
separated into two patches. But because the second patch causes such a
regression in performance, the third patch adds a "Fixes" tag to the second
patch, such that the two must be backported together and not just the second
patch.
-----BEGIN PGP SIGNATURE-----
iIoEABYIADIWIQRRSw7ePDh/lE+zeZMp5XQQmuv6qgUCYQ15TBQccm9zdGVkdEBn
b29kbWlzLm9yZwAKCRAp5XQQmuv6qnmmAP4hoA34CDr5hrd8mYLeKptW63f5Nd1w
fVZjprfa1wJhZAEAq39OeRCT4Fb2hIeZNBNUnLU90f+J6NH5QFDEhW+CkAI=
=JcZS
-----END PGP SIGNATURE-----
Merge tag 'trace-v5.14-rc4-2' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace
Pull tracing fixes from Steven Rostedt:
"Fix tracepoint race between static_call and callback data
As callbacks to a tracepoint are paired with the data that is passed
in when the callback is registered to the tracepoint, it must have
that data passed to the callback when the tracepoint is triggered,
else bad things will happen. To keep the two together, they are both
assigned to a tracepoint structure and added to an array. The
tracepoint call site will dereference the structure (via RCU) and call
the callback in that structure along with the data in that structure.
This keeps the callback and data tightly coupled.
Because of the overhead that retpolines have on tracepoint callbacks,
if there's only one callback attached to a tracepoint (a common case),
then it is called via a static call (code modified to do a direct call
instead of an indirect call). But to implement this, the data had to
be decoupled from the callback, as now the callback is implemented via
a direct call from the static call and not an indirect call from the
dereferenced structure.
Note, the static call only calls a callback used when there's a single
callback attached to the tracepoint. If more than one callback is
attached to the same tracepoint, then the static call will call an
iterator function that goes back to dereferencing the structure
keeping the callback and its data tightly coupled again.
Issues can arise when going from 0 callbacks to one, as the static
call is assigned to the callback, and it must take care that the data
passed to it is loaded before the static call calls the callback.
Going from 1 to 2 callbacks is not an issue, as long as the static
call is updated to the iterator before the tracepoint structure array
is updated via RCU. Going from 2 to more or back down to 2 is not an
issue as the iterator can handle all theses cases. But going from 2 to
1, care must be taken as the static call is now calling a callback and
the data that is loaded must be the data for that callback.
Care was taken to ensure the callback and data would be in-sync, but
after a bug was reported, it became clear that not enough was done to
make sure that was the case. These changes address this.
The first change is to compare the old and new data instead of the old
and new callback, as it's the data that can corrupt the callback, even
if the callback is the same (something getting freed).
The next change is to convert these transitions into states, to make
it easier to know when a synchronization is needed, and to perform
those synchronizations. The problem with this patch is that it slows
down disabling all events from under a second, to making it take over
10 seconds to do the same work. But that is addressed in the final
patch.
The final patch uses the RCU state functions to keep track of the RCU
state between the transitions, and only needs to perform the
synchronization if an RCU synchronization hasn't been done already.
This brings the performance of disabling all events back to its
original value. That's because no synchronization is required between
disabling tracepoints but is required when enabling a tracepoint after
its been disabled. If an RCU synchronization happens after the
tracepoint is disabled, and before it is re-enabled, there's no need
to do the synchronization again.
Both the second and third patch have subtle complexities that they are
separated into two patches. But because the second patch causes such a
regression in performance, the third patch adds a "Fixes" tag to the
second patch, such that the two must be backported together and not
just the second patch"
* tag 'trace-v5.14-rc4-2' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace:
tracepoint: Use rcu get state and cond sync for static call updates
tracepoint: Fix static call function vs data state mismatch
tracepoint: static call: Compare data on transition from 2->1 callees
State transitions from 1->0->1 and N->2->1 callbacks require RCU
synchronization. Rather than performing the RCU synchronization every
time the state change occurs, which is quite slow when many tracepoints
are registered in batch, instead keep a snapshot of the RCU state on the
most recent transitions which belong to a chain, and conditionally wait
for a grace period on the last transition of the chain if one g.p. has
not elapsed since the last snapshot.
This applies to both RCU and SRCU.
This brings the performance regression caused by commit 231264d692
("Fix: tracepoint: static call function vs data state mismatch") back to
what it was originally.
Before this commit:
# trace-cmd start -e all
# time trace-cmd start -p nop
real 0m10.593s
user 0m0.017s
sys 0m0.259s
After this commit:
# trace-cmd start -e all
# time trace-cmd start -p nop
real 0m0.878s
user 0m0.000s
sys 0m0.103s
Link: https://lkml.kernel.org/r/20210805192954.30688-1-mathieu.desnoyers@efficios.com
Link: https://lore.kernel.org/io-uring/4ebea8f0-58c9-e571-fd30-0ce4f6f09c70@samba.org/
Cc: stable@vger.kernel.org
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: "Paul E. McKenney" <paulmck@kernel.org>
Cc: Stefan Metzmacher <metze@samba.org>
Fixes: 231264d692 ("Fix: tracepoint: static call function vs data state mismatch")
Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Reviewed-by: Paul E. McKenney <paulmck@kernel.org>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
Build failure in drivers/net/wwan/mhi_wwan_mbim.c:
add missing parameter (0, assuming we don't want buffer pre-alloc).
Conflict in drivers/net/dsa/sja1105/sja1105_main.c between:
589918df93 ("net: dsa: sja1105: be stateless with FDB entries on SJA1105P/Q/R/S/SJA1110 too")
0fac6aa098 ("net: dsa: sja1105: delete the best_effort_vlan_filtering mode")
Follow the instructions from the commit message of the former commit
- removed the if conditions. When looking at commit 589918df93 ("net:
dsa: sja1105: be stateless with FDB entries on SJA1105P/Q/R/S/SJA1110 too")
note that the mask_iotag fields get removed by the following patch.
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
On a 1->0->1 callbacks transition, there is an issue with the new
callback using the old callback's data.
Considering __DO_TRACE_CALL:
do { \
struct tracepoint_func *it_func_ptr; \
void *__data; \
it_func_ptr = \
rcu_dereference_raw((&__tracepoint_##name)->funcs); \
if (it_func_ptr) { \
__data = (it_func_ptr)->data; \
----> [ delayed here on one CPU (e.g. vcpu preempted by the host) ]
static_call(tp_func_##name)(__data, args); \
} \
} while (0)
It has loaded the tp->funcs of the old callback, so it will try to use the old
data. This can be fixed by adding a RCU sync anywhere in the 1->0->1
transition chain.
On a N->2->1 transition, we need an rcu-sync because you may have a
sequence of 3->2->1 (or 1->2->1) where the element 0 data is unchanged
between 2->1, but was changed from 3->2 (or from 1->2), which may be
observed by the static call. This can be fixed by adding an
unconditional RCU sync in transition 2->1.
Note, this fixes a correctness issue at the cost of adding a tremendous
performance regression to the disabling of tracepoints.
Before this commit:
# trace-cmd start -e all
# time trace-cmd start -p nop
real 0m0.778s
user 0m0.000s
sys 0m0.061s
After this commit:
# trace-cmd start -e all
# time trace-cmd start -p nop
real 0m10.593s
user 0m0.017s
sys 0m0.259s
A follow up fix will introduce a more lightweight scheme based on RCU
get_state and cond_sync, that will return the performance back to what it
was. As both this change and the lightweight versions are complex on their
own, for bisecting any issues that this may cause, they are kept as two
separate changes.
Link: https://lkml.kernel.org/r/20210805132717.23813-3-mathieu.desnoyers@efficios.com
Link: https://lore.kernel.org/io-uring/4ebea8f0-58c9-e571-fd30-0ce4f6f09c70@samba.org/
Cc: stable@vger.kernel.org
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: "Paul E. McKenney" <paulmck@kernel.org>
Cc: Stefan Metzmacher <metze@samba.org>
Fixes: d25e37d89d ("tracepoint: Optimize using static_call()")
Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
On transition from 2->1 callees, we should be comparing .data rather
than .func, because the same callback can be registered twice with
different data, and what we care about here is that the data of array
element 0 is unchanged to skip rcu sync.
Link: https://lkml.kernel.org/r/20210805132717.23813-2-mathieu.desnoyers@efficios.com
Link: https://lore.kernel.org/io-uring/4ebea8f0-58c9-e571-fd30-0ce4f6f09c70@samba.org/
Cc: stable@vger.kernel.org
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: "Paul E. McKenney" <paulmck@kernel.org>
Cc: Stefan Metzmacher <metze@samba.org>
Fixes: 547305a646 ("tracepoint: Fix out of sync data passing by static caller")
Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
Pull ucounts fix from Eric Biederman:
"Fix a subtle locking versus reference counting bug in the ucount
changes, found by syzbot"
* 'for-v5.14' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace:
ucounts: Fix race condition between alloc_ucounts and put_ucounts
- Fix NULL pointer dereference caused by an error path
- Give histogram calculation fields a size, otherwise it breaks synthetic
creation based on them.
- Reject strings being used for number calculations.
- Fix recordmcount.pl warning on llvm building RISC-V allmodconfig
- Fix the draw_functrace.py script to handle the new trace output
- Fix warning of smp_processor_id() in preemptible code
-----BEGIN PGP SIGNATURE-----
iIoEABYIADIWIQRRSw7ePDh/lE+zeZMp5XQQmuv6qgUCYQwR+xQccm9zdGVkdEBn
b29kbWlzLm9yZwAKCRAp5XQQmuv6qtHOAQD7gBn1cRK0T3Eolf5HRd14PLDVUZ1B
iMZuTJZzJUWLSAD/ec3ezcOafNlPKmG1ta8UxrWP5VzHOC5qTIAJYc1d5AA=
=7FNB
-----END PGP SIGNATURE-----
Merge tag 'trace-v5.14-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace
Pull tracing fixes from Steven Rostedt:
"Various tracing fixes:
- Fix NULL pointer dereference caused by an error path
- Give histogram calculation fields a size, otherwise it breaks
synthetic creation based on them.
- Reject strings being used for number calculations.
- Fix recordmcount.pl warning on llvm building RISC-V allmodconfig
- Fix the draw_functrace.py script to handle the new trace output
- Fix warning of smp_processor_id() in preemptible code"
* tag 'trace-v5.14-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace:
tracing: Quiet smp_processor_id() use in preemptable warning in hwlat
scripts/tracing: fix the bug that can't parse raw_trace_func
scripts/recordmcount.pl: Remove check_objcopy() and $can_use_local
tracing: Reject string operand in the histogram expression
tracing / histogram: Give calculation hist_fields a size
tracing: Fix NULL pointer dereference in start_creating
The hardware latency detector (hwlat) has a mode that it runs one thread
across CPUs. The logic to move from the currently running CPU to the next
one in the list does a smp_processor_id() to find where it currently is.
Unfortunately, it's done with preemption enabled, and this triggers a
warning for using smp_processor_id() in a preempt enabled section.
As it is only using smp_processor_id() to get information on where it
currently is in order to simply move it to the next CPU, it doesn't really
care if it got moved in the mean time. It will simply balance out later if
such a case arises.
Switch smp_processor_id() to raw_smp_processor_id() to quiet that warning.
Link: https://lkml.kernel.org/r/20210804141848.79edadc0@oasis.local.home
Acked-by: Daniel Bristot de Oliveira <bristot@redhat.com>
Fixes: 8fa826b734 ("trace/hwlat: Implement the mode config option")
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
Since the string type can not be the target of the addition / subtraction
operation, it must be rejected. Without this fix, the string type silently
converted to digits.
Link: https://lkml.kernel.org/r/162742654278.290973.1523000673366456634.stgit@devnote2
Cc: stable@vger.kernel.org
Fixes: 100719dcef ("tracing: Add simple expression support to hist triggers")
Signed-off-by: Masami Hiramatsu <mhiramat@kernel.org>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
When working on my user space applications, I found a bug in the synthetic
event code where the automated synthetic event field was not matching the
event field calculation it was attached to. Looking deeper into it, it was
because the calculation hist_field was not given a size.
The synthetic event fields are matched to their hist_fields either by
having the field have an identical string type, or if that does not match,
then the size and signed values are used to match the fields.
The problem arose when I tried to match a calculation where the fields
were "unsigned int". My tool created a synthetic event of type "u32". But
it failed to match. The string was:
diff=field1-field2:onmatch(event).trace(synth,$diff)
Adding debugging into the kernel, I found that the size of "diff" was 0.
And since it was given "unsigned int" as a type, the histogram fallback
code used size and signed. The signed matched, but the size of u32 (4) did
not match zero, and the event failed to be created.
This can be worse if the field you want to match is not one of the
acceptable fields for a synthetic event. As event fields can have any type
that is supported in Linux, this can cause an issue. For example, if a
type is an enum. Then there's no way to use that with any calculations.
Have the calculation field simply take on the size of what it is
calculating.
Link: https://lkml.kernel.org/r/20210730171951.59c7743f@oasis.local.home
Cc: Tom Zanussi <zanussi@kernel.org>
Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Namhyung Kim <namhyung@kernel.org>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: stable@vger.kernel.org
Fixes: 100719dcef ("tracing: Add simple expression support to hist triggers")
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
Double enqueues in rt runqueues (list) have been reported while running
a simple test that spawns a number of threads doing a short sleep/run
pattern while being concurrently setscheduled between rt and fair class.
WARNING: CPU: 3 PID: 2825 at kernel/sched/rt.c:1294 enqueue_task_rt+0x355/0x360
CPU: 3 PID: 2825 Comm: setsched__13
RIP: 0010:enqueue_task_rt+0x355/0x360
Call Trace:
__sched_setscheduler+0x581/0x9d0
_sched_setscheduler+0x63/0xa0
do_sched_setscheduler+0xa0/0x150
__x64_sys_sched_setscheduler+0x1a/0x30
do_syscall_64+0x33/0x40
entry_SYSCALL_64_after_hwframe+0x44/0xae
list_add double add: new=ffff9867cb629b40, prev=ffff9867cb629b40,
next=ffff98679fc67ca0.
kernel BUG at lib/list_debug.c:31!
invalid opcode: 0000 [#1] PREEMPT_RT SMP PTI
CPU: 3 PID: 2825 Comm: setsched__13
RIP: 0010:__list_add_valid+0x41/0x50
Call Trace:
enqueue_task_rt+0x291/0x360
__sched_setscheduler+0x581/0x9d0
_sched_setscheduler+0x63/0xa0
do_sched_setscheduler+0xa0/0x150
__x64_sys_sched_setscheduler+0x1a/0x30
do_syscall_64+0x33/0x40
entry_SYSCALL_64_after_hwframe+0x44/0xae
__sched_setscheduler() uses rt_effective_prio() to handle proper queuing
of priority boosted tasks that are setscheduled while being boosted.
rt_effective_prio() is however called twice per each
__sched_setscheduler() call: first directly by __sched_setscheduler()
before dequeuing the task and then by __setscheduler() to actually do
the priority change. If the priority of the pi_top_task is concurrently
being changed however, it might happen that the two calls return
different results. If, for example, the first call returned the same rt
priority the task was running at and the second one a fair priority, the
task won't be removed by the rt list (on_list still set) and then
enqueued in the fair runqueue. When eventually setscheduled back to rt
it will be seen as enqueued already and the WARNING/BUG be issued.
Fix this by calling rt_effective_prio() only once and then reusing the
return value. While at it refactor code as well for clarity. Concurrent
priority inheritance handling is still safe and will eventually converge
to a new state by following the inheritance chain(s).
Fixes: 0782e63bc6 ("sched: Handle priority boosted tasks proper in setscheduler()")
[squashed Peterz changes; added changelog]
Reported-by: Mark Simmons <msimmons@redhat.com>
Signed-off-by: Juri Lelli <juri.lelli@redhat.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lkml.kernel.org/r/20210803104501.38333-1-juri.lelli@redhat.com
Before, the interpreter allowed up to MAX_TAIL_CALL_CNT + 1 tail calls.
Now precisely MAX_TAIL_CALL_CNT is allowed, which is in line with the
behavior of the x86 JITs.
Signed-off-by: Johan Almbladh <johan.almbladh@anyfinetworks.com>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Acked-by: Yonghong Song <yhs@fb.com>
Link: https://lore.kernel.org/bpf/20210728164741.350370-1-johan.almbladh@anyfinetworks.com