Allow to save stacktraces into a histogram variable. This will be used by
synthetic events to allow a stacktrace from one event to be passed and
displayed by another event.
The special keyword "stacktrace" is to be used to trigger a stack
trace for the event that the histogram trigger is attached to.
echo 'hist:keys=pid:st=stacktrace" > events/sched/sched_waking/trigger
Currently nothing can get access to the "$st" variable above that contains
the stack trace, but that will soon change.
Link: https://lkml.kernel.org/r/20230117152235.856323729@goodmis.org
Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Tom Zanussi <zanussi@kernel.org>
Cc: Ross Zwisler <zwisler@google.com>
Cc: Ching-lin Yu <chinglinyu@google.com>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
When tracing a dynamic string field for a synthetic event, the offset
calculation for where to write the next event can use struct_size() to
find what the current size of the structure is.
This simplifies the code and makes it less error prone.
Link: https://lkml.kernel.org/r/20230117152235.698632147@goodmis.org
Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Tom Zanussi <zanussi@kernel.org>
Cc: Ross Zwisler <zwisler@google.com>
Cc: Ching-lin Yu <chinglinyu@google.com>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
In a previous commit 7433632c9f, buffer, buffer->buffers and
buffer->buffers[cpu] in ring_buffer_wake_waiters() can be NULL,
and thus the related checks are added.
However, in the same call stack, these variables are also used in
ring_buffer_free_read_page():
tracing_buffers_release()
ring_buffer_wake_waiters(iter->array_buffer->buffer)
cpu_buffer = buffer->buffers[cpu] -> Add checks by previous commit
ring_buffer_free_read_page(iter->array_buffer->buffer)
cpu_buffer = buffer->buffers[cpu] -> No check
Thus, to avod possible null-pointer derefernces, the related checks
should be added.
These results are reported by a static tool designed by myself.
Link: https://lkml.kernel.org/r/20230113125501.760324-1-baijiaju1990@gmail.com
Reported-by: TOTE Robot <oslab@tsinghua.edu.cn>
Signed-off-by: Jia-Ju Bai <baijiaju1990@gmail.com>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
There's been several times where an event records a function address in
its field and I needed to filter on that address for a specific function
name. It required looking up the function in kallsyms, finding its size,
and doing a compare of "field >= function_start && field < function_end".
But this would change from boot to boot and is unreliable in scripts.
Also, it is useful to have this at boot up, where the addresses will not
be known. For example, on the boot command line:
trace_trigger="initcall_finish.traceoff if func.function == acpi_init"
To implement this, add a ".function" prefix, that will check that the
field is of size long, and the only operations allowed (so far) are "=="
and "!=".
Link: https://lkml.kernel.org/r/20221219183213.916833763@goodmis.org
Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Tom Zanussi <zanussi@kernel.org>
Cc: Zheng Yejian <zhengyejian1@huawei.com>
Reviewed-by: Ross Zwisler <zwisler@google.com>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
The pointer ptr is being initialized with a value that is never read,
it is being updated later on a call to strim. Remove the extraneous
initialization.
Link: https://lkml.kernel.org/r/20230116161612.77192-1-colin.i.king@gmail.com
Cc: Daniel Bristot de Oliveira <bristot@kernel.org>
Cc: Masami Hiramatsu <mhiramat@kernel.org>
Signed-off-by: Colin Ian King <colin.i.king@gmail.com>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
Use the 'struct' keyword for a struct's kernel-doc notation and
use the correct function parameter name to eliminate kernel-doc
warnings:
kernel/trace/trace_events_filter.c:136: warning: cannot understand function prototype: 'struct prog_entry '
kerne/trace/trace_events_filter.c:155: warning: Excess function parameter 'when_to_branch' description in 'update_preds'
Also correct some trivial punctuation problems.
Link: https://lkml.kernel.org/r/20230108021238.16398-1-rdunlap@infradead.org
Signed-off-by: Randy Dunlap <rdunlap@infradead.org>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
Function 'create_hist_field' is called recursively at
trace_events_hist.c:1954 and can return NULL-value that's why we have
to check it to avoid null pointer dereference.
Found by Linux Verification Center (linuxtesting.org) with SVACE.
Link: https://lkml.kernel.org/r/20230111120409.4111-1-n.petrova@fintech.ru
Cc: stable@vger.kernel.org
Fixes: 30350d65ac ("tracing: Add variable support to hist triggers")
Signed-off-by: Natalia Petrova <n.petrova@fintech.ru>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
list_for_each_entry_rcu() has built-in RCU and lock checking.
Pass cond argument to list_for_each_entry_rcu() to silence false lockdep
warning when CONFIG_PROVE_RCU_LIST is enabled.
Execute as follow:
[tracing]# echo osnoise > current_tracer
[tracing]# echo 1 > tracing_on
[tracing]# echo 0 > tracing_on
The trace_types_lock is held when osnoise_tracer_stop() or
timerlat_tracer_stop() are called in the non-RCU read side section.
So, pass lockdep_is_held(&trace_types_lock) to silence false lockdep
warning.
Link: https://lkml.kernel.org/r/20221227023036.784337-1-nashuiliang@gmail.com
Cc: Masami Hiramatsu <mhiramat@kernel.org>
Fixes: dae181349f ("tracing/osnoise: Support a list of trace_array *tr")
Acked-by: Daniel Bristot de Oliveira <bristot@kernel.org>
Signed-off-by: Chuang Wang <nashuiliang@gmail.com>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
Currently trace_printk() can be used as soon as early_trace_init() is
called from start_kernel(). But if a crash happens, and
"ftrace_dump_on_oops" is set on the kernel command line, all you get will
be:
[ 0.456075] <idle>-0 0dN.2. 347519us : Unknown type 6
[ 0.456075] <idle>-0 0dN.2. 353141us : Unknown type 6
[ 0.456075] <idle>-0 0dN.2. 358684us : Unknown type 6
This is because the trace_printk() event (type 6) hasn't been registered
yet. That gets done via an early_initcall(), which may be early, but not
early enough.
Instead of registering the trace_printk() event (and other ftrace events,
which are not trace events) via an early_initcall(), have them registered at
the same time that trace_printk() can be used. This way, if there is a
crash before early_initcall(), then the trace_printk()s will actually be
useful.
Link: https://lkml.kernel.org/r/20230104161412.019f6c55@gandalf.local.home
Cc: stable@vger.kernel.org
Cc: Masami Hiramatsu <mhiramat@kernel.org>
Fixes: e725c731e3 ("tracing: Split tracing initialization into two for early initialization")
Reported-by: "Joel Fernandes (Google)" <joel@joelfernandes.org>
Tested-by: Joel Fernandes (Google) <joel@joelfernandes.org>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
Setting filters on an ftrace ops results in some memory being allocated
for the filter hashes, which must be freed before the ops can be freed.
This can be done by removing every individual element of the hash by
calling ftrace_set_filter_ip() or ftrace_set_filter_ips() with `remove`
set, but this is somewhat error prone as it's easy to forget to remove
an element.
Make it easier to clean this up by exporting ftrace_free_filter(), which
can be used to clean up all of the filter hashes after an ftrace_ops has
been unregistered.
Using this, fix the ftrace-direct* samples to free hashes prior to being
unloaded. All other code either removes individual filters explicitly or
is built-in and already calls ftrace_free_filter().
Link: https://lkml.kernel.org/r/20230103124912.2948963-3-mark.rutland@arm.com
Cc: stable@vger.kernel.org
Cc: Florent Revest <revest@chromium.org>
Cc: Masami Hiramatsu <mhiramat@kernel.org>
Fixes: e1067a07cf ("ftrace/samples: Add module to test multi direct modify interface")
Fixes: 5fae941b9a ("ftrace/samples: Add multi direct interface test module")
Reviewed-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
Reviewed-by: Steven Rostedt (Google) <rostedt@goodmis.org>
Signed-off-by: Mark Rutland <mark.rutland@arm.com>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
Architectures without dynamic ftrace trampolines incur an overhead when
multiple ftrace_ops are enabled with distinct filters. in these cases,
each call site calls a common trampoline which uses
ftrace_ops_list_func() to iterate over all enabled ftrace functions, and
so incurs an overhead relative to the size of this list (including RCU
protection overhead).
Architectures with dynamic ftrace trampolines avoid this overhead for
call sites which have a single associated ftrace_ops. In these cases,
the dynamic trampoline is customized to branch directly to the relevant
ftrace function, avoiding the list overhead.
On some architectures it's impractical and/or undesirable to implement
dynamic ftrace trampolines. For example, arm64 has limited branch ranges
and cannot always directly branch from a call site to an arbitrary
address (e.g. from a kernel text address to an arbitrary module
address). Calls from modules to core kernel text can be indirected via
PLTs (allocated at module load time) to address this, but the same is
not possible from calls from core kernel text.
Using an indirect branch from a call site to an arbitrary trampoline is
possible, but requires several more instructions in the function
prologue (or immediately before it), and/or comes with far more complex
requirements for patching.
Instead, this patch adds a new option, where an architecture can
associate each call site with a pointer to an ftrace_ops, placed at a
fixed offset from the call site. A shared trampoline can recover this
pointer and call ftrace_ops::func() without needing to go via
ftrace_ops_list_func(), avoiding the associated overhead.
This avoids issues with branch range limitations, and avoids the need to
allocate and manipulate dynamic trampolines, making it far simpler to
implement and maintain, while having similar performance
characteristics.
Note that this allows for dynamic ftrace_ops to be invoked directly from
an architecture's ftrace_caller trampoline, whereas existing code forces
the use of ftrace_ops_get_list_func(), which is in part necessary to
permit the ftrace_ops to be freed once unregistered *and* to avoid
branch/address-generation range limitation on some architectures (e.g.
where ops->func is a module address, and may be outside of the direct
branch range for callsites within the main kernel image).
The CALL_OPS approach avoids this problems and is safe as:
* The existing synchronization in ftrace_shutdown() using
ftrace_shutdown() using synchronize_rcu_tasks_rude() (and
synchronize_rcu_tasks()) ensures that no tasks hold a stale reference
to an ftrace_ops (e.g. in the middle of the ftrace_caller trampoline,
or while invoking ftrace_ops::func), when that ftrace_ops is
unregistered.
Arguably this could also be relied upon for the existing scheme,
permitting dynamic ftrace_ops to be invoked directly when ops->func is
in range, but this will require additional logic to handle branch
range limitations, and is not handled by this patch.
* Each callsite's ftrace_ops pointer literal can hold any valid kernel
address, and is updated atomically. As an architecture's ftrace_caller
trampoline will atomically load the ops pointer then dereference
ops->func, there is no risk of invoking ops->func with a mismatches
ops pointer, and updates to the ops pointer do not require special
care.
A subsequent patch will implement architectures support for arm64. There
should be no functional change as a result of this patch alone.
Signed-off-by: Mark Rutland <mark.rutland@arm.com>
Reviewed-by: Steven Rostedt (Google) <rostedt@goodmis.org>
Cc: Florent Revest <revest@chromium.org>
Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Will Deacon <will@kernel.org>
Link: https://lore.kernel.org/r/20230123134603.1064407-2-mark.rutland@arm.com
Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>
We currently use module_kallsyms_on_each_symbol that iterates all
modules/symbols and we try to lookup each such address in user
provided symbols/addresses to get list of used modules.
This fix instead only iterates provided kprobe addresses and calls
__module_address on each to get list of used modules. This turned
out to be simpler and also bit faster.
On my setup with workload (executed 10 times):
# test_progs -t kprobe_multi_bench_attach/modules
Current code:
Performance counter stats for './test.sh' (5 runs):
76,081,161,596 cycles:k ( +- 0.47% )
18.3867 +- 0.0992 seconds time elapsed ( +- 0.54% )
With the fix:
Performance counter stats for './test.sh' (5 runs):
74,079,889,063 cycles:k ( +- 0.04% )
17.8514 +- 0.0218 seconds time elapsed ( +- 0.12% )
Signed-off-by: Jiri Olsa <jolsa@kernel.org>
Reviewed-by: Zhen Lei <thunder.leizhen@huawei.com>
Reviewed-by: Petr Mladek <pmladek@suse.com>
Link: https://lore.kernel.org/r/20230116101009.23694-4-jolsa@kernel.org
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Currently we traverse all symbols of all modules to find the specified
function for the specified module. But in reality, we just need to find
the given module and then traverse all the symbols in it.
Let's add a new parameter 'const char *modname' to function
module_kallsyms_on_each_symbol(), then we can compare the module names
directly in this function and call hook 'fn' after matching. If 'modname'
is NULL, the symbols of all modules are still traversed for compatibility
with other usage cases.
Phase1: mod1-->mod2..(subsequent modules do not need to be compared)
|
Phase2: -->f1-->f2-->f3
Assuming that there are m modules, each module has n symbols on average,
then the time complexity is reduced from O(m * n) to O(m) + O(n).
Reviewed-by: Petr Mladek <pmladek@suse.com>
Acked-by: Song Liu <song@kernel.org>
Signed-off-by: Zhen Lei <thunder.leizhen@huawei.com>
Signed-off-by: Jiri Olsa <jolsa@kernel.org>
Acked-by: Miroslav Benes <mbenes@suse.cz>
Reviewed-by: Luis Chamberlain <mcgrof@kernel.org>
Link: https://lore.kernel.org/r/20230116101009.23694-2-jolsa@kernel.org
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
In current bpf_send_signal() and bpf_send_signal_thread() helper
implementation, irq_work is used to handle nmi context. Hao Sun
reported in [1] that the current task at the entry of the helper
might be gone during irq_work callback processing. To fix the issue,
a reference is acquired for the current task before enqueuing into
the irq_work so that the queued task is still available during
irq_work callback processing.
[1] https://lore.kernel.org/bpf/20230109074425.12556-1-sunhao.th@gmail.com/
Fixes: 8b401f9ed2 ("bpf: implement bpf_send_signal() helper")
Tested-by: Hao Sun <sunhao.th@gmail.com>
Reported-by: Hao Sun <sunhao.th@gmail.com>
Signed-off-by: Yonghong Song <yhs@fb.com>
Link: https://lore.kernel.org/r/20230118204815.3331855-1-yhs@fb.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
When we save the raw_data to the perf sample data, we need to update
the sample flags and the dynamic size. To make sure this is done
consistently, add the perf_sample_save_raw_data() helper and convert
all call sites.
Suggested-by: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Namhyung Kim <namhyung@kernel.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Tested-by: Jiri Olsa <jolsa@kernel.org>
Acked-by: Jiri Olsa <jolsa@kernel.org>
Acked-by: Peter Zijlstra <peterz@infradead.org>
Link: https://lore.kernel.org/r/20230118060559.615653-4-namhyung@kernel.org
Robot reported that trace_hardirqs_{on,off}() tickle the forbidden
_rcuidle() tracepoint through local_irq_{en,dis}able().
For 'sane' configs, these calls will only happen with RCU enabled and
as such can use the regular tracepoint. This also means it's possible
to trace them from NMI context again.
Reported-by: kernel test robot <lkp@intel.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Link: https://lore.kernel.org/r/20230112195541.477416709@infradead.org
ARCH_WANTS_NO_INSTR (a superset of CONFIG_GENERIC_ENTRY) disallows any
and all tracing when RCU isn't enabled.
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Tested-by: Tony Lindgren <tony@atomide.com>
Tested-by: Ulf Hansson <ulf.hansson@linaro.org>
Acked-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Acked-by: Frederic Weisbecker <frederic@kernel.org>
Link: https://lore.kernel.org/r/20230112195541.416110581@infradead.org
Per commit 56e62a7370 ("s390: convert to generic entry") the last
and only callers of trace_hardirqs_{on,off}_caller() went away, clean
up.
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Link: https://lore.kernel.org/r/20230112195541.355283994@infradead.org
The following kernel panic can be triggered when a task with pid=1 attaches
a prog that attempts to send killing signal to itself, also see [1] for more
details:
Kernel panic - not syncing: Attempted to kill init! exitcode=0x0000000b
CPU: 3 PID: 1 Comm: systemd Not tainted 6.1.0-09652-g59fe41b5255f #148
Call Trace:
<TASK>
__dump_stack lib/dump_stack.c:88 [inline]
dump_stack_lvl+0x100/0x178 lib/dump_stack.c:106
panic+0x2c4/0x60f kernel/panic.c:275
do_exit.cold+0x63/0xe4 kernel/exit.c:789
do_group_exit+0xd4/0x2a0 kernel/exit.c:950
get_signal+0x2460/0x2600 kernel/signal.c:2858
arch_do_signal_or_restart+0x78/0x5d0 arch/x86/kernel/signal.c:306
exit_to_user_mode_loop kernel/entry/common.c:168 [inline]
exit_to_user_mode_prepare+0x15f/0x250 kernel/entry/common.c:203
__syscall_exit_to_user_mode_work kernel/entry/common.c:285 [inline]
syscall_exit_to_user_mode+0x1d/0x50 kernel/entry/common.c:296
do_syscall_64+0x44/0xb0 arch/x86/entry/common.c:86
entry_SYSCALL_64_after_hwframe+0x63/0xcd
So skip task with pid=1 in bpf_send_signal_common() to avoid the panic.
[1] https://lore.kernel.org/bpf/20221222043507.33037-1-sunhao.th@gmail.com
Signed-off-by: Hao Sun <sunhao.th@gmail.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Stanislav Fomichev <sdf@google.com>
Link: https://lore.kernel.org/bpf/20230106084838.12690-1-sunhao.th@gmail.com
-----BEGIN PGP SIGNATURE-----
iHUEABYIAB0WIQTFp0I1jqZrAX+hPRXbK58LschIgwUCY7X/4wAKCRDbK58LschI
g7gzAQCjKsLtAWg1OplW+B7pvEPwkQ8g3O1+PYWlToCUACTlzQD+PEMrqGnxB573
oQAk6I2yOTwLgvlHkrm+TIdKSouI4gs=
=2hUY
-----END PGP SIGNATURE-----
Merge tag 'for-netdev' of https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next
Daniel Borkmann says:
====================
bpf-next 2023-01-04
We've added 45 non-merge commits during the last 21 day(s) which contain
a total of 50 files changed, 1454 insertions(+), 375 deletions(-).
The main changes are:
1) Fixes, improvements and refactoring of parts of BPF verifier's
state equivalence checks, from Andrii Nakryiko.
2) Fix a few corner cases in libbpf's BTF-to-C converter in particular
around padding handling and enums, also from Andrii Nakryiko.
3) Add BPF_F_NO_TUNNEL_KEY extension to bpf_skb_set_tunnel_key to better
support decap on GRE tunnel devices not operating in collect metadata,
from Christian Ehrig.
4) Improve x86 JIT's codegen for PROBE_MEM runtime error checks,
from Dave Marchevsky.
5) Remove the need for trace_printk_lock for bpf_trace_printk
and bpf_trace_vprintk helpers, from Jiri Olsa.
6) Add proper documentation for BPF_MAP_TYPE_SOCK{MAP,HASH} maps,
from Maryam Tahhan.
7) Improvements in libbpf's btf_parse_elf error handling, from Changbin Du.
8) Bigger batch of improvements to BPF tracing code samples,
from Daniel T. Lee.
9) Add LoongArch support to libbpf's bpf_tracing helper header,
from Hengqi Chen.
10) Fix a libbpf compiler warning in perf_event_open_probe on arm32,
from Khem Raj.
11) Optimize bpf_local_storage_elem by removing 56 bytes of padding,
from Martin KaFai Lau.
12) Use pkg-config to locate libelf for resolve_btfids build,
from Shen Jiamin.
13) Various libbpf improvements around API documentation and errno
handling, from Xin Liu.
* tag 'for-netdev' of https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next: (45 commits)
libbpf: Return -ENODATA for missing btf section
libbpf: Add LoongArch support to bpf_tracing.h
libbpf: Restore errno after pr_warn.
libbpf: Added the description of some API functions
libbpf: Fix invalid return address register in s390
samples/bpf: Use BPF_KSYSCALL macro in syscall tracing programs
samples/bpf: Fix tracex2 by using BPF_KSYSCALL macro
samples/bpf: Change _kern suffix to .bpf with syscall tracing program
samples/bpf: Use vmlinux.h instead of implicit headers in syscall tracing program
samples/bpf: Use kyscall instead of kprobe in syscall tracing program
bpf: rename list_head -> graph_root in field info types
libbpf: fix errno is overwritten after being closed.
bpf: fix regs_exact() logic in regsafe() to remap IDs correctly
bpf: perform byte-by-byte comparison only when necessary in regsafe()
bpf: reject non-exact register type matches in regsafe()
bpf: generalize MAYBE_NULL vs non-MAYBE_NULL rule
bpf: reorganize struct bpf_reg_state fields
bpf: teach refsafe() to take into account ID remapping
bpf: Remove unused field initialization in bpf's ctl_table
selftests/bpf: Add jit probe_mem corner case tests to s390x denylist
...
====================
Link: https://lore.kernel.org/r/20230105000926.31350-1-daniel@iogearbox.net
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
- Make monitor structures read only
-----BEGIN PGP SIGNATURE-----
iIoEABYIADIWIQRRSw7ePDh/lE+zeZMp5XQQmuv6qgUCY6J+vxQccm9zdGVkdEBn
b29kbWlzLm9yZwAKCRAp5XQQmuv6qohJAP9Yx3A4xmopkMjpfK1HBzuB7j4U7blN
2NhqKM626unbeQEAi3FhPRc5N/sGBdsUClYZIKau0p3ip1TVfYbhk8vSgwg=
=VcGm
-----END PGP SIGNATURE-----
Merge tag 'trace-v6.2-1' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace
Pull tracing fix from Steven Rostedt:
"I missed this minor hardening of the kernel in the first pull.
- Make monitor structures read only"
* tag 'trace-v6.2-1' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace:
rv/monitors: Move monitor structure in rodata
- New "symstr" type for dynamic events that writes the name of the
function+offset into the ring buffer and not just the address
- Prevent kernel symbol processing on addresses in user space probes
(uprobes).
- And minor fixes and clean ups
-----BEGIN PGP SIGNATURE-----
iIoEABYIADIWIQRRSw7ePDh/lE+zeZMp5XQQmuv6qgUCY5yAHxQccm9zdGVkdEBn
b29kbWlzLm9yZwAKCRAp5XQQmuv6qoWoAP9ZLmqgIqlH3Zcms31SR250kLXxsxT3
JHe82hiuI1I3fAD/Z93QLHw9wngLqIMx/wXsdFjTNOGGWdxfclSWI2qI6Q0=
=KaJg
-----END PGP SIGNATURE-----
Merge tag 'trace-probes-v6.2' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace
Pull trace probes updates from Steven Rostedt:
- New "symstr" type for dynamic events that writes the name of the
function+offset into the ring buffer and not just the address
- Prevent kernel symbol processing on addresses in user space probes
(uprobes).
- And minor fixes and clean ups
* tag 'trace-probes-v6.2' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace:
tracing/probes: Reject symbol/symstr type for uprobe
tracing/probes: Add symstr type for dynamic events
kprobes: kretprobe events missing on 2-core KVM guest
kprobes: Fix check for probe enabled in kill_kprobe()
test_kprobes: Fix implicit declaration error of test_kprobes
tracing: Fix race where eprobes can be called before the event
It makes sense to move the important monitor structure into rodata to
prevent accidental structure modification.
Link: https://lkml.kernel.org/r/20221122173648.4732-1-acarmina@redhat.com
Signed-off-by: Alessandro Carminati <acarmina@redhat.com>
Acked-by: Daniel Bristot de Oliveira <bristot@kernel.org>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
Both bpf_trace_printk and bpf_trace_vprintk helpers use static buffer guarded
with trace_printk_lock spin lock.
The spin lock contention causes issues with bpf programs attached to
contention_begin tracepoint [1][2].
Andrii suggested we could get rid of the contention by using trylock, but we
could actually get rid of the spinlock completely by using percpu buffers the
same way as for bin_args in bpf_bprintf_prepare function.
Adding new return 'buf' argument to struct bpf_bprintf_data and making
bpf_bprintf_prepare to return also the buffer for printk helpers.
[1] https://lore.kernel.org/bpf/CACkBjsakT_yWxnSWr4r-0TpPvbKm9-OBmVUhJb7hV3hY8fdCkw@mail.gmail.com/
[2] https://lore.kernel.org/bpf/CACkBjsaCsTovQHFfkqJKto6S4Z8d02ud1D7MPESrHa1cVNNTrw@mail.gmail.com/
Reported-by: Hao Sun <sunhao.th@gmail.com>
Suggested-by: Andrii Nakryiko <andrii@kernel.org>
Signed-off-by: Jiri Olsa <jolsa@kernel.org>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Yonghong Song <yhs@fb.com>
Link: https://lore.kernel.org/bpf/20221215214430.1336195-4-jolsa@kernel.org
Currently we always cleanup/decrement bpf_bprintf_nest_level variable
in bpf_bprintf_cleanup if it's > 0.
There's possible scenario where this could cause a problem, when
bpf_bprintf_prepare does not get bin_args buffer (because num_args is 0)
and following bpf_bprintf_cleanup call decrements bpf_bprintf_nest_level
variable, like:
in task context:
bpf_bprintf_prepare(num_args != 0) increments 'bpf_bprintf_nest_level = 1'
-> first irq :
bpf_bprintf_prepare(num_args == 0)
bpf_bprintf_cleanup decrements 'bpf_bprintf_nest_level = 0'
-> second irq:
bpf_bprintf_prepare(num_args != 0) bpf_bprintf_nest_level = 1
gets same buffer as task context above
Adding check to bpf_bprintf_cleanup and doing the real cleanup only if we
got bin_args data in the first place.
Signed-off-by: Jiri Olsa <jolsa@kernel.org>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Yonghong Song <yhs@fb.com>
Link: https://lore.kernel.org/bpf/20221215214430.1336195-3-jolsa@kernel.org
Adding struct bpf_bprintf_data to hold bin_args argument for
bpf_bprintf_prepare function.
We will add another return argument to bpf_bprintf_prepare and
pass the struct to bpf_bprintf_cleanup for proper cleanup in
following changes.
Signed-off-by: Jiri Olsa <jolsa@kernel.org>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Yonghong Song <yhs@fb.com>
Link: https://lore.kernel.org/bpf/20221215214430.1336195-2-jolsa@kernel.org
- Add powerpc qspinlock implementation optimised for large system scalability and
paravirt. See the merge message for more details.
- Enable objtool to be built on powerpc to generate mcount locations.
- Use a temporary mm for code patching with the Radix MMU, so the writable mapping is
restricted to the patching CPU.
- Add an option to build the 64-bit big-endian kernel with the ELFv2 ABI.
- Sanitise user registers on interrupt entry on 64-bit Book3S.
- Many other small features and fixes.
Thanks to: Aboorva Devarajan, Angel Iglesias, Benjamin Gray, Bjorn Helgaas, Bo Liu, Chen
Lifu, Christoph Hellwig, Christophe JAILLET, Christophe Leroy, Christopher M. Riedl, Colin
Ian King, Deming Wang, Disha Goel, Dmitry Torokhov, Finn Thain, Geert Uytterhoeven,
Gustavo A. R. Silva, Haowen Bai, Joel Stanley, Jordan Niethe, Julia Lawall, Kajol Jain,
Laurent Dufour, Li zeming, Miaoqian Lin, Michael Jeanson, Nathan Lynch, Naveen N. Rao,
Nayna Jain, Nicholas Miehlbradt, Nicholas Piggin, Pali Rohár, Randy Dunlap, Rohan McLure,
Russell Currey, Sathvika Vasireddy, Shaomin Deng, Stephen Kitt, Stephen Rothwell, Thomas
Weißschuh, Tiezhu Yang, Uwe Kleine-König, Xie Shaowen, Xiu Jianfeng, XueBing Chen, Yang
Yingliang, Zhang Jiaming, ruanjinjie, Jessica Yu, Wolfram Sang.
-----BEGIN PGP SIGNATURE-----
iQJHBAABCAAxFiEEJFGtCPCthwEv2Y/bUevqPMjhpYAFAmOfrj8THG1wZUBlbGxl
cm1hbi5pZC5hdQAKCRBR6+o8yOGlgIWtD/9mGF/ze2k+qFTo+30fb7bO8WJIDgsR
dIASnZjXV7q/45elvymhUdkQv4R7xL3pzC40P1+ZKtWzGTNe+zWUQLoALNwRK85j
8CsxZbqefGNKE5Z6ZHo9s37wsu3+jJu9yEQpGFo1LINyzeclCn5St5oqfRam+Hd/
cPF+VfvREwZ0+YOKGBhJ2EgC+Gc9xsFY7DLQsoYlu71iZZr6Z6rgZW/EY5h3RMGS
YKBoVwDsWaU0FpFWrr/rYTI6DqSr3AHr1+ftDg7ncCZMD6vQva6aMCCt94aLB1aE
vC+DNdhZlA558bXGa5yA7Wr//7aUBUIwyC60DogOeZ6vw3kD9tdEd1fbH5hmqNKY
K5bfqm28XU2959CTE8RDgsYYZvwDcfrjBIML14WZGdCQOTcGKpgOGp22o6yNb1Pq
JKpHHnVpvu2PZ/p2XdKSm9+etr2yI6lXZAEVTS7ehdtMukButjSHEVbSCEZ8tlWz
KokQt2J23BMHuSrXK6+67wWQBtdsLEk+LBOQmweiwarMocqvL/Zjz/5J7DR2DtH8
wlY3wOtB1+E5j7xZ+RgK3c3jNg5dH39ZwvFsSATWTI3P+iq6OK/bbk4q4LmZt2l9
ZIfH/CXPf9BvGCHzHa3AAd3UBbJLFwj17btMEv1wFVPS0T4LPUzkgTNTNUYeP6zL
h1e5QfgUxvKPuQ==
=7k3p
-----END PGP SIGNATURE-----
Merge tag 'powerpc-6.2-1' of git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux
Pull powerpc updates from Michael Ellerman:
- Add powerpc qspinlock implementation optimised for large system
scalability and paravirt. See the merge message for more details
- Enable objtool to be built on powerpc to generate mcount locations
- Use a temporary mm for code patching with the Radix MMU, so the
writable mapping is restricted to the patching CPU
- Add an option to build the 64-bit big-endian kernel with the ELFv2
ABI
- Sanitise user registers on interrupt entry on 64-bit Book3S
- Many other small features and fixes
Thanks to Aboorva Devarajan, Angel Iglesias, Benjamin Gray, Bjorn
Helgaas, Bo Liu, Chen Lifu, Christoph Hellwig, Christophe JAILLET,
Christophe Leroy, Christopher M. Riedl, Colin Ian King, Deming Wang,
Disha Goel, Dmitry Torokhov, Finn Thain, Geert Uytterhoeven, Gustavo A.
R. Silva, Haowen Bai, Joel Stanley, Jordan Niethe, Julia Lawall, Kajol
Jain, Laurent Dufour, Li zeming, Miaoqian Lin, Michael Jeanson, Nathan
Lynch, Naveen N. Rao, Nayna Jain, Nicholas Miehlbradt, Nicholas Piggin,
Pali Rohár, Randy Dunlap, Rohan McLure, Russell Currey, Sathvika
Vasireddy, Shaomin Deng, Stephen Kitt, Stephen Rothwell, Thomas
Weißschuh, Tiezhu Yang, Uwe Kleine-König, Xie Shaowen, Xiu Jianfeng,
XueBing Chen, Yang Yingliang, Zhang Jiaming, ruanjinjie, Jessica Yu,
and Wolfram Sang.
* tag 'powerpc-6.2-1' of git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux: (181 commits)
powerpc/code-patching: Fix oops with DEBUG_VM enabled
powerpc/qspinlock: Fix 32-bit build
powerpc/prom: Fix 32-bit build
powerpc/rtas: mandate RTAS syscall filtering
powerpc/rtas: define pr_fmt and convert printk call sites
powerpc/rtas: clean up includes
powerpc/rtas: clean up rtas_error_log_max initialization
powerpc/pseries/eeh: use correct API for error log size
powerpc/rtas: avoid scheduling in rtas_os_term()
powerpc/rtas: avoid device tree lookups in rtas_os_term()
powerpc/rtasd: use correct OF API for event scan rate
powerpc/rtas: document rtas_call()
powerpc/pseries: unregister VPA when hot unplugging a CPU
powerpc/pseries: reset the RCU watchdogs after a LPM
powerpc: Take in account addition CPU node when building kexec FDT
powerpc: export the CPU node count
powerpc/cpuidle: Set CPUIDLE_FLAG_POLLING for snooze state
powerpc/dts/fsl: Fix pca954x i2c-mux node names
cxl: Remove unnecessary cxl_pci_window_alignment()
selftests/powerpc: Fix resource leaks
...
- Add options to the osnoise tracer
o panic_on_stop option that panics the kernel if osnoise is greater than some
user defined threshold.
o preempt option, to test noise while preemption is disabled
o irq option, to test noise when interrupts are disabled
- Add .percent and .graph suffix to histograms to give different outputs
- Add nohitcount to disable showing hitcount in histogram output
- Add new __cpumask() to trace event fields to annotate that a unsigned long
array is a cpumask to user space and should be treated as one.
- Add trace_trigger kernel command line parameter to enable trace event
triggers at boot up. Useful to trace stack traces, disable tracing and take
snapshots.
- Fix x86/kmmio mmio tracer to work with the updates to lockdep
- Unify the panic and die notifiers
- Add back ftrace_expect reference that is used to extract more information in
the ftrace_bug() code.
- Have trigger filter parsing errors show up in the tracing error log.
- Updated MAINTAINERS file to add kernel tracing mailing list and patchwork
info
- Use IDA to keep track of event type numbers.
- And minor fixes and clean ups
-----BEGIN PGP SIGNATURE-----
iIoEABYIADIWIQRRSw7ePDh/lE+zeZMp5XQQmuv6qgUCY5vIcxQccm9zdGVkdEBn
b29kbWlzLm9yZwAKCRAp5XQQmuv6qlO7AQCmtZbriadAR6N7Llj092YXmYfzrxyi
1WS35vhpZsBJ8gEA8j68l+LrgNt51N2gXlTXEHgXzdBgL/TKAPSX4D99GQY=
=z1pe
-----END PGP SIGNATURE-----
Merge tag 'trace-v6.2' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace
Pull tracing updates from Steven Rostedt:
- Add options to the osnoise tracer:
- 'panic_on_stop' option that panics the kernel if osnoise is
greater than some user defined threshold.
- 'preempt' option, to test noise while preemption is disabled
- 'irq' option, to test noise when interrupts are disabled
- Add .percent and .graph suffix to histograms to give different
outputs
- Add nohitcount to disable showing hitcount in histogram output
- Add new __cpumask() to trace event fields to annotate that a unsigned
long array is a cpumask to user space and should be treated as one.
- Add trace_trigger kernel command line parameter to enable trace event
triggers at boot up. Useful to trace stack traces, disable tracing
and take snapshots.
- Fix x86/kmmio mmio tracer to work with the updates to lockdep
- Unify the panic and die notifiers
- Add back ftrace_expect reference that is used to extract more
information in the ftrace_bug() code.
- Have trigger filter parsing errors show up in the tracing error log.
- Updated MAINTAINERS file to add kernel tracing mailing list and
patchwork info
- Use IDA to keep track of event type numbers.
- And minor fixes and clean ups
* tag 'trace-v6.2' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace: (44 commits)
tracing: Fix cpumask() example typo
tracing: Improve panic/die notifiers
ftrace: Prevent RCU stall on PREEMPT_VOLUNTARY kernels
tracing: Do not synchronize freeing of trigger filter on boot up
tracing: Remove pointer (asterisk) and brackets from cpumask_t field
tracing: Have trigger filter parsing errors show up in error_log
x86/mm/kmmio: Remove redundant preempt_disable()
tracing: Fix infinite loop in tracing_read_pipe on overflowed print_trace_line
Documentation/osnoise: Add osnoise/options documentation
tracing/osnoise: Add preempt and/or irq disabled options
tracing/osnoise: Add PANIC_ON_STOP option
Documentation/osnoise: Escape underscore of NO_ prefix
tracing: Fix some checker warnings
tracing/osnoise: Make osnoise_options static
tracing: remove unnecessary trace_trigger ifdef
ring-buffer: Handle resize in early boot up
tracing/hist: Fix issue of losting command info in error_log
tracing: Fix issue of missing one synthetic field
tracing/hist: Fix out-of-bound write on 'action_data.var_ref_idx'
tracing/hist: Fix wrong return value in parse_action_params()
...
Since uprobe's argument must contain the user-space data, that
should not be converted to kernel symbols. Reject if user
specifies these types on uprobe events. e.g.
/sys/kernel/debug/tracing # echo 'p /bin/sh:10 %ax:symbol' >> uprobe_events
sh: write error: Invalid argument
/sys/kernel/debug/tracing # echo 'p /bin/sh:10 %ax:symstr' >> uprobe_events
sh: write error: Invalid argument
/sys/kernel/debug/tracing # cat error_log
[ 1783.134883] trace_uprobe: error: Unknown type is specified
Command: p /bin/sh:10 %ax:symbol
^
[ 1792.201120] trace_uprobe: error: Unknown type is specified
Command: p /bin/sh:10 %ax:symstr
^
Link: https://lore.kernel.org/all/166679931679.1528100.15540755370726009882.stgit@devnote3/
Signed-off-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
Add 'symstr' type for storing the kernel symbol as a string data
instead of the symbol address. This allows us to filter the
events by wildcard symbol name.
e.g.
# echo 'e:wqfunc workqueue.workqueue_execute_start symname=$function:symstr' >> dynamic_events
# cat events/eprobes/wqfunc/format
name: wqfunc
ID: 2110
format:
field:unsigned short common_type; offset:0; size:2; signed:0;
field:unsigned char common_flags; offset:2; size:1; signed:0;
field:unsigned char common_preempt_count; offset:3; size:1; signed:0;
field:int common_pid; offset:4; size:4; signed:1;
field:__data_loc char[] symname; offset:8; size:4; signed:1;
print fmt: " symname=\"%s\"", __get_str(symname)
Note that there is already 'symbol' type which just change the
print format (so it still stores the symbol address in the tracing
ring buffer.) On the other hand, 'symstr' type stores the actual
"symbol+offset/size" data as a string.
Link: https://lore.kernel.org/all/166679930847.1528100.4124308529180235965.stgit@devnote3/
Signed-off-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
been long in the making. It is a lighterweight software-only fix for
Skylake-based cores where enabling IBRS is a big hammer and causes a
significant performance impact.
What it basically does is, it aligns all kernel functions to 16 bytes
boundary and adds a 16-byte padding before the function, objtool
collects all functions' locations and when the mitigation gets applied,
it patches a call accounting thunk which is used to track the call depth
of the stack at any time.
When that call depth reaches a magical, microarchitecture-specific value
for the Return Stack Buffer, the code stuffs that RSB and avoids its
underflow which could otherwise lead to the Intel variant of Retbleed.
This software-only solution brings a lot of the lost performance back,
as benchmarks suggest:
https://lore.kernel.org/all/20220915111039.092790446@infradead.org/
That page above also contains a lot more detailed explanation of the
whole mechanism
- Implement a new control flow integrity scheme called FineIBT which is
based on the software kCFI implementation and uses hardware IBT support
where present to annotate and track indirect branches using a hash to
validate them
- Other misc fixes and cleanups
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEEzv7L6UO9uDPlPSfHEsHwGGHeVUoFAmOZp5EACgkQEsHwGGHe
VUrZFxAAvi/+8L0IYSK4mKJvixGbTFjxN/Swo2JVOfs34LqGUT6JaBc+VUMwZxdb
VMTFIZ3ttkKEodjhxGI7oGev6V8UfhI37SmO2lYKXpQVjXXnMlv/M+Vw3teE38CN
gopi+xtGnT1IeWQ3tc/Tv18pleJ0mh5HKWiW+9KoqgXj0wgF9x4eRYDz1TDCDA/A
iaBzs56j8m/FSykZHnrWZ/MvjKNPdGlfJASUCPeTM2dcrXQGJ93+X2hJctzDte0y
Nuiw6Y0htfFBE7xoJn+sqm5Okr+McoUM18/CCprbgSKYk18iMYm3ZtAi6FUQZS1A
ua4wQCf49loGp15PO61AS5d3OBf5D3q/WihQRbCaJvTVgPp9sWYnWwtcVUuhMllh
ZQtBU9REcVJ/22bH09Q9CjBW0VpKpXHveqQdqRDViLJ6v/iI6EFGmD24SW/VxyRd
73k9MBGrL/dOf1SbEzdsnvcSB3LGzp0Om8o/KzJWOomrVKjBCJy16bwTEsCZEJmP
i406m92GPXeaN1GhTko7vmF0GnkEdJs1GVCZPluCAxxbhHukyxHnrjlQjI4vC80n
Ylc0B3Kvitw7LGJsPqu+/jfNHADC/zhx1qz/30wb5cFmFbN1aRdp3pm8JYUkn+l/
zri2Y6+O89gvE/9/xUhMohzHsWUO7xITiBavewKeTP9GSWybWUs=
=cRy1
-----END PGP SIGNATURE-----
Merge tag 'x86_core_for_v6.2' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull x86 core updates from Borislav Petkov:
- Add the call depth tracking mitigation for Retbleed which has been
long in the making. It is a lighterweight software-only fix for
Skylake-based cores where enabling IBRS is a big hammer and causes a
significant performance impact.
What it basically does is, it aligns all kernel functions to 16 bytes
boundary and adds a 16-byte padding before the function, objtool
collects all functions' locations and when the mitigation gets
applied, it patches a call accounting thunk which is used to track
the call depth of the stack at any time.
When that call depth reaches a magical, microarchitecture-specific
value for the Return Stack Buffer, the code stuffs that RSB and
avoids its underflow which could otherwise lead to the Intel variant
of Retbleed.
This software-only solution brings a lot of the lost performance
back, as benchmarks suggest:
https://lore.kernel.org/all/20220915111039.092790446@infradead.org/
That page above also contains a lot more detailed explanation of the
whole mechanism
- Implement a new control flow integrity scheme called FineIBT which is
based on the software kCFI implementation and uses hardware IBT
support where present to annotate and track indirect branches using a
hash to validate them
- Other misc fixes and cleanups
* tag 'x86_core_for_v6.2' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (80 commits)
x86/paravirt: Use common macro for creating simple asm paravirt functions
x86/paravirt: Remove clobber bitmask from .parainstructions
x86/debug: Include percpu.h in debugreg.h to get DECLARE_PER_CPU() et al
x86/cpufeatures: Move X86_FEATURE_CALL_DEPTH from bit 18 to bit 19 of word 11, to leave space for WIP X86_FEATURE_SGX_EDECCSSA bit
x86/Kconfig: Enable kernel IBT by default
x86,pm: Force out-of-line memcpy()
objtool: Fix weak hole vs prefix symbol
objtool: Optimize elf_dirty_reloc_sym()
x86/cfi: Add boot time hash randomization
x86/cfi: Boot time selection of CFI scheme
x86/ibt: Implement FineIBT
objtool: Add --cfi to generate the .cfi_sites section
x86: Add prefix symbols for function padding
objtool: Add option to generate prefix symbols
objtool: Avoid O(bloody terrible) behaviour -- an ode to libelf
objtool: Slice up elf_create_section_symbol()
kallsyms: Revert "Take callthunks into account"
x86: Unconfuse CONFIG_ and X86_FEATURE_ namespaces
x86/retpoline: Fix crash printing warning
x86/paravirt: Fix a !PARAVIRT build warning
...
Currently the tracing dump_on_oops feature is implemented through
separate notifiers, one for die/oops and the other for panic;
given they have the same functionality, let's unify them.
Also improve the function comment and change the priority of the
notifier to make it execute earlier, avoiding showing useless trace
data (like the callback names for the other notifiers); finally,
we also removed an unnecessary header inclusion.
Link: https://lkml.kernel.org/r/20220819221731.480795-7-gpiccoli@igalia.com
Cc: Petr Mladek <pmladek@suse.com>
Cc: Sergei Shtylyov <sergei.shtylyov@gmail.com>
Signed-off-by: Guilherme G. Piccoli <gpiccoli@igalia.com>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
The function match_records() may take a while due to a large
number of string comparisons, so when in PREEMPT_VOLUNTARY
kernels we could face RCU stalls due to that.
Add a cond_resched() to prevent that.
Link: https://lkml.kernel.org/r/20221115204847.593616-1-gpiccoli@igalia.com
Cc: Mark Rutland <mark.rutland@arm.com>
Suggested-by: Steven Rostedt <rostedt@goodmis.org>
Acked-by: Paul E. McKenney <paulmck@kernel.org> # from RCU CPU stall warning perspective
Signed-off-by: Guilherme G. Piccoli <gpiccoli@igalia.com>
Acked-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
If a trigger filter on the kernel command line fails to apply (due to
syntax error), it will be freed. The freeing will call
tracepoint_synchronize_unregister(), but this is not needed during early
boot up, and will even trigger a lockdep splat.
Avoid calling the synchronization function when system_state is
SYSTEM_BOOTING.
Link: https://lore.kernel.org/linux-trace-kernel/20221213172429.7774f4ba@gandalf.local.home
Cc: Andrew Morton <akpm@linux-foundation.org>
Acked-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
Core
----
- Allow live renaming when an interface is up
- Add retpoline wrappers for tc, improving considerably the
performances of complex queue discipline configurations.
- Add inet drop monitor support.
- A few GRO performance improvements.
- Add infrastructure for atomic dev stats, addressing long standing
data races.
- De-duplicate common code between OVS and conntrack offloading
infrastructure.
- A bunch of UBSAN_BOUNDS/FORTIFY_SOURCE improvements.
- Netfilter: introduce packet parser for tunneled packets
- Replace IPVS timer-based estimators with kthreads to scale up
the workload with the number of available CPUs.
- Add the helper support for connection-tracking OVS offload.
BPF
---
- Support for user defined BPF objects: the use case is to allocate
own objects, build own object hierarchies and use the building
blocks to build own data structures flexibly, for example, linked
lists in BPF.
- Make cgroup local storage available to non-cgroup attached BPF
programs.
- Avoid unnecessary deadlock detection and failures wrt BPF task
storage helpers.
- A relevant bunch of BPF verifier fixes and improvements.
- Veristat tool improvements to support custom filtering, sorting,
and replay of results.
- Add LLVM disassembler as default library for dumping JITed code.
- Lots of new BPF documentation for various BPF maps.
- Add bpf_rcu_read_{,un}lock() support for sleepable programs.
- Add RCU grace period chaining to BPF to wait for the completion
of access from both sleepable and non-sleepable BPF programs.
- Add support storing struct task_struct objects as kptrs in maps.
- Improve helper UAPI by explicitly defining BPF_FUNC_xxx integer
values.
- Add libbpf *_opts API-variants for bpf_*_get_fd_by_id() functions.
Protocols
---------
- TCP: implement Protective Load Balancing across switch links.
- TCP: allow dynamically disabling TCP-MD5 static key, reverting
back to fast[er]-path.
- UDP: Introduce optional per-netns hash lookup table.
- IPv6: simplify and cleanup sockets disposal.
- Netlink: support different type policies for each generic
netlink operation.
- MPTCP: add MSG_FASTOPEN and FastOpen listener side support.
- MPTCP: add netlink notification support for listener sockets
events.
- SCTP: add VRF support, allowing sctp sockets binding to VRF
devices.
- Add bridging MAC Authentication Bypass (MAB) support.
- Extensions for Ethernet VPN bridging implementation to better
support multicast scenarios.
- More work for Wi-Fi 7 support, comprising conversion of all
the existing drivers to internal TX queue usage.
- IPSec: introduce a new offload type (packet offload) allowing
complete header processing and crypto offloading.
- IPSec: extended ack support for more descriptive XFRM error
reporting.
- RXRPC: increase SACK table size and move processing into a
per-local endpoint kernel thread, reducing considerably the
required locking.
- IEEE 802154: synchronous send frame and extended filtering
support, initial support for scanning available 15.4 networks.
- Tun: bump the link speed from 10Mbps to 10Gbps.
- Tun/VirtioNet: implement UDP segmentation offload support.
Driver API
----------
- PHY/SFP: improve power level switching between standard
level 1 and the higher power levels.
- New API for netdev <-> devlink_port linkage.
- PTP: convert existing drivers to new frequency adjustment
implementation.
- DSA: add support for rx offloading.
- Autoload DSA tagging driver when dynamically changing protocol.
- Add new PCP and APPTRUST attributes to Data Center Bridging.
- Add configuration support for 800Gbps link speed.
- Add devlink port function attribute to enable/disable RoCE and
migratable.
- Extend devlink-rate to support strict prioriry and weighted fair
queuing.
- Add devlink support to directly reading from region memory.
- New device tree helper to fetch MAC address from nvmem.
- New big TCP helper to simplify temporary header stripping.
New hardware / drivers
----------------------
- Ethernet:
- Marvel Octeon CNF95N and CN10KB Ethernet Switches.
- Marvel Prestera AC5X Ethernet Switch.
- WangXun 10 Gigabit NIC.
- Motorcomm yt8521 Gigabit Ethernet.
- Microchip ksz9563 Gigabit Ethernet Switch.
- Microsoft Azure Network Adapter.
- Linux Automation 10Base-T1L adapter.
- PHY:
- Aquantia AQR112 and AQR412.
- Motorcomm YT8531S.
- PTP:
- Orolia ART-CARD.
- WiFi:
- MediaTek Wi-Fi 7 (802.11be) devices.
- RealTek rtw8821cu, rtw8822bu, rtw8822cu and rtw8723du USB
devices.
- Bluetooth:
- Broadcom BCM4377/4378/4387 Bluetooth chipsets.
- Realtek RTL8852BE and RTL8723DS.
- Cypress.CYW4373A0 WiFi + Bluetooth combo device.
Drivers
-------
- CAN:
- gs_usb: bus error reporting support.
- kvaser_usb: listen only and bus error reporting support.
- Ethernet NICs:
- Intel (100G):
- extend action skbedit to RX queue mapping.
- implement devlink-rate support.
- support direct read from memory.
- nVidia/Mellanox (mlx5):
- SW steering improvements, increasing rules update rate.
- Support for enhanced events compression.
- extend H/W offload packet manipulation capabilities.
- implement IPSec packet offload mode.
- nVidia/Mellanox (mlx4):
- better big TCP support.
- Netronome Ethernet NICs (nfp):
- IPsec offload support.
- add support for multicast filter.
- Broadcom:
- RSS and PTP support improvements.
- AMD/SolarFlare:
- netlink extened ack improvements.
- add basic flower matches to offload, and related stats.
- Virtual NICs:
- ibmvnic: introduce affinity hint support.
- small / embedded:
- FreeScale fec: add initial XDP support.
- Marvel mv643xx_eth: support MII/GMII/RGMII modes for Kirkwood.
- TI am65-cpsw: add suspend/resume support.
- Mediatek MT7986: add RX wireless wthernet dispatch support.
- Realtek 8169: enable GRO software interrupt coalescing per
default.
- Ethernet high-speed switches:
- Microchip (sparx5):
- add support for Sparx5 TC/flower H/W offload via VCAP.
- Mellanox mlxsw:
- add 802.1X and MAC Authentication Bypass offload support.
- add ip6gre support.
- Embedded Ethernet switches:
- Mediatek (mtk_eth_soc):
- improve PCS implementation, add DSA untag support.
- enable flow offload support.
- Renesas:
- add rswitch R-Car Gen4 gPTP support.
- Microchip (lan966x):
- add full XDP support.
- add TC H/W offload via VCAP.
- enable PTP on bridge interfaces.
- Microchip (ksz8):
- add MTU support for KSZ8 series.
- Qualcomm 802.11ax WiFi (ath11k):
- support configuring channel dwell time during scan.
- MediaTek WiFi (mt76):
- enable Wireless Ethernet Dispatch (WED) offload support.
- add ack signal support.
- enable coredump support.
- remain_on_channel support.
- Intel WiFi (iwlwifi):
- enable Wi-Fi 7 Extremely High Throughput (EHT) PHY capabilities.
- 320 MHz channels support.
- RealTek WiFi (rtw89):
- new dynamic header firmware format support.
- wake-over-WLAN support.
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
-----BEGIN PGP SIGNATURE-----
iQJGBAABCAAwFiEEg1AjqC77wbdLX2LbKSR5jcyPE6QFAmOYXUcSHHBhYmVuaUBy
ZWRoYXQuY29tAAoJECkkeY3MjxOk8zQP/R7BZtbJMTPiWkRnSoKHnAyupDVwrz5U
ktukLkwPsCyJuEbAjgxrxf4EEEQ9uq2FFlxNSYuKiiQMqIpFxV6KED7LCUygn4Tc
kxtkp0Q+5XiqisWlQmtfExf2OjuuPqcjV9tWCDBI6GebKUbfNwY/eI44RcMu4BSv
DzIlW5GkX/kZAPqnnuqaLsN3FudDTJHGEAD7NbA++7wJ076RWYSLXlFv0Z+SCSPS
H8/PEG0/ZK/65rIWMAFRClJ9BNIDwGVgp0GrsIvs1gqbRUOlA1hl1rDM21TqtNFf
5QPQT7sIfTcCE/nerxKJD5JE3JyP+XRlRn96PaRw3rt4MgI6I/EOj/HOKQ5tMCNc
oPiqb7N70+hkLZyr42qX+vN9eDPjp2koEQm7EO2Zs+/534/zWDs24Zfk/Aa1ps0I
Fa82oGjAgkBhGe/FZ6i5cYoLcyxqRqZV1Ws9XQMl72qRC7/BwvNbIW6beLpCRyeM
yYIU+0e9dEm+wHQEdh2niJuVtR63hy8tvmPx56lyh+6u0+pondkwbfSiC5aD3kAC
ikKsN5DyEsdXyiBAlytCEBxnaOjQy4RAz+3YXSiS0eBNacXp03UUrNGx4Pzpu/D0
QLFJhBnMFFCgy5to8/DvKnrTPgZdSURwqbIUcZdvU21f1HLR8tUTpaQnYffc/Whm
V8gnt1EL+0cc
=CbJC
-----END PGP SIGNATURE-----
Merge tag 'net-next-6.2' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-next
Pull networking updates from Paolo Abeni:
"Core:
- Allow live renaming when an interface is up
- Add retpoline wrappers for tc, improving considerably the
performances of complex queue discipline configurations
- Add inet drop monitor support
- A few GRO performance improvements
- Add infrastructure for atomic dev stats, addressing long standing
data races
- De-duplicate common code between OVS and conntrack offloading
infrastructure
- A bunch of UBSAN_BOUNDS/FORTIFY_SOURCE improvements
- Netfilter: introduce packet parser for tunneled packets
- Replace IPVS timer-based estimators with kthreads to scale up the
workload with the number of available CPUs
- Add the helper support for connection-tracking OVS offload
BPF:
- Support for user defined BPF objects: the use case is to allocate
own objects, build own object hierarchies and use the building
blocks to build own data structures flexibly, for example, linked
lists in BPF
- Make cgroup local storage available to non-cgroup attached BPF
programs
- Avoid unnecessary deadlock detection and failures wrt BPF task
storage helpers
- A relevant bunch of BPF verifier fixes and improvements
- Veristat tool improvements to support custom filtering, sorting,
and replay of results
- Add LLVM disassembler as default library for dumping JITed code
- Lots of new BPF documentation for various BPF maps
- Add bpf_rcu_read_{,un}lock() support for sleepable programs
- Add RCU grace period chaining to BPF to wait for the completion of
access from both sleepable and non-sleepable BPF programs
- Add support storing struct task_struct objects as kptrs in maps
- Improve helper UAPI by explicitly defining BPF_FUNC_xxx integer
values
- Add libbpf *_opts API-variants for bpf_*_get_fd_by_id() functions
Protocols:
- TCP: implement Protective Load Balancing across switch links
- TCP: allow dynamically disabling TCP-MD5 static key, reverting back
to fast[er]-path
- UDP: Introduce optional per-netns hash lookup table
- IPv6: simplify and cleanup sockets disposal
- Netlink: support different type policies for each generic netlink
operation
- MPTCP: add MSG_FASTOPEN and FastOpen listener side support
- MPTCP: add netlink notification support for listener sockets events
- SCTP: add VRF support, allowing sctp sockets binding to VRF devices
- Add bridging MAC Authentication Bypass (MAB) support
- Extensions for Ethernet VPN bridging implementation to better
support multicast scenarios
- More work for Wi-Fi 7 support, comprising conversion of all the
existing drivers to internal TX queue usage
- IPSec: introduce a new offload type (packet offload) allowing
complete header processing and crypto offloading
- IPSec: extended ack support for more descriptive XFRM error
reporting
- RXRPC: increase SACK table size and move processing into a
per-local endpoint kernel thread, reducing considerably the
required locking
- IEEE 802154: synchronous send frame and extended filtering support,
initial support for scanning available 15.4 networks
- Tun: bump the link speed from 10Mbps to 10Gbps
- Tun/VirtioNet: implement UDP segmentation offload support
Driver API:
- PHY/SFP: improve power level switching between standard level 1 and
the higher power levels
- New API for netdev <-> devlink_port linkage
- PTP: convert existing drivers to new frequency adjustment
implementation
- DSA: add support for rx offloading
- Autoload DSA tagging driver when dynamically changing protocol
- Add new PCP and APPTRUST attributes to Data Center Bridging
- Add configuration support for 800Gbps link speed
- Add devlink port function attribute to enable/disable RoCE and
migratable
- Extend devlink-rate to support strict prioriry and weighted fair
queuing
- Add devlink support to directly reading from region memory
- New device tree helper to fetch MAC address from nvmem
- New big TCP helper to simplify temporary header stripping
New hardware / drivers:
- Ethernet:
- Marvel Octeon CNF95N and CN10KB Ethernet Switches
- Marvel Prestera AC5X Ethernet Switch
- WangXun 10 Gigabit NIC
- Motorcomm yt8521 Gigabit Ethernet
- Microchip ksz9563 Gigabit Ethernet Switch
- Microsoft Azure Network Adapter
- Linux Automation 10Base-T1L adapter
- PHY:
- Aquantia AQR112 and AQR412
- Motorcomm YT8531S
- PTP:
- Orolia ART-CARD
- WiFi:
- MediaTek Wi-Fi 7 (802.11be) devices
- RealTek rtw8821cu, rtw8822bu, rtw8822cu and rtw8723du USB
devices
- Bluetooth:
- Broadcom BCM4377/4378/4387 Bluetooth chipsets
- Realtek RTL8852BE and RTL8723DS
- Cypress.CYW4373A0 WiFi + Bluetooth combo device
Drivers:
- CAN:
- gs_usb: bus error reporting support
- kvaser_usb: listen only and bus error reporting support
- Ethernet NICs:
- Intel (100G):
- extend action skbedit to RX queue mapping
- implement devlink-rate support
- support direct read from memory
- nVidia/Mellanox (mlx5):
- SW steering improvements, increasing rules update rate
- Support for enhanced events compression
- extend H/W offload packet manipulation capabilities
- implement IPSec packet offload mode
- nVidia/Mellanox (mlx4):
- better big TCP support
- Netronome Ethernet NICs (nfp):
- IPsec offload support
- add support for multicast filter
- Broadcom:
- RSS and PTP support improvements
- AMD/SolarFlare:
- netlink extened ack improvements
- add basic flower matches to offload, and related stats
- Virtual NICs:
- ibmvnic: introduce affinity hint support
- small / embedded:
- FreeScale fec: add initial XDP support
- Marvel mv643xx_eth: support MII/GMII/RGMII modes for Kirkwood
- TI am65-cpsw: add suspend/resume support
- Mediatek MT7986: add RX wireless wthernet dispatch support
- Realtek 8169: enable GRO software interrupt coalescing per
default
- Ethernet high-speed switches:
- Microchip (sparx5):
- add support for Sparx5 TC/flower H/W offload via VCAP
- Mellanox mlxsw:
- add 802.1X and MAC Authentication Bypass offload support
- add ip6gre support
- Embedded Ethernet switches:
- Mediatek (mtk_eth_soc):
- improve PCS implementation, add DSA untag support
- enable flow offload support
- Renesas:
- add rswitch R-Car Gen4 gPTP support
- Microchip (lan966x):
- add full XDP support
- add TC H/W offload via VCAP
- enable PTP on bridge interfaces
- Microchip (ksz8):
- add MTU support for KSZ8 series
- Qualcomm 802.11ax WiFi (ath11k):
- support configuring channel dwell time during scan
- MediaTek WiFi (mt76):
- enable Wireless Ethernet Dispatch (WED) offload support
- add ack signal support
- enable coredump support
- remain_on_channel support
- Intel WiFi (iwlwifi):
- enable Wi-Fi 7 Extremely High Throughput (EHT) PHY capabilities
- 320 MHz channels support
- RealTek WiFi (rtw89):
- new dynamic header firmware format support
- wake-over-WLAN support"
* tag 'net-next-6.2' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-next: (2002 commits)
ipvs: fix type warning in do_div() on 32 bit
net: lan966x: Remove a useless test in lan966x_ptp_add_trap()
net: ipa: add IPA v4.7 support
dt-bindings: net: qcom,ipa: Add SM6350 compatible
bnxt: Use generic HBH removal helper in tx path
IPv6/GRO: generic helper to remove temporary HBH/jumbo header in driver
selftests: forwarding: Add bridge MDB test
selftests: forwarding: Rename bridge_mdb test
bridge: mcast: Support replacement of MDB port group entries
bridge: mcast: Allow user space to specify MDB entry routing protocol
bridge: mcast: Allow user space to add (*, G) with a source list and filter mode
bridge: mcast: Add support for (*, G) with a source list and filter mode
bridge: mcast: Avoid arming group timer when (S, G) corresponds to a source
bridge: mcast: Add a flag for user installed source entries
bridge: mcast: Expose __br_multicast_del_group_src()
bridge: mcast: Expose br_multicast_new_group_src()
bridge: mcast: Add a centralized error path
bridge: mcast: Place netlink policy before validation functions
bridge: mcast: Split (*, G) and (S, G) addition into different functions
bridge: mcast: Do not derive entry type from its filter mode
...
It is annoying that the filter parsing of triggers do not show up in the
error_log. Trying to figure out what is incorrect in the input is
difficult when it fails for a typo.
Have the errors of filter parsing show up in error_log as well.
Link: https://lore.kernel.org/linux-trace-kernel/20221213095602.083fa9fd@gandalf.local.home
Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Tom Zanussi <zanussi@kernel.org>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
-----BEGIN PGP SIGNATURE-----
iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAmOScsgQHGF4Ym9lQGtl
cm5lbC5kawAKCRD301j7KXHgpi5ID/9pLXFYOq1+uDjU0KO/MdjMjK8Ukr34lCnk
WkajRLheE8JBKOFDE54XJk56sQSZHX9bTWqziar0h1fioh7FlQR/tVvzsERCm2M9
2y9THJNJygC68wgybStyiKlshFjl7TD7Kv5N9Y3xP3mkQygT+D6o8fXZk5xQbYyH
YdFSoq4rJVHxRL03yzQiReGGIYdOUEQQh8l1FiLwLlKa3lXAey1KuxWIzksVN0KK
aZB4QhiBpOiPgDHUVisq2XtyQjpZ2byoCImPzgrcqk9Jo4esvm/e6esrg4xlsvII
LKFFkTmbVqjUZtFjqakFHmfuzVor4nU5f+xb90ZHExuuODYckkxWp5rWhf9QwqqI
0ik6WYgI1/5vnHnX8f2DYzOFQf9qa/rLgg0CshyUODlD6RfHa9vntqYvlIFkmOBd
Q7KblIoK8YTzUS1M+v7X8JQ7gDR2KwygH37Da2KJS+vgvfIb8kJGr1ZORuhJuJJ7
Bl69gaNkHTHrqufp7UI64YXfueeuNu2J9z3zwzGoxeaFaofF/phDn0/2gCQE1fQI
XBhsMw+ETqI6B2SPHMnzYDu2DM1S8ZTOYQlaD4G3uqgWnAM1tG707395uAy5yu4n
D5azU1fVG4UocoNIyPujpaoSRs2zWZycEFEeUQkhyDDww/j4hlHi6H33eOnk0zsr
wxzFGfvHfw==
=k/vv
-----END PGP SIGNATURE-----
Merge tag 'for-6.2/block-2022-12-08' of git://git.kernel.dk/linux
Pull block updates from Jens Axboe:
- NVMe pull requests via Christoph:
- Support some passthrough commands without CAP_SYS_ADMIN (Kanchan
Joshi)
- Refactor PCIe probing and reset (Christoph Hellwig)
- Various fabrics authentication fixes and improvements (Sagi
Grimberg)
- Avoid fallback to sequential scan due to transient issues (Uday
Shankar)
- Implement support for the DEAC bit in Write Zeroes (Christoph
Hellwig)
- Allow overriding the IEEE OUI and firmware revision in configfs
for nvmet (Aleksandr Miloserdov)
- Force reconnect when number of queue changes in nvmet (Daniel
Wagner)
- Minor fixes and improvements (Uros Bizjak, Joel Granados, Sagi
Grimberg, Christoph Hellwig, Christophe JAILLET)
- Fix and cleanup nvme-fc req allocation (Chaitanya Kulkarni)
- Use the common tagset helpers in nvme-pci driver (Christoph
Hellwig)
- Cleanup the nvme-pci removal path (Christoph Hellwig)
- Use kstrtobool() instead of strtobool (Christophe JAILLET)
- Allow unprivileged passthrough of Identify Controller (Joel
Granados)
- Support io stats on the mpath device (Sagi Grimberg)
- Minor nvmet cleanup (Sagi Grimberg)
- MD pull requests via Song:
- Code cleanups (Christoph)
- Various fixes
- Floppy pull request from Denis:
- Fix a memory leak in the init error path (Yuan)
- Series fixing some batch wakeup issues with sbitmap (Gabriel)
- Removal of the pktcdvd driver that was deprecated more than 5 years
ago, and subsequent removal of the devnode callback in struct
block_device_operations as no users are now left (Greg)
- Fix for partition read on an exclusively opened bdev (Jan)
- Series of elevator API cleanups (Jinlong, Christoph)
- Series of fixes and cleanups for blk-iocost (Kemeng)
- Series of fixes and cleanups for blk-throttle (Kemeng)
- Series adding concurrent support for sync queues in BFQ (Yu)
- Series bringing drbd a bit closer to the out-of-tree maintained
version (Christian, Joel, Lars, Philipp)
- Misc drbd fixes (Wang)
- blk-wbt fixes and tweaks for enable/disable (Yu)
- Fixes for mq-deadline for zoned devices (Damien)
- Add support for read-only and offline zones for null_blk
(Shin'ichiro)
- Series fixing the delayed holder tracking, as used by DM (Yu,
Christoph)
- Series enabling bio alloc caching for IRQ based IO (Pavel)
- Series enabling userspace peer-to-peer DMA (Logan)
- BFQ waker fixes (Khazhismel)
- Series fixing elevator refcount issues (Christoph, Jinlong)
- Series cleaning up references around queue destruction (Christoph)
- Series doing quiesce by tagset, enabling cleanups in drivers
(Christoph, Chao)
- Series untangling the queue kobject and queue references (Christoph)
- Misc fixes and cleanups (Bart, David, Dawei, Jinlong, Kemeng, Ye,
Yang, Waiman, Shin'ichiro, Randy, Pankaj, Christoph)
* tag 'for-6.2/block-2022-12-08' of git://git.kernel.dk/linux: (247 commits)
blktrace: Fix output non-blktrace event when blk_classic option enabled
block: sed-opal: Don't include <linux/kernel.h>
sed-opal: allow using IOC_OPAL_SAVE for locking too
blk-cgroup: Fix typo in comment
block: remove bio_set_op_attrs
nvmet: don't open-code NVME_NS_ATTR_RO enumeration
nvme-pci: use the tagset alloc/free helpers
nvme: add the Apple shared tag workaround to nvme_alloc_io_tag_set
nvme: only set reserved_tags in nvme_alloc_io_tag_set for fabrics controllers
nvme: consolidate setting the tagset flags
nvme: pass nr_maps explicitly to nvme_alloc_io_tag_set
block: bio_copy_data_iter
nvme-pci: split out a nvme_pci_ctrl_is_dead helper
nvme-pci: return early on ctrl state mismatch in nvme_reset_work
nvme-pci: rename nvme_disable_io_queues
nvme-pci: cleanup nvme_suspend_queue
nvme-pci: remove nvme_pci_disable
nvme-pci: remove nvme_disable_admin_queue
nvme: merge nvme_shutdown_ctrl into nvme_disable_ctrl
nvme: use nvme_wait_ready in nvme_shutdown_ctrl
...
direction misannotations and (hopefully) preventing
more of the same for the future.
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
-----BEGIN PGP SIGNATURE-----
iHQEABYIAB0WIQQqUNBr3gm4hGXdBJlZ7Krx/gZQ6wUCY5ZzQAAKCRBZ7Krx/gZQ
65RZAP4nTkvOn0NZLVFkuGOx8pgJelXAvrteyAuecVL8V6CR4AD40qCVY51PJp8N
MzwiRTeqnGDxTTF7mgd//IB6hoatAA==
=bcvF
-----END PGP SIGNATURE-----
Merge tag 'pull-iov_iter' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs
Pull iov_iter updates from Al Viro:
"iov_iter work; most of that is about getting rid of direction
misannotations and (hopefully) preventing more of the same for the
future"
* tag 'pull-iov_iter' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
use less confusing names for iov_iter direction initializers
iov_iter: saner checks for attempt to copy to/from iterator
[xen] fix "direction" argument of iov_iter_kvec()
[vhost] fix 'direction' argument of iov_iter_{init,bvec}()
[target] fix iov_iter_bvec() "direction" argument
[s390] memcpy_real(): WRITE is "data source", not destination...
[s390] zcore: WRITE is "data source", not destination...
[infiniband] READ is "data destination", not source...
[fsi] WRITE is "data source", not destination...
[s390] copy_oldmem_kernel() - WRITE is "data source", not destination
csum_and_copy_to_iter(): handle ITER_DISCARD
get rid of unlikely() on page_copy_sane() calls
ACPI:
* Enable FPDT support for boot-time profiling
* Fix CPU PMU probing to work better with PREEMPT_RT
* Update SMMUv3 MSI DeviceID parsing to latest IORT spec
* APMT support for probing Arm CoreSight PMU devices
CPU features:
* Advertise new SVE instructions (v2.1)
* Advertise range prefetch instruction
* Advertise CSSC ("Common Short Sequence Compression") scalar
instructions, adding things like min, max, abs, popcount
* Enable DIT (Data Independent Timing) when running in the kernel
* More conversion of system register fields over to the generated
header
CPU misfeatures:
* Workaround for Cortex-A715 erratum #2645198
Dynamic SCS:
* Support for dynamic shadow call stacks to allow switching at
runtime between Clang's SCS implementation and the CPU's
pointer authentication feature when it is supported (complete
with scary DWARF parser!)
Tracing and debug:
* Remove static ftrace in favour of, err, dynamic ftrace!
* Seperate 'struct ftrace_regs' from 'struct pt_regs' in core
ftrace and existing arch code
* Introduce and implement FTRACE_WITH_ARGS on arm64 to replace
the old FTRACE_WITH_REGS
* Extend 'crashkernel=' parameter with default value and fallback
to placement above 4G physical if initial (low) allocation
fails
SVE:
* Optimisation to avoid disabling SVE unconditionally on syscall
entry and just zeroing the non-shared state on return instead
Exceptions:
* Rework of undefined instruction handling to avoid serialisation
on global lock (this includes emulation of user accesses to the
ID registers)
Perf and PMU:
* Support for TLP filters in Hisilicon's PCIe PMU device
* Support for the DDR PMU present in Amlogic Meson G12 SoCs
* Support for the terribly-named "CoreSight PMU" architecture
from Arm (and Nvidia's implementation of said architecture)
Misc:
* Tighten up our boot protocol for systems with memory above
52 bits physical
* Const-ify static keys to satisty jump label asm constraints
* Trivial FFA driver cleanups in preparation for v1.1 support
* Export the kernel_neon_* APIs as GPL symbols
* Harden our instruction generation routines against
instrumentation
* A bunch of robustness improvements to our arch-specific selftests
* Minor cleanups and fixes all over (kbuild, kprobes, kfence, PMU, ...)
-----BEGIN PGP SIGNATURE-----
iQFEBAABCgAuFiEEPxTL6PPUbjXGY88ct6xw3ITBYzQFAmOPLFAQHHdpbGxAa2Vy
bmVsLm9yZwAKCRC3rHDchMFjNPRcCACLyDTvkimiqfoPxzzgdkx/6QOvw9s3/mXg
UcTORSZBR1VnYkiMYEKVz/tTfG99dnWtD8/0k/rz48NbhBfsF2sN4ukyBBXVf0zR
fjnaVyVC11LUgBgZKPo6maV+jf/JWf9hJtpPl06KTiPb2Hw2JX4DXg+PeF8t2hGx
NLH4ekQOrlDM8mlsN5mc0YsHbiuO7Xe/NRuet8TsgU4bEvLAwO6bzOLVUMqDQZNq
bQe2ENcGVAzAf7iRJb38lj9qB/5hrQTHRXqLXMSnJyyVjQEwYca0PeJMa7x30bXF
ZZ+xQ8Wq0mxiffZraf6SE34yD4gaYS4Fziw7rqvydC15vYhzJBH1
=hV+2
-----END PGP SIGNATURE-----
Merge tag 'arm64-upstream' of git://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux
Pull arm64 updates from Will Deacon:
"The highlights this time are support for dynamically enabling and
disabling Clang's Shadow Call Stack at boot and a long-awaited
optimisation to the way in which we handle the SVE register state on
system call entry to avoid taking unnecessary traps from userspace.
Summary:
ACPI:
- Enable FPDT support for boot-time profiling
- Fix CPU PMU probing to work better with PREEMPT_RT
- Update SMMUv3 MSI DeviceID parsing to latest IORT spec
- APMT support for probing Arm CoreSight PMU devices
CPU features:
- Advertise new SVE instructions (v2.1)
- Advertise range prefetch instruction
- Advertise CSSC ("Common Short Sequence Compression") scalar
instructions, adding things like min, max, abs, popcount
- Enable DIT (Data Independent Timing) when running in the kernel
- More conversion of system register fields over to the generated
header
CPU misfeatures:
- Workaround for Cortex-A715 erratum #2645198
Dynamic SCS:
- Support for dynamic shadow call stacks to allow switching at
runtime between Clang's SCS implementation and the CPU's pointer
authentication feature when it is supported (complete with scary
DWARF parser!)
Tracing and debug:
- Remove static ftrace in favour of, err, dynamic ftrace!
- Seperate 'struct ftrace_regs' from 'struct pt_regs' in core ftrace
and existing arch code
- Introduce and implement FTRACE_WITH_ARGS on arm64 to replace the
old FTRACE_WITH_REGS
- Extend 'crashkernel=' parameter with default value and fallback to
placement above 4G physical if initial (low) allocation fails
SVE:
- Optimisation to avoid disabling SVE unconditionally on syscall
entry and just zeroing the non-shared state on return instead
Exceptions:
- Rework of undefined instruction handling to avoid serialisation on
global lock (this includes emulation of user accesses to the ID
registers)
Perf and PMU:
- Support for TLP filters in Hisilicon's PCIe PMU device
- Support for the DDR PMU present in Amlogic Meson G12 SoCs
- Support for the terribly-named "CoreSight PMU" architecture from
Arm (and Nvidia's implementation of said architecture)
Misc:
- Tighten up our boot protocol for systems with memory above 52 bits
physical
- Const-ify static keys to satisty jump label asm constraints
- Trivial FFA driver cleanups in preparation for v1.1 support
- Export the kernel_neon_* APIs as GPL symbols
- Harden our instruction generation routines against instrumentation
- A bunch of robustness improvements to our arch-specific selftests
- Minor cleanups and fixes all over (kbuild, kprobes, kfence, PMU, ...)"
* tag 'arm64-upstream' of git://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux: (151 commits)
arm64: kprobes: Return DBG_HOOK_ERROR if kprobes can not handle a BRK
arm64: kprobes: Let arch do_page_fault() fix up page fault in user handler
arm64: Prohibit instrumentation on arch_stack_walk()
arm64:uprobe fix the uprobe SWBP_INSN in big-endian
arm64: alternatives: add __init/__initconst to some functions/variables
arm_pmu: Drop redundant armpmu->map_event() in armpmu_event_init()
kselftest/arm64: Allow epoll_wait() to return more than one result
kselftest/arm64: Don't drain output while spawning children
kselftest/arm64: Hold fp-stress children until they're all spawned
arm64/sysreg: Remove duplicate definitions from asm/sysreg.h
arm64/sysreg: Convert ID_DFR1_EL1 to automatic generation
arm64/sysreg: Convert ID_DFR0_EL1 to automatic generation
arm64/sysreg: Convert ID_AFR0_EL1 to automatic generation
arm64/sysreg: Convert ID_MMFR5_EL1 to automatic generation
arm64/sysreg: Convert MVFR2_EL1 to automatic generation
arm64/sysreg: Convert MVFR1_EL1 to automatic generation
arm64/sysreg: Convert MVFR0_EL1 to automatic generation
arm64/sysreg: Convert ID_PFR2_EL1 to automatic generation
arm64/sysreg: Convert ID_PFR1_EL1 to automatic generation
arm64/sysreg: Convert ID_PFR0_EL1 to automatic generation
...
print_trace_line may overflow seq_file buffer. If the event is not
consumed, the while loop keeps peeking this event, causing a infinite loop.
Link: https://lkml.kernel.org/r/20221129113009.182425-1-yangjihong1@huawei.com
Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: stable@vger.kernel.org
Fixes: 088b1e427d ("ftrace: pipe fixes")
Signed-off-by: Yang Jihong <yangjihong1@huawei.com>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
The osnoise workload runs with preemption and IRQs enabled in such
a way as to allow all sorts of noise to disturb osnoise's execution.
hwlat tracer has a similar workload but works with irq disabled,
allowing only NMIs and the hardware to generate noise.
While thinking about adding an options file to hwlat tracer to
allow the system to panic, and other features I was thinking
to add, like having a tracepoint at each noise detection, it
came to my mind that is easier to make osnoise and also do
hardware latency detection than making hwlat "feature compatible"
with osnoise.
Other points are:
- osnoise already has an independent cpu file.
- osnoise has a more intuitive interface, e.g., runtime/period vs.
window/width (and people often need help remembering what it is).
- osnoise: tracepoints
- osnoise stop options
- osnoise options file itself
Moreover, the user-space side (in rtla) is simplified by reusing the
existing osnoise code.
Finally, people have been asking me about using osnoise for hw latency
detection, and I have to explain that it was sufficient but not
necessary. These options make it sufficient and necessary.
Adding a Suggested-by Clark, as he often asked me about this
possibility.
Link: https://lkml.kernel.org/r/d9c6c19135497054986900f94c8e47410b15316a.1670623111.git.bristot@kernel.org
Cc: Suggested-by: Clark Williams <williams@redhat.com>
Cc: Juri Lelli <juri.lelli@redhat.com>
Cc: Bagas Sanjaya <bagasdotme@gmail.com>
Cc: Daniel Bristot de Oliveira <bristot@kernel.org>
Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Jonathan Corbet <corbet@lwn.net>
Signed-off-by: Daniel Bristot de Oliveira <bristot@kernel.org>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
Often the latency observed in a CPU is not caused by the work being done
in the CPU itself, but by work done on another CPU that causes the
hardware to stall all CPUs. In this case, it is interesting to know
what is happening on ALL CPUs, and the best way to do this is via
crash dump analysis.
Add the PANIC_ON_STOP option to osnoise/timerlat tracers. The default
behavior is having this option off. When enabled by the user, the system
will panic after hitting a stop tracing condition.
This option was motivated by a real scenario that Juri Lelli and I
were debugging.
Link: https://lkml.kernel.org/r/249ce4287c6725543e6db845a6e0df621dc67db5.1670623111.git.bristot@kernel.org
Cc: Juri Lelli <juri.lelli@redhat.com>
Cc: Clark Williams <williams@redhat.com>
Cc: Bagas Sanjaya <bagasdotme@gmail.com>
Cc: Daniel Bristot de Oliveira <bristot@kernel.org>
Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Jonathan Corbet <corbet@lwn.net>
Signed-off-by: Daniel Bristot de Oliveira <bristot@kernel.org>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
Make osnoise_options static, as reported by the kernel test robot.
Link: https://lkml.kernel.org/r/63255826485400d7a2270e9c5e66111079671e7a.1670228712.git.bristot@kernel.org
Reported-by: kernel test robot <lkp@intel.com>
Cc: Daniel Bristot de Oliveira <bristot@kernel.org>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Masami Hiramatsu <mhiramat@kernel.org>
Signed-off-by: Daniel Bristot de Oliveira <bristot@kernel.org>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
The trace_trigger command line option introduced by
commit a01fdc897f ("tracing: Add trace_trigger kernel command line option")
doesn't need to depend on the CONFIG_HIST_TRIGGERS kernel config option.
This code doesn't depend on the histogram code, and the run-time
selection of triggers is usable without CONFIG_HIST_TRIGGERS.
Link: https://lore.kernel.org/linux-trace-kernel/20221209003310.1737039-1-zwisler@google.com
Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: Joel Fernandes <joel@joelfernandes.org>
Cc: Tom Zanussi <zanussi@kernel.org>
Fixes: a01fdc897f ("tracing: Add trace_trigger kernel command line option")
Signed-off-by: Ross Zwisler <zwisler@google.com>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
With the new command line option that allows trace event triggers to be
added at boot, the "snapshot" trigger will allocate the snapshot buffer
very early, when interrupts can not be enabled. Allocating the ring buffer
is not the problem, but it also resizes it, which is, as the resize code
does synchronization that can not be preformed at early boot.
To handle this, first change the raw_spin_lock_irq() in rb_insert_pages()
to raw_spin_lock_irqsave(), such that the unlocking of that spin lock will
not enable interrupts.
Next, where it calls schedule_work_on(), disable migration and check if
the CPU to update is the current CPU, and if so, perform the work
directly, otherwise re-enable migration and call the schedule_work_on() to
the CPU that is being updated. The rb_insert_pages() just needs to be run
on the CPU that it is updating, and does not need preemption nor
interrupts disabled when calling it.
Link: https://lore.kernel.org/lkml/Y5J%2FCajlNh1gexvo@google.com/
Link: https://lore.kernel.org/linux-trace-kernel/20221209101151.1fec1167@gandalf.local.home
Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Fixes: a01fdc897f ("tracing: Add trace_trigger kernel command line option")
Reported-by: Ross Zwisler <zwisler@google.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Tested-by: Ross Zwisler <zwisler@google.com>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
When input some constructed invalid 'trigger' command, command info
in 'error_log' are lost [1].
The root cause is that there is a path that event_hist_trigger_parse()
is recursely called once and 'last_cmd' which save origin command is
cleared, then later calling of hist_err() will no longer record origin
command info:
event_hist_trigger_parse() {
last_cmd_set() // <1> 'last_cmd' save origin command here at first
create_actions() {
onmatch_create() {
action_create() {
trace_action_create() {
trace_action_create_field_var() {
create_field_var_hist() {
event_hist_trigger_parse() { // <2> recursely called once
hist_err_clear() // <3> 'last_cmd' is cleared here
}
hist_err() // <4> No longer find origin command!!!
Since 'glob' is empty string while running into the recurse call, we
can trickly check it and bypass the call of hist_err_clear() to solve it.
[1]
# cd /sys/kernel/tracing
# echo "my_synth_event int v1; int v2; int v3;" >> synthetic_events
# echo 'hist:keys=pid' >> events/sched/sched_waking/trigger
# echo "hist:keys=next_pid:onmatch(sched.sched_waking).my_synth_event(\
pid,pid1)" >> events/sched/sched_switch/trigger
# cat error_log
[ 8.405018] hist:sched:sched_switch: error: Couldn't find synthetic event
Command:
hist:keys=next_pid:onmatch(sched.sched_waking).my_synth_event(pid,pid1)
^
[ 8.816902] hist:sched:sched_switch: error: Couldn't find field
Command:
hist:keys=next_pid:onmatch(sched.sched_waking).my_synth_event(pid,pid1)
^
[ 8.816902] hist:sched:sched_switch: error: Couldn't parse field variable
Command:
hist:keys=next_pid:onmatch(sched.sched_waking).my_synth_event(pid,pid1)
^
[ 8.999880] : error: Couldn't find field
Command:
^
[ 8.999880] : error: Couldn't parse field variable
Command:
^
[ 8.999880] : error: Couldn't find field
Command:
^
[ 8.999880] : error: Couldn't create histogram for field
Command:
^
Link: https://lore.kernel.org/linux-trace-kernel/20221207135326.3483216-1-zhengyejian1@huawei.com
Cc: <mhiramat@kernel.org>
Cc: <zanussi@kernel.org>
Fixes: f404da6e1d ("tracing: Add 'last error' error facility for hist triggers")
Signed-off-by: Zheng Yejian <zhengyejian1@huawei.com>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
The maximum number of synthetic fields supported is defined as
SYNTH_FIELDS_MAX which value currently is 64, but it actually fails
when try to generate a synthetic event with 64 fields by executing like:
# echo "my_synth_event int v1; int v2; int v3; int v4; int v5; int v6;\
int v7; int v8; int v9; int v10; int v11; int v12; int v13; int v14;\
int v15; int v16; int v17; int v18; int v19; int v20; int v21; int v22;\
int v23; int v24; int v25; int v26; int v27; int v28; int v29; int v30;\
int v31; int v32; int v33; int v34; int v35; int v36; int v37; int v38;\
int v39; int v40; int v41; int v42; int v43; int v44; int v45; int v46;\
int v47; int v48; int v49; int v50; int v51; int v52; int v53; int v54;\
int v55; int v56; int v57; int v58; int v59; int v60; int v61; int v62;\
int v63; int v64" >> /sys/kernel/tracing/synthetic_events
Correct the field counting to fix it.
Link: https://lore.kernel.org/linux-trace-kernel/20221207091557.3137904-1-zhengyejian1@huawei.com
Cc: <mhiramat@kernel.org>
Cc: <zanussi@kernel.org>
Cc: stable@vger.kernel.org
Fixes: c9e759b1e8 ("tracing: Rework synthetic event command parsing")
Signed-off-by: Zheng Yejian <zhengyejian1@huawei.com>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
When generate a synthetic event with many params and then create a trace
action for it [1], kernel panic happened [2].
It is because that in trace_action_create() 'data->n_params' is up to
SYNTH_FIELDS_MAX (current value is 64), and array 'data->var_ref_idx'
keeps indices into array 'hist_data->var_refs' for each synthetic event
param, but the length of 'data->var_ref_idx' is TRACING_MAP_VARS_MAX
(current value is 16), so out-of-bound write happened when 'data->n_params'
more than 16. In this case, 'data->match_data.event' is overwritten and
eventually cause the panic.
To solve the issue, adjust the length of 'data->var_ref_idx' to be
SYNTH_FIELDS_MAX and add sanity checks to avoid out-of-bound write.
[1]
# cd /sys/kernel/tracing/
# echo "my_synth_event int v1; int v2; int v3; int v4; int v5; int v6;\
int v7; int v8; int v9; int v10; int v11; int v12; int v13; int v14;\
int v15; int v16; int v17; int v18; int v19; int v20; int v21; int v22;\
int v23; int v24; int v25; int v26; int v27; int v28; int v29; int v30;\
int v31; int v32; int v33; int v34; int v35; int v36; int v37; int v38;\
int v39; int v40; int v41; int v42; int v43; int v44; int v45; int v46;\
int v47; int v48; int v49; int v50; int v51; int v52; int v53; int v54;\
int v55; int v56; int v57; int v58; int v59; int v60; int v61; int v62;\
int v63" >> synthetic_events
# echo 'hist:keys=pid:ts0=common_timestamp.usecs if comm=="bash"' >> \
events/sched/sched_waking/trigger
# echo "hist:keys=next_pid:onmatch(sched.sched_waking).my_synth_event(\
pid,pid,pid,pid,pid,pid,pid,pid,pid,pid,pid,pid,pid,pid,pid,pid,pid,pid,\
pid,pid,pid,pid,pid,pid,pid,pid,pid,pid,pid,pid,pid,pid,pid,pid,pid,pid,\
pid,pid,pid,pid,pid,pid,pid,pid,pid,pid,pid,pid,pid,pid,pid,pid,pid,pid,\
pid,pid,pid,pid,pid,pid,pid,pid,pid)" >> events/sched/sched_switch/trigger
[2]
BUG: unable to handle page fault for address: ffff91c900000000
PGD 61001067 P4D 61001067 PUD 0
Oops: 0000 [#1] PREEMPT SMP NOPTI
CPU: 2 PID: 322 Comm: bash Tainted: G W 6.1.0-rc8+ #229
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS
rel-1.15.0-0-g2dd4b9b3f840-prebuilt.qemu.org 04/01/2014
RIP: 0010:strcmp+0xc/0x30
Code: 75 f7 31 d2 44 0f b6 04 16 44 88 04 11 48 83 c2 01 45 84 c0 75 ee
c3 cc cc cc cc 0f 1f 00 31 c0 eb 08 48 83 c0 01 84 d2 74 13 <0f> b6 14
07 3a 14 06 74 ef 19 c0 83 c8 01 c3 cc cc cc cc 31 c3
RSP: 0018:ffff9b3b00f53c48 EFLAGS: 00000246
RAX: 0000000000000000 RBX: ffffffffba958a68 RCX: 0000000000000000
RDX: 0000000000000010 RSI: ffff91c943d33a90 RDI: ffff91c900000000
RBP: ffff91c900000000 R08: 00000018d604b529 R09: 0000000000000000
R10: ffff91c9483eddb1 R11: ffff91ca483eddab R12: ffff91c946171580
R13: ffff91c9479f0538 R14: ffff91c9457c2848 R15: ffff91c9479f0538
FS: 00007f1d1cfbe740(0000) GS:ffff91c9bdc80000(0000)
knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: ffff91c900000000 CR3: 0000000006316000 CR4: 00000000000006e0
Call Trace:
<TASK>
__find_event_file+0x55/0x90
action_create+0x76c/0x1060
event_hist_trigger_parse+0x146d/0x2060
? event_trigger_write+0x31/0xd0
trigger_process_regex+0xbb/0x110
event_trigger_write+0x6b/0xd0
vfs_write+0xc8/0x3e0
? alloc_fd+0xc0/0x160
? preempt_count_add+0x4d/0xa0
? preempt_count_add+0x70/0xa0
ksys_write+0x5f/0xe0
do_syscall_64+0x3b/0x90
entry_SYSCALL_64_after_hwframe+0x63/0xcd
RIP: 0033:0x7f1d1d0cf077
Code: 64 89 02 48 c7 c0 ff ff ff ff eb bb 0f 1f 80 00 00 00 00 f3 0f 1e
fa 64 8b 04 25 18 00 00 00 85 c0 75 10 b8 01 00 00 00 0f 05 <48> 3d 00
f0 ff ff 77 51 c3 48 83 ec 28 48 89 54 24 18 48 89 74
RSP: 002b:00007ffcebb0e568 EFLAGS: 00000246 ORIG_RAX: 0000000000000001
RAX: ffffffffffffffda RBX: 0000000000000143 RCX: 00007f1d1d0cf077
RDX: 0000000000000143 RSI: 00005639265aa7e0 RDI: 0000000000000001
RBP: 00005639265aa7e0 R08: 000000000000000a R09: 0000000000000142
R10: 000056392639c017 R11: 0000000000000246 R12: 0000000000000143
R13: 00007f1d1d1ae6a0 R14: 00007f1d1d1aa4a0 R15: 00007f1d1d1a98a0
</TASK>
Modules linked in:
CR2: ffff91c900000000
---[ end trace 0000000000000000 ]---
RIP: 0010:strcmp+0xc/0x30
Code: 75 f7 31 d2 44 0f b6 04 16 44 88 04 11 48 83 c2 01 45 84 c0 75 ee
c3 cc cc cc cc 0f 1f 00 31 c0 eb 08 48 83 c0 01 84 d2 74 13 <0f> b6 14
07 3a 14 06 74 ef 19 c0 83 c8 01 c3 cc cc cc cc 31 c3
RSP: 0018:ffff9b3b00f53c48 EFLAGS: 00000246
RAX: 0000000000000000 RBX: ffffffffba958a68 RCX: 0000000000000000
RDX: 0000000000000010 RSI: ffff91c943d33a90 RDI: ffff91c900000000
RBP: ffff91c900000000 R08: 00000018d604b529 R09: 0000000000000000
R10: ffff91c9483eddb1 R11: ffff91ca483eddab R12: ffff91c946171580
R13: ffff91c9479f0538 R14: ffff91c9457c2848 R15: ffff91c9479f0538
FS: 00007f1d1cfbe740(0000) GS:ffff91c9bdc80000(0000)
knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: ffff91c900000000 CR3: 0000000006316000 CR4: 00000000000006e0
Link: https://lore.kernel.org/linux-trace-kernel/20221207035143.2278781-1-zhengyejian1@huawei.com
Cc: <mhiramat@kernel.org>
Cc: <zanussi@kernel.org>
Cc: stable@vger.kernel.org
Fixes: d380dcde9a ("tracing: Fix now invalid var_ref_vals assumption in trace action")
Signed-off-by: Zheng Yejian <zhengyejian1@huawei.com>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
Both CONFIG_OSNOISE_TRACER and CONFIG_HWLAT_TRACER partially enables the
CONFIG_TRACER_MAX_TRACE code, but that is complicated and has
introduced a bug; It declares tracing_max_lat_fops data structure outside
of #ifdefs, but since it is defined only when CONFIG_TRACER_MAX_TRACE=y
or CONFIG_HWLAT_TRACER=y, if only CONFIG_OSNOISE_TRACER=y, that
declaration comes to a definition(!).
To fix this issue, and do not repeat the similar problem, makes
CONFIG_OSNOISE_TRACER and CONFIG_HWLAT_TRACER enables the
CONFIG_TRACER_MAX_TRACE always. It has there benefits;
- Fix the tracing_max_lat_fops bug
- Simplify the #ifdefs
- CONFIG_TRACER_MAX_TRACE code is fully enabled, or not.
Link: https://lore.kernel.org/linux-trace-kernel/167033628155.4111793.12185405690820208159.stgit@devnote3
Fixes: 424b650f35 ("tracing: Fix missing osnoise tracer on max_latency")
Cc: Daniel Bristot de Oliveira <bristot@kernel.org>
Cc: stable@vger.kernel.org
Reported-by: David Howells <dhowells@redhat.com>
Reported-by: kernel test robot <lkp@intel.com>
Signed-off-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
Link: https://lore.kernel.org/all/166992525941.1716618.13740663757583361463.stgit@warthog.procyon.org.uk/ (original thread and v1)
Link: https://lore.kernel.org/all/202212052253.VuhZ2ulJ-lkp@intel.com/T/#u (v1 error report)
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
When creating probe names, a check is done to make sure it matches basic C
standard variable naming standards. Basically, starts with alphabetic or
underline, and then the rest of the characters have alpha-numeric or
underline in them.
But system names do not have any true naming conventions, as they are
created by the TRACE_SYSTEM macro and nothing tests to see what they are.
The "xhci-hcd" trace events has a '-' in the system name. When trying to
attach a eprobe to one of these trace points, it fails because the system
name does not follow the variable naming convention because of the
hyphen, and the eprobe checks fail on this.
Allow hyphens in the system name so that eprobes can attach to the
"xhci-hcd" trace events.
Link: https://lore.kernel.org/all/Y3eJ8GiGnEvVd8%2FN@macondo/
Link: https://lore.kernel.org/linux-trace-kernel/20221122122345.160f5077@gandalf.local.home
Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: stable@vger.kernel.org
Fixes: 5b7a962209 ("tracing/probe: Check event/group naming rule at parsing")
Reported-by: Rafael Mendonca <rafaelmendsr@gmail.com>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
Function __kprobe_trace_func calls ring_buffer_event_data to
get a ring buffer, however, it has been done in above call
trace_event_buffer_reserve. So does __kretprobe_trace_func.
This patch removes those duplicated calls.
Link: https://lore.kernel.org/all/1666145478-4706-1-git-send-email-chensong_2000@189.cn/
Reviewed-by: Steven Rostedt (Google) <rostedt@goodmis.org>
Acked-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
Signed-off-by: Song Chen <chensong_2000@189.cn>
Signed-off-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
The hitcount is treated specially in the histograms - since it's
always expected to be there regardless of whether the user specified
anything or not, it's always added as the first histogram value.
Currently the code doesn't allow it to be added more than once as a
value, which is inconsistent with all the other possible values. It
would seem to be a pointless thing to want to do, but other features
being added such as percent and graph modifiers don't work properly
with the current hitcount restrictions.
Fix this by allowing multiple hitcounts to be added.
Link: https://lore.kernel.org/linux-trace-kernel/166610812248.56030.16754785928712505251.stgit@devnote2
Signed-off-by: Tom Zanussi <zanussi@kernel.org>
Signed-off-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
Tested-by: Tom Zanussi <zanussi@kernel.org>
When the blk_classic option is enabled, non-blktrace events must be
filtered out. Otherwise, events of other types are output in the blktrace
classic format, which is unexpected.
The problem can be triggered in the following ways:
# echo 1 > /sys/kernel/debug/tracing/options/blk_classic
# echo 1 > /sys/kernel/debug/tracing/events/enable
# echo blk > /sys/kernel/debug/tracing/current_tracer
# cat /sys/kernel/debug/tracing/trace_pipe
Fixes: c71a896154 ("blktrace: add ftrace plugin")
Signed-off-by: Yang Jihong <yangjihong1@huawei.com>
Link: https://lore.kernel.org/r/20221122040410.85113-1-yangjihong1@huawei.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Use only one hyphen in kernel-doc notation between the function name
and its short description.
The is the documented kerenl-doc format. It also fixes the HTML
presentation to be consistent with other functions.
Signed-off-by: Randy Dunlap <rdunlap@infradead.org>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: linux-block@vger.kernel.org
Link: https://lore.kernel.org/r/20221201070331.25685-1-rdunlap@infradead.org
Signed-off-by: Jens Axboe <axboe@kernel.dk>
If user_event_trace_register() fails within user_event_parse() the
call's print_fmt member is not freed. Add kfree call to fix this.
Link: https://lkml.kernel.org/r/20221123183248.554-1-beaub@linux.microsoft.com
Fixes: aa3b2b4c66 ("user_events: Add print_fmt generation support for basic types")
Signed-off-by: Beau Belgrave <beaub@linux.microsoft.com>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
READ/WRITE proved to be actively confusing - the meanings are
"data destination, as used with read(2)" and "data source, as
used with write(2)", but people keep interpreting those as
"we read data from it" and "we write data to it", i.e. exactly
the wrong way.
Call them ITER_DEST and ITER_SOURCE - at least that is harder
to misinterpret...
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Introduce bpf_func_proto->might_sleep to indicate a particular helper
might sleep. This will make later check whether a helper might be
sleepable or not easier.
Acked-by: Martin KaFai Lau <martin.lau@kernel.org>
Signed-off-by: Yonghong Song <yhs@fb.com>
Link: https://lore.kernel.org/r/20221124053211.2373553-1-yhs@fb.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Song Shuai reported:
The list func (ftrace_ops_list_func) will be patched first
before the transition between old and new calls are set,
which fixed the race described in this commit `59338f75`.
While ftrace_trace_function changes from the list func to a
ftrace_ops func, like unregistering the klp_ops to leave the only
global_ops in ftrace_ops_list, the ftrace_[regs]_call will be
replaced with the list func although it already exists. So there
should be a condition to avoid this.
And suggested using another variable to keep track of what the ftrace
function is set to. But this could be simplified by using a helper
function that does the same with a static variable.
Link: https://lore.kernel.org/lkml/20221026132039.2236233-1-suagrfillet@gmail.com/
Link: https://lore.kernel.org/linux-trace-kernel/20221122180905.737b6f52@gandalf.local.home
Reported-by: Song Shuai <suagrfillet@gmail.com>
Reviewed-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
After commit 060fa5c83e ("tracing/events: reuse trace event ids after
overflow"), trace events with dynamic type are linked up in list
'ftrace_event_list' through field 'trace_event.list'. Then when max
event type number used up, it's possible to reuse type number of some
freed one by traversing 'ftrace_event_list'.
As instead, using IDA to manage available type numbers can make codes
simpler and then the field 'trace_event.list' can be dropped.
Since 'struct trace_event' is used in static tracepoints, drop
'trace_event.list' can make vmlinux smaller. Local test with about 2000
tracepoints, vmlinux reduced about 64KB:
before:-rwxrwxr-x 1 root root 76669448 Nov 8 17:14 vmlinux
after: -rwxrwxr-x 1 root root 76604176 Nov 8 17:15 vmlinux
Link: https://lkml.kernel.org/r/20221110020319.1259291-1-zhengyejian1@huawei.com
Signed-off-by: Zheng Yejian <zhengyejian1@huawei.com>
Acked-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
After change in commit 4239174570 ("tracing: Make tracepoint_printk a
static_key"), this symbol is not used outside of the file, so mark it
static.
Link: https://lkml.kernel.org/r/20221122091456.72055-1-xiujianfeng@huawei.com
Signed-off-by: Xiu Jianfeng <xiujianfeng@huawei.com>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
This patch uses strndup_user instead of kzalloc + strncpy_from_user,
which makes the code more concise.
Link: https://lkml.kernel.org/r/20221121080831.707409-1-nashuiliang@gmail.com
Signed-off-by: Chuang Wang <nashuiliang@gmail.com>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
Add the tracing/osnoise/options file to control
osnoise/timerlat tracer features. It is a single
file to contain multiple features, similar to
the sched/features file.
Reading the file displays a list of options. Writing
the OPTION_NAME enables it, writing NO_OPTION_NAME disables
it.
The DEAFULTS is a particular option that resets the options
to the default ones.
It uses a bitmask to keep track of the status of the option. When
needed, we can add a list of static keys, but for now
it does not justify the memory increase.
Link: https://lkml.kernel.org/r/f8d34aefdb225d2603fcb4c02a120832a0cd3339.1668692096.git.bristot@kernel.org
Cc: Daniel Bristot de Oliveira <bristot@kernel.org>
Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Jonathan Corbet <corbet@lwn.net>
Signed-off-by: Daniel Bristot de Oliveira <bristot@kernel.org>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
After commit a389d86f7f ("ring-buffer: Have nested events still record
running time stamp"), the "event" parameter is no longer used in either
ring_buffer_unlock_commit() or rb_commit(). Best to remove it.
Link: https://lkml.kernel.org/r/1666274811-24138-1-git-send-email-chensong_2000@189.cn
Signed-off-by: Song Chen <chensong_2000@189.cn>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
Allow triggers to be enabled at kernel boot up. For example:
trace_trigger="sched_switch.stacktrace if prev_state == 2"
The above will enable the stacktrace trigger on top of the sched_switch
event and only trigger if its prev_state is 2 (TASK_UNINTERRUPTIBLE). Then
at boot up, a stacktrace will trigger and be recorded in the tracing ring
buffer every time the sched_switch happens where the previous state is
TASK_INTERRUPTIBLE.
Another useful trigger would be "traceoff" which can stop tracing on an
event if a field of the event matches a certain value defined by the
filter ("if" statement).
Link: https://lore.kernel.org/linux-trace-kernel/20221020210056.0d8d0a5b@gandalf.local.home
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
After 65536 dynamic events have been added and removed, the "type" field
of the event then uses the first type number that is available (not
currently used by other events). A type number is the identifier of the
binary blobs in the tracing ring buffer (known as events) to map them to
logic that can parse the binary blob.
The issue is that if a dynamic event (like a kprobe event) is traced and
is in the ring buffer, and then that event is removed (because it is
dynamic, which means it can be created and destroyed), if another dynamic
event is created that has the same number that new event's logic on
parsing the binary blob will be used.
To show how this can be an issue, the following can crash the kernel:
# cd /sys/kernel/tracing
# for i in `seq 65536`; do
echo 'p:kprobes/foo do_sys_openat2 $arg1:u32' > kprobe_events
# done
For every iteration of the above, the writing to the kprobe_events will
remove the old event and create a new one (with the same format) and
increase the type number to the next available on until the type number
reaches over 65535 which is the max number for the 16 bit type. After it
reaches that number, the logic to allocate a new number simply looks for
the next available number. When an dynamic event is removed, that number
is then available to be reused by the next dynamic event created. That is,
once the above reaches the max number, the number assigned to the event in
that loop will remain the same.
Now that means deleting one dynamic event and created another will reuse
the previous events type number. This is where bad things can happen.
After the above loop finishes, the kprobes/foo event which reads the
do_sys_openat2 function call's first parameter as an integer.
# echo 1 > kprobes/foo/enable
# cat /etc/passwd > /dev/null
# cat trace
cat-2211 [005] .... 2007.849603: foo: (do_sys_openat2+0x0/0x130) arg1=4294967196
cat-2211 [005] .... 2007.849620: foo: (do_sys_openat2+0x0/0x130) arg1=4294967196
cat-2211 [005] .... 2007.849838: foo: (do_sys_openat2+0x0/0x130) arg1=4294967196
cat-2211 [005] .... 2007.849880: foo: (do_sys_openat2+0x0/0x130) arg1=4294967196
# echo 0 > kprobes/foo/enable
Now if we delete the kprobe and create a new one that reads a string:
# echo 'p:kprobes/foo do_sys_openat2 +0($arg2):string' > kprobe_events
And now we can the trace:
# cat trace
sendmail-1942 [002] ..... 530.136320: foo: (do_sys_openat2+0x0/0x240) arg1= cat-2046 [004] ..... 530.930817: foo: (do_sys_openat2+0x0/0x240) arg1="������������������������������������������������������������������������������������������������"
cat-2046 [004] ..... 530.930961: foo: (do_sys_openat2+0x0/0x240) arg1="������������������������������������������������������������������������������������������������"
cat-2046 [004] ..... 530.934278: foo: (do_sys_openat2+0x0/0x240) arg1="������������������������������������������������������������������������������������������������"
cat-2046 [004] ..... 530.934563: foo: (do_sys_openat2+0x0/0x240) arg1="������������������������������������������������������������������������������������������������"
bash-1515 [007] ..... 534.299093: foo: (do_sys_openat2+0x0/0x240) arg1="kkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkk���������@��4Z����;Y�����U
And dmesg has:
==================================================================
BUG: KASAN: use-after-free in string+0xd4/0x1c0
Read of size 1 at addr ffff88805fdbbfa0 by task cat/2049
CPU: 0 PID: 2049 Comm: cat Not tainted 6.1.0-rc6-test+ #641
Hardware name: Hewlett-Packard HP Compaq Pro 6300 SFF/339A, BIOS K01 v03.03 07/14/2016
Call Trace:
<TASK>
dump_stack_lvl+0x5b/0x77
print_report+0x17f/0x47b
kasan_report+0xad/0x130
string+0xd4/0x1c0
vsnprintf+0x500/0x840
seq_buf_vprintf+0x62/0xc0
trace_seq_printf+0x10e/0x1e0
print_type_string+0x90/0xa0
print_kprobe_event+0x16b/0x290
print_trace_line+0x451/0x8e0
s_show+0x72/0x1f0
seq_read_iter+0x58e/0x750
seq_read+0x115/0x160
vfs_read+0x11d/0x460
ksys_read+0xa9/0x130
do_syscall_64+0x3a/0x90
entry_SYSCALL_64_after_hwframe+0x63/0xcd
RIP: 0033:0x7fc2e972ade2
Code: c0 e9 b2 fe ff ff 50 48 8d 3d b2 3f 0a 00 e8 05 f0 01 00 0f 1f 44 00 00 f3 0f 1e fa 64 8b 04 25 18 00 00 00 85 c0 75 10 0f 05 <48> 3d 00 f0 ff ff 77 56 c3 0f 1f 44 00 00 48 83 ec 28 48 89 54 24
RSP: 002b:00007ffc64e687c8 EFLAGS: 00000246 ORIG_RAX: 0000000000000000
RAX: ffffffffffffffda RBX: 0000000000020000 RCX: 00007fc2e972ade2
RDX: 0000000000020000 RSI: 00007fc2e980d000 RDI: 0000000000000003
RBP: 00007fc2e980d000 R08: 00007fc2e980c010 R09: 0000000000000000
R10: 0000000000000022 R11: 0000000000000246 R12: 0000000000020f00
R13: 0000000000000003 R14: 0000000000020000 R15: 0000000000020000
</TASK>
The buggy address belongs to the physical page:
page:ffffea00017f6ec0 refcount:0 mapcount:0 mapping:0000000000000000 index:0x0 pfn:0x5fdbb
flags: 0xfffffc0000000(node=0|zone=1|lastcpupid=0x1fffff)
raw: 000fffffc0000000 0000000000000000 ffffea00017f6ec8 0000000000000000
raw: 0000000000000000 0000000000000000 00000000ffffffff 0000000000000000
page dumped because: kasan: bad access detected
Memory state around the buggy address:
ffff88805fdbbe80: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff
ffff88805fdbbf00: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff
>ffff88805fdbbf80: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff
^
ffff88805fdbc000: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff
ffff88805fdbc080: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff
==================================================================
This was found when Zheng Yejian sent a patch to convert the event type
number assignment to use IDA, which gives the next available number, and
this bug showed up in the fuzz testing by Yujie Liu and the kernel test
robot. But after further analysis, I found that this behavior is the same
as when the event type numbers go past the 16bit max (and the above shows
that).
As modules have a similar issue, but is dealt with by setting a
"WAS_ENABLED" flag when a module event is enabled, and when the module is
freed, if any of its events were enabled, the ring buffer that holds that
event is also cleared, to prevent reading stale events. The same can be
done for dynamic events.
If any dynamic event that is being removed was enabled, then make sure the
buffers they were enabled in are now cleared.
Link: https://lkml.kernel.org/r/20221123171434.545706e3@gandalf.local.home
Link: https://lore.kernel.org/all/20221110020319.1259291-1-zhengyejian1@huawei.com/
Cc: stable@vger.kernel.org
Cc: Andrew Morton <akpm@linux-foundation.org>
Depends-on: e18eb8783e ("tracing: Add tracing_reset_all_online_cpus_unlocked() function")
Depends-on: 5448d44c38 ("tracing: Add unified dynamic event framework")
Depends-on: 6212dd2968 ("tracing/kprobes: Use dyn_event framework for kprobe events")
Depends-on: 065e63f951 ("tracing: Only have rmmod clear buffers that its events were active in")
Depends-on: 575380da8b ("tracing: Only clear trace buffer on module unload if event was traced")
Fixes: 77b44d1b7c ("tracing/kprobes: Rename Kprobe-tracer to kprobe-event")
Reported-by: Zheng Yejian <zhengyejian1@huawei.com>
Reported-by: Yujie Liu <yujie.liu@intel.com>
Reported-by: kernel test robot <yujie.liu@intel.com>
Acked-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
Currently the tracing_reset_all_online_cpus() requires the
trace_types_lock held. But only one caller of this function actually has
that lock held before calling it, and the other just takes the lock so
that it can call it. More users of this function is needed where the lock
is not held.
Add a tracing_reset_all_online_cpus_unlocked() function for the one use
case that calls it without being held, and also add a lockdep_assert to
make sure it is held when called.
Then have tracing_reset_all_online_cpus() take the lock internally, such
that callers do not need to worry about taking it.
Link: https://lkml.kernel.org/r/20221123192741.658273220@goodmis.org
Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Zheng Yejian <zhengyejian1@huawei.com>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
commit 94eedf3dde ("tracing: Fix race where eprobes can be called before
the event") fixed an issue where if an event is soft disabled, and the
trigger is being added, there's a small window where the event sees that
there's a trigger but does not see that it requires reading the event yet,
and then calls the trigger with the record == NULL.
This could be solved with adding memory barriers in the hot path, or to
make sure that all the triggers requiring a record check for NULL. The
latter was chosen.
Commit 94eedf3dde set the eprobe trigger handle to check for NULL, but
the same needs to be done with histograms.
Link: https://lore.kernel.org/linux-trace-kernel/20221118211809.701d40c0f8a757b0df3c025a@kernel.org/
Link: https://lore.kernel.org/linux-trace-kernel/20221123164323.03450c3a@gandalf.local.home
Cc: Tom Zanussi <zanussi@kernel.org>
Cc: stable@vger.kernel.org
Fixes: 7491e2c442 ("tracing: Add a probe that attaches to trace events")
Reported-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
Acked-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
The flag that tells the event to call its triggers after reading the event
is set for eprobes after the eprobe is enabled. This leads to a race where
the eprobe may be triggered at the beginning of the event where the record
information is NULL. The eprobe then dereferences the NULL record causing
a NULL kernel pointer bug.
Test for a NULL record to keep this from happening.
Link: https://lore.kernel.org/linux-trace-kernel/20221116192552.1066630-1-rafaelmendsr@gmail.com/
Link: https://lore.kernel.org/all/20221117214249.2addbe10@gandalf.local.home/
Cc: stable@vger.kernel.org
Fixes: 7491e2c442 ("tracing: Add a probe that attaches to trace events")
Reported-by: Rafael Mendonca <rafaelmendsr@gmail.com>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
Acked-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
Signed-off-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
The duration type is a 64 long value, not an int. This was
causing some long noise to report wrong values.
Change the duration to a 64 bits value.
Link: https://lkml.kernel.org/r/a93d8a8378c7973e9c609de05826533c9e977939.1668692096.git.bristot@kernel.org
Cc: stable@vger.kernel.org
Cc: Daniel Bristot de Oliveira <bristot@kernel.org>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Jonathan Corbet <corbet@lwn.net>
Fixes: bce29ac9ce ("trace: Add osnoise tracer")
Signed-off-by: Daniel Bristot de Oliveira <bristot@kernel.org>
Acked-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
Before current_user_event_group(), it has allocated memory and save it
in @name, this should freed before return error.
Link: https://lkml.kernel.org/r/20221115014445.158419-1-xiujianfeng@huawei.com
Fixes: e5d271812e ("tracing/user_events: Move pages/locks into groups to prepare for namespaces")
Signed-off-by: Xiu Jianfeng <xiujianfeng@huawei.com>
Acked-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
Acked-by: Beau Belgrave <beaub@linux.microsoft.com>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
There are a couple of missing * in comment blocks. Fix these.
Cleans up two clang warnings:
kernel/trace/trace_events_hist.c:986: warning: bad line:
kernel/trace/trace_events_hist.c:3229: warning: bad line:
Link: https://lkml.kernel.org/r/20221020133019.1547587-1-colin.i.king@gmail.com
Signed-off-by: Colin Ian King <colin.i.king@gmail.com>
Acked-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
-----BEGIN PGP SIGNATURE-----
iQFSBAABCAA8FiEEq68RxlopcLEwq+PEeb4+QwBBGIYFAmN6wAgeHHRvcnZhbGRz
QGxpbnV4LWZvdW5kYXRpb24ub3JnAAoJEHm+PkMAQRiG0EYH/3/RO90NbrFItraN
Lzr+d3VdbGjTu8xd1M+PRTmwh3zxLpB+Jwqr0T0A2gzL9B/D+AUPUJdrCVbv9DqS
FLJAVqoeV20dNBAHSffOOLPsgCZ+Eu+LzlNN7Iqde0e8cyZICFMNktitui84Xm/i
1NgFVgz9OZ6+aieYvUj3FrFq0p8GTIaC/oybDZrxYKcO8ZzKVMJ11swRw10wwq0g
qOOECvV3w7wlQ8upQZkzFxItKFc7EexZI6R4elXeGSJJ9Hlc092dv/zsKB9dwV+k
WcwkJrZRoezYXzgGBFxUcQtzi+ethjrPjuJuM1rYLUSIcfIW/0lkaSLgRoBu8D+I
1GfXkXs=
=gt6P
-----END PGP SIGNATURE-----
Merge tag 'v6.1-rc6' into x86/core, to resolve conflicts
Resolve conflicts between these commits in arch/x86/kernel/asm-offsets.c:
# upstream:
debc5a1ec0 ("KVM: x86: use a separate asm-offsets.c file")
# retbleed work in x86/core:
5d8213864a ("x86/retbleed: Add SKL return thunk")
... and these commits in include/linux/bpf.h:
# upstram:
18acb7fac2 ("bpf: Revert ("Fix dispatcher patchable function entry to 5 bytes nop")")
# x86/core commits:
931ab63664 ("x86/ibt: Implement FineIBT")
bea75b3389 ("x86/Kconfig: Introduce function padding")
The latter two modify BPF_DISPATCHER_ATTRIBUTES(), which was removed upstream.
Conflicts:
arch/x86/kernel/asm-offsets.c
include/linux/bpf.h
Signed-off-by: Ingo Molnar <mingo@kernel.org>
- Fix possible NULL pointer dereference on trace_event_file in kprobe_event_gen_test_exit()
- Fix NULL pointer dereference for trace_array in kprobe_event_gen_test_exit()
- Fix memory leak of filter string for eprobes
- Fix a possible memory leak in rethook_alloc()
- Skip clearing aggrprobe's post_handler in kprobe-on-ftrace case
which can cause a possible use-after-free
- Fix warning in eprobe filter creation
- Fix eprobe filter creation as it picked the wrong event for the fields
-----BEGIN PGP SIGNATURE-----
iIoEABYIADIWIQRRSw7ePDh/lE+zeZMp5XQQmuv6qgUCY3qRDhQccm9zdGVkdEBn
b29kbWlzLm9yZwAKCRAp5XQQmuv6qgCdAP0cB1ZjmMM7O8OFdnrTV0jfavZTnNNC
ut9sczpYQ0upcwEArmVvB+H3sTM6Y7PrsQEUn8gsc7WmieUoDAOr0hIe4AI=
=DGVC
-----END PGP SIGNATURE-----
Merge tag 'trace-probes-v6.1' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace
Pull tracing/probes fixes from Steven Rostedt:
- Fix possible NULL pointer dereference on trace_event_file in
kprobe_event_gen_test_exit()
- Fix NULL pointer dereference for trace_array in
kprobe_event_gen_test_exit()
- Fix memory leak of filter string for eprobes
- Fix a possible memory leak in rethook_alloc()
- Skip clearing aggrprobe's post_handler in kprobe-on-ftrace case which
can cause a possible use-after-free
- Fix warning in eprobe filter creation
- Fix eprobe filter creation as it picked the wrong event for the
fields
* tag 'trace-probes-v6.1' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace:
tracing/eprobe: Fix eprobe filter to make a filter correctly
tracing/eprobe: Fix warning in filter creation
kprobes: Skip clearing aggrprobe's post_handler in kprobe-on-ftrace case
rethook: fix a potential memleak in rethook_alloc()
tracing/eprobe: Fix memory leak of filter string
tracing: kprobe: Fix potential null-ptr-deref on trace_array in kprobe_event_gen_test_exit()
tracing: kprobe: Fix potential null-ptr-deref on trace_event_file in kprobe_event_gen_test_exit()
The flag that tells the event to call its triggers after reading the event
is set for eprobes after the eprobe is enabled. This leads to a race where
the eprobe may be triggered at the beginning of the event where the record
information is NULL. The eprobe then dereferences the NULL record causing
a NULL kernel pointer bug.
Test for a NULL record to keep this from happening.
Link: https://lore.kernel.org/linux-trace-kernel/20221116192552.1066630-1-rafaelmendsr@gmail.com/
Link: https://lore.kernel.org/linux-trace-kernel/20221117214249.2addbe10@gandalf.local.home
Cc: Linux Trace Kernel <linux-trace-kernel@vger.kernel.org>
Cc: Tzvetomir Stoyanov <tz.stoyanov@gmail.com>
Cc: Tom Zanussi <zanussi@kernel.org>
Cc: stable@vger.kernel.org
Fixes: 7491e2c442 ("tracing: Add a probe that attaches to trace events")
Acked-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
Reported-by: Rafael Mendonca <rafaelmendsr@gmail.com>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
Kfuncs currently support specifying the KF_TRUSTED_ARGS flag to signal
to the verifier that it should enforce that a BPF program passes it a
"safe", trusted pointer. Currently, "safe" means that the pointer is
either PTR_TO_CTX, or is refcounted. There may be cases, however, where
the kernel passes a BPF program a safe / trusted pointer to an object
that the BPF program wishes to use as a kptr, but because the object
does not yet have a ref_obj_id from the perspective of the verifier, the
program would be unable to pass it to a KF_ACQUIRE | KF_TRUSTED_ARGS
kfunc.
The solution is to expand the set of pointers that are considered
trusted according to KF_TRUSTED_ARGS, so that programs can invoke kfuncs
with these pointers without getting rejected by the verifier.
There is already a PTR_UNTRUSTED flag that is set in some scenarios,
such as when a BPF program reads a kptr directly from a map
without performing a bpf_kptr_xchg() call. These pointers of course can
and should be rejected by the verifier. Unfortunately, however,
PTR_UNTRUSTED does not cover all the cases for safety that need to
be addressed to adequately protect kfuncs. Specifically, pointers
obtained by a BPF program "walking" a struct are _not_ considered
PTR_UNTRUSTED according to BPF. For example, say that we were to add a
kfunc called bpf_task_acquire(), with KF_ACQUIRE | KF_TRUSTED_ARGS, to
acquire a struct task_struct *. If we only used PTR_UNTRUSTED to signal
that a task was unsafe to pass to a kfunc, the verifier would mistakenly
allow the following unsafe BPF program to be loaded:
SEC("tp_btf/task_newtask")
int BPF_PROG(unsafe_acquire_task,
struct task_struct *task,
u64 clone_flags)
{
struct task_struct *acquired, *nested;
nested = task->last_wakee;
/* Would not be rejected by the verifier. */
acquired = bpf_task_acquire(nested);
if (!acquired)
return 0;
bpf_task_release(acquired);
return 0;
}
To address this, this patch defines a new type flag called PTR_TRUSTED
which tracks whether a PTR_TO_BTF_ID pointer is safe to pass to a
KF_TRUSTED_ARGS kfunc or a BPF helper function. PTR_TRUSTED pointers are
passed directly from the kernel as a tracepoint or struct_ops callback
argument. Any nested pointer that is obtained from walking a PTR_TRUSTED
pointer is no longer PTR_TRUSTED. From the example above, the struct
task_struct *task argument is PTR_TRUSTED, but the 'nested' pointer
obtained from 'task->last_wakee' is not PTR_TRUSTED.
A subsequent patch will add kfuncs for storing a task kfunc as a kptr,
and then another patch will add selftests to validate.
Signed-off-by: David Vernet <void@manifault.com>
Link: https://lore.kernel.org/r/20221120051004.3605026-3-void@manifault.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
In subsequent patches we'll arrange for architectures to have an
ftrace_regs which is entirely distinct from pt_regs. In preparation for
this, we need to minimize the use of pt_regs to where strictly necessary
in the core ftrace code.
This patch adds new ftrace_regs_{get,set}_*() helpers which can be used
to manipulate ftrace_regs. When CONFIG_HAVE_DYNAMIC_FTRACE_WITH_ARGS=y,
these can always be used on any ftrace_regs, and when
CONFIG_HAVE_DYNAMIC_FTRACE_WITH_ARGS=n these can be used when regs are
available. A new ftrace_regs_has_args(fregs) helper is added which code
can use to check when these are usable.
Co-developed-by: Florent Revest <revest@chromium.org>
Signed-off-by: Florent Revest <revest@chromium.org>
Signed-off-by: Mark Rutland <mark.rutland@arm.com>
Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Steven Rostedt <rostedt@goodmis.org>
Reviewed-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
Reviewed-by: Steven Rostedt (Google) <rostedt@goodmis.org>
Link: https://lore.kernel.org/r/20221103170520.931305-4-mark.rutland@arm.com
Signed-off-by: Will Deacon <will@kernel.org>
In subsequent patches we'll arrange for architectures to have an
ftrace_regs which is entirely distinct from pt_regs. In preparation for
this, we need to minimize the use of pt_regs to where strictly
necessary in the core ftrace code.
This patch changes the prototype of arch_ftrace_set_direct_caller() to
take ftrace_regs rather than pt_regs, and moves the extraction of the
pt_regs into arch_ftrace_set_direct_caller().
On x86, arch_ftrace_set_direct_caller() can be used even when
CONFIG_HAVE_DYNAMIC_FTRACE_WITH_ARGS=n, and <linux/ftrace.h> defines
struct ftrace_regs. Due to this, it's necessary to define
arch_ftrace_set_direct_caller() as a macro to avoid using an incomplete
type. I've also moved the body of arch_ftrace_set_direct_caller() after
the CONFIG_HAVE_DYNAMIC_FTRACE_WITH_ARGS=y defineidion of struct
ftrace_regs.
There should be no functional change as a result of this patch.
Signed-off-by: Mark Rutland <mark.rutland@arm.com>
Cc: Florent Revest <revest@chromium.org>
Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Steven Rostedt <rostedt@goodmis.org>
Reviewed-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
Reviewed-by: Steven Rostedt (Google) <rostedt@goodmis.org>
Link: https://lore.kernel.org/r/20221103170520.931305-2-mark.rutland@arm.com
Signed-off-by: Will Deacon <will@kernel.org>
Some architectures (powerpc) may not support ftrace locations being nop'ed
out at build time. Introduce CONFIG_HAVE_OBJTOOL_NOP_MCOUNT for objtool, as
a means for architectures to enable nop'ing of ftrace locations. Add --mnop
as an option to objtool --mcount, to indicate support for the same.
Also, make sure that --mnop can be passed as an option to objtool only when
--mcount is passed.
Tested-by: Naveen N. Rao <naveen.n.rao@linux.vnet.ibm.com>
Reviewed-by: Naveen N. Rao <naveen.n.rao@linux.vnet.ibm.com>
Acked-by: Josh Poimboeuf <jpoimboe@kernel.org>
Reviewed-by: Christophe Leroy <christophe.leroy@csgroup.eu>
Signed-off-by: Sathvika Vasireddy <sv@linux.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20221114175754.1131267-12-sv@linux.ibm.com
Entries in list 'tr->err_log' will be reused after entry number
exceed TRACING_LOG_ERRS_MAX.
The cmd string of the to be reused entry will be freed first then
allocated a new one. If the allocation failed, then the entry will
still be in list 'tr->err_log' but its 'cmd' field is set to be NULL,
later access of 'cmd' is risky.
Currently above problem can cause the loss of 'cmd' information of first
entry in 'tr->err_log'. When execute `cat /sys/kernel/tracing/error_log`,
reproduce logs like:
[ 37.495100] trace_kprobe: error: Maxactive is not for kprobe(null) ^
[ 38.412517] trace_kprobe: error: Maxactive is not for kprobe
Command: p4:myprobe2 do_sys_openat2
^
Link: https://lore.kernel.org/linux-trace-kernel/20221114104632.3547266-1-zhengyejian1@huawei.com
Fixes: 1581a884b7 ("tracing: Remove size restriction on tracing_log_err cmd strings")
Signed-off-by: Zheng Yejian <zhengyejian1@huawei.com>
Acked-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
Since the eprobe filter was defined based on the eprobe's trace event
itself, it doesn't work correctly. Use the original trace event of
the eprobe when making the filter so that the filter works correctly.
Without this fix:
# echo 'e syscalls/sys_enter_openat \
flags_rename=$flags:u32 if flags < 1000' >> dynamic_events
# echo 1 > events/eprobes/sys_enter_openat/enable
[ 114.551550] event trace: Could not enable event sys_enter_openat
-bash: echo: write error: Invalid argument
With this fix:
# echo 'e syscalls/sys_enter_openat \
flags_rename=$flags:u32 if flags < 1000' >> dynamic_events
# echo 1 > events/eprobes/sys_enter_openat/enable
# tail trace
cat-241 [000] ...1. 266.498449: sys_enter_openat: (syscalls.sys_enter_openat) flags_rename=0
cat-242 [000] ...1. 266.977640: sys_enter_openat: (syscalls.sys_enter_openat) flags_rename=0
Link: https://lore.kernel.org/all/166823166395.1385292.8931770640212414483.stgit@devnote3/
Fixes: 752be5c5c9 ("tracing/eprobe: Add eprobe filter support")
Reported-by: Rafael Mendonca <rafaelmendsr@gmail.com>
Tested-by: Rafael Mendonca <rafaelmendsr@gmail.com>
Signed-off-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
In rethook_alloc(), the variable rh is not freed or passed out
if handler is NULL, which could lead to a memleak, fix it.
Link: https://lore.kernel.org/all/20221110104438.88099-1-yiyang13@huawei.com/
[Masami: Add "rethook:" tag to the title.]
Fixes: 54ecbe6f1e ("rethook: Add a generic return hook")
Cc: stable@vger.kernel.org
Signed-off-by: Yi Yang <yiyang13@huawei.com>
Acke-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
Signed-off-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
The @ftrace_mod is allocated by kzalloc(), so both the members {prev,next}
of @ftrace_mode->list are NULL, it's not a valid state to call list_del().
If kstrdup() for @ftrace_mod->{func|module} fails, it goes to @out_free
tag and calls free_ftrace_mod() to destroy @ftrace_mod, then list_del()
will write prev->next and next->prev, where null pointer dereference
happens.
BUG: kernel NULL pointer dereference, address: 0000000000000008
Oops: 0002 [#1] PREEMPT SMP NOPTI
Call Trace:
<TASK>
ftrace_mod_callback+0x20d/0x220
? do_filp_open+0xd9/0x140
ftrace_process_regex.isra.51+0xbf/0x130
ftrace_regex_write.isra.52.part.53+0x6e/0x90
vfs_write+0xee/0x3a0
? __audit_filter_op+0xb1/0x100
? auditd_test_task+0x38/0x50
ksys_write+0xa5/0xe0
do_syscall_64+0x3a/0x90
entry_SYSCALL_64_after_hwframe+0x63/0xcd
Kernel panic - not syncing: Fatal exception
So call INIT_LIST_HEAD() to initialize the list member to fix this issue.
Link: https://lkml.kernel.org/r/20221116015207.30858-1-xiujianfeng@huawei.com
Cc: stable@vger.kernel.org
Fixes: 673feb9d76 ("ftrace: Add :mod: caching infrastructure to trace_array")
Signed-off-by: Xiu Jianfeng <xiujianfeng@huawei.com>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
rb_head_page_deactivate() expects cpu_buffer to contain a valid list of
->pages, so verify that the list is actually present before calling it.
Found by Linux Verification Center (linuxtesting.org) with the SVACE
static analysis tool.
Link: https://lkml.kernel.org/r/20221114143129.3534443-1-d-tatianin@yandex-team.ru
Cc: stable@vger.kernel.org
Fixes: 77ae365eca ("ring-buffer: make lockless")
Signed-off-by: Daniil Tatianin <d-tatianin@yandex-team.ru>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
If we can't allocate this size, try something smaller with half of the
size. Its order should be decreased by one instead of divided by two.
Link: https://lkml.kernel.org/r/20221109094434.84046-3-wangwensheng4@huawei.com
Cc: <mhiramat@kernel.org>
Cc: <mark.rutland@arm.com>
Cc: stable@vger.kernel.org
Fixes: a790087554 ("ftrace: Allocate the mcount record pages as groups")
Signed-off-by: Wang Wensheng <wangwensheng4@huawei.com>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
If the number of mcount entries is an integer multiple of
ENTRIES_PER_PAGE, the page count showing on the console would be wrong.
Link: https://lkml.kernel.org/r/20221109094434.84046-2-wangwensheng4@huawei.com
Cc: <mhiramat@kernel.org>
Cc: <mark.rutland@arm.com>
Cc: stable@vger.kernel.org
Fixes: 5821e1b74f ("function tracing: fix wrong pos computing when read buffer has been fulfilled")
Signed-off-by: Wang Wensheng <wangwensheng4@huawei.com>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
The function ring_buffer_nr_dirty_pages() was created to find out how many
pages are filled in the ring buffer. There's two running counters. One is
incremented whenever a new page is touched (pages_touched) and the other
is whenever a page is read (pages_read). The dirty count is the number
touched minus the number read. This is used to determine if a blocked task
should be woken up if the percentage of the ring buffer it is waiting for
is hit.
The problem is that it does not take into account dropped pages (when the
new writes overwrite pages that were not read). And then the dirty pages
will always be greater than the percentage.
This makes the "buffer_percent" file inaccurate, as the number of dirty
pages end up always being larger than the percentage, event when it's not
and this causes user space to be woken up more than it wants to be.
Add a new counter to keep track of lost pages, and include that in the
accounting of dirty pages so that it is actually accurate.
Link: https://lkml.kernel.org/r/20221021123013.55fb6055@gandalf.local.home
Fixes: 2c2b0a78b3 ("ring-buffer: Add percentage of ring buffer full to wake up reader")
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
Currently the way polling works on the ring buffer is broken. It will
return immediately if there's any data in the ring buffer whereas a read
will block until the watermark (defined by the tracefs buffer_percent file)
is hit.
That is, a select() or poll() will return as if there's data available,
but then the following read will block. This is broken for the way
select()s and poll()s are supposed to work.
Have the polling on the ring buffer also block the same way reads and
splice does on the ring buffer.
Link: https://lkml.kernel.org/r/20221020231427.41be3f26@gandalf.local.home
Cc: Linux Trace Kernel <linux-trace-kernel@vger.kernel.org>
Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: Primiano Tucci <primiano@google.com>
Cc: stable@vger.kernel.org
Fixes: 1e0d6714ac ("ring-buffer: Do not wake up a splice waiter when page is not full")
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
Since commit ab51e15d53 ("fprobe: Introduce FPROBE_FL_KPROBE_SHARED flag
for fprobe") introduced fprobe_kprobe_handler() for fprobe::ops::func,
unregister_fprobe() fails to unregister the registered if user specifies
FPROBE_FL_KPROBE_SHARED flag.
Moreover, __register_ftrace_function() is possible to change the
ftrace_ops::func, thus we have to check fprobe::ops::saved_func instead.
To check it correctly, it should confirm the fprobe::ops::saved_func is
either fprobe_handler() or fprobe_kprobe_handler().
Link: https://lore.kernel.org/all/166677683946.1459107.15997653945538644683.stgit@devnote3/
Fixes: cad9931f64 ("fprobe: Add ftrace based probe APIs")
Cc: stable@vger.kernel.org
Signed-off-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
Check if fp->rethook succeeded to be allocated. Otherwise, if
rethook_alloc() fails, then we end up dereferencing a NULL pointer in
rethook_add_node().
Link: https://lore.kernel.org/all/20221025031209.954836-1-rafaelmendsr@gmail.com/
Fixes: 5b0ab78998 ("fprobe: Add exit_handler support")
Cc: stable@vger.kernel.org
Signed-off-by: Rafael Mendonca <rafaelmendsr@gmail.com>
Acked-by: Steven Rostedt (Google) <rostedt@goodmis.org>
Acked-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
Signed-off-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
KASAN reported a use-after-free with ftrace ops [1]. It was found from
vmcore that perf had registered two ops with the same content
successively, both dynamic. After unregistering the second ops, a
use-after-free occurred.
In ftrace_shutdown(), when the second ops is unregistered, the
FTRACE_UPDATE_CALLS command is not set because there is another enabled
ops with the same content. Also, both ops are dynamic and the ftrace
callback function is ftrace_ops_list_func, so the
FTRACE_UPDATE_TRACE_FUNC command will not be set. Eventually the value
of 'command' will be 0 and ftrace_shutdown() will skip the rcu
synchronization.
However, ftrace may be activated. When the ops is released, another CPU
may be accessing the ops. Add the missing synchronization to fix this
problem.
[1]
BUG: KASAN: use-after-free in __ftrace_ops_list_func kernel/trace/ftrace.c:7020 [inline]
BUG: KASAN: use-after-free in ftrace_ops_list_func+0x2b0/0x31c kernel/trace/ftrace.c:7049
Read of size 8 at addr ffff56551965bbc8 by task syz-executor.2/14468
CPU: 1 PID: 14468 Comm: syz-executor.2 Not tainted 5.10.0 #7
Hardware name: linux,dummy-virt (DT)
Call trace:
dump_backtrace+0x0/0x40c arch/arm64/kernel/stacktrace.c:132
show_stack+0x30/0x40 arch/arm64/kernel/stacktrace.c:196
__dump_stack lib/dump_stack.c:77 [inline]
dump_stack+0x1b4/0x248 lib/dump_stack.c:118
print_address_description.constprop.0+0x28/0x48c mm/kasan/report.c:387
__kasan_report mm/kasan/report.c:547 [inline]
kasan_report+0x118/0x210 mm/kasan/report.c:564
check_memory_region_inline mm/kasan/generic.c:187 [inline]
__asan_load8+0x98/0xc0 mm/kasan/generic.c:253
__ftrace_ops_list_func kernel/trace/ftrace.c:7020 [inline]
ftrace_ops_list_func+0x2b0/0x31c kernel/trace/ftrace.c:7049
ftrace_graph_call+0x0/0x4
__might_sleep+0x8/0x100 include/linux/perf_event.h:1170
__might_fault mm/memory.c:5183 [inline]
__might_fault+0x58/0x70 mm/memory.c:5171
do_strncpy_from_user lib/strncpy_from_user.c:41 [inline]
strncpy_from_user+0x1f4/0x4b0 lib/strncpy_from_user.c:139
getname_flags+0xb0/0x31c fs/namei.c:149
getname+0x2c/0x40 fs/namei.c:209
[...]
Allocated by task 14445:
kasan_save_stack+0x24/0x50 mm/kasan/common.c:48
kasan_set_track mm/kasan/common.c:56 [inline]
__kasan_kmalloc mm/kasan/common.c:479 [inline]
__kasan_kmalloc.constprop.0+0x110/0x13c mm/kasan/common.c:449
kasan_kmalloc+0xc/0x14 mm/kasan/common.c:493
kmem_cache_alloc_trace+0x440/0x924 mm/slub.c:2950
kmalloc include/linux/slab.h:563 [inline]
kzalloc include/linux/slab.h:675 [inline]
perf_event_alloc.part.0+0xb4/0x1350 kernel/events/core.c:11230
perf_event_alloc kernel/events/core.c:11733 [inline]
__do_sys_perf_event_open kernel/events/core.c:11831 [inline]
__se_sys_perf_event_open+0x550/0x15f4 kernel/events/core.c:11723
__arm64_sys_perf_event_open+0x6c/0x80 kernel/events/core.c:11723
[...]
Freed by task 14445:
kasan_save_stack+0x24/0x50 mm/kasan/common.c:48
kasan_set_track+0x24/0x34 mm/kasan/common.c:56
kasan_set_free_info+0x20/0x40 mm/kasan/generic.c:358
__kasan_slab_free.part.0+0x11c/0x1b0 mm/kasan/common.c:437
__kasan_slab_free mm/kasan/common.c:445 [inline]
kasan_slab_free+0x2c/0x40 mm/kasan/common.c:446
slab_free_hook mm/slub.c:1569 [inline]
slab_free_freelist_hook mm/slub.c:1608 [inline]
slab_free mm/slub.c:3179 [inline]
kfree+0x12c/0xc10 mm/slub.c:4176
perf_event_alloc.part.0+0xa0c/0x1350 kernel/events/core.c:11434
perf_event_alloc kernel/events/core.c:11733 [inline]
__do_sys_perf_event_open kernel/events/core.c:11831 [inline]
__se_sys_perf_event_open+0x550/0x15f4 kernel/events/core.c:11723
[...]
Link: https://lore.kernel.org/linux-trace-kernel/20221103031010.166498-1-lihuafei1@huawei.com
Fixes: edb096e007 ("ftrace: Fix memleak when unregistering dynamic ops when tracing disabled")
Cc: stable@vger.kernel.org
Suggested-by: Steven Rostedt <rostedt@goodmis.org>
Signed-off-by: Li Huafei <lihuafei1@huawei.com>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
On some machines the number of listed CPUs may be bigger than the actual
CPUs that exist. The tracing subsystem allocates a per_cpu directory with
access to the per CPU ring buffer via a cpuX file. But to save space, the
ring buffer will only allocate buffers for online CPUs, even though the
CPU array will be as big as the nr_cpu_ids.
With the addition of waking waiters on the ring buffer when closing the
file, the ring_buffer_wake_waiters() now needs to make sure that the
buffer is allocated (with the irq_work allocated with it) before trying to
wake waiters, as it will cause a NULL pointer dereference.
While debugging this, I added a NULL check for the buffer itself (which is
OK to do), and also NULL pointer checks against buffer->buffers (which is
not fine, and will WARN) as well as making sure the CPU number passed in
is within the nr_cpu_ids (which is also not fine if it isn't).
Link: https://lore.kernel.org/all/87h6zklb6n.wl-tiwai@suse.de/
Link: https://lore.kernel.org/all/CAM6Wdxc0KRJMXVAA0Y=u6Jh2V=uWB-_Fn6M4xRuNppfXzL1mUg@mail.gmail.com/
Link: https://lkml.kernel.org/linux-trace-kernel/20221101191009.1e7378c8@rorschach.local.home
Cc: stable@vger.kernel.org
Cc: Steven Noonan <steven.noonan@gmail.com>
Bugzilla: https://bugzilla.opensuse.org/show_bug.cgi?id=1204705
Reported-by: Takashi Iwai <tiwai@suse.de>
Reported-by: Roland Ruckerbauer <roland.rucky@gmail.com>
Fixes: f3ddb74ad0 ("tracing: Wake up ring buffer waiters on closing of the file")
Reviewed-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
-----BEGIN PGP SIGNATURE-----
iHUEABYIAB0WIQTFp0I1jqZrAX+hPRXbK58LschIgwUCY2GuKgAKCRDbK58LschI
gy32AP9PI0e/bUGDExKJ8g97PeeEtnpj4TTI6g+XKILtYnyXlgD/Rk4j2D/f3IBF
Ha9TmqYvAUim+U/g50vUrNuoNLNJ5w8=
=OKC1
-----END PGP SIGNATURE-----
Merge tag 'for-netdev' of https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next
Daniel Borkmann says:
====================
bpf-next 2022-11-02
We've added 70 non-merge commits during the last 14 day(s) which contain
a total of 96 files changed, 3203 insertions(+), 640 deletions(-).
The main changes are:
1) Make cgroup local storage available to non-cgroup attached BPF programs
such as tc BPF ones, from Yonghong Song.
2) Avoid unnecessary deadlock detection and failures wrt BPF task storage
helpers, from Martin KaFai Lau.
3) Add LLVM disassembler as default library for dumping JITed code
in bpftool, from Quentin Monnet.
4) Various kprobe_multi_link fixes related to kernel modules,
from Jiri Olsa.
5) Optimize x86-64 JIT with emitting BMI2-based shift instructions,
from Jie Meng.
6) Improve BPF verifier's memory type compatibility for map key/value
arguments, from Dave Marchevsky.
7) Only create mmap-able data section maps in libbpf when data is exposed
via skeletons, from Andrii Nakryiko.
8) Add an autoattach option for bpftool to load all object assets,
from Wang Yufen.
9) Various memory handling fixes for libbpf and BPF selftests,
from Xu Kuohai.
10) Initial support for BPF selftest's vmtest.sh on arm64,
from Manu Bretelle.
11) Improve libbpf's BTF handling to dedup identical structs,
from Alan Maguire.
12) Add BPF CI and denylist documentation for BPF selftests,
from Daniel Müller.
13) Check BPF cpumap max_entries before doing allocation work,
from Florian Lehner.
* tag 'for-netdev' of https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next: (70 commits)
samples/bpf: Fix typo in README
bpf: Remove the obsolte u64_stats_fetch_*_irq() users.
bpf: check max_entries before allocating memory
bpf: Fix a typo in comment for DFS algorithm
bpftool: Fix spelling mistake "disasembler" -> "disassembler"
selftests/bpf: Fix bpftool synctypes checking failure
selftests/bpf: Panic on hard/soft lockup
docs/bpf: Add documentation for new cgroup local storage
selftests/bpf: Add test cgrp_local_storage to DENYLIST.s390x
selftests/bpf: Add selftests for new cgroup local storage
selftests/bpf: Fix test test_libbpf_str/bpf_map_type_str
bpftool: Support new cgroup local storage
libbpf: Support new cgroup local storage
bpf: Implement cgroup storage available to non-cgroup-attached bpf progs
bpf: Refactor some inode/task/sk storage functions for reuse
bpf: Make struct cgroup btf id global
selftests/bpf: Tracing prog can still do lookup under busy lock
selftests/bpf: Ensure no task storage failure for bpf_lsm.s prog due to deadlock detection
bpf: Add new bpf_task_storage_delete proto with no deadlock detection
bpf: bpf_task_storage_delete_recur does lookup first before the deadlock check
...
====================
Link: https://lore.kernel.org/r/20221102062120.5724-1-daniel@iogearbox.net
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Similar to sk/inode/task storage, implement similar cgroup local storage.
There already exists a local storage implementation for cgroup-attached
bpf programs. See map type BPF_MAP_TYPE_CGROUP_STORAGE and helper
bpf_get_local_storage(). But there are use cases such that non-cgroup
attached bpf progs wants to access cgroup local storage data. For example,
tc egress prog has access to sk and cgroup. It is possible to use
sk local storage to emulate cgroup local storage by storing data in socket.
But this is a waste as it could be lots of sockets belonging to a particular
cgroup. Alternatively, a separate map can be created with cgroup id as the key.
But this will introduce additional overhead to manipulate the new map.
A cgroup local storage, similar to existing sk/inode/task storage,
should help for this use case.
The life-cycle of storage is managed with the life-cycle of the
cgroup struct. i.e. the storage is destroyed along with the owning cgroup
with a call to bpf_cgrp_storage_free() when cgroup itself
is deleted.
The userspace map operations can be done by using a cgroup fd as a key
passed to the lookup, update and delete operations.
Typically, the following code is used to get the current cgroup:
struct task_struct *task = bpf_get_current_task_btf();
... task->cgroups->dfl_cgrp ...
and in structure task_struct definition:
struct task_struct {
....
struct css_set __rcu *cgroups;
....
}
With sleepable program, accessing task->cgroups is not protected by rcu_read_lock.
So the current implementation only supports non-sleepable program and supporting
sleepable program will be the next step together with adding rcu_read_lock
protection for rcu tagged structures.
Since map name BPF_MAP_TYPE_CGROUP_STORAGE has been used for old cgroup local
storage support, the new map name BPF_MAP_TYPE_CGRP_STORAGE is used
for cgroup storage available to non-cgroup-attached bpf programs. The old
cgroup storage supports bpf_get_local_storage() helper to get the cgroup data.
The new cgroup storage helper bpf_cgrp_storage_get() can provide similar
functionality. While old cgroup storage pre-allocates storage memory, the new
mechanism can also pre-allocate with a user space bpf_map_update_elem() call
to avoid potential run-time memory allocation failure.
Therefore, the new cgroup storage can provide all functionality w.r.t.
the old one. So in uapi bpf.h, the old BPF_MAP_TYPE_CGROUP_STORAGE is alias to
BPF_MAP_TYPE_CGROUP_STORAGE_DEPRECATED to indicate the old cgroup storage can
be deprecated since the new one can provide the same functionality.
Acked-by: David Vernet <void@manifault.com>
Signed-off-by: Yonghong Song <yhs@fb.com>
Link: https://lore.kernel.org/r/20221026042850.673791-1-yhs@fb.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
The bpf_lsm and bpf_iter do not recur that will cause a deadlock.
The situation is similar to the bpf_pid_task_storage_delete_elem()
which is called from the syscall map_delete_elem. It does not need
deadlock detection. Otherwise, it will cause unnecessary failure
when calling the bpf_task_storage_delete() helper.
This patch adds bpf_task_storage_delete proto that does not do deadlock
detection. It will be used by bpf_lsm and bpf_iter program.
Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
Link: https://lore.kernel.org/r/20221025184524.3526117-8-martin.lau@linux.dev
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
The bpf_lsm and bpf_iter do not recur that will cause a deadlock.
The situation is similar to the bpf_pid_task_storage_lookup_elem()
which is called from the syscall map_lookup_elem. It does not need
deadlock detection. Otherwise, it will cause unnecessary failure
when calling the bpf_task_storage_get() helper.
This patch adds bpf_task_storage_get proto that does not do deadlock
detection. It will be used by bpf_lsm and bpf_iter programs.
Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
Link: https://lore.kernel.org/r/20221025184524.3526117-6-martin.lau@linux.dev
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
This patch adds the "_recur" naming to the bpf_task_storage_{get,delete}
proto. In a latter patch, they will only be used by the tracing
programs that requires a deadlock detection because a tracing
prog may use bpf_task_storage_{get,delete} recursively and cause a
deadlock.
Another following patch will add a different helper proto for the non
tracing programs because they do not need the deadlock prevention.
This patch does this rename to prepare for this future proto
additions.
Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
Link: https://lore.kernel.org/r/20221025184524.3526117-3-martin.lau@linux.dev
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Currently we allow to create kprobe multi link on function from kernel
module, but we don't take the module reference to ensure it's not
unloaded while we are tracing it.
The multi kprobe link is based on fprobe/ftrace layer which takes
different approach and releases ftrace hooks when module is unloaded
even if there's tracer registered on top of it.
Adding code that gathers all the related modules for the link and takes
their references before it's attached. All kernel module references are
released after link is unregistered.
Note that we do it the same way already for trampoline probes
(but for single address).
Acked-by: Andrii Nakryiko <andrii@kernel.org>
Signed-off-by: Jiri Olsa <jolsa@kernel.org>
Link: https://lore.kernel.org/r/20221025134148.3300700-5-jolsa@kernel.org
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Renaming __bpf_kprobe_multi_cookie_cmp to bpf_kprobe_multi_addrs_cmp,
because it's more suitable to current and upcoming code.
Acked-by: Song Liu <song@kernel.org>
Signed-off-by: Jiri Olsa <jolsa@kernel.org>
Link: https://lore.kernel.org/r/20221025134148.3300700-4-jolsa@kernel.org
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Currently ftrace_lookup_symbols iterates only over core symbols,
adding module_kallsyms_on_each_symbol call to check on modules
symbols as well.
Also removing 'args.found == args.cnt' condition, because it's
already checked in kallsyms_callback function.
Also removing 'err < 0' check, because both *kallsyms_on_each_symbol
functions do not return error.
Reported-by: Martynas Pumputis <m@lambda.lt>
Acked-by: Song Liu <song@kernel.org>
Signed-off-by: Jiri Olsa <jolsa@kernel.org>
Link: https://lore.kernel.org/r/20221025134148.3300700-3-jolsa@kernel.org
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
- Rework how SIGTRAPs get delivered to events to address a bunch of
problems with it. Add a selftest for that too
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEEzv7L6UO9uDPlPSfHEsHwGGHeVUoFAmNVE9UACgkQEsHwGGHe
VUpodRAAsn3s+xX02qfxRG0mYX/1nnOmnFnDhmUEPAZpDXjB6g3PFXAh26F9tw92
dLm8fbfp8W3tK7TQrGyOGWMnb7oj4gEoXAcQBXvsq2KOJMVwRHHKwCeSSJ89DLAW
zZEl2nzgea2cGyZn2RZJvhmbm8YdFON3v++Vv34yovs1MMoCcD7FOPLLGWkD2gk5
6PK6lIlWEI/+vYKWhZdxgk6/PanBInXRUnbaBVj42US1XwfDLzPBEi9yyUUJQrht
CQhfTpHn4Z5MC/hJTlFat4Jlaajql4JBcQ1SS5LW59M+6gdlBK4tL6zFt10zvU2m
+kywOOIWiYLRRgFf6idGO45P5BuWOdmsXEaEg5KW7b6nJfGvgkd7WUYgCVOgtEY1
r4Mf4hAQUunDHGQ4e8eHk7XemFJsoBSweYCTQ2O0yr/QzO2M6QBi/BR9PzUajyH+
yShKEfrxXt4595BMH0nonSMpTKcE4Zxdj06LZSnGecEN8UUlx/n49uYwhFNdUkqM
s6Wz6kSR76YRlKUmYnNzP1gkY6nJZ1nR6z7SjmMkioxf3VxhT9SY8K587r6hRUlr
/NVA69iUhJy75VdttxZEmZzB03A7AjdudmZEisF0ImEmB1hxzYLHcDKJMTIj/r4/
f8OXCg5ACKhFlnx1SdBVRtA+6+5ab368Fs2rItJQ4dzdxRi6VVY=
=DXG7
-----END PGP SIGNATURE-----
Merge tag 'perf_urgent_for_v6.1_rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull perf fixes from Borislav Petkov:
- Fix raw data handling when perf events are used in bpf
- Rework how SIGTRAPs get delivered to events to address a bunch of
problems with it. Add a selftest for that too
* tag 'perf_urgent_for_v6.1_rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
bpf: Fix sample_flags for bpf_perf_event_output
selftests/perf_events: Add a SIGTRAP stress test with disables
perf: Fix missing SIGTRAPs
As previous commit, 'blk_trace_cleanup' will stop block trace if
block trace's state is 'Blktrace_running'.
So remove unnessary stop block trace in 'blk_trace_shutdown'.
Signed-off-by: Ye Bin <yebin10@huawei.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20221019033602.752383-4-yebin@huaweicloud.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
When test as follows:
step1: ioctl(sda, BLKTRACESETUP, &arg)
step2: ioctl(sda, BLKTRACESTART, NULL)
step3: ioctl(sda, BLKTRACETEARDOWN, NULL)
step4: ioctl(sda, BLKTRACESETUP, &arg)
Got issue as follows:
debugfs: File 'dropped' in directory 'sda' already present!
debugfs: File 'msg' in directory 'sda' already present!
debugfs: File 'trace0' in directory 'sda' already present!
And also find syzkaller report issue like "KASAN: use-after-free Read in relay_switch_subbuf"
"https://syzkaller.appspot.com/bug?id=13849f0d9b1b818b087341691be6cc3ac6a6bfb7"
If remove block trace without stop(BLKTRACESTOP) block trace, '__blk_trace_remove'
will just set 'q->blk_trace' with NULL. However, debugfs file isn't removed, so
will report file already present when call BLKTRACESETUP.
static int __blk_trace_remove(struct request_queue *q)
{
struct blk_trace *bt;
bt = rcu_replace_pointer(q->blk_trace, NULL,
lockdep_is_held(&q->debugfs_mutex));
if (!bt)
return -EINVAL;
if (bt->trace_state != Blktrace_running)
blk_trace_cleanup(q, bt);
return 0;
}
If do test as follows:
step1: ioctl(sda, BLKTRACESETUP, &arg)
step2: ioctl(sda, BLKTRACESTART, NULL)
step3: ioctl(sda, BLKTRACETEARDOWN, NULL)
step4: remove sda
There will remove debugfs directory which will remove recursively all file
under directory.
>> blk_release_queue
>> debugfs_remove_recursive(q->debugfs_dir)
So all files which created in 'do_blk_trace_setup' are removed, and
'dentry->d_inode' is NULL. But 'q->blk_trace' is still in 'running_trace_lock',
'trace_note_tsk' will traverse 'running_trace_lock' all nodes.
>>trace_note_tsk
>> trace_note
>> relay_reserve
>> relay_switch_subbuf
>> d_inode(buf->dentry)->i_size
To solve above issues, reference commit '5afedf670caf', call 'blk_trace_cleanup'
unconditionally in '__blk_trace_remove' and first stop block trace in
'blk_trace_cleanup'.
Signed-off-by: Ye Bin <yebin10@huawei.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20221019033602.752383-3-yebin@huaweicloud.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Since ftrace has trampolines, don't use thunks for the __fentry__ site
but instead require that every function called from there includes
accounting. This very much includes all the direct-call functions.
Additionally, ftrace uses ROP tricks in two places:
- return_to_handler(), and
- ftrace_regs_caller() when pt_regs->orig_ax is set by a direct-call.
return_to_handler() already uses a retpoline to replace an
indirect-jump to defeat IBT, since this is a jump-type retpoline, make
sure there is no accounting done and ALTERNATIVE the RET into a ret.
ftrace_regs_caller() does much the same and gets the same treatment.
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lore.kernel.org/r/20220915111148.927545073@infradead.org
* Raw data is also filled by bpf_perf_event_output.
* Add sample_flags to indicate raw data.
* This eliminates the segfaults as shown below:
Run ./samples/bpf/trace_output
BUG pid 9 cookie 1001000000004 sized 4
BUG pid 9 cookie 1001000000004 sized 4
BUG pid 9 cookie 1001000000004 sized 4
Segmentation fault (core dumped)
Fixes: 838d9bb62d ("perf: Use sample_flags for raw_data")
Signed-off-by: Sumanth Korikkar <sumanthk@linux.ibm.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Namhyung Kim <namhyung@kernel.org>
Link: https://lkml.kernel.org/r/20221007081327.1047552-1-sumanthk@linux.ibm.com
- Found that the synthetic events were using strlen/strscpy() on values
that could have come from userspace, and that is bad.
Consolidate the string logic of kprobe and eprobe and extend it to
the synthetic events to safely process string addresses.
- Clean up content of text dump in ftrace_bug() where the output does not
make char reads into signed and sign extending the byte output.
- Fix some kernel docs in the ring buffer code.
-----BEGIN PGP SIGNATURE-----
iIoEABYIADIWIQRRSw7ePDh/lE+zeZMp5XQQmuv6qgUCY0c6GBQccm9zdGVkdEBn
b29kbWlzLm9yZwAKCRAp5XQQmuv6qpDNAQCuw9YTeNMU4zxFqBg4/JCbfpnWQGj4
Qdl2u3WtEvTzrgEA85Q01swCYRKdrGPCrFemZ3lm6PGzpGruh+BfD4qRMwk=
=F5kK
-----END PGP SIGNATURE-----
Merge tag 'trace-v6.1-1' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace
Pull tracing fixes from Steven Rostedt:
- Found that the synthetic events were using strlen/strscpy() on values
that could have come from userspace, and that is bad.
Consolidate the string logic of kprobe and eprobe and extend it to
the synthetic events to safely process string addresses.
- Clean up content of text dump in ftrace_bug() where the output does
not make char reads into signed and sign extending the byte output.
- Fix some kernel docs in the ring buffer code.
* tag 'trace-v6.1-1' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace:
tracing: Fix reading strings from synthetic events
tracing: Add "(fault)" name injection to kernel probes
tracing: Move duplicate code of trace_kprobe/eprobe.c into header
ring-buffer: Fix kernel-doc
ftrace: Fix char print issue in print_ip_ins()
The follow commands caused a crash:
# cd /sys/kernel/tracing
# echo 's:open char file[]' > dynamic_events
# echo 'hist:keys=common_pid:file=filename:onchange($file).trace(open,$file)' > events/syscalls/sys_enter_openat/trigger'
# echo 1 > events/synthetic/open/enable
BOOM!
The problem is that the synthetic event field "char file[]" will read
the value given to it as a string without any memory checks to make sure
the address is valid. The above example will pass in the user space
address and the sythetic event code will happily call strlen() on it
and then strscpy() where either one will cause an oops when accessing
user space addresses.
Use the helper functions from trace_kprobe and trace_eprobe that can
read strings safely (and actually succeed when the address is from user
space and the memory is mapped in).
Now the above can show:
packagekitd-1721 [000] ...2. 104.597170: open: file=/usr/lib/rpm/fileattrs/cmake.attr
in:imjournal-978 [006] ...2. 104.599642: open: file=/var/lib/rsyslog/imjournal.state.tmp
packagekitd-1721 [000] ...2. 104.626308: open: file=/usr/lib/rpm/fileattrs/debuginfo.attr
Link: https://lkml.kernel.org/r/20221012104534.826549315@goodmis.org
Cc: stable@vger.kernel.org
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Tom Zanussi <zanussi@kernel.org>
Acked-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
Reviewed-by: Tom Zanussi <zanussi@kernel.org>
Fixes: bd82631d7c ("tracing: Add support for dynamic strings to synthetic events")
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
Have the specific functions for kernel probes that read strings to inject
the "(fault)" name directly. trace_probes.c does this too (for uprobes)
but as the code to read strings are going to be used by synthetic events
(and perhaps other utilities), it simplifies the code by making sure those
other uses do not need to implement the "(fault)" name injection as well.
Link: https://lkml.kernel.org/r/20221012104534.644803645@goodmis.org
Cc: stable@vger.kernel.org
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Tom Zanussi <zanussi@kernel.org>
Acked-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
Reviewed-by: Tom Zanussi <zanussi@kernel.org>
Fixes: bd82631d7c ("tracing: Add support for dynamic strings to synthetic events")
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
The functions:
fetch_store_strlen_user()
fetch_store_strlen()
fetch_store_string_user()
fetch_store_string()
are identical in both trace_kprobe.c and trace_eprobe.c. Move them into
a new header file trace_probe_kernel.h to share it. This code will later
be used by the synthetic events as well.
Marked for stable as a fix for a crash in synthetic events requires it.
Link: https://lkml.kernel.org/r/20221012104534.467668078@goodmis.org
Cc: stable@vger.kernel.org
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Tom Zanussi <zanussi@kernel.org>
Acked-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
Reviewed-by: Tom Zanussi <zanussi@kernel.org>
Fixes: bd82631d7c ("tracing: Add support for dynamic strings to synthetic events")
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
kernel/trace/ring_buffer.c:895: warning: expecting prototype for ring_buffer_nr_pages_dirty(). Prototype was for ring_buffer_nr_dirty_pages() instead.
kernel/trace/ring_buffer.c:5313: warning: expecting prototype for ring_buffer_reset_cpu(). Prototype was for ring_buffer_reset_online_cpus() instead.
kernel/trace/ring_buffer.c:5382: warning: expecting prototype for rind_buffer_empty(). Prototype was for ring_buffer_empty() instead.
Link: https://bugzilla.openanolis.cn/show_bug.cgi?id=2340
Link: https://lkml.kernel.org/r/20221009020642.12506-1-jiapeng.chong@linux.alibaba.com
Reported-by: Abaci Robot <abaci@linux.alibaba.com>
Signed-off-by: Jiapeng Chong <jiapeng.chong@linux.alibaba.com>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
When ftrace bug happened, following log shows every hex data in
problematic ip address:
actual: ffffffe8:6b:ffffffd9:01:21
But so many 'f's seem a little confusing, and that is because format
'%x' being used to print signed chars in array 'ins'. As suggested
by Joe, change to use format "%*phC" to print array 'ins'.
After this patch, the log is like:
actual: e8:6b:d9:01:21
Link: https://lkml.kernel.org/r/20221011120352.1878494-1-zhengyejian1@huawei.com
Fixes: 6c14133d2d ("ftrace: Do not blindly read the ip address in ftrace_bug()")
Suggested-by: Joe Perches <joe@perches.com>
Signed-off-by: Zheng Yejian <zhengyejian1@huawei.com>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
Major changes:
- Changed location of tracing repo from personal git repo to:
git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace.git
- Added Masami Hiramatsu as co-maintainer
- Updated MAINTAINERS file to separate out FTRACE as it is
more than just TRACING.
Minor changes:
- Added Mark Rutland as FTRACE reviewer
- Updated user_events to make it on its way to remove the BROKEN tag.
The changes should now be acceptable but will run it through
a cycle and hopefully we can remove the BROKEN tag next release.
- Added filtering to eprobes
- Added a delta time to the benchmark trace event
- Have the histogram and filter callbacks called via a switch
statement instead of indirect functions. This speeds it up to
avoid retpolines.
- Add a way to wake up ring buffer waiters waiting for the
ring buffer to fill up to its watermark.
- New ioctl() on the trace_pipe_raw file to wake up ring buffer
waiters.
- Wake up waiters when the ring buffer is disabled.
A reader may block when the ring buffer is disabled,
but if it was blocked when the ring buffer is disabled
it should then wake up.
Fixes:
- Allow splice to read partially read ring buffer pages
Fixes splice never moving forward.
- Fix inverted compare that made the "shortest" ring buffer
wait queue actually the longest.
- Fix a race in the ring buffer between resetting a page when
a writer goes to another page, and the reader.
- Fix ftrace accounting bug when function hooks are added at
boot up before the weak functions are set to "disabled".
- Fix bug that freed a user allocated snapshot buffer when
enabling a tracer.
- Fix possible recursive locks in osnoise tracer
- Fix recursive locking direct functions
- And other minor clean ups and fixes
-----BEGIN PGP SIGNATURE-----
iIoEABYIADIWIQRRSw7ePDh/lE+zeZMp5XQQmuv6qgUCYz70cxQccm9zdGVkdEBn
b29kbWlzLm9yZwAKCRAp5XQQmuv6qpLKAP4+yOje7ZY/b3R4tTx0EIWiKdhqPx6t
Nvam2+WR2PN3QQEAqiK2A+oIbh3Zjp1MyhQWuulssWKtSTXhIQkbs7ioYAc=
=MsQw
-----END PGP SIGNATURE-----
Merge tag 'trace-v6.1' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace
Pull tracing updates from Steven Rostedt:
"Major changes:
- Changed location of tracing repo from personal git repo to:
git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace.git
- Added Masami Hiramatsu as co-maintainer
- Updated MAINTAINERS file to separate out FTRACE as it is more than
just TRACING.
Minor changes:
- Added Mark Rutland as FTRACE reviewer
- Updated user_events to make it on its way to remove the BROKEN tag.
The changes should now be acceptable but will run it through a
cycle and hopefully we can remove the BROKEN tag next release.
- Added filtering to eprobes
- Added a delta time to the benchmark trace event
- Have the histogram and filter callbacks called via a switch
statement instead of indirect functions. This speeds it up to avoid
retpolines.
- Add a way to wake up ring buffer waiters waiting for the ring
buffer to fill up to its watermark.
- New ioctl() on the trace_pipe_raw file to wake up ring buffer
waiters.
- Wake up waiters when the ring buffer is disabled. A reader may
block when the ring buffer is disabled, but if it was blocked when
the ring buffer is disabled it should then wake up.
Fixes:
- Allow splice to read partially read ring buffer pages. This fixes
splice never moving forward.
- Fix inverted compare that made the "shortest" ring buffer wait
queue actually the longest.
- Fix a race in the ring buffer between resetting a page when a
writer goes to another page, and the reader.
- Fix ftrace accounting bug when function hooks are added at boot up
before the weak functions are set to "disabled".
- Fix bug that freed a user allocated snapshot buffer when enabling a
tracer.
- Fix possible recursive locks in osnoise tracer
- Fix recursive locking direct functions
- Other minor clean ups and fixes"
* tag 'trace-v6.1' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace: (44 commits)
ftrace: Create separate entry in MAINTAINERS for function hooks
tracing: Update MAINTAINERS to reflect new tracing git repo
tracing: Do not free snapshot if tracer is on cmdline
ftrace: Still disable enabled records marked as disabled
tracing/user_events: Move pages/locks into groups to prepare for namespaces
tracing: Add Masami Hiramatsu as co-maintainer
tracing: Remove unused variable 'dups'
MAINTAINERS: add myself as a tracing reviewer
ring-buffer: Fix race between reset page and reading page
tracing/user_events: Update ABI documentation to align to bits vs bytes
tracing/user_events: Use bits vs bytes for enabled status page data
tracing/user_events: Use refcount instead of atomic for ref tracking
tracing/user_events: Ensure user provided strings are safely formatted
tracing/user_events: Use WRITE instead of READ for io vector import
tracing/user_events: Use NULL for strstr checks
tracing: Fix spelling mistake "preapre" -> "prepare"
tracing: Wake up waiters when tracing is disabled
tracing: Add ioctl() to force ring buffer waiters to wake up
tracing: Wake up ring buffer waiters on closing of the file
ring-buffer: Add ring_buffer_wake_waiters()
...
- PMU driver updates:
- Add AMD Last Branch Record Extension Version 2 (LbrExtV2)
feature support for Zen 4 processors.
- Extend the perf ABI to provide branch speculation information,
if available, and use this on CPUs that have it (eg. LbrExtV2).
- Improve Intel PEBS TSC timestamp handling & integration.
- Add Intel Raptor Lake S CPU support.
- Add 'perf mem' and 'perf c2c' memory profiling support on
AMD CPUs by utilizing IBS tagged load/store samples.
- Clean up & optimize various x86 PMU details.
- HW breakpoints:
- Big rework to optimize the code for systems with hundreds of CPUs and
thousands of breakpoints:
- Replace the nr_bp_mutex global mutex with the bp_cpuinfo_sem
per-CPU rwsem that is read-locked during most of the key operations.
- Improve the O(#cpus * #tasks) logic in toggle_bp_slot()
and fetch_bp_busy_slots().
- Apply micro-optimizations & cleanups.
- Misc cleanups & enhancements.
Signed-off-by: Ingo Molnar <mingo@kernel.org>
-----BEGIN PGP SIGNATURE-----
iQJFBAABCgAvFiEEBpT5eoXrXCwVQwEKEnMQ0APhK1gFAmM/2pMRHG1pbmdvQGtl
cm5lbC5vcmcACgkQEnMQ0APhK1iIMA/+J+MCEVTt9kwZeBtHoPX7iZ5gnq1+McoQ
f6ALX19AO/ZSuA7EBA3cS3Ny5eyGy3ofYUnRW+POezu9CpflLW/5N27R2qkZFrWC
A09B86WH676ZrmXt+oI05rpZ2y/NGw4gJxLLa4/bWF3g9xLfo21i+YGKwdOnNFpl
DEdCVHtjlMcOAU3+on6fOYuhXDcYd7PKGcCfLE7muOMOAtwyj0bUDBt7m+hneZgy
qbZHzDU2DZ5L/LXiMyuZj5rC7V4xUbfZZfXglG38YCW1WTieS3IjefaU2tREhu7I
rRkCK48ULDNNJR3dZK8IzXJRxusq1ICPG68I+nm/K37oZyTZWtfYZWehW/d/TnPa
tUiTwimabz7UUqaGq9ZptxwINcAigax0nl6dZ3EseeGhkDE6j71/3kqrkKPz4jth
+fCwHLOrI3c4Gq5qWgPvqcUlUneKf3DlOMtzPKYg7sMhla2zQmFpYCPzKfm77U/Z
BclGOH3FiwaK6MIjPJRUXTePXqnUseqCR8PCH/UPQUeBEVHFcMvqCaa15nALed8x
dFi76VywR9mahouuLNq6sUNePlvDd2B124PygNwegLlBfY9QmKONg9qRKOnQpuJ6
UprRJjLOOucZ/N/jn6+ShHkqmXsnY2MhfUoHUoMQ0QAI+n++e+2AuePo251kKWr8
QlqKxd9PMQU=
=LcGg
-----END PGP SIGNATURE-----
Merge tag 'perf-core-2022-10-07' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull perf events updates from Ingo Molnar:
"PMU driver updates:
- Add AMD Last Branch Record Extension Version 2 (LbrExtV2) feature
support for Zen 4 processors.
- Extend the perf ABI to provide branch speculation information, if
available, and use this on CPUs that have it (eg. LbrExtV2).
- Improve Intel PEBS TSC timestamp handling & integration.
- Add Intel Raptor Lake S CPU support.
- Add 'perf mem' and 'perf c2c' memory profiling support on AMD CPUs
by utilizing IBS tagged load/store samples.
- Clean up & optimize various x86 PMU details.
HW breakpoints:
- Big rework to optimize the code for systems with hundreds of CPUs
and thousands of breakpoints:
- Replace the nr_bp_mutex global mutex with the bp_cpuinfo_sem
per-CPU rwsem that is read-locked during most of the key
operations.
- Improve the O(#cpus * #tasks) logic in toggle_bp_slot() and
fetch_bp_busy_slots().
- Apply micro-optimizations & cleanups.
- Misc cleanups & enhancements"
* tag 'perf-core-2022-10-07' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (75 commits)
perf/hw_breakpoint: Annotate tsk->perf_event_mutex vs ctx->mutex
perf: Fix pmu_filter_match()
perf: Fix lockdep_assert_event_ctx()
perf/x86/amd/lbr: Adjust LBR regardless of filtering
perf/x86/utils: Fix uninitialized var in get_branch_type()
perf/uapi: Define PERF_MEM_SNOOPX_PEER in kernel header file
perf/x86/amd: Support PERF_SAMPLE_PHY_ADDR
perf/x86/amd: Support PERF_SAMPLE_ADDR
perf/x86/amd: Support PERF_SAMPLE_{WEIGHT|WEIGHT_STRUCT}
perf/x86/amd: Support PERF_SAMPLE_DATA_SRC
perf/x86/amd: Add IBS OP_DATA2 DataSrc bit definitions
perf/mem: Introduce PERF_MEM_LVLNUM_{EXTN_MEM|IO}
perf/x86/uncore: Add new Raptor Lake S support
perf/x86/cstate: Add new Raptor Lake S support
perf/x86/msr: Add new Raptor Lake S support
perf/x86: Add new Raptor Lake S support
bpf: Check flags for branch stack in bpf_read_branch_records helper
perf, hw_breakpoint: Fix use-after-free if perf_event_open() fails
perf: Use sample_flags for raw_data
perf: Use sample_flags for addr
...
The ftrace_boot_snapshot and alloc_snapshot cmdline options allocate the
snapshot buffer at boot up for use later. The ftrace_boot_snapshot in
particular requires the snapshot to be allocated because it will take a
snapshot at the end of boot up allowing to see the traces that happened
during boot so that it's not lost when user space takes over.
When a tracer is registered (started) there's a path that checks if it
requires the snapshot buffer or not, and if it does not and it was
allocated it will do a synchronization and free the snapshot buffer.
This is only required if the previous tracer was using it for "max
latency" snapshots, as it needs to make sure all max snapshots are
complete before freeing. But this is only needed if the previous tracer
was using the snapshot buffer for latency (like irqoff tracer and
friends). But it does not make sense to free it, if the previous tracer
was not using it, and the snapshot was allocated by the cmdline
parameters. This basically takes away the point of allocating it in the
first place!
Note, the allocated snapshot worked fine for just trace events, but fails
when a tracer is enabled on the cmdline.
Further investigation, this goes back even further and it does not require
a tracer on the cmdline to fail. Simply enable snapshots and then enable a
tracer, and it will remove the snapshot.
Link: https://lkml.kernel.org/r/20221005113757.041df7fe@gandalf.local.home
Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: stable@vger.kernel.org
Fixes: 45ad21ca55 ("tracing: Have trace_array keep track if snapshot buffer is allocated")
Reported-by: Ross Zwisler <zwisler@kernel.org>
Tested-by: Ross Zwisler <zwisler@kernel.org>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
Weak functions started causing havoc as they showed up in the
"available_filter_functions" and this confused people as to why some
functions marked as "notrace" were listed, but when enabled they did
nothing. This was because weak functions can still have fentry calls, and
these addresses get added to the "available_filter_functions" file.
kallsyms is what converts those addresses to names, and since the weak
functions are not listed in kallsyms, it would just pick the function
before that.
To solve this, there was a trick to detect weak functions listed, and
these records would be marked as DISABLED so that they do not get enabled
and are mostly ignored. As the processing of the list of all functions to
figure out what is weak or not can take a long time, this process is put
off into a kernel thread and run in parallel with the rest of start up.
Now the issue happens whet function tracing is enabled via the kernel
command line. As it starts very early in boot up, it can be enabled before
the records that are weak are marked to be disabled. This causes an issue
in the accounting, as the weak records are enabled by the command line
function tracing, but after boot up, they are not disabled.
The ftrace records have several accounting flags and a ref count. The
DISABLED flag is just one. If the record is enabled before it is marked
DISABLED it will get an ENABLED flag and also have its ref counter
incremented. After it is marked for DISABLED, neither the ENABLED flag nor
the ref counter is cleared. There's sanity checks on the records that are
performed after an ftrace function is registered or unregistered, and this
detected that there were records marked as ENABLED with ref counter that
should not have been.
Note, the module loading code uses the DISABLED flag as well to keep its
functions from being modified while its being loaded and some of these
flags may get set in this process. So changing the verification code to
ignore DISABLED records is a no go, as it still needs to verify that the
module records are working too.
Also, the weak functions still are calling a trampoline. Even though they
should never be called, it is dangerous to leave these weak functions
calling a trampoline that is freed, so they should still be set back to
nops.
There's two places that need to not skip records that have the ENABLED
and the DISABLED flags set. That is where the ftrace_ops is processed and
sets the records ref counts, and then later when the function itself is to
be updated, and the ENABLED flag gets removed. Add a helper function
"skip_record()" that returns true if the record has the DISABLED flag set
but not the ENABLED flag.
Link: https://lkml.kernel.org/r/20221005003809.27d2b97b@gandalf.local.home
Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: stable@vger.kernel.org
Fixes: b39181f7c6 ("ftrace: Add FTRACE_MCOUNT_MAX_OFFSET to avoid adding weak function")
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
Daniel Borkmann says:
====================
pull-request: bpf-next 2022-10-03
We've added 143 non-merge commits during the last 27 day(s) which contain
a total of 151 files changed, 8321 insertions(+), 1402 deletions(-).
The main changes are:
1) Add kfuncs for PKCS#7 signature verification from BPF programs, from Roberto Sassu.
2) Add support for struct-based arguments for trampoline based BPF programs,
from Yonghong Song.
3) Fix entry IP for kprobe-multi and trampoline probes under IBT enabled, from Jiri Olsa.
4) Batch of improvements to veristat selftest tool in particular to add CSV output,
a comparison mode for CSV outputs and filtering, from Andrii Nakryiko.
5) Add preparatory changes needed for the BPF core for upcoming BPF HID support,
from Benjamin Tissoires.
6) Support for direct writes to nf_conn's mark field from tc and XDP BPF program
types, from Daniel Xu.
7) Initial batch of documentation improvements for BPF insn set spec, from Dave Thaler.
8) Add a new BPF_MAP_TYPE_USER_RINGBUF map which provides single-user-space-producer /
single-kernel-consumer semantics for BPF ring buffer, from David Vernet.
9) Follow-up fixes to BPF allocator under RT to always use raw spinlock for the BPF
hashtab's bucket lock, from Hou Tao.
10) Allow creating an iterator that loops through only the resources of one
task/thread instead of all, from Kui-Feng Lee.
11) Add support for kptrs in the per-CPU arraymap, from Kumar Kartikeya Dwivedi.
12) Add a new kfunc helper for nf to set src/dst NAT IP/port in a newly allocated CT
entry which is not yet inserted, from Lorenzo Bianconi.
13) Remove invalid recursion check for struct_ops for TCP congestion control BPF
programs, from Martin KaFai Lau.
14) Fix W^X issue with BPF trampoline and BPF dispatcher, from Song Liu.
15) Fix percpu_counter leakage in BPF hashtab allocation error path, from Tetsuo Handa.
16) Various cleanups in BPF selftests to use preferred ASSERT_* macros, from Wang Yufen.
17) Add invocation for cgroup/connect{4,6} BPF programs for ICMP pings, from YiFei Zhu.
18) Lift blinding decision under bpf_jit_harden = 1 to bpf_capable(), from Yauheni Kaliuta.
19) Various libbpf fixes and cleanups including a libbpf NULL pointer deref, from Xin Liu.
* https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next: (143 commits)
net: netfilter: move bpf_ct_set_nat_info kfunc in nf_nat_bpf.c
Documentation: bpf: Add implementation notes documentations to table of contents
bpf, docs: Delete misformatted table.
selftests/xsk: Fix double free
bpftool: Fix error message of strerror
libbpf: Fix overrun in netlink attribute iteration
selftests/bpf: Fix spelling mistake "unpriviledged" -> "unprivileged"
samples/bpf: Fix typo in xdp_router_ipv4 sample
bpftool: Remove unused struct event_ring_info
bpftool: Remove unused struct btf_attach_point
bpf, docs: Add TOC and fix formatting.
bpf, docs: Add Clang note about BPF_ALU
bpf, docs: Move Clang notes to a separate file
bpf, docs: Linux byteswap note
bpf, docs: Move legacy packet instructions to a separate file
selftests/bpf: Check -EBUSY for the recurred bpf_setsockopt(TCP_CONGESTION)
bpf: tcp: Stop bpf_setsockopt(TCP_CONGESTION) in init ops to recur itself
bpf: Refactor bpf_setsockopt(TCP_CONGESTION) handling into another function
bpf: Move the "cdg" tcp-cc check to the common sol_tcp_sockopt()
bpf: Add __bpf_prog_{enter,exit}_struct_ops for struct_ops trampoline
...
====================
Link: https://lore.kernel.org/r/20221003194915.11847-1-daniel@iogearbox.net
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
In order to enable namespaces or any sort of isolation within
user_events the register lock and pages need to be broken up into
groups. Each event and file now has a group pointer which stores the
actual pages to map, lookup data and synchronization objects.
This only enables a single group that maps to init_user_ns, as IMA
namespace has done. This enables user_events to start the work of
supporting namespaces by walking the namespaces up to the init_user_ns.
Future patches will address other user namespaces and will align to the
approaches the IMA namespace uses.
Link: https://lore.kernel.org/linux-kernel/20220915193221.1728029-15-stefanb@linux.ibm.com/#t
Link: https://lkml.kernel.org/r/20221001001016.2832-2-beaub@linux.microsoft.com
Signed-off-by: Beau Belgrave <beaub@linux.microsoft.com>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
Reported by Clang [-Wunused-but-set-variable]
'commit c193707dde ("tracing: Remove code which merges duplicates")'
This commit removed the code which merges duplicates in detect_dups(),
but forgot to delete the variable 'dups' which used to merge
duplicates in the loop.
Now only 'total_dups' is needed, remove 'dups' for clean code.
Link: https://lkml.kernel.org/r/20220930103236.253985-1-chenzhongjin@huawei.com
Signed-off-by: Chen Zhongjin <chenzhongjin@huawei.com>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
The ring buffer is broken up into sub buffers (currently of page size).
Each sub buffer has a pointer to its "tail" (the last event written to the
sub buffer). When a new event is requested, the tail is locally
incremented to cover the size of the new event. This is done in a way that
there is no need for locking.
If the tail goes past the end of the sub buffer, the process of moving to
the next sub buffer takes place. After setting the current sub buffer to
the next one, the previous one that had the tail go passed the end of the
sub buffer needs to be reset back to the original tail location (before
the new event was requested) and the rest of the sub buffer needs to be
"padded".
The race happens when a reader takes control of the sub buffer. As readers
do a "swap" of sub buffers from the ring buffer to get exclusive access to
the sub buffer, it replaces the "head" sub buffer with an empty sub buffer
that goes back into the writable portion of the ring buffer. This swap can
happen as soon as the writer moves to the next sub buffer and before it
updates the last sub buffer with padding.
Because the sub buffer can be released to the reader while the writer is
still updating the padding, it is possible for the reader to see the event
that goes past the end of the sub buffer. This can cause obvious issues.
To fix this, add a few memory barriers so that the reader definitely sees
the updates to the sub buffer, and also waits until the writer has put
back the "tail" of the sub buffer back to the last event that was written
on it.
To be paranoid, it will only spin for 1 second, otherwise it will
warn and shutdown the ring buffer code. 1 second should be enough as
the writer does have preemption disabled. If the writer doesn't move
within 1 second (with preemption disabled) something is horribly
wrong. No interrupt should last 1 second!
Link: https://lore.kernel.org/all/20220830120854.7545-1-jiazi.li@transsion.com/
Link: https://bugzilla.kernel.org/show_bug.cgi?id=216369
Link: https://lkml.kernel.org/r/20220929104909.0650a36c@gandalf.local.home
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: stable@vger.kernel.org
Fixes: c7b0930857 ("ring-buffer: prevent adding write in discarded area")
Reported-by: Jiazi.Li <jiazi.li@transsion.com>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
User processes may require many events and when they do the cache
performance of a byte index status check is less ideal than a bit index.
The previous event limit per-page was 4096, the new limit is 32,768.
This change adds a bitwise index to the user_reg struct. Programs check
that the bit at status_bit has a bit set within the status page(s).
Link: https://lkml.kernel.org/r/20220728233309.1896-6-beaub@linux.microsoft.com
Link: https://lore.kernel.org/all/2059213643.196683.1648499088753.JavaMail.zimbra@efficios.com/
Suggested-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Signed-off-by: Beau Belgrave <beaub@linux.microsoft.com>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
User processes could open up enough event references to cause rollovers.
These could cause use after free scenarios, which we do not want.
Switching to refcount APIs prevent this, but will leak memory once
saturated.
Once saturated, user processes can still use the events. This prevents
a bad user process from stopping existing telemetry from being emitted.
Link: https://lkml.kernel.org/r/20220728233309.1896-5-beaub@linux.microsoft.com
Link: https://lore.kernel.org/all/2059213643.196683.1648499088753.JavaMail.zimbra@efficios.com/
Reported-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Signed-off-by: Beau Belgrave <beaub@linux.microsoft.com>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
When tracing is disabled, there's no reason that waiters should stay
waiting, wake them up, otherwise tasks get stuck when they should be
flushing the buffers.
Cc: stable@vger.kernel.org
Fixes: e30f53aad2 ("tracing: Do not busy wait in buffer splice")
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
If a process is waiting on the ring buffer for data, there currently isn't
a clean way to force it to wake up. Add an ioctl call that will force any
tasks that are waiting on the trace_pipe_raw file to wake up.
Link: https://lkml.kernel.org/r/20220929095029.117f913f@gandalf.local.home
Cc: stable@vger.kernel.org
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Fixes: e30f53aad2 ("tracing: Do not busy wait in buffer splice")
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
When the file that represents the ring buffer is closed, there may be
waiters waiting on more input from the ring buffer. Call
ring_buffer_wake_waiters() to wake up any waiters when the file is
closed.
Link: https://lkml.kernel.org/r/20220927231825.182416969@goodmis.org
Cc: stable@vger.kernel.org
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Fixes: e30f53aad2 ("tracing: Do not busy wait in buffer splice")
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
On closing of a file that represents a ring buffer or flushing the file,
there may be waiters on the ring buffer that needs to be woken up and exit
the ring_buffer_wait() function.
Add ring_buffer_wake_waiters() to wake up the waiters on the ring buffer
and allow them to exit the wait loop.
Link: https://lkml.kernel.org/r/20220928133938.28dc2c27@gandalf.local.home
Cc: stable@vger.kernel.org
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Fixes: 15693458c4 ("tracing/ring-buffer: Move poll wake ups into ring buffer code")
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
The wake up waiters only checks the "wakeup_full" variable and not the
"full_waiters_pending". The full_waiters_pending is set when a waiter is
added to the wait queue. The wakeup_full is only set when an event is
triggered, and it clears the full_waiters_pending to avoid multiple calls
to irq_work_queue().
The irq_work callback really needs to check both wakeup_full as well as
full_waiters_pending such that this code can be used to wake up waiters
when a file is closed that represents the ring buffer and the waiters need
to be woken up.
Link: https://lkml.kernel.org/r/20220927231824.209460321@goodmis.org
Cc: stable@vger.kernel.org
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Fixes: 15693458c4 ("tracing/ring-buffer: Move poll wake ups into ring buffer code")
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
The logic to know when the shortest waiters on the ring buffer should be
woken up or not has uses a less than instead of a greater than compare,
which causes the shortest_full to actually be the longest.
Link: https://lkml.kernel.org/r/20220927231823.718039222@goodmis.org
Cc: stable@vger.kernel.org
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Fixes: 2c2b0a78b3 ("ring-buffer: Add percentage of ring buffer full to wake up reader")
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
Recent commit [1] changed branch stack data indication from
br_stack pointer to sample_flags in perf_sample_data struct.
We need to check sample_flags for PERF_SAMPLE_BRANCH_STACK
bit for valid branch stack data.
[1] a9a931e266 ("perf: Use sample_flags for branch stack")
Fixes: a9a931e266 ("perf: Use sample_flags for branch stack")
Signed-off-by: Jiri Olsa <jolsa@kernel.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Kan Liang <kan.liang@linux.intel.com>
Link: https://lore.kernel.org/r/20220927203259.590950-1-jolsa@kernel.org
If a page is partially read, and then the splice system call is run
against the ring buffer, it will always fail to read, no matter how much
is in the ring buffer. That's because the code path for a partial read of
the page does will fail if the "full" flag is set.
The splice system call wants full pages, so if the read of the ring buffer
is not yet full, it should return zero, and the splice will block. But if
a previous read was done, where the beginning has been consumed, it should
still be given to the splice caller if the rest of the page has been
written to.
This caused the splice command to never consume data in this scenario, and
let the ring buffer just fill up and lose events.
Link: https://lkml.kernel.org/r/20220927144317.46be6b80@gandalf.local.home
Cc: stable@vger.kernel.org
Fixes: 8789a9e7df ("ring-buffer: read page interface")
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
When executing following commands like what document said, but the log
"#### all functions enabled ####" was not shown as expect:
1. Set a 'mod' filter:
$ echo 'write*:mod:ext3' > /sys/kernel/tracing/set_ftrace_filter
2. Invert above filter:
$ echo '!write*:mod:ext3' >> /sys/kernel/tracing/set_ftrace_filter
3. Read the file:
$ cat /sys/kernel/tracing/set_ftrace_filter
By some debugging, I found that flag FTRACE_HASH_FL_MOD was not unset
after inversion like above step 2 and then result of ftrace_hash_empty()
is incorrect.
Link: https://lkml.kernel.org/r/20220926152008.2239274-1-zhengyejian1@huawei.com
Cc: <mingo@redhat.com>
Cc: stable@vger.kernel.org
Fixes: 8c08f0d5c6 ("ftrace: Have cached module filters be an active filter")
Signed-off-by: Zheng Yejian <zhengyejian1@huawei.com>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
The event dir will alloc failed when event name no set, using the
command:
"echo "e:esys/ syscalls/sys_enter_openat file=\$filename:string"
>> dynamic_events"
It seems that dir name="syscalls/sys_enter_openat" is not allowed
in debugfs. So just use the "sys_enter_openat" as the event name.
Link: https://lkml.kernel.org/r/1664028814-45923-1-git-send-email-chentao.kernel@linux.alibaba.com
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Tom Zanussi <zanussi@kernel.org>
Cc: Linyu Yuan <quic_linyyuan@quicinc.com>
Cc: Tao Chen <chentao.kernel@linux.alibaba.com
Cc: stable@vger.kernel.org
Fixes: 95c104c378 ("tracing: Auto generate event name when creating a group of events")
Acked-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
Signed-off-by: Tao Chen <chentao.kernel@linux.alibaba.com>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
It was found that some tracing functions in kernel/trace/trace.c acquire
an arch_spinlock_t with preemption and irqs enabled. An example is the
tracing_saved_cmdlines_size_read() function which intermittently causes
a "BUG: using smp_processor_id() in preemptible" warning when the LTP
read_all_proc test is run.
That can be problematic in case preemption happens after acquiring the
lock. Add the necessary preemption or interrupt disabling code in the
appropriate places before acquiring an arch_spinlock_t.
The convention here is to disable preemption for trace_cmdline_lock and
interupt for max_lock.
Link: https://lkml.kernel.org/r/20220922145622.1744826-1-longman@redhat.com
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Will Deacon <will@kernel.org>
Cc: Boqun Feng <boqun.feng@gmail.com>
Cc: stable@vger.kernel.org
Fixes: a35873a099 ("tracing: Add conditional snapshot")
Fixes: 939c7a4f04 ("tracing: Introduce saved_cmdlines_size file")
Suggested-by: Steven Rostedt <rostedt@goodmis.org>
Signed-off-by: Waiman Long <longman@redhat.com>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
Changing return value of kprobe's version of bpf_get_func_ip
to return zero if the attach address is not on the function's
entry point.
For kprobes attached in the middle of the function we can't easily
get to the function address especially now with the CONFIG_X86_KERNEL_IBT
support.
If user cares about current IP for kprobes attached within the
function body, they can get it with PT_REGS_IP(ctx).
Suggested-by: Andrii Nakryiko <andrii@kernel.org>
Acked-by: Andrii Nakryiko <andrii@kernel.org>
Acked-by: Martynas Pumputis <m@lambda.lt>
Signed-off-by: Jiri Olsa <jolsa@kernel.org>
Link: https://lore.kernel.org/r/20220926153340.1621984-6-jolsa@kernel.org
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Martynas reported bpf_get_func_ip returning +4 address when
CONFIG_X86_KERNEL_IBT option is enabled.
When CONFIG_X86_KERNEL_IBT is enabled we'll have endbr instruction
at the function entry, which screws return value of bpf_get_func_ip()
helper that should return the function address.
There's short term workaround for kprobe_multi bpf program made by
Alexei [1], but we need this fixup also for bpf_get_attach_cookie,
that returns cookie based on the entry_ip value.
Moving the fixup in the fprobe handler, so both bpf_get_func_ip
and bpf_get_attach_cookie get expected function address when
CONFIG_X86_KERNEL_IBT option is enabled.
Also renaming kprobe_multi_link_handler entry_ip argument to fentry_ip
so it's clearer this is an ftrace __fentry__ ip.
[1] commit 7f0059b58f ("selftests/bpf: Fix kprobe_multi test.")
Cc: Peter Zijlstra <peterz@infradead.org>
Reported-by: Martynas Pumputis <m@lambda.lt>
Acked-by: Andrii Nakryiko <andrii@kernel.org>
Signed-off-by: Jiri Olsa <jolsa@kernel.org>
Link: https://lore.kernel.org/r/20220926153340.1621984-5-jolsa@kernel.org
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Keeping the resolved 'addr' in kallsyms_callback, instead of taking
ftrace_location value, because we depend on symbol address in the
cookie related code.
With CONFIG_X86_KERNEL_IBT option the ftrace_location value differs
from symbol address, which screwes the symbol address cookies matching.
There are 2 users of this function:
- bpf_kprobe_multi_link_attach
for which this fix is for
- get_ftrace_locations
which is used by register_fprobe_syms
this function needs to get symbols resolved to addresses,
but does not need 'ftrace location addresses' at this point
there's another ftrace location translation in the path done
by ftrace_set_filter_ips call:
register_fprobe_syms
addrs = get_ftrace_locations
register_fprobe_ips(addrs)
...
ftrace_set_filter_ips
...
__ftrace_match_addr
ip = ftrace_location(ip);
...
Reviewed-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
Signed-off-by: Jiri Olsa <jolsa@kernel.org>
Link: https://lore.kernel.org/r/20220926153340.1621984-3-jolsa@kernel.org
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
There is a recursive lock on the cpu_hotplug_lock.
In kernel/trace/trace_osnoise.c:<start/stop>_per_cpu_kthreads:
- start_per_cpu_kthreads calls cpus_read_lock() and if
start_kthreads returns a error it will call stop_per_cpu_kthreads.
- stop_per_cpu_kthreads then calls cpus_read_lock() again causing
deadlock.
Fix this by calling cpus_read_unlock() before calling
stop_per_cpu_kthreads. This behavior can also be seen in commit
f46b16520a ("trace/hwlat: Implement the per-cpu mode").
This error was noticed during the LTP ftrace-stress-test:
WARNING: possible recursive locking detected
--------------------------------------------
sh/275006 is trying to acquire lock:
ffffffffb02f5400 (cpu_hotplug_lock){++++}-{0:0}, at: stop_per_cpu_kthreads
but task is already holding lock:
ffffffffb02f5400 (cpu_hotplug_lock){++++}-{0:0}, at: start_per_cpu_kthreads
other info that might help us debug this:
Possible unsafe locking scenario:
CPU0
----
lock(cpu_hotplug_lock);
lock(cpu_hotplug_lock);
*** DEADLOCK ***
May be due to missing lock nesting notation
3 locks held by sh/275006:
#0: ffff8881023f0470 (sb_writers#24){.+.+}-{0:0}, at: ksys_write
#1: ffffffffb084f430 (trace_types_lock){+.+.}-{3:3}, at: rb_simple_write
#2: ffffffffb02f5400 (cpu_hotplug_lock){++++}-{0:0}, at: start_per_cpu_kthreads
Link: https://lkml.kernel.org/r/20220919144932.3064014-1-npache@redhat.com
Fixes: c8895e271f ("trace/osnoise: Support hotplug operations")
Signed-off-by: Nico Pache <npache@redhat.com>
Acked-by: Daniel Bristot de Oliveira <bristot@kernel.org>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
For now, this selftest module can only work in x86 because of the
kprobe cmd was fixed use of x86 registers.
This patch adapted to register names under arm and riscv, So that
this module can be worked on those platform.
Link: https://lkml.kernel.org/r/20220919125629.238242-3-zouyipeng@huawei.com
Cc: <linux-riscv@lists.infradead.org>
Cc: <mingo@redhat.com>
Cc: <paul.walmsley@sifive.com>
Cc: <palmer@dabbelt.com>
Cc: <aou@eecs.berkeley.edu>
Cc: <zanussi@kernel.org>
Cc: <liaochang1@huawei.com>
Cc: <chris.zjh@huawei.com>
Fixes: 64836248dd ("tracing: Add kprobe event command generation test module")
Signed-off-by: Yipeng Zou <zouyipeng@huawei.com>
Acked-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
The structure filter_pred and the typedef of the function used are only
referenced by trace_events_filter.c. There's no reason to have it in an
external header file. Move them into the only file they are used in.
Link: https://lkml.kernel.org/r/20220906225529.598047132@goodmis.org
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Tom Zanussi <zanussi@kernel.org>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
In order to testing filtering and histograms via the trace event
benchmark, record the delta time of the last event as a numeric value
(currently, it just saves it within the string) so that filters and
histograms can use it.
Link: https://lkml.kernel.org/r/20220906225529.213677569@goodmis.org
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Tom Zanussi <zanussi@kernel.org>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
The sparse tool complains as follows:
kernel/trace/rv/monitors/wwnr/wwnr.c:18:19:
warning: symbol 'rv_wwnr' was not declared. Should it be static?
The `rv_wwnr` symbol is not dereferenced by other extern files,
so add static qualifier for it.
So does wip module.
Link: https://lkml.kernel.org/r/20220824034357.2014202-2-zengheng4@huawei.com
Cc: <mingo@redhat.com>
Fixes: ccc319dcb4 ("rv/monitor: Add the wwnr monitor")
Fixes: 8812d21219 ("rv/monitor: Add the wip monitor skeleton created by dot2k")
Signed-off-by: Zeng Heng <zengheng4@huawei.com>
Acked-by: Daniel Bristot de Oliveira <bristot@kernel.org>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
Add the filter option to the event probe. This is useful if user wants
to derive a new event based on the condition of the original event.
E.g.
echo 'e:egroup/stat_runtime_4core sched/sched_stat_runtime \
runtime=$runtime:u32 if cpu < 4' >> ../dynamic_events
Then it can filter the events only on first 4 cores.
Note that the fields used for 'if' must be the fields in the original
events, not eprobe events.
Link: https://lkml.kernel.org/r/165932114513.2850673.2592206685744598080.stgit@devnote2
Cc: Tzvetomir Stoyanov <tz.stoyanov@gmail.com>
Cc: Ingo Molnar <mingo@kernel.org>
Signed-off-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
We got report from sysbot [1] about warnings that were caused by
bpf program attached to contention_begin raw tracepoint triggering
the same tracepoint by using bpf_trace_printk helper that takes
trace_printk_lock lock.
Call Trace:
<TASK>
? trace_event_raw_event_bpf_trace_printk+0x5f/0x90
bpf_trace_printk+0x2b/0xe0
bpf_prog_a9aec6167c091eef_prog+0x1f/0x24
bpf_trace_run2+0x26/0x90
native_queued_spin_lock_slowpath+0x1c6/0x2b0
_raw_spin_lock_irqsave+0x44/0x50
bpf_trace_printk+0x3f/0xe0
bpf_prog_a9aec6167c091eef_prog+0x1f/0x24
bpf_trace_run2+0x26/0x90
native_queued_spin_lock_slowpath+0x1c6/0x2b0
_raw_spin_lock_irqsave+0x44/0x50
bpf_trace_printk+0x3f/0xe0
bpf_prog_a9aec6167c091eef_prog+0x1f/0x24
bpf_trace_run2+0x26/0x90
native_queued_spin_lock_slowpath+0x1c6/0x2b0
_raw_spin_lock_irqsave+0x44/0x50
bpf_trace_printk+0x3f/0xe0
bpf_prog_a9aec6167c091eef_prog+0x1f/0x24
bpf_trace_run2+0x26/0x90
native_queued_spin_lock_slowpath+0x1c6/0x2b0
_raw_spin_lock_irqsave+0x44/0x50
__unfreeze_partials+0x5b/0x160
...
The can be reproduced by attaching bpf program as raw tracepoint on
contention_begin tracepoint. The bpf prog calls bpf_trace_printk
helper. Then by running perf bench the spin lock code is forced to
take slow path and call contention_begin tracepoint.
Fixing this by skipping execution of the bpf program if it's
already running, Using bpf prog 'active' field, which is being
currently used by trampoline programs for the same reason.
Moving bpf_prog_inc_misses_counter to syscall.c because
trampoline.c is compiled in just for CONFIG_BPF_JIT option.
Reviewed-by: Stanislav Fomichev <sdf@google.com>
Reported-by: syzbot+2251879aa068ad9c960d@syzkaller.appspotmail.com
[1] https://lore.kernel.org/bpf/YxhFe3EwqchC%2FfYf@krava/T/#t
Signed-off-by: Jiri Olsa <jolsa@kernel.org>
Link: https://lore.kernel.org/r/20220916071914.7156-1-jolsa@kernel.org
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Add the bpf_verify_pkcs7_signature() kfunc, to give eBPF security modules
the ability to check the validity of a signature against supplied data, by
using user-provided or system-provided keys as trust anchor.
The new kfunc makes it possible to enforce mandatory policies, as eBPF
programs might be allowed to make security decisions only based on data
sources the system administrator approves.
The caller should provide the data to be verified and the signature as eBPF
dynamic pointers (to minimize the number of parameters) and a bpf_key
structure containing a reference to the keyring with keys trusted for
signature verification, obtained from bpf_lookup_user_key() or
bpf_lookup_system_key().
For bpf_key structures obtained from the former lookup function,
bpf_verify_pkcs7_signature() completes the permission check deferred by
that function by calling key_validate(). key_task_permission() is already
called by the PKCS#7 code.
Signed-off-by: Roberto Sassu <roberto.sassu@huawei.com>
Acked-by: KP Singh <kpsingh@kernel.org>
Acked-by: Song Liu <song@kernel.org>
Link: https://lore.kernel.org/r/20220920075951.929132-9-roberto.sassu@huaweicloud.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Add the bpf_lookup_user_key(), bpf_lookup_system_key() and bpf_key_put()
kfuncs, to respectively search a key with a given key handle serial number
and flags, obtain a key from a pre-determined ID defined in
include/linux/verification.h, and cleanup.
Introduce system_keyring_id_check() to validate the keyring ID parameter of
bpf_lookup_system_key().
Signed-off-by: Roberto Sassu <roberto.sassu@huawei.com>
Acked-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
Acked-by: Song Liu <song@kernel.org>
Link: https://lore.kernel.org/r/20220920075951.929132-8-roberto.sassu@huaweicloud.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
x86 will shortly start using -fpatchable-function-entry for purposes
other than ftrace, make sure the __patchable_function_entry section
isn't merged in the mcount_loc section.
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Jiri Olsa <jolsa@kernel.org>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Link: https://lore.kernel.org/bpf/20220903131154.420467-2-jolsa@kernel.org
Since the check_user_trigger() is called outside of RCU
read lock, this list_for_each_entry_rcu() caused a suspicious
RCU usage warning.
# echo hist:keys=pid > events/sched/sched_stat_runtime/trigger
# cat events/sched/sched_stat_runtime/trigger
[ 43.167032]
[ 43.167418] =============================
[ 43.167992] WARNING: suspicious RCU usage
[ 43.168567] 5.19.0-rc5-00029-g19ebe4651abf #59 Not tainted
[ 43.169283] -----------------------------
[ 43.169863] kernel/trace/trace_events_trigger.c:145 RCU-list traversed in non-reader section!!
...
However, this file->triggers list is safe when it is accessed
under event_mutex is held.
To fix this warning, adds a lockdep_is_held check to the
list_for_each_entry_rcu().
Link: https://lkml.kernel.org/r/166226474977.223837.1992182913048377113.stgit@devnote2
Cc: stable@vger.kernel.org
Fixes: 7491e2c442 ("tracing: Add a probe that attaches to trace events")
Signed-off-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
Currently, The arguments passing to lockdep_hardirqs_{on,off} was fixed
in CALLER_ADDR0.
The function trace_hardirqs_on_caller should have been intended to use
caller_addr to represent the address that caller wants to be traced.
For example, lockdep log in riscv showing the last {enabled,disabled} at
__trace_hardirqs_{on,off} all the time(if called by):
[ 57.853175] hardirqs last enabled at (2519): __trace_hardirqs_on+0xc/0x14
[ 57.853848] hardirqs last disabled at (2520): __trace_hardirqs_off+0xc/0x14
After use trace_hardirqs_xx_caller, we can get more effective information:
[ 53.781428] hardirqs last enabled at (2595): restore_all+0xe/0x66
[ 53.782185] hardirqs last disabled at (2596): ret_from_exception+0xa/0x10
Link: https://lkml.kernel.org/r/20220901104515.135162-2-zouyipeng@huawei.com
Cc: stable@vger.kernel.org
Fixes: c3bc8fd637 ("tracing: Centralize preemptirq tracepoints and unify their usage")
Signed-off-by: Yipeng Zou <zouyipeng@huawei.com>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
The change that made IPMODIFY and DIRECT ops work together needed access
to the ops_references_ip() function, which it pulled out of the module
only code. But now if both CONFIG_MODULES and
CONFIG_DYNAMIC_FTRACE_WITH_DIRECT_CALLS is not set, we get the below
warning:
‘ops_references_rec’ defined but not used.
Since ops_references_rec() only calls ops_references_ip() replace the
usage of ops_references_rec() with ops_references_ip() and encompass the
function with an #ifdef of DIRECT_CALLS || MODULES being defined.
Link: https://lkml.kernel.org/r/20220801084745.1187987-1-wangjingjin1@huawei.com
Fixes: 53cd885bc5 ("ftrace: Allow IPMODIFY and DIRECT ops on the same function")
Signed-off-by: Wang Jingjin <wangjingjin1@huawei.com>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
- Fix a return value of traceprobe_parse_event_name()
- Fix NULL pointer dereference from failed ftrace enabling
- Fix NULL pointer dereference when asking for registers from eprobes
- Make eprobes consistent with kprobes/uprobes, filters and histograms
-----BEGIN PGP SIGNATURE-----
iIoEABYIADIWIQRRSw7ePDh/lE+zeZMp5XQQmuv6qgUCYwKRrhQccm9zdGVkdEBn
b29kbWlzLm9yZwAKCRAp5XQQmuv6qosDAP9WySmPxjoMfR0hbjmnepLy2zJtBbIq
ABWR3LDrjvLlYwD9H/wrD+6ctOZtXh5XJc0Vn5z6XEyNtqrVGSse7Lm+sg4=
=qb/R
-----END PGP SIGNATURE-----
Merge tag 'trace-v6.0-rc1-2' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace
Pull tracing fixes from Steven Rostedt:
"Various fixes for tracing:
- Fix a return value of traceprobe_parse_event_name()
- Fix NULL pointer dereference from failed ftrace enabling
- Fix NULL pointer dereference when asking for registers from eprobes
- Make eprobes consistent with kprobes/uprobes, filters and
histograms"
* tag 'trace-v6.0-rc1-2' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace:
tracing: Have filter accept "common_cpu" to be consistent
tracing/probes: Have kprobes and uprobes use $COMM too
tracing/eprobes: Have event probes be consistent with kprobes and uprobes
tracing/eprobes: Fix reading of string fields
tracing/eprobes: Do not hardcode $comm as a string
tracing/eprobes: Do not allow eprobes to use $stack, or % for regs
ftrace: Fix NULL pointer dereference in is_ftrace_trampoline when ftrace is dead
tracing/perf: Fix double put of trace event when init fails
tracing: React to error return from traceprobe_parse_event_name()
Make filtering consistent with histograms. As "cpu" can be a field of an
event, allow for "common_cpu" to keep it from being confused with the
"cpu" field of the event.
Link: https://lkml.kernel.org/r/20220820134401.513062765@goodmis.org
Link: https://lore.kernel.org/all/20220820220920.e42fa32b70505b1904f0a0ad@kernel.org/
Cc: stable@vger.kernel.org
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Tzvetomir Stoyanov <tz.stoyanov@gmail.com>
Cc: Tom Zanussi <zanussi@kernel.org>
Fixes: 1e3bac71c5 ("tracing/histogram: Rename "cpu" to "common_cpu"")
Suggested-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
Acked-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
Both $comm and $COMM can be used to get current->comm in eprobes and the
filtering and histogram logic. Make kprobes and uprobes consistent in this
regard and allow both $comm and $COMM as well. Currently kprobes and
uprobes only handle $comm, which is inconsistent with the other utilities,
and can be confusing to users.
Link: https://lkml.kernel.org/r/20220820134401.317014913@goodmis.org
Link: https://lore.kernel.org/all/20220820220442.776e1ddaf8836e82edb34d01@kernel.org/
Cc: stable@vger.kernel.org
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Tzvetomir Stoyanov <tz.stoyanov@gmail.com>
Cc: Tom Zanussi <zanussi@kernel.org>
Fixes: 533059281e ("tracing: probeevent: Introduce new argument fetching code")
Suggested-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
Acked-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
Currently, if a symbol "@" is attempted to be used with an event probe
(eprobes), it will cause a NULL pointer dereference crash.
Both kprobes and uprobes can reference data other than the main registers.
Such as immediate address, symbols and the current task name. Have eprobes
do the same thing.
For "comm", if "comm" is used and the event being attached to does not
have the "comm" field, then make it the "$comm" that kprobes has. This is
consistent to the way histograms and filters work.
Link: https://lkml.kernel.org/r/20220820134401.136924220@goodmis.org
Cc: stable@vger.kernel.org
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Tzvetomir Stoyanov <tz.stoyanov@gmail.com>
Cc: Tom Zanussi <zanussi@kernel.org>
Fixes: 7491e2c442 ("tracing: Add a probe that attaches to trace events")
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
Currently when an event probe (eprobe) hooks to a string field, it does
not display it as a string, but instead as a number. This makes the field
rather useless. Handle the different kinds of strings, dynamic, static,
relational/dynamic etc.
Now when a string field is used, the ":string" type can be used to display
it:
echo "e:sw sched/sched_switch comm=$next_comm:string" > dynamic_events
Link: https://lkml.kernel.org/r/20220820134400.959640191@goodmis.org
Cc: stable@vger.kernel.org
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Tzvetomir Stoyanov <tz.stoyanov@gmail.com>
Cc: Tom Zanussi <zanussi@kernel.org>
Fixes: 7491e2c442 ("tracing: Add a probe that attaches to trace events")
Acked-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
The variable $comm is hard coded as a string, which is true for both
kprobes and uprobes, but for event probes (eprobes) it is a field name. In
most cases the "comm" field would be a string, but there's no guarantee of
that fact.
Do not assume that comm is a string. Not to mention, it currently forces
comm fields to fault, as string processing for event probes is currently
broken.
Link: https://lkml.kernel.org/r/20220820134400.756152112@goodmis.org
Cc: stable@vger.kernel.org
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Tzvetomir Stoyanov <tz.stoyanov@gmail.com>
Cc: Tom Zanussi <zanussi@kernel.org>
Fixes: 7491e2c442 ("tracing: Add a probe that attaches to trace events")
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
If in perf_trace_event_init(), the perf_trace_event_open() fails, then it
will call perf_trace_event_unreg() which will not only unregister the perf
trace event, but will also call the put() function of the tp_event.
The problem here is that the trace_event_try_get_ref() is called by the
caller of perf_trace_event_init() and if perf_trace_event_init() returns a
failure, it will then call trace_event_put(). But since the
perf_trace_event_unreg() already called the trace_event_put() function, it
triggers a WARN_ON().
WARNING: CPU: 1 PID: 30309 at kernel/trace/trace_dynevent.c:46 trace_event_dyn_put_ref+0x15/0x20
If perf_trace_event_reg() does not call the trace_event_try_get_ref() then
the perf_trace_event_unreg() should not be calling trace_event_put(). This
breaks symmetry and causes bugs like these.
Pull out the trace_event_put() from perf_trace_event_unreg() and call it
in the locations that perf_trace_event_unreg() is called. This not only
fixes this bug, but also brings back the proper symmetry of the reg/unreg
vs get/put logic.
Link: https://lore.kernel.org/all/cover.1660347763.git.kjlx@templeofstupid.com/
Link: https://lkml.kernel.org/r/20220816192817.43d5e17f@gandalf.local.home
Cc: stable@vger.kernel.org
Fixes: 1d18538e6a ("tracing: Have dynamic events have a ref counter")
Reported-by: Krister Johansen <kjlx@templeofstupid.com>
Reviewed-by: Krister Johansen <kjlx@templeofstupid.com>
Tested-by: Krister Johansen <kjlx@templeofstupid.com>
Acked-by: Jiri Olsa <jolsa@kernel.org>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
The function traceprobe_parse_event_name() may set the first two function
arguments to a non-null value and still return -EINVAL to indicate an
unsuccessful completion of the function. Hence, it is not sufficient to
just check the result of the two function arguments for being not null,
but the return value also needs to be checked.
Commit 95c104c378 ("tracing: Auto generate event name when creating a
group of events") changed the error-return-value checking of the second
traceprobe_parse_event_name() invocation in __trace_eprobe_create() and
removed checking the return value to jump to the error handling case.
Reinstate using the return value in the error-return-value checking.
Link: https://lkml.kernel.org/r/20220811071734.20700-1-lukas.bulwahn@gmail.com
Fixes: 95c104c378 ("tracing: Auto generate event name when creating a group of events")
Acked-by: Linyu Yuan <quic_linyyuan@quicinc.com>
Signed-off-by: Lukas Bulwahn <lukas.bulwahn@gmail.com>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
- Runtime verification infrastructure
This is the biggest change for this pull request. It introduces the
runtime verification that is necessary for running Linux on safety
critical systems. It allows for deterministic automata models to be
inserted into the kernel that will attach to tracepoints, where the
information on these tracepoints will move the model from state to state.
If a state is encountered that does not belong to the model, it will then
activate a given reactor, that could just inform the user or even panic
the kernel (for which safety critical systems will detect and can recover
from).
- Two monitor models are also added: Wakeup In Preemptive (WIP - not to be
confused with "work in progress"), and Wakeup While Not Running (WWNR).
- Added __vstring() helper to the TRACE_EVENT() macro to replace several
vsnprintf() usages that were all doing it wrong.
- eprobes now can have their event autogenerated when the event name is left
off.
- The rest is various cleanups and fixes.
-----BEGIN PGP SIGNATURE-----
iIoEABYIADIWIQRRSw7ePDh/lE+zeZMp5XQQmuv6qgUCYu0yzRQccm9zdGVkdEBn
b29kbWlzLm9yZwAKCRAp5XQQmuv6qj4HAP4tQtV55rjj4DQ5XIXmtI3/64PmyRSJ
+y4DEXi1UvEUCQD/QAuQfWoT/7gh35ltkfeS4t3ockzy14rrkP5drZigiQA=
=kEtM
-----END PGP SIGNATURE-----
Merge tag 'trace-v6.0' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace
Pull tracing updates from Steven Rostedt:
- Runtime verification infrastructure
This is the biggest change here. It introduces the runtime
verification that is necessary for running Linux on safety critical
systems.
It allows for deterministic automata models to be inserted into the
kernel that will attach to tracepoints, where the information on
these tracepoints will move the model from state to state.
If a state is encountered that does not belong to the model, it will
then activate a given reactor, that could just inform the user or
even panic the kernel (for which safety critical systems will detect
and can recover from).
- Two monitor models are also added: Wakeup In Preemptive (WIP - not to
be confused with "work in progress"), and Wakeup While Not Running
(WWNR).
- Added __vstring() helper to the TRACE_EVENT() macro to replace
several vsnprintf() usages that were all doing it wrong.
- eprobes now can have their event autogenerated when the event name is
left off.
- The rest is various cleanups and fixes.
* tag 'trace-v6.0' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace: (50 commits)
rv: Unlock on error path in rv_unregister_reactor()
tracing: Use alignof__(struct {type b;}) instead of offsetof()
tracing/eprobe: Show syntax error logs in error_log file
scripts/tracing: Fix typo 'the the' in comment
tracepoints: It is CONFIG_TRACEPOINTS not CONFIG_TRACEPOINT
tracing: Use free_trace_buffer() in allocate_trace_buffers()
tracing: Use a struct alignof to determine trace event field alignment
rv/reactor: Add the panic reactor
rv/reactor: Add the printk reactor
rv/monitor: Add the wwnr monitor
rv/monitor: Add the wip monitor
rv/monitor: Add the wip monitor skeleton created by dot2k
Documentation/rv: Add deterministic automata instrumentation documentation
Documentation/rv: Add deterministic automata monitor synthesis documentation
tools/rv: Add dot2k
Documentation/rv: Add deterministic automaton documentation
tools/rv: Add dot2c
Documentation/rv: Add a basic documentation
rv/include: Add instrumentation helper functions
rv/include: Add deterministic automata monitor definition via C macros
...
Core
----
- Refactor the forward memory allocation to better cope with memory
pressure with many open sockets, moving from a per socket cache to
a per-CPU one
- Replace rwlocks with RCU for better fairness in ping, raw sockets
and IP multicast router.
- Network-side support for IO uring zero-copy send.
- A few skb drop reason improvements, including codegen the source file
with string mapping instead of using macro magic.
- Rename reference tracking helpers to a more consistent
netdev_* schema.
- Adapt u64_stats_t type to address load/store tearing issues.
- Refine debug helper usage to reduce the log noise caused by bots.
BPF
---
- Improve socket map performance, avoiding skb cloning on read
operation.
- Add support for 64 bits enum, to match types exposed by kernel.
- Introduce support for sleepable uprobes program.
- Introduce support for enum textual representation in libbpf.
- New helpers to implement synproxy with eBPF/XDP.
- Improve loop performances, inlining indirect calls when
possible.
- Removed all the deprecated libbpf APIs.
- Implement new eBPF-based LSM flavor.
- Add type match support, which allow accurate queries to the
eBPF used types.
- A few TCP congetsion control framework usability improvements.
- Add new infrastructure to manipulate CT entries via eBPF programs.
- Allow for livepatch (KLP) and BPF trampolines to attach to the same
kernel function.
Protocols
---------
- Introduce per network namespace lookup tables for unix sockets,
increasing scalability and reducing contention.
- Preparation work for Wi-Fi 7 Multi-Link Operation (MLO) support.
- Add support to forciby close TIME_WAIT TCP sockets via user-space
tools.
- Significant performance improvement for the TLS 1.3 receive path,
both for zero-copy and not-zero-copy.
- Support for changing the initial MTPCP subflow priority/backup
status
- Introduce virtually contingus buffers for sockets over RDMA,
to cope better with memory pressure.
- Extend CAN ethtool support with timestamping capabilities
- Refactor CAN build infrastructure to allow building only the needed
features.
Driver API
----------
- Remove devlink mutex to allow parallel commands on multiple links.
- Add support for pause stats in distributed switch.
- Implement devlink helpers to query and flash line cards.
- New helper for phy mode to register conversion.
New hardware / drivers
----------------------
- Ethernet DSA driver for the rockchip mt7531 on BPI-R2 Pro.
- Ethernet DSA driver for the Renesas RZ/N1 A5PSW switch.
- Ethernet DSA driver for the Microchip LAN937x switch.
- Ethernet PHY driver for the Aquantia AQR113C EPHY.
- CAN driver for the OBD-II ELM327 interface.
- CAN driver for RZ/N1 SJA1000 CAN controller.
- Bluetooth: Infineon CYW55572 Wi-Fi plus Bluetooth combo device.
Drivers
-------
- Intel Ethernet NICs:
- i40e: add support for vlan pruning
- i40e: add support for XDP framented packets
- ice: improved vlan offload support
- ice: add support for PPPoE offload
- Mellanox Ethernet (mlx5)
- refactor packet steering offload for performance and scalability
- extend support for TC offload
- refactor devlink code to clean-up the locking schema
- support stacked vlans for bridge offloads
- use TLS objects pool to improve connection rate
- Netronome Ethernet NICs (nfp):
- extend support for IPv6 fields mangling offload
- add support for vepa mode in HW bridge
- better support for virtio data path acceleration (VDPA)
- enable TSO by default
- Microsoft vNIC driver (mana)
- add support for XDP redirect
- Others Ethernet drivers:
- bonding: add per-port priority support
- microchip lan743x: extend phy support
- Fungible funeth: support UDP segmentation offload and XDP xmit
- Solarflare EF100: add support for virtual function representors
- MediaTek SoC: add XDP support
- Mellanox Ethernet/IB switch (mlxsw):
- dropped support for unreleased H/W (XM router).
- improved stats accuracy
- unified bridge model coversion improving scalability
(parts 1-6)
- support for PTP in Spectrum-2 asics
- Broadcom PHYs
- add PTP support for BCM54210E
- add support for the BCM53128 internal PHY
- Marvell Ethernet switches (prestera):
- implement support for multicast forwarding offload
- Embedded Ethernet switches:
- refactor OcteonTx MAC filter for better scalability
- improve TC H/W offload for the Felix driver
- refactor the Microchip ksz8 and ksz9477 drivers to share
the probe code (parts 1, 2), add support for phylink
mac configuration
- Other WiFi:
- Microchip wilc1000: diable WEP support and enable WPA3
- Atheros ath10k: encapsulation offload support
Old code removal:
- Neterion vxge ethernet driver: this is untouched since more than
10 years.
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
-----BEGIN PGP SIGNATURE-----
iQJGBAABCAAwFiEEg1AjqC77wbdLX2LbKSR5jcyPE6QFAmLqN+oSHHBhYmVuaUBy
ZWRoYXQuY29tAAoJECkkeY3MjxOkB9kQAI9VqW0c3SfiTJnkVBEIovZ6Tnh5stD2
UYFkh1BdchLsYxi7W4XMpVPSzRztiTP87mIx5c/KvIzj+QNeWL1XWRJSPdI9HhTD
pTAA/tM2OG7bqrbyQiKDNfpQdNl7+kk1RwnYd+f9RFl1QVuIJaYhmjVwrsN5xF/+
jUsotpROarM2dGFWiFwJbKhP2zMDT+6qEEahM8pEPggKhv8wRLYjany2cZVEe4e0
WGUpbINAS8gEKm0Ob922WaDfDrcK/N1Z0jNz/kMaENkK18Vvc7F6bCO0DzAawKX9
QZMMwm6mHp3EThflJAMAzCGIYiIcwLhykgdyj8rrjPhFrWbMD2Sdsbo21HOXU/8j
u4aAhVl+d+h7emmbgBoJ8sycVJ7BQlXz7lX20sTgADv9xI4/dPhQ17CMRuwX6fXX
JSrn6P6e1LTV5CEg6vrlSPnKPY6uhFn/cPw47FxCjRwJ9phVnp+8uZWQmf9Pz3yf
Ok/tcj+juFbsmuOshHy2cbRkuNZNS0oRWlSTBo5795ZwOLSakMonR3L+ev2aOvzz
DVrFp2Y/iIVwMSFdCbouYdYnhArPRhOAtCmZc2afY8aBN7aaMgrdTy3+mzUoHy3I
FG3K+VuKpfi0vY4zn6ZoLZDIpyXIoJJ93RcSGltD32t3Dp1RaQMVEI4s45k05PVm
1nYpXKHA8qML
=hxEG
-----END PGP SIGNATURE-----
Merge tag 'net-next-6.0' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-next
Pull networking changes from Paolo Abeni:
"Core:
- Refactor the forward memory allocation to better cope with memory
pressure with many open sockets, moving from a per socket cache to
a per-CPU one
- Replace rwlocks with RCU for better fairness in ping, raw sockets
and IP multicast router.
- Network-side support for IO uring zero-copy send.
- A few skb drop reason improvements, including codegen the source
file with string mapping instead of using macro magic.
- Rename reference tracking helpers to a more consistent netdev_*
schema.
- Adapt u64_stats_t type to address load/store tearing issues.
- Refine debug helper usage to reduce the log noise caused by bots.
BPF:
- Improve socket map performance, avoiding skb cloning on read
operation.
- Add support for 64 bits enum, to match types exposed by kernel.
- Introduce support for sleepable uprobes program.
- Introduce support for enum textual representation in libbpf.
- New helpers to implement synproxy with eBPF/XDP.
- Improve loop performances, inlining indirect calls when possible.
- Removed all the deprecated libbpf APIs.
- Implement new eBPF-based LSM flavor.
- Add type match support, which allow accurate queries to the eBPF
used types.
- A few TCP congetsion control framework usability improvements.
- Add new infrastructure to manipulate CT entries via eBPF programs.
- Allow for livepatch (KLP) and BPF trampolines to attach to the same
kernel function.
Protocols:
- Introduce per network namespace lookup tables for unix sockets,
increasing scalability and reducing contention.
- Preparation work for Wi-Fi 7 Multi-Link Operation (MLO) support.
- Add support to forciby close TIME_WAIT TCP sockets via user-space
tools.
- Significant performance improvement for the TLS 1.3 receive path,
both for zero-copy and not-zero-copy.
- Support for changing the initial MTPCP subflow priority/backup
status
- Introduce virtually contingus buffers for sockets over RDMA, to
cope better with memory pressure.
- Extend CAN ethtool support with timestamping capabilities
- Refactor CAN build infrastructure to allow building only the needed
features.
Driver API:
- Remove devlink mutex to allow parallel commands on multiple links.
- Add support for pause stats in distributed switch.
- Implement devlink helpers to query and flash line cards.
- New helper for phy mode to register conversion.
New hardware / drivers:
- Ethernet DSA driver for the rockchip mt7531 on BPI-R2 Pro.
- Ethernet DSA driver for the Renesas RZ/N1 A5PSW switch.
- Ethernet DSA driver for the Microchip LAN937x switch.
- Ethernet PHY driver for the Aquantia AQR113C EPHY.
- CAN driver for the OBD-II ELM327 interface.
- CAN driver for RZ/N1 SJA1000 CAN controller.
- Bluetooth: Infineon CYW55572 Wi-Fi plus Bluetooth combo device.
Drivers:
- Intel Ethernet NICs:
- i40e: add support for vlan pruning
- i40e: add support for XDP framented packets
- ice: improved vlan offload support
- ice: add support for PPPoE offload
- Mellanox Ethernet (mlx5)
- refactor packet steering offload for performance and scalability
- extend support for TC offload
- refactor devlink code to clean-up the locking schema
- support stacked vlans for bridge offloads
- use TLS objects pool to improve connection rate
- Netronome Ethernet NICs (nfp):
- extend support for IPv6 fields mangling offload
- add support for vepa mode in HW bridge
- better support for virtio data path acceleration (VDPA)
- enable TSO by default
- Microsoft vNIC driver (mana)
- add support for XDP redirect
- Others Ethernet drivers:
- bonding: add per-port priority support
- microchip lan743x: extend phy support
- Fungible funeth: support UDP segmentation offload and XDP xmit
- Solarflare EF100: add support for virtual function representors
- MediaTek SoC: add XDP support
- Mellanox Ethernet/IB switch (mlxsw):
- dropped support for unreleased H/W (XM router).
- improved stats accuracy
- unified bridge model coversion improving scalability (parts 1-6)
- support for PTP in Spectrum-2 asics
- Broadcom PHYs
- add PTP support for BCM54210E
- add support for the BCM53128 internal PHY
- Marvell Ethernet switches (prestera):
- implement support for multicast forwarding offload
- Embedded Ethernet switches:
- refactor OcteonTx MAC filter for better scalability
- improve TC H/W offload for the Felix driver
- refactor the Microchip ksz8 and ksz9477 drivers to share the
probe code (parts 1, 2), add support for phylink mac
configuration
- Other WiFi:
- Microchip wilc1000: diable WEP support and enable WPA3
- Atheros ath10k: encapsulation offload support
Old code removal:
- Neterion vxge ethernet driver: this is untouched since more than 10 years"
* tag 'net-next-6.0' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-next: (1890 commits)
doc: sfp-phylink: Fix a broken reference
wireguard: selftests: support UML
wireguard: allowedips: don't corrupt stack when detecting overflow
wireguard: selftests: update config fragments
wireguard: ratelimiter: use hrtimer in selftest
net/mlx5e: xsk: Discard unaligned XSK frames on striding RQ
net: usb: ax88179_178a: Bind only to vendor-specific interface
selftests: net: fix IOAM test skip return code
net: usb: make USB_RTL8153_ECM non user configurable
net: marvell: prestera: remove reduntant code
octeontx2-pf: Reduce minimum mtu size to 60
net: devlink: Fix missing mutex_unlock() call
net/tls: Remove redundant workqueue flush before destroy
net: txgbe: Fix an error handling path in txgbe_probe()
net: dsa: Fix spelling mistakes and cleanup code
Documentation: devlink: add add devlink-selftests to the table of contents
dccp: put dccp_qpolicy_full() and dccp_qpolicy_push() in the same lock
net: ionic: fix error check for vlan flags in ionic_set_nic_features()
net: ice: fix error NETIF_F_HW_VLAN_CTAG_FILTER check in ice_vsi_sync_fltr()
nfp: flower: add support for tunnel offload without key ID
...
This pull request contains the following branches:
doc.2022.06.21a: Documentation updates.
fixes.2022.07.19a: Miscellaneous fixes.
nocb.2022.07.19a: Callback-offload updates, perhaps most notably a new
RCU_NOCB_CPU_DEFAULT_ALL Kconfig option that causes all CPUs to
be offloaded at boot time, regardless of kernel boot parameters.
This is useful to battery-powered systems such as ChromeOS
and Android. In addition, a new RCU_NOCB_CPU_CB_BOOST kernel
boot parameter prevents offloaded callbacks from interfering
with real-time workloads and with energy-efficiency mechanisms.
poll.2022.07.21a: Polled grace-period updates, perhaps most notably
making these APIs account for both normal and expedited grace
periods.
rcu-tasks.2022.06.21a: Tasks RCU updates, perhaps most notably reducing
the CPU overhead of RCU tasks trace grace periods by more than
a factor of two on a system with 15,000 tasks. The reduction
is expected to increase with the number of tasks, so it seems
reasonable to hypothesize that a system with 150,000 tasks might
see a 20-fold reduction in CPU overhead.
torture.2022.06.21a: Torture-test updates.
ctxt.2022.07.05a: Updates that merge RCU's dyntick-idle tracking into
context tracking, thus reducing the overhead of transitioning to
kernel mode from either idle or nohz_full userspace execution
for kernels that track context independently of RCU. This is
expected to be helpful primarily for kernels built with
CONFIG_NO_HZ_FULL=y.
-----BEGIN PGP SIGNATURE-----
iQJHBAABCgAxFiEEbK7UrM+RBIrCoViJnr8S83LZ+4wFAmLgMcgTHHBhdWxtY2tA
a2VybmVsLm9yZwAKCRCevxLzctn7jArXD/0fjbCwqpRjHVTzjMY8jN4zDkqZZD6m
g8Fx27hZ4ToNFwRptyHwNezrNj14skjAJEXfdjaVw32W62ivXvf0HINvSzsTLCSq
k2kWyBdXLc9CwY5p5W4smnpn5VoAScjg5PoPL59INoZ/Zziji323C7Zepl/1DYJt
0T6bPCQjo1ZQoDUCyVpSjDmAqxnderWG0MeJVt74GkLqmnYLANg0GH8c7mH4+9LL
kVGlLp5nlPgNJ4FEoFdMwNU8T/ETmaVld/m2dkiawjkXjJzB2XKtBigU91DDmXz5
7DIdV4ABrxiy4kGNqtIe/jFgnKyVD7xiDpyfjd6KTeDr/rDS8u2ZH7+1iHsyz3g0
Np/tS3vcd0KR+gI/d0eXxPbgm5sKlCmKw/nU2eArpW/+4LmVXBUfHTG9Jg+LJmBc
JrUh6aEdIZJZHgv/nOQBNig7GJW43IG50rjuJxAuzcxiZNEG5lUSS23ysaA9CPCL
PxRWKSxIEfK3kdmvVO5IIbKTQmIBGWlcWMTcYictFSVfBgcCXpPAksGvqA5JiUkc
egW+xLFo/7K+E158vSKsVqlWZcEeUbsNJ88QOlpqnRgH++I2Yv/LhK41XfJfpH+Y
ALxVaDd+mAq6v+qSHNVq9wT3ozXIPy/zK1hDlMIqx40h2YvaEsH4je+521oSoN9r
vX60+QNxvUBLwA==
=vUNm
-----END PGP SIGNATURE-----
Merge tag 'rcu.2022.07.26a' of git://git.kernel.org/pub/scm/linux/kernel/git/paulmck/linux-rcu
Pull RCU updates from Paul McKenney:
- Documentation updates
- Miscellaneous fixes
- Callback-offload updates, perhaps most notably a new
RCU_NOCB_CPU_DEFAULT_ALL Kconfig option that causes all CPUs to be
offloaded at boot time, regardless of kernel boot parameters.
This is useful to battery-powered systems such as ChromeOS and
Android. In addition, a new RCU_NOCB_CPU_CB_BOOST kernel boot
parameter prevents offloaded callbacks from interfering with
real-time workloads and with energy-efficiency mechanisms
- Polled grace-period updates, perhaps most notably making these APIs
account for both normal and expedited grace periods
- Tasks RCU updates, perhaps most notably reducing the CPU overhead of
RCU tasks trace grace periods by more than a factor of two on a
system with 15,000 tasks.
The reduction is expected to increase with the number of tasks, so it
seems reasonable to hypothesize that a system with 150,000 tasks
might see a 20-fold reduction in CPU overhead
- Torture-test updates
- Updates that merge RCU's dyntick-idle tracking into context tracking,
thus reducing the overhead of transitioning to kernel mode from
either idle or nohz_full userspace execution for kernels that track
context independently of RCU.
This is expected to be helpful primarily for kernels built with
CONFIG_NO_HZ_FULL=y
* tag 'rcu.2022.07.26a' of git://git.kernel.org/pub/scm/linux/kernel/git/paulmck/linux-rcu: (98 commits)
rcu: Add irqs-disabled indicator to expedited RCU CPU stall warnings
rcu: Diagnose extended sync_rcu_do_polled_gp() loops
rcu: Put panic_on_rcu_stall() after expedited RCU CPU stall warnings
rcutorture: Test polled expedited grace-period primitives
rcu: Add polled expedited grace-period primitives
rcutorture: Verify that polled GP API sees synchronous grace periods
rcu: Make Tiny RCU grace periods visible to polled APIs
rcu: Make polled grace-period API account for expedited grace periods
rcu: Switch polled grace-period APIs to ->gp_seq_polled
rcu/nocb: Avoid polling when my_rdp->nocb_head_rdp list is empty
rcu/nocb: Add option to opt rcuo kthreads out of RT priority
rcu: Add nocb_cb_kthread check to rcu_is_callbacks_kthread()
rcu/nocb: Add an option to offload all CPUs on boot
rcu/nocb: Fix NOCB kthreads spawn failure with rcu_nocb_rdp_deoffload() direct call
rcu/nocb: Invert rcu_state.barrier_mutex VS hotplug lock locking order
rcu/nocb: Add/del rdp to iterate from rcuog itself
rcu/tree: Add comment to describe GP-done condition in fqs loop
rcu: Initialize first_gp_fqs at declaration in rcu_gp_fqs()
rcu/kvfree: Remove useless monitor_todo flag
rcu: Cleanup RCU urgency state for offline CPU
...
-----BEGIN PGP SIGNATURE-----
iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAmLko3gQHGF4Ym9lQGtl
cm5lbC5kawAKCRD301j7KXHgpmQaD/90NKFj4v8I456TUQyg1jimXEsL+e84E6o2
ALWVb6JzQvlPVQXNLnK5YKIunMWOTtTMz0nyB8sVRwVJVJO0P5d7QopAkZM8fkyU
MK5OCzoryENw4DTc2wJS4in6cSbGylIuN74wMzlf7+M67JTImfoZQhbTMcjwzZfn
b3OlL6sID7zMXwGcuOJPZyUJICCpDhzdSF9JXqKma5PQuG2SBmQyvFxJAcsoFBPc
YetnoRIOIN6yBvsIZaPaYq7XI9MIvF0e67EQtyCEHj4tHpyVnyDWkeObVFULsISU
gGEKbkYPvNUzRAU5Q1NBBHh1tTfkf/MaUxTuZwoEwZ/s04IGBGMmrZGyfvdfzYo6
M7NwSEg/TrUSNfTwn65mQi7uOXu1pGkJrqz84Flm8u9Qid9Vd7LExLG5p/ggnWdH
5th93MDEmtEg29e9DXpEAuS5d0t3TtSvosflaKpyfNNfr+P0rWCN6GM/uW62VUTK
ls69SQh/AQJRbg64jU4xper6WhaYtSXK7TKEnxJycoEn9gYNyCcdot2uekth0xRH
ChHGmRlteiqe/y4uFWn/2dcxWjoleiHbFjTaiRL75WVl8wIDEjw02LGuoZ61Ss9H
WOV+MT7KqNjBGe6lreUY+O/PO02dzmoR6heJXN19p8zr/pBuLCTGX7UpO7rzgaBR
4N1HEozvIw==
=celk
-----END PGP SIGNATURE-----
Merge tag 'for-5.20/block-2022-07-29' of git://git.kernel.dk/linux-block
Pull block updates from Jens Axboe:
- Improve the type checking of request flags (Bart)
- Ensure queue mapping for a single queues always picks the right queue
(Bart)
- Sanitize the io priority handling (Jan)
- rq-qos race fix (Jinke)
- Reserved tags handling improvements (John)
- Separate memory alignment from file/disk offset aligment for O_DIRECT
(Keith)
- Add new ublk driver, userspace block driver using io_uring for
communication with the userspace backend (Ming)
- Use try_cmpxchg() to cleanup the code in various spots (Uros)
- Finally remove bdevname() (Christoph)
- Clean up the zoned device handling (Christoph)
- Clean up independent access range support (Christoph)
- Clean up and improve block sysfs handling (Christoph)
- Clean up and improve teardown of block devices.
This turns the usual two step process into something that is simpler
to implement and handle in block drivers (Christoph)
- Clean up chunk size handling (Christoph)
- Misc cleanups and fixes (Bart, Bo, Dan, GuoYong, Jason, Keith, Liu,
Ming, Sebastian, Yang, Ying)
* tag 'for-5.20/block-2022-07-29' of git://git.kernel.dk/linux-block: (178 commits)
ublk_drv: fix double shift bug
ublk_drv: make sure that correct flags(features) returned to userspace
ublk_drv: fix error handling of ublk_add_dev
ublk_drv: fix lockdep warning
block: remove __blk_get_queue
block: call blk_mq_exit_queue from disk_release for never added disks
blk-mq: fix error handling in __blk_mq_alloc_disk
ublk: defer disk allocation
ublk: rewrite ublk_ctrl_get_queue_affinity to not rely on hctx->cpumask
ublk: fold __ublk_create_dev into ublk_ctrl_add_dev
ublk: cleanup ublk_ctrl_uring_cmd
ublk: simplify ublk_ch_open and ublk_ch_release
ublk: remove the empty open and release block device operations
ublk: remove UBLK_IO_F_PREFLUSH
ublk: add a MAINTAINERS entry
block: don't allow the same type rq_qos add more than once
mmc: fix disk/queue leak in case of adding disk failure
ublk_drv: fix an IS_ERR() vs NULL check
ublk: remove UBLK_IO_F_INTEGRITY
ublk_drv: remove unneeded semicolon
...
Show the syntax errors for event probes in error_log file as same as
other dynamic events, so that user can understand what is the problem.
Link: https://lkml.kernel.org/r/165932113556.2850673.3483079297896607612.stgit@devnote2
Signed-off-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
In allocate_trace_buffers(), if allocating tr->max_buffer
fails, we can directly call free_trace_buffer to free
tr->array_buffer.
Link: https://lkml.kernel.org/r/65f0702d-07f6-08de-2a07-4c50af56a67b@huawei.com
Signed-off-by: Zhiqiang Liu <liuzhiqiang26@huawei.com>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
Sample reactor that panics the system when an exception is found. This
is useful both to capture a vmcore, or to fail-safe a critical system.
Link: https://lkml.kernel.org/r/729aae3aba95f35738b8f8180e626d747d1d9da2.1659052063.git.bristot@kernel.org
Cc: Wim Van Sebroeck <wim@linux-watchdog.org>
Cc: Guenter Roeck <linux@roeck-us.net>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Will Deacon <will@kernel.org>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Marco Elver <elver@google.com>
Cc: Dmitry Vyukov <dvyukov@google.com>
Cc: "Paul E. McKenney" <paulmck@kernel.org>
Cc: Shuah Khan <skhan@linuxfoundation.org>
Cc: Gabriele Paoloni <gpaoloni@redhat.com>
Cc: Juri Lelli <juri.lelli@redhat.com>
Cc: Clark Williams <williams@redhat.com>
Cc: Tao Zhou <tao.zhou@linux.dev>
Cc: Randy Dunlap <rdunlap@infradead.org>
Cc: linux-doc@vger.kernel.org
Cc: linux-kernel@vger.kernel.org
Cc: linux-trace-devel@vger.kernel.org
Signed-off-by: Daniel Bristot de Oliveira <bristot@kernel.org>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>