The name of the func() callback on event_command is too generic and is
easily confused with other callbacks with that name, so change it to
something that reflects its actual purpose.
In this case, the main purpose of the callback is to parse an event
command, so call it parse() instead.
Link: https://lkml.kernel.org/r/7784e321840752ed88aac0b349c0c685fc9247b1.1641823001.git.zanussi@kernel.org
Signed-off-by: Tom Zanussi <zanussi@kernel.org>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Pull clocksource watchdog updates from Paul McKenney:
- Avoid accidental unstable marking of clocksources by rejecting
clocksource measurements where the source of the skew is the delay
reading reference clocksource itself. This change avoids many of the
current false positives caused by epic cache-thrashing workloads.
- Reduce the default clocksource_watchdog() retries to 2, thus offsetting
the increased overhead due to #1 above rereading the reference
clocksource.
Link: https://lore.kernel.org/lkml/20220105001723.GA536708@paulmck-ThinkPad-P17-Gen-1
In the function bacct_add_task the code reading task->exit_code was
introduced in commit f3cef7a994 ("[PATCH] csa: basic accounting over
taskstats"), and it is not entirely clear what the taskstats interface
is trying to return as only returning the exit_code of the first task
in a process doesn't make a lot of sense.
As best as I can figure the intent is to return task->exit_code after
a task exits. The field is returned with per task fields, so the
exit_code of the entire process is not wanted. Only the value of the
first task is returned so this is not a useful way to get the per task
ptrace stop code. The ordinary case of returning this value is
returning after a task exits, which also precludes use for getting
a ptrace value.
It is common to for the first task of a process to also be the last
task of a process so this field may have done something reasonable by
accident in testing.
Make ac_exitcode a reliable per task value by always returning it for
every exited task.
Setting ac_exitcode in a sensible mannter makes it possible to continue
to provide this value going forward.
Cc: Balbir Singh <bsingharora@gmail.com>
Fixes: f3cef7a994 ("[PATCH] csa: basic accounting over taskstats")
Link: https://lkml.kernel.org/r/20220103213312.9144-5-ebiederm@xmission.com
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
The function wait_task_zombie is defined to always returns the process not
thread exit status. Unfortunately when process group exit support
was added to wait_task_zombie the WNOWAIT case was overlooked.
Usually tsk->exit_code and tsk->signal->group_exit_code will be in sync
so fixing this is bug probably has no effect in practice. But fix
it anyway so that people aren't scratching their heads about why
the two code paths are different.
History-Tree: https://git.kernel.org/pub/scm/linux/kernel/git/tglx/history.git
Fixes: 2c66151cbc2c ("[PATCH] sys_exit() threading improvements, BK-curr")
Link: https://lkml.kernel.org/r/20220103213312.9144-3-ebiederm@xmission.com
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
The comment about coredumps not reaching do_group_exit and the
corresponding BUG_ON are bogus.
What happens and has happened for years is that get_signal calls
do_coredump (which sets SIGNAL_GROUP_EXIT and group_exit_code) and
then do_group_exit passing the signal number. Then do_group_exit
ignores the exit_code it is passed and uses signal->group_exit_code
from the coredump.
The comment and BUG_ON were correct when they were added during the
2.5 development cycle, but became obsolete and incorrect when
get_signal was changed to fall through to do_group_exit after
do_coredump in 2.6.10-rc2.
So remove the stale comment and BUG_ON
Fixes: 63bd6144f191 ("[PATCH] Invalid BUG_ONs in signal.c")
History-Tree: https://git.kernel.org/pub/scm/linux/kernel/git/tglx/history.git
Link: https://lkml.kernel.org/r/20220103213312.9144-2-ebiederm@xmission.com
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
All profile_handoff_task does is notify the task_free_notifier chain.
The helpers task_handoff_register and task_handoff_unregister are used
to add and delete entries from that chain and are never called.
So remove the dead code and make it much easier to read and reason
about __put_task_struct.
Suggested-by: Al Viro <viro@zeniv.linux.org.uk>
Link: https://lkml.kernel.org/r/87fspyw6m0.fsf@email.froward.int.ebiederm.org
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
When I say remove I mean remove. All profile_task_exit and
profile_munmap do is call a blocking notifier chain. The helpers
profile_task_register and profile_task_unregister are not called
anywhere in the tree. Which means this is all dead code.
So remove the dead code and make it easier to read do_exit.
Reviewed-by: Christoph Hellwig <hch@lst.de>
Link: https://lkml.kernel.org/r/20220103213312.9144-1-ebiederm@xmission.com
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Fix kernel-doc warnings in kernel/signal.c:
kernel/signal.c:1830: warning: Function parameter or member 'force_coredump' not described in 'force_sig_seccomp'
kernel/signal.c:2873: warning: missing initial short description on line:
* signal_delivered -
Also add a closing parenthesis to the comments in signal_delivered().
Signed-off-by: Randy Dunlap <rdunlap@infradead.org>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Richard Weinberger <richard@nod.at>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Marco Elver <elver@google.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: https://lkml.kernel.org/r/20211222031027.29694-1-rdunlap@infradead.org
Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
This helper is misleading. It tests for an ongoing exec as well as
the process having received a fatal signal.
Sometimes it is appropriate to treat an on-going exec differently than
a process that is shutting down due to a fatal signal. In particular
taking the fast path out of exit_signals instead of retargeting
signals is not appropriate during exec, and not changing the the exit
code in do_group_exit during exec.
Removing the helper makes it more obvious what is going on as both
cases must be coded for explicitly.
While removing the helper fix the two cases where I have observed
using signal_group_exit resulted in the wrong result.
In exit_signals only test for SIGNAL_GROUP_EXIT so that signals are
retargetted during an exec.
In do_group_exit use 0 as the exit code during an exec as de_thread
does not set group_exit_code. As best as I can determine
group_exit_code has been is set to 0 most of the time during
de_thread. During a thread group stop group_exit_code is set to the
stop signal and when the thread group receives SIGCONT group_exit_code
is reset to 0.
Link: https://lkml.kernel.org/r/20211213225350.27481-8-ebiederm@xmission.com
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
The only remaining user of group_exit_task is exec. Rename the field
so that it is clear which part of the code uses it.
Update the comment above the definition of group_exec_task to document
how it is currently used.
Link: https://lkml.kernel.org/r/20211213225350.27481-7-ebiederm@xmission.com
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
After the previous cleanups "signal->core_state" is set whenever
SIGNAL_GROUP_COREDUMP is set and "signal->core_state" is tested
whenver the code wants to know if a coredump is in progress. The
remaining tests of SIGNAL_GROUP_COREDUMP also test to see if
SIGNAL_GROUP_EXIT is set. Similarly the only place that sets
SIGNAL_GROUP_COREDUMP also sets SIGNAL_GROUP_EXIT.
Which makes SIGNAL_GROUP_COREDUMP unecessary and redundant. So stop
setting SIGNAL_GROUP_COREDUMP, stop testing SIGNAL_GROUP_COREDUMP, and
remove it's definition.
With the setting of SIGNAL_GROUP_COREDUMP gone, coredump_finish no
longer needs to clear SIGNAL_GROUP_COREDUMP out of signal->flags
by setting SIGNAL_GROUP_EXIT.
Link: https://lkml.kernel.org/r/20211213225350.27481-5-ebiederm@xmission.com
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Ever since commit 6cd8f0acae ("coredump: ensure that SIGKILL always
kills the dumping thread") it has been possible for a SIGKILL received
during a coredump to set SIGNAL_GROUP_EXIT and trigger a process
shutdown (for a second time).
Update the logic to explicitly allow coredumps so that coredumps can
set SIGNAL_GROUP_EXIT and shutdown like an ordinary process.
Link: https://lkml.kernel.org/r/87zgo6ytyf.fsf_-_@email.froward.int.ebiederm.org
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
With kernel threads on architectures that still have set_fs/get_fs
running as KERNEL_DS moving force_uaccess_begin does not appear safe.
Calling force_uaccess_begin is a noop on anything people care about.
Update the comment to explain why this code while looking like an
obvious candidate for moving to make_task_dead probably needs to
remain in do_exit until set_fs/get_fs are entirely removed from the
kernel.
Fixes: 05ea0424f0 ("exit: Move oops specific logic from do_exit into make_task_dead")
Suggested-by: Al Viro <viro@zeniv.linux.org.uk>
Link: https://lkml.kernel.org/r/YdUxGKRcSiDy8jGg@zeniv-ca.linux.org.uk
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Change the task state to EXIT_DEAD and take an extra rcu_refernce
to guarantee the task will not be reaped and that it will not be
freed.
Link: https://lkml.kernel.org/r/YdUzjrLAlRiNLQp2@zeniv-ca.linux.org.uk
Pointed-out-by: Al Viro <viro@zeniv.linux.org.uk>
Fixes: 7f80a2fd7d ("exit: Stop poorly open coding do_task_dead in make_task_dead")
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
The point of using set_child_tid to hold the kthread pointer was that
it already did what is necessary. There are now restrictions on when
set_child_tid can be initialized and when set_child_tid can be used in
schedule_tail. Which indicates that continuing to use set_child_tid
to hold the kthread pointer is a bad idea.
Instead of continuing to use the set_child_tid field of task_struct
generalize the pf_io_worker field of task_struct and use it to hold
the kthread pointer.
Rename pf_io_worker (which is a void * pointer) to worker_private so
it can be used to store kthreads struct kthread pointer. Update the
kthread code to store the kthread pointer in the worker_private field.
Remove the places where set_child_tid had to be dealt with carefully
because kthreads also used it.
Link: https://lkml.kernel.org/r/CAHk-=wgtFAA9SbVYg0gR1tqPMC17-NYcs0GQkaYg1bGhh1uJQQ@mail.gmail.com
Link: https://lkml.kernel.org/r/87a6grvqy8.fsf_-_@email.froward.int.ebiederm.org
Suggested-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Pull cgroup fixes from Tejun Heo:
"This contains the cgroup.procs permission check fixes so that they use
the credentials at the time of open rather than write, which also
fixes the cgroup namespace lifetime bug"
* 'for-5.16-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup:
selftests: cgroup: Test open-time cgroup namespace usage for migration checks
selftests: cgroup: Test open-time credential usage for migration checks
selftests: cgroup: Make cg_create() use 0755 for permission instead of 0644
cgroup: Use open-time cgroup namespace for process migration perm checks
cgroup: Allocate cgroup_file_ctx for kernfs_open_file->priv
cgroup: Use open-time credentials for process migraton perm checks
Convert 'allowed' in __cpuset_node_allowed() to be boolean since the
return types of node_isset() and __cpuset_node_allowed() are both
boolean.
Signed-off-by: Qi Zheng <zhengqi.arch@bytedance.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
When initializing a 'struct klp_object' in klp_init_object_loaded(), and
performing relocations in klp_resolve_symbols(), klp_find_object_symbol()
is invoked to look up the address of a symbol in an already-loaded module
(or vmlinux). This, in turn, calls kallsyms_on_each_symbol() or
module_kallsyms_on_each_symbol() to find the address of the symbol that is
being patched.
It turns out that symbol lookups often take up the most CPU time when
enabling and disabling a patch, and may hog the CPU and cause other tasks
on that CPU's runqueue to starve -- even in paths where interrupts are
enabled. For example, under certain workloads, enabling a KLP patch with
many objects or functions may cause ksoftirqd to be starved, and thus for
interrupts to be backlogged and delayed. This may end up causing TCP
retransmits on the host where the KLP patch is being applied, and in
general, may cause any interrupts serviced by softirqd to be delayed while
the patch is being applied.
So as to ensure that kallsyms_on_each_symbol() does not end up hogging the
CPU, this patch adds a call to cond_resched() in kallsyms_on_each_symbol()
and module_kallsyms_on_each_symbol(), which are invoked when doing a symbol
lookup in vmlinux and a module respectively. Without this patch, if a
live-patch is applied on a 36-core Intel host with heavy TCP traffic, a
~10x spike is observed in TCP retransmits while the patch is being applied.
Additionally, collecting sched events with perf indicates that ksoftirqd is
awakened ~1.3 seconds before it's eventually scheduled. With the patch, no
increase in TCP retransmit events is observed, and ksoftirqd is scheduled
shortly after it's awakened.
Signed-off-by: David Vernet <void@manifault.com>
Acked-by: Miroslav Benes <mbenes@suse.cz>
Acked-by: Song Liu <song@kernel.org>
Signed-off-by: Petr Mladek <pmladek@suse.com>
Link: https://lore.kernel.org/r/20211229215646.830451-1-void@manifault.com
The __IRQF_TIMER bit from the flags argument was used in
add_interrupt_randomness() to distinguish the timer interrupt from other
interrupts. This is no longer the case.
Remove the flags argument from __handle_irq_event_percpu().
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
Since commit
ee3e00e9e7 ("random: use registers from interrupted code for CPU's w/o a cycle counter")
the irq_flags argument is no longer used.
Remove unused irq_flags.
Cc: Borislav Petkov <bp@alien8.de>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Dexuan Cui <decui@microsoft.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Haiyang Zhang <haiyangz@microsoft.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: K. Y. Srinivasan <kys@microsoft.com>
Cc: Stephen Hemminger <sthemmin@microsoft.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Wei Liu <wei.liu@kernel.org>
Cc: linux-hyperv@vger.kernel.org
Cc: x86@kernel.org
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Acked-by: Wei Liu <wei.liu@kernel.org>
Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
- Fix missing prototypes in sample module for direct functions
- Fix check of valid buffer in get_trace_buf()
- Fix annotations of percpu pointers.
-----BEGIN PGP SIGNATURE-----
iIoEABYIADIWIQRRSw7ePDh/lE+zeZMp5XQQmuv6qgUCYddVnBQccm9zdGVkdEBn
b29kbWlzLm9yZwAKCRAp5XQQmuv6qg2PAQDVhSODIERza+YwP4AkMYBLWukngdi4
2fvFOJa1qdGQ1AD/YMSsJzbqfUk5YL9LNElL37TFH0fyWzU85tXRHVwf4As=
=KKJx
-----END PGP SIGNATURE-----
Merge tag 'trace-v5.16-rc8' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace
Pull tracing fixes from Steven Rostedt:
"Three minor tracing fixes:
- Fix missing prototypes in sample module for direct functions
- Fix check of valid buffer in get_trace_buf()
- Fix annotations of percpu pointers"
* tag 'trace-v5.16-rc8' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace:
tracing: Tag trace_percpu_buffer as a percpu pointer
tracing: Fix check for trace_percpu_buffer validity in get_trace_buf()
ftrace/samples: Add missing prototypes direct functions
After commit dc26532aed ("cgroup: rstat: punt root-level optimization to
individual controllers"), each rstat on updated_children list has its
->updated_next not NULL.
This means we can remove the check on ->updated_next, if we make sure
the subtree from @root is on list, which could be done by checking
updated_next for root.
tj: Coding style fixes.
Signed-off-by: Wei Yang <richard.weiyang@gmail.com>
Reviewed-by: Michal Koutný <mkoutny@suse.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
Instead of do while unconditionally, let's put the loop variant in
while.
Signed-off-by: Wei Yang <richard.weiyang@gmail.com>
Reviewed-by: Michal Koutný <mkoutny@suse.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
cgroup process migration permission checks are performed at write time as
whether a given operation is allowed or not is dependent on the content of
the write - the PID. This currently uses current's cgroup namespace which is
a potential security weakness as it may allow scenarios where a less
privileged process tricks a more privileged one into writing into a fd that
it created.
This patch makes cgroup remember the cgroup namespace at the time of open
and uses it for migration permission checks instad of current's. Note that
this only applies to cgroup2 as cgroup1 doesn't have namespace support.
This also fixes a use-after-free bug on cgroupns reported in
https://lore.kernel.org/r/00000000000048c15c05d0083397@google.com
Note that backporting this fix also requires the preceding patch.
Reported-by: "Eric W. Biederman" <ebiederm@xmission.com>
Suggested-by: Linus Torvalds <torvalds@linuxfoundation.org>
Cc: Michal Koutný <mkoutny@suse.com>
Cc: Oleg Nesterov <oleg@redhat.com>
Reviewed-by: Michal Koutný <mkoutny@suse.com>
Reported-by: syzbot+50f5cf33a284ce738b62@syzkaller.appspotmail.com
Link: https://lore.kernel.org/r/00000000000048c15c05d0083397@google.com
Fixes: 5136f6365c ("cgroup: implement "nsdelegate" mount option")
Signed-off-by: Tejun Heo <tj@kernel.org>
of->priv is currently used by each interface file implementation to store
private information. This patch collects the current two private data usages
into struct cgroup_file_ctx which is allocated and freed by the common path.
This allows generic private data which applies to multiple files, which will
be used to in the following patch.
Note that cgroup_procs iterator is now embedded as procs.iter in the new
cgroup_file_ctx so that it doesn't need to be allocated and freed
separately.
v2: union dropped from cgroup_file_ctx and the procs iterator is embedded in
cgroup_file_ctx as suggested by Linus.
v3: Michal pointed out that cgroup1's procs pidlist uses of->priv too.
Converted. Didn't change to embedded allocation as cgroup1 pidlists get
stored for caching.
Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Reviewed-by: Michal Koutný <mkoutny@suse.com>
cgroup process migration permission checks are performed at write time as
whether a given operation is allowed or not is dependent on the content of
the write - the PID. This currently uses current's credentials which is a
potential security weakness as it may allow scenarios where a less
privileged process tricks a more privileged one into writing into a fd that
it created.
This patch makes both cgroup2 and cgroup1 process migration interfaces to
use the credentials saved at the time of open (file->f_cred) instead of
current's.
Reported-by: "Eric W. Biederman" <ebiederm@xmission.com>
Suggested-by: Linus Torvalds <torvalds@linuxfoundation.org>
Fixes: 187fe84067 ("cgroup: require write perm on common ancestor when moving processes on the default hierarchy")
Reviewed-by: Michal Koutný <mkoutny@suse.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
All map redirect functions except XSK maps convert xdp_buff to xdp_frame
before enqueueing it. So move this conversion of out the map functions
and into xdp_do_redirect(). This removes a bit of duplicated code, but more
importantly it makes it possible to support caller-allocated xdp_frame
structures, which will be added in a subsequent commit.
Signed-off-by: Toke Høiland-Jørgensen <toke@redhat.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Link: https://lore.kernel.org/bpf/20220103150812.87914-5-toke@redhat.com
With the new osnoise tracer, we are seeing the below splat:
Kernel attempted to read user page (c7d880000) - exploit attempt? (uid: 0)
BUG: Unable to handle kernel data access on read at 0xc7d880000
Faulting instruction address: 0xc0000000002ffa10
Oops: Kernel access of bad area, sig: 11 [#1]
LE PAGE_SIZE=64K MMU=Radix SMP NR_CPUS=2048 NUMA pSeries
...
NIP [c0000000002ffa10] __trace_array_vprintk.part.0+0x70/0x2f0
LR [c0000000002ff9fc] __trace_array_vprintk.part.0+0x5c/0x2f0
Call Trace:
[c0000008bdd73b80] [c0000000001c49cc] put_prev_task_fair+0x3c/0x60 (unreliable)
[c0000008bdd73be0] [c000000000301430] trace_array_printk_buf+0x70/0x90
[c0000008bdd73c00] [c0000000003178b0] trace_sched_switch_callback+0x250/0x290
[c0000008bdd73c90] [c000000000e70d60] __schedule+0x410/0x710
[c0000008bdd73d40] [c000000000e710c0] schedule+0x60/0x130
[c0000008bdd73d70] [c000000000030614] interrupt_exit_user_prepare_main+0x264/0x270
[c0000008bdd73de0] [c000000000030a70] syscall_exit_prepare+0x150/0x180
[c0000008bdd73e10] [c00000000000c174] system_call_vectored_common+0xf4/0x278
osnoise tracer on ppc64le is triggering osnoise_taint() for negative
duration in get_int_safe_duration() called from
trace_sched_switch_callback()->thread_exit().
The problem though is that the check for a valid trace_percpu_buffer is
incorrect in get_trace_buf(). The check is being done after calculating
the pointer for the current cpu, rather than on the main percpu pointer.
Fix the check to be against trace_percpu_buffer.
Link: https://lkml.kernel.org/r/a920e4272e0b0635cf20c444707cbce1b2c8973d.1640255304.git.naveen.n.rao@linux.vnet.ibm.com
Cc: stable@vger.kernel.org
Fixes: e2ace00117 ("tracing: Choose static tp_printk buffer by explicit nesting count")
Signed-off-by: Naveen N. Rao <naveen.n.rao@linux.vnet.ibm.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Commit bfc6bb74e4 ("bpf: Implement verifier support for validation of async callbacks.")
added support for BPF_FUNC_timer_set_callback to
the __check_func_call() function. The test in __check_func_call() is
flaweed because it can mis-interpret a regular BPF-to-BPF pseudo-call
as a BPF_FUNC_timer_set_callback callback call.
Consider the conditional in the code:
if (insn->code == (BPF_JMP | BPF_CALL) &&
insn->imm == BPF_FUNC_timer_set_callback) {
The BPF_FUNC_timer_set_callback has value 170. This means that if you
have a BPF program that contains a pseudo-call with an instruction delta
of 170, this conditional will be found to be true by the verifier, and
it will interpret the pseudo-call as a callback. This leads to a mess
with the verification of the program because it makes the wrong
assumptions about the nature of this call.
Solution: include an explicit check to ensure that insn->src_reg == 0.
This ensures that calls cannot be mis-interpreted as an async callback
call.
Fixes: bfc6bb74e4 ("bpf: Implement verifier support for validation of async callbacks.")
Signed-off-by: Kris Van Hees <kris.van.hees@oracle.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Link: https://lore.kernel.org/bpf/20220105210150.GH1559@oracle.com
If we ever get to a point again where we convert a bogus looking <ptr>_or_null
typed register containing a non-zero fixed or variable offset, then lets not
reset these bounds to zero since they are not and also don't promote the register
to a <ptr> type, but instead leave it as <ptr>_or_null. Converting to a unknown
register could be an avenue as well, but then if we run into this case it would
allow to leak a kernel pointer this way.
Fixes: f1174f77b5 ("bpf/verifier: rework value tracking")
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
* arm64/for-next/perf: (32 commits)
arm64: perf: Don't register user access sysctl handler multiple times
drivers: perf: marvell_cn10k: fix an IS_ERR() vs NULL check
perf/smmuv3: Fix unused variable warning when CONFIG_OF=n
arm64: perf: Support new DT compatibles
arm64: perf: Simplify registration boilerplate
arm64: perf: Support Denver and Carmel PMUs
drivers/perf: hisi: Add driver for HiSilicon PCIe PMU
docs: perf: Add description for HiSilicon PCIe PMU driver
dt-bindings: perf: Add YAML schemas for Marvell CN10K LLC-TAD pmu bindings
drivers: perf: Add LLC-TAD perf counter support
perf/smmuv3: Synthesize IIDR from CoreSight ID registers
perf/smmuv3: Add devicetree support
dt-bindings: Add Arm SMMUv3 PMCG binding
perf/arm-cmn: Add debugfs topology info
perf/arm-cmn: Add CI-700 Support
dt-bindings: perf: arm-cmn: Add CI-700
perf/arm-cmn: Support new IP features
perf/arm-cmn: Demarcate CMN-600 specifics
perf/arm-cmn: Move group validation data off-stack
perf/arm-cmn: Optimise DTC counter accesses
...
* for-next/misc:
: Miscellaneous patches
arm64: Use correct method to calculate nomap region boundaries
arm64: Drop outdated links in comments
arm64: errata: Fix exec handling in erratum 1418040 workaround
arm64: Unhash early pointer print plus improve comment
asm-generic: introduce io_stop_wc() and add implementation for ARM64
arm64: remove __dma_*_area() aliases
docs/arm64: delete a space from tagged-address-abi
arm64/fp: Add comments documenting the usage of state restore functions
arm64: mm: Use asid feature macro for cheanup
arm64: mm: Rename asid2idx() to ctxid2asid()
arm64: kexec: reduce calls to page_address()
arm64: extable: remove unused ex_handler_t definition
arm64: entry: Use SDEI event constants
arm64: Simplify checking for populated DT
arm64/kvm: Fix bitrotted comment for SVE handling in handle_exit.c
* for-next/cache-ops-dzp:
: Avoid DC instructions when DCZID_EL0.DZP == 1
arm64: mte: DC {GVA,GZVA} shouldn't be used when DCZID_EL0.DZP == 1
arm64: clear_page() shouldn't use DC ZVA when DCZID_EL0.DZP == 1
* for-next/stacktrace:
: Unify the arm64 unwind code
arm64: Make some stacktrace functions private
arm64: Make dump_backtrace() use arch_stack_walk()
arm64: Make profile_pc() use arch_stack_walk()
arm64: Make return_address() use arch_stack_walk()
arm64: Make __get_wchan() use arch_stack_walk()
arm64: Make perf_callchain_kernel() use arch_stack_walk()
arm64: Mark __switch_to() as __sched
arm64: Add comment for stack_info::kr_cur
arch: Make ARCH_STACKWALK independent of STACKTRACE
* for-next/xor-neon:
: Use SHA3 instructions to speed up XOR
arm64/xor: use EOR3 instructions when available
* for-next/kasan:
: Log potential KASAN shadow aliases
arm64: mm: log potential KASAN shadow alias
arm64: mm: use die_kernel_fault() in do_mem_abort()
* for-next/armv8_7-fp:
: Add HWCAPS for ARMv8.7 FEAT_AFP amd FEAT_RPRES
arm64: cpufeature: add HWCAP for FEAT_RPRES
arm64: add ID_AA64ISAR2_EL1 sys register
arm64: cpufeature: add HWCAP for FEAT_AFP
* for-next/atomics:
: arm64 atomics clean-ups and codegen improvements
arm64: atomics: lse: define RETURN ops in terms of FETCH ops
arm64: atomics: lse: improve constraints for simple ops
arm64: atomics: lse: define ANDs in terms of ANDNOTs
arm64: atomics lse: define SUBs in terms of ADDs
arm64: atomics: format whitespace consistently
* for-next/bti:
: BTI clean-ups
arm64: Ensure that the 'bti' macro is defined where linkage.h is included
arm64: Use BTI C directly and unconditionally
arm64: Unconditionally override SYM_FUNC macros
arm64: Add macro version of the BTI instruction
arm64: ftrace: add missing BTIs
arm64: kexec: use __pa_symbol(empty_zero_page)
arm64: update PAC description for kernel
* for-next/sve:
: SVE code clean-ups and refactoring in prepararation of Scalable Matrix Extensions
arm64/sve: Minor clarification of ABI documentation
arm64/sve: Generalise vector length configuration prctl() for SME
arm64/sve: Make sysctl interface for SVE reusable by SME
* for-next/kselftest:
: arm64 kselftest additions
kselftest/arm64: Add pidbench for floating point syscall cases
kselftest/arm64: Add a test program to exercise the syscall ABI
kselftest/arm64: Allow signal tests to trigger from a function
kselftest/arm64: Parameterise ptrace vector length information
* for-next/kcsan:
: Enable KCSAN for arm64
arm64: Enable KCSAN
HAS_IOMEM option may not be selected on some platforms (e.g, s390) and
this will cause compilation failure due to missing memremap()
implementation.
Fix it by stubbing out swiotlb_mem_remap when CONFIG_HAS_IOMEM is not
set.
Reported-by: kernel test robot <lkp@intel.com>
Signed-off-by: Tianyu Lan <Tianyu.Lan@microsoft.com>
Acked-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Wei Liu <wei.liu@kernel.org>
When enabling a klp patch with klp_enable_patch(), klp_init_patch_early()
is invoked to initialize the kobjects for the patch itself, as well as the
'struct klp_object' and 'struct klp_func' objects that comprise it.
However, there are some error paths in klp_enable_patch() where some
kobjects may have been initialized with kobject_init(), but an error code
is still returned due to e.g. a 'struct klp_object' having a NULL funcs
pointer.
In these paths, the initial reference of the kobject of the 'struct
klp_patch' may never be released, along with one or more of its objects and
their functions, as kobject_put() is not invoked on the cleanup path if
klp_init_patch_early() returns an error code.
For example, if an object entry such as the following were added to the
sample livepatch module's klp patch, it would cause the vmlinux klp_object,
and its klp_func which updates 'cmdline_proc_show', to never be released:
static struct klp_object objs[] = {
{
/* name being NULL means vmlinux */
.funcs = funcs,
},
{
/* NULL funcs -- would cause reference leak */
.name = "kvm",
}, { }
};
Without this change, if CONFIG_DEBUG_KOBJECT is enabled, and the sample klp
patch is loaded, the kobjects (the patch, the vmlinux 'struct klp_object',
and its func) are observed as initialized, but never released, in the dmesg
log output. With the change, these kobject references no longer fail to be
released as the error case is properly handled before they are initialized.
Signed-off-by: David Vernet <void@manifault.com>
Reviewed-by: Petr Mladek <pmladek@suse.com>
Acked-by: Miroslav Benes <mbenes@suse.cz>
Acked-by: Josh Poimboeuf <jpoimboe@redhat.com>
Signed-off-by: Petr Mladek <pmladek@suse.com>
Alexei Starovoitov says:
====================
pull-request: bpf-next 2021-12-30
The following pull-request contains BPF updates for your *net-next* tree.
We've added 72 non-merge commits during the last 20 day(s) which contain
a total of 223 files changed, 3510 insertions(+), 1591 deletions(-).
The main changes are:
1) Automatic setrlimit in libbpf when bpf is memcg's in the kernel, from Andrii.
2) Beautify and de-verbose verifier logs, from Christy.
3) Composable verifier types, from Hao.
4) bpf_strncmp helper, from Hou.
5) bpf.h header dependency cleanup, from Jakub.
6) get_func_[arg|ret|arg_cnt] helpers, from Jiri.
7) Sleepable local storage, from KP.
8) Extend kfunc with PTR_TO_CTX, PTR_TO_MEM argument support, from Kumar.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
drivers/net/ethernet/mellanox/mlx5/core/en_tc.c
commit 077cdda764 ("net/mlx5e: TC, Fix memory leak with rules with internal port")
commit 31108d142f ("net/mlx5: Fix some error handling paths in 'mlx5e_tc_add_fdb_flow()'")
commit 4390c6edc0 ("net/mlx5: Fix some error handling paths in 'mlx5e_tc_add_fdb_flow()'")
https://lore.kernel.org/all/20211229065352.30178-1-saeed@kernel.org/
net/smc/smc_wr.c
commit 49dc9013e3 ("net/smc: Use the bitmap API when applicable")
commit 349d43127d ("net/smc: fix kernel panic caused by race of smc_sock")
bitmap_zero()/memset() is removed by the fix
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Add missing includes unmasked by the subsequent change.
Mostly network drivers missing an include for XDP_PACKET_HEADROOM.
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Link: https://lore.kernel.org/bpf/20211230012742.770642-2-kuba@kernel.org
Other maps like hashmaps are already available to sleepable programs.
Sleepable BPF programs run under trace RCU. Allow task, sk and inode
storage to be used from sleepable programs. This allows sleepable and
non-sleepable programs to provide shareable annotations on kernel
objects.
Sleepable programs run in trace RCU where as non-sleepable programs run
in a normal RCU critical section i.e. __bpf_prog_enter{_sleepable}
and __bpf_prog_exit{_sleepable}) (rcu_read_lock or rcu_read_lock_trace).
In order to make the local storage maps accessible to both sleepable
and non-sleepable programs, one needs to call both
call_rcu_tasks_trace and call_rcu to wait for both trace and classical
RCU grace periods to expire before freeing memory.
Paul's work on call_rcu_tasks_trace allows us to have per CPU queueing
for call_rcu_tasks_trace. This behaviour can be achieved by setting
rcupdate.rcu_task_enqueue_lim=<num_cpus> boot parameter.
In light of these new performance changes and to keep the local storage
code simple, avoid adding a new flag for sleepable maps / local storage
to select the RCU synchronization (trace / classical).
Also, update the dereferencing of the pointers to use
rcu_derference_check (with either the trace or normal RCU locks held)
with a common bpf_rcu_lock_held helper method.
Signed-off-by: KP Singh <kpsingh@kernel.org>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Martin KaFai Lau <kafai@fb.com>
Link: https://lore.kernel.org/bpf/20211224152916.1550677-2-kpsingh@kernel.org
sock.h is pretty heavily used (5k objects rebuilt on x86 after
it's touched). We can drop the include of filter.h from it and
add a forward declaration of struct sk_filter instead.
This decreases the number of rebuilt objects when bpf.h
is touched from ~5k to ~1k.
There's a lot of missing includes this was masking. Primarily
in networking tho, this time.
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Marc Kleine-Budde <mkl@pengutronix.de>
Acked-by: Florian Fainelli <f.fainelli@gmail.com>
Acked-by: Nikolay Aleksandrov <nikolay@nvidia.com>
Acked-by: Stefano Garzarella <sgarzare@redhat.com>
Link: https://lore.kernel.org/bpf/20211229004913.513372-1-kuba@kernel.org
There is no need to pass the pointer to the kset in the struct
kset_uevent_ops callbacks as no one uses it, so just remove that pointer
entirely.
Reviewed-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Reviewed-by: Wedson Almeida Filho <wedsonaf@google.com>
Link: https://lore.kernel.org/r/20211227163924.3970661-1-gregkh@linuxfoundation.org
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
This way instances of kobj_type (which contain function pointers) can be
stored in .rodata, which means that they cannot be [easily/accidentally]
modified at runtime.
Signed-off-by: Wedson Almeida Filho <wedsonaf@google.com>
Link: https://lore.kernel.org/r/20211224231345.777370-1-wedsonaf@google.com
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Merge misc fixes from Andrew Morton:
"9 patches.
Subsystems affected by this patch series: mm (kfence, mempolicy,
memory-failure, pagemap, pagealloc, damon, and memory-failure),
core-kernel, and MAINTAINERS"
* emailed patches from Andrew Morton <akpm@linux-foundation.org>:
mm/hwpoison: clear MF_COUNT_INCREASED before retrying get_any_page()
mm/damon/dbgfs: protect targets destructions with kdamond_lock
mm/page_alloc: fix __alloc_size attribute for alloc_pages_exact_nid
mm: delete unsafe BUG from page_cache_add_speculative()
mm, hwpoison: fix condition in free hugetlb page path
MAINTAINERS: mark more list instances as moderated
kernel/crash_core: suppress unknown crashkernel parameter warning
mm: mempolicy: fix THP allocations escaping mempolicy restrictions
kfence: fix memory leak when cat kfence objects
When booting with crashkernel= on the kernel command line a warning
similar to
Kernel command line: ro console=ttyS0 crashkernel=256M
Unknown kernel command line parameters "crashkernel=256M", will be passed to user space.
is printed.
This comes from crashkernel= being parsed independent from the kernel
parameter handling mechanism. So the code in init/main.c doesn't know
that crashkernel= is a valid kernel parameter and prints this incorrect
warning.
Suppress the warning by adding a dummy early_param handler for
crashkernel=.
Link: https://lkml.kernel.org/r/20211208133443.6867-1-prudo@redhat.com
Fixes: 86d1919a4f ("init: print out unknown kernel parameters")
Signed-off-by: Philipp Rudo <prudo@redhat.com>
Acked-by: Baoquan He <bhe@redhat.com>
Cc: Andrew Halaney <ahalaney@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Pull ucount fix from Eric Biederman:
"This fixes a silly logic bug in the ucount rlimits code, where it was
comparing against the wrong limit"
* 'ucount-rlimit-fixes-for-v5.16' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace:
ucounts: Fix rlimit max values check
The livepatch subsystem has several exported functions and objects with
kerneldoc comments. Though the livepatch documentation contains handwritten
descriptions of all of these exported functions, they are currently not
pulled into the docs build using the kernel-doc directive.
In order to allow readers of the documentation to see the full kerneldoc
comments in the generated documentation files, this change adds a new
Documentation/livepatch/api.rst page which contains kernel-doc directives
to link the kerneldoc comments directly in the documentation. With this,
all of the hand-written descriptions of the APIs now cross-reference the
kerneldoc comments on the new Livepatching APIs page, and running
./scripts/find-unused-docs.sh on kernel/livepatch no longer shows any files
as missing documentation.
Note that all of the handwritten API descriptions were left alone with the
exception of Documentation/livepatch/system-state.rst, which was updated to
allow the cross-referencing to work correctly. The file now follows the
cross-referencing formatting guidance specified in
Documentation/doc-guide/kernel-doc.rst. Furthermore, some comments around
klp_shadow_free_all() were updated to say <_, id> rather than <*, id> to
match the rest of the file, and to prevent the docs build from emitting an
"Inline emphasis start-string without end string" error.
Signed-off-by: David Vernet <void@manifault.com>
Reviewed-by: Petr Mladek <pmladek@suse.com>
Acked-by: Miroslav Benes <mbenes@suse.cz>
Signed-off-by: Petr Mladek <pmladek@suse.com>
Link: https://lore.kernel.org/r/20211221145743.4098360-1-void@manifault.com
Kernel threads abuse set_child_tid. Historically that has been fine
as set_child_tid was initialized after the kernel thread had been
forked. Unfortunately storing struct kthread in set_child_tid after
the thread is running makes struct kthread being unusable for storing
result codes of the thread.
When set_child_tid is set to struct kthread during fork that results
in schedule_tail writing the thread id to the beggining of struct
kthread (if put_user does not realize it is a kernel address).
Solve this by skipping the put_user for all kthreads.
Reported-by: Nathan Chancellor <nathan@kernel.org>
Link: https://lkml.kernel.org/r/YcNsG0Lp94V13whH@archlinux-ax161
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
The running_trace_lock protects running_trace_list and is acquired
within the tracepoint which implies disabled preemption. The spinlock_t
typed lock can not be acquired with disabled preemption on PREEMPT_RT
because it becomes a sleeping lock.
The runtime of the tracepoint depends on the number of entries in
running_trace_list and has no limit. The blk-tracer is considered debug
code and higher latencies here are okay.
Make running_trace_lock a raw_spinlock_t.
Signed-off-by: Wander Lairson Costa <wander@redhat.com>
Link: https://lore.kernel.org/r/20211220192827.38297-1-wander@redhat.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Make use of struct_size() helper instead of an open-coded calculation.
Link: https://github.com/KSPP/linux/issues/160
Signed-off-by: Xiu Jianfeng <xiujianfeng@huawei.com>
Signed-off-by: Paul Moore <paul@paul-moore.com>
In Isolation VM with AMD SEV, bounce buffer needs to be accessed via
extra address space which is above shared_gpa_boundary (E.G 39 bit
address line) reported by Hyper-V CPUID ISOLATION_CONFIG. The access
physical address will be original physical address + shared_gpa_boundary.
The shared_gpa_boundary in the AMD SEV SNP spec is called virtual top of
memory(vTOM). Memory addresses below vTOM are automatically treated as
private while memory above vTOM is treated as shared.
Expose swiotlb_unencrypted_base for platforms to set unencrypted
memory base offset and platform calls swiotlb_update_mem_attributes()
to remap swiotlb mem to unencrypted address space. memremap() can
not be called in the early stage and so put remapping code into
swiotlb_update_mem_attributes(). Store remap address and use it to copy
data from/to swiotlb bounce buffer.
Signed-off-by: Tianyu Lan <Tianyu.Lan@microsoft.com>
Acked-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Michael Kelley <mikelley@microsoft.com>
Link: https://lore.kernel.org/r/20211213071407.314309-2-ltykernel@gmail.com
Signed-off-by: Wei Liu <wei.liu@kernel.org>
I just fixed a bug in copy_process when using the label
bad_fork_cleanup_threadgroup_lock. While fixing the bug I looked
closer at the label and realized it has been misnamed since
568ac88821 ("cgroup: reduce read locked section of
cgroup_threadgroup_rwsem during fork").
Fix the name so that fork is easier to understand.
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Mark Brown <broonie@kernel.org> reported:
> This is also causing further build errors including but not limited to:
>
> /tmp/next/build/kernel/fork.c: In function 'copy_process':
> /tmp/next/build/kernel/fork.c:2106:4: error: label 'bad_fork_cleanup_threadgroup_lock' used but not defined
> 2106 | goto bad_fork_cleanup_threadgroup_lock;
> | ^~~~
It turns out that I messed up and was depending upon a label protected
by an ifdef. Move the label out of the ifdef as the ifdef around the label
no longer makes sense (if it ever did).
Link: https://lkml.kernel.org/r/YbugCP144uxXvRsk@sirena.org.uk
Fixes: 40966e316f ("kthread: Ensure struct kthread is present for all kthreads")
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEEzv7L6UO9uDPlPSfHEsHwGGHeVUoFAmG/OTsACgkQEsHwGGHe
VUrBWxAAhCQ5rFc5WkVxN3Lr2JLtY2bNUAOrdWNVXXmuKIZhbCgnXZ6a7NH9Ins/
zkLS7YL1gaZtcK+sYnPbO7Z6oTVEqV5UZnxuUH8DF8Q2U7cVdGvQSeHx5ghx4O35
13P0RSrj0++Q03dc5mf7+OA7RTuH00JpFCvRavpHNJDYFIN+gl1pPDjM/0g+j90W
PwFa/Hr8vOH7vpPRwygZ+yWfMunb7nTpY7Pa7toSQtE4NR6L2+A49+0/scjD5i9n
wQCFI4Md49DRV8qvC04YmN4XC72PBKo59z0ptw1LP1yYuD3n0IjjxhRmkaEGLS/x
abSs3DfwDDD3Bkl/CprJ6ZfoNez5jOsgdPgPH+c5QdHYk837JAgiLZL0M5YK+Gqf
azuYSv0XfSA6Jg4ioaqsw5gq2QhJS0/ej3VN9qLIspDLncx0BHHr99inrmuvONbl
cgtm24xQx8ezG8iEK4Ij05bg/sflwP8czTx4La8tnK2p1VK+xHeezKRLjEFqmXCr
NV8nZEPO7QVbNinViHnEcvz4fur1lYHpCJnG2UbNPipYT2XHsAkaVEZ8uvmg+Ovy
alcAasSVq9YdQbgWyYFmwWXVoPeG87z53MDA7kPk2TihJaOz2jaY0me5J6fOgSqh
QFETA8Hcd3Do9hf0MRY9HTX18/uKinW8HclVw2yZxdztGfOfAxA=
=8tah
-----END PGP SIGNATURE-----
Merge tag 'timers_urgent_for_v5.16_rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull timer fix from Borislav Petkov:
- Make sure the CLOCK_REALTIME to CLOCK_MONOTONIC offset is never
positive
* tag 'timers_urgent_for_v5.16_rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
timekeeping: Really make sure wall_to_monotonic isn't positive
to be terminated
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEEzv7L6UO9uDPlPSfHEsHwGGHeVUoFAmG/NxMACgkQEsHwGGHe
VUpzQw/+OQ6cDj41E+482w3iQDdnQWTyWV29ukyBbR+QRDmi7IyPIR6YQ3mEz0Wu
qiG76aO3R7+y0mc84ISaZPhbZ1pTCvOPaBiE91rachc1w9bLH1J/HIy2veKvPw29
8Vhn6sB2lUoh8y8Cy8AHgD0D6u/imBuBrVyO+qT22r1ZUlnZj02fT1U/XD2e3WNO
Id9JXhzu6S2leRqg5hSS6WodXbtGBsM4k5jDscu3s4Akv0JS7dxaeVaEGLw5oqyJ
+sIL6V6BwbfLEe4UOgvVzVgwzXnyhqtVF8ldaqj3PpdjhqUtzqGEmirUq4WVjZ+R
A1mHZ3bgPQNqmdhhWNtz1IFSJcuVXGEgXSS98LStyLyxVPiAByo5wHWJxF3jx/UW
ag2boT/MyoKP3iRclUKOgRqeDFsDH4HCNF9YEyqu5uSrvJhMNwhhCttCDFKu3cAl
vSEXmgNr1gcL1IAUlm3w4ZQIU8x/eznfhZiVpoWqtGhSxQPmTShV/YT4S7SY7mtf
0kxhK/Y1nS4nQqDTyuyVzJDFVX1ZoS0SJXe1L9TnMiD7VLO9wEblgdaDfp8DxCrY
YPCpnpmnV9tOyGVmbAJU+Xz9Pbuoahr0h7JoslPDKMJQTO30vc0reF2Z5gV05FCM
SgFUExL9a3TGLplMPmz96MhtTnN/a884txQOCCpvDygubXVLnTs=
=85I3
-----END PGP SIGNATURE-----
Merge tag 'locking_urgent_for_v5.16_rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull locking fix from Borislav Petkov:
- Fix the rtmutex condition checking when the optimistic spinning of a
waiter needs to be terminated
* tag 'locking_urgent_for_v5.16_rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
locking/rtmutex: Fix incorrect condition in rtmutex_spin_on_owner()
path, when no changes have been made to the alternative signal stack.
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEEzv7L6UO9uDPlPSfHEsHwGGHeVUoFAmG/NZ0ACgkQEsHwGGHe
VUqwEQ/8CCQ2KKBbDZOYIr1wnl+FDIycgq7tnz+q9SomzxQODdDWLREBoTPsOtoE
NZgXZEQxX4Wh/+4rvvdSMCVT3nz2GvSSasVKGrPZyLpDDyL3coRO0Ngx9iRUd1kF
j67e9oMuNboPC5jJfP9cC4T+GgDQDnXAjjT3jX7aiIXnNjnOCTZ5Z7W8GKw7d2qH
4L2SJwAPOkuRicdQiRMJhVLsowsDIZtC8q8OZHhwu0dqM3/JVJCIxKKGKV69j5uk
TUP6M0ZdyR30VrDfKYlm3m5fY0YFsBY/algphP41Hz5sUe9Xsw6F5+8sL3nCqLz1
BBUFr/00qVruM3jWmIag/OQ8/4cAFZjrx+8ewdF61OEOWya9Mq7VxINjT8R77B0i
AuA6Bkv1LArJyfvywbbD6JzAj7TQFPuhFPc0BUFwZfn+B1rvxm88JK2mjR9aO/wZ
ZHgDJ5hOSIKKNJ2W9g2fhW0MTMUELxKqxHZqOmQU/8ydVxYHZtD2GLHLDAU3XBoe
9PTntBvv7+qxqNQyY70k4jzIRfOFB8XuYxeWCbg10LqkbFFm2otYN2orsjVVBY7u
9wPQhFvJo6pHBx+dNIV6be56SnIeTCdIWBqlUcAto5mCVbmIxQoIMoNLo6rGBrhA
7UdhVCFJJki/Bs92aEQxl09volI9Ec7yXvmpU74LfKD+Gc8TxQo=
=if9T
-----END PGP SIGNATURE-----
Merge tag 'core_urgent_for_v5.16_rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull signal handlign fix from Borislav Petkov:
- Prevent lock contention on the new sigaltstack lock on the
common-case path, when no changes have been made to the alternative
signal stack.
* tag 'core_urgent_for_v5.16_rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
signal: Skip the altstack update when not needed
Allow passing PTR_TO_CTX, if the kfunc expects a matching struct type,
and punt to PTR_TO_MEM block if reg->type does not fall in one of
PTR_TO_BTF_ID or PTR_TO_SOCK* types. This will be used by future commits
to get access to XDP and TC PTR_TO_CTX, and pass various data (flags,
l4proto, netns_id, etc.) encoded in opts struct passed as pointer to
kfunc.
For PTR_TO_MEM support, arguments are currently limited to pointer to
scalar, or pointer to struct composed of scalars. This is done so that
unsafe scenarios (like passing PTR_TO_MEM where PTR_TO_BTF_ID of
in-kernel valid structure is expected, which may have pointers) are
avoided. Since the argument checking happens basd on argument register
type, it is not easy to ascertain what the expected type is. In the
future, support for PTR_TO_MEM for kfunc can be extended to serve other
usecases. The struct type whose pointer is passed in may have maximum
nesting depth of 4, all recursively composed of scalars or struct with
scalars.
Future commits will add negative tests that check whether these
restrictions imposed for kfunc arguments are duly rejected by BPF
verifier or not.
Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Link: https://lore.kernel.org/bpf/20211217015031.1278167-4-memxor@gmail.com
Some helper functions may modify its arguments, for example,
bpf_d_path, bpf_get_stack etc. Previously, their argument types
were marked as ARG_PTR_TO_MEM, which is compatible with read-only
mem types, such as PTR_TO_RDONLY_BUF. Therefore it's legitimate,
but technically incorrect, to modify a read-only memory by passing
it into one of such helper functions.
This patch tags the bpf_args compatible with immutable memory with
MEM_RDONLY flag. The arguments that don't have this flag will be
only compatible with mutable memory types, preventing the helper
from modifying a read-only memory. The bpf_args that have
MEM_RDONLY are compatible with both mutable memory and immutable
memory.
Signed-off-by: Hao Luo <haoluo@google.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Link: https://lore.kernel.org/bpf/20211217003152.48334-9-haoluo@google.com
Tag the return type of {per, this}_cpu_ptr with RDONLY_MEM. The
returned value of this pair of helpers is kernel object, which
can not be updated by bpf programs. Previously these two helpers
return PTR_OT_MEM for kernel objects of scalar type, which allows
one to directly modify the memory. Now with RDONLY_MEM tagging,
the verifier will reject programs that write into RDONLY_MEM.
Fixes: 63d9b80dcf ("bpf: Introducte bpf_this_cpu_ptr()")
Fixes: eaa6bcb71e ("bpf: Introduce bpf_per_cpu_ptr()")
Fixes: 4976b718c3 ("bpf: Introduce pseudo_btf_id")
Signed-off-by: Hao Luo <haoluo@google.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Link: https://lore.kernel.org/bpf/20211217003152.48334-8-haoluo@google.com
This patch introduce a flag MEM_RDONLY to tag a reg value
pointing to read-only memory. It makes the following changes:
1. PTR_TO_RDWR_BUF -> PTR_TO_BUF
2. PTR_TO_RDONLY_BUF -> PTR_TO_BUF | MEM_RDONLY
Signed-off-by: Hao Luo <haoluo@google.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Link: https://lore.kernel.org/bpf/20211217003152.48334-6-haoluo@google.com
We have introduced a new type to make bpf_reg composable, by
allocating bits in the type to represent flags.
One of the flags is PTR_MAYBE_NULL which indicates a pointer
may be NULL. This patch switches the qualified reg_types to
use this flag. The reg_types changed in this patch include:
1. PTR_TO_MAP_VALUE_OR_NULL
2. PTR_TO_SOCKET_OR_NULL
3. PTR_TO_SOCK_COMMON_OR_NULL
4. PTR_TO_TCP_SOCK_OR_NULL
5. PTR_TO_BTF_ID_OR_NULL
6. PTR_TO_MEM_OR_NULL
7. PTR_TO_RDONLY_BUF_OR_NULL
8. PTR_TO_RDWR_BUF_OR_NULL
Signed-off-by: Hao Luo <haoluo@google.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Link: https://lore.kernel.org/r/20211217003152.48334-5-haoluo@google.com
We have introduced a new type to make bpf_ret composable, by
reserving high bits to represent flags.
One of the flag is PTR_MAYBE_NULL, which indicates a pointer
may be NULL. When applying this flag to ret_types, it means
the returned value could be a NULL pointer. This patch
switches the qualified arg_types to use this flag.
The ret_types changed in this patch include:
1. RET_PTR_TO_MAP_VALUE_OR_NULL
2. RET_PTR_TO_SOCKET_OR_NULL
3. RET_PTR_TO_TCP_SOCK_OR_NULL
4. RET_PTR_TO_SOCK_COMMON_OR_NULL
5. RET_PTR_TO_ALLOC_MEM_OR_NULL
6. RET_PTR_TO_MEM_OR_BTF_ID_OR_NULL
7. RET_PTR_TO_BTF_ID_OR_NULL
This patch doesn't eliminate the use of these names, instead
it makes them aliases to 'RET_PTR_TO_XXX | PTR_MAYBE_NULL'.
Signed-off-by: Hao Luo <haoluo@google.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Link: https://lore.kernel.org/bpf/20211217003152.48334-4-haoluo@google.com
We have introduced a new type to make bpf_arg composable, by
reserving high bits of bpf_arg to represent flags of a type.
One of the flags is PTR_MAYBE_NULL which indicates a pointer
may be NULL. When applying this flag to an arg_type, it means
the arg can take NULL pointer. This patch switches the
qualified arg_types to use this flag. The arg_types changed
in this patch include:
1. ARG_PTR_TO_MAP_VALUE_OR_NULL
2. ARG_PTR_TO_MEM_OR_NULL
3. ARG_PTR_TO_CTX_OR_NULL
4. ARG_PTR_TO_SOCKET_OR_NULL
5. ARG_PTR_TO_ALLOC_MEM_OR_NULL
6. ARG_PTR_TO_STACK_OR_NULL
This patch does not eliminate the use of these arg_types, instead
it makes them an alias to the 'ARG_XXX | PTR_MAYBE_NULL'.
Signed-off-by: Hao Luo <haoluo@google.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Link: https://lore.kernel.org/bpf/20211217003152.48334-3-haoluo@google.com
Optimistic spinning needs to be terminated when the spinning waiter is not
longer the top waiter on the lock, but the condition is negated. It
terminates if the waiter is the top waiter, which is defeating the whole
purpose.
Fixes: c3123c4314 ("locking/rtmutex: Dont dereference waiter lockless")
Signed-off-by: Zqiang <qiang1.zhang@intel.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: stable@vger.kernel.org
Link: https://lore.kernel.org/r/20211217074207.77425-1-qiang1.zhang@intel.com
Even after commit e1d7ba8735 ("time: Always make sure wall_to_monotonic
isn't positive") it is still possible to make wall_to_monotonic positive
by running the following code:
int main(void)
{
struct timespec time;
clock_gettime(CLOCK_MONOTONIC, &time);
time.tv_nsec = 0;
clock_settime(CLOCK_REALTIME, &time);
return 0;
}
The reason is that the second parameter of timespec64_compare(), ts_delta,
may be unnormalized because the delta is calculated with an open coded
substraction which causes the comparison of tv_sec to yield the wrong
result:
wall_to_monotonic = { .tv_sec = -10, .tv_nsec = 900000000 }
ts_delta = { .tv_sec = -9, .tv_nsec = -900000000 }
That makes timespec64_compare() claim that wall_to_monotonic < ts_delta,
but actually the result should be wall_to_monotonic > ts_delta.
After normalization, the result of timespec64_compare() is correct because
the tv_sec comparison is not longer misleading:
wall_to_monotonic = { .tv_sec = -10, .tv_nsec = 900000000 }
ts_delta = { .tv_sec = -10, .tv_nsec = 100000000 }
Use timespec64_sub() to ensure that ts_delta is normalized, which fixes the
issue.
Fixes: e1d7ba8735 ("time: Always make sure wall_to_monotonic isn't positive")
Signed-off-by: Yu Liao <liaoyu15@huawei.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: stable@vger.kernel.org
Link: https://lore.kernel.org/r/20211213135727.1656662-1-liaoyu15@huawei.com
Backtracking information is very verbose, don't print it in log
level 1 to improve readability.
Signed-off-by: Christy Lee <christylee@fb.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/bpf/20211216213358.3374427-4-christylee@fb.com
-----BEGIN PGP SIGNATURE-----
iQJIBAABCAAyFiEES0KozwfymdVUl37v6iDy2pc3iXMFAmG7vm8UHHBhdWxAcGF1
bC1tb29yZS5jb20ACgkQ6iDy2pc3iXOCYw//Z7N53pFP1Ci1ToZWTgjdwBAV1lM/
52uG1aEg/TxAVHt/3STNXEmsUc3BaxpYQxBIevjkGYbxe3MRvE9ZJlSQdFpyjXOs
DrXxCC38TrcJ2wJpOPUidbokMSoyyJSX3dfSOwD566q1RCK1z9O7G544eh1DW651
ewYLVClOFuoyxiQiBQwSPPjaOV8vTmFWl+omsoZS74CcshPglAngqqZcLRNJ14RV
6TpnKZ1q4az7GQY1lqad1YmEwmMEgH32qfz/pFUvQ3s8omi3JhC1+IBggW2iE76G
Ssdw62sqrn3dEoSG5TADc8NxDH+MFLauF2XgRP9ct3eKFG3X3Z605eWEpDFJ1i8S
1FhOyherjQ1uSc6EOMMKfoyo7thrhoQ92wyCQBt4EkZxW8hULVuhqSX8KDs2p1+l
0epQmlpCrzAzbPSMHlC5LATga8zzaUbyoVj03AcDAb+I+29v5fNRmzAbJrKZruwM
dJosdAsJ9tlVE6GqyCIBLeC3PQxJ5Xjw3jpsrutD/aoFYkgKASve+Y927OWIj24r
KpFqjdLOS3dTKmxEQr97iF5w1IaW80lGykaQAjW2JZVp2CWOCUxQOtqTaUQYzQAp
H4D2aYzy9RJVHxvK0HYceT+FhrB+yIPKBMOaLz+UjDWopIkYzuJZ3AbaxLGVdGIh
pEMYpVR3XXm87z0=
=jWtt
-----END PGP SIGNATURE-----
Merge tag 'audit-pr-20211216' of git://git.kernel.org/pub/scm/linux/kernel/git/pcmoore/audit
Pull audit fix from Paul Moore:
"A single patch to fix a problem where the audit queue could grow
unbounded when the audit daemon is forcibly stopped"
* tag 'audit-pr-20211216' of git://git.kernel.org/pub/scm/linux/kernel/git/pcmoore/audit:
audit: improve robustness of the audit queue handling
Current release - regressions:
- dpaa2-eth: fix buffer overrun when reporting ethtool statistics
Current release - new code bugs:
- bpf: fix incorrect state pruning for <8B spill/fill
- iavf:
- add missing unlocks in iavf_watchdog_task()
- do not override the adapter state in the watchdog task (again)
- mlxsw: spectrum_router: consolidate MAC profiles when possible
Previous releases - regressions:
- mac80211, fix:
- rate control, avoid driver crash for retransmitted frames
- regression in SSN handling of addba tx
- a memory leak where sta_info is not freed
- marking TX-during-stop for TX in in_reconfig, prevent stall
- cfg80211: acquire wiphy mutex on regulatory work
- wifi drivers: fix build regressions and LED config dependency
- virtio_net: fix rx_drops stat for small pkts
- dsa: mv88e6xxx: unforce speed & duplex in mac_link_down()
Previous releases - always broken:
- bpf, fix:
- kernel address leakage in atomic fetch
- kernel address leakage in atomic cmpxchg's r0 aux reg
- signed bounds propagation after mov32
- extable fixup offset
- extable address check
- mac80211:
- fix the size used for building probe request
- send ADDBA requests using the tid/queue of the aggregation
session
- agg-tx: don't schedule_and_wake_txq() under sta->lock,
avoid deadlocks
- validate extended element ID is present
- mptcp:
- never allow the PM to close a listener subflow (null-defer)
- clear 'kern' flag from fallback sockets, prevent crash
- fix deadlock in __mptcp_push_pending()
- inet_diag: fix kernel-infoleak for UDP sockets
- xsk: do not sleep in poll() when need_wakeup set
- smc: avoid very long waits in smc_release()
- sch_ets: don't remove idle classes from the round-robin list
- netdevsim:
- zero-initialize memory for bpf map's value, prevent info leak
- don't let user space overwrite read only (max) ethtool parms
- ixgbe: set X550 MDIO speed before talking to PHY
- stmmac:
- fix null-deref in flower deletion w/ VLAN prio Rx steering
- dwmac-rk: fix oob read in rk_gmac_setup
- ice: time stamping fixes
- systemport: add global locking for descriptor life cycle
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
-----BEGIN PGP SIGNATURE-----
iQIzBAABCAAdFiEE6jPA+I1ugmIBA4hXMUZtbf5SIrsFAmG7rdUACgkQMUZtbf5S
IrtRvw//etsgeg2+zxe+fBSbe7ZihcCB4yzWUoRDdNzPrLNLsnWxKT1wYblDcZft
b1f/SpTy9ycfg+fspn2qET8gzydn4m9xHkjmlQPzmXB9tdIDF6mECFTAXYlar1hQ
RQIijpfZYyrZeGdgHpsyq72YC4dpNdbZrxmQFVdpMr3cK8P2N0Dn32bBVa//+jb+
LCv3Uw9C0yNbqhtRIiukkWIE20+/pXtKm0uErDVmvonqFMWPo6mYD0C2PwC20PwR
Kv5ok6jH+44fCSwDoLChbB+Wes0AtrIQdUvUwXGXaF3MDfZl+24oLkX5xJl3EHWT
90Mh0k0NhRORgBZ3NItwK7OliohrRHCYxlAXPjg1Dicxl+kxl0wPlva8v64eAA+u
ZhwXwaQpCrZNdKoxHJw9kQ/CmbggtxcWkVolbZp3TzDjYY1E7qxuwg51YMhGmGT1
FPjradYGvHKi+thizJiEdiZaMKRc8bpaL0hbpROxFQvfjNwFOwREQhtnXYP3W5Kd
lK88fWaH86dxqL+ABvbrMnSZKuNlSL8R/CROWpZuF+vyLRXaxhAvYRrL79bgmkKq
zvImnh1mFovdyKGJhibFMdy92X14z8FzoyX3VQuFcl9EB+2NQXnNZ6abDLJlufZX
A0jQ5r46Ce/yyaXXmS61PrP7Pf5sxhs/69fqAIDQfSSzpyUKHd4=
=VIbd
-----END PGP SIGNATURE-----
Merge tag 'net-5.16-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net
Pull networking fixes from Jakub Kicinski:
"Networking fixes, including fixes from mac80211, wifi, bpf.
Relatively large batches of fixes from BPF and the WiFi stack, calm in
general networking.
Current release - regressions:
- dpaa2-eth: fix buffer overrun when reporting ethtool statistics
Current release - new code bugs:
- bpf: fix incorrect state pruning for <8B spill/fill
- iavf:
- add missing unlocks in iavf_watchdog_task()
- do not override the adapter state in the watchdog task (again)
- mlxsw: spectrum_router: consolidate MAC profiles when possible
Previous releases - regressions:
- mac80211 fixes:
- rate control, avoid driver crash for retransmitted frames
- regression in SSN handling of addba tx
- a memory leak where sta_info is not freed
- marking TX-during-stop for TX in in_reconfig, prevent stall
- cfg80211: acquire wiphy mutex on regulatory work
- wifi drivers: fix build regressions and LED config dependency
- virtio_net: fix rx_drops stat for small pkts
- dsa: mv88e6xxx: unforce speed & duplex in mac_link_down()
Previous releases - always broken:
- bpf fixes:
- kernel address leakage in atomic fetch
- kernel address leakage in atomic cmpxchg's r0 aux reg
- signed bounds propagation after mov32
- extable fixup offset
- extable address check
- mac80211:
- fix the size used for building probe request
- send ADDBA requests using the tid/queue of the aggregation
session
- agg-tx: don't schedule_and_wake_txq() under sta->lock, avoid
deadlocks
- validate extended element ID is present
- mptcp:
- never allow the PM to close a listener subflow (null-defer)
- clear 'kern' flag from fallback sockets, prevent crash
- fix deadlock in __mptcp_push_pending()
- inet_diag: fix kernel-infoleak for UDP sockets
- xsk: do not sleep in poll() when need_wakeup set
- smc: avoid very long waits in smc_release()
- sch_ets: don't remove idle classes from the round-robin list
- netdevsim:
- zero-initialize memory for bpf map's value, prevent info leak
- don't let user space overwrite read only (max) ethtool parms
- ixgbe: set X550 MDIO speed before talking to PHY
- stmmac:
- fix null-deref in flower deletion w/ VLAN prio Rx steering
- dwmac-rk: fix oob read in rk_gmac_setup
- ice: time stamping fixes
- systemport: add global locking for descriptor life cycle"
* tag 'net-5.16-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net: (89 commits)
bpf, selftests: Fix racing issue in btf_skc_cls_ingress test
selftest/bpf: Add a test that reads various addresses.
bpf: Fix extable address check.
bpf: Fix extable fixup offset.
bpf, selftests: Add test case trying to taint map value pointer
bpf: Make 32->64 bounds propagation slightly more robust
bpf: Fix signed bounds propagation after mov32
sit: do not call ipip6_dev_free() from sit_init_net()
net: systemport: Add global locking for descriptor lifecycle
net/smc: Prevent smc_release() from long blocking
net: Fix double 0x prefix print in SKB dump
virtio_net: fix rx_drops stat for small pkts
dsa: mv88e6xxx: fix debug print for SPEED_UNFORCED
sfc_ef100: potential dereference of null pointer
net: stmmac: dwmac-rk: fix oob read in rk_gmac_setup
net: usb: lan78xx: add Allied Telesis AT29M2-AF
net/packet: rx_owner_map depends on pg_vec
netdevsim: Zero-initialize memory for new map's value in function nsim_bpf_map_alloc
dpaa2-eth: fix ethtool statistics
ixgbe: set X550 MDIO speed before talking to PHY
...
We're about to break the cgroup-defs.h -> bpf-cgroup.h dependency,
make sure those who actually need more than the definition of
struct cgroup_bpf include bpf-cgroup.h explicitly.
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Tejun Heo <tj@kernel.org>
Link: https://lore.kernel.org/bpf/20211216025538.1649516-3-kuba@kernel.org
The current linked list storage for MSI descriptors is suboptimal in
several ways:
1) Looking up a MSI desciptor requires a O(n) list walk in the worst case
2) The upcoming support of runtime expansion of MSI-X vectors would need
to do a full list walk to figure out whether a particular index is
already associated.
3) Runtime expansion of sparse allocations is even more complex as the
current implementation assumes an ordered list (increasing MSI index).
Use an xarray which solves all of the above problems nicely.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Tested-by: Michael Kelley <mikelley@microsoft.com>
Tested-by: Nishanth Menon <nm@ti.com>
Reviewed-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>
Link: https://lore.kernel.org/r/20211206210749.280627070@linutronix.de
The sysfs handling for MSI is a convoluted maze and it is in the way of
supporting dynamic expansion of the MSI-X vectors because it only supports
a one off bulk population/free of the sysfs entries.
Change it to do:
1) Creating an empty sysfs attribute group when msi_device_data is
allocated
2) Populate the entries when the MSI descriptor is initialized
3) Free the entries when a MSI descriptor is detached from a Linux
interrupt.
4) Provide functions for the legacy non-irqdomain fallback code to
do a bulk population/free. This code won't support dynamic
expansion.
This makes the code simpler and reduces the number of allocations as the
empty attribute group can be shared.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Tested-by: Michael Kelley <mikelley@microsoft.com>
Tested-by: Nishanth Menon <nm@ti.com>
Reviewed-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>
Link: https://lore.kernel.org/r/20211206210749.224917330@linutronix.de
Get rid of the old iterators, alloc/free functions and adjust the core code
accordingly.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Tested-by: Michael Kelley <mikelley@microsoft.com>
Tested-by: Nishanth Menon <nm@ti.com>
Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>
Link: https://lore.kernel.org/r/20211206210749.117395027@linutronix.de
Use the new iterator functions and add locking where required.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Tested-by: Michael Kelley <mikelley@microsoft.com>
Tested-by: Nishanth Menon <nm@ti.com>
Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>
Link: https://lore.kernel.org/r/20211206210749.063705667@linutronix.de
There is no real reason to do several loops over the MSI descriptors
instead of just doing one loop. In case of an error everything is undone
anyway so it does not matter whether it's a partial or a full rollback.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Tested-by: Michael Kelley <mikelley@microsoft.com>
Tested-by: Nishanth Menon <nm@ti.com>
Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>
Link: https://lore.kernel.org/r/20211206210749.010234767@linutronix.de
The allocation code is overly complex. It tries to have the MSI index space
packed, which is not working when an interrupt is freed. There is no
requirement for this. The only requirement is that the MSI index is unique.
Move the MSI descriptor allocation into msi_domain_populate_irqs() and use
the Linux interrupt number as MSI index which fulfils the unique
requirement.
This requires to lock the MSI descriptors which makes the lock order
reverse to the regular MSI alloc/free functions vs. the domain
mutex. Assign a seperate lockdep class for these MSI device domains.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Tested-by: Nishanth Menon <nm@ti.com>
Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>
Link: https://lore.kernel.org/r/20211206210748.956731741@linutronix.de
Provide domain info flags which tell the core to allocate simple
descriptors or to free descriptors when the interrupts are freed and
implement the required functionality.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Tested-by: Michael Kelley <mikelley@microsoft.com>
Tested-by: Nishanth Menon <nm@ti.com>
Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>
Link: https://lore.kernel.org/r/20211206210747.928198636@linutronix.de
Provide msi_alloc_msi_desc() which takes a template MSI descriptor for
initializing a newly allocated descriptor. This allows to simplify various
usage sites of alloc_msi_entry() and moves the storage handling into the
core code.
For simple cases where only a linear vector space is required provide
msi_add_simple_msi_descs() which just allocates a linear range of MSI
descriptors and fills msi_desc::msi_index accordingly.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Tested-by: Michael Kelley <mikelley@microsoft.com>
Tested-by: Nishanth Menon <nm@ti.com>
Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>
Link: https://lore.kernel.org/r/20211206210747.873833567@linutronix.de
In preparation for dynamic handling of MSI-X interrupts provide a new set
of MSI descriptor accessor functions and iterators. They are benefitial per
se as they allow to cleanup quite some code in various MSI domain
implementations.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Tested-by: Michael Kelley <mikelley@microsoft.com>
Tested-by: Nishanth Menon <nm@ti.com>
Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>
Link: https://lore.kernel.org/r/20211206210747.818635078@linutronix.de
Usage sites which do allocations of the MSI descriptors before invoking
msi_domain_alloc_irqs() require to lock the MSI decriptors accross the
operation.
Provide entry points which can be called with the MSI mutex held and lock
the mutex in the existing entry points.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Tested-by: Michael Kelley <mikelley@microsoft.com>
Tested-by: Nishanth Menon <nm@ti.com>
Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>
Link: https://lore.kernel.org/r/20211206210747.765371053@linutronix.de
For upcoming runtime extensions of MSI-X interrupts it's required to
protect the MSI descriptor list. Add a mutex to struct msi_device_data and
provide lock/unlock functions.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Tested-by: Michael Kelley <mikelley@microsoft.com>
Tested-by: Nishanth Menon <nm@ti.com>
Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>
Link: https://lore.kernel.org/r/20211206210747.708877269@linutronix.de
It's only required when MSI is in use.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Tested-by: Michael Kelley <mikelley@microsoft.com>
Tested-by: Nishanth Menon <nm@ti.com>
Reviewed-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>
Link: https://lore.kernel.org/r/20211206210747.650487479@linutronix.de
This allows drivers to retrieve the Linux interrupt number instead of
fiddling with MSI descriptors.
msi_get_virq() returns the Linux interrupt number or 0 in case that there
is no entry for the given MSI index.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Tested-by: Michael Kelley <mikelley@microsoft.com>
Tested-by: Nishanth Menon <nm@ti.com>
Reviewed-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>
Link: https://lore.kernel.org/r/20211210221814.780824745@linutronix.de
No more users. Refactor the core code accordingly and move the global
interface under CONFIG_PCI_MSI_ARCH_FALLBACKS.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Tested-by: Michael Kelley <mikelley@microsoft.com>
Tested-by: Nishanth Menon <nm@ti.com>
Reviewed-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>
Link: https://lore.kernel.org/r/20211210221814.168362229@linutronix.de
Add new allocation functions which can be activated by domain info
flags. They store the groups pointer in struct msi_device_data.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Tested-by: Michael Kelley <mikelley@microsoft.com>
Tested-by: Nishanth Menon <nm@ti.com>
Reviewed-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>
Link: https://lore.kernel.org/r/20211210221813.988659194@linutronix.de
Create struct msi_device_data and add a pointer of that type to struct
dev_msi_info, which is part of struct device. Provide an allocator function
which can be invoked from the MSI interrupt allocation code pathes.
Add a properties field to the data structure as a first member so the
allocation size is not zero bytes. The field will be uses later on.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Tested-by: Michael Kelley <mikelley@microsoft.com>
Tested-by: Nishanth Menon <nm@ti.com>
Reviewed-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>
Link: https://lore.kernel.org/r/20211210221813.676660809@linutronix.de
to determine whether this is MSI or MSIX instead of consulting MSI
descriptors.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Tested-by: Michael Kelley <mikelley@microsoft.com>
Tested-by: Nishanth Menon <nm@ti.com>
Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>
Link: https://lore.kernel.org/r/20211210221813.434156196@linutronix.de
Make the bounds propagation in __reg_assign_32_into_64() slightly more
robust and readable by aligning it similarly as we did back in the
__reg_combine_64_into_32() counterpart. Meaning, only propagate or
pessimize them as a smin/smax pair.
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Reviewed-by: John Fastabend <john.fastabend@gmail.com>
Acked-by: Alexei Starovoitov <ast@kernel.org>
For the case where both s32_{min,max}_value bounds are positive, the
__reg_assign_32_into_64() directly propagates them to their 64 bit
counterparts, otherwise it pessimises them into [0,u32_max] universe and
tries to refine them later on by learning through the tnum as per comment
in mentioned function. However, that does not always happen, for example,
in mov32 operation we call zext_32_to_64(dst_reg) which invokes the
__reg_assign_32_into_64() as is without subsequent bounds update as
elsewhere thus no refinement based on tnum takes place.
Thus, not calling into the __update_reg_bounds() / __reg_deduce_bounds() /
__reg_bound_offset() triplet as we do, for example, in case of ALU ops via
adjust_scalar_min_max_vals(), will lead to more pessimistic bounds when
dumping the full register state:
Before fix:
0: (b4) w0 = -1
1: R0_w=invP4294967295
(id=0,imm=ffffffff,
smin_value=4294967295,smax_value=4294967295,
umin_value=4294967295,umax_value=4294967295,
var_off=(0xffffffff; 0x0),
s32_min_value=-1,s32_max_value=-1,
u32_min_value=-1,u32_max_value=-1)
1: (bc) w0 = w0
2: R0_w=invP4294967295
(id=0,imm=ffffffff,
smin_value=0,smax_value=4294967295,
umin_value=4294967295,umax_value=4294967295,
var_off=(0xffffffff; 0x0),
s32_min_value=-1,s32_max_value=-1,
u32_min_value=-1,u32_max_value=-1)
Technically, the smin_value=0 and smax_value=4294967295 bounds are not
incorrect, but given the register is still a constant, they break assumptions
about const scalars that smin_value == smax_value and umin_value == umax_value.
After fix:
0: (b4) w0 = -1
1: R0_w=invP4294967295
(id=0,imm=ffffffff,
smin_value=4294967295,smax_value=4294967295,
umin_value=4294967295,umax_value=4294967295,
var_off=(0xffffffff; 0x0),
s32_min_value=-1,s32_max_value=-1,
u32_min_value=-1,u32_max_value=-1)
1: (bc) w0 = w0
2: R0_w=invP4294967295
(id=0,imm=ffffffff,
smin_value=4294967295,smax_value=4294967295,
umin_value=4294967295,umax_value=4294967295,
var_off=(0xffffffff; 0x0),
s32_min_value=-1,s32_max_value=-1,
u32_min_value=-1,u32_max_value=-1)
Without the smin_value == smax_value and umin_value == umax_value invariant
being intact for const scalars, it is possible to leak out kernel pointers
from unprivileged user space if the latter is enabled. For example, when such
registers are involved in pointer arithmtics, then adjust_ptr_min_max_vals()
will taint the destination register into an unknown scalar, and the latter
can be exported and stored e.g. into a BPF map value.
Fixes: 3f50f132d8 ("bpf: Verifier, do explicit ALU32 bounds tracking")
Reported-by: Kuee K1r0a <liulin063@gmail.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Reviewed-by: John Fastabend <john.fastabend@gmail.com>
Acked-by: Alexei Starovoitov <ast@kernel.org>
If the audit daemon were ever to get stuck in a stopped state the
kernel's kauditd_thread() could get blocked attempting to send audit
records to the userspace audit daemon. With the kernel thread
blocked it is possible that the audit queue could grow unbounded as
certain audit record generating events must be exempt from the queue
limits else the system enter a deadlock state.
This patch resolves this problem by lowering the kernel thread's
socket sending timeout from MAX_SCHEDULE_TIMEOUT to HZ/10 and tweaks
the kauditd_send_queue() function to better manage the various audit
queues when connection problems occur between the kernel and the
audit daemon. With this patch, the backlog may temporarily grow
beyond the defined limits when the audit daemon is stopped and the
system is under heavy audit pressure, but kauditd_thread() will
continue to make progress and drain the queues as it would for other
connection problems. For example, with the audit daemon put into a
stopped state and the system configured to audit every syscall it
was still possible to shutdown the system without a kernel panic,
deadlock, etc.; granted, the system was slow to shutdown but that is
to be expected given the extreme pressure of recording every syscall.
The timeout value of HZ/10 was chosen primarily through
experimentation and this developer's "gut feeling". There is likely
no one perfect value, but as this scenario is limited in scope (root
privileges would be needed to send SIGSTOP to the audit daemon), it
is likely not worth exposing this as a tunable at present. This can
always be done at a later date if it proves necessary.
Cc: stable@vger.kernel.org
Fixes: 5b52330bbf ("audit: fix auditd/kernel connection state tracking")
Reported-by: Gaosheng Cui <cuigaosheng1@huawei.com>
Tested-by: Gaosheng Cui <cuigaosheng1@huawei.com>
Reviewed-by: Richard Guy Briggs <rgb@redhat.com>
Signed-off-by: Paul Moore <paul@paul-moore.com>
Due to the audit control mutex necessary for serializing audit
userspace messages we haven't been able to block/penalize userspace
processes that attempt to send audit records while the system is
under audit pressure. The result is that privileged userspace
applications have a priority boost with respect to audit as they are
not bound by the same audit queue throttling as the other tasks on
the system.
This patch attempts to restore some balance to the system when under
audit pressure by blocking these privileged userspace tasks after
they have finished their audit processing, and dropped the audit
control mutex, but before they return to userspace.
Reported-by: Gaosheng Cui <cuigaosheng1@huawei.com>
Tested-by: Gaosheng Cui <cuigaosheng1@huawei.com>
Reviewed-by: Richard Guy Briggs <rgb@redhat.com>
Signed-off-by: Paul Moore <paul@paul-moore.com>
The implementation of BPF_CMPXCHG on a high level has the following parameters:
.-[old-val] .-[new-val]
BPF_R0 = cmpxchg{32,64}(DST_REG + insn->off, BPF_R0, SRC_REG)
`-[mem-loc] `-[old-val]
Given a BPF insn can only have two registers (dst, src), the R0 is fixed and
used as an auxilliary register for input (old value) as well as output (returning
old value from memory location). While the verifier performs a number of safety
checks, it misses to reject unprivileged programs where R0 contains a pointer as
old value.
Through brute-forcing it takes about ~16sec on my machine to leak a kernel pointer
with BPF_CMPXCHG. The PoC is basically probing for kernel addresses by storing the
guessed address into the map slot as a scalar, and using the map value pointer as
R0 while SRC_REG has a canary value to detect a matching address.
Fix it by checking R0 for pointers, and reject if that's the case for unprivileged
programs.
Fixes: 5ffa25502b ("bpf: Add instructions for atomic_[cmp]xchg")
Reported-by: Ryota Shiga (Flatt Security)
Acked-by: Brendan Jackman <jackmanb@google.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
The change in commit 37086bfdc7 ("bpf: Propagate stack bounds to registers
in atomics w/ BPF_FETCH") around check_mem_access() handling is buggy since
this would allow for unprivileged users to leak kernel pointers. For example,
an atomic fetch/and with -1 on a stack destination which holds a spilled
pointer will migrate the spilled register type into a scalar, which can then
be exported out of the program (since scalar != pointer) by dumping it into
a map value.
The original implementation of XADD was preventing this situation by using
a double call to check_mem_access() one with BPF_READ and a subsequent one
with BPF_WRITE, in both cases passing -1 as a placeholder value instead of
register as per XADD semantics since it didn't contain a value fetch. The
BPF_READ also included a check in check_stack_read_fixed_off() which rejects
the program if the stack slot is of __is_pointer_value() if dst_regno < 0.
The latter is to distinguish whether we're dealing with a regular stack spill/
fill or some arithmetical operation which is disallowed on non-scalars, see
also 6e7e63cbb0 ("bpf: Forbid XADD on spilled pointers for unprivileged
users") for more context on check_mem_access() and its handling of placeholder
value -1.
One minimally intrusive option to fix the leak is for the BPF_FETCH case to
initially check the BPF_READ case via check_mem_access() with -1 as register,
followed by the actual load case with non-negative load_reg to propagate
stack bounds to registers.
Fixes: 37086bfdc7 ("bpf: Propagate stack bounds to registers in atomics w/ BPF_FETCH")
Reported-by: <n4ke4mry@gmail.com>
Acked-by: Brendan Jackman <jackmanb@google.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Make use of struct_size() helper instead of an open-coded calucation.
Link: https://github.com/KSPP/linux/issues/160
Signed-off-by: Xiu Jianfeng <xiujianfeng@huawei.com>
Reviewed-by: Kees Cook <keescook@chromium.org>
Signed-off-by: Paul Moore <paul@paul-moore.com>
== Background ==
Support for large, "dynamic" fpstates was recently merged. This
included code to ensure that sigaltstacks are sufficiently sized for
these large states. A new lock was added to remove races between
enabling large features and setting up sigaltstacks.
== Problem ==
The new lock (sigaltstack_lock()) is acquired in the sigreturn path
before restoring the old sigaltstack. Unfortunately, contention on the
new lock causes a measurable signal handling performance regression [1].
However, the common case is that no *changes* are made to the
sigaltstack state at sigreturn.
== Solution ==
do_sigaltstack() acquires sigaltstack_lock() and is used for both
sys_sigaltstack() and restoring the sigaltstack in sys_sigreturn().
Check for changes to the sigaltstack before taking the lock. If no
changes were made, return before acquiring the lock.
This removes lock contention from the common-case sigreturn path.
[1] https://lore.kernel.org/lkml/20211207012128.GA16074@xsang-OptiPlex-9020/
Fixes: 3aac3ebea0 ("x86/signal: Implement sigaltstack size validation")
Reported-by: kernel test robot <oliver.sang@intel.com>
Signed-off-by: Chang S. Bae <chang.seok.bae@intel.com>
Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lkml.kernel.org/r/20211210225503.12734-1-chang.seok.bae@intel.com
If a cset is already on preloaded list, this means we have already setup
this cset properly for migration.
This patch just relocates the root cgrp lookup which isn't used anyway
when the cset is already on the preloaded list.
[tj@kernel.org: rephrase the commit log]
Signed-off-by: Wei Yang <richard.weiyang@gmail.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
This patch enables KCSAN for arm64, with updates to build rules
to not use KCSAN for several incompatible compilation units.
Recent GCC version(at least GCC10) made outline-atomics as the
default option(unlike Clang), which will cause linker errors
for kernel/kcsan/core.o. Disables the out-of-line atomics by
no-outline-atomics to fix the linker errors.
Meanwhile, as Mark said[1], some latent issues are needed to be
fixed which isn't just a KCSAN problem, we make the KCSAN depends
on EXPERT for now.
Tested selftest and kcsan_test(built with GCC11 and Clang 13),
and all passed.
[1] https://lkml.kernel.org/r/YadiUPpJ0gADbiHQ@FVFF77S0Q05N
Acked-by: Marco Elver <elver@google.com> # kernel/kcsan
Tested-by: Joey Gouly <joey.gouly@arm.com>
Signed-off-by: Kefeng Wang <wangkefeng.wang@huawei.com>
Link: https://lore.kernel.org/r/20211211131734.126874-1-wangkefeng.wang@huawei.com
[catalin.marinas@arm.com: added comment to justify EXPERT]
Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>
On arm64, user space counter access will be controlled differently
compared to x86. On x86, access in the strictest mode is enabled for all
tasks in an MM when any event is mmap'ed. For arm64, access is
explicitly requested for an event and only enabled when the event's
context is active. This avoids hooks into the arch context switch code
and gives better control of when access is enabled.
In order to configure user space access when the PMU is enabled, it is
necessary to know if any event (currently active or not) in the current
context has user space accessed enabled. Add a counter similar to other
counters in the context to avoid walking the event list every time.
Reviewed-by: Mark Rutland <mark.rutland@arm.com>
Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Rob Herring <robh@kernel.org>
Link: https://lore.kernel.org/r/20211208201124.310740-3-robh@kernel.org
Signed-off-by: Will Deacon <will@kernel.org>
In non trivial scenarios, the action id alone is not sufficient to
identify the program causing the warning. Before the previous patch,
the generated stack-trace pointed out at least the involved device
driver.
Let's additionally include the program name and id, and the relevant
device name.
If the user needs additional infos, he can fetch them via a kernel
probe, leveraging the arguments added here.
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Toke Høiland-Jørgensen <toke@redhat.com>
Link: https://lore.kernel.org/bpf/ddb96bb975cbfddb1546cf5da60e77d5100b533c.1638189075.git.pabeni@redhat.com
In validate_change(), there is a check since v2.6.12 to make sure that
each of the child cpusets must be a subset of a parent cpuset. IOW, it
allows child cpusets to restrict what changes can be made to a parent's
"cpuset.cpus". This actually violates one of the core principles of the
default hierarchy where a cgroup higher up in the hierarchy should be
able to change configuration however it sees fit as deligation breaks
down otherwise.
To address this issue, the check is now removed for the default hierarchy
to free parent cpusets from being restricted by child cpusets. The
check will still apply for legacy hierarchy.
Suggested-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Waiman Long <longman@redhat.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
The exit code of kernel threads has different semantics than the
exit_code of userspace tasks. To avoid confusion and allow
the userspace implementation to change as needed move
the kernel thread exit code into struct kthread.
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Today the rules are a bit iffy and arbitrary about which kernel
threads have struct kthread present. Both idle threads and thread
started with create_kthread want struct kthread present so that is
effectively all kernel threads. Make the rule that if PF_KTHREAD
and the task is running then struct kthread is present.
This will allow the kernel thread code to using tsk->exit_code
with different semantics from ordinary processes.
To make ensure that struct kthread is present for all
kernel threads move it's allocation into copy_process.
Add a deallocation of struct kthread in exec for processes
that were kernel threads.
Move the allocation of struct kthread for the initial thread
earlier so that it is not repeated for each additional idle
thread.
Move the initialization of struct kthread into set_kthread_struct
so that the structure is always and reliably initailized.
Clear set_child_tid in free_kthread_struct to ensure the kthread
struct is reliably freed during exec. The function
free_kthread_struct does not need to clear vfork_done during exec as
exec_mm_release called from exec_mmap has already cleared vfork_done.
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Update complete_and_exit to call kthread_exit instead of do_exit.
Change the name to reflect this change in functionality. All of the
users of complete_and_exit are causing the current kthread to exit so
this change makes it clear what is happening.
Move the implementation of kthread_complete_and_exit from
kernel/exit.c to to kernel/kthread.c. As this function is kthread
specific it makes most sense to live with the kthread functions.
There are no functional change.
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Update module_put_and_exit to call kthread_exit instead of do_exit.
Change the name to reflect this change in functionality. All of the
users of module_put_and_exit are causing the current kthread to exit
so this change makes it clear what is happening. There is no
functional change.
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
The way the per task_struct exit_code is used by kernel threads is not
quite compatible how it is used by userspace applications. The low
byte of the userspace exit_code value encodes the exit signal. While
kthreads just use the value as an int holding ordinary kernel function
exit status like -EPERM.
Add kthread_exit to clearly separate the two kinds of uses.
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Now that there are no more modular uses of do_exit remove the EXPORT_SYMBOL.
Suggested-by: Christoph Hellwig <hch@infradead.org>
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
When the kernel detects it is oops or otherwise force killing a task
while it exits the code poorly attempts to permanently stop the task
from scheduling.
I say poorly because it is possible for a task in TASK_UINTERRUPTIBLE
to be woken up.
As it makes no sense for the task to continue call do_task_dead
instead which actually does the work and permanently removes the task
from the scheduler. Guaranteeing the task will never be woken
up again.
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
The beginning of do_exit has become cluttered and difficult to read as
it is filled with checks to handle things that can only happen when
the kernel is operating improperly.
Now that we have a dedicated function for cleaning up a task when the
kernel is operating improperly move the checks there.
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
There are two big uses of do_exit. The first is it's design use to be
the guts of the exit(2) system call. The second use is to terminate
a task after something catastrophic has happened like a NULL pointer
in kernel code.
Add a function make_task_dead that is initialy exactly the same as
do_exit to cover the cases where do_exit is called to handle
catastrophic failure. In time this can probably be reduced to just a
light wrapper around do_task_dead. For now keep it exactly the same so
that there will be no behavioral differences introducing this new
concept.
Replace all of the uses of do_exit that use it for catastraphic
task cleanup with make_task_dead to make it clear what the code
is doing.
As part of this rename rewind_stack_do_exit
rewind_stack_and_make_dead.
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Adding following helpers for tracing programs:
Get n-th argument of the traced function:
long bpf_get_func_arg(void *ctx, u32 n, u64 *value)
Get return value of the traced function:
long bpf_get_func_ret(void *ctx, u64 *value)
Get arguments count of the traced function:
long bpf_get_func_arg_cnt(void *ctx)
The trampoline now stores number of arguments on ctx-8
address, so it's easy to verify argument index and find
return value argument's position.
Moving function ip address on the trampoline stack behind
the number of functions arguments, so it's now stored on
ctx-16 address if it's needed.
All helpers above are inlined by verifier.
Also bit unrelated small change - using newly added function
bpf_prog_has_trampoline in check_get_func_ip.
Signed-off-by: Jiri Olsa <jolsa@kernel.org>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Link: https://lore.kernel.org/bpf/20211208193245.172141-5-jolsa@kernel.org
Adding support to access arguments with int pointer arguments
in tracing programs.
Currently we allow tracing programs to access only pointers to
string (char pointer), void pointers and pointers to structs.
If we try to access argument which is pointer to int, verifier
will fail to load the program with;
R1 type=ctx expected=fp
; int BPF_PROG(fmod_ret_test, int _a, __u64 _b, int _ret)
0: (bf) r6 = r1
; int BPF_PROG(fmod_ret_test, int _a, __u64 _b, int _ret)
1: (79) r9 = *(u64 *)(r6 +8)
func 'bpf_modify_return_test' arg1 type INT is not a struct
There is no harm for the program to access int pointer argument.
We are already doing that for string pointer, which is pointer
to int with 1 byte size.
Changing the is_string_ptr to generic integer check and renaming
it to btf_type_is_int.
Signed-off-by: Jiri Olsa <jolsa@kernel.org>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Link: https://lore.kernel.org/bpf/20211208193245.172141-2-jolsa@kernel.org
Coverity issued the following warning:
6685 cands = bpf_core_add_cands(cands, main_btf, 1);
6686 if (IS_ERR(cands))
>>> CID 1510300: (RETURN_LOCAL)
>>> Returning pointer "cands" which points to local variable "local_cand".
6687 return cands;
It's a false positive.
Add ERR_CAST() to silence it.
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
The helper compares two strings: one string is a null-terminated
read-only string, and another string has const max storage size
but doesn't need to be null-terminated. It can be used to compare
file name in tracing or LSM program.
Signed-off-by: Hou Tao <houtao1@huawei.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Link: https://lore.kernel.org/bpf/20211210141652.877186-2-houtao1@huawei.com
Merge misc fixes from Andrew Morton:
"21 patches.
Subsystems affected by this patch series: MAINTAINERS, mailmap, and mm
(mlock, pagecache, damon, slub, memcg, hugetlb, and pagecache)"
* emailed patches from Andrew Morton <akpm@linux-foundation.org>: (21 commits)
mm: bdi: initialize bdi_min_ratio when bdi is unregistered
hugetlbfs: fix issue of preallocation of gigantic pages can't work
mm/memcg: relocate mod_objcg_mlstate(), get_obj_stock() and put_obj_stock()
mm/slub: fix endianness bug for alloc/free_traces attributes
selftests/damon: split test cases
selftests/damon: test debugfs file reads/writes with huge count
selftests/damon: test wrong DAMOS condition ranges input
selftests/damon: test DAMON enabling with empty target_ids case
selftests/damon: skip test if DAMON is running
mm/damon/vaddr-test: remove unnecessary variables
mm/damon/vaddr-test: split a test function having >1024 bytes frame size
mm/damon/vaddr: remove an unnecessary warning message
mm/damon/core: remove unnecessary error messages
mm/damon/dbgfs: remove an unnecessary error message
mm/damon/core: use better timer mechanisms selection threshold
mm/damon/core: fix fake load reports due to uninterruptible sleeps
timers: implement usleep_idle_range()
filemap: remove PageHWPoison check from next_uptodate_page()
mailmap: update email address for Guo Ren
MAINTAINERS: update kdump maintainers
...
Make use of memset_startat helper to simplify the code, there should be
no functional change as a result of this patch.
Link: https://lkml.kernel.org/r/20211210012245.207489-1-xiujianfeng@huawei.com
Signed-off-by: Xiu Jianfeng <xiujianfeng@huawei.com>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
synth_events is returning -EINVAL if the dyn_event create command does
not contain ' \t'. This prevents other systems from getting called back.
synth_events needs to return -ECANCELED in these cases when the command
is not targeting the synth_event system.
Link: https://lore.kernel.org/linux-trace-devel/20210930223821.11025-1-beaub@linux.microsoft.com
Fixes: c9e759b1e8 ("tracing: Rework synthetic event command parsing")
Reviewed-by: Masami Hiramatsu <mhiramat@kernel.org>
Signed-off-by: Beau Belgrave <beaub@linux.microsoft.com>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
cpu_util_cfs() was created by commit d4edd662ac ("sched/cpufreq: Use
the DEADLINE utilization signal") to enable the access to CPU
utilization from the Schedutil CPUfreq governor.
Commit a07630b8b2 ("sched/cpufreq/schedutil: Use util_est for OPP
selection") added util_est support later.
The only thing cpu_util() is doing on top of what cpu_util_cfs() already
does is to clamp the return value to the [0..capacity_orig] capacity
range of the CPU. Integrating this into cpu_util_cfs() is not harming
the existing users (Schedutil and CPUfreq cooling (latter via
sched_cpu_util() wrapper)).
For straightforwardness, prefer to keep using `int cpu` as the function
parameter over using `struct rq *rq` which might avoid some calls to
cpu_rq(cpu) -> per_cpu(runqueues, cpu) -> RELOC_HIDE().
Update cfs_util()'s documentation and reuse it for cpu_util_cfs().
Remove cpu_util().
Signed-off-by: Dietmar Eggemann <dietmar.eggemann@arm.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Vincent Guittot <vincent.guittot@linaro.org>
Link: https://lore.kernel.org/r/20211118164240.623551-1-dietmar.eggemann@arm.com
Patch series "mm/damon: Fix fake /proc/loadavg reports", v3.
This patchset fixes DAMON's fake load report issue. The first patch
makes yet another variant of usleep_range() for this fix, and the second
patch fixes the issue of DAMON by making it using the newly introduced
function.
This patch (of 2):
Some kernel threads such as DAMON could need to repeatedly sleep in
micro seconds level. Because usleep_range() sleeps in uninterruptible
state, however, such threads would make /proc/loadavg reports fake load.
To help such cases, this commit implements a variant of usleep_range()
called usleep_idle_range(). It is same to usleep_range() but sets the
state of the current task as TASK_IDLE while sleeping.
Link: https://lkml.kernel.org/r/20211126145015.15862-1-sj@kernel.org
Link: https://lkml.kernel.org/r/20211126145015.15862-2-sj@kernel.org
Signed-off-by: SeongJae Park <sj@kernel.org>
Suggested-by: Andrew Morton <akpm@linux-foundation.org>
Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
Tested-by: Oleksandr Natalenko <oleksandr@natalenko.name>
Cc: John Stultz <john.stultz@linaro.org>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Andrii Nakryiko says:
====================
bpf-next 2021-12-10 v2
We've added 115 non-merge commits during the last 26 day(s) which contain
a total of 182 files changed, 5747 insertions(+), 2564 deletions(-).
The main changes are:
1) Various samples fixes, from Alexander Lobakin.
2) BPF CO-RE support in kernel and light skeleton, from Alexei Starovoitov.
3) A batch of new unified APIs for libbpf, logging improvements, version
querying, etc. Also a batch of old deprecations for old APIs and various
bug fixes, in preparation for libbpf 1.0, from Andrii Nakryiko.
4) BPF documentation reorganization and improvements, from Christoph Hellwig
and Dave Tucker.
5) Support for declarative initialization of BPF_MAP_TYPE_PROG_ARRAY in
libbpf, from Hengqi Chen.
6) Verifier log fixes, from Hou Tao.
7) Runtime-bounded loops support with bpf_loop() helper, from Joanne Koong.
8) Extend branch record capturing to all platforms that support it,
from Kajol Jain.
9) Light skeleton codegen improvements, from Kumar Kartikeya Dwivedi.
10) bpftool doc-generating script improvements, from Quentin Monnet.
11) Two libbpf v0.6 bug fixes, from Shuyi Cheng and Vincent Minet.
12) Deprecation warning fix for perf/bpf_counter, from Song Liu.
13) MAX_TAIL_CALL_CNT unification and MIPS build fix for libbpf,
from Tiezhu Yang.
14) BTF_KING_TYPE_TAG follow-up fixes, from Yonghong Song.
15) Selftests fixes and improvements, from Ilya Leoshkevich, Jean-Philippe
Brucker, Jiri Olsa, Maxim Mikityanskiy, Tirthendu Sarkar, Yucong Sun,
and others.
* https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next: (115 commits)
libbpf: Add "bool skipped" to struct bpf_map
libbpf: Fix typo in btf__dedup@LIBBPF_0.0.2 definition
bpftool: Switch bpf_object__load_xattr() to bpf_object__load()
selftests/bpf: Remove the only use of deprecated bpf_object__load_xattr()
selftests/bpf: Add test for libbpf's custom log_buf behavior
selftests/bpf: Replace all uses of bpf_load_btf() with bpf_btf_load()
libbpf: Deprecate bpf_object__load_xattr()
libbpf: Add per-program log buffer setter and getter
libbpf: Preserve kernel error code and remove kprobe prog type guessing
libbpf: Improve logging around BPF program loading
libbpf: Allow passing user log setting through bpf_object_open_opts
libbpf: Allow passing preallocated log_buf when loading BTF into kernel
libbpf: Add OPTS-based bpf_btf_load() API
libbpf: Fix bpf_prog_load() log_buf logic for log_level 0
samples/bpf: Remove unneeded variable
bpf: Remove redundant assignment to pointer t
selftests/bpf: Fix a compilation warning
perf/bpf_counter: Use bpf_map_create instead of bpf_create_map
samples: bpf: Fix 'unknown warning group' build warning on Clang
samples: bpf: Fix xdp_sample_user.o linking with Clang
...
====================
Link: https://lore.kernel.org/r/20211210234746.2100561-1-andrii@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
- Have tracefs honor the gid mount option
- Have new files in tracefs inherit the parent ownership
- Have direct_ops unregister when it has no more functions
- Properly clean up the ops when unregistering multi direct ops
- Add a sample module to test the multiple direct ops
- Fix memory leak in error path of __create_synth_event()
-----BEGIN PGP SIGNATURE-----
iIoEABYIADIWIQRRSw7ePDh/lE+zeZMp5XQQmuv6qgUCYbOgPBQccm9zdGVkdEBn
b29kbWlzLm9yZwAKCRAp5XQQmuv6qgOtAP0YD+cRLxnRKA376oQVB8zmuZ3mZ/4x
6M1hqruSDlno3AEA19PyHpxl7flFwnBb6Gnfo9VeefcMS5ENDH9p1aHX4wU=
=Tr6t
-----END PGP SIGNATURE-----
Merge tag 'trace-v5.16-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace
Pull tracing fixes from Steven Rostedt:
"Tracing, ftrace and tracefs fixes:
- Have tracefs honor the gid mount option
- Have new files in tracefs inherit the parent ownership
- Have direct_ops unregister when it has no more functions
- Properly clean up the ops when unregistering multi direct ops
- Add a sample module to test the multiple direct ops
- Fix memory leak in error path of __create_synth_event()"
* tag 'trace-v5.16-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace:
tracing: Fix possible memory leak in __create_synth_event() error path
ftrace/samples: Add module to test multi direct modify interface
ftrace: Add cleanup to unregister_ftrace_direct_multi
ftrace: Use direct_ops hash in unregister_ftrace_direct
tracefs: Set all files to the same group ownership as the mount option
tracefs: Have new files inherit the ownership of their parent
Fix three bugs in aio poll, and one issue with POLLFREE more broadly:
- aio poll didn't handle POLLFREE, causing a use-after-free.
- aio poll could block while the file is ready.
- aio poll called eventfd_signal() when it isn't allowed.
- POLLFREE didn't handle multiple exclusive waiters correctly.
This has been tested with the libaio test suite, as well as with test
programs I wrote that reproduce the first two bugs. I am sending this
pull request myself as no one seems to be maintaining this code.
-----BEGIN PGP SIGNATURE-----
iIoEABYIADIWIQSacvsUNc7UX4ntmEPzXCl4vpKOKwUCYbObthQcZWJpZ2dlcnNA
Z29vZ2xlLmNvbQAKCRDzXCl4vpKOK+3mAQC9W8ApzBleEPI6FXzIIo5AiQT/2jGl
7FbO1MtkdUBU4QEAzf+VWl4Z4BJTgxl44avRdVDpXGAMnbWkd7heY+e3HwA=
=mp+r
-----END PGP SIGNATURE-----
Merge tag 'aio-poll-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiggers/linux
Pull aio poll fixes from Eric Biggers:
"Fix three bugs in aio poll, and one issue with POLLFREE more broadly:
- aio poll didn't handle POLLFREE, causing a use-after-free.
- aio poll could block while the file is ready.
- aio poll called eventfd_signal() when it isn't allowed.
- POLLFREE didn't handle multiple exclusive waiters correctly.
This has been tested with the libaio test suite, as well as with test
programs I wrote that reproduce the first two bugs. I am sending this
pull request myself as no one seems to be maintaining this code"
* tag 'aio-poll-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiggers/linux:
aio: Fix incorrect usage of eventfd_signal_allowed()
aio: fix use-after-free due to missing POLLFREE handling
aio: keep poll requests on waitqueue until completed
signalfd: use wake_up_pollfree()
binder: use wake_up_pollfree()
wait: add wake_up_pollfree()
The discussion about removing the side effect of irq_set_affinity_hint() of
actually applying the cpumask (if not NULL) as affinity to the interrupt,
unearthed a few unpleasantries:
1) The modular perf drivers rely on the current behaviour for the very
wrong reasons.
2) While none of the other drivers prevents user space from changing
the affinity, a cursorily inspection shows that there are at least
expectations in some drivers.
#1 needs to be cleaned up anyway, so that's not a problem
#2 might result in subtle regressions especially when irqbalanced (which
nowadays ignores the affinity hint) is disabled.
Provide new interfaces:
irq_update_affinity_hint() - Only sets the affinity hint pointer
irq_set_affinity_and_hint() - Set the pointer and apply the affinity to
the interrupt
Make irq_set_affinity_hint() a wrapper around irq_apply_affinity_hint() and
document it to be phased out.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Nitesh Narayan Lal <nitesh@redhat.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Ming Lei <ming.lei@redhat.com>
Link: https://lore.kernel.org/r/20210501021832.743094-1-jesse.brandeburg@intel.com
Link: https://lore.kernel.org/r/20210903152430.244937-2-nitesh@redhat.com
Commit 354e8f1970 ("bpf: Support <8-byte scalar spill and refill")
introduced support in the verifier to track <8B spill/fills of scalars.
The backtracking logic for the precision bit was however skipping
spill/fills of less than 8B. That could cause state pruning to consider
two states equivalent when they shouldn't be.
As an example, consider the following bytecode snippet:
0: r7 = r1
1: call bpf_get_prandom_u32
2: r6 = 2
3: if r0 == 0 goto pc+1
4: r6 = 3
...
8: [state pruning point]
...
/* u32 spill/fill */
10: *(u32 *)(r10 - 8) = r6
11: r8 = *(u32 *)(r10 - 8)
12: r0 = 0
13: if r8 == 3 goto pc+1
14: r0 = 1
15: exit
The verifier first walks the path with R6=3. Given the support for <8B
spill/fills, at instruction 13, it knows the condition is true and skips
instruction 14. At that point, the backtracking logic kicks in but stops
at the fill instruction since it only propagates the precision bit for
8B spill/fill. When the verifier then walks the path with R6=2, it will
consider it safe at instruction 8 because R6 is not marked as needing
precision. Instruction 14 is thus never walked and is then incorrectly
removed as 'dead code'.
It's also possible to lead the verifier to accept e.g. an out-of-bound
memory access instead of causing an incorrect dead code elimination.
This regression was found via Cilium's bpf-next CI where it was causing
a conntrack map update to be silently skipped because the code had been
removed by the verifier.
This commit fixes it by enabling support for <8B spill/fills in the
bactracking logic. In case of a <8B spill/fill, the full 8B stack slot
will be marked as needing precision. Then, in __mark_chain_precision,
any tracked register spilled in a marked slot will itself be marked as
needing precision, regardless of the spill size. This logic makes two
assumptions: (1) only 8B-aligned spill/fill are tracked and (2) spilled
registers are only tracked if the spill and fill sizes are equal. Commit
ef979017b8 ("bpf: selftest: Add verifier tests for <8-byte scalar
spill and refill") covers the first assumption and the next commit in
this patchset covers the second.
Fixes: 354e8f1970 ("bpf: Support <8-byte scalar spill and refill")
Signed-off-by: Paul Chaignon <paul@isovalent.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Some architectures do not define clear_bit_unlock_is_negative_byte().
Only test it when it is actually defined (similar to other usage, such
as in lib/test_kasan.c).
Link: https://lkml.kernel.org/r/202112050757.x67rHnFU-lkp@intel.com
Reported-by: kernel test robot <lkp@intel.com>
Signed-off-by: Marco Elver <elver@google.com>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
Nested contexts, such as nested interrupts or scheduler code, share the
same kcsan_ctx. When such a nested context reads an inconsistent
reorder_access due to an interrupt during set_reorder_access(), we can
observe the following warning:
| ------------[ cut here ]------------
| Cannot find frame for torture_random kernel/torture.c:456 in stack trace
| WARNING: CPU: 13 PID: 147 at kernel/kcsan/report.c:343 replace_stack_entry kernel/kcsan/report.c:343
| ...
| Call Trace:
| <TASK>
| sanitize_stack_entries kernel/kcsan/report.c:351 [inline]
| print_report kernel/kcsan/report.c:409
| kcsan_report_known_origin kernel/kcsan/report.c:693
| kcsan_setup_watchpoint kernel/kcsan/core.c:658
| rcutorture_one_extend kernel/rcu/rcutorture.c:1475
| rcutorture_loop_extend kernel/rcu/rcutorture.c:1558 [inline]
| ...
| </TASK>
| ---[ end trace ee5299cb933115f5 ]---
| ==================================================================
| BUG: KCSAN: data-race in _raw_spin_lock_irqsave / rcutorture_one_extend
|
| write (reordered) to 0xffffffff8c93b300 of 8 bytes by task 154 on cpu 12:
| queued_spin_lock include/asm-generic/qspinlock.h:80 [inline]
| do_raw_spin_lock include/linux/spinlock.h:185 [inline]
| __raw_spin_lock_irqsave include/linux/spinlock_api_smp.h:111 [inline]
| _raw_spin_lock_irqsave kernel/locking/spinlock.c:162
| try_to_wake_up kernel/sched/core.c:4003
| sysvec_apic_timer_interrupt arch/x86/kernel/apic/apic.c:1097
| asm_sysvec_apic_timer_interrupt arch/x86/include/asm/idtentry.h:638
| set_reorder_access kernel/kcsan/core.c:416 [inline] <-- inconsistent reorder_access
| kcsan_setup_watchpoint kernel/kcsan/core.c:693
| rcutorture_one_extend kernel/rcu/rcutorture.c:1475
| rcutorture_loop_extend kernel/rcu/rcutorture.c:1558 [inline]
| rcu_torture_one_read kernel/rcu/rcutorture.c:1600
| rcu_torture_reader kernel/rcu/rcutorture.c:1692
| kthread kernel/kthread.c:327
| ret_from_fork arch/x86/entry/entry_64.S:295
|
| read to 0xffffffff8c93b300 of 8 bytes by task 147 on cpu 13:
| rcutorture_one_extend kernel/rcu/rcutorture.c:1475
| rcutorture_loop_extend kernel/rcu/rcutorture.c:1558 [inline]
| ...
The warning is telling us that there was a data race which KCSAN wants
to report, but the function where the original access (that is now
reordered) happened cannot be found in the stack trace, which prevents
KCSAN from generating the right stack trace. The stack trace of "write
(reordered)" now only shows where the access was reordered to, but
should instead show the stack trace of the original write, with a final
line saying "reordered to".
At the point where set_reorder_access() is interrupted, it just set
reorder_access->ptr and size, at which point size is non-zero. This is
sufficient (if ctx->disable_scoped is zero) for further accesses from
nested contexts to perform checking of this reorder_access.
That then happened in _raw_spin_lock_irqsave(), which is called by
scheduler code. However, since reorder_access->ip is still stale (ptr
and size belong to a different ip not yet set) this finally leads to
replace_stack_entry() not finding the frame in reorder_access->ip and
generating the above warning.
Fix it by ensuring that a nested context cannot access reorder_access
while we update it in set_reorder_access(): set ctx->disable_scoped for
the duration that reorder_access is updated, which effectively locks
reorder_access and prevents concurrent use by nested contexts. Note,
set_reorder_access() can do the update only if disabled_scoped is zero
on entry, and must therefore set disable_scoped back to non-zero after
the initial check in set_reorder_access().
Signed-off-by: Marco Elver <elver@google.com>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
The barrier tests in selftest and the kcsan_test module only need the
spinlock and mutex to test correct barrier instrumentation. Therefore,
these were initially placed on the stack.
However, lockdep asserts that locks are in static storage, and will
generate this warning:
| INFO: trying to register non-static key.
| The code is fine but needs lockdep annotation, or maybe
| you didn't initialize this object before use?
| turning off the locking correctness validator.
| CPU: 0 PID: 1 Comm: swapper/0 Not tainted 5.16.0-rc1+ #3208
| Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.13.0-1ubuntu1.1 04/01/2014
| Call Trace:
| <TASK>
| dump_stack_lvl+0x88/0xd8
| dump_stack+0x15/0x1b
| register_lock_class+0x6b3/0x840
| ...
| test_barrier+0x490/0x14c7
| kcsan_selftest+0x47/0xa0
| ...
To fix, move the test locks into static storage.
Fixing the above also revealed that lock operations are strengthened on
first use with lockdep enabled, due to lockdep calling out into
non-instrumented files (recall that kernel/locking/lockdep.c is not
instrumented with KCSAN).
Only kcsan_test checks for over-instrumentation of *_lock() operations,
where we can simply "warm up" the test locks to avoid the test case
failing with lockdep.
Reported-by: Paul E. McKenney <paulmck@kernel.org>
Signed-off-by: Marco Elver <elver@google.com>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
There's no fundamental reason to disable KCSAN for scheduler code,
except for excessive noise and performance concerns (instrumenting
scheduler code is usually a good way to stress test KCSAN itself).
However, several core sched functions imply memory barriers that are
invisible to KCSAN without instrumentation, but are required to avoid
false positives. Therefore, unconditionally enable instrumentation of
memory barriers in scheduler code. Also update the comment to reflect
this and be a bit more brief.
Signed-off-by: Marco Elver <elver@google.com>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
Memory barrier instrumentation is crucial to avoid false positives. To
avoid surprises, run a simple test case in the boot-time selftest to
ensure memory barriers are still instrumented correctly.
Signed-off-by: Marco Elver <elver@google.com>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
Adds test cases to check that memory barriers are instrumented
correctly, and detection of missing memory barriers is working as
intended if CONFIG_KCSAN_STRICT=y.
Signed-off-by: Marco Elver <elver@google.com>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
Due to reordering accesses with weak memory modeling, any access can now
appear as "(reordered)".
Match any permutation of accesses if CONFIG_KCSAN_WEAK_MEMORY=y, so that
we effectively match an access if it is denoted "(reordered)" or not.
Signed-off-by: Marco Elver <elver@google.com>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
Also show the location the access was reordered to. An example report:
| ==================================================================
| BUG: KCSAN: data-race in test_kernel_wrong_memorder / test_kernel_wrong_memorder
|
| read-write to 0xffffffffc01e61a8 of 8 bytes by task 2311 on cpu 5:
| test_kernel_wrong_memorder+0x57/0x90
| access_thread+0x99/0xe0
| kthread+0x2ba/0x2f0
| ret_from_fork+0x22/0x30
|
| read-write (reordered) to 0xffffffffc01e61a8 of 8 bytes by task 2310 on cpu 7:
| test_kernel_wrong_memorder+0x57/0x90
| access_thread+0x99/0xe0
| kthread+0x2ba/0x2f0
| ret_from_fork+0x22/0x30
| |
| +-> reordered to: test_kernel_wrong_memorder+0x80/0x90
|
| Reported by Kernel Concurrency Sanitizer on:
| CPU: 7 PID: 2310 Comm: access_thread Not tainted 5.14.0-rc1+ #18
| Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.14.0-2 04/01/2014
| ==================================================================
Reviewed-by: Boqun Feng <boqun.feng@gmail.com>
Signed-off-by: Marco Elver <elver@google.com>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
The scoping of an access simply denotes the scope in which it may be
reordered. However, in reports, it'll be less confusing to say the
access is "reordered". This is more accurate when the race occurred.
Signed-off-by: Marco Elver <elver@google.com>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
Add the core memory barrier instrumentation functions. These invalidate
the current in-flight reordered access based on the rules for the
respective barrier types and in-flight access type.
To obtain barrier instrumentation that can be disabled via __no_kcsan
with appropriate compiler-support (and not just with objtool help),
barrier instrumentation repurposes __atomic_signal_fence(), instead of
inserting explicit calls. Crucially, __atomic_signal_fence() normally
does not map to any real instructions, but is still intercepted by
fsanitize=thread. As a result, like any other instrumentation done by
the compiler, barrier instrumentation can be disabled with __no_kcsan.
Unfortunately Clang and GCC currently differ in their __no_kcsan aka
__no_sanitize_thread behaviour with respect to builtin atomics (and
__tsan_func_{entry,exit}) instrumentation. This is already reflected in
Kconfig.kcsan's dependencies for KCSAN_WEAK_MEMORY. A later change will
introduce support for newer versions of Clang that can implement
__no_kcsan to also remove the additional instrumentation introduced by
KCSAN_WEAK_MEMORY.
Signed-off-by: Marco Elver <elver@google.com>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
Add support for modeling a subset of weak memory, which will enable
detection of a subset of data races due to missing memory barriers.
KCSAN's approach to detecting missing memory barriers is based on
modeling access reordering, and enabled if `CONFIG_KCSAN_WEAK_MEMORY=y`,
which depends on `CONFIG_KCSAN_STRICT=y`. The feature can be enabled or
disabled at boot and runtime via the `kcsan.weak_memory` boot parameter.
Each memory access for which a watchpoint is set up, is also selected
for simulated reordering within the scope of its function (at most 1
in-flight access).
We are limited to modeling the effects of "buffering" (delaying the
access), since the runtime cannot "prefetch" accesses (therefore no
acquire modeling). Once an access has been selected for reordering, it
is checked along every other access until the end of the function scope.
If an appropriate memory barrier is encountered, the access will no
longer be considered for reordering.
When the result of a memory operation should be ordered by a barrier,
KCSAN can then detect data races where the conflict only occurs as a
result of a missing barrier due to reordering accesses.
Suggested-by: Dmitry Vyukov <dvyukov@google.com>
Signed-off-by: Marco Elver <elver@google.com>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
Avoid checking scoped accesses from nested contexts (such as nested
interrupts or in scheduler code) which share the same kcsan_ctx.
This is to avoid detecting false positive races of accesses in the same
thread with currently scoped accesses: consider setting up a watchpoint
for a non-scoped (normal) access that also "conflicts" with a current
scoped access. In a nested interrupt (or in the scheduler), which shares
the same kcsan_ctx, we cannot check scoped accesses set up in the parent
context -- simply ignore them in this case.
With the introduction of kcsan_ctx::disable_scoped, we can also clean up
kcsan_check_scoped_accesses()'s recursion guard, and do not need to
modify the list's prev pointer.
Signed-off-by: Marco Elver <elver@google.com>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
They are implicitly zero-initialized, remove explicit initialization.
It keeps the upcoming additions to kcsan_ctx consistent with the rest.
No functional change intended.
Signed-off-by: Marco Elver <elver@google.com>
Acked-by: Mark Rutland <mark.rutland@arm.com>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
Factor out the switch statement reading instrumented memory into a
helper read_instrumented_memory().
No functional change.
Signed-off-by: Marco Elver <elver@google.com>
Acked-by: Mark Rutland <mark.rutland@arm.com>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
nr_running is never modified remotely after the schedule callback in
wakeup path is removed.
Rather nr_running is often accessed with other fields in the pool
together, so the cacheline_aligned for nr_running isn't needed.
Signed-off-by: Lai Jiangshan <laijs@linux.alibaba.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
In unbind_workers(), there are two pool->lock held sections separated
by the code of zapping nr_running. wake_up_worker() needs to be in
pool->lock held section and after zapping nr_running. And zapping
nr_running had to be after schedule() when the local wake up
functionality was in use. Now, the call to schedule() has been removed
along with the local wake up functionality, so the code can be merged
into the same pool->lock held section.
The diffstat shows that it is other code moved down because the diff
tools can not know the meaning of merging lock sections by swapping
two code blocks.
Signed-off-by: Lai Jiangshan <laijs@linux.alibaba.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
The commit 6d25be5782 ("sched/core, workqueues: Distangle worker
accounting from rq lock") changed the schedule callbacks for workqueue
and moved the schedule callback from the wakeup code to at end of
schedule() in the worker's process context.
It means that the callback wq_worker_running() is guaranteed that
it sees the %WORKER_UNBOUND flag after scheduled since unbind_workers()
is running on the same CPU that all the pool's workers bound to.
Signed-off-by: Lai Jiangshan <laijs@linux.alibaba.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
Long time before, workers are not ALL bound after CPU_ONLINE, they can
still be running in other CPUs before self rebinding.
But the commit a9ab775bca ("workqueue: directly restore CPU affinity
of workers from CPU_ONLINE") makes rebind_workers() bind them all.
So all workers are on the CPU before the CPU is down.
And the comment in unbind_workers() refers to the workers "which are
still executing works from before the last CPU down" is outdated.
Just removed it.
Signed-off-by: Lai Jiangshan <laijs@linux.alibaba.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
The commit 6d25be5782 ("sched/core, workqueues: Distangle worker
accounting from rq lock") changed the schedule callbacks for workqueue
and removed the local-wake-up functionality.
Now the wakingup of workers is done by normal fashion and workers not
yet migrated to the specific CPU in concurrency managed pool can also
be woken up by workers that already bound to the specific cpu now.
So this advanced kicking of the idle workers to migrate them to the
associated CPU is unneeded now.
Signed-off-by: Lai Jiangshan <laijs@linux.alibaba.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
The semantics of the rlimit max values differs from ucounts itself. When
creating a new userns, we store the current rlimit of the process in
ucount_max. Thus, the value of the limit in the parent userns is saved
in the created one.
The problem is that now we are taking the maximum value for counter from
the same userns. So for init_user_ns it will always be RLIM_INFINITY.
To fix the problem we need to check the counter value with the max value
stored in userns.
Reproducer:
su - test -c "ulimit -u 3; sleep 5 & sleep 6 & unshare -U --map-root-user sh -c 'sleep 7 & sleep 8 & date; wait'"
Before:
[1] 175
[2] 176
Fri Nov 26 13:48:20 UTC 2021
[1]- Done sleep 5
[2]+ Done sleep 6
After:
[1] 167
[2] 168
sh: fork: retry: Resource temporarily unavailable
sh: fork: retry: Resource temporarily unavailable
sh: fork: retry: Resource temporarily unavailable
sh: fork: retry: Resource temporarily unavailable
sh: fork: retry: Resource temporarily unavailable
sh: fork: retry: Resource temporarily unavailable
sh: fork: retry: Resource temporarily unavailable
sh: fork: Interrupted system call
[1]- Done sleep 5
[2]+ Done sleep 6
Fixes: c54b245d01 ("Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace")
Reported-by: Gleb Fotengauer-Malinovskiy <glebfm@altlinux.org>
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Signed-off-by: Alexey Gladkov <legion@kernel.org>
Link: https://lkml.kernel.org/r/024ec805f6e16896f0b23e094773790d171d2c1c.1638218242.git.legion@kernel.org
Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
The rcu_spawn_one_nocb_kthread() function is called only from
rcu_spawn_cpu_nocb_kthread(). Therefore, inline the former into
the latter, saving a few lines of code.
Reviewed-by: Neeraj Upadhyay <quic_neeraju@quicinc.com>
Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
Cc: Boqun Feng <boqun.feng@gmail.com>
Cc: Uladzislau Rezki <urezki@gmail.com>
Cc: Josh Triplett <josh@joshtriplett.org>
Cc: Joel Fernandes <joel@joelfernandes.org>
Tested-by: Juri Lelli <juri.lelli@redhat.com>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
Allow the rcu_nocbs kernel parameter to be specified just by itself,
without specifying any CPUs. This allows systems administrators to use
"rcu_nocbs" to specify that none of the CPUs are to be offloaded at boot
time, but than any of them may be offloaded at runtime via cpusets.
In contrast, if the "rcu_nocbs" or "nohz_full" kernel parameters are not
specified at all, then not only are none of the CPUs offloaded at boot,
none of them can be offloaded at runtime, either.
While in the area, modernize the description of the "rcuo" kthreads'
naming scheme.
Reviewed-by: Neeraj Upadhyay <quic_neeraju@quicinc.com>
Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
Cc: Boqun Feng <boqun.feng@gmail.com>
Cc: Uladzislau Rezki <urezki@gmail.com>
Cc: Josh Triplett <josh@joshtriplett.org>
Cc: Joel Fernandes <joel@joelfernandes.org>
Tested-by: Juri Lelli <juri.lelli@redhat.com>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
In order to be able to (de-)offload any CPU using cpusets in the future,
create the NOCB data structures for all possible CPUs. For now this is
done only as long as the "rcu_nocbs=" or "nohz_full=" kernel parameters
are passed to avoid the unnecessary overhead for most users.
Note that the rcuog and rcuoc kthreads are not created until at least
one of the corresponding CPUs comes online. This approach avoids the
creation of excess kthreads when firmware lies about the number of CPUs
present on the system.
Reviewed-by: Neeraj Upadhyay <quic_neeraju@quicinc.com>
Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
Cc: Boqun Feng <boqun.feng@gmail.com>
Cc: Uladzislau Rezki <urezki@gmail.com>
Cc: Josh Triplett <josh@joshtriplett.org>
Cc: Joel Fernandes <joel@joelfernandes.org>
Tested-by: Juri Lelli <juri.lelli@redhat.com>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
Currently cpumask_available() is used to prevent from unwanted NOCB
initialization. However if neither "rcu_nocbs=" nor "nohz_full="
parameters are passed to a kernel built with CONFIG_CPUMASK_OFFSTACK=n,
the initialization path is still taken, running through all sorts of
needless operations and iterations on an empty cpumask.
Fix this by relying on a real initialization state instead. This also
optimizes kthread creation, preventing needless iteration over all online
CPUs when the kernel is booted without any offloaded CPUs.
Reviewed-by: Neeraj Upadhyay <quic_neeraju@quicinc.com>
Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
Cc: Boqun Feng <boqun.feng@gmail.com>
Cc: Uladzislau Rezki <urezki@gmail.com>
Cc: Josh Triplett <josh@joshtriplett.org>
Cc: Joel Fernandes <joel@joelfernandes.org>
Tested-by: Juri Lelli <juri.lelli@redhat.com>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
In order to be able to toggle the offloaded state from cpusets, a nocb
kthread will need to be created for all possible CPUs whenever either
of the "rcu_nocbs=" or "nohz_full=" parameters are specified.
Therefore, the nocb_cb_wait() kthread must be prepared to start running
on a de-offloaded rdp. To accomplish this, simply move the sleeping
condition to the beginning of the nocb_cb_wait() function, which prevents
this kthread from attempting to invoke callbacks before the corresponding
CPU is offloaded.
Reviewed-by: Neeraj Upadhyay <quic_neeraju@quicinc.com>
Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
Cc: Boqun Feng <boqun.feng@gmail.com>
Cc: Uladzislau Rezki <urezki@gmail.com>
Cc: Josh Triplett <josh@joshtriplett.org>
Cc: Joel Fernandes <joel@joelfernandes.org>
Tested-by: Juri Lelli <juri.lelli@redhat.com>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
The nocb_gp_wait() function iterates over all CPUs in its group,
including even those CPUs that have been de-offloaded. This is of
course suboptimal, especially if none of the CPUs within the group are
currently offloaded. This will become even more of a problem once a
nocb kthread is created for all possible CPUs.
Therefore use a standard double linked list to link all the offloaded
rcu_data structures and safely add or delete these structure as we
offload or de-offload them, respectively.
Reviewed-by: Neeraj Upadhyay <quic_neeraju@quicinc.com>
Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
Cc: Boqun Feng <boqun.feng@gmail.com>
Cc: Uladzislau Rezki <urezki@gmail.com>
Cc: Josh Triplett <josh@joshtriplett.org>
Cc: Joel Fernandes <joel@joelfernandes.org>
Tested-by: Juri Lelli <juri.lelli@redhat.com>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
Current release - regressions:
- bpf, sockmap: re-evaluate proto ops when psock is removed from sockmap
Current release - new code bugs:
- bpf: fix bpf_check_mod_kfunc_call for built-in modules
- ice: fixes for TC classifier offloads
- vrf: don't run conntrack on vrf with !dflt qdisc
Previous releases - regressions:
- bpf: fix the off-by-two error in range markings
- seg6: fix the iif in the IPv6 socket control block
- devlink: fix netns refcount leak in devlink_nl_cmd_reload()
- dsa: mv88e6xxx: fix "don't use PHY_DETECT on internal PHY's"
- dsa: mv88e6xxx: allow use of PHYs on CPU and DSA ports
Previous releases - always broken:
- ethtool: do not perform operations on net devices being unregistered
- udp: use datalen to cap max gso segments
- ice: fix races in stats collection
- fec: only clear interrupt of handling queue in fec_enet_rx_queue()
- m_can: pci: fix incorrect reference clock rate
- m_can: disable and ignore ELO interrupt
- mvpp2: fix XDP rx queues registering
Misc:
- treewide: add missing includes masked by cgroup -> bpf.h dependency
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
-----BEGIN PGP SIGNATURE-----
iQIzBAABCAAdFiEE6jPA+I1ugmIBA4hXMUZtbf5SIrsFAmGyN1AACgkQMUZtbf5S
IrtgMA/8D0qk3c75ts0hCzGXwdNdEBs+e7u1bJVPqdyU8x/ZLAp2c0EKB/7IWuxA
CtsnbanPcmibqvQJDI1hZEBdafi43BmF5VuFSIxYC4EM/1vLoRprurXlIwL2YWki
aWi//tyOIGBl6/ClzJ9Vm51HTJQwDmdv8GRnKAbsC1eOTM3pmmcg+6TLbDhycFEQ
F9kkDCvyB9kWIH645QyJRH+Y5qQOvneCyQCPkkyjTgEADzV5i7YgtRol6J3QIbw3
umPHSckCBTjMacYcCLsbhQaF2gTMgPV1basNLPMjCquJVrItE0ZaeX3MiD6nBFae
yY5+Wt5KAZDzjERhneX8AINHoRPA/tNIahC1+ytTmsTA8Hj230FHE5hH1ajWiJ9+
GSTBCBqjtZXce3r2Efxfzy0Kb9JwL3vDi7LS2eKQLv0zBLfYp2ry9Sp9qe4NhPkb
OYrxws9kl5GOPvrFB5BWI9XBINciC9yC3PjIsz1noi0vD8/Hi9dPwXeAYh36fXU3
rwRg9uAt6tvFCpwbuQ9T2rsMST0miur2cDYd8qkJtuJ7zFvc+suMXwBZyI29nF2D
uyymIC2XStHJfAjUkFsGVUSXF5FhML9OQsqmisdQ8KdH26jMnDeMjIWJM7UWK+zY
E/fqWT8UyS3mXWqaggid4ZbotipCwA0gxiDHuqqUGTM+dbKrzmk=
=F6rS
-----END PGP SIGNATURE-----
Merge tag 'net-5.16-rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net
Pull networking fixes from Jakub Kicinski:
"Including fixes from bpf, can and netfilter.
Current release - regressions:
- bpf, sockmap: re-evaluate proto ops when psock is removed from
sockmap
Current release - new code bugs:
- bpf: fix bpf_check_mod_kfunc_call for built-in modules
- ice: fixes for TC classifier offloads
- vrf: don't run conntrack on vrf with !dflt qdisc
Previous releases - regressions:
- bpf: fix the off-by-two error in range markings
- seg6: fix the iif in the IPv6 socket control block
- devlink: fix netns refcount leak in devlink_nl_cmd_reload()
- dsa: mv88e6xxx: fix "don't use PHY_DETECT on internal PHY's"
- dsa: mv88e6xxx: allow use of PHYs on CPU and DSA ports
Previous releases - always broken:
- ethtool: do not perform operations on net devices being
unregistered
- udp: use datalen to cap max gso segments
- ice: fix races in stats collection
- fec: only clear interrupt of handling queue in fec_enet_rx_queue()
- m_can: pci: fix incorrect reference clock rate
- m_can: disable and ignore ELO interrupt
- mvpp2: fix XDP rx queues registering
Misc:
- treewide: add missing includes masked by cgroup -> bpf.h
dependency"
* tag 'net-5.16-rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net: (82 commits)
net: dsa: mv88e6xxx: allow use of PHYs on CPU and DSA ports
net: wwan: iosm: fixes unable to send AT command during mbim tx
net: wwan: iosm: fixes net interface nonfunctional after fw flash
net: wwan: iosm: fixes unnecessary doorbell send
net: dsa: felix: Fix memory leak in felix_setup_mmio_filtering
MAINTAINERS: s390/net: remove myself as maintainer
net/sched: fq_pie: prevent dismantle issue
net: mana: Fix memory leak in mana_hwc_create_wq
seg6: fix the iif in the IPv6 socket control block
nfp: Fix memory leak in nfp_cpp_area_cache_add()
nfc: fix potential NULL pointer deref in nfc_genl_dump_ses_done
nfc: fix segfault in nfc_genl_dump_devices_done
udp: using datalen to cap max gso segments
net: dsa: mv88e6xxx: error handling for serdes_power functions
can: kvaser_usb: get CAN clock frequency from device
can: kvaser_pciefd: kvaser_pciefd_rx_error_frame(): increase correct stats->{rx,tx}_errors counter
net: mvpp2: fix XDP rx queues registering
vmxnet3: fix minimum vectors alloc issue
net, neigh: clear whole pneigh_entry at alloc time
net: dsa: mv88e6xxx: fix "don't use PHY_DETECT on internal PHY's"
...
By default, when lock contention is encountered, the RCU Tasks flavors
of RCU switch to using per-CPU queueing. However, if the callback
flood ends, per-CPU queueing continues to be used, which introduces
significant additional overhead, especially for callback invocation,
which fans out a series of workqueue handlers.
This commit therefore switches back to single-queue operation if at the
beginning of a grace period there are very few callbacks. The definition
of "very few" is set by the rcupdate.rcu_task_collapse_lim module
parameter, which defaults to 10. This switch happens in two phases,
with the first phase causing future callbacks to be enqueued on CPU 0's
queue, but with all queues continuing to be checked for grace periods
and callback invocation. The second phase checks to see if an RCU grace
period has elapsed and if all remaining RCU-Tasks callbacks are queued
on CPU 0. If so, only CPU 0 is checked for future grace periods and
callback operation.
Of course, the return of contention anywhere during this process will
result in returning to per-CPU callback queueing.
Reported-by: Martin Lau <kafai@fb.com>
Cc: Neeraj Upadhyay <neeraj.iitr10@gmail.com>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
Decreasing the number of callback queues is a bit tricky because it
is necessary to handle callbacks that were queued before the number of
queues decreased, but which were not ready to invoke until afterwards.
This commit takes a first step in this direction by maintaining a separate
->percpu_dequeue_lim to control callback dequeueing, in addition to the
existing ->percpu_enqueue_lim which now controls only enqueueing.
Reported-by: Martin Lau <kafai@fb.com>
Cc: Neeraj Upadhyay <neeraj.iitr10@gmail.com>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
The rcupdate.rcu_task_enqueue_lim module parameter allows system
administrators to tune the number of callback queues used by the RCU
Tasks flavors. However if callback storms are infrequent, it would
be better to operate with a single queue on a given system unless and
until that system actually needed more queues. Systems not needing
more queues can then avoid the overhead of checking the extra queues
and especially avoid the overhead of fanning workqueue handlers out to
all CPUs to invoke callbacks.
This commit therefore switches to using all the CPUs' callback queues if
call_rcu_tasks_generic() encounters too much lock contention. The amount
of lock contention to tolerate defaults to 100 contended lock acquisitions
per jiffy, and can be adjusted using the new rcupdate.rcu_task_contend_lim
module parameter.
Such switching is undertaken only if the rcupdate.rcu_task_enqueue_lim
module parameter is negative, which is its default value (-1).
This allows savvy systems administrators to set the number of queues
to some known good value and to not have to worry about the kernel doing
any second guessing.
[ paulmck: Apply feedback from Guillaume Tucker and kernelci. ]
Reported-by: Martin Lau <kafai@fb.com>
Cc: Neeraj Upadhyay <neeraj.iitr10@gmail.com>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
If the caller of of call_rcu_tasks(), call_rcu_tasks_rude(),
or call_rcu_tasks_trace() holds a raw spinlock, and then if
call_rcu_tasks_generic() determines that the grace-period kthread must
be awakened, then the wakeup might acquire a normal spinlock while a
raw spinlock is held. This results in lockdep splats when the
kernel is built with CONFIG_PROVE_RAW_LOCK_NESTING=y.
This commit therefore defers the wakeup using irq_work_queue().
It would be nice to directly invoke wakeup when a raw spinlock is not
held, but there is currently no way to check for this in all kernels.
Reported-by: Martin Lau <kafai@fb.com>
Cc: Neeraj Upadhyay <neeraj.iitr10@gmail.com>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
This commit converts the unconditional raw_spin_lock_rcu_node() lock
acquisition in call_rcu_tasks_generic() to a trylock followed by an
unconditional acquisition if the trylock fails. If the trylock fails,
the failure is counted, but the count is reset to zero on each new jiffy.
This statistic will be used to determine when to move from a single
callback queue to per-CPU callback queues.
Reported-by: Martin Lau <kafai@fb.com>
Cc: Neeraj Upadhyay <neeraj.iitr10@gmail.com>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
This commit adds a rcupdate.rcu_task_enqueue_lim module parameter that
sets the initial number of callback queues to use for the RCU Tasks
family of RCU implementations. This parameter allows testing of various
fanout values.
Reported-by: Martin Lau <kafai@fb.com>
Cc: Neeraj Upadhyay <neeraj.iitr10@gmail.com>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
Currently, rcu_barrier_tasks(), rcu_barrier_tasks_rude(),
and rcu_barrier_tasks_trace() simply invoke the corresponding
synchronize_rcu_tasks*() function. This works because there is only
one callback queue.
However, there will soon be multiple callback queues. This commit
therefore scans the queues currently in use, entraining a callback on
each non-empty queue. Sequence numbers and reference counts are used
to synchronize this process in a manner similar to the approach taken
by rcu_barrier().
Reported-by: Martin Lau <kafai@fb.com>
Cc: Neeraj Upadhyay <neeraj.iitr10@gmail.com>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
If there is a flood of callbacks, it is necessary to put multiple
CPUs to work invoking those callbacks. This commit therefore uses a
workqueue-flooding approach to parallelize RCU Tasks callback execution.
Reported-by: Martin Lau <kafai@fb.com>
Cc: Neeraj Upadhyay <neeraj.iitr10@gmail.com>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
This commit adds a rcu_tasks_invoke_cbs() function that invokes all
ready callbacks on all of the per-CPU lists that are currently in use.
Reported-by: Martin Lau <kafai@fb.com>
Cc: Neeraj Upadhyay <neeraj.iitr10@gmail.com>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
This commit adds a rcu_tasks_need_gpcb() function that returns an
indication of whether another grace period is required, and if no grace
period is required, whether there are callbacks that need to be invoked.
The function scans all per-CPU lists currently in use.
Reported-by: Martin Lau <kafai@fb.com>
Cc: Neeraj Upadhyay <neeraj.iitr10@gmail.com>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
Several ->poll() implementations are special in that they use a
waitqueue whose lifetime is the current task, rather than the struct
file as is normally the case. This is okay for blocking polls, since a
blocking poll occurs within one task; however, non-blocking polls
require another solution. This solution is for the queue to be cleared
before it is freed, using 'wake_up_poll(wq, EPOLLHUP | POLLFREE);'.
However, that has a bug: wake_up_poll() calls __wake_up() with
nr_exclusive=1. Therefore, if there are multiple "exclusive" waiters,
and the wakeup function for the first one returns a positive value, only
that one will be called. That's *not* what's needed for POLLFREE;
POLLFREE is special in that it really needs to wake up everyone.
Considering the three non-blocking poll systems:
- io_uring poll doesn't handle POLLFREE at all, so it is broken anyway.
- aio poll is unaffected, since it doesn't support exclusive waits.
However, that's fragile, as someone could add this feature later.
- epoll doesn't appear to be broken by this, since its wakeup function
returns 0 when it sees POLLFREE. But this is fragile.
Although there is a workaround (see epoll), it's better to define a
function which always sends POLLFREE to all waiters. Add such a
function. Also make it verify that the queue really becomes empty after
all waiters have been woken up.
Reported-by: Linus Torvalds <torvalds@linux-foundation.org>
Cc: stable@vger.kernel.org
Link: https://lore.kernel.org/r/20211209010455.42744-2-ebiggers@kernel.org
Signed-off-by: Eric Biggers <ebiggers@google.com>
This commit adds a ->percpu_enqueue_lim field to the rcu_tasks structure.
This field contains two to the power of the ->percpu_enqueue_shift
field, easing construction of iterators over the per-CPU queues that
might contain RCU Tasks callbacks. Such iterators will be introduced
in later commits.
Reported-by: Martin Lau <kafai@fb.com>
Cc: Neeraj Upadhyay <neeraj.iitr10@gmail.com>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
This commit renames the rcu_tasks_percpu structure's ->cbs_pcpu_lock
to ->lock and then uses spin_lock_rcu_node() and friends to acquire and
release this lock, preparing for upcoming commits that will spread the
grace-period process across multiple CPUs and kthreads.
[ paulmck: Apply feedback from kernel test robot. ]
Reported-by: Martin Lau <kafai@fb.com>
Cc: Neeraj Upadhyay <neeraj.iitr10@gmail.com>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
There's error paths in __create_synth_event() after the argv is allocated
that fail to free it. Add a jump to free it when necessary.
Link: https://lkml.kernel.org/r/20211209024317.11783-1-linmq006@gmail.com
Suggested-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
Signed-off-by: Miaoqian Lin <linmq006@gmail.com>
[ Fixed up the patch and change log ]
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
Get rid of yet another irqdomain callback and let the core code return the
already available information of how many descriptors could be allocated.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Tested-by: Juergen Gross <jgross@suse.com>
Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>
Reviewed-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Acked-by: Bjorn Helgaas <bhelgaas@google.com> # PCI
Link: https://lore.kernel.org/r/20211206210225.046615302@linutronix.de
The unnamed struct sucks and is in the way of further cleanups. Stick the
PCI related MSI data into a real data structure and cleanup all users.
No functional change.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Tested-by: Juergen Gross <jgross@suse.com>
Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>
Reviewed-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Acked-by: Kalle Valo <kvalo@codeaurora.org>
Link: https://lore.kernel.org/r/20211206210224.374863119@linutronix.de
Remove the kobject.h include from msi.h as it's not required and add a
sysfs.h include to the core code instead.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Tested-by: Juergen Gross <jgross@suse.com>
Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>
Reviewed-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Link: https://lore.kernel.org/r/20211206210224.103502021@linutronix.de
No point in building unused code when CONFIG_SYSFS=n.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Tested-by: Juergen Gross <jgross@suse.com>
Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>
Reviewed-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Link: https://lore.kernel.org/r/20211206210223.985907940@linutronix.de
The pointer t is being initialized with a value that is never read. The
pointer is re-assigned a value a littler later on, hence the initialization
is redundant and can be removed.
Signed-off-by: Colin Ian King <colin.i.king@gmail.com>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/bpf/20211207224718.59593-1-colin.i.king@gmail.com
Daniel Borkmann says:
====================
bpf 2021-12-08
We've added 12 non-merge commits during the last 22 day(s) which contain
a total of 29 files changed, 659 insertions(+), 80 deletions(-).
The main changes are:
1) Fix an off-by-two error in packet range markings and also add a batch of
new tests for coverage of these corner cases, from Maxim Mikityanskiy.
2) Fix a compilation issue on MIPS JIT for R10000 CPUs, from Johan Almbladh.
3) Fix two functional regressions and a build warning related to BTF kfunc
for modules, from Kumar Kartikeya Dwivedi.
4) Fix outdated code and docs regarding BPF's migrate_disable() use on non-
PREEMPT_RT kernels, from Sebastian Andrzej Siewior.
5) Add missing includes in order to be able to detangle cgroup vs bpf header
dependencies, from Jakub Kicinski.
6) Fix regression in BPF sockmap tests caused by missing detachment of progs
from sockets when they are removed from the map, from John Fastabend.
7) Fix a missing "no previous prototype" warning in x86 JIT caused by BPF
dispatcher, from Björn Töpel.
* https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf:
bpf: Add selftests to cover packet access corner cases
bpf: Fix the off-by-two error in range markings
treewide: Add missing includes masked by cgroup -> bpf dependency
tools/resolve_btfids: Skip unresolved symbol warning for empty BTF sets
bpf: Fix bpf_check_mod_kfunc_call for built-in modules
bpf: Make CONFIG_DEBUG_INFO_BTF depend upon CONFIG_BPF_SYSCALL
mips, bpf: Fix reference to non-existing Kconfig symbol
bpf: Make sure bpf_disable_instrumentation() is safe vs preemption.
Documentation/locking/locktypes: Update migrate_disable() bits.
bpf, sockmap: Re-evaluate proto ops when psock is removed from sockmap
bpf, sockmap: Attach map progs to psock early for feature probes
bpf, x86: Fix "no previous prototype" warning
====================
Link: https://lore.kernel.org/r/20211208155125.11826-1-daniel@iogearbox.net
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
task_util and capacity are comparable unsigned long values. There is no
need for an intermidiate implicit signed cast.
Signed-off-by: Vincent Donnefort <vincent.donnefort@arm.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lkml.kernel.org/r/20211207095755.859972-1-vincent.donnefort@arm.com
Adding ops cleanup to unregister_ftrace_direct_multi,
so it can be reused in another register call.
Link: https://lkml.kernel.org/r/20211206182032.87248-3-jolsa@kernel.org
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Heiko Carstens <hca@linux.ibm.com>
Fixes: f64dd4627e ("ftrace: Add multi direct register/unregister interface")
Signed-off-by: Jiri Olsa <jolsa@kernel.org>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
Now when we have *direct_multi interface the direct_functions
hash is no longer owned just by direct_ops. It's also used by
any other ftrace_ops passed to *direct_multi interface.
Thus to find out that we are unregistering the last function
from direct_ops, we need to check directly direct_ops's hash.
Link: https://lkml.kernel.org/r/20211206182032.87248-2-jolsa@kernel.org
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Heiko Carstens <hca@linux.ibm.com>
Fixes: f64dd4627e ("ftrace: Add multi direct register/unregister interface")
Signed-off-by: Jiri Olsa <jolsa@kernel.org>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
Add a helper to check if a potentially blocking operation should
dip into the atomic pools.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Robin Murphy <robin.murphy@arm.com>
Theoretically, when the hardware signature in FACS changes, the OS
is supposed to gracefully decline to attempt to resume from S4:
"If the signature has changed, OSPM will not restore the system
context and can boot from scratch"
In practice, Windows doesn't do this and many laptop vendors do allow
the signature to change especially when docking/undocking, so it would
be a bad idea to simply comply with the specification by default in the
general case.
However, there are use cases where we do want the compliant behaviour
and we know it's safe. Specifically, when resuming virtual machines where
we know the hypervisor has changed sufficiently that resume will fail.
We really want to be able to *tell* the guest kernel not to try, so it
boots cleanly and doesn't just crash. This patch provides a way to opt
in to the spec-compliant behaviour on the command line.
A follow-up patch may do this automatically for certain "known good"
machines based on a DMI match, or perhaps just for all hypervisor
guests since there's no good reason a hypervisor would change the
hardware_signature that it exposes to guests *unless* it wants them
to obey the ACPI specification.
Signed-off-by: David Woodhouse <dwmw@amazon.co.uk>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
The tracing marker files are write-only streams with no meaningful
concept of file position. Using stream_open() to mark them as
stream-link indicates this and has the added advantage that a single
file descriptor can now be used from multiple threads without contention
thanks to clearing FMODE_ATOMIC_POS.
Note that this has the potential to break existing userspace by since
both lseek(2) and pwrite(2) will now return ESPIPE when previously lseek
would have updated the stored offset and pwrite would have appended to
the trace. A survey of libtracefs and several other projects found to
use trace_marker(_raw) [1][2][3] suggests that everyone limits
themselves to calling write(2) and close(2) on these file descriptors so
there is a good chance this will go unnoticed and the benefits of
reduced overhead and lock contention seem worth the risk.
[1] https://github.com/google/perfetto
[2] https://github.com/intel/media-driver/
[3] https://w1.fi/cgit/hostap/
Link: https://lkml.kernel.org/r/20211207142558.347029-1-john@metanate.com
Signed-off-by: John Keeping <john@metanate.com>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
With the addition of multiple callback-flood kthreads, the maximum number
of callbacks from any one of those kthreads is reported in the rcutorture
run summary. This commit changes this to report the sum of each kthread's
maximum number of callbacks in a given callback-flooding episode.
Cc: Neeraj Upadhyay <neeraj.iitr10@gmail.com>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
The RCU tasks flavors of RCU now need concurrent callback flooding to
test their ability to switch between single-queue mode and per-CPU queue
mode, but their lack of heavy-duty forward-progress features rules out
the use of rcutorture's current callback-flooding code. This commit
therefore provides the ability to limit the intensity of the callback
floods using a new ->cbflood_max field in the rcu_operations structure.
When this field is zero, there is no limit, otherwise, each callback-flood
kthread allocates at most ->cbflood_max callbacks.
Cc: Neeraj Upadhyay <neeraj.iitr10@gmail.com>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>