When list_for_each_entry() completes the iteration over the whole list
without breaking the loop, the iterator value will be a bogus pointer
computed based on the head element.
While it is safe to use the pointer to determine if it was computed
based on the head element, either with list_entry_is_head() or
&pos->member == head, using the iterator variable after the loop should
be avoided.
In preparation to limit the scope of a list iterator to the list
traversal loop, use a dedicated pointer to point to the found element [1].
Link: https://lkml.kernel.org/r/20220427170734.819891-5-jakobkoschel@gmail.com
Cc: Ingo Molnar <mingo@redhat.com>
Link: https://lore.kernel.org/all/CAHk-=wgRr_D8CB-D9Kg-c=EHreAsk5SqXPwr9Y7k9sA6cWXJ6w@mail.gmail.com/
Signed-off-by: Jakob Koschel <jakobkoschel@gmail.com>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
To move the list iterator variable into the list_for_each_entry_*()
macro in the future it should be avoided to use the list iterator
variable after the loop body.
To *never* use the list iterator variable after the loop it was
concluded to use a separate iterator variable instead of a
found boolean [1].
This removes the need to use a found variable and simply checking if
the variable was set, can determine if the break/goto was hit.
Link: https://lkml.kernel.org/r/20220427170734.819891-4-jakobkoschel@gmail.com
Cc: Ingo Molnar <mingo@redhat.com>
Link: https://lore.kernel.org/all/CAHk-=wgRr_D8CB-D9Kg-c=EHreAsk5SqXPwr9Y7k9sA6cWXJ6w@mail.gmail.com/ [1]
Signed-off-by: Jakob Koschel <jakobkoschel@gmail.com>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
In preparation to limit the scope of the list iterator variable to the
traversal loop, use a dedicated pointer to point to the found element
[1].
Before, the code implicitly used the head when no element was found
when using &pos->list. Since the new variable is only set if an
element was found, the head needs to be used explicitly if the
variable is NULL.
Link: https://lkml.kernel.org/r/20220427170734.819891-2-jakobkoschel@gmail.com
Cc: Ingo Molnar <mingo@redhat.com>
Link: https://lore.kernel.org/all/CAHk-=wgRr_D8CB-D9Kg-c=EHreAsk5SqXPwr9Y7k9sA6cWXJ6w@mail.gmail.com/ [1]
Signed-off-by: Jakob Koschel <jakobkoschel@gmail.com>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
A fast/NMI safe accessor for CLOCK_TAI has been introduced.
Use it for adding the additional trace clock "tai".
Link: https://lkml.kernel.org/r/20220414091805.89667-3-kurt@linutronix.de
Signed-off-by: Kurt Kanzenbach <kurt@linutronix.de>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
When the new logic was made to handle deltas of events from interrupts
that interrupted other events, it required 64 bit local atomics.
Unfortunately, 64 bit local atomics are expensive on 32 bit architectures.
Thus, commit 10464b4aa6 ("ring-buffer: Add rb_time_t 64 bit operations
for speeding up 32 bit") created a type of seq lock timer for 32 bits.
It used two 32 bit local atomics, but required 2 bits from them each for
synchronization, making it only 60 bits.
Add a new "msb" field to hold the extra 4 bits that are cut off.
Link: https://lore.kernel.org/all/20220426175338.3807ca4f@gandalf.local.home/
Link: https://lkml.kernel.org/r/20220427170812.53cc7139@gandalf.local.home
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
There's an absolute timestamp event in the ring buffer, but this only
saves 59 bits of the timestamp, as the 5 MSB is used for meta data
(stating it is an absolute time stamp). This was never an issue as all the
clocks currently in use never used those 5 MSB. But now there's a new
clock (TAI) that does.
To handle this case, when reading an absolute timestamp, a previous full
timestamp is passed in, and the 5 MSB of that timestamp is OR'd to the
absolute timestamp (if any of the 5 MSB are set), and then to test for
overflow, if the new result is smaller than the passed in previous
timestamp, then 1 << 59 is added to it.
All the extra processing is done on the reader "slow" path, with the
exception of the "too big delta" check, and the reading of timestamps
for histograms.
Note, libtraceevent will need to be updated to handle this case as well.
But this is not a user space regression, as user space was never able to
handle any timestamps that used more than 59 bits.
Link: https://lore.kernel.org/all/20220426175338.3807ca4f@gandalf.local.home/
Link: https://lkml.kernel.org/r/20220427153339.16c33f75@gandalf.local.home
Cc: Tom Zanussi <zanussi@kernel.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Kurt Kanzenbach <kurt@linutronix.de>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
In https://lore.kernel.org/all/87y22uujkm.ffs@tglx/ Thomas
said:
Its's simply wishful thinking that stuff gets fixed because of a
WARN_ONCE(). This has never worked. The only thing which works is to
make stuff fail hard or slow it down in a way which makes it annoying
enough to users to complain.
He was talking about WBINVD. But it made me think about how we use the
split lock detection feature in Linux.
Existing code has three options for applications:
1) Don't enable split lock detection (allow arbitrary split locks)
2) Warn once when a process uses split lock, but let the process
keep running with split lock detection disabled
3) Kill process that use split locks
Option 2 falls into the "wishful thinking" territory that Thomas warns does
nothing. But option 3 might not be viable in a situation with legacy
applications that need to run.
Hence make option 2 much stricter to "slow it down in a way which makes
it annoying".
Primary reason for this change is to provide better quality of service to
the rest of the applications running on the system. Internal testing shows
that even with many processes splitting locks, performance for the rest of
the system is much more responsive.
The new "warn" mode operates like this. When an application tries to
execute a bus lock the #AC handler.
1) Delays (interruptibly) 10 ms before moving to next step.
2) Blocks (interruptibly) until it can get the semaphore
If interrupted, just return. Assume the signal will either
kill the task, or direct execution away from the instruction
that is trying to get the bus lock.
3) Disables split lock detection for the current core
4) Schedules a work queue to re-enable split lock detect in 2 jiffies
5) Returns
The work queue that re-enables split lock detection also releases the
semaphore.
There is a corner case where a CPU may be taken offline while split lock
detection is disabled. A CPU hotplug handler handles this case.
Old behaviour was to only print the split lock warning on the first
occurrence of a split lock from a task. Preserve that by adding a flag to
the task structure that suppresses subsequent split lock messages from that
task.
Signed-off-by: Tony Luck <tony.luck@intel.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lore.kernel.org/r/20220310204854.31752-2-tony.luck@intel.com
Move trace_eval_init() to subsys_initcall to make it start
earlier.
And to avoid tracer_init_tracefs being blocked by
trace_event_sem which trace_eval_init() hold [1],
queue tracer_init_tracefs() to eval_map_wq to let
the two works being executed sequentially.
It can speed up the initialization of kernel as result
of making tracer_init_tracefs asynchronous.
On my arm64 platform, it reduce ~20ms of 125ms which total
time do_initcalls spend.
Link: https://lkml.kernel.org/r/20220426122407.17042-3-mark-pk.tsai@mediatek.com
[1]: https://lore.kernel.org/r/68d7b3327052757d0cd6359a6c9015a85b437232.camel@pengutronix.de
Signed-off-by: Mark-PK Tsai <mark-pk.tsai@mediatek.com>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
To prepare for support asynchronous tracer_init_tracefs initcall,
avoid calling create_trace_option_files before __update_tracer_options.
Otherwise, create_trace_option_files will show warning because
some tracers in trace_types list are already in tr->topts.
For example, hwlat_tracer call register_tracer in late_initcall,
and global_trace.dir is already created in tracing_init_dentry,
hwlat_tracer will be put into tr->topts.
Then if the __update_tracer_options is executed after hwlat_tracer
registered, create_trace_option_files find that hwlat_tracer is
already in tr->topts.
Link: https://lkml.kernel.org/r/20220426122407.17042-2-mark-pk.tsai@mediatek.com
Link: https://lore.kernel.org/lkml/20220322133339.GA32582@xsang-OptiPlex-9020/
Reported-by: kernel test robot <oliver.sang@intel.com>
Signed-off-by: Mark-PK Tsai <mark-pk.tsai@mediatek.com>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
When setting bootparams="trace_event=initcall:initcall_start tp_printk=1" in the
cmdline, the output_printk() was called, and the spin_lock_irqsave() was called in the
atomic and irq disable interrupt context suitation. On the PREEMPT_RT kernel,
these locks are replaced with sleepable rt-spinlock, so the stack calltrace will
be triggered.
Fix it by raw_spin_lock_irqsave when PREEMPT_RT and "trace_event=initcall:initcall_start
tp_printk=1" enabled.
BUG: sleeping function called from invalid context at kernel/locking/spinlock_rt.c:46
in_atomic(): 1, irqs_disabled(): 0, non_block: 0, pid: 1, name: swapper/0
preempt_count: 2, expected: 0
RCU nest depth: 0, expected: 0
Preemption disabled at:
[<ffffffff8992303e>] try_to_wake_up+0x7e/0xba0
CPU: 0 PID: 1 Comm: swapper/0 Not tainted 5.17.1-rt17+ #19 34c5812404187a875f32bee7977f7367f9679ea7
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.14.0-2 04/01/2014
Call Trace:
<TASK>
dump_stack_lvl+0x60/0x8c
dump_stack+0x10/0x12
__might_resched.cold+0x11d/0x155
rt_spin_lock+0x40/0x70
trace_event_buffer_commit+0x2fa/0x4c0
? map_vsyscall+0x93/0x93
trace_event_raw_event_initcall_start+0xbe/0x110
? perf_trace_initcall_finish+0x210/0x210
? probe_sched_wakeup+0x34/0x40
? ttwu_do_wakeup+0xda/0x310
? trace_hardirqs_on+0x35/0x170
? map_vsyscall+0x93/0x93
do_one_initcall+0x217/0x3c0
? trace_event_raw_event_initcall_level+0x170/0x170
? push_cpu_stop+0x400/0x400
? cblist_init_generic+0x241/0x290
kernel_init_freeable+0x1ac/0x347
? _raw_spin_unlock_irq+0x65/0x80
? rest_init+0xf0/0xf0
kernel_init+0x1e/0x150
ret_from_fork+0x22/0x30
</TASK>
Link: https://lkml.kernel.org/r/20220419013910.894370-1-jun.miao@intel.com
Signed-off-by: Jun Miao <jun.miao@intel.com>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
No need to traverse to the end of string. If the first byte is not a NUL
char, it's guaranteed `if (strlen(glob))` is true.
Link: https://lkml.kernel.org/r/20220417185630.199062-3-ammarfaizi2@gnuweeb.org
Cc: Ingo Molnar <mingo@redhat.com>
Cc: GNU/Weeb Mailing List <gwml@vger.gnuweeb.org>
Signed-off-by: Ammar Faizi <ammarfaizi2@gnuweeb.org>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
If `WARN_ON(!glob)` is ever triggered, we will still continue executing
the next lines. This will trigger the more serious problem, a NULL
pointer dereference bug.
Just return -EINVAL if @glob is NULL.
Link: https://lkml.kernel.org/r/20220417185630.199062-2-ammarfaizi2@gnuweeb.org
Cc: Ingo Molnar <mingo@redhat.com>
Cc: GNU/Weeb Mailing List <gwml@vger.gnuweeb.org>
Signed-off-by: Ammar Faizi <ammarfaizi2@gnuweeb.org>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
Currently the tp_printk option has no effect on syscall tracepoint.
When adding the kernel option parameter tp_printk, then:
echo 1 > /sys/kernel/debug/tracing/events/syscalls/enable
When running any application, no trace information is printed on the
terminal.
Now added printk for syscall tracepoints.
Link: https://lkml.kernel.org/r/20220410145025.681144-1-xiehuan09@gmail.com
Signed-off-by: Jeff Xie <xiehuan09@gmail.com>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
Add the description of @n_sort_keys and make @sort_key ->
@sort_keys in tracing_map_sort_entries() kernel-doc comment
to remove warnings found by running scripts/kernel-doc, which
is caused by using 'make W=1'.
kernel/trace/tracing_map.c:1073: warning: Function parameter or member
'sort_keys' not described in 'tracing_map_sort_entries'
kernel/trace/tracing_map.c:1073: warning: Function parameter or member
'n_sort_keys' not described in 'tracing_map_sort_entries'
kernel/trace/tracing_map.c:1073: warning: Excess function parameter
'sort_key' description in 'tracing_map_sort_entries'
Link: https://lkml.kernel.org/r/20220402072015.45864-1-yang.lee@linux.alibaba.com
Reported-by: Abaci Robot <abaci@linux.alibaba.com>
Signed-off-by: Yang Li <yang.lee@linux.alibaba.com>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
hist_register_trigger() handles both new hist registration as well as
existing hist registration through event_command.reg().
Adding a new function, existing_hist_update_only(), that checks and
updates existing histograms and exits after doing so allows the
confusing logic in event_hist_trigger_parse() to be simplified.
Link: https://lkml.kernel.org/r/211b2cd3e3d7e00f4f8ad45ef8b33063da6a7e05.1644010576.git.zanussi@kernel.org
Signed-off-by: Tom Zanussi <zanussi@kernel.org>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
Since event_trigger_data contains the .ops trigger_ops field, there's
no reason to pass the trigger_ops separately. Remove it as a param
from functions whenever event_trigger_data is passed.
Link: https://lkml.kernel.org/r/9856c9bc81bde57077f5b8d6f8faa47156c6354a.1644010575.git.zanussi@kernel.org
Signed-off-by: Tom Zanussi <zanussi@kernel.org>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
Code for registering triggers assumes it's possible to register more
than one trigger at a time. In fact, it's unimplemented and there
doesn't seem to be a reason to do that.
Remove the n_registered param from event_trigger_register() and fix up
callers.
Doing so simplifies the logic in event_trigger_register to the point
that it just becomes a wrapper calling event_command.reg().
It also removes the problematic call to event_command.unreg() in case
of failure. A new function, event_trigger_unregister() is also added
for callers to call themselves.
The changes to trace_events_hist.c simply allow compilation; a
separate patch follows which updates the hist triggers to work
correctly with the new changes.
Link: https://lkml.kernel.org/r/6149fec7a139d93e84fa4535672fb5bef88006b0.1644010575.git.zanussi@kernel.org
Signed-off-by: Tom Zanussi <zanussi@kernel.org>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
For now, the field 'map_btf_id' in 'struct bpf_map_ops' for all map
types are computed during vmlinux-btf init:
btf_parse_vmlinux() -> btf_vmlinux_map_ids_init()
It will lookup the btf_type according to the 'map_btf_name' field in
'struct bpf_map_ops'. This process can be done during build time,
thanks to Jiri's resolve_btfids.
selftest of map_ptr has passed:
$96 map_ptr:OK
Summary: 1/0 PASSED, 0 SKIPPED, 0 FAILED
Reported-by: kernel test robot <lkp@intel.com>
Signed-off-by: Menglong Dong <imagedong@tencent.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
The recent kernel change in 73f9b911fa ("kprobes: Use rethook for kretprobe
if possible"), introduced a potential NULL pointer dereference bug in the
KRETPROBE mechanism. The official Kprobes documentation defines that "Any or
all handlers can be NULL". Unfortunately, there is a missing return handler
verification to fulfill these requirements and can result in a NULL pointer
dereference bug.
This patch adds such verification in kretprobe_rethook_handler() function.
Fixes: 73f9b911fa ("kprobes: Use rethook for kretprobe if possible")
Signed-off-by: Adam Zabrocki <pi3@pi3.com.pl>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Naveen N. Rao <naveen.n.rao@linux.ibm.com>
Cc: Anil S. Keshavamurthy <anil.s.keshavamurthy@intel.com>
Link: https://lore.kernel.org/bpf/20220422164027.GA7862@pi3.com.pl
The static global variable @console_locked is used to help debug
VT code to make sure that certain code paths are running with
the console_lock held. However, this information is also available
with the static global variable @console_kthreads_blocked (for
locking via console_lock()), and the static global variable
@console_kthreads_active (for locking via console_trylock()).
Remove @console_locked and update is_console_locked() to use the
alternative variables.
Signed-off-by: John Ogness <john.ogness@linutronix.de>
Reviewed-by: Petr Mladek <pmladek@suse.com>
Signed-off-by: Petr Mladek <pmladek@suse.com>
Link: https://lore.kernel.org/r/20220421212250.565456-16-john.ogness@linutronix.de
Currently threaded console printers synchronize against each
other using console_lock(). However, different console drivers
are unrelated and do not require any synchronization between
each other. Removing the synchronization between the threaded
console printers will allow each console to print at its own
speed.
But the threaded consoles printers do still need to synchronize
against console_lock() callers. Introduce a per-console mutex
and a new console boolean field @blocked to provide this
synchronization.
console_lock() is modified so that it must acquire the mutex
of each console in order to set the @blocked field. Console
printing threads will acquire their mutex while printing a
record. If @blocked was set, the thread will go back to sleep
instead of printing.
The reason for the @blocked boolean field is so that
console_lock() callers do not need to acquire multiple console
mutexes simultaneously, which would introduce unnecessary
complexity due to nested mutex locking. Also, a new field
was chosen instead of adding a new @flags value so that the
blocked status could be checked without concern of reading
inconsistent values due to @flags updates from other contexts.
Threaded console printers also need to synchronize against
console_trylock() callers. Since console_trylock() may be
called from any context, the per-console mutex cannot be used
for this synchronization. (mutex_trylock() cannot be called
from atomic contexts.) Introduce a global atomic counter to
identify if any threaded printers are active. The threaded
printers will also check the atomic counter to identify if the
console has been locked by another task via console_trylock().
Note that @console_sem is still used to provide synchronization
between console_lock() and console_trylock() callers.
A locking overview for console_lock(), console_trylock(), and the
threaded printers is as follows (pseudo code):
console_lock()
{
down(&console_sem);
for_each_console(con) {
mutex_lock(&con->lock);
con->blocked = true;
mutex_unlock(&con->lock);
}
/* console_lock acquired */
}
console_trylock()
{
if (down_trylock(&console_sem) == 0) {
if (atomic_cmpxchg(&console_kthreads_active, 0, -1) == 0) {
/* console_lock acquired */
}
}
}
threaded_printer()
{
mutex_lock(&con->lock);
if (!con->blocked) {
/* console_lock() callers blocked */
if (atomic_inc_unless_negative(&console_kthreads_active)) {
/* console_trylock() callers blocked */
con->write();
atomic_dec(&console_lock_count);
}
}
mutex_unlock(&con->lock);
}
The console owner and waiter logic now only applies between contexts
that have taken the console_lock via console_trylock(). Threaded
printers never take the console_lock, so they do not have a
console_lock to handover. Tasks that have used console_lock() will
block the threaded printers using a mutex and if the console_lock
is handed over to an atomic context, it would be unable to unblock
the threaded printers. However, the console_trylock() case is
really the only scenario that is interesting for handovers anyway.
@panic_console_dropped must change to atomic_t since it is no longer
protected exclusively by the console_lock.
Since threaded printers remain asleep if they see that the console
is locked, they now must be explicitly woken in __console_unlock().
This means wake_up_klogd() calls following a console_unlock() are
no longer necessary and are removed.
Also note that threaded printers no longer need to check
@console_suspended. The check for the @blocked field implicitly
covers the suspended console case.
Signed-off-by: John Ogness <john.ogness@linutronix.de>
Reviewed-by: Petr Mladek <pmladek@suse.com>
Signed-off-by: Petr Mladek <pmladek@suse.com>
Link: https://lore.kernel.org/r/878rrs6ft7.fsf@jogness.linutronix.de
The current of behavior of btf_struct_ids_match for release arguments is
that when type match fails, it retries with first member type again
(recursively). Since the offset is already 0, this is akin to just
casting the pointer in normal C, since if type matches it was just
embedded inside parent sturct as an object. However, we want to reject
cases for release function type matching, be it kfunc or BPF helpers.
An example is the following:
struct foo {
struct bar b;
};
struct foo *v = acq_foo();
rel_bar(&v->b); // btf_struct_ids_match fails btf_types_are_same, then
// retries with first member type and succeeds, while
// it should fail.
Hence, don't walk the struct and only rely on btf_types_are_same for
strict mode. All users of strict mode must be dealing with zero offset
anyway, since otherwise they would want the struct to be walked.
Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Link: https://lore.kernel.org/bpf/20220424214901.2743946-10-memxor@gmail.com
We introduce a new style of kfunc helpers, namely *_kptr_get, where they
take pointer to the map value which points to a referenced kernel
pointer contained in the map. Since this is referenced, only
bpf_kptr_xchg from BPF side and xchg from kernel side is allowed to
change the current value, and each pointer that resides in that location
would be referenced, and RCU protected (this must be kept in mind while
adding kernel types embeddable as reference kptr in BPF maps).
This means that if do the load of the pointer value in an RCU read
section, and find a live pointer, then as long as we hold RCU read lock,
it won't be freed by a parallel xchg + release operation. This allows us
to implement a safe refcount increment scheme. Hence, enforce that first
argument of all such kfunc is a proper PTR_TO_MAP_VALUE pointing at the
right offset to referenced pointer.
For the rest of the arguments, they are subjected to typical kfunc
argument checks, hence allowing some flexibility in passing more intent
into how the reference should be taken.
For instance, in case of struct nf_conn, it is not freed until RCU grace
period ends, but can still be reused for another tuple once refcount has
dropped to zero. Hence, a bpf_ct_kptr_get helper not only needs to call
refcount_inc_not_zero, but also do a tuple match after incrementing the
reference, and when it fails to match it, put the reference again and
return NULL.
This can be implemented easily if we allow passing additional parameters
to the bpf_ct_kptr_get kfunc, like a struct bpf_sock_tuple * and a
tuple__sz pair.
Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Link: https://lore.kernel.org/bpf/20220424214901.2743946-9-memxor@gmail.com
A destructor kfunc can be defined as void func(type *), where type may
be void or any other pointer type as per convenience.
In this patch, we ensure that the type is sane and capture the function
pointer into off_desc of ptr_off_tab for the specific pointer offset,
with the invariant that the dtor pointer is always set when 'kptr_ref'
tag is applied to the pointer's pointee type, which is indicated by the
flag BPF_MAP_VALUE_OFF_F_REF.
Note that only BTF IDs whose destructor kfunc is registered, thus become
the allowed BTF IDs for embedding as referenced kptr. Hence it serves
the purpose of finding dtor kfunc BTF ID, as well acting as a check
against the whitelist of allowed BTF IDs for this purpose.
Finally, wire up the actual freeing of the referenced pointer if any at
all available offsets, so that no references are leaked after the BPF
map goes away and the BPF program previously moved the ownership a
referenced pointer into it.
The behavior is similar to BPF timers, where bpf_map_{update,delete}_elem
will free any existing referenced kptr. The same case is with LRU map's
bpf_lru_push_free/htab_lru_push_free functions, which are extended to
reset unreferenced and free referenced kptr.
Note that unlike BPF timers, kptr is not reset or freed when map uref
drops to zero.
Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Link: https://lore.kernel.org/bpf/20220424214901.2743946-8-memxor@gmail.com
To support storing referenced PTR_TO_BTF_ID in maps, we require
associating a specific BTF ID with a 'destructor' kfunc. This is because
we need to release a live referenced pointer at a certain offset in map
value from the map destruction path, otherwise we end up leaking
resources.
Hence, introduce support for passing an array of btf_id, kfunc_btf_id
pairs that denote a BTF ID and its associated release function. Then,
add an accessor 'btf_find_dtor_kfunc' which can be used to look up the
destructor kfunc of a certain BTF ID. If found, we can use it to free
the object from the map free path.
The registration of these pairs also serve as a whitelist of structures
which are allowed as referenced PTR_TO_BTF_ID in a BPF map, because
without finding the destructor kfunc, we will bail and return an error.
Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Link: https://lore.kernel.org/bpf/20220424214901.2743946-7-memxor@gmail.com
Since now there might be at most 10 offsets that need handling in
copy_map_value, the manual shuffling and special case is no longer going
to work. Hence, let's generalise the copy_map_value function by using
a sorted array of offsets to skip regions that must be avoided while
copying into and out of a map value.
When the map is created, we populate the offset array in struct map,
Then, copy_map_value uses this sorted offset array is used to memcpy
while skipping timer, spin lock, and kptr. The array is allocated as
in most cases none of these special fields would be present in map
value, hence we can save on space for the common case by not embedding
the entire object inside bpf_map struct.
Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Link: https://lore.kernel.org/bpf/20220424214901.2743946-6-memxor@gmail.com
While we can guarantee that even for unreferenced kptr, the object
pointer points to being freed etc. can be handled by the verifier's
exception handling (normal load patching to PROBE_MEM loads), we still
cannot allow the user to pass these pointers to BPF helpers and kfunc,
because the same exception handling won't be done for accesses inside
the kernel. The same is true if a referenced pointer is loaded using
normal load instruction. Since the reference is not guaranteed to be
held while the pointer is used, it must be marked as untrusted.
Hence introduce a new type flag, PTR_UNTRUSTED, which is used to mark
all registers loading unreferenced and referenced kptr from BPF maps,
and ensure they can never escape the BPF program and into the kernel by
way of calling stable/unstable helpers.
In check_ptr_to_btf_access, the !type_may_be_null check to reject type
flags is still correct, as apart from PTR_MAYBE_NULL, only MEM_USER,
MEM_PERCPU, and PTR_UNTRUSTED may be set for PTR_TO_BTF_ID. The first
two are checked inside the function and rejected using a proper error
message, but we still want to allow dereference of untrusted case.
Also, we make sure to inherit PTR_UNTRUSTED when chain of pointers are
walked, so that this flag is never dropped once it has been set on a
PTR_TO_BTF_ID (i.e. trusted to untrusted transition can only be in one
direction).
In convert_ctx_accesses, extend the switch case to consider untrusted
PTR_TO_BTF_ID in addition to normal PTR_TO_BTF_ID for PROBE_MEM
conversion for BPF_LDX.
Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Link: https://lore.kernel.org/bpf/20220424214901.2743946-5-memxor@gmail.com
Extending the code in previous commits, introduce referenced kptr
support, which needs to be tagged using 'kptr_ref' tag instead. Unlike
unreferenced kptr, referenced kptr have a lot more restrictions. In
addition to the type matching, only a newly introduced bpf_kptr_xchg
helper is allowed to modify the map value at that offset. This transfers
the referenced pointer being stored into the map, releasing the
references state for the program, and returning the old value and
creating new reference state for the returned pointer.
Similar to unreferenced pointer case, return value for this case will
also be PTR_TO_BTF_ID_OR_NULL. The reference for the returned pointer
must either be eventually released by calling the corresponding release
function, otherwise it must be transferred into another map.
It is also allowed to call bpf_kptr_xchg with a NULL pointer, to clear
the value, and obtain the old value if any.
BPF_LDX, BPF_STX, and BPF_ST cannot access referenced kptr. A future
commit will permit using BPF_LDX for such pointers, but attempt at
making it safe, since the lifetime of object won't be guaranteed.
There are valid reasons to enforce the restriction of permitting only
bpf_kptr_xchg to operate on referenced kptr. The pointer value must be
consistent in face of concurrent modification, and any prior values
contained in the map must also be released before a new one is moved
into the map. To ensure proper transfer of this ownership, bpf_kptr_xchg
returns the old value, which the verifier would require the user to
either free or move into another map, and releases the reference held
for the pointer being moved in.
In the future, direct BPF_XCHG instruction may also be permitted to work
like bpf_kptr_xchg helper.
Note that process_kptr_func doesn't have to call
check_helper_mem_access, since we already disallow rdonly/wronly flags
for map, which is what check_map_access_type checks, and we already
ensure the PTR_TO_MAP_VALUE refers to kptr by obtaining its off_desc,
so check_map_access is also not required.
Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Link: https://lore.kernel.org/bpf/20220424214901.2743946-4-memxor@gmail.com
Add a new type flag for bpf_arg_type that when set tells verifier that
for a release function, that argument's register will be the one for
which meta.ref_obj_id will be set, and which will then be released
using release_reference. To capture the regno, introduce a new field
release_regno in bpf_call_arg_meta.
This would be required in the next patch, where we may either pass NULL
or a refcounted pointer as an argument to the release function
bpf_kptr_xchg. Just releasing only when meta.ref_obj_id is set is not
enough, as there is a case where the type of argument needed matches,
but the ref_obj_id is set to 0. Hence, we must enforce that whenever
meta.ref_obj_id is zero, the register that is to be released can only
be NULL for a release function.
Since we now indicate whether an argument is to be released in
bpf_func_proto itself, is_release_function helper has lost its utitlity,
hence refactor code to work without it, and just rely on
meta.release_regno to know when to release state for a ref_obj_id.
Still, the restriction of one release argument and only one ref_obj_id
passed to BPF helper or kfunc remains. This may be lifted in the future.
Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Link: https://lore.kernel.org/bpf/20220424214901.2743946-3-memxor@gmail.com
This commit introduces a new pointer type 'kptr' which can be embedded
in a map value to hold a PTR_TO_BTF_ID stored by a BPF program during
its invocation. When storing such a kptr, BPF program's PTR_TO_BTF_ID
register must have the same type as in the map value's BTF, and loading
a kptr marks the destination register as PTR_TO_BTF_ID with the correct
kernel BTF and BTF ID.
Such kptr are unreferenced, i.e. by the time another invocation of the
BPF program loads this pointer, the object which the pointer points to
may not longer exist. Since PTR_TO_BTF_ID loads (using BPF_LDX) are
patched to PROBE_MEM loads by the verifier, it would safe to allow user
to still access such invalid pointer, but passing such pointers into
BPF helpers and kfuncs should not be permitted. A future patch in this
series will close this gap.
The flexibility offered by allowing programs to dereference such invalid
pointers while being safe at runtime frees the verifier from doing
complex lifetime tracking. As long as the user may ensure that the
object remains valid, it can ensure data read by it from the kernel
object is valid.
The user indicates that a certain pointer must be treated as kptr
capable of accepting stores of PTR_TO_BTF_ID of a certain type, by using
a BTF type tag 'kptr' on the pointed to type of the pointer. Then, this
information is recorded in the object BTF which will be passed into the
kernel by way of map's BTF information. The name and kind from the map
value BTF is used to look up the in-kernel type, and the actual BTF and
BTF ID is recorded in the map struct in a new kptr_off_tab member. For
now, only storing pointers to structs is permitted.
An example of this specification is shown below:
#define __kptr __attribute__((btf_type_tag("kptr")))
struct map_value {
...
struct task_struct __kptr *task;
...
};
Then, in a BPF program, user may store PTR_TO_BTF_ID with the type
task_struct into the map, and then load it later.
Note that the destination register is marked PTR_TO_BTF_ID_OR_NULL, as
the verifier cannot know whether the value is NULL or not statically, it
must treat all potential loads at that map value offset as loading a
possibly NULL pointer.
Only BPF_LDX, BPF_STX, and BPF_ST (with insn->imm = 0 to denote NULL)
are allowed instructions that can access such a pointer. On BPF_LDX, the
destination register is updated to be a PTR_TO_BTF_ID, and on BPF_STX,
it is checked whether the source register type is a PTR_TO_BTF_ID with
same BTF type as specified in the map BTF. The access size must always
be BPF_DW.
For the map in map support, the kptr_off_tab for outer map is copied
from the inner map's kptr_off_tab. It was chosen to do a deep copy
instead of introducing a refcount to kptr_off_tab, because the copy only
needs to be done when paramterizing using inner_map_fd in the map in map
case, hence would be unnecessary for all other users.
It is not permitted to use MAP_FREEZE command and mmap for BPF map
having kptrs, similar to the bpf_timer case. A kptr also requires that
BPF program has both read and write access to the map (hence both
BPF_F_RDONLY_PROG and BPF_F_WRONLY_PROG are disallowed).
Note that check_map_access must be called from both
check_helper_mem_access and for the BPF instructions, hence the kptr
check must distinguish between ACCESS_DIRECT and ACCESS_HELPER, and
reject ACCESS_HELPER cases. We rename stack_access_src to bpf_access_src
and reuse it for this purpose.
Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Link: https://lore.kernel.org/bpf/20220424214901.2743946-2-memxor@gmail.com
Rename bpf_prog_run_array_cg_flags to bpf_prog_run_array_cg and
use it everywhere. check_return_code already enforces sane
return ranges for all cgroup types. (only egress and bind hooks have
uncanonical return ranges, the rest is using [0, 1])
No functional changes.
v2:
- 'func_ret & 1' under explicit test (Andrii & Martin)
Suggested-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Stanislav Fomichev <sdf@google.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Link: https://lore.kernel.org/bpf/20220425220448.3669032-1-sdf@google.com
This move the kernel/kexec_core.c respective sysctls to its own file.
kernel/sysctl.c has grown to an insane mess, We move sysctls to places
where features actually belong to improve the readability and reduce
merge conflicts. At the same time, the proc-sysctl maintainers can easily
care about the core logic other than the sysctl knobs added for some feature.
We already moved all filesystem sysctls out. This patch is part of the effort
to move kexec related sysctls out.
Signed-off-by: yingelin <yingelin@huawei.com>
Acked-by: Baoquan He <bhe@redhat.com>
Signed-off-by: Luis Chamberlain <mcgrof@kernel.org>
Variable end is being initialized with a value that is never read, it
is being re-assigned later with the same value. The initialization is
redundant and can be removed.
Cleans up clang scan build warning:
kernel/irq/matrix.c:289:25: warning: Value stored to 'end' during its
initialization is never read [deadcode.DeadStores]
Signed-off-by: Colin Ian King <colin.i.king@gmail.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Tom Rix <trix@redhat.com>
Link: https://lore.kernel.org/r/20220422110418.1264778-1-colin.i.king@gmail.com
When tick_nohz_stop_tick() stops the tick and high resolution timers are
disabled, then the clock event device is not put into ONESHOT_STOPPED
mode. This can lead to spurious timer interrupts with some clock event
device drivers that don't shut down entirely after firing.
Eliminate these by putting the device into ONESHOT_STOPPED mode at points
where it is not being reprogrammed. When there are no timers active, then
tick_program_event() with KTIME_MAX can be used to stop the device. When
there is a timer active, the device can be stopped at the next tick (any
new timer added by timers will reprogram the tick).
Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lore.kernel.org/r/20220422141446.915024-1-npiggin@gmail.com
Instead of passing the allow_dups argument to fsnotify_add_mark()
as an argument, define the group flag FSNOTIFY_GROUP_DUPS to express
the allow_dups behavior and set this behavior at group creation time
for all calls of fsnotify_add_mark().
Rename the allow_dups argument to generic add_flags argument for future
use.
Link: https://lore.kernel.org/r/20220422120327.3459282-6-amir73il@gmail.com
Suggested-by: Jan Kara <jack@suse.cz>
Signed-off-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Jan Kara <jack@suse.cz>
Add flags argument to fsnotify_alloc_group(), define and use the flag
FSNOTIFY_GROUP_USER in inotify and fanotify instead of the helper
fsnotify_alloc_user_group() to indicate user allocation.
Although the flag FSNOTIFY_GROUP_USER is currently not used after group
allocation, we store the flags argument in the group struct for future
use of other group flags.
Link: https://lore.kernel.org/r/20220422120327.3459282-5-amir73il@gmail.com
Signed-off-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Jan Kara <jack@suse.cz>
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEEzv7L6UO9uDPlPSfHEsHwGGHeVUoFAmJlHhcACgkQEsHwGGHe
VUrPyQ/9FE7zLj8euC4HJ4HPJwf7vkwaeRHz1T2gS+izChBI+QSo5Ipe5zFOKz55
vYSfaYF0MIvVJtKSHMbnQf6f/2i+y5j0ozMjKEkHRZdYP26okPoj+M2effgbceiJ
pOIZUsdr8SdBQv313icuUfsXIGfMv/xIw20OtIhVpOFQPB4ZbLASn6AhusZL7U6Z
0BIcfGmmOwV6p4petOJVUXRcwkgfT812UOBLV71DEz9jzP8dXYGVvPV8ZnSYoVQW
tm6rcmnpzsOqb3xnp7hqFHevyoIzBT31KVo4xnB80CtCoWB/tbEIPIbNjUPaREp0
ezE8yXv6euob92+Uh5DH/+8oWuzlctKv1Pc98rFnrGGfW4ocDsr5ibsi9472Mkec
s+waTwemZMGN3bQHH5QvjWxPGdGuPsqrNvgHbZRFGYGJcMoC+2F9p+vKOXK00fMF
9ivhhuFqH8OVAFu3WUvvD8zO18tfnST7fQflQJNxZ/TqPumNc0+zLrpKDp+7ZE+r
qgdvxvXO3ZRnPttiEP1/J+uKxQGNMuDEU8NcfdA7nOzEv9yPyKLLcwo2qu3IYgP0
XuM3Gqt5/Cf38b+1ddR1LWai3KjxVTn7HV4G9YPdvYP296YcZlFjGOtzfNOfw905
djGEFTFyGQuS7BEHKhD3OoDbegT4FvB+69k2ddy4Dut99WkDk84=
=S7Ux
-----END PGP SIGNATURE-----
Merge tag 'sched_urgent_for_v5.18_rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull scheduler fix from Borislav Petkov:
- Fix a corner case when calculating sched runqueue variables
That fix also removes a check for a zero divisor in the code, without
mentioning it. Vincent clarified that it's ok after I whined about it:
https://lore.kernel.org/all/CAKfTPtD2QEyZ6ADd5WrwETMOX0XOwJGnVddt7VHgfURdqgOS-Q@mail.gmail.com/
* tag 'sched_urgent_for_v5.18_rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
sched/pelt: Fix attach_entity_load_avg() corner case
The CXL specification claims S3 support at a hardware level, but at a
system software level there are some missing pieces. Section 9.4 (CXL
2.0) rightly claims that "CXL mem adapters may need aux power to retain
memory context across S3", but there is no enumeration mechanism for the
OS to determine if a given adapter has that support. Moreover the save
state and resume image for the system may inadvertantly end up in a CXL
device that needs to be restored before the save state is recoverable.
I.e. a circular dependency that is not resolvable without a third party
save-area.
Arrange for the cxl_mem driver to fail S3 attempts. This still nominaly
allows for suspend, but requires unbinding all CXL memory devices before
the suspend to ensure the typical DRAM flow is taken. The cxl_mem unbind
flow is intended to also tear down all CXL memory regions associated
with a given cxl_memdev.
It is reasonable to assume that any device participating in a System RAM
range published in the EFI memory map is covered by aux power and
save-area outside the device itself. So this restriction can be
minimized in the future once pre-existing region enumeration support
arrives, and perhaps a spec update to clarify if the EFI memory map is
sufficent for determining the range of devices managed by
platform-firmware for S3 support.
Per Rafael, if the CXL configuration prevents suspend then it should
fail early before tasks are frozen, and mem_sleep should stop showing
'mem' as an option [1]. Effectively CXL augments the platform suspend
->valid() op since, for example, the ACPI ops are not aware of the CXL /
PCI dependencies. Given the split role of platform firmware vs OS
provisioned CXL memory it is up to the cxl_mem driver to determine if
the CXL configuration has elements that platform firmware may not be
prepared to restore.
Link: https://lore.kernel.org/r/CAJZ5v0hGVN_=3iU8OLpHY3Ak35T5+JcBM-qs8SbojKrpd0VXsA@mail.gmail.com [1]
Cc: "Rafael J. Wysocki" <rafael@kernel.org>
Cc: Pavel Machek <pavel@ucw.cz>
Cc: Len Brown <len.brown@intel.com>
Reviewed-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Link: https://lore.kernel.org/r/165066828317.3907920.5690432272182042556.stgit@dwillia2-desk3.amr.corp.intel.com
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
Allow attaching BTF-aware TRACING programs, previously attachable only
through BPF_RAW_TRACEPOINT_OPEN command, through LINK_CREATE command:
- BTF-aware raw tracepoints (tp_btf in libbpf lingo);
- fentry/fexit/fmod_ret programs;
- BPF LSM programs.
This change converges all bpf_link-based attachments under LINK_CREATE
command allowing to further extend the API with features like BPF cookie
under "multiplexed" link_create section of bpf_attr.
Non-BTF-aware raw tracepoints are left under BPF_RAW_TRACEPOINT_OPEN,
but there is nothing preventing opening them up to LINK_CREATE as well.
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Reviewed-by: Kuifeng Lee <kuifeng@fb.com>
Link: https://lore.kernel.org/bpf/20220421033945.3602803-2-andrii@kernel.org
Create a kthread for each console to perform console printing. During
normal operation (@system_state == SYSTEM_RUNNING), the kthread
printers are responsible for all printing on their respective
consoles.
During non-normal operation, console printing is done as it has been:
within the context of the printk caller or within irqwork triggered
by the printk caller, referred to as direct printing.
Since threaded console printers are responsible for all printing
during normal operation, this also includes messages generated via
deferred printk calls. If direct printing is in effect during a
deferred printk call, the queued irqwork will perform the direct
printing. To make it clear that this is the only time that the
irqwork will perform direct printing, rename the flag
PRINTK_PENDING_OUTPUT to PRINTK_PENDING_DIRECT_OUTPUT.
Threaded console printers synchronize against each other and against
console lockers by taking the console lock for each message that is
printed.
Note that the kthread printers do not care about direct printing.
They will always try to print if new records are available. They can
be blocked by direct printing, but will be woken again once direct
printing is finished.
Console unregistration is a bit tricky because the associated
kthread printer cannot be stopped while the console lock is held.
A policy is implemented that states: whichever task clears
con->thread (under the console lock) is responsible for stopping
the kthread. unregister_console() will clear con->thread while
the console lock is held and then stop the kthread after releasing
the console lock.
For consoles that have implemented the exit() callback, the kthread
is stopped before exit() is called.
Signed-off-by: John Ogness <john.ogness@linutronix.de>
Reviewed-by: Petr Mladek <pmladek@suse.com>
Signed-off-by: Petr Mladek <pmladek@suse.com>
Link: https://lore.kernel.org/r/20220421212250.565456-14-john.ogness@linutronix.de
Once kthread printing is available, console printing will no longer
occur in the context of the printk caller. However, there are some
special contexts where it is desirable for the printk caller to
directly print out kernel messages. Using pr_flush() to wait for
threaded printers is only possible if the caller is in a sleepable
context and the kthreads are active. That is not always the case.
Introduce printk_prefer_direct_enter() and printk_prefer_direct_exit()
functions to explicitly (and globally) activate/deactivate preferred
direct console printing. The term "direct console printing" refers to
printing to all enabled consoles from the context of the printk
caller. The term "prefer" is used because this type of printing is
only best effort. If the console is currently locked or other
printers are already actively printing, the printk caller will need
to rely on the other contexts to handle the printing.
This preferred direct printing is how all printing has been handled
until now (unless it was explicitly deferred).
When kthread printing is introduced, there may be some unanticipated
problems due to kthreads being unable to flush important messages.
In order to minimize such risks, preferred direct printing is
activated for the primary important messages when the system
experiences general types of major errors. These are:
- emergency reboot/shutdown
- cpu and rcu stalls
- hard and soft lockups
- hung tasks
- warn
- sysrq
Note that since kthread printing does not yet exist, no behavior
changes result from this commit. This is only implementing the
counter and marking the various places where preferred direct
printing is active.
Signed-off-by: John Ogness <john.ogness@linutronix.de>
Reviewed-by: Petr Mladek <pmladek@suse.com>
Acked-by: Paul E. McKenney <paulmck@kernel.org> # for RCU
Signed-off-by: Petr Mladek <pmladek@suse.com>
Link: https://lore.kernel.org/r/20220421212250.565456-13-john.ogness@linutronix.de
Provide a might-sleep function to allow waiting for console printers
to catch up to the latest logged message.
Use pr_flush() whenever it is desirable to get buffered messages
printed before continuing: suspend_console(), resume_console(),
console_stop(), console_start(), console_unblank().
Signed-off-by: John Ogness <john.ogness@linutronix.de>
Reviewed-by: Petr Mladek <pmladek@suse.com>
Signed-off-by: Petr Mladek <pmladek@suse.com>
Link: https://lore.kernel.org/r/20220421212250.565456-12-john.ogness@linutronix.de
Extended consoles print extended messages and do not print messages about
dropped records.
Non-extended consoles print "normal" messages as well as extra messages
about dropped records.
Currently the buffers for these various message types are defined within
the functions that might use them and their usage is based upon the
CON_EXTENDED flag. This will be a problem when moving to kthread printers
because each printer must be able to provide its own buffers.
Move all the message buffer definitions outside of
console_emit_next_record(). The caller knows if extended or dropped
messages should be printed and can specify the appropriate buffers to
use. The console_emit_next_record() and call_console_driver() functions
can know what to print based on whether specified buffers are non-NULL.
With this change, buffer definition/allocation/specification is separated
from the code that does the various types of string printing.
Signed-off-by: John Ogness <john.ogness@linutronix.de>
Reviewed-by: Petr Mladek <pmladek@suse.com>
Signed-off-by: Petr Mladek <pmladek@suse.com>
Link: https://lore.kernel.org/r/20220421212250.565456-11-john.ogness@linutronix.de
Refactor/rework printing logic in order to prepare for moving to
threaded console printing.
- Move @console_seq into struct console so that the current
"position" of each console can be tracked individually.
- Move @console_dropped into struct console so that the current drop
count of each console can be tracked individually.
- Modify printing logic so that each console independently loads,
prepares, and prints its next record.
- Remove exclusive_console logic. Since console positions are
handled independently, replaying past records occurs naturally.
- Update the comments explaining why preemption is disabled while
printing from printk() context.
With these changes, there is a change in behavior: the console
replaying the log (formerly exclusive console) will no longer block
other consoles. New messages appear on the other consoles while the
newly added console is still replaying.
Signed-off-by: John Ogness <john.ogness@linutronix.de>
Reviewed-by: Petr Mladek <pmladek@suse.com>
Signed-off-by: Petr Mladek <pmladek@suse.com>
Link: https://lore.kernel.org/r/20220421212250.565456-10-john.ogness@linutronix.de
It is useful to generate log messages that include details about
the related console. Rather than duplicate the code to assemble
the details, put that code into a macro con_printk().
Once console printers become threaded, this macro will find more
users.
Signed-off-by: John Ogness <john.ogness@linutronix.de>
Reviewed-by: Petr Mladek <pmladek@suse.com>
Signed-off-by: Petr Mladek <pmladek@suse.com>
Link: https://lore.kernel.org/r/20220421212250.565456-9-john.ogness@linutronix.de
boot_delay_msec() is always called immediately before printk_delay()
so just call it from within printk_delay().
Signed-off-by: John Ogness <john.ogness@linutronix.de>
Reviewed-by: Sergey Senozhatsky <senozhatsky@chromium.org>
Reviewed-by: Petr Mladek <pmladek@suse.com>
Signed-off-by: Petr Mladek <pmladek@suse.com>
Link: https://lore.kernel.org/r/20220421212250.565456-8-john.ogness@linutronix.de
Currently the local CPU timestamp and caller_id for the record are
collected while migration is enabled. Since this information is
CPU-specific, it should be collected with migration disabled.
Migration is disabled immediately after collecting this information
anyway, so just move the information collection to after the
migration disabling.
Signed-off-by: John Ogness <john.ogness@linutronix.de>
Reviewed-by: Sergey Senozhatsky <senozhatsky@chromium.org>
Reviewed-by: Petr Mladek <pmladek@suse.com>
Signed-off-by: Petr Mladek <pmladek@suse.com>
Link: https://lore.kernel.org/r/20220421212250.565456-7-john.ogness@linutronix.de
When printk() is called from safe or NMI contexts, it will directly
store the record (vprintk_store()) and then defer the console output.
However, defer_console_output() only causes console printing and does
not wake any waiters of new records.
Wake waiters from defer_console_output() so that they also are aware
of the new records from safe and NMI contexts.
Fixes: 03fc7f9c99 ("printk/nmi: Prevent deadlock when accessing the main log buffer in NMI")
Signed-off-by: John Ogness <john.ogness@linutronix.de>
Reviewed-by: Petr Mladek <pmladek@suse.com>
Signed-off-by: Petr Mladek <pmladek@suse.com>
Link: https://lore.kernel.org/r/20220421212250.565456-6-john.ogness@linutronix.de
There can be multiple tasks waiting for new records. They should
all be woken. Use wake_up_interruptible_all() instead of
wake_up_interruptible().
Signed-off-by: John Ogness <john.ogness@linutronix.de>
Reviewed-by: Petr Mladek <pmladek@suse.com>
Signed-off-by: Petr Mladek <pmladek@suse.com>
Link: https://lore.kernel.org/r/20220421212250.565456-5-john.ogness@linutronix.de
It is important that any new records are visible to preparing
waiters before the waker checks if the wait queue is empty.
Otherwise it is possible that:
- there are new records available
- the waker sees an empty wait queue and does not wake
- the preparing waiter sees no new records and begins to wait
This is exactly the problem that the function description of
waitqueue_active() warns about.
Use wq_has_sleeper() instead of waitqueue_active() because it
includes the necessary full memory barrier.
Signed-off-by: John Ogness <john.ogness@linutronix.de>
Reviewed-by: Petr Mladek <pmladek@suse.com>
Signed-off-by: Petr Mladek <pmladek@suse.com>
Link: https://lore.kernel.org/r/20220421212250.565456-4-john.ogness@linutronix.de
Since the printk cpulock is CPU-reentrant and since it is used
in all contexts, its usage must be carefully considered and
most likely will require programming locklessly. To avoid
mistaking the printk cpulock as a typical lock, rename it to
cpu_sync. The main functions then become:
printk_cpu_sync_get_irqsave(flags);
printk_cpu_sync_put_irqrestore(flags);
Add extra notes of caution in the function description to help
developers understand the requirements for correct usage.
Signed-off-by: John Ogness <john.ogness@linutronix.de>
Reviewed-by: Petr Mladek <pmladek@suse.com>
Signed-off-by: Petr Mladek <pmladek@suse.com>
Link: https://lore.kernel.org/r/20220421212250.565456-2-john.ogness@linutronix.de
As for SVE provide a prctl() interface which allows processes to
configure their SME vector length.
Signed-off-by: Mark Brown <broonie@kernel.org>
Reviewed-by: Catalin Marinas <catalin.marinas@arm.com>
Link: https://lore.kernel.org/r/20220419112247.711548-12-broonie@kernel.org
Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>
Now that stack validation is an optional feature of objtool, add
CONFIG_OBJTOOL and replace most usages of CONFIG_STACK_VALIDATION with
it.
CONFIG_STACK_VALIDATION can now be considered to be frame-pointer
specific. CONFIG_UNWINDER_ORC is already inherently valid for live
patching, so no need to "validate" it.
Signed-off-by: Josh Poimboeuf <jpoimboe@redhat.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Miroslav Benes <mbenes@suse.cz>
Link: https://lkml.kernel.org/r/939bf3d85604b2a126412bf11af6e3bd3b872bcb.1650300597.git.jpoimboe@redhat.com
If busiest group type is group_misfit_task, the local
group type must be group_has_spare according to below
code in update_sd_pick_busiest():
if (sgs->group_type == group_misfit_task &&
(!capacity_greater(capacity_of(env->dst_cpu), sg->sgc->max_capacity) ||
sds->local_stat.group_type != group_has_spare))
return false;
group type imbalanced and overloaded and fully_busy are filtered in here.
misfit and asym are filtered before in update_sg_lb_stats().
So, change the decision matrix to:
busiest \ local has_spare fully_busy misfit asym imbalanced overloaded
has_spare nr_idle balanced N/A N/A balanced balanced
fully_busy nr_idle nr_idle N/A N/A balanced balanced
misfit_task force N/A N/A N/A *N/A* *N/A*
asym_packing force force N/A N/A force force
imbalanced force force N/A N/A force force
overloaded force force N/A N/A force avg_load
Fixes: 0b0695f2b3 ("sched/fair: Rework load_balance()")
Signed-off-by: Tao Zhou <tao.zhou@linux.dev>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Dietmar Eggemann <dietmar.eggemann@arm.com>
Reviewed-by: Vincent Guittot <vincent.guittot@linaro.org>
Link: https://lkml.kernel.org/r/20220415095505.7765-1-tao.zhou@linux.dev
Martin find it confusing when look at the /proc/pressure/cpu output,
and found no hint about that CPU "full" line in psi Documentation.
% cat /proc/pressure/cpu
some avg10=0.92 avg60=0.91 avg300=0.73 total=933490489
full avg10=0.22 avg60=0.23 avg300=0.16 total=358783277
The PSI_CPU_FULL state is introduced by commit e7fcd76228
("psi: Add PSI_CPU_FULL state"), which mainly for cgroup level,
but also counted at the system level as a side effect.
Naturally, the FULL state doesn't exist for the CPU resource at
the system level. These "full" numbers can come from CPU idle
schedule latency. For example, t1 is the time when task wakeup
on an idle CPU, t2 is the time when CPU pick and switch to it.
The delta of (t2 - t1) will be in CPU_FULL state.
Another case all processes can be stalled is when all cgroups
have been throttled at the same time, which unlikely to happen.
Anyway, CPU_FULL metric is meaningless and confusing at the
system level. So this patch will report zeroes for CPU full
at the system level, and update psi Documentation accordingly.
Fixes: e7fcd76228 ("psi: Add PSI_CPU_FULL state")
Reported-by: Martin Steigerwald <Martin.Steigerwald@proact.de>
Suggested-by: Johannes Weiner <hannes@cmpxchg.org>
Signed-off-by: Chengming Zhou <zhouchengming@bytedance.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Link: https://lore.kernel.org/r/20220408121914.82855-1-zhouchengming@bytedance.com
We have tested cfs_rq->load.weight in cfs_rq_is_decayed(),
the first condition "!cfs_rq_is_decayed(cfs_rq)" is enough
to cover the second condition "cfs_rq->nr_running".
Signed-off-by: Chengming Zhou <zhouchengming@bytedance.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Ben Segall <bsegall@google.com>
Reviewed-by: Vincent Guittot <vincent.guittot@linaro.org>
Link: https://lore.kernel.org/r/20220408115309.81603-2-zhouchengming@bytedance.com
Since commit 2312729688 ("sched/fair: Update scale invariance of PELT")
change to use rq_clock_pelt() instead of rq_clock_task(), we should also
use rq_clock_pelt() for throttled_clock_task_time and throttled_clock_task
accounting to get correct cfs_rq_clock_pelt() of throttled cfs_rq. And
rename throttled_clock_task(_time) to be clock_pelt rather than clock_task.
Fixes: 2312729688 ("sched/fair: Update scale invariance of PELT")
Signed-off-by: Chengming Zhou <zhouchengming@bytedance.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Ben Segall <bsegall@google.com>
Reviewed-by: Vincent Guittot <vincent.guittot@linaro.org>
Link: https://lore.kernel.org/r/20220408115309.81603-1-zhouchengming@bytedance.com
In calculate_imbalance function, when the value of local->avg_load is
greater than or equal to busiest->avg_load, the calculated sds->avg_load is
not used. So this calculation can be placed in a more appropriate position.
Signed-off-by: zgpeng <zgpeng@tencent.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Samuel Liao <samuelliao@tencent.com>
Reviewed-by: Vincent Guittot <vincent.guittot@linaro.org>
Link: https://lore.kernel.org/r/1649239025-10010-1-git-send-email-zgpeng@tencent.com
When a trigger being created, its win.start_value and win.start_time are
reset to zero. If group->total[PSI_POLL][t->state] has accumulated before,
this trigger will be fired unexpectedly in the next period, even if its
growth time does not reach its threshold.
So set the window of the new trigger to the current state value.
Signed-off-by: Hailong Liu <liuhailong@linux.alibaba.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Suren Baghdasaryan <surenb@google.com>
Link: https://lore.kernel.org/r/1648789811-3788971-1-git-send-email-liuhailong@linux.alibaba.com
With SIGTRAP on perf events, we have encountered termination of
processes due to user space attempting to block delivery of SIGTRAP.
Consider this case:
<set up SIGTRAP on a perf event>
...
sigset_t s;
sigemptyset(&s);
sigaddset(&s, SIGTRAP | <and others>);
sigprocmask(SIG_BLOCK, &s, ...);
...
<perf event triggers>
When the perf event triggers, while SIGTRAP is blocked, force_sig_perf()
will force the signal, but revert back to the default handler, thus
terminating the task.
This makes sense for error conditions, but not so much for explicitly
requested monitoring. However, the expectation is still that signals
generated by perf events are synchronous, which will no longer be the
case if the signal is blocked and delivered later.
To give user space the ability to clearly distinguish synchronous from
asynchronous signals, introduce siginfo_t::si_perf_flags and
TRAP_PERF_FLAG_ASYNC (opted for flags in case more binary information is
required in future).
The resolution to the problem is then to (a) no longer force the signal
(avoiding the terminations), but (b) tell user space via si_perf_flags
if the signal was synchronous or not, so that such signals can be
handled differently (e.g. let user space decide to ignore or consider
the data imprecise).
The alternative of making the kernel ignore SIGTRAP on perf events if
the signal is blocked may work for some usecases, but likely causes
issues in others that then have to revert back to interception of
sigprocmask() (which we want to avoid). [ A concrete example: when using
breakpoint perf events to track data-flow, in a region of code where
signals are blocked, data-flow can no longer be tracked accurately.
When a relevant asynchronous signal is received after unblocking the
signal, the data-flow tracking logic needs to know its state is
imprecise. ]
Fixes: 97ba62b278 ("perf: Add support for SIGTRAP on perf events")
Reported-by: Dmitry Vyukov <dvyukov@google.com>
Signed-off-by: Marco Elver <elver@google.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Geert Uytterhoeven <geert@linux-m68k.org>
Tested-by: Dmitry Vyukov <dvyukov@google.com>
Link: https://lore.kernel.org/r/20220404111204.935357-1-elver@google.com
vm_insert_page()'s failure is not an unexpected condition, so don't do
WARN_ONCE() in such a case.
Instead, print a kernel message and just return an error code.
This flaw has been reported under an OOM condition by sysbot [1].
The message is mainly for the benefit of the test log, in this case the
fuzzer's log so that humans inspecting the log can figure out what was
going on. KCOV is a testing tool, so I think being a little more chatty
when KCOV unexpectedly is about to fail will save someone debugging
time.
We don't want the WARN, because it's not a kernel bug that syzbot should
report, and failure can happen if the fuzzer tries hard enough (as
above).
Link: https://lkml.kernel.org/r/Ylkr2xrVbhQYwNLf@elver.google.com [1]
Link: https://lkml.kernel.org/r/20220401182512.249282-1-nogikh@google.com
Fixes: b3d7fe86fb ("kcov: properly handle subsequent mmap calls"),
Signed-off-by: Aleksandr Nogikh <nogikh@google.com>
Acked-by: Marco Elver <elver@google.com>
Cc: Dmitry Vyukov <dvyukov@google.com>
Cc: Andrey Konovalov <andreyknvl@gmail.com>
Cc: Alexander Potapenko <glider@google.com>
Cc: Taras Madan <tarasmadan@google.com>
Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
When a CPU is going offline, all workers on the CPU's pool will have their
cpus_allowed cleared to cpu_possible_mask and can run on any CPUs including
the isolated ones. Instead, set cpus_allowed to wq_unbound_cpumask so that
the can avoid isolated CPUs.
Signed-off-by: Zqiang <qiang1.zhang@intel.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
Ok so hopefully this is the last of it. 0day picked up a build
failure [0] when SYSCTL=y but DYNAMIC_FTRACE=n. This can be fixed
by just declaring an empty routine for the calls moved just
recently.
[0] https://lkml.kernel.org/r/202204161203.6dSlgKJX-lkp@intel.com
Reported-by: kernel test robot <lkp@intel.com>
Fixes: f8b7d2b4c1 ("ftrace: fix building with SYSCTL=n but DYNAMIC_FTRACE=y")
Acked-by: Steven Rostedt (Google) <rostedt@goodmis.org>
Signed-off-by: Luis Chamberlain <mcgrof@kernel.org>
One can enable dyanmic tracing but disable sysctls.
When this is doen we get the compile kernel warning:
CC kernel/trace/ftrace.o
kernel/trace/ftrace.c:3086:13: warning: ‘ftrace_shutdown_sysctl’ defined
but not used [-Wunused-function]
3086 | static void ftrace_shutdown_sysctl(void)
| ^~~~~~~~~~~~~~~~~~~~~~
kernel/trace/ftrace.c:3068:13: warning: ‘ftrace_startup_sysctl’ defined
but not used [-Wunused-function]
3068 | static void ftrace_startup_sysctl(void)
When CONFIG_DYNAMIC_FTRACE=n the ftrace_startup_sysctl() and
routines ftrace_shutdown_sysctl() still compiles, so these
are actually more just used for when SYSCTL=y.
Fix this then by just moving these routines to when sysctls
are enabled.
Fixes: 7cde53da38a3 ("ftrace: move sysctl_ftrace_enabled to ftrace.c")
Reported-by: kernel test robot <lkp@intel.com>
Acked-by: Steven Rostedt (Google) <rostedt@goodmis.org>
Signed-off-by: Luis Chamberlain <mcgrof@kernel.org>
Some functions in next patch want to use this function, and those
functions will be called by check_map_access, hence move it before
check_map_access.
Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Joanne Koong <joannelkoong@gmail.com>
Link: https://lore.kernel.org/bpf/20220415160354.1050687-3-memxor@gmail.com
Next commit introduces field type 'kptr' whose kind will not be struct,
but pointer, and it will not be limited to one offset, but multiple
ones. Make existing btf_find_struct_field and btf_find_datasec_var
functions amenable to use for finding kptrs in map value, by moving
spin_lock and timer specific checks into their own function.
The alignment, and name are checked before the function is called, so it
is the last point where we can skip field or return an error before the
next loop iteration happens. Size of the field and type is meant to be
checked inside the function.
Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Link: https://lore.kernel.org/bpf/20220415160354.1050687-2-memxor@gmail.com
Currently, a CONFIG_PREEMPT_NONE=y kernel substitutes normal RCU for
RCU Tasks Rude and RCU Tasks Trace. Unless that kernel builds rcuscale,
whether built-in or as a module, in which case these RCU Tasks flavors are
(unnecessarily) built in. This both increases kernel size and increases
the complexity of certain tracing operations. This commit therefore
decouples the presence of rcuscale from the presence of RCU Tasks Rude
and RCU Tasks Trace.
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
Currently, a CONFIG_PREEMPT_NONE=y kernel substitutes normal RCU for
RCU Tasks. Unless that kernel builds rcuscale, whether built-in or as
a module, in which case RCU Tasks is (unnecessarily) built. This both
increases kernel size and increases the complexity of certain tracing
operations. This commit therefore decouples the presence of rcuscale
from the presence of RCU Tasks.
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
Currently, a CONFIG_PREEMPT_NONE=y kernel substitutes normal RCU for
RCU Tasks Rude and RCU Tasks Trace. Unless that kernel builds refscale,
whether built-in or as a module, in which case these RCU Tasks flavors are
(unnecessarily) built in. This both increases kernel size and increases
the complexity of certain tracing operations. This commit therefore
decouples the presence of refscale from the presence of RCU Tasks Rude
and RCU Tasks Trace.
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
Currently, a CONFIG_PREEMPT_NONE=y kernel substitutes normal RCU for
RCU Tasks. Unless that kernel builds refscale, whether built-in or as a
module, in which case RCU Tasks is (unnecessarily) built in. This both
increases kernel size and increases the complexity of certain tracing
operations. This commit therefore decouples the presence of refscale
from the presence of RCU Tasks.
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
Unless a kernel builds rcutorture, whether built-in or as a module, that
kernel is also built with CONFIG_TASKS_RUDE_RCU, whether anything else
needs Tasks Rude RCU or not. This unnecessarily increases kernel size.
This commit therefore decouples the presence of rcutorture from the
presence of RCU Tasks Rude.
However, there is a need to select CONFIG_TASKS_RUDE_RCU for testing
purposes. Except that casual users must not be bothered with
questions -- for them, this needs to be fully automated. There is
thus a CONFIG_FORCE_TASKS_RUDE_RCU that selects CONFIG_TASKS_RUDE_RCU,
is user-selectable, but which depends on CONFIG_RCU_EXPERT.
[ paulmck: Apply kernel test robot feedback. ]
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
Currently, a CONFIG_PREEMPT_NONE=y kernel substitutes normal RCU for
RCU Tasks. Unless that kernel builds rcutorture, whether built-in or as
a module, in which case RCU Tasks is (unnecessarily) used. This both
increases kernel size and increases the complexity of certain tracing
operations. This commit therefore decouples the presence of rcutorture
from the presence of RCU Tasks.
However, there is a need to select CONFIG_TASKS_RCU for testing purposes.
Except that casual users must not be bothered with questions -- for them,
this needs to be fully automated. There is thus a CONFIG_FORCE_TASKS_RCU
that selects CONFIG_TASKS_RCU, is user-selectable, but which depends
on CONFIG_RCU_EXPERT.
[ paulmck: Apply kernel test robot feedback. ]
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
Unless a kernel builds rcutorture, whether built-in or as a module, that
kernel is also built with CONFIG_TASKS_TRACE_RCU, whether anything else
needs Tasks Trace RCU or not. This unnecessarily increases kernel size.
This commit therefore decouples the presence of rcutorture from the
presence of RCU Tasks Trace.
However, there is a need to select CONFIG_TASKS_TRACE_RCU for
testing purposes. Except that casual users must not be bothered with
questions -- for them, this needs to be fully automated. There is thus
a CONFIG_FORCE_TASKS_TRACE_RCU that selects CONFIG_TASKS_TRACE_RCU,
is user-selectable, but which depends on CONFIG_RCU_EXPERT.
[ paulmck: Apply kernel test robot feedback. ]
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
Currently, any kernel built with CONFIG_PREEMPTION=y also gets
CONFIG_TASKS_RCU=y, which is not helpful to people trying to build
preemptible kernels of minimal size.
Because CONFIG_TASKS_RCU=y is needed only in kernels doing tracing of
one form or another, this commit moves from TASKS_RCU deciding when it
should be enabled to the tracing Kconfig options explicitly selecting it.
This allows building preemptible kernels without TASKS_RCU, if desired.
This commit also updates the SRCU-N and TREE09 rcutorture scenarios
in order to avoid Kconfig errors that would otherwise result from
CONFIG_TASKS_RCU being selected without its CONFIG_RCU_EXPERT dependency
being met.
[ paulmck: Apply BPF_SYSCALL feedback from Andrii Nakryiko. ]
Reported-by: Hyeonggon Yoo <42.hyeyoo@gmail.com>
Tested-by: Hyeonggon Yoo <42.hyeyoo@gmail.com>
Tested-by: Zhouyi Zhou <zhouzhouyi@gmail.com>
Cc: Andrii Nakryiko <andrii@kernel.org>
Cc: Alexei Starovoitov <ast@kernel.org>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Acked-by: Masami Hiramatsu <mhiramat@kernel.org>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
When booting kernels built with both CONFIG_RCU_STRICT_GRACE_PERIOD=y
and CONFIG_PREEMPT_RT=y, the rcu_read_unlock_special() function's
invocation of irq_work_queue_on() the init_irq_work() causes the
rcu_preempt_deferred_qs_handler() function to work execute in SCHED_FIFO
irq_work kthreads. Because rcu_read_unlock_special() is invoked on each
rcu_read_unlock() in such kernels, the amount of work just keeps piling
up, resulting in a boot-time hang.
This commit therefore avoids this hang by using IRQ_WORK_INIT_HARD()
instead of init_irq_work(), but only in kernels built with both
CONFIG_PREEMPT_RT=y and CONFIG_RCU_STRICT_GRACE_PERIOD=y.
Signed-off-by: Zqiang <qiang1.zhang@intel.com>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
The rcu_sync_enter() function is used by updaters to force RCU readers
(e.g. percpu-rwsem) to use their slow paths during an update. This is
accomplished by setting the ->gp_state of the rcu_sync structure to
GP_ENTER. In the case of percpu-rwsem, the readers' slow path waits on
a semaphore instead of just incrementing a reader count. Each updater
invokes the rcu_sync_exit() function to signal to readers that they
may again take their fastpaths. The rcu_sync_exit() function sets the
->gp_state of the rcu_sync structure to GP_EXIT, and if all goes well,
after a grace period the ->gp_state reverts back to GP_IDLE.
Unfortunately, the rcu_sync_enter() function currently has a comment
incorrectly stating that rcu_sync_exit() (by an updater) will re-enable
reader "slowpaths". This patch changes the comment to state that this
function re-enables reader fastpaths.
Signed-off-by: David Vernet <void@manifault.com>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
For the spawning of the priority-boost kthreads can fail, improbable
though this might seem. This commit therefore refrains from attemoting
to initiate RCU priority boosting when The ->boost_kthread_task pointer
is NULL.
Signed-off-by: Zqiang <qiang1.zhang@intel.com>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
An early check on synchronize_rcu[_expedited]() tries to determine if
the current CPU is in UP mode on an SMP no-preempt kernel, in which case
there is no need to start a grace period since the current assumed
quiescent state is all we need.
However the preemption mode doesn't take into account the boot selected
preemption mode under CONFIG_PREEMPT_DYNAMIC=y, missing a possible
early return if the running flavour is "none" or "voluntary".
Use the shiny new preempt mode accessors to fix this. However,
avoid invoking them during early boot because doing so triggers a
WARN_ON_ONCE().
[ paulmck: Update for mainlined API. ]
Reported-by: Paul E. McKenney <paulmck@kernel.org>
Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
Cc: Uladzislau Rezki <uladzislau.rezki@sony.com>
Cc: Joel Fernandes <joel@joelfernandes.org>
Cc: Boqun Feng <boqun.feng@gmail.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Neeraj Upadhyay <quic_neeraju@quicinc.com>
Cc: Valentin Schneider <valentin.schneider@arm.com>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
RCU's synchronous grace periods act quite differently when there is
only one online CPU, especially in the no-op case in kernels built with
CONFIG_PREEMPTION=n. This change in behavior can be important debugging
information, so this commit adds the number of online CPUs to the RCU
CPU stall warning messages.
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
The final "if" statement in rcu_gp_cleanup() has proven to be rather
confusing, straightforward though it might have seemed when initially
written. This commit therefore adds comments to its "then" and "else"
clauses to at least provide a more elevated form of confusion.
Reported-by: Boqun Feng <boqun.feng@gmail.com>
Reported-by: Frederic Weisbecker <frederic@kernel.org>
Reported-by: Neeraj Upadhyay <quic_neeraju@quicinc.com>
Reported-by: Uladzislau Rezki <urezki@gmail.com>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
Debugging of problems involving insanely long-running SMI handlers
proceeds better if the CSD-lock timeout can be adjusted. This commit
therefore provides a new smp.csd_lock_timeout kernel boot parameter
that specifies the timeout in milliseconds. The default remains at the
previously hard-coded value of five seconds.
[ paulmck: Apply feedback from Juergen Gross. ]
Cc: Rik van Riel <riel@surriel.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Reviewed-by: Juergen Gross <jgross@suse.com>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
bpf_{sk,task,inode}_storage_free() do not need to use
call_rcu_tasks_trace as no BPF program should be accessing the owner
as it's being destroyed. The only other reader at this point is
bpf_local_storage_map_free() which uses normal RCU.
The only path that needs trace RCU are:
* bpf_local_storage_{delete,update} helpers
* map_{delete,update}_elem() syscalls
Fixes: 0fe4b381a5 ("bpf: Allow bpf_local_storage to be used by sleepable programs")
Signed-off-by: KP Singh <kpsingh@kernel.org>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Martin KaFai Lau <kafai@fb.com>
Link: https://lore.kernel.org/bpf/20220418155158.2865678-1-kpsingh@kernel.org
It is guaranteed that for modifiers, clang always places type tags
before other modifiers, and then the base type. We would like to rely on
this guarantee inside the kernel to make it simple to parse type tags
from BTF.
However, a user would be allowed to construct a BTF without such
guarantees. Hence, add a pass to check that in modifier chains, type
tags only occur at the head of the chain, and then don't occur later in
the chain.
If we see a type tag, we can have one or more type tags preceding other
modifiers that then never have another type tag. If we see other
modifiers, all modifiers following them should never be a type tag.
Instead of having to walk chains we verified previously, we can remember
the last good modifier type ID which headed a good chain. At that point,
we must have verified all other chains headed by type IDs less than it.
This makes the verification process less costly, and it becomes a simple
O(n) pass.
Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Yonghong Song <yhs@fb.com>
Link: https://lore.kernel.org/bpf/20220419164608.1990559-2-memxor@gmail.com
This problem can be reproduced with CONFIG_PERF_USE_VMALLOC enabled on
both x86_64 and aarch64 arch when using sysdig -B(using ebpf)[1].
sysdig -B works fine after rebuilding the kernel with
CONFIG_PERF_USE_VMALLOC disabled.
I tracked it down to the if condition event->rb->nr_pages != nr_pages
in perf_mmap is true when CONFIG_PERF_USE_VMALLOC is enabled where
event->rb->nr_pages = 1 and nr_pages = 2048 resulting perf_mmap to
return -EINVAL. This is because when CONFIG_PERF_USE_VMALLOC is
enabled, rb->nr_pages is always equal to 1.
Arch with CONFIG_PERF_USE_VMALLOC enabled by default:
arc/arm/csky/mips/sh/sparc/xtensa
Arch with CONFIG_PERF_USE_VMALLOC disabled by default:
x86_64/aarch64/...
Fix this problem by using data_page_nr()
[1] https://github.com/draios/sysdig
Fixes: 906010b213 ("perf_event: Provide vmalloc() based mmap() backing")
Signed-off-by: Zhipeng Xie <xiezhipeng1@huawei.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lkml.kernel.org/r/20220209145417.6495-1-xiezhipeng1@huawei.com
The warning in cfs_rq_is_decayed() triggered:
SCHED_WARN_ON(cfs_rq->avg.load_avg ||
cfs_rq->avg.util_avg ||
cfs_rq->avg.runnable_avg)
There exists a corner case in attach_entity_load_avg() which will
cause load_sum to be zero while load_avg will not be.
Consider se_weight is 88761 as per the sched_prio_to_weight[] table.
Further assume the get_pelt_divider() is 47742, this gives:
se->avg.load_avg is 1.
However, calculating load_sum:
se->avg.load_sum = div_u64(se->avg.load_avg * se->avg.load_sum, se_weight(se));
se->avg.load_sum = 1*47742/88761 = 0.
Then enqueue_load_avg() adds this to the cfs_rq totals:
cfs_rq->avg.load_avg += se->avg.load_avg;
cfs_rq->avg.load_sum += se_weight(se) * se->avg.load_sum;
Resulting in load_avg being 1 with load_sum is 0, which will trigger
the WARN.
Fixes: f207934fb7 ("sched/fair: Align PELT windows between cfs_rq and its se")
Signed-off-by: kuyo chang <kuyo.chang@mediatek.com>
[peterz: massage changelog]
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Vincent Guittot <vincent.guittot@linaro.org>
Tested-by: Dietmar Eggemann <dietmar.eggemann@arm.com>
Link: https://lkml.kernel.org/r/20220414090229.342-1-kuyo.chang@mediatek.com
Commit 7d08c2c911 ("bpf: Refactor BPF_PROG_RUN_ARRAY family of macros
into functions") switched a bunch of BPF_PROG_RUN macros to inline
routines. This changed the semantic a bit. Due to arguments expansion
of macros, it used to be:
rcu_read_lock();
array = rcu_dereference(cgrp->bpf.effective[atype]);
...
Now, with with inline routines, we have:
array_rcu = rcu_dereference(cgrp->bpf.effective[atype]);
/* array_rcu can be kfree'd here */
rcu_read_lock();
array = rcu_dereference(array_rcu);
I'm assuming in practice rcu subsystem isn't fast enough to trigger
this but let's use rcu API properly.
Also, rename to lower caps to not confuse with macros. Additionally,
drop and expand BPF_PROG_CGROUP_INET_EGRESS_RUN_ARRAY.
See [1] for more context.
[1] https://lore.kernel.org/bpf/CAKH8qBs60fOinFdxiiQikK_q0EcVxGvNTQoWvHLEUGbgcj1UYg@mail.gmail.com/T/#u
v2
- keep rcu locks inside by passing cgroup_bpf
Fixes: 7d08c2c911 ("bpf: Refactor BPF_PROG_RUN_ARRAY family of macros into functions")
Signed-off-by: Stanislav Fomichev <sdf@google.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Martin KaFai Lau <kafai@fb.com>
Link: https://lore.kernel.org/bpf/20220414161233.170780-1-sdf@google.com
* irq/gpio-immutable:
: .
: First try at preventing the GPIO subsystem from abusing irq_chip
: data structures. The general idea is to have an irq_chip flag
: to tell the GPIO subsystem that these structures are immutable,
: and to convert drivers one by one.
: .
Documentation: Update the recommended pattern for GPIO irqchips
gpio: Update TODO to mention immutable irq_chip structures
pinctrl: amd: Make the irqchip immutable
pinctrl: msmgpio: Make the irqchip immutable
pinctrl: apple-gpio: Make the irqchip immutable
gpio: pl061: Make the irqchip immutable
gpio: tegra186: Make the irqchip immutable
gpio: Add helpers to ease the transition towards immutable irq_chip
gpio: Expose the gpiochip_irq_re[ql]res helpers
gpio: Don't fiddle with irqchips marked as immutable
Signed-off-by: Marc Zyngier <maz@kernel.org>
In order to move away from gpiolib messing with the internals of
unsuspecting irqchips, add a flag by which irqchips advertise
that they are not to be messed with, and do solemnly swear that
they correctly call into the gpiolib helpers when required.
Also nudge the users into converting their drivers to the
new model.
Reviewed-by: Andy Shevchenko <andy.shevchenko@gmail.com>
Reviewed-by: Bartosz Golaszewski <brgl@bgdev.pl>
Signed-off-by: Marc Zyngier <maz@kernel.org>
Link: https://lore.kernel.org/r/20220419141846.598305-2-maz@kernel.org
drm/drm-next has a build fix for the NewVision NV3052C panel
(drivers/gpu/drm/panel/panel-newvision-nv3052c.c), which needs to be
merged back to drm-misc-next, as it was failing to build there.
Signed-off-by: Paul Cercueil <paul@crapouillou.net>
No users left.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Tested-by: Boris Ostrovsky <boris.ostrovsky@oracle.com>
To shared more code between swiotlb and xen-swiotlb, offer a
swiotlb_init_remap interface and add a remap callback to
swiotlb_init_late that will allow Xen to remap the buffer without
duplicating much of the logic.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Tested-by: Boris Ostrovsky <boris.ostrovsky@oracle.com>
Let the caller chose a zone to allocate from. This will be used
later on by the xen-swiotlb initialization on arm.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Anshuman Khandual <anshuman.khandual@arm.com>
Reviewed-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Tested-by: Boris Ostrovsky <boris.ostrovsky@oracle.com>
Power SVM wants to allocate a swiotlb buffer that is not restricted to
low memory for the trusted hypervisor scheme. Consolidate the support
for this into the swiotlb_init interface by adding a new flag.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Tested-by: Boris Ostrovsky <boris.ostrovsky@oracle.com>
Pass a boolean flag to indicate if swiotlb needs to be enabled based on
the addressing needs, and replace the verbose argument with a set of
flags, including one to force enable bounce buffering.
Note that this patch removes the possibility to force xen-swiotlb use
with the swiotlb=force parameter on the command line on x86 (arm and
arm64 never supported that), but this interface will be restored shortly.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Tested-by: Boris Ostrovsky <boris.ostrovsky@oracle.com>
swiotlb_late_init_with_default_size is an overly verbose name that
doesn't even catch what the function is doing, given that the size is
not just a default but the actual requested size.
Rename it to swiotlb_init_late.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Anshuman Khandual <anshuman.khandual@arm.com>
Reviewed-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Tested-by: Boris Ostrovsky <boris.ostrovsky@oracle.com>
Remove the bogus Xen override that was usually larger than the actual
size and just calculate the value on demand. Note that
swiotlb_max_segment still doesn't make sense as an interface and should
eventually be removed.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Anshuman Khandual <anshuman.khandual@arm.com>
Reviewed-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Tested-by: Boris Ostrovsky <boris.ostrovsky@oracle.com>
If force bouncing is enabled we can't release the buffers.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Anshuman Khandual <anshuman.khandual@arm.com>
Reviewed-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Tested-by: Boris Ostrovsky <boris.ostrovsky@oracle.com>
Use the more specific is_swiotlb_active check instead of checking the
global swiotlb_force variable.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Anshuman Khandual <anshuman.khandual@arm.com>
Reviewed-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Tested-by: Boris Ostrovsky <boris.ostrovsky@oracle.com>
- Fix the warning condition in __run_timers() which does not take into
account, that a CPU base (especially the deferrable base) has never a
timer armed on it and therefore the next_expiry value can become stale.
- Replace a WARN_ON() in the NOHZ code with a WARN_ON_ONCE() to prevent
endless spam in dmesg.
- Remove the double star from a comment which is not meant to be in
kernel-doc format.
-----BEGIN PGP SIGNATURE-----
iQJHBAABCgAxFiEEQp8+kY+LLUocC4bMphj1TA10mKEFAmJb44gTHHRnbHhAbGlu
dXRyb25peC5kZQAKCRCmGPVMDXSYoXptD/4tKI9W4q0CAZnXw/G9cJdUK2u2VQGb
QfB0DMABuDLMlHUSO92T49IcfjQUR7hz36kHXY0OxwALaEB8ftap6yZLfI1864mZ
RFatWcTl71ZYY0SSio0rMnbDawTeN9N69WLPR+A6fDH0ahBqTwpeJoa1zj5w99LF
kFoEtI5vEBtf+m94PljBTclxOKTXmRRCzEZRpJ6L4Ze9SHR7aMKtBgp84M7a3IAw
IzXq0Kw0MRIm4P26uOpsVzi2KOq6+F1O1zy4L8cHVeU+Y6HC1qWLPXsyhipb0liE
6STFs1L7ANX+dpMW8kmorXc3RcN4xCuCY4Sb17W6Hqiq3syQ+Ho2G0CTheL0yOpM
GVqf7SVes3EGAJwWyKaFmhpgWGKcS93p18XGUrf5V+rVIU+ggQcc2vM9E8FiM03d
TA6T9+1u0TvtPlmd/KQ3OPLMEtu6/3R0fOLEXnv+Cb/PYIYMvPpb9cu6LjJkNqGv
6R+3dWORI7mYLZD+vS3Y5mrmB5OECGXTPNJ+aBr46PfdvSVLxvmOhFlNR7eQKHZ8
wFwOC6n4wkD5Y1VOVGEn0GVGWVkZXdlQzZOQGqBhJPjsfPfOaGacyLqCcsZH4V6w
kitKI8z7A7evLozZMQqKZskwLG1BTQ13YeElb5jJMpG3aTG7lSmPk+iGgLFMDWkc
Z2k4YD0Uw+92Tw==
=SffK
-----END PGP SIGNATURE-----
Merge tag 'timers-urgent-2022-04-17' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull timer fixes from Thomas Gleixner:
"A small set of fixes for the timers core:
- Fix the warning condition in __run_timers() which does not take
into account that a CPU base (especially the deferrable base) never
has a timer armed on it and therefore the next_expiry value can
become stale.
- Replace a WARN_ON() in the NOHZ code with a WARN_ON_ONCE() to
prevent endless spam in dmesg.
- Remove the double star from a comment which is not meant to be in
kernel-doc format"
* tag 'timers-urgent-2022-04-17' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
tick/sched: Fix non-kernel-doc comment
tick/nohz: Use WARN_ON_ONCE() to prevent console saturation
timers: Fix warning condition in __run_timers()
- Make the warning condition in flush_smp_call_function_queue() correct,
which checks a just emptied list head for being empty instead of
validating that there was no pending entry on the offlined CPU at all.
- The @cpu member of struct cpuhp_cpu_state is initialized when the CPU
hotplug thread for the upcoming CPU is created. That's too late because
the creation of the thread can fail and then the following rollback
operates on CPU0. Get rid of the CPU member and hand the CPU number to
the involved functions directly.
-----BEGIN PGP SIGNATURE-----
iQJHBAABCgAxFiEEQp8+kY+LLUocC4bMphj1TA10mKEFAmJb4skTHHRnbHhAbGlu
dXRyb25peC5kZQAKCRCmGPVMDXSYoUYaD/9SWzwEqwKICAhWWSMtRg6cvtVTAKGL
xWXNb5ZxLA8MdG9CYu7WaUF7mU3E8o7kF/dItvkdj0/9mg/IqcVnP1CEOfp0ioHc
1cbAZShORfFlOowO1WR9Xe3MSZB2mXPHsjh3FWPeGTDxnSByMDXROMFyotxji+7q
g1ZsyVbq3+hSVTOhnhpxrh8MS3qcCAsgYtHKRnPjOE/tqbNXSmbhmeuN7QBKLVp6
AD6DaWj8jGtZ8yzbpm0Ve3ZsjH+1SdZAzm4yhh3FHMKsSOwB8yuNwq1oNbbEjxWu
mTg+LCBBMGYybSTa9sUpDG5Xfc3/dyWnzkGnmp7M8bfElrI3x4d+sMdVhfFRjEXB
xlXmNqRoExgZwifyoEJbmSbG2SgLiWqyLqgpUvLS/yHud6IurFbu36TrByYRMMsG
CyaiLMdWIlBibo+xGkVgnLyc2io98KeoLJoc35HM7VYyn9oNI4hUU9OJDIkG5D2i
lyS+qMTFInFSOm7L5u/SUTYe4I0sXYJ6uEXxhR/Gi3ITXb7aLOJyxhNtq//u1dmg
6vZHgs8HZWylG3LxVn1QERWotVlPuNPRilsUGCVMsCHjOFnoPy+eM3ANtR2x/uaE
XpQ58jnhkhPXZfmnsjDEILkWB67J0/9FujbcPvixH8SRvUgBB7xRRbuo9zXPchhw
VDKy1BCIcZ93gA==
=XNLP
-----END PGP SIGNATURE-----
Merge tag 'smp-urgent-2022-04-17' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull SMP fixes from Thomas Gleixner:
"Two fixes for the SMP core:
- Make the warning condition in flush_smp_call_function_queue()
correct, which checked a just emptied list head for being empty
instead of validating that there was no pending entry on the
offlined CPU at all.
- The @cpu member of struct cpuhp_cpu_state is initialized when the
CPU hotplug thread for the upcoming CPU is created. That's too late
because the creation of the thread can fail and then the following
rollback operates on CPU0. Get rid of the CPU member and hand the
CPU number to the involved functions directly"
* tag 'smp-urgent-2022-04-17' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
cpu/hotplug: Remove the 'cpu' member of cpuhp_cpu_state
smp: Fix offline cpu check in flush_smp_call_function_queue()
account that there can be an imbalance between present and possible CPUs,
which causes already assigned bits to be overwritten.
-----BEGIN PGP SIGNATURE-----
iQJHBAABCgAxFiEEQp8+kY+LLUocC4bMphj1TA10mKEFAmJb4cETHHRnbHhAbGlu
dXRyb25peC5kZQAKCRCmGPVMDXSYoWrDD/wOd9X4zQz6cZci2Xl091IhMlS2/sBm
afY98mTK56rX/BEgybMhzXZtBOSdDL2UDnBYznuYiPWof09+mqJGp4cLJAtVvK7E
cDoLdpqmFuQzO8Ka1lUgdgwR+d32pFmaBASbl/0COdl8qgNskQp+jD6BaztNxYbZ
HjkN2chMXqRlCOkKCthNajPEB3thmW2H0gDiZT/VEGYkQfp+HdoH7QhpIXt8HkWd
lb1g1bpDBUwsGrVJ2PlI+XvwmYdW/WOyebIq2D1XBhSS1RXVTt5xfXh/ZfhULzgn
2bnhvXU7XwDZRJNte1r8EFHSufXZUwhZIa9DRKHoy1fxNaxufRy0VJLEqcRHbcBO
xGIvXFfjmn+vmO6hwC1/DeP7Dz6WiTYoxTwJVpvgURqcEdr1HvnSy1CdMEOsAMUa
ZQDJzudV47a2qj80MRO1TSovW2gtDPdSWEUp5oOx2nYUZbqME146ubglZEcXLzq7
Y7BeMaO1VzXkfyk6ImtrBYY+qrXaT1ZQvCOGPrOltnlU0cBLkQMSUpjTstIcaOAp
iIvPLjVR1fDdgEBRck+ZAsM9snefz160wEbx23PzGuh2BenLpQfKZHIYhDO8wo2l
4uUmnp7VY0FS5I38f+rWNvqhFfJV1oda0dQfprP+s3Usz5oSegl764as5hjGcu+y
0Z/mCvaPODGctg==
=wSdp
-----END PGP SIGNATURE-----
Merge tag 'irq-urgent-2022-04-17' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull irq fix from Thomas Gleixner:
"A single fix for the interrupt affinity spreading logic to take into
account that there can be an imbalance between present and possible
CPUs, which causes already assigned bits to be overwritten"
* tag 'irq-urgent-2022-04-17' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
genirq/affinity: Consider that CPUs on nodes can be unbalanced
On PREEMPT_RT kernel and KASAN is enabled. the kasan_record_aux_stack()
may call alloc_pages(), and the rt-spinlock will be acquired, if currently
in atomic context, will trigger warning:
BUG: sleeping function called from invalid context at kernel/locking/spinlock_rt.c:46
in_atomic(): 1, irqs_disabled(): 1, non_block: 0, pid: 239, name: bootlogd
Preemption disabled at:
[<ffffffffbab1a531>] rt_mutex_slowunlock+0xa1/0x4e0
CPU: 3 PID: 239 Comm: bootlogd Tainted: G W 5.17.1-rt17-yocto-preempt-rt+ #105
Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS rel-1.15.0-0-g2dd4b9b3f840-prebuilt.qemu.org 04/01/2014
Call Trace:
__might_resched.cold+0x13b/0x173
rt_spin_lock+0x5b/0xf0
get_page_from_freelist+0x20c/0x1610
__alloc_pages+0x25e/0x5e0
__stack_depot_save+0x3c0/0x4a0
kasan_save_stack+0x3a/0x50
__kasan_record_aux_stack+0xb6/0xc0
kasan_record_aux_stack+0xe/0x10
irq_work_queue_on+0x6a/0x1c0
pull_rt_task+0x631/0x6b0
do_balance_callbacks+0x56/0x80
__balance_callbacks+0x63/0x90
rt_mutex_setprio+0x349/0x880
rt_mutex_slowunlock+0x22a/0x4e0
rt_spin_unlock+0x49/0x80
uart_write+0x186/0x2b0
do_output_char+0x2e9/0x3a0
n_tty_write+0x306/0x800
file_tty_write.isra.0+0x2af/0x450
tty_write+0x22/0x30
new_sync_write+0x27c/0x3a0
vfs_write+0x3f7/0x5d0
ksys_write+0xd9/0x180
__x64_sys_write+0x43/0x50
do_syscall_64+0x44/0x90
entry_SYSCALL_64_after_hwframe+0x44/0xae
Fix it by using kasan_record_aux_stack_noalloc() to avoid the call to
alloc_pages().
Link: https://lkml.kernel.org/r/20220402142555.2699582-1-qiang1.zhang@intel.com
Signed-off-by: Zqiang <qiang1.zhang@intel.com>
Cc: Andrey Ryabinin <ryabinin.a.a@gmail.com>
Cc: Alexander Potapenko <glider@google.com>
Cc: Andrey Konovalov <andreyknvl@gmail.com>
Cc: Dmitry Vyukov <dvyukov@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
If CONFIG_SYSCTL and CONFIG_DYNAMIC_FTRACE is n, build warns:
kernel/trace/ftrace.c:7912:13: error: ‘is_permanent_ops_registered’ defined but not used [-Werror=unused-function]
static bool is_permanent_ops_registered(void)
^~~~~~~~~~~~~~~~~~~~~~~~~~~
kernel/trace/ftrace.c:89:12: error: ‘last_ftrace_enabled’ defined but not used [-Werror=unused-variable]
static int last_ftrace_enabled;
^~~~~~~~~~~~~~~~~~~
Move is_permanent_ops_registered() to ifdef block and mark last_ftrace_enabled as
__maybe_unused to fix this.
Fixes: 7cde53da38a3 ("ftrace: move sysctl_ftrace_enabled to ftrace.c")
Signed-off-by: YueHaibing <yuehaibing@huawei.com>
Acked-by: Steven Rostedt (Google) <rostedt@goodmis.org>
Signed-off-by: Luis Chamberlain <mcgrof@kernel.org>
-----BEGIN PGP SIGNATURE-----
iQJHBAABCgAxFiEEQp8+kY+LLUocC4bMphj1TA10mKEFAmJYLboTHHRnbHhAbGlu
dXRyb25peC5kZQAKCRCmGPVMDXSYoW94D/wPA3Pf3Dk5MBILSSd9RFwHQK9z+X2l
Np60Cuh0aetlwNILGafJ6VG34zwjR4Z5eutAM1zy14ehRXmQSEoE5OG5ixLOg83o
8IPZRKwdG7C3WnLn+s0OdTEpgGaDTxHPOrJOXsgG9G5NwVzUEad+srePSxhDSnLn
LCKWaLnGWGC2ymbz/0TTZkhzkVyOEElY7SubF6qn8J4T9XPjdYUsZ/r1cNGBSfFD
uViQYZpjUpNFmalwNldpgIZidDBHvTnlaE610jZHQKEczs9mg2EpmAPyb/e5MKcs
tMmFcFQoNX3o0MAdRmRQD46fDAD0RKb9mP2LF+52/QqI6V8Pvk5OV/RXl/WGR/1B
KxZ6qQFXasoXujfaELAnRSRve+k2xrsyEY6yikg3G/JmrMgCSVgvfSr0CTFbbdnG
pAXC4eO94SqyCYdVU9DJZO7fhwREGF8P7qI01PYGF1B4GseQ8gaNuXCTBrkYsio2
60Nv7GFiajRjUOdPNvhnWAMvnLRpnItgV3yB7nXAt15io9AkdCXsqPq3x69cxYYU
X+CQZPG0l+tG/BERYdjAjr7/Ij0NCqteKz7UqHIF0747DLQHagms7dmk/UWsKRfU
bFyCKvLttAFYRU5lPEqjTIDfPJcqoePreqB1xRRmeerV4id26If59a9yMkFcUuF/
ZQAkAZGCc82YJA==
=mALw
-----END PGP SIGNATURE-----
Merge tag 'tai-for-tracing' into timers/core
Pull in the NMI safe TAI accessor which was provided for the tracing tree
to prepare for further changes in this area.
Introduce fast/NMI safe accessor to clock tai for tracing. The Linux kernel
tracing infrastructure has support for using different clocks to generate
timestamps for trace events. Especially in TSN networks it's useful to have TAI
as trace clock, because the application scheduling is done in accordance to the
network time, which is based on TAI. With a tai trace_clock in place, it becomes
very convenient to correlate network activity with Linux kernel application
traces.
Use the same implementation as ktime_get_boot_fast_ns() does by reading the
monotonic time and adding the TAI offset. The same limitations as for the fast
boot implementation apply. The TAI offset may change at run time e.g., by
setting the time or using adjtimex() with an offset. However, these kind of
offset changes are rare events. Nevertheless, the user has to be aware and deal
with it in post processing.
An alternative approach would be to use the same implementation as
ktime_get_real_fast_ns() does. However, this requires to add an additional u64
member to the tk_read_base struct. This struct together with a seqcount is
designed to fit into a single cache line on 64 bit architectures. Adding a new
member would violate this constraint.
Signed-off-by: Kurt Kanzenbach <kurt@linutronix.de>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Steven Rostedt <rostedt@goodmis.org>
Link: https://lore.kernel.org/r/20220414091805.89667-2-kurt@linutronix.de
Although setting the affinity of an interrupt to a set of CPUs that doesn't
have any online CPU is generally frowned apon, there are a few limited
cases where such affinity is set from a CPUHP notifier, setting the
affinity to a CPU that isn't online yet.
The saving grace is that this is always done using the 'force' attribute,
which gives a hint that the affinity setting can be outside of the online
CPU mask and the callsite set this flag with the knowledge that the
underlying interrupt controller knows to handle it.
This restores the expected behaviour on Marek's system.
Fixes: 33de0aa4ba ("genirq: Always limit the affinity to online CPUs")
Reported-by: Marek Szyprowski <m.szyprowski@samsung.com>
Signed-off-by: Marc Zyngier <maz@kernel.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Tested-by: Marek Szyprowski <m.szyprowski@samsung.com>
Link: https://lore.kernel.org/r/4b7fc13c-887b-a664-26e8-45aed13f048a@samsung.com
Link: https://lore.kernel.org/r/20220414140011.541725-1-maz@kernel.org
x86-32 was the last architecture that implemented separate user and
kernel registers.
Signed-off-by: Brian Gerst <brgerst@gmail.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
Acked-by: Andy Lutomirski <luto@kernel.org>
Acked-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20220325153953.162643-3-brgerst@gmail.com
When we looked into FIO performance with swiotlb enabled in VM, we found
swiotlb_bounce() is always called one more time than expected for each DMA
read request.
It turns out that the bounce buffer is copied to original DMA buffer twice
after the completion of a DMA request (one is done by in
dma_direct_sync_single_for_cpu(), the other by swiotlb_tbl_unmap_single()).
But the content in bounce buffer actually doesn't change between the two
rounds of copy. So, one round of copy is redundant.
Pass DMA_ATTR_SKIP_CPU_SYNC flag to swiotlb_tbl_unmap_single() to
skip the memory copy in it.
This fix increases FIO 64KB sequential read throughput in a guest with
swiotlb=force by 5.6%.
Fixes: 55897af630 ("dma-direct: merge swiotlb_dma_ops into the dma_direct code")
Reported-by: Wang Zhaoyang1 <zhaoyang1.wang@intel.com>
Reported-by: Gao Liang <liang.gao@intel.com>
Signed-off-by: Chao Gao <chao.gao@intel.com>
Reviewed-by: Kevin Tian <kevin.tian@intel.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
We're moving sysctls out of kernel/sysctl.c as it is a mess. We
already moved all filesystem sysctls out. And with time the goal
is to move all sysctls out to their own subsystem/actual user.
kernel/sysctl.c has grown to an insane mess and its easy to run
into conflicts with it. The effort to move them out into various
subsystems is part of this.
Signed-off-by: Yan Zhu <zhuyan34@huawei.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Cc: Luis Chamberlain <mcgrof@kernel.org>
Link: https://lore.kernel.org/bpf/20220407070759.29506-1-zhuyan34@huawei.com
Rather than waiting until a CPU is first brought online, do the
initialisation of the cpuhp_cpu_state structure for each CPU during the
__init phase. This saves a (small) amount of non-__init memory and
avoids potential confusion about when the cpuhp_cpu_state struct is
valid.
Suggested-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Steven Price <steven.price@arm.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lore.kernel.org/r/20220411152233.474129-3-steven.price@arm.com
Currently the setting of the 'cpu' member of struct cpuhp_cpu_state in
cpuhp_create() is too late as it is used earlier in _cpu_up().
If kzalloc_node() in __smpboot_create_thread() fails then the rollback will
be done with st->cpu==0 causing CPU0 to be erroneously set to be dying,
causing the scheduler to get mightily confused and throw its toys out of
the pram.
However the cpu number is actually available directly, so simply remove
the 'cpu' member and avoid the problem in the first place.
Fixes: 2ea46c6fc9 ("cpumask/hotplug: Fix cpu_dying() state tracking")
Signed-off-by: Steven Price <steven.price@arm.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lore.kernel.org/r/20220411152233.474129-2-steven.price@arm.com
The check in flush_smp_call_function_queue() for callbacks that are sent
to offline CPUs currently checks whether the queue is empty.
However, flush_smp_call_function_queue() has just deleted all the
callbacks from the queue and moved all the entries into a local list.
This checks would only be positive if some callbacks were added in the
short time after llist_del_all() was called. This does not seem to be
the intention of this check.
Change the check to look at the local list to which the entries were
moved instead of the queue from which all the callbacks were just
removed.
Fixes: 8d056c48e4 ("CPU hotplug, smp: flush any pending IPI callbacks before CPU offline")
Signed-off-by: Nadav Amit <namit@vmware.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lore.kernel.org/r/20220319072015.1495036-1-namit@vmware.com
Change the comment to a normal (non-kernel-doc) comment to avoid
these kernel-doc warnings:
kernel/power/snapshot.c:335: warning: This comment starts with '/**', but
isn't a kernel-doc comment. Refer Documentation/doc-guide/kernel-doc.rst
* Data types related to memory bitmaps.
Signed-off-by: Haowen Bai <baihaowen@meizu.com>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Add parameter description in alloc_rtree_node() kernel-doc
comment and fix several inconsistent function name descriptions.
Remove some warnings found by running scripts/kernel-doc,
which is caused by using 'make W=1'.
kernel/power/snapshot.c:438: warning: Function parameter or member
'gfp_mask' not described in 'alloc_rtree_node'
kernel/power/snapshot.c:438: warning: Function parameter or member
'safe_needed' not described in 'alloc_rtree_node'
kernel/power/snapshot.c:438: warning: Function parameter or member 'ca'
not described in 'alloc_rtree_node'
kernel/power/snapshot.c:438: warning: Function parameter or member
'list' not described in 'alloc_rtree_node'
kernel/power/snapshot.c:916: warning: expecting prototype for
memory_bm_rtree_next_pfn(). Prototype was for memory_bm_next_pfn()
instead
kernel/power/snapshot.c:1947: warning: expecting prototype for
alloc_highmem_image_pages(). Prototype was for alloc_highmem_pages()
instead
kernel/power/snapshot.c:2230: warning: expecting prototype for load
header(). Prototype was for load_header() instead
Reported-by: Abaci Robot <abaci@linux.alibaba.com>
Signed-off-by: Yang Li <yang.lee@linux.alibaba.com>
[ rjw: Comment adjustments to avoid line breaks ]
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Currently pm_pr_dbg() is used to filter kernel pm debug messages based
on pm_debug_messages_on flag. The problem is if we enable/disable this
flag it will affect all pm_pr_dbg() calls at once, so we can't
individually control them.
This patch changes pm_pr_dbg() implementation as such:
- If pm_debug_messages_on is enabled, print the message.
- If pm_debug_messages_on is disabled and CONFIG_DYNAMIC_DEBUG is
enabled, only print the messages explicitly enabled on
/sys/kernel/debug/dynamic_debug/control.
- If pm_debug_messages_on is disabled and CONFIG_DYNAMIC_DEBUG is
disabled, don't print the message.
Signed-off-by: David Cohen <dacohen@pm.me>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
The macro -DDEBUG is broadly enabled on kernel/power/ directory if
CONFIG_DYNAMIC_DEBUG is enabled. As side effect all debug messages using
pr_debug() and dev_dbg() are enabled by default on dynamic debug.
We're reworking pm_pr_dbg() to support dynamic debug, where pm_pr_dbg()
will print message if either pm_debug_messages_on flag is set or if it's
explicitly enabled on dynamic debug's control. That means if we let
-DDEBUG broadly set, the pm_debug_messages_on flag will be bypassed by
default on pm_pr_dbg() if dynamic debug is also enabled.
The files that directly use pr_debug() and dev_dbg() on kernel/power/ are:
- swap.c
- snapshot.c
- energy_model.c
And those files do not use pm_pr_dbg(). So if we limit -DDEBUG to them,
we keep the same functional behavior while allowing the pm_pr_dbg()
refactor.
Signed-off-by: David Cohen <dacohen@pm.me>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
The Energy Model gets more bits used in 'flags'. Avoid adding another
debugfs file just to print what is the status of a new flag. Simply
remove old debugfs files and add one generic which prints all flags
as a hex value.
Signed-off-by: Lukasz Luba <lukasz.luba@arm.com>
Reviewed-by: Ionela Voinescu <ionela.voinescu@arm.com>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
The .active_power() callback passes the device pointer when it's called.
Aligned with a convetion present in other subsystems and pass the 'dev'
as a first argument. It looks more cleaner.
Adjust all affected drivers which implement that API callback.
Suggested-by: Ionela Voinescu <ionela.voinescu@arm.com>
Signed-off-by: Lukasz Luba <lukasz.luba@arm.com>
Reviewed-by: Ionela Voinescu <ionela.voinescu@arm.com>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
The Energy Model (EM) allows to provide the 'cost' values when the device
driver provides the .get_cost() optional callback. This removes
restriction which is in the EM calculation function of the 'cost'
for each performance state. Now, the driver is in charge of providing
the right values which are then used by Energy Aware Scheduler.
Signed-off-by: Lukasz Luba <lukasz.luba@arm.com>
Reviewed-by: Ionela Voinescu <ionela.voinescu@arm.com>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
The Energy Model (EM) can be used on platforms which are missing real
power information. Those platforms would implement .get_cost() which
populates needed values for the Energy Aware Scheduler (EAS). The EAS
doesn't use 'power' fields from EM, but other frameworks might use them.
Thus, to avoid miss-usage of this specific type of EM, introduce a new
flags which can be checked by other frameworks.
Signed-off-by: Pierre Gondois <Pierre.Gondois@arm.com>
Signed-off-by: Lukasz Luba <lukasz.luba@arm.com>
Reviewed-by: Ionela Voinescu <ionela.voinescu@arm.com>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
A CPU will not show up in virtualized environment which includes an
Enclave. The VM splits its resources into a primary VM and a Enclave
VM. While the Enclave is active, the hypervisor will ignore all requests to
bring up a CPU and this CPU will remain in CPU_UP_PREPARE state.
The kernel will wait up to ten seconds for CPU to show up (do_boot_cpu())
and then rollback the hotplug state back to CPUHP_OFFLINE leaving the CPU
state in CPU_UP_PREPARE. The CPU state is set back to CPUHP_TEARDOWN_CPU
during the CPU_POST_DEAD stage.
After the Enclave VM terminates, the primary VM can bring up the CPU
again.
Allow to bring up the CPU if it is in the CPU_UP_PREPARE state.
[bigeasy: Rewrite commit description.]
Signed-off-by: Longpeng(Mike) <longpeng2@huawei.com>
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Tested-by: Dongli Zhang <dongli.zhang@oracle.com>
Reviewed-by: Valentin Schneider <valentin.schneider@arm.com>
Reviewed-by: Henry Wang <Henry.Wang@arm.com>
Link: https://lore.kernel.org/r/20220209080214.1439408-3-bigeasy@linutronix.de
A report of a 12-jiffy normal RCU CPU stall warning raises interesting
questions about the nature of time on the offending system. This commit
instruments rcu_sched_clock_irq(), which is RCU's hook into the
scheduling-clock interrupt, checking for the jiffies counter going
backwards.
Reported-by: Saravanan D <sarvanand@fb.com>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
Suppose we have a kernel built with both CONFIG_RCU_STRICT_GRACE_PERIOD=y
and CONFIG_PREEMPT=y. Suppose further that an RCU reader from which RCU
core needs a quiescent state ends in rcu_preempt_deferred_qs_irqrestore().
This function will then invoke rcu_report_qs_rdp() in order to immediately
report that quiescent state. Unfortunately, it will not have cleared
that reader's CPU's rcu_data structure's ->cpu_no_qs.b.norm field.
As a result, rcu_report_qs_rdp() will take an early exit because it
will believe that this CPU has not yet encountered a quiescent state,
and there will be no reporting of the current quiescent state.
This commit therefore causes rcu_preempt_deferred_qs_irqrestore() to
clear the ->cpu_no_qs.b.norm field before invoking rcu_report_qs_rdp().
Kudos to Boqun Feng and Neeraj Upadhyay for helping with analysis of
this issue!
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
The TASKS_RUDE_RCU does not select IRQ_WORK, which can result in build
failures for kernels that do not otherwise select IRQ_WORK. This commit
therefore causes the TASKS_RUDE_RCU Kconfig option to select IRQ_WORK.
Reported-by: Hyeonggon Yoo <42.hyeyoo@gmail.com>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
The rcutorture module has an rcu_torture_writer task that repeatedly
performs writes, synchronizations, and deletes. There is a corner-case
check in rcu_torture_writer() wherein if nsynctypes is 0, a warning is
issued and the task waits to be stopped via a call to
torture_kthread_stopping() rather than performing any work.
There should be a return statement following this call to
torture_kthread_stopping(), as the intention with issuing the call to
torture_kthread_stopping() in the first place is to avoid the
rcu_torture_writer task from performing any work. Some of the work may even
be dangerous to perform, such as potentially causing a #DE due to
nsynctypes being used in a modulo operator when querying for sync updates
to issue.
This patch adds the missing return call. As a bonus, it also fixes a
checkpatch warning that was emitted due to the WARN_ONCE() call using the
name of the function rather than __func__.
Signed-off-by: David Vernet <void@manifault.com>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
The rcutorture module is used to run torture tests that validate RCU.
rcutorture takes a variety of module parameters that configure the
functionality of the test. Amongst these parameters are the types of
synchronization mechanisms that the rcu_torture_writer and
rcu_torture_fakewriter tasks may use, and the torture_type of the run which
determines what read and sync operations are used by the various writer and
reader tasks that run throughout the test.
When the module is configured to only use sync types for which the
specified torture_type does not implement the necessary operations, we can
end up in a state where nsynctypes is 0. This is not an erroneous state,
but it currently crashes the kernel with a #DE due to nsynctypes being used
with a modulo operator in rcu_torture_fakewriter().
Here is an example of such a #DE:
$ insmod ./rcutorture.ko gp_cond=1 gp_cond_exp=0 gp_exp=0 gp_poll_exp=0
gp_normal=0 gp_poll=0 gp_poll_exp=0 verbose=9999 torture_type=trivial
...
[ 8536.525096] divide error: 0000 [#1] PREEMPT SMP PTI
[ 8536.525101] CPU: 30 PID: 392138 Comm: rcu_torture_fak Kdump: loaded Tainted: G S 5.17.0-rc1-00179-gc8c42c80febd #24
[ 8536.525105] Hardware name: Quanta Twin Lakes MP/Twin Lakes Passive MP, BIOS F09_3A23 12/08/2020
[ 8536.525106] RIP: 0010:rcu_torture_fakewriter+0xf1/0x2d0 [rcutorture]
[ 8536.525121] Code: 00 31 d2 8d 0c f5 00 00 00 00 48 63 c9 48 f7 f1 48 85 d2 0f 84 79 ff ff ff 48 89 e7 e8 78 78 01 00 48 63 0d 29 ca 00 00 31 d2 <48> f7 f1 8b 04 95 00 05 4e a0 83 f8 06 0f 84 ad 00 00 00 7f 1f 83
[ 8536.525124] RSP: 0018:ffffc9000777fef0 EFLAGS: 00010246
[ 8536.525127] RAX: 00000000223d006e RBX: cccccccccccccccd RCX: 0000000000000000
[ 8536.525130] RDX: 0000000000000000 RSI: ffffffff824315b9 RDI: ffffc9000777fef0
[ 8536.525132] RBP: ffffc9000487bb30 R08: 0000000000000002 R09: 000000000002a580
[ 8536.525134] R10: ffffffff82c5f920 R11: 0000000000000000 R12: ffff8881a2c35d00
[ 8536.525136] R13: ffff8881540c8d00 R14: ffffffffa04d39d0 R15: 0000000000000000
[ 8536.525137] FS: 0000000000000000(0000) GS:ffff88903ff80000(0000) knlGS:0000000000000000
[ 8536.525140] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 8536.525142] CR2: 00007f839f022000 CR3: 0000000002c0a006 CR4: 00000000007706e0
[ 8536.525144] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[ 8536.525145] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[ 8536.525147] PKRU: 55555554
[ 8536.525148] Call Trace:
[ 8536.525150] <TASK>
[ 8536.525153] kthread+0xe8/0x110
[ 8536.525161] ? kthread_complete_and_exit+0x20/0x20
[ 8536.525167] ret_from_fork+0x22/0x30
[ 8536.525174] </TASK>
The solution is to gracefully handle the case of nsynctypes being 0 in
rcu_torture_fakewriter() by not performing any work. This is already being
done in rcu_torture_writer(), though there is a missing return on that path
which will be fixed in a subsequent patch.
Signed-off-by: David Vernet <void@manifault.com>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
The scftorture test module's scf_handler() function is supposed to provide
three different distributions of short delays (including "no delay") and
one distribution of long delays, if specified by the scftorture.longwait
module parameter. However, the second of the two non-zero-wait short delays
is disabled due to the first such delay's "goto out" not being enclosed in
the "then" clause with the "udelay()".
This commit therefore adjusts the code to provide the intended set of
delays.
Fixes: e9d338a0b1 ("scftorture: Add smp_call_function() torture test")
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
Tree RCU supports grace-period delays using the rcutree.gp_cleanup_delay,
rcutree.gp_init_delay, and rcutree.gp_preinit_delay kernel boot
parameters. These delays are strictly for debugging purposes, and have
proven quite effective at exposing bugs involving race with CPU-hotplug
operations. However, these delays can result in false positives when
used in conjunction with callback flooding, for example, those generated
by the rcutorture.fwd_progress kernel boot parameter.
This commit therefore suppresses grace-period delays while callback
flooding is in progress.
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
If the cpu_possible_mask is sparse (for example, if bits are set only for
CPUs 0, 4, 8, ...), then rcu_tasks_invoke_cbs() will access per-CPU data
for a CPU not in cpu_possible_mask. It makes these accesses while doing
a workqueue-based binary search for non-empty callback lists. Although
this search must pass through CPUs not represented in cpu_possible_mask,
it has no need to check the callback list for such CPUs.
This commit therefore changes the rcu_tasks_invoke_cbs() function's
binary search so as to only check callback lists for CPUs present in
cpu_possible_mask.
Reported-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
If the rcupdate.rcu_task_enqueue_lim kernel boot parameter is set to
something greater than 1 and less than nr_cpu_ids, the code attempts to
use a subset of the CPU's RCU Tasks callback lists. This works, but only
if the cpu_possible_mask is contiguous. If there are "holes" in this
mask, the callback-enqueue code might attempt to access a non-existent
per-CPU ->rtcpu variable for a non-existent CPU. For example, if only
CPUs 0, 4, 8, 12, 16 and so on are in cpu_possible_mask, specifying
rcupdate.rcu_task_enqueue_lim=4 would cause the code to attempt to
use callback queues for non-existent CPUs 1, 2, and 3. Because such
systems have existed in the past and might still exist, the code needs
to gracefully handle this situation.
This commit therefore checks to see whether the desired CPU is present
in cpu_possible_mask, and, if not, searches for the next CPU. This means
that the systems administrator of a system with a sparse cpu_possible_mask
will need to account for this sparsity when specifying the value of
the rcupdate.rcu_task_enqueue_lim kernel boot parameter. For example,
setting this parameter to the value 4 will use only CPUs 0 and 4, which
CPU 4 getting three times the callback load of CPU 0.
This commit assumes that bit (nr_cpu_ids - 1) is always set in
cpu_possible_mask.
Link: https://lore.kernel.org/lkml/CANn89iKaNEwyNZ=L_PQnkH0LP_XjLYrr_dpyRKNNoDJaWKdrmg@mail.gmail.com/
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
Currently, the show_rcu_tasks_generic_gp_kthread() function only looks
at CPU 0's callback lists. Although this is not fatal, it can confuse
debugging efforts in cases where any of the Tasks RCU flavors are in
per-CPU queueing mode. This commit therefore causes this function to
scan all CPUs' callback queues.
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
The use of hrtimers for RCU-tasks grace-period delays works well in
general, but can result in excessive grace-period delays for some
corner-case workloads. This commit therefore reverts to the use of
timers for non-RT kernels to mitigate those grace-period delays.
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
The synchronous RCU-tasks grace-period-wait primitives invoke
schedule_timeout_idle() to give readers a chance to exit their
read-side critical sections. Unfortunately, this fails during early
boot on PREEMPT_RT because PREEMPT_RT relies solely on ksoftirqd to run
timer handlers. Because ksoftirqd cannot operate until its kthreads
are spawned, there is a brief period of time following scheduler
initialization where PREEMPT_RT cannot run the timer handlers that
schedule_timeout_idle() relies on, resulting in a hang.
To avoid this boot-time hang, this commit replaces schedule_timeout_idle()
with schedule_hrtimeout(), so that the timer expires in hardirq context.
This is ensures that the timer fires even on PREEMPT_RT throughout the
irqs-enabled portions of boot as well as during runtime.
The timer is set to expire between fract and fract + HZ / 2 jiffies in
order to align with any other timers that might expire during that time,
thus reducing the number of wakeups.
Note that RCU-tasks grace periods are infrequent, so the use of hrtimer
should be fine. In contrast, in common-case code, user of hrtimer
could result in performance issues.
Cc: Martin KaFai Lau <kafai@fb.com>
Cc: Andrii Nakryiko <andrii@kernel.org>
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
The main Tasks RCU quiescent state is voluntary context switch. However,
userspace execution is also a valid quiescent state, and is a valuable one
for userspace applications that spin repeatedly executing light-weight
non-sleeping system calls. Currently, such an application can delay a
Tasks RCU grace period for many tens of seconds.
This commit therefore enlists the aid of the scheduler-clock interrupt to
provide a Tasks RCU quiescent state when it interrupted a task executing
in userspace.
[ paulmck: Apply feedback from kernel test robot. ]
Cc: Martin KaFai Lau <kafai@fb.com>
Cc: Neil Spring <ntspring@fb.com>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
The waitqueue used by rcu_tasks_kthread() has always only one waiter.
With a guaranteed only one waiter, this can be replaced with rcuwait
which is smaller and simpler. With rcuwait based wake counterpart, the
irqwork function (call_rcu_tasks_iw_wakeup()) can be invoked hardirq
context because it is only a wake up and no sleeping locks are involved
(unlike the wait_queue_head).
As a side effect, this is also one piece of the puzzle to pass the RCU
selftest at early boot on PREEMPT_RT.
Replace wait_queue_head with rcuwait and let the irqwork run in hardirq
context on PREEMPT_RT.
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
RCU-tasks stall-warning messages are printed after the grace period is ten
minutes old. Unfortunately, most of us will have rebooted the system in
response to an apparently-hung command long before the ten minutes is up,
and will thus see what looks to be a silent hang.
This commit therefore adds pr_info() messages that are printed earlier.
These should avoid being classified as errors, but should give impatient
users a hint. These are controlled by new rcupdate.rcu_task_stall_info
and rcupdate.rcu_task_stall_info_mult kernel-boot parameters. The former
defines the initial delay in jiffies (defaulting to 10 seconds) and the
latter defines the multiplier (defaulting to 3). Thus, by default, the
first message will appear 10 seconds into the RCU-tasks grace period,
the second 40 seconds in, and the third 160 seconds in. There would be
a fourth at 640 seconds in, but the stall warning message appears 600
seconds in, and once a stall warning is printed for a given grace period,
no further informational messages are printed.
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
While booting secondary CPUs, cpus_read_[lock/unlock] is not keeping
online cpumask stable. The transient online mask results in below
calltrace.
[ 0.324121] CPU1: Booted secondary processor 0x0000000001 [0x410fd083]
[ 0.346652] Detected PIPT I-cache on CPU2
[ 0.347212] CPU2: Booted secondary processor 0x0000000002 [0x410fd083]
[ 0.377255] Detected PIPT I-cache on CPU3
[ 0.377823] CPU3: Booted secondary processor 0x0000000003 [0x410fd083]
[ 0.379040] ------------[ cut here ]------------
[ 0.383662] WARNING: CPU: 0 PID: 10 at kernel/workqueue.c:3084 __flush_work+0x12c/0x138
[ 0.384850] Modules linked in:
[ 0.385403] CPU: 0 PID: 10 Comm: rcu_tasks_rude_ Not tainted 5.17.0-rc3-v8+ #13
[ 0.386473] Hardware name: Raspberry Pi 4 Model B Rev 1.4 (DT)
[ 0.387289] pstate: 20000005 (nzCv daif -PAN -UAO -TCO -DIT -SSBS BTYPE=--)
[ 0.388308] pc : __flush_work+0x12c/0x138
[ 0.388970] lr : __flush_work+0x80/0x138
[ 0.389620] sp : ffffffc00aaf3c60
[ 0.390139] x29: ffffffc00aaf3d20 x28: ffffffc009c16af0 x27: ffffff80f761df48
[ 0.391316] x26: 0000000000000004 x25: 0000000000000003 x24: 0000000000000100
[ 0.392493] x23: ffffffffffffffff x22: ffffffc009c16b10 x21: ffffffc009c16b28
[ 0.393668] x20: ffffffc009e53861 x19: ffffff80f77fbf40 x18: 00000000d744fcc9
[ 0.394842] x17: 000000000000000b x16: 00000000000001c2 x15: ffffffc009e57550
[ 0.396016] x14: 0000000000000000 x13: ffffffffffffffff x12: 0000000100000000
[ 0.397190] x11: 0000000000000462 x10: ffffff8040258008 x9 : 0000000100000000
[ 0.398364] x8 : 0000000000000000 x7 : ffffffc0093c8bf4 x6 : 0000000000000000
[ 0.399538] x5 : 0000000000000000 x4 : ffffffc00a976e40 x3 : ffffffc00810444c
[ 0.400711] x2 : 0000000000000004 x1 : 0000000000000000 x0 : 0000000000000000
[ 0.401886] Call trace:
[ 0.402309] __flush_work+0x12c/0x138
[ 0.402941] schedule_on_each_cpu+0x228/0x278
[ 0.403693] rcu_tasks_rude_wait_gp+0x130/0x144
[ 0.404502] rcu_tasks_kthread+0x220/0x254
[ 0.405264] kthread+0x174/0x1ac
[ 0.405837] ret_from_fork+0x10/0x20
[ 0.406456] irq event stamp: 102
[ 0.406966] hardirqs last enabled at (101): [<ffffffc0093c8468>] _raw_spin_unlock_irq+0x78/0xb4
[ 0.408304] hardirqs last disabled at (102): [<ffffffc0093b8270>] el1_dbg+0x24/0x5c
[ 0.409410] softirqs last enabled at (54): [<ffffffc0081b80c8>] local_bh_enable+0xc/0x2c
[ 0.410645] softirqs last disabled at (50): [<ffffffc0081b809c>] local_bh_disable+0xc/0x2c
[ 0.411890] ---[ end trace 0000000000000000 ]---
[ 0.413000] smp: Brought up 1 node, 4 CPUs
[ 0.413762] SMP: Total of 4 processors activated.
[ 0.414566] CPU features: detected: 32-bit EL0 Support
[ 0.415414] CPU features: detected: 32-bit EL1 Support
[ 0.416278] CPU features: detected: CRC32 instructions
[ 0.447021] Callback from call_rcu_tasks_rude() invoked.
[ 0.506693] Callback from call_rcu_tasks() invoked.
This commit therefore fixes this issue by applying a single-CPU
optimization to the RCU Tasks Rude grace-period process. The key point
here is that the purpose of this RCU flavor is to force a schedule on
each online CPU since some past event. But the rcu_tasks_rude_wait_gp()
function runs in the context of the RCU Tasks Rude's grace-period kthread,
so there must already have been a context switch on the current CPU since
the call to either synchronize_rcu_tasks_rude() or call_rcu_tasks_rude().
So if there is only a single CPU online, RCU Tasks Rude's grace-period
kthread does not need to anything at all.
It turns out that the rcu_tasks_rude_wait_gp() function's call to
schedule_on_each_cpu() causes problems during early boot. During that
time, there is only one online CPU, namely the boot CPU. Therefore,
applying this single-CPU optimization fixes early-boot instances of
this problem.
Link: https://lore.kernel.org/lkml/20220210184319.25009-1-treasure4paddy@gmail.com/T/
Suggested-by: Paul E. McKenney <paulmck@kernel.org>
Signed-off-by: Padmanabha Srinivasaiah <treasure4paddy@gmail.com>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
The rcu_spawn_gp_kthread() function is called as an early initcall, which
means that SMP initialization hasn't happened yet and only the boot CPU is
online. Therefore, create only the NOCB kthreads related to the boot CPU.
Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
Cc: Neeraj Upadhyay <quic_neeraju@quicinc.com>
Cc: Uladzislau Rezki <uladzislau.rezki@sony.com>
Cc: Joel Fernandes <joel@joelfernandes.org>
Cc: Boqun Feng <boqun.feng@gmail.com>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
The rcu_spawn_gp_kthread() function is called as an early initcall,
which means that SMP initialization hasn't happened yet and only the
boot CPU is online. Therefore, create only the boost kthread for the
leaf node of the boot CPU.
Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
Cc: Neeraj Upadhyay <quic_neeraju@quicinc.com>
Cc: Uladzislau Rezki <uladzislau.rezki@sony.com>
Cc: Joel Fernandes <joel@joelfernandes.org>
Cc: Boqun Feng <boqun.feng@gmail.com>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
The rcu_init() function is called way before SMP is initialized and
therefore only the boot CPU should be online at this stage.
Simplify the boot per-cpu initialization accordingly.
Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
Cc: Neeraj Upadhyay <quic_neeraju@quicinc.com>
Cc: Uladzislau Rezki <uladzislau.rezki@sony.com>
Cc: Joel Fernandes <joel@joelfernandes.org>
Cc: Boqun Feng <boqun.feng@gmail.com>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
This commit moves the RCU nocb initialization witness within rcu_state
to consolidate RCU's global state.
Reported-by: Paul E. McKenney <paulmck@kernel.org>
Cc: Neeraj Upadhyay <quic_neeraju@quicinc.com>
Cc: Uladzislau Rezki <uladzislau.rezki@sony.com>
Cc: Joel Fernandes <joel@joelfernandes.org>
Cc: Boqun Feng <boqun.feng@gmail.com>
Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
The rcu_is_nocb_cpu() function is no longer used, so this commmit
removes it.
Reported-by: Neeraj Upadhyay <quic_neeraju@quicinc.com>
Cc: Uladzislau Rezki <uladzislau.rezki@sony.com>
Cc: Joel Fernandes <joel@joelfernandes.org>
Cc: Boqun Feng <boqun.feng@gmail.com>
Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
This commit instruments the acquisitions of the srcu_struct structure's
->lock, enabling the initiation of a transition from SRCU_SIZE_SMALL
to SRCU_SIZE_BIG when sufficient contention is experienced. The
instrumentation counts the number of trylock failures within the confines
of a single jiffy. If that number exceeds the value specified by the
srcutree.small_contention_lim kernel boot parameter (which defaults to
100), and if the value specified by the srcutree.convert_to_big kernel
boot parameter has the 0x10 bit set (defaults to 0), then a transition
will be automatically initiated.
By default, there will never be any transitions, so that none of the
srcu_struct structures ever gains an srcu_node array.
The useful values for srcutree.convert_to_big are:
0x00: Never convert.
0x01: Always convert at init_srcu_struct() time.
0x02: Convert when rcutorture prints its first round of statistics.
0x03: Decide conversion approach at boot given system size.
0x10: Convert if contention is encountered.
0x12: Convert if contention is encountered or when rcutorture prints
its first round of statistics, whichever comes first.
The value 0x11 acts the same as 0x01 because the conversion happens
before there is any chance of contention.
[ paulmck: Apply "static" feedback from kernel test robot. ]
Co-developed-by: Neeraj Upadhyay <quic_neeraju@quicinc.com>
Signed-off-by: Neeraj Upadhyay <quic_neeraju@quicinc.com>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
Once there are contention-initiated size transitions, it will be
possible for rcutorture to initiate a transition at the same time
as a contention-initiated transition. This commit therefore creates
a concurrency-safe helper function named srcu_transition_to_big() to
safely initiate size transitions.
Co-developed-by: Neeraj Upadhyay <quic_neeraju@quicinc.com>
Signed-off-by: Neeraj Upadhyay <quic_neeraju@quicinc.com>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
This commit adds a comment explaining why an unprotected call to
list_add() from srcu_funnel_gp_start() can be safe. TL;DR: It is only
called during very early boot when we don't have no steeking concurrency!
Co-developed-by: Neeraj Upadhyay <quic_neeraju@quicinc.com>
Signed-off-by: Neeraj Upadhyay <quic_neeraju@quicinc.com>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
When an srcu_struct structure is created (but not in a kernel module)
by DEFINE_SRCU() and friends, the per-CPU srcu_data structure is
statically allocated. In all other cases, that structure is obtained
from alloc_percpu(), in which case cleanup_srcu_struct() must invoke
free_percpu() on the resulting ->sda pointer in the srcu_struct pointer.
Which it does.
Except that it also invokes free_percpu() on the ->sda pointer
referencing the statically allocated per-CPU srcu_data structures.
Which free_percpu() is surprisingly OK with.
This commit nevertheless stops cleanup_srcu_struct() from freeing
statically allocated per-CPU srcu_data structures.
Co-developed-by: Neeraj Upadhyay <quic_neeraju@quicinc.com>
Signed-off-by: Neeraj Upadhyay <quic_neeraju@quicinc.com>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
You really shouldn't invoke srcu_torture_stats_print() after invoking
cleanup_srcu_struct(), but there is really no reason to get a
compiler-obfuscated per-CPU-variable NULL pointer dereference as the
diagnostic. This commit therefore checks for NULL ->sda and makes a
more polite console-message complaint in that case.
Co-developed-by: Neeraj Upadhyay <quic_neeraju@quicinc.com>
Signed-off-by: Neeraj Upadhyay <quic_neeraju@quicinc.com>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
This commit adds an srcu_tree.convert_to_big kernel parameter that either
refuses to convert at all (0), converts immediately at init_srcu_struct()
time (1), or lets rcutorture convert it (2). An addition contention-based
dynamic conversion choice will be added, along with documentation.
[ paulmck: Apply callback-scanning feedback from Neeraj Upadhyay. ]
Co-developed-by: Neeraj Upadhyay <quic_neeraju@quicinc.com>
Signed-off-by: Neeraj Upadhyay <quic_neeraju@quicinc.com>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
For configurations where snp node tree is not initialized at
init time (added in subsequent commits), srcu_funnel_gp_start()
and srcu_funnel_exp_start() can potential traverse and observe
the snp nodes' transient (uninitialized) states. This can potentially
happen, when init_srcu_struct_nodes() initialization of sdp->mynode
races with srcu_funnel_gp_start() and srcu_funnel_exp_start()
Consider the case below where srcu_funnel_gp_start() observes
sdp->mynode to be not NULL and uses an uninitialized sdp->grpmask
P1 P2
init_srcu_struct_nodes() void srcu_funnel_gp_start(...)
{
for_each_possible_cpu(cpu) {
...
sdp->mynode = &snp_first[...];
for (snp = sdp->mynode;...) struct srcu_node *snp_leaf =
smp_load_acquire(&sdp->mynode)
... if (snp_leaf) {
for (snp = snp_leaf; ...)
...
if (snp == snp_leaf)
snp->srcu_data_have_cbs[idx] |=
sdp->grpmask;
sdp->grpmask =
1 << (cpu - sdp->mynode->grplo);
}
}
Similarly, init_srcu_struct_nodes() and srcu_funnel_exp_start() can
race, where srcu_funnel_exp_start() could observe state of snp lock
before spin_lock_init().
P1 P2
init_srcu_struct_nodes() void srcu_funnel_exp_start(...)
{
srcu_for_each_node_breadth_first(ssp, snp) { for (; ...) {
spin_lock_...(snp, )
spin_lock_init(&ACCESS_PRIVATE(snp, lock));
...
}
for_each_possible_cpu(cpu) {
...
sdp->mynode = &snp_first[...];
To avoid these issues, ensure that snp node tree initialization is
complete i.e. after SRCU_SIZE_WAIT_BARRIER srcu_size_state is reached,
before traversing the tree. Given that srcu_funnel_gp_start() and
srcu_funnel_exp_start() are called within SRCU read side critical
sections, this check is safe, in the sense that all callbacks are
enqueued on CPU0 srcu_cblist until SRCU_SIZE_WAIT_CALL is entered,
and these read side critical sections (containing srcu_funnel_gp_start()
and srcu_funnel_exp_start()) need to complete, before SRCU_SIZE_WAIT_CALL
is reached.
Signed-off-by: Neeraj Upadhyay <quic_neeraju@quicinc.com>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
Currently, tree SRCU relies on the srcu_node structures being initialized
at the same time that the srcu_struct itself is initialized, and thus
use the initial grace-period sequence number as the initial value for
the srcu_node structure's ->srcu_have_cbs[] and ->srcu_gp_seq_needed_exp
fields. Although this has a high probability of also working when the
srcu_node array is allocated and initialized at some random later time,
it would be better to avoid leaving such things to chance.
This commit therefore initializes these fields with 0x2, which is a
recognizable invalid value. It then adds the required checks for this
invalid value in order to avoid confusion on long-running kernels
(especially those on 32-bit systems) that allocate and initialize
srcu_node arrays late in life.
Co-developed-by: Neeraj Upadhyay <quic_neeraju@quicinc.com>
Signed-off-by: Neeraj Upadhyay <quic_neeraju@quicinc.com>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
Currently, srcu_funnel_gp_start() tests snp->srcu_have_cbs[idx] and then
separately assigns it to the snp_seq local variable. This commit does
the assignment earlier to simplify the code a bit. While in the area,
this commit also takes advantage of the 100-character line limit to put
the call to srcu_schedule_cbs_sdp() on a single line.
Co-developed-by: Neeraj Upadhyay <quic_neeraju@quicinc.com>
Signed-off-by: Neeraj Upadhyay <quic_neeraju@quicinc.com>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
This commit adds the numeric and string version of ->srcu_size_state to
the Tree-SRCU-specific portion of the rcutorture output.
[ paulmck: Apply feedback from kernel test robot and Dan Carpenter. ]
[ quic_neeraju: Apply feedback from Jiapeng Chong. ]
Co-developed-by: Neeraj Upadhyay <quic_neeraju@quicinc.com>
Signed-off-by: Neeraj Upadhyay <quic_neeraju@quicinc.com>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
This is just dead code at the moment, and will be used once
the state-transition code is activated.
Because srcu_barrier() must be aware of transition before call_srcu(), the
state machine waits for an SRCU grace period before callbacks are queued
to the non-CPU-0 queues. This requres that portions of srcu_barrier()
be enclosed in an SRCU read-side critical section.
Co-developed-by: Neeraj Upadhyay <quic_neeraju@quicinc.com>
Signed-off-by: Neeraj Upadhyay <quic_neeraju@quicinc.com>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
This commit shrinks the srcu_struct structure by converting its ->node
field from a fixed-size compile-time array to a pointer to a dynamically
allocated array. In kernels built with large values of NR_CPUS that boot
on systems with smaller numbers of CPUs, this can save significant memory.
[ paulmck: Apply kernel test robot feedback. ]
Reported-by: A cast of thousands
Co-developed-by: Neeraj Upadhyay <quic_neeraju@quicinc.com>
Signed-off-by: Neeraj Upadhyay <quic_neeraju@quicinc.com>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
This commit makes Tree SRCU able to operate without an snp_node
array, that is, when the srcu_data structures' ->mynode pointers
are NULL. This can result in high contention on the srcu_struct
structure's ->lock, but only when there are lots of call_srcu(),
synchronize_srcu(), and synchronize_srcu_expedited() calls.
Note that when there is no snp_node array, all SRCU callbacks use
CPU 0's callback queue. This is optimal in the common case of low
update-side load because it removes the need to search each CPU
for the single callback that made the grace period happen.
Co-developed-by: Neeraj Upadhyay <quic_neeraju@quicinc.com>
Signed-off-by: Neeraj Upadhyay <quic_neeraju@quicinc.com>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
Currently, the srcu_funnel_gp_start() walks its local variable snp up the
tree and reloads sdp->mynode whenever it is necessary to check whether
it is still at the leaf srcu_node level. This works, but is a bit more
obtuse than absolutely necessary. In addition, upcoming commits will
dynamically size srcu_struct structures, in which case sdp->mynode will
no longer necessarily be a constant, and this commit helps prepare for
that dynamic sizing.
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
Currently, cleanup_srcu_struct() checks for a grace period in progress,
but it does not check for a grace period that has not yet started but
which might start at any time. Such a situation could result in a
use-after-free bug, so this commit adds a check for a grace period that
is needed but not yet started to cleanup_srcu_struct().
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
The seq argument is assigned to meta.seq twice, the second one is
redundant, remove it.
This patch also removes a redundant space in bpf_iter_link_attach().
Signed-off-by: Yuntao Wang <ytcoode@gmail.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Song Liu <songliubraving@fb.com>
Link: https://lore.kernel.org/bpf/20220410060020.307283-1-ytcoode@gmail.com
If CPUs on a node are offline at boot time, the number of nodes is
different when building affinity masks for present cpus and when building
affinity masks for possible cpus. This causes the following problem:
In the case that the number of vectors is less than the number of nodes
there are cases where bits of masks for present cpus are overwritten when
building masks for possible cpus.
Fix this by excluding CPUs, which are not part of the current build mask
(present/possible).
[ tglx: Massaged changelog and added comment ]
Fixes: b825921990 ("genirq/affinity: Spread IRQs to all available NUMA nodes")
Signed-off-by: Rei Yamamoto <yamamoto.rei@jp.fujitsu.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Ming Lei <ming.lei@redhat.com>
Cc: stable@vger.kernel.org
Link: https://lore.kernel.org/r/20220331003309.10891-1-yamamoto.rei@jp.fujitsu.com
clocksource_verify_percpu() calls cpumask_weight() to check if any bit of a
given cpumask is set.
This can be done more efficiently with cpumask_empty() because
cpumask_empty() stops traversing the cpumask as soon as it finds first set
bit, while cpumask_weight() counts all bits unconditionally.
Signed-off-by: Yury Norov <yury.norov@gmail.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lore.kernel.org/r/20220210224933.379149-24-yury.norov@gmail.com
__irq_build_affinity_masks() calls cpumask_weight() to check if any bit of
a given cpumask is set.
This can be done more efficiently with cpumask_empty() because
cpumask_empty() stops traversing the cpumask as soon as it finds first set
bit, while cpumask_weight() counts all bits unconditionally.
Signed-off-by: Yury Norov <yury.norov@gmail.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lore.kernel.org/r/20220210224933.379149-22-yury.norov@gmail.com
When booting with maxcpus=<small number> (or even loading a driver
while most CPUs are offline), it is pretty easy to observe managed
affinities containing a mix of online and offline CPUs being passed
to the irqchip driver.
This means that the irqchip cannot trust the affinity passed down
from the core code, which is a bit annoying and requires (at least
in theory) all drivers to implement some sort of affinity narrowing.
In order to address this, always limit the cpumask to the set of
online CPUs.
Signed-off-by: Marc Zyngier <maz@kernel.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lore.kernel.org/r/20220405185040.206297-3-maz@kernel.org
When booting with maxcpus=<small number>, interrupt controllers
such as the GICv3 ITS may not be able to satisfy the affinity of
some managed interrupts, as some of the HW resources are simply
not available.
The same thing happens when loading a driver using managed interrupts
while CPUs are offline.
In order to deal with this, do not try to activate such interrupt
if there is no online CPU capable of handling it. Instead, place
it in shutdown state. Once a capable CPU shows up, it will be
activated.
Reported-by: John Garry <john.garry@huawei.com>
Reported-by: David Decotigny <ddecotig@google.com>
Signed-off-by: Marc Zyngier <maz@kernel.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Tested-by: John Garry <john.garry@huawei.com>
Link: https://lore.kernel.org/r/20220405185040.206297-2-maz@kernel.org
- A couple of fixes to event encoding on Sapphire Rapids
- Pass event caps of inherited events so that perf doesn't fail wrongly at fork()
- Add support for a new Raptor Lake CPU
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEEzv7L6UO9uDPlPSfHEsHwGGHeVUoFAmJSvt0ACgkQEsHwGGHe
VUpKBRAAtegwh4ilwoRM0LePH2TX752pREy+M1qfEUp/XyH3tF8VAixCAmIg7qlI
IyjRX0AKDC1F08sM7/JmTf0M+hnl/oH2YPG8Q6p3igtfARvn+5bPZdSBpTAC9P5L
QX3S2WVzv5X78IomIfENqbg5HyZP3IXeg7R7sqZhHbtoG54n5NEv/+aJl5HmHFTt
gLTrXetL46OSMnLzKfd3hlJqCWSnTz1aGKgGX2cZy9ipI63+XrYMuNmiwJ+CrA3G
pI98RmKnCPqV2rXij1GpVQNyG2aPR+VVZM3aaq6XBAmiNTaCfnvWbEBGhCkjaSgA
UU7Y6D1Qxc0OZ1plcjhKc4l/W1oj8jqmG9nS6J2Xy4szdpZIdxBhlWq89xCrb9AC
yIgKif2iVl7eMVKVG1Jq1u2wTwurBAamH73sCCNn8ndctBjicoM8pbtHMHxzceyZ
w4Cff0yUNzHgPiqSHQRARw/CaUceL9kDoGzPeEQOR0A+27MpNulchts4HCtIvwzI
yLIK1JFPHDrCACLTMuAhvov3EMTeoTIfc91eOZRjubRTPx7TxujaZHdP7N+R3nkk
Giehc/l6IhFPhT8QACk0bziTVJ9in+Jx8pCnocGKuj80Uqs7Sq7swjlasy1Zoy7r
x9Qzy1gZhPHnvPd6LWU4WyPa767D07DlG/zFdg+P3EeWa/3efdw=
=ba3V
-----END PGP SIGNATURE-----
Merge tag 'perf_urgent_for_v5.18_rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull perf fixes from Borislav Petkov:
- A couple of fixes to cgroup-related handling of perf events
- A couple of fixes to event encoding on Sapphire Rapids
- Pass event caps of inherited events so that perf doesn't fail wrongly
at fork()
- Add support for a new Raptor Lake CPU
* tag 'perf_urgent_for_v5.18_rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
perf/core: Always set cpuctx cgrp when enable cgroup event
perf/core: Fix perf_cgroup_switch()
perf/core: Use perf_cgroup_info->active to check if cgroup is active
perf/core: Don't pass task around when ctx sched in
perf/x86/intel: Update the FRONTEND MSR mask on Sapphire Rapids
perf/x86/intel: Don't extend the pseudo-encoding to GP counters
perf/core: Inherit event_caps
perf/x86/uncore: Add Raptor Lake uncore support
perf/x86/msr: Add Raptor Lake CPU support
perf/x86/cstate: Add Raptor Lake support
perf/x86: Add Intel Raptor Lake support
the local_lock_* macros back to inline functions
- A couple of fixes to static call insn patching
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEEzv7L6UO9uDPlPSfHEsHwGGHeVUoFAmJStZ4ACgkQEsHwGGHe
VUpUpA/8DHOMUQa7rM8z49ZWBV01HNVCLECTeeKshQBLyJfWc84MNOfdPbpgEGvY
XE/eIZDnTMB5UKD0bfRqD+AQ0fXjl3NiLnJrdDZJqEQAiP/wGBswKNXMire8xPT8
9MfaOKYWYPl0LY2uZBWVLcdC+lVe4kRGfhqAcl4LRx0ZSvMzgjcFy34NeXY8LlXD
kFQJEzHa97CTROje54mtmXEt7Y5bxjxWwVTSyfEt0hJPGo1bJtJP6FaY01Muj+Xu
h/OGNx3KLOYf9MqQC31caAwKgtUOptm8bTpvG3onaHg29qJgz2umKwONyOjYrUUn
2PE3NREfMuKI38nf88pX+lOCs6/I1uVIjJPvAVJijIcuI1ZBXrfm26IP0lZ3LqG1
h/9Y5gChiZPn1j90VnF4UCJUm4u3bYEAHqKIQgUdpcpUqX0NlxbDiXoYxJWfHnmB
PBJ0PE7Vdo4MPK0n3BGVrzXAFeOyHsohAsKFijT8afRCMAOF/ebmVs/tI5NygFrK
11e/U13/78iKkazZSxWew8vU3yXA39W5Rym7aPnhR2lWxvN+xQOjNTgZTxF9hUcZ
6AcsaYJgHR7nD8SM7Y9+cwHWOWaDEdZMg9XSkgvyd1p0tHb4u+Ve/SQK7sA3j9q7
ZmZyFSE1X3K+M1i+75rUSVmIEVM5cpfhodN89iRje/JIZ1KyRT8=
=hSOc
-----END PGP SIGNATURE-----
Merge tag 'locking_urgent_for_v5.18_rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull locking fixes from Borislav Petkov:
- Allow the compiler to optimize away unused percpu accesses and change
the local_lock_* macros back to inline functions
- A couple of fixes to static call insn patching
* tag 'locking_urgent_for_v5.18_rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
Revert "mm/page_alloc: mark pagesets as __maybe_unused"
Revert "locking/local_lock: Make the empty local_lock_*() function a macro."
x86/percpu: Remove volatile from arch_raw_cpu_ptr().
static_call: Remove __DEFINE_STATIC_CALL macro
static_call: Properly initialise DEFINE_STATIC_CALL_RET0()
static_call: Don't make __static_call_return0 static
x86,static_call: Fix __static_call_return0 for i386
- Two fixes for the new forceidle balancer
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEEzv7L6UO9uDPlPSfHEsHwGGHeVUoFAmJSsskACgkQEsHwGGHe
VUqHRg//UvcM+Ygrx17yhWzHAKi5Mm5sVxSbsY08WU9cQsAQTZ4c9I8Rs2OHkyKC
8k0uQDxrrOeJRcuJoP87pRPqefvep+Gk1873KEvBRUe7+sQATGO6KMshoiPrnYO5
kQzd98rK9vrNu5ZwZFWADnCAYco+5nlsyFVXkRY0ZXhDUvfInus2OLkAm4ULXrt1
FVtO69QXDK3y42NzLSPDCoQyeM/bcCCts6wpR2WWHE2F9zD8tiM8A4DBNpu6Iker
wli2la27V4U0236ar7Md2HwD8AkQUuOSGYh9JD5RBNZjJpoHKPNNIv35dI7r9yXz
6/r52pM+idMmfV7MjDWg7cIyHgJKbfBJ54+ibvoqd3Gi5R9IZLJOXGEQaaPLaMT0
7movvEm/NDzHbQDGKmPoiRr4PYinRKFN85zTuzirscbHInMkmshciHmK2TG9Qt9m
2L5DG/LnA0EQkhFyrGoxXTgnZwGWpmpWu7tRZfFTUsjbri4CRGThmQYpl7tAEzF7
TC60WA2RfYXaJgtguZJZfiHXSYfzQriXLd4Mj1WRv6FU+IKedZmAPgjYO2dKu++y
We8ZOER8Ysy2lJR/DDQ0waDp5UrTarX/WCFzIWNLKcFLvgLEPKrfeO8AtFHDVUZd
58g9DCn1Jed8ZEYwxpPYpbLVcqQ790oShlU+/EA+FwN5pehuJNU=
=elMK
-----END PGP SIGNATURE-----
Merge tag 'sched_urgent_for_v5.18_rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull scheduler fixes from Borislav Petkov:
- Use the correct static key checking primitive on the IRQ exit path
- Two fixes for the new forceidle balancer
* tag 'sched_urgent_for_v5.18_rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
entry: Fix compile error in dynamic_irqentry_exit_cond_resched()
sched: Teach the forced-newidle balancer about CPU affinity limitation.
sched/core: Fix forceidle balancing
Fixes the following W=1 kernel build warning:
kernel/time/tick-sched.c:1563: warning: This comment starts with '/**',
but isn't a kernel-doc comment.
Reported-by: Abaci Robot <abaci@linux.alibaba.com>
Signed-off-by: Jiapeng Chong <jiapeng.chong@linux.alibaba.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lore.kernel.org/r/20220214084739.63228-1-jiapeng.chong@linux.alibaba.com
While running some testing on code that happened to allow the variable
tick_nohz_full_running to get set but with no "possible" NOHZ cores to
back up that setting, this warning triggered:
if (unlikely(tick_do_timer_cpu == TICK_DO_TIMER_NONE))
WARN_ON(tick_nohz_full_running);
The console was overwhemled with an endless stream of one WARN per tick
per core and there was no way to even see what was going on w/o using a
serial console to capture it and then trace it back to this.
Change it to WARN_ON_ONCE().
Fixes: 08ae95f4fd ("nohz_full: Allow the boot CPU to be nohz_full")
Signed-off-by: Paul Gortmaker <paul.gortmaker@windriver.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: stable@vger.kernel.org
Link: https://lore.kernel.org/r/20211206145950.10927-3-paul.gortmaker@windriver.com
The level granularity round up of calc_index() does:
(x + (1 << n)) >> n
which is obviously equivalent to
(x >> n) + 1
but compilers can't figure that out despite the fact that the input range
is known to not cause an overflow. It's neither intuitive to read.
Just write out the obvious.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lore.kernel.org/r/87h778j46c.ffs@tglx
When base::next_expiry_recalc is not initialized to false during cpu
bringup in HOTPLUG_CPU and is accidently true and no timer is queued in the
meantime, the loop through the wheel to find __next_timer_interrupt() might
be done for nothing.
Therefore initialize base::next_expiry_recalc to false in
timers_prepare_cpu().
Signed-off-by: Anna-Maria Behnsen <anna-maria@linutronix.de>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Frederic Weisbecker <frederic@kernel.org>
Link: https://lore.kernel.org/r/20220405191732.7438-2-anna-maria@linutronix.de
When the timer base is empty, base::next_expiry is set to base::clk +
NEXT_TIMER_MAX_DELTA and base::next_expiry_recalc is false. When no timer
is queued until jiffies reaches base::next_expiry value, the warning for
not finding any expired timer and base::next_expiry_recalc is false in
__run_timers() triggers.
To prevent triggering the warning in this valid scenario
base::timers_pending needs to be added to the warning condition.
Fixes: 31cd0e119d ("timers: Recalculate next timer interrupt only when necessary")
Reported-by: Johannes Berg <johannes@sipsolutions.net>
Signed-off-by: Anna-Maria Behnsen <anna-maria@linutronix.de>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Frederic Weisbecker <frederic@kernel.org>
Link: https://lore.kernel.org/r/20220405191732.7438-3-anna-maria@linutronix.de
Daniel Borkmann says:
====================
pull-request: bpf-next 2022-04-09
We've added 63 non-merge commits during the last 9 day(s) which contain
a total of 68 files changed, 4852 insertions(+), 619 deletions(-).
The main changes are:
1) Add libbpf support for USDT (User Statically-Defined Tracing) probes.
USDTs are an abstraction built on top of uprobes, critical for tracing
and BPF, and widely used in production applications, from Andrii Nakryiko.
2) While Andrii was adding support for x86{-64}-specific logic of parsing
USDT argument specification, Ilya followed-up with USDT support for s390
architecture, from Ilya Leoshkevich.
3) Support name-based attaching for uprobe BPF programs in libbpf. The format
supported is `u[ret]probe/binary_path:[raw_offset|function[+offset]]`, e.g.
attaching to libc malloc can be done in BPF via SEC("uprobe/libc.so.6:malloc")
now, from Alan Maguire.
4) Various load/store optimizations for the arm64 JIT to shrink the image
size by using arm64 str/ldr immediate instructions. Also enable pointer
authentication to verify return address for JITed code, from Xu Kuohai.
5) BPF verifier fixes for write access checks to helper functions, e.g.
rd-only memory from bpf_*_cpu_ptr() must not be passed to helpers that
write into passed buffers, from Kumar Kartikeya Dwivedi.
6) Fix overly excessive stack map allocation for its base map structure and
buckets which slipped-in from cleanups during the rlimit accounting removal
back then, from Yuntao Wang.
7) Extend the unstable CT lookup helpers for XDP and tc/BPF to report netfilter
connection tracking tuple direction, from Lorenzo Bianconi.
8) Improve bpftool dump to show BPF program/link type names, Milan Landaverde.
9) Minor cleanups all over the place from various others.
* https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next: (63 commits)
bpf: Fix excessive memory allocation in stack_map_alloc()
selftests/bpf: Fix return value checks in perf_event_stackmap test
selftests/bpf: Add CO-RE relos into linked_funcs selftests
libbpf: Use weak hidden modifier for USDT BPF-side API functions
libbpf: Don't error out on CO-RE relos for overriden weak subprogs
samples, bpf: Move routes monitor in xdp_router_ipv4 in a dedicated thread
libbpf: Allow WEAK and GLOBAL bindings during BTF fixup
libbpf: Use strlcpy() in path resolution fallback logic
libbpf: Add s390-specific USDT arg spec parsing logic
libbpf: Make BPF-side of USDT support work on big-endian machines
libbpf: Minor style improvements in USDT code
libbpf: Fix use #ifdef instead of #if to avoid compiler warning
libbpf: Potential NULL dereference in usdt_manager_attach_usdt()
selftests/bpf: Uprobe tests should verify param/return values
libbpf: Improve string parsing for uprobe auto-attach
libbpf: Improve library identification for uprobe binary path resolution
selftests/bpf: Test for writes to map key from BPF helpers
selftests/bpf: Test passing rdonly mem to global func
bpf: Reject writes for PTR_TO_MAP_KEY in check_helper_mem_access
bpf: Check PTR_TO_MEM | MEM_RDONLY in check_helper_mem_access
...
====================
Link: https://lore.kernel.org/r/20220408231741.19116-1-daniel@iogearbox.net
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
The 'n_buckets * (value_size + sizeof(struct stack_map_bucket))' part of the
allocated memory for 'smap' is never used after the memlock accounting was
removed, thus get rid of it.
[ Note, Daniel:
Commit b936ca643a ("bpf: rework memlock-based memory accounting for maps")
moved `cost += n_buckets * (value_size + sizeof(struct stack_map_bucket))`
up and therefore before the bpf_map_area_alloc() allocation, sigh. In a later
step commit c85d69135a ("bpf: move memory size checks to bpf_map_charge_init()"),
and the overflow checks of `cost >= U32_MAX - PAGE_SIZE` moved into
bpf_map_charge_init(). And then 370868107b ("bpf: Eliminate rlimit-based
memory accounting for stackmap maps") finally removed the bpf_map_charge_init().
Anyway, the original code did the allocation same way as /after/ this fix. ]
Fixes: b936ca643a ("bpf: rework memlock-based memory accounting for maps")
Signed-off-by: Yuntao Wang <ytcoode@gmail.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Link: https://lore.kernel.org/bpf/20220407130423.798386-1-ytcoode@gmail.com
Unlike regular VMs, TDX guests use the firmware hand-off wakeup method
to wake up the APs during the boot process. This wakeup model uses a
mailbox to communicate with firmware to bring up the APs. As per the
design, this mailbox can only be used once for the given AP, which means
after the APs are booted, the same mailbox cannot be used to
offline/online the given AP. More details about this requirement can be
found in Intel TDX Virtual Firmware Design Guide, sec titled "AP
initialization in OS" and in sec titled "Hotplug Device".
Since the architecture does not support any method of offlining the
CPUs, disable CPU hotplug support in the kernel.
Since this hotplug disable feature can be re-used by other VM guests,
add a new CC attribute CC_ATTR_HOTPLUG_DISABLED and use it to disable
the hotplug support.
Attempt to offline CPU will fail with -EOPNOTSUPP.
Signed-off-by: Kuppuswamy Sathyanarayanan <sathyanarayanan.kuppuswamy@linux.intel.com>
Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Reviewed-by: Andi Kleen <ak@linux.intel.com>
Reviewed-by: Tony Luck <tony.luck@intel.com>
Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lkml.kernel.org/r/20220405232939.73860-25-kirill.shutemov@linux.intel.com
Alexei Starovoitov says:
====================
pull-request: bpf 2022-04-06
We've added 8 non-merge commits during the last 8 day(s) which contain
a total of 9 files changed, 139 insertions(+), 36 deletions(-).
The main changes are:
1) rethook related fixes, from Jiri and Masami.
2) Fix the case when tracing bpf prog is attached to struct_ops, from Martin.
3) Support dual-stack sockets in bpf_tcp_check_syncookie, from Maxim.
* https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf:
bpf: Adjust bpf_tcp_check_syncookie selftest to test dual-stack sockets
bpf: Support dual-stack sockets in bpf_tcp_check_syncookie
bpf: selftests: Test fentry tracing a struct_ops program
bpf: Resolve to prog->aux->dst_prog->type only for BPF_PROG_TYPE_EXT
rethook: Fix to use WRITE_ONCE() for rethook:: Handler
selftests/bpf: Fix warning comparing pointer to 0
bpf: Fix sparse warnings in kprobe_multi_resolve_syms
bpftool: Explicit errno handling in skeletons
====================
Link: https://lore.kernel.org/r/20220407031245.73026-1-alexei.starovoitov@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
This moves ftrace_enabled to trace/ftrace.c.
We move sysctls to places where features actually belong to improve
the readability of the code and reduce the risk of code merge conflicts.
At the same time, the proc-sysctl maintainers do not want to know what
sysctl knobs you wish to add for your owner piece of code, we just care
about the core logic.
Signed-off-by: Wei Xiao <xiaowei66@huawei.com>
Acked-by: Steven Rostedt (Google) <rostedt@goodmis.org>
Signed-off-by: Luis Chamberlain <mcgrof@kernel.org>
kernel/sysctl.c is a kitchen sink where everyone leaves their dirty
dishes, this makes it very difficult to maintain.
To help with this maintenance let's start by moving sysctls to places
where they actually belong. The proc sysctl maintainers do not want to
know what sysctl knobs you wish to add for your own piece of code, we
just care about the core logic.
All filesystem syctls now get reviewed by fs folks. This commit
follows the commit of fs, move the real_root_dev sysctl to its own file,
kernel/do_mount_initrd.c.
Signed-off-by: tangmeng <tangmeng@uniontech.com>
Signed-off-by: Luis Chamberlain <mcgrof@kernel.org>
kernel/sysctl.c is a kitchen sink where everyone leaves their dirty
dishes, this makes it very difficult to maintain.
To help with this maintenance let's start by moving sysctls to places
where they actually belong. The proc sysctl maintainers do not want to
know what sysctl knobs you wish to add for your own piece of code, we
just care about the core logic.
All filesystem syctls now get reviewed by fs folks. This commit
follows the commit of fs, move the delayacct sysctl to its own file,
kernel/delayacct.c.
Signed-off-by: tangmeng <tangmeng@uniontech.com>
Signed-off-by: Luis Chamberlain <mcgrof@kernel.org>
kernel/sysctl.c is a kitchen sink where everyone leaves their dirty
dishes, this makes it very difficult to maintain.
To help with this maintenance let's start by moving sysctls to places
where they actually belong. The proc sysctl maintainers do not want to
know what sysctl knobs you wish to add for your own piece of code, we
just care about the core logic.
All filesystem syctls now get reviewed by fs folks. This commit
follows the commit of fs, move the acct sysctl to its own file,
kernel/acct.c.
Signed-off-by: tangmeng <tangmeng@uniontech.com>
Signed-off-by: Luis Chamberlain <mcgrof@kernel.org>
kernel/sysctl.c is a kitchen sink where everyone leaves their dirty
dishes, this makes it very difficult to maintain.
To help with this maintenance let's start by moving sysctls to places
where they actually belong. The proc sysctl maintainers do not want to
know what sysctl knobs you wish to add for your own piece of code, we
just care about the core logic.
All filesystem syctls now get reviewed by fs folks. This commit
follows the commit of fs, move the oops_all_cpu_backtrace sysctl to
its own file, kernel/panic.c.
Signed-off-by: tangmeng <tangmeng@uniontech.com>
Signed-off-by: Luis Chamberlain <mcgrof@kernel.org>
kernel/sysctl.c is a kitchen sink where everyone leaves their dirty
dishes, this makes it very difficult to maintain.
To help with this maintenance let's start by moving sysctls to places
where they actually belong. The proc sysctl maintainers do not want to
know what sysctl knobs you wish to add for your own piece of code, we
just care about the core logic.
All filesystem syctls now get reviewed by fs folks. This commit
follows the commit of fs, move the prove_locking and lock_stat sysctls
to its own file, kernel/lockdep.c.
Signed-off-by: tangmeng <tangmeng@uniontech.com>
Signed-off-by: Luis Chamberlain <mcgrof@kernel.org>
kernel/sysctl.c is a kitchen sink where everyone leaves their dirty
dishes, this makes it very difficult to maintain.
To help with this maintenance let's start by moving sysctls to places
where they actually belong. The proc sysctl maintainers do not want to
know what sysctl knobs you wish to add for your own piece of code, we just
care about the core logic.
So move the page-writeback sysctls to its own file.
[akpm@linux-foundation.org: coding-style cleanups]
akpm@linux-foundation.org: fix CONFIG_SYSCTL=n warnings]
Link: https://lkml.kernel.org/r/20220129012955.26594-1-zhanglianjie@uniontech.com
Signed-off-by: zhanglianjie <zhanglianjie@uniontech.com>
Cc: Kees Cook <keescook@chromium.org>
Cc: Iurii Zaikin <yzaikin@google.com>
Cc: Luis Chamberlain <mcgrof@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Luis Chamberlain <mcgrof@kernel.org>
kernel/sysctl.c is a kitchen sink where everyone leaves their dirty
dishes, this makes it very difficult to maintain.
To help with this maintenance let's start by moving sysctls to places
where they actually belong. The proc sysctl maintainers do not want to
know what sysctl knobs you wish to add for your own piece of code, we just
care about the core logic.
So move the oom_kill sysctls to their own file, mm/oom_kill.c
[sfr@canb.auug.org.au: null-terminate the array]
Link: https://lkml.kernel.org/r/20220216193202.28838626@canb.auug.org.au
Link: https://lkml.kernel.org/r/20220215093203.31032-1-sujiaxun@uniontech.com
Signed-off-by: sujiaxun <sujiaxun@uniontech.com>
Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>
Cc: Kees Cook <keescook@chromium.org>
Cc: Iurii Zaikin <yzaikin@google.com>
Cc: Michal Hocko <mhocko@suse.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Luis Chamberlain <mcgrof@kernel.org>
kernel/sysctl.c is a kitchen sink where everyone leaves their dirty
dishes, this makes it very difficult to maintain.
To help with this maintenance let's start by moving sysctls to places
where they actually belong. The proc sysctl maintainers do not want to
know what sysctl knobs you wish to add for your own piece of code, we
just care about the core logic.
All filesystem syctls now get reviewed by fs folks. This commit
follows the commit of fs, move the poweroff_cmd and ctrl-alt-del
sysctls to its own file, kernel/reboot.c.
Signed-off-by: tangmeng <tangmeng@uniontech.com>
Signed-off-by: Luis Chamberlain <mcgrof@kernel.org>
move energy_aware sysctls to topology.c and use the new
register_sysctl_init() to register the sysctl interface.
Signed-off-by: Zhen Ni <nizhen@uniontech.com>
Signed-off-by: Luis Chamberlain <mcgrof@kernel.org>
move cfs_bandwidth_slice sysctls to fair.c and use the
new register_sysctl_init() to register the sysctl interface.
Signed-off-by: Zhen Ni <nizhen@uniontech.com>
Signed-off-by: Luis Chamberlain <mcgrof@kernel.org>
move uclamp_util sysctls to core.c and use the new
register_sysctl_init() to register the sysctl interface.
Signed-off-by: Zhen Ni <nizhen@uniontech.com>
Signed-off-by: Luis Chamberlain <mcgrof@kernel.org>
Avoid random build errors which do not select
CONFIG_SYSCTL by depending on it in Kconfig.
This fixes the following warning:
In file included from kernel/sched/build_policy.c:43:
At top level:
kernel/sched/rt.c:3017:12: error: ‘sched_rr_handler’ defined but not used [-Werror=unused-function]
3017 | static int sched_rr_handler(struct ctl_table *table, int write, void *buffer,
| ^~~~~~~~~~~~~~~~
kernel/sched/rt.c:2978:12: error: ‘sched_rt_handler’ defined but not used [-Werror=unused-function]
2978 | static int sched_rt_handler(struct ctl_table *table, int write, void *buffer,
| ^~~~~~~~~~~~~~~~
cc1: all warnings being treated as errors
make[2]: *** [scripts/Makefile.build:310: kernel/sched/build_policy.o] Error 1
make[1]: *** [scripts/Makefile.build:638: kernel/sched] Error 2
make[1]: *** Waiting for unfinished jobs....
Reported-by: Hulk Robot <hulkci@huawei.com>
Signed-off-by: Baisong Zhong <zhongbaisong@huawei.com>
[mcgrof: small build fix, we need sched_rt_can_attach() even
when CONFIG_SYSCTL is disabled]
Signed-off-by: Luis Chamberlain <mcgrof@kernel.org>
move rr_timeslice sysctls to rt.c and use the new
register_sysctl_init() to register the sysctl interface.
Signed-off-by: Zhen Ni <nizhen@uniontech.com>
Signed-off-by: Luis Chamberlain <mcgrof@kernel.org>
move deadline_period sysctls to deadline.c and use the new
register_sysctl_init() to register the sysctl interface.
Signed-off-by: Zhen Ni <nizhen@uniontech.com>
Signed-off-by: Luis Chamberlain <mcgrof@kernel.org>
move rt_period/runtime sysctls to rt.c and use the new
register_sysctl_init() to register the sysctl interface.
Signed-off-by: Zhen Ni <nizhen@uniontech.com>
Signed-off-by: Luis Chamberlain <mcgrof@kernel.org>
move schedstats sysctls to core.c and use the new
register_sysctl_init() to register the sysctl interface.
Signed-off-by: Zhen Ni <nizhen@uniontech.com>
Signed-off-by: Luis Chamberlain <mcgrof@kernel.org>
move child_runs_first sysctls to fair.c and use the new
register_sysctl_init() to register the sysctl interface.
Signed-off-by: Zhen Ni <nizhen@uniontech.com>
Signed-off-by: Luis Chamberlain <mcgrof@kernel.org>
The kernel has a wide variety of debugging options to help catch
and squash bugs. However, new debugging is added all the time and
the existing options can be hard to find.
Add a Kconfig fragment with the debugging options which tip
maintainers expect to be used to test contributions.
This should make it easier for contributors to test their code and
find issues before submission.
[ bp: Add to "make help" output, fix DEBUG_INFO selection as pointed
out by Nathan Chancellor <nathan@kernel.org>. ]
Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Link: https://lore.kernel.org/r/20220331175728.299103A0@davehans-spike.ostc.intel.com
It is not permitted to write to PTR_TO_MAP_KEY, but the current code in
check_helper_mem_access would allow for it, reject this case as well, as
helpers taking ARG_PTR_TO_UNINIT_MEM also take PTR_TO_MAP_KEY.
Fixes: 69c087ba62 ("bpf: Add bpf_for_each_map_elem() helper")
Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
Link: https://lore.kernel.org/r/20220319080827.73251-4-memxor@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
The commit being fixed was aiming to disallow users from incorrectly
obtaining writable pointer to memory that is only meant to be read. This
is enforced now using a MEM_RDONLY flag.
For instance, in case of global percpu variables, when the BTF type is
not struct (e.g. bpf_prog_active), the verifier marks register type as
PTR_TO_MEM | MEM_RDONLY from bpf_this_cpu_ptr or bpf_per_cpu_ptr
helpers. However, when passing such pointer to kfunc, global funcs, or
BPF helpers, in check_helper_mem_access, there is no expectation
MEM_RDONLY flag will be set, hence it is checked as pointer to writable
memory. Later, verifier sets up argument type of global func as
PTR_TO_MEM | PTR_MAYBE_NULL, so user can use a global func to get around
the limitations imposed by this flag.
This check will also cover global non-percpu variables that may be
introduced in kernel BTF in future.
Also, we update the log message for PTR_TO_BUF case to be similar to
PTR_TO_MEM case, so that the reason for error is clear to user.
Fixes: 34d3a78c68 ("bpf: Make per_cpu_ptr return rdonly PTR_TO_MEM.")
Reviewed-by: Hao Luo <haoluo@google.com>
Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
Link: https://lore.kernel.org/r/20220319080827.73251-3-memxor@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
When passing pointer to some map value to kfunc or global func, in
verifier we are passing meta as NULL to various functions, which uses
meta->raw_mode to check whether memory is being written to. Since some
kfunc or global funcs may also write to memory pointers they receive as
arguments, we must check for write access to memory. E.g. in some case
map may be read only and this will be missed by current checks.
However meta->raw_mode allows for uninitialized memory (e.g. on stack),
since there is not enough info available through BTF, we must perform
one call for read access (raw_mode = false), and one for write access
(raw_mode = true).
Fixes: e5069b9c23 ("bpf: Support pointers in global func args")
Fixes: d583691c47 ("bpf: Introduce mem, size argument pair support for kfunc")
Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
Link: https://lore.kernel.org/r/20220319080827.73251-2-memxor@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Replace module_addr_min and module_addr_max by
mod_tree.addr_min and mod_tree.addr_max
Signed-off-by: Christophe Leroy <christophe.leroy@csgroup.eu>
Signed-off-by: Luis Chamberlain <mcgrof@kernel.org>
Add CONFIG_ARCH_WANTS_MODULES_DATA_IN_VMALLOC to allow architectures
to request having modules data in vmalloc area instead of module area.
This is required on powerpc book3s/32 in order to set data non
executable, because it is not possible to set executability on page
basis, this is done per 256 Mbytes segments. The module area has exec
right, vmalloc area has noexec.
This can also be useful on other powerpc/32 in order to maximize the
chance of code being close enough to kernel core to avoid branch
trampolines.
Cc: Jason Wessel <jason.wessel@windriver.com>
Acked-by: Daniel Thompson <daniel.thompson@linaro.org>
Cc: Douglas Anderson <dianders@chromium.org>
Signed-off-by: Christophe Leroy <christophe.leroy@csgroup.eu>
[mcgrof: rebased in light of kernel/module/kdb.c move]
Signed-off-by: Luis Chamberlain <mcgrof@kernel.org>
In order to allow separation of data from text, add another layout,
called data_layout. For architectures requesting separation of text
and data, only text will go in core_layout and data will go in
data_layout.
For architectures which keep text and data together, make data_layout
an alias of core_layout, that way data_layout can be used for all
data manipulations, regardless of whether data is in core_layout or
data_layout.
Signed-off-by: Christophe Leroy <christophe.leroy@csgroup.eu>
Signed-off-by: Luis Chamberlain <mcgrof@kernel.org>
In order to separate text and data, we need to setup
two rb trees.
Modify functions to give the tree as a parameter.
Signed-off-by: Christophe Leroy <christophe.leroy@csgroup.eu>
Signed-off-by: Luis Chamberlain <mcgrof@kernel.org>
In order to separate text and data, we need to setup
two rb trees.
This means that struct mod_tree_root is required even without
MODULES_TREE_LOOKUP.
Signed-off-by: Christophe Leroy <christophe.leroy@csgroup.eu>
Reviewed-by: Aaron Tomlin <atomlin@atomlin.com>
Signed-off-by: Luis Chamberlain <mcgrof@kernel.org>
debug_align() was added by commit 84e1c6bb38 ("x86: Add RO/NX
protection for loadable kernel modules")
At that time the config item was CONFIG_DEBUG_SET_MODULE_RONX.
But nowadays it has changed to CONFIG_STRICT_MODULE_RWX and
debug_align() is confusing because it has nothing to do with
DEBUG.
Rename it strict_align()
Signed-off-by: Christophe Leroy <christophe.leroy@csgroup.eu>
Signed-off-by: Luis Chamberlain <mcgrof@kernel.org>
Perform layout alignment verification up front and WARN_ON()
and fail module loading instead of crashing the machine.
Signed-off-by: Christophe Leroy <christophe.leroy@csgroup.eu>
Signed-off-by: Luis Chamberlain <mcgrof@kernel.org>
Move module_enable_x() together with module_enable_nx() and
module_enable_ro().
Those three functions are going together, they are all used
to set up the correct page flags on the different sections.
As module_enable_x() is used independently of
CONFIG_STRICT_MODULE_RWX, build strict_rwx.c all the time and
use IS_ENABLED(CONFIG_STRICT_MODULE_RWX) when relevant.
Signed-off-by: Christophe Leroy <christophe.leroy@csgroup.eu>
Signed-off-by: Luis Chamberlain <mcgrof@kernel.org>
module_enable_x() has nothing to do with CONFIG_ARCH_HAS_STRICT_MODULE_RWX
allthough by coincidence architectures who need module_enable_x() are
selection CONFIG_ARCH_HAS_STRICT_MODULE_RWX.
Enable module_enable_x() for everyone everytime. If an architecture
already has module text set executable, it's a no-op.
Don't check text_size alignment. When CONFIG_STRICT_MODULE_RWX is set
the verification is already done in frob_rodata(). When
CONFIG_STRICT_MODULE_RWX is not set it is not a big deal to have the
start of data as executable. Just make sure we entirely get the last
page when the boundary is not aligned.
And don't BUG on misaligned base as some architectures like nios2
use kmalloc() for allocating modules. So just bail out in that case.
If that's a problem, a page fault will occur later anyway.
Signed-off-by: Christophe Leroy <christophe.leroy@csgroup.eu>
Signed-off-by: Luis Chamberlain <mcgrof@kernel.org>
No functional change.
This patch migrates module version support out of core code into
kernel/module/version.c. In addition simple code refactoring to
make this possible.
Reviewed-by: Christophe Leroy <christophe.leroy@csgroup.eu>
Signed-off-by: Aaron Tomlin <atomlin@redhat.com>
Signed-off-by: Luis Chamberlain <mcgrof@kernel.org>
No functional change.
This patch migrates the kdb 'lsmod' command support out of main
kdb code into its own file under kernel/module. In addition to
the above, a minor style warning i.e. missing a blank line after
declarations, was resolved too. The new file was added to
MAINTAINERS. Finally we remove linux/module.h as it is entirely
redundant.
Reviewed-by: Daniel Thompson <daniel.thompson@linaro.org>
Acked-by: Daniel Thompson <daniel.thompson@linaro.org>
Signed-off-by: Aaron Tomlin <atomlin@redhat.com>
Signed-off-by: Luis Chamberlain <mcgrof@kernel.org>
No functional change.
This patch migrates module sysfs support out of core code into
kernel/module/sysfs.c. In addition simple code refactoring to
make this possible.
Reviewed-by: Christophe Leroy <christophe.leroy@csgroup.eu>
Signed-off-by: Aaron Tomlin <atomlin@redhat.com>
Signed-off-by: Luis Chamberlain <mcgrof@kernel.org>
No functional change.
This patch migrates code that allows one to generate a
list of loaded/or linked modules via /proc when procfs
support is enabled into kernel/module/procfs.c.
Reviewed-by: Christophe Leroy <christophe.leroy@csgroup.eu>
Signed-off-by: Aaron Tomlin <atomlin@redhat.com>
Signed-off-by: Luis Chamberlain <mcgrof@kernel.org>
No functional change.
The purpose of this patch is to address the various Sparse warnings
due to the incorrect dereference/or access of an __rcu pointer.
Signed-off-by: Aaron Tomlin <atomlin@redhat.com>
Signed-off-by: Luis Chamberlain <mcgrof@kernel.org>
No functional change.
This patch migrates kallsyms code out of core module
code kernel/module/kallsyms.c
Signed-off-by: Aaron Tomlin <atomlin@redhat.com>
Signed-off-by: Luis Chamberlain <mcgrof@kernel.org>
No functional change.
This patch migrates kmemleak code out of core module
code into kernel/module/debug_kmemleak.c
Reviewed-by: Christophe Leroy <christophe.leroy@csgroup.eu>
Signed-off-by: Aaron Tomlin <atomlin@redhat.com>
Signed-off-by: Luis Chamberlain <mcgrof@kernel.org>
No functional change.
This patch migrates code that makes module text
and rodata memory read-only and non-text memory
non-executable from core module code into
kernel/module/strict_rwx.c.
Reviewed-by: Christophe Leroy <christophe.leroy@csgroup.eu>
Signed-off-by: Aaron Tomlin <atomlin@redhat.com>
Signed-off-by: Luis Chamberlain <mcgrof@kernel.org>
No functional change.
This patch migrates module latched RB-tree support
(e.g. see __module_address()) from core module code
into kernel/module/tree_lookup.c.
Reviewed-by: Christophe Leroy <christophe.leroy@csgroup.eu>
Signed-off-by: Aaron Tomlin <atomlin@redhat.com>
Signed-off-by: Luis Chamberlain <mcgrof@kernel.org>
No functional change.
This patch migrates livepatch support (i.e. used during module
add/or load and remove/or deletion) from core module code into
kernel/module/livepatch.c. At the moment it contains code to
persist Elf information about a given livepatch module, only.
The new file was added to MAINTAINERS.
Reviewed-by: Petr Mladek <pmladek@suse.com>
Tested-by: Petr Mladek <pmladek@suse.com>
Signed-off-by: Aaron Tomlin <atomlin@redhat.com>
Signed-off-by: Luis Chamberlain <mcgrof@kernel.org>
This patch will address the following warning and style violations
generated by ./scripts/checkpatch.pl in strict mode:
WARNING: Use #include <linux/module.h> instead of <asm/module.h>
#10: FILE: kernel/module/internal.h:10:
+#include <asm/module.h>
CHECK: spaces preferred around that '-' (ctx:VxV)
#18: FILE: kernel/module/internal.h:18:
+#define INIT_OFFSET_MASK (1UL << (BITS_PER_LONG-1))
CHECK: Please use a blank line after function/struct/union/enum declarations
#69: FILE: kernel/module/internal.h:69:
+}
+static inline void module_decompress_cleanup(struct load_info *info)
^
CHECK: extern prototypes should be avoided in .h files
#84: FILE: kernel/module/internal.h:84:
+extern int mod_verify_sig(const void *mod, struct load_info *info);
WARNING: Missing a blank line after declarations
#116: FILE: kernel/module/decompress.c:116:
+ struct page *page = module_get_next_page(info);
+ if (!page) {
WARNING: Missing a blank line after declarations
#174: FILE: kernel/module/decompress.c:174:
+ struct page *page = module_get_next_page(info);
+ if (!page) {
CHECK: Please use a blank line after function/struct/union/enum declarations
#258: FILE: kernel/module/decompress.c:258:
+}
+static struct kobj_attribute module_compression_attr = __ATTR_RO(compression);
Note: Fortunately, the multiple-include optimisation found in
include/linux/module.h will prevent duplication/or inclusion more than
once.
Fixes: f314dfea16 ("modsign: log module name in the event of an error")
Reviewed-by: Christophe Leroy <christophe.leroy@csgroup.eu>
Signed-off-by: Aaron Tomlin <atomlin@redhat.com>
Signed-off-by: Luis Chamberlain <mcgrof@kernel.org>
No functional change.
This patch makes it possible to move non-essential code
out of core module code.
Reviewed-by: Christophe Leroy <christophe.leroy@csgroup.eu>
Signed-off-by: Aaron Tomlin <atomlin@redhat.com>
Signed-off-by: Luis Chamberlain <mcgrof@kernel.org>
Per PREEMPT_DYNAMIC, checking CONFIG_PREEMPT doesn't tell you the actual
preemption model of the live kernel. Use the newly-introduced accessors
instead.
Signed-off-by: Valentin Schneider <valentin.schneider@arm.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
Acked-by: Frederic Weisbecker <frederic@kernel.org>
Link: https://lore.kernel.org/r/20211112185203.280040-5-valentin.schneider@arm.com
CONFIG_PREEMPT{_NONE, _VOLUNTARY} designate either:
o The build-time preemption model when !PREEMPT_DYNAMIC
o The default boot-time preemption model when PREEMPT_DYNAMIC
IOW, using those on PREEMPT_DYNAMIC kernels is meaningless - the actual
model could have been set to something else by the "preempt=foo" cmdline
parameter. Same problem applies to CONFIG_PREEMPTION.
Introduce a set of helpers to determine the actual preemption model used by
the live kernel.
Suggested-by: Marco Elver <elver@google.com>
Signed-off-by: Valentin Schneider <valentin.schneider@arm.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Marco Elver <elver@google.com>
Acked-by: Frederic Weisbecker <frederic@kernel.org>
Link: https://lore.kernel.org/r/20211112185203.280040-3-valentin.schneider@arm.com
Per PREEMPT_DYNAMIC, checking CONFIG_PREEMPT doesn't tell you the actual
preemption model of the live kernel. Use the newly-introduced accessors
instead.
Signed-off-by: Valentin Schneider <valentin.schneider@arm.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Marco Elver <elver@google.com>
Acked-by: Frederic Weisbecker <frederic@kernel.org>
Link: https://lore.kernel.org/r/20211112185203.280040-4-valentin.schneider@arm.com
Have the trace_contention_*() tracepoints consistently include
adaptive spinning. In order to differentiate between the spinning and
non-spinning states add LCB_F_MUTEX and combine with LCB_F_SPIN.
The consequence is that a mutex contention can now triggler multiple
_begin() tracepoints before triggering an _end().
Additionally, this fixes one path where mutex would trigger _end()
without ever seeing a _begin().
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Adding the lock contention tracepoints in various lock function slow
paths. Note that each arch can define spinlock differently, I only
added it only to the generic qspinlock for now.
Signed-off-by: Namhyung Kim <namhyung@kernel.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Tested-by: Hyeonggon Yoo <42.hyeyoo@gmail.com>
Link: https://lkml.kernel.org/r/20220322185709.141236-3-namhyung@kernel.org
This adds two new lock contention tracepoints like below:
* lock:contention_begin
* lock:contention_end
The lock:contention_begin takes a flags argument to classify locks. I
found it useful to identify what kind of locks it's tracing like if
it's spinning or sleeping, reader-writer lock, real-time, and per-cpu.
Move tracepoint definitions into mutex.c so that we can use them
without lockdep.
Signed-off-by: Namhyung Kim <namhyung@kernel.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Tested-by: Hyeonggon Yoo <42.hyeyoo@gmail.com>
Link: https://lkml.kernel.org/r/20220322185709.141236-2-namhyung@kernel.org
For writers, the out_nolock path will always attempt to wake up waiters.
This may not be really necessary if the waiter to be removed is not the
first one.
For readers, no attempt to wake up waiter is being made. However, if
the HANDOFF bit is set and the reader to be removed is the first waiter,
the waiter behind it will inherit the HANDOFF bit and for a write lock
waiter waking it up will allow it to spin on the lock to acquire it
faster. So it can be beneficial to do a wakeup in this case.
Add a new rwsem_del_wake_waiter() helper function to do that consistently
for both reader and writer out_nolock paths.
Signed-off-by: Waiman Long <longman@redhat.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lkml.kernel.org/r/20220322152059.2182333-4-longman@redhat.com
In an analysis of a recent vmcore, a reader-owned rwsem was found with
385 readers but no writer in the wait queue. That is kind of unusual
but it may be caused by some race conditions that we have not fully
understood yet. In such a case, all the readers in the wait queue should
join the other reader-owners and acquire the read lock.
In rwsem_down_write_slowpath(), an incoming writer will try to
wake up the front readers under such circumstance. That is not
the case for rwsem_down_read_slowpath(), add a new helper function
rwsem_cond_wake_waiter() to do wakeup and use it in both reader and
writer slowpaths to have a consistent and correct behavior.
Signed-off-by: Waiman Long <longman@redhat.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lkml.kernel.org/r/20220322152059.2182333-3-longman@redhat.com
Since commit d257cc8cb8 ("locking/rwsem: Make handoff bit handling
more consistent"), the handoff bit is always cleared if the wait queue
becomes empty. There is no need to check for RWSEM_FLAG_HANDOFF when
the wait list is known to be empty.
Signed-off-by: Waiman Long <longman@redhat.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lkml.kernel.org/r/20220322152059.2182333-2-longman@redhat.com
While looking into a bug related to the compiler's handling of addresses
of labels, I noticed some uses of _THIS_IP_ seemed unused in lockdep.
Drive by cleanup.
-Wunused-parameter:
kernel/locking/lockdep.c:1383:22: warning: unused parameter 'ip'
kernel/locking/lockdep.c:4246:48: warning: unused parameter 'ip'
kernel/locking/lockdep.c:4844:19: warning: unused parameter 'ip'
Signed-off-by: Nick Desaulniers <ndesaulniers@google.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Waiman Long <longman@redhat.com>
Link: https://lore.kernel.org/r/20220314221909.2027027-1-ndesaulniers@google.com
When enable a cgroup event, cpuctx->cgrp setting is conditional
on the current task cgrp matching the event's cgroup, so have to
do it for every new event. It brings complexity but no advantage.
To keep it simple, this patch would always set cpuctx->cgrp
when enable the first cgroup event, and reset to NULL when disable
the last cgroup event.
Signed-off-by: Chengming Zhou <zhouchengming@bytedance.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lore.kernel.org/r/20220329154523.86438-5-zhouchengming@bytedance.com
There is a race problem that can trigger WARN_ON_ONCE(cpuctx->cgrp)
in perf_cgroup_switch().
CPU1 CPU2
perf_cgroup_sched_out(prev, next)
cgrp1 = perf_cgroup_from_task(prev)
cgrp2 = perf_cgroup_from_task(next)
if (cgrp1 != cgrp2)
perf_cgroup_switch(prev, PERF_CGROUP_SWOUT)
cgroup_migrate_execute()
task->cgroups = ?
perf_cgroup_attach()
task_function_call(task, __perf_cgroup_move)
perf_cgroup_sched_in(prev, next)
cgrp1 = perf_cgroup_from_task(prev)
cgrp2 = perf_cgroup_from_task(next)
if (cgrp1 != cgrp2)
perf_cgroup_switch(next, PERF_CGROUP_SWIN)
__perf_cgroup_move()
perf_cgroup_switch(task, PERF_CGROUP_SWOUT | PERF_CGROUP_SWIN)
The commit a8d757ef07 ("perf events: Fix slow and broken cgroup
context switch code") want to skip perf_cgroup_switch() when the
perf_cgroup of "prev" and "next" are the same.
But task->cgroups can change in concurrent with context_switch()
in cgroup_migrate_execute(). If cgrp1 == cgrp2 in sched_out(),
cpuctx won't do sched_out. Then task->cgroups changed cause
cgrp1 != cgrp2 in sched_in(), cpuctx will do sched_in. So trigger
WARN_ON_ONCE(cpuctx->cgrp).
Even though __perf_cgroup_move() will be synchronized as the context
switch disables the interrupt, context_switch() still can see the
task->cgroups is changing in the middle, since task->cgroups changed
before sending IPI.
So we have to combine perf_cgroup_sched_in() into perf_cgroup_sched_out(),
unified into perf_cgroup_switch(), to fix the incosistency between
perf_cgroup_sched_out() and perf_cgroup_sched_in().
But we can't just compare prev->cgroups with next->cgroups to decide
whether to skip cpuctx sched_out/in since the prev->cgroups is changing
too. For example:
CPU1 CPU2
cgroup_migrate_execute()
prev->cgroups = ?
perf_cgroup_attach()
task_function_call(task, __perf_cgroup_move)
perf_cgroup_switch(task)
cgrp1 = perf_cgroup_from_task(prev)
cgrp2 = perf_cgroup_from_task(next)
if (cgrp1 != cgrp2)
cpuctx sched_out/in ...
task_function_call() will return -ESRCH
In the above example, prev->cgroups changing cause (cgrp1 == cgrp2)
to be true, so skip cpuctx sched_out/in. And later task_function_call()
would return -ESRCH since the prev task isn't running on cpu anymore.
So we would leave perf_events of the old prev->cgroups still sched on
the CPU, which is wrong.
The solution is that we should use cpuctx->cgrp to compare with
the next task's perf_cgroup. Since cpuctx->cgrp can only be changed
on local CPU, and we have irq disabled, we can read cpuctx->cgrp to
compare without holding ctx lock.
Fixes: a8d757ef07 ("perf events: Fix slow and broken cgroup context switch code")
Signed-off-by: Chengming Zhou <zhouchengming@bytedance.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lore.kernel.org/r/20220329154523.86438-4-zhouchengming@bytedance.com
Since we use perf_cgroup_set_timestamp() to start cgroup time and
set active to 1, then use update_cgrp_time_from_cpuctx() to stop
cgroup time and set active to 0.
We can use info->active directly to check if cgroup is active.
Signed-off-by: Chengming Zhou <zhouchengming@bytedance.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lore.kernel.org/r/20220329154523.86438-3-zhouchengming@bytedance.com
The current code pass task around for ctx_sched_in(), only
to get perf_cgroup of the task, then update the timestamp
of it and its ancestors and set them to active.
But we can use cpuctx->cgrp to get active perf_cgroup and
its ancestors since cpuctx->cgrp has been set before
ctx_sched_in().
This patch remove the task argument in ctx_sched_in()
and cleanup related code.
Signed-off-by: Chengming Zhou <zhouchengming@bytedance.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lore.kernel.org/r/20220329154523.86438-2-zhouchengming@bytedance.com
It was reported that some perf event setup can make fork failed on
ARM64. It was the case of a group of mixed hw and sw events and it
failed in perf_event_init_task() due to armpmu_event_init().
The ARM PMU code checks if all the events in a group belong to the
same PMU except for software events. But it didn't set the event_caps
of inherited events and no longer identify them as software events.
Therefore the test failed in a child process.
A simple reproducer is:
$ perf stat -e '{cycles,cs,instructions}' perf bench sched messaging
# Running 'sched/messaging' benchmark:
perf: fork(): Invalid argument
The perf stat was fine but the perf bench failed in fork(). Let's
inherit the event caps from the parent.
Signed-off-by: Namhyung Kim <namhyung@kernel.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: <stable@vger.kernel.org>
Link: https://lkml.kernel.org/r/20220328200112.457740-1-namhyung@kernel.org
System.map shows that vmlinux contains several instances of
__static_call_return0():
c0004fc0 t __static_call_return0
c0011518 t __static_call_return0
c00d8160 t __static_call_return0
arch_static_call_transform() uses the middle one to check whether we are
setting a call to __static_call_return0 or not:
c0011520 <arch_static_call_transform>:
c0011520: 3d 20 c0 01 lis r9,-16383 <== r9 = 0xc001 << 16
c0011524: 39 29 15 18 addi r9,r9,5400 <== r9 += 0x1518
c0011528: 7c 05 48 00 cmpw r5,r9 <== r9 has value 0xc0011518 here
So if static_call_update() is called with one of the other instances of
__static_call_return0(), arch_static_call_transform() won't recognise it.
In order to work properly, global single instance of __static_call_return0() is required.
Fixes: 3f2a8fc4b1 ("static_call/x86: Add __static_call_return0()")
Signed-off-by: Christophe Leroy <christophe.leroy@csgroup.eu>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Josh Poimboeuf <jpoimboe@redhat.com>
Link: https://lkml.kernel.org/r/30821468a0e7d28251954b578e5051dc09300d04.1647258493.git.christophe.leroy@csgroup.eu
kernel/entry/common.c: In function ‘dynamic_irqentry_exit_cond_resched’:
kernel/entry/common.c:409:14: error: implicit declaration of function ‘static_key_unlikely’; did you mean ‘static_key_enable’? [-Werror=implicit-function-declaration]
409 | if (!static_key_unlikely(&sk_dynamic_irqentry_exit_cond_resched))
| ^~~~~~~~~~~~~~~~~~~
| static_key_enable
static_key_unlikely() should be static_branch_unlikely().
Fixes: 99cf983cc8 ("sched/preempt: Add PREEMPT_DYNAMIC using static keys")
Signed-off-by: Sven Schnelle <svens@linux.ibm.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Mark Rutland <mark.rutland@arm.com>
Link: https://lore.kernel.org/r/20220330084328.1805665-1-svens@linux.ibm.com
try_steal_cookie() looks at task_struct::cpus_mask to decide if the
task could be moved to `this' CPU. It ignores that the task might be in
a migration disabled section while not on the CPU. In this case the task
must not be moved otherwise per-CPU assumption are broken.
Use is_cpu_allowed(), as suggested by Peter Zijlstra, to decide if the a
task can be moved.
Fixes: d2dfa17bc7 ("sched: Trivial forced-newidle balancer")
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lkml.kernel.org/r/YjNK9El+3fzGmswf@linutronix.de
Steve reported that ChromeOS encounters the forceidle balancer being
ran from rt_mutex_setprio()'s balance_callback() invocation and
explodes.
Now, the forceidle balancer gets queued every time the idle task gets
selected, set_next_task(), which is strictly too often.
rt_mutex_setprio() also uses set_next_task() in the 'change' pattern:
queued = task_on_rq_queued(p); /* p->on_rq == TASK_ON_RQ_QUEUED */
running = task_current(rq, p); /* rq->curr == p */
if (queued)
dequeue_task(...);
if (running)
put_prev_task(...);
/* change task properties */
if (queued)
enqueue_task(...);
if (running)
set_next_task(...);
However, rt_mutex_setprio() will explicitly not run this pattern on
the idle task (since priority boosting the idle task is quite insane).
Most other 'change' pattern users are pidhash based and would also not
apply to idle.
Also, the change pattern doesn't contain a __balance_callback()
invocation and hence we could have an out-of-band balance-callback,
which *should* trigger the WARN in rq_pin_lock() (which guards against
this exact anti-pattern).
So while none of that explains how this happens, it does indicate that
having it in set_next_task() might not be the most robust option.
Instead, explicitly queue the forceidle balancer from pick_next_task()
when it does indeed result in forceidle selection. Having it here,
ensures it can only be triggered under the __schedule() rq->lock
instance, and hence must be ran from that context.
This also happens to clean up the code a little, so win-win.
Fixes: d2dfa17bc7 ("sched: Trivial forced-newidle balancer")
Reported-by: Steven Rostedt <rostedt@goodmis.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Tested-by: T.J. Alumbaugh <talumbau@chromium.org>
Link: https://lkml.kernel.org/r/20220330160535.GN8939@worktop.programming.kicks-ass.net
No functional changes.
This patch moves all module related code into a separate directory,
modifies each file name and creates a new Makefile. Note: this effort
is in preparation to refactor core module code.
Reviewed-by: Christophe Leroy <christophe.leroy@csgroup.eu>
Signed-off-by: Aaron Tomlin <atomlin@redhat.com>
Signed-off-by: Luis Chamberlain <mcgrof@kernel.org>
To move the list iterator variable into the list_for_each_entry_*()
macro in the future it should be avoided to use the list iterator
variable after the loop body.
To *never* use the list iterator variable after the loop it was
concluded to use a separate iterator variable instead of a
found boolean [1].
This removes the need to use the found variable (existed & supported)
and simply checking if the variable was set, can determine if the
break/goto was hit.
[1] https://lore.kernel.org/all/CAHk-=wgRr_D8CB-D9Kg-c=EHreAsk5SqXPwr9Y7k9sA6cWXJ6w@mail.gmail.com/
Signed-off-by: Jakob Koschel <jakobkoschel@gmail.com>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Acked-by: Yonghong Song <yhs@fb.com>
Link: https://lore.kernel.org/bpf/20220331091929.647057-1-jakobkoschel@gmail.com
- Rename the staging files to give them some meaning.
Just stage1,stag2,etc, does not show what they are for
- Check for NULL from allocation in bootconfig
- Hold event mutex for dyn_event call in user events
- Mark user events to broken (to work on the API)
- Remove eBPF updates from user events
- Remove user events from uapi header to keep it from being installed.
- Move ftrace_graph_is_dead() into inline as it is called from hot paths
and also convert it into a static branch.
-----BEGIN PGP SIGNATURE-----
iIoEABYIADIWIQRRSw7ePDh/lE+zeZMp5XQQmuv6qgUCYkmyIBQccm9zdGVkdEBn
b29kbWlzLm9yZwAKCRAp5XQQmuv6qutfAQD90gbUgFMFe2akF5sKhonF5T6mm0+w
BsWqNlBEKBxmfwD+Krfpxql/PKp/gCufcIUUkYC4E6Wl9akf3eO1qQel1Ao=
=ZTn1
-----END PGP SIGNATURE-----
Merge tag 'trace-v5.18-2' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace
Pull more tracing updates from Steven Rostedt:
- Rename the staging files to give them some meaning. Just
stage1,stag2,etc, does not show what they are for
- Check for NULL from allocation in bootconfig
- Hold event mutex for dyn_event call in user events
- Mark user events to broken (to work on the API)
- Remove eBPF updates from user events
- Remove user events from uapi header to keep it from being installed.
- Move ftrace_graph_is_dead() into inline as it is called from hot
paths and also convert it into a static branch.
* tag 'trace-v5.18-2' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace:
tracing: Move user_events.h temporarily out of include/uapi
ftrace: Make ftrace_graph_is_dead() a static branch
tracing: Set user_events to BROKEN
tracing/user_events: Remove eBPF interfaces
tracing/user_events: Hold event_mutex during dyn_event_add
proc: bootconfig: Add null pointer check
tracing: Rename the staging files for trace_events
generalized.
-----BEGIN PGP SIGNATURE-----
iQJHBAABCgAxFiEEQp8+kY+LLUocC4bMphj1TA10mKEFAmJJV1gTHHRnbHhAbGlu
dXRyb25peC5kZQAKCRCmGPVMDXSYof8tD/0Xs4qpxlR81PgZSJ3QJ9vok5tKpe3j
O+ZLvQtyc2dnkduSOpJXiKe5YxDZ39Ihb7Fb9ETSUFS0ohJFDYiR6bKVXqKBjp6g
Z0u57B3j/ZrZt9W3oK2BxlKBgen3MTYmybPQja+oTZfuu+Vd+DKD6NEyGcOZe53G
+ZzEnBevar+f+/ble4PmJrnu5fP63jlUDPlY6h7HnsS2+MYTlx8JOMyhc4v4KxpR
od4/9NUMbcpV4q2hReC5D22TArhr/7woNaCFswnOuk+mb9d8sPvqv9U8iHC/YoTM
IeX3Bt1qHRT++Sjkkup2/k0xAy50H/7wMbQP+Jb993rWlLiWSd2WY0OHZ+gWSfgG
oM6a2yAZ029klyMBvV0AdiAYpvhlDs36UZBLyIIa8M4zRgH9h+//F9UZ5qnt+0kp
ACTd/B+bksbvO4A1npxZ1fUWPw6L5a8730GIy/csvAsoRlOaITfCFVA98ob+36TF
JUdyuzRAOrbt3H7pRUB+xz0pxxPkceoBBwrBTcSw1cyIyV3b8CaFT2oRWY3nt+er
THWuiXY4Jy2wtNcHMhKIZKBCtUZ7sDUBhcnplxL+qoRJ0V340B2Kh1J8/0mnjDD+
Aks4E7Q3ogpyuMXAKDEGebyTPcRe0bQXyyjJVR9cuPn5i8AM9/rv5Iqem4Ed1hLK
dQeXuWx6zLcGrw==
=mJKF
-----END PGP SIGNATURE-----
Merge tag 'core-urgent-2022-04-03' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull RT signal fix from Thomas Gleixner:
"Revert the RT related signal changes. They need to be reworked and
generalized"
* tag 'core-urgent-2022-04-03' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
Revert "signal, x86: Delay calling signals in atomic on RT enabled kernels"
- fix a regression in dma remap handling vs AMD memory encryption (me)
- finally kill off the legacy PCI DMA API (Christophe JAILLET)
-----BEGIN PGP SIGNATURE-----
iQI/BAABCgApFiEEgdbnc3r/njty3Iq9D55TZVIEUYMFAmJJkTULHGhjaEBsc3Qu
ZGUACgkQD55TZVIEUYMywg/8C6akX8wmA/Q2oCLtXY4guYjLVQIAfh7QajEvPKhN
Cs4lquiL4VvMtwPHKvih+41RlPMC2G5Wd2lFBOiOi8n5hS9AN+DuIMm9GH43WOTv
9Fk/59xJAj+tPCkeEZzKgG/HK8YqcB4BirfP3Zc6jL9Rh56kK12PUKL9zNboLIcb
FDk6cnz2/N0Mhoo5w52I+Nc1+1dW7uQnV1BO5aHJUxb6uiSEuBbAPc9zUjTBOA4/
LZXenU1zBfmv/LHFKIqDHi7p3INH2Ze+9GuR02rb5654c/8SRU+eDgN2mm4kn+B3
WmYHILqDHjqpDIAed5FBY99yI8eHtrv52ihj7E6QdHRXU7QqLEUSgChTnzU+k0pk
bhQdnen2f2Z6WOVxf9051qYi9A/RK49PZJR2tl/FDN3Cjyd1xL9t035zLMoUWv2b
q8Qt1avoM0wwQ/2C3eFsxszCJ2Z1IcyQuSZ+VW0hqLSwi0aEfI/G1GF+8gT3wuTO
r7y7EvQAuhPRcyMVn/qd306gmdANwWVQ/W9QcR7Z28jzK9lvkLHZDS0Hr4gU54Fp
Rmv8dizLkQokr2IGwX/dRv3lDWKhVW9+GdyTpKtKGG+e0GEv8QKxMkhOSGgiH4mF
+/DAmgcKdUKilhH20MLxL7+a4M6j3+leZH8VgYBHS8BjnWr7Zw9BzTUrapHIueRi
2oo=
=q+vI
-----END PGP SIGNATURE-----
Merge tag 'dma-mapping-5.18-1' of git://git.infradead.org/users/hch/dma-mapping
Pull more dma-mapping updates from Christoph Hellwig:
- fix a regression in dma remap handling vs AMD memory encryption (me)
- finally kill off the legacy PCI DMA API (Christophe JAILLET)
* tag 'dma-mapping-5.18-1' of git://git.infradead.org/users/hch/dma-mapping:
dma-mapping: move pgprot_decrypted out of dma_pgprot
PCI/doc: cleanup references to the legacy PCI DMA API
PCI: Remove the deprecated "pci-dma-compat.h" API
After being merged, user_events become more visible to a wider audience
that have concerns with the current API.
It is too late to fix this for this release, but instead of a full
revert, just mark it as BROKEN (which prevents it from being selected in
make config). Then we can work finding a better API. If that fails,
then it will need to be completely reverted.
To not have the code silently bitrot, still allow building it with
COMPILE_TEST.
And to prevent the uapi header from being installed, then later changed,
and then have an old distro user space see the old version, move the
header file out of the uapi directory.
Surround the include with CONFIG_COMPILE_TEST to the current location,
but when the BROKEN tag is taken off, it will use the uapi directory,
and fail to compile. This is a good way to remind us to move the header
back.
Link: https://lore.kernel.org/all/20220330155835.5e1f6669@gandalf.local.home
Link: https://lkml.kernel.org/r/20220330201755.29319-1-mathieu.desnoyers@efficios.com
Suggested-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
While user_events API is under development and has been marked for broken
to not let the API become fixed, move the header file out of the uapi
directory. This is to prevent it from being installed, then later changed,
and then have an old distro user space update with a new kernel, where
applications see the user_events being available, but the old header is in
place, and then they get compiled incorrectly.
Also, surround the include with CONFIG_COMPILE_TEST to the current
location, but when the BROKEN tag is taken off, it will use the uapi
directory, and fail to compile. This is a good way to remind us to move
the header back.
Link: https://lore.kernel.org/all/20220330155835.5e1f6669@gandalf.local.home
Link: https://lkml.kernel.org/r/20220330201755.29319-1-mathieu.desnoyers@efficios.com
Link: https://lkml.kernel.org/r/20220401143903.188384f3@gandalf.local.home
Suggested-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
ftrace_graph_is_dead() is used on hot paths, it just reads a variable
in memory and is not worth suffering function call constraints.
For instance, at entry of prepare_ftrace_return(), inlining it avoids
saving prepare_ftrace_return() parameters to stack and restoring them
after calling ftrace_graph_is_dead().
While at it using a static branch is even more performant and is
rather well adapted considering that the returned value will almost
never change.
Inline ftrace_graph_is_dead() and replace 'kill_ftrace_graph' bool
by a static branch.
The performance improvement is noticeable.
Link: https://lkml.kernel.org/r/e0411a6a0ed3eafff0ad2bc9cd4b0e202b4617df.1648623570.git.christophe.leroy@csgroup.eu
Signed-off-by: Christophe Leroy <christophe.leroy@csgroup.eu>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
Make sure the event_mutex is properly held during dyn_event_add call.
This is required when adding dynamic events.
Link: https://lkml.kernel.org/r/20220328223225.1992-1-beaub@linux.microsoft.com
Reported-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Signed-off-by: Beau Belgrave <beaub@linux.microsoft.com>
Acked-by: Masami Hiramatsu <mhiramat@kernel.org>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
The attr->value_size is already assigned to smap->map.value_size
in bpf_map_init_from_attr(), there is no need to do it again in
stack_map_alloc().
Signed-off-by: Yuntao Wang <ytcoode@gmail.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Joanne Koong <joannelkoong@gmail.com>
Link: https://lore.kernel.org/bpf/20220323073626.958652-1-ytcoode@gmail.com
pgprot_decrypted is used by AMD SME systems to allow access to memory
that was set to not encrypted using set_memory_decrypted. That only
happens for dma-direct memory as the IOMMU solves the addressing
challenges for the encryption bit using its own remapping.
Move the pgprot_decrypted call out of dma_pgprot which is also used
by the IOMMU mappings and into dma-direct so that it is only used with
memory that was set decrypted.
Fixes: f5ff79fddf ("dma-mapping: remove CONFIG_DMA_REMAP")
Reported-by: Alex Xu (Hello71) <alex_y_xu@yahoo.ca>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Tested-by: Alex Xu (Hello71) <alex_y_xu@yahoo.ca>
Features:
- kprobes: rethook: x86: replace kretprobe trampoline with rethook
Current release - regressions:
- sfc: avoid null-deref on systems without NUMA awareness
in the new queue sizing code
Current release - new code bugs:
- vxlan: do not feed vxlan_vnifilter_dump_dev with non-vxlan devices
- eth: lan966x: fix null-deref on PHY pointer in timestamp ioctl
when interface is down
Previous releases - always broken:
- openvswitch: correct neighbor discovery target mask field
in the flow dump
- wireguard: ignore v6 endpoints when ipv6 is disabled and fix a leak
- rxrpc: fix call timer start racing with call destruction
- rxrpc: fix null-deref when security type is rxrpc_no_security
- can: fix UAF bugs around echo skbs in multiple drivers
Misc:
- docs: move netdev-FAQ to the "process" section of the documentation
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
-----BEGIN PGP SIGNATURE-----
iQIzBAABCAAdFiEE6jPA+I1ugmIBA4hXMUZtbf5SIrsFAmJF3S0ACgkQMUZtbf5S
IruIvA/+NZx+c+fBBrbjOh63avRL7kYIqIDREf+v6lh4ZXmbrp22xalcjIdxgWeK
vAiYfYmzZblWAGkilcvPG3blCBc+9b+YE+pPJXFe60Huv3eYpjKfgTKwQOg/lIeM
8MfPP7eBwcJ/ltSTRtySRl9LYgyVcouP9rAVJavFVYrvuDYunwhfChswVfGCYon8
42O4nRwrtkTE1MjHD8HS3YxvwGlo+iIyhsxgG/gWx8F2xeIG22H6adzjDXcCQph8
air/awrJ4enYkVMRokGNfNppK9Z3vjJDX5xha3CREpvXNPe0F24cAE/L8XqyH7+r
/bXP5y9VC9mmEO7x4Le3VmDhOJGbCOtR89gTlevftDRdSIrbNHffZhbPW48tR7o8
NJFlhiSJb4HEMN0q7BmxnWaKlbZUlvLEXLuU5ytZE/G7i+nETULlunfZrCD4eNYH
gBGYhiob2I/XotJA9QzG/RDyaFwDaC/VARsyv37PSeBAl/yrEGAeP7DsKkKX/ayg
LM9ItveqHXK30J0xr3QJA8s49EkIYejjYR3l0hQ9esf9QvGK99dE/fo44Apf3C3A
Lz6XpnRc5Xd7tZ9Aopwb3FqOH6WR9Hq9Qlbk0qifsL/2sRbatpuZbbDK6L3CR3Ir
WFNcOoNbbqv85kCKFXFjj0jdpoNa9Yej8XFkMkVSkM3sHImYmYQ=
=5Bvy
-----END PGP SIGNATURE-----
Merge tag 'net-5.18-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net
Pull more networking updates from Jakub Kicinski:
"Networking fixes and rethook patches.
Features:
- kprobes: rethook: x86: replace kretprobe trampoline with rethook
Current release - regressions:
- sfc: avoid null-deref on systems without NUMA awareness in the new
queue sizing code
Current release - new code bugs:
- vxlan: do not feed vxlan_vnifilter_dump_dev with non-vxlan devices
- eth: lan966x: fix null-deref on PHY pointer in timestamp ioctl when
interface is down
Previous releases - always broken:
- openvswitch: correct neighbor discovery target mask field in the
flow dump
- wireguard: ignore v6 endpoints when ipv6 is disabled and fix a leak
- rxrpc: fix call timer start racing with call destruction
- rxrpc: fix null-deref when security type is rxrpc_no_security
- can: fix UAF bugs around echo skbs in multiple drivers
Misc:
- docs: move netdev-FAQ to the 'process' section of the
documentation"
* tag 'net-5.18-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net: (57 commits)
vxlan: do not feed vxlan_vnifilter_dump_dev with non vxlan devices
openvswitch: Add recirc_id to recirc warning
rxrpc: fix some null-ptr-deref bugs in server_key.c
rxrpc: Fix call timer start racing with call destruction
net: hns3: fix software vlan talbe of vlan 0 inconsistent with hardware
net: hns3: fix the concurrency between functions reading debugfs
docs: netdev: move the netdev-FAQ to the process pages
docs: netdev: broaden the new vs old code formatting guidelines
docs: netdev: call out the merge window in tag checking
docs: netdev: add missing back ticks
docs: netdev: make the testing requirement more stringent
docs: netdev: add a question about re-posting frequency
docs: netdev: rephrase the 'should I update patchwork' question
docs: netdev: rephrase the 'Under review' question
docs: netdev: shorten the name and mention msgid for patch status
docs: netdev: note that RFC postings are allowed any time
docs: netdev: turn the net-next closed into a Warning
docs: netdev: move the patch marking section up
docs: netdev: minor reword
docs: netdev: replace references to old archives
...
Revert commit bf9ad37dc8. It needs to be better encapsulated and
generalized.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Since the function pointered by rethook::handler never be removed when
the rethook is alive, it doesn't need to use rcu_assign_pointer() to
update it. Just use WRITE_ONCE().
Reported-by: Alexei Starovoitov <alexei.starovoitov@gmail.com>
Signed-off-by: Masami Hiramatsu <mhiramat@kernel.org>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Link: https://lore.kernel.org/bpf/164868907688.21983.1606862921419988152.stgit@devnote2
- do not zero buffer in set_memory_decrypted (Kirill A. Shutemov)
- fix return value of dma-debug __setup handlers (Randy Dunlap)
- swiotlb cleanups (Robin Murphy)
- remove most remaining users of the pci-dma-compat.h API
(Christophe JAILLET)
- share the ABI header for the DMA map_benchmark with userspace
(Tian Tao)
- update the maintainer for DMA MAPPING BENCHMARK (Xiang Chen)
- remove CONFIG_DMA_REMAP (me)
-----BEGIN PGP SIGNATURE-----
iQI/BAABCgApFiEEgdbnc3r/njty3Iq9D55TZVIEUYMFAmJDDgsLHGhjaEBsc3Qu
ZGUACgkQD55TZVIEUYM9oBAAxm93DZCXsqektM2qJ34o1KCyfAhvTvZ1r38ab+cl
wJwmMPF6/S9MCj6XZEnCzUnXL//TnhcuYVztNpPTWqhx6QaqWmmx9yJKjoYAnHce
svVMef7iipn35w7hAPpiVR/AVwWyxQCkSC+5sgp6XX8mp7l7I3ajfO0fZ52JCcxw
12d4k1E0yjC096Kw8wXQv+rzmCAoQcK9Jj20COUO3rkgOr68ZIXse2HXUJjn76Fy
wym2rJfqJ9mdKrDHqphe1ntIzkcQNWx9xR0UVh7/e4p7Si5H8Lp8QWwC7Zw6Y2Gb
paeotIMu1uTKkcZI4K54J8PXRLA7PLrDSDFdxnKOsWNZU/inIwt9b11kr9FOaYqR
BLJ+w6bF1/PmM6q2gkOwNuoiJD5YQfwF7y+wi84VyaauM0J8ssIHYnVrCWXn0m1E
4veAkWasAYb1oaoNlDhmZEbpI+kcN3xwDyK1WbtHuGvR00oSvxl0d1viGTVXYfDA
k5rBjb7CovK8JIrFIJoMiDM4TvdauxL66IlEL7ohLDh6l1f09Q0+gsdVcAM0ObX6
zOkoulyHCFqkePvoH/xpyIrZZ9cHA228fZYC7QiBcxdWlD3dFMWkKvhajiSDQJSW
SAz94CeEDWn64Q462N+ecivKlLwz7j/TqOig5xU+/6UoMC/2a7+HIim+p6bjh8Pc
5Gg=
=C+Es
-----END PGP SIGNATURE-----
Merge tag 'dma-mapping-5.18' of git://git.infradead.org/users/hch/dma-mapping
Pull dma-mapping updates from Christoph Hellwig:
- do not zero buffer in set_memory_decrypted (Kirill A. Shutemov)
- fix return value of dma-debug __setup handlers (Randy Dunlap)
- swiotlb cleanups (Robin Murphy)
- remove most remaining users of the pci-dma-compat.h API
(Christophe JAILLET)
- share the ABI header for the DMA map_benchmark with userspace
(Tian Tao)
- update the maintainer for DMA MAPPING BENCHMARK (Xiang Chen)
- remove CONFIG_DMA_REMAP (me)
* tag 'dma-mapping-5.18' of git://git.infradead.org/users/hch/dma-mapping:
dma-mapping: benchmark: extract a common header file for map_benchmark definition
dma-debug: fix return value of __setup handlers
dma-mapping: remove CONFIG_DMA_REMAP
media: v4l2-pci-skeleton: Remove usage of the deprecated "pci-dma-compat.h" API
rapidio/tsi721: Remove usage of the deprecated "pci-dma-compat.h" API
sparc: Remove usage of the deprecated "pci-dma-compat.h" API
agp/intel: Remove usage of the deprecated "pci-dma-compat.h" API
alpha: Remove usage of the deprecated "pci-dma-compat.h" API
MAINTAINERS: update maintainer list of DMA MAPPING BENCHMARK
swiotlb: simplify array allocation
swiotlb: tidy up includes
swiotlb: simplify debugfs setup
swiotlb: do not zero buffer in set_memory_decrypted()
Use rethook for kretprobe function return hooking if the arch sets
CONFIG_HAVE_RETHOOK=y. In this case, CONFIG_KRETPROBE_ON_RETHOOK is
set to 'y' automatically, and the kretprobe internal data fields
switches to use rethook. If not, it continues to use kretprobe
specific function return hooks.
Suggested-by: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Masami Hiramatsu <mhiramat@kernel.org>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Link: https://lore.kernel.org/bpf/164826162556.2455864.12255833167233452047.stgit@devnote2
Since the m->arg_size array can hold up to MAX_BPF_FUNC_ARGS argument
sizes, it's ok that nargs is equal to MAX_BPF_FUNC_ARGS.
Signed-off-by: Yuntao Wang <ytcoode@gmail.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Martin KaFai Lau <kafai@fb.com>
Link: https://lore.kernel.org/bpf/20220324164238.1274915-1-ytcoode@gmail.com
Since ftrace_ops::local_hash::filter_hash field is an __rcu pointer,
we have to use rcu_access_pointer() to access it.
Reported-by: kernel test robot <lkp@intel.com>
Signed-off-by: Masami Hiramatsu <mhiramat@kernel.org>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Link: https://lore.kernel.org/bpf/164802093635.1732982.4938094876018890866.stgit@devnote2
Fix the type mismatching warning of 'rethook_node vs fprobe_rethook_node'
found by Smatch.
Reported-by: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: Masami Hiramatsu <mhiramat@kernel.org>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Link: https://lore.kernel.org/bpf/164802092611.1732982.12268174743437084619.stgit@devnote2
This set of changes removes tracehook.h, moves modification of all of
the ptrace fields inside of siglock to remove races, adds a missing
permission check to ptrace.c
The removal of tracehook.h is quite significant as it has been a major
source of confusion in recent years. Much of that confusion was
around task_work and TIF_NOTIFY_SIGNAL (which I have now decoupled
making the semantics clearer).
For people who don't know tracehook.h is a vestiage of an attempt to
implement uprobes like functionality that was never fully merged, and
was later superseeded by uprobes when uprobes was merged. For many
years now we have been removing what tracehook functionaly a little
bit at a time. To the point where now anything left in tracehook.h is
some weird strange thing that is difficult to understand.
Eric W. Biederman (15):
ptrace: Move ptrace_report_syscall into ptrace.h
ptrace/arm: Rename tracehook_report_syscall report_syscall
ptrace: Create ptrace_report_syscall_{entry,exit} in ptrace.h
ptrace: Remove arch_syscall_{enter,exit}_tracehook
ptrace: Remove tracehook_signal_handler
task_work: Remove unnecessary include from posix_timers.h
task_work: Introduce task_work_pending
task_work: Call tracehook_notify_signal from get_signal on all architectures
task_work: Decouple TIF_NOTIFY_SIGNAL and task_work
signal: Move set_notify_signal and clear_notify_signal into sched/signal.h
resume_user_mode: Remove #ifdef TIF_NOTIFY_RESUME in set_notify_resume
resume_user_mode: Move to resume_user_mode.h
tracehook: Remove tracehook.h
ptrace: Move setting/clearing ptrace_message into ptrace_stop
ptrace: Return the signal to continue with from ptrace_stop
Jann Horn (1):
ptrace: Check PTRACE_O_SUSPEND_SECCOMP permission on PTRACE_SEIZE
Yang Li (1):
ptrace: Remove duplicated include in ptrace.c
MAINTAINERS | 1 -
arch/Kconfig | 5 +-
arch/alpha/kernel/ptrace.c | 5 +-
arch/alpha/kernel/signal.c | 4 +-
arch/arc/kernel/ptrace.c | 5 +-
arch/arc/kernel/signal.c | 4 +-
arch/arm/kernel/ptrace.c | 12 +-
arch/arm/kernel/signal.c | 4 +-
arch/arm64/kernel/ptrace.c | 14 +--
arch/arm64/kernel/signal.c | 4 +-
arch/csky/kernel/ptrace.c | 5 +-
arch/csky/kernel/signal.c | 4 +-
arch/h8300/kernel/ptrace.c | 5 +-
arch/h8300/kernel/signal.c | 4 +-
arch/hexagon/kernel/process.c | 4 +-
arch/hexagon/kernel/signal.c | 1 -
arch/hexagon/kernel/traps.c | 6 +-
arch/ia64/kernel/process.c | 4 +-
arch/ia64/kernel/ptrace.c | 6 +-
arch/ia64/kernel/signal.c | 1 -
arch/m68k/kernel/ptrace.c | 5 +-
arch/m68k/kernel/signal.c | 4 +-
arch/microblaze/kernel/ptrace.c | 5 +-
arch/microblaze/kernel/signal.c | 4 +-
arch/mips/kernel/ptrace.c | 5 +-
arch/mips/kernel/signal.c | 4 +-
arch/nds32/include/asm/syscall.h | 2 +-
arch/nds32/kernel/ptrace.c | 5 +-
arch/nds32/kernel/signal.c | 4 +-
arch/nios2/kernel/ptrace.c | 5 +-
arch/nios2/kernel/signal.c | 4 +-
arch/openrisc/kernel/ptrace.c | 5 +-
arch/openrisc/kernel/signal.c | 4 +-
arch/parisc/kernel/ptrace.c | 7 +-
arch/parisc/kernel/signal.c | 4 +-
arch/powerpc/kernel/ptrace/ptrace.c | 8 +-
arch/powerpc/kernel/signal.c | 4 +-
arch/riscv/kernel/ptrace.c | 5 +-
arch/riscv/kernel/signal.c | 4 +-
arch/s390/include/asm/entry-common.h | 1 -
arch/s390/kernel/ptrace.c | 1 -
arch/s390/kernel/signal.c | 5 +-
arch/sh/kernel/ptrace_32.c | 5 +-
arch/sh/kernel/signal_32.c | 4 +-
arch/sparc/kernel/ptrace_32.c | 5 +-
arch/sparc/kernel/ptrace_64.c | 5 +-
arch/sparc/kernel/signal32.c | 1 -
arch/sparc/kernel/signal_32.c | 4 +-
arch/sparc/kernel/signal_64.c | 4 +-
arch/um/kernel/process.c | 4 +-
arch/um/kernel/ptrace.c | 5 +-
arch/x86/kernel/ptrace.c | 1 -
arch/x86/kernel/signal.c | 5 +-
arch/x86/mm/tlb.c | 1 +
arch/xtensa/kernel/ptrace.c | 5 +-
arch/xtensa/kernel/signal.c | 4 +-
block/blk-cgroup.c | 2 +-
fs/coredump.c | 1 -
fs/exec.c | 1 -
fs/io-wq.c | 6 +-
fs/io_uring.c | 11 +-
fs/proc/array.c | 1 -
fs/proc/base.c | 1 -
include/asm-generic/syscall.h | 2 +-
include/linux/entry-common.h | 47 +-------
include/linux/entry-kvm.h | 2 +-
include/linux/posix-timers.h | 1 -
include/linux/ptrace.h | 81 ++++++++++++-
include/linux/resume_user_mode.h | 64 ++++++++++
include/linux/sched/signal.h | 17 +++
include/linux/task_work.h | 5 +
include/linux/tracehook.h | 226 -----------------------------------
include/uapi/linux/ptrace.h | 2 +-
kernel/entry/common.c | 19 +--
kernel/entry/kvm.c | 9 +-
kernel/exit.c | 3 +-
kernel/livepatch/transition.c | 1 -
kernel/ptrace.c | 47 +++++---
kernel/seccomp.c | 1 -
kernel/signal.c | 62 +++++-----
kernel/task_work.c | 4 +-
kernel/time/posix-cpu-timers.c | 1 +
mm/memcontrol.c | 2 +-
security/apparmor/domain.c | 1 -
security/selinux/hooks.c | 1 -
85 files changed, 372 insertions(+), 495 deletions(-)
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEEgjlraLDcwBA2B+6cC/v6Eiajj0AFAmJCQkoACgkQC/v6Eiaj
j0DCWQ/5AZVFU+hX32obUNCLackHTwgcCtSOs3JNBmNA/zL/htPiYYG0ghkvtlDR
Dw5J5DnxC6P7PVAdAqrpvx2uX2FebHYU0bRlyLx8LYUEP5dhyNicxX9jA882Z+vw
Ud0Ue9EojwGWS76dC9YoKUj3slThMATbhA2r4GVEoof8fSNJaBxQIqath44t0FwU
DinWa+tIOvZANGBZr6CUUINNIgqBIZCH/R4h6ArBhMlJpuQ5Ufk2kAaiWFwZCkX4
0LuuAwbKsCKkF8eap5I2KrIg/7zZVgxAg9O3cHOzzm8OPbKzRnNnQClcDe8perqp
S6e/f3MgpE+eavd1EiLxevZ660cJChnmikXVVh8ZYYoefaMKGqBaBSsB38bNcLjY
3+f2dB+TNBFRnZs1aCujK3tWBT9QyjZDKtCBfzxDNWBpXGLhHH6j6lA5Lj+Cef5K
/HNHFb+FuqedlFZh5m1Y+piFQ70hTgCa2u8b+FSOubI2hW9Zd+WzINV0ANaZ2LvZ
4YGtcyDNk1q1+c87lxP9xMRl/xi6rNg+B9T2MCo4IUnHgpSVP6VEB3osgUmrrrN0
eQlUI154G/AaDlqXLgmn1xhRmlPGfmenkxpok1AuzxvNJsfLKnpEwQSc13g3oiZr
disZQxNY0kBO2Nv3G323Z6PLinhbiIIFez6cJzK5v0YJ2WtO3pY=
=uEro
-----END PGP SIGNATURE-----
Merge tag 'ptrace-cleanups-for-v5.18' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace
Pull ptrace cleanups from Eric Biederman:
"This set of changes removes tracehook.h, moves modification of all of
the ptrace fields inside of siglock to remove races, adds a missing
permission check to ptrace.c
The removal of tracehook.h is quite significant as it has been a major
source of confusion in recent years. Much of that confusion was around
task_work and TIF_NOTIFY_SIGNAL (which I have now decoupled making the
semantics clearer).
For people who don't know tracehook.h is a vestiage of an attempt to
implement uprobes like functionality that was never fully merged, and
was later superseeded by uprobes when uprobes was merged. For many
years now we have been removing what tracehook functionaly a little
bit at a time. To the point where anything left in tracehook.h was
some weird strange thing that was difficult to understand"
* tag 'ptrace-cleanups-for-v5.18' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace:
ptrace: Remove duplicated include in ptrace.c
ptrace: Check PTRACE_O_SUSPEND_SECCOMP permission on PTRACE_SEIZE
ptrace: Return the signal to continue with from ptrace_stop
ptrace: Move setting/clearing ptrace_message into ptrace_stop
tracehook: Remove tracehook.h
resume_user_mode: Move to resume_user_mode.h
resume_user_mode: Remove #ifdef TIF_NOTIFY_RESUME in set_notify_resume
signal: Move set_notify_signal and clear_notify_signal into sched/signal.h
task_work: Decouple TIF_NOTIFY_SIGNAL and task_work
task_work: Call tracehook_notify_signal from get_signal on all architectures
task_work: Introduce task_work_pending
task_work: Remove unnecessary include from posix_timers.h
ptrace: Remove tracehook_signal_handler
ptrace: Remove arch_syscall_{enter,exit}_tracehook
ptrace: Create ptrace_report_syscall_{entry,exit} in ptrace.h
ptrace/arm: Rename tracehook_report_syscall report_syscall
ptrace: Move ptrace_report_syscall into ptrace.h
Only a single patch this cycle. Fix an obvious mistake with the
kdb memory accessors. It was a stupid mistake (to/from backwards)
but it has been there for a long time since many architectures
tolerated it with surprisingly good grace.
Signed-off-by: Daniel Thompson <daniel.thompson@linaro.org>
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEELzVBU1D3lWq6cKzwfOMlXTn3iKEFAmJB4JIACgkQfOMlXTn3
iKGLdw//UbVxz0cptc+MJcrtOwpcikg2IdVpFUBHV3N+t/dNIYx7Z9suA49xAq16
xOGrAulOQl6m1Z5t1aRd0TLaknkfiw17IXfF2pcwPqWo3QvEzQUf3eHRnID6Dnz3
THfk7SeQt7DzTZDZdHMJBoQX2NJ6ojeg9J7TcQ4j/YjPtVsZOhuTo8UBS+Wultbw
72ZWa3iEFSNNoll4lU0ZLuq5yvCsFP8jkr9iP/j0ekVHzf46pdYIDFkJQIjF5O+4
j3VU6iX1gF/EH8fK5nCJHCCWOvj9BMQLMlr2YlnRC1kPOjjUf2rNlEUT8nItW9nk
bs+4OUZ49iXOejEqP6J7epyggcZFaUxoYcCZMhRbW9tv1UyOCpidqPKtwHzAKve6
m8OvQlslqHTy2h84ryxuzegX2nxOAOQKDi9d+S1Y4FBwSvKvWt9rO9ga9gg3HiIH
gVJDyplosbrXzEtkbOV3UwgeNKg4So7f8FgRzej4exxrYofdX4bJddeiwtgSXkGB
6F8Y5F5EBMvpftxTN0d+d1NJP+j6bUorp238xfDoFAUTrAcWYZz/rzg66e0fK72C
HX8csoushcaCHHrXFMxiKLNRzm4vxHkq6eVXxw4YPceg9Vq7/cyCLqcQocvAIbXM
hvoOu2tH3iRZJSPadCIWSlzPYuQG1KaTZouvEyCIYWg95olkxsA=
=r8V6
-----END PGP SIGNATURE-----
Merge tag 'kgdb-5.18-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/danielt/linux
Pull kgdb update from Daniel Thompson:
"Only a single patch this cycle. Fix an obvious mistake with the kdb
memory accessors.
It was a stupid mistake (to/from backwards) but it has been there for
a long time since many architectures tolerated it with surprisingly
good grace"
* tag 'kgdb-5.18-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/danielt/linux:
kdb: Fix the putarea helper function
-----BEGIN PGP SIGNATURE-----
iQIzBAABCAAdFiEESH4wyp42V4tXvYsjUqAMR0iAlPIFAmJBmMMACgkQUqAMR0iA
lPLeXBAAnAqK3rY+mberKcFKHLaNJ0O2Y7OMcCf5Xh8snnivgi9RYcqklSbxXQwm
hILa2oP6gUug16zhD2XVb5Mxic7MfgsN8mfy/eItMfEVs3KqUzHKSryTp6N1PA5x
DiQvC7Fg7NGYZs95prMCrFILwVrkLYiKlWGTmlWrz/MTfOOsbAjB9yv5bfalvlo+
A3+XpXxHfb/Wl2kXrUjTey61Rrk3gdgLhucrHVxttb9VPp1ODoLLLu4ePoN9CArA
fpGVUfeh1IDV3sUgwpGgXBwJFBsXxJ9ZYGnJzea0opNn8EgfwgIC97qTaa+GXX/j
bUJFPUNrGGEq99JbPgHmu+imXC1eFfCwxXK7zi6TR7mIOq6I/DfQxCLYUHZpFIMn
mt30wm21j2zVRsOt27frhjyXCSnts7HmOleBcd8NL+aIVKaOqamEOQrmPZPj8eH2
cx9gAphhFv6EDnr3Cj3SbpBrqf1pcxjVa9T2gfhJjtkLLyxR2ruvlRvnWNnaKJZZ
bC7OL74h6eAhJk1pwPcHW2BsABv3jWPzBrOYkjIhRWUY77UriWNKJ27Dd83cAVkw
7P6GbGfTbSCX7m2+0pEdKxc9hMshK2zyTLbu02PopD7yGBDkrcnkgpPGPVMDsj4c
44ANkVlLojBAE43fXXdRPfpSKKBa0pi6MO5WXORrWiY7PNZjAAw=
=PhGM
-----END PGP SIGNATURE-----
Merge tag 'livepatching-for-5.18' of git://git.kernel.org/pub/scm/linux/kernel/git/livepatching/livepatching
Pull livepatching updates from Petr Mladek:
- Forced transitions block only to-be-removed livepatches [Chengming]
- Detect when ftrace handler could not be disabled in self-tests [David]
- Calm down warning from a static analyzer [Tom]
* tag 'livepatching-for-5.18' of git://git.kernel.org/pub/scm/linux/kernel/git/livepatching/livepatching:
livepatch: Reorder to use before freeing a pointer
livepatch: Don't block removal of patches that are safe to unload
livepatch: Skip livepatch tests if ftrace cannot be configured
Here is the set of driver core changes for 5.18-rc1.
Not much here, primarily it was a bunch of cleanups and small updates:
- kobj_type cleanups for default_groups
- documentation updates
- firmware loader minor changes
- component common helper added and take advantage of it in many
drivers (the largest part of this pull request).
There will be a merge conflict in drivers/power/supply/ab8500_chargalg.c
with your tree, the merge conflict should be easy (take all the
changes).
All of these have been in linux-next for a while with no reported
problems.
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
-----BEGIN PGP SIGNATURE-----
iG0EABECAC0WIQT0tgzFv3jCIUoxPcsxR9QN2y37KQUCYkG6PA8cZ3JlZ0Brcm9h
aC5jb20ACgkQMUfUDdst+ylMFwCfSIyAU4oLEgj+/Rfmx4o45cAVIWMAnit3zbdU
wUUCGqKcOnTJEcW6dMPh
=1VVi
-----END PGP SIGNATURE-----
Merge tag 'driver-core-5.18-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core
Pull driver core updates from Greg KH:
"Here is the set of driver core changes for 5.18-rc1.
Not much here, primarily it was a bunch of cleanups and small updates:
- kobj_type cleanups for default_groups
- documentation updates
- firmware loader minor changes
- component common helper added and take advantage of it in many
drivers (the largest part of this pull request).
All of these have been in linux-next for a while with no reported
problems"
* tag 'driver-core-5.18-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core: (54 commits)
Documentation: update stable review cycle documentation
drivers/base/dd.c : Remove the initial value of the global variable
Documentation: update stable tree link
Documentation: add link to stable release candidate tree
devres: fix typos in comments
Documentation: add note block surrounding security patch note
samples/kobject: Use sysfs_emit instead of sprintf
base: soc: Make soc_device_match() simpler and easier to read
driver core: dd: fix return value of __setup handler
driver core: Refactor sysfs and drv/bus remove hooks
driver core: Refactor multiple copies of device cleanup
scripts: get_abi.pl: Fix typo in help message
kernfs: fix typos in comments
kernfs: remove unneeded #if 0 guard
ALSA: hda/realtek: Make use of the helper component_compare_dev_name
video: omapfb: dss: Make use of the helper component_compare_dev
power: supply: ab8500: Make use of the helper component_compare_dev
ASoC: codecs: wcd938x: Make use of the helper component_compare/release_of
iommu/mediatek: Make use of the helper component_compare/release_of
drm: of: Make use of the helper component_release_of
...
Here is the big set of char/misc and other small driver subsystem
updates for 5.18-rc1.
Included in here are merges from driver subsystems which contain:
- iio driver updates and new drivers
- fsi driver updates
- fpga driver updates
- habanalabs driver updates and support for new hardware
- soundwire driver updates and new drivers
- phy driver updates and new drivers
- coresight driver updates
- icc driver updates
Individual changes include:
- mei driver updates
- interconnect driver updates
- new PECI driver subsystem added
- vmci driver updates
- lots of tiny misc/char driver updates
There will be two merge conflicts with your tree, one in MAINTAINERS
which is obvious to fix up, and one in drivers/phy/freescale/Kconfig
which also should be easy to resolve.
All of these have been in linux-next for a while with no reported
problems.
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
-----BEGIN PGP SIGNATURE-----
iG0EABECAC0WIQT0tgzFv3jCIUoxPcsxR9QN2y37KQUCYkG3fQ8cZ3JlZ0Brcm9h
aC5jb20ACgkQMUfUDdst+ykNEgCfaRG8CRxewDXOO4+GSeA3NGK+AIoAnR89donC
R4bgCjfg8BWIBcVVXg3/
=WWXC
-----END PGP SIGNATURE-----
Merge tag 'char-misc-5.18-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/char-misc
Pull char/misc and other driver updates from Greg KH:
"Here is the big set of char/misc and other small driver subsystem
updates for 5.18-rc1.
Included in here are merges from driver subsystems which contain:
- iio driver updates and new drivers
- fsi driver updates
- fpga driver updates
- habanalabs driver updates and support for new hardware
- soundwire driver updates and new drivers
- phy driver updates and new drivers
- coresight driver updates
- icc driver updates
Individual changes include:
- mei driver updates
- interconnect driver updates
- new PECI driver subsystem added
- vmci driver updates
- lots of tiny misc/char driver updates
All of these have been in linux-next for a while with no reported
problems"
* tag 'char-misc-5.18-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/char-misc: (556 commits)
firmware: google: Properly state IOMEM dependency
kgdbts: fix return value of __setup handler
firmware: sysfb: fix platform-device leak in error path
firmware: stratix10-svc: add missing callback parameter on RSU
arm64: dts: qcom: add non-secure domain property to fastrpc nodes
misc: fastrpc: Add dma handle implementation
misc: fastrpc: Add fdlist implementation
misc: fastrpc: Add helper function to get list and page
misc: fastrpc: Add support to secure memory map
dt-bindings: misc: add fastrpc domain vmid property
misc: fastrpc: check before loading process to the DSP
misc: fastrpc: add secure domain support
dt-bindings: misc: add property to support non-secure DSP
misc: fastrpc: Add support to get DSP capabilities
misc: fastrpc: add support for FASTRPC_IOCTL_MEM_MAP/UNMAP
misc: fastrpc: separate fastrpc device from channel context
dt-bindings: nvmem: brcm,nvram: add basic NVMEM cells
dt-bindings: nvmem: make "reg" property optional
nvmem: brcm_nvram: parse NVRAM content into NVMEM cells
nvmem: dt-bindings: Fix the error of dt-bindings check
...
Halil Pasic points out [1] that the full revert of that commit (revert
in bddac7c1e0), and that a partial revert that only reverts the
problematic case, but still keeps some of the cleanups is probably
better. 
And that partial revert [2] had already been verified by Oleksandr
Natalenko to also fix the issue, I had just missed that in the long
discussion.
So let's reinstate the cleanups from commit aa6f8dcbab ("swiotlb:
rework "fix info leak with DMA_FROM_DEVICE""), and effectively only
revert the part that caused problems.
Link: https://lore.kernel.org/all/20220328013731.017ae3e3.pasic@linux.ibm.com/ [1]
Link: https://lore.kernel.org/all/20220324055732.GB12078@lst.de/ [2]
Link: https://lore.kernel.org/all/4386660.LvFx2qVVIh@natalenko.name/ [3]
Suggested-by: Halil Pasic <pasic@linux.ibm.com>
Tested-by: Oleksandr Natalenko <oleksandr@natalenko.name>
Cc: Christoph Hellwig" <hch@lst.de>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
coarse grained, hardware based, forward edge Control-Flow-Integrity mechanism
where any indirect CALL/JMP must target an ENDBR instruction or suffer #CP.
Additionally, since Alderlake (12th gen)/Sapphire-Rapids, speculation is
limited to 2 instructions (and typically fewer) on branch targets not starting
with ENDBR. CET-IBT also limits speculation of the next sequential instruction
after the indirect CALL/JMP [1].
CET-IBT is fundamentally incompatible with retpolines, but provides, as
described above, speculation limits itself.
[1] https://www.intel.com/content/www/us/en/developer/articles/technical/software-security-guidance/technical-documentation/branch-history-injection.html
-----BEGIN PGP SIGNATURE-----
iQJJBAABCgAzFiEEv3OU3/byMaA0LqWJdkfhpEvA5LoFAmI/LI8VHHBldGVyekBp
bmZyYWRlYWQub3JnAAoJEHZH4aRLwOS6ZnkP/2QCgQLTu6oRxv9O020CHwlaSEeD
1Hoy3loum5q5hAi1Ik3dR9p0H5u64c9qbrBVxaFoNKaLt5GKrtHaDSHNk2L/CFHX
urpH65uvTLxbyZzcahkAahoJ71XU+m7PcrHLWMunw9sy10rExYVsUOlFyoyG6XCF
BDCNZpdkC09ZM3vwlWGMZd5Pp+6HcZNPyoV9tpvWAS2l+WYFWAID7mflbpQ+tA8b
y/hM6b3Ud0rT2ubuG1iUpopgNdwqQZ+HisMPGprh+wKZkYwS2l8pUTrz0MaBkFde
go7fW16kFy2HQzGm6aIEBmfcg0palP/mFVaWP0zS62LwhJSWTn5G6xWBr3yxSsht
9gWCiI0oDZuTg698MedWmomdG2SK6yAuZuqmdKtLLoWfWgviPEi7TDFG/cKtZdAW
ag8GM8T4iyYZzpCEcWO9GWbjo6TTGq30JBQefCBG47GjD0csv2ubXXx0Iey+jOwT
x3E8wnv9dl8V9FSd/tMpTFmje8ges23yGrWtNpb5BRBuWTeuGiBPZED2BNyyIf+T
dmewi2ufNMONgyNp27bDKopY81CPAQq9cVxqNm9Cg3eWPFnpOq2KGYEvisZ/rpEL
EjMQeUBsy/C3AUFAleu1vwNnkwP/7JfKYpN00gnSyeQNZpqwxXBCKnHNgOMTXyJz
beB/7u2KIUbKEkSN
=jZfK
-----END PGP SIGNATURE-----
Merge tag 'x86_core_for_5.18_rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull x86 CET-IBT (Control-Flow-Integrity) support from Peter Zijlstra:
"Add support for Intel CET-IBT, available since Tigerlake (11th gen),
which is a coarse grained, hardware based, forward edge
Control-Flow-Integrity mechanism where any indirect CALL/JMP must
target an ENDBR instruction or suffer #CP.
Additionally, since Alderlake (12th gen)/Sapphire-Rapids, speculation
is limited to 2 instructions (and typically fewer) on branch targets
not starting with ENDBR. CET-IBT also limits speculation of the next
sequential instruction after the indirect CALL/JMP [1].
CET-IBT is fundamentally incompatible with retpolines, but provides,
as described above, speculation limits itself"
[1] https://www.intel.com/content/www/us/en/developer/articles/technical/software-security-guidance/technical-documentation/branch-history-injection.html
* tag 'x86_core_for_5.18_rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (53 commits)
kvm/emulate: Fix SETcc emulation for ENDBR
x86/Kconfig: Only allow CONFIG_X86_KERNEL_IBT with ld.lld >= 14.0.0
x86/Kconfig: Only enable CONFIG_CC_HAS_IBT for clang >= 14.0.0
kbuild: Fixup the IBT kbuild changes
x86/Kconfig: Do not allow CONFIG_X86_X32_ABI=y with llvm-objcopy
x86: Remove toolchain check for X32 ABI capability
x86/alternative: Use .ibt_endbr_seal to seal indirect calls
objtool: Find unused ENDBR instructions
objtool: Validate IBT assumptions
objtool: Add IBT/ENDBR decoding
objtool: Read the NOENDBR annotation
x86: Annotate idtentry_df()
x86,objtool: Move the ASM_REACHABLE annotation to objtool.h
x86: Annotate call_on_stack()
objtool: Rework ASM_REACHABLE
x86: Mark __invalid_creds() __noreturn
exit: Mark do_group_exit() __noreturn
x86: Mark stop_this_cpu() __noreturn
objtool: Ignore extra-symbol code
objtool: Rename --duplicate to --lto
...
- The run time string verifier that checks all trace event formats
as they are read from the tracing file to make sure that the %s
pointers are not reading something that no longer exists, failed
to account for %*.s where the length given is zero, and the string
is NULL. It incorrectly flagged it as a null pointer dereference and
gave a WARN_ON().
-----BEGIN PGP SIGNATURE-----
iIoEABYIADIWIQRRSw7ePDh/lE+zeZMp5XQQmuv6qgUCYjzyxBQccm9zdGVkdEBn
b29kbWlzLm9yZwAKCRAp5XQQmuv6qk+WAP4+ICutAiIGyPmHEqtjLGyDPC25nbB3
vg+qWWkWEOIi5gD+PpGsGSE7HYFdWJi1BCfshNOm8I92TyoE2nJkXh3LeA8=
=mZr5
-----END PGP SIGNATURE-----
Merge tag 'trace-v5.18-1' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace
Pull trace event string verifier fix from Steven Rostedt:
"The run-time string verifier checks all trace event formats as
they are read from the tracing file to make sure that the %s pointers
are not reading something that no longer exists.
However, it failed to account for the valid case of '%*.s' where the
length given is zero, and the string is NULL. It incorrectly flagged
it as a null pointer dereference and gave a WARN_ON()"
* tag 'trace-v5.18-1' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace:
tracing: Have trace event string test handle zero length strings
This reverts commit aa6f8dcbab.
It turns out this breaks at least the ath9k wireless driver, and
possibly others.
What the ath9k driver does on packet receive is to set up the DMA
transfer with:
int ath_rx_init(..)
..
bf->bf_buf_addr = dma_map_single(sc->dev, skb->data,
common->rx_bufsize,
DMA_FROM_DEVICE);
and then the receive logic (through ath_rx_tasklet()) will fetch
incoming packets
static bool ath_edma_get_buffers(..)
..
dma_sync_single_for_cpu(sc->dev, bf->bf_buf_addr,
common->rx_bufsize, DMA_FROM_DEVICE);
ret = ath9k_hw_process_rxdesc_edma(ah, rs, skb->data);
if (ret == -EINPROGRESS) {
/*let device gain the buffer again*/
dma_sync_single_for_device(sc->dev, bf->bf_buf_addr,
common->rx_bufsize, DMA_FROM_DEVICE);
return false;
}
and it's worth noting how that first DMA sync:
dma_sync_single_for_cpu(..DMA_FROM_DEVICE);
is there to make sure the CPU can read the DMA buffer (possibly by
copying it from the bounce buffer area, or by doing some cache flush).
The iommu correctly turns that into a "copy from bounce bufer" so that
the driver can look at the state of the packets.
In the meantime, the device may continue to write to the DMA buffer, but
we at least have a snapshot of the state due to that first DMA sync.
But that _second_ DMA sync:
dma_sync_single_for_device(..DMA_FROM_DEVICE);
is telling the DMA mapping that the CPU wasn't interested in the area
because the packet wasn't there. In the case of a DMA bounce buffer,
that is a no-op.
Note how it's not a sync for the CPU (the "for_device()" part), and it's
not a sync for data written by the CPU (the "DMA_FROM_DEVICE" part).
Or rather, it _should_ be a no-op. That's what commit aa6f8dcbab
broke: it made the code bounce the buffer unconditionally, and changed
the DMA_FROM_DEVICE to just unconditionally and illogically be
DMA_TO_DEVICE.
[ Side note: purely within the confines of the swiotlb driver it wasn't
entirely illogical: The reason it did that odd DMA_FROM_DEVICE ->
DMA_TO_DEVICE conversion thing is because inside the swiotlb driver,
it uses just a swiotlb_bounce() helper that doesn't care about the
whole distinction of who the sync is for - only which direction to
bounce.
So it took the "sync for device" to mean that the CPU must have been
the one writing, and thought it meant DMA_TO_DEVICE. ]
Also note how the commentary in that commit was wrong, probably due to
that whole confusion, claiming that the commit makes the swiotlb code
"bounce unconditionally (that is, also
when dir == DMA_TO_DEVICE) in order do avoid synchronising back stale
data from the swiotlb buffer"
which is nonsensical for two reasons:
- that "also when dir == DMA_TO_DEVICE" is nonsensical, as that was
exactly when it always did - and should do - the bounce.
- since this is a sync for the device (not for the CPU), we're clearly
fundamentally not coping back stale data from the bounce buffers at
all, because we'd be copying *to* the bounce buffers.
So that commit was just very confused. It confused the direction of the
synchronization (to the device, not the cpu) with the direction of the
DMA (from the device).
Reported-and-bisected-by: Oleksandr Natalenko <oleksandr@natalenko.name>
Reported-by: Olha Cherevyk <olha.cherevyk@gmail.com>
Cc: Halil Pasic <pasic@linux.ibm.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Kalle Valo <kvalo@kernel.org>
Cc: Robin Murphy <robin.murphy@arm.com>
Cc: Toke Høiland-Jørgensen <toke@toke.dk>
Cc: Maxime Bizon <mbizon@freebox.fr>
Cc: Johannes Berg <johannes@sipsolutions.net>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Merge yet more updates from Andrew Morton:
"This is the material which was staged after willystuff in linux-next.
Subsystems affected by this patch series: mm (debug, selftests,
pagecache, thp, rmap, migration, kasan, hugetlb, pagemap, madvise),
and selftests"
* emailed patches from Andrew Morton <akpm@linux-foundation.org>: (113 commits)
selftests: kselftest framework: provide "finished" helper
mm: madvise: MADV_DONTNEED_LOCKED
mm: fix race between MADV_FREE reclaim and blkdev direct IO read
mm: generalize ARCH_HAS_FILTER_PGPROT
mm: unmap_mapping_range_tree() with i_mmap_rwsem shared
mm: warn on deleting redirtied only if accounted
mm/huge_memory: remove stale locking logic from __split_huge_pmd()
mm/huge_memory: remove stale page_trans_huge_mapcount()
mm/swapfile: remove stale reuse_swap_page()
mm/khugepaged: remove reuse_swap_page() usage
mm/huge_memory: streamline COW logic in do_huge_pmd_wp_page()
mm: streamline COW logic in do_swap_page()
mm: slightly clarify KSM logic in do_swap_page()
mm: optimize do_wp_page() for fresh pages in local LRU pagevecs
mm: optimize do_wp_page() for exclusive pages in the swapcache
mm/huge_memory: make is_transparent_hugepage() static
userfaultfd/selftests: enable hugetlb remap and remove event testing
selftests/vm: add hugetlb madvise MADV_DONTNEED MADV_REMOVE test
mm: enable MADV_DONTNEED for hugetlb mappings
kasan: disable LOCKDEP when printing reports
...
- Enforce kernel RO, and implement STRICT_MODULE_RWX for 603.
- Add support for livepatch to 32-bit.
- Implement CONFIG_DYNAMIC_FTRACE_WITH_ARGS.
- Merge vdso64 and vdso32 into a single directory.
- Fix build errors with newer binutils.
- Add support for UADDR64 relocations, which are emitted by some toolchains. This allows
powerpc to build with the latest lld.
- Fix (another) potential userspace r13 corruption in transactional memory handling.
- Cleanups of function descriptor handling & related fixes to LKDTM.
Thanks to: Abdul Haleem, Alexey Kardashevskiy, Anders Roxell, Aneesh Kumar K.V, Anton
Blanchard, Arnd Bergmann, Athira Rajeev, Bhaskar Chowdhury, Cédric Le Goater, Chen
Jingwen, Christophe JAILLET, Christophe Leroy, Corentin Labbe, Daniel Axtens, Daniel
Henrique Barboza, David Dai, Fabiano Rosas, Ganesh Goudar, Guo Zhengkui, Hangyu Hua, Haren
Myneni, Hari Bathini, Igor Zhbanov, Jakob Koschel, Jason Wang, Jeremy Kerr, Joachim
Wiberg, Jordan Niethe, Julia Lawall, Kajol Jain, Kees Cook, Laurent Dufour, Madhavan
Srinivasan, Mamatha Inamdar, Maxime Bizon, Maxim Kiselev, Maxim Kochetkov, Michal
Suchanek, Nageswara R Sastry, Nathan Lynch, Naveen N. Rao, Nicholas Piggin, Nour-eddine
Taleb, Paul Menzel, Ping Fang, Pratik R. Sampat, Randy Dunlap, Ritesh Harjani, Rohan
McLure, Russell Currey, Sachin Sant, Segher Boessenkool, Shivaprasad G Bhat, Sourabh Jain,
Thierry Reding, Tobias Waldekranz, Tyrel Datwyler, Vaibhav Jain, Vladimir Oltean, Wedson
Almeida Filho, YueHaibing.
-----BEGIN PGP SIGNATURE-----
iQJHBAABCAAxFiEEJFGtCPCthwEv2Y/bUevqPMjhpYAFAmI9TtQTHG1wZUBlbGxl
cm1hbi5pZC5hdQAKCRBR6+o8yOGlgLp2D/0dwoliEJubRCfoawYUGhxTRZuo6ZYw
EQzprOiFA/MtrZyPfbrX/FwxeeetzQWysaw2r5JAuQwx5Jb7Od9dNIrVmueFEktC
hD4fkO8YT+QuOD3Xhp/rDQTImdw4fkeofIjnWIqEAtz0XGInmiRQKOnojVe/Po7f
72Yi1u80LxYBAnkN/Hhpmi/BsVmu0Nh3wELu+JZopQXjINj4RyD49ayCBSLbmiNc
uo7oYzJ0/WsZHNTpX9kAzzCq+XmI3dKZPyf2AOCvoRxJTmUPCRZF9QCwsmQFikiI
vZOdz4fI5e+C0aYJj8ODmWMrXiS+JUQdEShjGg9t9K6EN8idC8joKWpAuXjTA9KN
kRjzXX7AvjxaMEGbLe8gjU0PmEjY3eSzMOy15Oc/C0DRRswXRzrXdx2AF+/J6bQb
MWMM4aCKfcYs5/TENkEnV0xpbOCOy4ikHM1KZbxvVrShvjSlNIL9XTOnl/pNK5BJ
XSSI2mfnjKkbI1+l0KQ4NBXIRTo6HLpu5jwY3Xh97Tq7kaEfqDbO5p2P2HoOCiLa
ZjdzmpP99zM6wnqUSj+lyvjob7btyhoq6TKmPtxfKbR6OaSfRJ760BCJ5y15Y9Hc
rHey4Y/NL7LqsVYFZxi4/T6Ncq1hNeYr2Fiis4gH+/1zjr6Cd4othnvw3Slaxhst
AaHpN3pyx1QI6g==
=8r2c
-----END PGP SIGNATURE-----
Merge tag 'powerpc-5.18-1' of git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux
Pull powerpc updates from Michael Ellerman:
"Livepatch support for 32-bit is probably the standout new feature,
otherwise mostly just lots of bits and pieces all over the board.
There's a series of commits cleaning up function descriptor handling,
which touches a few other arches as well as LKDTM. It has acks from
Arnd, Kees and Helge.
Summary:
- Enforce kernel RO, and implement STRICT_MODULE_RWX for 603.
- Add support for livepatch to 32-bit.
- Implement CONFIG_DYNAMIC_FTRACE_WITH_ARGS.
- Merge vdso64 and vdso32 into a single directory.
- Fix build errors with newer binutils.
- Add support for UADDR64 relocations, which are emitted by some
toolchains. This allows powerpc to build with the latest lld.
- Fix (another) potential userspace r13 corruption in transactional
memory handling.
- Cleanups of function descriptor handling & related fixes to LKDTM.
Thanks to Abdul Haleem, Alexey Kardashevskiy, Anders Roxell, Aneesh
Kumar K.V, Anton Blanchard, Arnd Bergmann, Athira Rajeev, Bhaskar
Chowdhury, Cédric Le Goater, Chen Jingwen, Christophe JAILLET,
Christophe Leroy, Corentin Labbe, Daniel Axtens, Daniel Henrique
Barboza, David Dai, Fabiano Rosas, Ganesh Goudar, Guo Zhengkui, Hangyu
Hua, Haren Myneni, Hari Bathini, Igor Zhbanov, Jakob Koschel, Jason
Wang, Jeremy Kerr, Joachim Wiberg, Jordan Niethe, Julia Lawall, Kajol
Jain, Kees Cook, Laurent Dufour, Madhavan Srinivasan, Mamatha Inamdar,
Maxime Bizon, Maxim Kiselev, Maxim Kochetkov, Michal Suchanek,
Nageswara R Sastry, Nathan Lynch, Naveen N. Rao, Nicholas Piggin,
Nour-eddine Taleb, Paul Menzel, Ping Fang, Pratik R. Sampat, Randy
Dunlap, Ritesh Harjani, Rohan McLure, Russell Currey, Sachin Sant,
Segher Boessenkool, Shivaprasad G Bhat, Sourabh Jain, Thierry Reding,
Tobias Waldekranz, Tyrel Datwyler, Vaibhav Jain, Vladimir Oltean,
Wedson Almeida Filho, and YueHaibing"
* tag 'powerpc-5.18-1' of git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux: (179 commits)
powerpc/pseries: Fix use after free in remove_phb_dynamic()
powerpc/time: improve decrementer clockevent processing
powerpc/time: Fix KVM host re-arming a timer beyond decrementer range
powerpc/tm: Fix more userspace r13 corruption
powerpc/xive: fix return value of __setup handler
powerpc/64: Add UADDR64 relocation support
powerpc: 8xx: fix a return value error in mpc8xx_pic_init
powerpc/ps3: remove unneeded semicolons
powerpc/64: Force inlining of prevent_user_access() and set_kuap()
powerpc/bitops: Force inlining of fls()
powerpc: declare unmodified attribute_group usages const
powerpc/spufs: Fix build warning when CONFIG_PROC_FS=n
powerpc/secvar: fix refcount leak in format_show()
powerpc/64e: Tie PPC_BOOK3E_64 to PPC_FSL_BOOK3E
powerpc: Move C prototypes out of asm-prototypes.h
powerpc/kexec: Declare kexec_paca static
powerpc/smp: Declare current_set static
powerpc: Cleanup asm-prototypes.c
powerpc/ftrace: Use STK_GOT in ftrace_mprofile.S
powerpc/ftrace: Regroup PPC64 specific operations in ftrace_mprofile.S
...
This series consists of the usual driver updates (qla2xxx, pm8001,
libsas, smartpqi, scsi_debug, lpfc, iscsi, mpi3mr) plus minor updates
and bug fixes. The high blast radius core update is the removal of
write same, which affects block and several non-SCSI devices. The
other big change, which is more local, is the removal of the SCSI
pointer.
Signed-off-by: James E.J. Bottomley <jejb@linux.ibm.com>
-----BEGIN PGP SIGNATURE-----
iJwEABMIAEQWIQTnYEDbdso9F2cI+arnQslM7pishQUCYjzDQyYcamFtZXMuYm90
dG9tbGV5QGhhbnNlbnBhcnRuZXJzaGlwLmNvbQAKCRDnQslM7pishQMYAQDEWUGV
6U0+736AHVtOfiMNfiRN79B1HfXVoHvemnPcTwD/UlndwFfy/3GGOtoZmqEpc73J
Ec1HDuUCE18H1H2QAh0=
=/Ty9
-----END PGP SIGNATURE-----
Merge tag 'scsi-misc' of git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi
Pull SCSI updates from James Bottomley:
"This series consists of the usual driver updates (qla2xxx, pm8001,
libsas, smartpqi, scsi_debug, lpfc, iscsi, mpi3mr) plus minor updates
and bug fixes.
The high blast radius core update is the removal of write same, which
affects block and several non-SCSI devices. The other big change,
which is more local, is the removal of the SCSI pointer"
* tag 'scsi-misc' of git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi: (281 commits)
scsi: scsi_ioctl: Drop needless assignment in sg_io()
scsi: bsg: Drop needless assignment in scsi_bsg_sg_io_fn()
scsi: lpfc: Copyright updates for 14.2.0.0 patches
scsi: lpfc: Update lpfc version to 14.2.0.0
scsi: lpfc: SLI path split: Refactor BSG paths
scsi: lpfc: SLI path split: Refactor Abort paths
scsi: lpfc: SLI path split: Refactor SCSI paths
scsi: lpfc: SLI path split: Refactor CT paths
scsi: lpfc: SLI path split: Refactor misc ELS paths
scsi: lpfc: SLI path split: Refactor VMID paths
scsi: lpfc: SLI path split: Refactor FDISC paths
scsi: lpfc: SLI path split: Refactor LS_RJT paths
scsi: lpfc: SLI path split: Refactor LS_ACC paths
scsi: lpfc: SLI path split: Refactor the RSCN/SCR/RDF/EDC/FARPR paths
scsi: lpfc: SLI path split: Refactor PLOGI/PRLI/ADISC/LOGO paths
scsi: lpfc: SLI path split: Refactor base ELS paths and the FLOGI path
scsi: lpfc: SLI path split: Introduce lpfc_prep_wqe
scsi: lpfc: SLI path split: Refactor fast and slow paths to native SLI4
scsi: lpfc: SLI path split: Refactor lpfc_iocbq
scsi: lpfc: Use kcalloc()
...
The kernel can use to allocate executable memory. The only supported
way to do that is via __vmalloc_node_range() with the executable bit set
in the prot argument. (vmap() resets the bit via pgprot_nx()).
Once tag-based KASAN modes start tagging vmalloc allocations, executing
code from such allocations will lead to the PC register getting a tag,
which is not tolerated by the kernel.
Only tag the allocations for normal kernel pages.
[andreyknvl@google.com: pass KASAN_VMALLOC_PROT_NORMAL to kasan_unpoison_vmalloc()]
Link: https://lkml.kernel.org/r/9230ca3d3e40ffca041c133a524191fd71969a8d.1646233925.git.andreyknvl@google.com
[andreyknvl@google.com: support tagged vmalloc mappings]
Link: https://lkml.kernel.org/r/2f6605e3a358cf64d73a05710cb3da356886ad29.1646233925.git.andreyknvl@google.com
[andreyknvl@google.com: don't unintentionally disabled poisoning]
Link: https://lkml.kernel.org/r/de4587d6a719232e83c760113e46ed2d4d8da61e.1646757322.git.andreyknvl@google.com
Link: https://lkml.kernel.org/r/fbfd9939a4dc375923c9a5c6b9e7ab05c26b8c6b.1643047180.git.andreyknvl@google.com
Signed-off-by: Andrey Konovalov <andreyknvl@google.com>
Acked-by: Marco Elver <elver@google.com>
Cc: Alexander Potapenko <glider@google.com>
Cc: Andrey Ryabinin <ryabinin.a.a@gmail.com>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Dmitry Vyukov <dvyukov@google.com>
Cc: Evgenii Stepanov <eugenis@google.com>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Peter Collingbourne <pcc@google.com>
Cc: Vincenzo Frascino <vincenzo.frascino@arm.com>
Cc: Will Deacon <will@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Add vmalloc tagging support to HW_TAGS KASAN.
The key difference between HW_TAGS and the other two KASAN modes when it
comes to vmalloc: HW_TAGS KASAN can only assign tags to physical memory.
The other two modes have shadow memory covering every mapped virtual
memory region.
Make __kasan_unpoison_vmalloc() for HW_TAGS KASAN:
- Skip non-VM_ALLOC mappings as HW_TAGS KASAN can only tag a single
mapping of normal physical memory; see the comment in the function.
- Generate a random tag, tag the returned pointer and the allocation,
and initialize the allocation at the same time.
- Propagate the tag into the page stucts to allow accesses through
page_address(vmalloc_to_page()).
The rest of vmalloc-related KASAN hooks are not needed:
- The shadow-related ones are fully skipped.
- __kasan_poison_vmalloc() is kept as a no-op with a comment.
Poisoning and zeroing of physical pages that are backing vmalloc()
allocations are skipped via __GFP_SKIP_KASAN_UNPOISON and
__GFP_SKIP_ZERO: __kasan_unpoison_vmalloc() does that instead.
Enabling CONFIG_KASAN_VMALLOC with HW_TAGS is not yet allowed.
Link: https://lkml.kernel.org/r/d19b2e9e59a9abc59d05b72dea8429dcaea739c6.1643047180.git.andreyknvl@google.com
Signed-off-by: Andrey Konovalov <andreyknvl@google.com>
Co-developed-by: Vincenzo Frascino <vincenzo.frascino@arm.com>
Signed-off-by: Vincenzo Frascino <vincenzo.frascino@arm.com>
Acked-by: Marco Elver <elver@google.com>
Cc: Alexander Potapenko <glider@google.com>
Cc: Andrey Ryabinin <ryabinin.a.a@gmail.com>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Dmitry Vyukov <dvyukov@google.com>
Cc: Evgenii Stepanov <eugenis@google.com>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Peter Collingbourne <pcc@google.com>
Cc: Will Deacon <will@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Once tag-based KASAN modes start tagging vmalloc() allocations, kernel
stacks start getting tagged if CONFIG_VMAP_STACK is enabled.
Reset the tag of kernel stack pointers after allocation in
arch_alloc_vmap_stack().
For SW_TAGS KASAN, when CONFIG_KASAN_STACK is enabled, the instrumentation
can't handle the SP register being tagged.
For HW_TAGS KASAN, there's no instrumentation-related issues. However,
the impact of having a tagged SP register needs to be properly evaluated,
so keep it non-tagged for now.
Note, that the memory for the stack allocation still gets tagged to catch
vmalloc-into-stack out-of-bounds accesses.
[andreyknvl@google.com: fix case when a stack is retrieved from cached_stacks]
Link: https://lkml.kernel.org/r/f50c5f96ef896d7936192c888b0c0a7674e33184.1644943792.git.andreyknvl@google.com
[dan.carpenter@oracle.com: remove unnecessary check in alloc_thread_stack_node()]
Link: https://lkml.kernel.org/r/20220301080706.GB17208@kili
Link: https://lkml.kernel.org/r/698c5ab21743c796d46c15d075b9481825973e34.1643047180.git.andreyknvl@google.com
Signed-off-by: Andrey Konovalov <andreyknvl@google.com>
Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Acked-by: Catalin Marinas <catalin.marinas@arm.com>
Acked-by: Marco Elver <elver@google.com>
Reviewed-by: Marco Elver <elver@google.com>
Cc: Alexander Potapenko <glider@google.com>
Cc: Andrey Ryabinin <ryabinin.a.a@gmail.com>
Cc: Dmitry Vyukov <dvyukov@google.com>
Cc: Evgenii Stepanov <eugenis@google.com>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Peter Collingbourne <pcc@google.com>
Cc: Vincenzo Frascino <vincenzo.frascino@arm.com>
Cc: Will Deacon <will@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Once tag-based KASAN modes start tagging vmalloc() allocations, kernel
stacks start getting tagged if CONFIG_VMAP_STACK is enabled.
Reset the tag of kernel stack pointers after allocation in
alloc_thread_stack_node().
For SW_TAGS KASAN, when CONFIG_KASAN_STACK is enabled, the instrumentation
can't handle the SP register being tagged.
For HW_TAGS KASAN, there's no instrumentation-related issues. However,
the impact of having a tagged SP register needs to be properly evaluated,
so keep it non-tagged for now.
Note, that the memory for the stack allocation still gets tagged to catch
vmalloc-into-stack out-of-bounds accesses.
Link: https://lkml.kernel.org/r/c6c96f012371ecd80e1936509ebcd3b07a5956f7.1643047180.git.andreyknvl@google.com
Signed-off-by: Andrey Konovalov <andreyknvl@google.com>
Reviewed-by: Alexander Potapenko <glider@google.com>
Acked-by: Marco Elver <elver@google.com>
Cc: Andrey Ryabinin <ryabinin.a.a@gmail.com>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Dmitry Vyukov <dvyukov@google.com>
Cc: Evgenii Stepanov <eugenis@google.com>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Peter Collingbourne <pcc@google.com>
Cc: Vincenzo Frascino <vincenzo.frascino@arm.com>
Cc: Will Deacon <will@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>