Currently, a CONFIG_PREEMPT_NONE=y kernel substitutes normal RCU for
RCU Tasks Rude and RCU Tasks Trace. Unless that kernel builds refscale,
whether built-in or as a module, in which case these RCU Tasks flavors are
(unnecessarily) built in. This both increases kernel size and increases
the complexity of certain tracing operations. This commit therefore
decouples the presence of refscale from the presence of RCU Tasks Rude
and RCU Tasks Trace.
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
Currently, a CONFIG_PREEMPT_NONE=y kernel substitutes normal RCU for
RCU Tasks. Unless that kernel builds refscale, whether built-in or as a
module, in which case RCU Tasks is (unnecessarily) built in. This both
increases kernel size and increases the complexity of certain tracing
operations. This commit therefore decouples the presence of refscale
from the presence of RCU Tasks.
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
Unless a kernel builds rcutorture, whether built-in or as a module, that
kernel is also built with CONFIG_TASKS_RUDE_RCU, whether anything else
needs Tasks Rude RCU or not. This unnecessarily increases kernel size.
This commit therefore decouples the presence of rcutorture from the
presence of RCU Tasks Rude.
However, there is a need to select CONFIG_TASKS_RUDE_RCU for testing
purposes. Except that casual users must not be bothered with
questions -- for them, this needs to be fully automated. There is
thus a CONFIG_FORCE_TASKS_RUDE_RCU that selects CONFIG_TASKS_RUDE_RCU,
is user-selectable, but which depends on CONFIG_RCU_EXPERT.
[ paulmck: Apply kernel test robot feedback. ]
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
Currently, a CONFIG_PREEMPT_NONE=y kernel substitutes normal RCU for
RCU Tasks. Unless that kernel builds rcutorture, whether built-in or as
a module, in which case RCU Tasks is (unnecessarily) used. This both
increases kernel size and increases the complexity of certain tracing
operations. This commit therefore decouples the presence of rcutorture
from the presence of RCU Tasks.
However, there is a need to select CONFIG_TASKS_RCU for testing purposes.
Except that casual users must not be bothered with questions -- for them,
this needs to be fully automated. There is thus a CONFIG_FORCE_TASKS_RCU
that selects CONFIG_TASKS_RCU, is user-selectable, but which depends
on CONFIG_RCU_EXPERT.
[ paulmck: Apply kernel test robot feedback. ]
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
Unless a kernel builds rcutorture, whether built-in or as a module, that
kernel is also built with CONFIG_TASKS_TRACE_RCU, whether anything else
needs Tasks Trace RCU or not. This unnecessarily increases kernel size.
This commit therefore decouples the presence of rcutorture from the
presence of RCU Tasks Trace.
However, there is a need to select CONFIG_TASKS_TRACE_RCU for
testing purposes. Except that casual users must not be bothered with
questions -- for them, this needs to be fully automated. There is thus
a CONFIG_FORCE_TASKS_TRACE_RCU that selects CONFIG_TASKS_TRACE_RCU,
is user-selectable, but which depends on CONFIG_RCU_EXPERT.
[ paulmck: Apply kernel test robot feedback. ]
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
Currently, any kernel built with CONFIG_PREEMPTION=y also gets
CONFIG_TASKS_RCU=y, which is not helpful to people trying to build
preemptible kernels of minimal size.
Because CONFIG_TASKS_RCU=y is needed only in kernels doing tracing of
one form or another, this commit moves from TASKS_RCU deciding when it
should be enabled to the tracing Kconfig options explicitly selecting it.
This allows building preemptible kernels without TASKS_RCU, if desired.
This commit also updates the SRCU-N and TREE09 rcutorture scenarios
in order to avoid Kconfig errors that would otherwise result from
CONFIG_TASKS_RCU being selected without its CONFIG_RCU_EXPERT dependency
being met.
[ paulmck: Apply BPF_SYSCALL feedback from Andrii Nakryiko. ]
Reported-by: Hyeonggon Yoo <42.hyeyoo@gmail.com>
Tested-by: Hyeonggon Yoo <42.hyeyoo@gmail.com>
Tested-by: Zhouyi Zhou <zhouzhouyi@gmail.com>
Cc: Andrii Nakryiko <andrii@kernel.org>
Cc: Alexei Starovoitov <ast@kernel.org>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Acked-by: Masami Hiramatsu <mhiramat@kernel.org>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
When booting kernels built with both CONFIG_RCU_STRICT_GRACE_PERIOD=y
and CONFIG_PREEMPT_RT=y, the rcu_read_unlock_special() function's
invocation of irq_work_queue_on() the init_irq_work() causes the
rcu_preempt_deferred_qs_handler() function to work execute in SCHED_FIFO
irq_work kthreads. Because rcu_read_unlock_special() is invoked on each
rcu_read_unlock() in such kernels, the amount of work just keeps piling
up, resulting in a boot-time hang.
This commit therefore avoids this hang by using IRQ_WORK_INIT_HARD()
instead of init_irq_work(), but only in kernels built with both
CONFIG_PREEMPT_RT=y and CONFIG_RCU_STRICT_GRACE_PERIOD=y.
Signed-off-by: Zqiang <qiang1.zhang@intel.com>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
The rcu_sync_enter() function is used by updaters to force RCU readers
(e.g. percpu-rwsem) to use their slow paths during an update. This is
accomplished by setting the ->gp_state of the rcu_sync structure to
GP_ENTER. In the case of percpu-rwsem, the readers' slow path waits on
a semaphore instead of just incrementing a reader count. Each updater
invokes the rcu_sync_exit() function to signal to readers that they
may again take their fastpaths. The rcu_sync_exit() function sets the
->gp_state of the rcu_sync structure to GP_EXIT, and if all goes well,
after a grace period the ->gp_state reverts back to GP_IDLE.
Unfortunately, the rcu_sync_enter() function currently has a comment
incorrectly stating that rcu_sync_exit() (by an updater) will re-enable
reader "slowpaths". This patch changes the comment to state that this
function re-enables reader fastpaths.
Signed-off-by: David Vernet <void@manifault.com>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
For the spawning of the priority-boost kthreads can fail, improbable
though this might seem. This commit therefore refrains from attemoting
to initiate RCU priority boosting when The ->boost_kthread_task pointer
is NULL.
Signed-off-by: Zqiang <qiang1.zhang@intel.com>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
An early check on synchronize_rcu[_expedited]() tries to determine if
the current CPU is in UP mode on an SMP no-preempt kernel, in which case
there is no need to start a grace period since the current assumed
quiescent state is all we need.
However the preemption mode doesn't take into account the boot selected
preemption mode under CONFIG_PREEMPT_DYNAMIC=y, missing a possible
early return if the running flavour is "none" or "voluntary".
Use the shiny new preempt mode accessors to fix this. However,
avoid invoking them during early boot because doing so triggers a
WARN_ON_ONCE().
[ paulmck: Update for mainlined API. ]
Reported-by: Paul E. McKenney <paulmck@kernel.org>
Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
Cc: Uladzislau Rezki <uladzislau.rezki@sony.com>
Cc: Joel Fernandes <joel@joelfernandes.org>
Cc: Boqun Feng <boqun.feng@gmail.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Neeraj Upadhyay <quic_neeraju@quicinc.com>
Cc: Valentin Schneider <valentin.schneider@arm.com>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
RCU's synchronous grace periods act quite differently when there is
only one online CPU, especially in the no-op case in kernels built with
CONFIG_PREEMPTION=n. This change in behavior can be important debugging
information, so this commit adds the number of online CPUs to the RCU
CPU stall warning messages.
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
The final "if" statement in rcu_gp_cleanup() has proven to be rather
confusing, straightforward though it might have seemed when initially
written. This commit therefore adds comments to its "then" and "else"
clauses to at least provide a more elevated form of confusion.
Reported-by: Boqun Feng <boqun.feng@gmail.com>
Reported-by: Frederic Weisbecker <frederic@kernel.org>
Reported-by: Neeraj Upadhyay <quic_neeraju@quicinc.com>
Reported-by: Uladzislau Rezki <urezki@gmail.com>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
A report of a 12-jiffy normal RCU CPU stall warning raises interesting
questions about the nature of time on the offending system. This commit
instruments rcu_sched_clock_irq(), which is RCU's hook into the
scheduling-clock interrupt, checking for the jiffies counter going
backwards.
Reported-by: Saravanan D <sarvanand@fb.com>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
Suppose we have a kernel built with both CONFIG_RCU_STRICT_GRACE_PERIOD=y
and CONFIG_PREEMPT=y. Suppose further that an RCU reader from which RCU
core needs a quiescent state ends in rcu_preempt_deferred_qs_irqrestore().
This function will then invoke rcu_report_qs_rdp() in order to immediately
report that quiescent state. Unfortunately, it will not have cleared
that reader's CPU's rcu_data structure's ->cpu_no_qs.b.norm field.
As a result, rcu_report_qs_rdp() will take an early exit because it
will believe that this CPU has not yet encountered a quiescent state,
and there will be no reporting of the current quiescent state.
This commit therefore causes rcu_preempt_deferred_qs_irqrestore() to
clear the ->cpu_no_qs.b.norm field before invoking rcu_report_qs_rdp().
Kudos to Boqun Feng and Neeraj Upadhyay for helping with analysis of
this issue!
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
The TASKS_RUDE_RCU does not select IRQ_WORK, which can result in build
failures for kernels that do not otherwise select IRQ_WORK. This commit
therefore causes the TASKS_RUDE_RCU Kconfig option to select IRQ_WORK.
Reported-by: Hyeonggon Yoo <42.hyeyoo@gmail.com>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
The rcutorture module has an rcu_torture_writer task that repeatedly
performs writes, synchronizations, and deletes. There is a corner-case
check in rcu_torture_writer() wherein if nsynctypes is 0, a warning is
issued and the task waits to be stopped via a call to
torture_kthread_stopping() rather than performing any work.
There should be a return statement following this call to
torture_kthread_stopping(), as the intention with issuing the call to
torture_kthread_stopping() in the first place is to avoid the
rcu_torture_writer task from performing any work. Some of the work may even
be dangerous to perform, such as potentially causing a #DE due to
nsynctypes being used in a modulo operator when querying for sync updates
to issue.
This patch adds the missing return call. As a bonus, it also fixes a
checkpatch warning that was emitted due to the WARN_ONCE() call using the
name of the function rather than __func__.
Signed-off-by: David Vernet <void@manifault.com>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
The rcutorture module is used to run torture tests that validate RCU.
rcutorture takes a variety of module parameters that configure the
functionality of the test. Amongst these parameters are the types of
synchronization mechanisms that the rcu_torture_writer and
rcu_torture_fakewriter tasks may use, and the torture_type of the run which
determines what read and sync operations are used by the various writer and
reader tasks that run throughout the test.
When the module is configured to only use sync types for which the
specified torture_type does not implement the necessary operations, we can
end up in a state where nsynctypes is 0. This is not an erroneous state,
but it currently crashes the kernel with a #DE due to nsynctypes being used
with a modulo operator in rcu_torture_fakewriter().
Here is an example of such a #DE:
$ insmod ./rcutorture.ko gp_cond=1 gp_cond_exp=0 gp_exp=0 gp_poll_exp=0
gp_normal=0 gp_poll=0 gp_poll_exp=0 verbose=9999 torture_type=trivial
...
[ 8536.525096] divide error: 0000 [#1] PREEMPT SMP PTI
[ 8536.525101] CPU: 30 PID: 392138 Comm: rcu_torture_fak Kdump: loaded Tainted: G S 5.17.0-rc1-00179-gc8c42c80febd #24
[ 8536.525105] Hardware name: Quanta Twin Lakes MP/Twin Lakes Passive MP, BIOS F09_3A23 12/08/2020
[ 8536.525106] RIP: 0010:rcu_torture_fakewriter+0xf1/0x2d0 [rcutorture]
[ 8536.525121] Code: 00 31 d2 8d 0c f5 00 00 00 00 48 63 c9 48 f7 f1 48 85 d2 0f 84 79 ff ff ff 48 89 e7 e8 78 78 01 00 48 63 0d 29 ca 00 00 31 d2 <48> f7 f1 8b 04 95 00 05 4e a0 83 f8 06 0f 84 ad 00 00 00 7f 1f 83
[ 8536.525124] RSP: 0018:ffffc9000777fef0 EFLAGS: 00010246
[ 8536.525127] RAX: 00000000223d006e RBX: cccccccccccccccd RCX: 0000000000000000
[ 8536.525130] RDX: 0000000000000000 RSI: ffffffff824315b9 RDI: ffffc9000777fef0
[ 8536.525132] RBP: ffffc9000487bb30 R08: 0000000000000002 R09: 000000000002a580
[ 8536.525134] R10: ffffffff82c5f920 R11: 0000000000000000 R12: ffff8881a2c35d00
[ 8536.525136] R13: ffff8881540c8d00 R14: ffffffffa04d39d0 R15: 0000000000000000
[ 8536.525137] FS: 0000000000000000(0000) GS:ffff88903ff80000(0000) knlGS:0000000000000000
[ 8536.525140] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 8536.525142] CR2: 00007f839f022000 CR3: 0000000002c0a006 CR4: 00000000007706e0
[ 8536.525144] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[ 8536.525145] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[ 8536.525147] PKRU: 55555554
[ 8536.525148] Call Trace:
[ 8536.525150] <TASK>
[ 8536.525153] kthread+0xe8/0x110
[ 8536.525161] ? kthread_complete_and_exit+0x20/0x20
[ 8536.525167] ret_from_fork+0x22/0x30
[ 8536.525174] </TASK>
The solution is to gracefully handle the case of nsynctypes being 0 in
rcu_torture_fakewriter() by not performing any work. This is already being
done in rcu_torture_writer(), though there is a missing return on that path
which will be fixed in a subsequent patch.
Signed-off-by: David Vernet <void@manifault.com>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
Tree RCU supports grace-period delays using the rcutree.gp_cleanup_delay,
rcutree.gp_init_delay, and rcutree.gp_preinit_delay kernel boot
parameters. These delays are strictly for debugging purposes, and have
proven quite effective at exposing bugs involving race with CPU-hotplug
operations. However, these delays can result in false positives when
used in conjunction with callback flooding, for example, those generated
by the rcutorture.fwd_progress kernel boot parameter.
This commit therefore suppresses grace-period delays while callback
flooding is in progress.
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
If the cpu_possible_mask is sparse (for example, if bits are set only for
CPUs 0, 4, 8, ...), then rcu_tasks_invoke_cbs() will access per-CPU data
for a CPU not in cpu_possible_mask. It makes these accesses while doing
a workqueue-based binary search for non-empty callback lists. Although
this search must pass through CPUs not represented in cpu_possible_mask,
it has no need to check the callback list for such CPUs.
This commit therefore changes the rcu_tasks_invoke_cbs() function's
binary search so as to only check callback lists for CPUs present in
cpu_possible_mask.
Reported-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
If the rcupdate.rcu_task_enqueue_lim kernel boot parameter is set to
something greater than 1 and less than nr_cpu_ids, the code attempts to
use a subset of the CPU's RCU Tasks callback lists. This works, but only
if the cpu_possible_mask is contiguous. If there are "holes" in this
mask, the callback-enqueue code might attempt to access a non-existent
per-CPU ->rtcpu variable for a non-existent CPU. For example, if only
CPUs 0, 4, 8, 12, 16 and so on are in cpu_possible_mask, specifying
rcupdate.rcu_task_enqueue_lim=4 would cause the code to attempt to
use callback queues for non-existent CPUs 1, 2, and 3. Because such
systems have existed in the past and might still exist, the code needs
to gracefully handle this situation.
This commit therefore checks to see whether the desired CPU is present
in cpu_possible_mask, and, if not, searches for the next CPU. This means
that the systems administrator of a system with a sparse cpu_possible_mask
will need to account for this sparsity when specifying the value of
the rcupdate.rcu_task_enqueue_lim kernel boot parameter. For example,
setting this parameter to the value 4 will use only CPUs 0 and 4, which
CPU 4 getting three times the callback load of CPU 0.
This commit assumes that bit (nr_cpu_ids - 1) is always set in
cpu_possible_mask.
Link: https://lore.kernel.org/lkml/CANn89iKaNEwyNZ=L_PQnkH0LP_XjLYrr_dpyRKNNoDJaWKdrmg@mail.gmail.com/
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
Currently, the show_rcu_tasks_generic_gp_kthread() function only looks
at CPU 0's callback lists. Although this is not fatal, it can confuse
debugging efforts in cases where any of the Tasks RCU flavors are in
per-CPU queueing mode. This commit therefore causes this function to
scan all CPUs' callback queues.
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
The use of hrtimers for RCU-tasks grace-period delays works well in
general, but can result in excessive grace-period delays for some
corner-case workloads. This commit therefore reverts to the use of
timers for non-RT kernels to mitigate those grace-period delays.
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
The synchronous RCU-tasks grace-period-wait primitives invoke
schedule_timeout_idle() to give readers a chance to exit their
read-side critical sections. Unfortunately, this fails during early
boot on PREEMPT_RT because PREEMPT_RT relies solely on ksoftirqd to run
timer handlers. Because ksoftirqd cannot operate until its kthreads
are spawned, there is a brief period of time following scheduler
initialization where PREEMPT_RT cannot run the timer handlers that
schedule_timeout_idle() relies on, resulting in a hang.
To avoid this boot-time hang, this commit replaces schedule_timeout_idle()
with schedule_hrtimeout(), so that the timer expires in hardirq context.
This is ensures that the timer fires even on PREEMPT_RT throughout the
irqs-enabled portions of boot as well as during runtime.
The timer is set to expire between fract and fract + HZ / 2 jiffies in
order to align with any other timers that might expire during that time,
thus reducing the number of wakeups.
Note that RCU-tasks grace periods are infrequent, so the use of hrtimer
should be fine. In contrast, in common-case code, user of hrtimer
could result in performance issues.
Cc: Martin KaFai Lau <kafai@fb.com>
Cc: Andrii Nakryiko <andrii@kernel.org>
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
The main Tasks RCU quiescent state is voluntary context switch. However,
userspace execution is also a valid quiescent state, and is a valuable one
for userspace applications that spin repeatedly executing light-weight
non-sleeping system calls. Currently, such an application can delay a
Tasks RCU grace period for many tens of seconds.
This commit therefore enlists the aid of the scheduler-clock interrupt to
provide a Tasks RCU quiescent state when it interrupted a task executing
in userspace.
[ paulmck: Apply feedback from kernel test robot. ]
Cc: Martin KaFai Lau <kafai@fb.com>
Cc: Neil Spring <ntspring@fb.com>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
The waitqueue used by rcu_tasks_kthread() has always only one waiter.
With a guaranteed only one waiter, this can be replaced with rcuwait
which is smaller and simpler. With rcuwait based wake counterpart, the
irqwork function (call_rcu_tasks_iw_wakeup()) can be invoked hardirq
context because it is only a wake up and no sleeping locks are involved
(unlike the wait_queue_head).
As a side effect, this is also one piece of the puzzle to pass the RCU
selftest at early boot on PREEMPT_RT.
Replace wait_queue_head with rcuwait and let the irqwork run in hardirq
context on PREEMPT_RT.
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
RCU-tasks stall-warning messages are printed after the grace period is ten
minutes old. Unfortunately, most of us will have rebooted the system in
response to an apparently-hung command long before the ten minutes is up,
and will thus see what looks to be a silent hang.
This commit therefore adds pr_info() messages that are printed earlier.
These should avoid being classified as errors, but should give impatient
users a hint. These are controlled by new rcupdate.rcu_task_stall_info
and rcupdate.rcu_task_stall_info_mult kernel-boot parameters. The former
defines the initial delay in jiffies (defaulting to 10 seconds) and the
latter defines the multiplier (defaulting to 3). Thus, by default, the
first message will appear 10 seconds into the RCU-tasks grace period,
the second 40 seconds in, and the third 160 seconds in. There would be
a fourth at 640 seconds in, but the stall warning message appears 600
seconds in, and once a stall warning is printed for a given grace period,
no further informational messages are printed.
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
While booting secondary CPUs, cpus_read_[lock/unlock] is not keeping
online cpumask stable. The transient online mask results in below
calltrace.
[ 0.324121] CPU1: Booted secondary processor 0x0000000001 [0x410fd083]
[ 0.346652] Detected PIPT I-cache on CPU2
[ 0.347212] CPU2: Booted secondary processor 0x0000000002 [0x410fd083]
[ 0.377255] Detected PIPT I-cache on CPU3
[ 0.377823] CPU3: Booted secondary processor 0x0000000003 [0x410fd083]
[ 0.379040] ------------[ cut here ]------------
[ 0.383662] WARNING: CPU: 0 PID: 10 at kernel/workqueue.c:3084 __flush_work+0x12c/0x138
[ 0.384850] Modules linked in:
[ 0.385403] CPU: 0 PID: 10 Comm: rcu_tasks_rude_ Not tainted 5.17.0-rc3-v8+ #13
[ 0.386473] Hardware name: Raspberry Pi 4 Model B Rev 1.4 (DT)
[ 0.387289] pstate: 20000005 (nzCv daif -PAN -UAO -TCO -DIT -SSBS BTYPE=--)
[ 0.388308] pc : __flush_work+0x12c/0x138
[ 0.388970] lr : __flush_work+0x80/0x138
[ 0.389620] sp : ffffffc00aaf3c60
[ 0.390139] x29: ffffffc00aaf3d20 x28: ffffffc009c16af0 x27: ffffff80f761df48
[ 0.391316] x26: 0000000000000004 x25: 0000000000000003 x24: 0000000000000100
[ 0.392493] x23: ffffffffffffffff x22: ffffffc009c16b10 x21: ffffffc009c16b28
[ 0.393668] x20: ffffffc009e53861 x19: ffffff80f77fbf40 x18: 00000000d744fcc9
[ 0.394842] x17: 000000000000000b x16: 00000000000001c2 x15: ffffffc009e57550
[ 0.396016] x14: 0000000000000000 x13: ffffffffffffffff x12: 0000000100000000
[ 0.397190] x11: 0000000000000462 x10: ffffff8040258008 x9 : 0000000100000000
[ 0.398364] x8 : 0000000000000000 x7 : ffffffc0093c8bf4 x6 : 0000000000000000
[ 0.399538] x5 : 0000000000000000 x4 : ffffffc00a976e40 x3 : ffffffc00810444c
[ 0.400711] x2 : 0000000000000004 x1 : 0000000000000000 x0 : 0000000000000000
[ 0.401886] Call trace:
[ 0.402309] __flush_work+0x12c/0x138
[ 0.402941] schedule_on_each_cpu+0x228/0x278
[ 0.403693] rcu_tasks_rude_wait_gp+0x130/0x144
[ 0.404502] rcu_tasks_kthread+0x220/0x254
[ 0.405264] kthread+0x174/0x1ac
[ 0.405837] ret_from_fork+0x10/0x20
[ 0.406456] irq event stamp: 102
[ 0.406966] hardirqs last enabled at (101): [<ffffffc0093c8468>] _raw_spin_unlock_irq+0x78/0xb4
[ 0.408304] hardirqs last disabled at (102): [<ffffffc0093b8270>] el1_dbg+0x24/0x5c
[ 0.409410] softirqs last enabled at (54): [<ffffffc0081b80c8>] local_bh_enable+0xc/0x2c
[ 0.410645] softirqs last disabled at (50): [<ffffffc0081b809c>] local_bh_disable+0xc/0x2c
[ 0.411890] ---[ end trace 0000000000000000 ]---
[ 0.413000] smp: Brought up 1 node, 4 CPUs
[ 0.413762] SMP: Total of 4 processors activated.
[ 0.414566] CPU features: detected: 32-bit EL0 Support
[ 0.415414] CPU features: detected: 32-bit EL1 Support
[ 0.416278] CPU features: detected: CRC32 instructions
[ 0.447021] Callback from call_rcu_tasks_rude() invoked.
[ 0.506693] Callback from call_rcu_tasks() invoked.
This commit therefore fixes this issue by applying a single-CPU
optimization to the RCU Tasks Rude grace-period process. The key point
here is that the purpose of this RCU flavor is to force a schedule on
each online CPU since some past event. But the rcu_tasks_rude_wait_gp()
function runs in the context of the RCU Tasks Rude's grace-period kthread,
so there must already have been a context switch on the current CPU since
the call to either synchronize_rcu_tasks_rude() or call_rcu_tasks_rude().
So if there is only a single CPU online, RCU Tasks Rude's grace-period
kthread does not need to anything at all.
It turns out that the rcu_tasks_rude_wait_gp() function's call to
schedule_on_each_cpu() causes problems during early boot. During that
time, there is only one online CPU, namely the boot CPU. Therefore,
applying this single-CPU optimization fixes early-boot instances of
this problem.
Link: https://lore.kernel.org/lkml/20220210184319.25009-1-treasure4paddy@gmail.com/T/
Suggested-by: Paul E. McKenney <paulmck@kernel.org>
Signed-off-by: Padmanabha Srinivasaiah <treasure4paddy@gmail.com>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
The rcu_spawn_gp_kthread() function is called as an early initcall, which
means that SMP initialization hasn't happened yet and only the boot CPU is
online. Therefore, create only the NOCB kthreads related to the boot CPU.
Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
Cc: Neeraj Upadhyay <quic_neeraju@quicinc.com>
Cc: Uladzislau Rezki <uladzislau.rezki@sony.com>
Cc: Joel Fernandes <joel@joelfernandes.org>
Cc: Boqun Feng <boqun.feng@gmail.com>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
The rcu_spawn_gp_kthread() function is called as an early initcall,
which means that SMP initialization hasn't happened yet and only the
boot CPU is online. Therefore, create only the boost kthread for the
leaf node of the boot CPU.
Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
Cc: Neeraj Upadhyay <quic_neeraju@quicinc.com>
Cc: Uladzislau Rezki <uladzislau.rezki@sony.com>
Cc: Joel Fernandes <joel@joelfernandes.org>
Cc: Boqun Feng <boqun.feng@gmail.com>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
The rcu_init() function is called way before SMP is initialized and
therefore only the boot CPU should be online at this stage.
Simplify the boot per-cpu initialization accordingly.
Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
Cc: Neeraj Upadhyay <quic_neeraju@quicinc.com>
Cc: Uladzislau Rezki <uladzislau.rezki@sony.com>
Cc: Joel Fernandes <joel@joelfernandes.org>
Cc: Boqun Feng <boqun.feng@gmail.com>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
This commit moves the RCU nocb initialization witness within rcu_state
to consolidate RCU's global state.
Reported-by: Paul E. McKenney <paulmck@kernel.org>
Cc: Neeraj Upadhyay <quic_neeraju@quicinc.com>
Cc: Uladzislau Rezki <uladzislau.rezki@sony.com>
Cc: Joel Fernandes <joel@joelfernandes.org>
Cc: Boqun Feng <boqun.feng@gmail.com>
Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
The rcu_is_nocb_cpu() function is no longer used, so this commmit
removes it.
Reported-by: Neeraj Upadhyay <quic_neeraju@quicinc.com>
Cc: Uladzislau Rezki <uladzislau.rezki@sony.com>
Cc: Joel Fernandes <joel@joelfernandes.org>
Cc: Boqun Feng <boqun.feng@gmail.com>
Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
This commit instruments the acquisitions of the srcu_struct structure's
->lock, enabling the initiation of a transition from SRCU_SIZE_SMALL
to SRCU_SIZE_BIG when sufficient contention is experienced. The
instrumentation counts the number of trylock failures within the confines
of a single jiffy. If that number exceeds the value specified by the
srcutree.small_contention_lim kernel boot parameter (which defaults to
100), and if the value specified by the srcutree.convert_to_big kernel
boot parameter has the 0x10 bit set (defaults to 0), then a transition
will be automatically initiated.
By default, there will never be any transitions, so that none of the
srcu_struct structures ever gains an srcu_node array.
The useful values for srcutree.convert_to_big are:
0x00: Never convert.
0x01: Always convert at init_srcu_struct() time.
0x02: Convert when rcutorture prints its first round of statistics.
0x03: Decide conversion approach at boot given system size.
0x10: Convert if contention is encountered.
0x12: Convert if contention is encountered or when rcutorture prints
its first round of statistics, whichever comes first.
The value 0x11 acts the same as 0x01 because the conversion happens
before there is any chance of contention.
[ paulmck: Apply "static" feedback from kernel test robot. ]
Co-developed-by: Neeraj Upadhyay <quic_neeraju@quicinc.com>
Signed-off-by: Neeraj Upadhyay <quic_neeraju@quicinc.com>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
Once there are contention-initiated size transitions, it will be
possible for rcutorture to initiate a transition at the same time
as a contention-initiated transition. This commit therefore creates
a concurrency-safe helper function named srcu_transition_to_big() to
safely initiate size transitions.
Co-developed-by: Neeraj Upadhyay <quic_neeraju@quicinc.com>
Signed-off-by: Neeraj Upadhyay <quic_neeraju@quicinc.com>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
This commit adds a comment explaining why an unprotected call to
list_add() from srcu_funnel_gp_start() can be safe. TL;DR: It is only
called during very early boot when we don't have no steeking concurrency!
Co-developed-by: Neeraj Upadhyay <quic_neeraju@quicinc.com>
Signed-off-by: Neeraj Upadhyay <quic_neeraju@quicinc.com>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
When an srcu_struct structure is created (but not in a kernel module)
by DEFINE_SRCU() and friends, the per-CPU srcu_data structure is
statically allocated. In all other cases, that structure is obtained
from alloc_percpu(), in which case cleanup_srcu_struct() must invoke
free_percpu() on the resulting ->sda pointer in the srcu_struct pointer.
Which it does.
Except that it also invokes free_percpu() on the ->sda pointer
referencing the statically allocated per-CPU srcu_data structures.
Which free_percpu() is surprisingly OK with.
This commit nevertheless stops cleanup_srcu_struct() from freeing
statically allocated per-CPU srcu_data structures.
Co-developed-by: Neeraj Upadhyay <quic_neeraju@quicinc.com>
Signed-off-by: Neeraj Upadhyay <quic_neeraju@quicinc.com>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
You really shouldn't invoke srcu_torture_stats_print() after invoking
cleanup_srcu_struct(), but there is really no reason to get a
compiler-obfuscated per-CPU-variable NULL pointer dereference as the
diagnostic. This commit therefore checks for NULL ->sda and makes a
more polite console-message complaint in that case.
Co-developed-by: Neeraj Upadhyay <quic_neeraju@quicinc.com>
Signed-off-by: Neeraj Upadhyay <quic_neeraju@quicinc.com>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
This commit adds an srcu_tree.convert_to_big kernel parameter that either
refuses to convert at all (0), converts immediately at init_srcu_struct()
time (1), or lets rcutorture convert it (2). An addition contention-based
dynamic conversion choice will be added, along with documentation.
[ paulmck: Apply callback-scanning feedback from Neeraj Upadhyay. ]
Co-developed-by: Neeraj Upadhyay <quic_neeraju@quicinc.com>
Signed-off-by: Neeraj Upadhyay <quic_neeraju@quicinc.com>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
For configurations where snp node tree is not initialized at
init time (added in subsequent commits), srcu_funnel_gp_start()
and srcu_funnel_exp_start() can potential traverse and observe
the snp nodes' transient (uninitialized) states. This can potentially
happen, when init_srcu_struct_nodes() initialization of sdp->mynode
races with srcu_funnel_gp_start() and srcu_funnel_exp_start()
Consider the case below where srcu_funnel_gp_start() observes
sdp->mynode to be not NULL and uses an uninitialized sdp->grpmask
P1 P2
init_srcu_struct_nodes() void srcu_funnel_gp_start(...)
{
for_each_possible_cpu(cpu) {
...
sdp->mynode = &snp_first[...];
for (snp = sdp->mynode;...) struct srcu_node *snp_leaf =
smp_load_acquire(&sdp->mynode)
... if (snp_leaf) {
for (snp = snp_leaf; ...)
...
if (snp == snp_leaf)
snp->srcu_data_have_cbs[idx] |=
sdp->grpmask;
sdp->grpmask =
1 << (cpu - sdp->mynode->grplo);
}
}
Similarly, init_srcu_struct_nodes() and srcu_funnel_exp_start() can
race, where srcu_funnel_exp_start() could observe state of snp lock
before spin_lock_init().
P1 P2
init_srcu_struct_nodes() void srcu_funnel_exp_start(...)
{
srcu_for_each_node_breadth_first(ssp, snp) { for (; ...) {
spin_lock_...(snp, )
spin_lock_init(&ACCESS_PRIVATE(snp, lock));
...
}
for_each_possible_cpu(cpu) {
...
sdp->mynode = &snp_first[...];
To avoid these issues, ensure that snp node tree initialization is
complete i.e. after SRCU_SIZE_WAIT_BARRIER srcu_size_state is reached,
before traversing the tree. Given that srcu_funnel_gp_start() and
srcu_funnel_exp_start() are called within SRCU read side critical
sections, this check is safe, in the sense that all callbacks are
enqueued on CPU0 srcu_cblist until SRCU_SIZE_WAIT_CALL is entered,
and these read side critical sections (containing srcu_funnel_gp_start()
and srcu_funnel_exp_start()) need to complete, before SRCU_SIZE_WAIT_CALL
is reached.
Signed-off-by: Neeraj Upadhyay <quic_neeraju@quicinc.com>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
Currently, tree SRCU relies on the srcu_node structures being initialized
at the same time that the srcu_struct itself is initialized, and thus
use the initial grace-period sequence number as the initial value for
the srcu_node structure's ->srcu_have_cbs[] and ->srcu_gp_seq_needed_exp
fields. Although this has a high probability of also working when the
srcu_node array is allocated and initialized at some random later time,
it would be better to avoid leaving such things to chance.
This commit therefore initializes these fields with 0x2, which is a
recognizable invalid value. It then adds the required checks for this
invalid value in order to avoid confusion on long-running kernels
(especially those on 32-bit systems) that allocate and initialize
srcu_node arrays late in life.
Co-developed-by: Neeraj Upadhyay <quic_neeraju@quicinc.com>
Signed-off-by: Neeraj Upadhyay <quic_neeraju@quicinc.com>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
Currently, srcu_funnel_gp_start() tests snp->srcu_have_cbs[idx] and then
separately assigns it to the snp_seq local variable. This commit does
the assignment earlier to simplify the code a bit. While in the area,
this commit also takes advantage of the 100-character line limit to put
the call to srcu_schedule_cbs_sdp() on a single line.
Co-developed-by: Neeraj Upadhyay <quic_neeraju@quicinc.com>
Signed-off-by: Neeraj Upadhyay <quic_neeraju@quicinc.com>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
This commit adds the numeric and string version of ->srcu_size_state to
the Tree-SRCU-specific portion of the rcutorture output.
[ paulmck: Apply feedback from kernel test robot and Dan Carpenter. ]
[ quic_neeraju: Apply feedback from Jiapeng Chong. ]
Co-developed-by: Neeraj Upadhyay <quic_neeraju@quicinc.com>
Signed-off-by: Neeraj Upadhyay <quic_neeraju@quicinc.com>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
This is just dead code at the moment, and will be used once
the state-transition code is activated.
Because srcu_barrier() must be aware of transition before call_srcu(), the
state machine waits for an SRCU grace period before callbacks are queued
to the non-CPU-0 queues. This requres that portions of srcu_barrier()
be enclosed in an SRCU read-side critical section.
Co-developed-by: Neeraj Upadhyay <quic_neeraju@quicinc.com>
Signed-off-by: Neeraj Upadhyay <quic_neeraju@quicinc.com>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
This commit shrinks the srcu_struct structure by converting its ->node
field from a fixed-size compile-time array to a pointer to a dynamically
allocated array. In kernels built with large values of NR_CPUS that boot
on systems with smaller numbers of CPUs, this can save significant memory.
[ paulmck: Apply kernel test robot feedback. ]
Reported-by: A cast of thousands
Co-developed-by: Neeraj Upadhyay <quic_neeraju@quicinc.com>
Signed-off-by: Neeraj Upadhyay <quic_neeraju@quicinc.com>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
This commit makes Tree SRCU able to operate without an snp_node
array, that is, when the srcu_data structures' ->mynode pointers
are NULL. This can result in high contention on the srcu_struct
structure's ->lock, but only when there are lots of call_srcu(),
synchronize_srcu(), and synchronize_srcu_expedited() calls.
Note that when there is no snp_node array, all SRCU callbacks use
CPU 0's callback queue. This is optimal in the common case of low
update-side load because it removes the need to search each CPU
for the single callback that made the grace period happen.
Co-developed-by: Neeraj Upadhyay <quic_neeraju@quicinc.com>
Signed-off-by: Neeraj Upadhyay <quic_neeraju@quicinc.com>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
Currently, the srcu_funnel_gp_start() walks its local variable snp up the
tree and reloads sdp->mynode whenever it is necessary to check whether
it is still at the leaf srcu_node level. This works, but is a bit more
obtuse than absolutely necessary. In addition, upcoming commits will
dynamically size srcu_struct structures, in which case sdp->mynode will
no longer necessarily be a constant, and this commit helps prepare for
that dynamic sizing.
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
Currently, cleanup_srcu_struct() checks for a grace period in progress,
but it does not check for a grace period that has not yet started but
which might start at any time. Such a situation could result in a
use-after-free bug, so this commit adds a check for a grace period that
is needed but not yet started to cleanup_srcu_struct().
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
move rt_period/runtime sysctls to rt.c and use the new
register_sysctl_init() to register the sysctl interface.
Signed-off-by: Zhen Ni <nizhen@uniontech.com>
Signed-off-by: Luis Chamberlain <mcgrof@kernel.org>
- Cleanups for SCHED_DEADLINE
- Tracing updates/fixes
- CPU Accounting fixes
- First wave of changes to optimize the overhead of the scheduler build,
from the fast-headers tree - including placeholder *_api.h headers for
later header split-ups.
- Preempt-dynamic using static_branch() for ARM64
- Isolation housekeeping mask rework; preperatory for further changes
- NUMA-balancing: deal with CPU-less nodes
- NUMA-balancing: tune systems that have multiple LLC cache domains per node (eg. AMD)
- Updates to RSEQ UAPI in preparation for glibc usage
- Lots of RSEQ/selftests, for same
- Add Suren as PSI co-maintainer
Signed-off-by: Ingo Molnar <mingo@kernel.org>
-----BEGIN PGP SIGNATURE-----
iQJFBAABCgAvFiEEBpT5eoXrXCwVQwEKEnMQ0APhK1gFAmI5rg8RHG1pbmdvQGtl
cm5lbC5vcmcACgkQEnMQ0APhK1hGrw/+M3QOk6fH7G48wjlNnBvcOife6ls+Ni4k
ixOAcF4JKoixO8HieU5vv0A7yf/83tAa6fpeXeMf1hkCGc0NSlmLtuIux+WOmoAL
LzCyDEYfiP8KnVh0A1Tui/lK0+AkGo21O6ADhQE2gh8o2LpslOHQMzvtyekSzeeb
mVxMYQN+QH0m518xdO2D8IQv9ctOYK0eGjmkqdNfntOlytypPZHeNel/tCzwklP/
dElJUjNiSKDlUgTBPtL3DfpoLOI/0mHF2p6NEXvNyULxSOqJTu8pv9Z2ADb2kKo1
0D56iXBDngMi9MHIJLgvzsA8gKzHLFSuPbpODDqkTZCa28vaMB9NYGhJ643NtEie
IXTJEvF1rmNkcLcZlZxo0yjL0fjvPkczjw4Vj27gbrUQeEBfb4mfuI4BRmij63Ep
qEkgQTJhduCqqrQP1rVyhwWZRk1JNcVug+F6N42qWW3fg1xhj0YSrLai2c9nPez6
3Zt98H8YGS1Z/JQomSw48iGXVqfTp/ETI7uU7jqHK8QcjzQ4lFK5H4GZpwuqGBZi
NJJ1l97XMEas+rPHiwMEN7Z1DVhzJLCp8omEj12QU+tGLofxxwAuuOVat3CQWLRk
f80Oya3TLEgd22hGIKDRmHa22vdWnNQyS0S15wJotawBzQf+n3auS9Q3/rh979+t
ES/qvlGxTIs=
=Z8uT
-----END PGP SIGNATURE-----
Merge tag 'sched-core-2022-03-22' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull scheduler updates from Ingo Molnar:
- Cleanups for SCHED_DEADLINE
- Tracing updates/fixes
- CPU Accounting fixes
- First wave of changes to optimize the overhead of the scheduler
build, from the fast-headers tree - including placeholder *_api.h
headers for later header split-ups.
- Preempt-dynamic using static_branch() for ARM64
- Isolation housekeeping mask rework; preperatory for further changes
- NUMA-balancing: deal with CPU-less nodes
- NUMA-balancing: tune systems that have multiple LLC cache domains per
node (eg. AMD)
- Updates to RSEQ UAPI in preparation for glibc usage
- Lots of RSEQ/selftests, for same
- Add Suren as PSI co-maintainer
* tag 'sched-core-2022-03-22' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (81 commits)
sched/headers: ARM needs asm/paravirt_api_clock.h too
sched/numa: Fix boot crash on arm64 systems
headers/prep: Fix header to build standalone: <linux/psi.h>
sched/headers: Only include <linux/entry-common.h> when CONFIG_GENERIC_ENTRY=y
cgroup: Fix suspicious rcu_dereference_check() usage warning
sched/preempt: Tell about PREEMPT_DYNAMIC on kernel headers
sched/topology: Remove redundant variable and fix incorrect type in build_sched_domains
sched/deadline,rt: Remove unused parameter from pick_next_[rt|dl]_entity()
sched/deadline,rt: Remove unused functions for !CONFIG_SMP
sched/deadline: Use __node_2_[pdl|dle]() and rb_first_cached() consistently
sched/deadline: Merge dl_task_can_attach() and dl_cpu_busy()
sched/deadline: Move bandwidth mgmt and reclaim functions into sched class source file
sched/deadline: Remove unused def_dl_bandwidth
sched/tracing: Report TASK_RTLOCK_WAIT tasks as TASK_UNINTERRUPTIBLE
sched/tracing: Don't re-read p->state when emitting sched_switch event
sched/rt: Plug rt_mutex_setprio() vs push_rt_task() race
sched/cpuacct: Remove redundant RCU read lock
sched/cpuacct: Optimize away RCU read lock
sched/cpuacct: Fix charge percpu cpuusage
sched/headers: Reorganize, clean up and optimize kernel/sched/sched.h dependencies
...
This pull request contains the following branches:
exp.2022.02.24a: Contains a fix for idle detection from Neeraj Upadhyay
and missing access marking detected by KCSAN.
fixes.2022.02.14a: Miscellaneous fixes.
rcu_barrier.2022.02.08a: Reduces coupling between rcu_barrier() and
CPU-hotplug operations, so that rcu_barrier() no longer needs
to do cpus_read_lock(). This may also someday allow system
boot to bring CPUs online concurrently.
rcu-tasks.2022.02.08a: Enable more aggressive movement to per-CPU
queueing when reacting to excessive lock contention due
to workloads placing heavy update-side stress on RCU tasks.
rt.2022.02.01b: Improvements to RCU priority boosting, including
changes from Neeraj Upadhyay, Zqiang, and Alison Chaiken.
torture.2022.02.01b: Various fixes improving test robustness and
debug information.
torturescript.2022.02.08a: Add tests for SRCU size transitions, further
compress torture.sh build products, and improve debug output.
-----BEGIN PGP SIGNATURE-----
iQJHBAABCgAxFiEEbK7UrM+RBIrCoViJnr8S83LZ+4wFAmIusb0THHBhdWxtY2tA
a2VybmVsLm9yZwAKCRCevxLzctn7jAklD/9VXLK7crcg2YeRXUIg1IOdnancsVCV
MNtTfxNYqYIis+W2UfuHKuQu2yEXF5fihdY0J9TQv0byHsprp6FIZT+i1An4Ukgd
0vyHjd/DaIKgs2txsB1DjhlatWlJUfQuBwhtNUkpYFLFwKdCI1l813bPbNlL+GiL
p0ZejVMpBC5HgE6sDOtaaQSAB+AEUp+Lgr+yaG/On8hfzwWFKO8KldxhiKY9n07v
SNDfKDgXB+80hx4RBVGbkuogV3s9brFULoNRXJy7Uf79DtiY09uazhhA3G0TjO34
zGwmF91dqsXDF/Uz8g4aZO0xYRXUchOrsQ5lgO/GhTVbM9I0wWlMHEk/8WHyBJkU
vlXOMuwzBc9/5uwZE3rnkA4a3nkXhPQjLlCr+/I7A/7Vsv9IBW9WSlgMvUN0Qf4S
XAwTnIqfErnR60a+L0+HRr5kIV5VoXcxqI/Nv0/4/BMLRubS/c7cYjOTxXNJL9SU
50pv5vty9xk3HSpuz0JAOyLf+PUT773uUQhFr5xCBSCVqbAm5WFg6hWPAgrN/tUS
wstBc0wlA73rKVJxeLDQwHc/oT1zTUEzswVZITQ5zLHK0t0GbeR6QHccsdeaJyTe
DisX+66A6YQrEuJmx5xUZqjYHqtYLDOBTbHA3ZwQmvjKu8ibWZ8Fg9ioURLCS4bF
+FVkp/5KdcAN9w==
=ljVY
-----END PGP SIGNATURE-----
Merge tag 'rcu.2022.03.13a' of git://git.kernel.org/pub/scm/linux/kernel/git/paulmck/linux-rcu
Pull RCU updates from Paul McKenney:
- Fix idle detection (Neeraj Upadhyay) and missing access marking
detected by KCSAN.
- Reduce coupling between rcu_barrier() and CPU-hotplug operations, so
that rcu_barrier() no longer needs to do cpus_read_lock(). This may
also someday allow system boot to bring CPUs online concurrently.
- Enable more aggressive movement to per-CPU queueing when reacting to
excessive lock contention due to workloads placing heavy update-side
stress on RCU tasks.
- Improvements to RCU priority boosting, including changes from Neeraj
Upadhyay, Zqiang, and Alison Chaiken.
- Various fixes improving test robustness and debug information.
- Add tests for SRCU size transitions, further compress torture.sh
build products, and improve debug output.
- Miscellaneous fixes.
* tag 'rcu.2022.03.13a' of git://git.kernel.org/pub/scm/linux/kernel/git/paulmck/linux-rcu: (49 commits)
rcu: Replace cpumask_weight with cpumask_empty where appropriate
rcu: Remove __read_mostly annotations from rcu_scheduler_active externs
rcu: Uninline multi-use function: finish_rcuwait()
rcu: Mark writes to the rcu_segcblist structure's ->flags field
kasan: Record work creation stack trace with interrupts enabled
rcu: Inline __call_rcu() into call_rcu()
rcu: Add mutex for rcu boost kthread spawning and affinity setting
rcu: Fix description of kvfree_rcu()
MAINTAINERS: Add Frederic and Neeraj to their RCU files
rcutorture: Provide non-power-of-two Tasks RCU scenarios
rcutorture: Test SRCU size transitions
torture: Make torture.sh help message match reality
rcu-tasks: Set ->percpu_enqueue_shift to zero upon contention
rcu-tasks: Use order_base_2() instead of ilog2()
rcu: Create and use an rcu_rdp_cpu_online()
rcu: Make rcu_barrier() no longer block CPU-hotplug operations
rcu: Rework rcu_barrier() and callback-migration logic
rcu: Refactor rcu_barrier() empty-list handling
rcu: Kill rnp->ofl_seq and use only rcu_state.ofl_lock for exclusion
torture: Change KVM environment variable to RCUTORTURE
...
With the removal of CONFIG_RCU_FAST_NO_HZ, the parameters in
rcu_needs_cpu() are not necessary anymore. Simply remove them.
Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Paul E. McKenney <paulmck@kernel.org>
Cc: Paul Menzel <pmenzel@molgen.mpg.de>
-----BEGIN PGP SIGNATURE-----
iQFSBAABCAA8FiEEq68RxlopcLEwq+PEeb4+QwBBGIYFAmISrYgeHHRvcnZhbGRz
QGxpbnV4LWZvdW5kYXRpb24ub3JnAAoJEHm+PkMAQRiGg20IAKDZr7rfSHBopjQV
Cocw744tom0XuxpvSZpp2GGOOXF+tkswcNNaRIrbGOl1mkyxA7eBZCTMpDeDS9aQ
wB0D0Gxx8QBAJp4KgB1W7TB+hIGes/rs8Ve+6iO4ulLLdCVWX/q2boI0aZ7QX9O9
qNi8OsoZQtk6falRvciZFHwV5Av1p2Sy1AW57udQ7DvJ4H98AfKf1u8/z208WWW8
1ixC+qJxQcUcM9vI+7P9Tt7NbFSKv8SvAmqjFY7P+DxQAsVw6KXoqVXykDzeOv0t
fUNOE/t0oFZafwtn8h7KBQnwS9lH03+3KkslVZs+iMFyUj/Bar+NVVyKoDhWXtVg
/PuMhEg=
=eU1o
-----END PGP SIGNATURE-----
Merge tag 'v5.17-rc5' into sched/core, to resolve conflicts
New conflicts in sched/core due to the following upstream fixes:
44585f7bc0 ("psi: fix "defined but not used" warnings when CONFIG_PROC_FS=n")
a06247c680 ("psi: Fix uaf issue when psi trigger is destroyed while being polled")
Conflicts:
include/linux/psi_types.h
kernel/sched/psi.c
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Refer to housekeeping APIs using single feature types instead of flags.
This prevents from passing multiple isolation features at once to
housekeeping interfaces, which soon won't be possible anymore as each
isolation features will have their own cpumask.
Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Juri Lelli <juri.lelli@redhat.com>
Reviewed-by: Phil Auld <pauld@redhat.com>
Link: https://lore.kernel.org/r/20220207155910.527133-5-frederic@kernel.org
In some places, RCU code calls cpumask_weight() to check if any bit of a
given cpumask is set. We can do it more efficiently with cpumask_empty()
because cpumask_empty() stops traversing the cpumask as soon as it finds
first set bit, while cpumask_weight() counts all bits unconditionally.
Signed-off-by: Yury Norov <yury.norov@gmail.com>
Acked-by: Frederic Weisbecker <frederic@kernel.org>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
This is a rarely used function, so uninlining its 3 instructions
is probably a win or a wash - but the main motivation is to
make <linux/rcuwait.h> independent of task_struct details.
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
KCSAN reports data races between the rcu_segcblist_clear_flags() and
rcu_segcblist_set_flags() functions, though misreporting the latter
as a call to rcu_segcblist_is_enabled() from call_rcu(). This commit
converts the updates of this field to WRITE_ONCE(), relying on the
resulting unmarked reads to continue to detect buggy concurrent writes
to this field.
Reported-by: Zhouyi Zhou <zhouzhouyi@gmail.com>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
Cc: Frederic Weisbecker <frederic@kernel.org>
Recording the work creation stack trace for KASAN reports in
call_rcu() is expensive, due to unwinding the stack, but also
due to acquiring depot_lock inside stackdepot (which may be contended).
Because calling kasan_record_aux_stack_noalloc() does not require
interrupts to already be disabled, this may unnecessarily extend
the time with interrupts disabled.
Therefore, move calling kasan_record_aux_stack() before the section
with interrupts disabled.
Acked-by: Marco Elver <elver@google.com>
Signed-off-by: Zqiang <qiang1.zhang@intel.com>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
Because __call_rcu() is invoked only by call_rcu(), this commit inlines
the former into the latter.
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
As we handle parallel CPU bringup, we will need to take care to avoid
spawning multiple boost threads, or race conditions when setting their
affinity. Spotted by Paul McKenney.
Signed-off-by: David Woodhouse <dwmw@amazon.co.uk>
Reviewed-by: Frederic Weisbecker <frederic@kernel.org>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
Currently, call_rcu_tasks_generic() sets ->percpu_enqueue_shift to
order_base_2(nr_cpu_ids) upon encountering sufficient contention.
This does not shift to use of non-CPU-0 callback queues as intended, but
rather continues using only CPU 0's queue. Although this does provide
some decrease in contention due to spreading work over multiple locks,
it is not the dramatic decrease that was intended.
This commit therefore makes call_rcu_tasks_generic() set
->percpu_enqueue_shift to 0.
Reported-by: Neeraj Upadhyay <quic_neeraju@quicinc.com>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
The ilog2() function can be used to generate a shift count, but it will
generate the same count for a power of two as for one greater than a power
of two. This results in shift counts that are larger than necessary for
systems with a power-of-two number of CPUs because the CPUs are numbered
from zero, so that the maximum CPU number is one less than that power
of two.
This commit therefore substitutes order_base_2(), which appears to have
been designed for exactly this use case.
Suggested-by: Mark Rutland <mark.rutland@arm.com>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
The pattern "rdp->grpmask & rcu_rnp_online_cpus(rnp)" occurs frequently
in RCU code in order to determine whether rdp->cpu is online from an
RCU perspective. This commit therefore creates an rcu_rdp_cpu_online()
function to replace it.
[ paulmck: Apply kernel test robot unused-variable feedback. ]
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
This commit removes the cpus_read_lock() and cpus_read_unlock() calls
from rcu_barrier(), thus allowing CPUs to come and go during the course
of rcu_barrier() execution. Posting of the ->barrier_head callbacks does
synchronize with portions of RCU's CPU-hotplug notifiers, but these locks
are held for short time periods on both sides. Thus, full CPU-hotplug
operations could both start and finish during the execution of a given
rcu_barrier() invocation.
Additional synchronization is provided by a global ->barrier_lock.
Since the ->barrier_lock is only used during rcu_barrier() execution and
during onlining/offlining a CPU, the contention for this lock should
be low. It might be tempting to make use of a per-CPU lock just on
general principles, but straightforward attempts to do this have the
problems shown below.
Initial state: 3 CPUs present, CPU 0 and CPU1 do not have
any callback and CPU2 has callbacks.
1. CPU0 calls rcu_barrier().
2. CPU1 starts offlining for CPU2. CPU1 calls
rcutree_migrate_callbacks(). rcu_barrier_entrain() is called
from rcutree_migrate_callbacks(), with CPU2's rdp->barrier_lock.
It does not entrain ->barrier_head for CPU2, as rcu_barrier()
on CPU0 hasn't started the barrier sequence (by calling
rcu_seq_start(&rcu_state.barrier_sequence)) yet.
3. CPU0 starts new barrier sequence. It iterates over
CPU0 and CPU1, after acquiring their per-cpu ->barrier_lock
and finds 0 segcblist length. It updates ->barrier_seq_snap
for CPU0 and CPU1 and continues loop iteration to CPU2.
for_each_possible_cpu(cpu) {
raw_spin_lock_irqsave(&rdp->barrier_lock, flags);
if (!rcu_segcblist_n_cbs(&rdp->cblist)) {
WRITE_ONCE(rdp->barrier_seq_snap, gseq);
raw_spin_unlock_irqrestore(&rdp->barrier_lock, flags);
rcu_barrier_trace(TPS("NQ"), cpu, rcu_state.barrier_sequence);
continue;
}
4. rcutree_migrate_callbacks() completes execution on CPU1.
Segcblist len for CPU2 becomes 0.
5. The loop iteration on CPU0, checks rcu_segcblist_n_cbs(&rdp->cblist)
for CPU2 and completes the loop iteration after setting
->barrier_seq_snap.
6. As there isn't any ->barrier_head callback entrained; at
this point, rcu_barrier() in CPU0 returns.
7. The callbacks, which migrated from CPU2 to CPU1, execute.
Straightforward per-CPU locking is also subject to the following race
condition noted by Boqun Feng:
1. CPU0 calls rcu_barrier(), starting a new barrier sequence by invoking
rcu_seq_start() and init_completion(), but does not yet initialize
rcu_state.barrier_cpu_count.
2. CPU1 starts offlining for CPU2, calling rcutree_migrate_callbacks(),
which in turn calls rcu_barrier_entrain() holding CPU2's.
rdp->barrier_lock. It then entrains ->barrier_head for CPU2
and atomically increments rcu_state.barrier_cpu_count, which is
unfortunately not yet initialized to the value 2.
3. The just-entrained RCU callback is invoked. It atomically
decrements rcu_state.barrier_cpu_count and sees that it is
now zero. This callback therefore invokes complete().
4. CPU0 continues executing rcu_barrier(), but is not blocked
by its call to wait_for_completion(). This results in rcu_barrier()
returning before all pre-existing callbacks have been invoked,
which is a bug.
Therefore, synchronization is provided by rcu_state.barrier_lock,
which is also held across the initialization sequence, especially the
rcu_seq_start() and the atomic_set() that sets rcu_state.barrier_cpu_count
to the value 2. In addition, this lock is held when entraining the
rcu_barrier() callback, when deciding whether or not a CPU has callbacks
that rcu_barrier() must wait on, when setting the ->qsmaskinitnext for
incoming CPUs, and when migrating callbacks from a CPU that is going
offline.
Reviewed-by: Frederic Weisbecker <frederic@kernel.org>
Co-developed-by: Neeraj Upadhyay <quic_neeraju@quicinc.com>
Signed-off-by: Neeraj Upadhyay <quic_neeraju@quicinc.com>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
This commit reworks rcu_barrier() and callback-migration logic to
permit allowing rcu_barrier() to run concurrently with CPU-hotplug
operations. The key trick is for callback migration to check to see if
an rcu_barrier() is in flight, and, if so, enqueue the ->barrier_head
callback on its behalf.
This commit adds synchronization with RCU's CPU-hotplug notifiers. Taken
together, this will permit a later commit to remove the cpus_read_lock()
and cpus_read_unlock() calls from rcu_barrier().
[ paulmck: Updated per kbuild test robot feedback. ]
[ paulmck: Updated per reviews session with Neeraj, Frederic, Uladzislau, and Boqun. ]
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
This commit saves a few lines by checking first for an empty callback
list. If the callback list is empty, then that CPU is taken care of,
regardless of its online or nocb state. Also simplify tracing accordingly
and fold a few lines together.
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
If we allow architectures to bring APs online in parallel, then we end
up requiring rcu_cpu_starting() to be reentrant. But currently, the
manipulation of rnp->ofl_seq is not thread-safe.
However, rnp->ofl_seq is also fairly much pointless anyway since both
rcu_cpu_starting() and rcu_report_dead() hold rcu_state.ofl_lock for
fairly much the whole time that rnp->ofl_seq is set to an odd number
to indicate that an operation is in progress.
So drop rnp->ofl_seq completely, and use only rcu_state.ofl_lock.
This has a couple of minor complexities: lockdep will complain when we
take rcu_state.ofl_lock, and currently accepts the 'excuse' of having
an odd value in rnp->ofl_seq. So switch it to an arch_spinlock_t to
avoid that false positive complaint. Since we're killing rnp->ofl_seq
of course that 'excuse' has to be changed too, so make it check for
arch_spin_is_locked(rcu_state.ofl_lock).
There's no arch_spin_lock_irqsave() so we have to manually save and
restore local interrupts around the locking.
At Paul's request based on Neeraj's analysis, make rcu_gp_init not just
wait but *exclude* any CPU online/offline activity, which was fairly
much true already by virtue of it holding rcu_state.ofl_lock.
Signed-off-by: David Woodhouse <dwmw@amazon.co.uk>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
This commit allows up to 50,000 callbacks worth of callback-flooding
tests of SRCU. The goal of this change is to exercise Tree SRCU's
ability to transition from SRCU_SIZE_SMALL to SRCU_SIZE_BIG triggered
by callback-queue-time lock contention.
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
The rcu_torture_fwd_cb_hist() function acquires rcu_fwd_mutex, but is
invoked from rcutorture_oom_notify() function, which hold this same
mutex across this call. This commit fixes the resulting deadlock.
Reported-by: kernel test robot <oliver.sang@intel.com>
Tested-by: Oliver Sang <oliver.sang@intel.com>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
The second and subsequent forward-progress kthreads loop waiting for
the first forward-progress kthread to start the next test interval.
Unfortunately, if the test ends while one of those kthreads is waiting,
the test will hang. This hang occurs because that wait loop fails to
check for the end of the test. This commit therefore adds an end-of-test
check to that wait loop.
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
Back when only one rcutorture kthread could do forward-progress testing,
it was just fine for rcu_fwd_cb_nodelay to be a non-atomic bool. It was
set at the start of forward-progress testing and cleared at the end.
But now that there are multiple threads, the value can be cleared while
one of the threads is still doing forward-progress testing. This commit
therefore makes rcu_fwd_cb_nodelay be an atomic counter, replacing the
WRITE_ONCE() operations with atomic_inc() and atomic_dec().
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
This commit adds a few pr_alert() calls to rcutorture's forward-progress
testing in order to better diagnose shutdown-time hangs.
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
The various ->cb_barrier() functions, for example, rcu_barrier(),
sometimes cause rcutorture hangs. But currently, the last console
message is the unenlightening "Stopping rcu_torture_stats". This commit
therefore prints a message of the form "rcu_torture_cleanup: Invoking
rcu_barrier+0x0/0x1e0()" to help point people in the right direction.
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
When the rcutree.use_softirq kernel boot parameter is set to zero, all
RCU_SOFTIRQ processing is carried out by the per-CPU rcuc kthreads.
If these kthreads are being starved, quiescent states will not be
reported, which in turn means that the grace period will not end, which
can in turn trigger RCU CPU stall warnings. This commit therefore dumps
stack traces of stalled CPUs' rcuc kthreads, which can help identify
what is preventing those kthreads from running.
Suggested-by: Ammar Faizi <ammarfaizi2@gnuweeb.org>
Reviewed-by: Ammar Faizi <ammarfaizi2@gnuweeb.org>
Signed-off-by: Zqiang <qiang1.zhang@intel.com>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
Currently rcu_preempt_deferred_qs_irqrestore() releases rnp->boost_mtx
before reporting the expedited quiescent state. Under heavy real-time
load, this can result in this function being preempted before the
quiescent state is reported, which can in turn prevent the expedited grace
period from completing. Tim Murray reports that the resulting expedited
grace periods can take hundreds of milliseconds and even more than one
second, when they should normally complete in less than a millisecond.
This was fine given that there were no particular response-time
constraints for synchronize_rcu_expedited(), as it was designed
for throughput rather than latency. However, some users now need
sub-100-millisecond response-time constratints.
This patch therefore follows Neeraj's suggestion (seconded by Tim and
by Uladzislau Rezki) of simply reversing the two operations.
Reported-by: Tim Murray <timmurray@google.com>
Reported-by: Joel Fernandes <joelaf@google.com>
Reported-by: Neeraj Upadhyay <quic_neeraju@quicinc.com>
Reviewed-by: Neeraj Upadhyay <quic_neeraju@quicinc.com>
Reviewed-by: Uladzislau Rezki (Sony) <urezki@gmail.com>
Tested-by: Tim Murray <timmurray@google.com>
Cc: Todd Kjos <tkjos@google.com>
Cc: Sandeep Patil <sspatil@google.com>
Cc: <stable@vger.kernel.org> # 5.4.x
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
When CONFIG_PREEMPT_RT=y, the rcutree.kthread_prio command-line
parameter signals initialization code to boost the priority of rcuc
callbacks to the designated value. With the additional
CONFIG_RCU_NOCB_CPU=y configuration and an additional rcu_nocbs
command-line parameter, the callbacks on the listed cores are
offloaded to new rcuop kthreads that are not pinned to the cores whose
post-grace-period work is performed. While the rcuop kthreads perform
the same function as the rcuc kthreads they offload, the kthread_prio
parameter only boosts the priority of the rcuc kthreads. Fix this
inconsistency by elevating rcuop kthreads to the same priority as the rcuc
kthreads.
Signed-off-by: Alison Chaiken <achaiken@aurora.tech>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
The priority of RCU grace period threads is set to kthread_prio when
they are launched from rcu_spawn_gp_kthread(). The same is not true
of rcu_spawn_one_nocb_kthread(). Accordingly, add priority elevation
to rcu_spawn_one_nocb_kthread().
Signed-off-by: Alison Chaiken <achaiken@aurora.tech>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
Move the bounds-check of the kthread_prio cmdline parameter to a new
function in order to faciliate a different callsite.
Signed-off-by: Alison Chaiken <achaiken@aurora.tech>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
The per-CPU "rcuc" kthreads are used only by kernels booted with
rcutree.use_softirq=0, but they are nevertheless unconditionally created
by kernels built with CONFIG_RCU_BOOST=y. This results in "rcuc"
kthreads being created that are never actually used. This commit
therefore refrains from creating these kthreads unless the kernel
is actually booted with rcutree.use_softirq=0.
Acked-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Signed-off-by: Zqiang <qiang1.zhang@intel.com>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
When multiple CPUs in the same nocb gp/cb group concurrently
come online, they might try to concurrently create the same
rcuog kthread. Fix this by using nocb gp CPU's spawn mutex to
provide mutual exclusion for the rcuog kthread creation code.
[ paulmck: Whitespace fixes per kernel test robot feedback. ]
Acked-by: David Woodhouse <dwmw@amazon.co.uk>
Signed-off-by: Neeraj Upadhyay <quic_neeraju@quicinc.com>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
The boost_starttime shared variable has conflicting unmarked C-language
accesses, which are dangerous at best. This commit therefore adds
appropriate marking. This was found by KCSAN.
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
This commit adds a READ_ONCE() to an access to the rcu_node structure's
->expmask field to prevent compiler mischief. Detected by KCSAN.
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
For PREEMPT_RCU, the rcu_exp_handler() function checks
whether the current CPU is in idle, by calling
rcu_dynticks_curr_cpu_in_eqs(). However, rcu_exp_handler()
is called in IPI handler context. So, it should be checking
the idle context using rcu_is_cpu_rrupt_from_idle(). Fix this
by using rcu_is_cpu_rrupt_from_idle() instead of
rcu_dynticks_curr_cpu_in_eqs(). Non-preempt configuration
already uses the correct check.
Reviewed-by: Frederic Weisbecker <frederic@kernel.org>
Signed-off-by: Neeraj Upadhyay <quic_neeraju@quicinc.com>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
The ->percpu_enqueue_shift field is used to map from the running CPU
number to the index of the corresponding callback list. This mapping
can change at runtime in response to varying callback load, resulting
in varying levels of contention on the callback-list locks.
Unfortunately, the initial value of this field is correct only if the
system happens to have a power-of-two number of CPUs, otherwise the
callbacks from the high-numbered CPUs can be placed into the callback list
indexed by 1 (rather than 0), and those index-1 callbacks will be ignored.
This can result in soft lockups and hangs.
This commit therefore corrects this mapping, adding one to this shift
count as needed for systems having odd numbers of CPUs.
Fixes: 7a30871b6a ("rcu-tasks: Introduce ->percpu_enqueue_shift for dynamic queue selection")
Reported-by: Andrii Nakryiko <andrii.nakryiko@gmail.com>
Cc: Reported-by: Martin Lau <kafai@fb.com>
Cc: Neeraj Upadhyay <neeraj.iitr10@gmail.com>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
Merge misc updates from Andrew Morton:
"146 patches.
Subsystems affected by this patch series: kthread, ia64, scripts,
ntfs, squashfs, ocfs2, vfs, and mm (slab-generic, slab, kmemleak,
dax, kasan, debug, pagecache, gup, shmem, frontswap, memremap,
memcg, selftests, pagemap, dma, vmalloc, memory-failure, hugetlb,
userfaultfd, vmscan, mempolicy, oom-kill, hugetlbfs, migration, thp,
ksm, page-poison, percpu, rmap, zswap, zram, cleanups, hmm, and
damon)"
* emailed patches from Andrew Morton <akpm@linux-foundation.org>: (146 commits)
mm/damon: hide kernel pointer from tracepoint event
mm/damon/vaddr: hide kernel pointer from damon_va_three_regions() failure log
mm/damon/vaddr: use pr_debug() for damon_va_three_regions() failure logging
mm/damon/dbgfs: remove an unnecessary variable
mm/damon: move the implementation of damon_insert_region to damon.h
mm/damon: add access checking for hugetlb pages
Docs/admin-guide/mm/damon/usage: update for schemes statistics
mm/damon/dbgfs: support all DAMOS stats
Docs/admin-guide/mm/damon/reclaim: document statistics parameters
mm/damon/reclaim: provide reclamation statistics
mm/damon/schemes: account how many times quota limit has exceeded
mm/damon/schemes: account scheme actions that successfully applied
mm/damon: remove a mistakenly added comment for a future feature
Docs/admin-guide/mm/damon/usage: update for kdamond_pid and (mk|rm)_contexts
Docs/admin-guide/mm/damon/usage: mention tracepoint at the beginning
Docs/admin-guide/mm/damon/usage: remove redundant information
Docs/admin-guide/mm/damon/usage: update for scheme quotas and watermarks
mm/damon: convert macro functions to static inline functions
mm/damon: modify damon_rand() macro to static inline function
mm/damon: move damon_rand() definition into damon.h
...
The rcu_spawn_one_nocb_kthread() function is called only from
rcu_spawn_cpu_nocb_kthread(). Therefore, inline the former into
the latter, saving a few lines of code.
Reviewed-by: Neeraj Upadhyay <quic_neeraju@quicinc.com>
Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
Cc: Boqun Feng <boqun.feng@gmail.com>
Cc: Uladzislau Rezki <urezki@gmail.com>
Cc: Josh Triplett <josh@joshtriplett.org>
Cc: Joel Fernandes <joel@joelfernandes.org>
Tested-by: Juri Lelli <juri.lelli@redhat.com>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
Allow the rcu_nocbs kernel parameter to be specified just by itself,
without specifying any CPUs. This allows systems administrators to use
"rcu_nocbs" to specify that none of the CPUs are to be offloaded at boot
time, but than any of them may be offloaded at runtime via cpusets.
In contrast, if the "rcu_nocbs" or "nohz_full" kernel parameters are not
specified at all, then not only are none of the CPUs offloaded at boot,
none of them can be offloaded at runtime, either.
While in the area, modernize the description of the "rcuo" kthreads'
naming scheme.
Reviewed-by: Neeraj Upadhyay <quic_neeraju@quicinc.com>
Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
Cc: Boqun Feng <boqun.feng@gmail.com>
Cc: Uladzislau Rezki <urezki@gmail.com>
Cc: Josh Triplett <josh@joshtriplett.org>
Cc: Joel Fernandes <joel@joelfernandes.org>
Tested-by: Juri Lelli <juri.lelli@redhat.com>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
In order to be able to (de-)offload any CPU using cpusets in the future,
create the NOCB data structures for all possible CPUs. For now this is
done only as long as the "rcu_nocbs=" or "nohz_full=" kernel parameters
are passed to avoid the unnecessary overhead for most users.
Note that the rcuog and rcuoc kthreads are not created until at least
one of the corresponding CPUs comes online. This approach avoids the
creation of excess kthreads when firmware lies about the number of CPUs
present on the system.
Reviewed-by: Neeraj Upadhyay <quic_neeraju@quicinc.com>
Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
Cc: Boqun Feng <boqun.feng@gmail.com>
Cc: Uladzislau Rezki <urezki@gmail.com>
Cc: Josh Triplett <josh@joshtriplett.org>
Cc: Joel Fernandes <joel@joelfernandes.org>
Tested-by: Juri Lelli <juri.lelli@redhat.com>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
Currently cpumask_available() is used to prevent from unwanted NOCB
initialization. However if neither "rcu_nocbs=" nor "nohz_full="
parameters are passed to a kernel built with CONFIG_CPUMASK_OFFSTACK=n,
the initialization path is still taken, running through all sorts of
needless operations and iterations on an empty cpumask.
Fix this by relying on a real initialization state instead. This also
optimizes kthread creation, preventing needless iteration over all online
CPUs when the kernel is booted without any offloaded CPUs.
Reviewed-by: Neeraj Upadhyay <quic_neeraju@quicinc.com>
Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
Cc: Boqun Feng <boqun.feng@gmail.com>
Cc: Uladzislau Rezki <urezki@gmail.com>
Cc: Josh Triplett <josh@joshtriplett.org>
Cc: Joel Fernandes <joel@joelfernandes.org>
Tested-by: Juri Lelli <juri.lelli@redhat.com>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
In order to be able to toggle the offloaded state from cpusets, a nocb
kthread will need to be created for all possible CPUs whenever either
of the "rcu_nocbs=" or "nohz_full=" parameters are specified.
Therefore, the nocb_cb_wait() kthread must be prepared to start running
on a de-offloaded rdp. To accomplish this, simply move the sleeping
condition to the beginning of the nocb_cb_wait() function, which prevents
this kthread from attempting to invoke callbacks before the corresponding
CPU is offloaded.
Reviewed-by: Neeraj Upadhyay <quic_neeraju@quicinc.com>
Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
Cc: Boqun Feng <boqun.feng@gmail.com>
Cc: Uladzislau Rezki <urezki@gmail.com>
Cc: Josh Triplett <josh@joshtriplett.org>
Cc: Joel Fernandes <joel@joelfernandes.org>
Tested-by: Juri Lelli <juri.lelli@redhat.com>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
The nocb_gp_wait() function iterates over all CPUs in its group,
including even those CPUs that have been de-offloaded. This is of
course suboptimal, especially if none of the CPUs within the group are
currently offloaded. This will become even more of a problem once a
nocb kthread is created for all possible CPUs.
Therefore use a standard double linked list to link all the offloaded
rcu_data structures and safely add or delete these structure as we
offload or de-offload them, respectively.
Reviewed-by: Neeraj Upadhyay <quic_neeraju@quicinc.com>
Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
Cc: Boqun Feng <boqun.feng@gmail.com>
Cc: Uladzislau Rezki <urezki@gmail.com>
Cc: Josh Triplett <josh@joshtriplett.org>
Cc: Joel Fernandes <joel@joelfernandes.org>
Tested-by: Juri Lelli <juri.lelli@redhat.com>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
By default, when lock contention is encountered, the RCU Tasks flavors
of RCU switch to using per-CPU queueing. However, if the callback
flood ends, per-CPU queueing continues to be used, which introduces
significant additional overhead, especially for callback invocation,
which fans out a series of workqueue handlers.
This commit therefore switches back to single-queue operation if at the
beginning of a grace period there are very few callbacks. The definition
of "very few" is set by the rcupdate.rcu_task_collapse_lim module
parameter, which defaults to 10. This switch happens in two phases,
with the first phase causing future callbacks to be enqueued on CPU 0's
queue, but with all queues continuing to be checked for grace periods
and callback invocation. The second phase checks to see if an RCU grace
period has elapsed and if all remaining RCU-Tasks callbacks are queued
on CPU 0. If so, only CPU 0 is checked for future grace periods and
callback operation.
Of course, the return of contention anywhere during this process will
result in returning to per-CPU callback queueing.
Reported-by: Martin Lau <kafai@fb.com>
Cc: Neeraj Upadhyay <neeraj.iitr10@gmail.com>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
Decreasing the number of callback queues is a bit tricky because it
is necessary to handle callbacks that were queued before the number of
queues decreased, but which were not ready to invoke until afterwards.
This commit takes a first step in this direction by maintaining a separate
->percpu_dequeue_lim to control callback dequeueing, in addition to the
existing ->percpu_enqueue_lim which now controls only enqueueing.
Reported-by: Martin Lau <kafai@fb.com>
Cc: Neeraj Upadhyay <neeraj.iitr10@gmail.com>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
The rcupdate.rcu_task_enqueue_lim module parameter allows system
administrators to tune the number of callback queues used by the RCU
Tasks flavors. However if callback storms are infrequent, it would
be better to operate with a single queue on a given system unless and
until that system actually needed more queues. Systems not needing
more queues can then avoid the overhead of checking the extra queues
and especially avoid the overhead of fanning workqueue handlers out to
all CPUs to invoke callbacks.
This commit therefore switches to using all the CPUs' callback queues if
call_rcu_tasks_generic() encounters too much lock contention. The amount
of lock contention to tolerate defaults to 100 contended lock acquisitions
per jiffy, and can be adjusted using the new rcupdate.rcu_task_contend_lim
module parameter.
Such switching is undertaken only if the rcupdate.rcu_task_enqueue_lim
module parameter is negative, which is its default value (-1).
This allows savvy systems administrators to set the number of queues
to some known good value and to not have to worry about the kernel doing
any second guessing.
[ paulmck: Apply feedback from Guillaume Tucker and kernelci. ]
Reported-by: Martin Lau <kafai@fb.com>
Cc: Neeraj Upadhyay <neeraj.iitr10@gmail.com>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
If the caller of of call_rcu_tasks(), call_rcu_tasks_rude(),
or call_rcu_tasks_trace() holds a raw spinlock, and then if
call_rcu_tasks_generic() determines that the grace-period kthread must
be awakened, then the wakeup might acquire a normal spinlock while a
raw spinlock is held. This results in lockdep splats when the
kernel is built with CONFIG_PROVE_RAW_LOCK_NESTING=y.
This commit therefore defers the wakeup using irq_work_queue().
It would be nice to directly invoke wakeup when a raw spinlock is not
held, but there is currently no way to check for this in all kernels.
Reported-by: Martin Lau <kafai@fb.com>
Cc: Neeraj Upadhyay <neeraj.iitr10@gmail.com>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
This commit converts the unconditional raw_spin_lock_rcu_node() lock
acquisition in call_rcu_tasks_generic() to a trylock followed by an
unconditional acquisition if the trylock fails. If the trylock fails,
the failure is counted, but the count is reset to zero on each new jiffy.
This statistic will be used to determine when to move from a single
callback queue to per-CPU callback queues.
Reported-by: Martin Lau <kafai@fb.com>
Cc: Neeraj Upadhyay <neeraj.iitr10@gmail.com>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
This commit adds a rcupdate.rcu_task_enqueue_lim module parameter that
sets the initial number of callback queues to use for the RCU Tasks
family of RCU implementations. This parameter allows testing of various
fanout values.
Reported-by: Martin Lau <kafai@fb.com>
Cc: Neeraj Upadhyay <neeraj.iitr10@gmail.com>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
Currently, rcu_barrier_tasks(), rcu_barrier_tasks_rude(),
and rcu_barrier_tasks_trace() simply invoke the corresponding
synchronize_rcu_tasks*() function. This works because there is only
one callback queue.
However, there will soon be multiple callback queues. This commit
therefore scans the queues currently in use, entraining a callback on
each non-empty queue. Sequence numbers and reference counts are used
to synchronize this process in a manner similar to the approach taken
by rcu_barrier().
Reported-by: Martin Lau <kafai@fb.com>
Cc: Neeraj Upadhyay <neeraj.iitr10@gmail.com>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
If there is a flood of callbacks, it is necessary to put multiple
CPUs to work invoking those callbacks. This commit therefore uses a
workqueue-flooding approach to parallelize RCU Tasks callback execution.
Reported-by: Martin Lau <kafai@fb.com>
Cc: Neeraj Upadhyay <neeraj.iitr10@gmail.com>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
This commit adds a rcu_tasks_invoke_cbs() function that invokes all
ready callbacks on all of the per-CPU lists that are currently in use.
Reported-by: Martin Lau <kafai@fb.com>
Cc: Neeraj Upadhyay <neeraj.iitr10@gmail.com>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
This commit adds a rcu_tasks_need_gpcb() function that returns an
indication of whether another grace period is required, and if no grace
period is required, whether there are callbacks that need to be invoked.
The function scans all per-CPU lists currently in use.
Reported-by: Martin Lau <kafai@fb.com>
Cc: Neeraj Upadhyay <neeraj.iitr10@gmail.com>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
This commit adds a ->percpu_enqueue_lim field to the rcu_tasks structure.
This field contains two to the power of the ->percpu_enqueue_shift
field, easing construction of iterators over the per-CPU queues that
might contain RCU Tasks callbacks. Such iterators will be introduced
in later commits.
Reported-by: Martin Lau <kafai@fb.com>
Cc: Neeraj Upadhyay <neeraj.iitr10@gmail.com>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
This commit renames the rcu_tasks_percpu structure's ->cbs_pcpu_lock
to ->lock and then uses spin_lock_rcu_node() and friends to acquire and
release this lock, preparing for upcoming commits that will spread the
grace-period process across multiple CPUs and kthreads.
[ paulmck: Apply feedback from kernel test robot. ]
Reported-by: Martin Lau <kafai@fb.com>
Cc: Neeraj Upadhyay <neeraj.iitr10@gmail.com>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
With the addition of multiple callback-flood kthreads, the maximum number
of callbacks from any one of those kthreads is reported in the rcutorture
run summary. This commit changes this to report the sum of each kthread's
maximum number of callbacks in a given callback-flooding episode.
Cc: Neeraj Upadhyay <neeraj.iitr10@gmail.com>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
The RCU tasks flavors of RCU now need concurrent callback flooding to
test their ability to switch between single-queue mode and per-CPU queue
mode, but their lack of heavy-duty forward-progress features rules out
the use of rcutorture's current callback-flooding code. This commit
therefore provides the ability to limit the intensity of the callback
floods using a new ->cbflood_max field in the rcu_operations structure.
When this field is zero, there is no limit, otherwise, each callback-flood
kthread allocates at most ->cbflood_max callbacks.
Cc: Neeraj Upadhyay <neeraj.iitr10@gmail.com>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
This commit converts the rcutorture.fwd_progress module parameter from
bool to int, so that it specifies the number of callback-flood kthreads.
Values less than zero specify one kthread per CPU, however, the number of
kthreads executing concurrently is limited to the number of online CPUs.
This commit also reverse the order of the need-resched and callback-flood
operations to cause the callback flooding to happen more nearly at the
same time.
Cc: Neeraj Upadhyay <neeraj.iitr10@gmail.com>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
If we use the module stall_cpu option, we may get a soft lockup warning
in case we also don't pass the stall_cpu_block option.
Introduce the stall_no_softlockup option to avoid a soft lockup on
cpu stall even if we don't use the stall_cpu_block option.
Signed-off-by: Wander Lairson Costa <wander@redhat.com>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
Unconditionally log messages corresponding to errors.
Acked-by: Davidlohr Bueso <dbueso@suse.de>
Signed-off-by: Li Zhijian <zhijianx.li@intel.com>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
Unconditionally log messages corresponding to errors.
Acked-by: Davidlohr Bueso <dbueso@suse.de>
Signed-off-by: Li Zhijian <zhijianx.li@intel.com>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
Add '\n' to macros to flush message for each call.
Acked-by: Davidlohr Bueso <dbueso@suse.de>
Signed-off-by: Li Zhijian <zhijianx.li@intel.com>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
An OOM is a serious error that should be logged even in non-verbose runs.
This commit therefore adds an unconditional SCALEOUT_ERRSTRING() macro
and uses it instead of VERBOSE_SCALEOUT_ERRSTRING() when reporting an OOM.
[ paulmck: Drop do-while from SCALEOUT_ERRSTRING() due to only single statement. ]
Signed-off-by: Li Zhijian <lizhijian@cn.fujitsu.com>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
This commit moves from a bespoke head and tail pointer in the
rcu_tasks_percpu structure to an rcu_segcblist structure, thus allowing
associating the grace-period sequence number with groups of callbacks.
This in turn will allow callbacks to be invoked independently on
different CPUs.
Reported-by: Martin Lau <kafai@fb.com>
Cc: Neeraj Upadhyay <neeraj.iitr10@gmail.com>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
This commit moves the rcu_tasks structure's ->n_gps grace-period-counter
field to a ->task_gp_seq grce-period sequence number in order to enable
use of the rcu_segcblist structure for the callback lists. This in turn
permits CPUs to lag behind the RCU Tasks grace-period sequence number
without suffering long-term slowdowns in callback invocation.
Reported-by: Martin Lau <kafai@fb.com>
Cc: Neeraj Upadhyay <neeraj.iitr10@gmail.com>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
This commit introduces a ->percpu_enqueue_shift field to the rcu_tasks
structure, and uses it to shift down the CPU number in order to
select a rcu_tasks_percpu structure. This field is currently set to a
sufficiently large shift count to always select the CPU-0 instance of
the rcu_tasks_percpu structure, and later commits will adjust this.
Reported-by: Martin Lau <kafai@fb.com>
Cc: Neeraj Upadhyay <neeraj.iitr10@gmail.com>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
Currently, RCU Tasks Trace (as well as the other two flavors of RCU Tasks)
use a single global callback list. This works well and is simple, but
expected changes in workload will cause this list to become a bottleneck.
This commit therefore creates per-CPU callback lists for the various
flavors of RCU Tasks, but continues queueing on a single list, namely
that of CPU 0. Later commits will dynamically vary the number of lists
in use to accommodate dynamic changes in workload.
Reported-by: Martin Lau <kafai@fb.com>
Cc: Neeraj Upadhyay <neeraj.iitr10@gmail.com>
Tested-by: kernel test robot <beibei.si@intel.com>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
rcu_core() tries to ensure that its self-invocation in case of callbacks
overload only happen in softirq/rcuc mode. Indeed it doesn't make sense
to trigger local RCU core from nocb_cb kthread since it can execute
on a CPU different from the target rdp. Also in case of overload, the
nocb_cb kthread simply iterates a new loop of callbacks processing.
However the "offloaded" check that aims at preventing misplaced
rcu_core() invocations is wrong. First of all that state is volatile
and second: softirq/rcuc can execute while the target rdp is offloaded.
As a result rcu_core() can be invoked on the wrong CPU while in the
process of (de-)offloading.
Fix that with moving the rcu_core() self-invocation to rcu_core() itself,
irrespective of the rdp offloaded state.
Tested-by: Valentin Schneider <valentin.schneider@arm.com>
Tested-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
Cc: Valentin Schneider <valentin.schneider@arm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Cc: Josh Triplett <josh@joshtriplett.org>
Cc: Joel Fernandes <joel@joelfernandes.org>
Cc: Boqun Feng <boqun.feng@gmail.com>
Cc: Neeraj Upadhyay <neeraju@codeaurora.org>
Cc: Uladzislau Rezki <urezki@gmail.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
Time limit only makes sense when callbacks are serviced in softirq mode
because:
_ In case we need to get back to the scheduler,
cond_resched_tasks_rcu_qs() is called after each callback.
_ In case some other softirq vector needs the CPU, the call to
local_bh_enable() before cond_resched_tasks_rcu_qs() takes care about
them via a call to do_softirq().
Therefore, make sure the time limit only applies to softirq mode.
Reviewed-by: Valentin Schneider <valentin.schneider@arm.com>
Tested-by: Valentin Schneider <valentin.schneider@arm.com>
Tested-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
Cc: Valentin Schneider <valentin.schneider@arm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Cc: Josh Triplett <josh@joshtriplett.org>
Cc: Joel Fernandes <joel@joelfernandes.org>
Cc: Boqun Feng <boqun.feng@gmail.com>
Cc: Neeraj Upadhyay <neeraju@codeaurora.org>
Cc: Uladzislau Rezki <urezki@gmail.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
The callbacks processing time limit makes sure we are not exceeding a
given amount of time executing the queue.
However its "continue" clause bypasses the cond_resched() call on
rcuc and NOCB kthreads, delaying it until we reach the limit, which can
be very long...
Make sure the scheduler has a higher priority than the time limit.
Reviewed-by: Valentin Schneider <valentin.schneider@arm.com>
Tested-by: Valentin Schneider <valentin.schneider@arm.com>
Tested-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
Cc: Valentin Schneider <valentin.schneider@arm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Cc: Josh Triplett <josh@joshtriplett.org>
Cc: Joel Fernandes <joel@joelfernandes.org>
Cc: Boqun Feng <boqun.feng@gmail.com>
Cc: Neeraj Upadhyay <neeraju@codeaurora.org>
Cc: Uladzislau Rezki <urezki@gmail.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
The current condition to limit the number of callbacks executed in a
row checks the offloaded state of the rdp. Not only is it volatile
but it is also misleading: the rcu_core() may well be executing
callbacks concurrently with NOCB kthreads, and the offloaded state
would then be verified on both cases. As a result the limit would
spuriously not apply anymore on softirq while in the middle of
(de-)offloading process.
Fix and clarify the condition with those constraints in mind:
_ If callbacks are processed either by rcuc or NOCB kthread, the call
to cond_resched_tasks_rcu_qs() is enough to take care of the overload.
_ If instead callbacks are processed by softirqs:
* If need_resched(), exit the callbacks processing
* Otherwise if CPU is idle we can continue
* Otherwise exit because a softirq shouldn't interrupt a task for too
long nor deprive other pending softirq vectors of the CPU.
Tested-by: Valentin Schneider <valentin.schneider@arm.com>
Tested-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
Cc: Valentin Schneider <valentin.schneider@arm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Cc: Josh Triplett <josh@joshtriplett.org>
Cc: Joel Fernandes <joel@joelfernandes.org>
Cc: Boqun Feng <boqun.feng@gmail.com>
Cc: Neeraj Upadhyay <neeraju@codeaurora.org>
Cc: Uladzislau Rezki <urezki@gmail.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
Instead of hardcoding IRQ save and nocb lock, use the consolidated
API (and fix a comment as per Valentin Schneider's suggestion).
Reviewed-by: Valentin Schneider <valentin.schneider@arm.com>
Tested-by: Valentin Schneider <valentin.schneider@arm.com>
Tested-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
Cc: Valentin Schneider <valentin.schneider@arm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Cc: Josh Triplett <josh@joshtriplett.org>
Cc: Joel Fernandes <joel@joelfernandes.org>
Cc: Boqun Feng <boqun.feng@gmail.com>
Cc: Neeraj Upadhyay <neeraju@codeaurora.org>
Cc: Uladzislau Rezki <urezki@gmail.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
It's not entirely obvious why rdp->qlen_last_fqs_check is updated before
processing the queue only on offloaded rdp. There can be different
effect to that, either in favour of triggering the force quiescent state
path or not. For example:
1) If the number of callbacks has decreased since the last
rdp->qlen_last_fqs_check update (because we recently called
rcu_do_batch() and we executed below qhimark callbacks) and the number
of processed callbacks on a subsequent do_batch() arranges for
exceeding qhimark on non-offloaded but not on offloaded setup, then we
may spare a later run to the force quiescent state
slow path on __call_rcu_nocb_wake(), as compared to the non-offloaded
counterpart scenario.
Here is such an offloaded scenario instance:
qhimark = 1000
rdp->last_qlen_last_fqs_check = 3000
rcu_segcblist_n_cbs(rdp) = 2000
rcu_do_batch() {
if (offloaded)
rdp->last_qlen_fqs_check = rcu_segcblist_n_cbs(rdp) // 2000
// run 1000 callback
rcu_segcblist_n_cbs(rdp) = 1000
// Not updating rdp->qlen_last_fqs_check
if (count < rdp->qlen_last_fqs_check - qhimark)
rdp->qlen_last_fqs_check = count;
}
call_rcu() * 1001 {
__call_rcu_nocb_wake() {
// not taking the fqs slowpath:
// rcu_segcblist_n_cbs(rdp) == 2001
// rdp->qlen_last_fqs_check == 2000
// qhimark == 1000
if (len > rdp->qlen_last_fqs_check + qhimark)
...
}
In the case of a non-offloaded scenario, rdp->qlen_last_fqs_check
would be 1000 and the fqs slowpath would have executed.
2) If the number of callbacks has increased since the last
rdp->qlen_last_fqs_check update (because we recently queued below
qhimark callbacks) and the number of callbacks executed in rcu_do_batch()
doesn't exceed qhimark for either offloaded or non-offloaded setup,
then it's possible that the offloaded scenario later run the force
quiescent state slow path on __call_rcu_nocb_wake() while the
non-offloaded doesn't.
qhimark = 1000
rdp->last_qlen_last_fqs_check = 3000
rcu_segcblist_n_cbs(rdp) = 2000
rcu_do_batch() {
if (offloaded)
rdp->last_qlen_last_fqs_check = rcu_segcblist_n_cbs(rdp) // 2000
// run 100 callbacks
// concurrent queued 100
rcu_segcblist_n_cbs(rdp) = 2000
// Not updating rdp->qlen_last_fqs_check
if (count < rdp->qlen_last_fqs_check - qhimark)
rdp->qlen_last_fqs_check = count;
}
call_rcu() * 1001 {
__call_rcu_nocb_wake() {
// Taking the fqs slowpath:
// rcu_segcblist_n_cbs(rdp) == 3001
// rdp->qlen_last_fqs_check == 2000
// qhimark == 1000
if (len > rdp->qlen_last_fqs_check + qhimark)
...
}
In the case of a non-offloaded scenario, rdp->qlen_last_fqs_check
would be 3000 and the fqs slowpath would have executed.
The reason for updating rdp->qlen_last_fqs_check when invoking callbacks
for offloaded CPUs is that there is usually no point in waking up either
the rcuog or rcuoc kthreads while in this state. After all, both threads
are prohibited from indefinite sleeps.
The exception is when some huge number of callbacks are enqueued while
rcu_do_batch() is in the midst of invoking, in which case interrupting
the rcuog kthread's timed sleep might get more callbacks set up for the
next grace period.
Reported-and-tested-by: Valentin Schneider <valentin.schneider@arm.com>
Tested-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Original-patch-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
Cc: Valentin Schneider <valentin.schneider@arm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Cc: Josh Triplett <josh@joshtriplett.org>
Cc: Joel Fernandes <joel@joelfernandes.org>
Cc: Boqun Feng <boqun.feng@gmail.com>
Cc: Neeraj Upadhyay <neeraju@codeaurora.org>
Cc: Uladzislau Rezki <urezki@gmail.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
When callbacks are offloaded, the NOCB kthreads handle the callbacks
progression on behalf of rcu_core().
However during the (de-)offloading process, the kthread may not be
entirely up to the task. As a result some callbacks grace period
sequence number may remain stale for a while because rcu_core() won't
take care of them either.
Fix this with forcing callbacks acceleration from rcu_core() as long
as the offloading process isn't complete.
Reported-and-tested-by: Valentin Schneider <valentin.schneider@arm.com>
Tested-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
Cc: Valentin Schneider <valentin.schneider@arm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Cc: Josh Triplett <josh@joshtriplett.org>
Cc: Joel Fernandes <joel@joelfernandes.org>
Cc: Boqun Feng <boqun.feng@gmail.com>
Cc: Neeraj Upadhyay <neeraju@codeaurora.org>
Cc: Uladzislau Rezki <urezki@gmail.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
While reporting a quiescent state for a given CPU, rcu_core() takes
advantage of the freshly loaded grace period sequence number and the
locked rnp to accelerate the callbacks whose sequence number have been
assigned a stale value.
This action is only necessary when the rdp isn't offloaded, otherwise
the NOCB kthreads already take care of the callbacks progression.
However the check for the offloaded state is volatile because it is
performed outside the IRQs disabled section. It's possible for the
offloading process to preempt rcu_core() at that point on PREEMPT_RT.
This is dangerous because rcu_core() may end up accelerating callbacks
concurrently with NOCB kthreads without appropriate locking.
Fix this with moving the offloaded check inside the rnp locking section.
Reported-and-tested-by: Valentin Schneider <valentin.schneider@arm.com>
Reviewed-by: Valentin Schneider <valentin.schneider@arm.com>
Tested-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Cc: Josh Triplett <josh@joshtriplett.org>
Cc: Joel Fernandes <joel@joelfernandes.org>
Cc: Boqun Feng <boqun.feng@gmail.com>
Cc: Neeraj Upadhyay <neeraju@codeaurora.org>
Cc: Uladzislau Rezki <urezki@gmail.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
On PREEMPT_RT, if rcu_core() is preempted by the de-offloading process,
some work, such as callbacks acceleration and invocation, may be left
unattended due to the volatile checks on the offloaded state.
In the worst case this work is postponed until the next rcu_pending()
check that can take a jiffy to reach, which can be a problem in case
of callbacks flooding.
Solve that with invoking rcu_core() early in the de-offloading process.
This way any work dismissed by an ongoing rcu_core() call fooled by
a preempting deoffloading process will be caught up by a nearby future
recall to rcu_core(), this time fully aware of the de-offloading state.
Tested-by: Valentin Schneider <valentin.schneider@arm.com>
Tested-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
Cc: Valentin Schneider <valentin.schneider@arm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Cc: Josh Triplett <josh@joshtriplett.org>
Cc: Joel Fernandes <joel@joelfernandes.org>
Cc: Boqun Feng <boqun.feng@gmail.com>
Cc: Neeraj Upadhyay <neeraju@codeaurora.org>
Cc: Uladzislau Rezki <urezki@gmail.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
Currently SEGCBLIST_SOFTIRQ_ONLY is a bit of an exception among the
segcblist flags because it is an exclusive state that doesn't mix up
with the other flags. Remove it in favour of:
_ A flag specifying that rcu_core() needs to perform callbacks execution
and acceleration
and
_ A flag specifying we want the nocb lock to be held in any needed
circumstances
This clarifies the code and is more flexible: It allows to have a state
where rcu_core() runs with locking while offloading hasn't started yet.
This is a necessary step to prepare for triggering rcu_core() at the
very beginning of the de-offloading process so that rcu_core() won't
dismiss work while being preempted by the de-offloading process, at
least not without a pending subsequent rcu_core() that will quickly
catch up.
Reviewed-by: Valentin Schneider <Valentin.Schneider@arm.com>
Tested-by: Valentin Schneider <valentin.schneider@arm.com>
Tested-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Cc: Josh Triplett <josh@joshtriplett.org>
Cc: Joel Fernandes <joel@joelfernandes.org>
Cc: Boqun Feng <boqun.feng@gmail.com>
Cc: Neeraj Upadhyay <neeraju@codeaurora.org>
Cc: Uladzislau Rezki <urezki@gmail.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
rcu_nocb_lock_irqsave() can be preempted between the call to
rcu_segcblist_is_offloaded() and the actual locking. This matters now
that rcu_core() is preemptible on PREEMPT_RT and the (de-)offloading
process can interrupt the softirq or the rcuc kthread.
As a result we may locklessly call into code that requires nocb locking.
In practice this is a problem while we accelerate callbacks on rcu_core().
Simply disabling interrupts before (instead of after) checking the NOCB
offload state fixes the issue.
Reported-and-tested-by: Valentin Schneider <valentin.schneider@arm.com>
Tested-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
Cc: Valentin Schneider <valentin.schneider@arm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Cc: Josh Triplett <josh@joshtriplett.org>
Cc: Joel Fernandes <joel@joelfernandes.org>
Cc: Boqun Feng <boqun.feng@gmail.com>
Cc: Neeraj Upadhyay <neeraju@codeaurora.org>
Cc: Uladzislau Rezki <urezki@gmail.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
Currently, rcu_advance_cbs_nowake() checks that a grace period is in
progress, however, that grace period could end just after the check.
This commit rechecks that a grace period is still in progress while
holding the rcu_node structure's lock. The grace period cannot end while
the current CPU's rcu_node structure's ->lock is held, thus avoiding
false positives from the WARN_ON_ONCE().
As Daniel Vacek noted, it is not necessary for the rcu_node structure
to have a CPU that has not yet passed through its quiescent state.
Tested-by: Guillaume Morin <guillaume@morinfr.org>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
Expedited RCU grace periods invoke sync_rcu_exp_select_node_cpus(), which
takes two passes over the leaf rcu_node structure's CPUs. The first
pass gathers up the current CPU and CPUs that are in dynticks idle mode.
The workqueue will report a quiescent state on their behalf later.
The second pass sends IPIs to the rest of the CPUs, but excludes the
current CPU, incorrectly assuming it has been included in the first
pass's list of CPUs.
Unfortunately the current CPU may have changed between the first and
second pass, due to the fact that the various rcu_node structures'
->lock fields have been dropped, thus momentarily enabling preemption.
This means that if the second pass's CPU was not on the first pass's
list, it will be ignored completely. There will be no IPI sent to
it, and there will be no reporting of quiescent states on its behalf.
Unfortunately, the expedited grace period will nevertheless be waiting
for that CPU to report a quiescent state, but with that CPU having no
reason to believe that such a report is needed.
The result will be an expedited grace period stall.
Fix this by no longer excluding the current CPU from consideration during
the second pass.
Fixes: b9ad4d6ed1 ("rcu: Avoid self-IPI in sync_rcu_exp_select_node_cpus()")
Reviewed-by: Neeraj Upadhyay <quic_neeraju@quicinc.com>
Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
Cc: Uladzislau Rezki <urezki@gmail.com>
Cc: Neeraj Upadhyay <quic_neeraju@quicinc.com>
Cc: Boqun Feng <boqun.feng@gmail.com>
Cc: Josh Triplett <josh@joshtriplett.org>
Cc: Joel Fernandes <joel@joelfernandes.org>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
In non-preemptible kernels, an unfortunately timed expedited grace period
can result in the rcu_exp_handler() IPI handler setting the rcu_data
structure's cpu_no_qs.b.exp field just as the target CPU enters idle.
There are situations in which this field will not be checked until after
that CPU exits idle. The resulting grace-period latency does not qualify
as "expedited".
This commit therefore checks this field upon non-preemptible idle entry in
the rcu_preempt_deferred_qs() function. It also qualifies the rcu_core()
preempt_count() check with IS_ENABLED(CONFIG_PREEMPT_COUNT) to prevent
false-positive quiescent states from count-free kernels.
Reported-by: Neeraj Upadhyay <neeraju@codeaurora.org>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
If an RCU expedited grace period starts just when a CPU is in the process
of going offline, so that the outgoing CPU has completed its pass through
stop-machine but has not yet completed its final dive into the idle loop,
RCU will attempt to enable that CPU's scheduling-clock tick via a call
to tick_dep_set_cpu(). For this to happen, that CPU has to have been
online when the expedited grace period completed its CPU-selection phase.
This is pointless: The outgoing CPU has interrupts disabled, so it cannot
take a scheduling-clock tick anyway. In addition, the tick_dep_set_cpu()
function's eventual call to irq_work_queue_on() will splat as follows:
smpboot: CPU 1 is now offline
WARNING: CPU: 6 PID: 124 at kernel/irq_work.c:95
+irq_work_queue_on+0x57/0x60
Modules linked in:
CPU: 6 PID: 124 Comm: kworker/6:2 Not tainted 5.15.0-rc1+ #3
Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS
+rel-1.14.0-0-g155821a-rebuilt.opensuse.org 04/01/2014
Workqueue: rcu_gp wait_rcu_exp_gp
RIP: 0010:irq_work_queue_on+0x57/0x60
Code: 8b 05 1d c7 ea 62 a9 00 00 f0 00 75 21 4c 89 ce 44 89 c7 e8
+9b 37 fa ff ba 01 00 00 00 89 d0 c3 4c 89 cf e8 3b ff ff ff eb ee <0f> 0b eb b7
+0f 0b eb db 90 48 c7 c0 98 2a 02 00 65 48 03 05 91
6f
RSP: 0000:ffffb12cc038fe48 EFLAGS: 00010282
RAX: 0000000000000001 RBX: 0000000000005208 RCX: 0000000000000020
RDX: 0000000000000001 RSI: 0000000000000001 RDI: ffff9ad01f45a680
RBP: 000000000004c990 R08: 0000000000000001 R09: ffff9ad01f45a680
R10: ffffb12cc0317db0 R11: 0000000000000001 R12: 00000000fffecee8
R13: 0000000000000001 R14: 0000000000026980 R15: ffffffff9e53ae00
FS: 0000000000000000(0000) GS:ffff9ad01f580000(0000)
+knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 0000000000000000 CR3: 000000000de0c000 CR4: 00000000000006e0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
Call Trace:
tick_nohz_dep_set_cpu+0x59/0x70
rcu_exp_wait_wake+0x54e/0x870
? sync_rcu_exp_select_cpus+0x1fc/0x390
process_one_work+0x1ef/0x3c0
? process_one_work+0x3c0/0x3c0
worker_thread+0x28/0x3c0
? process_one_work+0x3c0/0x3c0
kthread+0x115/0x140
? set_kthread_struct+0x40/0x40
ret_from_fork+0x22/0x30
---[ end trace c5bf75eb6aa80bc6 ]---
This commit therefore avoids invoking tick_dep_set_cpu() on offlined
CPUs to limit both futility and false-positive splats.
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
The sync_sched_exp_online_cleanup() is called from rcutree_online_cpu(),
which can be invoked with interrupts enabled. This means that
the ->cpu_no_qs.b.exp field is subject to data races from the
rcu_exp_handler() IPI handler, so this commit marks the load from
that field.
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
Having two fields for the same purpose with subtle differences on
different RCU flavours is confusing, especially when both fields always
exist on both RCU flavours.
Fortunately, it is now safe for preemptible RCU to rely on the rcu_data
structure's ->cpu_no_qs.b.exp field, just like non-preemptible RCU.
This commit therefore removes the ad-hoc ->exp_deferred_qs field.
Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
On non-preemptible RCU, move clearing of the rcu_data structure's
->cpu_no_qs.b.exp filed to the actual expedited quiescent state report
function, matching hw preemptible RCU handles the ->exp_deferred_qs field.
This prepares for removing ->exp_deferred_qs in favor of ->cpu_no_qs.b.exp
for both preemptible and non-preemptible RCU.
Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
Preemptible RCU does not use the rcu_data structure's ->cpu_no_qs.b.exp,
instead using a separate ->exp_deferred_qs field to record the need for
an expedited quiescent state.
In fact ->cpu_no_qs.b.exp should never be set in preemptible RCU because
preemptible RCU's expedited grace periods use other mechanisms to record
quiescent states.
This commit therefore removes the implicit rcu_qs() reference to
->cpu_no_qs.b.exp in favor of a direct reference to ->cpu_no_qs.b.norm.
Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
0Day/LKP observed that the refscale results fail to complete when larger
values of nrun (such as 300) are specified. The problem is that printk()
can accept at most a 1024-byte buffer. This commit therefore prints
the buffer whenever its length exceeds 800 bytes.
CC: Philip Li <philip.li@intel.com>
Reported-by: kernel test robot <lkp@intel.com>
Signed-off-by: Li Zhijian <lizhijian@cn.fujitsu.com>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
There is only the one OOM error case in main_func(), so this commit
eliminates the errexit local variable in favor of a branch to cleanup
code.
Signed-off-by: Li Zhijian <lizhijian@cn.fujitsu.com>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
Because Tiny srcu_read_unlock() directly calls swake_up_one(), lockdep
complains when a pi lock is held across that srcu_read_unlock().
Although this is a lockdep false positive (there is no other CPU to
complete the deadlock cycle), lockdep is what it is at the moment.
This commit therefore prevents rcutorture from holding pi lock across
a Tiny srcu_read_unlock().
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
Currently, nested readers occur only when a timer handler interrupts a
reader. This is rare, and is thus insufficient testing of the transition
between nesting levels. This commit therefore causes rcutorture nested
readers to be the rule rather than the exception.
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
RCUTORTURE_RDR_MASK is currently not the bit indicated by
RCUTORTURE_RDR_SHIFT, but is instead all the bits less significant than
that one. This is an accident waiting to happen, so this commit makes
RCUTORTURE_RDR_MASK be that one bit and adjusts uses accordingly.
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
Currently, the check_all_holdout_tasks_trace() function removes all tasks
marked with ->trc_reader_checked from the holdout list, including those
with IPIs pending. This means that the IPI handler might arrive at
a task that has already been removed from the list, which is at best
an accident waiting to happen.
This commit therefore avoids removing tasks with IPIs pending from
the holdout list. This in turn means that the "if" condition in the
for_each_online_cpu() loop in rcu_tasks_trace_postgp() should always
evaluate to false, so a WARN_ON_ONCE() is added to check that.
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
Tiny SRCU readers can appear at task level, but also in interrupt and
softirq handlers. Because Tiny SRCU is selected only in kernels built
with CONFIG_SMP=n and CONFIG_PREEMPTION=n, it is not possible for a grace
period to start while there is a non-task-level SRCU reader executing.
This means that it does not make sense for __srcu_read_unlock() to awaken
the Tiny SRCU grace period, because that can only happen when the grace
period is waiting for one value of ->srcu_idx and __srcu_read_unlock()
is ending the last reader for some other value of ->srcu_idx. After all,
any such wakeup will be redundant.
Worse yet, in some cases, such wakeups generate lockdep splats:
======================================================
WARNING: possible circular locking dependency detected
5.15.0-rc1+ #3758 Not tainted
------------------------------------------------------
rcu_torture_rea/53 is trying to acquire lock:
ffffffff9514e6a8 (srcu_ctl.srcu_wq.lock){..-.}-{2:2}, at:
xa/0x30
but task is already holding lock:
ffff95c642479d80 (&p->pi_lock){-.-.}-{2:2}, at:
_extend+0x370/0x400
which lock already depends on the new lock.
the existing dependency chain (in reverse order) is:
-> #1 (&p->pi_lock){-.-.}-{2:2}:
_raw_spin_lock_irqsave+0x2f/0x50
try_to_wake_up+0x50/0x580
swake_up_locked.part.7+0xe/0x30
swake_up_one+0x22/0x30
rcutorture_one_extend+0x1b6/0x400
rcu_torture_one_read+0x290/0x5d0
rcu_torture_timer+0x1a/0x70
call_timer_fn+0xa6/0x230
run_timer_softirq+0x493/0x4c0
__do_softirq+0xc0/0x371
irq_exit+0x73/0x90
sysvec_apic_timer_interrupt+0x63/0x80
asm_sysvec_apic_timer_interrupt+0x12/0x20
default_idle+0xb/0x10
default_idle_call+0x5e/0x170
do_idle+0x18a/0x1f0
cpu_startup_entry+0xa/0x10
start_kernel+0x678/0x69f
secondary_startup_64_no_verify+0xc2/0xcb
-> #0 (srcu_ctl.srcu_wq.lock){..-.}-{2:2}:
__lock_acquire+0x130c/0x2440
lock_acquire+0xc2/0x270
_raw_spin_lock_irqsave+0x2f/0x50
swake_up_one+0xa/0x30
rcutorture_one_extend+0x387/0x400
rcu_torture_one_read+0x290/0x5d0
rcu_torture_reader+0xac/0x200
kthread+0x12d/0x150
ret_from_fork+0x22/0x30
other info that might help us debug this:
Possible unsafe locking scenario:
CPU0 CPU1
---- ----
lock(&p->pi_lock);
lock(srcu_ctl.srcu_wq.lock);
lock(&p->pi_lock);
lock(srcu_ctl.srcu_wq.lock);
*** DEADLOCK ***
1 lock held by rcu_torture_rea/53:
#0: ffff95c642479d80 (&p->pi_lock){-.-.}-{2:2}, at:
_extend+0x370/0x400
stack backtrace:
CPU: 0 PID: 53 Comm: rcu_torture_rea Not tainted 5.15.0-rc1+
Hardware name: Red Hat KVM/RHEL-AV, BIOS
e_el8.5.0+746+bbd5d70c 04/01/2014
Call Trace:
check_noncircular+0xfe/0x110
? find_held_lock+0x2d/0x90
__lock_acquire+0x130c/0x2440
lock_acquire+0xc2/0x270
? swake_up_one+0xa/0x30
? find_held_lock+0x72/0x90
_raw_spin_lock_irqsave+0x2f/0x50
? swake_up_one+0xa/0x30
swake_up_one+0xa/0x30
rcutorture_one_extend+0x387/0x400
rcu_torture_one_read+0x290/0x5d0
rcu_torture_reader+0xac/0x200
? rcutorture_oom_notify+0xf0/0xf0
? __kthread_parkme+0x61/0x90
? rcu_torture_one_read+0x5d0/0x5d0
kthread+0x12d/0x150
? set_kthread_struct+0x40/0x40
ret_from_fork+0x22/0x30
This is a false positive because there is only one CPU, and both locks
are raw (non-preemptible) spinlocks. However, it is worthwhile getting
rid of the redundant wakeup, which has the side effect of breaking
the theoretical deadlock cycle. This commit therefore eliminates the
redundant wakeups.
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
The default kasan_record_aux_stack() calls stack_depot_save() with GFP_NOWAIT,
which in turn can then call alloc_pages(GFP_NOWAIT, ...). In general, however,
it is not even possible to use either GFP_ATOMIC nor GFP_NOWAIT in certain
non-preemptive contexts/RT kernel including raw_spin_locks (see gfp.h and ab00db216c).
Fix it by instructing stackdepot to not expand stack storage via alloc_pages()
in case it runs out by using kasan_record_aux_stack_noalloc().
Jianwei Hu reported:
BUG: sleeping function called from invalid context at kernel/locking/rtmutex.c:969
in_atomic(): 0, irqs_disabled(): 1, non_block: 0, pid: 15319, name: python3
INFO: lockdep is turned off.
irq event stamp: 0
hardirqs last enabled at (0): [<0000000000000000>] 0x0
hardirqs last disabled at (0): [<ffffffff856c8b13>] copy_process+0xaf3/0x2590
softirqs last enabled at (0): [<ffffffff856c8b13>] copy_process+0xaf3/0x2590
softirqs last disabled at (0): [<0000000000000000>] 0x0
CPU: 6 PID: 15319 Comm: python3 Tainted: G W O 5.15-rc7-preempt-rt #1
Hardware name: Supermicro SYS-E300-9A-8C/A2SDi-8C-HLN4F, BIOS 1.1b 12/17/2018
Call Trace:
show_stack+0x52/0x58
dump_stack+0xa1/0xd6
___might_sleep.cold+0x11c/0x12d
rt_spin_lock+0x3f/0xc0
rmqueue+0x100/0x1460
rmqueue+0x100/0x1460
mark_usage+0x1a0/0x1a0
ftrace_graph_ret_addr+0x2a/0xb0
rmqueue_pcplist.constprop.0+0x6a0/0x6a0
__kasan_check_read+0x11/0x20
__zone_watermark_ok+0x114/0x270
get_page_from_freelist+0x148/0x630
is_module_text_address+0x32/0xa0
__alloc_pages_nodemask+0x2f6/0x790
__alloc_pages_slowpath.constprop.0+0x12d0/0x12d0
create_prof_cpu_mask+0x30/0x30
alloc_pages_current+0xb1/0x150
stack_depot_save+0x39f/0x490
kasan_save_stack+0x42/0x50
kasan_save_stack+0x23/0x50
kasan_record_aux_stack+0xa9/0xc0
__call_rcu+0xff/0x9c0
call_rcu+0xe/0x10
put_object+0x53/0x70
__delete_object+0x7b/0x90
kmemleak_free+0x46/0x70
slab_free_freelist_hook+0xb4/0x160
kfree+0xe5/0x420
kfree_const+0x17/0x30
kobject_cleanup+0xaa/0x230
kobject_put+0x76/0x90
netdev_queue_update_kobjects+0x17d/0x1f0
... ...
ksys_write+0xd9/0x180
__x64_sys_write+0x42/0x50
do_syscall_64+0x38/0x50
entry_SYSCALL_64_after_hwframe+0x44/0xa9
Links: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/include/linux/kasan.h?id=7cb3007ce2da27ec02a1a3211941e7fe6875b642
Fixes: 84109ab585 ("rcu: Record kvfree_call_rcu() call stack for KASAN")
Fixes: 26e760c9a7 ("rcu: kasan: record and print call_rcu() call stack")
Reported-by: Jianwei Hu <jianwei.hu@windriver.com>
Reviewed-by: Uladzislau Rezki (Sony) <urezki@gmail.com>
Acked-by: Marco Elver <elver@google.com>
Tested-by: Juri Lelli <juri.lelli@redhat.com>
Signed-off-by: Jun Miao <jun.miao@intel.com>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
When the boost kthreads are created on systems with nohz_full CPUs,
the cpus_allowed_ptr is set to housekeeping_cpumask(HK_FLAG_KTHREAD).
However, when the rcu_boost_kthread_setaffinity() is called, the original
affinity will be changed and these kthreads can subsequently run on
nohz_full CPUs. This commit makes rcu_boost_kthread_setaffinity()
restrict these boost kthreads to housekeeping CPUs.
Signed-off-by: Zqiang <qiang.zhang1211@gmail.com>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
This commit cleans up some comments and code in kernel/rcu/tree_plugin.h.
Signed-off-by: Zhouyi Zhou <zhouzhouyi@gmail.com>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
This commit replaces the obsolete and ambiguous macro in_irq() with its
shiny new in_hardirq() equivalent.
Signed-off-by: Changbin Du <changbin.du@gmail.com>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
Now that RCU_FAST_NO_HZ is no more, there is but one implementation of
the rcu_needs_cpu() function. This commit therefore moves this function
from kernel/rcu/tree_plugin.c to kernel/rcu/tree.c.
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
All of the uses of CONFIG_RCU_FAST_NO_HZ=y that I have seen involve
systems with RCU callbacks offloaded. In this situation, all that this
Kconfig option does is slow down idle entry/exit with an additional
allways-taken early exit. If this is the only use case, then this
Kconfig option nothing but an attractive nuisance that needs to go away.
This commit therefore removes the RCU_FAST_NO_HZ Kconfig option.
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
This pull request contains the following branches:
fixes.2021.10.07a: Miscellaneous fixes.
scftorture.2021.09.16a: smp_call_function torture-test updates, most
notably better checking of module parameters.
tasks.2021.09.15a: Tasks-trace RCU updates that fix a number of rare
but important race-condition bugs.
torture.2021.09.13b: Other torture-test updates, most notably
better checking of module parameters. In addition, rcutorture
may now be run on CONFIG_PREEMPT_RT kernels.
torturescript.2021.09.16a: Torture-test scripting updates, most notably
specifying the new CONFIG_KCSAN_STRICT kconfig option rather
than maintaining an ever-changing list of individual KCSAN
kconfig options.
-----BEGIN PGP SIGNATURE-----
iQJHBAABCgAxFiEEbK7UrM+RBIrCoViJnr8S83LZ+4wFAmGAVMMTHHBhdWxtY2tA
a2VybmVsLm9yZwAKCRCevxLzctn7jGJBD/9ld6USOpedBLAbTYVMQYvIKoSqqDIG
74ZFhKvZ5I6Y8OZAGxXjb5U06rh4V2brlTN7IJ7XLEA1t401ENffsGeQSCxEmpEf
PqQN04dbmVvaWjD4jiLZCcl3oDp+w1gIKwmX6wh0Weogr3KZWu5aNvD5tl9qIz4a
uPC1JqTBxf7WDrLhqNxG5N4MXs27+KvukCd9wftk3NTzRJ9tyLM/YNGOVArM8rW2
QpEh8n6veB5dEoXBxmRHzuxYHN1k0Fhkbm3irMjcI0T5wj8TDod89zbg9mdFXMIj
AjZ9CGpIBa4frThdu654ZNuEQHDCsPWtMi925xNOWxh5lkPGjeWnwYpcRrwfI2pj
op0xVlur+Nam5CT/AJNT9+KogpZthAWXvwqCs5GbYNSU30Rlw99bw1vyAsJUD+af
Mv08/z4o7Kuhr4cw2vkd2UfF9zuIQsJ1jWCIjMxfj4ctBnIpedrEnEISp8Y61fWk
w9vXgCRhZCSkxoURoNss+nAUsiePUafptsvqKLu6Z53ufPA5yL0rVS778xq8vurP
Xyd34TVlQ94ydZDC5pkSNpri1HGV1U7pztFwey5GloE66iV+7TSQCfMhzLd4CM0K
wW96wimHrDtIxD6LedCZOHLHkS9AJd7F9uSoNodKspTH0tJowQztrzPW1eZifDE3
iJP8xcJ+vL67Og==
=nmaP
-----END PGP SIGNATURE-----
Merge tag 'rcu.2021.11.01a' of git://git.kernel.org/pub/scm/linux/kernel/git/paulmck/linux-rcu
Pull RCU updates from Paul McKenney:
- Miscellaneous fixes
- Torture-test updates for smp_call_function(), most notably improved
checking of module parameters.
- Tasks-trace RCU updates that fix a number of rare but important
race-condition bugs.
- Other torture-test updates, most notably better checking of module
parameters. In addition, rcutorture may once again be run on
CONFIG_PREEMPT_RT kernels.
- Torture-test scripting updates, most notably specifying the new
CONFIG_KCSAN_STRICT kconfig option rather than maintaining an
ever-changing list of individual KCSAN kconfig options.
* tag 'rcu.2021.11.01a' of git://git.kernel.org/pub/scm/linux/kernel/git/paulmck/linux-rcu: (46 commits)
rcu: Fix rcu_dynticks_curr_cpu_in_eqs() vs noinstr
rcu: Always inline rcu_dynticks_task*_{enter,exit}()
torture: Make kvm-remote.sh print size of downloaded tarball
torture: Allot 1G of memory for scftorture runs
tools/rcu: Add an extract-stall script
scftorture: Warn on individual scf_torture_init() error conditions
scftorture: Count reschedule IPIs
scftorture: Account for weight_resched when checking for all zeroes
scftorture: Shut down if nonsensical arguments given
scftorture: Allow zero weight to exclude an smp_call_function*() category
rcu: Avoid unneeded function call in rcu_read_unlock()
rcu-tasks: Update comments to cond_resched_tasks_rcu_qs()
rcu-tasks: Fix IPI failure handling in trc_wait_for_one_reader
rcu-tasks: Fix read-side primitives comment for call_rcu_tasks_trace
rcu-tasks: Clarify read side section info for rcu_tasks_rude GP primitives
rcu-tasks: Correct comparisons for CPU numbers in show_stalled_task_trace
rcu-tasks: Correct firstreport usage in check_all_holdout_tasks_trace
rcu-tasks: Fix s/rcu_add_holdout/trc_add_holdout/ typo in comment
rcu-tasks: Move RTGS_WAIT_CBS to beginning of rcu_tasks_kthread() loop
rcu-tasks: Fix s/instruction/instructions/ typo in comment
...
- Revert the printk format based wchan() symbol resolution as it can leak
the raw value in case that the symbol is not resolvable.
- Make wchan() more robust and work with all kind of unwinders by
enforcing that the task stays blocked while unwinding is in progress.
- Prevent sched_fork() from accessing an invalid sched_task_group
- Improve asymmetric packing logic
- Extend scheduler statistics to RT and DL scheduling classes and add
statistics for bandwith burst to the SCHED_FAIR class.
- Properly account SCHED_IDLE entities
- Prevent a potential deadlock when initial priority is assigned to a
newly created kthread. A recent change to plug a race between cpuset and
__sched_setscheduler() introduced a new lock dependency which is now
triggered. Break the lock dependency chain by moving the priority
assignment to the thread function.
- Fix the idle time reporting in /proc/uptime for NOHZ enabled systems.
- Improve idle balancing in general and especially for NOHZ enabled
systems.
- Provide proper interfaces for live patching so it does not have to
fiddle with scheduler internals.
- Add cluster aware scheduling support.
- A small set of tweaks for RT (irqwork, wait_task_inactive(), various
scheduler options and delaying mmdrop)
- The usual small tweaks and improvements all over the place
-----BEGIN PGP SIGNATURE-----
iQJHBAABCgAxFiEEQp8+kY+LLUocC4bMphj1TA10mKEFAmF/OUkTHHRnbHhAbGlu
dXRyb25peC5kZQAKCRCmGPVMDXSYoR/5D/9ikdGNpKg9osNqJ3GjAmxsK6kVkB29
iFe2k8pIpWDToWQf/wQRGih4Yj3Cl49QSnZcPIibh2/12EB1qrrW6iSPJkInz8Ec
/1LS5/Vewn2OyoxyXZjdvGC5gTXEodSbIazASvX7nvdMeI4gsAsL5etzrMJirT/t
aymqvr7zovvywrwMTQJrGjUMo9l4ewE8tafMNNhRu1BHU1U4ojM9yvThyRAAcmp7
3Xy49A+Yq3IgrvYI4u8FMK5Zh08KaxSFjiLhePGm/bF+wSfYmWop2TP1jY05W2Uo
ti8hfbJMUoFRYuMxAiEldkItnc0wV4M9PtWZZ/x+B71bs65Y4Zjt9cW+rxJv2+m1
vzV31EsQwGnOti072dzWN4c/cZqngVXAjaNtErvDwJUr+Tw1ayv9KUvuodMQqZY6
mu68bFUO2kV9EMe1CBOv51Uy1RGHyLj3rlNqrkw+Xp5ISE9Ad2vhUEiRp5bQx5Ci
V/XFhGZkGUluh0vccrdFlNYZwhj8cZEzkOPCnPSeZ+bq8SyZE6xuHH/lTP1CJCOy
s800rW1huM+kgV+zRN8adDkGXibAk9N3RtVGnQXmuEy8gB9LZmQg+JeM2wsc9B+6
i0gdqZnsjNAfoK+BBAG4holxptSL8/eOJsFH8ZNIoxQ+iqooyPx9tFX7yXnRTBQj
d2qWG7UvoseT+g==
=fgtS
-----END PGP SIGNATURE-----
Merge tag 'sched-core-2021-11-01' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull scheduler updates from Thomas Gleixner:
- Revert the printk format based wchan() symbol resolution as it can
leak the raw value in case that the symbol is not resolvable.
- Make wchan() more robust and work with all kind of unwinders by
enforcing that the task stays blocked while unwinding is in progress.
- Prevent sched_fork() from accessing an invalid sched_task_group
- Improve asymmetric packing logic
- Extend scheduler statistics to RT and DL scheduling classes and add
statistics for bandwith burst to the SCHED_FAIR class.
- Properly account SCHED_IDLE entities
- Prevent a potential deadlock when initial priority is assigned to a
newly created kthread. A recent change to plug a race between cpuset
and __sched_setscheduler() introduced a new lock dependency which is
now triggered. Break the lock dependency chain by moving the priority
assignment to the thread function.
- Fix the idle time reporting in /proc/uptime for NOHZ enabled systems.
- Improve idle balancing in general and especially for NOHZ enabled
systems.
- Provide proper interfaces for live patching so it does not have to
fiddle with scheduler internals.
- Add cluster aware scheduling support.
- A small set of tweaks for RT (irqwork, wait_task_inactive(), various
scheduler options and delaying mmdrop)
- The usual small tweaks and improvements all over the place
* tag 'sched-core-2021-11-01' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (69 commits)
sched/fair: Cleanup newidle_balance
sched/fair: Remove sysctl_sched_migration_cost condition
sched/fair: Wait before decaying max_newidle_lb_cost
sched/fair: Skip update_blocked_averages if we are defering load balance
sched/fair: Account update_blocked_averages in newidle_balance cost
x86: Fix __get_wchan() for !STACKTRACE
sched,x86: Fix L2 cache mask
sched/core: Remove rq_relock()
sched: Improve wake_up_all_idle_cpus() take #2
irq_work: Also rcuwait for !IRQ_WORK_HARD_IRQ on PREEMPT_RT
irq_work: Handle some irq_work in a per-CPU thread on PREEMPT_RT
irq_work: Allow irq_work_sync() to sleep if irq_work() no IRQ support.
sched/rt: Annotate the RT balancing logic irqwork as IRQ_WORK_HARD_IRQ
sched: Add cluster scheduler level for x86
sched: Add cluster scheduler level in core and related Kconfig for ARM64
topology: Represent clusters of CPUs within a die
sched: Disable -Wunused-but-set-variable
sched: Add wrapper for get_wchan() to keep task blocked
x86: Fix get_wchan() to support the ORC unwinder
proc: Use task_is_running() for wchan in /proc/$pid/stat
...
vmlinux.o: warning: objtool: rcu_nmi_enter()+0x36: call to __kasan_check_read() leaves .noinstr.text section
noinstr cannot have atomic_*() functions in because they're explicitly
annotated, use arch_atomic_*().
Fixes: 2be57f7328 ("rcu: Weaken ->dynticks accesses and updates")
Reported-by: Stephen Rothwell <sfr@canb.auug.org.au>
Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
RCU managed to grow a few noinstr violations:
vmlinux.o: warning: objtool: rcu_dynticks_eqs_enter()+0x0: call to rcu_dynticks_task_trace_enter() leaves .noinstr.text section
vmlinux.o: warning: objtool: rcu_dynticks_eqs_exit()+0xe: call to rcu_dynticks_task_trace_exit() leaves .noinstr.text section
Fix them by adding __always_inline to the relevant trivial functions.
Also replace the noinstr with __always_inline for the existing
rcu_dynticks_task_*() functions since noinstr would force noinline
them, even when empty, which seems silly.
Fixes: 7d0c9c50c5 ("rcu-tasks: Avoid IPIing userspace/idle tasks if kernel is so built")
Reported-by: Stephen Rothwell <sfr@canb.auug.org.au>
Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
Give try_invoke_on_locked_down_task() a saner name and have it return
an int so that the caller might distinguish between different reasons
of failure.
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Paul E. McKenney <paulmck@kernel.org>
Acked-by: Vasily Gorbik <gor@linux.ibm.com>
Tested-by: Vasily Gorbik <gor@linux.ibm.com> # on s390
Link: https://lkml.kernel.org/r/20210929152428.649944917@infradead.org
Comments in wait-type checks be improved by mentioning the
PREEPT_RT kernel configure option.
Signed-off-by: Zhouyi Zhou <zhouzhouyi@gmail.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Paul E. McKenney <paulmck@kernel.org>
Link: https://lkml.kernel.org/r/20210811025920.20751-1-zhouzhouyi@gmail.com
Since commit aa40c138cc ("rcu: Report QS for outermost PREEMPT=n
rcu_read_unlock() for strict GPs") the function rcu_read_unlock_strict()
is invoked by the inlined rcu_read_unlock() function. However,
rcu_read_unlock_strict() is an empty function in production kernels,
which are built with CONFIG_RCU_STRICT_GRACE_PERIOD=n.
There is a mention of rcu_read_unlock_strict() in the BPF verifier,
but this is in a deny-list, meaning that BPF does not care whether
rcu_read_unlock_strict() is ever called.
This commit therefore provides a slight performance improvement
by hoisting the check of CONFIG_RCU_STRICT_GRACE_PERIOD from
rcu_read_unlock_strict() into rcu_read_unlock(), thus avoiding the
pointless call to an empty function.
Cc: Alexei Starovoitov <alexei.starovoitov@gmail.com>
Acked-by: Andrii Nakryiko <andrii@kernel.org>
Signed-off-by: Waiman Long <longman@redhat.com>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
The cond_resched_rcu_qs() function no longer exists, despite being mentioned
several times in kernel/rcu/tasks.h. This commit therefore updates it to
the current cond_resched_tasks_rcu_qs().
Reported-by: Neeraj Upadhyay <neeraju@codeaurora.org>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
The trc_wait_for_one_reader() function is called at multiple stages
of trace rcu-tasks GP function, rcu_tasks_wait_gp():
- First, it is called as part of per task function -
rcu_tasks_trace_pertask(), for all non-idle tasks. As part of per task
processing, this function add the task in the holdout list and if the
task is currently running on a CPU, it sends IPI to the task's CPU.
The IPI handler takes action depending on whether task is in trace
rcu-tasks read side critical section or not:
- a. If the task is in trace rcu-tasks read side critical section
(t->trc_reader_nesting != 0), the IPI handler sets the task's
->trc_reader_special.b.need_qs, so that this task notifies exit
from its outermost read side critical section (by decrementing
trc_n_readers_need_end) to the GP handling function.
trc_wait_for_one_reader() also increments trc_n_readers_need_end,
so that the trace rcu-tasks GP handler function waits for this
task's read side exit notification. The IPI handler also sets
t->trc_reader_checked to true, and no further IPIs are sent for
this task, for this trace rcu-tasks grace period and this
task can be removed from holdout list.
- b. If the task is in the process of exiting its trace rcu-tasks
read side critical section, (t->trc_reader_nesting < 0), defer
this task's processing to future calls to trc_wait_for_one_reader().
- c. If task is not in rcu-task read side critical section,
t->trc_reader_nesting == 0, ->trc_reader_checked is set for this
task, so that this task is removed from holdout list.
- Second, trc_wait_for_one_reader() is called as part of post scan, in
function rcu_tasks_trace_postscan(), for all idle tasks.
- Third, in function check_all_holdout_tasks_trace(), this function is
called for each task in the holdout list, but only if there isn't
a pending IPI for the task (->trc_ipi_to_cpu == -1). This function
removed the task from holdout list, if IPI handler has completed the
required work, to ensure that the current trace rcu-tasks grace period
either waits for this task, or this task is not in a trace rcu-tasks
read side critical section.
Now, considering the scenario where smp_call_function_single() fails in
first case, inside rcu_tasks_trace_pertask(). In this case,
->trc_ipi_to_cpu is set to the current CPU for that task. This will
result in trc_wait_for_one_reader() getting skipped in third case,
inside check_all_holdout_tasks_trace(), for this task. This further
results in ->trc_reader_checked never getting set for this task,
and the task not getting removed from holdout list. This can cause
the current trace rcu-tasks grace period to stall.
Fix the above problem, by resetting ->trc_ipi_to_cpu to -1, on
smp_call_function_single() failure, so that future IPI calls can
be send for this task.
Note that all three of the trc_wait_for_one_reader() function's
callers (rcu_tasks_trace_pertask(), rcu_tasks_trace_postscan(),
check_all_holdout_tasks_trace()) hold cpu_read_lock(). This means
that smp_call_function_single() cannot race with CPU hotplug, and thus
should never fail. Therefore, also add a warning in order to report
any such failure in case smp_call_function_single() grows some other
reason for failure.
Signed-off-by: Neeraj Upadhyay <neeraju@codeaurora.org>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
call_rcu_tasks_trace() does have read-side primitives - rcu_read_lock_trace()
and rcu_read_unlock_trace(). Fix this information in the comments.
Signed-off-by: Neeraj Upadhyay <neeraju@codeaurora.org>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
RCU tasks rude variant does not check whether the current
running context on a CPU is usermode. Read side critical section ends
on transition to usermode execution, by the virtue of usermode
execution being schedulable. Clarify this in comments for
call_rcu_tasks_rude() and synchronize_rcu_tasks_rude().
Signed-off-by: Neeraj Upadhyay <neeraju@codeaurora.org>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
Valid CPU numbers can be zero or greater, but the checks for
->trc_ipi_to_cpu and tick_nohz_full_cpu()'s argument are for strictly
greater than. This commit therefore corrects the check for no_hz_full
cpu in show_stalled_task_trace() so as to include cpu 0.
Signed-off-by: Neeraj Upadhyay <neeraju@codeaurora.org>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
In check_all_holdout_tasks_trace(), firstreport is a pointer argument;
so, check the dereferenced value, instead of checking the pointer.
Signed-off-by: Neeraj Upadhyay <neeraju@codeaurora.org>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
Early in debugging, it made some sense to differentiate the first
iteration from subsequent iterations, but now this just causes confusion.
This commit therefore moves the "set_tasks_gp_state(rtp, RTGS_WAIT_CBS)"
statement to the beginning of the "for" loop in rcu_tasks_kthread().
Reported-by: Neeraj Upadhyay <neeraju@codeaurora.org>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
The second argument of rcu_read_unlock_trace_special() is always zero.
When called from exit_tasks_rcu_finish_trace(), it is the constant
zero, and rcu_read_unlock_trace_special() doesn't get called from
rcu_read_unlock_trace() unless the value of local variable "nesting"
is zero because in that case the early return is taken instead.
This commit therefore removes the "nesting" argument from the
rcu_read_unlock_trace_special() function, substituting the constant
zero within that function. This commit also adds a WARN_ON_ONCE()
to rcu_read_lock_trace_held() in case non-zeroness some day appears.
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
Currently, trc_inspect_reader() treats a task exiting its RCU Tasks
Trace read-side critical section the same as being within that critical
section. However, this can fail because that task might have already
checked its .need_qs field, which means that it might never decrement
the all-important trc_n_readers_need_end counter. Of course, for that
to happen, the task would need to never again execute an RCU Tasks Trace
read-side critical section, but this really could happen if the system's
last trampoline was removed. Note that exit from such a critical section
cannot be treated as a quiescent state due to the possibility of nested
critical sections. This means that if trc_inspect_reader() sees a
negative nesting value, it must set up to try again later.
This commit therefore ignores tasks that are exiting their RCU Tasks
Trace read-side critical sections so that they will be rechecked later.
[ paulmck: Apply feedback from Neeraj Upadhyay and Boqun Feng. ]
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
Currently, trc_wait_for_one_reader() atomically increments
the trc_n_readers_need_end counter before sending the IPI
invoking trc_read_check_handler(). All failure paths out of
trc_read_check_handler() and also from the smp_call_function_single()
within trc_wait_for_one_reader() must carefully atomically decrement
this counter. This is more complex than it needs to be.
This commit therefore simplifies things and saves a few lines of
code by dispensing with the atomic decrements in favor of having
trc_read_check_handler() do the atomic increment only in the success case.
In theory, this represents no change in functionality.
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
rcutorture is generating some nesting scenarios that are not compatible on PREEMPT_RT.
For example:
preempt_disable();
rcu_read_lock_bh();
preempt_enable();
rcu_read_unlock_bh();
The problem here is that on PREEMPT_RT the bottom halves have to be
disabled and enabled in preemptible context.
Reorder locking: start with BH locking and continue with then with
disabling preemption or interrupts. In the unlocking do it reverse by
first enabling interrupts and preemption and BH at the very end.
Ensure that on PREEMPT_RT BH locking remains unchanged if in
non-preemptible context.
Link: https://lkml.kernel.org/r/20190911165729.11178-6-swood@redhat.com
Link: https://lkml.kernel.org/r/20210819182035.GF4126399@paulmck-ThinkPad-P17-Gen-1
Signed-off-by: Scott Wood <swood@redhat.com>
[bigeasy: Drop ATOM_BH, make it only about changing BH in atomic
context. Allow enabling RCU in IRQ-off section. Reword commit message.]
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
Currently, in CONFIG_RCU_BOOST kernels, if the rcu_torture_init()
function's call to cpuhp_setup_state() fails, rcu_torture_cleanup()
gamely passes nonsense to cpuhp_remove_state(). This results in
strange and misleading splats. This commit therefore ensures that if
the rcu_torture_init() function's call to cpuhp_setup_state() fails,
rcu_torture_cleanup() avoids invoking cpuhp_remove_state().
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
When running rcuscale as a module, any rcu_scale_init() issues will be
reflected in the error code from modprobe or insmod, as the case may be.
However, these error codes are not available when running rcuscale
built-in, for example, when using the kvm.sh script. This commit
therefore adds WARN_ON_ONCE() to allow distinguishing rcu_scale_init()
errors when running rcuscale built-in.
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
When running refscale as a module, any ref_scale_init() issues will be
reflected in the error code from modprobe or insmod, as the case may be.
However, these error codes are not available when running refscale
built-in, for example, when using the kvm.sh script. This commit
therefore adds WARN_ON_ONCE() to allow distinguishing ref_scale_init()
errors when running refscale built-in.
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
When running rcutorture as a module, any rcu_torture_init() issues will be
reflected in the error code from modprobe or insmod, as the case may be.
However, these error codes are not available when running rcutorture
built-in, for example, when using the kvm.sh script. This commit
therefore adds WARN_ON_ONCE() to allow distinguishing rcu_torture_init()
errors when running rcutorture built-in.
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
Currently, specifying the rcutorture.read_exit_burst=0 kernel boot
parameter will result in a -EINVAL exit code that will stop the rcutorture
test run before it has fully initialized. This commit therefore uses a
zero exit code in that case, thus allowing rcutorture.read_exit_burst=0
to complete normally.
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
Currently, RCU Tasks Trace initializes the trc_n_readers_need_end counter
to the value one, increments it before each trc_read_check_handler()
IPI, then decrements it within trc_read_check_handler() if the target
task was in a quiescent state (or if the target task moved to some other
CPU while the IPI was in flight), complaining if the new value was zero.
The rationale for complaining is that the initial value of one must be
decremented away before zero can be reached, and this decrement has not
yet happened.
Except that trc_read_check_handler() is initiated with an asynchronous
smp_call_function_single(), which might be significantly delayed. This
can result in false-positive complaints about the counter reaching zero.
This commit therefore waits for in-flight IPI handlers to complete before
decrementing away the initial value of one from the trc_n_readers_need_end
counter.
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
The sync_sched_exp_online_cleanup() checks to see if RCU needs
an expedited quiescent state from the incoming CPU, sending it
an IPI if so. Before sending IPI, it checks whether expedited
qs need has been already requested for the incoming CPU, by
checking rcu_data.cpu_no_qs.b.exp for the current cpu, on which
sync_sched_exp_online_cleanup() is running. This works for the
case where incoming CPU is same as self. However, for the case
where incoming CPU is different from self, expedited request
won't get marked, which can potentially delay reporting of
expedited quiescent state for the incoming CPU.
Fixes: e015a34112 ("rcu: Avoid self-IPI in sync_sched_exp_online_cleanup()")
Signed-off-by: Neeraj Upadhyay <neeraju@codeaurora.org>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
rcu update module parameters currently don't appear in sysfs and this is
a serviceability issue as it might be needed to access their default
values at runtime.
Fix this issue by changing rcu update module parameters permissions to
world-readable.
Suggested-by: Paul E. McKenney <paulmck@kernel.org>
Signed-off-by: Juri Lelli <juri.lelli@redhat.com>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
Certain configurations (e.g., systems that make heavy use of netns)
need to use synchronize_rcu_expedited() to service RCU grace periods
even after boot.
Even though synchronize_rcu_expedited() has been traditionally
considered harmful for RT for the heavy use of IPIs, it is perfectly
usable under certain conditions (e.g. nohz_full).
Make rcupdate.rcu_normal_after_boot= again writeable on RT (if NO_HZ_
FULL is defined), but keep its default value to 1 (enabled) to avoid
regressions. Users who need synchronize_rcu_expedited() will boot with
rcupdate.rcu_normal_after_ boot=0 in the kernel cmdline.
Reflect the change in synchronize_rcu_expedited_wait() by removing the
WARN related to CONFIG_PREEMPT_RT.
Signed-off-by: Juri Lelli <juri.lelli@redhat.com>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
The CPU-hotplug functions take a "cpu" parameter, but rcutree_dying_cpu()
ignores it in favor of this_cpu_ptr(). This works at the moment, but
it would be better to be consistent. This might also work better given
some possible future changes. This commit therefore uses per_cpu_ptr()
to avoid ignoring the rcutree_dying_cpu() function's argument.
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
Currently, rcu_report_dead() disables preemption across its call to
rcu_report_exp_rdp(), but this is pointless because interrupts are
already disabled by the caller. In addition, rcu_report_dead() computes
the address of the outgoing CPU's rcu_data structure, which is also
pointless because this address is already present in local variable rdp.
This commit therefore drops the preemption disabling and passes rdp
to rcu_report_exp_rdp().
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
The purpose of rcu_dynticks_eqs_online() is to adjust the ->dynticks
counter of an incoming CPU when required. It is currently invoked
from rcutree_prepare_cpu(), which runs before the incoming CPU is
running, and thus on some other CPU. This makes the per-CPU accesses in
rcu_dynticks_eqs_online() iffy at best, and it all "works" only because
the running CPU cannot possibly be in dyntick-idle mode, which means
that rcu_dynticks_eqs_online() never has any effect.
It is currently OK for rcu_dynticks_eqs_online() to have no effect, but
only because the CPU-offline process just happens to leave ->dynticks in
the correct state. After all, if ->dynticks were in the wrong state on a
just-onlined CPU, rcutorture would complain bitterly the next time that
CPU went idle, at least in kernels built with CONFIG_RCU_EQS_DEBUG=y,
for example, those built by rcutorture scenario TREE04. One could
argue that this means that rcu_dynticks_eqs_online() is unnecessary,
however, removing it would make the CPU-online process vulnerable to
slight changes in the CPU-offline process.
One could also ask why it is safe to move the rcu_dynticks_eqs_online()
call so late in the CPU-online process. Indeed, there was a time when it
would not have been safe, which does much to explain its current location.
However, the marking of a CPU as online from an RCU perspective has long
since moved from rcutree_prepare_cpu() to rcu_cpu_starting(), and all
that is required is that ->dynticks be set correctly by the time that
the CPU is marked as online from an RCU perspective. After all, the RCU
grace-period kthread does not check to see if offline CPUs are also idle.
(In case you were curious, this is one reason why there is quiescent-state
reporting as part of the offlining process.)
This commit therefore moves the call to rcu_dynticks_eqs_online() from
rcutree_prepare_cpu() to rcu_cpu_starting(), this latter being guaranteed
to be running on the incoming CPU. The call to this function must of
course be placed before this rcu_cpu_starting() announces this CPU's
presence to RCU.
Reported-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
Near the beginning of rcu_gp_init() is a per-rcu_node loop that waits
for CPU-hotplug operations that might have started before the new
grace period did. This commit adds a comment explaining that this
wait does not exclude CPU-hotplug operations.
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
Invoking scripts/checkkconfigsymbols.py in the Linux-kernel source tree
located the following issues:
1. TREE_PREEMPT_RCU
Referencing files: arch/sh/configs/sdk7786_defconfig
It should now be CONFIG_PREEMPT_RCU. Except that the CONFIG_PREEMPT=y in
that same file implies CONFIG_PREEMPT_RCU=y. Therefore, delete the
CONFIG_TREE_PREEMPT_RCU=y line.
The reason is as follows:
In kernel/rcu/Kconfig, we have
config PREEMPT_RCU
bool
default y if PREEMPTION
https://www.kernel.org/doc/Documentation/kbuild/kconfig-language.txt says,
"The default value is only assigned to the config symbol if no other value
was set by the user (via the input prompt above)."
there is no prompt in config PREEMPT_RCU entry, so we are guaranteed to
get CONFIG_PREEMPT_RCU=y when CONFIG_PREEMPT is present.
2. RCU_CPU_STALL_INFO
Referencing files: arch/xtensa/configs/nommu_kc705_defconfig
The old Kconfig option RCU_CPU_STALL_INFO was removed by commit
75c27f119b ("rcu: Remove CONFIG_RCU_CPU_STALL_INFO"), and the kernel
now acts as if this Kconfig option was unconditionally enabled.
3. RCU_NOCB_CPU_ALL
Referencing files:
Documentation/RCU/Design/Memory-Ordering/Tree-RCU-Memory-Ordering.rst
This is an old snapshot of the code. I update this from the real
rcu_prepare_for_idle() function in kernel/rcu/tree_plugin.h.
This change was tested by invoking "make htmldocs".
4. RCU_TORTURE_TESTS
Referencing files: kernel/rcu/rcutorture.c
Forward-progress checking conflicts with CPU-stall testing, so we should
complain at "modprobe rcutorture" when both are enabled.
Signed-off-by: Zhouyi Zhou <zhouzhouyi@gmail.com>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
The rcu_implicit_dynticks_qs() function's local variable ruqp references
the ->rcu_urgent_qs field in the rcu_data structure referenced by the
function parameter rdp, with a rather odd method for computing the
pointer to this field. This commit therefore simplifies things and
saves a couple of lines of code by replacing each instance of ruqp with
&rdp->need_heavy_qs.
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
The rcu_implicit_dynticks_qs() function's local variable rnhqp references
the ->rcu_need_heavy_qs field in the rcu_data structure referenced by
the function parameter rdp, with a rather odd method for computing
the pointer to this field. This commit therefore simplifies things
and saves a few lines of code by replacing each instance of rnhqp with
&rdp->need_heavy_qs.
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
This commit removes a non-value-returning "return" statement at the end
of __call_rcu_nocb_wake() and adds a blank line following declarations
in nocb_cb_can_run().
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
This commit marks accesses to the rcu_state.n_force_qs. These data
races are hard to make happen, but syzkaller was equal to the task.
Reported-by: syzbot+e08a83a1940ec3846cd5@syzkaller.appspotmail.com
Acked-by: Marco Elver <elver@google.com>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
The regular pile:
- A few improvements to the mutex code
- Documentation updates for atomics to clarify the difference between
cmpxchg() and try_cmpxchg() and to explain the forward progress
expectations.
- Simplification of the atomics fallback generator
- The addition of arch_atomic_long*() variants and generic arch_*()
bitops based on them.
- Add the missing might_sleep() invocations to the down*() operations of
semaphores.
The PREEMPT_RT locking core:
- Scheduler updates to support the state preserving mechanism for
'sleeping' spin- and rwlocks on RT. This mechanism is carefully
preserving the state of the task when blocking on a 'sleeping' spin- or
rwlock and takes regular wake-ups targeted at the same task into
account. The preserved or updated (via a regular wakeup) state is
restored when the lock has been acquired.
- Restructuring of the rtmutex code so it can be utilized and extended
for the RT specific lock variants.
- Restructuring of the ww_mutex code to allow sharing of the ww_mutex
specific functionality for rtmutex based ww_mutexes.
- Header file disentangling to allow substitution of the regular lock
implementations with the PREEMPT_RT variants without creating an
unmaintainable #ifdef mess.
- Shared base code for the PREEMPT_RT specific rw_semaphore and rwlock
implementations. Contrary to the regular rw_semaphores and rwlocks the
PREEMPT_RT implementation is writer unfair because it is infeasible to
do priority inheritance on multiple readers. Experience over the years
has shown that real-time workloads are not the typical workloads which
are sensitive to writer starvation. The alternative solution would be
to allow only a single reader which has been tried and discarded as it
is a major bottleneck especially for mmap_sem. Aside of that many of
the writer starvation critical usage sites have been converted to a
writer side mutex/spinlock and RCU read side protections in the past
decade so that the issue is less prominent than it used to be.
- The actual rtmutex based lock substitutions for PREEMPT_RT enabled
kernels which affect mutex, ww_mutex, rw_semaphore, spinlock_t and
rwlock_t. The spin/rw_lock*() functions disable migration across the
critical section to preserve the existing semantics vs. per CPU
variables.
- Rework of the futex REQUEUE_PI mechanism to handle the case of early
wake-ups which interleave with a re-queue operation to prevent the
situation that a task would be blocked on both the rtmutex associated
to the outer futex and the rtmutex based hash bucket spinlock.
While this situation cannot happen on !RT enabled kernels the changes
make the underlying concurrency problems easier to understand in
general. As a result the difference between !RT and RT kernels is
reduced to the handling of waiting for the critical section. !RT
kernels simply spin-wait as before and RT kernels utilize rcu_wait().
- The substitution of local_lock for PREEMPT_RT with a spinlock which
protects the critical section while staying preemptible. The CPU
locality is established by disabling migration.
The underlying concepts of this code have been in use in PREEMPT_RT for
way more than a decade. The code has been refactored several times over
the years and this final incarnation has been optimized once again to be
as non-intrusive as possible, i.e. the RT specific parts are mostly
isolated.
It has been extensively tested in the 5.14-rt patch series and it has
been verified that !RT kernels are not affected by these changes.
-----BEGIN PGP SIGNATURE-----
iQJHBAABCgAxFiEEQp8+kY+LLUocC4bMphj1TA10mKEFAmEsnuMTHHRnbHhAbGlu
dXRyb25peC5kZQAKCRCmGPVMDXSYoaeWD/wLNMoAZXslS0prfr64ANjRgLXIqMFA
r6xgioiwxxaxbmZ/GNPraoLC//ENo6mwobuUovq8yKljv2oBu6AmlUkBwrmMBc8Q
nnm7jjGM3bZ1REup7rWERnjdOZfdGVSL5CUAAfthyC744XmXaepwrrrqfXG22GxJ
QwLXBTAwXFVDxKfUjDKzEo5zgLNHRvHbzc0DpTYYn6WcuDJOmlyWnhfDTu2mNG9Z
rqjqy+OgOUEUprQDgitk5hedfeic2kPm1mxxZrXkpkuPef5be2inQq2siC7GxR4g
0AKeUsMFgFmSqiD4iJTALJ+8WXkgMnD9VgooeWHk4OaqZfaGzi/iwRSnrlnf7+OV
GTmrsmX+TX/Wz2BDjB+3zylQnYqYh3quE5w4UO6uUyJXfdhlnvsjVc8bEajDFjeM
yUapaWxdAri7k2n+vjXQthAngxtYPgXtFbZPoOl109JcDcG6jJsCdM5TdenegaRs
WeUh05JqrH8+qI+Nwzc4rO+PmKHQ8on2wKdgLp11dviiPOf8OguH65nDQSGZ/fGv
7cnD9A1/MUd0sdrvc52AqkIYxh+Rp9GnCs1xA82JsTXgAPcXqAWjjR2JFPHL4neV
eW2upZekl8lMR7hkfcQbhe4MVjQIjff3iFOkQXittxMzfzFdi0tly8xB8AzpTHOx
h91MycvmMR2zRw==
=IEqE
-----END PGP SIGNATURE-----
Merge tag 'locking-core-2021-08-30' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull locking and atomics updates from Thomas Gleixner:
"The regular pile:
- A few improvements to the mutex code
- Documentation updates for atomics to clarify the difference between
cmpxchg() and try_cmpxchg() and to explain the forward progress
expectations.
- Simplification of the atomics fallback generator
- The addition of arch_atomic_long*() variants and generic arch_*()
bitops based on them.
- Add the missing might_sleep() invocations to the down*() operations
of semaphores.
The PREEMPT_RT locking core:
- Scheduler updates to support the state preserving mechanism for
'sleeping' spin- and rwlocks on RT.
This mechanism is carefully preserving the state of the task when
blocking on a 'sleeping' spin- or rwlock and takes regular wake-ups
targeted at the same task into account. The preserved or updated
(via a regular wakeup) state is restored when the lock has been
acquired.
- Restructuring of the rtmutex code so it can be utilized and
extended for the RT specific lock variants.
- Restructuring of the ww_mutex code to allow sharing of the ww_mutex
specific functionality for rtmutex based ww_mutexes.
- Header file disentangling to allow substitution of the regular lock
implementations with the PREEMPT_RT variants without creating an
unmaintainable #ifdef mess.
- Shared base code for the PREEMPT_RT specific rw_semaphore and
rwlock implementations.
Contrary to the regular rw_semaphores and rwlocks the PREEMPT_RT
implementation is writer unfair because it is infeasible to do
priority inheritance on multiple readers. Experience over the years
has shown that real-time workloads are not the typical workloads
which are sensitive to writer starvation.
The alternative solution would be to allow only a single reader
which has been tried and discarded as it is a major bottleneck
especially for mmap_sem. Aside of that many of the writer
starvation critical usage sites have been converted to a writer
side mutex/spinlock and RCU read side protections in the past
decade so that the issue is less prominent than it used to be.
- The actual rtmutex based lock substitutions for PREEMPT_RT enabled
kernels which affect mutex, ww_mutex, rw_semaphore, spinlock_t and
rwlock_t. The spin/rw_lock*() functions disable migration across
the critical section to preserve the existing semantics vs per-CPU
variables.
- Rework of the futex REQUEUE_PI mechanism to handle the case of
early wake-ups which interleave with a re-queue operation to
prevent the situation that a task would be blocked on both the
rtmutex associated to the outer futex and the rtmutex based hash
bucket spinlock.
While this situation cannot happen on !RT enabled kernels the
changes make the underlying concurrency problems easier to
understand in general. As a result the difference between !RT and
RT kernels is reduced to the handling of waiting for the critical
section. !RT kernels simply spin-wait as before and RT kernels
utilize rcu_wait().
- The substitution of local_lock for PREEMPT_RT with a spinlock which
protects the critical section while staying preemptible. The CPU
locality is established by disabling migration.
The underlying concepts of this code have been in use in PREEMPT_RT for
way more than a decade. The code has been refactored several times over
the years and this final incarnation has been optimized once again to be
as non-intrusive as possible, i.e. the RT specific parts are mostly
isolated.
It has been extensively tested in the 5.14-rt patch series and it has
been verified that !RT kernels are not affected by these changes"
* tag 'locking-core-2021-08-30' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (92 commits)
locking/rtmutex: Return success on deadlock for ww_mutex waiters
locking/rtmutex: Prevent spurious EDEADLK return caused by ww_mutexes
locking/rtmutex: Dequeue waiter on ww_mutex deadlock
locking/rtmutex: Dont dereference waiter lockless
locking/semaphore: Add might_sleep() to down_*() family
locking/ww_mutex: Initialize waiter.ww_ctx properly
static_call: Update API documentation
locking/local_lock: Add PREEMPT_RT support
locking/spinlock/rt: Prepare for RT local_lock
locking/rtmutex: Add adaptive spinwait mechanism
locking/rtmutex: Implement equal priority lock stealing
preempt: Adjust PREEMPT_LOCK_OFFSET for RT
locking/rtmutex: Prevent lockdep false positive with PI futexes
futex: Prevent requeue_pi() lock nesting issue on RT
futex: Simplify handle_early_requeue_pi_wakeup()
futex: Reorder sanity checks in futex_requeue()
futex: Clarify comment in futex_requeue()
futex: Restructure futex_requeue()
futex: Correct the number of requeued waiters for PI
futex: Remove bogus condition for requeue PI
...
RT builds substitutions for rwsem, mutex, spinlock and rwlock around
rtmutexes. Split the inner working out so each lock substitution can use
them with the appropriate lockdep annotations. This avoids having an extra
unused lockdep map in the wrapped rtmutex.
No functional change.
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Link: https://lore.kernel.org/r/20210815211302.784739994@linutronix.de
The functions get_online_cpus() and put_online_cpus() have been
deprecated during the CPU hotplug rework. They map directly to
cpus_read_lock() and cpus_read_unlock().
Replace deprecated CPU-hotplug functions with the official version.
The behavior remains unchanged.
Cc: "Paul E. McKenney" <paulmck@kernel.org>
Cc: Josh Triplett <josh@joshtriplett.org>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: Lai Jiangshan <jiangshanlai@gmail.com>
Cc: Joel Fernandes <joel@joelfernandes.org>
Cc: rcu@vger.kernel.org
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
The WARN_ON_ONCE() invocation within the CONFIG_PREEMPT=y version of
rcu_note_context_switch() triggers when there is a voluntary context
switch in an RCU read-side critical section, but there is quite a gap
between the output of that WARN_ON_ONCE() and this RCU-usage error.
This commit therefore converts the WARN_ON_ONCE() to a WARN_ONCE()
that explicitly describes the problem in its message.
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
There are a few remaining locations in kernel/rcu that still use
"&per_cpu()". This commit replaces them with "per_cpu_ptr(&)", and does
not introduce any functional change.
Reviewed-by: Uladzislau Rezki (Sony) <urezki@gmail.com>
Reviewed-by: Neeraj Upadhyay <neeraju@codeaurora.org>
Reviewed-by: Joel Fernandes (Google) <joel@joelfernandes.org>
Signed-off-by: Liu Song <liu.song11@zte.com.cn>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
Within rcu_gp_fqs_loop(), the "ret" local variable is set to the
return value from swait_event_idle_timeout_exclusive(), but "ret" is
unconditionally overwritten later in the code. This commit therefore
removes this useless assignment.
Signed-off-by: Liu Song <liu.song11@zte.com.cn>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
This commit marks the accesses in tree_stall.h so as to both avoid
undesirable compiler optimizations and to keep KCSAN focused on the
accesses of the core algorithm.
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
The kbuild test project found an oversized stack frame in rcu_gp_kthread()
for some kernel configurations. This oversizing was due to a very large
amount of inlining, which is unnecessary due to the fact that this code
executes infrequently. This commit therefore marks rcu_gp_init() and
rcu_gp_fqs_loop noinline_for_stack to conserve stack space.
Reported-by: kernel test robot <lkp@intel.com>
Tested-by: Rong Chen <rong.a.chen@intel.com>
[ paulmck: noinline_for_stack per Nathan Chancellor. ]
Reviewed-by: Nathan Chancellor <nathan@kernel.org>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>