The sync_sched_exp_online_cleanup() is called from rcutree_online_cpu(),
which can be invoked with interrupts enabled. This means that
the ->cpu_no_qs.b.exp field is subject to data races from the
rcu_exp_handler() IPI handler, so this commit marks the load from
that field.
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
Having two fields for the same purpose with subtle differences on
different RCU flavours is confusing, especially when both fields always
exist on both RCU flavours.
Fortunately, it is now safe for preemptible RCU to rely on the rcu_data
structure's ->cpu_no_qs.b.exp field, just like non-preemptible RCU.
This commit therefore removes the ad-hoc ->exp_deferred_qs field.
Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
On non-preemptible RCU, move clearing of the rcu_data structure's
->cpu_no_qs.b.exp filed to the actual expedited quiescent state report
function, matching hw preemptible RCU handles the ->exp_deferred_qs field.
This prepares for removing ->exp_deferred_qs in favor of ->cpu_no_qs.b.exp
for both preemptible and non-preemptible RCU.
Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
Preemptible RCU does not use the rcu_data structure's ->cpu_no_qs.b.exp,
instead using a separate ->exp_deferred_qs field to record the need for
an expedited quiescent state.
In fact ->cpu_no_qs.b.exp should never be set in preemptible RCU because
preemptible RCU's expedited grace periods use other mechanisms to record
quiescent states.
This commit therefore removes the implicit rcu_qs() reference to
->cpu_no_qs.b.exp in favor of a direct reference to ->cpu_no_qs.b.norm.
Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
0Day/LKP observed that the refscale results fail to complete when larger
values of nrun (such as 300) are specified. The problem is that printk()
can accept at most a 1024-byte buffer. This commit therefore prints
the buffer whenever its length exceeds 800 bytes.
CC: Philip Li <philip.li@intel.com>
Reported-by: kernel test robot <lkp@intel.com>
Signed-off-by: Li Zhijian <lizhijian@cn.fujitsu.com>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
There is only the one OOM error case in main_func(), so this commit
eliminates the errexit local variable in favor of a branch to cleanup
code.
Signed-off-by: Li Zhijian <lizhijian@cn.fujitsu.com>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
Because Tiny srcu_read_unlock() directly calls swake_up_one(), lockdep
complains when a pi lock is held across that srcu_read_unlock().
Although this is a lockdep false positive (there is no other CPU to
complete the deadlock cycle), lockdep is what it is at the moment.
This commit therefore prevents rcutorture from holding pi lock across
a Tiny srcu_read_unlock().
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
Currently, nested readers occur only when a timer handler interrupts a
reader. This is rare, and is thus insufficient testing of the transition
between nesting levels. This commit therefore causes rcutorture nested
readers to be the rule rather than the exception.
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
RCUTORTURE_RDR_MASK is currently not the bit indicated by
RCUTORTURE_RDR_SHIFT, but is instead all the bits less significant than
that one. This is an accident waiting to happen, so this commit makes
RCUTORTURE_RDR_MASK be that one bit and adjusts uses accordingly.
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
Currently, the check_all_holdout_tasks_trace() function removes all tasks
marked with ->trc_reader_checked from the holdout list, including those
with IPIs pending. This means that the IPI handler might arrive at
a task that has already been removed from the list, which is at best
an accident waiting to happen.
This commit therefore avoids removing tasks with IPIs pending from
the holdout list. This in turn means that the "if" condition in the
for_each_online_cpu() loop in rcu_tasks_trace_postgp() should always
evaluate to false, so a WARN_ON_ONCE() is added to check that.
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
Tiny SRCU readers can appear at task level, but also in interrupt and
softirq handlers. Because Tiny SRCU is selected only in kernels built
with CONFIG_SMP=n and CONFIG_PREEMPTION=n, it is not possible for a grace
period to start while there is a non-task-level SRCU reader executing.
This means that it does not make sense for __srcu_read_unlock() to awaken
the Tiny SRCU grace period, because that can only happen when the grace
period is waiting for one value of ->srcu_idx and __srcu_read_unlock()
is ending the last reader for some other value of ->srcu_idx. After all,
any such wakeup will be redundant.
Worse yet, in some cases, such wakeups generate lockdep splats:
======================================================
WARNING: possible circular locking dependency detected
5.15.0-rc1+ #3758 Not tainted
------------------------------------------------------
rcu_torture_rea/53 is trying to acquire lock:
ffffffff9514e6a8 (srcu_ctl.srcu_wq.lock){..-.}-{2:2}, at:
xa/0x30
but task is already holding lock:
ffff95c642479d80 (&p->pi_lock){-.-.}-{2:2}, at:
_extend+0x370/0x400
which lock already depends on the new lock.
the existing dependency chain (in reverse order) is:
-> #1 (&p->pi_lock){-.-.}-{2:2}:
_raw_spin_lock_irqsave+0x2f/0x50
try_to_wake_up+0x50/0x580
swake_up_locked.part.7+0xe/0x30
swake_up_one+0x22/0x30
rcutorture_one_extend+0x1b6/0x400
rcu_torture_one_read+0x290/0x5d0
rcu_torture_timer+0x1a/0x70
call_timer_fn+0xa6/0x230
run_timer_softirq+0x493/0x4c0
__do_softirq+0xc0/0x371
irq_exit+0x73/0x90
sysvec_apic_timer_interrupt+0x63/0x80
asm_sysvec_apic_timer_interrupt+0x12/0x20
default_idle+0xb/0x10
default_idle_call+0x5e/0x170
do_idle+0x18a/0x1f0
cpu_startup_entry+0xa/0x10
start_kernel+0x678/0x69f
secondary_startup_64_no_verify+0xc2/0xcb
-> #0 (srcu_ctl.srcu_wq.lock){..-.}-{2:2}:
__lock_acquire+0x130c/0x2440
lock_acquire+0xc2/0x270
_raw_spin_lock_irqsave+0x2f/0x50
swake_up_one+0xa/0x30
rcutorture_one_extend+0x387/0x400
rcu_torture_one_read+0x290/0x5d0
rcu_torture_reader+0xac/0x200
kthread+0x12d/0x150
ret_from_fork+0x22/0x30
other info that might help us debug this:
Possible unsafe locking scenario:
CPU0 CPU1
---- ----
lock(&p->pi_lock);
lock(srcu_ctl.srcu_wq.lock);
lock(&p->pi_lock);
lock(srcu_ctl.srcu_wq.lock);
*** DEADLOCK ***
1 lock held by rcu_torture_rea/53:
#0: ffff95c642479d80 (&p->pi_lock){-.-.}-{2:2}, at:
_extend+0x370/0x400
stack backtrace:
CPU: 0 PID: 53 Comm: rcu_torture_rea Not tainted 5.15.0-rc1+
Hardware name: Red Hat KVM/RHEL-AV, BIOS
e_el8.5.0+746+bbd5d70c 04/01/2014
Call Trace:
check_noncircular+0xfe/0x110
? find_held_lock+0x2d/0x90
__lock_acquire+0x130c/0x2440
lock_acquire+0xc2/0x270
? swake_up_one+0xa/0x30
? find_held_lock+0x72/0x90
_raw_spin_lock_irqsave+0x2f/0x50
? swake_up_one+0xa/0x30
swake_up_one+0xa/0x30
rcutorture_one_extend+0x387/0x400
rcu_torture_one_read+0x290/0x5d0
rcu_torture_reader+0xac/0x200
? rcutorture_oom_notify+0xf0/0xf0
? __kthread_parkme+0x61/0x90
? rcu_torture_one_read+0x5d0/0x5d0
kthread+0x12d/0x150
? set_kthread_struct+0x40/0x40
ret_from_fork+0x22/0x30
This is a false positive because there is only one CPU, and both locks
are raw (non-preemptible) spinlocks. However, it is worthwhile getting
rid of the redundant wakeup, which has the side effect of breaking
the theoretical deadlock cycle. This commit therefore eliminates the
redundant wakeups.
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
The default kasan_record_aux_stack() calls stack_depot_save() with GFP_NOWAIT,
which in turn can then call alloc_pages(GFP_NOWAIT, ...). In general, however,
it is not even possible to use either GFP_ATOMIC nor GFP_NOWAIT in certain
non-preemptive contexts/RT kernel including raw_spin_locks (see gfp.h and ab00db216c).
Fix it by instructing stackdepot to not expand stack storage via alloc_pages()
in case it runs out by using kasan_record_aux_stack_noalloc().
Jianwei Hu reported:
BUG: sleeping function called from invalid context at kernel/locking/rtmutex.c:969
in_atomic(): 0, irqs_disabled(): 1, non_block: 0, pid: 15319, name: python3
INFO: lockdep is turned off.
irq event stamp: 0
hardirqs last enabled at (0): [<0000000000000000>] 0x0
hardirqs last disabled at (0): [<ffffffff856c8b13>] copy_process+0xaf3/0x2590
softirqs last enabled at (0): [<ffffffff856c8b13>] copy_process+0xaf3/0x2590
softirqs last disabled at (0): [<0000000000000000>] 0x0
CPU: 6 PID: 15319 Comm: python3 Tainted: G W O 5.15-rc7-preempt-rt #1
Hardware name: Supermicro SYS-E300-9A-8C/A2SDi-8C-HLN4F, BIOS 1.1b 12/17/2018
Call Trace:
show_stack+0x52/0x58
dump_stack+0xa1/0xd6
___might_sleep.cold+0x11c/0x12d
rt_spin_lock+0x3f/0xc0
rmqueue+0x100/0x1460
rmqueue+0x100/0x1460
mark_usage+0x1a0/0x1a0
ftrace_graph_ret_addr+0x2a/0xb0
rmqueue_pcplist.constprop.0+0x6a0/0x6a0
__kasan_check_read+0x11/0x20
__zone_watermark_ok+0x114/0x270
get_page_from_freelist+0x148/0x630
is_module_text_address+0x32/0xa0
__alloc_pages_nodemask+0x2f6/0x790
__alloc_pages_slowpath.constprop.0+0x12d0/0x12d0
create_prof_cpu_mask+0x30/0x30
alloc_pages_current+0xb1/0x150
stack_depot_save+0x39f/0x490
kasan_save_stack+0x42/0x50
kasan_save_stack+0x23/0x50
kasan_record_aux_stack+0xa9/0xc0
__call_rcu+0xff/0x9c0
call_rcu+0xe/0x10
put_object+0x53/0x70
__delete_object+0x7b/0x90
kmemleak_free+0x46/0x70
slab_free_freelist_hook+0xb4/0x160
kfree+0xe5/0x420
kfree_const+0x17/0x30
kobject_cleanup+0xaa/0x230
kobject_put+0x76/0x90
netdev_queue_update_kobjects+0x17d/0x1f0
... ...
ksys_write+0xd9/0x180
__x64_sys_write+0x42/0x50
do_syscall_64+0x38/0x50
entry_SYSCALL_64_after_hwframe+0x44/0xa9
Links: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/include/linux/kasan.h?id=7cb3007ce2da27ec02a1a3211941e7fe6875b642
Fixes: 84109ab585 ("rcu: Record kvfree_call_rcu() call stack for KASAN")
Fixes: 26e760c9a7 ("rcu: kasan: record and print call_rcu() call stack")
Reported-by: Jianwei Hu <jianwei.hu@windriver.com>
Reviewed-by: Uladzislau Rezki (Sony) <urezki@gmail.com>
Acked-by: Marco Elver <elver@google.com>
Tested-by: Juri Lelli <juri.lelli@redhat.com>
Signed-off-by: Jun Miao <jun.miao@intel.com>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
When the boost kthreads are created on systems with nohz_full CPUs,
the cpus_allowed_ptr is set to housekeeping_cpumask(HK_FLAG_KTHREAD).
However, when the rcu_boost_kthread_setaffinity() is called, the original
affinity will be changed and these kthreads can subsequently run on
nohz_full CPUs. This commit makes rcu_boost_kthread_setaffinity()
restrict these boost kthreads to housekeeping CPUs.
Signed-off-by: Zqiang <qiang.zhang1211@gmail.com>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
This commit cleans up some comments and code in kernel/rcu/tree_plugin.h.
Signed-off-by: Zhouyi Zhou <zhouzhouyi@gmail.com>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
This commit replaces the obsolete and ambiguous macro in_irq() with its
shiny new in_hardirq() equivalent.
Signed-off-by: Changbin Du <changbin.du@gmail.com>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
Now that RCU_FAST_NO_HZ is no more, there is but one implementation of
the rcu_needs_cpu() function. This commit therefore moves this function
from kernel/rcu/tree_plugin.c to kernel/rcu/tree.c.
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
All of the uses of CONFIG_RCU_FAST_NO_HZ=y that I have seen involve
systems with RCU callbacks offloaded. In this situation, all that this
Kconfig option does is slow down idle entry/exit with an additional
allways-taken early exit. If this is the only use case, then this
Kconfig option nothing but an attractive nuisance that needs to go away.
This commit therefore removes the RCU_FAST_NO_HZ Kconfig option.
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
This pull request contains the following branches:
fixes.2021.10.07a: Miscellaneous fixes.
scftorture.2021.09.16a: smp_call_function torture-test updates, most
notably better checking of module parameters.
tasks.2021.09.15a: Tasks-trace RCU updates that fix a number of rare
but important race-condition bugs.
torture.2021.09.13b: Other torture-test updates, most notably
better checking of module parameters. In addition, rcutorture
may now be run on CONFIG_PREEMPT_RT kernels.
torturescript.2021.09.16a: Torture-test scripting updates, most notably
specifying the new CONFIG_KCSAN_STRICT kconfig option rather
than maintaining an ever-changing list of individual KCSAN
kconfig options.
-----BEGIN PGP SIGNATURE-----
iQJHBAABCgAxFiEEbK7UrM+RBIrCoViJnr8S83LZ+4wFAmGAVMMTHHBhdWxtY2tA
a2VybmVsLm9yZwAKCRCevxLzctn7jGJBD/9ld6USOpedBLAbTYVMQYvIKoSqqDIG
74ZFhKvZ5I6Y8OZAGxXjb5U06rh4V2brlTN7IJ7XLEA1t401ENffsGeQSCxEmpEf
PqQN04dbmVvaWjD4jiLZCcl3oDp+w1gIKwmX6wh0Weogr3KZWu5aNvD5tl9qIz4a
uPC1JqTBxf7WDrLhqNxG5N4MXs27+KvukCd9wftk3NTzRJ9tyLM/YNGOVArM8rW2
QpEh8n6veB5dEoXBxmRHzuxYHN1k0Fhkbm3irMjcI0T5wj8TDod89zbg9mdFXMIj
AjZ9CGpIBa4frThdu654ZNuEQHDCsPWtMi925xNOWxh5lkPGjeWnwYpcRrwfI2pj
op0xVlur+Nam5CT/AJNT9+KogpZthAWXvwqCs5GbYNSU30Rlw99bw1vyAsJUD+af
Mv08/z4o7Kuhr4cw2vkd2UfF9zuIQsJ1jWCIjMxfj4ctBnIpedrEnEISp8Y61fWk
w9vXgCRhZCSkxoURoNss+nAUsiePUafptsvqKLu6Z53ufPA5yL0rVS778xq8vurP
Xyd34TVlQ94ydZDC5pkSNpri1HGV1U7pztFwey5GloE66iV+7TSQCfMhzLd4CM0K
wW96wimHrDtIxD6LedCZOHLHkS9AJd7F9uSoNodKspTH0tJowQztrzPW1eZifDE3
iJP8xcJ+vL67Og==
=nmaP
-----END PGP SIGNATURE-----
Merge tag 'rcu.2021.11.01a' of git://git.kernel.org/pub/scm/linux/kernel/git/paulmck/linux-rcu
Pull RCU updates from Paul McKenney:
- Miscellaneous fixes
- Torture-test updates for smp_call_function(), most notably improved
checking of module parameters.
- Tasks-trace RCU updates that fix a number of rare but important
race-condition bugs.
- Other torture-test updates, most notably better checking of module
parameters. In addition, rcutorture may once again be run on
CONFIG_PREEMPT_RT kernels.
- Torture-test scripting updates, most notably specifying the new
CONFIG_KCSAN_STRICT kconfig option rather than maintaining an
ever-changing list of individual KCSAN kconfig options.
* tag 'rcu.2021.11.01a' of git://git.kernel.org/pub/scm/linux/kernel/git/paulmck/linux-rcu: (46 commits)
rcu: Fix rcu_dynticks_curr_cpu_in_eqs() vs noinstr
rcu: Always inline rcu_dynticks_task*_{enter,exit}()
torture: Make kvm-remote.sh print size of downloaded tarball
torture: Allot 1G of memory for scftorture runs
tools/rcu: Add an extract-stall script
scftorture: Warn on individual scf_torture_init() error conditions
scftorture: Count reschedule IPIs
scftorture: Account for weight_resched when checking for all zeroes
scftorture: Shut down if nonsensical arguments given
scftorture: Allow zero weight to exclude an smp_call_function*() category
rcu: Avoid unneeded function call in rcu_read_unlock()
rcu-tasks: Update comments to cond_resched_tasks_rcu_qs()
rcu-tasks: Fix IPI failure handling in trc_wait_for_one_reader
rcu-tasks: Fix read-side primitives comment for call_rcu_tasks_trace
rcu-tasks: Clarify read side section info for rcu_tasks_rude GP primitives
rcu-tasks: Correct comparisons for CPU numbers in show_stalled_task_trace
rcu-tasks: Correct firstreport usage in check_all_holdout_tasks_trace
rcu-tasks: Fix s/rcu_add_holdout/trc_add_holdout/ typo in comment
rcu-tasks: Move RTGS_WAIT_CBS to beginning of rcu_tasks_kthread() loop
rcu-tasks: Fix s/instruction/instructions/ typo in comment
...
- Revert the printk format based wchan() symbol resolution as it can leak
the raw value in case that the symbol is not resolvable.
- Make wchan() more robust and work with all kind of unwinders by
enforcing that the task stays blocked while unwinding is in progress.
- Prevent sched_fork() from accessing an invalid sched_task_group
- Improve asymmetric packing logic
- Extend scheduler statistics to RT and DL scheduling classes and add
statistics for bandwith burst to the SCHED_FAIR class.
- Properly account SCHED_IDLE entities
- Prevent a potential deadlock when initial priority is assigned to a
newly created kthread. A recent change to plug a race between cpuset and
__sched_setscheduler() introduced a new lock dependency which is now
triggered. Break the lock dependency chain by moving the priority
assignment to the thread function.
- Fix the idle time reporting in /proc/uptime for NOHZ enabled systems.
- Improve idle balancing in general and especially for NOHZ enabled
systems.
- Provide proper interfaces for live patching so it does not have to
fiddle with scheduler internals.
- Add cluster aware scheduling support.
- A small set of tweaks for RT (irqwork, wait_task_inactive(), various
scheduler options and delaying mmdrop)
- The usual small tweaks and improvements all over the place
-----BEGIN PGP SIGNATURE-----
iQJHBAABCgAxFiEEQp8+kY+LLUocC4bMphj1TA10mKEFAmF/OUkTHHRnbHhAbGlu
dXRyb25peC5kZQAKCRCmGPVMDXSYoR/5D/9ikdGNpKg9osNqJ3GjAmxsK6kVkB29
iFe2k8pIpWDToWQf/wQRGih4Yj3Cl49QSnZcPIibh2/12EB1qrrW6iSPJkInz8Ec
/1LS5/Vewn2OyoxyXZjdvGC5gTXEodSbIazASvX7nvdMeI4gsAsL5etzrMJirT/t
aymqvr7zovvywrwMTQJrGjUMo9l4ewE8tafMNNhRu1BHU1U4ojM9yvThyRAAcmp7
3Xy49A+Yq3IgrvYI4u8FMK5Zh08KaxSFjiLhePGm/bF+wSfYmWop2TP1jY05W2Uo
ti8hfbJMUoFRYuMxAiEldkItnc0wV4M9PtWZZ/x+B71bs65Y4Zjt9cW+rxJv2+m1
vzV31EsQwGnOti072dzWN4c/cZqngVXAjaNtErvDwJUr+Tw1ayv9KUvuodMQqZY6
mu68bFUO2kV9EMe1CBOv51Uy1RGHyLj3rlNqrkw+Xp5ISE9Ad2vhUEiRp5bQx5Ci
V/XFhGZkGUluh0vccrdFlNYZwhj8cZEzkOPCnPSeZ+bq8SyZE6xuHH/lTP1CJCOy
s800rW1huM+kgV+zRN8adDkGXibAk9N3RtVGnQXmuEy8gB9LZmQg+JeM2wsc9B+6
i0gdqZnsjNAfoK+BBAG4holxptSL8/eOJsFH8ZNIoxQ+iqooyPx9tFX7yXnRTBQj
d2qWG7UvoseT+g==
=fgtS
-----END PGP SIGNATURE-----
Merge tag 'sched-core-2021-11-01' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull scheduler updates from Thomas Gleixner:
- Revert the printk format based wchan() symbol resolution as it can
leak the raw value in case that the symbol is not resolvable.
- Make wchan() more robust and work with all kind of unwinders by
enforcing that the task stays blocked while unwinding is in progress.
- Prevent sched_fork() from accessing an invalid sched_task_group
- Improve asymmetric packing logic
- Extend scheduler statistics to RT and DL scheduling classes and add
statistics for bandwith burst to the SCHED_FAIR class.
- Properly account SCHED_IDLE entities
- Prevent a potential deadlock when initial priority is assigned to a
newly created kthread. A recent change to plug a race between cpuset
and __sched_setscheduler() introduced a new lock dependency which is
now triggered. Break the lock dependency chain by moving the priority
assignment to the thread function.
- Fix the idle time reporting in /proc/uptime for NOHZ enabled systems.
- Improve idle balancing in general and especially for NOHZ enabled
systems.
- Provide proper interfaces for live patching so it does not have to
fiddle with scheduler internals.
- Add cluster aware scheduling support.
- A small set of tweaks for RT (irqwork, wait_task_inactive(), various
scheduler options and delaying mmdrop)
- The usual small tweaks and improvements all over the place
* tag 'sched-core-2021-11-01' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (69 commits)
sched/fair: Cleanup newidle_balance
sched/fair: Remove sysctl_sched_migration_cost condition
sched/fair: Wait before decaying max_newidle_lb_cost
sched/fair: Skip update_blocked_averages if we are defering load balance
sched/fair: Account update_blocked_averages in newidle_balance cost
x86: Fix __get_wchan() for !STACKTRACE
sched,x86: Fix L2 cache mask
sched/core: Remove rq_relock()
sched: Improve wake_up_all_idle_cpus() take #2
irq_work: Also rcuwait for !IRQ_WORK_HARD_IRQ on PREEMPT_RT
irq_work: Handle some irq_work in a per-CPU thread on PREEMPT_RT
irq_work: Allow irq_work_sync() to sleep if irq_work() no IRQ support.
sched/rt: Annotate the RT balancing logic irqwork as IRQ_WORK_HARD_IRQ
sched: Add cluster scheduler level for x86
sched: Add cluster scheduler level in core and related Kconfig for ARM64
topology: Represent clusters of CPUs within a die
sched: Disable -Wunused-but-set-variable
sched: Add wrapper for get_wchan() to keep task blocked
x86: Fix get_wchan() to support the ORC unwinder
proc: Use task_is_running() for wchan in /proc/$pid/stat
...
vmlinux.o: warning: objtool: rcu_nmi_enter()+0x36: call to __kasan_check_read() leaves .noinstr.text section
noinstr cannot have atomic_*() functions in because they're explicitly
annotated, use arch_atomic_*().
Fixes: 2be57f7328 ("rcu: Weaken ->dynticks accesses and updates")
Reported-by: Stephen Rothwell <sfr@canb.auug.org.au>
Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
RCU managed to grow a few noinstr violations:
vmlinux.o: warning: objtool: rcu_dynticks_eqs_enter()+0x0: call to rcu_dynticks_task_trace_enter() leaves .noinstr.text section
vmlinux.o: warning: objtool: rcu_dynticks_eqs_exit()+0xe: call to rcu_dynticks_task_trace_exit() leaves .noinstr.text section
Fix them by adding __always_inline to the relevant trivial functions.
Also replace the noinstr with __always_inline for the existing
rcu_dynticks_task_*() functions since noinstr would force noinline
them, even when empty, which seems silly.
Fixes: 7d0c9c50c5 ("rcu-tasks: Avoid IPIing userspace/idle tasks if kernel is so built")
Reported-by: Stephen Rothwell <sfr@canb.auug.org.au>
Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
Give try_invoke_on_locked_down_task() a saner name and have it return
an int so that the caller might distinguish between different reasons
of failure.
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Paul E. McKenney <paulmck@kernel.org>
Acked-by: Vasily Gorbik <gor@linux.ibm.com>
Tested-by: Vasily Gorbik <gor@linux.ibm.com> # on s390
Link: https://lkml.kernel.org/r/20210929152428.649944917@infradead.org
Comments in wait-type checks be improved by mentioning the
PREEPT_RT kernel configure option.
Signed-off-by: Zhouyi Zhou <zhouzhouyi@gmail.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Paul E. McKenney <paulmck@kernel.org>
Link: https://lkml.kernel.org/r/20210811025920.20751-1-zhouzhouyi@gmail.com
Since commit aa40c138cc ("rcu: Report QS for outermost PREEMPT=n
rcu_read_unlock() for strict GPs") the function rcu_read_unlock_strict()
is invoked by the inlined rcu_read_unlock() function. However,
rcu_read_unlock_strict() is an empty function in production kernels,
which are built with CONFIG_RCU_STRICT_GRACE_PERIOD=n.
There is a mention of rcu_read_unlock_strict() in the BPF verifier,
but this is in a deny-list, meaning that BPF does not care whether
rcu_read_unlock_strict() is ever called.
This commit therefore provides a slight performance improvement
by hoisting the check of CONFIG_RCU_STRICT_GRACE_PERIOD from
rcu_read_unlock_strict() into rcu_read_unlock(), thus avoiding the
pointless call to an empty function.
Cc: Alexei Starovoitov <alexei.starovoitov@gmail.com>
Acked-by: Andrii Nakryiko <andrii@kernel.org>
Signed-off-by: Waiman Long <longman@redhat.com>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
The cond_resched_rcu_qs() function no longer exists, despite being mentioned
several times in kernel/rcu/tasks.h. This commit therefore updates it to
the current cond_resched_tasks_rcu_qs().
Reported-by: Neeraj Upadhyay <neeraju@codeaurora.org>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
The trc_wait_for_one_reader() function is called at multiple stages
of trace rcu-tasks GP function, rcu_tasks_wait_gp():
- First, it is called as part of per task function -
rcu_tasks_trace_pertask(), for all non-idle tasks. As part of per task
processing, this function add the task in the holdout list and if the
task is currently running on a CPU, it sends IPI to the task's CPU.
The IPI handler takes action depending on whether task is in trace
rcu-tasks read side critical section or not:
- a. If the task is in trace rcu-tasks read side critical section
(t->trc_reader_nesting != 0), the IPI handler sets the task's
->trc_reader_special.b.need_qs, so that this task notifies exit
from its outermost read side critical section (by decrementing
trc_n_readers_need_end) to the GP handling function.
trc_wait_for_one_reader() also increments trc_n_readers_need_end,
so that the trace rcu-tasks GP handler function waits for this
task's read side exit notification. The IPI handler also sets
t->trc_reader_checked to true, and no further IPIs are sent for
this task, for this trace rcu-tasks grace period and this
task can be removed from holdout list.
- b. If the task is in the process of exiting its trace rcu-tasks
read side critical section, (t->trc_reader_nesting < 0), defer
this task's processing to future calls to trc_wait_for_one_reader().
- c. If task is not in rcu-task read side critical section,
t->trc_reader_nesting == 0, ->trc_reader_checked is set for this
task, so that this task is removed from holdout list.
- Second, trc_wait_for_one_reader() is called as part of post scan, in
function rcu_tasks_trace_postscan(), for all idle tasks.
- Third, in function check_all_holdout_tasks_trace(), this function is
called for each task in the holdout list, but only if there isn't
a pending IPI for the task (->trc_ipi_to_cpu == -1). This function
removed the task from holdout list, if IPI handler has completed the
required work, to ensure that the current trace rcu-tasks grace period
either waits for this task, or this task is not in a trace rcu-tasks
read side critical section.
Now, considering the scenario where smp_call_function_single() fails in
first case, inside rcu_tasks_trace_pertask(). In this case,
->trc_ipi_to_cpu is set to the current CPU for that task. This will
result in trc_wait_for_one_reader() getting skipped in third case,
inside check_all_holdout_tasks_trace(), for this task. This further
results in ->trc_reader_checked never getting set for this task,
and the task not getting removed from holdout list. This can cause
the current trace rcu-tasks grace period to stall.
Fix the above problem, by resetting ->trc_ipi_to_cpu to -1, on
smp_call_function_single() failure, so that future IPI calls can
be send for this task.
Note that all three of the trc_wait_for_one_reader() function's
callers (rcu_tasks_trace_pertask(), rcu_tasks_trace_postscan(),
check_all_holdout_tasks_trace()) hold cpu_read_lock(). This means
that smp_call_function_single() cannot race with CPU hotplug, and thus
should never fail. Therefore, also add a warning in order to report
any such failure in case smp_call_function_single() grows some other
reason for failure.
Signed-off-by: Neeraj Upadhyay <neeraju@codeaurora.org>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
call_rcu_tasks_trace() does have read-side primitives - rcu_read_lock_trace()
and rcu_read_unlock_trace(). Fix this information in the comments.
Signed-off-by: Neeraj Upadhyay <neeraju@codeaurora.org>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
RCU tasks rude variant does not check whether the current
running context on a CPU is usermode. Read side critical section ends
on transition to usermode execution, by the virtue of usermode
execution being schedulable. Clarify this in comments for
call_rcu_tasks_rude() and synchronize_rcu_tasks_rude().
Signed-off-by: Neeraj Upadhyay <neeraju@codeaurora.org>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
Valid CPU numbers can be zero or greater, but the checks for
->trc_ipi_to_cpu and tick_nohz_full_cpu()'s argument are for strictly
greater than. This commit therefore corrects the check for no_hz_full
cpu in show_stalled_task_trace() so as to include cpu 0.
Signed-off-by: Neeraj Upadhyay <neeraju@codeaurora.org>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
In check_all_holdout_tasks_trace(), firstreport is a pointer argument;
so, check the dereferenced value, instead of checking the pointer.
Signed-off-by: Neeraj Upadhyay <neeraju@codeaurora.org>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
Early in debugging, it made some sense to differentiate the first
iteration from subsequent iterations, but now this just causes confusion.
This commit therefore moves the "set_tasks_gp_state(rtp, RTGS_WAIT_CBS)"
statement to the beginning of the "for" loop in rcu_tasks_kthread().
Reported-by: Neeraj Upadhyay <neeraju@codeaurora.org>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
The second argument of rcu_read_unlock_trace_special() is always zero.
When called from exit_tasks_rcu_finish_trace(), it is the constant
zero, and rcu_read_unlock_trace_special() doesn't get called from
rcu_read_unlock_trace() unless the value of local variable "nesting"
is zero because in that case the early return is taken instead.
This commit therefore removes the "nesting" argument from the
rcu_read_unlock_trace_special() function, substituting the constant
zero within that function. This commit also adds a WARN_ON_ONCE()
to rcu_read_lock_trace_held() in case non-zeroness some day appears.
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
Currently, trc_inspect_reader() treats a task exiting its RCU Tasks
Trace read-side critical section the same as being within that critical
section. However, this can fail because that task might have already
checked its .need_qs field, which means that it might never decrement
the all-important trc_n_readers_need_end counter. Of course, for that
to happen, the task would need to never again execute an RCU Tasks Trace
read-side critical section, but this really could happen if the system's
last trampoline was removed. Note that exit from such a critical section
cannot be treated as a quiescent state due to the possibility of nested
critical sections. This means that if trc_inspect_reader() sees a
negative nesting value, it must set up to try again later.
This commit therefore ignores tasks that are exiting their RCU Tasks
Trace read-side critical sections so that they will be rechecked later.
[ paulmck: Apply feedback from Neeraj Upadhyay and Boqun Feng. ]
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
Currently, trc_wait_for_one_reader() atomically increments
the trc_n_readers_need_end counter before sending the IPI
invoking trc_read_check_handler(). All failure paths out of
trc_read_check_handler() and also from the smp_call_function_single()
within trc_wait_for_one_reader() must carefully atomically decrement
this counter. This is more complex than it needs to be.
This commit therefore simplifies things and saves a few lines of
code by dispensing with the atomic decrements in favor of having
trc_read_check_handler() do the atomic increment only in the success case.
In theory, this represents no change in functionality.
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
rcutorture is generating some nesting scenarios that are not compatible on PREEMPT_RT.
For example:
preempt_disable();
rcu_read_lock_bh();
preempt_enable();
rcu_read_unlock_bh();
The problem here is that on PREEMPT_RT the bottom halves have to be
disabled and enabled in preemptible context.
Reorder locking: start with BH locking and continue with then with
disabling preemption or interrupts. In the unlocking do it reverse by
first enabling interrupts and preemption and BH at the very end.
Ensure that on PREEMPT_RT BH locking remains unchanged if in
non-preemptible context.
Link: https://lkml.kernel.org/r/20190911165729.11178-6-swood@redhat.com
Link: https://lkml.kernel.org/r/20210819182035.GF4126399@paulmck-ThinkPad-P17-Gen-1
Signed-off-by: Scott Wood <swood@redhat.com>
[bigeasy: Drop ATOM_BH, make it only about changing BH in atomic
context. Allow enabling RCU in IRQ-off section. Reword commit message.]
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
Currently, in CONFIG_RCU_BOOST kernels, if the rcu_torture_init()
function's call to cpuhp_setup_state() fails, rcu_torture_cleanup()
gamely passes nonsense to cpuhp_remove_state(). This results in
strange and misleading splats. This commit therefore ensures that if
the rcu_torture_init() function's call to cpuhp_setup_state() fails,
rcu_torture_cleanup() avoids invoking cpuhp_remove_state().
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
When running rcuscale as a module, any rcu_scale_init() issues will be
reflected in the error code from modprobe or insmod, as the case may be.
However, these error codes are not available when running rcuscale
built-in, for example, when using the kvm.sh script. This commit
therefore adds WARN_ON_ONCE() to allow distinguishing rcu_scale_init()
errors when running rcuscale built-in.
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
When running refscale as a module, any ref_scale_init() issues will be
reflected in the error code from modprobe or insmod, as the case may be.
However, these error codes are not available when running refscale
built-in, for example, when using the kvm.sh script. This commit
therefore adds WARN_ON_ONCE() to allow distinguishing ref_scale_init()
errors when running refscale built-in.
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
When running rcutorture as a module, any rcu_torture_init() issues will be
reflected in the error code from modprobe or insmod, as the case may be.
However, these error codes are not available when running rcutorture
built-in, for example, when using the kvm.sh script. This commit
therefore adds WARN_ON_ONCE() to allow distinguishing rcu_torture_init()
errors when running rcutorture built-in.
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
Currently, specifying the rcutorture.read_exit_burst=0 kernel boot
parameter will result in a -EINVAL exit code that will stop the rcutorture
test run before it has fully initialized. This commit therefore uses a
zero exit code in that case, thus allowing rcutorture.read_exit_burst=0
to complete normally.
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
Currently, RCU Tasks Trace initializes the trc_n_readers_need_end counter
to the value one, increments it before each trc_read_check_handler()
IPI, then decrements it within trc_read_check_handler() if the target
task was in a quiescent state (or if the target task moved to some other
CPU while the IPI was in flight), complaining if the new value was zero.
The rationale for complaining is that the initial value of one must be
decremented away before zero can be reached, and this decrement has not
yet happened.
Except that trc_read_check_handler() is initiated with an asynchronous
smp_call_function_single(), which might be significantly delayed. This
can result in false-positive complaints about the counter reaching zero.
This commit therefore waits for in-flight IPI handlers to complete before
decrementing away the initial value of one from the trc_n_readers_need_end
counter.
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
The sync_sched_exp_online_cleanup() checks to see if RCU needs
an expedited quiescent state from the incoming CPU, sending it
an IPI if so. Before sending IPI, it checks whether expedited
qs need has been already requested for the incoming CPU, by
checking rcu_data.cpu_no_qs.b.exp for the current cpu, on which
sync_sched_exp_online_cleanup() is running. This works for the
case where incoming CPU is same as self. However, for the case
where incoming CPU is different from self, expedited request
won't get marked, which can potentially delay reporting of
expedited quiescent state for the incoming CPU.
Fixes: e015a34112 ("rcu: Avoid self-IPI in sync_sched_exp_online_cleanup()")
Signed-off-by: Neeraj Upadhyay <neeraju@codeaurora.org>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
rcu update module parameters currently don't appear in sysfs and this is
a serviceability issue as it might be needed to access their default
values at runtime.
Fix this issue by changing rcu update module parameters permissions to
world-readable.
Suggested-by: Paul E. McKenney <paulmck@kernel.org>
Signed-off-by: Juri Lelli <juri.lelli@redhat.com>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
Certain configurations (e.g., systems that make heavy use of netns)
need to use synchronize_rcu_expedited() to service RCU grace periods
even after boot.
Even though synchronize_rcu_expedited() has been traditionally
considered harmful for RT for the heavy use of IPIs, it is perfectly
usable under certain conditions (e.g. nohz_full).
Make rcupdate.rcu_normal_after_boot= again writeable on RT (if NO_HZ_
FULL is defined), but keep its default value to 1 (enabled) to avoid
regressions. Users who need synchronize_rcu_expedited() will boot with
rcupdate.rcu_normal_after_ boot=0 in the kernel cmdline.
Reflect the change in synchronize_rcu_expedited_wait() by removing the
WARN related to CONFIG_PREEMPT_RT.
Signed-off-by: Juri Lelli <juri.lelli@redhat.com>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
The CPU-hotplug functions take a "cpu" parameter, but rcutree_dying_cpu()
ignores it in favor of this_cpu_ptr(). This works at the moment, but
it would be better to be consistent. This might also work better given
some possible future changes. This commit therefore uses per_cpu_ptr()
to avoid ignoring the rcutree_dying_cpu() function's argument.
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
Currently, rcu_report_dead() disables preemption across its call to
rcu_report_exp_rdp(), but this is pointless because interrupts are
already disabled by the caller. In addition, rcu_report_dead() computes
the address of the outgoing CPU's rcu_data structure, which is also
pointless because this address is already present in local variable rdp.
This commit therefore drops the preemption disabling and passes rdp
to rcu_report_exp_rdp().
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
The purpose of rcu_dynticks_eqs_online() is to adjust the ->dynticks
counter of an incoming CPU when required. It is currently invoked
from rcutree_prepare_cpu(), which runs before the incoming CPU is
running, and thus on some other CPU. This makes the per-CPU accesses in
rcu_dynticks_eqs_online() iffy at best, and it all "works" only because
the running CPU cannot possibly be in dyntick-idle mode, which means
that rcu_dynticks_eqs_online() never has any effect.
It is currently OK for rcu_dynticks_eqs_online() to have no effect, but
only because the CPU-offline process just happens to leave ->dynticks in
the correct state. After all, if ->dynticks were in the wrong state on a
just-onlined CPU, rcutorture would complain bitterly the next time that
CPU went idle, at least in kernels built with CONFIG_RCU_EQS_DEBUG=y,
for example, those built by rcutorture scenario TREE04. One could
argue that this means that rcu_dynticks_eqs_online() is unnecessary,
however, removing it would make the CPU-online process vulnerable to
slight changes in the CPU-offline process.
One could also ask why it is safe to move the rcu_dynticks_eqs_online()
call so late in the CPU-online process. Indeed, there was a time when it
would not have been safe, which does much to explain its current location.
However, the marking of a CPU as online from an RCU perspective has long
since moved from rcutree_prepare_cpu() to rcu_cpu_starting(), and all
that is required is that ->dynticks be set correctly by the time that
the CPU is marked as online from an RCU perspective. After all, the RCU
grace-period kthread does not check to see if offline CPUs are also idle.
(In case you were curious, this is one reason why there is quiescent-state
reporting as part of the offlining process.)
This commit therefore moves the call to rcu_dynticks_eqs_online() from
rcutree_prepare_cpu() to rcu_cpu_starting(), this latter being guaranteed
to be running on the incoming CPU. The call to this function must of
course be placed before this rcu_cpu_starting() announces this CPU's
presence to RCU.
Reported-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
Near the beginning of rcu_gp_init() is a per-rcu_node loop that waits
for CPU-hotplug operations that might have started before the new
grace period did. This commit adds a comment explaining that this
wait does not exclude CPU-hotplug operations.
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
Invoking scripts/checkkconfigsymbols.py in the Linux-kernel source tree
located the following issues:
1. TREE_PREEMPT_RCU
Referencing files: arch/sh/configs/sdk7786_defconfig
It should now be CONFIG_PREEMPT_RCU. Except that the CONFIG_PREEMPT=y in
that same file implies CONFIG_PREEMPT_RCU=y. Therefore, delete the
CONFIG_TREE_PREEMPT_RCU=y line.
The reason is as follows:
In kernel/rcu/Kconfig, we have
config PREEMPT_RCU
bool
default y if PREEMPTION
https://www.kernel.org/doc/Documentation/kbuild/kconfig-language.txt says,
"The default value is only assigned to the config symbol if no other value
was set by the user (via the input prompt above)."
there is no prompt in config PREEMPT_RCU entry, so we are guaranteed to
get CONFIG_PREEMPT_RCU=y when CONFIG_PREEMPT is present.
2. RCU_CPU_STALL_INFO
Referencing files: arch/xtensa/configs/nommu_kc705_defconfig
The old Kconfig option RCU_CPU_STALL_INFO was removed by commit
75c27f119b ("rcu: Remove CONFIG_RCU_CPU_STALL_INFO"), and the kernel
now acts as if this Kconfig option was unconditionally enabled.
3. RCU_NOCB_CPU_ALL
Referencing files:
Documentation/RCU/Design/Memory-Ordering/Tree-RCU-Memory-Ordering.rst
This is an old snapshot of the code. I update this from the real
rcu_prepare_for_idle() function in kernel/rcu/tree_plugin.h.
This change was tested by invoking "make htmldocs".
4. RCU_TORTURE_TESTS
Referencing files: kernel/rcu/rcutorture.c
Forward-progress checking conflicts with CPU-stall testing, so we should
complain at "modprobe rcutorture" when both are enabled.
Signed-off-by: Zhouyi Zhou <zhouzhouyi@gmail.com>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
The rcu_implicit_dynticks_qs() function's local variable ruqp references
the ->rcu_urgent_qs field in the rcu_data structure referenced by the
function parameter rdp, with a rather odd method for computing the
pointer to this field. This commit therefore simplifies things and
saves a couple of lines of code by replacing each instance of ruqp with
&rdp->need_heavy_qs.
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
The rcu_implicit_dynticks_qs() function's local variable rnhqp references
the ->rcu_need_heavy_qs field in the rcu_data structure referenced by
the function parameter rdp, with a rather odd method for computing
the pointer to this field. This commit therefore simplifies things
and saves a few lines of code by replacing each instance of rnhqp with
&rdp->need_heavy_qs.
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
This commit removes a non-value-returning "return" statement at the end
of __call_rcu_nocb_wake() and adds a blank line following declarations
in nocb_cb_can_run().
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
This commit marks accesses to the rcu_state.n_force_qs. These data
races are hard to make happen, but syzkaller was equal to the task.
Reported-by: syzbot+e08a83a1940ec3846cd5@syzkaller.appspotmail.com
Acked-by: Marco Elver <elver@google.com>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
The regular pile:
- A few improvements to the mutex code
- Documentation updates for atomics to clarify the difference between
cmpxchg() and try_cmpxchg() and to explain the forward progress
expectations.
- Simplification of the atomics fallback generator
- The addition of arch_atomic_long*() variants and generic arch_*()
bitops based on them.
- Add the missing might_sleep() invocations to the down*() operations of
semaphores.
The PREEMPT_RT locking core:
- Scheduler updates to support the state preserving mechanism for
'sleeping' spin- and rwlocks on RT. This mechanism is carefully
preserving the state of the task when blocking on a 'sleeping' spin- or
rwlock and takes regular wake-ups targeted at the same task into
account. The preserved or updated (via a regular wakeup) state is
restored when the lock has been acquired.
- Restructuring of the rtmutex code so it can be utilized and extended
for the RT specific lock variants.
- Restructuring of the ww_mutex code to allow sharing of the ww_mutex
specific functionality for rtmutex based ww_mutexes.
- Header file disentangling to allow substitution of the regular lock
implementations with the PREEMPT_RT variants without creating an
unmaintainable #ifdef mess.
- Shared base code for the PREEMPT_RT specific rw_semaphore and rwlock
implementations. Contrary to the regular rw_semaphores and rwlocks the
PREEMPT_RT implementation is writer unfair because it is infeasible to
do priority inheritance on multiple readers. Experience over the years
has shown that real-time workloads are not the typical workloads which
are sensitive to writer starvation. The alternative solution would be
to allow only a single reader which has been tried and discarded as it
is a major bottleneck especially for mmap_sem. Aside of that many of
the writer starvation critical usage sites have been converted to a
writer side mutex/spinlock and RCU read side protections in the past
decade so that the issue is less prominent than it used to be.
- The actual rtmutex based lock substitutions for PREEMPT_RT enabled
kernels which affect mutex, ww_mutex, rw_semaphore, spinlock_t and
rwlock_t. The spin/rw_lock*() functions disable migration across the
critical section to preserve the existing semantics vs. per CPU
variables.
- Rework of the futex REQUEUE_PI mechanism to handle the case of early
wake-ups which interleave with a re-queue operation to prevent the
situation that a task would be blocked on both the rtmutex associated
to the outer futex and the rtmutex based hash bucket spinlock.
While this situation cannot happen on !RT enabled kernels the changes
make the underlying concurrency problems easier to understand in
general. As a result the difference between !RT and RT kernels is
reduced to the handling of waiting for the critical section. !RT
kernels simply spin-wait as before and RT kernels utilize rcu_wait().
- The substitution of local_lock for PREEMPT_RT with a spinlock which
protects the critical section while staying preemptible. The CPU
locality is established by disabling migration.
The underlying concepts of this code have been in use in PREEMPT_RT for
way more than a decade. The code has been refactored several times over
the years and this final incarnation has been optimized once again to be
as non-intrusive as possible, i.e. the RT specific parts are mostly
isolated.
It has been extensively tested in the 5.14-rt patch series and it has
been verified that !RT kernels are not affected by these changes.
-----BEGIN PGP SIGNATURE-----
iQJHBAABCgAxFiEEQp8+kY+LLUocC4bMphj1TA10mKEFAmEsnuMTHHRnbHhAbGlu
dXRyb25peC5kZQAKCRCmGPVMDXSYoaeWD/wLNMoAZXslS0prfr64ANjRgLXIqMFA
r6xgioiwxxaxbmZ/GNPraoLC//ENo6mwobuUovq8yKljv2oBu6AmlUkBwrmMBc8Q
nnm7jjGM3bZ1REup7rWERnjdOZfdGVSL5CUAAfthyC744XmXaepwrrrqfXG22GxJ
QwLXBTAwXFVDxKfUjDKzEo5zgLNHRvHbzc0DpTYYn6WcuDJOmlyWnhfDTu2mNG9Z
rqjqy+OgOUEUprQDgitk5hedfeic2kPm1mxxZrXkpkuPef5be2inQq2siC7GxR4g
0AKeUsMFgFmSqiD4iJTALJ+8WXkgMnD9VgooeWHk4OaqZfaGzi/iwRSnrlnf7+OV
GTmrsmX+TX/Wz2BDjB+3zylQnYqYh3quE5w4UO6uUyJXfdhlnvsjVc8bEajDFjeM
yUapaWxdAri7k2n+vjXQthAngxtYPgXtFbZPoOl109JcDcG6jJsCdM5TdenegaRs
WeUh05JqrH8+qI+Nwzc4rO+PmKHQ8on2wKdgLp11dviiPOf8OguH65nDQSGZ/fGv
7cnD9A1/MUd0sdrvc52AqkIYxh+Rp9GnCs1xA82JsTXgAPcXqAWjjR2JFPHL4neV
eW2upZekl8lMR7hkfcQbhe4MVjQIjff3iFOkQXittxMzfzFdi0tly8xB8AzpTHOx
h91MycvmMR2zRw==
=IEqE
-----END PGP SIGNATURE-----
Merge tag 'locking-core-2021-08-30' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull locking and atomics updates from Thomas Gleixner:
"The regular pile:
- A few improvements to the mutex code
- Documentation updates for atomics to clarify the difference between
cmpxchg() and try_cmpxchg() and to explain the forward progress
expectations.
- Simplification of the atomics fallback generator
- The addition of arch_atomic_long*() variants and generic arch_*()
bitops based on them.
- Add the missing might_sleep() invocations to the down*() operations
of semaphores.
The PREEMPT_RT locking core:
- Scheduler updates to support the state preserving mechanism for
'sleeping' spin- and rwlocks on RT.
This mechanism is carefully preserving the state of the task when
blocking on a 'sleeping' spin- or rwlock and takes regular wake-ups
targeted at the same task into account. The preserved or updated
(via a regular wakeup) state is restored when the lock has been
acquired.
- Restructuring of the rtmutex code so it can be utilized and
extended for the RT specific lock variants.
- Restructuring of the ww_mutex code to allow sharing of the ww_mutex
specific functionality for rtmutex based ww_mutexes.
- Header file disentangling to allow substitution of the regular lock
implementations with the PREEMPT_RT variants without creating an
unmaintainable #ifdef mess.
- Shared base code for the PREEMPT_RT specific rw_semaphore and
rwlock implementations.
Contrary to the regular rw_semaphores and rwlocks the PREEMPT_RT
implementation is writer unfair because it is infeasible to do
priority inheritance on multiple readers. Experience over the years
has shown that real-time workloads are not the typical workloads
which are sensitive to writer starvation.
The alternative solution would be to allow only a single reader
which has been tried and discarded as it is a major bottleneck
especially for mmap_sem. Aside of that many of the writer
starvation critical usage sites have been converted to a writer
side mutex/spinlock and RCU read side protections in the past
decade so that the issue is less prominent than it used to be.
- The actual rtmutex based lock substitutions for PREEMPT_RT enabled
kernels which affect mutex, ww_mutex, rw_semaphore, spinlock_t and
rwlock_t. The spin/rw_lock*() functions disable migration across
the critical section to preserve the existing semantics vs per-CPU
variables.
- Rework of the futex REQUEUE_PI mechanism to handle the case of
early wake-ups which interleave with a re-queue operation to
prevent the situation that a task would be blocked on both the
rtmutex associated to the outer futex and the rtmutex based hash
bucket spinlock.
While this situation cannot happen on !RT enabled kernels the
changes make the underlying concurrency problems easier to
understand in general. As a result the difference between !RT and
RT kernels is reduced to the handling of waiting for the critical
section. !RT kernels simply spin-wait as before and RT kernels
utilize rcu_wait().
- The substitution of local_lock for PREEMPT_RT with a spinlock which
protects the critical section while staying preemptible. The CPU
locality is established by disabling migration.
The underlying concepts of this code have been in use in PREEMPT_RT for
way more than a decade. The code has been refactored several times over
the years and this final incarnation has been optimized once again to be
as non-intrusive as possible, i.e. the RT specific parts are mostly
isolated.
It has been extensively tested in the 5.14-rt patch series and it has
been verified that !RT kernels are not affected by these changes"
* tag 'locking-core-2021-08-30' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (92 commits)
locking/rtmutex: Return success on deadlock for ww_mutex waiters
locking/rtmutex: Prevent spurious EDEADLK return caused by ww_mutexes
locking/rtmutex: Dequeue waiter on ww_mutex deadlock
locking/rtmutex: Dont dereference waiter lockless
locking/semaphore: Add might_sleep() to down_*() family
locking/ww_mutex: Initialize waiter.ww_ctx properly
static_call: Update API documentation
locking/local_lock: Add PREEMPT_RT support
locking/spinlock/rt: Prepare for RT local_lock
locking/rtmutex: Add adaptive spinwait mechanism
locking/rtmutex: Implement equal priority lock stealing
preempt: Adjust PREEMPT_LOCK_OFFSET for RT
locking/rtmutex: Prevent lockdep false positive with PI futexes
futex: Prevent requeue_pi() lock nesting issue on RT
futex: Simplify handle_early_requeue_pi_wakeup()
futex: Reorder sanity checks in futex_requeue()
futex: Clarify comment in futex_requeue()
futex: Restructure futex_requeue()
futex: Correct the number of requeued waiters for PI
futex: Remove bogus condition for requeue PI
...
RT builds substitutions for rwsem, mutex, spinlock and rwlock around
rtmutexes. Split the inner working out so each lock substitution can use
them with the appropriate lockdep annotations. This avoids having an extra
unused lockdep map in the wrapped rtmutex.
No functional change.
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Link: https://lore.kernel.org/r/20210815211302.784739994@linutronix.de
The functions get_online_cpus() and put_online_cpus() have been
deprecated during the CPU hotplug rework. They map directly to
cpus_read_lock() and cpus_read_unlock().
Replace deprecated CPU-hotplug functions with the official version.
The behavior remains unchanged.
Cc: "Paul E. McKenney" <paulmck@kernel.org>
Cc: Josh Triplett <josh@joshtriplett.org>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: Lai Jiangshan <jiangshanlai@gmail.com>
Cc: Joel Fernandes <joel@joelfernandes.org>
Cc: rcu@vger.kernel.org
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
The WARN_ON_ONCE() invocation within the CONFIG_PREEMPT=y version of
rcu_note_context_switch() triggers when there is a voluntary context
switch in an RCU read-side critical section, but there is quite a gap
between the output of that WARN_ON_ONCE() and this RCU-usage error.
This commit therefore converts the WARN_ON_ONCE() to a WARN_ONCE()
that explicitly describes the problem in its message.
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
There are a few remaining locations in kernel/rcu that still use
"&per_cpu()". This commit replaces them with "per_cpu_ptr(&)", and does
not introduce any functional change.
Reviewed-by: Uladzislau Rezki (Sony) <urezki@gmail.com>
Reviewed-by: Neeraj Upadhyay <neeraju@codeaurora.org>
Reviewed-by: Joel Fernandes (Google) <joel@joelfernandes.org>
Signed-off-by: Liu Song <liu.song11@zte.com.cn>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
Within rcu_gp_fqs_loop(), the "ret" local variable is set to the
return value from swait_event_idle_timeout_exclusive(), but "ret" is
unconditionally overwritten later in the code. This commit therefore
removes this useless assignment.
Signed-off-by: Liu Song <liu.song11@zte.com.cn>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
This commit marks the accesses in tree_stall.h so as to both avoid
undesirable compiler optimizations and to keep KCSAN focused on the
accesses of the core algorithm.
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
The kbuild test project found an oversized stack frame in rcu_gp_kthread()
for some kernel configurations. This oversizing was due to a very large
amount of inlining, which is unnecessary due to the fact that this code
executes infrequently. This commit therefore marks rcu_gp_init() and
rcu_gp_fqs_loop noinline_for_stack to conserve stack space.
Reported-by: kernel test robot <lkp@intel.com>
Tested-by: Rong Chen <rong.a.chen@intel.com>
[ paulmck: noinline_for_stack per Nathan Chancellor. ]
Reviewed-by: Nathan Chancellor <nathan@kernel.org>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
Accesses to ->qsmask are normally protected by ->lock, but there is an
exception in the diagnostic code in rcu_check_boost_fail(). This commit
therefore applies data_race() to this access to avoid KCSAN complaining
about the C-language writes protected by ->lock.
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
This commit marks some interrupt-induced read-side data races in
__srcu_read_lock(), __srcu_read_unlock(), and srcu_torture_stats_print().
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
Systems with low-bandwidth consoles can have very large printk()
latencies, and on such systems it makes no sense to have the next RCU CPU
stall warning message start output before the prior message completed.
This commit therefore sets the time of the next stall only after the
prints have completed. While printing, the time of the next stall
message is set to ULONG_MAX/2 jiffies into the future.
Reviewed-by: Sergey Senozhatsky <senozhatsky@chromium.org>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
rcu_cpu_stall_reset() is one of the functions virtual CPUs
execute during VM resume in order to handle jiffies skew
that can trigger false positive stall warnings. Paul has
pointed out that this approach is problematic because
rcu_cpu_stall_reset() disables RCU grace period stall-detection
virtually forever, while in fact it can just restart the
stall-detection timeout.
Suggested-by: "Paul E. McKenney" <paulmck@kernel.org>
Signed-off-by: Sergey Senozhatsky <senozhatsky@chromium.org>
Signed-off-by: Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
The soft watchdog timer function checks if a virtual machine
was suspended and hence what looks like a lockup in fact
is a false positive.
This is what kvm_check_and_clear_guest_paused() does: it
tests guest PVCLOCK_GUEST_STOPPED (which is set by the host)
and if it's set then we need to touch all watchdogs and bail
out.
Watchdog timer function runs from IRQ, so PVCLOCK_GUEST_STOPPED
check works fine.
There is, however, one more watchdog that runs from IRQ, so
watchdog timer fn races with it, and that watchdog is not aware
of PVCLOCK_GUEST_STOPPED - RCU stall detector.
apic_timer_interrupt()
smp_apic_timer_interrupt()
hrtimer_interrupt()
__hrtimer_run_queues()
tick_sched_timer()
tick_sched_handle()
update_process_times()
rcu_sched_clock_irq()
This triggers RCU stalls on our devices during VM resume.
If tick_sched_handle()->rcu_sched_clock_irq() runs on a VCPU
before watchdog_timer_fn()->kvm_check_and_clear_guest_paused()
then there is nothing on this VCPU that touches watchdogs and
RCU reads stale gp stall timestamp and new jiffies value, which
makes it think that RCU has stalled.
Make RCU stall watchdog aware of PVCLOCK_GUEST_STOPPED and
don't report RCU stalls when we resume the VM.
Signed-off-by: Sergey Senozhatsky <senozhatsky@chromium.org>
Signed-off-by: Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
KCSAN flags accesses to ->rcu_read_lock_nesting as data races, but
in the past, the overhead of marked accesses was excessive. However,
that was long ago, and much has changed since then, both in terms of
hardware and of compilers. Here is data taken on an eight-core laptop
using Intel(R) Core(TM) i9-10885H CPU @ 2.40GHz with a kernel built
using gcc version 9.3.0, with all data in nanoseconds.
Unmarked accesses (status quo), measured by three refscale runs:
Minimum reader duration: 3.286 2.851 3.395
Median reader duration: 3.698 3.531 3.4695
Maximum reader duration: 4.481 5.215 5.157
Marked accesses, also measured by three refscale runs:
Minimum reader duration: 3.501 3.677 3.580
Median reader duration: 4.053 3.723 3.895
Maximum reader duration: 7.307 4.999 5.511
This focused microbenhmark shows only sub-nanosecond differences which
are unlikely to be visible at the system level. This commit therefore
marks data-racing accesses to ->rcu_read_lock_nesting.
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
Accesses to the rcu_data structure's ->dynticks field have always been
fully ordered because it was not possible to prove that weaker ordering
was safe. However, with the removal of the rcu_eqs_special_set() function
and the advent of the Linux-kernel memory model, it is now easy to show
that two of the four original full memory barriers can be weakened to
acquire and release operations. The remaining pair must remain full
memory barriers. This change makes the memory ordering requirements
more evident, and it might well also speed up the to-idle and from-idle
fastpaths on some architectures.
The following litmus test, adapted from one supplied off-list by Frederic
Weisbecker, models the RCU grace-period kthread detecting an idle CPU
that is concurrently transitioning to non-idle:
C dynticks-from-idle
{
DYNTICKS=0; (* Initially idle. *)
}
P0(int *X, int *DYNTICKS)
{
int dynticks;
int x;
// Idle.
dynticks = READ_ONCE(*DYNTICKS);
smp_store_release(DYNTICKS, dynticks + 1);
smp_mb();
// Now non-idle
x = READ_ONCE(*X);
}
P1(int *X, int *DYNTICKS)
{
int dynticks;
WRITE_ONCE(*X, 1);
smp_mb();
dynticks = smp_load_acquire(DYNTICKS);
}
exists (1:dynticks=0 /\ 0:x=1)
Running "herd7 -conf linux-kernel.cfg dynticks-from-idle.litmus" verifies
this transition, namely, showing that if the RCU grace-period kthread (P1)
sees another CPU as idle (P0), then any memory access prior to the start
of the grace period (P1's write to X) will be seen by any RCU read-side
critical section following the to-non-idle transition (P0's read from X).
This is a straightforward use of full memory barriers to force ordering
in a store-buffering (SB) litmus test.
The following litmus test, also adapted from the one supplied off-list
by Frederic Weisbecker, models the RCU grace-period kthread detecting
a non-idle CPU that is concurrently transitioning to idle:
C dynticks-into-idle
{
DYNTICKS=1; (* Initially non-idle. *)
}
P0(int *X, int *DYNTICKS)
{
int dynticks;
// Non-idle.
WRITE_ONCE(*X, 1);
dynticks = READ_ONCE(*DYNTICKS);
smp_store_release(DYNTICKS, dynticks + 1);
smp_mb();
// Now idle.
}
P1(int *X, int *DYNTICKS)
{
int x;
int dynticks;
smp_mb();
dynticks = smp_load_acquire(DYNTICKS);
x = READ_ONCE(*X);
}
exists (1:dynticks=2 /\ 1:x=0)
Running "herd7 -conf linux-kernel.cfg dynticks-into-idle.litmus" verifies
this transition, namely, showing that if the RCU grace-period kthread
(P1) sees another CPU as newly idle (P0), then any pre-idle memory access
(P0's write to X) will be seen by any code following the grace period
(P1's read from X). This is a simple release-acquire pair forcing
ordering in a message-passing (MP) litmus test.
Of course, if the grace-period kthread detects the CPU as non-idle,
it will refrain from reporting a quiescent state on behalf of that CPU,
so there are no ordering requirements from the grace-period kthread in
that case. However, other subsystems call rcu_is_idle_cpu() to check
for CPUs being non-idle from an RCU perspective. That case is also
verified by the above litmus tests with the proviso that the sense of
the low-order bit of the DYNTICKS counter be inverted.
Unfortunately, on x86 smp_mb() is as expensive as a cache-local atomic
increment. This commit therefore weakens only the read from ->dynticks.
However, the updates are abstracted into a rcu_dynticks_inc() function
to ease any future changes that might be needed.
[ paulmck: Apply Linus Torvalds feedback. ]
Link: https://lore.kernel.org/lkml/20210721202127.2129660-4-paulmck@kernel.org/
Suggested-by: Linus Torvalds <torvalds@linux-foundation.org>
Acked-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
Commit b8c17e6664 ("rcu: Maintain special bits at bottom of ->dynticks
counter") reserved a bit at the bottom of the ->dynticks counter to defer
flushing of TLBs, but this facility never has been used. This commit
therefore removes this capability along with the rcu_eqs_special_set()
function used to trigger it.
Link: https://lore.kernel.org/linux-doc/CALCETrWNPOOdTrFabTDd=H7+wc6xJ9rJceg6OL1S0rTV5pfSsA@mail.gmail.com/
Suggested-by: Andy Lutomirski <luto@kernel.org>
Signed-off-by: "Joel Fernandes (Google)" <joel@joelfernandes.org>
[ paulmck: Forward-port to v5.13-rc1. ]
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
If rcu_print_task_stall() is invoked on an rcu_node structure that does
not contain any tasks blocking the current grace period, it takes an
early exit that fails to release that rcu_node structure's lock. This
results in a self-deadlock, which is detected by lockdep.
To reproduce this bug:
tools/testing/selftests/rcutorture/bin/kvm.sh --allcpus --duration 3 --trust-make --configs "TREE03" --kconfig "CONFIG_PROVE_LOCKING=y" --bootargs "rcutorture.stall_cpu=30 rcutorture.stall_cpu_block=1 rcutorture.fwd_progress=0 rcutorture.test_boost=0"
This will also result in other complaints, including RCU's scheduler
hook complaining about blocking rather than preemption and an rcutorture
writer stall.
Only a partial RCU CPU stall warning message will be printed because of
the self-deadlock.
This commit therefore releases the lock on the rcu_print_task_stall()
function's early exit path.
Fixes: c583bcb8f5 ("rcu: Don't invoke try_invoke_on_locked_down_task() with irqs disabled")
Tested-by: Qais Yousef <qais.yousef@arm.com>
Signed-off-by: Yanfei Xu <yanfei.xu@windriver.com>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
The for loop in rcu_print_task_stall() always omits ts[0], which points
to the first task blocking the stalled grace period. This in turn fails
to count this first task, which means that ndetected will be equal to
zero when all CPUs have passed through their quiescent states and only
one task is blocking the stalled grace period. This zero value for
ndetected will in turn result in an incorrect "All QSes seen" message:
rcu: INFO: rcu_preempt detected stalls on CPUs/tasks:
rcu: Tasks blocked on level-1 rcu_node (CPUs 12-23):
(detected by 15, t=6504 jiffies, g=164777, q=9011209)
rcu: All QSes seen, last rcu_preempt kthread activity 1 (4295252379-4295252378), jiffies_till_next_fqs=1, root ->qsmask 0x2
BUG: sleeping function called from invalid context at include/linux/uaccess.h:156
in_atomic(): 1, irqs_disabled(): 0, non_block: 0, pid: 70613, name: msgstress04
INFO: lockdep is turned off.
Preemption disabled at:
[<ffff8000104031a4>] create_object.isra.0+0x204/0x4b0
CPU: 15 PID: 70613 Comm: msgstress04 Kdump: loaded Not tainted
5.12.2-yoctodev-standard #1
Hardware name: Marvell OcteonTX CN96XX board (DT)
Call trace:
dump_backtrace+0x0/0x2cc
show_stack+0x24/0x30
dump_stack+0x110/0x188
___might_sleep+0x214/0x2d0
__might_sleep+0x7c/0xe0
This commit therefore fixes the loop to include ts[0].
Fixes: c583bcb8f5 ("rcu: Don't invoke try_invoke_on_locked_down_task() with irqs disabled")
Tested-by: Qais Yousef <qais.yousef@arm.com>
Signed-off-by: Yanfei Xu <yanfei.xu@windriver.com>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
The rcuscale console output claims N grace periods, numbered from zero
to N, which means that there were really N+1 grace periods. The root
cause of this bug is that rcu_scale_writer() stores the number of the
last grace period (numbered from zero) into writer_n_durations[me]
instead of the number of grace periods. This commit therefore assigns
the actual number of grace periods to writer_n_durations[me], and also
makes the corresponding adjustment to the loop outputting per-grace-period
measurements.
Sample of old console output:
rcu-scale: writer 0 gps: 133
......
rcu-scale: 0 writer-duration: 0 44003961
rcu-scale: 0 writer-duration: 1 32003582
......
rcu-scale: 0 writer-duration: 132 28004391
rcu-scale: 0 writer-duration: 133 27996410
Sample of new console output:
rcu-scale: writer 0 gps: 134
......
rcu-scale: 0 writer-duration: 0 44003961
rcu-scale: 0 writer-duration: 1 32003582
......
rcu-scale: 0 writer-duration: 132 28004391
rcu-scale: 0 writer-duration: 133 27996410
Signed-off-by: Jiangong.Han <jiangong.han@windriver.com>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
Currently, rcu_torture_stall() does a one-jiffy timed wait when
stall_cpu_block is set. This works, but emits a pointless splat in
CONFIG_PREEMPT=y kernels. This commit avoids this splat by instead
invoking preempt_schedule() in CONFIG_PREEMPT=y kernels.
This uses an admittedly ugly #ifdef, but abstracted approaches just
looked worse. A prettier approach would provide a preempt_schedule()
definition with a WARN_ON() for CONFIG_PREEMPT=n kernels, but this seems
quite silly.
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
This commit adds a "clock" type to refscale, which checks the performance
of ktime_get_real_fast_ns(). Use the "clocksource=" kernel boot parameter
to select the underlying clock source.
[ paulmck: Work around compiler false positive per kernel test robot. ]
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
This commit fixes several typos where CONFIG_TASKS_RCU_TRACE should
instead be CONFIG_TASKS_TRACE_RCU. Among other things, these typos
could cause CONFIG_TASKS_TRACE_RCU_READ_MB=y kernels to suffer from
memory-ordering bugs that could result in false-positive quiescent
states and too-short grace periods.
Signed-off-by: Zhouyi Zhou <zhouzhouyi@gmail.com>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
This commit replaces the fictitious synchronize_rcu_rude() function with
its real-world synchronize_rcu_tasks_rude() counterpart.
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
There are several ->trc_reader_special.b.need_qs data races that are
too low-probability for KCSAN to notice, but which will happen sooner
or later. This commit therefore marks these accesses.
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
There are several ->trc_reader_nesting data races that are too
low-probability for KCSAN to notice, but which will happen sooner or
later. This commit therefore marks these accesses, and comments one
that cannot race.
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
Accesses to task_struct structures must be either protected by RCU
or by get_task_struct(). Tasks trace RCU uses these in a non-obvious
combination, in conjunction with an IPI handler. This commit therefore
adds comments explaining this usage.
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
At CPU offline time, we must handle any pending wakeup for the nocb_gp
kthread linked to the outgoing CPU.
Now we are making sure of that twice:
1) From rcu_report_dead() when the outgoing CPU makes the very last
local cleanups by itself before switching offline.
2) From rcutree_dead_cpu(). Here the offlining CPU has gone and is truly
now offline. Another CPU takes care of post-portem cleaning up and
check if the offline CPU had pending wakeup.
Both ways are fine but we have to choose one or the other because we
don't need to repeat that action. Simply benefit from cache locality
and keep only the first solution.
Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
The kernel/rcu/tree_plugin.h file contains not only the plugins for
preemptible RCU, but also many other features including rcu_nocbs
callback offloading. This offloading has become large and complex,
so it is time to put it in its own file.
This commit starts that process.
Suggested-by: Paul E. McKenney <paulmck@kernel.org>
Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
[ paulmck: Rename to tree_nocb.h, add Frederic as author. ]
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
This commit changes from "%lx" to "%x" and from "0x1ffffL" to "0x1ffff"
to match the change in type between the old field ->state (unsigned long)
and the new field ->__state (unsigned int).
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
Invoking trc_del_holdout() from within trc_wait_for_one_reader() is
only a performance optimization because the RCU Tasks Trace grace-period
kthread will eventually do this within check_all_holdout_tasks_trace().
But it is not a particularly important performance optimization because
it only applies to the grace-period kthread, of which there is but one.
This commit therefore removes this invocation of trc_del_holdout() in
favor of the one in check_all_holdout_tasks_trace() in the grace-period
kthread.
Reported-by: "Xu, Yanfei" <yanfei.xu@windriver.com>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
As Yanfei pointed out, although invoking trc_del_holdout() is safe
from the viewpoint of the integrity of the holdout list itself,
the put_task_struct() invoked by trc_del_holdout() can result in
use-after-free errors due to later accesses to this task_struct structure
by the RCU Tasks Trace grace-period kthread.
This commit therefore removes this call to trc_del_holdout() from
trc_inspect_reader() in favor of the grace-period thread's existing call
to trc_del_holdout(), thus eliminating that particular class of
use-after-free errors.
Reported-by: "Xu, Yanfei" <yanfei.xu@windriver.com>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
If the call to set_cpus_allowed_ptr() in ref_scale_reader()
fails, a later WARN_ONCE() complains. But with the advent of
570a752b7a ("lib/smp_processor_id: Use is_percpu_thread() instead of
nr_cpus_allowed"), this complaint can be drowned out by complaints from
smp_processor_id(). The rationale for this change is that refscale's
kthreads are not marked with PF_NO_SETAFFINITY, which means that a system
administrator could change affinity at any time.
However, refscale is a performance/stress test, and the system
administrator might well have a valid test-the-test reason for changing
affinity. This commit therefore changes to raw_smp_processor_id()
in order to avoid the noise, and also adds a WARN_ON_ONCE() to the
call to set_cpus_allowed_ptr() in order to directly detect immediate
failure. There is no WARN_ON_ONCE() within the test loop, allowing
human-reflex-based affinity resetting, if desired.
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
Pull RCU updates from Paul McKenney:
- Bitmap parsing support for "all" as an alias for all bits
- Documentation updates
- Miscellaneous fixes, including some that overlap into mm and lockdep
- kvfree_rcu() updates
- mem_dump_obj() updates, with acks from one of the slab-allocator
maintainers
- RCU NOCB CPU updates, including limited deoffloading
- SRCU updates
- Tasks-RCU updates
- Torture-test updates
* 'core-rcu-2021.07.04' of git://git.kernel.org/pub/scm/linux/kernel/git/paulmck/linux-rcu: (78 commits)
tasks-rcu: Make show_rcu_tasks_gp_kthreads() be static inline
rcu-tasks: Make ksoftirqd provide RCU Tasks quiescent states
rcu: Add missing __releases() annotation
rcu: Remove obsolete rcu_read_unlock() deadlock commentary
rcu: Improve comments describing RCU read-side critical sections
rcu: Create an unrcu_pointer() to remove __rcu from a pointer
srcu: Early test SRCU polling start
rcu: Fix various typos in comments
rcu/nocb: Unify timers
rcu/nocb: Prepare for fine-grained deferred wakeup
rcu/nocb: Only cancel nocb timer if not polling
rcu/nocb: Delete bypass_timer upon nocb_gp wakeup
rcu/nocb: Cancel nocb_timer upon nocb_gp wakeup
rcu/nocb: Allow de-offloading rdp leader
rcu/nocb: Directly call __wake_nocb_gp() from bypass timer
rcu: Don't penalize priority boosting when there is nothing to boost
rcu: Point to documentation of ordering guarantees
rcu: Make rcu_gp_cleanup() be noinline for tracing
rcu: Restrict RCU_STRICT_GRACE_PERIOD to at most four CPUs
rcu: Make show_rcu_gp_kthreads() dump rcu_node structures blocking GP
...
Merge more updates from Andrew Morton:
"190 patches.
Subsystems affected by this patch series: mm (hugetlb, userfaultfd,
vmscan, kconfig, proc, z3fold, zbud, ras, mempolicy, memblock,
migration, thp, nommu, kconfig, madvise, memory-hotplug, zswap,
zsmalloc, zram, cleanups, kfence, and hmm), procfs, sysctl, misc,
core-kernel, lib, lz4, checkpatch, init, kprobes, nilfs2, hfs,
signals, exec, kcov, selftests, compress/decompress, and ipc"
* emailed patches from Andrew Morton <akpm@linux-foundation.org>: (190 commits)
ipc/util.c: use binary search for max_idx
ipc/sem.c: use READ_ONCE()/WRITE_ONCE() for use_global_lock
ipc: use kmalloc for msg_queue and shmid_kernel
ipc sem: use kvmalloc for sem_undo allocation
lib/decompressors: remove set but not used variabled 'level'
selftests/vm/pkeys: exercise x86 XSAVE init state
selftests/vm/pkeys: refill shadow register after implicit kernel write
selftests/vm/pkeys: handle negative sys_pkey_alloc() return code
selftests/vm/pkeys: fix alloc_random_pkey() to make it really, really random
kcov: add __no_sanitize_coverage to fix noinstr for all architectures
exec: remove checks in __register_bimfmt()
x86: signal: don't do sas_ss_reset() until we are certain that sigframe won't be abandoned
hfsplus: report create_date to kstat.btime
hfsplus: remove unnecessary oom message
nilfs2: remove redundant continue statement in a while-loop
kprobes: remove duplicated strong free_insn_page in x86 and s390
init: print out unknown kernel parameters
checkpatch: do not complain about positive return values starting with EPOLL
checkpatch: improve the indented label test
checkpatch: scripts/spdxcheck.py now requires python3
...
kernel.h is being used as a dump for all kinds of stuff for a long time.
Here is the attempt to start cleaning it up by splitting out panic and
oops helpers.
There are several purposes of doing this:
- dropping dependency in bug.h
- dropping a loop by moving out panic_notifier.h
- unload kernel.h from something which has its own domain
At the same time convert users tree-wide to use new headers, although for
the time being include new header back to kernel.h to avoid twisted
indirected includes for existing users.
[akpm@linux-foundation.org: thread_info.h needs limits.h]
[andriy.shevchenko@linux.intel.com: ia64 fix]
Link: https://lkml.kernel.org/r/20210520130557.55277-1-andriy.shevchenko@linux.intel.com
Link: https://lkml.kernel.org/r/20210511074137.33666-1-andriy.shevchenko@linux.intel.com
Signed-off-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com>
Reviewed-by: Bjorn Andersson <bjorn.andersson@linaro.org>
Co-developed-by: Andrew Morton <akpm@linux-foundation.org>
Acked-by: Mike Rapoport <rppt@linux.ibm.com>
Acked-by: Corey Minyard <cminyard@mvista.com>
Acked-by: Christian Brauner <christian.brauner@ubuntu.com>
Acked-by: Arnd Bergmann <arnd@arndb.de>
Acked-by: Kees Cook <keescook@chromium.org>
Acked-by: Wei Liu <wei.liu@kernel.org>
Acked-by: Rasmus Villemoes <linux@rasmusvillemoes.dk>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Acked-by: Sebastian Reichel <sre@kernel.org>
Acked-by: Luis Chamberlain <mcgrof@kernel.org>
Acked-by: Stephen Boyd <sboyd@kernel.org>
Acked-by: Thomas Bogendoerfer <tsbogend@alpha.franken.de>
Acked-by: Helge Deller <deller@gmx.de> # parisc
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Change the type and name of task_struct::state. Drop the volatile and
shrink it to an 'unsigned int'. Rename it in order to find all uses
such that we can use READ_ONCE/WRITE_ONCE as appropriate.
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Daniel Bristot de Oliveira <bristot@redhat.com>
Acked-by: Will Deacon <will@kernel.org>
Acked-by: Daniel Thompson <daniel.thompson@linaro.org>
Link: https://lore.kernel.org/r/20210611082838.550736351@infradead.org
Replace a bunch of 'p->state == TASK_RUNNING' with a new helper:
task_is_running(p).
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Davidlohr Bueso <dave@stgolabs.net>
Acked-by: Geert Uytterhoeven <geert@linux-m68k.org>
Acked-by: Will Deacon <will@kernel.org>
Link: https://lore.kernel.org/r/20210611082838.222401495@infradead.org
In some architectures, the no-op variant of show_rcu_tasks_gp_kthreads()
get "no previous prototype" compiler warnings. These are false positives
given that kernel/rcu/tasks.h is included only once. But why put up
with the compiler noise?
This commit therefore adds "static inline" to this definition to force
the compiler to accept this situation, while also moving it to its proper
place in kernel/rcu/rcu.h.
Reported-by: kernel test robot <lkp@intel.com>
[ paulmck: Update per Stephen Rothwell feedback. ]
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
Heavy networking load can cause a CPU to execute continuously and
indefinitely within ksoftirqd, in which case there will be no voluntary
task switches and thus no RCU-tasks quiescent states. This commit
therefore causes the exiting rcu_softirq_qs() to provide an RCU-tasks
quiescent state.
This of course means that __do_softirq() and its callers cannot be
invoked from within a tracing trampoline.
Reported-by: Toke Høiland-Jørgensen <toke@redhat.com>
Tested-by: Toke Høiland-Jørgensen <toke@redhat.com>
Reviewed-by: Masami Hiramatsu <mhiramat@kernel.org>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Masami Hiramatsu <mhiramat@kernel.org>
Sparse reports a warning at rcu_print_task_stall():
"warning: context imbalance in rcu_print_task_stall - unexpected unlock"
The root cause is a missing annotation on rcu_print_task_stall().
This commit therefore adds the missing __releases(rnp->lock) annotation.
Signed-off-by: Jules Irenge <jbi.octave@gmail.com>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
There are a number of places that call out the fact that preempt-disable
regions of code now act as RCU read-side critical sections, where
preempt-disable regions of code include irq-disable regions of code,
bh-disable regions of code, hardirq handlers, and NMI handlers. However,
someone relying solely on (for example) the call_rcu() header comment
might well have no idea that preempt-disable regions of code have RCU
semantics.
This commit therefore updates the header comments for
call_rcu(), synchronize_rcu(), rcu_dereference_bh_check(), and
rcu_dereference_sched_check() to call out these new(ish) forms of RCU
readers.
Reported-by: Michel Lespinasse <michel@lespinasse.org>
[ paulmck: Apply Matthew Wilcox and Michel Lespinasse feedback. ]
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
Place an early call to start_poll_synchronize_srcu() before the invocation
of call_srcu() on the same srcu_struct structure.
After the later call to srcu_barrier(), the completion of the
first grace period should be visible to a subsequent invocation of
poll_state_synchronize_srcu(), and if not, warn.
Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
Cc: Boqun Feng <boqun.feng@gmail.com>
Cc: Lai Jiangshan <jiangshanlai@gmail.com>
Cc: Neeraj Upadhyay <neeraju@codeaurora.org>
Cc: Josh Triplett <josh@joshtriplett.org>
Cc: Joel Fernandes <joel@joelfernandes.org>
Cc: Uladzislau Rezki <urezki@gmail.com>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>