2019-01-18 02:23:39 +08:00
|
|
|
/* SPDX-License-Identifier: GPL-2.0+ */
|
rcu: Merge preemptable-RCU functionality into hierarchical RCU
Create a kernel/rcutree_plugin.h file that contains definitions
for preemptable RCU (or, under the #else branch of the #ifdef,
empty definitions for the classic non-preemptable semantics).
These definitions fit into plugins defined in kernel/rcutree.c
for this purpose.
This variant of preemptable RCU uses a new algorithm whose
read-side expense is roughly that of classic hierarchical RCU
under CONFIG_PREEMPT. This new algorithm's update-side expense
is similar to that of classic hierarchical RCU, and, in absence
of read-side preemption or blocking, is exactly that of classic
hierarchical RCU. Perhaps more important, this new algorithm
has a much simpler implementation, saving well over 1,000 lines
of code compared to mainline's implementation of preemptable
RCU, which will hopefully be retired in favor of this new
algorithm.
The simplifications are obtained by maintaining per-task
nesting state for running tasks, and using a simple
lock-protected algorithm to handle accounting when tasks block
within RCU read-side critical sections, making use of lessons
learned while creating numerous user-level RCU implementations
over the past 18 months.
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: laijs@cn.fujitsu.com
Cc: dipankar@in.ibm.com
Cc: akpm@linux-foundation.org
Cc: mathieu.desnoyers@polymtl.ca
Cc: josht@linux.vnet.ibm.com
Cc: dvhltc@us.ibm.com
Cc: niv@us.ibm.com
Cc: peterz@infradead.org
Cc: rostedt@goodmis.org
LKML-Reference: <12509746134003-git-send-email->
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2009-08-23 04:56:52 +08:00
|
|
|
/*
|
|
|
|
* Read-Copy Update mechanism for mutual exclusion (tree-based version)
|
|
|
|
* Internal non-public definitions that provide either classic
|
2011-03-03 05:15:15 +08:00
|
|
|
* or preemptible semantics.
|
rcu: Merge preemptable-RCU functionality into hierarchical RCU
Create a kernel/rcutree_plugin.h file that contains definitions
for preemptable RCU (or, under the #else branch of the #ifdef,
empty definitions for the classic non-preemptable semantics).
These definitions fit into plugins defined in kernel/rcutree.c
for this purpose.
This variant of preemptable RCU uses a new algorithm whose
read-side expense is roughly that of classic hierarchical RCU
under CONFIG_PREEMPT. This new algorithm's update-side expense
is similar to that of classic hierarchical RCU, and, in absence
of read-side preemption or blocking, is exactly that of classic
hierarchical RCU. Perhaps more important, this new algorithm
has a much simpler implementation, saving well over 1,000 lines
of code compared to mainline's implementation of preemptable
RCU, which will hopefully be retired in favor of this new
algorithm.
The simplifications are obtained by maintaining per-task
nesting state for running tasks, and using a simple
lock-protected algorithm to handle accounting when tasks block
within RCU read-side critical sections, making use of lessons
learned while creating numerous user-level RCU implementations
over the past 18 months.
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: laijs@cn.fujitsu.com
Cc: dipankar@in.ibm.com
Cc: akpm@linux-foundation.org
Cc: mathieu.desnoyers@polymtl.ca
Cc: josht@linux.vnet.ibm.com
Cc: dvhltc@us.ibm.com
Cc: niv@us.ibm.com
Cc: peterz@infradead.org
Cc: rostedt@goodmis.org
LKML-Reference: <12509746134003-git-send-email->
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2009-08-23 04:56:52 +08:00
|
|
|
*
|
|
|
|
* Copyright Red Hat, 2009
|
|
|
|
* Copyright IBM Corporation, 2009
|
|
|
|
*
|
|
|
|
* Author: Ingo Molnar <mingo@elte.hu>
|
2019-01-18 02:23:39 +08:00
|
|
|
* Paul E. McKenney <paulmck@linux.ibm.com>
|
rcu: Merge preemptable-RCU functionality into hierarchical RCU
Create a kernel/rcutree_plugin.h file that contains definitions
for preemptable RCU (or, under the #else branch of the #ifdef,
empty definitions for the classic non-preemptable semantics).
These definitions fit into plugins defined in kernel/rcutree.c
for this purpose.
This variant of preemptable RCU uses a new algorithm whose
read-side expense is roughly that of classic hierarchical RCU
under CONFIG_PREEMPT. This new algorithm's update-side expense
is similar to that of classic hierarchical RCU, and, in absence
of read-side preemption or blocking, is exactly that of classic
hierarchical RCU. Perhaps more important, this new algorithm
has a much simpler implementation, saving well over 1,000 lines
of code compared to mainline's implementation of preemptable
RCU, which will hopefully be retired in favor of this new
algorithm.
The simplifications are obtained by maintaining per-task
nesting state for running tasks, and using a simple
lock-protected algorithm to handle accounting when tasks block
within RCU read-side critical sections, making use of lessons
learned while creating numerous user-level RCU implementations
over the past 18 months.
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: laijs@cn.fujitsu.com
Cc: dipankar@in.ibm.com
Cc: akpm@linux-foundation.org
Cc: mathieu.desnoyers@polymtl.ca
Cc: josht@linux.vnet.ibm.com
Cc: dvhltc@us.ibm.com
Cc: niv@us.ibm.com
Cc: peterz@infradead.org
Cc: rostedt@goodmis.org
LKML-Reference: <12509746134003-git-send-email->
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2009-08-23 04:56:52 +08:00
|
|
|
*/
|
|
|
|
|
2014-06-13 04:30:25 +08:00
|
|
|
#include "../locking/rtmutex_common.h"
|
2011-08-20 02:39:11 +08:00
|
|
|
|
2020-11-12 08:51:21 +08:00
|
|
|
static bool rcu_rdp_is_offloaded(struct rcu_data *rdp)
|
|
|
|
{
|
|
|
|
/*
|
|
|
|
* In order to read the offloaded state of an rdp is a safe
|
|
|
|
* and stable way and prevent from its value to be changed
|
|
|
|
* under us, we must either hold the barrier mutex, the cpu
|
|
|
|
* hotplug lock (read or write) or the nocb lock. Local
|
|
|
|
* non-preemptible reads are also safe. NOCB kthreads and
|
|
|
|
* timers have their own means of synchronization against the
|
|
|
|
* offloaded state updaters.
|
|
|
|
*/
|
|
|
|
RCU_LOCKDEP_WARN(
|
|
|
|
!(lockdep_is_held(&rcu_state.barrier_mutex) ||
|
|
|
|
(IS_ENABLED(CONFIG_HOTPLUG_CPU) && lockdep_is_cpus_held()) ||
|
|
|
|
rcu_lockdep_is_held_nocb(rdp) ||
|
|
|
|
(rdp == this_cpu_ptr(&rcu_data) &&
|
|
|
|
!(IS_ENABLED(CONFIG_PREEMPT_COUNT) && preemptible())) ||
|
2021-02-23 08:10:03 +08:00
|
|
|
rcu_current_is_nocb_kthread(rdp)),
|
2020-11-12 08:51:21 +08:00
|
|
|
"Unsafe read of RCU_NOCB offloaded state"
|
|
|
|
);
|
|
|
|
|
|
|
|
return rcu_segcblist_is_offloaded(&rdp->cblist);
|
|
|
|
}
|
|
|
|
|
2010-04-14 05:19:23 +08:00
|
|
|
/*
|
|
|
|
* Check the RCU kernel configuration parameters and print informative
|
2015-09-29 23:47:49 +08:00
|
|
|
* messages about anything out of the ordinary.
|
2010-04-14 05:19:23 +08:00
|
|
|
*/
|
|
|
|
static void __init rcu_bootup_announce_oddness(void)
|
|
|
|
{
|
2015-01-22 08:58:06 +08:00
|
|
|
if (IS_ENABLED(CONFIG_RCU_TRACE))
|
2017-05-16 06:30:32 +08:00
|
|
|
pr_info("\tRCU event tracing is enabled.\n");
|
2015-04-21 05:27:43 +08:00
|
|
|
if ((IS_ENABLED(CONFIG_64BIT) && RCU_FANOUT != 64) ||
|
|
|
|
(!IS_ENABLED(CONFIG_64BIT) && RCU_FANOUT != 32))
|
2018-05-15 04:27:33 +08:00
|
|
|
pr_info("\tCONFIG_RCU_FANOUT set to non-default value of %d.\n",
|
|
|
|
RCU_FANOUT);
|
2015-04-21 01:27:15 +08:00
|
|
|
if (rcu_fanout_exact)
|
2015-01-22 08:58:06 +08:00
|
|
|
pr_info("\tHierarchical RCU autobalancing is disabled.\n");
|
|
|
|
if (IS_ENABLED(CONFIG_RCU_FAST_NO_HZ))
|
|
|
|
pr_info("\tRCU dyntick-idle grace-period acceleration is enabled.\n");
|
2017-05-13 05:37:19 +08:00
|
|
|
if (IS_ENABLED(CONFIG_PROVE_RCU))
|
2015-01-22 08:58:06 +08:00
|
|
|
pr_info("\tRCU lockdep checking is enabled.\n");
|
2020-08-06 06:51:20 +08:00
|
|
|
if (IS_ENABLED(CONFIG_RCU_STRICT_GRACE_PERIOD))
|
|
|
|
pr_info("\tRCU strict (and thus non-scalable) grace periods enabled.\n");
|
2015-06-03 14:18:31 +08:00
|
|
|
if (RCU_NUM_LVLS >= 4)
|
|
|
|
pr_info("\tFour(or more)-level hierarchy is enabled.\n");
|
2015-04-22 00:12:13 +08:00
|
|
|
if (RCU_FANOUT_LEAF != 16)
|
2015-01-22 12:58:57 +08:00
|
|
|
pr_info("\tBuild-time adjustment of leaf fanout to %d.\n",
|
2015-04-22 00:12:13 +08:00
|
|
|
RCU_FANOUT_LEAF);
|
|
|
|
if (rcu_fanout_leaf != RCU_FANOUT_LEAF)
|
2018-05-15 04:27:33 +08:00
|
|
|
pr_info("\tBoot-time adjustment of leaf fanout to %d.\n",
|
|
|
|
rcu_fanout_leaf);
|
2012-05-09 12:00:28 +08:00
|
|
|
if (nr_cpu_ids != NR_CPUS)
|
2017-09-09 07:14:18 +08:00
|
|
|
pr_info("\tRCU restricting CPUs from NR_CPUS=%d to nr_cpu_ids=%u.\n", NR_CPUS, nr_cpu_ids);
|
2017-04-29 02:12:34 +08:00
|
|
|
#ifdef CONFIG_RCU_BOOST
|
2018-05-15 04:27:33 +08:00
|
|
|
pr_info("\tRCU priority boosting: priority %d delay %d ms.\n",
|
|
|
|
kthread_prio, CONFIG_RCU_BOOST_DELAY);
|
2017-04-29 02:12:34 +08:00
|
|
|
#endif
|
|
|
|
if (blimit != DEFAULT_RCU_BLIMIT)
|
|
|
|
pr_info("\tBoot-time adjustment of callback invocation limit to %ld.\n", blimit);
|
|
|
|
if (qhimark != DEFAULT_RCU_QHIMARK)
|
|
|
|
pr_info("\tBoot-time adjustment of callback high-water mark to %ld.\n", qhimark);
|
|
|
|
if (qlowmark != DEFAULT_RCU_QLOMARK)
|
|
|
|
pr_info("\tBoot-time adjustment of callback low-water mark to %ld.\n", qlowmark);
|
2019-10-31 02:56:10 +08:00
|
|
|
if (qovld != DEFAULT_RCU_QOVLD)
|
2019-12-13 01:36:43 +08:00
|
|
|
pr_info("\tBoot-time adjustment of callback overload level to %ld.\n", qovld);
|
2017-04-29 02:12:34 +08:00
|
|
|
if (jiffies_till_first_fqs != ULONG_MAX)
|
|
|
|
pr_info("\tBoot-time adjustment of first FQS scan delay to %ld jiffies.\n", jiffies_till_first_fqs);
|
|
|
|
if (jiffies_till_next_fqs != ULONG_MAX)
|
|
|
|
pr_info("\tBoot-time adjustment of subsequent FQS scan delay to %ld jiffies.\n", jiffies_till_next_fqs);
|
rcu: Compute jiffies_till_sched_qs from other kernel parameters
The jiffies_till_sched_qs value used to determine how old a grace period
must be before RCU enlists the help of the scheduler to force a quiescent
state on the holdout CPU. Currently, this defaults to HZ/10 regardless of
system size and may be set only at boot time. This can be a problem for
very large systems, because if the values of the jiffies_till_first_fqs
and jiffies_till_next_fqs kernel parameters are left at their defaults,
they are calculated to increase as the number of CPUs actually configured
on the system increases. Thus, on a sufficiently large system, RCU would
enlist the help of the scheduler before the grace-period kthread had a
chance to scan for idle CPUs, which wastes CPU time.
This commit therefore allows jiffies_till_sched_qs to be set, if desired,
but if left as default, computes is as jiffies_till_first_fqs plus twice
jiffies_till_next_fqs, thus allowing three force-quiescent-state scans
for idle CPUs. This scales with the number of CPUs, providing sensible
default values.
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
2018-07-26 02:25:23 +08:00
|
|
|
if (jiffies_till_sched_qs != ULONG_MAX)
|
|
|
|
pr_info("\tBoot-time adjustment of scheduler-enlistment delay to %ld jiffies.\n", jiffies_till_sched_qs);
|
2017-04-29 02:12:34 +08:00
|
|
|
if (rcu_kick_kthreads)
|
|
|
|
pr_info("\tKick kthreads if too-long grace period.\n");
|
|
|
|
if (IS_ENABLED(CONFIG_DEBUG_OBJECTS_RCU_HEAD))
|
|
|
|
pr_info("\tRCU callback double-/use-after-free debug enabled.\n");
|
rcu: Remove *_SLOW_* Kconfig options
The RCU_TORTURE_TEST_SLOW_PREINIT, RCU_TORTURE_TEST_SLOW_PREINIT_DELAY,
RCU_TORTURE_TEST_SLOW_PREINIT_DELAY, RCU_TORTURE_TEST_SLOW_INIT,
RCU_TORTURE_TEST_SLOW_INIT_DELAY, RCU_TORTURE_TEST_SLOW_CLEANUP,
and RCU_TORTURE_TEST_SLOW_CLEANUP_DELAY Kconfig options are only
useful for torture testing, and there are the rcutree.gp_cleanup_delay,
rcutree.gp_init_delay, and rcutree.gp_preinit_delay kernel boot parameters
that rcutorture can use instead. The effect of these parameters is to
artificially slow down grace period initialization and cleanup in order
to make some types of race conditions happen more often.
This commit therefore simplifies Tree RCU a bit by removing the Kconfig
options and adding the corresponding kernel parameters to rcutorture's
.boot files instead. However, this commit also leaves out the kernel
parameters for TREE02, TREE04, and TREE07 in order to have about the
same number of tests slowed as not slowed. TREE01, TREE03, TREE05,
and TREE06 are slowed, and the rest are not slowed.
Reported-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
2017-05-11 05:36:55 +08:00
|
|
|
if (gp_preinit_delay)
|
2017-04-29 02:12:34 +08:00
|
|
|
pr_info("\tRCU debug GP pre-init slowdown %d jiffies.\n", gp_preinit_delay);
|
rcu: Remove *_SLOW_* Kconfig options
The RCU_TORTURE_TEST_SLOW_PREINIT, RCU_TORTURE_TEST_SLOW_PREINIT_DELAY,
RCU_TORTURE_TEST_SLOW_PREINIT_DELAY, RCU_TORTURE_TEST_SLOW_INIT,
RCU_TORTURE_TEST_SLOW_INIT_DELAY, RCU_TORTURE_TEST_SLOW_CLEANUP,
and RCU_TORTURE_TEST_SLOW_CLEANUP_DELAY Kconfig options are only
useful for torture testing, and there are the rcutree.gp_cleanup_delay,
rcutree.gp_init_delay, and rcutree.gp_preinit_delay kernel boot parameters
that rcutorture can use instead. The effect of these parameters is to
artificially slow down grace period initialization and cleanup in order
to make some types of race conditions happen more often.
This commit therefore simplifies Tree RCU a bit by removing the Kconfig
options and adding the corresponding kernel parameters to rcutorture's
.boot files instead. However, this commit also leaves out the kernel
parameters for TREE02, TREE04, and TREE07 in order to have about the
same number of tests slowed as not slowed. TREE01, TREE03, TREE05,
and TREE06 are slowed, and the rest are not slowed.
Reported-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
2017-05-11 05:36:55 +08:00
|
|
|
if (gp_init_delay)
|
2017-04-29 02:12:34 +08:00
|
|
|
pr_info("\tRCU debug GP init slowdown %d jiffies.\n", gp_init_delay);
|
rcu: Remove *_SLOW_* Kconfig options
The RCU_TORTURE_TEST_SLOW_PREINIT, RCU_TORTURE_TEST_SLOW_PREINIT_DELAY,
RCU_TORTURE_TEST_SLOW_PREINIT_DELAY, RCU_TORTURE_TEST_SLOW_INIT,
RCU_TORTURE_TEST_SLOW_INIT_DELAY, RCU_TORTURE_TEST_SLOW_CLEANUP,
and RCU_TORTURE_TEST_SLOW_CLEANUP_DELAY Kconfig options are only
useful for torture testing, and there are the rcutree.gp_cleanup_delay,
rcutree.gp_init_delay, and rcutree.gp_preinit_delay kernel boot parameters
that rcutorture can use instead. The effect of these parameters is to
artificially slow down grace period initialization and cleanup in order
to make some types of race conditions happen more often.
This commit therefore simplifies Tree RCU a bit by removing the Kconfig
options and adding the corresponding kernel parameters to rcutorture's
.boot files instead. However, this commit also leaves out the kernel
parameters for TREE02, TREE04, and TREE07 in order to have about the
same number of tests slowed as not slowed. TREE01, TREE03, TREE05,
and TREE06 are slowed, and the rest are not slowed.
Reported-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
2017-05-11 05:36:55 +08:00
|
|
|
if (gp_cleanup_delay)
|
2017-04-29 02:12:34 +08:00
|
|
|
pr_info("\tRCU debug GP init slowdown %d jiffies.\n", gp_cleanup_delay);
|
2019-03-21 05:13:33 +08:00
|
|
|
if (!use_softirq)
|
|
|
|
pr_info("\tRCU_SOFTIRQ processing moved to rcuc kthreads.\n");
|
2017-04-29 02:12:34 +08:00
|
|
|
if (IS_ENABLED(CONFIG_RCU_EQS_DEBUG))
|
|
|
|
pr_info("\tRCU debug extended QS entry/exit.\n");
|
2017-04-29 01:20:28 +08:00
|
|
|
rcupdate_announce_bootup_oddness();
|
2010-04-14 05:19:23 +08:00
|
|
|
}
|
|
|
|
|
2014-09-23 02:00:48 +08:00
|
|
|
#ifdef CONFIG_PREEMPT_RCU
|
rcu: Merge preemptable-RCU functionality into hierarchical RCU
Create a kernel/rcutree_plugin.h file that contains definitions
for preemptable RCU (or, under the #else branch of the #ifdef,
empty definitions for the classic non-preemptable semantics).
These definitions fit into plugins defined in kernel/rcutree.c
for this purpose.
This variant of preemptable RCU uses a new algorithm whose
read-side expense is roughly that of classic hierarchical RCU
under CONFIG_PREEMPT. This new algorithm's update-side expense
is similar to that of classic hierarchical RCU, and, in absence
of read-side preemption or blocking, is exactly that of classic
hierarchical RCU. Perhaps more important, this new algorithm
has a much simpler implementation, saving well over 1,000 lines
of code compared to mainline's implementation of preemptable
RCU, which will hopefully be retired in favor of this new
algorithm.
The simplifications are obtained by maintaining per-task
nesting state for running tasks, and using a simple
lock-protected algorithm to handle accounting when tasks block
within RCU read-side critical sections, making use of lessons
learned while creating numerous user-level RCU implementations
over the past 18 months.
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: laijs@cn.fujitsu.com
Cc: dipankar@in.ibm.com
Cc: akpm@linux-foundation.org
Cc: mathieu.desnoyers@polymtl.ca
Cc: josht@linux.vnet.ibm.com
Cc: dvhltc@us.ibm.com
Cc: niv@us.ibm.com
Cc: peterz@infradead.org
Cc: rostedt@goodmis.org
LKML-Reference: <12509746134003-git-send-email->
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2009-08-23 04:56:52 +08:00
|
|
|
|
2018-07-04 08:22:34 +08:00
|
|
|
static void rcu_report_exp_rnp(struct rcu_node *rnp, bool wake);
|
2018-05-09 06:29:10 +08:00
|
|
|
static void rcu_read_unlock_special(struct task_struct *t);
|
2009-12-03 04:10:15 +08:00
|
|
|
|
rcu: Merge preemptable-RCU functionality into hierarchical RCU
Create a kernel/rcutree_plugin.h file that contains definitions
for preemptable RCU (or, under the #else branch of the #ifdef,
empty definitions for the classic non-preemptable semantics).
These definitions fit into plugins defined in kernel/rcutree.c
for this purpose.
This variant of preemptable RCU uses a new algorithm whose
read-side expense is roughly that of classic hierarchical RCU
under CONFIG_PREEMPT. This new algorithm's update-side expense
is similar to that of classic hierarchical RCU, and, in absence
of read-side preemption or blocking, is exactly that of classic
hierarchical RCU. Perhaps more important, this new algorithm
has a much simpler implementation, saving well over 1,000 lines
of code compared to mainline's implementation of preemptable
RCU, which will hopefully be retired in favor of this new
algorithm.
The simplifications are obtained by maintaining per-task
nesting state for running tasks, and using a simple
lock-protected algorithm to handle accounting when tasks block
within RCU read-side critical sections, making use of lessons
learned while creating numerous user-level RCU implementations
over the past 18 months.
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: laijs@cn.fujitsu.com
Cc: dipankar@in.ibm.com
Cc: akpm@linux-foundation.org
Cc: mathieu.desnoyers@polymtl.ca
Cc: josht@linux.vnet.ibm.com
Cc: dvhltc@us.ibm.com
Cc: niv@us.ibm.com
Cc: peterz@infradead.org
Cc: rostedt@goodmis.org
LKML-Reference: <12509746134003-git-send-email->
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2009-08-23 04:56:52 +08:00
|
|
|
/*
|
|
|
|
* Tell them what RCU they are running.
|
|
|
|
*/
|
2009-11-12 03:28:06 +08:00
|
|
|
static void __init rcu_bootup_announce(void)
|
rcu: Merge preemptable-RCU functionality into hierarchical RCU
Create a kernel/rcutree_plugin.h file that contains definitions
for preemptable RCU (or, under the #else branch of the #ifdef,
empty definitions for the classic non-preemptable semantics).
These definitions fit into plugins defined in kernel/rcutree.c
for this purpose.
This variant of preemptable RCU uses a new algorithm whose
read-side expense is roughly that of classic hierarchical RCU
under CONFIG_PREEMPT. This new algorithm's update-side expense
is similar to that of classic hierarchical RCU, and, in absence
of read-side preemption or blocking, is exactly that of classic
hierarchical RCU. Perhaps more important, this new algorithm
has a much simpler implementation, saving well over 1,000 lines
of code compared to mainline's implementation of preemptable
RCU, which will hopefully be retired in favor of this new
algorithm.
The simplifications are obtained by maintaining per-task
nesting state for running tasks, and using a simple
lock-protected algorithm to handle accounting when tasks block
within RCU read-side critical sections, making use of lessons
learned while creating numerous user-level RCU implementations
over the past 18 months.
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: laijs@cn.fujitsu.com
Cc: dipankar@in.ibm.com
Cc: akpm@linux-foundation.org
Cc: mathieu.desnoyers@polymtl.ca
Cc: josht@linux.vnet.ibm.com
Cc: dvhltc@us.ibm.com
Cc: niv@us.ibm.com
Cc: peterz@infradead.org
Cc: rostedt@goodmis.org
LKML-Reference: <12509746134003-git-send-email->
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2009-08-23 04:56:52 +08:00
|
|
|
{
|
2013-03-19 07:24:11 +08:00
|
|
|
pr_info("Preemptible hierarchical RCU implementation.\n");
|
2010-04-14 05:19:23 +08:00
|
|
|
rcu_bootup_announce_oddness();
|
rcu: Merge preemptable-RCU functionality into hierarchical RCU
Create a kernel/rcutree_plugin.h file that contains definitions
for preemptable RCU (or, under the #else branch of the #ifdef,
empty definitions for the classic non-preemptable semantics).
These definitions fit into plugins defined in kernel/rcutree.c
for this purpose.
This variant of preemptable RCU uses a new algorithm whose
read-side expense is roughly that of classic hierarchical RCU
under CONFIG_PREEMPT. This new algorithm's update-side expense
is similar to that of classic hierarchical RCU, and, in absence
of read-side preemption or blocking, is exactly that of classic
hierarchical RCU. Perhaps more important, this new algorithm
has a much simpler implementation, saving well over 1,000 lines
of code compared to mainline's implementation of preemptable
RCU, which will hopefully be retired in favor of this new
algorithm.
The simplifications are obtained by maintaining per-task
nesting state for running tasks, and using a simple
lock-protected algorithm to handle accounting when tasks block
within RCU read-side critical sections, making use of lessons
learned while creating numerous user-level RCU implementations
over the past 18 months.
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: laijs@cn.fujitsu.com
Cc: dipankar@in.ibm.com
Cc: akpm@linux-foundation.org
Cc: mathieu.desnoyers@polymtl.ca
Cc: josht@linux.vnet.ibm.com
Cc: dvhltc@us.ibm.com
Cc: niv@us.ibm.com
Cc: peterz@infradead.org
Cc: rostedt@goodmis.org
LKML-Reference: <12509746134003-git-send-email->
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2009-08-23 04:56:52 +08:00
|
|
|
}
|
|
|
|
|
2015-08-03 04:53:17 +08:00
|
|
|
/* Flags for rcu_preempt_ctxt_queue() decision table. */
|
|
|
|
#define RCU_GP_TASKS 0x8
|
|
|
|
#define RCU_EXP_TASKS 0x4
|
|
|
|
#define RCU_GP_BLKD 0x2
|
|
|
|
#define RCU_EXP_BLKD 0x1
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Queues a task preempted within an RCU-preempt read-side critical
|
|
|
|
* section into the appropriate location within the ->blkd_tasks list,
|
|
|
|
* depending on the states of any ongoing normal and expedited grace
|
|
|
|
* periods. The ->gp_tasks pointer indicates which element the normal
|
|
|
|
* grace period is waiting on (NULL if none), and the ->exp_tasks pointer
|
|
|
|
* indicates which element the expedited grace period is waiting on (again,
|
|
|
|
* NULL if none). If a grace period is waiting on a given element in the
|
|
|
|
* ->blkd_tasks list, it also waits on all subsequent elements. Thus,
|
|
|
|
* adding a task to the tail of the list blocks any grace period that is
|
|
|
|
* already waiting on one of the elements. In contrast, adding a task
|
|
|
|
* to the head of the list won't block any grace period that is already
|
|
|
|
* waiting on one of the elements.
|
|
|
|
*
|
|
|
|
* This queuing is imprecise, and can sometimes make an ongoing grace
|
|
|
|
* period wait for a task that is not strictly speaking blocking it.
|
|
|
|
* Given the choice, we needlessly block a normal grace period rather than
|
|
|
|
* blocking an expedited grace period.
|
|
|
|
*
|
|
|
|
* Note that an endless sequence of expedited grace periods still cannot
|
|
|
|
* indefinitely postpone a normal grace period. Eventually, all of the
|
|
|
|
* fixed number of preempted tasks blocking the normal grace period that are
|
|
|
|
* not also blocking the expedited grace period will resume and complete
|
|
|
|
* their RCU read-side critical sections. At that point, the ->gp_tasks
|
|
|
|
* pointer will equal the ->exp_tasks pointer, at which point the end of
|
|
|
|
* the corresponding expedited grace period will also be the end of the
|
|
|
|
* normal grace period.
|
|
|
|
*/
|
2015-10-08 00:10:48 +08:00
|
|
|
static void rcu_preempt_ctxt_queue(struct rcu_node *rnp, struct rcu_data *rdp)
|
|
|
|
__releases(rnp->lock) /* But leaves rrupts disabled. */
|
2015-08-03 04:53:17 +08:00
|
|
|
{
|
|
|
|
int blkd_state = (rnp->gp_tasks ? RCU_GP_TASKS : 0) +
|
|
|
|
(rnp->exp_tasks ? RCU_EXP_TASKS : 0) +
|
|
|
|
(rnp->qsmask & rdp->grpmask ? RCU_GP_BLKD : 0) +
|
|
|
|
(rnp->expmask & rdp->grpmask ? RCU_EXP_BLKD : 0);
|
|
|
|
struct task_struct *t = current;
|
|
|
|
|
2018-01-17 22:24:30 +08:00
|
|
|
raw_lockdep_assert_held_rcu_node(rnp);
|
2017-07-12 12:52:31 +08:00
|
|
|
WARN_ON_ONCE(rdp->mynode != rnp);
|
2018-04-14 08:11:44 +08:00
|
|
|
WARN_ON_ONCE(!rcu_is_leaf_node(rnp));
|
2018-05-04 01:35:33 +08:00
|
|
|
/* RCU better not be waiting on newly onlined CPUs! */
|
|
|
|
WARN_ON_ONCE(rnp->qsmaskinitnext & ~rnp->qsmaskinit & rnp->qsmask &
|
|
|
|
rdp->grpmask);
|
2017-04-29 04:19:28 +08:00
|
|
|
|
2015-08-03 04:53:17 +08:00
|
|
|
/*
|
|
|
|
* Decide where to queue the newly blocked task. In theory,
|
|
|
|
* this could be an if-statement. In practice, when I tried
|
|
|
|
* that, it was quite messy.
|
|
|
|
*/
|
|
|
|
switch (blkd_state) {
|
|
|
|
case 0:
|
|
|
|
case RCU_EXP_TASKS:
|
|
|
|
case RCU_EXP_TASKS + RCU_GP_BLKD:
|
|
|
|
case RCU_GP_TASKS:
|
|
|
|
case RCU_GP_TASKS + RCU_EXP_TASKS:
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Blocking neither GP, or first task blocking the normal
|
|
|
|
* GP but not blocking the already-waiting expedited GP.
|
|
|
|
* Queue at the head of the list to avoid unnecessarily
|
|
|
|
* blocking the already-waiting GPs.
|
|
|
|
*/
|
|
|
|
list_add(&t->rcu_node_entry, &rnp->blkd_tasks);
|
|
|
|
break;
|
|
|
|
|
|
|
|
case RCU_EXP_BLKD:
|
|
|
|
case RCU_GP_BLKD:
|
|
|
|
case RCU_GP_BLKD + RCU_EXP_BLKD:
|
|
|
|
case RCU_GP_TASKS + RCU_EXP_BLKD:
|
|
|
|
case RCU_GP_TASKS + RCU_GP_BLKD + RCU_EXP_BLKD:
|
|
|
|
case RCU_GP_TASKS + RCU_EXP_TASKS + RCU_GP_BLKD + RCU_EXP_BLKD:
|
|
|
|
|
|
|
|
/*
|
|
|
|
* First task arriving that blocks either GP, or first task
|
|
|
|
* arriving that blocks the expedited GP (with the normal
|
|
|
|
* GP already waiting), or a task arriving that blocks
|
|
|
|
* both GPs with both GPs already waiting. Queue at the
|
|
|
|
* tail of the list to avoid any GP waiting on any of the
|
|
|
|
* already queued tasks that are not blocking it.
|
|
|
|
*/
|
|
|
|
list_add_tail(&t->rcu_node_entry, &rnp->blkd_tasks);
|
|
|
|
break;
|
|
|
|
|
|
|
|
case RCU_EXP_TASKS + RCU_EXP_BLKD:
|
|
|
|
case RCU_EXP_TASKS + RCU_GP_BLKD + RCU_EXP_BLKD:
|
|
|
|
case RCU_GP_TASKS + RCU_EXP_TASKS + RCU_EXP_BLKD:
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Second or subsequent task blocking the expedited GP.
|
|
|
|
* The task either does not block the normal GP, or is the
|
|
|
|
* first task blocking the normal GP. Queue just after
|
|
|
|
* the first task blocking the expedited GP.
|
|
|
|
*/
|
|
|
|
list_add(&t->rcu_node_entry, rnp->exp_tasks);
|
|
|
|
break;
|
|
|
|
|
|
|
|
case RCU_GP_TASKS + RCU_GP_BLKD:
|
|
|
|
case RCU_GP_TASKS + RCU_EXP_TASKS + RCU_GP_BLKD:
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Second or subsequent task blocking the normal GP.
|
|
|
|
* The task does not block the expedited GP. Queue just
|
|
|
|
* after the first task blocking the normal GP.
|
|
|
|
*/
|
|
|
|
list_add(&t->rcu_node_entry, rnp->gp_tasks);
|
|
|
|
break;
|
|
|
|
|
|
|
|
default:
|
|
|
|
|
|
|
|
/* Yet another exercise in excessive paranoia. */
|
|
|
|
WARN_ON_ONCE(1);
|
|
|
|
break;
|
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
|
|
|
* We have now queued the task. If it was the first one to
|
|
|
|
* block either grace period, update the ->gp_tasks and/or
|
|
|
|
* ->exp_tasks pointers, respectively, to reference the newly
|
|
|
|
* blocked tasks.
|
|
|
|
*/
|
2017-11-28 07:13:56 +08:00
|
|
|
if (!rnp->gp_tasks && (blkd_state & RCU_GP_BLKD)) {
|
2019-10-10 05:21:54 +08:00
|
|
|
WRITE_ONCE(rnp->gp_tasks, &t->rcu_node_entry);
|
2018-04-29 09:50:06 +08:00
|
|
|
WARN_ON_ONCE(rnp->completedqs == rnp->gp_seq);
|
2017-11-28 07:13:56 +08:00
|
|
|
}
|
2015-08-03 04:53:17 +08:00
|
|
|
if (!rnp->exp_tasks && (blkd_state & RCU_EXP_BLKD))
|
2020-01-04 06:18:12 +08:00
|
|
|
WRITE_ONCE(rnp->exp_tasks, &t->rcu_node_entry);
|
2017-07-12 12:52:31 +08:00
|
|
|
WARN_ON_ONCE(!(blkd_state & RCU_GP_BLKD) !=
|
|
|
|
!(rnp->qsmask & rdp->grpmask));
|
|
|
|
WARN_ON_ONCE(!(blkd_state & RCU_EXP_BLKD) !=
|
|
|
|
!(rnp->expmask & rdp->grpmask));
|
2015-12-29 12:18:47 +08:00
|
|
|
raw_spin_unlock_rcu_node(rnp); /* interrupts remain disabled. */
|
2015-08-03 04:53:17 +08:00
|
|
|
|
|
|
|
/*
|
|
|
|
* Report the quiescent state for the expedited GP. This expedited
|
|
|
|
* GP should not be able to end until we report, so there should be
|
|
|
|
* no need to check for a subsequent expedited GP. (Though we are
|
|
|
|
* still in a quiescent state in any case.)
|
|
|
|
*/
|
2019-03-28 06:51:25 +08:00
|
|
|
if (blkd_state & RCU_EXP_BLKD && rdp->exp_deferred_qs)
|
2018-07-04 08:22:34 +08:00
|
|
|
rcu_report_exp_rdp(rdp);
|
2018-06-28 22:39:59 +08:00
|
|
|
else
|
2019-03-28 06:51:25 +08:00
|
|
|
WARN_ON_ONCE(rdp->exp_deferred_qs);
|
2015-08-03 04:53:17 +08:00
|
|
|
}
|
|
|
|
|
rcu: Merge preemptable-RCU functionality into hierarchical RCU
Create a kernel/rcutree_plugin.h file that contains definitions
for preemptable RCU (or, under the #else branch of the #ifdef,
empty definitions for the classic non-preemptable semantics).
These definitions fit into plugins defined in kernel/rcutree.c
for this purpose.
This variant of preemptable RCU uses a new algorithm whose
read-side expense is roughly that of classic hierarchical RCU
under CONFIG_PREEMPT. This new algorithm's update-side expense
is similar to that of classic hierarchical RCU, and, in absence
of read-side preemption or blocking, is exactly that of classic
hierarchical RCU. Perhaps more important, this new algorithm
has a much simpler implementation, saving well over 1,000 lines
of code compared to mainline's implementation of preemptable
RCU, which will hopefully be retired in favor of this new
algorithm.
The simplifications are obtained by maintaining per-task
nesting state for running tasks, and using a simple
lock-protected algorithm to handle accounting when tasks block
within RCU read-side critical sections, making use of lessons
learned while creating numerous user-level RCU implementations
over the past 18 months.
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: laijs@cn.fujitsu.com
Cc: dipankar@in.ibm.com
Cc: akpm@linux-foundation.org
Cc: mathieu.desnoyers@polymtl.ca
Cc: josht@linux.vnet.ibm.com
Cc: dvhltc@us.ibm.com
Cc: niv@us.ibm.com
Cc: peterz@infradead.org
Cc: rostedt@goodmis.org
LKML-Reference: <12509746134003-git-send-email->
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2009-08-23 04:56:52 +08:00
|
|
|
/*
|
2018-05-17 09:00:17 +08:00
|
|
|
* Record a preemptible-RCU quiescent state for the specified CPU.
|
|
|
|
* Note that this does not necessarily mean that the task currently running
|
|
|
|
* on the CPU is in a quiescent state: Instead, it means that the current
|
|
|
|
* grace period need not wait on any RCU read-side critical section that
|
|
|
|
* starts later on this CPU. It also means that if the current task is
|
|
|
|
* in an RCU read-side critical section, it has already added itself to
|
|
|
|
* some leaf rcu_node structure's ->blkd_tasks list. In addition to the
|
|
|
|
* current task, there might be any number of other tasks blocked while
|
|
|
|
* in an RCU read-side critical section.
|
rcu: refactor RCU's context-switch handling
The addition of preemptible RCU to treercu resulted in a bit of
confusion and inefficiency surrounding the handling of context switches
for RCU-sched and for RCU-preempt. For RCU-sched, a context switch
is a quiescent state, pure and simple, just like it always has been.
For RCU-preempt, a context switch is in no way a quiescent state, but
special handling is required when a task blocks in an RCU read-side
critical section.
However, the callout from the scheduler and the outer loop in ksoftirqd
still calls something named rcu_sched_qs(), whose name is no longer
accurate. Furthermore, when rcu_check_callbacks() notes an RCU-sched
quiescent state, it ends up unnecessarily (though harmlessly, aside
from the performance hit) enqueuing the current task if it happens to
be running in an RCU-preempt read-side critical section. This not only
increases the maximum latency of scheduler_tick(), it also needlessly
increases the overhead of the next outermost rcu_read_unlock() invocation.
This patch addresses this situation by separating the notion of RCU's
context-switch handling from that of RCU-sched's quiescent states.
The context-switch handling is covered by rcu_note_context_switch() in
general and by rcu_preempt_note_context_switch() for preemptible RCU.
This permits rcu_sched_qs() to handle quiescent states and only quiescent
states. It also reduces the maximum latency of scheduler_tick(), though
probably by much less than a microsecond. Finally, it means that tasks
within preemptible-RCU read-side critical sections avoid incurring the
overhead of queuing unless there really is a context switch.
Suggested-by: Lai Jiangshan <laijs@cn.fujitsu.com>
Acked-by: Lai Jiangshan <laijs@cn.fujitsu.com>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Peter Zijlstra <peterz@infradead.org>
2010-04-02 08:37:01 +08:00
|
|
|
*
|
2018-05-17 09:00:17 +08:00
|
|
|
* Callers to this function must disable preemption.
|
rcu: Merge preemptable-RCU functionality into hierarchical RCU
Create a kernel/rcutree_plugin.h file that contains definitions
for preemptable RCU (or, under the #else branch of the #ifdef,
empty definitions for the classic non-preemptable semantics).
These definitions fit into plugins defined in kernel/rcutree.c
for this purpose.
This variant of preemptable RCU uses a new algorithm whose
read-side expense is roughly that of classic hierarchical RCU
under CONFIG_PREEMPT. This new algorithm's update-side expense
is similar to that of classic hierarchical RCU, and, in absence
of read-side preemption or blocking, is exactly that of classic
hierarchical RCU. Perhaps more important, this new algorithm
has a much simpler implementation, saving well over 1,000 lines
of code compared to mainline's implementation of preemptable
RCU, which will hopefully be retired in favor of this new
algorithm.
The simplifications are obtained by maintaining per-task
nesting state for running tasks, and using a simple
lock-protected algorithm to handle accounting when tasks block
within RCU read-side critical sections, making use of lessons
learned while creating numerous user-level RCU implementations
over the past 18 months.
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: laijs@cn.fujitsu.com
Cc: dipankar@in.ibm.com
Cc: akpm@linux-foundation.org
Cc: mathieu.desnoyers@polymtl.ca
Cc: josht@linux.vnet.ibm.com
Cc: dvhltc@us.ibm.com
Cc: niv@us.ibm.com
Cc: peterz@infradead.org
Cc: rostedt@goodmis.org
LKML-Reference: <12509746134003-git-send-email->
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2009-08-23 04:56:52 +08:00
|
|
|
*/
|
2018-07-03 05:30:37 +08:00
|
|
|
static void rcu_qs(void)
|
rcu: Merge preemptable-RCU functionality into hierarchical RCU
Create a kernel/rcutree_plugin.h file that contains definitions
for preemptable RCU (or, under the #else branch of the #ifdef,
empty definitions for the classic non-preemptable semantics).
These definitions fit into plugins defined in kernel/rcutree.c
for this purpose.
This variant of preemptable RCU uses a new algorithm whose
read-side expense is roughly that of classic hierarchical RCU
under CONFIG_PREEMPT. This new algorithm's update-side expense
is similar to that of classic hierarchical RCU, and, in absence
of read-side preemption or blocking, is exactly that of classic
hierarchical RCU. Perhaps more important, this new algorithm
has a much simpler implementation, saving well over 1,000 lines
of code compared to mainline's implementation of preemptable
RCU, which will hopefully be retired in favor of this new
algorithm.
The simplifications are obtained by maintaining per-task
nesting state for running tasks, and using a simple
lock-protected algorithm to handle accounting when tasks block
within RCU read-side critical sections, making use of lessons
learned while creating numerous user-level RCU implementations
over the past 18 months.
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: laijs@cn.fujitsu.com
Cc: dipankar@in.ibm.com
Cc: akpm@linux-foundation.org
Cc: mathieu.desnoyers@polymtl.ca
Cc: josht@linux.vnet.ibm.com
Cc: dvhltc@us.ibm.com
Cc: niv@us.ibm.com
Cc: peterz@infradead.org
Cc: rostedt@goodmis.org
LKML-Reference: <12509746134003-git-send-email->
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2009-08-23 04:56:52 +08:00
|
|
|
{
|
2018-07-03 05:30:37 +08:00
|
|
|
RCU_LOCKDEP_WARN(preemptible(), "rcu_qs() invoked with preemption enabled!!!\n");
|
2018-07-04 06:54:39 +08:00
|
|
|
if (__this_cpu_read(rcu_data.cpu_no_qs.s)) {
|
2014-08-15 07:38:46 +08:00
|
|
|
trace_rcu_grace_period(TPS("rcu_preempt"),
|
2018-07-04 06:54:39 +08:00
|
|
|
__this_cpu_read(rcu_data.gp_seq),
|
2014-08-15 07:38:46 +08:00
|
|
|
TPS("cpuqs"));
|
2018-07-04 06:54:39 +08:00
|
|
|
__this_cpu_write(rcu_data.cpu_no_qs.b.norm, false);
|
2018-11-22 03:35:03 +08:00
|
|
|
barrier(); /* Coordinate with rcu_flavor_sched_clock_irq(). */
|
2019-03-27 01:22:22 +08:00
|
|
|
WRITE_ONCE(current->rcu_read_unlock_special.b.need_qs, false);
|
2014-08-15 07:38:46 +08:00
|
|
|
}
|
rcu: Merge preemptable-RCU functionality into hierarchical RCU
Create a kernel/rcutree_plugin.h file that contains definitions
for preemptable RCU (or, under the #else branch of the #ifdef,
empty definitions for the classic non-preemptable semantics).
These definitions fit into plugins defined in kernel/rcutree.c
for this purpose.
This variant of preemptable RCU uses a new algorithm whose
read-side expense is roughly that of classic hierarchical RCU
under CONFIG_PREEMPT. This new algorithm's update-side expense
is similar to that of classic hierarchical RCU, and, in absence
of read-side preemption or blocking, is exactly that of classic
hierarchical RCU. Perhaps more important, this new algorithm
has a much simpler implementation, saving well over 1,000 lines
of code compared to mainline's implementation of preemptable
RCU, which will hopefully be retired in favor of this new
algorithm.
The simplifications are obtained by maintaining per-task
nesting state for running tasks, and using a simple
lock-protected algorithm to handle accounting when tasks block
within RCU read-side critical sections, making use of lessons
learned while creating numerous user-level RCU implementations
over the past 18 months.
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: laijs@cn.fujitsu.com
Cc: dipankar@in.ibm.com
Cc: akpm@linux-foundation.org
Cc: mathieu.desnoyers@polymtl.ca
Cc: josht@linux.vnet.ibm.com
Cc: dvhltc@us.ibm.com
Cc: niv@us.ibm.com
Cc: peterz@infradead.org
Cc: rostedt@goodmis.org
LKML-Reference: <12509746134003-git-send-email->
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2009-08-23 04:56:52 +08:00
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
2009-09-14 00:15:10 +08:00
|
|
|
* We have entered the scheduler, and the current task might soon be
|
|
|
|
* context-switched away from. If this task is in an RCU read-side
|
|
|
|
* critical section, we will no longer be able to rely on the CPU to
|
2010-11-30 13:56:39 +08:00
|
|
|
* record that fact, so we enqueue the task on the blkd_tasks list.
|
|
|
|
* The task will dequeue itself when it exits the outermost enclosing
|
|
|
|
* RCU read-side critical section. Therefore, the current grace period
|
|
|
|
* cannot be permitted to complete until the blkd_tasks list entries
|
|
|
|
* predating the current grace period drain, in other words, until
|
|
|
|
* rnp->gp_tasks becomes NULL.
|
2009-09-14 00:15:10 +08:00
|
|
|
*
|
2015-10-08 00:10:48 +08:00
|
|
|
* Caller must disable interrupts.
|
rcu: Merge preemptable-RCU functionality into hierarchical RCU
Create a kernel/rcutree_plugin.h file that contains definitions
for preemptable RCU (or, under the #else branch of the #ifdef,
empty definitions for the classic non-preemptable semantics).
These definitions fit into plugins defined in kernel/rcutree.c
for this purpose.
This variant of preemptable RCU uses a new algorithm whose
read-side expense is roughly that of classic hierarchical RCU
under CONFIG_PREEMPT. This new algorithm's update-side expense
is similar to that of classic hierarchical RCU, and, in absence
of read-side preemption or blocking, is exactly that of classic
hierarchical RCU. Perhaps more important, this new algorithm
has a much simpler implementation, saving well over 1,000 lines
of code compared to mainline's implementation of preemptable
RCU, which will hopefully be retired in favor of this new
algorithm.
The simplifications are obtained by maintaining per-task
nesting state for running tasks, and using a simple
lock-protected algorithm to handle accounting when tasks block
within RCU read-side critical sections, making use of lessons
learned while creating numerous user-level RCU implementations
over the past 18 months.
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: laijs@cn.fujitsu.com
Cc: dipankar@in.ibm.com
Cc: akpm@linux-foundation.org
Cc: mathieu.desnoyers@polymtl.ca
Cc: josht@linux.vnet.ibm.com
Cc: dvhltc@us.ibm.com
Cc: niv@us.ibm.com
Cc: peterz@infradead.org
Cc: rostedt@goodmis.org
LKML-Reference: <12509746134003-git-send-email->
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2009-08-23 04:56:52 +08:00
|
|
|
*/
|
2018-07-03 05:30:37 +08:00
|
|
|
void rcu_note_context_switch(bool preempt)
|
rcu: Merge preemptable-RCU functionality into hierarchical RCU
Create a kernel/rcutree_plugin.h file that contains definitions
for preemptable RCU (or, under the #else branch of the #ifdef,
empty definitions for the classic non-preemptable semantics).
These definitions fit into plugins defined in kernel/rcutree.c
for this purpose.
This variant of preemptable RCU uses a new algorithm whose
read-side expense is roughly that of classic hierarchical RCU
under CONFIG_PREEMPT. This new algorithm's update-side expense
is similar to that of classic hierarchical RCU, and, in absence
of read-side preemption or blocking, is exactly that of classic
hierarchical RCU. Perhaps more important, this new algorithm
has a much simpler implementation, saving well over 1,000 lines
of code compared to mainline's implementation of preemptable
RCU, which will hopefully be retired in favor of this new
algorithm.
The simplifications are obtained by maintaining per-task
nesting state for running tasks, and using a simple
lock-protected algorithm to handle accounting when tasks block
within RCU read-side critical sections, making use of lessons
learned while creating numerous user-level RCU implementations
over the past 18 months.
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: laijs@cn.fujitsu.com
Cc: dipankar@in.ibm.com
Cc: akpm@linux-foundation.org
Cc: mathieu.desnoyers@polymtl.ca
Cc: josht@linux.vnet.ibm.com
Cc: dvhltc@us.ibm.com
Cc: niv@us.ibm.com
Cc: peterz@infradead.org
Cc: rostedt@goodmis.org
LKML-Reference: <12509746134003-git-send-email->
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2009-08-23 04:56:52 +08:00
|
|
|
{
|
|
|
|
struct task_struct *t = current;
|
2018-07-04 06:37:16 +08:00
|
|
|
struct rcu_data *rdp = this_cpu_ptr(&rcu_data);
|
rcu: Merge preemptable-RCU functionality into hierarchical RCU
Create a kernel/rcutree_plugin.h file that contains definitions
for preemptable RCU (or, under the #else branch of the #ifdef,
empty definitions for the classic non-preemptable semantics).
These definitions fit into plugins defined in kernel/rcutree.c
for this purpose.
This variant of preemptable RCU uses a new algorithm whose
read-side expense is roughly that of classic hierarchical RCU
under CONFIG_PREEMPT. This new algorithm's update-side expense
is similar to that of classic hierarchical RCU, and, in absence
of read-side preemption or blocking, is exactly that of classic
hierarchical RCU. Perhaps more important, this new algorithm
has a much simpler implementation, saving well over 1,000 lines
of code compared to mainline's implementation of preemptable
RCU, which will hopefully be retired in favor of this new
algorithm.
The simplifications are obtained by maintaining per-task
nesting state for running tasks, and using a simple
lock-protected algorithm to handle accounting when tasks block
within RCU read-side critical sections, making use of lessons
learned while creating numerous user-level RCU implementations
over the past 18 months.
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: laijs@cn.fujitsu.com
Cc: dipankar@in.ibm.com
Cc: akpm@linux-foundation.org
Cc: mathieu.desnoyers@polymtl.ca
Cc: josht@linux.vnet.ibm.com
Cc: dvhltc@us.ibm.com
Cc: niv@us.ibm.com
Cc: peterz@infradead.org
Cc: rostedt@goodmis.org
LKML-Reference: <12509746134003-git-send-email->
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2009-08-23 04:56:52 +08:00
|
|
|
struct rcu_node *rnp;
|
|
|
|
|
2018-07-03 05:30:37 +08:00
|
|
|
trace_rcu_utilization(TPS("Start context switch"));
|
2017-11-06 23:01:30 +08:00
|
|
|
lockdep_assert_irqs_disabled();
|
2021-07-20 02:52:12 +08:00
|
|
|
WARN_ONCE(!preempt && rcu_preempt_depth() > 0, "Voluntary context switch within RCU read-side critical section!");
|
2019-11-16 06:08:53 +08:00
|
|
|
if (rcu_preempt_depth() > 0 &&
|
2014-08-15 07:01:53 +08:00
|
|
|
!t->rcu_read_unlock_special.b.blocked) {
|
rcu: Merge preemptable-RCU functionality into hierarchical RCU
Create a kernel/rcutree_plugin.h file that contains definitions
for preemptable RCU (or, under the #else branch of the #ifdef,
empty definitions for the classic non-preemptable semantics).
These definitions fit into plugins defined in kernel/rcutree.c
for this purpose.
This variant of preemptable RCU uses a new algorithm whose
read-side expense is roughly that of classic hierarchical RCU
under CONFIG_PREEMPT. This new algorithm's update-side expense
is similar to that of classic hierarchical RCU, and, in absence
of read-side preemption or blocking, is exactly that of classic
hierarchical RCU. Perhaps more important, this new algorithm
has a much simpler implementation, saving well over 1,000 lines
of code compared to mainline's implementation of preemptable
RCU, which will hopefully be retired in favor of this new
algorithm.
The simplifications are obtained by maintaining per-task
nesting state for running tasks, and using a simple
lock-protected algorithm to handle accounting when tasks block
within RCU read-side critical sections, making use of lessons
learned while creating numerous user-level RCU implementations
over the past 18 months.
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: laijs@cn.fujitsu.com
Cc: dipankar@in.ibm.com
Cc: akpm@linux-foundation.org
Cc: mathieu.desnoyers@polymtl.ca
Cc: josht@linux.vnet.ibm.com
Cc: dvhltc@us.ibm.com
Cc: niv@us.ibm.com
Cc: peterz@infradead.org
Cc: rostedt@goodmis.org
LKML-Reference: <12509746134003-git-send-email->
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2009-08-23 04:56:52 +08:00
|
|
|
|
|
|
|
/* Possibly blocking in an RCU read-side critical section. */
|
|
|
|
rnp = rdp->mynode;
|
2015-10-08 00:10:48 +08:00
|
|
|
raw_spin_lock_rcu_node(rnp);
|
2014-08-15 07:01:53 +08:00
|
|
|
t->rcu_read_unlock_special.b.blocked = true;
|
2009-08-28 06:00:12 +08:00
|
|
|
t->rcu_blocked_node = rnp;
|
rcu: Merge preemptable-RCU functionality into hierarchical RCU
Create a kernel/rcutree_plugin.h file that contains definitions
for preemptable RCU (or, under the #else branch of the #ifdef,
empty definitions for the classic non-preemptable semantics).
These definitions fit into plugins defined in kernel/rcutree.c
for this purpose.
This variant of preemptable RCU uses a new algorithm whose
read-side expense is roughly that of classic hierarchical RCU
under CONFIG_PREEMPT. This new algorithm's update-side expense
is similar to that of classic hierarchical RCU, and, in absence
of read-side preemption or blocking, is exactly that of classic
hierarchical RCU. Perhaps more important, this new algorithm
has a much simpler implementation, saving well over 1,000 lines
of code compared to mainline's implementation of preemptable
RCU, which will hopefully be retired in favor of this new
algorithm.
The simplifications are obtained by maintaining per-task
nesting state for running tasks, and using a simple
lock-protected algorithm to handle accounting when tasks block
within RCU read-side critical sections, making use of lessons
learned while creating numerous user-level RCU implementations
over the past 18 months.
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: laijs@cn.fujitsu.com
Cc: dipankar@in.ibm.com
Cc: akpm@linux-foundation.org
Cc: mathieu.desnoyers@polymtl.ca
Cc: josht@linux.vnet.ibm.com
Cc: dvhltc@us.ibm.com
Cc: niv@us.ibm.com
Cc: peterz@infradead.org
Cc: rostedt@goodmis.org
LKML-Reference: <12509746134003-git-send-email->
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2009-08-23 04:56:52 +08:00
|
|
|
|
|
|
|
/*
|
2015-08-03 04:53:17 +08:00
|
|
|
* Verify the CPU's sanity, trace the preemption, and
|
|
|
|
* then queue the task as required based on the states
|
|
|
|
* of any ongoing and expedited grace periods.
|
rcu: Merge preemptable-RCU functionality into hierarchical RCU
Create a kernel/rcutree_plugin.h file that contains definitions
for preemptable RCU (or, under the #else branch of the #ifdef,
empty definitions for the classic non-preemptable semantics).
These definitions fit into plugins defined in kernel/rcutree.c
for this purpose.
This variant of preemptable RCU uses a new algorithm whose
read-side expense is roughly that of classic hierarchical RCU
under CONFIG_PREEMPT. This new algorithm's update-side expense
is similar to that of classic hierarchical RCU, and, in absence
of read-side preemption or blocking, is exactly that of classic
hierarchical RCU. Perhaps more important, this new algorithm
has a much simpler implementation, saving well over 1,000 lines
of code compared to mainline's implementation of preemptable
RCU, which will hopefully be retired in favor of this new
algorithm.
The simplifications are obtained by maintaining per-task
nesting state for running tasks, and using a simple
lock-protected algorithm to handle accounting when tasks block
within RCU read-side critical sections, making use of lessons
learned while creating numerous user-level RCU implementations
over the past 18 months.
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: laijs@cn.fujitsu.com
Cc: dipankar@in.ibm.com
Cc: akpm@linux-foundation.org
Cc: mathieu.desnoyers@polymtl.ca
Cc: josht@linux.vnet.ibm.com
Cc: dvhltc@us.ibm.com
Cc: niv@us.ibm.com
Cc: peterz@infradead.org
Cc: rostedt@goodmis.org
LKML-Reference: <12509746134003-git-send-email->
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2009-08-23 04:56:52 +08:00
|
|
|
*/
|
rcu: Process offlining and onlining only at grace-period start
Races between CPU hotplug and grace periods can be difficult to resolve,
so the ->onoff_mutex is used to exclude the two events. Unfortunately,
this means that it is impossible for an outgoing CPU to perform the
last bits of its offlining from its last pass through the idle loop,
because sleeplocks cannot be acquired in that context.
This commit avoids these problems by buffering online and offline events
in a new ->qsmaskinitnext field in the leaf rcu_node structures. When a
grace period starts, the events accumulated in this mask are applied to
the ->qsmaskinit field, and, if needed, up the rcu_node tree. The special
case of all CPUs corresponding to a given leaf rcu_node structure being
offline while there are still elements in that structure's ->blkd_tasks
list is handled using a new ->wait_blkd_tasks field. In this case,
propagating the offline bits up the tree is deferred until the beginning
of the grace period after all of the tasks have exited their RCU read-side
critical sections and removed themselves from the list, at which point
the ->wait_blkd_tasks flag is cleared. If one of that leaf rcu_node
structure's CPUs comes back online before the list empties, then the
->wait_blkd_tasks flag is simply cleared.
This of course means that RCU's notion of which CPUs are offline can be
out of date. This is OK because RCU need only wait on CPUs that were
online at the time that the grace period started. In addition, RCU's
force-quiescent-state actions will handle the case where a CPU goes
offline after the grace period starts.
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
2015-01-24 13:52:37 +08:00
|
|
|
WARN_ON_ONCE((rdp->grpmask & rcu_rnp_online_cpus(rnp)) == 0);
|
2009-09-19 00:50:18 +08:00
|
|
|
WARN_ON_ONCE(!list_empty(&t->rcu_node_entry));
|
2018-07-05 05:45:00 +08:00
|
|
|
trace_rcu_preempt_task(rcu_state.name,
|
rcu: Add grace-period, quiescent-state, and call_rcu trace events
Add trace events to record grace-period start and end, quiescent states,
CPUs noticing grace-period start and end, grace-period initialization,
call_rcu() invocation, tasks blocking in RCU read-side critical sections,
tasks exiting those same critical sections, force_quiescent_state()
detection of dyntick-idle and offline CPUs, CPUs entering and leaving
dyntick-idle mode (except from NMIs), CPUs coming online and going
offline, and CPUs being kicked for staying in dyntick-idle mode for too
long (as in many weeks, even on 32-bit systems).
Signed-off-by: Paul E. McKenney <paul.mckenney@linaro.org>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
rcu: Add the rcu flavor to callback trace events
The earlier trace events for registering RCU callbacks and for invoking
them did not include the RCU flavor (rcu_bh, rcu_preempt, or rcu_sched).
This commit adds the RCU flavor to those trace events.
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
2011-06-25 21:36:56 +08:00
|
|
|
t->pid,
|
|
|
|
(rnp->qsmask & rdp->grpmask)
|
2018-05-02 04:35:20 +08:00
|
|
|
? rnp->gp_seq
|
|
|
|
: rcu_seq_snap(&rnp->gp_seq));
|
2015-10-08 00:10:48 +08:00
|
|
|
rcu_preempt_ctxt_queue(rnp, rdp);
|
rcu: Defer reporting RCU-preempt quiescent states when disabled
This commit defers reporting of RCU-preempt quiescent states at
rcu_read_unlock_special() time when any of interrupts, softirq, or
preemption are disabled. These deferred quiescent states are reported
at a later RCU_SOFTIRQ, context switch, idle entry, or CPU-hotplug
offline operation. Of course, if another RCU read-side critical
section has started in the meantime, the reporting of the quiescent
state will be further deferred.
This also means that disabling preemption, interrupts, and/or
softirqs will act as an RCU-preempt read-side critical section.
This is enforced by checking preempt_count() as needed.
Some special cases must be handled on an ad-hoc basis, for example,
context switch is a quiescent state even though both the scheduler and
do_exit() disable preemption. In these cases, additional calls to
rcu_preempt_deferred_qs() override the preemption disabling. Similar
logic overrides disabled interrupts in rcu_preempt_check_callbacks()
because in this case the quiescent state happened just before the
corresponding scheduling-clock interrupt.
In theory, this change lifts a long-standing restriction that required
that if interrupts were disabled across a call to rcu_read_unlock()
that the matching rcu_read_lock() also be contained within that
interrupts-disabled region of code. Because the reporting of the
corresponding RCU-preempt quiescent state is now deferred until
after interrupts have been enabled, it is no longer possible for this
situation to result in deadlocks involving the scheduler's runqueue and
priority-inheritance locks. This may allow some code simplification that
might reduce interrupt latency a bit. Unfortunately, in practice this
would also defer deboosting a low-priority task that had been subjected
to RCU priority boosting, so real-time-response considerations might
well force this restriction to remain in place.
Because RCU-preempt grace periods are now blocked not only by RCU
read-side critical sections, but also by disabling of interrupts,
preemption, and softirqs, it will be possible to eliminate RCU-bh and
RCU-sched in favor of RCU-preempt in CONFIG_PREEMPT=y kernels. This may
require some additional plumbing to provide the network denial-of-service
guarantees that have been traditionally provided by RCU-bh. Once these
are in place, CONFIG_PREEMPT=n kernels will be able to fold RCU-bh
into RCU-sched. This would mean that all kernels would have but
one flavor of RCU, which would open the door to significant code
cleanup.
Moving to a single flavor of RCU would also have the beneficial effect
of reducing the NOCB kthreads by at least a factor of two.
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
[ paulmck: Apply rcu_read_unlock_special() preempt_count() feedback
from Joel Fernandes. ]
[ paulmck: Adjust rcu_eqs_enter() call to rcu_preempt_deferred_qs() in
response to bug reports from kbuild test robot. ]
[ paulmck: Fix bug located by kbuild test robot involving recursion
via rcu_preempt_deferred_qs(). ]
2018-06-22 03:50:01 +08:00
|
|
|
} else {
|
|
|
|
rcu_preempt_deferred_qs(t);
|
rcu: Merge preemptable-RCU functionality into hierarchical RCU
Create a kernel/rcutree_plugin.h file that contains definitions
for preemptable RCU (or, under the #else branch of the #ifdef,
empty definitions for the classic non-preemptable semantics).
These definitions fit into plugins defined in kernel/rcutree.c
for this purpose.
This variant of preemptable RCU uses a new algorithm whose
read-side expense is roughly that of classic hierarchical RCU
under CONFIG_PREEMPT. This new algorithm's update-side expense
is similar to that of classic hierarchical RCU, and, in absence
of read-side preemption or blocking, is exactly that of classic
hierarchical RCU. Perhaps more important, this new algorithm
has a much simpler implementation, saving well over 1,000 lines
of code compared to mainline's implementation of preemptable
RCU, which will hopefully be retired in favor of this new
algorithm.
The simplifications are obtained by maintaining per-task
nesting state for running tasks, and using a simple
lock-protected algorithm to handle accounting when tasks block
within RCU read-side critical sections, making use of lessons
learned while creating numerous user-level RCU implementations
over the past 18 months.
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: laijs@cn.fujitsu.com
Cc: dipankar@in.ibm.com
Cc: akpm@linux-foundation.org
Cc: mathieu.desnoyers@polymtl.ca
Cc: josht@linux.vnet.ibm.com
Cc: dvhltc@us.ibm.com
Cc: niv@us.ibm.com
Cc: peterz@infradead.org
Cc: rostedt@goodmis.org
LKML-Reference: <12509746134003-git-send-email->
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2009-08-23 04:56:52 +08:00
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Either we were not in an RCU read-side critical section to
|
|
|
|
* begin with, or we have now recorded that critical section
|
|
|
|
* globally. Either way, we can now note a quiescent state
|
|
|
|
* for this CPU. Again, if we were in an RCU read-side critical
|
|
|
|
* section, and if that critical section was blocking the current
|
|
|
|
* grace period, then the fact that the task has been enqueued
|
|
|
|
* means that we continue to block the current grace period.
|
|
|
|
*/
|
2018-07-03 05:30:37 +08:00
|
|
|
rcu_qs();
|
2019-03-28 06:51:25 +08:00
|
|
|
if (rdp->exp_deferred_qs)
|
2018-07-04 08:22:34 +08:00
|
|
|
rcu_report_exp_rdp(rdp);
|
2020-03-17 11:38:29 +08:00
|
|
|
rcu_tasks_qs(current, preempt);
|
2018-07-03 05:30:37 +08:00
|
|
|
trace_rcu_utilization(TPS("End context switch"));
|
rcu: Merge preemptable-RCU functionality into hierarchical RCU
Create a kernel/rcutree_plugin.h file that contains definitions
for preemptable RCU (or, under the #else branch of the #ifdef,
empty definitions for the classic non-preemptable semantics).
These definitions fit into plugins defined in kernel/rcutree.c
for this purpose.
This variant of preemptable RCU uses a new algorithm whose
read-side expense is roughly that of classic hierarchical RCU
under CONFIG_PREEMPT. This new algorithm's update-side expense
is similar to that of classic hierarchical RCU, and, in absence
of read-side preemption or blocking, is exactly that of classic
hierarchical RCU. Perhaps more important, this new algorithm
has a much simpler implementation, saving well over 1,000 lines
of code compared to mainline's implementation of preemptable
RCU, which will hopefully be retired in favor of this new
algorithm.
The simplifications are obtained by maintaining per-task
nesting state for running tasks, and using a simple
lock-protected algorithm to handle accounting when tasks block
within RCU read-side critical sections, making use of lessons
learned while creating numerous user-level RCU implementations
over the past 18 months.
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: laijs@cn.fujitsu.com
Cc: dipankar@in.ibm.com
Cc: akpm@linux-foundation.org
Cc: mathieu.desnoyers@polymtl.ca
Cc: josht@linux.vnet.ibm.com
Cc: dvhltc@us.ibm.com
Cc: niv@us.ibm.com
Cc: peterz@infradead.org
Cc: rostedt@goodmis.org
LKML-Reference: <12509746134003-git-send-email->
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2009-08-23 04:56:52 +08:00
|
|
|
}
|
2018-07-03 05:30:37 +08:00
|
|
|
EXPORT_SYMBOL_GPL(rcu_note_context_switch);
|
rcu: Merge preemptable-RCU functionality into hierarchical RCU
Create a kernel/rcutree_plugin.h file that contains definitions
for preemptable RCU (or, under the #else branch of the #ifdef,
empty definitions for the classic non-preemptable semantics).
These definitions fit into plugins defined in kernel/rcutree.c
for this purpose.
This variant of preemptable RCU uses a new algorithm whose
read-side expense is roughly that of classic hierarchical RCU
under CONFIG_PREEMPT. This new algorithm's update-side expense
is similar to that of classic hierarchical RCU, and, in absence
of read-side preemption or blocking, is exactly that of classic
hierarchical RCU. Perhaps more important, this new algorithm
has a much simpler implementation, saving well over 1,000 lines
of code compared to mainline's implementation of preemptable
RCU, which will hopefully be retired in favor of this new
algorithm.
The simplifications are obtained by maintaining per-task
nesting state for running tasks, and using a simple
lock-protected algorithm to handle accounting when tasks block
within RCU read-side critical sections, making use of lessons
learned while creating numerous user-level RCU implementations
over the past 18 months.
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: laijs@cn.fujitsu.com
Cc: dipankar@in.ibm.com
Cc: akpm@linux-foundation.org
Cc: mathieu.desnoyers@polymtl.ca
Cc: josht@linux.vnet.ibm.com
Cc: dvhltc@us.ibm.com
Cc: niv@us.ibm.com
Cc: peterz@infradead.org
Cc: rostedt@goodmis.org
LKML-Reference: <12509746134003-git-send-email->
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2009-08-23 04:56:52 +08:00
|
|
|
|
2009-09-24 00:50:41 +08:00
|
|
|
/*
|
|
|
|
* Check for preempted RCU readers blocking the current grace period
|
|
|
|
* for the specified rcu_node structure. If the caller needs a reliable
|
|
|
|
* answer, it must hold the rcu_node's ->lock.
|
|
|
|
*/
|
2011-02-08 04:47:15 +08:00
|
|
|
static int rcu_preempt_blocked_readers_cgp(struct rcu_node *rnp)
|
2009-09-24 00:50:41 +08:00
|
|
|
{
|
2019-10-10 05:21:54 +08:00
|
|
|
return READ_ONCE(rnp->gp_tasks) != NULL;
|
2009-09-24 00:50:41 +08:00
|
|
|
}
|
|
|
|
|
2020-02-16 07:23:26 +08:00
|
|
|
/* limit value for ->rcu_read_lock_nesting. */
|
2018-10-29 22:36:50 +08:00
|
|
|
#define RCU_NEST_PMAX (INT_MAX / 2)
|
|
|
|
|
2019-11-16 06:08:53 +08:00
|
|
|
static void rcu_preempt_read_enter(void)
|
|
|
|
{
|
2021-05-21 04:35:50 +08:00
|
|
|
WRITE_ONCE(current->rcu_read_lock_nesting, READ_ONCE(current->rcu_read_lock_nesting) + 1);
|
2019-11-16 06:08:53 +08:00
|
|
|
}
|
|
|
|
|
2020-02-16 07:23:26 +08:00
|
|
|
static int rcu_preempt_read_exit(void)
|
2019-11-16 06:08:53 +08:00
|
|
|
{
|
2021-05-21 04:35:50 +08:00
|
|
|
int ret = READ_ONCE(current->rcu_read_lock_nesting) - 1;
|
|
|
|
|
|
|
|
WRITE_ONCE(current->rcu_read_lock_nesting, ret);
|
|
|
|
return ret;
|
2019-11-16 06:08:53 +08:00
|
|
|
}
|
|
|
|
|
|
|
|
static void rcu_preempt_depth_set(int val)
|
|
|
|
{
|
2021-05-21 04:35:50 +08:00
|
|
|
WRITE_ONCE(current->rcu_read_lock_nesting, val);
|
2019-11-16 06:08:53 +08:00
|
|
|
}
|
|
|
|
|
2018-03-19 23:05:04 +08:00
|
|
|
/*
|
|
|
|
* Preemptible RCU implementation for rcu_read_lock().
|
|
|
|
* Just increment ->rcu_read_lock_nesting, shared state will be updated
|
|
|
|
* if we block.
|
|
|
|
*/
|
|
|
|
void __rcu_read_lock(void)
|
|
|
|
{
|
2019-11-16 06:08:53 +08:00
|
|
|
rcu_preempt_read_enter();
|
2018-10-29 22:36:50 +08:00
|
|
|
if (IS_ENABLED(CONFIG_PROVE_LOCKING))
|
2019-11-16 06:08:53 +08:00
|
|
|
WARN_ON_ONCE(rcu_preempt_depth() > RCU_NEST_PMAX);
|
2020-08-07 00:40:18 +08:00
|
|
|
if (IS_ENABLED(CONFIG_RCU_STRICT_GRACE_PERIOD) && rcu_state.gp_kthread)
|
|
|
|
WRITE_ONCE(current->rcu_read_unlock_special.b.need_qs, true);
|
2018-03-19 23:05:04 +08:00
|
|
|
barrier(); /* critical section after entry code. */
|
|
|
|
}
|
|
|
|
EXPORT_SYMBOL_GPL(__rcu_read_lock);
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Preemptible RCU implementation for rcu_read_unlock().
|
|
|
|
* Decrement ->rcu_read_lock_nesting. If the result is zero (outermost
|
|
|
|
* rcu_read_unlock()) and ->rcu_read_unlock_special is non-zero, then
|
|
|
|
* invoke rcu_read_unlock_special() to clean up after a context switch
|
|
|
|
* in an RCU read-side critical section and other special cases.
|
|
|
|
*/
|
|
|
|
void __rcu_read_unlock(void)
|
|
|
|
{
|
|
|
|
struct task_struct *t = current;
|
|
|
|
|
2021-02-27 03:25:29 +08:00
|
|
|
barrier(); // critical section before exit code.
|
2020-02-16 07:23:26 +08:00
|
|
|
if (rcu_preempt_read_exit() == 0) {
|
2021-02-27 03:25:29 +08:00
|
|
|
barrier(); // critical-section exit before .s check.
|
2018-03-19 23:05:04 +08:00
|
|
|
if (unlikely(READ_ONCE(t->rcu_read_unlock_special.s)))
|
|
|
|
rcu_read_unlock_special(t);
|
|
|
|
}
|
2018-10-29 22:36:50 +08:00
|
|
|
if (IS_ENABLED(CONFIG_PROVE_LOCKING)) {
|
2019-11-16 06:08:53 +08:00
|
|
|
int rrln = rcu_preempt_depth();
|
2018-03-19 23:05:04 +08:00
|
|
|
|
2020-02-16 07:23:26 +08:00
|
|
|
WARN_ON_ONCE(rrln < 0 || rrln > RCU_NEST_PMAX);
|
2018-03-19 23:05:04 +08:00
|
|
|
}
|
|
|
|
}
|
|
|
|
EXPORT_SYMBOL_GPL(__rcu_read_unlock);
|
|
|
|
|
2010-11-30 13:56:39 +08:00
|
|
|
/*
|
|
|
|
* Advance a ->blkd_tasks-list pointer to the next entry, instead
|
|
|
|
* returning NULL if at the end of the list.
|
|
|
|
*/
|
|
|
|
static struct list_head *rcu_next_node_entry(struct task_struct *t,
|
|
|
|
struct rcu_node *rnp)
|
|
|
|
{
|
|
|
|
struct list_head *np;
|
|
|
|
|
|
|
|
np = t->rcu_node_entry.next;
|
|
|
|
if (np == &rnp->blkd_tasks)
|
|
|
|
np = NULL;
|
|
|
|
return np;
|
|
|
|
}
|
|
|
|
|
2014-11-01 02:22:37 +08:00
|
|
|
/*
|
|
|
|
* Return true if the specified rcu_node structure has tasks that were
|
|
|
|
* preempted within an RCU read-side critical section.
|
|
|
|
*/
|
|
|
|
static bool rcu_preempt_has_tasks(struct rcu_node *rnp)
|
|
|
|
{
|
|
|
|
return !list_empty(&rnp->blkd_tasks);
|
|
|
|
}
|
|
|
|
|
rcu: Fix grace-period-stall bug on large systems with CPU hotplug
When the last CPU of a given leaf rcu_node structure goes
offline, all of the tasks queued on that leaf rcu_node structure
(due to having blocked in their current RCU read-side critical
sections) are requeued onto the root rcu_node structure. This
requeuing is carried out by rcu_preempt_offline_tasks().
However, it is possible that these queued tasks are the only
thing preventing the leaf rcu_node structure from reporting a
quiescent state up the rcu_node hierarchy. Unfortunately, the
old code would fail to do this reporting, resulting in a
grace-period stall given the following sequence of events:
1. Kernel built for more than 32 CPUs on 32-bit systems or for more
than 64 CPUs on 64-bit systems, so that there is more than one
rcu_node structure. (Or CONFIG_RCU_FANOUT is artificially set
to a number smaller than CONFIG_NR_CPUS.)
2. The kernel is built with CONFIG_TREE_PREEMPT_RCU.
3. A task running on a CPU associated with a given leaf rcu_node
structure blocks while in an RCU read-side critical section
-and- that CPU has not yet passed through a quiescent state
for the current RCU grace period. This will cause the task
to be queued on the leaf rcu_node's blocked_tasks[] array, in
particular, on the element of this array corresponding to the
current grace period.
4. Each of the remaining CPUs corresponding to this same leaf rcu_node
structure pass through a quiescent state. However, the task is
still in its RCU read-side critical section, so these quiescent
states cannot be reported further up the rcu_node hierarchy.
Nevertheless, all bits in the leaf rcu_node structure's ->qsmask
field are now zero.
5. Each of the remaining CPUs go offline. (The events in step
#4 and #5 can happen in any order as long as each CPU passes
through a quiescent state before going offline.)
6. When the last CPU goes offline, __rcu_offline_cpu() will invoke
rcu_preempt_offline_tasks(), which will move the task to the
root rcu_node structure, but without reporting a quiescent state
up the rcu_node hierarchy (and this failure to report a quiescent
state is the bug).
But because this leaf rcu_node structure's ->qsmask field is
already zero and its ->block_tasks[] entries are all empty,
force_quiescent_state() will skip this rcu_node structure.
Therefore, grace periods are now hung.
This patch abstracts some code out of rcu_read_unlock_special(),
calling the result task_quiet() by analogy with cpu_quiet(), and
invokes task_quiet() from both rcu_read_lock_special() and
__rcu_offline_cpu(). Invoking task_quiet() from
__rcu_offline_cpu() reports the quiescent state up the rcu_node
hierarchy, fixing the bug. This ends up requiring a separate
lock_class_key per level of the rcu_node hierarchy, which this
patch also provides.
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: laijs@cn.fujitsu.com
Cc: dipankar@in.ibm.com
Cc: mathieu.desnoyers@polymtl.ca
Cc: josh@joshtriplett.org
Cc: dvhltc@us.ibm.com
Cc: niv@us.ibm.com
Cc: peterz@infradead.org
Cc: rostedt@goodmis.org
Cc: Valdis.Kletnieks@vt.edu
Cc: dhowells@redhat.com
LKML-Reference: <12589088301770-git-send-email->
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2009-11-23 00:53:48 +08:00
|
|
|
/*
|
rcu: Defer reporting RCU-preempt quiescent states when disabled
This commit defers reporting of RCU-preempt quiescent states at
rcu_read_unlock_special() time when any of interrupts, softirq, or
preemption are disabled. These deferred quiescent states are reported
at a later RCU_SOFTIRQ, context switch, idle entry, or CPU-hotplug
offline operation. Of course, if another RCU read-side critical
section has started in the meantime, the reporting of the quiescent
state will be further deferred.
This also means that disabling preemption, interrupts, and/or
softirqs will act as an RCU-preempt read-side critical section.
This is enforced by checking preempt_count() as needed.
Some special cases must be handled on an ad-hoc basis, for example,
context switch is a quiescent state even though both the scheduler and
do_exit() disable preemption. In these cases, additional calls to
rcu_preempt_deferred_qs() override the preemption disabling. Similar
logic overrides disabled interrupts in rcu_preempt_check_callbacks()
because in this case the quiescent state happened just before the
corresponding scheduling-clock interrupt.
In theory, this change lifts a long-standing restriction that required
that if interrupts were disabled across a call to rcu_read_unlock()
that the matching rcu_read_lock() also be contained within that
interrupts-disabled region of code. Because the reporting of the
corresponding RCU-preempt quiescent state is now deferred until
after interrupts have been enabled, it is no longer possible for this
situation to result in deadlocks involving the scheduler's runqueue and
priority-inheritance locks. This may allow some code simplification that
might reduce interrupt latency a bit. Unfortunately, in practice this
would also defer deboosting a low-priority task that had been subjected
to RCU priority boosting, so real-time-response considerations might
well force this restriction to remain in place.
Because RCU-preempt grace periods are now blocked not only by RCU
read-side critical sections, but also by disabling of interrupts,
preemption, and softirqs, it will be possible to eliminate RCU-bh and
RCU-sched in favor of RCU-preempt in CONFIG_PREEMPT=y kernels. This may
require some additional plumbing to provide the network denial-of-service
guarantees that have been traditionally provided by RCU-bh. Once these
are in place, CONFIG_PREEMPT=n kernels will be able to fold RCU-bh
into RCU-sched. This would mean that all kernels would have but
one flavor of RCU, which would open the door to significant code
cleanup.
Moving to a single flavor of RCU would also have the beneficial effect
of reducing the NOCB kthreads by at least a factor of two.
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
[ paulmck: Apply rcu_read_unlock_special() preempt_count() feedback
from Joel Fernandes. ]
[ paulmck: Adjust rcu_eqs_enter() call to rcu_preempt_deferred_qs() in
response to bug reports from kbuild test robot. ]
[ paulmck: Fix bug located by kbuild test robot involving recursion
via rcu_preempt_deferred_qs(). ]
2018-06-22 03:50:01 +08:00
|
|
|
* Report deferred quiescent states. The deferral time can
|
|
|
|
* be quite short, for example, in the case of the call from
|
|
|
|
* rcu_read_unlock_special().
|
rcu: Fix grace-period-stall bug on large systems with CPU hotplug
When the last CPU of a given leaf rcu_node structure goes
offline, all of the tasks queued on that leaf rcu_node structure
(due to having blocked in their current RCU read-side critical
sections) are requeued onto the root rcu_node structure. This
requeuing is carried out by rcu_preempt_offline_tasks().
However, it is possible that these queued tasks are the only
thing preventing the leaf rcu_node structure from reporting a
quiescent state up the rcu_node hierarchy. Unfortunately, the
old code would fail to do this reporting, resulting in a
grace-period stall given the following sequence of events:
1. Kernel built for more than 32 CPUs on 32-bit systems or for more
than 64 CPUs on 64-bit systems, so that there is more than one
rcu_node structure. (Or CONFIG_RCU_FANOUT is artificially set
to a number smaller than CONFIG_NR_CPUS.)
2. The kernel is built with CONFIG_TREE_PREEMPT_RCU.
3. A task running on a CPU associated with a given leaf rcu_node
structure blocks while in an RCU read-side critical section
-and- that CPU has not yet passed through a quiescent state
for the current RCU grace period. This will cause the task
to be queued on the leaf rcu_node's blocked_tasks[] array, in
particular, on the element of this array corresponding to the
current grace period.
4. Each of the remaining CPUs corresponding to this same leaf rcu_node
structure pass through a quiescent state. However, the task is
still in its RCU read-side critical section, so these quiescent
states cannot be reported further up the rcu_node hierarchy.
Nevertheless, all bits in the leaf rcu_node structure's ->qsmask
field are now zero.
5. Each of the remaining CPUs go offline. (The events in step
#4 and #5 can happen in any order as long as each CPU passes
through a quiescent state before going offline.)
6. When the last CPU goes offline, __rcu_offline_cpu() will invoke
rcu_preempt_offline_tasks(), which will move the task to the
root rcu_node structure, but without reporting a quiescent state
up the rcu_node hierarchy (and this failure to report a quiescent
state is the bug).
But because this leaf rcu_node structure's ->qsmask field is
already zero and its ->block_tasks[] entries are all empty,
force_quiescent_state() will skip this rcu_node structure.
Therefore, grace periods are now hung.
This patch abstracts some code out of rcu_read_unlock_special(),
calling the result task_quiet() by analogy with cpu_quiet(), and
invokes task_quiet() from both rcu_read_lock_special() and
__rcu_offline_cpu(). Invoking task_quiet() from
__rcu_offline_cpu() reports the quiescent state up the rcu_node
hierarchy, fixing the bug. This ends up requiring a separate
lock_class_key per level of the rcu_node hierarchy, which this
patch also provides.
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: laijs@cn.fujitsu.com
Cc: dipankar@in.ibm.com
Cc: mathieu.desnoyers@polymtl.ca
Cc: josh@joshtriplett.org
Cc: dvhltc@us.ibm.com
Cc: niv@us.ibm.com
Cc: peterz@infradead.org
Cc: rostedt@goodmis.org
Cc: Valdis.Kletnieks@vt.edu
Cc: dhowells@redhat.com
LKML-Reference: <12589088301770-git-send-email->
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2009-11-23 00:53:48 +08:00
|
|
|
*/
|
rcu: Defer reporting RCU-preempt quiescent states when disabled
This commit defers reporting of RCU-preempt quiescent states at
rcu_read_unlock_special() time when any of interrupts, softirq, or
preemption are disabled. These deferred quiescent states are reported
at a later RCU_SOFTIRQ, context switch, idle entry, or CPU-hotplug
offline operation. Of course, if another RCU read-side critical
section has started in the meantime, the reporting of the quiescent
state will be further deferred.
This also means that disabling preemption, interrupts, and/or
softirqs will act as an RCU-preempt read-side critical section.
This is enforced by checking preempt_count() as needed.
Some special cases must be handled on an ad-hoc basis, for example,
context switch is a quiescent state even though both the scheduler and
do_exit() disable preemption. In these cases, additional calls to
rcu_preempt_deferred_qs() override the preemption disabling. Similar
logic overrides disabled interrupts in rcu_preempt_check_callbacks()
because in this case the quiescent state happened just before the
corresponding scheduling-clock interrupt.
In theory, this change lifts a long-standing restriction that required
that if interrupts were disabled across a call to rcu_read_unlock()
that the matching rcu_read_lock() also be contained within that
interrupts-disabled region of code. Because the reporting of the
corresponding RCU-preempt quiescent state is now deferred until
after interrupts have been enabled, it is no longer possible for this
situation to result in deadlocks involving the scheduler's runqueue and
priority-inheritance locks. This may allow some code simplification that
might reduce interrupt latency a bit. Unfortunately, in practice this
would also defer deboosting a low-priority task that had been subjected
to RCU priority boosting, so real-time-response considerations might
well force this restriction to remain in place.
Because RCU-preempt grace periods are now blocked not only by RCU
read-side critical sections, but also by disabling of interrupts,
preemption, and softirqs, it will be possible to eliminate RCU-bh and
RCU-sched in favor of RCU-preempt in CONFIG_PREEMPT=y kernels. This may
require some additional plumbing to provide the network denial-of-service
guarantees that have been traditionally provided by RCU-bh. Once these
are in place, CONFIG_PREEMPT=n kernels will be able to fold RCU-bh
into RCU-sched. This would mean that all kernels would have but
one flavor of RCU, which would open the door to significant code
cleanup.
Moving to a single flavor of RCU would also have the beneficial effect
of reducing the NOCB kthreads by at least a factor of two.
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
[ paulmck: Apply rcu_read_unlock_special() preempt_count() feedback
from Joel Fernandes. ]
[ paulmck: Adjust rcu_eqs_enter() call to rcu_preempt_deferred_qs() in
response to bug reports from kbuild test robot. ]
[ paulmck: Fix bug located by kbuild test robot involving recursion
via rcu_preempt_deferred_qs(). ]
2018-06-22 03:50:01 +08:00
|
|
|
static void
|
|
|
|
rcu_preempt_deferred_qs_irqrestore(struct task_struct *t, unsigned long flags)
|
rcu: Merge preemptable-RCU functionality into hierarchical RCU
Create a kernel/rcutree_plugin.h file that contains definitions
for preemptable RCU (or, under the #else branch of the #ifdef,
empty definitions for the classic non-preemptable semantics).
These definitions fit into plugins defined in kernel/rcutree.c
for this purpose.
This variant of preemptable RCU uses a new algorithm whose
read-side expense is roughly that of classic hierarchical RCU
under CONFIG_PREEMPT. This new algorithm's update-side expense
is similar to that of classic hierarchical RCU, and, in absence
of read-side preemption or blocking, is exactly that of classic
hierarchical RCU. Perhaps more important, this new algorithm
has a much simpler implementation, saving well over 1,000 lines
of code compared to mainline's implementation of preemptable
RCU, which will hopefully be retired in favor of this new
algorithm.
The simplifications are obtained by maintaining per-task
nesting state for running tasks, and using a simple
lock-protected algorithm to handle accounting when tasks block
within RCU read-side critical sections, making use of lessons
learned while creating numerous user-level RCU implementations
over the past 18 months.
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: laijs@cn.fujitsu.com
Cc: dipankar@in.ibm.com
Cc: akpm@linux-foundation.org
Cc: mathieu.desnoyers@polymtl.ca
Cc: josht@linux.vnet.ibm.com
Cc: dvhltc@us.ibm.com
Cc: niv@us.ibm.com
Cc: peterz@infradead.org
Cc: rostedt@goodmis.org
LKML-Reference: <12509746134003-git-send-email->
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2009-08-23 04:56:52 +08:00
|
|
|
{
|
2014-11-01 03:05:04 +08:00
|
|
|
bool empty_exp;
|
|
|
|
bool empty_norm;
|
|
|
|
bool empty_exp_now;
|
2010-11-30 13:56:39 +08:00
|
|
|
struct list_head *np;
|
2014-06-13 04:30:25 +08:00
|
|
|
bool drop_boost_mutex = false;
|
2015-08-03 04:53:17 +08:00
|
|
|
struct rcu_data *rdp;
|
rcu: Merge preemptable-RCU functionality into hierarchical RCU
Create a kernel/rcutree_plugin.h file that contains definitions
for preemptable RCU (or, under the #else branch of the #ifdef,
empty definitions for the classic non-preemptable semantics).
These definitions fit into plugins defined in kernel/rcutree.c
for this purpose.
This variant of preemptable RCU uses a new algorithm whose
read-side expense is roughly that of classic hierarchical RCU
under CONFIG_PREEMPT. This new algorithm's update-side expense
is similar to that of classic hierarchical RCU, and, in absence
of read-side preemption or blocking, is exactly that of classic
hierarchical RCU. Perhaps more important, this new algorithm
has a much simpler implementation, saving well over 1,000 lines
of code compared to mainline's implementation of preemptable
RCU, which will hopefully be retired in favor of this new
algorithm.
The simplifications are obtained by maintaining per-task
nesting state for running tasks, and using a simple
lock-protected algorithm to handle accounting when tasks block
within RCU read-side critical sections, making use of lessons
learned while creating numerous user-level RCU implementations
over the past 18 months.
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: laijs@cn.fujitsu.com
Cc: dipankar@in.ibm.com
Cc: akpm@linux-foundation.org
Cc: mathieu.desnoyers@polymtl.ca
Cc: josht@linux.vnet.ibm.com
Cc: dvhltc@us.ibm.com
Cc: niv@us.ibm.com
Cc: peterz@infradead.org
Cc: rostedt@goodmis.org
LKML-Reference: <12509746134003-git-send-email->
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2009-08-23 04:56:52 +08:00
|
|
|
struct rcu_node *rnp;
|
2014-08-15 07:01:53 +08:00
|
|
|
union rcu_special special;
|
rcu: Merge preemptable-RCU functionality into hierarchical RCU
Create a kernel/rcutree_plugin.h file that contains definitions
for preemptable RCU (or, under the #else branch of the #ifdef,
empty definitions for the classic non-preemptable semantics).
These definitions fit into plugins defined in kernel/rcutree.c
for this purpose.
This variant of preemptable RCU uses a new algorithm whose
read-side expense is roughly that of classic hierarchical RCU
under CONFIG_PREEMPT. This new algorithm's update-side expense
is similar to that of classic hierarchical RCU, and, in absence
of read-side preemption or blocking, is exactly that of classic
hierarchical RCU. Perhaps more important, this new algorithm
has a much simpler implementation, saving well over 1,000 lines
of code compared to mainline's implementation of preemptable
RCU, which will hopefully be retired in favor of this new
algorithm.
The simplifications are obtained by maintaining per-task
nesting state for running tasks, and using a simple
lock-protected algorithm to handle accounting when tasks block
within RCU read-side critical sections, making use of lessons
learned while creating numerous user-level RCU implementations
over the past 18 months.
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: laijs@cn.fujitsu.com
Cc: dipankar@in.ibm.com
Cc: akpm@linux-foundation.org
Cc: mathieu.desnoyers@polymtl.ca
Cc: josht@linux.vnet.ibm.com
Cc: dvhltc@us.ibm.com
Cc: niv@us.ibm.com
Cc: peterz@infradead.org
Cc: rostedt@goodmis.org
LKML-Reference: <12509746134003-git-send-email->
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2009-08-23 04:56:52 +08:00
|
|
|
|
|
|
|
/*
|
2015-08-03 04:53:17 +08:00
|
|
|
* If RCU core is waiting for this CPU to exit its critical section,
|
|
|
|
* report the fact that it has exited. Because irqs are disabled,
|
2014-08-15 07:01:53 +08:00
|
|
|
* t->rcu_read_unlock_special cannot change.
|
rcu: Merge preemptable-RCU functionality into hierarchical RCU
Create a kernel/rcutree_plugin.h file that contains definitions
for preemptable RCU (or, under the #else branch of the #ifdef,
empty definitions for the classic non-preemptable semantics).
These definitions fit into plugins defined in kernel/rcutree.c
for this purpose.
This variant of preemptable RCU uses a new algorithm whose
read-side expense is roughly that of classic hierarchical RCU
under CONFIG_PREEMPT. This new algorithm's update-side expense
is similar to that of classic hierarchical RCU, and, in absence
of read-side preemption or blocking, is exactly that of classic
hierarchical RCU. Perhaps more important, this new algorithm
has a much simpler implementation, saving well over 1,000 lines
of code compared to mainline's implementation of preemptable
RCU, which will hopefully be retired in favor of this new
algorithm.
The simplifications are obtained by maintaining per-task
nesting state for running tasks, and using a simple
lock-protected algorithm to handle accounting when tasks block
within RCU read-side critical sections, making use of lessons
learned while creating numerous user-level RCU implementations
over the past 18 months.
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: laijs@cn.fujitsu.com
Cc: dipankar@in.ibm.com
Cc: akpm@linux-foundation.org
Cc: mathieu.desnoyers@polymtl.ca
Cc: josht@linux.vnet.ibm.com
Cc: dvhltc@us.ibm.com
Cc: niv@us.ibm.com
Cc: peterz@infradead.org
Cc: rostedt@goodmis.org
LKML-Reference: <12509746134003-git-send-email->
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2009-08-23 04:56:52 +08:00
|
|
|
*/
|
|
|
|
special = t->rcu_read_unlock_special;
|
2018-07-04 06:37:16 +08:00
|
|
|
rdp = this_cpu_ptr(&rcu_data);
|
2019-03-28 06:51:25 +08:00
|
|
|
if (!special.s && !rdp->exp_deferred_qs) {
|
rcu: Defer reporting RCU-preempt quiescent states when disabled
This commit defers reporting of RCU-preempt quiescent states at
rcu_read_unlock_special() time when any of interrupts, softirq, or
preemption are disabled. These deferred quiescent states are reported
at a later RCU_SOFTIRQ, context switch, idle entry, or CPU-hotplug
offline operation. Of course, if another RCU read-side critical
section has started in the meantime, the reporting of the quiescent
state will be further deferred.
This also means that disabling preemption, interrupts, and/or
softirqs will act as an RCU-preempt read-side critical section.
This is enforced by checking preempt_count() as needed.
Some special cases must be handled on an ad-hoc basis, for example,
context switch is a quiescent state even though both the scheduler and
do_exit() disable preemption. In these cases, additional calls to
rcu_preempt_deferred_qs() override the preemption disabling. Similar
logic overrides disabled interrupts in rcu_preempt_check_callbacks()
because in this case the quiescent state happened just before the
corresponding scheduling-clock interrupt.
In theory, this change lifts a long-standing restriction that required
that if interrupts were disabled across a call to rcu_read_unlock()
that the matching rcu_read_lock() also be contained within that
interrupts-disabled region of code. Because the reporting of the
corresponding RCU-preempt quiescent state is now deferred until
after interrupts have been enabled, it is no longer possible for this
situation to result in deadlocks involving the scheduler's runqueue and
priority-inheritance locks. This may allow some code simplification that
might reduce interrupt latency a bit. Unfortunately, in practice this
would also defer deboosting a low-priority task that had been subjected
to RCU priority boosting, so real-time-response considerations might
well force this restriction to remain in place.
Because RCU-preempt grace periods are now blocked not only by RCU
read-side critical sections, but also by disabling of interrupts,
preemption, and softirqs, it will be possible to eliminate RCU-bh and
RCU-sched in favor of RCU-preempt in CONFIG_PREEMPT=y kernels. This may
require some additional plumbing to provide the network denial-of-service
guarantees that have been traditionally provided by RCU-bh. Once these
are in place, CONFIG_PREEMPT=n kernels will be able to fold RCU-bh
into RCU-sched. This would mean that all kernels would have but
one flavor of RCU, which would open the door to significant code
cleanup.
Moving to a single flavor of RCU would also have the beneficial effect
of reducing the NOCB kthreads by at least a factor of two.
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
[ paulmck: Apply rcu_read_unlock_special() preempt_count() feedback
from Joel Fernandes. ]
[ paulmck: Adjust rcu_eqs_enter() call to rcu_preempt_deferred_qs() in
response to bug reports from kbuild test robot. ]
[ paulmck: Fix bug located by kbuild test robot involving recursion
via rcu_preempt_deferred_qs(). ]
2018-06-22 03:50:01 +08:00
|
|
|
local_irq_restore(flags);
|
|
|
|
return;
|
|
|
|
}
|
2019-11-01 20:06:21 +08:00
|
|
|
t->rcu_read_unlock_special.s = 0;
|
rcu: Do full report for .need_qs for strict GPs
The rcu_preempt_deferred_qs_irqrestore() function is invoked at
the end of an RCU read-side critical section (for example, directly
from rcu_read_unlock()) and, if .need_qs is set, invokes rcu_qs() to
report the new quiescent state. This works, except that rcu_qs() only
updates per-CPU state, leaving reporting of the actual quiescent state
to a later call to rcu_report_qs_rdp(), for example from within a later
RCU_SOFTIRQ instance. Although this approach is exactly what you want if
you are more concerned about efficiency than about short grace periods,
in CONFIG_RCU_STRICT_GRACE_PERIOD=y kernels, short grace periods are
the name of the game.
This commit therefore makes rcu_preempt_deferred_qs_irqrestore() directly
invoke rcu_report_qs_rdp() in CONFIG_RCU_STRICT_GRACE_PERIOD=y, thus
shortening grace periods.
Historical note: To the best of my knowledge, causing rcu_read_unlock()
to directly report a quiescent state first appeared in Jim Houston's
and Joe Korty's JRCU. This is the second instance of a Linux-kernel RCU
feature being inspired by JRCU, the first being RCU callback offloading
(as in the RCU_NOCB_CPU Kconfig option).
Reported-by Jann Horn <jannh@google.com>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2020-08-07 06:12:50 +08:00
|
|
|
if (special.b.need_qs) {
|
2020-08-08 04:44:10 +08:00
|
|
|
if (IS_ENABLED(CONFIG_RCU_STRICT_GRACE_PERIOD)) {
|
2020-08-21 02:26:14 +08:00
|
|
|
rcu_report_qs_rdp(rdp);
|
2020-08-08 04:44:10 +08:00
|
|
|
udelay(rcu_unlock_delay);
|
|
|
|
} else {
|
rcu: Do full report for .need_qs for strict GPs
The rcu_preempt_deferred_qs_irqrestore() function is invoked at
the end of an RCU read-side critical section (for example, directly
from rcu_read_unlock()) and, if .need_qs is set, invokes rcu_qs() to
report the new quiescent state. This works, except that rcu_qs() only
updates per-CPU state, leaving reporting of the actual quiescent state
to a later call to rcu_report_qs_rdp(), for example from within a later
RCU_SOFTIRQ instance. Although this approach is exactly what you want if
you are more concerned about efficiency than about short grace periods,
in CONFIG_RCU_STRICT_GRACE_PERIOD=y kernels, short grace periods are
the name of the game.
This commit therefore makes rcu_preempt_deferred_qs_irqrestore() directly
invoke rcu_report_qs_rdp() in CONFIG_RCU_STRICT_GRACE_PERIOD=y, thus
shortening grace periods.
Historical note: To the best of my knowledge, causing rcu_read_unlock()
to directly report a quiescent state first appeared in Jim Houston's
and Joe Korty's JRCU. This is the second instance of a Linux-kernel RCU
feature being inspired by JRCU, the first being RCU callback offloading
(as in the RCU_NOCB_CPU Kconfig option).
Reported-by Jann Horn <jannh@google.com>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2020-08-07 06:12:50 +08:00
|
|
|
rcu_qs();
|
2020-08-08 04:44:10 +08:00
|
|
|
}
|
rcu: Do full report for .need_qs for strict GPs
The rcu_preempt_deferred_qs_irqrestore() function is invoked at
the end of an RCU read-side critical section (for example, directly
from rcu_read_unlock()) and, if .need_qs is set, invokes rcu_qs() to
report the new quiescent state. This works, except that rcu_qs() only
updates per-CPU state, leaving reporting of the actual quiescent state
to a later call to rcu_report_qs_rdp(), for example from within a later
RCU_SOFTIRQ instance. Although this approach is exactly what you want if
you are more concerned about efficiency than about short grace periods,
in CONFIG_RCU_STRICT_GRACE_PERIOD=y kernels, short grace periods are
the name of the game.
This commit therefore makes rcu_preempt_deferred_qs_irqrestore() directly
invoke rcu_report_qs_rdp() in CONFIG_RCU_STRICT_GRACE_PERIOD=y, thus
shortening grace periods.
Historical note: To the best of my knowledge, causing rcu_read_unlock()
to directly report a quiescent state first appeared in Jim Houston's
and Joe Korty's JRCU. This is the second instance of a Linux-kernel RCU
feature being inspired by JRCU, the first being RCU callback offloading
(as in the RCU_NOCB_CPU Kconfig option).
Reported-by Jann Horn <jannh@google.com>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2020-08-07 06:12:50 +08:00
|
|
|
}
|
rcu: Merge preemptable-RCU functionality into hierarchical RCU
Create a kernel/rcutree_plugin.h file that contains definitions
for preemptable RCU (or, under the #else branch of the #ifdef,
empty definitions for the classic non-preemptable semantics).
These definitions fit into plugins defined in kernel/rcutree.c
for this purpose.
This variant of preemptable RCU uses a new algorithm whose
read-side expense is roughly that of classic hierarchical RCU
under CONFIG_PREEMPT. This new algorithm's update-side expense
is similar to that of classic hierarchical RCU, and, in absence
of read-side preemption or blocking, is exactly that of classic
hierarchical RCU. Perhaps more important, this new algorithm
has a much simpler implementation, saving well over 1,000 lines
of code compared to mainline's implementation of preemptable
RCU, which will hopefully be retired in favor of this new
algorithm.
The simplifications are obtained by maintaining per-task
nesting state for running tasks, and using a simple
lock-protected algorithm to handle accounting when tasks block
within RCU read-side critical sections, making use of lessons
learned while creating numerous user-level RCU implementations
over the past 18 months.
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: laijs@cn.fujitsu.com
Cc: dipankar@in.ibm.com
Cc: akpm@linux-foundation.org
Cc: mathieu.desnoyers@polymtl.ca
Cc: josht@linux.vnet.ibm.com
Cc: dvhltc@us.ibm.com
Cc: niv@us.ibm.com
Cc: peterz@infradead.org
Cc: rostedt@goodmis.org
LKML-Reference: <12509746134003-git-send-email->
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2009-08-23 04:56:52 +08:00
|
|
|
|
2015-08-03 04:53:17 +08:00
|
|
|
/*
|
rcu: Defer reporting RCU-preempt quiescent states when disabled
This commit defers reporting of RCU-preempt quiescent states at
rcu_read_unlock_special() time when any of interrupts, softirq, or
preemption are disabled. These deferred quiescent states are reported
at a later RCU_SOFTIRQ, context switch, idle entry, or CPU-hotplug
offline operation. Of course, if another RCU read-side critical
section has started in the meantime, the reporting of the quiescent
state will be further deferred.
This also means that disabling preemption, interrupts, and/or
softirqs will act as an RCU-preempt read-side critical section.
This is enforced by checking preempt_count() as needed.
Some special cases must be handled on an ad-hoc basis, for example,
context switch is a quiescent state even though both the scheduler and
do_exit() disable preemption. In these cases, additional calls to
rcu_preempt_deferred_qs() override the preemption disabling. Similar
logic overrides disabled interrupts in rcu_preempt_check_callbacks()
because in this case the quiescent state happened just before the
corresponding scheduling-clock interrupt.
In theory, this change lifts a long-standing restriction that required
that if interrupts were disabled across a call to rcu_read_unlock()
that the matching rcu_read_lock() also be contained within that
interrupts-disabled region of code. Because the reporting of the
corresponding RCU-preempt quiescent state is now deferred until
after interrupts have been enabled, it is no longer possible for this
situation to result in deadlocks involving the scheduler's runqueue and
priority-inheritance locks. This may allow some code simplification that
might reduce interrupt latency a bit. Unfortunately, in practice this
would also defer deboosting a low-priority task that had been subjected
to RCU priority boosting, so real-time-response considerations might
well force this restriction to remain in place.
Because RCU-preempt grace periods are now blocked not only by RCU
read-side critical sections, but also by disabling of interrupts,
preemption, and softirqs, it will be possible to eliminate RCU-bh and
RCU-sched in favor of RCU-preempt in CONFIG_PREEMPT=y kernels. This may
require some additional plumbing to provide the network denial-of-service
guarantees that have been traditionally provided by RCU-bh. Once these
are in place, CONFIG_PREEMPT=n kernels will be able to fold RCU-bh
into RCU-sched. This would mean that all kernels would have but
one flavor of RCU, which would open the door to significant code
cleanup.
Moving to a single flavor of RCU would also have the beneficial effect
of reducing the NOCB kthreads by at least a factor of two.
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
[ paulmck: Apply rcu_read_unlock_special() preempt_count() feedback
from Joel Fernandes. ]
[ paulmck: Adjust rcu_eqs_enter() call to rcu_preempt_deferred_qs() in
response to bug reports from kbuild test robot. ]
[ paulmck: Fix bug located by kbuild test robot involving recursion
via rcu_preempt_deferred_qs(). ]
2018-06-22 03:50:01 +08:00
|
|
|
* Respond to a request by an expedited grace period for a
|
|
|
|
* quiescent state from this CPU. Note that requests from
|
|
|
|
* tasks are handled when removing the task from the
|
|
|
|
* blocked-tasks list below.
|
2015-08-03 04:53:17 +08:00
|
|
|
*/
|
2019-11-01 20:06:21 +08:00
|
|
|
if (rdp->exp_deferred_qs)
|
2018-07-04 08:22:34 +08:00
|
|
|
rcu_report_exp_rdp(rdp);
|
2015-08-03 04:53:17 +08:00
|
|
|
|
rcu: Merge preemptable-RCU functionality into hierarchical RCU
Create a kernel/rcutree_plugin.h file that contains definitions
for preemptable RCU (or, under the #else branch of the #ifdef,
empty definitions for the classic non-preemptable semantics).
These definitions fit into plugins defined in kernel/rcutree.c
for this purpose.
This variant of preemptable RCU uses a new algorithm whose
read-side expense is roughly that of classic hierarchical RCU
under CONFIG_PREEMPT. This new algorithm's update-side expense
is similar to that of classic hierarchical RCU, and, in absence
of read-side preemption or blocking, is exactly that of classic
hierarchical RCU. Perhaps more important, this new algorithm
has a much simpler implementation, saving well over 1,000 lines
of code compared to mainline's implementation of preemptable
RCU, which will hopefully be retired in favor of this new
algorithm.
The simplifications are obtained by maintaining per-task
nesting state for running tasks, and using a simple
lock-protected algorithm to handle accounting when tasks block
within RCU read-side critical sections, making use of lessons
learned while creating numerous user-level RCU implementations
over the past 18 months.
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: laijs@cn.fujitsu.com
Cc: dipankar@in.ibm.com
Cc: akpm@linux-foundation.org
Cc: mathieu.desnoyers@polymtl.ca
Cc: josht@linux.vnet.ibm.com
Cc: dvhltc@us.ibm.com
Cc: niv@us.ibm.com
Cc: peterz@infradead.org
Cc: rostedt@goodmis.org
LKML-Reference: <12509746134003-git-send-email->
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2009-08-23 04:56:52 +08:00
|
|
|
/* Clean up if blocked during RCU read-side critical section. */
|
2014-08-15 07:01:53 +08:00
|
|
|
if (special.b.blocked) {
|
rcu: Merge preemptable-RCU functionality into hierarchical RCU
Create a kernel/rcutree_plugin.h file that contains definitions
for preemptable RCU (or, under the #else branch of the #ifdef,
empty definitions for the classic non-preemptable semantics).
These definitions fit into plugins defined in kernel/rcutree.c
for this purpose.
This variant of preemptable RCU uses a new algorithm whose
read-side expense is roughly that of classic hierarchical RCU
under CONFIG_PREEMPT. This new algorithm's update-side expense
is similar to that of classic hierarchical RCU, and, in absence
of read-side preemption or blocking, is exactly that of classic
hierarchical RCU. Perhaps more important, this new algorithm
has a much simpler implementation, saving well over 1,000 lines
of code compared to mainline's implementation of preemptable
RCU, which will hopefully be retired in favor of this new
algorithm.
The simplifications are obtained by maintaining per-task
nesting state for running tasks, and using a simple
lock-protected algorithm to handle accounting when tasks block
within RCU read-side critical sections, making use of lessons
learned while creating numerous user-level RCU implementations
over the past 18 months.
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: laijs@cn.fujitsu.com
Cc: dipankar@in.ibm.com
Cc: akpm@linux-foundation.org
Cc: mathieu.desnoyers@polymtl.ca
Cc: josht@linux.vnet.ibm.com
Cc: dvhltc@us.ibm.com
Cc: niv@us.ibm.com
Cc: peterz@infradead.org
Cc: rostedt@goodmis.org
LKML-Reference: <12509746134003-git-send-email->
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2009-08-23 04:56:52 +08:00
|
|
|
|
2009-08-28 05:58:16 +08:00
|
|
|
/*
|
2015-03-09 05:20:30 +08:00
|
|
|
* Remove this task from the list it blocked on. The task
|
2015-09-29 22:55:41 +08:00
|
|
|
* now remains queued on the rcu_node corresponding to the
|
|
|
|
* CPU it first blocked on, so there is no longer any need
|
|
|
|
* to loop. Retain a WARN_ON_ONCE() out of sheer paranoia.
|
2009-08-28 05:58:16 +08:00
|
|
|
*/
|
2015-09-29 22:55:41 +08:00
|
|
|
rnp = t->rcu_blocked_node;
|
|
|
|
raw_spin_lock_rcu_node(rnp); /* irqs already disabled. */
|
|
|
|
WARN_ON_ONCE(rnp != t->rcu_blocked_node);
|
2018-04-14 08:11:44 +08:00
|
|
|
WARN_ON_ONCE(!rcu_is_leaf_node(rnp));
|
2014-10-31 12:08:53 +08:00
|
|
|
empty_norm = !rcu_preempt_blocked_readers_cgp(rnp);
|
2018-04-29 09:50:06 +08:00
|
|
|
WARN_ON_ONCE(rnp->completedqs == rnp->gp_seq &&
|
2017-11-28 07:13:56 +08:00
|
|
|
(!empty_norm || rnp->qsmask));
|
2019-11-28 05:59:37 +08:00
|
|
|
empty_exp = sync_rcu_exp_done(rnp);
|
2009-12-03 04:10:15 +08:00
|
|
|
smp_mb(); /* ensure expedited fastpath sees end of RCU c-s. */
|
2010-11-30 13:56:39 +08:00
|
|
|
np = rcu_next_node_entry(t, rnp);
|
rcu: Merge preemptable-RCU functionality into hierarchical RCU
Create a kernel/rcutree_plugin.h file that contains definitions
for preemptable RCU (or, under the #else branch of the #ifdef,
empty definitions for the classic non-preemptable semantics).
These definitions fit into plugins defined in kernel/rcutree.c
for this purpose.
This variant of preemptable RCU uses a new algorithm whose
read-side expense is roughly that of classic hierarchical RCU
under CONFIG_PREEMPT. This new algorithm's update-side expense
is similar to that of classic hierarchical RCU, and, in absence
of read-side preemption or blocking, is exactly that of classic
hierarchical RCU. Perhaps more important, this new algorithm
has a much simpler implementation, saving well over 1,000 lines
of code compared to mainline's implementation of preemptable
RCU, which will hopefully be retired in favor of this new
algorithm.
The simplifications are obtained by maintaining per-task
nesting state for running tasks, and using a simple
lock-protected algorithm to handle accounting when tasks block
within RCU read-side critical sections, making use of lessons
learned while creating numerous user-level RCU implementations
over the past 18 months.
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: laijs@cn.fujitsu.com
Cc: dipankar@in.ibm.com
Cc: akpm@linux-foundation.org
Cc: mathieu.desnoyers@polymtl.ca
Cc: josht@linux.vnet.ibm.com
Cc: dvhltc@us.ibm.com
Cc: niv@us.ibm.com
Cc: peterz@infradead.org
Cc: rostedt@goodmis.org
LKML-Reference: <12509746134003-git-send-email->
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2009-08-23 04:56:52 +08:00
|
|
|
list_del_init(&t->rcu_node_entry);
|
2011-08-04 22:55:34 +08:00
|
|
|
t->rcu_blocked_node = NULL;
|
2013-07-13 05:18:47 +08:00
|
|
|
trace_rcu_unlock_preempted_task(TPS("rcu_preempt"),
|
2018-05-02 04:35:20 +08:00
|
|
|
rnp->gp_seq, t->pid);
|
2010-11-30 13:56:39 +08:00
|
|
|
if (&t->rcu_node_entry == rnp->gp_tasks)
|
2019-10-10 05:21:54 +08:00
|
|
|
WRITE_ONCE(rnp->gp_tasks, np);
|
2010-11-30 13:56:39 +08:00
|
|
|
if (&t->rcu_node_entry == rnp->exp_tasks)
|
2020-01-04 06:18:12 +08:00
|
|
|
WRITE_ONCE(rnp->exp_tasks, np);
|
2015-03-04 06:49:26 +08:00
|
|
|
if (IS_ENABLED(CONFIG_RCU_BOOST)) {
|
|
|
|
/* Snapshot ->boost_mtx ownership w/rnp->lock held. */
|
2021-08-16 05:27:58 +08:00
|
|
|
drop_boost_mutex = rt_mutex_owner(&rnp->boost_mtx.rtmutex) == t;
|
2017-07-12 12:52:31 +08:00
|
|
|
if (&t->rcu_node_entry == rnp->boost_tasks)
|
2020-01-05 02:44:41 +08:00
|
|
|
WRITE_ONCE(rnp->boost_tasks, np);
|
2015-03-04 06:49:26 +08:00
|
|
|
}
|
rcu: Merge preemptable-RCU functionality into hierarchical RCU
Create a kernel/rcutree_plugin.h file that contains definitions
for preemptable RCU (or, under the #else branch of the #ifdef,
empty definitions for the classic non-preemptable semantics).
These definitions fit into plugins defined in kernel/rcutree.c
for this purpose.
This variant of preemptable RCU uses a new algorithm whose
read-side expense is roughly that of classic hierarchical RCU
under CONFIG_PREEMPT. This new algorithm's update-side expense
is similar to that of classic hierarchical RCU, and, in absence
of read-side preemption or blocking, is exactly that of classic
hierarchical RCU. Perhaps more important, this new algorithm
has a much simpler implementation, saving well over 1,000 lines
of code compared to mainline's implementation of preemptable
RCU, which will hopefully be retired in favor of this new
algorithm.
The simplifications are obtained by maintaining per-task
nesting state for running tasks, and using a simple
lock-protected algorithm to handle accounting when tasks block
within RCU read-side critical sections, making use of lessons
learned while creating numerous user-level RCU implementations
over the past 18 months.
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: laijs@cn.fujitsu.com
Cc: dipankar@in.ibm.com
Cc: akpm@linux-foundation.org
Cc: mathieu.desnoyers@polymtl.ca
Cc: josht@linux.vnet.ibm.com
Cc: dvhltc@us.ibm.com
Cc: niv@us.ibm.com
Cc: peterz@infradead.org
Cc: rostedt@goodmis.org
LKML-Reference: <12509746134003-git-send-email->
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2009-08-23 04:56:52 +08:00
|
|
|
|
|
|
|
/*
|
|
|
|
* If this was the last task on the current list, and if
|
|
|
|
* we aren't waiting on any CPUs, report the quiescent state.
|
2011-09-22 05:41:37 +08:00
|
|
|
* Note that rcu_report_unblock_qs_rnp() releases rnp->lock,
|
|
|
|
* so we must take a snapshot of the expedited state.
|
rcu: Merge preemptable-RCU functionality into hierarchical RCU
Create a kernel/rcutree_plugin.h file that contains definitions
for preemptable RCU (or, under the #else branch of the #ifdef,
empty definitions for the classic non-preemptable semantics).
These definitions fit into plugins defined in kernel/rcutree.c
for this purpose.
This variant of preemptable RCU uses a new algorithm whose
read-side expense is roughly that of classic hierarchical RCU
under CONFIG_PREEMPT. This new algorithm's update-side expense
is similar to that of classic hierarchical RCU, and, in absence
of read-side preemption or blocking, is exactly that of classic
hierarchical RCU. Perhaps more important, this new algorithm
has a much simpler implementation, saving well over 1,000 lines
of code compared to mainline's implementation of preemptable
RCU, which will hopefully be retired in favor of this new
algorithm.
The simplifications are obtained by maintaining per-task
nesting state for running tasks, and using a simple
lock-protected algorithm to handle accounting when tasks block
within RCU read-side critical sections, making use of lessons
learned while creating numerous user-level RCU implementations
over the past 18 months.
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: laijs@cn.fujitsu.com
Cc: dipankar@in.ibm.com
Cc: akpm@linux-foundation.org
Cc: mathieu.desnoyers@polymtl.ca
Cc: josht@linux.vnet.ibm.com
Cc: dvhltc@us.ibm.com
Cc: niv@us.ibm.com
Cc: peterz@infradead.org
Cc: rostedt@goodmis.org
LKML-Reference: <12509746134003-git-send-email->
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2009-08-23 04:56:52 +08:00
|
|
|
*/
|
2019-11-28 05:59:37 +08:00
|
|
|
empty_exp_now = sync_rcu_exp_done(rnp);
|
2014-10-31 12:08:53 +08:00
|
|
|
if (!empty_norm && !rcu_preempt_blocked_readers_cgp(rnp)) {
|
2013-07-13 05:18:47 +08:00
|
|
|
trace_rcu_quiescent_state_report(TPS("preempt_rcu"),
|
2018-05-02 04:35:20 +08:00
|
|
|
rnp->gp_seq,
|
rcu: Add grace-period, quiescent-state, and call_rcu trace events
Add trace events to record grace-period start and end, quiescent states,
CPUs noticing grace-period start and end, grace-period initialization,
call_rcu() invocation, tasks blocking in RCU read-side critical sections,
tasks exiting those same critical sections, force_quiescent_state()
detection of dyntick-idle and offline CPUs, CPUs entering and leaving
dyntick-idle mode (except from NMIs), CPUs coming online and going
offline, and CPUs being kicked for staying in dyntick-idle mode for too
long (as in many weeks, even on 32-bit systems).
Signed-off-by: Paul E. McKenney <paul.mckenney@linaro.org>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
rcu: Add the rcu flavor to callback trace events
The earlier trace events for registering RCU callbacks and for invoking
them did not include the RCU flavor (rcu_bh, rcu_preempt, or rcu_sched).
This commit adds the RCU flavor to those trace events.
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
2011-06-25 21:36:56 +08:00
|
|
|
0, rnp->qsmask,
|
|
|
|
rnp->level,
|
|
|
|
rnp->grplo,
|
|
|
|
rnp->grphi,
|
|
|
|
!!rnp->gp_tasks);
|
2018-07-04 08:22:34 +08:00
|
|
|
rcu_report_unblock_qs_rnp(rnp, flags);
|
2012-06-28 23:08:25 +08:00
|
|
|
} else {
|
2015-12-29 12:18:47 +08:00
|
|
|
raw_spin_unlock_irqrestore_rcu_node(rnp, flags);
|
2012-06-28 23:08:25 +08:00
|
|
|
}
|
2009-12-03 04:10:15 +08:00
|
|
|
|
2011-02-08 04:47:15 +08:00
|
|
|
/* Unboost if we were boosted. */
|
2015-03-04 06:49:26 +08:00
|
|
|
if (IS_ENABLED(CONFIG_RCU_BOOST) && drop_boost_mutex)
|
2021-08-16 05:27:58 +08:00
|
|
|
rt_mutex_futex_unlock(&rnp->boost_mtx.rtmutex);
|
2011-02-08 04:47:15 +08:00
|
|
|
|
2009-12-03 04:10:15 +08:00
|
|
|
/*
|
|
|
|
* If this was the last task on the expedited lists,
|
|
|
|
* then we need to report up the rcu_node hierarchy.
|
|
|
|
*/
|
2011-09-22 05:41:37 +08:00
|
|
|
if (!empty_exp && empty_exp_now)
|
2018-07-04 08:22:34 +08:00
|
|
|
rcu_report_exp_rnp(rnp, true);
|
rcu: Fix grace-period-stall bug on large systems with CPU hotplug
When the last CPU of a given leaf rcu_node structure goes
offline, all of the tasks queued on that leaf rcu_node structure
(due to having blocked in their current RCU read-side critical
sections) are requeued onto the root rcu_node structure. This
requeuing is carried out by rcu_preempt_offline_tasks().
However, it is possible that these queued tasks are the only
thing preventing the leaf rcu_node structure from reporting a
quiescent state up the rcu_node hierarchy. Unfortunately, the
old code would fail to do this reporting, resulting in a
grace-period stall given the following sequence of events:
1. Kernel built for more than 32 CPUs on 32-bit systems or for more
than 64 CPUs on 64-bit systems, so that there is more than one
rcu_node structure. (Or CONFIG_RCU_FANOUT is artificially set
to a number smaller than CONFIG_NR_CPUS.)
2. The kernel is built with CONFIG_TREE_PREEMPT_RCU.
3. A task running on a CPU associated with a given leaf rcu_node
structure blocks while in an RCU read-side critical section
-and- that CPU has not yet passed through a quiescent state
for the current RCU grace period. This will cause the task
to be queued on the leaf rcu_node's blocked_tasks[] array, in
particular, on the element of this array corresponding to the
current grace period.
4. Each of the remaining CPUs corresponding to this same leaf rcu_node
structure pass through a quiescent state. However, the task is
still in its RCU read-side critical section, so these quiescent
states cannot be reported further up the rcu_node hierarchy.
Nevertheless, all bits in the leaf rcu_node structure's ->qsmask
field are now zero.
5. Each of the remaining CPUs go offline. (The events in step
#4 and #5 can happen in any order as long as each CPU passes
through a quiescent state before going offline.)
6. When the last CPU goes offline, __rcu_offline_cpu() will invoke
rcu_preempt_offline_tasks(), which will move the task to the
root rcu_node structure, but without reporting a quiescent state
up the rcu_node hierarchy (and this failure to report a quiescent
state is the bug).
But because this leaf rcu_node structure's ->qsmask field is
already zero and its ->block_tasks[] entries are all empty,
force_quiescent_state() will skip this rcu_node structure.
Therefore, grace periods are now hung.
This patch abstracts some code out of rcu_read_unlock_special(),
calling the result task_quiet() by analogy with cpu_quiet(), and
invokes task_quiet() from both rcu_read_lock_special() and
__rcu_offline_cpu(). Invoking task_quiet() from
__rcu_offline_cpu() reports the quiescent state up the rcu_node
hierarchy, fixing the bug. This ends up requiring a separate
lock_class_key per level of the rcu_node hierarchy, which this
patch also provides.
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: laijs@cn.fujitsu.com
Cc: dipankar@in.ibm.com
Cc: mathieu.desnoyers@polymtl.ca
Cc: josh@joshtriplett.org
Cc: dvhltc@us.ibm.com
Cc: niv@us.ibm.com
Cc: peterz@infradead.org
Cc: rostedt@goodmis.org
Cc: Valdis.Kletnieks@vt.edu
Cc: dhowells@redhat.com
LKML-Reference: <12589088301770-git-send-email->
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2009-11-23 00:53:48 +08:00
|
|
|
} else {
|
|
|
|
local_irq_restore(flags);
|
rcu: Merge preemptable-RCU functionality into hierarchical RCU
Create a kernel/rcutree_plugin.h file that contains definitions
for preemptable RCU (or, under the #else branch of the #ifdef,
empty definitions for the classic non-preemptable semantics).
These definitions fit into plugins defined in kernel/rcutree.c
for this purpose.
This variant of preemptable RCU uses a new algorithm whose
read-side expense is roughly that of classic hierarchical RCU
under CONFIG_PREEMPT. This new algorithm's update-side expense
is similar to that of classic hierarchical RCU, and, in absence
of read-side preemption or blocking, is exactly that of classic
hierarchical RCU. Perhaps more important, this new algorithm
has a much simpler implementation, saving well over 1,000 lines
of code compared to mainline's implementation of preemptable
RCU, which will hopefully be retired in favor of this new
algorithm.
The simplifications are obtained by maintaining per-task
nesting state for running tasks, and using a simple
lock-protected algorithm to handle accounting when tasks block
within RCU read-side critical sections, making use of lessons
learned while creating numerous user-level RCU implementations
over the past 18 months.
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: laijs@cn.fujitsu.com
Cc: dipankar@in.ibm.com
Cc: akpm@linux-foundation.org
Cc: mathieu.desnoyers@polymtl.ca
Cc: josht@linux.vnet.ibm.com
Cc: dvhltc@us.ibm.com
Cc: niv@us.ibm.com
Cc: peterz@infradead.org
Cc: rostedt@goodmis.org
LKML-Reference: <12509746134003-git-send-email->
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2009-08-23 04:56:52 +08:00
|
|
|
}
|
|
|
|
}
|
|
|
|
|
rcu: Defer reporting RCU-preempt quiescent states when disabled
This commit defers reporting of RCU-preempt quiescent states at
rcu_read_unlock_special() time when any of interrupts, softirq, or
preemption are disabled. These deferred quiescent states are reported
at a later RCU_SOFTIRQ, context switch, idle entry, or CPU-hotplug
offline operation. Of course, if another RCU read-side critical
section has started in the meantime, the reporting of the quiescent
state will be further deferred.
This also means that disabling preemption, interrupts, and/or
softirqs will act as an RCU-preempt read-side critical section.
This is enforced by checking preempt_count() as needed.
Some special cases must be handled on an ad-hoc basis, for example,
context switch is a quiescent state even though both the scheduler and
do_exit() disable preemption. In these cases, additional calls to
rcu_preempt_deferred_qs() override the preemption disabling. Similar
logic overrides disabled interrupts in rcu_preempt_check_callbacks()
because in this case the quiescent state happened just before the
corresponding scheduling-clock interrupt.
In theory, this change lifts a long-standing restriction that required
that if interrupts were disabled across a call to rcu_read_unlock()
that the matching rcu_read_lock() also be contained within that
interrupts-disabled region of code. Because the reporting of the
corresponding RCU-preempt quiescent state is now deferred until
after interrupts have been enabled, it is no longer possible for this
situation to result in deadlocks involving the scheduler's runqueue and
priority-inheritance locks. This may allow some code simplification that
might reduce interrupt latency a bit. Unfortunately, in practice this
would also defer deboosting a low-priority task that had been subjected
to RCU priority boosting, so real-time-response considerations might
well force this restriction to remain in place.
Because RCU-preempt grace periods are now blocked not only by RCU
read-side critical sections, but also by disabling of interrupts,
preemption, and softirqs, it will be possible to eliminate RCU-bh and
RCU-sched in favor of RCU-preempt in CONFIG_PREEMPT=y kernels. This may
require some additional plumbing to provide the network denial-of-service
guarantees that have been traditionally provided by RCU-bh. Once these
are in place, CONFIG_PREEMPT=n kernels will be able to fold RCU-bh
into RCU-sched. This would mean that all kernels would have but
one flavor of RCU, which would open the door to significant code
cleanup.
Moving to a single flavor of RCU would also have the beneficial effect
of reducing the NOCB kthreads by at least a factor of two.
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
[ paulmck: Apply rcu_read_unlock_special() preempt_count() feedback
from Joel Fernandes. ]
[ paulmck: Adjust rcu_eqs_enter() call to rcu_preempt_deferred_qs() in
response to bug reports from kbuild test robot. ]
[ paulmck: Fix bug located by kbuild test robot involving recursion
via rcu_preempt_deferred_qs(). ]
2018-06-22 03:50:01 +08:00
|
|
|
/*
|
|
|
|
* Is a deferred quiescent-state pending, and are we also not in
|
|
|
|
* an RCU read-side critical section? It is the caller's responsibility
|
|
|
|
* to ensure it is otherwise safe to report any deferred quiescent
|
|
|
|
* states. The reason for this is that it is safe to report a
|
|
|
|
* quiescent state during context switch even though preemption
|
|
|
|
* is disabled. This function cannot be expected to understand these
|
|
|
|
* nuances, so the caller must handle them.
|
|
|
|
*/
|
|
|
|
static bool rcu_preempt_need_deferred_qs(struct task_struct *t)
|
|
|
|
{
|
2019-03-28 06:51:25 +08:00
|
|
|
return (__this_cpu_read(rcu_data.exp_deferred_qs) ||
|
rcu: Defer reporting RCU-preempt quiescent states when disabled
This commit defers reporting of RCU-preempt quiescent states at
rcu_read_unlock_special() time when any of interrupts, softirq, or
preemption are disabled. These deferred quiescent states are reported
at a later RCU_SOFTIRQ, context switch, idle entry, or CPU-hotplug
offline operation. Of course, if another RCU read-side critical
section has started in the meantime, the reporting of the quiescent
state will be further deferred.
This also means that disabling preemption, interrupts, and/or
softirqs will act as an RCU-preempt read-side critical section.
This is enforced by checking preempt_count() as needed.
Some special cases must be handled on an ad-hoc basis, for example,
context switch is a quiescent state even though both the scheduler and
do_exit() disable preemption. In these cases, additional calls to
rcu_preempt_deferred_qs() override the preemption disabling. Similar
logic overrides disabled interrupts in rcu_preempt_check_callbacks()
because in this case the quiescent state happened just before the
corresponding scheduling-clock interrupt.
In theory, this change lifts a long-standing restriction that required
that if interrupts were disabled across a call to rcu_read_unlock()
that the matching rcu_read_lock() also be contained within that
interrupts-disabled region of code. Because the reporting of the
corresponding RCU-preempt quiescent state is now deferred until
after interrupts have been enabled, it is no longer possible for this
situation to result in deadlocks involving the scheduler's runqueue and
priority-inheritance locks. This may allow some code simplification that
might reduce interrupt latency a bit. Unfortunately, in practice this
would also defer deboosting a low-priority task that had been subjected
to RCU priority boosting, so real-time-response considerations might
well force this restriction to remain in place.
Because RCU-preempt grace periods are now blocked not only by RCU
read-side critical sections, but also by disabling of interrupts,
preemption, and softirqs, it will be possible to eliminate RCU-bh and
RCU-sched in favor of RCU-preempt in CONFIG_PREEMPT=y kernels. This may
require some additional plumbing to provide the network denial-of-service
guarantees that have been traditionally provided by RCU-bh. Once these
are in place, CONFIG_PREEMPT=n kernels will be able to fold RCU-bh
into RCU-sched. This would mean that all kernels would have but
one flavor of RCU, which would open the door to significant code
cleanup.
Moving to a single flavor of RCU would also have the beneficial effect
of reducing the NOCB kthreads by at least a factor of two.
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
[ paulmck: Apply rcu_read_unlock_special() preempt_count() feedback
from Joel Fernandes. ]
[ paulmck: Adjust rcu_eqs_enter() call to rcu_preempt_deferred_qs() in
response to bug reports from kbuild test robot. ]
[ paulmck: Fix bug located by kbuild test robot involving recursion
via rcu_preempt_deferred_qs(). ]
2018-06-22 03:50:01 +08:00
|
|
|
READ_ONCE(t->rcu_read_unlock_special.s)) &&
|
2020-02-16 07:23:26 +08:00
|
|
|
rcu_preempt_depth() == 0;
|
rcu: Defer reporting RCU-preempt quiescent states when disabled
This commit defers reporting of RCU-preempt quiescent states at
rcu_read_unlock_special() time when any of interrupts, softirq, or
preemption are disabled. These deferred quiescent states are reported
at a later RCU_SOFTIRQ, context switch, idle entry, or CPU-hotplug
offline operation. Of course, if another RCU read-side critical
section has started in the meantime, the reporting of the quiescent
state will be further deferred.
This also means that disabling preemption, interrupts, and/or
softirqs will act as an RCU-preempt read-side critical section.
This is enforced by checking preempt_count() as needed.
Some special cases must be handled on an ad-hoc basis, for example,
context switch is a quiescent state even though both the scheduler and
do_exit() disable preemption. In these cases, additional calls to
rcu_preempt_deferred_qs() override the preemption disabling. Similar
logic overrides disabled interrupts in rcu_preempt_check_callbacks()
because in this case the quiescent state happened just before the
corresponding scheduling-clock interrupt.
In theory, this change lifts a long-standing restriction that required
that if interrupts were disabled across a call to rcu_read_unlock()
that the matching rcu_read_lock() also be contained within that
interrupts-disabled region of code. Because the reporting of the
corresponding RCU-preempt quiescent state is now deferred until
after interrupts have been enabled, it is no longer possible for this
situation to result in deadlocks involving the scheduler's runqueue and
priority-inheritance locks. This may allow some code simplification that
might reduce interrupt latency a bit. Unfortunately, in practice this
would also defer deboosting a low-priority task that had been subjected
to RCU priority boosting, so real-time-response considerations might
well force this restriction to remain in place.
Because RCU-preempt grace periods are now blocked not only by RCU
read-side critical sections, but also by disabling of interrupts,
preemption, and softirqs, it will be possible to eliminate RCU-bh and
RCU-sched in favor of RCU-preempt in CONFIG_PREEMPT=y kernels. This may
require some additional plumbing to provide the network denial-of-service
guarantees that have been traditionally provided by RCU-bh. Once these
are in place, CONFIG_PREEMPT=n kernels will be able to fold RCU-bh
into RCU-sched. This would mean that all kernels would have but
one flavor of RCU, which would open the door to significant code
cleanup.
Moving to a single flavor of RCU would also have the beneficial effect
of reducing the NOCB kthreads by at least a factor of two.
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
[ paulmck: Apply rcu_read_unlock_special() preempt_count() feedback
from Joel Fernandes. ]
[ paulmck: Adjust rcu_eqs_enter() call to rcu_preempt_deferred_qs() in
response to bug reports from kbuild test robot. ]
[ paulmck: Fix bug located by kbuild test robot involving recursion
via rcu_preempt_deferred_qs(). ]
2018-06-22 03:50:01 +08:00
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Report a deferred quiescent state if needed and safe to do so.
|
|
|
|
* As with rcu_preempt_need_deferred_qs(), "safe" involves only
|
|
|
|
* not being in an RCU read-side critical section. The caller must
|
|
|
|
* evaluate safety in terms of interrupt, softirq, and preemption
|
|
|
|
* disabling.
|
|
|
|
*/
|
|
|
|
static void rcu_preempt_deferred_qs(struct task_struct *t)
|
|
|
|
{
|
|
|
|
unsigned long flags;
|
|
|
|
|
|
|
|
if (!rcu_preempt_need_deferred_qs(t))
|
|
|
|
return;
|
|
|
|
local_irq_save(flags);
|
|
|
|
rcu_preempt_deferred_qs_irqrestore(t, flags);
|
|
|
|
}
|
|
|
|
|
2019-04-05 03:19:25 +08:00
|
|
|
/*
|
|
|
|
* Minimal handler to give the scheduler a chance to re-evaluate.
|
|
|
|
*/
|
|
|
|
static void rcu_preempt_deferred_qs_handler(struct irq_work *iwp)
|
|
|
|
{
|
|
|
|
struct rcu_data *rdp;
|
|
|
|
|
|
|
|
rdp = container_of(iwp, struct rcu_data, defer_qs_iw);
|
|
|
|
rdp->defer_qs_iw_pending = false;
|
|
|
|
}
|
|
|
|
|
rcu: Defer reporting RCU-preempt quiescent states when disabled
This commit defers reporting of RCU-preempt quiescent states at
rcu_read_unlock_special() time when any of interrupts, softirq, or
preemption are disabled. These deferred quiescent states are reported
at a later RCU_SOFTIRQ, context switch, idle entry, or CPU-hotplug
offline operation. Of course, if another RCU read-side critical
section has started in the meantime, the reporting of the quiescent
state will be further deferred.
This also means that disabling preemption, interrupts, and/or
softirqs will act as an RCU-preempt read-side critical section.
This is enforced by checking preempt_count() as needed.
Some special cases must be handled on an ad-hoc basis, for example,
context switch is a quiescent state even though both the scheduler and
do_exit() disable preemption. In these cases, additional calls to
rcu_preempt_deferred_qs() override the preemption disabling. Similar
logic overrides disabled interrupts in rcu_preempt_check_callbacks()
because in this case the quiescent state happened just before the
corresponding scheduling-clock interrupt.
In theory, this change lifts a long-standing restriction that required
that if interrupts were disabled across a call to rcu_read_unlock()
that the matching rcu_read_lock() also be contained within that
interrupts-disabled region of code. Because the reporting of the
corresponding RCU-preempt quiescent state is now deferred until
after interrupts have been enabled, it is no longer possible for this
situation to result in deadlocks involving the scheduler's runqueue and
priority-inheritance locks. This may allow some code simplification that
might reduce interrupt latency a bit. Unfortunately, in practice this
would also defer deboosting a low-priority task that had been subjected
to RCU priority boosting, so real-time-response considerations might
well force this restriction to remain in place.
Because RCU-preempt grace periods are now blocked not only by RCU
read-side critical sections, but also by disabling of interrupts,
preemption, and softirqs, it will be possible to eliminate RCU-bh and
RCU-sched in favor of RCU-preempt in CONFIG_PREEMPT=y kernels. This may
require some additional plumbing to provide the network denial-of-service
guarantees that have been traditionally provided by RCU-bh. Once these
are in place, CONFIG_PREEMPT=n kernels will be able to fold RCU-bh
into RCU-sched. This would mean that all kernels would have but
one flavor of RCU, which would open the door to significant code
cleanup.
Moving to a single flavor of RCU would also have the beneficial effect
of reducing the NOCB kthreads by at least a factor of two.
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
[ paulmck: Apply rcu_read_unlock_special() preempt_count() feedback
from Joel Fernandes. ]
[ paulmck: Adjust rcu_eqs_enter() call to rcu_preempt_deferred_qs() in
response to bug reports from kbuild test robot. ]
[ paulmck: Fix bug located by kbuild test robot involving recursion
via rcu_preempt_deferred_qs(). ]
2018-06-22 03:50:01 +08:00
|
|
|
/*
|
|
|
|
* Handle special cases during rcu_read_unlock(), such as needing to
|
|
|
|
* notify RCU core processing or task having blocked during the RCU
|
|
|
|
* read-side critical section.
|
|
|
|
*/
|
|
|
|
static void rcu_read_unlock_special(struct task_struct *t)
|
|
|
|
{
|
|
|
|
unsigned long flags;
|
rcu: Expedite deboost in case of deferred quiescent state
Historically, a task that has been subjected to RCU priority boosting is
deboosted at rcu_read_unlock() time. However, with the advent of deferred
quiescent states, if the outermost rcu_read_unlock() was invoked with
either bottom halves, interrupts, or preemption disabled, the deboosting
will be delayed for some time. During this time, a low-priority process
might be incorrectly running at a high real-time priority level.
Fortunately, rcu_read_unlock_special() already provides mechanisms for
forcing a minimal deferral of quiescent states, at least for kernels
built with CONFIG_IRQ_WORK=y. These mechanisms are currently used
when expedited grace periods are pending that might be blocked by the
current task. This commit therefore causes those mechanisms to also be
used in cases where the current task has been or might soon be subjected
to RCU priority boosting. Note that this applies to all kernels built
with CONFIG_RCU_BOOST=y, regardless of whether or not they are also
built with CONFIG_PREEMPT_RT=y.
This approach assumes that kernels build for use with aggressive real-time
applications are built with CONFIG_IRQ_WORK=y. It is likely to be far
simpler to enable CONFIG_IRQ_WORK=y than to implement a fast-deboosting
scheme that works correctly in its absence.
While in the area, alphabetize the rcu_preempt_deferred_qs_handler()
function's local variables.
Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Cc: Scott Wood <swood@redhat.com>
Cc: Lai Jiangshan <jiangshanlai@gmail.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2021-01-15 02:39:31 +08:00
|
|
|
bool irqs_were_disabled;
|
rcu: Defer reporting RCU-preempt quiescent states when disabled
This commit defers reporting of RCU-preempt quiescent states at
rcu_read_unlock_special() time when any of interrupts, softirq, or
preemption are disabled. These deferred quiescent states are reported
at a later RCU_SOFTIRQ, context switch, idle entry, or CPU-hotplug
offline operation. Of course, if another RCU read-side critical
section has started in the meantime, the reporting of the quiescent
state will be further deferred.
This also means that disabling preemption, interrupts, and/or
softirqs will act as an RCU-preempt read-side critical section.
This is enforced by checking preempt_count() as needed.
Some special cases must be handled on an ad-hoc basis, for example,
context switch is a quiescent state even though both the scheduler and
do_exit() disable preemption. In these cases, additional calls to
rcu_preempt_deferred_qs() override the preemption disabling. Similar
logic overrides disabled interrupts in rcu_preempt_check_callbacks()
because in this case the quiescent state happened just before the
corresponding scheduling-clock interrupt.
In theory, this change lifts a long-standing restriction that required
that if interrupts were disabled across a call to rcu_read_unlock()
that the matching rcu_read_lock() also be contained within that
interrupts-disabled region of code. Because the reporting of the
corresponding RCU-preempt quiescent state is now deferred until
after interrupts have been enabled, it is no longer possible for this
situation to result in deadlocks involving the scheduler's runqueue and
priority-inheritance locks. This may allow some code simplification that
might reduce interrupt latency a bit. Unfortunately, in practice this
would also defer deboosting a low-priority task that had been subjected
to RCU priority boosting, so real-time-response considerations might
well force this restriction to remain in place.
Because RCU-preempt grace periods are now blocked not only by RCU
read-side critical sections, but also by disabling of interrupts,
preemption, and softirqs, it will be possible to eliminate RCU-bh and
RCU-sched in favor of RCU-preempt in CONFIG_PREEMPT=y kernels. This may
require some additional plumbing to provide the network denial-of-service
guarantees that have been traditionally provided by RCU-bh. Once these
are in place, CONFIG_PREEMPT=n kernels will be able to fold RCU-bh
into RCU-sched. This would mean that all kernels would have but
one flavor of RCU, which would open the door to significant code
cleanup.
Moving to a single flavor of RCU would also have the beneficial effect
of reducing the NOCB kthreads by at least a factor of two.
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
[ paulmck: Apply rcu_read_unlock_special() preempt_count() feedback
from Joel Fernandes. ]
[ paulmck: Adjust rcu_eqs_enter() call to rcu_preempt_deferred_qs() in
response to bug reports from kbuild test robot. ]
[ paulmck: Fix bug located by kbuild test robot involving recursion
via rcu_preempt_deferred_qs(). ]
2018-06-22 03:50:01 +08:00
|
|
|
bool preempt_bh_were_disabled =
|
|
|
|
!!(preempt_count() & (PREEMPT_MASK | SOFTIRQ_MASK));
|
|
|
|
|
|
|
|
/* NMI handlers cannot block and cannot safely manipulate state. */
|
|
|
|
if (in_nmi())
|
|
|
|
return;
|
|
|
|
|
|
|
|
local_irq_save(flags);
|
|
|
|
irqs_were_disabled = irqs_disabled_flags(flags);
|
rcu: Speed up expedited GPs when interrupting RCU reader
In PREEMPT kernels, an expedited grace period might send an IPI to a
CPU that is executing an RCU read-side critical section. In that case,
it would be nice if the rcu_read_unlock() directly interacted with the
RCU core code to immediately report the quiescent state. And this does
happen in the case where the reader has been preempted. But it would
also be a nice performance optimization if immediate reporting also
happened in the preemption-free case.
This commit therefore adds an ->exp_hint field to the task_struct structure's
->rcu_read_unlock_special field. The IPI handler sets this hint when
it has interrupted an RCU read-side critical section, and this causes
the outermost rcu_read_unlock() call to invoke rcu_read_unlock_special(),
which, if preemption is enabled, reports the quiescent state immediately.
If preemption is disabled, then the report is required to be deferred
until preemption (or bottom halves or interrupts or whatever) is re-enabled.
Because this is a hint, it does nothing for more complicated cases. For
example, if the IPI interrupts an RCU reader, but interrupts are disabled
across the rcu_read_unlock(), but another rcu_read_lock() is executed
before interrupts are re-enabled, the hint will already have been cleared.
If you do crazy things like this, reporting will be deferred until some
later RCU_SOFTIRQ handler, context switch, cond_resched(), or similar.
Reported-by: Joel Fernandes <joel@joelfernandes.org>
Signed-off-by: Paul E. McKenney <paulmck@linux.ibm.com>
Acked-by: Joel Fernandes (Google) <joel@joelfernandes.org>
2018-10-16 19:12:58 +08:00
|
|
|
if (preempt_bh_were_disabled || irqs_were_disabled) {
|
rcu: Expedite deboost in case of deferred quiescent state
Historically, a task that has been subjected to RCU priority boosting is
deboosted at rcu_read_unlock() time. However, with the advent of deferred
quiescent states, if the outermost rcu_read_unlock() was invoked with
either bottom halves, interrupts, or preemption disabled, the deboosting
will be delayed for some time. During this time, a low-priority process
might be incorrectly running at a high real-time priority level.
Fortunately, rcu_read_unlock_special() already provides mechanisms for
forcing a minimal deferral of quiescent states, at least for kernels
built with CONFIG_IRQ_WORK=y. These mechanisms are currently used
when expedited grace periods are pending that might be blocked by the
current task. This commit therefore causes those mechanisms to also be
used in cases where the current task has been or might soon be subjected
to RCU priority boosting. Note that this applies to all kernels built
with CONFIG_RCU_BOOST=y, regardless of whether or not they are also
built with CONFIG_PREEMPT_RT=y.
This approach assumes that kernels build for use with aggressive real-time
applications are built with CONFIG_IRQ_WORK=y. It is likely to be far
simpler to enable CONFIG_IRQ_WORK=y than to implement a fast-deboosting
scheme that works correctly in its absence.
While in the area, alphabetize the rcu_preempt_deferred_qs_handler()
function's local variables.
Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Cc: Scott Wood <swood@redhat.com>
Cc: Lai Jiangshan <jiangshanlai@gmail.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2021-01-15 02:39:31 +08:00
|
|
|
bool expboost; // Expedited GP in flight or possible boosting.
|
2019-04-02 05:12:50 +08:00
|
|
|
struct rcu_data *rdp = this_cpu_ptr(&rcu_data);
|
|
|
|
struct rcu_node *rnp = rdp->mynode;
|
|
|
|
|
rcu: Expedite deboost in case of deferred quiescent state
Historically, a task that has been subjected to RCU priority boosting is
deboosted at rcu_read_unlock() time. However, with the advent of deferred
quiescent states, if the outermost rcu_read_unlock() was invoked with
either bottom halves, interrupts, or preemption disabled, the deboosting
will be delayed for some time. During this time, a low-priority process
might be incorrectly running at a high real-time priority level.
Fortunately, rcu_read_unlock_special() already provides mechanisms for
forcing a minimal deferral of quiescent states, at least for kernels
built with CONFIG_IRQ_WORK=y. These mechanisms are currently used
when expedited grace periods are pending that might be blocked by the
current task. This commit therefore causes those mechanisms to also be
used in cases where the current task has been or might soon be subjected
to RCU priority boosting. Note that this applies to all kernels built
with CONFIG_RCU_BOOST=y, regardless of whether or not they are also
built with CONFIG_PREEMPT_RT=y.
This approach assumes that kernels build for use with aggressive real-time
applications are built with CONFIG_IRQ_WORK=y. It is likely to be far
simpler to enable CONFIG_IRQ_WORK=y than to implement a fast-deboosting
scheme that works correctly in its absence.
While in the area, alphabetize the rcu_preempt_deferred_qs_handler()
function's local variables.
Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Cc: Scott Wood <swood@redhat.com>
Cc: Lai Jiangshan <jiangshanlai@gmail.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2021-01-15 02:39:31 +08:00
|
|
|
expboost = (t->rcu_blocked_node && READ_ONCE(t->rcu_blocked_node->exp_tasks)) ||
|
|
|
|
(rdp->grpmask & READ_ONCE(rnp->expmask)) ||
|
2021-01-28 05:57:16 +08:00
|
|
|
IS_ENABLED(CONFIG_RCU_STRICT_GRACE_PERIOD) ||
|
rcu: Expedite deboost in case of deferred quiescent state
Historically, a task that has been subjected to RCU priority boosting is
deboosted at rcu_read_unlock() time. However, with the advent of deferred
quiescent states, if the outermost rcu_read_unlock() was invoked with
either bottom halves, interrupts, or preemption disabled, the deboosting
will be delayed for some time. During this time, a low-priority process
might be incorrectly running at a high real-time priority level.
Fortunately, rcu_read_unlock_special() already provides mechanisms for
forcing a minimal deferral of quiescent states, at least for kernels
built with CONFIG_IRQ_WORK=y. These mechanisms are currently used
when expedited grace periods are pending that might be blocked by the
current task. This commit therefore causes those mechanisms to also be
used in cases where the current task has been or might soon be subjected
to RCU priority boosting. Note that this applies to all kernels built
with CONFIG_RCU_BOOST=y, regardless of whether or not they are also
built with CONFIG_PREEMPT_RT=y.
This approach assumes that kernels build for use with aggressive real-time
applications are built with CONFIG_IRQ_WORK=y. It is likely to be far
simpler to enable CONFIG_IRQ_WORK=y than to implement a fast-deboosting
scheme that works correctly in its absence.
While in the area, alphabetize the rcu_preempt_deferred_qs_handler()
function's local variables.
Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Cc: Scott Wood <swood@redhat.com>
Cc: Lai Jiangshan <jiangshanlai@gmail.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2021-01-15 02:39:31 +08:00
|
|
|
(IS_ENABLED(CONFIG_RCU_BOOST) && irqs_were_disabled &&
|
|
|
|
t->rcu_blocked_node);
|
rcu: Check for wakeup-safe conditions in rcu_read_unlock_special()
When RCU core processing is offloaded from RCU_SOFTIRQ to the rcuc
kthreads, a full and unconditional wakeup is required to initiate RCU
core processing. In contrast, when RCU core processing is carried
out by RCU_SOFTIRQ, a raise_softirq() suffices. Of course, there are
situations where raise_softirq() does a full wakeup, but these do not
occur with normal usage of rcu_read_unlock().
The reason that full wakeups can be problematic is that the scheduler
sometimes invokes rcu_read_unlock() with its pi or rq locks held,
which can of course result in deadlock in CONFIG_PREEMPT=y kernels when
rcu_read_unlock() invokes the scheduler. Scheduler invocations can happen
in the following situations: (1) The just-ended reader has been subjected
to RCU priority boosting, in which case rcu_read_unlock() must deboost,
(2) Interrupts were disabled across the call to rcu_read_unlock(), so
the quiescent state must be deferred, requiring a wakeup of the rcuc
kthread corresponding to the current CPU.
Now, the scheduler may hold one of its locks across rcu_read_unlock()
only if preemption has been disabled across the entire RCU read-side
critical section, which in the days prior to RCU flavor consolidation
meant that rcu_read_unlock() never needed to do wakeups. However, this
is no longer the case for any but the first rcu_read_unlock() following a
condition (e.g., preempted RCU reader) requiring special rcu_read_unlock()
attention. For example, an RCU read-side critical section might be
preempted, but preemption might be disabled across the rcu_read_unlock().
The rcu_read_unlock() must defer the quiescent state, and therefore
leaves the task queued on its leaf rcu_node structure. If a scheduler
interrupt occurs, the scheduler might well invoke rcu_read_unlock() with
one of its locks held. However, the preempted task is still queued, so
rcu_read_unlock() will attempt to defer the quiescent state once more.
When RCU core processing is carried out by RCU_SOFTIRQ, this works just
fine: The raise_softirq() function simply sets a bit in a per-CPU mask
and the RCU core processing will be undertaken upon return from interrupt.
Not so when RCU core processing is carried out by the rcuc kthread: In this
case, the required wakeup can result in deadlock.
The initial solution to this problem was to use set_tsk_need_resched() and
set_preempt_need_resched() to force a future context switch, which allows
rcu_preempt_note_context_switch() to report the deferred quiescent state
to RCU's core processing. Unfortunately for expedited grace periods,
there can be a significant delay between the call for a context switch
and the actual context switch.
This commit therefore introduces a ->deferred_qs flag to the task_struct
structure's rcu_special structure. This flag is initially false, and
is set to true by the first call to rcu_read_unlock() requiring special
attention, then finally reset back to false when the quiescent state is
finally reported. Then rcu_read_unlock() attempts full wakeups only when
->deferred_qs is false, that is, on the first rcu_read_unlock() requiring
special attention. Note that a chain of RCU readers linked by some other
sort of reader may find that a later rcu_read_unlock() is once again able
to do a full wakeup, courtesy of an intervening preemption:
rcu_read_lock();
/* preempted */
local_irq_disable();
rcu_read_unlock(); /* Can do full wakeup, sets ->deferred_qs. */
rcu_read_lock();
local_irq_enable();
preempt_disable()
rcu_read_unlock(); /* Cannot do full wakeup, ->deferred_qs set. */
rcu_read_lock();
preempt_enable();
/* preempted, >deferred_qs reset. */
local_irq_disable();
rcu_read_unlock(); /* Can again do full wakeup, sets ->deferred_qs. */
Such linked RCU readers do not yet seem to appear in the Linux kernel, and
it is probably best if they don't. However, RCU needs to handle them, and
some variations on this theme could make even raise_softirq() unsafe due to
the possibility of its doing a full wakeup. This commit therefore also
avoids invoking raise_softirq() when the ->deferred_qs set flag is set.
Signed-off-by: Paul E. McKenney <paulmck@linux.ibm.com>
Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
2019-03-25 06:25:51 +08:00
|
|
|
// Need to defer quiescent state until everything is enabled.
|
rcu: Expedite deboost in case of deferred quiescent state
Historically, a task that has been subjected to RCU priority boosting is
deboosted at rcu_read_unlock() time. However, with the advent of deferred
quiescent states, if the outermost rcu_read_unlock() was invoked with
either bottom halves, interrupts, or preemption disabled, the deboosting
will be delayed for some time. During this time, a low-priority process
might be incorrectly running at a high real-time priority level.
Fortunately, rcu_read_unlock_special() already provides mechanisms for
forcing a minimal deferral of quiescent states, at least for kernels
built with CONFIG_IRQ_WORK=y. These mechanisms are currently used
when expedited grace periods are pending that might be blocked by the
current task. This commit therefore causes those mechanisms to also be
used in cases where the current task has been or might soon be subjected
to RCU priority boosting. Note that this applies to all kernels built
with CONFIG_RCU_BOOST=y, regardless of whether or not they are also
built with CONFIG_PREEMPT_RT=y.
This approach assumes that kernels build for use with aggressive real-time
applications are built with CONFIG_IRQ_WORK=y. It is likely to be far
simpler to enable CONFIG_IRQ_WORK=y than to implement a fast-deboosting
scheme that works correctly in its absence.
While in the area, alphabetize the rcu_preempt_deferred_qs_handler()
function's local variables.
Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Cc: Scott Wood <swood@redhat.com>
Cc: Lai Jiangshan <jiangshanlai@gmail.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2021-01-15 02:39:31 +08:00
|
|
|
if (use_softirq && (in_irq() || (expboost && !irqs_were_disabled))) {
|
2020-02-16 06:18:09 +08:00
|
|
|
// Using softirq, safe to awaken, and either the
|
rcu: Expedite deboost in case of deferred quiescent state
Historically, a task that has been subjected to RCU priority boosting is
deboosted at rcu_read_unlock() time. However, with the advent of deferred
quiescent states, if the outermost rcu_read_unlock() was invoked with
either bottom halves, interrupts, or preemption disabled, the deboosting
will be delayed for some time. During this time, a low-priority process
might be incorrectly running at a high real-time priority level.
Fortunately, rcu_read_unlock_special() already provides mechanisms for
forcing a minimal deferral of quiescent states, at least for kernels
built with CONFIG_IRQ_WORK=y. These mechanisms are currently used
when expedited grace periods are pending that might be blocked by the
current task. This commit therefore causes those mechanisms to also be
used in cases where the current task has been or might soon be subjected
to RCU priority boosting. Note that this applies to all kernels built
with CONFIG_RCU_BOOST=y, regardless of whether or not they are also
built with CONFIG_PREEMPT_RT=y.
This approach assumes that kernels build for use with aggressive real-time
applications are built with CONFIG_IRQ_WORK=y. It is likely to be far
simpler to enable CONFIG_IRQ_WORK=y than to implement a fast-deboosting
scheme that works correctly in its absence.
While in the area, alphabetize the rcu_preempt_deferred_qs_handler()
function's local variables.
Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Cc: Scott Wood <swood@redhat.com>
Cc: Lai Jiangshan <jiangshanlai@gmail.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2021-01-15 02:39:31 +08:00
|
|
|
// wakeup is free or there is either an expedited
|
|
|
|
// GP in flight or a potential need to deboost.
|
rcu: Speed up expedited GPs when interrupting RCU reader
In PREEMPT kernels, an expedited grace period might send an IPI to a
CPU that is executing an RCU read-side critical section. In that case,
it would be nice if the rcu_read_unlock() directly interacted with the
RCU core code to immediately report the quiescent state. And this does
happen in the case where the reader has been preempted. But it would
also be a nice performance optimization if immediate reporting also
happened in the preemption-free case.
This commit therefore adds an ->exp_hint field to the task_struct structure's
->rcu_read_unlock_special field. The IPI handler sets this hint when
it has interrupted an RCU read-side critical section, and this causes
the outermost rcu_read_unlock() call to invoke rcu_read_unlock_special(),
which, if preemption is enabled, reports the quiescent state immediately.
If preemption is disabled, then the report is required to be deferred
until preemption (or bottom halves or interrupts or whatever) is re-enabled.
Because this is a hint, it does nothing for more complicated cases. For
example, if the IPI interrupts an RCU reader, but interrupts are disabled
across the rcu_read_unlock(), but another rcu_read_lock() is executed
before interrupts are re-enabled, the hint will already have been cleared.
If you do crazy things like this, reporting will be deferred until some
later RCU_SOFTIRQ handler, context switch, cond_resched(), or similar.
Reported-by: Joel Fernandes <joel@joelfernandes.org>
Signed-off-by: Paul E. McKenney <paulmck@linux.ibm.com>
Acked-by: Joel Fernandes (Google) <joel@joelfernandes.org>
2018-10-16 19:12:58 +08:00
|
|
|
raise_softirq_irqoff(RCU_SOFTIRQ);
|
|
|
|
} else {
|
rcu: Check for wakeup-safe conditions in rcu_read_unlock_special()
When RCU core processing is offloaded from RCU_SOFTIRQ to the rcuc
kthreads, a full and unconditional wakeup is required to initiate RCU
core processing. In contrast, when RCU core processing is carried
out by RCU_SOFTIRQ, a raise_softirq() suffices. Of course, there are
situations where raise_softirq() does a full wakeup, but these do not
occur with normal usage of rcu_read_unlock().
The reason that full wakeups can be problematic is that the scheduler
sometimes invokes rcu_read_unlock() with its pi or rq locks held,
which can of course result in deadlock in CONFIG_PREEMPT=y kernels when
rcu_read_unlock() invokes the scheduler. Scheduler invocations can happen
in the following situations: (1) The just-ended reader has been subjected
to RCU priority boosting, in which case rcu_read_unlock() must deboost,
(2) Interrupts were disabled across the call to rcu_read_unlock(), so
the quiescent state must be deferred, requiring a wakeup of the rcuc
kthread corresponding to the current CPU.
Now, the scheduler may hold one of its locks across rcu_read_unlock()
only if preemption has been disabled across the entire RCU read-side
critical section, which in the days prior to RCU flavor consolidation
meant that rcu_read_unlock() never needed to do wakeups. However, this
is no longer the case for any but the first rcu_read_unlock() following a
condition (e.g., preempted RCU reader) requiring special rcu_read_unlock()
attention. For example, an RCU read-side critical section might be
preempted, but preemption might be disabled across the rcu_read_unlock().
The rcu_read_unlock() must defer the quiescent state, and therefore
leaves the task queued on its leaf rcu_node structure. If a scheduler
interrupt occurs, the scheduler might well invoke rcu_read_unlock() with
one of its locks held. However, the preempted task is still queued, so
rcu_read_unlock() will attempt to defer the quiescent state once more.
When RCU core processing is carried out by RCU_SOFTIRQ, this works just
fine: The raise_softirq() function simply sets a bit in a per-CPU mask
and the RCU core processing will be undertaken upon return from interrupt.
Not so when RCU core processing is carried out by the rcuc kthread: In this
case, the required wakeup can result in deadlock.
The initial solution to this problem was to use set_tsk_need_resched() and
set_preempt_need_resched() to force a future context switch, which allows
rcu_preempt_note_context_switch() to report the deferred quiescent state
to RCU's core processing. Unfortunately for expedited grace periods,
there can be a significant delay between the call for a context switch
and the actual context switch.
This commit therefore introduces a ->deferred_qs flag to the task_struct
structure's rcu_special structure. This flag is initially false, and
is set to true by the first call to rcu_read_unlock() requiring special
attention, then finally reset back to false when the quiescent state is
finally reported. Then rcu_read_unlock() attempts full wakeups only when
->deferred_qs is false, that is, on the first rcu_read_unlock() requiring
special attention. Note that a chain of RCU readers linked by some other
sort of reader may find that a later rcu_read_unlock() is once again able
to do a full wakeup, courtesy of an intervening preemption:
rcu_read_lock();
/* preempted */
local_irq_disable();
rcu_read_unlock(); /* Can do full wakeup, sets ->deferred_qs. */
rcu_read_lock();
local_irq_enable();
preempt_disable()
rcu_read_unlock(); /* Cannot do full wakeup, ->deferred_qs set. */
rcu_read_lock();
preempt_enable();
/* preempted, >deferred_qs reset. */
local_irq_disable();
rcu_read_unlock(); /* Can again do full wakeup, sets ->deferred_qs. */
Such linked RCU readers do not yet seem to appear in the Linux kernel, and
it is probably best if they don't. However, RCU needs to handle them, and
some variations on this theme could make even raise_softirq() unsafe due to
the possibility of its doing a full wakeup. This commit therefore also
avoids invoking raise_softirq() when the ->deferred_qs set flag is set.
Signed-off-by: Paul E. McKenney <paulmck@linux.ibm.com>
Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
2019-03-25 06:25:51 +08:00
|
|
|
// Enabling BH or preempt does reschedule, so...
|
rcu: Expedite deboost in case of deferred quiescent state
Historically, a task that has been subjected to RCU priority boosting is
deboosted at rcu_read_unlock() time. However, with the advent of deferred
quiescent states, if the outermost rcu_read_unlock() was invoked with
either bottom halves, interrupts, or preemption disabled, the deboosting
will be delayed for some time. During this time, a low-priority process
might be incorrectly running at a high real-time priority level.
Fortunately, rcu_read_unlock_special() already provides mechanisms for
forcing a minimal deferral of quiescent states, at least for kernels
built with CONFIG_IRQ_WORK=y. These mechanisms are currently used
when expedited grace periods are pending that might be blocked by the
current task. This commit therefore causes those mechanisms to also be
used in cases where the current task has been or might soon be subjected
to RCU priority boosting. Note that this applies to all kernels built
with CONFIG_RCU_BOOST=y, regardless of whether or not they are also
built with CONFIG_PREEMPT_RT=y.
This approach assumes that kernels build for use with aggressive real-time
applications are built with CONFIG_IRQ_WORK=y. It is likely to be far
simpler to enable CONFIG_IRQ_WORK=y than to implement a fast-deboosting
scheme that works correctly in its absence.
While in the area, alphabetize the rcu_preempt_deferred_qs_handler()
function's local variables.
Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Cc: Scott Wood <swood@redhat.com>
Cc: Lai Jiangshan <jiangshanlai@gmail.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2021-01-15 02:39:31 +08:00
|
|
|
// Also if no expediting and no possible deboosting,
|
|
|
|
// slow is OK. Plus nohz_full CPUs eventually get
|
|
|
|
// tick enabled.
|
rcu: Speed up expedited GPs when interrupting RCU reader
In PREEMPT kernels, an expedited grace period might send an IPI to a
CPU that is executing an RCU read-side critical section. In that case,
it would be nice if the rcu_read_unlock() directly interacted with the
RCU core code to immediately report the quiescent state. And this does
happen in the case where the reader has been preempted. But it would
also be a nice performance optimization if immediate reporting also
happened in the preemption-free case.
This commit therefore adds an ->exp_hint field to the task_struct structure's
->rcu_read_unlock_special field. The IPI handler sets this hint when
it has interrupted an RCU read-side critical section, and this causes
the outermost rcu_read_unlock() call to invoke rcu_read_unlock_special(),
which, if preemption is enabled, reports the quiescent state immediately.
If preemption is disabled, then the report is required to be deferred
until preemption (or bottom halves or interrupts or whatever) is re-enabled.
Because this is a hint, it does nothing for more complicated cases. For
example, if the IPI interrupts an RCU reader, but interrupts are disabled
across the rcu_read_unlock(), but another rcu_read_lock() is executed
before interrupts are re-enabled, the hint will already have been cleared.
If you do crazy things like this, reporting will be deferred until some
later RCU_SOFTIRQ handler, context switch, cond_resched(), or similar.
Reported-by: Joel Fernandes <joel@joelfernandes.org>
Signed-off-by: Paul E. McKenney <paulmck@linux.ibm.com>
Acked-by: Joel Fernandes (Google) <joel@joelfernandes.org>
2018-10-16 19:12:58 +08:00
|
|
|
set_tsk_need_resched(current);
|
|
|
|
set_preempt_need_resched();
|
2019-06-23 03:05:54 +08:00
|
|
|
if (IS_ENABLED(CONFIG_IRQ_WORK) && irqs_were_disabled &&
|
rcu: Expedite deboost in case of deferred quiescent state
Historically, a task that has been subjected to RCU priority boosting is
deboosted at rcu_read_unlock() time. However, with the advent of deferred
quiescent states, if the outermost rcu_read_unlock() was invoked with
either bottom halves, interrupts, or preemption disabled, the deboosting
will be delayed for some time. During this time, a low-priority process
might be incorrectly running at a high real-time priority level.
Fortunately, rcu_read_unlock_special() already provides mechanisms for
forcing a minimal deferral of quiescent states, at least for kernels
built with CONFIG_IRQ_WORK=y. These mechanisms are currently used
when expedited grace periods are pending that might be blocked by the
current task. This commit therefore causes those mechanisms to also be
used in cases where the current task has been or might soon be subjected
to RCU priority boosting. Note that this applies to all kernels built
with CONFIG_RCU_BOOST=y, regardless of whether or not they are also
built with CONFIG_PREEMPT_RT=y.
This approach assumes that kernels build for use with aggressive real-time
applications are built with CONFIG_IRQ_WORK=y. It is likely to be far
simpler to enable CONFIG_IRQ_WORK=y than to implement a fast-deboosting
scheme that works correctly in its absence.
While in the area, alphabetize the rcu_preempt_deferred_qs_handler()
function's local variables.
Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Cc: Scott Wood <swood@redhat.com>
Cc: Lai Jiangshan <jiangshanlai@gmail.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2021-01-15 02:39:31 +08:00
|
|
|
expboost && !rdp->defer_qs_iw_pending && cpu_online(rdp->cpu)) {
|
2019-04-05 03:19:25 +08:00
|
|
|
// Get scheduler to re-evaluate and call hooks.
|
|
|
|
// If !IRQ_WORK, FQS scan will eventually IPI.
|
rcu: Expedite deboost in case of deferred quiescent state
Historically, a task that has been subjected to RCU priority boosting is
deboosted at rcu_read_unlock() time. However, with the advent of deferred
quiescent states, if the outermost rcu_read_unlock() was invoked with
either bottom halves, interrupts, or preemption disabled, the deboosting
will be delayed for some time. During this time, a low-priority process
might be incorrectly running at a high real-time priority level.
Fortunately, rcu_read_unlock_special() already provides mechanisms for
forcing a minimal deferral of quiescent states, at least for kernels
built with CONFIG_IRQ_WORK=y. These mechanisms are currently used
when expedited grace periods are pending that might be blocked by the
current task. This commit therefore causes those mechanisms to also be
used in cases where the current task has been or might soon be subjected
to RCU priority boosting. Note that this applies to all kernels built
with CONFIG_RCU_BOOST=y, regardless of whether or not they are also
built with CONFIG_PREEMPT_RT=y.
This approach assumes that kernels build for use with aggressive real-time
applications are built with CONFIG_IRQ_WORK=y. It is likely to be far
simpler to enable CONFIG_IRQ_WORK=y than to implement a fast-deboosting
scheme that works correctly in its absence.
While in the area, alphabetize the rcu_preempt_deferred_qs_handler()
function's local variables.
Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Cc: Scott Wood <swood@redhat.com>
Cc: Lai Jiangshan <jiangshanlai@gmail.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2021-01-15 02:39:31 +08:00
|
|
|
init_irq_work(&rdp->defer_qs_iw, rcu_preempt_deferred_qs_handler);
|
2019-04-05 03:19:25 +08:00
|
|
|
rdp->defer_qs_iw_pending = true;
|
|
|
|
irq_work_queue_on(&rdp->defer_qs_iw, rdp->cpu);
|
|
|
|
}
|
rcu: Speed up expedited GPs when interrupting RCU reader
In PREEMPT kernels, an expedited grace period might send an IPI to a
CPU that is executing an RCU read-side critical section. In that case,
it would be nice if the rcu_read_unlock() directly interacted with the
RCU core code to immediately report the quiescent state. And this does
happen in the case where the reader has been preempted. But it would
also be a nice performance optimization if immediate reporting also
happened in the preemption-free case.
This commit therefore adds an ->exp_hint field to the task_struct structure's
->rcu_read_unlock_special field. The IPI handler sets this hint when
it has interrupted an RCU read-side critical section, and this causes
the outermost rcu_read_unlock() call to invoke rcu_read_unlock_special(),
which, if preemption is enabled, reports the quiescent state immediately.
If preemption is disabled, then the report is required to be deferred
until preemption (or bottom halves or interrupts or whatever) is re-enabled.
Because this is a hint, it does nothing for more complicated cases. For
example, if the IPI interrupts an RCU reader, but interrupts are disabled
across the rcu_read_unlock(), but another rcu_read_lock() is executed
before interrupts are re-enabled, the hint will already have been cleared.
If you do crazy things like this, reporting will be deferred until some
later RCU_SOFTIRQ handler, context switch, cond_resched(), or similar.
Reported-by: Joel Fernandes <joel@joelfernandes.org>
Signed-off-by: Paul E. McKenney <paulmck@linux.ibm.com>
Acked-by: Joel Fernandes (Google) <joel@joelfernandes.org>
2018-10-16 19:12:58 +08:00
|
|
|
}
|
rcu: Defer reporting RCU-preempt quiescent states when disabled
This commit defers reporting of RCU-preempt quiescent states at
rcu_read_unlock_special() time when any of interrupts, softirq, or
preemption are disabled. These deferred quiescent states are reported
at a later RCU_SOFTIRQ, context switch, idle entry, or CPU-hotplug
offline operation. Of course, if another RCU read-side critical
section has started in the meantime, the reporting of the quiescent
state will be further deferred.
This also means that disabling preemption, interrupts, and/or
softirqs will act as an RCU-preempt read-side critical section.
This is enforced by checking preempt_count() as needed.
Some special cases must be handled on an ad-hoc basis, for example,
context switch is a quiescent state even though both the scheduler and
do_exit() disable preemption. In these cases, additional calls to
rcu_preempt_deferred_qs() override the preemption disabling. Similar
logic overrides disabled interrupts in rcu_preempt_check_callbacks()
because in this case the quiescent state happened just before the
corresponding scheduling-clock interrupt.
In theory, this change lifts a long-standing restriction that required
that if interrupts were disabled across a call to rcu_read_unlock()
that the matching rcu_read_lock() also be contained within that
interrupts-disabled region of code. Because the reporting of the
corresponding RCU-preempt quiescent state is now deferred until
after interrupts have been enabled, it is no longer possible for this
situation to result in deadlocks involving the scheduler's runqueue and
priority-inheritance locks. This may allow some code simplification that
might reduce interrupt latency a bit. Unfortunately, in practice this
would also defer deboosting a low-priority task that had been subjected
to RCU priority boosting, so real-time-response considerations might
well force this restriction to remain in place.
Because RCU-preempt grace periods are now blocked not only by RCU
read-side critical sections, but also by disabling of interrupts,
preemption, and softirqs, it will be possible to eliminate RCU-bh and
RCU-sched in favor of RCU-preempt in CONFIG_PREEMPT=y kernels. This may
require some additional plumbing to provide the network denial-of-service
guarantees that have been traditionally provided by RCU-bh. Once these
are in place, CONFIG_PREEMPT=n kernels will be able to fold RCU-bh
into RCU-sched. This would mean that all kernels would have but
one flavor of RCU, which would open the door to significant code
cleanup.
Moving to a single flavor of RCU would also have the beneficial effect
of reducing the NOCB kthreads by at least a factor of two.
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
[ paulmck: Apply rcu_read_unlock_special() preempt_count() feedback
from Joel Fernandes. ]
[ paulmck: Adjust rcu_eqs_enter() call to rcu_preempt_deferred_qs() in
response to bug reports from kbuild test robot. ]
[ paulmck: Fix bug located by kbuild test robot involving recursion
via rcu_preempt_deferred_qs(). ]
2018-06-22 03:50:01 +08:00
|
|
|
local_irq_restore(flags);
|
|
|
|
return;
|
|
|
|
}
|
|
|
|
rcu_preempt_deferred_qs_irqrestore(t, flags);
|
|
|
|
}
|
|
|
|
|
2009-09-14 00:15:09 +08:00
|
|
|
/*
|
|
|
|
* Check that the list of blocked tasks for the newly completed grace
|
|
|
|
* period is in fact empty. It is a serious bug to complete a grace
|
|
|
|
* period that still has RCU readers blocked! This function must be
|
2019-10-11 00:05:27 +08:00
|
|
|
* invoked -before- updating this rnp's ->gp_seq.
|
2010-11-30 13:56:39 +08:00
|
|
|
*
|
|
|
|
* Also, if there are blocked tasks on the list, they automatically
|
|
|
|
* block the newly created grace period, so set up ->gp_tasks accordingly.
|
2009-09-14 00:15:09 +08:00
|
|
|
*/
|
2018-07-04 08:22:34 +08:00
|
|
|
static void rcu_preempt_check_blocked_tasks(struct rcu_node *rnp)
|
2009-09-14 00:15:09 +08:00
|
|
|
{
|
2017-06-20 01:32:23 +08:00
|
|
|
struct task_struct *t;
|
|
|
|
|
2017-04-29 04:19:28 +08:00
|
|
|
RCU_LOCKDEP_WARN(preemptible(), "rcu_preempt_check_blocked_tasks() invoked with preemption enabled!!!\n");
|
2019-10-11 00:05:27 +08:00
|
|
|
raw_lockdep_assert_held_rcu_node(rnp);
|
2017-11-28 07:13:56 +08:00
|
|
|
if (WARN_ON_ONCE(rcu_preempt_blocked_readers_cgp(rnp)))
|
2018-07-04 08:22:34 +08:00
|
|
|
dump_blkd_tasks(rnp, 10);
|
rcu: Suppress false-positive splats from mid-init task resume
Consider the following sequence of events in a PREEMPT=y kernel:
1. All CPUs corresponding to a given leaf rcu_node structure are
offline.
2. The first phase of the rcu_gp_init() function's grace-period
initialization runs, and sets that rcu_node structure's
->qsmaskinit to zero, as it should.
3. One of the CPUs corresponding to that rcu_node structure comes
back online. Note that because this CPU came online after the
grace period started, this grace period can safely ignore this
newly onlined CPU.
4. A task running on the newly onlined CPU enters an RCU-preempt
read-side critical section, and is then preempted. Because
the corresponding rcu_node structure's ->qsmask is zero,
rcu_preempt_ctxt_queue() leaves the rcu_node structure's
->gp_tasks field NULL, as it should.
5. The rcu_gp_init() function continues running the second phase of
grace-period initialization. The ->qsmask field of the parent of
the aforementioned leaf rcu_node structure is set to not expect
a quiescent state from the leaf, as is only right and proper.
However, when rcu_gp_init() reaches the leaf, it invokes
rcu_preempt_check_blocked_tasks(), which sees that the leaf's
->blkd_tasks list is non-empty, and therefore sets the leaf's
->gp_tasks field to reference the first task on that list.
6. The grace period ends before the preempted task resumes, which
is perfectly fine, given that this grace period was under no
obligation to wait for that task to exit its late-starting
RCU-preempt read-side critical section. Unfortunately, the
leaf's ->gp_tasks field is non-NULL, so rcu_gp_cleanup() splats.
After all, it appears to rcu_gp_cleanup() that the grace period
failed to wait for a task that was supposed to be blocking that
grace period.
This commit avoids this false-positive splat by adding a check of both
->qsmaskinit and ->wait_blkd_tasks to rcu_preempt_check_blocked_tasks().
If both ->qsmaskinit and ->wait_blkd_tasks are zero, then the task must
have entered its RCU-preempt read-side critical section late (after all,
the CPU that it is running on was not online at that time), which means
that the upper-level rcu_node structure won't be waiting for anything
on the leaf anyway.
If ->wait_blkd_tasks is non-zero, then there is at least one task on
ths rcu_node structure's ->blkd_tasks list whose RCU read-side
critical section predates the current grace period. If ->qsmaskinit
is non-zero, there is at least one CPU that was online at the start
of the current grace period. Thus, if both are zero, there is nothing
to wait for.
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
2018-05-09 07:18:28 +08:00
|
|
|
if (rcu_preempt_has_tasks(rnp) &&
|
|
|
|
(rnp->qsmaskinit || rnp->wait_blkd_tasks)) {
|
2019-10-10 05:21:54 +08:00
|
|
|
WRITE_ONCE(rnp->gp_tasks, rnp->blkd_tasks.next);
|
2017-06-20 01:32:23 +08:00
|
|
|
t = container_of(rnp->gp_tasks, struct task_struct,
|
|
|
|
rcu_node_entry);
|
|
|
|
trace_rcu_unlock_preempted_task(TPS("rcu_preempt-GPS"),
|
2018-05-02 04:35:20 +08:00
|
|
|
rnp->gp_seq, t->pid);
|
2017-06-20 01:32:23 +08:00
|
|
|
}
|
2009-09-19 00:50:17 +08:00
|
|
|
WARN_ON_ONCE(rnp->qsmask);
|
2009-09-14 00:15:09 +08:00
|
|
|
}
|
|
|
|
|
rcu: Merge preemptable-RCU functionality into hierarchical RCU
Create a kernel/rcutree_plugin.h file that contains definitions
for preemptable RCU (or, under the #else branch of the #ifdef,
empty definitions for the classic non-preemptable semantics).
These definitions fit into plugins defined in kernel/rcutree.c
for this purpose.
This variant of preemptable RCU uses a new algorithm whose
read-side expense is roughly that of classic hierarchical RCU
under CONFIG_PREEMPT. This new algorithm's update-side expense
is similar to that of classic hierarchical RCU, and, in absence
of read-side preemption or blocking, is exactly that of classic
hierarchical RCU. Perhaps more important, this new algorithm
has a much simpler implementation, saving well over 1,000 lines
of code compared to mainline's implementation of preemptable
RCU, which will hopefully be retired in favor of this new
algorithm.
The simplifications are obtained by maintaining per-task
nesting state for running tasks, and using a simple
lock-protected algorithm to handle accounting when tasks block
within RCU read-side critical sections, making use of lessons
learned while creating numerous user-level RCU implementations
over the past 18 months.
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: laijs@cn.fujitsu.com
Cc: dipankar@in.ibm.com
Cc: akpm@linux-foundation.org
Cc: mathieu.desnoyers@polymtl.ca
Cc: josht@linux.vnet.ibm.com
Cc: dvhltc@us.ibm.com
Cc: niv@us.ibm.com
Cc: peterz@infradead.org
Cc: rostedt@goodmis.org
LKML-Reference: <12509746134003-git-send-email->
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2009-08-23 04:56:52 +08:00
|
|
|
/*
|
2018-11-22 03:35:03 +08:00
|
|
|
* Check for a quiescent state from the current CPU, including voluntary
|
|
|
|
* context switches for Tasks RCU. When a task blocks, the task is
|
|
|
|
* recorded in the corresponding CPU's rcu_node structure, which is checked
|
|
|
|
* elsewhere, hence this function need only check for quiescent states
|
|
|
|
* related to the current CPU, not to those related to tasks.
|
rcu: Merge preemptable-RCU functionality into hierarchical RCU
Create a kernel/rcutree_plugin.h file that contains definitions
for preemptable RCU (or, under the #else branch of the #ifdef,
empty definitions for the classic non-preemptable semantics).
These definitions fit into plugins defined in kernel/rcutree.c
for this purpose.
This variant of preemptable RCU uses a new algorithm whose
read-side expense is roughly that of classic hierarchical RCU
under CONFIG_PREEMPT. This new algorithm's update-side expense
is similar to that of classic hierarchical RCU, and, in absence
of read-side preemption or blocking, is exactly that of classic
hierarchical RCU. Perhaps more important, this new algorithm
has a much simpler implementation, saving well over 1,000 lines
of code compared to mainline's implementation of preemptable
RCU, which will hopefully be retired in favor of this new
algorithm.
The simplifications are obtained by maintaining per-task
nesting state for running tasks, and using a simple
lock-protected algorithm to handle accounting when tasks block
within RCU read-side critical sections, making use of lessons
learned while creating numerous user-level RCU implementations
over the past 18 months.
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: laijs@cn.fujitsu.com
Cc: dipankar@in.ibm.com
Cc: akpm@linux-foundation.org
Cc: mathieu.desnoyers@polymtl.ca
Cc: josht@linux.vnet.ibm.com
Cc: dvhltc@us.ibm.com
Cc: niv@us.ibm.com
Cc: peterz@infradead.org
Cc: rostedt@goodmis.org
LKML-Reference: <12509746134003-git-send-email->
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2009-08-23 04:56:52 +08:00
|
|
|
*/
|
2018-11-22 03:35:03 +08:00
|
|
|
static void rcu_flavor_sched_clock_irq(int user)
|
rcu: Merge preemptable-RCU functionality into hierarchical RCU
Create a kernel/rcutree_plugin.h file that contains definitions
for preemptable RCU (or, under the #else branch of the #ifdef,
empty definitions for the classic non-preemptable semantics).
These definitions fit into plugins defined in kernel/rcutree.c
for this purpose.
This variant of preemptable RCU uses a new algorithm whose
read-side expense is roughly that of classic hierarchical RCU
under CONFIG_PREEMPT. This new algorithm's update-side expense
is similar to that of classic hierarchical RCU, and, in absence
of read-side preemption or blocking, is exactly that of classic
hierarchical RCU. Perhaps more important, this new algorithm
has a much simpler implementation, saving well over 1,000 lines
of code compared to mainline's implementation of preemptable
RCU, which will hopefully be retired in favor of this new
algorithm.
The simplifications are obtained by maintaining per-task
nesting state for running tasks, and using a simple
lock-protected algorithm to handle accounting when tasks block
within RCU read-side critical sections, making use of lessons
learned while creating numerous user-level RCU implementations
over the past 18 months.
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: laijs@cn.fujitsu.com
Cc: dipankar@in.ibm.com
Cc: akpm@linux-foundation.org
Cc: mathieu.desnoyers@polymtl.ca
Cc: josht@linux.vnet.ibm.com
Cc: dvhltc@us.ibm.com
Cc: niv@us.ibm.com
Cc: peterz@infradead.org
Cc: rostedt@goodmis.org
LKML-Reference: <12509746134003-git-send-email->
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2009-08-23 04:56:52 +08:00
|
|
|
{
|
|
|
|
struct task_struct *t = current;
|
|
|
|
|
2020-11-20 02:13:06 +08:00
|
|
|
lockdep_assert_irqs_disabled();
|
2018-07-03 05:30:37 +08:00
|
|
|
if (user || rcu_is_cpu_rrupt_from_idle()) {
|
|
|
|
rcu_note_voluntary_context_switch(current);
|
|
|
|
}
|
2019-11-16 06:08:53 +08:00
|
|
|
if (rcu_preempt_depth() > 0 ||
|
rcu: Defer reporting RCU-preempt quiescent states when disabled
This commit defers reporting of RCU-preempt quiescent states at
rcu_read_unlock_special() time when any of interrupts, softirq, or
preemption are disabled. These deferred quiescent states are reported
at a later RCU_SOFTIRQ, context switch, idle entry, or CPU-hotplug
offline operation. Of course, if another RCU read-side critical
section has started in the meantime, the reporting of the quiescent
state will be further deferred.
This also means that disabling preemption, interrupts, and/or
softirqs will act as an RCU-preempt read-side critical section.
This is enforced by checking preempt_count() as needed.
Some special cases must be handled on an ad-hoc basis, for example,
context switch is a quiescent state even though both the scheduler and
do_exit() disable preemption. In these cases, additional calls to
rcu_preempt_deferred_qs() override the preemption disabling. Similar
logic overrides disabled interrupts in rcu_preempt_check_callbacks()
because in this case the quiescent state happened just before the
corresponding scheduling-clock interrupt.
In theory, this change lifts a long-standing restriction that required
that if interrupts were disabled across a call to rcu_read_unlock()
that the matching rcu_read_lock() also be contained within that
interrupts-disabled region of code. Because the reporting of the
corresponding RCU-preempt quiescent state is now deferred until
after interrupts have been enabled, it is no longer possible for this
situation to result in deadlocks involving the scheduler's runqueue and
priority-inheritance locks. This may allow some code simplification that
might reduce interrupt latency a bit. Unfortunately, in practice this
would also defer deboosting a low-priority task that had been subjected
to RCU priority boosting, so real-time-response considerations might
well force this restriction to remain in place.
Because RCU-preempt grace periods are now blocked not only by RCU
read-side critical sections, but also by disabling of interrupts,
preemption, and softirqs, it will be possible to eliminate RCU-bh and
RCU-sched in favor of RCU-preempt in CONFIG_PREEMPT=y kernels. This may
require some additional plumbing to provide the network denial-of-service
guarantees that have been traditionally provided by RCU-bh. Once these
are in place, CONFIG_PREEMPT=n kernels will be able to fold RCU-bh
into RCU-sched. This would mean that all kernels would have but
one flavor of RCU, which would open the door to significant code
cleanup.
Moving to a single flavor of RCU would also have the beneficial effect
of reducing the NOCB kthreads by at least a factor of two.
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
[ paulmck: Apply rcu_read_unlock_special() preempt_count() feedback
from Joel Fernandes. ]
[ paulmck: Adjust rcu_eqs_enter() call to rcu_preempt_deferred_qs() in
response to bug reports from kbuild test robot. ]
[ paulmck: Fix bug located by kbuild test robot involving recursion
via rcu_preempt_deferred_qs(). ]
2018-06-22 03:50:01 +08:00
|
|
|
(preempt_count() & (PREEMPT_MASK | SOFTIRQ_MASK))) {
|
|
|
|
/* No QS, force context switch if deferred. */
|
2018-07-27 04:44:00 +08:00
|
|
|
if (rcu_preempt_need_deferred_qs(t)) {
|
|
|
|
set_tsk_need_resched(t);
|
|
|
|
set_preempt_need_resched();
|
|
|
|
}
|
rcu: Defer reporting RCU-preempt quiescent states when disabled
This commit defers reporting of RCU-preempt quiescent states at
rcu_read_unlock_special() time when any of interrupts, softirq, or
preemption are disabled. These deferred quiescent states are reported
at a later RCU_SOFTIRQ, context switch, idle entry, or CPU-hotplug
offline operation. Of course, if another RCU read-side critical
section has started in the meantime, the reporting of the quiescent
state will be further deferred.
This also means that disabling preemption, interrupts, and/or
softirqs will act as an RCU-preempt read-side critical section.
This is enforced by checking preempt_count() as needed.
Some special cases must be handled on an ad-hoc basis, for example,
context switch is a quiescent state even though both the scheduler and
do_exit() disable preemption. In these cases, additional calls to
rcu_preempt_deferred_qs() override the preemption disabling. Similar
logic overrides disabled interrupts in rcu_preempt_check_callbacks()
because in this case the quiescent state happened just before the
corresponding scheduling-clock interrupt.
In theory, this change lifts a long-standing restriction that required
that if interrupts were disabled across a call to rcu_read_unlock()
that the matching rcu_read_lock() also be contained within that
interrupts-disabled region of code. Because the reporting of the
corresponding RCU-preempt quiescent state is now deferred until
after interrupts have been enabled, it is no longer possible for this
situation to result in deadlocks involving the scheduler's runqueue and
priority-inheritance locks. This may allow some code simplification that
might reduce interrupt latency a bit. Unfortunately, in practice this
would also defer deboosting a low-priority task that had been subjected
to RCU priority boosting, so real-time-response considerations might
well force this restriction to remain in place.
Because RCU-preempt grace periods are now blocked not only by RCU
read-side critical sections, but also by disabling of interrupts,
preemption, and softirqs, it will be possible to eliminate RCU-bh and
RCU-sched in favor of RCU-preempt in CONFIG_PREEMPT=y kernels. This may
require some additional plumbing to provide the network denial-of-service
guarantees that have been traditionally provided by RCU-bh. Once these
are in place, CONFIG_PREEMPT=n kernels will be able to fold RCU-bh
into RCU-sched. This would mean that all kernels would have but
one flavor of RCU, which would open the door to significant code
cleanup.
Moving to a single flavor of RCU would also have the beneficial effect
of reducing the NOCB kthreads by at least a factor of two.
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
[ paulmck: Apply rcu_read_unlock_special() preempt_count() feedback
from Joel Fernandes. ]
[ paulmck: Adjust rcu_eqs_enter() call to rcu_preempt_deferred_qs() in
response to bug reports from kbuild test robot. ]
[ paulmck: Fix bug located by kbuild test robot involving recursion
via rcu_preempt_deferred_qs(). ]
2018-06-22 03:50:01 +08:00
|
|
|
} else if (rcu_preempt_need_deferred_qs(t)) {
|
|
|
|
rcu_preempt_deferred_qs(t); /* Report deferred QS. */
|
|
|
|
return;
|
2020-02-16 07:23:26 +08:00
|
|
|
} else if (!WARN_ON_ONCE(rcu_preempt_depth())) {
|
2018-07-03 05:30:37 +08:00
|
|
|
rcu_qs(); /* Report immediate QS. */
|
rcu: Merge preemptable-RCU functionality into hierarchical RCU
Create a kernel/rcutree_plugin.h file that contains definitions
for preemptable RCU (or, under the #else branch of the #ifdef,
empty definitions for the classic non-preemptable semantics).
These definitions fit into plugins defined in kernel/rcutree.c
for this purpose.
This variant of preemptable RCU uses a new algorithm whose
read-side expense is roughly that of classic hierarchical RCU
under CONFIG_PREEMPT. This new algorithm's update-side expense
is similar to that of classic hierarchical RCU, and, in absence
of read-side preemption or blocking, is exactly that of classic
hierarchical RCU. Perhaps more important, this new algorithm
has a much simpler implementation, saving well over 1,000 lines
of code compared to mainline's implementation of preemptable
RCU, which will hopefully be retired in favor of this new
algorithm.
The simplifications are obtained by maintaining per-task
nesting state for running tasks, and using a simple
lock-protected algorithm to handle accounting when tasks block
within RCU read-side critical sections, making use of lessons
learned while creating numerous user-level RCU implementations
over the past 18 months.
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: laijs@cn.fujitsu.com
Cc: dipankar@in.ibm.com
Cc: akpm@linux-foundation.org
Cc: mathieu.desnoyers@polymtl.ca
Cc: josht@linux.vnet.ibm.com
Cc: dvhltc@us.ibm.com
Cc: niv@us.ibm.com
Cc: peterz@infradead.org
Cc: rostedt@goodmis.org
LKML-Reference: <12509746134003-git-send-email->
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2009-08-23 04:56:52 +08:00
|
|
|
return;
|
|
|
|
}
|
rcu: Defer reporting RCU-preempt quiescent states when disabled
This commit defers reporting of RCU-preempt quiescent states at
rcu_read_unlock_special() time when any of interrupts, softirq, or
preemption are disabled. These deferred quiescent states are reported
at a later RCU_SOFTIRQ, context switch, idle entry, or CPU-hotplug
offline operation. Of course, if another RCU read-side critical
section has started in the meantime, the reporting of the quiescent
state will be further deferred.
This also means that disabling preemption, interrupts, and/or
softirqs will act as an RCU-preempt read-side critical section.
This is enforced by checking preempt_count() as needed.
Some special cases must be handled on an ad-hoc basis, for example,
context switch is a quiescent state even though both the scheduler and
do_exit() disable preemption. In these cases, additional calls to
rcu_preempt_deferred_qs() override the preemption disabling. Similar
logic overrides disabled interrupts in rcu_preempt_check_callbacks()
because in this case the quiescent state happened just before the
corresponding scheduling-clock interrupt.
In theory, this change lifts a long-standing restriction that required
that if interrupts were disabled across a call to rcu_read_unlock()
that the matching rcu_read_lock() also be contained within that
interrupts-disabled region of code. Because the reporting of the
corresponding RCU-preempt quiescent state is now deferred until
after interrupts have been enabled, it is no longer possible for this
situation to result in deadlocks involving the scheduler's runqueue and
priority-inheritance locks. This may allow some code simplification that
might reduce interrupt latency a bit. Unfortunately, in practice this
would also defer deboosting a low-priority task that had been subjected
to RCU priority boosting, so real-time-response considerations might
well force this restriction to remain in place.
Because RCU-preempt grace periods are now blocked not only by RCU
read-side critical sections, but also by disabling of interrupts,
preemption, and softirqs, it will be possible to eliminate RCU-bh and
RCU-sched in favor of RCU-preempt in CONFIG_PREEMPT=y kernels. This may
require some additional plumbing to provide the network denial-of-service
guarantees that have been traditionally provided by RCU-bh. Once these
are in place, CONFIG_PREEMPT=n kernels will be able to fold RCU-bh
into RCU-sched. This would mean that all kernels would have but
one flavor of RCU, which would open the door to significant code
cleanup.
Moving to a single flavor of RCU would also have the beneficial effect
of reducing the NOCB kthreads by at least a factor of two.
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
[ paulmck: Apply rcu_read_unlock_special() preempt_count() feedback
from Joel Fernandes. ]
[ paulmck: Adjust rcu_eqs_enter() call to rcu_preempt_deferred_qs() in
response to bug reports from kbuild test robot. ]
[ paulmck: Fix bug located by kbuild test robot involving recursion
via rcu_preempt_deferred_qs(). ]
2018-06-22 03:50:01 +08:00
|
|
|
|
|
|
|
/* If GP is oldish, ask for help from rcu_read_unlock_special(). */
|
2019-11-16 06:08:53 +08:00
|
|
|
if (rcu_preempt_depth() > 0 &&
|
2018-07-04 06:54:39 +08:00
|
|
|
__this_cpu_read(rcu_data.core_needs_qs) &&
|
|
|
|
__this_cpu_read(rcu_data.cpu_no_qs.b.norm) &&
|
2018-05-17 05:41:41 +08:00
|
|
|
!t->rcu_read_unlock_special.b.need_qs &&
|
2018-07-05 05:52:04 +08:00
|
|
|
time_after(jiffies, rcu_state.gp_start + HZ))
|
2014-08-15 07:01:53 +08:00
|
|
|
t->rcu_read_unlock_special.b.need_qs = true;
|
rcu: Merge preemptable-RCU functionality into hierarchical RCU
Create a kernel/rcutree_plugin.h file that contains definitions
for preemptable RCU (or, under the #else branch of the #ifdef,
empty definitions for the classic non-preemptable semantics).
These definitions fit into plugins defined in kernel/rcutree.c
for this purpose.
This variant of preemptable RCU uses a new algorithm whose
read-side expense is roughly that of classic hierarchical RCU
under CONFIG_PREEMPT. This new algorithm's update-side expense
is similar to that of classic hierarchical RCU, and, in absence
of read-side preemption or blocking, is exactly that of classic
hierarchical RCU. Perhaps more important, this new algorithm
has a much simpler implementation, saving well over 1,000 lines
of code compared to mainline's implementation of preemptable
RCU, which will hopefully be retired in favor of this new
algorithm.
The simplifications are obtained by maintaining per-task
nesting state for running tasks, and using a simple
lock-protected algorithm to handle accounting when tasks block
within RCU read-side critical sections, making use of lessons
learned while creating numerous user-level RCU implementations
over the past 18 months.
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: laijs@cn.fujitsu.com
Cc: dipankar@in.ibm.com
Cc: akpm@linux-foundation.org
Cc: mathieu.desnoyers@polymtl.ca
Cc: josht@linux.vnet.ibm.com
Cc: dvhltc@us.ibm.com
Cc: niv@us.ibm.com
Cc: peterz@infradead.org
Cc: rostedt@goodmis.org
LKML-Reference: <12509746134003-git-send-email->
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2009-08-23 04:56:52 +08:00
|
|
|
}
|
|
|
|
|
2013-04-12 01:15:52 +08:00
|
|
|
/*
|
|
|
|
* Check for a task exiting while in a preemptible-RCU read-side
|
rcu: Make exit_rcu() handle non-preempted RCU readers
The purpose of exit_rcu() is to handle cases where buggy code causes a
task to exit within an RCU read-side critical section. It currently
does that in the case where said RCU read-side critical section was
preempted at least once, but fails to handle cases where preemption did
not occur. This case needs to be handled because otherwise the final
context switch away from the exiting task will incorrectly behave as if
task exit were instead a preemption of an RCU read-side critical section,
and will therefore queue the exiting task. The exiting task will have
exited, and thus won't ever execute rcu_read_unlock(), which means that
it will remain queued forever, blocking all subsequent grace periods,
and eventually resulting in OOM.
Although this is arguably better than letting grace periods proceed
and having a later rcu_read_unlock() access the now-freed task
structure that once belonged to the exiting tasks, it would obviously
be better to correctly handle this case. This commit therefore sets
->rcu_read_lock_nesting to 1 in that case, so that the subsequence call
to __rcu_read_unlock() causes the exiting task to exit its dangling RCU
read-side critical section.
Note that deferred quiescent states need not be considered. The reason
is that removing the task from the ->blkd_tasks[] list in the call to
rcu_preempt_deferred_qs() handles the per-task component of any deferred
quiescent state, and all other components of any deferred quiescent state
are associated with the CPU, which isn't going anywhere until some later
CPU-hotplug operation, which will report any remaining deferred quiescent
states from within the rcu_report_dead() function.
Note also that negative values of ->rcu_read_lock_nesting need not be
considered. First, these won't show up in exit_rcu() unless there is
a serious bug in RCU, and second, setting ->rcu_read_lock_nesting sets
the state so that the RCU read-side critical section will be exited
normally.
Again, this code has no effect unless there has been some prior bug
that prevents a task from leaving an RCU read-side critical section
before exiting. Furthermore, there have been no reports of the bug
fixed by this commit appearing in production. This commit is therefore
absolutely -not- recommended for backporting to -stable.
Reported-by: ABHISHEK DUBEY <dabhishek@iisc.ac.in>
Reported-by: BHARATH Y MOURYA <bharathm@iisc.ac.in>
Reported-by: Aravinda Prasad <aravinda@iisc.ac.in>
Signed-off-by: Paul E. McKenney <paulmck@linux.ibm.com>
Tested-by: ABHISHEK DUBEY <dabhishek@iisc.ac.in>
2019-02-11 23:21:29 +08:00
|
|
|
* critical section, clean up if so. No need to issue warnings, as
|
|
|
|
* debug_check_no_locks_held() already does this if lockdep is enabled.
|
|
|
|
* Besides, if this function does anything other than just immediately
|
|
|
|
* return, there was a bug of some sort. Spewing warnings from this
|
|
|
|
* function is like as not to simply obscure important prior warnings.
|
2013-04-12 01:15:52 +08:00
|
|
|
*/
|
|
|
|
void exit_rcu(void)
|
|
|
|
{
|
|
|
|
struct task_struct *t = current;
|
|
|
|
|
rcu: Make exit_rcu() handle non-preempted RCU readers
The purpose of exit_rcu() is to handle cases where buggy code causes a
task to exit within an RCU read-side critical section. It currently
does that in the case where said RCU read-side critical section was
preempted at least once, but fails to handle cases where preemption did
not occur. This case needs to be handled because otherwise the final
context switch away from the exiting task will incorrectly behave as if
task exit were instead a preemption of an RCU read-side critical section,
and will therefore queue the exiting task. The exiting task will have
exited, and thus won't ever execute rcu_read_unlock(), which means that
it will remain queued forever, blocking all subsequent grace periods,
and eventually resulting in OOM.
Although this is arguably better than letting grace periods proceed
and having a later rcu_read_unlock() access the now-freed task
structure that once belonged to the exiting tasks, it would obviously
be better to correctly handle this case. This commit therefore sets
->rcu_read_lock_nesting to 1 in that case, so that the subsequence call
to __rcu_read_unlock() causes the exiting task to exit its dangling RCU
read-side critical section.
Note that deferred quiescent states need not be considered. The reason
is that removing the task from the ->blkd_tasks[] list in the call to
rcu_preempt_deferred_qs() handles the per-task component of any deferred
quiescent state, and all other components of any deferred quiescent state
are associated with the CPU, which isn't going anywhere until some later
CPU-hotplug operation, which will report any remaining deferred quiescent
states from within the rcu_report_dead() function.
Note also that negative values of ->rcu_read_lock_nesting need not be
considered. First, these won't show up in exit_rcu() unless there is
a serious bug in RCU, and second, setting ->rcu_read_lock_nesting sets
the state so that the RCU read-side critical section will be exited
normally.
Again, this code has no effect unless there has been some prior bug
that prevents a task from leaving an RCU read-side critical section
before exiting. Furthermore, there have been no reports of the bug
fixed by this commit appearing in production. This commit is therefore
absolutely -not- recommended for backporting to -stable.
Reported-by: ABHISHEK DUBEY <dabhishek@iisc.ac.in>
Reported-by: BHARATH Y MOURYA <bharathm@iisc.ac.in>
Reported-by: Aravinda Prasad <aravinda@iisc.ac.in>
Signed-off-by: Paul E. McKenney <paulmck@linux.ibm.com>
Tested-by: ABHISHEK DUBEY <dabhishek@iisc.ac.in>
2019-02-11 23:21:29 +08:00
|
|
|
if (unlikely(!list_empty(¤t->rcu_node_entry))) {
|
2019-11-16 06:08:53 +08:00
|
|
|
rcu_preempt_depth_set(1);
|
rcu: Make exit_rcu() handle non-preempted RCU readers
The purpose of exit_rcu() is to handle cases where buggy code causes a
task to exit within an RCU read-side critical section. It currently
does that in the case where said RCU read-side critical section was
preempted at least once, but fails to handle cases where preemption did
not occur. This case needs to be handled because otherwise the final
context switch away from the exiting task will incorrectly behave as if
task exit were instead a preemption of an RCU read-side critical section,
and will therefore queue the exiting task. The exiting task will have
exited, and thus won't ever execute rcu_read_unlock(), which means that
it will remain queued forever, blocking all subsequent grace periods,
and eventually resulting in OOM.
Although this is arguably better than letting grace periods proceed
and having a later rcu_read_unlock() access the now-freed task
structure that once belonged to the exiting tasks, it would obviously
be better to correctly handle this case. This commit therefore sets
->rcu_read_lock_nesting to 1 in that case, so that the subsequence call
to __rcu_read_unlock() causes the exiting task to exit its dangling RCU
read-side critical section.
Note that deferred quiescent states need not be considered. The reason
is that removing the task from the ->blkd_tasks[] list in the call to
rcu_preempt_deferred_qs() handles the per-task component of any deferred
quiescent state, and all other components of any deferred quiescent state
are associated with the CPU, which isn't going anywhere until some later
CPU-hotplug operation, which will report any remaining deferred quiescent
states from within the rcu_report_dead() function.
Note also that negative values of ->rcu_read_lock_nesting need not be
considered. First, these won't show up in exit_rcu() unless there is
a serious bug in RCU, and second, setting ->rcu_read_lock_nesting sets
the state so that the RCU read-side critical section will be exited
normally.
Again, this code has no effect unless there has been some prior bug
that prevents a task from leaving an RCU read-side critical section
before exiting. Furthermore, there have been no reports of the bug
fixed by this commit appearing in production. This commit is therefore
absolutely -not- recommended for backporting to -stable.
Reported-by: ABHISHEK DUBEY <dabhishek@iisc.ac.in>
Reported-by: BHARATH Y MOURYA <bharathm@iisc.ac.in>
Reported-by: Aravinda Prasad <aravinda@iisc.ac.in>
Signed-off-by: Paul E. McKenney <paulmck@linux.ibm.com>
Tested-by: ABHISHEK DUBEY <dabhishek@iisc.ac.in>
2019-02-11 23:21:29 +08:00
|
|
|
barrier();
|
2019-03-27 01:22:22 +08:00
|
|
|
WRITE_ONCE(t->rcu_read_unlock_special.b.blocked, true);
|
2019-11-16 06:08:53 +08:00
|
|
|
} else if (unlikely(rcu_preempt_depth())) {
|
|
|
|
rcu_preempt_depth_set(1);
|
rcu: Make exit_rcu() handle non-preempted RCU readers
The purpose of exit_rcu() is to handle cases where buggy code causes a
task to exit within an RCU read-side critical section. It currently
does that in the case where said RCU read-side critical section was
preempted at least once, but fails to handle cases where preemption did
not occur. This case needs to be handled because otherwise the final
context switch away from the exiting task will incorrectly behave as if
task exit were instead a preemption of an RCU read-side critical section,
and will therefore queue the exiting task. The exiting task will have
exited, and thus won't ever execute rcu_read_unlock(), which means that
it will remain queued forever, blocking all subsequent grace periods,
and eventually resulting in OOM.
Although this is arguably better than letting grace periods proceed
and having a later rcu_read_unlock() access the now-freed task
structure that once belonged to the exiting tasks, it would obviously
be better to correctly handle this case. This commit therefore sets
->rcu_read_lock_nesting to 1 in that case, so that the subsequence call
to __rcu_read_unlock() causes the exiting task to exit its dangling RCU
read-side critical section.
Note that deferred quiescent states need not be considered. The reason
is that removing the task from the ->blkd_tasks[] list in the call to
rcu_preempt_deferred_qs() handles the per-task component of any deferred
quiescent state, and all other components of any deferred quiescent state
are associated with the CPU, which isn't going anywhere until some later
CPU-hotplug operation, which will report any remaining deferred quiescent
states from within the rcu_report_dead() function.
Note also that negative values of ->rcu_read_lock_nesting need not be
considered. First, these won't show up in exit_rcu() unless there is
a serious bug in RCU, and second, setting ->rcu_read_lock_nesting sets
the state so that the RCU read-side critical section will be exited
normally.
Again, this code has no effect unless there has been some prior bug
that prevents a task from leaving an RCU read-side critical section
before exiting. Furthermore, there have been no reports of the bug
fixed by this commit appearing in production. This commit is therefore
absolutely -not- recommended for backporting to -stable.
Reported-by: ABHISHEK DUBEY <dabhishek@iisc.ac.in>
Reported-by: BHARATH Y MOURYA <bharathm@iisc.ac.in>
Reported-by: Aravinda Prasad <aravinda@iisc.ac.in>
Signed-off-by: Paul E. McKenney <paulmck@linux.ibm.com>
Tested-by: ABHISHEK DUBEY <dabhishek@iisc.ac.in>
2019-02-11 23:21:29 +08:00
|
|
|
} else {
|
2013-04-12 01:15:52 +08:00
|
|
|
return;
|
rcu: Make exit_rcu() handle non-preempted RCU readers
The purpose of exit_rcu() is to handle cases where buggy code causes a
task to exit within an RCU read-side critical section. It currently
does that in the case where said RCU read-side critical section was
preempted at least once, but fails to handle cases where preemption did
not occur. This case needs to be handled because otherwise the final
context switch away from the exiting task will incorrectly behave as if
task exit were instead a preemption of an RCU read-side critical section,
and will therefore queue the exiting task. The exiting task will have
exited, and thus won't ever execute rcu_read_unlock(), which means that
it will remain queued forever, blocking all subsequent grace periods,
and eventually resulting in OOM.
Although this is arguably better than letting grace periods proceed
and having a later rcu_read_unlock() access the now-freed task
structure that once belonged to the exiting tasks, it would obviously
be better to correctly handle this case. This commit therefore sets
->rcu_read_lock_nesting to 1 in that case, so that the subsequence call
to __rcu_read_unlock() causes the exiting task to exit its dangling RCU
read-side critical section.
Note that deferred quiescent states need not be considered. The reason
is that removing the task from the ->blkd_tasks[] list in the call to
rcu_preempt_deferred_qs() handles the per-task component of any deferred
quiescent state, and all other components of any deferred quiescent state
are associated with the CPU, which isn't going anywhere until some later
CPU-hotplug operation, which will report any remaining deferred quiescent
states from within the rcu_report_dead() function.
Note also that negative values of ->rcu_read_lock_nesting need not be
considered. First, these won't show up in exit_rcu() unless there is
a serious bug in RCU, and second, setting ->rcu_read_lock_nesting sets
the state so that the RCU read-side critical section will be exited
normally.
Again, this code has no effect unless there has been some prior bug
that prevents a task from leaving an RCU read-side critical section
before exiting. Furthermore, there have been no reports of the bug
fixed by this commit appearing in production. This commit is therefore
absolutely -not- recommended for backporting to -stable.
Reported-by: ABHISHEK DUBEY <dabhishek@iisc.ac.in>
Reported-by: BHARATH Y MOURYA <bharathm@iisc.ac.in>
Reported-by: Aravinda Prasad <aravinda@iisc.ac.in>
Signed-off-by: Paul E. McKenney <paulmck@linux.ibm.com>
Tested-by: ABHISHEK DUBEY <dabhishek@iisc.ac.in>
2019-02-11 23:21:29 +08:00
|
|
|
}
|
2013-04-12 01:15:52 +08:00
|
|
|
__rcu_read_unlock();
|
rcu: Defer reporting RCU-preempt quiescent states when disabled
This commit defers reporting of RCU-preempt quiescent states at
rcu_read_unlock_special() time when any of interrupts, softirq, or
preemption are disabled. These deferred quiescent states are reported
at a later RCU_SOFTIRQ, context switch, idle entry, or CPU-hotplug
offline operation. Of course, if another RCU read-side critical
section has started in the meantime, the reporting of the quiescent
state will be further deferred.
This also means that disabling preemption, interrupts, and/or
softirqs will act as an RCU-preempt read-side critical section.
This is enforced by checking preempt_count() as needed.
Some special cases must be handled on an ad-hoc basis, for example,
context switch is a quiescent state even though both the scheduler and
do_exit() disable preemption. In these cases, additional calls to
rcu_preempt_deferred_qs() override the preemption disabling. Similar
logic overrides disabled interrupts in rcu_preempt_check_callbacks()
because in this case the quiescent state happened just before the
corresponding scheduling-clock interrupt.
In theory, this change lifts a long-standing restriction that required
that if interrupts were disabled across a call to rcu_read_unlock()
that the matching rcu_read_lock() also be contained within that
interrupts-disabled region of code. Because the reporting of the
corresponding RCU-preempt quiescent state is now deferred until
after interrupts have been enabled, it is no longer possible for this
situation to result in deadlocks involving the scheduler's runqueue and
priority-inheritance locks. This may allow some code simplification that
might reduce interrupt latency a bit. Unfortunately, in practice this
would also defer deboosting a low-priority task that had been subjected
to RCU priority boosting, so real-time-response considerations might
well force this restriction to remain in place.
Because RCU-preempt grace periods are now blocked not only by RCU
read-side critical sections, but also by disabling of interrupts,
preemption, and softirqs, it will be possible to eliminate RCU-bh and
RCU-sched in favor of RCU-preempt in CONFIG_PREEMPT=y kernels. This may
require some additional plumbing to provide the network denial-of-service
guarantees that have been traditionally provided by RCU-bh. Once these
are in place, CONFIG_PREEMPT=n kernels will be able to fold RCU-bh
into RCU-sched. This would mean that all kernels would have but
one flavor of RCU, which would open the door to significant code
cleanup.
Moving to a single flavor of RCU would also have the beneficial effect
of reducing the NOCB kthreads by at least a factor of two.
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
[ paulmck: Apply rcu_read_unlock_special() preempt_count() feedback
from Joel Fernandes. ]
[ paulmck: Adjust rcu_eqs_enter() call to rcu_preempt_deferred_qs() in
response to bug reports from kbuild test robot. ]
[ paulmck: Fix bug located by kbuild test robot involving recursion
via rcu_preempt_deferred_qs(). ]
2018-06-22 03:50:01 +08:00
|
|
|
rcu_preempt_deferred_qs(current);
|
2013-04-12 01:15:52 +08:00
|
|
|
}
|
|
|
|
|
2017-11-28 07:13:56 +08:00
|
|
|
/*
|
|
|
|
* Dump the blocked-tasks state, but limit the list dump to the
|
|
|
|
* specified number of elements.
|
|
|
|
*/
|
2018-05-09 05:18:57 +08:00
|
|
|
static void
|
2018-07-04 08:22:34 +08:00
|
|
|
dump_blkd_tasks(struct rcu_node *rnp, int ncheck)
|
2017-11-28 07:13:56 +08:00
|
|
|
{
|
2018-05-09 05:18:57 +08:00
|
|
|
int cpu;
|
2017-11-28 07:13:56 +08:00
|
|
|
int i;
|
|
|
|
struct list_head *lhp;
|
2018-05-09 05:18:57 +08:00
|
|
|
bool onl;
|
|
|
|
struct rcu_data *rdp;
|
2018-05-09 03:50:14 +08:00
|
|
|
struct rcu_node *rnp1;
|
2017-11-28 07:13:56 +08:00
|
|
|
|
2018-03-09 09:32:18 +08:00
|
|
|
raw_lockdep_assert_held_rcu_node(rnp);
|
2018-05-09 03:50:14 +08:00
|
|
|
pr_info("%s: grp: %d-%d level: %d ->gp_seq %ld ->completedqs %ld\n",
|
2018-05-02 06:00:10 +08:00
|
|
|
__func__, rnp->grplo, rnp->grphi, rnp->level,
|
2020-01-05 03:33:17 +08:00
|
|
|
(long)READ_ONCE(rnp->gp_seq), (long)rnp->completedqs);
|
2018-05-09 03:50:14 +08:00
|
|
|
for (rnp1 = rnp; rnp1; rnp1 = rnp1->parent)
|
|
|
|
pr_info("%s: %d:%d ->qsmask %#lx ->qsmaskinit %#lx ->qsmaskinitnext %#lx\n",
|
|
|
|
__func__, rnp1->grplo, rnp1->grphi, rnp1->qsmask, rnp1->qsmaskinit, rnp1->qsmaskinitnext);
|
2018-05-02 06:00:10 +08:00
|
|
|
pr_info("%s: ->gp_tasks %p ->boost_tasks %p ->exp_tasks %p\n",
|
2020-01-04 07:22:01 +08:00
|
|
|
__func__, READ_ONCE(rnp->gp_tasks), data_race(rnp->boost_tasks),
|
2020-01-04 06:18:12 +08:00
|
|
|
READ_ONCE(rnp->exp_tasks));
|
2018-05-02 06:00:10 +08:00
|
|
|
pr_info("%s: ->blkd_tasks", __func__);
|
2017-11-28 07:13:56 +08:00
|
|
|
i = 0;
|
|
|
|
list_for_each(lhp, &rnp->blkd_tasks) {
|
|
|
|
pr_cont(" %p", lhp);
|
2019-03-29 17:55:52 +08:00
|
|
|
if (++i >= ncheck)
|
2017-11-28 07:13:56 +08:00
|
|
|
break;
|
|
|
|
}
|
|
|
|
pr_cont("\n");
|
2018-05-09 05:18:57 +08:00
|
|
|
for (cpu = rnp->grplo; cpu <= rnp->grphi; cpu++) {
|
2018-07-04 06:37:16 +08:00
|
|
|
rdp = per_cpu_ptr(&rcu_data, cpu);
|
2018-05-09 05:18:57 +08:00
|
|
|
onl = !!(rdp->grpmask & rcu_rnp_online_cpus(rnp));
|
|
|
|
pr_info("\t%d: %c online: %ld(%d) offline: %ld(%d)\n",
|
|
|
|
cpu, ".o"[onl],
|
|
|
|
(long)rdp->rcu_onl_gp_seq, rdp->rcu_onl_gp_flags,
|
|
|
|
(long)rdp->rcu_ofl_gp_seq, rdp->rcu_ofl_gp_flags);
|
|
|
|
}
|
2017-11-28 07:13:56 +08:00
|
|
|
}
|
|
|
|
|
2014-09-23 02:00:48 +08:00
|
|
|
#else /* #ifdef CONFIG_PREEMPT_RCU */
|
rcu: Merge preemptable-RCU functionality into hierarchical RCU
Create a kernel/rcutree_plugin.h file that contains definitions
for preemptable RCU (or, under the #else branch of the #ifdef,
empty definitions for the classic non-preemptable semantics).
These definitions fit into plugins defined in kernel/rcutree.c
for this purpose.
This variant of preemptable RCU uses a new algorithm whose
read-side expense is roughly that of classic hierarchical RCU
under CONFIG_PREEMPT. This new algorithm's update-side expense
is similar to that of classic hierarchical RCU, and, in absence
of read-side preemption or blocking, is exactly that of classic
hierarchical RCU. Perhaps more important, this new algorithm
has a much simpler implementation, saving well over 1,000 lines
of code compared to mainline's implementation of preemptable
RCU, which will hopefully be retired in favor of this new
algorithm.
The simplifications are obtained by maintaining per-task
nesting state for running tasks, and using a simple
lock-protected algorithm to handle accounting when tasks block
within RCU read-side critical sections, making use of lessons
learned while creating numerous user-level RCU implementations
over the past 18 months.
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: laijs@cn.fujitsu.com
Cc: dipankar@in.ibm.com
Cc: akpm@linux-foundation.org
Cc: mathieu.desnoyers@polymtl.ca
Cc: josht@linux.vnet.ibm.com
Cc: dvhltc@us.ibm.com
Cc: niv@us.ibm.com
Cc: peterz@infradead.org
Cc: rostedt@goodmis.org
LKML-Reference: <12509746134003-git-send-email->
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2009-08-23 04:56:52 +08:00
|
|
|
|
2020-08-11 00:58:03 +08:00
|
|
|
/*
|
|
|
|
* If strict grace periods are enabled, and if the calling
|
|
|
|
* __rcu_read_unlock() marks the beginning of a quiescent state, immediately
|
|
|
|
* report that quiescent state and, if requested, spin for a bit.
|
|
|
|
*/
|
|
|
|
void rcu_read_unlock_strict(void)
|
|
|
|
{
|
|
|
|
struct rcu_data *rdp;
|
|
|
|
|
2021-08-27 10:21:22 +08:00
|
|
|
if (irqs_disabled() || preempt_count() || !rcu_state.gp_kthread)
|
2020-08-11 00:58:03 +08:00
|
|
|
return;
|
|
|
|
rdp = this_cpu_ptr(&rcu_data);
|
2020-08-21 02:26:14 +08:00
|
|
|
rcu_report_qs_rdp(rdp);
|
2020-08-11 00:58:03 +08:00
|
|
|
udelay(rcu_unlock_delay);
|
|
|
|
}
|
|
|
|
EXPORT_SYMBOL_GPL(rcu_read_unlock_strict);
|
|
|
|
|
rcu: Merge preemptable-RCU functionality into hierarchical RCU
Create a kernel/rcutree_plugin.h file that contains definitions
for preemptable RCU (or, under the #else branch of the #ifdef,
empty definitions for the classic non-preemptable semantics).
These definitions fit into plugins defined in kernel/rcutree.c
for this purpose.
This variant of preemptable RCU uses a new algorithm whose
read-side expense is roughly that of classic hierarchical RCU
under CONFIG_PREEMPT. This new algorithm's update-side expense
is similar to that of classic hierarchical RCU, and, in absence
of read-side preemption or blocking, is exactly that of classic
hierarchical RCU. Perhaps more important, this new algorithm
has a much simpler implementation, saving well over 1,000 lines
of code compared to mainline's implementation of preemptable
RCU, which will hopefully be retired in favor of this new
algorithm.
The simplifications are obtained by maintaining per-task
nesting state for running tasks, and using a simple
lock-protected algorithm to handle accounting when tasks block
within RCU read-side critical sections, making use of lessons
learned while creating numerous user-level RCU implementations
over the past 18 months.
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: laijs@cn.fujitsu.com
Cc: dipankar@in.ibm.com
Cc: akpm@linux-foundation.org
Cc: mathieu.desnoyers@polymtl.ca
Cc: josht@linux.vnet.ibm.com
Cc: dvhltc@us.ibm.com
Cc: niv@us.ibm.com
Cc: peterz@infradead.org
Cc: rostedt@goodmis.org
LKML-Reference: <12509746134003-git-send-email->
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2009-08-23 04:56:52 +08:00
|
|
|
/*
|
|
|
|
* Tell them what RCU they are running.
|
|
|
|
*/
|
2009-11-12 03:28:06 +08:00
|
|
|
static void __init rcu_bootup_announce(void)
|
rcu: Merge preemptable-RCU functionality into hierarchical RCU
Create a kernel/rcutree_plugin.h file that contains definitions
for preemptable RCU (or, under the #else branch of the #ifdef,
empty definitions for the classic non-preemptable semantics).
These definitions fit into plugins defined in kernel/rcutree.c
for this purpose.
This variant of preemptable RCU uses a new algorithm whose
read-side expense is roughly that of classic hierarchical RCU
under CONFIG_PREEMPT. This new algorithm's update-side expense
is similar to that of classic hierarchical RCU, and, in absence
of read-side preemption or blocking, is exactly that of classic
hierarchical RCU. Perhaps more important, this new algorithm
has a much simpler implementation, saving well over 1,000 lines
of code compared to mainline's implementation of preemptable
RCU, which will hopefully be retired in favor of this new
algorithm.
The simplifications are obtained by maintaining per-task
nesting state for running tasks, and using a simple
lock-protected algorithm to handle accounting when tasks block
within RCU read-side critical sections, making use of lessons
learned while creating numerous user-level RCU implementations
over the past 18 months.
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: laijs@cn.fujitsu.com
Cc: dipankar@in.ibm.com
Cc: akpm@linux-foundation.org
Cc: mathieu.desnoyers@polymtl.ca
Cc: josht@linux.vnet.ibm.com
Cc: dvhltc@us.ibm.com
Cc: niv@us.ibm.com
Cc: peterz@infradead.org
Cc: rostedt@goodmis.org
LKML-Reference: <12509746134003-git-send-email->
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2009-08-23 04:56:52 +08:00
|
|
|
{
|
2013-03-19 07:24:11 +08:00
|
|
|
pr_info("Hierarchical RCU implementation.\n");
|
2010-04-14 05:19:23 +08:00
|
|
|
rcu_bootup_announce_oddness();
|
rcu: Merge preemptable-RCU functionality into hierarchical RCU
Create a kernel/rcutree_plugin.h file that contains definitions
for preemptable RCU (or, under the #else branch of the #ifdef,
empty definitions for the classic non-preemptable semantics).
These definitions fit into plugins defined in kernel/rcutree.c
for this purpose.
This variant of preemptable RCU uses a new algorithm whose
read-side expense is roughly that of classic hierarchical RCU
under CONFIG_PREEMPT. This new algorithm's update-side expense
is similar to that of classic hierarchical RCU, and, in absence
of read-side preemption or blocking, is exactly that of classic
hierarchical RCU. Perhaps more important, this new algorithm
has a much simpler implementation, saving well over 1,000 lines
of code compared to mainline's implementation of preemptable
RCU, which will hopefully be retired in favor of this new
algorithm.
The simplifications are obtained by maintaining per-task
nesting state for running tasks, and using a simple
lock-protected algorithm to handle accounting when tasks block
within RCU read-side critical sections, making use of lessons
learned while creating numerous user-level RCU implementations
over the past 18 months.
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: laijs@cn.fujitsu.com
Cc: dipankar@in.ibm.com
Cc: akpm@linux-foundation.org
Cc: mathieu.desnoyers@polymtl.ca
Cc: josht@linux.vnet.ibm.com
Cc: dvhltc@us.ibm.com
Cc: niv@us.ibm.com
Cc: peterz@infradead.org
Cc: rostedt@goodmis.org
LKML-Reference: <12509746134003-git-send-email->
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2009-08-23 04:56:52 +08:00
|
|
|
}
|
|
|
|
|
2018-07-03 05:30:37 +08:00
|
|
|
/*
|
2019-10-16 03:18:14 +08:00
|
|
|
* Note a quiescent state for PREEMPTION=n. Because we do not need to know
|
2018-07-03 05:30:37 +08:00
|
|
|
* how many quiescent states passed, just if there was at least one since
|
|
|
|
* the start of the grace period, this just sets a flag. The caller must
|
|
|
|
* have disabled preemption.
|
|
|
|
*/
|
|
|
|
static void rcu_qs(void)
|
2018-06-29 05:45:25 +08:00
|
|
|
{
|
2018-07-03 05:30:37 +08:00
|
|
|
RCU_LOCKDEP_WARN(preemptible(), "rcu_qs() invoked with preemption enabled!!!");
|
|
|
|
if (!__this_cpu_read(rcu_data.cpu_no_qs.s))
|
|
|
|
return;
|
|
|
|
trace_rcu_grace_period(TPS("rcu_sched"),
|
|
|
|
__this_cpu_read(rcu_data.gp_seq), TPS("cpuqs"));
|
|
|
|
__this_cpu_write(rcu_data.cpu_no_qs.b.norm, false);
|
|
|
|
if (!__this_cpu_read(rcu_data.cpu_no_qs.b.exp))
|
|
|
|
return;
|
|
|
|
__this_cpu_write(rcu_data.cpu_no_qs.b.exp, false);
|
2018-07-04 08:22:34 +08:00
|
|
|
rcu_report_exp_rdp(this_cpu_ptr(&rcu_data));
|
2018-06-29 05:45:25 +08:00
|
|
|
}
|
|
|
|
|
2018-07-11 05:00:14 +08:00
|
|
|
/*
|
|
|
|
* Register an urgently needed quiescent state. If there is an
|
|
|
|
* emergency, invoke rcu_momentary_dyntick_idle() to do a heavy-weight
|
|
|
|
* dyntick-idle quiescent state visible to other CPUs, which will in
|
|
|
|
* some cases serve for expedited as well as normal grace periods.
|
|
|
|
* Either way, register a lightweight quiescent state.
|
|
|
|
*/
|
|
|
|
void rcu_all_qs(void)
|
|
|
|
{
|
|
|
|
unsigned long flags;
|
|
|
|
|
2018-08-04 12:00:38 +08:00
|
|
|
if (!raw_cpu_read(rcu_data.rcu_urgent_qs))
|
2018-07-11 05:00:14 +08:00
|
|
|
return;
|
|
|
|
preempt_disable();
|
|
|
|
/* Load rcu_urgent_qs before other flags. */
|
2018-08-04 12:00:38 +08:00
|
|
|
if (!smp_load_acquire(this_cpu_ptr(&rcu_data.rcu_urgent_qs))) {
|
2018-07-11 05:00:14 +08:00
|
|
|
preempt_enable();
|
|
|
|
return;
|
|
|
|
}
|
2018-08-04 12:00:38 +08:00
|
|
|
this_cpu_write(rcu_data.rcu_urgent_qs, false);
|
|
|
|
if (unlikely(raw_cpu_read(rcu_data.rcu_need_heavy_qs))) {
|
2018-07-11 05:00:14 +08:00
|
|
|
local_irq_save(flags);
|
|
|
|
rcu_momentary_dyntick_idle();
|
|
|
|
local_irq_restore(flags);
|
|
|
|
}
|
2018-07-11 23:09:28 +08:00
|
|
|
rcu_qs();
|
2018-07-11 05:00:14 +08:00
|
|
|
preempt_enable();
|
|
|
|
}
|
|
|
|
EXPORT_SYMBOL_GPL(rcu_all_qs);
|
|
|
|
|
2012-07-02 22:08:42 +08:00
|
|
|
/*
|
2019-10-16 03:18:14 +08:00
|
|
|
* Note a PREEMPTION=n context switch. The caller must have disabled interrupts.
|
2012-07-02 22:08:42 +08:00
|
|
|
*/
|
2018-07-03 05:30:37 +08:00
|
|
|
void rcu_note_context_switch(bool preempt)
|
2012-07-02 22:08:42 +08:00
|
|
|
{
|
2018-07-03 05:30:37 +08:00
|
|
|
trace_rcu_utilization(TPS("Start context switch"));
|
|
|
|
rcu_qs();
|
|
|
|
/* Load rcu_urgent_qs before other flags. */
|
2018-08-04 12:00:38 +08:00
|
|
|
if (!smp_load_acquire(this_cpu_ptr(&rcu_data.rcu_urgent_qs)))
|
2018-07-03 05:30:37 +08:00
|
|
|
goto out;
|
2018-08-04 12:00:38 +08:00
|
|
|
this_cpu_write(rcu_data.rcu_urgent_qs, false);
|
|
|
|
if (unlikely(raw_cpu_read(rcu_data.rcu_need_heavy_qs)))
|
2018-07-03 05:30:37 +08:00
|
|
|
rcu_momentary_dyntick_idle();
|
2020-03-17 11:38:29 +08:00
|
|
|
rcu_tasks_qs(current, preempt);
|
2018-07-03 05:30:37 +08:00
|
|
|
out:
|
|
|
|
trace_rcu_utilization(TPS("End context switch"));
|
2012-07-02 22:08:42 +08:00
|
|
|
}
|
2018-07-03 05:30:37 +08:00
|
|
|
EXPORT_SYMBOL_GPL(rcu_note_context_switch);
|
2012-07-02 22:08:42 +08:00
|
|
|
|
2009-09-24 00:50:41 +08:00
|
|
|
/*
|
2011-03-03 05:15:15 +08:00
|
|
|
* Because preemptible RCU does not exist, there are never any preempted
|
2009-09-24 00:50:41 +08:00
|
|
|
* RCU readers.
|
|
|
|
*/
|
2011-02-08 04:47:15 +08:00
|
|
|
static int rcu_preempt_blocked_readers_cgp(struct rcu_node *rnp)
|
2009-09-24 00:50:41 +08:00
|
|
|
{
|
|
|
|
return 0;
|
|
|
|
}
|
|
|
|
|
2014-11-01 02:22:37 +08:00
|
|
|
/*
|
|
|
|
* Because there is no preemptible RCU, there can be no readers blocked.
|
|
|
|
*/
|
|
|
|
static bool rcu_preempt_has_tasks(struct rcu_node *rnp)
|
rcu: Fix grace-period-stall bug on large systems with CPU hotplug
When the last CPU of a given leaf rcu_node structure goes
offline, all of the tasks queued on that leaf rcu_node structure
(due to having blocked in their current RCU read-side critical
sections) are requeued onto the root rcu_node structure. This
requeuing is carried out by rcu_preempt_offline_tasks().
However, it is possible that these queued tasks are the only
thing preventing the leaf rcu_node structure from reporting a
quiescent state up the rcu_node hierarchy. Unfortunately, the
old code would fail to do this reporting, resulting in a
grace-period stall given the following sequence of events:
1. Kernel built for more than 32 CPUs on 32-bit systems or for more
than 64 CPUs on 64-bit systems, so that there is more than one
rcu_node structure. (Or CONFIG_RCU_FANOUT is artificially set
to a number smaller than CONFIG_NR_CPUS.)
2. The kernel is built with CONFIG_TREE_PREEMPT_RCU.
3. A task running on a CPU associated with a given leaf rcu_node
structure blocks while in an RCU read-side critical section
-and- that CPU has not yet passed through a quiescent state
for the current RCU grace period. This will cause the task
to be queued on the leaf rcu_node's blocked_tasks[] array, in
particular, on the element of this array corresponding to the
current grace period.
4. Each of the remaining CPUs corresponding to this same leaf rcu_node
structure pass through a quiescent state. However, the task is
still in its RCU read-side critical section, so these quiescent
states cannot be reported further up the rcu_node hierarchy.
Nevertheless, all bits in the leaf rcu_node structure's ->qsmask
field are now zero.
5. Each of the remaining CPUs go offline. (The events in step
#4 and #5 can happen in any order as long as each CPU passes
through a quiescent state before going offline.)
6. When the last CPU goes offline, __rcu_offline_cpu() will invoke
rcu_preempt_offline_tasks(), which will move the task to the
root rcu_node structure, but without reporting a quiescent state
up the rcu_node hierarchy (and this failure to report a quiescent
state is the bug).
But because this leaf rcu_node structure's ->qsmask field is
already zero and its ->block_tasks[] entries are all empty,
force_quiescent_state() will skip this rcu_node structure.
Therefore, grace periods are now hung.
This patch abstracts some code out of rcu_read_unlock_special(),
calling the result task_quiet() by analogy with cpu_quiet(), and
invokes task_quiet() from both rcu_read_lock_special() and
__rcu_offline_cpu(). Invoking task_quiet() from
__rcu_offline_cpu() reports the quiescent state up the rcu_node
hierarchy, fixing the bug. This ends up requiring a separate
lock_class_key per level of the rcu_node hierarchy, which this
patch also provides.
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: laijs@cn.fujitsu.com
Cc: dipankar@in.ibm.com
Cc: mathieu.desnoyers@polymtl.ca
Cc: josh@joshtriplett.org
Cc: dvhltc@us.ibm.com
Cc: niv@us.ibm.com
Cc: peterz@infradead.org
Cc: rostedt@goodmis.org
Cc: Valdis.Kletnieks@vt.edu
Cc: dhowells@redhat.com
LKML-Reference: <12589088301770-git-send-email->
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2009-11-23 00:53:48 +08:00
|
|
|
{
|
2014-11-01 02:22:37 +08:00
|
|
|
return false;
|
rcu: Fix grace-period-stall bug on large systems with CPU hotplug
When the last CPU of a given leaf rcu_node structure goes
offline, all of the tasks queued on that leaf rcu_node structure
(due to having blocked in their current RCU read-side critical
sections) are requeued onto the root rcu_node structure. This
requeuing is carried out by rcu_preempt_offline_tasks().
However, it is possible that these queued tasks are the only
thing preventing the leaf rcu_node structure from reporting a
quiescent state up the rcu_node hierarchy. Unfortunately, the
old code would fail to do this reporting, resulting in a
grace-period stall given the following sequence of events:
1. Kernel built for more than 32 CPUs on 32-bit systems or for more
than 64 CPUs on 64-bit systems, so that there is more than one
rcu_node structure. (Or CONFIG_RCU_FANOUT is artificially set
to a number smaller than CONFIG_NR_CPUS.)
2. The kernel is built with CONFIG_TREE_PREEMPT_RCU.
3. A task running on a CPU associated with a given leaf rcu_node
structure blocks while in an RCU read-side critical section
-and- that CPU has not yet passed through a quiescent state
for the current RCU grace period. This will cause the task
to be queued on the leaf rcu_node's blocked_tasks[] array, in
particular, on the element of this array corresponding to the
current grace period.
4. Each of the remaining CPUs corresponding to this same leaf rcu_node
structure pass through a quiescent state. However, the task is
still in its RCU read-side critical section, so these quiescent
states cannot be reported further up the rcu_node hierarchy.
Nevertheless, all bits in the leaf rcu_node structure's ->qsmask
field are now zero.
5. Each of the remaining CPUs go offline. (The events in step
#4 and #5 can happen in any order as long as each CPU passes
through a quiescent state before going offline.)
6. When the last CPU goes offline, __rcu_offline_cpu() will invoke
rcu_preempt_offline_tasks(), which will move the task to the
root rcu_node structure, but without reporting a quiescent state
up the rcu_node hierarchy (and this failure to report a quiescent
state is the bug).
But because this leaf rcu_node structure's ->qsmask field is
already zero and its ->block_tasks[] entries are all empty,
force_quiescent_state() will skip this rcu_node structure.
Therefore, grace periods are now hung.
This patch abstracts some code out of rcu_read_unlock_special(),
calling the result task_quiet() by analogy with cpu_quiet(), and
invokes task_quiet() from both rcu_read_lock_special() and
__rcu_offline_cpu(). Invoking task_quiet() from
__rcu_offline_cpu() reports the quiescent state up the rcu_node
hierarchy, fixing the bug. This ends up requiring a separate
lock_class_key per level of the rcu_node hierarchy, which this
patch also provides.
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: laijs@cn.fujitsu.com
Cc: dipankar@in.ibm.com
Cc: mathieu.desnoyers@polymtl.ca
Cc: josh@joshtriplett.org
Cc: dvhltc@us.ibm.com
Cc: niv@us.ibm.com
Cc: peterz@infradead.org
Cc: rostedt@goodmis.org
Cc: Valdis.Kletnieks@vt.edu
Cc: dhowells@redhat.com
LKML-Reference: <12589088301770-git-send-email->
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2009-11-23 00:53:48 +08:00
|
|
|
}
|
|
|
|
|
rcu: Defer reporting RCU-preempt quiescent states when disabled
This commit defers reporting of RCU-preempt quiescent states at
rcu_read_unlock_special() time when any of interrupts, softirq, or
preemption are disabled. These deferred quiescent states are reported
at a later RCU_SOFTIRQ, context switch, idle entry, or CPU-hotplug
offline operation. Of course, if another RCU read-side critical
section has started in the meantime, the reporting of the quiescent
state will be further deferred.
This also means that disabling preemption, interrupts, and/or
softirqs will act as an RCU-preempt read-side critical section.
This is enforced by checking preempt_count() as needed.
Some special cases must be handled on an ad-hoc basis, for example,
context switch is a quiescent state even though both the scheduler and
do_exit() disable preemption. In these cases, additional calls to
rcu_preempt_deferred_qs() override the preemption disabling. Similar
logic overrides disabled interrupts in rcu_preempt_check_callbacks()
because in this case the quiescent state happened just before the
corresponding scheduling-clock interrupt.
In theory, this change lifts a long-standing restriction that required
that if interrupts were disabled across a call to rcu_read_unlock()
that the matching rcu_read_lock() also be contained within that
interrupts-disabled region of code. Because the reporting of the
corresponding RCU-preempt quiescent state is now deferred until
after interrupts have been enabled, it is no longer possible for this
situation to result in deadlocks involving the scheduler's runqueue and
priority-inheritance locks. This may allow some code simplification that
might reduce interrupt latency a bit. Unfortunately, in practice this
would also defer deboosting a low-priority task that had been subjected
to RCU priority boosting, so real-time-response considerations might
well force this restriction to remain in place.
Because RCU-preempt grace periods are now blocked not only by RCU
read-side critical sections, but also by disabling of interrupts,
preemption, and softirqs, it will be possible to eliminate RCU-bh and
RCU-sched in favor of RCU-preempt in CONFIG_PREEMPT=y kernels. This may
require some additional plumbing to provide the network denial-of-service
guarantees that have been traditionally provided by RCU-bh. Once these
are in place, CONFIG_PREEMPT=n kernels will be able to fold RCU-bh
into RCU-sched. This would mean that all kernels would have but
one flavor of RCU, which would open the door to significant code
cleanup.
Moving to a single flavor of RCU would also have the beneficial effect
of reducing the NOCB kthreads by at least a factor of two.
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
[ paulmck: Apply rcu_read_unlock_special() preempt_count() feedback
from Joel Fernandes. ]
[ paulmck: Adjust rcu_eqs_enter() call to rcu_preempt_deferred_qs() in
response to bug reports from kbuild test robot. ]
[ paulmck: Fix bug located by kbuild test robot involving recursion
via rcu_preempt_deferred_qs(). ]
2018-06-22 03:50:01 +08:00
|
|
|
/*
|
|
|
|
* Because there is no preemptible RCU, there can be no deferred quiescent
|
|
|
|
* states.
|
|
|
|
*/
|
|
|
|
static bool rcu_preempt_need_deferred_qs(struct task_struct *t)
|
|
|
|
{
|
|
|
|
return false;
|
|
|
|
}
|
|
|
|
static void rcu_preempt_deferred_qs(struct task_struct *t) { }
|
|
|
|
|
2009-09-14 00:15:09 +08:00
|
|
|
/*
|
2011-03-03 05:15:15 +08:00
|
|
|
* Because there is no preemptible RCU, there can be no readers blocked,
|
2009-09-19 00:50:19 +08:00
|
|
|
* so there is no need to check for blocked tasks. So check only for
|
|
|
|
* bogus qsmask values.
|
2009-09-14 00:15:09 +08:00
|
|
|
*/
|
2018-07-04 08:22:34 +08:00
|
|
|
static void rcu_preempt_check_blocked_tasks(struct rcu_node *rnp)
|
2009-09-14 00:15:09 +08:00
|
|
|
{
|
2009-09-19 00:50:19 +08:00
|
|
|
WARN_ON_ONCE(rnp->qsmask);
|
2009-09-14 00:15:09 +08:00
|
|
|
}
|
|
|
|
|
rcu: Merge preemptable-RCU functionality into hierarchical RCU
Create a kernel/rcutree_plugin.h file that contains definitions
for preemptable RCU (or, under the #else branch of the #ifdef,
empty definitions for the classic non-preemptable semantics).
These definitions fit into plugins defined in kernel/rcutree.c
for this purpose.
This variant of preemptable RCU uses a new algorithm whose
read-side expense is roughly that of classic hierarchical RCU
under CONFIG_PREEMPT. This new algorithm's update-side expense
is similar to that of classic hierarchical RCU, and, in absence
of read-side preemption or blocking, is exactly that of classic
hierarchical RCU. Perhaps more important, this new algorithm
has a much simpler implementation, saving well over 1,000 lines
of code compared to mainline's implementation of preemptable
RCU, which will hopefully be retired in favor of this new
algorithm.
The simplifications are obtained by maintaining per-task
nesting state for running tasks, and using a simple
lock-protected algorithm to handle accounting when tasks block
within RCU read-side critical sections, making use of lessons
learned while creating numerous user-level RCU implementations
over the past 18 months.
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: laijs@cn.fujitsu.com
Cc: dipankar@in.ibm.com
Cc: akpm@linux-foundation.org
Cc: mathieu.desnoyers@polymtl.ca
Cc: josht@linux.vnet.ibm.com
Cc: dvhltc@us.ibm.com
Cc: niv@us.ibm.com
Cc: peterz@infradead.org
Cc: rostedt@goodmis.org
LKML-Reference: <12509746134003-git-send-email->
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2009-08-23 04:56:52 +08:00
|
|
|
/*
|
2018-11-22 03:35:03 +08:00
|
|
|
* Check to see if this CPU is in a non-context-switch quiescent state,
|
|
|
|
* namely user mode and idle loop.
|
rcu: Merge preemptable-RCU functionality into hierarchical RCU
Create a kernel/rcutree_plugin.h file that contains definitions
for preemptable RCU (or, under the #else branch of the #ifdef,
empty definitions for the classic non-preemptable semantics).
These definitions fit into plugins defined in kernel/rcutree.c
for this purpose.
This variant of preemptable RCU uses a new algorithm whose
read-side expense is roughly that of classic hierarchical RCU
under CONFIG_PREEMPT. This new algorithm's update-side expense
is similar to that of classic hierarchical RCU, and, in absence
of read-side preemption or blocking, is exactly that of classic
hierarchical RCU. Perhaps more important, this new algorithm
has a much simpler implementation, saving well over 1,000 lines
of code compared to mainline's implementation of preemptable
RCU, which will hopefully be retired in favor of this new
algorithm.
The simplifications are obtained by maintaining per-task
nesting state for running tasks, and using a simple
lock-protected algorithm to handle accounting when tasks block
within RCU read-side critical sections, making use of lessons
learned while creating numerous user-level RCU implementations
over the past 18 months.
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: laijs@cn.fujitsu.com
Cc: dipankar@in.ibm.com
Cc: akpm@linux-foundation.org
Cc: mathieu.desnoyers@polymtl.ca
Cc: josht@linux.vnet.ibm.com
Cc: dvhltc@us.ibm.com
Cc: niv@us.ibm.com
Cc: peterz@infradead.org
Cc: rostedt@goodmis.org
LKML-Reference: <12509746134003-git-send-email->
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2009-08-23 04:56:52 +08:00
|
|
|
*/
|
2018-11-22 03:35:03 +08:00
|
|
|
static void rcu_flavor_sched_clock_irq(int user)
|
rcu: Merge preemptable-RCU functionality into hierarchical RCU
Create a kernel/rcutree_plugin.h file that contains definitions
for preemptable RCU (or, under the #else branch of the #ifdef,
empty definitions for the classic non-preemptable semantics).
These definitions fit into plugins defined in kernel/rcutree.c
for this purpose.
This variant of preemptable RCU uses a new algorithm whose
read-side expense is roughly that of classic hierarchical RCU
under CONFIG_PREEMPT. This new algorithm's update-side expense
is similar to that of classic hierarchical RCU, and, in absence
of read-side preemption or blocking, is exactly that of classic
hierarchical RCU. Perhaps more important, this new algorithm
has a much simpler implementation, saving well over 1,000 lines
of code compared to mainline's implementation of preemptable
RCU, which will hopefully be retired in favor of this new
algorithm.
The simplifications are obtained by maintaining per-task
nesting state for running tasks, and using a simple
lock-protected algorithm to handle accounting when tasks block
within RCU read-side critical sections, making use of lessons
learned while creating numerous user-level RCU implementations
over the past 18 months.
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: laijs@cn.fujitsu.com
Cc: dipankar@in.ibm.com
Cc: akpm@linux-foundation.org
Cc: mathieu.desnoyers@polymtl.ca
Cc: josht@linux.vnet.ibm.com
Cc: dvhltc@us.ibm.com
Cc: niv@us.ibm.com
Cc: peterz@infradead.org
Cc: rostedt@goodmis.org
LKML-Reference: <12509746134003-git-send-email->
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2009-08-23 04:56:52 +08:00
|
|
|
{
|
2018-07-03 05:30:37 +08:00
|
|
|
if (user || rcu_is_cpu_rrupt_from_idle()) {
|
rcu: Merge preemptable-RCU functionality into hierarchical RCU
Create a kernel/rcutree_plugin.h file that contains definitions
for preemptable RCU (or, under the #else branch of the #ifdef,
empty definitions for the classic non-preemptable semantics).
These definitions fit into plugins defined in kernel/rcutree.c
for this purpose.
This variant of preemptable RCU uses a new algorithm whose
read-side expense is roughly that of classic hierarchical RCU
under CONFIG_PREEMPT. This new algorithm's update-side expense
is similar to that of classic hierarchical RCU, and, in absence
of read-side preemption or blocking, is exactly that of classic
hierarchical RCU. Perhaps more important, this new algorithm
has a much simpler implementation, saving well over 1,000 lines
of code compared to mainline's implementation of preemptable
RCU, which will hopefully be retired in favor of this new
algorithm.
The simplifications are obtained by maintaining per-task
nesting state for running tasks, and using a simple
lock-protected algorithm to handle accounting when tasks block
within RCU read-side critical sections, making use of lessons
learned while creating numerous user-level RCU implementations
over the past 18 months.
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: laijs@cn.fujitsu.com
Cc: dipankar@in.ibm.com
Cc: akpm@linux-foundation.org
Cc: mathieu.desnoyers@polymtl.ca
Cc: josht@linux.vnet.ibm.com
Cc: dvhltc@us.ibm.com
Cc: niv@us.ibm.com
Cc: peterz@infradead.org
Cc: rostedt@goodmis.org
LKML-Reference: <12509746134003-git-send-email->
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2009-08-23 04:56:52 +08:00
|
|
|
|
2018-07-03 05:30:37 +08:00
|
|
|
/*
|
|
|
|
* Get here if this CPU took its interrupt from user
|
|
|
|
* mode or from the idle loop, and if this is not a
|
|
|
|
* nested interrupt. In this case, the CPU is in
|
|
|
|
* a quiescent state, so note it.
|
|
|
|
*
|
|
|
|
* No memory barrier is required here because rcu_qs()
|
|
|
|
* references only CPU-local variables that other CPUs
|
|
|
|
* neither access nor modify, at least not while the
|
|
|
|
* corresponding CPU is online.
|
|
|
|
*/
|
|
|
|
|
|
|
|
rcu_qs();
|
|
|
|
}
|
2009-10-07 12:48:17 +08:00
|
|
|
}
|
|
|
|
|
2013-04-12 01:15:52 +08:00
|
|
|
/*
|
|
|
|
* Because preemptible RCU does not exist, tasks cannot possibly exit
|
|
|
|
* while in preemptible RCU read-side critical sections.
|
|
|
|
*/
|
|
|
|
void exit_rcu(void)
|
|
|
|
{
|
|
|
|
}
|
|
|
|
|
2017-11-28 07:13:56 +08:00
|
|
|
/*
|
|
|
|
* Dump the guaranteed-empty blocked-tasks state. Trust but verify.
|
|
|
|
*/
|
2018-05-09 05:18:57 +08:00
|
|
|
static void
|
2018-07-04 08:22:34 +08:00
|
|
|
dump_blkd_tasks(struct rcu_node *rnp, int ncheck)
|
2017-11-28 07:13:56 +08:00
|
|
|
{
|
|
|
|
WARN_ON_ONCE(!list_empty(&rnp->blkd_tasks));
|
|
|
|
}
|
|
|
|
|
2014-09-23 02:00:48 +08:00
|
|
|
#endif /* #else #ifdef CONFIG_PREEMPT_RCU */
|
2010-02-23 09:04:59 +08:00
|
|
|
|
2019-03-21 05:13:33 +08:00
|
|
|
/*
|
|
|
|
* If boosting, set rcuc kthreads to realtime priority.
|
|
|
|
*/
|
|
|
|
static void rcu_cpu_kthread_setup(unsigned int cpu)
|
|
|
|
{
|
2011-02-08 04:47:15 +08:00
|
|
|
#ifdef CONFIG_RCU_BOOST
|
2019-03-21 05:13:33 +08:00
|
|
|
struct sched_param sp;
|
2011-02-08 04:47:15 +08:00
|
|
|
|
2019-03-21 05:13:33 +08:00
|
|
|
sp.sched_priority = kthread_prio;
|
|
|
|
sched_setscheduler_nocheck(current, SCHED_FIFO, &sp);
|
|
|
|
#endif /* #ifdef CONFIG_RCU_BOOST */
|
2012-07-16 18:42:35 +08:00
|
|
|
}
|
|
|
|
|
2019-03-21 05:13:33 +08:00
|
|
|
#ifdef CONFIG_RCU_BOOST
|
|
|
|
|
2011-02-08 04:47:15 +08:00
|
|
|
/*
|
|
|
|
* Carry out RCU priority boosting on the task indicated by ->exp_tasks
|
|
|
|
* or ->boost_tasks, advancing the pointer to the next task in the
|
|
|
|
* ->blkd_tasks list.
|
|
|
|
*
|
|
|
|
* Note that irqs must be enabled: boosting the task can block.
|
|
|
|
* Returns 1 if there are more tasks needing to be boosted.
|
|
|
|
*/
|
|
|
|
static int rcu_boost(struct rcu_node *rnp)
|
|
|
|
{
|
|
|
|
unsigned long flags;
|
|
|
|
struct task_struct *t;
|
|
|
|
struct list_head *tb;
|
|
|
|
|
2015-03-04 06:57:58 +08:00
|
|
|
if (READ_ONCE(rnp->exp_tasks) == NULL &&
|
|
|
|
READ_ONCE(rnp->boost_tasks) == NULL)
|
2011-02-08 04:47:15 +08:00
|
|
|
return 0; /* Nothing left to boost. */
|
|
|
|
|
2015-10-08 18:24:23 +08:00
|
|
|
raw_spin_lock_irqsave_rcu_node(rnp, flags);
|
2011-02-08 04:47:15 +08:00
|
|
|
|
|
|
|
/*
|
|
|
|
* Recheck under the lock: all tasks in need of boosting
|
|
|
|
* might exit their RCU read-side critical sections on their own.
|
|
|
|
*/
|
|
|
|
if (rnp->exp_tasks == NULL && rnp->boost_tasks == NULL) {
|
2015-12-29 12:18:47 +08:00
|
|
|
raw_spin_unlock_irqrestore_rcu_node(rnp, flags);
|
2011-02-08 04:47:15 +08:00
|
|
|
return 0;
|
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Preferentially boost tasks blocking expedited grace periods.
|
|
|
|
* This cannot starve the normal grace periods because a second
|
|
|
|
* expedited grace period must boost all blocked tasks, including
|
|
|
|
* those blocking the pre-existing normal grace period.
|
|
|
|
*/
|
2018-01-11 04:16:42 +08:00
|
|
|
if (rnp->exp_tasks != NULL)
|
2011-02-08 04:47:15 +08:00
|
|
|
tb = rnp->exp_tasks;
|
2018-01-11 04:16:42 +08:00
|
|
|
else
|
2011-02-08 04:47:15 +08:00
|
|
|
tb = rnp->boost_tasks;
|
|
|
|
|
|
|
|
/*
|
|
|
|
* We boost task t by manufacturing an rt_mutex that appears to
|
|
|
|
* be held by task t. We leave a pointer to that rt_mutex where
|
|
|
|
* task t can find it, and task t will release the mutex when it
|
|
|
|
* exits its outermost RCU read-side critical section. Then
|
|
|
|
* simply acquiring this artificial rt_mutex will boost task
|
|
|
|
* t's priority. (Thanks to tglx for suggesting this approach!)
|
|
|
|
*
|
|
|
|
* Note that task t must acquire rnp->lock to remove itself from
|
|
|
|
* the ->blkd_tasks list, which it will do from exit() if from
|
|
|
|
* nowhere else. We therefore are guaranteed that task t will
|
|
|
|
* stay around at least until we drop rnp->lock. Note that
|
|
|
|
* rnp->lock also resolves races between our priority boosting
|
|
|
|
* and task t's exiting its outermost RCU read-side critical
|
|
|
|
* section.
|
|
|
|
*/
|
|
|
|
t = container_of(tb, struct task_struct, rcu_node_entry);
|
2021-08-16 05:27:58 +08:00
|
|
|
rt_mutex_init_proxy_locked(&rnp->boost_mtx.rtmutex, t);
|
2015-12-29 12:18:47 +08:00
|
|
|
raw_spin_unlock_irqrestore_rcu_node(rnp, flags);
|
2014-06-13 04:30:25 +08:00
|
|
|
/* Lock only for side effect: boosts task t's priority. */
|
|
|
|
rt_mutex_lock(&rnp->boost_mtx);
|
|
|
|
rt_mutex_unlock(&rnp->boost_mtx); /* Then keep lockdep happy. */
|
2021-04-07 07:31:42 +08:00
|
|
|
rnp->n_boosts++;
|
2011-02-08 04:47:15 +08:00
|
|
|
|
2015-03-04 06:57:58 +08:00
|
|
|
return READ_ONCE(rnp->exp_tasks) != NULL ||
|
|
|
|
READ_ONCE(rnp->boost_tasks) != NULL;
|
2011-02-08 04:47:15 +08:00
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
2015-06-06 23:11:43 +08:00
|
|
|
* Priority-boosting kthread, one per leaf rcu_node.
|
2011-02-08 04:47:15 +08:00
|
|
|
*/
|
|
|
|
static int rcu_boost_kthread(void *arg)
|
|
|
|
{
|
|
|
|
struct rcu_node *rnp = (struct rcu_node *)arg;
|
|
|
|
int spincnt = 0;
|
|
|
|
int more2boost;
|
|
|
|
|
2013-07-13 05:18:47 +08:00
|
|
|
trace_rcu_utilization(TPS("Start boost kthread@init"));
|
2011-02-08 04:47:15 +08:00
|
|
|
for (;;) {
|
2020-01-09 12:12:59 +08:00
|
|
|
WRITE_ONCE(rnp->boost_kthread_status, RCU_KTHREAD_WAITING);
|
2013-07-13 05:18:47 +08:00
|
|
|
trace_rcu_utilization(TPS("End boost kthread@rcu_wait"));
|
2020-01-04 07:22:01 +08:00
|
|
|
rcu_wait(READ_ONCE(rnp->boost_tasks) ||
|
|
|
|
READ_ONCE(rnp->exp_tasks));
|
2013-07-13 05:18:47 +08:00
|
|
|
trace_rcu_utilization(TPS("Start boost kthread@rcu_wait"));
|
2020-01-09 12:12:59 +08:00
|
|
|
WRITE_ONCE(rnp->boost_kthread_status, RCU_KTHREAD_RUNNING);
|
2011-02-08 04:47:15 +08:00
|
|
|
more2boost = rcu_boost(rnp);
|
|
|
|
if (more2boost)
|
|
|
|
spincnt++;
|
|
|
|
else
|
|
|
|
spincnt = 0;
|
|
|
|
if (spincnt > 10) {
|
2020-01-09 12:12:59 +08:00
|
|
|
WRITE_ONCE(rnp->boost_kthread_status, RCU_KTHREAD_YIELDING);
|
2013-07-13 05:18:47 +08:00
|
|
|
trace_rcu_utilization(TPS("End boost kthread@rcu_yield"));
|
2020-05-08 07:34:38 +08:00
|
|
|
schedule_timeout_idle(2);
|
2013-07-13 05:18:47 +08:00
|
|
|
trace_rcu_utilization(TPS("Start boost kthread@rcu_yield"));
|
2011-02-08 04:47:15 +08:00
|
|
|
spincnt = 0;
|
|
|
|
}
|
|
|
|
}
|
2011-05-05 12:43:49 +08:00
|
|
|
/* NOTREACHED */
|
2013-07-13 05:18:47 +08:00
|
|
|
trace_rcu_utilization(TPS("End boost kthread@notreached"));
|
2011-02-08 04:47:15 +08:00
|
|
|
return 0;
|
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Check to see if it is time to start boosting RCU readers that are
|
|
|
|
* blocking the current grace period, and, if so, tell the per-rcu_node
|
|
|
|
* kthread to start boosting them. If there is an expedited grace
|
|
|
|
* period in progress, it is always time to boost.
|
|
|
|
*
|
2012-08-02 06:57:54 +08:00
|
|
|
* The caller must hold rnp->lock, which this function releases.
|
|
|
|
* The ->boost_kthread_task is immortal, so we don't need to worry
|
|
|
|
* about it going away.
|
2011-02-08 04:47:15 +08:00
|
|
|
*/
|
2011-05-05 12:43:49 +08:00
|
|
|
static void rcu_initiate_boost(struct rcu_node *rnp, unsigned long flags)
|
2014-06-12 04:39:40 +08:00
|
|
|
__releases(rnp->lock)
|
2011-02-08 04:47:15 +08:00
|
|
|
{
|
2018-01-17 22:24:30 +08:00
|
|
|
raw_lockdep_assert_held_rcu_node(rnp);
|
2011-02-23 05:42:43 +08:00
|
|
|
if (!rcu_preempt_blocked_readers_cgp(rnp) && rnp->exp_tasks == NULL) {
|
2015-12-29 12:18:47 +08:00
|
|
|
raw_spin_unlock_irqrestore_rcu_node(rnp, flags);
|
2011-02-08 04:47:15 +08:00
|
|
|
return;
|
2011-02-23 05:42:43 +08:00
|
|
|
}
|
2011-02-08 04:47:15 +08:00
|
|
|
if (rnp->exp_tasks != NULL ||
|
|
|
|
(rnp->gp_tasks != NULL &&
|
|
|
|
rnp->boost_tasks == NULL &&
|
|
|
|
rnp->qsmask == 0 &&
|
2020-04-11 06:52:53 +08:00
|
|
|
(!time_after(rnp->boost_time, jiffies) || rcu_state.cbovld))) {
|
2011-02-08 04:47:15 +08:00
|
|
|
if (rnp->exp_tasks == NULL)
|
2020-01-05 02:44:41 +08:00
|
|
|
WRITE_ONCE(rnp->boost_tasks, rnp->gp_tasks);
|
2015-12-29 12:18:47 +08:00
|
|
|
raw_spin_unlock_irqrestore_rcu_node(rnp, flags);
|
2019-03-22 07:29:50 +08:00
|
|
|
rcu_wake_cond(rnp->boost_kthread_task,
|
2020-01-09 12:12:59 +08:00
|
|
|
READ_ONCE(rnp->boost_kthread_status));
|
2011-05-05 12:43:49 +08:00
|
|
|
} else {
|
2015-12-29 12:18:47 +08:00
|
|
|
raw_spin_unlock_irqrestore_rcu_node(rnp, flags);
|
2011-05-05 12:43:49 +08:00
|
|
|
}
|
2011-02-08 04:47:15 +08:00
|
|
|
}
|
|
|
|
|
2011-11-30 07:57:13 +08:00
|
|
|
/*
|
|
|
|
* Is the current CPU running the RCU-callbacks kthread?
|
|
|
|
* Caller must have preemption disabled.
|
|
|
|
*/
|
|
|
|
static bool rcu_is_callbacks_kthread(void)
|
|
|
|
{
|
2018-12-01 08:11:14 +08:00
|
|
|
return __this_cpu_read(rcu_data.rcu_cpu_kthread_task) == current;
|
2011-11-30 07:57:13 +08:00
|
|
|
}
|
|
|
|
|
2011-02-08 04:47:15 +08:00
|
|
|
#define RCU_BOOST_DELAY_JIFFIES DIV_ROUND_UP(CONFIG_RCU_BOOST_DELAY * HZ, 1000)
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Do priority-boost accounting for the start of a new grace period.
|
|
|
|
*/
|
|
|
|
static void rcu_preempt_boost_start_gp(struct rcu_node *rnp)
|
|
|
|
{
|
|
|
|
rnp->boost_time = jiffies + RCU_BOOST_DELAY_JIFFIES;
|
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Create an RCU-boost kthread for the specified node if one does not
|
|
|
|
* already exist. We only create this kthread for preemptible RCU.
|
|
|
|
* Returns zero if all is well, a negated errno otherwise.
|
|
|
|
*/
|
2019-07-01 08:40:39 +08:00
|
|
|
static void rcu_spawn_one_boost_kthread(struct rcu_node *rnp)
|
2011-02-08 04:47:15 +08:00
|
|
|
{
|
|
|
|
unsigned long flags;
|
rcu: Make RCU priority boosting work on single-CPU rcu_node structures
When any CPU comes online, it checks to see if an RCU-boost kthread has
already been created for that CPU's leaf rcu_node structure, and if
not, it creates one. Unfortunately, it also verifies that this leaf
rcu_node structure actually has at least one online CPU, and if not,
it declines to create the kthread. Although this behavior makes sense
during early boot, especially on systems that claim far more CPUs than
they actually have, it makes no sense for the first CPU to come online
for a given rcu_node structure. There is no point in checking because
we know there is a CPU on its way in.
The problem is that timing differences can cause this incoming CPU to not
yet be reflected in the various bit masks even at rcutree_online_cpu()
time, and there is no chance at rcutree_prepare_cpu() time. Plus it
would be better to create the RCU-boost kthread at rcutree_prepare_cpu()
to handle the case where the CPU is involved in an RCU priority inversion
very shortly after it comes online.
This commit therefore moves the checking to rcu_prepare_kthreads(), which
is called only at early boot, when the check is appropriate. In addition,
it makes rcutree_prepare_cpu() invoke rcu_spawn_one_boost_kthread(), which
no longer does any checking for online CPUs.
With this change, RCU priority boosting tests now pass for short rcutorture
runs, even with single-CPU leaf rcu_node structures.
Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Cc: Scott Wood <swood@redhat.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2021-04-06 11:42:09 +08:00
|
|
|
int rnp_index = rnp - rcu_get_root();
|
2011-02-08 04:47:15 +08:00
|
|
|
struct sched_param sp;
|
|
|
|
struct task_struct *t;
|
|
|
|
|
rcu: Make RCU priority boosting work on single-CPU rcu_node structures
When any CPU comes online, it checks to see if an RCU-boost kthread has
already been created for that CPU's leaf rcu_node structure, and if
not, it creates one. Unfortunately, it also verifies that this leaf
rcu_node structure actually has at least one online CPU, and if not,
it declines to create the kthread. Although this behavior makes sense
during early boot, especially on systems that claim far more CPUs than
they actually have, it makes no sense for the first CPU to come online
for a given rcu_node structure. There is no point in checking because
we know there is a CPU on its way in.
The problem is that timing differences can cause this incoming CPU to not
yet be reflected in the various bit masks even at rcutree_online_cpu()
time, and there is no chance at rcutree_prepare_cpu() time. Plus it
would be better to create the RCU-boost kthread at rcutree_prepare_cpu()
to handle the case where the CPU is involved in an RCU priority inversion
very shortly after it comes online.
This commit therefore moves the checking to rcu_prepare_kthreads(), which
is called only at early boot, when the check is appropriate. In addition,
it makes rcutree_prepare_cpu() invoke rcu_spawn_one_boost_kthread(), which
no longer does any checking for online CPUs.
With this change, RCU priority boosting tests now pass for short rcutorture
runs, even with single-CPU leaf rcu_node structures.
Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Cc: Scott Wood <swood@redhat.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2021-04-06 11:42:09 +08:00
|
|
|
if (rnp->boost_kthread_task || !rcu_scheduler_fully_active)
|
2019-07-01 08:40:39 +08:00
|
|
|
return;
|
2012-07-16 18:42:35 +08:00
|
|
|
|
2018-07-04 08:22:34 +08:00
|
|
|
rcu_state.boost = 1;
|
2019-07-01 08:40:39 +08:00
|
|
|
|
2011-02-08 04:47:15 +08:00
|
|
|
t = kthread_create(rcu_boost_kthread, (void *)rnp,
|
2011-08-20 02:39:11 +08:00
|
|
|
"rcub/%d", rnp_index);
|
2019-07-01 08:40:39 +08:00
|
|
|
if (WARN_ON_ONCE(IS_ERR(t)))
|
|
|
|
return;
|
|
|
|
|
2015-10-08 18:24:23 +08:00
|
|
|
raw_spin_lock_irqsave_rcu_node(rnp, flags);
|
2011-02-08 04:47:15 +08:00
|
|
|
rnp->boost_kthread_task = t;
|
2015-12-29 12:18:47 +08:00
|
|
|
raw_spin_unlock_irqrestore_rcu_node(rnp, flags);
|
2014-09-13 10:21:09 +08:00
|
|
|
sp.sched_priority = kthread_prio;
|
2011-02-08 04:47:15 +08:00
|
|
|
sched_setscheduler_nocheck(t, SCHED_FIFO, &sp);
|
2011-05-31 11:38:55 +08:00
|
|
|
wake_up_process(t); /* get to TASK_INTERRUPTIBLE quickly. */
|
2011-02-08 04:47:15 +08:00
|
|
|
}
|
|
|
|
|
2011-06-16 23:26:32 +08:00
|
|
|
/*
|
|
|
|
* Set the per-rcu_node kthread's affinity to cover all CPUs that are
|
|
|
|
* served by the rcu_node in question. The CPU hotplug lock is still
|
|
|
|
* held, so the value of rnp->qsmaskinit will be stable.
|
|
|
|
*
|
|
|
|
* We don't include outgoingcpu in the affinity set, use -1 if there is
|
|
|
|
* no outgoing CPU. If there are no CPUs left in the affinity set,
|
|
|
|
* this function allows the kthread to execute on any CPU.
|
|
|
|
*/
|
2012-07-16 18:42:35 +08:00
|
|
|
static void rcu_boost_kthread_setaffinity(struct rcu_node *rnp, int outgoingcpu)
|
2011-06-16 23:26:32 +08:00
|
|
|
{
|
2012-07-16 18:42:35 +08:00
|
|
|
struct task_struct *t = rnp->boost_kthread_task;
|
rcu: Process offlining and onlining only at grace-period start
Races between CPU hotplug and grace periods can be difficult to resolve,
so the ->onoff_mutex is used to exclude the two events. Unfortunately,
this means that it is impossible for an outgoing CPU to perform the
last bits of its offlining from its last pass through the idle loop,
because sleeplocks cannot be acquired in that context.
This commit avoids these problems by buffering online and offline events
in a new ->qsmaskinitnext field in the leaf rcu_node structures. When a
grace period starts, the events accumulated in this mask are applied to
the ->qsmaskinit field, and, if needed, up the rcu_node tree. The special
case of all CPUs corresponding to a given leaf rcu_node structure being
offline while there are still elements in that structure's ->blkd_tasks
list is handled using a new ->wait_blkd_tasks field. In this case,
propagating the offline bits up the tree is deferred until the beginning
of the grace period after all of the tasks have exited their RCU read-side
critical sections and removed themselves from the list, at which point
the ->wait_blkd_tasks flag is cleared. If one of that leaf rcu_node
structure's CPUs comes back online before the list empties, then the
->wait_blkd_tasks flag is simply cleared.
This of course means that RCU's notion of which CPUs are offline can be
out of date. This is OK because RCU need only wait on CPUs that were
online at the time that the grace period started. In addition, RCU's
force-quiescent-state actions will handle the case where a CPU goes
offline after the grace period starts.
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
2015-01-24 13:52:37 +08:00
|
|
|
unsigned long mask = rcu_rnp_online_cpus(rnp);
|
2011-06-16 23:26:32 +08:00
|
|
|
cpumask_var_t cm;
|
|
|
|
int cpu;
|
|
|
|
|
2012-07-16 18:42:35 +08:00
|
|
|
if (!t)
|
2011-06-16 23:26:32 +08:00
|
|
|
return;
|
2012-07-16 18:42:35 +08:00
|
|
|
if (!zalloc_cpumask_var(&cm, GFP_KERNEL))
|
2011-06-16 23:26:32 +08:00
|
|
|
return;
|
rcu: Correctly handle sparse possible cpus
In many cases in the RCU tree code, we iterate over the set of cpus for
a leaf node described by rcu_node::grplo and rcu_node::grphi, checking
per-cpu data for each cpu in this range. However, if the set of possible
cpus is sparse, some cpus described in this range are not possible, and
thus no per-cpu region will have been allocated (or initialised) for
them by the generic percpu code.
Erroneous accesses to a per-cpu area for these !possible cpus may fault
or may hit other data depending on the addressed generated when the
erroneous per cpu offset is applied. In practice, both cases have been
observed on arm64 hardware (the former being silent, but detectable with
additional patches).
To avoid issues resulting from this, we must iterate over the set of
*possible* cpus for a given leaf node. This patch add a new helper,
for_each_leaf_node_possible_cpu, to enable this. As iteration is often
intertwined with rcu_node local bitmask manipulation, a new
leaf_node_cpu_bit helper is added to make this simpler and more
consistent. The RCU tree code is made to use both of these where
appropriate.
Without this patch, running reboot at a shell can result in an oops
like:
[ 3369.075979] Unable to handle kernel paging request at virtual address ffffff8008b21b4c
[ 3369.083881] pgd = ffffffc3ecdda000
[ 3369.087270] [ffffff8008b21b4c] *pgd=00000083eca48003, *pud=00000083eca48003, *pmd=0000000000000000
[ 3369.096222] Internal error: Oops: 96000007 [#1] PREEMPT SMP
[ 3369.101781] Modules linked in:
[ 3369.104825] CPU: 2 PID: 1817 Comm: NetworkManager Tainted: G W 4.6.0+ #3
[ 3369.121239] task: ffffffc0fa13e000 ti: ffffffc3eb940000 task.ti: ffffffc3eb940000
[ 3369.128708] PC is at sync_rcu_exp_select_cpus+0x188/0x510
[ 3369.134094] LR is at sync_rcu_exp_select_cpus+0x104/0x510
[ 3369.139479] pc : [<ffffff80081109a8>] lr : [<ffffff8008110924>] pstate: 200001c5
[ 3369.146860] sp : ffffffc3eb9435a0
[ 3369.150162] x29: ffffffc3eb9435a0 x28: ffffff8008be4f88
[ 3369.155465] x27: ffffff8008b66c80 x26: ffffffc3eceb2600
[ 3369.160767] x25: 0000000000000001 x24: ffffff8008be4f88
[ 3369.166070] x23: ffffff8008b51c3c x22: ffffff8008b66c80
[ 3369.171371] x21: 0000000000000001 x20: ffffff8008b21b40
[ 3369.176673] x19: ffffff8008b66c80 x18: 0000000000000000
[ 3369.181975] x17: 0000007fa951a010 x16: ffffff80086a30f0
[ 3369.187278] x15: 0000007fa9505590 x14: 0000000000000000
[ 3369.192580] x13: ffffff8008b51000 x12: ffffffc3eb940000
[ 3369.197882] x11: 0000000000000006 x10: ffffff8008b51b78
[ 3369.203184] x9 : 0000000000000001 x8 : ffffff8008be4000
[ 3369.208486] x7 : ffffff8008b21b40 x6 : 0000000000001003
[ 3369.213788] x5 : 0000000000000000 x4 : ffffff8008b27280
[ 3369.219090] x3 : ffffff8008b21b4c x2 : 0000000000000001
[ 3369.224406] x1 : 0000000000000001 x0 : 0000000000000140
...
[ 3369.972257] [<ffffff80081109a8>] sync_rcu_exp_select_cpus+0x188/0x510
[ 3369.978685] [<ffffff80081128b4>] synchronize_rcu_expedited+0x64/0xa8
[ 3369.985026] [<ffffff80086b987c>] synchronize_net+0x24/0x30
[ 3369.990499] [<ffffff80086ddb54>] dev_deactivate_many+0x28c/0x298
[ 3369.996493] [<ffffff80086b6bb8>] __dev_close_many+0x60/0xd0
[ 3370.002052] [<ffffff80086b6d48>] __dev_close+0x28/0x40
[ 3370.007178] [<ffffff80086bf62c>] __dev_change_flags+0x8c/0x158
[ 3370.012999] [<ffffff80086bf718>] dev_change_flags+0x20/0x60
[ 3370.018558] [<ffffff80086cf7f0>] do_setlink+0x288/0x918
[ 3370.023771] [<ffffff80086d0798>] rtnl_newlink+0x398/0x6a8
[ 3370.029158] [<ffffff80086cee84>] rtnetlink_rcv_msg+0xe4/0x220
[ 3370.034891] [<ffffff80086e274c>] netlink_rcv_skb+0xc4/0xf8
[ 3370.040364] [<ffffff80086ced8c>] rtnetlink_rcv+0x2c/0x40
[ 3370.045663] [<ffffff80086e1fe8>] netlink_unicast+0x160/0x238
[ 3370.051309] [<ffffff80086e24b8>] netlink_sendmsg+0x2f0/0x358
[ 3370.056956] [<ffffff80086a0070>] sock_sendmsg+0x18/0x30
[ 3370.062168] [<ffffff80086a21cc>] ___sys_sendmsg+0x26c/0x280
[ 3370.067728] [<ffffff80086a30ac>] __sys_sendmsg+0x44/0x88
[ 3370.073027] [<ffffff80086a3100>] SyS_sendmsg+0x10/0x20
[ 3370.078153] [<ffffff8008085e70>] el0_svc_naked+0x24/0x28
Signed-off-by: Mark Rutland <mark.rutland@arm.com>
Reported-by: Dennis Chen <dennis.chen@arm.com>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Josh Triplett <josh@joshtriplett.org>
Cc: Lai Jiangshan <jiangshanlai@gmail.com>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: Steve Capper <steve.capper@arm.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Will Deacon <will.deacon@arm.com>
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
2016-06-03 22:20:04 +08:00
|
|
|
for_each_leaf_node_possible_cpu(rnp, cpu)
|
|
|
|
if ((mask & leaf_node_cpu_bit(rnp, cpu)) &&
|
|
|
|
cpu != outgoingcpu)
|
2011-06-16 23:26:32 +08:00
|
|
|
cpumask_set_cpu(cpu, cm);
|
2014-11-11 00:07:08 +08:00
|
|
|
if (cpumask_weight(cm) == 0)
|
2011-06-16 23:26:32 +08:00
|
|
|
cpumask_setall(cm);
|
2012-07-16 18:42:35 +08:00
|
|
|
set_cpus_allowed_ptr(t, cm);
|
2011-06-16 23:26:32 +08:00
|
|
|
free_cpumask_var(cm);
|
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
2014-07-14 03:00:53 +08:00
|
|
|
* Spawn boost kthreads -- called as soon as the scheduler is running.
|
2011-06-16 23:26:32 +08:00
|
|
|
*/
|
2014-07-14 03:00:53 +08:00
|
|
|
static void __init rcu_spawn_boost_kthreads(void)
|
2011-06-16 23:26:32 +08:00
|
|
|
{
|
|
|
|
struct rcu_node *rnp;
|
|
|
|
|
2018-07-05 05:33:59 +08:00
|
|
|
rcu_for_each_leaf_node(rnp)
|
rcu: Make RCU priority boosting work on single-CPU rcu_node structures
When any CPU comes online, it checks to see if an RCU-boost kthread has
already been created for that CPU's leaf rcu_node structure, and if
not, it creates one. Unfortunately, it also verifies that this leaf
rcu_node structure actually has at least one online CPU, and if not,
it declines to create the kthread. Although this behavior makes sense
during early boot, especially on systems that claim far more CPUs than
they actually have, it makes no sense for the first CPU to come online
for a given rcu_node structure. There is no point in checking because
we know there is a CPU on its way in.
The problem is that timing differences can cause this incoming CPU to not
yet be reflected in the various bit masks even at rcutree_online_cpu()
time, and there is no chance at rcutree_prepare_cpu() time. Plus it
would be better to create the RCU-boost kthread at rcutree_prepare_cpu()
to handle the case where the CPU is involved in an RCU priority inversion
very shortly after it comes online.
This commit therefore moves the checking to rcu_prepare_kthreads(), which
is called only at early boot, when the check is appropriate. In addition,
it makes rcutree_prepare_cpu() invoke rcu_spawn_one_boost_kthread(), which
no longer does any checking for online CPUs.
With this change, RCU priority boosting tests now pass for short rcutorture
runs, even with single-CPU leaf rcu_node structures.
Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Cc: Scott Wood <swood@redhat.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2021-04-06 11:42:09 +08:00
|
|
|
if (rcu_rnp_online_cpus(rnp))
|
|
|
|
rcu_spawn_one_boost_kthread(rnp);
|
2011-06-16 23:26:32 +08:00
|
|
|
}
|
|
|
|
|
2011-02-08 04:47:15 +08:00
|
|
|
#else /* #ifdef CONFIG_RCU_BOOST */
|
|
|
|
|
2011-05-05 12:43:49 +08:00
|
|
|
static void rcu_initiate_boost(struct rcu_node *rnp, unsigned long flags)
|
2014-06-12 04:39:40 +08:00
|
|
|
__releases(rnp->lock)
|
2011-02-08 04:47:15 +08:00
|
|
|
{
|
2015-12-29 12:18:47 +08:00
|
|
|
raw_spin_unlock_irqrestore_rcu_node(rnp, flags);
|
2011-02-08 04:47:15 +08:00
|
|
|
}
|
|
|
|
|
2011-11-30 07:57:13 +08:00
|
|
|
static bool rcu_is_callbacks_kthread(void)
|
|
|
|
{
|
|
|
|
return false;
|
|
|
|
}
|
|
|
|
|
2011-02-08 04:47:15 +08:00
|
|
|
static void rcu_preempt_boost_start_gp(struct rcu_node *rnp)
|
|
|
|
{
|
|
|
|
}
|
|
|
|
|
rcu: Make RCU priority boosting work on single-CPU rcu_node structures
When any CPU comes online, it checks to see if an RCU-boost kthread has
already been created for that CPU's leaf rcu_node structure, and if
not, it creates one. Unfortunately, it also verifies that this leaf
rcu_node structure actually has at least one online CPU, and if not,
it declines to create the kthread. Although this behavior makes sense
during early boot, especially on systems that claim far more CPUs than
they actually have, it makes no sense for the first CPU to come online
for a given rcu_node structure. There is no point in checking because
we know there is a CPU on its way in.
The problem is that timing differences can cause this incoming CPU to not
yet be reflected in the various bit masks even at rcutree_online_cpu()
time, and there is no chance at rcutree_prepare_cpu() time. Plus it
would be better to create the RCU-boost kthread at rcutree_prepare_cpu()
to handle the case where the CPU is involved in an RCU priority inversion
very shortly after it comes online.
This commit therefore moves the checking to rcu_prepare_kthreads(), which
is called only at early boot, when the check is appropriate. In addition,
it makes rcutree_prepare_cpu() invoke rcu_spawn_one_boost_kthread(), which
no longer does any checking for online CPUs.
With this change, RCU priority boosting tests now pass for short rcutorture
runs, even with single-CPU leaf rcu_node structures.
Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Cc: Scott Wood <swood@redhat.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2021-04-06 11:42:09 +08:00
|
|
|
static void rcu_spawn_one_boost_kthread(struct rcu_node *rnp)
|
2011-06-16 23:26:32 +08:00
|
|
|
{
|
|
|
|
}
|
|
|
|
|
rcu: Make RCU priority boosting work on single-CPU rcu_node structures
When any CPU comes online, it checks to see if an RCU-boost kthread has
already been created for that CPU's leaf rcu_node structure, and if
not, it creates one. Unfortunately, it also verifies that this leaf
rcu_node structure actually has at least one online CPU, and if not,
it declines to create the kthread. Although this behavior makes sense
during early boot, especially on systems that claim far more CPUs than
they actually have, it makes no sense for the first CPU to come online
for a given rcu_node structure. There is no point in checking because
we know there is a CPU on its way in.
The problem is that timing differences can cause this incoming CPU to not
yet be reflected in the various bit masks even at rcutree_online_cpu()
time, and there is no chance at rcutree_prepare_cpu() time. Plus it
would be better to create the RCU-boost kthread at rcutree_prepare_cpu()
to handle the case where the CPU is involved in an RCU priority inversion
very shortly after it comes online.
This commit therefore moves the checking to rcu_prepare_kthreads(), which
is called only at early boot, when the check is appropriate. In addition,
it makes rcutree_prepare_cpu() invoke rcu_spawn_one_boost_kthread(), which
no longer does any checking for online CPUs.
With this change, RCU priority boosting tests now pass for short rcutorture
runs, even with single-CPU leaf rcu_node structures.
Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Cc: Scott Wood <swood@redhat.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2021-04-06 11:42:09 +08:00
|
|
|
static void rcu_boost_kthread_setaffinity(struct rcu_node *rnp, int outgoingcpu)
|
2011-07-11 06:57:35 +08:00
|
|
|
{
|
|
|
|
}
|
|
|
|
|
rcu: Make RCU priority boosting work on single-CPU rcu_node structures
When any CPU comes online, it checks to see if an RCU-boost kthread has
already been created for that CPU's leaf rcu_node structure, and if
not, it creates one. Unfortunately, it also verifies that this leaf
rcu_node structure actually has at least one online CPU, and if not,
it declines to create the kthread. Although this behavior makes sense
during early boot, especially on systems that claim far more CPUs than
they actually have, it makes no sense for the first CPU to come online
for a given rcu_node structure. There is no point in checking because
we know there is a CPU on its way in.
The problem is that timing differences can cause this incoming CPU to not
yet be reflected in the various bit masks even at rcutree_online_cpu()
time, and there is no chance at rcutree_prepare_cpu() time. Plus it
would be better to create the RCU-boost kthread at rcutree_prepare_cpu()
to handle the case where the CPU is involved in an RCU priority inversion
very shortly after it comes online.
This commit therefore moves the checking to rcu_prepare_kthreads(), which
is called only at early boot, when the check is appropriate. In addition,
it makes rcutree_prepare_cpu() invoke rcu_spawn_one_boost_kthread(), which
no longer does any checking for online CPUs.
With this change, RCU priority boosting tests now pass for short rcutorture
runs, even with single-CPU leaf rcu_node structures.
Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Cc: Scott Wood <swood@redhat.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2021-04-06 11:42:09 +08:00
|
|
|
static void __init rcu_spawn_boost_kthreads(void)
|
2011-06-16 23:26:32 +08:00
|
|
|
{
|
|
|
|
}
|
|
|
|
|
2011-02-08 04:47:15 +08:00
|
|
|
#endif /* #else #ifdef CONFIG_RCU_BOOST */
|
|
|
|
|
2010-02-23 09:04:59 +08:00
|
|
|
#if !defined(CONFIG_RCU_FAST_NO_HZ)
|
|
|
|
|
|
|
|
/*
|
2019-08-13 01:28:08 +08:00
|
|
|
* Check to see if any future non-offloaded RCU-related work will need
|
|
|
|
* to be done by the current CPU, even if none need be done immediately,
|
|
|
|
* returning 1 if so. This function is part of the RCU implementation;
|
|
|
|
* it is -not- an exported member of the RCU API.
|
2010-02-23 09:04:59 +08:00
|
|
|
*
|
2018-07-08 09:12:26 +08:00
|
|
|
* Because we not have RCU_FAST_NO_HZ, just check whether or not this
|
|
|
|
* CPU has RCU callbacks queued.
|
2010-02-23 09:04:59 +08:00
|
|
|
*/
|
2015-04-15 05:08:58 +08:00
|
|
|
int rcu_needs_cpu(u64 basemono, u64 *nextevt)
|
2010-02-23 09:04:59 +08:00
|
|
|
{
|
2015-04-15 05:08:58 +08:00
|
|
|
*nextevt = KTIME_MAX;
|
2019-08-13 01:28:08 +08:00
|
|
|
return !rcu_segcblist_empty(&this_cpu_ptr(&rcu_data)->cblist) &&
|
2020-11-12 08:51:21 +08:00
|
|
|
!rcu_rdp_is_offloaded(this_cpu_ptr(&rcu_data));
|
rcu: Permit dyntick-idle with callbacks pending
The current implementation of RCU_FAST_NO_HZ prevents CPUs from entering
dyntick-idle state if they have RCU callbacks pending. Unfortunately,
this has the side-effect of often preventing them from entering this
state, especially if at least one other CPU is not in dyntick-idle state.
However, the resulting per-tick wakeup is wasteful in many cases: if the
CPU has already fully responded to the current RCU grace period, there
will be nothing for it to do until this grace period ends, which will
frequently take several jiffies.
This commit therefore permits a CPU that has done everything that the
current grace period has asked of it (rcu_pending() == 0) even if it
still as RCU callbacks pending. However, such a CPU posts a timer to
wake it up several jiffies later (6 jiffies, based on experience with
grace-period lengths). This wakeup is required to handle situations
that can result in all CPUs being in dyntick-idle mode, thus failing
to ever complete the current grace period. If a CPU wakes up before
the timer goes off, then it cancels that timer, thus avoiding spurious
wakeups.
Signed-off-by: Paul E. McKenney <paul.mckenney@linaro.org>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
2011-11-29 04:28:34 +08:00
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Because we do not have RCU_FAST_NO_HZ, don't bother cleaning up
|
|
|
|
* after it.
|
|
|
|
*/
|
2014-10-23 06:07:37 +08:00
|
|
|
static void rcu_cleanup_after_idle(void)
|
rcu: Permit dyntick-idle with callbacks pending
The current implementation of RCU_FAST_NO_HZ prevents CPUs from entering
dyntick-idle state if they have RCU callbacks pending. Unfortunately,
this has the side-effect of often preventing them from entering this
state, especially if at least one other CPU is not in dyntick-idle state.
However, the resulting per-tick wakeup is wasteful in many cases: if the
CPU has already fully responded to the current RCU grace period, there
will be nothing for it to do until this grace period ends, which will
frequently take several jiffies.
This commit therefore permits a CPU that has done everything that the
current grace period has asked of it (rcu_pending() == 0) even if it
still as RCU callbacks pending. However, such a CPU posts a timer to
wake it up several jiffies later (6 jiffies, based on experience with
grace-period lengths). This wakeup is required to handle situations
that can result in all CPUs being in dyntick-idle mode, thus failing
to ever complete the current grace period. If a CPU wakes up before
the timer goes off, then it cancels that timer, thus avoiding spurious
wakeups.
Signed-off-by: Paul E. McKenney <paul.mckenney@linaro.org>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
2011-11-29 04:28:34 +08:00
|
|
|
{
|
|
|
|
}
|
|
|
|
|
2011-11-02 21:54:54 +08:00
|
|
|
/*
|
2012-01-17 05:29:10 +08:00
|
|
|
* Do the idle-entry grace-period work, which, because CONFIG_RCU_FAST_NO_HZ=n,
|
2011-11-02 21:54:54 +08:00
|
|
|
* is nothing.
|
|
|
|
*/
|
2014-10-23 06:03:43 +08:00
|
|
|
static void rcu_prepare_for_idle(void)
|
2011-11-02 21:54:54 +08:00
|
|
|
{
|
|
|
|
}
|
|
|
|
|
2010-02-23 09:04:59 +08:00
|
|
|
#else /* #if !defined(CONFIG_RCU_FAST_NO_HZ) */
|
|
|
|
|
2011-12-01 07:41:14 +08:00
|
|
|
/*
|
|
|
|
* This code is invoked when a CPU goes idle, at which point we want
|
|
|
|
* to have the CPU do everything required for RCU so that it can enter
|
2019-08-31 00:36:32 +08:00
|
|
|
* the energy-efficient dyntick-idle mode.
|
2011-12-01 07:41:14 +08:00
|
|
|
*
|
2019-08-31 00:36:32 +08:00
|
|
|
* The following preprocessor symbol controls this:
|
2011-12-01 07:41:14 +08:00
|
|
|
*
|
|
|
|
* RCU_IDLE_GP_DELAY gives the number of jiffies that a CPU is permitted
|
|
|
|
* to sleep in dyntick-idle mode with RCU callbacks pending. This
|
|
|
|
* is sized to be roughly one RCU grace period. Those energy-efficiency
|
|
|
|
* benchmarkers who might otherwise be tempted to set this to a large
|
|
|
|
* number, be warned: Setting RCU_IDLE_GP_DELAY too high can hang your
|
|
|
|
* system. And if you are -that- concerned about energy efficiency,
|
|
|
|
* just power the system down and be done with it!
|
|
|
|
*
|
2019-08-31 00:36:32 +08:00
|
|
|
* The value below works well in practice. If future workloads require
|
2011-12-01 07:41:14 +08:00
|
|
|
* adjustment, they can be converted into kernel config parameters, though
|
|
|
|
* making the state machine smarter might be a better option.
|
|
|
|
*/
|
2012-06-05 11:45:10 +08:00
|
|
|
#define RCU_IDLE_GP_DELAY 4 /* Roughly one grace period. */
|
2011-12-01 07:41:14 +08:00
|
|
|
|
2012-12-13 04:35:29 +08:00
|
|
|
static int rcu_idle_gp_delay = RCU_IDLE_GP_DELAY;
|
|
|
|
module_param(rcu_idle_gp_delay, int, 0644);
|
2012-01-07 06:11:30 +08:00
|
|
|
|
|
|
|
/*
|
2018-07-08 09:12:26 +08:00
|
|
|
* Try to advance callbacks on the current CPU, but only if it has been
|
|
|
|
* awhile since the last time we did so. Afterwards, if there are any
|
|
|
|
* callbacks ready for immediate invocation, return true.
|
2012-01-07 06:11:30 +08:00
|
|
|
*/
|
2013-11-18 13:08:07 +08:00
|
|
|
static bool __maybe_unused rcu_try_advance_all_cbs(void)
|
2012-01-07 06:11:30 +08:00
|
|
|
{
|
2012-12-29 03:30:36 +08:00
|
|
|
bool cbs_ready = false;
|
2018-08-04 12:00:38 +08:00
|
|
|
struct rcu_data *rdp = this_cpu_ptr(&rcu_data);
|
2012-12-29 03:30:36 +08:00
|
|
|
struct rcu_node *rnp;
|
2012-01-07 06:11:30 +08:00
|
|
|
|
2013-08-26 12:20:47 +08:00
|
|
|
/* Exit early if we advanced recently. */
|
2018-08-04 12:00:38 +08:00
|
|
|
if (jiffies == rdp->last_advance_all)
|
2014-07-09 06:26:13 +08:00
|
|
|
return false;
|
2018-08-04 12:00:38 +08:00
|
|
|
rdp->last_advance_all = jiffies;
|
2013-08-26 12:20:47 +08:00
|
|
|
|
2018-07-05 06:35:00 +08:00
|
|
|
rnp = rdp->mynode;
|
2012-01-07 06:11:30 +08:00
|
|
|
|
2018-07-05 06:35:00 +08:00
|
|
|
/*
|
|
|
|
* Don't bother checking unless a grace period has
|
|
|
|
* completed since we last checked and there are
|
|
|
|
* callbacks not yet ready to invoke.
|
|
|
|
*/
|
|
|
|
if ((rcu_seq_completed_gp(rdp->gp_seq,
|
|
|
|
rcu_seq_current(&rnp->gp_seq)) ||
|
|
|
|
unlikely(READ_ONCE(rdp->gpwrap))) &&
|
|
|
|
rcu_segcblist_pend_cbs(&rdp->cblist))
|
|
|
|
note_gp_changes(rdp);
|
|
|
|
|
|
|
|
if (rcu_segcblist_ready_cbs(&rdp->cblist))
|
|
|
|
cbs_ready = true;
|
2012-12-29 03:30:36 +08:00
|
|
|
return cbs_ready;
|
2012-01-07 06:11:30 +08:00
|
|
|
}
|
|
|
|
|
rcu: Precompute RCU_FAST_NO_HZ timer offsets
When a CPU is entering dyntick-idle mode, tick_nohz_stop_sched_tick()
calls rcu_needs_cpu() see if RCU needs that CPU, and, if not, computes the
next wakeup time based on the timer wheels. Only later, when actually
entering the idle loop, rcu_prepare_for_idle() will be invoked. In some
cases, rcu_prepare_for_idle() will post timers to wake the CPU back up.
But all for naught: The next wakeup time for the CPU has already been
computed, and posting a timer afterwards does not force that wakeup
time to be recomputed. This means that rcu_prepare_for_idle()'s have
no effect.
This is not a problem on a busy system because something else will wake
up the CPU soon enough. However, on lightly loaded systems, the CPU
might stay asleep for a considerable length of time. If that CPU has
a callback that the rest of the system is waiting on, the system might
run very slowly or (in theory) even hang.
This commit avoids this problem by having rcu_needs_cpu() give
tick_nohz_stop_sched_tick() an estimate of when RCU will need the CPU
to wake back up, which tick_nohz_stop_sched_tick() takes into account
when programming the CPU's wakeup time. An alternative approach is
for rcu_prepare_for_idle() to use hrtimers instead of normal timers,
but timers are much more efficient than are hrtimers for frequently
and repeatedly posting and cancelling a given timer, which is exactly
what RCU_FAST_NO_HZ does.
Reported-by: Pascal Chapperon <pascal.chapperon@wanadoo.fr>
Reported-by: Heiko Carstens <heiko.carstens@de.ibm.com>
Signed-off-by: Paul E. McKenney <paul.mckenney@linaro.org>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Tested-by: Heiko Carstens <heiko.carstens@de.ibm.com>
Tested-by: Pascal Chapperon <pascal.chapperon@wanadoo.fr>
2012-05-11 07:41:44 +08:00
|
|
|
/*
|
2012-12-29 03:30:36 +08:00
|
|
|
* Allow the CPU to enter dyntick-idle mode unless it has callbacks ready
|
|
|
|
* to invoke. If the CPU has callbacks, try to advance them. Tell the
|
2019-08-31 00:36:32 +08:00
|
|
|
* caller about what to set the timeout.
|
rcu: Precompute RCU_FAST_NO_HZ timer offsets
When a CPU is entering dyntick-idle mode, tick_nohz_stop_sched_tick()
calls rcu_needs_cpu() see if RCU needs that CPU, and, if not, computes the
next wakeup time based on the timer wheels. Only later, when actually
entering the idle loop, rcu_prepare_for_idle() will be invoked. In some
cases, rcu_prepare_for_idle() will post timers to wake the CPU back up.
But all for naught: The next wakeup time for the CPU has already been
computed, and posting a timer afterwards does not force that wakeup
time to be recomputed. This means that rcu_prepare_for_idle()'s have
no effect.
This is not a problem on a busy system because something else will wake
up the CPU soon enough. However, on lightly loaded systems, the CPU
might stay asleep for a considerable length of time. If that CPU has
a callback that the rest of the system is waiting on, the system might
run very slowly or (in theory) even hang.
This commit avoids this problem by having rcu_needs_cpu() give
tick_nohz_stop_sched_tick() an estimate of when RCU will need the CPU
to wake back up, which tick_nohz_stop_sched_tick() takes into account
when programming the CPU's wakeup time. An alternative approach is
for rcu_prepare_for_idle() to use hrtimers instead of normal timers,
but timers are much more efficient than are hrtimers for frequently
and repeatedly posting and cancelling a given timer, which is exactly
what RCU_FAST_NO_HZ does.
Reported-by: Pascal Chapperon <pascal.chapperon@wanadoo.fr>
Reported-by: Heiko Carstens <heiko.carstens@de.ibm.com>
Signed-off-by: Paul E. McKenney <paul.mckenney@linaro.org>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Tested-by: Heiko Carstens <heiko.carstens@de.ibm.com>
Tested-by: Pascal Chapperon <pascal.chapperon@wanadoo.fr>
2012-05-11 07:41:44 +08:00
|
|
|
*
|
2012-12-29 03:30:36 +08:00
|
|
|
* The caller must have disabled interrupts.
|
rcu: Precompute RCU_FAST_NO_HZ timer offsets
When a CPU is entering dyntick-idle mode, tick_nohz_stop_sched_tick()
calls rcu_needs_cpu() see if RCU needs that CPU, and, if not, computes the
next wakeup time based on the timer wheels. Only later, when actually
entering the idle loop, rcu_prepare_for_idle() will be invoked. In some
cases, rcu_prepare_for_idle() will post timers to wake the CPU back up.
But all for naught: The next wakeup time for the CPU has already been
computed, and posting a timer afterwards does not force that wakeup
time to be recomputed. This means that rcu_prepare_for_idle()'s have
no effect.
This is not a problem on a busy system because something else will wake
up the CPU soon enough. However, on lightly loaded systems, the CPU
might stay asleep for a considerable length of time. If that CPU has
a callback that the rest of the system is waiting on, the system might
run very slowly or (in theory) even hang.
This commit avoids this problem by having rcu_needs_cpu() give
tick_nohz_stop_sched_tick() an estimate of when RCU will need the CPU
to wake back up, which tick_nohz_stop_sched_tick() takes into account
when programming the CPU's wakeup time. An alternative approach is
for rcu_prepare_for_idle() to use hrtimers instead of normal timers,
but timers are much more efficient than are hrtimers for frequently
and repeatedly posting and cancelling a given timer, which is exactly
what RCU_FAST_NO_HZ does.
Reported-by: Pascal Chapperon <pascal.chapperon@wanadoo.fr>
Reported-by: Heiko Carstens <heiko.carstens@de.ibm.com>
Signed-off-by: Paul E. McKenney <paul.mckenney@linaro.org>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Tested-by: Heiko Carstens <heiko.carstens@de.ibm.com>
Tested-by: Pascal Chapperon <pascal.chapperon@wanadoo.fr>
2012-05-11 07:41:44 +08:00
|
|
|
*/
|
2015-04-15 05:08:58 +08:00
|
|
|
int rcu_needs_cpu(u64 basemono, u64 *nextevt)
|
rcu: Precompute RCU_FAST_NO_HZ timer offsets
When a CPU is entering dyntick-idle mode, tick_nohz_stop_sched_tick()
calls rcu_needs_cpu() see if RCU needs that CPU, and, if not, computes the
next wakeup time based on the timer wheels. Only later, when actually
entering the idle loop, rcu_prepare_for_idle() will be invoked. In some
cases, rcu_prepare_for_idle() will post timers to wake the CPU back up.
But all for naught: The next wakeup time for the CPU has already been
computed, and posting a timer afterwards does not force that wakeup
time to be recomputed. This means that rcu_prepare_for_idle()'s have
no effect.
This is not a problem on a busy system because something else will wake
up the CPU soon enough. However, on lightly loaded systems, the CPU
might stay asleep for a considerable length of time. If that CPU has
a callback that the rest of the system is waiting on, the system might
run very slowly or (in theory) even hang.
This commit avoids this problem by having rcu_needs_cpu() give
tick_nohz_stop_sched_tick() an estimate of when RCU will need the CPU
to wake back up, which tick_nohz_stop_sched_tick() takes into account
when programming the CPU's wakeup time. An alternative approach is
for rcu_prepare_for_idle() to use hrtimers instead of normal timers,
but timers are much more efficient than are hrtimers for frequently
and repeatedly posting and cancelling a given timer, which is exactly
what RCU_FAST_NO_HZ does.
Reported-by: Pascal Chapperon <pascal.chapperon@wanadoo.fr>
Reported-by: Heiko Carstens <heiko.carstens@de.ibm.com>
Signed-off-by: Paul E. McKenney <paul.mckenney@linaro.org>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Tested-by: Heiko Carstens <heiko.carstens@de.ibm.com>
Tested-by: Pascal Chapperon <pascal.chapperon@wanadoo.fr>
2012-05-11 07:41:44 +08:00
|
|
|
{
|
2018-08-04 12:00:38 +08:00
|
|
|
struct rcu_data *rdp = this_cpu_ptr(&rcu_data);
|
2015-04-15 05:08:58 +08:00
|
|
|
unsigned long dj;
|
rcu: Precompute RCU_FAST_NO_HZ timer offsets
When a CPU is entering dyntick-idle mode, tick_nohz_stop_sched_tick()
calls rcu_needs_cpu() see if RCU needs that CPU, and, if not, computes the
next wakeup time based on the timer wheels. Only later, when actually
entering the idle loop, rcu_prepare_for_idle() will be invoked. In some
cases, rcu_prepare_for_idle() will post timers to wake the CPU back up.
But all for naught: The next wakeup time for the CPU has already been
computed, and posting a timer afterwards does not force that wakeup
time to be recomputed. This means that rcu_prepare_for_idle()'s have
no effect.
This is not a problem on a busy system because something else will wake
up the CPU soon enough. However, on lightly loaded systems, the CPU
might stay asleep for a considerable length of time. If that CPU has
a callback that the rest of the system is waiting on, the system might
run very slowly or (in theory) even hang.
This commit avoids this problem by having rcu_needs_cpu() give
tick_nohz_stop_sched_tick() an estimate of when RCU will need the CPU
to wake back up, which tick_nohz_stop_sched_tick() takes into account
when programming the CPU's wakeup time. An alternative approach is
for rcu_prepare_for_idle() to use hrtimers instead of normal timers,
but timers are much more efficient than are hrtimers for frequently
and repeatedly posting and cancelling a given timer, which is exactly
what RCU_FAST_NO_HZ does.
Reported-by: Pascal Chapperon <pascal.chapperon@wanadoo.fr>
Reported-by: Heiko Carstens <heiko.carstens@de.ibm.com>
Signed-off-by: Paul E. McKenney <paul.mckenney@linaro.org>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Tested-by: Heiko Carstens <heiko.carstens@de.ibm.com>
Tested-by: Pascal Chapperon <pascal.chapperon@wanadoo.fr>
2012-05-11 07:41:44 +08:00
|
|
|
|
2017-11-06 23:01:30 +08:00
|
|
|
lockdep_assert_irqs_disabled();
|
2015-03-05 07:41:24 +08:00
|
|
|
|
2019-08-13 01:28:08 +08:00
|
|
|
/* If no non-offloaded callbacks, RCU doesn't need the CPU. */
|
|
|
|
if (rcu_segcblist_empty(&rdp->cblist) ||
|
2020-11-12 08:51:21 +08:00
|
|
|
rcu_rdp_is_offloaded(rdp)) {
|
2015-04-15 05:08:58 +08:00
|
|
|
*nextevt = KTIME_MAX;
|
rcu: Precompute RCU_FAST_NO_HZ timer offsets
When a CPU is entering dyntick-idle mode, tick_nohz_stop_sched_tick()
calls rcu_needs_cpu() see if RCU needs that CPU, and, if not, computes the
next wakeup time based on the timer wheels. Only later, when actually
entering the idle loop, rcu_prepare_for_idle() will be invoked. In some
cases, rcu_prepare_for_idle() will post timers to wake the CPU back up.
But all for naught: The next wakeup time for the CPU has already been
computed, and posting a timer afterwards does not force that wakeup
time to be recomputed. This means that rcu_prepare_for_idle()'s have
no effect.
This is not a problem on a busy system because something else will wake
up the CPU soon enough. However, on lightly loaded systems, the CPU
might stay asleep for a considerable length of time. If that CPU has
a callback that the rest of the system is waiting on, the system might
run very slowly or (in theory) even hang.
This commit avoids this problem by having rcu_needs_cpu() give
tick_nohz_stop_sched_tick() an estimate of when RCU will need the CPU
to wake back up, which tick_nohz_stop_sched_tick() takes into account
when programming the CPU's wakeup time. An alternative approach is
for rcu_prepare_for_idle() to use hrtimers instead of normal timers,
but timers are much more efficient than are hrtimers for frequently
and repeatedly posting and cancelling a given timer, which is exactly
what RCU_FAST_NO_HZ does.
Reported-by: Pascal Chapperon <pascal.chapperon@wanadoo.fr>
Reported-by: Heiko Carstens <heiko.carstens@de.ibm.com>
Signed-off-by: Paul E. McKenney <paul.mckenney@linaro.org>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Tested-by: Heiko Carstens <heiko.carstens@de.ibm.com>
Tested-by: Pascal Chapperon <pascal.chapperon@wanadoo.fr>
2012-05-11 07:41:44 +08:00
|
|
|
return 0;
|
|
|
|
}
|
2012-12-29 03:30:36 +08:00
|
|
|
|
|
|
|
/* Attempt to advance callbacks. */
|
|
|
|
if (rcu_try_advance_all_cbs()) {
|
|
|
|
/* Some ready to invoke, so initiate later invocation. */
|
|
|
|
invoke_rcu_core();
|
rcu: Precompute RCU_FAST_NO_HZ timer offsets
When a CPU is entering dyntick-idle mode, tick_nohz_stop_sched_tick()
calls rcu_needs_cpu() see if RCU needs that CPU, and, if not, computes the
next wakeup time based on the timer wheels. Only later, when actually
entering the idle loop, rcu_prepare_for_idle() will be invoked. In some
cases, rcu_prepare_for_idle() will post timers to wake the CPU back up.
But all for naught: The next wakeup time for the CPU has already been
computed, and posting a timer afterwards does not force that wakeup
time to be recomputed. This means that rcu_prepare_for_idle()'s have
no effect.
This is not a problem on a busy system because something else will wake
up the CPU soon enough. However, on lightly loaded systems, the CPU
might stay asleep for a considerable length of time. If that CPU has
a callback that the rest of the system is waiting on, the system might
run very slowly or (in theory) even hang.
This commit avoids this problem by having rcu_needs_cpu() give
tick_nohz_stop_sched_tick() an estimate of when RCU will need the CPU
to wake back up, which tick_nohz_stop_sched_tick() takes into account
when programming the CPU's wakeup time. An alternative approach is
for rcu_prepare_for_idle() to use hrtimers instead of normal timers,
but timers are much more efficient than are hrtimers for frequently
and repeatedly posting and cancelling a given timer, which is exactly
what RCU_FAST_NO_HZ does.
Reported-by: Pascal Chapperon <pascal.chapperon@wanadoo.fr>
Reported-by: Heiko Carstens <heiko.carstens@de.ibm.com>
Signed-off-by: Paul E. McKenney <paul.mckenney@linaro.org>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Tested-by: Heiko Carstens <heiko.carstens@de.ibm.com>
Tested-by: Pascal Chapperon <pascal.chapperon@wanadoo.fr>
2012-05-11 07:41:44 +08:00
|
|
|
return 1;
|
|
|
|
}
|
2018-08-04 12:00:38 +08:00
|
|
|
rdp->last_accelerate = jiffies;
|
2012-12-29 03:30:36 +08:00
|
|
|
|
2019-08-31 00:36:32 +08:00
|
|
|
/* Request timer and round. */
|
|
|
|
dj = round_up(rcu_idle_gp_delay + jiffies, rcu_idle_gp_delay) - jiffies;
|
|
|
|
|
2015-04-15 05:08:58 +08:00
|
|
|
*nextevt = basemono + dj * TICK_NSEC;
|
rcu: Precompute RCU_FAST_NO_HZ timer offsets
When a CPU is entering dyntick-idle mode, tick_nohz_stop_sched_tick()
calls rcu_needs_cpu() see if RCU needs that CPU, and, if not, computes the
next wakeup time based on the timer wheels. Only later, when actually
entering the idle loop, rcu_prepare_for_idle() will be invoked. In some
cases, rcu_prepare_for_idle() will post timers to wake the CPU back up.
But all for naught: The next wakeup time for the CPU has already been
computed, and posting a timer afterwards does not force that wakeup
time to be recomputed. This means that rcu_prepare_for_idle()'s have
no effect.
This is not a problem on a busy system because something else will wake
up the CPU soon enough. However, on lightly loaded systems, the CPU
might stay asleep for a considerable length of time. If that CPU has
a callback that the rest of the system is waiting on, the system might
run very slowly or (in theory) even hang.
This commit avoids this problem by having rcu_needs_cpu() give
tick_nohz_stop_sched_tick() an estimate of when RCU will need the CPU
to wake back up, which tick_nohz_stop_sched_tick() takes into account
when programming the CPU's wakeup time. An alternative approach is
for rcu_prepare_for_idle() to use hrtimers instead of normal timers,
but timers are much more efficient than are hrtimers for frequently
and repeatedly posting and cancelling a given timer, which is exactly
what RCU_FAST_NO_HZ does.
Reported-by: Pascal Chapperon <pascal.chapperon@wanadoo.fr>
Reported-by: Heiko Carstens <heiko.carstens@de.ibm.com>
Signed-off-by: Paul E. McKenney <paul.mckenney@linaro.org>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Tested-by: Heiko Carstens <heiko.carstens@de.ibm.com>
Tested-by: Pascal Chapperon <pascal.chapperon@wanadoo.fr>
2012-05-11 07:41:44 +08:00
|
|
|
return 0;
|
|
|
|
}
|
|
|
|
|
2012-05-01 05:16:19 +08:00
|
|
|
/*
|
2019-08-31 00:36:32 +08:00
|
|
|
* Prepare a CPU for idle from an RCU perspective. The first major task is to
|
|
|
|
* sense whether nohz mode has been enabled or disabled via sysfs. The second
|
|
|
|
* major task is to accelerate (that is, assign grace-period numbers to) any
|
|
|
|
* recently arrived callbacks.
|
2011-11-02 21:54:54 +08:00
|
|
|
*
|
|
|
|
* The caller must have disabled interrupts.
|
2010-02-23 09:04:59 +08:00
|
|
|
*/
|
2014-10-23 06:03:43 +08:00
|
|
|
static void rcu_prepare_for_idle(void)
|
2010-02-23 09:04:59 +08:00
|
|
|
{
|
rcu: Make callers awaken grace-period kthread
The rcu_start_gp_advanced() function currently uses irq_work_queue()
to defer wakeups of the RCU grace-period kthread. This deferring
is necessary to avoid RCU-scheduler deadlocks involving the rcu_node
structure's lock, meaning that RCU cannot call any of the scheduler's
wake-up functions while holding one of these locks.
Unfortunately, the second and subsequent calls to irq_work_queue() are
ignored, and the first call will be ignored (aside from queuing the work
item) if the scheduler-clock tick is turned off. This is OK for many
uses, especially those where irq_work_queue() is called from an interrupt
or softirq handler, because in those cases the scheduler-clock-tick state
will be re-evaluated, which will turn the scheduler-clock tick back on.
On the next tick, any deferred work will then be processed.
However, this strategy does not always work for RCU, which can be invoked
at process level from idle CPUs. In this case, the tick might never
be turned back on, indefinitely defering a grace-period start request.
Note that the RCU CPU stall detector cannot see this condition, because
there is no RCU grace period in progress. Therefore, we can (and do!)
see long tens-of-seconds stalls in grace-period handling. In theory,
we could see a full grace-period hang, but rcutorture testing to date
has seen only the tens-of-seconds stalls. Event tracing demonstrates
that irq_work_queue() is being called repeatedly to no effect during
these stalls: The "newreq" event appears repeatedly from a task that is
not one of the grace-period kthreads.
In theory, irq_work_queue() might be fixed to avoid this sort of issue,
but RCU's requirements are unusual and it is quite straightforward to pass
wake-up responsibility up through RCU's call chain, so that the wakeup
happens when the offending locks are released.
This commit therefore makes this change. The rcu_start_gp_advanced(),
rcu_start_future_gp(), rcu_accelerate_cbs(), rcu_advance_cbs(),
__note_gp_changes(), and rcu_start_gp() functions now return a boolean
which indicates when a wake-up is needed. A new rcu_gp_kthread_wake()
does the wakeup when it is necessary and safe to do so: No self-wakes,
no wake-ups if the ->gp_flags field indicates there is no need (as in
someone else did the wake-up before we got around to it), and no wake-ups
before the grace-period kthread has been created.
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Reviewed-by: Josh Triplett <josh@joshtriplett.org>
2014-03-12 04:02:16 +08:00
|
|
|
bool needwake;
|
2018-08-04 12:00:38 +08:00
|
|
|
struct rcu_data *rdp = this_cpu_ptr(&rcu_data);
|
2012-12-29 03:30:36 +08:00
|
|
|
struct rcu_node *rnp;
|
2012-06-25 01:15:02 +08:00
|
|
|
int tne;
|
|
|
|
|
2017-11-06 23:01:30 +08:00
|
|
|
lockdep_assert_irqs_disabled();
|
2020-11-12 08:51:21 +08:00
|
|
|
if (rcu_rdp_is_offloaded(rdp))
|
2015-03-05 07:41:24 +08:00
|
|
|
return;
|
|
|
|
|
2012-06-25 01:15:02 +08:00
|
|
|
/* Handle nohz enablement switches conservatively. */
|
2015-03-04 06:57:58 +08:00
|
|
|
tne = READ_ONCE(tick_nohz_active);
|
2018-08-04 12:00:38 +08:00
|
|
|
if (tne != rdp->tick_nohz_enabled_snap) {
|
2018-11-30 05:28:49 +08:00
|
|
|
if (!rcu_segcblist_empty(&rdp->cblist))
|
2012-06-25 01:15:02 +08:00
|
|
|
invoke_rcu_core(); /* force nohz to see update. */
|
2018-08-04 12:00:38 +08:00
|
|
|
rdp->tick_nohz_enabled_snap = tne;
|
2012-06-25 01:15:02 +08:00
|
|
|
return;
|
|
|
|
}
|
|
|
|
if (!tne)
|
|
|
|
return;
|
2012-03-16 03:16:26 +08:00
|
|
|
|
2011-11-23 09:07:11 +08:00
|
|
|
/*
|
2012-12-29 03:30:36 +08:00
|
|
|
* If we have not yet accelerated this jiffy, accelerate all
|
|
|
|
* callbacks on this CPU.
|
2011-11-23 09:07:11 +08:00
|
|
|
*/
|
2018-08-04 12:00:38 +08:00
|
|
|
if (rdp->last_accelerate == jiffies)
|
2011-11-02 21:54:54 +08:00
|
|
|
return;
|
2018-08-04 12:00:38 +08:00
|
|
|
rdp->last_accelerate = jiffies;
|
2018-07-05 06:35:00 +08:00
|
|
|
if (rcu_segcblist_pend_cbs(&rdp->cblist)) {
|
2012-12-29 03:30:36 +08:00
|
|
|
rnp = rdp->mynode;
|
2015-10-08 18:24:23 +08:00
|
|
|
raw_spin_lock_rcu_node(rnp); /* irqs already disabled. */
|
2018-07-04 08:22:34 +08:00
|
|
|
needwake = rcu_accelerate_cbs(rnp, rdp);
|
2015-12-29 12:18:47 +08:00
|
|
|
raw_spin_unlock_rcu_node(rnp); /* irqs remain disabled. */
|
rcu: Make callers awaken grace-period kthread
The rcu_start_gp_advanced() function currently uses irq_work_queue()
to defer wakeups of the RCU grace-period kthread. This deferring
is necessary to avoid RCU-scheduler deadlocks involving the rcu_node
structure's lock, meaning that RCU cannot call any of the scheduler's
wake-up functions while holding one of these locks.
Unfortunately, the second and subsequent calls to irq_work_queue() are
ignored, and the first call will be ignored (aside from queuing the work
item) if the scheduler-clock tick is turned off. This is OK for many
uses, especially those where irq_work_queue() is called from an interrupt
or softirq handler, because in those cases the scheduler-clock-tick state
will be re-evaluated, which will turn the scheduler-clock tick back on.
On the next tick, any deferred work will then be processed.
However, this strategy does not always work for RCU, which can be invoked
at process level from idle CPUs. In this case, the tick might never
be turned back on, indefinitely defering a grace-period start request.
Note that the RCU CPU stall detector cannot see this condition, because
there is no RCU grace period in progress. Therefore, we can (and do!)
see long tens-of-seconds stalls in grace-period handling. In theory,
we could see a full grace-period hang, but rcutorture testing to date
has seen only the tens-of-seconds stalls. Event tracing demonstrates
that irq_work_queue() is being called repeatedly to no effect during
these stalls: The "newreq" event appears repeatedly from a task that is
not one of the grace-period kthreads.
In theory, irq_work_queue() might be fixed to avoid this sort of issue,
but RCU's requirements are unusual and it is quite straightforward to pass
wake-up responsibility up through RCU's call chain, so that the wakeup
happens when the offending locks are released.
This commit therefore makes this change. The rcu_start_gp_advanced(),
rcu_start_future_gp(), rcu_accelerate_cbs(), rcu_advance_cbs(),
__note_gp_changes(), and rcu_start_gp() functions now return a boolean
which indicates when a wake-up is needed. A new rcu_gp_kthread_wake()
does the wakeup when it is necessary and safe to do so: No self-wakes,
no wake-ups if the ->gp_flags field indicates there is no need (as in
someone else did the wake-up before we got around to it), and no wake-ups
before the grace-period kthread has been created.
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Reviewed-by: Josh Triplett <josh@joshtriplett.org>
2014-03-12 04:02:16 +08:00
|
|
|
if (needwake)
|
2018-07-04 08:22:34 +08:00
|
|
|
rcu_gp_kthread_wake();
|
2010-04-26 12:04:29 +08:00
|
|
|
}
|
2012-12-29 03:30:36 +08:00
|
|
|
}
|
2011-11-23 09:07:11 +08:00
|
|
|
|
2012-12-29 03:30:36 +08:00
|
|
|
/*
|
|
|
|
* Clean up for exit from idle. Attempt to advance callbacks based on
|
|
|
|
* any grace periods that elapsed while the CPU was idle, and if any
|
|
|
|
* callbacks are now ready to invoke, initiate invocation.
|
|
|
|
*/
|
2014-10-23 06:07:37 +08:00
|
|
|
static void rcu_cleanup_after_idle(void)
|
2012-12-29 03:30:36 +08:00
|
|
|
{
|
2019-04-13 06:58:34 +08:00
|
|
|
struct rcu_data *rdp = this_cpu_ptr(&rcu_data);
|
|
|
|
|
2017-11-06 23:01:30 +08:00
|
|
|
lockdep_assert_irqs_disabled();
|
2020-11-12 08:51:21 +08:00
|
|
|
if (rcu_rdp_is_offloaded(rdp))
|
2011-11-02 21:54:54 +08:00
|
|
|
return;
|
2013-08-23 09:16:16 +08:00
|
|
|
if (rcu_try_advance_all_cbs())
|
|
|
|
invoke_rcu_core();
|
2010-02-23 09:04:59 +08:00
|
|
|
}
|
|
|
|
|
|
|
|
#endif /* #else #if !defined(CONFIG_RCU_FAST_NO_HZ) */
|
2012-01-17 05:29:10 +08:00
|
|
|
|
2013-11-09 01:03:10 +08:00
|
|
|
/*
|
|
|
|
* Is this CPU a NO_HZ_FULL CPU that should ignore RCU so that the
|
|
|
|
* grace-period kthread will do force_quiescent_state() processing?
|
|
|
|
* The idea is to avoid waking up RCU core processing on such a
|
|
|
|
* CPU unless the grace period has extended for too long.
|
|
|
|
*
|
|
|
|
* This code relies on the fact that all NO_HZ_FULL CPUs are also
|
2014-02-09 21:35:11 +08:00
|
|
|
* CONFIG_RCU_NOCB_CPU CPUs.
|
2013-11-09 01:03:10 +08:00
|
|
|
*/
|
2018-07-04 08:22:34 +08:00
|
|
|
static bool rcu_nohz_full_cpu(void)
|
2013-11-09 01:03:10 +08:00
|
|
|
{
|
|
|
|
#ifdef CONFIG_NO_HZ_FULL
|
|
|
|
if (tick_nohz_full_cpu(smp_processor_id()) &&
|
2018-07-04 08:22:34 +08:00
|
|
|
(!rcu_gp_in_progress() ||
|
2020-04-11 08:05:22 +08:00
|
|
|
time_before(jiffies, READ_ONCE(rcu_state.gp_start) + HZ)))
|
2015-03-31 07:46:16 +08:00
|
|
|
return true;
|
2013-11-09 01:03:10 +08:00
|
|
|
#endif /* #ifdef CONFIG_NO_HZ_FULL */
|
2015-03-31 07:46:16 +08:00
|
|
|
return false;
|
2013-11-09 01:03:10 +08:00
|
|
|
}
|
2014-04-02 02:20:36 +08:00
|
|
|
|
|
|
|
/*
|
2018-03-20 02:53:22 +08:00
|
|
|
* Bind the RCU grace-period kthreads to the housekeeping CPU.
|
2014-04-02 02:20:36 +08:00
|
|
|
*/
|
|
|
|
static void rcu_bind_gp_kthread(void)
|
|
|
|
{
|
2014-06-05 04:46:03 +08:00
|
|
|
if (!tick_nohz_full_enabled())
|
2014-04-02 02:20:36 +08:00
|
|
|
return;
|
2017-10-27 10:42:35 +08:00
|
|
|
housekeeping_affine(current, HK_FLAG_RCU);
|
2014-04-02 02:20:36 +08:00
|
|
|
}
|
2014-08-05 08:43:50 +08:00
|
|
|
|
|
|
|
/* Record the current task on dyntick-idle entry. */
|
2020-03-14 00:32:17 +08:00
|
|
|
static void noinstr rcu_dynticks_task_enter(void)
|
2014-08-05 08:43:50 +08:00
|
|
|
{
|
|
|
|
#if defined(CONFIG_TASKS_RCU) && defined(CONFIG_NO_HZ_FULL)
|
2015-03-04 06:57:58 +08:00
|
|
|
WRITE_ONCE(current->rcu_tasks_idle_cpu, smp_processor_id());
|
2014-08-05 08:43:50 +08:00
|
|
|
#endif /* #if defined(CONFIG_TASKS_RCU) && defined(CONFIG_NO_HZ_FULL) */
|
|
|
|
}
|
|
|
|
|
|
|
|
/* Record no current task on dyntick-idle exit. */
|
2020-03-14 00:32:17 +08:00
|
|
|
static void noinstr rcu_dynticks_task_exit(void)
|
2014-08-05 08:43:50 +08:00
|
|
|
{
|
|
|
|
#if defined(CONFIG_TASKS_RCU) && defined(CONFIG_NO_HZ_FULL)
|
2015-03-04 06:57:58 +08:00
|
|
|
WRITE_ONCE(current->rcu_tasks_idle_cpu, -1);
|
2014-08-05 08:43:50 +08:00
|
|
|
#endif /* #if defined(CONFIG_TASKS_RCU) && defined(CONFIG_NO_HZ_FULL) */
|
|
|
|
}
|
rcu-tasks: Avoid IPIing userspace/idle tasks if kernel is so built
Systems running CPU-bound real-time task do not want IPIs sent to CPUs
executing nohz_full userspace tasks. Battery-powered systems don't
want IPIs sent to idle CPUs in low-power mode. Unfortunately, RCU tasks
trace can and will send such IPIs in some cases.
Both of these situations occur only when the target CPU is in RCU
dyntick-idle mode, in other words, when RCU is not watching the
target CPU. This suggests that CPUs in dyntick-idle mode should use
memory barriers in outermost invocations of rcu_read_lock_trace()
and rcu_read_unlock_trace(), which would allow the RCU tasks trace
grace period to directly read out the target CPU's read-side state.
One challenge is that RCU tasks trace is not targeting a specific
CPU, but rather a task. And that task could switch from one CPU to
another at any time.
This commit therefore uses try_invoke_on_locked_down_task()
and checks for task_curr() in trc_inspect_reader_notrunning().
When this condition holds, the target task is running and cannot move.
If CONFIG_TASKS_TRACE_RCU_READ_MB=y, the new rcu_dynticks_zero_in_eqs()
function can be used to check if the specified integer (in this case,
t->trc_reader_nesting) is zero while the target CPU remains in that same
dyntick-idle sojourn. If so, the target task is in a quiescent state.
If not, trc_read_check_handler() must indicate failure so that the
grace-period kthread can take appropriate action or retry after an
appropriate delay, as the case may be.
With this change, given CONFIG_TASKS_TRACE_RCU_READ_MB=y, if a given
CPU remains idle or a given task continues executing in nohz_full mode,
the RCU tasks trace grace-period kthread will detect this without the
need to send an IPI.
Suggested-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2020-03-20 06:33:12 +08:00
|
|
|
|
|
|
|
/* Turn on heavyweight RCU tasks trace readers on idle/user entry. */
|
|
|
|
static void rcu_dynticks_task_trace_enter(void)
|
|
|
|
{
|
2021-07-13 08:56:45 +08:00
|
|
|
#ifdef CONFIG_TASKS_TRACE_RCU
|
rcu-tasks: Avoid IPIing userspace/idle tasks if kernel is so built
Systems running CPU-bound real-time task do not want IPIs sent to CPUs
executing nohz_full userspace tasks. Battery-powered systems don't
want IPIs sent to idle CPUs in low-power mode. Unfortunately, RCU tasks
trace can and will send such IPIs in some cases.
Both of these situations occur only when the target CPU is in RCU
dyntick-idle mode, in other words, when RCU is not watching the
target CPU. This suggests that CPUs in dyntick-idle mode should use
memory barriers in outermost invocations of rcu_read_lock_trace()
and rcu_read_unlock_trace(), which would allow the RCU tasks trace
grace period to directly read out the target CPU's read-side state.
One challenge is that RCU tasks trace is not targeting a specific
CPU, but rather a task. And that task could switch from one CPU to
another at any time.
This commit therefore uses try_invoke_on_locked_down_task()
and checks for task_curr() in trc_inspect_reader_notrunning().
When this condition holds, the target task is running and cannot move.
If CONFIG_TASKS_TRACE_RCU_READ_MB=y, the new rcu_dynticks_zero_in_eqs()
function can be used to check if the specified integer (in this case,
t->trc_reader_nesting) is zero while the target CPU remains in that same
dyntick-idle sojourn. If so, the target task is in a quiescent state.
If not, trc_read_check_handler() must indicate failure so that the
grace-period kthread can take appropriate action or retry after an
appropriate delay, as the case may be.
With this change, given CONFIG_TASKS_TRACE_RCU_READ_MB=y, if a given
CPU remains idle or a given task continues executing in nohz_full mode,
the RCU tasks trace grace-period kthread will detect this without the
need to send an IPI.
Suggested-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2020-03-20 06:33:12 +08:00
|
|
|
if (IS_ENABLED(CONFIG_TASKS_TRACE_RCU_READ_MB))
|
|
|
|
current->trc_reader_special.b.need_mb = true;
|
2021-07-13 08:56:45 +08:00
|
|
|
#endif /* #ifdef CONFIG_TASKS_TRACE_RCU */
|
rcu-tasks: Avoid IPIing userspace/idle tasks if kernel is so built
Systems running CPU-bound real-time task do not want IPIs sent to CPUs
executing nohz_full userspace tasks. Battery-powered systems don't
want IPIs sent to idle CPUs in low-power mode. Unfortunately, RCU tasks
trace can and will send such IPIs in some cases.
Both of these situations occur only when the target CPU is in RCU
dyntick-idle mode, in other words, when RCU is not watching the
target CPU. This suggests that CPUs in dyntick-idle mode should use
memory barriers in outermost invocations of rcu_read_lock_trace()
and rcu_read_unlock_trace(), which would allow the RCU tasks trace
grace period to directly read out the target CPU's read-side state.
One challenge is that RCU tasks trace is not targeting a specific
CPU, but rather a task. And that task could switch from one CPU to
another at any time.
This commit therefore uses try_invoke_on_locked_down_task()
and checks for task_curr() in trc_inspect_reader_notrunning().
When this condition holds, the target task is running and cannot move.
If CONFIG_TASKS_TRACE_RCU_READ_MB=y, the new rcu_dynticks_zero_in_eqs()
function can be used to check if the specified integer (in this case,
t->trc_reader_nesting) is zero while the target CPU remains in that same
dyntick-idle sojourn. If so, the target task is in a quiescent state.
If not, trc_read_check_handler() must indicate failure so that the
grace-period kthread can take appropriate action or retry after an
appropriate delay, as the case may be.
With this change, given CONFIG_TASKS_TRACE_RCU_READ_MB=y, if a given
CPU remains idle or a given task continues executing in nohz_full mode,
the RCU tasks trace grace-period kthread will detect this without the
need to send an IPI.
Suggested-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2020-03-20 06:33:12 +08:00
|
|
|
}
|
|
|
|
|
|
|
|
/* Turn off heavyweight RCU tasks trace readers on idle/user exit. */
|
|
|
|
static void rcu_dynticks_task_trace_exit(void)
|
|
|
|
{
|
2021-07-13 08:56:45 +08:00
|
|
|
#ifdef CONFIG_TASKS_TRACE_RCU
|
rcu-tasks: Avoid IPIing userspace/idle tasks if kernel is so built
Systems running CPU-bound real-time task do not want IPIs sent to CPUs
executing nohz_full userspace tasks. Battery-powered systems don't
want IPIs sent to idle CPUs in low-power mode. Unfortunately, RCU tasks
trace can and will send such IPIs in some cases.
Both of these situations occur only when the target CPU is in RCU
dyntick-idle mode, in other words, when RCU is not watching the
target CPU. This suggests that CPUs in dyntick-idle mode should use
memory barriers in outermost invocations of rcu_read_lock_trace()
and rcu_read_unlock_trace(), which would allow the RCU tasks trace
grace period to directly read out the target CPU's read-side state.
One challenge is that RCU tasks trace is not targeting a specific
CPU, but rather a task. And that task could switch from one CPU to
another at any time.
This commit therefore uses try_invoke_on_locked_down_task()
and checks for task_curr() in trc_inspect_reader_notrunning().
When this condition holds, the target task is running and cannot move.
If CONFIG_TASKS_TRACE_RCU_READ_MB=y, the new rcu_dynticks_zero_in_eqs()
function can be used to check if the specified integer (in this case,
t->trc_reader_nesting) is zero while the target CPU remains in that same
dyntick-idle sojourn. If so, the target task is in a quiescent state.
If not, trc_read_check_handler() must indicate failure so that the
grace-period kthread can take appropriate action or retry after an
appropriate delay, as the case may be.
With this change, given CONFIG_TASKS_TRACE_RCU_READ_MB=y, if a given
CPU remains idle or a given task continues executing in nohz_full mode,
the RCU tasks trace grace-period kthread will detect this without the
need to send an IPI.
Suggested-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2020-03-20 06:33:12 +08:00
|
|
|
if (IS_ENABLED(CONFIG_TASKS_TRACE_RCU_READ_MB))
|
|
|
|
current->trc_reader_special.b.need_mb = false;
|
2021-07-13 08:56:45 +08:00
|
|
|
#endif /* #ifdef CONFIG_TASKS_TRACE_RCU */
|
rcu-tasks: Avoid IPIing userspace/idle tasks if kernel is so built
Systems running CPU-bound real-time task do not want IPIs sent to CPUs
executing nohz_full userspace tasks. Battery-powered systems don't
want IPIs sent to idle CPUs in low-power mode. Unfortunately, RCU tasks
trace can and will send such IPIs in some cases.
Both of these situations occur only when the target CPU is in RCU
dyntick-idle mode, in other words, when RCU is not watching the
target CPU. This suggests that CPUs in dyntick-idle mode should use
memory barriers in outermost invocations of rcu_read_lock_trace()
and rcu_read_unlock_trace(), which would allow the RCU tasks trace
grace period to directly read out the target CPU's read-side state.
One challenge is that RCU tasks trace is not targeting a specific
CPU, but rather a task. And that task could switch from one CPU to
another at any time.
This commit therefore uses try_invoke_on_locked_down_task()
and checks for task_curr() in trc_inspect_reader_notrunning().
When this condition holds, the target task is running and cannot move.
If CONFIG_TASKS_TRACE_RCU_READ_MB=y, the new rcu_dynticks_zero_in_eqs()
function can be used to check if the specified integer (in this case,
t->trc_reader_nesting) is zero while the target CPU remains in that same
dyntick-idle sojourn. If so, the target task is in a quiescent state.
If not, trc_read_check_handler() must indicate failure so that the
grace-period kthread can take appropriate action or retry after an
appropriate delay, as the case may be.
With this change, given CONFIG_TASKS_TRACE_RCU_READ_MB=y, if a given
CPU remains idle or a given task continues executing in nohz_full mode,
the RCU tasks trace grace-period kthread will detect this without the
need to send an IPI.
Suggested-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2020-03-20 06:33:12 +08:00
|
|
|
}
|