It is possible that the outgoing CPU is unaware of recent grace periods,
and so it is also possible that some of its pending callbacks are actually
ready to be invoked. The current callback-migration code would needlessly
force these callbacks to pass through another grace period. This commit
therefore invokes rcu_advance_cbs() on the outgoing CPU's callbacks in
order to give them full credit for having passed through any recent
grace periods.
This also fixes an odd theoretical bug where there are no callbacks in
the system except for those on the outgoing CPU, none of those callbacks
have yet been associated with a grace-period number, there is never again
another callback registered, and the surviving CPU never again takes a
scheduling-clock interrupt, never goes idle, and never enters nohz_full
userspace execution. Yes, this is (just barely) possible. It requires
that the surviving CPU be a nohz_full CPU, that its scheduler-clock
interrupt be shut off, and that it loop forever in the kernel. You get
bonus points if you can make this one happen! ;-)
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
RCU's CPU-hotplug callback-migration code first moves the outgoing
CPU's callbacks to ->orphan_done and ->orphan_pend, and only then
moves them to the NOCB callback list. This commit avoids the
extra step (and simplifies the code) by moving the callbacks directly
from the outgoing CPU's callback list to the NOCB callback list.
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
The current CPU-hotplug RCU-callback-migration code checks
for the source (newly offlined) CPU being a NOCBs CPU down in
rcu_send_cbs_to_orphanage(). This commit simplifies callback migration a
bit by moving this check up to rcu_migrate_callbacks(). This commit also
adds a check for the source CPU having no callbacks, which eases analysis
of the rcu_send_cbs_to_orphanage() and rcu_adopt_orphan_cbs() functions.
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
The rcu_node structure's ->n_cbs_orphaned and ->n_cbs_adopted fields
are updated, but never read. This commit therefore removes them.
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
The update of the ->expmaskinitnext and of ->ncpus are unsynchronized,
with the value of ->ncpus being incremented long before the corresponding
->expmaskinitnext mask is updated. If an RCU expedited grace period
sees ->ncpus change, it will update the ->expmaskinit masks from the new
->expmaskinitnext masks. But it is possible that ->ncpus has already
been updated, but the ->expmaskinitnext masks still have their old values.
For the current expedited grace period, no harm done. The CPU could not
have been online before the grace period started, so there is no need to
wait for its non-existent pre-existing readers.
But the next RCU expedited grace period is in a world of hurt. The value
of ->ncpus has already been updated, so this grace period will assume
that the ->expmaskinitnext masks have not changed. But they have, and
they won't be taken into account until the next never-been-online CPU
comes online. This means that RCU will be ignoring some CPUs that it
should be paying attention to.
The solution is to update ->ncpus and ->expmaskinitnext while holding
the ->lock for the rcu_node structure containing the ->expmaskinitnext
mask. Because smp_store_release() is now used to update ->ncpus and
smp_load_acquire() is now used to locklessly read it, if the expedited
grace period sees ->ncpus change, then the updating CPU has to
already be holding the corresponding ->lock. Therefore, when the
expedited grace period later acquires that ->lock, it is guaranteed
to see the new value of ->expmaskinitnext.
On the other hand, if the expedited grace period loads ->ncpus just
before an update, earlier full memory barriers guarantee that
the incoming CPU isn't far enough along to be running any RCU readers.
This commit therefore makes the required change.
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
RCU callbacks must be migrated away from an outgoing CPU, and this is
done near the end of the CPU-hotplug operation, after the outgoing CPU is
long gone. Unfortunately, this means that other CPU-hotplug callbacks
can execute while the outgoing CPU's callbacks are still immobilized
on the long-gone CPU's callback lists. If any of these CPU-hotplug
callbacks must wait, either directly or indirectly, for the invocation
of any of the immobilized RCU callbacks, the system will hang.
This commit avoids such hangs by migrating the callbacks away from the
outgoing CPU immediately upon its departure, shortly after the return
from __cpu_die() in takedown_cpu(). Thus, RCU is able to advance these
callbacks and invoke them, which allows all the after-the-fact CPU-hotplug
callbacks to wait on these RCU callbacks without risk of a hang.
While in the neighborhood, this commit also moves rcu_send_cbs_to_orphanage()
and rcu_adopt_orphan_cbs() under a pre-existing #ifdef to avoid including
dead code on the one hand and to avoid define-without-use warnings on the
other hand.
Reported-by: Jeffrey Hugo <jhugo@codeaurora.org>
Link: http://lkml.kernel.org/r/db9c91f6-1b17-6136-84f0-03c3c2581ab4@codeaurora.org
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Anna-Maria Gleixner <anna-maria@linutronix.de>
Cc: Boris Ostrovsky <boris.ostrovsky@oracle.com>
Cc: Richard Weinberger <richard@nod.at>
This commit augments the grace-period-kthread starvation debugging
messages by adding the last CPU that ran the kthread.
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
The NO_HZ_FULL_SYSIDLE full-system-idle capability was added in 2013
by commit 0edd1b1784 ("nohz_full: Add full-system-idle state machine"),
but has not been used. This commit therefore removes it.
If it turns out to be needed later, this commit can always be reverted.
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: Ingo Molnar <mingo@kernel.org>
Acked-by: Linus Torvalds <torvalds@linux-foundation.org>
Anything that can be done with the RCU_KTHREAD_PRIO Kconfig option can
also be done with the rcutree.kthread_prio kernel boot parameter.
This commit therefore removes this Kconfig option.
Reported-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Rik van Riel <riel@redhat.com>
The RCU_TORTURE_TEST_SLOW_PREINIT, RCU_TORTURE_TEST_SLOW_PREINIT_DELAY,
RCU_TORTURE_TEST_SLOW_PREINIT_DELAY, RCU_TORTURE_TEST_SLOW_INIT,
RCU_TORTURE_TEST_SLOW_INIT_DELAY, RCU_TORTURE_TEST_SLOW_CLEANUP,
and RCU_TORTURE_TEST_SLOW_CLEANUP_DELAY Kconfig options are only
useful for torture testing, and there are the rcutree.gp_cleanup_delay,
rcutree.gp_init_delay, and rcutree.gp_preinit_delay kernel boot parameters
that rcutorture can use instead. The effect of these parameters is to
artificially slow down grace period initialization and cleanup in order
to make some types of race conditions happen more often.
This commit therefore simplifies Tree RCU a bit by removing the Kconfig
options and adding the corresponding kernel parameters to rcutorture's
.boot files instead. However, this commit also leaves out the kernel
parameters for TREE02, TREE04, and TREE07 in order to have about the
same number of tests slowed as not slowed. TREE01, TREE03, TREE05,
and TREE06 are slowed, and the rest are not slowed.
Reported-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
The "__call_rcu(): Leaked duplicate callback" error message from
__call_rcu() has proven to be unhelpful. This commit therefore changes
it to "__call_rcu(): Double-freed CB" and adds the value of the pointer
passed in. The value of the pointer improves debuggability by allowing
correlation with tracing output, for example, the rcu:rcu_callback trace
event.
Reported-by: Vegard Nossum <vegard.nossum@oracle.com>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
The __rcu_is_watching() function is currently not used, aside from
to implement the rcu_is_watching() function. This commit therefore
eliminates __rcu_is_watching(), which has the beneficial side-effect
of shrinking include/linux/rcupdate.h a bit.
Reported-by: Ingo Molnar <mingo@kernel.org>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
The include/linux/rcupdate.h file is included by more than 200
files, so shrinking it should provide some build-time benefits.
This commit therefore moves several docbook comments from rcupdate.h to
kernel/rcu/update.c, kernel/rcu/tree.c, and kernel/rcu/tree_plugin.h, thus
reducing the number of times that the compiler has to scan these comments.
This likely provides only a small benefit, but every little bit helps.
This commit also fixes a malformed bulleted list noted by the 0day
Test Robot.
Reported-by: Ingo Molnar <mingo@kernel.org>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Comments can be helpful, but assertions carry more force. This
commit therefore adds lockdep_assert_held() and RCU_LOCKDEP_WARN()
calls to enforce lock-held and interrupt-disabled preconditions.
Reported-by: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
This commit updates rcu_bootup_announce_oddness() to check additional
Kconfig options and module/boot parameters.
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
This commit adds WARN_ON_ONCE() calls that trigger if either
rcu_sched_qs() or rcu_bh_qs() are invoked with preemption enabled.
In the immortal words of Peter Zijlstra: "these are much harder to ignore
than comments".
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
The synchronize_kernel() primitive was removed in favor of
synchronize_sched() more than a decade ago, and it seems likely that
rather few kernel hackers are familiar with it. Its continued presence
is therefore providing more confusion than enlightenment. This commit
therefore removes the reference from the synchronize_sched() header
comment, and adds the corresponding information to the synchronize_rcu(0
header comment.
Reported-by: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Although preemptible RCU allows its read-side critical sections to be
preempted, general blocking is forbidden. The reason for this is that
excessive preemption times can be handled by CONFIG_RCU_BOOST=y, but a
voluntarily blocked task doesn't care how high you boost its priority.
Because preemptible RCU is a global mechanism, one ill-behaved reader
hurts everyone. Hence the prohibition against general blocking in
RCU-preempt read-side critical sections. Preemption yes, blocking no.
This commit enforces this prohibition.
There is a special exception for the -rt patchset (which they kindly
volunteered to implement): It is OK to block (as opposed to merely being
preempted) within an RCU-preempt read-side critical section, but only if
the blocking is subject to priority inheritance. This exception permits
CONFIG_RCU_BOOST=y to get -rt RCU readers out of trouble.
Why doesn't this exception also apply to mainline's rt_mutex? Because
of the possibility that someone does general blocking while holding
an rt_mutex. Yes, the priority boosting will affect the rt_mutex,
but it won't help with the task doing general blocking while holding
that rt_mutex.
Reported-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Currently rcu_barrier() uses call_rcu() to enqueue new callbacks
on each CPU with a non-empty callback list. This works, but means
that rcu_barrier() forces grace periods that are not otherwise needed.
The key point is that rcu_barrier() never needs to wait for a grace
period, but instead only for all pre-existing callbacks to be invoked.
This means that rcu_barrier()'s new callbacks should be placed in
the callback-list segment containing the last pre-existing callback.
This commit makes this change using the new rcu_segcblist_entrain()
function.
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Pull RCU updates from Ingo Molnar:
"The main changes are:
- Debloat RCU headers
- Parallelize SRCU callback handling (plus overlapping patches)
- Improve the performance of Tree SRCU on a CPU-hotplug stress test
- Documentation updates
- Miscellaneous fixes"
* 'core-rcu-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (74 commits)
rcu: Open-code the rcu_cblist_n_lazy_cbs() function
rcu: Open-code the rcu_cblist_n_cbs() function
rcu: Open-code the rcu_cblist_empty() function
rcu: Separately compile large rcu_segcblist functions
srcu: Debloat the <linux/rcu_segcblist.h> header
srcu: Adjust default auto-expediting holdoff
srcu: Specify auto-expedite holdoff time
srcu: Expedite first synchronize_srcu() when idle
srcu: Expedited grace periods with reduced memory contention
srcu: Make rcutorture writer stalls print SRCU GP state
srcu: Exact tracking of srcu_data structures containing callbacks
srcu: Make SRCU be built by default
srcu: Fix Kconfig botch when SRCU not selected
rcu: Make non-preemptive schedule be Tasks RCU quiescent state
srcu: Expedite srcu_schedule_cbs_snp() callback invocation
srcu: Parallelize callback handling
kvm: Move srcu_struct fields to end of struct kvm
rcu: Fix typo in PER_RCU_NODE_PERIOD header comment
rcu: Use true/false in assignment to bool
rcu: Use bool value directly
...
Because the rcu_cblist_n_lazy_cbs() just samples the ->len_lazy counter,
and because the rcu_cblist structure is quite straightforward, it makes
sense to open-code rcu_cblist_n_lazy_cbs(p) as p->len_lazy, cutting out
a level of indirection. This commit makes this change.
Reported-by: Ingo Molnar <mingo@kernel.org>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Because the rcu_cblist_n_cbs() just samples the ->len counter, and
because the rcu_cblist structure is quite straightforward, it makes
sense to open-code rcu_cblist_n_cbs(p) as p->len, cutting out a level
of indirection. This commit makes this change.
Reported-by: Ingo Molnar <mingo@kernel.org>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Because the rcu_cblist_empty() just samples the ->head pointer, and
because the rcu_cblist structure is quite straightforward, it makes
sense to open-code rcu_cblist_empty(p) as !p->head, cutting out a
level of indirection. This commit makes this change.
Reported-by: Ingo Molnar <mingo@kernel.org>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
In the past, SRCU was simple enough that there was little point in
making the rcutorture writer stall messages print the SRCU grace-period
number state. With the advent of Tree SRCU, this has changed. This
commit therefore makes Classic, Tiny, and Tree SRCU report this state
to rcutorture as needed.
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Tested-by: Mike Galbraith <efault@gmx.de>
Currently, a call to schedule() acts as a Tasks RCU quiescent state
only if a context switch actually takes place. However, just the
call to schedule() guarantees that the calling task has moved off of
whatever tracing trampoline that it might have been one previously.
This commit therefore plumbs schedule()'s "preempt" parameter into
rcu_note_context_switch(), which then records the Tasks RCU quiescent
state, but only if this call to schedule() was -not- due to a preemption.
To avoid adding overhead to the common-case context-switch path,
this commit hides the rcu_note_context_switch() check under an existing
non-common-case check.
Suggested-by: Steven Rostedt <rostedt@goodmis.org>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Peter Zijlstra proposed using SRCU to reduce mmap_sem contention [1,2],
however, there are workloads that could result in a high volume of
concurrent invocations of call_srcu(), which with current SRCU would
result in excessive lock contention on the srcu_struct structure's
->queue_lock, which protects SRCU's callback lists. This commit therefore
moves SRCU to per-CPU callback lists, thus greatly reducing contention.
Because a given SRCU instance no longer has a single centralized callback
list, starting grace periods and invoking callbacks are both more complex
than in the single-list Classic SRCU implementation. Starting grace
periods and handling callbacks are now handled using an srcu_node tree
that is in some ways similar to the rcu_node trees used by RCU-bh,
RCU-preempt, and RCU-sched (for example, the srcu_node tree shape is
controlled by exactly the same Kconfig options and boot parameters that
control the shape of the rcu_node tree).
In addition, the old per-CPU srcu_array structure is now named srcu_data
and contains an rcu_segcblist structure named ->srcu_cblist for its
callbacks (and a spinlock to protect this). The srcu_struct gets
an srcu_gp_seq that is used to associate callback segments with the
corresponding completion-time grace-period number. These completion-time
grace-period numbers are propagated up the srcu_node tree so that the
grace-period workqueue handler can determine whether additional grace
periods are needed on the one hand and where to look for callbacks that
are ready to be invoked.
The srcu_barrier() function must now wait on all instances of the per-CPU
->srcu_cblist. Because each ->srcu_cblist is protected by ->lock,
srcu_barrier() can remotely add the needed callbacks. In theory,
it could also remotely start grace periods, but in practice doing so
is complex and racy. And interestingly enough, it is never necessary
for srcu_barrier() to start a grace period because srcu_barrier() only
enqueues a callback when a callback is already present--and it turns out
that a grace period has to have already been started for this pre-existing
callback. Furthermore, it is only the callback that srcu_barrier()
needs to wait on, not any particular grace period. Therefore, a new
rcu_segcblist_entrain() function enqueues the srcu_barrier() function's
callback into the same segment occupied by the last pre-existing callback
in the list. The special case where all the pre-existing callbacks are
on a different list (because they are in the process of being invoked)
is handled by enqueuing srcu_barrier()'s callback into the RCU_DONE_TAIL
segment, relying on the done-callbacks check that takes place after all
callbacks are inovked.
Note that the readers use the same algorithm as before. Note that there
is a separate srcu_idx that tells the readers what counter to increment.
This unfortunately cannot be combined with srcu_gp_seq because they
need to be incremented at different times.
This commit introduces some ugly #ifdefs in rcutorture. These will go
away when I feel good enough about Tree SRCU to ditch Classic SRCU.
Some crude performance comparisons, courtesy of a quickly hacked rcuperf
asynchronous-grace-period capability:
Callback Queuing Overhead
-------------------------
# CPUS Classic SRCU Tree SRCU
------ ------------ ---------
2 0.349 us 0.342 us
16 31.66 us 0.4 us
41 --------- 0.417 us
The times are the 90th percentiles, a statistic that was chosen to reject
the overheads of the occasional srcu_barrier() call needed to avoid OOMing
the test machine. The rcuperf test hangs when running Classic SRCU at 41
CPUs, hence the line of dashes. Despite the hacks to both the rcuperf code
and that statistics, this is a convincing demonstration of Tree SRCU's
performance and scalability advantages.
[1] https://lwn.net/Articles/309030/
[2] https://patchwork.kernel.org/patch/5108281/
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
[ paulmck: Fix initialization if synchronize_srcu_expedited() called first. ]
This commit just changes a "the the" to "the" to reduce repetition.
Reported-by: Michalis Kokologiannakis <mixaskok@gmail.com>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
The beenonline variable is declared bool so there is no need for an
explicit comparison, especially not against the constant zero.
Signed-off-by: Nicholas Mc Guire <der.herr@hofr.at>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
The rcu_nocb_gp_cleanup() function is now invoked elsewhere, so this
commit drags this comment into the year 2017.
Reported-by: Michalis Kokologiannakis <mixaskok@gmail.com>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
This commit makes the num_rcu_lvl[] array external so that SRCU can
make use of it for initializing its upcoming srcu_node tree.
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
This commit moves the rcu_init_levelspread() function from
kernel/rcu/tree.c to kernel/rcu/rcu.h so that SRCU can access it. This is
another step towards enabling SRCU to create its own combining tree.
This commit is code-movement only, give or take knock-on adjustments.
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
This commit moves rcu_seq_start(), rcu_seq_end(), rcu_seq_snap(),
and rcu_seq_done() from kernel/rcu/tree.c to kernel/rcu/rcu.h.
This will allow SRCU to use these functions, which in turn will
allow SRCU to move from a single global callback queue to a
per-CPU callback queue.
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
This is primarily a code-movement commit in preparation for allowing
SRCU to handle early-boot SRCU grace periods.
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
RCU has only one multi-tail callback list, which is implemented via
the nxtlist, nxttail, nxtcompleted, qlen_lazy, and qlen fields in the
rcu_data structure, and whose operations are open-code throughout the
Tree RCU implementation. This has been more or less OK in the past,
but upcoming callback-list optimizations in SRCU could really use
a multi-tail callback list there as well.
This commit therefore abstracts the multi-tail callback list handling
into a new kernel/rcu/rcu_segcblist.h file, and uses this new API.
The simple head-and-tail pointer callback list is also abstracted and
applied everywhere except for the NOCB callback-offload lists. (Yes,
the plan is to apply them there as well, but this commit is already
bigger than would be good.)
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
The rcu_all_qs() and rcu_note_context_switch() do a series of checks,
taking various actions to supply RCU with quiescent states, depending
on the outcomes of the various checks. This is a bit much for scheduling
fastpaths, so this commit creates a separate ->rcu_urgent_qs field in
the rcu_dynticks structure that acts as a global guard for these checks.
Thus, in the common case, rcu_all_qs() and rcu_note_context_switch()
check the ->rcu_urgent_qs field, find it false, and simply return.
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
The rcu_momentary_dyntick_idle() function scans the RCU flavors, checking
that one of them still needs a quiescent state before doing an expensive
atomic operation on the ->dynticks counter. However, this check reduces
overhead only after a rare race condition, and increases complexity. This
commit therefore removes the scan and the mechanism enabling the scan.
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
The rcu_qs_ctr variable is yet another isolated per-CPU variable,
so this commit pulls it into the pre-existing rcu_dynticks per-CPU
structure.
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
The rcu_sched_qs_mask variable is yet another isolated per-CPU variable,
so this commit pulls it into the pre-existing rcu_dynticks per-CPU
structure.
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
The current use of "RCU_TRACE(statement);" can cause odd bugs, especially
where "statement" is a local-variable declaration, as it can leave a
misplaced ";" in the source code. This commit therefore converts these
to "RCU_TRACE(statement;)", which avoids the misplaced ";".
Reported-by: Josh Triplett <josh@joshtriplett.org>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Currently, IPIs are used to force other CPUs to invalidate their TLBs
in response to a kernel virtual-memory mapping change. This works, but
degrades both battery lifetime (for idle CPUs) and real-time response
(for nohz_full CPUs), and in addition results in unnecessary IPIs due to
the fact that CPUs executing in usermode are unaffected by stale kernel
mappings. It would be better to cause a CPU executing in usermode to
wait until it is entering kernel mode to do the flush, first to avoid
interrupting usemode tasks and second to handle multiple flush requests
with a single flush in the case of a long-running user task.
This commit therefore reserves a bit at the bottom of the ->dynticks
counter, which is checked upon exit from extended quiescent states.
If it is set, it is cleared and then a new rcu_eqs_special_exit() macro is
invoked, which, if not supplied, is an empty single-pass do-while loop.
If this bottom bit is set on -entry- to an extended quiescent state,
then a WARN_ON_ONCE() triggers.
This bottom bit may be set using a new rcu_eqs_special_set() function,
which returns true if the bit was set, or false if the CPU turned
out to not be in an extended quiescent state. Please note that this
function refuses to set the bit for a non-nohz_full CPU when that CPU
is executing in usermode because usermode execution is tracked by RCU
as a dyntick-idle extended quiescent state only for nohz_full CPUs.
Reported-by: Andy Lutomirski <luto@amacapital.net>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Reviewed-by: Josh Triplett <josh@joshtriplett.org>
Tracing uses rcu_irq_enter() as a way to make sure that RCU is watching when
it needs to use rcu_read_lock() and friends. This is because tracing can
happen as RCU is about to enter user space, or about to go idle, and RCU
does not watch for RCU read side critical sections as it makes the
transition.
There is a small location within the RCU infrastructure that rcu_irq_enter()
itself will not work. If tracing were to occur in that section it will break
if it tries to use rcu_irq_enter().
Originally, this happens with the stack_tracer, because it will call
save_stack_trace when it encounters stack usage that is greater than any
stack usage it had encountered previously. There was a case where that
happened in the RCU section where rcu_irq_enter() did not work, and lockdep
complained loudly about it. To fix it, stack tracing added a call to be
disabled and RCU would disable stack tracing during the critical section
that rcu_irq_enter() was inoperable. This solution worked, but there are
other cases that use rcu_irq_enter() and it would be a good idea to let RCU
give a way to let others know that rcu_irq_enter() will not work. For
example, in trace events.
Another helpful aspect of this change is that it also moves the per cpu
variable called in the RCU critical section into a cache locale along with
other RCU per cpu variables used in that same location.
I'm keeping the stack_trace_disable() code, as that still could be used in
the future by places that really need to disable it. And since it's only a
static inline, it wont take up any kernel text if it is not used.
Link: http://lkml.kernel.org/r/20170405093207.404f8deb@gandalf.local.home
Acked-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
The tracing subsystem started using rcu_irq_entry() and rcu_irq_exit()
(with my blessing) to allow the current _rcuidle alternative tracepoint
name to be dispensed with while still maintaining good performance.
Unfortunately, this causes RCU's dyntick-idle entry code's tracing to
appear to RCU like an interrupt that occurs where RCU is not designed
to handle interrupts.
This commit fixes this problem by moving the zeroing of ->dynticks_nesting
after the offending trace_rcu_dyntick() statement, which narrows the
window of vulnerability to a pair of adjacent statements that are now
marked with comments to that effect.
Link: http://lkml.kernel.org/r/20170405093207.404f8deb@gandalf.local.home
Link: http://lkml.kernel.org/r/20170405193928.GM1600@linux.vnet.ibm.com
Reported-by: Steven Rostedt <rostedt@goodmis.org>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
We are going to split <linux/sched/debug.h> out of <linux/sched.h>, which
will have to be picked up from other headers and a couple of .c files.
Create a trivial placeholder <linux/sched/debug.h> file that just
maps to <linux/sched.h> to make this patch obviously correct and
bisectable.
Include the new header in the files that are going to need it.
Acked-by: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
We are going to move scheduler ABI details to <uapi/linux/sched/types.h>,
which will be used from a number of .c files.
Create empty placeholder header that maps to <linux/types.h>.
Include the new header in the files that are going to need it.
Acked-by: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
So rcupdate.h is a pretty complex header, in particular it includes
<linux/completion.h> which includes <linux/wait.h> - creating a
dependency that includes <linux/wait.h> in <linux/sched.h>,
which prevents the isolation of <linux/sched.h> from the derived
<linux/wait.h> header.
Solve part of the problem by decoupling rcupdate.h from completions:
this can be done by separating out the rcu_synchronize types and APIs,
and updating their usage sites.
Since this is a mostly RCU-internal types this will not just simplify
<linux/sched.h>'s dependencies, but will make all the hundreds of
.c files that include rcupdate.h but not completions or wait.h build
faster.
( For rcutiny this means that two dependent APIs have to be uninlined,
but that shouldn't be much of a problem as they are rare variants. )
Acked-by: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Commit 7ec99de36f ("rcu: Provide exact CPU-online tracking for RCU"),
as its title suggests, got rid of RCU's remaining CPU-hotplug timing
guesswork. This commit therefore removes the one-jiffy kludge that was
used to paper over this guesswork.
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Reviewed-by: Josh Triplett <josh@joshtriplett.org>
Commit 4a81e8328d ("rcu: Reduce overhead of cond_resched() checks
for RCU") moved quiescent-state generation out of cond_resched()
and commit bde6c3aa99 ("rcu: Provide cond_resched_rcu_qs() to force
quiescent states in long loops") introduced cond_resched_rcu_qs(), and
commit 5cd37193ce ("rcu: Make cond_resched_rcu_qs() apply to normal RCU
flavors") introduced the per-CPU rcu_qs_ctr variable, which is frequently
polled by the RCU core state machine.
This frequent polling can increase grace-period rate, which in turn
increases grace-period overhead, which is visible in some benchmarks
(for example, the "open1" benchmark in Anton Blanchard's "will it scale"
suite). This commit therefore reduces the rate at which rcu_qs_ctr
is polled by moving that polling into the force-quiescent-state (FQS)
machinery, and by further polling it only after the grace period has
been in effect for at least jiffies_till_sched_qs jiffies.
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Reviewed-by: Josh Triplett <josh@joshtriplett.org>
This commit is the fourth step towards full abstraction of all accesses
to the ->dynticks counter, implementing previously open-coded checks and
comparisons in new rcu_dynticks_in_eqs() and rcu_dynticks_in_eqs_since()
functions. This abstraction will ease changes to the ->dynticks counter
operation.
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Reviewed-by: Josh Triplett <josh@joshtriplett.org>
This commit is the third step towards full abstraction of all accesses
to the ->dynticks counter, implementing the previously open-coded atomic
add of 1 and entry checks in a new rcu_dynticks_eqs_enter() function, and
the same but with exit checks in a new rcu_dynticks_eqs_exit() function.
This abstraction will ease changes to the ->dynticks counter operation.
Note that this commit gets rid of the smp_mb__before_atomic() and the
smp_mb__after_atomic() calls that were previously present. The reason
that this is OK from a memory-ordering perspective is that the atomic
operation is now atomic_add_return(), which, as a value-returning atomic,
guarantees full ordering.
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
[ paulmck: Fixed RCU_TRACE() statements added by this commit. ]
Reviewed-by: Josh Triplett <josh@joshtriplett.org>
The rcu_cpu_starting() function uses this_cpu_ptr() to locate the
incoming CPU's rcu_data structure. This works for the boot CPU and for
all CPUs onlined after rcu_init() executes (during very early boot).
Currently, this is the full set of CPUs, so all is well. But if
anyone ever parallelizes boot before rcu_init() time, it will fail.
This commit therefore substitutes the rcu_cpu_starting() function's
this_cpu_pointer() for per_cpu_ptr(), future-proofing the code and
(arguably) improving readability.
This commit inadvertently fixes a latent bug: If there ever had been
more than just the boot CPU online at rcu_init() time, the old code
would not initialize the non-boot CPUs, but rather would repeatedly
initialize the boot CPU.
Reported-by: Boqun Feng <boqun.feng@gmail.com>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Reviewed-by: Josh Triplett <josh@joshtriplett.org>
Chris Friesen notice that rcuc/X kthreads were consuming CPU even on
NOCB CPUs. This makes no sense because the only purpose or these
kthreads is to invoke normal (non-offloaded) callbacks, of which there
will never be any on NOCB CPUs. This problem was due to a bug in
cpu_has_callbacks_ready_to_invoke(), which should have been checking
->nxttail[RCU_NEXT_TAIL] for NULL, but which was instead (incorrectly)
checking ->nxttail[RCU_DONE_TAIL]. Because ->nxttail[RCU_DONE_TAIL] is
never NULL, the only effect is to cause the rcuc/X kthread to execute
when it should not do so.
This commit therefore checks ->nxttail[RCU_NEXT_TAIL], which is NULL
for NOCB CPUs.
Reported-by: Chris Friesen <chris.friesen@windriver.com>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Reviewed-by: Josh Triplett <josh@joshtriplett.org>
This commit is for all intents and purposes a revert of bc1dce514e
("rcu: Don't use NMIs to dump other CPUs' stacks"). The reason to suppose
that this can now safely be reverted is the presence of 42a0bb3f71
("printk/nmi: generic solution for safe printk in NMI"), which is said
to have made NMI-based stack dumps safe.
However, this reversion keeps one nice property of bc1dce514e
("rcu: Don't use NMIs to dump other CPUs' stacks"), namely that
only those CPUs blocking the grace period are dumped. The new
trigger_single_cpu_backtrace() is used to make this happen, as
suggested by Josh Poimboeuf.
Reported-by: Vince Weaver <vincent.weaver@maine.edu>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Petr Mladek <pmladek@suse.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Reviewed-by: Josh Poimboeuf <jpoimboe@redhat.com>
Reviewed-by: Petr Mladek <pmladek@suse.com>
Reviewed-by: Josh Triplett <josh@joshtriplett.org>
Commit 4914950aaa ("rcu: Stop treating in-kernel CPU-bound workloads
as errors") added a (relatively) short-timeout call to resched_cpu().
This was inspired by as issue that was fixed by b7e7ade34e ("sched/core:
Fix remote wakeups"). But given that this issue was fixed, it is time
for the current commit to remove this call to resched_cpu().
Reported-by: Byungchul Park <byungchul.park@lge.com>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Reviewed-by: Josh Triplett <josh@joshtriplett.org>
This commit prepares for the removal of short-term CPU kicking (in a
subsequent commit). It does so by starting to invoke resched_cpu()
for each holdout at each force-quiescent-state interval that is more
than halfway through the stall-warning interval.
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Reviewed-by: Josh Triplett <josh@joshtriplett.org>
Since commit 7ec99de36f ("rcu: Provide exact CPU-online tracking for
RCU"), the variable mask in rcu_init_percpu_data is set but no longer
used. Remove it to fix the following warning when building with 'W=1':
kernel/rcu/tree.c: In function ‘rcu_init_percpu_data’:
kernel/rcu/tree.c:3765:16: warning: variable ‘mask’ set but not used [-Wunused-but-set-variable]
Signed-off-by: Tobias Klauser <tklauser@distanz.ch>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Reviewed-by: Josh Triplett <josh@joshtriplett.org>
The print_other_cpu_stall() function currently unconditionally invokes
rcu_print_detail_task_stall(). This is OK because if there was a stall
sufficient to cause print_other_cpu_stall() to be invoked, that stall
is very likely to persist through the entire print_other_cpu_stall()
execution. However, if the stall did not persist, the variable ndetected
will be zero, and that variable is already tested in an "if" statement.
Therefore, this commit moves the call to rcu_print_detail_task_stall()
under that pre-existing "if" to improve readability, with a very rare
reduction in overhead.
Signed-off-by: Byungchul Park <byungchul.park@lge.com>
[ paulmck: Reworked commit log. ]
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Reviewed-by: Josh Triplett <josh@joshtriplett.org>
This commit is the second step towards full abstraction of all accesses to
the ->dynticks counter, implementing the previously open-coded atomic
add of zero in a new rcu_dynticks_snap() function. This abstraction will
ease changes o the ->dynticks counter operation.
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Reviewed-by: Josh Triplett <josh@joshtriplett.org>
This commit is the first step towards full abstraction of all accesses to
the ->dynticks counter, implementing the previously open-coded atomic add
of two in a new rcu_dynticks_momentary_idle() function. This abstraction
will ease changes to the ->dynticks counter operation.
Note that this commit gets rid of the smp_mb__before_atomic() and the
smp_mb__after_atomic() calls that were previously present. The reason
that this is OK from a memory-ordering perspective is that the atomic
operation is now atomic_add_return(), which, as a value-returning atomic,
guarantees full ordering.
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Reviewed-by: Josh Triplett <josh@joshtriplett.org>
The current preemptible RCU implementation goes through three phases
during bootup. In the first phase, there is only one CPU that is running
with preemption disabled, so that a no-op is a synchronous grace period.
In the second mid-boot phase, the scheduler is running, but RCU has
not yet gotten its kthreads spawned (and, for expedited grace periods,
workqueues are not yet running. During this time, any attempt to do
a synchronous grace period will hang the system (or complain bitterly,
depending). In the third and final phase, RCU is fully operational and
everything works normally.
This has been OK for some time, but there has recently been some
synchronous grace periods showing up during the second mid-boot phase.
This code worked "by accident" for awhile, but started failing as soon
as expedited RCU grace periods switched over to workqueues in commit
8b355e3bc1 ("rcu: Drive expedited grace periods from workqueue").
Note that the code was buggy even before this commit, as it was subject
to failure on real-time systems that forced all expedited grace periods
to run as normal grace periods (for example, using the rcu_normal ksysfs
parameter). The callchain from the failure case is as follows:
early_amd_iommu_init()
|-> acpi_put_table(ivrs_base);
|-> acpi_tb_put_table(table_desc);
|-> acpi_tb_invalidate_table(table_desc);
|-> acpi_tb_release_table(...)
|-> acpi_os_unmap_memory
|-> acpi_os_unmap_iomem
|-> acpi_os_map_cleanup
|-> synchronize_rcu_expedited
The kernel showing this callchain was built with CONFIG_PREEMPT_RCU=y,
which caused the code to try using workqueues before they were
initialized, which did not go well.
This commit therefore reworks RCU to permit synchronous grace periods
to proceed during this mid-boot phase. This commit is therefore a
fix to a regression introduced in v4.9, and is therefore being put
forward post-merge-window in v4.10.
This commit sets a flag from the existing rcu_scheduler_starting()
function which causes all synchronous grace periods to take the expedited
path. The expedited path now checks this flag, using the requesting task
to drive the expedited grace period forward during the mid-boot phase.
Finally, this flag is updated by a core_initcall() function named
rcu_exp_runtime_mode(), which causes the runtime codepaths to be used.
Note that this arrangement assumes that tasks are not sent POSIX signals
(or anything similar) from the time that the first task is spawned
through core_initcall() time.
Fixes: 8b355e3bc1 ("rcu: Drive expedited grace periods from workqueue")
Reported-by: "Zheng, Lv" <lv.zheng@intel.com>
Reported-by: Borislav Petkov <bp@alien8.de>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Tested-by: Stan Kain <stan.kain@gmail.com>
Tested-by: Ivan <waffolz@hotmail.com>
Tested-by: Emanuel Castelo <emanuel.castelo@gmail.com>
Tested-by: Bruno Pesavento <bpesavento@infinito.it>
Tested-by: Borislav Petkov <bp@suse.de>
Tested-by: Frederic Bezies <fredbezies@gmail.com>
Cc: <stable@vger.kernel.org> # 4.9.0-
The current code can result in spurious kicks when there are no grace
periods in progress and no grace-period-related requests. This is
sort of OK for a diagnostic aid, but the resulting ftrace-dump messages
in dmesg are annoying. This commit therefore avoids spurious kicks
in the common case.
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Reviewed-by: Josh Triplett <josh@joshtriplett.org>
The __call_rcu() comment about opportunistically noting grace period
beginnings and endings is obsolete. RCU still does such opportunistic
noting, but in __call_rcu_core() rather than __call_rcu(), and there
already is an appropriate comment in __call_rcu_core(). This commit
therefore removes the obsolete comment.
Reported-by: Michalis Kokologiannakis <mixaskok@gmail.com>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Reviewed-by: Josh Triplett <josh@joshtriplett.org>
In the deep past, rcu_check_callbacks() was only invoked if rcu_pending()
returned true. Which was fine, but these days rcu_check_callbacks()
is invoked unconditionally. This commit therefore removes the obsolete
sentence from the header comment.
Reported-by: Michalis Kokologiannakis <mixaskok@gmail.com>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Reviewed-by: Josh Triplett <josh@joshtriplett.org>
Commit 720abae3d6 ("rcu: force alignment on struct
callback_head/rcu_head") forced the rcu_head (AKA callback_head)
structure's alignment to pointer size, that is, to 4-byte boundaries on
32-bit systems and to 8-byte boundaries on 64-bit systems. This
commit therefore checks for this same alignment in __call_rcu(),
which used to contain a looser check for two-byte alignment.
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Tested-by: Geert Uytterhoeven <geert@linux-m68k.org>
Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Reviewed-by: Josh Triplett <josh@joshtriplett.org>
extract as much possible uncertainty from a running system at boot time as
possible, hoping to capitalize on any possible variation in CPU operation
(due to runtime data differences, hardware differences, SMP ordering,
thermal timing variation, cache behavior, etc).
At the very least, this plugin is a much more comprehensive example for
how to manipulate kernel code using the gcc plugin internals.
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1
Comment: Kees Cook <kees@outflux.net>
iQIcBAABCgAGBQJX/BAFAAoJEIly9N/cbcAmzW8QALFbCs7EFFkML+M/M/9d8zEk
1QbUs/z8covJTTT1PjSdw7JUrAMulI3S00owpcQVd/PcWjRPU80QwfsXBgIB0tvC
Kub2qxn6Oaf+kTB646zwjFgjdCecw/USJP+90nfcu2+LCnE8ReclKd1aUee+Bnhm
iDEUyH2ONIoWq6ta2Z9sA7+E4y2ZgOlmW0iga3Mnf+OcPtLE70fWPoe5E4g9DpYk
B+kiPDrD9ql5zsHaEnKG1ldjiAZ1L6Grk8rGgLEXmbOWtTOFmnUhR+raK5NA/RCw
MXNuyPay5aYPpqDHFm+OuaWQAiPWfPNWM3Ett4k0d9ZWLixTcD1z68AciExwk7aW
SEA8b1Jwbg05ZNYM7NJB6t6suKC4dGPxWzKFOhmBicsh2Ni5f+Az0BQL6q8/V8/4
8UEqDLuFlPJBB50A3z5ngCVeYJKZe8Bg/Swb4zXl6mIzZ9darLzXDEV6ystfPXxJ
e1AdBb41WC+O2SAI4l64yyeswkGo3Iw2oMbXG5jmFl6wY/xGp7dWxw7gfnhC6oOh
afOT54p2OUDfSAbJaO0IHliWoIdmE5ZYdVYVU9Ek+uWyaIwcXhNmqRg+Uqmo32jf
cP5J9x2kF3RdOcbSHXmFp++fU+wkhBtEcjkNpvkjpi4xyA47IWS7lrVBBebrCq9R
pa/A7CNQwibIV6YD8+/p
=1dUK
-----END PGP SIGNATURE-----
Merge tag 'gcc-plugins-v4.9-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux
Pull gcc plugins update from Kees Cook:
"This adds a new gcc plugin named "latent_entropy". It is designed to
extract as much possible uncertainty from a running system at boot
time as possible, hoping to capitalize on any possible variation in
CPU operation (due to runtime data differences, hardware differences,
SMP ordering, thermal timing variation, cache behavior, etc).
At the very least, this plugin is a much more comprehensive example
for how to manipulate kernel code using the gcc plugin internals"
* tag 'gcc-plugins-v4.9-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux:
latent_entropy: Mark functions with __latent_entropy
gcc-plugins: Add latent_entropy plugin
The __latent_entropy gcc attribute can be used only on functions and
variables. If it is on a function then the plugin will instrument it for
gathering control-flow entropy. If the attribute is on a variable then
the plugin will initialize it with random contents. The variable must
be an integer, an integer array type or a structure with integer fields.
These specific functions have been selected because they are init
functions (to help gather boot-time entropy), are called at unpredictable
times, or they have variable loops, each of which provide some level of
latent entropy.
Signed-off-by: Emese Revfy <re.emese@gmail.com>
[kees: expanded commit message]
Signed-off-by: Kees Cook <keescook@chromium.org>
Up to now, RCU has assumed that the CPU-online process makes it from
CPU_UP_PREPARE to set_cpu_online() within one jiffy. Given the recent
rise of virtualized environments, this assumption is very clearly
obsolete. Failing to meet this deadline can result in RCU paying
attention to an incoming CPU for one jiffy, then ignoring it until the
grace period following the one in which that CPU sets itself online.
This situation might prove to be fatally disappointing to any RCU
read-side critical sections that had the misfortune to execute during
the time in which RCU was ignoring the slow-to-come-online CPU.
This commit therefore updates RCU's internal CPU state-tracking
information at notify_cpu_starting() time, thus providing RCU with
an exact transition of the CPU's state from offline to online.
Note that this means that incoming CPUs must not use RCU read-side
critical section (other than those of SRCU) until notify_cpu_starting()
time. Note also that the CPU_STARTING notifiers -are- allowed to use
RCU read-side critical sections. (Of course, CPU-hotplug notifiers are
rapidly becoming obsolete, so you need to act fast!)
If a given architecture or CPU family needs to use RCU read-side
critical sections earlier, the call to rcu_cpu_starting() from
notify_cpu_starting() will need to be architecture-specific, with
architectures that need early use being required to hand-place
the call to rcu_cpu_starting() at some point preceding the call to
notify_cpu_starting().
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Currently, __note_gp_changes() checks to see if the CPU has slept through
multiple grace periods. If it has, it resynchronizes that CPU's view
of the grace-period state, which includes whether or not the current
grace period needs a quiescent state from this CPU. The fact of this
need (or lack thereof) needs to be in two places, rdp->cpu_no_qs.b.norm
and rdp->core_needs_qs. The former tells RCU's context-switch code to
go get a quiescent state and the latter says that it needs to be reported.
The current code unconditionally sets the former to true, but correctly
sets the latter.
This does not result in failures, but it does unnecessarily increase
the amount of work done on average at context-switch time. This commit
therefore correctly sets both fields.
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
The Kconfig currently controlling compilation of tree.c is:
init/Kconfig:config TREE_RCU
init/Kconfig: bool
...and update.c and sync.c are "obj-y" meaning that none are ever
built as a module by anyone.
Since MODULE_ALIAS is a no-op for non-modular code, we can remove
them from these files.
We leave moduleparam.h behind since the files instantiate some boot
time configuration parameters with module_param() still.
Cc: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>
Cc: Josh Triplett <josh@joshtriplett.org>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: Lai Jiangshan <jiangshanlai@gmail.com>
Signed-off-by: Paul Gortmaker <paul.gortmaker@windriver.com>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Commit abedf8e241 ("rcu: Use simple wait queues where possible in
rcutree") converts Tree RCU's wait queues to simple wait queues,
but it incorrectly reverts the commit 2aa792e6fa ("rcu: Use
rcu_gp_kthread_wake() to wake up grace period kthreads"). This can
result in redundant self-wakeups.
This commit therefore replaces the simple wait-queue wakeups with
rcu_gp_kthread_wake(), thus avoiding the redundant wakeups.
Signed-off-by: Jisheng Zhang <jszhang@marvell.com>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Straight forward conversion to the state machine. Though the question arises
whether this needs really all these state transitions to work.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Anna-Maria Gleixner <anna-maria@linutronix.de>
Reviewed-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: rt@linutronix.de
Link: http://lkml.kernel.org/r/20160713153337.982013161@linutronix.de
Signed-off-by: Ingo Molnar <mingo@kernel.org>
In many cases in the RCU tree code, we iterate over the set of cpus for
a leaf node described by rcu_node::grplo and rcu_node::grphi, checking
per-cpu data for each cpu in this range. However, if the set of possible
cpus is sparse, some cpus described in this range are not possible, and
thus no per-cpu region will have been allocated (or initialised) for
them by the generic percpu code.
Erroneous accesses to a per-cpu area for these !possible cpus may fault
or may hit other data depending on the addressed generated when the
erroneous per cpu offset is applied. In practice, both cases have been
observed on arm64 hardware (the former being silent, but detectable with
additional patches).
To avoid issues resulting from this, we must iterate over the set of
*possible* cpus for a given leaf node. This patch add a new helper,
for_each_leaf_node_possible_cpu, to enable this. As iteration is often
intertwined with rcu_node local bitmask manipulation, a new
leaf_node_cpu_bit helper is added to make this simpler and more
consistent. The RCU tree code is made to use both of these where
appropriate.
Without this patch, running reboot at a shell can result in an oops
like:
[ 3369.075979] Unable to handle kernel paging request at virtual address ffffff8008b21b4c
[ 3369.083881] pgd = ffffffc3ecdda000
[ 3369.087270] [ffffff8008b21b4c] *pgd=00000083eca48003, *pud=00000083eca48003, *pmd=0000000000000000
[ 3369.096222] Internal error: Oops: 96000007 [#1] PREEMPT SMP
[ 3369.101781] Modules linked in:
[ 3369.104825] CPU: 2 PID: 1817 Comm: NetworkManager Tainted: G W 4.6.0+ #3
[ 3369.121239] task: ffffffc0fa13e000 ti: ffffffc3eb940000 task.ti: ffffffc3eb940000
[ 3369.128708] PC is at sync_rcu_exp_select_cpus+0x188/0x510
[ 3369.134094] LR is at sync_rcu_exp_select_cpus+0x104/0x510
[ 3369.139479] pc : [<ffffff80081109a8>] lr : [<ffffff8008110924>] pstate: 200001c5
[ 3369.146860] sp : ffffffc3eb9435a0
[ 3369.150162] x29: ffffffc3eb9435a0 x28: ffffff8008be4f88
[ 3369.155465] x27: ffffff8008b66c80 x26: ffffffc3eceb2600
[ 3369.160767] x25: 0000000000000001 x24: ffffff8008be4f88
[ 3369.166070] x23: ffffff8008b51c3c x22: ffffff8008b66c80
[ 3369.171371] x21: 0000000000000001 x20: ffffff8008b21b40
[ 3369.176673] x19: ffffff8008b66c80 x18: 0000000000000000
[ 3369.181975] x17: 0000007fa951a010 x16: ffffff80086a30f0
[ 3369.187278] x15: 0000007fa9505590 x14: 0000000000000000
[ 3369.192580] x13: ffffff8008b51000 x12: ffffffc3eb940000
[ 3369.197882] x11: 0000000000000006 x10: ffffff8008b51b78
[ 3369.203184] x9 : 0000000000000001 x8 : ffffff8008be4000
[ 3369.208486] x7 : ffffff8008b21b40 x6 : 0000000000001003
[ 3369.213788] x5 : 0000000000000000 x4 : ffffff8008b27280
[ 3369.219090] x3 : ffffff8008b21b4c x2 : 0000000000000001
[ 3369.224406] x1 : 0000000000000001 x0 : 0000000000000140
...
[ 3369.972257] [<ffffff80081109a8>] sync_rcu_exp_select_cpus+0x188/0x510
[ 3369.978685] [<ffffff80081128b4>] synchronize_rcu_expedited+0x64/0xa8
[ 3369.985026] [<ffffff80086b987c>] synchronize_net+0x24/0x30
[ 3369.990499] [<ffffff80086ddb54>] dev_deactivate_many+0x28c/0x298
[ 3369.996493] [<ffffff80086b6bb8>] __dev_close_many+0x60/0xd0
[ 3370.002052] [<ffffff80086b6d48>] __dev_close+0x28/0x40
[ 3370.007178] [<ffffff80086bf62c>] __dev_change_flags+0x8c/0x158
[ 3370.012999] [<ffffff80086bf718>] dev_change_flags+0x20/0x60
[ 3370.018558] [<ffffff80086cf7f0>] do_setlink+0x288/0x918
[ 3370.023771] [<ffffff80086d0798>] rtnl_newlink+0x398/0x6a8
[ 3370.029158] [<ffffff80086cee84>] rtnetlink_rcv_msg+0xe4/0x220
[ 3370.034891] [<ffffff80086e274c>] netlink_rcv_skb+0xc4/0xf8
[ 3370.040364] [<ffffff80086ced8c>] rtnetlink_rcv+0x2c/0x40
[ 3370.045663] [<ffffff80086e1fe8>] netlink_unicast+0x160/0x238
[ 3370.051309] [<ffffff80086e24b8>] netlink_sendmsg+0x2f0/0x358
[ 3370.056956] [<ffffff80086a0070>] sock_sendmsg+0x18/0x30
[ 3370.062168] [<ffffff80086a21cc>] ___sys_sendmsg+0x26c/0x280
[ 3370.067728] [<ffffff80086a30ac>] __sys_sendmsg+0x44/0x88
[ 3370.073027] [<ffffff80086a3100>] SyS_sendmsg+0x10/0x20
[ 3370.078153] [<ffffff8008085e70>] el0_svc_naked+0x24/0x28
Signed-off-by: Mark Rutland <mark.rutland@arm.com>
Reported-by: Dennis Chen <dennis.chen@arm.com>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Josh Triplett <josh@joshtriplett.org>
Cc: Lai Jiangshan <jiangshanlai@gmail.com>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: Steve Capper <steve.capper@arm.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Will Deacon <will.deacon@arm.com>
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
It is not always easy to determine the cause of an RCU stall just by
analysing the RCU stall messages, mainly when the problem is caused
by the indirect starvation of rcu threads. For example, when preempt_rcu
is not awakened due to the starvation of a timer softirq.
We have been hard coding panic() in the RCU stall functions for
some time while testing the kernel-rt. But this is not possible in
some scenarios, like when supporting customers.
This patch implements the sysctl kernel.panic_on_rcu_stall. If
set to 1, the system will panic() when an RCU stall takes place,
enabling the capture of a vmcore. The vmcore provides a way to analyze
all kernel/tasks states, helping out to point to the culprit and the
solution for the stall.
The kernel.panic_on_rcu_stall sysctl is disabled by default.
Changes from v1:
- Fixed a typo in the git log
- The if(sysctl_panic_on_rcu_stall) panic() is in a static function
- Fixed the CONFIG_TINY_RCU compilation issue
- The var sysctl_panic_on_rcu_stall is now __read_mostly
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>
Cc: Josh Triplett <josh@joshtriplett.org>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: Lai Jiangshan <jiangshanlai@gmail.com>
Acked-by: Christian Borntraeger <borntraeger@de.ibm.com>
Reviewed-by: Josh Triplett <josh@joshtriplett.org>
Reviewed-by: Arnaldo Carvalho de Melo <acme@kernel.org>
Tested-by: "Luis Claudio R. Goncalves" <lgoncalv@redhat.com>
Signed-off-by: Daniel Bristot de Oliveira <bristot@redhat.com>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
People have been having some difficulty finding their way around the
RCU code. This commit therefore pulls some of the expedited grace-period
code from tree.c to a new tree_exp.h file. This commit is strictly code
movement, with the exception of a forward declaration that was added
for the sync_sched_exp_online_cleanup() function.
A subsequent commit will move the remaining expedited grace-period code
from tree_plugin.h to tree_exp.h.
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
I think you'll find this condition is superfluous, as the whole function
is under #ifdef of that same.
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
In the past, RCU grace-period initialization excluded CPU-hotplug
operations, but this is no longer the case. This commit therefore
removed an outdated comment in rcu_gp_init() claiming that these
are excluded.
Reported-by: Lihao Liang <lihao.liang@gmail.com>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
The comment header for rcu_scheduler_active states that it is used
to optimize synchronize_sched() at early boot. This is incorrect.
The synchronize_sched() function instead checks the number of online
CPUs. This commit therefore replaces the comment's synchronize_sched()
with synchronize_rcu(), which really does use rcu_scheduler_active for
this purpose.
Reported-by: Lihao Liang <lihao.liang@gmail.com>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
This commit provides rcu_exp_batches_completed() and
rcu_exp_batches_completed_sched() functions to allow torture-test modules
to check how many expedited grace period batches have completed.
These are analogous to the existing rcu_batches_completed(),
rcu_batches_completed_bh(), and rcu_batches_completed_sched() functions.
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
If it is necessary to kick the grace-period kthread, that is a good
time to dump the trace buffer in order to learn why kicking was needed.
This commit therefore does the dump.
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Recent kernels can fail to awaken the grace-period kthread for
quiescent-state forcing. This commit is a crude hack that does
a wakeup if a scheduling-clock interrupt sees that it has been
too long since force-quiescent-state (FQS) processing.
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Currently, the force-quiescent-state (FQS) code in rcu_gp_kthread() can
advance the next FQS even if one was not executed last time. This can
happen due timeout-duration uncertainty. This commit therefore avoids
advancing the FQS schedule unless an FQS was just executed. In the
corner case where an FQS was not executed, but is due now, the code does
a one-jiffy wait.
This change prepares for kthread kicking.
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Recent kernels can fail to awaken the grace-period kthread for
quiescent-state forcing. This commit is a crude hack that does
a wakeup any time a stall is detected.
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
The current expedited grace-period implementation makes subsequent grace
periods wait on wakeups for the prior grace period. This does not fit
the dictionary definition of "expedited", so this commit allows these two
phases to overlap. Doing this requires four waitqueues rather than two
because tasks can now be waiting on the previous, current, and next grace
periods. The fourth waitqueue makes the bit masking work out nicely.
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
This commit pulls the grace-period-start counter adjustment and tracing
from synchronize_rcu_expedited() and synchronize_sched_expedited()
into exp_funnel_lock(), thus eliminating some code duplication.
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
This commit moves some duplicate code from synchronize_rcu_expedited()
and synchronize_sched_expedited() into rcu_exp_gp_seq_snap(). This
doesn't save lines of code, but does eliminate a "tell me twice" issue.
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Currently, synchronize_rcu_expedited() and rcu_sched_expedited() have
significant duplicate code. This commit therefore consolidates some of
this code into rcu_exp_wake(), which is now renamed to rcu_exp_wait_wake()
in recognition of its added responsibilities.
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
This commit speeds up the low-contention case, especially for systems
with large rcu_node trees, by attempting to directly acquire the
->exp_mutex. This fastpath checks the leaves and root first in
order to avoid excessive memory contention on the mutex itself.
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
The current mutex-based funnel-locking approach used by expedited grace
periods is subject to severe unfairness. The problem arises when a
few tasks, making a path from leaves to root, all wake up before other
tasks do. A new task can then follow this path all the way to the root,
which needlessly delays tasks whose grace period is done, but who do
not happen to acquire the lock quickly enough.
This commit avoids this problem by maintaining per-rcu_node wait queues,
along with a per-rcu_node counter that tracks the latest grace period
sought by an earlier task to visit this node. If that grace period
would satisfy the current task, instead of proceeding up the tree,
it waits on the current rcu_node structure using a pair of wait queues
provided for that purpose. This decouples awakening of old tasks from
the arrival of new tasks.
If the wakeups prove to be a bottleneck, additional kthreads can be
brought to bear for that purpose.
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
The cpu_online() function can return values other than 0 and 1, which
can result in subscript overflow when applied to a two-element array.
This commit allows for this behavior by using "!!" on the return value
from cpu_online() when used as a subscript.
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Commit #cdacbe1f91264 ("rcu: Add fastpath bypassing funnel locking")
turns out to be a pessimization at high load because it forces a tree
full of tasks to wait for an expedited grace period that they probably
do not need. This commit therefore removes this optimization.
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Although cond_resched_rcu_qs() supplies quiescent states to all flavors
of normal RCU grace periods, it does nothing for expedited RCU-sched
grace periods. This commit therefore adds a check for a need for a
quiescent state from the current CPU by an expedited RCU-sched grace
period, and invokes rcu_sched_qs() to supply that quiescent state if so.
Note that the check is racy in that we might be migrated to some other
CPU just after checking the per-CPU variable. This is OK because the
act of migration will do a context switch, which will supply the needed
quiescent state. The only downside is that we might do an unnecessary
call to rcu_sched_qs(), but the probability is low and the overhead
is small.
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Currently, synchronize_sched_expedited_wait() simply sets the ndetected
variable to the rcu_print_task_exp_stall() return value. This means
that if the last rcu_node structure has no stalled tasks, record of
any stalled tasks in previous rcu_node structures is lost, which can
in turn result in failure to dump out the blocking rcu_node structures.
Or could, had the test been correct.
This commit therefore adds the return value of rcu_print_task_exp_stall()
to ndetected and corrects the later test for ndetected.
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Currently, sync_sched_exp_handler() will force a reschedule unless
this CPU has already checked in or unless a reschedule has already
been called for. This is clearly wasteful if sync_sched_exp_handler()
interrupted an idle CPU, so this commit immediately reports the
quiescent state in that case.
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
This commit consolidates a couple definitions and several calls for
single-shot ftrace-buffer dumping.
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Pull cpu hotplug updates from Thomas Gleixner:
"This is the first part of the ongoing cpu hotplug rework:
- Initial implementation of the state machine
- Runs all online and prepare down callbacks on the plugged cpu and
not on some random processor
- Replaces busy loop waiting with completions
- Adds tracepoints so the states can be followed"
More detailed commentary on this work from an earlier email:
"What's wrong with the current cpu hotplug infrastructure?
- Asymmetry
The hotplug notifier mechanism is asymmetric versus the bringup and
teardown. This is mostly caused by the notifier mechanism.
- Largely undocumented dependencies
While some notifiers use explicitely defined notifier priorities,
we have quite some notifiers which use numerical priorities to
express dependencies without any documentation why.
- Control processor driven
Most of the bringup/teardown of a cpu is driven by a control
processor. While it is understandable, that preperatory steps,
like idle thread creation, memory allocation for and initialization
of essential facilities needs to be done before a cpu can boot,
there is no reason why everything else must run on a control
processor. Before this patch series, bringup looks like this:
Control CPU Booting CPU
do preparatory steps
kick cpu into life
do low level init
sync with booting cpu sync with control cpu
bring the rest up
- All or nothing approach
There is no way to do partial bringups. That's something which is
really desired because we waste e.g. at boot substantial amount of
time just busy waiting that the cpu comes to life. That's stupid
as we could very well do preparatory steps and the initial IPI for
other cpus and then go back and do the necessary low level
synchronization with the freshly booted cpu.
- Minimal debuggability
Due to the notifier based design, it's impossible to switch between
two stages of the bringup/teardown back and forth in order to test
the correctness. So in many hotplug notifiers the cancel
mechanisms are either not existant or completely untested.
- Notifier [un]registering is tedious
To [un]register notifiers we need to protect against hotplug at
every callsite. There is no mechanism that bringup/teardown
callbacks are issued on the online cpus, so every caller needs to
do it itself. That also includes error rollback.
What's the new design?
The base of the new design is a symmetric state machine, where both
the control processor and the booting/dying cpu execute a well
defined set of states. Each state is symmetric in the end, except
for some well defined exceptions, and the bringup/teardown can be
stopped and reversed at almost all states.
So the bringup of a cpu will look like this in the future:
Control CPU Booting CPU
do preparatory steps
kick cpu into life
do low level init
sync with booting cpu sync with control cpu
bring itself up
The synchronization step does not require the control cpu to wait.
That mechanism can be done asynchronously via a worker or some
other mechanism.
The teardown can be made very similar, so that the dying cpu cleans
up and brings itself down. Cleanups which need to be done after
the cpu is gone, can be scheduled asynchronously as well.
There is a long way to this, as we need to refactor the notion when a
cpu is available. Today we set the cpu online right after it comes
out of the low level bringup, which is not really correct.
The proper mechanism is to set it to available, i.e. cpu local
threads, like softirqd, hotplug thread etc. can be scheduled on that
cpu, and once it finished all booting steps, it's set to online, so
general workloads can be scheduled on it. The reverse happens on
teardown. First thing to do is to forbid scheduling of general
workloads, then teardown all the per cpu resources and finally shut it
off completely.
This patch series implements the basic infrastructure for this at the
core level. This includes the following:
- Basic state machine implementation with well defined states, so
ordering and prioritization can be expressed.
- Interfaces to [un]register state callbacks
This invokes the bringup/teardown callback on all online cpus with
the proper protection in place and [un]installs the callbacks in
the state machine array.
For callbacks which have no particular ordering requirement we have
a dynamic state space, so that drivers don't have to register an
explicit hotplug state.
If a callback fails, the code automatically does a rollback to the
previous state.
- Sysfs interface to drive the state machine to a particular step.
This is only partially functional today. Full functionality and
therefor testability will be achieved once we converted all
existing hotplug notifiers over to the new scheme.
- Run all CPU_ONLINE/DOWN_PREPARE notifiers on the booting/dying
processor:
Control CPU Booting CPU
do preparatory steps
kick cpu into life
do low level init
sync with booting cpu sync with control cpu
wait for boot
bring itself up
Signal completion to control cpu
In a previous step of this work we've done a full tree mechanical
conversion of all hotplug notifiers to the new scheme. The balance
is a net removal of about 4000 lines of code.
This is not included in this series, as we decided to take a
different approach. Instead of mechanically converting everything
over, we will do a proper overhaul of the usage sites one by one so
they nicely fit into the symmetric callback scheme.
I decided to do that after I looked at the ugliness of some of the
converted sites and figured out that their hotplug mechanism is
completely buggered anyway. So there is no point to do a
mechanical conversion first as we need to go through the usage
sites one by one again in order to achieve a full symmetric and
testable behaviour"
* 'smp-hotplug-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (23 commits)
cpu/hotplug: Document states better
cpu/hotplug: Fix smpboot thread ordering
cpu/hotplug: Remove redundant state check
cpu/hotplug: Plug death reporting race
rcu: Make CPU_DYING_IDLE an explicit call
cpu/hotplug: Make wait for dead cpu completion based
cpu/hotplug: Let upcoming cpu bring itself fully up
arch/hotplug: Call into idle with a proper state
cpu/hotplug: Move online calls to hotplugged cpu
cpu/hotplug: Create hotplug threads
cpu/hotplug: Split out the state walk into functions
cpu/hotplug: Unpark smpboot threads from the state machine
cpu/hotplug: Move scheduler cpu_online notifier to hotplug core
cpu/hotplug: Implement setup/removal interface
cpu/hotplug: Make target state writeable
cpu/hotplug: Add sysfs state interface
cpu/hotplug: Hand in target state to _cpu_up/down
cpu/hotplug: Convert the hotplugged cpu work to a state machine
cpu/hotplug: Convert to a state machine for the control processor
cpu/hotplug: Add tracepoints
...
Make the RCU CPU_DYING_IDLE callback an explicit function call, so it gets
invoked at the proper place.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-arch@vger.kernel.org
Cc: Rik van Riel <riel@redhat.com>
Cc: Rafael Wysocki <rafael.j.wysocki@intel.com>
Cc: "Srivatsa S. Bhat" <srivatsa@mit.edu>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Arjan van de Ven <arjan@linux.intel.com>
Cc: Sebastian Siewior <bigeasy@linutronix.de>
Cc: Rusty Russell <rusty@rustcorp.com.au>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Paul McKenney <paulmck@linux.vnet.ibm.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Paul Turner <pjt@google.com>
Link: http://lkml.kernel.org/r/20160226182341.870167933@linutronix.de
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
As of commit dae6e64d2b ("rcu: Introduce proper blocking to no-CBs kthreads
GP waits") the RCU subsystem started making use of wait queues.
Here we convert all additions of RCU wait queues to use simple wait queues,
since they don't need the extra overhead of the full wait queue features.
Originally this was done for RT kernels[1], since we would get things like...
BUG: sleeping function called from invalid context at kernel/rtmutex.c:659
in_atomic(): 1, irqs_disabled(): 1, pid: 8, name: rcu_preempt
Pid: 8, comm: rcu_preempt Not tainted
Call Trace:
[<ffffffff8106c8d0>] __might_sleep+0xd0/0xf0
[<ffffffff817d77b4>] rt_spin_lock+0x24/0x50
[<ffffffff8106fcf6>] __wake_up+0x36/0x70
[<ffffffff810c4542>] rcu_gp_kthread+0x4d2/0x680
[<ffffffff8105f910>] ? __init_waitqueue_head+0x50/0x50
[<ffffffff810c4070>] ? rcu_gp_fqs+0x80/0x80
[<ffffffff8105eabb>] kthread+0xdb/0xe0
[<ffffffff8106b912>] ? finish_task_switch+0x52/0x100
[<ffffffff817e0754>] kernel_thread_helper+0x4/0x10
[<ffffffff8105e9e0>] ? __init_kthread_worker+0x60/0x60
[<ffffffff817e0750>] ? gs_change+0xb/0xb
...and hence simple wait queues were deployed on RT out of necessity
(as simple wait uses a raw lock), but mainline might as well take
advantage of the more streamline support as well.
[1] This is a carry forward of work from v3.10-rt; the original conversion
was by Thomas on an earlier -rt version, and Sebastian extended it to
additional post-3.10 added RCU waiters; here I've added a commit log and
unified the RCU changes into one, and uprev'd it to match mainline RCU.
Signed-off-by: Daniel Wagner <daniel.wagner@bmw-carit.de>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: linux-rt-users@vger.kernel.org
Cc: Boqun Feng <boqun.feng@gmail.com>
Cc: Marcelo Tosatti <mtosatti@redhat.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Paul Gortmaker <paul.gortmaker@windriver.com>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>
Link: http://lkml.kernel.org/r/1455871601-27484-6-git-send-email-wagi@monom.org
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
rcu_nocb_gp_cleanup() is called while holding rnp->lock. Currently,
this is okay because the wake_up_all() in rcu_nocb_gp_cleanup() will
not enable the IRQs. lockdep is happy.
By switching over using swait this is not true anymore. swake_up_all()
enables the IRQs while processing the waiters. __do_softirq() can now
run and will eventually call rcu_process_callbacks() which wants to
grap nrp->lock.
Let's move the rcu_nocb_gp_cleanup() call outside the lock before we
switch over to swait.
If we would hold the rnp->lock and use swait, lockdep reports
following:
=================================
[ INFO: inconsistent lock state ]
4.2.0-rc5-00025-g9a73ba0 #136 Not tainted
---------------------------------
inconsistent {IN-SOFTIRQ-W} -> {SOFTIRQ-ON-W} usage.
rcu_preempt/8 [HC0[0]:SC0[0]:HE1:SE1] takes:
(rcu_node_1){+.?...}, at: [<ffffffff811387c7>] rcu_gp_kthread+0xb97/0xeb0
{IN-SOFTIRQ-W} state was registered at:
[<ffffffff81109b9f>] __lock_acquire+0xd5f/0x21e0
[<ffffffff8110be0f>] lock_acquire+0xdf/0x2b0
[<ffffffff81841cc9>] _raw_spin_lock_irqsave+0x59/0xa0
[<ffffffff81136991>] rcu_process_callbacks+0x141/0x3c0
[<ffffffff810b1a9d>] __do_softirq+0x14d/0x670
[<ffffffff810b2214>] irq_exit+0x104/0x110
[<ffffffff81844e96>] smp_apic_timer_interrupt+0x46/0x60
[<ffffffff81842e70>] apic_timer_interrupt+0x70/0x80
[<ffffffff810dba66>] rq_attach_root+0xa6/0x100
[<ffffffff810dbc2d>] cpu_attach_domain+0x16d/0x650
[<ffffffff810e4b42>] build_sched_domains+0x942/0xb00
[<ffffffff821777c2>] sched_init_smp+0x509/0x5c1
[<ffffffff821551e3>] kernel_init_freeable+0x172/0x28f
[<ffffffff8182cdce>] kernel_init+0xe/0xe0
[<ffffffff8184231f>] ret_from_fork+0x3f/0x70
irq event stamp: 76
hardirqs last enabled at (75): [<ffffffff81841330>] _raw_spin_unlock_irq+0x30/0x60
hardirqs last disabled at (76): [<ffffffff8184116f>] _raw_spin_lock_irq+0x1f/0x90
softirqs last enabled at (0): [<ffffffff810a8df2>] copy_process.part.26+0x602/0x1cf0
softirqs last disabled at (0): [< (null)>] (null)
other info that might help us debug this:
Possible unsafe locking scenario:
CPU0
----
lock(rcu_node_1);
<Interrupt>
lock(rcu_node_1);
*** DEADLOCK ***
1 lock held by rcu_preempt/8:
#0: (rcu_node_1){+.?...}, at: [<ffffffff811387c7>] rcu_gp_kthread+0xb97/0xeb0
stack backtrace:
CPU: 0 PID: 8 Comm: rcu_preempt Not tainted 4.2.0-rc5-00025-g9a73ba0 #136
Hardware name: Dell Inc. PowerEdge R820/066N7P, BIOS 2.0.20 01/16/2014
0000000000000000 000000006d7e67d8 ffff881fb081fbd8 ffffffff818379e0
0000000000000000 ffff881fb0812a00 ffff881fb081fc38 ffffffff8110813b
0000000000000000 0000000000000001 ffff881f00000001 ffffffff8102fa4f
Call Trace:
[<ffffffff818379e0>] dump_stack+0x4f/0x7b
[<ffffffff8110813b>] print_usage_bug+0x1db/0x1e0
[<ffffffff8102fa4f>] ? save_stack_trace+0x2f/0x50
[<ffffffff811087ad>] mark_lock+0x66d/0x6e0
[<ffffffff81107790>] ? check_usage_forwards+0x150/0x150
[<ffffffff81108898>] mark_held_locks+0x78/0xa0
[<ffffffff81841330>] ? _raw_spin_unlock_irq+0x30/0x60
[<ffffffff81108a28>] trace_hardirqs_on_caller+0x168/0x220
[<ffffffff81108aed>] trace_hardirqs_on+0xd/0x10
[<ffffffff81841330>] _raw_spin_unlock_irq+0x30/0x60
[<ffffffff810fd1c7>] swake_up_all+0xb7/0xe0
[<ffffffff811386e1>] rcu_gp_kthread+0xab1/0xeb0
[<ffffffff811089bf>] ? trace_hardirqs_on_caller+0xff/0x220
[<ffffffff81841341>] ? _raw_spin_unlock_irq+0x41/0x60
[<ffffffff81137c30>] ? rcu_barrier+0x20/0x20
[<ffffffff810d2014>] kthread+0x104/0x120
[<ffffffff81841330>] ? _raw_spin_unlock_irq+0x30/0x60
[<ffffffff810d1f10>] ? kthread_create_on_node+0x260/0x260
[<ffffffff8184231f>] ret_from_fork+0x3f/0x70
[<ffffffff810d1f10>] ? kthread_create_on_node+0x260/0x260
Signed-off-by: Daniel Wagner <daniel.wagner@bmw-carit.de>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: linux-rt-users@vger.kernel.org
Cc: Boqun Feng <boqun.feng@gmail.com>
Cc: Marcelo Tosatti <mtosatti@redhat.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Paul Gortmaker <paul.gortmaker@windriver.com>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>
Link: http://lkml.kernel.org/r/1455871601-27484-5-git-send-email-wagi@monom.org
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
In patch:
"rcu: Add transitivity to remaining rcu_node ->lock acquisitions"
All locking operations on rcu_node::lock are replaced with the wrappers
because of the need of transitivity, which indicates we should never
write code using LOCK primitives alone(i.e. without a proper barrier
following) on rcu_node::lock outside those wrappers. We could detect
this kind of misuses on rcu_node::lock in the future by adding __private
modifier on rcu_node::lock.
To privatize rcu_node::lock, unlock wrappers are also needed. Replacing
spinlock unlocks with these wrappers not only privatizes rcu_node::lock
but also makes it easier to figure out critical sections of rcu_node.
This patch adds __private modifier to rcu_node::lock and makes every
access to it wrapped by ACCESS_PRIVATE(). Besides, unlock wrappers are
added and raw_spin_unlock(&rnp->lock) and its friends are replaced with
those wrappers.
Signed-off-by: Boqun Feng <boqun.feng@gmail.com>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
The related warning from gcc 6.0:
In file included from kernel/rcu/tree.c:4630:0:
kernel/rcu/tree_plugin.h:810:40: warning: ‘rcu_data_p’ defined but not used [-Wunused-const-variable]
static struct rcu_data __percpu *const rcu_data_p = &rcu_sched_data;
^~~~~~~~~~
Also remove always redundant rcu_data_p in tree.c.
Signed-off-by: Chen Gang <gang.chen.5i5j@gmail.com>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Commit #e3663b1024d1 ("rcu: Handle gpnum/completed wrap while dyntick
idle") sets rdp->gpwrap on the wrong side of the "if" statement in
dyntick_save_progress_counter(), that is, it sets it when the CPU is
not idle instead of when it is idle. Of course, if the CPU is not idle,
its rdp->gpnum won't be lagging beind the global rsp->gpnum, which means
that rdp->gpwrap will never be set.
This commit therefore moves this code to the proper leg of that "if"
statement. This change means that the "else" cause is just "return 0"
and the "then" clause ends with "return 1", so also move the "return 0"
to follow the "if", dropping the "else" clause.
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Commit 4a81e8328d ("Reduce overhead of cond_resched() checks for RCU")
handles the error case where a nohz_full loops indefinitely in the kernel
with the scheduling-clock interrupt disabled. However, this handling
includes IPIing the CPU running the offending loop, which is not what
we want for real-time workloads. And there are starting to be real-time
CPU-bound in-kernel workloads, and these must be handled without IPIing
the CPU, at least not in the common case. Therefore, this situation can
no longer be dismissed as an error case.
This commit therefore splits the handling out, so that the setting of
bits in the per-CPU rcu_sched_qs_mask variable is done relatively early,
but if the problem persists, resched_cpu() is eventually used to IPI the
CPU containing the offending loop. Assuming that in-kernel CPU-bound
loops used by real-time tasks contain frequent calls cond_resched_rcu_qs()
(as in more than once per few tens of milliseconds), the real-time tasks
will never be IPIed.
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
The header comment for rcu_report_qs_rsp() was obsolete, dating well
before the advent of RCU grace-period kthreads. This commit therefore
brings this comment back into alignment with current reality.
Reported-by: Lihao Liang <lihao.liang@cs.ox.ac.uk>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
A zero seems to have escaped earlier true/false substitution efforts,
so this commit changes 0 to false for the ->core_needs_qs boolean field.
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
The return value from rcu_gp_init() is always used as a bool, so
this commit makes it be a bool.
Reported-by: Iftekhar Ahmed <ahmedi@oregonstate.edu>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
This patch removes a potential deadlock hazard by moving the
wake_up_process() in rcu_spawn_gp_kthread() out from under rnp->lock.
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
This commit replaces a local_irq_save()/local_irq_restore() pair with
a lockdep assertion that interrupts are already disabled. This should
remove the corresponding overhead from the interrupt entry/exit fastpaths.
This change was inspired by the fact that Iftekhar Ahmed's mutation
testing showed that removing rcu_irq_enter()'s call to local_ird_restore()
had no effect, which might indicate that interrupts were always enabled
anyway.
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
The cpu_needs_another_gp() function is currently of type int, but only
returns zero or one. Bow to reality and make it be of type bool.
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Now that the rcu_state structure's ->rda field is compile-time initialized,
there is no need to pass the per-CPU rcu_data structure into rcu_init_one().
This commit therefore eliminates this now-unused parameter.
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Currently, ->gp_state is printed as an integer, which slows debugging.
This commit therefore prints a symbolic name in addition to the integer.
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
[ paulmck: Updated to fix relational operator called out by Dan Carpenter. ]
[ paulmck: More "const", as suggested by Josh Triplett. ]
Reviewed-by: Josh Triplett <josh@joshtriplett.org>
This commit increases debug information in the case where the grace-period
kthread is being prevented from running by dumping that kthread's stack.
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
[ paulmck: Split into prior commit and this commit, as suggested by
Josh Triplett. ]
Reviewed-by: Josh Triplett <josh@joshtriplett.org>
Currently, if the RCU grace-period kthread has not yet been created,
in which case the starvation-check code will print zero for the state,
which maps to TASK_RUNNING. This could clearly be quite confusing, so
this commit prints ~0, which does not map to any legal ->state value.
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Reviewed-by: Josh Triplett <josh@joshtriplett.org>
We need the scheduler's fastpaths to be, well, fast, and unnecessarily
disabling and re-enabling interrupts is not necessarily consistent with
this goal. Especially given that there are regions of the scheduler that
already have interrupts disabled.
This commit therefore moves the call to rcu_note_context_switch()
to one of the interrupts-disabled regions of the scheduler, and
removes the now-redundant disabling and re-enabling of interrupts from
rcu_note_context_switch() and the functions it calls.
Reported-by: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
[ paulmck: Shift rcu_note_context_switch() to avoid deadlock, as suggested
by Peter Zijlstra. ]
This commit applies an early-exit approach to rcu_sched_qs(), reducing
the nesting level and saving a line of code.
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Currently, the rcu_node_class[], rcu_fqs_class[], and rcu_exp_class[]
arrays needlessly pollute the global namespace within tree.c. This
commit therefore converts them to static local variables within
rcu_init_one().
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Although expedited grace periods can be quite useful, and although their
OS jitter has been greatly reduced, they can still pose problems for
extreme real-time workloads. This commit therefore adds a rcu_normal
kernel boot parameter (which can also be manipulated via sysfs)
to suppress expedited grace periods, that is, to treat requests for
expedited grace periods as if they were requests for normal grace periods.
If both rcu_expedited and rcu_normal are specified, rcu_normal wins.
This means that if you are relying on expedited grace periods to speed up
boot, you will want to specify rcu_expedited on the kernel command line,
and then specify rcu_normal via sysfs once boot completes.
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
This commit adds print statements that check the rcu_node structure to
find which ->expmask bits and which ->exp_tasks structures are blocking
the current expedited grace period.
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Currently, if a grace period ends just as the stall-warning timeout
fires, an empty stall warning will be printed. This is not helpful,
so this commit avoids these useless warnings by rechecking completion
after awakening in synchronize_sched_expedited_wait().
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Currently, the piggybacked-work checks carried out by sync_exp_work_done()
atomically increment a small set of variables (the ->expedited_workdone0,
->expedited_workdone1, ->expedited_workdone2, ->expedited_workdone3
fields in the rcu_state structure), which will form a memory-contention
bottleneck given a sufficiently large number of CPUs concurrently invoking
either synchronize_rcu_expedited() or synchronize_sched_expedited().
This commit therefore moves these for fields to the per-CPU rcu_data
structure, eliminating the memory contention. The show_rcuexp() function
also changes to sum up each field in the rcu_data structures.
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
This commit saves a couple lines of code and reduces indentation
by inverting the sense of an "if" statement in the function
sync_rcu_exp_select_cpus().
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
The memory barrier in rcu_seq_snap() is needed only for grace periods,
so this commit moves it to the grace-period-oriented wrapper
rcu_exp_gp_seq_snap().
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
If there is only one CPU, then invoking synchronize_sched_expedited()
is by definition a grace period. This commit checks for this condition
and does a short-circuit return in that case.
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
The rule is that all acquisitions of the rcu_node structure's ->lock
must provide transitivity: The lock is not acquired that frequently,
and sorting out exactly which required it and which did not would be
a maintenance nightmare. This commit therefore supplies the needed
transitivity to the remaining ->lock acquisitions.
Reported-by: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Providing RCU's memory-ordering guarantees requires that the rcu_node
tree's locking provide transitive memory ordering, which the Linux kernel's
spinlocks currently do not provide unless smp_mb__after_unlock_lock()
is used. Having a separate smp_mb__after_unlock_lock() after each and
every lock acquisition is error-prone, hard to read, and a bit annoying,
so this commit provides wrapper functions that pull in the
smp_mb__after_unlock_lock() invocations.
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Earlier versions of synchronize_sched_expedited() can prematurely end
grace periods due to the fact that a CPU marked as cpu_is_offline()
can still be using RCU read-side critical sections during the time that
CPU makes its last pass through the scheduler and into the idle loop
and during the time that a given CPU is in the process of coming online.
This commit therefore eliminates this window by adding additional
interaction with the CPU-hotplug operations.
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
This commit adds task-print ability to the expedited RCU CPU stall
warning messages in preparation for adding stall warnings to
synchornize_rcu_expedited().
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
This commit makes the RCU CPU stall warning message print online/offline
indications immediately after the CPU number. A "O" indicates global
offline, a "." global online, and a "o" indicates RCU believes that the
CPU is offline for the current grace period and "." otherwise, and an
"N" indicates that RCU believes that the CPU will be offline for the
next grace period, and "." otherwise, all right after the CPU number.
So for CPU 10, you would normally see "10-...:" indicating that everything
believes that the CPU is online.
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Now that sync_sched_exp_select_cpus() and sync_rcu_exp_select_cpus()
are identical aside from the the argument to smp_call_function_single(),
this commit consolidates them with a functional argument.
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
This commit brings sync_sched_exp_select_cpus() into alignment with
sync_rcu_exp_select_cpus(), as a first step towards consolidating them
into one function.
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Now that synchronize_sched_expedited() uses IPIs, a hook in
rcu_sched_qs(), and the ->expmask field in the rcu_node combining
tree, it is no longer necessary to exclude CPU hotplug. Any
races with CPU hotplug will be detected when attempting to send
the IPI. This commit therefore removes the code excluding
CPU hotplug operations.
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
This reverts commit af859beaab (rcu: Silence lockdep false positive
for expedited grace periods). Because synchronize_rcu_expedited()
no longer invokes synchronize_sched_expedited(), ->exp_funnel_mutex
acquisition is no longer nested, so the false positive no longer happens.
This commit therefore removes the extra lockdep data structures, as they
are no longer needed.
This commit switches synchronize_sched_expedited() from stop_one_cpu_nowait()
to smp_call_function_single(), thus moving from an IPI and a pair of
context switches to an IPI and a single pass through the scheduler.
Of course, if the scheduler actually does decide to switch to a different
task, there will still be a pair of context switches, but there would
likely have been a pair of context switches anyway, just a bit later.
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Commit commit 4cdfc175c2 ("rcu: Move quiescent-state forcing
into kthread") started the process of folding the old ->fqs_state into
->gp_state, but did not complete it. This situation does not cause
any malfunction, but can result in extremely confusing trace output.
This commit completes this task of eliminating ->fqs_state in favor
of ->gp_state.
The old ->fqs_state was also used to decide when to collect dyntick-idle
snapshots. For this purpose, we add a boolean variable into the kthread,
which is set on the first call to rcu_gp_fqs() for a given grace period
and clear otherwise.
Signed-off-by: Petr Mladek <pmladek@suse.com>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Reviewed-by: Josh Triplett <josh@joshtriplett.org>
This commit loosens rcutree.rcu_fanout_leaf range checks
and replaces a panic() with a fallback to compile-time values.
This fallback is accompanied by a WARN_ON(), and both occur when the
rcutree.rcu_fanout_leaf value is too small to accommodate the number of
CPUs. For example, given the current four-level limit for the rcu_node
tree, a system with more than 16 CPUs built with CONFIG_FANOUT=2 must
have rcutree.rcu_fanout_leaf larger than 2.
Reported-by: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Reviewed-by: Josh Triplett <josh@joshtriplett.org>
Because preempt_disable() maps to barrier() for non-debug builds,
it forces the compiler to spill and reload registers. Because Tree
RCU and Tiny RCU now only appear in CONFIG_PREEMPT=n builds, these
barrier() instances generate needless extra code for each instance of
rcu_read_lock() and rcu_read_unlock(). This extra code slows down Tree
RCU and bloats Tiny RCU.
This commit therefore removes the preempt_disable() and preempt_enable()
from the non-preemptible implementations of __rcu_read_lock() and
__rcu_read_unlock(), respectively. However, for debug purposes,
preempt_disable() and preempt_enable() are still invoked if
CONFIG_PREEMPT_COUNT=y, because this allows detection of sleeping inside
atomic sections in non-preemptible kernels.
However, Tiny and Tree RCU operates by coalescing all RCU read-side
critical sections on a given CPU that lie between successive quiescent
states. It is therefore necessary to compensate for removing barriers
from __rcu_read_lock() and __rcu_read_unlock() by adding them to a
couple of the RCU functions invoked during quiescent states, namely to
rcu_all_qs() and rcu_note_context_switch(). However, note that the latter
is more paranoia than necessity, at least until link-time optimizations
become more aggressive.
This is based on an earlier patch by Paul E. McKenney, fixing
a bug encountered in kernels built with CONFIG_PREEMPT=n and
CONFIG_PREEMPT_COUNT=y.
Signed-off-by: Boqun Feng <boqun.feng@gmail.com>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
As we now have rcu_callback_t typedefs as the type of rcu callbacks, we
should use it in call_rcu*() and friends as the type of parameters. This
could save us a few lines of code and make it clear which function
requires an rcu callbacks rather than other callbacks as its argument.
Besides, this can also help cscope to generate a better database for
code reading.
Signed-off-by: Boqun Feng <boqun.feng@gmail.com>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Reviewed-by: Josh Triplett <josh@joshtriplett.org>
This commit converts the rcu_data structure's ->cpu_no_qs field
to a union. The bytewise side of this union allows individual access
to indications as to whether this CPU needs to find a quiescent state
for a normal (.norm) and/or expedited (.exp) grace period. The setwise
side of the union allows testing whether or not a quiescent state is
needed at all, for either type of grace period.
For now, only .norm is used. A later commit will introduce the expedited
usage.
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
This commit inverts the sense of the rcu_data structure's ->passed_quiesce
field and renames it to ->cpu_no_qs. This will allow a later commit to
use an "aggregate OR" operation to test expedited as well as normal grace
periods without added overhead.
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
An upcoming commit needs to invert the sense of the ->passed_quiesce
rcu_data structure field, so this commit is taking this opportunity
to clarify things a bit by renaming ->qs_pending to ->core_needs_qs.
So if !rdp->core_needs_qs, then this CPU need not concern itself with
quiescent states, in particular, it need not acquire its leaf rcu_node
structure's ->lock to check. Otherwise, it needs to report the next
quiescent state.
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Currently, synchronize_sched_expedited() uses a single global counter
to track the number of remaining context switches that the current
expedited grace period must wait on. This is problematic on large
systems, where the resulting memory contention can be pathological.
This commit therefore makes synchronize_sched_expedited() instead use
the combining tree in the same manner as synchronize_rcu_expedited(),
keeping memory contention down to a dull roar.
This commit creates a temporary function sync_sched_exp_select_cpus()
that is very similar to sync_rcu_exp_select_cpus(). A later commit
will consolidate these two functions, which becomes possible when
synchronize_sched_expedited() switches from stop_one_cpu_nowait() to
smp_call_function_single().
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
The current preemptible-RCU expedited grace-period algorithm invokes
synchronize_sched_expedited() to enqueue all tasks currently running
in a preemptible-RCU read-side critical section, then waits for all the
->blkd_tasks lists to drain. This works, but results in both an IPI and
a double context switch even on CPUs that do not happen to be running
in a preemptible RCU read-side critical section.
This commit implements a new algorithm that causes less OS jitter.
This new algorithm IPIs all online CPUs that are not idle (from an
RCU perspective), but refrains from self-IPIs. If a CPU receiving
this IPI is not in a preemptible RCU read-side critical section (or
is just now exiting one), it pushes quiescence up the rcu_node tree,
otherwise, it sets a flag that will be handled by the upcoming outermost
rcu_read_unlock(), which will then push quiescence up the tree.
The expedited grace period must of course wait on any pre-existing blocked
readers, and newly blocked readers must be queued carefully based on
the state of both the normal and the expedited grace periods. This
new queueing approach also avoids the need to update boost state,
courtesy of the fact that blocked tasks are no longer ever migrated to
the root rcu_node structure.
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
This commit replaces sync_rcu_preempt_exp_init1(() and
sync_rcu_preempt_exp_init2() with sync_exp_reset_tree_hotplug()
and sync_exp_reset_tree(), which will also be used by
synchronize_sched_expedited(), and sync_rcu_exp_select_nodes(), which
contains code specific to synchronize_rcu_expedited().
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
This is a nearly pure code-movement commit, moving rcu_report_exp_rnp(),
sync_rcu_preempt_exp_done(), and rcu_preempted_readers_exp() so
that later commits can make synchronize_sched_expedited() use them.
The non-code-movement portion of this commit tags rcu_report_exp_rnp()
as __maybe_unused to avoid build errors when CONFIG_PREEMPT=n.
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Now that there is an ->expedited_wq waitqueue in each rcu_state structure,
there is no need for the sync_rcu_preempt_exp_wq global variable. This
commit therefore substitutes ->expedited_wq for sync_rcu_preempt_exp_wq.
It also initializes ->expedited_wq only once at boot instead of at the
start of each expedited grace period.
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
In kernels built with CONFIG_PREEMPT=y, synchronize_rcu_expedited()
invokes synchronize_sched_expedited() while holding RCU-preempt's
root rcu_node structure's ->exp_funnel_mutex, which is acquired after
the rcu_data structure's ->exp_funnel_mutex. The first thing that
synchronize_sched_expedited() will do is acquire RCU-sched's rcu_data
structure's ->exp_funnel_mutex. There is no danger of an actual deadlock
because the locking order is always from RCU-preempt's expedited mutexes
to those of RCU-sched. Unfortunately, lockdep considers both rcu_data
structures' ->exp_funnel_mutex to be in the same lock class and therefore
reports a deadlock cycle.
This commit silences this false positive by placing RCU-sched's rcu_data
structures' ->exp_funnel_mutex locks into their own lock class.
Reported-by: Sasha Levin <sasha.levin@oracle.com>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
In a CONFIG_PREEMPT=y kernel, synchronize_rcu_expedited()
acquires the ->exp_funnel_mutex in rcu_preempt_state, then invokes
synchronize_sched_expedited, which acquires the ->exp_funnel_mutex in
rcu_sched_state. There can be no deadlock because rcu_preempt_state
->exp_funnel_mutex acquisition always precedes that of rcu_sched_state.
But lockdep does not know that, so it gives false-positive splats.
This commit therefore associates a separate lock_class_key structure
with the rcu_sched_state structure's ->exp_funnel_mutex, allowing
lockdep to see the lock ordering, avoiding the false positives.
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
This commit renames rcu_lockdep_assert() to RCU_LOCKDEP_WARN() for
consistency with the WARN() series of macros. This also requires
inverting the sense of the conditional, which this commit also does.
Reported-by: Ingo Molnar <mingo@kernel.org>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Reviewed-by: Ingo Molnar <mingo@kernel.org>
Although rcu_is_watching() is marked notrace, it invokes preempt_disable()
and preempt_enable(), both of which can be traced. This defeats the
purpose of the notrace on rcu_is_watching(), so this commit substitutes
preempt_disable_notrace() and preempt_enable_notrace().
Signed-off-by: Alexei Starovoitov <ast@plumgrid.com>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Acked-by: Steven Rostedt <rostedt@goodmis.org>
The get_state_synchronize_rcu() and cond_synchronize_rcu() functions
allow polling for grace-period completion, with an actual wait for a
grace period occurring only when cond_synchronize_rcu() is called too
soon after the corresponding get_state_synchronize_rcu(). However,
these functions work only for vanilla RCU. This commit adds the
get_state_synchronize_sched() and cond_synchronize_sched(), which provide
the same capability for RCU-sched.
Reported-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
In the common case, there will be only one expedited grace period in
the system at a given time, in which case it is not helpful to use
funnel locking. This commit therefore adds a fastpath that bypasses
funnel locking when the root ->exp_funnel_mutex is not held.
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
The grace-period kthread sleeps waiting to do a force-quiescent-state
scan, and when awakened sets rsp->gp_state to RCU_GP_DONE_FQS.
However, this is confusing because the kthread has not done the
force-quiescent-state, but is instead just starting to do it. This commit
therefore renames RCU_GP_DONE_FQS to RCU_GP_DOING_FQS in order to make
things a bit easier on reviewers.
Reported-by: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
The condition for the wait_event_interruptible_timeout() that waits
to do the next force-quiescent-state scan is a bit ornate:
((gf = READ_ONCE(rsp->gp_flags)) &
RCU_GP_FLAG_FQS) ||
(!READ_ONCE(rnp->qsmask) &&
!rcu_preempt_blocked_readers_cgp(rnp))
This commit therefore pulls this condition out into a helper function
and comments its component conditions.
Reported-by: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Although synchronize_sched_expedited() historically has no RCU CPU stall
warnings, the availability of the rcupdate.rcu_expedited boot parameter
invalidates the old assumption that synchronize_sched()'s stall warnings
would suffice. This commit therefore adds RCU CPU stall warnings to
synchronize_sched_expedited().
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
The strictly rcu_node based funnel-locking scheme works well in many
cases, but systems with CONFIG_RCU_FANOUT_LEAF=64 won't necessarily get
all that much concurrency. This commit therefore extends the funnel
locking into the per-CPU rcu_data structure, providing concurrency equal
to the number of CPUs.
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
One of the requirements on RCU grace periods is that if there is a
causal chain of operations that starts after one grace period and
ends before another grace period, then the two grace periods must
be serialized. There has been (and might still be) code that relies
on this, for example, certain types of reference-counting code that
does a call_rcu() within an RCU callback function.
This requirement is why there is an smp_mb() at the end of both
synchronize_sched_expedited() and synchronize_rcu_expedited().
However, this is the only smp_mb() in these functions, so it would
be nicer to consolidate it into rcu_exp_gp_seq_end(). This commit
does just that.
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
The rcu_seq operations were open-coded in _rcu_barrier(), so this commit
replaces the open-coding with the shiny new rcu_seq operations.
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
This commit gets rid of synchronize_rcu_expedited()'s mutex_trylock()
polling loop in favor of the funnel-locking scheme that was abstracted
from synchronize_sched_expedited().
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
The type of "s" has been "long" rather than the correct "unsigned long"
for quite some time. This commit fixes this type error.
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
This commit abstracts funnel locking from synchronize_sched_expedited()
so that it may be used by synchronize_rcu_expedited().
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
This commit creates rcu_exp_gp_seq_start() and rcu_exp_gp_seq_end() to
bracket an expedited grace period, rcu_exp_gp_seq_snap() to snapshot the
sequence counter, and rcu_exp_gp_seq_done() to check to see if a full
expedited grace period has elapsed since the snapshot. These will be
applied to synchronize_rcu_expedited(). These are defined in terms of
underlying rcu_seq_start(), rcu_seq_end(), rcu_seq_snap(), rcu_seq_done(),
which will be applied to _rcu_barrier().
One reason that this commit doesn't use the seqcount primitives themselves
is that the smp_wmb() in those primitive is insufficient due to the fact
that expedited grace periods do reads as well as writes. In addition,
the read-side seqcount primitives detect a potentially partial change,
where the expedited primitives instead need a guaranteed full change.
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Sequentially stopping the CPUs slows down expedited grace periods by
at least a factor of two, based on rcutorture's grace-period-per-second
rate. This is a conservative measure because rcutorture uses unusually
long RCU read-side critical sections and because rcutorture periodically
quiesces the system in order to test RCU's ability to ramp down to and
up from the idle state. This commit therefore replaces the stop_one_cpu()
with stop_one_cpu_nowait(), using an atomic-counter scheme to determine
when all CPUs have passed through the stopped state.
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
This commit gets rid of synchronize_sched_expedited()'s mutex_trylock()
polling loop in favor of a funnel-locking scheme based on the rcu_node
tree. The work-done check is done at each level of the tree, allowing
high-contention situations to be resolved quickly with reasonable levels
of mutex contention.
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Now that synchronize_sched_expedited() have a mutex, it can use simpler
work-already-done detection scheme. This commit simplifies this scheme
by using something similar to the sequence-locking counter scheme.
A counter is incremented before and after each grace period, so that
the counter is odd in the midst of the grace period and even otherwise.
So if the counter has advanced to the second even number that is
greater than or equal to the snapshot, the required grace period has
already happened.
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
The synchronize_sched_expedited() currently invokes try_stop_cpus(),
which schedules the stopper kthreads on each online non-idle CPU,
and waits until all those kthreads are running before letting any
of them stop. This is disastrous for real-time workloads, which
get hit with a preemption that is as long as the longest scheduling
latency on any CPU, including any non-realtime housekeeping CPUs.
This commit therefore switches to using stop_one_cpu() on each CPU
in turn. This avoids inflicting the worst-case scheduling latency
on the worst-case CPU onto all other CPUs, and also simplifies the
code a little bit.
Follow-up commits will simplify the counter-snapshotting algorithm
and convert a number of the counters that are now protected by the
new ->expedited_mutex to non-atomic.
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
[ paulmck: Kept stop_one_cpu(), dropped disabling of "guardrails". ]
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Currently if the rcu_fanout_leaf boot parameter is out of bounds (that
is, less than RCU_FANOUT_LEAF or greater than the number of bits in an
unsigned long), a warning is issued and execution continues with the
out-of-bounds value. This can result in all manner of failures, so this
patch resets rcu_fanout_leaf to RCU_FANOUT_LEAF when an out-of-bounds
condition is detected.
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Although a number of RCU levels may be less than the current
maximum of four, some static data associated with each level
are allocated for all four levels. As result, the extra data
never get accessed and just wast memory. This update limits
count of allocated items to the number of used RCU levels.
Cc: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Signed-off-by: Alexander Gordeev <agordeev@redhat.com>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Members rcu_state::levelcnt[] and rcu_state::levelspread[]
are only used at init. There is no reason to keep them
afterwards.
Cc: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Signed-off-by: Alexander Gordeev <agordeev@redhat.com>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Number of items in rcu_capacity[] array is defined by macro
MAX_RCU_LVLS. However, that array is never accessed beyond
RCU_NUM_LVLS index. Therefore, we can limit the array to
RCU_NUM_LVLS items and eliminate MAX_RCU_LVLS. As result,
in most cases the memory is conserved.
Cc: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Signed-off-by: Alexander Gordeev <agordeev@redhat.com>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Current code suggests that introducing the extra level to
rcu_capacity[] array makes some of the arithmetic easier.
Well, in fact it appears rather confusing and unnecessary.
Cc: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Signed-off-by: Alexander Gordeev <agordeev@redhat.com>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
This update simplifies rcu_init_geometry() code flow
and makes calculation of the total number of rcu_node
structures more easy to read.
The update relies on the fact num_rcu_lvl[] is never
accessed beyond rcu_num_lvls index by the rest of the
code. Therefore, there is no need initialize the whole
num_rcu_lvl[].
Cc: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Signed-off-by: Alexander Gordeev <agordeev@redhat.com>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Local variable 'n' mimics 'nr_cpu_ids' while the both are
used within one function. There is no reason for 'n' to
exist whatsoever.
Cc: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Signed-off-by: Alexander Gordeev <agordeev@redhat.com>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Currently a condition when RCU tree is unable to accommodate
the configured number of CPUs is not permitted and causes
a fall back to compile-time values. However, the code has no
means to exceed the RCU tree capacity neither at compile-time
nor in run-time. Therefore, if the condition is met in run-
time then it indicates a serios problem elsewhere and should
be handled with a panic.
Cc: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Signed-off-by: Alexander Gordeev <agordeev@redhat.com>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
The RCU_USER_QS Kconfig parameter is now just a synonym for NO_HZ_FULL,
so this commit eliminates RCU_USER_QS, replacing all uses with NO_HZ_FULL.
Reported-by: Frederic Weisbecker <fweisbec@gmail.com>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Acked-by: Frederic Weisbecker <fweisbec@gmail.com>
"monitonic raw". Also some enhancements to make the ring buffer even
faster. But the biggest and most noticeable change is the renaming of
the ftrace* files, structures and variables that have to deal with
trace events.
Over the years I've had several developers tell me about their confusion
with what ftrace is compared to events. Technically, "ftrace" is the
infrastructure to do the function hooks, which include tracing and also
helps with live kernel patching. But the trace events are a separate
entity altogether, and the files that affect the trace events should
not be named "ftrace". These include:
include/trace/ftrace.h -> include/trace/trace_events.h
include/linux/ftrace_event.h -> include/linux/trace_events.h
Also, functions that are specific for trace events have also been renamed:
ftrace_print_*() -> trace_print_*()
(un)register_ftrace_event() -> (un)register_trace_event()
ftrace_event_name() -> trace_event_name()
ftrace_trigger_soft_disabled()-> trace_trigger_soft_disabled()
ftrace_define_fields_##call() -> trace_define_fields_##call()
ftrace_get_offsets_##call() -> trace_get_offsets_##call()
Structures have been renamed:
ftrace_event_file -> trace_event_file
ftrace_event_{call,class} -> trace_event_{call,class}
ftrace_event_buffer -> trace_event_buffer
ftrace_subsystem_dir -> trace_subsystem_dir
ftrace_event_raw_##call -> trace_event_raw_##call
ftrace_event_data_offset_##call-> trace_event_data_offset_##call
ftrace_event_type_funcs_##call -> trace_event_type_funcs_##call
And a few various variables and flags have also been updated.
This has been sitting in linux-next for some time, and I have not heard
a single complaint about this rename breaking anything. Mostly because
these functions, variables and structures are mostly internal to the
tracing system and are seldom (if ever) used by anything external to that.
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1
iQEcBAABAgAGBQJViYhVAAoJEEjnJuOKh9ldcJ0IAI+mytwoMAN/CWDE8pXrTrgs
aHlcr1zorSzZ0Lq6lKsWP+V0VGVhP8KWO16vl35HaM5ZB9U+cDzWiGobI8JTHi/3
eeTAPTjQdgrr/L+ZO1ApzS1jYPhN3Xi5L7xublcYMJjKfzU+bcYXg/x8gRt0QbG3
S9QN/kBt0JIIjT7McN64m5JVk2OiU36LxXxwHgCqJvVCPHUrriAdIX7Z5KRpEv13
zxgCN4d7Jiec/FsMW8dkO0vRlVAvudZWLL7oDmdsvNhnLy8nE79UOeHos2c1qifQ
LV4DeQ+2Hlu7w9wxixHuoOgNXDUEiQPJXzPc/CuCahiTL9N/urQSGQDoOVMltR4=
=hkdz
-----END PGP SIGNATURE-----
Merge tag 'trace-v4.2' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace
Pull tracing updates from Steven Rostedt:
"This patch series contains several clean ups and even a new trace
clock "monitonic raw". Also some enhancements to make the ring buffer
even faster. But the biggest and most noticeable change is the
renaming of the ftrace* files, structures and variables that have to
deal with trace events.
Over the years I've had several developers tell me about their
confusion with what ftrace is compared to events. Technically,
"ftrace" is the infrastructure to do the function hooks, which include
tracing and also helps with live kernel patching. But the trace
events are a separate entity altogether, and the files that affect the
trace events should not be named "ftrace". These include:
include/trace/ftrace.h -> include/trace/trace_events.h
include/linux/ftrace_event.h -> include/linux/trace_events.h
Also, functions that are specific for trace events have also been renamed:
ftrace_print_*() -> trace_print_*()
(un)register_ftrace_event() -> (un)register_trace_event()
ftrace_event_name() -> trace_event_name()
ftrace_trigger_soft_disabled() -> trace_trigger_soft_disabled()
ftrace_define_fields_##call() -> trace_define_fields_##call()
ftrace_get_offsets_##call() -> trace_get_offsets_##call()
Structures have been renamed:
ftrace_event_file -> trace_event_file
ftrace_event_{call,class} -> trace_event_{call,class}
ftrace_event_buffer -> trace_event_buffer
ftrace_subsystem_dir -> trace_subsystem_dir
ftrace_event_raw_##call -> trace_event_raw_##call
ftrace_event_data_offset_##call-> trace_event_data_offset_##call
ftrace_event_type_funcs_##call -> trace_event_type_funcs_##call
And a few various variables and flags have also been updated.
This has been sitting in linux-next for some time, and I have not
heard a single complaint about this rename breaking anything. Mostly
because these functions, variables and structures are mostly internal
to the tracing system and are seldom (if ever) used by anything
external to that"
* tag 'trace-v4.2' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace: (33 commits)
ring_buffer: Allow to exit the ring buffer benchmark immediately
ring-buffer-benchmark: Fix the wrong type
ring-buffer-benchmark: Fix the wrong param in module_param
ring-buffer: Add enum names for the context levels
ring-buffer: Remove useless unused tracing_off_permanent()
ring-buffer: Give NMIs a chance to lock the reader_lock
ring-buffer: Add trace_recursive checks to ring_buffer_write()
ring-buffer: Allways do the trace_recursive checks
ring-buffer: Move recursive check to per_cpu descriptor
ring-buffer: Add unlikelys to make fast path the default
tracing: Rename ftrace_get_offsets_##call() to trace_event_get_offsets_##call()
tracing: Rename ftrace_define_fields_##call() to trace_event_define_fields_##call()
tracing: Rename ftrace_event_type_funcs_##call to trace_event_type_funcs_##call
tracing: Rename ftrace_data_offset_##call to trace_event_data_offset_##call
tracing: Rename ftrace_raw_##call event structures to trace_event_raw_##call
tracing: Rename ftrace_trigger_soft_disabled() to trace_trigger_soft_disabled()
tracing: Rename FTRACE_EVENT_FL_* flags to EVENT_FILE_FL_*
tracing: Rename struct ftrace_subsystem_dir to trace_subsystem_dir
tracing: Rename ftrace_event_name() to trace_event_name()
tracing: Rename FTRACE_MAX_EVENT to TRACE_EVENT_TYPE_MAX
...
This commit applies some warning-omission micro-optimizations to RCU's
various extended-quiescent-state functions, which are on the kernel/user
hotpath for CONFIG_NO_HZ_FULL=y.
Reported-by: Rik van Riel <riel@redhat.com>
Reported by: Mike Galbraith <umgwanakikbuti@gmail.com>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
This commit updates the initialization of the kthread_prio boot parameter
so that RCU will build even when CONFIG_RCU_KTHREAD_PRIO is undefined.
The kthread_prio boot parameter is set to CONFIG_RCU_KTHREAD_PRIO if
that is defined, otherwise to 1 if CONFIG_RCU_BOOST is defined and
to zero otherwise. This commit then makes CONFIG_RCU_KTHREAD_PRIO
depend on CONFIG_RCU_EXPERT, so that Kconfig users won't be asked about
CONFIG_RCU_KTHREAD_PRIO unless they want to be.
Reported-by: Linus Torvalds <torvalds@linux-foundation.org>
Reported-by: Ingo Molnar <mingo@kernel.org>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Reviewed-by: Pranith Kumar <bobby.prani@gmail.com>
This commit introduces an RCU_FANOUT_LEAF C-preprocessor macro so
that RCU will build even when CONFIG_RCU_FANOUT_LEAF is undefined.
The RCU_FANOUT_LEAF macro is set to the value of CONFIG_RCU_FANOUT_LEAF
when defined, otherwise it is set to 32 for 32-bit systems and 64 for
64-bit systems. This commit then makes CONFIG_RCU_FANOUT_LEAF depend
on CONFIG_RCU_EXPERT, so that Kconfig users won't be asked about
CONFIG_RCU_FANOUT_LEAF unless they want to be.
Reported-by: Ingo Molnar <mingo@kernel.org>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Reviewed-by: Pranith Kumar <bobby.prani@gmail.com>
This commit introduces an RCU_FANOUT C-preprocessor macro so that RCU will
build even when CONFIG_RCU_FANOUT is undefined. The RCU_FANOUT macro is
set to the value of CONFIG_RCU_FANOUT when defined, otherwise it is set
to 32 for 32-bit systems and 64 for 64-bit systems. This commit then
makes CONFIG_RCU_FANOUT depend on CONFIG_RCU_EXPERT, so that Kconfig
users won't be asked about CONFIG_RCU_FANOUT unless they want to be.
Reported-by: Ingo Molnar <mingo@kernel.org>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Reviewed-by: Pranith Kumar <bobby.prani@gmail.com>
The purpose of this commit is to make it easier to verify that RCU's
combining tree is set up correctly, which is useful to have when making
changes in how that tree is initialized.
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Reviewed-by: Pranith Kumar <bobby.prani@gmail.com>
[ paulmck: Fold fix found by Fengguang's 0-day test robot. ]
The CONFIG_RCU_FANOUT_EXACT Kconfig parameter is used primarily (and
perhaps only) by rcutorture to verify that RCU works correctly in specific
rcu_node combining-tree configurations. It therefore does not make
much sense have this as a question to people attempting to configure
their kernels. So this commit creates an rcutree.rcu_fanout_exact=
boot parameter that rcutorture can use, and eliminates the original
CONFIG_RCU_FANOUT_EXACT Kconfig parameter.
Reported-by: Ingo Molnar <mingo@kernel.org>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Reviewed-by: Pranith Kumar <bobby.prani@gmail.com>
Grace-period scans of the rcu_node combining tree normally
proceed quite quickly, so that it is very difficult to reproduce
races against them. This commit therefore allows grace-period
pre-initialization and cleanup to be artificially slowed down,
increasing race-reproduction probability. A pair of pairs of new
Kconfig parameters are provided, RCU_TORTURE_TEST_SLOW_PREINIT to
enable the slowing down of propagating CPU-hotplug changes up the
combining tree along with RCU_TORTURE_TEST_SLOW_PREINIT_DELAY to
specify the delay in jiffies, and RCU_TORTURE_TEST_SLOW_CLEANUP
to enable the slowing down of the end-of-grace-period cleanup scan
along with RCU_TORTURE_TEST_SLOW_CLEANUP_DELAY to specify the delay
in jiffies. Boot-time parameters named rcutree.gp_preinit_delay and
rcutree.gp_cleanup_delay allow these delays to be specified at boot time.
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Because gcc doesn't realize that rcu_num_lvls must be strictly greater
than zero, some versions give a spurious warning about levelcnt[0] being
uninitialized in rcu_init_one(). This commit updates the condition on
the pre-existing panic() in order to educate gcc on this point.
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Currently, the larger the gp_init_delay boot parameter, the slower
rcutorture will sequence through grace periods. This commit avoids this
issue by decreasing the probability of slowing initialization of a given
grace period as the degree of slowness increases.
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
The rcu_data structure's ->rcu_qs_ctr_snap field is initialized at
CPU-online time from the current CPU's element of the per-CPU rcu_qs_ctr
variable. Unfortunately, this is at CPU_UP_PREPARE time, so has nothing
to do with the CPU being onlined. This commit therefore initializes
this variable from the incoming CPU's element of rcu_qs_ctr.
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Because offline CPUs are propagated up the rcu_node tree's ->qsmaskinit
bits just before each grace period starts, the ->qsmaskinit bit cannot
be clear when the corresponding ->qsmask bit is set. Furthermore, this
condition used to correspond to a CPU that was on its way offline, and
making RCU's notion of an offline CPU more precise has eliminated this
situation. This commit therefore removes the now-redundant offline
check from force_qs_rnp().
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>