Consider the following sequence of events in a PREEMPT=y kernel:
1. All but one of the CPUs corresponding to a given leaf rcu_node
structure go offline. Each of these CPUs clears its bit in that
structure's ->qsmaskinitnext field.
2. A new grace period starts, and rcu_gp_init() scans the leaf
rcu_node structures, applying CPU-hotplug changes since the
start of the previous grace period, including those changes in
#1 above. This copies each leaf structure's ->qsmaskinitnext
to its ->qsmask field, which represents the CPUs that this new
grace period will wait on. Each copy operation is done holding
the corresponding leaf rcu_node structure's ->lock, and at the
end of this scan, rcu_gp_init() holds no locks.
3. The last CPU corresponding to #1's leaf rcu_node structure goes
offline, clearing its bit in that structure's ->qsmaskinitnext
field, but not touching the ->qsmaskinit field. Note that
rcu_gp_init() is not currently holding any locks! This CPU does
-not- report a quiescent state because the grace period has not
yet initialized itself sufficiently to have set any bits in any
of the leaf rcu_node structures' ->qsmask fields.
4. The rcu_gp_init() function continues initializing the new grace
period, copying each leaf rcu_node structure's ->qsmaskinit
field to its ->qsmask field while holding the corresponding ->lock.
This sets the ->qsmask bit corresponding to #3's CPU.
5. Before the grace period ends, #3's CPU comes back online.
Because te grace period has not yet done any force-quiescent-state
scans (which would report a quiescent state on behalf of any
offline CPUs), this CPU's ->qsmask bit is still set.
6. A task running on the newly onlined CPU is preempted while in
an RCU read-side critical section. Because this CPU's ->qsmask
bit is net, not only does this task queue itself on the leaf
rcu_node structure's ->blkd_tasks list, it also sets that
structure's ->gp_tasks pointer to reference it.
7. The grace period started in #1 above comes to an end. This
results in rcu_gp_cleanup() being invoked, which, among other
things, checks to make sure that there are no tasks blocking the
just-ended grace period, that is, that all ->gp_tasks pointers
are NULL. The ->gp_tasks pointer corresponding to the task
preempted in #3 above is non-NULL, which results in a splat.
This splat is a false positive. The task's RCU read-side critical
section cannot have begun before the just-ended grace period because
this would mean either: (1) The CPU came online before the grace period
started, which cannot have happened because the grace period started
before that CPU went offline, or (2) The task started its RCU read-side
critical section on some other CPU, but then it would have had to have
been preempted before migrating to this CPU, which would mean that it
would have instead queued itself on that other CPU's rcu_node structure.
RCU's grace periods thus are working correctly. Or, more accurately,
that remaining bugs in RCU's grace periods are elsewhere.
This commit eliminates this false positive by adding code to the end
of rcu_cpu_starting() that reports a quiescent state to RCU, which has
the side-effect of clearing that CPU's ->qsmask bit, preventing the
above scenario. This approach has the added benefit of more promptly
reporting quiescent states corresponding to offline CPUs. Nevertheless,
this commit does -not- remove the need for the force-quiescent-state
scans to check for offline CPUs, given that a CPU might remain offline
indefinitely. And without the checks in the force-quiescent-state scans,
the grace period would also persist indefinitely, which could result in
hangs or memory exhaustion.
Note well that the call to rcu_report_qs_rnp() reporting the quiescent
state must come -after- the setting of this CPU's bit in the leaf rcu_node
structure's ->qsmaskinitnext field. Otherwise, lockdep-RCU will complain
bitterly about quiescent states coming from an offline CPU.
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Consider the following sequence of events in a PREEMPT=y kernel:
1. All CPUs corresponding to a given rcu_node structure go offline.
A new grace period starts just after the CPU-hotplug code path
does its synchronize_rcu() for the last CPU, so at least this
CPU is present in that structure's ->qsmask.
2. Before the grace period ends, a CPU comes back online, and not
just any CPU, but the one corresponding to a non-zero bit in
the leaf rcu_node structure's ->qsmask.
3. A task running on the newly onlined CPU is preempted while in
an RCU read-side critical section. Because this CPU's ->qsmask
bit is net, not only does this task queue itself on the leaf
rcu_node structure's ->blkd_tasks list, it also sets that
structure's ->gp_tasks pointer to reference it.
4. The grace period started in #1 above comes to an end. This
results in rcu_gp_cleanup() being invoked, which, among other
things, checks to make sure that there are no tasks blocking the
just-ended grace period, that is, that all ->gp_tasks pointers
are NULL. The ->gp_tasks pointer corresponding to the task
preempted in #3 above is non-NULL, which results in a splat.
This splat is a false positive. The task's RCU read-side critical
section cannot have begun before the just-ended grace period because
this would mean either: (1) The CPU came online before the grace period
started, which cannot have happened because the grace period started
before that CPU was all the way offline, or (2) The task started its
RCU read-side critical section on some other CPU, but then it would
have had to have been preempted before migrating to this CPU, which
would mean that it would have instead queued itself on that other CPU's
rcu_node structure.
This commit eliminates this false positive by adding code to the end
of rcu_cleanup_dying_idle_cpu() that reports a quiescent state to RCU,
which has the side-effect of clearing that CPU's ->qsmask bit, preventing
the above scenario. This approach has the added benefit of more promptly
reporting quiescent states corresponding to offline CPUs.
Note well that the call to rcu_report_qs_rnp() reporting the quiescent
state must come -before- the clearing of this CPU's bit in the leaf
rcu_node structure's ->qsmaskinitnext field. Otherwise, lockdep-RCU
will complain bitterly about quiescent states coming from an offline CPU.
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
The rcu_lockdep_current_cpu_online() function currently checks only the
RCU-sched data structures to determine whether or not RCU believes that a
given CPU is offline. Unfortunately, there are multiple flavors of RCU,
which means that there is a short window of time during which the various
flavors disagree as to whether or not a given CPU is offline. This can
result in false-positive lockdep-RCU splats in which some other flavor
of RCU tries to do something based on its view that the CPU is online,
only to get hit with a lockdep-RCU splat because RCU-sched instead
believes that the CPU is offline.
This commit therefore changes rcu_lockdep_current_cpu_online() to scan
all RCU flavors and to consider a given CPU to be online if any of the
RCU flavors believe it to be online, thus preventing these false-positive
splats.
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
The force_qs_rnp() function checks for ->qsmask being all zero, that is,
all CPUs for the current rcu_node structure having already passed through
quiescent states. But with RCU-preempt, this is not sufficient to report
quiescent states further up the tree, so there are further checks that
can initiate RCU priority boosting and also for races with CPU-hotplug
operations. However, if neither of these further checks apply, the code
proceeds to carry out a useless scan of an all-zero ->qsmask.
This commit therefore adds code to release the current rcu_node
structure's lock and continue on to the next rcu_node structure, thereby
avoiding this useless scan.
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
This commit gets rid of the smp_wmb() in record_gp_stall_check_time()
in favor of an smp_store_release().
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
This commit fixes a typo and adds some additional debugging to the
message emitted when a task blocking the current grace period is listed
as blocking it when either that grace period ends or the next grace
period begins. This commit also reformats the console message for
readability.
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
If rcu_report_unblock_qs_rnp() is invoked on something other than
preemptible RCU or if there are still preempted tasks blocking the
current grace period, something went badly wrong in the caller.
This commit therefore adds WARN_ON_ONCE() to these conditions, but
leaving the legitimate reason for early exit (rnp->qsmask != 0)
unwarned.
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Currently, rcu_init_new_rnp() walks up the rcu_node combining tree,
setting bits in the ->qsmaskinit fields on the way up. It walks up
unconditionally, regardless of the initial state of these bits. This is
OK because only the corresponding RCU grace-period kthread ever tests
or sets these bits during runtime. However, it is also pointless, and
it increases both memory and lock contention (albeit only slightly), so
this commit stops the walk as soon as an already-set bit is encountered.
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Back in the old days, when grace-period initialization blocked CPU
hotplug, the ->qsmaskinit mask was indeed updated at the time that
a given CPU went offline. However, with the deferral of these updates
until the beginning of the next grace period in commit 0aa04b055e
("rcu: Process offlining and onlining only at grace-period start"),
it is instead ->qsmaskinitnext that gets updated at that time.
This commit therefore updates the obsolete comment. It also fixes
punctuation while on the topic of comments mentioning ->qsmaskinit.
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Commit 0aa04b055e ("rcu: Process offlining and onlining only at
grace-period start") deferred handling of CPU-hotplug events until the
start of the next grace period, but consider the following sequence
of events:
1. A task is preempted within an RCU-preempt read-side critical
section.
2. The CPU that this task was running on goes offline, along with all
other CPUs sharing the corresponding leaf rcu_node structure.
3. The task resumes execution.
4. One of those CPUs comes back online before a new grace period starts.
In step 2, the code in the next rcu_gp_init() invocation will (correctly)
defer removing the leaf rcu_node structure from the upper-level bitmasks,
and will (correctly) set that structure's ->wait_blkd_tasks field. During
the ensuing interval, RCU will (correctly) track the tasks preempted on
that structure because they must block any subsequent grace period.
In step 3, the code in rcu_read_unlock_special() will (correctly) remove
the task from the leaf rcu_node structure. From this point forward, RCU
need not pay attention to this structure, at least not until one of the
corresponding CPUs comes back online.
In step 4, the code in the next rcu_gp_init() invocation will
(incorrectly) invoke rcu_init_new_rnp(). This is incorrect because
the corresponding rcu_cleanup_dead_rnp() was never invoked. This is
nevertheless harmless because the upper-level bits are still set.
So, no harm, no foul, right?
At least, all is well until a little further into rcu_gp_init()
invocation, which will notice that there are no longer any tasks blocked
on the leaf rcu_node structure, conclude that there is no longer anything
left over from step 2's offline operation, and will therefore invoke
rcu_cleanup_dead_rnp(). But this invocation of rcu_cleanup_dead_rnp()
is for the beginning of the earlier offline interval, and the previous
invocation of rcu_init_new_rnp() is for the end of that same interval.
That is right, they are invoked out of order.
That cannot be good, can it?
It turns out that this is not a (correctness!) problem because
rcu_cleanup_dead_rnp() checks to see if any of the corresponding CPUs
are online, and refuses to do anything if so. In other words, in the
case where rcu_init_new_rnp() and rcu_cleanup_dead_rnp() execute out of
order, they both have no effect.
But this is at best an accident waiting to happen.
This commit therefore adds logic to rcu_gp_init() so that
rcu_init_new_rnp() and rcu_cleanup_dead_rnp() are always invoked in
order, and so that neither are invoked at all in cases where RCU had to
pay attention to the leaf rcu_node structure during the entire time that
all corresponding CPUs were offline.
And, while in the area, this commit reduces confusion by using formal
parameters rather than local variables that just happen to have the same
value at that particular point in the code.
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
There's no need to keep checking the same starting node for whether a
grace period is in progress as we advance up the funnel lock loop. Its
sufficient if we just checked it in the start, and then subsequently
checked the internal nodes as we advanced up the combining tree. This
also makes sense because the grace-period updates propogate from the
root to the leaf, so there's a chance we may find a grace period has
started as we advance up, lets check for the same.
Reported-by: Paul McKenney <paulmck@linux.vnet.ibm.com>
Signed-off-by: Joel Fernandes (Google) <joel@joelfernandes.org>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
The funnel locking loop in rcu_start_this_gp uses rcu_root as a
temporary variable while walking the combining tree. This causes a
tiresome exercise of a code reader reminding themselves that rcu_root
may not be root. Lets just call it rnp, and rename other variables as
well to be more appropriate.
Original patch: https://patchwork.kernel.org/patch/10396577/
Signed-off-by: Joel Fernandes (Google) <joel@joelfernandes.org>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
[ paulmck: Fix name in comment as well. ]
The name 'c' is used for variables and parameters holding the requested
grace-period sequence number. However it is no longer very meaningful
given the conversions from ->gpnum and (especially) ->completed to
->gp_seq. This commit therefore renames 'c' to 'gp_seq_req'.
Previous patch discussion is at:
https://patchwork.kernel.org/patch/10396579/
Signed-off-by: Joel Fernandes <joel@joelfernandes.org>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
The rcu_data structure's ->gpwrap indicator is currently reset only
when the CPU in question detects a new grace period. This is in theory
sufficient because any CPU that has been out of action for long enough
that its ->gpwrap indicator is set is guaranteed to see both the end
of an old grace period and the start of a new one.
However, the current code leaves a short window during which the ->gpwrap
indicator has been reset but the corresponding ->gp_seq counter has not
yet been brought up to date. This is harmless because interrupts are
disabled, but it is likely to (at the very least) cause confusion.
This commit therefore moves the resetting of ->gpwrap to follow the
updating of ->gp_seq. While in the area, it also resets ->gp_seq_needed.
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
The new ->gq_seq grace-period sequence numbers must be shifted down,
which give artifacts when these numbers wrap. This commit therefore
enables rcutorture and rcuperf to handle grace-period sequence numbers
even if they do wrap. It does this by allowing a special subtraction
function to be specified, and this function subtracts before shifting.
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
In the old days of ->gpnum and ->completed, the code requesting a new
grace period checked to see if that grace period had already started,
bailing early if so. The new-age ->gp_seq approach instead checks
whether the grace period has already finished. A compensating change
pushed the requested grace period down to the bottom of the tree, thus
reducing lock contention and even eliminating it in some cases. But why
not further reduce contention, especially on large systems, by doing both,
especially given that the cost of doing both is extremely small?
This commit therefore adds a new rcu_seq_started() function that checks
whether a specified grace period has already started. It then uses
this new function in place of rcu_seq_done() in the rcu_start_this_gp()
function's funnel locking code.
Reported-by: Joel Fernandes <joel@joelfernandes.org>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
The "cpustart" trace event shows a stale gp_seq. This is because it uses
rdp->gp_seq, which is updated only at the end of the __note_gp_changes()
function. This commit therefore instead uses rnp->gp_seq.
An alternative fix would be to update rdp->gp_seq earlier, but this would
break RCU's detection of the beginning of a new-to-this-CPU grace period.
Signed-off-by: Joel Fernandes (Google) <joel@joelfernandes.org>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Currently Tree RCU's clean-up code emits a "CleanupMore" trace event in
response to late-arriving grace-period requests even if the grace period
was already requested. This makes "CleanupMore" show up an extra time (in
addition to once for each rcu_node structure that was previously marked
with the request), and for no good reason. This commit therefore avoids
emitting this trace message unless the the only request for this next
grace period arrived during or after the cleanup scan of the rcu_node
structures.
Signed-off-by: Joel Fernandes (Google) <joel@joelfernandes.org>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
The old grace-period start code would acquire only the leaf's rcu_node
structure's ->lock if that structure believed that a grace period was
in progress. The new code advances to the leaf's parent in this case,
needlessly acquiring then leaf's parent's ->lock. This commit therefore
checks the grace-period state after marking the leaf with the need for
the specified grace period, and if the leaf believes that a grace period
is in progress, takes an early exit.
Reported-by: Joel Fernandes <joel@joelfernandes.org>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
[ paulmck: Add "Startedleaf" tracing as suggested by Joel Fernandes. ]
Now that the rcu_data structure contains ->gp_seq_needed, create an
rcu_accelerate_cbs_unlocked() helper function that locklessly checks to
see if new callbacks' required grace period has already been requested.
If so, update the callback list locally and again locklessly. (Though
interrupts must be and are disabled to avoid racing with conflicting
updates in interrupt handlers.)
Otherwise, call rcu_accelerate_cbs() as before.
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Now that everything has been converted to use ->gp_seq instead of
->gpnum and ->completed, this commit removes ->gpnum and ->completed.
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
This commit makes the rcu_quiescent_state_report tracepoint use ->gp_seq
instead of ->gpnum.
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
This commit makes the rcu_unlock_preempted_task tracepoint use ->gp_seq
instead of ->gpnum.
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
This commit makes the rcu_future_grace_period tracepoint use gp_seq
instead of ->gpnum and ->completed.
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
This commit makes the rcu_grace_period tracepoint use gp_seq instead
of ->gpnum or ->completed. It also introduces a "cpuofl-bgp" string to
less obscurely indicate when a CPU has gone offline while a grace period
is waiting on it.
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
This commit makes rcu_nocb_wait_gp() check rdp->gp_seq_needed to see
if the current CPU already knows about the needed grace period having
already been requested. If so, it avoids acquiring the corresponding
leaf rcu_node structure's ->lock, thus decreasing contention. This
optimization is intended for cases where either multiple leader rcuo
kthreads are running on the same CPU or these kthreads are running on
a non-offloaded (e.g., housekeeping) CPU.
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
[ paulmck: Move lock release past "if" as suggested by Joel Fernandes. ]
[ paulmck: Fix caching of furthest-future requested grace period. ]
One problem with the ->need_future_gp[] array is that the grace-period
assignment of each element changes as the grace periods complete.
This means that it is necessary to hold a lock when checking this
array to learn if a given grace period has already been requested.
This increase lock contention, which is the opposite of helpful.
This commit therefore replaces the ->need_future_gp[] with a single
->gp_seq_needed value and keeps it updated in the rcu_data structure.
This will enable reliable lockless checking of whether or not a given
grace period has already been requested.
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
SRCU has long used ->srcu_gp_seq, and now RCU uses ->gp_seq. This
commit therefore moves the rcutorture_get_gp_data() function from
a ->gpnum / ->completed pair to ->gp_seq.
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
This commit makes the RCU CPU stall-warning code in print_other_cpu_stall(),
print_cpu_stall(), and check_cpu_stall() use ->gp_seq instead of ->gpnum
and ->completed.
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
This commit converts the grace-period request code paths from ->completed
and ->gpnum to ->gp_seq. The need_future_gp_element() macro encapsulates
the shift operation required to use ->gp_seq as an index to the
->need_future_gp[] array. The rcu_cbs_completed() function is removed
in favor of the rcu_seq_snap() function. The rcu_start_this_gp()
gets some temporary consistency checks and uses rcu_seq_done(),
rcu_seq_current(), rcu_seq_state(), and rcu_gp_in_progress() in place
of the earlier open-coded comparisons of ->gpnum and ->completed.
The rcu_future_gp_cleanup() function replaces use of ->completed
with ->gp_seq. The rcu_accelerate_cbs() function replaces a call to
rcu_cbs_completed() with one to rcu_seq_snap(). The rcu_advance_cbs()
function replaces an access to >completed with one to ->gp_seq and adds
some temporary warnings. The rcu_nocb_wait_gp() function replaces a
call to rcu_cbs_completed() with one to rcu_seq_snap() and an open-coded
comparison with rcu_seq_done().
The temporary warnings will be removed when the various ->gpnum and
->completed fields are removed. Their purpose is to locate code who
might still be using ->gpnum and ->completed. (Much easier that way
than trying to trace down the causes of too-short grace periods and
grace-period hangs!)
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
This commit switches the quiescent-state no-backtracking checks from
->gpnum and ->completed to ->gp_seq.
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
This commit switches the interrupt-disabled detection mechanism to
->gp_seq. This mechanism is used as part of RCU CPU stall warnings,
and detects cases where the stall is due to a CPU having interrupts
disabled.
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
This commit makes rcu_gp_in_progress() use ->gp_seq instead of
->completed and ->gpnum. The READ_ONCE() invocations are buried
in rcu_seq_current().
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
This commit makes rcu_try_advance_all_cbs() use ->gp_seq. It uses
rcu_seq_ctr() in order to shift away the state bits, so that the
low-order bits of the result may safely be used to index ->nocb_gp_wq[].
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
This commit makes rcu_try_advance_all_cbs() use ->gp_seq, with the
exception of tracing, which will be converted later.
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
This commit makes rcu_implicit_dynticks_qs() use ->gp_seq, with the
exception of tracing, which will be converted later.
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
This commit converts rcu_gpnum_ovf() to use ->gp_seq instead of ->gpnum.
Same size unsigned long, so same approach.
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
This commit moves __note_gp_changes(), note_gp_changes(), and
__rcu_pending() to ->gp_seq, creating new rcu_seq_completed_gp() and
rcu_seq_new_gp() functions for this purpose.
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
[ paulmck: Reinstate "cpuend: trace as suggested by Joel Fernandes. ]
This commit converts get_state_synchronize_rcu(), cond_synchronize_rcu(),
get_state_synchronize_sched(), and cond_synchronize_sched() from ->gpnum
and ->completed to ->gp_seq. Note that this also introduces a full
memory barrier in the already-done paths off cond_synchronize_rcu() and
cond_synchronize_sched(), as work with LKMM indicates that the earlier
smp_load_acquire() were insufficiently strong in some situations where
these two functions were called just as the grace period ended. In such
cases, these two functions would not gain the benefit of memory ordering
at the end of the grace period.
Please note that the performance impact is negligible, as you shouldn't
be using either function anywhere near a fastpath in any case.
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
This commit switches the functions reporting quiescent states from
use of ->gpnum to ->gp_seq. In either case, the point is to handle
races where a given grace period ends before a quiescent state can
be reported. Failing to catch these races would result in too-short
grace periods, hence the checking.
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
This commit switches rcu_check_gp_kthread_starvation() from printing
->gpnum and ->completed to printing ->gp_seq upon detecting a starving
RCU grace-period kthread during an RCU CPU stall warning.
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
The rcutorture test invokes rcu_batches_started(),
rcu_batches_completed(), rcu_batches_started_bh(),
rcu_batches_completed_bh(), rcu_batches_started_sched(), and
rcu_batches_completed_sched() to do grace-period consistency checks,
and rcuperf uses the _completed variants for statistics.
These functions use ->gpnum and ->completed. This commit therefore
replaces them with rcu_get_gp_seq(), rcu_bh_get_gp_seq(), and
rcu_sched_get_gp_seq(), adjusting rcutorture and rcuperf to make
use of them.
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
This commit moves rcu_gp_slow() to ->gp_seq. This function only uses
the grace-period number to modulate delay, so rcu_seq_ctr(rsp->gp_seq)
gets the same effect, at least in cases where the delay is to happen
more than four times per wrap of an unsigned long.
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
This commit adds grace-period sequence numbers (->gp_seq) to the
rcu_state, rcu_node, and rcu_data structures, and updates them.
It also checks for consistency between rsp->gpnum and rsp->gp_seq.
These ->gp_seq counters will eventually replace the existing ->gpnum
and ->completed counters, allowing a single memory access to determine
whether or not a grace period is in progress and if so, which one.
This in turn will enable changes that will reduce ->lock contention on
the leaf rcu_node structures.
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
At the end of rcu_gp_cleanup(), if another grace period is needed, but
not via rcu_accelerate_cbs(), the ->gp_flags field is written twice,
once when making the new grace-period request, and once when clearing
all other types of requests. This commit therefore adds an else-clause
to avoid this double write.
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
This commit causes a splat if RCU is idle and a request for a new grace
period is ignored for more than one second. This splat normally indicates
that some code path asked for a new grace period, but failed to wake up
the RCU grace-period kthread.
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
[ paulmck: Fix bug located by Dan Carpenter and his static checker. ]
[ paulmck: Fix self-deadlock bug located 0day test robot. ]
[ paulmck: Disable unless CONFIG_PROVE_RCU=y. ]
Currently, the parallelized initialization of expedited grace periods uses
the workqueue associated with each rcu_node structure's ->grplo field.
This works fine unless that CPU is offline. This commit therefore uses
the CPU corresponding to the lowest-numbered online CPU, or just queues
the work on WORK_CPU_UNBOUND if there are no online CPUs corresponding
to this rcu_node structure.
Note that this patch uses cpu_is_offline() instead of the usual approach
of checking bits in the rcu_node structure's ->qsmaskinitnext field. This
is safe because preemption is disabled across both the cpu_is_offline()
check and the call to queue_work_on().
Signed-off-by: Boqun Feng <boqun.feng@gmail.com>
[ paulmck: Disable preemption to close offline race window. ]
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
[ paulmck: Apply Peter Zijlstra feedback on CPU selection. ]
Tested-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
There is a two-jiffy delay between the time that a CPU will self-report
an RCU CPU stall warning and the time that some other CPU will report a
warning on behalf of the first CPU. This has worked well in the past,
but on busy systems, it is possible for the two warnings to overlap,
which makes interpreting them extremely difficult.
This commit therefore uses a cmpxchg-based timing decision that
allows only one report in a given one-minute period (assuming default
stall-warning Kconfig parameters). This approach will of course fail
if you are seeing minute-long vCPU preemption, but in that case the
overlapping RCU CPU stall warnings are the least of your worries.
Reported-by: Dmitry Vyukov <dvyukov@google.com>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Sparse reported this:
| kernel/rcu/tree_plugin.h:814:9: warning: incorrect type in argument 1 (different modifiers)
| kernel/rcu/tree_plugin.h:814:9: expected struct lockdep_map const *lock
| kernel/rcu/tree_plugin.h:814:9: got struct lockdep_map [noderef] *<noident>
This is caused by using vanilla lockdep annotations on rcu_node::lock,
and that requires accessing ->lock of rcu_node directly. However we need
to keep rcu_node::lock __private to avoid breaking its extra ordering
guarantee. And we have a dedicated lockdep annotation for
rcu_node::lock, so use it.
Signed-off-by: Boqun Feng <boqun.feng@gmail.com>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
The WARN_ON_ONCE(rcu_preempt_blocked_readers_cgp()) in
rcu_gp_cleanup() triggers (inexplicably, of course) every so often.
This commit therefore extracts more information.
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
This commit adds "#define pr_fmt(fmt) fmt" to the torture-test files
in order to keep the current dmesg format. Once Joe's commits have
hit mainline, these definitions will be changed in order to automatically
generate the dmesg line prefix that the scripts expect. This will have
the beneficial side-effect of allowing printk() formats to be used more
widely and of shortening some pr_*() lines.
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Joe Perches <joe@perches.com>
Some bugs reproduce quickly only at high CPU-hotplug rates, so the
rcutorture TREE03 scenario now has only 200 milliseconds spacing between
CPU-hotplug operations. At this rate, the torture-test pair of console
messages per operation becomes a bit voluminous. This commit therefore
converts the torture-test set of "verbose" kernel-boot arguments from
bool to int, and prints the extra console messages only when verbose=2.
The default is still verbose=1.
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
This commit adds the address of the first callback to the per-CPU rcutorture
output in order to allow lost wakeups to be more efficiently tracked down.
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
This commit updates the header comment of srcu_funnel_gp_start() to
document the fact that srcu_funnel_gp_start() does the work of
srcu_funnel_exp_start(), in some cases by invoking it directly.
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
This commit simply changes some copy-pasta call_rcu() instances to
the correct call_srcu().
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
During expedited grace-period initialization, a work item is scheduled
for each leaf rcu_node structure. However, that initialization code
is itself (normally) executing from a workqueue, so one of the leaf
rcu_node structures could just as well be handled by that pre-existing
workqueue, and with less overhead. This commit therefore uses a
shiny new rcu_is_leaf_node() macro to execute the last leaf rcu_node
structure's initialization directly from the pre-existing workqueue.
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
The x86/mtrr code does horrific things because hardware. It uses
stop_machine_from_inactive_cpu(), which does a wakeup (of the stopper
thread on another CPU), which uses RCU, all before the CPU is onlined.
RCU complains about this, because wakeups use RCU and RCU does
(rightfully) not consider offline CPUs for grace-periods.
Fix this by initializing RCU way early in the MTRR case.
Tested-by: Mike Galbraith <efault@gmx.de>
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
[ paulmck: Add !SMP support, per 0day Test Robot report. ]
This commit adds end-of-test state printout to help check whether RCU
shut down nicely. Note that this printout only helps for flavors of
RCU that are not used much by the kernel. In particular, for normal
RCU having a grace period in progress is expected behavior.
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Tested-by: Nicholas Piggin <npiggin@gmail.com>
Now that grace-period requests use funnel locking and now that they
set ->gp_flags to RCU_GP_FLAG_INIT even when the RCU grace-period
kthread has not yet started, rcu_gp_kthread() no longer needs to check
need_any_future_gp() at startup time. This commit therefore removes
this check.
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Tested-by: Nicholas Piggin <npiggin@gmail.com>
Now that RCU no longer relies on failsafe checks, cpu_needs_another_gp()
can be greatly simplified. This simplification eliminates the last
call to rcu_future_needs_gp() and to rcu_segcblist_future_gp_needed(),
both of which which can then be eliminated. And then, because
cpu_needs_another_gp() is called only from __rcu_pending(), it can be
inlined and eliminated.
This commit carries out the simplification, inlining, and elimination
called out above.
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Tested-by: Nicholas Piggin <npiggin@gmail.com>
All of the cpu_needs_another_gp() function's checks (except for
newly arrived callbacks) have been subsumed into the rcu_gp_cleanup()
function's scan of the rcu_node tree. This commit therefore drops the
call to cpu_needs_another_gp(). The check for newly arrived callbacks
is supplied by rcu_accelerate_cbs(). Any needed advancing (as in the
earlier rcu_advance_cbs() call) will be supplied when the corresponding
CPU becomes aware of the end of the now-completed grace period.
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Tested-by: Nicholas Piggin <npiggin@gmail.com>
If rcu_start_this_gp() is invoked with a requested grace period more
than three in the future, then either the ->need_future_gp[] array
needs to be bigger or the caller needs to be repaired. This commit
therefore adds a WARN_ON_ONCE() checking for this condition.
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Tested-by: Nicholas Piggin <npiggin@gmail.com>
The rcu_start_this_gp() function had a simple form of funnel locking that
used only the leaves and root of the rcu_node tree, which is fine for
systems with only a few hundred CPUs, but sub-optimal for systems having
thousands of CPUs. This commit therefore adds full-tree funnel locking.
This variant of funnel locking is unusual in the following ways:
1. The leaf-level rcu_node structure's ->lock is held throughout.
Other funnel-locking implementations drop the leaf-level lock
before progressing to the next level of the tree.
2. Funnel locking can be started at the root, which is convenient
for code that already holds the root rcu_node structure's ->lock.
Other funnel-locking implementations start at the leaves.
3. If an rcu_node structure other than the initial one believes
that a grace period is in progress, it is not necessary to
go further up the tree. This is because grace-period cleanup
scans the full tree, so that marking the need for a subsequent
grace period anywhere in the tree suffices -- but only if
a grace period is currently in progress.
4. It is possible that the RCU grace-period kthread has not yet
started, and this case must be handled appropriately.
However, the general approach of using a tree to control lock contention
is still in place.
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Tested-by: Nicholas Piggin <npiggin@gmail.com>
The rcu_accelerate_cbs() function selects a grace-period target, which
it uses to have rcu_segcblist_accelerate() assign numbers to recently
queued callbacks. Then it invokes rcu_start_future_gp(), which selects
a grace-period target again, which is a bit pointless. This commit
therefore changes rcu_start_future_gp() to take the grace-period target as
a parameter, thus avoiding double selection. This commit also changes
the name of rcu_start_future_gp() to rcu_start_this_gp() to reflect
this change in functionality, and also makes a similar change to the
name of trace_rcu_future_gp().
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Tested-by: Nicholas Piggin <npiggin@gmail.com>
The rcu_start_gp_advanced() is invoked only from rcu_start_future_gp() and
much of its code is redundant when invoked from that context. This commit
therefore inlines rcu_start_gp_advanced() into rcu_start_future_gp(),
then removes rcu_start_gp_advanced().
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Tested-by: Nicholas Piggin <npiggin@gmail.com>
Once the grace period has ended, any RCU_GP_FLAG_FQS requests are
irrelevant: The grace period has ended, so there is no longer any
point in forcing quiescent states in order to try to make it end sooner.
This commit therefore causes rcu_gp_cleanup() to clear any bits other
than RCU_GP_FLAG_INIT from ->gp_flags at the end of the grace period.
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Tested-by: Nicholas Piggin <npiggin@gmail.com>
It is true that currently only the low-order two bits are used, so
there should be no problem given modern machines and compilers, but
good hygiene and maintainability dictates use of an unsigned long
instead of an int. This commit therefore makes this change.
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Tested-by: Nicholas Piggin <npiggin@gmail.com>
The __rcu_process_callbacks() function currently checks to see if
the current CPU needs a grace period and also if there is any other
reason to kick off a new grace period. This is one of the fail-safe
checks that has been rendered unnecessary by the changes that increase
the accuracy of rcu_gp_cleanup()'s estimate as to whether another grace
period is required. Because this particular fail-safe involved acquiring
the root rcu_node structure's ->lock, which has seen excessive contention
in real life, this fail-safe needs to go.
However, one check must remain, namely the check for newly arrived
RCU callbacks that have not yet been associated with a grace period.
One might hope that the checks in __note_gp_changes(), which is invoked
indirectly from rcu_check_quiescent_state(), would suffice, but this
function won't be invoked at all if RCU is idle. It is therefore necessary
to replace the fail-safe checks with a simpler check for newly arrived
callbacks during an RCU idle period, which is exactly what this commit
does. This change removes the final call to rcu_start_gp(), so this
function is removed as well.
Note that lockless use of cpu_needs_another_gp() is racy, but that
these races are harmless in this case. If RCU really is idle, the
values will not change, so the return value from cpu_needs_another_gp()
will be correct. If RCU is not idle, the resulting redundant call to
rcu_accelerate_cbs() will be harmless, and might even have the benefit
of reducing grace-period latency a bit.
This commit also moves interrupt disabling into the "if" statement to
improve real-time response a bit.
Reported-by: Nicholas Piggin <npiggin@gmail.com>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Tested-by: Nicholas Piggin <npiggin@gmail.com>
When __call_rcu_core() notices excessive numbers of callbacks pending
on the current CPU, we know that at least one of them is not yet
classified, namely the one that was just now queued. Therefore, it
is not necessary to invoke rcu_start_gp() and thus not necessary to
acquire the root rcu_node structure's ->lock. This commit therefore
replaces the rcu_start_gp() with rcu_accelerate_cbs(), thus replacing
an acquisition of the root rcu_node structure's ->lock with that of
this CPU's leaf rcu_node structure.
This decreases contention on the root rcu_node structure's ->lock.
Reported-by: Nicholas Piggin <npiggin@gmail.com>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Tested-by: Nicholas Piggin <npiggin@gmail.com>
The rcu_migrate_callbacks() function invokes rcu_advance_cbs()
twice, ignoring the return value. This is OK at pressent because of
failsafe code that does the wakeup when needed. However, this failsafe
code acquires the root rcu_node structure's lock frequently, while
rcu_migrate_callbacks() does so only once per CPU-offline operation.
This commit therefore makes rcu_migrate_callbacks()
wake up the RCU GP kthread when either call to rcu_advance_cbs()
returns true, thus removing need for the failsafe code.
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Tested-by: Nicholas Piggin <npiggin@gmail.com>
There is no longer any need for ->need_future_gp[] to count the number of
requests for future grace periods, so this commit converts the additions
to assignments to "true" and reduces the size of each element to one byte.
While we are in the area, fix an obsolete comment.
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Tested-by: Nicholas Piggin <npiggin@gmail.com>
Currently, the rcu_future_needs_gp() function checks only the current
element of the ->need_future_gps[] array, which might miss elements that
were offset from the expected element, for example, due to races with
the start or the end of a grace period. This commit therefore makes
rcu_future_needs_gp() use the need_any_future_gp() macro to check all
of the elements of this array.
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Tested-by: Nicholas Piggin <npiggin@gmail.com>
The rcu_cbs_completed() function provides the value of ->completed
at which new callbacks can safely be invoked. This is recorded in
two-element ->need_future_gp[] arrays in the rcu_node structure, and
the elements of these arrays corresponding to the just-completed grace
period are zeroed at the end of that grace period. However, the
rcu_cbs_completed() function can return the current ->completed value
plus either one or two, so it is possible for the corresponding
->need_future_gp[] entry to be cleared just after it was set, thus
losing a request for a future grace period.
This commit avoids this race by expanding ->need_future_gp[] to four
elements.
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Tested-by: Nicholas Piggin <npiggin@gmail.com>
Currently, rcu_gp_cleanup() scans the rcu_node tree in order to reset
state to reflect the end of the grace period. It also checks to see
whether a new grace period is needed, but in a number of cases, rather
than directly cause the new grace period to be immediately started, it
instead leaves the grace-period-needed state where various fail-safes
can find it. This works fine, but results in higher contention on the
root rcu_node structure's ->lock, which is undesirable, and contention
on that lock has recently become noticeable.
This commit therefore makes rcu_gp_cleanup() immediately start a new
grace period if there is any need for one.
It is quite possible that it will later be necessary to throttle the
grace-period rate, but that can be dealt with when and if.
Reported-by: Nicholas Piggin <npiggin@gmail.com>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Tested-by: Nicholas Piggin <npiggin@gmail.com>
The rcu_gp_kthread() function immediately sleeps waiting to be notified
of the need for a new grace period, which currently works because there
are a number of code sequences that will provide the needed wakeup later.
However, some of these code sequences need to acquire the root rcu_node
structure's ->lock, and contention on that lock has started manifesting.
This commit therefore makes rcu_gp_kthread() check for early-boot activity
when it starts up, omitting the initial sleep in that case.
Reported-by: Nicholas Piggin <npiggin@gmail.com>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Tested-by: Nicholas Piggin <npiggin@gmail.com>
Accessors for the ->need_future_gp[] array are currently open-coded,
which makes them difficult to change. To improve maintainability, this
commit adds need_future_gp_mask() to compute the indexing mask from the
array size, need_future_gp_element() to access the element corresponding
to the specified grace-period number, and need_any_future_gp() to
determine if any future grace period is needed. This commit also applies
need_future_gp_element() to existing open-coded single-element accesses.
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Tested-by: Nicholas Piggin <npiggin@gmail.com>
The rcu_start_future_gp() function uses a sloppy check for a grace
period being in progress, which works today because there are a number
of code sequences that resolve the resulting races. However, some of
these race-resolution code sequences must acquire the root rcu_node
structure's ->lock, and contention on that lock has started manifesting.
This commit therefore makes rcu_start_future_gp() check more precise,
eliminating the sloppy lockless check of the rcu_state structure's ->gpnum
and ->completed fields. The effect is that rcu_start_future_gp() will
sometimes unnecessarily attempt to start a new grace period, but this
overhead will be reduced later using funnel locking.
Reported-by: Nicholas Piggin <npiggin@gmail.com>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Tested-by: Nicholas Piggin <npiggin@gmail.com>
When rcu_cbs_completed() is invoked on a non-root rcu_node structure,
it unconditionally assumes that two grace periods must complete before
the callbacks at hand can be invoked. This is overly conservative because
if that non-root rcu_node structure believes that no grace period is in
progress, and if the corresponding rcu_state structure's ->gpnum field
has not yet been incremented, then these callbacks may safely be invoked
after only one grace period has completed.
This change is required to permit grace-period start requests to use
funnel locking, which is in turn permitted to reduce root rcu_node ->lock
contention, which has been observed by Nick Piggin. Furthermore, such
contention will likely be increased by the merging of RCU-bh, RCU-preempt,
and RCU-sched, so it makes sense to take steps to decrease it.
This commit therefore improves the accuracy of rcu_cbs_completed() when
invoked on a non-root rcu_node structure as described above.
Reported-by: Nicholas Piggin <npiggin@gmail.com>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Tested-by: Nicholas Piggin <npiggin@gmail.com>
This commit adds rcu_first_leaf_node() that returns a pointer to
the first leaf rcu_node structure in the specified RCU flavor and an
rcu_is_leaf_node() that returns true iff the specified rcu_node structure
is a leaf. This commit also uses these macros where appropriate.
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Tested-by: Nicholas Piggin <npiggin@gmail.com>
The current cleanup_srcu_struct() flushes work, which prevents it
from being invoked from some workqueue contexts, as well as from
atomic (non-blocking) contexts. This patch therefore introduced a
cleanup_srcu_struct_quiesced(), which can be invoked only after all
activity on the specified srcu_struct has completed. This restriction
allows cleanup_srcu_struct_quiesced() to be invoked from workqueue
contexts as well as from atomic contexts.
Suggested-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Tested-by: Nitzan Carmi <nitzanc@mellanox.com>
Tested-by: Nicholas Piggin <npiggin@gmail.com>
Because rcu_eqs_special_set() is declared only in internal header
kernel/rcu/tree.h and stubbed in include/linux/rcutiny.h, it is
inaccessible outside of the RCU implementation. This patch therefore
moves the rcu_eqs_special_set() declaration to include/linux/rcutree.h,
which allows it to be used in non-rcu kernel code.
Signed-off-by: Yury Norov <ynorov@caviumnetworks.com>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Tested-by: Nicholas Piggin <npiggin@gmail.com>
The header comment for rcu_bind_gp_kthread() refers to sysidle, which
is no longer with us. However, it is still important to bind RCU's
grace-period kthreads to the housekeeping CPU(s), so rather than remove
rcu_bind_gp_kthread(), this commit updates the comment.
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Tested-by: Nicholas Piggin <npiggin@gmail.com>
The __rcu_read_lock() and __rcu_read_unlock() functions were moved
to kernel/rcu/update.c in order to implement tiny preemptible RCU.
However, tiny preemptible RCU was removed from the kernel a long time
ago, so this commit belatedly moves them back into the only remaining
preemptible-RCU code.
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Tested-by: Nicholas Piggin <npiggin@gmail.com>
Commit e31d28b6ab ("trace: Eliminate cond_resched_rcu_qs() in favor
of cond_resched()") substituted cond_resched() for the earlier call
to cond_resched_rcu_qs(). However, the new-age cond_resched() does
not do anything to help RCU-tasks grace periods because (1) RCU-tasks
is only enabled when CONFIG_PREEMPT=y and (2) cond_resched() is a
complete no-op when preemption is enabled. This situation results
in hangs when running the trace benchmarks.
A number of potential fixes were discussed on LKML
(https://lkml.kernel.org/r/20180224151240.0d63a059@vmware.local.home),
including making cond_resched() not be a no-op; making cond_resched()
not be a no-op, but only when running tracing benchmarks; reverting
the aforementioned commit (which works because cond_resched_rcu_qs()
does provide an RCU-tasks quiescent state; and adding a call to the
scheduler/RCU rcu_note_voluntary_context_switch() function. All were
deemed unsatisfactory, either due to added cond_resched() overhead or
due to magic functions inviting cargo culting.
This commit renames cond_resched_rcu_qs() to cond_resched_tasks_rcu_qs(),
which provides a clear hint as to what this function is doing and
why and where it should be used, and then replaces the call to
cond_resched() with cond_resched_tasks_rcu_qs() in the trace benchmark's
benchmark_event_kthread() function.
Reported-by: Steven Rostedt <rostedt@goodmis.org>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Tested-by: Nicholas Piggin <npiggin@gmail.com>
Commit ae91aa0adb ("rcu: Remove debugfs tracing") removed the
RCU debugfs tracing code, but did not remove the no-longer used
->exp_workdone{0,1,2,3} fields in the srcu_data structure. This commit
therefore removes these fields along with the code that uselessly
updates them.
Signed-off-by: Byungchul Park <byungchul.park@lge.com>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Tested-by: Nicholas Piggin <npiggin@gmail.com>
If an excessive number of callbacks have been queued, but the NOCB
leader kthread's wakeup must be deferred, then we should wake up the
leader unconditionally once it is safe to do so.
This was handled correctly in commit fbce7497ee ("rcu: Parallelize and
economize NOCB kthread wakeups"), but then commit 8be6e1b15c ("rcu:
Use timer as backstop for NOCB deferred wakeups") passed RCU_NOCB_WAKE
instead of the correct RCU_NOCB_WAKE_FORCE to wake_nocb_leader_defer().
As an interesting aside, RCU_NOCB_WAKE_FORCE is never passed to anything,
which should have been taken as a hint. ;-)
This commit therefore passes RCU_NOCB_WAKE_FORCE instead of RCU_NOCB_WAKE
to wake_nocb_leader_defer() when a callback is queued onto a NOCB CPU
that already has an excessive number of callbacks pending.
Signed-off-by: Byungchul Park <byungchul.park@lge.com>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Tested-by: Nicholas Piggin <npiggin@gmail.com>
Commit 44c65ff2e3 ("rcu: Eliminate NOCBs CPU-state Kconfig options")
made allocation of rcu_nocb_mask depend only on the rcu_nocbs=,
nohz_full=, or isolcpus= kernel boot parameters. However, it failed
to change the initial value of rcu_init_nohz()'s local variable
need_rcu_nocb_mask to false, which can result in useless allocation
of an all-zero rcu_nocb_mask. This commit therefore fixes this bug by
changing the initial value of need_rcu_nocb_mask to false.
While we are in the area, also correct the error message that is printed
when someone specifies that can-never-exist CPUs should be NOCBs CPUs.
Reported-by: Byungchul Park <byungchul.park@lge.com>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Acked-by: Byungchul Park <byungchul.park@lge.com>
Tested-by: Nicholas Piggin <npiggin@gmail.com>
The rcu_preempt_do_callbacks() function was introduced in commit
09223371dea(rcu: Use softirq to address performance regression), where it
was necessary to handle kernel builds both containing and not containing
RCU-preempt. Since then, various changes (most notably f8b7fc6b51
("rcu: use softirq instead of kthreads except when RCU_BOOST=y")) have
resulted in this function being invoked only from rcu_kthread_do_work(),
which is present only in kernels containing RCU-preempt, which in turn
means that the rcu_preempt_do_callbacks() function is no longer needed.
This commit therefore inlines rcu_preempt_do_callbacks() into its
sole remaining caller and also removes the rcu_state_p and rcu_data_p
indirection for added clarity.
Signed-off-by: Byungchul Park <byungchul.park@lge.com>
Reviewed-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
[ paulmck: Remove the rcu_state_p and rcu_data_p indirection. ]
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Tested-by: Nicholas Piggin <npiggin@gmail.com>
Currently some callsites of sync_rcu_preempt_exp_done() are not called
with the corresponding rcu_node's ->lock held, which could introduces
bugs as per Paul:
o CPU 0 in sync_rcu_preempt_exp_done() reads ->exp_tasks and
sees that it is NULL.
o CPU 1 blocks within an RCU read-side critical section, so
it enqueues the task and points ->exp_tasks at it and
clears CPU 1's bit in ->expmask.
o All other CPUs clear their bits in ->expmask.
o CPU 0 reads ->expmask, sees that it is zero, so incorrectly
concludes that all quiescent states have completed, despite
the fact that ->exp_tasks is non-NULL.
To fix this, sync_rcu_preempt_exp_unlocked() is introduced to replace
lockless callsites of sync_rcu_preempt_exp_done().
Further, a lockdep annotation is added into sync_rcu_preempt_exp_done()
to prevent mis-use in the future.
Signed-off-by: Boqun Feng <boqun.feng@gmail.com>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Tested-by: Nicholas Piggin <npiggin@gmail.com>
Since commit d9a3da0699 ("rcu: Add expedited grace-period support
for preemptible RCU"), there are comments for some funtions in
rcu_report_exp_rnp()'s call-chain saying that exp_mutex or its
predecessors needs to be held.
However, exp_mutex and its predecessors were used only to synchronize
between GPs, and it is clear that all variables visited by those functions
are under the protection of rcu_node's ->lock. Moreover, those functions
are currently called without held exp_mutex, and seems that doesn't
introduce any trouble.
So this patch fixes this problem by updating the comments to match the
current code.
Signed-off-by: Boqun Feng <boqun.feng@gmail.com>
Fixes: d9a3da0699 ("rcu: Add expedited grace-period support for preemptible RCU")
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Tested-by: Nicholas Piggin <npiggin@gmail.com>
The latency of RCU expedited grace periods grows with increasing numbers
of CPUs, eventually failing to be all that expedited. Much of the growth
in latency is in the initialization phase, so this commit uses workqueues
to carry out this initialization concurrently on a rcu_node-by-rcu_node
basis.
This change makes use of a new rcu_par_gp_wq because flushing a work
item from another work item running from the same workqueue can result
in deadlock.
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Tested-by: Nicholas Piggin <npiggin@gmail.com>
RCU's expedited grace periods can participate in out-of-memory deadlocks
due to all available system_wq kthreads being blocked and there not being
memory available to create more. This commit prevents such deadlocks
by allocating an RCU-specific workqueue_struct at early boot time, and
providing it with a rescuer to ensure forward progress. This uses the
shiny new init_rescuer() function provided by Tejun (but indirectly).
This commit also causes SRCU to use this new RCU-specific
workqueue_struct. Note that SRCU's use of workqueues never blocks them
waiting for readers, so this should be safe from a forward-progress
viewpoint. Note that this moves SRCU from system_power_efficient_wq
to a normal workqueue. In the unlikely event that this results in
measurable degradation, a separate power-efficient workqueue will be
creates for SRCU.
Reported-by: Prateek Sood <prsood@codeaurora.org>
Reported-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Acked-by: Tejun Heo <tj@kernel.org>
The default values for nreader and nwriter are apparently not all that
user-friendly, resulting in people doing scalability tests that ran all
runs at large scale. This commit therefore makes both the nreaders and
nwriters module default to the number of CPUs, and adds a comment to
rcuperf.c stating that the number of CPUs should be specified using the
nr_cpus kernel boot parameter. This commit also eliminates the redundant
rcuperf scripting specification of default values for these parameters.
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>