1392 lines
55 KiB
HTML
1392 lines
55 KiB
HTML
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN"
|
|
"http://www.w3.org/TR/html4/loose.dtd">
|
|
<html>
|
|
<head><title>A Tour Through TREE_RCU's Data Structures [LWN.net]</title>
|
|
<meta HTTP-EQUIV="Content-Type" CONTENT="text/html; charset=iso-8859-1">
|
|
|
|
<p>December 18, 2016</p>
|
|
<p>This article was contributed by Paul E. McKenney</p>
|
|
|
|
<h3>Introduction</h3>
|
|
|
|
This document describes RCU's major data structures and their relationship
|
|
to each other.
|
|
|
|
<ol>
|
|
<li> <a href="#Data-Structure Relationships">
|
|
Data-Structure Relationships</a>
|
|
<li> <a href="#The rcu_state Structure">
|
|
The <tt>rcu_state</tt> Structure</a>
|
|
<li> <a href="#The rcu_node Structure">
|
|
The <tt>rcu_node</tt> Structure</a>
|
|
<li> <a href="#The rcu_segcblist Structure">
|
|
The <tt>rcu_segcblist</tt> Structure</a>
|
|
<li> <a href="#The rcu_data Structure">
|
|
The <tt>rcu_data</tt> Structure</a>
|
|
<li> <a href="#The rcu_head Structure">
|
|
The <tt>rcu_head</tt> Structure</a>
|
|
<li> <a href="#RCU-Specific Fields in the task_struct Structure">
|
|
RCU-Specific Fields in the <tt>task_struct</tt> Structure</a>
|
|
<li> <a href="#Accessor Functions">
|
|
Accessor Functions</a>
|
|
</ol>
|
|
|
|
<h3><a name="Data-Structure Relationships">Data-Structure Relationships</a></h3>
|
|
|
|
<p>RCU is for all intents and purposes a large state machine, and its
|
|
data structures maintain the state in such a way as to allow RCU readers
|
|
to execute extremely quickly, while also processing the RCU grace periods
|
|
requested by updaters in an efficient and extremely scalable fashion.
|
|
The efficiency and scalability of RCU updaters is provided primarily
|
|
by a combining tree, as shown below:
|
|
|
|
</p><p><img src="BigTreeClassicRCU.svg" alt="BigTreeClassicRCU.svg" width="30%">
|
|
|
|
</p><p>This diagram shows an enclosing <tt>rcu_state</tt> structure
|
|
containing a tree of <tt>rcu_node</tt> structures.
|
|
Each leaf node of the <tt>rcu_node</tt> tree has up to 16
|
|
<tt>rcu_data</tt> structures associated with it, so that there
|
|
are <tt>NR_CPUS</tt> number of <tt>rcu_data</tt> structures,
|
|
one for each possible CPU.
|
|
This structure is adjusted at boot time, if needed, to handle the
|
|
common case where <tt>nr_cpu_ids</tt> is much less than
|
|
<tt>NR_CPUs</tt>.
|
|
For example, a number of Linux distributions set <tt>NR_CPUs=4096</tt>,
|
|
which results in a three-level <tt>rcu_node</tt> tree.
|
|
If the actual hardware has only 16 CPUs, RCU will adjust itself
|
|
at boot time, resulting in an <tt>rcu_node</tt> tree with only a single node.
|
|
|
|
</p><p>The purpose of this combining tree is to allow per-CPU events
|
|
such as quiescent states, dyntick-idle transitions,
|
|
and CPU hotplug operations to be processed efficiently
|
|
and scalably.
|
|
Quiescent states are recorded by the per-CPU <tt>rcu_data</tt> structures,
|
|
and other events are recorded by the leaf-level <tt>rcu_node</tt>
|
|
structures.
|
|
All of these events are combined at each level of the tree until finally
|
|
grace periods are completed at the tree's root <tt>rcu_node</tt>
|
|
structure.
|
|
A grace period can be completed at the root once every CPU
|
|
(or, in the case of <tt>CONFIG_PREEMPT_RCU</tt>, task)
|
|
has passed through a quiescent state.
|
|
Once a grace period has completed, record of that fact is propagated
|
|
back down the tree.
|
|
|
|
</p><p>As can be seen from the diagram, on a 64-bit system
|
|
a two-level tree with 64 leaves can accommodate 1,024 CPUs, with a fanout
|
|
of 64 at the root and a fanout of 16 at the leaves.
|
|
|
|
<table>
|
|
<tr><th> </th></tr>
|
|
<tr><th align="left">Quick Quiz:</th></tr>
|
|
<tr><td>
|
|
Why isn't the fanout at the leaves also 64?
|
|
</td></tr>
|
|
<tr><th align="left">Answer:</th></tr>
|
|
<tr><td bgcolor="#ffffff"><font color="ffffff">
|
|
Because there are more types of events that affect the leaf-level
|
|
<tt>rcu_node</tt> structures than further up the tree.
|
|
Therefore, if the leaf <tt>rcu_node</tt> structures have fanout of
|
|
64, the contention on these structures' <tt>->structures</tt>
|
|
becomes excessive.
|
|
Experimentation on a wide variety of systems has shown that a fanout
|
|
of 16 works well for the leaves of the <tt>rcu_node</tt> tree.
|
|
</font>
|
|
|
|
<p><font color="ffffff">Of course, further experience with
|
|
systems having hundreds or thousands of CPUs may demonstrate
|
|
that the fanout for the non-leaf <tt>rcu_node</tt> structures
|
|
must also be reduced.
|
|
Such reduction can be easily carried out when and if it proves
|
|
necessary.
|
|
In the meantime, if you are using such a system and running into
|
|
contention problems on the non-leaf <tt>rcu_node</tt> structures,
|
|
you may use the <tt>CONFIG_RCU_FANOUT</tt> kernel configuration
|
|
parameter to reduce the non-leaf fanout as needed.
|
|
</font>
|
|
|
|
<p><font color="ffffff">Kernels built for systems with
|
|
strong NUMA characteristics might also need to adjust
|
|
<tt>CONFIG_RCU_FANOUT</tt> so that the domains of the
|
|
<tt>rcu_node</tt> structures align with hardware boundaries.
|
|
However, there has thus far been no need for this.
|
|
</font></td></tr>
|
|
<tr><td> </td></tr>
|
|
</table>
|
|
|
|
<p>If your system has more than 1,024 CPUs (or more than 512 CPUs on
|
|
a 32-bit system), then RCU will automatically add more levels to the
|
|
tree.
|
|
For example, if you are crazy enough to build a 64-bit system with 65,536
|
|
CPUs, RCU would configure the <tt>rcu_node</tt> tree as follows:
|
|
|
|
</p><p><img src="HugeTreeClassicRCU.svg" alt="HugeTreeClassicRCU.svg" width="50%">
|
|
|
|
</p><p>RCU currently permits up to a four-level tree, which on a 64-bit system
|
|
accommodates up to 4,194,304 CPUs, though only a mere 524,288 CPUs for
|
|
32-bit systems.
|
|
On the other hand, you can set both <tt>CONFIG_RCU_FANOUT</tt> and
|
|
<tt>CONFIG_RCU_FANOUT_LEAF</tt> to be as small as 2, which would result
|
|
in a 16-CPU test using a 4-level tree.
|
|
This can be useful for testing large-system capabilities on small test
|
|
machines.
|
|
|
|
</p><p>This multi-level combining tree allows us to get most of the
|
|
performance and scalability
|
|
benefits of partitioning, even though RCU grace-period detection is
|
|
inherently a global operation.
|
|
The trick here is that only the last CPU to report a quiescent state
|
|
into a given <tt>rcu_node</tt> structure need advance to the <tt>rcu_node</tt>
|
|
structure at the next level up the tree.
|
|
This means that at the leaf-level <tt>rcu_node</tt> structure, only
|
|
one access out of sixteen will progress up the tree.
|
|
For the internal <tt>rcu_node</tt> structures, the situation is even
|
|
more extreme: Only one access out of sixty-four will progress up
|
|
the tree.
|
|
Because the vast majority of the CPUs do not progress up the tree,
|
|
the lock contention remains roughly constant up the tree.
|
|
No matter how many CPUs there are in the system, at most 64 quiescent-state
|
|
reports per grace period will progress all the way to the root
|
|
<tt>rcu_node</tt> structure, thus ensuring that the lock contention
|
|
on that root <tt>rcu_node</tt> structure remains acceptably low.
|
|
|
|
</p><p>In effect, the combining tree acts like a big shock absorber,
|
|
keeping lock contention under control at all tree levels regardless
|
|
of the level of loading on the system.
|
|
|
|
</p><p>RCU updaters wait for normal grace periods by registering
|
|
RCU callbacks, either directly via <tt>call_rcu()</tt>
|
|
or indirectly via <tt>synchronize_rcu()</tt> and friends.
|
|
RCU callbacks are represented by <tt>rcu_head</tt> structures,
|
|
which are queued on <tt>rcu_data</tt> structures while they are
|
|
waiting for a grace period to elapse, as shown in the following figure:
|
|
|
|
</p><p><img src="BigTreePreemptRCUBHdyntickCB.svg" alt="BigTreePreemptRCUBHdyntickCB.svg" width="40%">
|
|
|
|
</p><p>This figure shows how <tt>TREE_RCU</tt>'s and
|
|
<tt>PREEMPT_RCU</tt>'s major data structures are related.
|
|
Lesser data structures will be introduced with the algorithms that
|
|
make use of them.
|
|
|
|
</p><p>Note that each of the data structures in the above figure has
|
|
its own synchronization:
|
|
|
|
<p><ol>
|
|
<li> Each <tt>rcu_state</tt> structures has a lock and a mutex,
|
|
and some fields are protected by the corresponding root
|
|
<tt>rcu_node</tt> structure's lock.
|
|
<li> Each <tt>rcu_node</tt> structure has a spinlock.
|
|
<li> The fields in <tt>rcu_data</tt> are private to the corresponding
|
|
CPU, although a few can be read and written by other CPUs.
|
|
</ol>
|
|
|
|
<p>It is important to note that different data structures can have
|
|
very different ideas about the state of RCU at any given time.
|
|
For but one example, awareness of the start or end of a given RCU
|
|
grace period propagates slowly through the data structures.
|
|
This slow propagation is absolutely necessary for RCU to have good
|
|
read-side performance.
|
|
If this balkanized implementation seems foreign to you, one useful
|
|
trick is to consider each instance of these data structures to be
|
|
a different person, each having the usual slightly different
|
|
view of reality.
|
|
|
|
</p><p>The general role of each of these data structures is as
|
|
follows:
|
|
|
|
</p><ol>
|
|
<li> <tt>rcu_state</tt>:
|
|
This structure forms the interconnection between the
|
|
<tt>rcu_node</tt> and <tt>rcu_data</tt> structures,
|
|
tracks grace periods, serves as short-term repository
|
|
for callbacks orphaned by CPU-hotplug events,
|
|
maintains <tt>rcu_barrier()</tt> state,
|
|
tracks expedited grace-period state,
|
|
and maintains state used to force quiescent states when
|
|
grace periods extend too long,
|
|
<li> <tt>rcu_node</tt>: This structure forms the combining
|
|
tree that propagates quiescent-state
|
|
information from the leaves to the root, and also propagates
|
|
grace-period information from the root to the leaves.
|
|
It provides local copies of the grace-period state in order
|
|
to allow this information to be accessed in a synchronized
|
|
manner without suffering the scalability limitations that
|
|
would otherwise be imposed by global locking.
|
|
In <tt>CONFIG_PREEMPT_RCU</tt> kernels, it manages the lists
|
|
of tasks that have blocked while in their current
|
|
RCU read-side critical section.
|
|
In <tt>CONFIG_PREEMPT_RCU</tt> with
|
|
<tt>CONFIG_RCU_BOOST</tt>, it manages the
|
|
per-<tt>rcu_node</tt> priority-boosting
|
|
kernel threads (kthreads) and state.
|
|
Finally, it records CPU-hotplug state in order to determine
|
|
which CPUs should be ignored during a given grace period.
|
|
<li> <tt>rcu_data</tt>: This per-CPU structure is the
|
|
focus of quiescent-state detection and RCU callback queuing.
|
|
It also tracks its relationship to the corresponding leaf
|
|
<tt>rcu_node</tt> structure to allow more-efficient
|
|
propagation of quiescent states up the <tt>rcu_node</tt>
|
|
combining tree.
|
|
Like the <tt>rcu_node</tt> structure, it provides a local
|
|
copy of the grace-period information to allow for-free
|
|
synchronized
|
|
access to this information from the corresponding CPU.
|
|
Finally, this structure records past dyntick-idle state
|
|
for the corresponding CPU and also tracks statistics.
|
|
<li> <tt>rcu_head</tt>:
|
|
This structure represents RCU callbacks, and is the
|
|
only structure allocated and managed by RCU users.
|
|
The <tt>rcu_head</tt> structure is normally embedded
|
|
within the RCU-protected data structure.
|
|
</ol>
|
|
|
|
<p>If all you wanted from this article was a general notion of how
|
|
RCU's data structures are related, you are done.
|
|
Otherwise, each of the following sections give more details on
|
|
the <tt>rcu_state</tt>, <tt>rcu_node</tt> and <tt>rcu_data</tt> data
|
|
structures.
|
|
|
|
<h3><a name="The rcu_state Structure">
|
|
The <tt>rcu_state</tt> Structure</a></h3>
|
|
|
|
<p>The <tt>rcu_state</tt> structure is the base structure that
|
|
represents the state of RCU in the system.
|
|
This structure forms the interconnection between the
|
|
<tt>rcu_node</tt> and <tt>rcu_data</tt> structures,
|
|
tracks grace periods, contains the lock used to
|
|
synchronize with CPU-hotplug events,
|
|
and maintains state used to force quiescent states when
|
|
grace periods extend too long,
|
|
|
|
</p><p>A few of the <tt>rcu_state</tt> structure's fields are discussed,
|
|
singly and in groups, in the following sections.
|
|
The more specialized fields are covered in the discussion of their
|
|
use.
|
|
|
|
<h5>Relationship to rcu_node and rcu_data Structures</h5>
|
|
|
|
This portion of the <tt>rcu_state</tt> structure is declared
|
|
as follows:
|
|
|
|
<pre>
|
|
1 struct rcu_node node[NUM_RCU_NODES];
|
|
2 struct rcu_node *level[NUM_RCU_LVLS + 1];
|
|
3 struct rcu_data __percpu *rda;
|
|
</pre>
|
|
|
|
<table>
|
|
<tr><th> </th></tr>
|
|
<tr><th align="left">Quick Quiz:</th></tr>
|
|
<tr><td>
|
|
Wait a minute!
|
|
You said that the <tt>rcu_node</tt> structures formed a tree,
|
|
but they are declared as a flat array!
|
|
What gives?
|
|
</td></tr>
|
|
<tr><th align="left">Answer:</th></tr>
|
|
<tr><td bgcolor="#ffffff"><font color="ffffff">
|
|
The tree is laid out in the array.
|
|
The first node In the array is the head, the next set of nodes in the
|
|
array are children of the head node, and so on until the last set of
|
|
nodes in the array are the leaves.
|
|
</font>
|
|
|
|
<p><font color="ffffff">See the following diagrams to see how
|
|
this works.
|
|
</font></td></tr>
|
|
<tr><td> </td></tr>
|
|
</table>
|
|
|
|
<p>The <tt>rcu_node</tt> tree is embedded into the
|
|
<tt>->node[]</tt> array as shown in the following figure:
|
|
|
|
</p><p><img src="TreeMapping.svg" alt="TreeMapping.svg" width="40%">
|
|
|
|
</p><p>One interesting consequence of this mapping is that a
|
|
breadth-first traversal of the tree is implemented as a simple
|
|
linear scan of the array, which is in fact what the
|
|
<tt>rcu_for_each_node_breadth_first()</tt> macro does.
|
|
This macro is used at the beginning and ends of grace periods.
|
|
|
|
</p><p>Each entry of the <tt>->level</tt> array references
|
|
the first <tt>rcu_node</tt> structure on the corresponding level
|
|
of the tree, for example, as shown below:
|
|
|
|
</p><p><img src="TreeMappingLevel.svg" alt="TreeMappingLevel.svg" width="40%">
|
|
|
|
</p><p>The zero<sup>th</sup> element of the array references the root
|
|
<tt>rcu_node</tt> structure, the first element references the
|
|
first child of the root <tt>rcu_node</tt>, and finally the second
|
|
element references the first leaf <tt>rcu_node</tt> structure.
|
|
|
|
</p><p>For whatever it is worth, if you draw the tree to be tree-shaped
|
|
rather than array-shaped, it is easy to draw a planar representation:
|
|
|
|
</p><p><img src="TreeLevel.svg" alt="TreeLevel.svg" width="60%">
|
|
|
|
</p><p>Finally, the <tt>->rda</tt> field references a per-CPU
|
|
pointer to the corresponding CPU's <tt>rcu_data</tt> structure.
|
|
|
|
</p><p>All of these fields are constant once initialization is complete,
|
|
and therefore need no protection.
|
|
|
|
<h5>Grace-Period Tracking</h5>
|
|
|
|
<p>This portion of the <tt>rcu_state</tt> structure is declared
|
|
as follows:
|
|
|
|
<pre>
|
|
1 unsigned long gp_seq;
|
|
</pre>
|
|
|
|
<p>RCU grace periods are numbered, and
|
|
the <tt>->gp_seq</tt> field contains the current grace-period
|
|
sequence number.
|
|
The bottom two bits are the state of the current grace period,
|
|
which can be zero for not yet started or one for in progress.
|
|
In other words, if the bottom two bits of <tt>->gp_seq</tt> are
|
|
zero, then RCU is idle.
|
|
Any other value in the bottom two bits indicates that something is broken.
|
|
This field is protected by the root <tt>rcu_node</tt> structure's
|
|
<tt>->lock</tt> field.
|
|
|
|
</p><p>There are <tt>->gp_seq</tt> fields
|
|
in the <tt>rcu_node</tt> and <tt>rcu_data</tt> structures
|
|
as well.
|
|
The fields in the <tt>rcu_state</tt> structure represent the
|
|
most current value, and those of the other structures are compared
|
|
in order to detect the beginnings and ends of grace periods in a distributed
|
|
fashion.
|
|
The values flow from <tt>rcu_state</tt> to <tt>rcu_node</tt>
|
|
(down the tree from the root to the leaves) to <tt>rcu_data</tt>.
|
|
|
|
<h5>Miscellaneous</h5>
|
|
|
|
<p>This portion of the <tt>rcu_state</tt> structure is declared
|
|
as follows:
|
|
|
|
<pre>
|
|
1 unsigned long gp_max;
|
|
2 char abbr;
|
|
3 char *name;
|
|
</pre>
|
|
|
|
<p>The <tt>->gp_max</tt> field tracks the duration of the longest
|
|
grace period in jiffies.
|
|
It is protected by the root <tt>rcu_node</tt>'s <tt>->lock</tt>.
|
|
|
|
<p>The <tt>->name</tt> and <tt>->abbr</tt> fields distinguish
|
|
between preemptible RCU (“rcu_preempt” and “p”)
|
|
and non-preemptible RCU (“rcu_sched” and “s”).
|
|
These fields are used for diagnostic and tracing purposes.
|
|
|
|
<h3><a name="The rcu_node Structure">
|
|
The <tt>rcu_node</tt> Structure</a></h3>
|
|
|
|
<p>The <tt>rcu_node</tt> structures form the combining
|
|
tree that propagates quiescent-state
|
|
information from the leaves to the root and also that propagates
|
|
grace-period information from the root down to the leaves.
|
|
They provides local copies of the grace-period state in order
|
|
to allow this information to be accessed in a synchronized
|
|
manner without suffering the scalability limitations that
|
|
would otherwise be imposed by global locking.
|
|
In <tt>CONFIG_PREEMPT_RCU</tt> kernels, they manage the lists
|
|
of tasks that have blocked while in their current
|
|
RCU read-side critical section.
|
|
In <tt>CONFIG_PREEMPT_RCU</tt> with
|
|
<tt>CONFIG_RCU_BOOST</tt>, they manage the
|
|
per-<tt>rcu_node</tt> priority-boosting
|
|
kernel threads (kthreads) and state.
|
|
Finally, they record CPU-hotplug state in order to determine
|
|
which CPUs should be ignored during a given grace period.
|
|
|
|
</p><p>The <tt>rcu_node</tt> structure's fields are discussed,
|
|
singly and in groups, in the following sections.
|
|
|
|
<h5>Connection to Combining Tree</h5>
|
|
|
|
<p>This portion of the <tt>rcu_node</tt> structure is declared
|
|
as follows:
|
|
|
|
<pre>
|
|
1 struct rcu_node *parent;
|
|
2 u8 level;
|
|
3 u8 grpnum;
|
|
4 unsigned long grpmask;
|
|
5 int grplo;
|
|
6 int grphi;
|
|
</pre>
|
|
|
|
<p>The <tt>->parent</tt> pointer references the <tt>rcu_node</tt>
|
|
one level up in the tree, and is <tt>NULL</tt> for the root
|
|
<tt>rcu_node</tt>.
|
|
The RCU implementation makes heavy use of this field to push quiescent
|
|
states up the tree.
|
|
The <tt>->level</tt> field gives the level in the tree, with
|
|
the root being at level zero, its children at level one, and so on.
|
|
The <tt>->grpnum</tt> field gives this node's position within
|
|
the children of its parent, so this number can range between 0 and 31
|
|
on 32-bit systems and between 0 and 63 on 64-bit systems.
|
|
The <tt>->level</tt> and <tt>->grpnum</tt> fields are
|
|
used only during initialization and for tracing.
|
|
The <tt>->grpmask</tt> field is the bitmask counterpart of
|
|
<tt>->grpnum</tt>, and therefore always has exactly one bit set.
|
|
This mask is used to clear the bit corresponding to this <tt>rcu_node</tt>
|
|
structure in its parent's bitmasks, which are described later.
|
|
Finally, the <tt>->grplo</tt> and <tt>->grphi</tt> fields
|
|
contain the lowest and highest numbered CPU served by this
|
|
<tt>rcu_node</tt> structure, respectively.
|
|
|
|
</p><p>All of these fields are constant, and thus do not require any
|
|
synchronization.
|
|
|
|
<h5>Synchronization</h5>
|
|
|
|
<p>This field of the <tt>rcu_node</tt> structure is declared
|
|
as follows:
|
|
|
|
<pre>
|
|
1 raw_spinlock_t lock;
|
|
</pre>
|
|
|
|
<p>This field is used to protect the remaining fields in this structure,
|
|
unless otherwise stated.
|
|
That said, all of the fields in this structure can be accessed without
|
|
locking for tracing purposes.
|
|
Yes, this can result in confusing traces, but better some tracing confusion
|
|
than to be heisenbugged out of existence.
|
|
|
|
<h5>Grace-Period Tracking</h5>
|
|
|
|
<p>This portion of the <tt>rcu_node</tt> structure is declared
|
|
as follows:
|
|
|
|
<pre>
|
|
1 unsigned long gp_seq;
|
|
2 unsigned long gp_seq_needed;
|
|
</pre>
|
|
|
|
<p>The <tt>rcu_node</tt> structures' <tt>->gp_seq</tt> fields are
|
|
the counterparts of the field of the same name in the <tt>rcu_state</tt>
|
|
structure.
|
|
They each may lag up to one step behind their <tt>rcu_state</tt>
|
|
counterpart.
|
|
If the bottom two bits of a given <tt>rcu_node</tt> structure's
|
|
<tt>->gp_seq</tt> field is zero, then this <tt>rcu_node</tt>
|
|
structure believes that RCU is idle.
|
|
</p><p>The <tt>>gp_seq</tt> field of each <tt>rcu_node</tt>
|
|
structure is updated at the beginning and the end
|
|
of each grace period.
|
|
|
|
<p>The <tt>->gp_seq_needed</tt> fields record the
|
|
furthest-in-the-future grace period request seen by the corresponding
|
|
<tt>rcu_node</tt> structure. The request is considered fulfilled when
|
|
the value of the <tt>->gp_seq</tt> field equals or exceeds that of
|
|
the <tt>->gp_seq_needed</tt> field.
|
|
|
|
<table>
|
|
<tr><th> </th></tr>
|
|
<tr><th align="left">Quick Quiz:</th></tr>
|
|
<tr><td>
|
|
Suppose that this <tt>rcu_node</tt> structure doesn't see
|
|
a request for a very long time.
|
|
Won't wrapping of the <tt>->gp_seq</tt> field cause
|
|
problems?
|
|
</td></tr>
|
|
<tr><th align="left">Answer:</th></tr>
|
|
<tr><td bgcolor="#ffffff"><font color="ffffff">
|
|
No, because if the <tt>->gp_seq_needed</tt> field lags behind the
|
|
<tt>->gp_seq</tt> field, the <tt>->gp_seq_needed</tt> field
|
|
will be updated at the end of the grace period.
|
|
Modulo-arithmetic comparisons therefore will always get the
|
|
correct answer, even with wrapping.
|
|
</font></td></tr>
|
|
<tr><td> </td></tr>
|
|
</table>
|
|
|
|
<h5>Quiescent-State Tracking</h5>
|
|
|
|
<p>These fields manage the propagation of quiescent states up the
|
|
combining tree.
|
|
|
|
</p><p>This portion of the <tt>rcu_node</tt> structure has fields
|
|
as follows:
|
|
|
|
<pre>
|
|
1 unsigned long qsmask;
|
|
2 unsigned long expmask;
|
|
3 unsigned long qsmaskinit;
|
|
4 unsigned long expmaskinit;
|
|
</pre>
|
|
|
|
<p>The <tt>->qsmask</tt> field tracks which of this
|
|
<tt>rcu_node</tt> structure's children still need to report
|
|
quiescent states for the current normal grace period.
|
|
Such children will have a value of 1 in their corresponding bit.
|
|
Note that the leaf <tt>rcu_node</tt> structures should be
|
|
thought of as having <tt>rcu_data</tt> structures as their
|
|
children.
|
|
Similarly, the <tt>->expmask</tt> field tracks which
|
|
of this <tt>rcu_node</tt> structure's children still need to report
|
|
quiescent states for the current expedited grace period.
|
|
An expedited grace period has
|
|
the same conceptual properties as a normal grace period, but the
|
|
expedited implementation accepts extreme CPU overhead to obtain
|
|
much lower grace-period latency, for example, consuming a few
|
|
tens of microseconds worth of CPU time to reduce grace-period
|
|
duration from milliseconds to tens of microseconds.
|
|
The <tt>->qsmaskinit</tt> field tracks which of this
|
|
<tt>rcu_node</tt> structure's children cover for at least
|
|
one online CPU.
|
|
This mask is used to initialize <tt>->qsmask</tt>,
|
|
and <tt>->expmaskinit</tt> is used to initialize
|
|
<tt>->expmask</tt> and the beginning of the
|
|
normal and expedited grace periods, respectively.
|
|
|
|
<table>
|
|
<tr><th> </th></tr>
|
|
<tr><th align="left">Quick Quiz:</th></tr>
|
|
<tr><td>
|
|
Why are these bitmasks protected by locking?
|
|
Come on, haven't you heard of atomic instructions???
|
|
</td></tr>
|
|
<tr><th align="left">Answer:</th></tr>
|
|
<tr><td bgcolor="#ffffff"><font color="ffffff">
|
|
Lockless grace-period computation! Such a tantalizing possibility!
|
|
</font>
|
|
|
|
<p><font color="ffffff">But consider the following sequence of events:
|
|
</font>
|
|
|
|
<ol>
|
|
<li> <font color="ffffff">CPU 0 has been in dyntick-idle
|
|
mode for quite some time.
|
|
When it wakes up, it notices that the current RCU
|
|
grace period needs it to report in, so it sets a
|
|
flag where the scheduling clock interrupt will find it.
|
|
</font><p>
|
|
<li> <font color="ffffff">Meanwhile, CPU 1 is running
|
|
<tt>force_quiescent_state()</tt>,
|
|
and notices that CPU 0 has been in dyntick idle mode,
|
|
which qualifies as an extended quiescent state.
|
|
</font><p>
|
|
<li> <font color="ffffff">CPU 0's scheduling clock
|
|
interrupt fires in the
|
|
middle of an RCU read-side critical section, and notices
|
|
that the RCU core needs something, so commences RCU softirq
|
|
processing.
|
|
</font>
|
|
<p>
|
|
<li> <font color="ffffff">CPU 0's softirq handler
|
|
executes and is just about ready
|
|
to report its quiescent state up the <tt>rcu_node</tt>
|
|
tree.
|
|
</font><p>
|
|
<li> <font color="ffffff">But CPU 1 beats it to the punch,
|
|
completing the current
|
|
grace period and starting a new one.
|
|
</font><p>
|
|
<li> <font color="ffffff">CPU 0 now reports its quiescent
|
|
state for the wrong
|
|
grace period.
|
|
That grace period might now end before the RCU read-side
|
|
critical section.
|
|
If that happens, disaster will ensue.
|
|
</font>
|
|
</ol>
|
|
|
|
<p><font color="ffffff">So the locking is absolutely required in
|
|
order to coordinate clearing of the bits with updating of the
|
|
grace-period sequence number in <tt>->gp_seq</tt>.
|
|
</font></td></tr>
|
|
<tr><td> </td></tr>
|
|
</table>
|
|
|
|
<h5>Blocked-Task Management</h5>
|
|
|
|
<p><tt>PREEMPT_RCU</tt> allows tasks to be preempted in the
|
|
midst of their RCU read-side critical sections, and these tasks
|
|
must be tracked explicitly.
|
|
The details of exactly why and how they are tracked will be covered
|
|
in a separate article on RCU read-side processing.
|
|
For now, it is enough to know that the <tt>rcu_node</tt>
|
|
structure tracks them.
|
|
|
|
<pre>
|
|
1 struct list_head blkd_tasks;
|
|
2 struct list_head *gp_tasks;
|
|
3 struct list_head *exp_tasks;
|
|
4 bool wait_blkd_tasks;
|
|
</pre>
|
|
|
|
<p>The <tt>->blkd_tasks</tt> field is a list header for
|
|
the list of blocked and preempted tasks.
|
|
As tasks undergo context switches within RCU read-side critical
|
|
sections, their <tt>task_struct</tt> structures are enqueued
|
|
(via the <tt>task_struct</tt>'s <tt>->rcu_node_entry</tt>
|
|
field) onto the head of the <tt>->blkd_tasks</tt> list for the
|
|
leaf <tt>rcu_node</tt> structure corresponding to the CPU
|
|
on which the outgoing context switch executed.
|
|
As these tasks later exit their RCU read-side critical sections,
|
|
they remove themselves from the list.
|
|
This list is therefore in reverse time order, so that if one of the tasks
|
|
is blocking the current grace period, all subsequent tasks must
|
|
also be blocking that same grace period.
|
|
Therefore, a single pointer into this list suffices to track
|
|
all tasks blocking a given grace period.
|
|
That pointer is stored in <tt>->gp_tasks</tt> for normal
|
|
grace periods and in <tt>->exp_tasks</tt> for expedited
|
|
grace periods.
|
|
These last two fields are <tt>NULL</tt> if either there is
|
|
no grace period in flight or if there are no blocked tasks
|
|
preventing that grace period from completing.
|
|
If either of these two pointers is referencing a task that
|
|
removes itself from the <tt>->blkd_tasks</tt> list,
|
|
then that task must advance the pointer to the next task on
|
|
the list, or set the pointer to <tt>NULL</tt> if there
|
|
are no subsequent tasks on the list.
|
|
|
|
</p><p>For example, suppose that tasks T1, T2, and T3 are
|
|
all hard-affinitied to the largest-numbered CPU in the system.
|
|
Then if task T1 blocked in an RCU read-side
|
|
critical section, then an expedited grace period started,
|
|
then task T2 blocked in an RCU read-side critical section,
|
|
then a normal grace period started, and finally task 3 blocked
|
|
in an RCU read-side critical section, then the state of the
|
|
last leaf <tt>rcu_node</tt> structure's blocked-task list
|
|
would be as shown below:
|
|
|
|
</p><p><img src="blkd_task.svg" alt="blkd_task.svg" width="60%">
|
|
|
|
</p><p>Task T1 is blocking both grace periods, task T2 is
|
|
blocking only the normal grace period, and task T3 is blocking
|
|
neither grace period.
|
|
Note that these tasks will not remove themselves from this list
|
|
immediately upon resuming execution.
|
|
They will instead remain on the list until they execute the outermost
|
|
<tt>rcu_read_unlock()</tt> that ends their RCU read-side critical
|
|
section.
|
|
|
|
<p>
|
|
The <tt>->wait_blkd_tasks</tt> field indicates whether or not
|
|
the current grace period is waiting on a blocked task.
|
|
|
|
<h5>Sizing the <tt>rcu_node</tt> Array</h5>
|
|
|
|
<p>The <tt>rcu_node</tt> array is sized via a series of
|
|
C-preprocessor expressions as follows:
|
|
|
|
<pre>
|
|
1 #ifdef CONFIG_RCU_FANOUT
|
|
2 #define RCU_FANOUT CONFIG_RCU_FANOUT
|
|
3 #else
|
|
4 # ifdef CONFIG_64BIT
|
|
5 # define RCU_FANOUT 64
|
|
6 # else
|
|
7 # define RCU_FANOUT 32
|
|
8 # endif
|
|
9 #endif
|
|
10
|
|
11 #ifdef CONFIG_RCU_FANOUT_LEAF
|
|
12 #define RCU_FANOUT_LEAF CONFIG_RCU_FANOUT_LEAF
|
|
13 #else
|
|
14 # ifdef CONFIG_64BIT
|
|
15 # define RCU_FANOUT_LEAF 64
|
|
16 # else
|
|
17 # define RCU_FANOUT_LEAF 32
|
|
18 # endif
|
|
19 #endif
|
|
20
|
|
21 #define RCU_FANOUT_1 (RCU_FANOUT_LEAF)
|
|
22 #define RCU_FANOUT_2 (RCU_FANOUT_1 * RCU_FANOUT)
|
|
23 #define RCU_FANOUT_3 (RCU_FANOUT_2 * RCU_FANOUT)
|
|
24 #define RCU_FANOUT_4 (RCU_FANOUT_3 * RCU_FANOUT)
|
|
25
|
|
26 #if NR_CPUS <= RCU_FANOUT_1
|
|
27 # define RCU_NUM_LVLS 1
|
|
28 # define NUM_RCU_LVL_0 1
|
|
29 # define NUM_RCU_NODES NUM_RCU_LVL_0
|
|
30 # define NUM_RCU_LVL_INIT { NUM_RCU_LVL_0 }
|
|
31 # define RCU_NODE_NAME_INIT { "rcu_node_0" }
|
|
32 # define RCU_FQS_NAME_INIT { "rcu_node_fqs_0" }
|
|
33 # define RCU_EXP_NAME_INIT { "rcu_node_exp_0" }
|
|
34 #elif NR_CPUS <= RCU_FANOUT_2
|
|
35 # define RCU_NUM_LVLS 2
|
|
36 # define NUM_RCU_LVL_0 1
|
|
37 # define NUM_RCU_LVL_1 DIV_ROUND_UP(NR_CPUS, RCU_FANOUT_1)
|
|
38 # define NUM_RCU_NODES (NUM_RCU_LVL_0 + NUM_RCU_LVL_1)
|
|
39 # define NUM_RCU_LVL_INIT { NUM_RCU_LVL_0, NUM_RCU_LVL_1 }
|
|
40 # define RCU_NODE_NAME_INIT { "rcu_node_0", "rcu_node_1" }
|
|
41 # define RCU_FQS_NAME_INIT { "rcu_node_fqs_0", "rcu_node_fqs_1" }
|
|
42 # define RCU_EXP_NAME_INIT { "rcu_node_exp_0", "rcu_node_exp_1" }
|
|
43 #elif NR_CPUS <= RCU_FANOUT_3
|
|
44 # define RCU_NUM_LVLS 3
|
|
45 # define NUM_RCU_LVL_0 1
|
|
46 # define NUM_RCU_LVL_1 DIV_ROUND_UP(NR_CPUS, RCU_FANOUT_2)
|
|
47 # define NUM_RCU_LVL_2 DIV_ROUND_UP(NR_CPUS, RCU_FANOUT_1)
|
|
48 # define NUM_RCU_NODES (NUM_RCU_LVL_0 + NUM_RCU_LVL_1 + NUM_RCU_LVL_2)
|
|
49 # define NUM_RCU_LVL_INIT { NUM_RCU_LVL_0, NUM_RCU_LVL_1, NUM_RCU_LVL_2 }
|
|
50 # define RCU_NODE_NAME_INIT { "rcu_node_0", "rcu_node_1", "rcu_node_2" }
|
|
51 # define RCU_FQS_NAME_INIT { "rcu_node_fqs_0", "rcu_node_fqs_1", "rcu_node_fqs_2" }
|
|
52 # define RCU_EXP_NAME_INIT { "rcu_node_exp_0", "rcu_node_exp_1", "rcu_node_exp_2" }
|
|
53 #elif NR_CPUS <= RCU_FANOUT_4
|
|
54 # define RCU_NUM_LVLS 4
|
|
55 # define NUM_RCU_LVL_0 1
|
|
56 # define NUM_RCU_LVL_1 DIV_ROUND_UP(NR_CPUS, RCU_FANOUT_3)
|
|
57 # define NUM_RCU_LVL_2 DIV_ROUND_UP(NR_CPUS, RCU_FANOUT_2)
|
|
58 # define NUM_RCU_LVL_3 DIV_ROUND_UP(NR_CPUS, RCU_FANOUT_1)
|
|
59 # define NUM_RCU_NODES (NUM_RCU_LVL_0 + NUM_RCU_LVL_1 + NUM_RCU_LVL_2 + NUM_RCU_LVL_3)
|
|
60 # define NUM_RCU_LVL_INIT { NUM_RCU_LVL_0, NUM_RCU_LVL_1, NUM_RCU_LVL_2, NUM_RCU_LVL_3 }
|
|
61 # define RCU_NODE_NAME_INIT { "rcu_node_0", "rcu_node_1", "rcu_node_2", "rcu_node_3" }
|
|
62 # define RCU_FQS_NAME_INIT { "rcu_node_fqs_0", "rcu_node_fqs_1", "rcu_node_fqs_2", "rcu_node_fqs_3" }
|
|
63 # define RCU_EXP_NAME_INIT { "rcu_node_exp_0", "rcu_node_exp_1", "rcu_node_exp_2", "rcu_node_exp_3" }
|
|
64 #else
|
|
65 # error "CONFIG_RCU_FANOUT insufficient for NR_CPUS"
|
|
66 #endif
|
|
</pre>
|
|
|
|
<p>The maximum number of levels in the <tt>rcu_node</tt> structure
|
|
is currently limited to four, as specified by lines 21-24
|
|
and the structure of the subsequent “if” statement.
|
|
For 32-bit systems, this allows 16*32*32*32=524,288 CPUs, which
|
|
should be sufficient for the next few years at least.
|
|
For 64-bit systems, 16*64*64*64=4,194,304 CPUs is allowed, which
|
|
should see us through the next decade or so.
|
|
This four-level tree also allows kernels built with
|
|
<tt>CONFIG_RCU_FANOUT=8</tt> to support up to 4096 CPUs,
|
|
which might be useful in very large systems having eight CPUs per
|
|
socket (but please note that no one has yet shown any measurable
|
|
performance degradation due to misaligned socket and <tt>rcu_node</tt>
|
|
boundaries).
|
|
In addition, building kernels with a full four levels of <tt>rcu_node</tt>
|
|
tree permits better testing of RCU's combining-tree code.
|
|
|
|
</p><p>The <tt>RCU_FANOUT</tt> symbol controls how many children
|
|
are permitted at each non-leaf level of the <tt>rcu_node</tt> tree.
|
|
If the <tt>CONFIG_RCU_FANOUT</tt> Kconfig option is not specified,
|
|
it is set based on the word size of the system, which is also
|
|
the Kconfig default.
|
|
|
|
</p><p>The <tt>RCU_FANOUT_LEAF</tt> symbol controls how many CPUs are
|
|
handled by each leaf <tt>rcu_node</tt> structure.
|
|
Experience has shown that allowing a given leaf <tt>rcu_node</tt>
|
|
structure to handle 64 CPUs, as permitted by the number of bits in
|
|
the <tt>->qsmask</tt> field on a 64-bit system, results in
|
|
excessive contention for the leaf <tt>rcu_node</tt> structures'
|
|
<tt>->lock</tt> fields.
|
|
The number of CPUs per leaf <tt>rcu_node</tt> structure is therefore
|
|
limited to 16 given the default value of <tt>CONFIG_RCU_FANOUT_LEAF</tt>.
|
|
If <tt>CONFIG_RCU_FANOUT_LEAF</tt> is unspecified, the value
|
|
selected is based on the word size of the system, just as for
|
|
<tt>CONFIG_RCU_FANOUT</tt>.
|
|
Lines 11-19 perform this computation.
|
|
|
|
</p><p>Lines 21-24 compute the maximum number of CPUs supported by
|
|
a single-level (which contains a single <tt>rcu_node</tt> structure),
|
|
two-level, three-level, and four-level <tt>rcu_node</tt> tree,
|
|
respectively, given the fanout specified by <tt>RCU_FANOUT</tt>
|
|
and <tt>RCU_FANOUT_LEAF</tt>.
|
|
These numbers of CPUs are retained in the
|
|
<tt>RCU_FANOUT_1</tt>,
|
|
<tt>RCU_FANOUT_2</tt>,
|
|
<tt>RCU_FANOUT_3</tt>, and
|
|
<tt>RCU_FANOUT_4</tt>
|
|
C-preprocessor variables, respectively.
|
|
|
|
</p><p>These variables are used to control the C-preprocessor <tt>#if</tt>
|
|
statement spanning lines 26-66 that computes the number of
|
|
<tt>rcu_node</tt> structures required for each level of the tree,
|
|
as well as the number of levels required.
|
|
The number of levels is placed in the <tt>NUM_RCU_LVLS</tt>
|
|
C-preprocessor variable by lines 27, 35, 44, and 54.
|
|
The number of <tt>rcu_node</tt> structures for the topmost level
|
|
of the tree is always exactly one, and this value is unconditionally
|
|
placed into <tt>NUM_RCU_LVL_0</tt> by lines 28, 36, 45, and 55.
|
|
The rest of the levels (if any) of the <tt>rcu_node</tt> tree
|
|
are computed by dividing the maximum number of CPUs by the
|
|
fanout supported by the number of levels from the current level down,
|
|
rounding up. This computation is performed by lines 37,
|
|
46-47, and 56-58.
|
|
Lines 31-33, 40-42, 50-52, and 62-63 create initializers
|
|
for lockdep lock-class names.
|
|
Finally, lines 64-66 produce an error if the maximum number of
|
|
CPUs is too large for the specified fanout.
|
|
|
|
<h3><a name="The rcu_segcblist Structure">
|
|
The <tt>rcu_segcblist</tt> Structure</a></h3>
|
|
|
|
The <tt>rcu_segcblist</tt> structure maintains a segmented list of
|
|
callbacks as follows:
|
|
|
|
<pre>
|
|
1 #define RCU_DONE_TAIL 0
|
|
2 #define RCU_WAIT_TAIL 1
|
|
3 #define RCU_NEXT_READY_TAIL 2
|
|
4 #define RCU_NEXT_TAIL 3
|
|
5 #define RCU_CBLIST_NSEGS 4
|
|
6
|
|
7 struct rcu_segcblist {
|
|
8 struct rcu_head *head;
|
|
9 struct rcu_head **tails[RCU_CBLIST_NSEGS];
|
|
10 unsigned long gp_seq[RCU_CBLIST_NSEGS];
|
|
11 long len;
|
|
12 long len_lazy;
|
|
13 };
|
|
</pre>
|
|
|
|
<p>
|
|
The segments are as follows:
|
|
|
|
<ol>
|
|
<li> <tt>RCU_DONE_TAIL</tt>: Callbacks whose grace periods have elapsed.
|
|
These callbacks are ready to be invoked.
|
|
<li> <tt>RCU_WAIT_TAIL</tt>: Callbacks that are waiting for the
|
|
current grace period.
|
|
Note that different CPUs can have different ideas about which
|
|
grace period is current, hence the <tt>->gp_seq</tt> field.
|
|
<li> <tt>RCU_NEXT_READY_TAIL</tt>: Callbacks waiting for the next
|
|
grace period to start.
|
|
<li> <tt>RCU_NEXT_TAIL</tt>: Callbacks that have not yet been
|
|
associated with a grace period.
|
|
</ol>
|
|
|
|
<p>
|
|
The <tt>->head</tt> pointer references the first callback or
|
|
is <tt>NULL</tt> if the list contains no callbacks (which is
|
|
<i>not</i> the same as being empty).
|
|
Each element of the <tt>->tails[]</tt> array references the
|
|
<tt>->next</tt> pointer of the last callback in the corresponding
|
|
segment of the list, or the list's <tt>->head</tt> pointer if
|
|
that segment and all previous segments are empty.
|
|
If the corresponding segment is empty but some previous segment is
|
|
not empty, then the array element is identical to its predecessor.
|
|
Older callbacks are closer to the head of the list, and new callbacks
|
|
are added at the tail.
|
|
This relationship between the <tt>->head</tt> pointer, the
|
|
<tt>->tails[]</tt> array, and the callbacks is shown in this
|
|
diagram:
|
|
|
|
</p><p><img src="nxtlist.svg" alt="nxtlist.svg" width="40%">
|
|
|
|
</p><p>In this figure, the <tt>->head</tt> pointer references the
|
|
first
|
|
RCU callback in the list.
|
|
The <tt>->tails[RCU_DONE_TAIL]</tt> array element references
|
|
the <tt>->head</tt> pointer itself, indicating that none
|
|
of the callbacks is ready to invoke.
|
|
The <tt>->tails[RCU_WAIT_TAIL]</tt> array element references callback
|
|
CB 2's <tt>->next</tt> pointer, which indicates that
|
|
CB 1 and CB 2 are both waiting on the current grace period,
|
|
give or take possible disagreements about exactly which grace period
|
|
is the current one.
|
|
The <tt>->tails[RCU_NEXT_READY_TAIL]</tt> array element
|
|
references the same RCU callback that <tt>->tails[RCU_WAIT_TAIL]</tt>
|
|
does, which indicates that there are no callbacks waiting on the next
|
|
RCU grace period.
|
|
The <tt>->tails[RCU_NEXT_TAIL]</tt> array element references
|
|
CB 4's <tt>->next</tt> pointer, indicating that all the
|
|
remaining RCU callbacks have not yet been assigned to an RCU grace
|
|
period.
|
|
Note that the <tt>->tails[RCU_NEXT_TAIL]</tt> array element
|
|
always references the last RCU callback's <tt>->next</tt> pointer
|
|
unless the callback list is empty, in which case it references
|
|
the <tt>->head</tt> pointer.
|
|
|
|
<p>
|
|
There is one additional important special case for the
|
|
<tt>->tails[RCU_NEXT_TAIL]</tt> array element: It can be <tt>NULL</tt>
|
|
when this list is <i>disabled</i>.
|
|
Lists are disabled when the corresponding CPU is offline or when
|
|
the corresponding CPU's callbacks are offloaded to a kthread,
|
|
both of which are described elsewhere.
|
|
|
|
</p><p>CPUs advance their callbacks from the
|
|
<tt>RCU_NEXT_TAIL</tt> to the <tt>RCU_NEXT_READY_TAIL</tt> to the
|
|
<tt>RCU_WAIT_TAIL</tt> to the <tt>RCU_DONE_TAIL</tt> list segments
|
|
as grace periods advance.
|
|
|
|
</p><p>The <tt>->gp_seq[]</tt> array records grace-period
|
|
numbers corresponding to the list segments.
|
|
This is what allows different CPUs to have different ideas as to
|
|
which is the current grace period while still avoiding premature
|
|
invocation of their callbacks.
|
|
In particular, this allows CPUs that go idle for extended periods
|
|
to determine which of their callbacks are ready to be invoked after
|
|
reawakening.
|
|
|
|
</p><p>The <tt>->len</tt> counter contains the number of
|
|
callbacks in <tt>->head</tt>, and the
|
|
<tt>->len_lazy</tt> contains the number of those callbacks that
|
|
are known to only free memory, and whose invocation can therefore
|
|
be safely deferred.
|
|
|
|
<p><b>Important note</b>: It is the <tt>->len</tt> field that
|
|
determines whether or not there are callbacks associated with
|
|
this <tt>rcu_segcblist</tt> structure, <i>not</i> the <tt>->head</tt>
|
|
pointer.
|
|
The reason for this is that all the ready-to-invoke callbacks
|
|
(that is, those in the <tt>RCU_DONE_TAIL</tt> segment) are extracted
|
|
all at once at callback-invocation time (<tt>rcu_do_batch</tt>), due
|
|
to which <tt>->head</tt> may be set to NULL if there are no not-done
|
|
callbacks remaining in the <tt>rcu_segcblist</tt>.
|
|
If callback invocation must be postponed, for example, because a
|
|
high-priority process just woke up on this CPU, then the remaining
|
|
callbacks are placed back on the <tt>RCU_DONE_TAIL</tt> segment and
|
|
<tt>->head</tt> once again points to the start of the segment.
|
|
In short, the head field can briefly be <tt>NULL</tt> even though the
|
|
CPU has callbacks present the entire time.
|
|
Therefore, it is not appropriate to test the <tt>->head</tt> pointer
|
|
for <tt>NULL</tt>.
|
|
|
|
<p>In contrast, the <tt>->len</tt> and <tt>->len_lazy</tt> counts
|
|
are adjusted only after the corresponding callbacks have been invoked.
|
|
This means that the <tt>->len</tt> count is zero only if
|
|
the <tt>rcu_segcblist</tt> structure really is devoid of callbacks.
|
|
Of course, off-CPU sampling of the <tt>->len</tt> count requires
|
|
careful use of appropriate synchronization, for example, memory barriers.
|
|
This synchronization can be a bit subtle, particularly in the case
|
|
of <tt>rcu_barrier()</tt>.
|
|
|
|
<h3><a name="The rcu_data Structure">
|
|
The <tt>rcu_data</tt> Structure</a></h3>
|
|
|
|
<p>The <tt>rcu_data</tt> maintains the per-CPU state for the RCU subsystem.
|
|
The fields in this structure may be accessed only from the corresponding
|
|
CPU (and from tracing) unless otherwise stated.
|
|
This structure is the
|
|
focus of quiescent-state detection and RCU callback queuing.
|
|
It also tracks its relationship to the corresponding leaf
|
|
<tt>rcu_node</tt> structure to allow more-efficient
|
|
propagation of quiescent states up the <tt>rcu_node</tt>
|
|
combining tree.
|
|
Like the <tt>rcu_node</tt> structure, it provides a local
|
|
copy of the grace-period information to allow for-free
|
|
synchronized
|
|
access to this information from the corresponding CPU.
|
|
Finally, this structure records past dyntick-idle state
|
|
for the corresponding CPU and also tracks statistics.
|
|
|
|
</p><p>The <tt>rcu_data</tt> structure's fields are discussed,
|
|
singly and in groups, in the following sections.
|
|
|
|
<h5>Connection to Other Data Structures</h5>
|
|
|
|
<p>This portion of the <tt>rcu_data</tt> structure is declared
|
|
as follows:
|
|
|
|
<pre>
|
|
1 int cpu;
|
|
2 struct rcu_node *mynode;
|
|
3 unsigned long grpmask;
|
|
4 bool beenonline;
|
|
</pre>
|
|
|
|
<p>The <tt>->cpu</tt> field contains the number of the
|
|
corresponding CPU and the <tt>->mynode</tt> field references the
|
|
corresponding <tt>rcu_node</tt> structure.
|
|
The <tt>->mynode</tt> is used to propagate quiescent states
|
|
up the combining tree.
|
|
These two fields are constant and therefore do not require synchronization.
|
|
|
|
<p>The <tt>->grpmask</tt> field indicates the bit in
|
|
the <tt>->mynode->qsmask</tt> corresponding to this
|
|
<tt>rcu_data</tt> structure, and is also used when propagating
|
|
quiescent states.
|
|
The <tt>->beenonline</tt> flag is set whenever the corresponding
|
|
CPU comes online, which means that the debugfs tracing need not dump
|
|
out any <tt>rcu_data</tt> structure for which this flag is not set.
|
|
|
|
<h5>Quiescent-State and Grace-Period Tracking</h5>
|
|
|
|
<p>This portion of the <tt>rcu_data</tt> structure is declared
|
|
as follows:
|
|
|
|
<pre>
|
|
1 unsigned long gp_seq;
|
|
2 unsigned long gp_seq_needed;
|
|
3 bool cpu_no_qs;
|
|
4 bool core_needs_qs;
|
|
5 bool gpwrap;
|
|
</pre>
|
|
|
|
<p>The <tt>->gp_seq</tt> field is the counterpart of the field of the same
|
|
name in the <tt>rcu_state</tt> and <tt>rcu_node</tt> structures. The
|
|
<tt>->gp_seq_needed</tt> field is the counterpart of the field of the same
|
|
name in the rcu_node</tt> structure.
|
|
They may each lag up to one behind their <tt>rcu_node</tt>
|
|
counterparts, but in <tt>CONFIG_NO_HZ_IDLE</tt> and
|
|
<tt>CONFIG_NO_HZ_FULL</tt> kernels can lag
|
|
arbitrarily far behind for CPUs in dyntick-idle mode (but these counters
|
|
will catch up upon exit from dyntick-idle mode).
|
|
If the lower two bits of a given <tt>rcu_data</tt> structure's
|
|
<tt>->gp_seq</tt> are zero, then this <tt>rcu_data</tt>
|
|
structure believes that RCU is idle.
|
|
|
|
<table>
|
|
<tr><th> </th></tr>
|
|
<tr><th align="left">Quick Quiz:</th></tr>
|
|
<tr><td>
|
|
All this replication of the grace period numbers can only cause
|
|
massive confusion.
|
|
Why not just keep a global sequence number and be done with it???
|
|
</td></tr>
|
|
<tr><th align="left">Answer:</th></tr>
|
|
<tr><td bgcolor="#ffffff"><font color="ffffff">
|
|
Because if there was only a single global sequence
|
|
numbers, there would need to be a single global lock to allow
|
|
safely accessing and updating it.
|
|
And if we are not going to have a single global lock, we need
|
|
to carefully manage the numbers on a per-node basis.
|
|
Recall from the answer to a previous Quick Quiz that the consequences
|
|
of applying a previously sampled quiescent state to the wrong
|
|
grace period are quite severe.
|
|
</font></td></tr>
|
|
<tr><td> </td></tr>
|
|
</table>
|
|
|
|
<p>The <tt>->cpu_no_qs</tt> flag indicates that the
|
|
CPU has not yet passed through a quiescent state,
|
|
while the <tt>->core_needs_qs</tt> flag indicates that the
|
|
RCU core needs a quiescent state from the corresponding CPU.
|
|
The <tt>->gpwrap</tt> field indicates that the corresponding
|
|
CPU has remained idle for so long that the
|
|
<tt>gp_seq</tt> counter is in danger of overflow, which
|
|
will cause the CPU to disregard the values of its counters on
|
|
its next exit from idle.
|
|
|
|
<h5>RCU Callback Handling</h5>
|
|
|
|
<p>In the absence of CPU-hotplug events, RCU callbacks are invoked by
|
|
the same CPU that registered them.
|
|
This is strictly a cache-locality optimization: callbacks can and
|
|
do get invoked on CPUs other than the one that registered them.
|
|
After all, if the CPU that registered a given callback has gone
|
|
offline before the callback can be invoked, there really is no other
|
|
choice.
|
|
|
|
</p><p>This portion of the <tt>rcu_data</tt> structure is declared
|
|
as follows:
|
|
|
|
<pre>
|
|
1 struct rcu_segcblist cblist;
|
|
2 long qlen_last_fqs_check;
|
|
3 unsigned long n_cbs_invoked;
|
|
4 unsigned long n_nocbs_invoked;
|
|
5 unsigned long n_cbs_orphaned;
|
|
6 unsigned long n_cbs_adopted;
|
|
7 unsigned long n_force_qs_snap;
|
|
8 long blimit;
|
|
</pre>
|
|
|
|
<p>The <tt>->cblist</tt> structure is the segmented callback list
|
|
described earlier.
|
|
The CPU advances the callbacks in its <tt>rcu_data</tt> structure
|
|
whenever it notices that another RCU grace period has completed.
|
|
The CPU detects the completion of an RCU grace period by noticing
|
|
that the value of its <tt>rcu_data</tt> structure's
|
|
<tt>->gp_seq</tt> field differs from that of its leaf
|
|
<tt>rcu_node</tt> structure.
|
|
Recall that each <tt>rcu_node</tt> structure's
|
|
<tt>->gp_seq</tt> field is updated at the beginnings and ends of each
|
|
grace period.
|
|
|
|
<p>
|
|
The <tt>->qlen_last_fqs_check</tt> and
|
|
<tt>->n_force_qs_snap</tt> coordinate the forcing of quiescent
|
|
states from <tt>call_rcu()</tt> and friends when callback
|
|
lists grow excessively long.
|
|
|
|
</p><p>The <tt>->n_cbs_invoked</tt>,
|
|
<tt>->n_cbs_orphaned</tt>, and <tt>->n_cbs_adopted</tt>
|
|
fields count the number of callbacks invoked,
|
|
sent to other CPUs when this CPU goes offline,
|
|
and received from other CPUs when those other CPUs go offline.
|
|
The <tt>->n_nocbs_invoked</tt> is used when the CPU's callbacks
|
|
are offloaded to a kthread.
|
|
|
|
<p>
|
|
Finally, the <tt>->blimit</tt> counter is the maximum number of
|
|
RCU callbacks that may be invoked at a given time.
|
|
|
|
<h5>Dyntick-Idle Handling</h5>
|
|
|
|
<p>This portion of the <tt>rcu_data</tt> structure is declared
|
|
as follows:
|
|
|
|
<pre>
|
|
1 int dynticks_snap;
|
|
2 unsigned long dynticks_fqs;
|
|
</pre>
|
|
|
|
The <tt>->dynticks_snap</tt> field is used to take a snapshot
|
|
of the corresponding CPU's dyntick-idle state when forcing
|
|
quiescent states, and is therefore accessed from other CPUs.
|
|
Finally, the <tt>->dynticks_fqs</tt> field is used to
|
|
count the number of times this CPU is determined to be in
|
|
dyntick-idle state, and is used for tracing and debugging purposes.
|
|
|
|
<p>
|
|
This portion of the rcu_data structure is declared as follows:
|
|
|
|
<pre>
|
|
1 long dynticks_nesting;
|
|
2 long dynticks_nmi_nesting;
|
|
3 atomic_t dynticks;
|
|
4 bool rcu_need_heavy_qs;
|
|
5 bool rcu_urgent_qs;
|
|
</pre>
|
|
|
|
<p>These fields in the rcu_data structure maintain the per-CPU dyntick-idle
|
|
state for the corresponding CPU.
|
|
The fields may be accessed only from the corresponding CPU (and from tracing)
|
|
unless otherwise stated.
|
|
|
|
<p>The <tt>->dynticks_nesting</tt> field counts the
|
|
nesting depth of process execution, so that in normal circumstances
|
|
this counter has value zero or one.
|
|
NMIs, irqs, and tracers are counted by the <tt>->dynticks_nmi_nesting</tt>
|
|
field.
|
|
Because NMIs cannot be masked, changes to this variable have to be
|
|
undertaken carefully using an algorithm provided by Andy Lutomirski.
|
|
The initial transition from idle adds one, and nested transitions
|
|
add two, so that a nesting level of five is represented by a
|
|
<tt>->dynticks_nmi_nesting</tt> value of nine.
|
|
This counter can therefore be thought of as counting the number
|
|
of reasons why this CPU cannot be permitted to enter dyntick-idle
|
|
mode, aside from process-level transitions.
|
|
|
|
<p>However, it turns out that when running in non-idle kernel context,
|
|
the Linux kernel is fully capable of entering interrupt handlers that
|
|
never exit and perhaps also vice versa.
|
|
Therefore, whenever the <tt>->dynticks_nesting</tt> field is
|
|
incremented up from zero, the <tt>->dynticks_nmi_nesting</tt> field
|
|
is set to a large positive number, and whenever the
|
|
<tt>->dynticks_nesting</tt> field is decremented down to zero,
|
|
the the <tt>->dynticks_nmi_nesting</tt> field is set to zero.
|
|
Assuming that the number of misnested interrupts is not sufficient
|
|
to overflow the counter, this approach corrects the
|
|
<tt>->dynticks_nmi_nesting</tt> field every time the corresponding
|
|
CPU enters the idle loop from process context.
|
|
|
|
</p><p>The <tt>->dynticks</tt> field counts the corresponding
|
|
CPU's transitions to and from either dyntick-idle or user mode, so
|
|
that this counter has an even value when the CPU is in dyntick-idle
|
|
mode or user mode and an odd value otherwise. The transitions to/from
|
|
user mode need to be counted for user mode adaptive-ticks support
|
|
(see timers/NO_HZ.txt).
|
|
|
|
</p><p>The <tt>->rcu_need_heavy_qs</tt> field is used
|
|
to record the fact that the RCU core code would really like to
|
|
see a quiescent state from the corresponding CPU, so much so that
|
|
it is willing to call for heavy-weight dyntick-counter operations.
|
|
This flag is checked by RCU's context-switch and <tt>cond_resched()</tt>
|
|
code, which provide a momentary idle sojourn in response.
|
|
|
|
</p><p>Finally, the <tt>->rcu_urgent_qs</tt> field is used to record
|
|
the fact that the RCU core code would really like to see a quiescent state from
|
|
the corresponding CPU, with the various other fields indicating just how badly
|
|
RCU wants this quiescent state.
|
|
This flag is checked by RCU's context-switch path
|
|
(<tt>rcu_note_context_switch</tt>) and the cond_resched code.
|
|
|
|
<table>
|
|
<tr><th> </th></tr>
|
|
<tr><th align="left">Quick Quiz:</th></tr>
|
|
<tr><td>
|
|
Why not simply combine the <tt>->dynticks_nesting</tt>
|
|
and <tt>->dynticks_nmi_nesting</tt> counters into a
|
|
single counter that just counts the number of reasons that
|
|
the corresponding CPU is non-idle?
|
|
</td></tr>
|
|
<tr><th align="left">Answer:</th></tr>
|
|
<tr><td bgcolor="#ffffff"><font color="ffffff">
|
|
Because this would fail in the presence of interrupts whose
|
|
handlers never return and of handlers that manage to return
|
|
from a made-up interrupt.
|
|
</font></td></tr>
|
|
<tr><td> </td></tr>
|
|
</table>
|
|
|
|
<p>Additional fields are present for some special-purpose
|
|
builds, and are discussed separately.
|
|
|
|
<h3><a name="The rcu_head Structure">
|
|
The <tt>rcu_head</tt> Structure</a></h3>
|
|
|
|
<p>Each <tt>rcu_head</tt> structure represents an RCU callback.
|
|
These structures are normally embedded within RCU-protected data
|
|
structures whose algorithms use asynchronous grace periods.
|
|
In contrast, when using algorithms that block waiting for RCU grace periods,
|
|
RCU users need not provide <tt>rcu_head</tt> structures.
|
|
|
|
</p><p>The <tt>rcu_head</tt> structure has fields as follows:
|
|
|
|
<pre>
|
|
1 struct rcu_head *next;
|
|
2 void (*func)(struct rcu_head *head);
|
|
</pre>
|
|
|
|
<p>The <tt>->next</tt> field is used
|
|
to link the <tt>rcu_head</tt> structures together in the
|
|
lists within the <tt>rcu_data</tt> structures.
|
|
The <tt>->func</tt> field is a pointer to the function
|
|
to be called when the callback is ready to be invoked, and
|
|
this function is passed a pointer to the <tt>rcu_head</tt>
|
|
structure.
|
|
However, <tt>kfree_rcu()</tt> uses the <tt>->func</tt>
|
|
field to record the offset of the <tt>rcu_head</tt>
|
|
structure within the enclosing RCU-protected data structure.
|
|
|
|
</p><p>Both of these fields are used internally by RCU.
|
|
From the viewpoint of RCU users, this structure is an
|
|
opaque “cookie”.
|
|
|
|
<table>
|
|
<tr><th> </th></tr>
|
|
<tr><th align="left">Quick Quiz:</th></tr>
|
|
<tr><td>
|
|
Given that the callback function <tt>->func</tt>
|
|
is passed a pointer to the <tt>rcu_head</tt> structure,
|
|
how is that function supposed to find the beginning of the
|
|
enclosing RCU-protected data structure?
|
|
</td></tr>
|
|
<tr><th align="left">Answer:</th></tr>
|
|
<tr><td bgcolor="#ffffff"><font color="ffffff">
|
|
In actual practice, there is a separate callback function per
|
|
type of RCU-protected data structure.
|
|
The callback function can therefore use the <tt>container_of()</tt>
|
|
macro in the Linux kernel (or other pointer-manipulation facilities
|
|
in other software environments) to find the beginning of the
|
|
enclosing structure.
|
|
</font></td></tr>
|
|
<tr><td> </td></tr>
|
|
</table>
|
|
|
|
<h3><a name="RCU-Specific Fields in the task_struct Structure">
|
|
RCU-Specific Fields in the <tt>task_struct</tt> Structure</a></h3>
|
|
|
|
<p>The <tt>CONFIG_PREEMPT_RCU</tt> implementation uses some
|
|
additional fields in the <tt>task_struct</tt> structure:
|
|
|
|
<pre>
|
|
1 #ifdef CONFIG_PREEMPT_RCU
|
|
2 int rcu_read_lock_nesting;
|
|
3 union rcu_special rcu_read_unlock_special;
|
|
4 struct list_head rcu_node_entry;
|
|
5 struct rcu_node *rcu_blocked_node;
|
|
6 #endif /* #ifdef CONFIG_PREEMPT_RCU */
|
|
7 #ifdef CONFIG_TASKS_RCU
|
|
8 unsigned long rcu_tasks_nvcsw;
|
|
9 bool rcu_tasks_holdout;
|
|
10 struct list_head rcu_tasks_holdout_list;
|
|
11 int rcu_tasks_idle_cpu;
|
|
12 #endif /* #ifdef CONFIG_TASKS_RCU */
|
|
</pre>
|
|
|
|
<p>The <tt>->rcu_read_lock_nesting</tt> field records the
|
|
nesting level for RCU read-side critical sections, and
|
|
the <tt>->rcu_read_unlock_special</tt> field is a bitmask
|
|
that records special conditions that require <tt>rcu_read_unlock()</tt>
|
|
to do additional work.
|
|
The <tt>->rcu_node_entry</tt> field is used to form lists of
|
|
tasks that have blocked within preemptible-RCU read-side critical
|
|
sections and the <tt>->rcu_blocked_node</tt> field references
|
|
the <tt>rcu_node</tt> structure whose list this task is a member of,
|
|
or <tt>NULL</tt> if it is not blocked within a preemptible-RCU
|
|
read-side critical section.
|
|
|
|
<p>The <tt>->rcu_tasks_nvcsw</tt> field tracks the number of
|
|
voluntary context switches that this task had undergone at the
|
|
beginning of the current tasks-RCU grace period,
|
|
<tt>->rcu_tasks_holdout</tt> is set if the current tasks-RCU
|
|
grace period is waiting on this task, <tt>->rcu_tasks_holdout_list</tt>
|
|
is a list element enqueuing this task on the holdout list,
|
|
and <tt>->rcu_tasks_idle_cpu</tt> tracks which CPU this
|
|
idle task is running, but only if the task is currently running,
|
|
that is, if the CPU is currently idle.
|
|
|
|
<h3><a name="Accessor Functions">
|
|
Accessor Functions</a></h3>
|
|
|
|
<p>The following listing shows the
|
|
<tt>rcu_get_root()</tt>, <tt>rcu_for_each_node_breadth_first</tt> and
|
|
<tt>rcu_for_each_leaf_node()</tt> function and macros:
|
|
|
|
<pre>
|
|
1 static struct rcu_node *rcu_get_root(struct rcu_state *rsp)
|
|
2 {
|
|
3 return &rsp->node[0];
|
|
4 }
|
|
5
|
|
6 #define rcu_for_each_node_breadth_first(rsp, rnp) \
|
|
7 for ((rnp) = &(rsp)->node[0]; \
|
|
8 (rnp) < &(rsp)->node[NUM_RCU_NODES]; (rnp)++)
|
|
9
|
|
10 #define rcu_for_each_leaf_node(rsp, rnp) \
|
|
11 for ((rnp) = (rsp)->level[NUM_RCU_LVLS - 1]; \
|
|
12 (rnp) < &(rsp)->node[NUM_RCU_NODES]; (rnp)++)
|
|
</pre>
|
|
|
|
<p>The <tt>rcu_get_root()</tt> simply returns a pointer to the
|
|
first element of the specified <tt>rcu_state</tt> structure's
|
|
<tt>->node[]</tt> array, which is the root <tt>rcu_node</tt>
|
|
structure.
|
|
|
|
</p><p>As noted earlier, the <tt>rcu_for_each_node_breadth_first()</tt>
|
|
macro takes advantage of the layout of the <tt>rcu_node</tt>
|
|
structures in the <tt>rcu_state</tt> structure's
|
|
<tt>->node[]</tt> array, performing a breadth-first traversal by
|
|
simply traversing the array in order.
|
|
Similarly, the <tt>rcu_for_each_leaf_node()</tt> macro traverses only
|
|
the last part of the array, thus traversing only the leaf
|
|
<tt>rcu_node</tt> structures.
|
|
|
|
<table>
|
|
<tr><th> </th></tr>
|
|
<tr><th align="left">Quick Quiz:</th></tr>
|
|
<tr><td>
|
|
What does
|
|
<tt>rcu_for_each_leaf_node()</tt> do if the <tt>rcu_node</tt> tree
|
|
contains only a single node?
|
|
</td></tr>
|
|
<tr><th align="left">Answer:</th></tr>
|
|
<tr><td bgcolor="#ffffff"><font color="ffffff">
|
|
In the single-node case,
|
|
<tt>rcu_for_each_leaf_node()</tt> traverses the single node.
|
|
</font></td></tr>
|
|
<tr><td> </td></tr>
|
|
</table>
|
|
|
|
<h3><a name="Summary">
|
|
Summary</a></h3>
|
|
|
|
So the state of RCU is represented by an <tt>rcu_state</tt> structure,
|
|
which contains a combining tree of <tt>rcu_node</tt> and
|
|
<tt>rcu_data</tt> structures.
|
|
Finally, in <tt>CONFIG_NO_HZ_IDLE</tt> kernels, each CPU's dyntick-idle
|
|
state is tracked by dynticks-related fields in the <tt>rcu_data</tt> structure.
|
|
|
|
If you made it this far, you are well prepared to read the code
|
|
walkthroughs in the other articles in this series.
|
|
|
|
<h3><a name="Acknowledgments">
|
|
Acknowledgments</a></h3>
|
|
|
|
I owe thanks to Cyrill Gorcunov, Mathieu Desnoyers, Dhaval Giani, Paul
|
|
Turner, Abhishek Srivastava, Matt Kowalczyk, and Serge Hallyn
|
|
for helping me get this document into a more human-readable state.
|
|
|
|
<h3><a name="Legal Statement">
|
|
Legal Statement</a></h3>
|
|
|
|
<p>This work represents the view of the author and does not necessarily
|
|
represent the view of IBM.
|
|
|
|
</p><p>Linux is a registered trademark of Linus Torvalds.
|
|
|
|
</p><p>Other company, product, and service names may be trademarks or
|
|
service marks of others.
|
|
|
|
</body></html>
|