Merge branch 'lkmm' of git://git.kernel.org/pub/scm/linux/kernel/git/paulmck/linux-rcu into locking/core

Pull LKMM changes for v5.10 from Paul E. McKenney.

Various documentation updates.

Signed-off-by: Ingo Molnar <mingo@kernel.org>
This commit is contained in:
Ingo Molnar 2020-10-09 08:46:53 +02:00
commit 2116d708b0
6 changed files with 1410 additions and 134 deletions

View File

@ -3,9 +3,9 @@
C Self R W RMW Self R W DR DW RMW SV
-- ---- - - --- ---- - - -- -- --- --
Store, e.g., WRITE_ONCE() Y Y
Load, e.g., READ_ONCE() Y Y Y Y
Unsuccessful RMW operation Y Y Y Y
Relaxed store Y Y
Relaxed load Y Y Y Y
Relaxed RMW operation Y Y Y Y
rcu_dereference() Y Y Y Y
Successful *_acquire() R Y Y Y Y Y Y
Successful *_release() C Y Y Y W Y
@ -17,14 +17,19 @@ smp_mb__before_atomic() CP Y Y Y a a a a Y
smp_mb__after_atomic() CP a a Y Y Y Y Y Y
Key: C: Ordering is cumulative
P: Ordering propagates
R: Read, for example, READ_ONCE(), or read portion of RMW
W: Write, for example, WRITE_ONCE(), or write portion of RMW
Y: Provides ordering
a: Provides ordering given intervening RMW atomic operation
DR: Dependent read (address dependency)
DW: Dependent write (address, data, or control dependency)
RMW: Atomic read-modify-write operation
SELF: Orders self, as opposed to accesses before and/or after
SV: Orders later accesses to the same variable
Key: Relaxed: A relaxed operation is either READ_ONCE(), WRITE_ONCE(),
a *_relaxed() RMW operation, an unsuccessful RMW
operation, a non-value-returning RMW operation such
as atomic_inc(), or one of the atomic*_read() and
atomic*_set() family of operations.
C: Ordering is cumulative
P: Ordering propagates
R: Read, for example, READ_ONCE(), or read portion of RMW
W: Write, for example, WRITE_ONCE(), or write portion of RMW
Y: Provides ordering
a: Provides ordering given intervening RMW atomic operation
DR: Dependent read (address dependency)
DW: Dependent write (address, data, or control dependency)
RMW: Atomic read-modify-write operation
SELF: Orders self, as opposed to accesses before and/or after
SV: Orders later accesses to the same variable

File diff suppressed because it is too large Load Diff

View File

@ -1,7 +1,7 @@
This document provides "recipes", that is, litmus tests for commonly
occurring situations, as well as a few that illustrate subtly broken but
attractive nuisances. Many of these recipes include example code from
v4.13 of the Linux kernel.
v5.7 of the Linux kernel.
The first section covers simple special cases, the second section
takes off the training wheels to cover more involved examples,
@ -278,7 +278,7 @@ is present if the value loaded determines the address of a later access
first place (control dependency). Note that the term "data dependency"
is sometimes casually used to cover both address and data dependencies.
In lib/prime_numbers.c, the expand_to_next_prime() function invokes
In lib/math/prime_numbers.c, the expand_to_next_prime() function invokes
rcu_assign_pointer(), and the next_prime_number() function invokes
rcu_dereference(). This combination mediates access to a bit vector
that is expanded as additional primes are needed.

View File

@ -120,7 +120,7 @@ o Jade Alglave, Luc Maranget, and Michael Tautschnig. 2014. "Herding
o Jade Alglave, Patrick Cousot, and Luc Maranget. 2016. "Syntax and
semantics of the weak consistency model specification language
cat". CoRR abs/1608.07531 (2016). http://arxiv.org/abs/1608.07531
cat". CoRR abs/1608.07531 (2016). https://arxiv.org/abs/1608.07531
Memory-model comparisons

View File

@ -0,0 +1,271 @@
This document provides options for those wishing to keep their
memory-ordering lives simple, as is necessary for those whose domain
is complex. After all, there are bugs other than memory-ordering bugs,
and the time spent gaining memory-ordering knowledge is not available
for gaining domain knowledge. Furthermore Linux-kernel memory model
(LKMM) is quite complex, with subtle differences in code often having
dramatic effects on correctness.
The options near the beginning of this list are quite simple. The idea
is not that kernel hackers don't already know about them, but rather
that they might need the occasional reminder.
Please note that this is a generic guide, and that specific subsystems
will often have special requirements or idioms. For example, developers
of MMIO-based device drivers will often need to use mb(), rmb(), and
wmb(), and therefore might find smp_mb(), smp_rmb(), and smp_wmb()
to be more natural than smp_load_acquire() and smp_store_release().
On the other hand, those coming in from other environments will likely
be more familiar with these last two.
Single-threaded code
====================
In single-threaded code, there is no reordering, at least assuming
that your toolchain and hardware are working correctly. In addition,
it is generally a mistake to assume your code will only run in a single
threaded context as the kernel can enter the same code path on multiple
CPUs at the same time. One important exception is a function that makes
no external data references.
In the general case, you will need to take explicit steps to ensure that
your code really is executed within a single thread that does not access
shared variables. A simple way to achieve this is to define a global lock
that you acquire at the beginning of your code and release at the end,
taking care to ensure that all references to your code's shared data are
also carried out under that same lock. Because only one thread can hold
this lock at a given time, your code will be executed single-threaded.
This approach is called "code locking".
Code locking can severely limit both performance and scalability, so it
should be used with caution, and only on code paths that execute rarely.
After all, a huge amount of effort was required to remove the Linux
kernel's old "Big Kernel Lock", so let's please be very careful about
adding new "little kernel locks".
One of the advantages of locking is that, in happy contrast with the
year 1981, almost all kernel developers are very familiar with locking.
The Linux kernel's lockdep (CONFIG_PROVE_LOCKING=y) is very helpful with
the formerly feared deadlock scenarios.
Please use the standard locking primitives provided by the kernel rather
than rolling your own. For one thing, the standard primitives interact
properly with lockdep. For another thing, these primitives have been
tuned to deal better with high contention. And for one final thing, it is
surprisingly hard to correctly code production-quality lock acquisition
and release functions. After all, even simple non-production-quality
locking functions must carefully prevent both the CPU and the compiler
from moving code in either direction across the locking function.
Despite the scalability limitations of single-threaded code, RCU
takes this approach for much of its grace-period processing and also
for early-boot operation. The reason RCU is able to scale despite
single-threaded grace-period processing is use of batching, where all
updates that accumulated during one grace period are handled by the
next one. In other words, slowing down grace-period processing makes
it more efficient. Nor is RCU unique: Similar batching optimizations
are used in many I/O operations.
Packaged code
=============
Even if performance and scalability concerns prevent your code from
being completely single-threaded, it is often possible to use library
functions that handle the concurrency nearly or entirely on their own.
This approach delegates any LKMM worries to the library maintainer.
In the kernel, what is the "library"? Quite a bit. It includes the
contents of the lib/ directory, much of the include/linux/ directory along
with a lot of other heavily used APIs. But heavily used examples include
the list macros (for example, include/linux/{,rcu}list.h), workqueues,
smp_call_function(), and the various hash tables and search trees.
Data locking
============
With code locking, we use single-threaded code execution to guarantee
serialized access to the data that the code is accessing. However,
we can also achieve this by instead associating the lock with specific
instances of the data structures. This creates a "critical section"
in the code execution that will execute as though it is single threaded.
By placing all the accesses and modifications to a shared data structure
inside a critical section, we ensure that the execution context that
holds the lock has exclusive access to the shared data.
The poster boy for this approach is the hash table, where placing a lock
in each hash bucket allows operations on different buckets to proceed
concurrently. This works because the buckets do not overlap with each
other, so that an operation on one bucket does not interfere with any
other bucket.
As the number of buckets increases, data locking scales naturally.
In particular, if the amount of data increases with the number of CPUs,
increasing the number of buckets as the number of CPUs increase results
in a naturally scalable data structure.
Per-CPU processing
==================
Partitioning processing and data over CPUs allows each CPU to take
a single-threaded approach while providing excellent performance and
scalability. Of course, there is no free lunch: The dark side of this
excellence is substantially increased memory footprint.
In addition, it is sometimes necessary to occasionally update some global
view of this processing and data, in which case something like locking
must be used to protect this global view. This is the approach taken
by the percpu_counter infrastructure. In many cases, there are already
generic/library variants of commonly used per-cpu constructs available.
Please use them rather than rolling your own.
RCU uses DEFINE_PER_CPU*() declaration to create a number of per-CPU
data sets. For example, each CPU does private quiescent-state processing
within its instance of the per-CPU rcu_data structure, and then uses data
locking to report quiescent states up the grace-period combining tree.
Packaged primitives: Sequence locking
=====================================
Lockless programming is considered by many to be more difficult than
lock-based programming, but there are a few lockless design patterns that
have been built out into an API. One of these APIs is sequence locking.
Although this APIs can be used in extremely complex ways, there are simple
and effective ways of using it that avoid the need to pay attention to
memory ordering.
The basic keep-things-simple rule for sequence locking is "do not write
in read-side code". Yes, you can do writes from within sequence-locking
readers, but it won't be so simple. For example, such writes will be
lockless and should be idempotent.
For more sophisticated use cases, LKMM can guide you, including use
cases involving combining sequence locking with other synchronization
primitives. (LKMM does not yet know about sequence locking, so it is
currently necessary to open-code it in your litmus tests.)
Additional information may be found in include/linux/seqlock.h.
Packaged primitives: RCU
========================
Another lockless design pattern that has been baked into an API
is RCU. The Linux kernel makes sophisticated use of RCU, but the
keep-things-simple rules for RCU are "do not write in read-side code"
and "do not update anything that is visible to and accessed by readers",
and "protect updates with locking".
These rules are illustrated by the functions foo_update_a() and
foo_get_a() shown in Documentation/RCU/whatisRCU.rst. Additional
RCU usage patterns maybe found in Documentation/RCU and in the
source code.
Packaged primitives: Atomic operations
======================================
Back in the day, the Linux kernel had three types of atomic operations:
1. Initialization and read-out, such as atomic_set() and atomic_read().
2. Operations that did not return a value and provided no ordering,
such as atomic_inc() and atomic_dec().
3. Operations that returned a value and provided full ordering, such as
atomic_add_return() and atomic_dec_and_test(). Note that some
value-returning operations provide full ordering only conditionally.
For example, cmpxchg() provides ordering only upon success.
More recent kernels have operations that return a value but do not
provide full ordering. These are flagged with either a _relaxed()
suffix (providing no ordering), or an _acquire() or _release() suffix
(providing limited ordering).
Additional information may be found in these files:
Documentation/atomic_t.txt
Documentation/atomic_bitops.txt
Documentation/core-api/atomic_ops.rst
Documentation/core-api/refcount-vs-atomic.rst
Reading code using these primitives is often also quite helpful.
Lockless, fully ordered
=======================
When using locking, there often comes a time when it is necessary
to access some variable or another without holding the data lock
that serializes access to that variable.
If you want to keep things simple, use the initialization and read-out
operations from the previous section only when there are no racing
accesses. Otherwise, use only fully ordered operations when accessing
or modifying the variable. This approach guarantees that code prior
to a given access to that variable will be seen by all CPUs has having
happened before any code following any later access to that same variable.
Please note that per-CPU functions are not atomic operations and
hence they do not provide any ordering guarantees at all.
If the lockless accesses are frequently executed reads that are used
only for heuristics, or if they are frequently executed writes that
are used only for statistics, please see the next section.
Lockless statistics and heuristics
==================================
Unordered primitives such as atomic_read(), atomic_set(), READ_ONCE(), and
WRITE_ONCE() can safely be used in some cases. These primitives provide
no ordering, but they do prevent the compiler from carrying out a number
of destructive optimizations (for which please see the next section).
One example use for these primitives is statistics, such as per-CPU
counters exemplified by the rt_cache_stat structure's routing-cache
statistics counters. Another example use case is heuristics, such as
the jiffies_till_first_fqs and jiffies_till_next_fqs kernel parameters
controlling how often RCU scans for idle CPUs.
But be careful. "Unordered" really does mean "unordered". It is all
too easy to assume ordering, and this assumption must be avoided when
using these primitives.
Don't let the compiler trip you up
==================================
It can be quite tempting to use plain C-language accesses for lockless
loads from and stores to shared variables. Although this is both
possible and quite common in the Linux kernel, it does require a
surprising amount of analysis, care, and knowledge about the compiler.
Yes, some decades ago it was not unfair to consider a C compiler to be
an assembler with added syntax and better portability, but the advent of
sophisticated optimizing compilers mean that those days are long gone.
Today's optimizing compilers can profoundly rewrite your code during the
translation process, and have long been ready, willing, and able to do so.
Therefore, if you really need to use C-language assignments instead of
READ_ONCE(), WRITE_ONCE(), and so on, you will need to have a very good
understanding of both the C standard and your compiler. Here are some
introductory references and some tooling to start you on this noble quest:
Who's afraid of a big bad optimizing compiler?
https://lwn.net/Articles/793253/
Calibrating your fear of big bad optimizing compilers
https://lwn.net/Articles/799218/
Concurrency bugs should fear the big bad data-race detector (part 1)
https://lwn.net/Articles/816850/
Concurrency bugs should fear the big bad data-race detector (part 2)
https://lwn.net/Articles/816854/
More complex use cases
======================
If the alternatives above do not do what you need, please look at the
recipes-pairs.txt file to peel off the next layer of the memory-ordering
onion.

View File

@ -63,10 +63,32 @@ BASIC USAGE: HERD7
==================
The memory model is used, in conjunction with "herd7", to exhaustively
explore the state space of small litmus tests.
explore the state space of small litmus tests. Documentation describing
the format, features, capabilities and limitations of these litmus
tests is available in tools/memory-model/Documentation/litmus-tests.txt.
For example, to run SB+fencembonceonces.litmus against the memory model:
Example litmus tests may be found in the Linux-kernel source tree:
tools/memory-model/litmus-tests/
Documentation/litmus-tests/
Several thousand more example litmus tests are available here:
https://github.com/paulmckrcu/litmus
https://git.kernel.org/pub/scm/linux/kernel/git/paulmck/perfbook.git/tree/CodeSamples/formal/herd
https://git.kernel.org/pub/scm/linux/kernel/git/paulmck/perfbook.git/tree/CodeSamples/formal/litmus
Documentation describing litmus tests and now to use them may be found
here:
tools/memory-model/Documentation/litmus-tests.txt
The remainder of this section uses the SB+fencembonceonces.litmus test
located in the tools/memory-model directory.
To run SB+fencembonceonces.litmus against the memory model:
$ cd $LINUX_SOURCE_TREE/tools/memory-model
$ herd7 -conf linux-kernel.cfg litmus-tests/SB+fencembonceonces.litmus
Here is the corresponding output:
@ -87,7 +109,11 @@ Here is the corresponding output:
The "Positive: 0 Negative: 3" and the "Never 0 3" each indicate that
this litmus test's "exists" clause can not be satisfied.
See "herd7 -help" or "herdtools7/doc/" for more information.
See "herd7 -help" or "herdtools7/doc/" for more information on running the
tool itself, but please be aware that this documentation is intended for
people who work on the memory model itself, that is, people making changes
to the tools/memory-model/linux-kernel.* files. It is not intended for
people focusing on writing, understanding, and running LKMM litmus tests.
=====================
@ -124,7 +150,11 @@ that during two million trials, the state specified in this litmus
test's "exists" clause was not reached.
And, as with "herd7", please see "klitmus7 -help" or "herdtools7/doc/"
for more information.
for more information. And again, please be aware that this documentation
is intended for people who work on the memory model itself, that is,
people making changes to the tools/memory-model/linux-kernel.* files.
It is not intended for people focusing on writing, understanding, and
running LKMM litmus tests.
====================
@ -137,12 +167,21 @@ Documentation/cheatsheet.txt
Documentation/explanation.txt
Describes the memory model in detail.
Documentation/litmus-tests.txt
Describes the format, features, capabilities, and limitations
of the litmus tests that LKMM can evaluate.
Documentation/recipes.txt
Lists common memory-ordering patterns.
Documentation/references.txt
Provides background reading.
Documentation/simple.txt
Starting point for someone new to Linux-kernel concurrency.
And also for those needing a reminder of the simpler approaches
to concurrency!
linux-kernel.bell
Categorizes the relevant instructions, including memory
references, memory barriers, atomic read-modify-write operations,
@ -187,116 +226,3 @@ README
This file.
scripts Various scripts, see scripts/README.
===========
LIMITATIONS
===========
The Linux-kernel memory model (LKMM) has the following limitations:
1. Compiler optimizations are not accurately modeled. Of course,
the use of READ_ONCE() and WRITE_ONCE() limits the compiler's
ability to optimize, but under some circumstances it is possible
for the compiler to undermine the memory model. For more
information, see Documentation/explanation.txt (in particular,
the "THE PROGRAM ORDER RELATION: po AND po-loc" and "A WARNING"
sections).
Note that this limitation in turn limits LKMM's ability to
accurately model address, control, and data dependencies.
For example, if the compiler can deduce the value of some variable
carrying a dependency, then the compiler can break that dependency
by substituting a constant of that value.
2. Multiple access sizes for a single variable are not supported,
and neither are misaligned or partially overlapping accesses.
3. Exceptions and interrupts are not modeled. In some cases,
this limitation can be overcome by modeling the interrupt or
exception with an additional process.
4. I/O such as MMIO or DMA is not supported.
5. Self-modifying code (such as that found in the kernel's
alternatives mechanism, function tracer, Berkeley Packet Filter
JIT compiler, and module loader) is not supported.
6. Complete modeling of all variants of atomic read-modify-write
operations, locking primitives, and RCU is not provided.
For example, call_rcu() and rcu_barrier() are not supported.
However, a substantial amount of support is provided for these
operations, as shown in the linux-kernel.def file.
a. When rcu_assign_pointer() is passed NULL, the Linux
kernel provides no ordering, but LKMM models this
case as a store release.
b. The "unless" RMW operations are not currently modeled:
atomic_long_add_unless(), atomic_inc_unless_negative(),
and atomic_dec_unless_positive(). These can be emulated
in litmus tests, for example, by using atomic_cmpxchg().
One exception of this limitation is atomic_add_unless(),
which is provided directly by herd7 (so no corresponding
definition in linux-kernel.def). atomic_add_unless() is
modeled by herd7 therefore it can be used in litmus tests.
c. The call_rcu() function is not modeled. It can be
emulated in litmus tests by adding another process that
invokes synchronize_rcu() and the body of the callback
function, with (for example) a release-acquire from
the site of the emulated call_rcu() to the beginning
of the additional process.
d. The rcu_barrier() function is not modeled. It can be
emulated in litmus tests emulating call_rcu() via
(for example) a release-acquire from the end of each
additional call_rcu() process to the site of the
emulated rcu-barrier().
e. Although sleepable RCU (SRCU) is now modeled, there
are some subtle differences between its semantics and
those in the Linux kernel. For example, the kernel
might interpret the following sequence as two partially
overlapping SRCU read-side critical sections:
1 r1 = srcu_read_lock(&my_srcu);
2 do_something_1();
3 r2 = srcu_read_lock(&my_srcu);
4 do_something_2();
5 srcu_read_unlock(&my_srcu, r1);
6 do_something_3();
7 srcu_read_unlock(&my_srcu, r2);
In contrast, LKMM will interpret this as a nested pair of
SRCU read-side critical sections, with the outer critical
section spanning lines 1-7 and the inner critical section
spanning lines 3-5.
This difference would be more of a concern had anyone
identified a reasonable use case for partially overlapping
SRCU read-side critical sections. For more information,
please see: https://paulmck.livejournal.com/40593.html
f. Reader-writer locking is not modeled. It can be
emulated in litmus tests using atomic read-modify-write
operations.
The "herd7" tool has some additional limitations of its own, apart from
the memory model:
1. Non-trivial data structures such as arrays or structures are
not supported. However, pointers are supported, allowing trivial
linked lists to be constructed.
2. Dynamic memory allocation is not supported, although this can
be worked around in some cases by supplying multiple statically
allocated variables.
Some of these limitations may be overcome in the future, but others are
more likely to be addressed by incorporating the Linux-kernel memory model
into other tools.
Finally, please note that LKMM is subject to change as hardware, use cases,
and compilers evolve.