207 lines
9.2 KiB
ReStructuredText
207 lines
9.2 KiB
ReStructuredText
.. SPDX-License-Identifier: GPL-2.0
|
|
|
|
=============
|
|
False Sharing
|
|
=============
|
|
|
|
What is False Sharing
|
|
=====================
|
|
False sharing is related with cache mechanism of maintaining the data
|
|
coherence of one cache line stored in multiple CPU's caches; then
|
|
academic definition for it is in [1]_. Consider a struct with a
|
|
refcount and a string::
|
|
|
|
struct foo {
|
|
refcount_t refcount;
|
|
...
|
|
char name[16];
|
|
} ____cacheline_internodealigned_in_smp;
|
|
|
|
Member 'refcount'(A) and 'name'(B) _share_ one cache line like below::
|
|
|
|
+-----------+ +-----------+
|
|
| CPU 0 | | CPU 1 |
|
|
+-----------+ +-----------+
|
|
/ |
|
|
/ |
|
|
V V
|
|
+----------------------+ +----------------------+
|
|
| A B | Cache 0 | A B | Cache 1
|
|
+----------------------+ +----------------------+
|
|
| |
|
|
---------------------------+------------------+-----------------------------
|
|
| |
|
|
+----------------------+
|
|
| |
|
|
+----------------------+
|
|
Main Memory | A B |
|
|
+----------------------+
|
|
|
|
'refcount' is modified frequently, but 'name' is set once at object
|
|
creation time and is never modified. When many CPUs access 'foo' at
|
|
the same time, with 'refcount' being only bumped by one CPU frequently
|
|
and 'name' being read by other CPUs, all those reading CPUs have to
|
|
reload the whole cache line over and over due to the 'sharing', even
|
|
though 'name' is never changed.
|
|
|
|
There are many real-world cases of performance regressions caused by
|
|
false sharing. One of these is a rw_semaphore 'mmap_lock' inside
|
|
mm_struct struct, whose cache line layout change triggered a
|
|
regression and Linus analyzed in [2]_.
|
|
|
|
There are two key factors for a harmful false sharing:
|
|
|
|
* A global datum accessed (shared) by many CPUs
|
|
* In the concurrent accesses to the data, there is at least one write
|
|
operation: write/write or write/read cases.
|
|
|
|
The sharing could be from totally unrelated kernel components, or
|
|
different code paths of the same kernel component.
|
|
|
|
|
|
False Sharing Pitfalls
|
|
======================
|
|
Back in time when one platform had only one or a few CPUs, hot data
|
|
members could be purposely put in the same cache line to make them
|
|
cache hot and save cacheline/TLB, like a lock and the data protected
|
|
by it. But for recent large system with hundreds of CPUs, this may
|
|
not work when the lock is heavily contended, as the lock owner CPU
|
|
could write to the data, while other CPUs are busy spinning the lock.
|
|
|
|
Looking at past cases, there are several frequently occurring patterns
|
|
for false sharing:
|
|
|
|
* lock (spinlock/mutex/semaphore) and data protected by it are
|
|
purposely put in one cache line.
|
|
* global data being put together in one cache line. Some kernel
|
|
subsystems have many global parameters of small size (4 bytes),
|
|
which can easily be grouped together and put into one cache line.
|
|
* data members of a big data structure randomly sitting together
|
|
without being noticed (cache line is usually 64 bytes or more),
|
|
like 'mem_cgroup' struct.
|
|
|
|
Following 'mitigation' section provides real-world examples.
|
|
|
|
False sharing could easily happen unless they are intentionally
|
|
checked, and it is valuable to run specific tools for performance
|
|
critical workloads to detect false sharing affecting performance case
|
|
and optimize accordingly.
|
|
|
|
|
|
How to detect and analyze False Sharing
|
|
========================================
|
|
perf record/report/stat are widely used for performance tuning, and
|
|
once hotspots are detected, tools like 'perf-c2c' and 'pahole' can
|
|
be further used to detect and pinpoint the possible false sharing
|
|
data structures. 'addr2line' is also good at decoding instruction
|
|
pointer when there are multiple layers of inline functions.
|
|
|
|
perf-c2c can capture the cache lines with most false sharing hits,
|
|
decoded functions (line number of file) accessing that cache line,
|
|
and in-line offset of the data. Simple commands are::
|
|
|
|
$ perf c2c record -ag sleep 3
|
|
$ perf c2c report --call-graph none -k vmlinux
|
|
|
|
When running above during testing will-it-scale's tlb_flush1 case,
|
|
perf reports something like::
|
|
|
|
Total records : 1658231
|
|
Locked Load/Store Operations : 89439
|
|
Load Operations : 623219
|
|
Load Local HITM : 92117
|
|
Load Remote HITM : 139
|
|
|
|
#----------------------------------------------------------------------
|
|
4 0 2374 0 0 0 0xff1100088366d880
|
|
#----------------------------------------------------------------------
|
|
0.00% 42.29% 0.00% 0.00% 0.00% 0x8 1 1 0xffffffff81373b7b 0 231 129 5312 64 [k] __mod_lruvec_page_state [kernel.vmlinux] memcontrol.h:752 1
|
|
0.00% 13.10% 0.00% 0.00% 0.00% 0x8 1 1 0xffffffff81374718 0 226 97 3551 64 [k] folio_lruvec_lock_irqsave [kernel.vmlinux] memcontrol.h:752 1
|
|
0.00% 11.20% 0.00% 0.00% 0.00% 0x8 1 1 0xffffffff812c29bf 0 170 136 555 64 [k] lru_add_fn [kernel.vmlinux] mm_inline.h:41 1
|
|
0.00% 7.62% 0.00% 0.00% 0.00% 0x8 1 1 0xffffffff812c3ec5 0 175 108 632 64 [k] release_pages [kernel.vmlinux] mm_inline.h:41 1
|
|
0.00% 23.29% 0.00% 0.00% 0.00% 0x10 1 1 0xffffffff81372d0a 0 234 279 1051 64 [k] __mod_memcg_lruvec_state [kernel.vmlinux] memcontrol.c:736 1
|
|
|
|
A nice introduction for perf-c2c is [3]_.
|
|
|
|
'pahole' decodes data structure layouts delimited in cache line
|
|
granularity. Users can match the offset in perf-c2c output with
|
|
pahole's decoding to locate the exact data members. For global
|
|
data, users can search the data address in System.map.
|
|
|
|
|
|
Possible Mitigations
|
|
====================
|
|
False sharing does not always need to be mitigated. False sharing
|
|
mitigations should balance performance gains with complexity and
|
|
space consumption. Sometimes, lower performance is OK, and it's
|
|
unnecessary to hyper-optimize every rarely used data structure or
|
|
a cold data path.
|
|
|
|
False sharing hurting performance cases are seen more frequently with
|
|
core count increasing. Because of these detrimental effects, many
|
|
patches have been proposed across variety of subsystems (like
|
|
networking and memory management) and merged. Some common mitigations
|
|
(with examples) are:
|
|
|
|
* Separate hot global data in its own dedicated cache line, even if it
|
|
is just a 'short' type. The downside is more consumption of memory,
|
|
cache line and TLB entries.
|
|
|
|
- Commit 91b6d3256356 ("net: cache align tcp_memory_allocated, tcp_sockets_allocated")
|
|
|
|
* Reorganize the data structure, separate the interfering members to
|
|
different cache lines. One downside is it may introduce new false
|
|
sharing of other members.
|
|
|
|
- Commit 802f1d522d5f ("mm: page_counter: re-layout structure to reduce false sharing")
|
|
|
|
* Replace 'write' with 'read' when possible, especially in loops.
|
|
Like for some global variable, use compare(read)-then-write instead
|
|
of unconditional write. For example, use::
|
|
|
|
if (!test_bit(XXX))
|
|
set_bit(XXX);
|
|
|
|
instead of directly "set_bit(XXX);", similarly for atomic_t data::
|
|
|
|
if (atomic_read(XXX) == AAA)
|
|
atomic_set(XXX, BBB);
|
|
|
|
- Commit 7b1002f7cfe5 ("bcache: fixup bcache_dev_sectors_dirty_add() multithreaded CPU false sharing")
|
|
- Commit 292648ac5cf1 ("mm: gup: allow FOLL_PIN to scale in SMP")
|
|
|
|
* Turn hot global data to 'per-cpu data + global data' when possible,
|
|
or reasonably increase the threshold for syncing per-cpu data to
|
|
global data, to reduce or postpone the 'write' to that global data.
|
|
|
|
- Commit 520f897a3554 ("ext4: use percpu_counters for extent_status cache hits/misses")
|
|
- Commit 56f3547bfa4d ("mm: adjust vm_committed_as_batch according to vm overcommit policy")
|
|
|
|
Surely, all mitigations should be carefully verified to not cause side
|
|
effects. To avoid introducing false sharing when coding, it's better
|
|
to:
|
|
|
|
* Be aware of cache line boundaries
|
|
* Group mostly read-only fields together
|
|
* Group things that are written at the same time together
|
|
* Separate frequently read and frequently written fields on
|
|
different cache lines.
|
|
|
|
and better add a comment stating the false sharing consideration.
|
|
|
|
One note is, sometimes even after a severe false sharing is detected
|
|
and solved, the performance may still have no obvious improvement as
|
|
the hotspot switches to a new place.
|
|
|
|
|
|
Miscellaneous
|
|
=============
|
|
One open issue is that kernel has an optional data structure
|
|
randomization mechanism, which also randomizes the situation of cache
|
|
line sharing of data members.
|
|
|
|
|
|
.. [1] https://en.wikipedia.org/wiki/False_sharing
|
|
.. [2] https://lore.kernel.org/lkml/CAHk-=whoqV=cX5VC80mmR9rr+Z+yQ6fiQZm36Fb-izsanHg23w@mail.gmail.com/
|
|
.. [3] https://joemario.github.io/blog/2016/09/01/c2c-blog/
|