OpenCloudOS-Kernel/Documentation
Darrick J. Wong dd81dc0559 xfs: improve CIL scalability
This series aims to improve the scalability of XFS transaction
 commits on large CPU count machines. My 32p machine hits contention
 limits in xlog_cil_commit() at about 700,000 transaction commits a
 section. It hits this at 16 thread workloads, and 32 thread
 workloads go no faster and just burn CPU on the CIL spinlocks.
 
 This patchset gets rid of spinlocks and global serialisation points
 in the xlog_cil_commit() path. It does this by moving to a
 combination of per-cpu counters, unordered per-cpu lists and
 post-ordered per-cpu lists.
 
 This results in transaction commit rates exceeding 1.4 million
 commits/s under unlink certain workloads, and while the log lock
 contention is largely gone there is still significant lock
 contention in the VFS (dentry cache, inode cache and security layers)
 at >600,000 transactions/s that still limit scalability.
 
 The changes to the CIL accounting and behaviour, combined with the
 structural changes to xlog_write() in prior patchsets make the
 per-cpu restructuring possible and sane. This allows us to move to
 precalculated reservation requirements that allow for reservation
 stealing to be accounted across multiple CPUs accurately.
 
 That is, instead of trying to account for continuation log opheaders
 on a "growth" basis, we pre-calculate how many iclogs we'll need to
 write out a maximally sized CIL checkpoint and steal that reserveD
 that space one commit at a time until the CIL has a full
 reservation. If we ever run a commit when we are already at the hard
 limit (because post-throttling) we simply take an extra reservation
 from each commit that is run when over the limit. Hence we don't
 need to do space usage math in the fast path and so never need to
 sum the per-cpu counters in this fast path.
 
 Similarly, per-cpu lists have the problem of ordering - we can't
 remove an item from a per-cpu list if we want to move it forward in
 the CIL. We solve this problem by using an atomic counter to give
 every commit a sequence number that is copied into the log items in
 that transaction. Hence relogging items just overwrites the sequence
 number in the log item, and does not move it in the per-cpu lists.
 Once we reaggregate the per-cpu lists back into a single list in the
 CIL push work, we can run it through list-sort() and reorder it back
 into a globally ordered list. This costs a bit of CPU time, but now
 that the CIL can run multiple works and pipelines properly, this is
 not a limiting factor for performance. It does increase fsync
 latency when the CIL is full, but workloads issuing large numbers of
 fsync()s or sync transactions end up with very small CILs and so the
 latency impact or sorting is not measurable for such workloads.
 
 OVerall, this pushes the transaction commit bottleneck out to the
 lockless reservation grant head updates. These atomic updates don't
 start to be a limiting fact until > 1.5 million transactions/s are
 being run, at which point the accounting functions start to show up
 in profiles as the highest CPU users. Still, this series doubles
 transaction throughput without increasing CPU usage before we get
 to that cacheline contention breakdown point...
 `
 Signed-off-by: Dave Chinner <dchinner@redhat.com>
 -----BEGIN PGP SIGNATURE-----
 
 iQJIBAABCgAyFiEEmJOoJ8GffZYWSjj/regpR/R1+h0FAmLHai8UHGRhdmlkQGZy
 b21vcmJpdC5jb20ACgkQregpR/R1+h3JZQ//bb9HyBiBkeuK9MvqH40hOfazfGXD
 8+pdP9r22qWp9LHhjz/EtH4Wy1sYe6a99mtPxqlsT3DqSl8GiolA1VFn+T3Sadu4
 nqmB/ppzMLE0LLzKoVrb3/Zw+mEaz5Is3WLpr86CpK5gNW6gBHCj4B68lWiBtvjs
 OW5fTm0E44BnNORh/AdSUkJxxEB2OQhVk5omY/Op8vO5frviG5yqYakAeoQ3vFpS
 UKadwlGjei91c63g9se360Re+DXTBhzbgXz0oNV4YbgWba2O9lnut5zqlcJMvVAU
 YgGBxttT0OqCdSNp0vtwOG8UFeUqfWSY+AFwfDkNycltLASvU53efqC94kQHouoh
 9++2VrPwPg0KOcQsvQo5WViQqWrr0+KlsaiTRO/TE0XCGFx4xQKEuhZ6QAnHiiVU
 en34SMqY51qa5D3LSbs6F278rEZNcLQguiH6Urxe5KRmkJDfoxtsWQ/DpV8itbnk
 raCUFlhW8GIBrRvizB7Na+hDWj1/HGQRIEs+xlfqPcFDV9bkECE/IpbD04+JDbil
 wsDoy2IO15oG/rX05/bkXAY7fFuhWbnVAbKrqvl+50w8Oo5w0+X3ZHlqhiLqCzVr
 e/TL5lc+9Ciq4uG8TCwal4HoktYLwqez4qxz396YpE4LN1ax2ICFgR9HyY4GLqmU
 0H1qSxZmOkeueCU=
 =vLZn
 -----END PGP SIGNATURE-----

Merge tag 'xfs-cil-scale-5.20' of git://git.kernel.org/pub/scm/linux/kernel/git/dgc/linux-xfs into xfs-5.20-mergeA

xfs: improve CIL scalability

This series aims to improve the scalability of XFS transaction
commits on large CPU count machines. My 32p machine hits contention
limits in xlog_cil_commit() at about 700,000 transaction commits a
section. It hits this at 16 thread workloads, and 32 thread
workloads go no faster and just burn CPU on the CIL spinlocks.

This patchset gets rid of spinlocks and global serialisation points
in the xlog_cil_commit() path. It does this by moving to a
combination of per-cpu counters, unordered per-cpu lists and
post-ordered per-cpu lists.

This results in transaction commit rates exceeding 1.4 million
commits/s under unlink certain workloads, and while the log lock
contention is largely gone there is still significant lock
contention in the VFS (dentry cache, inode cache and security layers)
at >600,000 transactions/s that still limit scalability.

The changes to the CIL accounting and behaviour, combined with the
structural changes to xlog_write() in prior patchsets make the
per-cpu restructuring possible and sane. This allows us to move to
precalculated reservation requirements that allow for reservation
stealing to be accounted across multiple CPUs accurately.

That is, instead of trying to account for continuation log opheaders
on a "growth" basis, we pre-calculate how many iclogs we'll need to
write out a maximally sized CIL checkpoint and steal that reserveD
that space one commit at a time until the CIL has a full
reservation. If we ever run a commit when we are already at the hard
limit (because post-throttling) we simply take an extra reservation
from each commit that is run when over the limit. Hence we don't
need to do space usage math in the fast path and so never need to
sum the per-cpu counters in this fast path.

Similarly, per-cpu lists have the problem of ordering - we can't
remove an item from a per-cpu list if we want to move it forward in
the CIL. We solve this problem by using an atomic counter to give
every commit a sequence number that is copied into the log items in
that transaction. Hence relogging items just overwrites the sequence
number in the log item, and does not move it in the per-cpu lists.
Once we reaggregate the per-cpu lists back into a single list in the
CIL push work, we can run it through list-sort() and reorder it back
into a globally ordered list. This costs a bit of CPU time, but now
that the CIL can run multiple works and pipelines properly, this is
not a limiting factor for performance. It does increase fsync
latency when the CIL is full, but workloads issuing large numbers of
fsync()s or sync transactions end up with very small CILs and so the
latency impact or sorting is not measurable for such workloads.

OVerall, this pushes the transaction commit bottleneck out to the
lockless reservation grant head updates. These atomic updates don't
start to be a limiting fact until > 1.5 million transactions/s are
being run, at which point the accounting functions start to show up
in profiles as the highest CPU users. Still, this series doubles
transaction throughput without increasing CPU usage before we get
to that cacheline contention breakdown point...
`
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Darrick J. Wong <djwong@kernel.org>

* tag 'xfs-cil-scale-5.20' of git://git.kernel.org/pub/scm/linux/kernel/git/dgc/linux-xfs:
  xfs: expanding delayed logging design with background material
  xfs: xlog_sync() manually adjusts grant head space
  xfs: avoid cil push lock if possible
  xfs: move CIL ordering to the logvec chain
  xfs: convert log vector chain to use list heads
  xfs: convert CIL to unordered per cpu lists
  xfs: Add order IDs to log items in CIL
  xfs: convert CIL busy extents to per-cpu
  xfs: track CIL ticket reservation in percpu structure
  xfs: implement percpu cil space used calculation
  xfs: introduce per-cpu CIL tracking structure
  xfs: rework per-iclog header CIL reservation
  xfs: lift init CIL reservation out of xc_cil_lock
  xfs: use the CIL space used counter for emptiness checks
2022-07-09 10:55:21 -07:00
..
ABI 1st set of IIO fixes for the 5.19 cycle. 2022-06-20 09:49:52 +02:00
PCI PCI/doc: Update obsolete pci_set_dma_mask() references 2022-04-21 12:10:44 -05:00
RCU Merge branch 'exp.2022.05.11a' into HEAD 2022-05-11 11:49:35 -07:00
accounting delayacct: track delays from write-protect copy 2022-06-01 15:55:25 -07:00
admin-guide ARM64: 2022-06-14 07:57:18 -07:00
arc
arm docs: arm: tcm: Fix typo in description of TCM and MMU usage 2022-06-09 12:56:33 -06:00
arm64 arm64/sme: Fix SVE/SME typo in ABI documentation 2022-06-08 18:38:31 +01:00
block
bpf bpf, docs: Fix typo "respetively" to "respectively" 2022-04-28 17:20:48 +02:00
cdrom It was a moderately busy cycle for documentation; highlights include: 2022-05-25 11:17:41 -07:00
core-api It was a moderately busy cycle for documentation; highlights include: 2022-05-25 11:17:41 -07:00
cpu-freq
crypto
dev-tools Yang Shi has improved the behaviour of khugepaged collapsing of readonly 2022-05-26 12:32:41 -07:00
devicetree USB driver fixes for 5.19-rc4 2022-06-25 10:02:05 -07:00
doc-guide Documentation/process: use scripts/get_maintainer.pl on patches 2022-05-09 16:12:16 -06:00
driver-api A NULL pointer dereference fix for vc4, and 3 patches to improve the 2022-07-01 09:27:55 +10:00
fault-injection docs: fault-injection: fix defaults 2022-04-16 02:46:44 -06:00
fb
features Documentation/features: Update the arch support status files 2022-06-09 09:35:57 -06:00
filesystems xfs: improve CIL scalability 2022-07-09 10:55:21 -07:00
firmware-guide TTY / Serial driver changes for 5.19-rc1 2022-06-03 11:08:40 -07:00
firmware_class
fpga Documentation: fpga: dfl: add link address of feature id table 2022-05-10 16:05:27 +08:00
gpu drm/todo: Add entry for using kunit in the subsystem 2022-05-05 10:09:06 +02:00
hid
hwmon hwmon: Make chip parameter for with_info API mandatory 2022-05-22 11:32:31 -07:00
i2c docs: i2c: reference simple probes 2022-05-04 22:35:19 +02:00
ia64
iio
images docs: add SVG version of the Linux logo 2022-06-01 09:32:45 -06:00
infiniband
input documentation: Format button_dev as a pointer. 2022-06-01 09:34:28 -06:00
isdn
kbuild Documentation/llvm: Update Supported Arch table 2022-06-20 08:21:29 +09:00
kernel-hacking
leds leds: qcom-lpg: Require pattern to follow documentation 2022-05-24 22:08:10 +02:00
litmus-tests
livepatch
locking
loongarch docs/LoongArch: Fix notes rendering by using reST directives 2022-06-17 22:09:05 +08:00
m68k
maintainer
mhi
mips
misc-devices Documentation: Wire Oxford Semiconductor PCIe (Tornado) 950 2022-05-19 18:24:22 +02:00
netlabel
networking docs: networking: phy: Fix a typo 2022-06-13 23:12:44 -07:00
nios2
nvdimm
openrisc
parisc
pcmcia
peci
power Documentation: EM: Add artificial EM registration description 2022-04-13 16:26:18 +02:00
powerpc powerpc: Enable the DAWR on POWER9 DD2.3 and above 2022-05-22 15:59:53 +10:00
process scripts/check-local-export: avoid 'wait $!' for process substitution 2022-06-10 03:47:13 +09:00
riscv Documentation: riscv: Add sv48 description to VM layout 2022-06-01 20:38:34 -07:00
s390
scheduler docs/scheduler: fix unit error 2022-04-16 02:54:32 -06:00
scsi
security integrity-v5.19 2022-05-24 13:50:39 -07:00
sh
sound ALSA: usb-audio: Add quirk bits for enabling/disabling generic implicit fb 2022-04-21 10:17:17 +02:00
sparc
sphinx docs: pdfdocs: Add space for chapter counts >= 100 in TOC 2022-05-17 13:41:26 -06:00
sphinx-static
spi
staging
target
timers
tools Updates to Real Time Linux Analysis tool for 5.19: 2022-05-29 10:48:58 -07:00
trace tracing/timerlat: Print stacktrace in the IRQ handler if needed 2022-05-26 21:13:00 -04:00
translations docs/zh_CN/LoongArch: Fix notes rendering by using reST directives 2022-06-17 22:09:05 +08:00
usb docs: usb: fix literal block marker in usbmon verification example 2022-06-09 09:50:03 -06:00
userspace-api media: lirc: add missing exceptions for lirc uapi header file 2022-05-26 14:30:17 -07:00
virt S390: 2022-05-26 14:20:14 -07:00
vm mm/memory-failure: disable unpoison once hw error happens 2022-06-16 19:11:32 -07:00
w1
watchdog
x86 It was a moderately busy cycle for documentation; highlights include: 2022-05-25 11:17:41 -07:00
xtensa
.gitignore
Changes
CodingStyle
Kconfig
Makefile
SubmittingPatches
arch.rst Documentation: LoongArch: Add basic documentations 2022-06-03 20:09:27 +08:00
asm-annotations.rst
atomic_bitops.txt
atomic_t.txt
conf.py docs/conf.py: Cope with removal of language=None in Sphinx 5.0.0 2022-06-01 09:26:05 -06:00
docutils.conf
dontdiff randstruct: Move seed generation into scripts/basic/ 2022-05-08 01:33:07 -07:00
index.rst docs: Move the HTE documentation to driver-api/ 2022-06-09 10:02:47 -06:00
memory-barriers.txt