OpenCloudOS-Kernel

History

Darrick J. Wong dd81dc0559 xfs: improve CIL scalability This series aims to improve the scalability of XFS transaction commits on large CPU count machines. My 32p machine hits contention limits in xlog_cil_commit() at about 700,000 transaction commits a section. It hits this at 16 thread workloads, and 32 thread workloads go no faster and just burn CPU on the CIL spinlocks. This patchset gets rid of spinlocks and global serialisation points in the xlog_cil_commit() path. It does this by moving to a combination of per-cpu counters, unordered per-cpu lists and post-ordered per-cpu lists. This results in transaction commit rates exceeding 1.4 million commits/s under unlink certain workloads, and while the log lock contention is largely gone there is still significant lock contention in the VFS (dentry cache, inode cache and security layers) at >600,000 transactions/s that still limit scalability. The changes to the CIL accounting and behaviour, combined with the structural changes to xlog_write() in prior patchsets make the per-cpu restructuring possible and sane. This allows us to move to precalculated reservation requirements that allow for reservation stealing to be accounted across multiple CPUs accurately. That is, instead of trying to account for continuation log opheaders on a "growth" basis, we pre-calculate how many iclogs we'll need to write out a maximally sized CIL checkpoint and steal that reserveD that space one commit at a time until the CIL has a full reservation. If we ever run a commit when we are already at the hard limit (because post-throttling) we simply take an extra reservation from each commit that is run when over the limit. Hence we don't need to do space usage math in the fast path and so never need to sum the per-cpu counters in this fast path. Similarly, per-cpu lists have the problem of ordering - we can't remove an item from a per-cpu list if we want to move it forward in the CIL. We solve this problem by using an atomic counter to give every commit a sequence number that is copied into the log items in that transaction. Hence relogging items just overwrites the sequence number in the log item, and does not move it in the per-cpu lists. Once we reaggregate the per-cpu lists back into a single list in the CIL push work, we can run it through list-sort() and reorder it back into a globally ordered list. This costs a bit of CPU time, but now that the CIL can run multiple works and pipelines properly, this is not a limiting factor for performance. It does increase fsync latency when the CIL is full, but workloads issuing large numbers of fsync()s or sync transactions end up with very small CILs and so the latency impact or sorting is not measurable for such workloads. OVerall, this pushes the transaction commit bottleneck out to the lockless reservation grant head updates. These atomic updates don't start to be a limiting fact until > 1.5 million transactions/s are being run, at which point the accounting functions start to show up in profiles as the highest CPU users. Still, this series doubles transaction throughput without increasing CPU usage before we get to that cacheline contention breakdown point... ` Signed-off-by: Dave Chinner <dchinner@redhat.com> -----BEGIN PGP SIGNATURE----- iQJIBAABCgAyFiEEmJOoJ8GffZYWSjj/regpR/R1+h0FAmLHai8UHGRhdmlkQGZy b21vcmJpdC5jb20ACgkQregpR/R1+h3JZQ//bb9HyBiBkeuK9MvqH40hOfazfGXD 8+pdP9r22qWp9LHhjz/EtH4Wy1sYe6a99mtPxqlsT3DqSl8GiolA1VFn+T3Sadu4 nqmB/ppzMLE0LLzKoVrb3/Zw+mEaz5Is3WLpr86CpK5gNW6gBHCj4B68lWiBtvjs OW5fTm0E44BnNORh/AdSUkJxxEB2OQhVk5omY/Op8vO5frviG5yqYakAeoQ3vFpS UKadwlGjei91c63g9se360Re+DXTBhzbgXz0oNV4YbgWba2O9lnut5zqlcJMvVAU YgGBxttT0OqCdSNp0vtwOG8UFeUqfWSY+AFwfDkNycltLASvU53efqC94kQHouoh 9++2VrPwPg0KOcQsvQo5WViQqWrr0+KlsaiTRO/TE0XCGFx4xQKEuhZ6QAnHiiVU en34SMqY51qa5D3LSbs6F278rEZNcLQguiH6Urxe5KRmkJDfoxtsWQ/DpV8itbnk raCUFlhW8GIBrRvizB7Na+hDWj1/HGQRIEs+xlfqPcFDV9bkECE/IpbD04+JDbil wsDoy2IO15oG/rX05/bkXAY7fFuhWbnVAbKrqvl+50w8Oo5w0+X3ZHlqhiLqCzVr e/TL5lc+9Ciq4uG8TCwal4HoktYLwqez4qxz396YpE4LN1ax2ICFgR9HyY4GLqmU 0H1qSxZmOkeueCU= =vLZn -----END PGP SIGNATURE----- Merge tag 'xfs-cil-scale-5.20' of git://git.kernel.org/pub/scm/linux/kernel/git/dgc/linux-xfs into xfs-5.20-mergeA xfs: improve CIL scalability This series aims to improve the scalability of XFS transaction commits on large CPU count machines. My 32p machine hits contention limits in xlog_cil_commit() at about 700,000 transaction commits a section. It hits this at 16 thread workloads, and 32 thread workloads go no faster and just burn CPU on the CIL spinlocks. This patchset gets rid of spinlocks and global serialisation points in the xlog_cil_commit() path. It does this by moving to a combination of per-cpu counters, unordered per-cpu lists and post-ordered per-cpu lists. This results in transaction commit rates exceeding 1.4 million commits/s under unlink certain workloads, and while the log lock contention is largely gone there is still significant lock contention in the VFS (dentry cache, inode cache and security layers) at >600,000 transactions/s that still limit scalability. The changes to the CIL accounting and behaviour, combined with the structural changes to xlog_write() in prior patchsets make the per-cpu restructuring possible and sane. This allows us to move to precalculated reservation requirements that allow for reservation stealing to be accounted across multiple CPUs accurately. That is, instead of trying to account for continuation log opheaders on a "growth" basis, we pre-calculate how many iclogs we'll need to write out a maximally sized CIL checkpoint and steal that reserveD that space one commit at a time until the CIL has a full reservation. If we ever run a commit when we are already at the hard limit (because post-throttling) we simply take an extra reservation from each commit that is run when over the limit. Hence we don't need to do space usage math in the fast path and so never need to sum the per-cpu counters in this fast path. Similarly, per-cpu lists have the problem of ordering - we can't remove an item from a per-cpu list if we want to move it forward in the CIL. We solve this problem by using an atomic counter to give every commit a sequence number that is copied into the log items in that transaction. Hence relogging items just overwrites the sequence number in the log item, and does not move it in the per-cpu lists. Once we reaggregate the per-cpu lists back into a single list in the CIL push work, we can run it through list-sort() and reorder it back into a globally ordered list. This costs a bit of CPU time, but now that the CIL can run multiple works and pipelines properly, this is not a limiting factor for performance. It does increase fsync latency when the CIL is full, but workloads issuing large numbers of fsync()s or sync transactions end up with very small CILs and so the latency impact or sorting is not measurable for such workloads. OVerall, this pushes the transaction commit bottleneck out to the lockless reservation grant head updates. These atomic updates don't start to be a limiting fact until > 1.5 million transactions/s are being run, at which point the accounting functions start to show up in profiles as the highest CPU users. Still, this series doubles transaction throughput without increasing CPU usage before we get to that cacheline contention breakdown point... ` Signed-off-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Darrick J. Wong <djwong@kernel.org> * tag 'xfs-cil-scale-5.20' of git://git.kernel.org/pub/scm/linux/kernel/git/dgc/linux-xfs: xfs: expanding delayed logging design with background material xfs: xlog_sync() manually adjusts grant head space xfs: avoid cil push lock if possible xfs: move CIL ordering to the logvec chain xfs: convert log vector chain to use list heads xfs: convert CIL to unordered per cpu lists xfs: Add order IDs to log items in CIL xfs: convert CIL busy extents to per-cpu xfs: track CIL ticket reservation in percpu structure xfs: implement percpu cil space used calculation xfs: introduce per-cpu CIL tracking structure xfs: rework per-iclog header CIL reservation xfs: lift init CIL reservation out of xc_cil_lock xfs: use the CIL space used counter for emptiness checks		2022-07-09 10:55:21 -07:00
..
ABI	1st set of IIO fixes for the 5.19 cycle.	2022-06-20 09:49:52 +02:00
PCI	PCI/doc: Update obsolete pci_set_dma_mask() references	2022-04-21 12:10:44 -05:00
RCU	Merge branch 'exp.2022.05.11a' into HEAD	2022-05-11 11:49:35 -07:00
accounting	delayacct: track delays from write-protect copy	2022-06-01 15:55:25 -07:00
admin-guide	ARM64:	2022-06-14 07:57:18 -07:00
arc	…
arm	docs: arm: tcm: Fix typo in description of TCM and MMU usage	2022-06-09 12:56:33 -06:00
arm64	arm64/sme: Fix SVE/SME typo in ABI documentation	2022-06-08 18:38:31 +01:00
block	…
bpf	bpf, docs: Fix typo "respetively" to "respectively"	2022-04-28 17:20:48 +02:00
cdrom	It was a moderately busy cycle for documentation; highlights include:	2022-05-25 11:17:41 -07:00
core-api	It was a moderately busy cycle for documentation; highlights include:	2022-05-25 11:17:41 -07:00
cpu-freq	…
crypto	…
dev-tools	Yang Shi has improved the behaviour of khugepaged collapsing of readonly	2022-05-26 12:32:41 -07:00
devicetree	USB driver fixes for 5.19-rc4	2022-06-25 10:02:05 -07:00
doc-guide	Documentation/process: use scripts/get_maintainer.pl on patches	2022-05-09 16:12:16 -06:00
driver-api	A NULL pointer dereference fix for vc4, and 3 patches to improve the	2022-07-01 09:27:55 +10:00
fault-injection	docs: fault-injection: fix defaults	2022-04-16 02:46:44 -06:00
fb	…
features	Documentation/features: Update the arch support status files	2022-06-09 09:35:57 -06:00
filesystems	xfs: improve CIL scalability	2022-07-09 10:55:21 -07:00
firmware-guide	TTY / Serial driver changes for 5.19-rc1	2022-06-03 11:08:40 -07:00
firmware_class	…
fpga	Documentation: fpga: dfl: add link address of feature id table	2022-05-10 16:05:27 +08:00
gpu	drm/todo: Add entry for using kunit in the subsystem	2022-05-05 10:09:06 +02:00
hid	…
hwmon	hwmon: Make chip parameter for with_info API mandatory	2022-05-22 11:32:31 -07:00
i2c	docs: i2c: reference simple probes	2022-05-04 22:35:19 +02:00
ia64	…
iio	…
images	docs: add SVG version of the Linux logo	2022-06-01 09:32:45 -06:00
infiniband	…
input	documentation: Format button_dev as a pointer.	2022-06-01 09:34:28 -06:00
isdn	…
kbuild	Documentation/llvm: Update Supported Arch table	2022-06-20 08:21:29 +09:00
kernel-hacking	…
leds	leds: qcom-lpg: Require pattern to follow documentation	2022-05-24 22:08:10 +02:00
litmus-tests	…
livepatch	…
locking	…
loongarch	docs/LoongArch: Fix notes rendering by using reST directives	2022-06-17 22:09:05 +08:00
m68k	…
maintainer	…
mhi	…
mips	…
misc-devices	Documentation: Wire Oxford Semiconductor PCIe (Tornado) 950	2022-05-19 18:24:22 +02:00
netlabel	…
networking	docs: networking: phy: Fix a typo	2022-06-13 23:12:44 -07:00
nios2	…
nvdimm	…
openrisc	…
parisc	…
pcmcia	…
peci	…
power	…
powerpc	powerpc: Enable the DAWR on POWER9 DD2.3 and above	2022-05-22 15:59:53 +10:00
process	scripts/check-local-export: avoid 'wait $!' for process substitution	2022-06-10 03:47:13 +09:00
riscv	Documentation: riscv: Add sv48 description to VM layout	2022-06-01 20:38:34 -07:00
s390	…
scheduler	docs/scheduler: fix unit error	2022-04-16 02:54:32 -06:00
scsi	…
security	integrity-v5.19	2022-05-24 13:50:39 -07:00
sh	…
sound	ALSA: usb-audio: Add quirk bits for enabling/disabling generic implicit fb	2022-04-21 10:17:17 +02:00
sparc	…
sphinx	docs: pdfdocs: Add space for chapter counts >= 100 in TOC	2022-05-17 13:41:26 -06:00
sphinx-static	…
spi	…
staging	…
target	…
timers	…
tools	Updates to Real Time Linux Analysis tool for 5.19:	2022-05-29 10:48:58 -07:00
trace	tracing/timerlat: Print stacktrace in the IRQ handler if needed	2022-05-26 21:13:00 -04:00
translations	docs/zh_CN/LoongArch: Fix notes rendering by using reST directives	2022-06-17 22:09:05 +08:00
usb	docs: usb: fix literal block marker in usbmon verification example	2022-06-09 09:50:03 -06:00
userspace-api	media: lirc: add missing exceptions for lirc uapi header file	2022-05-26 14:30:17 -07:00
virt	S390:	2022-05-26 14:20:14 -07:00
vm	mm/memory-failure: disable unpoison once hw error happens	2022-06-16 19:11:32 -07:00
w1	…
watchdog	…
x86	It was a moderately busy cycle for documentation; highlights include:	2022-05-25 11:17:41 -07:00
xtensa	…
.gitignore	…
Changes	…
CodingStyle	…
Kconfig	…
Makefile	…
SubmittingPatches	…
arch.rst	Documentation: LoongArch: Add basic documentations	2022-06-03 20:09:27 +08:00
asm-annotations.rst	…
atomic_bitops.txt	…
atomic_t.txt	…
conf.py	docs/conf.py: Cope with removal of language=None in Sphinx 5.0.0	2022-06-01 09:26:05 -06:00
docutils.conf	…
dontdiff	randstruct: Move seed generation into scripts/basic/	2022-05-08 01:33:07 -07:00
index.rst	docs: Move the HTE documentation to driver-api/	2022-06-09 10:02:47 -06:00
memory-barriers.txt	…