OpenCloudOS-Kernel

History

Barry Song 43b3dfdd04 arm64: support batched/deferred tlb shootdown during page reclamation/migration On x86, batched and deferred tlb shootdown has lead to 90% performance increase on tlb shootdown. on arm64, HW can do tlb shootdown without software IPI. But sync tlbi is still quite expensive. Even running a simplest program which requires swapout can prove this is true, #include <sys/types.h> #include <unistd.h> #include <sys/mman.h> #include <string.h> int main() { #define SIZE (1 * 1024 * 1024) volatile unsigned char p = mmap(NULL, SIZE, PROT_READ \| PROT_WRITE, MAP_SHARED \| MAP_ANONYMOUS, -1, 0); memset(p, 0x88, SIZE); for (int k = 0; k < 10000; k++) { / swap in / for (int i = 0; i < SIZE; i += 4096) { (void)p[i]; } / swap out */ madvise(p, SIZE, MADV_PAGEOUT); } } Perf result on snapdragon 888 with 8 cores by using zRAM as the swap block device. ~ # perf record taskset -c 4 ./a.out [ perf record: Woken up 10 times to write data ] [ perf record: Captured and wrote 2.297 MB perf.data (60084 samples) ] ~ # perf report # To display the perf.data header info, please use --header/--header-only options. # To display the perf.data header info, please use --header/--header-only options. # # # Total Lost Samples: 0 # # Samples: 60K of event 'cycles' # Event count (approx.): 35706225414 # # Overhead Command Shared Object Symbol # ........ ....... ................. ...... # 21.07% a.out [kernel.kallsyms] [k] _raw_spin_unlock_irq 8.23% a.out [kernel.kallsyms] [k] _raw_spin_unlock_irqrestore 6.67% a.out [kernel.kallsyms] [k] filemap_map_pages 6.16% a.out [kernel.kallsyms] [k] __zram_bvec_write 5.36% a.out [kernel.kallsyms] [k] ptep_clear_flush 3.71% a.out [kernel.kallsyms] [k] _raw_spin_lock 3.49% a.out [kernel.kallsyms] [k] memset64 1.63% a.out [kernel.kallsyms] [k] clear_page 1.42% a.out [kernel.kallsyms] [k] _raw_spin_unlock 1.26% a.out [kernel.kallsyms] [k] mod_zone_state.llvm.8525150236079521930 1.23% a.out [kernel.kallsyms] [k] xas_load 1.15% a.out [kernel.kallsyms] [k] zram_slot_lock ptep_clear_flush() takes 5.36% CPU in the micro-benchmark swapping in/out a page mapped by only one process. If the page is mapped by multiple processes, typically, like more than 100 on a phone, the overhead would be much higher as we have to run tlb flush 100 times for one single page. Plus, tlb flush overhead will increase with the number of CPU cores due to the bad scalability of tlb shootdown in HW, so those ARM64 servers should expect much higher overhead. Further perf annonate shows 95% cpu time of ptep_clear_flush is actually used by the final dsb() to wait for the completion of tlb flush. This provides us a very good chance to leverage the existing batched tlb in kernel. The minimum modification is that we only send async tlbi in the first stage and we send dsb while we have to sync in the second stage. With the above simplest micro benchmark, collapsed time to finish the program decreases around 5%. Typical collapsed time w/o patch: ~ # time taskset -c 4 ./a.out 0.21user 14.34system 0:14.69elapsed w/ patch: ~ # time taskset -c 4 ./a.out 0.22user 13.45system 0:13.80elapsed Also tested with benchmark in the commit on Kunpeng920 arm64 server and observed an improvement around 12.5% with command `time ./swap_bench`. w/o w/ real 0m13.460s 0m11.771s user 0m0.248s 0m0.279s sys 0m12.039s 0m11.458s Originally it's noticed a 16.99% overhead of ptep_clear_flush() which has been eliminated by this patch: [root@localhost yang]# perf record -- ./swap_bench && perf report [...] 16.99% swap_bench [kernel.kallsyms] [k] ptep_clear_flush It is tested on 4,8,128 CPU platforms and shows to be beneficial on large systems but may not have improvement on small systems like on a 4 CPU platform. Also this patch improve the performance of page migration. Using pmbench and tries to migrate the pages of pmbench between node 0 and node 1 for 100 times for 1G memory, this patch decrease the time used around 20% (prev 18.338318910 sec after 13.981866350 sec) and saved the time used by ptep_clear_flush(). Link: https://lkml.kernel.org/r/20230717131004.12662-5-yangyicong@huawei.com Tested-by: Yicong Yang <yangyicong@hisilicon.com> Tested-by: Xin Hao <xhao@linux.alibaba.com> Tested-by: Punit Agrawal <punit.agrawal@bytedance.com> Signed-off-by: Barry Song <v-songbaohua@oppo.com> Signed-off-by: Yicong Yang <yangyicong@hisilicon.com> Reviewed-by: Kefeng Wang <wangkefeng.wang@huawei.com> Reviewed-by: Xin Hao <xhao@linux.alibaba.com> Reviewed-by: Anshuman Khandual <anshuman.khandual@arm.com> Reviewed-by: Catalin Marinas <catalin.marinas@arm.com> Cc: Anshuman Khandual <anshuman.khandual@arm.com> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Nadav Amit <namit@vmware.com> Cc: Mel Gorman <mgorman@suse.de> Cc: Anshuman Khandual <khandual@linux.vnet.ibm.com> Cc: Arnd Bergmann <arnd@arndb.de> Cc: Barry Song <baohua@kernel.org> Cc: Darren Hart <darren@os.amperecomputing.com> Cc: Jonathan Cameron <Jonathan.Cameron@huawei.com> Cc: lipeifeng <lipeifeng@oppo.com> Cc: Mark Rutland <mark.rutland@arm.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Ryan Roberts <ryan.roberts@arm.com> Cc: Steven Miao <realmz6@gmail.com> Cc: Will Deacon <will@kernel.org> Cc: Zeng Tao <prime.zeng@hisilicon.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>		2023-08-18 10:12:37 -07:00
..
ABI	HWPOISON: offline support: fix spelling in Documentation/ABI/	2023-08-18 10:12:18 -07:00
PCI	Merge branch 'pci/controller/endpoint'	2023-06-26 13:00:00 -05:00
RCU	rcu: Remove RCU_NONIDLE()	2023-05-11 13:42:04 -07:00
accel	…
accounting	…
admin-guide	mm/memory_hotplug: document the signal_pending() check in offline_pages()	2023-08-18 10:12:19 -07:00
arch	- Work around an erratum on GIC700, where a race between a CPU	2023-07-30 10:59:19 -07:00
block	Documentation/block: drop the request.rst file	2023-05-12 11:04:58 -06:00
bpf	sock: Remove ->sendpage*() in favour of sendmsg(MSG_SPLICE_PAGES)	2023-06-24 15:50:13 -07:00
cdrom	Documentation: use capitalization for chapters and acronyms	2023-05-16 12:49:31 -06:00
core-api	workqueue: Changes for v6.5	2023-06-27 16:32:52 -07:00
cpu-freq	…
crypto	docs: crypto: async-tx-api: fix typo in struct name	2023-06-09 01:59:30 -06:00
dev-tools	- Yosry Ahmed brought back some cgroup v1 stats in OOM logs.	2023-06-28 10:28:11 -07:00
devicetree	Devicetree fixes for v6.5:	2023-07-22 10:28:22 -07:00
doc-guide	docs/doc-guide: Clarify how to write tables	2023-06-09 01:57:56 -06:00
driver-api	Fixes for pci_clean_master, error handling in driver inits, and various	2023-07-09 09:35:51 -07:00
fault-injection	lkdtm: replace ll_rw_block with submit_bh	2023-05-31 20:26:57 +01:00
fb	…
features	arm64: support batched/deferred tlb shootdown during page reclamation/migration	2023-08-18 10:12:37 -07:00
filesystems	tmpfs: fix Documentation of noswap and huge mount options	2023-07-27 13:07:03 -07:00
firmware-guide	…
firmware_class	…
fpga	Documentation: use capitalization for chapters and acronyms	2023-05-16 12:49:31 -06:00
gpu	Merge tag 'amd-drm-next-6.5-2023-06-09' of https://gitlab.freedesktop.org/agd5f/linux into drm-next	2023-06-15 14:11:22 +10:00
hid	…
hwmon	hwmon: (oxp-sensors) Add support for AOKZOE A1 PRO	2023-06-24 20:17:18 -07:00
i2c	i2c: i801: Add support for Intel Meteor Lake PCH-S	2023-06-05 10:13:48 +02:00
iio	…
images	…
infiniband	…
input	Input: xpad - spelling fixes for "Xbox"	2023-05-22 17:28:16 -07:00
isdn	…
kbuild	kernel-doc: don't let V=1 change outcome	2023-06-10 16:39:02 +09:00
kernel-hacking	…
leds	- New Drivers	2023-07-03 11:26:05 -07:00
litmus-tests	…
livepatch	…
locking	Documentation: use capitalization for chapters and acronyms	2023-05-16 12:49:31 -06:00
loongarch	…
maintainer	Documentation: update git configuration for Link: tag	2023-06-21 09:15:15 -06:00
mhi	…
mips	…
misc-devices	Documentation: Add TI TPS6594 PFSM	2023-06-15 13:41:53 +02:00
mm	- Yosry Ahmed brought back some cgroup v1 stats in OOM logs.	2023-06-28 10:28:11 -07:00
netlabel	…
netlink	netlink: specs: add display hints to ovs_flow	2023-06-24 15:45:49 -07:00
networking	docs: net: clarify the NAPI rules around XDP Tx	2023-07-21 18:51:37 -07:00
nvdimm	…
nvme	…
pcmcia	Documentation: use capitalization for chapters and acronyms	2023-05-16 12:49:31 -06:00
peci	…
power	…
powerpc	Documentation: Document PowerPC kernel DEXCR interface	2023-06-19 17:36:27 +10:00
process	Documentation: embargoed-hardware-issues.rst: add AMD to the list	2023-07-26 09:39:34 +02:00
riscv	Documentation: RISC-V: hwprobe: Fix a formatting error	2023-07-11 10:43:51 -07:00
rust	docs: rust: point directly to the standalone installers	2023-05-31 18:52:35 +02:00
s390	s390/iommu: get rid of S390_CCW_IOMMU and S390_AP_IOMMU	2023-05-17 15:20:18 +02:00
scheduler	sched/deadline: Update GRUB description in the documentation	2023-06-16 22:08:12 +02:00
scsi	scsi: docs: sym53c8xx_2: Shorten chapter heading	2023-05-22 18:36:07 -04:00
security	…
sound	ALSA: compress: allow setting codec params after next track	2023-06-21 07:28:31 +02:00
sphinx	…
sphinx-static	…
spi	…
staging	Documentation: use capitalization for chapters and acronyms	2023-05-16 12:49:31 -06:00
target	scsi: target: docs: Remove tcm_mod_builder.py	2023-06-28 22:01:32 -04:00
timers	Documentation: use capitalization for chapters and acronyms	2023-05-16 12:49:31 -06:00
tools	Documentation: Add tools/rtla timerlat -u option documentation	2023-06-13 16:43:37 -04:00
trace	Char/Misc and other driver subsystem updates for 6.5-rc1	2023-07-03 12:46:47 -07:00
translations	A half-dozen late arriving docs patches. They are mostly fixes, but we	2023-07-06 22:15:38 -07:00
usb	…
userspace-api	media updates for v6.5-rc1	2023-07-05 10:42:32 -07:00
virt	A half-dozen late arriving docs patches. They are mostly fixes, but we	2023-07-06 22:15:38 -07:00
w1	…
watchdog	…
wmi	platform/x86: dell-ddv: Fix mangled list in documentation	2023-07-11 12:15:30 +02:00
.gitignore	…
Changes	…
CodingStyle	…
Kconfig	…
Makefile	…
SubmittingPatches	…
atomic_bitops.txt	…
atomic_t.txt	…
conf.py	Documentation: conf.py: Add __force to c_id_attributes	2023-05-19 08:58:10 -06:00
docutils.conf	…
dontdiff	…
index.rst	…
memory-barriers.txt	…
subsystem-apis.rst	platform-drivers-x86 for v6.5-1	2023-06-30 14:50:00 -07:00