OpenCloudOS-Kernel/Documentation
Barry Song 43b3dfdd04 arm64: support batched/deferred tlb shootdown during page reclamation/migration
On x86, batched and deferred tlb shootdown has lead to 90% performance
increase on tlb shootdown.  on arm64, HW can do tlb shootdown without
software IPI.  But sync tlbi is still quite expensive.

Even running a simplest program which requires swapout can
prove this is true,
 #include <sys/types.h>
 #include <unistd.h>
 #include <sys/mman.h>
 #include <string.h>

 int main()
 {
 #define SIZE (1 * 1024 * 1024)
         volatile unsigned char *p = mmap(NULL, SIZE, PROT_READ | PROT_WRITE,
                                          MAP_SHARED | MAP_ANONYMOUS, -1, 0);

         memset(p, 0x88, SIZE);

         for (int k = 0; k < 10000; k++) {
                 /* swap in */
                 for (int i = 0; i < SIZE; i += 4096) {
                         (void)p[i];
                 }

                 /* swap out */
                 madvise(p, SIZE, MADV_PAGEOUT);
         }
 }

Perf result on snapdragon 888 with 8 cores by using zRAM
as the swap block device.

 ~ # perf record taskset -c 4 ./a.out
 [ perf record: Woken up 10 times to write data ]
 [ perf record: Captured and wrote 2.297 MB perf.data (60084 samples) ]
 ~ # perf report
 # To display the perf.data header info, please use --header/--header-only options.
 # To display the perf.data header info, please use --header/--header-only options.
 #
 #
 # Total Lost Samples: 0
 #
 # Samples: 60K of event 'cycles'
 # Event count (approx.): 35706225414
 #
 # Overhead  Command  Shared Object      Symbol
 # ........  .......  .................  ......
 #
    21.07%  a.out    [kernel.kallsyms]  [k] _raw_spin_unlock_irq
     8.23%  a.out    [kernel.kallsyms]  [k] _raw_spin_unlock_irqrestore
     6.67%  a.out    [kernel.kallsyms]  [k] filemap_map_pages
     6.16%  a.out    [kernel.kallsyms]  [k] __zram_bvec_write
     5.36%  a.out    [kernel.kallsyms]  [k] ptep_clear_flush
     3.71%  a.out    [kernel.kallsyms]  [k] _raw_spin_lock
     3.49%  a.out    [kernel.kallsyms]  [k] memset64
     1.63%  a.out    [kernel.kallsyms]  [k] clear_page
     1.42%  a.out    [kernel.kallsyms]  [k] _raw_spin_unlock
     1.26%  a.out    [kernel.kallsyms]  [k] mod_zone_state.llvm.8525150236079521930
     1.23%  a.out    [kernel.kallsyms]  [k] xas_load
     1.15%  a.out    [kernel.kallsyms]  [k] zram_slot_lock

ptep_clear_flush() takes 5.36% CPU in the micro-benchmark swapping in/out
a page mapped by only one process.  If the page is mapped by multiple
processes, typically, like more than 100 on a phone, the overhead would be
much higher as we have to run tlb flush 100 times for one single page. 
Plus, tlb flush overhead will increase with the number of CPU cores due to
the bad scalability of tlb shootdown in HW, so those ARM64 servers should
expect much higher overhead.

Further perf annonate shows 95% cpu time of ptep_clear_flush is actually
used by the final dsb() to wait for the completion of tlb flush.  This
provides us a very good chance to leverage the existing batched tlb in
kernel.  The minimum modification is that we only send async tlbi in the
first stage and we send dsb while we have to sync in the second stage.

With the above simplest micro benchmark, collapsed time to finish the
program decreases around 5%.

Typical collapsed time w/o patch:
 ~ # time taskset -c 4 ./a.out
 0.21user 14.34system 0:14.69elapsed
w/ patch:
 ~ # time taskset -c 4 ./a.out
 0.22user 13.45system 0:13.80elapsed

Also tested with benchmark in the commit on Kunpeng920 arm64 server
and observed an improvement around 12.5% with command
`time ./swap_bench`.
        w/o             w/
real    0m13.460s       0m11.771s
user    0m0.248s        0m0.279s
sys     0m12.039s       0m11.458s

Originally it's noticed a 16.99% overhead of ptep_clear_flush()
which has been eliminated by this patch:

[root@localhost yang]# perf record -- ./swap_bench && perf report
[...]
16.99%  swap_bench  [kernel.kallsyms]  [k] ptep_clear_flush

It is tested on 4,8,128 CPU platforms and shows to be beneficial on
large systems but may not have improvement on small systems like on
a 4 CPU platform.

Also this patch improve the performance of page migration. Using pmbench
and tries to migrate the pages of pmbench between node 0 and node 1 for
100 times for 1G memory, this patch decrease the time used around 20%
(prev 18.338318910 sec after 13.981866350 sec) and saved the time used
by ptep_clear_flush().

Link: https://lkml.kernel.org/r/20230717131004.12662-5-yangyicong@huawei.com
Tested-by: Yicong Yang <yangyicong@hisilicon.com>
Tested-by: Xin Hao <xhao@linux.alibaba.com>
Tested-by: Punit Agrawal <punit.agrawal@bytedance.com>
Signed-off-by: Barry Song <v-songbaohua@oppo.com>
Signed-off-by: Yicong Yang <yangyicong@hisilicon.com>
Reviewed-by: Kefeng Wang <wangkefeng.wang@huawei.com>
Reviewed-by: Xin Hao <xhao@linux.alibaba.com>
Reviewed-by: Anshuman Khandual <anshuman.khandual@arm.com>
Reviewed-by: Catalin Marinas <catalin.marinas@arm.com>
Cc: Anshuman Khandual <anshuman.khandual@arm.com>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Nadav Amit <namit@vmware.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Anshuman Khandual <khandual@linux.vnet.ibm.com>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Barry Song <baohua@kernel.org>
Cc: Darren Hart <darren@os.amperecomputing.com>
Cc: Jonathan Cameron <Jonathan.Cameron@huawei.com>
Cc: lipeifeng <lipeifeng@oppo.com>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Ryan Roberts <ryan.roberts@arm.com>
Cc: Steven Miao <realmz6@gmail.com>
Cc: Will Deacon <will@kernel.org>
Cc: Zeng Tao <prime.zeng@hisilicon.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-08-18 10:12:37 -07:00
..
ABI HWPOISON: offline support: fix spelling in Documentation/ABI/ 2023-08-18 10:12:18 -07:00
PCI Merge branch 'pci/controller/endpoint' 2023-06-26 13:00:00 -05:00
RCU rcu: Remove RCU_NONIDLE() 2023-05-11 13:42:04 -07:00
accel
accounting
admin-guide mm/memory_hotplug: document the signal_pending() check in offline_pages() 2023-08-18 10:12:19 -07:00
arch - Work around an erratum on GIC700, where a race between a CPU 2023-07-30 10:59:19 -07:00
block Documentation/block: drop the request.rst file 2023-05-12 11:04:58 -06:00
bpf sock: Remove ->sendpage*() in favour of sendmsg(MSG_SPLICE_PAGES) 2023-06-24 15:50:13 -07:00
cdrom Documentation: use capitalization for chapters and acronyms 2023-05-16 12:49:31 -06:00
core-api workqueue: Changes for v6.5 2023-06-27 16:32:52 -07:00
cpu-freq
crypto docs: crypto: async-tx-api: fix typo in struct name 2023-06-09 01:59:30 -06:00
dev-tools - Yosry Ahmed brought back some cgroup v1 stats in OOM logs. 2023-06-28 10:28:11 -07:00
devicetree Devicetree fixes for v6.5: 2023-07-22 10:28:22 -07:00
doc-guide docs/doc-guide: Clarify how to write tables 2023-06-09 01:57:56 -06:00
driver-api Fixes for pci_clean_master, error handling in driver inits, and various 2023-07-09 09:35:51 -07:00
fault-injection lkdtm: replace ll_rw_block with submit_bh 2023-05-31 20:26:57 +01:00
fb
features arm64: support batched/deferred tlb shootdown during page reclamation/migration 2023-08-18 10:12:37 -07:00
filesystems tmpfs: fix Documentation of noswap and huge mount options 2023-07-27 13:07:03 -07:00
firmware-guide
firmware_class
fpga Documentation: use capitalization for chapters and acronyms 2023-05-16 12:49:31 -06:00
gpu Merge tag 'amd-drm-next-6.5-2023-06-09' of https://gitlab.freedesktop.org/agd5f/linux into drm-next 2023-06-15 14:11:22 +10:00
hid
hwmon hwmon: (oxp-sensors) Add support for AOKZOE A1 PRO 2023-06-24 20:17:18 -07:00
i2c i2c: i801: Add support for Intel Meteor Lake PCH-S 2023-06-05 10:13:48 +02:00
iio
images
infiniband
input Input: xpad - spelling fixes for "Xbox" 2023-05-22 17:28:16 -07:00
isdn
kbuild kernel-doc: don't let V=1 change outcome 2023-06-10 16:39:02 +09:00
kernel-hacking
leds - New Drivers 2023-07-03 11:26:05 -07:00
litmus-tests
livepatch
locking Documentation: use capitalization for chapters and acronyms 2023-05-16 12:49:31 -06:00
loongarch
maintainer Documentation: update git configuration for Link: tag 2023-06-21 09:15:15 -06:00
mhi
mips
misc-devices Documentation: Add TI TPS6594 PFSM 2023-06-15 13:41:53 +02:00
mm - Yosry Ahmed brought back some cgroup v1 stats in OOM logs. 2023-06-28 10:28:11 -07:00
netlabel
netlink netlink: specs: add display hints to ovs_flow 2023-06-24 15:45:49 -07:00
networking docs: net: clarify the NAPI rules around XDP Tx 2023-07-21 18:51:37 -07:00
nvdimm
nvme
pcmcia Documentation: use capitalization for chapters and acronyms 2023-05-16 12:49:31 -06:00
peci
power
powerpc Documentation: Document PowerPC kernel DEXCR interface 2023-06-19 17:36:27 +10:00
process Documentation: embargoed-hardware-issues.rst: add AMD to the list 2023-07-26 09:39:34 +02:00
riscv Documentation: RISC-V: hwprobe: Fix a formatting error 2023-07-11 10:43:51 -07:00
rust docs: rust: point directly to the standalone installers 2023-05-31 18:52:35 +02:00
s390 s390/iommu: get rid of S390_CCW_IOMMU and S390_AP_IOMMU 2023-05-17 15:20:18 +02:00
scheduler sched/deadline: Update GRUB description in the documentation 2023-06-16 22:08:12 +02:00
scsi scsi: docs: sym53c8xx_2: Shorten chapter heading 2023-05-22 18:36:07 -04:00
security
sound ALSA: compress: allow setting codec params after next track 2023-06-21 07:28:31 +02:00
sphinx
sphinx-static
spi
staging Documentation: use capitalization for chapters and acronyms 2023-05-16 12:49:31 -06:00
target scsi: target: docs: Remove tcm_mod_builder.py 2023-06-28 22:01:32 -04:00
timers Documentation: use capitalization for chapters and acronyms 2023-05-16 12:49:31 -06:00
tools Documentation: Add tools/rtla timerlat -u option documentation 2023-06-13 16:43:37 -04:00
trace Char/Misc and other driver subsystem updates for 6.5-rc1 2023-07-03 12:46:47 -07:00
translations A half-dozen late arriving docs patches. They are mostly fixes, but we 2023-07-06 22:15:38 -07:00
usb
userspace-api media updates for v6.5-rc1 2023-07-05 10:42:32 -07:00
virt A half-dozen late arriving docs patches. They are mostly fixes, but we 2023-07-06 22:15:38 -07:00
w1
watchdog
wmi platform/x86: dell-ddv: Fix mangled list in documentation 2023-07-11 12:15:30 +02:00
.gitignore
Changes
CodingStyle
Kconfig
Makefile
SubmittingPatches
atomic_bitops.txt
atomic_t.txt
conf.py Documentation: conf.py: Add __force to c_id_attributes 2023-05-19 08:58:10 -06:00
docutils.conf
dontdiff
index.rst
memory-barriers.txt
subsystem-apis.rst platform-drivers-x86 for v6.5-1 2023-06-30 14:50:00 -07:00