OpenCloudOS-Kernel

Go to file

Kairui Song 4564eafa9e emm: workingset: simplify and use a more intuitive model Upstream: pending This basically removed workingset_activation and reduced calls to workingset_age_nonresident. The idea behind this change is a new way to calculate the refault distance and prepare for adapting refault distance based file page protection for multi-gen LRU. Currently, refault distance re-activation for active/inactive can help keep working set pages in memory, it works by estimating the refault (re-access) distance of a page, if it's small enough, then put it on active LRU instead of inactive LRU. The estimation, as described in mm/workingset.c, is based on two assumptions: 1. Activation of an inactive page will left-shift LRU pages (considering LRU starts from right). 2. Eviction of an inactive page will left-shift LRU pages. Assumption 2 is correct, but assumption 1 is not always true, an activated page could be anywhere in the LRU list (through mark_page_accessed), it only left-shift the pages on its right side. And besides, one page can get activate/deactivated for multiple times. And multi-gen LRU doesn't fit with this model well, pages are getting aged in generations, and getting promoted frequently between generations. So instead we introduce a simpler idea here: Just presume the evicted pages are still in memory, each has an corresponding eviction timestamp (nonresistence_age) that is increased and recorded upon each eviction. These timestamp could logically form a "Shadow LRU", a read-only imaginary LRU. Let the `nonresistence_age` still be NA, then we have: Let SP = ((NA's reading @ current) - (NA's reading @ eviction)) +-memory available to cache-+ \| \| +-------------------------+===============+===========+ \| * shadows O O O \| INACTIVE \| ACTIVE \| +-+-----------------------+===============+===========+ \| \| +-----------------------+ \| SP fault page O -> Hole left by refaulted in pages. Entries are suppose to be removed upon access but this is not a real LRU so can't really update it. * -> The page corresponding to SP It can be easily seen that SP stands for the offset of a page in the imaginary LRU, which is also how far the current workflow could push a page out of available memory. Since all evicted page was once head of INACTIVE list, the estimated minimum value of refault distance is: SP + NR_INACTIVE On refault, the page may get activated and stay in memory if we put it to active LRU if: SP + NR_INACTIVE < NR_INACTIVE + NR_ACTIVE Which can be simplified to: SP < NR_ACTIVE Then the page is worth getting re-activated to start from active LRU, since the access distance is smaller than the total memory. And since this is only an estimation, based on several hypotheses, and it could break the ability of LRU to distinguish a workingset out of caches, in extreme cases all refault causing activation will lead to worse thrashing, so throttle this by two factors: 1. Notice previously re-faulted in pages may leave "holes" on the shadow part of LRU, that part is left unhandled on purpose to decrease re-activate rate for pages that have a large SP value (the larger SP value a page has, the more likely it will be affected by such holes). 2. When the active LRU is long enough, chanllaging active pages by re-activating a one-time access previously evicted/inactive page may not be a good idea, so throttle the re-activation when NR_ACTIVE > NR_INACTIVE, by comparing with NR_INACTIVE instead. Another effect of the refault activation throttling worth noticing is that, when the cache size is larger than total memory and hotness is similar among all cache pages, it can help hold a portion (possible have slightly higher hotness) of the caches in memory instead of letting caches get evicted permutably due to the nature of LRU. That's because the established workingset (active LRU) will tend to stay since we throttled reactivation when NR_ACTIVE is high. This side effect is actually similar with the algoritm before, which introduce such effect by increasing nonresistence_age in extra call paths, trottled the re-activation when activition/reactivation is massively happenning. Combined all above, we have following simple rules: Upon refault, if any of following conditions is met, mark page as active: - If active LRU is low (NR_ACTIVE < NR_INACTIVE), check if: SP < NR_ACTIVE - If active LRU is high (NR_ACTIVE >= NR_INACTIVE), check if: SP < NR_INACTIVE Code-wise, this is simpler than before since no longer need to do lruvec workingset data update when activating a page, and so far, a few benchmarks shows a similar or better result under memore pressure. The performance should also be better when there is no memory pressure since some memcg iteration and atomic operation is no longer needed. When combined with multi-gen LRU (in later commits) it shows a measurable performance gain for some workloads. Using memtier and fio test from commit `ac35a49023` but scaled down to fit in my test environment, and some other test results: memtier test (with 16G ramdisk as swap and 4G memcg limit on an i7-9700): memcached -u nobody -m 16384 -s /tmp/memcached.socket \ -a 0766 -t 12 -B binary & memtier_benchmark -S /tmp/memcached.socket -P memcache_binary -n allkeys\ --key-minimum=1 --key-maximum=32000000 --key-pattern=P:P -c 1 \ -t 12 --ratio 1:0 --pipeline 8 -d 2000 -x 6 fio test 1 (with 16G ramdisk on 28G VM on an i7-9700): fio -name=refault --numjobs=12 --directory=/mnt --size=1024m \ --buffered=1 --ioengine=io_uring --iodepth=128 \ --iodepth_batch_submit=32 --iodepth_batch_complete=32 \ --rw=randread --random_distribution=random --norandommap \ --time_based --ramp_time=5m --runtime=5m --group_reporting fio test 2 (with 16G ramdisk on 28G VM on an i7-9700): fio -name=mglru --numjobs=10 --directory=/mnt --size=1536m \ --buffered=1 --ioengine=io_uring --iodepth=128 \ --iodepth_batch_submit=32 --iodepth_batch_complete=32 \ --rw=randread --random_distribution=zipf:1.2 --norandommap \ --time_based --ramp_time=10m --runtime=5m --group_reporting mysql (using oltp_read_only from sysbench, with 12G of buffer pool in a 10G memcg): sysbench /usr/share/sysbench/oltp_read_only.lua <auth and db params> \ --tables=36 --table-size=2000000 --threads=12 --time=1800 kernel build test done with 3G memcg limit on an i7-9700. Before (Average of 6 test run): fio: IOPS=5125.5k fio2: IOPS=7291.16k memcached: 57600.926 ops/s mysql: 6280.08 tps kernel-build: 1817.13499 seconds After (Average of 6 test run): fio: IOPS=5137.5k (+2.3%) fio2: IOPS=7300.67k (+1.3%) memcached: 57878.422 ops/s (+4.8%) mysql: 6312.06 tps (+0.5%) kernel-build: 1813.66231 seconds (+2.0%) Signed-off-by: Kairui Song <kasong@tencent.com>		2024-04-03 16:58:55 +08:00
Documentation	docs/perf: Add ampere_cspmu to toctree to fix a build warning	2024-03-14 17:15:27 +08:00
LICENSES	LICENSES: Add the copyleft-next-0.3.1 license	2022-11-08 15:44:01 +01:00
arch	dist: config: update config	2024-04-03 16:58:47 +08:00
block	Merge branch 'linux-6.6.y'	2024-03-04 10:25:33 +08:00
certs	certs: Reference revocation list for all keyrings	2023-08-17 20:12:41 +00:00
crypto	crypto: algif_hash - Remove bogus SGL free on zero-length error path	2024-02-23 09:25:11 +01:00
dist	dist: disable non kernel pkg on non default config	2024-04-03 16:58:46 +08:00
drivers	emm: block: introduce basic flags for ramdisk swap optimization	2024-04-03 16:58:52 +08:00
fs	Merge branch 'linux-6.6.y'	2024-03-04 16:09:50 +08:00
include	emm: workingset: simplify and use a more intuitive model	2024-04-03 16:58:55 +08:00
init	emm: memcg, zram: add support for ZRAM memory accounting	2024-04-03 16:58:50 +08:00
io_uring	io_uring/net: fix multishot accept overflow handling	2024-02-23 09:25:10 +01:00
ipc	Add x86 shadow stack support	2023-08-31 12:20:12 -07:00
kernel	emm: allow modules to access more mm data	2024-04-03 16:58:54 +08:00
lib	lib/Kconfig.debug: TEST_IOV_ITER depends on MMU	2024-03-01 13:34:59 +01:00
mm	emm: workingset: simplify and use a more intuitive model	2024-04-03 16:58:55 +08:00
net	eks: net/toa: add ali_cip support	2024-04-03 15:36:21 +08:00
rust	rust: upgrade to Rust 1.73.0	2024-02-16 19:10:43 +01:00
samples	work around gcc bugs with 'asm goto' with outputs	2024-02-23 09:24:47 +01:00
scripts	checkpatch: add Signed-off-by check if commit cherry-pick from upstream	2024-03-27 21:59:03 +08:00
security	Merge remote-tracking branch 'stable/linux-6.6.y' into ocks-2401	2024-03-01 17:21:23 +08:00
sound	Merge branch 'linux-6.6.y'	2024-03-04 10:25:33 +08:00
tools	selftests/bpf: Test pinning bpf timer to a core	2024-03-27 18:09:18 +08:00
usr	initramfs: Encode dependency on KBUILD_BUILD_TIMESTAMP	2023-06-06 17:54:49 +09:00
virt	ARM:	2023-09-07 13:52:20 -07:00
.clang-format	iommu: Add for_each_group_device()	2023-05-23 08:15:51 +02:00
.cocciconfig	…
.get_maintainer.ignore	get_maintainer: add Alan to .get_maintainer.ignore	2022-08-20 15:17:44 -07:00
.gitattributes	dist: initial support	2023-12-12 15:56:34 +08:00
.gitignore	dist: initial support	2023-12-12 15:56:34 +08:00
.mailmap	20 hotfixes. 12 are cc:stable and the remainder address post-6.5 issues	2023-10-24 09:52:16 -10:00
.rustfmt.toml	rust: add `.rustfmt.toml`	2022-09-28 09:02:20 +02:00
COPYING	COPYING: state that all contributions really are covered by this file	2020-02-10 13:32:20 -08:00
CREDITS	USB: Remove Wireless USB and UWB documentation	2023-08-09 14:17:32 +02:00
Kbuild	Kbuild updates for v6.1	2022-10-10 12:00:45 -07:00
Kconfig	tkernel: netatop: add netatop module in kernel/tkernel/	2023-12-12 15:56:47 +08:00
MAINTAINERS	MAINTAINERS: add Catherine as xfs maintainer for 6.6.y	2024-02-16 19:10:43 +01:00
Makefile	kabi: provide kabi check/update/create commands for local users	2024-03-26 14:10:56 +08:00
README	Drop all 00-INDEX files from Documentation/	2018-09-09 15:08:58 -06:00
config-readme	Makefile, dist: add "make tencentconfig" support	2024-03-04 13:25:26 +08:00

README

Linux kernel
============

There are several guides for kernel developers and users. These guides can
be rendered in a number of formats, like HTML and PDF. Please read
Documentation/admin-guide/README.rst first.

In order to build the documentation, use ``make htmldocs`` or
``make pdfdocs``.  The formatted documentation can also be read online at:

    https://www.kernel.org/doc/html/latest/

There are various text files in the Documentation/ subdirectory,
several of them using the Restructured Text markup notation.

Please read the Documentation/process/changes.rst file, as it contains the
requirements for building and running the kernel, and information about
the problems which may result by upgrading your kernel.