Go to file
Kairui Song 4564eafa9e emm: workingset: simplify and use a more intuitive model
Upstream: pending

This basically removed workingset_activation and reduced calls to
workingset_age_nonresident.

The idea behind this change is a new way to calculate the refault
distance and prepare for adapting refault distance based file page
protection for multi-gen LRU.

Currently, refault distance re-activation for active/inactive can help
keep working set pages in memory, it works by estimating the refault
(re-access) distance of a page, if it's small enough, then put it
on active LRU instead of inactive LRU.

The estimation, as described in mm/workingset.c, is based on two assumptions:

1. Activation of an inactive page will left-shift LRU pages (considering
   LRU starts from right).
2. Eviction of an inactive page will left-shift LRU pages.

Assumption 2 is correct, but assumption 1 is not always true, an activated
page could be anywhere in the LRU list (through mark_page_accessed), it
only left-shift the pages on its right side.

And besides, one page can get activate/deactivated for multiple times.

And multi-gen LRU doesn't fit with this model well, pages are getting
aged in generations, and getting promoted frequently between generations.

So instead we introduce a simpler idea here: Just presume the evicted
pages are still in memory, each has an corresponding eviction timestamp
(nonresistence_age) that is increased and recorded upon each eviction.
These timestamp could logically form a "Shadow LRU", a read-only
imaginary LRU. Let the `nonresistence_age` still be NA, then we have:

  Let SP = ((NA's reading @ current) - (NA's reading @ eviction))

                           +-memory available to cache-+
                           |                           |
 +-------------------------+===============+===========+
 | *   shadows  O O  O     |   INACTIVE    |   ACTIVE  |
 +-+-----------------------+===============+===========+
   |                       |
   +-----------------------+
   |         SP
 fault page         O -> Hole left by refaulted in pages.
                         Entries are suppose to be removed
                         upon access but this is not a real
                         LRU so can't really update it.
                    * -> The page corresponding to SP

It can be easily seen that SP stands for the offset of a page in the
imaginary LRU, which is also how far the current workflow could push
a page out of available memory. Since all evicted page was once head
of INACTIVE list, the estimated minimum value of refault distance is:

  SP + NR_INACTIVE

On refault, the page *may* get activated and stay in memory if we put
it to active LRU if:

  SP + NR_INACTIVE < NR_INACTIVE + NR_ACTIVE

Which can be simplified to:

  SP < NR_ACTIVE

Then the page is worth getting re-activated to start from active LRU,
since the access distance is smaller than the total memory.

And since this is only an estimation, based on several hypotheses, and
it could break the ability of LRU to distinguish a workingset out of
caches, in extreme cases all refault causing activation will lead to
worse thrashing, so throttle this by two factors:

1. Notice previously re-faulted in pages may leave "holes" on the shadow
   part of LRU, that part is left unhandled on purpose to decrease
   re-activate rate for pages that have a large SP value (the larger
   SP value a page has, the more likely it will be affected by such
   holes).
2. When the active LRU is long enough, chanllaging active pages
   by re-activating a one-time access previously evicted/inactive page
   may not be a good idea, so throttle the re-activation when
   NR_ACTIVE > NR_INACTIVE, by comparing with NR_INACTIVE instead.

Another effect of the refault activation throttling worth noticing is that,
when the cache size is larger than total memory and hotness is similar
among all cache pages, it can help hold a portion (possible have slightly
higher hotness) of the caches in memory instead of letting caches get
evicted permutably due to the nature of LRU.
That's because the established workingset (active LRU) will tend to stay
since we throttled reactivation when NR_ACTIVE is high.

This side effect is actually similar with the algoritm before, which
introduce such effect by increasing nonresistence_age in extra call
paths, trottled the re-activation when activition/reactivation is
massively happenning.

Combined all above, we have following simple rules:

  Upon refault, if any of following conditions is met, mark page as active:

- If active LRU is low (NR_ACTIVE < NR_INACTIVE), check if:
  SP < NR_ACTIVE

- If active LRU is high (NR_ACTIVE >= NR_INACTIVE), check if:
  SP < NR_INACTIVE

Code-wise, this is simpler than before since no longer need to do lruvec
workingset data update when activating a page, and so far, a few benchmarks
shows a similar or better result under memore pressure. The performance
should also be better when there is no memory pressure since some memcg
iteration and atomic operation is no longer needed.

When combined with multi-gen LRU (in later commits) it shows a measurable
performance gain for some workloads.

Using memtier and fio test from commit ac35a49023 but scaled down
to fit in my test environment, and some other test results:

  memtier test (with 16G ramdisk as swap and 4G memcg limit on an i7-9700):
  memcached -u nobody -m 16384 -s /tmp/memcached.socket \
    -a 0766 -t 12 -B binary &
  memtier_benchmark -S /tmp/memcached.socket -P memcache_binary -n allkeys\
    --key-minimum=1 --key-maximum=32000000 --key-pattern=P:P -c 1 \
    -t 12 --ratio 1:0 --pipeline 8 -d 2000 -x 6

  fio test 1 (with 16G ramdisk on 28G VM on an i7-9700):
  fio -name=refault --numjobs=12 --directory=/mnt --size=1024m \
    --buffered=1 --ioengine=io_uring --iodepth=128 \
    --iodepth_batch_submit=32 --iodepth_batch_complete=32 \
    --rw=randread --random_distribution=random --norandommap \
    --time_based --ramp_time=5m --runtime=5m --group_reporting

  fio test 2 (with 16G ramdisk on 28G VM on an i7-9700):
  fio -name=mglru --numjobs=10 --directory=/mnt --size=1536m \
    --buffered=1 --ioengine=io_uring --iodepth=128 \
    --iodepth_batch_submit=32 --iodepth_batch_complete=32 \
    --rw=randread --random_distribution=zipf:1.2 --norandommap \
    --time_based --ramp_time=10m --runtime=5m --group_reporting

  mysql (using oltp_read_only from sysbench, with 12G of buffer pool
  in a 10G memcg):
  sysbench /usr/share/sysbench/oltp_read_only.lua <auth and db params> \
    --tables=36 --table-size=2000000 --threads=12 --time=1800

  kernel build test done with 3G memcg limit on an i7-9700.

Before (Average of 6 test run):
fio: IOPS=5125.5k
fio2: IOPS=7291.16k
memcached: 57600.926 ops/s
mysql: 6280.08 tps
kernel-build: 1817.13499 seconds

After (Average of 6 test run):
fio: IOPS=5137.5k (+2.3%)
fio2: IOPS=7300.67k (+1.3%)
memcached: 57878.422 ops/s (+4.8%)
mysql: 6312.06 tps (+0.5%)
kernel-build: 1813.66231 seconds (+2.0%)

Signed-off-by: Kairui Song <kasong@tencent.com>
2024-04-03 16:58:55 +08:00
Documentation docs/perf: Add ampere_cspmu to toctree to fix a build warning 2024-03-14 17:15:27 +08:00
LICENSES LICENSES: Add the copyleft-next-0.3.1 license 2022-11-08 15:44:01 +01:00
arch dist: config: update config 2024-04-03 16:58:47 +08:00
block Merge branch 'linux-6.6.y' 2024-03-04 10:25:33 +08:00
certs certs: Reference revocation list for all keyrings 2023-08-17 20:12:41 +00:00
crypto crypto: algif_hash - Remove bogus SGL free on zero-length error path 2024-02-23 09:25:11 +01:00
dist dist: disable non kernel pkg on non default config 2024-04-03 16:58:46 +08:00
drivers emm: block: introduce basic flags for ramdisk swap optimization 2024-04-03 16:58:52 +08:00
fs Merge branch 'linux-6.6.y' 2024-03-04 16:09:50 +08:00
include emm: workingset: simplify and use a more intuitive model 2024-04-03 16:58:55 +08:00
init emm: memcg, zram: add support for ZRAM memory accounting 2024-04-03 16:58:50 +08:00
io_uring io_uring/net: fix multishot accept overflow handling 2024-02-23 09:25:10 +01:00
ipc Add x86 shadow stack support 2023-08-31 12:20:12 -07:00
kernel emm: allow modules to access more mm data 2024-04-03 16:58:54 +08:00
lib lib/Kconfig.debug: TEST_IOV_ITER depends on MMU 2024-03-01 13:34:59 +01:00
mm emm: workingset: simplify and use a more intuitive model 2024-04-03 16:58:55 +08:00
net eks: net/toa: add ali_cip support 2024-04-03 15:36:21 +08:00
rust rust: upgrade to Rust 1.73.0 2024-02-16 19:10:43 +01:00
samples work around gcc bugs with 'asm goto' with outputs 2024-02-23 09:24:47 +01:00
scripts checkpatch: add Signed-off-by check if commit cherry-pick from upstream 2024-03-27 21:59:03 +08:00
security Merge remote-tracking branch 'stable/linux-6.6.y' into ocks-2401 2024-03-01 17:21:23 +08:00
sound Merge branch 'linux-6.6.y' 2024-03-04 10:25:33 +08:00
tools selftests/bpf: Test pinning bpf timer to a core 2024-03-27 18:09:18 +08:00
usr initramfs: Encode dependency on KBUILD_BUILD_TIMESTAMP 2023-06-06 17:54:49 +09:00
virt ARM: 2023-09-07 13:52:20 -07:00
.clang-format iommu: Add for_each_group_device() 2023-05-23 08:15:51 +02:00
.cocciconfig
.get_maintainer.ignore get_maintainer: add Alan to .get_maintainer.ignore 2022-08-20 15:17:44 -07:00
.gitattributes dist: initial support 2023-12-12 15:56:34 +08:00
.gitignore dist: initial support 2023-12-12 15:56:34 +08:00
.mailmap 20 hotfixes. 12 are cc:stable and the remainder address post-6.5 issues 2023-10-24 09:52:16 -10:00
.rustfmt.toml rust: add `.rustfmt.toml` 2022-09-28 09:02:20 +02:00
COPYING COPYING: state that all contributions really are covered by this file 2020-02-10 13:32:20 -08:00
CREDITS USB: Remove Wireless USB and UWB documentation 2023-08-09 14:17:32 +02:00
Kbuild Kbuild updates for v6.1 2022-10-10 12:00:45 -07:00
Kconfig tkernel: netatop: add netatop module in kernel/tkernel/ 2023-12-12 15:56:47 +08:00
MAINTAINERS MAINTAINERS: add Catherine as xfs maintainer for 6.6.y 2024-02-16 19:10:43 +01:00
Makefile kabi: provide kabi check/update/create commands for local users 2024-03-26 14:10:56 +08:00
README Drop all 00-INDEX files from Documentation/ 2018-09-09 15:08:58 -06:00
config-readme Makefile, dist: add "make tencentconfig" support 2024-03-04 13:25:26 +08:00

README

Linux kernel
============

There are several guides for kernel developers and users. These guides can
be rendered in a number of formats, like HTML and PDF. Please read
Documentation/admin-guide/README.rst first.

In order to build the documentation, use ``make htmldocs`` or
``make pdfdocs``.  The formatted documentation can also be read online at:

    https://www.kernel.org/doc/html/latest/

There are various text files in the Documentation/ subdirectory,
several of them using the Restructured Text markup notation.

Please read the Documentation/process/changes.rst file, as it contains the
requirements for building and running the kernel, and information about
the problems which may result by upgrading your kernel.