License cleanup: add SPDX GPL-2.0 license identifier to files with no license
Many source files in the tree are missing licensing information, which
makes it harder for compliance tools to determine the correct license.
By default all files without license information are under the default
license of the kernel, which is GPL version 2.
Update the files which contain no license information with the 'GPL-2.0'
SPDX license identifier. The SPDX identifier is a legally binding
shorthand, which can be used instead of the full boiler plate text.
This patch is based on work done by Thomas Gleixner and Kate Stewart and
Philippe Ombredanne.
How this work was done:
Patches were generated and checked against linux-4.14-rc6 for a subset of
the use cases:
- file had no licensing information it it.
- file was a */uapi/* one with no licensing information in it,
- file was a */uapi/* one with existing licensing information,
Further patches will be generated in subsequent months to fix up cases
where non-standard license headers were used, and references to license
had to be inferred by heuristics based on keywords.
The analysis to determine which SPDX License Identifier to be applied to
a file was done in a spreadsheet of side by side results from of the
output of two independent scanners (ScanCode & Windriver) producing SPDX
tag:value files created by Philippe Ombredanne. Philippe prepared the
base worksheet, and did an initial spot review of a few 1000 files.
The 4.13 kernel was the starting point of the analysis with 60,537 files
assessed. Kate Stewart did a file by file comparison of the scanner
results in the spreadsheet to determine which SPDX license identifier(s)
to be applied to the file. She confirmed any determination that was not
immediately clear with lawyers working with the Linux Foundation.
Criteria used to select files for SPDX license identifier tagging was:
- Files considered eligible had to be source code files.
- Make and config files were included as candidates if they contained >5
lines of source
- File already had some variant of a license header in it (even if <5
lines).
All documentation files were explicitly excluded.
The following heuristics were used to determine which SPDX license
identifiers to apply.
- when both scanners couldn't find any license traces, file was
considered to have no license information in it, and the top level
COPYING file license applied.
For non */uapi/* files that summary was:
SPDX license identifier # files
---------------------------------------------------|-------
GPL-2.0 11139
and resulted in the first patch in this series.
If that file was a */uapi/* path one, it was "GPL-2.0 WITH
Linux-syscall-note" otherwise it was "GPL-2.0". Results of that was:
SPDX license identifier # files
---------------------------------------------------|-------
GPL-2.0 WITH Linux-syscall-note 930
and resulted in the second patch in this series.
- if a file had some form of licensing information in it, and was one
of the */uapi/* ones, it was denoted with the Linux-syscall-note if
any GPL family license was found in the file or had no licensing in
it (per prior point). Results summary:
SPDX license identifier # files
---------------------------------------------------|------
GPL-2.0 WITH Linux-syscall-note 270
GPL-2.0+ WITH Linux-syscall-note 169
((GPL-2.0 WITH Linux-syscall-note) OR BSD-2-Clause) 21
((GPL-2.0 WITH Linux-syscall-note) OR BSD-3-Clause) 17
LGPL-2.1+ WITH Linux-syscall-note 15
GPL-1.0+ WITH Linux-syscall-note 14
((GPL-2.0+ WITH Linux-syscall-note) OR BSD-3-Clause) 5
LGPL-2.0+ WITH Linux-syscall-note 4
LGPL-2.1 WITH Linux-syscall-note 3
((GPL-2.0 WITH Linux-syscall-note) OR MIT) 3
((GPL-2.0 WITH Linux-syscall-note) AND MIT) 1
and that resulted in the third patch in this series.
- when the two scanners agreed on the detected license(s), that became
the concluded license(s).
- when there was disagreement between the two scanners (one detected a
license but the other didn't, or they both detected different
licenses) a manual inspection of the file occurred.
- In most cases a manual inspection of the information in the file
resulted in a clear resolution of the license that should apply (and
which scanner probably needed to revisit its heuristics).
- When it was not immediately clear, the license identifier was
confirmed with lawyers working with the Linux Foundation.
- If there was any question as to the appropriate license identifier,
the file was flagged for further research and to be revisited later
in time.
In total, over 70 hours of logged manual review was done on the
spreadsheet to determine the SPDX license identifiers to apply to the
source files by Kate, Philippe, Thomas and, in some cases, confirmation
by lawyers working with the Linux Foundation.
Kate also obtained a third independent scan of the 4.13 code base from
FOSSology, and compared selected files where the other two scanners
disagreed against that SPDX file, to see if there was new insights. The
Windriver scanner is based on an older version of FOSSology in part, so
they are related.
Thomas did random spot checks in about 500 files from the spreadsheets
for the uapi headers and agreed with SPDX license identifier in the
files he inspected. For the non-uapi files Thomas did random spot checks
in about 15000 files.
In initial set of patches against 4.14-rc6, 3 files were found to have
copy/paste license identifier errors, and have been fixed to reflect the
correct identifier.
Additionally Philippe spent 10 hours this week doing a detailed manual
inspection and review of the 12,461 patched files from the initial patch
version early this week with:
- a full scancode scan run, collecting the matched texts, detected
license ids and scores
- reviewing anything where there was a license detected (about 500+
files) to ensure that the applied SPDX license was correct
- reviewing anything where there was no detection but the patch license
was not GPL-2.0 WITH Linux-syscall-note to ensure that the applied
SPDX license was correct
This produced a worksheet with 20 files needing minor correction. This
worksheet was then exported into 3 different .csv files for the
different types of files to be modified.
These .csv files were then reviewed by Greg. Thomas wrote a script to
parse the csv files and add the proper SPDX tag to the file, in the
format that the file expected. This script was further refined by Greg
based on the output to detect more types of files automatically and to
distinguish between header and source .c files (which need different
comment types.) Finally Greg ran the script using the .csv files to
generate the patches.
Reviewed-by: Kate Stewart <kstewart@linuxfoundation.org>
Reviewed-by: Philippe Ombredanne <pombredanne@nexb.com>
Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2017-11-01 22:07:57 +08:00
|
|
|
// SPDX-License-Identifier: GPL-2.0
|
2014-04-04 05:47:51 +08:00
|
|
|
/*
|
|
|
|
* Workingset detection
|
|
|
|
*
|
|
|
|
* Copyright (C) 2013 Red Hat, Inc., Johannes Weiner
|
|
|
|
*/
|
|
|
|
|
|
|
|
#include <linux/memcontrol.h>
|
2020-08-12 09:30:43 +08:00
|
|
|
#include <linux/mm_inline.h>
|
2014-04-04 05:47:51 +08:00
|
|
|
#include <linux/writeback.h>
|
2017-02-25 06:59:36 +08:00
|
|
|
#include <linux/shmem_fs.h>
|
2014-04-04 05:47:51 +08:00
|
|
|
#include <linux/pagemap.h>
|
|
|
|
#include <linux/atomic.h>
|
|
|
|
#include <linux/module.h>
|
|
|
|
#include <linux/swap.h>
|
2016-12-13 08:43:52 +08:00
|
|
|
#include <linux/dax.h>
|
2014-04-04 05:47:51 +08:00
|
|
|
#include <linux/fs.h>
|
|
|
|
#include <linux/mm.h>
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Double CLOCK lists
|
|
|
|
*
|
2016-07-29 06:46:08 +08:00
|
|
|
* Per node, two clock lists are maintained for file pages: the
|
2014-04-04 05:47:51 +08:00
|
|
|
* inactive and the active list. Freshly faulted pages start out at
|
|
|
|
* the head of the inactive list and page reclaim scans pages from the
|
|
|
|
* tail. Pages that are accessed multiple times on the inactive list
|
|
|
|
* are promoted to the active list, to protect them from reclaim,
|
|
|
|
* whereas active pages are demoted to the inactive list when the
|
|
|
|
* active list grows too big.
|
|
|
|
*
|
|
|
|
* fault ------------------------+
|
|
|
|
* |
|
|
|
|
* +--------------+ | +-------------+
|
|
|
|
* reclaim <- | inactive | <-+-- demotion | active | <--+
|
|
|
|
* +--------------+ +-------------+ |
|
|
|
|
* | |
|
|
|
|
* +-------------- promotion ------------------+
|
|
|
|
*
|
|
|
|
*
|
|
|
|
* Access frequency and refault distance
|
|
|
|
*
|
|
|
|
* A workload is thrashing when its pages are frequently used but they
|
|
|
|
* are evicted from the inactive list every time before another access
|
|
|
|
* would have promoted them to the active list.
|
|
|
|
*
|
|
|
|
* In cases where the average access distance between thrashing pages
|
|
|
|
* is bigger than the size of memory there is nothing that can be
|
|
|
|
* done - the thrashing set could never fit into memory under any
|
|
|
|
* circumstance.
|
|
|
|
*
|
|
|
|
* However, the average access distance could be bigger than the
|
|
|
|
* inactive list, yet smaller than the size of memory. In this case,
|
|
|
|
* the set could fit into memory if it weren't for the currently
|
|
|
|
* active pages - which may be used more, hopefully less frequently:
|
|
|
|
*
|
|
|
|
* +-memory available to cache-+
|
|
|
|
* | |
|
|
|
|
* +-inactive------+-active----+
|
|
|
|
* a b | c d e f g h i | J K L M N |
|
|
|
|
* +---------------+-----------+
|
|
|
|
*
|
|
|
|
* It is prohibitively expensive to accurately track access frequency
|
|
|
|
* of pages. But a reasonable approximation can be made to measure
|
|
|
|
* thrashing on the inactive list, after which refaulting pages can be
|
|
|
|
* activated optimistically to compete with the existing active pages.
|
|
|
|
*
|
2023-12-29 15:37:53 +08:00
|
|
|
* For such approximation, we introduce a counter `eviction` (E)
|
emm: workingset: simplify and use a more intuitive model
Upstream: pending
This basically removed workingset_activation and reduced calls to
workingset_age_nonresident.
The idea behind this change is a new way to calculate the refault
distance and prepare for adapting refault distance based file page
protection for multi-gen LRU.
Currently, refault distance re-activation for active/inactive can help
keep working set pages in memory, it works by estimating the refault
(re-access) distance of a page, if it's small enough, then put it
on active LRU instead of inactive LRU.
The estimation, as described in mm/workingset.c, is based on two assumptions:
1. Activation of an inactive page will left-shift LRU pages (considering
LRU starts from right).
2. Eviction of an inactive page will left-shift LRU pages.
Assumption 2 is correct, but assumption 1 is not always true, an activated
page could be anywhere in the LRU list (through mark_page_accessed), it
only left-shift the pages on its right side.
And besides, one page can get activate/deactivated for multiple times.
And multi-gen LRU doesn't fit with this model well, pages are getting
aged in generations, and getting promoted frequently between generations.
So instead we introduce a simpler idea here: Just presume the evicted
pages are still in memory, each has an corresponding eviction timestamp
(nonresistence_age) that is increased and recorded upon each eviction.
These timestamp could logically form a "Shadow LRU", a read-only
imaginary LRU. Let the `nonresistence_age` still be NA, then we have:
Let SP = ((NA's reading @ current) - (NA's reading @ eviction))
+-memory available to cache-+
| |
+-------------------------+===============+===========+
| * shadows O O O | INACTIVE | ACTIVE |
+-+-----------------------+===============+===========+
| |
+-----------------------+
| SP
fault page O -> Hole left by refaulted in pages.
Entries are suppose to be removed
upon access but this is not a real
LRU so can't really update it.
* -> The page corresponding to SP
It can be easily seen that SP stands for the offset of a page in the
imaginary LRU, which is also how far the current workflow could push
a page out of available memory. Since all evicted page was once head
of INACTIVE list, the estimated minimum value of refault distance is:
SP + NR_INACTIVE
On refault, the page *may* get activated and stay in memory if we put
it to active LRU if:
SP + NR_INACTIVE < NR_INACTIVE + NR_ACTIVE
Which can be simplified to:
SP < NR_ACTIVE
Then the page is worth getting re-activated to start from active LRU,
since the access distance is smaller than the total memory.
And since this is only an estimation, based on several hypotheses, and
it could break the ability of LRU to distinguish a workingset out of
caches, in extreme cases all refault causing activation will lead to
worse thrashing, so throttle this by two factors:
1. Notice previously re-faulted in pages may leave "holes" on the shadow
part of LRU, that part is left unhandled on purpose to decrease
re-activate rate for pages that have a large SP value (the larger
SP value a page has, the more likely it will be affected by such
holes).
2. When the active LRU is long enough, chanllaging active pages
by re-activating a one-time access previously evicted/inactive page
may not be a good idea, so throttle the re-activation when
NR_ACTIVE > NR_INACTIVE, by comparing with NR_INACTIVE instead.
Another effect of the refault activation throttling worth noticing is that,
when the cache size is larger than total memory and hotness is similar
among all cache pages, it can help hold a portion (possible have slightly
higher hotness) of the caches in memory instead of letting caches get
evicted permutably due to the nature of LRU.
That's because the established workingset (active LRU) will tend to stay
since we throttled reactivation when NR_ACTIVE is high.
This side effect is actually similar with the algoritm before, which
introduce such effect by increasing nonresistence_age in extra call
paths, trottled the re-activation when activition/reactivation is
massively happenning.
Combined all above, we have following simple rules:
Upon refault, if any of following conditions is met, mark page as active:
- If active LRU is low (NR_ACTIVE < NR_INACTIVE), check if:
SP < NR_ACTIVE
- If active LRU is high (NR_ACTIVE >= NR_INACTIVE), check if:
SP < NR_INACTIVE
Code-wise, this is simpler than before since no longer need to do lruvec
workingset data update when activating a page, and so far, a few benchmarks
shows a similar or better result under memore pressure. The performance
should also be better when there is no memory pressure since some memcg
iteration and atomic operation is no longer needed.
When combined with multi-gen LRU (in later commits) it shows a measurable
performance gain for some workloads.
Using memtier and fio test from commit ac35a4902374 but scaled down
to fit in my test environment, and some other test results:
memtier test (with 16G ramdisk as swap and 4G memcg limit on an i7-9700):
memcached -u nobody -m 16384 -s /tmp/memcached.socket \
-a 0766 -t 12 -B binary &
memtier_benchmark -S /tmp/memcached.socket -P memcache_binary -n allkeys\
--key-minimum=1 --key-maximum=32000000 --key-pattern=P:P -c 1 \
-t 12 --ratio 1:0 --pipeline 8 -d 2000 -x 6
fio test 1 (with 16G ramdisk on 28G VM on an i7-9700):
fio -name=refault --numjobs=12 --directory=/mnt --size=1024m \
--buffered=1 --ioengine=io_uring --iodepth=128 \
--iodepth_batch_submit=32 --iodepth_batch_complete=32 \
--rw=randread --random_distribution=random --norandommap \
--time_based --ramp_time=5m --runtime=5m --group_reporting
fio test 2 (with 16G ramdisk on 28G VM on an i7-9700):
fio -name=mglru --numjobs=10 --directory=/mnt --size=1536m \
--buffered=1 --ioengine=io_uring --iodepth=128 \
--iodepth_batch_submit=32 --iodepth_batch_complete=32 \
--rw=randread --random_distribution=zipf:1.2 --norandommap \
--time_based --ramp_time=10m --runtime=5m --group_reporting
mysql (using oltp_read_only from sysbench, with 12G of buffer pool
in a 10G memcg):
sysbench /usr/share/sysbench/oltp_read_only.lua <auth and db params> \
--tables=36 --table-size=2000000 --threads=12 --time=1800
kernel build test done with 3G memcg limit on an i7-9700.
Before (Average of 6 test run):
fio: IOPS=5125.5k
fio2: IOPS=7291.16k
memcached: 57600.926 ops/s
mysql: 6280.08 tps
kernel-build: 1817.13499 seconds
After (Average of 6 test run):
fio: IOPS=5137.5k (+2.3%)
fio2: IOPS=7300.67k (+1.3%)
memcached: 57878.422 ops/s (+4.8%)
mysql: 6312.06 tps (+0.5%)
kernel-build: 1813.66231 seconds (+2.0%)
Signed-off-by: Kairui Song <kasong@tencent.com>
2023-12-15 10:45:43 +08:00
|
|
|
* here. This counter increases each time a page is evicted, and each evicted
|
|
|
|
* page will have a shadow that stores the counter reading at the eviction
|
|
|
|
* time as a timestamp. So when an evicted page was faulted again, we have:
|
2014-04-04 05:47:51 +08:00
|
|
|
*
|
2023-12-29 15:37:53 +08:00
|
|
|
* Let SP = ((E's reading @ current) - (E's reading @ eviction))
|
2014-04-04 05:47:51 +08:00
|
|
|
*
|
emm: workingset: simplify and use a more intuitive model
Upstream: pending
This basically removed workingset_activation and reduced calls to
workingset_age_nonresident.
The idea behind this change is a new way to calculate the refault
distance and prepare for adapting refault distance based file page
protection for multi-gen LRU.
Currently, refault distance re-activation for active/inactive can help
keep working set pages in memory, it works by estimating the refault
(re-access) distance of a page, if it's small enough, then put it
on active LRU instead of inactive LRU.
The estimation, as described in mm/workingset.c, is based on two assumptions:
1. Activation of an inactive page will left-shift LRU pages (considering
LRU starts from right).
2. Eviction of an inactive page will left-shift LRU pages.
Assumption 2 is correct, but assumption 1 is not always true, an activated
page could be anywhere in the LRU list (through mark_page_accessed), it
only left-shift the pages on its right side.
And besides, one page can get activate/deactivated for multiple times.
And multi-gen LRU doesn't fit with this model well, pages are getting
aged in generations, and getting promoted frequently between generations.
So instead we introduce a simpler idea here: Just presume the evicted
pages are still in memory, each has an corresponding eviction timestamp
(nonresistence_age) that is increased and recorded upon each eviction.
These timestamp could logically form a "Shadow LRU", a read-only
imaginary LRU. Let the `nonresistence_age` still be NA, then we have:
Let SP = ((NA's reading @ current) - (NA's reading @ eviction))
+-memory available to cache-+
| |
+-------------------------+===============+===========+
| * shadows O O O | INACTIVE | ACTIVE |
+-+-----------------------+===============+===========+
| |
+-----------------------+
| SP
fault page O -> Hole left by refaulted in pages.
Entries are suppose to be removed
upon access but this is not a real
LRU so can't really update it.
* -> The page corresponding to SP
It can be easily seen that SP stands for the offset of a page in the
imaginary LRU, which is also how far the current workflow could push
a page out of available memory. Since all evicted page was once head
of INACTIVE list, the estimated minimum value of refault distance is:
SP + NR_INACTIVE
On refault, the page *may* get activated and stay in memory if we put
it to active LRU if:
SP + NR_INACTIVE < NR_INACTIVE + NR_ACTIVE
Which can be simplified to:
SP < NR_ACTIVE
Then the page is worth getting re-activated to start from active LRU,
since the access distance is smaller than the total memory.
And since this is only an estimation, based on several hypotheses, and
it could break the ability of LRU to distinguish a workingset out of
caches, in extreme cases all refault causing activation will lead to
worse thrashing, so throttle this by two factors:
1. Notice previously re-faulted in pages may leave "holes" on the shadow
part of LRU, that part is left unhandled on purpose to decrease
re-activate rate for pages that have a large SP value (the larger
SP value a page has, the more likely it will be affected by such
holes).
2. When the active LRU is long enough, chanllaging active pages
by re-activating a one-time access previously evicted/inactive page
may not be a good idea, so throttle the re-activation when
NR_ACTIVE > NR_INACTIVE, by comparing with NR_INACTIVE instead.
Another effect of the refault activation throttling worth noticing is that,
when the cache size is larger than total memory and hotness is similar
among all cache pages, it can help hold a portion (possible have slightly
higher hotness) of the caches in memory instead of letting caches get
evicted permutably due to the nature of LRU.
That's because the established workingset (active LRU) will tend to stay
since we throttled reactivation when NR_ACTIVE is high.
This side effect is actually similar with the algoritm before, which
introduce such effect by increasing nonresistence_age in extra call
paths, trottled the re-activation when activition/reactivation is
massively happenning.
Combined all above, we have following simple rules:
Upon refault, if any of following conditions is met, mark page as active:
- If active LRU is low (NR_ACTIVE < NR_INACTIVE), check if:
SP < NR_ACTIVE
- If active LRU is high (NR_ACTIVE >= NR_INACTIVE), check if:
SP < NR_INACTIVE
Code-wise, this is simpler than before since no longer need to do lruvec
workingset data update when activating a page, and so far, a few benchmarks
shows a similar or better result under memore pressure. The performance
should also be better when there is no memory pressure since some memcg
iteration and atomic operation is no longer needed.
When combined with multi-gen LRU (in later commits) it shows a measurable
performance gain for some workloads.
Using memtier and fio test from commit ac35a4902374 but scaled down
to fit in my test environment, and some other test results:
memtier test (with 16G ramdisk as swap and 4G memcg limit on an i7-9700):
memcached -u nobody -m 16384 -s /tmp/memcached.socket \
-a 0766 -t 12 -B binary &
memtier_benchmark -S /tmp/memcached.socket -P memcache_binary -n allkeys\
--key-minimum=1 --key-maximum=32000000 --key-pattern=P:P -c 1 \
-t 12 --ratio 1:0 --pipeline 8 -d 2000 -x 6
fio test 1 (with 16G ramdisk on 28G VM on an i7-9700):
fio -name=refault --numjobs=12 --directory=/mnt --size=1024m \
--buffered=1 --ioengine=io_uring --iodepth=128 \
--iodepth_batch_submit=32 --iodepth_batch_complete=32 \
--rw=randread --random_distribution=random --norandommap \
--time_based --ramp_time=5m --runtime=5m --group_reporting
fio test 2 (with 16G ramdisk on 28G VM on an i7-9700):
fio -name=mglru --numjobs=10 --directory=/mnt --size=1536m \
--buffered=1 --ioengine=io_uring --iodepth=128 \
--iodepth_batch_submit=32 --iodepth_batch_complete=32 \
--rw=randread --random_distribution=zipf:1.2 --norandommap \
--time_based --ramp_time=10m --runtime=5m --group_reporting
mysql (using oltp_read_only from sysbench, with 12G of buffer pool
in a 10G memcg):
sysbench /usr/share/sysbench/oltp_read_only.lua <auth and db params> \
--tables=36 --table-size=2000000 --threads=12 --time=1800
kernel build test done with 3G memcg limit on an i7-9700.
Before (Average of 6 test run):
fio: IOPS=5125.5k
fio2: IOPS=7291.16k
memcached: 57600.926 ops/s
mysql: 6280.08 tps
kernel-build: 1817.13499 seconds
After (Average of 6 test run):
fio: IOPS=5137.5k (+2.3%)
fio2: IOPS=7300.67k (+1.3%)
memcached: 57878.422 ops/s (+4.8%)
mysql: 6312.06 tps (+0.5%)
kernel-build: 1813.66231 seconds (+2.0%)
Signed-off-by: Kairui Song <kasong@tencent.com>
2023-12-15 10:45:43 +08:00
|
|
|
* +-memory available to cache-+
|
|
|
|
* | |
|
|
|
|
* +-------------------------+===============+===========+
|
|
|
|
* | * shadows O O O | INACTIVE | ACTIVE |
|
|
|
|
* +-+-----------------------+===============+===========+
|
|
|
|
* | |
|
|
|
|
* +-----------------------+
|
|
|
|
* | SP
|
|
|
|
* fault page O -> Hole left by previously faulted in pages
|
|
|
|
* * -> The page corresponding to SP
|
2014-04-04 05:47:51 +08:00
|
|
|
*
|
emm: workingset: simplify and use a more intuitive model
Upstream: pending
This basically removed workingset_activation and reduced calls to
workingset_age_nonresident.
The idea behind this change is a new way to calculate the refault
distance and prepare for adapting refault distance based file page
protection for multi-gen LRU.
Currently, refault distance re-activation for active/inactive can help
keep working set pages in memory, it works by estimating the refault
(re-access) distance of a page, if it's small enough, then put it
on active LRU instead of inactive LRU.
The estimation, as described in mm/workingset.c, is based on two assumptions:
1. Activation of an inactive page will left-shift LRU pages (considering
LRU starts from right).
2. Eviction of an inactive page will left-shift LRU pages.
Assumption 2 is correct, but assumption 1 is not always true, an activated
page could be anywhere in the LRU list (through mark_page_accessed), it
only left-shift the pages on its right side.
And besides, one page can get activate/deactivated for multiple times.
And multi-gen LRU doesn't fit with this model well, pages are getting
aged in generations, and getting promoted frequently between generations.
So instead we introduce a simpler idea here: Just presume the evicted
pages are still in memory, each has an corresponding eviction timestamp
(nonresistence_age) that is increased and recorded upon each eviction.
These timestamp could logically form a "Shadow LRU", a read-only
imaginary LRU. Let the `nonresistence_age` still be NA, then we have:
Let SP = ((NA's reading @ current) - (NA's reading @ eviction))
+-memory available to cache-+
| |
+-------------------------+===============+===========+
| * shadows O O O | INACTIVE | ACTIVE |
+-+-----------------------+===============+===========+
| |
+-----------------------+
| SP
fault page O -> Hole left by refaulted in pages.
Entries are suppose to be removed
upon access but this is not a real
LRU so can't really update it.
* -> The page corresponding to SP
It can be easily seen that SP stands for the offset of a page in the
imaginary LRU, which is also how far the current workflow could push
a page out of available memory. Since all evicted page was once head
of INACTIVE list, the estimated minimum value of refault distance is:
SP + NR_INACTIVE
On refault, the page *may* get activated and stay in memory if we put
it to active LRU if:
SP + NR_INACTIVE < NR_INACTIVE + NR_ACTIVE
Which can be simplified to:
SP < NR_ACTIVE
Then the page is worth getting re-activated to start from active LRU,
since the access distance is smaller than the total memory.
And since this is only an estimation, based on several hypotheses, and
it could break the ability of LRU to distinguish a workingset out of
caches, in extreme cases all refault causing activation will lead to
worse thrashing, so throttle this by two factors:
1. Notice previously re-faulted in pages may leave "holes" on the shadow
part of LRU, that part is left unhandled on purpose to decrease
re-activate rate for pages that have a large SP value (the larger
SP value a page has, the more likely it will be affected by such
holes).
2. When the active LRU is long enough, chanllaging active pages
by re-activating a one-time access previously evicted/inactive page
may not be a good idea, so throttle the re-activation when
NR_ACTIVE > NR_INACTIVE, by comparing with NR_INACTIVE instead.
Another effect of the refault activation throttling worth noticing is that,
when the cache size is larger than total memory and hotness is similar
among all cache pages, it can help hold a portion (possible have slightly
higher hotness) of the caches in memory instead of letting caches get
evicted permutably due to the nature of LRU.
That's because the established workingset (active LRU) will tend to stay
since we throttled reactivation when NR_ACTIVE is high.
This side effect is actually similar with the algoritm before, which
introduce such effect by increasing nonresistence_age in extra call
paths, trottled the re-activation when activition/reactivation is
massively happenning.
Combined all above, we have following simple rules:
Upon refault, if any of following conditions is met, mark page as active:
- If active LRU is low (NR_ACTIVE < NR_INACTIVE), check if:
SP < NR_ACTIVE
- If active LRU is high (NR_ACTIVE >= NR_INACTIVE), check if:
SP < NR_INACTIVE
Code-wise, this is simpler than before since no longer need to do lruvec
workingset data update when activating a page, and so far, a few benchmarks
shows a similar or better result under memore pressure. The performance
should also be better when there is no memory pressure since some memcg
iteration and atomic operation is no longer needed.
When combined with multi-gen LRU (in later commits) it shows a measurable
performance gain for some workloads.
Using memtier and fio test from commit ac35a4902374 but scaled down
to fit in my test environment, and some other test results:
memtier test (with 16G ramdisk as swap and 4G memcg limit on an i7-9700):
memcached -u nobody -m 16384 -s /tmp/memcached.socket \
-a 0766 -t 12 -B binary &
memtier_benchmark -S /tmp/memcached.socket -P memcache_binary -n allkeys\
--key-minimum=1 --key-maximum=32000000 --key-pattern=P:P -c 1 \
-t 12 --ratio 1:0 --pipeline 8 -d 2000 -x 6
fio test 1 (with 16G ramdisk on 28G VM on an i7-9700):
fio -name=refault --numjobs=12 --directory=/mnt --size=1024m \
--buffered=1 --ioengine=io_uring --iodepth=128 \
--iodepth_batch_submit=32 --iodepth_batch_complete=32 \
--rw=randread --random_distribution=random --norandommap \
--time_based --ramp_time=5m --runtime=5m --group_reporting
fio test 2 (with 16G ramdisk on 28G VM on an i7-9700):
fio -name=mglru --numjobs=10 --directory=/mnt --size=1536m \
--buffered=1 --ioengine=io_uring --iodepth=128 \
--iodepth_batch_submit=32 --iodepth_batch_complete=32 \
--rw=randread --random_distribution=zipf:1.2 --norandommap \
--time_based --ramp_time=10m --runtime=5m --group_reporting
mysql (using oltp_read_only from sysbench, with 12G of buffer pool
in a 10G memcg):
sysbench /usr/share/sysbench/oltp_read_only.lua <auth and db params> \
--tables=36 --table-size=2000000 --threads=12 --time=1800
kernel build test done with 3G memcg limit on an i7-9700.
Before (Average of 6 test run):
fio: IOPS=5125.5k
fio2: IOPS=7291.16k
memcached: 57600.926 ops/s
mysql: 6280.08 tps
kernel-build: 1817.13499 seconds
After (Average of 6 test run):
fio: IOPS=5137.5k (+2.3%)
fio2: IOPS=7300.67k (+1.3%)
memcached: 57878.422 ops/s (+4.8%)
mysql: 6312.06 tps (+0.5%)
kernel-build: 1813.66231 seconds (+2.0%)
Signed-off-by: Kairui Song <kasong@tencent.com>
2023-12-15 10:45:43 +08:00
|
|
|
* Here SP can stands for how far the current workflow could push a page
|
|
|
|
* out of available memory. Since all evicted page was once head of
|
|
|
|
* INACTIVE list, the page could have such an access distance of:
|
2014-04-04 05:47:51 +08:00
|
|
|
*
|
emm: workingset: simplify and use a more intuitive model
Upstream: pending
This basically removed workingset_activation and reduced calls to
workingset_age_nonresident.
The idea behind this change is a new way to calculate the refault
distance and prepare for adapting refault distance based file page
protection for multi-gen LRU.
Currently, refault distance re-activation for active/inactive can help
keep working set pages in memory, it works by estimating the refault
(re-access) distance of a page, if it's small enough, then put it
on active LRU instead of inactive LRU.
The estimation, as described in mm/workingset.c, is based on two assumptions:
1. Activation of an inactive page will left-shift LRU pages (considering
LRU starts from right).
2. Eviction of an inactive page will left-shift LRU pages.
Assumption 2 is correct, but assumption 1 is not always true, an activated
page could be anywhere in the LRU list (through mark_page_accessed), it
only left-shift the pages on its right side.
And besides, one page can get activate/deactivated for multiple times.
And multi-gen LRU doesn't fit with this model well, pages are getting
aged in generations, and getting promoted frequently between generations.
So instead we introduce a simpler idea here: Just presume the evicted
pages are still in memory, each has an corresponding eviction timestamp
(nonresistence_age) that is increased and recorded upon each eviction.
These timestamp could logically form a "Shadow LRU", a read-only
imaginary LRU. Let the `nonresistence_age` still be NA, then we have:
Let SP = ((NA's reading @ current) - (NA's reading @ eviction))
+-memory available to cache-+
| |
+-------------------------+===============+===========+
| * shadows O O O | INACTIVE | ACTIVE |
+-+-----------------------+===============+===========+
| |
+-----------------------+
| SP
fault page O -> Hole left by refaulted in pages.
Entries are suppose to be removed
upon access but this is not a real
LRU so can't really update it.
* -> The page corresponding to SP
It can be easily seen that SP stands for the offset of a page in the
imaginary LRU, which is also how far the current workflow could push
a page out of available memory. Since all evicted page was once head
of INACTIVE list, the estimated minimum value of refault distance is:
SP + NR_INACTIVE
On refault, the page *may* get activated and stay in memory if we put
it to active LRU if:
SP + NR_INACTIVE < NR_INACTIVE + NR_ACTIVE
Which can be simplified to:
SP < NR_ACTIVE
Then the page is worth getting re-activated to start from active LRU,
since the access distance is smaller than the total memory.
And since this is only an estimation, based on several hypotheses, and
it could break the ability of LRU to distinguish a workingset out of
caches, in extreme cases all refault causing activation will lead to
worse thrashing, so throttle this by two factors:
1. Notice previously re-faulted in pages may leave "holes" on the shadow
part of LRU, that part is left unhandled on purpose to decrease
re-activate rate for pages that have a large SP value (the larger
SP value a page has, the more likely it will be affected by such
holes).
2. When the active LRU is long enough, chanllaging active pages
by re-activating a one-time access previously evicted/inactive page
may not be a good idea, so throttle the re-activation when
NR_ACTIVE > NR_INACTIVE, by comparing with NR_INACTIVE instead.
Another effect of the refault activation throttling worth noticing is that,
when the cache size is larger than total memory and hotness is similar
among all cache pages, it can help hold a portion (possible have slightly
higher hotness) of the caches in memory instead of letting caches get
evicted permutably due to the nature of LRU.
That's because the established workingset (active LRU) will tend to stay
since we throttled reactivation when NR_ACTIVE is high.
This side effect is actually similar with the algoritm before, which
introduce such effect by increasing nonresistence_age in extra call
paths, trottled the re-activation when activition/reactivation is
massively happenning.
Combined all above, we have following simple rules:
Upon refault, if any of following conditions is met, mark page as active:
- If active LRU is low (NR_ACTIVE < NR_INACTIVE), check if:
SP < NR_ACTIVE
- If active LRU is high (NR_ACTIVE >= NR_INACTIVE), check if:
SP < NR_INACTIVE
Code-wise, this is simpler than before since no longer need to do lruvec
workingset data update when activating a page, and so far, a few benchmarks
shows a similar or better result under memore pressure. The performance
should also be better when there is no memory pressure since some memcg
iteration and atomic operation is no longer needed.
When combined with multi-gen LRU (in later commits) it shows a measurable
performance gain for some workloads.
Using memtier and fio test from commit ac35a4902374 but scaled down
to fit in my test environment, and some other test results:
memtier test (with 16G ramdisk as swap and 4G memcg limit on an i7-9700):
memcached -u nobody -m 16384 -s /tmp/memcached.socket \
-a 0766 -t 12 -B binary &
memtier_benchmark -S /tmp/memcached.socket -P memcache_binary -n allkeys\
--key-minimum=1 --key-maximum=32000000 --key-pattern=P:P -c 1 \
-t 12 --ratio 1:0 --pipeline 8 -d 2000 -x 6
fio test 1 (with 16G ramdisk on 28G VM on an i7-9700):
fio -name=refault --numjobs=12 --directory=/mnt --size=1024m \
--buffered=1 --ioengine=io_uring --iodepth=128 \
--iodepth_batch_submit=32 --iodepth_batch_complete=32 \
--rw=randread --random_distribution=random --norandommap \
--time_based --ramp_time=5m --runtime=5m --group_reporting
fio test 2 (with 16G ramdisk on 28G VM on an i7-9700):
fio -name=mglru --numjobs=10 --directory=/mnt --size=1536m \
--buffered=1 --ioengine=io_uring --iodepth=128 \
--iodepth_batch_submit=32 --iodepth_batch_complete=32 \
--rw=randread --random_distribution=zipf:1.2 --norandommap \
--time_based --ramp_time=10m --runtime=5m --group_reporting
mysql (using oltp_read_only from sysbench, with 12G of buffer pool
in a 10G memcg):
sysbench /usr/share/sysbench/oltp_read_only.lua <auth and db params> \
--tables=36 --table-size=2000000 --threads=12 --time=1800
kernel build test done with 3G memcg limit on an i7-9700.
Before (Average of 6 test run):
fio: IOPS=5125.5k
fio2: IOPS=7291.16k
memcached: 57600.926 ops/s
mysql: 6280.08 tps
kernel-build: 1817.13499 seconds
After (Average of 6 test run):
fio: IOPS=5137.5k (+2.3%)
fio2: IOPS=7300.67k (+1.3%)
memcached: 57878.422 ops/s (+4.8%)
mysql: 6312.06 tps (+0.5%)
kernel-build: 1813.66231 seconds (+2.0%)
Signed-off-by: Kairui Song <kasong@tencent.com>
2023-12-15 10:45:43 +08:00
|
|
|
* SP + NR_INACTIVE
|
2014-04-04 05:47:51 +08:00
|
|
|
*
|
emm: workingset: simplify and use a more intuitive model
Upstream: pending
This basically removed workingset_activation and reduced calls to
workingset_age_nonresident.
The idea behind this change is a new way to calculate the refault
distance and prepare for adapting refault distance based file page
protection for multi-gen LRU.
Currently, refault distance re-activation for active/inactive can help
keep working set pages in memory, it works by estimating the refault
(re-access) distance of a page, if it's small enough, then put it
on active LRU instead of inactive LRU.
The estimation, as described in mm/workingset.c, is based on two assumptions:
1. Activation of an inactive page will left-shift LRU pages (considering
LRU starts from right).
2. Eviction of an inactive page will left-shift LRU pages.
Assumption 2 is correct, but assumption 1 is not always true, an activated
page could be anywhere in the LRU list (through mark_page_accessed), it
only left-shift the pages on its right side.
And besides, one page can get activate/deactivated for multiple times.
And multi-gen LRU doesn't fit with this model well, pages are getting
aged in generations, and getting promoted frequently between generations.
So instead we introduce a simpler idea here: Just presume the evicted
pages are still in memory, each has an corresponding eviction timestamp
(nonresistence_age) that is increased and recorded upon each eviction.
These timestamp could logically form a "Shadow LRU", a read-only
imaginary LRU. Let the `nonresistence_age` still be NA, then we have:
Let SP = ((NA's reading @ current) - (NA's reading @ eviction))
+-memory available to cache-+
| |
+-------------------------+===============+===========+
| * shadows O O O | INACTIVE | ACTIVE |
+-+-----------------------+===============+===========+
| |
+-----------------------+
| SP
fault page O -> Hole left by refaulted in pages.
Entries are suppose to be removed
upon access but this is not a real
LRU so can't really update it.
* -> The page corresponding to SP
It can be easily seen that SP stands for the offset of a page in the
imaginary LRU, which is also how far the current workflow could push
a page out of available memory. Since all evicted page was once head
of INACTIVE list, the estimated minimum value of refault distance is:
SP + NR_INACTIVE
On refault, the page *may* get activated and stay in memory if we put
it to active LRU if:
SP + NR_INACTIVE < NR_INACTIVE + NR_ACTIVE
Which can be simplified to:
SP < NR_ACTIVE
Then the page is worth getting re-activated to start from active LRU,
since the access distance is smaller than the total memory.
And since this is only an estimation, based on several hypotheses, and
it could break the ability of LRU to distinguish a workingset out of
caches, in extreme cases all refault causing activation will lead to
worse thrashing, so throttle this by two factors:
1. Notice previously re-faulted in pages may leave "holes" on the shadow
part of LRU, that part is left unhandled on purpose to decrease
re-activate rate for pages that have a large SP value (the larger
SP value a page has, the more likely it will be affected by such
holes).
2. When the active LRU is long enough, chanllaging active pages
by re-activating a one-time access previously evicted/inactive page
may not be a good idea, so throttle the re-activation when
NR_ACTIVE > NR_INACTIVE, by comparing with NR_INACTIVE instead.
Another effect of the refault activation throttling worth noticing is that,
when the cache size is larger than total memory and hotness is similar
among all cache pages, it can help hold a portion (possible have slightly
higher hotness) of the caches in memory instead of letting caches get
evicted permutably due to the nature of LRU.
That's because the established workingset (active LRU) will tend to stay
since we throttled reactivation when NR_ACTIVE is high.
This side effect is actually similar with the algoritm before, which
introduce such effect by increasing nonresistence_age in extra call
paths, trottled the re-activation when activition/reactivation is
massively happenning.
Combined all above, we have following simple rules:
Upon refault, if any of following conditions is met, mark page as active:
- If active LRU is low (NR_ACTIVE < NR_INACTIVE), check if:
SP < NR_ACTIVE
- If active LRU is high (NR_ACTIVE >= NR_INACTIVE), check if:
SP < NR_INACTIVE
Code-wise, this is simpler than before since no longer need to do lruvec
workingset data update when activating a page, and so far, a few benchmarks
shows a similar or better result under memore pressure. The performance
should also be better when there is no memory pressure since some memcg
iteration and atomic operation is no longer needed.
When combined with multi-gen LRU (in later commits) it shows a measurable
performance gain for some workloads.
Using memtier and fio test from commit ac35a4902374 but scaled down
to fit in my test environment, and some other test results:
memtier test (with 16G ramdisk as swap and 4G memcg limit on an i7-9700):
memcached -u nobody -m 16384 -s /tmp/memcached.socket \
-a 0766 -t 12 -B binary &
memtier_benchmark -S /tmp/memcached.socket -P memcache_binary -n allkeys\
--key-minimum=1 --key-maximum=32000000 --key-pattern=P:P -c 1 \
-t 12 --ratio 1:0 --pipeline 8 -d 2000 -x 6
fio test 1 (with 16G ramdisk on 28G VM on an i7-9700):
fio -name=refault --numjobs=12 --directory=/mnt --size=1024m \
--buffered=1 --ioengine=io_uring --iodepth=128 \
--iodepth_batch_submit=32 --iodepth_batch_complete=32 \
--rw=randread --random_distribution=random --norandommap \
--time_based --ramp_time=5m --runtime=5m --group_reporting
fio test 2 (with 16G ramdisk on 28G VM on an i7-9700):
fio -name=mglru --numjobs=10 --directory=/mnt --size=1536m \
--buffered=1 --ioengine=io_uring --iodepth=128 \
--iodepth_batch_submit=32 --iodepth_batch_complete=32 \
--rw=randread --random_distribution=zipf:1.2 --norandommap \
--time_based --ramp_time=10m --runtime=5m --group_reporting
mysql (using oltp_read_only from sysbench, with 12G of buffer pool
in a 10G memcg):
sysbench /usr/share/sysbench/oltp_read_only.lua <auth and db params> \
--tables=36 --table-size=2000000 --threads=12 --time=1800
kernel build test done with 3G memcg limit on an i7-9700.
Before (Average of 6 test run):
fio: IOPS=5125.5k
fio2: IOPS=7291.16k
memcached: 57600.926 ops/s
mysql: 6280.08 tps
kernel-build: 1817.13499 seconds
After (Average of 6 test run):
fio: IOPS=5137.5k (+2.3%)
fio2: IOPS=7300.67k (+1.3%)
memcached: 57878.422 ops/s (+4.8%)
mysql: 6312.06 tps (+0.5%)
kernel-build: 1813.66231 seconds (+2.0%)
Signed-off-by: Kairui Song <kasong@tencent.com>
2023-12-15 10:45:43 +08:00
|
|
|
* So if:
|
2014-04-04 05:47:51 +08:00
|
|
|
*
|
emm: workingset: simplify and use a more intuitive model
Upstream: pending
This basically removed workingset_activation and reduced calls to
workingset_age_nonresident.
The idea behind this change is a new way to calculate the refault
distance and prepare for adapting refault distance based file page
protection for multi-gen LRU.
Currently, refault distance re-activation for active/inactive can help
keep working set pages in memory, it works by estimating the refault
(re-access) distance of a page, if it's small enough, then put it
on active LRU instead of inactive LRU.
The estimation, as described in mm/workingset.c, is based on two assumptions:
1. Activation of an inactive page will left-shift LRU pages (considering
LRU starts from right).
2. Eviction of an inactive page will left-shift LRU pages.
Assumption 2 is correct, but assumption 1 is not always true, an activated
page could be anywhere in the LRU list (through mark_page_accessed), it
only left-shift the pages on its right side.
And besides, one page can get activate/deactivated for multiple times.
And multi-gen LRU doesn't fit with this model well, pages are getting
aged in generations, and getting promoted frequently between generations.
So instead we introduce a simpler idea here: Just presume the evicted
pages are still in memory, each has an corresponding eviction timestamp
(nonresistence_age) that is increased and recorded upon each eviction.
These timestamp could logically form a "Shadow LRU", a read-only
imaginary LRU. Let the `nonresistence_age` still be NA, then we have:
Let SP = ((NA's reading @ current) - (NA's reading @ eviction))
+-memory available to cache-+
| |
+-------------------------+===============+===========+
| * shadows O O O | INACTIVE | ACTIVE |
+-+-----------------------+===============+===========+
| |
+-----------------------+
| SP
fault page O -> Hole left by refaulted in pages.
Entries are suppose to be removed
upon access but this is not a real
LRU so can't really update it.
* -> The page corresponding to SP
It can be easily seen that SP stands for the offset of a page in the
imaginary LRU, which is also how far the current workflow could push
a page out of available memory. Since all evicted page was once head
of INACTIVE list, the estimated minimum value of refault distance is:
SP + NR_INACTIVE
On refault, the page *may* get activated and stay in memory if we put
it to active LRU if:
SP + NR_INACTIVE < NR_INACTIVE + NR_ACTIVE
Which can be simplified to:
SP < NR_ACTIVE
Then the page is worth getting re-activated to start from active LRU,
since the access distance is smaller than the total memory.
And since this is only an estimation, based on several hypotheses, and
it could break the ability of LRU to distinguish a workingset out of
caches, in extreme cases all refault causing activation will lead to
worse thrashing, so throttle this by two factors:
1. Notice previously re-faulted in pages may leave "holes" on the shadow
part of LRU, that part is left unhandled on purpose to decrease
re-activate rate for pages that have a large SP value (the larger
SP value a page has, the more likely it will be affected by such
holes).
2. When the active LRU is long enough, chanllaging active pages
by re-activating a one-time access previously evicted/inactive page
may not be a good idea, so throttle the re-activation when
NR_ACTIVE > NR_INACTIVE, by comparing with NR_INACTIVE instead.
Another effect of the refault activation throttling worth noticing is that,
when the cache size is larger than total memory and hotness is similar
among all cache pages, it can help hold a portion (possible have slightly
higher hotness) of the caches in memory instead of letting caches get
evicted permutably due to the nature of LRU.
That's because the established workingset (active LRU) will tend to stay
since we throttled reactivation when NR_ACTIVE is high.
This side effect is actually similar with the algoritm before, which
introduce such effect by increasing nonresistence_age in extra call
paths, trottled the re-activation when activition/reactivation is
massively happenning.
Combined all above, we have following simple rules:
Upon refault, if any of following conditions is met, mark page as active:
- If active LRU is low (NR_ACTIVE < NR_INACTIVE), check if:
SP < NR_ACTIVE
- If active LRU is high (NR_ACTIVE >= NR_INACTIVE), check if:
SP < NR_INACTIVE
Code-wise, this is simpler than before since no longer need to do lruvec
workingset data update when activating a page, and so far, a few benchmarks
shows a similar or better result under memore pressure. The performance
should also be better when there is no memory pressure since some memcg
iteration and atomic operation is no longer needed.
When combined with multi-gen LRU (in later commits) it shows a measurable
performance gain for some workloads.
Using memtier and fio test from commit ac35a4902374 but scaled down
to fit in my test environment, and some other test results:
memtier test (with 16G ramdisk as swap and 4G memcg limit on an i7-9700):
memcached -u nobody -m 16384 -s /tmp/memcached.socket \
-a 0766 -t 12 -B binary &
memtier_benchmark -S /tmp/memcached.socket -P memcache_binary -n allkeys\
--key-minimum=1 --key-maximum=32000000 --key-pattern=P:P -c 1 \
-t 12 --ratio 1:0 --pipeline 8 -d 2000 -x 6
fio test 1 (with 16G ramdisk on 28G VM on an i7-9700):
fio -name=refault --numjobs=12 --directory=/mnt --size=1024m \
--buffered=1 --ioengine=io_uring --iodepth=128 \
--iodepth_batch_submit=32 --iodepth_batch_complete=32 \
--rw=randread --random_distribution=random --norandommap \
--time_based --ramp_time=5m --runtime=5m --group_reporting
fio test 2 (with 16G ramdisk on 28G VM on an i7-9700):
fio -name=mglru --numjobs=10 --directory=/mnt --size=1536m \
--buffered=1 --ioengine=io_uring --iodepth=128 \
--iodepth_batch_submit=32 --iodepth_batch_complete=32 \
--rw=randread --random_distribution=zipf:1.2 --norandommap \
--time_based --ramp_time=10m --runtime=5m --group_reporting
mysql (using oltp_read_only from sysbench, with 12G of buffer pool
in a 10G memcg):
sysbench /usr/share/sysbench/oltp_read_only.lua <auth and db params> \
--tables=36 --table-size=2000000 --threads=12 --time=1800
kernel build test done with 3G memcg limit on an i7-9700.
Before (Average of 6 test run):
fio: IOPS=5125.5k
fio2: IOPS=7291.16k
memcached: 57600.926 ops/s
mysql: 6280.08 tps
kernel-build: 1817.13499 seconds
After (Average of 6 test run):
fio: IOPS=5137.5k (+2.3%)
fio2: IOPS=7300.67k (+1.3%)
memcached: 57878.422 ops/s (+4.8%)
mysql: 6312.06 tps (+0.5%)
kernel-build: 1813.66231 seconds (+2.0%)
Signed-off-by: Kairui Song <kasong@tencent.com>
2023-12-15 10:45:43 +08:00
|
|
|
* SP + NR_INACTIVE < NR_INACTIVE + NR_ACTIVE
|
2014-04-04 05:47:51 +08:00
|
|
|
*
|
emm: workingset: simplify and use a more intuitive model
Upstream: pending
This basically removed workingset_activation and reduced calls to
workingset_age_nonresident.
The idea behind this change is a new way to calculate the refault
distance and prepare for adapting refault distance based file page
protection for multi-gen LRU.
Currently, refault distance re-activation for active/inactive can help
keep working set pages in memory, it works by estimating the refault
(re-access) distance of a page, if it's small enough, then put it
on active LRU instead of inactive LRU.
The estimation, as described in mm/workingset.c, is based on two assumptions:
1. Activation of an inactive page will left-shift LRU pages (considering
LRU starts from right).
2. Eviction of an inactive page will left-shift LRU pages.
Assumption 2 is correct, but assumption 1 is not always true, an activated
page could be anywhere in the LRU list (through mark_page_accessed), it
only left-shift the pages on its right side.
And besides, one page can get activate/deactivated for multiple times.
And multi-gen LRU doesn't fit with this model well, pages are getting
aged in generations, and getting promoted frequently between generations.
So instead we introduce a simpler idea here: Just presume the evicted
pages are still in memory, each has an corresponding eviction timestamp
(nonresistence_age) that is increased and recorded upon each eviction.
These timestamp could logically form a "Shadow LRU", a read-only
imaginary LRU. Let the `nonresistence_age` still be NA, then we have:
Let SP = ((NA's reading @ current) - (NA's reading @ eviction))
+-memory available to cache-+
| |
+-------------------------+===============+===========+
| * shadows O O O | INACTIVE | ACTIVE |
+-+-----------------------+===============+===========+
| |
+-----------------------+
| SP
fault page O -> Hole left by refaulted in pages.
Entries are suppose to be removed
upon access but this is not a real
LRU so can't really update it.
* -> The page corresponding to SP
It can be easily seen that SP stands for the offset of a page in the
imaginary LRU, which is also how far the current workflow could push
a page out of available memory. Since all evicted page was once head
of INACTIVE list, the estimated minimum value of refault distance is:
SP + NR_INACTIVE
On refault, the page *may* get activated and stay in memory if we put
it to active LRU if:
SP + NR_INACTIVE < NR_INACTIVE + NR_ACTIVE
Which can be simplified to:
SP < NR_ACTIVE
Then the page is worth getting re-activated to start from active LRU,
since the access distance is smaller than the total memory.
And since this is only an estimation, based on several hypotheses, and
it could break the ability of LRU to distinguish a workingset out of
caches, in extreme cases all refault causing activation will lead to
worse thrashing, so throttle this by two factors:
1. Notice previously re-faulted in pages may leave "holes" on the shadow
part of LRU, that part is left unhandled on purpose to decrease
re-activate rate for pages that have a large SP value (the larger
SP value a page has, the more likely it will be affected by such
holes).
2. When the active LRU is long enough, chanllaging active pages
by re-activating a one-time access previously evicted/inactive page
may not be a good idea, so throttle the re-activation when
NR_ACTIVE > NR_INACTIVE, by comparing with NR_INACTIVE instead.
Another effect of the refault activation throttling worth noticing is that,
when the cache size is larger than total memory and hotness is similar
among all cache pages, it can help hold a portion (possible have slightly
higher hotness) of the caches in memory instead of letting caches get
evicted permutably due to the nature of LRU.
That's because the established workingset (active LRU) will tend to stay
since we throttled reactivation when NR_ACTIVE is high.
This side effect is actually similar with the algoritm before, which
introduce such effect by increasing nonresistence_age in extra call
paths, trottled the re-activation when activition/reactivation is
massively happenning.
Combined all above, we have following simple rules:
Upon refault, if any of following conditions is met, mark page as active:
- If active LRU is low (NR_ACTIVE < NR_INACTIVE), check if:
SP < NR_ACTIVE
- If active LRU is high (NR_ACTIVE >= NR_INACTIVE), check if:
SP < NR_INACTIVE
Code-wise, this is simpler than before since no longer need to do lruvec
workingset data update when activating a page, and so far, a few benchmarks
shows a similar or better result under memore pressure. The performance
should also be better when there is no memory pressure since some memcg
iteration and atomic operation is no longer needed.
When combined with multi-gen LRU (in later commits) it shows a measurable
performance gain for some workloads.
Using memtier and fio test from commit ac35a4902374 but scaled down
to fit in my test environment, and some other test results:
memtier test (with 16G ramdisk as swap and 4G memcg limit on an i7-9700):
memcached -u nobody -m 16384 -s /tmp/memcached.socket \
-a 0766 -t 12 -B binary &
memtier_benchmark -S /tmp/memcached.socket -P memcache_binary -n allkeys\
--key-minimum=1 --key-maximum=32000000 --key-pattern=P:P -c 1 \
-t 12 --ratio 1:0 --pipeline 8 -d 2000 -x 6
fio test 1 (with 16G ramdisk on 28G VM on an i7-9700):
fio -name=refault --numjobs=12 --directory=/mnt --size=1024m \
--buffered=1 --ioengine=io_uring --iodepth=128 \
--iodepth_batch_submit=32 --iodepth_batch_complete=32 \
--rw=randread --random_distribution=random --norandommap \
--time_based --ramp_time=5m --runtime=5m --group_reporting
fio test 2 (with 16G ramdisk on 28G VM on an i7-9700):
fio -name=mglru --numjobs=10 --directory=/mnt --size=1536m \
--buffered=1 --ioengine=io_uring --iodepth=128 \
--iodepth_batch_submit=32 --iodepth_batch_complete=32 \
--rw=randread --random_distribution=zipf:1.2 --norandommap \
--time_based --ramp_time=10m --runtime=5m --group_reporting
mysql (using oltp_read_only from sysbench, with 12G of buffer pool
in a 10G memcg):
sysbench /usr/share/sysbench/oltp_read_only.lua <auth and db params> \
--tables=36 --table-size=2000000 --threads=12 --time=1800
kernel build test done with 3G memcg limit on an i7-9700.
Before (Average of 6 test run):
fio: IOPS=5125.5k
fio2: IOPS=7291.16k
memcached: 57600.926 ops/s
mysql: 6280.08 tps
kernel-build: 1817.13499 seconds
After (Average of 6 test run):
fio: IOPS=5137.5k (+2.3%)
fio2: IOPS=7300.67k (+1.3%)
memcached: 57878.422 ops/s (+4.8%)
mysql: 6312.06 tps (+0.5%)
kernel-build: 1813.66231 seconds (+2.0%)
Signed-off-by: Kairui Song <kasong@tencent.com>
2023-12-15 10:45:43 +08:00
|
|
|
* Which can be simplified to:
|
2014-04-04 05:47:51 +08:00
|
|
|
*
|
emm: workingset: simplify and use a more intuitive model
Upstream: pending
This basically removed workingset_activation and reduced calls to
workingset_age_nonresident.
The idea behind this change is a new way to calculate the refault
distance and prepare for adapting refault distance based file page
protection for multi-gen LRU.
Currently, refault distance re-activation for active/inactive can help
keep working set pages in memory, it works by estimating the refault
(re-access) distance of a page, if it's small enough, then put it
on active LRU instead of inactive LRU.
The estimation, as described in mm/workingset.c, is based on two assumptions:
1. Activation of an inactive page will left-shift LRU pages (considering
LRU starts from right).
2. Eviction of an inactive page will left-shift LRU pages.
Assumption 2 is correct, but assumption 1 is not always true, an activated
page could be anywhere in the LRU list (through mark_page_accessed), it
only left-shift the pages on its right side.
And besides, one page can get activate/deactivated for multiple times.
And multi-gen LRU doesn't fit with this model well, pages are getting
aged in generations, and getting promoted frequently between generations.
So instead we introduce a simpler idea here: Just presume the evicted
pages are still in memory, each has an corresponding eviction timestamp
(nonresistence_age) that is increased and recorded upon each eviction.
These timestamp could logically form a "Shadow LRU", a read-only
imaginary LRU. Let the `nonresistence_age` still be NA, then we have:
Let SP = ((NA's reading @ current) - (NA's reading @ eviction))
+-memory available to cache-+
| |
+-------------------------+===============+===========+
| * shadows O O O | INACTIVE | ACTIVE |
+-+-----------------------+===============+===========+
| |
+-----------------------+
| SP
fault page O -> Hole left by refaulted in pages.
Entries are suppose to be removed
upon access but this is not a real
LRU so can't really update it.
* -> The page corresponding to SP
It can be easily seen that SP stands for the offset of a page in the
imaginary LRU, which is also how far the current workflow could push
a page out of available memory. Since all evicted page was once head
of INACTIVE list, the estimated minimum value of refault distance is:
SP + NR_INACTIVE
On refault, the page *may* get activated and stay in memory if we put
it to active LRU if:
SP + NR_INACTIVE < NR_INACTIVE + NR_ACTIVE
Which can be simplified to:
SP < NR_ACTIVE
Then the page is worth getting re-activated to start from active LRU,
since the access distance is smaller than the total memory.
And since this is only an estimation, based on several hypotheses, and
it could break the ability of LRU to distinguish a workingset out of
caches, in extreme cases all refault causing activation will lead to
worse thrashing, so throttle this by two factors:
1. Notice previously re-faulted in pages may leave "holes" on the shadow
part of LRU, that part is left unhandled on purpose to decrease
re-activate rate for pages that have a large SP value (the larger
SP value a page has, the more likely it will be affected by such
holes).
2. When the active LRU is long enough, chanllaging active pages
by re-activating a one-time access previously evicted/inactive page
may not be a good idea, so throttle the re-activation when
NR_ACTIVE > NR_INACTIVE, by comparing with NR_INACTIVE instead.
Another effect of the refault activation throttling worth noticing is that,
when the cache size is larger than total memory and hotness is similar
among all cache pages, it can help hold a portion (possible have slightly
higher hotness) of the caches in memory instead of letting caches get
evicted permutably due to the nature of LRU.
That's because the established workingset (active LRU) will tend to stay
since we throttled reactivation when NR_ACTIVE is high.
This side effect is actually similar with the algoritm before, which
introduce such effect by increasing nonresistence_age in extra call
paths, trottled the re-activation when activition/reactivation is
massively happenning.
Combined all above, we have following simple rules:
Upon refault, if any of following conditions is met, mark page as active:
- If active LRU is low (NR_ACTIVE < NR_INACTIVE), check if:
SP < NR_ACTIVE
- If active LRU is high (NR_ACTIVE >= NR_INACTIVE), check if:
SP < NR_INACTIVE
Code-wise, this is simpler than before since no longer need to do lruvec
workingset data update when activating a page, and so far, a few benchmarks
shows a similar or better result under memore pressure. The performance
should also be better when there is no memory pressure since some memcg
iteration and atomic operation is no longer needed.
When combined with multi-gen LRU (in later commits) it shows a measurable
performance gain for some workloads.
Using memtier and fio test from commit ac35a4902374 but scaled down
to fit in my test environment, and some other test results:
memtier test (with 16G ramdisk as swap and 4G memcg limit on an i7-9700):
memcached -u nobody -m 16384 -s /tmp/memcached.socket \
-a 0766 -t 12 -B binary &
memtier_benchmark -S /tmp/memcached.socket -P memcache_binary -n allkeys\
--key-minimum=1 --key-maximum=32000000 --key-pattern=P:P -c 1 \
-t 12 --ratio 1:0 --pipeline 8 -d 2000 -x 6
fio test 1 (with 16G ramdisk on 28G VM on an i7-9700):
fio -name=refault --numjobs=12 --directory=/mnt --size=1024m \
--buffered=1 --ioengine=io_uring --iodepth=128 \
--iodepth_batch_submit=32 --iodepth_batch_complete=32 \
--rw=randread --random_distribution=random --norandommap \
--time_based --ramp_time=5m --runtime=5m --group_reporting
fio test 2 (with 16G ramdisk on 28G VM on an i7-9700):
fio -name=mglru --numjobs=10 --directory=/mnt --size=1536m \
--buffered=1 --ioengine=io_uring --iodepth=128 \
--iodepth_batch_submit=32 --iodepth_batch_complete=32 \
--rw=randread --random_distribution=zipf:1.2 --norandommap \
--time_based --ramp_time=10m --runtime=5m --group_reporting
mysql (using oltp_read_only from sysbench, with 12G of buffer pool
in a 10G memcg):
sysbench /usr/share/sysbench/oltp_read_only.lua <auth and db params> \
--tables=36 --table-size=2000000 --threads=12 --time=1800
kernel build test done with 3G memcg limit on an i7-9700.
Before (Average of 6 test run):
fio: IOPS=5125.5k
fio2: IOPS=7291.16k
memcached: 57600.926 ops/s
mysql: 6280.08 tps
kernel-build: 1817.13499 seconds
After (Average of 6 test run):
fio: IOPS=5137.5k (+2.3%)
fio2: IOPS=7300.67k (+1.3%)
memcached: 57878.422 ops/s (+4.8%)
mysql: 6312.06 tps (+0.5%)
kernel-build: 1813.66231 seconds (+2.0%)
Signed-off-by: Kairui Song <kasong@tencent.com>
2023-12-15 10:45:43 +08:00
|
|
|
* SP < NR_ACTIVE
|
2014-04-04 05:47:51 +08:00
|
|
|
*
|
emm: workingset: simplify and use a more intuitive model
Upstream: pending
This basically removed workingset_activation and reduced calls to
workingset_age_nonresident.
The idea behind this change is a new way to calculate the refault
distance and prepare for adapting refault distance based file page
protection for multi-gen LRU.
Currently, refault distance re-activation for active/inactive can help
keep working set pages in memory, it works by estimating the refault
(re-access) distance of a page, if it's small enough, then put it
on active LRU instead of inactive LRU.
The estimation, as described in mm/workingset.c, is based on two assumptions:
1. Activation of an inactive page will left-shift LRU pages (considering
LRU starts from right).
2. Eviction of an inactive page will left-shift LRU pages.
Assumption 2 is correct, but assumption 1 is not always true, an activated
page could be anywhere in the LRU list (through mark_page_accessed), it
only left-shift the pages on its right side.
And besides, one page can get activate/deactivated for multiple times.
And multi-gen LRU doesn't fit with this model well, pages are getting
aged in generations, and getting promoted frequently between generations.
So instead we introduce a simpler idea here: Just presume the evicted
pages are still in memory, each has an corresponding eviction timestamp
(nonresistence_age) that is increased and recorded upon each eviction.
These timestamp could logically form a "Shadow LRU", a read-only
imaginary LRU. Let the `nonresistence_age` still be NA, then we have:
Let SP = ((NA's reading @ current) - (NA's reading @ eviction))
+-memory available to cache-+
| |
+-------------------------+===============+===========+
| * shadows O O O | INACTIVE | ACTIVE |
+-+-----------------------+===============+===========+
| |
+-----------------------+
| SP
fault page O -> Hole left by refaulted in pages.
Entries are suppose to be removed
upon access but this is not a real
LRU so can't really update it.
* -> The page corresponding to SP
It can be easily seen that SP stands for the offset of a page in the
imaginary LRU, which is also how far the current workflow could push
a page out of available memory. Since all evicted page was once head
of INACTIVE list, the estimated minimum value of refault distance is:
SP + NR_INACTIVE
On refault, the page *may* get activated and stay in memory if we put
it to active LRU if:
SP + NR_INACTIVE < NR_INACTIVE + NR_ACTIVE
Which can be simplified to:
SP < NR_ACTIVE
Then the page is worth getting re-activated to start from active LRU,
since the access distance is smaller than the total memory.
And since this is only an estimation, based on several hypotheses, and
it could break the ability of LRU to distinguish a workingset out of
caches, in extreme cases all refault causing activation will lead to
worse thrashing, so throttle this by two factors:
1. Notice previously re-faulted in pages may leave "holes" on the shadow
part of LRU, that part is left unhandled on purpose to decrease
re-activate rate for pages that have a large SP value (the larger
SP value a page has, the more likely it will be affected by such
holes).
2. When the active LRU is long enough, chanllaging active pages
by re-activating a one-time access previously evicted/inactive page
may not be a good idea, so throttle the re-activation when
NR_ACTIVE > NR_INACTIVE, by comparing with NR_INACTIVE instead.
Another effect of the refault activation throttling worth noticing is that,
when the cache size is larger than total memory and hotness is similar
among all cache pages, it can help hold a portion (possible have slightly
higher hotness) of the caches in memory instead of letting caches get
evicted permutably due to the nature of LRU.
That's because the established workingset (active LRU) will tend to stay
since we throttled reactivation when NR_ACTIVE is high.
This side effect is actually similar with the algoritm before, which
introduce such effect by increasing nonresistence_age in extra call
paths, trottled the re-activation when activition/reactivation is
massively happenning.
Combined all above, we have following simple rules:
Upon refault, if any of following conditions is met, mark page as active:
- If active LRU is low (NR_ACTIVE < NR_INACTIVE), check if:
SP < NR_ACTIVE
- If active LRU is high (NR_ACTIVE >= NR_INACTIVE), check if:
SP < NR_INACTIVE
Code-wise, this is simpler than before since no longer need to do lruvec
workingset data update when activating a page, and so far, a few benchmarks
shows a similar or better result under memore pressure. The performance
should also be better when there is no memory pressure since some memcg
iteration and atomic operation is no longer needed.
When combined with multi-gen LRU (in later commits) it shows a measurable
performance gain for some workloads.
Using memtier and fio test from commit ac35a4902374 but scaled down
to fit in my test environment, and some other test results:
memtier test (with 16G ramdisk as swap and 4G memcg limit on an i7-9700):
memcached -u nobody -m 16384 -s /tmp/memcached.socket \
-a 0766 -t 12 -B binary &
memtier_benchmark -S /tmp/memcached.socket -P memcache_binary -n allkeys\
--key-minimum=1 --key-maximum=32000000 --key-pattern=P:P -c 1 \
-t 12 --ratio 1:0 --pipeline 8 -d 2000 -x 6
fio test 1 (with 16G ramdisk on 28G VM on an i7-9700):
fio -name=refault --numjobs=12 --directory=/mnt --size=1024m \
--buffered=1 --ioengine=io_uring --iodepth=128 \
--iodepth_batch_submit=32 --iodepth_batch_complete=32 \
--rw=randread --random_distribution=random --norandommap \
--time_based --ramp_time=5m --runtime=5m --group_reporting
fio test 2 (with 16G ramdisk on 28G VM on an i7-9700):
fio -name=mglru --numjobs=10 --directory=/mnt --size=1536m \
--buffered=1 --ioengine=io_uring --iodepth=128 \
--iodepth_batch_submit=32 --iodepth_batch_complete=32 \
--rw=randread --random_distribution=zipf:1.2 --norandommap \
--time_based --ramp_time=10m --runtime=5m --group_reporting
mysql (using oltp_read_only from sysbench, with 12G of buffer pool
in a 10G memcg):
sysbench /usr/share/sysbench/oltp_read_only.lua <auth and db params> \
--tables=36 --table-size=2000000 --threads=12 --time=1800
kernel build test done with 3G memcg limit on an i7-9700.
Before (Average of 6 test run):
fio: IOPS=5125.5k
fio2: IOPS=7291.16k
memcached: 57600.926 ops/s
mysql: 6280.08 tps
kernel-build: 1817.13499 seconds
After (Average of 6 test run):
fio: IOPS=5137.5k (+2.3%)
fio2: IOPS=7300.67k (+1.3%)
memcached: 57878.422 ops/s (+4.8%)
mysql: 6312.06 tps (+0.5%)
kernel-build: 1813.66231 seconds (+2.0%)
Signed-off-by: Kairui Song <kasong@tencent.com>
2023-12-15 10:45:43 +08:00
|
|
|
* Then the page is worth getting re-activated to start from ACTIVE part,
|
|
|
|
* since the access distance is shorter than total memory to make it stay.
|
2014-04-04 05:47:51 +08:00
|
|
|
*
|
emm: workingset: simplify and use a more intuitive model
Upstream: pending
This basically removed workingset_activation and reduced calls to
workingset_age_nonresident.
The idea behind this change is a new way to calculate the refault
distance and prepare for adapting refault distance based file page
protection for multi-gen LRU.
Currently, refault distance re-activation for active/inactive can help
keep working set pages in memory, it works by estimating the refault
(re-access) distance of a page, if it's small enough, then put it
on active LRU instead of inactive LRU.
The estimation, as described in mm/workingset.c, is based on two assumptions:
1. Activation of an inactive page will left-shift LRU pages (considering
LRU starts from right).
2. Eviction of an inactive page will left-shift LRU pages.
Assumption 2 is correct, but assumption 1 is not always true, an activated
page could be anywhere in the LRU list (through mark_page_accessed), it
only left-shift the pages on its right side.
And besides, one page can get activate/deactivated for multiple times.
And multi-gen LRU doesn't fit with this model well, pages are getting
aged in generations, and getting promoted frequently between generations.
So instead we introduce a simpler idea here: Just presume the evicted
pages are still in memory, each has an corresponding eviction timestamp
(nonresistence_age) that is increased and recorded upon each eviction.
These timestamp could logically form a "Shadow LRU", a read-only
imaginary LRU. Let the `nonresistence_age` still be NA, then we have:
Let SP = ((NA's reading @ current) - (NA's reading @ eviction))
+-memory available to cache-+
| |
+-------------------------+===============+===========+
| * shadows O O O | INACTIVE | ACTIVE |
+-+-----------------------+===============+===========+
| |
+-----------------------+
| SP
fault page O -> Hole left by refaulted in pages.
Entries are suppose to be removed
upon access but this is not a real
LRU so can't really update it.
* -> The page corresponding to SP
It can be easily seen that SP stands for the offset of a page in the
imaginary LRU, which is also how far the current workflow could push
a page out of available memory. Since all evicted page was once head
of INACTIVE list, the estimated minimum value of refault distance is:
SP + NR_INACTIVE
On refault, the page *may* get activated and stay in memory if we put
it to active LRU if:
SP + NR_INACTIVE < NR_INACTIVE + NR_ACTIVE
Which can be simplified to:
SP < NR_ACTIVE
Then the page is worth getting re-activated to start from active LRU,
since the access distance is smaller than the total memory.
And since this is only an estimation, based on several hypotheses, and
it could break the ability of LRU to distinguish a workingset out of
caches, in extreme cases all refault causing activation will lead to
worse thrashing, so throttle this by two factors:
1. Notice previously re-faulted in pages may leave "holes" on the shadow
part of LRU, that part is left unhandled on purpose to decrease
re-activate rate for pages that have a large SP value (the larger
SP value a page has, the more likely it will be affected by such
holes).
2. When the active LRU is long enough, chanllaging active pages
by re-activating a one-time access previously evicted/inactive page
may not be a good idea, so throttle the re-activation when
NR_ACTIVE > NR_INACTIVE, by comparing with NR_INACTIVE instead.
Another effect of the refault activation throttling worth noticing is that,
when the cache size is larger than total memory and hotness is similar
among all cache pages, it can help hold a portion (possible have slightly
higher hotness) of the caches in memory instead of letting caches get
evicted permutably due to the nature of LRU.
That's because the established workingset (active LRU) will tend to stay
since we throttled reactivation when NR_ACTIVE is high.
This side effect is actually similar with the algoritm before, which
introduce such effect by increasing nonresistence_age in extra call
paths, trottled the re-activation when activition/reactivation is
massively happenning.
Combined all above, we have following simple rules:
Upon refault, if any of following conditions is met, mark page as active:
- If active LRU is low (NR_ACTIVE < NR_INACTIVE), check if:
SP < NR_ACTIVE
- If active LRU is high (NR_ACTIVE >= NR_INACTIVE), check if:
SP < NR_INACTIVE
Code-wise, this is simpler than before since no longer need to do lruvec
workingset data update when activating a page, and so far, a few benchmarks
shows a similar or better result under memore pressure. The performance
should also be better when there is no memory pressure since some memcg
iteration and atomic operation is no longer needed.
When combined with multi-gen LRU (in later commits) it shows a measurable
performance gain for some workloads.
Using memtier and fio test from commit ac35a4902374 but scaled down
to fit in my test environment, and some other test results:
memtier test (with 16G ramdisk as swap and 4G memcg limit on an i7-9700):
memcached -u nobody -m 16384 -s /tmp/memcached.socket \
-a 0766 -t 12 -B binary &
memtier_benchmark -S /tmp/memcached.socket -P memcache_binary -n allkeys\
--key-minimum=1 --key-maximum=32000000 --key-pattern=P:P -c 1 \
-t 12 --ratio 1:0 --pipeline 8 -d 2000 -x 6
fio test 1 (with 16G ramdisk on 28G VM on an i7-9700):
fio -name=refault --numjobs=12 --directory=/mnt --size=1024m \
--buffered=1 --ioengine=io_uring --iodepth=128 \
--iodepth_batch_submit=32 --iodepth_batch_complete=32 \
--rw=randread --random_distribution=random --norandommap \
--time_based --ramp_time=5m --runtime=5m --group_reporting
fio test 2 (with 16G ramdisk on 28G VM on an i7-9700):
fio -name=mglru --numjobs=10 --directory=/mnt --size=1536m \
--buffered=1 --ioengine=io_uring --iodepth=128 \
--iodepth_batch_submit=32 --iodepth_batch_complete=32 \
--rw=randread --random_distribution=zipf:1.2 --norandommap \
--time_based --ramp_time=10m --runtime=5m --group_reporting
mysql (using oltp_read_only from sysbench, with 12G of buffer pool
in a 10G memcg):
sysbench /usr/share/sysbench/oltp_read_only.lua <auth and db params> \
--tables=36 --table-size=2000000 --threads=12 --time=1800
kernel build test done with 3G memcg limit on an i7-9700.
Before (Average of 6 test run):
fio: IOPS=5125.5k
fio2: IOPS=7291.16k
memcached: 57600.926 ops/s
mysql: 6280.08 tps
kernel-build: 1817.13499 seconds
After (Average of 6 test run):
fio: IOPS=5137.5k (+2.3%)
fio2: IOPS=7300.67k (+1.3%)
memcached: 57878.422 ops/s (+4.8%)
mysql: 6312.06 tps (+0.5%)
kernel-build: 1813.66231 seconds (+2.0%)
Signed-off-by: Kairui Song <kasong@tencent.com>
2023-12-15 10:45:43 +08:00
|
|
|
* And since this is only an estimation, based on several hypotheses, and
|
|
|
|
* it could break the ability of LRU to distinguish a workingset out of
|
|
|
|
* caches, so throttle this by two factors:
|
2014-04-04 05:47:51 +08:00
|
|
|
*
|
emm: workingset: simplify and use a more intuitive model
Upstream: pending
This basically removed workingset_activation and reduced calls to
workingset_age_nonresident.
The idea behind this change is a new way to calculate the refault
distance and prepare for adapting refault distance based file page
protection for multi-gen LRU.
Currently, refault distance re-activation for active/inactive can help
keep working set pages in memory, it works by estimating the refault
(re-access) distance of a page, if it's small enough, then put it
on active LRU instead of inactive LRU.
The estimation, as described in mm/workingset.c, is based on two assumptions:
1. Activation of an inactive page will left-shift LRU pages (considering
LRU starts from right).
2. Eviction of an inactive page will left-shift LRU pages.
Assumption 2 is correct, but assumption 1 is not always true, an activated
page could be anywhere in the LRU list (through mark_page_accessed), it
only left-shift the pages on its right side.
And besides, one page can get activate/deactivated for multiple times.
And multi-gen LRU doesn't fit with this model well, pages are getting
aged in generations, and getting promoted frequently between generations.
So instead we introduce a simpler idea here: Just presume the evicted
pages are still in memory, each has an corresponding eviction timestamp
(nonresistence_age) that is increased and recorded upon each eviction.
These timestamp could logically form a "Shadow LRU", a read-only
imaginary LRU. Let the `nonresistence_age` still be NA, then we have:
Let SP = ((NA's reading @ current) - (NA's reading @ eviction))
+-memory available to cache-+
| |
+-------------------------+===============+===========+
| * shadows O O O | INACTIVE | ACTIVE |
+-+-----------------------+===============+===========+
| |
+-----------------------+
| SP
fault page O -> Hole left by refaulted in pages.
Entries are suppose to be removed
upon access but this is not a real
LRU so can't really update it.
* -> The page corresponding to SP
It can be easily seen that SP stands for the offset of a page in the
imaginary LRU, which is also how far the current workflow could push
a page out of available memory. Since all evicted page was once head
of INACTIVE list, the estimated minimum value of refault distance is:
SP + NR_INACTIVE
On refault, the page *may* get activated and stay in memory if we put
it to active LRU if:
SP + NR_INACTIVE < NR_INACTIVE + NR_ACTIVE
Which can be simplified to:
SP < NR_ACTIVE
Then the page is worth getting re-activated to start from active LRU,
since the access distance is smaller than the total memory.
And since this is only an estimation, based on several hypotheses, and
it could break the ability of LRU to distinguish a workingset out of
caches, in extreme cases all refault causing activation will lead to
worse thrashing, so throttle this by two factors:
1. Notice previously re-faulted in pages may leave "holes" on the shadow
part of LRU, that part is left unhandled on purpose to decrease
re-activate rate for pages that have a large SP value (the larger
SP value a page has, the more likely it will be affected by such
holes).
2. When the active LRU is long enough, chanllaging active pages
by re-activating a one-time access previously evicted/inactive page
may not be a good idea, so throttle the re-activation when
NR_ACTIVE > NR_INACTIVE, by comparing with NR_INACTIVE instead.
Another effect of the refault activation throttling worth noticing is that,
when the cache size is larger than total memory and hotness is similar
among all cache pages, it can help hold a portion (possible have slightly
higher hotness) of the caches in memory instead of letting caches get
evicted permutably due to the nature of LRU.
That's because the established workingset (active LRU) will tend to stay
since we throttled reactivation when NR_ACTIVE is high.
This side effect is actually similar with the algoritm before, which
introduce such effect by increasing nonresistence_age in extra call
paths, trottled the re-activation when activition/reactivation is
massively happenning.
Combined all above, we have following simple rules:
Upon refault, if any of following conditions is met, mark page as active:
- If active LRU is low (NR_ACTIVE < NR_INACTIVE), check if:
SP < NR_ACTIVE
- If active LRU is high (NR_ACTIVE >= NR_INACTIVE), check if:
SP < NR_INACTIVE
Code-wise, this is simpler than before since no longer need to do lruvec
workingset data update when activating a page, and so far, a few benchmarks
shows a similar or better result under memore pressure. The performance
should also be better when there is no memory pressure since some memcg
iteration and atomic operation is no longer needed.
When combined with multi-gen LRU (in later commits) it shows a measurable
performance gain for some workloads.
Using memtier and fio test from commit ac35a4902374 but scaled down
to fit in my test environment, and some other test results:
memtier test (with 16G ramdisk as swap and 4G memcg limit on an i7-9700):
memcached -u nobody -m 16384 -s /tmp/memcached.socket \
-a 0766 -t 12 -B binary &
memtier_benchmark -S /tmp/memcached.socket -P memcache_binary -n allkeys\
--key-minimum=1 --key-maximum=32000000 --key-pattern=P:P -c 1 \
-t 12 --ratio 1:0 --pipeline 8 -d 2000 -x 6
fio test 1 (with 16G ramdisk on 28G VM on an i7-9700):
fio -name=refault --numjobs=12 --directory=/mnt --size=1024m \
--buffered=1 --ioengine=io_uring --iodepth=128 \
--iodepth_batch_submit=32 --iodepth_batch_complete=32 \
--rw=randread --random_distribution=random --norandommap \
--time_based --ramp_time=5m --runtime=5m --group_reporting
fio test 2 (with 16G ramdisk on 28G VM on an i7-9700):
fio -name=mglru --numjobs=10 --directory=/mnt --size=1536m \
--buffered=1 --ioengine=io_uring --iodepth=128 \
--iodepth_batch_submit=32 --iodepth_batch_complete=32 \
--rw=randread --random_distribution=zipf:1.2 --norandommap \
--time_based --ramp_time=10m --runtime=5m --group_reporting
mysql (using oltp_read_only from sysbench, with 12G of buffer pool
in a 10G memcg):
sysbench /usr/share/sysbench/oltp_read_only.lua <auth and db params> \
--tables=36 --table-size=2000000 --threads=12 --time=1800
kernel build test done with 3G memcg limit on an i7-9700.
Before (Average of 6 test run):
fio: IOPS=5125.5k
fio2: IOPS=7291.16k
memcached: 57600.926 ops/s
mysql: 6280.08 tps
kernel-build: 1817.13499 seconds
After (Average of 6 test run):
fio: IOPS=5137.5k (+2.3%)
fio2: IOPS=7300.67k (+1.3%)
memcached: 57878.422 ops/s (+4.8%)
mysql: 6312.06 tps (+0.5%)
kernel-build: 1813.66231 seconds (+2.0%)
Signed-off-by: Kairui Song <kasong@tencent.com>
2023-12-15 10:45:43 +08:00
|
|
|
* 1. Notice that re-faulted in pages may leave "holes" on the shadow
|
|
|
|
* part of LRU, that part is left unhandled on purpose to decrease
|
|
|
|
* re-activate rate for pages that have a large SP value (the larger
|
|
|
|
* SP value a page have, the more likely it will be affected by such
|
|
|
|
* holes).
|
|
|
|
* 2. When the ACTIVE part of LRU is long enough, challenging ACTIVE pages
|
|
|
|
* by re-activating a one-time faulted previously INACTIVE page may not
|
|
|
|
* be a good idea, so throttle the re-activation when ACTIVE > INACTIVE
|
|
|
|
* by comparing with INACTIVE instead.
|
2014-04-04 05:47:51 +08:00
|
|
|
*
|
emm: workingset: simplify and use a more intuitive model
Upstream: pending
This basically removed workingset_activation and reduced calls to
workingset_age_nonresident.
The idea behind this change is a new way to calculate the refault
distance and prepare for adapting refault distance based file page
protection for multi-gen LRU.
Currently, refault distance re-activation for active/inactive can help
keep working set pages in memory, it works by estimating the refault
(re-access) distance of a page, if it's small enough, then put it
on active LRU instead of inactive LRU.
The estimation, as described in mm/workingset.c, is based on two assumptions:
1. Activation of an inactive page will left-shift LRU pages (considering
LRU starts from right).
2. Eviction of an inactive page will left-shift LRU pages.
Assumption 2 is correct, but assumption 1 is not always true, an activated
page could be anywhere in the LRU list (through mark_page_accessed), it
only left-shift the pages on its right side.
And besides, one page can get activate/deactivated for multiple times.
And multi-gen LRU doesn't fit with this model well, pages are getting
aged in generations, and getting promoted frequently between generations.
So instead we introduce a simpler idea here: Just presume the evicted
pages are still in memory, each has an corresponding eviction timestamp
(nonresistence_age) that is increased and recorded upon each eviction.
These timestamp could logically form a "Shadow LRU", a read-only
imaginary LRU. Let the `nonresistence_age` still be NA, then we have:
Let SP = ((NA's reading @ current) - (NA's reading @ eviction))
+-memory available to cache-+
| |
+-------------------------+===============+===========+
| * shadows O O O | INACTIVE | ACTIVE |
+-+-----------------------+===============+===========+
| |
+-----------------------+
| SP
fault page O -> Hole left by refaulted in pages.
Entries are suppose to be removed
upon access but this is not a real
LRU so can't really update it.
* -> The page corresponding to SP
It can be easily seen that SP stands for the offset of a page in the
imaginary LRU, which is also how far the current workflow could push
a page out of available memory. Since all evicted page was once head
of INACTIVE list, the estimated minimum value of refault distance is:
SP + NR_INACTIVE
On refault, the page *may* get activated and stay in memory if we put
it to active LRU if:
SP + NR_INACTIVE < NR_INACTIVE + NR_ACTIVE
Which can be simplified to:
SP < NR_ACTIVE
Then the page is worth getting re-activated to start from active LRU,
since the access distance is smaller than the total memory.
And since this is only an estimation, based on several hypotheses, and
it could break the ability of LRU to distinguish a workingset out of
caches, in extreme cases all refault causing activation will lead to
worse thrashing, so throttle this by two factors:
1. Notice previously re-faulted in pages may leave "holes" on the shadow
part of LRU, that part is left unhandled on purpose to decrease
re-activate rate for pages that have a large SP value (the larger
SP value a page has, the more likely it will be affected by such
holes).
2. When the active LRU is long enough, chanllaging active pages
by re-activating a one-time access previously evicted/inactive page
may not be a good idea, so throttle the re-activation when
NR_ACTIVE > NR_INACTIVE, by comparing with NR_INACTIVE instead.
Another effect of the refault activation throttling worth noticing is that,
when the cache size is larger than total memory and hotness is similar
among all cache pages, it can help hold a portion (possible have slightly
higher hotness) of the caches in memory instead of letting caches get
evicted permutably due to the nature of LRU.
That's because the established workingset (active LRU) will tend to stay
since we throttled reactivation when NR_ACTIVE is high.
This side effect is actually similar with the algoritm before, which
introduce such effect by increasing nonresistence_age in extra call
paths, trottled the re-activation when activition/reactivation is
massively happenning.
Combined all above, we have following simple rules:
Upon refault, if any of following conditions is met, mark page as active:
- If active LRU is low (NR_ACTIVE < NR_INACTIVE), check if:
SP < NR_ACTIVE
- If active LRU is high (NR_ACTIVE >= NR_INACTIVE), check if:
SP < NR_INACTIVE
Code-wise, this is simpler than before since no longer need to do lruvec
workingset data update when activating a page, and so far, a few benchmarks
shows a similar or better result under memore pressure. The performance
should also be better when there is no memory pressure since some memcg
iteration and atomic operation is no longer needed.
When combined with multi-gen LRU (in later commits) it shows a measurable
performance gain for some workloads.
Using memtier and fio test from commit ac35a4902374 but scaled down
to fit in my test environment, and some other test results:
memtier test (with 16G ramdisk as swap and 4G memcg limit on an i7-9700):
memcached -u nobody -m 16384 -s /tmp/memcached.socket \
-a 0766 -t 12 -B binary &
memtier_benchmark -S /tmp/memcached.socket -P memcache_binary -n allkeys\
--key-minimum=1 --key-maximum=32000000 --key-pattern=P:P -c 1 \
-t 12 --ratio 1:0 --pipeline 8 -d 2000 -x 6
fio test 1 (with 16G ramdisk on 28G VM on an i7-9700):
fio -name=refault --numjobs=12 --directory=/mnt --size=1024m \
--buffered=1 --ioengine=io_uring --iodepth=128 \
--iodepth_batch_submit=32 --iodepth_batch_complete=32 \
--rw=randread --random_distribution=random --norandommap \
--time_based --ramp_time=5m --runtime=5m --group_reporting
fio test 2 (with 16G ramdisk on 28G VM on an i7-9700):
fio -name=mglru --numjobs=10 --directory=/mnt --size=1536m \
--buffered=1 --ioengine=io_uring --iodepth=128 \
--iodepth_batch_submit=32 --iodepth_batch_complete=32 \
--rw=randread --random_distribution=zipf:1.2 --norandommap \
--time_based --ramp_time=10m --runtime=5m --group_reporting
mysql (using oltp_read_only from sysbench, with 12G of buffer pool
in a 10G memcg):
sysbench /usr/share/sysbench/oltp_read_only.lua <auth and db params> \
--tables=36 --table-size=2000000 --threads=12 --time=1800
kernel build test done with 3G memcg limit on an i7-9700.
Before (Average of 6 test run):
fio: IOPS=5125.5k
fio2: IOPS=7291.16k
memcached: 57600.926 ops/s
mysql: 6280.08 tps
kernel-build: 1817.13499 seconds
After (Average of 6 test run):
fio: IOPS=5137.5k (+2.3%)
fio2: IOPS=7300.67k (+1.3%)
memcached: 57878.422 ops/s (+4.8%)
mysql: 6312.06 tps (+0.5%)
kernel-build: 1813.66231 seconds (+2.0%)
Signed-off-by: Kairui Song <kasong@tencent.com>
2023-12-15 10:45:43 +08:00
|
|
|
* Combined all above, we have:
|
|
|
|
* Upon refault, if any of the following conditions is met, mark the page
|
|
|
|
* as active:
|
2014-04-04 05:47:51 +08:00
|
|
|
*
|
emm: workingset: simplify and use a more intuitive model
Upstream: pending
This basically removed workingset_activation and reduced calls to
workingset_age_nonresident.
The idea behind this change is a new way to calculate the refault
distance and prepare for adapting refault distance based file page
protection for multi-gen LRU.
Currently, refault distance re-activation for active/inactive can help
keep working set pages in memory, it works by estimating the refault
(re-access) distance of a page, if it's small enough, then put it
on active LRU instead of inactive LRU.
The estimation, as described in mm/workingset.c, is based on two assumptions:
1. Activation of an inactive page will left-shift LRU pages (considering
LRU starts from right).
2. Eviction of an inactive page will left-shift LRU pages.
Assumption 2 is correct, but assumption 1 is not always true, an activated
page could be anywhere in the LRU list (through mark_page_accessed), it
only left-shift the pages on its right side.
And besides, one page can get activate/deactivated for multiple times.
And multi-gen LRU doesn't fit with this model well, pages are getting
aged in generations, and getting promoted frequently between generations.
So instead we introduce a simpler idea here: Just presume the evicted
pages are still in memory, each has an corresponding eviction timestamp
(nonresistence_age) that is increased and recorded upon each eviction.
These timestamp could logically form a "Shadow LRU", a read-only
imaginary LRU. Let the `nonresistence_age` still be NA, then we have:
Let SP = ((NA's reading @ current) - (NA's reading @ eviction))
+-memory available to cache-+
| |
+-------------------------+===============+===========+
| * shadows O O O | INACTIVE | ACTIVE |
+-+-----------------------+===============+===========+
| |
+-----------------------+
| SP
fault page O -> Hole left by refaulted in pages.
Entries are suppose to be removed
upon access but this is not a real
LRU so can't really update it.
* -> The page corresponding to SP
It can be easily seen that SP stands for the offset of a page in the
imaginary LRU, which is also how far the current workflow could push
a page out of available memory. Since all evicted page was once head
of INACTIVE list, the estimated minimum value of refault distance is:
SP + NR_INACTIVE
On refault, the page *may* get activated and stay in memory if we put
it to active LRU if:
SP + NR_INACTIVE < NR_INACTIVE + NR_ACTIVE
Which can be simplified to:
SP < NR_ACTIVE
Then the page is worth getting re-activated to start from active LRU,
since the access distance is smaller than the total memory.
And since this is only an estimation, based on several hypotheses, and
it could break the ability of LRU to distinguish a workingset out of
caches, in extreme cases all refault causing activation will lead to
worse thrashing, so throttle this by two factors:
1. Notice previously re-faulted in pages may leave "holes" on the shadow
part of LRU, that part is left unhandled on purpose to decrease
re-activate rate for pages that have a large SP value (the larger
SP value a page has, the more likely it will be affected by such
holes).
2. When the active LRU is long enough, chanllaging active pages
by re-activating a one-time access previously evicted/inactive page
may not be a good idea, so throttle the re-activation when
NR_ACTIVE > NR_INACTIVE, by comparing with NR_INACTIVE instead.
Another effect of the refault activation throttling worth noticing is that,
when the cache size is larger than total memory and hotness is similar
among all cache pages, it can help hold a portion (possible have slightly
higher hotness) of the caches in memory instead of letting caches get
evicted permutably due to the nature of LRU.
That's because the established workingset (active LRU) will tend to stay
since we throttled reactivation when NR_ACTIVE is high.
This side effect is actually similar with the algoritm before, which
introduce such effect by increasing nonresistence_age in extra call
paths, trottled the re-activation when activition/reactivation is
massively happenning.
Combined all above, we have following simple rules:
Upon refault, if any of following conditions is met, mark page as active:
- If active LRU is low (NR_ACTIVE < NR_INACTIVE), check if:
SP < NR_ACTIVE
- If active LRU is high (NR_ACTIVE >= NR_INACTIVE), check if:
SP < NR_INACTIVE
Code-wise, this is simpler than before since no longer need to do lruvec
workingset data update when activating a page, and so far, a few benchmarks
shows a similar or better result under memore pressure. The performance
should also be better when there is no memory pressure since some memcg
iteration and atomic operation is no longer needed.
When combined with multi-gen LRU (in later commits) it shows a measurable
performance gain for some workloads.
Using memtier and fio test from commit ac35a4902374 but scaled down
to fit in my test environment, and some other test results:
memtier test (with 16G ramdisk as swap and 4G memcg limit on an i7-9700):
memcached -u nobody -m 16384 -s /tmp/memcached.socket \
-a 0766 -t 12 -B binary &
memtier_benchmark -S /tmp/memcached.socket -P memcache_binary -n allkeys\
--key-minimum=1 --key-maximum=32000000 --key-pattern=P:P -c 1 \
-t 12 --ratio 1:0 --pipeline 8 -d 2000 -x 6
fio test 1 (with 16G ramdisk on 28G VM on an i7-9700):
fio -name=refault --numjobs=12 --directory=/mnt --size=1024m \
--buffered=1 --ioengine=io_uring --iodepth=128 \
--iodepth_batch_submit=32 --iodepth_batch_complete=32 \
--rw=randread --random_distribution=random --norandommap \
--time_based --ramp_time=5m --runtime=5m --group_reporting
fio test 2 (with 16G ramdisk on 28G VM on an i7-9700):
fio -name=mglru --numjobs=10 --directory=/mnt --size=1536m \
--buffered=1 --ioengine=io_uring --iodepth=128 \
--iodepth_batch_submit=32 --iodepth_batch_complete=32 \
--rw=randread --random_distribution=zipf:1.2 --norandommap \
--time_based --ramp_time=10m --runtime=5m --group_reporting
mysql (using oltp_read_only from sysbench, with 12G of buffer pool
in a 10G memcg):
sysbench /usr/share/sysbench/oltp_read_only.lua <auth and db params> \
--tables=36 --table-size=2000000 --threads=12 --time=1800
kernel build test done with 3G memcg limit on an i7-9700.
Before (Average of 6 test run):
fio: IOPS=5125.5k
fio2: IOPS=7291.16k
memcached: 57600.926 ops/s
mysql: 6280.08 tps
kernel-build: 1817.13499 seconds
After (Average of 6 test run):
fio: IOPS=5137.5k (+2.3%)
fio2: IOPS=7300.67k (+1.3%)
memcached: 57878.422 ops/s (+4.8%)
mysql: 6312.06 tps (+0.5%)
kernel-build: 1813.66231 seconds (+2.0%)
Signed-off-by: Kairui Song <kasong@tencent.com>
2023-12-15 10:45:43 +08:00
|
|
|
* - If ACTIVE LRU is low (NR_ACTIVE < NR_INACTIVE), check if:
|
|
|
|
* SP < NR_ACTIVE
|
2014-04-04 05:47:51 +08:00
|
|
|
*
|
emm: workingset: simplify and use a more intuitive model
Upstream: pending
This basically removed workingset_activation and reduced calls to
workingset_age_nonresident.
The idea behind this change is a new way to calculate the refault
distance and prepare for adapting refault distance based file page
protection for multi-gen LRU.
Currently, refault distance re-activation for active/inactive can help
keep working set pages in memory, it works by estimating the refault
(re-access) distance of a page, if it's small enough, then put it
on active LRU instead of inactive LRU.
The estimation, as described in mm/workingset.c, is based on two assumptions:
1. Activation of an inactive page will left-shift LRU pages (considering
LRU starts from right).
2. Eviction of an inactive page will left-shift LRU pages.
Assumption 2 is correct, but assumption 1 is not always true, an activated
page could be anywhere in the LRU list (through mark_page_accessed), it
only left-shift the pages on its right side.
And besides, one page can get activate/deactivated for multiple times.
And multi-gen LRU doesn't fit with this model well, pages are getting
aged in generations, and getting promoted frequently between generations.
So instead we introduce a simpler idea here: Just presume the evicted
pages are still in memory, each has an corresponding eviction timestamp
(nonresistence_age) that is increased and recorded upon each eviction.
These timestamp could logically form a "Shadow LRU", a read-only
imaginary LRU. Let the `nonresistence_age` still be NA, then we have:
Let SP = ((NA's reading @ current) - (NA's reading @ eviction))
+-memory available to cache-+
| |
+-------------------------+===============+===========+
| * shadows O O O | INACTIVE | ACTIVE |
+-+-----------------------+===============+===========+
| |
+-----------------------+
| SP
fault page O -> Hole left by refaulted in pages.
Entries are suppose to be removed
upon access but this is not a real
LRU so can't really update it.
* -> The page corresponding to SP
It can be easily seen that SP stands for the offset of a page in the
imaginary LRU, which is also how far the current workflow could push
a page out of available memory. Since all evicted page was once head
of INACTIVE list, the estimated minimum value of refault distance is:
SP + NR_INACTIVE
On refault, the page *may* get activated and stay in memory if we put
it to active LRU if:
SP + NR_INACTIVE < NR_INACTIVE + NR_ACTIVE
Which can be simplified to:
SP < NR_ACTIVE
Then the page is worth getting re-activated to start from active LRU,
since the access distance is smaller than the total memory.
And since this is only an estimation, based on several hypotheses, and
it could break the ability of LRU to distinguish a workingset out of
caches, in extreme cases all refault causing activation will lead to
worse thrashing, so throttle this by two factors:
1. Notice previously re-faulted in pages may leave "holes" on the shadow
part of LRU, that part is left unhandled on purpose to decrease
re-activate rate for pages that have a large SP value (the larger
SP value a page has, the more likely it will be affected by such
holes).
2. When the active LRU is long enough, chanllaging active pages
by re-activating a one-time access previously evicted/inactive page
may not be a good idea, so throttle the re-activation when
NR_ACTIVE > NR_INACTIVE, by comparing with NR_INACTIVE instead.
Another effect of the refault activation throttling worth noticing is that,
when the cache size is larger than total memory and hotness is similar
among all cache pages, it can help hold a portion (possible have slightly
higher hotness) of the caches in memory instead of letting caches get
evicted permutably due to the nature of LRU.
That's because the established workingset (active LRU) will tend to stay
since we throttled reactivation when NR_ACTIVE is high.
This side effect is actually similar with the algoritm before, which
introduce such effect by increasing nonresistence_age in extra call
paths, trottled the re-activation when activition/reactivation is
massively happenning.
Combined all above, we have following simple rules:
Upon refault, if any of following conditions is met, mark page as active:
- If active LRU is low (NR_ACTIVE < NR_INACTIVE), check if:
SP < NR_ACTIVE
- If active LRU is high (NR_ACTIVE >= NR_INACTIVE), check if:
SP < NR_INACTIVE
Code-wise, this is simpler than before since no longer need to do lruvec
workingset data update when activating a page, and so far, a few benchmarks
shows a similar or better result under memore pressure. The performance
should also be better when there is no memory pressure since some memcg
iteration and atomic operation is no longer needed.
When combined with multi-gen LRU (in later commits) it shows a measurable
performance gain for some workloads.
Using memtier and fio test from commit ac35a4902374 but scaled down
to fit in my test environment, and some other test results:
memtier test (with 16G ramdisk as swap and 4G memcg limit on an i7-9700):
memcached -u nobody -m 16384 -s /tmp/memcached.socket \
-a 0766 -t 12 -B binary &
memtier_benchmark -S /tmp/memcached.socket -P memcache_binary -n allkeys\
--key-minimum=1 --key-maximum=32000000 --key-pattern=P:P -c 1 \
-t 12 --ratio 1:0 --pipeline 8 -d 2000 -x 6
fio test 1 (with 16G ramdisk on 28G VM on an i7-9700):
fio -name=refault --numjobs=12 --directory=/mnt --size=1024m \
--buffered=1 --ioengine=io_uring --iodepth=128 \
--iodepth_batch_submit=32 --iodepth_batch_complete=32 \
--rw=randread --random_distribution=random --norandommap \
--time_based --ramp_time=5m --runtime=5m --group_reporting
fio test 2 (with 16G ramdisk on 28G VM on an i7-9700):
fio -name=mglru --numjobs=10 --directory=/mnt --size=1536m \
--buffered=1 --ioengine=io_uring --iodepth=128 \
--iodepth_batch_submit=32 --iodepth_batch_complete=32 \
--rw=randread --random_distribution=zipf:1.2 --norandommap \
--time_based --ramp_time=10m --runtime=5m --group_reporting
mysql (using oltp_read_only from sysbench, with 12G of buffer pool
in a 10G memcg):
sysbench /usr/share/sysbench/oltp_read_only.lua <auth and db params> \
--tables=36 --table-size=2000000 --threads=12 --time=1800
kernel build test done with 3G memcg limit on an i7-9700.
Before (Average of 6 test run):
fio: IOPS=5125.5k
fio2: IOPS=7291.16k
memcached: 57600.926 ops/s
mysql: 6280.08 tps
kernel-build: 1817.13499 seconds
After (Average of 6 test run):
fio: IOPS=5137.5k (+2.3%)
fio2: IOPS=7300.67k (+1.3%)
memcached: 57878.422 ops/s (+4.8%)
mysql: 6312.06 tps (+0.5%)
kernel-build: 1813.66231 seconds (+2.0%)
Signed-off-by: Kairui Song <kasong@tencent.com>
2023-12-15 10:45:43 +08:00
|
|
|
* - If ACTIVE LRU is high (NR_ACTIVE >= NR_INACTIVE), check if:
|
|
|
|
* SP < NR_INACTIVE
|
2014-04-04 05:47:51 +08:00
|
|
|
*
|
mm: workingset: tell cache transitions from workingset thrashing
Refaults happen during transitions between workingsets as well as in-place
thrashing. Knowing the difference between the two has a range of
applications, including measuring the impact of memory shortage on the
system performance, as well as the ability to smarter balance pressure
between the filesystem cache and the swap-backed workingset.
During workingset transitions, inactive cache refaults and pushes out
established active cache. When that active cache isn't stale, however,
and also ends up refaulting, that's bonafide thrashing.
Introduce a new page flag that tells on eviction whether the page has been
active or not in its lifetime. This bit is then stored in the shadow
entry, to classify refaults as transitioning or thrashing.
How many page->flags does this leave us with on 32-bit?
20 bits are always page flags
21 if you have an MMU
23 with the zone bits for DMA, Normal, HighMem, Movable
29 with the sparsemem section bits
30 if PAE is enabled
31 with this patch.
So on 32-bit PAE, that leaves 1 bit for distinguishing two NUMA nodes. If
that's not enough, the system can switch to discontigmem and re-gain the 6
or 7 sparsemem section bits.
Link: http://lkml.kernel.org/r/20180828172258.3185-3-hannes@cmpxchg.org
Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Tested-by: Daniel Drake <drake@endlessm.com>
Tested-by: Suren Baghdasaryan <surenb@google.com>
Cc: Christopher Lameter <cl@linux.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Johannes Weiner <jweiner@fb.com>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Enderborg <peter.enderborg@sony.com>
Cc: Randy Dunlap <rdunlap@infradead.org>
Cc: Shakeel Butt <shakeelb@google.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: Vinayak Menon <vinmenon@codeaurora.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-10-27 06:06:04 +08:00
|
|
|
* Refaulting inactive pages
|
2014-04-04 05:47:51 +08:00
|
|
|
*
|
|
|
|
* All that is known about the active list is that the pages have been
|
|
|
|
* accessed more than once in the past. This means that at any given
|
|
|
|
* time there is actually a good chance that pages on the active list
|
|
|
|
* are no longer in active use.
|
|
|
|
*
|
|
|
|
* So when a refault distance of (R - E) is observed and there are at
|
2023-04-13 16:34:49 +08:00
|
|
|
* least (R - E) pages in the userspace workingset, the refaulting page
|
|
|
|
* is activated optimistically in the hope that (R - E) pages are actually
|
2014-04-04 05:47:51 +08:00
|
|
|
* used less frequently than the refaulting page - or even not used at
|
|
|
|
* all anymore.
|
|
|
|
*
|
mm: workingset: tell cache transitions from workingset thrashing
Refaults happen during transitions between workingsets as well as in-place
thrashing. Knowing the difference between the two has a range of
applications, including measuring the impact of memory shortage on the
system performance, as well as the ability to smarter balance pressure
between the filesystem cache and the swap-backed workingset.
During workingset transitions, inactive cache refaults and pushes out
established active cache. When that active cache isn't stale, however,
and also ends up refaulting, that's bonafide thrashing.
Introduce a new page flag that tells on eviction whether the page has been
active or not in its lifetime. This bit is then stored in the shadow
entry, to classify refaults as transitioning or thrashing.
How many page->flags does this leave us with on 32-bit?
20 bits are always page flags
21 if you have an MMU
23 with the zone bits for DMA, Normal, HighMem, Movable
29 with the sparsemem section bits
30 if PAE is enabled
31 with this patch.
So on 32-bit PAE, that leaves 1 bit for distinguishing two NUMA nodes. If
that's not enough, the system can switch to discontigmem and re-gain the 6
or 7 sparsemem section bits.
Link: http://lkml.kernel.org/r/20180828172258.3185-3-hannes@cmpxchg.org
Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Tested-by: Daniel Drake <drake@endlessm.com>
Tested-by: Suren Baghdasaryan <surenb@google.com>
Cc: Christopher Lameter <cl@linux.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Johannes Weiner <jweiner@fb.com>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Enderborg <peter.enderborg@sony.com>
Cc: Randy Dunlap <rdunlap@infradead.org>
Cc: Shakeel Butt <shakeelb@google.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: Vinayak Menon <vinmenon@codeaurora.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-10-27 06:06:04 +08:00
|
|
|
* That means if inactive cache is refaulting with a suitable refault
|
|
|
|
* distance, we assume the cache workingset is transitioning and put
|
2023-04-13 16:34:49 +08:00
|
|
|
* pressure on the current workingset.
|
mm: workingset: tell cache transitions from workingset thrashing
Refaults happen during transitions between workingsets as well as in-place
thrashing. Knowing the difference between the two has a range of
applications, including measuring the impact of memory shortage on the
system performance, as well as the ability to smarter balance pressure
between the filesystem cache and the swap-backed workingset.
During workingset transitions, inactive cache refaults and pushes out
established active cache. When that active cache isn't stale, however,
and also ends up refaulting, that's bonafide thrashing.
Introduce a new page flag that tells on eviction whether the page has been
active or not in its lifetime. This bit is then stored in the shadow
entry, to classify refaults as transitioning or thrashing.
How many page->flags does this leave us with on 32-bit?
20 bits are always page flags
21 if you have an MMU
23 with the zone bits for DMA, Normal, HighMem, Movable
29 with the sparsemem section bits
30 if PAE is enabled
31 with this patch.
So on 32-bit PAE, that leaves 1 bit for distinguishing two NUMA nodes. If
that's not enough, the system can switch to discontigmem and re-gain the 6
or 7 sparsemem section bits.
Link: http://lkml.kernel.org/r/20180828172258.3185-3-hannes@cmpxchg.org
Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Tested-by: Daniel Drake <drake@endlessm.com>
Tested-by: Suren Baghdasaryan <surenb@google.com>
Cc: Christopher Lameter <cl@linux.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Johannes Weiner <jweiner@fb.com>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Enderborg <peter.enderborg@sony.com>
Cc: Randy Dunlap <rdunlap@infradead.org>
Cc: Shakeel Butt <shakeelb@google.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: Vinayak Menon <vinmenon@codeaurora.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-10-27 06:06:04 +08:00
|
|
|
*
|
2014-04-04 05:47:51 +08:00
|
|
|
* If this is wrong and demotion kicks in, the pages which are truly
|
|
|
|
* used more frequently will be reactivated while the less frequently
|
|
|
|
* used once will be evicted from memory.
|
|
|
|
*
|
|
|
|
* But if this is right, the stale pages will be pushed out of memory
|
|
|
|
* and the used pages get to stay in cache.
|
|
|
|
*
|
mm: workingset: tell cache transitions from workingset thrashing
Refaults happen during transitions between workingsets as well as in-place
thrashing. Knowing the difference between the two has a range of
applications, including measuring the impact of memory shortage on the
system performance, as well as the ability to smarter balance pressure
between the filesystem cache and the swap-backed workingset.
During workingset transitions, inactive cache refaults and pushes out
established active cache. When that active cache isn't stale, however,
and also ends up refaulting, that's bonafide thrashing.
Introduce a new page flag that tells on eviction whether the page has been
active or not in its lifetime. This bit is then stored in the shadow
entry, to classify refaults as transitioning or thrashing.
How many page->flags does this leave us with on 32-bit?
20 bits are always page flags
21 if you have an MMU
23 with the zone bits for DMA, Normal, HighMem, Movable
29 with the sparsemem section bits
30 if PAE is enabled
31 with this patch.
So on 32-bit PAE, that leaves 1 bit for distinguishing two NUMA nodes. If
that's not enough, the system can switch to discontigmem and re-gain the 6
or 7 sparsemem section bits.
Link: http://lkml.kernel.org/r/20180828172258.3185-3-hannes@cmpxchg.org
Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Tested-by: Daniel Drake <drake@endlessm.com>
Tested-by: Suren Baghdasaryan <surenb@google.com>
Cc: Christopher Lameter <cl@linux.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Johannes Weiner <jweiner@fb.com>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Enderborg <peter.enderborg@sony.com>
Cc: Randy Dunlap <rdunlap@infradead.org>
Cc: Shakeel Butt <shakeelb@google.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: Vinayak Menon <vinmenon@codeaurora.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-10-27 06:06:04 +08:00
|
|
|
* Refaulting active pages
|
|
|
|
*
|
|
|
|
* If on the other hand the refaulting pages have recently been
|
|
|
|
* deactivated, it means that the active list is no longer protecting
|
|
|
|
* actively used cache from reclaim. The cache is NOT transitioning to
|
|
|
|
* a different workingset; the existing workingset is thrashing in the
|
|
|
|
* space allocated to the page cache.
|
|
|
|
*
|
2014-04-04 05:47:51 +08:00
|
|
|
*
|
|
|
|
* Implementation
|
|
|
|
*
|
2020-06-26 11:30:31 +08:00
|
|
|
* For each node's LRU lists, a counter for inactive evictions and
|
2023-12-29 15:37:53 +08:00
|
|
|
* activations is maintained (node->evictions).
|
2014-04-04 05:47:51 +08:00
|
|
|
*
|
|
|
|
* On eviction, a snapshot of this counter (along with some bits to
|
2017-11-25 03:24:59 +08:00
|
|
|
* identify the node) is stored in the now empty page cache
|
2014-04-04 05:47:51 +08:00
|
|
|
* slot of the evicted page. This is called a shadow entry.
|
|
|
|
*
|
|
|
|
* On cache misses for which there are shadow entries, an eligible
|
|
|
|
* refault distance will immediately activate the refaulting page.
|
|
|
|
*/
|
|
|
|
|
2021-07-01 09:49:54 +08:00
|
|
|
#define WORKINGSET_SHIFT 1
|
2024-04-01 19:50:55 +08:00
|
|
|
#define EVICTION_SHIFT ((BITS_PER_LONG - BITS_PER_XA_VALUE) + \
|
2021-07-01 09:49:54 +08:00
|
|
|
WORKINGSET_SHIFT + NODES_SHIFT + \
|
|
|
|
MEM_CGROUP_ID_SHIFT)
|
2024-04-01 19:50:55 +08:00
|
|
|
#define EVICTION_BITS (BITS_PER_LONG - (EVICTION_SHIFT))
|
2016-03-16 05:57:07 +08:00
|
|
|
#define EVICTION_MASK (~0UL >> EVICTION_SHIFT)
|
workingset, lru_gen: apply refault-distance based protection
Upstream: pending
I noticed MGLRU not working very well on certain workflows, which is
observed on some heavily stressed databases. That is when the file
page workingset size exceeds total memory, and the access distance
(the left-shift time of a page before it gets activated, considering
LRU starts from right) of file pages also larger than total memory.
All file pages are stuck on the oldest generation and getting
read-in then evicted permutably. Despite anon pages being idle,
they never get aged. PID controller didn't kickin until there are some
minor access pattern changes. And file pages are not promoted
or reused.
Even though the memory can't cover the whole workingset, the
refault-distance based re-activation can help hold part of the
workingset in-memory to help reduce the IO workload significantly.
So apply it for MGLRU as well. The updated refault-distance model
fits well for MGLRU in most cases, if we just consider the last two
generation as the inactive LRU and the first two generations as
active LRU.
Some adjustment is done to fit the logic better, also make the
refault-distance contributed to page tiering and PID refault detection
of MGLRU:
- If a tier-0 page have a qualified refault-distance, just promote
it to higher tier, send it to second oldest gen.
- If a tier >= 1 page have a qualified refault-distance, mark it as
active and send it to youngest gen.
- Increase the reference of every page that have a qualified
refault-distance and increase the PID countroled refault rate
of the updated tier, in hope similar paged will be protected
next time upon eviction.
NOTE: This also changed the meaning of workingset_* fields in
/proc/vmstat, workingset_activate_* now stands for the pages
reactivated or promoted by refault distance checking,
workingset_restore_* now stands for all pages promoted by
any reason.
Following benchmark showed 5x improvement. To simulate the optimized
workflow, I setup a 3-replicated mongodb cluster, each in a different
cgroup, using 5 gb of wiretiger cache and 10g of oplog, on a 32G VM with
no limit set. The benchmark is done using
https://github.com/apavlo/py-tpcc.git, modified to run STOCK_LEVEL
query only, for simulating slow query and get a stable result.
Test is done on an EPYC 7K62 with 32G RAM with SATA SSD:
- Before (with ZRAM enabled, the result won't change whether
any kind of swap is on or not):
$ tpcc.py --config=mongodb.config mongodb --duration=900 --warehouses=500 --clients=30
==================================================================
Execution Results after 919 seconds
------------------------------------------------------------------
Executed Time (µs) Rate
STOCK_LEVEL 577 27584645283.7 0.02 txn/s
------------------------------------------------------------------
TOTAL 577 27584645283.7 0.02 txn/s
$ cat /proc/vmstat | grep workingset
workingset_nodes 47860
workingset_refault_anon 0
workingset_refault_file 23498953
workingset_activate_anon 0
workingset_activate_file 23487840
workingset_restore_anon 0
workingset_restore_file 18553646
workingset_nodereclaim 768
$ free -m
total used free shared buff/cache available
Mem: 31849 6829 790 23 24229 24542
Swap: 31848 0 31848
- Patched: (with ZRAM enabled):
$ tpcc.py --config=mongodb.config mongodb --duration=900 --warehouses=500 --clients=30
==================================================================
Execution Results after 905 seconds
------------------------------------------------------------------
Executed Time (µs) Rate
STOCK_LEVEL 2542 27121571486.2 0.09 txn/s
------------------------------------------------------------------
TOTAL 2542 27121571486.2 0.09 txn/s
$ cat /proc/vmstat | grep working
workingset_nodes 70358
workingset_refault_anon 16853
workingset_refault_file 22693601
workingset_activate_anon 10099
workingset_activate_file 8565519
workingset_restore_anon 10127
workingset_restore_file 8566053
workingset_nodereclaim 9801
$ free -m
total used free shared buff/cache available
Mem: 31849 7093 283 4 24472 24289
Swap: 31848 1652 30196
The performance is 5x times better than before, and the idle anon pages
now can get swapped out as expected. The result is also better with
lower test stress, testing with lower stress also shows a improvement.
There is no regression on other tests so far, and a performance gain
is observed on file page heavy tasks.
Signed-off-by: Kairui Song <kasong@tencent.com>
2024-04-01 23:43:38 +08:00
|
|
|
#define LRU_GEN_EVICTION_BITS (EVICTION_BITS - LRU_REFS_WIDTH)
|
2016-03-16 05:57:07 +08:00
|
|
|
|
mm: workingset: eviction buckets for bigmem/lowbit machines
For per-cgroup thrash detection, we need to store the memcg ID inside
the radix tree cookie as well. However, on 32 bit that doesn't leave
enough bits for the eviction timestamp to cover the necessary range of
recently evicted pages. The radix tree entry would look like this:
[ RADIX_TREE_EXCEPTIONAL(2) | ZONEID(2) | MEMCGID(16) | EVICTION(12) ]
12 bits means 4096 pages, means 16M worth of recently evicted pages.
But refaults are actionable up to distances covering half of memory. To
not miss refaults, we have to stretch out the range at the cost of how
precisely we can tell when a page was evicted. This way we can shave
off lower bits from the eviction timestamp until the necessary range is
covered. E.g. grouping evictions into 1M buckets (256 pages) will
stretch the longest representable refault distance to 4G.
This patch implements eviction buckets that are automatically sized
according to the available bits and the necessary refault range, in
preparation for per-cgroup thrash detection.
The maximum actionable distance is currently half of memory, but to
support memory hotplug of up to 200% of boot-time memory, we size the
buckets to cover double the distance. Beyond that, thrashing won't be
detectable anymore.
During boot, the kernel will print out the exact parameters, like so:
[ 0.113929] workingset: timestamp_bits=12 max_order=18 bucket_order=6
In this example, there are 12 radix entry bits available for the
eviction timestamp, to cover a maximum distance of 2^18 pages (this is a
1G machine). Consequently, evictions must be grouped into buckets of
2^6 pages, or 256K.
Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Reviewed-by: Vladimir Davydov <vdavydov@virtuozzo.com>
Cc: Michal Hocko <mhocko@suse.cz>
Cc: David Rientjes <rientjes@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-03-16 05:57:13 +08:00
|
|
|
/*
|
|
|
|
* Eviction timestamps need to be able to cover the full range of
|
2017-11-25 03:24:59 +08:00
|
|
|
* actionable refaults. However, bits are tight in the xarray
|
mm: workingset: eviction buckets for bigmem/lowbit machines
For per-cgroup thrash detection, we need to store the memcg ID inside
the radix tree cookie as well. However, on 32 bit that doesn't leave
enough bits for the eviction timestamp to cover the necessary range of
recently evicted pages. The radix tree entry would look like this:
[ RADIX_TREE_EXCEPTIONAL(2) | ZONEID(2) | MEMCGID(16) | EVICTION(12) ]
12 bits means 4096 pages, means 16M worth of recently evicted pages.
But refaults are actionable up to distances covering half of memory. To
not miss refaults, we have to stretch out the range at the cost of how
precisely we can tell when a page was evicted. This way we can shave
off lower bits from the eviction timestamp until the necessary range is
covered. E.g. grouping evictions into 1M buckets (256 pages) will
stretch the longest representable refault distance to 4G.
This patch implements eviction buckets that are automatically sized
according to the available bits and the necessary refault range, in
preparation for per-cgroup thrash detection.
The maximum actionable distance is currently half of memory, but to
support memory hotplug of up to 200% of boot-time memory, we size the
buckets to cover double the distance. Beyond that, thrashing won't be
detectable anymore.
During boot, the kernel will print out the exact parameters, like so:
[ 0.113929] workingset: timestamp_bits=12 max_order=18 bucket_order=6
In this example, there are 12 radix entry bits available for the
eviction timestamp, to cover a maximum distance of 2^18 pages (this is a
1G machine). Consequently, evictions must be grouped into buckets of
2^6 pages, or 256K.
Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Reviewed-by: Vladimir Davydov <vdavydov@virtuozzo.com>
Cc: Michal Hocko <mhocko@suse.cz>
Cc: David Rientjes <rientjes@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-03-16 05:57:13 +08:00
|
|
|
* entry, and after storing the identifier for the lruvec there might
|
|
|
|
* not be enough left to represent every single actionable refault. In
|
|
|
|
* that case, we have to sacrifice granularity for distance, and group
|
|
|
|
* evictions into coarser buckets by shaving off lower timestamp bits.
|
|
|
|
*/
|
|
|
|
static unsigned int bucket_order __read_mostly;
|
workingset, lru_gen: apply refault-distance based protection
Upstream: pending
I noticed MGLRU not working very well on certain workflows, which is
observed on some heavily stressed databases. That is when the file
page workingset size exceeds total memory, and the access distance
(the left-shift time of a page before it gets activated, considering
LRU starts from right) of file pages also larger than total memory.
All file pages are stuck on the oldest generation and getting
read-in then evicted permutably. Despite anon pages being idle,
they never get aged. PID controller didn't kickin until there are some
minor access pattern changes. And file pages are not promoted
or reused.
Even though the memory can't cover the whole workingset, the
refault-distance based re-activation can help hold part of the
workingset in-memory to help reduce the IO workload significantly.
So apply it for MGLRU as well. The updated refault-distance model
fits well for MGLRU in most cases, if we just consider the last two
generation as the inactive LRU and the first two generations as
active LRU.
Some adjustment is done to fit the logic better, also make the
refault-distance contributed to page tiering and PID refault detection
of MGLRU:
- If a tier-0 page have a qualified refault-distance, just promote
it to higher tier, send it to second oldest gen.
- If a tier >= 1 page have a qualified refault-distance, mark it as
active and send it to youngest gen.
- Increase the reference of every page that have a qualified
refault-distance and increase the PID countroled refault rate
of the updated tier, in hope similar paged will be protected
next time upon eviction.
NOTE: This also changed the meaning of workingset_* fields in
/proc/vmstat, workingset_activate_* now stands for the pages
reactivated or promoted by refault distance checking,
workingset_restore_* now stands for all pages promoted by
any reason.
Following benchmark showed 5x improvement. To simulate the optimized
workflow, I setup a 3-replicated mongodb cluster, each in a different
cgroup, using 5 gb of wiretiger cache and 10g of oplog, on a 32G VM with
no limit set. The benchmark is done using
https://github.com/apavlo/py-tpcc.git, modified to run STOCK_LEVEL
query only, for simulating slow query and get a stable result.
Test is done on an EPYC 7K62 with 32G RAM with SATA SSD:
- Before (with ZRAM enabled, the result won't change whether
any kind of swap is on or not):
$ tpcc.py --config=mongodb.config mongodb --duration=900 --warehouses=500 --clients=30
==================================================================
Execution Results after 919 seconds
------------------------------------------------------------------
Executed Time (µs) Rate
STOCK_LEVEL 577 27584645283.7 0.02 txn/s
------------------------------------------------------------------
TOTAL 577 27584645283.7 0.02 txn/s
$ cat /proc/vmstat | grep workingset
workingset_nodes 47860
workingset_refault_anon 0
workingset_refault_file 23498953
workingset_activate_anon 0
workingset_activate_file 23487840
workingset_restore_anon 0
workingset_restore_file 18553646
workingset_nodereclaim 768
$ free -m
total used free shared buff/cache available
Mem: 31849 6829 790 23 24229 24542
Swap: 31848 0 31848
- Patched: (with ZRAM enabled):
$ tpcc.py --config=mongodb.config mongodb --duration=900 --warehouses=500 --clients=30
==================================================================
Execution Results after 905 seconds
------------------------------------------------------------------
Executed Time (µs) Rate
STOCK_LEVEL 2542 27121571486.2 0.09 txn/s
------------------------------------------------------------------
TOTAL 2542 27121571486.2 0.09 txn/s
$ cat /proc/vmstat | grep working
workingset_nodes 70358
workingset_refault_anon 16853
workingset_refault_file 22693601
workingset_activate_anon 10099
workingset_activate_file 8565519
workingset_restore_anon 10127
workingset_restore_file 8566053
workingset_nodereclaim 9801
$ free -m
total used free shared buff/cache available
Mem: 31849 7093 283 4 24472 24289
Swap: 31848 1652 30196
The performance is 5x times better than before, and the idle anon pages
now can get swapped out as expected. The result is also better with
lower test stress, testing with lower stress also shows a improvement.
There is no regression on other tests so far, and a performance gain
is observed on file page heavy tasks.
Signed-off-by: Kairui Song <kasong@tencent.com>
2024-04-01 23:43:38 +08:00
|
|
|
static unsigned int lru_gen_bucket_order __read_mostly;
|
mm: workingset: eviction buckets for bigmem/lowbit machines
For per-cgroup thrash detection, we need to store the memcg ID inside
the radix tree cookie as well. However, on 32 bit that doesn't leave
enough bits for the eviction timestamp to cover the necessary range of
recently evicted pages. The radix tree entry would look like this:
[ RADIX_TREE_EXCEPTIONAL(2) | ZONEID(2) | MEMCGID(16) | EVICTION(12) ]
12 bits means 4096 pages, means 16M worth of recently evicted pages.
But refaults are actionable up to distances covering half of memory. To
not miss refaults, we have to stretch out the range at the cost of how
precisely we can tell when a page was evicted. This way we can shave
off lower bits from the eviction timestamp until the necessary range is
covered. E.g. grouping evictions into 1M buckets (256 pages) will
stretch the longest representable refault distance to 4G.
This patch implements eviction buckets that are automatically sized
according to the available bits and the necessary refault range, in
preparation for per-cgroup thrash detection.
The maximum actionable distance is currently half of memory, but to
support memory hotplug of up to 200% of boot-time memory, we size the
buckets to cover double the distance. Beyond that, thrashing won't be
detectable anymore.
During boot, the kernel will print out the exact parameters, like so:
[ 0.113929] workingset: timestamp_bits=12 max_order=18 bucket_order=6
In this example, there are 12 radix entry bits available for the
eviction timestamp, to cover a maximum distance of 2^18 pages (this is a
1G machine). Consequently, evictions must be grouped into buckets of
2^6 pages, or 256K.
Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Reviewed-by: Vladimir Davydov <vdavydov@virtuozzo.com>
Cc: Michal Hocko <mhocko@suse.cz>
Cc: David Rientjes <rientjes@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-03-16 05:57:13 +08:00
|
|
|
|
mm: workingset: tell cache transitions from workingset thrashing
Refaults happen during transitions between workingsets as well as in-place
thrashing. Knowing the difference between the two has a range of
applications, including measuring the impact of memory shortage on the
system performance, as well as the ability to smarter balance pressure
between the filesystem cache and the swap-backed workingset.
During workingset transitions, inactive cache refaults and pushes out
established active cache. When that active cache isn't stale, however,
and also ends up refaulting, that's bonafide thrashing.
Introduce a new page flag that tells on eviction whether the page has been
active or not in its lifetime. This bit is then stored in the shadow
entry, to classify refaults as transitioning or thrashing.
How many page->flags does this leave us with on 32-bit?
20 bits are always page flags
21 if you have an MMU
23 with the zone bits for DMA, Normal, HighMem, Movable
29 with the sparsemem section bits
30 if PAE is enabled
31 with this patch.
So on 32-bit PAE, that leaves 1 bit for distinguishing two NUMA nodes. If
that's not enough, the system can switch to discontigmem and re-gain the 6
or 7 sparsemem section bits.
Link: http://lkml.kernel.org/r/20180828172258.3185-3-hannes@cmpxchg.org
Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Tested-by: Daniel Drake <drake@endlessm.com>
Tested-by: Suren Baghdasaryan <surenb@google.com>
Cc: Christopher Lameter <cl@linux.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Johannes Weiner <jweiner@fb.com>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Enderborg <peter.enderborg@sony.com>
Cc: Randy Dunlap <rdunlap@infradead.org>
Cc: Shakeel Butt <shakeelb@google.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: Vinayak Menon <vinmenon@codeaurora.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-10-27 06:06:04 +08:00
|
|
|
static void *pack_shadow(int memcgid, pg_data_t *pgdat, unsigned long eviction,
|
|
|
|
bool workingset)
|
2014-04-04 05:47:51 +08:00
|
|
|
{
|
2017-11-04 01:30:42 +08:00
|
|
|
eviction &= EVICTION_MASK;
|
2016-03-16 05:57:16 +08:00
|
|
|
eviction = (eviction << MEM_CGROUP_ID_SHIFT) | memcgid;
|
2016-07-29 06:46:08 +08:00
|
|
|
eviction = (eviction << NODES_SHIFT) | pgdat->node_id;
|
2021-07-01 09:49:54 +08:00
|
|
|
eviction = (eviction << WORKINGSET_SHIFT) | workingset;
|
2014-04-04 05:47:51 +08:00
|
|
|
|
2017-11-04 01:30:42 +08:00
|
|
|
return xa_mk_value(eviction);
|
2014-04-04 05:47:51 +08:00
|
|
|
}
|
|
|
|
|
2016-07-29 06:46:08 +08:00
|
|
|
static void unpack_shadow(void *shadow, int *memcgidp, pg_data_t **pgdat,
|
mm: workingset: tell cache transitions from workingset thrashing
Refaults happen during transitions between workingsets as well as in-place
thrashing. Knowing the difference between the two has a range of
applications, including measuring the impact of memory shortage on the
system performance, as well as the ability to smarter balance pressure
between the filesystem cache and the swap-backed workingset.
During workingset transitions, inactive cache refaults and pushes out
established active cache. When that active cache isn't stale, however,
and also ends up refaulting, that's bonafide thrashing.
Introduce a new page flag that tells on eviction whether the page has been
active or not in its lifetime. This bit is then stored in the shadow
entry, to classify refaults as transitioning or thrashing.
How many page->flags does this leave us with on 32-bit?
20 bits are always page flags
21 if you have an MMU
23 with the zone bits for DMA, Normal, HighMem, Movable
29 with the sparsemem section bits
30 if PAE is enabled
31 with this patch.
So on 32-bit PAE, that leaves 1 bit for distinguishing two NUMA nodes. If
that's not enough, the system can switch to discontigmem and re-gain the 6
or 7 sparsemem section bits.
Link: http://lkml.kernel.org/r/20180828172258.3185-3-hannes@cmpxchg.org
Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Tested-by: Daniel Drake <drake@endlessm.com>
Tested-by: Suren Baghdasaryan <surenb@google.com>
Cc: Christopher Lameter <cl@linux.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Johannes Weiner <jweiner@fb.com>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Enderborg <peter.enderborg@sony.com>
Cc: Randy Dunlap <rdunlap@infradead.org>
Cc: Shakeel Butt <shakeelb@google.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: Vinayak Menon <vinmenon@codeaurora.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-10-27 06:06:04 +08:00
|
|
|
unsigned long *evictionp, bool *workingsetp)
|
2014-04-04 05:47:51 +08:00
|
|
|
{
|
2017-11-04 01:30:42 +08:00
|
|
|
unsigned long entry = xa_to_value(shadow);
|
2016-07-29 06:46:08 +08:00
|
|
|
int memcgid, nid;
|
mm: workingset: tell cache transitions from workingset thrashing
Refaults happen during transitions between workingsets as well as in-place
thrashing. Knowing the difference between the two has a range of
applications, including measuring the impact of memory shortage on the
system performance, as well as the ability to smarter balance pressure
between the filesystem cache and the swap-backed workingset.
During workingset transitions, inactive cache refaults and pushes out
established active cache. When that active cache isn't stale, however,
and also ends up refaulting, that's bonafide thrashing.
Introduce a new page flag that tells on eviction whether the page has been
active or not in its lifetime. This bit is then stored in the shadow
entry, to classify refaults as transitioning or thrashing.
How many page->flags does this leave us with on 32-bit?
20 bits are always page flags
21 if you have an MMU
23 with the zone bits for DMA, Normal, HighMem, Movable
29 with the sparsemem section bits
30 if PAE is enabled
31 with this patch.
So on 32-bit PAE, that leaves 1 bit for distinguishing two NUMA nodes. If
that's not enough, the system can switch to discontigmem and re-gain the 6
or 7 sparsemem section bits.
Link: http://lkml.kernel.org/r/20180828172258.3185-3-hannes@cmpxchg.org
Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Tested-by: Daniel Drake <drake@endlessm.com>
Tested-by: Suren Baghdasaryan <surenb@google.com>
Cc: Christopher Lameter <cl@linux.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Johannes Weiner <jweiner@fb.com>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Enderborg <peter.enderborg@sony.com>
Cc: Randy Dunlap <rdunlap@infradead.org>
Cc: Shakeel Butt <shakeelb@google.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: Vinayak Menon <vinmenon@codeaurora.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-10-27 06:06:04 +08:00
|
|
|
bool workingset;
|
2014-04-04 05:47:51 +08:00
|
|
|
|
2021-07-01 09:49:54 +08:00
|
|
|
workingset = entry & ((1UL << WORKINGSET_SHIFT) - 1);
|
|
|
|
entry >>= WORKINGSET_SHIFT;
|
2014-04-04 05:47:51 +08:00
|
|
|
nid = entry & ((1UL << NODES_SHIFT) - 1);
|
|
|
|
entry >>= NODES_SHIFT;
|
2016-03-16 05:57:16 +08:00
|
|
|
memcgid = entry & ((1UL << MEM_CGROUP_ID_SHIFT) - 1);
|
|
|
|
entry >>= MEM_CGROUP_ID_SHIFT;
|
2014-04-04 05:47:51 +08:00
|
|
|
|
2016-03-16 05:57:16 +08:00
|
|
|
*memcgidp = memcgid;
|
2016-07-29 06:46:08 +08:00
|
|
|
*pgdat = NODE_DATA(nid);
|
mm: multi-gen LRU: minimal implementation
To avoid confusion, the terms "promotion" and "demotion" will be applied
to the multi-gen LRU, as a new convention; the terms "activation" and
"deactivation" will be applied to the active/inactive LRU, as usual.
The aging produces young generations. Given an lruvec, it increments
max_seq when max_seq-min_seq+1 approaches MIN_NR_GENS. The aging promotes
hot pages to the youngest generation when it finds them accessed through
page tables; the demotion of cold pages happens consequently when it
increments max_seq. Promotion in the aging path does not involve any LRU
list operations, only the updates of the gen counter and
lrugen->nr_pages[]; demotion, unless as the result of the increment of
max_seq, requires LRU list operations, e.g., lru_deactivate_fn(). The
aging has the complexity O(nr_hot_pages), since it is only interested in
hot pages.
The eviction consumes old generations. Given an lruvec, it increments
min_seq when lrugen->lists[] indexed by min_seq%MAX_NR_GENS becomes empty.
A feedback loop modeled after the PID controller monitors refaults over
anon and file types and decides which type to evict when both types are
available from the same generation.
The protection of pages accessed multiple times through file descriptors
takes place in the eviction path. Each generation is divided into
multiple tiers. A page accessed N times through file descriptors is in
tier order_base_2(N). Tiers do not have dedicated lrugen->lists[], only
bits in folio->flags. The aforementioned feedback loop also monitors
refaults over all tiers and decides when to protect pages in which tiers
(N>1), using the first tier (N=0,1) as a baseline. The first tier
contains single-use unmapped clean pages, which are most likely the best
choices. In contrast to promotion in the aging path, the protection of a
page in the eviction path is achieved by moving this page to the next
generation, i.e., min_seq+1, if the feedback loop decides so. This
approach has the following advantages:
1. It removes the cost of activation in the buffered access path by
inferring whether pages accessed multiple times through file
descriptors are statistically hot and thus worth protecting in the
eviction path.
2. It takes pages accessed through page tables into account and avoids
overprotecting pages accessed multiple times through file
descriptors. (Pages accessed through page tables are in the first
tier, since N=0.)
3. More tiers provide better protection for pages accessed more than
twice through file descriptors, when under heavy buffered I/O
workloads.
Server benchmark results:
Single workload:
fio (buffered I/O): +[30, 32]%
IOPS BW
5.19-rc1: 2673k 10.2GiB/s
patch1-6: 3491k 13.3GiB/s
Single workload:
memcached (anon): -[4, 6]%
Ops/sec KB/sec
5.19-rc1: 1161501.04 45177.25
patch1-6: 1106168.46 43025.04
Configurations:
CPU: two Xeon 6154
Mem: total 256G
Node 1 was only used as a ram disk to reduce the variance in the
results.
patch drivers/block/brd.c <<EOF
99,100c99,100
< gfp_flags = GFP_NOIO | __GFP_ZERO | __GFP_HIGHMEM;
< page = alloc_page(gfp_flags);
---
> gfp_flags = GFP_NOIO | __GFP_ZERO | __GFP_HIGHMEM | __GFP_THISNODE;
> page = alloc_pages_node(1, gfp_flags, 0);
EOF
cat >>/etc/systemd/system.conf <<EOF
CPUAffinity=numa
NUMAPolicy=bind
NUMAMask=0
EOF
cat >>/etc/memcached.conf <<EOF
-m 184320
-s /var/run/memcached/memcached.sock
-a 0766
-t 36
-B binary
EOF
cat fio.sh
modprobe brd rd_nr=1 rd_size=113246208
swapoff -a
mkfs.ext4 /dev/ram0
mount -t ext4 /dev/ram0 /mnt
mkdir /sys/fs/cgroup/user.slice/test
echo 38654705664 >/sys/fs/cgroup/user.slice/test/memory.max
echo $$ >/sys/fs/cgroup/user.slice/test/cgroup.procs
fio -name=mglru --numjobs=72 --directory=/mnt --size=1408m \
--buffered=1 --ioengine=io_uring --iodepth=128 \
--iodepth_batch_submit=32 --iodepth_batch_complete=32 \
--rw=randread --random_distribution=random --norandommap \
--time_based --ramp_time=10m --runtime=5m --group_reporting
cat memcached.sh
modprobe brd rd_nr=1 rd_size=113246208
swapoff -a
mkswap /dev/ram0
swapon /dev/ram0
memtier_benchmark -S /var/run/memcached/memcached.sock \
-P memcache_binary -n allkeys --key-minimum=1 \
--key-maximum=65000000 --key-pattern=P:P -c 1 -t 36 \
--ratio 1:0 --pipeline 8 -d 2000
memtier_benchmark -S /var/run/memcached/memcached.sock \
-P memcache_binary -n allkeys --key-minimum=1 \
--key-maximum=65000000 --key-pattern=R:R -c 1 -t 36 \
--ratio 0:1 --pipeline 8 --randomize --distinct-client-seed
Client benchmark results:
kswapd profiles:
5.19-rc1
40.33% page_vma_mapped_walk (overhead)
21.80% lzo1x_1_do_compress (real work)
7.53% do_raw_spin_lock
3.95% _raw_spin_unlock_irq
2.52% vma_interval_tree_iter_next
2.37% folio_referenced_one
2.28% vma_interval_tree_subtree_search
1.97% anon_vma_interval_tree_iter_first
1.60% ptep_clear_flush
1.06% __zram_bvec_write
patch1-6
39.03% lzo1x_1_do_compress (real work)
18.47% page_vma_mapped_walk (overhead)
6.74% _raw_spin_unlock_irq
3.97% do_raw_spin_lock
2.49% ptep_clear_flush
2.48% anon_vma_interval_tree_iter_first
1.92% folio_referenced_one
1.88% __zram_bvec_write
1.48% memmove
1.31% vma_interval_tree_iter_next
Configurations:
CPU: single Snapdragon 7c
Mem: total 4G
ChromeOS MemoryPressure [1]
[1] https://chromium.googlesource.com/chromiumos/platform/tast-tests/
Link: https://lkml.kernel.org/r/20220918080010.2920238-7-yuzhao@google.com
Signed-off-by: Yu Zhao <yuzhao@google.com>
Acked-by: Brian Geffon <bgeffon@google.com>
Acked-by: Jan Alexander Steffens (heftig) <heftig@archlinux.org>
Acked-by: Oleksandr Natalenko <oleksandr@natalenko.name>
Acked-by: Steven Barrett <steven@liquorix.net>
Acked-by: Suleiman Souhlal <suleiman@google.com>
Tested-by: Daniel Byrne <djbyrne@mtu.edu>
Tested-by: Donald Carr <d@chaos-reins.com>
Tested-by: Holger Hoffstätte <holger@applied-asynchrony.com>
Tested-by: Konstantin Kharlamov <Hi-Angel@yandex.ru>
Tested-by: Shuang Zhai <szhai2@cs.rochester.edu>
Tested-by: Sofia Trinh <sofia.trinh@edi.works>
Tested-by: Vaibhav Jain <vaibhav@linux.ibm.com>
Cc: Andi Kleen <ak@linux.intel.com>
Cc: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
Cc: Barry Song <baohua@kernel.org>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Hillf Danton <hdanton@sina.com>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Miaohe Lin <linmiaohe@huawei.com>
Cc: Michael Larabel <Michael@MichaelLarabel.com>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Mike Rapoport <rppt@kernel.org>
Cc: Mike Rapoport <rppt@linux.ibm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Qi Zheng <zhengqi.arch@bytedance.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Will Deacon <will@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-09-18 16:00:03 +08:00
|
|
|
*evictionp = entry;
|
mm: workingset: tell cache transitions from workingset thrashing
Refaults happen during transitions between workingsets as well as in-place
thrashing. Knowing the difference between the two has a range of
applications, including measuring the impact of memory shortage on the
system performance, as well as the ability to smarter balance pressure
between the filesystem cache and the swap-backed workingset.
During workingset transitions, inactive cache refaults and pushes out
established active cache. When that active cache isn't stale, however,
and also ends up refaulting, that's bonafide thrashing.
Introduce a new page flag that tells on eviction whether the page has been
active or not in its lifetime. This bit is then stored in the shadow
entry, to classify refaults as transitioning or thrashing.
How many page->flags does this leave us with on 32-bit?
20 bits are always page flags
21 if you have an MMU
23 with the zone bits for DMA, Normal, HighMem, Movable
29 with the sparsemem section bits
30 if PAE is enabled
31 with this patch.
So on 32-bit PAE, that leaves 1 bit for distinguishing two NUMA nodes. If
that's not enough, the system can switch to discontigmem and re-gain the 6
or 7 sparsemem section bits.
Link: http://lkml.kernel.org/r/20180828172258.3185-3-hannes@cmpxchg.org
Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Tested-by: Daniel Drake <drake@endlessm.com>
Tested-by: Suren Baghdasaryan <surenb@google.com>
Cc: Christopher Lameter <cl@linux.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Johannes Weiner <jweiner@fb.com>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Enderborg <peter.enderborg@sony.com>
Cc: Randy Dunlap <rdunlap@infradead.org>
Cc: Shakeel Butt <shakeelb@google.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: Vinayak Menon <vinmenon@codeaurora.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-10-27 06:06:04 +08:00
|
|
|
*workingsetp = workingset;
|
2014-04-04 05:47:51 +08:00
|
|
|
}
|
|
|
|
|
2023-10-08 17:35:57 +08:00
|
|
|
#ifdef CONFIG_EMM_WORKINGSET_TRACKING
|
|
|
|
static void workingset_eviction_file(struct lruvec *lruvec, unsigned long nr_pages)
|
|
|
|
{
|
|
|
|
do {
|
|
|
|
atomic_long_add(nr_pages, &lruvec->evicted_file);
|
|
|
|
} while ((lruvec = parent_lruvec(lruvec)));
|
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
|
|
|
* If a page is evicted and never come back, either this page is really cold or it
|
|
|
|
* is deleted on disk.
|
|
|
|
*
|
|
|
|
* For cold page, it could take up all of memory until kswapd start to shrink it.
|
|
|
|
* For deleted page, the shadow will be gone too, so no refault.
|
|
|
|
*
|
|
|
|
* If a page comes back before it's shadow is released, that's a refault, which means
|
|
|
|
* file page reclaim have gone over-aggressive and that page would not have been evicted
|
|
|
|
* if all the page, include it self, stayed in memory.
|
|
|
|
*/
|
|
|
|
static void workingset_refault_track(struct lruvec *lruvec, unsigned long refault_distance)
|
|
|
|
{
|
|
|
|
do {
|
|
|
|
/*
|
|
|
|
* Not taking any lock, for better performance, may lead to some
|
|
|
|
* event got lost, but it's just a rough estimation anyway.
|
|
|
|
*/
|
|
|
|
WRITE_ONCE(lruvec->refault_count, READ_ONCE(lruvec->refault_count) + 1);
|
|
|
|
WRITE_ONCE(lruvec->total_distance, READ_ONCE(lruvec->total_distance) + refault_distance);
|
|
|
|
} while ((lruvec = parent_lruvec(lruvec)));
|
|
|
|
}
|
|
|
|
#else
|
|
|
|
static void workingset_eviction_file(struct lruvec *lruvec, unsigned long nr_pages)
|
|
|
|
{
|
|
|
|
}
|
|
|
|
static void workingset_refault_track(struct lruvec *lruvec, unsigned long refault_distance)
|
|
|
|
{
|
|
|
|
}
|
|
|
|
#endif
|
|
|
|
|
workingset, lru_gen: apply refault-distance based protection
Upstream: pending
I noticed MGLRU not working very well on certain workflows, which is
observed on some heavily stressed databases. That is when the file
page workingset size exceeds total memory, and the access distance
(the left-shift time of a page before it gets activated, considering
LRU starts from right) of file pages also larger than total memory.
All file pages are stuck on the oldest generation and getting
read-in then evicted permutably. Despite anon pages being idle,
they never get aged. PID controller didn't kickin until there are some
minor access pattern changes. And file pages are not promoted
or reused.
Even though the memory can't cover the whole workingset, the
refault-distance based re-activation can help hold part of the
workingset in-memory to help reduce the IO workload significantly.
So apply it for MGLRU as well. The updated refault-distance model
fits well for MGLRU in most cases, if we just consider the last two
generation as the inactive LRU and the first two generations as
active LRU.
Some adjustment is done to fit the logic better, also make the
refault-distance contributed to page tiering and PID refault detection
of MGLRU:
- If a tier-0 page have a qualified refault-distance, just promote
it to higher tier, send it to second oldest gen.
- If a tier >= 1 page have a qualified refault-distance, mark it as
active and send it to youngest gen.
- Increase the reference of every page that have a qualified
refault-distance and increase the PID countroled refault rate
of the updated tier, in hope similar paged will be protected
next time upon eviction.
NOTE: This also changed the meaning of workingset_* fields in
/proc/vmstat, workingset_activate_* now stands for the pages
reactivated or promoted by refault distance checking,
workingset_restore_* now stands for all pages promoted by
any reason.
Following benchmark showed 5x improvement. To simulate the optimized
workflow, I setup a 3-replicated mongodb cluster, each in a different
cgroup, using 5 gb of wiretiger cache and 10g of oplog, on a 32G VM with
no limit set. The benchmark is done using
https://github.com/apavlo/py-tpcc.git, modified to run STOCK_LEVEL
query only, for simulating slow query and get a stable result.
Test is done on an EPYC 7K62 with 32G RAM with SATA SSD:
- Before (with ZRAM enabled, the result won't change whether
any kind of swap is on or not):
$ tpcc.py --config=mongodb.config mongodb --duration=900 --warehouses=500 --clients=30
==================================================================
Execution Results after 919 seconds
------------------------------------------------------------------
Executed Time (µs) Rate
STOCK_LEVEL 577 27584645283.7 0.02 txn/s
------------------------------------------------------------------
TOTAL 577 27584645283.7 0.02 txn/s
$ cat /proc/vmstat | grep workingset
workingset_nodes 47860
workingset_refault_anon 0
workingset_refault_file 23498953
workingset_activate_anon 0
workingset_activate_file 23487840
workingset_restore_anon 0
workingset_restore_file 18553646
workingset_nodereclaim 768
$ free -m
total used free shared buff/cache available
Mem: 31849 6829 790 23 24229 24542
Swap: 31848 0 31848
- Patched: (with ZRAM enabled):
$ tpcc.py --config=mongodb.config mongodb --duration=900 --warehouses=500 --clients=30
==================================================================
Execution Results after 905 seconds
------------------------------------------------------------------
Executed Time (µs) Rate
STOCK_LEVEL 2542 27121571486.2 0.09 txn/s
------------------------------------------------------------------
TOTAL 2542 27121571486.2 0.09 txn/s
$ cat /proc/vmstat | grep working
workingset_nodes 70358
workingset_refault_anon 16853
workingset_refault_file 22693601
workingset_activate_anon 10099
workingset_activate_file 8565519
workingset_restore_anon 10127
workingset_restore_file 8566053
workingset_nodereclaim 9801
$ free -m
total used free shared buff/cache available
Mem: 31849 7093 283 4 24472 24289
Swap: 31848 1652 30196
The performance is 5x times better than before, and the idle anon pages
now can get swapped out as expected. The result is also better with
lower test stress, testing with lower stress also shows a improvement.
There is no regression on other tests so far, and a performance gain
is observed on file page heavy tasks.
Signed-off-by: Kairui Song <kasong@tencent.com>
2024-04-01 23:43:38 +08:00
|
|
|
static inline struct mem_cgroup *try_get_flush_memcg(int memcgid)
|
|
|
|
{
|
|
|
|
struct mem_cgroup *memcg;
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Look up the memcg associated with the stored ID. It might
|
|
|
|
* have been deleted since the folio's eviction.
|
|
|
|
*
|
|
|
|
* Note that in rare events the ID could have been recycled
|
|
|
|
* for a new cgroup that refaults a shared folio. This is
|
|
|
|
* impossible to tell from the available data. However, this
|
|
|
|
* should be a rare and limited disturbance, and activations
|
|
|
|
* are always speculative anyway. Ultimately, it's the aging
|
|
|
|
* algorithm's job to shake out the minimum access frequency
|
|
|
|
* for the active cache.
|
|
|
|
*
|
|
|
|
* XXX: On !CONFIG_MEMCG, this will always return NULL; it
|
|
|
|
* would be better if the root_mem_cgroup existed in all
|
|
|
|
* configurations instead.
|
|
|
|
*/
|
|
|
|
rcu_read_lock();
|
|
|
|
memcg = mem_cgroup_from_id(memcgid);
|
|
|
|
if (!mem_cgroup_disabled() &&
|
|
|
|
(!memcg || !mem_cgroup_tryget(memcg))) {
|
|
|
|
rcu_read_unlock();
|
|
|
|
return NULL;
|
|
|
|
}
|
|
|
|
rcu_read_unlock();
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Flush stats (and potentially sleep) outside the RCU read section.
|
|
|
|
* XXX: With per-memcg flushing and thresholding, is ratelimiting
|
|
|
|
* still needed here?
|
|
|
|
*/
|
|
|
|
mem_cgroup_flush_stats_ratelimited(memcg);
|
|
|
|
|
|
|
|
return memcg;
|
|
|
|
}
|
|
|
|
|
2024-04-01 19:50:55 +08:00
|
|
|
/**
|
|
|
|
* lru_eviction - age non-resident entries as LRU ages
|
|
|
|
*
|
|
|
|
* As in-memory pages are aged, non-resident pages need to be aged as
|
|
|
|
* well, in order for the refault distances later on to be comparable
|
|
|
|
* to the in-memory dimensions. This function allows reclaim and LRU
|
|
|
|
* operations to drive the non-resident aging along in parallel.
|
|
|
|
*/
|
|
|
|
static inline unsigned long lru_eviction(struct lruvec *lruvec, int type,
|
|
|
|
int nr_pages, int bits, int bucket_order)
|
|
|
|
{
|
|
|
|
unsigned long eviction;
|
|
|
|
|
2024-04-01 17:43:25 +08:00
|
|
|
if (type)
|
|
|
|
workingset_eviction_file(lruvec, nr_pages);
|
|
|
|
|
2024-04-01 19:50:55 +08:00
|
|
|
/*
|
|
|
|
* Reclaiming a cgroup means reclaiming all its children in a
|
|
|
|
* round-robin fashion. That means that each cgroup has an LRU
|
|
|
|
* order that is composed of the LRU orders of its child
|
|
|
|
* cgroups; and every page has an LRU position not just in the
|
|
|
|
* cgroup that owns it, but in all of that group's ancestors.
|
|
|
|
*
|
|
|
|
* So when the physical inactive list of a leaf cgroup ages,
|
|
|
|
* the virtual inactive lists of all its parents, including
|
|
|
|
* the root cgroup's, age as well.
|
|
|
|
*/
|
|
|
|
eviction = atomic_long_fetch_add_relaxed(nr_pages, &lruvec->evictions[type]);
|
|
|
|
while ((lruvec = parent_lruvec(lruvec)))
|
|
|
|
atomic_long_add(nr_pages, &lruvec->evictions[type]);
|
|
|
|
|
|
|
|
/* Truncate the timestamp to fit in limited bits */
|
|
|
|
eviction >>= bucket_order;
|
|
|
|
eviction &= ~0UL >> (BITS_PER_LONG - bits);
|
|
|
|
return eviction;
|
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
|
|
|
* lru_distance - calculate the refault distance based on non-resident age
|
|
|
|
*/
|
|
|
|
static inline unsigned long lru_distance(struct lruvec *lruvec, int type,
|
|
|
|
unsigned long eviction, int bits,
|
|
|
|
int bucket_order)
|
|
|
|
{
|
|
|
|
unsigned long refault = atomic_long_read(&lruvec->evictions[type]);
|
|
|
|
|
workingset, lru_gen: apply refault-distance based protection
Upstream: pending
I noticed MGLRU not working very well on certain workflows, which is
observed on some heavily stressed databases. That is when the file
page workingset size exceeds total memory, and the access distance
(the left-shift time of a page before it gets activated, considering
LRU starts from right) of file pages also larger than total memory.
All file pages are stuck on the oldest generation and getting
read-in then evicted permutably. Despite anon pages being idle,
they never get aged. PID controller didn't kickin until there are some
minor access pattern changes. And file pages are not promoted
or reused.
Even though the memory can't cover the whole workingset, the
refault-distance based re-activation can help hold part of the
workingset in-memory to help reduce the IO workload significantly.
So apply it for MGLRU as well. The updated refault-distance model
fits well for MGLRU in most cases, if we just consider the last two
generation as the inactive LRU and the first two generations as
active LRU.
Some adjustment is done to fit the logic better, also make the
refault-distance contributed to page tiering and PID refault detection
of MGLRU:
- If a tier-0 page have a qualified refault-distance, just promote
it to higher tier, send it to second oldest gen.
- If a tier >= 1 page have a qualified refault-distance, mark it as
active and send it to youngest gen.
- Increase the reference of every page that have a qualified
refault-distance and increase the PID countroled refault rate
of the updated tier, in hope similar paged will be protected
next time upon eviction.
NOTE: This also changed the meaning of workingset_* fields in
/proc/vmstat, workingset_activate_* now stands for the pages
reactivated or promoted by refault distance checking,
workingset_restore_* now stands for all pages promoted by
any reason.
Following benchmark showed 5x improvement. To simulate the optimized
workflow, I setup a 3-replicated mongodb cluster, each in a different
cgroup, using 5 gb of wiretiger cache and 10g of oplog, on a 32G VM with
no limit set. The benchmark is done using
https://github.com/apavlo/py-tpcc.git, modified to run STOCK_LEVEL
query only, for simulating slow query and get a stable result.
Test is done on an EPYC 7K62 with 32G RAM with SATA SSD:
- Before (with ZRAM enabled, the result won't change whether
any kind of swap is on or not):
$ tpcc.py --config=mongodb.config mongodb --duration=900 --warehouses=500 --clients=30
==================================================================
Execution Results after 919 seconds
------------------------------------------------------------------
Executed Time (µs) Rate
STOCK_LEVEL 577 27584645283.7 0.02 txn/s
------------------------------------------------------------------
TOTAL 577 27584645283.7 0.02 txn/s
$ cat /proc/vmstat | grep workingset
workingset_nodes 47860
workingset_refault_anon 0
workingset_refault_file 23498953
workingset_activate_anon 0
workingset_activate_file 23487840
workingset_restore_anon 0
workingset_restore_file 18553646
workingset_nodereclaim 768
$ free -m
total used free shared buff/cache available
Mem: 31849 6829 790 23 24229 24542
Swap: 31848 0 31848
- Patched: (with ZRAM enabled):
$ tpcc.py --config=mongodb.config mongodb --duration=900 --warehouses=500 --clients=30
==================================================================
Execution Results after 905 seconds
------------------------------------------------------------------
Executed Time (µs) Rate
STOCK_LEVEL 2542 27121571486.2 0.09 txn/s
------------------------------------------------------------------
TOTAL 2542 27121571486.2 0.09 txn/s
$ cat /proc/vmstat | grep working
workingset_nodes 70358
workingset_refault_anon 16853
workingset_refault_file 22693601
workingset_activate_anon 10099
workingset_activate_file 8565519
workingset_restore_anon 10127
workingset_restore_file 8566053
workingset_nodereclaim 9801
$ free -m
total used free shared buff/cache available
Mem: 31849 7093 283 4 24472 24289
Swap: 31848 1652 30196
The performance is 5x times better than before, and the idle anon pages
now can get swapped out as expected. The result is also better with
lower test stress, testing with lower stress also shows a improvement.
There is no regression on other tests so far, and a performance gain
is observed on file page heavy tasks.
Signed-off-by: Kairui Song <kasong@tencent.com>
2024-04-01 23:43:38 +08:00
|
|
|
eviction &= ~0UL >> (BITS_PER_LONG - bits);
|
2024-04-01 19:50:55 +08:00
|
|
|
eviction <<= bucket_order;
|
|
|
|
|
|
|
|
/*
|
|
|
|
* The unsigned subtraction here gives an accurate distance
|
|
|
|
* across non-resident age overflows in most cases. There is a
|
|
|
|
* special case: usually, shadow entries have a short lifetime
|
|
|
|
* and are either refaulted or reclaimed along with the inode
|
|
|
|
* before they get too old. But it is not impossible for the
|
|
|
|
* non-resident age to lap a shadow entry in the field, which
|
|
|
|
* can then result in a false small refault distance, leading
|
|
|
|
* to a false activation should this old entry actually
|
|
|
|
* refault again. However, earlier kernels used to deactivate
|
|
|
|
* unconditionally with *every* reclaim invocation for the
|
|
|
|
* longest time, so the occasional inappropriate activation
|
|
|
|
* leading to pressure on the active list is not a problem.
|
|
|
|
*/
|
|
|
|
return (refault - eviction) & (~0UL >> (BITS_PER_LONG - bits));
|
|
|
|
}
|
|
|
|
|
mm: multi-gen LRU: minimal implementation
To avoid confusion, the terms "promotion" and "demotion" will be applied
to the multi-gen LRU, as a new convention; the terms "activation" and
"deactivation" will be applied to the active/inactive LRU, as usual.
The aging produces young generations. Given an lruvec, it increments
max_seq when max_seq-min_seq+1 approaches MIN_NR_GENS. The aging promotes
hot pages to the youngest generation when it finds them accessed through
page tables; the demotion of cold pages happens consequently when it
increments max_seq. Promotion in the aging path does not involve any LRU
list operations, only the updates of the gen counter and
lrugen->nr_pages[]; demotion, unless as the result of the increment of
max_seq, requires LRU list operations, e.g., lru_deactivate_fn(). The
aging has the complexity O(nr_hot_pages), since it is only interested in
hot pages.
The eviction consumes old generations. Given an lruvec, it increments
min_seq when lrugen->lists[] indexed by min_seq%MAX_NR_GENS becomes empty.
A feedback loop modeled after the PID controller monitors refaults over
anon and file types and decides which type to evict when both types are
available from the same generation.
The protection of pages accessed multiple times through file descriptors
takes place in the eviction path. Each generation is divided into
multiple tiers. A page accessed N times through file descriptors is in
tier order_base_2(N). Tiers do not have dedicated lrugen->lists[], only
bits in folio->flags. The aforementioned feedback loop also monitors
refaults over all tiers and decides when to protect pages in which tiers
(N>1), using the first tier (N=0,1) as a baseline. The first tier
contains single-use unmapped clean pages, which are most likely the best
choices. In contrast to promotion in the aging path, the protection of a
page in the eviction path is achieved by moving this page to the next
generation, i.e., min_seq+1, if the feedback loop decides so. This
approach has the following advantages:
1. It removes the cost of activation in the buffered access path by
inferring whether pages accessed multiple times through file
descriptors are statistically hot and thus worth protecting in the
eviction path.
2. It takes pages accessed through page tables into account and avoids
overprotecting pages accessed multiple times through file
descriptors. (Pages accessed through page tables are in the first
tier, since N=0.)
3. More tiers provide better protection for pages accessed more than
twice through file descriptors, when under heavy buffered I/O
workloads.
Server benchmark results:
Single workload:
fio (buffered I/O): +[30, 32]%
IOPS BW
5.19-rc1: 2673k 10.2GiB/s
patch1-6: 3491k 13.3GiB/s
Single workload:
memcached (anon): -[4, 6]%
Ops/sec KB/sec
5.19-rc1: 1161501.04 45177.25
patch1-6: 1106168.46 43025.04
Configurations:
CPU: two Xeon 6154
Mem: total 256G
Node 1 was only used as a ram disk to reduce the variance in the
results.
patch drivers/block/brd.c <<EOF
99,100c99,100
< gfp_flags = GFP_NOIO | __GFP_ZERO | __GFP_HIGHMEM;
< page = alloc_page(gfp_flags);
---
> gfp_flags = GFP_NOIO | __GFP_ZERO | __GFP_HIGHMEM | __GFP_THISNODE;
> page = alloc_pages_node(1, gfp_flags, 0);
EOF
cat >>/etc/systemd/system.conf <<EOF
CPUAffinity=numa
NUMAPolicy=bind
NUMAMask=0
EOF
cat >>/etc/memcached.conf <<EOF
-m 184320
-s /var/run/memcached/memcached.sock
-a 0766
-t 36
-B binary
EOF
cat fio.sh
modprobe brd rd_nr=1 rd_size=113246208
swapoff -a
mkfs.ext4 /dev/ram0
mount -t ext4 /dev/ram0 /mnt
mkdir /sys/fs/cgroup/user.slice/test
echo 38654705664 >/sys/fs/cgroup/user.slice/test/memory.max
echo $$ >/sys/fs/cgroup/user.slice/test/cgroup.procs
fio -name=mglru --numjobs=72 --directory=/mnt --size=1408m \
--buffered=1 --ioengine=io_uring --iodepth=128 \
--iodepth_batch_submit=32 --iodepth_batch_complete=32 \
--rw=randread --random_distribution=random --norandommap \
--time_based --ramp_time=10m --runtime=5m --group_reporting
cat memcached.sh
modprobe brd rd_nr=1 rd_size=113246208
swapoff -a
mkswap /dev/ram0
swapon /dev/ram0
memtier_benchmark -S /var/run/memcached/memcached.sock \
-P memcache_binary -n allkeys --key-minimum=1 \
--key-maximum=65000000 --key-pattern=P:P -c 1 -t 36 \
--ratio 1:0 --pipeline 8 -d 2000
memtier_benchmark -S /var/run/memcached/memcached.sock \
-P memcache_binary -n allkeys --key-minimum=1 \
--key-maximum=65000000 --key-pattern=R:R -c 1 -t 36 \
--ratio 0:1 --pipeline 8 --randomize --distinct-client-seed
Client benchmark results:
kswapd profiles:
5.19-rc1
40.33% page_vma_mapped_walk (overhead)
21.80% lzo1x_1_do_compress (real work)
7.53% do_raw_spin_lock
3.95% _raw_spin_unlock_irq
2.52% vma_interval_tree_iter_next
2.37% folio_referenced_one
2.28% vma_interval_tree_subtree_search
1.97% anon_vma_interval_tree_iter_first
1.60% ptep_clear_flush
1.06% __zram_bvec_write
patch1-6
39.03% lzo1x_1_do_compress (real work)
18.47% page_vma_mapped_walk (overhead)
6.74% _raw_spin_unlock_irq
3.97% do_raw_spin_lock
2.49% ptep_clear_flush
2.48% anon_vma_interval_tree_iter_first
1.92% folio_referenced_one
1.88% __zram_bvec_write
1.48% memmove
1.31% vma_interval_tree_iter_next
Configurations:
CPU: single Snapdragon 7c
Mem: total 4G
ChromeOS MemoryPressure [1]
[1] https://chromium.googlesource.com/chromiumos/platform/tast-tests/
Link: https://lkml.kernel.org/r/20220918080010.2920238-7-yuzhao@google.com
Signed-off-by: Yu Zhao <yuzhao@google.com>
Acked-by: Brian Geffon <bgeffon@google.com>
Acked-by: Jan Alexander Steffens (heftig) <heftig@archlinux.org>
Acked-by: Oleksandr Natalenko <oleksandr@natalenko.name>
Acked-by: Steven Barrett <steven@liquorix.net>
Acked-by: Suleiman Souhlal <suleiman@google.com>
Tested-by: Daniel Byrne <djbyrne@mtu.edu>
Tested-by: Donald Carr <d@chaos-reins.com>
Tested-by: Holger Hoffstätte <holger@applied-asynchrony.com>
Tested-by: Konstantin Kharlamov <Hi-Angel@yandex.ru>
Tested-by: Shuang Zhai <szhai2@cs.rochester.edu>
Tested-by: Sofia Trinh <sofia.trinh@edi.works>
Tested-by: Vaibhav Jain <vaibhav@linux.ibm.com>
Cc: Andi Kleen <ak@linux.intel.com>
Cc: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
Cc: Barry Song <baohua@kernel.org>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Hillf Danton <hdanton@sina.com>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Miaohe Lin <linmiaohe@huawei.com>
Cc: Michael Larabel <Michael@MichaelLarabel.com>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Mike Rapoport <rppt@kernel.org>
Cc: Mike Rapoport <rppt@linux.ibm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Qi Zheng <zhengqi.arch@bytedance.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Will Deacon <will@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-09-18 16:00:03 +08:00
|
|
|
#ifdef CONFIG_LRU_GEN
|
|
|
|
|
|
|
|
static void *lru_gen_eviction(struct folio *folio)
|
|
|
|
{
|
|
|
|
int hist;
|
|
|
|
unsigned long token;
|
|
|
|
struct lruvec *lruvec;
|
mm: multi-gen LRU: rename lru_gen_struct to lru_gen_folio
Patch series "mm: multi-gen LRU: memcg LRU", v3.
Overview
========
An memcg LRU is a per-node LRU of memcgs. It is also an LRU of LRUs,
since each node and memcg combination has an LRU of folios (see
mem_cgroup_lruvec()).
Its goal is to improve the scalability of global reclaim, which is
critical to system-wide memory overcommit in data centers. Note that
memcg reclaim is currently out of scope.
Its memory bloat is a pointer to each lruvec and negligible to each
pglist_data. In terms of traversing memcgs during global reclaim, it
improves the best-case complexity from O(n) to O(1) and does not affect
the worst-case complexity O(n). Therefore, on average, it has a sublinear
complexity in contrast to the current linear complexity.
The basic structure of an memcg LRU can be understood by an analogy to
the active/inactive LRU (of folios):
1. It has the young and the old (generations), i.e., the counterparts
to the active and the inactive;
2. The increment of max_seq triggers promotion, i.e., the counterpart
to activation;
3. Other events trigger similar operations, e.g., offlining an memcg
triggers demotion, i.e., the counterpart to deactivation.
In terms of global reclaim, it has two distinct features:
1. Sharding, which allows each thread to start at a random memcg (in
the old generation) and improves parallelism;
2. Eventual fairness, which allows direct reclaim to bail out at will
and reduces latency without affecting fairness over some time.
The commit message in patch 6 details the workflow:
https://lore.kernel.org/r/20221222041905.2431096-7-yuzhao@google.com/
The following is a simple test to quickly verify its effectiveness.
Test design:
1. Create multiple memcgs.
2. Each memcg contains a job (fio).
3. All jobs access the same amount of memory randomly.
4. The system does not experience global memory pressure.
5. Periodically write to the root memory.reclaim.
Desired outcome:
1. All memcgs have similar pgsteal counts, i.e., stddev(pgsteal)
over mean(pgsteal) is close to 0%.
2. The total pgsteal is close to the total requested through
memory.reclaim, i.e., sum(pgsteal) over sum(requested) is close
to 100%.
Actual outcome [1]:
MGLRU off MGLRU on
stddev(pgsteal) / mean(pgsteal) 75% 20%
sum(pgsteal) / sum(requested) 425% 95%
####################################################################
MEMCGS=128
for ((memcg = 0; memcg < $MEMCGS; memcg++)); do
mkdir /sys/fs/cgroup/memcg$memcg
done
start() {
echo $BASHPID > /sys/fs/cgroup/memcg$memcg/cgroup.procs
fio -name=memcg$memcg --numjobs=1 --ioengine=mmap \
--filename=/dev/zero --size=1920M --rw=randrw \
--rate=64m,64m --random_distribution=random \
--fadvise_hint=0 --time_based --runtime=10h \
--group_reporting --minimal
}
for ((memcg = 0; memcg < $MEMCGS; memcg++)); do
start &
done
sleep 600
for ((i = 0; i < 600; i++)); do
echo 256m >/sys/fs/cgroup/memory.reclaim
sleep 6
done
for ((memcg = 0; memcg < $MEMCGS; memcg++)); do
grep "pgsteal " /sys/fs/cgroup/memcg$memcg/memory.stat
done
####################################################################
[1]: This was obtained from running the above script (touches less
than 256GB memory) on an EPYC 7B13 with 512GB DRAM for over an
hour.
This patch (of 8):
The new name lru_gen_folio will be more distinct from the coming
lru_gen_memcg.
Link: https://lkml.kernel.org/r/20221222041905.2431096-1-yuzhao@google.com
Link: https://lkml.kernel.org/r/20221222041905.2431096-2-yuzhao@google.com
Signed-off-by: Yu Zhao <yuzhao@google.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Michael Larabel <Michael@MichaelLarabel.com>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Mike Rapoport <rppt@kernel.org>
Cc: Roman Gushchin <roman.gushchin@linux.dev>
Cc: Suren Baghdasaryan <surenb@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-12-22 12:18:59 +08:00
|
|
|
struct lru_gen_folio *lrugen;
|
mm: multi-gen LRU: minimal implementation
To avoid confusion, the terms "promotion" and "demotion" will be applied
to the multi-gen LRU, as a new convention; the terms "activation" and
"deactivation" will be applied to the active/inactive LRU, as usual.
The aging produces young generations. Given an lruvec, it increments
max_seq when max_seq-min_seq+1 approaches MIN_NR_GENS. The aging promotes
hot pages to the youngest generation when it finds them accessed through
page tables; the demotion of cold pages happens consequently when it
increments max_seq. Promotion in the aging path does not involve any LRU
list operations, only the updates of the gen counter and
lrugen->nr_pages[]; demotion, unless as the result of the increment of
max_seq, requires LRU list operations, e.g., lru_deactivate_fn(). The
aging has the complexity O(nr_hot_pages), since it is only interested in
hot pages.
The eviction consumes old generations. Given an lruvec, it increments
min_seq when lrugen->lists[] indexed by min_seq%MAX_NR_GENS becomes empty.
A feedback loop modeled after the PID controller monitors refaults over
anon and file types and decides which type to evict when both types are
available from the same generation.
The protection of pages accessed multiple times through file descriptors
takes place in the eviction path. Each generation is divided into
multiple tiers. A page accessed N times through file descriptors is in
tier order_base_2(N). Tiers do not have dedicated lrugen->lists[], only
bits in folio->flags. The aforementioned feedback loop also monitors
refaults over all tiers and decides when to protect pages in which tiers
(N>1), using the first tier (N=0,1) as a baseline. The first tier
contains single-use unmapped clean pages, which are most likely the best
choices. In contrast to promotion in the aging path, the protection of a
page in the eviction path is achieved by moving this page to the next
generation, i.e., min_seq+1, if the feedback loop decides so. This
approach has the following advantages:
1. It removes the cost of activation in the buffered access path by
inferring whether pages accessed multiple times through file
descriptors are statistically hot and thus worth protecting in the
eviction path.
2. It takes pages accessed through page tables into account and avoids
overprotecting pages accessed multiple times through file
descriptors. (Pages accessed through page tables are in the first
tier, since N=0.)
3. More tiers provide better protection for pages accessed more than
twice through file descriptors, when under heavy buffered I/O
workloads.
Server benchmark results:
Single workload:
fio (buffered I/O): +[30, 32]%
IOPS BW
5.19-rc1: 2673k 10.2GiB/s
patch1-6: 3491k 13.3GiB/s
Single workload:
memcached (anon): -[4, 6]%
Ops/sec KB/sec
5.19-rc1: 1161501.04 45177.25
patch1-6: 1106168.46 43025.04
Configurations:
CPU: two Xeon 6154
Mem: total 256G
Node 1 was only used as a ram disk to reduce the variance in the
results.
patch drivers/block/brd.c <<EOF
99,100c99,100
< gfp_flags = GFP_NOIO | __GFP_ZERO | __GFP_HIGHMEM;
< page = alloc_page(gfp_flags);
---
> gfp_flags = GFP_NOIO | __GFP_ZERO | __GFP_HIGHMEM | __GFP_THISNODE;
> page = alloc_pages_node(1, gfp_flags, 0);
EOF
cat >>/etc/systemd/system.conf <<EOF
CPUAffinity=numa
NUMAPolicy=bind
NUMAMask=0
EOF
cat >>/etc/memcached.conf <<EOF
-m 184320
-s /var/run/memcached/memcached.sock
-a 0766
-t 36
-B binary
EOF
cat fio.sh
modprobe brd rd_nr=1 rd_size=113246208
swapoff -a
mkfs.ext4 /dev/ram0
mount -t ext4 /dev/ram0 /mnt
mkdir /sys/fs/cgroup/user.slice/test
echo 38654705664 >/sys/fs/cgroup/user.slice/test/memory.max
echo $$ >/sys/fs/cgroup/user.slice/test/cgroup.procs
fio -name=mglru --numjobs=72 --directory=/mnt --size=1408m \
--buffered=1 --ioengine=io_uring --iodepth=128 \
--iodepth_batch_submit=32 --iodepth_batch_complete=32 \
--rw=randread --random_distribution=random --norandommap \
--time_based --ramp_time=10m --runtime=5m --group_reporting
cat memcached.sh
modprobe brd rd_nr=1 rd_size=113246208
swapoff -a
mkswap /dev/ram0
swapon /dev/ram0
memtier_benchmark -S /var/run/memcached/memcached.sock \
-P memcache_binary -n allkeys --key-minimum=1 \
--key-maximum=65000000 --key-pattern=P:P -c 1 -t 36 \
--ratio 1:0 --pipeline 8 -d 2000
memtier_benchmark -S /var/run/memcached/memcached.sock \
-P memcache_binary -n allkeys --key-minimum=1 \
--key-maximum=65000000 --key-pattern=R:R -c 1 -t 36 \
--ratio 0:1 --pipeline 8 --randomize --distinct-client-seed
Client benchmark results:
kswapd profiles:
5.19-rc1
40.33% page_vma_mapped_walk (overhead)
21.80% lzo1x_1_do_compress (real work)
7.53% do_raw_spin_lock
3.95% _raw_spin_unlock_irq
2.52% vma_interval_tree_iter_next
2.37% folio_referenced_one
2.28% vma_interval_tree_subtree_search
1.97% anon_vma_interval_tree_iter_first
1.60% ptep_clear_flush
1.06% __zram_bvec_write
patch1-6
39.03% lzo1x_1_do_compress (real work)
18.47% page_vma_mapped_walk (overhead)
6.74% _raw_spin_unlock_irq
3.97% do_raw_spin_lock
2.49% ptep_clear_flush
2.48% anon_vma_interval_tree_iter_first
1.92% folio_referenced_one
1.88% __zram_bvec_write
1.48% memmove
1.31% vma_interval_tree_iter_next
Configurations:
CPU: single Snapdragon 7c
Mem: total 4G
ChromeOS MemoryPressure [1]
[1] https://chromium.googlesource.com/chromiumos/platform/tast-tests/
Link: https://lkml.kernel.org/r/20220918080010.2920238-7-yuzhao@google.com
Signed-off-by: Yu Zhao <yuzhao@google.com>
Acked-by: Brian Geffon <bgeffon@google.com>
Acked-by: Jan Alexander Steffens (heftig) <heftig@archlinux.org>
Acked-by: Oleksandr Natalenko <oleksandr@natalenko.name>
Acked-by: Steven Barrett <steven@liquorix.net>
Acked-by: Suleiman Souhlal <suleiman@google.com>
Tested-by: Daniel Byrne <djbyrne@mtu.edu>
Tested-by: Donald Carr <d@chaos-reins.com>
Tested-by: Holger Hoffstätte <holger@applied-asynchrony.com>
Tested-by: Konstantin Kharlamov <Hi-Angel@yandex.ru>
Tested-by: Shuang Zhai <szhai2@cs.rochester.edu>
Tested-by: Sofia Trinh <sofia.trinh@edi.works>
Tested-by: Vaibhav Jain <vaibhav@linux.ibm.com>
Cc: Andi Kleen <ak@linux.intel.com>
Cc: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
Cc: Barry Song <baohua@kernel.org>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Hillf Danton <hdanton@sina.com>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Miaohe Lin <linmiaohe@huawei.com>
Cc: Michael Larabel <Michael@MichaelLarabel.com>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Mike Rapoport <rppt@kernel.org>
Cc: Mike Rapoport <rppt@linux.ibm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Qi Zheng <zhengqi.arch@bytedance.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Will Deacon <will@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-09-18 16:00:03 +08:00
|
|
|
int type = folio_is_file_lru(folio);
|
|
|
|
int delta = folio_nr_pages(folio);
|
|
|
|
int refs = folio_lru_refs(folio);
|
|
|
|
int tier = lru_tier_from_refs(refs);
|
|
|
|
struct mem_cgroup *memcg = folio_memcg(folio);
|
|
|
|
struct pglist_data *pgdat = folio_pgdat(folio);
|
|
|
|
|
workingset, lru_gen: apply refault-distance based protection
Upstream: pending
I noticed MGLRU not working very well on certain workflows, which is
observed on some heavily stressed databases. That is when the file
page workingset size exceeds total memory, and the access distance
(the left-shift time of a page before it gets activated, considering
LRU starts from right) of file pages also larger than total memory.
All file pages are stuck on the oldest generation and getting
read-in then evicted permutably. Despite anon pages being idle,
they never get aged. PID controller didn't kickin until there are some
minor access pattern changes. And file pages are not promoted
or reused.
Even though the memory can't cover the whole workingset, the
refault-distance based re-activation can help hold part of the
workingset in-memory to help reduce the IO workload significantly.
So apply it for MGLRU as well. The updated refault-distance model
fits well for MGLRU in most cases, if we just consider the last two
generation as the inactive LRU and the first two generations as
active LRU.
Some adjustment is done to fit the logic better, also make the
refault-distance contributed to page tiering and PID refault detection
of MGLRU:
- If a tier-0 page have a qualified refault-distance, just promote
it to higher tier, send it to second oldest gen.
- If a tier >= 1 page have a qualified refault-distance, mark it as
active and send it to youngest gen.
- Increase the reference of every page that have a qualified
refault-distance and increase the PID countroled refault rate
of the updated tier, in hope similar paged will be protected
next time upon eviction.
NOTE: This also changed the meaning of workingset_* fields in
/proc/vmstat, workingset_activate_* now stands for the pages
reactivated or promoted by refault distance checking,
workingset_restore_* now stands for all pages promoted by
any reason.
Following benchmark showed 5x improvement. To simulate the optimized
workflow, I setup a 3-replicated mongodb cluster, each in a different
cgroup, using 5 gb of wiretiger cache and 10g of oplog, on a 32G VM with
no limit set. The benchmark is done using
https://github.com/apavlo/py-tpcc.git, modified to run STOCK_LEVEL
query only, for simulating slow query and get a stable result.
Test is done on an EPYC 7K62 with 32G RAM with SATA SSD:
- Before (with ZRAM enabled, the result won't change whether
any kind of swap is on or not):
$ tpcc.py --config=mongodb.config mongodb --duration=900 --warehouses=500 --clients=30
==================================================================
Execution Results after 919 seconds
------------------------------------------------------------------
Executed Time (µs) Rate
STOCK_LEVEL 577 27584645283.7 0.02 txn/s
------------------------------------------------------------------
TOTAL 577 27584645283.7 0.02 txn/s
$ cat /proc/vmstat | grep workingset
workingset_nodes 47860
workingset_refault_anon 0
workingset_refault_file 23498953
workingset_activate_anon 0
workingset_activate_file 23487840
workingset_restore_anon 0
workingset_restore_file 18553646
workingset_nodereclaim 768
$ free -m
total used free shared buff/cache available
Mem: 31849 6829 790 23 24229 24542
Swap: 31848 0 31848
- Patched: (with ZRAM enabled):
$ tpcc.py --config=mongodb.config mongodb --duration=900 --warehouses=500 --clients=30
==================================================================
Execution Results after 905 seconds
------------------------------------------------------------------
Executed Time (µs) Rate
STOCK_LEVEL 2542 27121571486.2 0.09 txn/s
------------------------------------------------------------------
TOTAL 2542 27121571486.2 0.09 txn/s
$ cat /proc/vmstat | grep working
workingset_nodes 70358
workingset_refault_anon 16853
workingset_refault_file 22693601
workingset_activate_anon 10099
workingset_activate_file 8565519
workingset_restore_anon 10127
workingset_restore_file 8566053
workingset_nodereclaim 9801
$ free -m
total used free shared buff/cache available
Mem: 31849 7093 283 4 24472 24289
Swap: 31848 1652 30196
The performance is 5x times better than before, and the idle anon pages
now can get swapped out as expected. The result is also better with
lower test stress, testing with lower stress also shows a improvement.
There is no regression on other tests so far, and a performance gain
is observed on file page heavy tasks.
Signed-off-by: Kairui Song <kasong@tencent.com>
2024-04-01 23:43:38 +08:00
|
|
|
BUILD_BUG_ON(LRU_REFS_WIDTH > BITS_PER_LONG - EVICTION_SHIFT);
|
mm: multi-gen LRU: minimal implementation
To avoid confusion, the terms "promotion" and "demotion" will be applied
to the multi-gen LRU, as a new convention; the terms "activation" and
"deactivation" will be applied to the active/inactive LRU, as usual.
The aging produces young generations. Given an lruvec, it increments
max_seq when max_seq-min_seq+1 approaches MIN_NR_GENS. The aging promotes
hot pages to the youngest generation when it finds them accessed through
page tables; the demotion of cold pages happens consequently when it
increments max_seq. Promotion in the aging path does not involve any LRU
list operations, only the updates of the gen counter and
lrugen->nr_pages[]; demotion, unless as the result of the increment of
max_seq, requires LRU list operations, e.g., lru_deactivate_fn(). The
aging has the complexity O(nr_hot_pages), since it is only interested in
hot pages.
The eviction consumes old generations. Given an lruvec, it increments
min_seq when lrugen->lists[] indexed by min_seq%MAX_NR_GENS becomes empty.
A feedback loop modeled after the PID controller monitors refaults over
anon and file types and decides which type to evict when both types are
available from the same generation.
The protection of pages accessed multiple times through file descriptors
takes place in the eviction path. Each generation is divided into
multiple tiers. A page accessed N times through file descriptors is in
tier order_base_2(N). Tiers do not have dedicated lrugen->lists[], only
bits in folio->flags. The aforementioned feedback loop also monitors
refaults over all tiers and decides when to protect pages in which tiers
(N>1), using the first tier (N=0,1) as a baseline. The first tier
contains single-use unmapped clean pages, which are most likely the best
choices. In contrast to promotion in the aging path, the protection of a
page in the eviction path is achieved by moving this page to the next
generation, i.e., min_seq+1, if the feedback loop decides so. This
approach has the following advantages:
1. It removes the cost of activation in the buffered access path by
inferring whether pages accessed multiple times through file
descriptors are statistically hot and thus worth protecting in the
eviction path.
2. It takes pages accessed through page tables into account and avoids
overprotecting pages accessed multiple times through file
descriptors. (Pages accessed through page tables are in the first
tier, since N=0.)
3. More tiers provide better protection for pages accessed more than
twice through file descriptors, when under heavy buffered I/O
workloads.
Server benchmark results:
Single workload:
fio (buffered I/O): +[30, 32]%
IOPS BW
5.19-rc1: 2673k 10.2GiB/s
patch1-6: 3491k 13.3GiB/s
Single workload:
memcached (anon): -[4, 6]%
Ops/sec KB/sec
5.19-rc1: 1161501.04 45177.25
patch1-6: 1106168.46 43025.04
Configurations:
CPU: two Xeon 6154
Mem: total 256G
Node 1 was only used as a ram disk to reduce the variance in the
results.
patch drivers/block/brd.c <<EOF
99,100c99,100
< gfp_flags = GFP_NOIO | __GFP_ZERO | __GFP_HIGHMEM;
< page = alloc_page(gfp_flags);
---
> gfp_flags = GFP_NOIO | __GFP_ZERO | __GFP_HIGHMEM | __GFP_THISNODE;
> page = alloc_pages_node(1, gfp_flags, 0);
EOF
cat >>/etc/systemd/system.conf <<EOF
CPUAffinity=numa
NUMAPolicy=bind
NUMAMask=0
EOF
cat >>/etc/memcached.conf <<EOF
-m 184320
-s /var/run/memcached/memcached.sock
-a 0766
-t 36
-B binary
EOF
cat fio.sh
modprobe brd rd_nr=1 rd_size=113246208
swapoff -a
mkfs.ext4 /dev/ram0
mount -t ext4 /dev/ram0 /mnt
mkdir /sys/fs/cgroup/user.slice/test
echo 38654705664 >/sys/fs/cgroup/user.slice/test/memory.max
echo $$ >/sys/fs/cgroup/user.slice/test/cgroup.procs
fio -name=mglru --numjobs=72 --directory=/mnt --size=1408m \
--buffered=1 --ioengine=io_uring --iodepth=128 \
--iodepth_batch_submit=32 --iodepth_batch_complete=32 \
--rw=randread --random_distribution=random --norandommap \
--time_based --ramp_time=10m --runtime=5m --group_reporting
cat memcached.sh
modprobe brd rd_nr=1 rd_size=113246208
swapoff -a
mkswap /dev/ram0
swapon /dev/ram0
memtier_benchmark -S /var/run/memcached/memcached.sock \
-P memcache_binary -n allkeys --key-minimum=1 \
--key-maximum=65000000 --key-pattern=P:P -c 1 -t 36 \
--ratio 1:0 --pipeline 8 -d 2000
memtier_benchmark -S /var/run/memcached/memcached.sock \
-P memcache_binary -n allkeys --key-minimum=1 \
--key-maximum=65000000 --key-pattern=R:R -c 1 -t 36 \
--ratio 0:1 --pipeline 8 --randomize --distinct-client-seed
Client benchmark results:
kswapd profiles:
5.19-rc1
40.33% page_vma_mapped_walk (overhead)
21.80% lzo1x_1_do_compress (real work)
7.53% do_raw_spin_lock
3.95% _raw_spin_unlock_irq
2.52% vma_interval_tree_iter_next
2.37% folio_referenced_one
2.28% vma_interval_tree_subtree_search
1.97% anon_vma_interval_tree_iter_first
1.60% ptep_clear_flush
1.06% __zram_bvec_write
patch1-6
39.03% lzo1x_1_do_compress (real work)
18.47% page_vma_mapped_walk (overhead)
6.74% _raw_spin_unlock_irq
3.97% do_raw_spin_lock
2.49% ptep_clear_flush
2.48% anon_vma_interval_tree_iter_first
1.92% folio_referenced_one
1.88% __zram_bvec_write
1.48% memmove
1.31% vma_interval_tree_iter_next
Configurations:
CPU: single Snapdragon 7c
Mem: total 4G
ChromeOS MemoryPressure [1]
[1] https://chromium.googlesource.com/chromiumos/platform/tast-tests/
Link: https://lkml.kernel.org/r/20220918080010.2920238-7-yuzhao@google.com
Signed-off-by: Yu Zhao <yuzhao@google.com>
Acked-by: Brian Geffon <bgeffon@google.com>
Acked-by: Jan Alexander Steffens (heftig) <heftig@archlinux.org>
Acked-by: Oleksandr Natalenko <oleksandr@natalenko.name>
Acked-by: Steven Barrett <steven@liquorix.net>
Acked-by: Suleiman Souhlal <suleiman@google.com>
Tested-by: Daniel Byrne <djbyrne@mtu.edu>
Tested-by: Donald Carr <d@chaos-reins.com>
Tested-by: Holger Hoffstätte <holger@applied-asynchrony.com>
Tested-by: Konstantin Kharlamov <Hi-Angel@yandex.ru>
Tested-by: Shuang Zhai <szhai2@cs.rochester.edu>
Tested-by: Sofia Trinh <sofia.trinh@edi.works>
Tested-by: Vaibhav Jain <vaibhav@linux.ibm.com>
Cc: Andi Kleen <ak@linux.intel.com>
Cc: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
Cc: Barry Song <baohua@kernel.org>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Hillf Danton <hdanton@sina.com>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Miaohe Lin <linmiaohe@huawei.com>
Cc: Michael Larabel <Michael@MichaelLarabel.com>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Mike Rapoport <rppt@kernel.org>
Cc: Mike Rapoport <rppt@linux.ibm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Qi Zheng <zhengqi.arch@bytedance.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Will Deacon <will@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-09-18 16:00:03 +08:00
|
|
|
|
|
|
|
lruvec = mem_cgroup_lruvec(memcg, pgdat);
|
|
|
|
lrugen = &lruvec->lrugen;
|
workingset, lru_gen: apply refault-distance based protection
Upstream: pending
I noticed MGLRU not working very well on certain workflows, which is
observed on some heavily stressed databases. That is when the file
page workingset size exceeds total memory, and the access distance
(the left-shift time of a page before it gets activated, considering
LRU starts from right) of file pages also larger than total memory.
All file pages are stuck on the oldest generation and getting
read-in then evicted permutably. Despite anon pages being idle,
they never get aged. PID controller didn't kickin until there are some
minor access pattern changes. And file pages are not promoted
or reused.
Even though the memory can't cover the whole workingset, the
refault-distance based re-activation can help hold part of the
workingset in-memory to help reduce the IO workload significantly.
So apply it for MGLRU as well. The updated refault-distance model
fits well for MGLRU in most cases, if we just consider the last two
generation as the inactive LRU and the first two generations as
active LRU.
Some adjustment is done to fit the logic better, also make the
refault-distance contributed to page tiering and PID refault detection
of MGLRU:
- If a tier-0 page have a qualified refault-distance, just promote
it to higher tier, send it to second oldest gen.
- If a tier >= 1 page have a qualified refault-distance, mark it as
active and send it to youngest gen.
- Increase the reference of every page that have a qualified
refault-distance and increase the PID countroled refault rate
of the updated tier, in hope similar paged will be protected
next time upon eviction.
NOTE: This also changed the meaning of workingset_* fields in
/proc/vmstat, workingset_activate_* now stands for the pages
reactivated or promoted by refault distance checking,
workingset_restore_* now stands for all pages promoted by
any reason.
Following benchmark showed 5x improvement. To simulate the optimized
workflow, I setup a 3-replicated mongodb cluster, each in a different
cgroup, using 5 gb of wiretiger cache and 10g of oplog, on a 32G VM with
no limit set. The benchmark is done using
https://github.com/apavlo/py-tpcc.git, modified to run STOCK_LEVEL
query only, for simulating slow query and get a stable result.
Test is done on an EPYC 7K62 with 32G RAM with SATA SSD:
- Before (with ZRAM enabled, the result won't change whether
any kind of swap is on or not):
$ tpcc.py --config=mongodb.config mongodb --duration=900 --warehouses=500 --clients=30
==================================================================
Execution Results after 919 seconds
------------------------------------------------------------------
Executed Time (µs) Rate
STOCK_LEVEL 577 27584645283.7 0.02 txn/s
------------------------------------------------------------------
TOTAL 577 27584645283.7 0.02 txn/s
$ cat /proc/vmstat | grep workingset
workingset_nodes 47860
workingset_refault_anon 0
workingset_refault_file 23498953
workingset_activate_anon 0
workingset_activate_file 23487840
workingset_restore_anon 0
workingset_restore_file 18553646
workingset_nodereclaim 768
$ free -m
total used free shared buff/cache available
Mem: 31849 6829 790 23 24229 24542
Swap: 31848 0 31848
- Patched: (with ZRAM enabled):
$ tpcc.py --config=mongodb.config mongodb --duration=900 --warehouses=500 --clients=30
==================================================================
Execution Results after 905 seconds
------------------------------------------------------------------
Executed Time (µs) Rate
STOCK_LEVEL 2542 27121571486.2 0.09 txn/s
------------------------------------------------------------------
TOTAL 2542 27121571486.2 0.09 txn/s
$ cat /proc/vmstat | grep working
workingset_nodes 70358
workingset_refault_anon 16853
workingset_refault_file 22693601
workingset_activate_anon 10099
workingset_activate_file 8565519
workingset_restore_anon 10127
workingset_restore_file 8566053
workingset_nodereclaim 9801
$ free -m
total used free shared buff/cache available
Mem: 31849 7093 283 4 24472 24289
Swap: 31848 1652 30196
The performance is 5x times better than before, and the idle anon pages
now can get swapped out as expected. The result is also better with
lower test stress, testing with lower stress also shows a improvement.
There is no regression on other tests so far, and a performance gain
is observed on file page heavy tasks.
Signed-off-by: Kairui Song <kasong@tencent.com>
2024-04-01 23:43:38 +08:00
|
|
|
hist = lru_hist_of_min_seq(lruvec, type);
|
mm: multi-gen LRU: minimal implementation
To avoid confusion, the terms "promotion" and "demotion" will be applied
to the multi-gen LRU, as a new convention; the terms "activation" and
"deactivation" will be applied to the active/inactive LRU, as usual.
The aging produces young generations. Given an lruvec, it increments
max_seq when max_seq-min_seq+1 approaches MIN_NR_GENS. The aging promotes
hot pages to the youngest generation when it finds them accessed through
page tables; the demotion of cold pages happens consequently when it
increments max_seq. Promotion in the aging path does not involve any LRU
list operations, only the updates of the gen counter and
lrugen->nr_pages[]; demotion, unless as the result of the increment of
max_seq, requires LRU list operations, e.g., lru_deactivate_fn(). The
aging has the complexity O(nr_hot_pages), since it is only interested in
hot pages.
The eviction consumes old generations. Given an lruvec, it increments
min_seq when lrugen->lists[] indexed by min_seq%MAX_NR_GENS becomes empty.
A feedback loop modeled after the PID controller monitors refaults over
anon and file types and decides which type to evict when both types are
available from the same generation.
The protection of pages accessed multiple times through file descriptors
takes place in the eviction path. Each generation is divided into
multiple tiers. A page accessed N times through file descriptors is in
tier order_base_2(N). Tiers do not have dedicated lrugen->lists[], only
bits in folio->flags. The aforementioned feedback loop also monitors
refaults over all tiers and decides when to protect pages in which tiers
(N>1), using the first tier (N=0,1) as a baseline. The first tier
contains single-use unmapped clean pages, which are most likely the best
choices. In contrast to promotion in the aging path, the protection of a
page in the eviction path is achieved by moving this page to the next
generation, i.e., min_seq+1, if the feedback loop decides so. This
approach has the following advantages:
1. It removes the cost of activation in the buffered access path by
inferring whether pages accessed multiple times through file
descriptors are statistically hot and thus worth protecting in the
eviction path.
2. It takes pages accessed through page tables into account and avoids
overprotecting pages accessed multiple times through file
descriptors. (Pages accessed through page tables are in the first
tier, since N=0.)
3. More tiers provide better protection for pages accessed more than
twice through file descriptors, when under heavy buffered I/O
workloads.
Server benchmark results:
Single workload:
fio (buffered I/O): +[30, 32]%
IOPS BW
5.19-rc1: 2673k 10.2GiB/s
patch1-6: 3491k 13.3GiB/s
Single workload:
memcached (anon): -[4, 6]%
Ops/sec KB/sec
5.19-rc1: 1161501.04 45177.25
patch1-6: 1106168.46 43025.04
Configurations:
CPU: two Xeon 6154
Mem: total 256G
Node 1 was only used as a ram disk to reduce the variance in the
results.
patch drivers/block/brd.c <<EOF
99,100c99,100
< gfp_flags = GFP_NOIO | __GFP_ZERO | __GFP_HIGHMEM;
< page = alloc_page(gfp_flags);
---
> gfp_flags = GFP_NOIO | __GFP_ZERO | __GFP_HIGHMEM | __GFP_THISNODE;
> page = alloc_pages_node(1, gfp_flags, 0);
EOF
cat >>/etc/systemd/system.conf <<EOF
CPUAffinity=numa
NUMAPolicy=bind
NUMAMask=0
EOF
cat >>/etc/memcached.conf <<EOF
-m 184320
-s /var/run/memcached/memcached.sock
-a 0766
-t 36
-B binary
EOF
cat fio.sh
modprobe brd rd_nr=1 rd_size=113246208
swapoff -a
mkfs.ext4 /dev/ram0
mount -t ext4 /dev/ram0 /mnt
mkdir /sys/fs/cgroup/user.slice/test
echo 38654705664 >/sys/fs/cgroup/user.slice/test/memory.max
echo $$ >/sys/fs/cgroup/user.slice/test/cgroup.procs
fio -name=mglru --numjobs=72 --directory=/mnt --size=1408m \
--buffered=1 --ioengine=io_uring --iodepth=128 \
--iodepth_batch_submit=32 --iodepth_batch_complete=32 \
--rw=randread --random_distribution=random --norandommap \
--time_based --ramp_time=10m --runtime=5m --group_reporting
cat memcached.sh
modprobe brd rd_nr=1 rd_size=113246208
swapoff -a
mkswap /dev/ram0
swapon /dev/ram0
memtier_benchmark -S /var/run/memcached/memcached.sock \
-P memcache_binary -n allkeys --key-minimum=1 \
--key-maximum=65000000 --key-pattern=P:P -c 1 -t 36 \
--ratio 1:0 --pipeline 8 -d 2000
memtier_benchmark -S /var/run/memcached/memcached.sock \
-P memcache_binary -n allkeys --key-minimum=1 \
--key-maximum=65000000 --key-pattern=R:R -c 1 -t 36 \
--ratio 0:1 --pipeline 8 --randomize --distinct-client-seed
Client benchmark results:
kswapd profiles:
5.19-rc1
40.33% page_vma_mapped_walk (overhead)
21.80% lzo1x_1_do_compress (real work)
7.53% do_raw_spin_lock
3.95% _raw_spin_unlock_irq
2.52% vma_interval_tree_iter_next
2.37% folio_referenced_one
2.28% vma_interval_tree_subtree_search
1.97% anon_vma_interval_tree_iter_first
1.60% ptep_clear_flush
1.06% __zram_bvec_write
patch1-6
39.03% lzo1x_1_do_compress (real work)
18.47% page_vma_mapped_walk (overhead)
6.74% _raw_spin_unlock_irq
3.97% do_raw_spin_lock
2.49% ptep_clear_flush
2.48% anon_vma_interval_tree_iter_first
1.92% folio_referenced_one
1.88% __zram_bvec_write
1.48% memmove
1.31% vma_interval_tree_iter_next
Configurations:
CPU: single Snapdragon 7c
Mem: total 4G
ChromeOS MemoryPressure [1]
[1] https://chromium.googlesource.com/chromiumos/platform/tast-tests/
Link: https://lkml.kernel.org/r/20220918080010.2920238-7-yuzhao@google.com
Signed-off-by: Yu Zhao <yuzhao@google.com>
Acked-by: Brian Geffon <bgeffon@google.com>
Acked-by: Jan Alexander Steffens (heftig) <heftig@archlinux.org>
Acked-by: Oleksandr Natalenko <oleksandr@natalenko.name>
Acked-by: Steven Barrett <steven@liquorix.net>
Acked-by: Suleiman Souhlal <suleiman@google.com>
Tested-by: Daniel Byrne <djbyrne@mtu.edu>
Tested-by: Donald Carr <d@chaos-reins.com>
Tested-by: Holger Hoffstätte <holger@applied-asynchrony.com>
Tested-by: Konstantin Kharlamov <Hi-Angel@yandex.ru>
Tested-by: Shuang Zhai <szhai2@cs.rochester.edu>
Tested-by: Sofia Trinh <sofia.trinh@edi.works>
Tested-by: Vaibhav Jain <vaibhav@linux.ibm.com>
Cc: Andi Kleen <ak@linux.intel.com>
Cc: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
Cc: Barry Song <baohua@kernel.org>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Hillf Danton <hdanton@sina.com>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Miaohe Lin <linmiaohe@huawei.com>
Cc: Michael Larabel <Michael@MichaelLarabel.com>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Mike Rapoport <rppt@kernel.org>
Cc: Mike Rapoport <rppt@linux.ibm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Qi Zheng <zhengqi.arch@bytedance.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Will Deacon <will@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-09-18 16:00:03 +08:00
|
|
|
|
workingset, lru_gen: apply refault-distance based protection
Upstream: pending
I noticed MGLRU not working very well on certain workflows, which is
observed on some heavily stressed databases. That is when the file
page workingset size exceeds total memory, and the access distance
(the left-shift time of a page before it gets activated, considering
LRU starts from right) of file pages also larger than total memory.
All file pages are stuck on the oldest generation and getting
read-in then evicted permutably. Despite anon pages being idle,
they never get aged. PID controller didn't kickin until there are some
minor access pattern changes. And file pages are not promoted
or reused.
Even though the memory can't cover the whole workingset, the
refault-distance based re-activation can help hold part of the
workingset in-memory to help reduce the IO workload significantly.
So apply it for MGLRU as well. The updated refault-distance model
fits well for MGLRU in most cases, if we just consider the last two
generation as the inactive LRU and the first two generations as
active LRU.
Some adjustment is done to fit the logic better, also make the
refault-distance contributed to page tiering and PID refault detection
of MGLRU:
- If a tier-0 page have a qualified refault-distance, just promote
it to higher tier, send it to second oldest gen.
- If a tier >= 1 page have a qualified refault-distance, mark it as
active and send it to youngest gen.
- Increase the reference of every page that have a qualified
refault-distance and increase the PID countroled refault rate
of the updated tier, in hope similar paged will be protected
next time upon eviction.
NOTE: This also changed the meaning of workingset_* fields in
/proc/vmstat, workingset_activate_* now stands for the pages
reactivated or promoted by refault distance checking,
workingset_restore_* now stands for all pages promoted by
any reason.
Following benchmark showed 5x improvement. To simulate the optimized
workflow, I setup a 3-replicated mongodb cluster, each in a different
cgroup, using 5 gb of wiretiger cache and 10g of oplog, on a 32G VM with
no limit set. The benchmark is done using
https://github.com/apavlo/py-tpcc.git, modified to run STOCK_LEVEL
query only, for simulating slow query and get a stable result.
Test is done on an EPYC 7K62 with 32G RAM with SATA SSD:
- Before (with ZRAM enabled, the result won't change whether
any kind of swap is on or not):
$ tpcc.py --config=mongodb.config mongodb --duration=900 --warehouses=500 --clients=30
==================================================================
Execution Results after 919 seconds
------------------------------------------------------------------
Executed Time (µs) Rate
STOCK_LEVEL 577 27584645283.7 0.02 txn/s
------------------------------------------------------------------
TOTAL 577 27584645283.7 0.02 txn/s
$ cat /proc/vmstat | grep workingset
workingset_nodes 47860
workingset_refault_anon 0
workingset_refault_file 23498953
workingset_activate_anon 0
workingset_activate_file 23487840
workingset_restore_anon 0
workingset_restore_file 18553646
workingset_nodereclaim 768
$ free -m
total used free shared buff/cache available
Mem: 31849 6829 790 23 24229 24542
Swap: 31848 0 31848
- Patched: (with ZRAM enabled):
$ tpcc.py --config=mongodb.config mongodb --duration=900 --warehouses=500 --clients=30
==================================================================
Execution Results after 905 seconds
------------------------------------------------------------------
Executed Time (µs) Rate
STOCK_LEVEL 2542 27121571486.2 0.09 txn/s
------------------------------------------------------------------
TOTAL 2542 27121571486.2 0.09 txn/s
$ cat /proc/vmstat | grep working
workingset_nodes 70358
workingset_refault_anon 16853
workingset_refault_file 22693601
workingset_activate_anon 10099
workingset_activate_file 8565519
workingset_restore_anon 10127
workingset_restore_file 8566053
workingset_nodereclaim 9801
$ free -m
total used free shared buff/cache available
Mem: 31849 7093 283 4 24472 24289
Swap: 31848 1652 30196
The performance is 5x times better than before, and the idle anon pages
now can get swapped out as expected. The result is also better with
lower test stress, testing with lower stress also shows a improvement.
There is no regression on other tests so far, and a performance gain
is observed on file page heavy tasks.
Signed-off-by: Kairui Song <kasong@tencent.com>
2024-04-01 23:43:38 +08:00
|
|
|
token = max(refs - 1, 0);
|
|
|
|
token <<= LRU_GEN_EVICTION_BITS;
|
|
|
|
token |= lru_eviction(lruvec, type, delta,
|
|
|
|
LRU_GEN_EVICTION_BITS, lru_gen_bucket_order);
|
mm: multi-gen LRU: minimal implementation
To avoid confusion, the terms "promotion" and "demotion" will be applied
to the multi-gen LRU, as a new convention; the terms "activation" and
"deactivation" will be applied to the active/inactive LRU, as usual.
The aging produces young generations. Given an lruvec, it increments
max_seq when max_seq-min_seq+1 approaches MIN_NR_GENS. The aging promotes
hot pages to the youngest generation when it finds them accessed through
page tables; the demotion of cold pages happens consequently when it
increments max_seq. Promotion in the aging path does not involve any LRU
list operations, only the updates of the gen counter and
lrugen->nr_pages[]; demotion, unless as the result of the increment of
max_seq, requires LRU list operations, e.g., lru_deactivate_fn(). The
aging has the complexity O(nr_hot_pages), since it is only interested in
hot pages.
The eviction consumes old generations. Given an lruvec, it increments
min_seq when lrugen->lists[] indexed by min_seq%MAX_NR_GENS becomes empty.
A feedback loop modeled after the PID controller monitors refaults over
anon and file types and decides which type to evict when both types are
available from the same generation.
The protection of pages accessed multiple times through file descriptors
takes place in the eviction path. Each generation is divided into
multiple tiers. A page accessed N times through file descriptors is in
tier order_base_2(N). Tiers do not have dedicated lrugen->lists[], only
bits in folio->flags. The aforementioned feedback loop also monitors
refaults over all tiers and decides when to protect pages in which tiers
(N>1), using the first tier (N=0,1) as a baseline. The first tier
contains single-use unmapped clean pages, which are most likely the best
choices. In contrast to promotion in the aging path, the protection of a
page in the eviction path is achieved by moving this page to the next
generation, i.e., min_seq+1, if the feedback loop decides so. This
approach has the following advantages:
1. It removes the cost of activation in the buffered access path by
inferring whether pages accessed multiple times through file
descriptors are statistically hot and thus worth protecting in the
eviction path.
2. It takes pages accessed through page tables into account and avoids
overprotecting pages accessed multiple times through file
descriptors. (Pages accessed through page tables are in the first
tier, since N=0.)
3. More tiers provide better protection for pages accessed more than
twice through file descriptors, when under heavy buffered I/O
workloads.
Server benchmark results:
Single workload:
fio (buffered I/O): +[30, 32]%
IOPS BW
5.19-rc1: 2673k 10.2GiB/s
patch1-6: 3491k 13.3GiB/s
Single workload:
memcached (anon): -[4, 6]%
Ops/sec KB/sec
5.19-rc1: 1161501.04 45177.25
patch1-6: 1106168.46 43025.04
Configurations:
CPU: two Xeon 6154
Mem: total 256G
Node 1 was only used as a ram disk to reduce the variance in the
results.
patch drivers/block/brd.c <<EOF
99,100c99,100
< gfp_flags = GFP_NOIO | __GFP_ZERO | __GFP_HIGHMEM;
< page = alloc_page(gfp_flags);
---
> gfp_flags = GFP_NOIO | __GFP_ZERO | __GFP_HIGHMEM | __GFP_THISNODE;
> page = alloc_pages_node(1, gfp_flags, 0);
EOF
cat >>/etc/systemd/system.conf <<EOF
CPUAffinity=numa
NUMAPolicy=bind
NUMAMask=0
EOF
cat >>/etc/memcached.conf <<EOF
-m 184320
-s /var/run/memcached/memcached.sock
-a 0766
-t 36
-B binary
EOF
cat fio.sh
modprobe brd rd_nr=1 rd_size=113246208
swapoff -a
mkfs.ext4 /dev/ram0
mount -t ext4 /dev/ram0 /mnt
mkdir /sys/fs/cgroup/user.slice/test
echo 38654705664 >/sys/fs/cgroup/user.slice/test/memory.max
echo $$ >/sys/fs/cgroup/user.slice/test/cgroup.procs
fio -name=mglru --numjobs=72 --directory=/mnt --size=1408m \
--buffered=1 --ioengine=io_uring --iodepth=128 \
--iodepth_batch_submit=32 --iodepth_batch_complete=32 \
--rw=randread --random_distribution=random --norandommap \
--time_based --ramp_time=10m --runtime=5m --group_reporting
cat memcached.sh
modprobe brd rd_nr=1 rd_size=113246208
swapoff -a
mkswap /dev/ram0
swapon /dev/ram0
memtier_benchmark -S /var/run/memcached/memcached.sock \
-P memcache_binary -n allkeys --key-minimum=1 \
--key-maximum=65000000 --key-pattern=P:P -c 1 -t 36 \
--ratio 1:0 --pipeline 8 -d 2000
memtier_benchmark -S /var/run/memcached/memcached.sock \
-P memcache_binary -n allkeys --key-minimum=1 \
--key-maximum=65000000 --key-pattern=R:R -c 1 -t 36 \
--ratio 0:1 --pipeline 8 --randomize --distinct-client-seed
Client benchmark results:
kswapd profiles:
5.19-rc1
40.33% page_vma_mapped_walk (overhead)
21.80% lzo1x_1_do_compress (real work)
7.53% do_raw_spin_lock
3.95% _raw_spin_unlock_irq
2.52% vma_interval_tree_iter_next
2.37% folio_referenced_one
2.28% vma_interval_tree_subtree_search
1.97% anon_vma_interval_tree_iter_first
1.60% ptep_clear_flush
1.06% __zram_bvec_write
patch1-6
39.03% lzo1x_1_do_compress (real work)
18.47% page_vma_mapped_walk (overhead)
6.74% _raw_spin_unlock_irq
3.97% do_raw_spin_lock
2.49% ptep_clear_flush
2.48% anon_vma_interval_tree_iter_first
1.92% folio_referenced_one
1.88% __zram_bvec_write
1.48% memmove
1.31% vma_interval_tree_iter_next
Configurations:
CPU: single Snapdragon 7c
Mem: total 4G
ChromeOS MemoryPressure [1]
[1] https://chromium.googlesource.com/chromiumos/platform/tast-tests/
Link: https://lkml.kernel.org/r/20220918080010.2920238-7-yuzhao@google.com
Signed-off-by: Yu Zhao <yuzhao@google.com>
Acked-by: Brian Geffon <bgeffon@google.com>
Acked-by: Jan Alexander Steffens (heftig) <heftig@archlinux.org>
Acked-by: Oleksandr Natalenko <oleksandr@natalenko.name>
Acked-by: Steven Barrett <steven@liquorix.net>
Acked-by: Suleiman Souhlal <suleiman@google.com>
Tested-by: Daniel Byrne <djbyrne@mtu.edu>
Tested-by: Donald Carr <d@chaos-reins.com>
Tested-by: Holger Hoffstätte <holger@applied-asynchrony.com>
Tested-by: Konstantin Kharlamov <Hi-Angel@yandex.ru>
Tested-by: Shuang Zhai <szhai2@cs.rochester.edu>
Tested-by: Sofia Trinh <sofia.trinh@edi.works>
Tested-by: Vaibhav Jain <vaibhav@linux.ibm.com>
Cc: Andi Kleen <ak@linux.intel.com>
Cc: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
Cc: Barry Song <baohua@kernel.org>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Hillf Danton <hdanton@sina.com>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Miaohe Lin <linmiaohe@huawei.com>
Cc: Michael Larabel <Michael@MichaelLarabel.com>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Mike Rapoport <rppt@kernel.org>
Cc: Mike Rapoport <rppt@linux.ibm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Qi Zheng <zhengqi.arch@bytedance.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Will Deacon <will@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-09-18 16:00:03 +08:00
|
|
|
atomic_long_add(delta, &lrugen->evicted[hist][type][tier]);
|
|
|
|
|
|
|
|
return pack_shadow(mem_cgroup_id(memcg), pgdat, token, refs);
|
|
|
|
}
|
|
|
|
|
workingset: refactor LRU refault to expose refault recency check
Patch series "cachestat: a new syscall for page cache state of files",
v13.
There is currently no good way to query the page cache statistics of large
files and directory trees. There is mincore(), but it scales poorly: the
kernel writes out a lot of bitmap data that userspace has to aggregate,
when the user really does not care about per-page information in that
case. The user also needs to mmap and unmap each file as it goes along,
which can be quite slow as well.
Some use cases where this information could come in handy:
* Allowing database to decide whether to perform an index scan or direct
table queries based on the in-memory cache state of the index.
* Visibility into the writeback algorithm, for performance issues
diagnostic.
* Workload-aware writeback pacing: estimating IO fulfilled by page cache
(and IO to be done) within a range of a file, allowing for more
frequent syncing when and where there is IO capacity, and batching
when there is not.
* Computing memory usage of large files/directory trees, analogous to
the du tool for disk usage.
More information about these use cases could be found in this thread:
https://lore.kernel.org/lkml/20230315170934.GA97793@cmpxchg.org/
This series of patches introduces a new system call, cachestat, that
summarizes the page cache statistics (number of cached pages, dirty pages,
pages marked for writeback, evicted pages etc.) of a file, in a specified
range of bytes. It also include a selftest suite that tests some typical
usage. Currently, the syscall is only wired in for x86 architecture.
This interface is inspired by past discussion and concerns with fincore,
which has a similar design (and as a result, issues) as mincore. Relevant
links:
https://lkml.indiana.edu/hypermail/linux/kernel/1302.1/04207.html
https://lkml.indiana.edu/hypermail/linux/kernel/1302.1/04209.html
I have also developed a small tool that computes the memory usage of files
and directories, analogous to the du utility. User can choose between
mincore or cachestat (with cachestat exporting more information than
mincore). To compare the performance of these two options, I benchmarked
the tool on the root directory of a Meta's server machine, each for five
runs:
Using cachestat
real -- Median: 33.377s, Average: 33.475s, Standard Deviation: 0.3602
user -- Median: 4.08s, Average: 4.1078s, Standard Deviation: 0.0742
sys -- Median: 28.823s, Average: 28.8866s, Standard Deviation: 0.2689
Using mincore:
real -- Median: 102.352s, Average: 102.3442s, Standard Deviation: 0.2059
user -- Median: 10.149s, Average: 10.1482s, Standard Deviation: 0.0162
sys -- Median: 91.186s, Average: 91.2084s, Standard Deviation: 0.2046
I also ran both syscalls on a 2TB sparse file:
Using cachestat:
real 0m0.009s
user 0m0.000s
sys 0m0.009s
Using mincore:
real 0m37.510s
user 0m2.934s
sys 0m34.558s
Very large files like this are the pathological case for mincore. In
fact, to compute the stats for a single 2TB file, mincore takes as long as
cachestat takes to compute the stats for the entire tree! This could
easily happen inadvertently when we run it on subdirectories. Mincore is
clearly not suitable for a general-purpose command line tool.
Regarding security concerns, cachestat() should not pose any additional
issues. The caller already has read permission to the file itself (since
they need an fd to that file to call cachestat). This means that the
caller can access the underlying data in its entirety, which is a much
greater source of information (and as a result, a much greater security
risk) than the cache status itself.
The latest API change (in v13 of the patch series) is suggested by Jens
Axboe. It allows for 64-bit length argument, even on 32-bit architecture
(which is previously not possible due to the limit on the number of
syscall arguments). Furthermore, it eliminates the need for compatibility
handling - every user can use the same ABI.
This patch (of 4):
In preparation for computing recently evicted pages in cachestat, refactor
workingset_refault and lru_gen_refault to expose a helper function that
would test if an evicted page is recently evicted.
[penguin-kernel@I-love.SAKURA.ne.jp: add missing rcu_read_unlock() in lru_gen_refault()]
Link: https://lkml.kernel.org/r/610781bc-cf11-fc89-a46f-87cb8235d439@I-love.SAKURA.ne.jp
Link: https://lkml.kernel.org/r/20230503013608.2431726-1-nphamcs@gmail.com
Link: https://lkml.kernel.org/r/20230503013608.2431726-2-nphamcs@gmail.com
Signed-off-by: Nhat Pham <nphamcs@gmail.com>
Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Cc: Brian Foster <bfoster@redhat.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
Cc: Michael Kerrisk <mtk.manpages@gmail.com>
Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-05-03 09:36:06 +08:00
|
|
|
/*
|
|
|
|
* Tests if the shadow entry is for a folio that was recently evicted.
|
2023-05-22 19:20:58 +08:00
|
|
|
* Fills in @lruvec, @token, @workingset with the values unpacked from shadow.
|
workingset: refactor LRU refault to expose refault recency check
Patch series "cachestat: a new syscall for page cache state of files",
v13.
There is currently no good way to query the page cache statistics of large
files and directory trees. There is mincore(), but it scales poorly: the
kernel writes out a lot of bitmap data that userspace has to aggregate,
when the user really does not care about per-page information in that
case. The user also needs to mmap and unmap each file as it goes along,
which can be quite slow as well.
Some use cases where this information could come in handy:
* Allowing database to decide whether to perform an index scan or direct
table queries based on the in-memory cache state of the index.
* Visibility into the writeback algorithm, for performance issues
diagnostic.
* Workload-aware writeback pacing: estimating IO fulfilled by page cache
(and IO to be done) within a range of a file, allowing for more
frequent syncing when and where there is IO capacity, and batching
when there is not.
* Computing memory usage of large files/directory trees, analogous to
the du tool for disk usage.
More information about these use cases could be found in this thread:
https://lore.kernel.org/lkml/20230315170934.GA97793@cmpxchg.org/
This series of patches introduces a new system call, cachestat, that
summarizes the page cache statistics (number of cached pages, dirty pages,
pages marked for writeback, evicted pages etc.) of a file, in a specified
range of bytes. It also include a selftest suite that tests some typical
usage. Currently, the syscall is only wired in for x86 architecture.
This interface is inspired by past discussion and concerns with fincore,
which has a similar design (and as a result, issues) as mincore. Relevant
links:
https://lkml.indiana.edu/hypermail/linux/kernel/1302.1/04207.html
https://lkml.indiana.edu/hypermail/linux/kernel/1302.1/04209.html
I have also developed a small tool that computes the memory usage of files
and directories, analogous to the du utility. User can choose between
mincore or cachestat (with cachestat exporting more information than
mincore). To compare the performance of these two options, I benchmarked
the tool on the root directory of a Meta's server machine, each for five
runs:
Using cachestat
real -- Median: 33.377s, Average: 33.475s, Standard Deviation: 0.3602
user -- Median: 4.08s, Average: 4.1078s, Standard Deviation: 0.0742
sys -- Median: 28.823s, Average: 28.8866s, Standard Deviation: 0.2689
Using mincore:
real -- Median: 102.352s, Average: 102.3442s, Standard Deviation: 0.2059
user -- Median: 10.149s, Average: 10.1482s, Standard Deviation: 0.0162
sys -- Median: 91.186s, Average: 91.2084s, Standard Deviation: 0.2046
I also ran both syscalls on a 2TB sparse file:
Using cachestat:
real 0m0.009s
user 0m0.000s
sys 0m0.009s
Using mincore:
real 0m37.510s
user 0m2.934s
sys 0m34.558s
Very large files like this are the pathological case for mincore. In
fact, to compute the stats for a single 2TB file, mincore takes as long as
cachestat takes to compute the stats for the entire tree! This could
easily happen inadvertently when we run it on subdirectories. Mincore is
clearly not suitable for a general-purpose command line tool.
Regarding security concerns, cachestat() should not pose any additional
issues. The caller already has read permission to the file itself (since
they need an fd to that file to call cachestat). This means that the
caller can access the underlying data in its entirety, which is a much
greater source of information (and as a result, a much greater security
risk) than the cache status itself.
The latest API change (in v13 of the patch series) is suggested by Jens
Axboe. It allows for 64-bit length argument, even on 32-bit architecture
(which is previously not possible due to the limit on the number of
syscall arguments). Furthermore, it eliminates the need for compatibility
handling - every user can use the same ABI.
This patch (of 4):
In preparation for computing recently evicted pages in cachestat, refactor
workingset_refault and lru_gen_refault to expose a helper function that
would test if an evicted page is recently evicted.
[penguin-kernel@I-love.SAKURA.ne.jp: add missing rcu_read_unlock() in lru_gen_refault()]
Link: https://lkml.kernel.org/r/610781bc-cf11-fc89-a46f-87cb8235d439@I-love.SAKURA.ne.jp
Link: https://lkml.kernel.org/r/20230503013608.2431726-1-nphamcs@gmail.com
Link: https://lkml.kernel.org/r/20230503013608.2431726-2-nphamcs@gmail.com
Signed-off-by: Nhat Pham <nphamcs@gmail.com>
Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Cc: Brian Foster <bfoster@redhat.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
Cc: Michael Kerrisk <mtk.manpages@gmail.com>
Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-05-03 09:36:06 +08:00
|
|
|
*/
|
workingset, lru_gen: apply refault-distance based protection
Upstream: pending
I noticed MGLRU not working very well on certain workflows, which is
observed on some heavily stressed databases. That is when the file
page workingset size exceeds total memory, and the access distance
(the left-shift time of a page before it gets activated, considering
LRU starts from right) of file pages also larger than total memory.
All file pages are stuck on the oldest generation and getting
read-in then evicted permutably. Despite anon pages being idle,
they never get aged. PID controller didn't kickin until there are some
minor access pattern changes. And file pages are not promoted
or reused.
Even though the memory can't cover the whole workingset, the
refault-distance based re-activation can help hold part of the
workingset in-memory to help reduce the IO workload significantly.
So apply it for MGLRU as well. The updated refault-distance model
fits well for MGLRU in most cases, if we just consider the last two
generation as the inactive LRU and the first two generations as
active LRU.
Some adjustment is done to fit the logic better, also make the
refault-distance contributed to page tiering and PID refault detection
of MGLRU:
- If a tier-0 page have a qualified refault-distance, just promote
it to higher tier, send it to second oldest gen.
- If a tier >= 1 page have a qualified refault-distance, mark it as
active and send it to youngest gen.
- Increase the reference of every page that have a qualified
refault-distance and increase the PID countroled refault rate
of the updated tier, in hope similar paged will be protected
next time upon eviction.
NOTE: This also changed the meaning of workingset_* fields in
/proc/vmstat, workingset_activate_* now stands for the pages
reactivated or promoted by refault distance checking,
workingset_restore_* now stands for all pages promoted by
any reason.
Following benchmark showed 5x improvement. To simulate the optimized
workflow, I setup a 3-replicated mongodb cluster, each in a different
cgroup, using 5 gb of wiretiger cache and 10g of oplog, on a 32G VM with
no limit set. The benchmark is done using
https://github.com/apavlo/py-tpcc.git, modified to run STOCK_LEVEL
query only, for simulating slow query and get a stable result.
Test is done on an EPYC 7K62 with 32G RAM with SATA SSD:
- Before (with ZRAM enabled, the result won't change whether
any kind of swap is on or not):
$ tpcc.py --config=mongodb.config mongodb --duration=900 --warehouses=500 --clients=30
==================================================================
Execution Results after 919 seconds
------------------------------------------------------------------
Executed Time (µs) Rate
STOCK_LEVEL 577 27584645283.7 0.02 txn/s
------------------------------------------------------------------
TOTAL 577 27584645283.7 0.02 txn/s
$ cat /proc/vmstat | grep workingset
workingset_nodes 47860
workingset_refault_anon 0
workingset_refault_file 23498953
workingset_activate_anon 0
workingset_activate_file 23487840
workingset_restore_anon 0
workingset_restore_file 18553646
workingset_nodereclaim 768
$ free -m
total used free shared buff/cache available
Mem: 31849 6829 790 23 24229 24542
Swap: 31848 0 31848
- Patched: (with ZRAM enabled):
$ tpcc.py --config=mongodb.config mongodb --duration=900 --warehouses=500 --clients=30
==================================================================
Execution Results after 905 seconds
------------------------------------------------------------------
Executed Time (µs) Rate
STOCK_LEVEL 2542 27121571486.2 0.09 txn/s
------------------------------------------------------------------
TOTAL 2542 27121571486.2 0.09 txn/s
$ cat /proc/vmstat | grep working
workingset_nodes 70358
workingset_refault_anon 16853
workingset_refault_file 22693601
workingset_activate_anon 10099
workingset_activate_file 8565519
workingset_restore_anon 10127
workingset_restore_file 8566053
workingset_nodereclaim 9801
$ free -m
total used free shared buff/cache available
Mem: 31849 7093 283 4 24472 24289
Swap: 31848 1652 30196
The performance is 5x times better than before, and the idle anon pages
now can get swapped out as expected. The result is also better with
lower test stress, testing with lower stress also shows a improvement.
There is no regression on other tests so far, and a performance gain
is observed on file page heavy tasks.
Signed-off-by: Kairui Song <kasong@tencent.com>
2024-04-01 23:43:38 +08:00
|
|
|
static bool inline lru_gen_test_recent(struct lruvec *lruvec, bool type,
|
|
|
|
unsigned long distance)
|
workingset: refactor LRU refault to expose refault recency check
Patch series "cachestat: a new syscall for page cache state of files",
v13.
There is currently no good way to query the page cache statistics of large
files and directory trees. There is mincore(), but it scales poorly: the
kernel writes out a lot of bitmap data that userspace has to aggregate,
when the user really does not care about per-page information in that
case. The user also needs to mmap and unmap each file as it goes along,
which can be quite slow as well.
Some use cases where this information could come in handy:
* Allowing database to decide whether to perform an index scan or direct
table queries based on the in-memory cache state of the index.
* Visibility into the writeback algorithm, for performance issues
diagnostic.
* Workload-aware writeback pacing: estimating IO fulfilled by page cache
(and IO to be done) within a range of a file, allowing for more
frequent syncing when and where there is IO capacity, and batching
when there is not.
* Computing memory usage of large files/directory trees, analogous to
the du tool for disk usage.
More information about these use cases could be found in this thread:
https://lore.kernel.org/lkml/20230315170934.GA97793@cmpxchg.org/
This series of patches introduces a new system call, cachestat, that
summarizes the page cache statistics (number of cached pages, dirty pages,
pages marked for writeback, evicted pages etc.) of a file, in a specified
range of bytes. It also include a selftest suite that tests some typical
usage. Currently, the syscall is only wired in for x86 architecture.
This interface is inspired by past discussion and concerns with fincore,
which has a similar design (and as a result, issues) as mincore. Relevant
links:
https://lkml.indiana.edu/hypermail/linux/kernel/1302.1/04207.html
https://lkml.indiana.edu/hypermail/linux/kernel/1302.1/04209.html
I have also developed a small tool that computes the memory usage of files
and directories, analogous to the du utility. User can choose between
mincore or cachestat (with cachestat exporting more information than
mincore). To compare the performance of these two options, I benchmarked
the tool on the root directory of a Meta's server machine, each for five
runs:
Using cachestat
real -- Median: 33.377s, Average: 33.475s, Standard Deviation: 0.3602
user -- Median: 4.08s, Average: 4.1078s, Standard Deviation: 0.0742
sys -- Median: 28.823s, Average: 28.8866s, Standard Deviation: 0.2689
Using mincore:
real -- Median: 102.352s, Average: 102.3442s, Standard Deviation: 0.2059
user -- Median: 10.149s, Average: 10.1482s, Standard Deviation: 0.0162
sys -- Median: 91.186s, Average: 91.2084s, Standard Deviation: 0.2046
I also ran both syscalls on a 2TB sparse file:
Using cachestat:
real 0m0.009s
user 0m0.000s
sys 0m0.009s
Using mincore:
real 0m37.510s
user 0m2.934s
sys 0m34.558s
Very large files like this are the pathological case for mincore. In
fact, to compute the stats for a single 2TB file, mincore takes as long as
cachestat takes to compute the stats for the entire tree! This could
easily happen inadvertently when we run it on subdirectories. Mincore is
clearly not suitable for a general-purpose command line tool.
Regarding security concerns, cachestat() should not pose any additional
issues. The caller already has read permission to the file itself (since
they need an fd to that file to call cachestat). This means that the
caller can access the underlying data in its entirety, which is a much
greater source of information (and as a result, a much greater security
risk) than the cache status itself.
The latest API change (in v13 of the patch series) is suggested by Jens
Axboe. It allows for 64-bit length argument, even on 32-bit architecture
(which is previously not possible due to the limit on the number of
syscall arguments). Furthermore, it eliminates the need for compatibility
handling - every user can use the same ABI.
This patch (of 4):
In preparation for computing recently evicted pages in cachestat, refactor
workingset_refault and lru_gen_refault to expose a helper function that
would test if an evicted page is recently evicted.
[penguin-kernel@I-love.SAKURA.ne.jp: add missing rcu_read_unlock() in lru_gen_refault()]
Link: https://lkml.kernel.org/r/610781bc-cf11-fc89-a46f-87cb8235d439@I-love.SAKURA.ne.jp
Link: https://lkml.kernel.org/r/20230503013608.2431726-1-nphamcs@gmail.com
Link: https://lkml.kernel.org/r/20230503013608.2431726-2-nphamcs@gmail.com
Signed-off-by: Nhat Pham <nphamcs@gmail.com>
Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Cc: Brian Foster <bfoster@redhat.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
Cc: Michael Kerrisk <mtk.manpages@gmail.com>
Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-05-03 09:36:06 +08:00
|
|
|
{
|
workingset, lru_gen: apply refault-distance based protection
Upstream: pending
I noticed MGLRU not working very well on certain workflows, which is
observed on some heavily stressed databases. That is when the file
page workingset size exceeds total memory, and the access distance
(the left-shift time of a page before it gets activated, considering
LRU starts from right) of file pages also larger than total memory.
All file pages are stuck on the oldest generation and getting
read-in then evicted permutably. Despite anon pages being idle,
they never get aged. PID controller didn't kickin until there are some
minor access pattern changes. And file pages are not promoted
or reused.
Even though the memory can't cover the whole workingset, the
refault-distance based re-activation can help hold part of the
workingset in-memory to help reduce the IO workload significantly.
So apply it for MGLRU as well. The updated refault-distance model
fits well for MGLRU in most cases, if we just consider the last two
generation as the inactive LRU and the first two generations as
active LRU.
Some adjustment is done to fit the logic better, also make the
refault-distance contributed to page tiering and PID refault detection
of MGLRU:
- If a tier-0 page have a qualified refault-distance, just promote
it to higher tier, send it to second oldest gen.
- If a tier >= 1 page have a qualified refault-distance, mark it as
active and send it to youngest gen.
- Increase the reference of every page that have a qualified
refault-distance and increase the PID countroled refault rate
of the updated tier, in hope similar paged will be protected
next time upon eviction.
NOTE: This also changed the meaning of workingset_* fields in
/proc/vmstat, workingset_activate_* now stands for the pages
reactivated or promoted by refault distance checking,
workingset_restore_* now stands for all pages promoted by
any reason.
Following benchmark showed 5x improvement. To simulate the optimized
workflow, I setup a 3-replicated mongodb cluster, each in a different
cgroup, using 5 gb of wiretiger cache and 10g of oplog, on a 32G VM with
no limit set. The benchmark is done using
https://github.com/apavlo/py-tpcc.git, modified to run STOCK_LEVEL
query only, for simulating slow query and get a stable result.
Test is done on an EPYC 7K62 with 32G RAM with SATA SSD:
- Before (with ZRAM enabled, the result won't change whether
any kind of swap is on or not):
$ tpcc.py --config=mongodb.config mongodb --duration=900 --warehouses=500 --clients=30
==================================================================
Execution Results after 919 seconds
------------------------------------------------------------------
Executed Time (µs) Rate
STOCK_LEVEL 577 27584645283.7 0.02 txn/s
------------------------------------------------------------------
TOTAL 577 27584645283.7 0.02 txn/s
$ cat /proc/vmstat | grep workingset
workingset_nodes 47860
workingset_refault_anon 0
workingset_refault_file 23498953
workingset_activate_anon 0
workingset_activate_file 23487840
workingset_restore_anon 0
workingset_restore_file 18553646
workingset_nodereclaim 768
$ free -m
total used free shared buff/cache available
Mem: 31849 6829 790 23 24229 24542
Swap: 31848 0 31848
- Patched: (with ZRAM enabled):
$ tpcc.py --config=mongodb.config mongodb --duration=900 --warehouses=500 --clients=30
==================================================================
Execution Results after 905 seconds
------------------------------------------------------------------
Executed Time (µs) Rate
STOCK_LEVEL 2542 27121571486.2 0.09 txn/s
------------------------------------------------------------------
TOTAL 2542 27121571486.2 0.09 txn/s
$ cat /proc/vmstat | grep working
workingset_nodes 70358
workingset_refault_anon 16853
workingset_refault_file 22693601
workingset_activate_anon 10099
workingset_activate_file 8565519
workingset_restore_anon 10127
workingset_restore_file 8566053
workingset_nodereclaim 9801
$ free -m
total used free shared buff/cache available
Mem: 31849 7093 283 4 24472 24289
Swap: 31848 1652 30196
The performance is 5x times better than before, and the idle anon pages
now can get swapped out as expected. The result is also better with
lower test stress, testing with lower stress also shows a improvement.
There is no regression on other tests so far, and a performance gain
is observed on file page heavy tasks.
Signed-off-by: Kairui Song <kasong@tencent.com>
2024-04-01 23:43:38 +08:00
|
|
|
int hist;
|
|
|
|
unsigned long evicted = 0;
|
|
|
|
struct lru_gen_folio *lrugen;
|
|
|
|
|
|
|
|
lrugen = &lruvec->lrugen;
|
|
|
|
hist = lru_hist_of_min_seq(lruvec, type);
|
|
|
|
|
|
|
|
for (int tier = 0; tier < MAX_NR_TIERS; tier++)
|
|
|
|
evicted += atomic_long_read(&lrugen->evicted[hist][type][tier]);
|
workingset: refactor LRU refault to expose refault recency check
Patch series "cachestat: a new syscall for page cache state of files",
v13.
There is currently no good way to query the page cache statistics of large
files and directory trees. There is mincore(), but it scales poorly: the
kernel writes out a lot of bitmap data that userspace has to aggregate,
when the user really does not care about per-page information in that
case. The user also needs to mmap and unmap each file as it goes along,
which can be quite slow as well.
Some use cases where this information could come in handy:
* Allowing database to decide whether to perform an index scan or direct
table queries based on the in-memory cache state of the index.
* Visibility into the writeback algorithm, for performance issues
diagnostic.
* Workload-aware writeback pacing: estimating IO fulfilled by page cache
(and IO to be done) within a range of a file, allowing for more
frequent syncing when and where there is IO capacity, and batching
when there is not.
* Computing memory usage of large files/directory trees, analogous to
the du tool for disk usage.
More information about these use cases could be found in this thread:
https://lore.kernel.org/lkml/20230315170934.GA97793@cmpxchg.org/
This series of patches introduces a new system call, cachestat, that
summarizes the page cache statistics (number of cached pages, dirty pages,
pages marked for writeback, evicted pages etc.) of a file, in a specified
range of bytes. It also include a selftest suite that tests some typical
usage. Currently, the syscall is only wired in for x86 architecture.
This interface is inspired by past discussion and concerns with fincore,
which has a similar design (and as a result, issues) as mincore. Relevant
links:
https://lkml.indiana.edu/hypermail/linux/kernel/1302.1/04207.html
https://lkml.indiana.edu/hypermail/linux/kernel/1302.1/04209.html
I have also developed a small tool that computes the memory usage of files
and directories, analogous to the du utility. User can choose between
mincore or cachestat (with cachestat exporting more information than
mincore). To compare the performance of these two options, I benchmarked
the tool on the root directory of a Meta's server machine, each for five
runs:
Using cachestat
real -- Median: 33.377s, Average: 33.475s, Standard Deviation: 0.3602
user -- Median: 4.08s, Average: 4.1078s, Standard Deviation: 0.0742
sys -- Median: 28.823s, Average: 28.8866s, Standard Deviation: 0.2689
Using mincore:
real -- Median: 102.352s, Average: 102.3442s, Standard Deviation: 0.2059
user -- Median: 10.149s, Average: 10.1482s, Standard Deviation: 0.0162
sys -- Median: 91.186s, Average: 91.2084s, Standard Deviation: 0.2046
I also ran both syscalls on a 2TB sparse file:
Using cachestat:
real 0m0.009s
user 0m0.000s
sys 0m0.009s
Using mincore:
real 0m37.510s
user 0m2.934s
sys 0m34.558s
Very large files like this are the pathological case for mincore. In
fact, to compute the stats for a single 2TB file, mincore takes as long as
cachestat takes to compute the stats for the entire tree! This could
easily happen inadvertently when we run it on subdirectories. Mincore is
clearly not suitable for a general-purpose command line tool.
Regarding security concerns, cachestat() should not pose any additional
issues. The caller already has read permission to the file itself (since
they need an fd to that file to call cachestat). This means that the
caller can access the underlying data in its entirety, which is a much
greater source of information (and as a result, a much greater security
risk) than the cache status itself.
The latest API change (in v13 of the patch series) is suggested by Jens
Axboe. It allows for 64-bit length argument, even on 32-bit architecture
(which is previously not possible due to the limit on the number of
syscall arguments). Furthermore, it eliminates the need for compatibility
handling - every user can use the same ABI.
This patch (of 4):
In preparation for computing recently evicted pages in cachestat, refactor
workingset_refault and lru_gen_refault to expose a helper function that
would test if an evicted page is recently evicted.
[penguin-kernel@I-love.SAKURA.ne.jp: add missing rcu_read_unlock() in lru_gen_refault()]
Link: https://lkml.kernel.org/r/610781bc-cf11-fc89-a46f-87cb8235d439@I-love.SAKURA.ne.jp
Link: https://lkml.kernel.org/r/20230503013608.2431726-1-nphamcs@gmail.com
Link: https://lkml.kernel.org/r/20230503013608.2431726-2-nphamcs@gmail.com
Signed-off-by: Nhat Pham <nphamcs@gmail.com>
Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Cc: Brian Foster <bfoster@redhat.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
Cc: Michael Kerrisk <mtk.manpages@gmail.com>
Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-05-03 09:36:06 +08:00
|
|
|
|
workingset, lru_gen: apply refault-distance based protection
Upstream: pending
I noticed MGLRU not working very well on certain workflows, which is
observed on some heavily stressed databases. That is when the file
page workingset size exceeds total memory, and the access distance
(the left-shift time of a page before it gets activated, considering
LRU starts from right) of file pages also larger than total memory.
All file pages are stuck on the oldest generation and getting
read-in then evicted permutably. Despite anon pages being idle,
they never get aged. PID controller didn't kickin until there are some
minor access pattern changes. And file pages are not promoted
or reused.
Even though the memory can't cover the whole workingset, the
refault-distance based re-activation can help hold part of the
workingset in-memory to help reduce the IO workload significantly.
So apply it for MGLRU as well. The updated refault-distance model
fits well for MGLRU in most cases, if we just consider the last two
generation as the inactive LRU and the first two generations as
active LRU.
Some adjustment is done to fit the logic better, also make the
refault-distance contributed to page tiering and PID refault detection
of MGLRU:
- If a tier-0 page have a qualified refault-distance, just promote
it to higher tier, send it to second oldest gen.
- If a tier >= 1 page have a qualified refault-distance, mark it as
active and send it to youngest gen.
- Increase the reference of every page that have a qualified
refault-distance and increase the PID countroled refault rate
of the updated tier, in hope similar paged will be protected
next time upon eviction.
NOTE: This also changed the meaning of workingset_* fields in
/proc/vmstat, workingset_activate_* now stands for the pages
reactivated or promoted by refault distance checking,
workingset_restore_* now stands for all pages promoted by
any reason.
Following benchmark showed 5x improvement. To simulate the optimized
workflow, I setup a 3-replicated mongodb cluster, each in a different
cgroup, using 5 gb of wiretiger cache and 10g of oplog, on a 32G VM with
no limit set. The benchmark is done using
https://github.com/apavlo/py-tpcc.git, modified to run STOCK_LEVEL
query only, for simulating slow query and get a stable result.
Test is done on an EPYC 7K62 with 32G RAM with SATA SSD:
- Before (with ZRAM enabled, the result won't change whether
any kind of swap is on or not):
$ tpcc.py --config=mongodb.config mongodb --duration=900 --warehouses=500 --clients=30
==================================================================
Execution Results after 919 seconds
------------------------------------------------------------------
Executed Time (µs) Rate
STOCK_LEVEL 577 27584645283.7 0.02 txn/s
------------------------------------------------------------------
TOTAL 577 27584645283.7 0.02 txn/s
$ cat /proc/vmstat | grep workingset
workingset_nodes 47860
workingset_refault_anon 0
workingset_refault_file 23498953
workingset_activate_anon 0
workingset_activate_file 23487840
workingset_restore_anon 0
workingset_restore_file 18553646
workingset_nodereclaim 768
$ free -m
total used free shared buff/cache available
Mem: 31849 6829 790 23 24229 24542
Swap: 31848 0 31848
- Patched: (with ZRAM enabled):
$ tpcc.py --config=mongodb.config mongodb --duration=900 --warehouses=500 --clients=30
==================================================================
Execution Results after 905 seconds
------------------------------------------------------------------
Executed Time (µs) Rate
STOCK_LEVEL 2542 27121571486.2 0.09 txn/s
------------------------------------------------------------------
TOTAL 2542 27121571486.2 0.09 txn/s
$ cat /proc/vmstat | grep working
workingset_nodes 70358
workingset_refault_anon 16853
workingset_refault_file 22693601
workingset_activate_anon 10099
workingset_activate_file 8565519
workingset_restore_anon 10127
workingset_restore_file 8566053
workingset_nodereclaim 9801
$ free -m
total used free shared buff/cache available
Mem: 31849 7093 283 4 24472 24289
Swap: 31848 1652 30196
The performance is 5x times better than before, and the idle anon pages
now can get swapped out as expected. The result is also better with
lower test stress, testing with lower stress also shows a improvement.
There is no regression on other tests so far, and a performance gain
is observed on file page heavy tasks.
Signed-off-by: Kairui Song <kasong@tencent.com>
2024-04-01 23:43:38 +08:00
|
|
|
return distance <= evicted;
|
|
|
|
}
|
|
|
|
|
|
|
|
enum lru_gen_refault_distance {
|
|
|
|
DISTANCE_SHORT,
|
|
|
|
DISTANCE_MID,
|
|
|
|
DISTANCE_LONG,
|
|
|
|
DISTANCE_NONE,
|
|
|
|
};
|
|
|
|
|
|
|
|
static inline int lru_gen_test_refault(struct lruvec *lruvec, bool file,
|
|
|
|
unsigned long distance, bool can_swap)
|
|
|
|
{
|
|
|
|
unsigned long total;
|
|
|
|
|
|
|
|
total = lruvec_page_state(lruvec, NR_ACTIVE_FILE) +
|
|
|
|
lruvec_page_state(lruvec, NR_INACTIVE_FILE);
|
|
|
|
|
|
|
|
if (can_swap)
|
|
|
|
total += lruvec_page_state(lruvec, NR_ACTIVE_ANON) +
|
|
|
|
lruvec_page_state(lruvec, NR_INACTIVE_ANON);
|
|
|
|
|
|
|
|
/* Imagine having an extra gen outside of available memory */
|
|
|
|
if (distance <= total / MAX_NR_GENS)
|
|
|
|
return DISTANCE_SHORT;
|
|
|
|
if (distance <= total / MIN_NR_GENS)
|
|
|
|
return DISTANCE_MID;
|
|
|
|
if (distance <= total)
|
|
|
|
return DISTANCE_LONG;
|
|
|
|
return DISTANCE_NONE;
|
workingset: refactor LRU refault to expose refault recency check
Patch series "cachestat: a new syscall for page cache state of files",
v13.
There is currently no good way to query the page cache statistics of large
files and directory trees. There is mincore(), but it scales poorly: the
kernel writes out a lot of bitmap data that userspace has to aggregate,
when the user really does not care about per-page information in that
case. The user also needs to mmap and unmap each file as it goes along,
which can be quite slow as well.
Some use cases where this information could come in handy:
* Allowing database to decide whether to perform an index scan or direct
table queries based on the in-memory cache state of the index.
* Visibility into the writeback algorithm, for performance issues
diagnostic.
* Workload-aware writeback pacing: estimating IO fulfilled by page cache
(and IO to be done) within a range of a file, allowing for more
frequent syncing when and where there is IO capacity, and batching
when there is not.
* Computing memory usage of large files/directory trees, analogous to
the du tool for disk usage.
More information about these use cases could be found in this thread:
https://lore.kernel.org/lkml/20230315170934.GA97793@cmpxchg.org/
This series of patches introduces a new system call, cachestat, that
summarizes the page cache statistics (number of cached pages, dirty pages,
pages marked for writeback, evicted pages etc.) of a file, in a specified
range of bytes. It also include a selftest suite that tests some typical
usage. Currently, the syscall is only wired in for x86 architecture.
This interface is inspired by past discussion and concerns with fincore,
which has a similar design (and as a result, issues) as mincore. Relevant
links:
https://lkml.indiana.edu/hypermail/linux/kernel/1302.1/04207.html
https://lkml.indiana.edu/hypermail/linux/kernel/1302.1/04209.html
I have also developed a small tool that computes the memory usage of files
and directories, analogous to the du utility. User can choose between
mincore or cachestat (with cachestat exporting more information than
mincore). To compare the performance of these two options, I benchmarked
the tool on the root directory of a Meta's server machine, each for five
runs:
Using cachestat
real -- Median: 33.377s, Average: 33.475s, Standard Deviation: 0.3602
user -- Median: 4.08s, Average: 4.1078s, Standard Deviation: 0.0742
sys -- Median: 28.823s, Average: 28.8866s, Standard Deviation: 0.2689
Using mincore:
real -- Median: 102.352s, Average: 102.3442s, Standard Deviation: 0.2059
user -- Median: 10.149s, Average: 10.1482s, Standard Deviation: 0.0162
sys -- Median: 91.186s, Average: 91.2084s, Standard Deviation: 0.2046
I also ran both syscalls on a 2TB sparse file:
Using cachestat:
real 0m0.009s
user 0m0.000s
sys 0m0.009s
Using mincore:
real 0m37.510s
user 0m2.934s
sys 0m34.558s
Very large files like this are the pathological case for mincore. In
fact, to compute the stats for a single 2TB file, mincore takes as long as
cachestat takes to compute the stats for the entire tree! This could
easily happen inadvertently when we run it on subdirectories. Mincore is
clearly not suitable for a general-purpose command line tool.
Regarding security concerns, cachestat() should not pose any additional
issues. The caller already has read permission to the file itself (since
they need an fd to that file to call cachestat). This means that the
caller can access the underlying data in its entirety, which is a much
greater source of information (and as a result, a much greater security
risk) than the cache status itself.
The latest API change (in v13 of the patch series) is suggested by Jens
Axboe. It allows for 64-bit length argument, even on 32-bit architecture
(which is previously not possible due to the limit on the number of
syscall arguments). Furthermore, it eliminates the need for compatibility
handling - every user can use the same ABI.
This patch (of 4):
In preparation for computing recently evicted pages in cachestat, refactor
workingset_refault and lru_gen_refault to expose a helper function that
would test if an evicted page is recently evicted.
[penguin-kernel@I-love.SAKURA.ne.jp: add missing rcu_read_unlock() in lru_gen_refault()]
Link: https://lkml.kernel.org/r/610781bc-cf11-fc89-a46f-87cb8235d439@I-love.SAKURA.ne.jp
Link: https://lkml.kernel.org/r/20230503013608.2431726-1-nphamcs@gmail.com
Link: https://lkml.kernel.org/r/20230503013608.2431726-2-nphamcs@gmail.com
Signed-off-by: Nhat Pham <nphamcs@gmail.com>
Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Cc: Brian Foster <bfoster@redhat.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
Cc: Michael Kerrisk <mtk.manpages@gmail.com>
Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-05-03 09:36:06 +08:00
|
|
|
}
|
|
|
|
|
mm: multi-gen LRU: minimal implementation
To avoid confusion, the terms "promotion" and "demotion" will be applied
to the multi-gen LRU, as a new convention; the terms "activation" and
"deactivation" will be applied to the active/inactive LRU, as usual.
The aging produces young generations. Given an lruvec, it increments
max_seq when max_seq-min_seq+1 approaches MIN_NR_GENS. The aging promotes
hot pages to the youngest generation when it finds them accessed through
page tables; the demotion of cold pages happens consequently when it
increments max_seq. Promotion in the aging path does not involve any LRU
list operations, only the updates of the gen counter and
lrugen->nr_pages[]; demotion, unless as the result of the increment of
max_seq, requires LRU list operations, e.g., lru_deactivate_fn(). The
aging has the complexity O(nr_hot_pages), since it is only interested in
hot pages.
The eviction consumes old generations. Given an lruvec, it increments
min_seq when lrugen->lists[] indexed by min_seq%MAX_NR_GENS becomes empty.
A feedback loop modeled after the PID controller monitors refaults over
anon and file types and decides which type to evict when both types are
available from the same generation.
The protection of pages accessed multiple times through file descriptors
takes place in the eviction path. Each generation is divided into
multiple tiers. A page accessed N times through file descriptors is in
tier order_base_2(N). Tiers do not have dedicated lrugen->lists[], only
bits in folio->flags. The aforementioned feedback loop also monitors
refaults over all tiers and decides when to protect pages in which tiers
(N>1), using the first tier (N=0,1) as a baseline. The first tier
contains single-use unmapped clean pages, which are most likely the best
choices. In contrast to promotion in the aging path, the protection of a
page in the eviction path is achieved by moving this page to the next
generation, i.e., min_seq+1, if the feedback loop decides so. This
approach has the following advantages:
1. It removes the cost of activation in the buffered access path by
inferring whether pages accessed multiple times through file
descriptors are statistically hot and thus worth protecting in the
eviction path.
2. It takes pages accessed through page tables into account and avoids
overprotecting pages accessed multiple times through file
descriptors. (Pages accessed through page tables are in the first
tier, since N=0.)
3. More tiers provide better protection for pages accessed more than
twice through file descriptors, when under heavy buffered I/O
workloads.
Server benchmark results:
Single workload:
fio (buffered I/O): +[30, 32]%
IOPS BW
5.19-rc1: 2673k 10.2GiB/s
patch1-6: 3491k 13.3GiB/s
Single workload:
memcached (anon): -[4, 6]%
Ops/sec KB/sec
5.19-rc1: 1161501.04 45177.25
patch1-6: 1106168.46 43025.04
Configurations:
CPU: two Xeon 6154
Mem: total 256G
Node 1 was only used as a ram disk to reduce the variance in the
results.
patch drivers/block/brd.c <<EOF
99,100c99,100
< gfp_flags = GFP_NOIO | __GFP_ZERO | __GFP_HIGHMEM;
< page = alloc_page(gfp_flags);
---
> gfp_flags = GFP_NOIO | __GFP_ZERO | __GFP_HIGHMEM | __GFP_THISNODE;
> page = alloc_pages_node(1, gfp_flags, 0);
EOF
cat >>/etc/systemd/system.conf <<EOF
CPUAffinity=numa
NUMAPolicy=bind
NUMAMask=0
EOF
cat >>/etc/memcached.conf <<EOF
-m 184320
-s /var/run/memcached/memcached.sock
-a 0766
-t 36
-B binary
EOF
cat fio.sh
modprobe brd rd_nr=1 rd_size=113246208
swapoff -a
mkfs.ext4 /dev/ram0
mount -t ext4 /dev/ram0 /mnt
mkdir /sys/fs/cgroup/user.slice/test
echo 38654705664 >/sys/fs/cgroup/user.slice/test/memory.max
echo $$ >/sys/fs/cgroup/user.slice/test/cgroup.procs
fio -name=mglru --numjobs=72 --directory=/mnt --size=1408m \
--buffered=1 --ioengine=io_uring --iodepth=128 \
--iodepth_batch_submit=32 --iodepth_batch_complete=32 \
--rw=randread --random_distribution=random --norandommap \
--time_based --ramp_time=10m --runtime=5m --group_reporting
cat memcached.sh
modprobe brd rd_nr=1 rd_size=113246208
swapoff -a
mkswap /dev/ram0
swapon /dev/ram0
memtier_benchmark -S /var/run/memcached/memcached.sock \
-P memcache_binary -n allkeys --key-minimum=1 \
--key-maximum=65000000 --key-pattern=P:P -c 1 -t 36 \
--ratio 1:0 --pipeline 8 -d 2000
memtier_benchmark -S /var/run/memcached/memcached.sock \
-P memcache_binary -n allkeys --key-minimum=1 \
--key-maximum=65000000 --key-pattern=R:R -c 1 -t 36 \
--ratio 0:1 --pipeline 8 --randomize --distinct-client-seed
Client benchmark results:
kswapd profiles:
5.19-rc1
40.33% page_vma_mapped_walk (overhead)
21.80% lzo1x_1_do_compress (real work)
7.53% do_raw_spin_lock
3.95% _raw_spin_unlock_irq
2.52% vma_interval_tree_iter_next
2.37% folio_referenced_one
2.28% vma_interval_tree_subtree_search
1.97% anon_vma_interval_tree_iter_first
1.60% ptep_clear_flush
1.06% __zram_bvec_write
patch1-6
39.03% lzo1x_1_do_compress (real work)
18.47% page_vma_mapped_walk (overhead)
6.74% _raw_spin_unlock_irq
3.97% do_raw_spin_lock
2.49% ptep_clear_flush
2.48% anon_vma_interval_tree_iter_first
1.92% folio_referenced_one
1.88% __zram_bvec_write
1.48% memmove
1.31% vma_interval_tree_iter_next
Configurations:
CPU: single Snapdragon 7c
Mem: total 4G
ChromeOS MemoryPressure [1]
[1] https://chromium.googlesource.com/chromiumos/platform/tast-tests/
Link: https://lkml.kernel.org/r/20220918080010.2920238-7-yuzhao@google.com
Signed-off-by: Yu Zhao <yuzhao@google.com>
Acked-by: Brian Geffon <bgeffon@google.com>
Acked-by: Jan Alexander Steffens (heftig) <heftig@archlinux.org>
Acked-by: Oleksandr Natalenko <oleksandr@natalenko.name>
Acked-by: Steven Barrett <steven@liquorix.net>
Acked-by: Suleiman Souhlal <suleiman@google.com>
Tested-by: Daniel Byrne <djbyrne@mtu.edu>
Tested-by: Donald Carr <d@chaos-reins.com>
Tested-by: Holger Hoffstätte <holger@applied-asynchrony.com>
Tested-by: Konstantin Kharlamov <Hi-Angel@yandex.ru>
Tested-by: Shuang Zhai <szhai2@cs.rochester.edu>
Tested-by: Sofia Trinh <sofia.trinh@edi.works>
Tested-by: Vaibhav Jain <vaibhav@linux.ibm.com>
Cc: Andi Kleen <ak@linux.intel.com>
Cc: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
Cc: Barry Song <baohua@kernel.org>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Hillf Danton <hdanton@sina.com>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Miaohe Lin <linmiaohe@huawei.com>
Cc: Michael Larabel <Michael@MichaelLarabel.com>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Mike Rapoport <rppt@kernel.org>
Cc: Mike Rapoport <rppt@linux.ibm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Qi Zheng <zhengqi.arch@bytedance.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Will Deacon <will@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-09-18 16:00:03 +08:00
|
|
|
static void lru_gen_refault(struct folio *folio, void *shadow)
|
|
|
|
{
|
2024-04-01 23:29:44 +08:00
|
|
|
int memcgid;
|
2023-05-24 04:59:21 +08:00
|
|
|
bool recent;
|
mm: multi-gen LRU: minimal implementation
To avoid confusion, the terms "promotion" and "demotion" will be applied
to the multi-gen LRU, as a new convention; the terms "activation" and
"deactivation" will be applied to the active/inactive LRU, as usual.
The aging produces young generations. Given an lruvec, it increments
max_seq when max_seq-min_seq+1 approaches MIN_NR_GENS. The aging promotes
hot pages to the youngest generation when it finds them accessed through
page tables; the demotion of cold pages happens consequently when it
increments max_seq. Promotion in the aging path does not involve any LRU
list operations, only the updates of the gen counter and
lrugen->nr_pages[]; demotion, unless as the result of the increment of
max_seq, requires LRU list operations, e.g., lru_deactivate_fn(). The
aging has the complexity O(nr_hot_pages), since it is only interested in
hot pages.
The eviction consumes old generations. Given an lruvec, it increments
min_seq when lrugen->lists[] indexed by min_seq%MAX_NR_GENS becomes empty.
A feedback loop modeled after the PID controller monitors refaults over
anon and file types and decides which type to evict when both types are
available from the same generation.
The protection of pages accessed multiple times through file descriptors
takes place in the eviction path. Each generation is divided into
multiple tiers. A page accessed N times through file descriptors is in
tier order_base_2(N). Tiers do not have dedicated lrugen->lists[], only
bits in folio->flags. The aforementioned feedback loop also monitors
refaults over all tiers and decides when to protect pages in which tiers
(N>1), using the first tier (N=0,1) as a baseline. The first tier
contains single-use unmapped clean pages, which are most likely the best
choices. In contrast to promotion in the aging path, the protection of a
page in the eviction path is achieved by moving this page to the next
generation, i.e., min_seq+1, if the feedback loop decides so. This
approach has the following advantages:
1. It removes the cost of activation in the buffered access path by
inferring whether pages accessed multiple times through file
descriptors are statistically hot and thus worth protecting in the
eviction path.
2. It takes pages accessed through page tables into account and avoids
overprotecting pages accessed multiple times through file
descriptors. (Pages accessed through page tables are in the first
tier, since N=0.)
3. More tiers provide better protection for pages accessed more than
twice through file descriptors, when under heavy buffered I/O
workloads.
Server benchmark results:
Single workload:
fio (buffered I/O): +[30, 32]%
IOPS BW
5.19-rc1: 2673k 10.2GiB/s
patch1-6: 3491k 13.3GiB/s
Single workload:
memcached (anon): -[4, 6]%
Ops/sec KB/sec
5.19-rc1: 1161501.04 45177.25
patch1-6: 1106168.46 43025.04
Configurations:
CPU: two Xeon 6154
Mem: total 256G
Node 1 was only used as a ram disk to reduce the variance in the
results.
patch drivers/block/brd.c <<EOF
99,100c99,100
< gfp_flags = GFP_NOIO | __GFP_ZERO | __GFP_HIGHMEM;
< page = alloc_page(gfp_flags);
---
> gfp_flags = GFP_NOIO | __GFP_ZERO | __GFP_HIGHMEM | __GFP_THISNODE;
> page = alloc_pages_node(1, gfp_flags, 0);
EOF
cat >>/etc/systemd/system.conf <<EOF
CPUAffinity=numa
NUMAPolicy=bind
NUMAMask=0
EOF
cat >>/etc/memcached.conf <<EOF
-m 184320
-s /var/run/memcached/memcached.sock
-a 0766
-t 36
-B binary
EOF
cat fio.sh
modprobe brd rd_nr=1 rd_size=113246208
swapoff -a
mkfs.ext4 /dev/ram0
mount -t ext4 /dev/ram0 /mnt
mkdir /sys/fs/cgroup/user.slice/test
echo 38654705664 >/sys/fs/cgroup/user.slice/test/memory.max
echo $$ >/sys/fs/cgroup/user.slice/test/cgroup.procs
fio -name=mglru --numjobs=72 --directory=/mnt --size=1408m \
--buffered=1 --ioengine=io_uring --iodepth=128 \
--iodepth_batch_submit=32 --iodepth_batch_complete=32 \
--rw=randread --random_distribution=random --norandommap \
--time_based --ramp_time=10m --runtime=5m --group_reporting
cat memcached.sh
modprobe brd rd_nr=1 rd_size=113246208
swapoff -a
mkswap /dev/ram0
swapon /dev/ram0
memtier_benchmark -S /var/run/memcached/memcached.sock \
-P memcache_binary -n allkeys --key-minimum=1 \
--key-maximum=65000000 --key-pattern=P:P -c 1 -t 36 \
--ratio 1:0 --pipeline 8 -d 2000
memtier_benchmark -S /var/run/memcached/memcached.sock \
-P memcache_binary -n allkeys --key-minimum=1 \
--key-maximum=65000000 --key-pattern=R:R -c 1 -t 36 \
--ratio 0:1 --pipeline 8 --randomize --distinct-client-seed
Client benchmark results:
kswapd profiles:
5.19-rc1
40.33% page_vma_mapped_walk (overhead)
21.80% lzo1x_1_do_compress (real work)
7.53% do_raw_spin_lock
3.95% _raw_spin_unlock_irq
2.52% vma_interval_tree_iter_next
2.37% folio_referenced_one
2.28% vma_interval_tree_subtree_search
1.97% anon_vma_interval_tree_iter_first
1.60% ptep_clear_flush
1.06% __zram_bvec_write
patch1-6
39.03% lzo1x_1_do_compress (real work)
18.47% page_vma_mapped_walk (overhead)
6.74% _raw_spin_unlock_irq
3.97% do_raw_spin_lock
2.49% ptep_clear_flush
2.48% anon_vma_interval_tree_iter_first
1.92% folio_referenced_one
1.88% __zram_bvec_write
1.48% memmove
1.31% vma_interval_tree_iter_next
Configurations:
CPU: single Snapdragon 7c
Mem: total 4G
ChromeOS MemoryPressure [1]
[1] https://chromium.googlesource.com/chromiumos/platform/tast-tests/
Link: https://lkml.kernel.org/r/20220918080010.2920238-7-yuzhao@google.com
Signed-off-by: Yu Zhao <yuzhao@google.com>
Acked-by: Brian Geffon <bgeffon@google.com>
Acked-by: Jan Alexander Steffens (heftig) <heftig@archlinux.org>
Acked-by: Oleksandr Natalenko <oleksandr@natalenko.name>
Acked-by: Steven Barrett <steven@liquorix.net>
Acked-by: Suleiman Souhlal <suleiman@google.com>
Tested-by: Daniel Byrne <djbyrne@mtu.edu>
Tested-by: Donald Carr <d@chaos-reins.com>
Tested-by: Holger Hoffstätte <holger@applied-asynchrony.com>
Tested-by: Konstantin Kharlamov <Hi-Angel@yandex.ru>
Tested-by: Shuang Zhai <szhai2@cs.rochester.edu>
Tested-by: Sofia Trinh <sofia.trinh@edi.works>
Tested-by: Vaibhav Jain <vaibhav@linux.ibm.com>
Cc: Andi Kleen <ak@linux.intel.com>
Cc: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
Cc: Barry Song <baohua@kernel.org>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Hillf Danton <hdanton@sina.com>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Miaohe Lin <linmiaohe@huawei.com>
Cc: Michael Larabel <Michael@MichaelLarabel.com>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Mike Rapoport <rppt@kernel.org>
Cc: Mike Rapoport <rppt@linux.ibm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Qi Zheng <zhengqi.arch@bytedance.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Will Deacon <will@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-09-18 16:00:03 +08:00
|
|
|
bool workingset;
|
|
|
|
unsigned long token;
|
workingset, lru_gen: apply refault-distance based protection
Upstream: pending
I noticed MGLRU not working very well on certain workflows, which is
observed on some heavily stressed databases. That is when the file
page workingset size exceeds total memory, and the access distance
(the left-shift time of a page before it gets activated, considering
LRU starts from right) of file pages also larger than total memory.
All file pages are stuck on the oldest generation and getting
read-in then evicted permutably. Despite anon pages being idle,
they never get aged. PID controller didn't kickin until there are some
minor access pattern changes. And file pages are not promoted
or reused.
Even though the memory can't cover the whole workingset, the
refault-distance based re-activation can help hold part of the
workingset in-memory to help reduce the IO workload significantly.
So apply it for MGLRU as well. The updated refault-distance model
fits well for MGLRU in most cases, if we just consider the last two
generation as the inactive LRU and the first two generations as
active LRU.
Some adjustment is done to fit the logic better, also make the
refault-distance contributed to page tiering and PID refault detection
of MGLRU:
- If a tier-0 page have a qualified refault-distance, just promote
it to higher tier, send it to second oldest gen.
- If a tier >= 1 page have a qualified refault-distance, mark it as
active and send it to youngest gen.
- Increase the reference of every page that have a qualified
refault-distance and increase the PID countroled refault rate
of the updated tier, in hope similar paged will be protected
next time upon eviction.
NOTE: This also changed the meaning of workingset_* fields in
/proc/vmstat, workingset_activate_* now stands for the pages
reactivated or promoted by refault distance checking,
workingset_restore_* now stands for all pages promoted by
any reason.
Following benchmark showed 5x improvement. To simulate the optimized
workflow, I setup a 3-replicated mongodb cluster, each in a different
cgroup, using 5 gb of wiretiger cache and 10g of oplog, on a 32G VM with
no limit set. The benchmark is done using
https://github.com/apavlo/py-tpcc.git, modified to run STOCK_LEVEL
query only, for simulating slow query and get a stable result.
Test is done on an EPYC 7K62 with 32G RAM with SATA SSD:
- Before (with ZRAM enabled, the result won't change whether
any kind of swap is on or not):
$ tpcc.py --config=mongodb.config mongodb --duration=900 --warehouses=500 --clients=30
==================================================================
Execution Results after 919 seconds
------------------------------------------------------------------
Executed Time (µs) Rate
STOCK_LEVEL 577 27584645283.7 0.02 txn/s
------------------------------------------------------------------
TOTAL 577 27584645283.7 0.02 txn/s
$ cat /proc/vmstat | grep workingset
workingset_nodes 47860
workingset_refault_anon 0
workingset_refault_file 23498953
workingset_activate_anon 0
workingset_activate_file 23487840
workingset_restore_anon 0
workingset_restore_file 18553646
workingset_nodereclaim 768
$ free -m
total used free shared buff/cache available
Mem: 31849 6829 790 23 24229 24542
Swap: 31848 0 31848
- Patched: (with ZRAM enabled):
$ tpcc.py --config=mongodb.config mongodb --duration=900 --warehouses=500 --clients=30
==================================================================
Execution Results after 905 seconds
------------------------------------------------------------------
Executed Time (µs) Rate
STOCK_LEVEL 2542 27121571486.2 0.09 txn/s
------------------------------------------------------------------
TOTAL 2542 27121571486.2 0.09 txn/s
$ cat /proc/vmstat | grep working
workingset_nodes 70358
workingset_refault_anon 16853
workingset_refault_file 22693601
workingset_activate_anon 10099
workingset_activate_file 8565519
workingset_restore_anon 10127
workingset_restore_file 8566053
workingset_nodereclaim 9801
$ free -m
total used free shared buff/cache available
Mem: 31849 7093 283 4 24472 24289
Swap: 31848 1652 30196
The performance is 5x times better than before, and the idle anon pages
now can get swapped out as expected. The result is also better with
lower test stress, testing with lower stress also shows a improvement.
There is no regression on other tests so far, and a performance gain
is observed on file page heavy tasks.
Signed-off-by: Kairui Song <kasong@tencent.com>
2024-04-01 23:43:38 +08:00
|
|
|
int hist, tier, refs;
|
mm: multi-gen LRU: minimal implementation
To avoid confusion, the terms "promotion" and "demotion" will be applied
to the multi-gen LRU, as a new convention; the terms "activation" and
"deactivation" will be applied to the active/inactive LRU, as usual.
The aging produces young generations. Given an lruvec, it increments
max_seq when max_seq-min_seq+1 approaches MIN_NR_GENS. The aging promotes
hot pages to the youngest generation when it finds them accessed through
page tables; the demotion of cold pages happens consequently when it
increments max_seq. Promotion in the aging path does not involve any LRU
list operations, only the updates of the gen counter and
lrugen->nr_pages[]; demotion, unless as the result of the increment of
max_seq, requires LRU list operations, e.g., lru_deactivate_fn(). The
aging has the complexity O(nr_hot_pages), since it is only interested in
hot pages.
The eviction consumes old generations. Given an lruvec, it increments
min_seq when lrugen->lists[] indexed by min_seq%MAX_NR_GENS becomes empty.
A feedback loop modeled after the PID controller monitors refaults over
anon and file types and decides which type to evict when both types are
available from the same generation.
The protection of pages accessed multiple times through file descriptors
takes place in the eviction path. Each generation is divided into
multiple tiers. A page accessed N times through file descriptors is in
tier order_base_2(N). Tiers do not have dedicated lrugen->lists[], only
bits in folio->flags. The aforementioned feedback loop also monitors
refaults over all tiers and decides when to protect pages in which tiers
(N>1), using the first tier (N=0,1) as a baseline. The first tier
contains single-use unmapped clean pages, which are most likely the best
choices. In contrast to promotion in the aging path, the protection of a
page in the eviction path is achieved by moving this page to the next
generation, i.e., min_seq+1, if the feedback loop decides so. This
approach has the following advantages:
1. It removes the cost of activation in the buffered access path by
inferring whether pages accessed multiple times through file
descriptors are statistically hot and thus worth protecting in the
eviction path.
2. It takes pages accessed through page tables into account and avoids
overprotecting pages accessed multiple times through file
descriptors. (Pages accessed through page tables are in the first
tier, since N=0.)
3. More tiers provide better protection for pages accessed more than
twice through file descriptors, when under heavy buffered I/O
workloads.
Server benchmark results:
Single workload:
fio (buffered I/O): +[30, 32]%
IOPS BW
5.19-rc1: 2673k 10.2GiB/s
patch1-6: 3491k 13.3GiB/s
Single workload:
memcached (anon): -[4, 6]%
Ops/sec KB/sec
5.19-rc1: 1161501.04 45177.25
patch1-6: 1106168.46 43025.04
Configurations:
CPU: two Xeon 6154
Mem: total 256G
Node 1 was only used as a ram disk to reduce the variance in the
results.
patch drivers/block/brd.c <<EOF
99,100c99,100
< gfp_flags = GFP_NOIO | __GFP_ZERO | __GFP_HIGHMEM;
< page = alloc_page(gfp_flags);
---
> gfp_flags = GFP_NOIO | __GFP_ZERO | __GFP_HIGHMEM | __GFP_THISNODE;
> page = alloc_pages_node(1, gfp_flags, 0);
EOF
cat >>/etc/systemd/system.conf <<EOF
CPUAffinity=numa
NUMAPolicy=bind
NUMAMask=0
EOF
cat >>/etc/memcached.conf <<EOF
-m 184320
-s /var/run/memcached/memcached.sock
-a 0766
-t 36
-B binary
EOF
cat fio.sh
modprobe brd rd_nr=1 rd_size=113246208
swapoff -a
mkfs.ext4 /dev/ram0
mount -t ext4 /dev/ram0 /mnt
mkdir /sys/fs/cgroup/user.slice/test
echo 38654705664 >/sys/fs/cgroup/user.slice/test/memory.max
echo $$ >/sys/fs/cgroup/user.slice/test/cgroup.procs
fio -name=mglru --numjobs=72 --directory=/mnt --size=1408m \
--buffered=1 --ioengine=io_uring --iodepth=128 \
--iodepth_batch_submit=32 --iodepth_batch_complete=32 \
--rw=randread --random_distribution=random --norandommap \
--time_based --ramp_time=10m --runtime=5m --group_reporting
cat memcached.sh
modprobe brd rd_nr=1 rd_size=113246208
swapoff -a
mkswap /dev/ram0
swapon /dev/ram0
memtier_benchmark -S /var/run/memcached/memcached.sock \
-P memcache_binary -n allkeys --key-minimum=1 \
--key-maximum=65000000 --key-pattern=P:P -c 1 -t 36 \
--ratio 1:0 --pipeline 8 -d 2000
memtier_benchmark -S /var/run/memcached/memcached.sock \
-P memcache_binary -n allkeys --key-minimum=1 \
--key-maximum=65000000 --key-pattern=R:R -c 1 -t 36 \
--ratio 0:1 --pipeline 8 --randomize --distinct-client-seed
Client benchmark results:
kswapd profiles:
5.19-rc1
40.33% page_vma_mapped_walk (overhead)
21.80% lzo1x_1_do_compress (real work)
7.53% do_raw_spin_lock
3.95% _raw_spin_unlock_irq
2.52% vma_interval_tree_iter_next
2.37% folio_referenced_one
2.28% vma_interval_tree_subtree_search
1.97% anon_vma_interval_tree_iter_first
1.60% ptep_clear_flush
1.06% __zram_bvec_write
patch1-6
39.03% lzo1x_1_do_compress (real work)
18.47% page_vma_mapped_walk (overhead)
6.74% _raw_spin_unlock_irq
3.97% do_raw_spin_lock
2.49% ptep_clear_flush
2.48% anon_vma_interval_tree_iter_first
1.92% folio_referenced_one
1.88% __zram_bvec_write
1.48% memmove
1.31% vma_interval_tree_iter_next
Configurations:
CPU: single Snapdragon 7c
Mem: total 4G
ChromeOS MemoryPressure [1]
[1] https://chromium.googlesource.com/chromiumos/platform/tast-tests/
Link: https://lkml.kernel.org/r/20220918080010.2920238-7-yuzhao@google.com
Signed-off-by: Yu Zhao <yuzhao@google.com>
Acked-by: Brian Geffon <bgeffon@google.com>
Acked-by: Jan Alexander Steffens (heftig) <heftig@archlinux.org>
Acked-by: Oleksandr Natalenko <oleksandr@natalenko.name>
Acked-by: Steven Barrett <steven@liquorix.net>
Acked-by: Suleiman Souhlal <suleiman@google.com>
Tested-by: Daniel Byrne <djbyrne@mtu.edu>
Tested-by: Donald Carr <d@chaos-reins.com>
Tested-by: Holger Hoffstätte <holger@applied-asynchrony.com>
Tested-by: Konstantin Kharlamov <Hi-Angel@yandex.ru>
Tested-by: Shuang Zhai <szhai2@cs.rochester.edu>
Tested-by: Sofia Trinh <sofia.trinh@edi.works>
Tested-by: Vaibhav Jain <vaibhav@linux.ibm.com>
Cc: Andi Kleen <ak@linux.intel.com>
Cc: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
Cc: Barry Song <baohua@kernel.org>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Hillf Danton <hdanton@sina.com>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Miaohe Lin <linmiaohe@huawei.com>
Cc: Michael Larabel <Michael@MichaelLarabel.com>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Mike Rapoport <rppt@kernel.org>
Cc: Mike Rapoport <rppt@linux.ibm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Qi Zheng <zhengqi.arch@bytedance.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Will Deacon <will@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-09-18 16:00:03 +08:00
|
|
|
struct lruvec *lruvec;
|
workingset, lru_gen: apply refault-distance based protection
Upstream: pending
I noticed MGLRU not working very well on certain workflows, which is
observed on some heavily stressed databases. That is when the file
page workingset size exceeds total memory, and the access distance
(the left-shift time of a page before it gets activated, considering
LRU starts from right) of file pages also larger than total memory.
All file pages are stuck on the oldest generation and getting
read-in then evicted permutably. Despite anon pages being idle,
they never get aged. PID controller didn't kickin until there are some
minor access pattern changes. And file pages are not promoted
or reused.
Even though the memory can't cover the whole workingset, the
refault-distance based re-activation can help hold part of the
workingset in-memory to help reduce the IO workload significantly.
So apply it for MGLRU as well. The updated refault-distance model
fits well for MGLRU in most cases, if we just consider the last two
generation as the inactive LRU and the first two generations as
active LRU.
Some adjustment is done to fit the logic better, also make the
refault-distance contributed to page tiering and PID refault detection
of MGLRU:
- If a tier-0 page have a qualified refault-distance, just promote
it to higher tier, send it to second oldest gen.
- If a tier >= 1 page have a qualified refault-distance, mark it as
active and send it to youngest gen.
- Increase the reference of every page that have a qualified
refault-distance and increase the PID countroled refault rate
of the updated tier, in hope similar paged will be protected
next time upon eviction.
NOTE: This also changed the meaning of workingset_* fields in
/proc/vmstat, workingset_activate_* now stands for the pages
reactivated or promoted by refault distance checking,
workingset_restore_* now stands for all pages promoted by
any reason.
Following benchmark showed 5x improvement. To simulate the optimized
workflow, I setup a 3-replicated mongodb cluster, each in a different
cgroup, using 5 gb of wiretiger cache and 10g of oplog, on a 32G VM with
no limit set. The benchmark is done using
https://github.com/apavlo/py-tpcc.git, modified to run STOCK_LEVEL
query only, for simulating slow query and get a stable result.
Test is done on an EPYC 7K62 with 32G RAM with SATA SSD:
- Before (with ZRAM enabled, the result won't change whether
any kind of swap is on or not):
$ tpcc.py --config=mongodb.config mongodb --duration=900 --warehouses=500 --clients=30
==================================================================
Execution Results after 919 seconds
------------------------------------------------------------------
Executed Time (µs) Rate
STOCK_LEVEL 577 27584645283.7 0.02 txn/s
------------------------------------------------------------------
TOTAL 577 27584645283.7 0.02 txn/s
$ cat /proc/vmstat | grep workingset
workingset_nodes 47860
workingset_refault_anon 0
workingset_refault_file 23498953
workingset_activate_anon 0
workingset_activate_file 23487840
workingset_restore_anon 0
workingset_restore_file 18553646
workingset_nodereclaim 768
$ free -m
total used free shared buff/cache available
Mem: 31849 6829 790 23 24229 24542
Swap: 31848 0 31848
- Patched: (with ZRAM enabled):
$ tpcc.py --config=mongodb.config mongodb --duration=900 --warehouses=500 --clients=30
==================================================================
Execution Results after 905 seconds
------------------------------------------------------------------
Executed Time (µs) Rate
STOCK_LEVEL 2542 27121571486.2 0.09 txn/s
------------------------------------------------------------------
TOTAL 2542 27121571486.2 0.09 txn/s
$ cat /proc/vmstat | grep working
workingset_nodes 70358
workingset_refault_anon 16853
workingset_refault_file 22693601
workingset_activate_anon 10099
workingset_activate_file 8565519
workingset_restore_anon 10127
workingset_restore_file 8566053
workingset_nodereclaim 9801
$ free -m
total used free shared buff/cache available
Mem: 31849 7093 283 4 24472 24289
Swap: 31848 1652 30196
The performance is 5x times better than before, and the idle anon pages
now can get swapped out as expected. The result is also better with
lower test stress, testing with lower stress also shows a improvement.
There is no regression on other tests so far, and a performance gain
is observed on file page heavy tasks.
Signed-off-by: Kairui Song <kasong@tencent.com>
2024-04-01 23:43:38 +08:00
|
|
|
struct mem_cgroup *memcg;
|
2024-04-01 23:29:44 +08:00
|
|
|
struct pglist_data *pgdat;
|
mm: multi-gen LRU: rename lru_gen_struct to lru_gen_folio
Patch series "mm: multi-gen LRU: memcg LRU", v3.
Overview
========
An memcg LRU is a per-node LRU of memcgs. It is also an LRU of LRUs,
since each node and memcg combination has an LRU of folios (see
mem_cgroup_lruvec()).
Its goal is to improve the scalability of global reclaim, which is
critical to system-wide memory overcommit in data centers. Note that
memcg reclaim is currently out of scope.
Its memory bloat is a pointer to each lruvec and negligible to each
pglist_data. In terms of traversing memcgs during global reclaim, it
improves the best-case complexity from O(n) to O(1) and does not affect
the worst-case complexity O(n). Therefore, on average, it has a sublinear
complexity in contrast to the current linear complexity.
The basic structure of an memcg LRU can be understood by an analogy to
the active/inactive LRU (of folios):
1. It has the young and the old (generations), i.e., the counterparts
to the active and the inactive;
2. The increment of max_seq triggers promotion, i.e., the counterpart
to activation;
3. Other events trigger similar operations, e.g., offlining an memcg
triggers demotion, i.e., the counterpart to deactivation.
In terms of global reclaim, it has two distinct features:
1. Sharding, which allows each thread to start at a random memcg (in
the old generation) and improves parallelism;
2. Eventual fairness, which allows direct reclaim to bail out at will
and reduces latency without affecting fairness over some time.
The commit message in patch 6 details the workflow:
https://lore.kernel.org/r/20221222041905.2431096-7-yuzhao@google.com/
The following is a simple test to quickly verify its effectiveness.
Test design:
1. Create multiple memcgs.
2. Each memcg contains a job (fio).
3. All jobs access the same amount of memory randomly.
4. The system does not experience global memory pressure.
5. Periodically write to the root memory.reclaim.
Desired outcome:
1. All memcgs have similar pgsteal counts, i.e., stddev(pgsteal)
over mean(pgsteal) is close to 0%.
2. The total pgsteal is close to the total requested through
memory.reclaim, i.e., sum(pgsteal) over sum(requested) is close
to 100%.
Actual outcome [1]:
MGLRU off MGLRU on
stddev(pgsteal) / mean(pgsteal) 75% 20%
sum(pgsteal) / sum(requested) 425% 95%
####################################################################
MEMCGS=128
for ((memcg = 0; memcg < $MEMCGS; memcg++)); do
mkdir /sys/fs/cgroup/memcg$memcg
done
start() {
echo $BASHPID > /sys/fs/cgroup/memcg$memcg/cgroup.procs
fio -name=memcg$memcg --numjobs=1 --ioengine=mmap \
--filename=/dev/zero --size=1920M --rw=randrw \
--rate=64m,64m --random_distribution=random \
--fadvise_hint=0 --time_based --runtime=10h \
--group_reporting --minimal
}
for ((memcg = 0; memcg < $MEMCGS; memcg++)); do
start &
done
sleep 600
for ((i = 0; i < 600; i++)); do
echo 256m >/sys/fs/cgroup/memory.reclaim
sleep 6
done
for ((memcg = 0; memcg < $MEMCGS; memcg++)); do
grep "pgsteal " /sys/fs/cgroup/memcg$memcg/memory.stat
done
####################################################################
[1]: This was obtained from running the above script (touches less
than 256GB memory) on an EPYC 7B13 with 512GB DRAM for over an
hour.
This patch (of 8):
The new name lru_gen_folio will be more distinct from the coming
lru_gen_memcg.
Link: https://lkml.kernel.org/r/20221222041905.2431096-1-yuzhao@google.com
Link: https://lkml.kernel.org/r/20221222041905.2431096-2-yuzhao@google.com
Signed-off-by: Yu Zhao <yuzhao@google.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Michael Larabel <Michael@MichaelLarabel.com>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Mike Rapoport <rppt@kernel.org>
Cc: Roman Gushchin <roman.gushchin@linux.dev>
Cc: Suren Baghdasaryan <surenb@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-12-22 12:18:59 +08:00
|
|
|
struct lru_gen_folio *lrugen;
|
mm: multi-gen LRU: minimal implementation
To avoid confusion, the terms "promotion" and "demotion" will be applied
to the multi-gen LRU, as a new convention; the terms "activation" and
"deactivation" will be applied to the active/inactive LRU, as usual.
The aging produces young generations. Given an lruvec, it increments
max_seq when max_seq-min_seq+1 approaches MIN_NR_GENS. The aging promotes
hot pages to the youngest generation when it finds them accessed through
page tables; the demotion of cold pages happens consequently when it
increments max_seq. Promotion in the aging path does not involve any LRU
list operations, only the updates of the gen counter and
lrugen->nr_pages[]; demotion, unless as the result of the increment of
max_seq, requires LRU list operations, e.g., lru_deactivate_fn(). The
aging has the complexity O(nr_hot_pages), since it is only interested in
hot pages.
The eviction consumes old generations. Given an lruvec, it increments
min_seq when lrugen->lists[] indexed by min_seq%MAX_NR_GENS becomes empty.
A feedback loop modeled after the PID controller monitors refaults over
anon and file types and decides which type to evict when both types are
available from the same generation.
The protection of pages accessed multiple times through file descriptors
takes place in the eviction path. Each generation is divided into
multiple tiers. A page accessed N times through file descriptors is in
tier order_base_2(N). Tiers do not have dedicated lrugen->lists[], only
bits in folio->flags. The aforementioned feedback loop also monitors
refaults over all tiers and decides when to protect pages in which tiers
(N>1), using the first tier (N=0,1) as a baseline. The first tier
contains single-use unmapped clean pages, which are most likely the best
choices. In contrast to promotion in the aging path, the protection of a
page in the eviction path is achieved by moving this page to the next
generation, i.e., min_seq+1, if the feedback loop decides so. This
approach has the following advantages:
1. It removes the cost of activation in the buffered access path by
inferring whether pages accessed multiple times through file
descriptors are statistically hot and thus worth protecting in the
eviction path.
2. It takes pages accessed through page tables into account and avoids
overprotecting pages accessed multiple times through file
descriptors. (Pages accessed through page tables are in the first
tier, since N=0.)
3. More tiers provide better protection for pages accessed more than
twice through file descriptors, when under heavy buffered I/O
workloads.
Server benchmark results:
Single workload:
fio (buffered I/O): +[30, 32]%
IOPS BW
5.19-rc1: 2673k 10.2GiB/s
patch1-6: 3491k 13.3GiB/s
Single workload:
memcached (anon): -[4, 6]%
Ops/sec KB/sec
5.19-rc1: 1161501.04 45177.25
patch1-6: 1106168.46 43025.04
Configurations:
CPU: two Xeon 6154
Mem: total 256G
Node 1 was only used as a ram disk to reduce the variance in the
results.
patch drivers/block/brd.c <<EOF
99,100c99,100
< gfp_flags = GFP_NOIO | __GFP_ZERO | __GFP_HIGHMEM;
< page = alloc_page(gfp_flags);
---
> gfp_flags = GFP_NOIO | __GFP_ZERO | __GFP_HIGHMEM | __GFP_THISNODE;
> page = alloc_pages_node(1, gfp_flags, 0);
EOF
cat >>/etc/systemd/system.conf <<EOF
CPUAffinity=numa
NUMAPolicy=bind
NUMAMask=0
EOF
cat >>/etc/memcached.conf <<EOF
-m 184320
-s /var/run/memcached/memcached.sock
-a 0766
-t 36
-B binary
EOF
cat fio.sh
modprobe brd rd_nr=1 rd_size=113246208
swapoff -a
mkfs.ext4 /dev/ram0
mount -t ext4 /dev/ram0 /mnt
mkdir /sys/fs/cgroup/user.slice/test
echo 38654705664 >/sys/fs/cgroup/user.slice/test/memory.max
echo $$ >/sys/fs/cgroup/user.slice/test/cgroup.procs
fio -name=mglru --numjobs=72 --directory=/mnt --size=1408m \
--buffered=1 --ioengine=io_uring --iodepth=128 \
--iodepth_batch_submit=32 --iodepth_batch_complete=32 \
--rw=randread --random_distribution=random --norandommap \
--time_based --ramp_time=10m --runtime=5m --group_reporting
cat memcached.sh
modprobe brd rd_nr=1 rd_size=113246208
swapoff -a
mkswap /dev/ram0
swapon /dev/ram0
memtier_benchmark -S /var/run/memcached/memcached.sock \
-P memcache_binary -n allkeys --key-minimum=1 \
--key-maximum=65000000 --key-pattern=P:P -c 1 -t 36 \
--ratio 1:0 --pipeline 8 -d 2000
memtier_benchmark -S /var/run/memcached/memcached.sock \
-P memcache_binary -n allkeys --key-minimum=1 \
--key-maximum=65000000 --key-pattern=R:R -c 1 -t 36 \
--ratio 0:1 --pipeline 8 --randomize --distinct-client-seed
Client benchmark results:
kswapd profiles:
5.19-rc1
40.33% page_vma_mapped_walk (overhead)
21.80% lzo1x_1_do_compress (real work)
7.53% do_raw_spin_lock
3.95% _raw_spin_unlock_irq
2.52% vma_interval_tree_iter_next
2.37% folio_referenced_one
2.28% vma_interval_tree_subtree_search
1.97% anon_vma_interval_tree_iter_first
1.60% ptep_clear_flush
1.06% __zram_bvec_write
patch1-6
39.03% lzo1x_1_do_compress (real work)
18.47% page_vma_mapped_walk (overhead)
6.74% _raw_spin_unlock_irq
3.97% do_raw_spin_lock
2.49% ptep_clear_flush
2.48% anon_vma_interval_tree_iter_first
1.92% folio_referenced_one
1.88% __zram_bvec_write
1.48% memmove
1.31% vma_interval_tree_iter_next
Configurations:
CPU: single Snapdragon 7c
Mem: total 4G
ChromeOS MemoryPressure [1]
[1] https://chromium.googlesource.com/chromiumos/platform/tast-tests/
Link: https://lkml.kernel.org/r/20220918080010.2920238-7-yuzhao@google.com
Signed-off-by: Yu Zhao <yuzhao@google.com>
Acked-by: Brian Geffon <bgeffon@google.com>
Acked-by: Jan Alexander Steffens (heftig) <heftig@archlinux.org>
Acked-by: Oleksandr Natalenko <oleksandr@natalenko.name>
Acked-by: Steven Barrett <steven@liquorix.net>
Acked-by: Suleiman Souhlal <suleiman@google.com>
Tested-by: Daniel Byrne <djbyrne@mtu.edu>
Tested-by: Donald Carr <d@chaos-reins.com>
Tested-by: Holger Hoffstätte <holger@applied-asynchrony.com>
Tested-by: Konstantin Kharlamov <Hi-Angel@yandex.ru>
Tested-by: Shuang Zhai <szhai2@cs.rochester.edu>
Tested-by: Sofia Trinh <sofia.trinh@edi.works>
Tested-by: Vaibhav Jain <vaibhav@linux.ibm.com>
Cc: Andi Kleen <ak@linux.intel.com>
Cc: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
Cc: Barry Song <baohua@kernel.org>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Hillf Danton <hdanton@sina.com>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Miaohe Lin <linmiaohe@huawei.com>
Cc: Michael Larabel <Michael@MichaelLarabel.com>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Mike Rapoport <rppt@kernel.org>
Cc: Mike Rapoport <rppt@linux.ibm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Qi Zheng <zhengqi.arch@bytedance.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Will Deacon <will@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-09-18 16:00:03 +08:00
|
|
|
int type = folio_is_file_lru(folio);
|
|
|
|
int delta = folio_nr_pages(folio);
|
workingset, lru_gen: apply refault-distance based protection
Upstream: pending
I noticed MGLRU not working very well on certain workflows, which is
observed on some heavily stressed databases. That is when the file
page workingset size exceeds total memory, and the access distance
(the left-shift time of a page before it gets activated, considering
LRU starts from right) of file pages also larger than total memory.
All file pages are stuck on the oldest generation and getting
read-in then evicted permutably. Despite anon pages being idle,
they never get aged. PID controller didn't kickin until there are some
minor access pattern changes. And file pages are not promoted
or reused.
Even though the memory can't cover the whole workingset, the
refault-distance based re-activation can help hold part of the
workingset in-memory to help reduce the IO workload significantly.
So apply it for MGLRU as well. The updated refault-distance model
fits well for MGLRU in most cases, if we just consider the last two
generation as the inactive LRU and the first two generations as
active LRU.
Some adjustment is done to fit the logic better, also make the
refault-distance contributed to page tiering and PID refault detection
of MGLRU:
- If a tier-0 page have a qualified refault-distance, just promote
it to higher tier, send it to second oldest gen.
- If a tier >= 1 page have a qualified refault-distance, mark it as
active and send it to youngest gen.
- Increase the reference of every page that have a qualified
refault-distance and increase the PID countroled refault rate
of the updated tier, in hope similar paged will be protected
next time upon eviction.
NOTE: This also changed the meaning of workingset_* fields in
/proc/vmstat, workingset_activate_* now stands for the pages
reactivated or promoted by refault distance checking,
workingset_restore_* now stands for all pages promoted by
any reason.
Following benchmark showed 5x improvement. To simulate the optimized
workflow, I setup a 3-replicated mongodb cluster, each in a different
cgroup, using 5 gb of wiretiger cache and 10g of oplog, on a 32G VM with
no limit set. The benchmark is done using
https://github.com/apavlo/py-tpcc.git, modified to run STOCK_LEVEL
query only, for simulating slow query and get a stable result.
Test is done on an EPYC 7K62 with 32G RAM with SATA SSD:
- Before (with ZRAM enabled, the result won't change whether
any kind of swap is on or not):
$ tpcc.py --config=mongodb.config mongodb --duration=900 --warehouses=500 --clients=30
==================================================================
Execution Results after 919 seconds
------------------------------------------------------------------
Executed Time (µs) Rate
STOCK_LEVEL 577 27584645283.7 0.02 txn/s
------------------------------------------------------------------
TOTAL 577 27584645283.7 0.02 txn/s
$ cat /proc/vmstat | grep workingset
workingset_nodes 47860
workingset_refault_anon 0
workingset_refault_file 23498953
workingset_activate_anon 0
workingset_activate_file 23487840
workingset_restore_anon 0
workingset_restore_file 18553646
workingset_nodereclaim 768
$ free -m
total used free shared buff/cache available
Mem: 31849 6829 790 23 24229 24542
Swap: 31848 0 31848
- Patched: (with ZRAM enabled):
$ tpcc.py --config=mongodb.config mongodb --duration=900 --warehouses=500 --clients=30
==================================================================
Execution Results after 905 seconds
------------------------------------------------------------------
Executed Time (µs) Rate
STOCK_LEVEL 2542 27121571486.2 0.09 txn/s
------------------------------------------------------------------
TOTAL 2542 27121571486.2 0.09 txn/s
$ cat /proc/vmstat | grep working
workingset_nodes 70358
workingset_refault_anon 16853
workingset_refault_file 22693601
workingset_activate_anon 10099
workingset_activate_file 8565519
workingset_restore_anon 10127
workingset_restore_file 8566053
workingset_nodereclaim 9801
$ free -m
total used free shared buff/cache available
Mem: 31849 7093 283 4 24472 24289
Swap: 31848 1652 30196
The performance is 5x times better than before, and the idle anon pages
now can get swapped out as expected. The result is also better with
lower test stress, testing with lower stress also shows a improvement.
There is no regression on other tests so far, and a performance gain
is observed on file page heavy tasks.
Signed-off-by: Kairui Song <kasong@tencent.com>
2024-04-01 23:43:38 +08:00
|
|
|
int distance;
|
|
|
|
unsigned long refault_distance, protect_tier;
|
mm: multi-gen LRU: minimal implementation
To avoid confusion, the terms "promotion" and "demotion" will be applied
to the multi-gen LRU, as a new convention; the terms "activation" and
"deactivation" will be applied to the active/inactive LRU, as usual.
The aging produces young generations. Given an lruvec, it increments
max_seq when max_seq-min_seq+1 approaches MIN_NR_GENS. The aging promotes
hot pages to the youngest generation when it finds them accessed through
page tables; the demotion of cold pages happens consequently when it
increments max_seq. Promotion in the aging path does not involve any LRU
list operations, only the updates of the gen counter and
lrugen->nr_pages[]; demotion, unless as the result of the increment of
max_seq, requires LRU list operations, e.g., lru_deactivate_fn(). The
aging has the complexity O(nr_hot_pages), since it is only interested in
hot pages.
The eviction consumes old generations. Given an lruvec, it increments
min_seq when lrugen->lists[] indexed by min_seq%MAX_NR_GENS becomes empty.
A feedback loop modeled after the PID controller monitors refaults over
anon and file types and decides which type to evict when both types are
available from the same generation.
The protection of pages accessed multiple times through file descriptors
takes place in the eviction path. Each generation is divided into
multiple tiers. A page accessed N times through file descriptors is in
tier order_base_2(N). Tiers do not have dedicated lrugen->lists[], only
bits in folio->flags. The aforementioned feedback loop also monitors
refaults over all tiers and decides when to protect pages in which tiers
(N>1), using the first tier (N=0,1) as a baseline. The first tier
contains single-use unmapped clean pages, which are most likely the best
choices. In contrast to promotion in the aging path, the protection of a
page in the eviction path is achieved by moving this page to the next
generation, i.e., min_seq+1, if the feedback loop decides so. This
approach has the following advantages:
1. It removes the cost of activation in the buffered access path by
inferring whether pages accessed multiple times through file
descriptors are statistically hot and thus worth protecting in the
eviction path.
2. It takes pages accessed through page tables into account and avoids
overprotecting pages accessed multiple times through file
descriptors. (Pages accessed through page tables are in the first
tier, since N=0.)
3. More tiers provide better protection for pages accessed more than
twice through file descriptors, when under heavy buffered I/O
workloads.
Server benchmark results:
Single workload:
fio (buffered I/O): +[30, 32]%
IOPS BW
5.19-rc1: 2673k 10.2GiB/s
patch1-6: 3491k 13.3GiB/s
Single workload:
memcached (anon): -[4, 6]%
Ops/sec KB/sec
5.19-rc1: 1161501.04 45177.25
patch1-6: 1106168.46 43025.04
Configurations:
CPU: two Xeon 6154
Mem: total 256G
Node 1 was only used as a ram disk to reduce the variance in the
results.
patch drivers/block/brd.c <<EOF
99,100c99,100
< gfp_flags = GFP_NOIO | __GFP_ZERO | __GFP_HIGHMEM;
< page = alloc_page(gfp_flags);
---
> gfp_flags = GFP_NOIO | __GFP_ZERO | __GFP_HIGHMEM | __GFP_THISNODE;
> page = alloc_pages_node(1, gfp_flags, 0);
EOF
cat >>/etc/systemd/system.conf <<EOF
CPUAffinity=numa
NUMAPolicy=bind
NUMAMask=0
EOF
cat >>/etc/memcached.conf <<EOF
-m 184320
-s /var/run/memcached/memcached.sock
-a 0766
-t 36
-B binary
EOF
cat fio.sh
modprobe brd rd_nr=1 rd_size=113246208
swapoff -a
mkfs.ext4 /dev/ram0
mount -t ext4 /dev/ram0 /mnt
mkdir /sys/fs/cgroup/user.slice/test
echo 38654705664 >/sys/fs/cgroup/user.slice/test/memory.max
echo $$ >/sys/fs/cgroup/user.slice/test/cgroup.procs
fio -name=mglru --numjobs=72 --directory=/mnt --size=1408m \
--buffered=1 --ioengine=io_uring --iodepth=128 \
--iodepth_batch_submit=32 --iodepth_batch_complete=32 \
--rw=randread --random_distribution=random --norandommap \
--time_based --ramp_time=10m --runtime=5m --group_reporting
cat memcached.sh
modprobe brd rd_nr=1 rd_size=113246208
swapoff -a
mkswap /dev/ram0
swapon /dev/ram0
memtier_benchmark -S /var/run/memcached/memcached.sock \
-P memcache_binary -n allkeys --key-minimum=1 \
--key-maximum=65000000 --key-pattern=P:P -c 1 -t 36 \
--ratio 1:0 --pipeline 8 -d 2000
memtier_benchmark -S /var/run/memcached/memcached.sock \
-P memcache_binary -n allkeys --key-minimum=1 \
--key-maximum=65000000 --key-pattern=R:R -c 1 -t 36 \
--ratio 0:1 --pipeline 8 --randomize --distinct-client-seed
Client benchmark results:
kswapd profiles:
5.19-rc1
40.33% page_vma_mapped_walk (overhead)
21.80% lzo1x_1_do_compress (real work)
7.53% do_raw_spin_lock
3.95% _raw_spin_unlock_irq
2.52% vma_interval_tree_iter_next
2.37% folio_referenced_one
2.28% vma_interval_tree_subtree_search
1.97% anon_vma_interval_tree_iter_first
1.60% ptep_clear_flush
1.06% __zram_bvec_write
patch1-6
39.03% lzo1x_1_do_compress (real work)
18.47% page_vma_mapped_walk (overhead)
6.74% _raw_spin_unlock_irq
3.97% do_raw_spin_lock
2.49% ptep_clear_flush
2.48% anon_vma_interval_tree_iter_first
1.92% folio_referenced_one
1.88% __zram_bvec_write
1.48% memmove
1.31% vma_interval_tree_iter_next
Configurations:
CPU: single Snapdragon 7c
Mem: total 4G
ChromeOS MemoryPressure [1]
[1] https://chromium.googlesource.com/chromiumos/platform/tast-tests/
Link: https://lkml.kernel.org/r/20220918080010.2920238-7-yuzhao@google.com
Signed-off-by: Yu Zhao <yuzhao@google.com>
Acked-by: Brian Geffon <bgeffon@google.com>
Acked-by: Jan Alexander Steffens (heftig) <heftig@archlinux.org>
Acked-by: Oleksandr Natalenko <oleksandr@natalenko.name>
Acked-by: Steven Barrett <steven@liquorix.net>
Acked-by: Suleiman Souhlal <suleiman@google.com>
Tested-by: Daniel Byrne <djbyrne@mtu.edu>
Tested-by: Donald Carr <d@chaos-reins.com>
Tested-by: Holger Hoffstätte <holger@applied-asynchrony.com>
Tested-by: Konstantin Kharlamov <Hi-Angel@yandex.ru>
Tested-by: Shuang Zhai <szhai2@cs.rochester.edu>
Tested-by: Sofia Trinh <sofia.trinh@edi.works>
Tested-by: Vaibhav Jain <vaibhav@linux.ibm.com>
Cc: Andi Kleen <ak@linux.intel.com>
Cc: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
Cc: Barry Song <baohua@kernel.org>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Hillf Danton <hdanton@sina.com>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Miaohe Lin <linmiaohe@huawei.com>
Cc: Michael Larabel <Michael@MichaelLarabel.com>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Mike Rapoport <rppt@kernel.org>
Cc: Mike Rapoport <rppt@linux.ibm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Qi Zheng <zhengqi.arch@bytedance.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Will Deacon <will@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-09-18 16:00:03 +08:00
|
|
|
|
2024-04-01 23:29:44 +08:00
|
|
|
unpack_shadow(shadow, &memcgid, &pgdat, &token, &workingset);
|
workingset, lru_gen: apply refault-distance based protection
Upstream: pending
I noticed MGLRU not working very well on certain workflows, which is
observed on some heavily stressed databases. That is when the file
page workingset size exceeds total memory, and the access distance
(the left-shift time of a page before it gets activated, considering
LRU starts from right) of file pages also larger than total memory.
All file pages are stuck on the oldest generation and getting
read-in then evicted permutably. Despite anon pages being idle,
they never get aged. PID controller didn't kickin until there are some
minor access pattern changes. And file pages are not promoted
or reused.
Even though the memory can't cover the whole workingset, the
refault-distance based re-activation can help hold part of the
workingset in-memory to help reduce the IO workload significantly.
So apply it for MGLRU as well. The updated refault-distance model
fits well for MGLRU in most cases, if we just consider the last two
generation as the inactive LRU and the first two generations as
active LRU.
Some adjustment is done to fit the logic better, also make the
refault-distance contributed to page tiering and PID refault detection
of MGLRU:
- If a tier-0 page have a qualified refault-distance, just promote
it to higher tier, send it to second oldest gen.
- If a tier >= 1 page have a qualified refault-distance, mark it as
active and send it to youngest gen.
- Increase the reference of every page that have a qualified
refault-distance and increase the PID countroled refault rate
of the updated tier, in hope similar paged will be protected
next time upon eviction.
NOTE: This also changed the meaning of workingset_* fields in
/proc/vmstat, workingset_activate_* now stands for the pages
reactivated or promoted by refault distance checking,
workingset_restore_* now stands for all pages promoted by
any reason.
Following benchmark showed 5x improvement. To simulate the optimized
workflow, I setup a 3-replicated mongodb cluster, each in a different
cgroup, using 5 gb of wiretiger cache and 10g of oplog, on a 32G VM with
no limit set. The benchmark is done using
https://github.com/apavlo/py-tpcc.git, modified to run STOCK_LEVEL
query only, for simulating slow query and get a stable result.
Test is done on an EPYC 7K62 with 32G RAM with SATA SSD:
- Before (with ZRAM enabled, the result won't change whether
any kind of swap is on or not):
$ tpcc.py --config=mongodb.config mongodb --duration=900 --warehouses=500 --clients=30
==================================================================
Execution Results after 919 seconds
------------------------------------------------------------------
Executed Time (µs) Rate
STOCK_LEVEL 577 27584645283.7 0.02 txn/s
------------------------------------------------------------------
TOTAL 577 27584645283.7 0.02 txn/s
$ cat /proc/vmstat | grep workingset
workingset_nodes 47860
workingset_refault_anon 0
workingset_refault_file 23498953
workingset_activate_anon 0
workingset_activate_file 23487840
workingset_restore_anon 0
workingset_restore_file 18553646
workingset_nodereclaim 768
$ free -m
total used free shared buff/cache available
Mem: 31849 6829 790 23 24229 24542
Swap: 31848 0 31848
- Patched: (with ZRAM enabled):
$ tpcc.py --config=mongodb.config mongodb --duration=900 --warehouses=500 --clients=30
==================================================================
Execution Results after 905 seconds
------------------------------------------------------------------
Executed Time (µs) Rate
STOCK_LEVEL 2542 27121571486.2 0.09 txn/s
------------------------------------------------------------------
TOTAL 2542 27121571486.2 0.09 txn/s
$ cat /proc/vmstat | grep working
workingset_nodes 70358
workingset_refault_anon 16853
workingset_refault_file 22693601
workingset_activate_anon 10099
workingset_activate_file 8565519
workingset_restore_anon 10127
workingset_restore_file 8566053
workingset_nodereclaim 9801
$ free -m
total used free shared buff/cache available
Mem: 31849 7093 283 4 24472 24289
Swap: 31848 1652 30196
The performance is 5x times better than before, and the idle anon pages
now can get swapped out as expected. The result is also better with
lower test stress, testing with lower stress also shows a improvement.
There is no regression on other tests so far, and a performance gain
is observed on file page heavy tasks.
Signed-off-by: Kairui Song <kasong@tencent.com>
2024-04-01 23:43:38 +08:00
|
|
|
memcg = try_get_flush_memcg(memcgid);
|
|
|
|
if (!memcg)
|
|
|
|
return;
|
|
|
|
|
|
|
|
lruvec = mem_cgroup_lruvec(memcg, pgdat);
|
2023-05-24 04:59:21 +08:00
|
|
|
if (lruvec != folio_lruvec(folio))
|
mm: multi-gen LRU: minimal implementation
To avoid confusion, the terms "promotion" and "demotion" will be applied
to the multi-gen LRU, as a new convention; the terms "activation" and
"deactivation" will be applied to the active/inactive LRU, as usual.
The aging produces young generations. Given an lruvec, it increments
max_seq when max_seq-min_seq+1 approaches MIN_NR_GENS. The aging promotes
hot pages to the youngest generation when it finds them accessed through
page tables; the demotion of cold pages happens consequently when it
increments max_seq. Promotion in the aging path does not involve any LRU
list operations, only the updates of the gen counter and
lrugen->nr_pages[]; demotion, unless as the result of the increment of
max_seq, requires LRU list operations, e.g., lru_deactivate_fn(). The
aging has the complexity O(nr_hot_pages), since it is only interested in
hot pages.
The eviction consumes old generations. Given an lruvec, it increments
min_seq when lrugen->lists[] indexed by min_seq%MAX_NR_GENS becomes empty.
A feedback loop modeled after the PID controller monitors refaults over
anon and file types and decides which type to evict when both types are
available from the same generation.
The protection of pages accessed multiple times through file descriptors
takes place in the eviction path. Each generation is divided into
multiple tiers. A page accessed N times through file descriptors is in
tier order_base_2(N). Tiers do not have dedicated lrugen->lists[], only
bits in folio->flags. The aforementioned feedback loop also monitors
refaults over all tiers and decides when to protect pages in which tiers
(N>1), using the first tier (N=0,1) as a baseline. The first tier
contains single-use unmapped clean pages, which are most likely the best
choices. In contrast to promotion in the aging path, the protection of a
page in the eviction path is achieved by moving this page to the next
generation, i.e., min_seq+1, if the feedback loop decides so. This
approach has the following advantages:
1. It removes the cost of activation in the buffered access path by
inferring whether pages accessed multiple times through file
descriptors are statistically hot and thus worth protecting in the
eviction path.
2. It takes pages accessed through page tables into account and avoids
overprotecting pages accessed multiple times through file
descriptors. (Pages accessed through page tables are in the first
tier, since N=0.)
3. More tiers provide better protection for pages accessed more than
twice through file descriptors, when under heavy buffered I/O
workloads.
Server benchmark results:
Single workload:
fio (buffered I/O): +[30, 32]%
IOPS BW
5.19-rc1: 2673k 10.2GiB/s
patch1-6: 3491k 13.3GiB/s
Single workload:
memcached (anon): -[4, 6]%
Ops/sec KB/sec
5.19-rc1: 1161501.04 45177.25
patch1-6: 1106168.46 43025.04
Configurations:
CPU: two Xeon 6154
Mem: total 256G
Node 1 was only used as a ram disk to reduce the variance in the
results.
patch drivers/block/brd.c <<EOF
99,100c99,100
< gfp_flags = GFP_NOIO | __GFP_ZERO | __GFP_HIGHMEM;
< page = alloc_page(gfp_flags);
---
> gfp_flags = GFP_NOIO | __GFP_ZERO | __GFP_HIGHMEM | __GFP_THISNODE;
> page = alloc_pages_node(1, gfp_flags, 0);
EOF
cat >>/etc/systemd/system.conf <<EOF
CPUAffinity=numa
NUMAPolicy=bind
NUMAMask=0
EOF
cat >>/etc/memcached.conf <<EOF
-m 184320
-s /var/run/memcached/memcached.sock
-a 0766
-t 36
-B binary
EOF
cat fio.sh
modprobe brd rd_nr=1 rd_size=113246208
swapoff -a
mkfs.ext4 /dev/ram0
mount -t ext4 /dev/ram0 /mnt
mkdir /sys/fs/cgroup/user.slice/test
echo 38654705664 >/sys/fs/cgroup/user.slice/test/memory.max
echo $$ >/sys/fs/cgroup/user.slice/test/cgroup.procs
fio -name=mglru --numjobs=72 --directory=/mnt --size=1408m \
--buffered=1 --ioengine=io_uring --iodepth=128 \
--iodepth_batch_submit=32 --iodepth_batch_complete=32 \
--rw=randread --random_distribution=random --norandommap \
--time_based --ramp_time=10m --runtime=5m --group_reporting
cat memcached.sh
modprobe brd rd_nr=1 rd_size=113246208
swapoff -a
mkswap /dev/ram0
swapon /dev/ram0
memtier_benchmark -S /var/run/memcached/memcached.sock \
-P memcache_binary -n allkeys --key-minimum=1 \
--key-maximum=65000000 --key-pattern=P:P -c 1 -t 36 \
--ratio 1:0 --pipeline 8 -d 2000
memtier_benchmark -S /var/run/memcached/memcached.sock \
-P memcache_binary -n allkeys --key-minimum=1 \
--key-maximum=65000000 --key-pattern=R:R -c 1 -t 36 \
--ratio 0:1 --pipeline 8 --randomize --distinct-client-seed
Client benchmark results:
kswapd profiles:
5.19-rc1
40.33% page_vma_mapped_walk (overhead)
21.80% lzo1x_1_do_compress (real work)
7.53% do_raw_spin_lock
3.95% _raw_spin_unlock_irq
2.52% vma_interval_tree_iter_next
2.37% folio_referenced_one
2.28% vma_interval_tree_subtree_search
1.97% anon_vma_interval_tree_iter_first
1.60% ptep_clear_flush
1.06% __zram_bvec_write
patch1-6
39.03% lzo1x_1_do_compress (real work)
18.47% page_vma_mapped_walk (overhead)
6.74% _raw_spin_unlock_irq
3.97% do_raw_spin_lock
2.49% ptep_clear_flush
2.48% anon_vma_interval_tree_iter_first
1.92% folio_referenced_one
1.88% __zram_bvec_write
1.48% memmove
1.31% vma_interval_tree_iter_next
Configurations:
CPU: single Snapdragon 7c
Mem: total 4G
ChromeOS MemoryPressure [1]
[1] https://chromium.googlesource.com/chromiumos/platform/tast-tests/
Link: https://lkml.kernel.org/r/20220918080010.2920238-7-yuzhao@google.com
Signed-off-by: Yu Zhao <yuzhao@google.com>
Acked-by: Brian Geffon <bgeffon@google.com>
Acked-by: Jan Alexander Steffens (heftig) <heftig@archlinux.org>
Acked-by: Oleksandr Natalenko <oleksandr@natalenko.name>
Acked-by: Steven Barrett <steven@liquorix.net>
Acked-by: Suleiman Souhlal <suleiman@google.com>
Tested-by: Daniel Byrne <djbyrne@mtu.edu>
Tested-by: Donald Carr <d@chaos-reins.com>
Tested-by: Holger Hoffstätte <holger@applied-asynchrony.com>
Tested-by: Konstantin Kharlamov <Hi-Angel@yandex.ru>
Tested-by: Shuang Zhai <szhai2@cs.rochester.edu>
Tested-by: Sofia Trinh <sofia.trinh@edi.works>
Tested-by: Vaibhav Jain <vaibhav@linux.ibm.com>
Cc: Andi Kleen <ak@linux.intel.com>
Cc: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
Cc: Barry Song <baohua@kernel.org>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Hillf Danton <hdanton@sina.com>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Miaohe Lin <linmiaohe@huawei.com>
Cc: Michael Larabel <Michael@MichaelLarabel.com>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Mike Rapoport <rppt@kernel.org>
Cc: Mike Rapoport <rppt@linux.ibm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Qi Zheng <zhengqi.arch@bytedance.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Will Deacon <will@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-09-18 16:00:03 +08:00
|
|
|
goto unlock;
|
|
|
|
|
2023-05-24 04:59:21 +08:00
|
|
|
mod_lruvec_state(lruvec, WORKINGSET_REFAULT_BASE + type, delta);
|
workingset, lru_gen: apply refault-distance based protection
Upstream: pending
I noticed MGLRU not working very well on certain workflows, which is
observed on some heavily stressed databases. That is when the file
page workingset size exceeds total memory, and the access distance
(the left-shift time of a page before it gets activated, considering
LRU starts from right) of file pages also larger than total memory.
All file pages are stuck on the oldest generation and getting
read-in then evicted permutably. Despite anon pages being idle,
they never get aged. PID controller didn't kickin until there are some
minor access pattern changes. And file pages are not promoted
or reused.
Even though the memory can't cover the whole workingset, the
refault-distance based re-activation can help hold part of the
workingset in-memory to help reduce the IO workload significantly.
So apply it for MGLRU as well. The updated refault-distance model
fits well for MGLRU in most cases, if we just consider the last two
generation as the inactive LRU and the first two generations as
active LRU.
Some adjustment is done to fit the logic better, also make the
refault-distance contributed to page tiering and PID refault detection
of MGLRU:
- If a tier-0 page have a qualified refault-distance, just promote
it to higher tier, send it to second oldest gen.
- If a tier >= 1 page have a qualified refault-distance, mark it as
active and send it to youngest gen.
- Increase the reference of every page that have a qualified
refault-distance and increase the PID countroled refault rate
of the updated tier, in hope similar paged will be protected
next time upon eviction.
NOTE: This also changed the meaning of workingset_* fields in
/proc/vmstat, workingset_activate_* now stands for the pages
reactivated or promoted by refault distance checking,
workingset_restore_* now stands for all pages promoted by
any reason.
Following benchmark showed 5x improvement. To simulate the optimized
workflow, I setup a 3-replicated mongodb cluster, each in a different
cgroup, using 5 gb of wiretiger cache and 10g of oplog, on a 32G VM with
no limit set. The benchmark is done using
https://github.com/apavlo/py-tpcc.git, modified to run STOCK_LEVEL
query only, for simulating slow query and get a stable result.
Test is done on an EPYC 7K62 with 32G RAM with SATA SSD:
- Before (with ZRAM enabled, the result won't change whether
any kind of swap is on or not):
$ tpcc.py --config=mongodb.config mongodb --duration=900 --warehouses=500 --clients=30
==================================================================
Execution Results after 919 seconds
------------------------------------------------------------------
Executed Time (µs) Rate
STOCK_LEVEL 577 27584645283.7 0.02 txn/s
------------------------------------------------------------------
TOTAL 577 27584645283.7 0.02 txn/s
$ cat /proc/vmstat | grep workingset
workingset_nodes 47860
workingset_refault_anon 0
workingset_refault_file 23498953
workingset_activate_anon 0
workingset_activate_file 23487840
workingset_restore_anon 0
workingset_restore_file 18553646
workingset_nodereclaim 768
$ free -m
total used free shared buff/cache available
Mem: 31849 6829 790 23 24229 24542
Swap: 31848 0 31848
- Patched: (with ZRAM enabled):
$ tpcc.py --config=mongodb.config mongodb --duration=900 --warehouses=500 --clients=30
==================================================================
Execution Results after 905 seconds
------------------------------------------------------------------
Executed Time (µs) Rate
STOCK_LEVEL 2542 27121571486.2 0.09 txn/s
------------------------------------------------------------------
TOTAL 2542 27121571486.2 0.09 txn/s
$ cat /proc/vmstat | grep working
workingset_nodes 70358
workingset_refault_anon 16853
workingset_refault_file 22693601
workingset_activate_anon 10099
workingset_activate_file 8565519
workingset_restore_anon 10127
workingset_restore_file 8566053
workingset_nodereclaim 9801
$ free -m
total used free shared buff/cache available
Mem: 31849 7093 283 4 24472 24289
Swap: 31848 1652 30196
The performance is 5x times better than before, and the idle anon pages
now can get swapped out as expected. The result is also better with
lower test stress, testing with lower stress also shows a improvement.
There is no regression on other tests so far, and a performance gain
is observed on file page heavy tasks.
Signed-off-by: Kairui Song <kasong@tencent.com>
2024-04-01 23:43:38 +08:00
|
|
|
refault_distance = lru_distance(lruvec, type, token,
|
|
|
|
LRU_GEN_EVICTION_BITS, lru_gen_bucket_order);
|
2024-04-01 17:43:25 +08:00
|
|
|
workingset_refault_track(lruvec, distance);
|
workingset, lru_gen: apply refault-distance based protection
Upstream: pending
I noticed MGLRU not working very well on certain workflows, which is
observed on some heavily stressed databases. That is when the file
page workingset size exceeds total memory, and the access distance
(the left-shift time of a page before it gets activated, considering
LRU starts from right) of file pages also larger than total memory.
All file pages are stuck on the oldest generation and getting
read-in then evicted permutably. Despite anon pages being idle,
they never get aged. PID controller didn't kickin until there are some
minor access pattern changes. And file pages are not promoted
or reused.
Even though the memory can't cover the whole workingset, the
refault-distance based re-activation can help hold part of the
workingset in-memory to help reduce the IO workload significantly.
So apply it for MGLRU as well. The updated refault-distance model
fits well for MGLRU in most cases, if we just consider the last two
generation as the inactive LRU and the first two generations as
active LRU.
Some adjustment is done to fit the logic better, also make the
refault-distance contributed to page tiering and PID refault detection
of MGLRU:
- If a tier-0 page have a qualified refault-distance, just promote
it to higher tier, send it to second oldest gen.
- If a tier >= 1 page have a qualified refault-distance, mark it as
active and send it to youngest gen.
- Increase the reference of every page that have a qualified
refault-distance and increase the PID countroled refault rate
of the updated tier, in hope similar paged will be protected
next time upon eviction.
NOTE: This also changed the meaning of workingset_* fields in
/proc/vmstat, workingset_activate_* now stands for the pages
reactivated or promoted by refault distance checking,
workingset_restore_* now stands for all pages promoted by
any reason.
Following benchmark showed 5x improvement. To simulate the optimized
workflow, I setup a 3-replicated mongodb cluster, each in a different
cgroup, using 5 gb of wiretiger cache and 10g of oplog, on a 32G VM with
no limit set. The benchmark is done using
https://github.com/apavlo/py-tpcc.git, modified to run STOCK_LEVEL
query only, for simulating slow query and get a stable result.
Test is done on an EPYC 7K62 with 32G RAM with SATA SSD:
- Before (with ZRAM enabled, the result won't change whether
any kind of swap is on or not):
$ tpcc.py --config=mongodb.config mongodb --duration=900 --warehouses=500 --clients=30
==================================================================
Execution Results after 919 seconds
------------------------------------------------------------------
Executed Time (µs) Rate
STOCK_LEVEL 577 27584645283.7 0.02 txn/s
------------------------------------------------------------------
TOTAL 577 27584645283.7 0.02 txn/s
$ cat /proc/vmstat | grep workingset
workingset_nodes 47860
workingset_refault_anon 0
workingset_refault_file 23498953
workingset_activate_anon 0
workingset_activate_file 23487840
workingset_restore_anon 0
workingset_restore_file 18553646
workingset_nodereclaim 768
$ free -m
total used free shared buff/cache available
Mem: 31849 6829 790 23 24229 24542
Swap: 31848 0 31848
- Patched: (with ZRAM enabled):
$ tpcc.py --config=mongodb.config mongodb --duration=900 --warehouses=500 --clients=30
==================================================================
Execution Results after 905 seconds
------------------------------------------------------------------
Executed Time (µs) Rate
STOCK_LEVEL 2542 27121571486.2 0.09 txn/s
------------------------------------------------------------------
TOTAL 2542 27121571486.2 0.09 txn/s
$ cat /proc/vmstat | grep working
workingset_nodes 70358
workingset_refault_anon 16853
workingset_refault_file 22693601
workingset_activate_anon 10099
workingset_activate_file 8565519
workingset_restore_anon 10127
workingset_restore_file 8566053
workingset_nodereclaim 9801
$ free -m
total used free shared buff/cache available
Mem: 31849 7093 283 4 24472 24289
Swap: 31848 1652 30196
The performance is 5x times better than before, and the idle anon pages
now can get swapped out as expected. The result is also better with
lower test stress, testing with lower stress also shows a improvement.
There is no regression on other tests so far, and a performance gain
is observed on file page heavy tasks.
Signed-off-by: Kairui Song <kasong@tencent.com>
2024-04-01 23:43:38 +08:00
|
|
|
/* Check if the gen the page was evicted from still exist */
|
|
|
|
recent = lru_gen_test_recent(lruvec, type, refault_distance);
|
|
|
|
/* Check if the distance indicates a refault */
|
|
|
|
distance = lru_gen_test_refault(lruvec, type, refault_distance,
|
|
|
|
mem_cgroup_get_nr_swap_pages(memcg));
|
|
|
|
if (!recent && distance == DISTANCE_NONE)
|
workingset: refactor LRU refault to expose refault recency check
Patch series "cachestat: a new syscall for page cache state of files",
v13.
There is currently no good way to query the page cache statistics of large
files and directory trees. There is mincore(), but it scales poorly: the
kernel writes out a lot of bitmap data that userspace has to aggregate,
when the user really does not care about per-page information in that
case. The user also needs to mmap and unmap each file as it goes along,
which can be quite slow as well.
Some use cases where this information could come in handy:
* Allowing database to decide whether to perform an index scan or direct
table queries based on the in-memory cache state of the index.
* Visibility into the writeback algorithm, for performance issues
diagnostic.
* Workload-aware writeback pacing: estimating IO fulfilled by page cache
(and IO to be done) within a range of a file, allowing for more
frequent syncing when and where there is IO capacity, and batching
when there is not.
* Computing memory usage of large files/directory trees, analogous to
the du tool for disk usage.
More information about these use cases could be found in this thread:
https://lore.kernel.org/lkml/20230315170934.GA97793@cmpxchg.org/
This series of patches introduces a new system call, cachestat, that
summarizes the page cache statistics (number of cached pages, dirty pages,
pages marked for writeback, evicted pages etc.) of a file, in a specified
range of bytes. It also include a selftest suite that tests some typical
usage. Currently, the syscall is only wired in for x86 architecture.
This interface is inspired by past discussion and concerns with fincore,
which has a similar design (and as a result, issues) as mincore. Relevant
links:
https://lkml.indiana.edu/hypermail/linux/kernel/1302.1/04207.html
https://lkml.indiana.edu/hypermail/linux/kernel/1302.1/04209.html
I have also developed a small tool that computes the memory usage of files
and directories, analogous to the du utility. User can choose between
mincore or cachestat (with cachestat exporting more information than
mincore). To compare the performance of these two options, I benchmarked
the tool on the root directory of a Meta's server machine, each for five
runs:
Using cachestat
real -- Median: 33.377s, Average: 33.475s, Standard Deviation: 0.3602
user -- Median: 4.08s, Average: 4.1078s, Standard Deviation: 0.0742
sys -- Median: 28.823s, Average: 28.8866s, Standard Deviation: 0.2689
Using mincore:
real -- Median: 102.352s, Average: 102.3442s, Standard Deviation: 0.2059
user -- Median: 10.149s, Average: 10.1482s, Standard Deviation: 0.0162
sys -- Median: 91.186s, Average: 91.2084s, Standard Deviation: 0.2046
I also ran both syscalls on a 2TB sparse file:
Using cachestat:
real 0m0.009s
user 0m0.000s
sys 0m0.009s
Using mincore:
real 0m37.510s
user 0m2.934s
sys 0m34.558s
Very large files like this are the pathological case for mincore. In
fact, to compute the stats for a single 2TB file, mincore takes as long as
cachestat takes to compute the stats for the entire tree! This could
easily happen inadvertently when we run it on subdirectories. Mincore is
clearly not suitable for a general-purpose command line tool.
Regarding security concerns, cachestat() should not pose any additional
issues. The caller already has read permission to the file itself (since
they need an fd to that file to call cachestat). This means that the
caller can access the underlying data in its entirety, which is a much
greater source of information (and as a result, a much greater security
risk) than the cache status itself.
The latest API change (in v13 of the patch series) is suggested by Jens
Axboe. It allows for 64-bit length argument, even on 32-bit architecture
(which is previously not possible due to the limit on the number of
syscall arguments). Furthermore, it eliminates the need for compatibility
handling - every user can use the same ABI.
This patch (of 4):
In preparation for computing recently evicted pages in cachestat, refactor
workingset_refault and lru_gen_refault to expose a helper function that
would test if an evicted page is recently evicted.
[penguin-kernel@I-love.SAKURA.ne.jp: add missing rcu_read_unlock() in lru_gen_refault()]
Link: https://lkml.kernel.org/r/610781bc-cf11-fc89-a46f-87cb8235d439@I-love.SAKURA.ne.jp
Link: https://lkml.kernel.org/r/20230503013608.2431726-1-nphamcs@gmail.com
Link: https://lkml.kernel.org/r/20230503013608.2431726-2-nphamcs@gmail.com
Signed-off-by: Nhat Pham <nphamcs@gmail.com>
Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Cc: Brian Foster <bfoster@redhat.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
Cc: Michael Kerrisk <mtk.manpages@gmail.com>
Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-05-03 09:36:06 +08:00
|
|
|
goto unlock;
|
|
|
|
|
mm: multi-gen LRU: minimal implementation
To avoid confusion, the terms "promotion" and "demotion" will be applied
to the multi-gen LRU, as a new convention; the terms "activation" and
"deactivation" will be applied to the active/inactive LRU, as usual.
The aging produces young generations. Given an lruvec, it increments
max_seq when max_seq-min_seq+1 approaches MIN_NR_GENS. The aging promotes
hot pages to the youngest generation when it finds them accessed through
page tables; the demotion of cold pages happens consequently when it
increments max_seq. Promotion in the aging path does not involve any LRU
list operations, only the updates of the gen counter and
lrugen->nr_pages[]; demotion, unless as the result of the increment of
max_seq, requires LRU list operations, e.g., lru_deactivate_fn(). The
aging has the complexity O(nr_hot_pages), since it is only interested in
hot pages.
The eviction consumes old generations. Given an lruvec, it increments
min_seq when lrugen->lists[] indexed by min_seq%MAX_NR_GENS becomes empty.
A feedback loop modeled after the PID controller monitors refaults over
anon and file types and decides which type to evict when both types are
available from the same generation.
The protection of pages accessed multiple times through file descriptors
takes place in the eviction path. Each generation is divided into
multiple tiers. A page accessed N times through file descriptors is in
tier order_base_2(N). Tiers do not have dedicated lrugen->lists[], only
bits in folio->flags. The aforementioned feedback loop also monitors
refaults over all tiers and decides when to protect pages in which tiers
(N>1), using the first tier (N=0,1) as a baseline. The first tier
contains single-use unmapped clean pages, which are most likely the best
choices. In contrast to promotion in the aging path, the protection of a
page in the eviction path is achieved by moving this page to the next
generation, i.e., min_seq+1, if the feedback loop decides so. This
approach has the following advantages:
1. It removes the cost of activation in the buffered access path by
inferring whether pages accessed multiple times through file
descriptors are statistically hot and thus worth protecting in the
eviction path.
2. It takes pages accessed through page tables into account and avoids
overprotecting pages accessed multiple times through file
descriptors. (Pages accessed through page tables are in the first
tier, since N=0.)
3. More tiers provide better protection for pages accessed more than
twice through file descriptors, when under heavy buffered I/O
workloads.
Server benchmark results:
Single workload:
fio (buffered I/O): +[30, 32]%
IOPS BW
5.19-rc1: 2673k 10.2GiB/s
patch1-6: 3491k 13.3GiB/s
Single workload:
memcached (anon): -[4, 6]%
Ops/sec KB/sec
5.19-rc1: 1161501.04 45177.25
patch1-6: 1106168.46 43025.04
Configurations:
CPU: two Xeon 6154
Mem: total 256G
Node 1 was only used as a ram disk to reduce the variance in the
results.
patch drivers/block/brd.c <<EOF
99,100c99,100
< gfp_flags = GFP_NOIO | __GFP_ZERO | __GFP_HIGHMEM;
< page = alloc_page(gfp_flags);
---
> gfp_flags = GFP_NOIO | __GFP_ZERO | __GFP_HIGHMEM | __GFP_THISNODE;
> page = alloc_pages_node(1, gfp_flags, 0);
EOF
cat >>/etc/systemd/system.conf <<EOF
CPUAffinity=numa
NUMAPolicy=bind
NUMAMask=0
EOF
cat >>/etc/memcached.conf <<EOF
-m 184320
-s /var/run/memcached/memcached.sock
-a 0766
-t 36
-B binary
EOF
cat fio.sh
modprobe brd rd_nr=1 rd_size=113246208
swapoff -a
mkfs.ext4 /dev/ram0
mount -t ext4 /dev/ram0 /mnt
mkdir /sys/fs/cgroup/user.slice/test
echo 38654705664 >/sys/fs/cgroup/user.slice/test/memory.max
echo $$ >/sys/fs/cgroup/user.slice/test/cgroup.procs
fio -name=mglru --numjobs=72 --directory=/mnt --size=1408m \
--buffered=1 --ioengine=io_uring --iodepth=128 \
--iodepth_batch_submit=32 --iodepth_batch_complete=32 \
--rw=randread --random_distribution=random --norandommap \
--time_based --ramp_time=10m --runtime=5m --group_reporting
cat memcached.sh
modprobe brd rd_nr=1 rd_size=113246208
swapoff -a
mkswap /dev/ram0
swapon /dev/ram0
memtier_benchmark -S /var/run/memcached/memcached.sock \
-P memcache_binary -n allkeys --key-minimum=1 \
--key-maximum=65000000 --key-pattern=P:P -c 1 -t 36 \
--ratio 1:0 --pipeline 8 -d 2000
memtier_benchmark -S /var/run/memcached/memcached.sock \
-P memcache_binary -n allkeys --key-minimum=1 \
--key-maximum=65000000 --key-pattern=R:R -c 1 -t 36 \
--ratio 0:1 --pipeline 8 --randomize --distinct-client-seed
Client benchmark results:
kswapd profiles:
5.19-rc1
40.33% page_vma_mapped_walk (overhead)
21.80% lzo1x_1_do_compress (real work)
7.53% do_raw_spin_lock
3.95% _raw_spin_unlock_irq
2.52% vma_interval_tree_iter_next
2.37% folio_referenced_one
2.28% vma_interval_tree_subtree_search
1.97% anon_vma_interval_tree_iter_first
1.60% ptep_clear_flush
1.06% __zram_bvec_write
patch1-6
39.03% lzo1x_1_do_compress (real work)
18.47% page_vma_mapped_walk (overhead)
6.74% _raw_spin_unlock_irq
3.97% do_raw_spin_lock
2.49% ptep_clear_flush
2.48% anon_vma_interval_tree_iter_first
1.92% folio_referenced_one
1.88% __zram_bvec_write
1.48% memmove
1.31% vma_interval_tree_iter_next
Configurations:
CPU: single Snapdragon 7c
Mem: total 4G
ChromeOS MemoryPressure [1]
[1] https://chromium.googlesource.com/chromiumos/platform/tast-tests/
Link: https://lkml.kernel.org/r/20220918080010.2920238-7-yuzhao@google.com
Signed-off-by: Yu Zhao <yuzhao@google.com>
Acked-by: Brian Geffon <bgeffon@google.com>
Acked-by: Jan Alexander Steffens (heftig) <heftig@archlinux.org>
Acked-by: Oleksandr Natalenko <oleksandr@natalenko.name>
Acked-by: Steven Barrett <steven@liquorix.net>
Acked-by: Suleiman Souhlal <suleiman@google.com>
Tested-by: Daniel Byrne <djbyrne@mtu.edu>
Tested-by: Donald Carr <d@chaos-reins.com>
Tested-by: Holger Hoffstätte <holger@applied-asynchrony.com>
Tested-by: Konstantin Kharlamov <Hi-Angel@yandex.ru>
Tested-by: Shuang Zhai <szhai2@cs.rochester.edu>
Tested-by: Sofia Trinh <sofia.trinh@edi.works>
Tested-by: Vaibhav Jain <vaibhav@linux.ibm.com>
Cc: Andi Kleen <ak@linux.intel.com>
Cc: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
Cc: Barry Song <baohua@kernel.org>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Hillf Danton <hdanton@sina.com>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Miaohe Lin <linmiaohe@huawei.com>
Cc: Michael Larabel <Michael@MichaelLarabel.com>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Mike Rapoport <rppt@kernel.org>
Cc: Mike Rapoport <rppt@linux.ibm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Qi Zheng <zhengqi.arch@bytedance.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Will Deacon <will@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-09-18 16:00:03 +08:00
|
|
|
/* see the comment in folio_lru_refs() */
|
workingset, lru_gen: apply refault-distance based protection
Upstream: pending
I noticed MGLRU not working very well on certain workflows, which is
observed on some heavily stressed databases. That is when the file
page workingset size exceeds total memory, and the access distance
(the left-shift time of a page before it gets activated, considering
LRU starts from right) of file pages also larger than total memory.
All file pages are stuck on the oldest generation and getting
read-in then evicted permutably. Despite anon pages being idle,
they never get aged. PID controller didn't kickin until there are some
minor access pattern changes. And file pages are not promoted
or reused.
Even though the memory can't cover the whole workingset, the
refault-distance based re-activation can help hold part of the
workingset in-memory to help reduce the IO workload significantly.
So apply it for MGLRU as well. The updated refault-distance model
fits well for MGLRU in most cases, if we just consider the last two
generation as the inactive LRU and the first two generations as
active LRU.
Some adjustment is done to fit the logic better, also make the
refault-distance contributed to page tiering and PID refault detection
of MGLRU:
- If a tier-0 page have a qualified refault-distance, just promote
it to higher tier, send it to second oldest gen.
- If a tier >= 1 page have a qualified refault-distance, mark it as
active and send it to youngest gen.
- Increase the reference of every page that have a qualified
refault-distance and increase the PID countroled refault rate
of the updated tier, in hope similar paged will be protected
next time upon eviction.
NOTE: This also changed the meaning of workingset_* fields in
/proc/vmstat, workingset_activate_* now stands for the pages
reactivated or promoted by refault distance checking,
workingset_restore_* now stands for all pages promoted by
any reason.
Following benchmark showed 5x improvement. To simulate the optimized
workflow, I setup a 3-replicated mongodb cluster, each in a different
cgroup, using 5 gb of wiretiger cache and 10g of oplog, on a 32G VM with
no limit set. The benchmark is done using
https://github.com/apavlo/py-tpcc.git, modified to run STOCK_LEVEL
query only, for simulating slow query and get a stable result.
Test is done on an EPYC 7K62 with 32G RAM with SATA SSD:
- Before (with ZRAM enabled, the result won't change whether
any kind of swap is on or not):
$ tpcc.py --config=mongodb.config mongodb --duration=900 --warehouses=500 --clients=30
==================================================================
Execution Results after 919 seconds
------------------------------------------------------------------
Executed Time (µs) Rate
STOCK_LEVEL 577 27584645283.7 0.02 txn/s
------------------------------------------------------------------
TOTAL 577 27584645283.7 0.02 txn/s
$ cat /proc/vmstat | grep workingset
workingset_nodes 47860
workingset_refault_anon 0
workingset_refault_file 23498953
workingset_activate_anon 0
workingset_activate_file 23487840
workingset_restore_anon 0
workingset_restore_file 18553646
workingset_nodereclaim 768
$ free -m
total used free shared buff/cache available
Mem: 31849 6829 790 23 24229 24542
Swap: 31848 0 31848
- Patched: (with ZRAM enabled):
$ tpcc.py --config=mongodb.config mongodb --duration=900 --warehouses=500 --clients=30
==================================================================
Execution Results after 905 seconds
------------------------------------------------------------------
Executed Time (µs) Rate
STOCK_LEVEL 2542 27121571486.2 0.09 txn/s
------------------------------------------------------------------
TOTAL 2542 27121571486.2 0.09 txn/s
$ cat /proc/vmstat | grep working
workingset_nodes 70358
workingset_refault_anon 16853
workingset_refault_file 22693601
workingset_activate_anon 10099
workingset_activate_file 8565519
workingset_restore_anon 10127
workingset_restore_file 8566053
workingset_nodereclaim 9801
$ free -m
total used free shared buff/cache available
Mem: 31849 7093 283 4 24472 24289
Swap: 31848 1652 30196
The performance is 5x times better than before, and the idle anon pages
now can get swapped out as expected. The result is also better with
lower test stress, testing with lower stress also shows a improvement.
There is no regression on other tests so far, and a performance gain
is observed on file page heavy tasks.
Signed-off-by: Kairui Song <kasong@tencent.com>
2024-04-01 23:43:38 +08:00
|
|
|
token >>= LRU_GEN_EVICTION_BITS;
|
mm: multi-gen LRU: minimal implementation
To avoid confusion, the terms "promotion" and "demotion" will be applied
to the multi-gen LRU, as a new convention; the terms "activation" and
"deactivation" will be applied to the active/inactive LRU, as usual.
The aging produces young generations. Given an lruvec, it increments
max_seq when max_seq-min_seq+1 approaches MIN_NR_GENS. The aging promotes
hot pages to the youngest generation when it finds them accessed through
page tables; the demotion of cold pages happens consequently when it
increments max_seq. Promotion in the aging path does not involve any LRU
list operations, only the updates of the gen counter and
lrugen->nr_pages[]; demotion, unless as the result of the increment of
max_seq, requires LRU list operations, e.g., lru_deactivate_fn(). The
aging has the complexity O(nr_hot_pages), since it is only interested in
hot pages.
The eviction consumes old generations. Given an lruvec, it increments
min_seq when lrugen->lists[] indexed by min_seq%MAX_NR_GENS becomes empty.
A feedback loop modeled after the PID controller monitors refaults over
anon and file types and decides which type to evict when both types are
available from the same generation.
The protection of pages accessed multiple times through file descriptors
takes place in the eviction path. Each generation is divided into
multiple tiers. A page accessed N times through file descriptors is in
tier order_base_2(N). Tiers do not have dedicated lrugen->lists[], only
bits in folio->flags. The aforementioned feedback loop also monitors
refaults over all tiers and decides when to protect pages in which tiers
(N>1), using the first tier (N=0,1) as a baseline. The first tier
contains single-use unmapped clean pages, which are most likely the best
choices. In contrast to promotion in the aging path, the protection of a
page in the eviction path is achieved by moving this page to the next
generation, i.e., min_seq+1, if the feedback loop decides so. This
approach has the following advantages:
1. It removes the cost of activation in the buffered access path by
inferring whether pages accessed multiple times through file
descriptors are statistically hot and thus worth protecting in the
eviction path.
2. It takes pages accessed through page tables into account and avoids
overprotecting pages accessed multiple times through file
descriptors. (Pages accessed through page tables are in the first
tier, since N=0.)
3. More tiers provide better protection for pages accessed more than
twice through file descriptors, when under heavy buffered I/O
workloads.
Server benchmark results:
Single workload:
fio (buffered I/O): +[30, 32]%
IOPS BW
5.19-rc1: 2673k 10.2GiB/s
patch1-6: 3491k 13.3GiB/s
Single workload:
memcached (anon): -[4, 6]%
Ops/sec KB/sec
5.19-rc1: 1161501.04 45177.25
patch1-6: 1106168.46 43025.04
Configurations:
CPU: two Xeon 6154
Mem: total 256G
Node 1 was only used as a ram disk to reduce the variance in the
results.
patch drivers/block/brd.c <<EOF
99,100c99,100
< gfp_flags = GFP_NOIO | __GFP_ZERO | __GFP_HIGHMEM;
< page = alloc_page(gfp_flags);
---
> gfp_flags = GFP_NOIO | __GFP_ZERO | __GFP_HIGHMEM | __GFP_THISNODE;
> page = alloc_pages_node(1, gfp_flags, 0);
EOF
cat >>/etc/systemd/system.conf <<EOF
CPUAffinity=numa
NUMAPolicy=bind
NUMAMask=0
EOF
cat >>/etc/memcached.conf <<EOF
-m 184320
-s /var/run/memcached/memcached.sock
-a 0766
-t 36
-B binary
EOF
cat fio.sh
modprobe brd rd_nr=1 rd_size=113246208
swapoff -a
mkfs.ext4 /dev/ram0
mount -t ext4 /dev/ram0 /mnt
mkdir /sys/fs/cgroup/user.slice/test
echo 38654705664 >/sys/fs/cgroup/user.slice/test/memory.max
echo $$ >/sys/fs/cgroup/user.slice/test/cgroup.procs
fio -name=mglru --numjobs=72 --directory=/mnt --size=1408m \
--buffered=1 --ioengine=io_uring --iodepth=128 \
--iodepth_batch_submit=32 --iodepth_batch_complete=32 \
--rw=randread --random_distribution=random --norandommap \
--time_based --ramp_time=10m --runtime=5m --group_reporting
cat memcached.sh
modprobe brd rd_nr=1 rd_size=113246208
swapoff -a
mkswap /dev/ram0
swapon /dev/ram0
memtier_benchmark -S /var/run/memcached/memcached.sock \
-P memcache_binary -n allkeys --key-minimum=1 \
--key-maximum=65000000 --key-pattern=P:P -c 1 -t 36 \
--ratio 1:0 --pipeline 8 -d 2000
memtier_benchmark -S /var/run/memcached/memcached.sock \
-P memcache_binary -n allkeys --key-minimum=1 \
--key-maximum=65000000 --key-pattern=R:R -c 1 -t 36 \
--ratio 0:1 --pipeline 8 --randomize --distinct-client-seed
Client benchmark results:
kswapd profiles:
5.19-rc1
40.33% page_vma_mapped_walk (overhead)
21.80% lzo1x_1_do_compress (real work)
7.53% do_raw_spin_lock
3.95% _raw_spin_unlock_irq
2.52% vma_interval_tree_iter_next
2.37% folio_referenced_one
2.28% vma_interval_tree_subtree_search
1.97% anon_vma_interval_tree_iter_first
1.60% ptep_clear_flush
1.06% __zram_bvec_write
patch1-6
39.03% lzo1x_1_do_compress (real work)
18.47% page_vma_mapped_walk (overhead)
6.74% _raw_spin_unlock_irq
3.97% do_raw_spin_lock
2.49% ptep_clear_flush
2.48% anon_vma_interval_tree_iter_first
1.92% folio_referenced_one
1.88% __zram_bvec_write
1.48% memmove
1.31% vma_interval_tree_iter_next
Configurations:
CPU: single Snapdragon 7c
Mem: total 4G
ChromeOS MemoryPressure [1]
[1] https://chromium.googlesource.com/chromiumos/platform/tast-tests/
Link: https://lkml.kernel.org/r/20220918080010.2920238-7-yuzhao@google.com
Signed-off-by: Yu Zhao <yuzhao@google.com>
Acked-by: Brian Geffon <bgeffon@google.com>
Acked-by: Jan Alexander Steffens (heftig) <heftig@archlinux.org>
Acked-by: Oleksandr Natalenko <oleksandr@natalenko.name>
Acked-by: Steven Barrett <steven@liquorix.net>
Acked-by: Suleiman Souhlal <suleiman@google.com>
Tested-by: Daniel Byrne <djbyrne@mtu.edu>
Tested-by: Donald Carr <d@chaos-reins.com>
Tested-by: Holger Hoffstätte <holger@applied-asynchrony.com>
Tested-by: Konstantin Kharlamov <Hi-Angel@yandex.ru>
Tested-by: Shuang Zhai <szhai2@cs.rochester.edu>
Tested-by: Sofia Trinh <sofia.trinh@edi.works>
Tested-by: Vaibhav Jain <vaibhav@linux.ibm.com>
Cc: Andi Kleen <ak@linux.intel.com>
Cc: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
Cc: Barry Song <baohua@kernel.org>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Hillf Danton <hdanton@sina.com>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Miaohe Lin <linmiaohe@huawei.com>
Cc: Michael Larabel <Michael@MichaelLarabel.com>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Mike Rapoport <rppt@kernel.org>
Cc: Mike Rapoport <rppt@linux.ibm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Qi Zheng <zhengqi.arch@bytedance.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Will Deacon <will@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-09-18 16:00:03 +08:00
|
|
|
refs = (token & (BIT(LRU_REFS_WIDTH) - 1)) + workingset;
|
|
|
|
tier = lru_tier_from_refs(refs);
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Count the following two cases as stalls:
|
|
|
|
* 1. For pages accessed through page tables, hotter pages pushed out
|
|
|
|
* hot pages which refaulted immediately.
|
|
|
|
* 2. For pages accessed multiple times through file descriptors,
|
mm/mglru: fix underprotected page cache
commit 081488051d28d32569ebb7c7a23572778b2e7d57 upstream.
Unmapped folios accessed through file descriptors can be underprotected.
Those folios are added to the oldest generation based on:
1. The fact that they are less costly to reclaim (no need to walk the
rmap and flush the TLB) and have less impact on performance (don't
cause major PFs and can be non-blocking if needed again).
2. The observation that they are likely to be single-use. E.g., for
client use cases like Android, its apps parse configuration files
and store the data in heap (anon); for server use cases like MySQL,
it reads from InnoDB files and holds the cached data for tables in
buffer pools (anon).
However, the oldest generation can be very short lived, and if so, it
doesn't provide the PID controller with enough time to respond to a surge
of refaults. (Note that the PID controller uses weighted refaults and
those from evicted generations only take a half of the whole weight.) In
other words, for a short lived generation, the moving average smooths out
the spike quickly.
To fix the problem:
1. For folios that are already on LRU, if they can be beyond the
tracking range of tiers, i.e., five accesses through file
descriptors, move them to the second oldest generation to give them
more time to age. (Note that tiers are used by the PID controller
to statistically determine whether folios accessed multiple times
through file descriptors are worth protecting.)
2. When adding unmapped folios to LRU, adjust the placement of them so
that they are not too close to the tail. The effect of this is
similar to the above.
On Android, launching 55 apps sequentially:
Before After Change
workingset_refault_anon 25641024 25598972 0%
workingset_refault_file 115016834 106178438 -8%
Link: https://lkml.kernel.org/r/20231208061407.2125867-1-yuzhao@google.com
Fixes: ac35a4902374 ("mm: multi-gen LRU: minimal implementation")
Signed-off-by: Yu Zhao <yuzhao@google.com>
Reported-by: Charan Teja Kalla <quic_charante@quicinc.com>
Tested-by: Kalesh Singh <kaleshsingh@google.com>
Cc: T.J. Mercier <tjmercier@google.com>
Cc: Kairui Song <ryncsn@gmail.com>
Cc: Hillf Danton <hdanton@sina.com>
Cc: Jaroslav Pulchart <jaroslav.pulchart@gooddata.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2023-12-08 14:14:04 +08:00
|
|
|
* they would have been protected by sort_folio().
|
mm: multi-gen LRU: minimal implementation
To avoid confusion, the terms "promotion" and "demotion" will be applied
to the multi-gen LRU, as a new convention; the terms "activation" and
"deactivation" will be applied to the active/inactive LRU, as usual.
The aging produces young generations. Given an lruvec, it increments
max_seq when max_seq-min_seq+1 approaches MIN_NR_GENS. The aging promotes
hot pages to the youngest generation when it finds them accessed through
page tables; the demotion of cold pages happens consequently when it
increments max_seq. Promotion in the aging path does not involve any LRU
list operations, only the updates of the gen counter and
lrugen->nr_pages[]; demotion, unless as the result of the increment of
max_seq, requires LRU list operations, e.g., lru_deactivate_fn(). The
aging has the complexity O(nr_hot_pages), since it is only interested in
hot pages.
The eviction consumes old generations. Given an lruvec, it increments
min_seq when lrugen->lists[] indexed by min_seq%MAX_NR_GENS becomes empty.
A feedback loop modeled after the PID controller monitors refaults over
anon and file types and decides which type to evict when both types are
available from the same generation.
The protection of pages accessed multiple times through file descriptors
takes place in the eviction path. Each generation is divided into
multiple tiers. A page accessed N times through file descriptors is in
tier order_base_2(N). Tiers do not have dedicated lrugen->lists[], only
bits in folio->flags. The aforementioned feedback loop also monitors
refaults over all tiers and decides when to protect pages in which tiers
(N>1), using the first tier (N=0,1) as a baseline. The first tier
contains single-use unmapped clean pages, which are most likely the best
choices. In contrast to promotion in the aging path, the protection of a
page in the eviction path is achieved by moving this page to the next
generation, i.e., min_seq+1, if the feedback loop decides so. This
approach has the following advantages:
1. It removes the cost of activation in the buffered access path by
inferring whether pages accessed multiple times through file
descriptors are statistically hot and thus worth protecting in the
eviction path.
2. It takes pages accessed through page tables into account and avoids
overprotecting pages accessed multiple times through file
descriptors. (Pages accessed through page tables are in the first
tier, since N=0.)
3. More tiers provide better protection for pages accessed more than
twice through file descriptors, when under heavy buffered I/O
workloads.
Server benchmark results:
Single workload:
fio (buffered I/O): +[30, 32]%
IOPS BW
5.19-rc1: 2673k 10.2GiB/s
patch1-6: 3491k 13.3GiB/s
Single workload:
memcached (anon): -[4, 6]%
Ops/sec KB/sec
5.19-rc1: 1161501.04 45177.25
patch1-6: 1106168.46 43025.04
Configurations:
CPU: two Xeon 6154
Mem: total 256G
Node 1 was only used as a ram disk to reduce the variance in the
results.
patch drivers/block/brd.c <<EOF
99,100c99,100
< gfp_flags = GFP_NOIO | __GFP_ZERO | __GFP_HIGHMEM;
< page = alloc_page(gfp_flags);
---
> gfp_flags = GFP_NOIO | __GFP_ZERO | __GFP_HIGHMEM | __GFP_THISNODE;
> page = alloc_pages_node(1, gfp_flags, 0);
EOF
cat >>/etc/systemd/system.conf <<EOF
CPUAffinity=numa
NUMAPolicy=bind
NUMAMask=0
EOF
cat >>/etc/memcached.conf <<EOF
-m 184320
-s /var/run/memcached/memcached.sock
-a 0766
-t 36
-B binary
EOF
cat fio.sh
modprobe brd rd_nr=1 rd_size=113246208
swapoff -a
mkfs.ext4 /dev/ram0
mount -t ext4 /dev/ram0 /mnt
mkdir /sys/fs/cgroup/user.slice/test
echo 38654705664 >/sys/fs/cgroup/user.slice/test/memory.max
echo $$ >/sys/fs/cgroup/user.slice/test/cgroup.procs
fio -name=mglru --numjobs=72 --directory=/mnt --size=1408m \
--buffered=1 --ioengine=io_uring --iodepth=128 \
--iodepth_batch_submit=32 --iodepth_batch_complete=32 \
--rw=randread --random_distribution=random --norandommap \
--time_based --ramp_time=10m --runtime=5m --group_reporting
cat memcached.sh
modprobe brd rd_nr=1 rd_size=113246208
swapoff -a
mkswap /dev/ram0
swapon /dev/ram0
memtier_benchmark -S /var/run/memcached/memcached.sock \
-P memcache_binary -n allkeys --key-minimum=1 \
--key-maximum=65000000 --key-pattern=P:P -c 1 -t 36 \
--ratio 1:0 --pipeline 8 -d 2000
memtier_benchmark -S /var/run/memcached/memcached.sock \
-P memcache_binary -n allkeys --key-minimum=1 \
--key-maximum=65000000 --key-pattern=R:R -c 1 -t 36 \
--ratio 0:1 --pipeline 8 --randomize --distinct-client-seed
Client benchmark results:
kswapd profiles:
5.19-rc1
40.33% page_vma_mapped_walk (overhead)
21.80% lzo1x_1_do_compress (real work)
7.53% do_raw_spin_lock
3.95% _raw_spin_unlock_irq
2.52% vma_interval_tree_iter_next
2.37% folio_referenced_one
2.28% vma_interval_tree_subtree_search
1.97% anon_vma_interval_tree_iter_first
1.60% ptep_clear_flush
1.06% __zram_bvec_write
patch1-6
39.03% lzo1x_1_do_compress (real work)
18.47% page_vma_mapped_walk (overhead)
6.74% _raw_spin_unlock_irq
3.97% do_raw_spin_lock
2.49% ptep_clear_flush
2.48% anon_vma_interval_tree_iter_first
1.92% folio_referenced_one
1.88% __zram_bvec_write
1.48% memmove
1.31% vma_interval_tree_iter_next
Configurations:
CPU: single Snapdragon 7c
Mem: total 4G
ChromeOS MemoryPressure [1]
[1] https://chromium.googlesource.com/chromiumos/platform/tast-tests/
Link: https://lkml.kernel.org/r/20220918080010.2920238-7-yuzhao@google.com
Signed-off-by: Yu Zhao <yuzhao@google.com>
Acked-by: Brian Geffon <bgeffon@google.com>
Acked-by: Jan Alexander Steffens (heftig) <heftig@archlinux.org>
Acked-by: Oleksandr Natalenko <oleksandr@natalenko.name>
Acked-by: Steven Barrett <steven@liquorix.net>
Acked-by: Suleiman Souhlal <suleiman@google.com>
Tested-by: Daniel Byrne <djbyrne@mtu.edu>
Tested-by: Donald Carr <d@chaos-reins.com>
Tested-by: Holger Hoffstätte <holger@applied-asynchrony.com>
Tested-by: Konstantin Kharlamov <Hi-Angel@yandex.ru>
Tested-by: Shuang Zhai <szhai2@cs.rochester.edu>
Tested-by: Sofia Trinh <sofia.trinh@edi.works>
Tested-by: Vaibhav Jain <vaibhav@linux.ibm.com>
Cc: Andi Kleen <ak@linux.intel.com>
Cc: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
Cc: Barry Song <baohua@kernel.org>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Hillf Danton <hdanton@sina.com>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Miaohe Lin <linmiaohe@huawei.com>
Cc: Michael Larabel <Michael@MichaelLarabel.com>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Mike Rapoport <rppt@kernel.org>
Cc: Mike Rapoport <rppt@linux.ibm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Qi Zheng <zhengqi.arch@bytedance.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Will Deacon <will@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-09-18 16:00:03 +08:00
|
|
|
*/
|
mm/mglru: fix underprotected page cache
commit 081488051d28d32569ebb7c7a23572778b2e7d57 upstream.
Unmapped folios accessed through file descriptors can be underprotected.
Those folios are added to the oldest generation based on:
1. The fact that they are less costly to reclaim (no need to walk the
rmap and flush the TLB) and have less impact on performance (don't
cause major PFs and can be non-blocking if needed again).
2. The observation that they are likely to be single-use. E.g., for
client use cases like Android, its apps parse configuration files
and store the data in heap (anon); for server use cases like MySQL,
it reads from InnoDB files and holds the cached data for tables in
buffer pools (anon).
However, the oldest generation can be very short lived, and if so, it
doesn't provide the PID controller with enough time to respond to a surge
of refaults. (Note that the PID controller uses weighted refaults and
those from evicted generations only take a half of the whole weight.) In
other words, for a short lived generation, the moving average smooths out
the spike quickly.
To fix the problem:
1. For folios that are already on LRU, if they can be beyond the
tracking range of tiers, i.e., five accesses through file
descriptors, move them to the second oldest generation to give them
more time to age. (Note that tiers are used by the PID controller
to statistically determine whether folios accessed multiple times
through file descriptors are worth protecting.)
2. When adding unmapped folios to LRU, adjust the placement of them so
that they are not too close to the tail. The effect of this is
similar to the above.
On Android, launching 55 apps sequentially:
Before After Change
workingset_refault_anon 25641024 25598972 0%
workingset_refault_file 115016834 106178438 -8%
Link: https://lkml.kernel.org/r/20231208061407.2125867-1-yuzhao@google.com
Fixes: ac35a4902374 ("mm: multi-gen LRU: minimal implementation")
Signed-off-by: Yu Zhao <yuzhao@google.com>
Reported-by: Charan Teja Kalla <quic_charante@quicinc.com>
Tested-by: Kalesh Singh <kaleshsingh@google.com>
Cc: T.J. Mercier <tjmercier@google.com>
Cc: Kairui Song <ryncsn@gmail.com>
Cc: Hillf Danton <hdanton@sina.com>
Cc: Jaroslav Pulchart <jaroslav.pulchart@gooddata.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2023-12-08 14:14:04 +08:00
|
|
|
if (lru_gen_in_fault() || refs >= BIT(LRU_REFS_WIDTH) - 1) {
|
workingset, lru_gen: apply refault-distance based protection
Upstream: pending
I noticed MGLRU not working very well on certain workflows, which is
observed on some heavily stressed databases. That is when the file
page workingset size exceeds total memory, and the access distance
(the left-shift time of a page before it gets activated, considering
LRU starts from right) of file pages also larger than total memory.
All file pages are stuck on the oldest generation and getting
read-in then evicted permutably. Despite anon pages being idle,
they never get aged. PID controller didn't kickin until there are some
minor access pattern changes. And file pages are not promoted
or reused.
Even though the memory can't cover the whole workingset, the
refault-distance based re-activation can help hold part of the
workingset in-memory to help reduce the IO workload significantly.
So apply it for MGLRU as well. The updated refault-distance model
fits well for MGLRU in most cases, if we just consider the last two
generation as the inactive LRU and the first two generations as
active LRU.
Some adjustment is done to fit the logic better, also make the
refault-distance contributed to page tiering and PID refault detection
of MGLRU:
- If a tier-0 page have a qualified refault-distance, just promote
it to higher tier, send it to second oldest gen.
- If a tier >= 1 page have a qualified refault-distance, mark it as
active and send it to youngest gen.
- Increase the reference of every page that have a qualified
refault-distance and increase the PID countroled refault rate
of the updated tier, in hope similar paged will be protected
next time upon eviction.
NOTE: This also changed the meaning of workingset_* fields in
/proc/vmstat, workingset_activate_* now stands for the pages
reactivated or promoted by refault distance checking,
workingset_restore_* now stands for all pages promoted by
any reason.
Following benchmark showed 5x improvement. To simulate the optimized
workflow, I setup a 3-replicated mongodb cluster, each in a different
cgroup, using 5 gb of wiretiger cache and 10g of oplog, on a 32G VM with
no limit set. The benchmark is done using
https://github.com/apavlo/py-tpcc.git, modified to run STOCK_LEVEL
query only, for simulating slow query and get a stable result.
Test is done on an EPYC 7K62 with 32G RAM with SATA SSD:
- Before (with ZRAM enabled, the result won't change whether
any kind of swap is on or not):
$ tpcc.py --config=mongodb.config mongodb --duration=900 --warehouses=500 --clients=30
==================================================================
Execution Results after 919 seconds
------------------------------------------------------------------
Executed Time (µs) Rate
STOCK_LEVEL 577 27584645283.7 0.02 txn/s
------------------------------------------------------------------
TOTAL 577 27584645283.7 0.02 txn/s
$ cat /proc/vmstat | grep workingset
workingset_nodes 47860
workingset_refault_anon 0
workingset_refault_file 23498953
workingset_activate_anon 0
workingset_activate_file 23487840
workingset_restore_anon 0
workingset_restore_file 18553646
workingset_nodereclaim 768
$ free -m
total used free shared buff/cache available
Mem: 31849 6829 790 23 24229 24542
Swap: 31848 0 31848
- Patched: (with ZRAM enabled):
$ tpcc.py --config=mongodb.config mongodb --duration=900 --warehouses=500 --clients=30
==================================================================
Execution Results after 905 seconds
------------------------------------------------------------------
Executed Time (µs) Rate
STOCK_LEVEL 2542 27121571486.2 0.09 txn/s
------------------------------------------------------------------
TOTAL 2542 27121571486.2 0.09 txn/s
$ cat /proc/vmstat | grep working
workingset_nodes 70358
workingset_refault_anon 16853
workingset_refault_file 22693601
workingset_activate_anon 10099
workingset_activate_file 8565519
workingset_restore_anon 10127
workingset_restore_file 8566053
workingset_nodereclaim 9801
$ free -m
total used free shared buff/cache available
Mem: 31849 7093 283 4 24472 24289
Swap: 31848 1652 30196
The performance is 5x times better than before, and the idle anon pages
now can get swapped out as expected. The result is also better with
lower test stress, testing with lower stress also shows a improvement.
There is no regression on other tests so far, and a performance gain
is observed on file page heavy tasks.
Signed-off-by: Kairui Song <kasong@tencent.com>
2024-04-01 23:43:38 +08:00
|
|
|
if (distance <= DISTANCE_SHORT) {
|
|
|
|
/* Set ref bits and workingset (increase refs by one) */
|
|
|
|
if (!lru_gen_in_fault())
|
|
|
|
folio_set_active(folio);
|
|
|
|
else
|
|
|
|
set_mask_bits(&folio->flags, 0,
|
|
|
|
min_t(unsigned long, refs, BIT(LRU_REFS_WIDTH) - 1)
|
|
|
|
<< LRU_REFS_PGOFF);
|
|
|
|
folio_set_workingset(folio);
|
|
|
|
} else if (recent || distance <= DISTANCE_MID) {
|
|
|
|
/*
|
|
|
|
* Beyound PID protection range, no point increasing refs
|
|
|
|
* for highest tier, but we can activate file page.
|
|
|
|
*/
|
2024-04-11 11:00:12 +08:00
|
|
|
set_mask_bits(&folio->flags, 0, (unsigned long)(refs - workingset) << LRU_REFS_PGOFF);
|
workingset, lru_gen: apply refault-distance based protection
Upstream: pending
I noticed MGLRU not working very well on certain workflows, which is
observed on some heavily stressed databases. That is when the file
page workingset size exceeds total memory, and the access distance
(the left-shift time of a page before it gets activated, considering
LRU starts from right) of file pages also larger than total memory.
All file pages are stuck on the oldest generation and getting
read-in then evicted permutably. Despite anon pages being idle,
they never get aged. PID controller didn't kickin until there are some
minor access pattern changes. And file pages are not promoted
or reused.
Even though the memory can't cover the whole workingset, the
refault-distance based re-activation can help hold part of the
workingset in-memory to help reduce the IO workload significantly.
So apply it for MGLRU as well. The updated refault-distance model
fits well for MGLRU in most cases, if we just consider the last two
generation as the inactive LRU and the first two generations as
active LRU.
Some adjustment is done to fit the logic better, also make the
refault-distance contributed to page tiering and PID refault detection
of MGLRU:
- If a tier-0 page have a qualified refault-distance, just promote
it to higher tier, send it to second oldest gen.
- If a tier >= 1 page have a qualified refault-distance, mark it as
active and send it to youngest gen.
- Increase the reference of every page that have a qualified
refault-distance and increase the PID countroled refault rate
of the updated tier, in hope similar paged will be protected
next time upon eviction.
NOTE: This also changed the meaning of workingset_* fields in
/proc/vmstat, workingset_activate_* now stands for the pages
reactivated or promoted by refault distance checking,
workingset_restore_* now stands for all pages promoted by
any reason.
Following benchmark showed 5x improvement. To simulate the optimized
workflow, I setup a 3-replicated mongodb cluster, each in a different
cgroup, using 5 gb of wiretiger cache and 10g of oplog, on a 32G VM with
no limit set. The benchmark is done using
https://github.com/apavlo/py-tpcc.git, modified to run STOCK_LEVEL
query only, for simulating slow query and get a stable result.
Test is done on an EPYC 7K62 with 32G RAM with SATA SSD:
- Before (with ZRAM enabled, the result won't change whether
any kind of swap is on or not):
$ tpcc.py --config=mongodb.config mongodb --duration=900 --warehouses=500 --clients=30
==================================================================
Execution Results after 919 seconds
------------------------------------------------------------------
Executed Time (µs) Rate
STOCK_LEVEL 577 27584645283.7 0.02 txn/s
------------------------------------------------------------------
TOTAL 577 27584645283.7 0.02 txn/s
$ cat /proc/vmstat | grep workingset
workingset_nodes 47860
workingset_refault_anon 0
workingset_refault_file 23498953
workingset_activate_anon 0
workingset_activate_file 23487840
workingset_restore_anon 0
workingset_restore_file 18553646
workingset_nodereclaim 768
$ free -m
total used free shared buff/cache available
Mem: 31849 6829 790 23 24229 24542
Swap: 31848 0 31848
- Patched: (with ZRAM enabled):
$ tpcc.py --config=mongodb.config mongodb --duration=900 --warehouses=500 --clients=30
==================================================================
Execution Results after 905 seconds
------------------------------------------------------------------
Executed Time (µs) Rate
STOCK_LEVEL 2542 27121571486.2 0.09 txn/s
------------------------------------------------------------------
TOTAL 2542 27121571486.2 0.09 txn/s
$ cat /proc/vmstat | grep working
workingset_nodes 70358
workingset_refault_anon 16853
workingset_refault_file 22693601
workingset_activate_anon 10099
workingset_activate_file 8565519
workingset_restore_anon 10127
workingset_restore_file 8566053
workingset_nodereclaim 9801
$ free -m
total used free shared buff/cache available
Mem: 31849 7093 283 4 24472 24289
Swap: 31848 1652 30196
The performance is 5x times better than before, and the idle anon pages
now can get swapped out as expected. The result is also better with
lower test stress, testing with lower stress also shows a improvement.
There is no regression on other tests so far, and a performance gain
is observed on file page heavy tasks.
Signed-off-by: Kairui Song <kasong@tencent.com>
2024-04-01 23:43:38 +08:00
|
|
|
folio_set_workingset(folio);
|
|
|
|
} else {
|
2024-04-11 11:00:12 +08:00
|
|
|
set_mask_bits(&folio->flags, 0, 1UL << LRU_REFS_PGOFF);
|
workingset, lru_gen: apply refault-distance based protection
Upstream: pending
I noticed MGLRU not working very well on certain workflows, which is
observed on some heavily stressed databases. That is when the file
page workingset size exceeds total memory, and the access distance
(the left-shift time of a page before it gets activated, considering
LRU starts from right) of file pages also larger than total memory.
All file pages are stuck on the oldest generation and getting
read-in then evicted permutably. Despite anon pages being idle,
they never get aged. PID controller didn't kickin until there are some
minor access pattern changes. And file pages are not promoted
or reused.
Even though the memory can't cover the whole workingset, the
refault-distance based re-activation can help hold part of the
workingset in-memory to help reduce the IO workload significantly.
So apply it for MGLRU as well. The updated refault-distance model
fits well for MGLRU in most cases, if we just consider the last two
generation as the inactive LRU and the first two generations as
active LRU.
Some adjustment is done to fit the logic better, also make the
refault-distance contributed to page tiering and PID refault detection
of MGLRU:
- If a tier-0 page have a qualified refault-distance, just promote
it to higher tier, send it to second oldest gen.
- If a tier >= 1 page have a qualified refault-distance, mark it as
active and send it to youngest gen.
- Increase the reference of every page that have a qualified
refault-distance and increase the PID countroled refault rate
of the updated tier, in hope similar paged will be protected
next time upon eviction.
NOTE: This also changed the meaning of workingset_* fields in
/proc/vmstat, workingset_activate_* now stands for the pages
reactivated or promoted by refault distance checking,
workingset_restore_* now stands for all pages promoted by
any reason.
Following benchmark showed 5x improvement. To simulate the optimized
workflow, I setup a 3-replicated mongodb cluster, each in a different
cgroup, using 5 gb of wiretiger cache and 10g of oplog, on a 32G VM with
no limit set. The benchmark is done using
https://github.com/apavlo/py-tpcc.git, modified to run STOCK_LEVEL
query only, for simulating slow query and get a stable result.
Test is done on an EPYC 7K62 with 32G RAM with SATA SSD:
- Before (with ZRAM enabled, the result won't change whether
any kind of swap is on or not):
$ tpcc.py --config=mongodb.config mongodb --duration=900 --warehouses=500 --clients=30
==================================================================
Execution Results after 919 seconds
------------------------------------------------------------------
Executed Time (µs) Rate
STOCK_LEVEL 577 27584645283.7 0.02 txn/s
------------------------------------------------------------------
TOTAL 577 27584645283.7 0.02 txn/s
$ cat /proc/vmstat | grep workingset
workingset_nodes 47860
workingset_refault_anon 0
workingset_refault_file 23498953
workingset_activate_anon 0
workingset_activate_file 23487840
workingset_restore_anon 0
workingset_restore_file 18553646
workingset_nodereclaim 768
$ free -m
total used free shared buff/cache available
Mem: 31849 6829 790 23 24229 24542
Swap: 31848 0 31848
- Patched: (with ZRAM enabled):
$ tpcc.py --config=mongodb.config mongodb --duration=900 --warehouses=500 --clients=30
==================================================================
Execution Results after 905 seconds
------------------------------------------------------------------
Executed Time (µs) Rate
STOCK_LEVEL 2542 27121571486.2 0.09 txn/s
------------------------------------------------------------------
TOTAL 2542 27121571486.2 0.09 txn/s
$ cat /proc/vmstat | grep working
workingset_nodes 70358
workingset_refault_anon 16853
workingset_refault_file 22693601
workingset_activate_anon 10099
workingset_activate_file 8565519
workingset_restore_anon 10127
workingset_restore_file 8566053
workingset_nodereclaim 9801
$ free -m
total used free shared buff/cache available
Mem: 31849 7093 283 4 24472 24289
Swap: 31848 1652 30196
The performance is 5x times better than before, and the idle anon pages
now can get swapped out as expected. The result is also better with
lower test stress, testing with lower stress also shows a improvement.
There is no regression on other tests so far, and a performance gain
is observed on file page heavy tasks.
Signed-off-by: Kairui Song <kasong@tencent.com>
2024-04-01 23:43:38 +08:00
|
|
|
}
|
mm: multi-gen LRU: minimal implementation
To avoid confusion, the terms "promotion" and "demotion" will be applied
to the multi-gen LRU, as a new convention; the terms "activation" and
"deactivation" will be applied to the active/inactive LRU, as usual.
The aging produces young generations. Given an lruvec, it increments
max_seq when max_seq-min_seq+1 approaches MIN_NR_GENS. The aging promotes
hot pages to the youngest generation when it finds them accessed through
page tables; the demotion of cold pages happens consequently when it
increments max_seq. Promotion in the aging path does not involve any LRU
list operations, only the updates of the gen counter and
lrugen->nr_pages[]; demotion, unless as the result of the increment of
max_seq, requires LRU list operations, e.g., lru_deactivate_fn(). The
aging has the complexity O(nr_hot_pages), since it is only interested in
hot pages.
The eviction consumes old generations. Given an lruvec, it increments
min_seq when lrugen->lists[] indexed by min_seq%MAX_NR_GENS becomes empty.
A feedback loop modeled after the PID controller monitors refaults over
anon and file types and decides which type to evict when both types are
available from the same generation.
The protection of pages accessed multiple times through file descriptors
takes place in the eviction path. Each generation is divided into
multiple tiers. A page accessed N times through file descriptors is in
tier order_base_2(N). Tiers do not have dedicated lrugen->lists[], only
bits in folio->flags. The aforementioned feedback loop also monitors
refaults over all tiers and decides when to protect pages in which tiers
(N>1), using the first tier (N=0,1) as a baseline. The first tier
contains single-use unmapped clean pages, which are most likely the best
choices. In contrast to promotion in the aging path, the protection of a
page in the eviction path is achieved by moving this page to the next
generation, i.e., min_seq+1, if the feedback loop decides so. This
approach has the following advantages:
1. It removes the cost of activation in the buffered access path by
inferring whether pages accessed multiple times through file
descriptors are statistically hot and thus worth protecting in the
eviction path.
2. It takes pages accessed through page tables into account and avoids
overprotecting pages accessed multiple times through file
descriptors. (Pages accessed through page tables are in the first
tier, since N=0.)
3. More tiers provide better protection for pages accessed more than
twice through file descriptors, when under heavy buffered I/O
workloads.
Server benchmark results:
Single workload:
fio (buffered I/O): +[30, 32]%
IOPS BW
5.19-rc1: 2673k 10.2GiB/s
patch1-6: 3491k 13.3GiB/s
Single workload:
memcached (anon): -[4, 6]%
Ops/sec KB/sec
5.19-rc1: 1161501.04 45177.25
patch1-6: 1106168.46 43025.04
Configurations:
CPU: two Xeon 6154
Mem: total 256G
Node 1 was only used as a ram disk to reduce the variance in the
results.
patch drivers/block/brd.c <<EOF
99,100c99,100
< gfp_flags = GFP_NOIO | __GFP_ZERO | __GFP_HIGHMEM;
< page = alloc_page(gfp_flags);
---
> gfp_flags = GFP_NOIO | __GFP_ZERO | __GFP_HIGHMEM | __GFP_THISNODE;
> page = alloc_pages_node(1, gfp_flags, 0);
EOF
cat >>/etc/systemd/system.conf <<EOF
CPUAffinity=numa
NUMAPolicy=bind
NUMAMask=0
EOF
cat >>/etc/memcached.conf <<EOF
-m 184320
-s /var/run/memcached/memcached.sock
-a 0766
-t 36
-B binary
EOF
cat fio.sh
modprobe brd rd_nr=1 rd_size=113246208
swapoff -a
mkfs.ext4 /dev/ram0
mount -t ext4 /dev/ram0 /mnt
mkdir /sys/fs/cgroup/user.slice/test
echo 38654705664 >/sys/fs/cgroup/user.slice/test/memory.max
echo $$ >/sys/fs/cgroup/user.slice/test/cgroup.procs
fio -name=mglru --numjobs=72 --directory=/mnt --size=1408m \
--buffered=1 --ioengine=io_uring --iodepth=128 \
--iodepth_batch_submit=32 --iodepth_batch_complete=32 \
--rw=randread --random_distribution=random --norandommap \
--time_based --ramp_time=10m --runtime=5m --group_reporting
cat memcached.sh
modprobe brd rd_nr=1 rd_size=113246208
swapoff -a
mkswap /dev/ram0
swapon /dev/ram0
memtier_benchmark -S /var/run/memcached/memcached.sock \
-P memcache_binary -n allkeys --key-minimum=1 \
--key-maximum=65000000 --key-pattern=P:P -c 1 -t 36 \
--ratio 1:0 --pipeline 8 -d 2000
memtier_benchmark -S /var/run/memcached/memcached.sock \
-P memcache_binary -n allkeys --key-minimum=1 \
--key-maximum=65000000 --key-pattern=R:R -c 1 -t 36 \
--ratio 0:1 --pipeline 8 --randomize --distinct-client-seed
Client benchmark results:
kswapd profiles:
5.19-rc1
40.33% page_vma_mapped_walk (overhead)
21.80% lzo1x_1_do_compress (real work)
7.53% do_raw_spin_lock
3.95% _raw_spin_unlock_irq
2.52% vma_interval_tree_iter_next
2.37% folio_referenced_one
2.28% vma_interval_tree_subtree_search
1.97% anon_vma_interval_tree_iter_first
1.60% ptep_clear_flush
1.06% __zram_bvec_write
patch1-6
39.03% lzo1x_1_do_compress (real work)
18.47% page_vma_mapped_walk (overhead)
6.74% _raw_spin_unlock_irq
3.97% do_raw_spin_lock
2.49% ptep_clear_flush
2.48% anon_vma_interval_tree_iter_first
1.92% folio_referenced_one
1.88% __zram_bvec_write
1.48% memmove
1.31% vma_interval_tree_iter_next
Configurations:
CPU: single Snapdragon 7c
Mem: total 4G
ChromeOS MemoryPressure [1]
[1] https://chromium.googlesource.com/chromiumos/platform/tast-tests/
Link: https://lkml.kernel.org/r/20220918080010.2920238-7-yuzhao@google.com
Signed-off-by: Yu Zhao <yuzhao@google.com>
Acked-by: Brian Geffon <bgeffon@google.com>
Acked-by: Jan Alexander Steffens (heftig) <heftig@archlinux.org>
Acked-by: Oleksandr Natalenko <oleksandr@natalenko.name>
Acked-by: Steven Barrett <steven@liquorix.net>
Acked-by: Suleiman Souhlal <suleiman@google.com>
Tested-by: Daniel Byrne <djbyrne@mtu.edu>
Tested-by: Donald Carr <d@chaos-reins.com>
Tested-by: Holger Hoffstätte <holger@applied-asynchrony.com>
Tested-by: Konstantin Kharlamov <Hi-Angel@yandex.ru>
Tested-by: Shuang Zhai <szhai2@cs.rochester.edu>
Tested-by: Sofia Trinh <sofia.trinh@edi.works>
Tested-by: Vaibhav Jain <vaibhav@linux.ibm.com>
Cc: Andi Kleen <ak@linux.intel.com>
Cc: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
Cc: Barry Song <baohua@kernel.org>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Hillf Danton <hdanton@sina.com>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Miaohe Lin <linmiaohe@huawei.com>
Cc: Michael Larabel <Michael@MichaelLarabel.com>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Mike Rapoport <rppt@kernel.org>
Cc: Mike Rapoport <rppt@linux.ibm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Qi Zheng <zhengqi.arch@bytedance.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Will Deacon <will@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-09-18 16:00:03 +08:00
|
|
|
mod_lruvec_state(lruvec, WORKINGSET_RESTORE_BASE + type, delta);
|
|
|
|
}
|
workingset, lru_gen: apply refault-distance based protection
Upstream: pending
I noticed MGLRU not working very well on certain workflows, which is
observed on some heavily stressed databases. That is when the file
page workingset size exceeds total memory, and the access distance
(the left-shift time of a page before it gets activated, considering
LRU starts from right) of file pages also larger than total memory.
All file pages are stuck on the oldest generation and getting
read-in then evicted permutably. Despite anon pages being idle,
they never get aged. PID controller didn't kickin until there are some
minor access pattern changes. And file pages are not promoted
or reused.
Even though the memory can't cover the whole workingset, the
refault-distance based re-activation can help hold part of the
workingset in-memory to help reduce the IO workload significantly.
So apply it for MGLRU as well. The updated refault-distance model
fits well for MGLRU in most cases, if we just consider the last two
generation as the inactive LRU and the first two generations as
active LRU.
Some adjustment is done to fit the logic better, also make the
refault-distance contributed to page tiering and PID refault detection
of MGLRU:
- If a tier-0 page have a qualified refault-distance, just promote
it to higher tier, send it to second oldest gen.
- If a tier >= 1 page have a qualified refault-distance, mark it as
active and send it to youngest gen.
- Increase the reference of every page that have a qualified
refault-distance and increase the PID countroled refault rate
of the updated tier, in hope similar paged will be protected
next time upon eviction.
NOTE: This also changed the meaning of workingset_* fields in
/proc/vmstat, workingset_activate_* now stands for the pages
reactivated or promoted by refault distance checking,
workingset_restore_* now stands for all pages promoted by
any reason.
Following benchmark showed 5x improvement. To simulate the optimized
workflow, I setup a 3-replicated mongodb cluster, each in a different
cgroup, using 5 gb of wiretiger cache and 10g of oplog, on a 32G VM with
no limit set. The benchmark is done using
https://github.com/apavlo/py-tpcc.git, modified to run STOCK_LEVEL
query only, for simulating slow query and get a stable result.
Test is done on an EPYC 7K62 with 32G RAM with SATA SSD:
- Before (with ZRAM enabled, the result won't change whether
any kind of swap is on or not):
$ tpcc.py --config=mongodb.config mongodb --duration=900 --warehouses=500 --clients=30
==================================================================
Execution Results after 919 seconds
------------------------------------------------------------------
Executed Time (µs) Rate
STOCK_LEVEL 577 27584645283.7 0.02 txn/s
------------------------------------------------------------------
TOTAL 577 27584645283.7 0.02 txn/s
$ cat /proc/vmstat | grep workingset
workingset_nodes 47860
workingset_refault_anon 0
workingset_refault_file 23498953
workingset_activate_anon 0
workingset_activate_file 23487840
workingset_restore_anon 0
workingset_restore_file 18553646
workingset_nodereclaim 768
$ free -m
total used free shared buff/cache available
Mem: 31849 6829 790 23 24229 24542
Swap: 31848 0 31848
- Patched: (with ZRAM enabled):
$ tpcc.py --config=mongodb.config mongodb --duration=900 --warehouses=500 --clients=30
==================================================================
Execution Results after 905 seconds
------------------------------------------------------------------
Executed Time (µs) Rate
STOCK_LEVEL 2542 27121571486.2 0.09 txn/s
------------------------------------------------------------------
TOTAL 2542 27121571486.2 0.09 txn/s
$ cat /proc/vmstat | grep working
workingset_nodes 70358
workingset_refault_anon 16853
workingset_refault_file 22693601
workingset_activate_anon 10099
workingset_activate_file 8565519
workingset_restore_anon 10127
workingset_restore_file 8566053
workingset_nodereclaim 9801
$ free -m
total used free shared buff/cache available
Mem: 31849 7093 283 4 24472 24289
Swap: 31848 1652 30196
The performance is 5x times better than before, and the idle anon pages
now can get swapped out as expected. The result is also better with
lower test stress, testing with lower stress also shows a improvement.
There is no regression on other tests so far, and a performance gain
is observed on file page heavy tasks.
Signed-off-by: Kairui Song <kasong@tencent.com>
2024-04-01 23:43:38 +08:00
|
|
|
|
|
|
|
lrugen = &lruvec->lrugen;
|
|
|
|
hist = lru_hist_of_min_seq(lruvec, type);
|
|
|
|
protect_tier = tier;
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Don't over-protect clean cache page (!tier page), if the page wasn't access
|
|
|
|
* for a while (refault distance > LRU / MAX_NR_GENS), there is no help keeping
|
|
|
|
* it in memory, bias higher tier instead.
|
|
|
|
*/
|
|
|
|
if (distance <= DISTANCE_SHORT && !tier) {
|
|
|
|
/* The folio is referenced one more time in the shadow gen */
|
|
|
|
folio_set_workingset(folio);
|
|
|
|
protect_tier = lru_tier_from_refs(1);
|
|
|
|
mod_lruvec_state(lruvec, WORKINGSET_ACTIVATE_BASE + type, delta);
|
|
|
|
}
|
|
|
|
|
|
|
|
if (protect_tier == tier && recent) {
|
|
|
|
atomic_long_add(delta, &lrugen->refaulted[hist][type][tier]);
|
|
|
|
} else {
|
|
|
|
atomic_long_add(delta, &lrugen->avg_total[type][protect_tier]);
|
|
|
|
atomic_long_add(delta, &lrugen->avg_refaulted[type][protect_tier]);
|
|
|
|
}
|
mm: multi-gen LRU: minimal implementation
To avoid confusion, the terms "promotion" and "demotion" will be applied
to the multi-gen LRU, as a new convention; the terms "activation" and
"deactivation" will be applied to the active/inactive LRU, as usual.
The aging produces young generations. Given an lruvec, it increments
max_seq when max_seq-min_seq+1 approaches MIN_NR_GENS. The aging promotes
hot pages to the youngest generation when it finds them accessed through
page tables; the demotion of cold pages happens consequently when it
increments max_seq. Promotion in the aging path does not involve any LRU
list operations, only the updates of the gen counter and
lrugen->nr_pages[]; demotion, unless as the result of the increment of
max_seq, requires LRU list operations, e.g., lru_deactivate_fn(). The
aging has the complexity O(nr_hot_pages), since it is only interested in
hot pages.
The eviction consumes old generations. Given an lruvec, it increments
min_seq when lrugen->lists[] indexed by min_seq%MAX_NR_GENS becomes empty.
A feedback loop modeled after the PID controller monitors refaults over
anon and file types and decides which type to evict when both types are
available from the same generation.
The protection of pages accessed multiple times through file descriptors
takes place in the eviction path. Each generation is divided into
multiple tiers. A page accessed N times through file descriptors is in
tier order_base_2(N). Tiers do not have dedicated lrugen->lists[], only
bits in folio->flags. The aforementioned feedback loop also monitors
refaults over all tiers and decides when to protect pages in which tiers
(N>1), using the first tier (N=0,1) as a baseline. The first tier
contains single-use unmapped clean pages, which are most likely the best
choices. In contrast to promotion in the aging path, the protection of a
page in the eviction path is achieved by moving this page to the next
generation, i.e., min_seq+1, if the feedback loop decides so. This
approach has the following advantages:
1. It removes the cost of activation in the buffered access path by
inferring whether pages accessed multiple times through file
descriptors are statistically hot and thus worth protecting in the
eviction path.
2. It takes pages accessed through page tables into account and avoids
overprotecting pages accessed multiple times through file
descriptors. (Pages accessed through page tables are in the first
tier, since N=0.)
3. More tiers provide better protection for pages accessed more than
twice through file descriptors, when under heavy buffered I/O
workloads.
Server benchmark results:
Single workload:
fio (buffered I/O): +[30, 32]%
IOPS BW
5.19-rc1: 2673k 10.2GiB/s
patch1-6: 3491k 13.3GiB/s
Single workload:
memcached (anon): -[4, 6]%
Ops/sec KB/sec
5.19-rc1: 1161501.04 45177.25
patch1-6: 1106168.46 43025.04
Configurations:
CPU: two Xeon 6154
Mem: total 256G
Node 1 was only used as a ram disk to reduce the variance in the
results.
patch drivers/block/brd.c <<EOF
99,100c99,100
< gfp_flags = GFP_NOIO | __GFP_ZERO | __GFP_HIGHMEM;
< page = alloc_page(gfp_flags);
---
> gfp_flags = GFP_NOIO | __GFP_ZERO | __GFP_HIGHMEM | __GFP_THISNODE;
> page = alloc_pages_node(1, gfp_flags, 0);
EOF
cat >>/etc/systemd/system.conf <<EOF
CPUAffinity=numa
NUMAPolicy=bind
NUMAMask=0
EOF
cat >>/etc/memcached.conf <<EOF
-m 184320
-s /var/run/memcached/memcached.sock
-a 0766
-t 36
-B binary
EOF
cat fio.sh
modprobe brd rd_nr=1 rd_size=113246208
swapoff -a
mkfs.ext4 /dev/ram0
mount -t ext4 /dev/ram0 /mnt
mkdir /sys/fs/cgroup/user.slice/test
echo 38654705664 >/sys/fs/cgroup/user.slice/test/memory.max
echo $$ >/sys/fs/cgroup/user.slice/test/cgroup.procs
fio -name=mglru --numjobs=72 --directory=/mnt --size=1408m \
--buffered=1 --ioengine=io_uring --iodepth=128 \
--iodepth_batch_submit=32 --iodepth_batch_complete=32 \
--rw=randread --random_distribution=random --norandommap \
--time_based --ramp_time=10m --runtime=5m --group_reporting
cat memcached.sh
modprobe brd rd_nr=1 rd_size=113246208
swapoff -a
mkswap /dev/ram0
swapon /dev/ram0
memtier_benchmark -S /var/run/memcached/memcached.sock \
-P memcache_binary -n allkeys --key-minimum=1 \
--key-maximum=65000000 --key-pattern=P:P -c 1 -t 36 \
--ratio 1:0 --pipeline 8 -d 2000
memtier_benchmark -S /var/run/memcached/memcached.sock \
-P memcache_binary -n allkeys --key-minimum=1 \
--key-maximum=65000000 --key-pattern=R:R -c 1 -t 36 \
--ratio 0:1 --pipeline 8 --randomize --distinct-client-seed
Client benchmark results:
kswapd profiles:
5.19-rc1
40.33% page_vma_mapped_walk (overhead)
21.80% lzo1x_1_do_compress (real work)
7.53% do_raw_spin_lock
3.95% _raw_spin_unlock_irq
2.52% vma_interval_tree_iter_next
2.37% folio_referenced_one
2.28% vma_interval_tree_subtree_search
1.97% anon_vma_interval_tree_iter_first
1.60% ptep_clear_flush
1.06% __zram_bvec_write
patch1-6
39.03% lzo1x_1_do_compress (real work)
18.47% page_vma_mapped_walk (overhead)
6.74% _raw_spin_unlock_irq
3.97% do_raw_spin_lock
2.49% ptep_clear_flush
2.48% anon_vma_interval_tree_iter_first
1.92% folio_referenced_one
1.88% __zram_bvec_write
1.48% memmove
1.31% vma_interval_tree_iter_next
Configurations:
CPU: single Snapdragon 7c
Mem: total 4G
ChromeOS MemoryPressure [1]
[1] https://chromium.googlesource.com/chromiumos/platform/tast-tests/
Link: https://lkml.kernel.org/r/20220918080010.2920238-7-yuzhao@google.com
Signed-off-by: Yu Zhao <yuzhao@google.com>
Acked-by: Brian Geffon <bgeffon@google.com>
Acked-by: Jan Alexander Steffens (heftig) <heftig@archlinux.org>
Acked-by: Oleksandr Natalenko <oleksandr@natalenko.name>
Acked-by: Steven Barrett <steven@liquorix.net>
Acked-by: Suleiman Souhlal <suleiman@google.com>
Tested-by: Daniel Byrne <djbyrne@mtu.edu>
Tested-by: Donald Carr <d@chaos-reins.com>
Tested-by: Holger Hoffstätte <holger@applied-asynchrony.com>
Tested-by: Konstantin Kharlamov <Hi-Angel@yandex.ru>
Tested-by: Shuang Zhai <szhai2@cs.rochester.edu>
Tested-by: Sofia Trinh <sofia.trinh@edi.works>
Tested-by: Vaibhav Jain <vaibhav@linux.ibm.com>
Cc: Andi Kleen <ak@linux.intel.com>
Cc: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
Cc: Barry Song <baohua@kernel.org>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Hillf Danton <hdanton@sina.com>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Miaohe Lin <linmiaohe@huawei.com>
Cc: Michael Larabel <Michael@MichaelLarabel.com>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Mike Rapoport <rppt@kernel.org>
Cc: Mike Rapoport <rppt@linux.ibm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Qi Zheng <zhengqi.arch@bytedance.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Will Deacon <will@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-09-18 16:00:03 +08:00
|
|
|
unlock:
|
workingset, lru_gen: apply refault-distance based protection
Upstream: pending
I noticed MGLRU not working very well on certain workflows, which is
observed on some heavily stressed databases. That is when the file
page workingset size exceeds total memory, and the access distance
(the left-shift time of a page before it gets activated, considering
LRU starts from right) of file pages also larger than total memory.
All file pages are stuck on the oldest generation and getting
read-in then evicted permutably. Despite anon pages being idle,
they never get aged. PID controller didn't kickin until there are some
minor access pattern changes. And file pages are not promoted
or reused.
Even though the memory can't cover the whole workingset, the
refault-distance based re-activation can help hold part of the
workingset in-memory to help reduce the IO workload significantly.
So apply it for MGLRU as well. The updated refault-distance model
fits well for MGLRU in most cases, if we just consider the last two
generation as the inactive LRU and the first two generations as
active LRU.
Some adjustment is done to fit the logic better, also make the
refault-distance contributed to page tiering and PID refault detection
of MGLRU:
- If a tier-0 page have a qualified refault-distance, just promote
it to higher tier, send it to second oldest gen.
- If a tier >= 1 page have a qualified refault-distance, mark it as
active and send it to youngest gen.
- Increase the reference of every page that have a qualified
refault-distance and increase the PID countroled refault rate
of the updated tier, in hope similar paged will be protected
next time upon eviction.
NOTE: This also changed the meaning of workingset_* fields in
/proc/vmstat, workingset_activate_* now stands for the pages
reactivated or promoted by refault distance checking,
workingset_restore_* now stands for all pages promoted by
any reason.
Following benchmark showed 5x improvement. To simulate the optimized
workflow, I setup a 3-replicated mongodb cluster, each in a different
cgroup, using 5 gb of wiretiger cache and 10g of oplog, on a 32G VM with
no limit set. The benchmark is done using
https://github.com/apavlo/py-tpcc.git, modified to run STOCK_LEVEL
query only, for simulating slow query and get a stable result.
Test is done on an EPYC 7K62 with 32G RAM with SATA SSD:
- Before (with ZRAM enabled, the result won't change whether
any kind of swap is on or not):
$ tpcc.py --config=mongodb.config mongodb --duration=900 --warehouses=500 --clients=30
==================================================================
Execution Results after 919 seconds
------------------------------------------------------------------
Executed Time (µs) Rate
STOCK_LEVEL 577 27584645283.7 0.02 txn/s
------------------------------------------------------------------
TOTAL 577 27584645283.7 0.02 txn/s
$ cat /proc/vmstat | grep workingset
workingset_nodes 47860
workingset_refault_anon 0
workingset_refault_file 23498953
workingset_activate_anon 0
workingset_activate_file 23487840
workingset_restore_anon 0
workingset_restore_file 18553646
workingset_nodereclaim 768
$ free -m
total used free shared buff/cache available
Mem: 31849 6829 790 23 24229 24542
Swap: 31848 0 31848
- Patched: (with ZRAM enabled):
$ tpcc.py --config=mongodb.config mongodb --duration=900 --warehouses=500 --clients=30
==================================================================
Execution Results after 905 seconds
------------------------------------------------------------------
Executed Time (µs) Rate
STOCK_LEVEL 2542 27121571486.2 0.09 txn/s
------------------------------------------------------------------
TOTAL 2542 27121571486.2 0.09 txn/s
$ cat /proc/vmstat | grep working
workingset_nodes 70358
workingset_refault_anon 16853
workingset_refault_file 22693601
workingset_activate_anon 10099
workingset_activate_file 8565519
workingset_restore_anon 10127
workingset_restore_file 8566053
workingset_nodereclaim 9801
$ free -m
total used free shared buff/cache available
Mem: 31849 7093 283 4 24472 24289
Swap: 31848 1652 30196
The performance is 5x times better than before, and the idle anon pages
now can get swapped out as expected. The result is also better with
lower test stress, testing with lower stress also shows a improvement.
There is no regression on other tests so far, and a performance gain
is observed on file page heavy tasks.
Signed-off-by: Kairui Song <kasong@tencent.com>
2024-04-01 23:43:38 +08:00
|
|
|
mem_cgroup_put(memcg);
|
mm: multi-gen LRU: minimal implementation
To avoid confusion, the terms "promotion" and "demotion" will be applied
to the multi-gen LRU, as a new convention; the terms "activation" and
"deactivation" will be applied to the active/inactive LRU, as usual.
The aging produces young generations. Given an lruvec, it increments
max_seq when max_seq-min_seq+1 approaches MIN_NR_GENS. The aging promotes
hot pages to the youngest generation when it finds them accessed through
page tables; the demotion of cold pages happens consequently when it
increments max_seq. Promotion in the aging path does not involve any LRU
list operations, only the updates of the gen counter and
lrugen->nr_pages[]; demotion, unless as the result of the increment of
max_seq, requires LRU list operations, e.g., lru_deactivate_fn(). The
aging has the complexity O(nr_hot_pages), since it is only interested in
hot pages.
The eviction consumes old generations. Given an lruvec, it increments
min_seq when lrugen->lists[] indexed by min_seq%MAX_NR_GENS becomes empty.
A feedback loop modeled after the PID controller monitors refaults over
anon and file types and decides which type to evict when both types are
available from the same generation.
The protection of pages accessed multiple times through file descriptors
takes place in the eviction path. Each generation is divided into
multiple tiers. A page accessed N times through file descriptors is in
tier order_base_2(N). Tiers do not have dedicated lrugen->lists[], only
bits in folio->flags. The aforementioned feedback loop also monitors
refaults over all tiers and decides when to protect pages in which tiers
(N>1), using the first tier (N=0,1) as a baseline. The first tier
contains single-use unmapped clean pages, which are most likely the best
choices. In contrast to promotion in the aging path, the protection of a
page in the eviction path is achieved by moving this page to the next
generation, i.e., min_seq+1, if the feedback loop decides so. This
approach has the following advantages:
1. It removes the cost of activation in the buffered access path by
inferring whether pages accessed multiple times through file
descriptors are statistically hot and thus worth protecting in the
eviction path.
2. It takes pages accessed through page tables into account and avoids
overprotecting pages accessed multiple times through file
descriptors. (Pages accessed through page tables are in the first
tier, since N=0.)
3. More tiers provide better protection for pages accessed more than
twice through file descriptors, when under heavy buffered I/O
workloads.
Server benchmark results:
Single workload:
fio (buffered I/O): +[30, 32]%
IOPS BW
5.19-rc1: 2673k 10.2GiB/s
patch1-6: 3491k 13.3GiB/s
Single workload:
memcached (anon): -[4, 6]%
Ops/sec KB/sec
5.19-rc1: 1161501.04 45177.25
patch1-6: 1106168.46 43025.04
Configurations:
CPU: two Xeon 6154
Mem: total 256G
Node 1 was only used as a ram disk to reduce the variance in the
results.
patch drivers/block/brd.c <<EOF
99,100c99,100
< gfp_flags = GFP_NOIO | __GFP_ZERO | __GFP_HIGHMEM;
< page = alloc_page(gfp_flags);
---
> gfp_flags = GFP_NOIO | __GFP_ZERO | __GFP_HIGHMEM | __GFP_THISNODE;
> page = alloc_pages_node(1, gfp_flags, 0);
EOF
cat >>/etc/systemd/system.conf <<EOF
CPUAffinity=numa
NUMAPolicy=bind
NUMAMask=0
EOF
cat >>/etc/memcached.conf <<EOF
-m 184320
-s /var/run/memcached/memcached.sock
-a 0766
-t 36
-B binary
EOF
cat fio.sh
modprobe brd rd_nr=1 rd_size=113246208
swapoff -a
mkfs.ext4 /dev/ram0
mount -t ext4 /dev/ram0 /mnt
mkdir /sys/fs/cgroup/user.slice/test
echo 38654705664 >/sys/fs/cgroup/user.slice/test/memory.max
echo $$ >/sys/fs/cgroup/user.slice/test/cgroup.procs
fio -name=mglru --numjobs=72 --directory=/mnt --size=1408m \
--buffered=1 --ioengine=io_uring --iodepth=128 \
--iodepth_batch_submit=32 --iodepth_batch_complete=32 \
--rw=randread --random_distribution=random --norandommap \
--time_based --ramp_time=10m --runtime=5m --group_reporting
cat memcached.sh
modprobe brd rd_nr=1 rd_size=113246208
swapoff -a
mkswap /dev/ram0
swapon /dev/ram0
memtier_benchmark -S /var/run/memcached/memcached.sock \
-P memcache_binary -n allkeys --key-minimum=1 \
--key-maximum=65000000 --key-pattern=P:P -c 1 -t 36 \
--ratio 1:0 --pipeline 8 -d 2000
memtier_benchmark -S /var/run/memcached/memcached.sock \
-P memcache_binary -n allkeys --key-minimum=1 \
--key-maximum=65000000 --key-pattern=R:R -c 1 -t 36 \
--ratio 0:1 --pipeline 8 --randomize --distinct-client-seed
Client benchmark results:
kswapd profiles:
5.19-rc1
40.33% page_vma_mapped_walk (overhead)
21.80% lzo1x_1_do_compress (real work)
7.53% do_raw_spin_lock
3.95% _raw_spin_unlock_irq
2.52% vma_interval_tree_iter_next
2.37% folio_referenced_one
2.28% vma_interval_tree_subtree_search
1.97% anon_vma_interval_tree_iter_first
1.60% ptep_clear_flush
1.06% __zram_bvec_write
patch1-6
39.03% lzo1x_1_do_compress (real work)
18.47% page_vma_mapped_walk (overhead)
6.74% _raw_spin_unlock_irq
3.97% do_raw_spin_lock
2.49% ptep_clear_flush
2.48% anon_vma_interval_tree_iter_first
1.92% folio_referenced_one
1.88% __zram_bvec_write
1.48% memmove
1.31% vma_interval_tree_iter_next
Configurations:
CPU: single Snapdragon 7c
Mem: total 4G
ChromeOS MemoryPressure [1]
[1] https://chromium.googlesource.com/chromiumos/platform/tast-tests/
Link: https://lkml.kernel.org/r/20220918080010.2920238-7-yuzhao@google.com
Signed-off-by: Yu Zhao <yuzhao@google.com>
Acked-by: Brian Geffon <bgeffon@google.com>
Acked-by: Jan Alexander Steffens (heftig) <heftig@archlinux.org>
Acked-by: Oleksandr Natalenko <oleksandr@natalenko.name>
Acked-by: Steven Barrett <steven@liquorix.net>
Acked-by: Suleiman Souhlal <suleiman@google.com>
Tested-by: Daniel Byrne <djbyrne@mtu.edu>
Tested-by: Donald Carr <d@chaos-reins.com>
Tested-by: Holger Hoffstätte <holger@applied-asynchrony.com>
Tested-by: Konstantin Kharlamov <Hi-Angel@yandex.ru>
Tested-by: Shuang Zhai <szhai2@cs.rochester.edu>
Tested-by: Sofia Trinh <sofia.trinh@edi.works>
Tested-by: Vaibhav Jain <vaibhav@linux.ibm.com>
Cc: Andi Kleen <ak@linux.intel.com>
Cc: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
Cc: Barry Song <baohua@kernel.org>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Hillf Danton <hdanton@sina.com>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Miaohe Lin <linmiaohe@huawei.com>
Cc: Michael Larabel <Michael@MichaelLarabel.com>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Mike Rapoport <rppt@kernel.org>
Cc: Mike Rapoport <rppt@linux.ibm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Qi Zheng <zhengqi.arch@bytedance.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Will Deacon <will@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-09-18 16:00:03 +08:00
|
|
|
}
|
|
|
|
|
|
|
|
#else /* !CONFIG_LRU_GEN */
|
|
|
|
|
|
|
|
static void *lru_gen_eviction(struct folio *folio)
|
|
|
|
{
|
|
|
|
return NULL;
|
|
|
|
}
|
|
|
|
|
2024-04-01 23:29:44 +08:00
|
|
|
static bool lru_gen_test_recent(struct lruvec *lruvec, bool file,
|
|
|
|
unsigned long token)
|
workingset: refactor LRU refault to expose refault recency check
Patch series "cachestat: a new syscall for page cache state of files",
v13.
There is currently no good way to query the page cache statistics of large
files and directory trees. There is mincore(), but it scales poorly: the
kernel writes out a lot of bitmap data that userspace has to aggregate,
when the user really does not care about per-page information in that
case. The user also needs to mmap and unmap each file as it goes along,
which can be quite slow as well.
Some use cases where this information could come in handy:
* Allowing database to decide whether to perform an index scan or direct
table queries based on the in-memory cache state of the index.
* Visibility into the writeback algorithm, for performance issues
diagnostic.
* Workload-aware writeback pacing: estimating IO fulfilled by page cache
(and IO to be done) within a range of a file, allowing for more
frequent syncing when and where there is IO capacity, and batching
when there is not.
* Computing memory usage of large files/directory trees, analogous to
the du tool for disk usage.
More information about these use cases could be found in this thread:
https://lore.kernel.org/lkml/20230315170934.GA97793@cmpxchg.org/
This series of patches introduces a new system call, cachestat, that
summarizes the page cache statistics (number of cached pages, dirty pages,
pages marked for writeback, evicted pages etc.) of a file, in a specified
range of bytes. It also include a selftest suite that tests some typical
usage. Currently, the syscall is only wired in for x86 architecture.
This interface is inspired by past discussion and concerns with fincore,
which has a similar design (and as a result, issues) as mincore. Relevant
links:
https://lkml.indiana.edu/hypermail/linux/kernel/1302.1/04207.html
https://lkml.indiana.edu/hypermail/linux/kernel/1302.1/04209.html
I have also developed a small tool that computes the memory usage of files
and directories, analogous to the du utility. User can choose between
mincore or cachestat (with cachestat exporting more information than
mincore). To compare the performance of these two options, I benchmarked
the tool on the root directory of a Meta's server machine, each for five
runs:
Using cachestat
real -- Median: 33.377s, Average: 33.475s, Standard Deviation: 0.3602
user -- Median: 4.08s, Average: 4.1078s, Standard Deviation: 0.0742
sys -- Median: 28.823s, Average: 28.8866s, Standard Deviation: 0.2689
Using mincore:
real -- Median: 102.352s, Average: 102.3442s, Standard Deviation: 0.2059
user -- Median: 10.149s, Average: 10.1482s, Standard Deviation: 0.0162
sys -- Median: 91.186s, Average: 91.2084s, Standard Deviation: 0.2046
I also ran both syscalls on a 2TB sparse file:
Using cachestat:
real 0m0.009s
user 0m0.000s
sys 0m0.009s
Using mincore:
real 0m37.510s
user 0m2.934s
sys 0m34.558s
Very large files like this are the pathological case for mincore. In
fact, to compute the stats for a single 2TB file, mincore takes as long as
cachestat takes to compute the stats for the entire tree! This could
easily happen inadvertently when we run it on subdirectories. Mincore is
clearly not suitable for a general-purpose command line tool.
Regarding security concerns, cachestat() should not pose any additional
issues. The caller already has read permission to the file itself (since
they need an fd to that file to call cachestat). This means that the
caller can access the underlying data in its entirety, which is a much
greater source of information (and as a result, a much greater security
risk) than the cache status itself.
The latest API change (in v13 of the patch series) is suggested by Jens
Axboe. It allows for 64-bit length argument, even on 32-bit architecture
(which is previously not possible due to the limit on the number of
syscall arguments). Furthermore, it eliminates the need for compatibility
handling - every user can use the same ABI.
This patch (of 4):
In preparation for computing recently evicted pages in cachestat, refactor
workingset_refault and lru_gen_refault to expose a helper function that
would test if an evicted page is recently evicted.
[penguin-kernel@I-love.SAKURA.ne.jp: add missing rcu_read_unlock() in lru_gen_refault()]
Link: https://lkml.kernel.org/r/610781bc-cf11-fc89-a46f-87cb8235d439@I-love.SAKURA.ne.jp
Link: https://lkml.kernel.org/r/20230503013608.2431726-1-nphamcs@gmail.com
Link: https://lkml.kernel.org/r/20230503013608.2431726-2-nphamcs@gmail.com
Signed-off-by: Nhat Pham <nphamcs@gmail.com>
Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Cc: Brian Foster <bfoster@redhat.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
Cc: Michael Kerrisk <mtk.manpages@gmail.com>
Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-05-03 09:36:06 +08:00
|
|
|
{
|
|
|
|
return false;
|
|
|
|
}
|
|
|
|
|
mm: multi-gen LRU: minimal implementation
To avoid confusion, the terms "promotion" and "demotion" will be applied
to the multi-gen LRU, as a new convention; the terms "activation" and
"deactivation" will be applied to the active/inactive LRU, as usual.
The aging produces young generations. Given an lruvec, it increments
max_seq when max_seq-min_seq+1 approaches MIN_NR_GENS. The aging promotes
hot pages to the youngest generation when it finds them accessed through
page tables; the demotion of cold pages happens consequently when it
increments max_seq. Promotion in the aging path does not involve any LRU
list operations, only the updates of the gen counter and
lrugen->nr_pages[]; demotion, unless as the result of the increment of
max_seq, requires LRU list operations, e.g., lru_deactivate_fn(). The
aging has the complexity O(nr_hot_pages), since it is only interested in
hot pages.
The eviction consumes old generations. Given an lruvec, it increments
min_seq when lrugen->lists[] indexed by min_seq%MAX_NR_GENS becomes empty.
A feedback loop modeled after the PID controller monitors refaults over
anon and file types and decides which type to evict when both types are
available from the same generation.
The protection of pages accessed multiple times through file descriptors
takes place in the eviction path. Each generation is divided into
multiple tiers. A page accessed N times through file descriptors is in
tier order_base_2(N). Tiers do not have dedicated lrugen->lists[], only
bits in folio->flags. The aforementioned feedback loop also monitors
refaults over all tiers and decides when to protect pages in which tiers
(N>1), using the first tier (N=0,1) as a baseline. The first tier
contains single-use unmapped clean pages, which are most likely the best
choices. In contrast to promotion in the aging path, the protection of a
page in the eviction path is achieved by moving this page to the next
generation, i.e., min_seq+1, if the feedback loop decides so. This
approach has the following advantages:
1. It removes the cost of activation in the buffered access path by
inferring whether pages accessed multiple times through file
descriptors are statistically hot and thus worth protecting in the
eviction path.
2. It takes pages accessed through page tables into account and avoids
overprotecting pages accessed multiple times through file
descriptors. (Pages accessed through page tables are in the first
tier, since N=0.)
3. More tiers provide better protection for pages accessed more than
twice through file descriptors, when under heavy buffered I/O
workloads.
Server benchmark results:
Single workload:
fio (buffered I/O): +[30, 32]%
IOPS BW
5.19-rc1: 2673k 10.2GiB/s
patch1-6: 3491k 13.3GiB/s
Single workload:
memcached (anon): -[4, 6]%
Ops/sec KB/sec
5.19-rc1: 1161501.04 45177.25
patch1-6: 1106168.46 43025.04
Configurations:
CPU: two Xeon 6154
Mem: total 256G
Node 1 was only used as a ram disk to reduce the variance in the
results.
patch drivers/block/brd.c <<EOF
99,100c99,100
< gfp_flags = GFP_NOIO | __GFP_ZERO | __GFP_HIGHMEM;
< page = alloc_page(gfp_flags);
---
> gfp_flags = GFP_NOIO | __GFP_ZERO | __GFP_HIGHMEM | __GFP_THISNODE;
> page = alloc_pages_node(1, gfp_flags, 0);
EOF
cat >>/etc/systemd/system.conf <<EOF
CPUAffinity=numa
NUMAPolicy=bind
NUMAMask=0
EOF
cat >>/etc/memcached.conf <<EOF
-m 184320
-s /var/run/memcached/memcached.sock
-a 0766
-t 36
-B binary
EOF
cat fio.sh
modprobe brd rd_nr=1 rd_size=113246208
swapoff -a
mkfs.ext4 /dev/ram0
mount -t ext4 /dev/ram0 /mnt
mkdir /sys/fs/cgroup/user.slice/test
echo 38654705664 >/sys/fs/cgroup/user.slice/test/memory.max
echo $$ >/sys/fs/cgroup/user.slice/test/cgroup.procs
fio -name=mglru --numjobs=72 --directory=/mnt --size=1408m \
--buffered=1 --ioengine=io_uring --iodepth=128 \
--iodepth_batch_submit=32 --iodepth_batch_complete=32 \
--rw=randread --random_distribution=random --norandommap \
--time_based --ramp_time=10m --runtime=5m --group_reporting
cat memcached.sh
modprobe brd rd_nr=1 rd_size=113246208
swapoff -a
mkswap /dev/ram0
swapon /dev/ram0
memtier_benchmark -S /var/run/memcached/memcached.sock \
-P memcache_binary -n allkeys --key-minimum=1 \
--key-maximum=65000000 --key-pattern=P:P -c 1 -t 36 \
--ratio 1:0 --pipeline 8 -d 2000
memtier_benchmark -S /var/run/memcached/memcached.sock \
-P memcache_binary -n allkeys --key-minimum=1 \
--key-maximum=65000000 --key-pattern=R:R -c 1 -t 36 \
--ratio 0:1 --pipeline 8 --randomize --distinct-client-seed
Client benchmark results:
kswapd profiles:
5.19-rc1
40.33% page_vma_mapped_walk (overhead)
21.80% lzo1x_1_do_compress (real work)
7.53% do_raw_spin_lock
3.95% _raw_spin_unlock_irq
2.52% vma_interval_tree_iter_next
2.37% folio_referenced_one
2.28% vma_interval_tree_subtree_search
1.97% anon_vma_interval_tree_iter_first
1.60% ptep_clear_flush
1.06% __zram_bvec_write
patch1-6
39.03% lzo1x_1_do_compress (real work)
18.47% page_vma_mapped_walk (overhead)
6.74% _raw_spin_unlock_irq
3.97% do_raw_spin_lock
2.49% ptep_clear_flush
2.48% anon_vma_interval_tree_iter_first
1.92% folio_referenced_one
1.88% __zram_bvec_write
1.48% memmove
1.31% vma_interval_tree_iter_next
Configurations:
CPU: single Snapdragon 7c
Mem: total 4G
ChromeOS MemoryPressure [1]
[1] https://chromium.googlesource.com/chromiumos/platform/tast-tests/
Link: https://lkml.kernel.org/r/20220918080010.2920238-7-yuzhao@google.com
Signed-off-by: Yu Zhao <yuzhao@google.com>
Acked-by: Brian Geffon <bgeffon@google.com>
Acked-by: Jan Alexander Steffens (heftig) <heftig@archlinux.org>
Acked-by: Oleksandr Natalenko <oleksandr@natalenko.name>
Acked-by: Steven Barrett <steven@liquorix.net>
Acked-by: Suleiman Souhlal <suleiman@google.com>
Tested-by: Daniel Byrne <djbyrne@mtu.edu>
Tested-by: Donald Carr <d@chaos-reins.com>
Tested-by: Holger Hoffstätte <holger@applied-asynchrony.com>
Tested-by: Konstantin Kharlamov <Hi-Angel@yandex.ru>
Tested-by: Shuang Zhai <szhai2@cs.rochester.edu>
Tested-by: Sofia Trinh <sofia.trinh@edi.works>
Tested-by: Vaibhav Jain <vaibhav@linux.ibm.com>
Cc: Andi Kleen <ak@linux.intel.com>
Cc: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
Cc: Barry Song <baohua@kernel.org>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Hillf Danton <hdanton@sina.com>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Miaohe Lin <linmiaohe@huawei.com>
Cc: Michael Larabel <Michael@MichaelLarabel.com>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Mike Rapoport <rppt@kernel.org>
Cc: Mike Rapoport <rppt@linux.ibm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Qi Zheng <zhengqi.arch@bytedance.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Will Deacon <will@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-09-18 16:00:03 +08:00
|
|
|
static void lru_gen_refault(struct folio *folio, void *shadow)
|
|
|
|
{
|
|
|
|
}
|
|
|
|
|
|
|
|
#endif /* CONFIG_LRU_GEN */
|
|
|
|
|
2014-04-04 05:47:51 +08:00
|
|
|
/**
|
2021-12-24 05:39:05 +08:00
|
|
|
* workingset_eviction - note the eviction of a folio from memory
|
mm: vmscan: detect file thrashing at the reclaim root
We use refault information to determine whether the cache workingset is
stable or transitioning, and dynamically adjust the inactive:active file
LRU ratio so as to maximize protection from one-off cache during stable
periods, and minimize IO during transitions.
With cgroups and their nested LRU lists, we currently don't do this
correctly. While recursive cgroup reclaim establishes a relative LRU
order among the pages of all involved cgroups, refaults only affect the
local LRU order in the cgroup in which they are occuring. As a result,
cache transitions can take longer in a cgrouped system as the active pages
of sibling cgroups aren't challenged when they should be.
[ Right now, this is somewhat theoretical, because the siblings, under
continued regular reclaim pressure, should eventually run out of
inactive pages - and since inactive:active *size* balancing is also
done on a cgroup-local level, we will challenge the active pages
eventually in most cases. But the next patch will move that relative
size enforcement to the reclaim root as well, and then this patch
here will be necessary to propagate refault pressure to siblings. ]
This patch moves refault detection to the root of reclaim. Instead of
remembering the cgroup owner of an evicted page, remember the cgroup that
caused the reclaim to happen. When refaults later occur, they'll
correctly influence the cross-cgroup LRU order that reclaim follows.
I.e. if global reclaim kicked out pages in some subgroup A/B/C, the
refault of those pages will challenge the global LRU order, and not just
the local order down inside C.
[hannes@cmpxchg.org: use page_memcg() instead of another lookup]
Link: http://lkml.kernel.org/r/20191115160722.GA309754@cmpxchg.org
Link: http://lkml.kernel.org/r/20191107205334.158354-3-hannes@cmpxchg.org
Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Reviewed-by: Suren Baghdasaryan <surenb@google.com>
Cc: Andrey Ryabinin <aryabinin@virtuozzo.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Rik van Riel <riel@surriel.com>
Cc: Shakeel Butt <shakeelb@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-12-01 09:55:59 +08:00
|
|
|
* @target_memcg: the cgroup that is causing the reclaim
|
2021-12-24 05:39:05 +08:00
|
|
|
* @folio: the folio being evicted
|
2014-04-04 05:47:51 +08:00
|
|
|
*
|
2021-12-24 05:39:05 +08:00
|
|
|
* Return: a shadow entry to be stored in @folio->mapping->i_pages in place
|
|
|
|
* of the evicted @folio so that a later refault can be detected.
|
2014-04-04 05:47:51 +08:00
|
|
|
*/
|
2021-12-24 05:39:05 +08:00
|
|
|
void *workingset_eviction(struct folio *folio, struct mem_cgroup *target_memcg)
|
2014-04-04 05:47:51 +08:00
|
|
|
{
|
2021-12-24 05:39:05 +08:00
|
|
|
struct pglist_data *pgdat = folio_pgdat(folio);
|
2014-04-04 05:47:51 +08:00
|
|
|
unsigned long eviction;
|
2016-03-16 05:57:16 +08:00
|
|
|
struct lruvec *lruvec;
|
mm: vmscan: detect file thrashing at the reclaim root
We use refault information to determine whether the cache workingset is
stable or transitioning, and dynamically adjust the inactive:active file
LRU ratio so as to maximize protection from one-off cache during stable
periods, and minimize IO during transitions.
With cgroups and their nested LRU lists, we currently don't do this
correctly. While recursive cgroup reclaim establishes a relative LRU
order among the pages of all involved cgroups, refaults only affect the
local LRU order in the cgroup in which they are occuring. As a result,
cache transitions can take longer in a cgrouped system as the active pages
of sibling cgroups aren't challenged when they should be.
[ Right now, this is somewhat theoretical, because the siblings, under
continued regular reclaim pressure, should eventually run out of
inactive pages - and since inactive:active *size* balancing is also
done on a cgroup-local level, we will challenge the active pages
eventually in most cases. But the next patch will move that relative
size enforcement to the reclaim root as well, and then this patch
here will be necessary to propagate refault pressure to siblings. ]
This patch moves refault detection to the root of reclaim. Instead of
remembering the cgroup owner of an evicted page, remember the cgroup that
caused the reclaim to happen. When refaults later occur, they'll
correctly influence the cross-cgroup LRU order that reclaim follows.
I.e. if global reclaim kicked out pages in some subgroup A/B/C, the
refault of those pages will challenge the global LRU order, and not just
the local order down inside C.
[hannes@cmpxchg.org: use page_memcg() instead of another lookup]
Link: http://lkml.kernel.org/r/20191115160722.GA309754@cmpxchg.org
Link: http://lkml.kernel.org/r/20191107205334.158354-3-hannes@cmpxchg.org
Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Reviewed-by: Suren Baghdasaryan <surenb@google.com>
Cc: Andrey Ryabinin <aryabinin@virtuozzo.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Rik van Riel <riel@surriel.com>
Cc: Shakeel Butt <shakeelb@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-12-01 09:55:59 +08:00
|
|
|
int memcgid;
|
2014-04-04 05:47:51 +08:00
|
|
|
|
2021-12-24 05:39:05 +08:00
|
|
|
/* Folio is fully exclusive and pins folio's memory cgroup pointer */
|
|
|
|
VM_BUG_ON_FOLIO(folio_test_lru(folio), folio);
|
|
|
|
VM_BUG_ON_FOLIO(folio_ref_count(folio), folio);
|
|
|
|
VM_BUG_ON_FOLIO(!folio_test_locked(folio), folio);
|
2016-03-16 05:57:16 +08:00
|
|
|
|
mm: multi-gen LRU: minimal implementation
To avoid confusion, the terms "promotion" and "demotion" will be applied
to the multi-gen LRU, as a new convention; the terms "activation" and
"deactivation" will be applied to the active/inactive LRU, as usual.
The aging produces young generations. Given an lruvec, it increments
max_seq when max_seq-min_seq+1 approaches MIN_NR_GENS. The aging promotes
hot pages to the youngest generation when it finds them accessed through
page tables; the demotion of cold pages happens consequently when it
increments max_seq. Promotion in the aging path does not involve any LRU
list operations, only the updates of the gen counter and
lrugen->nr_pages[]; demotion, unless as the result of the increment of
max_seq, requires LRU list operations, e.g., lru_deactivate_fn(). The
aging has the complexity O(nr_hot_pages), since it is only interested in
hot pages.
The eviction consumes old generations. Given an lruvec, it increments
min_seq when lrugen->lists[] indexed by min_seq%MAX_NR_GENS becomes empty.
A feedback loop modeled after the PID controller monitors refaults over
anon and file types and decides which type to evict when both types are
available from the same generation.
The protection of pages accessed multiple times through file descriptors
takes place in the eviction path. Each generation is divided into
multiple tiers. A page accessed N times through file descriptors is in
tier order_base_2(N). Tiers do not have dedicated lrugen->lists[], only
bits in folio->flags. The aforementioned feedback loop also monitors
refaults over all tiers and decides when to protect pages in which tiers
(N>1), using the first tier (N=0,1) as a baseline. The first tier
contains single-use unmapped clean pages, which are most likely the best
choices. In contrast to promotion in the aging path, the protection of a
page in the eviction path is achieved by moving this page to the next
generation, i.e., min_seq+1, if the feedback loop decides so. This
approach has the following advantages:
1. It removes the cost of activation in the buffered access path by
inferring whether pages accessed multiple times through file
descriptors are statistically hot and thus worth protecting in the
eviction path.
2. It takes pages accessed through page tables into account and avoids
overprotecting pages accessed multiple times through file
descriptors. (Pages accessed through page tables are in the first
tier, since N=0.)
3. More tiers provide better protection for pages accessed more than
twice through file descriptors, when under heavy buffered I/O
workloads.
Server benchmark results:
Single workload:
fio (buffered I/O): +[30, 32]%
IOPS BW
5.19-rc1: 2673k 10.2GiB/s
patch1-6: 3491k 13.3GiB/s
Single workload:
memcached (anon): -[4, 6]%
Ops/sec KB/sec
5.19-rc1: 1161501.04 45177.25
patch1-6: 1106168.46 43025.04
Configurations:
CPU: two Xeon 6154
Mem: total 256G
Node 1 was only used as a ram disk to reduce the variance in the
results.
patch drivers/block/brd.c <<EOF
99,100c99,100
< gfp_flags = GFP_NOIO | __GFP_ZERO | __GFP_HIGHMEM;
< page = alloc_page(gfp_flags);
---
> gfp_flags = GFP_NOIO | __GFP_ZERO | __GFP_HIGHMEM | __GFP_THISNODE;
> page = alloc_pages_node(1, gfp_flags, 0);
EOF
cat >>/etc/systemd/system.conf <<EOF
CPUAffinity=numa
NUMAPolicy=bind
NUMAMask=0
EOF
cat >>/etc/memcached.conf <<EOF
-m 184320
-s /var/run/memcached/memcached.sock
-a 0766
-t 36
-B binary
EOF
cat fio.sh
modprobe brd rd_nr=1 rd_size=113246208
swapoff -a
mkfs.ext4 /dev/ram0
mount -t ext4 /dev/ram0 /mnt
mkdir /sys/fs/cgroup/user.slice/test
echo 38654705664 >/sys/fs/cgroup/user.slice/test/memory.max
echo $$ >/sys/fs/cgroup/user.slice/test/cgroup.procs
fio -name=mglru --numjobs=72 --directory=/mnt --size=1408m \
--buffered=1 --ioengine=io_uring --iodepth=128 \
--iodepth_batch_submit=32 --iodepth_batch_complete=32 \
--rw=randread --random_distribution=random --norandommap \
--time_based --ramp_time=10m --runtime=5m --group_reporting
cat memcached.sh
modprobe brd rd_nr=1 rd_size=113246208
swapoff -a
mkswap /dev/ram0
swapon /dev/ram0
memtier_benchmark -S /var/run/memcached/memcached.sock \
-P memcache_binary -n allkeys --key-minimum=1 \
--key-maximum=65000000 --key-pattern=P:P -c 1 -t 36 \
--ratio 1:0 --pipeline 8 -d 2000
memtier_benchmark -S /var/run/memcached/memcached.sock \
-P memcache_binary -n allkeys --key-minimum=1 \
--key-maximum=65000000 --key-pattern=R:R -c 1 -t 36 \
--ratio 0:1 --pipeline 8 --randomize --distinct-client-seed
Client benchmark results:
kswapd profiles:
5.19-rc1
40.33% page_vma_mapped_walk (overhead)
21.80% lzo1x_1_do_compress (real work)
7.53% do_raw_spin_lock
3.95% _raw_spin_unlock_irq
2.52% vma_interval_tree_iter_next
2.37% folio_referenced_one
2.28% vma_interval_tree_subtree_search
1.97% anon_vma_interval_tree_iter_first
1.60% ptep_clear_flush
1.06% __zram_bvec_write
patch1-6
39.03% lzo1x_1_do_compress (real work)
18.47% page_vma_mapped_walk (overhead)
6.74% _raw_spin_unlock_irq
3.97% do_raw_spin_lock
2.49% ptep_clear_flush
2.48% anon_vma_interval_tree_iter_first
1.92% folio_referenced_one
1.88% __zram_bvec_write
1.48% memmove
1.31% vma_interval_tree_iter_next
Configurations:
CPU: single Snapdragon 7c
Mem: total 4G
ChromeOS MemoryPressure [1]
[1] https://chromium.googlesource.com/chromiumos/platform/tast-tests/
Link: https://lkml.kernel.org/r/20220918080010.2920238-7-yuzhao@google.com
Signed-off-by: Yu Zhao <yuzhao@google.com>
Acked-by: Brian Geffon <bgeffon@google.com>
Acked-by: Jan Alexander Steffens (heftig) <heftig@archlinux.org>
Acked-by: Oleksandr Natalenko <oleksandr@natalenko.name>
Acked-by: Steven Barrett <steven@liquorix.net>
Acked-by: Suleiman Souhlal <suleiman@google.com>
Tested-by: Daniel Byrne <djbyrne@mtu.edu>
Tested-by: Donald Carr <d@chaos-reins.com>
Tested-by: Holger Hoffstätte <holger@applied-asynchrony.com>
Tested-by: Konstantin Kharlamov <Hi-Angel@yandex.ru>
Tested-by: Shuang Zhai <szhai2@cs.rochester.edu>
Tested-by: Sofia Trinh <sofia.trinh@edi.works>
Tested-by: Vaibhav Jain <vaibhav@linux.ibm.com>
Cc: Andi Kleen <ak@linux.intel.com>
Cc: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
Cc: Barry Song <baohua@kernel.org>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Hillf Danton <hdanton@sina.com>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Miaohe Lin <linmiaohe@huawei.com>
Cc: Michael Larabel <Michael@MichaelLarabel.com>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Mike Rapoport <rppt@kernel.org>
Cc: Mike Rapoport <rppt@linux.ibm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Qi Zheng <zhengqi.arch@bytedance.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Will Deacon <will@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-09-18 16:00:03 +08:00
|
|
|
if (lru_gen_enabled())
|
|
|
|
return lru_gen_eviction(folio);
|
|
|
|
|
mm: vmscan: detect file thrashing at the reclaim root
We use refault information to determine whether the cache workingset is
stable or transitioning, and dynamically adjust the inactive:active file
LRU ratio so as to maximize protection from one-off cache during stable
periods, and minimize IO during transitions.
With cgroups and their nested LRU lists, we currently don't do this
correctly. While recursive cgroup reclaim establishes a relative LRU
order among the pages of all involved cgroups, refaults only affect the
local LRU order in the cgroup in which they are occuring. As a result,
cache transitions can take longer in a cgrouped system as the active pages
of sibling cgroups aren't challenged when they should be.
[ Right now, this is somewhat theoretical, because the siblings, under
continued regular reclaim pressure, should eventually run out of
inactive pages - and since inactive:active *size* balancing is also
done on a cgroup-local level, we will challenge the active pages
eventually in most cases. But the next patch will move that relative
size enforcement to the reclaim root as well, and then this patch
here will be necessary to propagate refault pressure to siblings. ]
This patch moves refault detection to the root of reclaim. Instead of
remembering the cgroup owner of an evicted page, remember the cgroup that
caused the reclaim to happen. When refaults later occur, they'll
correctly influence the cross-cgroup LRU order that reclaim follows.
I.e. if global reclaim kicked out pages in some subgroup A/B/C, the
refault of those pages will challenge the global LRU order, and not just
the local order down inside C.
[hannes@cmpxchg.org: use page_memcg() instead of another lookup]
Link: http://lkml.kernel.org/r/20191115160722.GA309754@cmpxchg.org
Link: http://lkml.kernel.org/r/20191107205334.158354-3-hannes@cmpxchg.org
Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Reviewed-by: Suren Baghdasaryan <surenb@google.com>
Cc: Andrey Ryabinin <aryabinin@virtuozzo.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Rik van Riel <riel@surriel.com>
Cc: Shakeel Butt <shakeelb@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-12-01 09:55:59 +08:00
|
|
|
lruvec = mem_cgroup_lruvec(target_memcg, pgdat);
|
|
|
|
/* XXX: target_memcg can be NULL, go through lruvec */
|
|
|
|
memcgid = mem_cgroup_id(lruvec_memcg(lruvec));
|
2024-04-01 19:50:55 +08:00
|
|
|
eviction = lru_eviction(lruvec, folio_is_file_lru(folio),
|
|
|
|
folio_nr_pages(folio), EVICTION_BITS, bucket_order);
|
2021-12-24 05:39:05 +08:00
|
|
|
return pack_shadow(memcgid, pgdat, eviction,
|
|
|
|
folio_test_workingset(folio));
|
2014-04-04 05:47:51 +08:00
|
|
|
}
|
|
|
|
|
|
|
|
/**
|
workingset: refactor LRU refault to expose refault recency check
Patch series "cachestat: a new syscall for page cache state of files",
v13.
There is currently no good way to query the page cache statistics of large
files and directory trees. There is mincore(), but it scales poorly: the
kernel writes out a lot of bitmap data that userspace has to aggregate,
when the user really does not care about per-page information in that
case. The user also needs to mmap and unmap each file as it goes along,
which can be quite slow as well.
Some use cases where this information could come in handy:
* Allowing database to decide whether to perform an index scan or direct
table queries based on the in-memory cache state of the index.
* Visibility into the writeback algorithm, for performance issues
diagnostic.
* Workload-aware writeback pacing: estimating IO fulfilled by page cache
(and IO to be done) within a range of a file, allowing for more
frequent syncing when and where there is IO capacity, and batching
when there is not.
* Computing memory usage of large files/directory trees, analogous to
the du tool for disk usage.
More information about these use cases could be found in this thread:
https://lore.kernel.org/lkml/20230315170934.GA97793@cmpxchg.org/
This series of patches introduces a new system call, cachestat, that
summarizes the page cache statistics (number of cached pages, dirty pages,
pages marked for writeback, evicted pages etc.) of a file, in a specified
range of bytes. It also include a selftest suite that tests some typical
usage. Currently, the syscall is only wired in for x86 architecture.
This interface is inspired by past discussion and concerns with fincore,
which has a similar design (and as a result, issues) as mincore. Relevant
links:
https://lkml.indiana.edu/hypermail/linux/kernel/1302.1/04207.html
https://lkml.indiana.edu/hypermail/linux/kernel/1302.1/04209.html
I have also developed a small tool that computes the memory usage of files
and directories, analogous to the du utility. User can choose between
mincore or cachestat (with cachestat exporting more information than
mincore). To compare the performance of these two options, I benchmarked
the tool on the root directory of a Meta's server machine, each for five
runs:
Using cachestat
real -- Median: 33.377s, Average: 33.475s, Standard Deviation: 0.3602
user -- Median: 4.08s, Average: 4.1078s, Standard Deviation: 0.0742
sys -- Median: 28.823s, Average: 28.8866s, Standard Deviation: 0.2689
Using mincore:
real -- Median: 102.352s, Average: 102.3442s, Standard Deviation: 0.2059
user -- Median: 10.149s, Average: 10.1482s, Standard Deviation: 0.0162
sys -- Median: 91.186s, Average: 91.2084s, Standard Deviation: 0.2046
I also ran both syscalls on a 2TB sparse file:
Using cachestat:
real 0m0.009s
user 0m0.000s
sys 0m0.009s
Using mincore:
real 0m37.510s
user 0m2.934s
sys 0m34.558s
Very large files like this are the pathological case for mincore. In
fact, to compute the stats for a single 2TB file, mincore takes as long as
cachestat takes to compute the stats for the entire tree! This could
easily happen inadvertently when we run it on subdirectories. Mincore is
clearly not suitable for a general-purpose command line tool.
Regarding security concerns, cachestat() should not pose any additional
issues. The caller already has read permission to the file itself (since
they need an fd to that file to call cachestat). This means that the
caller can access the underlying data in its entirety, which is a much
greater source of information (and as a result, a much greater security
risk) than the cache status itself.
The latest API change (in v13 of the patch series) is suggested by Jens
Axboe. It allows for 64-bit length argument, even on 32-bit architecture
(which is previously not possible due to the limit on the number of
syscall arguments). Furthermore, it eliminates the need for compatibility
handling - every user can use the same ABI.
This patch (of 4):
In preparation for computing recently evicted pages in cachestat, refactor
workingset_refault and lru_gen_refault to expose a helper function that
would test if an evicted page is recently evicted.
[penguin-kernel@I-love.SAKURA.ne.jp: add missing rcu_read_unlock() in lru_gen_refault()]
Link: https://lkml.kernel.org/r/610781bc-cf11-fc89-a46f-87cb8235d439@I-love.SAKURA.ne.jp
Link: https://lkml.kernel.org/r/20230503013608.2431726-1-nphamcs@gmail.com
Link: https://lkml.kernel.org/r/20230503013608.2431726-2-nphamcs@gmail.com
Signed-off-by: Nhat Pham <nphamcs@gmail.com>
Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Cc: Brian Foster <bfoster@redhat.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
Cc: Michael Kerrisk <mtk.manpages@gmail.com>
Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-05-03 09:36:06 +08:00
|
|
|
* workingset_test_recent - tests if the shadow entry is for a folio that was
|
|
|
|
* recently evicted. Also fills in @workingset with the value unpacked from
|
|
|
|
* shadow.
|
|
|
|
* @shadow: the shadow entry to be tested.
|
|
|
|
* @file: whether the corresponding folio is from the file lru.
|
|
|
|
* @workingset: where the workingset value unpacked from shadow should
|
|
|
|
* be stored.
|
2024-04-01 17:43:25 +08:00
|
|
|
* @tracking: whether do workingset tracking or not
|
workingset: refactor LRU refault to expose refault recency check
Patch series "cachestat: a new syscall for page cache state of files",
v13.
There is currently no good way to query the page cache statistics of large
files and directory trees. There is mincore(), but it scales poorly: the
kernel writes out a lot of bitmap data that userspace has to aggregate,
when the user really does not care about per-page information in that
case. The user also needs to mmap and unmap each file as it goes along,
which can be quite slow as well.
Some use cases where this information could come in handy:
* Allowing database to decide whether to perform an index scan or direct
table queries based on the in-memory cache state of the index.
* Visibility into the writeback algorithm, for performance issues
diagnostic.
* Workload-aware writeback pacing: estimating IO fulfilled by page cache
(and IO to be done) within a range of a file, allowing for more
frequent syncing when and where there is IO capacity, and batching
when there is not.
* Computing memory usage of large files/directory trees, analogous to
the du tool for disk usage.
More information about these use cases could be found in this thread:
https://lore.kernel.org/lkml/20230315170934.GA97793@cmpxchg.org/
This series of patches introduces a new system call, cachestat, that
summarizes the page cache statistics (number of cached pages, dirty pages,
pages marked for writeback, evicted pages etc.) of a file, in a specified
range of bytes. It also include a selftest suite that tests some typical
usage. Currently, the syscall is only wired in for x86 architecture.
This interface is inspired by past discussion and concerns with fincore,
which has a similar design (and as a result, issues) as mincore. Relevant
links:
https://lkml.indiana.edu/hypermail/linux/kernel/1302.1/04207.html
https://lkml.indiana.edu/hypermail/linux/kernel/1302.1/04209.html
I have also developed a small tool that computes the memory usage of files
and directories, analogous to the du utility. User can choose between
mincore or cachestat (with cachestat exporting more information than
mincore). To compare the performance of these two options, I benchmarked
the tool on the root directory of a Meta's server machine, each for five
runs:
Using cachestat
real -- Median: 33.377s, Average: 33.475s, Standard Deviation: 0.3602
user -- Median: 4.08s, Average: 4.1078s, Standard Deviation: 0.0742
sys -- Median: 28.823s, Average: 28.8866s, Standard Deviation: 0.2689
Using mincore:
real -- Median: 102.352s, Average: 102.3442s, Standard Deviation: 0.2059
user -- Median: 10.149s, Average: 10.1482s, Standard Deviation: 0.0162
sys -- Median: 91.186s, Average: 91.2084s, Standard Deviation: 0.2046
I also ran both syscalls on a 2TB sparse file:
Using cachestat:
real 0m0.009s
user 0m0.000s
sys 0m0.009s
Using mincore:
real 0m37.510s
user 0m2.934s
sys 0m34.558s
Very large files like this are the pathological case for mincore. In
fact, to compute the stats for a single 2TB file, mincore takes as long as
cachestat takes to compute the stats for the entire tree! This could
easily happen inadvertently when we run it on subdirectories. Mincore is
clearly not suitable for a general-purpose command line tool.
Regarding security concerns, cachestat() should not pose any additional
issues. The caller already has read permission to the file itself (since
they need an fd to that file to call cachestat). This means that the
caller can access the underlying data in its entirety, which is a much
greater source of information (and as a result, a much greater security
risk) than the cache status itself.
The latest API change (in v13 of the patch series) is suggested by Jens
Axboe. It allows for 64-bit length argument, even on 32-bit architecture
(which is previously not possible due to the limit on the number of
syscall arguments). Furthermore, it eliminates the need for compatibility
handling - every user can use the same ABI.
This patch (of 4):
In preparation for computing recently evicted pages in cachestat, refactor
workingset_refault and lru_gen_refault to expose a helper function that
would test if an evicted page is recently evicted.
[penguin-kernel@I-love.SAKURA.ne.jp: add missing rcu_read_unlock() in lru_gen_refault()]
Link: https://lkml.kernel.org/r/610781bc-cf11-fc89-a46f-87cb8235d439@I-love.SAKURA.ne.jp
Link: https://lkml.kernel.org/r/20230503013608.2431726-1-nphamcs@gmail.com
Link: https://lkml.kernel.org/r/20230503013608.2431726-2-nphamcs@gmail.com
Signed-off-by: Nhat Pham <nphamcs@gmail.com>
Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Cc: Brian Foster <bfoster@redhat.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
Cc: Michael Kerrisk <mtk.manpages@gmail.com>
Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-05-03 09:36:06 +08:00
|
|
|
*
|
|
|
|
* Return: true if the shadow is for a recently evicted folio; false otherwise.
|
2014-04-04 05:47:51 +08:00
|
|
|
*/
|
2024-04-01 17:43:25 +08:00
|
|
|
bool workingset_test_recent(void *shadow, bool file, bool *workingset, bool tracking)
|
2014-04-04 05:47:51 +08:00
|
|
|
{
|
mm: vmscan: detect file thrashing at the reclaim root
We use refault information to determine whether the cache workingset is
stable or transitioning, and dynamically adjust the inactive:active file
LRU ratio so as to maximize protection from one-off cache during stable
periods, and minimize IO during transitions.
With cgroups and their nested LRU lists, we currently don't do this
correctly. While recursive cgroup reclaim establishes a relative LRU
order among the pages of all involved cgroups, refaults only affect the
local LRU order in the cgroup in which they are occuring. As a result,
cache transitions can take longer in a cgrouped system as the active pages
of sibling cgroups aren't challenged when they should be.
[ Right now, this is somewhat theoretical, because the siblings, under
continued regular reclaim pressure, should eventually run out of
inactive pages - and since inactive:active *size* balancing is also
done on a cgroup-local level, we will challenge the active pages
eventually in most cases. But the next patch will move that relative
size enforcement to the reclaim root as well, and then this patch
here will be necessary to propagate refault pressure to siblings. ]
This patch moves refault detection to the root of reclaim. Instead of
remembering the cgroup owner of an evicted page, remember the cgroup that
caused the reclaim to happen. When refaults later occur, they'll
correctly influence the cross-cgroup LRU order that reclaim follows.
I.e. if global reclaim kicked out pages in some subgroup A/B/C, the
refault of those pages will challenge the global LRU order, and not just
the local order down inside C.
[hannes@cmpxchg.org: use page_memcg() instead of another lookup]
Link: http://lkml.kernel.org/r/20191115160722.GA309754@cmpxchg.org
Link: http://lkml.kernel.org/r/20191107205334.158354-3-hannes@cmpxchg.org
Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Reviewed-by: Suren Baghdasaryan <surenb@google.com>
Cc: Andrey Ryabinin <aryabinin@virtuozzo.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Rik van Riel <riel@surriel.com>
Cc: Shakeel Butt <shakeelb@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-12-01 09:55:59 +08:00
|
|
|
struct mem_cgroup *eviction_memcg;
|
|
|
|
struct lruvec *eviction_lruvec;
|
2024-04-01 19:50:55 +08:00
|
|
|
unsigned long refault_distance;
|
emm: workingset: simplify and use a more intuitive model
Upstream: pending
This basically removed workingset_activation and reduced calls to
workingset_age_nonresident.
The idea behind this change is a new way to calculate the refault
distance and prepare for adapting refault distance based file page
protection for multi-gen LRU.
Currently, refault distance re-activation for active/inactive can help
keep working set pages in memory, it works by estimating the refault
(re-access) distance of a page, if it's small enough, then put it
on active LRU instead of inactive LRU.
The estimation, as described in mm/workingset.c, is based on two assumptions:
1. Activation of an inactive page will left-shift LRU pages (considering
LRU starts from right).
2. Eviction of an inactive page will left-shift LRU pages.
Assumption 2 is correct, but assumption 1 is not always true, an activated
page could be anywhere in the LRU list (through mark_page_accessed), it
only left-shift the pages on its right side.
And besides, one page can get activate/deactivated for multiple times.
And multi-gen LRU doesn't fit with this model well, pages are getting
aged in generations, and getting promoted frequently between generations.
So instead we introduce a simpler idea here: Just presume the evicted
pages are still in memory, each has an corresponding eviction timestamp
(nonresistence_age) that is increased and recorded upon each eviction.
These timestamp could logically form a "Shadow LRU", a read-only
imaginary LRU. Let the `nonresistence_age` still be NA, then we have:
Let SP = ((NA's reading @ current) - (NA's reading @ eviction))
+-memory available to cache-+
| |
+-------------------------+===============+===========+
| * shadows O O O | INACTIVE | ACTIVE |
+-+-----------------------+===============+===========+
| |
+-----------------------+
| SP
fault page O -> Hole left by refaulted in pages.
Entries are suppose to be removed
upon access but this is not a real
LRU so can't really update it.
* -> The page corresponding to SP
It can be easily seen that SP stands for the offset of a page in the
imaginary LRU, which is also how far the current workflow could push
a page out of available memory. Since all evicted page was once head
of INACTIVE list, the estimated minimum value of refault distance is:
SP + NR_INACTIVE
On refault, the page *may* get activated and stay in memory if we put
it to active LRU if:
SP + NR_INACTIVE < NR_INACTIVE + NR_ACTIVE
Which can be simplified to:
SP < NR_ACTIVE
Then the page is worth getting re-activated to start from active LRU,
since the access distance is smaller than the total memory.
And since this is only an estimation, based on several hypotheses, and
it could break the ability of LRU to distinguish a workingset out of
caches, in extreme cases all refault causing activation will lead to
worse thrashing, so throttle this by two factors:
1. Notice previously re-faulted in pages may leave "holes" on the shadow
part of LRU, that part is left unhandled on purpose to decrease
re-activate rate for pages that have a large SP value (the larger
SP value a page has, the more likely it will be affected by such
holes).
2. When the active LRU is long enough, chanllaging active pages
by re-activating a one-time access previously evicted/inactive page
may not be a good idea, so throttle the re-activation when
NR_ACTIVE > NR_INACTIVE, by comparing with NR_INACTIVE instead.
Another effect of the refault activation throttling worth noticing is that,
when the cache size is larger than total memory and hotness is similar
among all cache pages, it can help hold a portion (possible have slightly
higher hotness) of the caches in memory instead of letting caches get
evicted permutably due to the nature of LRU.
That's because the established workingset (active LRU) will tend to stay
since we throttled reactivation when NR_ACTIVE is high.
This side effect is actually similar with the algoritm before, which
introduce such effect by increasing nonresistence_age in extra call
paths, trottled the re-activation when activition/reactivation is
massively happenning.
Combined all above, we have following simple rules:
Upon refault, if any of following conditions is met, mark page as active:
- If active LRU is low (NR_ACTIVE < NR_INACTIVE), check if:
SP < NR_ACTIVE
- If active LRU is high (NR_ACTIVE >= NR_INACTIVE), check if:
SP < NR_INACTIVE
Code-wise, this is simpler than before since no longer need to do lruvec
workingset data update when activating a page, and so far, a few benchmarks
shows a similar or better result under memore pressure. The performance
should also be better when there is no memory pressure since some memcg
iteration and atomic operation is no longer needed.
When combined with multi-gen LRU (in later commits) it shows a measurable
performance gain for some workloads.
Using memtier and fio test from commit ac35a4902374 but scaled down
to fit in my test environment, and some other test results:
memtier test (with 16G ramdisk as swap and 4G memcg limit on an i7-9700):
memcached -u nobody -m 16384 -s /tmp/memcached.socket \
-a 0766 -t 12 -B binary &
memtier_benchmark -S /tmp/memcached.socket -P memcache_binary -n allkeys\
--key-minimum=1 --key-maximum=32000000 --key-pattern=P:P -c 1 \
-t 12 --ratio 1:0 --pipeline 8 -d 2000 -x 6
fio test 1 (with 16G ramdisk on 28G VM on an i7-9700):
fio -name=refault --numjobs=12 --directory=/mnt --size=1024m \
--buffered=1 --ioengine=io_uring --iodepth=128 \
--iodepth_batch_submit=32 --iodepth_batch_complete=32 \
--rw=randread --random_distribution=random --norandommap \
--time_based --ramp_time=5m --runtime=5m --group_reporting
fio test 2 (with 16G ramdisk on 28G VM on an i7-9700):
fio -name=mglru --numjobs=10 --directory=/mnt --size=1536m \
--buffered=1 --ioengine=io_uring --iodepth=128 \
--iodepth_batch_submit=32 --iodepth_batch_complete=32 \
--rw=randread --random_distribution=zipf:1.2 --norandommap \
--time_based --ramp_time=10m --runtime=5m --group_reporting
mysql (using oltp_read_only from sysbench, with 12G of buffer pool
in a 10G memcg):
sysbench /usr/share/sysbench/oltp_read_only.lua <auth and db params> \
--tables=36 --table-size=2000000 --threads=12 --time=1800
kernel build test done with 3G memcg limit on an i7-9700.
Before (Average of 6 test run):
fio: IOPS=5125.5k
fio2: IOPS=7291.16k
memcached: 57600.926 ops/s
mysql: 6280.08 tps
kernel-build: 1817.13499 seconds
After (Average of 6 test run):
fio: IOPS=5137.5k (+2.3%)
fio2: IOPS=7300.67k (+1.3%)
memcached: 57878.422 ops/s (+4.8%)
mysql: 6312.06 tps (+0.5%)
kernel-build: 1813.66231 seconds (+2.0%)
Signed-off-by: Kairui Song <kasong@tencent.com>
2023-12-15 10:45:43 +08:00
|
|
|
unsigned long inactive;
|
|
|
|
unsigned long active;
|
2016-03-16 05:57:16 +08:00
|
|
|
int memcgid;
|
workingset: refactor LRU refault to expose refault recency check
Patch series "cachestat: a new syscall for page cache state of files",
v13.
There is currently no good way to query the page cache statistics of large
files and directory trees. There is mincore(), but it scales poorly: the
kernel writes out a lot of bitmap data that userspace has to aggregate,
when the user really does not care about per-page information in that
case. The user also needs to mmap and unmap each file as it goes along,
which can be quite slow as well.
Some use cases where this information could come in handy:
* Allowing database to decide whether to perform an index scan or direct
table queries based on the in-memory cache state of the index.
* Visibility into the writeback algorithm, for performance issues
diagnostic.
* Workload-aware writeback pacing: estimating IO fulfilled by page cache
(and IO to be done) within a range of a file, allowing for more
frequent syncing when and where there is IO capacity, and batching
when there is not.
* Computing memory usage of large files/directory trees, analogous to
the du tool for disk usage.
More information about these use cases could be found in this thread:
https://lore.kernel.org/lkml/20230315170934.GA97793@cmpxchg.org/
This series of patches introduces a new system call, cachestat, that
summarizes the page cache statistics (number of cached pages, dirty pages,
pages marked for writeback, evicted pages etc.) of a file, in a specified
range of bytes. It also include a selftest suite that tests some typical
usage. Currently, the syscall is only wired in for x86 architecture.
This interface is inspired by past discussion and concerns with fincore,
which has a similar design (and as a result, issues) as mincore. Relevant
links:
https://lkml.indiana.edu/hypermail/linux/kernel/1302.1/04207.html
https://lkml.indiana.edu/hypermail/linux/kernel/1302.1/04209.html
I have also developed a small tool that computes the memory usage of files
and directories, analogous to the du utility. User can choose between
mincore or cachestat (with cachestat exporting more information than
mincore). To compare the performance of these two options, I benchmarked
the tool on the root directory of a Meta's server machine, each for five
runs:
Using cachestat
real -- Median: 33.377s, Average: 33.475s, Standard Deviation: 0.3602
user -- Median: 4.08s, Average: 4.1078s, Standard Deviation: 0.0742
sys -- Median: 28.823s, Average: 28.8866s, Standard Deviation: 0.2689
Using mincore:
real -- Median: 102.352s, Average: 102.3442s, Standard Deviation: 0.2059
user -- Median: 10.149s, Average: 10.1482s, Standard Deviation: 0.0162
sys -- Median: 91.186s, Average: 91.2084s, Standard Deviation: 0.2046
I also ran both syscalls on a 2TB sparse file:
Using cachestat:
real 0m0.009s
user 0m0.000s
sys 0m0.009s
Using mincore:
real 0m37.510s
user 0m2.934s
sys 0m34.558s
Very large files like this are the pathological case for mincore. In
fact, to compute the stats for a single 2TB file, mincore takes as long as
cachestat takes to compute the stats for the entire tree! This could
easily happen inadvertently when we run it on subdirectories. Mincore is
clearly not suitable for a general-purpose command line tool.
Regarding security concerns, cachestat() should not pose any additional
issues. The caller already has read permission to the file itself (since
they need an fd to that file to call cachestat). This means that the
caller can access the underlying data in its entirety, which is a much
greater source of information (and as a result, a much greater security
risk) than the cache status itself.
The latest API change (in v13 of the patch series) is suggested by Jens
Axboe. It allows for 64-bit length argument, even on 32-bit architecture
(which is previously not possible due to the limit on the number of
syscall arguments). Furthermore, it eliminates the need for compatibility
handling - every user can use the same ABI.
This patch (of 4):
In preparation for computing recently evicted pages in cachestat, refactor
workingset_refault and lru_gen_refault to expose a helper function that
would test if an evicted page is recently evicted.
[penguin-kernel@I-love.SAKURA.ne.jp: add missing rcu_read_unlock() in lru_gen_refault()]
Link: https://lkml.kernel.org/r/610781bc-cf11-fc89-a46f-87cb8235d439@I-love.SAKURA.ne.jp
Link: https://lkml.kernel.org/r/20230503013608.2431726-1-nphamcs@gmail.com
Link: https://lkml.kernel.org/r/20230503013608.2431726-2-nphamcs@gmail.com
Signed-off-by: Nhat Pham <nphamcs@gmail.com>
Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Cc: Brian Foster <bfoster@redhat.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
Cc: Michael Kerrisk <mtk.manpages@gmail.com>
Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-05-03 09:36:06 +08:00
|
|
|
struct pglist_data *pgdat;
|
|
|
|
unsigned long eviction;
|
2014-04-04 05:47:51 +08:00
|
|
|
|
workingset: refactor LRU refault to expose refault recency check
Patch series "cachestat: a new syscall for page cache state of files",
v13.
There is currently no good way to query the page cache statistics of large
files and directory trees. There is mincore(), but it scales poorly: the
kernel writes out a lot of bitmap data that userspace has to aggregate,
when the user really does not care about per-page information in that
case. The user also needs to mmap and unmap each file as it goes along,
which can be quite slow as well.
Some use cases where this information could come in handy:
* Allowing database to decide whether to perform an index scan or direct
table queries based on the in-memory cache state of the index.
* Visibility into the writeback algorithm, for performance issues
diagnostic.
* Workload-aware writeback pacing: estimating IO fulfilled by page cache
(and IO to be done) within a range of a file, allowing for more
frequent syncing when and where there is IO capacity, and batching
when there is not.
* Computing memory usage of large files/directory trees, analogous to
the du tool for disk usage.
More information about these use cases could be found in this thread:
https://lore.kernel.org/lkml/20230315170934.GA97793@cmpxchg.org/
This series of patches introduces a new system call, cachestat, that
summarizes the page cache statistics (number of cached pages, dirty pages,
pages marked for writeback, evicted pages etc.) of a file, in a specified
range of bytes. It also include a selftest suite that tests some typical
usage. Currently, the syscall is only wired in for x86 architecture.
This interface is inspired by past discussion and concerns with fincore,
which has a similar design (and as a result, issues) as mincore. Relevant
links:
https://lkml.indiana.edu/hypermail/linux/kernel/1302.1/04207.html
https://lkml.indiana.edu/hypermail/linux/kernel/1302.1/04209.html
I have also developed a small tool that computes the memory usage of files
and directories, analogous to the du utility. User can choose between
mincore or cachestat (with cachestat exporting more information than
mincore). To compare the performance of these two options, I benchmarked
the tool on the root directory of a Meta's server machine, each for five
runs:
Using cachestat
real -- Median: 33.377s, Average: 33.475s, Standard Deviation: 0.3602
user -- Median: 4.08s, Average: 4.1078s, Standard Deviation: 0.0742
sys -- Median: 28.823s, Average: 28.8866s, Standard Deviation: 0.2689
Using mincore:
real -- Median: 102.352s, Average: 102.3442s, Standard Deviation: 0.2059
user -- Median: 10.149s, Average: 10.1482s, Standard Deviation: 0.0162
sys -- Median: 91.186s, Average: 91.2084s, Standard Deviation: 0.2046
I also ran both syscalls on a 2TB sparse file:
Using cachestat:
real 0m0.009s
user 0m0.000s
sys 0m0.009s
Using mincore:
real 0m37.510s
user 0m2.934s
sys 0m34.558s
Very large files like this are the pathological case for mincore. In
fact, to compute the stats for a single 2TB file, mincore takes as long as
cachestat takes to compute the stats for the entire tree! This could
easily happen inadvertently when we run it on subdirectories. Mincore is
clearly not suitable for a general-purpose command line tool.
Regarding security concerns, cachestat() should not pose any additional
issues. The caller already has read permission to the file itself (since
they need an fd to that file to call cachestat). This means that the
caller can access the underlying data in its entirety, which is a much
greater source of information (and as a result, a much greater security
risk) than the cache status itself.
The latest API change (in v13 of the patch series) is suggested by Jens
Axboe. It allows for 64-bit length argument, even on 32-bit architecture
(which is previously not possible due to the limit on the number of
syscall arguments). Furthermore, it eliminates the need for compatibility
handling - every user can use the same ABI.
This patch (of 4):
In preparation for computing recently evicted pages in cachestat, refactor
workingset_refault and lru_gen_refault to expose a helper function that
would test if an evicted page is recently evicted.
[penguin-kernel@I-love.SAKURA.ne.jp: add missing rcu_read_unlock() in lru_gen_refault()]
Link: https://lkml.kernel.org/r/610781bc-cf11-fc89-a46f-87cb8235d439@I-love.SAKURA.ne.jp
Link: https://lkml.kernel.org/r/20230503013608.2431726-1-nphamcs@gmail.com
Link: https://lkml.kernel.org/r/20230503013608.2431726-2-nphamcs@gmail.com
Signed-off-by: Nhat Pham <nphamcs@gmail.com>
Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Cc: Brian Foster <bfoster@redhat.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
Cc: Michael Kerrisk <mtk.manpages@gmail.com>
Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-05-03 09:36:06 +08:00
|
|
|
unpack_shadow(shadow, &memcgid, &pgdat, &eviction, workingset);
|
2016-03-16 05:57:10 +08:00
|
|
|
|
2016-03-16 05:57:16 +08:00
|
|
|
/*
|
|
|
|
* Look up the memcg associated with the stored ID. It might
|
2021-04-29 22:27:16 +08:00
|
|
|
* have been deleted since the folio's eviction.
|
2016-03-16 05:57:16 +08:00
|
|
|
*
|
|
|
|
* Note that in rare events the ID could have been recycled
|
2021-04-29 22:27:16 +08:00
|
|
|
* for a new cgroup that refaults a shared folio. This is
|
2016-03-16 05:57:16 +08:00
|
|
|
* impossible to tell from the available data. However, this
|
|
|
|
* should be a rare and limited disturbance, and activations
|
|
|
|
* are always speculative anyway. Ultimately, it's the aging
|
|
|
|
* algorithm's job to shake out the minimum access frequency
|
|
|
|
* for the active cache.
|
|
|
|
*
|
|
|
|
* XXX: On !CONFIG_MEMCG, this will always return NULL; it
|
|
|
|
* would be better if the root_mem_cgroup existed in all
|
|
|
|
* configurations instead.
|
|
|
|
*/
|
workingset, lru_gen: apply refault-distance based protection
Upstream: pending
I noticed MGLRU not working very well on certain workflows, which is
observed on some heavily stressed databases. That is when the file
page workingset size exceeds total memory, and the access distance
(the left-shift time of a page before it gets activated, considering
LRU starts from right) of file pages also larger than total memory.
All file pages are stuck on the oldest generation and getting
read-in then evicted permutably. Despite anon pages being idle,
they never get aged. PID controller didn't kickin until there are some
minor access pattern changes. And file pages are not promoted
or reused.
Even though the memory can't cover the whole workingset, the
refault-distance based re-activation can help hold part of the
workingset in-memory to help reduce the IO workload significantly.
So apply it for MGLRU as well. The updated refault-distance model
fits well for MGLRU in most cases, if we just consider the last two
generation as the inactive LRU and the first two generations as
active LRU.
Some adjustment is done to fit the logic better, also make the
refault-distance contributed to page tiering and PID refault detection
of MGLRU:
- If a tier-0 page have a qualified refault-distance, just promote
it to higher tier, send it to second oldest gen.
- If a tier >= 1 page have a qualified refault-distance, mark it as
active and send it to youngest gen.
- Increase the reference of every page that have a qualified
refault-distance and increase the PID countroled refault rate
of the updated tier, in hope similar paged will be protected
next time upon eviction.
NOTE: This also changed the meaning of workingset_* fields in
/proc/vmstat, workingset_activate_* now stands for the pages
reactivated or promoted by refault distance checking,
workingset_restore_* now stands for all pages promoted by
any reason.
Following benchmark showed 5x improvement. To simulate the optimized
workflow, I setup a 3-replicated mongodb cluster, each in a different
cgroup, using 5 gb of wiretiger cache and 10g of oplog, on a 32G VM with
no limit set. The benchmark is done using
https://github.com/apavlo/py-tpcc.git, modified to run STOCK_LEVEL
query only, for simulating slow query and get a stable result.
Test is done on an EPYC 7K62 with 32G RAM with SATA SSD:
- Before (with ZRAM enabled, the result won't change whether
any kind of swap is on or not):
$ tpcc.py --config=mongodb.config mongodb --duration=900 --warehouses=500 --clients=30
==================================================================
Execution Results after 919 seconds
------------------------------------------------------------------
Executed Time (µs) Rate
STOCK_LEVEL 577 27584645283.7 0.02 txn/s
------------------------------------------------------------------
TOTAL 577 27584645283.7 0.02 txn/s
$ cat /proc/vmstat | grep workingset
workingset_nodes 47860
workingset_refault_anon 0
workingset_refault_file 23498953
workingset_activate_anon 0
workingset_activate_file 23487840
workingset_restore_anon 0
workingset_restore_file 18553646
workingset_nodereclaim 768
$ free -m
total used free shared buff/cache available
Mem: 31849 6829 790 23 24229 24542
Swap: 31848 0 31848
- Patched: (with ZRAM enabled):
$ tpcc.py --config=mongodb.config mongodb --duration=900 --warehouses=500 --clients=30
==================================================================
Execution Results after 905 seconds
------------------------------------------------------------------
Executed Time (µs) Rate
STOCK_LEVEL 2542 27121571486.2 0.09 txn/s
------------------------------------------------------------------
TOTAL 2542 27121571486.2 0.09 txn/s
$ cat /proc/vmstat | grep working
workingset_nodes 70358
workingset_refault_anon 16853
workingset_refault_file 22693601
workingset_activate_anon 10099
workingset_activate_file 8565519
workingset_restore_anon 10127
workingset_restore_file 8566053
workingset_nodereclaim 9801
$ free -m
total used free shared buff/cache available
Mem: 31849 7093 283 4 24472 24289
Swap: 31848 1652 30196
The performance is 5x times better than before, and the idle anon pages
now can get swapped out as expected. The result is also better with
lower test stress, testing with lower stress also shows a improvement.
There is no regression on other tests so far, and a performance gain
is observed on file page heavy tasks.
Signed-off-by: Kairui Song <kasong@tencent.com>
2024-04-01 23:43:38 +08:00
|
|
|
eviction_memcg = try_get_flush_memcg(memcgid);
|
|
|
|
if (!eviction_memcg)
|
workingset: refactor LRU refault to expose refault recency check
Patch series "cachestat: a new syscall for page cache state of files",
v13.
There is currently no good way to query the page cache statistics of large
files and directory trees. There is mincore(), but it scales poorly: the
kernel writes out a lot of bitmap data that userspace has to aggregate,
when the user really does not care about per-page information in that
case. The user also needs to mmap and unmap each file as it goes along,
which can be quite slow as well.
Some use cases where this information could come in handy:
* Allowing database to decide whether to perform an index scan or direct
table queries based on the in-memory cache state of the index.
* Visibility into the writeback algorithm, for performance issues
diagnostic.
* Workload-aware writeback pacing: estimating IO fulfilled by page cache
(and IO to be done) within a range of a file, allowing for more
frequent syncing when and where there is IO capacity, and batching
when there is not.
* Computing memory usage of large files/directory trees, analogous to
the du tool for disk usage.
More information about these use cases could be found in this thread:
https://lore.kernel.org/lkml/20230315170934.GA97793@cmpxchg.org/
This series of patches introduces a new system call, cachestat, that
summarizes the page cache statistics (number of cached pages, dirty pages,
pages marked for writeback, evicted pages etc.) of a file, in a specified
range of bytes. It also include a selftest suite that tests some typical
usage. Currently, the syscall is only wired in for x86 architecture.
This interface is inspired by past discussion and concerns with fincore,
which has a similar design (and as a result, issues) as mincore. Relevant
links:
https://lkml.indiana.edu/hypermail/linux/kernel/1302.1/04207.html
https://lkml.indiana.edu/hypermail/linux/kernel/1302.1/04209.html
I have also developed a small tool that computes the memory usage of files
and directories, analogous to the du utility. User can choose between
mincore or cachestat (with cachestat exporting more information than
mincore). To compare the performance of these two options, I benchmarked
the tool on the root directory of a Meta's server machine, each for five
runs:
Using cachestat
real -- Median: 33.377s, Average: 33.475s, Standard Deviation: 0.3602
user -- Median: 4.08s, Average: 4.1078s, Standard Deviation: 0.0742
sys -- Median: 28.823s, Average: 28.8866s, Standard Deviation: 0.2689
Using mincore:
real -- Median: 102.352s, Average: 102.3442s, Standard Deviation: 0.2059
user -- Median: 10.149s, Average: 10.1482s, Standard Deviation: 0.0162
sys -- Median: 91.186s, Average: 91.2084s, Standard Deviation: 0.2046
I also ran both syscalls on a 2TB sparse file:
Using cachestat:
real 0m0.009s
user 0m0.000s
sys 0m0.009s
Using mincore:
real 0m37.510s
user 0m2.934s
sys 0m34.558s
Very large files like this are the pathological case for mincore. In
fact, to compute the stats for a single 2TB file, mincore takes as long as
cachestat takes to compute the stats for the entire tree! This could
easily happen inadvertently when we run it on subdirectories. Mincore is
clearly not suitable for a general-purpose command line tool.
Regarding security concerns, cachestat() should not pose any additional
issues. The caller already has read permission to the file itself (since
they need an fd to that file to call cachestat). This means that the
caller can access the underlying data in its entirety, which is a much
greater source of information (and as a result, a much greater security
risk) than the cache status itself.
The latest API change (in v13 of the patch series) is suggested by Jens
Axboe. It allows for 64-bit length argument, even on 32-bit architecture
(which is previously not possible due to the limit on the number of
syscall arguments). Furthermore, it eliminates the need for compatibility
handling - every user can use the same ABI.
This patch (of 4):
In preparation for computing recently evicted pages in cachestat, refactor
workingset_refault and lru_gen_refault to expose a helper function that
would test if an evicted page is recently evicted.
[penguin-kernel@I-love.SAKURA.ne.jp: add missing rcu_read_unlock() in lru_gen_refault()]
Link: https://lkml.kernel.org/r/610781bc-cf11-fc89-a46f-87cb8235d439@I-love.SAKURA.ne.jp
Link: https://lkml.kernel.org/r/20230503013608.2431726-1-nphamcs@gmail.com
Link: https://lkml.kernel.org/r/20230503013608.2431726-2-nphamcs@gmail.com
Signed-off-by: Nhat Pham <nphamcs@gmail.com>
Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Cc: Brian Foster <bfoster@redhat.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
Cc: Michael Kerrisk <mtk.manpages@gmail.com>
Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-05-03 09:36:06 +08:00
|
|
|
return false;
|
2023-11-29 11:21:52 +08:00
|
|
|
|
mm: memcg: restore subtree stats flushing
Upstream: commit 7d7ef0a4686abe43cd76a141b340a348f45ecdf2
Conflicts: Skip change in zswap.c, due to missing of b5ba474f3f51,
should be OK, later backport will easily notice the change of
function params.
Backport-reason: mm: memcg: subtree stats flushing and thresholds
Stats flushing for memcg currently follows the following rules:
- Always flush the entire memcg hierarchy (i.e. flush the root).
- Only one flusher is allowed at a time. If someone else tries to flush
concurrently, they skip and return immediately.
- A periodic flusher flushes all the stats every 2 seconds.
The reason this approach is followed is because all flushes are serialized
by a global rstat spinlock. On the memcg side, flushing is invoked from
userspace reads as well as in-kernel flushers (e.g. reclaim, refault,
etc). This approach aims to avoid serializing all flushers on the global
lock, which can cause a significant performance hit under high
concurrency.
This approach has the following problems:
- Occasionally a userspace read of the stats of a non-root cgroup will
be too expensive as it has to flush the entire hierarchy [1].
- Sometimes the stats accuracy are compromised if there is an ongoing
flush, and we skip and return before the subtree of interest is
actually flushed, yielding stale stats (by up to 2s due to periodic
flushing). This is more visible when reading stats from userspace,
but can also affect in-kernel flushers.
The latter problem is particulary a concern when userspace reads stats
after an event occurs, but gets stats from before the event. Examples:
- When memory usage / pressure spikes, a userspace OOM handler may look
at the stats of different memcgs to select a victim based on various
heuristics (e.g. how much private memory will be freed by killing
this). Reading stale stats from before the usage spike in this case
may cause a wrongful OOM kill.
- A proactive reclaimer may read the stats after writing to
memory.reclaim to measure the success of the reclaim operation. Stale
stats from before reclaim may give a false negative.
- Reading the stats of a parent and a child memcg may be inconsistent
(child larger than parent), if the flush doesn't happen when the
parent is read, but happens when the child is read.
As for in-kernel flushers, they will occasionally get stale stats. No
regressions are currently known from this, but if there are regressions,
they would be very difficult to debug and link to the source of the
problem.
This patch aims to fix these problems by restoring subtree flushing, and
removing the unified/coalesced flushing logic that skips flushing if there
is an ongoing flush. This change would introduce a significant regression
with global stats flushing thresholds. With per-memcg stats flushing
thresholds, this seems to perform really well. The thresholds protect the
underlying lock from unnecessary contention.
This patch was tested in two ways to ensure the latency of flushing is
up to par, on a machine with 384 cpus:
- A synthetic test with 5000 concurrent workers in 500 cgroups doing
allocations and reclaim, as well as 1000 readers for memory.stat
(variation of [2]). No regressions were noticed in the total runtime.
Note that significant regressions in this test are observed with
global stats thresholds, but not with per-memcg thresholds.
- A synthetic stress test for concurrently reading memcg stats while
memory allocation/freeing workers are running in the background,
provided by Wei Xu [3]. With 250k threads reading the stats every
100ms in 50k cgroups, 99.9% of reads take <= 50us. Less than 0.01%
of reads take more than 1ms, and no reads take more than 100ms.
[1] https://lore.kernel.org/lkml/CABWYdi0c6__rh-K7dcM_pkf9BJdTRtAU08M43KO9ME4-dsgfoQ@mail.gmail.com/
[2] https://lore.kernel.org/lkml/CAJD7tka13M-zVZTyQJYL1iUAYvuQ1fcHbCjcOBZcz6POYTV-4g@mail.gmail.com/
[3] https://lore.kernel.org/lkml/CAAPL-u9D2b=iF5Lf_cRnKxUfkiEe0AMDTu6yhrUAzX0b6a6rDg@mail.gmail.com/
[akpm@linux-foundation.org: fix mm/zswap.c]
[yosryahmed@google.com: remove stats flushing mutex]
Link: https://lkml.kernel.org/r/CAJD7tkZgP3m-VVPn+fF_YuvXeQYK=tZZjJHj=dzD=CcSSpp2qg@mail.gmail.com
Link: https://lkml.kernel.org/r/20231129032154.3710765-6-yosryahmed@google.com
Signed-off-by: Yosry Ahmed <yosryahmed@google.com>
Tested-by: Domenico Cerasuolo <cerasuolodomenico@gmail.com>
Acked-by: Shakeel Butt <shakeelb@google.com>
Cc: Chris Li <chrisl@kernel.org>
Cc: Greg Thelen <gthelen@google.com>
Cc: Ivan Babrou <ivan@cloudflare.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Michal Koutny <mkoutny@suse.com>
Cc: Muchun Song <muchun.song@linux.dev>
Cc: Roman Gushchin <roman.gushchin@linux.dev>
Cc: Tejun Heo <tj@kernel.org>
Cc: Waiman Long <longman@redhat.com>
Cc: Wei Xu <weixugc@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Kairui Song <kasong@tencent.com>
2023-11-29 11:21:53 +08:00
|
|
|
/*
|
|
|
|
* Flush stats (and potentially sleep) outside the RCU read section.
|
|
|
|
* XXX: With per-memcg flushing and thresholding, is ratelimiting
|
|
|
|
* still needed here?
|
|
|
|
*/
|
|
|
|
mem_cgroup_flush_stats_ratelimited(eviction_memcg);
|
mm: vmscan: detect file thrashing at the reclaim root
We use refault information to determine whether the cache workingset is
stable or transitioning, and dynamically adjust the inactive:active file
LRU ratio so as to maximize protection from one-off cache during stable
periods, and minimize IO during transitions.
With cgroups and their nested LRU lists, we currently don't do this
correctly. While recursive cgroup reclaim establishes a relative LRU
order among the pages of all involved cgroups, refaults only affect the
local LRU order in the cgroup in which they are occuring. As a result,
cache transitions can take longer in a cgrouped system as the active pages
of sibling cgroups aren't challenged when they should be.
[ Right now, this is somewhat theoretical, because the siblings, under
continued regular reclaim pressure, should eventually run out of
inactive pages - and since inactive:active *size* balancing is also
done on a cgroup-local level, we will challenge the active pages
eventually in most cases. But the next patch will move that relative
size enforcement to the reclaim root as well, and then this patch
here will be necessary to propagate refault pressure to siblings. ]
This patch moves refault detection to the root of reclaim. Instead of
remembering the cgroup owner of an evicted page, remember the cgroup that
caused the reclaim to happen. When refaults later occur, they'll
correctly influence the cross-cgroup LRU order that reclaim follows.
I.e. if global reclaim kicked out pages in some subgroup A/B/C, the
refault of those pages will challenge the global LRU order, and not just
the local order down inside C.
[hannes@cmpxchg.org: use page_memcg() instead of another lookup]
Link: http://lkml.kernel.org/r/20191115160722.GA309754@cmpxchg.org
Link: http://lkml.kernel.org/r/20191107205334.158354-3-hannes@cmpxchg.org
Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Reviewed-by: Suren Baghdasaryan <surenb@google.com>
Cc: Andrey Ryabinin <aryabinin@virtuozzo.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Rik van Riel <riel@surriel.com>
Cc: Shakeel Butt <shakeelb@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-12-01 09:55:59 +08:00
|
|
|
eviction_lruvec = mem_cgroup_lruvec(eviction_memcg, pgdat);
|
2016-03-16 05:57:10 +08:00
|
|
|
|
2024-04-01 23:29:44 +08:00
|
|
|
if (lru_gen_enabled()) {
|
workingset, lru_gen: apply refault-distance based protection
Upstream: pending
I noticed MGLRU not working very well on certain workflows, which is
observed on some heavily stressed databases. That is when the file
page workingset size exceeds total memory, and the access distance
(the left-shift time of a page before it gets activated, considering
LRU starts from right) of file pages also larger than total memory.
All file pages are stuck on the oldest generation and getting
read-in then evicted permutably. Despite anon pages being idle,
they never get aged. PID controller didn't kickin until there are some
minor access pattern changes. And file pages are not promoted
or reused.
Even though the memory can't cover the whole workingset, the
refault-distance based re-activation can help hold part of the
workingset in-memory to help reduce the IO workload significantly.
So apply it for MGLRU as well. The updated refault-distance model
fits well for MGLRU in most cases, if we just consider the last two
generation as the inactive LRU and the first two generations as
active LRU.
Some adjustment is done to fit the logic better, also make the
refault-distance contributed to page tiering and PID refault detection
of MGLRU:
- If a tier-0 page have a qualified refault-distance, just promote
it to higher tier, send it to second oldest gen.
- If a tier >= 1 page have a qualified refault-distance, mark it as
active and send it to youngest gen.
- Increase the reference of every page that have a qualified
refault-distance and increase the PID countroled refault rate
of the updated tier, in hope similar paged will be protected
next time upon eviction.
NOTE: This also changed the meaning of workingset_* fields in
/proc/vmstat, workingset_activate_* now stands for the pages
reactivated or promoted by refault distance checking,
workingset_restore_* now stands for all pages promoted by
any reason.
Following benchmark showed 5x improvement. To simulate the optimized
workflow, I setup a 3-replicated mongodb cluster, each in a different
cgroup, using 5 gb of wiretiger cache and 10g of oplog, on a 32G VM with
no limit set. The benchmark is done using
https://github.com/apavlo/py-tpcc.git, modified to run STOCK_LEVEL
query only, for simulating slow query and get a stable result.
Test is done on an EPYC 7K62 with 32G RAM with SATA SSD:
- Before (with ZRAM enabled, the result won't change whether
any kind of swap is on or not):
$ tpcc.py --config=mongodb.config mongodb --duration=900 --warehouses=500 --clients=30
==================================================================
Execution Results after 919 seconds
------------------------------------------------------------------
Executed Time (µs) Rate
STOCK_LEVEL 577 27584645283.7 0.02 txn/s
------------------------------------------------------------------
TOTAL 577 27584645283.7 0.02 txn/s
$ cat /proc/vmstat | grep workingset
workingset_nodes 47860
workingset_refault_anon 0
workingset_refault_file 23498953
workingset_activate_anon 0
workingset_activate_file 23487840
workingset_restore_anon 0
workingset_restore_file 18553646
workingset_nodereclaim 768
$ free -m
total used free shared buff/cache available
Mem: 31849 6829 790 23 24229 24542
Swap: 31848 0 31848
- Patched: (with ZRAM enabled):
$ tpcc.py --config=mongodb.config mongodb --duration=900 --warehouses=500 --clients=30
==================================================================
Execution Results after 905 seconds
------------------------------------------------------------------
Executed Time (µs) Rate
STOCK_LEVEL 2542 27121571486.2 0.09 txn/s
------------------------------------------------------------------
TOTAL 2542 27121571486.2 0.09 txn/s
$ cat /proc/vmstat | grep working
workingset_nodes 70358
workingset_refault_anon 16853
workingset_refault_file 22693601
workingset_activate_anon 10099
workingset_activate_file 8565519
workingset_restore_anon 10127
workingset_restore_file 8566053
workingset_nodereclaim 9801
$ free -m
total used free shared buff/cache available
Mem: 31849 7093 283 4 24472 24289
Swap: 31848 1652 30196
The performance is 5x times better than before, and the idle anon pages
now can get swapped out as expected. The result is also better with
lower test stress, testing with lower stress also shows a improvement.
There is no regression on other tests so far, and a performance gain
is observed on file page heavy tasks.
Signed-off-by: Kairui Song <kasong@tencent.com>
2024-04-01 23:43:38 +08:00
|
|
|
bool recent;
|
|
|
|
refault_distance = lru_distance(eviction_lruvec, file, eviction,
|
|
|
|
LRU_GEN_EVICTION_BITS, lru_gen_bucket_order);
|
|
|
|
recent = lru_gen_test_recent(eviction_lruvec, file, refault_distance);
|
2024-04-01 23:29:44 +08:00
|
|
|
mem_cgroup_put(eviction_memcg);
|
|
|
|
return recent;
|
|
|
|
}
|
|
|
|
|
2024-04-01 19:50:55 +08:00
|
|
|
refault_distance = lru_distance(eviction_lruvec, file,
|
|
|
|
eviction, EVICTION_BITS, bucket_order);
|
2016-03-16 05:57:10 +08:00
|
|
|
|
2024-04-01 17:43:25 +08:00
|
|
|
if (tracking)
|
|
|
|
workingset_refault_track(eviction_lruvec, refault_distance);
|
|
|
|
|
mm: workingset: tell cache transitions from workingset thrashing
Refaults happen during transitions between workingsets as well as in-place
thrashing. Knowing the difference between the two has a range of
applications, including measuring the impact of memory shortage on the
system performance, as well as the ability to smarter balance pressure
between the filesystem cache and the swap-backed workingset.
During workingset transitions, inactive cache refaults and pushes out
established active cache. When that active cache isn't stale, however,
and also ends up refaulting, that's bonafide thrashing.
Introduce a new page flag that tells on eviction whether the page has been
active or not in its lifetime. This bit is then stored in the shadow
entry, to classify refaults as transitioning or thrashing.
How many page->flags does this leave us with on 32-bit?
20 bits are always page flags
21 if you have an MMU
23 with the zone bits for DMA, Normal, HighMem, Movable
29 with the sparsemem section bits
30 if PAE is enabled
31 with this patch.
So on 32-bit PAE, that leaves 1 bit for distinguishing two NUMA nodes. If
that's not enough, the system can switch to discontigmem and re-gain the 6
or 7 sparsemem section bits.
Link: http://lkml.kernel.org/r/20180828172258.3185-3-hannes@cmpxchg.org
Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Tested-by: Daniel Drake <drake@endlessm.com>
Tested-by: Suren Baghdasaryan <surenb@google.com>
Cc: Christopher Lameter <cl@linux.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Johannes Weiner <jweiner@fb.com>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Enderborg <peter.enderborg@sony.com>
Cc: Randy Dunlap <rdunlap@infradead.org>
Cc: Shakeel Butt <shakeelb@google.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: Vinayak Menon <vinmenon@codeaurora.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-10-27 06:06:04 +08:00
|
|
|
/*
|
|
|
|
* Compare the distance to the existing workingset size. We
|
2020-06-04 07:02:43 +08:00
|
|
|
* don't activate pages that couldn't stay resident even if
|
2020-08-12 09:30:50 +08:00
|
|
|
* all the memory was available to the workingset. Whether
|
|
|
|
* workingset competition needs to consider anon or not depends
|
2023-04-13 16:34:49 +08:00
|
|
|
* on having free swap space.
|
mm: workingset: tell cache transitions from workingset thrashing
Refaults happen during transitions between workingsets as well as in-place
thrashing. Knowing the difference between the two has a range of
applications, including measuring the impact of memory shortage on the
system performance, as well as the ability to smarter balance pressure
between the filesystem cache and the swap-backed workingset.
During workingset transitions, inactive cache refaults and pushes out
established active cache. When that active cache isn't stale, however,
and also ends up refaulting, that's bonafide thrashing.
Introduce a new page flag that tells on eviction whether the page has been
active or not in its lifetime. This bit is then stored in the shadow
entry, to classify refaults as transitioning or thrashing.
How many page->flags does this leave us with on 32-bit?
20 bits are always page flags
21 if you have an MMU
23 with the zone bits for DMA, Normal, HighMem, Movable
29 with the sparsemem section bits
30 if PAE is enabled
31 with this patch.
So on 32-bit PAE, that leaves 1 bit for distinguishing two NUMA nodes. If
that's not enough, the system can switch to discontigmem and re-gain the 6
or 7 sparsemem section bits.
Link: http://lkml.kernel.org/r/20180828172258.3185-3-hannes@cmpxchg.org
Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Tested-by: Daniel Drake <drake@endlessm.com>
Tested-by: Suren Baghdasaryan <surenb@google.com>
Cc: Christopher Lameter <cl@linux.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Johannes Weiner <jweiner@fb.com>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Enderborg <peter.enderborg@sony.com>
Cc: Randy Dunlap <rdunlap@infradead.org>
Cc: Shakeel Butt <shakeelb@google.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: Vinayak Menon <vinmenon@codeaurora.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-10-27 06:06:04 +08:00
|
|
|
*/
|
emm: workingset: simplify and use a more intuitive model
Upstream: pending
This basically removed workingset_activation and reduced calls to
workingset_age_nonresident.
The idea behind this change is a new way to calculate the refault
distance and prepare for adapting refault distance based file page
protection for multi-gen LRU.
Currently, refault distance re-activation for active/inactive can help
keep working set pages in memory, it works by estimating the refault
(re-access) distance of a page, if it's small enough, then put it
on active LRU instead of inactive LRU.
The estimation, as described in mm/workingset.c, is based on two assumptions:
1. Activation of an inactive page will left-shift LRU pages (considering
LRU starts from right).
2. Eviction of an inactive page will left-shift LRU pages.
Assumption 2 is correct, but assumption 1 is not always true, an activated
page could be anywhere in the LRU list (through mark_page_accessed), it
only left-shift the pages on its right side.
And besides, one page can get activate/deactivated for multiple times.
And multi-gen LRU doesn't fit with this model well, pages are getting
aged in generations, and getting promoted frequently between generations.
So instead we introduce a simpler idea here: Just presume the evicted
pages are still in memory, each has an corresponding eviction timestamp
(nonresistence_age) that is increased and recorded upon each eviction.
These timestamp could logically form a "Shadow LRU", a read-only
imaginary LRU. Let the `nonresistence_age` still be NA, then we have:
Let SP = ((NA's reading @ current) - (NA's reading @ eviction))
+-memory available to cache-+
| |
+-------------------------+===============+===========+
| * shadows O O O | INACTIVE | ACTIVE |
+-+-----------------------+===============+===========+
| |
+-----------------------+
| SP
fault page O -> Hole left by refaulted in pages.
Entries are suppose to be removed
upon access but this is not a real
LRU so can't really update it.
* -> The page corresponding to SP
It can be easily seen that SP stands for the offset of a page in the
imaginary LRU, which is also how far the current workflow could push
a page out of available memory. Since all evicted page was once head
of INACTIVE list, the estimated minimum value of refault distance is:
SP + NR_INACTIVE
On refault, the page *may* get activated and stay in memory if we put
it to active LRU if:
SP + NR_INACTIVE < NR_INACTIVE + NR_ACTIVE
Which can be simplified to:
SP < NR_ACTIVE
Then the page is worth getting re-activated to start from active LRU,
since the access distance is smaller than the total memory.
And since this is only an estimation, based on several hypotheses, and
it could break the ability of LRU to distinguish a workingset out of
caches, in extreme cases all refault causing activation will lead to
worse thrashing, so throttle this by two factors:
1. Notice previously re-faulted in pages may leave "holes" on the shadow
part of LRU, that part is left unhandled on purpose to decrease
re-activate rate for pages that have a large SP value (the larger
SP value a page has, the more likely it will be affected by such
holes).
2. When the active LRU is long enough, chanllaging active pages
by re-activating a one-time access previously evicted/inactive page
may not be a good idea, so throttle the re-activation when
NR_ACTIVE > NR_INACTIVE, by comparing with NR_INACTIVE instead.
Another effect of the refault activation throttling worth noticing is that,
when the cache size is larger than total memory and hotness is similar
among all cache pages, it can help hold a portion (possible have slightly
higher hotness) of the caches in memory instead of letting caches get
evicted permutably due to the nature of LRU.
That's because the established workingset (active LRU) will tend to stay
since we throttled reactivation when NR_ACTIVE is high.
This side effect is actually similar with the algoritm before, which
introduce such effect by increasing nonresistence_age in extra call
paths, trottled the re-activation when activition/reactivation is
massively happenning.
Combined all above, we have following simple rules:
Upon refault, if any of following conditions is met, mark page as active:
- If active LRU is low (NR_ACTIVE < NR_INACTIVE), check if:
SP < NR_ACTIVE
- If active LRU is high (NR_ACTIVE >= NR_INACTIVE), check if:
SP < NR_INACTIVE
Code-wise, this is simpler than before since no longer need to do lruvec
workingset data update when activating a page, and so far, a few benchmarks
shows a similar or better result under memore pressure. The performance
should also be better when there is no memory pressure since some memcg
iteration and atomic operation is no longer needed.
When combined with multi-gen LRU (in later commits) it shows a measurable
performance gain for some workloads.
Using memtier and fio test from commit ac35a4902374 but scaled down
to fit in my test environment, and some other test results:
memtier test (with 16G ramdisk as swap and 4G memcg limit on an i7-9700):
memcached -u nobody -m 16384 -s /tmp/memcached.socket \
-a 0766 -t 12 -B binary &
memtier_benchmark -S /tmp/memcached.socket -P memcache_binary -n allkeys\
--key-minimum=1 --key-maximum=32000000 --key-pattern=P:P -c 1 \
-t 12 --ratio 1:0 --pipeline 8 -d 2000 -x 6
fio test 1 (with 16G ramdisk on 28G VM on an i7-9700):
fio -name=refault --numjobs=12 --directory=/mnt --size=1024m \
--buffered=1 --ioengine=io_uring --iodepth=128 \
--iodepth_batch_submit=32 --iodepth_batch_complete=32 \
--rw=randread --random_distribution=random --norandommap \
--time_based --ramp_time=5m --runtime=5m --group_reporting
fio test 2 (with 16G ramdisk on 28G VM on an i7-9700):
fio -name=mglru --numjobs=10 --directory=/mnt --size=1536m \
--buffered=1 --ioengine=io_uring --iodepth=128 \
--iodepth_batch_submit=32 --iodepth_batch_complete=32 \
--rw=randread --random_distribution=zipf:1.2 --norandommap \
--time_based --ramp_time=10m --runtime=5m --group_reporting
mysql (using oltp_read_only from sysbench, with 12G of buffer pool
in a 10G memcg):
sysbench /usr/share/sysbench/oltp_read_only.lua <auth and db params> \
--tables=36 --table-size=2000000 --threads=12 --time=1800
kernel build test done with 3G memcg limit on an i7-9700.
Before (Average of 6 test run):
fio: IOPS=5125.5k
fio2: IOPS=7291.16k
memcached: 57600.926 ops/s
mysql: 6280.08 tps
kernel-build: 1817.13499 seconds
After (Average of 6 test run):
fio: IOPS=5137.5k (+2.3%)
fio2: IOPS=7300.67k (+1.3%)
memcached: 57878.422 ops/s (+4.8%)
mysql: 6312.06 tps (+0.5%)
kernel-build: 1813.66231 seconds (+2.0%)
Signed-off-by: Kairui Song <kasong@tencent.com>
2023-12-15 10:45:43 +08:00
|
|
|
active = lruvec_page_state(eviction_lruvec, NR_ACTIVE_FILE);
|
|
|
|
inactive = lruvec_page_state(eviction_lruvec, NR_INACTIVE_FILE);
|
|
|
|
|
2023-01-05 06:29:44 +08:00
|
|
|
if (mem_cgroup_get_nr_swap_pages(eviction_memcg) > 0) {
|
emm: workingset: simplify and use a more intuitive model
Upstream: pending
This basically removed workingset_activation and reduced calls to
workingset_age_nonresident.
The idea behind this change is a new way to calculate the refault
distance and prepare for adapting refault distance based file page
protection for multi-gen LRU.
Currently, refault distance re-activation for active/inactive can help
keep working set pages in memory, it works by estimating the refault
(re-access) distance of a page, if it's small enough, then put it
on active LRU instead of inactive LRU.
The estimation, as described in mm/workingset.c, is based on two assumptions:
1. Activation of an inactive page will left-shift LRU pages (considering
LRU starts from right).
2. Eviction of an inactive page will left-shift LRU pages.
Assumption 2 is correct, but assumption 1 is not always true, an activated
page could be anywhere in the LRU list (through mark_page_accessed), it
only left-shift the pages on its right side.
And besides, one page can get activate/deactivated for multiple times.
And multi-gen LRU doesn't fit with this model well, pages are getting
aged in generations, and getting promoted frequently between generations.
So instead we introduce a simpler idea here: Just presume the evicted
pages are still in memory, each has an corresponding eviction timestamp
(nonresistence_age) that is increased and recorded upon each eviction.
These timestamp could logically form a "Shadow LRU", a read-only
imaginary LRU. Let the `nonresistence_age` still be NA, then we have:
Let SP = ((NA's reading @ current) - (NA's reading @ eviction))
+-memory available to cache-+
| |
+-------------------------+===============+===========+
| * shadows O O O | INACTIVE | ACTIVE |
+-+-----------------------+===============+===========+
| |
+-----------------------+
| SP
fault page O -> Hole left by refaulted in pages.
Entries are suppose to be removed
upon access but this is not a real
LRU so can't really update it.
* -> The page corresponding to SP
It can be easily seen that SP stands for the offset of a page in the
imaginary LRU, which is also how far the current workflow could push
a page out of available memory. Since all evicted page was once head
of INACTIVE list, the estimated minimum value of refault distance is:
SP + NR_INACTIVE
On refault, the page *may* get activated and stay in memory if we put
it to active LRU if:
SP + NR_INACTIVE < NR_INACTIVE + NR_ACTIVE
Which can be simplified to:
SP < NR_ACTIVE
Then the page is worth getting re-activated to start from active LRU,
since the access distance is smaller than the total memory.
And since this is only an estimation, based on several hypotheses, and
it could break the ability of LRU to distinguish a workingset out of
caches, in extreme cases all refault causing activation will lead to
worse thrashing, so throttle this by two factors:
1. Notice previously re-faulted in pages may leave "holes" on the shadow
part of LRU, that part is left unhandled on purpose to decrease
re-activate rate for pages that have a large SP value (the larger
SP value a page has, the more likely it will be affected by such
holes).
2. When the active LRU is long enough, chanllaging active pages
by re-activating a one-time access previously evicted/inactive page
may not be a good idea, so throttle the re-activation when
NR_ACTIVE > NR_INACTIVE, by comparing with NR_INACTIVE instead.
Another effect of the refault activation throttling worth noticing is that,
when the cache size is larger than total memory and hotness is similar
among all cache pages, it can help hold a portion (possible have slightly
higher hotness) of the caches in memory instead of letting caches get
evicted permutably due to the nature of LRU.
That's because the established workingset (active LRU) will tend to stay
since we throttled reactivation when NR_ACTIVE is high.
This side effect is actually similar with the algoritm before, which
introduce such effect by increasing nonresistence_age in extra call
paths, trottled the re-activation when activition/reactivation is
massively happenning.
Combined all above, we have following simple rules:
Upon refault, if any of following conditions is met, mark page as active:
- If active LRU is low (NR_ACTIVE < NR_INACTIVE), check if:
SP < NR_ACTIVE
- If active LRU is high (NR_ACTIVE >= NR_INACTIVE), check if:
SP < NR_INACTIVE
Code-wise, this is simpler than before since no longer need to do lruvec
workingset data update when activating a page, and so far, a few benchmarks
shows a similar or better result under memore pressure. The performance
should also be better when there is no memory pressure since some memcg
iteration and atomic operation is no longer needed.
When combined with multi-gen LRU (in later commits) it shows a measurable
performance gain for some workloads.
Using memtier and fio test from commit ac35a4902374 but scaled down
to fit in my test environment, and some other test results:
memtier test (with 16G ramdisk as swap and 4G memcg limit on an i7-9700):
memcached -u nobody -m 16384 -s /tmp/memcached.socket \
-a 0766 -t 12 -B binary &
memtier_benchmark -S /tmp/memcached.socket -P memcache_binary -n allkeys\
--key-minimum=1 --key-maximum=32000000 --key-pattern=P:P -c 1 \
-t 12 --ratio 1:0 --pipeline 8 -d 2000 -x 6
fio test 1 (with 16G ramdisk on 28G VM on an i7-9700):
fio -name=refault --numjobs=12 --directory=/mnt --size=1024m \
--buffered=1 --ioengine=io_uring --iodepth=128 \
--iodepth_batch_submit=32 --iodepth_batch_complete=32 \
--rw=randread --random_distribution=random --norandommap \
--time_based --ramp_time=5m --runtime=5m --group_reporting
fio test 2 (with 16G ramdisk on 28G VM on an i7-9700):
fio -name=mglru --numjobs=10 --directory=/mnt --size=1536m \
--buffered=1 --ioengine=io_uring --iodepth=128 \
--iodepth_batch_submit=32 --iodepth_batch_complete=32 \
--rw=randread --random_distribution=zipf:1.2 --norandommap \
--time_based --ramp_time=10m --runtime=5m --group_reporting
mysql (using oltp_read_only from sysbench, with 12G of buffer pool
in a 10G memcg):
sysbench /usr/share/sysbench/oltp_read_only.lua <auth and db params> \
--tables=36 --table-size=2000000 --threads=12 --time=1800
kernel build test done with 3G memcg limit on an i7-9700.
Before (Average of 6 test run):
fio: IOPS=5125.5k
fio2: IOPS=7291.16k
memcached: 57600.926 ops/s
mysql: 6280.08 tps
kernel-build: 1817.13499 seconds
After (Average of 6 test run):
fio: IOPS=5137.5k (+2.3%)
fio2: IOPS=7300.67k (+1.3%)
memcached: 57878.422 ops/s (+4.8%)
mysql: 6312.06 tps (+0.5%)
kernel-build: 1813.66231 seconds (+2.0%)
Signed-off-by: Kairui Song <kasong@tencent.com>
2023-12-15 10:45:43 +08:00
|
|
|
active += lruvec_page_state(eviction_lruvec, NR_ACTIVE_ANON);
|
|
|
|
inactive += lruvec_page_state(eviction_lruvec, NR_INACTIVE_ANON);
|
2020-06-04 07:02:43 +08:00
|
|
|
}
|
workingset: refactor LRU refault to expose refault recency check
Patch series "cachestat: a new syscall for page cache state of files",
v13.
There is currently no good way to query the page cache statistics of large
files and directory trees. There is mincore(), but it scales poorly: the
kernel writes out a lot of bitmap data that userspace has to aggregate,
when the user really does not care about per-page information in that
case. The user also needs to mmap and unmap each file as it goes along,
which can be quite slow as well.
Some use cases where this information could come in handy:
* Allowing database to decide whether to perform an index scan or direct
table queries based on the in-memory cache state of the index.
* Visibility into the writeback algorithm, for performance issues
diagnostic.
* Workload-aware writeback pacing: estimating IO fulfilled by page cache
(and IO to be done) within a range of a file, allowing for more
frequent syncing when and where there is IO capacity, and batching
when there is not.
* Computing memory usage of large files/directory trees, analogous to
the du tool for disk usage.
More information about these use cases could be found in this thread:
https://lore.kernel.org/lkml/20230315170934.GA97793@cmpxchg.org/
This series of patches introduces a new system call, cachestat, that
summarizes the page cache statistics (number of cached pages, dirty pages,
pages marked for writeback, evicted pages etc.) of a file, in a specified
range of bytes. It also include a selftest suite that tests some typical
usage. Currently, the syscall is only wired in for x86 architecture.
This interface is inspired by past discussion and concerns with fincore,
which has a similar design (and as a result, issues) as mincore. Relevant
links:
https://lkml.indiana.edu/hypermail/linux/kernel/1302.1/04207.html
https://lkml.indiana.edu/hypermail/linux/kernel/1302.1/04209.html
I have also developed a small tool that computes the memory usage of files
and directories, analogous to the du utility. User can choose between
mincore or cachestat (with cachestat exporting more information than
mincore). To compare the performance of these two options, I benchmarked
the tool on the root directory of a Meta's server machine, each for five
runs:
Using cachestat
real -- Median: 33.377s, Average: 33.475s, Standard Deviation: 0.3602
user -- Median: 4.08s, Average: 4.1078s, Standard Deviation: 0.0742
sys -- Median: 28.823s, Average: 28.8866s, Standard Deviation: 0.2689
Using mincore:
real -- Median: 102.352s, Average: 102.3442s, Standard Deviation: 0.2059
user -- Median: 10.149s, Average: 10.1482s, Standard Deviation: 0.0162
sys -- Median: 91.186s, Average: 91.2084s, Standard Deviation: 0.2046
I also ran both syscalls on a 2TB sparse file:
Using cachestat:
real 0m0.009s
user 0m0.000s
sys 0m0.009s
Using mincore:
real 0m37.510s
user 0m2.934s
sys 0m34.558s
Very large files like this are the pathological case for mincore. In
fact, to compute the stats for a single 2TB file, mincore takes as long as
cachestat takes to compute the stats for the entire tree! This could
easily happen inadvertently when we run it on subdirectories. Mincore is
clearly not suitable for a general-purpose command line tool.
Regarding security concerns, cachestat() should not pose any additional
issues. The caller already has read permission to the file itself (since
they need an fd to that file to call cachestat). This means that the
caller can access the underlying data in its entirety, which is a much
greater source of information (and as a result, a much greater security
risk) than the cache status itself.
The latest API change (in v13 of the patch series) is suggested by Jens
Axboe. It allows for 64-bit length argument, even on 32-bit architecture
(which is previously not possible due to the limit on the number of
syscall arguments). Furthermore, it eliminates the need for compatibility
handling - every user can use the same ABI.
This patch (of 4):
In preparation for computing recently evicted pages in cachestat, refactor
workingset_refault and lru_gen_refault to expose a helper function that
would test if an evicted page is recently evicted.
[penguin-kernel@I-love.SAKURA.ne.jp: add missing rcu_read_unlock() in lru_gen_refault()]
Link: https://lkml.kernel.org/r/610781bc-cf11-fc89-a46f-87cb8235d439@I-love.SAKURA.ne.jp
Link: https://lkml.kernel.org/r/20230503013608.2431726-1-nphamcs@gmail.com
Link: https://lkml.kernel.org/r/20230503013608.2431726-2-nphamcs@gmail.com
Signed-off-by: Nhat Pham <nphamcs@gmail.com>
Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Cc: Brian Foster <bfoster@redhat.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
Cc: Michael Kerrisk <mtk.manpages@gmail.com>
Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-05-03 09:36:06 +08:00
|
|
|
|
2023-11-29 11:21:52 +08:00
|
|
|
mem_cgroup_put(eviction_memcg);
|
emm: workingset: simplify and use a more intuitive model
Upstream: pending
This basically removed workingset_activation and reduced calls to
workingset_age_nonresident.
The idea behind this change is a new way to calculate the refault
distance and prepare for adapting refault distance based file page
protection for multi-gen LRU.
Currently, refault distance re-activation for active/inactive can help
keep working set pages in memory, it works by estimating the refault
(re-access) distance of a page, if it's small enough, then put it
on active LRU instead of inactive LRU.
The estimation, as described in mm/workingset.c, is based on two assumptions:
1. Activation of an inactive page will left-shift LRU pages (considering
LRU starts from right).
2. Eviction of an inactive page will left-shift LRU pages.
Assumption 2 is correct, but assumption 1 is not always true, an activated
page could be anywhere in the LRU list (through mark_page_accessed), it
only left-shift the pages on its right side.
And besides, one page can get activate/deactivated for multiple times.
And multi-gen LRU doesn't fit with this model well, pages are getting
aged in generations, and getting promoted frequently between generations.
So instead we introduce a simpler idea here: Just presume the evicted
pages are still in memory, each has an corresponding eviction timestamp
(nonresistence_age) that is increased and recorded upon each eviction.
These timestamp could logically form a "Shadow LRU", a read-only
imaginary LRU. Let the `nonresistence_age` still be NA, then we have:
Let SP = ((NA's reading @ current) - (NA's reading @ eviction))
+-memory available to cache-+
| |
+-------------------------+===============+===========+
| * shadows O O O | INACTIVE | ACTIVE |
+-+-----------------------+===============+===========+
| |
+-----------------------+
| SP
fault page O -> Hole left by refaulted in pages.
Entries are suppose to be removed
upon access but this is not a real
LRU so can't really update it.
* -> The page corresponding to SP
It can be easily seen that SP stands for the offset of a page in the
imaginary LRU, which is also how far the current workflow could push
a page out of available memory. Since all evicted page was once head
of INACTIVE list, the estimated minimum value of refault distance is:
SP + NR_INACTIVE
On refault, the page *may* get activated and stay in memory if we put
it to active LRU if:
SP + NR_INACTIVE < NR_INACTIVE + NR_ACTIVE
Which can be simplified to:
SP < NR_ACTIVE
Then the page is worth getting re-activated to start from active LRU,
since the access distance is smaller than the total memory.
And since this is only an estimation, based on several hypotheses, and
it could break the ability of LRU to distinguish a workingset out of
caches, in extreme cases all refault causing activation will lead to
worse thrashing, so throttle this by two factors:
1. Notice previously re-faulted in pages may leave "holes" on the shadow
part of LRU, that part is left unhandled on purpose to decrease
re-activate rate for pages that have a large SP value (the larger
SP value a page has, the more likely it will be affected by such
holes).
2. When the active LRU is long enough, chanllaging active pages
by re-activating a one-time access previously evicted/inactive page
may not be a good idea, so throttle the re-activation when
NR_ACTIVE > NR_INACTIVE, by comparing with NR_INACTIVE instead.
Another effect of the refault activation throttling worth noticing is that,
when the cache size is larger than total memory and hotness is similar
among all cache pages, it can help hold a portion (possible have slightly
higher hotness) of the caches in memory instead of letting caches get
evicted permutably due to the nature of LRU.
That's because the established workingset (active LRU) will tend to stay
since we throttled reactivation when NR_ACTIVE is high.
This side effect is actually similar with the algoritm before, which
introduce such effect by increasing nonresistence_age in extra call
paths, trottled the re-activation when activition/reactivation is
massively happenning.
Combined all above, we have following simple rules:
Upon refault, if any of following conditions is met, mark page as active:
- If active LRU is low (NR_ACTIVE < NR_INACTIVE), check if:
SP < NR_ACTIVE
- If active LRU is high (NR_ACTIVE >= NR_INACTIVE), check if:
SP < NR_INACTIVE
Code-wise, this is simpler than before since no longer need to do lruvec
workingset data update when activating a page, and so far, a few benchmarks
shows a similar or better result under memore pressure. The performance
should also be better when there is no memory pressure since some memcg
iteration and atomic operation is no longer needed.
When combined with multi-gen LRU (in later commits) it shows a measurable
performance gain for some workloads.
Using memtier and fio test from commit ac35a4902374 but scaled down
to fit in my test environment, and some other test results:
memtier test (with 16G ramdisk as swap and 4G memcg limit on an i7-9700):
memcached -u nobody -m 16384 -s /tmp/memcached.socket \
-a 0766 -t 12 -B binary &
memtier_benchmark -S /tmp/memcached.socket -P memcache_binary -n allkeys\
--key-minimum=1 --key-maximum=32000000 --key-pattern=P:P -c 1 \
-t 12 --ratio 1:0 --pipeline 8 -d 2000 -x 6
fio test 1 (with 16G ramdisk on 28G VM on an i7-9700):
fio -name=refault --numjobs=12 --directory=/mnt --size=1024m \
--buffered=1 --ioengine=io_uring --iodepth=128 \
--iodepth_batch_submit=32 --iodepth_batch_complete=32 \
--rw=randread --random_distribution=random --norandommap \
--time_based --ramp_time=5m --runtime=5m --group_reporting
fio test 2 (with 16G ramdisk on 28G VM on an i7-9700):
fio -name=mglru --numjobs=10 --directory=/mnt --size=1536m \
--buffered=1 --ioengine=io_uring --iodepth=128 \
--iodepth_batch_submit=32 --iodepth_batch_complete=32 \
--rw=randread --random_distribution=zipf:1.2 --norandommap \
--time_based --ramp_time=10m --runtime=5m --group_reporting
mysql (using oltp_read_only from sysbench, with 12G of buffer pool
in a 10G memcg):
sysbench /usr/share/sysbench/oltp_read_only.lua <auth and db params> \
--tables=36 --table-size=2000000 --threads=12 --time=1800
kernel build test done with 3G memcg limit on an i7-9700.
Before (Average of 6 test run):
fio: IOPS=5125.5k
fio2: IOPS=7291.16k
memcached: 57600.926 ops/s
mysql: 6280.08 tps
kernel-build: 1817.13499 seconds
After (Average of 6 test run):
fio: IOPS=5137.5k (+2.3%)
fio2: IOPS=7300.67k (+1.3%)
memcached: 57878.422 ops/s (+4.8%)
mysql: 6312.06 tps (+0.5%)
kernel-build: 1813.66231 seconds (+2.0%)
Signed-off-by: Kairui Song <kasong@tencent.com>
2023-12-15 10:45:43 +08:00
|
|
|
|
|
|
|
/*
|
|
|
|
* When there are already enough active pages, be less aggressive
|
|
|
|
* on reactivating pages, challenge an large set of established
|
|
|
|
* active pages with one time refaulted page may not be a good idea.
|
|
|
|
*/
|
|
|
|
return refault_distance < min(active, inactive);
|
workingset: refactor LRU refault to expose refault recency check
Patch series "cachestat: a new syscall for page cache state of files",
v13.
There is currently no good way to query the page cache statistics of large
files and directory trees. There is mincore(), but it scales poorly: the
kernel writes out a lot of bitmap data that userspace has to aggregate,
when the user really does not care about per-page information in that
case. The user also needs to mmap and unmap each file as it goes along,
which can be quite slow as well.
Some use cases where this information could come in handy:
* Allowing database to decide whether to perform an index scan or direct
table queries based on the in-memory cache state of the index.
* Visibility into the writeback algorithm, for performance issues
diagnostic.
* Workload-aware writeback pacing: estimating IO fulfilled by page cache
(and IO to be done) within a range of a file, allowing for more
frequent syncing when and where there is IO capacity, and batching
when there is not.
* Computing memory usage of large files/directory trees, analogous to
the du tool for disk usage.
More information about these use cases could be found in this thread:
https://lore.kernel.org/lkml/20230315170934.GA97793@cmpxchg.org/
This series of patches introduces a new system call, cachestat, that
summarizes the page cache statistics (number of cached pages, dirty pages,
pages marked for writeback, evicted pages etc.) of a file, in a specified
range of bytes. It also include a selftest suite that tests some typical
usage. Currently, the syscall is only wired in for x86 architecture.
This interface is inspired by past discussion and concerns with fincore,
which has a similar design (and as a result, issues) as mincore. Relevant
links:
https://lkml.indiana.edu/hypermail/linux/kernel/1302.1/04207.html
https://lkml.indiana.edu/hypermail/linux/kernel/1302.1/04209.html
I have also developed a small tool that computes the memory usage of files
and directories, analogous to the du utility. User can choose between
mincore or cachestat (with cachestat exporting more information than
mincore). To compare the performance of these two options, I benchmarked
the tool on the root directory of a Meta's server machine, each for five
runs:
Using cachestat
real -- Median: 33.377s, Average: 33.475s, Standard Deviation: 0.3602
user -- Median: 4.08s, Average: 4.1078s, Standard Deviation: 0.0742
sys -- Median: 28.823s, Average: 28.8866s, Standard Deviation: 0.2689
Using mincore:
real -- Median: 102.352s, Average: 102.3442s, Standard Deviation: 0.2059
user -- Median: 10.149s, Average: 10.1482s, Standard Deviation: 0.0162
sys -- Median: 91.186s, Average: 91.2084s, Standard Deviation: 0.2046
I also ran both syscalls on a 2TB sparse file:
Using cachestat:
real 0m0.009s
user 0m0.000s
sys 0m0.009s
Using mincore:
real 0m37.510s
user 0m2.934s
sys 0m34.558s
Very large files like this are the pathological case for mincore. In
fact, to compute the stats for a single 2TB file, mincore takes as long as
cachestat takes to compute the stats for the entire tree! This could
easily happen inadvertently when we run it on subdirectories. Mincore is
clearly not suitable for a general-purpose command line tool.
Regarding security concerns, cachestat() should not pose any additional
issues. The caller already has read permission to the file itself (since
they need an fd to that file to call cachestat). This means that the
caller can access the underlying data in its entirety, which is a much
greater source of information (and as a result, a much greater security
risk) than the cache status itself.
The latest API change (in v13 of the patch series) is suggested by Jens
Axboe. It allows for 64-bit length argument, even on 32-bit architecture
(which is previously not possible due to the limit on the number of
syscall arguments). Furthermore, it eliminates the need for compatibility
handling - every user can use the same ABI.
This patch (of 4):
In preparation for computing recently evicted pages in cachestat, refactor
workingset_refault and lru_gen_refault to expose a helper function that
would test if an evicted page is recently evicted.
[penguin-kernel@I-love.SAKURA.ne.jp: add missing rcu_read_unlock() in lru_gen_refault()]
Link: https://lkml.kernel.org/r/610781bc-cf11-fc89-a46f-87cb8235d439@I-love.SAKURA.ne.jp
Link: https://lkml.kernel.org/r/20230503013608.2431726-1-nphamcs@gmail.com
Link: https://lkml.kernel.org/r/20230503013608.2431726-2-nphamcs@gmail.com
Signed-off-by: Nhat Pham <nphamcs@gmail.com>
Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Cc: Brian Foster <bfoster@redhat.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
Cc: Michael Kerrisk <mtk.manpages@gmail.com>
Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-05-03 09:36:06 +08:00
|
|
|
}
|
|
|
|
|
|
|
|
/**
|
|
|
|
* workingset_refault - Evaluate the refault of a previously evicted folio.
|
|
|
|
* @folio: The freshly allocated replacement folio.
|
|
|
|
* @shadow: Shadow entry of the evicted folio.
|
|
|
|
*
|
|
|
|
* Calculates and evaluates the refault distance of the previously
|
|
|
|
* evicted folio in the context of the node and the memcg whose memory
|
|
|
|
* pressure caused the eviction.
|
|
|
|
*/
|
|
|
|
void workingset_refault(struct folio *folio, void *shadow)
|
|
|
|
{
|
|
|
|
bool file = folio_is_file_lru(folio);
|
|
|
|
struct pglist_data *pgdat;
|
|
|
|
struct mem_cgroup *memcg;
|
|
|
|
struct lruvec *lruvec;
|
|
|
|
bool workingset;
|
|
|
|
long nr;
|
|
|
|
|
|
|
|
/*
|
|
|
|
* The activation decision for this folio is made at the level
|
|
|
|
* where the eviction occurred, as that is where the LRU order
|
|
|
|
* during folio reclaim is being determined.
|
|
|
|
*
|
|
|
|
* However, the cgroup that will own the folio is the one that
|
2023-11-29 11:21:52 +08:00
|
|
|
* is actually experiencing the refault event. Make sure the folio is
|
|
|
|
* locked to guarantee folio_memcg() stability throughout.
|
workingset: refactor LRU refault to expose refault recency check
Patch series "cachestat: a new syscall for page cache state of files",
v13.
There is currently no good way to query the page cache statistics of large
files and directory trees. There is mincore(), but it scales poorly: the
kernel writes out a lot of bitmap data that userspace has to aggregate,
when the user really does not care about per-page information in that
case. The user also needs to mmap and unmap each file as it goes along,
which can be quite slow as well.
Some use cases where this information could come in handy:
* Allowing database to decide whether to perform an index scan or direct
table queries based on the in-memory cache state of the index.
* Visibility into the writeback algorithm, for performance issues
diagnostic.
* Workload-aware writeback pacing: estimating IO fulfilled by page cache
(and IO to be done) within a range of a file, allowing for more
frequent syncing when and where there is IO capacity, and batching
when there is not.
* Computing memory usage of large files/directory trees, analogous to
the du tool for disk usage.
More information about these use cases could be found in this thread:
https://lore.kernel.org/lkml/20230315170934.GA97793@cmpxchg.org/
This series of patches introduces a new system call, cachestat, that
summarizes the page cache statistics (number of cached pages, dirty pages,
pages marked for writeback, evicted pages etc.) of a file, in a specified
range of bytes. It also include a selftest suite that tests some typical
usage. Currently, the syscall is only wired in for x86 architecture.
This interface is inspired by past discussion and concerns with fincore,
which has a similar design (and as a result, issues) as mincore. Relevant
links:
https://lkml.indiana.edu/hypermail/linux/kernel/1302.1/04207.html
https://lkml.indiana.edu/hypermail/linux/kernel/1302.1/04209.html
I have also developed a small tool that computes the memory usage of files
and directories, analogous to the du utility. User can choose between
mincore or cachestat (with cachestat exporting more information than
mincore). To compare the performance of these two options, I benchmarked
the tool on the root directory of a Meta's server machine, each for five
runs:
Using cachestat
real -- Median: 33.377s, Average: 33.475s, Standard Deviation: 0.3602
user -- Median: 4.08s, Average: 4.1078s, Standard Deviation: 0.0742
sys -- Median: 28.823s, Average: 28.8866s, Standard Deviation: 0.2689
Using mincore:
real -- Median: 102.352s, Average: 102.3442s, Standard Deviation: 0.2059
user -- Median: 10.149s, Average: 10.1482s, Standard Deviation: 0.0162
sys -- Median: 91.186s, Average: 91.2084s, Standard Deviation: 0.2046
I also ran both syscalls on a 2TB sparse file:
Using cachestat:
real 0m0.009s
user 0m0.000s
sys 0m0.009s
Using mincore:
real 0m37.510s
user 0m2.934s
sys 0m34.558s
Very large files like this are the pathological case for mincore. In
fact, to compute the stats for a single 2TB file, mincore takes as long as
cachestat takes to compute the stats for the entire tree! This could
easily happen inadvertently when we run it on subdirectories. Mincore is
clearly not suitable for a general-purpose command line tool.
Regarding security concerns, cachestat() should not pose any additional
issues. The caller already has read permission to the file itself (since
they need an fd to that file to call cachestat). This means that the
caller can access the underlying data in its entirety, which is a much
greater source of information (and as a result, a much greater security
risk) than the cache status itself.
The latest API change (in v13 of the patch series) is suggested by Jens
Axboe. It allows for 64-bit length argument, even on 32-bit architecture
(which is previously not possible due to the limit on the number of
syscall arguments). Furthermore, it eliminates the need for compatibility
handling - every user can use the same ABI.
This patch (of 4):
In preparation for computing recently evicted pages in cachestat, refactor
workingset_refault and lru_gen_refault to expose a helper function that
would test if an evicted page is recently evicted.
[penguin-kernel@I-love.SAKURA.ne.jp: add missing rcu_read_unlock() in lru_gen_refault()]
Link: https://lkml.kernel.org/r/610781bc-cf11-fc89-a46f-87cb8235d439@I-love.SAKURA.ne.jp
Link: https://lkml.kernel.org/r/20230503013608.2431726-1-nphamcs@gmail.com
Link: https://lkml.kernel.org/r/20230503013608.2431726-2-nphamcs@gmail.com
Signed-off-by: Nhat Pham <nphamcs@gmail.com>
Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Cc: Brian Foster <bfoster@redhat.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
Cc: Michael Kerrisk <mtk.manpages@gmail.com>
Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-05-03 09:36:06 +08:00
|
|
|
*/
|
2023-11-29 11:21:52 +08:00
|
|
|
VM_BUG_ON_FOLIO(!folio_test_locked(folio), folio);
|
2024-04-02 00:03:11 +08:00
|
|
|
|
|
|
|
if (lru_gen_enabled()) {
|
|
|
|
lru_gen_refault(folio, shadow);
|
|
|
|
return;
|
|
|
|
}
|
|
|
|
|
workingset: refactor LRU refault to expose refault recency check
Patch series "cachestat: a new syscall for page cache state of files",
v13.
There is currently no good way to query the page cache statistics of large
files and directory trees. There is mincore(), but it scales poorly: the
kernel writes out a lot of bitmap data that userspace has to aggregate,
when the user really does not care about per-page information in that
case. The user also needs to mmap and unmap each file as it goes along,
which can be quite slow as well.
Some use cases where this information could come in handy:
* Allowing database to decide whether to perform an index scan or direct
table queries based on the in-memory cache state of the index.
* Visibility into the writeback algorithm, for performance issues
diagnostic.
* Workload-aware writeback pacing: estimating IO fulfilled by page cache
(and IO to be done) within a range of a file, allowing for more
frequent syncing when and where there is IO capacity, and batching
when there is not.
* Computing memory usage of large files/directory trees, analogous to
the du tool for disk usage.
More information about these use cases could be found in this thread:
https://lore.kernel.org/lkml/20230315170934.GA97793@cmpxchg.org/
This series of patches introduces a new system call, cachestat, that
summarizes the page cache statistics (number of cached pages, dirty pages,
pages marked for writeback, evicted pages etc.) of a file, in a specified
range of bytes. It also include a selftest suite that tests some typical
usage. Currently, the syscall is only wired in for x86 architecture.
This interface is inspired by past discussion and concerns with fincore,
which has a similar design (and as a result, issues) as mincore. Relevant
links:
https://lkml.indiana.edu/hypermail/linux/kernel/1302.1/04207.html
https://lkml.indiana.edu/hypermail/linux/kernel/1302.1/04209.html
I have also developed a small tool that computes the memory usage of files
and directories, analogous to the du utility. User can choose between
mincore or cachestat (with cachestat exporting more information than
mincore). To compare the performance of these two options, I benchmarked
the tool on the root directory of a Meta's server machine, each for five
runs:
Using cachestat
real -- Median: 33.377s, Average: 33.475s, Standard Deviation: 0.3602
user -- Median: 4.08s, Average: 4.1078s, Standard Deviation: 0.0742
sys -- Median: 28.823s, Average: 28.8866s, Standard Deviation: 0.2689
Using mincore:
real -- Median: 102.352s, Average: 102.3442s, Standard Deviation: 0.2059
user -- Median: 10.149s, Average: 10.1482s, Standard Deviation: 0.0162
sys -- Median: 91.186s, Average: 91.2084s, Standard Deviation: 0.2046
I also ran both syscalls on a 2TB sparse file:
Using cachestat:
real 0m0.009s
user 0m0.000s
sys 0m0.009s
Using mincore:
real 0m37.510s
user 0m2.934s
sys 0m34.558s
Very large files like this are the pathological case for mincore. In
fact, to compute the stats for a single 2TB file, mincore takes as long as
cachestat takes to compute the stats for the entire tree! This could
easily happen inadvertently when we run it on subdirectories. Mincore is
clearly not suitable for a general-purpose command line tool.
Regarding security concerns, cachestat() should not pose any additional
issues. The caller already has read permission to the file itself (since
they need an fd to that file to call cachestat). This means that the
caller can access the underlying data in its entirety, which is a much
greater source of information (and as a result, a much greater security
risk) than the cache status itself.
The latest API change (in v13 of the patch series) is suggested by Jens
Axboe. It allows for 64-bit length argument, even on 32-bit architecture
(which is previously not possible due to the limit on the number of
syscall arguments). Furthermore, it eliminates the need for compatibility
handling - every user can use the same ABI.
This patch (of 4):
In preparation for computing recently evicted pages in cachestat, refactor
workingset_refault and lru_gen_refault to expose a helper function that
would test if an evicted page is recently evicted.
[penguin-kernel@I-love.SAKURA.ne.jp: add missing rcu_read_unlock() in lru_gen_refault()]
Link: https://lkml.kernel.org/r/610781bc-cf11-fc89-a46f-87cb8235d439@I-love.SAKURA.ne.jp
Link: https://lkml.kernel.org/r/20230503013608.2431726-1-nphamcs@gmail.com
Link: https://lkml.kernel.org/r/20230503013608.2431726-2-nphamcs@gmail.com
Signed-off-by: Nhat Pham <nphamcs@gmail.com>
Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Cc: Brian Foster <bfoster@redhat.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
Cc: Michael Kerrisk <mtk.manpages@gmail.com>
Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-05-03 09:36:06 +08:00
|
|
|
nr = folio_nr_pages(folio);
|
|
|
|
memcg = folio_memcg(folio);
|
|
|
|
pgdat = folio_pgdat(folio);
|
|
|
|
lruvec = mem_cgroup_lruvec(memcg, pgdat);
|
|
|
|
|
|
|
|
mod_lruvec_state(lruvec, WORKINGSET_REFAULT_BASE + file, nr);
|
|
|
|
|
2024-04-01 17:43:25 +08:00
|
|
|
if (!workingset_test_recent(shadow, file, &workingset, true))
|
2023-11-29 11:21:52 +08:00
|
|
|
return;
|
mm: workingset: tell cache transitions from workingset thrashing
Refaults happen during transitions between workingsets as well as in-place
thrashing. Knowing the difference between the two has a range of
applications, including measuring the impact of memory shortage on the
system performance, as well as the ability to smarter balance pressure
between the filesystem cache and the swap-backed workingset.
During workingset transitions, inactive cache refaults and pushes out
established active cache. When that active cache isn't stale, however,
and also ends up refaulting, that's bonafide thrashing.
Introduce a new page flag that tells on eviction whether the page has been
active or not in its lifetime. This bit is then stored in the shadow
entry, to classify refaults as transitioning or thrashing.
How many page->flags does this leave us with on 32-bit?
20 bits are always page flags
21 if you have an MMU
23 with the zone bits for DMA, Normal, HighMem, Movable
29 with the sparsemem section bits
30 if PAE is enabled
31 with this patch.
So on 32-bit PAE, that leaves 1 bit for distinguishing two NUMA nodes. If
that's not enough, the system can switch to discontigmem and re-gain the 6
or 7 sparsemem section bits.
Link: http://lkml.kernel.org/r/20180828172258.3185-3-hannes@cmpxchg.org
Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Tested-by: Daniel Drake <drake@endlessm.com>
Tested-by: Suren Baghdasaryan <surenb@google.com>
Cc: Christopher Lameter <cl@linux.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Johannes Weiner <jweiner@fb.com>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Enderborg <peter.enderborg@sony.com>
Cc: Randy Dunlap <rdunlap@infradead.org>
Cc: Shakeel Butt <shakeelb@google.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: Vinayak Menon <vinmenon@codeaurora.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-10-27 06:06:04 +08:00
|
|
|
|
2021-04-29 22:27:16 +08:00
|
|
|
folio_set_active(folio);
|
|
|
|
mod_lruvec_state(lruvec, WORKINGSET_ACTIVATE_BASE + file, nr);
|
mm: workingset: tell cache transitions from workingset thrashing
Refaults happen during transitions between workingsets as well as in-place
thrashing. Knowing the difference between the two has a range of
applications, including measuring the impact of memory shortage on the
system performance, as well as the ability to smarter balance pressure
between the filesystem cache and the swap-backed workingset.
During workingset transitions, inactive cache refaults and pushes out
established active cache. When that active cache isn't stale, however,
and also ends up refaulting, that's bonafide thrashing.
Introduce a new page flag that tells on eviction whether the page has been
active or not in its lifetime. This bit is then stored in the shadow
entry, to classify refaults as transitioning or thrashing.
How many page->flags does this leave us with on 32-bit?
20 bits are always page flags
21 if you have an MMU
23 with the zone bits for DMA, Normal, HighMem, Movable
29 with the sparsemem section bits
30 if PAE is enabled
31 with this patch.
So on 32-bit PAE, that leaves 1 bit for distinguishing two NUMA nodes. If
that's not enough, the system can switch to discontigmem and re-gain the 6
or 7 sparsemem section bits.
Link: http://lkml.kernel.org/r/20180828172258.3185-3-hannes@cmpxchg.org
Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Tested-by: Daniel Drake <drake@endlessm.com>
Tested-by: Suren Baghdasaryan <surenb@google.com>
Cc: Christopher Lameter <cl@linux.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Johannes Weiner <jweiner@fb.com>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Enderborg <peter.enderborg@sony.com>
Cc: Randy Dunlap <rdunlap@infradead.org>
Cc: Shakeel Butt <shakeelb@google.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: Vinayak Menon <vinmenon@codeaurora.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-10-27 06:06:04 +08:00
|
|
|
|
2021-04-29 22:27:16 +08:00
|
|
|
/* Folio was active prior to eviction */
|
mm: workingset: tell cache transitions from workingset thrashing
Refaults happen during transitions between workingsets as well as in-place
thrashing. Knowing the difference between the two has a range of
applications, including measuring the impact of memory shortage on the
system performance, as well as the ability to smarter balance pressure
between the filesystem cache and the swap-backed workingset.
During workingset transitions, inactive cache refaults and pushes out
established active cache. When that active cache isn't stale, however,
and also ends up refaulting, that's bonafide thrashing.
Introduce a new page flag that tells on eviction whether the page has been
active or not in its lifetime. This bit is then stored in the shadow
entry, to classify refaults as transitioning or thrashing.
How many page->flags does this leave us with on 32-bit?
20 bits are always page flags
21 if you have an MMU
23 with the zone bits for DMA, Normal, HighMem, Movable
29 with the sparsemem section bits
30 if PAE is enabled
31 with this patch.
So on 32-bit PAE, that leaves 1 bit for distinguishing two NUMA nodes. If
that's not enough, the system can switch to discontigmem and re-gain the 6
or 7 sparsemem section bits.
Link: http://lkml.kernel.org/r/20180828172258.3185-3-hannes@cmpxchg.org
Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Tested-by: Daniel Drake <drake@endlessm.com>
Tested-by: Suren Baghdasaryan <surenb@google.com>
Cc: Christopher Lameter <cl@linux.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Johannes Weiner <jweiner@fb.com>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Enderborg <peter.enderborg@sony.com>
Cc: Randy Dunlap <rdunlap@infradead.org>
Cc: Shakeel Butt <shakeelb@google.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: Vinayak Menon <vinmenon@codeaurora.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-10-27 06:06:04 +08:00
|
|
|
if (workingset) {
|
2021-04-29 22:27:16 +08:00
|
|
|
folio_set_workingset(folio);
|
2022-11-02 01:53:26 +08:00
|
|
|
/*
|
|
|
|
* XXX: Move to folio_add_lru() when it supports new vs
|
|
|
|
* putback
|
|
|
|
*/
|
mm: vmscan: make rotations a secondary factor in balancing anon vs file
We noticed a 2% webserver throughput regression after upgrading from 5.6.
This could be tracked down to a shift in the anon/file reclaim balance
(confirmed with swappiness) that resulted in worse reclaim efficiency and
thus more kswapd activity for the same outcome.
The change that exposed the problem is aae466b0052e ("mm/swap: implement
workingset detection for anonymous LRU"). By qualifying swapins based on
their refault distance, it lowered the cost of anon reclaim in this
workload, in turn causing (much) more anon scanning than before. Scanning
the anon list is more expensive due to the higher ratio of mmapped pages
that may rotate during reclaim, and so the result was an increase in %sys
time.
Right now, rotations aren't considered a cost when balancing scan pressure
between LRUs. We can end up with very few file refaults putting all the
scan pressure on hot anon pages that are rotated en masse, don't get
reclaimed, and never push back on the file LRU again. We still only
reclaim file cache in that case, but we burn a lot CPU rotating anon
pages. It's "fair" from an LRU age POV, but doesn't reflect the real cost
it imposes on the system.
Consider rotations as a secondary factor in balancing the LRUs. This
doesn't attempt to make a precise comparison between IO cost and CPU cost,
it just says: if reloads are about comparable between the lists, or
rotations are overwhelmingly different, adjust for CPU work.
This fixed the regression on our webservers. It has since been deployed
to the entire Meta fleet and hasn't caused any problems.
Link: https://lkml.kernel.org/r/20221013193113.726425-1-hannes@cmpxchg.org
Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Cc: Rik van Riel <riel@surriel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-10-14 03:31:13 +08:00
|
|
|
lru_note_cost_refault(folio);
|
2021-04-29 22:27:16 +08:00
|
|
|
mod_lruvec_state(lruvec, WORKINGSET_RESTORE_BASE + file, nr);
|
2014-04-04 05:47:51 +08:00
|
|
|
}
|
|
|
|
}
|
|
|
|
|
mm: keep page cache radix tree nodes in check
Previously, page cache radix tree nodes were freed after reclaim emptied
out their page pointers. But now reclaim stores shadow entries in their
place, which are only reclaimed when the inodes themselves are
reclaimed. This is problematic for bigger files that are still in use
after they have a significant amount of their cache reclaimed, without
any of those pages actually refaulting. The shadow entries will just
sit there and waste memory. In the worst case, the shadow entries will
accumulate until the machine runs out of memory.
To get this under control, the VM will track radix tree nodes
exclusively containing shadow entries on a per-NUMA node list. Per-NUMA
rather than global because we expect the radix tree nodes themselves to
be allocated node-locally and we want to reduce cross-node references of
otherwise independent cache workloads. A simple shrinker will then
reclaim these nodes on memory pressure.
A few things need to be stored in the radix tree node to implement the
shadow node LRU and allow tree deletions coming from the list:
1. There is no index available that would describe the reverse path
from the node up to the tree root, which is needed to perform a
deletion. To solve this, encode in each node its offset inside the
parent. This can be stored in the unused upper bits of the same
member that stores the node's height at no extra space cost.
2. The number of shadow entries needs to be counted in addition to the
regular entries, to quickly detect when the node is ready to go to
the shadow node LRU list. The current entry count is an unsigned
int but the maximum number of entries is 64, so a shadow counter
can easily be stored in the unused upper bits.
3. Tree modification needs tree lock and tree root, which are located
in the address space, so store an address_space backpointer in the
node. The parent pointer of the node is in a union with the 2-word
rcu_head, so the backpointer comes at no extra cost as well.
4. The node needs to be linked to an LRU list, which requires a list
head inside the node. This does increase the size of the node, but
it does not change the number of objects that fit into a slab page.
[akpm@linux-foundation.org: export the right function]
Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Reviewed-by: Rik van Riel <riel@redhat.com>
Reviewed-by: Minchan Kim <minchan@kernel.org>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Bob Liu <bob.liu@oracle.com>
Cc: Christoph Hellwig <hch@infradead.org>
Cc: Dave Chinner <david@fromorbit.com>
Cc: Greg Thelen <gthelen@google.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Jan Kara <jack@suse.cz>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: Luigi Semenzato <semenzato@google.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Metin Doslu <metin@citusdata.com>
Cc: Michel Lespinasse <walken@google.com>
Cc: Ozgun Erdogan <ozgun@citusdata.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Roman Gushchin <klamm@yandex-team.ru>
Cc: Ryan Mallon <rmallon@gmail.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-04-04 05:47:56 +08:00
|
|
|
/*
|
|
|
|
* Shadow entries reflect the share of the working set that does not
|
|
|
|
* fit into memory, so their number depends on the access pattern of
|
|
|
|
* the workload. In most cases, they will refault or get reclaimed
|
|
|
|
* along with the inode, but a (malicious) workload that streams
|
|
|
|
* through files with a total size several times that of available
|
|
|
|
* memory, while preventing the inodes from being reclaimed, can
|
|
|
|
* create excessive amounts of shadow nodes. To keep a lid on this,
|
|
|
|
* track shadow nodes and reclaim them when they grow way past the
|
|
|
|
* point where they would still be useful.
|
|
|
|
*/
|
|
|
|
|
2022-03-23 05:41:12 +08:00
|
|
|
struct list_lru shadow_nodes;
|
2016-12-13 08:43:52 +08:00
|
|
|
|
2017-11-25 03:24:59 +08:00
|
|
|
void workingset_update_node(struct xa_node *node)
|
2016-12-13 08:43:52 +08:00
|
|
|
{
|
2022-03-23 05:45:50 +08:00
|
|
|
struct address_space *mapping;
|
|
|
|
|
2016-12-13 08:43:52 +08:00
|
|
|
/*
|
|
|
|
* Track non-empty nodes that contain only shadow entries;
|
|
|
|
* unlink those that contain pages or are being freed.
|
|
|
|
*
|
|
|
|
* Avoid acquiring the list_lru lock when the nodes are
|
|
|
|
* already where they should be. The list_empty() test is safe
|
2018-04-11 07:36:56 +08:00
|
|
|
* as node->private_list is protected by the i_pages lock.
|
2016-12-13 08:43:52 +08:00
|
|
|
*/
|
2022-03-23 05:45:50 +08:00
|
|
|
mapping = container_of(node->array, struct address_space, i_pages);
|
|
|
|
lockdep_assert_held(&mapping->i_pages.xa_lock);
|
2018-10-27 06:06:39 +08:00
|
|
|
|
2017-11-09 22:23:56 +08:00
|
|
|
if (node->count && node->count == node->nr_values) {
|
2018-10-27 06:06:39 +08:00
|
|
|
if (list_empty(&node->private_list)) {
|
2016-12-13 08:43:52 +08:00
|
|
|
list_lru_add(&shadow_nodes, &node->private_list);
|
2020-12-15 11:07:04 +08:00
|
|
|
__inc_lruvec_kmem_state(node, WORKINGSET_NODES);
|
2018-10-27 06:06:39 +08:00
|
|
|
}
|
2016-12-13 08:43:52 +08:00
|
|
|
} else {
|
2018-10-27 06:06:39 +08:00
|
|
|
if (!list_empty(&node->private_list)) {
|
2016-12-13 08:43:52 +08:00
|
|
|
list_lru_del(&shadow_nodes, &node->private_list);
|
2020-12-15 11:07:04 +08:00
|
|
|
__dec_lruvec_kmem_state(node, WORKINGSET_NODES);
|
2018-10-27 06:06:39 +08:00
|
|
|
}
|
2016-12-13 08:43:52 +08:00
|
|
|
}
|
|
|
|
}
|
mm: keep page cache radix tree nodes in check
Previously, page cache radix tree nodes were freed after reclaim emptied
out their page pointers. But now reclaim stores shadow entries in their
place, which are only reclaimed when the inodes themselves are
reclaimed. This is problematic for bigger files that are still in use
after they have a significant amount of their cache reclaimed, without
any of those pages actually refaulting. The shadow entries will just
sit there and waste memory. In the worst case, the shadow entries will
accumulate until the machine runs out of memory.
To get this under control, the VM will track radix tree nodes
exclusively containing shadow entries on a per-NUMA node list. Per-NUMA
rather than global because we expect the radix tree nodes themselves to
be allocated node-locally and we want to reduce cross-node references of
otherwise independent cache workloads. A simple shrinker will then
reclaim these nodes on memory pressure.
A few things need to be stored in the radix tree node to implement the
shadow node LRU and allow tree deletions coming from the list:
1. There is no index available that would describe the reverse path
from the node up to the tree root, which is needed to perform a
deletion. To solve this, encode in each node its offset inside the
parent. This can be stored in the unused upper bits of the same
member that stores the node's height at no extra space cost.
2. The number of shadow entries needs to be counted in addition to the
regular entries, to quickly detect when the node is ready to go to
the shadow node LRU list. The current entry count is an unsigned
int but the maximum number of entries is 64, so a shadow counter
can easily be stored in the unused upper bits.
3. Tree modification needs tree lock and tree root, which are located
in the address space, so store an address_space backpointer in the
node. The parent pointer of the node is in a union with the 2-word
rcu_head, so the backpointer comes at no extra cost as well.
4. The node needs to be linked to an LRU list, which requires a list
head inside the node. This does increase the size of the node, but
it does not change the number of objects that fit into a slab page.
[akpm@linux-foundation.org: export the right function]
Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Reviewed-by: Rik van Riel <riel@redhat.com>
Reviewed-by: Minchan Kim <minchan@kernel.org>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Bob Liu <bob.liu@oracle.com>
Cc: Christoph Hellwig <hch@infradead.org>
Cc: Dave Chinner <david@fromorbit.com>
Cc: Greg Thelen <gthelen@google.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Jan Kara <jack@suse.cz>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: Luigi Semenzato <semenzato@google.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Metin Doslu <metin@citusdata.com>
Cc: Michel Lespinasse <walken@google.com>
Cc: Ozgun Erdogan <ozgun@citusdata.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Roman Gushchin <klamm@yandex-team.ru>
Cc: Ryan Mallon <rmallon@gmail.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-04-04 05:47:56 +08:00
|
|
|
|
|
|
|
static unsigned long count_shadow_nodes(struct shrinker *shrinker,
|
|
|
|
struct shrink_control *sc)
|
|
|
|
{
|
|
|
|
unsigned long max_nodes;
|
2016-12-13 08:43:52 +08:00
|
|
|
unsigned long nodes;
|
mm: workingset: don't drop refault information prematurely
Patch series "psi: pressure stall information for CPU, memory, and IO", v4.
Overview
PSI reports the overall wallclock time in which the tasks in a system (or
cgroup) wait for (contended) hardware resources.
This helps users understand the resource pressure their workloads are
under, which allows them to rootcause and fix throughput and latency
problems caused by overcommitting, underprovisioning, suboptimal job
placement in a grid; as well as anticipate major disruptions like OOM.
Real-world applications
We're using the data collected by PSI (and its previous incarnation,
memdelay) quite extensively at Facebook, and with several success stories.
One usecase is avoiding OOM hangs/livelocks. The reason these happen is
because the OOM killer is triggered by reclaim not being able to free
pages, but with fast flash devices there is *always* some clean and
uptodate cache to reclaim; the OOM killer never kicks in, even as tasks
spend 90% of the time thrashing the cache pages of their own executables.
There is no situation where this ever makes sense in practice. We wrote a
<100 line POC python script to monitor memory pressure and kill stuff way
before such pathological thrashing leads to full system losses that would
require forcible hard resets.
We've since extended and deployed this code into other places to guarantee
latency and throughput SLAs, since they're usually violated way before the
kernel OOM killer would ever kick in.
It is available here: https://github.com/facebookincubator/oomd
Eventually we probably want to trigger the in-kernel OOM killer based on
extreme sustained pressure as well, so that Linux can avoid memory
livelocks - which technically aren't deadlocks, but to the user
indistinguishable from them - out of the box. We'd continue using OOMD as
the first line of defense to ensure workload health and implement complex
kill policies that are beyond the scope of the kernel.
We also use PSI memory pressure for loadshedding. Our batch job
infrastructure used to use heuristics based on various VM stats to
anticipate OOM situations, with lackluster success. We switched it to PSI
and managed to anticipate and avoid OOM kills and lockups fairly reliably.
The reduction of OOM outages in the worker pool raised the pool's
aggregate productivity, and we were able to switch that service to smaller
machines.
Lastly, we use cgroups to isolate a machine's main workload from
maintenance crap like package upgrades, logging, configuration, as well as
to prevent multiple workloads on a machine from stepping on each others'
toes. We were not able to configure this properly without the pressure
metrics; we would see latency or bandwidth drops, but it would often be
hard to impossible to rootcause it post-mortem.
We now log and graph pressure for the containers in our fleet and can
trivially link latency spikes and throughput drops to shortages of
specific resources after the fact, and fix the job config/scheduling.
PSI has also received testing, feedback, and feature requests from Android
and EndlessOS for the purpose of low-latency OOM killing, to intervene in
pressure situations before the UI starts hanging.
How do you use this feature?
A kernel with CONFIG_PSI=y will create a /proc/pressure directory with 3
files: cpu, memory, and io. If using cgroup2, cgroups will also have
cpu.pressure, memory.pressure and io.pressure files, which simply
aggregate task stalls at the cgroup level instead of system-wide.
The cpu file contains one line:
some avg10=2.04 avg60=0.75 avg300=0.40 total=157656722
The averages give the percentage of walltime in which one or more tasks
are delayed on the runqueue while another task has the CPU. They're
recent averages over 10s, 1m, 5m windows, so you can tell short term
trends from long term ones, similarly to the load average.
The total= value gives the absolute stall time in microseconds. This
allows detecting latency spikes that might be too short to sway the
running averages. It also allows custom time averaging in case the
10s/1m/5m windows aren't adequate for the usecase (or are too coarse with
future hardware).
What to make of this "some" metric? If CPU utilization is at 100% and CPU
pressure is 0, it means the system is perfectly utilized, with one
runnable thread per CPU and nobody waiting. At two or more runnable tasks
per CPU, the system is 100% overcommitted and the pressure average will
indicate as much. From a utilization perspective this is a great state of
course: no CPU cycles are being wasted, even when 50% of the threads were
to go idle (as most workloads do vary). From the perspective of the
individual job it's not great, however, and they would do better with more
resources. Depending on what your priority and options are, raised "some"
numbers may or may not require action.
The memory file contains two lines:
some avg10=70.24 avg60=68.52 avg300=69.91 total=3559632828
full avg10=57.59 avg60=58.06 avg300=60.38 total=3300487258
The some line is the same as for cpu, the time in which at least one task
is stalled on the resource. In the case of memory, this includes waiting
on swap-in, page cache refaults and page reclaim.
The full line, however, indicates time in which *nobody* is using the CPU
productively due to pressure: all non-idle tasks are waiting for memory in
one form or another. Significant time spent in there is a good trigger
for killing things, moving jobs to other machines, or dropping incoming
requests, since neither the jobs nor the machine overall are making too
much headway.
The io file is similar to memory. Because the block layer doesn't have a
concept of hardware contention right now (how much longer is my IO request
taking due to other tasks?), it reports CPU potential lost on all IO
delays, not just the potential lost due to competition.
FAQ
Q: How is PSI's CPU component different from the load average?
A: There are several quirks in the load average that make it hard to
impossible to tell how overcommitted the CPU really is.
1. The load average is reported as a raw number of active tasks.
You need to know how many CPUs there are in the system, how many
CPUs the workload is allowed to use, then think about what the
proportion between load and the number of CPUs mean for the
tasks trying to run.
PSI reports the percentage of wallclock time in which tasks are
waiting for a CPU to run on. It doesn't matter how many CPUs are
present or usable. The number always tells the quality of life
of tasks in the system or in a particular cgroup.
2. The shortest averaging window is 1m, which is extremely coarse,
and it's sampled in 5s intervals. A *lot* can happen on a CPU in
5 seconds. This *may* be able to identify persistent long-term
trends and very clear and obvious overloads, but it's unusable
for latency spikes and more subtle overutilization.
PSI's shortest window is 10s. It also exports the cumulative
stall times (in microseconds) of synchronously recorded events.
3. On Linux, the load average for historical reasons includes all
TASK_UNINTERRUPTIBLE tasks. This gives a broader sense of how
busy the system is, but on the flipside it doesn't distinguish
whether tasks are likely to contend over the CPU or IO - which
obviously requires very different interventions from a sys admin
or a job scheduler.
PSI reports independent metrics for CPU and IO. You can tell
which resource is making the tasks wait, but in conjunction
still see how overloaded the system is overall.
Q: What's the cost / performance impact of this feature?
A: PSI's primary cost is in the scheduler, in particular task wakeups
and sleeps.
I benchmarked this code using Facebook's two most scheduling
sensitive workloads: memcache and webserver. They handle a ton of
small requests - lots of wakeups and sleeps with little actual work
in between - so they tend to be canaries for scheduler regressions.
In the tests, the boxes were handling live traffic over the course
of several hours. Half the machines, the control, ran with
CONFIG_PSI=n.
For memcache I used eight machines total. They're 2-socket, 14
core, 56 thread boxes. The test runs for half the test period,
flips the test and control kernels on the hardware to rule out HW
factors, DC location etc., then runs the other half of the test.
For the webservers, I used 32 machines total. They're single
socket, 16 core, 32 thread machines.
During the memcache test, CPU load was nopsi=78.05% psi=78.98% in
the first half and nopsi=77.52% psi=78.25%, so PSI added between
0.7 and 0.9 percentage points to the CPU load, a difference of
about 1%.
UPDATE: I re-ran this test with the v3 version of this patch set
and the CPU utilization was equivalent between test and control.
UPDATE: v4 is on par with v3.
As far as end-to-end request latency from the client perspective
goes, we don't sample those finely enough to capture the requests
going to those particular machines during the test, but we know the
p50 turnaround time in this workload is 54us, and perf bench sched
pipe on those machines show nopsi=5.232666 us/op and psi=5.587347
us/op, so this doesn't add much here either.
The profile for the pipe benchmark shows:
0.87% sched-pipe [kernel.vmlinux] [k] psi_group_change
0.83% perf.real [kernel.vmlinux] [k] psi_group_change
0.82% perf.real [kernel.vmlinux] [k] psi_task_change
0.58% sched-pipe [kernel.vmlinux] [k] psi_task_change
The webserver load is running inside 4 nested cgroup levels. The
CPU load with both nopsi and psi kernels was indistinguishable at
81%.
For comparison, we had to disable the cgroup cpu controller on the
webservers because it added 4 percentage points to the CPU% during
this same exact test.
Versions of this accounting code now run on 80% of our fleet. None
of our workloads have reported regressions during the rollout.
Daniel Drake said:
: I just retested the latest version at
: http://git.cmpxchg.org/cgit.cgi/linux-psi.git (Linux 4.18) and the results
: are great.
:
: Test setup:
: Endless OS
: GeminiLake N4200 low end laptop
: 2GB RAM
: swap (and zram swap) disabled
:
: Baseline test: open a handful of large-ish apps and several website
: tabs in Google Chrome.
:
: Results: after a couple of minutes, system is excessively thrashing, mouse
: cursor can barely be moved, UI is not responding to mouse clicks, so it's
: impractical to recover from this situation as an ordinary user
:
: Add my simple killer:
: https://gist.github.com/dsd/a8988bf0b81a6163475988120fe8d9cd
:
: Results: when the thrashing causes the UI to become sluggish, the killer
: steps in and kills something (usually a chrome tab), and the system
: remains usable. I repeatedly opened more apps and more websites over a 15
: minute period but I wasn't able to get the system to a point of UI
: unresponsiveness.
Suren said:
: Backported to 4.9 and retested on ARMv8 8 code system running Android.
: Signals behave as expected reacting to memory pressure, no jumps in
: "total" counters that would indicate an overflow/underflow issues. Nicely
: done!
This patch (of 9):
If we keep just enough refault information to match the *current* page
cache during reclaim time, we could lose a lot of events when there is
only a temporary spike in non-cache memory consumption that pushes out all
the cache. Once cache comes back, we won't see those refaults. They
might not be actionable for LRU aging, but we want to know about them for
measuring memory pressure.
[hannes@cmpxchg.org: switch to NUMA-aware lru and slab counters]
Link: http://lkml.kernel.org/r/20181009184732.762-2-hannes@cmpxchg.org
Link: http://lkml.kernel.org/r/20180828172258.3185-2-hannes@cmpxchg.org
Signed-off-by: Johannes Weiner <jweiner@fb.com>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Rik van Riel <riel@surriel.com>
Tested-by: Daniel Drake <drake@endlessm.com>
Tested-by: Suren Baghdasaryan <surenb@google.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: Vinayak Menon <vinmenon@codeaurora.org>
Cc: Christopher Lameter <cl@linux.com>
Cc: Peter Enderborg <peter.enderborg@sony.com>
Cc: Shakeel Butt <shakeelb@google.com>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Randy Dunlap <rdunlap@infradead.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-10-27 06:05:59 +08:00
|
|
|
unsigned long pages;
|
mm: keep page cache radix tree nodes in check
Previously, page cache radix tree nodes were freed after reclaim emptied
out their page pointers. But now reclaim stores shadow entries in their
place, which are only reclaimed when the inodes themselves are
reclaimed. This is problematic for bigger files that are still in use
after they have a significant amount of their cache reclaimed, without
any of those pages actually refaulting. The shadow entries will just
sit there and waste memory. In the worst case, the shadow entries will
accumulate until the machine runs out of memory.
To get this under control, the VM will track radix tree nodes
exclusively containing shadow entries on a per-NUMA node list. Per-NUMA
rather than global because we expect the radix tree nodes themselves to
be allocated node-locally and we want to reduce cross-node references of
otherwise independent cache workloads. A simple shrinker will then
reclaim these nodes on memory pressure.
A few things need to be stored in the radix tree node to implement the
shadow node LRU and allow tree deletions coming from the list:
1. There is no index available that would describe the reverse path
from the node up to the tree root, which is needed to perform a
deletion. To solve this, encode in each node its offset inside the
parent. This can be stored in the unused upper bits of the same
member that stores the node's height at no extra space cost.
2. The number of shadow entries needs to be counted in addition to the
regular entries, to quickly detect when the node is ready to go to
the shadow node LRU list. The current entry count is an unsigned
int but the maximum number of entries is 64, so a shadow counter
can easily be stored in the unused upper bits.
3. Tree modification needs tree lock and tree root, which are located
in the address space, so store an address_space backpointer in the
node. The parent pointer of the node is in a union with the 2-word
rcu_head, so the backpointer comes at no extra cost as well.
4. The node needs to be linked to an LRU list, which requires a list
head inside the node. This does increase the size of the node, but
it does not change the number of objects that fit into a slab page.
[akpm@linux-foundation.org: export the right function]
Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Reviewed-by: Rik van Riel <riel@redhat.com>
Reviewed-by: Minchan Kim <minchan@kernel.org>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Bob Liu <bob.liu@oracle.com>
Cc: Christoph Hellwig <hch@infradead.org>
Cc: Dave Chinner <david@fromorbit.com>
Cc: Greg Thelen <gthelen@google.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Jan Kara <jack@suse.cz>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: Luigi Semenzato <semenzato@google.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Metin Doslu <metin@citusdata.com>
Cc: Michel Lespinasse <walken@google.com>
Cc: Ozgun Erdogan <ozgun@citusdata.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Roman Gushchin <klamm@yandex-team.ru>
Cc: Ryan Mallon <rmallon@gmail.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-04-04 05:47:56 +08:00
|
|
|
|
2016-12-13 08:43:52 +08:00
|
|
|
nodes = list_lru_shrink_count(&shadow_nodes, sc);
|
2021-02-25 04:08:06 +08:00
|
|
|
if (!nodes)
|
|
|
|
return SHRINK_EMPTY;
|
mm: keep page cache radix tree nodes in check
Previously, page cache radix tree nodes were freed after reclaim emptied
out their page pointers. But now reclaim stores shadow entries in their
place, which are only reclaimed when the inodes themselves are
reclaimed. This is problematic for bigger files that are still in use
after they have a significant amount of their cache reclaimed, without
any of those pages actually refaulting. The shadow entries will just
sit there and waste memory. In the worst case, the shadow entries will
accumulate until the machine runs out of memory.
To get this under control, the VM will track radix tree nodes
exclusively containing shadow entries on a per-NUMA node list. Per-NUMA
rather than global because we expect the radix tree nodes themselves to
be allocated node-locally and we want to reduce cross-node references of
otherwise independent cache workloads. A simple shrinker will then
reclaim these nodes on memory pressure.
A few things need to be stored in the radix tree node to implement the
shadow node LRU and allow tree deletions coming from the list:
1. There is no index available that would describe the reverse path
from the node up to the tree root, which is needed to perform a
deletion. To solve this, encode in each node its offset inside the
parent. This can be stored in the unused upper bits of the same
member that stores the node's height at no extra space cost.
2. The number of shadow entries needs to be counted in addition to the
regular entries, to quickly detect when the node is ready to go to
the shadow node LRU list. The current entry count is an unsigned
int but the maximum number of entries is 64, so a shadow counter
can easily be stored in the unused upper bits.
3. Tree modification needs tree lock and tree root, which are located
in the address space, so store an address_space backpointer in the
node. The parent pointer of the node is in a union with the 2-word
rcu_head, so the backpointer comes at no extra cost as well.
4. The node needs to be linked to an LRU list, which requires a list
head inside the node. This does increase the size of the node, but
it does not change the number of objects that fit into a slab page.
[akpm@linux-foundation.org: export the right function]
Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Reviewed-by: Rik van Riel <riel@redhat.com>
Reviewed-by: Minchan Kim <minchan@kernel.org>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Bob Liu <bob.liu@oracle.com>
Cc: Christoph Hellwig <hch@infradead.org>
Cc: Dave Chinner <david@fromorbit.com>
Cc: Greg Thelen <gthelen@google.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Jan Kara <jack@suse.cz>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: Luigi Semenzato <semenzato@google.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Metin Doslu <metin@citusdata.com>
Cc: Michel Lespinasse <walken@google.com>
Cc: Ozgun Erdogan <ozgun@citusdata.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Roman Gushchin <klamm@yandex-team.ru>
Cc: Ryan Mallon <rmallon@gmail.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-04-04 05:47:56 +08:00
|
|
|
|
|
|
|
/*
|
2017-11-25 03:24:59 +08:00
|
|
|
* Approximate a reasonable limit for the nodes
|
2016-12-13 08:43:58 +08:00
|
|
|
* containing shadow entries. We don't need to keep more
|
|
|
|
* shadow entries than possible pages on the active list,
|
|
|
|
* since refault distances bigger than that are dismissed.
|
|
|
|
*
|
|
|
|
* The size of the active list converges toward 100% of
|
|
|
|
* overall page cache as memory grows, with only a tiny
|
|
|
|
* inactive list. Assume the total cache size for that.
|
|
|
|
*
|
|
|
|
* Nodes might be sparsely populated, with only one shadow
|
|
|
|
* entry in the extreme case. Obviously, we cannot keep one
|
|
|
|
* node for every eligible shadow entry, so compromise on a
|
|
|
|
* worst-case density of 1/8th. Below that, not all eligible
|
|
|
|
* refaults can be detected anymore.
|
mm: keep page cache radix tree nodes in check
Previously, page cache radix tree nodes were freed after reclaim emptied
out their page pointers. But now reclaim stores shadow entries in their
place, which are only reclaimed when the inodes themselves are
reclaimed. This is problematic for bigger files that are still in use
after they have a significant amount of their cache reclaimed, without
any of those pages actually refaulting. The shadow entries will just
sit there and waste memory. In the worst case, the shadow entries will
accumulate until the machine runs out of memory.
To get this under control, the VM will track radix tree nodes
exclusively containing shadow entries on a per-NUMA node list. Per-NUMA
rather than global because we expect the radix tree nodes themselves to
be allocated node-locally and we want to reduce cross-node references of
otherwise independent cache workloads. A simple shrinker will then
reclaim these nodes on memory pressure.
A few things need to be stored in the radix tree node to implement the
shadow node LRU and allow tree deletions coming from the list:
1. There is no index available that would describe the reverse path
from the node up to the tree root, which is needed to perform a
deletion. To solve this, encode in each node its offset inside the
parent. This can be stored in the unused upper bits of the same
member that stores the node's height at no extra space cost.
2. The number of shadow entries needs to be counted in addition to the
regular entries, to quickly detect when the node is ready to go to
the shadow node LRU list. The current entry count is an unsigned
int but the maximum number of entries is 64, so a shadow counter
can easily be stored in the unused upper bits.
3. Tree modification needs tree lock and tree root, which are located
in the address space, so store an address_space backpointer in the
node. The parent pointer of the node is in a union with the 2-word
rcu_head, so the backpointer comes at no extra cost as well.
4. The node needs to be linked to an LRU list, which requires a list
head inside the node. This does increase the size of the node, but
it does not change the number of objects that fit into a slab page.
[akpm@linux-foundation.org: export the right function]
Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Reviewed-by: Rik van Riel <riel@redhat.com>
Reviewed-by: Minchan Kim <minchan@kernel.org>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Bob Liu <bob.liu@oracle.com>
Cc: Christoph Hellwig <hch@infradead.org>
Cc: Dave Chinner <david@fromorbit.com>
Cc: Greg Thelen <gthelen@google.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Jan Kara <jack@suse.cz>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: Luigi Semenzato <semenzato@google.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Metin Doslu <metin@citusdata.com>
Cc: Michel Lespinasse <walken@google.com>
Cc: Ozgun Erdogan <ozgun@citusdata.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Roman Gushchin <klamm@yandex-team.ru>
Cc: Ryan Mallon <rmallon@gmail.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-04-04 05:47:56 +08:00
|
|
|
*
|
2017-11-25 03:24:59 +08:00
|
|
|
* On 64-bit with 7 xa_nodes per page and 64 slots
|
mm: keep page cache radix tree nodes in check
Previously, page cache radix tree nodes were freed after reclaim emptied
out their page pointers. But now reclaim stores shadow entries in their
place, which are only reclaimed when the inodes themselves are
reclaimed. This is problematic for bigger files that are still in use
after they have a significant amount of their cache reclaimed, without
any of those pages actually refaulting. The shadow entries will just
sit there and waste memory. In the worst case, the shadow entries will
accumulate until the machine runs out of memory.
To get this under control, the VM will track radix tree nodes
exclusively containing shadow entries on a per-NUMA node list. Per-NUMA
rather than global because we expect the radix tree nodes themselves to
be allocated node-locally and we want to reduce cross-node references of
otherwise independent cache workloads. A simple shrinker will then
reclaim these nodes on memory pressure.
A few things need to be stored in the radix tree node to implement the
shadow node LRU and allow tree deletions coming from the list:
1. There is no index available that would describe the reverse path
from the node up to the tree root, which is needed to perform a
deletion. To solve this, encode in each node its offset inside the
parent. This can be stored in the unused upper bits of the same
member that stores the node's height at no extra space cost.
2. The number of shadow entries needs to be counted in addition to the
regular entries, to quickly detect when the node is ready to go to
the shadow node LRU list. The current entry count is an unsigned
int but the maximum number of entries is 64, so a shadow counter
can easily be stored in the unused upper bits.
3. Tree modification needs tree lock and tree root, which are located
in the address space, so store an address_space backpointer in the
node. The parent pointer of the node is in a union with the 2-word
rcu_head, so the backpointer comes at no extra cost as well.
4. The node needs to be linked to an LRU list, which requires a list
head inside the node. This does increase the size of the node, but
it does not change the number of objects that fit into a slab page.
[akpm@linux-foundation.org: export the right function]
Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Reviewed-by: Rik van Riel <riel@redhat.com>
Reviewed-by: Minchan Kim <minchan@kernel.org>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Bob Liu <bob.liu@oracle.com>
Cc: Christoph Hellwig <hch@infradead.org>
Cc: Dave Chinner <david@fromorbit.com>
Cc: Greg Thelen <gthelen@google.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Jan Kara <jack@suse.cz>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: Luigi Semenzato <semenzato@google.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Metin Doslu <metin@citusdata.com>
Cc: Michel Lespinasse <walken@google.com>
Cc: Ozgun Erdogan <ozgun@citusdata.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Roman Gushchin <klamm@yandex-team.ru>
Cc: Ryan Mallon <rmallon@gmail.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-04-04 05:47:56 +08:00
|
|
|
* each, this will reclaim shadow entries when they consume
|
2016-12-13 08:43:58 +08:00
|
|
|
* ~1.8% of available memory:
|
mm: keep page cache radix tree nodes in check
Previously, page cache radix tree nodes were freed after reclaim emptied
out their page pointers. But now reclaim stores shadow entries in their
place, which are only reclaimed when the inodes themselves are
reclaimed. This is problematic for bigger files that are still in use
after they have a significant amount of their cache reclaimed, without
any of those pages actually refaulting. The shadow entries will just
sit there and waste memory. In the worst case, the shadow entries will
accumulate until the machine runs out of memory.
To get this under control, the VM will track radix tree nodes
exclusively containing shadow entries on a per-NUMA node list. Per-NUMA
rather than global because we expect the radix tree nodes themselves to
be allocated node-locally and we want to reduce cross-node references of
otherwise independent cache workloads. A simple shrinker will then
reclaim these nodes on memory pressure.
A few things need to be stored in the radix tree node to implement the
shadow node LRU and allow tree deletions coming from the list:
1. There is no index available that would describe the reverse path
from the node up to the tree root, which is needed to perform a
deletion. To solve this, encode in each node its offset inside the
parent. This can be stored in the unused upper bits of the same
member that stores the node's height at no extra space cost.
2. The number of shadow entries needs to be counted in addition to the
regular entries, to quickly detect when the node is ready to go to
the shadow node LRU list. The current entry count is an unsigned
int but the maximum number of entries is 64, so a shadow counter
can easily be stored in the unused upper bits.
3. Tree modification needs tree lock and tree root, which are located
in the address space, so store an address_space backpointer in the
node. The parent pointer of the node is in a union with the 2-word
rcu_head, so the backpointer comes at no extra cost as well.
4. The node needs to be linked to an LRU list, which requires a list
head inside the node. This does increase the size of the node, but
it does not change the number of objects that fit into a slab page.
[akpm@linux-foundation.org: export the right function]
Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Reviewed-by: Rik van Riel <riel@redhat.com>
Reviewed-by: Minchan Kim <minchan@kernel.org>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Bob Liu <bob.liu@oracle.com>
Cc: Christoph Hellwig <hch@infradead.org>
Cc: Dave Chinner <david@fromorbit.com>
Cc: Greg Thelen <gthelen@google.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Jan Kara <jack@suse.cz>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: Luigi Semenzato <semenzato@google.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Metin Doslu <metin@citusdata.com>
Cc: Michel Lespinasse <walken@google.com>
Cc: Ozgun Erdogan <ozgun@citusdata.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Roman Gushchin <klamm@yandex-team.ru>
Cc: Ryan Mallon <rmallon@gmail.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-04-04 05:47:56 +08:00
|
|
|
*
|
2017-11-25 03:24:59 +08:00
|
|
|
* PAGE_SIZE / xa_nodes / node_entries * 8 / PAGE_SIZE
|
mm: keep page cache radix tree nodes in check
Previously, page cache radix tree nodes were freed after reclaim emptied
out their page pointers. But now reclaim stores shadow entries in their
place, which are only reclaimed when the inodes themselves are
reclaimed. This is problematic for bigger files that are still in use
after they have a significant amount of their cache reclaimed, without
any of those pages actually refaulting. The shadow entries will just
sit there and waste memory. In the worst case, the shadow entries will
accumulate until the machine runs out of memory.
To get this under control, the VM will track radix tree nodes
exclusively containing shadow entries on a per-NUMA node list. Per-NUMA
rather than global because we expect the radix tree nodes themselves to
be allocated node-locally and we want to reduce cross-node references of
otherwise independent cache workloads. A simple shrinker will then
reclaim these nodes on memory pressure.
A few things need to be stored in the radix tree node to implement the
shadow node LRU and allow tree deletions coming from the list:
1. There is no index available that would describe the reverse path
from the node up to the tree root, which is needed to perform a
deletion. To solve this, encode in each node its offset inside the
parent. This can be stored in the unused upper bits of the same
member that stores the node's height at no extra space cost.
2. The number of shadow entries needs to be counted in addition to the
regular entries, to quickly detect when the node is ready to go to
the shadow node LRU list. The current entry count is an unsigned
int but the maximum number of entries is 64, so a shadow counter
can easily be stored in the unused upper bits.
3. Tree modification needs tree lock and tree root, which are located
in the address space, so store an address_space backpointer in the
node. The parent pointer of the node is in a union with the 2-word
rcu_head, so the backpointer comes at no extra cost as well.
4. The node needs to be linked to an LRU list, which requires a list
head inside the node. This does increase the size of the node, but
it does not change the number of objects that fit into a slab page.
[akpm@linux-foundation.org: export the right function]
Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Reviewed-by: Rik van Riel <riel@redhat.com>
Reviewed-by: Minchan Kim <minchan@kernel.org>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Bob Liu <bob.liu@oracle.com>
Cc: Christoph Hellwig <hch@infradead.org>
Cc: Dave Chinner <david@fromorbit.com>
Cc: Greg Thelen <gthelen@google.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Jan Kara <jack@suse.cz>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: Luigi Semenzato <semenzato@google.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Metin Doslu <metin@citusdata.com>
Cc: Michel Lespinasse <walken@google.com>
Cc: Ozgun Erdogan <ozgun@citusdata.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Roman Gushchin <klamm@yandex-team.ru>
Cc: Ryan Mallon <rmallon@gmail.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-04-04 05:47:56 +08:00
|
|
|
*/
|
mm: workingset: don't drop refault information prematurely
Patch series "psi: pressure stall information for CPU, memory, and IO", v4.
Overview
PSI reports the overall wallclock time in which the tasks in a system (or
cgroup) wait for (contended) hardware resources.
This helps users understand the resource pressure their workloads are
under, which allows them to rootcause and fix throughput and latency
problems caused by overcommitting, underprovisioning, suboptimal job
placement in a grid; as well as anticipate major disruptions like OOM.
Real-world applications
We're using the data collected by PSI (and its previous incarnation,
memdelay) quite extensively at Facebook, and with several success stories.
One usecase is avoiding OOM hangs/livelocks. The reason these happen is
because the OOM killer is triggered by reclaim not being able to free
pages, but with fast flash devices there is *always* some clean and
uptodate cache to reclaim; the OOM killer never kicks in, even as tasks
spend 90% of the time thrashing the cache pages of their own executables.
There is no situation where this ever makes sense in practice. We wrote a
<100 line POC python script to monitor memory pressure and kill stuff way
before such pathological thrashing leads to full system losses that would
require forcible hard resets.
We've since extended and deployed this code into other places to guarantee
latency and throughput SLAs, since they're usually violated way before the
kernel OOM killer would ever kick in.
It is available here: https://github.com/facebookincubator/oomd
Eventually we probably want to trigger the in-kernel OOM killer based on
extreme sustained pressure as well, so that Linux can avoid memory
livelocks - which technically aren't deadlocks, but to the user
indistinguishable from them - out of the box. We'd continue using OOMD as
the first line of defense to ensure workload health and implement complex
kill policies that are beyond the scope of the kernel.
We also use PSI memory pressure for loadshedding. Our batch job
infrastructure used to use heuristics based on various VM stats to
anticipate OOM situations, with lackluster success. We switched it to PSI
and managed to anticipate and avoid OOM kills and lockups fairly reliably.
The reduction of OOM outages in the worker pool raised the pool's
aggregate productivity, and we were able to switch that service to smaller
machines.
Lastly, we use cgroups to isolate a machine's main workload from
maintenance crap like package upgrades, logging, configuration, as well as
to prevent multiple workloads on a machine from stepping on each others'
toes. We were not able to configure this properly without the pressure
metrics; we would see latency or bandwidth drops, but it would often be
hard to impossible to rootcause it post-mortem.
We now log and graph pressure for the containers in our fleet and can
trivially link latency spikes and throughput drops to shortages of
specific resources after the fact, and fix the job config/scheduling.
PSI has also received testing, feedback, and feature requests from Android
and EndlessOS for the purpose of low-latency OOM killing, to intervene in
pressure situations before the UI starts hanging.
How do you use this feature?
A kernel with CONFIG_PSI=y will create a /proc/pressure directory with 3
files: cpu, memory, and io. If using cgroup2, cgroups will also have
cpu.pressure, memory.pressure and io.pressure files, which simply
aggregate task stalls at the cgroup level instead of system-wide.
The cpu file contains one line:
some avg10=2.04 avg60=0.75 avg300=0.40 total=157656722
The averages give the percentage of walltime in which one or more tasks
are delayed on the runqueue while another task has the CPU. They're
recent averages over 10s, 1m, 5m windows, so you can tell short term
trends from long term ones, similarly to the load average.
The total= value gives the absolute stall time in microseconds. This
allows detecting latency spikes that might be too short to sway the
running averages. It also allows custom time averaging in case the
10s/1m/5m windows aren't adequate for the usecase (or are too coarse with
future hardware).
What to make of this "some" metric? If CPU utilization is at 100% and CPU
pressure is 0, it means the system is perfectly utilized, with one
runnable thread per CPU and nobody waiting. At two or more runnable tasks
per CPU, the system is 100% overcommitted and the pressure average will
indicate as much. From a utilization perspective this is a great state of
course: no CPU cycles are being wasted, even when 50% of the threads were
to go idle (as most workloads do vary). From the perspective of the
individual job it's not great, however, and they would do better with more
resources. Depending on what your priority and options are, raised "some"
numbers may or may not require action.
The memory file contains two lines:
some avg10=70.24 avg60=68.52 avg300=69.91 total=3559632828
full avg10=57.59 avg60=58.06 avg300=60.38 total=3300487258
The some line is the same as for cpu, the time in which at least one task
is stalled on the resource. In the case of memory, this includes waiting
on swap-in, page cache refaults and page reclaim.
The full line, however, indicates time in which *nobody* is using the CPU
productively due to pressure: all non-idle tasks are waiting for memory in
one form or another. Significant time spent in there is a good trigger
for killing things, moving jobs to other machines, or dropping incoming
requests, since neither the jobs nor the machine overall are making too
much headway.
The io file is similar to memory. Because the block layer doesn't have a
concept of hardware contention right now (how much longer is my IO request
taking due to other tasks?), it reports CPU potential lost on all IO
delays, not just the potential lost due to competition.
FAQ
Q: How is PSI's CPU component different from the load average?
A: There are several quirks in the load average that make it hard to
impossible to tell how overcommitted the CPU really is.
1. The load average is reported as a raw number of active tasks.
You need to know how many CPUs there are in the system, how many
CPUs the workload is allowed to use, then think about what the
proportion between load and the number of CPUs mean for the
tasks trying to run.
PSI reports the percentage of wallclock time in which tasks are
waiting for a CPU to run on. It doesn't matter how many CPUs are
present or usable. The number always tells the quality of life
of tasks in the system or in a particular cgroup.
2. The shortest averaging window is 1m, which is extremely coarse,
and it's sampled in 5s intervals. A *lot* can happen on a CPU in
5 seconds. This *may* be able to identify persistent long-term
trends and very clear and obvious overloads, but it's unusable
for latency spikes and more subtle overutilization.
PSI's shortest window is 10s. It also exports the cumulative
stall times (in microseconds) of synchronously recorded events.
3. On Linux, the load average for historical reasons includes all
TASK_UNINTERRUPTIBLE tasks. This gives a broader sense of how
busy the system is, but on the flipside it doesn't distinguish
whether tasks are likely to contend over the CPU or IO - which
obviously requires very different interventions from a sys admin
or a job scheduler.
PSI reports independent metrics for CPU and IO. You can tell
which resource is making the tasks wait, but in conjunction
still see how overloaded the system is overall.
Q: What's the cost / performance impact of this feature?
A: PSI's primary cost is in the scheduler, in particular task wakeups
and sleeps.
I benchmarked this code using Facebook's two most scheduling
sensitive workloads: memcache and webserver. They handle a ton of
small requests - lots of wakeups and sleeps with little actual work
in between - so they tend to be canaries for scheduler regressions.
In the tests, the boxes were handling live traffic over the course
of several hours. Half the machines, the control, ran with
CONFIG_PSI=n.
For memcache I used eight machines total. They're 2-socket, 14
core, 56 thread boxes. The test runs for half the test period,
flips the test and control kernels on the hardware to rule out HW
factors, DC location etc., then runs the other half of the test.
For the webservers, I used 32 machines total. They're single
socket, 16 core, 32 thread machines.
During the memcache test, CPU load was nopsi=78.05% psi=78.98% in
the first half and nopsi=77.52% psi=78.25%, so PSI added between
0.7 and 0.9 percentage points to the CPU load, a difference of
about 1%.
UPDATE: I re-ran this test with the v3 version of this patch set
and the CPU utilization was equivalent between test and control.
UPDATE: v4 is on par with v3.
As far as end-to-end request latency from the client perspective
goes, we don't sample those finely enough to capture the requests
going to those particular machines during the test, but we know the
p50 turnaround time in this workload is 54us, and perf bench sched
pipe on those machines show nopsi=5.232666 us/op and psi=5.587347
us/op, so this doesn't add much here either.
The profile for the pipe benchmark shows:
0.87% sched-pipe [kernel.vmlinux] [k] psi_group_change
0.83% perf.real [kernel.vmlinux] [k] psi_group_change
0.82% perf.real [kernel.vmlinux] [k] psi_task_change
0.58% sched-pipe [kernel.vmlinux] [k] psi_task_change
The webserver load is running inside 4 nested cgroup levels. The
CPU load with both nopsi and psi kernels was indistinguishable at
81%.
For comparison, we had to disable the cgroup cpu controller on the
webservers because it added 4 percentage points to the CPU% during
this same exact test.
Versions of this accounting code now run on 80% of our fleet. None
of our workloads have reported regressions during the rollout.
Daniel Drake said:
: I just retested the latest version at
: http://git.cmpxchg.org/cgit.cgi/linux-psi.git (Linux 4.18) and the results
: are great.
:
: Test setup:
: Endless OS
: GeminiLake N4200 low end laptop
: 2GB RAM
: swap (and zram swap) disabled
:
: Baseline test: open a handful of large-ish apps and several website
: tabs in Google Chrome.
:
: Results: after a couple of minutes, system is excessively thrashing, mouse
: cursor can barely be moved, UI is not responding to mouse clicks, so it's
: impractical to recover from this situation as an ordinary user
:
: Add my simple killer:
: https://gist.github.com/dsd/a8988bf0b81a6163475988120fe8d9cd
:
: Results: when the thrashing causes the UI to become sluggish, the killer
: steps in and kills something (usually a chrome tab), and the system
: remains usable. I repeatedly opened more apps and more websites over a 15
: minute period but I wasn't able to get the system to a point of UI
: unresponsiveness.
Suren said:
: Backported to 4.9 and retested on ARMv8 8 code system running Android.
: Signals behave as expected reacting to memory pressure, no jumps in
: "total" counters that would indicate an overflow/underflow issues. Nicely
: done!
This patch (of 9):
If we keep just enough refault information to match the *current* page
cache during reclaim time, we could lose a lot of events when there is
only a temporary spike in non-cache memory consumption that pushes out all
the cache. Once cache comes back, we won't see those refaults. They
might not be actionable for LRU aging, but we want to know about them for
measuring memory pressure.
[hannes@cmpxchg.org: switch to NUMA-aware lru and slab counters]
Link: http://lkml.kernel.org/r/20181009184732.762-2-hannes@cmpxchg.org
Link: http://lkml.kernel.org/r/20180828172258.3185-2-hannes@cmpxchg.org
Signed-off-by: Johannes Weiner <jweiner@fb.com>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Rik van Riel <riel@surriel.com>
Tested-by: Daniel Drake <drake@endlessm.com>
Tested-by: Suren Baghdasaryan <surenb@google.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: Vinayak Menon <vinmenon@codeaurora.org>
Cc: Christopher Lameter <cl@linux.com>
Cc: Peter Enderborg <peter.enderborg@sony.com>
Cc: Shakeel Butt <shakeelb@google.com>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Randy Dunlap <rdunlap@infradead.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-10-27 06:05:59 +08:00
|
|
|
#ifdef CONFIG_MEMCG
|
2016-12-13 08:43:58 +08:00
|
|
|
if (sc->memcg) {
|
mm: workingset: don't drop refault information prematurely
Patch series "psi: pressure stall information for CPU, memory, and IO", v4.
Overview
PSI reports the overall wallclock time in which the tasks in a system (or
cgroup) wait for (contended) hardware resources.
This helps users understand the resource pressure their workloads are
under, which allows them to rootcause and fix throughput and latency
problems caused by overcommitting, underprovisioning, suboptimal job
placement in a grid; as well as anticipate major disruptions like OOM.
Real-world applications
We're using the data collected by PSI (and its previous incarnation,
memdelay) quite extensively at Facebook, and with several success stories.
One usecase is avoiding OOM hangs/livelocks. The reason these happen is
because the OOM killer is triggered by reclaim not being able to free
pages, but with fast flash devices there is *always* some clean and
uptodate cache to reclaim; the OOM killer never kicks in, even as tasks
spend 90% of the time thrashing the cache pages of their own executables.
There is no situation where this ever makes sense in practice. We wrote a
<100 line POC python script to monitor memory pressure and kill stuff way
before such pathological thrashing leads to full system losses that would
require forcible hard resets.
We've since extended and deployed this code into other places to guarantee
latency and throughput SLAs, since they're usually violated way before the
kernel OOM killer would ever kick in.
It is available here: https://github.com/facebookincubator/oomd
Eventually we probably want to trigger the in-kernel OOM killer based on
extreme sustained pressure as well, so that Linux can avoid memory
livelocks - which technically aren't deadlocks, but to the user
indistinguishable from them - out of the box. We'd continue using OOMD as
the first line of defense to ensure workload health and implement complex
kill policies that are beyond the scope of the kernel.
We also use PSI memory pressure for loadshedding. Our batch job
infrastructure used to use heuristics based on various VM stats to
anticipate OOM situations, with lackluster success. We switched it to PSI
and managed to anticipate and avoid OOM kills and lockups fairly reliably.
The reduction of OOM outages in the worker pool raised the pool's
aggregate productivity, and we were able to switch that service to smaller
machines.
Lastly, we use cgroups to isolate a machine's main workload from
maintenance crap like package upgrades, logging, configuration, as well as
to prevent multiple workloads on a machine from stepping on each others'
toes. We were not able to configure this properly without the pressure
metrics; we would see latency or bandwidth drops, but it would often be
hard to impossible to rootcause it post-mortem.
We now log and graph pressure for the containers in our fleet and can
trivially link latency spikes and throughput drops to shortages of
specific resources after the fact, and fix the job config/scheduling.
PSI has also received testing, feedback, and feature requests from Android
and EndlessOS for the purpose of low-latency OOM killing, to intervene in
pressure situations before the UI starts hanging.
How do you use this feature?
A kernel with CONFIG_PSI=y will create a /proc/pressure directory with 3
files: cpu, memory, and io. If using cgroup2, cgroups will also have
cpu.pressure, memory.pressure and io.pressure files, which simply
aggregate task stalls at the cgroup level instead of system-wide.
The cpu file contains one line:
some avg10=2.04 avg60=0.75 avg300=0.40 total=157656722
The averages give the percentage of walltime in which one or more tasks
are delayed on the runqueue while another task has the CPU. They're
recent averages over 10s, 1m, 5m windows, so you can tell short term
trends from long term ones, similarly to the load average.
The total= value gives the absolute stall time in microseconds. This
allows detecting latency spikes that might be too short to sway the
running averages. It also allows custom time averaging in case the
10s/1m/5m windows aren't adequate for the usecase (or are too coarse with
future hardware).
What to make of this "some" metric? If CPU utilization is at 100% and CPU
pressure is 0, it means the system is perfectly utilized, with one
runnable thread per CPU and nobody waiting. At two or more runnable tasks
per CPU, the system is 100% overcommitted and the pressure average will
indicate as much. From a utilization perspective this is a great state of
course: no CPU cycles are being wasted, even when 50% of the threads were
to go idle (as most workloads do vary). From the perspective of the
individual job it's not great, however, and they would do better with more
resources. Depending on what your priority and options are, raised "some"
numbers may or may not require action.
The memory file contains two lines:
some avg10=70.24 avg60=68.52 avg300=69.91 total=3559632828
full avg10=57.59 avg60=58.06 avg300=60.38 total=3300487258
The some line is the same as for cpu, the time in which at least one task
is stalled on the resource. In the case of memory, this includes waiting
on swap-in, page cache refaults and page reclaim.
The full line, however, indicates time in which *nobody* is using the CPU
productively due to pressure: all non-idle tasks are waiting for memory in
one form or another. Significant time spent in there is a good trigger
for killing things, moving jobs to other machines, or dropping incoming
requests, since neither the jobs nor the machine overall are making too
much headway.
The io file is similar to memory. Because the block layer doesn't have a
concept of hardware contention right now (how much longer is my IO request
taking due to other tasks?), it reports CPU potential lost on all IO
delays, not just the potential lost due to competition.
FAQ
Q: How is PSI's CPU component different from the load average?
A: There are several quirks in the load average that make it hard to
impossible to tell how overcommitted the CPU really is.
1. The load average is reported as a raw number of active tasks.
You need to know how many CPUs there are in the system, how many
CPUs the workload is allowed to use, then think about what the
proportion between load and the number of CPUs mean for the
tasks trying to run.
PSI reports the percentage of wallclock time in which tasks are
waiting for a CPU to run on. It doesn't matter how many CPUs are
present or usable. The number always tells the quality of life
of tasks in the system or in a particular cgroup.
2. The shortest averaging window is 1m, which is extremely coarse,
and it's sampled in 5s intervals. A *lot* can happen on a CPU in
5 seconds. This *may* be able to identify persistent long-term
trends and very clear and obvious overloads, but it's unusable
for latency spikes and more subtle overutilization.
PSI's shortest window is 10s. It also exports the cumulative
stall times (in microseconds) of synchronously recorded events.
3. On Linux, the load average for historical reasons includes all
TASK_UNINTERRUPTIBLE tasks. This gives a broader sense of how
busy the system is, but on the flipside it doesn't distinguish
whether tasks are likely to contend over the CPU or IO - which
obviously requires very different interventions from a sys admin
or a job scheduler.
PSI reports independent metrics for CPU and IO. You can tell
which resource is making the tasks wait, but in conjunction
still see how overloaded the system is overall.
Q: What's the cost / performance impact of this feature?
A: PSI's primary cost is in the scheduler, in particular task wakeups
and sleeps.
I benchmarked this code using Facebook's two most scheduling
sensitive workloads: memcache and webserver. They handle a ton of
small requests - lots of wakeups and sleeps with little actual work
in between - so they tend to be canaries for scheduler regressions.
In the tests, the boxes were handling live traffic over the course
of several hours. Half the machines, the control, ran with
CONFIG_PSI=n.
For memcache I used eight machines total. They're 2-socket, 14
core, 56 thread boxes. The test runs for half the test period,
flips the test and control kernels on the hardware to rule out HW
factors, DC location etc., then runs the other half of the test.
For the webservers, I used 32 machines total. They're single
socket, 16 core, 32 thread machines.
During the memcache test, CPU load was nopsi=78.05% psi=78.98% in
the first half and nopsi=77.52% psi=78.25%, so PSI added between
0.7 and 0.9 percentage points to the CPU load, a difference of
about 1%.
UPDATE: I re-ran this test with the v3 version of this patch set
and the CPU utilization was equivalent between test and control.
UPDATE: v4 is on par with v3.
As far as end-to-end request latency from the client perspective
goes, we don't sample those finely enough to capture the requests
going to those particular machines during the test, but we know the
p50 turnaround time in this workload is 54us, and perf bench sched
pipe on those machines show nopsi=5.232666 us/op and psi=5.587347
us/op, so this doesn't add much here either.
The profile for the pipe benchmark shows:
0.87% sched-pipe [kernel.vmlinux] [k] psi_group_change
0.83% perf.real [kernel.vmlinux] [k] psi_group_change
0.82% perf.real [kernel.vmlinux] [k] psi_task_change
0.58% sched-pipe [kernel.vmlinux] [k] psi_task_change
The webserver load is running inside 4 nested cgroup levels. The
CPU load with both nopsi and psi kernels was indistinguishable at
81%.
For comparison, we had to disable the cgroup cpu controller on the
webservers because it added 4 percentage points to the CPU% during
this same exact test.
Versions of this accounting code now run on 80% of our fleet. None
of our workloads have reported regressions during the rollout.
Daniel Drake said:
: I just retested the latest version at
: http://git.cmpxchg.org/cgit.cgi/linux-psi.git (Linux 4.18) and the results
: are great.
:
: Test setup:
: Endless OS
: GeminiLake N4200 low end laptop
: 2GB RAM
: swap (and zram swap) disabled
:
: Baseline test: open a handful of large-ish apps and several website
: tabs in Google Chrome.
:
: Results: after a couple of minutes, system is excessively thrashing, mouse
: cursor can barely be moved, UI is not responding to mouse clicks, so it's
: impractical to recover from this situation as an ordinary user
:
: Add my simple killer:
: https://gist.github.com/dsd/a8988bf0b81a6163475988120fe8d9cd
:
: Results: when the thrashing causes the UI to become sluggish, the killer
: steps in and kills something (usually a chrome tab), and the system
: remains usable. I repeatedly opened more apps and more websites over a 15
: minute period but I wasn't able to get the system to a point of UI
: unresponsiveness.
Suren said:
: Backported to 4.9 and retested on ARMv8 8 code system running Android.
: Signals behave as expected reacting to memory pressure, no jumps in
: "total" counters that would indicate an overflow/underflow issues. Nicely
: done!
This patch (of 9):
If we keep just enough refault information to match the *current* page
cache during reclaim time, we could lose a lot of events when there is
only a temporary spike in non-cache memory consumption that pushes out all
the cache. Once cache comes back, we won't see those refaults. They
might not be actionable for LRU aging, but we want to know about them for
measuring memory pressure.
[hannes@cmpxchg.org: switch to NUMA-aware lru and slab counters]
Link: http://lkml.kernel.org/r/20181009184732.762-2-hannes@cmpxchg.org
Link: http://lkml.kernel.org/r/20180828172258.3185-2-hannes@cmpxchg.org
Signed-off-by: Johannes Weiner <jweiner@fb.com>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Rik van Riel <riel@surriel.com>
Tested-by: Daniel Drake <drake@endlessm.com>
Tested-by: Suren Baghdasaryan <surenb@google.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: Vinayak Menon <vinmenon@codeaurora.org>
Cc: Christopher Lameter <cl@linux.com>
Cc: Peter Enderborg <peter.enderborg@sony.com>
Cc: Shakeel Butt <shakeelb@google.com>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Randy Dunlap <rdunlap@infradead.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-10-27 06:05:59 +08:00
|
|
|
struct lruvec *lruvec;
|
2019-05-14 08:18:05 +08:00
|
|
|
int i;
|
mm: workingset: don't drop refault information prematurely
Patch series "psi: pressure stall information for CPU, memory, and IO", v4.
Overview
PSI reports the overall wallclock time in which the tasks in a system (or
cgroup) wait for (contended) hardware resources.
This helps users understand the resource pressure their workloads are
under, which allows them to rootcause and fix throughput and latency
problems caused by overcommitting, underprovisioning, suboptimal job
placement in a grid; as well as anticipate major disruptions like OOM.
Real-world applications
We're using the data collected by PSI (and its previous incarnation,
memdelay) quite extensively at Facebook, and with several success stories.
One usecase is avoiding OOM hangs/livelocks. The reason these happen is
because the OOM killer is triggered by reclaim not being able to free
pages, but with fast flash devices there is *always* some clean and
uptodate cache to reclaim; the OOM killer never kicks in, even as tasks
spend 90% of the time thrashing the cache pages of their own executables.
There is no situation where this ever makes sense in practice. We wrote a
<100 line POC python script to monitor memory pressure and kill stuff way
before such pathological thrashing leads to full system losses that would
require forcible hard resets.
We've since extended and deployed this code into other places to guarantee
latency and throughput SLAs, since they're usually violated way before the
kernel OOM killer would ever kick in.
It is available here: https://github.com/facebookincubator/oomd
Eventually we probably want to trigger the in-kernel OOM killer based on
extreme sustained pressure as well, so that Linux can avoid memory
livelocks - which technically aren't deadlocks, but to the user
indistinguishable from them - out of the box. We'd continue using OOMD as
the first line of defense to ensure workload health and implement complex
kill policies that are beyond the scope of the kernel.
We also use PSI memory pressure for loadshedding. Our batch job
infrastructure used to use heuristics based on various VM stats to
anticipate OOM situations, with lackluster success. We switched it to PSI
and managed to anticipate and avoid OOM kills and lockups fairly reliably.
The reduction of OOM outages in the worker pool raised the pool's
aggregate productivity, and we were able to switch that service to smaller
machines.
Lastly, we use cgroups to isolate a machine's main workload from
maintenance crap like package upgrades, logging, configuration, as well as
to prevent multiple workloads on a machine from stepping on each others'
toes. We were not able to configure this properly without the pressure
metrics; we would see latency or bandwidth drops, but it would often be
hard to impossible to rootcause it post-mortem.
We now log and graph pressure for the containers in our fleet and can
trivially link latency spikes and throughput drops to shortages of
specific resources after the fact, and fix the job config/scheduling.
PSI has also received testing, feedback, and feature requests from Android
and EndlessOS for the purpose of low-latency OOM killing, to intervene in
pressure situations before the UI starts hanging.
How do you use this feature?
A kernel with CONFIG_PSI=y will create a /proc/pressure directory with 3
files: cpu, memory, and io. If using cgroup2, cgroups will also have
cpu.pressure, memory.pressure and io.pressure files, which simply
aggregate task stalls at the cgroup level instead of system-wide.
The cpu file contains one line:
some avg10=2.04 avg60=0.75 avg300=0.40 total=157656722
The averages give the percentage of walltime in which one or more tasks
are delayed on the runqueue while another task has the CPU. They're
recent averages over 10s, 1m, 5m windows, so you can tell short term
trends from long term ones, similarly to the load average.
The total= value gives the absolute stall time in microseconds. This
allows detecting latency spikes that might be too short to sway the
running averages. It also allows custom time averaging in case the
10s/1m/5m windows aren't adequate for the usecase (or are too coarse with
future hardware).
What to make of this "some" metric? If CPU utilization is at 100% and CPU
pressure is 0, it means the system is perfectly utilized, with one
runnable thread per CPU and nobody waiting. At two or more runnable tasks
per CPU, the system is 100% overcommitted and the pressure average will
indicate as much. From a utilization perspective this is a great state of
course: no CPU cycles are being wasted, even when 50% of the threads were
to go idle (as most workloads do vary). From the perspective of the
individual job it's not great, however, and they would do better with more
resources. Depending on what your priority and options are, raised "some"
numbers may or may not require action.
The memory file contains two lines:
some avg10=70.24 avg60=68.52 avg300=69.91 total=3559632828
full avg10=57.59 avg60=58.06 avg300=60.38 total=3300487258
The some line is the same as for cpu, the time in which at least one task
is stalled on the resource. In the case of memory, this includes waiting
on swap-in, page cache refaults and page reclaim.
The full line, however, indicates time in which *nobody* is using the CPU
productively due to pressure: all non-idle tasks are waiting for memory in
one form or another. Significant time spent in there is a good trigger
for killing things, moving jobs to other machines, or dropping incoming
requests, since neither the jobs nor the machine overall are making too
much headway.
The io file is similar to memory. Because the block layer doesn't have a
concept of hardware contention right now (how much longer is my IO request
taking due to other tasks?), it reports CPU potential lost on all IO
delays, not just the potential lost due to competition.
FAQ
Q: How is PSI's CPU component different from the load average?
A: There are several quirks in the load average that make it hard to
impossible to tell how overcommitted the CPU really is.
1. The load average is reported as a raw number of active tasks.
You need to know how many CPUs there are in the system, how many
CPUs the workload is allowed to use, then think about what the
proportion between load and the number of CPUs mean for the
tasks trying to run.
PSI reports the percentage of wallclock time in which tasks are
waiting for a CPU to run on. It doesn't matter how many CPUs are
present or usable. The number always tells the quality of life
of tasks in the system or in a particular cgroup.
2. The shortest averaging window is 1m, which is extremely coarse,
and it's sampled in 5s intervals. A *lot* can happen on a CPU in
5 seconds. This *may* be able to identify persistent long-term
trends and very clear and obvious overloads, but it's unusable
for latency spikes and more subtle overutilization.
PSI's shortest window is 10s. It also exports the cumulative
stall times (in microseconds) of synchronously recorded events.
3. On Linux, the load average for historical reasons includes all
TASK_UNINTERRUPTIBLE tasks. This gives a broader sense of how
busy the system is, but on the flipside it doesn't distinguish
whether tasks are likely to contend over the CPU or IO - which
obviously requires very different interventions from a sys admin
or a job scheduler.
PSI reports independent metrics for CPU and IO. You can tell
which resource is making the tasks wait, but in conjunction
still see how overloaded the system is overall.
Q: What's the cost / performance impact of this feature?
A: PSI's primary cost is in the scheduler, in particular task wakeups
and sleeps.
I benchmarked this code using Facebook's two most scheduling
sensitive workloads: memcache and webserver. They handle a ton of
small requests - lots of wakeups and sleeps with little actual work
in between - so they tend to be canaries for scheduler regressions.
In the tests, the boxes were handling live traffic over the course
of several hours. Half the machines, the control, ran with
CONFIG_PSI=n.
For memcache I used eight machines total. They're 2-socket, 14
core, 56 thread boxes. The test runs for half the test period,
flips the test and control kernels on the hardware to rule out HW
factors, DC location etc., then runs the other half of the test.
For the webservers, I used 32 machines total. They're single
socket, 16 core, 32 thread machines.
During the memcache test, CPU load was nopsi=78.05% psi=78.98% in
the first half and nopsi=77.52% psi=78.25%, so PSI added between
0.7 and 0.9 percentage points to the CPU load, a difference of
about 1%.
UPDATE: I re-ran this test with the v3 version of this patch set
and the CPU utilization was equivalent between test and control.
UPDATE: v4 is on par with v3.
As far as end-to-end request latency from the client perspective
goes, we don't sample those finely enough to capture the requests
going to those particular machines during the test, but we know the
p50 turnaround time in this workload is 54us, and perf bench sched
pipe on those machines show nopsi=5.232666 us/op and psi=5.587347
us/op, so this doesn't add much here either.
The profile for the pipe benchmark shows:
0.87% sched-pipe [kernel.vmlinux] [k] psi_group_change
0.83% perf.real [kernel.vmlinux] [k] psi_group_change
0.82% perf.real [kernel.vmlinux] [k] psi_task_change
0.58% sched-pipe [kernel.vmlinux] [k] psi_task_change
The webserver load is running inside 4 nested cgroup levels. The
CPU load with both nopsi and psi kernels was indistinguishable at
81%.
For comparison, we had to disable the cgroup cpu controller on the
webservers because it added 4 percentage points to the CPU% during
this same exact test.
Versions of this accounting code now run on 80% of our fleet. None
of our workloads have reported regressions during the rollout.
Daniel Drake said:
: I just retested the latest version at
: http://git.cmpxchg.org/cgit.cgi/linux-psi.git (Linux 4.18) and the results
: are great.
:
: Test setup:
: Endless OS
: GeminiLake N4200 low end laptop
: 2GB RAM
: swap (and zram swap) disabled
:
: Baseline test: open a handful of large-ish apps and several website
: tabs in Google Chrome.
:
: Results: after a couple of minutes, system is excessively thrashing, mouse
: cursor can barely be moved, UI is not responding to mouse clicks, so it's
: impractical to recover from this situation as an ordinary user
:
: Add my simple killer:
: https://gist.github.com/dsd/a8988bf0b81a6163475988120fe8d9cd
:
: Results: when the thrashing causes the UI to become sluggish, the killer
: steps in and kills something (usually a chrome tab), and the system
: remains usable. I repeatedly opened more apps and more websites over a 15
: minute period but I wasn't able to get the system to a point of UI
: unresponsiveness.
Suren said:
: Backported to 4.9 and retested on ARMv8 8 code system running Android.
: Signals behave as expected reacting to memory pressure, no jumps in
: "total" counters that would indicate an overflow/underflow issues. Nicely
: done!
This patch (of 9):
If we keep just enough refault information to match the *current* page
cache during reclaim time, we could lose a lot of events when there is
only a temporary spike in non-cache memory consumption that pushes out all
the cache. Once cache comes back, we won't see those refaults. They
might not be actionable for LRU aging, but we want to know about them for
measuring memory pressure.
[hannes@cmpxchg.org: switch to NUMA-aware lru and slab counters]
Link: http://lkml.kernel.org/r/20181009184732.762-2-hannes@cmpxchg.org
Link: http://lkml.kernel.org/r/20180828172258.3185-2-hannes@cmpxchg.org
Signed-off-by: Johannes Weiner <jweiner@fb.com>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Rik van Riel <riel@surriel.com>
Tested-by: Daniel Drake <drake@endlessm.com>
Tested-by: Suren Baghdasaryan <surenb@google.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: Vinayak Menon <vinmenon@codeaurora.org>
Cc: Christopher Lameter <cl@linux.com>
Cc: Peter Enderborg <peter.enderborg@sony.com>
Cc: Shakeel Butt <shakeelb@google.com>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Randy Dunlap <rdunlap@infradead.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-10-27 06:05:59 +08:00
|
|
|
|
2024-06-17 11:07:27 +08:00
|
|
|
mem_cgroup_flush_stats_ratelimited(sc->memcg);
|
2019-12-01 09:55:34 +08:00
|
|
|
lruvec = mem_cgroup_lruvec(sc->memcg, NODE_DATA(sc->nid));
|
2019-05-14 08:18:05 +08:00
|
|
|
for (pages = 0, i = 0; i < NR_LRU_LISTS; i++)
|
mm: memcontrol: make cgroup stats and events query API explicitly local
Patch series "mm: memcontrol: memory.stat cost & correctness".
The cgroup memory.stat file holds recursive statistics for the entire
subtree. The current implementation does this tree walk on-demand
whenever the file is read. This is giving us problems in production.
1. The cost of aggregating the statistics on-demand is high. A lot of
system service cgroups are mostly idle and their stats don't change
between reads, yet we always have to check them. There are also always
some lazily-dying cgroups sitting around that are pinned by a handful
of remaining page cache; the same applies to them.
In an application that periodically monitors memory.stat in our
fleet, we have seen the aggregation consume up to 5% CPU time.
2. When cgroups die and disappear from the cgroup tree, so do their
accumulated vm events. The result is that the event counters at
higher-level cgroups can go backwards and confuse some of our
automation, let alone people looking at the graphs over time.
To address both issues, this patch series changes the stat
implementation to spill counts upwards when the counters change.
The upward spilling is batched using the existing per-cpu cache. In a
sparse file stress test with 5 level cgroup nesting, the additional cost
of the flushing was negligible (a little under 1% of CPU at 100% CPU
utilization, compared to the 5% of reading memory.stat during regular
operation).
This patch (of 4):
memcg_page_state(), lruvec_page_state(), memcg_sum_events() are
currently returning the state of the local memcg or lruvec, not the
recursive state.
In practice there is a demand for both versions, although the callers
that want the recursive counts currently sum them up by hand.
Per default, cgroups are considered recursive entities and generally we
expect more users of the recursive counters, with the local counts being
special cases. To reflect that in the name, add a _local suffix to the
current implementations.
The following patch will re-incarnate these functions with recursive
semantics, but with an O(1) implementation.
[hannes@cmpxchg.org: fix bisection hole]
Link: http://lkml.kernel.org/r/20190417160347.GC23013@cmpxchg.org
Link: http://lkml.kernel.org/r/20190412151507.2769-2-hannes@cmpxchg.org
Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Reviewed-by: Shakeel Butt <shakeelb@google.com>
Reviewed-by: Roman Gushchin <guro@fb.com>
Cc: Michal Hocko <mhocko@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-05-15 06:47:06 +08:00
|
|
|
pages += lruvec_page_state_local(lruvec,
|
|
|
|
NR_LRU_BASE + i);
|
2020-08-07 14:20:39 +08:00
|
|
|
pages += lruvec_page_state_local(
|
|
|
|
lruvec, NR_SLAB_RECLAIMABLE_B) >> PAGE_SHIFT;
|
|
|
|
pages += lruvec_page_state_local(
|
|
|
|
lruvec, NR_SLAB_UNRECLAIMABLE_B) >> PAGE_SHIFT;
|
mm: workingset: don't drop refault information prematurely
Patch series "psi: pressure stall information for CPU, memory, and IO", v4.
Overview
PSI reports the overall wallclock time in which the tasks in a system (or
cgroup) wait for (contended) hardware resources.
This helps users understand the resource pressure their workloads are
under, which allows them to rootcause and fix throughput and latency
problems caused by overcommitting, underprovisioning, suboptimal job
placement in a grid; as well as anticipate major disruptions like OOM.
Real-world applications
We're using the data collected by PSI (and its previous incarnation,
memdelay) quite extensively at Facebook, and with several success stories.
One usecase is avoiding OOM hangs/livelocks. The reason these happen is
because the OOM killer is triggered by reclaim not being able to free
pages, but with fast flash devices there is *always* some clean and
uptodate cache to reclaim; the OOM killer never kicks in, even as tasks
spend 90% of the time thrashing the cache pages of their own executables.
There is no situation where this ever makes sense in practice. We wrote a
<100 line POC python script to monitor memory pressure and kill stuff way
before such pathological thrashing leads to full system losses that would
require forcible hard resets.
We've since extended and deployed this code into other places to guarantee
latency and throughput SLAs, since they're usually violated way before the
kernel OOM killer would ever kick in.
It is available here: https://github.com/facebookincubator/oomd
Eventually we probably want to trigger the in-kernel OOM killer based on
extreme sustained pressure as well, so that Linux can avoid memory
livelocks - which technically aren't deadlocks, but to the user
indistinguishable from them - out of the box. We'd continue using OOMD as
the first line of defense to ensure workload health and implement complex
kill policies that are beyond the scope of the kernel.
We also use PSI memory pressure for loadshedding. Our batch job
infrastructure used to use heuristics based on various VM stats to
anticipate OOM situations, with lackluster success. We switched it to PSI
and managed to anticipate and avoid OOM kills and lockups fairly reliably.
The reduction of OOM outages in the worker pool raised the pool's
aggregate productivity, and we were able to switch that service to smaller
machines.
Lastly, we use cgroups to isolate a machine's main workload from
maintenance crap like package upgrades, logging, configuration, as well as
to prevent multiple workloads on a machine from stepping on each others'
toes. We were not able to configure this properly without the pressure
metrics; we would see latency or bandwidth drops, but it would often be
hard to impossible to rootcause it post-mortem.
We now log and graph pressure for the containers in our fleet and can
trivially link latency spikes and throughput drops to shortages of
specific resources after the fact, and fix the job config/scheduling.
PSI has also received testing, feedback, and feature requests from Android
and EndlessOS for the purpose of low-latency OOM killing, to intervene in
pressure situations before the UI starts hanging.
How do you use this feature?
A kernel with CONFIG_PSI=y will create a /proc/pressure directory with 3
files: cpu, memory, and io. If using cgroup2, cgroups will also have
cpu.pressure, memory.pressure and io.pressure files, which simply
aggregate task stalls at the cgroup level instead of system-wide.
The cpu file contains one line:
some avg10=2.04 avg60=0.75 avg300=0.40 total=157656722
The averages give the percentage of walltime in which one or more tasks
are delayed on the runqueue while another task has the CPU. They're
recent averages over 10s, 1m, 5m windows, so you can tell short term
trends from long term ones, similarly to the load average.
The total= value gives the absolute stall time in microseconds. This
allows detecting latency spikes that might be too short to sway the
running averages. It also allows custom time averaging in case the
10s/1m/5m windows aren't adequate for the usecase (or are too coarse with
future hardware).
What to make of this "some" metric? If CPU utilization is at 100% and CPU
pressure is 0, it means the system is perfectly utilized, with one
runnable thread per CPU and nobody waiting. At two or more runnable tasks
per CPU, the system is 100% overcommitted and the pressure average will
indicate as much. From a utilization perspective this is a great state of
course: no CPU cycles are being wasted, even when 50% of the threads were
to go idle (as most workloads do vary). From the perspective of the
individual job it's not great, however, and they would do better with more
resources. Depending on what your priority and options are, raised "some"
numbers may or may not require action.
The memory file contains two lines:
some avg10=70.24 avg60=68.52 avg300=69.91 total=3559632828
full avg10=57.59 avg60=58.06 avg300=60.38 total=3300487258
The some line is the same as for cpu, the time in which at least one task
is stalled on the resource. In the case of memory, this includes waiting
on swap-in, page cache refaults and page reclaim.
The full line, however, indicates time in which *nobody* is using the CPU
productively due to pressure: all non-idle tasks are waiting for memory in
one form or another. Significant time spent in there is a good trigger
for killing things, moving jobs to other machines, or dropping incoming
requests, since neither the jobs nor the machine overall are making too
much headway.
The io file is similar to memory. Because the block layer doesn't have a
concept of hardware contention right now (how much longer is my IO request
taking due to other tasks?), it reports CPU potential lost on all IO
delays, not just the potential lost due to competition.
FAQ
Q: How is PSI's CPU component different from the load average?
A: There are several quirks in the load average that make it hard to
impossible to tell how overcommitted the CPU really is.
1. The load average is reported as a raw number of active tasks.
You need to know how many CPUs there are in the system, how many
CPUs the workload is allowed to use, then think about what the
proportion between load and the number of CPUs mean for the
tasks trying to run.
PSI reports the percentage of wallclock time in which tasks are
waiting for a CPU to run on. It doesn't matter how many CPUs are
present or usable. The number always tells the quality of life
of tasks in the system or in a particular cgroup.
2. The shortest averaging window is 1m, which is extremely coarse,
and it's sampled in 5s intervals. A *lot* can happen on a CPU in
5 seconds. This *may* be able to identify persistent long-term
trends and very clear and obvious overloads, but it's unusable
for latency spikes and more subtle overutilization.
PSI's shortest window is 10s. It also exports the cumulative
stall times (in microseconds) of synchronously recorded events.
3. On Linux, the load average for historical reasons includes all
TASK_UNINTERRUPTIBLE tasks. This gives a broader sense of how
busy the system is, but on the flipside it doesn't distinguish
whether tasks are likely to contend over the CPU or IO - which
obviously requires very different interventions from a sys admin
or a job scheduler.
PSI reports independent metrics for CPU and IO. You can tell
which resource is making the tasks wait, but in conjunction
still see how overloaded the system is overall.
Q: What's the cost / performance impact of this feature?
A: PSI's primary cost is in the scheduler, in particular task wakeups
and sleeps.
I benchmarked this code using Facebook's two most scheduling
sensitive workloads: memcache and webserver. They handle a ton of
small requests - lots of wakeups and sleeps with little actual work
in between - so they tend to be canaries for scheduler regressions.
In the tests, the boxes were handling live traffic over the course
of several hours. Half the machines, the control, ran with
CONFIG_PSI=n.
For memcache I used eight machines total. They're 2-socket, 14
core, 56 thread boxes. The test runs for half the test period,
flips the test and control kernels on the hardware to rule out HW
factors, DC location etc., then runs the other half of the test.
For the webservers, I used 32 machines total. They're single
socket, 16 core, 32 thread machines.
During the memcache test, CPU load was nopsi=78.05% psi=78.98% in
the first half and nopsi=77.52% psi=78.25%, so PSI added between
0.7 and 0.9 percentage points to the CPU load, a difference of
about 1%.
UPDATE: I re-ran this test with the v3 version of this patch set
and the CPU utilization was equivalent between test and control.
UPDATE: v4 is on par with v3.
As far as end-to-end request latency from the client perspective
goes, we don't sample those finely enough to capture the requests
going to those particular machines during the test, but we know the
p50 turnaround time in this workload is 54us, and perf bench sched
pipe on those machines show nopsi=5.232666 us/op and psi=5.587347
us/op, so this doesn't add much here either.
The profile for the pipe benchmark shows:
0.87% sched-pipe [kernel.vmlinux] [k] psi_group_change
0.83% perf.real [kernel.vmlinux] [k] psi_group_change
0.82% perf.real [kernel.vmlinux] [k] psi_task_change
0.58% sched-pipe [kernel.vmlinux] [k] psi_task_change
The webserver load is running inside 4 nested cgroup levels. The
CPU load with both nopsi and psi kernels was indistinguishable at
81%.
For comparison, we had to disable the cgroup cpu controller on the
webservers because it added 4 percentage points to the CPU% during
this same exact test.
Versions of this accounting code now run on 80% of our fleet. None
of our workloads have reported regressions during the rollout.
Daniel Drake said:
: I just retested the latest version at
: http://git.cmpxchg.org/cgit.cgi/linux-psi.git (Linux 4.18) and the results
: are great.
:
: Test setup:
: Endless OS
: GeminiLake N4200 low end laptop
: 2GB RAM
: swap (and zram swap) disabled
:
: Baseline test: open a handful of large-ish apps and several website
: tabs in Google Chrome.
:
: Results: after a couple of minutes, system is excessively thrashing, mouse
: cursor can barely be moved, UI is not responding to mouse clicks, so it's
: impractical to recover from this situation as an ordinary user
:
: Add my simple killer:
: https://gist.github.com/dsd/a8988bf0b81a6163475988120fe8d9cd
:
: Results: when the thrashing causes the UI to become sluggish, the killer
: steps in and kills something (usually a chrome tab), and the system
: remains usable. I repeatedly opened more apps and more websites over a 15
: minute period but I wasn't able to get the system to a point of UI
: unresponsiveness.
Suren said:
: Backported to 4.9 and retested on ARMv8 8 code system running Android.
: Signals behave as expected reacting to memory pressure, no jumps in
: "total" counters that would indicate an overflow/underflow issues. Nicely
: done!
This patch (of 9):
If we keep just enough refault information to match the *current* page
cache during reclaim time, we could lose a lot of events when there is
only a temporary spike in non-cache memory consumption that pushes out all
the cache. Once cache comes back, we won't see those refaults. They
might not be actionable for LRU aging, but we want to know about them for
measuring memory pressure.
[hannes@cmpxchg.org: switch to NUMA-aware lru and slab counters]
Link: http://lkml.kernel.org/r/20181009184732.762-2-hannes@cmpxchg.org
Link: http://lkml.kernel.org/r/20180828172258.3185-2-hannes@cmpxchg.org
Signed-off-by: Johannes Weiner <jweiner@fb.com>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Rik van Riel <riel@surriel.com>
Tested-by: Daniel Drake <drake@endlessm.com>
Tested-by: Suren Baghdasaryan <surenb@google.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: Vinayak Menon <vinmenon@codeaurora.org>
Cc: Christopher Lameter <cl@linux.com>
Cc: Peter Enderborg <peter.enderborg@sony.com>
Cc: Shakeel Butt <shakeelb@google.com>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Randy Dunlap <rdunlap@infradead.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-10-27 06:05:59 +08:00
|
|
|
} else
|
|
|
|
#endif
|
|
|
|
pages = node_present_pages(sc->nid);
|
|
|
|
|
2018-10-29 02:35:40 +08:00
|
|
|
max_nodes = pages >> (XA_CHUNK_SHIFT - 3);
|
mm: keep page cache radix tree nodes in check
Previously, page cache radix tree nodes were freed after reclaim emptied
out their page pointers. But now reclaim stores shadow entries in their
place, which are only reclaimed when the inodes themselves are
reclaimed. This is problematic for bigger files that are still in use
after they have a significant amount of their cache reclaimed, without
any of those pages actually refaulting. The shadow entries will just
sit there and waste memory. In the worst case, the shadow entries will
accumulate until the machine runs out of memory.
To get this under control, the VM will track radix tree nodes
exclusively containing shadow entries on a per-NUMA node list. Per-NUMA
rather than global because we expect the radix tree nodes themselves to
be allocated node-locally and we want to reduce cross-node references of
otherwise independent cache workloads. A simple shrinker will then
reclaim these nodes on memory pressure.
A few things need to be stored in the radix tree node to implement the
shadow node LRU and allow tree deletions coming from the list:
1. There is no index available that would describe the reverse path
from the node up to the tree root, which is needed to perform a
deletion. To solve this, encode in each node its offset inside the
parent. This can be stored in the unused upper bits of the same
member that stores the node's height at no extra space cost.
2. The number of shadow entries needs to be counted in addition to the
regular entries, to quickly detect when the node is ready to go to
the shadow node LRU list. The current entry count is an unsigned
int but the maximum number of entries is 64, so a shadow counter
can easily be stored in the unused upper bits.
3. Tree modification needs tree lock and tree root, which are located
in the address space, so store an address_space backpointer in the
node. The parent pointer of the node is in a union with the 2-word
rcu_head, so the backpointer comes at no extra cost as well.
4. The node needs to be linked to an LRU list, which requires a list
head inside the node. This does increase the size of the node, but
it does not change the number of objects that fit into a slab page.
[akpm@linux-foundation.org: export the right function]
Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Reviewed-by: Rik van Riel <riel@redhat.com>
Reviewed-by: Minchan Kim <minchan@kernel.org>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Bob Liu <bob.liu@oracle.com>
Cc: Christoph Hellwig <hch@infradead.org>
Cc: Dave Chinner <david@fromorbit.com>
Cc: Greg Thelen <gthelen@google.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Jan Kara <jack@suse.cz>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: Luigi Semenzato <semenzato@google.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Metin Doslu <metin@citusdata.com>
Cc: Michel Lespinasse <walken@google.com>
Cc: Ozgun Erdogan <ozgun@citusdata.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Roman Gushchin <klamm@yandex-team.ru>
Cc: Ryan Mallon <rmallon@gmail.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-04-04 05:47:56 +08:00
|
|
|
|
2016-12-13 08:43:52 +08:00
|
|
|
if (nodes <= max_nodes)
|
mm: keep page cache radix tree nodes in check
Previously, page cache radix tree nodes were freed after reclaim emptied
out their page pointers. But now reclaim stores shadow entries in their
place, which are only reclaimed when the inodes themselves are
reclaimed. This is problematic for bigger files that are still in use
after they have a significant amount of their cache reclaimed, without
any of those pages actually refaulting. The shadow entries will just
sit there and waste memory. In the worst case, the shadow entries will
accumulate until the machine runs out of memory.
To get this under control, the VM will track radix tree nodes
exclusively containing shadow entries on a per-NUMA node list. Per-NUMA
rather than global because we expect the radix tree nodes themselves to
be allocated node-locally and we want to reduce cross-node references of
otherwise independent cache workloads. A simple shrinker will then
reclaim these nodes on memory pressure.
A few things need to be stored in the radix tree node to implement the
shadow node LRU and allow tree deletions coming from the list:
1. There is no index available that would describe the reverse path
from the node up to the tree root, which is needed to perform a
deletion. To solve this, encode in each node its offset inside the
parent. This can be stored in the unused upper bits of the same
member that stores the node's height at no extra space cost.
2. The number of shadow entries needs to be counted in addition to the
regular entries, to quickly detect when the node is ready to go to
the shadow node LRU list. The current entry count is an unsigned
int but the maximum number of entries is 64, so a shadow counter
can easily be stored in the unused upper bits.
3. Tree modification needs tree lock and tree root, which are located
in the address space, so store an address_space backpointer in the
node. The parent pointer of the node is in a union with the 2-word
rcu_head, so the backpointer comes at no extra cost as well.
4. The node needs to be linked to an LRU list, which requires a list
head inside the node. This does increase the size of the node, but
it does not change the number of objects that fit into a slab page.
[akpm@linux-foundation.org: export the right function]
Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Reviewed-by: Rik van Riel <riel@redhat.com>
Reviewed-by: Minchan Kim <minchan@kernel.org>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Bob Liu <bob.liu@oracle.com>
Cc: Christoph Hellwig <hch@infradead.org>
Cc: Dave Chinner <david@fromorbit.com>
Cc: Greg Thelen <gthelen@google.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Jan Kara <jack@suse.cz>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: Luigi Semenzato <semenzato@google.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Metin Doslu <metin@citusdata.com>
Cc: Michel Lespinasse <walken@google.com>
Cc: Ozgun Erdogan <ozgun@citusdata.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Roman Gushchin <klamm@yandex-team.ru>
Cc: Ryan Mallon <rmallon@gmail.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-04-04 05:47:56 +08:00
|
|
|
return 0;
|
2016-12-13 08:43:52 +08:00
|
|
|
return nodes - max_nodes;
|
mm: keep page cache radix tree nodes in check
Previously, page cache radix tree nodes were freed after reclaim emptied
out their page pointers. But now reclaim stores shadow entries in their
place, which are only reclaimed when the inodes themselves are
reclaimed. This is problematic for bigger files that are still in use
after they have a significant amount of their cache reclaimed, without
any of those pages actually refaulting. The shadow entries will just
sit there and waste memory. In the worst case, the shadow entries will
accumulate until the machine runs out of memory.
To get this under control, the VM will track radix tree nodes
exclusively containing shadow entries on a per-NUMA node list. Per-NUMA
rather than global because we expect the radix tree nodes themselves to
be allocated node-locally and we want to reduce cross-node references of
otherwise independent cache workloads. A simple shrinker will then
reclaim these nodes on memory pressure.
A few things need to be stored in the radix tree node to implement the
shadow node LRU and allow tree deletions coming from the list:
1. There is no index available that would describe the reverse path
from the node up to the tree root, which is needed to perform a
deletion. To solve this, encode in each node its offset inside the
parent. This can be stored in the unused upper bits of the same
member that stores the node's height at no extra space cost.
2. The number of shadow entries needs to be counted in addition to the
regular entries, to quickly detect when the node is ready to go to
the shadow node LRU list. The current entry count is an unsigned
int but the maximum number of entries is 64, so a shadow counter
can easily be stored in the unused upper bits.
3. Tree modification needs tree lock and tree root, which are located
in the address space, so store an address_space backpointer in the
node. The parent pointer of the node is in a union with the 2-word
rcu_head, so the backpointer comes at no extra cost as well.
4. The node needs to be linked to an LRU list, which requires a list
head inside the node. This does increase the size of the node, but
it does not change the number of objects that fit into a slab page.
[akpm@linux-foundation.org: export the right function]
Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Reviewed-by: Rik van Riel <riel@redhat.com>
Reviewed-by: Minchan Kim <minchan@kernel.org>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Bob Liu <bob.liu@oracle.com>
Cc: Christoph Hellwig <hch@infradead.org>
Cc: Dave Chinner <david@fromorbit.com>
Cc: Greg Thelen <gthelen@google.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Jan Kara <jack@suse.cz>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: Luigi Semenzato <semenzato@google.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Metin Doslu <metin@citusdata.com>
Cc: Michel Lespinasse <walken@google.com>
Cc: Ozgun Erdogan <ozgun@citusdata.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Roman Gushchin <klamm@yandex-team.ru>
Cc: Ryan Mallon <rmallon@gmail.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-04-04 05:47:56 +08:00
|
|
|
}
|
|
|
|
|
|
|
|
static enum lru_status shadow_lru_isolate(struct list_head *item,
|
2015-02-13 06:59:35 +08:00
|
|
|
struct list_lru_one *lru,
|
mm: keep page cache radix tree nodes in check
Previously, page cache radix tree nodes were freed after reclaim emptied
out their page pointers. But now reclaim stores shadow entries in their
place, which are only reclaimed when the inodes themselves are
reclaimed. This is problematic for bigger files that are still in use
after they have a significant amount of their cache reclaimed, without
any of those pages actually refaulting. The shadow entries will just
sit there and waste memory. In the worst case, the shadow entries will
accumulate until the machine runs out of memory.
To get this under control, the VM will track radix tree nodes
exclusively containing shadow entries on a per-NUMA node list. Per-NUMA
rather than global because we expect the radix tree nodes themselves to
be allocated node-locally and we want to reduce cross-node references of
otherwise independent cache workloads. A simple shrinker will then
reclaim these nodes on memory pressure.
A few things need to be stored in the radix tree node to implement the
shadow node LRU and allow tree deletions coming from the list:
1. There is no index available that would describe the reverse path
from the node up to the tree root, which is needed to perform a
deletion. To solve this, encode in each node its offset inside the
parent. This can be stored in the unused upper bits of the same
member that stores the node's height at no extra space cost.
2. The number of shadow entries needs to be counted in addition to the
regular entries, to quickly detect when the node is ready to go to
the shadow node LRU list. The current entry count is an unsigned
int but the maximum number of entries is 64, so a shadow counter
can easily be stored in the unused upper bits.
3. Tree modification needs tree lock and tree root, which are located
in the address space, so store an address_space backpointer in the
node. The parent pointer of the node is in a union with the 2-word
rcu_head, so the backpointer comes at no extra cost as well.
4. The node needs to be linked to an LRU list, which requires a list
head inside the node. This does increase the size of the node, but
it does not change the number of objects that fit into a slab page.
[akpm@linux-foundation.org: export the right function]
Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Reviewed-by: Rik van Riel <riel@redhat.com>
Reviewed-by: Minchan Kim <minchan@kernel.org>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Bob Liu <bob.liu@oracle.com>
Cc: Christoph Hellwig <hch@infradead.org>
Cc: Dave Chinner <david@fromorbit.com>
Cc: Greg Thelen <gthelen@google.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Jan Kara <jack@suse.cz>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: Luigi Semenzato <semenzato@google.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Metin Doslu <metin@citusdata.com>
Cc: Michel Lespinasse <walken@google.com>
Cc: Ozgun Erdogan <ozgun@citusdata.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Roman Gushchin <klamm@yandex-team.ru>
Cc: Ryan Mallon <rmallon@gmail.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-04-04 05:47:56 +08:00
|
|
|
spinlock_t *lru_lock,
|
2017-11-25 03:24:59 +08:00
|
|
|
void *arg) __must_hold(lru_lock)
|
mm: keep page cache radix tree nodes in check
Previously, page cache radix tree nodes were freed after reclaim emptied
out their page pointers. But now reclaim stores shadow entries in their
place, which are only reclaimed when the inodes themselves are
reclaimed. This is problematic for bigger files that are still in use
after they have a significant amount of their cache reclaimed, without
any of those pages actually refaulting. The shadow entries will just
sit there and waste memory. In the worst case, the shadow entries will
accumulate until the machine runs out of memory.
To get this under control, the VM will track radix tree nodes
exclusively containing shadow entries on a per-NUMA node list. Per-NUMA
rather than global because we expect the radix tree nodes themselves to
be allocated node-locally and we want to reduce cross-node references of
otherwise independent cache workloads. A simple shrinker will then
reclaim these nodes on memory pressure.
A few things need to be stored in the radix tree node to implement the
shadow node LRU and allow tree deletions coming from the list:
1. There is no index available that would describe the reverse path
from the node up to the tree root, which is needed to perform a
deletion. To solve this, encode in each node its offset inside the
parent. This can be stored in the unused upper bits of the same
member that stores the node's height at no extra space cost.
2. The number of shadow entries needs to be counted in addition to the
regular entries, to quickly detect when the node is ready to go to
the shadow node LRU list. The current entry count is an unsigned
int but the maximum number of entries is 64, so a shadow counter
can easily be stored in the unused upper bits.
3. Tree modification needs tree lock and tree root, which are located
in the address space, so store an address_space backpointer in the
node. The parent pointer of the node is in a union with the 2-word
rcu_head, so the backpointer comes at no extra cost as well.
4. The node needs to be linked to an LRU list, which requires a list
head inside the node. This does increase the size of the node, but
it does not change the number of objects that fit into a slab page.
[akpm@linux-foundation.org: export the right function]
Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Reviewed-by: Rik van Riel <riel@redhat.com>
Reviewed-by: Minchan Kim <minchan@kernel.org>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Bob Liu <bob.liu@oracle.com>
Cc: Christoph Hellwig <hch@infradead.org>
Cc: Dave Chinner <david@fromorbit.com>
Cc: Greg Thelen <gthelen@google.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Jan Kara <jack@suse.cz>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: Luigi Semenzato <semenzato@google.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Metin Doslu <metin@citusdata.com>
Cc: Michel Lespinasse <walken@google.com>
Cc: Ozgun Erdogan <ozgun@citusdata.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Roman Gushchin <klamm@yandex-team.ru>
Cc: Ryan Mallon <rmallon@gmail.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-04-04 05:47:56 +08:00
|
|
|
{
|
2017-11-25 03:24:59 +08:00
|
|
|
struct xa_node *node = container_of(item, struct xa_node, private_list);
|
mm: keep page cache radix tree nodes in check
Previously, page cache radix tree nodes were freed after reclaim emptied
out their page pointers. But now reclaim stores shadow entries in their
place, which are only reclaimed when the inodes themselves are
reclaimed. This is problematic for bigger files that are still in use
after they have a significant amount of their cache reclaimed, without
any of those pages actually refaulting. The shadow entries will just
sit there and waste memory. In the worst case, the shadow entries will
accumulate until the machine runs out of memory.
To get this under control, the VM will track radix tree nodes
exclusively containing shadow entries on a per-NUMA node list. Per-NUMA
rather than global because we expect the radix tree nodes themselves to
be allocated node-locally and we want to reduce cross-node references of
otherwise independent cache workloads. A simple shrinker will then
reclaim these nodes on memory pressure.
A few things need to be stored in the radix tree node to implement the
shadow node LRU and allow tree deletions coming from the list:
1. There is no index available that would describe the reverse path
from the node up to the tree root, which is needed to perform a
deletion. To solve this, encode in each node its offset inside the
parent. This can be stored in the unused upper bits of the same
member that stores the node's height at no extra space cost.
2. The number of shadow entries needs to be counted in addition to the
regular entries, to quickly detect when the node is ready to go to
the shadow node LRU list. The current entry count is an unsigned
int but the maximum number of entries is 64, so a shadow counter
can easily be stored in the unused upper bits.
3. Tree modification needs tree lock and tree root, which are located
in the address space, so store an address_space backpointer in the
node. The parent pointer of the node is in a union with the 2-word
rcu_head, so the backpointer comes at no extra cost as well.
4. The node needs to be linked to an LRU list, which requires a list
head inside the node. This does increase the size of the node, but
it does not change the number of objects that fit into a slab page.
[akpm@linux-foundation.org: export the right function]
Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Reviewed-by: Rik van Riel <riel@redhat.com>
Reviewed-by: Minchan Kim <minchan@kernel.org>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Bob Liu <bob.liu@oracle.com>
Cc: Christoph Hellwig <hch@infradead.org>
Cc: Dave Chinner <david@fromorbit.com>
Cc: Greg Thelen <gthelen@google.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Jan Kara <jack@suse.cz>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: Luigi Semenzato <semenzato@google.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Metin Doslu <metin@citusdata.com>
Cc: Michel Lespinasse <walken@google.com>
Cc: Ozgun Erdogan <ozgun@citusdata.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Roman Gushchin <klamm@yandex-team.ru>
Cc: Ryan Mallon <rmallon@gmail.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-04-04 05:47:56 +08:00
|
|
|
struct address_space *mapping;
|
|
|
|
int ret;
|
|
|
|
|
|
|
|
/*
|
2020-08-18 21:05:56 +08:00
|
|
|
* Page cache insertions and deletions synchronously maintain
|
2018-04-11 07:36:56 +08:00
|
|
|
* the shadow node LRU under the i_pages lock and the
|
mm: keep page cache radix tree nodes in check
Previously, page cache radix tree nodes were freed after reclaim emptied
out their page pointers. But now reclaim stores shadow entries in their
place, which are only reclaimed when the inodes themselves are
reclaimed. This is problematic for bigger files that are still in use
after they have a significant amount of their cache reclaimed, without
any of those pages actually refaulting. The shadow entries will just
sit there and waste memory. In the worst case, the shadow entries will
accumulate until the machine runs out of memory.
To get this under control, the VM will track radix tree nodes
exclusively containing shadow entries on a per-NUMA node list. Per-NUMA
rather than global because we expect the radix tree nodes themselves to
be allocated node-locally and we want to reduce cross-node references of
otherwise independent cache workloads. A simple shrinker will then
reclaim these nodes on memory pressure.
A few things need to be stored in the radix tree node to implement the
shadow node LRU and allow tree deletions coming from the list:
1. There is no index available that would describe the reverse path
from the node up to the tree root, which is needed to perform a
deletion. To solve this, encode in each node its offset inside the
parent. This can be stored in the unused upper bits of the same
member that stores the node's height at no extra space cost.
2. The number of shadow entries needs to be counted in addition to the
regular entries, to quickly detect when the node is ready to go to
the shadow node LRU list. The current entry count is an unsigned
int but the maximum number of entries is 64, so a shadow counter
can easily be stored in the unused upper bits.
3. Tree modification needs tree lock and tree root, which are located
in the address space, so store an address_space backpointer in the
node. The parent pointer of the node is in a union with the 2-word
rcu_head, so the backpointer comes at no extra cost as well.
4. The node needs to be linked to an LRU list, which requires a list
head inside the node. This does increase the size of the node, but
it does not change the number of objects that fit into a slab page.
[akpm@linux-foundation.org: export the right function]
Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Reviewed-by: Rik van Riel <riel@redhat.com>
Reviewed-by: Minchan Kim <minchan@kernel.org>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Bob Liu <bob.liu@oracle.com>
Cc: Christoph Hellwig <hch@infradead.org>
Cc: Dave Chinner <david@fromorbit.com>
Cc: Greg Thelen <gthelen@google.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Jan Kara <jack@suse.cz>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: Luigi Semenzato <semenzato@google.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Metin Doslu <metin@citusdata.com>
Cc: Michel Lespinasse <walken@google.com>
Cc: Ozgun Erdogan <ozgun@citusdata.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Roman Gushchin <klamm@yandex-team.ru>
Cc: Ryan Mallon <rmallon@gmail.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-04-04 05:47:56 +08:00
|
|
|
* lru_lock. Because the page cache tree is emptied before
|
|
|
|
* the inode can be destroyed, holding the lru_lock pins any
|
2017-11-25 03:24:59 +08:00
|
|
|
* address_space that has nodes on the LRU.
|
mm: keep page cache radix tree nodes in check
Previously, page cache radix tree nodes were freed after reclaim emptied
out their page pointers. But now reclaim stores shadow entries in their
place, which are only reclaimed when the inodes themselves are
reclaimed. This is problematic for bigger files that are still in use
after they have a significant amount of their cache reclaimed, without
any of those pages actually refaulting. The shadow entries will just
sit there and waste memory. In the worst case, the shadow entries will
accumulate until the machine runs out of memory.
To get this under control, the VM will track radix tree nodes
exclusively containing shadow entries on a per-NUMA node list. Per-NUMA
rather than global because we expect the radix tree nodes themselves to
be allocated node-locally and we want to reduce cross-node references of
otherwise independent cache workloads. A simple shrinker will then
reclaim these nodes on memory pressure.
A few things need to be stored in the radix tree node to implement the
shadow node LRU and allow tree deletions coming from the list:
1. There is no index available that would describe the reverse path
from the node up to the tree root, which is needed to perform a
deletion. To solve this, encode in each node its offset inside the
parent. This can be stored in the unused upper bits of the same
member that stores the node's height at no extra space cost.
2. The number of shadow entries needs to be counted in addition to the
regular entries, to quickly detect when the node is ready to go to
the shadow node LRU list. The current entry count is an unsigned
int but the maximum number of entries is 64, so a shadow counter
can easily be stored in the unused upper bits.
3. Tree modification needs tree lock and tree root, which are located
in the address space, so store an address_space backpointer in the
node. The parent pointer of the node is in a union with the 2-word
rcu_head, so the backpointer comes at no extra cost as well.
4. The node needs to be linked to an LRU list, which requires a list
head inside the node. This does increase the size of the node, but
it does not change the number of objects that fit into a slab page.
[akpm@linux-foundation.org: export the right function]
Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Reviewed-by: Rik van Riel <riel@redhat.com>
Reviewed-by: Minchan Kim <minchan@kernel.org>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Bob Liu <bob.liu@oracle.com>
Cc: Christoph Hellwig <hch@infradead.org>
Cc: Dave Chinner <david@fromorbit.com>
Cc: Greg Thelen <gthelen@google.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Jan Kara <jack@suse.cz>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: Luigi Semenzato <semenzato@google.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Metin Doslu <metin@citusdata.com>
Cc: Michel Lespinasse <walken@google.com>
Cc: Ozgun Erdogan <ozgun@citusdata.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Roman Gushchin <klamm@yandex-team.ru>
Cc: Ryan Mallon <rmallon@gmail.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-04-04 05:47:56 +08:00
|
|
|
*
|
2018-04-11 07:36:56 +08:00
|
|
|
* We can then safely transition to the i_pages lock to
|
mm: keep page cache radix tree nodes in check
Previously, page cache radix tree nodes were freed after reclaim emptied
out their page pointers. But now reclaim stores shadow entries in their
place, which are only reclaimed when the inodes themselves are
reclaimed. This is problematic for bigger files that are still in use
after they have a significant amount of their cache reclaimed, without
any of those pages actually refaulting. The shadow entries will just
sit there and waste memory. In the worst case, the shadow entries will
accumulate until the machine runs out of memory.
To get this under control, the VM will track radix tree nodes
exclusively containing shadow entries on a per-NUMA node list. Per-NUMA
rather than global because we expect the radix tree nodes themselves to
be allocated node-locally and we want to reduce cross-node references of
otherwise independent cache workloads. A simple shrinker will then
reclaim these nodes on memory pressure.
A few things need to be stored in the radix tree node to implement the
shadow node LRU and allow tree deletions coming from the list:
1. There is no index available that would describe the reverse path
from the node up to the tree root, which is needed to perform a
deletion. To solve this, encode in each node its offset inside the
parent. This can be stored in the unused upper bits of the same
member that stores the node's height at no extra space cost.
2. The number of shadow entries needs to be counted in addition to the
regular entries, to quickly detect when the node is ready to go to
the shadow node LRU list. The current entry count is an unsigned
int but the maximum number of entries is 64, so a shadow counter
can easily be stored in the unused upper bits.
3. Tree modification needs tree lock and tree root, which are located
in the address space, so store an address_space backpointer in the
node. The parent pointer of the node is in a union with the 2-word
rcu_head, so the backpointer comes at no extra cost as well.
4. The node needs to be linked to an LRU list, which requires a list
head inside the node. This does increase the size of the node, but
it does not change the number of objects that fit into a slab page.
[akpm@linux-foundation.org: export the right function]
Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Reviewed-by: Rik van Riel <riel@redhat.com>
Reviewed-by: Minchan Kim <minchan@kernel.org>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Bob Liu <bob.liu@oracle.com>
Cc: Christoph Hellwig <hch@infradead.org>
Cc: Dave Chinner <david@fromorbit.com>
Cc: Greg Thelen <gthelen@google.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Jan Kara <jack@suse.cz>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: Luigi Semenzato <semenzato@google.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Metin Doslu <metin@citusdata.com>
Cc: Michel Lespinasse <walken@google.com>
Cc: Ozgun Erdogan <ozgun@citusdata.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Roman Gushchin <klamm@yandex-team.ru>
Cc: Ryan Mallon <rmallon@gmail.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-04-04 05:47:56 +08:00
|
|
|
* pin only the address_space of the particular node we want
|
|
|
|
* to reclaim, take the node off-LRU, and drop the lru_lock.
|
|
|
|
*/
|
|
|
|
|
2017-11-09 22:23:56 +08:00
|
|
|
mapping = container_of(node->array, struct address_space, i_pages);
|
mm: keep page cache radix tree nodes in check
Previously, page cache radix tree nodes were freed after reclaim emptied
out their page pointers. But now reclaim stores shadow entries in their
place, which are only reclaimed when the inodes themselves are
reclaimed. This is problematic for bigger files that are still in use
after they have a significant amount of their cache reclaimed, without
any of those pages actually refaulting. The shadow entries will just
sit there and waste memory. In the worst case, the shadow entries will
accumulate until the machine runs out of memory.
To get this under control, the VM will track radix tree nodes
exclusively containing shadow entries on a per-NUMA node list. Per-NUMA
rather than global because we expect the radix tree nodes themselves to
be allocated node-locally and we want to reduce cross-node references of
otherwise independent cache workloads. A simple shrinker will then
reclaim these nodes on memory pressure.
A few things need to be stored in the radix tree node to implement the
shadow node LRU and allow tree deletions coming from the list:
1. There is no index available that would describe the reverse path
from the node up to the tree root, which is needed to perform a
deletion. To solve this, encode in each node its offset inside the
parent. This can be stored in the unused upper bits of the same
member that stores the node's height at no extra space cost.
2. The number of shadow entries needs to be counted in addition to the
regular entries, to quickly detect when the node is ready to go to
the shadow node LRU list. The current entry count is an unsigned
int but the maximum number of entries is 64, so a shadow counter
can easily be stored in the unused upper bits.
3. Tree modification needs tree lock and tree root, which are located
in the address space, so store an address_space backpointer in the
node. The parent pointer of the node is in a union with the 2-word
rcu_head, so the backpointer comes at no extra cost as well.
4. The node needs to be linked to an LRU list, which requires a list
head inside the node. This does increase the size of the node, but
it does not change the number of objects that fit into a slab page.
[akpm@linux-foundation.org: export the right function]
Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Reviewed-by: Rik van Riel <riel@redhat.com>
Reviewed-by: Minchan Kim <minchan@kernel.org>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Bob Liu <bob.liu@oracle.com>
Cc: Christoph Hellwig <hch@infradead.org>
Cc: Dave Chinner <david@fromorbit.com>
Cc: Greg Thelen <gthelen@google.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Jan Kara <jack@suse.cz>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: Luigi Semenzato <semenzato@google.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Metin Doslu <metin@citusdata.com>
Cc: Michel Lespinasse <walken@google.com>
Cc: Ozgun Erdogan <ozgun@citusdata.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Roman Gushchin <klamm@yandex-team.ru>
Cc: Ryan Mallon <rmallon@gmail.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-04-04 05:47:56 +08:00
|
|
|
|
|
|
|
/* Coming from the list, invert the lock order */
|
2018-04-11 07:36:56 +08:00
|
|
|
if (!xa_trylock(&mapping->i_pages)) {
|
2018-08-18 06:46:08 +08:00
|
|
|
spin_unlock_irq(lru_lock);
|
mm: keep page cache radix tree nodes in check
Previously, page cache radix tree nodes were freed after reclaim emptied
out their page pointers. But now reclaim stores shadow entries in their
place, which are only reclaimed when the inodes themselves are
reclaimed. This is problematic for bigger files that are still in use
after they have a significant amount of their cache reclaimed, without
any of those pages actually refaulting. The shadow entries will just
sit there and waste memory. In the worst case, the shadow entries will
accumulate until the machine runs out of memory.
To get this under control, the VM will track radix tree nodes
exclusively containing shadow entries on a per-NUMA node list. Per-NUMA
rather than global because we expect the radix tree nodes themselves to
be allocated node-locally and we want to reduce cross-node references of
otherwise independent cache workloads. A simple shrinker will then
reclaim these nodes on memory pressure.
A few things need to be stored in the radix tree node to implement the
shadow node LRU and allow tree deletions coming from the list:
1. There is no index available that would describe the reverse path
from the node up to the tree root, which is needed to perform a
deletion. To solve this, encode in each node its offset inside the
parent. This can be stored in the unused upper bits of the same
member that stores the node's height at no extra space cost.
2. The number of shadow entries needs to be counted in addition to the
regular entries, to quickly detect when the node is ready to go to
the shadow node LRU list. The current entry count is an unsigned
int but the maximum number of entries is 64, so a shadow counter
can easily be stored in the unused upper bits.
3. Tree modification needs tree lock and tree root, which are located
in the address space, so store an address_space backpointer in the
node. The parent pointer of the node is in a union with the 2-word
rcu_head, so the backpointer comes at no extra cost as well.
4. The node needs to be linked to an LRU list, which requires a list
head inside the node. This does increase the size of the node, but
it does not change the number of objects that fit into a slab page.
[akpm@linux-foundation.org: export the right function]
Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Reviewed-by: Rik van Riel <riel@redhat.com>
Reviewed-by: Minchan Kim <minchan@kernel.org>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Bob Liu <bob.liu@oracle.com>
Cc: Christoph Hellwig <hch@infradead.org>
Cc: Dave Chinner <david@fromorbit.com>
Cc: Greg Thelen <gthelen@google.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Jan Kara <jack@suse.cz>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: Luigi Semenzato <semenzato@google.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Metin Doslu <metin@citusdata.com>
Cc: Michel Lespinasse <walken@google.com>
Cc: Ozgun Erdogan <ozgun@citusdata.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Roman Gushchin <klamm@yandex-team.ru>
Cc: Ryan Mallon <rmallon@gmail.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-04-04 05:47:56 +08:00
|
|
|
ret = LRU_RETRY;
|
|
|
|
goto out;
|
|
|
|
}
|
|
|
|
|
2023-01-18 20:13:03 +08:00
|
|
|
/* For page cache we need to hold i_lock */
|
|
|
|
if (mapping->host != NULL) {
|
|
|
|
if (!spin_trylock(&mapping->host->i_lock)) {
|
|
|
|
xa_unlock(&mapping->i_pages);
|
|
|
|
spin_unlock_irq(lru_lock);
|
|
|
|
ret = LRU_RETRY;
|
|
|
|
goto out;
|
|
|
|
}
|
vfs: keep inodes with page cache off the inode shrinker LRU
Historically (pre-2.5), the inode shrinker used to reclaim only empty
inodes and skip over those that still contained page cache. This caused
problems on highmem hosts: struct inode could put fill lowmem zones
before the cache was getting reclaimed in the highmem zones.
To address this, the inode shrinker started to strip page cache to
facilitate reclaiming lowmem. However, this comes with its own set of
problems: the shrinkers may drop actively used page cache just because
the inodes are not currently open or dirty - think working with a large
git tree. It further doesn't respect cgroup memory protection settings
and can cause priority inversions between containers.
Nowadays, the page cache also holds non-resident info for evicted cache
pages in order to detect refaults. We've come to rely heavily on this
data inside reclaim for protecting the cache workingset and driving swap
behavior. We also use it to quantify and report workload health through
psi. The latter in turn is used for fleet health monitoring, as well as
driving automated memory sizing of workloads and containers, proactive
reclaim and memory offloading schemes.
The consequences of dropping page cache prematurely is that we're seeing
subtle and not-so-subtle failures in all of the above-mentioned
scenarios, with the workload generally entering unexpected thrashing
states while losing the ability to reliably detect it.
To fix this on non-highmem systems at least, going back to rotating
inodes on the LRU isn't feasible. We've tried (commit a76cf1a474d7
("mm: don't reclaim inodes with many attached pages")) and failed
(commit 69056ee6a8a3 ("Revert "mm: don't reclaim inodes with many
attached pages"")).
The issue is mostly that shrinker pools attract pressure based on their
size, and when objects get skipped the shrinkers remember this as
deferred reclaim work. This accumulates excessive pressure on the
remaining inodes, and we can quickly eat into heavily used ones, or
dirty ones that require IO to reclaim, when there potentially is plenty
of cold, clean cache around still.
Instead, this patch keeps populated inodes off the inode LRU in the
first place - just like an open file or dirty state would. An otherwise
clean and unused inode then gets queued when the last cache entry
disappears. This solves the problem without reintroducing the reclaim
issues, and generally is a bit more scalable than having to wade through
potentially hundreds of thousands of busy inodes.
Locking is a bit tricky because the locks protecting the inode state
(i_lock) and the inode LRU (lru_list.lock) don't nest inside the
irq-safe page cache lock (i_pages.xa_lock). Page cache deletions are
serialized through i_lock, taken before the i_pages lock, to make sure
depopulated inodes are queued reliably. Additions may race with
deletions, but we'll check again in the shrinker. If additions race
with the shrinker itself, we're protected by the i_lock: if find_inode()
or iput() win, the shrinker will bail on the elevated i_count or
I_REFERENCED; if the shrinker wins and goes ahead with the inode, it
will set I_FREEING and inhibit further igets(), which will cause the
other side to create a new instance of the inode instead.
Link: https://lkml.kernel.org/r/20210614211904.14420-4-hannes@cmpxchg.org
Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Cc: Roman Gushchin <guro@fb.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: Dave Chinner <david@fromorbit.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-11-09 10:31:24 +08:00
|
|
|
}
|
|
|
|
|
2015-02-13 06:59:35 +08:00
|
|
|
list_lru_isolate(lru, item);
|
2020-12-15 11:07:04 +08:00
|
|
|
__dec_lruvec_kmem_state(node, WORKINGSET_NODES);
|
2018-10-27 06:06:39 +08:00
|
|
|
|
mm: keep page cache radix tree nodes in check
Previously, page cache radix tree nodes were freed after reclaim emptied
out their page pointers. But now reclaim stores shadow entries in their
place, which are only reclaimed when the inodes themselves are
reclaimed. This is problematic for bigger files that are still in use
after they have a significant amount of their cache reclaimed, without
any of those pages actually refaulting. The shadow entries will just
sit there and waste memory. In the worst case, the shadow entries will
accumulate until the machine runs out of memory.
To get this under control, the VM will track radix tree nodes
exclusively containing shadow entries on a per-NUMA node list. Per-NUMA
rather than global because we expect the radix tree nodes themselves to
be allocated node-locally and we want to reduce cross-node references of
otherwise independent cache workloads. A simple shrinker will then
reclaim these nodes on memory pressure.
A few things need to be stored in the radix tree node to implement the
shadow node LRU and allow tree deletions coming from the list:
1. There is no index available that would describe the reverse path
from the node up to the tree root, which is needed to perform a
deletion. To solve this, encode in each node its offset inside the
parent. This can be stored in the unused upper bits of the same
member that stores the node's height at no extra space cost.
2. The number of shadow entries needs to be counted in addition to the
regular entries, to quickly detect when the node is ready to go to
the shadow node LRU list. The current entry count is an unsigned
int but the maximum number of entries is 64, so a shadow counter
can easily be stored in the unused upper bits.
3. Tree modification needs tree lock and tree root, which are located
in the address space, so store an address_space backpointer in the
node. The parent pointer of the node is in a union with the 2-word
rcu_head, so the backpointer comes at no extra cost as well.
4. The node needs to be linked to an LRU list, which requires a list
head inside the node. This does increase the size of the node, but
it does not change the number of objects that fit into a slab page.
[akpm@linux-foundation.org: export the right function]
Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Reviewed-by: Rik van Riel <riel@redhat.com>
Reviewed-by: Minchan Kim <minchan@kernel.org>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Bob Liu <bob.liu@oracle.com>
Cc: Christoph Hellwig <hch@infradead.org>
Cc: Dave Chinner <david@fromorbit.com>
Cc: Greg Thelen <gthelen@google.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Jan Kara <jack@suse.cz>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: Luigi Semenzato <semenzato@google.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Metin Doslu <metin@citusdata.com>
Cc: Michel Lespinasse <walken@google.com>
Cc: Ozgun Erdogan <ozgun@citusdata.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Roman Gushchin <klamm@yandex-team.ru>
Cc: Ryan Mallon <rmallon@gmail.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-04-04 05:47:56 +08:00
|
|
|
spin_unlock(lru_lock);
|
|
|
|
|
|
|
|
/*
|
|
|
|
* The nodes should only contain one or more shadow entries,
|
|
|
|
* no pages, so we expect to be able to remove them all and
|
|
|
|
* delete and free the empty node afterwards.
|
|
|
|
*/
|
2017-11-09 22:23:56 +08:00
|
|
|
if (WARN_ON_ONCE(!node->nr_values))
|
2016-12-13 08:43:38 +08:00
|
|
|
goto out_invalid;
|
2017-11-09 22:23:56 +08:00
|
|
|
if (WARN_ON_ONCE(node->count != node->nr_values))
|
2016-12-13 08:43:38 +08:00
|
|
|
goto out_invalid;
|
2020-08-18 21:05:56 +08:00
|
|
|
xa_delete_node(node, workingset_update_node);
|
2020-12-15 11:07:04 +08:00
|
|
|
__inc_lruvec_kmem_state(node, WORKINGSET_NODERECLAIM);
|
mm: keep page cache radix tree nodes in check
Previously, page cache radix tree nodes were freed after reclaim emptied
out their page pointers. But now reclaim stores shadow entries in their
place, which are only reclaimed when the inodes themselves are
reclaimed. This is problematic for bigger files that are still in use
after they have a significant amount of their cache reclaimed, without
any of those pages actually refaulting. The shadow entries will just
sit there and waste memory. In the worst case, the shadow entries will
accumulate until the machine runs out of memory.
To get this under control, the VM will track radix tree nodes
exclusively containing shadow entries on a per-NUMA node list. Per-NUMA
rather than global because we expect the radix tree nodes themselves to
be allocated node-locally and we want to reduce cross-node references of
otherwise independent cache workloads. A simple shrinker will then
reclaim these nodes on memory pressure.
A few things need to be stored in the radix tree node to implement the
shadow node LRU and allow tree deletions coming from the list:
1. There is no index available that would describe the reverse path
from the node up to the tree root, which is needed to perform a
deletion. To solve this, encode in each node its offset inside the
parent. This can be stored in the unused upper bits of the same
member that stores the node's height at no extra space cost.
2. The number of shadow entries needs to be counted in addition to the
regular entries, to quickly detect when the node is ready to go to
the shadow node LRU list. The current entry count is an unsigned
int but the maximum number of entries is 64, so a shadow counter
can easily be stored in the unused upper bits.
3. Tree modification needs tree lock and tree root, which are located
in the address space, so store an address_space backpointer in the
node. The parent pointer of the node is in a union with the 2-word
rcu_head, so the backpointer comes at no extra cost as well.
4. The node needs to be linked to an LRU list, which requires a list
head inside the node. This does increase the size of the node, but
it does not change the number of objects that fit into a slab page.
[akpm@linux-foundation.org: export the right function]
Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Reviewed-by: Rik van Riel <riel@redhat.com>
Reviewed-by: Minchan Kim <minchan@kernel.org>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Bob Liu <bob.liu@oracle.com>
Cc: Christoph Hellwig <hch@infradead.org>
Cc: Dave Chinner <david@fromorbit.com>
Cc: Greg Thelen <gthelen@google.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Jan Kara <jack@suse.cz>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: Luigi Semenzato <semenzato@google.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Metin Doslu <metin@citusdata.com>
Cc: Michel Lespinasse <walken@google.com>
Cc: Ozgun Erdogan <ozgun@citusdata.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Roman Gushchin <klamm@yandex-team.ru>
Cc: Ryan Mallon <rmallon@gmail.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-04-04 05:47:56 +08:00
|
|
|
|
2016-12-13 08:43:38 +08:00
|
|
|
out_invalid:
|
2018-08-18 06:46:08 +08:00
|
|
|
xa_unlock_irq(&mapping->i_pages);
|
2023-01-18 20:13:03 +08:00
|
|
|
if (mapping->host != NULL) {
|
|
|
|
if (mapping_shrinkable(mapping))
|
|
|
|
inode_add_lru(mapping->host);
|
|
|
|
spin_unlock(&mapping->host->i_lock);
|
|
|
|
}
|
mm: keep page cache radix tree nodes in check
Previously, page cache radix tree nodes were freed after reclaim emptied
out their page pointers. But now reclaim stores shadow entries in their
place, which are only reclaimed when the inodes themselves are
reclaimed. This is problematic for bigger files that are still in use
after they have a significant amount of their cache reclaimed, without
any of those pages actually refaulting. The shadow entries will just
sit there and waste memory. In the worst case, the shadow entries will
accumulate until the machine runs out of memory.
To get this under control, the VM will track radix tree nodes
exclusively containing shadow entries on a per-NUMA node list. Per-NUMA
rather than global because we expect the radix tree nodes themselves to
be allocated node-locally and we want to reduce cross-node references of
otherwise independent cache workloads. A simple shrinker will then
reclaim these nodes on memory pressure.
A few things need to be stored in the radix tree node to implement the
shadow node LRU and allow tree deletions coming from the list:
1. There is no index available that would describe the reverse path
from the node up to the tree root, which is needed to perform a
deletion. To solve this, encode in each node its offset inside the
parent. This can be stored in the unused upper bits of the same
member that stores the node's height at no extra space cost.
2. The number of shadow entries needs to be counted in addition to the
regular entries, to quickly detect when the node is ready to go to
the shadow node LRU list. The current entry count is an unsigned
int but the maximum number of entries is 64, so a shadow counter
can easily be stored in the unused upper bits.
3. Tree modification needs tree lock and tree root, which are located
in the address space, so store an address_space backpointer in the
node. The parent pointer of the node is in a union with the 2-word
rcu_head, so the backpointer comes at no extra cost as well.
4. The node needs to be linked to an LRU list, which requires a list
head inside the node. This does increase the size of the node, but
it does not change the number of objects that fit into a slab page.
[akpm@linux-foundation.org: export the right function]
Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Reviewed-by: Rik van Riel <riel@redhat.com>
Reviewed-by: Minchan Kim <minchan@kernel.org>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Bob Liu <bob.liu@oracle.com>
Cc: Christoph Hellwig <hch@infradead.org>
Cc: Dave Chinner <david@fromorbit.com>
Cc: Greg Thelen <gthelen@google.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Jan Kara <jack@suse.cz>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: Luigi Semenzato <semenzato@google.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Metin Doslu <metin@citusdata.com>
Cc: Michel Lespinasse <walken@google.com>
Cc: Ozgun Erdogan <ozgun@citusdata.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Roman Gushchin <klamm@yandex-team.ru>
Cc: Ryan Mallon <rmallon@gmail.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-04-04 05:47:56 +08:00
|
|
|
ret = LRU_REMOVED_RETRY;
|
|
|
|
out:
|
|
|
|
cond_resched();
|
2018-08-18 06:46:08 +08:00
|
|
|
spin_lock_irq(lru_lock);
|
mm: keep page cache radix tree nodes in check
Previously, page cache radix tree nodes were freed after reclaim emptied
out their page pointers. But now reclaim stores shadow entries in their
place, which are only reclaimed when the inodes themselves are
reclaimed. This is problematic for bigger files that are still in use
after they have a significant amount of their cache reclaimed, without
any of those pages actually refaulting. The shadow entries will just
sit there and waste memory. In the worst case, the shadow entries will
accumulate until the machine runs out of memory.
To get this under control, the VM will track radix tree nodes
exclusively containing shadow entries on a per-NUMA node list. Per-NUMA
rather than global because we expect the radix tree nodes themselves to
be allocated node-locally and we want to reduce cross-node references of
otherwise independent cache workloads. A simple shrinker will then
reclaim these nodes on memory pressure.
A few things need to be stored in the radix tree node to implement the
shadow node LRU and allow tree deletions coming from the list:
1. There is no index available that would describe the reverse path
from the node up to the tree root, which is needed to perform a
deletion. To solve this, encode in each node its offset inside the
parent. This can be stored in the unused upper bits of the same
member that stores the node's height at no extra space cost.
2. The number of shadow entries needs to be counted in addition to the
regular entries, to quickly detect when the node is ready to go to
the shadow node LRU list. The current entry count is an unsigned
int but the maximum number of entries is 64, so a shadow counter
can easily be stored in the unused upper bits.
3. Tree modification needs tree lock and tree root, which are located
in the address space, so store an address_space backpointer in the
node. The parent pointer of the node is in a union with the 2-word
rcu_head, so the backpointer comes at no extra cost as well.
4. The node needs to be linked to an LRU list, which requires a list
head inside the node. This does increase the size of the node, but
it does not change the number of objects that fit into a slab page.
[akpm@linux-foundation.org: export the right function]
Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Reviewed-by: Rik van Riel <riel@redhat.com>
Reviewed-by: Minchan Kim <minchan@kernel.org>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Bob Liu <bob.liu@oracle.com>
Cc: Christoph Hellwig <hch@infradead.org>
Cc: Dave Chinner <david@fromorbit.com>
Cc: Greg Thelen <gthelen@google.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Jan Kara <jack@suse.cz>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: Luigi Semenzato <semenzato@google.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Metin Doslu <metin@citusdata.com>
Cc: Michel Lespinasse <walken@google.com>
Cc: Ozgun Erdogan <ozgun@citusdata.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Roman Gushchin <klamm@yandex-team.ru>
Cc: Ryan Mallon <rmallon@gmail.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-04-04 05:47:56 +08:00
|
|
|
return ret;
|
|
|
|
}
|
|
|
|
|
|
|
|
static unsigned long scan_shadow_nodes(struct shrinker *shrinker,
|
|
|
|
struct shrink_control *sc)
|
|
|
|
{
|
2018-04-11 07:36:56 +08:00
|
|
|
/* list_lru lock nests inside the IRQ-safe i_pages lock */
|
2018-08-18 06:49:55 +08:00
|
|
|
return list_lru_shrink_walk_irq(&shadow_nodes, sc, shadow_lru_isolate,
|
|
|
|
NULL);
|
mm: keep page cache radix tree nodes in check
Previously, page cache radix tree nodes were freed after reclaim emptied
out their page pointers. But now reclaim stores shadow entries in their
place, which are only reclaimed when the inodes themselves are
reclaimed. This is problematic for bigger files that are still in use
after they have a significant amount of their cache reclaimed, without
any of those pages actually refaulting. The shadow entries will just
sit there and waste memory. In the worst case, the shadow entries will
accumulate until the machine runs out of memory.
To get this under control, the VM will track radix tree nodes
exclusively containing shadow entries on a per-NUMA node list. Per-NUMA
rather than global because we expect the radix tree nodes themselves to
be allocated node-locally and we want to reduce cross-node references of
otherwise independent cache workloads. A simple shrinker will then
reclaim these nodes on memory pressure.
A few things need to be stored in the radix tree node to implement the
shadow node LRU and allow tree deletions coming from the list:
1. There is no index available that would describe the reverse path
from the node up to the tree root, which is needed to perform a
deletion. To solve this, encode in each node its offset inside the
parent. This can be stored in the unused upper bits of the same
member that stores the node's height at no extra space cost.
2. The number of shadow entries needs to be counted in addition to the
regular entries, to quickly detect when the node is ready to go to
the shadow node LRU list. The current entry count is an unsigned
int but the maximum number of entries is 64, so a shadow counter
can easily be stored in the unused upper bits.
3. Tree modification needs tree lock and tree root, which are located
in the address space, so store an address_space backpointer in the
node. The parent pointer of the node is in a union with the 2-word
rcu_head, so the backpointer comes at no extra cost as well.
4. The node needs to be linked to an LRU list, which requires a list
head inside the node. This does increase the size of the node, but
it does not change the number of objects that fit into a slab page.
[akpm@linux-foundation.org: export the right function]
Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Reviewed-by: Rik van Riel <riel@redhat.com>
Reviewed-by: Minchan Kim <minchan@kernel.org>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Bob Liu <bob.liu@oracle.com>
Cc: Christoph Hellwig <hch@infradead.org>
Cc: Dave Chinner <david@fromorbit.com>
Cc: Greg Thelen <gthelen@google.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Jan Kara <jack@suse.cz>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: Luigi Semenzato <semenzato@google.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Metin Doslu <metin@citusdata.com>
Cc: Michel Lespinasse <walken@google.com>
Cc: Ozgun Erdogan <ozgun@citusdata.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Roman Gushchin <klamm@yandex-team.ru>
Cc: Ryan Mallon <rmallon@gmail.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-04-04 05:47:56 +08:00
|
|
|
}
|
|
|
|
|
|
|
|
static struct shrinker workingset_shadow_shrinker = {
|
|
|
|
.count_objects = count_shadow_nodes,
|
|
|
|
.scan_objects = scan_shadow_nodes,
|
mm: zero-seek shrinkers
The page cache and most shrinkable slab caches hold data that has been
read from disk, but there are some caches that only cache CPU work, such
as the dentry and inode caches of procfs and sysfs, as well as the subset
of radix tree nodes that track non-resident page cache.
Currently, all these are shrunk at the same rate: using DEFAULT_SEEKS for
the shrinker's seeks setting tells the reclaim algorithm that for every
two page cache pages scanned it should scan one slab object.
This is a bogus setting. A virtual inode that required no IO to create is
not twice as valuable as a page cache page; shadow cache entries with
eviction distances beyond the size of memory aren't either.
In most cases, the behavior in practice is still fine. Such virtual
caches don't tend to grow and assert themselves aggressively, and usually
get picked up before they cause problems. But there are scenarios where
that's not true.
Our database workloads suffer from two of those. For one, their file
workingset is several times bigger than available memory, which has the
kernel aggressively create shadow page cache entries for the non-resident
parts of it. The workingset code does tell the VM that most of these are
expendable, but the VM ends up balancing them 2:1 to cache pages as per
the seeks setting. This is a huge waste of memory.
These workloads also deal with tens of thousands of open files and use
/proc for introspection, which ends up growing the proc_inode_cache to
absurdly large sizes - again at the cost of valuable cache space, which
isn't a reasonable trade-off, given that proc inodes can be re-created
without involving the disk.
This patch implements a "zero-seek" setting for shrinkers that results in
a target ratio of 0:1 between their objects and IO-backed caches. This
allows such virtual caches to grow when memory is available (they do
cache/avoid CPU work after all), but effectively disables them as soon as
IO-backed objects are under pressure.
It then switches the shrinkers for procfs and sysfs metadata, as well as
excess page cache shadow nodes, to the new zero-seek setting.
Link: http://lkml.kernel.org/r/20181009184732.762-5-hannes@cmpxchg.org
Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Reported-by: Domas Mituzas <dmituzas@fb.com>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Reviewed-by: Rik van Riel <riel@surriel.com>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-10-27 06:06:42 +08:00
|
|
|
.seeks = 0, /* ->count reports only fully expendable nodes */
|
2016-03-18 05:18:42 +08:00
|
|
|
.flags = SHRINKER_NUMA_AWARE | SHRINKER_MEMCG_AWARE,
|
mm: keep page cache radix tree nodes in check
Previously, page cache radix tree nodes were freed after reclaim emptied
out their page pointers. But now reclaim stores shadow entries in their
place, which are only reclaimed when the inodes themselves are
reclaimed. This is problematic for bigger files that are still in use
after they have a significant amount of their cache reclaimed, without
any of those pages actually refaulting. The shadow entries will just
sit there and waste memory. In the worst case, the shadow entries will
accumulate until the machine runs out of memory.
To get this under control, the VM will track radix tree nodes
exclusively containing shadow entries on a per-NUMA node list. Per-NUMA
rather than global because we expect the radix tree nodes themselves to
be allocated node-locally and we want to reduce cross-node references of
otherwise independent cache workloads. A simple shrinker will then
reclaim these nodes on memory pressure.
A few things need to be stored in the radix tree node to implement the
shadow node LRU and allow tree deletions coming from the list:
1. There is no index available that would describe the reverse path
from the node up to the tree root, which is needed to perform a
deletion. To solve this, encode in each node its offset inside the
parent. This can be stored in the unused upper bits of the same
member that stores the node's height at no extra space cost.
2. The number of shadow entries needs to be counted in addition to the
regular entries, to quickly detect when the node is ready to go to
the shadow node LRU list. The current entry count is an unsigned
int but the maximum number of entries is 64, so a shadow counter
can easily be stored in the unused upper bits.
3. Tree modification needs tree lock and tree root, which are located
in the address space, so store an address_space backpointer in the
node. The parent pointer of the node is in a union with the 2-word
rcu_head, so the backpointer comes at no extra cost as well.
4. The node needs to be linked to an LRU list, which requires a list
head inside the node. This does increase the size of the node, but
it does not change the number of objects that fit into a slab page.
[akpm@linux-foundation.org: export the right function]
Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Reviewed-by: Rik van Riel <riel@redhat.com>
Reviewed-by: Minchan Kim <minchan@kernel.org>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Bob Liu <bob.liu@oracle.com>
Cc: Christoph Hellwig <hch@infradead.org>
Cc: Dave Chinner <david@fromorbit.com>
Cc: Greg Thelen <gthelen@google.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Jan Kara <jack@suse.cz>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: Luigi Semenzato <semenzato@google.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Metin Doslu <metin@citusdata.com>
Cc: Michel Lespinasse <walken@google.com>
Cc: Ozgun Erdogan <ozgun@citusdata.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Roman Gushchin <klamm@yandex-team.ru>
Cc: Ryan Mallon <rmallon@gmail.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-04-04 05:47:56 +08:00
|
|
|
};
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Our list_lru->lock is IRQ-safe as it nests inside the IRQ-safe
|
2018-04-11 07:36:56 +08:00
|
|
|
* i_pages lock.
|
mm: keep page cache radix tree nodes in check
Previously, page cache radix tree nodes were freed after reclaim emptied
out their page pointers. But now reclaim stores shadow entries in their
place, which are only reclaimed when the inodes themselves are
reclaimed. This is problematic for bigger files that are still in use
after they have a significant amount of their cache reclaimed, without
any of those pages actually refaulting. The shadow entries will just
sit there and waste memory. In the worst case, the shadow entries will
accumulate until the machine runs out of memory.
To get this under control, the VM will track radix tree nodes
exclusively containing shadow entries on a per-NUMA node list. Per-NUMA
rather than global because we expect the radix tree nodes themselves to
be allocated node-locally and we want to reduce cross-node references of
otherwise independent cache workloads. A simple shrinker will then
reclaim these nodes on memory pressure.
A few things need to be stored in the radix tree node to implement the
shadow node LRU and allow tree deletions coming from the list:
1. There is no index available that would describe the reverse path
from the node up to the tree root, which is needed to perform a
deletion. To solve this, encode in each node its offset inside the
parent. This can be stored in the unused upper bits of the same
member that stores the node's height at no extra space cost.
2. The number of shadow entries needs to be counted in addition to the
regular entries, to quickly detect when the node is ready to go to
the shadow node LRU list. The current entry count is an unsigned
int but the maximum number of entries is 64, so a shadow counter
can easily be stored in the unused upper bits.
3. Tree modification needs tree lock and tree root, which are located
in the address space, so store an address_space backpointer in the
node. The parent pointer of the node is in a union with the 2-word
rcu_head, so the backpointer comes at no extra cost as well.
4. The node needs to be linked to an LRU list, which requires a list
head inside the node. This does increase the size of the node, but
it does not change the number of objects that fit into a slab page.
[akpm@linux-foundation.org: export the right function]
Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Reviewed-by: Rik van Riel <riel@redhat.com>
Reviewed-by: Minchan Kim <minchan@kernel.org>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Bob Liu <bob.liu@oracle.com>
Cc: Christoph Hellwig <hch@infradead.org>
Cc: Dave Chinner <david@fromorbit.com>
Cc: Greg Thelen <gthelen@google.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Jan Kara <jack@suse.cz>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: Luigi Semenzato <semenzato@google.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Metin Doslu <metin@citusdata.com>
Cc: Michel Lespinasse <walken@google.com>
Cc: Ozgun Erdogan <ozgun@citusdata.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Roman Gushchin <klamm@yandex-team.ru>
Cc: Ryan Mallon <rmallon@gmail.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-04-04 05:47:56 +08:00
|
|
|
*/
|
|
|
|
static struct lock_class_key shadow_nodes_key;
|
|
|
|
|
|
|
|
static int __init workingset_init(void)
|
|
|
|
{
|
mm: workingset: eviction buckets for bigmem/lowbit machines
For per-cgroup thrash detection, we need to store the memcg ID inside
the radix tree cookie as well. However, on 32 bit that doesn't leave
enough bits for the eviction timestamp to cover the necessary range of
recently evicted pages. The radix tree entry would look like this:
[ RADIX_TREE_EXCEPTIONAL(2) | ZONEID(2) | MEMCGID(16) | EVICTION(12) ]
12 bits means 4096 pages, means 16M worth of recently evicted pages.
But refaults are actionable up to distances covering half of memory. To
not miss refaults, we have to stretch out the range at the cost of how
precisely we can tell when a page was evicted. This way we can shave
off lower bits from the eviction timestamp until the necessary range is
covered. E.g. grouping evictions into 1M buckets (256 pages) will
stretch the longest representable refault distance to 4G.
This patch implements eviction buckets that are automatically sized
according to the available bits and the necessary refault range, in
preparation for per-cgroup thrash detection.
The maximum actionable distance is currently half of memory, but to
support memory hotplug of up to 200% of boot-time memory, we size the
buckets to cover double the distance. Beyond that, thrashing won't be
detectable anymore.
During boot, the kernel will print out the exact parameters, like so:
[ 0.113929] workingset: timestamp_bits=12 max_order=18 bucket_order=6
In this example, there are 12 radix entry bits available for the
eviction timestamp, to cover a maximum distance of 2^18 pages (this is a
1G machine). Consequently, evictions must be grouped into buckets of
2^6 pages, or 256K.
Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Reviewed-by: Vladimir Davydov <vdavydov@virtuozzo.com>
Cc: Michal Hocko <mhocko@suse.cz>
Cc: David Rientjes <rientjes@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-03-16 05:57:13 +08:00
|
|
|
unsigned int max_order;
|
mm: keep page cache radix tree nodes in check
Previously, page cache radix tree nodes were freed after reclaim emptied
out their page pointers. But now reclaim stores shadow entries in their
place, which are only reclaimed when the inodes themselves are
reclaimed. This is problematic for bigger files that are still in use
after they have a significant amount of their cache reclaimed, without
any of those pages actually refaulting. The shadow entries will just
sit there and waste memory. In the worst case, the shadow entries will
accumulate until the machine runs out of memory.
To get this under control, the VM will track radix tree nodes
exclusively containing shadow entries on a per-NUMA node list. Per-NUMA
rather than global because we expect the radix tree nodes themselves to
be allocated node-locally and we want to reduce cross-node references of
otherwise independent cache workloads. A simple shrinker will then
reclaim these nodes on memory pressure.
A few things need to be stored in the radix tree node to implement the
shadow node LRU and allow tree deletions coming from the list:
1. There is no index available that would describe the reverse path
from the node up to the tree root, which is needed to perform a
deletion. To solve this, encode in each node its offset inside the
parent. This can be stored in the unused upper bits of the same
member that stores the node's height at no extra space cost.
2. The number of shadow entries needs to be counted in addition to the
regular entries, to quickly detect when the node is ready to go to
the shadow node LRU list. The current entry count is an unsigned
int but the maximum number of entries is 64, so a shadow counter
can easily be stored in the unused upper bits.
3. Tree modification needs tree lock and tree root, which are located
in the address space, so store an address_space backpointer in the
node. The parent pointer of the node is in a union with the 2-word
rcu_head, so the backpointer comes at no extra cost as well.
4. The node needs to be linked to an LRU list, which requires a list
head inside the node. This does increase the size of the node, but
it does not change the number of objects that fit into a slab page.
[akpm@linux-foundation.org: export the right function]
Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Reviewed-by: Rik van Riel <riel@redhat.com>
Reviewed-by: Minchan Kim <minchan@kernel.org>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Bob Liu <bob.liu@oracle.com>
Cc: Christoph Hellwig <hch@infradead.org>
Cc: Dave Chinner <david@fromorbit.com>
Cc: Greg Thelen <gthelen@google.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Jan Kara <jack@suse.cz>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: Luigi Semenzato <semenzato@google.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Metin Doslu <metin@citusdata.com>
Cc: Michel Lespinasse <walken@google.com>
Cc: Ozgun Erdogan <ozgun@citusdata.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Roman Gushchin <klamm@yandex-team.ru>
Cc: Ryan Mallon <rmallon@gmail.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-04-04 05:47:56 +08:00
|
|
|
int ret;
|
|
|
|
|
mm: workingset: eviction buckets for bigmem/lowbit machines
For per-cgroup thrash detection, we need to store the memcg ID inside
the radix tree cookie as well. However, on 32 bit that doesn't leave
enough bits for the eviction timestamp to cover the necessary range of
recently evicted pages. The radix tree entry would look like this:
[ RADIX_TREE_EXCEPTIONAL(2) | ZONEID(2) | MEMCGID(16) | EVICTION(12) ]
12 bits means 4096 pages, means 16M worth of recently evicted pages.
But refaults are actionable up to distances covering half of memory. To
not miss refaults, we have to stretch out the range at the cost of how
precisely we can tell when a page was evicted. This way we can shave
off lower bits from the eviction timestamp until the necessary range is
covered. E.g. grouping evictions into 1M buckets (256 pages) will
stretch the longest representable refault distance to 4G.
This patch implements eviction buckets that are automatically sized
according to the available bits and the necessary refault range, in
preparation for per-cgroup thrash detection.
The maximum actionable distance is currently half of memory, but to
support memory hotplug of up to 200% of boot-time memory, we size the
buckets to cover double the distance. Beyond that, thrashing won't be
detectable anymore.
During boot, the kernel will print out the exact parameters, like so:
[ 0.113929] workingset: timestamp_bits=12 max_order=18 bucket_order=6
In this example, there are 12 radix entry bits available for the
eviction timestamp, to cover a maximum distance of 2^18 pages (this is a
1G machine). Consequently, evictions must be grouped into buckets of
2^6 pages, or 256K.
Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Reviewed-by: Vladimir Davydov <vdavydov@virtuozzo.com>
Cc: Michal Hocko <mhocko@suse.cz>
Cc: David Rientjes <rientjes@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-03-16 05:57:13 +08:00
|
|
|
BUILD_BUG_ON(BITS_PER_LONG < EVICTION_SHIFT);
|
|
|
|
/*
|
|
|
|
* Calculate the eviction bucket size to cover the longest
|
|
|
|
* actionable refault distance, which is currently half of
|
|
|
|
* memory (totalram_pages/2). However, memory hotplug may add
|
|
|
|
* some more pages at runtime, so keep working with up to
|
|
|
|
* double the initial memory by using totalram_pages as-is.
|
|
|
|
*/
|
2018-12-28 16:34:29 +08:00
|
|
|
max_order = fls_long(totalram_pages() - 1);
|
2023-10-08 10:37:17 +08:00
|
|
|
if (max_order > EVICTION_BITS)
|
|
|
|
bucket_order = max_order - EVICTION_BITS;
|
2016-07-15 03:07:41 +08:00
|
|
|
pr_info("workingset: timestamp_bits=%d max_order=%d bucket_order=%u\n",
|
2023-10-08 10:37:17 +08:00
|
|
|
EVICTION_BITS, max_order, bucket_order);
|
workingset, lru_gen: apply refault-distance based protection
Upstream: pending
I noticed MGLRU not working very well on certain workflows, which is
observed on some heavily stressed databases. That is when the file
page workingset size exceeds total memory, and the access distance
(the left-shift time of a page before it gets activated, considering
LRU starts from right) of file pages also larger than total memory.
All file pages are stuck on the oldest generation and getting
read-in then evicted permutably. Despite anon pages being idle,
they never get aged. PID controller didn't kickin until there are some
minor access pattern changes. And file pages are not promoted
or reused.
Even though the memory can't cover the whole workingset, the
refault-distance based re-activation can help hold part of the
workingset in-memory to help reduce the IO workload significantly.
So apply it for MGLRU as well. The updated refault-distance model
fits well for MGLRU in most cases, if we just consider the last two
generation as the inactive LRU and the first two generations as
active LRU.
Some adjustment is done to fit the logic better, also make the
refault-distance contributed to page tiering and PID refault detection
of MGLRU:
- If a tier-0 page have a qualified refault-distance, just promote
it to higher tier, send it to second oldest gen.
- If a tier >= 1 page have a qualified refault-distance, mark it as
active and send it to youngest gen.
- Increase the reference of every page that have a qualified
refault-distance and increase the PID countroled refault rate
of the updated tier, in hope similar paged will be protected
next time upon eviction.
NOTE: This also changed the meaning of workingset_* fields in
/proc/vmstat, workingset_activate_* now stands for the pages
reactivated or promoted by refault distance checking,
workingset_restore_* now stands for all pages promoted by
any reason.
Following benchmark showed 5x improvement. To simulate the optimized
workflow, I setup a 3-replicated mongodb cluster, each in a different
cgroup, using 5 gb of wiretiger cache and 10g of oplog, on a 32G VM with
no limit set. The benchmark is done using
https://github.com/apavlo/py-tpcc.git, modified to run STOCK_LEVEL
query only, for simulating slow query and get a stable result.
Test is done on an EPYC 7K62 with 32G RAM with SATA SSD:
- Before (with ZRAM enabled, the result won't change whether
any kind of swap is on or not):
$ tpcc.py --config=mongodb.config mongodb --duration=900 --warehouses=500 --clients=30
==================================================================
Execution Results after 919 seconds
------------------------------------------------------------------
Executed Time (µs) Rate
STOCK_LEVEL 577 27584645283.7 0.02 txn/s
------------------------------------------------------------------
TOTAL 577 27584645283.7 0.02 txn/s
$ cat /proc/vmstat | grep workingset
workingset_nodes 47860
workingset_refault_anon 0
workingset_refault_file 23498953
workingset_activate_anon 0
workingset_activate_file 23487840
workingset_restore_anon 0
workingset_restore_file 18553646
workingset_nodereclaim 768
$ free -m
total used free shared buff/cache available
Mem: 31849 6829 790 23 24229 24542
Swap: 31848 0 31848
- Patched: (with ZRAM enabled):
$ tpcc.py --config=mongodb.config mongodb --duration=900 --warehouses=500 --clients=30
==================================================================
Execution Results after 905 seconds
------------------------------------------------------------------
Executed Time (µs) Rate
STOCK_LEVEL 2542 27121571486.2 0.09 txn/s
------------------------------------------------------------------
TOTAL 2542 27121571486.2 0.09 txn/s
$ cat /proc/vmstat | grep working
workingset_nodes 70358
workingset_refault_anon 16853
workingset_refault_file 22693601
workingset_activate_anon 10099
workingset_activate_file 8565519
workingset_restore_anon 10127
workingset_restore_file 8566053
workingset_nodereclaim 9801
$ free -m
total used free shared buff/cache available
Mem: 31849 7093 283 4 24472 24289
Swap: 31848 1652 30196
The performance is 5x times better than before, and the idle anon pages
now can get swapped out as expected. The result is also better with
lower test stress, testing with lower stress also shows a improvement.
There is no regression on other tests so far, and a performance gain
is observed on file page heavy tasks.
Signed-off-by: Kairui Song <kasong@tencent.com>
2024-04-01 23:43:38 +08:00
|
|
|
#ifdef CONFIG_LRU_GEN
|
|
|
|
if (max_order > LRU_GEN_EVICTION_BITS)
|
|
|
|
lru_gen_bucket_order = max_order - LRU_GEN_EVICTION_BITS;
|
|
|
|
pr_info("workingset: lru_gen_timestamp_bits=%d lru_gen_bucket_order=%u\n",
|
|
|
|
LRU_GEN_EVICTION_BITS, lru_gen_bucket_order);
|
|
|
|
#endif
|
mm: workingset: eviction buckets for bigmem/lowbit machines
For per-cgroup thrash detection, we need to store the memcg ID inside
the radix tree cookie as well. However, on 32 bit that doesn't leave
enough bits for the eviction timestamp to cover the necessary range of
recently evicted pages. The radix tree entry would look like this:
[ RADIX_TREE_EXCEPTIONAL(2) | ZONEID(2) | MEMCGID(16) | EVICTION(12) ]
12 bits means 4096 pages, means 16M worth of recently evicted pages.
But refaults are actionable up to distances covering half of memory. To
not miss refaults, we have to stretch out the range at the cost of how
precisely we can tell when a page was evicted. This way we can shave
off lower bits from the eviction timestamp until the necessary range is
covered. E.g. grouping evictions into 1M buckets (256 pages) will
stretch the longest representable refault distance to 4G.
This patch implements eviction buckets that are automatically sized
according to the available bits and the necessary refault range, in
preparation for per-cgroup thrash detection.
The maximum actionable distance is currently half of memory, but to
support memory hotplug of up to 200% of boot-time memory, we size the
buckets to cover double the distance. Beyond that, thrashing won't be
detectable anymore.
During boot, the kernel will print out the exact parameters, like so:
[ 0.113929] workingset: timestamp_bits=12 max_order=18 bucket_order=6
In this example, there are 12 radix entry bits available for the
eviction timestamp, to cover a maximum distance of 2^18 pages (this is a
1G machine). Consequently, evictions must be grouped into buckets of
2^6 pages, or 256K.
Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Reviewed-by: Vladimir Davydov <vdavydov@virtuozzo.com>
Cc: Michal Hocko <mhocko@suse.cz>
Cc: David Rientjes <rientjes@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-03-16 05:57:13 +08:00
|
|
|
|
2022-06-01 11:22:24 +08:00
|
|
|
ret = prealloc_shrinker(&workingset_shadow_shrinker, "mm-shadow");
|
mm: keep page cache radix tree nodes in check
Previously, page cache radix tree nodes were freed after reclaim emptied
out their page pointers. But now reclaim stores shadow entries in their
place, which are only reclaimed when the inodes themselves are
reclaimed. This is problematic for bigger files that are still in use
after they have a significant amount of their cache reclaimed, without
any of those pages actually refaulting. The shadow entries will just
sit there and waste memory. In the worst case, the shadow entries will
accumulate until the machine runs out of memory.
To get this under control, the VM will track radix tree nodes
exclusively containing shadow entries on a per-NUMA node list. Per-NUMA
rather than global because we expect the radix tree nodes themselves to
be allocated node-locally and we want to reduce cross-node references of
otherwise independent cache workloads. A simple shrinker will then
reclaim these nodes on memory pressure.
A few things need to be stored in the radix tree node to implement the
shadow node LRU and allow tree deletions coming from the list:
1. There is no index available that would describe the reverse path
from the node up to the tree root, which is needed to perform a
deletion. To solve this, encode in each node its offset inside the
parent. This can be stored in the unused upper bits of the same
member that stores the node's height at no extra space cost.
2. The number of shadow entries needs to be counted in addition to the
regular entries, to quickly detect when the node is ready to go to
the shadow node LRU list. The current entry count is an unsigned
int but the maximum number of entries is 64, so a shadow counter
can easily be stored in the unused upper bits.
3. Tree modification needs tree lock and tree root, which are located
in the address space, so store an address_space backpointer in the
node. The parent pointer of the node is in a union with the 2-word
rcu_head, so the backpointer comes at no extra cost as well.
4. The node needs to be linked to an LRU list, which requires a list
head inside the node. This does increase the size of the node, but
it does not change the number of objects that fit into a slab page.
[akpm@linux-foundation.org: export the right function]
Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Reviewed-by: Rik van Riel <riel@redhat.com>
Reviewed-by: Minchan Kim <minchan@kernel.org>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Bob Liu <bob.liu@oracle.com>
Cc: Christoph Hellwig <hch@infradead.org>
Cc: Dave Chinner <david@fromorbit.com>
Cc: Greg Thelen <gthelen@google.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Jan Kara <jack@suse.cz>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: Luigi Semenzato <semenzato@google.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Metin Doslu <metin@citusdata.com>
Cc: Michel Lespinasse <walken@google.com>
Cc: Ozgun Erdogan <ozgun@citusdata.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Roman Gushchin <klamm@yandex-team.ru>
Cc: Ryan Mallon <rmallon@gmail.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-04-04 05:47:56 +08:00
|
|
|
if (ret)
|
|
|
|
goto err;
|
2018-08-18 06:47:50 +08:00
|
|
|
ret = __list_lru_init(&shadow_nodes, true, &shadow_nodes_key,
|
|
|
|
&workingset_shadow_shrinker);
|
mm: keep page cache radix tree nodes in check
Previously, page cache radix tree nodes were freed after reclaim emptied
out their page pointers. But now reclaim stores shadow entries in their
place, which are only reclaimed when the inodes themselves are
reclaimed. This is problematic for bigger files that are still in use
after they have a significant amount of their cache reclaimed, without
any of those pages actually refaulting. The shadow entries will just
sit there and waste memory. In the worst case, the shadow entries will
accumulate until the machine runs out of memory.
To get this under control, the VM will track radix tree nodes
exclusively containing shadow entries on a per-NUMA node list. Per-NUMA
rather than global because we expect the radix tree nodes themselves to
be allocated node-locally and we want to reduce cross-node references of
otherwise independent cache workloads. A simple shrinker will then
reclaim these nodes on memory pressure.
A few things need to be stored in the radix tree node to implement the
shadow node LRU and allow tree deletions coming from the list:
1. There is no index available that would describe the reverse path
from the node up to the tree root, which is needed to perform a
deletion. To solve this, encode in each node its offset inside the
parent. This can be stored in the unused upper bits of the same
member that stores the node's height at no extra space cost.
2. The number of shadow entries needs to be counted in addition to the
regular entries, to quickly detect when the node is ready to go to
the shadow node LRU list. The current entry count is an unsigned
int but the maximum number of entries is 64, so a shadow counter
can easily be stored in the unused upper bits.
3. Tree modification needs tree lock and tree root, which are located
in the address space, so store an address_space backpointer in the
node. The parent pointer of the node is in a union with the 2-word
rcu_head, so the backpointer comes at no extra cost as well.
4. The node needs to be linked to an LRU list, which requires a list
head inside the node. This does increase the size of the node, but
it does not change the number of objects that fit into a slab page.
[akpm@linux-foundation.org: export the right function]
Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Reviewed-by: Rik van Riel <riel@redhat.com>
Reviewed-by: Minchan Kim <minchan@kernel.org>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Bob Liu <bob.liu@oracle.com>
Cc: Christoph Hellwig <hch@infradead.org>
Cc: Dave Chinner <david@fromorbit.com>
Cc: Greg Thelen <gthelen@google.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Jan Kara <jack@suse.cz>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: Luigi Semenzato <semenzato@google.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Metin Doslu <metin@citusdata.com>
Cc: Michel Lespinasse <walken@google.com>
Cc: Ozgun Erdogan <ozgun@citusdata.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Roman Gushchin <klamm@yandex-team.ru>
Cc: Ryan Mallon <rmallon@gmail.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-04-04 05:47:56 +08:00
|
|
|
if (ret)
|
|
|
|
goto err_list_lru;
|
2018-08-18 06:47:41 +08:00
|
|
|
register_shrinker_prepared(&workingset_shadow_shrinker);
|
mm: keep page cache radix tree nodes in check
Previously, page cache radix tree nodes were freed after reclaim emptied
out their page pointers. But now reclaim stores shadow entries in their
place, which are only reclaimed when the inodes themselves are
reclaimed. This is problematic for bigger files that are still in use
after they have a significant amount of their cache reclaimed, without
any of those pages actually refaulting. The shadow entries will just
sit there and waste memory. In the worst case, the shadow entries will
accumulate until the machine runs out of memory.
To get this under control, the VM will track radix tree nodes
exclusively containing shadow entries on a per-NUMA node list. Per-NUMA
rather than global because we expect the radix tree nodes themselves to
be allocated node-locally and we want to reduce cross-node references of
otherwise independent cache workloads. A simple shrinker will then
reclaim these nodes on memory pressure.
A few things need to be stored in the radix tree node to implement the
shadow node LRU and allow tree deletions coming from the list:
1. There is no index available that would describe the reverse path
from the node up to the tree root, which is needed to perform a
deletion. To solve this, encode in each node its offset inside the
parent. This can be stored in the unused upper bits of the same
member that stores the node's height at no extra space cost.
2. The number of shadow entries needs to be counted in addition to the
regular entries, to quickly detect when the node is ready to go to
the shadow node LRU list. The current entry count is an unsigned
int but the maximum number of entries is 64, so a shadow counter
can easily be stored in the unused upper bits.
3. Tree modification needs tree lock and tree root, which are located
in the address space, so store an address_space backpointer in the
node. The parent pointer of the node is in a union with the 2-word
rcu_head, so the backpointer comes at no extra cost as well.
4. The node needs to be linked to an LRU list, which requires a list
head inside the node. This does increase the size of the node, but
it does not change the number of objects that fit into a slab page.
[akpm@linux-foundation.org: export the right function]
Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Reviewed-by: Rik van Riel <riel@redhat.com>
Reviewed-by: Minchan Kim <minchan@kernel.org>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Bob Liu <bob.liu@oracle.com>
Cc: Christoph Hellwig <hch@infradead.org>
Cc: Dave Chinner <david@fromorbit.com>
Cc: Greg Thelen <gthelen@google.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Jan Kara <jack@suse.cz>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: Luigi Semenzato <semenzato@google.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Metin Doslu <metin@citusdata.com>
Cc: Michel Lespinasse <walken@google.com>
Cc: Ozgun Erdogan <ozgun@citusdata.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Roman Gushchin <klamm@yandex-team.ru>
Cc: Ryan Mallon <rmallon@gmail.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-04-04 05:47:56 +08:00
|
|
|
return 0;
|
|
|
|
err_list_lru:
|
2018-08-18 06:47:41 +08:00
|
|
|
free_prealloced_shrinker(&workingset_shadow_shrinker);
|
mm: keep page cache radix tree nodes in check
Previously, page cache radix tree nodes were freed after reclaim emptied
out their page pointers. But now reclaim stores shadow entries in their
place, which are only reclaimed when the inodes themselves are
reclaimed. This is problematic for bigger files that are still in use
after they have a significant amount of their cache reclaimed, without
any of those pages actually refaulting. The shadow entries will just
sit there and waste memory. In the worst case, the shadow entries will
accumulate until the machine runs out of memory.
To get this under control, the VM will track radix tree nodes
exclusively containing shadow entries on a per-NUMA node list. Per-NUMA
rather than global because we expect the radix tree nodes themselves to
be allocated node-locally and we want to reduce cross-node references of
otherwise independent cache workloads. A simple shrinker will then
reclaim these nodes on memory pressure.
A few things need to be stored in the radix tree node to implement the
shadow node LRU and allow tree deletions coming from the list:
1. There is no index available that would describe the reverse path
from the node up to the tree root, which is needed to perform a
deletion. To solve this, encode in each node its offset inside the
parent. This can be stored in the unused upper bits of the same
member that stores the node's height at no extra space cost.
2. The number of shadow entries needs to be counted in addition to the
regular entries, to quickly detect when the node is ready to go to
the shadow node LRU list. The current entry count is an unsigned
int but the maximum number of entries is 64, so a shadow counter
can easily be stored in the unused upper bits.
3. Tree modification needs tree lock and tree root, which are located
in the address space, so store an address_space backpointer in the
node. The parent pointer of the node is in a union with the 2-word
rcu_head, so the backpointer comes at no extra cost as well.
4. The node needs to be linked to an LRU list, which requires a list
head inside the node. This does increase the size of the node, but
it does not change the number of objects that fit into a slab page.
[akpm@linux-foundation.org: export the right function]
Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Reviewed-by: Rik van Riel <riel@redhat.com>
Reviewed-by: Minchan Kim <minchan@kernel.org>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Bob Liu <bob.liu@oracle.com>
Cc: Christoph Hellwig <hch@infradead.org>
Cc: Dave Chinner <david@fromorbit.com>
Cc: Greg Thelen <gthelen@google.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Jan Kara <jack@suse.cz>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: Luigi Semenzato <semenzato@google.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Metin Doslu <metin@citusdata.com>
Cc: Michel Lespinasse <walken@google.com>
Cc: Ozgun Erdogan <ozgun@citusdata.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Roman Gushchin <klamm@yandex-team.ru>
Cc: Ryan Mallon <rmallon@gmail.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-04-04 05:47:56 +08:00
|
|
|
err:
|
|
|
|
return ret;
|
|
|
|
}
|
|
|
|
module_init(workingset_init);
|