OpenCloudOS-Kernel/mm/workingset.c

978 lines
32 KiB
C
Raw Normal View History

License cleanup: add SPDX GPL-2.0 license identifier to files with no license Many source files in the tree are missing licensing information, which makes it harder for compliance tools to determine the correct license. By default all files without license information are under the default license of the kernel, which is GPL version 2. Update the files which contain no license information with the 'GPL-2.0' SPDX license identifier. The SPDX identifier is a legally binding shorthand, which can be used instead of the full boiler plate text. This patch is based on work done by Thomas Gleixner and Kate Stewart and Philippe Ombredanne. How this work was done: Patches were generated and checked against linux-4.14-rc6 for a subset of the use cases: - file had no licensing information it it. - file was a */uapi/* one with no licensing information in it, - file was a */uapi/* one with existing licensing information, Further patches will be generated in subsequent months to fix up cases where non-standard license headers were used, and references to license had to be inferred by heuristics based on keywords. The analysis to determine which SPDX License Identifier to be applied to a file was done in a spreadsheet of side by side results from of the output of two independent scanners (ScanCode & Windriver) producing SPDX tag:value files created by Philippe Ombredanne. Philippe prepared the base worksheet, and did an initial spot review of a few 1000 files. The 4.13 kernel was the starting point of the analysis with 60,537 files assessed. Kate Stewart did a file by file comparison of the scanner results in the spreadsheet to determine which SPDX license identifier(s) to be applied to the file. She confirmed any determination that was not immediately clear with lawyers working with the Linux Foundation. Criteria used to select files for SPDX license identifier tagging was: - Files considered eligible had to be source code files. - Make and config files were included as candidates if they contained >5 lines of source - File already had some variant of a license header in it (even if <5 lines). All documentation files were explicitly excluded. The following heuristics were used to determine which SPDX license identifiers to apply. - when both scanners couldn't find any license traces, file was considered to have no license information in it, and the top level COPYING file license applied. For non */uapi/* files that summary was: SPDX license identifier # files ---------------------------------------------------|------- GPL-2.0 11139 and resulted in the first patch in this series. If that file was a */uapi/* path one, it was "GPL-2.0 WITH Linux-syscall-note" otherwise it was "GPL-2.0". Results of that was: SPDX license identifier # files ---------------------------------------------------|------- GPL-2.0 WITH Linux-syscall-note 930 and resulted in the second patch in this series. - if a file had some form of licensing information in it, and was one of the */uapi/* ones, it was denoted with the Linux-syscall-note if any GPL family license was found in the file or had no licensing in it (per prior point). Results summary: SPDX license identifier # files ---------------------------------------------------|------ GPL-2.0 WITH Linux-syscall-note 270 GPL-2.0+ WITH Linux-syscall-note 169 ((GPL-2.0 WITH Linux-syscall-note) OR BSD-2-Clause) 21 ((GPL-2.0 WITH Linux-syscall-note) OR BSD-3-Clause) 17 LGPL-2.1+ WITH Linux-syscall-note 15 GPL-1.0+ WITH Linux-syscall-note 14 ((GPL-2.0+ WITH Linux-syscall-note) OR BSD-3-Clause) 5 LGPL-2.0+ WITH Linux-syscall-note 4 LGPL-2.1 WITH Linux-syscall-note 3 ((GPL-2.0 WITH Linux-syscall-note) OR MIT) 3 ((GPL-2.0 WITH Linux-syscall-note) AND MIT) 1 and that resulted in the third patch in this series. - when the two scanners agreed on the detected license(s), that became the concluded license(s). - when there was disagreement between the two scanners (one detected a license but the other didn't, or they both detected different licenses) a manual inspection of the file occurred. - In most cases a manual inspection of the information in the file resulted in a clear resolution of the license that should apply (and which scanner probably needed to revisit its heuristics). - When it was not immediately clear, the license identifier was confirmed with lawyers working with the Linux Foundation. - If there was any question as to the appropriate license identifier, the file was flagged for further research and to be revisited later in time. In total, over 70 hours of logged manual review was done on the spreadsheet to determine the SPDX license identifiers to apply to the source files by Kate, Philippe, Thomas and, in some cases, confirmation by lawyers working with the Linux Foundation. Kate also obtained a third independent scan of the 4.13 code base from FOSSology, and compared selected files where the other two scanners disagreed against that SPDX file, to see if there was new insights. The Windriver scanner is based on an older version of FOSSology in part, so they are related. Thomas did random spot checks in about 500 files from the spreadsheets for the uapi headers and agreed with SPDX license identifier in the files he inspected. For the non-uapi files Thomas did random spot checks in about 15000 files. In initial set of patches against 4.14-rc6, 3 files were found to have copy/paste license identifier errors, and have been fixed to reflect the correct identifier. Additionally Philippe spent 10 hours this week doing a detailed manual inspection and review of the 12,461 patched files from the initial patch version early this week with: - a full scancode scan run, collecting the matched texts, detected license ids and scores - reviewing anything where there was a license detected (about 500+ files) to ensure that the applied SPDX license was correct - reviewing anything where there was no detection but the patch license was not GPL-2.0 WITH Linux-syscall-note to ensure that the applied SPDX license was correct This produced a worksheet with 20 files needing minor correction. This worksheet was then exported into 3 different .csv files for the different types of files to be modified. These .csv files were then reviewed by Greg. Thomas wrote a script to parse the csv files and add the proper SPDX tag to the file, in the format that the file expected. This script was further refined by Greg based on the output to detect more types of files automatically and to distinguish between header and source .c files (which need different comment types.) Finally Greg ran the script using the .csv files to generate the patches. Reviewed-by: Kate Stewart <kstewart@linuxfoundation.org> Reviewed-by: Philippe Ombredanne <pombredanne@nexb.com> Reviewed-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2017-11-01 22:07:57 +08:00
// SPDX-License-Identifier: GPL-2.0
mm: thrash detection-based file cache sizing The VM maintains cached filesystem pages on two types of lists. One list holds the pages recently faulted into the cache, the other list holds pages that have been referenced repeatedly on that first list. The idea is to prefer reclaiming young pages over those that have shown to benefit from caching in the past. We call the recently usedbut ultimately was not significantly better than a FIFO policy and still thrashed cache based on eviction speed, rather than actual demand for cache. This patch solves one half of the problem by decoupling the ability to detect working set changes from the inactive list size. By maintaining a history of recently evicted file pages it can detect frequently used pages with an arbitrarily small inactive list size, and subsequently apply pressure on the active list based on actual demand for cache, not just overall eviction speed. Every zone maintains a counter that tracks inactive list aging speed. When a page is evicted, a snapshot of this counter is stored in the now-empty page cache radix tree slot. On refault, the minimum access distance of the page can be assessed, to evaluate whether the page should be part of the active list or not. This fixes the VM's blindness towards working set changes in excess of the inactive list. And it's the foundation to further improve the protection ability and reduce the minimum inactive list size of 50%. Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> Reviewed-by: Rik van Riel <riel@redhat.com> Reviewed-by: Minchan Kim <minchan@kernel.org> Reviewed-by: Bob Liu <bob.liu@oracle.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Christoph Hellwig <hch@infradead.org> Cc: Dave Chinner <david@fromorbit.com> Cc: Greg Thelen <gthelen@google.com> Cc: Hugh Dickins <hughd@google.com> Cc: Jan Kara <jack@suse.cz> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: Luigi Semenzato <semenzato@google.com> Cc: Mel Gorman <mgorman@suse.de> Cc: Metin Doslu <metin@citusdata.com> Cc: Michel Lespinasse <walken@google.com> Cc: Ozgun Erdogan <ozgun@citusdata.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Roman Gushchin <klamm@yandex-team.ru> Cc: Ryan Mallon <rmallon@gmail.com> Cc: Tejun Heo <tj@kernel.org> Cc: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-04-04 05:47:51 +08:00
/*
* Workingset detection
*
* Copyright (C) 2013 Red Hat, Inc., Johannes Weiner
*/
#include <linux/memcontrol.h>
#include <linux/mm_inline.h>
mm: thrash detection-based file cache sizing The VM maintains cached filesystem pages on two types of lists. One list holds the pages recently faulted into the cache, the other list holds pages that have been referenced repeatedly on that first list. The idea is to prefer reclaiming young pages over those that have shown to benefit from caching in the past. We call the recently usedbut ultimately was not significantly better than a FIFO policy and still thrashed cache based on eviction speed, rather than actual demand for cache. This patch solves one half of the problem by decoupling the ability to detect working set changes from the inactive list size. By maintaining a history of recently evicted file pages it can detect frequently used pages with an arbitrarily small inactive list size, and subsequently apply pressure on the active list based on actual demand for cache, not just overall eviction speed. Every zone maintains a counter that tracks inactive list aging speed. When a page is evicted, a snapshot of this counter is stored in the now-empty page cache radix tree slot. On refault, the minimum access distance of the page can be assessed, to evaluate whether the page should be part of the active list or not. This fixes the VM's blindness towards working set changes in excess of the inactive list. And it's the foundation to further improve the protection ability and reduce the minimum inactive list size of 50%. Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> Reviewed-by: Rik van Riel <riel@redhat.com> Reviewed-by: Minchan Kim <minchan@kernel.org> Reviewed-by: Bob Liu <bob.liu@oracle.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Christoph Hellwig <hch@infradead.org> Cc: Dave Chinner <david@fromorbit.com> Cc: Greg Thelen <gthelen@google.com> Cc: Hugh Dickins <hughd@google.com> Cc: Jan Kara <jack@suse.cz> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: Luigi Semenzato <semenzato@google.com> Cc: Mel Gorman <mgorman@suse.de> Cc: Metin Doslu <metin@citusdata.com> Cc: Michel Lespinasse <walken@google.com> Cc: Ozgun Erdogan <ozgun@citusdata.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Roman Gushchin <klamm@yandex-team.ru> Cc: Ryan Mallon <rmallon@gmail.com> Cc: Tejun Heo <tj@kernel.org> Cc: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-04-04 05:47:51 +08:00
#include <linux/writeback.h>
#include <linux/shmem_fs.h>
mm: thrash detection-based file cache sizing The VM maintains cached filesystem pages on two types of lists. One list holds the pages recently faulted into the cache, the other list holds pages that have been referenced repeatedly on that first list. The idea is to prefer reclaiming young pages over those that have shown to benefit from caching in the past. We call the recently usedbut ultimately was not significantly better than a FIFO policy and still thrashed cache based on eviction speed, rather than actual demand for cache. This patch solves one half of the problem by decoupling the ability to detect working set changes from the inactive list size. By maintaining a history of recently evicted file pages it can detect frequently used pages with an arbitrarily small inactive list size, and subsequently apply pressure on the active list based on actual demand for cache, not just overall eviction speed. Every zone maintains a counter that tracks inactive list aging speed. When a page is evicted, a snapshot of this counter is stored in the now-empty page cache radix tree slot. On refault, the minimum access distance of the page can be assessed, to evaluate whether the page should be part of the active list or not. This fixes the VM's blindness towards working set changes in excess of the inactive list. And it's the foundation to further improve the protection ability and reduce the minimum inactive list size of 50%. Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> Reviewed-by: Rik van Riel <riel@redhat.com> Reviewed-by: Minchan Kim <minchan@kernel.org> Reviewed-by: Bob Liu <bob.liu@oracle.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Christoph Hellwig <hch@infradead.org> Cc: Dave Chinner <david@fromorbit.com> Cc: Greg Thelen <gthelen@google.com> Cc: Hugh Dickins <hughd@google.com> Cc: Jan Kara <jack@suse.cz> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: Luigi Semenzato <semenzato@google.com> Cc: Mel Gorman <mgorman@suse.de> Cc: Metin Doslu <metin@citusdata.com> Cc: Michel Lespinasse <walken@google.com> Cc: Ozgun Erdogan <ozgun@citusdata.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Roman Gushchin <klamm@yandex-team.ru> Cc: Ryan Mallon <rmallon@gmail.com> Cc: Tejun Heo <tj@kernel.org> Cc: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-04-04 05:47:51 +08:00
#include <linux/pagemap.h>
#include <linux/atomic.h>
#include <linux/module.h>
#include <linux/swap.h>
mm: workingset: move shadow entry tracking to radix tree exceptional tracking Currently, we track the shadow entries in the page cache in the upper bits of the radix_tree_node->count, behind the back of the radix tree implementation. Because the radix tree code has no awareness of them, we rely on random subtleties throughout the implementation (such as the node->count != 1 check in the shrinking code, which is meant to exclude multi-entry nodes but also happens to skip nodes with only one shadow entry, as that's accounted in the upper bits). This is error prone and has, in fact, caused the bug fixed in d3798ae8c6f3 ("mm: filemap: don't plant shadow entries without radix tree node"). To remove these subtleties, this patch moves shadow entry tracking from the upper bits of node->count to the existing counter for exceptional entries. node->count goes back to being a simple counter of valid entries in the tree node and can be shrunk to a single byte. This vastly simplifies the page cache code. All accounting happens natively inside the radix tree implementation, and maintaining the LRU linkage of shadow nodes is consolidated into a single function in the workingset code that is called for leaf nodes affected by a change in the page cache tree. This also removes the last user of the __radix_delete_node() return value. Eliminate it. Link: http://lkml.kernel.org/r/20161117193211.GE23430@cmpxchg.org Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> Reviewed-by: Jan Kara <jack@suse.cz> Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Hugh Dickins <hughd@google.com> Cc: Matthew Wilcox <mawilcox@linuxonhyperv.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-12-13 08:43:52 +08:00
#include <linux/dax.h>
mm: thrash detection-based file cache sizing The VM maintains cached filesystem pages on two types of lists. One list holds the pages recently faulted into the cache, the other list holds pages that have been referenced repeatedly on that first list. The idea is to prefer reclaiming young pages over those that have shown to benefit from caching in the past. We call the recently usedbut ultimately was not significantly better than a FIFO policy and still thrashed cache based on eviction speed, rather than actual demand for cache. This patch solves one half of the problem by decoupling the ability to detect working set changes from the inactive list size. By maintaining a history of recently evicted file pages it can detect frequently used pages with an arbitrarily small inactive list size, and subsequently apply pressure on the active list based on actual demand for cache, not just overall eviction speed. Every zone maintains a counter that tracks inactive list aging speed. When a page is evicted, a snapshot of this counter is stored in the now-empty page cache radix tree slot. On refault, the minimum access distance of the page can be assessed, to evaluate whether the page should be part of the active list or not. This fixes the VM's blindness towards working set changes in excess of the inactive list. And it's the foundation to further improve the protection ability and reduce the minimum inactive list size of 50%. Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> Reviewed-by: Rik van Riel <riel@redhat.com> Reviewed-by: Minchan Kim <minchan@kernel.org> Reviewed-by: Bob Liu <bob.liu@oracle.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Christoph Hellwig <hch@infradead.org> Cc: Dave Chinner <david@fromorbit.com> Cc: Greg Thelen <gthelen@google.com> Cc: Hugh Dickins <hughd@google.com> Cc: Jan Kara <jack@suse.cz> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: Luigi Semenzato <semenzato@google.com> Cc: Mel Gorman <mgorman@suse.de> Cc: Metin Doslu <metin@citusdata.com> Cc: Michel Lespinasse <walken@google.com> Cc: Ozgun Erdogan <ozgun@citusdata.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Roman Gushchin <klamm@yandex-team.ru> Cc: Ryan Mallon <rmallon@gmail.com> Cc: Tejun Heo <tj@kernel.org> Cc: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-04-04 05:47:51 +08:00
#include <linux/fs.h>
#include <linux/mm.h>
/*
* Double CLOCK lists
*
* Per node, two clock lists are maintained for file pages: the
mm: thrash detection-based file cache sizing The VM maintains cached filesystem pages on two types of lists. One list holds the pages recently faulted into the cache, the other list holds pages that have been referenced repeatedly on that first list. The idea is to prefer reclaiming young pages over those that have shown to benefit from caching in the past. We call the recently usedbut ultimately was not significantly better than a FIFO policy and still thrashed cache based on eviction speed, rather than actual demand for cache. This patch solves one half of the problem by decoupling the ability to detect working set changes from the inactive list size. By maintaining a history of recently evicted file pages it can detect frequently used pages with an arbitrarily small inactive list size, and subsequently apply pressure on the active list based on actual demand for cache, not just overall eviction speed. Every zone maintains a counter that tracks inactive list aging speed. When a page is evicted, a snapshot of this counter is stored in the now-empty page cache radix tree slot. On refault, the minimum access distance of the page can be assessed, to evaluate whether the page should be part of the active list or not. This fixes the VM's blindness towards working set changes in excess of the inactive list. And it's the foundation to further improve the protection ability and reduce the minimum inactive list size of 50%. Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> Reviewed-by: Rik van Riel <riel@redhat.com> Reviewed-by: Minchan Kim <minchan@kernel.org> Reviewed-by: Bob Liu <bob.liu@oracle.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Christoph Hellwig <hch@infradead.org> Cc: Dave Chinner <david@fromorbit.com> Cc: Greg Thelen <gthelen@google.com> Cc: Hugh Dickins <hughd@google.com> Cc: Jan Kara <jack@suse.cz> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: Luigi Semenzato <semenzato@google.com> Cc: Mel Gorman <mgorman@suse.de> Cc: Metin Doslu <metin@citusdata.com> Cc: Michel Lespinasse <walken@google.com> Cc: Ozgun Erdogan <ozgun@citusdata.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Roman Gushchin <klamm@yandex-team.ru> Cc: Ryan Mallon <rmallon@gmail.com> Cc: Tejun Heo <tj@kernel.org> Cc: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-04-04 05:47:51 +08:00
* inactive and the active list. Freshly faulted pages start out at
* the head of the inactive list and page reclaim scans pages from the
* tail. Pages that are accessed multiple times on the inactive list
* are promoted to the active list, to protect them from reclaim,
* whereas active pages are demoted to the inactive list when the
* active list grows too big.
*
* fault ------------------------+
* |
* +--------------+ | +-------------+
* reclaim <- | inactive | <-+-- demotion | active | <--+
* +--------------+ +-------------+ |
* | |
* +-------------- promotion ------------------+
*
*
* Access frequency and refault distance
*
* A workload is thrashing when its pages are frequently used but they
* are evicted from the inactive list every time before another access
* would have promoted them to the active list.
*
* In cases where the average access distance between thrashing pages
* is bigger than the size of memory there is nothing that can be
* done - the thrashing set could never fit into memory under any
* circumstance.
*
* However, the average access distance could be bigger than the
* inactive list, yet smaller than the size of memory. In this case,
* the set could fit into memory if it weren't for the currently
* active pages - which may be used more, hopefully less frequently:
*
* +-memory available to cache-+
* | |
* +-inactive------+-active----+
* a b | c d e f g h i | J K L M N |
* +---------------+-----------+
*
* It is prohibitively expensive to accurately track access frequency
* of pages. But a reasonable approximation can be made to measure
* thrashing on the inactive list, after which refaulting pages can be
* activated optimistically to compete with the existing active pages.
*
* For such approximation, we introduce a counter `eviction` (E)
emm: workingset: simplify and use a more intuitive model Upstream: pending This basically removed workingset_activation and reduced calls to workingset_age_nonresident. The idea behind this change is a new way to calculate the refault distance and prepare for adapting refault distance based file page protection for multi-gen LRU. Currently, refault distance re-activation for active/inactive can help keep working set pages in memory, it works by estimating the refault (re-access) distance of a page, if it's small enough, then put it on active LRU instead of inactive LRU. The estimation, as described in mm/workingset.c, is based on two assumptions: 1. Activation of an inactive page will left-shift LRU pages (considering LRU starts from right). 2. Eviction of an inactive page will left-shift LRU pages. Assumption 2 is correct, but assumption 1 is not always true, an activated page could be anywhere in the LRU list (through mark_page_accessed), it only left-shift the pages on its right side. And besides, one page can get activate/deactivated for multiple times. And multi-gen LRU doesn't fit with this model well, pages are getting aged in generations, and getting promoted frequently between generations. So instead we introduce a simpler idea here: Just presume the evicted pages are still in memory, each has an corresponding eviction timestamp (nonresistence_age) that is increased and recorded upon each eviction. These timestamp could logically form a "Shadow LRU", a read-only imaginary LRU. Let the `nonresistence_age` still be NA, then we have: Let SP = ((NA's reading @ current) - (NA's reading @ eviction)) +-memory available to cache-+ | | +-------------------------+===============+===========+ | * shadows O O O | INACTIVE | ACTIVE | +-+-----------------------+===============+===========+ | | +-----------------------+ | SP fault page O -> Hole left by refaulted in pages. Entries are suppose to be removed upon access but this is not a real LRU so can't really update it. * -> The page corresponding to SP It can be easily seen that SP stands for the offset of a page in the imaginary LRU, which is also how far the current workflow could push a page out of available memory. Since all evicted page was once head of INACTIVE list, the estimated minimum value of refault distance is: SP + NR_INACTIVE On refault, the page *may* get activated and stay in memory if we put it to active LRU if: SP + NR_INACTIVE < NR_INACTIVE + NR_ACTIVE Which can be simplified to: SP < NR_ACTIVE Then the page is worth getting re-activated to start from active LRU, since the access distance is smaller than the total memory. And since this is only an estimation, based on several hypotheses, and it could break the ability of LRU to distinguish a workingset out of caches, in extreme cases all refault causing activation will lead to worse thrashing, so throttle this by two factors: 1. Notice previously re-faulted in pages may leave "holes" on the shadow part of LRU, that part is left unhandled on purpose to decrease re-activate rate for pages that have a large SP value (the larger SP value a page has, the more likely it will be affected by such holes). 2. When the active LRU is long enough, chanllaging active pages by re-activating a one-time access previously evicted/inactive page may not be a good idea, so throttle the re-activation when NR_ACTIVE > NR_INACTIVE, by comparing with NR_INACTIVE instead. Another effect of the refault activation throttling worth noticing is that, when the cache size is larger than total memory and hotness is similar among all cache pages, it can help hold a portion (possible have slightly higher hotness) of the caches in memory instead of letting caches get evicted permutably due to the nature of LRU. That's because the established workingset (active LRU) will tend to stay since we throttled reactivation when NR_ACTIVE is high. This side effect is actually similar with the algoritm before, which introduce such effect by increasing nonresistence_age in extra call paths, trottled the re-activation when activition/reactivation is massively happenning. Combined all above, we have following simple rules: Upon refault, if any of following conditions is met, mark page as active: - If active LRU is low (NR_ACTIVE < NR_INACTIVE), check if: SP < NR_ACTIVE - If active LRU is high (NR_ACTIVE >= NR_INACTIVE), check if: SP < NR_INACTIVE Code-wise, this is simpler than before since no longer need to do lruvec workingset data update when activating a page, and so far, a few benchmarks shows a similar or better result under memore pressure. The performance should also be better when there is no memory pressure since some memcg iteration and atomic operation is no longer needed. When combined with multi-gen LRU (in later commits) it shows a measurable performance gain for some workloads. Using memtier and fio test from commit ac35a4902374 but scaled down to fit in my test environment, and some other test results: memtier test (with 16G ramdisk as swap and 4G memcg limit on an i7-9700): memcached -u nobody -m 16384 -s /tmp/memcached.socket \ -a 0766 -t 12 -B binary & memtier_benchmark -S /tmp/memcached.socket -P memcache_binary -n allkeys\ --key-minimum=1 --key-maximum=32000000 --key-pattern=P:P -c 1 \ -t 12 --ratio 1:0 --pipeline 8 -d 2000 -x 6 fio test 1 (with 16G ramdisk on 28G VM on an i7-9700): fio -name=refault --numjobs=12 --directory=/mnt --size=1024m \ --buffered=1 --ioengine=io_uring --iodepth=128 \ --iodepth_batch_submit=32 --iodepth_batch_complete=32 \ --rw=randread --random_distribution=random --norandommap \ --time_based --ramp_time=5m --runtime=5m --group_reporting fio test 2 (with 16G ramdisk on 28G VM on an i7-9700): fio -name=mglru --numjobs=10 --directory=/mnt --size=1536m \ --buffered=1 --ioengine=io_uring --iodepth=128 \ --iodepth_batch_submit=32 --iodepth_batch_complete=32 \ --rw=randread --random_distribution=zipf:1.2 --norandommap \ --time_based --ramp_time=10m --runtime=5m --group_reporting mysql (using oltp_read_only from sysbench, with 12G of buffer pool in a 10G memcg): sysbench /usr/share/sysbench/oltp_read_only.lua <auth and db params> \ --tables=36 --table-size=2000000 --threads=12 --time=1800 kernel build test done with 3G memcg limit on an i7-9700. Before (Average of 6 test run): fio: IOPS=5125.5k fio2: IOPS=7291.16k memcached: 57600.926 ops/s mysql: 6280.08 tps kernel-build: 1817.13499 seconds After (Average of 6 test run): fio: IOPS=5137.5k (+2.3%) fio2: IOPS=7300.67k (+1.3%) memcached: 57878.422 ops/s (+4.8%) mysql: 6312.06 tps (+0.5%) kernel-build: 1813.66231 seconds (+2.0%) Signed-off-by: Kairui Song <kasong@tencent.com>
2023-12-15 10:45:43 +08:00
* here. This counter increases each time a page is evicted, and each evicted
* page will have a shadow that stores the counter reading at the eviction
* time as a timestamp. So when an evicted page was faulted again, we have:
mm: thrash detection-based file cache sizing The VM maintains cached filesystem pages on two types of lists. One list holds the pages recently faulted into the cache, the other list holds pages that have been referenced repeatedly on that first list. The idea is to prefer reclaiming young pages over those that have shown to benefit from caching in the past. We call the recently usedbut ultimately was not significantly better than a FIFO policy and still thrashed cache based on eviction speed, rather than actual demand for cache. This patch solves one half of the problem by decoupling the ability to detect working set changes from the inactive list size. By maintaining a history of recently evicted file pages it can detect frequently used pages with an arbitrarily small inactive list size, and subsequently apply pressure on the active list based on actual demand for cache, not just overall eviction speed. Every zone maintains a counter that tracks inactive list aging speed. When a page is evicted, a snapshot of this counter is stored in the now-empty page cache radix tree slot. On refault, the minimum access distance of the page can be assessed, to evaluate whether the page should be part of the active list or not. This fixes the VM's blindness towards working set changes in excess of the inactive list. And it's the foundation to further improve the protection ability and reduce the minimum inactive list size of 50%. Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> Reviewed-by: Rik van Riel <riel@redhat.com> Reviewed-by: Minchan Kim <minchan@kernel.org> Reviewed-by: Bob Liu <bob.liu@oracle.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Christoph Hellwig <hch@infradead.org> Cc: Dave Chinner <david@fromorbit.com> Cc: Greg Thelen <gthelen@google.com> Cc: Hugh Dickins <hughd@google.com> Cc: Jan Kara <jack@suse.cz> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: Luigi Semenzato <semenzato@google.com> Cc: Mel Gorman <mgorman@suse.de> Cc: Metin Doslu <metin@citusdata.com> Cc: Michel Lespinasse <walken@google.com> Cc: Ozgun Erdogan <ozgun@citusdata.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Roman Gushchin <klamm@yandex-team.ru> Cc: Ryan Mallon <rmallon@gmail.com> Cc: Tejun Heo <tj@kernel.org> Cc: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-04-04 05:47:51 +08:00
*
* Let SP = ((E's reading @ current) - (E's reading @ eviction))
mm: thrash detection-based file cache sizing The VM maintains cached filesystem pages on two types of lists. One list holds the pages recently faulted into the cache, the other list holds pages that have been referenced repeatedly on that first list. The idea is to prefer reclaiming young pages over those that have shown to benefit from caching in the past. We call the recently usedbut ultimately was not significantly better than a FIFO policy and still thrashed cache based on eviction speed, rather than actual demand for cache. This patch solves one half of the problem by decoupling the ability to detect working set changes from the inactive list size. By maintaining a history of recently evicted file pages it can detect frequently used pages with an arbitrarily small inactive list size, and subsequently apply pressure on the active list based on actual demand for cache, not just overall eviction speed. Every zone maintains a counter that tracks inactive list aging speed. When a page is evicted, a snapshot of this counter is stored in the now-empty page cache radix tree slot. On refault, the minimum access distance of the page can be assessed, to evaluate whether the page should be part of the active list or not. This fixes the VM's blindness towards working set changes in excess of the inactive list. And it's the foundation to further improve the protection ability and reduce the minimum inactive list size of 50%. Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> Reviewed-by: Rik van Riel <riel@redhat.com> Reviewed-by: Minchan Kim <minchan@kernel.org> Reviewed-by: Bob Liu <bob.liu@oracle.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Christoph Hellwig <hch@infradead.org> Cc: Dave Chinner <david@fromorbit.com> Cc: Greg Thelen <gthelen@google.com> Cc: Hugh Dickins <hughd@google.com> Cc: Jan Kara <jack@suse.cz> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: Luigi Semenzato <semenzato@google.com> Cc: Mel Gorman <mgorman@suse.de> Cc: Metin Doslu <metin@citusdata.com> Cc: Michel Lespinasse <walken@google.com> Cc: Ozgun Erdogan <ozgun@citusdata.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Roman Gushchin <klamm@yandex-team.ru> Cc: Ryan Mallon <rmallon@gmail.com> Cc: Tejun Heo <tj@kernel.org> Cc: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-04-04 05:47:51 +08:00
*
emm: workingset: simplify and use a more intuitive model Upstream: pending This basically removed workingset_activation and reduced calls to workingset_age_nonresident. The idea behind this change is a new way to calculate the refault distance and prepare for adapting refault distance based file page protection for multi-gen LRU. Currently, refault distance re-activation for active/inactive can help keep working set pages in memory, it works by estimating the refault (re-access) distance of a page, if it's small enough, then put it on active LRU instead of inactive LRU. The estimation, as described in mm/workingset.c, is based on two assumptions: 1. Activation of an inactive page will left-shift LRU pages (considering LRU starts from right). 2. Eviction of an inactive page will left-shift LRU pages. Assumption 2 is correct, but assumption 1 is not always true, an activated page could be anywhere in the LRU list (through mark_page_accessed), it only left-shift the pages on its right side. And besides, one page can get activate/deactivated for multiple times. And multi-gen LRU doesn't fit with this model well, pages are getting aged in generations, and getting promoted frequently between generations. So instead we introduce a simpler idea here: Just presume the evicted pages are still in memory, each has an corresponding eviction timestamp (nonresistence_age) that is increased and recorded upon each eviction. These timestamp could logically form a "Shadow LRU", a read-only imaginary LRU. Let the `nonresistence_age` still be NA, then we have: Let SP = ((NA's reading @ current) - (NA's reading @ eviction)) +-memory available to cache-+ | | +-------------------------+===============+===========+ | * shadows O O O | INACTIVE | ACTIVE | +-+-----------------------+===============+===========+ | | +-----------------------+ | SP fault page O -> Hole left by refaulted in pages. Entries are suppose to be removed upon access but this is not a real LRU so can't really update it. * -> The page corresponding to SP It can be easily seen that SP stands for the offset of a page in the imaginary LRU, which is also how far the current workflow could push a page out of available memory. Since all evicted page was once head of INACTIVE list, the estimated minimum value of refault distance is: SP + NR_INACTIVE On refault, the page *may* get activated and stay in memory if we put it to active LRU if: SP + NR_INACTIVE < NR_INACTIVE + NR_ACTIVE Which can be simplified to: SP < NR_ACTIVE Then the page is worth getting re-activated to start from active LRU, since the access distance is smaller than the total memory. And since this is only an estimation, based on several hypotheses, and it could break the ability of LRU to distinguish a workingset out of caches, in extreme cases all refault causing activation will lead to worse thrashing, so throttle this by two factors: 1. Notice previously re-faulted in pages may leave "holes" on the shadow part of LRU, that part is left unhandled on purpose to decrease re-activate rate for pages that have a large SP value (the larger SP value a page has, the more likely it will be affected by such holes). 2. When the active LRU is long enough, chanllaging active pages by re-activating a one-time access previously evicted/inactive page may not be a good idea, so throttle the re-activation when NR_ACTIVE > NR_INACTIVE, by comparing with NR_INACTIVE instead. Another effect of the refault activation throttling worth noticing is that, when the cache size is larger than total memory and hotness is similar among all cache pages, it can help hold a portion (possible have slightly higher hotness) of the caches in memory instead of letting caches get evicted permutably due to the nature of LRU. That's because the established workingset (active LRU) will tend to stay since we throttled reactivation when NR_ACTIVE is high. This side effect is actually similar with the algoritm before, which introduce such effect by increasing nonresistence_age in extra call paths, trottled the re-activation when activition/reactivation is massively happenning. Combined all above, we have following simple rules: Upon refault, if any of following conditions is met, mark page as active: - If active LRU is low (NR_ACTIVE < NR_INACTIVE), check if: SP < NR_ACTIVE - If active LRU is high (NR_ACTIVE >= NR_INACTIVE), check if: SP < NR_INACTIVE Code-wise, this is simpler than before since no longer need to do lruvec workingset data update when activating a page, and so far, a few benchmarks shows a similar or better result under memore pressure. The performance should also be better when there is no memory pressure since some memcg iteration and atomic operation is no longer needed. When combined with multi-gen LRU (in later commits) it shows a measurable performance gain for some workloads. Using memtier and fio test from commit ac35a4902374 but scaled down to fit in my test environment, and some other test results: memtier test (with 16G ramdisk as swap and 4G memcg limit on an i7-9700): memcached -u nobody -m 16384 -s /tmp/memcached.socket \ -a 0766 -t 12 -B binary & memtier_benchmark -S /tmp/memcached.socket -P memcache_binary -n allkeys\ --key-minimum=1 --key-maximum=32000000 --key-pattern=P:P -c 1 \ -t 12 --ratio 1:0 --pipeline 8 -d 2000 -x 6 fio test 1 (with 16G ramdisk on 28G VM on an i7-9700): fio -name=refault --numjobs=12 --directory=/mnt --size=1024m \ --buffered=1 --ioengine=io_uring --iodepth=128 \ --iodepth_batch_submit=32 --iodepth_batch_complete=32 \ --rw=randread --random_distribution=random --norandommap \ --time_based --ramp_time=5m --runtime=5m --group_reporting fio test 2 (with 16G ramdisk on 28G VM on an i7-9700): fio -name=mglru --numjobs=10 --directory=/mnt --size=1536m \ --buffered=1 --ioengine=io_uring --iodepth=128 \ --iodepth_batch_submit=32 --iodepth_batch_complete=32 \ --rw=randread --random_distribution=zipf:1.2 --norandommap \ --time_based --ramp_time=10m --runtime=5m --group_reporting mysql (using oltp_read_only from sysbench, with 12G of buffer pool in a 10G memcg): sysbench /usr/share/sysbench/oltp_read_only.lua <auth and db params> \ --tables=36 --table-size=2000000 --threads=12 --time=1800 kernel build test done with 3G memcg limit on an i7-9700. Before (Average of 6 test run): fio: IOPS=5125.5k fio2: IOPS=7291.16k memcached: 57600.926 ops/s mysql: 6280.08 tps kernel-build: 1817.13499 seconds After (Average of 6 test run): fio: IOPS=5137.5k (+2.3%) fio2: IOPS=7300.67k (+1.3%) memcached: 57878.422 ops/s (+4.8%) mysql: 6312.06 tps (+0.5%) kernel-build: 1813.66231 seconds (+2.0%) Signed-off-by: Kairui Song <kasong@tencent.com>
2023-12-15 10:45:43 +08:00
* +-memory available to cache-+
* | |
* +-------------------------+===============+===========+
* | * shadows O O O | INACTIVE | ACTIVE |
* +-+-----------------------+===============+===========+
* | |
* +-----------------------+
* | SP
* fault page O -> Hole left by previously faulted in pages
* * -> The page corresponding to SP
mm: thrash detection-based file cache sizing The VM maintains cached filesystem pages on two types of lists. One list holds the pages recently faulted into the cache, the other list holds pages that have been referenced repeatedly on that first list. The idea is to prefer reclaiming young pages over those that have shown to benefit from caching in the past. We call the recently usedbut ultimately was not significantly better than a FIFO policy and still thrashed cache based on eviction speed, rather than actual demand for cache. This patch solves one half of the problem by decoupling the ability to detect working set changes from the inactive list size. By maintaining a history of recently evicted file pages it can detect frequently used pages with an arbitrarily small inactive list size, and subsequently apply pressure on the active list based on actual demand for cache, not just overall eviction speed. Every zone maintains a counter that tracks inactive list aging speed. When a page is evicted, a snapshot of this counter is stored in the now-empty page cache radix tree slot. On refault, the minimum access distance of the page can be assessed, to evaluate whether the page should be part of the active list or not. This fixes the VM's blindness towards working set changes in excess of the inactive list. And it's the foundation to further improve the protection ability and reduce the minimum inactive list size of 50%. Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> Reviewed-by: Rik van Riel <riel@redhat.com> Reviewed-by: Minchan Kim <minchan@kernel.org> Reviewed-by: Bob Liu <bob.liu@oracle.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Christoph Hellwig <hch@infradead.org> Cc: Dave Chinner <david@fromorbit.com> Cc: Greg Thelen <gthelen@google.com> Cc: Hugh Dickins <hughd@google.com> Cc: Jan Kara <jack@suse.cz> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: Luigi Semenzato <semenzato@google.com> Cc: Mel Gorman <mgorman@suse.de> Cc: Metin Doslu <metin@citusdata.com> Cc: Michel Lespinasse <walken@google.com> Cc: Ozgun Erdogan <ozgun@citusdata.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Roman Gushchin <klamm@yandex-team.ru> Cc: Ryan Mallon <rmallon@gmail.com> Cc: Tejun Heo <tj@kernel.org> Cc: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-04-04 05:47:51 +08:00
*
emm: workingset: simplify and use a more intuitive model Upstream: pending This basically removed workingset_activation and reduced calls to workingset_age_nonresident. The idea behind this change is a new way to calculate the refault distance and prepare for adapting refault distance based file page protection for multi-gen LRU. Currently, refault distance re-activation for active/inactive can help keep working set pages in memory, it works by estimating the refault (re-access) distance of a page, if it's small enough, then put it on active LRU instead of inactive LRU. The estimation, as described in mm/workingset.c, is based on two assumptions: 1. Activation of an inactive page will left-shift LRU pages (considering LRU starts from right). 2. Eviction of an inactive page will left-shift LRU pages. Assumption 2 is correct, but assumption 1 is not always true, an activated page could be anywhere in the LRU list (through mark_page_accessed), it only left-shift the pages on its right side. And besides, one page can get activate/deactivated for multiple times. And multi-gen LRU doesn't fit with this model well, pages are getting aged in generations, and getting promoted frequently between generations. So instead we introduce a simpler idea here: Just presume the evicted pages are still in memory, each has an corresponding eviction timestamp (nonresistence_age) that is increased and recorded upon each eviction. These timestamp could logically form a "Shadow LRU", a read-only imaginary LRU. Let the `nonresistence_age` still be NA, then we have: Let SP = ((NA's reading @ current) - (NA's reading @ eviction)) +-memory available to cache-+ | | +-------------------------+===============+===========+ | * shadows O O O | INACTIVE | ACTIVE | +-+-----------------------+===============+===========+ | | +-----------------------+ | SP fault page O -> Hole left by refaulted in pages. Entries are suppose to be removed upon access but this is not a real LRU so can't really update it. * -> The page corresponding to SP It can be easily seen that SP stands for the offset of a page in the imaginary LRU, which is also how far the current workflow could push a page out of available memory. Since all evicted page was once head of INACTIVE list, the estimated minimum value of refault distance is: SP + NR_INACTIVE On refault, the page *may* get activated and stay in memory if we put it to active LRU if: SP + NR_INACTIVE < NR_INACTIVE + NR_ACTIVE Which can be simplified to: SP < NR_ACTIVE Then the page is worth getting re-activated to start from active LRU, since the access distance is smaller than the total memory. And since this is only an estimation, based on several hypotheses, and it could break the ability of LRU to distinguish a workingset out of caches, in extreme cases all refault causing activation will lead to worse thrashing, so throttle this by two factors: 1. Notice previously re-faulted in pages may leave "holes" on the shadow part of LRU, that part is left unhandled on purpose to decrease re-activate rate for pages that have a large SP value (the larger SP value a page has, the more likely it will be affected by such holes). 2. When the active LRU is long enough, chanllaging active pages by re-activating a one-time access previously evicted/inactive page may not be a good idea, so throttle the re-activation when NR_ACTIVE > NR_INACTIVE, by comparing with NR_INACTIVE instead. Another effect of the refault activation throttling worth noticing is that, when the cache size is larger than total memory and hotness is similar among all cache pages, it can help hold a portion (possible have slightly higher hotness) of the caches in memory instead of letting caches get evicted permutably due to the nature of LRU. That's because the established workingset (active LRU) will tend to stay since we throttled reactivation when NR_ACTIVE is high. This side effect is actually similar with the algoritm before, which introduce such effect by increasing nonresistence_age in extra call paths, trottled the re-activation when activition/reactivation is massively happenning. Combined all above, we have following simple rules: Upon refault, if any of following conditions is met, mark page as active: - If active LRU is low (NR_ACTIVE < NR_INACTIVE), check if: SP < NR_ACTIVE - If active LRU is high (NR_ACTIVE >= NR_INACTIVE), check if: SP < NR_INACTIVE Code-wise, this is simpler than before since no longer need to do lruvec workingset data update when activating a page, and so far, a few benchmarks shows a similar or better result under memore pressure. The performance should also be better when there is no memory pressure since some memcg iteration and atomic operation is no longer needed. When combined with multi-gen LRU (in later commits) it shows a measurable performance gain for some workloads. Using memtier and fio test from commit ac35a4902374 but scaled down to fit in my test environment, and some other test results: memtier test (with 16G ramdisk as swap and 4G memcg limit on an i7-9700): memcached -u nobody -m 16384 -s /tmp/memcached.socket \ -a 0766 -t 12 -B binary & memtier_benchmark -S /tmp/memcached.socket -P memcache_binary -n allkeys\ --key-minimum=1 --key-maximum=32000000 --key-pattern=P:P -c 1 \ -t 12 --ratio 1:0 --pipeline 8 -d 2000 -x 6 fio test 1 (with 16G ramdisk on 28G VM on an i7-9700): fio -name=refault --numjobs=12 --directory=/mnt --size=1024m \ --buffered=1 --ioengine=io_uring --iodepth=128 \ --iodepth_batch_submit=32 --iodepth_batch_complete=32 \ --rw=randread --random_distribution=random --norandommap \ --time_based --ramp_time=5m --runtime=5m --group_reporting fio test 2 (with 16G ramdisk on 28G VM on an i7-9700): fio -name=mglru --numjobs=10 --directory=/mnt --size=1536m \ --buffered=1 --ioengine=io_uring --iodepth=128 \ --iodepth_batch_submit=32 --iodepth_batch_complete=32 \ --rw=randread --random_distribution=zipf:1.2 --norandommap \ --time_based --ramp_time=10m --runtime=5m --group_reporting mysql (using oltp_read_only from sysbench, with 12G of buffer pool in a 10G memcg): sysbench /usr/share/sysbench/oltp_read_only.lua <auth and db params> \ --tables=36 --table-size=2000000 --threads=12 --time=1800 kernel build test done with 3G memcg limit on an i7-9700. Before (Average of 6 test run): fio: IOPS=5125.5k fio2: IOPS=7291.16k memcached: 57600.926 ops/s mysql: 6280.08 tps kernel-build: 1817.13499 seconds After (Average of 6 test run): fio: IOPS=5137.5k (+2.3%) fio2: IOPS=7300.67k (+1.3%) memcached: 57878.422 ops/s (+4.8%) mysql: 6312.06 tps (+0.5%) kernel-build: 1813.66231 seconds (+2.0%) Signed-off-by: Kairui Song <kasong@tencent.com>
2023-12-15 10:45:43 +08:00
* Here SP can stands for how far the current workflow could push a page
* out of available memory. Since all evicted page was once head of
* INACTIVE list, the page could have such an access distance of:
mm: thrash detection-based file cache sizing The VM maintains cached filesystem pages on two types of lists. One list holds the pages recently faulted into the cache, the other list holds pages that have been referenced repeatedly on that first list. The idea is to prefer reclaiming young pages over those that have shown to benefit from caching in the past. We call the recently usedbut ultimately was not significantly better than a FIFO policy and still thrashed cache based on eviction speed, rather than actual demand for cache. This patch solves one half of the problem by decoupling the ability to detect working set changes from the inactive list size. By maintaining a history of recently evicted file pages it can detect frequently used pages with an arbitrarily small inactive list size, and subsequently apply pressure on the active list based on actual demand for cache, not just overall eviction speed. Every zone maintains a counter that tracks inactive list aging speed. When a page is evicted, a snapshot of this counter is stored in the now-empty page cache radix tree slot. On refault, the minimum access distance of the page can be assessed, to evaluate whether the page should be part of the active list or not. This fixes the VM's blindness towards working set changes in excess of the inactive list. And it's the foundation to further improve the protection ability and reduce the minimum inactive list size of 50%. Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> Reviewed-by: Rik van Riel <riel@redhat.com> Reviewed-by: Minchan Kim <minchan@kernel.org> Reviewed-by: Bob Liu <bob.liu@oracle.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Christoph Hellwig <hch@infradead.org> Cc: Dave Chinner <david@fromorbit.com> Cc: Greg Thelen <gthelen@google.com> Cc: Hugh Dickins <hughd@google.com> Cc: Jan Kara <jack@suse.cz> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: Luigi Semenzato <semenzato@google.com> Cc: Mel Gorman <mgorman@suse.de> Cc: Metin Doslu <metin@citusdata.com> Cc: Michel Lespinasse <walken@google.com> Cc: Ozgun Erdogan <ozgun@citusdata.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Roman Gushchin <klamm@yandex-team.ru> Cc: Ryan Mallon <rmallon@gmail.com> Cc: Tejun Heo <tj@kernel.org> Cc: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-04-04 05:47:51 +08:00
*
emm: workingset: simplify and use a more intuitive model Upstream: pending This basically removed workingset_activation and reduced calls to workingset_age_nonresident. The idea behind this change is a new way to calculate the refault distance and prepare for adapting refault distance based file page protection for multi-gen LRU. Currently, refault distance re-activation for active/inactive can help keep working set pages in memory, it works by estimating the refault (re-access) distance of a page, if it's small enough, then put it on active LRU instead of inactive LRU. The estimation, as described in mm/workingset.c, is based on two assumptions: 1. Activation of an inactive page will left-shift LRU pages (considering LRU starts from right). 2. Eviction of an inactive page will left-shift LRU pages. Assumption 2 is correct, but assumption 1 is not always true, an activated page could be anywhere in the LRU list (through mark_page_accessed), it only left-shift the pages on its right side. And besides, one page can get activate/deactivated for multiple times. And multi-gen LRU doesn't fit with this model well, pages are getting aged in generations, and getting promoted frequently between generations. So instead we introduce a simpler idea here: Just presume the evicted pages are still in memory, each has an corresponding eviction timestamp (nonresistence_age) that is increased and recorded upon each eviction. These timestamp could logically form a "Shadow LRU", a read-only imaginary LRU. Let the `nonresistence_age` still be NA, then we have: Let SP = ((NA's reading @ current) - (NA's reading @ eviction)) +-memory available to cache-+ | | +-------------------------+===============+===========+ | * shadows O O O | INACTIVE | ACTIVE | +-+-----------------------+===============+===========+ | | +-----------------------+ | SP fault page O -> Hole left by refaulted in pages. Entries are suppose to be removed upon access but this is not a real LRU so can't really update it. * -> The page corresponding to SP It can be easily seen that SP stands for the offset of a page in the imaginary LRU, which is also how far the current workflow could push a page out of available memory. Since all evicted page was once head of INACTIVE list, the estimated minimum value of refault distance is: SP + NR_INACTIVE On refault, the page *may* get activated and stay in memory if we put it to active LRU if: SP + NR_INACTIVE < NR_INACTIVE + NR_ACTIVE Which can be simplified to: SP < NR_ACTIVE Then the page is worth getting re-activated to start from active LRU, since the access distance is smaller than the total memory. And since this is only an estimation, based on several hypotheses, and it could break the ability of LRU to distinguish a workingset out of caches, in extreme cases all refault causing activation will lead to worse thrashing, so throttle this by two factors: 1. Notice previously re-faulted in pages may leave "holes" on the shadow part of LRU, that part is left unhandled on purpose to decrease re-activate rate for pages that have a large SP value (the larger SP value a page has, the more likely it will be affected by such holes). 2. When the active LRU is long enough, chanllaging active pages by re-activating a one-time access previously evicted/inactive page may not be a good idea, so throttle the re-activation when NR_ACTIVE > NR_INACTIVE, by comparing with NR_INACTIVE instead. Another effect of the refault activation throttling worth noticing is that, when the cache size is larger than total memory and hotness is similar among all cache pages, it can help hold a portion (possible have slightly higher hotness) of the caches in memory instead of letting caches get evicted permutably due to the nature of LRU. That's because the established workingset (active LRU) will tend to stay since we throttled reactivation when NR_ACTIVE is high. This side effect is actually similar with the algoritm before, which introduce such effect by increasing nonresistence_age in extra call paths, trottled the re-activation when activition/reactivation is massively happenning. Combined all above, we have following simple rules: Upon refault, if any of following conditions is met, mark page as active: - If active LRU is low (NR_ACTIVE < NR_INACTIVE), check if: SP < NR_ACTIVE - If active LRU is high (NR_ACTIVE >= NR_INACTIVE), check if: SP < NR_INACTIVE Code-wise, this is simpler than before since no longer need to do lruvec workingset data update when activating a page, and so far, a few benchmarks shows a similar or better result under memore pressure. The performance should also be better when there is no memory pressure since some memcg iteration and atomic operation is no longer needed. When combined with multi-gen LRU (in later commits) it shows a measurable performance gain for some workloads. Using memtier and fio test from commit ac35a4902374 but scaled down to fit in my test environment, and some other test results: memtier test (with 16G ramdisk as swap and 4G memcg limit on an i7-9700): memcached -u nobody -m 16384 -s /tmp/memcached.socket \ -a 0766 -t 12 -B binary & memtier_benchmark -S /tmp/memcached.socket -P memcache_binary -n allkeys\ --key-minimum=1 --key-maximum=32000000 --key-pattern=P:P -c 1 \ -t 12 --ratio 1:0 --pipeline 8 -d 2000 -x 6 fio test 1 (with 16G ramdisk on 28G VM on an i7-9700): fio -name=refault --numjobs=12 --directory=/mnt --size=1024m \ --buffered=1 --ioengine=io_uring --iodepth=128 \ --iodepth_batch_submit=32 --iodepth_batch_complete=32 \ --rw=randread --random_distribution=random --norandommap \ --time_based --ramp_time=5m --runtime=5m --group_reporting fio test 2 (with 16G ramdisk on 28G VM on an i7-9700): fio -name=mglru --numjobs=10 --directory=/mnt --size=1536m \ --buffered=1 --ioengine=io_uring --iodepth=128 \ --iodepth_batch_submit=32 --iodepth_batch_complete=32 \ --rw=randread --random_distribution=zipf:1.2 --norandommap \ --time_based --ramp_time=10m --runtime=5m --group_reporting mysql (using oltp_read_only from sysbench, with 12G of buffer pool in a 10G memcg): sysbench /usr/share/sysbench/oltp_read_only.lua <auth and db params> \ --tables=36 --table-size=2000000 --threads=12 --time=1800 kernel build test done with 3G memcg limit on an i7-9700. Before (Average of 6 test run): fio: IOPS=5125.5k fio2: IOPS=7291.16k memcached: 57600.926 ops/s mysql: 6280.08 tps kernel-build: 1817.13499 seconds After (Average of 6 test run): fio: IOPS=5137.5k (+2.3%) fio2: IOPS=7300.67k (+1.3%) memcached: 57878.422 ops/s (+4.8%) mysql: 6312.06 tps (+0.5%) kernel-build: 1813.66231 seconds (+2.0%) Signed-off-by: Kairui Song <kasong@tencent.com>
2023-12-15 10:45:43 +08:00
* SP + NR_INACTIVE
mm: thrash detection-based file cache sizing The VM maintains cached filesystem pages on two types of lists. One list holds the pages recently faulted into the cache, the other list holds pages that have been referenced repeatedly on that first list. The idea is to prefer reclaiming young pages over those that have shown to benefit from caching in the past. We call the recently usedbut ultimately was not significantly better than a FIFO policy and still thrashed cache based on eviction speed, rather than actual demand for cache. This patch solves one half of the problem by decoupling the ability to detect working set changes from the inactive list size. By maintaining a history of recently evicted file pages it can detect frequently used pages with an arbitrarily small inactive list size, and subsequently apply pressure on the active list based on actual demand for cache, not just overall eviction speed. Every zone maintains a counter that tracks inactive list aging speed. When a page is evicted, a snapshot of this counter is stored in the now-empty page cache radix tree slot. On refault, the minimum access distance of the page can be assessed, to evaluate whether the page should be part of the active list or not. This fixes the VM's blindness towards working set changes in excess of the inactive list. And it's the foundation to further improve the protection ability and reduce the minimum inactive list size of 50%. Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> Reviewed-by: Rik van Riel <riel@redhat.com> Reviewed-by: Minchan Kim <minchan@kernel.org> Reviewed-by: Bob Liu <bob.liu@oracle.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Christoph Hellwig <hch@infradead.org> Cc: Dave Chinner <david@fromorbit.com> Cc: Greg Thelen <gthelen@google.com> Cc: Hugh Dickins <hughd@google.com> Cc: Jan Kara <jack@suse.cz> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: Luigi Semenzato <semenzato@google.com> Cc: Mel Gorman <mgorman@suse.de> Cc: Metin Doslu <metin@citusdata.com> Cc: Michel Lespinasse <walken@google.com> Cc: Ozgun Erdogan <ozgun@citusdata.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Roman Gushchin <klamm@yandex-team.ru> Cc: Ryan Mallon <rmallon@gmail.com> Cc: Tejun Heo <tj@kernel.org> Cc: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-04-04 05:47:51 +08:00
*
emm: workingset: simplify and use a more intuitive model Upstream: pending This basically removed workingset_activation and reduced calls to workingset_age_nonresident. The idea behind this change is a new way to calculate the refault distance and prepare for adapting refault distance based file page protection for multi-gen LRU. Currently, refault distance re-activation for active/inactive can help keep working set pages in memory, it works by estimating the refault (re-access) distance of a page, if it's small enough, then put it on active LRU instead of inactive LRU. The estimation, as described in mm/workingset.c, is based on two assumptions: 1. Activation of an inactive page will left-shift LRU pages (considering LRU starts from right). 2. Eviction of an inactive page will left-shift LRU pages. Assumption 2 is correct, but assumption 1 is not always true, an activated page could be anywhere in the LRU list (through mark_page_accessed), it only left-shift the pages on its right side. And besides, one page can get activate/deactivated for multiple times. And multi-gen LRU doesn't fit with this model well, pages are getting aged in generations, and getting promoted frequently between generations. So instead we introduce a simpler idea here: Just presume the evicted pages are still in memory, each has an corresponding eviction timestamp (nonresistence_age) that is increased and recorded upon each eviction. These timestamp could logically form a "Shadow LRU", a read-only imaginary LRU. Let the `nonresistence_age` still be NA, then we have: Let SP = ((NA's reading @ current) - (NA's reading @ eviction)) +-memory available to cache-+ | | +-------------------------+===============+===========+ | * shadows O O O | INACTIVE | ACTIVE | +-+-----------------------+===============+===========+ | | +-----------------------+ | SP fault page O -> Hole left by refaulted in pages. Entries are suppose to be removed upon access but this is not a real LRU so can't really update it. * -> The page corresponding to SP It can be easily seen that SP stands for the offset of a page in the imaginary LRU, which is also how far the current workflow could push a page out of available memory. Since all evicted page was once head of INACTIVE list, the estimated minimum value of refault distance is: SP + NR_INACTIVE On refault, the page *may* get activated and stay in memory if we put it to active LRU if: SP + NR_INACTIVE < NR_INACTIVE + NR_ACTIVE Which can be simplified to: SP < NR_ACTIVE Then the page is worth getting re-activated to start from active LRU, since the access distance is smaller than the total memory. And since this is only an estimation, based on several hypotheses, and it could break the ability of LRU to distinguish a workingset out of caches, in extreme cases all refault causing activation will lead to worse thrashing, so throttle this by two factors: 1. Notice previously re-faulted in pages may leave "holes" on the shadow part of LRU, that part is left unhandled on purpose to decrease re-activate rate for pages that have a large SP value (the larger SP value a page has, the more likely it will be affected by such holes). 2. When the active LRU is long enough, chanllaging active pages by re-activating a one-time access previously evicted/inactive page may not be a good idea, so throttle the re-activation when NR_ACTIVE > NR_INACTIVE, by comparing with NR_INACTIVE instead. Another effect of the refault activation throttling worth noticing is that, when the cache size is larger than total memory and hotness is similar among all cache pages, it can help hold a portion (possible have slightly higher hotness) of the caches in memory instead of letting caches get evicted permutably due to the nature of LRU. That's because the established workingset (active LRU) will tend to stay since we throttled reactivation when NR_ACTIVE is high. This side effect is actually similar with the algoritm before, which introduce such effect by increasing nonresistence_age in extra call paths, trottled the re-activation when activition/reactivation is massively happenning. Combined all above, we have following simple rules: Upon refault, if any of following conditions is met, mark page as active: - If active LRU is low (NR_ACTIVE < NR_INACTIVE), check if: SP < NR_ACTIVE - If active LRU is high (NR_ACTIVE >= NR_INACTIVE), check if: SP < NR_INACTIVE Code-wise, this is simpler than before since no longer need to do lruvec workingset data update when activating a page, and so far, a few benchmarks shows a similar or better result under memore pressure. The performance should also be better when there is no memory pressure since some memcg iteration and atomic operation is no longer needed. When combined with multi-gen LRU (in later commits) it shows a measurable performance gain for some workloads. Using memtier and fio test from commit ac35a4902374 but scaled down to fit in my test environment, and some other test results: memtier test (with 16G ramdisk as swap and 4G memcg limit on an i7-9700): memcached -u nobody -m 16384 -s /tmp/memcached.socket \ -a 0766 -t 12 -B binary & memtier_benchmark -S /tmp/memcached.socket -P memcache_binary -n allkeys\ --key-minimum=1 --key-maximum=32000000 --key-pattern=P:P -c 1 \ -t 12 --ratio 1:0 --pipeline 8 -d 2000 -x 6 fio test 1 (with 16G ramdisk on 28G VM on an i7-9700): fio -name=refault --numjobs=12 --directory=/mnt --size=1024m \ --buffered=1 --ioengine=io_uring --iodepth=128 \ --iodepth_batch_submit=32 --iodepth_batch_complete=32 \ --rw=randread --random_distribution=random --norandommap \ --time_based --ramp_time=5m --runtime=5m --group_reporting fio test 2 (with 16G ramdisk on 28G VM on an i7-9700): fio -name=mglru --numjobs=10 --directory=/mnt --size=1536m \ --buffered=1 --ioengine=io_uring --iodepth=128 \ --iodepth_batch_submit=32 --iodepth_batch_complete=32 \ --rw=randread --random_distribution=zipf:1.2 --norandommap \ --time_based --ramp_time=10m --runtime=5m --group_reporting mysql (using oltp_read_only from sysbench, with 12G of buffer pool in a 10G memcg): sysbench /usr/share/sysbench/oltp_read_only.lua <auth and db params> \ --tables=36 --table-size=2000000 --threads=12 --time=1800 kernel build test done with 3G memcg limit on an i7-9700. Before (Average of 6 test run): fio: IOPS=5125.5k fio2: IOPS=7291.16k memcached: 57600.926 ops/s mysql: 6280.08 tps kernel-build: 1817.13499 seconds After (Average of 6 test run): fio: IOPS=5137.5k (+2.3%) fio2: IOPS=7300.67k (+1.3%) memcached: 57878.422 ops/s (+4.8%) mysql: 6312.06 tps (+0.5%) kernel-build: 1813.66231 seconds (+2.0%) Signed-off-by: Kairui Song <kasong@tencent.com>
2023-12-15 10:45:43 +08:00
* So if:
mm: thrash detection-based file cache sizing The VM maintains cached filesystem pages on two types of lists. One list holds the pages recently faulted into the cache, the other list holds pages that have been referenced repeatedly on that first list. The idea is to prefer reclaiming young pages over those that have shown to benefit from caching in the past. We call the recently usedbut ultimately was not significantly better than a FIFO policy and still thrashed cache based on eviction speed, rather than actual demand for cache. This patch solves one half of the problem by decoupling the ability to detect working set changes from the inactive list size. By maintaining a history of recently evicted file pages it can detect frequently used pages with an arbitrarily small inactive list size, and subsequently apply pressure on the active list based on actual demand for cache, not just overall eviction speed. Every zone maintains a counter that tracks inactive list aging speed. When a page is evicted, a snapshot of this counter is stored in the now-empty page cache radix tree slot. On refault, the minimum access distance of the page can be assessed, to evaluate whether the page should be part of the active list or not. This fixes the VM's blindness towards working set changes in excess of the inactive list. And it's the foundation to further improve the protection ability and reduce the minimum inactive list size of 50%. Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> Reviewed-by: Rik van Riel <riel@redhat.com> Reviewed-by: Minchan Kim <minchan@kernel.org> Reviewed-by: Bob Liu <bob.liu@oracle.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Christoph Hellwig <hch@infradead.org> Cc: Dave Chinner <david@fromorbit.com> Cc: Greg Thelen <gthelen@google.com> Cc: Hugh Dickins <hughd@google.com> Cc: Jan Kara <jack@suse.cz> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: Luigi Semenzato <semenzato@google.com> Cc: Mel Gorman <mgorman@suse.de> Cc: Metin Doslu <metin@citusdata.com> Cc: Michel Lespinasse <walken@google.com> Cc: Ozgun Erdogan <ozgun@citusdata.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Roman Gushchin <klamm@yandex-team.ru> Cc: Ryan Mallon <rmallon@gmail.com> Cc: Tejun Heo <tj@kernel.org> Cc: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-04-04 05:47:51 +08:00
*
emm: workingset: simplify and use a more intuitive model Upstream: pending This basically removed workingset_activation and reduced calls to workingset_age_nonresident. The idea behind this change is a new way to calculate the refault distance and prepare for adapting refault distance based file page protection for multi-gen LRU. Currently, refault distance re-activation for active/inactive can help keep working set pages in memory, it works by estimating the refault (re-access) distance of a page, if it's small enough, then put it on active LRU instead of inactive LRU. The estimation, as described in mm/workingset.c, is based on two assumptions: 1. Activation of an inactive page will left-shift LRU pages (considering LRU starts from right). 2. Eviction of an inactive page will left-shift LRU pages. Assumption 2 is correct, but assumption 1 is not always true, an activated page could be anywhere in the LRU list (through mark_page_accessed), it only left-shift the pages on its right side. And besides, one page can get activate/deactivated for multiple times. And multi-gen LRU doesn't fit with this model well, pages are getting aged in generations, and getting promoted frequently between generations. So instead we introduce a simpler idea here: Just presume the evicted pages are still in memory, each has an corresponding eviction timestamp (nonresistence_age) that is increased and recorded upon each eviction. These timestamp could logically form a "Shadow LRU", a read-only imaginary LRU. Let the `nonresistence_age` still be NA, then we have: Let SP = ((NA's reading @ current) - (NA's reading @ eviction)) +-memory available to cache-+ | | +-------------------------+===============+===========+ | * shadows O O O | INACTIVE | ACTIVE | +-+-----------------------+===============+===========+ | | +-----------------------+ | SP fault page O -> Hole left by refaulted in pages. Entries are suppose to be removed upon access but this is not a real LRU so can't really update it. * -> The page corresponding to SP It can be easily seen that SP stands for the offset of a page in the imaginary LRU, which is also how far the current workflow could push a page out of available memory. Since all evicted page was once head of INACTIVE list, the estimated minimum value of refault distance is: SP + NR_INACTIVE On refault, the page *may* get activated and stay in memory if we put it to active LRU if: SP + NR_INACTIVE < NR_INACTIVE + NR_ACTIVE Which can be simplified to: SP < NR_ACTIVE Then the page is worth getting re-activated to start from active LRU, since the access distance is smaller than the total memory. And since this is only an estimation, based on several hypotheses, and it could break the ability of LRU to distinguish a workingset out of caches, in extreme cases all refault causing activation will lead to worse thrashing, so throttle this by two factors: 1. Notice previously re-faulted in pages may leave "holes" on the shadow part of LRU, that part is left unhandled on purpose to decrease re-activate rate for pages that have a large SP value (the larger SP value a page has, the more likely it will be affected by such holes). 2. When the active LRU is long enough, chanllaging active pages by re-activating a one-time access previously evicted/inactive page may not be a good idea, so throttle the re-activation when NR_ACTIVE > NR_INACTIVE, by comparing with NR_INACTIVE instead. Another effect of the refault activation throttling worth noticing is that, when the cache size is larger than total memory and hotness is similar among all cache pages, it can help hold a portion (possible have slightly higher hotness) of the caches in memory instead of letting caches get evicted permutably due to the nature of LRU. That's because the established workingset (active LRU) will tend to stay since we throttled reactivation when NR_ACTIVE is high. This side effect is actually similar with the algoritm before, which introduce such effect by increasing nonresistence_age in extra call paths, trottled the re-activation when activition/reactivation is massively happenning. Combined all above, we have following simple rules: Upon refault, if any of following conditions is met, mark page as active: - If active LRU is low (NR_ACTIVE < NR_INACTIVE), check if: SP < NR_ACTIVE - If active LRU is high (NR_ACTIVE >= NR_INACTIVE), check if: SP < NR_INACTIVE Code-wise, this is simpler than before since no longer need to do lruvec workingset data update when activating a page, and so far, a few benchmarks shows a similar or better result under memore pressure. The performance should also be better when there is no memory pressure since some memcg iteration and atomic operation is no longer needed. When combined with multi-gen LRU (in later commits) it shows a measurable performance gain for some workloads. Using memtier and fio test from commit ac35a4902374 but scaled down to fit in my test environment, and some other test results: memtier test (with 16G ramdisk as swap and 4G memcg limit on an i7-9700): memcached -u nobody -m 16384 -s /tmp/memcached.socket \ -a 0766 -t 12 -B binary & memtier_benchmark -S /tmp/memcached.socket -P memcache_binary -n allkeys\ --key-minimum=1 --key-maximum=32000000 --key-pattern=P:P -c 1 \ -t 12 --ratio 1:0 --pipeline 8 -d 2000 -x 6 fio test 1 (with 16G ramdisk on 28G VM on an i7-9700): fio -name=refault --numjobs=12 --directory=/mnt --size=1024m \ --buffered=1 --ioengine=io_uring --iodepth=128 \ --iodepth_batch_submit=32 --iodepth_batch_complete=32 \ --rw=randread --random_distribution=random --norandommap \ --time_based --ramp_time=5m --runtime=5m --group_reporting fio test 2 (with 16G ramdisk on 28G VM on an i7-9700): fio -name=mglru --numjobs=10 --directory=/mnt --size=1536m \ --buffered=1 --ioengine=io_uring --iodepth=128 \ --iodepth_batch_submit=32 --iodepth_batch_complete=32 \ --rw=randread --random_distribution=zipf:1.2 --norandommap \ --time_based --ramp_time=10m --runtime=5m --group_reporting mysql (using oltp_read_only from sysbench, with 12G of buffer pool in a 10G memcg): sysbench /usr/share/sysbench/oltp_read_only.lua <auth and db params> \ --tables=36 --table-size=2000000 --threads=12 --time=1800 kernel build test done with 3G memcg limit on an i7-9700. Before (Average of 6 test run): fio: IOPS=5125.5k fio2: IOPS=7291.16k memcached: 57600.926 ops/s mysql: 6280.08 tps kernel-build: 1817.13499 seconds After (Average of 6 test run): fio: IOPS=5137.5k (+2.3%) fio2: IOPS=7300.67k (+1.3%) memcached: 57878.422 ops/s (+4.8%) mysql: 6312.06 tps (+0.5%) kernel-build: 1813.66231 seconds (+2.0%) Signed-off-by: Kairui Song <kasong@tencent.com>
2023-12-15 10:45:43 +08:00
* SP + NR_INACTIVE < NR_INACTIVE + NR_ACTIVE
mm: thrash detection-based file cache sizing The VM maintains cached filesystem pages on two types of lists. One list holds the pages recently faulted into the cache, the other list holds pages that have been referenced repeatedly on that first list. The idea is to prefer reclaiming young pages over those that have shown to benefit from caching in the past. We call the recently usedbut ultimately was not significantly better than a FIFO policy and still thrashed cache based on eviction speed, rather than actual demand for cache. This patch solves one half of the problem by decoupling the ability to detect working set changes from the inactive list size. By maintaining a history of recently evicted file pages it can detect frequently used pages with an arbitrarily small inactive list size, and subsequently apply pressure on the active list based on actual demand for cache, not just overall eviction speed. Every zone maintains a counter that tracks inactive list aging speed. When a page is evicted, a snapshot of this counter is stored in the now-empty page cache radix tree slot. On refault, the minimum access distance of the page can be assessed, to evaluate whether the page should be part of the active list or not. This fixes the VM's blindness towards working set changes in excess of the inactive list. And it's the foundation to further improve the protection ability and reduce the minimum inactive list size of 50%. Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> Reviewed-by: Rik van Riel <riel@redhat.com> Reviewed-by: Minchan Kim <minchan@kernel.org> Reviewed-by: Bob Liu <bob.liu@oracle.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Christoph Hellwig <hch@infradead.org> Cc: Dave Chinner <david@fromorbit.com> Cc: Greg Thelen <gthelen@google.com> Cc: Hugh Dickins <hughd@google.com> Cc: Jan Kara <jack@suse.cz> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: Luigi Semenzato <semenzato@google.com> Cc: Mel Gorman <mgorman@suse.de> Cc: Metin Doslu <metin@citusdata.com> Cc: Michel Lespinasse <walken@google.com> Cc: Ozgun Erdogan <ozgun@citusdata.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Roman Gushchin <klamm@yandex-team.ru> Cc: Ryan Mallon <rmallon@gmail.com> Cc: Tejun Heo <tj@kernel.org> Cc: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-04-04 05:47:51 +08:00
*
emm: workingset: simplify and use a more intuitive model Upstream: pending This basically removed workingset_activation and reduced calls to workingset_age_nonresident. The idea behind this change is a new way to calculate the refault distance and prepare for adapting refault distance based file page protection for multi-gen LRU. Currently, refault distance re-activation for active/inactive can help keep working set pages in memory, it works by estimating the refault (re-access) distance of a page, if it's small enough, then put it on active LRU instead of inactive LRU. The estimation, as described in mm/workingset.c, is based on two assumptions: 1. Activation of an inactive page will left-shift LRU pages (considering LRU starts from right). 2. Eviction of an inactive page will left-shift LRU pages. Assumption 2 is correct, but assumption 1 is not always true, an activated page could be anywhere in the LRU list (through mark_page_accessed), it only left-shift the pages on its right side. And besides, one page can get activate/deactivated for multiple times. And multi-gen LRU doesn't fit with this model well, pages are getting aged in generations, and getting promoted frequently between generations. So instead we introduce a simpler idea here: Just presume the evicted pages are still in memory, each has an corresponding eviction timestamp (nonresistence_age) that is increased and recorded upon each eviction. These timestamp could logically form a "Shadow LRU", a read-only imaginary LRU. Let the `nonresistence_age` still be NA, then we have: Let SP = ((NA's reading @ current) - (NA's reading @ eviction)) +-memory available to cache-+ | | +-------------------------+===============+===========+ | * shadows O O O | INACTIVE | ACTIVE | +-+-----------------------+===============+===========+ | | +-----------------------+ | SP fault page O -> Hole left by refaulted in pages. Entries are suppose to be removed upon access but this is not a real LRU so can't really update it. * -> The page corresponding to SP It can be easily seen that SP stands for the offset of a page in the imaginary LRU, which is also how far the current workflow could push a page out of available memory. Since all evicted page was once head of INACTIVE list, the estimated minimum value of refault distance is: SP + NR_INACTIVE On refault, the page *may* get activated and stay in memory if we put it to active LRU if: SP + NR_INACTIVE < NR_INACTIVE + NR_ACTIVE Which can be simplified to: SP < NR_ACTIVE Then the page is worth getting re-activated to start from active LRU, since the access distance is smaller than the total memory. And since this is only an estimation, based on several hypotheses, and it could break the ability of LRU to distinguish a workingset out of caches, in extreme cases all refault causing activation will lead to worse thrashing, so throttle this by two factors: 1. Notice previously re-faulted in pages may leave "holes" on the shadow part of LRU, that part is left unhandled on purpose to decrease re-activate rate for pages that have a large SP value (the larger SP value a page has, the more likely it will be affected by such holes). 2. When the active LRU is long enough, chanllaging active pages by re-activating a one-time access previously evicted/inactive page may not be a good idea, so throttle the re-activation when NR_ACTIVE > NR_INACTIVE, by comparing with NR_INACTIVE instead. Another effect of the refault activation throttling worth noticing is that, when the cache size is larger than total memory and hotness is similar among all cache pages, it can help hold a portion (possible have slightly higher hotness) of the caches in memory instead of letting caches get evicted permutably due to the nature of LRU. That's because the established workingset (active LRU) will tend to stay since we throttled reactivation when NR_ACTIVE is high. This side effect is actually similar with the algoritm before, which introduce such effect by increasing nonresistence_age in extra call paths, trottled the re-activation when activition/reactivation is massively happenning. Combined all above, we have following simple rules: Upon refault, if any of following conditions is met, mark page as active: - If active LRU is low (NR_ACTIVE < NR_INACTIVE), check if: SP < NR_ACTIVE - If active LRU is high (NR_ACTIVE >= NR_INACTIVE), check if: SP < NR_INACTIVE Code-wise, this is simpler than before since no longer need to do lruvec workingset data update when activating a page, and so far, a few benchmarks shows a similar or better result under memore pressure. The performance should also be better when there is no memory pressure since some memcg iteration and atomic operation is no longer needed. When combined with multi-gen LRU (in later commits) it shows a measurable performance gain for some workloads. Using memtier and fio test from commit ac35a4902374 but scaled down to fit in my test environment, and some other test results: memtier test (with 16G ramdisk as swap and 4G memcg limit on an i7-9700): memcached -u nobody -m 16384 -s /tmp/memcached.socket \ -a 0766 -t 12 -B binary & memtier_benchmark -S /tmp/memcached.socket -P memcache_binary -n allkeys\ --key-minimum=1 --key-maximum=32000000 --key-pattern=P:P -c 1 \ -t 12 --ratio 1:0 --pipeline 8 -d 2000 -x 6 fio test 1 (with 16G ramdisk on 28G VM on an i7-9700): fio -name=refault --numjobs=12 --directory=/mnt --size=1024m \ --buffered=1 --ioengine=io_uring --iodepth=128 \ --iodepth_batch_submit=32 --iodepth_batch_complete=32 \ --rw=randread --random_distribution=random --norandommap \ --time_based --ramp_time=5m --runtime=5m --group_reporting fio test 2 (with 16G ramdisk on 28G VM on an i7-9700): fio -name=mglru --numjobs=10 --directory=/mnt --size=1536m \ --buffered=1 --ioengine=io_uring --iodepth=128 \ --iodepth_batch_submit=32 --iodepth_batch_complete=32 \ --rw=randread --random_distribution=zipf:1.2 --norandommap \ --time_based --ramp_time=10m --runtime=5m --group_reporting mysql (using oltp_read_only from sysbench, with 12G of buffer pool in a 10G memcg): sysbench /usr/share/sysbench/oltp_read_only.lua <auth and db params> \ --tables=36 --table-size=2000000 --threads=12 --time=1800 kernel build test done with 3G memcg limit on an i7-9700. Before (Average of 6 test run): fio: IOPS=5125.5k fio2: IOPS=7291.16k memcached: 57600.926 ops/s mysql: 6280.08 tps kernel-build: 1817.13499 seconds After (Average of 6 test run): fio: IOPS=5137.5k (+2.3%) fio2: IOPS=7300.67k (+1.3%) memcached: 57878.422 ops/s (+4.8%) mysql: 6312.06 tps (+0.5%) kernel-build: 1813.66231 seconds (+2.0%) Signed-off-by: Kairui Song <kasong@tencent.com>
2023-12-15 10:45:43 +08:00
* Which can be simplified to:
mm: thrash detection-based file cache sizing The VM maintains cached filesystem pages on two types of lists. One list holds the pages recently faulted into the cache, the other list holds pages that have been referenced repeatedly on that first list. The idea is to prefer reclaiming young pages over those that have shown to benefit from caching in the past. We call the recently usedbut ultimately was not significantly better than a FIFO policy and still thrashed cache based on eviction speed, rather than actual demand for cache. This patch solves one half of the problem by decoupling the ability to detect working set changes from the inactive list size. By maintaining a history of recently evicted file pages it can detect frequently used pages with an arbitrarily small inactive list size, and subsequently apply pressure on the active list based on actual demand for cache, not just overall eviction speed. Every zone maintains a counter that tracks inactive list aging speed. When a page is evicted, a snapshot of this counter is stored in the now-empty page cache radix tree slot. On refault, the minimum access distance of the page can be assessed, to evaluate whether the page should be part of the active list or not. This fixes the VM's blindness towards working set changes in excess of the inactive list. And it's the foundation to further improve the protection ability and reduce the minimum inactive list size of 50%. Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> Reviewed-by: Rik van Riel <riel@redhat.com> Reviewed-by: Minchan Kim <minchan@kernel.org> Reviewed-by: Bob Liu <bob.liu@oracle.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Christoph Hellwig <hch@infradead.org> Cc: Dave Chinner <david@fromorbit.com> Cc: Greg Thelen <gthelen@google.com> Cc: Hugh Dickins <hughd@google.com> Cc: Jan Kara <jack@suse.cz> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: Luigi Semenzato <semenzato@google.com> Cc: Mel Gorman <mgorman@suse.de> Cc: Metin Doslu <metin@citusdata.com> Cc: Michel Lespinasse <walken@google.com> Cc: Ozgun Erdogan <ozgun@citusdata.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Roman Gushchin <klamm@yandex-team.ru> Cc: Ryan Mallon <rmallon@gmail.com> Cc: Tejun Heo <tj@kernel.org> Cc: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-04-04 05:47:51 +08:00
*
emm: workingset: simplify and use a more intuitive model Upstream: pending This basically removed workingset_activation and reduced calls to workingset_age_nonresident. The idea behind this change is a new way to calculate the refault distance and prepare for adapting refault distance based file page protection for multi-gen LRU. Currently, refault distance re-activation for active/inactive can help keep working set pages in memory, it works by estimating the refault (re-access) distance of a page, if it's small enough, then put it on active LRU instead of inactive LRU. The estimation, as described in mm/workingset.c, is based on two assumptions: 1. Activation of an inactive page will left-shift LRU pages (considering LRU starts from right). 2. Eviction of an inactive page will left-shift LRU pages. Assumption 2 is correct, but assumption 1 is not always true, an activated page could be anywhere in the LRU list (through mark_page_accessed), it only left-shift the pages on its right side. And besides, one page can get activate/deactivated for multiple times. And multi-gen LRU doesn't fit with this model well, pages are getting aged in generations, and getting promoted frequently between generations. So instead we introduce a simpler idea here: Just presume the evicted pages are still in memory, each has an corresponding eviction timestamp (nonresistence_age) that is increased and recorded upon each eviction. These timestamp could logically form a "Shadow LRU", a read-only imaginary LRU. Let the `nonresistence_age` still be NA, then we have: Let SP = ((NA's reading @ current) - (NA's reading @ eviction)) +-memory available to cache-+ | | +-------------------------+===============+===========+ | * shadows O O O | INACTIVE | ACTIVE | +-+-----------------------+===============+===========+ | | +-----------------------+ | SP fault page O -> Hole left by refaulted in pages. Entries are suppose to be removed upon access but this is not a real LRU so can't really update it. * -> The page corresponding to SP It can be easily seen that SP stands for the offset of a page in the imaginary LRU, which is also how far the current workflow could push a page out of available memory. Since all evicted page was once head of INACTIVE list, the estimated minimum value of refault distance is: SP + NR_INACTIVE On refault, the page *may* get activated and stay in memory if we put it to active LRU if: SP + NR_INACTIVE < NR_INACTIVE + NR_ACTIVE Which can be simplified to: SP < NR_ACTIVE Then the page is worth getting re-activated to start from active LRU, since the access distance is smaller than the total memory. And since this is only an estimation, based on several hypotheses, and it could break the ability of LRU to distinguish a workingset out of caches, in extreme cases all refault causing activation will lead to worse thrashing, so throttle this by two factors: 1. Notice previously re-faulted in pages may leave "holes" on the shadow part of LRU, that part is left unhandled on purpose to decrease re-activate rate for pages that have a large SP value (the larger SP value a page has, the more likely it will be affected by such holes). 2. When the active LRU is long enough, chanllaging active pages by re-activating a one-time access previously evicted/inactive page may not be a good idea, so throttle the re-activation when NR_ACTIVE > NR_INACTIVE, by comparing with NR_INACTIVE instead. Another effect of the refault activation throttling worth noticing is that, when the cache size is larger than total memory and hotness is similar among all cache pages, it can help hold a portion (possible have slightly higher hotness) of the caches in memory instead of letting caches get evicted permutably due to the nature of LRU. That's because the established workingset (active LRU) will tend to stay since we throttled reactivation when NR_ACTIVE is high. This side effect is actually similar with the algoritm before, which introduce such effect by increasing nonresistence_age in extra call paths, trottled the re-activation when activition/reactivation is massively happenning. Combined all above, we have following simple rules: Upon refault, if any of following conditions is met, mark page as active: - If active LRU is low (NR_ACTIVE < NR_INACTIVE), check if: SP < NR_ACTIVE - If active LRU is high (NR_ACTIVE >= NR_INACTIVE), check if: SP < NR_INACTIVE Code-wise, this is simpler than before since no longer need to do lruvec workingset data update when activating a page, and so far, a few benchmarks shows a similar or better result under memore pressure. The performance should also be better when there is no memory pressure since some memcg iteration and atomic operation is no longer needed. When combined with multi-gen LRU (in later commits) it shows a measurable performance gain for some workloads. Using memtier and fio test from commit ac35a4902374 but scaled down to fit in my test environment, and some other test results: memtier test (with 16G ramdisk as swap and 4G memcg limit on an i7-9700): memcached -u nobody -m 16384 -s /tmp/memcached.socket \ -a 0766 -t 12 -B binary & memtier_benchmark -S /tmp/memcached.socket -P memcache_binary -n allkeys\ --key-minimum=1 --key-maximum=32000000 --key-pattern=P:P -c 1 \ -t 12 --ratio 1:0 --pipeline 8 -d 2000 -x 6 fio test 1 (with 16G ramdisk on 28G VM on an i7-9700): fio -name=refault --numjobs=12 --directory=/mnt --size=1024m \ --buffered=1 --ioengine=io_uring --iodepth=128 \ --iodepth_batch_submit=32 --iodepth_batch_complete=32 \ --rw=randread --random_distribution=random --norandommap \ --time_based --ramp_time=5m --runtime=5m --group_reporting fio test 2 (with 16G ramdisk on 28G VM on an i7-9700): fio -name=mglru --numjobs=10 --directory=/mnt --size=1536m \ --buffered=1 --ioengine=io_uring --iodepth=128 \ --iodepth_batch_submit=32 --iodepth_batch_complete=32 \ --rw=randread --random_distribution=zipf:1.2 --norandommap \ --time_based --ramp_time=10m --runtime=5m --group_reporting mysql (using oltp_read_only from sysbench, with 12G of buffer pool in a 10G memcg): sysbench /usr/share/sysbench/oltp_read_only.lua <auth and db params> \ --tables=36 --table-size=2000000 --threads=12 --time=1800 kernel build test done with 3G memcg limit on an i7-9700. Before (Average of 6 test run): fio: IOPS=5125.5k fio2: IOPS=7291.16k memcached: 57600.926 ops/s mysql: 6280.08 tps kernel-build: 1817.13499 seconds After (Average of 6 test run): fio: IOPS=5137.5k (+2.3%) fio2: IOPS=7300.67k (+1.3%) memcached: 57878.422 ops/s (+4.8%) mysql: 6312.06 tps (+0.5%) kernel-build: 1813.66231 seconds (+2.0%) Signed-off-by: Kairui Song <kasong@tencent.com>
2023-12-15 10:45:43 +08:00
* SP < NR_ACTIVE
mm: thrash detection-based file cache sizing The VM maintains cached filesystem pages on two types of lists. One list holds the pages recently faulted into the cache, the other list holds pages that have been referenced repeatedly on that first list. The idea is to prefer reclaiming young pages over those that have shown to benefit from caching in the past. We call the recently usedbut ultimately was not significantly better than a FIFO policy and still thrashed cache based on eviction speed, rather than actual demand for cache. This patch solves one half of the problem by decoupling the ability to detect working set changes from the inactive list size. By maintaining a history of recently evicted file pages it can detect frequently used pages with an arbitrarily small inactive list size, and subsequently apply pressure on the active list based on actual demand for cache, not just overall eviction speed. Every zone maintains a counter that tracks inactive list aging speed. When a page is evicted, a snapshot of this counter is stored in the now-empty page cache radix tree slot. On refault, the minimum access distance of the page can be assessed, to evaluate whether the page should be part of the active list or not. This fixes the VM's blindness towards working set changes in excess of the inactive list. And it's the foundation to further improve the protection ability and reduce the minimum inactive list size of 50%. Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> Reviewed-by: Rik van Riel <riel@redhat.com> Reviewed-by: Minchan Kim <minchan@kernel.org> Reviewed-by: Bob Liu <bob.liu@oracle.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Christoph Hellwig <hch@infradead.org> Cc: Dave Chinner <david@fromorbit.com> Cc: Greg Thelen <gthelen@google.com> Cc: Hugh Dickins <hughd@google.com> Cc: Jan Kara <jack@suse.cz> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: Luigi Semenzato <semenzato@google.com> Cc: Mel Gorman <mgorman@suse.de> Cc: Metin Doslu <metin@citusdata.com> Cc: Michel Lespinasse <walken@google.com> Cc: Ozgun Erdogan <ozgun@citusdata.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Roman Gushchin <klamm@yandex-team.ru> Cc: Ryan Mallon <rmallon@gmail.com> Cc: Tejun Heo <tj@kernel.org> Cc: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-04-04 05:47:51 +08:00
*
emm: workingset: simplify and use a more intuitive model Upstream: pending This basically removed workingset_activation and reduced calls to workingset_age_nonresident. The idea behind this change is a new way to calculate the refault distance and prepare for adapting refault distance based file page protection for multi-gen LRU. Currently, refault distance re-activation for active/inactive can help keep working set pages in memory, it works by estimating the refault (re-access) distance of a page, if it's small enough, then put it on active LRU instead of inactive LRU. The estimation, as described in mm/workingset.c, is based on two assumptions: 1. Activation of an inactive page will left-shift LRU pages (considering LRU starts from right). 2. Eviction of an inactive page will left-shift LRU pages. Assumption 2 is correct, but assumption 1 is not always true, an activated page could be anywhere in the LRU list (through mark_page_accessed), it only left-shift the pages on its right side. And besides, one page can get activate/deactivated for multiple times. And multi-gen LRU doesn't fit with this model well, pages are getting aged in generations, and getting promoted frequently between generations. So instead we introduce a simpler idea here: Just presume the evicted pages are still in memory, each has an corresponding eviction timestamp (nonresistence_age) that is increased and recorded upon each eviction. These timestamp could logically form a "Shadow LRU", a read-only imaginary LRU. Let the `nonresistence_age` still be NA, then we have: Let SP = ((NA's reading @ current) - (NA's reading @ eviction)) +-memory available to cache-+ | | +-------------------------+===============+===========+ | * shadows O O O | INACTIVE | ACTIVE | +-+-----------------------+===============+===========+ | | +-----------------------+ | SP fault page O -> Hole left by refaulted in pages. Entries are suppose to be removed upon access but this is not a real LRU so can't really update it. * -> The page corresponding to SP It can be easily seen that SP stands for the offset of a page in the imaginary LRU, which is also how far the current workflow could push a page out of available memory. Since all evicted page was once head of INACTIVE list, the estimated minimum value of refault distance is: SP + NR_INACTIVE On refault, the page *may* get activated and stay in memory if we put it to active LRU if: SP + NR_INACTIVE < NR_INACTIVE + NR_ACTIVE Which can be simplified to: SP < NR_ACTIVE Then the page is worth getting re-activated to start from active LRU, since the access distance is smaller than the total memory. And since this is only an estimation, based on several hypotheses, and it could break the ability of LRU to distinguish a workingset out of caches, in extreme cases all refault causing activation will lead to worse thrashing, so throttle this by two factors: 1. Notice previously re-faulted in pages may leave "holes" on the shadow part of LRU, that part is left unhandled on purpose to decrease re-activate rate for pages that have a large SP value (the larger SP value a page has, the more likely it will be affected by such holes). 2. When the active LRU is long enough, chanllaging active pages by re-activating a one-time access previously evicted/inactive page may not be a good idea, so throttle the re-activation when NR_ACTIVE > NR_INACTIVE, by comparing with NR_INACTIVE instead. Another effect of the refault activation throttling worth noticing is that, when the cache size is larger than total memory and hotness is similar among all cache pages, it can help hold a portion (possible have slightly higher hotness) of the caches in memory instead of letting caches get evicted permutably due to the nature of LRU. That's because the established workingset (active LRU) will tend to stay since we throttled reactivation when NR_ACTIVE is high. This side effect is actually similar with the algoritm before, which introduce such effect by increasing nonresistence_age in extra call paths, trottled the re-activation when activition/reactivation is massively happenning. Combined all above, we have following simple rules: Upon refault, if any of following conditions is met, mark page as active: - If active LRU is low (NR_ACTIVE < NR_INACTIVE), check if: SP < NR_ACTIVE - If active LRU is high (NR_ACTIVE >= NR_INACTIVE), check if: SP < NR_INACTIVE Code-wise, this is simpler than before since no longer need to do lruvec workingset data update when activating a page, and so far, a few benchmarks shows a similar or better result under memore pressure. The performance should also be better when there is no memory pressure since some memcg iteration and atomic operation is no longer needed. When combined with multi-gen LRU (in later commits) it shows a measurable performance gain for some workloads. Using memtier and fio test from commit ac35a4902374 but scaled down to fit in my test environment, and some other test results: memtier test (with 16G ramdisk as swap and 4G memcg limit on an i7-9700): memcached -u nobody -m 16384 -s /tmp/memcached.socket \ -a 0766 -t 12 -B binary & memtier_benchmark -S /tmp/memcached.socket -P memcache_binary -n allkeys\ --key-minimum=1 --key-maximum=32000000 --key-pattern=P:P -c 1 \ -t 12 --ratio 1:0 --pipeline 8 -d 2000 -x 6 fio test 1 (with 16G ramdisk on 28G VM on an i7-9700): fio -name=refault --numjobs=12 --directory=/mnt --size=1024m \ --buffered=1 --ioengine=io_uring --iodepth=128 \ --iodepth_batch_submit=32 --iodepth_batch_complete=32 \ --rw=randread --random_distribution=random --norandommap \ --time_based --ramp_time=5m --runtime=5m --group_reporting fio test 2 (with 16G ramdisk on 28G VM on an i7-9700): fio -name=mglru --numjobs=10 --directory=/mnt --size=1536m \ --buffered=1 --ioengine=io_uring --iodepth=128 \ --iodepth_batch_submit=32 --iodepth_batch_complete=32 \ --rw=randread --random_distribution=zipf:1.2 --norandommap \ --time_based --ramp_time=10m --runtime=5m --group_reporting mysql (using oltp_read_only from sysbench, with 12G of buffer pool in a 10G memcg): sysbench /usr/share/sysbench/oltp_read_only.lua <auth and db params> \ --tables=36 --table-size=2000000 --threads=12 --time=1800 kernel build test done with 3G memcg limit on an i7-9700. Before (Average of 6 test run): fio: IOPS=5125.5k fio2: IOPS=7291.16k memcached: 57600.926 ops/s mysql: 6280.08 tps kernel-build: 1817.13499 seconds After (Average of 6 test run): fio: IOPS=5137.5k (+2.3%) fio2: IOPS=7300.67k (+1.3%) memcached: 57878.422 ops/s (+4.8%) mysql: 6312.06 tps (+0.5%) kernel-build: 1813.66231 seconds (+2.0%) Signed-off-by: Kairui Song <kasong@tencent.com>
2023-12-15 10:45:43 +08:00
* Then the page is worth getting re-activated to start from ACTIVE part,
* since the access distance is shorter than total memory to make it stay.
mm: thrash detection-based file cache sizing The VM maintains cached filesystem pages on two types of lists. One list holds the pages recently faulted into the cache, the other list holds pages that have been referenced repeatedly on that first list. The idea is to prefer reclaiming young pages over those that have shown to benefit from caching in the past. We call the recently usedbut ultimately was not significantly better than a FIFO policy and still thrashed cache based on eviction speed, rather than actual demand for cache. This patch solves one half of the problem by decoupling the ability to detect working set changes from the inactive list size. By maintaining a history of recently evicted file pages it can detect frequently used pages with an arbitrarily small inactive list size, and subsequently apply pressure on the active list based on actual demand for cache, not just overall eviction speed. Every zone maintains a counter that tracks inactive list aging speed. When a page is evicted, a snapshot of this counter is stored in the now-empty page cache radix tree slot. On refault, the minimum access distance of the page can be assessed, to evaluate whether the page should be part of the active list or not. This fixes the VM's blindness towards working set changes in excess of the inactive list. And it's the foundation to further improve the protection ability and reduce the minimum inactive list size of 50%. Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> Reviewed-by: Rik van Riel <riel@redhat.com> Reviewed-by: Minchan Kim <minchan@kernel.org> Reviewed-by: Bob Liu <bob.liu@oracle.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Christoph Hellwig <hch@infradead.org> Cc: Dave Chinner <david@fromorbit.com> Cc: Greg Thelen <gthelen@google.com> Cc: Hugh Dickins <hughd@google.com> Cc: Jan Kara <jack@suse.cz> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: Luigi Semenzato <semenzato@google.com> Cc: Mel Gorman <mgorman@suse.de> Cc: Metin Doslu <metin@citusdata.com> Cc: Michel Lespinasse <walken@google.com> Cc: Ozgun Erdogan <ozgun@citusdata.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Roman Gushchin <klamm@yandex-team.ru> Cc: Ryan Mallon <rmallon@gmail.com> Cc: Tejun Heo <tj@kernel.org> Cc: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-04-04 05:47:51 +08:00
*
emm: workingset: simplify and use a more intuitive model Upstream: pending This basically removed workingset_activation and reduced calls to workingset_age_nonresident. The idea behind this change is a new way to calculate the refault distance and prepare for adapting refault distance based file page protection for multi-gen LRU. Currently, refault distance re-activation for active/inactive can help keep working set pages in memory, it works by estimating the refault (re-access) distance of a page, if it's small enough, then put it on active LRU instead of inactive LRU. The estimation, as described in mm/workingset.c, is based on two assumptions: 1. Activation of an inactive page will left-shift LRU pages (considering LRU starts from right). 2. Eviction of an inactive page will left-shift LRU pages. Assumption 2 is correct, but assumption 1 is not always true, an activated page could be anywhere in the LRU list (through mark_page_accessed), it only left-shift the pages on its right side. And besides, one page can get activate/deactivated for multiple times. And multi-gen LRU doesn't fit with this model well, pages are getting aged in generations, and getting promoted frequently between generations. So instead we introduce a simpler idea here: Just presume the evicted pages are still in memory, each has an corresponding eviction timestamp (nonresistence_age) that is increased and recorded upon each eviction. These timestamp could logically form a "Shadow LRU", a read-only imaginary LRU. Let the `nonresistence_age` still be NA, then we have: Let SP = ((NA's reading @ current) - (NA's reading @ eviction)) +-memory available to cache-+ | | +-------------------------+===============+===========+ | * shadows O O O | INACTIVE | ACTIVE | +-+-----------------------+===============+===========+ | | +-----------------------+ | SP fault page O -> Hole left by refaulted in pages. Entries are suppose to be removed upon access but this is not a real LRU so can't really update it. * -> The page corresponding to SP It can be easily seen that SP stands for the offset of a page in the imaginary LRU, which is also how far the current workflow could push a page out of available memory. Since all evicted page was once head of INACTIVE list, the estimated minimum value of refault distance is: SP + NR_INACTIVE On refault, the page *may* get activated and stay in memory if we put it to active LRU if: SP + NR_INACTIVE < NR_INACTIVE + NR_ACTIVE Which can be simplified to: SP < NR_ACTIVE Then the page is worth getting re-activated to start from active LRU, since the access distance is smaller than the total memory. And since this is only an estimation, based on several hypotheses, and it could break the ability of LRU to distinguish a workingset out of caches, in extreme cases all refault causing activation will lead to worse thrashing, so throttle this by two factors: 1. Notice previously re-faulted in pages may leave "holes" on the shadow part of LRU, that part is left unhandled on purpose to decrease re-activate rate for pages that have a large SP value (the larger SP value a page has, the more likely it will be affected by such holes). 2. When the active LRU is long enough, chanllaging active pages by re-activating a one-time access previously evicted/inactive page may not be a good idea, so throttle the re-activation when NR_ACTIVE > NR_INACTIVE, by comparing with NR_INACTIVE instead. Another effect of the refault activation throttling worth noticing is that, when the cache size is larger than total memory and hotness is similar among all cache pages, it can help hold a portion (possible have slightly higher hotness) of the caches in memory instead of letting caches get evicted permutably due to the nature of LRU. That's because the established workingset (active LRU) will tend to stay since we throttled reactivation when NR_ACTIVE is high. This side effect is actually similar with the algoritm before, which introduce such effect by increasing nonresistence_age in extra call paths, trottled the re-activation when activition/reactivation is massively happenning. Combined all above, we have following simple rules: Upon refault, if any of following conditions is met, mark page as active: - If active LRU is low (NR_ACTIVE < NR_INACTIVE), check if: SP < NR_ACTIVE - If active LRU is high (NR_ACTIVE >= NR_INACTIVE), check if: SP < NR_INACTIVE Code-wise, this is simpler than before since no longer need to do lruvec workingset data update when activating a page, and so far, a few benchmarks shows a similar or better result under memore pressure. The performance should also be better when there is no memory pressure since some memcg iteration and atomic operation is no longer needed. When combined with multi-gen LRU (in later commits) it shows a measurable performance gain for some workloads. Using memtier and fio test from commit ac35a4902374 but scaled down to fit in my test environment, and some other test results: memtier test (with 16G ramdisk as swap and 4G memcg limit on an i7-9700): memcached -u nobody -m 16384 -s /tmp/memcached.socket \ -a 0766 -t 12 -B binary & memtier_benchmark -S /tmp/memcached.socket -P memcache_binary -n allkeys\ --key-minimum=1 --key-maximum=32000000 --key-pattern=P:P -c 1 \ -t 12 --ratio 1:0 --pipeline 8 -d 2000 -x 6 fio test 1 (with 16G ramdisk on 28G VM on an i7-9700): fio -name=refault --numjobs=12 --directory=/mnt --size=1024m \ --buffered=1 --ioengine=io_uring --iodepth=128 \ --iodepth_batch_submit=32 --iodepth_batch_complete=32 \ --rw=randread --random_distribution=random --norandommap \ --time_based --ramp_time=5m --runtime=5m --group_reporting fio test 2 (with 16G ramdisk on 28G VM on an i7-9700): fio -name=mglru --numjobs=10 --directory=/mnt --size=1536m \ --buffered=1 --ioengine=io_uring --iodepth=128 \ --iodepth_batch_submit=32 --iodepth_batch_complete=32 \ --rw=randread --random_distribution=zipf:1.2 --norandommap \ --time_based --ramp_time=10m --runtime=5m --group_reporting mysql (using oltp_read_only from sysbench, with 12G of buffer pool in a 10G memcg): sysbench /usr/share/sysbench/oltp_read_only.lua <auth and db params> \ --tables=36 --table-size=2000000 --threads=12 --time=1800 kernel build test done with 3G memcg limit on an i7-9700. Before (Average of 6 test run): fio: IOPS=5125.5k fio2: IOPS=7291.16k memcached: 57600.926 ops/s mysql: 6280.08 tps kernel-build: 1817.13499 seconds After (Average of 6 test run): fio: IOPS=5137.5k (+2.3%) fio2: IOPS=7300.67k (+1.3%) memcached: 57878.422 ops/s (+4.8%) mysql: 6312.06 tps (+0.5%) kernel-build: 1813.66231 seconds (+2.0%) Signed-off-by: Kairui Song <kasong@tencent.com>
2023-12-15 10:45:43 +08:00
* And since this is only an estimation, based on several hypotheses, and
* it could break the ability of LRU to distinguish a workingset out of
* caches, so throttle this by two factors:
mm: thrash detection-based file cache sizing The VM maintains cached filesystem pages on two types of lists. One list holds the pages recently faulted into the cache, the other list holds pages that have been referenced repeatedly on that first list. The idea is to prefer reclaiming young pages over those that have shown to benefit from caching in the past. We call the recently usedbut ultimately was not significantly better than a FIFO policy and still thrashed cache based on eviction speed, rather than actual demand for cache. This patch solves one half of the problem by decoupling the ability to detect working set changes from the inactive list size. By maintaining a history of recently evicted file pages it can detect frequently used pages with an arbitrarily small inactive list size, and subsequently apply pressure on the active list based on actual demand for cache, not just overall eviction speed. Every zone maintains a counter that tracks inactive list aging speed. When a page is evicted, a snapshot of this counter is stored in the now-empty page cache radix tree slot. On refault, the minimum access distance of the page can be assessed, to evaluate whether the page should be part of the active list or not. This fixes the VM's blindness towards working set changes in excess of the inactive list. And it's the foundation to further improve the protection ability and reduce the minimum inactive list size of 50%. Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> Reviewed-by: Rik van Riel <riel@redhat.com> Reviewed-by: Minchan Kim <minchan@kernel.org> Reviewed-by: Bob Liu <bob.liu@oracle.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Christoph Hellwig <hch@infradead.org> Cc: Dave Chinner <david@fromorbit.com> Cc: Greg Thelen <gthelen@google.com> Cc: Hugh Dickins <hughd@google.com> Cc: Jan Kara <jack@suse.cz> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: Luigi Semenzato <semenzato@google.com> Cc: Mel Gorman <mgorman@suse.de> Cc: Metin Doslu <metin@citusdata.com> Cc: Michel Lespinasse <walken@google.com> Cc: Ozgun Erdogan <ozgun@citusdata.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Roman Gushchin <klamm@yandex-team.ru> Cc: Ryan Mallon <rmallon@gmail.com> Cc: Tejun Heo <tj@kernel.org> Cc: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-04-04 05:47:51 +08:00
*
emm: workingset: simplify and use a more intuitive model Upstream: pending This basically removed workingset_activation and reduced calls to workingset_age_nonresident. The idea behind this change is a new way to calculate the refault distance and prepare for adapting refault distance based file page protection for multi-gen LRU. Currently, refault distance re-activation for active/inactive can help keep working set pages in memory, it works by estimating the refault (re-access) distance of a page, if it's small enough, then put it on active LRU instead of inactive LRU. The estimation, as described in mm/workingset.c, is based on two assumptions: 1. Activation of an inactive page will left-shift LRU pages (considering LRU starts from right). 2. Eviction of an inactive page will left-shift LRU pages. Assumption 2 is correct, but assumption 1 is not always true, an activated page could be anywhere in the LRU list (through mark_page_accessed), it only left-shift the pages on its right side. And besides, one page can get activate/deactivated for multiple times. And multi-gen LRU doesn't fit with this model well, pages are getting aged in generations, and getting promoted frequently between generations. So instead we introduce a simpler idea here: Just presume the evicted pages are still in memory, each has an corresponding eviction timestamp (nonresistence_age) that is increased and recorded upon each eviction. These timestamp could logically form a "Shadow LRU", a read-only imaginary LRU. Let the `nonresistence_age` still be NA, then we have: Let SP = ((NA's reading @ current) - (NA's reading @ eviction)) +-memory available to cache-+ | | +-------------------------+===============+===========+ | * shadows O O O | INACTIVE | ACTIVE | +-+-----------------------+===============+===========+ | | +-----------------------+ | SP fault page O -> Hole left by refaulted in pages. Entries are suppose to be removed upon access but this is not a real LRU so can't really update it. * -> The page corresponding to SP It can be easily seen that SP stands for the offset of a page in the imaginary LRU, which is also how far the current workflow could push a page out of available memory. Since all evicted page was once head of INACTIVE list, the estimated minimum value of refault distance is: SP + NR_INACTIVE On refault, the page *may* get activated and stay in memory if we put it to active LRU if: SP + NR_INACTIVE < NR_INACTIVE + NR_ACTIVE Which can be simplified to: SP < NR_ACTIVE Then the page is worth getting re-activated to start from active LRU, since the access distance is smaller than the total memory. And since this is only an estimation, based on several hypotheses, and it could break the ability of LRU to distinguish a workingset out of caches, in extreme cases all refault causing activation will lead to worse thrashing, so throttle this by two factors: 1. Notice previously re-faulted in pages may leave "holes" on the shadow part of LRU, that part is left unhandled on purpose to decrease re-activate rate for pages that have a large SP value (the larger SP value a page has, the more likely it will be affected by such holes). 2. When the active LRU is long enough, chanllaging active pages by re-activating a one-time access previously evicted/inactive page may not be a good idea, so throttle the re-activation when NR_ACTIVE > NR_INACTIVE, by comparing with NR_INACTIVE instead. Another effect of the refault activation throttling worth noticing is that, when the cache size is larger than total memory and hotness is similar among all cache pages, it can help hold a portion (possible have slightly higher hotness) of the caches in memory instead of letting caches get evicted permutably due to the nature of LRU. That's because the established workingset (active LRU) will tend to stay since we throttled reactivation when NR_ACTIVE is high. This side effect is actually similar with the algoritm before, which introduce such effect by increasing nonresistence_age in extra call paths, trottled the re-activation when activition/reactivation is massively happenning. Combined all above, we have following simple rules: Upon refault, if any of following conditions is met, mark page as active: - If active LRU is low (NR_ACTIVE < NR_INACTIVE), check if: SP < NR_ACTIVE - If active LRU is high (NR_ACTIVE >= NR_INACTIVE), check if: SP < NR_INACTIVE Code-wise, this is simpler than before since no longer need to do lruvec workingset data update when activating a page, and so far, a few benchmarks shows a similar or better result under memore pressure. The performance should also be better when there is no memory pressure since some memcg iteration and atomic operation is no longer needed. When combined with multi-gen LRU (in later commits) it shows a measurable performance gain for some workloads. Using memtier and fio test from commit ac35a4902374 but scaled down to fit in my test environment, and some other test results: memtier test (with 16G ramdisk as swap and 4G memcg limit on an i7-9700): memcached -u nobody -m 16384 -s /tmp/memcached.socket \ -a 0766 -t 12 -B binary & memtier_benchmark -S /tmp/memcached.socket -P memcache_binary -n allkeys\ --key-minimum=1 --key-maximum=32000000 --key-pattern=P:P -c 1 \ -t 12 --ratio 1:0 --pipeline 8 -d 2000 -x 6 fio test 1 (with 16G ramdisk on 28G VM on an i7-9700): fio -name=refault --numjobs=12 --directory=/mnt --size=1024m \ --buffered=1 --ioengine=io_uring --iodepth=128 \ --iodepth_batch_submit=32 --iodepth_batch_complete=32 \ --rw=randread --random_distribution=random --norandommap \ --time_based --ramp_time=5m --runtime=5m --group_reporting fio test 2 (with 16G ramdisk on 28G VM on an i7-9700): fio -name=mglru --numjobs=10 --directory=/mnt --size=1536m \ --buffered=1 --ioengine=io_uring --iodepth=128 \ --iodepth_batch_submit=32 --iodepth_batch_complete=32 \ --rw=randread --random_distribution=zipf:1.2 --norandommap \ --time_based --ramp_time=10m --runtime=5m --group_reporting mysql (using oltp_read_only from sysbench, with 12G of buffer pool in a 10G memcg): sysbench /usr/share/sysbench/oltp_read_only.lua <auth and db params> \ --tables=36 --table-size=2000000 --threads=12 --time=1800 kernel build test done with 3G memcg limit on an i7-9700. Before (Average of 6 test run): fio: IOPS=5125.5k fio2: IOPS=7291.16k memcached: 57600.926 ops/s mysql: 6280.08 tps kernel-build: 1817.13499 seconds After (Average of 6 test run): fio: IOPS=5137.5k (+2.3%) fio2: IOPS=7300.67k (+1.3%) memcached: 57878.422 ops/s (+4.8%) mysql: 6312.06 tps (+0.5%) kernel-build: 1813.66231 seconds (+2.0%) Signed-off-by: Kairui Song <kasong@tencent.com>
2023-12-15 10:45:43 +08:00
* 1. Notice that re-faulted in pages may leave "holes" on the shadow
* part of LRU, that part is left unhandled on purpose to decrease
* re-activate rate for pages that have a large SP value (the larger
* SP value a page have, the more likely it will be affected by such
* holes).
* 2. When the ACTIVE part of LRU is long enough, challenging ACTIVE pages
* by re-activating a one-time faulted previously INACTIVE page may not
* be a good idea, so throttle the re-activation when ACTIVE > INACTIVE
* by comparing with INACTIVE instead.
mm: thrash detection-based file cache sizing The VM maintains cached filesystem pages on two types of lists. One list holds the pages recently faulted into the cache, the other list holds pages that have been referenced repeatedly on that first list. The idea is to prefer reclaiming young pages over those that have shown to benefit from caching in the past. We call the recently usedbut ultimately was not significantly better than a FIFO policy and still thrashed cache based on eviction speed, rather than actual demand for cache. This patch solves one half of the problem by decoupling the ability to detect working set changes from the inactive list size. By maintaining a history of recently evicted file pages it can detect frequently used pages with an arbitrarily small inactive list size, and subsequently apply pressure on the active list based on actual demand for cache, not just overall eviction speed. Every zone maintains a counter that tracks inactive list aging speed. When a page is evicted, a snapshot of this counter is stored in the now-empty page cache radix tree slot. On refault, the minimum access distance of the page can be assessed, to evaluate whether the page should be part of the active list or not. This fixes the VM's blindness towards working set changes in excess of the inactive list. And it's the foundation to further improve the protection ability and reduce the minimum inactive list size of 50%. Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> Reviewed-by: Rik van Riel <riel@redhat.com> Reviewed-by: Minchan Kim <minchan@kernel.org> Reviewed-by: Bob Liu <bob.liu@oracle.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Christoph Hellwig <hch@infradead.org> Cc: Dave Chinner <david@fromorbit.com> Cc: Greg Thelen <gthelen@google.com> Cc: Hugh Dickins <hughd@google.com> Cc: Jan Kara <jack@suse.cz> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: Luigi Semenzato <semenzato@google.com> Cc: Mel Gorman <mgorman@suse.de> Cc: Metin Doslu <metin@citusdata.com> Cc: Michel Lespinasse <walken@google.com> Cc: Ozgun Erdogan <ozgun@citusdata.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Roman Gushchin <klamm@yandex-team.ru> Cc: Ryan Mallon <rmallon@gmail.com> Cc: Tejun Heo <tj@kernel.org> Cc: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-04-04 05:47:51 +08:00
*
emm: workingset: simplify and use a more intuitive model Upstream: pending This basically removed workingset_activation and reduced calls to workingset_age_nonresident. The idea behind this change is a new way to calculate the refault distance and prepare for adapting refault distance based file page protection for multi-gen LRU. Currently, refault distance re-activation for active/inactive can help keep working set pages in memory, it works by estimating the refault (re-access) distance of a page, if it's small enough, then put it on active LRU instead of inactive LRU. The estimation, as described in mm/workingset.c, is based on two assumptions: 1. Activation of an inactive page will left-shift LRU pages (considering LRU starts from right). 2. Eviction of an inactive page will left-shift LRU pages. Assumption 2 is correct, but assumption 1 is not always true, an activated page could be anywhere in the LRU list (through mark_page_accessed), it only left-shift the pages on its right side. And besides, one page can get activate/deactivated for multiple times. And multi-gen LRU doesn't fit with this model well, pages are getting aged in generations, and getting promoted frequently between generations. So instead we introduce a simpler idea here: Just presume the evicted pages are still in memory, each has an corresponding eviction timestamp (nonresistence_age) that is increased and recorded upon each eviction. These timestamp could logically form a "Shadow LRU", a read-only imaginary LRU. Let the `nonresistence_age` still be NA, then we have: Let SP = ((NA's reading @ current) - (NA's reading @ eviction)) +-memory available to cache-+ | | +-------------------------+===============+===========+ | * shadows O O O | INACTIVE | ACTIVE | +-+-----------------------+===============+===========+ | | +-----------------------+ | SP fault page O -> Hole left by refaulted in pages. Entries are suppose to be removed upon access but this is not a real LRU so can't really update it. * -> The page corresponding to SP It can be easily seen that SP stands for the offset of a page in the imaginary LRU, which is also how far the current workflow could push a page out of available memory. Since all evicted page was once head of INACTIVE list, the estimated minimum value of refault distance is: SP + NR_INACTIVE On refault, the page *may* get activated and stay in memory if we put it to active LRU if: SP + NR_INACTIVE < NR_INACTIVE + NR_ACTIVE Which can be simplified to: SP < NR_ACTIVE Then the page is worth getting re-activated to start from active LRU, since the access distance is smaller than the total memory. And since this is only an estimation, based on several hypotheses, and it could break the ability of LRU to distinguish a workingset out of caches, in extreme cases all refault causing activation will lead to worse thrashing, so throttle this by two factors: 1. Notice previously re-faulted in pages may leave "holes" on the shadow part of LRU, that part is left unhandled on purpose to decrease re-activate rate for pages that have a large SP value (the larger SP value a page has, the more likely it will be affected by such holes). 2. When the active LRU is long enough, chanllaging active pages by re-activating a one-time access previously evicted/inactive page may not be a good idea, so throttle the re-activation when NR_ACTIVE > NR_INACTIVE, by comparing with NR_INACTIVE instead. Another effect of the refault activation throttling worth noticing is that, when the cache size is larger than total memory and hotness is similar among all cache pages, it can help hold a portion (possible have slightly higher hotness) of the caches in memory instead of letting caches get evicted permutably due to the nature of LRU. That's because the established workingset (active LRU) will tend to stay since we throttled reactivation when NR_ACTIVE is high. This side effect is actually similar with the algoritm before, which introduce such effect by increasing nonresistence_age in extra call paths, trottled the re-activation when activition/reactivation is massively happenning. Combined all above, we have following simple rules: Upon refault, if any of following conditions is met, mark page as active: - If active LRU is low (NR_ACTIVE < NR_INACTIVE), check if: SP < NR_ACTIVE - If active LRU is high (NR_ACTIVE >= NR_INACTIVE), check if: SP < NR_INACTIVE Code-wise, this is simpler than before since no longer need to do lruvec workingset data update when activating a page, and so far, a few benchmarks shows a similar or better result under memore pressure. The performance should also be better when there is no memory pressure since some memcg iteration and atomic operation is no longer needed. When combined with multi-gen LRU (in later commits) it shows a measurable performance gain for some workloads. Using memtier and fio test from commit ac35a4902374 but scaled down to fit in my test environment, and some other test results: memtier test (with 16G ramdisk as swap and 4G memcg limit on an i7-9700): memcached -u nobody -m 16384 -s /tmp/memcached.socket \ -a 0766 -t 12 -B binary & memtier_benchmark -S /tmp/memcached.socket -P memcache_binary -n allkeys\ --key-minimum=1 --key-maximum=32000000 --key-pattern=P:P -c 1 \ -t 12 --ratio 1:0 --pipeline 8 -d 2000 -x 6 fio test 1 (with 16G ramdisk on 28G VM on an i7-9700): fio -name=refault --numjobs=12 --directory=/mnt --size=1024m \ --buffered=1 --ioengine=io_uring --iodepth=128 \ --iodepth_batch_submit=32 --iodepth_batch_complete=32 \ --rw=randread --random_distribution=random --norandommap \ --time_based --ramp_time=5m --runtime=5m --group_reporting fio test 2 (with 16G ramdisk on 28G VM on an i7-9700): fio -name=mglru --numjobs=10 --directory=/mnt --size=1536m \ --buffered=1 --ioengine=io_uring --iodepth=128 \ --iodepth_batch_submit=32 --iodepth_batch_complete=32 \ --rw=randread --random_distribution=zipf:1.2 --norandommap \ --time_based --ramp_time=10m --runtime=5m --group_reporting mysql (using oltp_read_only from sysbench, with 12G of buffer pool in a 10G memcg): sysbench /usr/share/sysbench/oltp_read_only.lua <auth and db params> \ --tables=36 --table-size=2000000 --threads=12 --time=1800 kernel build test done with 3G memcg limit on an i7-9700. Before (Average of 6 test run): fio: IOPS=5125.5k fio2: IOPS=7291.16k memcached: 57600.926 ops/s mysql: 6280.08 tps kernel-build: 1817.13499 seconds After (Average of 6 test run): fio: IOPS=5137.5k (+2.3%) fio2: IOPS=7300.67k (+1.3%) memcached: 57878.422 ops/s (+4.8%) mysql: 6312.06 tps (+0.5%) kernel-build: 1813.66231 seconds (+2.0%) Signed-off-by: Kairui Song <kasong@tencent.com>
2023-12-15 10:45:43 +08:00
* Combined all above, we have:
* Upon refault, if any of the following conditions is met, mark the page
* as active:
mm: thrash detection-based file cache sizing The VM maintains cached filesystem pages on two types of lists. One list holds the pages recently faulted into the cache, the other list holds pages that have been referenced repeatedly on that first list. The idea is to prefer reclaiming young pages over those that have shown to benefit from caching in the past. We call the recently usedbut ultimately was not significantly better than a FIFO policy and still thrashed cache based on eviction speed, rather than actual demand for cache. This patch solves one half of the problem by decoupling the ability to detect working set changes from the inactive list size. By maintaining a history of recently evicted file pages it can detect frequently used pages with an arbitrarily small inactive list size, and subsequently apply pressure on the active list based on actual demand for cache, not just overall eviction speed. Every zone maintains a counter that tracks inactive list aging speed. When a page is evicted, a snapshot of this counter is stored in the now-empty page cache radix tree slot. On refault, the minimum access distance of the page can be assessed, to evaluate whether the page should be part of the active list or not. This fixes the VM's blindness towards working set changes in excess of the inactive list. And it's the foundation to further improve the protection ability and reduce the minimum inactive list size of 50%. Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> Reviewed-by: Rik van Riel <riel@redhat.com> Reviewed-by: Minchan Kim <minchan@kernel.org> Reviewed-by: Bob Liu <bob.liu@oracle.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Christoph Hellwig <hch@infradead.org> Cc: Dave Chinner <david@fromorbit.com> Cc: Greg Thelen <gthelen@google.com> Cc: Hugh Dickins <hughd@google.com> Cc: Jan Kara <jack@suse.cz> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: Luigi Semenzato <semenzato@google.com> Cc: Mel Gorman <mgorman@suse.de> Cc: Metin Doslu <metin@citusdata.com> Cc: Michel Lespinasse <walken@google.com> Cc: Ozgun Erdogan <ozgun@citusdata.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Roman Gushchin <klamm@yandex-team.ru> Cc: Ryan Mallon <rmallon@gmail.com> Cc: Tejun Heo <tj@kernel.org> Cc: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-04-04 05:47:51 +08:00
*
emm: workingset: simplify and use a more intuitive model Upstream: pending This basically removed workingset_activation and reduced calls to workingset_age_nonresident. The idea behind this change is a new way to calculate the refault distance and prepare for adapting refault distance based file page protection for multi-gen LRU. Currently, refault distance re-activation for active/inactive can help keep working set pages in memory, it works by estimating the refault (re-access) distance of a page, if it's small enough, then put it on active LRU instead of inactive LRU. The estimation, as described in mm/workingset.c, is based on two assumptions: 1. Activation of an inactive page will left-shift LRU pages (considering LRU starts from right). 2. Eviction of an inactive page will left-shift LRU pages. Assumption 2 is correct, but assumption 1 is not always true, an activated page could be anywhere in the LRU list (through mark_page_accessed), it only left-shift the pages on its right side. And besides, one page can get activate/deactivated for multiple times. And multi-gen LRU doesn't fit with this model well, pages are getting aged in generations, and getting promoted frequently between generations. So instead we introduce a simpler idea here: Just presume the evicted pages are still in memory, each has an corresponding eviction timestamp (nonresistence_age) that is increased and recorded upon each eviction. These timestamp could logically form a "Shadow LRU", a read-only imaginary LRU. Let the `nonresistence_age` still be NA, then we have: Let SP = ((NA's reading @ current) - (NA's reading @ eviction)) +-memory available to cache-+ | | +-------------------------+===============+===========+ | * shadows O O O | INACTIVE | ACTIVE | +-+-----------------------+===============+===========+ | | +-----------------------+ | SP fault page O -> Hole left by refaulted in pages. Entries are suppose to be removed upon access but this is not a real LRU so can't really update it. * -> The page corresponding to SP It can be easily seen that SP stands for the offset of a page in the imaginary LRU, which is also how far the current workflow could push a page out of available memory. Since all evicted page was once head of INACTIVE list, the estimated minimum value of refault distance is: SP + NR_INACTIVE On refault, the page *may* get activated and stay in memory if we put it to active LRU if: SP + NR_INACTIVE < NR_INACTIVE + NR_ACTIVE Which can be simplified to: SP < NR_ACTIVE Then the page is worth getting re-activated to start from active LRU, since the access distance is smaller than the total memory. And since this is only an estimation, based on several hypotheses, and it could break the ability of LRU to distinguish a workingset out of caches, in extreme cases all refault causing activation will lead to worse thrashing, so throttle this by two factors: 1. Notice previously re-faulted in pages may leave "holes" on the shadow part of LRU, that part is left unhandled on purpose to decrease re-activate rate for pages that have a large SP value (the larger SP value a page has, the more likely it will be affected by such holes). 2. When the active LRU is long enough, chanllaging active pages by re-activating a one-time access previously evicted/inactive page may not be a good idea, so throttle the re-activation when NR_ACTIVE > NR_INACTIVE, by comparing with NR_INACTIVE instead. Another effect of the refault activation throttling worth noticing is that, when the cache size is larger than total memory and hotness is similar among all cache pages, it can help hold a portion (possible have slightly higher hotness) of the caches in memory instead of letting caches get evicted permutably due to the nature of LRU. That's because the established workingset (active LRU) will tend to stay since we throttled reactivation when NR_ACTIVE is high. This side effect is actually similar with the algoritm before, which introduce such effect by increasing nonresistence_age in extra call paths, trottled the re-activation when activition/reactivation is massively happenning. Combined all above, we have following simple rules: Upon refault, if any of following conditions is met, mark page as active: - If active LRU is low (NR_ACTIVE < NR_INACTIVE), check if: SP < NR_ACTIVE - If active LRU is high (NR_ACTIVE >= NR_INACTIVE), check if: SP < NR_INACTIVE Code-wise, this is simpler than before since no longer need to do lruvec workingset data update when activating a page, and so far, a few benchmarks shows a similar or better result under memore pressure. The performance should also be better when there is no memory pressure since some memcg iteration and atomic operation is no longer needed. When combined with multi-gen LRU (in later commits) it shows a measurable performance gain for some workloads. Using memtier and fio test from commit ac35a4902374 but scaled down to fit in my test environment, and some other test results: memtier test (with 16G ramdisk as swap and 4G memcg limit on an i7-9700): memcached -u nobody -m 16384 -s /tmp/memcached.socket \ -a 0766 -t 12 -B binary & memtier_benchmark -S /tmp/memcached.socket -P memcache_binary -n allkeys\ --key-minimum=1 --key-maximum=32000000 --key-pattern=P:P -c 1 \ -t 12 --ratio 1:0 --pipeline 8 -d 2000 -x 6 fio test 1 (with 16G ramdisk on 28G VM on an i7-9700): fio -name=refault --numjobs=12 --directory=/mnt --size=1024m \ --buffered=1 --ioengine=io_uring --iodepth=128 \ --iodepth_batch_submit=32 --iodepth_batch_complete=32 \ --rw=randread --random_distribution=random --norandommap \ --time_based --ramp_time=5m --runtime=5m --group_reporting fio test 2 (with 16G ramdisk on 28G VM on an i7-9700): fio -name=mglru --numjobs=10 --directory=/mnt --size=1536m \ --buffered=1 --ioengine=io_uring --iodepth=128 \ --iodepth_batch_submit=32 --iodepth_batch_complete=32 \ --rw=randread --random_distribution=zipf:1.2 --norandommap \ --time_based --ramp_time=10m --runtime=5m --group_reporting mysql (using oltp_read_only from sysbench, with 12G of buffer pool in a 10G memcg): sysbench /usr/share/sysbench/oltp_read_only.lua <auth and db params> \ --tables=36 --table-size=2000000 --threads=12 --time=1800 kernel build test done with 3G memcg limit on an i7-9700. Before (Average of 6 test run): fio: IOPS=5125.5k fio2: IOPS=7291.16k memcached: 57600.926 ops/s mysql: 6280.08 tps kernel-build: 1817.13499 seconds After (Average of 6 test run): fio: IOPS=5137.5k (+2.3%) fio2: IOPS=7300.67k (+1.3%) memcached: 57878.422 ops/s (+4.8%) mysql: 6312.06 tps (+0.5%) kernel-build: 1813.66231 seconds (+2.0%) Signed-off-by: Kairui Song <kasong@tencent.com>
2023-12-15 10:45:43 +08:00
* - If ACTIVE LRU is low (NR_ACTIVE < NR_INACTIVE), check if:
* SP < NR_ACTIVE
mm: thrash detection-based file cache sizing The VM maintains cached filesystem pages on two types of lists. One list holds the pages recently faulted into the cache, the other list holds pages that have been referenced repeatedly on that first list. The idea is to prefer reclaiming young pages over those that have shown to benefit from caching in the past. We call the recently usedbut ultimately was not significantly better than a FIFO policy and still thrashed cache based on eviction speed, rather than actual demand for cache. This patch solves one half of the problem by decoupling the ability to detect working set changes from the inactive list size. By maintaining a history of recently evicted file pages it can detect frequently used pages with an arbitrarily small inactive list size, and subsequently apply pressure on the active list based on actual demand for cache, not just overall eviction speed. Every zone maintains a counter that tracks inactive list aging speed. When a page is evicted, a snapshot of this counter is stored in the now-empty page cache radix tree slot. On refault, the minimum access distance of the page can be assessed, to evaluate whether the page should be part of the active list or not. This fixes the VM's blindness towards working set changes in excess of the inactive list. And it's the foundation to further improve the protection ability and reduce the minimum inactive list size of 50%. Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> Reviewed-by: Rik van Riel <riel@redhat.com> Reviewed-by: Minchan Kim <minchan@kernel.org> Reviewed-by: Bob Liu <bob.liu@oracle.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Christoph Hellwig <hch@infradead.org> Cc: Dave Chinner <david@fromorbit.com> Cc: Greg Thelen <gthelen@google.com> Cc: Hugh Dickins <hughd@google.com> Cc: Jan Kara <jack@suse.cz> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: Luigi Semenzato <semenzato@google.com> Cc: Mel Gorman <mgorman@suse.de> Cc: Metin Doslu <metin@citusdata.com> Cc: Michel Lespinasse <walken@google.com> Cc: Ozgun Erdogan <ozgun@citusdata.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Roman Gushchin <klamm@yandex-team.ru> Cc: Ryan Mallon <rmallon@gmail.com> Cc: Tejun Heo <tj@kernel.org> Cc: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-04-04 05:47:51 +08:00
*
emm: workingset: simplify and use a more intuitive model Upstream: pending This basically removed workingset_activation and reduced calls to workingset_age_nonresident. The idea behind this change is a new way to calculate the refault distance and prepare for adapting refault distance based file page protection for multi-gen LRU. Currently, refault distance re-activation for active/inactive can help keep working set pages in memory, it works by estimating the refault (re-access) distance of a page, if it's small enough, then put it on active LRU instead of inactive LRU. The estimation, as described in mm/workingset.c, is based on two assumptions: 1. Activation of an inactive page will left-shift LRU pages (considering LRU starts from right). 2. Eviction of an inactive page will left-shift LRU pages. Assumption 2 is correct, but assumption 1 is not always true, an activated page could be anywhere in the LRU list (through mark_page_accessed), it only left-shift the pages on its right side. And besides, one page can get activate/deactivated for multiple times. And multi-gen LRU doesn't fit with this model well, pages are getting aged in generations, and getting promoted frequently between generations. So instead we introduce a simpler idea here: Just presume the evicted pages are still in memory, each has an corresponding eviction timestamp (nonresistence_age) that is increased and recorded upon each eviction. These timestamp could logically form a "Shadow LRU", a read-only imaginary LRU. Let the `nonresistence_age` still be NA, then we have: Let SP = ((NA's reading @ current) - (NA's reading @ eviction)) +-memory available to cache-+ | | +-------------------------+===============+===========+ | * shadows O O O | INACTIVE | ACTIVE | +-+-----------------------+===============+===========+ | | +-----------------------+ | SP fault page O -> Hole left by refaulted in pages. Entries are suppose to be removed upon access but this is not a real LRU so can't really update it. * -> The page corresponding to SP It can be easily seen that SP stands for the offset of a page in the imaginary LRU, which is also how far the current workflow could push a page out of available memory. Since all evicted page was once head of INACTIVE list, the estimated minimum value of refault distance is: SP + NR_INACTIVE On refault, the page *may* get activated and stay in memory if we put it to active LRU if: SP + NR_INACTIVE < NR_INACTIVE + NR_ACTIVE Which can be simplified to: SP < NR_ACTIVE Then the page is worth getting re-activated to start from active LRU, since the access distance is smaller than the total memory. And since this is only an estimation, based on several hypotheses, and it could break the ability of LRU to distinguish a workingset out of caches, in extreme cases all refault causing activation will lead to worse thrashing, so throttle this by two factors: 1. Notice previously re-faulted in pages may leave "holes" on the shadow part of LRU, that part is left unhandled on purpose to decrease re-activate rate for pages that have a large SP value (the larger SP value a page has, the more likely it will be affected by such holes). 2. When the active LRU is long enough, chanllaging active pages by re-activating a one-time access previously evicted/inactive page may not be a good idea, so throttle the re-activation when NR_ACTIVE > NR_INACTIVE, by comparing with NR_INACTIVE instead. Another effect of the refault activation throttling worth noticing is that, when the cache size is larger than total memory and hotness is similar among all cache pages, it can help hold a portion (possible have slightly higher hotness) of the caches in memory instead of letting caches get evicted permutably due to the nature of LRU. That's because the established workingset (active LRU) will tend to stay since we throttled reactivation when NR_ACTIVE is high. This side effect is actually similar with the algoritm before, which introduce such effect by increasing nonresistence_age in extra call paths, trottled the re-activation when activition/reactivation is massively happenning. Combined all above, we have following simple rules: Upon refault, if any of following conditions is met, mark page as active: - If active LRU is low (NR_ACTIVE < NR_INACTIVE), check if: SP < NR_ACTIVE - If active LRU is high (NR_ACTIVE >= NR_INACTIVE), check if: SP < NR_INACTIVE Code-wise, this is simpler than before since no longer need to do lruvec workingset data update when activating a page, and so far, a few benchmarks shows a similar or better result under memore pressure. The performance should also be better when there is no memory pressure since some memcg iteration and atomic operation is no longer needed. When combined with multi-gen LRU (in later commits) it shows a measurable performance gain for some workloads. Using memtier and fio test from commit ac35a4902374 but scaled down to fit in my test environment, and some other test results: memtier test (with 16G ramdisk as swap and 4G memcg limit on an i7-9700): memcached -u nobody -m 16384 -s /tmp/memcached.socket \ -a 0766 -t 12 -B binary & memtier_benchmark -S /tmp/memcached.socket -P memcache_binary -n allkeys\ --key-minimum=1 --key-maximum=32000000 --key-pattern=P:P -c 1 \ -t 12 --ratio 1:0 --pipeline 8 -d 2000 -x 6 fio test 1 (with 16G ramdisk on 28G VM on an i7-9700): fio -name=refault --numjobs=12 --directory=/mnt --size=1024m \ --buffered=1 --ioengine=io_uring --iodepth=128 \ --iodepth_batch_submit=32 --iodepth_batch_complete=32 \ --rw=randread --random_distribution=random --norandommap \ --time_based --ramp_time=5m --runtime=5m --group_reporting fio test 2 (with 16G ramdisk on 28G VM on an i7-9700): fio -name=mglru --numjobs=10 --directory=/mnt --size=1536m \ --buffered=1 --ioengine=io_uring --iodepth=128 \ --iodepth_batch_submit=32 --iodepth_batch_complete=32 \ --rw=randread --random_distribution=zipf:1.2 --norandommap \ --time_based --ramp_time=10m --runtime=5m --group_reporting mysql (using oltp_read_only from sysbench, with 12G of buffer pool in a 10G memcg): sysbench /usr/share/sysbench/oltp_read_only.lua <auth and db params> \ --tables=36 --table-size=2000000 --threads=12 --time=1800 kernel build test done with 3G memcg limit on an i7-9700. Before (Average of 6 test run): fio: IOPS=5125.5k fio2: IOPS=7291.16k memcached: 57600.926 ops/s mysql: 6280.08 tps kernel-build: 1817.13499 seconds After (Average of 6 test run): fio: IOPS=5137.5k (+2.3%) fio2: IOPS=7300.67k (+1.3%) memcached: 57878.422 ops/s (+4.8%) mysql: 6312.06 tps (+0.5%) kernel-build: 1813.66231 seconds (+2.0%) Signed-off-by: Kairui Song <kasong@tencent.com>
2023-12-15 10:45:43 +08:00
* - If ACTIVE LRU is high (NR_ACTIVE >= NR_INACTIVE), check if:
* SP < NR_INACTIVE
mm: thrash detection-based file cache sizing The VM maintains cached filesystem pages on two types of lists. One list holds the pages recently faulted into the cache, the other list holds pages that have been referenced repeatedly on that first list. The idea is to prefer reclaiming young pages over those that have shown to benefit from caching in the past. We call the recently usedbut ultimately was not significantly better than a FIFO policy and still thrashed cache based on eviction speed, rather than actual demand for cache. This patch solves one half of the problem by decoupling the ability to detect working set changes from the inactive list size. By maintaining a history of recently evicted file pages it can detect frequently used pages with an arbitrarily small inactive list size, and subsequently apply pressure on the active list based on actual demand for cache, not just overall eviction speed. Every zone maintains a counter that tracks inactive list aging speed. When a page is evicted, a snapshot of this counter is stored in the now-empty page cache radix tree slot. On refault, the minimum access distance of the page can be assessed, to evaluate whether the page should be part of the active list or not. This fixes the VM's blindness towards working set changes in excess of the inactive list. And it's the foundation to further improve the protection ability and reduce the minimum inactive list size of 50%. Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> Reviewed-by: Rik van Riel <riel@redhat.com> Reviewed-by: Minchan Kim <minchan@kernel.org> Reviewed-by: Bob Liu <bob.liu@oracle.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Christoph Hellwig <hch@infradead.org> Cc: Dave Chinner <david@fromorbit.com> Cc: Greg Thelen <gthelen@google.com> Cc: Hugh Dickins <hughd@google.com> Cc: Jan Kara <jack@suse.cz> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: Luigi Semenzato <semenzato@google.com> Cc: Mel Gorman <mgorman@suse.de> Cc: Metin Doslu <metin@citusdata.com> Cc: Michel Lespinasse <walken@google.com> Cc: Ozgun Erdogan <ozgun@citusdata.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Roman Gushchin <klamm@yandex-team.ru> Cc: Ryan Mallon <rmallon@gmail.com> Cc: Tejun Heo <tj@kernel.org> Cc: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-04-04 05:47:51 +08:00
*
mm: workingset: tell cache transitions from workingset thrashing Refaults happen during transitions between workingsets as well as in-place thrashing. Knowing the difference between the two has a range of applications, including measuring the impact of memory shortage on the system performance, as well as the ability to smarter balance pressure between the filesystem cache and the swap-backed workingset. During workingset transitions, inactive cache refaults and pushes out established active cache. When that active cache isn't stale, however, and also ends up refaulting, that's bonafide thrashing. Introduce a new page flag that tells on eviction whether the page has been active or not in its lifetime. This bit is then stored in the shadow entry, to classify refaults as transitioning or thrashing. How many page->flags does this leave us with on 32-bit? 20 bits are always page flags 21 if you have an MMU 23 with the zone bits for DMA, Normal, HighMem, Movable 29 with the sparsemem section bits 30 if PAE is enabled 31 with this patch. So on 32-bit PAE, that leaves 1 bit for distinguishing two NUMA nodes. If that's not enough, the system can switch to discontigmem and re-gain the 6 or 7 sparsemem section bits. Link: http://lkml.kernel.org/r/20180828172258.3185-3-hannes@cmpxchg.org Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org> Tested-by: Daniel Drake <drake@endlessm.com> Tested-by: Suren Baghdasaryan <surenb@google.com> Cc: Christopher Lameter <cl@linux.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: Johannes Weiner <jweiner@fb.com> Cc: Mike Galbraith <efault@gmx.de> Cc: Peter Enderborg <peter.enderborg@sony.com> Cc: Randy Dunlap <rdunlap@infradead.org> Cc: Shakeel Butt <shakeelb@google.com> Cc: Tejun Heo <tj@kernel.org> Cc: Vinayak Menon <vinmenon@codeaurora.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-10-27 06:06:04 +08:00
* Refaulting inactive pages
mm: thrash detection-based file cache sizing The VM maintains cached filesystem pages on two types of lists. One list holds the pages recently faulted into the cache, the other list holds pages that have been referenced repeatedly on that first list. The idea is to prefer reclaiming young pages over those that have shown to benefit from caching in the past. We call the recently usedbut ultimately was not significantly better than a FIFO policy and still thrashed cache based on eviction speed, rather than actual demand for cache. This patch solves one half of the problem by decoupling the ability to detect working set changes from the inactive list size. By maintaining a history of recently evicted file pages it can detect frequently used pages with an arbitrarily small inactive list size, and subsequently apply pressure on the active list based on actual demand for cache, not just overall eviction speed. Every zone maintains a counter that tracks inactive list aging speed. When a page is evicted, a snapshot of this counter is stored in the now-empty page cache radix tree slot. On refault, the minimum access distance of the page can be assessed, to evaluate whether the page should be part of the active list or not. This fixes the VM's blindness towards working set changes in excess of the inactive list. And it's the foundation to further improve the protection ability and reduce the minimum inactive list size of 50%. Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> Reviewed-by: Rik van Riel <riel@redhat.com> Reviewed-by: Minchan Kim <minchan@kernel.org> Reviewed-by: Bob Liu <bob.liu@oracle.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Christoph Hellwig <hch@infradead.org> Cc: Dave Chinner <david@fromorbit.com> Cc: Greg Thelen <gthelen@google.com> Cc: Hugh Dickins <hughd@google.com> Cc: Jan Kara <jack@suse.cz> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: Luigi Semenzato <semenzato@google.com> Cc: Mel Gorman <mgorman@suse.de> Cc: Metin Doslu <metin@citusdata.com> Cc: Michel Lespinasse <walken@google.com> Cc: Ozgun Erdogan <ozgun@citusdata.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Roman Gushchin <klamm@yandex-team.ru> Cc: Ryan Mallon <rmallon@gmail.com> Cc: Tejun Heo <tj@kernel.org> Cc: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-04-04 05:47:51 +08:00
*
* All that is known about the active list is that the pages have been
* accessed more than once in the past. This means that at any given
* time there is actually a good chance that pages on the active list
* are no longer in active use.
*
* So when a refault distance of (R - E) is observed and there are at
* least (R - E) pages in the userspace workingset, the refaulting page
* is activated optimistically in the hope that (R - E) pages are actually
mm: thrash detection-based file cache sizing The VM maintains cached filesystem pages on two types of lists. One list holds the pages recently faulted into the cache, the other list holds pages that have been referenced repeatedly on that first list. The idea is to prefer reclaiming young pages over those that have shown to benefit from caching in the past. We call the recently usedbut ultimately was not significantly better than a FIFO policy and still thrashed cache based on eviction speed, rather than actual demand for cache. This patch solves one half of the problem by decoupling the ability to detect working set changes from the inactive list size. By maintaining a history of recently evicted file pages it can detect frequently used pages with an arbitrarily small inactive list size, and subsequently apply pressure on the active list based on actual demand for cache, not just overall eviction speed. Every zone maintains a counter that tracks inactive list aging speed. When a page is evicted, a snapshot of this counter is stored in the now-empty page cache radix tree slot. On refault, the minimum access distance of the page can be assessed, to evaluate whether the page should be part of the active list or not. This fixes the VM's blindness towards working set changes in excess of the inactive list. And it's the foundation to further improve the protection ability and reduce the minimum inactive list size of 50%. Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> Reviewed-by: Rik van Riel <riel@redhat.com> Reviewed-by: Minchan Kim <minchan@kernel.org> Reviewed-by: Bob Liu <bob.liu@oracle.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Christoph Hellwig <hch@infradead.org> Cc: Dave Chinner <david@fromorbit.com> Cc: Greg Thelen <gthelen@google.com> Cc: Hugh Dickins <hughd@google.com> Cc: Jan Kara <jack@suse.cz> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: Luigi Semenzato <semenzato@google.com> Cc: Mel Gorman <mgorman@suse.de> Cc: Metin Doslu <metin@citusdata.com> Cc: Michel Lespinasse <walken@google.com> Cc: Ozgun Erdogan <ozgun@citusdata.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Roman Gushchin <klamm@yandex-team.ru> Cc: Ryan Mallon <rmallon@gmail.com> Cc: Tejun Heo <tj@kernel.org> Cc: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-04-04 05:47:51 +08:00
* used less frequently than the refaulting page - or even not used at
* all anymore.
*
mm: workingset: tell cache transitions from workingset thrashing Refaults happen during transitions between workingsets as well as in-place thrashing. Knowing the difference between the two has a range of applications, including measuring the impact of memory shortage on the system performance, as well as the ability to smarter balance pressure between the filesystem cache and the swap-backed workingset. During workingset transitions, inactive cache refaults and pushes out established active cache. When that active cache isn't stale, however, and also ends up refaulting, that's bonafide thrashing. Introduce a new page flag that tells on eviction whether the page has been active or not in its lifetime. This bit is then stored in the shadow entry, to classify refaults as transitioning or thrashing. How many page->flags does this leave us with on 32-bit? 20 bits are always page flags 21 if you have an MMU 23 with the zone bits for DMA, Normal, HighMem, Movable 29 with the sparsemem section bits 30 if PAE is enabled 31 with this patch. So on 32-bit PAE, that leaves 1 bit for distinguishing two NUMA nodes. If that's not enough, the system can switch to discontigmem and re-gain the 6 or 7 sparsemem section bits. Link: http://lkml.kernel.org/r/20180828172258.3185-3-hannes@cmpxchg.org Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org> Tested-by: Daniel Drake <drake@endlessm.com> Tested-by: Suren Baghdasaryan <surenb@google.com> Cc: Christopher Lameter <cl@linux.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: Johannes Weiner <jweiner@fb.com> Cc: Mike Galbraith <efault@gmx.de> Cc: Peter Enderborg <peter.enderborg@sony.com> Cc: Randy Dunlap <rdunlap@infradead.org> Cc: Shakeel Butt <shakeelb@google.com> Cc: Tejun Heo <tj@kernel.org> Cc: Vinayak Menon <vinmenon@codeaurora.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-10-27 06:06:04 +08:00
* That means if inactive cache is refaulting with a suitable refault
* distance, we assume the cache workingset is transitioning and put
* pressure on the current workingset.
mm: workingset: tell cache transitions from workingset thrashing Refaults happen during transitions between workingsets as well as in-place thrashing. Knowing the difference between the two has a range of applications, including measuring the impact of memory shortage on the system performance, as well as the ability to smarter balance pressure between the filesystem cache and the swap-backed workingset. During workingset transitions, inactive cache refaults and pushes out established active cache. When that active cache isn't stale, however, and also ends up refaulting, that's bonafide thrashing. Introduce a new page flag that tells on eviction whether the page has been active or not in its lifetime. This bit is then stored in the shadow entry, to classify refaults as transitioning or thrashing. How many page->flags does this leave us with on 32-bit? 20 bits are always page flags 21 if you have an MMU 23 with the zone bits for DMA, Normal, HighMem, Movable 29 with the sparsemem section bits 30 if PAE is enabled 31 with this patch. So on 32-bit PAE, that leaves 1 bit for distinguishing two NUMA nodes. If that's not enough, the system can switch to discontigmem and re-gain the 6 or 7 sparsemem section bits. Link: http://lkml.kernel.org/r/20180828172258.3185-3-hannes@cmpxchg.org Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org> Tested-by: Daniel Drake <drake@endlessm.com> Tested-by: Suren Baghdasaryan <surenb@google.com> Cc: Christopher Lameter <cl@linux.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: Johannes Weiner <jweiner@fb.com> Cc: Mike Galbraith <efault@gmx.de> Cc: Peter Enderborg <peter.enderborg@sony.com> Cc: Randy Dunlap <rdunlap@infradead.org> Cc: Shakeel Butt <shakeelb@google.com> Cc: Tejun Heo <tj@kernel.org> Cc: Vinayak Menon <vinmenon@codeaurora.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-10-27 06:06:04 +08:00
*
mm: thrash detection-based file cache sizing The VM maintains cached filesystem pages on two types of lists. One list holds the pages recently faulted into the cache, the other list holds pages that have been referenced repeatedly on that first list. The idea is to prefer reclaiming young pages over those that have shown to benefit from caching in the past. We call the recently usedbut ultimately was not significantly better than a FIFO policy and still thrashed cache based on eviction speed, rather than actual demand for cache. This patch solves one half of the problem by decoupling the ability to detect working set changes from the inactive list size. By maintaining a history of recently evicted file pages it can detect frequently used pages with an arbitrarily small inactive list size, and subsequently apply pressure on the active list based on actual demand for cache, not just overall eviction speed. Every zone maintains a counter that tracks inactive list aging speed. When a page is evicted, a snapshot of this counter is stored in the now-empty page cache radix tree slot. On refault, the minimum access distance of the page can be assessed, to evaluate whether the page should be part of the active list or not. This fixes the VM's blindness towards working set changes in excess of the inactive list. And it's the foundation to further improve the protection ability and reduce the minimum inactive list size of 50%. Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> Reviewed-by: Rik van Riel <riel@redhat.com> Reviewed-by: Minchan Kim <minchan@kernel.org> Reviewed-by: Bob Liu <bob.liu@oracle.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Christoph Hellwig <hch@infradead.org> Cc: Dave Chinner <david@fromorbit.com> Cc: Greg Thelen <gthelen@google.com> Cc: Hugh Dickins <hughd@google.com> Cc: Jan Kara <jack@suse.cz> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: Luigi Semenzato <semenzato@google.com> Cc: Mel Gorman <mgorman@suse.de> Cc: Metin Doslu <metin@citusdata.com> Cc: Michel Lespinasse <walken@google.com> Cc: Ozgun Erdogan <ozgun@citusdata.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Roman Gushchin <klamm@yandex-team.ru> Cc: Ryan Mallon <rmallon@gmail.com> Cc: Tejun Heo <tj@kernel.org> Cc: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-04-04 05:47:51 +08:00
* If this is wrong and demotion kicks in, the pages which are truly
* used more frequently will be reactivated while the less frequently
* used once will be evicted from memory.
*
* But if this is right, the stale pages will be pushed out of memory
* and the used pages get to stay in cache.
*
mm: workingset: tell cache transitions from workingset thrashing Refaults happen during transitions between workingsets as well as in-place thrashing. Knowing the difference between the two has a range of applications, including measuring the impact of memory shortage on the system performance, as well as the ability to smarter balance pressure between the filesystem cache and the swap-backed workingset. During workingset transitions, inactive cache refaults and pushes out established active cache. When that active cache isn't stale, however, and also ends up refaulting, that's bonafide thrashing. Introduce a new page flag that tells on eviction whether the page has been active or not in its lifetime. This bit is then stored in the shadow entry, to classify refaults as transitioning or thrashing. How many page->flags does this leave us with on 32-bit? 20 bits are always page flags 21 if you have an MMU 23 with the zone bits for DMA, Normal, HighMem, Movable 29 with the sparsemem section bits 30 if PAE is enabled 31 with this patch. So on 32-bit PAE, that leaves 1 bit for distinguishing two NUMA nodes. If that's not enough, the system can switch to discontigmem and re-gain the 6 or 7 sparsemem section bits. Link: http://lkml.kernel.org/r/20180828172258.3185-3-hannes@cmpxchg.org Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org> Tested-by: Daniel Drake <drake@endlessm.com> Tested-by: Suren Baghdasaryan <surenb@google.com> Cc: Christopher Lameter <cl@linux.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: Johannes Weiner <jweiner@fb.com> Cc: Mike Galbraith <efault@gmx.de> Cc: Peter Enderborg <peter.enderborg@sony.com> Cc: Randy Dunlap <rdunlap@infradead.org> Cc: Shakeel Butt <shakeelb@google.com> Cc: Tejun Heo <tj@kernel.org> Cc: Vinayak Menon <vinmenon@codeaurora.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-10-27 06:06:04 +08:00
* Refaulting active pages
*
* If on the other hand the refaulting pages have recently been
* deactivated, it means that the active list is no longer protecting
* actively used cache from reclaim. The cache is NOT transitioning to
* a different workingset; the existing workingset is thrashing in the
* space allocated to the page cache.
*
mm: thrash detection-based file cache sizing The VM maintains cached filesystem pages on two types of lists. One list holds the pages recently faulted into the cache, the other list holds pages that have been referenced repeatedly on that first list. The idea is to prefer reclaiming young pages over those that have shown to benefit from caching in the past. We call the recently usedbut ultimately was not significantly better than a FIFO policy and still thrashed cache based on eviction speed, rather than actual demand for cache. This patch solves one half of the problem by decoupling the ability to detect working set changes from the inactive list size. By maintaining a history of recently evicted file pages it can detect frequently used pages with an arbitrarily small inactive list size, and subsequently apply pressure on the active list based on actual demand for cache, not just overall eviction speed. Every zone maintains a counter that tracks inactive list aging speed. When a page is evicted, a snapshot of this counter is stored in the now-empty page cache radix tree slot. On refault, the minimum access distance of the page can be assessed, to evaluate whether the page should be part of the active list or not. This fixes the VM's blindness towards working set changes in excess of the inactive list. And it's the foundation to further improve the protection ability and reduce the minimum inactive list size of 50%. Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> Reviewed-by: Rik van Riel <riel@redhat.com> Reviewed-by: Minchan Kim <minchan@kernel.org> Reviewed-by: Bob Liu <bob.liu@oracle.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Christoph Hellwig <hch@infradead.org> Cc: Dave Chinner <david@fromorbit.com> Cc: Greg Thelen <gthelen@google.com> Cc: Hugh Dickins <hughd@google.com> Cc: Jan Kara <jack@suse.cz> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: Luigi Semenzato <semenzato@google.com> Cc: Mel Gorman <mgorman@suse.de> Cc: Metin Doslu <metin@citusdata.com> Cc: Michel Lespinasse <walken@google.com> Cc: Ozgun Erdogan <ozgun@citusdata.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Roman Gushchin <klamm@yandex-team.ru> Cc: Ryan Mallon <rmallon@gmail.com> Cc: Tejun Heo <tj@kernel.org> Cc: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-04-04 05:47:51 +08:00
*
* Implementation
*
mm: workingset: age nonresident information alongside anonymous pages Patch series "fix for "mm: balance LRU lists based on relative thrashing" patchset" This patchset fixes some problems of the patchset, "mm: balance LRU lists based on relative thrashing", which is now merged on the mainline. Patch "mm: workingset: let cache workingset challenge anon fix" is the result of discussion with Johannes. See following link. http://lkml.kernel.org/r/20200520232525.798933-6-hannes@cmpxchg.org And, the other two are minor things which are found when I try to rebase my patchset. This patch (of 3): After ("mm: workingset: let cache workingset challenge anon fix"), we compare refault distances to active_file + anon. But age of the non-resident information is only driven by the file LRU. As a result, we may overestimate the recency of any incoming refaults and activate them too eagerly, causing unnecessary LRU churn in certain situations. Make anon aging drive nonresident age as well to address that. Link: http://lkml.kernel.org/r/1592288204-27734-1-git-send-email-iamjoonsoo.kim@lge.com Link: http://lkml.kernel.org/r/1592288204-27734-2-git-send-email-iamjoonsoo.kim@lge.com Fixes: 34e58cac6d8f2a ("mm: workingset: let cache workingset challenge anon") Reported-by: Joonsoo Kim <js1304@gmail.com> Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> Signed-off-by: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: Rik van Riel <riel@surriel.com> Cc: Minchan Kim <minchan.kim@gmail.com> Cc: Michal Hocko <mhocko@suse.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-06-26 11:30:31 +08:00
* For each node's LRU lists, a counter for inactive evictions and
* activations is maintained (node->evictions).
mm: thrash detection-based file cache sizing The VM maintains cached filesystem pages on two types of lists. One list holds the pages recently faulted into the cache, the other list holds pages that have been referenced repeatedly on that first list. The idea is to prefer reclaiming young pages over those that have shown to benefit from caching in the past. We call the recently usedbut ultimately was not significantly better than a FIFO policy and still thrashed cache based on eviction speed, rather than actual demand for cache. This patch solves one half of the problem by decoupling the ability to detect working set changes from the inactive list size. By maintaining a history of recently evicted file pages it can detect frequently used pages with an arbitrarily small inactive list size, and subsequently apply pressure on the active list based on actual demand for cache, not just overall eviction speed. Every zone maintains a counter that tracks inactive list aging speed. When a page is evicted, a snapshot of this counter is stored in the now-empty page cache radix tree slot. On refault, the minimum access distance of the page can be assessed, to evaluate whether the page should be part of the active list or not. This fixes the VM's blindness towards working set changes in excess of the inactive list. And it's the foundation to further improve the protection ability and reduce the minimum inactive list size of 50%. Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> Reviewed-by: Rik van Riel <riel@redhat.com> Reviewed-by: Minchan Kim <minchan@kernel.org> Reviewed-by: Bob Liu <bob.liu@oracle.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Christoph Hellwig <hch@infradead.org> Cc: Dave Chinner <david@fromorbit.com> Cc: Greg Thelen <gthelen@google.com> Cc: Hugh Dickins <hughd@google.com> Cc: Jan Kara <jack@suse.cz> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: Luigi Semenzato <semenzato@google.com> Cc: Mel Gorman <mgorman@suse.de> Cc: Metin Doslu <metin@citusdata.com> Cc: Michel Lespinasse <walken@google.com> Cc: Ozgun Erdogan <ozgun@citusdata.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Roman Gushchin <klamm@yandex-team.ru> Cc: Ryan Mallon <rmallon@gmail.com> Cc: Tejun Heo <tj@kernel.org> Cc: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-04-04 05:47:51 +08:00
*
* On eviction, a snapshot of this counter (along with some bits to
* identify the node) is stored in the now empty page cache
mm: thrash detection-based file cache sizing The VM maintains cached filesystem pages on two types of lists. One list holds the pages recently faulted into the cache, the other list holds pages that have been referenced repeatedly on that first list. The idea is to prefer reclaiming young pages over those that have shown to benefit from caching in the past. We call the recently usedbut ultimately was not significantly better than a FIFO policy and still thrashed cache based on eviction speed, rather than actual demand for cache. This patch solves one half of the problem by decoupling the ability to detect working set changes from the inactive list size. By maintaining a history of recently evicted file pages it can detect frequently used pages with an arbitrarily small inactive list size, and subsequently apply pressure on the active list based on actual demand for cache, not just overall eviction speed. Every zone maintains a counter that tracks inactive list aging speed. When a page is evicted, a snapshot of this counter is stored in the now-empty page cache radix tree slot. On refault, the minimum access distance of the page can be assessed, to evaluate whether the page should be part of the active list or not. This fixes the VM's blindness towards working set changes in excess of the inactive list. And it's the foundation to further improve the protection ability and reduce the minimum inactive list size of 50%. Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> Reviewed-by: Rik van Riel <riel@redhat.com> Reviewed-by: Minchan Kim <minchan@kernel.org> Reviewed-by: Bob Liu <bob.liu@oracle.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Christoph Hellwig <hch@infradead.org> Cc: Dave Chinner <david@fromorbit.com> Cc: Greg Thelen <gthelen@google.com> Cc: Hugh Dickins <hughd@google.com> Cc: Jan Kara <jack@suse.cz> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: Luigi Semenzato <semenzato@google.com> Cc: Mel Gorman <mgorman@suse.de> Cc: Metin Doslu <metin@citusdata.com> Cc: Michel Lespinasse <walken@google.com> Cc: Ozgun Erdogan <ozgun@citusdata.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Roman Gushchin <klamm@yandex-team.ru> Cc: Ryan Mallon <rmallon@gmail.com> Cc: Tejun Heo <tj@kernel.org> Cc: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-04-04 05:47:51 +08:00
* slot of the evicted page. This is called a shadow entry.
*
* On cache misses for which there are shadow entries, an eligible
* refault distance will immediately activate the refaulting page.
*/
#define WORKINGSET_SHIFT 1
#define EVICTION_SHIFT ((BITS_PER_LONG - BITS_PER_XA_VALUE) + \
WORKINGSET_SHIFT + NODES_SHIFT + \
MEM_CGROUP_ID_SHIFT)
#define EVICTION_BITS (BITS_PER_LONG - (EVICTION_SHIFT))
#define EVICTION_MASK (~0UL >> EVICTION_SHIFT)
workingset, lru_gen: apply refault-distance based protection Upstream: pending I noticed MGLRU not working very well on certain workflows, which is observed on some heavily stressed databases. That is when the file page workingset size exceeds total memory, and the access distance (the left-shift time of a page before it gets activated, considering LRU starts from right) of file pages also larger than total memory. All file pages are stuck on the oldest generation and getting read-in then evicted permutably. Despite anon pages being idle, they never get aged. PID controller didn't kickin until there are some minor access pattern changes. And file pages are not promoted or reused. Even though the memory can't cover the whole workingset, the refault-distance based re-activation can help hold part of the workingset in-memory to help reduce the IO workload significantly. So apply it for MGLRU as well. The updated refault-distance model fits well for MGLRU in most cases, if we just consider the last two generation as the inactive LRU and the first two generations as active LRU. Some adjustment is done to fit the logic better, also make the refault-distance contributed to page tiering and PID refault detection of MGLRU: - If a tier-0 page have a qualified refault-distance, just promote it to higher tier, send it to second oldest gen. - If a tier >= 1 page have a qualified refault-distance, mark it as active and send it to youngest gen. - Increase the reference of every page that have a qualified refault-distance and increase the PID countroled refault rate of the updated tier, in hope similar paged will be protected next time upon eviction. NOTE: This also changed the meaning of workingset_* fields in /proc/vmstat, workingset_activate_* now stands for the pages reactivated or promoted by refault distance checking, workingset_restore_* now stands for all pages promoted by any reason. Following benchmark showed 5x improvement. To simulate the optimized workflow, I setup a 3-replicated mongodb cluster, each in a different cgroup, using 5 gb of wiretiger cache and 10g of oplog, on a 32G VM with no limit set. The benchmark is done using https://github.com/apavlo/py-tpcc.git, modified to run STOCK_LEVEL query only, for simulating slow query and get a stable result. Test is done on an EPYC 7K62 with 32G RAM with SATA SSD: - Before (with ZRAM enabled, the result won't change whether any kind of swap is on or not): $ tpcc.py --config=mongodb.config mongodb --duration=900 --warehouses=500 --clients=30 ================================================================== Execution Results after 919 seconds ------------------------------------------------------------------ Executed Time (µs) Rate STOCK_LEVEL 577 27584645283.7 0.02 txn/s ------------------------------------------------------------------ TOTAL 577 27584645283.7 0.02 txn/s $ cat /proc/vmstat | grep workingset workingset_nodes 47860 workingset_refault_anon 0 workingset_refault_file 23498953 workingset_activate_anon 0 workingset_activate_file 23487840 workingset_restore_anon 0 workingset_restore_file 18553646 workingset_nodereclaim 768 $ free -m total used free shared buff/cache available Mem: 31849 6829 790 23 24229 24542 Swap: 31848 0 31848 - Patched: (with ZRAM enabled): $ tpcc.py --config=mongodb.config mongodb --duration=900 --warehouses=500 --clients=30 ================================================================== Execution Results after 905 seconds ------------------------------------------------------------------ Executed Time (µs) Rate STOCK_LEVEL 2542 27121571486.2 0.09 txn/s ------------------------------------------------------------------ TOTAL 2542 27121571486.2 0.09 txn/s $ cat /proc/vmstat | grep working workingset_nodes 70358 workingset_refault_anon 16853 workingset_refault_file 22693601 workingset_activate_anon 10099 workingset_activate_file 8565519 workingset_restore_anon 10127 workingset_restore_file 8566053 workingset_nodereclaim 9801 $ free -m total used free shared buff/cache available Mem: 31849 7093 283 4 24472 24289 Swap: 31848 1652 30196 The performance is 5x times better than before, and the idle anon pages now can get swapped out as expected. The result is also better with lower test stress, testing with lower stress also shows a improvement. There is no regression on other tests so far, and a performance gain is observed on file page heavy tasks. Signed-off-by: Kairui Song <kasong@tencent.com>
2024-04-01 23:43:38 +08:00
#define LRU_GEN_EVICTION_BITS (EVICTION_BITS - LRU_REFS_WIDTH)
mm: workingset: eviction buckets for bigmem/lowbit machines For per-cgroup thrash detection, we need to store the memcg ID inside the radix tree cookie as well. However, on 32 bit that doesn't leave enough bits for the eviction timestamp to cover the necessary range of recently evicted pages. The radix tree entry would look like this: [ RADIX_TREE_EXCEPTIONAL(2) | ZONEID(2) | MEMCGID(16) | EVICTION(12) ] 12 bits means 4096 pages, means 16M worth of recently evicted pages. But refaults are actionable up to distances covering half of memory. To not miss refaults, we have to stretch out the range at the cost of how precisely we can tell when a page was evicted. This way we can shave off lower bits from the eviction timestamp until the necessary range is covered. E.g. grouping evictions into 1M buckets (256 pages) will stretch the longest representable refault distance to 4G. This patch implements eviction buckets that are automatically sized according to the available bits and the necessary refault range, in preparation for per-cgroup thrash detection. The maximum actionable distance is currently half of memory, but to support memory hotplug of up to 200% of boot-time memory, we size the buckets to cover double the distance. Beyond that, thrashing won't be detectable anymore. During boot, the kernel will print out the exact parameters, like so: [ 0.113929] workingset: timestamp_bits=12 max_order=18 bucket_order=6 In this example, there are 12 radix entry bits available for the eviction timestamp, to cover a maximum distance of 2^18 pages (this is a 1G machine). Consequently, evictions must be grouped into buckets of 2^6 pages, or 256K. Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> Reviewed-by: Vladimir Davydov <vdavydov@virtuozzo.com> Cc: Michal Hocko <mhocko@suse.cz> Cc: David Rientjes <rientjes@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-03-16 05:57:13 +08:00
/*
* Eviction timestamps need to be able to cover the full range of
* actionable refaults. However, bits are tight in the xarray
mm: workingset: eviction buckets for bigmem/lowbit machines For per-cgroup thrash detection, we need to store the memcg ID inside the radix tree cookie as well. However, on 32 bit that doesn't leave enough bits for the eviction timestamp to cover the necessary range of recently evicted pages. The radix tree entry would look like this: [ RADIX_TREE_EXCEPTIONAL(2) | ZONEID(2) | MEMCGID(16) | EVICTION(12) ] 12 bits means 4096 pages, means 16M worth of recently evicted pages. But refaults are actionable up to distances covering half of memory. To not miss refaults, we have to stretch out the range at the cost of how precisely we can tell when a page was evicted. This way we can shave off lower bits from the eviction timestamp until the necessary range is covered. E.g. grouping evictions into 1M buckets (256 pages) will stretch the longest representable refault distance to 4G. This patch implements eviction buckets that are automatically sized according to the available bits and the necessary refault range, in preparation for per-cgroup thrash detection. The maximum actionable distance is currently half of memory, but to support memory hotplug of up to 200% of boot-time memory, we size the buckets to cover double the distance. Beyond that, thrashing won't be detectable anymore. During boot, the kernel will print out the exact parameters, like so: [ 0.113929] workingset: timestamp_bits=12 max_order=18 bucket_order=6 In this example, there are 12 radix entry bits available for the eviction timestamp, to cover a maximum distance of 2^18 pages (this is a 1G machine). Consequently, evictions must be grouped into buckets of 2^6 pages, or 256K. Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> Reviewed-by: Vladimir Davydov <vdavydov@virtuozzo.com> Cc: Michal Hocko <mhocko@suse.cz> Cc: David Rientjes <rientjes@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-03-16 05:57:13 +08:00
* entry, and after storing the identifier for the lruvec there might
* not be enough left to represent every single actionable refault. In
* that case, we have to sacrifice granularity for distance, and group
* evictions into coarser buckets by shaving off lower timestamp bits.
*/
static unsigned int bucket_order __read_mostly;
workingset, lru_gen: apply refault-distance based protection Upstream: pending I noticed MGLRU not working very well on certain workflows, which is observed on some heavily stressed databases. That is when the file page workingset size exceeds total memory, and the access distance (the left-shift time of a page before it gets activated, considering LRU starts from right) of file pages also larger than total memory. All file pages are stuck on the oldest generation and getting read-in then evicted permutably. Despite anon pages being idle, they never get aged. PID controller didn't kickin until there are some minor access pattern changes. And file pages are not promoted or reused. Even though the memory can't cover the whole workingset, the refault-distance based re-activation can help hold part of the workingset in-memory to help reduce the IO workload significantly. So apply it for MGLRU as well. The updated refault-distance model fits well for MGLRU in most cases, if we just consider the last two generation as the inactive LRU and the first two generations as active LRU. Some adjustment is done to fit the logic better, also make the refault-distance contributed to page tiering and PID refault detection of MGLRU: - If a tier-0 page have a qualified refault-distance, just promote it to higher tier, send it to second oldest gen. - If a tier >= 1 page have a qualified refault-distance, mark it as active and send it to youngest gen. - Increase the reference of every page that have a qualified refault-distance and increase the PID countroled refault rate of the updated tier, in hope similar paged will be protected next time upon eviction. NOTE: This also changed the meaning of workingset_* fields in /proc/vmstat, workingset_activate_* now stands for the pages reactivated or promoted by refault distance checking, workingset_restore_* now stands for all pages promoted by any reason. Following benchmark showed 5x improvement. To simulate the optimized workflow, I setup a 3-replicated mongodb cluster, each in a different cgroup, using 5 gb of wiretiger cache and 10g of oplog, on a 32G VM with no limit set. The benchmark is done using https://github.com/apavlo/py-tpcc.git, modified to run STOCK_LEVEL query only, for simulating slow query and get a stable result. Test is done on an EPYC 7K62 with 32G RAM with SATA SSD: - Before (with ZRAM enabled, the result won't change whether any kind of swap is on or not): $ tpcc.py --config=mongodb.config mongodb --duration=900 --warehouses=500 --clients=30 ================================================================== Execution Results after 919 seconds ------------------------------------------------------------------ Executed Time (µs) Rate STOCK_LEVEL 577 27584645283.7 0.02 txn/s ------------------------------------------------------------------ TOTAL 577 27584645283.7 0.02 txn/s $ cat /proc/vmstat | grep workingset workingset_nodes 47860 workingset_refault_anon 0 workingset_refault_file 23498953 workingset_activate_anon 0 workingset_activate_file 23487840 workingset_restore_anon 0 workingset_restore_file 18553646 workingset_nodereclaim 768 $ free -m total used free shared buff/cache available Mem: 31849 6829 790 23 24229 24542 Swap: 31848 0 31848 - Patched: (with ZRAM enabled): $ tpcc.py --config=mongodb.config mongodb --duration=900 --warehouses=500 --clients=30 ================================================================== Execution Results after 905 seconds ------------------------------------------------------------------ Executed Time (µs) Rate STOCK_LEVEL 2542 27121571486.2 0.09 txn/s ------------------------------------------------------------------ TOTAL 2542 27121571486.2 0.09 txn/s $ cat /proc/vmstat | grep working workingset_nodes 70358 workingset_refault_anon 16853 workingset_refault_file 22693601 workingset_activate_anon 10099 workingset_activate_file 8565519 workingset_restore_anon 10127 workingset_restore_file 8566053 workingset_nodereclaim 9801 $ free -m total used free shared buff/cache available Mem: 31849 7093 283 4 24472 24289 Swap: 31848 1652 30196 The performance is 5x times better than before, and the idle anon pages now can get swapped out as expected. The result is also better with lower test stress, testing with lower stress also shows a improvement. There is no regression on other tests so far, and a performance gain is observed on file page heavy tasks. Signed-off-by: Kairui Song <kasong@tencent.com>
2024-04-01 23:43:38 +08:00
static unsigned int lru_gen_bucket_order __read_mostly;
mm: workingset: eviction buckets for bigmem/lowbit machines For per-cgroup thrash detection, we need to store the memcg ID inside the radix tree cookie as well. However, on 32 bit that doesn't leave enough bits for the eviction timestamp to cover the necessary range of recently evicted pages. The radix tree entry would look like this: [ RADIX_TREE_EXCEPTIONAL(2) | ZONEID(2) | MEMCGID(16) | EVICTION(12) ] 12 bits means 4096 pages, means 16M worth of recently evicted pages. But refaults are actionable up to distances covering half of memory. To not miss refaults, we have to stretch out the range at the cost of how precisely we can tell when a page was evicted. This way we can shave off lower bits from the eviction timestamp until the necessary range is covered. E.g. grouping evictions into 1M buckets (256 pages) will stretch the longest representable refault distance to 4G. This patch implements eviction buckets that are automatically sized according to the available bits and the necessary refault range, in preparation for per-cgroup thrash detection. The maximum actionable distance is currently half of memory, but to support memory hotplug of up to 200% of boot-time memory, we size the buckets to cover double the distance. Beyond that, thrashing won't be detectable anymore. During boot, the kernel will print out the exact parameters, like so: [ 0.113929] workingset: timestamp_bits=12 max_order=18 bucket_order=6 In this example, there are 12 radix entry bits available for the eviction timestamp, to cover a maximum distance of 2^18 pages (this is a 1G machine). Consequently, evictions must be grouped into buckets of 2^6 pages, or 256K. Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> Reviewed-by: Vladimir Davydov <vdavydov@virtuozzo.com> Cc: Michal Hocko <mhocko@suse.cz> Cc: David Rientjes <rientjes@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-03-16 05:57:13 +08:00
mm: workingset: tell cache transitions from workingset thrashing Refaults happen during transitions between workingsets as well as in-place thrashing. Knowing the difference between the two has a range of applications, including measuring the impact of memory shortage on the system performance, as well as the ability to smarter balance pressure between the filesystem cache and the swap-backed workingset. During workingset transitions, inactive cache refaults and pushes out established active cache. When that active cache isn't stale, however, and also ends up refaulting, that's bonafide thrashing. Introduce a new page flag that tells on eviction whether the page has been active or not in its lifetime. This bit is then stored in the shadow entry, to classify refaults as transitioning or thrashing. How many page->flags does this leave us with on 32-bit? 20 bits are always page flags 21 if you have an MMU 23 with the zone bits for DMA, Normal, HighMem, Movable 29 with the sparsemem section bits 30 if PAE is enabled 31 with this patch. So on 32-bit PAE, that leaves 1 bit for distinguishing two NUMA nodes. If that's not enough, the system can switch to discontigmem and re-gain the 6 or 7 sparsemem section bits. Link: http://lkml.kernel.org/r/20180828172258.3185-3-hannes@cmpxchg.org Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org> Tested-by: Daniel Drake <drake@endlessm.com> Tested-by: Suren Baghdasaryan <surenb@google.com> Cc: Christopher Lameter <cl@linux.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: Johannes Weiner <jweiner@fb.com> Cc: Mike Galbraith <efault@gmx.de> Cc: Peter Enderborg <peter.enderborg@sony.com> Cc: Randy Dunlap <rdunlap@infradead.org> Cc: Shakeel Butt <shakeelb@google.com> Cc: Tejun Heo <tj@kernel.org> Cc: Vinayak Menon <vinmenon@codeaurora.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-10-27 06:06:04 +08:00
static void *pack_shadow(int memcgid, pg_data_t *pgdat, unsigned long eviction,
bool workingset)
mm: thrash detection-based file cache sizing The VM maintains cached filesystem pages on two types of lists. One list holds the pages recently faulted into the cache, the other list holds pages that have been referenced repeatedly on that first list. The idea is to prefer reclaiming young pages over those that have shown to benefit from caching in the past. We call the recently usedbut ultimately was not significantly better than a FIFO policy and still thrashed cache based on eviction speed, rather than actual demand for cache. This patch solves one half of the problem by decoupling the ability to detect working set changes from the inactive list size. By maintaining a history of recently evicted file pages it can detect frequently used pages with an arbitrarily small inactive list size, and subsequently apply pressure on the active list based on actual demand for cache, not just overall eviction speed. Every zone maintains a counter that tracks inactive list aging speed. When a page is evicted, a snapshot of this counter is stored in the now-empty page cache radix tree slot. On refault, the minimum access distance of the page can be assessed, to evaluate whether the page should be part of the active list or not. This fixes the VM's blindness towards working set changes in excess of the inactive list. And it's the foundation to further improve the protection ability and reduce the minimum inactive list size of 50%. Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> Reviewed-by: Rik van Riel <riel@redhat.com> Reviewed-by: Minchan Kim <minchan@kernel.org> Reviewed-by: Bob Liu <bob.liu@oracle.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Christoph Hellwig <hch@infradead.org> Cc: Dave Chinner <david@fromorbit.com> Cc: Greg Thelen <gthelen@google.com> Cc: Hugh Dickins <hughd@google.com> Cc: Jan Kara <jack@suse.cz> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: Luigi Semenzato <semenzato@google.com> Cc: Mel Gorman <mgorman@suse.de> Cc: Metin Doslu <metin@citusdata.com> Cc: Michel Lespinasse <walken@google.com> Cc: Ozgun Erdogan <ozgun@citusdata.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Roman Gushchin <klamm@yandex-team.ru> Cc: Ryan Mallon <rmallon@gmail.com> Cc: Tejun Heo <tj@kernel.org> Cc: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-04-04 05:47:51 +08:00
{
eviction &= EVICTION_MASK;
eviction = (eviction << MEM_CGROUP_ID_SHIFT) | memcgid;
eviction = (eviction << NODES_SHIFT) | pgdat->node_id;
eviction = (eviction << WORKINGSET_SHIFT) | workingset;
mm: thrash detection-based file cache sizing The VM maintains cached filesystem pages on two types of lists. One list holds the pages recently faulted into the cache, the other list holds pages that have been referenced repeatedly on that first list. The idea is to prefer reclaiming young pages over those that have shown to benefit from caching in the past. We call the recently usedbut ultimately was not significantly better than a FIFO policy and still thrashed cache based on eviction speed, rather than actual demand for cache. This patch solves one half of the problem by decoupling the ability to detect working set changes from the inactive list size. By maintaining a history of recently evicted file pages it can detect frequently used pages with an arbitrarily small inactive list size, and subsequently apply pressure on the active list based on actual demand for cache, not just overall eviction speed. Every zone maintains a counter that tracks inactive list aging speed. When a page is evicted, a snapshot of this counter is stored in the now-empty page cache radix tree slot. On refault, the minimum access distance of the page can be assessed, to evaluate whether the page should be part of the active list or not. This fixes the VM's blindness towards working set changes in excess of the inactive list. And it's the foundation to further improve the protection ability and reduce the minimum inactive list size of 50%. Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> Reviewed-by: Rik van Riel <riel@redhat.com> Reviewed-by: Minchan Kim <minchan@kernel.org> Reviewed-by: Bob Liu <bob.liu@oracle.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Christoph Hellwig <hch@infradead.org> Cc: Dave Chinner <david@fromorbit.com> Cc: Greg Thelen <gthelen@google.com> Cc: Hugh Dickins <hughd@google.com> Cc: Jan Kara <jack@suse.cz> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: Luigi Semenzato <semenzato@google.com> Cc: Mel Gorman <mgorman@suse.de> Cc: Metin Doslu <metin@citusdata.com> Cc: Michel Lespinasse <walken@google.com> Cc: Ozgun Erdogan <ozgun@citusdata.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Roman Gushchin <klamm@yandex-team.ru> Cc: Ryan Mallon <rmallon@gmail.com> Cc: Tejun Heo <tj@kernel.org> Cc: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-04-04 05:47:51 +08:00
return xa_mk_value(eviction);
mm: thrash detection-based file cache sizing The VM maintains cached filesystem pages on two types of lists. One list holds the pages recently faulted into the cache, the other list holds pages that have been referenced repeatedly on that first list. The idea is to prefer reclaiming young pages over those that have shown to benefit from caching in the past. We call the recently usedbut ultimately was not significantly better than a FIFO policy and still thrashed cache based on eviction speed, rather than actual demand for cache. This patch solves one half of the problem by decoupling the ability to detect working set changes from the inactive list size. By maintaining a history of recently evicted file pages it can detect frequently used pages with an arbitrarily small inactive list size, and subsequently apply pressure on the active list based on actual demand for cache, not just overall eviction speed. Every zone maintains a counter that tracks inactive list aging speed. When a page is evicted, a snapshot of this counter is stored in the now-empty page cache radix tree slot. On refault, the minimum access distance of the page can be assessed, to evaluate whether the page should be part of the active list or not. This fixes the VM's blindness towards working set changes in excess of the inactive list. And it's the foundation to further improve the protection ability and reduce the minimum inactive list size of 50%. Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> Reviewed-by: Rik van Riel <riel@redhat.com> Reviewed-by: Minchan Kim <minchan@kernel.org> Reviewed-by: Bob Liu <bob.liu@oracle.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Christoph Hellwig <hch@infradead.org> Cc: Dave Chinner <david@fromorbit.com> Cc: Greg Thelen <gthelen@google.com> Cc: Hugh Dickins <hughd@google.com> Cc: Jan Kara <jack@suse.cz> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: Luigi Semenzato <semenzato@google.com> Cc: Mel Gorman <mgorman@suse.de> Cc: Metin Doslu <metin@citusdata.com> Cc: Michel Lespinasse <walken@google.com> Cc: Ozgun Erdogan <ozgun@citusdata.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Roman Gushchin <klamm@yandex-team.ru> Cc: Ryan Mallon <rmallon@gmail.com> Cc: Tejun Heo <tj@kernel.org> Cc: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-04-04 05:47:51 +08:00
}
static void unpack_shadow(void *shadow, int *memcgidp, pg_data_t **pgdat,
mm: workingset: tell cache transitions from workingset thrashing Refaults happen during transitions between workingsets as well as in-place thrashing. Knowing the difference between the two has a range of applications, including measuring the impact of memory shortage on the system performance, as well as the ability to smarter balance pressure between the filesystem cache and the swap-backed workingset. During workingset transitions, inactive cache refaults and pushes out established active cache. When that active cache isn't stale, however, and also ends up refaulting, that's bonafide thrashing. Introduce a new page flag that tells on eviction whether the page has been active or not in its lifetime. This bit is then stored in the shadow entry, to classify refaults as transitioning or thrashing. How many page->flags does this leave us with on 32-bit? 20 bits are always page flags 21 if you have an MMU 23 with the zone bits for DMA, Normal, HighMem, Movable 29 with the sparsemem section bits 30 if PAE is enabled 31 with this patch. So on 32-bit PAE, that leaves 1 bit for distinguishing two NUMA nodes. If that's not enough, the system can switch to discontigmem and re-gain the 6 or 7 sparsemem section bits. Link: http://lkml.kernel.org/r/20180828172258.3185-3-hannes@cmpxchg.org Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org> Tested-by: Daniel Drake <drake@endlessm.com> Tested-by: Suren Baghdasaryan <surenb@google.com> Cc: Christopher Lameter <cl@linux.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: Johannes Weiner <jweiner@fb.com> Cc: Mike Galbraith <efault@gmx.de> Cc: Peter Enderborg <peter.enderborg@sony.com> Cc: Randy Dunlap <rdunlap@infradead.org> Cc: Shakeel Butt <shakeelb@google.com> Cc: Tejun Heo <tj@kernel.org> Cc: Vinayak Menon <vinmenon@codeaurora.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-10-27 06:06:04 +08:00
unsigned long *evictionp, bool *workingsetp)
mm: thrash detection-based file cache sizing The VM maintains cached filesystem pages on two types of lists. One list holds the pages recently faulted into the cache, the other list holds pages that have been referenced repeatedly on that first list. The idea is to prefer reclaiming young pages over those that have shown to benefit from caching in the past. We call the recently usedbut ultimately was not significantly better than a FIFO policy and still thrashed cache based on eviction speed, rather than actual demand for cache. This patch solves one half of the problem by decoupling the ability to detect working set changes from the inactive list size. By maintaining a history of recently evicted file pages it can detect frequently used pages with an arbitrarily small inactive list size, and subsequently apply pressure on the active list based on actual demand for cache, not just overall eviction speed. Every zone maintains a counter that tracks inactive list aging speed. When a page is evicted, a snapshot of this counter is stored in the now-empty page cache radix tree slot. On refault, the minimum access distance of the page can be assessed, to evaluate whether the page should be part of the active list or not. This fixes the VM's blindness towards working set changes in excess of the inactive list. And it's the foundation to further improve the protection ability and reduce the minimum inactive list size of 50%. Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> Reviewed-by: Rik van Riel <riel@redhat.com> Reviewed-by: Minchan Kim <minchan@kernel.org> Reviewed-by: Bob Liu <bob.liu@oracle.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Christoph Hellwig <hch@infradead.org> Cc: Dave Chinner <david@fromorbit.com> Cc: Greg Thelen <gthelen@google.com> Cc: Hugh Dickins <hughd@google.com> Cc: Jan Kara <jack@suse.cz> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: Luigi Semenzato <semenzato@google.com> Cc: Mel Gorman <mgorman@suse.de> Cc: Metin Doslu <metin@citusdata.com> Cc: Michel Lespinasse <walken@google.com> Cc: Ozgun Erdogan <ozgun@citusdata.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Roman Gushchin <klamm@yandex-team.ru> Cc: Ryan Mallon <rmallon@gmail.com> Cc: Tejun Heo <tj@kernel.org> Cc: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-04-04 05:47:51 +08:00
{
unsigned long entry = xa_to_value(shadow);
int memcgid, nid;
mm: workingset: tell cache transitions from workingset thrashing Refaults happen during transitions between workingsets as well as in-place thrashing. Knowing the difference between the two has a range of applications, including measuring the impact of memory shortage on the system performance, as well as the ability to smarter balance pressure between the filesystem cache and the swap-backed workingset. During workingset transitions, inactive cache refaults and pushes out established active cache. When that active cache isn't stale, however, and also ends up refaulting, that's bonafide thrashing. Introduce a new page flag that tells on eviction whether the page has been active or not in its lifetime. This bit is then stored in the shadow entry, to classify refaults as transitioning or thrashing. How many page->flags does this leave us with on 32-bit? 20 bits are always page flags 21 if you have an MMU 23 with the zone bits for DMA, Normal, HighMem, Movable 29 with the sparsemem section bits 30 if PAE is enabled 31 with this patch. So on 32-bit PAE, that leaves 1 bit for distinguishing two NUMA nodes. If that's not enough, the system can switch to discontigmem and re-gain the 6 or 7 sparsemem section bits. Link: http://lkml.kernel.org/r/20180828172258.3185-3-hannes@cmpxchg.org Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org> Tested-by: Daniel Drake <drake@endlessm.com> Tested-by: Suren Baghdasaryan <surenb@google.com> Cc: Christopher Lameter <cl@linux.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: Johannes Weiner <jweiner@fb.com> Cc: Mike Galbraith <efault@gmx.de> Cc: Peter Enderborg <peter.enderborg@sony.com> Cc: Randy Dunlap <rdunlap@infradead.org> Cc: Shakeel Butt <shakeelb@google.com> Cc: Tejun Heo <tj@kernel.org> Cc: Vinayak Menon <vinmenon@codeaurora.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-10-27 06:06:04 +08:00
bool workingset;
mm: thrash detection-based file cache sizing The VM maintains cached filesystem pages on two types of lists. One list holds the pages recently faulted into the cache, the other list holds pages that have been referenced repeatedly on that first list. The idea is to prefer reclaiming young pages over those that have shown to benefit from caching in the past. We call the recently usedbut ultimately was not significantly better than a FIFO policy and still thrashed cache based on eviction speed, rather than actual demand for cache. This patch solves one half of the problem by decoupling the ability to detect working set changes from the inactive list size. By maintaining a history of recently evicted file pages it can detect frequently used pages with an arbitrarily small inactive list size, and subsequently apply pressure on the active list based on actual demand for cache, not just overall eviction speed. Every zone maintains a counter that tracks inactive list aging speed. When a page is evicted, a snapshot of this counter is stored in the now-empty page cache radix tree slot. On refault, the minimum access distance of the page can be assessed, to evaluate whether the page should be part of the active list or not. This fixes the VM's blindness towards working set changes in excess of the inactive list. And it's the foundation to further improve the protection ability and reduce the minimum inactive list size of 50%. Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> Reviewed-by: Rik van Riel <riel@redhat.com> Reviewed-by: Minchan Kim <minchan@kernel.org> Reviewed-by: Bob Liu <bob.liu@oracle.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Christoph Hellwig <hch@infradead.org> Cc: Dave Chinner <david@fromorbit.com> Cc: Greg Thelen <gthelen@google.com> Cc: Hugh Dickins <hughd@google.com> Cc: Jan Kara <jack@suse.cz> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: Luigi Semenzato <semenzato@google.com> Cc: Mel Gorman <mgorman@suse.de> Cc: Metin Doslu <metin@citusdata.com> Cc: Michel Lespinasse <walken@google.com> Cc: Ozgun Erdogan <ozgun@citusdata.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Roman Gushchin <klamm@yandex-team.ru> Cc: Ryan Mallon <rmallon@gmail.com> Cc: Tejun Heo <tj@kernel.org> Cc: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-04-04 05:47:51 +08:00
workingset = entry & ((1UL << WORKINGSET_SHIFT) - 1);
entry >>= WORKINGSET_SHIFT;
mm: thrash detection-based file cache sizing The VM maintains cached filesystem pages on two types of lists. One list holds the pages recently faulted into the cache, the other list holds pages that have been referenced repeatedly on that first list. The idea is to prefer reclaiming young pages over those that have shown to benefit from caching in the past. We call the recently usedbut ultimately was not significantly better than a FIFO policy and still thrashed cache based on eviction speed, rather than actual demand for cache. This patch solves one half of the problem by decoupling the ability to detect working set changes from the inactive list size. By maintaining a history of recently evicted file pages it can detect frequently used pages with an arbitrarily small inactive list size, and subsequently apply pressure on the active list based on actual demand for cache, not just overall eviction speed. Every zone maintains a counter that tracks inactive list aging speed. When a page is evicted, a snapshot of this counter is stored in the now-empty page cache radix tree slot. On refault, the minimum access distance of the page can be assessed, to evaluate whether the page should be part of the active list or not. This fixes the VM's blindness towards working set changes in excess of the inactive list. And it's the foundation to further improve the protection ability and reduce the minimum inactive list size of 50%. Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> Reviewed-by: Rik van Riel <riel@redhat.com> Reviewed-by: Minchan Kim <minchan@kernel.org> Reviewed-by: Bob Liu <bob.liu@oracle.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Christoph Hellwig <hch@infradead.org> Cc: Dave Chinner <david@fromorbit.com> Cc: Greg Thelen <gthelen@google.com> Cc: Hugh Dickins <hughd@google.com> Cc: Jan Kara <jack@suse.cz> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: Luigi Semenzato <semenzato@google.com> Cc: Mel Gorman <mgorman@suse.de> Cc: Metin Doslu <metin@citusdata.com> Cc: Michel Lespinasse <walken@google.com> Cc: Ozgun Erdogan <ozgun@citusdata.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Roman Gushchin <klamm@yandex-team.ru> Cc: Ryan Mallon <rmallon@gmail.com> Cc: Tejun Heo <tj@kernel.org> Cc: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-04-04 05:47:51 +08:00
nid = entry & ((1UL << NODES_SHIFT) - 1);
entry >>= NODES_SHIFT;
memcgid = entry & ((1UL << MEM_CGROUP_ID_SHIFT) - 1);
entry >>= MEM_CGROUP_ID_SHIFT;
mm: thrash detection-based file cache sizing The VM maintains cached filesystem pages on two types of lists. One list holds the pages recently faulted into the cache, the other list holds pages that have been referenced repeatedly on that first list. The idea is to prefer reclaiming young pages over those that have shown to benefit from caching in the past. We call the recently usedbut ultimately was not significantly better than a FIFO policy and still thrashed cache based on eviction speed, rather than actual demand for cache. This patch solves one half of the problem by decoupling the ability to detect working set changes from the inactive list size. By maintaining a history of recently evicted file pages it can detect frequently used pages with an arbitrarily small inactive list size, and subsequently apply pressure on the active list based on actual demand for cache, not just overall eviction speed. Every zone maintains a counter that tracks inactive list aging speed. When a page is evicted, a snapshot of this counter is stored in the now-empty page cache radix tree slot. On refault, the minimum access distance of the page can be assessed, to evaluate whether the page should be part of the active list or not. This fixes the VM's blindness towards working set changes in excess of the inactive list. And it's the foundation to further improve the protection ability and reduce the minimum inactive list size of 50%. Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> Reviewed-by: Rik van Riel <riel@redhat.com> Reviewed-by: Minchan Kim <minchan@kernel.org> Reviewed-by: Bob Liu <bob.liu@oracle.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Christoph Hellwig <hch@infradead.org> Cc: Dave Chinner <david@fromorbit.com> Cc: Greg Thelen <gthelen@google.com> Cc: Hugh Dickins <hughd@google.com> Cc: Jan Kara <jack@suse.cz> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: Luigi Semenzato <semenzato@google.com> Cc: Mel Gorman <mgorman@suse.de> Cc: Metin Doslu <metin@citusdata.com> Cc: Michel Lespinasse <walken@google.com> Cc: Ozgun Erdogan <ozgun@citusdata.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Roman Gushchin <klamm@yandex-team.ru> Cc: Ryan Mallon <rmallon@gmail.com> Cc: Tejun Heo <tj@kernel.org> Cc: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-04-04 05:47:51 +08:00
*memcgidp = memcgid;
*pgdat = NODE_DATA(nid);
mm: multi-gen LRU: minimal implementation To avoid confusion, the terms "promotion" and "demotion" will be applied to the multi-gen LRU, as a new convention; the terms "activation" and "deactivation" will be applied to the active/inactive LRU, as usual. The aging produces young generations. Given an lruvec, it increments max_seq when max_seq-min_seq+1 approaches MIN_NR_GENS. The aging promotes hot pages to the youngest generation when it finds them accessed through page tables; the demotion of cold pages happens consequently when it increments max_seq. Promotion in the aging path does not involve any LRU list operations, only the updates of the gen counter and lrugen->nr_pages[]; demotion, unless as the result of the increment of max_seq, requires LRU list operations, e.g., lru_deactivate_fn(). The aging has the complexity O(nr_hot_pages), since it is only interested in hot pages. The eviction consumes old generations. Given an lruvec, it increments min_seq when lrugen->lists[] indexed by min_seq%MAX_NR_GENS becomes empty. A feedback loop modeled after the PID controller monitors refaults over anon and file types and decides which type to evict when both types are available from the same generation. The protection of pages accessed multiple times through file descriptors takes place in the eviction path. Each generation is divided into multiple tiers. A page accessed N times through file descriptors is in tier order_base_2(N). Tiers do not have dedicated lrugen->lists[], only bits in folio->flags. The aforementioned feedback loop also monitors refaults over all tiers and decides when to protect pages in which tiers (N>1), using the first tier (N=0,1) as a baseline. The first tier contains single-use unmapped clean pages, which are most likely the best choices. In contrast to promotion in the aging path, the protection of a page in the eviction path is achieved by moving this page to the next generation, i.e., min_seq+1, if the feedback loop decides so. This approach has the following advantages: 1. It removes the cost of activation in the buffered access path by inferring whether pages accessed multiple times through file descriptors are statistically hot and thus worth protecting in the eviction path. 2. It takes pages accessed through page tables into account and avoids overprotecting pages accessed multiple times through file descriptors. (Pages accessed through page tables are in the first tier, since N=0.) 3. More tiers provide better protection for pages accessed more than twice through file descriptors, when under heavy buffered I/O workloads. Server benchmark results: Single workload: fio (buffered I/O): +[30, 32]% IOPS BW 5.19-rc1: 2673k 10.2GiB/s patch1-6: 3491k 13.3GiB/s Single workload: memcached (anon): -[4, 6]% Ops/sec KB/sec 5.19-rc1: 1161501.04 45177.25 patch1-6: 1106168.46 43025.04 Configurations: CPU: two Xeon 6154 Mem: total 256G Node 1 was only used as a ram disk to reduce the variance in the results. patch drivers/block/brd.c <<EOF 99,100c99,100 < gfp_flags = GFP_NOIO | __GFP_ZERO | __GFP_HIGHMEM; < page = alloc_page(gfp_flags); --- > gfp_flags = GFP_NOIO | __GFP_ZERO | __GFP_HIGHMEM | __GFP_THISNODE; > page = alloc_pages_node(1, gfp_flags, 0); EOF cat >>/etc/systemd/system.conf <<EOF CPUAffinity=numa NUMAPolicy=bind NUMAMask=0 EOF cat >>/etc/memcached.conf <<EOF -m 184320 -s /var/run/memcached/memcached.sock -a 0766 -t 36 -B binary EOF cat fio.sh modprobe brd rd_nr=1 rd_size=113246208 swapoff -a mkfs.ext4 /dev/ram0 mount -t ext4 /dev/ram0 /mnt mkdir /sys/fs/cgroup/user.slice/test echo 38654705664 >/sys/fs/cgroup/user.slice/test/memory.max echo $$ >/sys/fs/cgroup/user.slice/test/cgroup.procs fio -name=mglru --numjobs=72 --directory=/mnt --size=1408m \ --buffered=1 --ioengine=io_uring --iodepth=128 \ --iodepth_batch_submit=32 --iodepth_batch_complete=32 \ --rw=randread --random_distribution=random --norandommap \ --time_based --ramp_time=10m --runtime=5m --group_reporting cat memcached.sh modprobe brd rd_nr=1 rd_size=113246208 swapoff -a mkswap /dev/ram0 swapon /dev/ram0 memtier_benchmark -S /var/run/memcached/memcached.sock \ -P memcache_binary -n allkeys --key-minimum=1 \ --key-maximum=65000000 --key-pattern=P:P -c 1 -t 36 \ --ratio 1:0 --pipeline 8 -d 2000 memtier_benchmark -S /var/run/memcached/memcached.sock \ -P memcache_binary -n allkeys --key-minimum=1 \ --key-maximum=65000000 --key-pattern=R:R -c 1 -t 36 \ --ratio 0:1 --pipeline 8 --randomize --distinct-client-seed Client benchmark results: kswapd profiles: 5.19-rc1 40.33% page_vma_mapped_walk (overhead) 21.80% lzo1x_1_do_compress (real work) 7.53% do_raw_spin_lock 3.95% _raw_spin_unlock_irq 2.52% vma_interval_tree_iter_next 2.37% folio_referenced_one 2.28% vma_interval_tree_subtree_search 1.97% anon_vma_interval_tree_iter_first 1.60% ptep_clear_flush 1.06% __zram_bvec_write patch1-6 39.03% lzo1x_1_do_compress (real work) 18.47% page_vma_mapped_walk (overhead) 6.74% _raw_spin_unlock_irq 3.97% do_raw_spin_lock 2.49% ptep_clear_flush 2.48% anon_vma_interval_tree_iter_first 1.92% folio_referenced_one 1.88% __zram_bvec_write 1.48% memmove 1.31% vma_interval_tree_iter_next Configurations: CPU: single Snapdragon 7c Mem: total 4G ChromeOS MemoryPressure [1] [1] https://chromium.googlesource.com/chromiumos/platform/tast-tests/ Link: https://lkml.kernel.org/r/20220918080010.2920238-7-yuzhao@google.com Signed-off-by: Yu Zhao <yuzhao@google.com> Acked-by: Brian Geffon <bgeffon@google.com> Acked-by: Jan Alexander Steffens (heftig) <heftig@archlinux.org> Acked-by: Oleksandr Natalenko <oleksandr@natalenko.name> Acked-by: Steven Barrett <steven@liquorix.net> Acked-by: Suleiman Souhlal <suleiman@google.com> Tested-by: Daniel Byrne <djbyrne@mtu.edu> Tested-by: Donald Carr <d@chaos-reins.com> Tested-by: Holger Hoffstätte <holger@applied-asynchrony.com> Tested-by: Konstantin Kharlamov <Hi-Angel@yandex.ru> Tested-by: Shuang Zhai <szhai2@cs.rochester.edu> Tested-by: Sofia Trinh <sofia.trinh@edi.works> Tested-by: Vaibhav Jain <vaibhav@linux.ibm.com> Cc: Andi Kleen <ak@linux.intel.com> Cc: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com> Cc: Barry Song <baohua@kernel.org> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: Hillf Danton <hdanton@sina.com> Cc: Jens Axboe <axboe@kernel.dk> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Matthew Wilcox <willy@infradead.org> Cc: Mel Gorman <mgorman@suse.de> Cc: Miaohe Lin <linmiaohe@huawei.com> Cc: Michael Larabel <Michael@MichaelLarabel.com> Cc: Michal Hocko <mhocko@kernel.org> Cc: Mike Rapoport <rppt@kernel.org> Cc: Mike Rapoport <rppt@linux.ibm.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Qi Zheng <zhengqi.arch@bytedance.com> Cc: Tejun Heo <tj@kernel.org> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Will Deacon <will@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-09-18 16:00:03 +08:00
*evictionp = entry;
mm: workingset: tell cache transitions from workingset thrashing Refaults happen during transitions between workingsets as well as in-place thrashing. Knowing the difference between the two has a range of applications, including measuring the impact of memory shortage on the system performance, as well as the ability to smarter balance pressure between the filesystem cache and the swap-backed workingset. During workingset transitions, inactive cache refaults and pushes out established active cache. When that active cache isn't stale, however, and also ends up refaulting, that's bonafide thrashing. Introduce a new page flag that tells on eviction whether the page has been active or not in its lifetime. This bit is then stored in the shadow entry, to classify refaults as transitioning or thrashing. How many page->flags does this leave us with on 32-bit? 20 bits are always page flags 21 if you have an MMU 23 with the zone bits for DMA, Normal, HighMem, Movable 29 with the sparsemem section bits 30 if PAE is enabled 31 with this patch. So on 32-bit PAE, that leaves 1 bit for distinguishing two NUMA nodes. If that's not enough, the system can switch to discontigmem and re-gain the 6 or 7 sparsemem section bits. Link: http://lkml.kernel.org/r/20180828172258.3185-3-hannes@cmpxchg.org Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org> Tested-by: Daniel Drake <drake@endlessm.com> Tested-by: Suren Baghdasaryan <surenb@google.com> Cc: Christopher Lameter <cl@linux.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: Johannes Weiner <jweiner@fb.com> Cc: Mike Galbraith <efault@gmx.de> Cc: Peter Enderborg <peter.enderborg@sony.com> Cc: Randy Dunlap <rdunlap@infradead.org> Cc: Shakeel Butt <shakeelb@google.com> Cc: Tejun Heo <tj@kernel.org> Cc: Vinayak Menon <vinmenon@codeaurora.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-10-27 06:06:04 +08:00
*workingsetp = workingset;
mm: thrash detection-based file cache sizing The VM maintains cached filesystem pages on two types of lists. One list holds the pages recently faulted into the cache, the other list holds pages that have been referenced repeatedly on that first list. The idea is to prefer reclaiming young pages over those that have shown to benefit from caching in the past. We call the recently usedbut ultimately was not significantly better than a FIFO policy and still thrashed cache based on eviction speed, rather than actual demand for cache. This patch solves one half of the problem by decoupling the ability to detect working set changes from the inactive list size. By maintaining a history of recently evicted file pages it can detect frequently used pages with an arbitrarily small inactive list size, and subsequently apply pressure on the active list based on actual demand for cache, not just overall eviction speed. Every zone maintains a counter that tracks inactive list aging speed. When a page is evicted, a snapshot of this counter is stored in the now-empty page cache radix tree slot. On refault, the minimum access distance of the page can be assessed, to evaluate whether the page should be part of the active list or not. This fixes the VM's blindness towards working set changes in excess of the inactive list. And it's the foundation to further improve the protection ability and reduce the minimum inactive list size of 50%. Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> Reviewed-by: Rik van Riel <riel@redhat.com> Reviewed-by: Minchan Kim <minchan@kernel.org> Reviewed-by: Bob Liu <bob.liu@oracle.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Christoph Hellwig <hch@infradead.org> Cc: Dave Chinner <david@fromorbit.com> Cc: Greg Thelen <gthelen@google.com> Cc: Hugh Dickins <hughd@google.com> Cc: Jan Kara <jack@suse.cz> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: Luigi Semenzato <semenzato@google.com> Cc: Mel Gorman <mgorman@suse.de> Cc: Metin Doslu <metin@citusdata.com> Cc: Michel Lespinasse <walken@google.com> Cc: Ozgun Erdogan <ozgun@citusdata.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Roman Gushchin <klamm@yandex-team.ru> Cc: Ryan Mallon <rmallon@gmail.com> Cc: Tejun Heo <tj@kernel.org> Cc: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-04-04 05:47:51 +08:00
}
#ifdef CONFIG_EMM_WORKINGSET_TRACKING
static void workingset_eviction_file(struct lruvec *lruvec, unsigned long nr_pages)
{
do {
atomic_long_add(nr_pages, &lruvec->evicted_file);
} while ((lruvec = parent_lruvec(lruvec)));
}
/*
* If a page is evicted and never come back, either this page is really cold or it
* is deleted on disk.
*
* For cold page, it could take up all of memory until kswapd start to shrink it.
* For deleted page, the shadow will be gone too, so no refault.
*
* If a page comes back before it's shadow is released, that's a refault, which means
* file page reclaim have gone over-aggressive and that page would not have been evicted
* if all the page, include it self, stayed in memory.
*/
static void workingset_refault_track(struct lruvec *lruvec, unsigned long refault_distance)
{
do {
/*
* Not taking any lock, for better performance, may lead to some
* event got lost, but it's just a rough estimation anyway.
*/
WRITE_ONCE(lruvec->refault_count, READ_ONCE(lruvec->refault_count) + 1);
WRITE_ONCE(lruvec->total_distance, READ_ONCE(lruvec->total_distance) + refault_distance);
} while ((lruvec = parent_lruvec(lruvec)));
}
#else
static void workingset_eviction_file(struct lruvec *lruvec, unsigned long nr_pages)
{
}
static void workingset_refault_track(struct lruvec *lruvec, unsigned long refault_distance)
{
}
#endif
workingset, lru_gen: apply refault-distance based protection Upstream: pending I noticed MGLRU not working very well on certain workflows, which is observed on some heavily stressed databases. That is when the file page workingset size exceeds total memory, and the access distance (the left-shift time of a page before it gets activated, considering LRU starts from right) of file pages also larger than total memory. All file pages are stuck on the oldest generation and getting read-in then evicted permutably. Despite anon pages being idle, they never get aged. PID controller didn't kickin until there are some minor access pattern changes. And file pages are not promoted or reused. Even though the memory can't cover the whole workingset, the refault-distance based re-activation can help hold part of the workingset in-memory to help reduce the IO workload significantly. So apply it for MGLRU as well. The updated refault-distance model fits well for MGLRU in most cases, if we just consider the last two generation as the inactive LRU and the first two generations as active LRU. Some adjustment is done to fit the logic better, also make the refault-distance contributed to page tiering and PID refault detection of MGLRU: - If a tier-0 page have a qualified refault-distance, just promote it to higher tier, send it to second oldest gen. - If a tier >= 1 page have a qualified refault-distance, mark it as active and send it to youngest gen. - Increase the reference of every page that have a qualified refault-distance and increase the PID countroled refault rate of the updated tier, in hope similar paged will be protected next time upon eviction. NOTE: This also changed the meaning of workingset_* fields in /proc/vmstat, workingset_activate_* now stands for the pages reactivated or promoted by refault distance checking, workingset_restore_* now stands for all pages promoted by any reason. Following benchmark showed 5x improvement. To simulate the optimized workflow, I setup a 3-replicated mongodb cluster, each in a different cgroup, using 5 gb of wiretiger cache and 10g of oplog, on a 32G VM with no limit set. The benchmark is done using https://github.com/apavlo/py-tpcc.git, modified to run STOCK_LEVEL query only, for simulating slow query and get a stable result. Test is done on an EPYC 7K62 with 32G RAM with SATA SSD: - Before (with ZRAM enabled, the result won't change whether any kind of swap is on or not): $ tpcc.py --config=mongodb.config mongodb --duration=900 --warehouses=500 --clients=30 ================================================================== Execution Results after 919 seconds ------------------------------------------------------------------ Executed Time (µs) Rate STOCK_LEVEL 577 27584645283.7 0.02 txn/s ------------------------------------------------------------------ TOTAL 577 27584645283.7 0.02 txn/s $ cat /proc/vmstat | grep workingset workingset_nodes 47860 workingset_refault_anon 0 workingset_refault_file 23498953 workingset_activate_anon 0 workingset_activate_file 23487840 workingset_restore_anon 0 workingset_restore_file 18553646 workingset_nodereclaim 768 $ free -m total used free shared buff/cache available Mem: 31849 6829 790 23 24229 24542 Swap: 31848 0 31848 - Patched: (with ZRAM enabled): $ tpcc.py --config=mongodb.config mongodb --duration=900 --warehouses=500 --clients=30 ================================================================== Execution Results after 905 seconds ------------------------------------------------------------------ Executed Time (µs) Rate STOCK_LEVEL 2542 27121571486.2 0.09 txn/s ------------------------------------------------------------------ TOTAL 2542 27121571486.2 0.09 txn/s $ cat /proc/vmstat | grep working workingset_nodes 70358 workingset_refault_anon 16853 workingset_refault_file 22693601 workingset_activate_anon 10099 workingset_activate_file 8565519 workingset_restore_anon 10127 workingset_restore_file 8566053 workingset_nodereclaim 9801 $ free -m total used free shared buff/cache available Mem: 31849 7093 283 4 24472 24289 Swap: 31848 1652 30196 The performance is 5x times better than before, and the idle anon pages now can get swapped out as expected. The result is also better with lower test stress, testing with lower stress also shows a improvement. There is no regression on other tests so far, and a performance gain is observed on file page heavy tasks. Signed-off-by: Kairui Song <kasong@tencent.com>
2024-04-01 23:43:38 +08:00
static inline struct mem_cgroup *try_get_flush_memcg(int memcgid)
{
struct mem_cgroup *memcg;
/*
* Look up the memcg associated with the stored ID. It might
* have been deleted since the folio's eviction.
*
* Note that in rare events the ID could have been recycled
* for a new cgroup that refaults a shared folio. This is
* impossible to tell from the available data. However, this
* should be a rare and limited disturbance, and activations
* are always speculative anyway. Ultimately, it's the aging
* algorithm's job to shake out the minimum access frequency
* for the active cache.
*
* XXX: On !CONFIG_MEMCG, this will always return NULL; it
* would be better if the root_mem_cgroup existed in all
* configurations instead.
*/
rcu_read_lock();
memcg = mem_cgroup_from_id(memcgid);
if (!mem_cgroup_disabled() &&
(!memcg || !mem_cgroup_tryget(memcg))) {
rcu_read_unlock();
return NULL;
}
rcu_read_unlock();
/*
* Flush stats (and potentially sleep) outside the RCU read section.
* XXX: With per-memcg flushing and thresholding, is ratelimiting
* still needed here?
*/
mem_cgroup_flush_stats_ratelimited(memcg);
return memcg;
}
/**
* lru_eviction - age non-resident entries as LRU ages
*
* As in-memory pages are aged, non-resident pages need to be aged as
* well, in order for the refault distances later on to be comparable
* to the in-memory dimensions. This function allows reclaim and LRU
* operations to drive the non-resident aging along in parallel.
*/
static inline unsigned long lru_eviction(struct lruvec *lruvec, int type,
int nr_pages, int bits, int bucket_order)
{
unsigned long eviction;
if (type)
workingset_eviction_file(lruvec, nr_pages);
/*
* Reclaiming a cgroup means reclaiming all its children in a
* round-robin fashion. That means that each cgroup has an LRU
* order that is composed of the LRU orders of its child
* cgroups; and every page has an LRU position not just in the
* cgroup that owns it, but in all of that group's ancestors.
*
* So when the physical inactive list of a leaf cgroup ages,
* the virtual inactive lists of all its parents, including
* the root cgroup's, age as well.
*/
eviction = atomic_long_fetch_add_relaxed(nr_pages, &lruvec->evictions[type]);
while ((lruvec = parent_lruvec(lruvec)))
atomic_long_add(nr_pages, &lruvec->evictions[type]);
/* Truncate the timestamp to fit in limited bits */
eviction >>= bucket_order;
eviction &= ~0UL >> (BITS_PER_LONG - bits);
return eviction;
}
/*
* lru_distance - calculate the refault distance based on non-resident age
*/
static inline unsigned long lru_distance(struct lruvec *lruvec, int type,
unsigned long eviction, int bits,
int bucket_order)
{
unsigned long refault = atomic_long_read(&lruvec->evictions[type]);
workingset, lru_gen: apply refault-distance based protection Upstream: pending I noticed MGLRU not working very well on certain workflows, which is observed on some heavily stressed databases. That is when the file page workingset size exceeds total memory, and the access distance (the left-shift time of a page before it gets activated, considering LRU starts from right) of file pages also larger than total memory. All file pages are stuck on the oldest generation and getting read-in then evicted permutably. Despite anon pages being idle, they never get aged. PID controller didn't kickin until there are some minor access pattern changes. And file pages are not promoted or reused. Even though the memory can't cover the whole workingset, the refault-distance based re-activation can help hold part of the workingset in-memory to help reduce the IO workload significantly. So apply it for MGLRU as well. The updated refault-distance model fits well for MGLRU in most cases, if we just consider the last two generation as the inactive LRU and the first two generations as active LRU. Some adjustment is done to fit the logic better, also make the refault-distance contributed to page tiering and PID refault detection of MGLRU: - If a tier-0 page have a qualified refault-distance, just promote it to higher tier, send it to second oldest gen. - If a tier >= 1 page have a qualified refault-distance, mark it as active and send it to youngest gen. - Increase the reference of every page that have a qualified refault-distance and increase the PID countroled refault rate of the updated tier, in hope similar paged will be protected next time upon eviction. NOTE: This also changed the meaning of workingset_* fields in /proc/vmstat, workingset_activate_* now stands for the pages reactivated or promoted by refault distance checking, workingset_restore_* now stands for all pages promoted by any reason. Following benchmark showed 5x improvement. To simulate the optimized workflow, I setup a 3-replicated mongodb cluster, each in a different cgroup, using 5 gb of wiretiger cache and 10g of oplog, on a 32G VM with no limit set. The benchmark is done using https://github.com/apavlo/py-tpcc.git, modified to run STOCK_LEVEL query only, for simulating slow query and get a stable result. Test is done on an EPYC 7K62 with 32G RAM with SATA SSD: - Before (with ZRAM enabled, the result won't change whether any kind of swap is on or not): $ tpcc.py --config=mongodb.config mongodb --duration=900 --warehouses=500 --clients=30 ================================================================== Execution Results after 919 seconds ------------------------------------------------------------------ Executed Time (µs) Rate STOCK_LEVEL 577 27584645283.7 0.02 txn/s ------------------------------------------------------------------ TOTAL 577 27584645283.7 0.02 txn/s $ cat /proc/vmstat | grep workingset workingset_nodes 47860 workingset_refault_anon 0 workingset_refault_file 23498953 workingset_activate_anon 0 workingset_activate_file 23487840 workingset_restore_anon 0 workingset_restore_file 18553646 workingset_nodereclaim 768 $ free -m total used free shared buff/cache available Mem: 31849 6829 790 23 24229 24542 Swap: 31848 0 31848 - Patched: (with ZRAM enabled): $ tpcc.py --config=mongodb.config mongodb --duration=900 --warehouses=500 --clients=30 ================================================================== Execution Results after 905 seconds ------------------------------------------------------------------ Executed Time (µs) Rate STOCK_LEVEL 2542 27121571486.2 0.09 txn/s ------------------------------------------------------------------ TOTAL 2542 27121571486.2 0.09 txn/s $ cat /proc/vmstat | grep working workingset_nodes 70358 workingset_refault_anon 16853 workingset_refault_file 22693601 workingset_activate_anon 10099 workingset_activate_file 8565519 workingset_restore_anon 10127 workingset_restore_file 8566053 workingset_nodereclaim 9801 $ free -m total used free shared buff/cache available Mem: 31849 7093 283 4 24472 24289 Swap: 31848 1652 30196 The performance is 5x times better than before, and the idle anon pages now can get swapped out as expected. The result is also better with lower test stress, testing with lower stress also shows a improvement. There is no regression on other tests so far, and a performance gain is observed on file page heavy tasks. Signed-off-by: Kairui Song <kasong@tencent.com>
2024-04-01 23:43:38 +08:00
eviction &= ~0UL >> (BITS_PER_LONG - bits);
eviction <<= bucket_order;
/*
* The unsigned subtraction here gives an accurate distance
* across non-resident age overflows in most cases. There is a
* special case: usually, shadow entries have a short lifetime
* and are either refaulted or reclaimed along with the inode
* before they get too old. But it is not impossible for the
* non-resident age to lap a shadow entry in the field, which
* can then result in a false small refault distance, leading
* to a false activation should this old entry actually
* refault again. However, earlier kernels used to deactivate
* unconditionally with *every* reclaim invocation for the
* longest time, so the occasional inappropriate activation
* leading to pressure on the active list is not a problem.
*/
return (refault - eviction) & (~0UL >> (BITS_PER_LONG - bits));
}
mm: multi-gen LRU: minimal implementation To avoid confusion, the terms "promotion" and "demotion" will be applied to the multi-gen LRU, as a new convention; the terms "activation" and "deactivation" will be applied to the active/inactive LRU, as usual. The aging produces young generations. Given an lruvec, it increments max_seq when max_seq-min_seq+1 approaches MIN_NR_GENS. The aging promotes hot pages to the youngest generation when it finds them accessed through page tables; the demotion of cold pages happens consequently when it increments max_seq. Promotion in the aging path does not involve any LRU list operations, only the updates of the gen counter and lrugen->nr_pages[]; demotion, unless as the result of the increment of max_seq, requires LRU list operations, e.g., lru_deactivate_fn(). The aging has the complexity O(nr_hot_pages), since it is only interested in hot pages. The eviction consumes old generations. Given an lruvec, it increments min_seq when lrugen->lists[] indexed by min_seq%MAX_NR_GENS becomes empty. A feedback loop modeled after the PID controller monitors refaults over anon and file types and decides which type to evict when both types are available from the same generation. The protection of pages accessed multiple times through file descriptors takes place in the eviction path. Each generation is divided into multiple tiers. A page accessed N times through file descriptors is in tier order_base_2(N). Tiers do not have dedicated lrugen->lists[], only bits in folio->flags. The aforementioned feedback loop also monitors refaults over all tiers and decides when to protect pages in which tiers (N>1), using the first tier (N=0,1) as a baseline. The first tier contains single-use unmapped clean pages, which are most likely the best choices. In contrast to promotion in the aging path, the protection of a page in the eviction path is achieved by moving this page to the next generation, i.e., min_seq+1, if the feedback loop decides so. This approach has the following advantages: 1. It removes the cost of activation in the buffered access path by inferring whether pages accessed multiple times through file descriptors are statistically hot and thus worth protecting in the eviction path. 2. It takes pages accessed through page tables into account and avoids overprotecting pages accessed multiple times through file descriptors. (Pages accessed through page tables are in the first tier, since N=0.) 3. More tiers provide better protection for pages accessed more than twice through file descriptors, when under heavy buffered I/O workloads. Server benchmark results: Single workload: fio (buffered I/O): +[30, 32]% IOPS BW 5.19-rc1: 2673k 10.2GiB/s patch1-6: 3491k 13.3GiB/s Single workload: memcached (anon): -[4, 6]% Ops/sec KB/sec 5.19-rc1: 1161501.04 45177.25 patch1-6: 1106168.46 43025.04 Configurations: CPU: two Xeon 6154 Mem: total 256G Node 1 was only used as a ram disk to reduce the variance in the results. patch drivers/block/brd.c <<EOF 99,100c99,100 < gfp_flags = GFP_NOIO | __GFP_ZERO | __GFP_HIGHMEM; < page = alloc_page(gfp_flags); --- > gfp_flags = GFP_NOIO | __GFP_ZERO | __GFP_HIGHMEM | __GFP_THISNODE; > page = alloc_pages_node(1, gfp_flags, 0); EOF cat >>/etc/systemd/system.conf <<EOF CPUAffinity=numa NUMAPolicy=bind NUMAMask=0 EOF cat >>/etc/memcached.conf <<EOF -m 184320 -s /var/run/memcached/memcached.sock -a 0766 -t 36 -B binary EOF cat fio.sh modprobe brd rd_nr=1 rd_size=113246208 swapoff -a mkfs.ext4 /dev/ram0 mount -t ext4 /dev/ram0 /mnt mkdir /sys/fs/cgroup/user.slice/test echo 38654705664 >/sys/fs/cgroup/user.slice/test/memory.max echo $$ >/sys/fs/cgroup/user.slice/test/cgroup.procs fio -name=mglru --numjobs=72 --directory=/mnt --size=1408m \ --buffered=1 --ioengine=io_uring --iodepth=128 \ --iodepth_batch_submit=32 --iodepth_batch_complete=32 \ --rw=randread --random_distribution=random --norandommap \ --time_based --ramp_time=10m --runtime=5m --group_reporting cat memcached.sh modprobe brd rd_nr=1 rd_size=113246208 swapoff -a mkswap /dev/ram0 swapon /dev/ram0 memtier_benchmark -S /var/run/memcached/memcached.sock \ -P memcache_binary -n allkeys --key-minimum=1 \ --key-maximum=65000000 --key-pattern=P:P -c 1 -t 36 \ --ratio 1:0 --pipeline 8 -d 2000 memtier_benchmark -S /var/run/memcached/memcached.sock \ -P memcache_binary -n allkeys --key-minimum=1 \ --key-maximum=65000000 --key-pattern=R:R -c 1 -t 36 \ --ratio 0:1 --pipeline 8 --randomize --distinct-client-seed Client benchmark results: kswapd profiles: 5.19-rc1 40.33% page_vma_mapped_walk (overhead) 21.80% lzo1x_1_do_compress (real work) 7.53% do_raw_spin_lock 3.95% _raw_spin_unlock_irq 2.52% vma_interval_tree_iter_next 2.37% folio_referenced_one 2.28% vma_interval_tree_subtree_search 1.97% anon_vma_interval_tree_iter_first 1.60% ptep_clear_flush 1.06% __zram_bvec_write patch1-6 39.03% lzo1x_1_do_compress (real work) 18.47% page_vma_mapped_walk (overhead) 6.74% _raw_spin_unlock_irq 3.97% do_raw_spin_lock 2.49% ptep_clear_flush 2.48% anon_vma_interval_tree_iter_first 1.92% folio_referenced_one 1.88% __zram_bvec_write 1.48% memmove 1.31% vma_interval_tree_iter_next Configurations: CPU: single Snapdragon 7c Mem: total 4G ChromeOS MemoryPressure [1] [1] https://chromium.googlesource.com/chromiumos/platform/tast-tests/ Link: https://lkml.kernel.org/r/20220918080010.2920238-7-yuzhao@google.com Signed-off-by: Yu Zhao <yuzhao@google.com> Acked-by: Brian Geffon <bgeffon@google.com> Acked-by: Jan Alexander Steffens (heftig) <heftig@archlinux.org> Acked-by: Oleksandr Natalenko <oleksandr@natalenko.name> Acked-by: Steven Barrett <steven@liquorix.net> Acked-by: Suleiman Souhlal <suleiman@google.com> Tested-by: Daniel Byrne <djbyrne@mtu.edu> Tested-by: Donald Carr <d@chaos-reins.com> Tested-by: Holger Hoffstätte <holger@applied-asynchrony.com> Tested-by: Konstantin Kharlamov <Hi-Angel@yandex.ru> Tested-by: Shuang Zhai <szhai2@cs.rochester.edu> Tested-by: Sofia Trinh <sofia.trinh@edi.works> Tested-by: Vaibhav Jain <vaibhav@linux.ibm.com> Cc: Andi Kleen <ak@linux.intel.com> Cc: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com> Cc: Barry Song <baohua@kernel.org> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: Hillf Danton <hdanton@sina.com> Cc: Jens Axboe <axboe@kernel.dk> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Matthew Wilcox <willy@infradead.org> Cc: Mel Gorman <mgorman@suse.de> Cc: Miaohe Lin <linmiaohe@huawei.com> Cc: Michael Larabel <Michael@MichaelLarabel.com> Cc: Michal Hocko <mhocko@kernel.org> Cc: Mike Rapoport <rppt@kernel.org> Cc: Mike Rapoport <rppt@linux.ibm.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Qi Zheng <zhengqi.arch@bytedance.com> Cc: Tejun Heo <tj@kernel.org> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Will Deacon <will@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-09-18 16:00:03 +08:00
#ifdef CONFIG_LRU_GEN
static void *lru_gen_eviction(struct folio *folio)
{
int hist;
unsigned long token;
struct lruvec *lruvec;
mm: multi-gen LRU: rename lru_gen_struct to lru_gen_folio Patch series "mm: multi-gen LRU: memcg LRU", v3. Overview ======== An memcg LRU is a per-node LRU of memcgs. It is also an LRU of LRUs, since each node and memcg combination has an LRU of folios (see mem_cgroup_lruvec()). Its goal is to improve the scalability of global reclaim, which is critical to system-wide memory overcommit in data centers. Note that memcg reclaim is currently out of scope. Its memory bloat is a pointer to each lruvec and negligible to each pglist_data. In terms of traversing memcgs during global reclaim, it improves the best-case complexity from O(n) to O(1) and does not affect the worst-case complexity O(n). Therefore, on average, it has a sublinear complexity in contrast to the current linear complexity. The basic structure of an memcg LRU can be understood by an analogy to the active/inactive LRU (of folios): 1. It has the young and the old (generations), i.e., the counterparts to the active and the inactive; 2. The increment of max_seq triggers promotion, i.e., the counterpart to activation; 3. Other events trigger similar operations, e.g., offlining an memcg triggers demotion, i.e., the counterpart to deactivation. In terms of global reclaim, it has two distinct features: 1. Sharding, which allows each thread to start at a random memcg (in the old generation) and improves parallelism; 2. Eventual fairness, which allows direct reclaim to bail out at will and reduces latency without affecting fairness over some time. The commit message in patch 6 details the workflow: https://lore.kernel.org/r/20221222041905.2431096-7-yuzhao@google.com/ The following is a simple test to quickly verify its effectiveness. Test design: 1. Create multiple memcgs. 2. Each memcg contains a job (fio). 3. All jobs access the same amount of memory randomly. 4. The system does not experience global memory pressure. 5. Periodically write to the root memory.reclaim. Desired outcome: 1. All memcgs have similar pgsteal counts, i.e., stddev(pgsteal) over mean(pgsteal) is close to 0%. 2. The total pgsteal is close to the total requested through memory.reclaim, i.e., sum(pgsteal) over sum(requested) is close to 100%. Actual outcome [1]: MGLRU off MGLRU on stddev(pgsteal) / mean(pgsteal) 75% 20% sum(pgsteal) / sum(requested) 425% 95% #################################################################### MEMCGS=128 for ((memcg = 0; memcg < $MEMCGS; memcg++)); do mkdir /sys/fs/cgroup/memcg$memcg done start() { echo $BASHPID > /sys/fs/cgroup/memcg$memcg/cgroup.procs fio -name=memcg$memcg --numjobs=1 --ioengine=mmap \ --filename=/dev/zero --size=1920M --rw=randrw \ --rate=64m,64m --random_distribution=random \ --fadvise_hint=0 --time_based --runtime=10h \ --group_reporting --minimal } for ((memcg = 0; memcg < $MEMCGS; memcg++)); do start & done sleep 600 for ((i = 0; i < 600; i++)); do echo 256m >/sys/fs/cgroup/memory.reclaim sleep 6 done for ((memcg = 0; memcg < $MEMCGS; memcg++)); do grep "pgsteal " /sys/fs/cgroup/memcg$memcg/memory.stat done #################################################################### [1]: This was obtained from running the above script (touches less than 256GB memory) on an EPYC 7B13 with 512GB DRAM for over an hour. This patch (of 8): The new name lru_gen_folio will be more distinct from the coming lru_gen_memcg. Link: https://lkml.kernel.org/r/20221222041905.2431096-1-yuzhao@google.com Link: https://lkml.kernel.org/r/20221222041905.2431096-2-yuzhao@google.com Signed-off-by: Yu Zhao <yuzhao@google.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Michael Larabel <Michael@MichaelLarabel.com> Cc: Michal Hocko <mhocko@kernel.org> Cc: Mike Rapoport <rppt@kernel.org> Cc: Roman Gushchin <roman.gushchin@linux.dev> Cc: Suren Baghdasaryan <surenb@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-12-22 12:18:59 +08:00
struct lru_gen_folio *lrugen;
mm: multi-gen LRU: minimal implementation To avoid confusion, the terms "promotion" and "demotion" will be applied to the multi-gen LRU, as a new convention; the terms "activation" and "deactivation" will be applied to the active/inactive LRU, as usual. The aging produces young generations. Given an lruvec, it increments max_seq when max_seq-min_seq+1 approaches MIN_NR_GENS. The aging promotes hot pages to the youngest generation when it finds them accessed through page tables; the demotion of cold pages happens consequently when it increments max_seq. Promotion in the aging path does not involve any LRU list operations, only the updates of the gen counter and lrugen->nr_pages[]; demotion, unless as the result of the increment of max_seq, requires LRU list operations, e.g., lru_deactivate_fn(). The aging has the complexity O(nr_hot_pages), since it is only interested in hot pages. The eviction consumes old generations. Given an lruvec, it increments min_seq when lrugen->lists[] indexed by min_seq%MAX_NR_GENS becomes empty. A feedback loop modeled after the PID controller monitors refaults over anon and file types and decides which type to evict when both types are available from the same generation. The protection of pages accessed multiple times through file descriptors takes place in the eviction path. Each generation is divided into multiple tiers. A page accessed N times through file descriptors is in tier order_base_2(N). Tiers do not have dedicated lrugen->lists[], only bits in folio->flags. The aforementioned feedback loop also monitors refaults over all tiers and decides when to protect pages in which tiers (N>1), using the first tier (N=0,1) as a baseline. The first tier contains single-use unmapped clean pages, which are most likely the best choices. In contrast to promotion in the aging path, the protection of a page in the eviction path is achieved by moving this page to the next generation, i.e., min_seq+1, if the feedback loop decides so. This approach has the following advantages: 1. It removes the cost of activation in the buffered access path by inferring whether pages accessed multiple times through file descriptors are statistically hot and thus worth protecting in the eviction path. 2. It takes pages accessed through page tables into account and avoids overprotecting pages accessed multiple times through file descriptors. (Pages accessed through page tables are in the first tier, since N=0.) 3. More tiers provide better protection for pages accessed more than twice through file descriptors, when under heavy buffered I/O workloads. Server benchmark results: Single workload: fio (buffered I/O): +[30, 32]% IOPS BW 5.19-rc1: 2673k 10.2GiB/s patch1-6: 3491k 13.3GiB/s Single workload: memcached (anon): -[4, 6]% Ops/sec KB/sec 5.19-rc1: 1161501.04 45177.25 patch1-6: 1106168.46 43025.04 Configurations: CPU: two Xeon 6154 Mem: total 256G Node 1 was only used as a ram disk to reduce the variance in the results. patch drivers/block/brd.c <<EOF 99,100c99,100 < gfp_flags = GFP_NOIO | __GFP_ZERO | __GFP_HIGHMEM; < page = alloc_page(gfp_flags); --- > gfp_flags = GFP_NOIO | __GFP_ZERO | __GFP_HIGHMEM | __GFP_THISNODE; > page = alloc_pages_node(1, gfp_flags, 0); EOF cat >>/etc/systemd/system.conf <<EOF CPUAffinity=numa NUMAPolicy=bind NUMAMask=0 EOF cat >>/etc/memcached.conf <<EOF -m 184320 -s /var/run/memcached/memcached.sock -a 0766 -t 36 -B binary EOF cat fio.sh modprobe brd rd_nr=1 rd_size=113246208 swapoff -a mkfs.ext4 /dev/ram0 mount -t ext4 /dev/ram0 /mnt mkdir /sys/fs/cgroup/user.slice/test echo 38654705664 >/sys/fs/cgroup/user.slice/test/memory.max echo $$ >/sys/fs/cgroup/user.slice/test/cgroup.procs fio -name=mglru --numjobs=72 --directory=/mnt --size=1408m \ --buffered=1 --ioengine=io_uring --iodepth=128 \ --iodepth_batch_submit=32 --iodepth_batch_complete=32 \ --rw=randread --random_distribution=random --norandommap \ --time_based --ramp_time=10m --runtime=5m --group_reporting cat memcached.sh modprobe brd rd_nr=1 rd_size=113246208 swapoff -a mkswap /dev/ram0 swapon /dev/ram0 memtier_benchmark -S /var/run/memcached/memcached.sock \ -P memcache_binary -n allkeys --key-minimum=1 \ --key-maximum=65000000 --key-pattern=P:P -c 1 -t 36 \ --ratio 1:0 --pipeline 8 -d 2000 memtier_benchmark -S /var/run/memcached/memcached.sock \ -P memcache_binary -n allkeys --key-minimum=1 \ --key-maximum=65000000 --key-pattern=R:R -c 1 -t 36 \ --ratio 0:1 --pipeline 8 --randomize --distinct-client-seed Client benchmark results: kswapd profiles: 5.19-rc1 40.33% page_vma_mapped_walk (overhead) 21.80% lzo1x_1_do_compress (real work) 7.53% do_raw_spin_lock 3.95% _raw_spin_unlock_irq 2.52% vma_interval_tree_iter_next 2.37% folio_referenced_one 2.28% vma_interval_tree_subtree_search 1.97% anon_vma_interval_tree_iter_first 1.60% ptep_clear_flush 1.06% __zram_bvec_write patch1-6 39.03% lzo1x_1_do_compress (real work) 18.47% page_vma_mapped_walk (overhead) 6.74% _raw_spin_unlock_irq 3.97% do_raw_spin_lock 2.49% ptep_clear_flush 2.48% anon_vma_interval_tree_iter_first 1.92% folio_referenced_one 1.88% __zram_bvec_write 1.48% memmove 1.31% vma_interval_tree_iter_next Configurations: CPU: single Snapdragon 7c Mem: total 4G ChromeOS MemoryPressure [1] [1] https://chromium.googlesource.com/chromiumos/platform/tast-tests/ Link: https://lkml.kernel.org/r/20220918080010.2920238-7-yuzhao@google.com Signed-off-by: Yu Zhao <yuzhao@google.com> Acked-by: Brian Geffon <bgeffon@google.com> Acked-by: Jan Alexander Steffens (heftig) <heftig@archlinux.org> Acked-by: Oleksandr Natalenko <oleksandr@natalenko.name> Acked-by: Steven Barrett <steven@liquorix.net> Acked-by: Suleiman Souhlal <suleiman@google.com> Tested-by: Daniel Byrne <djbyrne@mtu.edu> Tested-by: Donald Carr <d@chaos-reins.com> Tested-by: Holger Hoffstätte <holger@applied-asynchrony.com> Tested-by: Konstantin Kharlamov <Hi-Angel@yandex.ru> Tested-by: Shuang Zhai <szhai2@cs.rochester.edu> Tested-by: Sofia Trinh <sofia.trinh@edi.works> Tested-by: Vaibhav Jain <vaibhav@linux.ibm.com> Cc: Andi Kleen <ak@linux.intel.com> Cc: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com> Cc: Barry Song <baohua@kernel.org> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: Hillf Danton <hdanton@sina.com> Cc: Jens Axboe <axboe@kernel.dk> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Matthew Wilcox <willy@infradead.org> Cc: Mel Gorman <mgorman@suse.de> Cc: Miaohe Lin <linmiaohe@huawei.com> Cc: Michael Larabel <Michael@MichaelLarabel.com> Cc: Michal Hocko <mhocko@kernel.org> Cc: Mike Rapoport <rppt@kernel.org> Cc: Mike Rapoport <rppt@linux.ibm.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Qi Zheng <zhengqi.arch@bytedance.com> Cc: Tejun Heo <tj@kernel.org> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Will Deacon <will@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-09-18 16:00:03 +08:00
int type = folio_is_file_lru(folio);
int delta = folio_nr_pages(folio);
int refs = folio_lru_refs(folio);
int tier = lru_tier_from_refs(refs);
struct mem_cgroup *memcg = folio_memcg(folio);
struct pglist_data *pgdat = folio_pgdat(folio);
workingset, lru_gen: apply refault-distance based protection Upstream: pending I noticed MGLRU not working very well on certain workflows, which is observed on some heavily stressed databases. That is when the file page workingset size exceeds total memory, and the access distance (the left-shift time of a page before it gets activated, considering LRU starts from right) of file pages also larger than total memory. All file pages are stuck on the oldest generation and getting read-in then evicted permutably. Despite anon pages being idle, they never get aged. PID controller didn't kickin until there are some minor access pattern changes. And file pages are not promoted or reused. Even though the memory can't cover the whole workingset, the refault-distance based re-activation can help hold part of the workingset in-memory to help reduce the IO workload significantly. So apply it for MGLRU as well. The updated refault-distance model fits well for MGLRU in most cases, if we just consider the last two generation as the inactive LRU and the first two generations as active LRU. Some adjustment is done to fit the logic better, also make the refault-distance contributed to page tiering and PID refault detection of MGLRU: - If a tier-0 page have a qualified refault-distance, just promote it to higher tier, send it to second oldest gen. - If a tier >= 1 page have a qualified refault-distance, mark it as active and send it to youngest gen. - Increase the reference of every page that have a qualified refault-distance and increase the PID countroled refault rate of the updated tier, in hope similar paged will be protected next time upon eviction. NOTE: This also changed the meaning of workingset_* fields in /proc/vmstat, workingset_activate_* now stands for the pages reactivated or promoted by refault distance checking, workingset_restore_* now stands for all pages promoted by any reason. Following benchmark showed 5x improvement. To simulate the optimized workflow, I setup a 3-replicated mongodb cluster, each in a different cgroup, using 5 gb of wiretiger cache and 10g of oplog, on a 32G VM with no limit set. The benchmark is done using https://github.com/apavlo/py-tpcc.git, modified to run STOCK_LEVEL query only, for simulating slow query and get a stable result. Test is done on an EPYC 7K62 with 32G RAM with SATA SSD: - Before (with ZRAM enabled, the result won't change whether any kind of swap is on or not): $ tpcc.py --config=mongodb.config mongodb --duration=900 --warehouses=500 --clients=30 ================================================================== Execution Results after 919 seconds ------------------------------------------------------------------ Executed Time (µs) Rate STOCK_LEVEL 577 27584645283.7 0.02 txn/s ------------------------------------------------------------------ TOTAL 577 27584645283.7 0.02 txn/s $ cat /proc/vmstat | grep workingset workingset_nodes 47860 workingset_refault_anon 0 workingset_refault_file 23498953 workingset_activate_anon 0 workingset_activate_file 23487840 workingset_restore_anon 0 workingset_restore_file 18553646 workingset_nodereclaim 768 $ free -m total used free shared buff/cache available Mem: 31849 6829 790 23 24229 24542 Swap: 31848 0 31848 - Patched: (with ZRAM enabled): $ tpcc.py --config=mongodb.config mongodb --duration=900 --warehouses=500 --clients=30 ================================================================== Execution Results after 905 seconds ------------------------------------------------------------------ Executed Time (µs) Rate STOCK_LEVEL 2542 27121571486.2 0.09 txn/s ------------------------------------------------------------------ TOTAL 2542 27121571486.2 0.09 txn/s $ cat /proc/vmstat | grep working workingset_nodes 70358 workingset_refault_anon 16853 workingset_refault_file 22693601 workingset_activate_anon 10099 workingset_activate_file 8565519 workingset_restore_anon 10127 workingset_restore_file 8566053 workingset_nodereclaim 9801 $ free -m total used free shared buff/cache available Mem: 31849 7093 283 4 24472 24289 Swap: 31848 1652 30196 The performance is 5x times better than before, and the idle anon pages now can get swapped out as expected. The result is also better with lower test stress, testing with lower stress also shows a improvement. There is no regression on other tests so far, and a performance gain is observed on file page heavy tasks. Signed-off-by: Kairui Song <kasong@tencent.com>
2024-04-01 23:43:38 +08:00
BUILD_BUG_ON(LRU_REFS_WIDTH > BITS_PER_LONG - EVICTION_SHIFT);
mm: multi-gen LRU: minimal implementation To avoid confusion, the terms "promotion" and "demotion" will be applied to the multi-gen LRU, as a new convention; the terms "activation" and "deactivation" will be applied to the active/inactive LRU, as usual. The aging produces young generations. Given an lruvec, it increments max_seq when max_seq-min_seq+1 approaches MIN_NR_GENS. The aging promotes hot pages to the youngest generation when it finds them accessed through page tables; the demotion of cold pages happens consequently when it increments max_seq. Promotion in the aging path does not involve any LRU list operations, only the updates of the gen counter and lrugen->nr_pages[]; demotion, unless as the result of the increment of max_seq, requires LRU list operations, e.g., lru_deactivate_fn(). The aging has the complexity O(nr_hot_pages), since it is only interested in hot pages. The eviction consumes old generations. Given an lruvec, it increments min_seq when lrugen->lists[] indexed by min_seq%MAX_NR_GENS becomes empty. A feedback loop modeled after the PID controller monitors refaults over anon and file types and decides which type to evict when both types are available from the same generation. The protection of pages accessed multiple times through file descriptors takes place in the eviction path. Each generation is divided into multiple tiers. A page accessed N times through file descriptors is in tier order_base_2(N). Tiers do not have dedicated lrugen->lists[], only bits in folio->flags. The aforementioned feedback loop also monitors refaults over all tiers and decides when to protect pages in which tiers (N>1), using the first tier (N=0,1) as a baseline. The first tier contains single-use unmapped clean pages, which are most likely the best choices. In contrast to promotion in the aging path, the protection of a page in the eviction path is achieved by moving this page to the next generation, i.e., min_seq+1, if the feedback loop decides so. This approach has the following advantages: 1. It removes the cost of activation in the buffered access path by inferring whether pages accessed multiple times through file descriptors are statistically hot and thus worth protecting in the eviction path. 2. It takes pages accessed through page tables into account and avoids overprotecting pages accessed multiple times through file descriptors. (Pages accessed through page tables are in the first tier, since N=0.) 3. More tiers provide better protection for pages accessed more than twice through file descriptors, when under heavy buffered I/O workloads. Server benchmark results: Single workload: fio (buffered I/O): +[30, 32]% IOPS BW 5.19-rc1: 2673k 10.2GiB/s patch1-6: 3491k 13.3GiB/s Single workload: memcached (anon): -[4, 6]% Ops/sec KB/sec 5.19-rc1: 1161501.04 45177.25 patch1-6: 1106168.46 43025.04 Configurations: CPU: two Xeon 6154 Mem: total 256G Node 1 was only used as a ram disk to reduce the variance in the results. patch drivers/block/brd.c <<EOF 99,100c99,100 < gfp_flags = GFP_NOIO | __GFP_ZERO | __GFP_HIGHMEM; < page = alloc_page(gfp_flags); --- > gfp_flags = GFP_NOIO | __GFP_ZERO | __GFP_HIGHMEM | __GFP_THISNODE; > page = alloc_pages_node(1, gfp_flags, 0); EOF cat >>/etc/systemd/system.conf <<EOF CPUAffinity=numa NUMAPolicy=bind NUMAMask=0 EOF cat >>/etc/memcached.conf <<EOF -m 184320 -s /var/run/memcached/memcached.sock -a 0766 -t 36 -B binary EOF cat fio.sh modprobe brd rd_nr=1 rd_size=113246208 swapoff -a mkfs.ext4 /dev/ram0 mount -t ext4 /dev/ram0 /mnt mkdir /sys/fs/cgroup/user.slice/test echo 38654705664 >/sys/fs/cgroup/user.slice/test/memory.max echo $$ >/sys/fs/cgroup/user.slice/test/cgroup.procs fio -name=mglru --numjobs=72 --directory=/mnt --size=1408m \ --buffered=1 --ioengine=io_uring --iodepth=128 \ --iodepth_batch_submit=32 --iodepth_batch_complete=32 \ --rw=randread --random_distribution=random --norandommap \ --time_based --ramp_time=10m --runtime=5m --group_reporting cat memcached.sh modprobe brd rd_nr=1 rd_size=113246208 swapoff -a mkswap /dev/ram0 swapon /dev/ram0 memtier_benchmark -S /var/run/memcached/memcached.sock \ -P memcache_binary -n allkeys --key-minimum=1 \ --key-maximum=65000000 --key-pattern=P:P -c 1 -t 36 \ --ratio 1:0 --pipeline 8 -d 2000 memtier_benchmark -S /var/run/memcached/memcached.sock \ -P memcache_binary -n allkeys --key-minimum=1 \ --key-maximum=65000000 --key-pattern=R:R -c 1 -t 36 \ --ratio 0:1 --pipeline 8 --randomize --distinct-client-seed Client benchmark results: kswapd profiles: 5.19-rc1 40.33% page_vma_mapped_walk (overhead) 21.80% lzo1x_1_do_compress (real work) 7.53% do_raw_spin_lock 3.95% _raw_spin_unlock_irq 2.52% vma_interval_tree_iter_next 2.37% folio_referenced_one 2.28% vma_interval_tree_subtree_search 1.97% anon_vma_interval_tree_iter_first 1.60% ptep_clear_flush 1.06% __zram_bvec_write patch1-6 39.03% lzo1x_1_do_compress (real work) 18.47% page_vma_mapped_walk (overhead) 6.74% _raw_spin_unlock_irq 3.97% do_raw_spin_lock 2.49% ptep_clear_flush 2.48% anon_vma_interval_tree_iter_first 1.92% folio_referenced_one 1.88% __zram_bvec_write 1.48% memmove 1.31% vma_interval_tree_iter_next Configurations: CPU: single Snapdragon 7c Mem: total 4G ChromeOS MemoryPressure [1] [1] https://chromium.googlesource.com/chromiumos/platform/tast-tests/ Link: https://lkml.kernel.org/r/20220918080010.2920238-7-yuzhao@google.com Signed-off-by: Yu Zhao <yuzhao@google.com> Acked-by: Brian Geffon <bgeffon@google.com> Acked-by: Jan Alexander Steffens (heftig) <heftig@archlinux.org> Acked-by: Oleksandr Natalenko <oleksandr@natalenko.name> Acked-by: Steven Barrett <steven@liquorix.net> Acked-by: Suleiman Souhlal <suleiman@google.com> Tested-by: Daniel Byrne <djbyrne@mtu.edu> Tested-by: Donald Carr <d@chaos-reins.com> Tested-by: Holger Hoffstätte <holger@applied-asynchrony.com> Tested-by: Konstantin Kharlamov <Hi-Angel@yandex.ru> Tested-by: Shuang Zhai <szhai2@cs.rochester.edu> Tested-by: Sofia Trinh <sofia.trinh@edi.works> Tested-by: Vaibhav Jain <vaibhav@linux.ibm.com> Cc: Andi Kleen <ak@linux.intel.com> Cc: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com> Cc: Barry Song <baohua@kernel.org> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: Hillf Danton <hdanton@sina.com> Cc: Jens Axboe <axboe@kernel.dk> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Matthew Wilcox <willy@infradead.org> Cc: Mel Gorman <mgorman@suse.de> Cc: Miaohe Lin <linmiaohe@huawei.com> Cc: Michael Larabel <Michael@MichaelLarabel.com> Cc: Michal Hocko <mhocko@kernel.org> Cc: Mike Rapoport <rppt@kernel.org> Cc: Mike Rapoport <rppt@linux.ibm.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Qi Zheng <zhengqi.arch@bytedance.com> Cc: Tejun Heo <tj@kernel.org> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Will Deacon <will@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-09-18 16:00:03 +08:00
lruvec = mem_cgroup_lruvec(memcg, pgdat);
lrugen = &lruvec->lrugen;
workingset, lru_gen: apply refault-distance based protection Upstream: pending I noticed MGLRU not working very well on certain workflows, which is observed on some heavily stressed databases. That is when the file page workingset size exceeds total memory, and the access distance (the left-shift time of a page before it gets activated, considering LRU starts from right) of file pages also larger than total memory. All file pages are stuck on the oldest generation and getting read-in then evicted permutably. Despite anon pages being idle, they never get aged. PID controller didn't kickin until there are some minor access pattern changes. And file pages are not promoted or reused. Even though the memory can't cover the whole workingset, the refault-distance based re-activation can help hold part of the workingset in-memory to help reduce the IO workload significantly. So apply it for MGLRU as well. The updated refault-distance model fits well for MGLRU in most cases, if we just consider the last two generation as the inactive LRU and the first two generations as active LRU. Some adjustment is done to fit the logic better, also make the refault-distance contributed to page tiering and PID refault detection of MGLRU: - If a tier-0 page have a qualified refault-distance, just promote it to higher tier, send it to second oldest gen. - If a tier >= 1 page have a qualified refault-distance, mark it as active and send it to youngest gen. - Increase the reference of every page that have a qualified refault-distance and increase the PID countroled refault rate of the updated tier, in hope similar paged will be protected next time upon eviction. NOTE: This also changed the meaning of workingset_* fields in /proc/vmstat, workingset_activate_* now stands for the pages reactivated or promoted by refault distance checking, workingset_restore_* now stands for all pages promoted by any reason. Following benchmark showed 5x improvement. To simulate the optimized workflow, I setup a 3-replicated mongodb cluster, each in a different cgroup, using 5 gb of wiretiger cache and 10g of oplog, on a 32G VM with no limit set. The benchmark is done using https://github.com/apavlo/py-tpcc.git, modified to run STOCK_LEVEL query only, for simulating slow query and get a stable result. Test is done on an EPYC 7K62 with 32G RAM with SATA SSD: - Before (with ZRAM enabled, the result won't change whether any kind of swap is on or not): $ tpcc.py --config=mongodb.config mongodb --duration=900 --warehouses=500 --clients=30 ================================================================== Execution Results after 919 seconds ------------------------------------------------------------------ Executed Time (µs) Rate STOCK_LEVEL 577 27584645283.7 0.02 txn/s ------------------------------------------------------------------ TOTAL 577 27584645283.7 0.02 txn/s $ cat /proc/vmstat | grep workingset workingset_nodes 47860 workingset_refault_anon 0 workingset_refault_file 23498953 workingset_activate_anon 0 workingset_activate_file 23487840 workingset_restore_anon 0 workingset_restore_file 18553646 workingset_nodereclaim 768 $ free -m total used free shared buff/cache available Mem: 31849 6829 790 23 24229 24542 Swap: 31848 0 31848 - Patched: (with ZRAM enabled): $ tpcc.py --config=mongodb.config mongodb --duration=900 --warehouses=500 --clients=30 ================================================================== Execution Results after 905 seconds ------------------------------------------------------------------ Executed Time (µs) Rate STOCK_LEVEL 2542 27121571486.2 0.09 txn/s ------------------------------------------------------------------ TOTAL 2542 27121571486.2 0.09 txn/s $ cat /proc/vmstat | grep working workingset_nodes 70358 workingset_refault_anon 16853 workingset_refault_file 22693601 workingset_activate_anon 10099 workingset_activate_file 8565519 workingset_restore_anon 10127 workingset_restore_file 8566053 workingset_nodereclaim 9801 $ free -m total used free shared buff/cache available Mem: 31849 7093 283 4 24472 24289 Swap: 31848 1652 30196 The performance is 5x times better than before, and the idle anon pages now can get swapped out as expected. The result is also better with lower test stress, testing with lower stress also shows a improvement. There is no regression on other tests so far, and a performance gain is observed on file page heavy tasks. Signed-off-by: Kairui Song <kasong@tencent.com>
2024-04-01 23:43:38 +08:00
hist = lru_hist_of_min_seq(lruvec, type);
mm: multi-gen LRU: minimal implementation To avoid confusion, the terms "promotion" and "demotion" will be applied to the multi-gen LRU, as a new convention; the terms "activation" and "deactivation" will be applied to the active/inactive LRU, as usual. The aging produces young generations. Given an lruvec, it increments max_seq when max_seq-min_seq+1 approaches MIN_NR_GENS. The aging promotes hot pages to the youngest generation when it finds them accessed through page tables; the demotion of cold pages happens consequently when it increments max_seq. Promotion in the aging path does not involve any LRU list operations, only the updates of the gen counter and lrugen->nr_pages[]; demotion, unless as the result of the increment of max_seq, requires LRU list operations, e.g., lru_deactivate_fn(). The aging has the complexity O(nr_hot_pages), since it is only interested in hot pages. The eviction consumes old generations. Given an lruvec, it increments min_seq when lrugen->lists[] indexed by min_seq%MAX_NR_GENS becomes empty. A feedback loop modeled after the PID controller monitors refaults over anon and file types and decides which type to evict when both types are available from the same generation. The protection of pages accessed multiple times through file descriptors takes place in the eviction path. Each generation is divided into multiple tiers. A page accessed N times through file descriptors is in tier order_base_2(N). Tiers do not have dedicated lrugen->lists[], only bits in folio->flags. The aforementioned feedback loop also monitors refaults over all tiers and decides when to protect pages in which tiers (N>1), using the first tier (N=0,1) as a baseline. The first tier contains single-use unmapped clean pages, which are most likely the best choices. In contrast to promotion in the aging path, the protection of a page in the eviction path is achieved by moving this page to the next generation, i.e., min_seq+1, if the feedback loop decides so. This approach has the following advantages: 1. It removes the cost of activation in the buffered access path by inferring whether pages accessed multiple times through file descriptors are statistically hot and thus worth protecting in the eviction path. 2. It takes pages accessed through page tables into account and avoids overprotecting pages accessed multiple times through file descriptors. (Pages accessed through page tables are in the first tier, since N=0.) 3. More tiers provide better protection for pages accessed more than twice through file descriptors, when under heavy buffered I/O workloads. Server benchmark results: Single workload: fio (buffered I/O): +[30, 32]% IOPS BW 5.19-rc1: 2673k 10.2GiB/s patch1-6: 3491k 13.3GiB/s Single workload: memcached (anon): -[4, 6]% Ops/sec KB/sec 5.19-rc1: 1161501.04 45177.25 patch1-6: 1106168.46 43025.04 Configurations: CPU: two Xeon 6154 Mem: total 256G Node 1 was only used as a ram disk to reduce the variance in the results. patch drivers/block/brd.c <<EOF 99,100c99,100 < gfp_flags = GFP_NOIO | __GFP_ZERO | __GFP_HIGHMEM; < page = alloc_page(gfp_flags); --- > gfp_flags = GFP_NOIO | __GFP_ZERO | __GFP_HIGHMEM | __GFP_THISNODE; > page = alloc_pages_node(1, gfp_flags, 0); EOF cat >>/etc/systemd/system.conf <<EOF CPUAffinity=numa NUMAPolicy=bind NUMAMask=0 EOF cat >>/etc/memcached.conf <<EOF -m 184320 -s /var/run/memcached/memcached.sock -a 0766 -t 36 -B binary EOF cat fio.sh modprobe brd rd_nr=1 rd_size=113246208 swapoff -a mkfs.ext4 /dev/ram0 mount -t ext4 /dev/ram0 /mnt mkdir /sys/fs/cgroup/user.slice/test echo 38654705664 >/sys/fs/cgroup/user.slice/test/memory.max echo $$ >/sys/fs/cgroup/user.slice/test/cgroup.procs fio -name=mglru --numjobs=72 --directory=/mnt --size=1408m \ --buffered=1 --ioengine=io_uring --iodepth=128 \ --iodepth_batch_submit=32 --iodepth_batch_complete=32 \ --rw=randread --random_distribution=random --norandommap \ --time_based --ramp_time=10m --runtime=5m --group_reporting cat memcached.sh modprobe brd rd_nr=1 rd_size=113246208 swapoff -a mkswap /dev/ram0 swapon /dev/ram0 memtier_benchmark -S /var/run/memcached/memcached.sock \ -P memcache_binary -n allkeys --key-minimum=1 \ --key-maximum=65000000 --key-pattern=P:P -c 1 -t 36 \ --ratio 1:0 --pipeline 8 -d 2000 memtier_benchmark -S /var/run/memcached/memcached.sock \ -P memcache_binary -n allkeys --key-minimum=1 \ --key-maximum=65000000 --key-pattern=R:R -c 1 -t 36 \ --ratio 0:1 --pipeline 8 --randomize --distinct-client-seed Client benchmark results: kswapd profiles: 5.19-rc1 40.33% page_vma_mapped_walk (overhead) 21.80% lzo1x_1_do_compress (real work) 7.53% do_raw_spin_lock 3.95% _raw_spin_unlock_irq 2.52% vma_interval_tree_iter_next 2.37% folio_referenced_one 2.28% vma_interval_tree_subtree_search 1.97% anon_vma_interval_tree_iter_first 1.60% ptep_clear_flush 1.06% __zram_bvec_write patch1-6 39.03% lzo1x_1_do_compress (real work) 18.47% page_vma_mapped_walk (overhead) 6.74% _raw_spin_unlock_irq 3.97% do_raw_spin_lock 2.49% ptep_clear_flush 2.48% anon_vma_interval_tree_iter_first 1.92% folio_referenced_one 1.88% __zram_bvec_write 1.48% memmove 1.31% vma_interval_tree_iter_next Configurations: CPU: single Snapdragon 7c Mem: total 4G ChromeOS MemoryPressure [1] [1] https://chromium.googlesource.com/chromiumos/platform/tast-tests/ Link: https://lkml.kernel.org/r/20220918080010.2920238-7-yuzhao@google.com Signed-off-by: Yu Zhao <yuzhao@google.com> Acked-by: Brian Geffon <bgeffon@google.com> Acked-by: Jan Alexander Steffens (heftig) <heftig@archlinux.org> Acked-by: Oleksandr Natalenko <oleksandr@natalenko.name> Acked-by: Steven Barrett <steven@liquorix.net> Acked-by: Suleiman Souhlal <suleiman@google.com> Tested-by: Daniel Byrne <djbyrne@mtu.edu> Tested-by: Donald Carr <d@chaos-reins.com> Tested-by: Holger Hoffstätte <holger@applied-asynchrony.com> Tested-by: Konstantin Kharlamov <Hi-Angel@yandex.ru> Tested-by: Shuang Zhai <szhai2@cs.rochester.edu> Tested-by: Sofia Trinh <sofia.trinh@edi.works> Tested-by: Vaibhav Jain <vaibhav@linux.ibm.com> Cc: Andi Kleen <ak@linux.intel.com> Cc: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com> Cc: Barry Song <baohua@kernel.org> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: Hillf Danton <hdanton@sina.com> Cc: Jens Axboe <axboe@kernel.dk> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Matthew Wilcox <willy@infradead.org> Cc: Mel Gorman <mgorman@suse.de> Cc: Miaohe Lin <linmiaohe@huawei.com> Cc: Michael Larabel <Michael@MichaelLarabel.com> Cc: Michal Hocko <mhocko@kernel.org> Cc: Mike Rapoport <rppt@kernel.org> Cc: Mike Rapoport <rppt@linux.ibm.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Qi Zheng <zhengqi.arch@bytedance.com> Cc: Tejun Heo <tj@kernel.org> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Will Deacon <will@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-09-18 16:00:03 +08:00
workingset, lru_gen: apply refault-distance based protection Upstream: pending I noticed MGLRU not working very well on certain workflows, which is observed on some heavily stressed databases. That is when the file page workingset size exceeds total memory, and the access distance (the left-shift time of a page before it gets activated, considering LRU starts from right) of file pages also larger than total memory. All file pages are stuck on the oldest generation and getting read-in then evicted permutably. Despite anon pages being idle, they never get aged. PID controller didn't kickin until there are some minor access pattern changes. And file pages are not promoted or reused. Even though the memory can't cover the whole workingset, the refault-distance based re-activation can help hold part of the workingset in-memory to help reduce the IO workload significantly. So apply it for MGLRU as well. The updated refault-distance model fits well for MGLRU in most cases, if we just consider the last two generation as the inactive LRU and the first two generations as active LRU. Some adjustment is done to fit the logic better, also make the refault-distance contributed to page tiering and PID refault detection of MGLRU: - If a tier-0 page have a qualified refault-distance, just promote it to higher tier, send it to second oldest gen. - If a tier >= 1 page have a qualified refault-distance, mark it as active and send it to youngest gen. - Increase the reference of every page that have a qualified refault-distance and increase the PID countroled refault rate of the updated tier, in hope similar paged will be protected next time upon eviction. NOTE: This also changed the meaning of workingset_* fields in /proc/vmstat, workingset_activate_* now stands for the pages reactivated or promoted by refault distance checking, workingset_restore_* now stands for all pages promoted by any reason. Following benchmark showed 5x improvement. To simulate the optimized workflow, I setup a 3-replicated mongodb cluster, each in a different cgroup, using 5 gb of wiretiger cache and 10g of oplog, on a 32G VM with no limit set. The benchmark is done using https://github.com/apavlo/py-tpcc.git, modified to run STOCK_LEVEL query only, for simulating slow query and get a stable result. Test is done on an EPYC 7K62 with 32G RAM with SATA SSD: - Before (with ZRAM enabled, the result won't change whether any kind of swap is on or not): $ tpcc.py --config=mongodb.config mongodb --duration=900 --warehouses=500 --clients=30 ================================================================== Execution Results after 919 seconds ------------------------------------------------------------------ Executed Time (µs) Rate STOCK_LEVEL 577 27584645283.7 0.02 txn/s ------------------------------------------------------------------ TOTAL 577 27584645283.7 0.02 txn/s $ cat /proc/vmstat | grep workingset workingset_nodes 47860 workingset_refault_anon 0 workingset_refault_file 23498953 workingset_activate_anon 0 workingset_activate_file 23487840 workingset_restore_anon 0 workingset_restore_file 18553646 workingset_nodereclaim 768 $ free -m total used free shared buff/cache available Mem: 31849 6829 790 23 24229 24542 Swap: 31848 0 31848 - Patched: (with ZRAM enabled): $ tpcc.py --config=mongodb.config mongodb --duration=900 --warehouses=500 --clients=30 ================================================================== Execution Results after 905 seconds ------------------------------------------------------------------ Executed Time (µs) Rate STOCK_LEVEL 2542 27121571486.2 0.09 txn/s ------------------------------------------------------------------ TOTAL 2542 27121571486.2 0.09 txn/s $ cat /proc/vmstat | grep working workingset_nodes 70358 workingset_refault_anon 16853 workingset_refault_file 22693601 workingset_activate_anon 10099 workingset_activate_file 8565519 workingset_restore_anon 10127 workingset_restore_file 8566053 workingset_nodereclaim 9801 $ free -m total used free shared buff/cache available Mem: 31849 7093 283 4 24472 24289 Swap: 31848 1652 30196 The performance is 5x times better than before, and the idle anon pages now can get swapped out as expected. The result is also better with lower test stress, testing with lower stress also shows a improvement. There is no regression on other tests so far, and a performance gain is observed on file page heavy tasks. Signed-off-by: Kairui Song <kasong@tencent.com>
2024-04-01 23:43:38 +08:00
token = max(refs - 1, 0);
token <<= LRU_GEN_EVICTION_BITS;
token |= lru_eviction(lruvec, type, delta,
LRU_GEN_EVICTION_BITS, lru_gen_bucket_order);
mm: multi-gen LRU: minimal implementation To avoid confusion, the terms "promotion" and "demotion" will be applied to the multi-gen LRU, as a new convention; the terms "activation" and "deactivation" will be applied to the active/inactive LRU, as usual. The aging produces young generations. Given an lruvec, it increments max_seq when max_seq-min_seq+1 approaches MIN_NR_GENS. The aging promotes hot pages to the youngest generation when it finds them accessed through page tables; the demotion of cold pages happens consequently when it increments max_seq. Promotion in the aging path does not involve any LRU list operations, only the updates of the gen counter and lrugen->nr_pages[]; demotion, unless as the result of the increment of max_seq, requires LRU list operations, e.g., lru_deactivate_fn(). The aging has the complexity O(nr_hot_pages), since it is only interested in hot pages. The eviction consumes old generations. Given an lruvec, it increments min_seq when lrugen->lists[] indexed by min_seq%MAX_NR_GENS becomes empty. A feedback loop modeled after the PID controller monitors refaults over anon and file types and decides which type to evict when both types are available from the same generation. The protection of pages accessed multiple times through file descriptors takes place in the eviction path. Each generation is divided into multiple tiers. A page accessed N times through file descriptors is in tier order_base_2(N). Tiers do not have dedicated lrugen->lists[], only bits in folio->flags. The aforementioned feedback loop also monitors refaults over all tiers and decides when to protect pages in which tiers (N>1), using the first tier (N=0,1) as a baseline. The first tier contains single-use unmapped clean pages, which are most likely the best choices. In contrast to promotion in the aging path, the protection of a page in the eviction path is achieved by moving this page to the next generation, i.e., min_seq+1, if the feedback loop decides so. This approach has the following advantages: 1. It removes the cost of activation in the buffered access path by inferring whether pages accessed multiple times through file descriptors are statistically hot and thus worth protecting in the eviction path. 2. It takes pages accessed through page tables into account and avoids overprotecting pages accessed multiple times through file descriptors. (Pages accessed through page tables are in the first tier, since N=0.) 3. More tiers provide better protection for pages accessed more than twice through file descriptors, when under heavy buffered I/O workloads. Server benchmark results: Single workload: fio (buffered I/O): +[30, 32]% IOPS BW 5.19-rc1: 2673k 10.2GiB/s patch1-6: 3491k 13.3GiB/s Single workload: memcached (anon): -[4, 6]% Ops/sec KB/sec 5.19-rc1: 1161501.04 45177.25 patch1-6: 1106168.46 43025.04 Configurations: CPU: two Xeon 6154 Mem: total 256G Node 1 was only used as a ram disk to reduce the variance in the results. patch drivers/block/brd.c <<EOF 99,100c99,100 < gfp_flags = GFP_NOIO | __GFP_ZERO | __GFP_HIGHMEM; < page = alloc_page(gfp_flags); --- > gfp_flags = GFP_NOIO | __GFP_ZERO | __GFP_HIGHMEM | __GFP_THISNODE; > page = alloc_pages_node(1, gfp_flags, 0); EOF cat >>/etc/systemd/system.conf <<EOF CPUAffinity=numa NUMAPolicy=bind NUMAMask=0 EOF cat >>/etc/memcached.conf <<EOF -m 184320 -s /var/run/memcached/memcached.sock -a 0766 -t 36 -B binary EOF cat fio.sh modprobe brd rd_nr=1 rd_size=113246208 swapoff -a mkfs.ext4 /dev/ram0 mount -t ext4 /dev/ram0 /mnt mkdir /sys/fs/cgroup/user.slice/test echo 38654705664 >/sys/fs/cgroup/user.slice/test/memory.max echo $$ >/sys/fs/cgroup/user.slice/test/cgroup.procs fio -name=mglru --numjobs=72 --directory=/mnt --size=1408m \ --buffered=1 --ioengine=io_uring --iodepth=128 \ --iodepth_batch_submit=32 --iodepth_batch_complete=32 \ --rw=randread --random_distribution=random --norandommap \ --time_based --ramp_time=10m --runtime=5m --group_reporting cat memcached.sh modprobe brd rd_nr=1 rd_size=113246208 swapoff -a mkswap /dev/ram0 swapon /dev/ram0 memtier_benchmark -S /var/run/memcached/memcached.sock \ -P memcache_binary -n allkeys --key-minimum=1 \ --key-maximum=65000000 --key-pattern=P:P -c 1 -t 36 \ --ratio 1:0 --pipeline 8 -d 2000 memtier_benchmark -S /var/run/memcached/memcached.sock \ -P memcache_binary -n allkeys --key-minimum=1 \ --key-maximum=65000000 --key-pattern=R:R -c 1 -t 36 \ --ratio 0:1 --pipeline 8 --randomize --distinct-client-seed Client benchmark results: kswapd profiles: 5.19-rc1 40.33% page_vma_mapped_walk (overhead) 21.80% lzo1x_1_do_compress (real work) 7.53% do_raw_spin_lock 3.95% _raw_spin_unlock_irq 2.52% vma_interval_tree_iter_next 2.37% folio_referenced_one 2.28% vma_interval_tree_subtree_search 1.97% anon_vma_interval_tree_iter_first 1.60% ptep_clear_flush 1.06% __zram_bvec_write patch1-6 39.03% lzo1x_1_do_compress (real work) 18.47% page_vma_mapped_walk (overhead) 6.74% _raw_spin_unlock_irq 3.97% do_raw_spin_lock 2.49% ptep_clear_flush 2.48% anon_vma_interval_tree_iter_first 1.92% folio_referenced_one 1.88% __zram_bvec_write 1.48% memmove 1.31% vma_interval_tree_iter_next Configurations: CPU: single Snapdragon 7c Mem: total 4G ChromeOS MemoryPressure [1] [1] https://chromium.googlesource.com/chromiumos/platform/tast-tests/ Link: https://lkml.kernel.org/r/20220918080010.2920238-7-yuzhao@google.com Signed-off-by: Yu Zhao <yuzhao@google.com> Acked-by: Brian Geffon <bgeffon@google.com> Acked-by: Jan Alexander Steffens (heftig) <heftig@archlinux.org> Acked-by: Oleksandr Natalenko <oleksandr@natalenko.name> Acked-by: Steven Barrett <steven@liquorix.net> Acked-by: Suleiman Souhlal <suleiman@google.com> Tested-by: Daniel Byrne <djbyrne@mtu.edu> Tested-by: Donald Carr <d@chaos-reins.com> Tested-by: Holger Hoffstätte <holger@applied-asynchrony.com> Tested-by: Konstantin Kharlamov <Hi-Angel@yandex.ru> Tested-by: Shuang Zhai <szhai2@cs.rochester.edu> Tested-by: Sofia Trinh <sofia.trinh@edi.works> Tested-by: Vaibhav Jain <vaibhav@linux.ibm.com> Cc: Andi Kleen <ak@linux.intel.com> Cc: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com> Cc: Barry Song <baohua@kernel.org> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: Hillf Danton <hdanton@sina.com> Cc: Jens Axboe <axboe@kernel.dk> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Matthew Wilcox <willy@infradead.org> Cc: Mel Gorman <mgorman@suse.de> Cc: Miaohe Lin <linmiaohe@huawei.com> Cc: Michael Larabel <Michael@MichaelLarabel.com> Cc: Michal Hocko <mhocko@kernel.org> Cc: Mike Rapoport <rppt@kernel.org> Cc: Mike Rapoport <rppt@linux.ibm.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Qi Zheng <zhengqi.arch@bytedance.com> Cc: Tejun Heo <tj@kernel.org> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Will Deacon <will@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-09-18 16:00:03 +08:00
atomic_long_add(delta, &lrugen->evicted[hist][type][tier]);
return pack_shadow(mem_cgroup_id(memcg), pgdat, token, refs);
}
workingset: refactor LRU refault to expose refault recency check Patch series "cachestat: a new syscall for page cache state of files", v13. There is currently no good way to query the page cache statistics of large files and directory trees. There is mincore(), but it scales poorly: the kernel writes out a lot of bitmap data that userspace has to aggregate, when the user really does not care about per-page information in that case. The user also needs to mmap and unmap each file as it goes along, which can be quite slow as well. Some use cases where this information could come in handy: * Allowing database to decide whether to perform an index scan or direct table queries based on the in-memory cache state of the index. * Visibility into the writeback algorithm, for performance issues diagnostic. * Workload-aware writeback pacing: estimating IO fulfilled by page cache (and IO to be done) within a range of a file, allowing for more frequent syncing when and where there is IO capacity, and batching when there is not. * Computing memory usage of large files/directory trees, analogous to the du tool for disk usage. More information about these use cases could be found in this thread: https://lore.kernel.org/lkml/20230315170934.GA97793@cmpxchg.org/ This series of patches introduces a new system call, cachestat, that summarizes the page cache statistics (number of cached pages, dirty pages, pages marked for writeback, evicted pages etc.) of a file, in a specified range of bytes. It also include a selftest suite that tests some typical usage. Currently, the syscall is only wired in for x86 architecture. This interface is inspired by past discussion and concerns with fincore, which has a similar design (and as a result, issues) as mincore. Relevant links: https://lkml.indiana.edu/hypermail/linux/kernel/1302.1/04207.html https://lkml.indiana.edu/hypermail/linux/kernel/1302.1/04209.html I have also developed a small tool that computes the memory usage of files and directories, analogous to the du utility. User can choose between mincore or cachestat (with cachestat exporting more information than mincore). To compare the performance of these two options, I benchmarked the tool on the root directory of a Meta's server machine, each for five runs: Using cachestat real -- Median: 33.377s, Average: 33.475s, Standard Deviation: 0.3602 user -- Median: 4.08s, Average: 4.1078s, Standard Deviation: 0.0742 sys -- Median: 28.823s, Average: 28.8866s, Standard Deviation: 0.2689 Using mincore: real -- Median: 102.352s, Average: 102.3442s, Standard Deviation: 0.2059 user -- Median: 10.149s, Average: 10.1482s, Standard Deviation: 0.0162 sys -- Median: 91.186s, Average: 91.2084s, Standard Deviation: 0.2046 I also ran both syscalls on a 2TB sparse file: Using cachestat: real 0m0.009s user 0m0.000s sys 0m0.009s Using mincore: real 0m37.510s user 0m2.934s sys 0m34.558s Very large files like this are the pathological case for mincore. In fact, to compute the stats for a single 2TB file, mincore takes as long as cachestat takes to compute the stats for the entire tree! This could easily happen inadvertently when we run it on subdirectories. Mincore is clearly not suitable for a general-purpose command line tool. Regarding security concerns, cachestat() should not pose any additional issues. The caller already has read permission to the file itself (since they need an fd to that file to call cachestat). This means that the caller can access the underlying data in its entirety, which is a much greater source of information (and as a result, a much greater security risk) than the cache status itself. The latest API change (in v13 of the patch series) is suggested by Jens Axboe. It allows for 64-bit length argument, even on 32-bit architecture (which is previously not possible due to the limit on the number of syscall arguments). Furthermore, it eliminates the need for compatibility handling - every user can use the same ABI. This patch (of 4): In preparation for computing recently evicted pages in cachestat, refactor workingset_refault and lru_gen_refault to expose a helper function that would test if an evicted page is recently evicted. [penguin-kernel@I-love.SAKURA.ne.jp: add missing rcu_read_unlock() in lru_gen_refault()] Link: https://lkml.kernel.org/r/610781bc-cf11-fc89-a46f-87cb8235d439@I-love.SAKURA.ne.jp Link: https://lkml.kernel.org/r/20230503013608.2431726-1-nphamcs@gmail.com Link: https://lkml.kernel.org/r/20230503013608.2431726-2-nphamcs@gmail.com Signed-off-by: Nhat Pham <nphamcs@gmail.com> Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Cc: Brian Foster <bfoster@redhat.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Michael Kerrisk <mtk.manpages@gmail.com> Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-05-03 09:36:06 +08:00
/*
* Tests if the shadow entry is for a folio that was recently evicted.
* Fills in @lruvec, @token, @workingset with the values unpacked from shadow.
workingset: refactor LRU refault to expose refault recency check Patch series "cachestat: a new syscall for page cache state of files", v13. There is currently no good way to query the page cache statistics of large files and directory trees. There is mincore(), but it scales poorly: the kernel writes out a lot of bitmap data that userspace has to aggregate, when the user really does not care about per-page information in that case. The user also needs to mmap and unmap each file as it goes along, which can be quite slow as well. Some use cases where this information could come in handy: * Allowing database to decide whether to perform an index scan or direct table queries based on the in-memory cache state of the index. * Visibility into the writeback algorithm, for performance issues diagnostic. * Workload-aware writeback pacing: estimating IO fulfilled by page cache (and IO to be done) within a range of a file, allowing for more frequent syncing when and where there is IO capacity, and batching when there is not. * Computing memory usage of large files/directory trees, analogous to the du tool for disk usage. More information about these use cases could be found in this thread: https://lore.kernel.org/lkml/20230315170934.GA97793@cmpxchg.org/ This series of patches introduces a new system call, cachestat, that summarizes the page cache statistics (number of cached pages, dirty pages, pages marked for writeback, evicted pages etc.) of a file, in a specified range of bytes. It also include a selftest suite that tests some typical usage. Currently, the syscall is only wired in for x86 architecture. This interface is inspired by past discussion and concerns with fincore, which has a similar design (and as a result, issues) as mincore. Relevant links: https://lkml.indiana.edu/hypermail/linux/kernel/1302.1/04207.html https://lkml.indiana.edu/hypermail/linux/kernel/1302.1/04209.html I have also developed a small tool that computes the memory usage of files and directories, analogous to the du utility. User can choose between mincore or cachestat (with cachestat exporting more information than mincore). To compare the performance of these two options, I benchmarked the tool on the root directory of a Meta's server machine, each for five runs: Using cachestat real -- Median: 33.377s, Average: 33.475s, Standard Deviation: 0.3602 user -- Median: 4.08s, Average: 4.1078s, Standard Deviation: 0.0742 sys -- Median: 28.823s, Average: 28.8866s, Standard Deviation: 0.2689 Using mincore: real -- Median: 102.352s, Average: 102.3442s, Standard Deviation: 0.2059 user -- Median: 10.149s, Average: 10.1482s, Standard Deviation: 0.0162 sys -- Median: 91.186s, Average: 91.2084s, Standard Deviation: 0.2046 I also ran both syscalls on a 2TB sparse file: Using cachestat: real 0m0.009s user 0m0.000s sys 0m0.009s Using mincore: real 0m37.510s user 0m2.934s sys 0m34.558s Very large files like this are the pathological case for mincore. In fact, to compute the stats for a single 2TB file, mincore takes as long as cachestat takes to compute the stats for the entire tree! This could easily happen inadvertently when we run it on subdirectories. Mincore is clearly not suitable for a general-purpose command line tool. Regarding security concerns, cachestat() should not pose any additional issues. The caller already has read permission to the file itself (since they need an fd to that file to call cachestat). This means that the caller can access the underlying data in its entirety, which is a much greater source of information (and as a result, a much greater security risk) than the cache status itself. The latest API change (in v13 of the patch series) is suggested by Jens Axboe. It allows for 64-bit length argument, even on 32-bit architecture (which is previously not possible due to the limit on the number of syscall arguments). Furthermore, it eliminates the need for compatibility handling - every user can use the same ABI. This patch (of 4): In preparation for computing recently evicted pages in cachestat, refactor workingset_refault and lru_gen_refault to expose a helper function that would test if an evicted page is recently evicted. [penguin-kernel@I-love.SAKURA.ne.jp: add missing rcu_read_unlock() in lru_gen_refault()] Link: https://lkml.kernel.org/r/610781bc-cf11-fc89-a46f-87cb8235d439@I-love.SAKURA.ne.jp Link: https://lkml.kernel.org/r/20230503013608.2431726-1-nphamcs@gmail.com Link: https://lkml.kernel.org/r/20230503013608.2431726-2-nphamcs@gmail.com Signed-off-by: Nhat Pham <nphamcs@gmail.com> Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Cc: Brian Foster <bfoster@redhat.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Michael Kerrisk <mtk.manpages@gmail.com> Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-05-03 09:36:06 +08:00
*/
workingset, lru_gen: apply refault-distance based protection Upstream: pending I noticed MGLRU not working very well on certain workflows, which is observed on some heavily stressed databases. That is when the file page workingset size exceeds total memory, and the access distance (the left-shift time of a page before it gets activated, considering LRU starts from right) of file pages also larger than total memory. All file pages are stuck on the oldest generation and getting read-in then evicted permutably. Despite anon pages being idle, they never get aged. PID controller didn't kickin until there are some minor access pattern changes. And file pages are not promoted or reused. Even though the memory can't cover the whole workingset, the refault-distance based re-activation can help hold part of the workingset in-memory to help reduce the IO workload significantly. So apply it for MGLRU as well. The updated refault-distance model fits well for MGLRU in most cases, if we just consider the last two generation as the inactive LRU and the first two generations as active LRU. Some adjustment is done to fit the logic better, also make the refault-distance contributed to page tiering and PID refault detection of MGLRU: - If a tier-0 page have a qualified refault-distance, just promote it to higher tier, send it to second oldest gen. - If a tier >= 1 page have a qualified refault-distance, mark it as active and send it to youngest gen. - Increase the reference of every page that have a qualified refault-distance and increase the PID countroled refault rate of the updated tier, in hope similar paged will be protected next time upon eviction. NOTE: This also changed the meaning of workingset_* fields in /proc/vmstat, workingset_activate_* now stands for the pages reactivated or promoted by refault distance checking, workingset_restore_* now stands for all pages promoted by any reason. Following benchmark showed 5x improvement. To simulate the optimized workflow, I setup a 3-replicated mongodb cluster, each in a different cgroup, using 5 gb of wiretiger cache and 10g of oplog, on a 32G VM with no limit set. The benchmark is done using https://github.com/apavlo/py-tpcc.git, modified to run STOCK_LEVEL query only, for simulating slow query and get a stable result. Test is done on an EPYC 7K62 with 32G RAM with SATA SSD: - Before (with ZRAM enabled, the result won't change whether any kind of swap is on or not): $ tpcc.py --config=mongodb.config mongodb --duration=900 --warehouses=500 --clients=30 ================================================================== Execution Results after 919 seconds ------------------------------------------------------------------ Executed Time (µs) Rate STOCK_LEVEL 577 27584645283.7 0.02 txn/s ------------------------------------------------------------------ TOTAL 577 27584645283.7 0.02 txn/s $ cat /proc/vmstat | grep workingset workingset_nodes 47860 workingset_refault_anon 0 workingset_refault_file 23498953 workingset_activate_anon 0 workingset_activate_file 23487840 workingset_restore_anon 0 workingset_restore_file 18553646 workingset_nodereclaim 768 $ free -m total used free shared buff/cache available Mem: 31849 6829 790 23 24229 24542 Swap: 31848 0 31848 - Patched: (with ZRAM enabled): $ tpcc.py --config=mongodb.config mongodb --duration=900 --warehouses=500 --clients=30 ================================================================== Execution Results after 905 seconds ------------------------------------------------------------------ Executed Time (µs) Rate STOCK_LEVEL 2542 27121571486.2 0.09 txn/s ------------------------------------------------------------------ TOTAL 2542 27121571486.2 0.09 txn/s $ cat /proc/vmstat | grep working workingset_nodes 70358 workingset_refault_anon 16853 workingset_refault_file 22693601 workingset_activate_anon 10099 workingset_activate_file 8565519 workingset_restore_anon 10127 workingset_restore_file 8566053 workingset_nodereclaim 9801 $ free -m total used free shared buff/cache available Mem: 31849 7093 283 4 24472 24289 Swap: 31848 1652 30196 The performance is 5x times better than before, and the idle anon pages now can get swapped out as expected. The result is also better with lower test stress, testing with lower stress also shows a improvement. There is no regression on other tests so far, and a performance gain is observed on file page heavy tasks. Signed-off-by: Kairui Song <kasong@tencent.com>
2024-04-01 23:43:38 +08:00
static bool inline lru_gen_test_recent(struct lruvec *lruvec, bool type,
unsigned long distance)
workingset: refactor LRU refault to expose refault recency check Patch series "cachestat: a new syscall for page cache state of files", v13. There is currently no good way to query the page cache statistics of large files and directory trees. There is mincore(), but it scales poorly: the kernel writes out a lot of bitmap data that userspace has to aggregate, when the user really does not care about per-page information in that case. The user also needs to mmap and unmap each file as it goes along, which can be quite slow as well. Some use cases where this information could come in handy: * Allowing database to decide whether to perform an index scan or direct table queries based on the in-memory cache state of the index. * Visibility into the writeback algorithm, for performance issues diagnostic. * Workload-aware writeback pacing: estimating IO fulfilled by page cache (and IO to be done) within a range of a file, allowing for more frequent syncing when and where there is IO capacity, and batching when there is not. * Computing memory usage of large files/directory trees, analogous to the du tool for disk usage. More information about these use cases could be found in this thread: https://lore.kernel.org/lkml/20230315170934.GA97793@cmpxchg.org/ This series of patches introduces a new system call, cachestat, that summarizes the page cache statistics (number of cached pages, dirty pages, pages marked for writeback, evicted pages etc.) of a file, in a specified range of bytes. It also include a selftest suite that tests some typical usage. Currently, the syscall is only wired in for x86 architecture. This interface is inspired by past discussion and concerns with fincore, which has a similar design (and as a result, issues) as mincore. Relevant links: https://lkml.indiana.edu/hypermail/linux/kernel/1302.1/04207.html https://lkml.indiana.edu/hypermail/linux/kernel/1302.1/04209.html I have also developed a small tool that computes the memory usage of files and directories, analogous to the du utility. User can choose between mincore or cachestat (with cachestat exporting more information than mincore). To compare the performance of these two options, I benchmarked the tool on the root directory of a Meta's server machine, each for five runs: Using cachestat real -- Median: 33.377s, Average: 33.475s, Standard Deviation: 0.3602 user -- Median: 4.08s, Average: 4.1078s, Standard Deviation: 0.0742 sys -- Median: 28.823s, Average: 28.8866s, Standard Deviation: 0.2689 Using mincore: real -- Median: 102.352s, Average: 102.3442s, Standard Deviation: 0.2059 user -- Median: 10.149s, Average: 10.1482s, Standard Deviation: 0.0162 sys -- Median: 91.186s, Average: 91.2084s, Standard Deviation: 0.2046 I also ran both syscalls on a 2TB sparse file: Using cachestat: real 0m0.009s user 0m0.000s sys 0m0.009s Using mincore: real 0m37.510s user 0m2.934s sys 0m34.558s Very large files like this are the pathological case for mincore. In fact, to compute the stats for a single 2TB file, mincore takes as long as cachestat takes to compute the stats for the entire tree! This could easily happen inadvertently when we run it on subdirectories. Mincore is clearly not suitable for a general-purpose command line tool. Regarding security concerns, cachestat() should not pose any additional issues. The caller already has read permission to the file itself (since they need an fd to that file to call cachestat). This means that the caller can access the underlying data in its entirety, which is a much greater source of information (and as a result, a much greater security risk) than the cache status itself. The latest API change (in v13 of the patch series) is suggested by Jens Axboe. It allows for 64-bit length argument, even on 32-bit architecture (which is previously not possible due to the limit on the number of syscall arguments). Furthermore, it eliminates the need for compatibility handling - every user can use the same ABI. This patch (of 4): In preparation for computing recently evicted pages in cachestat, refactor workingset_refault and lru_gen_refault to expose a helper function that would test if an evicted page is recently evicted. [penguin-kernel@I-love.SAKURA.ne.jp: add missing rcu_read_unlock() in lru_gen_refault()] Link: https://lkml.kernel.org/r/610781bc-cf11-fc89-a46f-87cb8235d439@I-love.SAKURA.ne.jp Link: https://lkml.kernel.org/r/20230503013608.2431726-1-nphamcs@gmail.com Link: https://lkml.kernel.org/r/20230503013608.2431726-2-nphamcs@gmail.com Signed-off-by: Nhat Pham <nphamcs@gmail.com> Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Cc: Brian Foster <bfoster@redhat.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Michael Kerrisk <mtk.manpages@gmail.com> Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-05-03 09:36:06 +08:00
{
workingset, lru_gen: apply refault-distance based protection Upstream: pending I noticed MGLRU not working very well on certain workflows, which is observed on some heavily stressed databases. That is when the file page workingset size exceeds total memory, and the access distance (the left-shift time of a page before it gets activated, considering LRU starts from right) of file pages also larger than total memory. All file pages are stuck on the oldest generation and getting read-in then evicted permutably. Despite anon pages being idle, they never get aged. PID controller didn't kickin until there are some minor access pattern changes. And file pages are not promoted or reused. Even though the memory can't cover the whole workingset, the refault-distance based re-activation can help hold part of the workingset in-memory to help reduce the IO workload significantly. So apply it for MGLRU as well. The updated refault-distance model fits well for MGLRU in most cases, if we just consider the last two generation as the inactive LRU and the first two generations as active LRU. Some adjustment is done to fit the logic better, also make the refault-distance contributed to page tiering and PID refault detection of MGLRU: - If a tier-0 page have a qualified refault-distance, just promote it to higher tier, send it to second oldest gen. - If a tier >= 1 page have a qualified refault-distance, mark it as active and send it to youngest gen. - Increase the reference of every page that have a qualified refault-distance and increase the PID countroled refault rate of the updated tier, in hope similar paged will be protected next time upon eviction. NOTE: This also changed the meaning of workingset_* fields in /proc/vmstat, workingset_activate_* now stands for the pages reactivated or promoted by refault distance checking, workingset_restore_* now stands for all pages promoted by any reason. Following benchmark showed 5x improvement. To simulate the optimized workflow, I setup a 3-replicated mongodb cluster, each in a different cgroup, using 5 gb of wiretiger cache and 10g of oplog, on a 32G VM with no limit set. The benchmark is done using https://github.com/apavlo/py-tpcc.git, modified to run STOCK_LEVEL query only, for simulating slow query and get a stable result. Test is done on an EPYC 7K62 with 32G RAM with SATA SSD: - Before (with ZRAM enabled, the result won't change whether any kind of swap is on or not): $ tpcc.py --config=mongodb.config mongodb --duration=900 --warehouses=500 --clients=30 ================================================================== Execution Results after 919 seconds ------------------------------------------------------------------ Executed Time (µs) Rate STOCK_LEVEL 577 27584645283.7 0.02 txn/s ------------------------------------------------------------------ TOTAL 577 27584645283.7 0.02 txn/s $ cat /proc/vmstat | grep workingset workingset_nodes 47860 workingset_refault_anon 0 workingset_refault_file 23498953 workingset_activate_anon 0 workingset_activate_file 23487840 workingset_restore_anon 0 workingset_restore_file 18553646 workingset_nodereclaim 768 $ free -m total used free shared buff/cache available Mem: 31849 6829 790 23 24229 24542 Swap: 31848 0 31848 - Patched: (with ZRAM enabled): $ tpcc.py --config=mongodb.config mongodb --duration=900 --warehouses=500 --clients=30 ================================================================== Execution Results after 905 seconds ------------------------------------------------------------------ Executed Time (µs) Rate STOCK_LEVEL 2542 27121571486.2 0.09 txn/s ------------------------------------------------------------------ TOTAL 2542 27121571486.2 0.09 txn/s $ cat /proc/vmstat | grep working workingset_nodes 70358 workingset_refault_anon 16853 workingset_refault_file 22693601 workingset_activate_anon 10099 workingset_activate_file 8565519 workingset_restore_anon 10127 workingset_restore_file 8566053 workingset_nodereclaim 9801 $ free -m total used free shared buff/cache available Mem: 31849 7093 283 4 24472 24289 Swap: 31848 1652 30196 The performance is 5x times better than before, and the idle anon pages now can get swapped out as expected. The result is also better with lower test stress, testing with lower stress also shows a improvement. There is no regression on other tests so far, and a performance gain is observed on file page heavy tasks. Signed-off-by: Kairui Song <kasong@tencent.com>
2024-04-01 23:43:38 +08:00
int hist;
unsigned long evicted = 0;
struct lru_gen_folio *lrugen;
lrugen = &lruvec->lrugen;
hist = lru_hist_of_min_seq(lruvec, type);
for (int tier = 0; tier < MAX_NR_TIERS; tier++)
evicted += atomic_long_read(&lrugen->evicted[hist][type][tier]);
workingset: refactor LRU refault to expose refault recency check Patch series "cachestat: a new syscall for page cache state of files", v13. There is currently no good way to query the page cache statistics of large files and directory trees. There is mincore(), but it scales poorly: the kernel writes out a lot of bitmap data that userspace has to aggregate, when the user really does not care about per-page information in that case. The user also needs to mmap and unmap each file as it goes along, which can be quite slow as well. Some use cases where this information could come in handy: * Allowing database to decide whether to perform an index scan or direct table queries based on the in-memory cache state of the index. * Visibility into the writeback algorithm, for performance issues diagnostic. * Workload-aware writeback pacing: estimating IO fulfilled by page cache (and IO to be done) within a range of a file, allowing for more frequent syncing when and where there is IO capacity, and batching when there is not. * Computing memory usage of large files/directory trees, analogous to the du tool for disk usage. More information about these use cases could be found in this thread: https://lore.kernel.org/lkml/20230315170934.GA97793@cmpxchg.org/ This series of patches introduces a new system call, cachestat, that summarizes the page cache statistics (number of cached pages, dirty pages, pages marked for writeback, evicted pages etc.) of a file, in a specified range of bytes. It also include a selftest suite that tests some typical usage. Currently, the syscall is only wired in for x86 architecture. This interface is inspired by past discussion and concerns with fincore, which has a similar design (and as a result, issues) as mincore. Relevant links: https://lkml.indiana.edu/hypermail/linux/kernel/1302.1/04207.html https://lkml.indiana.edu/hypermail/linux/kernel/1302.1/04209.html I have also developed a small tool that computes the memory usage of files and directories, analogous to the du utility. User can choose between mincore or cachestat (with cachestat exporting more information than mincore). To compare the performance of these two options, I benchmarked the tool on the root directory of a Meta's server machine, each for five runs: Using cachestat real -- Median: 33.377s, Average: 33.475s, Standard Deviation: 0.3602 user -- Median: 4.08s, Average: 4.1078s, Standard Deviation: 0.0742 sys -- Median: 28.823s, Average: 28.8866s, Standard Deviation: 0.2689 Using mincore: real -- Median: 102.352s, Average: 102.3442s, Standard Deviation: 0.2059 user -- Median: 10.149s, Average: 10.1482s, Standard Deviation: 0.0162 sys -- Median: 91.186s, Average: 91.2084s, Standard Deviation: 0.2046 I also ran both syscalls on a 2TB sparse file: Using cachestat: real 0m0.009s user 0m0.000s sys 0m0.009s Using mincore: real 0m37.510s user 0m2.934s sys 0m34.558s Very large files like this are the pathological case for mincore. In fact, to compute the stats for a single 2TB file, mincore takes as long as cachestat takes to compute the stats for the entire tree! This could easily happen inadvertently when we run it on subdirectories. Mincore is clearly not suitable for a general-purpose command line tool. Regarding security concerns, cachestat() should not pose any additional issues. The caller already has read permission to the file itself (since they need an fd to that file to call cachestat). This means that the caller can access the underlying data in its entirety, which is a much greater source of information (and as a result, a much greater security risk) than the cache status itself. The latest API change (in v13 of the patch series) is suggested by Jens Axboe. It allows for 64-bit length argument, even on 32-bit architecture (which is previously not possible due to the limit on the number of syscall arguments). Furthermore, it eliminates the need for compatibility handling - every user can use the same ABI. This patch (of 4): In preparation for computing recently evicted pages in cachestat, refactor workingset_refault and lru_gen_refault to expose a helper function that would test if an evicted page is recently evicted. [penguin-kernel@I-love.SAKURA.ne.jp: add missing rcu_read_unlock() in lru_gen_refault()] Link: https://lkml.kernel.org/r/610781bc-cf11-fc89-a46f-87cb8235d439@I-love.SAKURA.ne.jp Link: https://lkml.kernel.org/r/20230503013608.2431726-1-nphamcs@gmail.com Link: https://lkml.kernel.org/r/20230503013608.2431726-2-nphamcs@gmail.com Signed-off-by: Nhat Pham <nphamcs@gmail.com> Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Cc: Brian Foster <bfoster@redhat.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Michael Kerrisk <mtk.manpages@gmail.com> Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-05-03 09:36:06 +08:00
workingset, lru_gen: apply refault-distance based protection Upstream: pending I noticed MGLRU not working very well on certain workflows, which is observed on some heavily stressed databases. That is when the file page workingset size exceeds total memory, and the access distance (the left-shift time of a page before it gets activated, considering LRU starts from right) of file pages also larger than total memory. All file pages are stuck on the oldest generation and getting read-in then evicted permutably. Despite anon pages being idle, they never get aged. PID controller didn't kickin until there are some minor access pattern changes. And file pages are not promoted or reused. Even though the memory can't cover the whole workingset, the refault-distance based re-activation can help hold part of the workingset in-memory to help reduce the IO workload significantly. So apply it for MGLRU as well. The updated refault-distance model fits well for MGLRU in most cases, if we just consider the last two generation as the inactive LRU and the first two generations as active LRU. Some adjustment is done to fit the logic better, also make the refault-distance contributed to page tiering and PID refault detection of MGLRU: - If a tier-0 page have a qualified refault-distance, just promote it to higher tier, send it to second oldest gen. - If a tier >= 1 page have a qualified refault-distance, mark it as active and send it to youngest gen. - Increase the reference of every page that have a qualified refault-distance and increase the PID countroled refault rate of the updated tier, in hope similar paged will be protected next time upon eviction. NOTE: This also changed the meaning of workingset_* fields in /proc/vmstat, workingset_activate_* now stands for the pages reactivated or promoted by refault distance checking, workingset_restore_* now stands for all pages promoted by any reason. Following benchmark showed 5x improvement. To simulate the optimized workflow, I setup a 3-replicated mongodb cluster, each in a different cgroup, using 5 gb of wiretiger cache and 10g of oplog, on a 32G VM with no limit set. The benchmark is done using https://github.com/apavlo/py-tpcc.git, modified to run STOCK_LEVEL query only, for simulating slow query and get a stable result. Test is done on an EPYC 7K62 with 32G RAM with SATA SSD: - Before (with ZRAM enabled, the result won't change whether any kind of swap is on or not): $ tpcc.py --config=mongodb.config mongodb --duration=900 --warehouses=500 --clients=30 ================================================================== Execution Results after 919 seconds ------------------------------------------------------------------ Executed Time (µs) Rate STOCK_LEVEL 577 27584645283.7 0.02 txn/s ------------------------------------------------------------------ TOTAL 577 27584645283.7 0.02 txn/s $ cat /proc/vmstat | grep workingset workingset_nodes 47860 workingset_refault_anon 0 workingset_refault_file 23498953 workingset_activate_anon 0 workingset_activate_file 23487840 workingset_restore_anon 0 workingset_restore_file 18553646 workingset_nodereclaim 768 $ free -m total used free shared buff/cache available Mem: 31849 6829 790 23 24229 24542 Swap: 31848 0 31848 - Patched: (with ZRAM enabled): $ tpcc.py --config=mongodb.config mongodb --duration=900 --warehouses=500 --clients=30 ================================================================== Execution Results after 905 seconds ------------------------------------------------------------------ Executed Time (µs) Rate STOCK_LEVEL 2542 27121571486.2 0.09 txn/s ------------------------------------------------------------------ TOTAL 2542 27121571486.2 0.09 txn/s $ cat /proc/vmstat | grep working workingset_nodes 70358 workingset_refault_anon 16853 workingset_refault_file 22693601 workingset_activate_anon 10099 workingset_activate_file 8565519 workingset_restore_anon 10127 workingset_restore_file 8566053 workingset_nodereclaim 9801 $ free -m total used free shared buff/cache available Mem: 31849 7093 283 4 24472 24289 Swap: 31848 1652 30196 The performance is 5x times better than before, and the idle anon pages now can get swapped out as expected. The result is also better with lower test stress, testing with lower stress also shows a improvement. There is no regression on other tests so far, and a performance gain is observed on file page heavy tasks. Signed-off-by: Kairui Song <kasong@tencent.com>
2024-04-01 23:43:38 +08:00
return distance <= evicted;
}
enum lru_gen_refault_distance {
DISTANCE_SHORT,
DISTANCE_MID,
DISTANCE_LONG,
DISTANCE_NONE,
};
static inline int lru_gen_test_refault(struct lruvec *lruvec, bool file,
unsigned long distance, bool can_swap)
{
unsigned long total;
total = lruvec_page_state(lruvec, NR_ACTIVE_FILE) +
lruvec_page_state(lruvec, NR_INACTIVE_FILE);
if (can_swap)
total += lruvec_page_state(lruvec, NR_ACTIVE_ANON) +
lruvec_page_state(lruvec, NR_INACTIVE_ANON);
/* Imagine having an extra gen outside of available memory */
if (distance <= total / MAX_NR_GENS)
return DISTANCE_SHORT;
if (distance <= total / MIN_NR_GENS)
return DISTANCE_MID;
if (distance <= total)
return DISTANCE_LONG;
return DISTANCE_NONE;
workingset: refactor LRU refault to expose refault recency check Patch series "cachestat: a new syscall for page cache state of files", v13. There is currently no good way to query the page cache statistics of large files and directory trees. There is mincore(), but it scales poorly: the kernel writes out a lot of bitmap data that userspace has to aggregate, when the user really does not care about per-page information in that case. The user also needs to mmap and unmap each file as it goes along, which can be quite slow as well. Some use cases where this information could come in handy: * Allowing database to decide whether to perform an index scan or direct table queries based on the in-memory cache state of the index. * Visibility into the writeback algorithm, for performance issues diagnostic. * Workload-aware writeback pacing: estimating IO fulfilled by page cache (and IO to be done) within a range of a file, allowing for more frequent syncing when and where there is IO capacity, and batching when there is not. * Computing memory usage of large files/directory trees, analogous to the du tool for disk usage. More information about these use cases could be found in this thread: https://lore.kernel.org/lkml/20230315170934.GA97793@cmpxchg.org/ This series of patches introduces a new system call, cachestat, that summarizes the page cache statistics (number of cached pages, dirty pages, pages marked for writeback, evicted pages etc.) of a file, in a specified range of bytes. It also include a selftest suite that tests some typical usage. Currently, the syscall is only wired in for x86 architecture. This interface is inspired by past discussion and concerns with fincore, which has a similar design (and as a result, issues) as mincore. Relevant links: https://lkml.indiana.edu/hypermail/linux/kernel/1302.1/04207.html https://lkml.indiana.edu/hypermail/linux/kernel/1302.1/04209.html I have also developed a small tool that computes the memory usage of files and directories, analogous to the du utility. User can choose between mincore or cachestat (with cachestat exporting more information than mincore). To compare the performance of these two options, I benchmarked the tool on the root directory of a Meta's server machine, each for five runs: Using cachestat real -- Median: 33.377s, Average: 33.475s, Standard Deviation: 0.3602 user -- Median: 4.08s, Average: 4.1078s, Standard Deviation: 0.0742 sys -- Median: 28.823s, Average: 28.8866s, Standard Deviation: 0.2689 Using mincore: real -- Median: 102.352s, Average: 102.3442s, Standard Deviation: 0.2059 user -- Median: 10.149s, Average: 10.1482s, Standard Deviation: 0.0162 sys -- Median: 91.186s, Average: 91.2084s, Standard Deviation: 0.2046 I also ran both syscalls on a 2TB sparse file: Using cachestat: real 0m0.009s user 0m0.000s sys 0m0.009s Using mincore: real 0m37.510s user 0m2.934s sys 0m34.558s Very large files like this are the pathological case for mincore. In fact, to compute the stats for a single 2TB file, mincore takes as long as cachestat takes to compute the stats for the entire tree! This could easily happen inadvertently when we run it on subdirectories. Mincore is clearly not suitable for a general-purpose command line tool. Regarding security concerns, cachestat() should not pose any additional issues. The caller already has read permission to the file itself (since they need an fd to that file to call cachestat). This means that the caller can access the underlying data in its entirety, which is a much greater source of information (and as a result, a much greater security risk) than the cache status itself. The latest API change (in v13 of the patch series) is suggested by Jens Axboe. It allows for 64-bit length argument, even on 32-bit architecture (which is previously not possible due to the limit on the number of syscall arguments). Furthermore, it eliminates the need for compatibility handling - every user can use the same ABI. This patch (of 4): In preparation for computing recently evicted pages in cachestat, refactor workingset_refault and lru_gen_refault to expose a helper function that would test if an evicted page is recently evicted. [penguin-kernel@I-love.SAKURA.ne.jp: add missing rcu_read_unlock() in lru_gen_refault()] Link: https://lkml.kernel.org/r/610781bc-cf11-fc89-a46f-87cb8235d439@I-love.SAKURA.ne.jp Link: https://lkml.kernel.org/r/20230503013608.2431726-1-nphamcs@gmail.com Link: https://lkml.kernel.org/r/20230503013608.2431726-2-nphamcs@gmail.com Signed-off-by: Nhat Pham <nphamcs@gmail.com> Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Cc: Brian Foster <bfoster@redhat.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Michael Kerrisk <mtk.manpages@gmail.com> Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-05-03 09:36:06 +08:00
}
mm: multi-gen LRU: minimal implementation To avoid confusion, the terms "promotion" and "demotion" will be applied to the multi-gen LRU, as a new convention; the terms "activation" and "deactivation" will be applied to the active/inactive LRU, as usual. The aging produces young generations. Given an lruvec, it increments max_seq when max_seq-min_seq+1 approaches MIN_NR_GENS. The aging promotes hot pages to the youngest generation when it finds them accessed through page tables; the demotion of cold pages happens consequently when it increments max_seq. Promotion in the aging path does not involve any LRU list operations, only the updates of the gen counter and lrugen->nr_pages[]; demotion, unless as the result of the increment of max_seq, requires LRU list operations, e.g., lru_deactivate_fn(). The aging has the complexity O(nr_hot_pages), since it is only interested in hot pages. The eviction consumes old generations. Given an lruvec, it increments min_seq when lrugen->lists[] indexed by min_seq%MAX_NR_GENS becomes empty. A feedback loop modeled after the PID controller monitors refaults over anon and file types and decides which type to evict when both types are available from the same generation. The protection of pages accessed multiple times through file descriptors takes place in the eviction path. Each generation is divided into multiple tiers. A page accessed N times through file descriptors is in tier order_base_2(N). Tiers do not have dedicated lrugen->lists[], only bits in folio->flags. The aforementioned feedback loop also monitors refaults over all tiers and decides when to protect pages in which tiers (N>1), using the first tier (N=0,1) as a baseline. The first tier contains single-use unmapped clean pages, which are most likely the best choices. In contrast to promotion in the aging path, the protection of a page in the eviction path is achieved by moving this page to the next generation, i.e., min_seq+1, if the feedback loop decides so. This approach has the following advantages: 1. It removes the cost of activation in the buffered access path by inferring whether pages accessed multiple times through file descriptors are statistically hot and thus worth protecting in the eviction path. 2. It takes pages accessed through page tables into account and avoids overprotecting pages accessed multiple times through file descriptors. (Pages accessed through page tables are in the first tier, since N=0.) 3. More tiers provide better protection for pages accessed more than twice through file descriptors, when under heavy buffered I/O workloads. Server benchmark results: Single workload: fio (buffered I/O): +[30, 32]% IOPS BW 5.19-rc1: 2673k 10.2GiB/s patch1-6: 3491k 13.3GiB/s Single workload: memcached (anon): -[4, 6]% Ops/sec KB/sec 5.19-rc1: 1161501.04 45177.25 patch1-6: 1106168.46 43025.04 Configurations: CPU: two Xeon 6154 Mem: total 256G Node 1 was only used as a ram disk to reduce the variance in the results. patch drivers/block/brd.c <<EOF 99,100c99,100 < gfp_flags = GFP_NOIO | __GFP_ZERO | __GFP_HIGHMEM; < page = alloc_page(gfp_flags); --- > gfp_flags = GFP_NOIO | __GFP_ZERO | __GFP_HIGHMEM | __GFP_THISNODE; > page = alloc_pages_node(1, gfp_flags, 0); EOF cat >>/etc/systemd/system.conf <<EOF CPUAffinity=numa NUMAPolicy=bind NUMAMask=0 EOF cat >>/etc/memcached.conf <<EOF -m 184320 -s /var/run/memcached/memcached.sock -a 0766 -t 36 -B binary EOF cat fio.sh modprobe brd rd_nr=1 rd_size=113246208 swapoff -a mkfs.ext4 /dev/ram0 mount -t ext4 /dev/ram0 /mnt mkdir /sys/fs/cgroup/user.slice/test echo 38654705664 >/sys/fs/cgroup/user.slice/test/memory.max echo $$ >/sys/fs/cgroup/user.slice/test/cgroup.procs fio -name=mglru --numjobs=72 --directory=/mnt --size=1408m \ --buffered=1 --ioengine=io_uring --iodepth=128 \ --iodepth_batch_submit=32 --iodepth_batch_complete=32 \ --rw=randread --random_distribution=random --norandommap \ --time_based --ramp_time=10m --runtime=5m --group_reporting cat memcached.sh modprobe brd rd_nr=1 rd_size=113246208 swapoff -a mkswap /dev/ram0 swapon /dev/ram0 memtier_benchmark -S /var/run/memcached/memcached.sock \ -P memcache_binary -n allkeys --key-minimum=1 \ --key-maximum=65000000 --key-pattern=P:P -c 1 -t 36 \ --ratio 1:0 --pipeline 8 -d 2000 memtier_benchmark -S /var/run/memcached/memcached.sock \ -P memcache_binary -n allkeys --key-minimum=1 \ --key-maximum=65000000 --key-pattern=R:R -c 1 -t 36 \ --ratio 0:1 --pipeline 8 --randomize --distinct-client-seed Client benchmark results: kswapd profiles: 5.19-rc1 40.33% page_vma_mapped_walk (overhead) 21.80% lzo1x_1_do_compress (real work) 7.53% do_raw_spin_lock 3.95% _raw_spin_unlock_irq 2.52% vma_interval_tree_iter_next 2.37% folio_referenced_one 2.28% vma_interval_tree_subtree_search 1.97% anon_vma_interval_tree_iter_first 1.60% ptep_clear_flush 1.06% __zram_bvec_write patch1-6 39.03% lzo1x_1_do_compress (real work) 18.47% page_vma_mapped_walk (overhead) 6.74% _raw_spin_unlock_irq 3.97% do_raw_spin_lock 2.49% ptep_clear_flush 2.48% anon_vma_interval_tree_iter_first 1.92% folio_referenced_one 1.88% __zram_bvec_write 1.48% memmove 1.31% vma_interval_tree_iter_next Configurations: CPU: single Snapdragon 7c Mem: total 4G ChromeOS MemoryPressure [1] [1] https://chromium.googlesource.com/chromiumos/platform/tast-tests/ Link: https://lkml.kernel.org/r/20220918080010.2920238-7-yuzhao@google.com Signed-off-by: Yu Zhao <yuzhao@google.com> Acked-by: Brian Geffon <bgeffon@google.com> Acked-by: Jan Alexander Steffens (heftig) <heftig@archlinux.org> Acked-by: Oleksandr Natalenko <oleksandr@natalenko.name> Acked-by: Steven Barrett <steven@liquorix.net> Acked-by: Suleiman Souhlal <suleiman@google.com> Tested-by: Daniel Byrne <djbyrne@mtu.edu> Tested-by: Donald Carr <d@chaos-reins.com> Tested-by: Holger Hoffstätte <holger@applied-asynchrony.com> Tested-by: Konstantin Kharlamov <Hi-Angel@yandex.ru> Tested-by: Shuang Zhai <szhai2@cs.rochester.edu> Tested-by: Sofia Trinh <sofia.trinh@edi.works> Tested-by: Vaibhav Jain <vaibhav@linux.ibm.com> Cc: Andi Kleen <ak@linux.intel.com> Cc: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com> Cc: Barry Song <baohua@kernel.org> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: Hillf Danton <hdanton@sina.com> Cc: Jens Axboe <axboe@kernel.dk> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Matthew Wilcox <willy@infradead.org> Cc: Mel Gorman <mgorman@suse.de> Cc: Miaohe Lin <linmiaohe@huawei.com> Cc: Michael Larabel <Michael@MichaelLarabel.com> Cc: Michal Hocko <mhocko@kernel.org> Cc: Mike Rapoport <rppt@kernel.org> Cc: Mike Rapoport <rppt@linux.ibm.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Qi Zheng <zhengqi.arch@bytedance.com> Cc: Tejun Heo <tj@kernel.org> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Will Deacon <will@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-09-18 16:00:03 +08:00
static void lru_gen_refault(struct folio *folio, void *shadow)
{
int memcgid;
bool recent;
mm: multi-gen LRU: minimal implementation To avoid confusion, the terms "promotion" and "demotion" will be applied to the multi-gen LRU, as a new convention; the terms "activation" and "deactivation" will be applied to the active/inactive LRU, as usual. The aging produces young generations. Given an lruvec, it increments max_seq when max_seq-min_seq+1 approaches MIN_NR_GENS. The aging promotes hot pages to the youngest generation when it finds them accessed through page tables; the demotion of cold pages happens consequently when it increments max_seq. Promotion in the aging path does not involve any LRU list operations, only the updates of the gen counter and lrugen->nr_pages[]; demotion, unless as the result of the increment of max_seq, requires LRU list operations, e.g., lru_deactivate_fn(). The aging has the complexity O(nr_hot_pages), since it is only interested in hot pages. The eviction consumes old generations. Given an lruvec, it increments min_seq when lrugen->lists[] indexed by min_seq%MAX_NR_GENS becomes empty. A feedback loop modeled after the PID controller monitors refaults over anon and file types and decides which type to evict when both types are available from the same generation. The protection of pages accessed multiple times through file descriptors takes place in the eviction path. Each generation is divided into multiple tiers. A page accessed N times through file descriptors is in tier order_base_2(N). Tiers do not have dedicated lrugen->lists[], only bits in folio->flags. The aforementioned feedback loop also monitors refaults over all tiers and decides when to protect pages in which tiers (N>1), using the first tier (N=0,1) as a baseline. The first tier contains single-use unmapped clean pages, which are most likely the best choices. In contrast to promotion in the aging path, the protection of a page in the eviction path is achieved by moving this page to the next generation, i.e., min_seq+1, if the feedback loop decides so. This approach has the following advantages: 1. It removes the cost of activation in the buffered access path by inferring whether pages accessed multiple times through file descriptors are statistically hot and thus worth protecting in the eviction path. 2. It takes pages accessed through page tables into account and avoids overprotecting pages accessed multiple times through file descriptors. (Pages accessed through page tables are in the first tier, since N=0.) 3. More tiers provide better protection for pages accessed more than twice through file descriptors, when under heavy buffered I/O workloads. Server benchmark results: Single workload: fio (buffered I/O): +[30, 32]% IOPS BW 5.19-rc1: 2673k 10.2GiB/s patch1-6: 3491k 13.3GiB/s Single workload: memcached (anon): -[4, 6]% Ops/sec KB/sec 5.19-rc1: 1161501.04 45177.25 patch1-6: 1106168.46 43025.04 Configurations: CPU: two Xeon 6154 Mem: total 256G Node 1 was only used as a ram disk to reduce the variance in the results. patch drivers/block/brd.c <<EOF 99,100c99,100 < gfp_flags = GFP_NOIO | __GFP_ZERO | __GFP_HIGHMEM; < page = alloc_page(gfp_flags); --- > gfp_flags = GFP_NOIO | __GFP_ZERO | __GFP_HIGHMEM | __GFP_THISNODE; > page = alloc_pages_node(1, gfp_flags, 0); EOF cat >>/etc/systemd/system.conf <<EOF CPUAffinity=numa NUMAPolicy=bind NUMAMask=0 EOF cat >>/etc/memcached.conf <<EOF -m 184320 -s /var/run/memcached/memcached.sock -a 0766 -t 36 -B binary EOF cat fio.sh modprobe brd rd_nr=1 rd_size=113246208 swapoff -a mkfs.ext4 /dev/ram0 mount -t ext4 /dev/ram0 /mnt mkdir /sys/fs/cgroup/user.slice/test echo 38654705664 >/sys/fs/cgroup/user.slice/test/memory.max echo $$ >/sys/fs/cgroup/user.slice/test/cgroup.procs fio -name=mglru --numjobs=72 --directory=/mnt --size=1408m \ --buffered=1 --ioengine=io_uring --iodepth=128 \ --iodepth_batch_submit=32 --iodepth_batch_complete=32 \ --rw=randread --random_distribution=random --norandommap \ --time_based --ramp_time=10m --runtime=5m --group_reporting cat memcached.sh modprobe brd rd_nr=1 rd_size=113246208 swapoff -a mkswap /dev/ram0 swapon /dev/ram0 memtier_benchmark -S /var/run/memcached/memcached.sock \ -P memcache_binary -n allkeys --key-minimum=1 \ --key-maximum=65000000 --key-pattern=P:P -c 1 -t 36 \ --ratio 1:0 --pipeline 8 -d 2000 memtier_benchmark -S /var/run/memcached/memcached.sock \ -P memcache_binary -n allkeys --key-minimum=1 \ --key-maximum=65000000 --key-pattern=R:R -c 1 -t 36 \ --ratio 0:1 --pipeline 8 --randomize --distinct-client-seed Client benchmark results: kswapd profiles: 5.19-rc1 40.33% page_vma_mapped_walk (overhead) 21.80% lzo1x_1_do_compress (real work) 7.53% do_raw_spin_lock 3.95% _raw_spin_unlock_irq 2.52% vma_interval_tree_iter_next 2.37% folio_referenced_one 2.28% vma_interval_tree_subtree_search 1.97% anon_vma_interval_tree_iter_first 1.60% ptep_clear_flush 1.06% __zram_bvec_write patch1-6 39.03% lzo1x_1_do_compress (real work) 18.47% page_vma_mapped_walk (overhead) 6.74% _raw_spin_unlock_irq 3.97% do_raw_spin_lock 2.49% ptep_clear_flush 2.48% anon_vma_interval_tree_iter_first 1.92% folio_referenced_one 1.88% __zram_bvec_write 1.48% memmove 1.31% vma_interval_tree_iter_next Configurations: CPU: single Snapdragon 7c Mem: total 4G ChromeOS MemoryPressure [1] [1] https://chromium.googlesource.com/chromiumos/platform/tast-tests/ Link: https://lkml.kernel.org/r/20220918080010.2920238-7-yuzhao@google.com Signed-off-by: Yu Zhao <yuzhao@google.com> Acked-by: Brian Geffon <bgeffon@google.com> Acked-by: Jan Alexander Steffens (heftig) <heftig@archlinux.org> Acked-by: Oleksandr Natalenko <oleksandr@natalenko.name> Acked-by: Steven Barrett <steven@liquorix.net> Acked-by: Suleiman Souhlal <suleiman@google.com> Tested-by: Daniel Byrne <djbyrne@mtu.edu> Tested-by: Donald Carr <d@chaos-reins.com> Tested-by: Holger Hoffstätte <holger@applied-asynchrony.com> Tested-by: Konstantin Kharlamov <Hi-Angel@yandex.ru> Tested-by: Shuang Zhai <szhai2@cs.rochester.edu> Tested-by: Sofia Trinh <sofia.trinh@edi.works> Tested-by: Vaibhav Jain <vaibhav@linux.ibm.com> Cc: Andi Kleen <ak@linux.intel.com> Cc: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com> Cc: Barry Song <baohua@kernel.org> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: Hillf Danton <hdanton@sina.com> Cc: Jens Axboe <axboe@kernel.dk> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Matthew Wilcox <willy@infradead.org> Cc: Mel Gorman <mgorman@suse.de> Cc: Miaohe Lin <linmiaohe@huawei.com> Cc: Michael Larabel <Michael@MichaelLarabel.com> Cc: Michal Hocko <mhocko@kernel.org> Cc: Mike Rapoport <rppt@kernel.org> Cc: Mike Rapoport <rppt@linux.ibm.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Qi Zheng <zhengqi.arch@bytedance.com> Cc: Tejun Heo <tj@kernel.org> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Will Deacon <will@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-09-18 16:00:03 +08:00
bool workingset;
unsigned long token;
workingset, lru_gen: apply refault-distance based protection Upstream: pending I noticed MGLRU not working very well on certain workflows, which is observed on some heavily stressed databases. That is when the file page workingset size exceeds total memory, and the access distance (the left-shift time of a page before it gets activated, considering LRU starts from right) of file pages also larger than total memory. All file pages are stuck on the oldest generation and getting read-in then evicted permutably. Despite anon pages being idle, they never get aged. PID controller didn't kickin until there are some minor access pattern changes. And file pages are not promoted or reused. Even though the memory can't cover the whole workingset, the refault-distance based re-activation can help hold part of the workingset in-memory to help reduce the IO workload significantly. So apply it for MGLRU as well. The updated refault-distance model fits well for MGLRU in most cases, if we just consider the last two generation as the inactive LRU and the first two generations as active LRU. Some adjustment is done to fit the logic better, also make the refault-distance contributed to page tiering and PID refault detection of MGLRU: - If a tier-0 page have a qualified refault-distance, just promote it to higher tier, send it to second oldest gen. - If a tier >= 1 page have a qualified refault-distance, mark it as active and send it to youngest gen. - Increase the reference of every page that have a qualified refault-distance and increase the PID countroled refault rate of the updated tier, in hope similar paged will be protected next time upon eviction. NOTE: This also changed the meaning of workingset_* fields in /proc/vmstat, workingset_activate_* now stands for the pages reactivated or promoted by refault distance checking, workingset_restore_* now stands for all pages promoted by any reason. Following benchmark showed 5x improvement. To simulate the optimized workflow, I setup a 3-replicated mongodb cluster, each in a different cgroup, using 5 gb of wiretiger cache and 10g of oplog, on a 32G VM with no limit set. The benchmark is done using https://github.com/apavlo/py-tpcc.git, modified to run STOCK_LEVEL query only, for simulating slow query and get a stable result. Test is done on an EPYC 7K62 with 32G RAM with SATA SSD: - Before (with ZRAM enabled, the result won't change whether any kind of swap is on or not): $ tpcc.py --config=mongodb.config mongodb --duration=900 --warehouses=500 --clients=30 ================================================================== Execution Results after 919 seconds ------------------------------------------------------------------ Executed Time (µs) Rate STOCK_LEVEL 577 27584645283.7 0.02 txn/s ------------------------------------------------------------------ TOTAL 577 27584645283.7 0.02 txn/s $ cat /proc/vmstat | grep workingset workingset_nodes 47860 workingset_refault_anon 0 workingset_refault_file 23498953 workingset_activate_anon 0 workingset_activate_file 23487840 workingset_restore_anon 0 workingset_restore_file 18553646 workingset_nodereclaim 768 $ free -m total used free shared buff/cache available Mem: 31849 6829 790 23 24229 24542 Swap: 31848 0 31848 - Patched: (with ZRAM enabled): $ tpcc.py --config=mongodb.config mongodb --duration=900 --warehouses=500 --clients=30 ================================================================== Execution Results after 905 seconds ------------------------------------------------------------------ Executed Time (µs) Rate STOCK_LEVEL 2542 27121571486.2 0.09 txn/s ------------------------------------------------------------------ TOTAL 2542 27121571486.2 0.09 txn/s $ cat /proc/vmstat | grep working workingset_nodes 70358 workingset_refault_anon 16853 workingset_refault_file 22693601 workingset_activate_anon 10099 workingset_activate_file 8565519 workingset_restore_anon 10127 workingset_restore_file 8566053 workingset_nodereclaim 9801 $ free -m total used free shared buff/cache available Mem: 31849 7093 283 4 24472 24289 Swap: 31848 1652 30196 The performance is 5x times better than before, and the idle anon pages now can get swapped out as expected. The result is also better with lower test stress, testing with lower stress also shows a improvement. There is no regression on other tests so far, and a performance gain is observed on file page heavy tasks. Signed-off-by: Kairui Song <kasong@tencent.com>
2024-04-01 23:43:38 +08:00
int hist, tier, refs;
mm: multi-gen LRU: minimal implementation To avoid confusion, the terms "promotion" and "demotion" will be applied to the multi-gen LRU, as a new convention; the terms "activation" and "deactivation" will be applied to the active/inactive LRU, as usual. The aging produces young generations. Given an lruvec, it increments max_seq when max_seq-min_seq+1 approaches MIN_NR_GENS. The aging promotes hot pages to the youngest generation when it finds them accessed through page tables; the demotion of cold pages happens consequently when it increments max_seq. Promotion in the aging path does not involve any LRU list operations, only the updates of the gen counter and lrugen->nr_pages[]; demotion, unless as the result of the increment of max_seq, requires LRU list operations, e.g., lru_deactivate_fn(). The aging has the complexity O(nr_hot_pages), since it is only interested in hot pages. The eviction consumes old generations. Given an lruvec, it increments min_seq when lrugen->lists[] indexed by min_seq%MAX_NR_GENS becomes empty. A feedback loop modeled after the PID controller monitors refaults over anon and file types and decides which type to evict when both types are available from the same generation. The protection of pages accessed multiple times through file descriptors takes place in the eviction path. Each generation is divided into multiple tiers. A page accessed N times through file descriptors is in tier order_base_2(N). Tiers do not have dedicated lrugen->lists[], only bits in folio->flags. The aforementioned feedback loop also monitors refaults over all tiers and decides when to protect pages in which tiers (N>1), using the first tier (N=0,1) as a baseline. The first tier contains single-use unmapped clean pages, which are most likely the best choices. In contrast to promotion in the aging path, the protection of a page in the eviction path is achieved by moving this page to the next generation, i.e., min_seq+1, if the feedback loop decides so. This approach has the following advantages: 1. It removes the cost of activation in the buffered access path by inferring whether pages accessed multiple times through file descriptors are statistically hot and thus worth protecting in the eviction path. 2. It takes pages accessed through page tables into account and avoids overprotecting pages accessed multiple times through file descriptors. (Pages accessed through page tables are in the first tier, since N=0.) 3. More tiers provide better protection for pages accessed more than twice through file descriptors, when under heavy buffered I/O workloads. Server benchmark results: Single workload: fio (buffered I/O): +[30, 32]% IOPS BW 5.19-rc1: 2673k 10.2GiB/s patch1-6: 3491k 13.3GiB/s Single workload: memcached (anon): -[4, 6]% Ops/sec KB/sec 5.19-rc1: 1161501.04 45177.25 patch1-6: 1106168.46 43025.04 Configurations: CPU: two Xeon 6154 Mem: total 256G Node 1 was only used as a ram disk to reduce the variance in the results. patch drivers/block/brd.c <<EOF 99,100c99,100 < gfp_flags = GFP_NOIO | __GFP_ZERO | __GFP_HIGHMEM; < page = alloc_page(gfp_flags); --- > gfp_flags = GFP_NOIO | __GFP_ZERO | __GFP_HIGHMEM | __GFP_THISNODE; > page = alloc_pages_node(1, gfp_flags, 0); EOF cat >>/etc/systemd/system.conf <<EOF CPUAffinity=numa NUMAPolicy=bind NUMAMask=0 EOF cat >>/etc/memcached.conf <<EOF -m 184320 -s /var/run/memcached/memcached.sock -a 0766 -t 36 -B binary EOF cat fio.sh modprobe brd rd_nr=1 rd_size=113246208 swapoff -a mkfs.ext4 /dev/ram0 mount -t ext4 /dev/ram0 /mnt mkdir /sys/fs/cgroup/user.slice/test echo 38654705664 >/sys/fs/cgroup/user.slice/test/memory.max echo $$ >/sys/fs/cgroup/user.slice/test/cgroup.procs fio -name=mglru --numjobs=72 --directory=/mnt --size=1408m \ --buffered=1 --ioengine=io_uring --iodepth=128 \ --iodepth_batch_submit=32 --iodepth_batch_complete=32 \ --rw=randread --random_distribution=random --norandommap \ --time_based --ramp_time=10m --runtime=5m --group_reporting cat memcached.sh modprobe brd rd_nr=1 rd_size=113246208 swapoff -a mkswap /dev/ram0 swapon /dev/ram0 memtier_benchmark -S /var/run/memcached/memcached.sock \ -P memcache_binary -n allkeys --key-minimum=1 \ --key-maximum=65000000 --key-pattern=P:P -c 1 -t 36 \ --ratio 1:0 --pipeline 8 -d 2000 memtier_benchmark -S /var/run/memcached/memcached.sock \ -P memcache_binary -n allkeys --key-minimum=1 \ --key-maximum=65000000 --key-pattern=R:R -c 1 -t 36 \ --ratio 0:1 --pipeline 8 --randomize --distinct-client-seed Client benchmark results: kswapd profiles: 5.19-rc1 40.33% page_vma_mapped_walk (overhead) 21.80% lzo1x_1_do_compress (real work) 7.53% do_raw_spin_lock 3.95% _raw_spin_unlock_irq 2.52% vma_interval_tree_iter_next 2.37% folio_referenced_one 2.28% vma_interval_tree_subtree_search 1.97% anon_vma_interval_tree_iter_first 1.60% ptep_clear_flush 1.06% __zram_bvec_write patch1-6 39.03% lzo1x_1_do_compress (real work) 18.47% page_vma_mapped_walk (overhead) 6.74% _raw_spin_unlock_irq 3.97% do_raw_spin_lock 2.49% ptep_clear_flush 2.48% anon_vma_interval_tree_iter_first 1.92% folio_referenced_one 1.88% __zram_bvec_write 1.48% memmove 1.31% vma_interval_tree_iter_next Configurations: CPU: single Snapdragon 7c Mem: total 4G ChromeOS MemoryPressure [1] [1] https://chromium.googlesource.com/chromiumos/platform/tast-tests/ Link: https://lkml.kernel.org/r/20220918080010.2920238-7-yuzhao@google.com Signed-off-by: Yu Zhao <yuzhao@google.com> Acked-by: Brian Geffon <bgeffon@google.com> Acked-by: Jan Alexander Steffens (heftig) <heftig@archlinux.org> Acked-by: Oleksandr Natalenko <oleksandr@natalenko.name> Acked-by: Steven Barrett <steven@liquorix.net> Acked-by: Suleiman Souhlal <suleiman@google.com> Tested-by: Daniel Byrne <djbyrne@mtu.edu> Tested-by: Donald Carr <d@chaos-reins.com> Tested-by: Holger Hoffstätte <holger@applied-asynchrony.com> Tested-by: Konstantin Kharlamov <Hi-Angel@yandex.ru> Tested-by: Shuang Zhai <szhai2@cs.rochester.edu> Tested-by: Sofia Trinh <sofia.trinh@edi.works> Tested-by: Vaibhav Jain <vaibhav@linux.ibm.com> Cc: Andi Kleen <ak@linux.intel.com> Cc: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com> Cc: Barry Song <baohua@kernel.org> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: Hillf Danton <hdanton@sina.com> Cc: Jens Axboe <axboe@kernel.dk> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Matthew Wilcox <willy@infradead.org> Cc: Mel Gorman <mgorman@suse.de> Cc: Miaohe Lin <linmiaohe@huawei.com> Cc: Michael Larabel <Michael@MichaelLarabel.com> Cc: Michal Hocko <mhocko@kernel.org> Cc: Mike Rapoport <rppt@kernel.org> Cc: Mike Rapoport <rppt@linux.ibm.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Qi Zheng <zhengqi.arch@bytedance.com> Cc: Tejun Heo <tj@kernel.org> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Will Deacon <will@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-09-18 16:00:03 +08:00
struct lruvec *lruvec;
workingset, lru_gen: apply refault-distance based protection Upstream: pending I noticed MGLRU not working very well on certain workflows, which is observed on some heavily stressed databases. That is when the file page workingset size exceeds total memory, and the access distance (the left-shift time of a page before it gets activated, considering LRU starts from right) of file pages also larger than total memory. All file pages are stuck on the oldest generation and getting read-in then evicted permutably. Despite anon pages being idle, they never get aged. PID controller didn't kickin until there are some minor access pattern changes. And file pages are not promoted or reused. Even though the memory can't cover the whole workingset, the refault-distance based re-activation can help hold part of the workingset in-memory to help reduce the IO workload significantly. So apply it for MGLRU as well. The updated refault-distance model fits well for MGLRU in most cases, if we just consider the last two generation as the inactive LRU and the first two generations as active LRU. Some adjustment is done to fit the logic better, also make the refault-distance contributed to page tiering and PID refault detection of MGLRU: - If a tier-0 page have a qualified refault-distance, just promote it to higher tier, send it to second oldest gen. - If a tier >= 1 page have a qualified refault-distance, mark it as active and send it to youngest gen. - Increase the reference of every page that have a qualified refault-distance and increase the PID countroled refault rate of the updated tier, in hope similar paged will be protected next time upon eviction. NOTE: This also changed the meaning of workingset_* fields in /proc/vmstat, workingset_activate_* now stands for the pages reactivated or promoted by refault distance checking, workingset_restore_* now stands for all pages promoted by any reason. Following benchmark showed 5x improvement. To simulate the optimized workflow, I setup a 3-replicated mongodb cluster, each in a different cgroup, using 5 gb of wiretiger cache and 10g of oplog, on a 32G VM with no limit set. The benchmark is done using https://github.com/apavlo/py-tpcc.git, modified to run STOCK_LEVEL query only, for simulating slow query and get a stable result. Test is done on an EPYC 7K62 with 32G RAM with SATA SSD: - Before (with ZRAM enabled, the result won't change whether any kind of swap is on or not): $ tpcc.py --config=mongodb.config mongodb --duration=900 --warehouses=500 --clients=30 ================================================================== Execution Results after 919 seconds ------------------------------------------------------------------ Executed Time (µs) Rate STOCK_LEVEL 577 27584645283.7 0.02 txn/s ------------------------------------------------------------------ TOTAL 577 27584645283.7 0.02 txn/s $ cat /proc/vmstat | grep workingset workingset_nodes 47860 workingset_refault_anon 0 workingset_refault_file 23498953 workingset_activate_anon 0 workingset_activate_file 23487840 workingset_restore_anon 0 workingset_restore_file 18553646 workingset_nodereclaim 768 $ free -m total used free shared buff/cache available Mem: 31849 6829 790 23 24229 24542 Swap: 31848 0 31848 - Patched: (with ZRAM enabled): $ tpcc.py --config=mongodb.config mongodb --duration=900 --warehouses=500 --clients=30 ================================================================== Execution Results after 905 seconds ------------------------------------------------------------------ Executed Time (µs) Rate STOCK_LEVEL 2542 27121571486.2 0.09 txn/s ------------------------------------------------------------------ TOTAL 2542 27121571486.2 0.09 txn/s $ cat /proc/vmstat | grep working workingset_nodes 70358 workingset_refault_anon 16853 workingset_refault_file 22693601 workingset_activate_anon 10099 workingset_activate_file 8565519 workingset_restore_anon 10127 workingset_restore_file 8566053 workingset_nodereclaim 9801 $ free -m total used free shared buff/cache available Mem: 31849 7093 283 4 24472 24289 Swap: 31848 1652 30196 The performance is 5x times better than before, and the idle anon pages now can get swapped out as expected. The result is also better with lower test stress, testing with lower stress also shows a improvement. There is no regression on other tests so far, and a performance gain is observed on file page heavy tasks. Signed-off-by: Kairui Song <kasong@tencent.com>
2024-04-01 23:43:38 +08:00
struct mem_cgroup *memcg;
struct pglist_data *pgdat;
mm: multi-gen LRU: rename lru_gen_struct to lru_gen_folio Patch series "mm: multi-gen LRU: memcg LRU", v3. Overview ======== An memcg LRU is a per-node LRU of memcgs. It is also an LRU of LRUs, since each node and memcg combination has an LRU of folios (see mem_cgroup_lruvec()). Its goal is to improve the scalability of global reclaim, which is critical to system-wide memory overcommit in data centers. Note that memcg reclaim is currently out of scope. Its memory bloat is a pointer to each lruvec and negligible to each pglist_data. In terms of traversing memcgs during global reclaim, it improves the best-case complexity from O(n) to O(1) and does not affect the worst-case complexity O(n). Therefore, on average, it has a sublinear complexity in contrast to the current linear complexity. The basic structure of an memcg LRU can be understood by an analogy to the active/inactive LRU (of folios): 1. It has the young and the old (generations), i.e., the counterparts to the active and the inactive; 2. The increment of max_seq triggers promotion, i.e., the counterpart to activation; 3. Other events trigger similar operations, e.g., offlining an memcg triggers demotion, i.e., the counterpart to deactivation. In terms of global reclaim, it has two distinct features: 1. Sharding, which allows each thread to start at a random memcg (in the old generation) and improves parallelism; 2. Eventual fairness, which allows direct reclaim to bail out at will and reduces latency without affecting fairness over some time. The commit message in patch 6 details the workflow: https://lore.kernel.org/r/20221222041905.2431096-7-yuzhao@google.com/ The following is a simple test to quickly verify its effectiveness. Test design: 1. Create multiple memcgs. 2. Each memcg contains a job (fio). 3. All jobs access the same amount of memory randomly. 4. The system does not experience global memory pressure. 5. Periodically write to the root memory.reclaim. Desired outcome: 1. All memcgs have similar pgsteal counts, i.e., stddev(pgsteal) over mean(pgsteal) is close to 0%. 2. The total pgsteal is close to the total requested through memory.reclaim, i.e., sum(pgsteal) over sum(requested) is close to 100%. Actual outcome [1]: MGLRU off MGLRU on stddev(pgsteal) / mean(pgsteal) 75% 20% sum(pgsteal) / sum(requested) 425% 95% #################################################################### MEMCGS=128 for ((memcg = 0; memcg < $MEMCGS; memcg++)); do mkdir /sys/fs/cgroup/memcg$memcg done start() { echo $BASHPID > /sys/fs/cgroup/memcg$memcg/cgroup.procs fio -name=memcg$memcg --numjobs=1 --ioengine=mmap \ --filename=/dev/zero --size=1920M --rw=randrw \ --rate=64m,64m --random_distribution=random \ --fadvise_hint=0 --time_based --runtime=10h \ --group_reporting --minimal } for ((memcg = 0; memcg < $MEMCGS; memcg++)); do start & done sleep 600 for ((i = 0; i < 600; i++)); do echo 256m >/sys/fs/cgroup/memory.reclaim sleep 6 done for ((memcg = 0; memcg < $MEMCGS; memcg++)); do grep "pgsteal " /sys/fs/cgroup/memcg$memcg/memory.stat done #################################################################### [1]: This was obtained from running the above script (touches less than 256GB memory) on an EPYC 7B13 with 512GB DRAM for over an hour. This patch (of 8): The new name lru_gen_folio will be more distinct from the coming lru_gen_memcg. Link: https://lkml.kernel.org/r/20221222041905.2431096-1-yuzhao@google.com Link: https://lkml.kernel.org/r/20221222041905.2431096-2-yuzhao@google.com Signed-off-by: Yu Zhao <yuzhao@google.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Michael Larabel <Michael@MichaelLarabel.com> Cc: Michal Hocko <mhocko@kernel.org> Cc: Mike Rapoport <rppt@kernel.org> Cc: Roman Gushchin <roman.gushchin@linux.dev> Cc: Suren Baghdasaryan <surenb@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-12-22 12:18:59 +08:00
struct lru_gen_folio *lrugen;
mm: multi-gen LRU: minimal implementation To avoid confusion, the terms "promotion" and "demotion" will be applied to the multi-gen LRU, as a new convention; the terms "activation" and "deactivation" will be applied to the active/inactive LRU, as usual. The aging produces young generations. Given an lruvec, it increments max_seq when max_seq-min_seq+1 approaches MIN_NR_GENS. The aging promotes hot pages to the youngest generation when it finds them accessed through page tables; the demotion of cold pages happens consequently when it increments max_seq. Promotion in the aging path does not involve any LRU list operations, only the updates of the gen counter and lrugen->nr_pages[]; demotion, unless as the result of the increment of max_seq, requires LRU list operations, e.g., lru_deactivate_fn(). The aging has the complexity O(nr_hot_pages), since it is only interested in hot pages. The eviction consumes old generations. Given an lruvec, it increments min_seq when lrugen->lists[] indexed by min_seq%MAX_NR_GENS becomes empty. A feedback loop modeled after the PID controller monitors refaults over anon and file types and decides which type to evict when both types are available from the same generation. The protection of pages accessed multiple times through file descriptors takes place in the eviction path. Each generation is divided into multiple tiers. A page accessed N times through file descriptors is in tier order_base_2(N). Tiers do not have dedicated lrugen->lists[], only bits in folio->flags. The aforementioned feedback loop also monitors refaults over all tiers and decides when to protect pages in which tiers (N>1), using the first tier (N=0,1) as a baseline. The first tier contains single-use unmapped clean pages, which are most likely the best choices. In contrast to promotion in the aging path, the protection of a page in the eviction path is achieved by moving this page to the next generation, i.e., min_seq+1, if the feedback loop decides so. This approach has the following advantages: 1. It removes the cost of activation in the buffered access path by inferring whether pages accessed multiple times through file descriptors are statistically hot and thus worth protecting in the eviction path. 2. It takes pages accessed through page tables into account and avoids overprotecting pages accessed multiple times through file descriptors. (Pages accessed through page tables are in the first tier, since N=0.) 3. More tiers provide better protection for pages accessed more than twice through file descriptors, when under heavy buffered I/O workloads. Server benchmark results: Single workload: fio (buffered I/O): +[30, 32]% IOPS BW 5.19-rc1: 2673k 10.2GiB/s patch1-6: 3491k 13.3GiB/s Single workload: memcached (anon): -[4, 6]% Ops/sec KB/sec 5.19-rc1: 1161501.04 45177.25 patch1-6: 1106168.46 43025.04 Configurations: CPU: two Xeon 6154 Mem: total 256G Node 1 was only used as a ram disk to reduce the variance in the results. patch drivers/block/brd.c <<EOF 99,100c99,100 < gfp_flags = GFP_NOIO | __GFP_ZERO | __GFP_HIGHMEM; < page = alloc_page(gfp_flags); --- > gfp_flags = GFP_NOIO | __GFP_ZERO | __GFP_HIGHMEM | __GFP_THISNODE; > page = alloc_pages_node(1, gfp_flags, 0); EOF cat >>/etc/systemd/system.conf <<EOF CPUAffinity=numa NUMAPolicy=bind NUMAMask=0 EOF cat >>/etc/memcached.conf <<EOF -m 184320 -s /var/run/memcached/memcached.sock -a 0766 -t 36 -B binary EOF cat fio.sh modprobe brd rd_nr=1 rd_size=113246208 swapoff -a mkfs.ext4 /dev/ram0 mount -t ext4 /dev/ram0 /mnt mkdir /sys/fs/cgroup/user.slice/test echo 38654705664 >/sys/fs/cgroup/user.slice/test/memory.max echo $$ >/sys/fs/cgroup/user.slice/test/cgroup.procs fio -name=mglru --numjobs=72 --directory=/mnt --size=1408m \ --buffered=1 --ioengine=io_uring --iodepth=128 \ --iodepth_batch_submit=32 --iodepth_batch_complete=32 \ --rw=randread --random_distribution=random --norandommap \ --time_based --ramp_time=10m --runtime=5m --group_reporting cat memcached.sh modprobe brd rd_nr=1 rd_size=113246208 swapoff -a mkswap /dev/ram0 swapon /dev/ram0 memtier_benchmark -S /var/run/memcached/memcached.sock \ -P memcache_binary -n allkeys --key-minimum=1 \ --key-maximum=65000000 --key-pattern=P:P -c 1 -t 36 \ --ratio 1:0 --pipeline 8 -d 2000 memtier_benchmark -S /var/run/memcached/memcached.sock \ -P memcache_binary -n allkeys --key-minimum=1 \ --key-maximum=65000000 --key-pattern=R:R -c 1 -t 36 \ --ratio 0:1 --pipeline 8 --randomize --distinct-client-seed Client benchmark results: kswapd profiles: 5.19-rc1 40.33% page_vma_mapped_walk (overhead) 21.80% lzo1x_1_do_compress (real work) 7.53% do_raw_spin_lock 3.95% _raw_spin_unlock_irq 2.52% vma_interval_tree_iter_next 2.37% folio_referenced_one 2.28% vma_interval_tree_subtree_search 1.97% anon_vma_interval_tree_iter_first 1.60% ptep_clear_flush 1.06% __zram_bvec_write patch1-6 39.03% lzo1x_1_do_compress (real work) 18.47% page_vma_mapped_walk (overhead) 6.74% _raw_spin_unlock_irq 3.97% do_raw_spin_lock 2.49% ptep_clear_flush 2.48% anon_vma_interval_tree_iter_first 1.92% folio_referenced_one 1.88% __zram_bvec_write 1.48% memmove 1.31% vma_interval_tree_iter_next Configurations: CPU: single Snapdragon 7c Mem: total 4G ChromeOS MemoryPressure [1] [1] https://chromium.googlesource.com/chromiumos/platform/tast-tests/ Link: https://lkml.kernel.org/r/20220918080010.2920238-7-yuzhao@google.com Signed-off-by: Yu Zhao <yuzhao@google.com> Acked-by: Brian Geffon <bgeffon@google.com> Acked-by: Jan Alexander Steffens (heftig) <heftig@archlinux.org> Acked-by: Oleksandr Natalenko <oleksandr@natalenko.name> Acked-by: Steven Barrett <steven@liquorix.net> Acked-by: Suleiman Souhlal <suleiman@google.com> Tested-by: Daniel Byrne <djbyrne@mtu.edu> Tested-by: Donald Carr <d@chaos-reins.com> Tested-by: Holger Hoffstätte <holger@applied-asynchrony.com> Tested-by: Konstantin Kharlamov <Hi-Angel@yandex.ru> Tested-by: Shuang Zhai <szhai2@cs.rochester.edu> Tested-by: Sofia Trinh <sofia.trinh@edi.works> Tested-by: Vaibhav Jain <vaibhav@linux.ibm.com> Cc: Andi Kleen <ak@linux.intel.com> Cc: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com> Cc: Barry Song <baohua@kernel.org> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: Hillf Danton <hdanton@sina.com> Cc: Jens Axboe <axboe@kernel.dk> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Matthew Wilcox <willy@infradead.org> Cc: Mel Gorman <mgorman@suse.de> Cc: Miaohe Lin <linmiaohe@huawei.com> Cc: Michael Larabel <Michael@MichaelLarabel.com> Cc: Michal Hocko <mhocko@kernel.org> Cc: Mike Rapoport <rppt@kernel.org> Cc: Mike Rapoport <rppt@linux.ibm.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Qi Zheng <zhengqi.arch@bytedance.com> Cc: Tejun Heo <tj@kernel.org> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Will Deacon <will@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-09-18 16:00:03 +08:00
int type = folio_is_file_lru(folio);
int delta = folio_nr_pages(folio);
workingset, lru_gen: apply refault-distance based protection Upstream: pending I noticed MGLRU not working very well on certain workflows, which is observed on some heavily stressed databases. That is when the file page workingset size exceeds total memory, and the access distance (the left-shift time of a page before it gets activated, considering LRU starts from right) of file pages also larger than total memory. All file pages are stuck on the oldest generation and getting read-in then evicted permutably. Despite anon pages being idle, they never get aged. PID controller didn't kickin until there are some minor access pattern changes. And file pages are not promoted or reused. Even though the memory can't cover the whole workingset, the refault-distance based re-activation can help hold part of the workingset in-memory to help reduce the IO workload significantly. So apply it for MGLRU as well. The updated refault-distance model fits well for MGLRU in most cases, if we just consider the last two generation as the inactive LRU and the first two generations as active LRU. Some adjustment is done to fit the logic better, also make the refault-distance contributed to page tiering and PID refault detection of MGLRU: - If a tier-0 page have a qualified refault-distance, just promote it to higher tier, send it to second oldest gen. - If a tier >= 1 page have a qualified refault-distance, mark it as active and send it to youngest gen. - Increase the reference of every page that have a qualified refault-distance and increase the PID countroled refault rate of the updated tier, in hope similar paged will be protected next time upon eviction. NOTE: This also changed the meaning of workingset_* fields in /proc/vmstat, workingset_activate_* now stands for the pages reactivated or promoted by refault distance checking, workingset_restore_* now stands for all pages promoted by any reason. Following benchmark showed 5x improvement. To simulate the optimized workflow, I setup a 3-replicated mongodb cluster, each in a different cgroup, using 5 gb of wiretiger cache and 10g of oplog, on a 32G VM with no limit set. The benchmark is done using https://github.com/apavlo/py-tpcc.git, modified to run STOCK_LEVEL query only, for simulating slow query and get a stable result. Test is done on an EPYC 7K62 with 32G RAM with SATA SSD: - Before (with ZRAM enabled, the result won't change whether any kind of swap is on or not): $ tpcc.py --config=mongodb.config mongodb --duration=900 --warehouses=500 --clients=30 ================================================================== Execution Results after 919 seconds ------------------------------------------------------------------ Executed Time (µs) Rate STOCK_LEVEL 577 27584645283.7 0.02 txn/s ------------------------------------------------------------------ TOTAL 577 27584645283.7 0.02 txn/s $ cat /proc/vmstat | grep workingset workingset_nodes 47860 workingset_refault_anon 0 workingset_refault_file 23498953 workingset_activate_anon 0 workingset_activate_file 23487840 workingset_restore_anon 0 workingset_restore_file 18553646 workingset_nodereclaim 768 $ free -m total used free shared buff/cache available Mem: 31849 6829 790 23 24229 24542 Swap: 31848 0 31848 - Patched: (with ZRAM enabled): $ tpcc.py --config=mongodb.config mongodb --duration=900 --warehouses=500 --clients=30 ================================================================== Execution Results after 905 seconds ------------------------------------------------------------------ Executed Time (µs) Rate STOCK_LEVEL 2542 27121571486.2 0.09 txn/s ------------------------------------------------------------------ TOTAL 2542 27121571486.2 0.09 txn/s $ cat /proc/vmstat | grep working workingset_nodes 70358 workingset_refault_anon 16853 workingset_refault_file 22693601 workingset_activate_anon 10099 workingset_activate_file 8565519 workingset_restore_anon 10127 workingset_restore_file 8566053 workingset_nodereclaim 9801 $ free -m total used free shared buff/cache available Mem: 31849 7093 283 4 24472 24289 Swap: 31848 1652 30196 The performance is 5x times better than before, and the idle anon pages now can get swapped out as expected. The result is also better with lower test stress, testing with lower stress also shows a improvement. There is no regression on other tests so far, and a performance gain is observed on file page heavy tasks. Signed-off-by: Kairui Song <kasong@tencent.com>
2024-04-01 23:43:38 +08:00
int distance;
unsigned long refault_distance, protect_tier;
mm: multi-gen LRU: minimal implementation To avoid confusion, the terms "promotion" and "demotion" will be applied to the multi-gen LRU, as a new convention; the terms "activation" and "deactivation" will be applied to the active/inactive LRU, as usual. The aging produces young generations. Given an lruvec, it increments max_seq when max_seq-min_seq+1 approaches MIN_NR_GENS. The aging promotes hot pages to the youngest generation when it finds them accessed through page tables; the demotion of cold pages happens consequently when it increments max_seq. Promotion in the aging path does not involve any LRU list operations, only the updates of the gen counter and lrugen->nr_pages[]; demotion, unless as the result of the increment of max_seq, requires LRU list operations, e.g., lru_deactivate_fn(). The aging has the complexity O(nr_hot_pages), since it is only interested in hot pages. The eviction consumes old generations. Given an lruvec, it increments min_seq when lrugen->lists[] indexed by min_seq%MAX_NR_GENS becomes empty. A feedback loop modeled after the PID controller monitors refaults over anon and file types and decides which type to evict when both types are available from the same generation. The protection of pages accessed multiple times through file descriptors takes place in the eviction path. Each generation is divided into multiple tiers. A page accessed N times through file descriptors is in tier order_base_2(N). Tiers do not have dedicated lrugen->lists[], only bits in folio->flags. The aforementioned feedback loop also monitors refaults over all tiers and decides when to protect pages in which tiers (N>1), using the first tier (N=0,1) as a baseline. The first tier contains single-use unmapped clean pages, which are most likely the best choices. In contrast to promotion in the aging path, the protection of a page in the eviction path is achieved by moving this page to the next generation, i.e., min_seq+1, if the feedback loop decides so. This approach has the following advantages: 1. It removes the cost of activation in the buffered access path by inferring whether pages accessed multiple times through file descriptors are statistically hot and thus worth protecting in the eviction path. 2. It takes pages accessed through page tables into account and avoids overprotecting pages accessed multiple times through file descriptors. (Pages accessed through page tables are in the first tier, since N=0.) 3. More tiers provide better protection for pages accessed more than twice through file descriptors, when under heavy buffered I/O workloads. Server benchmark results: Single workload: fio (buffered I/O): +[30, 32]% IOPS BW 5.19-rc1: 2673k 10.2GiB/s patch1-6: 3491k 13.3GiB/s Single workload: memcached (anon): -[4, 6]% Ops/sec KB/sec 5.19-rc1: 1161501.04 45177.25 patch1-6: 1106168.46 43025.04 Configurations: CPU: two Xeon 6154 Mem: total 256G Node 1 was only used as a ram disk to reduce the variance in the results. patch drivers/block/brd.c <<EOF 99,100c99,100 < gfp_flags = GFP_NOIO | __GFP_ZERO | __GFP_HIGHMEM; < page = alloc_page(gfp_flags); --- > gfp_flags = GFP_NOIO | __GFP_ZERO | __GFP_HIGHMEM | __GFP_THISNODE; > page = alloc_pages_node(1, gfp_flags, 0); EOF cat >>/etc/systemd/system.conf <<EOF CPUAffinity=numa NUMAPolicy=bind NUMAMask=0 EOF cat >>/etc/memcached.conf <<EOF -m 184320 -s /var/run/memcached/memcached.sock -a 0766 -t 36 -B binary EOF cat fio.sh modprobe brd rd_nr=1 rd_size=113246208 swapoff -a mkfs.ext4 /dev/ram0 mount -t ext4 /dev/ram0 /mnt mkdir /sys/fs/cgroup/user.slice/test echo 38654705664 >/sys/fs/cgroup/user.slice/test/memory.max echo $$ >/sys/fs/cgroup/user.slice/test/cgroup.procs fio -name=mglru --numjobs=72 --directory=/mnt --size=1408m \ --buffered=1 --ioengine=io_uring --iodepth=128 \ --iodepth_batch_submit=32 --iodepth_batch_complete=32 \ --rw=randread --random_distribution=random --norandommap \ --time_based --ramp_time=10m --runtime=5m --group_reporting cat memcached.sh modprobe brd rd_nr=1 rd_size=113246208 swapoff -a mkswap /dev/ram0 swapon /dev/ram0 memtier_benchmark -S /var/run/memcached/memcached.sock \ -P memcache_binary -n allkeys --key-minimum=1 \ --key-maximum=65000000 --key-pattern=P:P -c 1 -t 36 \ --ratio 1:0 --pipeline 8 -d 2000 memtier_benchmark -S /var/run/memcached/memcached.sock \ -P memcache_binary -n allkeys --key-minimum=1 \ --key-maximum=65000000 --key-pattern=R:R -c 1 -t 36 \ --ratio 0:1 --pipeline 8 --randomize --distinct-client-seed Client benchmark results: kswapd profiles: 5.19-rc1 40.33% page_vma_mapped_walk (overhead) 21.80% lzo1x_1_do_compress (real work) 7.53% do_raw_spin_lock 3.95% _raw_spin_unlock_irq 2.52% vma_interval_tree_iter_next 2.37% folio_referenced_one 2.28% vma_interval_tree_subtree_search 1.97% anon_vma_interval_tree_iter_first 1.60% ptep_clear_flush 1.06% __zram_bvec_write patch1-6 39.03% lzo1x_1_do_compress (real work) 18.47% page_vma_mapped_walk (overhead) 6.74% _raw_spin_unlock_irq 3.97% do_raw_spin_lock 2.49% ptep_clear_flush 2.48% anon_vma_interval_tree_iter_first 1.92% folio_referenced_one 1.88% __zram_bvec_write 1.48% memmove 1.31% vma_interval_tree_iter_next Configurations: CPU: single Snapdragon 7c Mem: total 4G ChromeOS MemoryPressure [1] [1] https://chromium.googlesource.com/chromiumos/platform/tast-tests/ Link: https://lkml.kernel.org/r/20220918080010.2920238-7-yuzhao@google.com Signed-off-by: Yu Zhao <yuzhao@google.com> Acked-by: Brian Geffon <bgeffon@google.com> Acked-by: Jan Alexander Steffens (heftig) <heftig@archlinux.org> Acked-by: Oleksandr Natalenko <oleksandr@natalenko.name> Acked-by: Steven Barrett <steven@liquorix.net> Acked-by: Suleiman Souhlal <suleiman@google.com> Tested-by: Daniel Byrne <djbyrne@mtu.edu> Tested-by: Donald Carr <d@chaos-reins.com> Tested-by: Holger Hoffstätte <holger@applied-asynchrony.com> Tested-by: Konstantin Kharlamov <Hi-Angel@yandex.ru> Tested-by: Shuang Zhai <szhai2@cs.rochester.edu> Tested-by: Sofia Trinh <sofia.trinh@edi.works> Tested-by: Vaibhav Jain <vaibhav@linux.ibm.com> Cc: Andi Kleen <ak@linux.intel.com> Cc: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com> Cc: Barry Song <baohua@kernel.org> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: Hillf Danton <hdanton@sina.com> Cc: Jens Axboe <axboe@kernel.dk> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Matthew Wilcox <willy@infradead.org> Cc: Mel Gorman <mgorman@suse.de> Cc: Miaohe Lin <linmiaohe@huawei.com> Cc: Michael Larabel <Michael@MichaelLarabel.com> Cc: Michal Hocko <mhocko@kernel.org> Cc: Mike Rapoport <rppt@kernel.org> Cc: Mike Rapoport <rppt@linux.ibm.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Qi Zheng <zhengqi.arch@bytedance.com> Cc: Tejun Heo <tj@kernel.org> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Will Deacon <will@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-09-18 16:00:03 +08:00
unpack_shadow(shadow, &memcgid, &pgdat, &token, &workingset);
workingset, lru_gen: apply refault-distance based protection Upstream: pending I noticed MGLRU not working very well on certain workflows, which is observed on some heavily stressed databases. That is when the file page workingset size exceeds total memory, and the access distance (the left-shift time of a page before it gets activated, considering LRU starts from right) of file pages also larger than total memory. All file pages are stuck on the oldest generation and getting read-in then evicted permutably. Despite anon pages being idle, they never get aged. PID controller didn't kickin until there are some minor access pattern changes. And file pages are not promoted or reused. Even though the memory can't cover the whole workingset, the refault-distance based re-activation can help hold part of the workingset in-memory to help reduce the IO workload significantly. So apply it for MGLRU as well. The updated refault-distance model fits well for MGLRU in most cases, if we just consider the last two generation as the inactive LRU and the first two generations as active LRU. Some adjustment is done to fit the logic better, also make the refault-distance contributed to page tiering and PID refault detection of MGLRU: - If a tier-0 page have a qualified refault-distance, just promote it to higher tier, send it to second oldest gen. - If a tier >= 1 page have a qualified refault-distance, mark it as active and send it to youngest gen. - Increase the reference of every page that have a qualified refault-distance and increase the PID countroled refault rate of the updated tier, in hope similar paged will be protected next time upon eviction. NOTE: This also changed the meaning of workingset_* fields in /proc/vmstat, workingset_activate_* now stands for the pages reactivated or promoted by refault distance checking, workingset_restore_* now stands for all pages promoted by any reason. Following benchmark showed 5x improvement. To simulate the optimized workflow, I setup a 3-replicated mongodb cluster, each in a different cgroup, using 5 gb of wiretiger cache and 10g of oplog, on a 32G VM with no limit set. The benchmark is done using https://github.com/apavlo/py-tpcc.git, modified to run STOCK_LEVEL query only, for simulating slow query and get a stable result. Test is done on an EPYC 7K62 with 32G RAM with SATA SSD: - Before (with ZRAM enabled, the result won't change whether any kind of swap is on or not): $ tpcc.py --config=mongodb.config mongodb --duration=900 --warehouses=500 --clients=30 ================================================================== Execution Results after 919 seconds ------------------------------------------------------------------ Executed Time (µs) Rate STOCK_LEVEL 577 27584645283.7 0.02 txn/s ------------------------------------------------------------------ TOTAL 577 27584645283.7 0.02 txn/s $ cat /proc/vmstat | grep workingset workingset_nodes 47860 workingset_refault_anon 0 workingset_refault_file 23498953 workingset_activate_anon 0 workingset_activate_file 23487840 workingset_restore_anon 0 workingset_restore_file 18553646 workingset_nodereclaim 768 $ free -m total used free shared buff/cache available Mem: 31849 6829 790 23 24229 24542 Swap: 31848 0 31848 - Patched: (with ZRAM enabled): $ tpcc.py --config=mongodb.config mongodb --duration=900 --warehouses=500 --clients=30 ================================================================== Execution Results after 905 seconds ------------------------------------------------------------------ Executed Time (µs) Rate STOCK_LEVEL 2542 27121571486.2 0.09 txn/s ------------------------------------------------------------------ TOTAL 2542 27121571486.2 0.09 txn/s $ cat /proc/vmstat | grep working workingset_nodes 70358 workingset_refault_anon 16853 workingset_refault_file 22693601 workingset_activate_anon 10099 workingset_activate_file 8565519 workingset_restore_anon 10127 workingset_restore_file 8566053 workingset_nodereclaim 9801 $ free -m total used free shared buff/cache available Mem: 31849 7093 283 4 24472 24289 Swap: 31848 1652 30196 The performance is 5x times better than before, and the idle anon pages now can get swapped out as expected. The result is also better with lower test stress, testing with lower stress also shows a improvement. There is no regression on other tests so far, and a performance gain is observed on file page heavy tasks. Signed-off-by: Kairui Song <kasong@tencent.com>
2024-04-01 23:43:38 +08:00
memcg = try_get_flush_memcg(memcgid);
if (!memcg)
return;
lruvec = mem_cgroup_lruvec(memcg, pgdat);
if (lruvec != folio_lruvec(folio))
mm: multi-gen LRU: minimal implementation To avoid confusion, the terms "promotion" and "demotion" will be applied to the multi-gen LRU, as a new convention; the terms "activation" and "deactivation" will be applied to the active/inactive LRU, as usual. The aging produces young generations. Given an lruvec, it increments max_seq when max_seq-min_seq+1 approaches MIN_NR_GENS. The aging promotes hot pages to the youngest generation when it finds them accessed through page tables; the demotion of cold pages happens consequently when it increments max_seq. Promotion in the aging path does not involve any LRU list operations, only the updates of the gen counter and lrugen->nr_pages[]; demotion, unless as the result of the increment of max_seq, requires LRU list operations, e.g., lru_deactivate_fn(). The aging has the complexity O(nr_hot_pages), since it is only interested in hot pages. The eviction consumes old generations. Given an lruvec, it increments min_seq when lrugen->lists[] indexed by min_seq%MAX_NR_GENS becomes empty. A feedback loop modeled after the PID controller monitors refaults over anon and file types and decides which type to evict when both types are available from the same generation. The protection of pages accessed multiple times through file descriptors takes place in the eviction path. Each generation is divided into multiple tiers. A page accessed N times through file descriptors is in tier order_base_2(N). Tiers do not have dedicated lrugen->lists[], only bits in folio->flags. The aforementioned feedback loop also monitors refaults over all tiers and decides when to protect pages in which tiers (N>1), using the first tier (N=0,1) as a baseline. The first tier contains single-use unmapped clean pages, which are most likely the best choices. In contrast to promotion in the aging path, the protection of a page in the eviction path is achieved by moving this page to the next generation, i.e., min_seq+1, if the feedback loop decides so. This approach has the following advantages: 1. It removes the cost of activation in the buffered access path by inferring whether pages accessed multiple times through file descriptors are statistically hot and thus worth protecting in the eviction path. 2. It takes pages accessed through page tables into account and avoids overprotecting pages accessed multiple times through file descriptors. (Pages accessed through page tables are in the first tier, since N=0.) 3. More tiers provide better protection for pages accessed more than twice through file descriptors, when under heavy buffered I/O workloads. Server benchmark results: Single workload: fio (buffered I/O): +[30, 32]% IOPS BW 5.19-rc1: 2673k 10.2GiB/s patch1-6: 3491k 13.3GiB/s Single workload: memcached (anon): -[4, 6]% Ops/sec KB/sec 5.19-rc1: 1161501.04 45177.25 patch1-6: 1106168.46 43025.04 Configurations: CPU: two Xeon 6154 Mem: total 256G Node 1 was only used as a ram disk to reduce the variance in the results. patch drivers/block/brd.c <<EOF 99,100c99,100 < gfp_flags = GFP_NOIO | __GFP_ZERO | __GFP_HIGHMEM; < page = alloc_page(gfp_flags); --- > gfp_flags = GFP_NOIO | __GFP_ZERO | __GFP_HIGHMEM | __GFP_THISNODE; > page = alloc_pages_node(1, gfp_flags, 0); EOF cat >>/etc/systemd/system.conf <<EOF CPUAffinity=numa NUMAPolicy=bind NUMAMask=0 EOF cat >>/etc/memcached.conf <<EOF -m 184320 -s /var/run/memcached/memcached.sock -a 0766 -t 36 -B binary EOF cat fio.sh modprobe brd rd_nr=1 rd_size=113246208 swapoff -a mkfs.ext4 /dev/ram0 mount -t ext4 /dev/ram0 /mnt mkdir /sys/fs/cgroup/user.slice/test echo 38654705664 >/sys/fs/cgroup/user.slice/test/memory.max echo $$ >/sys/fs/cgroup/user.slice/test/cgroup.procs fio -name=mglru --numjobs=72 --directory=/mnt --size=1408m \ --buffered=1 --ioengine=io_uring --iodepth=128 \ --iodepth_batch_submit=32 --iodepth_batch_complete=32 \ --rw=randread --random_distribution=random --norandommap \ --time_based --ramp_time=10m --runtime=5m --group_reporting cat memcached.sh modprobe brd rd_nr=1 rd_size=113246208 swapoff -a mkswap /dev/ram0 swapon /dev/ram0 memtier_benchmark -S /var/run/memcached/memcached.sock \ -P memcache_binary -n allkeys --key-minimum=1 \ --key-maximum=65000000 --key-pattern=P:P -c 1 -t 36 \ --ratio 1:0 --pipeline 8 -d 2000 memtier_benchmark -S /var/run/memcached/memcached.sock \ -P memcache_binary -n allkeys --key-minimum=1 \ --key-maximum=65000000 --key-pattern=R:R -c 1 -t 36 \ --ratio 0:1 --pipeline 8 --randomize --distinct-client-seed Client benchmark results: kswapd profiles: 5.19-rc1 40.33% page_vma_mapped_walk (overhead) 21.80% lzo1x_1_do_compress (real work) 7.53% do_raw_spin_lock 3.95% _raw_spin_unlock_irq 2.52% vma_interval_tree_iter_next 2.37% folio_referenced_one 2.28% vma_interval_tree_subtree_search 1.97% anon_vma_interval_tree_iter_first 1.60% ptep_clear_flush 1.06% __zram_bvec_write patch1-6 39.03% lzo1x_1_do_compress (real work) 18.47% page_vma_mapped_walk (overhead) 6.74% _raw_spin_unlock_irq 3.97% do_raw_spin_lock 2.49% ptep_clear_flush 2.48% anon_vma_interval_tree_iter_first 1.92% folio_referenced_one 1.88% __zram_bvec_write 1.48% memmove 1.31% vma_interval_tree_iter_next Configurations: CPU: single Snapdragon 7c Mem: total 4G ChromeOS MemoryPressure [1] [1] https://chromium.googlesource.com/chromiumos/platform/tast-tests/ Link: https://lkml.kernel.org/r/20220918080010.2920238-7-yuzhao@google.com Signed-off-by: Yu Zhao <yuzhao@google.com> Acked-by: Brian Geffon <bgeffon@google.com> Acked-by: Jan Alexander Steffens (heftig) <heftig@archlinux.org> Acked-by: Oleksandr Natalenko <oleksandr@natalenko.name> Acked-by: Steven Barrett <steven@liquorix.net> Acked-by: Suleiman Souhlal <suleiman@google.com> Tested-by: Daniel Byrne <djbyrne@mtu.edu> Tested-by: Donald Carr <d@chaos-reins.com> Tested-by: Holger Hoffstätte <holger@applied-asynchrony.com> Tested-by: Konstantin Kharlamov <Hi-Angel@yandex.ru> Tested-by: Shuang Zhai <szhai2@cs.rochester.edu> Tested-by: Sofia Trinh <sofia.trinh@edi.works> Tested-by: Vaibhav Jain <vaibhav@linux.ibm.com> Cc: Andi Kleen <ak@linux.intel.com> Cc: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com> Cc: Barry Song <baohua@kernel.org> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: Hillf Danton <hdanton@sina.com> Cc: Jens Axboe <axboe@kernel.dk> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Matthew Wilcox <willy@infradead.org> Cc: Mel Gorman <mgorman@suse.de> Cc: Miaohe Lin <linmiaohe@huawei.com> Cc: Michael Larabel <Michael@MichaelLarabel.com> Cc: Michal Hocko <mhocko@kernel.org> Cc: Mike Rapoport <rppt@kernel.org> Cc: Mike Rapoport <rppt@linux.ibm.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Qi Zheng <zhengqi.arch@bytedance.com> Cc: Tejun Heo <tj@kernel.org> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Will Deacon <will@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-09-18 16:00:03 +08:00
goto unlock;
mod_lruvec_state(lruvec, WORKINGSET_REFAULT_BASE + type, delta);
workingset, lru_gen: apply refault-distance based protection Upstream: pending I noticed MGLRU not working very well on certain workflows, which is observed on some heavily stressed databases. That is when the file page workingset size exceeds total memory, and the access distance (the left-shift time of a page before it gets activated, considering LRU starts from right) of file pages also larger than total memory. All file pages are stuck on the oldest generation and getting read-in then evicted permutably. Despite anon pages being idle, they never get aged. PID controller didn't kickin until there are some minor access pattern changes. And file pages are not promoted or reused. Even though the memory can't cover the whole workingset, the refault-distance based re-activation can help hold part of the workingset in-memory to help reduce the IO workload significantly. So apply it for MGLRU as well. The updated refault-distance model fits well for MGLRU in most cases, if we just consider the last two generation as the inactive LRU and the first two generations as active LRU. Some adjustment is done to fit the logic better, also make the refault-distance contributed to page tiering and PID refault detection of MGLRU: - If a tier-0 page have a qualified refault-distance, just promote it to higher tier, send it to second oldest gen. - If a tier >= 1 page have a qualified refault-distance, mark it as active and send it to youngest gen. - Increase the reference of every page that have a qualified refault-distance and increase the PID countroled refault rate of the updated tier, in hope similar paged will be protected next time upon eviction. NOTE: This also changed the meaning of workingset_* fields in /proc/vmstat, workingset_activate_* now stands for the pages reactivated or promoted by refault distance checking, workingset_restore_* now stands for all pages promoted by any reason. Following benchmark showed 5x improvement. To simulate the optimized workflow, I setup a 3-replicated mongodb cluster, each in a different cgroup, using 5 gb of wiretiger cache and 10g of oplog, on a 32G VM with no limit set. The benchmark is done using https://github.com/apavlo/py-tpcc.git, modified to run STOCK_LEVEL query only, for simulating slow query and get a stable result. Test is done on an EPYC 7K62 with 32G RAM with SATA SSD: - Before (with ZRAM enabled, the result won't change whether any kind of swap is on or not): $ tpcc.py --config=mongodb.config mongodb --duration=900 --warehouses=500 --clients=30 ================================================================== Execution Results after 919 seconds ------------------------------------------------------------------ Executed Time (µs) Rate STOCK_LEVEL 577 27584645283.7 0.02 txn/s ------------------------------------------------------------------ TOTAL 577 27584645283.7 0.02 txn/s $ cat /proc/vmstat | grep workingset workingset_nodes 47860 workingset_refault_anon 0 workingset_refault_file 23498953 workingset_activate_anon 0 workingset_activate_file 23487840 workingset_restore_anon 0 workingset_restore_file 18553646 workingset_nodereclaim 768 $ free -m total used free shared buff/cache available Mem: 31849 6829 790 23 24229 24542 Swap: 31848 0 31848 - Patched: (with ZRAM enabled): $ tpcc.py --config=mongodb.config mongodb --duration=900 --warehouses=500 --clients=30 ================================================================== Execution Results after 905 seconds ------------------------------------------------------------------ Executed Time (µs) Rate STOCK_LEVEL 2542 27121571486.2 0.09 txn/s ------------------------------------------------------------------ TOTAL 2542 27121571486.2 0.09 txn/s $ cat /proc/vmstat | grep working workingset_nodes 70358 workingset_refault_anon 16853 workingset_refault_file 22693601 workingset_activate_anon 10099 workingset_activate_file 8565519 workingset_restore_anon 10127 workingset_restore_file 8566053 workingset_nodereclaim 9801 $ free -m total used free shared buff/cache available Mem: 31849 7093 283 4 24472 24289 Swap: 31848 1652 30196 The performance is 5x times better than before, and the idle anon pages now can get swapped out as expected. The result is also better with lower test stress, testing with lower stress also shows a improvement. There is no regression on other tests so far, and a performance gain is observed on file page heavy tasks. Signed-off-by: Kairui Song <kasong@tencent.com>
2024-04-01 23:43:38 +08:00
refault_distance = lru_distance(lruvec, type, token,
LRU_GEN_EVICTION_BITS, lru_gen_bucket_order);
workingset_refault_track(lruvec, distance);
workingset, lru_gen: apply refault-distance based protection Upstream: pending I noticed MGLRU not working very well on certain workflows, which is observed on some heavily stressed databases. That is when the file page workingset size exceeds total memory, and the access distance (the left-shift time of a page before it gets activated, considering LRU starts from right) of file pages also larger than total memory. All file pages are stuck on the oldest generation and getting read-in then evicted permutably. Despite anon pages being idle, they never get aged. PID controller didn't kickin until there are some minor access pattern changes. And file pages are not promoted or reused. Even though the memory can't cover the whole workingset, the refault-distance based re-activation can help hold part of the workingset in-memory to help reduce the IO workload significantly. So apply it for MGLRU as well. The updated refault-distance model fits well for MGLRU in most cases, if we just consider the last two generation as the inactive LRU and the first two generations as active LRU. Some adjustment is done to fit the logic better, also make the refault-distance contributed to page tiering and PID refault detection of MGLRU: - If a tier-0 page have a qualified refault-distance, just promote it to higher tier, send it to second oldest gen. - If a tier >= 1 page have a qualified refault-distance, mark it as active and send it to youngest gen. - Increase the reference of every page that have a qualified refault-distance and increase the PID countroled refault rate of the updated tier, in hope similar paged will be protected next time upon eviction. NOTE: This also changed the meaning of workingset_* fields in /proc/vmstat, workingset_activate_* now stands for the pages reactivated or promoted by refault distance checking, workingset_restore_* now stands for all pages promoted by any reason. Following benchmark showed 5x improvement. To simulate the optimized workflow, I setup a 3-replicated mongodb cluster, each in a different cgroup, using 5 gb of wiretiger cache and 10g of oplog, on a 32G VM with no limit set. The benchmark is done using https://github.com/apavlo/py-tpcc.git, modified to run STOCK_LEVEL query only, for simulating slow query and get a stable result. Test is done on an EPYC 7K62 with 32G RAM with SATA SSD: - Before (with ZRAM enabled, the result won't change whether any kind of swap is on or not): $ tpcc.py --config=mongodb.config mongodb --duration=900 --warehouses=500 --clients=30 ================================================================== Execution Results after 919 seconds ------------------------------------------------------------------ Executed Time (µs) Rate STOCK_LEVEL 577 27584645283.7 0.02 txn/s ------------------------------------------------------------------ TOTAL 577 27584645283.7 0.02 txn/s $ cat /proc/vmstat | grep workingset workingset_nodes 47860 workingset_refault_anon 0 workingset_refault_file 23498953 workingset_activate_anon 0 workingset_activate_file 23487840 workingset_restore_anon 0 workingset_restore_file 18553646 workingset_nodereclaim 768 $ free -m total used free shared buff/cache available Mem: 31849 6829 790 23 24229 24542 Swap: 31848 0 31848 - Patched: (with ZRAM enabled): $ tpcc.py --config=mongodb.config mongodb --duration=900 --warehouses=500 --clients=30 ================================================================== Execution Results after 905 seconds ------------------------------------------------------------------ Executed Time (µs) Rate STOCK_LEVEL 2542 27121571486.2 0.09 txn/s ------------------------------------------------------------------ TOTAL 2542 27121571486.2 0.09 txn/s $ cat /proc/vmstat | grep working workingset_nodes 70358 workingset_refault_anon 16853 workingset_refault_file 22693601 workingset_activate_anon 10099 workingset_activate_file 8565519 workingset_restore_anon 10127 workingset_restore_file 8566053 workingset_nodereclaim 9801 $ free -m total used free shared buff/cache available Mem: 31849 7093 283 4 24472 24289 Swap: 31848 1652 30196 The performance is 5x times better than before, and the idle anon pages now can get swapped out as expected. The result is also better with lower test stress, testing with lower stress also shows a improvement. There is no regression on other tests so far, and a performance gain is observed on file page heavy tasks. Signed-off-by: Kairui Song <kasong@tencent.com>
2024-04-01 23:43:38 +08:00
/* Check if the gen the page was evicted from still exist */
recent = lru_gen_test_recent(lruvec, type, refault_distance);
/* Check if the distance indicates a refault */
distance = lru_gen_test_refault(lruvec, type, refault_distance,
mem_cgroup_get_nr_swap_pages(memcg));
if (!recent && distance == DISTANCE_NONE)
workingset: refactor LRU refault to expose refault recency check Patch series "cachestat: a new syscall for page cache state of files", v13. There is currently no good way to query the page cache statistics of large files and directory trees. There is mincore(), but it scales poorly: the kernel writes out a lot of bitmap data that userspace has to aggregate, when the user really does not care about per-page information in that case. The user also needs to mmap and unmap each file as it goes along, which can be quite slow as well. Some use cases where this information could come in handy: * Allowing database to decide whether to perform an index scan or direct table queries based on the in-memory cache state of the index. * Visibility into the writeback algorithm, for performance issues diagnostic. * Workload-aware writeback pacing: estimating IO fulfilled by page cache (and IO to be done) within a range of a file, allowing for more frequent syncing when and where there is IO capacity, and batching when there is not. * Computing memory usage of large files/directory trees, analogous to the du tool for disk usage. More information about these use cases could be found in this thread: https://lore.kernel.org/lkml/20230315170934.GA97793@cmpxchg.org/ This series of patches introduces a new system call, cachestat, that summarizes the page cache statistics (number of cached pages, dirty pages, pages marked for writeback, evicted pages etc.) of a file, in a specified range of bytes. It also include a selftest suite that tests some typical usage. Currently, the syscall is only wired in for x86 architecture. This interface is inspired by past discussion and concerns with fincore, which has a similar design (and as a result, issues) as mincore. Relevant links: https://lkml.indiana.edu/hypermail/linux/kernel/1302.1/04207.html https://lkml.indiana.edu/hypermail/linux/kernel/1302.1/04209.html I have also developed a small tool that computes the memory usage of files and directories, analogous to the du utility. User can choose between mincore or cachestat (with cachestat exporting more information than mincore). To compare the performance of these two options, I benchmarked the tool on the root directory of a Meta's server machine, each for five runs: Using cachestat real -- Median: 33.377s, Average: 33.475s, Standard Deviation: 0.3602 user -- Median: 4.08s, Average: 4.1078s, Standard Deviation: 0.0742 sys -- Median: 28.823s, Average: 28.8866s, Standard Deviation: 0.2689 Using mincore: real -- Median: 102.352s, Average: 102.3442s, Standard Deviation: 0.2059 user -- Median: 10.149s, Average: 10.1482s, Standard Deviation: 0.0162 sys -- Median: 91.186s, Average: 91.2084s, Standard Deviation: 0.2046 I also ran both syscalls on a 2TB sparse file: Using cachestat: real 0m0.009s user 0m0.000s sys 0m0.009s Using mincore: real 0m37.510s user 0m2.934s sys 0m34.558s Very large files like this are the pathological case for mincore. In fact, to compute the stats for a single 2TB file, mincore takes as long as cachestat takes to compute the stats for the entire tree! This could easily happen inadvertently when we run it on subdirectories. Mincore is clearly not suitable for a general-purpose command line tool. Regarding security concerns, cachestat() should not pose any additional issues. The caller already has read permission to the file itself (since they need an fd to that file to call cachestat). This means that the caller can access the underlying data in its entirety, which is a much greater source of information (and as a result, a much greater security risk) than the cache status itself. The latest API change (in v13 of the patch series) is suggested by Jens Axboe. It allows for 64-bit length argument, even on 32-bit architecture (which is previously not possible due to the limit on the number of syscall arguments). Furthermore, it eliminates the need for compatibility handling - every user can use the same ABI. This patch (of 4): In preparation for computing recently evicted pages in cachestat, refactor workingset_refault and lru_gen_refault to expose a helper function that would test if an evicted page is recently evicted. [penguin-kernel@I-love.SAKURA.ne.jp: add missing rcu_read_unlock() in lru_gen_refault()] Link: https://lkml.kernel.org/r/610781bc-cf11-fc89-a46f-87cb8235d439@I-love.SAKURA.ne.jp Link: https://lkml.kernel.org/r/20230503013608.2431726-1-nphamcs@gmail.com Link: https://lkml.kernel.org/r/20230503013608.2431726-2-nphamcs@gmail.com Signed-off-by: Nhat Pham <nphamcs@gmail.com> Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Cc: Brian Foster <bfoster@redhat.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Michael Kerrisk <mtk.manpages@gmail.com> Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-05-03 09:36:06 +08:00
goto unlock;
mm: multi-gen LRU: minimal implementation To avoid confusion, the terms "promotion" and "demotion" will be applied to the multi-gen LRU, as a new convention; the terms "activation" and "deactivation" will be applied to the active/inactive LRU, as usual. The aging produces young generations. Given an lruvec, it increments max_seq when max_seq-min_seq+1 approaches MIN_NR_GENS. The aging promotes hot pages to the youngest generation when it finds them accessed through page tables; the demotion of cold pages happens consequently when it increments max_seq. Promotion in the aging path does not involve any LRU list operations, only the updates of the gen counter and lrugen->nr_pages[]; demotion, unless as the result of the increment of max_seq, requires LRU list operations, e.g., lru_deactivate_fn(). The aging has the complexity O(nr_hot_pages), since it is only interested in hot pages. The eviction consumes old generations. Given an lruvec, it increments min_seq when lrugen->lists[] indexed by min_seq%MAX_NR_GENS becomes empty. A feedback loop modeled after the PID controller monitors refaults over anon and file types and decides which type to evict when both types are available from the same generation. The protection of pages accessed multiple times through file descriptors takes place in the eviction path. Each generation is divided into multiple tiers. A page accessed N times through file descriptors is in tier order_base_2(N). Tiers do not have dedicated lrugen->lists[], only bits in folio->flags. The aforementioned feedback loop also monitors refaults over all tiers and decides when to protect pages in which tiers (N>1), using the first tier (N=0,1) as a baseline. The first tier contains single-use unmapped clean pages, which are most likely the best choices. In contrast to promotion in the aging path, the protection of a page in the eviction path is achieved by moving this page to the next generation, i.e., min_seq+1, if the feedback loop decides so. This approach has the following advantages: 1. It removes the cost of activation in the buffered access path by inferring whether pages accessed multiple times through file descriptors are statistically hot and thus worth protecting in the eviction path. 2. It takes pages accessed through page tables into account and avoids overprotecting pages accessed multiple times through file descriptors. (Pages accessed through page tables are in the first tier, since N=0.) 3. More tiers provide better protection for pages accessed more than twice through file descriptors, when under heavy buffered I/O workloads. Server benchmark results: Single workload: fio (buffered I/O): +[30, 32]% IOPS BW 5.19-rc1: 2673k 10.2GiB/s patch1-6: 3491k 13.3GiB/s Single workload: memcached (anon): -[4, 6]% Ops/sec KB/sec 5.19-rc1: 1161501.04 45177.25 patch1-6: 1106168.46 43025.04 Configurations: CPU: two Xeon 6154 Mem: total 256G Node 1 was only used as a ram disk to reduce the variance in the results. patch drivers/block/brd.c <<EOF 99,100c99,100 < gfp_flags = GFP_NOIO | __GFP_ZERO | __GFP_HIGHMEM; < page = alloc_page(gfp_flags); --- > gfp_flags = GFP_NOIO | __GFP_ZERO | __GFP_HIGHMEM | __GFP_THISNODE; > page = alloc_pages_node(1, gfp_flags, 0); EOF cat >>/etc/systemd/system.conf <<EOF CPUAffinity=numa NUMAPolicy=bind NUMAMask=0 EOF cat >>/etc/memcached.conf <<EOF -m 184320 -s /var/run/memcached/memcached.sock -a 0766 -t 36 -B binary EOF cat fio.sh modprobe brd rd_nr=1 rd_size=113246208 swapoff -a mkfs.ext4 /dev/ram0 mount -t ext4 /dev/ram0 /mnt mkdir /sys/fs/cgroup/user.slice/test echo 38654705664 >/sys/fs/cgroup/user.slice/test/memory.max echo $$ >/sys/fs/cgroup/user.slice/test/cgroup.procs fio -name=mglru --numjobs=72 --directory=/mnt --size=1408m \ --buffered=1 --ioengine=io_uring --iodepth=128 \ --iodepth_batch_submit=32 --iodepth_batch_complete=32 \ --rw=randread --random_distribution=random --norandommap \ --time_based --ramp_time=10m --runtime=5m --group_reporting cat memcached.sh modprobe brd rd_nr=1 rd_size=113246208 swapoff -a mkswap /dev/ram0 swapon /dev/ram0 memtier_benchmark -S /var/run/memcached/memcached.sock \ -P memcache_binary -n allkeys --key-minimum=1 \ --key-maximum=65000000 --key-pattern=P:P -c 1 -t 36 \ --ratio 1:0 --pipeline 8 -d 2000 memtier_benchmark -S /var/run/memcached/memcached.sock \ -P memcache_binary -n allkeys --key-minimum=1 \ --key-maximum=65000000 --key-pattern=R:R -c 1 -t 36 \ --ratio 0:1 --pipeline 8 --randomize --distinct-client-seed Client benchmark results: kswapd profiles: 5.19-rc1 40.33% page_vma_mapped_walk (overhead) 21.80% lzo1x_1_do_compress (real work) 7.53% do_raw_spin_lock 3.95% _raw_spin_unlock_irq 2.52% vma_interval_tree_iter_next 2.37% folio_referenced_one 2.28% vma_interval_tree_subtree_search 1.97% anon_vma_interval_tree_iter_first 1.60% ptep_clear_flush 1.06% __zram_bvec_write patch1-6 39.03% lzo1x_1_do_compress (real work) 18.47% page_vma_mapped_walk (overhead) 6.74% _raw_spin_unlock_irq 3.97% do_raw_spin_lock 2.49% ptep_clear_flush 2.48% anon_vma_interval_tree_iter_first 1.92% folio_referenced_one 1.88% __zram_bvec_write 1.48% memmove 1.31% vma_interval_tree_iter_next Configurations: CPU: single Snapdragon 7c Mem: total 4G ChromeOS MemoryPressure [1] [1] https://chromium.googlesource.com/chromiumos/platform/tast-tests/ Link: https://lkml.kernel.org/r/20220918080010.2920238-7-yuzhao@google.com Signed-off-by: Yu Zhao <yuzhao@google.com> Acked-by: Brian Geffon <bgeffon@google.com> Acked-by: Jan Alexander Steffens (heftig) <heftig@archlinux.org> Acked-by: Oleksandr Natalenko <oleksandr@natalenko.name> Acked-by: Steven Barrett <steven@liquorix.net> Acked-by: Suleiman Souhlal <suleiman@google.com> Tested-by: Daniel Byrne <djbyrne@mtu.edu> Tested-by: Donald Carr <d@chaos-reins.com> Tested-by: Holger Hoffstätte <holger@applied-asynchrony.com> Tested-by: Konstantin Kharlamov <Hi-Angel@yandex.ru> Tested-by: Shuang Zhai <szhai2@cs.rochester.edu> Tested-by: Sofia Trinh <sofia.trinh@edi.works> Tested-by: Vaibhav Jain <vaibhav@linux.ibm.com> Cc: Andi Kleen <ak@linux.intel.com> Cc: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com> Cc: Barry Song <baohua@kernel.org> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: Hillf Danton <hdanton@sina.com> Cc: Jens Axboe <axboe@kernel.dk> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Matthew Wilcox <willy@infradead.org> Cc: Mel Gorman <mgorman@suse.de> Cc: Miaohe Lin <linmiaohe@huawei.com> Cc: Michael Larabel <Michael@MichaelLarabel.com> Cc: Michal Hocko <mhocko@kernel.org> Cc: Mike Rapoport <rppt@kernel.org> Cc: Mike Rapoport <rppt@linux.ibm.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Qi Zheng <zhengqi.arch@bytedance.com> Cc: Tejun Heo <tj@kernel.org> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Will Deacon <will@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-09-18 16:00:03 +08:00
/* see the comment in folio_lru_refs() */
workingset, lru_gen: apply refault-distance based protection Upstream: pending I noticed MGLRU not working very well on certain workflows, which is observed on some heavily stressed databases. That is when the file page workingset size exceeds total memory, and the access distance (the left-shift time of a page before it gets activated, considering LRU starts from right) of file pages also larger than total memory. All file pages are stuck on the oldest generation and getting read-in then evicted permutably. Despite anon pages being idle, they never get aged. PID controller didn't kickin until there are some minor access pattern changes. And file pages are not promoted or reused. Even though the memory can't cover the whole workingset, the refault-distance based re-activation can help hold part of the workingset in-memory to help reduce the IO workload significantly. So apply it for MGLRU as well. The updated refault-distance model fits well for MGLRU in most cases, if we just consider the last two generation as the inactive LRU and the first two generations as active LRU. Some adjustment is done to fit the logic better, also make the refault-distance contributed to page tiering and PID refault detection of MGLRU: - If a tier-0 page have a qualified refault-distance, just promote it to higher tier, send it to second oldest gen. - If a tier >= 1 page have a qualified refault-distance, mark it as active and send it to youngest gen. - Increase the reference of every page that have a qualified refault-distance and increase the PID countroled refault rate of the updated tier, in hope similar paged will be protected next time upon eviction. NOTE: This also changed the meaning of workingset_* fields in /proc/vmstat, workingset_activate_* now stands for the pages reactivated or promoted by refault distance checking, workingset_restore_* now stands for all pages promoted by any reason. Following benchmark showed 5x improvement. To simulate the optimized workflow, I setup a 3-replicated mongodb cluster, each in a different cgroup, using 5 gb of wiretiger cache and 10g of oplog, on a 32G VM with no limit set. The benchmark is done using https://github.com/apavlo/py-tpcc.git, modified to run STOCK_LEVEL query only, for simulating slow query and get a stable result. Test is done on an EPYC 7K62 with 32G RAM with SATA SSD: - Before (with ZRAM enabled, the result won't change whether any kind of swap is on or not): $ tpcc.py --config=mongodb.config mongodb --duration=900 --warehouses=500 --clients=30 ================================================================== Execution Results after 919 seconds ------------------------------------------------------------------ Executed Time (µs) Rate STOCK_LEVEL 577 27584645283.7 0.02 txn/s ------------------------------------------------------------------ TOTAL 577 27584645283.7 0.02 txn/s $ cat /proc/vmstat | grep workingset workingset_nodes 47860 workingset_refault_anon 0 workingset_refault_file 23498953 workingset_activate_anon 0 workingset_activate_file 23487840 workingset_restore_anon 0 workingset_restore_file 18553646 workingset_nodereclaim 768 $ free -m total used free shared buff/cache available Mem: 31849 6829 790 23 24229 24542 Swap: 31848 0 31848 - Patched: (with ZRAM enabled): $ tpcc.py --config=mongodb.config mongodb --duration=900 --warehouses=500 --clients=30 ================================================================== Execution Results after 905 seconds ------------------------------------------------------------------ Executed Time (µs) Rate STOCK_LEVEL 2542 27121571486.2 0.09 txn/s ------------------------------------------------------------------ TOTAL 2542 27121571486.2 0.09 txn/s $ cat /proc/vmstat | grep working workingset_nodes 70358 workingset_refault_anon 16853 workingset_refault_file 22693601 workingset_activate_anon 10099 workingset_activate_file 8565519 workingset_restore_anon 10127 workingset_restore_file 8566053 workingset_nodereclaim 9801 $ free -m total used free shared buff/cache available Mem: 31849 7093 283 4 24472 24289 Swap: 31848 1652 30196 The performance is 5x times better than before, and the idle anon pages now can get swapped out as expected. The result is also better with lower test stress, testing with lower stress also shows a improvement. There is no regression on other tests so far, and a performance gain is observed on file page heavy tasks. Signed-off-by: Kairui Song <kasong@tencent.com>
2024-04-01 23:43:38 +08:00
token >>= LRU_GEN_EVICTION_BITS;
mm: multi-gen LRU: minimal implementation To avoid confusion, the terms "promotion" and "demotion" will be applied to the multi-gen LRU, as a new convention; the terms "activation" and "deactivation" will be applied to the active/inactive LRU, as usual. The aging produces young generations. Given an lruvec, it increments max_seq when max_seq-min_seq+1 approaches MIN_NR_GENS. The aging promotes hot pages to the youngest generation when it finds them accessed through page tables; the demotion of cold pages happens consequently when it increments max_seq. Promotion in the aging path does not involve any LRU list operations, only the updates of the gen counter and lrugen->nr_pages[]; demotion, unless as the result of the increment of max_seq, requires LRU list operations, e.g., lru_deactivate_fn(). The aging has the complexity O(nr_hot_pages), since it is only interested in hot pages. The eviction consumes old generations. Given an lruvec, it increments min_seq when lrugen->lists[] indexed by min_seq%MAX_NR_GENS becomes empty. A feedback loop modeled after the PID controller monitors refaults over anon and file types and decides which type to evict when both types are available from the same generation. The protection of pages accessed multiple times through file descriptors takes place in the eviction path. Each generation is divided into multiple tiers. A page accessed N times through file descriptors is in tier order_base_2(N). Tiers do not have dedicated lrugen->lists[], only bits in folio->flags. The aforementioned feedback loop also monitors refaults over all tiers and decides when to protect pages in which tiers (N>1), using the first tier (N=0,1) as a baseline. The first tier contains single-use unmapped clean pages, which are most likely the best choices. In contrast to promotion in the aging path, the protection of a page in the eviction path is achieved by moving this page to the next generation, i.e., min_seq+1, if the feedback loop decides so. This approach has the following advantages: 1. It removes the cost of activation in the buffered access path by inferring whether pages accessed multiple times through file descriptors are statistically hot and thus worth protecting in the eviction path. 2. It takes pages accessed through page tables into account and avoids overprotecting pages accessed multiple times through file descriptors. (Pages accessed through page tables are in the first tier, since N=0.) 3. More tiers provide better protection for pages accessed more than twice through file descriptors, when under heavy buffered I/O workloads. Server benchmark results: Single workload: fio (buffered I/O): +[30, 32]% IOPS BW 5.19-rc1: 2673k 10.2GiB/s patch1-6: 3491k 13.3GiB/s Single workload: memcached (anon): -[4, 6]% Ops/sec KB/sec 5.19-rc1: 1161501.04 45177.25 patch1-6: 1106168.46 43025.04 Configurations: CPU: two Xeon 6154 Mem: total 256G Node 1 was only used as a ram disk to reduce the variance in the results. patch drivers/block/brd.c <<EOF 99,100c99,100 < gfp_flags = GFP_NOIO | __GFP_ZERO | __GFP_HIGHMEM; < page = alloc_page(gfp_flags); --- > gfp_flags = GFP_NOIO | __GFP_ZERO | __GFP_HIGHMEM | __GFP_THISNODE; > page = alloc_pages_node(1, gfp_flags, 0); EOF cat >>/etc/systemd/system.conf <<EOF CPUAffinity=numa NUMAPolicy=bind NUMAMask=0 EOF cat >>/etc/memcached.conf <<EOF -m 184320 -s /var/run/memcached/memcached.sock -a 0766 -t 36 -B binary EOF cat fio.sh modprobe brd rd_nr=1 rd_size=113246208 swapoff -a mkfs.ext4 /dev/ram0 mount -t ext4 /dev/ram0 /mnt mkdir /sys/fs/cgroup/user.slice/test echo 38654705664 >/sys/fs/cgroup/user.slice/test/memory.max echo $$ >/sys/fs/cgroup/user.slice/test/cgroup.procs fio -name=mglru --numjobs=72 --directory=/mnt --size=1408m \ --buffered=1 --ioengine=io_uring --iodepth=128 \ --iodepth_batch_submit=32 --iodepth_batch_complete=32 \ --rw=randread --random_distribution=random --norandommap \ --time_based --ramp_time=10m --runtime=5m --group_reporting cat memcached.sh modprobe brd rd_nr=1 rd_size=113246208 swapoff -a mkswap /dev/ram0 swapon /dev/ram0 memtier_benchmark -S /var/run/memcached/memcached.sock \ -P memcache_binary -n allkeys --key-minimum=1 \ --key-maximum=65000000 --key-pattern=P:P -c 1 -t 36 \ --ratio 1:0 --pipeline 8 -d 2000 memtier_benchmark -S /var/run/memcached/memcached.sock \ -P memcache_binary -n allkeys --key-minimum=1 \ --key-maximum=65000000 --key-pattern=R:R -c 1 -t 36 \ --ratio 0:1 --pipeline 8 --randomize --distinct-client-seed Client benchmark results: kswapd profiles: 5.19-rc1 40.33% page_vma_mapped_walk (overhead) 21.80% lzo1x_1_do_compress (real work) 7.53% do_raw_spin_lock 3.95% _raw_spin_unlock_irq 2.52% vma_interval_tree_iter_next 2.37% folio_referenced_one 2.28% vma_interval_tree_subtree_search 1.97% anon_vma_interval_tree_iter_first 1.60% ptep_clear_flush 1.06% __zram_bvec_write patch1-6 39.03% lzo1x_1_do_compress (real work) 18.47% page_vma_mapped_walk (overhead) 6.74% _raw_spin_unlock_irq 3.97% do_raw_spin_lock 2.49% ptep_clear_flush 2.48% anon_vma_interval_tree_iter_first 1.92% folio_referenced_one 1.88% __zram_bvec_write 1.48% memmove 1.31% vma_interval_tree_iter_next Configurations: CPU: single Snapdragon 7c Mem: total 4G ChromeOS MemoryPressure [1] [1] https://chromium.googlesource.com/chromiumos/platform/tast-tests/ Link: https://lkml.kernel.org/r/20220918080010.2920238-7-yuzhao@google.com Signed-off-by: Yu Zhao <yuzhao@google.com> Acked-by: Brian Geffon <bgeffon@google.com> Acked-by: Jan Alexander Steffens (heftig) <heftig@archlinux.org> Acked-by: Oleksandr Natalenko <oleksandr@natalenko.name> Acked-by: Steven Barrett <steven@liquorix.net> Acked-by: Suleiman Souhlal <suleiman@google.com> Tested-by: Daniel Byrne <djbyrne@mtu.edu> Tested-by: Donald Carr <d@chaos-reins.com> Tested-by: Holger Hoffstätte <holger@applied-asynchrony.com> Tested-by: Konstantin Kharlamov <Hi-Angel@yandex.ru> Tested-by: Shuang Zhai <szhai2@cs.rochester.edu> Tested-by: Sofia Trinh <sofia.trinh@edi.works> Tested-by: Vaibhav Jain <vaibhav@linux.ibm.com> Cc: Andi Kleen <ak@linux.intel.com> Cc: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com> Cc: Barry Song <baohua@kernel.org> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: Hillf Danton <hdanton@sina.com> Cc: Jens Axboe <axboe@kernel.dk> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Matthew Wilcox <willy@infradead.org> Cc: Mel Gorman <mgorman@suse.de> Cc: Miaohe Lin <linmiaohe@huawei.com> Cc: Michael Larabel <Michael@MichaelLarabel.com> Cc: Michal Hocko <mhocko@kernel.org> Cc: Mike Rapoport <rppt@kernel.org> Cc: Mike Rapoport <rppt@linux.ibm.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Qi Zheng <zhengqi.arch@bytedance.com> Cc: Tejun Heo <tj@kernel.org> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Will Deacon <will@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-09-18 16:00:03 +08:00
refs = (token & (BIT(LRU_REFS_WIDTH) - 1)) + workingset;
tier = lru_tier_from_refs(refs);
/*
* Count the following two cases as stalls:
* 1. For pages accessed through page tables, hotter pages pushed out
* hot pages which refaulted immediately.
* 2. For pages accessed multiple times through file descriptors,
mm/mglru: fix underprotected page cache commit 081488051d28d32569ebb7c7a23572778b2e7d57 upstream. Unmapped folios accessed through file descriptors can be underprotected. Those folios are added to the oldest generation based on: 1. The fact that they are less costly to reclaim (no need to walk the rmap and flush the TLB) and have less impact on performance (don't cause major PFs and can be non-blocking if needed again). 2. The observation that they are likely to be single-use. E.g., for client use cases like Android, its apps parse configuration files and store the data in heap (anon); for server use cases like MySQL, it reads from InnoDB files and holds the cached data for tables in buffer pools (anon). However, the oldest generation can be very short lived, and if so, it doesn't provide the PID controller with enough time to respond to a surge of refaults. (Note that the PID controller uses weighted refaults and those from evicted generations only take a half of the whole weight.) In other words, for a short lived generation, the moving average smooths out the spike quickly. To fix the problem: 1. For folios that are already on LRU, if they can be beyond the tracking range of tiers, i.e., five accesses through file descriptors, move them to the second oldest generation to give them more time to age. (Note that tiers are used by the PID controller to statistically determine whether folios accessed multiple times through file descriptors are worth protecting.) 2. When adding unmapped folios to LRU, adjust the placement of them so that they are not too close to the tail. The effect of this is similar to the above. On Android, launching 55 apps sequentially: Before After Change workingset_refault_anon 25641024 25598972 0% workingset_refault_file 115016834 106178438 -8% Link: https://lkml.kernel.org/r/20231208061407.2125867-1-yuzhao@google.com Fixes: ac35a4902374 ("mm: multi-gen LRU: minimal implementation") Signed-off-by: Yu Zhao <yuzhao@google.com> Reported-by: Charan Teja Kalla <quic_charante@quicinc.com> Tested-by: Kalesh Singh <kaleshsingh@google.com> Cc: T.J. Mercier <tjmercier@google.com> Cc: Kairui Song <ryncsn@gmail.com> Cc: Hillf Danton <hdanton@sina.com> Cc: Jaroslav Pulchart <jaroslav.pulchart@gooddata.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2023-12-08 14:14:04 +08:00
* they would have been protected by sort_folio().
mm: multi-gen LRU: minimal implementation To avoid confusion, the terms "promotion" and "demotion" will be applied to the multi-gen LRU, as a new convention; the terms "activation" and "deactivation" will be applied to the active/inactive LRU, as usual. The aging produces young generations. Given an lruvec, it increments max_seq when max_seq-min_seq+1 approaches MIN_NR_GENS. The aging promotes hot pages to the youngest generation when it finds them accessed through page tables; the demotion of cold pages happens consequently when it increments max_seq. Promotion in the aging path does not involve any LRU list operations, only the updates of the gen counter and lrugen->nr_pages[]; demotion, unless as the result of the increment of max_seq, requires LRU list operations, e.g., lru_deactivate_fn(). The aging has the complexity O(nr_hot_pages), since it is only interested in hot pages. The eviction consumes old generations. Given an lruvec, it increments min_seq when lrugen->lists[] indexed by min_seq%MAX_NR_GENS becomes empty. A feedback loop modeled after the PID controller monitors refaults over anon and file types and decides which type to evict when both types are available from the same generation. The protection of pages accessed multiple times through file descriptors takes place in the eviction path. Each generation is divided into multiple tiers. A page accessed N times through file descriptors is in tier order_base_2(N). Tiers do not have dedicated lrugen->lists[], only bits in folio->flags. The aforementioned feedback loop also monitors refaults over all tiers and decides when to protect pages in which tiers (N>1), using the first tier (N=0,1) as a baseline. The first tier contains single-use unmapped clean pages, which are most likely the best choices. In contrast to promotion in the aging path, the protection of a page in the eviction path is achieved by moving this page to the next generation, i.e., min_seq+1, if the feedback loop decides so. This approach has the following advantages: 1. It removes the cost of activation in the buffered access path by inferring whether pages accessed multiple times through file descriptors are statistically hot and thus worth protecting in the eviction path. 2. It takes pages accessed through page tables into account and avoids overprotecting pages accessed multiple times through file descriptors. (Pages accessed through page tables are in the first tier, since N=0.) 3. More tiers provide better protection for pages accessed more than twice through file descriptors, when under heavy buffered I/O workloads. Server benchmark results: Single workload: fio (buffered I/O): +[30, 32]% IOPS BW 5.19-rc1: 2673k 10.2GiB/s patch1-6: 3491k 13.3GiB/s Single workload: memcached (anon): -[4, 6]% Ops/sec KB/sec 5.19-rc1: 1161501.04 45177.25 patch1-6: 1106168.46 43025.04 Configurations: CPU: two Xeon 6154 Mem: total 256G Node 1 was only used as a ram disk to reduce the variance in the results. patch drivers/block/brd.c <<EOF 99,100c99,100 < gfp_flags = GFP_NOIO | __GFP_ZERO | __GFP_HIGHMEM; < page = alloc_page(gfp_flags); --- > gfp_flags = GFP_NOIO | __GFP_ZERO | __GFP_HIGHMEM | __GFP_THISNODE; > page = alloc_pages_node(1, gfp_flags, 0); EOF cat >>/etc/systemd/system.conf <<EOF CPUAffinity=numa NUMAPolicy=bind NUMAMask=0 EOF cat >>/etc/memcached.conf <<EOF -m 184320 -s /var/run/memcached/memcached.sock -a 0766 -t 36 -B binary EOF cat fio.sh modprobe brd rd_nr=1 rd_size=113246208 swapoff -a mkfs.ext4 /dev/ram0 mount -t ext4 /dev/ram0 /mnt mkdir /sys/fs/cgroup/user.slice/test echo 38654705664 >/sys/fs/cgroup/user.slice/test/memory.max echo $$ >/sys/fs/cgroup/user.slice/test/cgroup.procs fio -name=mglru --numjobs=72 --directory=/mnt --size=1408m \ --buffered=1 --ioengine=io_uring --iodepth=128 \ --iodepth_batch_submit=32 --iodepth_batch_complete=32 \ --rw=randread --random_distribution=random --norandommap \ --time_based --ramp_time=10m --runtime=5m --group_reporting cat memcached.sh modprobe brd rd_nr=1 rd_size=113246208 swapoff -a mkswap /dev/ram0 swapon /dev/ram0 memtier_benchmark -S /var/run/memcached/memcached.sock \ -P memcache_binary -n allkeys --key-minimum=1 \ --key-maximum=65000000 --key-pattern=P:P -c 1 -t 36 \ --ratio 1:0 --pipeline 8 -d 2000 memtier_benchmark -S /var/run/memcached/memcached.sock \ -P memcache_binary -n allkeys --key-minimum=1 \ --key-maximum=65000000 --key-pattern=R:R -c 1 -t 36 \ --ratio 0:1 --pipeline 8 --randomize --distinct-client-seed Client benchmark results: kswapd profiles: 5.19-rc1 40.33% page_vma_mapped_walk (overhead) 21.80% lzo1x_1_do_compress (real work) 7.53% do_raw_spin_lock 3.95% _raw_spin_unlock_irq 2.52% vma_interval_tree_iter_next 2.37% folio_referenced_one 2.28% vma_interval_tree_subtree_search 1.97% anon_vma_interval_tree_iter_first 1.60% ptep_clear_flush 1.06% __zram_bvec_write patch1-6 39.03% lzo1x_1_do_compress (real work) 18.47% page_vma_mapped_walk (overhead) 6.74% _raw_spin_unlock_irq 3.97% do_raw_spin_lock 2.49% ptep_clear_flush 2.48% anon_vma_interval_tree_iter_first 1.92% folio_referenced_one 1.88% __zram_bvec_write 1.48% memmove 1.31% vma_interval_tree_iter_next Configurations: CPU: single Snapdragon 7c Mem: total 4G ChromeOS MemoryPressure [1] [1] https://chromium.googlesource.com/chromiumos/platform/tast-tests/ Link: https://lkml.kernel.org/r/20220918080010.2920238-7-yuzhao@google.com Signed-off-by: Yu Zhao <yuzhao@google.com> Acked-by: Brian Geffon <bgeffon@google.com> Acked-by: Jan Alexander Steffens (heftig) <heftig@archlinux.org> Acked-by: Oleksandr Natalenko <oleksandr@natalenko.name> Acked-by: Steven Barrett <steven@liquorix.net> Acked-by: Suleiman Souhlal <suleiman@google.com> Tested-by: Daniel Byrne <djbyrne@mtu.edu> Tested-by: Donald Carr <d@chaos-reins.com> Tested-by: Holger Hoffstätte <holger@applied-asynchrony.com> Tested-by: Konstantin Kharlamov <Hi-Angel@yandex.ru> Tested-by: Shuang Zhai <szhai2@cs.rochester.edu> Tested-by: Sofia Trinh <sofia.trinh@edi.works> Tested-by: Vaibhav Jain <vaibhav@linux.ibm.com> Cc: Andi Kleen <ak@linux.intel.com> Cc: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com> Cc: Barry Song <baohua@kernel.org> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: Hillf Danton <hdanton@sina.com> Cc: Jens Axboe <axboe@kernel.dk> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Matthew Wilcox <willy@infradead.org> Cc: Mel Gorman <mgorman@suse.de> Cc: Miaohe Lin <linmiaohe@huawei.com> Cc: Michael Larabel <Michael@MichaelLarabel.com> Cc: Michal Hocko <mhocko@kernel.org> Cc: Mike Rapoport <rppt@kernel.org> Cc: Mike Rapoport <rppt@linux.ibm.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Qi Zheng <zhengqi.arch@bytedance.com> Cc: Tejun Heo <tj@kernel.org> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Will Deacon <will@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-09-18 16:00:03 +08:00
*/
mm/mglru: fix underprotected page cache commit 081488051d28d32569ebb7c7a23572778b2e7d57 upstream. Unmapped folios accessed through file descriptors can be underprotected. Those folios are added to the oldest generation based on: 1. The fact that they are less costly to reclaim (no need to walk the rmap and flush the TLB) and have less impact on performance (don't cause major PFs and can be non-blocking if needed again). 2. The observation that they are likely to be single-use. E.g., for client use cases like Android, its apps parse configuration files and store the data in heap (anon); for server use cases like MySQL, it reads from InnoDB files and holds the cached data for tables in buffer pools (anon). However, the oldest generation can be very short lived, and if so, it doesn't provide the PID controller with enough time to respond to a surge of refaults. (Note that the PID controller uses weighted refaults and those from evicted generations only take a half of the whole weight.) In other words, for a short lived generation, the moving average smooths out the spike quickly. To fix the problem: 1. For folios that are already on LRU, if they can be beyond the tracking range of tiers, i.e., five accesses through file descriptors, move them to the second oldest generation to give them more time to age. (Note that tiers are used by the PID controller to statistically determine whether folios accessed multiple times through file descriptors are worth protecting.) 2. When adding unmapped folios to LRU, adjust the placement of them so that they are not too close to the tail. The effect of this is similar to the above. On Android, launching 55 apps sequentially: Before After Change workingset_refault_anon 25641024 25598972 0% workingset_refault_file 115016834 106178438 -8% Link: https://lkml.kernel.org/r/20231208061407.2125867-1-yuzhao@google.com Fixes: ac35a4902374 ("mm: multi-gen LRU: minimal implementation") Signed-off-by: Yu Zhao <yuzhao@google.com> Reported-by: Charan Teja Kalla <quic_charante@quicinc.com> Tested-by: Kalesh Singh <kaleshsingh@google.com> Cc: T.J. Mercier <tjmercier@google.com> Cc: Kairui Song <ryncsn@gmail.com> Cc: Hillf Danton <hdanton@sina.com> Cc: Jaroslav Pulchart <jaroslav.pulchart@gooddata.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2023-12-08 14:14:04 +08:00
if (lru_gen_in_fault() || refs >= BIT(LRU_REFS_WIDTH) - 1) {
workingset, lru_gen: apply refault-distance based protection Upstream: pending I noticed MGLRU not working very well on certain workflows, which is observed on some heavily stressed databases. That is when the file page workingset size exceeds total memory, and the access distance (the left-shift time of a page before it gets activated, considering LRU starts from right) of file pages also larger than total memory. All file pages are stuck on the oldest generation and getting read-in then evicted permutably. Despite anon pages being idle, they never get aged. PID controller didn't kickin until there are some minor access pattern changes. And file pages are not promoted or reused. Even though the memory can't cover the whole workingset, the refault-distance based re-activation can help hold part of the workingset in-memory to help reduce the IO workload significantly. So apply it for MGLRU as well. The updated refault-distance model fits well for MGLRU in most cases, if we just consider the last two generation as the inactive LRU and the first two generations as active LRU. Some adjustment is done to fit the logic better, also make the refault-distance contributed to page tiering and PID refault detection of MGLRU: - If a tier-0 page have a qualified refault-distance, just promote it to higher tier, send it to second oldest gen. - If a tier >= 1 page have a qualified refault-distance, mark it as active and send it to youngest gen. - Increase the reference of every page that have a qualified refault-distance and increase the PID countroled refault rate of the updated tier, in hope similar paged will be protected next time upon eviction. NOTE: This also changed the meaning of workingset_* fields in /proc/vmstat, workingset_activate_* now stands for the pages reactivated or promoted by refault distance checking, workingset_restore_* now stands for all pages promoted by any reason. Following benchmark showed 5x improvement. To simulate the optimized workflow, I setup a 3-replicated mongodb cluster, each in a different cgroup, using 5 gb of wiretiger cache and 10g of oplog, on a 32G VM with no limit set. The benchmark is done using https://github.com/apavlo/py-tpcc.git, modified to run STOCK_LEVEL query only, for simulating slow query and get a stable result. Test is done on an EPYC 7K62 with 32G RAM with SATA SSD: - Before (with ZRAM enabled, the result won't change whether any kind of swap is on or not): $ tpcc.py --config=mongodb.config mongodb --duration=900 --warehouses=500 --clients=30 ================================================================== Execution Results after 919 seconds ------------------------------------------------------------------ Executed Time (µs) Rate STOCK_LEVEL 577 27584645283.7 0.02 txn/s ------------------------------------------------------------------ TOTAL 577 27584645283.7 0.02 txn/s $ cat /proc/vmstat | grep workingset workingset_nodes 47860 workingset_refault_anon 0 workingset_refault_file 23498953 workingset_activate_anon 0 workingset_activate_file 23487840 workingset_restore_anon 0 workingset_restore_file 18553646 workingset_nodereclaim 768 $ free -m total used free shared buff/cache available Mem: 31849 6829 790 23 24229 24542 Swap: 31848 0 31848 - Patched: (with ZRAM enabled): $ tpcc.py --config=mongodb.config mongodb --duration=900 --warehouses=500 --clients=30 ================================================================== Execution Results after 905 seconds ------------------------------------------------------------------ Executed Time (µs) Rate STOCK_LEVEL 2542 27121571486.2 0.09 txn/s ------------------------------------------------------------------ TOTAL 2542 27121571486.2 0.09 txn/s $ cat /proc/vmstat | grep working workingset_nodes 70358 workingset_refault_anon 16853 workingset_refault_file 22693601 workingset_activate_anon 10099 workingset_activate_file 8565519 workingset_restore_anon 10127 workingset_restore_file 8566053 workingset_nodereclaim 9801 $ free -m total used free shared buff/cache available Mem: 31849 7093 283 4 24472 24289 Swap: 31848 1652 30196 The performance is 5x times better than before, and the idle anon pages now can get swapped out as expected. The result is also better with lower test stress, testing with lower stress also shows a improvement. There is no regression on other tests so far, and a performance gain is observed on file page heavy tasks. Signed-off-by: Kairui Song <kasong@tencent.com>
2024-04-01 23:43:38 +08:00
if (distance <= DISTANCE_SHORT) {
/* Set ref bits and workingset (increase refs by one) */
if (!lru_gen_in_fault())
folio_set_active(folio);
else
set_mask_bits(&folio->flags, 0,
min_t(unsigned long, refs, BIT(LRU_REFS_WIDTH) - 1)
<< LRU_REFS_PGOFF);
folio_set_workingset(folio);
} else if (recent || distance <= DISTANCE_MID) {
/*
* Beyound PID protection range, no point increasing refs
* for highest tier, but we can activate file page.
*/
set_mask_bits(&folio->flags, 0, (unsigned long)(refs - workingset) << LRU_REFS_PGOFF);
workingset, lru_gen: apply refault-distance based protection Upstream: pending I noticed MGLRU not working very well on certain workflows, which is observed on some heavily stressed databases. That is when the file page workingset size exceeds total memory, and the access distance (the left-shift time of a page before it gets activated, considering LRU starts from right) of file pages also larger than total memory. All file pages are stuck on the oldest generation and getting read-in then evicted permutably. Despite anon pages being idle, they never get aged. PID controller didn't kickin until there are some minor access pattern changes. And file pages are not promoted or reused. Even though the memory can't cover the whole workingset, the refault-distance based re-activation can help hold part of the workingset in-memory to help reduce the IO workload significantly. So apply it for MGLRU as well. The updated refault-distance model fits well for MGLRU in most cases, if we just consider the last two generation as the inactive LRU and the first two generations as active LRU. Some adjustment is done to fit the logic better, also make the refault-distance contributed to page tiering and PID refault detection of MGLRU: - If a tier-0 page have a qualified refault-distance, just promote it to higher tier, send it to second oldest gen. - If a tier >= 1 page have a qualified refault-distance, mark it as active and send it to youngest gen. - Increase the reference of every page that have a qualified refault-distance and increase the PID countroled refault rate of the updated tier, in hope similar paged will be protected next time upon eviction. NOTE: This also changed the meaning of workingset_* fields in /proc/vmstat, workingset_activate_* now stands for the pages reactivated or promoted by refault distance checking, workingset_restore_* now stands for all pages promoted by any reason. Following benchmark showed 5x improvement. To simulate the optimized workflow, I setup a 3-replicated mongodb cluster, each in a different cgroup, using 5 gb of wiretiger cache and 10g of oplog, on a 32G VM with no limit set. The benchmark is done using https://github.com/apavlo/py-tpcc.git, modified to run STOCK_LEVEL query only, for simulating slow query and get a stable result. Test is done on an EPYC 7K62 with 32G RAM with SATA SSD: - Before (with ZRAM enabled, the result won't change whether any kind of swap is on or not): $ tpcc.py --config=mongodb.config mongodb --duration=900 --warehouses=500 --clients=30 ================================================================== Execution Results after 919 seconds ------------------------------------------------------------------ Executed Time (µs) Rate STOCK_LEVEL 577 27584645283.7 0.02 txn/s ------------------------------------------------------------------ TOTAL 577 27584645283.7 0.02 txn/s $ cat /proc/vmstat | grep workingset workingset_nodes 47860 workingset_refault_anon 0 workingset_refault_file 23498953 workingset_activate_anon 0 workingset_activate_file 23487840 workingset_restore_anon 0 workingset_restore_file 18553646 workingset_nodereclaim 768 $ free -m total used free shared buff/cache available Mem: 31849 6829 790 23 24229 24542 Swap: 31848 0 31848 - Patched: (with ZRAM enabled): $ tpcc.py --config=mongodb.config mongodb --duration=900 --warehouses=500 --clients=30 ================================================================== Execution Results after 905 seconds ------------------------------------------------------------------ Executed Time (µs) Rate STOCK_LEVEL 2542 27121571486.2 0.09 txn/s ------------------------------------------------------------------ TOTAL 2542 27121571486.2 0.09 txn/s $ cat /proc/vmstat | grep working workingset_nodes 70358 workingset_refault_anon 16853 workingset_refault_file 22693601 workingset_activate_anon 10099 workingset_activate_file 8565519 workingset_restore_anon 10127 workingset_restore_file 8566053 workingset_nodereclaim 9801 $ free -m total used free shared buff/cache available Mem: 31849 7093 283 4 24472 24289 Swap: 31848 1652 30196 The performance is 5x times better than before, and the idle anon pages now can get swapped out as expected. The result is also better with lower test stress, testing with lower stress also shows a improvement. There is no regression on other tests so far, and a performance gain is observed on file page heavy tasks. Signed-off-by: Kairui Song <kasong@tencent.com>
2024-04-01 23:43:38 +08:00
folio_set_workingset(folio);
} else {
set_mask_bits(&folio->flags, 0, 1UL << LRU_REFS_PGOFF);
workingset, lru_gen: apply refault-distance based protection Upstream: pending I noticed MGLRU not working very well on certain workflows, which is observed on some heavily stressed databases. That is when the file page workingset size exceeds total memory, and the access distance (the left-shift time of a page before it gets activated, considering LRU starts from right) of file pages also larger than total memory. All file pages are stuck on the oldest generation and getting read-in then evicted permutably. Despite anon pages being idle, they never get aged. PID controller didn't kickin until there are some minor access pattern changes. And file pages are not promoted or reused. Even though the memory can't cover the whole workingset, the refault-distance based re-activation can help hold part of the workingset in-memory to help reduce the IO workload significantly. So apply it for MGLRU as well. The updated refault-distance model fits well for MGLRU in most cases, if we just consider the last two generation as the inactive LRU and the first two generations as active LRU. Some adjustment is done to fit the logic better, also make the refault-distance contributed to page tiering and PID refault detection of MGLRU: - If a tier-0 page have a qualified refault-distance, just promote it to higher tier, send it to second oldest gen. - If a tier >= 1 page have a qualified refault-distance, mark it as active and send it to youngest gen. - Increase the reference of every page that have a qualified refault-distance and increase the PID countroled refault rate of the updated tier, in hope similar paged will be protected next time upon eviction. NOTE: This also changed the meaning of workingset_* fields in /proc/vmstat, workingset_activate_* now stands for the pages reactivated or promoted by refault distance checking, workingset_restore_* now stands for all pages promoted by any reason. Following benchmark showed 5x improvement. To simulate the optimized workflow, I setup a 3-replicated mongodb cluster, each in a different cgroup, using 5 gb of wiretiger cache and 10g of oplog, on a 32G VM with no limit set. The benchmark is done using https://github.com/apavlo/py-tpcc.git, modified to run STOCK_LEVEL query only, for simulating slow query and get a stable result. Test is done on an EPYC 7K62 with 32G RAM with SATA SSD: - Before (with ZRAM enabled, the result won't change whether any kind of swap is on or not): $ tpcc.py --config=mongodb.config mongodb --duration=900 --warehouses=500 --clients=30 ================================================================== Execution Results after 919 seconds ------------------------------------------------------------------ Executed Time (µs) Rate STOCK_LEVEL 577 27584645283.7 0.02 txn/s ------------------------------------------------------------------ TOTAL 577 27584645283.7 0.02 txn/s $ cat /proc/vmstat | grep workingset workingset_nodes 47860 workingset_refault_anon 0 workingset_refault_file 23498953 workingset_activate_anon 0 workingset_activate_file 23487840 workingset_restore_anon 0 workingset_restore_file 18553646 workingset_nodereclaim 768 $ free -m total used free shared buff/cache available Mem: 31849 6829 790 23 24229 24542 Swap: 31848 0 31848 - Patched: (with ZRAM enabled): $ tpcc.py --config=mongodb.config mongodb --duration=900 --warehouses=500 --clients=30 ================================================================== Execution Results after 905 seconds ------------------------------------------------------------------ Executed Time (µs) Rate STOCK_LEVEL 2542 27121571486.2 0.09 txn/s ------------------------------------------------------------------ TOTAL 2542 27121571486.2 0.09 txn/s $ cat /proc/vmstat | grep working workingset_nodes 70358 workingset_refault_anon 16853 workingset_refault_file 22693601 workingset_activate_anon 10099 workingset_activate_file 8565519 workingset_restore_anon 10127 workingset_restore_file 8566053 workingset_nodereclaim 9801 $ free -m total used free shared buff/cache available Mem: 31849 7093 283 4 24472 24289 Swap: 31848 1652 30196 The performance is 5x times better than before, and the idle anon pages now can get swapped out as expected. The result is also better with lower test stress, testing with lower stress also shows a improvement. There is no regression on other tests so far, and a performance gain is observed on file page heavy tasks. Signed-off-by: Kairui Song <kasong@tencent.com>
2024-04-01 23:43:38 +08:00
}
mm: multi-gen LRU: minimal implementation To avoid confusion, the terms "promotion" and "demotion" will be applied to the multi-gen LRU, as a new convention; the terms "activation" and "deactivation" will be applied to the active/inactive LRU, as usual. The aging produces young generations. Given an lruvec, it increments max_seq when max_seq-min_seq+1 approaches MIN_NR_GENS. The aging promotes hot pages to the youngest generation when it finds them accessed through page tables; the demotion of cold pages happens consequently when it increments max_seq. Promotion in the aging path does not involve any LRU list operations, only the updates of the gen counter and lrugen->nr_pages[]; demotion, unless as the result of the increment of max_seq, requires LRU list operations, e.g., lru_deactivate_fn(). The aging has the complexity O(nr_hot_pages), since it is only interested in hot pages. The eviction consumes old generations. Given an lruvec, it increments min_seq when lrugen->lists[] indexed by min_seq%MAX_NR_GENS becomes empty. A feedback loop modeled after the PID controller monitors refaults over anon and file types and decides which type to evict when both types are available from the same generation. The protection of pages accessed multiple times through file descriptors takes place in the eviction path. Each generation is divided into multiple tiers. A page accessed N times through file descriptors is in tier order_base_2(N). Tiers do not have dedicated lrugen->lists[], only bits in folio->flags. The aforementioned feedback loop also monitors refaults over all tiers and decides when to protect pages in which tiers (N>1), using the first tier (N=0,1) as a baseline. The first tier contains single-use unmapped clean pages, which are most likely the best choices. In contrast to promotion in the aging path, the protection of a page in the eviction path is achieved by moving this page to the next generation, i.e., min_seq+1, if the feedback loop decides so. This approach has the following advantages: 1. It removes the cost of activation in the buffered access path by inferring whether pages accessed multiple times through file descriptors are statistically hot and thus worth protecting in the eviction path. 2. It takes pages accessed through page tables into account and avoids overprotecting pages accessed multiple times through file descriptors. (Pages accessed through page tables are in the first tier, since N=0.) 3. More tiers provide better protection for pages accessed more than twice through file descriptors, when under heavy buffered I/O workloads. Server benchmark results: Single workload: fio (buffered I/O): +[30, 32]% IOPS BW 5.19-rc1: 2673k 10.2GiB/s patch1-6: 3491k 13.3GiB/s Single workload: memcached (anon): -[4, 6]% Ops/sec KB/sec 5.19-rc1: 1161501.04 45177.25 patch1-6: 1106168.46 43025.04 Configurations: CPU: two Xeon 6154 Mem: total 256G Node 1 was only used as a ram disk to reduce the variance in the results. patch drivers/block/brd.c <<EOF 99,100c99,100 < gfp_flags = GFP_NOIO | __GFP_ZERO | __GFP_HIGHMEM; < page = alloc_page(gfp_flags); --- > gfp_flags = GFP_NOIO | __GFP_ZERO | __GFP_HIGHMEM | __GFP_THISNODE; > page = alloc_pages_node(1, gfp_flags, 0); EOF cat >>/etc/systemd/system.conf <<EOF CPUAffinity=numa NUMAPolicy=bind NUMAMask=0 EOF cat >>/etc/memcached.conf <<EOF -m 184320 -s /var/run/memcached/memcached.sock -a 0766 -t 36 -B binary EOF cat fio.sh modprobe brd rd_nr=1 rd_size=113246208 swapoff -a mkfs.ext4 /dev/ram0 mount -t ext4 /dev/ram0 /mnt mkdir /sys/fs/cgroup/user.slice/test echo 38654705664 >/sys/fs/cgroup/user.slice/test/memory.max echo $$ >/sys/fs/cgroup/user.slice/test/cgroup.procs fio -name=mglru --numjobs=72 --directory=/mnt --size=1408m \ --buffered=1 --ioengine=io_uring --iodepth=128 \ --iodepth_batch_submit=32 --iodepth_batch_complete=32 \ --rw=randread --random_distribution=random --norandommap \ --time_based --ramp_time=10m --runtime=5m --group_reporting cat memcached.sh modprobe brd rd_nr=1 rd_size=113246208 swapoff -a mkswap /dev/ram0 swapon /dev/ram0 memtier_benchmark -S /var/run/memcached/memcached.sock \ -P memcache_binary -n allkeys --key-minimum=1 \ --key-maximum=65000000 --key-pattern=P:P -c 1 -t 36 \ --ratio 1:0 --pipeline 8 -d 2000 memtier_benchmark -S /var/run/memcached/memcached.sock \ -P memcache_binary -n allkeys --key-minimum=1 \ --key-maximum=65000000 --key-pattern=R:R -c 1 -t 36 \ --ratio 0:1 --pipeline 8 --randomize --distinct-client-seed Client benchmark results: kswapd profiles: 5.19-rc1 40.33% page_vma_mapped_walk (overhead) 21.80% lzo1x_1_do_compress (real work) 7.53% do_raw_spin_lock 3.95% _raw_spin_unlock_irq 2.52% vma_interval_tree_iter_next 2.37% folio_referenced_one 2.28% vma_interval_tree_subtree_search 1.97% anon_vma_interval_tree_iter_first 1.60% ptep_clear_flush 1.06% __zram_bvec_write patch1-6 39.03% lzo1x_1_do_compress (real work) 18.47% page_vma_mapped_walk (overhead) 6.74% _raw_spin_unlock_irq 3.97% do_raw_spin_lock 2.49% ptep_clear_flush 2.48% anon_vma_interval_tree_iter_first 1.92% folio_referenced_one 1.88% __zram_bvec_write 1.48% memmove 1.31% vma_interval_tree_iter_next Configurations: CPU: single Snapdragon 7c Mem: total 4G ChromeOS MemoryPressure [1] [1] https://chromium.googlesource.com/chromiumos/platform/tast-tests/ Link: https://lkml.kernel.org/r/20220918080010.2920238-7-yuzhao@google.com Signed-off-by: Yu Zhao <yuzhao@google.com> Acked-by: Brian Geffon <bgeffon@google.com> Acked-by: Jan Alexander Steffens (heftig) <heftig@archlinux.org> Acked-by: Oleksandr Natalenko <oleksandr@natalenko.name> Acked-by: Steven Barrett <steven@liquorix.net> Acked-by: Suleiman Souhlal <suleiman@google.com> Tested-by: Daniel Byrne <djbyrne@mtu.edu> Tested-by: Donald Carr <d@chaos-reins.com> Tested-by: Holger Hoffstätte <holger@applied-asynchrony.com> Tested-by: Konstantin Kharlamov <Hi-Angel@yandex.ru> Tested-by: Shuang Zhai <szhai2@cs.rochester.edu> Tested-by: Sofia Trinh <sofia.trinh@edi.works> Tested-by: Vaibhav Jain <vaibhav@linux.ibm.com> Cc: Andi Kleen <ak@linux.intel.com> Cc: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com> Cc: Barry Song <baohua@kernel.org> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: Hillf Danton <hdanton@sina.com> Cc: Jens Axboe <axboe@kernel.dk> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Matthew Wilcox <willy@infradead.org> Cc: Mel Gorman <mgorman@suse.de> Cc: Miaohe Lin <linmiaohe@huawei.com> Cc: Michael Larabel <Michael@MichaelLarabel.com> Cc: Michal Hocko <mhocko@kernel.org> Cc: Mike Rapoport <rppt@kernel.org> Cc: Mike Rapoport <rppt@linux.ibm.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Qi Zheng <zhengqi.arch@bytedance.com> Cc: Tejun Heo <tj@kernel.org> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Will Deacon <will@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-09-18 16:00:03 +08:00
mod_lruvec_state(lruvec, WORKINGSET_RESTORE_BASE + type, delta);
}
workingset, lru_gen: apply refault-distance based protection Upstream: pending I noticed MGLRU not working very well on certain workflows, which is observed on some heavily stressed databases. That is when the file page workingset size exceeds total memory, and the access distance (the left-shift time of a page before it gets activated, considering LRU starts from right) of file pages also larger than total memory. All file pages are stuck on the oldest generation and getting read-in then evicted permutably. Despite anon pages being idle, they never get aged. PID controller didn't kickin until there are some minor access pattern changes. And file pages are not promoted or reused. Even though the memory can't cover the whole workingset, the refault-distance based re-activation can help hold part of the workingset in-memory to help reduce the IO workload significantly. So apply it for MGLRU as well. The updated refault-distance model fits well for MGLRU in most cases, if we just consider the last two generation as the inactive LRU and the first two generations as active LRU. Some adjustment is done to fit the logic better, also make the refault-distance contributed to page tiering and PID refault detection of MGLRU: - If a tier-0 page have a qualified refault-distance, just promote it to higher tier, send it to second oldest gen. - If a tier >= 1 page have a qualified refault-distance, mark it as active and send it to youngest gen. - Increase the reference of every page that have a qualified refault-distance and increase the PID countroled refault rate of the updated tier, in hope similar paged will be protected next time upon eviction. NOTE: This also changed the meaning of workingset_* fields in /proc/vmstat, workingset_activate_* now stands for the pages reactivated or promoted by refault distance checking, workingset_restore_* now stands for all pages promoted by any reason. Following benchmark showed 5x improvement. To simulate the optimized workflow, I setup a 3-replicated mongodb cluster, each in a different cgroup, using 5 gb of wiretiger cache and 10g of oplog, on a 32G VM with no limit set. The benchmark is done using https://github.com/apavlo/py-tpcc.git, modified to run STOCK_LEVEL query only, for simulating slow query and get a stable result. Test is done on an EPYC 7K62 with 32G RAM with SATA SSD: - Before (with ZRAM enabled, the result won't change whether any kind of swap is on or not): $ tpcc.py --config=mongodb.config mongodb --duration=900 --warehouses=500 --clients=30 ================================================================== Execution Results after 919 seconds ------------------------------------------------------------------ Executed Time (µs) Rate STOCK_LEVEL 577 27584645283.7 0.02 txn/s ------------------------------------------------------------------ TOTAL 577 27584645283.7 0.02 txn/s $ cat /proc/vmstat | grep workingset workingset_nodes 47860 workingset_refault_anon 0 workingset_refault_file 23498953 workingset_activate_anon 0 workingset_activate_file 23487840 workingset_restore_anon 0 workingset_restore_file 18553646 workingset_nodereclaim 768 $ free -m total used free shared buff/cache available Mem: 31849 6829 790 23 24229 24542 Swap: 31848 0 31848 - Patched: (with ZRAM enabled): $ tpcc.py --config=mongodb.config mongodb --duration=900 --warehouses=500 --clients=30 ================================================================== Execution Results after 905 seconds ------------------------------------------------------------------ Executed Time (µs) Rate STOCK_LEVEL 2542 27121571486.2 0.09 txn/s ------------------------------------------------------------------ TOTAL 2542 27121571486.2 0.09 txn/s $ cat /proc/vmstat | grep working workingset_nodes 70358 workingset_refault_anon 16853 workingset_refault_file 22693601 workingset_activate_anon 10099 workingset_activate_file 8565519 workingset_restore_anon 10127 workingset_restore_file 8566053 workingset_nodereclaim 9801 $ free -m total used free shared buff/cache available Mem: 31849 7093 283 4 24472 24289 Swap: 31848 1652 30196 The performance is 5x times better than before, and the idle anon pages now can get swapped out as expected. The result is also better with lower test stress, testing with lower stress also shows a improvement. There is no regression on other tests so far, and a performance gain is observed on file page heavy tasks. Signed-off-by: Kairui Song <kasong@tencent.com>
2024-04-01 23:43:38 +08:00
lrugen = &lruvec->lrugen;
hist = lru_hist_of_min_seq(lruvec, type);
protect_tier = tier;
/*
* Don't over-protect clean cache page (!tier page), if the page wasn't access
* for a while (refault distance > LRU / MAX_NR_GENS), there is no help keeping
* it in memory, bias higher tier instead.
*/
if (distance <= DISTANCE_SHORT && !tier) {
/* The folio is referenced one more time in the shadow gen */
folio_set_workingset(folio);
protect_tier = lru_tier_from_refs(1);
mod_lruvec_state(lruvec, WORKINGSET_ACTIVATE_BASE + type, delta);
}
if (protect_tier == tier && recent) {
atomic_long_add(delta, &lrugen->refaulted[hist][type][tier]);
} else {
atomic_long_add(delta, &lrugen->avg_total[type][protect_tier]);
atomic_long_add(delta, &lrugen->avg_refaulted[type][protect_tier]);
}
mm: multi-gen LRU: minimal implementation To avoid confusion, the terms "promotion" and "demotion" will be applied to the multi-gen LRU, as a new convention; the terms "activation" and "deactivation" will be applied to the active/inactive LRU, as usual. The aging produces young generations. Given an lruvec, it increments max_seq when max_seq-min_seq+1 approaches MIN_NR_GENS. The aging promotes hot pages to the youngest generation when it finds them accessed through page tables; the demotion of cold pages happens consequently when it increments max_seq. Promotion in the aging path does not involve any LRU list operations, only the updates of the gen counter and lrugen->nr_pages[]; demotion, unless as the result of the increment of max_seq, requires LRU list operations, e.g., lru_deactivate_fn(). The aging has the complexity O(nr_hot_pages), since it is only interested in hot pages. The eviction consumes old generations. Given an lruvec, it increments min_seq when lrugen->lists[] indexed by min_seq%MAX_NR_GENS becomes empty. A feedback loop modeled after the PID controller monitors refaults over anon and file types and decides which type to evict when both types are available from the same generation. The protection of pages accessed multiple times through file descriptors takes place in the eviction path. Each generation is divided into multiple tiers. A page accessed N times through file descriptors is in tier order_base_2(N). Tiers do not have dedicated lrugen->lists[], only bits in folio->flags. The aforementioned feedback loop also monitors refaults over all tiers and decides when to protect pages in which tiers (N>1), using the first tier (N=0,1) as a baseline. The first tier contains single-use unmapped clean pages, which are most likely the best choices. In contrast to promotion in the aging path, the protection of a page in the eviction path is achieved by moving this page to the next generation, i.e., min_seq+1, if the feedback loop decides so. This approach has the following advantages: 1. It removes the cost of activation in the buffered access path by inferring whether pages accessed multiple times through file descriptors are statistically hot and thus worth protecting in the eviction path. 2. It takes pages accessed through page tables into account and avoids overprotecting pages accessed multiple times through file descriptors. (Pages accessed through page tables are in the first tier, since N=0.) 3. More tiers provide better protection for pages accessed more than twice through file descriptors, when under heavy buffered I/O workloads. Server benchmark results: Single workload: fio (buffered I/O): +[30, 32]% IOPS BW 5.19-rc1: 2673k 10.2GiB/s patch1-6: 3491k 13.3GiB/s Single workload: memcached (anon): -[4, 6]% Ops/sec KB/sec 5.19-rc1: 1161501.04 45177.25 patch1-6: 1106168.46 43025.04 Configurations: CPU: two Xeon 6154 Mem: total 256G Node 1 was only used as a ram disk to reduce the variance in the results. patch drivers/block/brd.c <<EOF 99,100c99,100 < gfp_flags = GFP_NOIO | __GFP_ZERO | __GFP_HIGHMEM; < page = alloc_page(gfp_flags); --- > gfp_flags = GFP_NOIO | __GFP_ZERO | __GFP_HIGHMEM | __GFP_THISNODE; > page = alloc_pages_node(1, gfp_flags, 0); EOF cat >>/etc/systemd/system.conf <<EOF CPUAffinity=numa NUMAPolicy=bind NUMAMask=0 EOF cat >>/etc/memcached.conf <<EOF -m 184320 -s /var/run/memcached/memcached.sock -a 0766 -t 36 -B binary EOF cat fio.sh modprobe brd rd_nr=1 rd_size=113246208 swapoff -a mkfs.ext4 /dev/ram0 mount -t ext4 /dev/ram0 /mnt mkdir /sys/fs/cgroup/user.slice/test echo 38654705664 >/sys/fs/cgroup/user.slice/test/memory.max echo $$ >/sys/fs/cgroup/user.slice/test/cgroup.procs fio -name=mglru --numjobs=72 --directory=/mnt --size=1408m \ --buffered=1 --ioengine=io_uring --iodepth=128 \ --iodepth_batch_submit=32 --iodepth_batch_complete=32 \ --rw=randread --random_distribution=random --norandommap \ --time_based --ramp_time=10m --runtime=5m --group_reporting cat memcached.sh modprobe brd rd_nr=1 rd_size=113246208 swapoff -a mkswap /dev/ram0 swapon /dev/ram0 memtier_benchmark -S /var/run/memcached/memcached.sock \ -P memcache_binary -n allkeys --key-minimum=1 \ --key-maximum=65000000 --key-pattern=P:P -c 1 -t 36 \ --ratio 1:0 --pipeline 8 -d 2000 memtier_benchmark -S /var/run/memcached/memcached.sock \ -P memcache_binary -n allkeys --key-minimum=1 \ --key-maximum=65000000 --key-pattern=R:R -c 1 -t 36 \ --ratio 0:1 --pipeline 8 --randomize --distinct-client-seed Client benchmark results: kswapd profiles: 5.19-rc1 40.33% page_vma_mapped_walk (overhead) 21.80% lzo1x_1_do_compress (real work) 7.53% do_raw_spin_lock 3.95% _raw_spin_unlock_irq 2.52% vma_interval_tree_iter_next 2.37% folio_referenced_one 2.28% vma_interval_tree_subtree_search 1.97% anon_vma_interval_tree_iter_first 1.60% ptep_clear_flush 1.06% __zram_bvec_write patch1-6 39.03% lzo1x_1_do_compress (real work) 18.47% page_vma_mapped_walk (overhead) 6.74% _raw_spin_unlock_irq 3.97% do_raw_spin_lock 2.49% ptep_clear_flush 2.48% anon_vma_interval_tree_iter_first 1.92% folio_referenced_one 1.88% __zram_bvec_write 1.48% memmove 1.31% vma_interval_tree_iter_next Configurations: CPU: single Snapdragon 7c Mem: total 4G ChromeOS MemoryPressure [1] [1] https://chromium.googlesource.com/chromiumos/platform/tast-tests/ Link: https://lkml.kernel.org/r/20220918080010.2920238-7-yuzhao@google.com Signed-off-by: Yu Zhao <yuzhao@google.com> Acked-by: Brian Geffon <bgeffon@google.com> Acked-by: Jan Alexander Steffens (heftig) <heftig@archlinux.org> Acked-by: Oleksandr Natalenko <oleksandr@natalenko.name> Acked-by: Steven Barrett <steven@liquorix.net> Acked-by: Suleiman Souhlal <suleiman@google.com> Tested-by: Daniel Byrne <djbyrne@mtu.edu> Tested-by: Donald Carr <d@chaos-reins.com> Tested-by: Holger Hoffstätte <holger@applied-asynchrony.com> Tested-by: Konstantin Kharlamov <Hi-Angel@yandex.ru> Tested-by: Shuang Zhai <szhai2@cs.rochester.edu> Tested-by: Sofia Trinh <sofia.trinh@edi.works> Tested-by: Vaibhav Jain <vaibhav@linux.ibm.com> Cc: Andi Kleen <ak@linux.intel.com> Cc: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com> Cc: Barry Song <baohua@kernel.org> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: Hillf Danton <hdanton@sina.com> Cc: Jens Axboe <axboe@kernel.dk> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Matthew Wilcox <willy@infradead.org> Cc: Mel Gorman <mgorman@suse.de> Cc: Miaohe Lin <linmiaohe@huawei.com> Cc: Michael Larabel <Michael@MichaelLarabel.com> Cc: Michal Hocko <mhocko@kernel.org> Cc: Mike Rapoport <rppt@kernel.org> Cc: Mike Rapoport <rppt@linux.ibm.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Qi Zheng <zhengqi.arch@bytedance.com> Cc: Tejun Heo <tj@kernel.org> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Will Deacon <will@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-09-18 16:00:03 +08:00
unlock:
workingset, lru_gen: apply refault-distance based protection Upstream: pending I noticed MGLRU not working very well on certain workflows, which is observed on some heavily stressed databases. That is when the file page workingset size exceeds total memory, and the access distance (the left-shift time of a page before it gets activated, considering LRU starts from right) of file pages also larger than total memory. All file pages are stuck on the oldest generation and getting read-in then evicted permutably. Despite anon pages being idle, they never get aged. PID controller didn't kickin until there are some minor access pattern changes. And file pages are not promoted or reused. Even though the memory can't cover the whole workingset, the refault-distance based re-activation can help hold part of the workingset in-memory to help reduce the IO workload significantly. So apply it for MGLRU as well. The updated refault-distance model fits well for MGLRU in most cases, if we just consider the last two generation as the inactive LRU and the first two generations as active LRU. Some adjustment is done to fit the logic better, also make the refault-distance contributed to page tiering and PID refault detection of MGLRU: - If a tier-0 page have a qualified refault-distance, just promote it to higher tier, send it to second oldest gen. - If a tier >= 1 page have a qualified refault-distance, mark it as active and send it to youngest gen. - Increase the reference of every page that have a qualified refault-distance and increase the PID countroled refault rate of the updated tier, in hope similar paged will be protected next time upon eviction. NOTE: This also changed the meaning of workingset_* fields in /proc/vmstat, workingset_activate_* now stands for the pages reactivated or promoted by refault distance checking, workingset_restore_* now stands for all pages promoted by any reason. Following benchmark showed 5x improvement. To simulate the optimized workflow, I setup a 3-replicated mongodb cluster, each in a different cgroup, using 5 gb of wiretiger cache and 10g of oplog, on a 32G VM with no limit set. The benchmark is done using https://github.com/apavlo/py-tpcc.git, modified to run STOCK_LEVEL query only, for simulating slow query and get a stable result. Test is done on an EPYC 7K62 with 32G RAM with SATA SSD: - Before (with ZRAM enabled, the result won't change whether any kind of swap is on or not): $ tpcc.py --config=mongodb.config mongodb --duration=900 --warehouses=500 --clients=30 ================================================================== Execution Results after 919 seconds ------------------------------------------------------------------ Executed Time (µs) Rate STOCK_LEVEL 577 27584645283.7 0.02 txn/s ------------------------------------------------------------------ TOTAL 577 27584645283.7 0.02 txn/s $ cat /proc/vmstat | grep workingset workingset_nodes 47860 workingset_refault_anon 0 workingset_refault_file 23498953 workingset_activate_anon 0 workingset_activate_file 23487840 workingset_restore_anon 0 workingset_restore_file 18553646 workingset_nodereclaim 768 $ free -m total used free shared buff/cache available Mem: 31849 6829 790 23 24229 24542 Swap: 31848 0 31848 - Patched: (with ZRAM enabled): $ tpcc.py --config=mongodb.config mongodb --duration=900 --warehouses=500 --clients=30 ================================================================== Execution Results after 905 seconds ------------------------------------------------------------------ Executed Time (µs) Rate STOCK_LEVEL 2542 27121571486.2 0.09 txn/s ------------------------------------------------------------------ TOTAL 2542 27121571486.2 0.09 txn/s $ cat /proc/vmstat | grep working workingset_nodes 70358 workingset_refault_anon 16853 workingset_refault_file 22693601 workingset_activate_anon 10099 workingset_activate_file 8565519 workingset_restore_anon 10127 workingset_restore_file 8566053 workingset_nodereclaim 9801 $ free -m total used free shared buff/cache available Mem: 31849 7093 283 4 24472 24289 Swap: 31848 1652 30196 The performance is 5x times better than before, and the idle anon pages now can get swapped out as expected. The result is also better with lower test stress, testing with lower stress also shows a improvement. There is no regression on other tests so far, and a performance gain is observed on file page heavy tasks. Signed-off-by: Kairui Song <kasong@tencent.com>
2024-04-01 23:43:38 +08:00
mem_cgroup_put(memcg);
mm: multi-gen LRU: minimal implementation To avoid confusion, the terms "promotion" and "demotion" will be applied to the multi-gen LRU, as a new convention; the terms "activation" and "deactivation" will be applied to the active/inactive LRU, as usual. The aging produces young generations. Given an lruvec, it increments max_seq when max_seq-min_seq+1 approaches MIN_NR_GENS. The aging promotes hot pages to the youngest generation when it finds them accessed through page tables; the demotion of cold pages happens consequently when it increments max_seq. Promotion in the aging path does not involve any LRU list operations, only the updates of the gen counter and lrugen->nr_pages[]; demotion, unless as the result of the increment of max_seq, requires LRU list operations, e.g., lru_deactivate_fn(). The aging has the complexity O(nr_hot_pages), since it is only interested in hot pages. The eviction consumes old generations. Given an lruvec, it increments min_seq when lrugen->lists[] indexed by min_seq%MAX_NR_GENS becomes empty. A feedback loop modeled after the PID controller monitors refaults over anon and file types and decides which type to evict when both types are available from the same generation. The protection of pages accessed multiple times through file descriptors takes place in the eviction path. Each generation is divided into multiple tiers. A page accessed N times through file descriptors is in tier order_base_2(N). Tiers do not have dedicated lrugen->lists[], only bits in folio->flags. The aforementioned feedback loop also monitors refaults over all tiers and decides when to protect pages in which tiers (N>1), using the first tier (N=0,1) as a baseline. The first tier contains single-use unmapped clean pages, which are most likely the best choices. In contrast to promotion in the aging path, the protection of a page in the eviction path is achieved by moving this page to the next generation, i.e., min_seq+1, if the feedback loop decides so. This approach has the following advantages: 1. It removes the cost of activation in the buffered access path by inferring whether pages accessed multiple times through file descriptors are statistically hot and thus worth protecting in the eviction path. 2. It takes pages accessed through page tables into account and avoids overprotecting pages accessed multiple times through file descriptors. (Pages accessed through page tables are in the first tier, since N=0.) 3. More tiers provide better protection for pages accessed more than twice through file descriptors, when under heavy buffered I/O workloads. Server benchmark results: Single workload: fio (buffered I/O): +[30, 32]% IOPS BW 5.19-rc1: 2673k 10.2GiB/s patch1-6: 3491k 13.3GiB/s Single workload: memcached (anon): -[4, 6]% Ops/sec KB/sec 5.19-rc1: 1161501.04 45177.25 patch1-6: 1106168.46 43025.04 Configurations: CPU: two Xeon 6154 Mem: total 256G Node 1 was only used as a ram disk to reduce the variance in the results. patch drivers/block/brd.c <<EOF 99,100c99,100 < gfp_flags = GFP_NOIO | __GFP_ZERO | __GFP_HIGHMEM; < page = alloc_page(gfp_flags); --- > gfp_flags = GFP_NOIO | __GFP_ZERO | __GFP_HIGHMEM | __GFP_THISNODE; > page = alloc_pages_node(1, gfp_flags, 0); EOF cat >>/etc/systemd/system.conf <<EOF CPUAffinity=numa NUMAPolicy=bind NUMAMask=0 EOF cat >>/etc/memcached.conf <<EOF -m 184320 -s /var/run/memcached/memcached.sock -a 0766 -t 36 -B binary EOF cat fio.sh modprobe brd rd_nr=1 rd_size=113246208 swapoff -a mkfs.ext4 /dev/ram0 mount -t ext4 /dev/ram0 /mnt mkdir /sys/fs/cgroup/user.slice/test echo 38654705664 >/sys/fs/cgroup/user.slice/test/memory.max echo $$ >/sys/fs/cgroup/user.slice/test/cgroup.procs fio -name=mglru --numjobs=72 --directory=/mnt --size=1408m \ --buffered=1 --ioengine=io_uring --iodepth=128 \ --iodepth_batch_submit=32 --iodepth_batch_complete=32 \ --rw=randread --random_distribution=random --norandommap \ --time_based --ramp_time=10m --runtime=5m --group_reporting cat memcached.sh modprobe brd rd_nr=1 rd_size=113246208 swapoff -a mkswap /dev/ram0 swapon /dev/ram0 memtier_benchmark -S /var/run/memcached/memcached.sock \ -P memcache_binary -n allkeys --key-minimum=1 \ --key-maximum=65000000 --key-pattern=P:P -c 1 -t 36 \ --ratio 1:0 --pipeline 8 -d 2000 memtier_benchmark -S /var/run/memcached/memcached.sock \ -P memcache_binary -n allkeys --key-minimum=1 \ --key-maximum=65000000 --key-pattern=R:R -c 1 -t 36 \ --ratio 0:1 --pipeline 8 --randomize --distinct-client-seed Client benchmark results: kswapd profiles: 5.19-rc1 40.33% page_vma_mapped_walk (overhead) 21.80% lzo1x_1_do_compress (real work) 7.53% do_raw_spin_lock 3.95% _raw_spin_unlock_irq 2.52% vma_interval_tree_iter_next 2.37% folio_referenced_one 2.28% vma_interval_tree_subtree_search 1.97% anon_vma_interval_tree_iter_first 1.60% ptep_clear_flush 1.06% __zram_bvec_write patch1-6 39.03% lzo1x_1_do_compress (real work) 18.47% page_vma_mapped_walk (overhead) 6.74% _raw_spin_unlock_irq 3.97% do_raw_spin_lock 2.49% ptep_clear_flush 2.48% anon_vma_interval_tree_iter_first 1.92% folio_referenced_one 1.88% __zram_bvec_write 1.48% memmove 1.31% vma_interval_tree_iter_next Configurations: CPU: single Snapdragon 7c Mem: total 4G ChromeOS MemoryPressure [1] [1] https://chromium.googlesource.com/chromiumos/platform/tast-tests/ Link: https://lkml.kernel.org/r/20220918080010.2920238-7-yuzhao@google.com Signed-off-by: Yu Zhao <yuzhao@google.com> Acked-by: Brian Geffon <bgeffon@google.com> Acked-by: Jan Alexander Steffens (heftig) <heftig@archlinux.org> Acked-by: Oleksandr Natalenko <oleksandr@natalenko.name> Acked-by: Steven Barrett <steven@liquorix.net> Acked-by: Suleiman Souhlal <suleiman@google.com> Tested-by: Daniel Byrne <djbyrne@mtu.edu> Tested-by: Donald Carr <d@chaos-reins.com> Tested-by: Holger Hoffstätte <holger@applied-asynchrony.com> Tested-by: Konstantin Kharlamov <Hi-Angel@yandex.ru> Tested-by: Shuang Zhai <szhai2@cs.rochester.edu> Tested-by: Sofia Trinh <sofia.trinh@edi.works> Tested-by: Vaibhav Jain <vaibhav@linux.ibm.com> Cc: Andi Kleen <ak@linux.intel.com> Cc: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com> Cc: Barry Song <baohua@kernel.org> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: Hillf Danton <hdanton@sina.com> Cc: Jens Axboe <axboe@kernel.dk> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Matthew Wilcox <willy@infradead.org> Cc: Mel Gorman <mgorman@suse.de> Cc: Miaohe Lin <linmiaohe@huawei.com> Cc: Michael Larabel <Michael@MichaelLarabel.com> Cc: Michal Hocko <mhocko@kernel.org> Cc: Mike Rapoport <rppt@kernel.org> Cc: Mike Rapoport <rppt@linux.ibm.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Qi Zheng <zhengqi.arch@bytedance.com> Cc: Tejun Heo <tj@kernel.org> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Will Deacon <will@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-09-18 16:00:03 +08:00
}
#else /* !CONFIG_LRU_GEN */
static void *lru_gen_eviction(struct folio *folio)
{
return NULL;
}
static bool lru_gen_test_recent(struct lruvec *lruvec, bool file,
unsigned long token)
workingset: refactor LRU refault to expose refault recency check Patch series "cachestat: a new syscall for page cache state of files", v13. There is currently no good way to query the page cache statistics of large files and directory trees. There is mincore(), but it scales poorly: the kernel writes out a lot of bitmap data that userspace has to aggregate, when the user really does not care about per-page information in that case. The user also needs to mmap and unmap each file as it goes along, which can be quite slow as well. Some use cases where this information could come in handy: * Allowing database to decide whether to perform an index scan or direct table queries based on the in-memory cache state of the index. * Visibility into the writeback algorithm, for performance issues diagnostic. * Workload-aware writeback pacing: estimating IO fulfilled by page cache (and IO to be done) within a range of a file, allowing for more frequent syncing when and where there is IO capacity, and batching when there is not. * Computing memory usage of large files/directory trees, analogous to the du tool for disk usage. More information about these use cases could be found in this thread: https://lore.kernel.org/lkml/20230315170934.GA97793@cmpxchg.org/ This series of patches introduces a new system call, cachestat, that summarizes the page cache statistics (number of cached pages, dirty pages, pages marked for writeback, evicted pages etc.) of a file, in a specified range of bytes. It also include a selftest suite that tests some typical usage. Currently, the syscall is only wired in for x86 architecture. This interface is inspired by past discussion and concerns with fincore, which has a similar design (and as a result, issues) as mincore. Relevant links: https://lkml.indiana.edu/hypermail/linux/kernel/1302.1/04207.html https://lkml.indiana.edu/hypermail/linux/kernel/1302.1/04209.html I have also developed a small tool that computes the memory usage of files and directories, analogous to the du utility. User can choose between mincore or cachestat (with cachestat exporting more information than mincore). To compare the performance of these two options, I benchmarked the tool on the root directory of a Meta's server machine, each for five runs: Using cachestat real -- Median: 33.377s, Average: 33.475s, Standard Deviation: 0.3602 user -- Median: 4.08s, Average: 4.1078s, Standard Deviation: 0.0742 sys -- Median: 28.823s, Average: 28.8866s, Standard Deviation: 0.2689 Using mincore: real -- Median: 102.352s, Average: 102.3442s, Standard Deviation: 0.2059 user -- Median: 10.149s, Average: 10.1482s, Standard Deviation: 0.0162 sys -- Median: 91.186s, Average: 91.2084s, Standard Deviation: 0.2046 I also ran both syscalls on a 2TB sparse file: Using cachestat: real 0m0.009s user 0m0.000s sys 0m0.009s Using mincore: real 0m37.510s user 0m2.934s sys 0m34.558s Very large files like this are the pathological case for mincore. In fact, to compute the stats for a single 2TB file, mincore takes as long as cachestat takes to compute the stats for the entire tree! This could easily happen inadvertently when we run it on subdirectories. Mincore is clearly not suitable for a general-purpose command line tool. Regarding security concerns, cachestat() should not pose any additional issues. The caller already has read permission to the file itself (since they need an fd to that file to call cachestat). This means that the caller can access the underlying data in its entirety, which is a much greater source of information (and as a result, a much greater security risk) than the cache status itself. The latest API change (in v13 of the patch series) is suggested by Jens Axboe. It allows for 64-bit length argument, even on 32-bit architecture (which is previously not possible due to the limit on the number of syscall arguments). Furthermore, it eliminates the need for compatibility handling - every user can use the same ABI. This patch (of 4): In preparation for computing recently evicted pages in cachestat, refactor workingset_refault and lru_gen_refault to expose a helper function that would test if an evicted page is recently evicted. [penguin-kernel@I-love.SAKURA.ne.jp: add missing rcu_read_unlock() in lru_gen_refault()] Link: https://lkml.kernel.org/r/610781bc-cf11-fc89-a46f-87cb8235d439@I-love.SAKURA.ne.jp Link: https://lkml.kernel.org/r/20230503013608.2431726-1-nphamcs@gmail.com Link: https://lkml.kernel.org/r/20230503013608.2431726-2-nphamcs@gmail.com Signed-off-by: Nhat Pham <nphamcs@gmail.com> Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Cc: Brian Foster <bfoster@redhat.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Michael Kerrisk <mtk.manpages@gmail.com> Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-05-03 09:36:06 +08:00
{
return false;
}
mm: multi-gen LRU: minimal implementation To avoid confusion, the terms "promotion" and "demotion" will be applied to the multi-gen LRU, as a new convention; the terms "activation" and "deactivation" will be applied to the active/inactive LRU, as usual. The aging produces young generations. Given an lruvec, it increments max_seq when max_seq-min_seq+1 approaches MIN_NR_GENS. The aging promotes hot pages to the youngest generation when it finds them accessed through page tables; the demotion of cold pages happens consequently when it increments max_seq. Promotion in the aging path does not involve any LRU list operations, only the updates of the gen counter and lrugen->nr_pages[]; demotion, unless as the result of the increment of max_seq, requires LRU list operations, e.g., lru_deactivate_fn(). The aging has the complexity O(nr_hot_pages), since it is only interested in hot pages. The eviction consumes old generations. Given an lruvec, it increments min_seq when lrugen->lists[] indexed by min_seq%MAX_NR_GENS becomes empty. A feedback loop modeled after the PID controller monitors refaults over anon and file types and decides which type to evict when both types are available from the same generation. The protection of pages accessed multiple times through file descriptors takes place in the eviction path. Each generation is divided into multiple tiers. A page accessed N times through file descriptors is in tier order_base_2(N). Tiers do not have dedicated lrugen->lists[], only bits in folio->flags. The aforementioned feedback loop also monitors refaults over all tiers and decides when to protect pages in which tiers (N>1), using the first tier (N=0,1) as a baseline. The first tier contains single-use unmapped clean pages, which are most likely the best choices. In contrast to promotion in the aging path, the protection of a page in the eviction path is achieved by moving this page to the next generation, i.e., min_seq+1, if the feedback loop decides so. This approach has the following advantages: 1. It removes the cost of activation in the buffered access path by inferring whether pages accessed multiple times through file descriptors are statistically hot and thus worth protecting in the eviction path. 2. It takes pages accessed through page tables into account and avoids overprotecting pages accessed multiple times through file descriptors. (Pages accessed through page tables are in the first tier, since N=0.) 3. More tiers provide better protection for pages accessed more than twice through file descriptors, when under heavy buffered I/O workloads. Server benchmark results: Single workload: fio (buffered I/O): +[30, 32]% IOPS BW 5.19-rc1: 2673k 10.2GiB/s patch1-6: 3491k 13.3GiB/s Single workload: memcached (anon): -[4, 6]% Ops/sec KB/sec 5.19-rc1: 1161501.04 45177.25 patch1-6: 1106168.46 43025.04 Configurations: CPU: two Xeon 6154 Mem: total 256G Node 1 was only used as a ram disk to reduce the variance in the results. patch drivers/block/brd.c <<EOF 99,100c99,100 < gfp_flags = GFP_NOIO | __GFP_ZERO | __GFP_HIGHMEM; < page = alloc_page(gfp_flags); --- > gfp_flags = GFP_NOIO | __GFP_ZERO | __GFP_HIGHMEM | __GFP_THISNODE; > page = alloc_pages_node(1, gfp_flags, 0); EOF cat >>/etc/systemd/system.conf <<EOF CPUAffinity=numa NUMAPolicy=bind NUMAMask=0 EOF cat >>/etc/memcached.conf <<EOF -m 184320 -s /var/run/memcached/memcached.sock -a 0766 -t 36 -B binary EOF cat fio.sh modprobe brd rd_nr=1 rd_size=113246208 swapoff -a mkfs.ext4 /dev/ram0 mount -t ext4 /dev/ram0 /mnt mkdir /sys/fs/cgroup/user.slice/test echo 38654705664 >/sys/fs/cgroup/user.slice/test/memory.max echo $$ >/sys/fs/cgroup/user.slice/test/cgroup.procs fio -name=mglru --numjobs=72 --directory=/mnt --size=1408m \ --buffered=1 --ioengine=io_uring --iodepth=128 \ --iodepth_batch_submit=32 --iodepth_batch_complete=32 \ --rw=randread --random_distribution=random --norandommap \ --time_based --ramp_time=10m --runtime=5m --group_reporting cat memcached.sh modprobe brd rd_nr=1 rd_size=113246208 swapoff -a mkswap /dev/ram0 swapon /dev/ram0 memtier_benchmark -S /var/run/memcached/memcached.sock \ -P memcache_binary -n allkeys --key-minimum=1 \ --key-maximum=65000000 --key-pattern=P:P -c 1 -t 36 \ --ratio 1:0 --pipeline 8 -d 2000 memtier_benchmark -S /var/run/memcached/memcached.sock \ -P memcache_binary -n allkeys --key-minimum=1 \ --key-maximum=65000000 --key-pattern=R:R -c 1 -t 36 \ --ratio 0:1 --pipeline 8 --randomize --distinct-client-seed Client benchmark results: kswapd profiles: 5.19-rc1 40.33% page_vma_mapped_walk (overhead) 21.80% lzo1x_1_do_compress (real work) 7.53% do_raw_spin_lock 3.95% _raw_spin_unlock_irq 2.52% vma_interval_tree_iter_next 2.37% folio_referenced_one 2.28% vma_interval_tree_subtree_search 1.97% anon_vma_interval_tree_iter_first 1.60% ptep_clear_flush 1.06% __zram_bvec_write patch1-6 39.03% lzo1x_1_do_compress (real work) 18.47% page_vma_mapped_walk (overhead) 6.74% _raw_spin_unlock_irq 3.97% do_raw_spin_lock 2.49% ptep_clear_flush 2.48% anon_vma_interval_tree_iter_first 1.92% folio_referenced_one 1.88% __zram_bvec_write 1.48% memmove 1.31% vma_interval_tree_iter_next Configurations: CPU: single Snapdragon 7c Mem: total 4G ChromeOS MemoryPressure [1] [1] https://chromium.googlesource.com/chromiumos/platform/tast-tests/ Link: https://lkml.kernel.org/r/20220918080010.2920238-7-yuzhao@google.com Signed-off-by: Yu Zhao <yuzhao@google.com> Acked-by: Brian Geffon <bgeffon@google.com> Acked-by: Jan Alexander Steffens (heftig) <heftig@archlinux.org> Acked-by: Oleksandr Natalenko <oleksandr@natalenko.name> Acked-by: Steven Barrett <steven@liquorix.net> Acked-by: Suleiman Souhlal <suleiman@google.com> Tested-by: Daniel Byrne <djbyrne@mtu.edu> Tested-by: Donald Carr <d@chaos-reins.com> Tested-by: Holger Hoffstätte <holger@applied-asynchrony.com> Tested-by: Konstantin Kharlamov <Hi-Angel@yandex.ru> Tested-by: Shuang Zhai <szhai2@cs.rochester.edu> Tested-by: Sofia Trinh <sofia.trinh@edi.works> Tested-by: Vaibhav Jain <vaibhav@linux.ibm.com> Cc: Andi Kleen <ak@linux.intel.com> Cc: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com> Cc: Barry Song <baohua@kernel.org> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: Hillf Danton <hdanton@sina.com> Cc: Jens Axboe <axboe@kernel.dk> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Matthew Wilcox <willy@infradead.org> Cc: Mel Gorman <mgorman@suse.de> Cc: Miaohe Lin <linmiaohe@huawei.com> Cc: Michael Larabel <Michael@MichaelLarabel.com> Cc: Michal Hocko <mhocko@kernel.org> Cc: Mike Rapoport <rppt@kernel.org> Cc: Mike Rapoport <rppt@linux.ibm.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Qi Zheng <zhengqi.arch@bytedance.com> Cc: Tejun Heo <tj@kernel.org> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Will Deacon <will@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-09-18 16:00:03 +08:00
static void lru_gen_refault(struct folio *folio, void *shadow)
{
}
#endif /* CONFIG_LRU_GEN */
mm: thrash detection-based file cache sizing The VM maintains cached filesystem pages on two types of lists. One list holds the pages recently faulted into the cache, the other list holds pages that have been referenced repeatedly on that first list. The idea is to prefer reclaiming young pages over those that have shown to benefit from caching in the past. We call the recently usedbut ultimately was not significantly better than a FIFO policy and still thrashed cache based on eviction speed, rather than actual demand for cache. This patch solves one half of the problem by decoupling the ability to detect working set changes from the inactive list size. By maintaining a history of recently evicted file pages it can detect frequently used pages with an arbitrarily small inactive list size, and subsequently apply pressure on the active list based on actual demand for cache, not just overall eviction speed. Every zone maintains a counter that tracks inactive list aging speed. When a page is evicted, a snapshot of this counter is stored in the now-empty page cache radix tree slot. On refault, the minimum access distance of the page can be assessed, to evaluate whether the page should be part of the active list or not. This fixes the VM's blindness towards working set changes in excess of the inactive list. And it's the foundation to further improve the protection ability and reduce the minimum inactive list size of 50%. Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> Reviewed-by: Rik van Riel <riel@redhat.com> Reviewed-by: Minchan Kim <minchan@kernel.org> Reviewed-by: Bob Liu <bob.liu@oracle.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Christoph Hellwig <hch@infradead.org> Cc: Dave Chinner <david@fromorbit.com> Cc: Greg Thelen <gthelen@google.com> Cc: Hugh Dickins <hughd@google.com> Cc: Jan Kara <jack@suse.cz> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: Luigi Semenzato <semenzato@google.com> Cc: Mel Gorman <mgorman@suse.de> Cc: Metin Doslu <metin@citusdata.com> Cc: Michel Lespinasse <walken@google.com> Cc: Ozgun Erdogan <ozgun@citusdata.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Roman Gushchin <klamm@yandex-team.ru> Cc: Ryan Mallon <rmallon@gmail.com> Cc: Tejun Heo <tj@kernel.org> Cc: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-04-04 05:47:51 +08:00
/**
* workingset_eviction - note the eviction of a folio from memory
mm: vmscan: detect file thrashing at the reclaim root We use refault information to determine whether the cache workingset is stable or transitioning, and dynamically adjust the inactive:active file LRU ratio so as to maximize protection from one-off cache during stable periods, and minimize IO during transitions. With cgroups and their nested LRU lists, we currently don't do this correctly. While recursive cgroup reclaim establishes a relative LRU order among the pages of all involved cgroups, refaults only affect the local LRU order in the cgroup in which they are occuring. As a result, cache transitions can take longer in a cgrouped system as the active pages of sibling cgroups aren't challenged when they should be. [ Right now, this is somewhat theoretical, because the siblings, under continued regular reclaim pressure, should eventually run out of inactive pages - and since inactive:active *size* balancing is also done on a cgroup-local level, we will challenge the active pages eventually in most cases. But the next patch will move that relative size enforcement to the reclaim root as well, and then this patch here will be necessary to propagate refault pressure to siblings. ] This patch moves refault detection to the root of reclaim. Instead of remembering the cgroup owner of an evicted page, remember the cgroup that caused the reclaim to happen. When refaults later occur, they'll correctly influence the cross-cgroup LRU order that reclaim follows. I.e. if global reclaim kicked out pages in some subgroup A/B/C, the refault of those pages will challenge the global LRU order, and not just the local order down inside C. [hannes@cmpxchg.org: use page_memcg() instead of another lookup] Link: http://lkml.kernel.org/r/20191115160722.GA309754@cmpxchg.org Link: http://lkml.kernel.org/r/20191107205334.158354-3-hannes@cmpxchg.org Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> Reviewed-by: Suren Baghdasaryan <surenb@google.com> Cc: Andrey Ryabinin <aryabinin@virtuozzo.com> Cc: Michal Hocko <mhocko@suse.com> Cc: Rik van Riel <riel@surriel.com> Cc: Shakeel Butt <shakeelb@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-12-01 09:55:59 +08:00
* @target_memcg: the cgroup that is causing the reclaim
* @folio: the folio being evicted
mm: thrash detection-based file cache sizing The VM maintains cached filesystem pages on two types of lists. One list holds the pages recently faulted into the cache, the other list holds pages that have been referenced repeatedly on that first list. The idea is to prefer reclaiming young pages over those that have shown to benefit from caching in the past. We call the recently usedbut ultimately was not significantly better than a FIFO policy and still thrashed cache based on eviction speed, rather than actual demand for cache. This patch solves one half of the problem by decoupling the ability to detect working set changes from the inactive list size. By maintaining a history of recently evicted file pages it can detect frequently used pages with an arbitrarily small inactive list size, and subsequently apply pressure on the active list based on actual demand for cache, not just overall eviction speed. Every zone maintains a counter that tracks inactive list aging speed. When a page is evicted, a snapshot of this counter is stored in the now-empty page cache radix tree slot. On refault, the minimum access distance of the page can be assessed, to evaluate whether the page should be part of the active list or not. This fixes the VM's blindness towards working set changes in excess of the inactive list. And it's the foundation to further improve the protection ability and reduce the minimum inactive list size of 50%. Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> Reviewed-by: Rik van Riel <riel@redhat.com> Reviewed-by: Minchan Kim <minchan@kernel.org> Reviewed-by: Bob Liu <bob.liu@oracle.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Christoph Hellwig <hch@infradead.org> Cc: Dave Chinner <david@fromorbit.com> Cc: Greg Thelen <gthelen@google.com> Cc: Hugh Dickins <hughd@google.com> Cc: Jan Kara <jack@suse.cz> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: Luigi Semenzato <semenzato@google.com> Cc: Mel Gorman <mgorman@suse.de> Cc: Metin Doslu <metin@citusdata.com> Cc: Michel Lespinasse <walken@google.com> Cc: Ozgun Erdogan <ozgun@citusdata.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Roman Gushchin <klamm@yandex-team.ru> Cc: Ryan Mallon <rmallon@gmail.com> Cc: Tejun Heo <tj@kernel.org> Cc: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-04-04 05:47:51 +08:00
*
* Return: a shadow entry to be stored in @folio->mapping->i_pages in place
* of the evicted @folio so that a later refault can be detected.
mm: thrash detection-based file cache sizing The VM maintains cached filesystem pages on two types of lists. One list holds the pages recently faulted into the cache, the other list holds pages that have been referenced repeatedly on that first list. The idea is to prefer reclaiming young pages over those that have shown to benefit from caching in the past. We call the recently usedbut ultimately was not significantly better than a FIFO policy and still thrashed cache based on eviction speed, rather than actual demand for cache. This patch solves one half of the problem by decoupling the ability to detect working set changes from the inactive list size. By maintaining a history of recently evicted file pages it can detect frequently used pages with an arbitrarily small inactive list size, and subsequently apply pressure on the active list based on actual demand for cache, not just overall eviction speed. Every zone maintains a counter that tracks inactive list aging speed. When a page is evicted, a snapshot of this counter is stored in the now-empty page cache radix tree slot. On refault, the minimum access distance of the page can be assessed, to evaluate whether the page should be part of the active list or not. This fixes the VM's blindness towards working set changes in excess of the inactive list. And it's the foundation to further improve the protection ability and reduce the minimum inactive list size of 50%. Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> Reviewed-by: Rik van Riel <riel@redhat.com> Reviewed-by: Minchan Kim <minchan@kernel.org> Reviewed-by: Bob Liu <bob.liu@oracle.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Christoph Hellwig <hch@infradead.org> Cc: Dave Chinner <david@fromorbit.com> Cc: Greg Thelen <gthelen@google.com> Cc: Hugh Dickins <hughd@google.com> Cc: Jan Kara <jack@suse.cz> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: Luigi Semenzato <semenzato@google.com> Cc: Mel Gorman <mgorman@suse.de> Cc: Metin Doslu <metin@citusdata.com> Cc: Michel Lespinasse <walken@google.com> Cc: Ozgun Erdogan <ozgun@citusdata.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Roman Gushchin <klamm@yandex-team.ru> Cc: Ryan Mallon <rmallon@gmail.com> Cc: Tejun Heo <tj@kernel.org> Cc: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-04-04 05:47:51 +08:00
*/
void *workingset_eviction(struct folio *folio, struct mem_cgroup *target_memcg)
mm: thrash detection-based file cache sizing The VM maintains cached filesystem pages on two types of lists. One list holds the pages recently faulted into the cache, the other list holds pages that have been referenced repeatedly on that first list. The idea is to prefer reclaiming young pages over those that have shown to benefit from caching in the past. We call the recently usedbut ultimately was not significantly better than a FIFO policy and still thrashed cache based on eviction speed, rather than actual demand for cache. This patch solves one half of the problem by decoupling the ability to detect working set changes from the inactive list size. By maintaining a history of recently evicted file pages it can detect frequently used pages with an arbitrarily small inactive list size, and subsequently apply pressure on the active list based on actual demand for cache, not just overall eviction speed. Every zone maintains a counter that tracks inactive list aging speed. When a page is evicted, a snapshot of this counter is stored in the now-empty page cache radix tree slot. On refault, the minimum access distance of the page can be assessed, to evaluate whether the page should be part of the active list or not. This fixes the VM's blindness towards working set changes in excess of the inactive list. And it's the foundation to further improve the protection ability and reduce the minimum inactive list size of 50%. Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> Reviewed-by: Rik van Riel <riel@redhat.com> Reviewed-by: Minchan Kim <minchan@kernel.org> Reviewed-by: Bob Liu <bob.liu@oracle.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Christoph Hellwig <hch@infradead.org> Cc: Dave Chinner <david@fromorbit.com> Cc: Greg Thelen <gthelen@google.com> Cc: Hugh Dickins <hughd@google.com> Cc: Jan Kara <jack@suse.cz> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: Luigi Semenzato <semenzato@google.com> Cc: Mel Gorman <mgorman@suse.de> Cc: Metin Doslu <metin@citusdata.com> Cc: Michel Lespinasse <walken@google.com> Cc: Ozgun Erdogan <ozgun@citusdata.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Roman Gushchin <klamm@yandex-team.ru> Cc: Ryan Mallon <rmallon@gmail.com> Cc: Tejun Heo <tj@kernel.org> Cc: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-04-04 05:47:51 +08:00
{
struct pglist_data *pgdat = folio_pgdat(folio);
mm: thrash detection-based file cache sizing The VM maintains cached filesystem pages on two types of lists. One list holds the pages recently faulted into the cache, the other list holds pages that have been referenced repeatedly on that first list. The idea is to prefer reclaiming young pages over those that have shown to benefit from caching in the past. We call the recently usedbut ultimately was not significantly better than a FIFO policy and still thrashed cache based on eviction speed, rather than actual demand for cache. This patch solves one half of the problem by decoupling the ability to detect working set changes from the inactive list size. By maintaining a history of recently evicted file pages it can detect frequently used pages with an arbitrarily small inactive list size, and subsequently apply pressure on the active list based on actual demand for cache, not just overall eviction speed. Every zone maintains a counter that tracks inactive list aging speed. When a page is evicted, a snapshot of this counter is stored in the now-empty page cache radix tree slot. On refault, the minimum access distance of the page can be assessed, to evaluate whether the page should be part of the active list or not. This fixes the VM's blindness towards working set changes in excess of the inactive list. And it's the foundation to further improve the protection ability and reduce the minimum inactive list size of 50%. Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> Reviewed-by: Rik van Riel <riel@redhat.com> Reviewed-by: Minchan Kim <minchan@kernel.org> Reviewed-by: Bob Liu <bob.liu@oracle.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Christoph Hellwig <hch@infradead.org> Cc: Dave Chinner <david@fromorbit.com> Cc: Greg Thelen <gthelen@google.com> Cc: Hugh Dickins <hughd@google.com> Cc: Jan Kara <jack@suse.cz> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: Luigi Semenzato <semenzato@google.com> Cc: Mel Gorman <mgorman@suse.de> Cc: Metin Doslu <metin@citusdata.com> Cc: Michel Lespinasse <walken@google.com> Cc: Ozgun Erdogan <ozgun@citusdata.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Roman Gushchin <klamm@yandex-team.ru> Cc: Ryan Mallon <rmallon@gmail.com> Cc: Tejun Heo <tj@kernel.org> Cc: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-04-04 05:47:51 +08:00
unsigned long eviction;
struct lruvec *lruvec;
mm: vmscan: detect file thrashing at the reclaim root We use refault information to determine whether the cache workingset is stable or transitioning, and dynamically adjust the inactive:active file LRU ratio so as to maximize protection from one-off cache during stable periods, and minimize IO during transitions. With cgroups and their nested LRU lists, we currently don't do this correctly. While recursive cgroup reclaim establishes a relative LRU order among the pages of all involved cgroups, refaults only affect the local LRU order in the cgroup in which they are occuring. As a result, cache transitions can take longer in a cgrouped system as the active pages of sibling cgroups aren't challenged when they should be. [ Right now, this is somewhat theoretical, because the siblings, under continued regular reclaim pressure, should eventually run out of inactive pages - and since inactive:active *size* balancing is also done on a cgroup-local level, we will challenge the active pages eventually in most cases. But the next patch will move that relative size enforcement to the reclaim root as well, and then this patch here will be necessary to propagate refault pressure to siblings. ] This patch moves refault detection to the root of reclaim. Instead of remembering the cgroup owner of an evicted page, remember the cgroup that caused the reclaim to happen. When refaults later occur, they'll correctly influence the cross-cgroup LRU order that reclaim follows. I.e. if global reclaim kicked out pages in some subgroup A/B/C, the refault of those pages will challenge the global LRU order, and not just the local order down inside C. [hannes@cmpxchg.org: use page_memcg() instead of another lookup] Link: http://lkml.kernel.org/r/20191115160722.GA309754@cmpxchg.org Link: http://lkml.kernel.org/r/20191107205334.158354-3-hannes@cmpxchg.org Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> Reviewed-by: Suren Baghdasaryan <surenb@google.com> Cc: Andrey Ryabinin <aryabinin@virtuozzo.com> Cc: Michal Hocko <mhocko@suse.com> Cc: Rik van Riel <riel@surriel.com> Cc: Shakeel Butt <shakeelb@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-12-01 09:55:59 +08:00
int memcgid;
mm: thrash detection-based file cache sizing The VM maintains cached filesystem pages on two types of lists. One list holds the pages recently faulted into the cache, the other list holds pages that have been referenced repeatedly on that first list. The idea is to prefer reclaiming young pages over those that have shown to benefit from caching in the past. We call the recently usedbut ultimately was not significantly better than a FIFO policy and still thrashed cache based on eviction speed, rather than actual demand for cache. This patch solves one half of the problem by decoupling the ability to detect working set changes from the inactive list size. By maintaining a history of recently evicted file pages it can detect frequently used pages with an arbitrarily small inactive list size, and subsequently apply pressure on the active list based on actual demand for cache, not just overall eviction speed. Every zone maintains a counter that tracks inactive list aging speed. When a page is evicted, a snapshot of this counter is stored in the now-empty page cache radix tree slot. On refault, the minimum access distance of the page can be assessed, to evaluate whether the page should be part of the active list or not. This fixes the VM's blindness towards working set changes in excess of the inactive list. And it's the foundation to further improve the protection ability and reduce the minimum inactive list size of 50%. Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> Reviewed-by: Rik van Riel <riel@redhat.com> Reviewed-by: Minchan Kim <minchan@kernel.org> Reviewed-by: Bob Liu <bob.liu@oracle.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Christoph Hellwig <hch@infradead.org> Cc: Dave Chinner <david@fromorbit.com> Cc: Greg Thelen <gthelen@google.com> Cc: Hugh Dickins <hughd@google.com> Cc: Jan Kara <jack@suse.cz> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: Luigi Semenzato <semenzato@google.com> Cc: Mel Gorman <mgorman@suse.de> Cc: Metin Doslu <metin@citusdata.com> Cc: Michel Lespinasse <walken@google.com> Cc: Ozgun Erdogan <ozgun@citusdata.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Roman Gushchin <klamm@yandex-team.ru> Cc: Ryan Mallon <rmallon@gmail.com> Cc: Tejun Heo <tj@kernel.org> Cc: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-04-04 05:47:51 +08:00
/* Folio is fully exclusive and pins folio's memory cgroup pointer */
VM_BUG_ON_FOLIO(folio_test_lru(folio), folio);
VM_BUG_ON_FOLIO(folio_ref_count(folio), folio);
VM_BUG_ON_FOLIO(!folio_test_locked(folio), folio);
mm: multi-gen LRU: minimal implementation To avoid confusion, the terms "promotion" and "demotion" will be applied to the multi-gen LRU, as a new convention; the terms "activation" and "deactivation" will be applied to the active/inactive LRU, as usual. The aging produces young generations. Given an lruvec, it increments max_seq when max_seq-min_seq+1 approaches MIN_NR_GENS. The aging promotes hot pages to the youngest generation when it finds them accessed through page tables; the demotion of cold pages happens consequently when it increments max_seq. Promotion in the aging path does not involve any LRU list operations, only the updates of the gen counter and lrugen->nr_pages[]; demotion, unless as the result of the increment of max_seq, requires LRU list operations, e.g., lru_deactivate_fn(). The aging has the complexity O(nr_hot_pages), since it is only interested in hot pages. The eviction consumes old generations. Given an lruvec, it increments min_seq when lrugen->lists[] indexed by min_seq%MAX_NR_GENS becomes empty. A feedback loop modeled after the PID controller monitors refaults over anon and file types and decides which type to evict when both types are available from the same generation. The protection of pages accessed multiple times through file descriptors takes place in the eviction path. Each generation is divided into multiple tiers. A page accessed N times through file descriptors is in tier order_base_2(N). Tiers do not have dedicated lrugen->lists[], only bits in folio->flags. The aforementioned feedback loop also monitors refaults over all tiers and decides when to protect pages in which tiers (N>1), using the first tier (N=0,1) as a baseline. The first tier contains single-use unmapped clean pages, which are most likely the best choices. In contrast to promotion in the aging path, the protection of a page in the eviction path is achieved by moving this page to the next generation, i.e., min_seq+1, if the feedback loop decides so. This approach has the following advantages: 1. It removes the cost of activation in the buffered access path by inferring whether pages accessed multiple times through file descriptors are statistically hot and thus worth protecting in the eviction path. 2. It takes pages accessed through page tables into account and avoids overprotecting pages accessed multiple times through file descriptors. (Pages accessed through page tables are in the first tier, since N=0.) 3. More tiers provide better protection for pages accessed more than twice through file descriptors, when under heavy buffered I/O workloads. Server benchmark results: Single workload: fio (buffered I/O): +[30, 32]% IOPS BW 5.19-rc1: 2673k 10.2GiB/s patch1-6: 3491k 13.3GiB/s Single workload: memcached (anon): -[4, 6]% Ops/sec KB/sec 5.19-rc1: 1161501.04 45177.25 patch1-6: 1106168.46 43025.04 Configurations: CPU: two Xeon 6154 Mem: total 256G Node 1 was only used as a ram disk to reduce the variance in the results. patch drivers/block/brd.c <<EOF 99,100c99,100 < gfp_flags = GFP_NOIO | __GFP_ZERO | __GFP_HIGHMEM; < page = alloc_page(gfp_flags); --- > gfp_flags = GFP_NOIO | __GFP_ZERO | __GFP_HIGHMEM | __GFP_THISNODE; > page = alloc_pages_node(1, gfp_flags, 0); EOF cat >>/etc/systemd/system.conf <<EOF CPUAffinity=numa NUMAPolicy=bind NUMAMask=0 EOF cat >>/etc/memcached.conf <<EOF -m 184320 -s /var/run/memcached/memcached.sock -a 0766 -t 36 -B binary EOF cat fio.sh modprobe brd rd_nr=1 rd_size=113246208 swapoff -a mkfs.ext4 /dev/ram0 mount -t ext4 /dev/ram0 /mnt mkdir /sys/fs/cgroup/user.slice/test echo 38654705664 >/sys/fs/cgroup/user.slice/test/memory.max echo $$ >/sys/fs/cgroup/user.slice/test/cgroup.procs fio -name=mglru --numjobs=72 --directory=/mnt --size=1408m \ --buffered=1 --ioengine=io_uring --iodepth=128 \ --iodepth_batch_submit=32 --iodepth_batch_complete=32 \ --rw=randread --random_distribution=random --norandommap \ --time_based --ramp_time=10m --runtime=5m --group_reporting cat memcached.sh modprobe brd rd_nr=1 rd_size=113246208 swapoff -a mkswap /dev/ram0 swapon /dev/ram0 memtier_benchmark -S /var/run/memcached/memcached.sock \ -P memcache_binary -n allkeys --key-minimum=1 \ --key-maximum=65000000 --key-pattern=P:P -c 1 -t 36 \ --ratio 1:0 --pipeline 8 -d 2000 memtier_benchmark -S /var/run/memcached/memcached.sock \ -P memcache_binary -n allkeys --key-minimum=1 \ --key-maximum=65000000 --key-pattern=R:R -c 1 -t 36 \ --ratio 0:1 --pipeline 8 --randomize --distinct-client-seed Client benchmark results: kswapd profiles: 5.19-rc1 40.33% page_vma_mapped_walk (overhead) 21.80% lzo1x_1_do_compress (real work) 7.53% do_raw_spin_lock 3.95% _raw_spin_unlock_irq 2.52% vma_interval_tree_iter_next 2.37% folio_referenced_one 2.28% vma_interval_tree_subtree_search 1.97% anon_vma_interval_tree_iter_first 1.60% ptep_clear_flush 1.06% __zram_bvec_write patch1-6 39.03% lzo1x_1_do_compress (real work) 18.47% page_vma_mapped_walk (overhead) 6.74% _raw_spin_unlock_irq 3.97% do_raw_spin_lock 2.49% ptep_clear_flush 2.48% anon_vma_interval_tree_iter_first 1.92% folio_referenced_one 1.88% __zram_bvec_write 1.48% memmove 1.31% vma_interval_tree_iter_next Configurations: CPU: single Snapdragon 7c Mem: total 4G ChromeOS MemoryPressure [1] [1] https://chromium.googlesource.com/chromiumos/platform/tast-tests/ Link: https://lkml.kernel.org/r/20220918080010.2920238-7-yuzhao@google.com Signed-off-by: Yu Zhao <yuzhao@google.com> Acked-by: Brian Geffon <bgeffon@google.com> Acked-by: Jan Alexander Steffens (heftig) <heftig@archlinux.org> Acked-by: Oleksandr Natalenko <oleksandr@natalenko.name> Acked-by: Steven Barrett <steven@liquorix.net> Acked-by: Suleiman Souhlal <suleiman@google.com> Tested-by: Daniel Byrne <djbyrne@mtu.edu> Tested-by: Donald Carr <d@chaos-reins.com> Tested-by: Holger Hoffstätte <holger@applied-asynchrony.com> Tested-by: Konstantin Kharlamov <Hi-Angel@yandex.ru> Tested-by: Shuang Zhai <szhai2@cs.rochester.edu> Tested-by: Sofia Trinh <sofia.trinh@edi.works> Tested-by: Vaibhav Jain <vaibhav@linux.ibm.com> Cc: Andi Kleen <ak@linux.intel.com> Cc: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com> Cc: Barry Song <baohua@kernel.org> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: Hillf Danton <hdanton@sina.com> Cc: Jens Axboe <axboe@kernel.dk> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Matthew Wilcox <willy@infradead.org> Cc: Mel Gorman <mgorman@suse.de> Cc: Miaohe Lin <linmiaohe@huawei.com> Cc: Michael Larabel <Michael@MichaelLarabel.com> Cc: Michal Hocko <mhocko@kernel.org> Cc: Mike Rapoport <rppt@kernel.org> Cc: Mike Rapoport <rppt@linux.ibm.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Qi Zheng <zhengqi.arch@bytedance.com> Cc: Tejun Heo <tj@kernel.org> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Will Deacon <will@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-09-18 16:00:03 +08:00
if (lru_gen_enabled())
return lru_gen_eviction(folio);
mm: vmscan: detect file thrashing at the reclaim root We use refault information to determine whether the cache workingset is stable or transitioning, and dynamically adjust the inactive:active file LRU ratio so as to maximize protection from one-off cache during stable periods, and minimize IO during transitions. With cgroups and their nested LRU lists, we currently don't do this correctly. While recursive cgroup reclaim establishes a relative LRU order among the pages of all involved cgroups, refaults only affect the local LRU order in the cgroup in which they are occuring. As a result, cache transitions can take longer in a cgrouped system as the active pages of sibling cgroups aren't challenged when they should be. [ Right now, this is somewhat theoretical, because the siblings, under continued regular reclaim pressure, should eventually run out of inactive pages - and since inactive:active *size* balancing is also done on a cgroup-local level, we will challenge the active pages eventually in most cases. But the next patch will move that relative size enforcement to the reclaim root as well, and then this patch here will be necessary to propagate refault pressure to siblings. ] This patch moves refault detection to the root of reclaim. Instead of remembering the cgroup owner of an evicted page, remember the cgroup that caused the reclaim to happen. When refaults later occur, they'll correctly influence the cross-cgroup LRU order that reclaim follows. I.e. if global reclaim kicked out pages in some subgroup A/B/C, the refault of those pages will challenge the global LRU order, and not just the local order down inside C. [hannes@cmpxchg.org: use page_memcg() instead of another lookup] Link: http://lkml.kernel.org/r/20191115160722.GA309754@cmpxchg.org Link: http://lkml.kernel.org/r/20191107205334.158354-3-hannes@cmpxchg.org Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> Reviewed-by: Suren Baghdasaryan <surenb@google.com> Cc: Andrey Ryabinin <aryabinin@virtuozzo.com> Cc: Michal Hocko <mhocko@suse.com> Cc: Rik van Riel <riel@surriel.com> Cc: Shakeel Butt <shakeelb@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-12-01 09:55:59 +08:00
lruvec = mem_cgroup_lruvec(target_memcg, pgdat);
/* XXX: target_memcg can be NULL, go through lruvec */
memcgid = mem_cgroup_id(lruvec_memcg(lruvec));
eviction = lru_eviction(lruvec, folio_is_file_lru(folio),
folio_nr_pages(folio), EVICTION_BITS, bucket_order);
return pack_shadow(memcgid, pgdat, eviction,
folio_test_workingset(folio));
mm: thrash detection-based file cache sizing The VM maintains cached filesystem pages on two types of lists. One list holds the pages recently faulted into the cache, the other list holds pages that have been referenced repeatedly on that first list. The idea is to prefer reclaiming young pages over those that have shown to benefit from caching in the past. We call the recently usedbut ultimately was not significantly better than a FIFO policy and still thrashed cache based on eviction speed, rather than actual demand for cache. This patch solves one half of the problem by decoupling the ability to detect working set changes from the inactive list size. By maintaining a history of recently evicted file pages it can detect frequently used pages with an arbitrarily small inactive list size, and subsequently apply pressure on the active list based on actual demand for cache, not just overall eviction speed. Every zone maintains a counter that tracks inactive list aging speed. When a page is evicted, a snapshot of this counter is stored in the now-empty page cache radix tree slot. On refault, the minimum access distance of the page can be assessed, to evaluate whether the page should be part of the active list or not. This fixes the VM's blindness towards working set changes in excess of the inactive list. And it's the foundation to further improve the protection ability and reduce the minimum inactive list size of 50%. Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> Reviewed-by: Rik van Riel <riel@redhat.com> Reviewed-by: Minchan Kim <minchan@kernel.org> Reviewed-by: Bob Liu <bob.liu@oracle.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Christoph Hellwig <hch@infradead.org> Cc: Dave Chinner <david@fromorbit.com> Cc: Greg Thelen <gthelen@google.com> Cc: Hugh Dickins <hughd@google.com> Cc: Jan Kara <jack@suse.cz> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: Luigi Semenzato <semenzato@google.com> Cc: Mel Gorman <mgorman@suse.de> Cc: Metin Doslu <metin@citusdata.com> Cc: Michel Lespinasse <walken@google.com> Cc: Ozgun Erdogan <ozgun@citusdata.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Roman Gushchin <klamm@yandex-team.ru> Cc: Ryan Mallon <rmallon@gmail.com> Cc: Tejun Heo <tj@kernel.org> Cc: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-04-04 05:47:51 +08:00
}
/**
workingset: refactor LRU refault to expose refault recency check Patch series "cachestat: a new syscall for page cache state of files", v13. There is currently no good way to query the page cache statistics of large files and directory trees. There is mincore(), but it scales poorly: the kernel writes out a lot of bitmap data that userspace has to aggregate, when the user really does not care about per-page information in that case. The user also needs to mmap and unmap each file as it goes along, which can be quite slow as well. Some use cases where this information could come in handy: * Allowing database to decide whether to perform an index scan or direct table queries based on the in-memory cache state of the index. * Visibility into the writeback algorithm, for performance issues diagnostic. * Workload-aware writeback pacing: estimating IO fulfilled by page cache (and IO to be done) within a range of a file, allowing for more frequent syncing when and where there is IO capacity, and batching when there is not. * Computing memory usage of large files/directory trees, analogous to the du tool for disk usage. More information about these use cases could be found in this thread: https://lore.kernel.org/lkml/20230315170934.GA97793@cmpxchg.org/ This series of patches introduces a new system call, cachestat, that summarizes the page cache statistics (number of cached pages, dirty pages, pages marked for writeback, evicted pages etc.) of a file, in a specified range of bytes. It also include a selftest suite that tests some typical usage. Currently, the syscall is only wired in for x86 architecture. This interface is inspired by past discussion and concerns with fincore, which has a similar design (and as a result, issues) as mincore. Relevant links: https://lkml.indiana.edu/hypermail/linux/kernel/1302.1/04207.html https://lkml.indiana.edu/hypermail/linux/kernel/1302.1/04209.html I have also developed a small tool that computes the memory usage of files and directories, analogous to the du utility. User can choose between mincore or cachestat (with cachestat exporting more information than mincore). To compare the performance of these two options, I benchmarked the tool on the root directory of a Meta's server machine, each for five runs: Using cachestat real -- Median: 33.377s, Average: 33.475s, Standard Deviation: 0.3602 user -- Median: 4.08s, Average: 4.1078s, Standard Deviation: 0.0742 sys -- Median: 28.823s, Average: 28.8866s, Standard Deviation: 0.2689 Using mincore: real -- Median: 102.352s, Average: 102.3442s, Standard Deviation: 0.2059 user -- Median: 10.149s, Average: 10.1482s, Standard Deviation: 0.0162 sys -- Median: 91.186s, Average: 91.2084s, Standard Deviation: 0.2046 I also ran both syscalls on a 2TB sparse file: Using cachestat: real 0m0.009s user 0m0.000s sys 0m0.009s Using mincore: real 0m37.510s user 0m2.934s sys 0m34.558s Very large files like this are the pathological case for mincore. In fact, to compute the stats for a single 2TB file, mincore takes as long as cachestat takes to compute the stats for the entire tree! This could easily happen inadvertently when we run it on subdirectories. Mincore is clearly not suitable for a general-purpose command line tool. Regarding security concerns, cachestat() should not pose any additional issues. The caller already has read permission to the file itself (since they need an fd to that file to call cachestat). This means that the caller can access the underlying data in its entirety, which is a much greater source of information (and as a result, a much greater security risk) than the cache status itself. The latest API change (in v13 of the patch series) is suggested by Jens Axboe. It allows for 64-bit length argument, even on 32-bit architecture (which is previously not possible due to the limit on the number of syscall arguments). Furthermore, it eliminates the need for compatibility handling - every user can use the same ABI. This patch (of 4): In preparation for computing recently evicted pages in cachestat, refactor workingset_refault and lru_gen_refault to expose a helper function that would test if an evicted page is recently evicted. [penguin-kernel@I-love.SAKURA.ne.jp: add missing rcu_read_unlock() in lru_gen_refault()] Link: https://lkml.kernel.org/r/610781bc-cf11-fc89-a46f-87cb8235d439@I-love.SAKURA.ne.jp Link: https://lkml.kernel.org/r/20230503013608.2431726-1-nphamcs@gmail.com Link: https://lkml.kernel.org/r/20230503013608.2431726-2-nphamcs@gmail.com Signed-off-by: Nhat Pham <nphamcs@gmail.com> Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Cc: Brian Foster <bfoster@redhat.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Michael Kerrisk <mtk.manpages@gmail.com> Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-05-03 09:36:06 +08:00
* workingset_test_recent - tests if the shadow entry is for a folio that was
* recently evicted. Also fills in @workingset with the value unpacked from
* shadow.
* @shadow: the shadow entry to be tested.
* @file: whether the corresponding folio is from the file lru.
* @workingset: where the workingset value unpacked from shadow should
* be stored.
* @tracking: whether do workingset tracking or not
workingset: refactor LRU refault to expose refault recency check Patch series "cachestat: a new syscall for page cache state of files", v13. There is currently no good way to query the page cache statistics of large files and directory trees. There is mincore(), but it scales poorly: the kernel writes out a lot of bitmap data that userspace has to aggregate, when the user really does not care about per-page information in that case. The user also needs to mmap and unmap each file as it goes along, which can be quite slow as well. Some use cases where this information could come in handy: * Allowing database to decide whether to perform an index scan or direct table queries based on the in-memory cache state of the index. * Visibility into the writeback algorithm, for performance issues diagnostic. * Workload-aware writeback pacing: estimating IO fulfilled by page cache (and IO to be done) within a range of a file, allowing for more frequent syncing when and where there is IO capacity, and batching when there is not. * Computing memory usage of large files/directory trees, analogous to the du tool for disk usage. More information about these use cases could be found in this thread: https://lore.kernel.org/lkml/20230315170934.GA97793@cmpxchg.org/ This series of patches introduces a new system call, cachestat, that summarizes the page cache statistics (number of cached pages, dirty pages, pages marked for writeback, evicted pages etc.) of a file, in a specified range of bytes. It also include a selftest suite that tests some typical usage. Currently, the syscall is only wired in for x86 architecture. This interface is inspired by past discussion and concerns with fincore, which has a similar design (and as a result, issues) as mincore. Relevant links: https://lkml.indiana.edu/hypermail/linux/kernel/1302.1/04207.html https://lkml.indiana.edu/hypermail/linux/kernel/1302.1/04209.html I have also developed a small tool that computes the memory usage of files and directories, analogous to the du utility. User can choose between mincore or cachestat (with cachestat exporting more information than mincore). To compare the performance of these two options, I benchmarked the tool on the root directory of a Meta's server machine, each for five runs: Using cachestat real -- Median: 33.377s, Average: 33.475s, Standard Deviation: 0.3602 user -- Median: 4.08s, Average: 4.1078s, Standard Deviation: 0.0742 sys -- Median: 28.823s, Average: 28.8866s, Standard Deviation: 0.2689 Using mincore: real -- Median: 102.352s, Average: 102.3442s, Standard Deviation: 0.2059 user -- Median: 10.149s, Average: 10.1482s, Standard Deviation: 0.0162 sys -- Median: 91.186s, Average: 91.2084s, Standard Deviation: 0.2046 I also ran both syscalls on a 2TB sparse file: Using cachestat: real 0m0.009s user 0m0.000s sys 0m0.009s Using mincore: real 0m37.510s user 0m2.934s sys 0m34.558s Very large files like this are the pathological case for mincore. In fact, to compute the stats for a single 2TB file, mincore takes as long as cachestat takes to compute the stats for the entire tree! This could easily happen inadvertently when we run it on subdirectories. Mincore is clearly not suitable for a general-purpose command line tool. Regarding security concerns, cachestat() should not pose any additional issues. The caller already has read permission to the file itself (since they need an fd to that file to call cachestat). This means that the caller can access the underlying data in its entirety, which is a much greater source of information (and as a result, a much greater security risk) than the cache status itself. The latest API change (in v13 of the patch series) is suggested by Jens Axboe. It allows for 64-bit length argument, even on 32-bit architecture (which is previously not possible due to the limit on the number of syscall arguments). Furthermore, it eliminates the need for compatibility handling - every user can use the same ABI. This patch (of 4): In preparation for computing recently evicted pages in cachestat, refactor workingset_refault and lru_gen_refault to expose a helper function that would test if an evicted page is recently evicted. [penguin-kernel@I-love.SAKURA.ne.jp: add missing rcu_read_unlock() in lru_gen_refault()] Link: https://lkml.kernel.org/r/610781bc-cf11-fc89-a46f-87cb8235d439@I-love.SAKURA.ne.jp Link: https://lkml.kernel.org/r/20230503013608.2431726-1-nphamcs@gmail.com Link: https://lkml.kernel.org/r/20230503013608.2431726-2-nphamcs@gmail.com Signed-off-by: Nhat Pham <nphamcs@gmail.com> Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Cc: Brian Foster <bfoster@redhat.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Michael Kerrisk <mtk.manpages@gmail.com> Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-05-03 09:36:06 +08:00
*
* Return: true if the shadow is for a recently evicted folio; false otherwise.
mm: thrash detection-based file cache sizing The VM maintains cached filesystem pages on two types of lists. One list holds the pages recently faulted into the cache, the other list holds pages that have been referenced repeatedly on that first list. The idea is to prefer reclaiming young pages over those that have shown to benefit from caching in the past. We call the recently usedbut ultimately was not significantly better than a FIFO policy and still thrashed cache based on eviction speed, rather than actual demand for cache. This patch solves one half of the problem by decoupling the ability to detect working set changes from the inactive list size. By maintaining a history of recently evicted file pages it can detect frequently used pages with an arbitrarily small inactive list size, and subsequently apply pressure on the active list based on actual demand for cache, not just overall eviction speed. Every zone maintains a counter that tracks inactive list aging speed. When a page is evicted, a snapshot of this counter is stored in the now-empty page cache radix tree slot. On refault, the minimum access distance of the page can be assessed, to evaluate whether the page should be part of the active list or not. This fixes the VM's blindness towards working set changes in excess of the inactive list. And it's the foundation to further improve the protection ability and reduce the minimum inactive list size of 50%. Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> Reviewed-by: Rik van Riel <riel@redhat.com> Reviewed-by: Minchan Kim <minchan@kernel.org> Reviewed-by: Bob Liu <bob.liu@oracle.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Christoph Hellwig <hch@infradead.org> Cc: Dave Chinner <david@fromorbit.com> Cc: Greg Thelen <gthelen@google.com> Cc: Hugh Dickins <hughd@google.com> Cc: Jan Kara <jack@suse.cz> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: Luigi Semenzato <semenzato@google.com> Cc: Mel Gorman <mgorman@suse.de> Cc: Metin Doslu <metin@citusdata.com> Cc: Michel Lespinasse <walken@google.com> Cc: Ozgun Erdogan <ozgun@citusdata.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Roman Gushchin <klamm@yandex-team.ru> Cc: Ryan Mallon <rmallon@gmail.com> Cc: Tejun Heo <tj@kernel.org> Cc: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-04-04 05:47:51 +08:00
*/
bool workingset_test_recent(void *shadow, bool file, bool *workingset, bool tracking)
mm: thrash detection-based file cache sizing The VM maintains cached filesystem pages on two types of lists. One list holds the pages recently faulted into the cache, the other list holds pages that have been referenced repeatedly on that first list. The idea is to prefer reclaiming young pages over those that have shown to benefit from caching in the past. We call the recently usedbut ultimately was not significantly better than a FIFO policy and still thrashed cache based on eviction speed, rather than actual demand for cache. This patch solves one half of the problem by decoupling the ability to detect working set changes from the inactive list size. By maintaining a history of recently evicted file pages it can detect frequently used pages with an arbitrarily small inactive list size, and subsequently apply pressure on the active list based on actual demand for cache, not just overall eviction speed. Every zone maintains a counter that tracks inactive list aging speed. When a page is evicted, a snapshot of this counter is stored in the now-empty page cache radix tree slot. On refault, the minimum access distance of the page can be assessed, to evaluate whether the page should be part of the active list or not. This fixes the VM's blindness towards working set changes in excess of the inactive list. And it's the foundation to further improve the protection ability and reduce the minimum inactive list size of 50%. Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> Reviewed-by: Rik van Riel <riel@redhat.com> Reviewed-by: Minchan Kim <minchan@kernel.org> Reviewed-by: Bob Liu <bob.liu@oracle.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Christoph Hellwig <hch@infradead.org> Cc: Dave Chinner <david@fromorbit.com> Cc: Greg Thelen <gthelen@google.com> Cc: Hugh Dickins <hughd@google.com> Cc: Jan Kara <jack@suse.cz> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: Luigi Semenzato <semenzato@google.com> Cc: Mel Gorman <mgorman@suse.de> Cc: Metin Doslu <metin@citusdata.com> Cc: Michel Lespinasse <walken@google.com> Cc: Ozgun Erdogan <ozgun@citusdata.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Roman Gushchin <klamm@yandex-team.ru> Cc: Ryan Mallon <rmallon@gmail.com> Cc: Tejun Heo <tj@kernel.org> Cc: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-04-04 05:47:51 +08:00
{
mm: vmscan: detect file thrashing at the reclaim root We use refault information to determine whether the cache workingset is stable or transitioning, and dynamically adjust the inactive:active file LRU ratio so as to maximize protection from one-off cache during stable periods, and minimize IO during transitions. With cgroups and their nested LRU lists, we currently don't do this correctly. While recursive cgroup reclaim establishes a relative LRU order among the pages of all involved cgroups, refaults only affect the local LRU order in the cgroup in which they are occuring. As a result, cache transitions can take longer in a cgrouped system as the active pages of sibling cgroups aren't challenged when they should be. [ Right now, this is somewhat theoretical, because the siblings, under continued regular reclaim pressure, should eventually run out of inactive pages - and since inactive:active *size* balancing is also done on a cgroup-local level, we will challenge the active pages eventually in most cases. But the next patch will move that relative size enforcement to the reclaim root as well, and then this patch here will be necessary to propagate refault pressure to siblings. ] This patch moves refault detection to the root of reclaim. Instead of remembering the cgroup owner of an evicted page, remember the cgroup that caused the reclaim to happen. When refaults later occur, they'll correctly influence the cross-cgroup LRU order that reclaim follows. I.e. if global reclaim kicked out pages in some subgroup A/B/C, the refault of those pages will challenge the global LRU order, and not just the local order down inside C. [hannes@cmpxchg.org: use page_memcg() instead of another lookup] Link: http://lkml.kernel.org/r/20191115160722.GA309754@cmpxchg.org Link: http://lkml.kernel.org/r/20191107205334.158354-3-hannes@cmpxchg.org Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> Reviewed-by: Suren Baghdasaryan <surenb@google.com> Cc: Andrey Ryabinin <aryabinin@virtuozzo.com> Cc: Michal Hocko <mhocko@suse.com> Cc: Rik van Riel <riel@surriel.com> Cc: Shakeel Butt <shakeelb@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-12-01 09:55:59 +08:00
struct mem_cgroup *eviction_memcg;
struct lruvec *eviction_lruvec;
unsigned long refault_distance;
emm: workingset: simplify and use a more intuitive model Upstream: pending This basically removed workingset_activation and reduced calls to workingset_age_nonresident. The idea behind this change is a new way to calculate the refault distance and prepare for adapting refault distance based file page protection for multi-gen LRU. Currently, refault distance re-activation for active/inactive can help keep working set pages in memory, it works by estimating the refault (re-access) distance of a page, if it's small enough, then put it on active LRU instead of inactive LRU. The estimation, as described in mm/workingset.c, is based on two assumptions: 1. Activation of an inactive page will left-shift LRU pages (considering LRU starts from right). 2. Eviction of an inactive page will left-shift LRU pages. Assumption 2 is correct, but assumption 1 is not always true, an activated page could be anywhere in the LRU list (through mark_page_accessed), it only left-shift the pages on its right side. And besides, one page can get activate/deactivated for multiple times. And multi-gen LRU doesn't fit with this model well, pages are getting aged in generations, and getting promoted frequently between generations. So instead we introduce a simpler idea here: Just presume the evicted pages are still in memory, each has an corresponding eviction timestamp (nonresistence_age) that is increased and recorded upon each eviction. These timestamp could logically form a "Shadow LRU", a read-only imaginary LRU. Let the `nonresistence_age` still be NA, then we have: Let SP = ((NA's reading @ current) - (NA's reading @ eviction)) +-memory available to cache-+ | | +-------------------------+===============+===========+ | * shadows O O O | INACTIVE | ACTIVE | +-+-----------------------+===============+===========+ | | +-----------------------+ | SP fault page O -> Hole left by refaulted in pages. Entries are suppose to be removed upon access but this is not a real LRU so can't really update it. * -> The page corresponding to SP It can be easily seen that SP stands for the offset of a page in the imaginary LRU, which is also how far the current workflow could push a page out of available memory. Since all evicted page was once head of INACTIVE list, the estimated minimum value of refault distance is: SP + NR_INACTIVE On refault, the page *may* get activated and stay in memory if we put it to active LRU if: SP + NR_INACTIVE < NR_INACTIVE + NR_ACTIVE Which can be simplified to: SP < NR_ACTIVE Then the page is worth getting re-activated to start from active LRU, since the access distance is smaller than the total memory. And since this is only an estimation, based on several hypotheses, and it could break the ability of LRU to distinguish a workingset out of caches, in extreme cases all refault causing activation will lead to worse thrashing, so throttle this by two factors: 1. Notice previously re-faulted in pages may leave "holes" on the shadow part of LRU, that part is left unhandled on purpose to decrease re-activate rate for pages that have a large SP value (the larger SP value a page has, the more likely it will be affected by such holes). 2. When the active LRU is long enough, chanllaging active pages by re-activating a one-time access previously evicted/inactive page may not be a good idea, so throttle the re-activation when NR_ACTIVE > NR_INACTIVE, by comparing with NR_INACTIVE instead. Another effect of the refault activation throttling worth noticing is that, when the cache size is larger than total memory and hotness is similar among all cache pages, it can help hold a portion (possible have slightly higher hotness) of the caches in memory instead of letting caches get evicted permutably due to the nature of LRU. That's because the established workingset (active LRU) will tend to stay since we throttled reactivation when NR_ACTIVE is high. This side effect is actually similar with the algoritm before, which introduce such effect by increasing nonresistence_age in extra call paths, trottled the re-activation when activition/reactivation is massively happenning. Combined all above, we have following simple rules: Upon refault, if any of following conditions is met, mark page as active: - If active LRU is low (NR_ACTIVE < NR_INACTIVE), check if: SP < NR_ACTIVE - If active LRU is high (NR_ACTIVE >= NR_INACTIVE), check if: SP < NR_INACTIVE Code-wise, this is simpler than before since no longer need to do lruvec workingset data update when activating a page, and so far, a few benchmarks shows a similar or better result under memore pressure. The performance should also be better when there is no memory pressure since some memcg iteration and atomic operation is no longer needed. When combined with multi-gen LRU (in later commits) it shows a measurable performance gain for some workloads. Using memtier and fio test from commit ac35a4902374 but scaled down to fit in my test environment, and some other test results: memtier test (with 16G ramdisk as swap and 4G memcg limit on an i7-9700): memcached -u nobody -m 16384 -s /tmp/memcached.socket \ -a 0766 -t 12 -B binary & memtier_benchmark -S /tmp/memcached.socket -P memcache_binary -n allkeys\ --key-minimum=1 --key-maximum=32000000 --key-pattern=P:P -c 1 \ -t 12 --ratio 1:0 --pipeline 8 -d 2000 -x 6 fio test 1 (with 16G ramdisk on 28G VM on an i7-9700): fio -name=refault --numjobs=12 --directory=/mnt --size=1024m \ --buffered=1 --ioengine=io_uring --iodepth=128 \ --iodepth_batch_submit=32 --iodepth_batch_complete=32 \ --rw=randread --random_distribution=random --norandommap \ --time_based --ramp_time=5m --runtime=5m --group_reporting fio test 2 (with 16G ramdisk on 28G VM on an i7-9700): fio -name=mglru --numjobs=10 --directory=/mnt --size=1536m \ --buffered=1 --ioengine=io_uring --iodepth=128 \ --iodepth_batch_submit=32 --iodepth_batch_complete=32 \ --rw=randread --random_distribution=zipf:1.2 --norandommap \ --time_based --ramp_time=10m --runtime=5m --group_reporting mysql (using oltp_read_only from sysbench, with 12G of buffer pool in a 10G memcg): sysbench /usr/share/sysbench/oltp_read_only.lua <auth and db params> \ --tables=36 --table-size=2000000 --threads=12 --time=1800 kernel build test done with 3G memcg limit on an i7-9700. Before (Average of 6 test run): fio: IOPS=5125.5k fio2: IOPS=7291.16k memcached: 57600.926 ops/s mysql: 6280.08 tps kernel-build: 1817.13499 seconds After (Average of 6 test run): fio: IOPS=5137.5k (+2.3%) fio2: IOPS=7300.67k (+1.3%) memcached: 57878.422 ops/s (+4.8%) mysql: 6312.06 tps (+0.5%) kernel-build: 1813.66231 seconds (+2.0%) Signed-off-by: Kairui Song <kasong@tencent.com>
2023-12-15 10:45:43 +08:00
unsigned long inactive;
unsigned long active;
int memcgid;
workingset: refactor LRU refault to expose refault recency check Patch series "cachestat: a new syscall for page cache state of files", v13. There is currently no good way to query the page cache statistics of large files and directory trees. There is mincore(), but it scales poorly: the kernel writes out a lot of bitmap data that userspace has to aggregate, when the user really does not care about per-page information in that case. The user also needs to mmap and unmap each file as it goes along, which can be quite slow as well. Some use cases where this information could come in handy: * Allowing database to decide whether to perform an index scan or direct table queries based on the in-memory cache state of the index. * Visibility into the writeback algorithm, for performance issues diagnostic. * Workload-aware writeback pacing: estimating IO fulfilled by page cache (and IO to be done) within a range of a file, allowing for more frequent syncing when and where there is IO capacity, and batching when there is not. * Computing memory usage of large files/directory trees, analogous to the du tool for disk usage. More information about these use cases could be found in this thread: https://lore.kernel.org/lkml/20230315170934.GA97793@cmpxchg.org/ This series of patches introduces a new system call, cachestat, that summarizes the page cache statistics (number of cached pages, dirty pages, pages marked for writeback, evicted pages etc.) of a file, in a specified range of bytes. It also include a selftest suite that tests some typical usage. Currently, the syscall is only wired in for x86 architecture. This interface is inspired by past discussion and concerns with fincore, which has a similar design (and as a result, issues) as mincore. Relevant links: https://lkml.indiana.edu/hypermail/linux/kernel/1302.1/04207.html https://lkml.indiana.edu/hypermail/linux/kernel/1302.1/04209.html I have also developed a small tool that computes the memory usage of files and directories, analogous to the du utility. User can choose between mincore or cachestat (with cachestat exporting more information than mincore). To compare the performance of these two options, I benchmarked the tool on the root directory of a Meta's server machine, each for five runs: Using cachestat real -- Median: 33.377s, Average: 33.475s, Standard Deviation: 0.3602 user -- Median: 4.08s, Average: 4.1078s, Standard Deviation: 0.0742 sys -- Median: 28.823s, Average: 28.8866s, Standard Deviation: 0.2689 Using mincore: real -- Median: 102.352s, Average: 102.3442s, Standard Deviation: 0.2059 user -- Median: 10.149s, Average: 10.1482s, Standard Deviation: 0.0162 sys -- Median: 91.186s, Average: 91.2084s, Standard Deviation: 0.2046 I also ran both syscalls on a 2TB sparse file: Using cachestat: real 0m0.009s user 0m0.000s sys 0m0.009s Using mincore: real 0m37.510s user 0m2.934s sys 0m34.558s Very large files like this are the pathological case for mincore. In fact, to compute the stats for a single 2TB file, mincore takes as long as cachestat takes to compute the stats for the entire tree! This could easily happen inadvertently when we run it on subdirectories. Mincore is clearly not suitable for a general-purpose command line tool. Regarding security concerns, cachestat() should not pose any additional issues. The caller already has read permission to the file itself (since they need an fd to that file to call cachestat). This means that the caller can access the underlying data in its entirety, which is a much greater source of information (and as a result, a much greater security risk) than the cache status itself. The latest API change (in v13 of the patch series) is suggested by Jens Axboe. It allows for 64-bit length argument, even on 32-bit architecture (which is previously not possible due to the limit on the number of syscall arguments). Furthermore, it eliminates the need for compatibility handling - every user can use the same ABI. This patch (of 4): In preparation for computing recently evicted pages in cachestat, refactor workingset_refault and lru_gen_refault to expose a helper function that would test if an evicted page is recently evicted. [penguin-kernel@I-love.SAKURA.ne.jp: add missing rcu_read_unlock() in lru_gen_refault()] Link: https://lkml.kernel.org/r/610781bc-cf11-fc89-a46f-87cb8235d439@I-love.SAKURA.ne.jp Link: https://lkml.kernel.org/r/20230503013608.2431726-1-nphamcs@gmail.com Link: https://lkml.kernel.org/r/20230503013608.2431726-2-nphamcs@gmail.com Signed-off-by: Nhat Pham <nphamcs@gmail.com> Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Cc: Brian Foster <bfoster@redhat.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Michael Kerrisk <mtk.manpages@gmail.com> Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-05-03 09:36:06 +08:00
struct pglist_data *pgdat;
unsigned long eviction;
mm: thrash detection-based file cache sizing The VM maintains cached filesystem pages on two types of lists. One list holds the pages recently faulted into the cache, the other list holds pages that have been referenced repeatedly on that first list. The idea is to prefer reclaiming young pages over those that have shown to benefit from caching in the past. We call the recently usedbut ultimately was not significantly better than a FIFO policy and still thrashed cache based on eviction speed, rather than actual demand for cache. This patch solves one half of the problem by decoupling the ability to detect working set changes from the inactive list size. By maintaining a history of recently evicted file pages it can detect frequently used pages with an arbitrarily small inactive list size, and subsequently apply pressure on the active list based on actual demand for cache, not just overall eviction speed. Every zone maintains a counter that tracks inactive list aging speed. When a page is evicted, a snapshot of this counter is stored in the now-empty page cache radix tree slot. On refault, the minimum access distance of the page can be assessed, to evaluate whether the page should be part of the active list or not. This fixes the VM's blindness towards working set changes in excess of the inactive list. And it's the foundation to further improve the protection ability and reduce the minimum inactive list size of 50%. Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> Reviewed-by: Rik van Riel <riel@redhat.com> Reviewed-by: Minchan Kim <minchan@kernel.org> Reviewed-by: Bob Liu <bob.liu@oracle.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Christoph Hellwig <hch@infradead.org> Cc: Dave Chinner <david@fromorbit.com> Cc: Greg Thelen <gthelen@google.com> Cc: Hugh Dickins <hughd@google.com> Cc: Jan Kara <jack@suse.cz> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: Luigi Semenzato <semenzato@google.com> Cc: Mel Gorman <mgorman@suse.de> Cc: Metin Doslu <metin@citusdata.com> Cc: Michel Lespinasse <walken@google.com> Cc: Ozgun Erdogan <ozgun@citusdata.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Roman Gushchin <klamm@yandex-team.ru> Cc: Ryan Mallon <rmallon@gmail.com> Cc: Tejun Heo <tj@kernel.org> Cc: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-04-04 05:47:51 +08:00
workingset: refactor LRU refault to expose refault recency check Patch series "cachestat: a new syscall for page cache state of files", v13. There is currently no good way to query the page cache statistics of large files and directory trees. There is mincore(), but it scales poorly: the kernel writes out a lot of bitmap data that userspace has to aggregate, when the user really does not care about per-page information in that case. The user also needs to mmap and unmap each file as it goes along, which can be quite slow as well. Some use cases where this information could come in handy: * Allowing database to decide whether to perform an index scan or direct table queries based on the in-memory cache state of the index. * Visibility into the writeback algorithm, for performance issues diagnostic. * Workload-aware writeback pacing: estimating IO fulfilled by page cache (and IO to be done) within a range of a file, allowing for more frequent syncing when and where there is IO capacity, and batching when there is not. * Computing memory usage of large files/directory trees, analogous to the du tool for disk usage. More information about these use cases could be found in this thread: https://lore.kernel.org/lkml/20230315170934.GA97793@cmpxchg.org/ This series of patches introduces a new system call, cachestat, that summarizes the page cache statistics (number of cached pages, dirty pages, pages marked for writeback, evicted pages etc.) of a file, in a specified range of bytes. It also include a selftest suite that tests some typical usage. Currently, the syscall is only wired in for x86 architecture. This interface is inspired by past discussion and concerns with fincore, which has a similar design (and as a result, issues) as mincore. Relevant links: https://lkml.indiana.edu/hypermail/linux/kernel/1302.1/04207.html https://lkml.indiana.edu/hypermail/linux/kernel/1302.1/04209.html I have also developed a small tool that computes the memory usage of files and directories, analogous to the du utility. User can choose between mincore or cachestat (with cachestat exporting more information than mincore). To compare the performance of these two options, I benchmarked the tool on the root directory of a Meta's server machine, each for five runs: Using cachestat real -- Median: 33.377s, Average: 33.475s, Standard Deviation: 0.3602 user -- Median: 4.08s, Average: 4.1078s, Standard Deviation: 0.0742 sys -- Median: 28.823s, Average: 28.8866s, Standard Deviation: 0.2689 Using mincore: real -- Median: 102.352s, Average: 102.3442s, Standard Deviation: 0.2059 user -- Median: 10.149s, Average: 10.1482s, Standard Deviation: 0.0162 sys -- Median: 91.186s, Average: 91.2084s, Standard Deviation: 0.2046 I also ran both syscalls on a 2TB sparse file: Using cachestat: real 0m0.009s user 0m0.000s sys 0m0.009s Using mincore: real 0m37.510s user 0m2.934s sys 0m34.558s Very large files like this are the pathological case for mincore. In fact, to compute the stats for a single 2TB file, mincore takes as long as cachestat takes to compute the stats for the entire tree! This could easily happen inadvertently when we run it on subdirectories. Mincore is clearly not suitable for a general-purpose command line tool. Regarding security concerns, cachestat() should not pose any additional issues. The caller already has read permission to the file itself (since they need an fd to that file to call cachestat). This means that the caller can access the underlying data in its entirety, which is a much greater source of information (and as a result, a much greater security risk) than the cache status itself. The latest API change (in v13 of the patch series) is suggested by Jens Axboe. It allows for 64-bit length argument, even on 32-bit architecture (which is previously not possible due to the limit on the number of syscall arguments). Furthermore, it eliminates the need for compatibility handling - every user can use the same ABI. This patch (of 4): In preparation for computing recently evicted pages in cachestat, refactor workingset_refault and lru_gen_refault to expose a helper function that would test if an evicted page is recently evicted. [penguin-kernel@I-love.SAKURA.ne.jp: add missing rcu_read_unlock() in lru_gen_refault()] Link: https://lkml.kernel.org/r/610781bc-cf11-fc89-a46f-87cb8235d439@I-love.SAKURA.ne.jp Link: https://lkml.kernel.org/r/20230503013608.2431726-1-nphamcs@gmail.com Link: https://lkml.kernel.org/r/20230503013608.2431726-2-nphamcs@gmail.com Signed-off-by: Nhat Pham <nphamcs@gmail.com> Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Cc: Brian Foster <bfoster@redhat.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Michael Kerrisk <mtk.manpages@gmail.com> Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-05-03 09:36:06 +08:00
unpack_shadow(shadow, &memcgid, &pgdat, &eviction, workingset);
/*
* Look up the memcg associated with the stored ID. It might
* have been deleted since the folio's eviction.
*
* Note that in rare events the ID could have been recycled
* for a new cgroup that refaults a shared folio. This is
* impossible to tell from the available data. However, this
* should be a rare and limited disturbance, and activations
* are always speculative anyway. Ultimately, it's the aging
* algorithm's job to shake out the minimum access frequency
* for the active cache.
*
* XXX: On !CONFIG_MEMCG, this will always return NULL; it
* would be better if the root_mem_cgroup existed in all
* configurations instead.
*/
workingset, lru_gen: apply refault-distance based protection Upstream: pending I noticed MGLRU not working very well on certain workflows, which is observed on some heavily stressed databases. That is when the file page workingset size exceeds total memory, and the access distance (the left-shift time of a page before it gets activated, considering LRU starts from right) of file pages also larger than total memory. All file pages are stuck on the oldest generation and getting read-in then evicted permutably. Despite anon pages being idle, they never get aged. PID controller didn't kickin until there are some minor access pattern changes. And file pages are not promoted or reused. Even though the memory can't cover the whole workingset, the refault-distance based re-activation can help hold part of the workingset in-memory to help reduce the IO workload significantly. So apply it for MGLRU as well. The updated refault-distance model fits well for MGLRU in most cases, if we just consider the last two generation as the inactive LRU and the first two generations as active LRU. Some adjustment is done to fit the logic better, also make the refault-distance contributed to page tiering and PID refault detection of MGLRU: - If a tier-0 page have a qualified refault-distance, just promote it to higher tier, send it to second oldest gen. - If a tier >= 1 page have a qualified refault-distance, mark it as active and send it to youngest gen. - Increase the reference of every page that have a qualified refault-distance and increase the PID countroled refault rate of the updated tier, in hope similar paged will be protected next time upon eviction. NOTE: This also changed the meaning of workingset_* fields in /proc/vmstat, workingset_activate_* now stands for the pages reactivated or promoted by refault distance checking, workingset_restore_* now stands for all pages promoted by any reason. Following benchmark showed 5x improvement. To simulate the optimized workflow, I setup a 3-replicated mongodb cluster, each in a different cgroup, using 5 gb of wiretiger cache and 10g of oplog, on a 32G VM with no limit set. The benchmark is done using https://github.com/apavlo/py-tpcc.git, modified to run STOCK_LEVEL query only, for simulating slow query and get a stable result. Test is done on an EPYC 7K62 with 32G RAM with SATA SSD: - Before (with ZRAM enabled, the result won't change whether any kind of swap is on or not): $ tpcc.py --config=mongodb.config mongodb --duration=900 --warehouses=500 --clients=30 ================================================================== Execution Results after 919 seconds ------------------------------------------------------------------ Executed Time (µs) Rate STOCK_LEVEL 577 27584645283.7 0.02 txn/s ------------------------------------------------------------------ TOTAL 577 27584645283.7 0.02 txn/s $ cat /proc/vmstat | grep workingset workingset_nodes 47860 workingset_refault_anon 0 workingset_refault_file 23498953 workingset_activate_anon 0 workingset_activate_file 23487840 workingset_restore_anon 0 workingset_restore_file 18553646 workingset_nodereclaim 768 $ free -m total used free shared buff/cache available Mem: 31849 6829 790 23 24229 24542 Swap: 31848 0 31848 - Patched: (with ZRAM enabled): $ tpcc.py --config=mongodb.config mongodb --duration=900 --warehouses=500 --clients=30 ================================================================== Execution Results after 905 seconds ------------------------------------------------------------------ Executed Time (µs) Rate STOCK_LEVEL 2542 27121571486.2 0.09 txn/s ------------------------------------------------------------------ TOTAL 2542 27121571486.2 0.09 txn/s $ cat /proc/vmstat | grep working workingset_nodes 70358 workingset_refault_anon 16853 workingset_refault_file 22693601 workingset_activate_anon 10099 workingset_activate_file 8565519 workingset_restore_anon 10127 workingset_restore_file 8566053 workingset_nodereclaim 9801 $ free -m total used free shared buff/cache available Mem: 31849 7093 283 4 24472 24289 Swap: 31848 1652 30196 The performance is 5x times better than before, and the idle anon pages now can get swapped out as expected. The result is also better with lower test stress, testing with lower stress also shows a improvement. There is no regression on other tests so far, and a performance gain is observed on file page heavy tasks. Signed-off-by: Kairui Song <kasong@tencent.com>
2024-04-01 23:43:38 +08:00
eviction_memcg = try_get_flush_memcg(memcgid);
if (!eviction_memcg)
workingset: refactor LRU refault to expose refault recency check Patch series "cachestat: a new syscall for page cache state of files", v13. There is currently no good way to query the page cache statistics of large files and directory trees. There is mincore(), but it scales poorly: the kernel writes out a lot of bitmap data that userspace has to aggregate, when the user really does not care about per-page information in that case. The user also needs to mmap and unmap each file as it goes along, which can be quite slow as well. Some use cases where this information could come in handy: * Allowing database to decide whether to perform an index scan or direct table queries based on the in-memory cache state of the index. * Visibility into the writeback algorithm, for performance issues diagnostic. * Workload-aware writeback pacing: estimating IO fulfilled by page cache (and IO to be done) within a range of a file, allowing for more frequent syncing when and where there is IO capacity, and batching when there is not. * Computing memory usage of large files/directory trees, analogous to the du tool for disk usage. More information about these use cases could be found in this thread: https://lore.kernel.org/lkml/20230315170934.GA97793@cmpxchg.org/ This series of patches introduces a new system call, cachestat, that summarizes the page cache statistics (number of cached pages, dirty pages, pages marked for writeback, evicted pages etc.) of a file, in a specified range of bytes. It also include a selftest suite that tests some typical usage. Currently, the syscall is only wired in for x86 architecture. This interface is inspired by past discussion and concerns with fincore, which has a similar design (and as a result, issues) as mincore. Relevant links: https://lkml.indiana.edu/hypermail/linux/kernel/1302.1/04207.html https://lkml.indiana.edu/hypermail/linux/kernel/1302.1/04209.html I have also developed a small tool that computes the memory usage of files and directories, analogous to the du utility. User can choose between mincore or cachestat (with cachestat exporting more information than mincore). To compare the performance of these two options, I benchmarked the tool on the root directory of a Meta's server machine, each for five runs: Using cachestat real -- Median: 33.377s, Average: 33.475s, Standard Deviation: 0.3602 user -- Median: 4.08s, Average: 4.1078s, Standard Deviation: 0.0742 sys -- Median: 28.823s, Average: 28.8866s, Standard Deviation: 0.2689 Using mincore: real -- Median: 102.352s, Average: 102.3442s, Standard Deviation: 0.2059 user -- Median: 10.149s, Average: 10.1482s, Standard Deviation: 0.0162 sys -- Median: 91.186s, Average: 91.2084s, Standard Deviation: 0.2046 I also ran both syscalls on a 2TB sparse file: Using cachestat: real 0m0.009s user 0m0.000s sys 0m0.009s Using mincore: real 0m37.510s user 0m2.934s sys 0m34.558s Very large files like this are the pathological case for mincore. In fact, to compute the stats for a single 2TB file, mincore takes as long as cachestat takes to compute the stats for the entire tree! This could easily happen inadvertently when we run it on subdirectories. Mincore is clearly not suitable for a general-purpose command line tool. Regarding security concerns, cachestat() should not pose any additional issues. The caller already has read permission to the file itself (since they need an fd to that file to call cachestat). This means that the caller can access the underlying data in its entirety, which is a much greater source of information (and as a result, a much greater security risk) than the cache status itself. The latest API change (in v13 of the patch series) is suggested by Jens Axboe. It allows for 64-bit length argument, even on 32-bit architecture (which is previously not possible due to the limit on the number of syscall arguments). Furthermore, it eliminates the need for compatibility handling - every user can use the same ABI. This patch (of 4): In preparation for computing recently evicted pages in cachestat, refactor workingset_refault and lru_gen_refault to expose a helper function that would test if an evicted page is recently evicted. [penguin-kernel@I-love.SAKURA.ne.jp: add missing rcu_read_unlock() in lru_gen_refault()] Link: https://lkml.kernel.org/r/610781bc-cf11-fc89-a46f-87cb8235d439@I-love.SAKURA.ne.jp Link: https://lkml.kernel.org/r/20230503013608.2431726-1-nphamcs@gmail.com Link: https://lkml.kernel.org/r/20230503013608.2431726-2-nphamcs@gmail.com Signed-off-by: Nhat Pham <nphamcs@gmail.com> Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Cc: Brian Foster <bfoster@redhat.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Michael Kerrisk <mtk.manpages@gmail.com> Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-05-03 09:36:06 +08:00
return false;
mm: workingset: move the stats flush into workingset_test_recent() Upstream: commit b006847222623ac3cda8589d15379eac86a2bcb7 Conflicts: none Backport-reason: mm: memcg: subtree stats flushing and thresholds The workingset code flushes the stats in workingset_refault() to get accurate stats of the eviction memcg. In preparation for more scoped flushed and passing the eviction memcg to the flush call, move the call to workingset_test_recent() where we have a pointer to the eviction memcg. The flush call is sleepable, and cannot be made in an rcu read section. Hence, minimize the rcu read section by also moving it into workingset_test_recent(). Furthermore, instead of holding the rcu read lock throughout workingset_test_recent(), only hold it briefly to get a ref on the eviction memcg. This allows us to make the flush call after we get the eviction memcg. As for workingset_refault(), nothing else there appears to be protected by rcu. The memcg of the faulted folio (which is not necessarily the same as the eviction memcg) is protected by the folio lock, which is held from all callsites. Add a VM_BUG_ON() to make sure this doesn't change from under us. No functional change intended. Link: https://lkml.kernel.org/r/20231129032154.3710765-5-yosryahmed@google.com Signed-off-by: Yosry Ahmed <yosryahmed@google.com> Tested-by: Domenico Cerasuolo <cerasuolodomenico@gmail.com> Acked-by: Shakeel Butt <shakeelb@google.com> Cc: Chris Li <chrisl@kernel.org> Cc: Greg Thelen <gthelen@google.com> Cc: Ivan Babrou <ivan@cloudflare.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Michal Hocko <mhocko@kernel.org> Cc: Michal Koutny <mkoutny@suse.com> Cc: Muchun Song <muchun.song@linux.dev> Cc: Roman Gushchin <roman.gushchin@linux.dev> Cc: Tejun Heo <tj@kernel.org> Cc: Waiman Long <longman@redhat.com> Cc: Wei Xu <weixugc@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Kairui Song <kasong@tencent.com>
2023-11-29 11:21:52 +08:00
mm: memcg: restore subtree stats flushing Upstream: commit 7d7ef0a4686abe43cd76a141b340a348f45ecdf2 Conflicts: Skip change in zswap.c, due to missing of b5ba474f3f51, should be OK, later backport will easily notice the change of function params. Backport-reason: mm: memcg: subtree stats flushing and thresholds Stats flushing for memcg currently follows the following rules: - Always flush the entire memcg hierarchy (i.e. flush the root). - Only one flusher is allowed at a time. If someone else tries to flush concurrently, they skip and return immediately. - A periodic flusher flushes all the stats every 2 seconds. The reason this approach is followed is because all flushes are serialized by a global rstat spinlock. On the memcg side, flushing is invoked from userspace reads as well as in-kernel flushers (e.g. reclaim, refault, etc). This approach aims to avoid serializing all flushers on the global lock, which can cause a significant performance hit under high concurrency. This approach has the following problems: - Occasionally a userspace read of the stats of a non-root cgroup will be too expensive as it has to flush the entire hierarchy [1]. - Sometimes the stats accuracy are compromised if there is an ongoing flush, and we skip and return before the subtree of interest is actually flushed, yielding stale stats (by up to 2s due to periodic flushing). This is more visible when reading stats from userspace, but can also affect in-kernel flushers. The latter problem is particulary a concern when userspace reads stats after an event occurs, but gets stats from before the event. Examples: - When memory usage / pressure spikes, a userspace OOM handler may look at the stats of different memcgs to select a victim based on various heuristics (e.g. how much private memory will be freed by killing this). Reading stale stats from before the usage spike in this case may cause a wrongful OOM kill. - A proactive reclaimer may read the stats after writing to memory.reclaim to measure the success of the reclaim operation. Stale stats from before reclaim may give a false negative. - Reading the stats of a parent and a child memcg may be inconsistent (child larger than parent), if the flush doesn't happen when the parent is read, but happens when the child is read. As for in-kernel flushers, they will occasionally get stale stats. No regressions are currently known from this, but if there are regressions, they would be very difficult to debug and link to the source of the problem. This patch aims to fix these problems by restoring subtree flushing, and removing the unified/coalesced flushing logic that skips flushing if there is an ongoing flush. This change would introduce a significant regression with global stats flushing thresholds. With per-memcg stats flushing thresholds, this seems to perform really well. The thresholds protect the underlying lock from unnecessary contention. This patch was tested in two ways to ensure the latency of flushing is up to par, on a machine with 384 cpus: - A synthetic test with 5000 concurrent workers in 500 cgroups doing allocations and reclaim, as well as 1000 readers for memory.stat (variation of [2]). No regressions were noticed in the total runtime. Note that significant regressions in this test are observed with global stats thresholds, but not with per-memcg thresholds. - A synthetic stress test for concurrently reading memcg stats while memory allocation/freeing workers are running in the background, provided by Wei Xu [3]. With 250k threads reading the stats every 100ms in 50k cgroups, 99.9% of reads take <= 50us. Less than 0.01% of reads take more than 1ms, and no reads take more than 100ms. [1] https://lore.kernel.org/lkml/CABWYdi0c6__rh-K7dcM_pkf9BJdTRtAU08M43KO9ME4-dsgfoQ@mail.gmail.com/ [2] https://lore.kernel.org/lkml/CAJD7tka13M-zVZTyQJYL1iUAYvuQ1fcHbCjcOBZcz6POYTV-4g@mail.gmail.com/ [3] https://lore.kernel.org/lkml/CAAPL-u9D2b=iF5Lf_cRnKxUfkiEe0AMDTu6yhrUAzX0b6a6rDg@mail.gmail.com/ [akpm@linux-foundation.org: fix mm/zswap.c] [yosryahmed@google.com: remove stats flushing mutex] Link: https://lkml.kernel.org/r/CAJD7tkZgP3m-VVPn+fF_YuvXeQYK=tZZjJHj=dzD=CcSSpp2qg@mail.gmail.com Link: https://lkml.kernel.org/r/20231129032154.3710765-6-yosryahmed@google.com Signed-off-by: Yosry Ahmed <yosryahmed@google.com> Tested-by: Domenico Cerasuolo <cerasuolodomenico@gmail.com> Acked-by: Shakeel Butt <shakeelb@google.com> Cc: Chris Li <chrisl@kernel.org> Cc: Greg Thelen <gthelen@google.com> Cc: Ivan Babrou <ivan@cloudflare.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Michal Hocko <mhocko@kernel.org> Cc: Michal Koutny <mkoutny@suse.com> Cc: Muchun Song <muchun.song@linux.dev> Cc: Roman Gushchin <roman.gushchin@linux.dev> Cc: Tejun Heo <tj@kernel.org> Cc: Waiman Long <longman@redhat.com> Cc: Wei Xu <weixugc@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Kairui Song <kasong@tencent.com>
2023-11-29 11:21:53 +08:00
/*
* Flush stats (and potentially sleep) outside the RCU read section.
* XXX: With per-memcg flushing and thresholding, is ratelimiting
* still needed here?
*/
mem_cgroup_flush_stats_ratelimited(eviction_memcg);
mm: vmscan: detect file thrashing at the reclaim root We use refault information to determine whether the cache workingset is stable or transitioning, and dynamically adjust the inactive:active file LRU ratio so as to maximize protection from one-off cache during stable periods, and minimize IO during transitions. With cgroups and their nested LRU lists, we currently don't do this correctly. While recursive cgroup reclaim establishes a relative LRU order among the pages of all involved cgroups, refaults only affect the local LRU order in the cgroup in which they are occuring. As a result, cache transitions can take longer in a cgrouped system as the active pages of sibling cgroups aren't challenged when they should be. [ Right now, this is somewhat theoretical, because the siblings, under continued regular reclaim pressure, should eventually run out of inactive pages - and since inactive:active *size* balancing is also done on a cgroup-local level, we will challenge the active pages eventually in most cases. But the next patch will move that relative size enforcement to the reclaim root as well, and then this patch here will be necessary to propagate refault pressure to siblings. ] This patch moves refault detection to the root of reclaim. Instead of remembering the cgroup owner of an evicted page, remember the cgroup that caused the reclaim to happen. When refaults later occur, they'll correctly influence the cross-cgroup LRU order that reclaim follows. I.e. if global reclaim kicked out pages in some subgroup A/B/C, the refault of those pages will challenge the global LRU order, and not just the local order down inside C. [hannes@cmpxchg.org: use page_memcg() instead of another lookup] Link: http://lkml.kernel.org/r/20191115160722.GA309754@cmpxchg.org Link: http://lkml.kernel.org/r/20191107205334.158354-3-hannes@cmpxchg.org Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> Reviewed-by: Suren Baghdasaryan <surenb@google.com> Cc: Andrey Ryabinin <aryabinin@virtuozzo.com> Cc: Michal Hocko <mhocko@suse.com> Cc: Rik van Riel <riel@surriel.com> Cc: Shakeel Butt <shakeelb@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-12-01 09:55:59 +08:00
eviction_lruvec = mem_cgroup_lruvec(eviction_memcg, pgdat);
if (lru_gen_enabled()) {
workingset, lru_gen: apply refault-distance based protection Upstream: pending I noticed MGLRU not working very well on certain workflows, which is observed on some heavily stressed databases. That is when the file page workingset size exceeds total memory, and the access distance (the left-shift time of a page before it gets activated, considering LRU starts from right) of file pages also larger than total memory. All file pages are stuck on the oldest generation and getting read-in then evicted permutably. Despite anon pages being idle, they never get aged. PID controller didn't kickin until there are some minor access pattern changes. And file pages are not promoted or reused. Even though the memory can't cover the whole workingset, the refault-distance based re-activation can help hold part of the workingset in-memory to help reduce the IO workload significantly. So apply it for MGLRU as well. The updated refault-distance model fits well for MGLRU in most cases, if we just consider the last two generation as the inactive LRU and the first two generations as active LRU. Some adjustment is done to fit the logic better, also make the refault-distance contributed to page tiering and PID refault detection of MGLRU: - If a tier-0 page have a qualified refault-distance, just promote it to higher tier, send it to second oldest gen. - If a tier >= 1 page have a qualified refault-distance, mark it as active and send it to youngest gen. - Increase the reference of every page that have a qualified refault-distance and increase the PID countroled refault rate of the updated tier, in hope similar paged will be protected next time upon eviction. NOTE: This also changed the meaning of workingset_* fields in /proc/vmstat, workingset_activate_* now stands for the pages reactivated or promoted by refault distance checking, workingset_restore_* now stands for all pages promoted by any reason. Following benchmark showed 5x improvement. To simulate the optimized workflow, I setup a 3-replicated mongodb cluster, each in a different cgroup, using 5 gb of wiretiger cache and 10g of oplog, on a 32G VM with no limit set. The benchmark is done using https://github.com/apavlo/py-tpcc.git, modified to run STOCK_LEVEL query only, for simulating slow query and get a stable result. Test is done on an EPYC 7K62 with 32G RAM with SATA SSD: - Before (with ZRAM enabled, the result won't change whether any kind of swap is on or not): $ tpcc.py --config=mongodb.config mongodb --duration=900 --warehouses=500 --clients=30 ================================================================== Execution Results after 919 seconds ------------------------------------------------------------------ Executed Time (µs) Rate STOCK_LEVEL 577 27584645283.7 0.02 txn/s ------------------------------------------------------------------ TOTAL 577 27584645283.7 0.02 txn/s $ cat /proc/vmstat | grep workingset workingset_nodes 47860 workingset_refault_anon 0 workingset_refault_file 23498953 workingset_activate_anon 0 workingset_activate_file 23487840 workingset_restore_anon 0 workingset_restore_file 18553646 workingset_nodereclaim 768 $ free -m total used free shared buff/cache available Mem: 31849 6829 790 23 24229 24542 Swap: 31848 0 31848 - Patched: (with ZRAM enabled): $ tpcc.py --config=mongodb.config mongodb --duration=900 --warehouses=500 --clients=30 ================================================================== Execution Results after 905 seconds ------------------------------------------------------------------ Executed Time (µs) Rate STOCK_LEVEL 2542 27121571486.2 0.09 txn/s ------------------------------------------------------------------ TOTAL 2542 27121571486.2 0.09 txn/s $ cat /proc/vmstat | grep working workingset_nodes 70358 workingset_refault_anon 16853 workingset_refault_file 22693601 workingset_activate_anon 10099 workingset_activate_file 8565519 workingset_restore_anon 10127 workingset_restore_file 8566053 workingset_nodereclaim 9801 $ free -m total used free shared buff/cache available Mem: 31849 7093 283 4 24472 24289 Swap: 31848 1652 30196 The performance is 5x times better than before, and the idle anon pages now can get swapped out as expected. The result is also better with lower test stress, testing with lower stress also shows a improvement. There is no regression on other tests so far, and a performance gain is observed on file page heavy tasks. Signed-off-by: Kairui Song <kasong@tencent.com>
2024-04-01 23:43:38 +08:00
bool recent;
refault_distance = lru_distance(eviction_lruvec, file, eviction,
LRU_GEN_EVICTION_BITS, lru_gen_bucket_order);
recent = lru_gen_test_recent(eviction_lruvec, file, refault_distance);
mem_cgroup_put(eviction_memcg);
return recent;
}
refault_distance = lru_distance(eviction_lruvec, file,
eviction, EVICTION_BITS, bucket_order);
if (tracking)
workingset_refault_track(eviction_lruvec, refault_distance);
mm: workingset: tell cache transitions from workingset thrashing Refaults happen during transitions between workingsets as well as in-place thrashing. Knowing the difference between the two has a range of applications, including measuring the impact of memory shortage on the system performance, as well as the ability to smarter balance pressure between the filesystem cache and the swap-backed workingset. During workingset transitions, inactive cache refaults and pushes out established active cache. When that active cache isn't stale, however, and also ends up refaulting, that's bonafide thrashing. Introduce a new page flag that tells on eviction whether the page has been active or not in its lifetime. This bit is then stored in the shadow entry, to classify refaults as transitioning or thrashing. How many page->flags does this leave us with on 32-bit? 20 bits are always page flags 21 if you have an MMU 23 with the zone bits for DMA, Normal, HighMem, Movable 29 with the sparsemem section bits 30 if PAE is enabled 31 with this patch. So on 32-bit PAE, that leaves 1 bit for distinguishing two NUMA nodes. If that's not enough, the system can switch to discontigmem and re-gain the 6 or 7 sparsemem section bits. Link: http://lkml.kernel.org/r/20180828172258.3185-3-hannes@cmpxchg.org Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org> Tested-by: Daniel Drake <drake@endlessm.com> Tested-by: Suren Baghdasaryan <surenb@google.com> Cc: Christopher Lameter <cl@linux.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: Johannes Weiner <jweiner@fb.com> Cc: Mike Galbraith <efault@gmx.de> Cc: Peter Enderborg <peter.enderborg@sony.com> Cc: Randy Dunlap <rdunlap@infradead.org> Cc: Shakeel Butt <shakeelb@google.com> Cc: Tejun Heo <tj@kernel.org> Cc: Vinayak Menon <vinmenon@codeaurora.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-10-27 06:06:04 +08:00
/*
* Compare the distance to the existing workingset size. We
2020-06-04 07:02:43 +08:00
* don't activate pages that couldn't stay resident even if
* all the memory was available to the workingset. Whether
* workingset competition needs to consider anon or not depends
* on having free swap space.
mm: workingset: tell cache transitions from workingset thrashing Refaults happen during transitions between workingsets as well as in-place thrashing. Knowing the difference between the two has a range of applications, including measuring the impact of memory shortage on the system performance, as well as the ability to smarter balance pressure between the filesystem cache and the swap-backed workingset. During workingset transitions, inactive cache refaults and pushes out established active cache. When that active cache isn't stale, however, and also ends up refaulting, that's bonafide thrashing. Introduce a new page flag that tells on eviction whether the page has been active or not in its lifetime. This bit is then stored in the shadow entry, to classify refaults as transitioning or thrashing. How many page->flags does this leave us with on 32-bit? 20 bits are always page flags 21 if you have an MMU 23 with the zone bits for DMA, Normal, HighMem, Movable 29 with the sparsemem section bits 30 if PAE is enabled 31 with this patch. So on 32-bit PAE, that leaves 1 bit for distinguishing two NUMA nodes. If that's not enough, the system can switch to discontigmem and re-gain the 6 or 7 sparsemem section bits. Link: http://lkml.kernel.org/r/20180828172258.3185-3-hannes@cmpxchg.org Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org> Tested-by: Daniel Drake <drake@endlessm.com> Tested-by: Suren Baghdasaryan <surenb@google.com> Cc: Christopher Lameter <cl@linux.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: Johannes Weiner <jweiner@fb.com> Cc: Mike Galbraith <efault@gmx.de> Cc: Peter Enderborg <peter.enderborg@sony.com> Cc: Randy Dunlap <rdunlap@infradead.org> Cc: Shakeel Butt <shakeelb@google.com> Cc: Tejun Heo <tj@kernel.org> Cc: Vinayak Menon <vinmenon@codeaurora.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-10-27 06:06:04 +08:00
*/
emm: workingset: simplify and use a more intuitive model Upstream: pending This basically removed workingset_activation and reduced calls to workingset_age_nonresident. The idea behind this change is a new way to calculate the refault distance and prepare for adapting refault distance based file page protection for multi-gen LRU. Currently, refault distance re-activation for active/inactive can help keep working set pages in memory, it works by estimating the refault (re-access) distance of a page, if it's small enough, then put it on active LRU instead of inactive LRU. The estimation, as described in mm/workingset.c, is based on two assumptions: 1. Activation of an inactive page will left-shift LRU pages (considering LRU starts from right). 2. Eviction of an inactive page will left-shift LRU pages. Assumption 2 is correct, but assumption 1 is not always true, an activated page could be anywhere in the LRU list (through mark_page_accessed), it only left-shift the pages on its right side. And besides, one page can get activate/deactivated for multiple times. And multi-gen LRU doesn't fit with this model well, pages are getting aged in generations, and getting promoted frequently between generations. So instead we introduce a simpler idea here: Just presume the evicted pages are still in memory, each has an corresponding eviction timestamp (nonresistence_age) that is increased and recorded upon each eviction. These timestamp could logically form a "Shadow LRU", a read-only imaginary LRU. Let the `nonresistence_age` still be NA, then we have: Let SP = ((NA's reading @ current) - (NA's reading @ eviction)) +-memory available to cache-+ | | +-------------------------+===============+===========+ | * shadows O O O | INACTIVE | ACTIVE | +-+-----------------------+===============+===========+ | | +-----------------------+ | SP fault page O -> Hole left by refaulted in pages. Entries are suppose to be removed upon access but this is not a real LRU so can't really update it. * -> The page corresponding to SP It can be easily seen that SP stands for the offset of a page in the imaginary LRU, which is also how far the current workflow could push a page out of available memory. Since all evicted page was once head of INACTIVE list, the estimated minimum value of refault distance is: SP + NR_INACTIVE On refault, the page *may* get activated and stay in memory if we put it to active LRU if: SP + NR_INACTIVE < NR_INACTIVE + NR_ACTIVE Which can be simplified to: SP < NR_ACTIVE Then the page is worth getting re-activated to start from active LRU, since the access distance is smaller than the total memory. And since this is only an estimation, based on several hypotheses, and it could break the ability of LRU to distinguish a workingset out of caches, in extreme cases all refault causing activation will lead to worse thrashing, so throttle this by two factors: 1. Notice previously re-faulted in pages may leave "holes" on the shadow part of LRU, that part is left unhandled on purpose to decrease re-activate rate for pages that have a large SP value (the larger SP value a page has, the more likely it will be affected by such holes). 2. When the active LRU is long enough, chanllaging active pages by re-activating a one-time access previously evicted/inactive page may not be a good idea, so throttle the re-activation when NR_ACTIVE > NR_INACTIVE, by comparing with NR_INACTIVE instead. Another effect of the refault activation throttling worth noticing is that, when the cache size is larger than total memory and hotness is similar among all cache pages, it can help hold a portion (possible have slightly higher hotness) of the caches in memory instead of letting caches get evicted permutably due to the nature of LRU. That's because the established workingset (active LRU) will tend to stay since we throttled reactivation when NR_ACTIVE is high. This side effect is actually similar with the algoritm before, which introduce such effect by increasing nonresistence_age in extra call paths, trottled the re-activation when activition/reactivation is massively happenning. Combined all above, we have following simple rules: Upon refault, if any of following conditions is met, mark page as active: - If active LRU is low (NR_ACTIVE < NR_INACTIVE), check if: SP < NR_ACTIVE - If active LRU is high (NR_ACTIVE >= NR_INACTIVE), check if: SP < NR_INACTIVE Code-wise, this is simpler than before since no longer need to do lruvec workingset data update when activating a page, and so far, a few benchmarks shows a similar or better result under memore pressure. The performance should also be better when there is no memory pressure since some memcg iteration and atomic operation is no longer needed. When combined with multi-gen LRU (in later commits) it shows a measurable performance gain for some workloads. Using memtier and fio test from commit ac35a4902374 but scaled down to fit in my test environment, and some other test results: memtier test (with 16G ramdisk as swap and 4G memcg limit on an i7-9700): memcached -u nobody -m 16384 -s /tmp/memcached.socket \ -a 0766 -t 12 -B binary & memtier_benchmark -S /tmp/memcached.socket -P memcache_binary -n allkeys\ --key-minimum=1 --key-maximum=32000000 --key-pattern=P:P -c 1 \ -t 12 --ratio 1:0 --pipeline 8 -d 2000 -x 6 fio test 1 (with 16G ramdisk on 28G VM on an i7-9700): fio -name=refault --numjobs=12 --directory=/mnt --size=1024m \ --buffered=1 --ioengine=io_uring --iodepth=128 \ --iodepth_batch_submit=32 --iodepth_batch_complete=32 \ --rw=randread --random_distribution=random --norandommap \ --time_based --ramp_time=5m --runtime=5m --group_reporting fio test 2 (with 16G ramdisk on 28G VM on an i7-9700): fio -name=mglru --numjobs=10 --directory=/mnt --size=1536m \ --buffered=1 --ioengine=io_uring --iodepth=128 \ --iodepth_batch_submit=32 --iodepth_batch_complete=32 \ --rw=randread --random_distribution=zipf:1.2 --norandommap \ --time_based --ramp_time=10m --runtime=5m --group_reporting mysql (using oltp_read_only from sysbench, with 12G of buffer pool in a 10G memcg): sysbench /usr/share/sysbench/oltp_read_only.lua <auth and db params> \ --tables=36 --table-size=2000000 --threads=12 --time=1800 kernel build test done with 3G memcg limit on an i7-9700. Before (Average of 6 test run): fio: IOPS=5125.5k fio2: IOPS=7291.16k memcached: 57600.926 ops/s mysql: 6280.08 tps kernel-build: 1817.13499 seconds After (Average of 6 test run): fio: IOPS=5137.5k (+2.3%) fio2: IOPS=7300.67k (+1.3%) memcached: 57878.422 ops/s (+4.8%) mysql: 6312.06 tps (+0.5%) kernel-build: 1813.66231 seconds (+2.0%) Signed-off-by: Kairui Song <kasong@tencent.com>
2023-12-15 10:45:43 +08:00
active = lruvec_page_state(eviction_lruvec, NR_ACTIVE_FILE);
inactive = lruvec_page_state(eviction_lruvec, NR_INACTIVE_FILE);
workingset: fix confusion around eviction vs refault container Refault decisions are made based on the lruvec where the page was evicted, as that determined its LRU order while it was alive. Stats and workingset aging must then occur on the lruvec of the new page, as that's the node and cgroup that experience the refault and that's the lruvec whose nonresident info ages out by a new resident page. Those lruvecs could be different when a page is shared between cgroups, or the refaulting page is allocated on a different node. There are currently two mix-ups: 1. When swap is available, the resident anon set must be considered when comparing the refault distance. The comparison is made against the right anon set, but the check for swap is not. When pages get evicted from a cgroup with swap, and refault in one without, this can incorrectly consider a hot refault as cold - and vice versa. Fix that by using the eviction cgroup for the swap check. 2. The stats and workingset age are updated against the wrong lruvec altogether: the right cgroup but the wrong NUMA node. When a page refaults on a different NUMA node, this will have confusing stats and distort the workingset age on a different lruvec - again possibly resulting in hot/cold misclassifications down the line. Fix the swap check and the refault pgdat to address both concerns. This was found during code review. It hasn't caused notable issues in production, suggesting that those refault-migrations are relatively rare in practice. Link: https://lkml.kernel.org/r/20230104222944.2380117-1-nphamcs@gmail.com Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> Co-developed-by: Nhat Pham <nphamcs@gmail.com> Signed-off-by: Nhat Pham <nphamcs@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-01-05 06:29:44 +08:00
if (mem_cgroup_get_nr_swap_pages(eviction_memcg) > 0) {
emm: workingset: simplify and use a more intuitive model Upstream: pending This basically removed workingset_activation and reduced calls to workingset_age_nonresident. The idea behind this change is a new way to calculate the refault distance and prepare for adapting refault distance based file page protection for multi-gen LRU. Currently, refault distance re-activation for active/inactive can help keep working set pages in memory, it works by estimating the refault (re-access) distance of a page, if it's small enough, then put it on active LRU instead of inactive LRU. The estimation, as described in mm/workingset.c, is based on two assumptions: 1. Activation of an inactive page will left-shift LRU pages (considering LRU starts from right). 2. Eviction of an inactive page will left-shift LRU pages. Assumption 2 is correct, but assumption 1 is not always true, an activated page could be anywhere in the LRU list (through mark_page_accessed), it only left-shift the pages on its right side. And besides, one page can get activate/deactivated for multiple times. And multi-gen LRU doesn't fit with this model well, pages are getting aged in generations, and getting promoted frequently between generations. So instead we introduce a simpler idea here: Just presume the evicted pages are still in memory, each has an corresponding eviction timestamp (nonresistence_age) that is increased and recorded upon each eviction. These timestamp could logically form a "Shadow LRU", a read-only imaginary LRU. Let the `nonresistence_age` still be NA, then we have: Let SP = ((NA's reading @ current) - (NA's reading @ eviction)) +-memory available to cache-+ | | +-------------------------+===============+===========+ | * shadows O O O | INACTIVE | ACTIVE | +-+-----------------------+===============+===========+ | | +-----------------------+ | SP fault page O -> Hole left by refaulted in pages. Entries are suppose to be removed upon access but this is not a real LRU so can't really update it. * -> The page corresponding to SP It can be easily seen that SP stands for the offset of a page in the imaginary LRU, which is also how far the current workflow could push a page out of available memory. Since all evicted page was once head of INACTIVE list, the estimated minimum value of refault distance is: SP + NR_INACTIVE On refault, the page *may* get activated and stay in memory if we put it to active LRU if: SP + NR_INACTIVE < NR_INACTIVE + NR_ACTIVE Which can be simplified to: SP < NR_ACTIVE Then the page is worth getting re-activated to start from active LRU, since the access distance is smaller than the total memory. And since this is only an estimation, based on several hypotheses, and it could break the ability of LRU to distinguish a workingset out of caches, in extreme cases all refault causing activation will lead to worse thrashing, so throttle this by two factors: 1. Notice previously re-faulted in pages may leave "holes" on the shadow part of LRU, that part is left unhandled on purpose to decrease re-activate rate for pages that have a large SP value (the larger SP value a page has, the more likely it will be affected by such holes). 2. When the active LRU is long enough, chanllaging active pages by re-activating a one-time access previously evicted/inactive page may not be a good idea, so throttle the re-activation when NR_ACTIVE > NR_INACTIVE, by comparing with NR_INACTIVE instead. Another effect of the refault activation throttling worth noticing is that, when the cache size is larger than total memory and hotness is similar among all cache pages, it can help hold a portion (possible have slightly higher hotness) of the caches in memory instead of letting caches get evicted permutably due to the nature of LRU. That's because the established workingset (active LRU) will tend to stay since we throttled reactivation when NR_ACTIVE is high. This side effect is actually similar with the algoritm before, which introduce such effect by increasing nonresistence_age in extra call paths, trottled the re-activation when activition/reactivation is massively happenning. Combined all above, we have following simple rules: Upon refault, if any of following conditions is met, mark page as active: - If active LRU is low (NR_ACTIVE < NR_INACTIVE), check if: SP < NR_ACTIVE - If active LRU is high (NR_ACTIVE >= NR_INACTIVE), check if: SP < NR_INACTIVE Code-wise, this is simpler than before since no longer need to do lruvec workingset data update when activating a page, and so far, a few benchmarks shows a similar or better result under memore pressure. The performance should also be better when there is no memory pressure since some memcg iteration and atomic operation is no longer needed. When combined with multi-gen LRU (in later commits) it shows a measurable performance gain for some workloads. Using memtier and fio test from commit ac35a4902374 but scaled down to fit in my test environment, and some other test results: memtier test (with 16G ramdisk as swap and 4G memcg limit on an i7-9700): memcached -u nobody -m 16384 -s /tmp/memcached.socket \ -a 0766 -t 12 -B binary & memtier_benchmark -S /tmp/memcached.socket -P memcache_binary -n allkeys\ --key-minimum=1 --key-maximum=32000000 --key-pattern=P:P -c 1 \ -t 12 --ratio 1:0 --pipeline 8 -d 2000 -x 6 fio test 1 (with 16G ramdisk on 28G VM on an i7-9700): fio -name=refault --numjobs=12 --directory=/mnt --size=1024m \ --buffered=1 --ioengine=io_uring --iodepth=128 \ --iodepth_batch_submit=32 --iodepth_batch_complete=32 \ --rw=randread --random_distribution=random --norandommap \ --time_based --ramp_time=5m --runtime=5m --group_reporting fio test 2 (with 16G ramdisk on 28G VM on an i7-9700): fio -name=mglru --numjobs=10 --directory=/mnt --size=1536m \ --buffered=1 --ioengine=io_uring --iodepth=128 \ --iodepth_batch_submit=32 --iodepth_batch_complete=32 \ --rw=randread --random_distribution=zipf:1.2 --norandommap \ --time_based --ramp_time=10m --runtime=5m --group_reporting mysql (using oltp_read_only from sysbench, with 12G of buffer pool in a 10G memcg): sysbench /usr/share/sysbench/oltp_read_only.lua <auth and db params> \ --tables=36 --table-size=2000000 --threads=12 --time=1800 kernel build test done with 3G memcg limit on an i7-9700. Before (Average of 6 test run): fio: IOPS=5125.5k fio2: IOPS=7291.16k memcached: 57600.926 ops/s mysql: 6280.08 tps kernel-build: 1817.13499 seconds After (Average of 6 test run): fio: IOPS=5137.5k (+2.3%) fio2: IOPS=7300.67k (+1.3%) memcached: 57878.422 ops/s (+4.8%) mysql: 6312.06 tps (+0.5%) kernel-build: 1813.66231 seconds (+2.0%) Signed-off-by: Kairui Song <kasong@tencent.com>
2023-12-15 10:45:43 +08:00
active += lruvec_page_state(eviction_lruvec, NR_ACTIVE_ANON);
inactive += lruvec_page_state(eviction_lruvec, NR_INACTIVE_ANON);
2020-06-04 07:02:43 +08:00
}
workingset: refactor LRU refault to expose refault recency check Patch series "cachestat: a new syscall for page cache state of files", v13. There is currently no good way to query the page cache statistics of large files and directory trees. There is mincore(), but it scales poorly: the kernel writes out a lot of bitmap data that userspace has to aggregate, when the user really does not care about per-page information in that case. The user also needs to mmap and unmap each file as it goes along, which can be quite slow as well. Some use cases where this information could come in handy: * Allowing database to decide whether to perform an index scan or direct table queries based on the in-memory cache state of the index. * Visibility into the writeback algorithm, for performance issues diagnostic. * Workload-aware writeback pacing: estimating IO fulfilled by page cache (and IO to be done) within a range of a file, allowing for more frequent syncing when and where there is IO capacity, and batching when there is not. * Computing memory usage of large files/directory trees, analogous to the du tool for disk usage. More information about these use cases could be found in this thread: https://lore.kernel.org/lkml/20230315170934.GA97793@cmpxchg.org/ This series of patches introduces a new system call, cachestat, that summarizes the page cache statistics (number of cached pages, dirty pages, pages marked for writeback, evicted pages etc.) of a file, in a specified range of bytes. It also include a selftest suite that tests some typical usage. Currently, the syscall is only wired in for x86 architecture. This interface is inspired by past discussion and concerns with fincore, which has a similar design (and as a result, issues) as mincore. Relevant links: https://lkml.indiana.edu/hypermail/linux/kernel/1302.1/04207.html https://lkml.indiana.edu/hypermail/linux/kernel/1302.1/04209.html I have also developed a small tool that computes the memory usage of files and directories, analogous to the du utility. User can choose between mincore or cachestat (with cachestat exporting more information than mincore). To compare the performance of these two options, I benchmarked the tool on the root directory of a Meta's server machine, each for five runs: Using cachestat real -- Median: 33.377s, Average: 33.475s, Standard Deviation: 0.3602 user -- Median: 4.08s, Average: 4.1078s, Standard Deviation: 0.0742 sys -- Median: 28.823s, Average: 28.8866s, Standard Deviation: 0.2689 Using mincore: real -- Median: 102.352s, Average: 102.3442s, Standard Deviation: 0.2059 user -- Median: 10.149s, Average: 10.1482s, Standard Deviation: 0.0162 sys -- Median: 91.186s, Average: 91.2084s, Standard Deviation: 0.2046 I also ran both syscalls on a 2TB sparse file: Using cachestat: real 0m0.009s user 0m0.000s sys 0m0.009s Using mincore: real 0m37.510s user 0m2.934s sys 0m34.558s Very large files like this are the pathological case for mincore. In fact, to compute the stats for a single 2TB file, mincore takes as long as cachestat takes to compute the stats for the entire tree! This could easily happen inadvertently when we run it on subdirectories. Mincore is clearly not suitable for a general-purpose command line tool. Regarding security concerns, cachestat() should not pose any additional issues. The caller already has read permission to the file itself (since they need an fd to that file to call cachestat). This means that the caller can access the underlying data in its entirety, which is a much greater source of information (and as a result, a much greater security risk) than the cache status itself. The latest API change (in v13 of the patch series) is suggested by Jens Axboe. It allows for 64-bit length argument, even on 32-bit architecture (which is previously not possible due to the limit on the number of syscall arguments). Furthermore, it eliminates the need for compatibility handling - every user can use the same ABI. This patch (of 4): In preparation for computing recently evicted pages in cachestat, refactor workingset_refault and lru_gen_refault to expose a helper function that would test if an evicted page is recently evicted. [penguin-kernel@I-love.SAKURA.ne.jp: add missing rcu_read_unlock() in lru_gen_refault()] Link: https://lkml.kernel.org/r/610781bc-cf11-fc89-a46f-87cb8235d439@I-love.SAKURA.ne.jp Link: https://lkml.kernel.org/r/20230503013608.2431726-1-nphamcs@gmail.com Link: https://lkml.kernel.org/r/20230503013608.2431726-2-nphamcs@gmail.com Signed-off-by: Nhat Pham <nphamcs@gmail.com> Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Cc: Brian Foster <bfoster@redhat.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Michael Kerrisk <mtk.manpages@gmail.com> Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-05-03 09:36:06 +08:00
mm: workingset: move the stats flush into workingset_test_recent() Upstream: commit b006847222623ac3cda8589d15379eac86a2bcb7 Conflicts: none Backport-reason: mm: memcg: subtree stats flushing and thresholds The workingset code flushes the stats in workingset_refault() to get accurate stats of the eviction memcg. In preparation for more scoped flushed and passing the eviction memcg to the flush call, move the call to workingset_test_recent() where we have a pointer to the eviction memcg. The flush call is sleepable, and cannot be made in an rcu read section. Hence, minimize the rcu read section by also moving it into workingset_test_recent(). Furthermore, instead of holding the rcu read lock throughout workingset_test_recent(), only hold it briefly to get a ref on the eviction memcg. This allows us to make the flush call after we get the eviction memcg. As for workingset_refault(), nothing else there appears to be protected by rcu. The memcg of the faulted folio (which is not necessarily the same as the eviction memcg) is protected by the folio lock, which is held from all callsites. Add a VM_BUG_ON() to make sure this doesn't change from under us. No functional change intended. Link: https://lkml.kernel.org/r/20231129032154.3710765-5-yosryahmed@google.com Signed-off-by: Yosry Ahmed <yosryahmed@google.com> Tested-by: Domenico Cerasuolo <cerasuolodomenico@gmail.com> Acked-by: Shakeel Butt <shakeelb@google.com> Cc: Chris Li <chrisl@kernel.org> Cc: Greg Thelen <gthelen@google.com> Cc: Ivan Babrou <ivan@cloudflare.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Michal Hocko <mhocko@kernel.org> Cc: Michal Koutny <mkoutny@suse.com> Cc: Muchun Song <muchun.song@linux.dev> Cc: Roman Gushchin <roman.gushchin@linux.dev> Cc: Tejun Heo <tj@kernel.org> Cc: Waiman Long <longman@redhat.com> Cc: Wei Xu <weixugc@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Kairui Song <kasong@tencent.com>
2023-11-29 11:21:52 +08:00
mem_cgroup_put(eviction_memcg);
emm: workingset: simplify and use a more intuitive model Upstream: pending This basically removed workingset_activation and reduced calls to workingset_age_nonresident. The idea behind this change is a new way to calculate the refault distance and prepare for adapting refault distance based file page protection for multi-gen LRU. Currently, refault distance re-activation for active/inactive can help keep working set pages in memory, it works by estimating the refault (re-access) distance of a page, if it's small enough, then put it on active LRU instead of inactive LRU. The estimation, as described in mm/workingset.c, is based on two assumptions: 1. Activation of an inactive page will left-shift LRU pages (considering LRU starts from right). 2. Eviction of an inactive page will left-shift LRU pages. Assumption 2 is correct, but assumption 1 is not always true, an activated page could be anywhere in the LRU list (through mark_page_accessed), it only left-shift the pages on its right side. And besides, one page can get activate/deactivated for multiple times. And multi-gen LRU doesn't fit with this model well, pages are getting aged in generations, and getting promoted frequently between generations. So instead we introduce a simpler idea here: Just presume the evicted pages are still in memory, each has an corresponding eviction timestamp (nonresistence_age) that is increased and recorded upon each eviction. These timestamp could logically form a "Shadow LRU", a read-only imaginary LRU. Let the `nonresistence_age` still be NA, then we have: Let SP = ((NA's reading @ current) - (NA's reading @ eviction)) +-memory available to cache-+ | | +-------------------------+===============+===========+ | * shadows O O O | INACTIVE | ACTIVE | +-+-----------------------+===============+===========+ | | +-----------------------+ | SP fault page O -> Hole left by refaulted in pages. Entries are suppose to be removed upon access but this is not a real LRU so can't really update it. * -> The page corresponding to SP It can be easily seen that SP stands for the offset of a page in the imaginary LRU, which is also how far the current workflow could push a page out of available memory. Since all evicted page was once head of INACTIVE list, the estimated minimum value of refault distance is: SP + NR_INACTIVE On refault, the page *may* get activated and stay in memory if we put it to active LRU if: SP + NR_INACTIVE < NR_INACTIVE + NR_ACTIVE Which can be simplified to: SP < NR_ACTIVE Then the page is worth getting re-activated to start from active LRU, since the access distance is smaller than the total memory. And since this is only an estimation, based on several hypotheses, and it could break the ability of LRU to distinguish a workingset out of caches, in extreme cases all refault causing activation will lead to worse thrashing, so throttle this by two factors: 1. Notice previously re-faulted in pages may leave "holes" on the shadow part of LRU, that part is left unhandled on purpose to decrease re-activate rate for pages that have a large SP value (the larger SP value a page has, the more likely it will be affected by such holes). 2. When the active LRU is long enough, chanllaging active pages by re-activating a one-time access previously evicted/inactive page may not be a good idea, so throttle the re-activation when NR_ACTIVE > NR_INACTIVE, by comparing with NR_INACTIVE instead. Another effect of the refault activation throttling worth noticing is that, when the cache size is larger than total memory and hotness is similar among all cache pages, it can help hold a portion (possible have slightly higher hotness) of the caches in memory instead of letting caches get evicted permutably due to the nature of LRU. That's because the established workingset (active LRU) will tend to stay since we throttled reactivation when NR_ACTIVE is high. This side effect is actually similar with the algoritm before, which introduce such effect by increasing nonresistence_age in extra call paths, trottled the re-activation when activition/reactivation is massively happenning. Combined all above, we have following simple rules: Upon refault, if any of following conditions is met, mark page as active: - If active LRU is low (NR_ACTIVE < NR_INACTIVE), check if: SP < NR_ACTIVE - If active LRU is high (NR_ACTIVE >= NR_INACTIVE), check if: SP < NR_INACTIVE Code-wise, this is simpler than before since no longer need to do lruvec workingset data update when activating a page, and so far, a few benchmarks shows a similar or better result under memore pressure. The performance should also be better when there is no memory pressure since some memcg iteration and atomic operation is no longer needed. When combined with multi-gen LRU (in later commits) it shows a measurable performance gain for some workloads. Using memtier and fio test from commit ac35a4902374 but scaled down to fit in my test environment, and some other test results: memtier test (with 16G ramdisk as swap and 4G memcg limit on an i7-9700): memcached -u nobody -m 16384 -s /tmp/memcached.socket \ -a 0766 -t 12 -B binary & memtier_benchmark -S /tmp/memcached.socket -P memcache_binary -n allkeys\ --key-minimum=1 --key-maximum=32000000 --key-pattern=P:P -c 1 \ -t 12 --ratio 1:0 --pipeline 8 -d 2000 -x 6 fio test 1 (with 16G ramdisk on 28G VM on an i7-9700): fio -name=refault --numjobs=12 --directory=/mnt --size=1024m \ --buffered=1 --ioengine=io_uring --iodepth=128 \ --iodepth_batch_submit=32 --iodepth_batch_complete=32 \ --rw=randread --random_distribution=random --norandommap \ --time_based --ramp_time=5m --runtime=5m --group_reporting fio test 2 (with 16G ramdisk on 28G VM on an i7-9700): fio -name=mglru --numjobs=10 --directory=/mnt --size=1536m \ --buffered=1 --ioengine=io_uring --iodepth=128 \ --iodepth_batch_submit=32 --iodepth_batch_complete=32 \ --rw=randread --random_distribution=zipf:1.2 --norandommap \ --time_based --ramp_time=10m --runtime=5m --group_reporting mysql (using oltp_read_only from sysbench, with 12G of buffer pool in a 10G memcg): sysbench /usr/share/sysbench/oltp_read_only.lua <auth and db params> \ --tables=36 --table-size=2000000 --threads=12 --time=1800 kernel build test done with 3G memcg limit on an i7-9700. Before (Average of 6 test run): fio: IOPS=5125.5k fio2: IOPS=7291.16k memcached: 57600.926 ops/s mysql: 6280.08 tps kernel-build: 1817.13499 seconds After (Average of 6 test run): fio: IOPS=5137.5k (+2.3%) fio2: IOPS=7300.67k (+1.3%) memcached: 57878.422 ops/s (+4.8%) mysql: 6312.06 tps (+0.5%) kernel-build: 1813.66231 seconds (+2.0%) Signed-off-by: Kairui Song <kasong@tencent.com>
2023-12-15 10:45:43 +08:00
/*
* When there are already enough active pages, be less aggressive
* on reactivating pages, challenge an large set of established
* active pages with one time refaulted page may not be a good idea.
*/
return refault_distance < min(active, inactive);
workingset: refactor LRU refault to expose refault recency check Patch series "cachestat: a new syscall for page cache state of files", v13. There is currently no good way to query the page cache statistics of large files and directory trees. There is mincore(), but it scales poorly: the kernel writes out a lot of bitmap data that userspace has to aggregate, when the user really does not care about per-page information in that case. The user also needs to mmap and unmap each file as it goes along, which can be quite slow as well. Some use cases where this information could come in handy: * Allowing database to decide whether to perform an index scan or direct table queries based on the in-memory cache state of the index. * Visibility into the writeback algorithm, for performance issues diagnostic. * Workload-aware writeback pacing: estimating IO fulfilled by page cache (and IO to be done) within a range of a file, allowing for more frequent syncing when and where there is IO capacity, and batching when there is not. * Computing memory usage of large files/directory trees, analogous to the du tool for disk usage. More information about these use cases could be found in this thread: https://lore.kernel.org/lkml/20230315170934.GA97793@cmpxchg.org/ This series of patches introduces a new system call, cachestat, that summarizes the page cache statistics (number of cached pages, dirty pages, pages marked for writeback, evicted pages etc.) of a file, in a specified range of bytes. It also include a selftest suite that tests some typical usage. Currently, the syscall is only wired in for x86 architecture. This interface is inspired by past discussion and concerns with fincore, which has a similar design (and as a result, issues) as mincore. Relevant links: https://lkml.indiana.edu/hypermail/linux/kernel/1302.1/04207.html https://lkml.indiana.edu/hypermail/linux/kernel/1302.1/04209.html I have also developed a small tool that computes the memory usage of files and directories, analogous to the du utility. User can choose between mincore or cachestat (with cachestat exporting more information than mincore). To compare the performance of these two options, I benchmarked the tool on the root directory of a Meta's server machine, each for five runs: Using cachestat real -- Median: 33.377s, Average: 33.475s, Standard Deviation: 0.3602 user -- Median: 4.08s, Average: 4.1078s, Standard Deviation: 0.0742 sys -- Median: 28.823s, Average: 28.8866s, Standard Deviation: 0.2689 Using mincore: real -- Median: 102.352s, Average: 102.3442s, Standard Deviation: 0.2059 user -- Median: 10.149s, Average: 10.1482s, Standard Deviation: 0.0162 sys -- Median: 91.186s, Average: 91.2084s, Standard Deviation: 0.2046 I also ran both syscalls on a 2TB sparse file: Using cachestat: real 0m0.009s user 0m0.000s sys 0m0.009s Using mincore: real 0m37.510s user 0m2.934s sys 0m34.558s Very large files like this are the pathological case for mincore. In fact, to compute the stats for a single 2TB file, mincore takes as long as cachestat takes to compute the stats for the entire tree! This could easily happen inadvertently when we run it on subdirectories. Mincore is clearly not suitable for a general-purpose command line tool. Regarding security concerns, cachestat() should not pose any additional issues. The caller already has read permission to the file itself (since they need an fd to that file to call cachestat). This means that the caller can access the underlying data in its entirety, which is a much greater source of information (and as a result, a much greater security risk) than the cache status itself. The latest API change (in v13 of the patch series) is suggested by Jens Axboe. It allows for 64-bit length argument, even on 32-bit architecture (which is previously not possible due to the limit on the number of syscall arguments). Furthermore, it eliminates the need for compatibility handling - every user can use the same ABI. This patch (of 4): In preparation for computing recently evicted pages in cachestat, refactor workingset_refault and lru_gen_refault to expose a helper function that would test if an evicted page is recently evicted. [penguin-kernel@I-love.SAKURA.ne.jp: add missing rcu_read_unlock() in lru_gen_refault()] Link: https://lkml.kernel.org/r/610781bc-cf11-fc89-a46f-87cb8235d439@I-love.SAKURA.ne.jp Link: https://lkml.kernel.org/r/20230503013608.2431726-1-nphamcs@gmail.com Link: https://lkml.kernel.org/r/20230503013608.2431726-2-nphamcs@gmail.com Signed-off-by: Nhat Pham <nphamcs@gmail.com> Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Cc: Brian Foster <bfoster@redhat.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Michael Kerrisk <mtk.manpages@gmail.com> Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-05-03 09:36:06 +08:00
}
/**
* workingset_refault - Evaluate the refault of a previously evicted folio.
* @folio: The freshly allocated replacement folio.
* @shadow: Shadow entry of the evicted folio.
*
* Calculates and evaluates the refault distance of the previously
* evicted folio in the context of the node and the memcg whose memory
* pressure caused the eviction.
*/
void workingset_refault(struct folio *folio, void *shadow)
{
bool file = folio_is_file_lru(folio);
struct pglist_data *pgdat;
struct mem_cgroup *memcg;
struct lruvec *lruvec;
bool workingset;
long nr;
/*
* The activation decision for this folio is made at the level
* where the eviction occurred, as that is where the LRU order
* during folio reclaim is being determined.
*
* However, the cgroup that will own the folio is the one that
mm: workingset: move the stats flush into workingset_test_recent() Upstream: commit b006847222623ac3cda8589d15379eac86a2bcb7 Conflicts: none Backport-reason: mm: memcg: subtree stats flushing and thresholds The workingset code flushes the stats in workingset_refault() to get accurate stats of the eviction memcg. In preparation for more scoped flushed and passing the eviction memcg to the flush call, move the call to workingset_test_recent() where we have a pointer to the eviction memcg. The flush call is sleepable, and cannot be made in an rcu read section. Hence, minimize the rcu read section by also moving it into workingset_test_recent(). Furthermore, instead of holding the rcu read lock throughout workingset_test_recent(), only hold it briefly to get a ref on the eviction memcg. This allows us to make the flush call after we get the eviction memcg. As for workingset_refault(), nothing else there appears to be protected by rcu. The memcg of the faulted folio (which is not necessarily the same as the eviction memcg) is protected by the folio lock, which is held from all callsites. Add a VM_BUG_ON() to make sure this doesn't change from under us. No functional change intended. Link: https://lkml.kernel.org/r/20231129032154.3710765-5-yosryahmed@google.com Signed-off-by: Yosry Ahmed <yosryahmed@google.com> Tested-by: Domenico Cerasuolo <cerasuolodomenico@gmail.com> Acked-by: Shakeel Butt <shakeelb@google.com> Cc: Chris Li <chrisl@kernel.org> Cc: Greg Thelen <gthelen@google.com> Cc: Ivan Babrou <ivan@cloudflare.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Michal Hocko <mhocko@kernel.org> Cc: Michal Koutny <mkoutny@suse.com> Cc: Muchun Song <muchun.song@linux.dev> Cc: Roman Gushchin <roman.gushchin@linux.dev> Cc: Tejun Heo <tj@kernel.org> Cc: Waiman Long <longman@redhat.com> Cc: Wei Xu <weixugc@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Kairui Song <kasong@tencent.com>
2023-11-29 11:21:52 +08:00
* is actually experiencing the refault event. Make sure the folio is
* locked to guarantee folio_memcg() stability throughout.
workingset: refactor LRU refault to expose refault recency check Patch series "cachestat: a new syscall for page cache state of files", v13. There is currently no good way to query the page cache statistics of large files and directory trees. There is mincore(), but it scales poorly: the kernel writes out a lot of bitmap data that userspace has to aggregate, when the user really does not care about per-page information in that case. The user also needs to mmap and unmap each file as it goes along, which can be quite slow as well. Some use cases where this information could come in handy: * Allowing database to decide whether to perform an index scan or direct table queries based on the in-memory cache state of the index. * Visibility into the writeback algorithm, for performance issues diagnostic. * Workload-aware writeback pacing: estimating IO fulfilled by page cache (and IO to be done) within a range of a file, allowing for more frequent syncing when and where there is IO capacity, and batching when there is not. * Computing memory usage of large files/directory trees, analogous to the du tool for disk usage. More information about these use cases could be found in this thread: https://lore.kernel.org/lkml/20230315170934.GA97793@cmpxchg.org/ This series of patches introduces a new system call, cachestat, that summarizes the page cache statistics (number of cached pages, dirty pages, pages marked for writeback, evicted pages etc.) of a file, in a specified range of bytes. It also include a selftest suite that tests some typical usage. Currently, the syscall is only wired in for x86 architecture. This interface is inspired by past discussion and concerns with fincore, which has a similar design (and as a result, issues) as mincore. Relevant links: https://lkml.indiana.edu/hypermail/linux/kernel/1302.1/04207.html https://lkml.indiana.edu/hypermail/linux/kernel/1302.1/04209.html I have also developed a small tool that computes the memory usage of files and directories, analogous to the du utility. User can choose between mincore or cachestat (with cachestat exporting more information than mincore). To compare the performance of these two options, I benchmarked the tool on the root directory of a Meta's server machine, each for five runs: Using cachestat real -- Median: 33.377s, Average: 33.475s, Standard Deviation: 0.3602 user -- Median: 4.08s, Average: 4.1078s, Standard Deviation: 0.0742 sys -- Median: 28.823s, Average: 28.8866s, Standard Deviation: 0.2689 Using mincore: real -- Median: 102.352s, Average: 102.3442s, Standard Deviation: 0.2059 user -- Median: 10.149s, Average: 10.1482s, Standard Deviation: 0.0162 sys -- Median: 91.186s, Average: 91.2084s, Standard Deviation: 0.2046 I also ran both syscalls on a 2TB sparse file: Using cachestat: real 0m0.009s user 0m0.000s sys 0m0.009s Using mincore: real 0m37.510s user 0m2.934s sys 0m34.558s Very large files like this are the pathological case for mincore. In fact, to compute the stats for a single 2TB file, mincore takes as long as cachestat takes to compute the stats for the entire tree! This could easily happen inadvertently when we run it on subdirectories. Mincore is clearly not suitable for a general-purpose command line tool. Regarding security concerns, cachestat() should not pose any additional issues. The caller already has read permission to the file itself (since they need an fd to that file to call cachestat). This means that the caller can access the underlying data in its entirety, which is a much greater source of information (and as a result, a much greater security risk) than the cache status itself. The latest API change (in v13 of the patch series) is suggested by Jens Axboe. It allows for 64-bit length argument, even on 32-bit architecture (which is previously not possible due to the limit on the number of syscall arguments). Furthermore, it eliminates the need for compatibility handling - every user can use the same ABI. This patch (of 4): In preparation for computing recently evicted pages in cachestat, refactor workingset_refault and lru_gen_refault to expose a helper function that would test if an evicted page is recently evicted. [penguin-kernel@I-love.SAKURA.ne.jp: add missing rcu_read_unlock() in lru_gen_refault()] Link: https://lkml.kernel.org/r/610781bc-cf11-fc89-a46f-87cb8235d439@I-love.SAKURA.ne.jp Link: https://lkml.kernel.org/r/20230503013608.2431726-1-nphamcs@gmail.com Link: https://lkml.kernel.org/r/20230503013608.2431726-2-nphamcs@gmail.com Signed-off-by: Nhat Pham <nphamcs@gmail.com> Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Cc: Brian Foster <bfoster@redhat.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Michael Kerrisk <mtk.manpages@gmail.com> Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-05-03 09:36:06 +08:00
*/
mm: workingset: move the stats flush into workingset_test_recent() Upstream: commit b006847222623ac3cda8589d15379eac86a2bcb7 Conflicts: none Backport-reason: mm: memcg: subtree stats flushing and thresholds The workingset code flushes the stats in workingset_refault() to get accurate stats of the eviction memcg. In preparation for more scoped flushed and passing the eviction memcg to the flush call, move the call to workingset_test_recent() where we have a pointer to the eviction memcg. The flush call is sleepable, and cannot be made in an rcu read section. Hence, minimize the rcu read section by also moving it into workingset_test_recent(). Furthermore, instead of holding the rcu read lock throughout workingset_test_recent(), only hold it briefly to get a ref on the eviction memcg. This allows us to make the flush call after we get the eviction memcg. As for workingset_refault(), nothing else there appears to be protected by rcu. The memcg of the faulted folio (which is not necessarily the same as the eviction memcg) is protected by the folio lock, which is held from all callsites. Add a VM_BUG_ON() to make sure this doesn't change from under us. No functional change intended. Link: https://lkml.kernel.org/r/20231129032154.3710765-5-yosryahmed@google.com Signed-off-by: Yosry Ahmed <yosryahmed@google.com> Tested-by: Domenico Cerasuolo <cerasuolodomenico@gmail.com> Acked-by: Shakeel Butt <shakeelb@google.com> Cc: Chris Li <chrisl@kernel.org> Cc: Greg Thelen <gthelen@google.com> Cc: Ivan Babrou <ivan@cloudflare.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Michal Hocko <mhocko@kernel.org> Cc: Michal Koutny <mkoutny@suse.com> Cc: Muchun Song <muchun.song@linux.dev> Cc: Roman Gushchin <roman.gushchin@linux.dev> Cc: Tejun Heo <tj@kernel.org> Cc: Waiman Long <longman@redhat.com> Cc: Wei Xu <weixugc@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Kairui Song <kasong@tencent.com>
2023-11-29 11:21:52 +08:00
VM_BUG_ON_FOLIO(!folio_test_locked(folio), folio);
if (lru_gen_enabled()) {
lru_gen_refault(folio, shadow);
return;
}
workingset: refactor LRU refault to expose refault recency check Patch series "cachestat: a new syscall for page cache state of files", v13. There is currently no good way to query the page cache statistics of large files and directory trees. There is mincore(), but it scales poorly: the kernel writes out a lot of bitmap data that userspace has to aggregate, when the user really does not care about per-page information in that case. The user also needs to mmap and unmap each file as it goes along, which can be quite slow as well. Some use cases where this information could come in handy: * Allowing database to decide whether to perform an index scan or direct table queries based on the in-memory cache state of the index. * Visibility into the writeback algorithm, for performance issues diagnostic. * Workload-aware writeback pacing: estimating IO fulfilled by page cache (and IO to be done) within a range of a file, allowing for more frequent syncing when and where there is IO capacity, and batching when there is not. * Computing memory usage of large files/directory trees, analogous to the du tool for disk usage. More information about these use cases could be found in this thread: https://lore.kernel.org/lkml/20230315170934.GA97793@cmpxchg.org/ This series of patches introduces a new system call, cachestat, that summarizes the page cache statistics (number of cached pages, dirty pages, pages marked for writeback, evicted pages etc.) of a file, in a specified range of bytes. It also include a selftest suite that tests some typical usage. Currently, the syscall is only wired in for x86 architecture. This interface is inspired by past discussion and concerns with fincore, which has a similar design (and as a result, issues) as mincore. Relevant links: https://lkml.indiana.edu/hypermail/linux/kernel/1302.1/04207.html https://lkml.indiana.edu/hypermail/linux/kernel/1302.1/04209.html I have also developed a small tool that computes the memory usage of files and directories, analogous to the du utility. User can choose between mincore or cachestat (with cachestat exporting more information than mincore). To compare the performance of these two options, I benchmarked the tool on the root directory of a Meta's server machine, each for five runs: Using cachestat real -- Median: 33.377s, Average: 33.475s, Standard Deviation: 0.3602 user -- Median: 4.08s, Average: 4.1078s, Standard Deviation: 0.0742 sys -- Median: 28.823s, Average: 28.8866s, Standard Deviation: 0.2689 Using mincore: real -- Median: 102.352s, Average: 102.3442s, Standard Deviation: 0.2059 user -- Median: 10.149s, Average: 10.1482s, Standard Deviation: 0.0162 sys -- Median: 91.186s, Average: 91.2084s, Standard Deviation: 0.2046 I also ran both syscalls on a 2TB sparse file: Using cachestat: real 0m0.009s user 0m0.000s sys 0m0.009s Using mincore: real 0m37.510s user 0m2.934s sys 0m34.558s Very large files like this are the pathological case for mincore. In fact, to compute the stats for a single 2TB file, mincore takes as long as cachestat takes to compute the stats for the entire tree! This could easily happen inadvertently when we run it on subdirectories. Mincore is clearly not suitable for a general-purpose command line tool. Regarding security concerns, cachestat() should not pose any additional issues. The caller already has read permission to the file itself (since they need an fd to that file to call cachestat). This means that the caller can access the underlying data in its entirety, which is a much greater source of information (and as a result, a much greater security risk) than the cache status itself. The latest API change (in v13 of the patch series) is suggested by Jens Axboe. It allows for 64-bit length argument, even on 32-bit architecture (which is previously not possible due to the limit on the number of syscall arguments). Furthermore, it eliminates the need for compatibility handling - every user can use the same ABI. This patch (of 4): In preparation for computing recently evicted pages in cachestat, refactor workingset_refault and lru_gen_refault to expose a helper function that would test if an evicted page is recently evicted. [penguin-kernel@I-love.SAKURA.ne.jp: add missing rcu_read_unlock() in lru_gen_refault()] Link: https://lkml.kernel.org/r/610781bc-cf11-fc89-a46f-87cb8235d439@I-love.SAKURA.ne.jp Link: https://lkml.kernel.org/r/20230503013608.2431726-1-nphamcs@gmail.com Link: https://lkml.kernel.org/r/20230503013608.2431726-2-nphamcs@gmail.com Signed-off-by: Nhat Pham <nphamcs@gmail.com> Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Cc: Brian Foster <bfoster@redhat.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Michael Kerrisk <mtk.manpages@gmail.com> Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-05-03 09:36:06 +08:00
nr = folio_nr_pages(folio);
memcg = folio_memcg(folio);
pgdat = folio_pgdat(folio);
lruvec = mem_cgroup_lruvec(memcg, pgdat);
mod_lruvec_state(lruvec, WORKINGSET_REFAULT_BASE + file, nr);
if (!workingset_test_recent(shadow, file, &workingset, true))
mm: workingset: move the stats flush into workingset_test_recent() Upstream: commit b006847222623ac3cda8589d15379eac86a2bcb7 Conflicts: none Backport-reason: mm: memcg: subtree stats flushing and thresholds The workingset code flushes the stats in workingset_refault() to get accurate stats of the eviction memcg. In preparation for more scoped flushed and passing the eviction memcg to the flush call, move the call to workingset_test_recent() where we have a pointer to the eviction memcg. The flush call is sleepable, and cannot be made in an rcu read section. Hence, minimize the rcu read section by also moving it into workingset_test_recent(). Furthermore, instead of holding the rcu read lock throughout workingset_test_recent(), only hold it briefly to get a ref on the eviction memcg. This allows us to make the flush call after we get the eviction memcg. As for workingset_refault(), nothing else there appears to be protected by rcu. The memcg of the faulted folio (which is not necessarily the same as the eviction memcg) is protected by the folio lock, which is held from all callsites. Add a VM_BUG_ON() to make sure this doesn't change from under us. No functional change intended. Link: https://lkml.kernel.org/r/20231129032154.3710765-5-yosryahmed@google.com Signed-off-by: Yosry Ahmed <yosryahmed@google.com> Tested-by: Domenico Cerasuolo <cerasuolodomenico@gmail.com> Acked-by: Shakeel Butt <shakeelb@google.com> Cc: Chris Li <chrisl@kernel.org> Cc: Greg Thelen <gthelen@google.com> Cc: Ivan Babrou <ivan@cloudflare.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Michal Hocko <mhocko@kernel.org> Cc: Michal Koutny <mkoutny@suse.com> Cc: Muchun Song <muchun.song@linux.dev> Cc: Roman Gushchin <roman.gushchin@linux.dev> Cc: Tejun Heo <tj@kernel.org> Cc: Waiman Long <longman@redhat.com> Cc: Wei Xu <weixugc@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Kairui Song <kasong@tencent.com>
2023-11-29 11:21:52 +08:00
return;
mm: workingset: tell cache transitions from workingset thrashing Refaults happen during transitions between workingsets as well as in-place thrashing. Knowing the difference between the two has a range of applications, including measuring the impact of memory shortage on the system performance, as well as the ability to smarter balance pressure between the filesystem cache and the swap-backed workingset. During workingset transitions, inactive cache refaults and pushes out established active cache. When that active cache isn't stale, however, and also ends up refaulting, that's bonafide thrashing. Introduce a new page flag that tells on eviction whether the page has been active or not in its lifetime. This bit is then stored in the shadow entry, to classify refaults as transitioning or thrashing. How many page->flags does this leave us with on 32-bit? 20 bits are always page flags 21 if you have an MMU 23 with the zone bits for DMA, Normal, HighMem, Movable 29 with the sparsemem section bits 30 if PAE is enabled 31 with this patch. So on 32-bit PAE, that leaves 1 bit for distinguishing two NUMA nodes. If that's not enough, the system can switch to discontigmem and re-gain the 6 or 7 sparsemem section bits. Link: http://lkml.kernel.org/r/20180828172258.3185-3-hannes@cmpxchg.org Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org> Tested-by: Daniel Drake <drake@endlessm.com> Tested-by: Suren Baghdasaryan <surenb@google.com> Cc: Christopher Lameter <cl@linux.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: Johannes Weiner <jweiner@fb.com> Cc: Mike Galbraith <efault@gmx.de> Cc: Peter Enderborg <peter.enderborg@sony.com> Cc: Randy Dunlap <rdunlap@infradead.org> Cc: Shakeel Butt <shakeelb@google.com> Cc: Tejun Heo <tj@kernel.org> Cc: Vinayak Menon <vinmenon@codeaurora.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-10-27 06:06:04 +08:00
folio_set_active(folio);
mod_lruvec_state(lruvec, WORKINGSET_ACTIVATE_BASE + file, nr);
mm: workingset: tell cache transitions from workingset thrashing Refaults happen during transitions between workingsets as well as in-place thrashing. Knowing the difference between the two has a range of applications, including measuring the impact of memory shortage on the system performance, as well as the ability to smarter balance pressure between the filesystem cache and the swap-backed workingset. During workingset transitions, inactive cache refaults and pushes out established active cache. When that active cache isn't stale, however, and also ends up refaulting, that's bonafide thrashing. Introduce a new page flag that tells on eviction whether the page has been active or not in its lifetime. This bit is then stored in the shadow entry, to classify refaults as transitioning or thrashing. How many page->flags does this leave us with on 32-bit? 20 bits are always page flags 21 if you have an MMU 23 with the zone bits for DMA, Normal, HighMem, Movable 29 with the sparsemem section bits 30 if PAE is enabled 31 with this patch. So on 32-bit PAE, that leaves 1 bit for distinguishing two NUMA nodes. If that's not enough, the system can switch to discontigmem and re-gain the 6 or 7 sparsemem section bits. Link: http://lkml.kernel.org/r/20180828172258.3185-3-hannes@cmpxchg.org Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org> Tested-by: Daniel Drake <drake@endlessm.com> Tested-by: Suren Baghdasaryan <surenb@google.com> Cc: Christopher Lameter <cl@linux.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: Johannes Weiner <jweiner@fb.com> Cc: Mike Galbraith <efault@gmx.de> Cc: Peter Enderborg <peter.enderborg@sony.com> Cc: Randy Dunlap <rdunlap@infradead.org> Cc: Shakeel Butt <shakeelb@google.com> Cc: Tejun Heo <tj@kernel.org> Cc: Vinayak Menon <vinmenon@codeaurora.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-10-27 06:06:04 +08:00
/* Folio was active prior to eviction */
mm: workingset: tell cache transitions from workingset thrashing Refaults happen during transitions between workingsets as well as in-place thrashing. Knowing the difference between the two has a range of applications, including measuring the impact of memory shortage on the system performance, as well as the ability to smarter balance pressure between the filesystem cache and the swap-backed workingset. During workingset transitions, inactive cache refaults and pushes out established active cache. When that active cache isn't stale, however, and also ends up refaulting, that's bonafide thrashing. Introduce a new page flag that tells on eviction whether the page has been active or not in its lifetime. This bit is then stored in the shadow entry, to classify refaults as transitioning or thrashing. How many page->flags does this leave us with on 32-bit? 20 bits are always page flags 21 if you have an MMU 23 with the zone bits for DMA, Normal, HighMem, Movable 29 with the sparsemem section bits 30 if PAE is enabled 31 with this patch. So on 32-bit PAE, that leaves 1 bit for distinguishing two NUMA nodes. If that's not enough, the system can switch to discontigmem and re-gain the 6 or 7 sparsemem section bits. Link: http://lkml.kernel.org/r/20180828172258.3185-3-hannes@cmpxchg.org Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org> Tested-by: Daniel Drake <drake@endlessm.com> Tested-by: Suren Baghdasaryan <surenb@google.com> Cc: Christopher Lameter <cl@linux.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: Johannes Weiner <jweiner@fb.com> Cc: Mike Galbraith <efault@gmx.de> Cc: Peter Enderborg <peter.enderborg@sony.com> Cc: Randy Dunlap <rdunlap@infradead.org> Cc: Shakeel Butt <shakeelb@google.com> Cc: Tejun Heo <tj@kernel.org> Cc: Vinayak Menon <vinmenon@codeaurora.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-10-27 06:06:04 +08:00
if (workingset) {
folio_set_workingset(folio);
/*
* XXX: Move to folio_add_lru() when it supports new vs
* putback
*/
mm: vmscan: make rotations a secondary factor in balancing anon vs file We noticed a 2% webserver throughput regression after upgrading from 5.6. This could be tracked down to a shift in the anon/file reclaim balance (confirmed with swappiness) that resulted in worse reclaim efficiency and thus more kswapd activity for the same outcome. The change that exposed the problem is aae466b0052e ("mm/swap: implement workingset detection for anonymous LRU"). By qualifying swapins based on their refault distance, it lowered the cost of anon reclaim in this workload, in turn causing (much) more anon scanning than before. Scanning the anon list is more expensive due to the higher ratio of mmapped pages that may rotate during reclaim, and so the result was an increase in %sys time. Right now, rotations aren't considered a cost when balancing scan pressure between LRUs. We can end up with very few file refaults putting all the scan pressure on hot anon pages that are rotated en masse, don't get reclaimed, and never push back on the file LRU again. We still only reclaim file cache in that case, but we burn a lot CPU rotating anon pages. It's "fair" from an LRU age POV, but doesn't reflect the real cost it imposes on the system. Consider rotations as a secondary factor in balancing the LRUs. This doesn't attempt to make a precise comparison between IO cost and CPU cost, it just says: if reloads are about comparable between the lists, or rotations are overwhelmingly different, adjust for CPU work. This fixed the regression on our webservers. It has since been deployed to the entire Meta fleet and hasn't caused any problems. Link: https://lkml.kernel.org/r/20221013193113.726425-1-hannes@cmpxchg.org Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> Cc: Rik van Riel <riel@surriel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-10-14 03:31:13 +08:00
lru_note_cost_refault(folio);
mod_lruvec_state(lruvec, WORKINGSET_RESTORE_BASE + file, nr);
mm: thrash detection-based file cache sizing The VM maintains cached filesystem pages on two types of lists. One list holds the pages recently faulted into the cache, the other list holds pages that have been referenced repeatedly on that first list. The idea is to prefer reclaiming young pages over those that have shown to benefit from caching in the past. We call the recently usedbut ultimately was not significantly better than a FIFO policy and still thrashed cache based on eviction speed, rather than actual demand for cache. This patch solves one half of the problem by decoupling the ability to detect working set changes from the inactive list size. By maintaining a history of recently evicted file pages it can detect frequently used pages with an arbitrarily small inactive list size, and subsequently apply pressure on the active list based on actual demand for cache, not just overall eviction speed. Every zone maintains a counter that tracks inactive list aging speed. When a page is evicted, a snapshot of this counter is stored in the now-empty page cache radix tree slot. On refault, the minimum access distance of the page can be assessed, to evaluate whether the page should be part of the active list or not. This fixes the VM's blindness towards working set changes in excess of the inactive list. And it's the foundation to further improve the protection ability and reduce the minimum inactive list size of 50%. Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> Reviewed-by: Rik van Riel <riel@redhat.com> Reviewed-by: Minchan Kim <minchan@kernel.org> Reviewed-by: Bob Liu <bob.liu@oracle.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Christoph Hellwig <hch@infradead.org> Cc: Dave Chinner <david@fromorbit.com> Cc: Greg Thelen <gthelen@google.com> Cc: Hugh Dickins <hughd@google.com> Cc: Jan Kara <jack@suse.cz> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: Luigi Semenzato <semenzato@google.com> Cc: Mel Gorman <mgorman@suse.de> Cc: Metin Doslu <metin@citusdata.com> Cc: Michel Lespinasse <walken@google.com> Cc: Ozgun Erdogan <ozgun@citusdata.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Roman Gushchin <klamm@yandex-team.ru> Cc: Ryan Mallon <rmallon@gmail.com> Cc: Tejun Heo <tj@kernel.org> Cc: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-04-04 05:47:51 +08:00
}
}
mm: keep page cache radix tree nodes in check Previously, page cache radix tree nodes were freed after reclaim emptied out their page pointers. But now reclaim stores shadow entries in their place, which are only reclaimed when the inodes themselves are reclaimed. This is problematic for bigger files that are still in use after they have a significant amount of their cache reclaimed, without any of those pages actually refaulting. The shadow entries will just sit there and waste memory. In the worst case, the shadow entries will accumulate until the machine runs out of memory. To get this under control, the VM will track radix tree nodes exclusively containing shadow entries on a per-NUMA node list. Per-NUMA rather than global because we expect the radix tree nodes themselves to be allocated node-locally and we want to reduce cross-node references of otherwise independent cache workloads. A simple shrinker will then reclaim these nodes on memory pressure. A few things need to be stored in the radix tree node to implement the shadow node LRU and allow tree deletions coming from the list: 1. There is no index available that would describe the reverse path from the node up to the tree root, which is needed to perform a deletion. To solve this, encode in each node its offset inside the parent. This can be stored in the unused upper bits of the same member that stores the node's height at no extra space cost. 2. The number of shadow entries needs to be counted in addition to the regular entries, to quickly detect when the node is ready to go to the shadow node LRU list. The current entry count is an unsigned int but the maximum number of entries is 64, so a shadow counter can easily be stored in the unused upper bits. 3. Tree modification needs tree lock and tree root, which are located in the address space, so store an address_space backpointer in the node. The parent pointer of the node is in a union with the 2-word rcu_head, so the backpointer comes at no extra cost as well. 4. The node needs to be linked to an LRU list, which requires a list head inside the node. This does increase the size of the node, but it does not change the number of objects that fit into a slab page. [akpm@linux-foundation.org: export the right function] Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> Reviewed-by: Rik van Riel <riel@redhat.com> Reviewed-by: Minchan Kim <minchan@kernel.org> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Bob Liu <bob.liu@oracle.com> Cc: Christoph Hellwig <hch@infradead.org> Cc: Dave Chinner <david@fromorbit.com> Cc: Greg Thelen <gthelen@google.com> Cc: Hugh Dickins <hughd@google.com> Cc: Jan Kara <jack@suse.cz> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: Luigi Semenzato <semenzato@google.com> Cc: Mel Gorman <mgorman@suse.de> Cc: Metin Doslu <metin@citusdata.com> Cc: Michel Lespinasse <walken@google.com> Cc: Ozgun Erdogan <ozgun@citusdata.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Roman Gushchin <klamm@yandex-team.ru> Cc: Ryan Mallon <rmallon@gmail.com> Cc: Tejun Heo <tj@kernel.org> Cc: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-04-04 05:47:56 +08:00
/*
* Shadow entries reflect the share of the working set that does not
* fit into memory, so their number depends on the access pattern of
* the workload. In most cases, they will refault or get reclaimed
* along with the inode, but a (malicious) workload that streams
* through files with a total size several times that of available
* memory, while preventing the inodes from being reclaimed, can
* create excessive amounts of shadow nodes. To keep a lid on this,
* track shadow nodes and reclaim them when they grow way past the
* point where they would still be useful.
*/
struct list_lru shadow_nodes;
mm: workingset: move shadow entry tracking to radix tree exceptional tracking Currently, we track the shadow entries in the page cache in the upper bits of the radix_tree_node->count, behind the back of the radix tree implementation. Because the radix tree code has no awareness of them, we rely on random subtleties throughout the implementation (such as the node->count != 1 check in the shrinking code, which is meant to exclude multi-entry nodes but also happens to skip nodes with only one shadow entry, as that's accounted in the upper bits). This is error prone and has, in fact, caused the bug fixed in d3798ae8c6f3 ("mm: filemap: don't plant shadow entries without radix tree node"). To remove these subtleties, this patch moves shadow entry tracking from the upper bits of node->count to the existing counter for exceptional entries. node->count goes back to being a simple counter of valid entries in the tree node and can be shrunk to a single byte. This vastly simplifies the page cache code. All accounting happens natively inside the radix tree implementation, and maintaining the LRU linkage of shadow nodes is consolidated into a single function in the workingset code that is called for leaf nodes affected by a change in the page cache tree. This also removes the last user of the __radix_delete_node() return value. Eliminate it. Link: http://lkml.kernel.org/r/20161117193211.GE23430@cmpxchg.org Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> Reviewed-by: Jan Kara <jack@suse.cz> Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Hugh Dickins <hughd@google.com> Cc: Matthew Wilcox <mawilcox@linuxonhyperv.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-12-13 08:43:52 +08:00
void workingset_update_node(struct xa_node *node)
mm: workingset: move shadow entry tracking to radix tree exceptional tracking Currently, we track the shadow entries in the page cache in the upper bits of the radix_tree_node->count, behind the back of the radix tree implementation. Because the radix tree code has no awareness of them, we rely on random subtleties throughout the implementation (such as the node->count != 1 check in the shrinking code, which is meant to exclude multi-entry nodes but also happens to skip nodes with only one shadow entry, as that's accounted in the upper bits). This is error prone and has, in fact, caused the bug fixed in d3798ae8c6f3 ("mm: filemap: don't plant shadow entries without radix tree node"). To remove these subtleties, this patch moves shadow entry tracking from the upper bits of node->count to the existing counter for exceptional entries. node->count goes back to being a simple counter of valid entries in the tree node and can be shrunk to a single byte. This vastly simplifies the page cache code. All accounting happens natively inside the radix tree implementation, and maintaining the LRU linkage of shadow nodes is consolidated into a single function in the workingset code that is called for leaf nodes affected by a change in the page cache tree. This also removes the last user of the __radix_delete_node() return value. Eliminate it. Link: http://lkml.kernel.org/r/20161117193211.GE23430@cmpxchg.org Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> Reviewed-by: Jan Kara <jack@suse.cz> Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Hugh Dickins <hughd@google.com> Cc: Matthew Wilcox <mawilcox@linuxonhyperv.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-12-13 08:43:52 +08:00
{
struct address_space *mapping;
mm: workingset: move shadow entry tracking to radix tree exceptional tracking Currently, we track the shadow entries in the page cache in the upper bits of the radix_tree_node->count, behind the back of the radix tree implementation. Because the radix tree code has no awareness of them, we rely on random subtleties throughout the implementation (such as the node->count != 1 check in the shrinking code, which is meant to exclude multi-entry nodes but also happens to skip nodes with only one shadow entry, as that's accounted in the upper bits). This is error prone and has, in fact, caused the bug fixed in d3798ae8c6f3 ("mm: filemap: don't plant shadow entries without radix tree node"). To remove these subtleties, this patch moves shadow entry tracking from the upper bits of node->count to the existing counter for exceptional entries. node->count goes back to being a simple counter of valid entries in the tree node and can be shrunk to a single byte. This vastly simplifies the page cache code. All accounting happens natively inside the radix tree implementation, and maintaining the LRU linkage of shadow nodes is consolidated into a single function in the workingset code that is called for leaf nodes affected by a change in the page cache tree. This also removes the last user of the __radix_delete_node() return value. Eliminate it. Link: http://lkml.kernel.org/r/20161117193211.GE23430@cmpxchg.org Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> Reviewed-by: Jan Kara <jack@suse.cz> Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Hugh Dickins <hughd@google.com> Cc: Matthew Wilcox <mawilcox@linuxonhyperv.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-12-13 08:43:52 +08:00
/*
* Track non-empty nodes that contain only shadow entries;
* unlink those that contain pages or are being freed.
*
* Avoid acquiring the list_lru lock when the nodes are
* already where they should be. The list_empty() test is safe
* as node->private_list is protected by the i_pages lock.
mm: workingset: move shadow entry tracking to radix tree exceptional tracking Currently, we track the shadow entries in the page cache in the upper bits of the radix_tree_node->count, behind the back of the radix tree implementation. Because the radix tree code has no awareness of them, we rely on random subtleties throughout the implementation (such as the node->count != 1 check in the shrinking code, which is meant to exclude multi-entry nodes but also happens to skip nodes with only one shadow entry, as that's accounted in the upper bits). This is error prone and has, in fact, caused the bug fixed in d3798ae8c6f3 ("mm: filemap: don't plant shadow entries without radix tree node"). To remove these subtleties, this patch moves shadow entry tracking from the upper bits of node->count to the existing counter for exceptional entries. node->count goes back to being a simple counter of valid entries in the tree node and can be shrunk to a single byte. This vastly simplifies the page cache code. All accounting happens natively inside the radix tree implementation, and maintaining the LRU linkage of shadow nodes is consolidated into a single function in the workingset code that is called for leaf nodes affected by a change in the page cache tree. This also removes the last user of the __radix_delete_node() return value. Eliminate it. Link: http://lkml.kernel.org/r/20161117193211.GE23430@cmpxchg.org Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> Reviewed-by: Jan Kara <jack@suse.cz> Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Hugh Dickins <hughd@google.com> Cc: Matthew Wilcox <mawilcox@linuxonhyperv.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-12-13 08:43:52 +08:00
*/
mapping = container_of(node->array, struct address_space, i_pages);
lockdep_assert_held(&mapping->i_pages.xa_lock);
if (node->count && node->count == node->nr_values) {
if (list_empty(&node->private_list)) {
mm: workingset: move shadow entry tracking to radix tree exceptional tracking Currently, we track the shadow entries in the page cache in the upper bits of the radix_tree_node->count, behind the back of the radix tree implementation. Because the radix tree code has no awareness of them, we rely on random subtleties throughout the implementation (such as the node->count != 1 check in the shrinking code, which is meant to exclude multi-entry nodes but also happens to skip nodes with only one shadow entry, as that's accounted in the upper bits). This is error prone and has, in fact, caused the bug fixed in d3798ae8c6f3 ("mm: filemap: don't plant shadow entries without radix tree node"). To remove these subtleties, this patch moves shadow entry tracking from the upper bits of node->count to the existing counter for exceptional entries. node->count goes back to being a simple counter of valid entries in the tree node and can be shrunk to a single byte. This vastly simplifies the page cache code. All accounting happens natively inside the radix tree implementation, and maintaining the LRU linkage of shadow nodes is consolidated into a single function in the workingset code that is called for leaf nodes affected by a change in the page cache tree. This also removes the last user of the __radix_delete_node() return value. Eliminate it. Link: http://lkml.kernel.org/r/20161117193211.GE23430@cmpxchg.org Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> Reviewed-by: Jan Kara <jack@suse.cz> Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Hugh Dickins <hughd@google.com> Cc: Matthew Wilcox <mawilcox@linuxonhyperv.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-12-13 08:43:52 +08:00
list_lru_add(&shadow_nodes, &node->private_list);
__inc_lruvec_kmem_state(node, WORKINGSET_NODES);
}
mm: workingset: move shadow entry tracking to radix tree exceptional tracking Currently, we track the shadow entries in the page cache in the upper bits of the radix_tree_node->count, behind the back of the radix tree implementation. Because the radix tree code has no awareness of them, we rely on random subtleties throughout the implementation (such as the node->count != 1 check in the shrinking code, which is meant to exclude multi-entry nodes but also happens to skip nodes with only one shadow entry, as that's accounted in the upper bits). This is error prone and has, in fact, caused the bug fixed in d3798ae8c6f3 ("mm: filemap: don't plant shadow entries without radix tree node"). To remove these subtleties, this patch moves shadow entry tracking from the upper bits of node->count to the existing counter for exceptional entries. node->count goes back to being a simple counter of valid entries in the tree node and can be shrunk to a single byte. This vastly simplifies the page cache code. All accounting happens natively inside the radix tree implementation, and maintaining the LRU linkage of shadow nodes is consolidated into a single function in the workingset code that is called for leaf nodes affected by a change in the page cache tree. This also removes the last user of the __radix_delete_node() return value. Eliminate it. Link: http://lkml.kernel.org/r/20161117193211.GE23430@cmpxchg.org Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> Reviewed-by: Jan Kara <jack@suse.cz> Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Hugh Dickins <hughd@google.com> Cc: Matthew Wilcox <mawilcox@linuxonhyperv.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-12-13 08:43:52 +08:00
} else {
if (!list_empty(&node->private_list)) {
mm: workingset: move shadow entry tracking to radix tree exceptional tracking Currently, we track the shadow entries in the page cache in the upper bits of the radix_tree_node->count, behind the back of the radix tree implementation. Because the radix tree code has no awareness of them, we rely on random subtleties throughout the implementation (such as the node->count != 1 check in the shrinking code, which is meant to exclude multi-entry nodes but also happens to skip nodes with only one shadow entry, as that's accounted in the upper bits). This is error prone and has, in fact, caused the bug fixed in d3798ae8c6f3 ("mm: filemap: don't plant shadow entries without radix tree node"). To remove these subtleties, this patch moves shadow entry tracking from the upper bits of node->count to the existing counter for exceptional entries. node->count goes back to being a simple counter of valid entries in the tree node and can be shrunk to a single byte. This vastly simplifies the page cache code. All accounting happens natively inside the radix tree implementation, and maintaining the LRU linkage of shadow nodes is consolidated into a single function in the workingset code that is called for leaf nodes affected by a change in the page cache tree. This also removes the last user of the __radix_delete_node() return value. Eliminate it. Link: http://lkml.kernel.org/r/20161117193211.GE23430@cmpxchg.org Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> Reviewed-by: Jan Kara <jack@suse.cz> Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Hugh Dickins <hughd@google.com> Cc: Matthew Wilcox <mawilcox@linuxonhyperv.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-12-13 08:43:52 +08:00
list_lru_del(&shadow_nodes, &node->private_list);
__dec_lruvec_kmem_state(node, WORKINGSET_NODES);
}
mm: workingset: move shadow entry tracking to radix tree exceptional tracking Currently, we track the shadow entries in the page cache in the upper bits of the radix_tree_node->count, behind the back of the radix tree implementation. Because the radix tree code has no awareness of them, we rely on random subtleties throughout the implementation (such as the node->count != 1 check in the shrinking code, which is meant to exclude multi-entry nodes but also happens to skip nodes with only one shadow entry, as that's accounted in the upper bits). This is error prone and has, in fact, caused the bug fixed in d3798ae8c6f3 ("mm: filemap: don't plant shadow entries without radix tree node"). To remove these subtleties, this patch moves shadow entry tracking from the upper bits of node->count to the existing counter for exceptional entries. node->count goes back to being a simple counter of valid entries in the tree node and can be shrunk to a single byte. This vastly simplifies the page cache code. All accounting happens natively inside the radix tree implementation, and maintaining the LRU linkage of shadow nodes is consolidated into a single function in the workingset code that is called for leaf nodes affected by a change in the page cache tree. This also removes the last user of the __radix_delete_node() return value. Eliminate it. Link: http://lkml.kernel.org/r/20161117193211.GE23430@cmpxchg.org Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> Reviewed-by: Jan Kara <jack@suse.cz> Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Hugh Dickins <hughd@google.com> Cc: Matthew Wilcox <mawilcox@linuxonhyperv.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-12-13 08:43:52 +08:00
}
}
mm: keep page cache radix tree nodes in check Previously, page cache radix tree nodes were freed after reclaim emptied out their page pointers. But now reclaim stores shadow entries in their place, which are only reclaimed when the inodes themselves are reclaimed. This is problematic for bigger files that are still in use after they have a significant amount of their cache reclaimed, without any of those pages actually refaulting. The shadow entries will just sit there and waste memory. In the worst case, the shadow entries will accumulate until the machine runs out of memory. To get this under control, the VM will track radix tree nodes exclusively containing shadow entries on a per-NUMA node list. Per-NUMA rather than global because we expect the radix tree nodes themselves to be allocated node-locally and we want to reduce cross-node references of otherwise independent cache workloads. A simple shrinker will then reclaim these nodes on memory pressure. A few things need to be stored in the radix tree node to implement the shadow node LRU and allow tree deletions coming from the list: 1. There is no index available that would describe the reverse path from the node up to the tree root, which is needed to perform a deletion. To solve this, encode in each node its offset inside the parent. This can be stored in the unused upper bits of the same member that stores the node's height at no extra space cost. 2. The number of shadow entries needs to be counted in addition to the regular entries, to quickly detect when the node is ready to go to the shadow node LRU list. The current entry count is an unsigned int but the maximum number of entries is 64, so a shadow counter can easily be stored in the unused upper bits. 3. Tree modification needs tree lock and tree root, which are located in the address space, so store an address_space backpointer in the node. The parent pointer of the node is in a union with the 2-word rcu_head, so the backpointer comes at no extra cost as well. 4. The node needs to be linked to an LRU list, which requires a list head inside the node. This does increase the size of the node, but it does not change the number of objects that fit into a slab page. [akpm@linux-foundation.org: export the right function] Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> Reviewed-by: Rik van Riel <riel@redhat.com> Reviewed-by: Minchan Kim <minchan@kernel.org> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Bob Liu <bob.liu@oracle.com> Cc: Christoph Hellwig <hch@infradead.org> Cc: Dave Chinner <david@fromorbit.com> Cc: Greg Thelen <gthelen@google.com> Cc: Hugh Dickins <hughd@google.com> Cc: Jan Kara <jack@suse.cz> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: Luigi Semenzato <semenzato@google.com> Cc: Mel Gorman <mgorman@suse.de> Cc: Metin Doslu <metin@citusdata.com> Cc: Michel Lespinasse <walken@google.com> Cc: Ozgun Erdogan <ozgun@citusdata.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Roman Gushchin <klamm@yandex-team.ru> Cc: Ryan Mallon <rmallon@gmail.com> Cc: Tejun Heo <tj@kernel.org> Cc: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-04-04 05:47:56 +08:00
static unsigned long count_shadow_nodes(struct shrinker *shrinker,
struct shrink_control *sc)
{
unsigned long max_nodes;
mm: workingset: move shadow entry tracking to radix tree exceptional tracking Currently, we track the shadow entries in the page cache in the upper bits of the radix_tree_node->count, behind the back of the radix tree implementation. Because the radix tree code has no awareness of them, we rely on random subtleties throughout the implementation (such as the node->count != 1 check in the shrinking code, which is meant to exclude multi-entry nodes but also happens to skip nodes with only one shadow entry, as that's accounted in the upper bits). This is error prone and has, in fact, caused the bug fixed in d3798ae8c6f3 ("mm: filemap: don't plant shadow entries without radix tree node"). To remove these subtleties, this patch moves shadow entry tracking from the upper bits of node->count to the existing counter for exceptional entries. node->count goes back to being a simple counter of valid entries in the tree node and can be shrunk to a single byte. This vastly simplifies the page cache code. All accounting happens natively inside the radix tree implementation, and maintaining the LRU linkage of shadow nodes is consolidated into a single function in the workingset code that is called for leaf nodes affected by a change in the page cache tree. This also removes the last user of the __radix_delete_node() return value. Eliminate it. Link: http://lkml.kernel.org/r/20161117193211.GE23430@cmpxchg.org Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> Reviewed-by: Jan Kara <jack@suse.cz> Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Hugh Dickins <hughd@google.com> Cc: Matthew Wilcox <mawilcox@linuxonhyperv.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-12-13 08:43:52 +08:00
unsigned long nodes;
mm: workingset: don't drop refault information prematurely Patch series "psi: pressure stall information for CPU, memory, and IO", v4. Overview PSI reports the overall wallclock time in which the tasks in a system (or cgroup) wait for (contended) hardware resources. This helps users understand the resource pressure their workloads are under, which allows them to rootcause and fix throughput and latency problems caused by overcommitting, underprovisioning, suboptimal job placement in a grid; as well as anticipate major disruptions like OOM. Real-world applications We're using the data collected by PSI (and its previous incarnation, memdelay) quite extensively at Facebook, and with several success stories. One usecase is avoiding OOM hangs/livelocks. The reason these happen is because the OOM killer is triggered by reclaim not being able to free pages, but with fast flash devices there is *always* some clean and uptodate cache to reclaim; the OOM killer never kicks in, even as tasks spend 90% of the time thrashing the cache pages of their own executables. There is no situation where this ever makes sense in practice. We wrote a <100 line POC python script to monitor memory pressure and kill stuff way before such pathological thrashing leads to full system losses that would require forcible hard resets. We've since extended and deployed this code into other places to guarantee latency and throughput SLAs, since they're usually violated way before the kernel OOM killer would ever kick in. It is available here: https://github.com/facebookincubator/oomd Eventually we probably want to trigger the in-kernel OOM killer based on extreme sustained pressure as well, so that Linux can avoid memory livelocks - which technically aren't deadlocks, but to the user indistinguishable from them - out of the box. We'd continue using OOMD as the first line of defense to ensure workload health and implement complex kill policies that are beyond the scope of the kernel. We also use PSI memory pressure for loadshedding. Our batch job infrastructure used to use heuristics based on various VM stats to anticipate OOM situations, with lackluster success. We switched it to PSI and managed to anticipate and avoid OOM kills and lockups fairly reliably. The reduction of OOM outages in the worker pool raised the pool's aggregate productivity, and we were able to switch that service to smaller machines. Lastly, we use cgroups to isolate a machine's main workload from maintenance crap like package upgrades, logging, configuration, as well as to prevent multiple workloads on a machine from stepping on each others' toes. We were not able to configure this properly without the pressure metrics; we would see latency or bandwidth drops, but it would often be hard to impossible to rootcause it post-mortem. We now log and graph pressure for the containers in our fleet and can trivially link latency spikes and throughput drops to shortages of specific resources after the fact, and fix the job config/scheduling. PSI has also received testing, feedback, and feature requests from Android and EndlessOS for the purpose of low-latency OOM killing, to intervene in pressure situations before the UI starts hanging. How do you use this feature? A kernel with CONFIG_PSI=y will create a /proc/pressure directory with 3 files: cpu, memory, and io. If using cgroup2, cgroups will also have cpu.pressure, memory.pressure and io.pressure files, which simply aggregate task stalls at the cgroup level instead of system-wide. The cpu file contains one line: some avg10=2.04 avg60=0.75 avg300=0.40 total=157656722 The averages give the percentage of walltime in which one or more tasks are delayed on the runqueue while another task has the CPU. They're recent averages over 10s, 1m, 5m windows, so you can tell short term trends from long term ones, similarly to the load average. The total= value gives the absolute stall time in microseconds. This allows detecting latency spikes that might be too short to sway the running averages. It also allows custom time averaging in case the 10s/1m/5m windows aren't adequate for the usecase (or are too coarse with future hardware). What to make of this "some" metric? If CPU utilization is at 100% and CPU pressure is 0, it means the system is perfectly utilized, with one runnable thread per CPU and nobody waiting. At two or more runnable tasks per CPU, the system is 100% overcommitted and the pressure average will indicate as much. From a utilization perspective this is a great state of course: no CPU cycles are being wasted, even when 50% of the threads were to go idle (as most workloads do vary). From the perspective of the individual job it's not great, however, and they would do better with more resources. Depending on what your priority and options are, raised "some" numbers may or may not require action. The memory file contains two lines: some avg10=70.24 avg60=68.52 avg300=69.91 total=3559632828 full avg10=57.59 avg60=58.06 avg300=60.38 total=3300487258 The some line is the same as for cpu, the time in which at least one task is stalled on the resource. In the case of memory, this includes waiting on swap-in, page cache refaults and page reclaim. The full line, however, indicates time in which *nobody* is using the CPU productively due to pressure: all non-idle tasks are waiting for memory in one form or another. Significant time spent in there is a good trigger for killing things, moving jobs to other machines, or dropping incoming requests, since neither the jobs nor the machine overall are making too much headway. The io file is similar to memory. Because the block layer doesn't have a concept of hardware contention right now (how much longer is my IO request taking due to other tasks?), it reports CPU potential lost on all IO delays, not just the potential lost due to competition. FAQ Q: How is PSI's CPU component different from the load average? A: There are several quirks in the load average that make it hard to impossible to tell how overcommitted the CPU really is. 1. The load average is reported as a raw number of active tasks. You need to know how many CPUs there are in the system, how many CPUs the workload is allowed to use, then think about what the proportion between load and the number of CPUs mean for the tasks trying to run. PSI reports the percentage of wallclock time in which tasks are waiting for a CPU to run on. It doesn't matter how many CPUs are present or usable. The number always tells the quality of life of tasks in the system or in a particular cgroup. 2. The shortest averaging window is 1m, which is extremely coarse, and it's sampled in 5s intervals. A *lot* can happen on a CPU in 5 seconds. This *may* be able to identify persistent long-term trends and very clear and obvious overloads, but it's unusable for latency spikes and more subtle overutilization. PSI's shortest window is 10s. It also exports the cumulative stall times (in microseconds) of synchronously recorded events. 3. On Linux, the load average for historical reasons includes all TASK_UNINTERRUPTIBLE tasks. This gives a broader sense of how busy the system is, but on the flipside it doesn't distinguish whether tasks are likely to contend over the CPU or IO - which obviously requires very different interventions from a sys admin or a job scheduler. PSI reports independent metrics for CPU and IO. You can tell which resource is making the tasks wait, but in conjunction still see how overloaded the system is overall. Q: What's the cost / performance impact of this feature? A: PSI's primary cost is in the scheduler, in particular task wakeups and sleeps. I benchmarked this code using Facebook's two most scheduling sensitive workloads: memcache and webserver. They handle a ton of small requests - lots of wakeups and sleeps with little actual work in between - so they tend to be canaries for scheduler regressions. In the tests, the boxes were handling live traffic over the course of several hours. Half the machines, the control, ran with CONFIG_PSI=n. For memcache I used eight machines total. They're 2-socket, 14 core, 56 thread boxes. The test runs for half the test period, flips the test and control kernels on the hardware to rule out HW factors, DC location etc., then runs the other half of the test. For the webservers, I used 32 machines total. They're single socket, 16 core, 32 thread machines. During the memcache test, CPU load was nopsi=78.05% psi=78.98% in the first half and nopsi=77.52% psi=78.25%, so PSI added between 0.7 and 0.9 percentage points to the CPU load, a difference of about 1%. UPDATE: I re-ran this test with the v3 version of this patch set and the CPU utilization was equivalent between test and control. UPDATE: v4 is on par with v3. As far as end-to-end request latency from the client perspective goes, we don't sample those finely enough to capture the requests going to those particular machines during the test, but we know the p50 turnaround time in this workload is 54us, and perf bench sched pipe on those machines show nopsi=5.232666 us/op and psi=5.587347 us/op, so this doesn't add much here either. The profile for the pipe benchmark shows: 0.87% sched-pipe [kernel.vmlinux] [k] psi_group_change 0.83% perf.real [kernel.vmlinux] [k] psi_group_change 0.82% perf.real [kernel.vmlinux] [k] psi_task_change 0.58% sched-pipe [kernel.vmlinux] [k] psi_task_change The webserver load is running inside 4 nested cgroup levels. The CPU load with both nopsi and psi kernels was indistinguishable at 81%. For comparison, we had to disable the cgroup cpu controller on the webservers because it added 4 percentage points to the CPU% during this same exact test. Versions of this accounting code now run on 80% of our fleet. None of our workloads have reported regressions during the rollout. Daniel Drake said: : I just retested the latest version at : http://git.cmpxchg.org/cgit.cgi/linux-psi.git (Linux 4.18) and the results : are great. : : Test setup: : Endless OS : GeminiLake N4200 low end laptop : 2GB RAM : swap (and zram swap) disabled : : Baseline test: open a handful of large-ish apps and several website : tabs in Google Chrome. : : Results: after a couple of minutes, system is excessively thrashing, mouse : cursor can barely be moved, UI is not responding to mouse clicks, so it's : impractical to recover from this situation as an ordinary user : : Add my simple killer: : https://gist.github.com/dsd/a8988bf0b81a6163475988120fe8d9cd : : Results: when the thrashing causes the UI to become sluggish, the killer : steps in and kills something (usually a chrome tab), and the system : remains usable. I repeatedly opened more apps and more websites over a 15 : minute period but I wasn't able to get the system to a point of UI : unresponsiveness. Suren said: : Backported to 4.9 and retested on ARMv8 8 code system running Android. : Signals behave as expected reacting to memory pressure, no jumps in : "total" counters that would indicate an overflow/underflow issues. Nicely : done! This patch (of 9): If we keep just enough refault information to match the *current* page cache during reclaim time, we could lose a lot of events when there is only a temporary spike in non-cache memory consumption that pushes out all the cache. Once cache comes back, we won't see those refaults. They might not be actionable for LRU aging, but we want to know about them for measuring memory pressure. [hannes@cmpxchg.org: switch to NUMA-aware lru and slab counters] Link: http://lkml.kernel.org/r/20181009184732.762-2-hannes@cmpxchg.org Link: http://lkml.kernel.org/r/20180828172258.3185-2-hannes@cmpxchg.org Signed-off-by: Johannes Weiner <jweiner@fb.com> Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Rik van Riel <riel@surriel.com> Tested-by: Daniel Drake <drake@endlessm.com> Tested-by: Suren Baghdasaryan <surenb@google.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: Tejun Heo <tj@kernel.org> Cc: Vinayak Menon <vinmenon@codeaurora.org> Cc: Christopher Lameter <cl@linux.com> Cc: Peter Enderborg <peter.enderborg@sony.com> Cc: Shakeel Butt <shakeelb@google.com> Cc: Mike Galbraith <efault@gmx.de> Cc: Randy Dunlap <rdunlap@infradead.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-10-27 06:05:59 +08:00
unsigned long pages;
mm: keep page cache radix tree nodes in check Previously, page cache radix tree nodes were freed after reclaim emptied out their page pointers. But now reclaim stores shadow entries in their place, which are only reclaimed when the inodes themselves are reclaimed. This is problematic for bigger files that are still in use after they have a significant amount of their cache reclaimed, without any of those pages actually refaulting. The shadow entries will just sit there and waste memory. In the worst case, the shadow entries will accumulate until the machine runs out of memory. To get this under control, the VM will track radix tree nodes exclusively containing shadow entries on a per-NUMA node list. Per-NUMA rather than global because we expect the radix tree nodes themselves to be allocated node-locally and we want to reduce cross-node references of otherwise independent cache workloads. A simple shrinker will then reclaim these nodes on memory pressure. A few things need to be stored in the radix tree node to implement the shadow node LRU and allow tree deletions coming from the list: 1. There is no index available that would describe the reverse path from the node up to the tree root, which is needed to perform a deletion. To solve this, encode in each node its offset inside the parent. This can be stored in the unused upper bits of the same member that stores the node's height at no extra space cost. 2. The number of shadow entries needs to be counted in addition to the regular entries, to quickly detect when the node is ready to go to the shadow node LRU list. The current entry count is an unsigned int but the maximum number of entries is 64, so a shadow counter can easily be stored in the unused upper bits. 3. Tree modification needs tree lock and tree root, which are located in the address space, so store an address_space backpointer in the node. The parent pointer of the node is in a union with the 2-word rcu_head, so the backpointer comes at no extra cost as well. 4. The node needs to be linked to an LRU list, which requires a list head inside the node. This does increase the size of the node, but it does not change the number of objects that fit into a slab page. [akpm@linux-foundation.org: export the right function] Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> Reviewed-by: Rik van Riel <riel@redhat.com> Reviewed-by: Minchan Kim <minchan@kernel.org> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Bob Liu <bob.liu@oracle.com> Cc: Christoph Hellwig <hch@infradead.org> Cc: Dave Chinner <david@fromorbit.com> Cc: Greg Thelen <gthelen@google.com> Cc: Hugh Dickins <hughd@google.com> Cc: Jan Kara <jack@suse.cz> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: Luigi Semenzato <semenzato@google.com> Cc: Mel Gorman <mgorman@suse.de> Cc: Metin Doslu <metin@citusdata.com> Cc: Michel Lespinasse <walken@google.com> Cc: Ozgun Erdogan <ozgun@citusdata.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Roman Gushchin <klamm@yandex-team.ru> Cc: Ryan Mallon <rmallon@gmail.com> Cc: Tejun Heo <tj@kernel.org> Cc: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-04-04 05:47:56 +08:00
mm: workingset: move shadow entry tracking to radix tree exceptional tracking Currently, we track the shadow entries in the page cache in the upper bits of the radix_tree_node->count, behind the back of the radix tree implementation. Because the radix tree code has no awareness of them, we rely on random subtleties throughout the implementation (such as the node->count != 1 check in the shrinking code, which is meant to exclude multi-entry nodes but also happens to skip nodes with only one shadow entry, as that's accounted in the upper bits). This is error prone and has, in fact, caused the bug fixed in d3798ae8c6f3 ("mm: filemap: don't plant shadow entries without radix tree node"). To remove these subtleties, this patch moves shadow entry tracking from the upper bits of node->count to the existing counter for exceptional entries. node->count goes back to being a simple counter of valid entries in the tree node and can be shrunk to a single byte. This vastly simplifies the page cache code. All accounting happens natively inside the radix tree implementation, and maintaining the LRU linkage of shadow nodes is consolidated into a single function in the workingset code that is called for leaf nodes affected by a change in the page cache tree. This also removes the last user of the __radix_delete_node() return value. Eliminate it. Link: http://lkml.kernel.org/r/20161117193211.GE23430@cmpxchg.org Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> Reviewed-by: Jan Kara <jack@suse.cz> Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Hugh Dickins <hughd@google.com> Cc: Matthew Wilcox <mawilcox@linuxonhyperv.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-12-13 08:43:52 +08:00
nodes = list_lru_shrink_count(&shadow_nodes, sc);
if (!nodes)
return SHRINK_EMPTY;
mm: keep page cache radix tree nodes in check Previously, page cache radix tree nodes were freed after reclaim emptied out their page pointers. But now reclaim stores shadow entries in their place, which are only reclaimed when the inodes themselves are reclaimed. This is problematic for bigger files that are still in use after they have a significant amount of their cache reclaimed, without any of those pages actually refaulting. The shadow entries will just sit there and waste memory. In the worst case, the shadow entries will accumulate until the machine runs out of memory. To get this under control, the VM will track radix tree nodes exclusively containing shadow entries on a per-NUMA node list. Per-NUMA rather than global because we expect the radix tree nodes themselves to be allocated node-locally and we want to reduce cross-node references of otherwise independent cache workloads. A simple shrinker will then reclaim these nodes on memory pressure. A few things need to be stored in the radix tree node to implement the shadow node LRU and allow tree deletions coming from the list: 1. There is no index available that would describe the reverse path from the node up to the tree root, which is needed to perform a deletion. To solve this, encode in each node its offset inside the parent. This can be stored in the unused upper bits of the same member that stores the node's height at no extra space cost. 2. The number of shadow entries needs to be counted in addition to the regular entries, to quickly detect when the node is ready to go to the shadow node LRU list. The current entry count is an unsigned int but the maximum number of entries is 64, so a shadow counter can easily be stored in the unused upper bits. 3. Tree modification needs tree lock and tree root, which are located in the address space, so store an address_space backpointer in the node. The parent pointer of the node is in a union with the 2-word rcu_head, so the backpointer comes at no extra cost as well. 4. The node needs to be linked to an LRU list, which requires a list head inside the node. This does increase the size of the node, but it does not change the number of objects that fit into a slab page. [akpm@linux-foundation.org: export the right function] Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> Reviewed-by: Rik van Riel <riel@redhat.com> Reviewed-by: Minchan Kim <minchan@kernel.org> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Bob Liu <bob.liu@oracle.com> Cc: Christoph Hellwig <hch@infradead.org> Cc: Dave Chinner <david@fromorbit.com> Cc: Greg Thelen <gthelen@google.com> Cc: Hugh Dickins <hughd@google.com> Cc: Jan Kara <jack@suse.cz> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: Luigi Semenzato <semenzato@google.com> Cc: Mel Gorman <mgorman@suse.de> Cc: Metin Doslu <metin@citusdata.com> Cc: Michel Lespinasse <walken@google.com> Cc: Ozgun Erdogan <ozgun@citusdata.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Roman Gushchin <klamm@yandex-team.ru> Cc: Ryan Mallon <rmallon@gmail.com> Cc: Tejun Heo <tj@kernel.org> Cc: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-04-04 05:47:56 +08:00
/*
* Approximate a reasonable limit for the nodes
* containing shadow entries. We don't need to keep more
* shadow entries than possible pages on the active list,
* since refault distances bigger than that are dismissed.
*
* The size of the active list converges toward 100% of
* overall page cache as memory grows, with only a tiny
* inactive list. Assume the total cache size for that.
*
* Nodes might be sparsely populated, with only one shadow
* entry in the extreme case. Obviously, we cannot keep one
* node for every eligible shadow entry, so compromise on a
* worst-case density of 1/8th. Below that, not all eligible
* refaults can be detected anymore.
mm: keep page cache radix tree nodes in check Previously, page cache radix tree nodes were freed after reclaim emptied out their page pointers. But now reclaim stores shadow entries in their place, which are only reclaimed when the inodes themselves are reclaimed. This is problematic for bigger files that are still in use after they have a significant amount of their cache reclaimed, without any of those pages actually refaulting. The shadow entries will just sit there and waste memory. In the worst case, the shadow entries will accumulate until the machine runs out of memory. To get this under control, the VM will track radix tree nodes exclusively containing shadow entries on a per-NUMA node list. Per-NUMA rather than global because we expect the radix tree nodes themselves to be allocated node-locally and we want to reduce cross-node references of otherwise independent cache workloads. A simple shrinker will then reclaim these nodes on memory pressure. A few things need to be stored in the radix tree node to implement the shadow node LRU and allow tree deletions coming from the list: 1. There is no index available that would describe the reverse path from the node up to the tree root, which is needed to perform a deletion. To solve this, encode in each node its offset inside the parent. This can be stored in the unused upper bits of the same member that stores the node's height at no extra space cost. 2. The number of shadow entries needs to be counted in addition to the regular entries, to quickly detect when the node is ready to go to the shadow node LRU list. The current entry count is an unsigned int but the maximum number of entries is 64, so a shadow counter can easily be stored in the unused upper bits. 3. Tree modification needs tree lock and tree root, which are located in the address space, so store an address_space backpointer in the node. The parent pointer of the node is in a union with the 2-word rcu_head, so the backpointer comes at no extra cost as well. 4. The node needs to be linked to an LRU list, which requires a list head inside the node. This does increase the size of the node, but it does not change the number of objects that fit into a slab page. [akpm@linux-foundation.org: export the right function] Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> Reviewed-by: Rik van Riel <riel@redhat.com> Reviewed-by: Minchan Kim <minchan@kernel.org> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Bob Liu <bob.liu@oracle.com> Cc: Christoph Hellwig <hch@infradead.org> Cc: Dave Chinner <david@fromorbit.com> Cc: Greg Thelen <gthelen@google.com> Cc: Hugh Dickins <hughd@google.com> Cc: Jan Kara <jack@suse.cz> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: Luigi Semenzato <semenzato@google.com> Cc: Mel Gorman <mgorman@suse.de> Cc: Metin Doslu <metin@citusdata.com> Cc: Michel Lespinasse <walken@google.com> Cc: Ozgun Erdogan <ozgun@citusdata.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Roman Gushchin <klamm@yandex-team.ru> Cc: Ryan Mallon <rmallon@gmail.com> Cc: Tejun Heo <tj@kernel.org> Cc: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-04-04 05:47:56 +08:00
*
* On 64-bit with 7 xa_nodes per page and 64 slots
mm: keep page cache radix tree nodes in check Previously, page cache radix tree nodes were freed after reclaim emptied out their page pointers. But now reclaim stores shadow entries in their place, which are only reclaimed when the inodes themselves are reclaimed. This is problematic for bigger files that are still in use after they have a significant amount of their cache reclaimed, without any of those pages actually refaulting. The shadow entries will just sit there and waste memory. In the worst case, the shadow entries will accumulate until the machine runs out of memory. To get this under control, the VM will track radix tree nodes exclusively containing shadow entries on a per-NUMA node list. Per-NUMA rather than global because we expect the radix tree nodes themselves to be allocated node-locally and we want to reduce cross-node references of otherwise independent cache workloads. A simple shrinker will then reclaim these nodes on memory pressure. A few things need to be stored in the radix tree node to implement the shadow node LRU and allow tree deletions coming from the list: 1. There is no index available that would describe the reverse path from the node up to the tree root, which is needed to perform a deletion. To solve this, encode in each node its offset inside the parent. This can be stored in the unused upper bits of the same member that stores the node's height at no extra space cost. 2. The number of shadow entries needs to be counted in addition to the regular entries, to quickly detect when the node is ready to go to the shadow node LRU list. The current entry count is an unsigned int but the maximum number of entries is 64, so a shadow counter can easily be stored in the unused upper bits. 3. Tree modification needs tree lock and tree root, which are located in the address space, so store an address_space backpointer in the node. The parent pointer of the node is in a union with the 2-word rcu_head, so the backpointer comes at no extra cost as well. 4. The node needs to be linked to an LRU list, which requires a list head inside the node. This does increase the size of the node, but it does not change the number of objects that fit into a slab page. [akpm@linux-foundation.org: export the right function] Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> Reviewed-by: Rik van Riel <riel@redhat.com> Reviewed-by: Minchan Kim <minchan@kernel.org> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Bob Liu <bob.liu@oracle.com> Cc: Christoph Hellwig <hch@infradead.org> Cc: Dave Chinner <david@fromorbit.com> Cc: Greg Thelen <gthelen@google.com> Cc: Hugh Dickins <hughd@google.com> Cc: Jan Kara <jack@suse.cz> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: Luigi Semenzato <semenzato@google.com> Cc: Mel Gorman <mgorman@suse.de> Cc: Metin Doslu <metin@citusdata.com> Cc: Michel Lespinasse <walken@google.com> Cc: Ozgun Erdogan <ozgun@citusdata.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Roman Gushchin <klamm@yandex-team.ru> Cc: Ryan Mallon <rmallon@gmail.com> Cc: Tejun Heo <tj@kernel.org> Cc: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-04-04 05:47:56 +08:00
* each, this will reclaim shadow entries when they consume
* ~1.8% of available memory:
mm: keep page cache radix tree nodes in check Previously, page cache radix tree nodes were freed after reclaim emptied out their page pointers. But now reclaim stores shadow entries in their place, which are only reclaimed when the inodes themselves are reclaimed. This is problematic for bigger files that are still in use after they have a significant amount of their cache reclaimed, without any of those pages actually refaulting. The shadow entries will just sit there and waste memory. In the worst case, the shadow entries will accumulate until the machine runs out of memory. To get this under control, the VM will track radix tree nodes exclusively containing shadow entries on a per-NUMA node list. Per-NUMA rather than global because we expect the radix tree nodes themselves to be allocated node-locally and we want to reduce cross-node references of otherwise independent cache workloads. A simple shrinker will then reclaim these nodes on memory pressure. A few things need to be stored in the radix tree node to implement the shadow node LRU and allow tree deletions coming from the list: 1. There is no index available that would describe the reverse path from the node up to the tree root, which is needed to perform a deletion. To solve this, encode in each node its offset inside the parent. This can be stored in the unused upper bits of the same member that stores the node's height at no extra space cost. 2. The number of shadow entries needs to be counted in addition to the regular entries, to quickly detect when the node is ready to go to the shadow node LRU list. The current entry count is an unsigned int but the maximum number of entries is 64, so a shadow counter can easily be stored in the unused upper bits. 3. Tree modification needs tree lock and tree root, which are located in the address space, so store an address_space backpointer in the node. The parent pointer of the node is in a union with the 2-word rcu_head, so the backpointer comes at no extra cost as well. 4. The node needs to be linked to an LRU list, which requires a list head inside the node. This does increase the size of the node, but it does not change the number of objects that fit into a slab page. [akpm@linux-foundation.org: export the right function] Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> Reviewed-by: Rik van Riel <riel@redhat.com> Reviewed-by: Minchan Kim <minchan@kernel.org> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Bob Liu <bob.liu@oracle.com> Cc: Christoph Hellwig <hch@infradead.org> Cc: Dave Chinner <david@fromorbit.com> Cc: Greg Thelen <gthelen@google.com> Cc: Hugh Dickins <hughd@google.com> Cc: Jan Kara <jack@suse.cz> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: Luigi Semenzato <semenzato@google.com> Cc: Mel Gorman <mgorman@suse.de> Cc: Metin Doslu <metin@citusdata.com> Cc: Michel Lespinasse <walken@google.com> Cc: Ozgun Erdogan <ozgun@citusdata.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Roman Gushchin <klamm@yandex-team.ru> Cc: Ryan Mallon <rmallon@gmail.com> Cc: Tejun Heo <tj@kernel.org> Cc: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-04-04 05:47:56 +08:00
*
* PAGE_SIZE / xa_nodes / node_entries * 8 / PAGE_SIZE
mm: keep page cache radix tree nodes in check Previously, page cache radix tree nodes were freed after reclaim emptied out their page pointers. But now reclaim stores shadow entries in their place, which are only reclaimed when the inodes themselves are reclaimed. This is problematic for bigger files that are still in use after they have a significant amount of their cache reclaimed, without any of those pages actually refaulting. The shadow entries will just sit there and waste memory. In the worst case, the shadow entries will accumulate until the machine runs out of memory. To get this under control, the VM will track radix tree nodes exclusively containing shadow entries on a per-NUMA node list. Per-NUMA rather than global because we expect the radix tree nodes themselves to be allocated node-locally and we want to reduce cross-node references of otherwise independent cache workloads. A simple shrinker will then reclaim these nodes on memory pressure. A few things need to be stored in the radix tree node to implement the shadow node LRU and allow tree deletions coming from the list: 1. There is no index available that would describe the reverse path from the node up to the tree root, which is needed to perform a deletion. To solve this, encode in each node its offset inside the parent. This can be stored in the unused upper bits of the same member that stores the node's height at no extra space cost. 2. The number of shadow entries needs to be counted in addition to the regular entries, to quickly detect when the node is ready to go to the shadow node LRU list. The current entry count is an unsigned int but the maximum number of entries is 64, so a shadow counter can easily be stored in the unused upper bits. 3. Tree modification needs tree lock and tree root, which are located in the address space, so store an address_space backpointer in the node. The parent pointer of the node is in a union with the 2-word rcu_head, so the backpointer comes at no extra cost as well. 4. The node needs to be linked to an LRU list, which requires a list head inside the node. This does increase the size of the node, but it does not change the number of objects that fit into a slab page. [akpm@linux-foundation.org: export the right function] Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> Reviewed-by: Rik van Riel <riel@redhat.com> Reviewed-by: Minchan Kim <minchan@kernel.org> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Bob Liu <bob.liu@oracle.com> Cc: Christoph Hellwig <hch@infradead.org> Cc: Dave Chinner <david@fromorbit.com> Cc: Greg Thelen <gthelen@google.com> Cc: Hugh Dickins <hughd@google.com> Cc: Jan Kara <jack@suse.cz> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: Luigi Semenzato <semenzato@google.com> Cc: Mel Gorman <mgorman@suse.de> Cc: Metin Doslu <metin@citusdata.com> Cc: Michel Lespinasse <walken@google.com> Cc: Ozgun Erdogan <ozgun@citusdata.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Roman Gushchin <klamm@yandex-team.ru> Cc: Ryan Mallon <rmallon@gmail.com> Cc: Tejun Heo <tj@kernel.org> Cc: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-04-04 05:47:56 +08:00
*/
mm: workingset: don't drop refault information prematurely Patch series "psi: pressure stall information for CPU, memory, and IO", v4. Overview PSI reports the overall wallclock time in which the tasks in a system (or cgroup) wait for (contended) hardware resources. This helps users understand the resource pressure their workloads are under, which allows them to rootcause and fix throughput and latency problems caused by overcommitting, underprovisioning, suboptimal job placement in a grid; as well as anticipate major disruptions like OOM. Real-world applications We're using the data collected by PSI (and its previous incarnation, memdelay) quite extensively at Facebook, and with several success stories. One usecase is avoiding OOM hangs/livelocks. The reason these happen is because the OOM killer is triggered by reclaim not being able to free pages, but with fast flash devices there is *always* some clean and uptodate cache to reclaim; the OOM killer never kicks in, even as tasks spend 90% of the time thrashing the cache pages of their own executables. There is no situation where this ever makes sense in practice. We wrote a <100 line POC python script to monitor memory pressure and kill stuff way before such pathological thrashing leads to full system losses that would require forcible hard resets. We've since extended and deployed this code into other places to guarantee latency and throughput SLAs, since they're usually violated way before the kernel OOM killer would ever kick in. It is available here: https://github.com/facebookincubator/oomd Eventually we probably want to trigger the in-kernel OOM killer based on extreme sustained pressure as well, so that Linux can avoid memory livelocks - which technically aren't deadlocks, but to the user indistinguishable from them - out of the box. We'd continue using OOMD as the first line of defense to ensure workload health and implement complex kill policies that are beyond the scope of the kernel. We also use PSI memory pressure for loadshedding. Our batch job infrastructure used to use heuristics based on various VM stats to anticipate OOM situations, with lackluster success. We switched it to PSI and managed to anticipate and avoid OOM kills and lockups fairly reliably. The reduction of OOM outages in the worker pool raised the pool's aggregate productivity, and we were able to switch that service to smaller machines. Lastly, we use cgroups to isolate a machine's main workload from maintenance crap like package upgrades, logging, configuration, as well as to prevent multiple workloads on a machine from stepping on each others' toes. We were not able to configure this properly without the pressure metrics; we would see latency or bandwidth drops, but it would often be hard to impossible to rootcause it post-mortem. We now log and graph pressure for the containers in our fleet and can trivially link latency spikes and throughput drops to shortages of specific resources after the fact, and fix the job config/scheduling. PSI has also received testing, feedback, and feature requests from Android and EndlessOS for the purpose of low-latency OOM killing, to intervene in pressure situations before the UI starts hanging. How do you use this feature? A kernel with CONFIG_PSI=y will create a /proc/pressure directory with 3 files: cpu, memory, and io. If using cgroup2, cgroups will also have cpu.pressure, memory.pressure and io.pressure files, which simply aggregate task stalls at the cgroup level instead of system-wide. The cpu file contains one line: some avg10=2.04 avg60=0.75 avg300=0.40 total=157656722 The averages give the percentage of walltime in which one or more tasks are delayed on the runqueue while another task has the CPU. They're recent averages over 10s, 1m, 5m windows, so you can tell short term trends from long term ones, similarly to the load average. The total= value gives the absolute stall time in microseconds. This allows detecting latency spikes that might be too short to sway the running averages. It also allows custom time averaging in case the 10s/1m/5m windows aren't adequate for the usecase (or are too coarse with future hardware). What to make of this "some" metric? If CPU utilization is at 100% and CPU pressure is 0, it means the system is perfectly utilized, with one runnable thread per CPU and nobody waiting. At two or more runnable tasks per CPU, the system is 100% overcommitted and the pressure average will indicate as much. From a utilization perspective this is a great state of course: no CPU cycles are being wasted, even when 50% of the threads were to go idle (as most workloads do vary). From the perspective of the individual job it's not great, however, and they would do better with more resources. Depending on what your priority and options are, raised "some" numbers may or may not require action. The memory file contains two lines: some avg10=70.24 avg60=68.52 avg300=69.91 total=3559632828 full avg10=57.59 avg60=58.06 avg300=60.38 total=3300487258 The some line is the same as for cpu, the time in which at least one task is stalled on the resource. In the case of memory, this includes waiting on swap-in, page cache refaults and page reclaim. The full line, however, indicates time in which *nobody* is using the CPU productively due to pressure: all non-idle tasks are waiting for memory in one form or another. Significant time spent in there is a good trigger for killing things, moving jobs to other machines, or dropping incoming requests, since neither the jobs nor the machine overall are making too much headway. The io file is similar to memory. Because the block layer doesn't have a concept of hardware contention right now (how much longer is my IO request taking due to other tasks?), it reports CPU potential lost on all IO delays, not just the potential lost due to competition. FAQ Q: How is PSI's CPU component different from the load average? A: There are several quirks in the load average that make it hard to impossible to tell how overcommitted the CPU really is. 1. The load average is reported as a raw number of active tasks. You need to know how many CPUs there are in the system, how many CPUs the workload is allowed to use, then think about what the proportion between load and the number of CPUs mean for the tasks trying to run. PSI reports the percentage of wallclock time in which tasks are waiting for a CPU to run on. It doesn't matter how many CPUs are present or usable. The number always tells the quality of life of tasks in the system or in a particular cgroup. 2. The shortest averaging window is 1m, which is extremely coarse, and it's sampled in 5s intervals. A *lot* can happen on a CPU in 5 seconds. This *may* be able to identify persistent long-term trends and very clear and obvious overloads, but it's unusable for latency spikes and more subtle overutilization. PSI's shortest window is 10s. It also exports the cumulative stall times (in microseconds) of synchronously recorded events. 3. On Linux, the load average for historical reasons includes all TASK_UNINTERRUPTIBLE tasks. This gives a broader sense of how busy the system is, but on the flipside it doesn't distinguish whether tasks are likely to contend over the CPU or IO - which obviously requires very different interventions from a sys admin or a job scheduler. PSI reports independent metrics for CPU and IO. You can tell which resource is making the tasks wait, but in conjunction still see how overloaded the system is overall. Q: What's the cost / performance impact of this feature? A: PSI's primary cost is in the scheduler, in particular task wakeups and sleeps. I benchmarked this code using Facebook's two most scheduling sensitive workloads: memcache and webserver. They handle a ton of small requests - lots of wakeups and sleeps with little actual work in between - so they tend to be canaries for scheduler regressions. In the tests, the boxes were handling live traffic over the course of several hours. Half the machines, the control, ran with CONFIG_PSI=n. For memcache I used eight machines total. They're 2-socket, 14 core, 56 thread boxes. The test runs for half the test period, flips the test and control kernels on the hardware to rule out HW factors, DC location etc., then runs the other half of the test. For the webservers, I used 32 machines total. They're single socket, 16 core, 32 thread machines. During the memcache test, CPU load was nopsi=78.05% psi=78.98% in the first half and nopsi=77.52% psi=78.25%, so PSI added between 0.7 and 0.9 percentage points to the CPU load, a difference of about 1%. UPDATE: I re-ran this test with the v3 version of this patch set and the CPU utilization was equivalent between test and control. UPDATE: v4 is on par with v3. As far as end-to-end request latency from the client perspective goes, we don't sample those finely enough to capture the requests going to those particular machines during the test, but we know the p50 turnaround time in this workload is 54us, and perf bench sched pipe on those machines show nopsi=5.232666 us/op and psi=5.587347 us/op, so this doesn't add much here either. The profile for the pipe benchmark shows: 0.87% sched-pipe [kernel.vmlinux] [k] psi_group_change 0.83% perf.real [kernel.vmlinux] [k] psi_group_change 0.82% perf.real [kernel.vmlinux] [k] psi_task_change 0.58% sched-pipe [kernel.vmlinux] [k] psi_task_change The webserver load is running inside 4 nested cgroup levels. The CPU load with both nopsi and psi kernels was indistinguishable at 81%. For comparison, we had to disable the cgroup cpu controller on the webservers because it added 4 percentage points to the CPU% during this same exact test. Versions of this accounting code now run on 80% of our fleet. None of our workloads have reported regressions during the rollout. Daniel Drake said: : I just retested the latest version at : http://git.cmpxchg.org/cgit.cgi/linux-psi.git (Linux 4.18) and the results : are great. : : Test setup: : Endless OS : GeminiLake N4200 low end laptop : 2GB RAM : swap (and zram swap) disabled : : Baseline test: open a handful of large-ish apps and several website : tabs in Google Chrome. : : Results: after a couple of minutes, system is excessively thrashing, mouse : cursor can barely be moved, UI is not responding to mouse clicks, so it's : impractical to recover from this situation as an ordinary user : : Add my simple killer: : https://gist.github.com/dsd/a8988bf0b81a6163475988120fe8d9cd : : Results: when the thrashing causes the UI to become sluggish, the killer : steps in and kills something (usually a chrome tab), and the system : remains usable. I repeatedly opened more apps and more websites over a 15 : minute period but I wasn't able to get the system to a point of UI : unresponsiveness. Suren said: : Backported to 4.9 and retested on ARMv8 8 code system running Android. : Signals behave as expected reacting to memory pressure, no jumps in : "total" counters that would indicate an overflow/underflow issues. Nicely : done! This patch (of 9): If we keep just enough refault information to match the *current* page cache during reclaim time, we could lose a lot of events when there is only a temporary spike in non-cache memory consumption that pushes out all the cache. Once cache comes back, we won't see those refaults. They might not be actionable for LRU aging, but we want to know about them for measuring memory pressure. [hannes@cmpxchg.org: switch to NUMA-aware lru and slab counters] Link: http://lkml.kernel.org/r/20181009184732.762-2-hannes@cmpxchg.org Link: http://lkml.kernel.org/r/20180828172258.3185-2-hannes@cmpxchg.org Signed-off-by: Johannes Weiner <jweiner@fb.com> Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Rik van Riel <riel@surriel.com> Tested-by: Daniel Drake <drake@endlessm.com> Tested-by: Suren Baghdasaryan <surenb@google.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: Tejun Heo <tj@kernel.org> Cc: Vinayak Menon <vinmenon@codeaurora.org> Cc: Christopher Lameter <cl@linux.com> Cc: Peter Enderborg <peter.enderborg@sony.com> Cc: Shakeel Butt <shakeelb@google.com> Cc: Mike Galbraith <efault@gmx.de> Cc: Randy Dunlap <rdunlap@infradead.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-10-27 06:05:59 +08:00
#ifdef CONFIG_MEMCG
if (sc->memcg) {
mm: workingset: don't drop refault information prematurely Patch series "psi: pressure stall information for CPU, memory, and IO", v4. Overview PSI reports the overall wallclock time in which the tasks in a system (or cgroup) wait for (contended) hardware resources. This helps users understand the resource pressure their workloads are under, which allows them to rootcause and fix throughput and latency problems caused by overcommitting, underprovisioning, suboptimal job placement in a grid; as well as anticipate major disruptions like OOM. Real-world applications We're using the data collected by PSI (and its previous incarnation, memdelay) quite extensively at Facebook, and with several success stories. One usecase is avoiding OOM hangs/livelocks. The reason these happen is because the OOM killer is triggered by reclaim not being able to free pages, but with fast flash devices there is *always* some clean and uptodate cache to reclaim; the OOM killer never kicks in, even as tasks spend 90% of the time thrashing the cache pages of their own executables. There is no situation where this ever makes sense in practice. We wrote a <100 line POC python script to monitor memory pressure and kill stuff way before such pathological thrashing leads to full system losses that would require forcible hard resets. We've since extended and deployed this code into other places to guarantee latency and throughput SLAs, since they're usually violated way before the kernel OOM killer would ever kick in. It is available here: https://github.com/facebookincubator/oomd Eventually we probably want to trigger the in-kernel OOM killer based on extreme sustained pressure as well, so that Linux can avoid memory livelocks - which technically aren't deadlocks, but to the user indistinguishable from them - out of the box. We'd continue using OOMD as the first line of defense to ensure workload health and implement complex kill policies that are beyond the scope of the kernel. We also use PSI memory pressure for loadshedding. Our batch job infrastructure used to use heuristics based on various VM stats to anticipate OOM situations, with lackluster success. We switched it to PSI and managed to anticipate and avoid OOM kills and lockups fairly reliably. The reduction of OOM outages in the worker pool raised the pool's aggregate productivity, and we were able to switch that service to smaller machines. Lastly, we use cgroups to isolate a machine's main workload from maintenance crap like package upgrades, logging, configuration, as well as to prevent multiple workloads on a machine from stepping on each others' toes. We were not able to configure this properly without the pressure metrics; we would see latency or bandwidth drops, but it would often be hard to impossible to rootcause it post-mortem. We now log and graph pressure for the containers in our fleet and can trivially link latency spikes and throughput drops to shortages of specific resources after the fact, and fix the job config/scheduling. PSI has also received testing, feedback, and feature requests from Android and EndlessOS for the purpose of low-latency OOM killing, to intervene in pressure situations before the UI starts hanging. How do you use this feature? A kernel with CONFIG_PSI=y will create a /proc/pressure directory with 3 files: cpu, memory, and io. If using cgroup2, cgroups will also have cpu.pressure, memory.pressure and io.pressure files, which simply aggregate task stalls at the cgroup level instead of system-wide. The cpu file contains one line: some avg10=2.04 avg60=0.75 avg300=0.40 total=157656722 The averages give the percentage of walltime in which one or more tasks are delayed on the runqueue while another task has the CPU. They're recent averages over 10s, 1m, 5m windows, so you can tell short term trends from long term ones, similarly to the load average. The total= value gives the absolute stall time in microseconds. This allows detecting latency spikes that might be too short to sway the running averages. It also allows custom time averaging in case the 10s/1m/5m windows aren't adequate for the usecase (or are too coarse with future hardware). What to make of this "some" metric? If CPU utilization is at 100% and CPU pressure is 0, it means the system is perfectly utilized, with one runnable thread per CPU and nobody waiting. At two or more runnable tasks per CPU, the system is 100% overcommitted and the pressure average will indicate as much. From a utilization perspective this is a great state of course: no CPU cycles are being wasted, even when 50% of the threads were to go idle (as most workloads do vary). From the perspective of the individual job it's not great, however, and they would do better with more resources. Depending on what your priority and options are, raised "some" numbers may or may not require action. The memory file contains two lines: some avg10=70.24 avg60=68.52 avg300=69.91 total=3559632828 full avg10=57.59 avg60=58.06 avg300=60.38 total=3300487258 The some line is the same as for cpu, the time in which at least one task is stalled on the resource. In the case of memory, this includes waiting on swap-in, page cache refaults and page reclaim. The full line, however, indicates time in which *nobody* is using the CPU productively due to pressure: all non-idle tasks are waiting for memory in one form or another. Significant time spent in there is a good trigger for killing things, moving jobs to other machines, or dropping incoming requests, since neither the jobs nor the machine overall are making too much headway. The io file is similar to memory. Because the block layer doesn't have a concept of hardware contention right now (how much longer is my IO request taking due to other tasks?), it reports CPU potential lost on all IO delays, not just the potential lost due to competition. FAQ Q: How is PSI's CPU component different from the load average? A: There are several quirks in the load average that make it hard to impossible to tell how overcommitted the CPU really is. 1. The load average is reported as a raw number of active tasks. You need to know how many CPUs there are in the system, how many CPUs the workload is allowed to use, then think about what the proportion between load and the number of CPUs mean for the tasks trying to run. PSI reports the percentage of wallclock time in which tasks are waiting for a CPU to run on. It doesn't matter how many CPUs are present or usable. The number always tells the quality of life of tasks in the system or in a particular cgroup. 2. The shortest averaging window is 1m, which is extremely coarse, and it's sampled in 5s intervals. A *lot* can happen on a CPU in 5 seconds. This *may* be able to identify persistent long-term trends and very clear and obvious overloads, but it's unusable for latency spikes and more subtle overutilization. PSI's shortest window is 10s. It also exports the cumulative stall times (in microseconds) of synchronously recorded events. 3. On Linux, the load average for historical reasons includes all TASK_UNINTERRUPTIBLE tasks. This gives a broader sense of how busy the system is, but on the flipside it doesn't distinguish whether tasks are likely to contend over the CPU or IO - which obviously requires very different interventions from a sys admin or a job scheduler. PSI reports independent metrics for CPU and IO. You can tell which resource is making the tasks wait, but in conjunction still see how overloaded the system is overall. Q: What's the cost / performance impact of this feature? A: PSI's primary cost is in the scheduler, in particular task wakeups and sleeps. I benchmarked this code using Facebook's two most scheduling sensitive workloads: memcache and webserver. They handle a ton of small requests - lots of wakeups and sleeps with little actual work in between - so they tend to be canaries for scheduler regressions. In the tests, the boxes were handling live traffic over the course of several hours. Half the machines, the control, ran with CONFIG_PSI=n. For memcache I used eight machines total. They're 2-socket, 14 core, 56 thread boxes. The test runs for half the test period, flips the test and control kernels on the hardware to rule out HW factors, DC location etc., then runs the other half of the test. For the webservers, I used 32 machines total. They're single socket, 16 core, 32 thread machines. During the memcache test, CPU load was nopsi=78.05% psi=78.98% in the first half and nopsi=77.52% psi=78.25%, so PSI added between 0.7 and 0.9 percentage points to the CPU load, a difference of about 1%. UPDATE: I re-ran this test with the v3 version of this patch set and the CPU utilization was equivalent between test and control. UPDATE: v4 is on par with v3. As far as end-to-end request latency from the client perspective goes, we don't sample those finely enough to capture the requests going to those particular machines during the test, but we know the p50 turnaround time in this workload is 54us, and perf bench sched pipe on those machines show nopsi=5.232666 us/op and psi=5.587347 us/op, so this doesn't add much here either. The profile for the pipe benchmark shows: 0.87% sched-pipe [kernel.vmlinux] [k] psi_group_change 0.83% perf.real [kernel.vmlinux] [k] psi_group_change 0.82% perf.real [kernel.vmlinux] [k] psi_task_change 0.58% sched-pipe [kernel.vmlinux] [k] psi_task_change The webserver load is running inside 4 nested cgroup levels. The CPU load with both nopsi and psi kernels was indistinguishable at 81%. For comparison, we had to disable the cgroup cpu controller on the webservers because it added 4 percentage points to the CPU% during this same exact test. Versions of this accounting code now run on 80% of our fleet. None of our workloads have reported regressions during the rollout. Daniel Drake said: : I just retested the latest version at : http://git.cmpxchg.org/cgit.cgi/linux-psi.git (Linux 4.18) and the results : are great. : : Test setup: : Endless OS : GeminiLake N4200 low end laptop : 2GB RAM : swap (and zram swap) disabled : : Baseline test: open a handful of large-ish apps and several website : tabs in Google Chrome. : : Results: after a couple of minutes, system is excessively thrashing, mouse : cursor can barely be moved, UI is not responding to mouse clicks, so it's : impractical to recover from this situation as an ordinary user : : Add my simple killer: : https://gist.github.com/dsd/a8988bf0b81a6163475988120fe8d9cd : : Results: when the thrashing causes the UI to become sluggish, the killer : steps in and kills something (usually a chrome tab), and the system : remains usable. I repeatedly opened more apps and more websites over a 15 : minute period but I wasn't able to get the system to a point of UI : unresponsiveness. Suren said: : Backported to 4.9 and retested on ARMv8 8 code system running Android. : Signals behave as expected reacting to memory pressure, no jumps in : "total" counters that would indicate an overflow/underflow issues. Nicely : done! This patch (of 9): If we keep just enough refault information to match the *current* page cache during reclaim time, we could lose a lot of events when there is only a temporary spike in non-cache memory consumption that pushes out all the cache. Once cache comes back, we won't see those refaults. They might not be actionable for LRU aging, but we want to know about them for measuring memory pressure. [hannes@cmpxchg.org: switch to NUMA-aware lru and slab counters] Link: http://lkml.kernel.org/r/20181009184732.762-2-hannes@cmpxchg.org Link: http://lkml.kernel.org/r/20180828172258.3185-2-hannes@cmpxchg.org Signed-off-by: Johannes Weiner <jweiner@fb.com> Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Rik van Riel <riel@surriel.com> Tested-by: Daniel Drake <drake@endlessm.com> Tested-by: Suren Baghdasaryan <surenb@google.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: Tejun Heo <tj@kernel.org> Cc: Vinayak Menon <vinmenon@codeaurora.org> Cc: Christopher Lameter <cl@linux.com> Cc: Peter Enderborg <peter.enderborg@sony.com> Cc: Shakeel Butt <shakeelb@google.com> Cc: Mike Galbraith <efault@gmx.de> Cc: Randy Dunlap <rdunlap@infradead.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-10-27 06:05:59 +08:00
struct lruvec *lruvec;
int i;
mm: workingset: don't drop refault information prematurely Patch series "psi: pressure stall information for CPU, memory, and IO", v4. Overview PSI reports the overall wallclock time in which the tasks in a system (or cgroup) wait for (contended) hardware resources. This helps users understand the resource pressure their workloads are under, which allows them to rootcause and fix throughput and latency problems caused by overcommitting, underprovisioning, suboptimal job placement in a grid; as well as anticipate major disruptions like OOM. Real-world applications We're using the data collected by PSI (and its previous incarnation, memdelay) quite extensively at Facebook, and with several success stories. One usecase is avoiding OOM hangs/livelocks. The reason these happen is because the OOM killer is triggered by reclaim not being able to free pages, but with fast flash devices there is *always* some clean and uptodate cache to reclaim; the OOM killer never kicks in, even as tasks spend 90% of the time thrashing the cache pages of their own executables. There is no situation where this ever makes sense in practice. We wrote a <100 line POC python script to monitor memory pressure and kill stuff way before such pathological thrashing leads to full system losses that would require forcible hard resets. We've since extended and deployed this code into other places to guarantee latency and throughput SLAs, since they're usually violated way before the kernel OOM killer would ever kick in. It is available here: https://github.com/facebookincubator/oomd Eventually we probably want to trigger the in-kernel OOM killer based on extreme sustained pressure as well, so that Linux can avoid memory livelocks - which technically aren't deadlocks, but to the user indistinguishable from them - out of the box. We'd continue using OOMD as the first line of defense to ensure workload health and implement complex kill policies that are beyond the scope of the kernel. We also use PSI memory pressure for loadshedding. Our batch job infrastructure used to use heuristics based on various VM stats to anticipate OOM situations, with lackluster success. We switched it to PSI and managed to anticipate and avoid OOM kills and lockups fairly reliably. The reduction of OOM outages in the worker pool raised the pool's aggregate productivity, and we were able to switch that service to smaller machines. Lastly, we use cgroups to isolate a machine's main workload from maintenance crap like package upgrades, logging, configuration, as well as to prevent multiple workloads on a machine from stepping on each others' toes. We were not able to configure this properly without the pressure metrics; we would see latency or bandwidth drops, but it would often be hard to impossible to rootcause it post-mortem. We now log and graph pressure for the containers in our fleet and can trivially link latency spikes and throughput drops to shortages of specific resources after the fact, and fix the job config/scheduling. PSI has also received testing, feedback, and feature requests from Android and EndlessOS for the purpose of low-latency OOM killing, to intervene in pressure situations before the UI starts hanging. How do you use this feature? A kernel with CONFIG_PSI=y will create a /proc/pressure directory with 3 files: cpu, memory, and io. If using cgroup2, cgroups will also have cpu.pressure, memory.pressure and io.pressure files, which simply aggregate task stalls at the cgroup level instead of system-wide. The cpu file contains one line: some avg10=2.04 avg60=0.75 avg300=0.40 total=157656722 The averages give the percentage of walltime in which one or more tasks are delayed on the runqueue while another task has the CPU. They're recent averages over 10s, 1m, 5m windows, so you can tell short term trends from long term ones, similarly to the load average. The total= value gives the absolute stall time in microseconds. This allows detecting latency spikes that might be too short to sway the running averages. It also allows custom time averaging in case the 10s/1m/5m windows aren't adequate for the usecase (or are too coarse with future hardware). What to make of this "some" metric? If CPU utilization is at 100% and CPU pressure is 0, it means the system is perfectly utilized, with one runnable thread per CPU and nobody waiting. At two or more runnable tasks per CPU, the system is 100% overcommitted and the pressure average will indicate as much. From a utilization perspective this is a great state of course: no CPU cycles are being wasted, even when 50% of the threads were to go idle (as most workloads do vary). From the perspective of the individual job it's not great, however, and they would do better with more resources. Depending on what your priority and options are, raised "some" numbers may or may not require action. The memory file contains two lines: some avg10=70.24 avg60=68.52 avg300=69.91 total=3559632828 full avg10=57.59 avg60=58.06 avg300=60.38 total=3300487258 The some line is the same as for cpu, the time in which at least one task is stalled on the resource. In the case of memory, this includes waiting on swap-in, page cache refaults and page reclaim. The full line, however, indicates time in which *nobody* is using the CPU productively due to pressure: all non-idle tasks are waiting for memory in one form or another. Significant time spent in there is a good trigger for killing things, moving jobs to other machines, or dropping incoming requests, since neither the jobs nor the machine overall are making too much headway. The io file is similar to memory. Because the block layer doesn't have a concept of hardware contention right now (how much longer is my IO request taking due to other tasks?), it reports CPU potential lost on all IO delays, not just the potential lost due to competition. FAQ Q: How is PSI's CPU component different from the load average? A: There are several quirks in the load average that make it hard to impossible to tell how overcommitted the CPU really is. 1. The load average is reported as a raw number of active tasks. You need to know how many CPUs there are in the system, how many CPUs the workload is allowed to use, then think about what the proportion between load and the number of CPUs mean for the tasks trying to run. PSI reports the percentage of wallclock time in which tasks are waiting for a CPU to run on. It doesn't matter how many CPUs are present or usable. The number always tells the quality of life of tasks in the system or in a particular cgroup. 2. The shortest averaging window is 1m, which is extremely coarse, and it's sampled in 5s intervals. A *lot* can happen on a CPU in 5 seconds. This *may* be able to identify persistent long-term trends and very clear and obvious overloads, but it's unusable for latency spikes and more subtle overutilization. PSI's shortest window is 10s. It also exports the cumulative stall times (in microseconds) of synchronously recorded events. 3. On Linux, the load average for historical reasons includes all TASK_UNINTERRUPTIBLE tasks. This gives a broader sense of how busy the system is, but on the flipside it doesn't distinguish whether tasks are likely to contend over the CPU or IO - which obviously requires very different interventions from a sys admin or a job scheduler. PSI reports independent metrics for CPU and IO. You can tell which resource is making the tasks wait, but in conjunction still see how overloaded the system is overall. Q: What's the cost / performance impact of this feature? A: PSI's primary cost is in the scheduler, in particular task wakeups and sleeps. I benchmarked this code using Facebook's two most scheduling sensitive workloads: memcache and webserver. They handle a ton of small requests - lots of wakeups and sleeps with little actual work in between - so they tend to be canaries for scheduler regressions. In the tests, the boxes were handling live traffic over the course of several hours. Half the machines, the control, ran with CONFIG_PSI=n. For memcache I used eight machines total. They're 2-socket, 14 core, 56 thread boxes. The test runs for half the test period, flips the test and control kernels on the hardware to rule out HW factors, DC location etc., then runs the other half of the test. For the webservers, I used 32 machines total. They're single socket, 16 core, 32 thread machines. During the memcache test, CPU load was nopsi=78.05% psi=78.98% in the first half and nopsi=77.52% psi=78.25%, so PSI added between 0.7 and 0.9 percentage points to the CPU load, a difference of about 1%. UPDATE: I re-ran this test with the v3 version of this patch set and the CPU utilization was equivalent between test and control. UPDATE: v4 is on par with v3. As far as end-to-end request latency from the client perspective goes, we don't sample those finely enough to capture the requests going to those particular machines during the test, but we know the p50 turnaround time in this workload is 54us, and perf bench sched pipe on those machines show nopsi=5.232666 us/op and psi=5.587347 us/op, so this doesn't add much here either. The profile for the pipe benchmark shows: 0.87% sched-pipe [kernel.vmlinux] [k] psi_group_change 0.83% perf.real [kernel.vmlinux] [k] psi_group_change 0.82% perf.real [kernel.vmlinux] [k] psi_task_change 0.58% sched-pipe [kernel.vmlinux] [k] psi_task_change The webserver load is running inside 4 nested cgroup levels. The CPU load with both nopsi and psi kernels was indistinguishable at 81%. For comparison, we had to disable the cgroup cpu controller on the webservers because it added 4 percentage points to the CPU% during this same exact test. Versions of this accounting code now run on 80% of our fleet. None of our workloads have reported regressions during the rollout. Daniel Drake said: : I just retested the latest version at : http://git.cmpxchg.org/cgit.cgi/linux-psi.git (Linux 4.18) and the results : are great. : : Test setup: : Endless OS : GeminiLake N4200 low end laptop : 2GB RAM : swap (and zram swap) disabled : : Baseline test: open a handful of large-ish apps and several website : tabs in Google Chrome. : : Results: after a couple of minutes, system is excessively thrashing, mouse : cursor can barely be moved, UI is not responding to mouse clicks, so it's : impractical to recover from this situation as an ordinary user : : Add my simple killer: : https://gist.github.com/dsd/a8988bf0b81a6163475988120fe8d9cd : : Results: when the thrashing causes the UI to become sluggish, the killer : steps in and kills something (usually a chrome tab), and the system : remains usable. I repeatedly opened more apps and more websites over a 15 : minute period but I wasn't able to get the system to a point of UI : unresponsiveness. Suren said: : Backported to 4.9 and retested on ARMv8 8 code system running Android. : Signals behave as expected reacting to memory pressure, no jumps in : "total" counters that would indicate an overflow/underflow issues. Nicely : done! This patch (of 9): If we keep just enough refault information to match the *current* page cache during reclaim time, we could lose a lot of events when there is only a temporary spike in non-cache memory consumption that pushes out all the cache. Once cache comes back, we won't see those refaults. They might not be actionable for LRU aging, but we want to know about them for measuring memory pressure. [hannes@cmpxchg.org: switch to NUMA-aware lru and slab counters] Link: http://lkml.kernel.org/r/20181009184732.762-2-hannes@cmpxchg.org Link: http://lkml.kernel.org/r/20180828172258.3185-2-hannes@cmpxchg.org Signed-off-by: Johannes Weiner <jweiner@fb.com> Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Rik van Riel <riel@surriel.com> Tested-by: Daniel Drake <drake@endlessm.com> Tested-by: Suren Baghdasaryan <surenb@google.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: Tejun Heo <tj@kernel.org> Cc: Vinayak Menon <vinmenon@codeaurora.org> Cc: Christopher Lameter <cl@linux.com> Cc: Peter Enderborg <peter.enderborg@sony.com> Cc: Shakeel Butt <shakeelb@google.com> Cc: Mike Galbraith <efault@gmx.de> Cc: Randy Dunlap <rdunlap@infradead.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-10-27 06:05:59 +08:00
mem_cgroup_flush_stats_ratelimited(sc->memcg);
lruvec = mem_cgroup_lruvec(sc->memcg, NODE_DATA(sc->nid));
for (pages = 0, i = 0; i < NR_LRU_LISTS; i++)
mm: memcontrol: make cgroup stats and events query API explicitly local Patch series "mm: memcontrol: memory.stat cost & correctness". The cgroup memory.stat file holds recursive statistics for the entire subtree. The current implementation does this tree walk on-demand whenever the file is read. This is giving us problems in production. 1. The cost of aggregating the statistics on-demand is high. A lot of system service cgroups are mostly idle and their stats don't change between reads, yet we always have to check them. There are also always some lazily-dying cgroups sitting around that are pinned by a handful of remaining page cache; the same applies to them. In an application that periodically monitors memory.stat in our fleet, we have seen the aggregation consume up to 5% CPU time. 2. When cgroups die and disappear from the cgroup tree, so do their accumulated vm events. The result is that the event counters at higher-level cgroups can go backwards and confuse some of our automation, let alone people looking at the graphs over time. To address both issues, this patch series changes the stat implementation to spill counts upwards when the counters change. The upward spilling is batched using the existing per-cpu cache. In a sparse file stress test with 5 level cgroup nesting, the additional cost of the flushing was negligible (a little under 1% of CPU at 100% CPU utilization, compared to the 5% of reading memory.stat during regular operation). This patch (of 4): memcg_page_state(), lruvec_page_state(), memcg_sum_events() are currently returning the state of the local memcg or lruvec, not the recursive state. In practice there is a demand for both versions, although the callers that want the recursive counts currently sum them up by hand. Per default, cgroups are considered recursive entities and generally we expect more users of the recursive counters, with the local counts being special cases. To reflect that in the name, add a _local suffix to the current implementations. The following patch will re-incarnate these functions with recursive semantics, but with an O(1) implementation. [hannes@cmpxchg.org: fix bisection hole] Link: http://lkml.kernel.org/r/20190417160347.GC23013@cmpxchg.org Link: http://lkml.kernel.org/r/20190412151507.2769-2-hannes@cmpxchg.org Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> Reviewed-by: Shakeel Butt <shakeelb@google.com> Reviewed-by: Roman Gushchin <guro@fb.com> Cc: Michal Hocko <mhocko@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-05-15 06:47:06 +08:00
pages += lruvec_page_state_local(lruvec,
NR_LRU_BASE + i);
pages += lruvec_page_state_local(
lruvec, NR_SLAB_RECLAIMABLE_B) >> PAGE_SHIFT;
pages += lruvec_page_state_local(
lruvec, NR_SLAB_UNRECLAIMABLE_B) >> PAGE_SHIFT;
mm: workingset: don't drop refault information prematurely Patch series "psi: pressure stall information for CPU, memory, and IO", v4. Overview PSI reports the overall wallclock time in which the tasks in a system (or cgroup) wait for (contended) hardware resources. This helps users understand the resource pressure their workloads are under, which allows them to rootcause and fix throughput and latency problems caused by overcommitting, underprovisioning, suboptimal job placement in a grid; as well as anticipate major disruptions like OOM. Real-world applications We're using the data collected by PSI (and its previous incarnation, memdelay) quite extensively at Facebook, and with several success stories. One usecase is avoiding OOM hangs/livelocks. The reason these happen is because the OOM killer is triggered by reclaim not being able to free pages, but with fast flash devices there is *always* some clean and uptodate cache to reclaim; the OOM killer never kicks in, even as tasks spend 90% of the time thrashing the cache pages of their own executables. There is no situation where this ever makes sense in practice. We wrote a <100 line POC python script to monitor memory pressure and kill stuff way before such pathological thrashing leads to full system losses that would require forcible hard resets. We've since extended and deployed this code into other places to guarantee latency and throughput SLAs, since they're usually violated way before the kernel OOM killer would ever kick in. It is available here: https://github.com/facebookincubator/oomd Eventually we probably want to trigger the in-kernel OOM killer based on extreme sustained pressure as well, so that Linux can avoid memory livelocks - which technically aren't deadlocks, but to the user indistinguishable from them - out of the box. We'd continue using OOMD as the first line of defense to ensure workload health and implement complex kill policies that are beyond the scope of the kernel. We also use PSI memory pressure for loadshedding. Our batch job infrastructure used to use heuristics based on various VM stats to anticipate OOM situations, with lackluster success. We switched it to PSI and managed to anticipate and avoid OOM kills and lockups fairly reliably. The reduction of OOM outages in the worker pool raised the pool's aggregate productivity, and we were able to switch that service to smaller machines. Lastly, we use cgroups to isolate a machine's main workload from maintenance crap like package upgrades, logging, configuration, as well as to prevent multiple workloads on a machine from stepping on each others' toes. We were not able to configure this properly without the pressure metrics; we would see latency or bandwidth drops, but it would often be hard to impossible to rootcause it post-mortem. We now log and graph pressure for the containers in our fleet and can trivially link latency spikes and throughput drops to shortages of specific resources after the fact, and fix the job config/scheduling. PSI has also received testing, feedback, and feature requests from Android and EndlessOS for the purpose of low-latency OOM killing, to intervene in pressure situations before the UI starts hanging. How do you use this feature? A kernel with CONFIG_PSI=y will create a /proc/pressure directory with 3 files: cpu, memory, and io. If using cgroup2, cgroups will also have cpu.pressure, memory.pressure and io.pressure files, which simply aggregate task stalls at the cgroup level instead of system-wide. The cpu file contains one line: some avg10=2.04 avg60=0.75 avg300=0.40 total=157656722 The averages give the percentage of walltime in which one or more tasks are delayed on the runqueue while another task has the CPU. They're recent averages over 10s, 1m, 5m windows, so you can tell short term trends from long term ones, similarly to the load average. The total= value gives the absolute stall time in microseconds. This allows detecting latency spikes that might be too short to sway the running averages. It also allows custom time averaging in case the 10s/1m/5m windows aren't adequate for the usecase (or are too coarse with future hardware). What to make of this "some" metric? If CPU utilization is at 100% and CPU pressure is 0, it means the system is perfectly utilized, with one runnable thread per CPU and nobody waiting. At two or more runnable tasks per CPU, the system is 100% overcommitted and the pressure average will indicate as much. From a utilization perspective this is a great state of course: no CPU cycles are being wasted, even when 50% of the threads were to go idle (as most workloads do vary). From the perspective of the individual job it's not great, however, and they would do better with more resources. Depending on what your priority and options are, raised "some" numbers may or may not require action. The memory file contains two lines: some avg10=70.24 avg60=68.52 avg300=69.91 total=3559632828 full avg10=57.59 avg60=58.06 avg300=60.38 total=3300487258 The some line is the same as for cpu, the time in which at least one task is stalled on the resource. In the case of memory, this includes waiting on swap-in, page cache refaults and page reclaim. The full line, however, indicates time in which *nobody* is using the CPU productively due to pressure: all non-idle tasks are waiting for memory in one form or another. Significant time spent in there is a good trigger for killing things, moving jobs to other machines, or dropping incoming requests, since neither the jobs nor the machine overall are making too much headway. The io file is similar to memory. Because the block layer doesn't have a concept of hardware contention right now (how much longer is my IO request taking due to other tasks?), it reports CPU potential lost on all IO delays, not just the potential lost due to competition. FAQ Q: How is PSI's CPU component different from the load average? A: There are several quirks in the load average that make it hard to impossible to tell how overcommitted the CPU really is. 1. The load average is reported as a raw number of active tasks. You need to know how many CPUs there are in the system, how many CPUs the workload is allowed to use, then think about what the proportion between load and the number of CPUs mean for the tasks trying to run. PSI reports the percentage of wallclock time in which tasks are waiting for a CPU to run on. It doesn't matter how many CPUs are present or usable. The number always tells the quality of life of tasks in the system or in a particular cgroup. 2. The shortest averaging window is 1m, which is extremely coarse, and it's sampled in 5s intervals. A *lot* can happen on a CPU in 5 seconds. This *may* be able to identify persistent long-term trends and very clear and obvious overloads, but it's unusable for latency spikes and more subtle overutilization. PSI's shortest window is 10s. It also exports the cumulative stall times (in microseconds) of synchronously recorded events. 3. On Linux, the load average for historical reasons includes all TASK_UNINTERRUPTIBLE tasks. This gives a broader sense of how busy the system is, but on the flipside it doesn't distinguish whether tasks are likely to contend over the CPU or IO - which obviously requires very different interventions from a sys admin or a job scheduler. PSI reports independent metrics for CPU and IO. You can tell which resource is making the tasks wait, but in conjunction still see how overloaded the system is overall. Q: What's the cost / performance impact of this feature? A: PSI's primary cost is in the scheduler, in particular task wakeups and sleeps. I benchmarked this code using Facebook's two most scheduling sensitive workloads: memcache and webserver. They handle a ton of small requests - lots of wakeups and sleeps with little actual work in between - so they tend to be canaries for scheduler regressions. In the tests, the boxes were handling live traffic over the course of several hours. Half the machines, the control, ran with CONFIG_PSI=n. For memcache I used eight machines total. They're 2-socket, 14 core, 56 thread boxes. The test runs for half the test period, flips the test and control kernels on the hardware to rule out HW factors, DC location etc., then runs the other half of the test. For the webservers, I used 32 machines total. They're single socket, 16 core, 32 thread machines. During the memcache test, CPU load was nopsi=78.05% psi=78.98% in the first half and nopsi=77.52% psi=78.25%, so PSI added between 0.7 and 0.9 percentage points to the CPU load, a difference of about 1%. UPDATE: I re-ran this test with the v3 version of this patch set and the CPU utilization was equivalent between test and control. UPDATE: v4 is on par with v3. As far as end-to-end request latency from the client perspective goes, we don't sample those finely enough to capture the requests going to those particular machines during the test, but we know the p50 turnaround time in this workload is 54us, and perf bench sched pipe on those machines show nopsi=5.232666 us/op and psi=5.587347 us/op, so this doesn't add much here either. The profile for the pipe benchmark shows: 0.87% sched-pipe [kernel.vmlinux] [k] psi_group_change 0.83% perf.real [kernel.vmlinux] [k] psi_group_change 0.82% perf.real [kernel.vmlinux] [k] psi_task_change 0.58% sched-pipe [kernel.vmlinux] [k] psi_task_change The webserver load is running inside 4 nested cgroup levels. The CPU load with both nopsi and psi kernels was indistinguishable at 81%. For comparison, we had to disable the cgroup cpu controller on the webservers because it added 4 percentage points to the CPU% during this same exact test. Versions of this accounting code now run on 80% of our fleet. None of our workloads have reported regressions during the rollout. Daniel Drake said: : I just retested the latest version at : http://git.cmpxchg.org/cgit.cgi/linux-psi.git (Linux 4.18) and the results : are great. : : Test setup: : Endless OS : GeminiLake N4200 low end laptop : 2GB RAM : swap (and zram swap) disabled : : Baseline test: open a handful of large-ish apps and several website : tabs in Google Chrome. : : Results: after a couple of minutes, system is excessively thrashing, mouse : cursor can barely be moved, UI is not responding to mouse clicks, so it's : impractical to recover from this situation as an ordinary user : : Add my simple killer: : https://gist.github.com/dsd/a8988bf0b81a6163475988120fe8d9cd : : Results: when the thrashing causes the UI to become sluggish, the killer : steps in and kills something (usually a chrome tab), and the system : remains usable. I repeatedly opened more apps and more websites over a 15 : minute period but I wasn't able to get the system to a point of UI : unresponsiveness. Suren said: : Backported to 4.9 and retested on ARMv8 8 code system running Android. : Signals behave as expected reacting to memory pressure, no jumps in : "total" counters that would indicate an overflow/underflow issues. Nicely : done! This patch (of 9): If we keep just enough refault information to match the *current* page cache during reclaim time, we could lose a lot of events when there is only a temporary spike in non-cache memory consumption that pushes out all the cache. Once cache comes back, we won't see those refaults. They might not be actionable for LRU aging, but we want to know about them for measuring memory pressure. [hannes@cmpxchg.org: switch to NUMA-aware lru and slab counters] Link: http://lkml.kernel.org/r/20181009184732.762-2-hannes@cmpxchg.org Link: http://lkml.kernel.org/r/20180828172258.3185-2-hannes@cmpxchg.org Signed-off-by: Johannes Weiner <jweiner@fb.com> Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Rik van Riel <riel@surriel.com> Tested-by: Daniel Drake <drake@endlessm.com> Tested-by: Suren Baghdasaryan <surenb@google.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: Tejun Heo <tj@kernel.org> Cc: Vinayak Menon <vinmenon@codeaurora.org> Cc: Christopher Lameter <cl@linux.com> Cc: Peter Enderborg <peter.enderborg@sony.com> Cc: Shakeel Butt <shakeelb@google.com> Cc: Mike Galbraith <efault@gmx.de> Cc: Randy Dunlap <rdunlap@infradead.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-10-27 06:05:59 +08:00
} else
#endif
pages = node_present_pages(sc->nid);
Merge branch 'xarray' of git://git.infradead.org/users/willy/linux-dax Pull XArray conversion from Matthew Wilcox: "The XArray provides an improved interface to the radix tree data structure, providing locking as part of the API, specifying GFP flags at allocation time, eliminating preloading, less re-walking the tree, more efficient iterations and not exposing RCU-protected pointers to its users. This patch set 1. Introduces the XArray implementation 2. Converts the pagecache to use it 3. Converts memremap to use it The page cache is the most complex and important user of the radix tree, so converting it was most important. Converting the memremap code removes the only other user of the multiorder code, which allows us to remove the radix tree code that supported it. I have 40+ followup patches to convert many other users of the radix tree over to the XArray, but I'd like to get this part in first. The other conversions haven't been in linux-next and aren't suitable for applying yet, but you can see them in the xarray-conv branch if you're interested" * 'xarray' of git://git.infradead.org/users/willy/linux-dax: (90 commits) radix tree: Remove multiorder support radix tree test: Convert multiorder tests to XArray radix tree tests: Convert item_delete_rcu to XArray radix tree tests: Convert item_kill_tree to XArray radix tree tests: Move item_insert_order radix tree test suite: Remove multiorder benchmarking radix tree test suite: Remove __item_insert memremap: Convert to XArray xarray: Add range store functionality xarray: Move multiorder_check to in-kernel tests xarray: Move multiorder_shrink to kernel tests xarray: Move multiorder account test in-kernel radix tree test suite: Convert iteration test to XArray radix tree test suite: Convert tag_tagged_items to XArray radix tree: Remove radix_tree_clear_tags radix tree: Remove radix_tree_maybe_preload_order radix tree: Remove split/join code radix tree: Remove radix_tree_update_node_t page cache: Finish XArray conversion dax: Convert page fault handlers to XArray ...
2018-10-29 02:35:40 +08:00
max_nodes = pages >> (XA_CHUNK_SHIFT - 3);
mm: keep page cache radix tree nodes in check Previously, page cache radix tree nodes were freed after reclaim emptied out their page pointers. But now reclaim stores shadow entries in their place, which are only reclaimed when the inodes themselves are reclaimed. This is problematic for bigger files that are still in use after they have a significant amount of their cache reclaimed, without any of those pages actually refaulting. The shadow entries will just sit there and waste memory. In the worst case, the shadow entries will accumulate until the machine runs out of memory. To get this under control, the VM will track radix tree nodes exclusively containing shadow entries on a per-NUMA node list. Per-NUMA rather than global because we expect the radix tree nodes themselves to be allocated node-locally and we want to reduce cross-node references of otherwise independent cache workloads. A simple shrinker will then reclaim these nodes on memory pressure. A few things need to be stored in the radix tree node to implement the shadow node LRU and allow tree deletions coming from the list: 1. There is no index available that would describe the reverse path from the node up to the tree root, which is needed to perform a deletion. To solve this, encode in each node its offset inside the parent. This can be stored in the unused upper bits of the same member that stores the node's height at no extra space cost. 2. The number of shadow entries needs to be counted in addition to the regular entries, to quickly detect when the node is ready to go to the shadow node LRU list. The current entry count is an unsigned int but the maximum number of entries is 64, so a shadow counter can easily be stored in the unused upper bits. 3. Tree modification needs tree lock and tree root, which are located in the address space, so store an address_space backpointer in the node. The parent pointer of the node is in a union with the 2-word rcu_head, so the backpointer comes at no extra cost as well. 4. The node needs to be linked to an LRU list, which requires a list head inside the node. This does increase the size of the node, but it does not change the number of objects that fit into a slab page. [akpm@linux-foundation.org: export the right function] Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> Reviewed-by: Rik van Riel <riel@redhat.com> Reviewed-by: Minchan Kim <minchan@kernel.org> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Bob Liu <bob.liu@oracle.com> Cc: Christoph Hellwig <hch@infradead.org> Cc: Dave Chinner <david@fromorbit.com> Cc: Greg Thelen <gthelen@google.com> Cc: Hugh Dickins <hughd@google.com> Cc: Jan Kara <jack@suse.cz> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: Luigi Semenzato <semenzato@google.com> Cc: Mel Gorman <mgorman@suse.de> Cc: Metin Doslu <metin@citusdata.com> Cc: Michel Lespinasse <walken@google.com> Cc: Ozgun Erdogan <ozgun@citusdata.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Roman Gushchin <klamm@yandex-team.ru> Cc: Ryan Mallon <rmallon@gmail.com> Cc: Tejun Heo <tj@kernel.org> Cc: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-04-04 05:47:56 +08:00
mm: workingset: move shadow entry tracking to radix tree exceptional tracking Currently, we track the shadow entries in the page cache in the upper bits of the radix_tree_node->count, behind the back of the radix tree implementation. Because the radix tree code has no awareness of them, we rely on random subtleties throughout the implementation (such as the node->count != 1 check in the shrinking code, which is meant to exclude multi-entry nodes but also happens to skip nodes with only one shadow entry, as that's accounted in the upper bits). This is error prone and has, in fact, caused the bug fixed in d3798ae8c6f3 ("mm: filemap: don't plant shadow entries without radix tree node"). To remove these subtleties, this patch moves shadow entry tracking from the upper bits of node->count to the existing counter for exceptional entries. node->count goes back to being a simple counter of valid entries in the tree node and can be shrunk to a single byte. This vastly simplifies the page cache code. All accounting happens natively inside the radix tree implementation, and maintaining the LRU linkage of shadow nodes is consolidated into a single function in the workingset code that is called for leaf nodes affected by a change in the page cache tree. This also removes the last user of the __radix_delete_node() return value. Eliminate it. Link: http://lkml.kernel.org/r/20161117193211.GE23430@cmpxchg.org Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> Reviewed-by: Jan Kara <jack@suse.cz> Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Hugh Dickins <hughd@google.com> Cc: Matthew Wilcox <mawilcox@linuxonhyperv.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-12-13 08:43:52 +08:00
if (nodes <= max_nodes)
mm: keep page cache radix tree nodes in check Previously, page cache radix tree nodes were freed after reclaim emptied out their page pointers. But now reclaim stores shadow entries in their place, which are only reclaimed when the inodes themselves are reclaimed. This is problematic for bigger files that are still in use after they have a significant amount of their cache reclaimed, without any of those pages actually refaulting. The shadow entries will just sit there and waste memory. In the worst case, the shadow entries will accumulate until the machine runs out of memory. To get this under control, the VM will track radix tree nodes exclusively containing shadow entries on a per-NUMA node list. Per-NUMA rather than global because we expect the radix tree nodes themselves to be allocated node-locally and we want to reduce cross-node references of otherwise independent cache workloads. A simple shrinker will then reclaim these nodes on memory pressure. A few things need to be stored in the radix tree node to implement the shadow node LRU and allow tree deletions coming from the list: 1. There is no index available that would describe the reverse path from the node up to the tree root, which is needed to perform a deletion. To solve this, encode in each node its offset inside the parent. This can be stored in the unused upper bits of the same member that stores the node's height at no extra space cost. 2. The number of shadow entries needs to be counted in addition to the regular entries, to quickly detect when the node is ready to go to the shadow node LRU list. The current entry count is an unsigned int but the maximum number of entries is 64, so a shadow counter can easily be stored in the unused upper bits. 3. Tree modification needs tree lock and tree root, which are located in the address space, so store an address_space backpointer in the node. The parent pointer of the node is in a union with the 2-word rcu_head, so the backpointer comes at no extra cost as well. 4. The node needs to be linked to an LRU list, which requires a list head inside the node. This does increase the size of the node, but it does not change the number of objects that fit into a slab page. [akpm@linux-foundation.org: export the right function] Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> Reviewed-by: Rik van Riel <riel@redhat.com> Reviewed-by: Minchan Kim <minchan@kernel.org> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Bob Liu <bob.liu@oracle.com> Cc: Christoph Hellwig <hch@infradead.org> Cc: Dave Chinner <david@fromorbit.com> Cc: Greg Thelen <gthelen@google.com> Cc: Hugh Dickins <hughd@google.com> Cc: Jan Kara <jack@suse.cz> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: Luigi Semenzato <semenzato@google.com> Cc: Mel Gorman <mgorman@suse.de> Cc: Metin Doslu <metin@citusdata.com> Cc: Michel Lespinasse <walken@google.com> Cc: Ozgun Erdogan <ozgun@citusdata.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Roman Gushchin <klamm@yandex-team.ru> Cc: Ryan Mallon <rmallon@gmail.com> Cc: Tejun Heo <tj@kernel.org> Cc: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-04-04 05:47:56 +08:00
return 0;
mm: workingset: move shadow entry tracking to radix tree exceptional tracking Currently, we track the shadow entries in the page cache in the upper bits of the radix_tree_node->count, behind the back of the radix tree implementation. Because the radix tree code has no awareness of them, we rely on random subtleties throughout the implementation (such as the node->count != 1 check in the shrinking code, which is meant to exclude multi-entry nodes but also happens to skip nodes with only one shadow entry, as that's accounted in the upper bits). This is error prone and has, in fact, caused the bug fixed in d3798ae8c6f3 ("mm: filemap: don't plant shadow entries without radix tree node"). To remove these subtleties, this patch moves shadow entry tracking from the upper bits of node->count to the existing counter for exceptional entries. node->count goes back to being a simple counter of valid entries in the tree node and can be shrunk to a single byte. This vastly simplifies the page cache code. All accounting happens natively inside the radix tree implementation, and maintaining the LRU linkage of shadow nodes is consolidated into a single function in the workingset code that is called for leaf nodes affected by a change in the page cache tree. This also removes the last user of the __radix_delete_node() return value. Eliminate it. Link: http://lkml.kernel.org/r/20161117193211.GE23430@cmpxchg.org Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> Reviewed-by: Jan Kara <jack@suse.cz> Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Hugh Dickins <hughd@google.com> Cc: Matthew Wilcox <mawilcox@linuxonhyperv.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-12-13 08:43:52 +08:00
return nodes - max_nodes;
mm: keep page cache radix tree nodes in check Previously, page cache radix tree nodes were freed after reclaim emptied out their page pointers. But now reclaim stores shadow entries in their place, which are only reclaimed when the inodes themselves are reclaimed. This is problematic for bigger files that are still in use after they have a significant amount of their cache reclaimed, without any of those pages actually refaulting. The shadow entries will just sit there and waste memory. In the worst case, the shadow entries will accumulate until the machine runs out of memory. To get this under control, the VM will track radix tree nodes exclusively containing shadow entries on a per-NUMA node list. Per-NUMA rather than global because we expect the radix tree nodes themselves to be allocated node-locally and we want to reduce cross-node references of otherwise independent cache workloads. A simple shrinker will then reclaim these nodes on memory pressure. A few things need to be stored in the radix tree node to implement the shadow node LRU and allow tree deletions coming from the list: 1. There is no index available that would describe the reverse path from the node up to the tree root, which is needed to perform a deletion. To solve this, encode in each node its offset inside the parent. This can be stored in the unused upper bits of the same member that stores the node's height at no extra space cost. 2. The number of shadow entries needs to be counted in addition to the regular entries, to quickly detect when the node is ready to go to the shadow node LRU list. The current entry count is an unsigned int but the maximum number of entries is 64, so a shadow counter can easily be stored in the unused upper bits. 3. Tree modification needs tree lock and tree root, which are located in the address space, so store an address_space backpointer in the node. The parent pointer of the node is in a union with the 2-word rcu_head, so the backpointer comes at no extra cost as well. 4. The node needs to be linked to an LRU list, which requires a list head inside the node. This does increase the size of the node, but it does not change the number of objects that fit into a slab page. [akpm@linux-foundation.org: export the right function] Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> Reviewed-by: Rik van Riel <riel@redhat.com> Reviewed-by: Minchan Kim <minchan@kernel.org> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Bob Liu <bob.liu@oracle.com> Cc: Christoph Hellwig <hch@infradead.org> Cc: Dave Chinner <david@fromorbit.com> Cc: Greg Thelen <gthelen@google.com> Cc: Hugh Dickins <hughd@google.com> Cc: Jan Kara <jack@suse.cz> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: Luigi Semenzato <semenzato@google.com> Cc: Mel Gorman <mgorman@suse.de> Cc: Metin Doslu <metin@citusdata.com> Cc: Michel Lespinasse <walken@google.com> Cc: Ozgun Erdogan <ozgun@citusdata.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Roman Gushchin <klamm@yandex-team.ru> Cc: Ryan Mallon <rmallon@gmail.com> Cc: Tejun Heo <tj@kernel.org> Cc: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-04-04 05:47:56 +08:00
}
static enum lru_status shadow_lru_isolate(struct list_head *item,
list_lru: add helpers to isolate items Currently, the isolate callback passed to the list_lru_walk family of functions is supposed to just delete an item from the list upon returning LRU_REMOVED or LRU_REMOVED_RETRY, while nr_items counter is fixed by __list_lru_walk_one after the callback returns. Since the callback is allowed to drop the lock after removing an item (it has to return LRU_REMOVED_RETRY then), the nr_items can be less than the actual number of elements on the list even if we check them under the lock. This makes it difficult to move items from one list_lru_one to another, which is required for per-memcg list_lru reparenting - we can't just splice the lists, we have to move entries one by one. This patch therefore introduces helpers that must be used by callback functions to isolate items instead of raw list_del/list_move. These are list_lru_isolate and list_lru_isolate_move. They not only remove the entry from the list, but also fix the nr_items counter, making sure nr_items always reflects the actual number of elements on the list if checked under the appropriate lock. Signed-off-by: Vladimir Davydov <vdavydov@parallels.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Michal Hocko <mhocko@suse.cz> Cc: Tejun Heo <tj@kernel.org> Cc: Christoph Lameter <cl@linux.com> Cc: Pekka Enberg <penberg@kernel.org> Cc: David Rientjes <rientjes@google.com> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: Dave Chinner <david@fromorbit.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-02-13 06:59:35 +08:00
struct list_lru_one *lru,
mm: keep page cache radix tree nodes in check Previously, page cache radix tree nodes were freed after reclaim emptied out their page pointers. But now reclaim stores shadow entries in their place, which are only reclaimed when the inodes themselves are reclaimed. This is problematic for bigger files that are still in use after they have a significant amount of their cache reclaimed, without any of those pages actually refaulting. The shadow entries will just sit there and waste memory. In the worst case, the shadow entries will accumulate until the machine runs out of memory. To get this under control, the VM will track radix tree nodes exclusively containing shadow entries on a per-NUMA node list. Per-NUMA rather than global because we expect the radix tree nodes themselves to be allocated node-locally and we want to reduce cross-node references of otherwise independent cache workloads. A simple shrinker will then reclaim these nodes on memory pressure. A few things need to be stored in the radix tree node to implement the shadow node LRU and allow tree deletions coming from the list: 1. There is no index available that would describe the reverse path from the node up to the tree root, which is needed to perform a deletion. To solve this, encode in each node its offset inside the parent. This can be stored in the unused upper bits of the same member that stores the node's height at no extra space cost. 2. The number of shadow entries needs to be counted in addition to the regular entries, to quickly detect when the node is ready to go to the shadow node LRU list. The current entry count is an unsigned int but the maximum number of entries is 64, so a shadow counter can easily be stored in the unused upper bits. 3. Tree modification needs tree lock and tree root, which are located in the address space, so store an address_space backpointer in the node. The parent pointer of the node is in a union with the 2-word rcu_head, so the backpointer comes at no extra cost as well. 4. The node needs to be linked to an LRU list, which requires a list head inside the node. This does increase the size of the node, but it does not change the number of objects that fit into a slab page. [akpm@linux-foundation.org: export the right function] Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> Reviewed-by: Rik van Riel <riel@redhat.com> Reviewed-by: Minchan Kim <minchan@kernel.org> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Bob Liu <bob.liu@oracle.com> Cc: Christoph Hellwig <hch@infradead.org> Cc: Dave Chinner <david@fromorbit.com> Cc: Greg Thelen <gthelen@google.com> Cc: Hugh Dickins <hughd@google.com> Cc: Jan Kara <jack@suse.cz> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: Luigi Semenzato <semenzato@google.com> Cc: Mel Gorman <mgorman@suse.de> Cc: Metin Doslu <metin@citusdata.com> Cc: Michel Lespinasse <walken@google.com> Cc: Ozgun Erdogan <ozgun@citusdata.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Roman Gushchin <klamm@yandex-team.ru> Cc: Ryan Mallon <rmallon@gmail.com> Cc: Tejun Heo <tj@kernel.org> Cc: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-04-04 05:47:56 +08:00
spinlock_t *lru_lock,
void *arg) __must_hold(lru_lock)
mm: keep page cache radix tree nodes in check Previously, page cache radix tree nodes were freed after reclaim emptied out their page pointers. But now reclaim stores shadow entries in their place, which are only reclaimed when the inodes themselves are reclaimed. This is problematic for bigger files that are still in use after they have a significant amount of their cache reclaimed, without any of those pages actually refaulting. The shadow entries will just sit there and waste memory. In the worst case, the shadow entries will accumulate until the machine runs out of memory. To get this under control, the VM will track radix tree nodes exclusively containing shadow entries on a per-NUMA node list. Per-NUMA rather than global because we expect the radix tree nodes themselves to be allocated node-locally and we want to reduce cross-node references of otherwise independent cache workloads. A simple shrinker will then reclaim these nodes on memory pressure. A few things need to be stored in the radix tree node to implement the shadow node LRU and allow tree deletions coming from the list: 1. There is no index available that would describe the reverse path from the node up to the tree root, which is needed to perform a deletion. To solve this, encode in each node its offset inside the parent. This can be stored in the unused upper bits of the same member that stores the node's height at no extra space cost. 2. The number of shadow entries needs to be counted in addition to the regular entries, to quickly detect when the node is ready to go to the shadow node LRU list. The current entry count is an unsigned int but the maximum number of entries is 64, so a shadow counter can easily be stored in the unused upper bits. 3. Tree modification needs tree lock and tree root, which are located in the address space, so store an address_space backpointer in the node. The parent pointer of the node is in a union with the 2-word rcu_head, so the backpointer comes at no extra cost as well. 4. The node needs to be linked to an LRU list, which requires a list head inside the node. This does increase the size of the node, but it does not change the number of objects that fit into a slab page. [akpm@linux-foundation.org: export the right function] Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> Reviewed-by: Rik van Riel <riel@redhat.com> Reviewed-by: Minchan Kim <minchan@kernel.org> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Bob Liu <bob.liu@oracle.com> Cc: Christoph Hellwig <hch@infradead.org> Cc: Dave Chinner <david@fromorbit.com> Cc: Greg Thelen <gthelen@google.com> Cc: Hugh Dickins <hughd@google.com> Cc: Jan Kara <jack@suse.cz> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: Luigi Semenzato <semenzato@google.com> Cc: Mel Gorman <mgorman@suse.de> Cc: Metin Doslu <metin@citusdata.com> Cc: Michel Lespinasse <walken@google.com> Cc: Ozgun Erdogan <ozgun@citusdata.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Roman Gushchin <klamm@yandex-team.ru> Cc: Ryan Mallon <rmallon@gmail.com> Cc: Tejun Heo <tj@kernel.org> Cc: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-04-04 05:47:56 +08:00
{
struct xa_node *node = container_of(item, struct xa_node, private_list);
mm: keep page cache radix tree nodes in check Previously, page cache radix tree nodes were freed after reclaim emptied out their page pointers. But now reclaim stores shadow entries in their place, which are only reclaimed when the inodes themselves are reclaimed. This is problematic for bigger files that are still in use after they have a significant amount of their cache reclaimed, without any of those pages actually refaulting. The shadow entries will just sit there and waste memory. In the worst case, the shadow entries will accumulate until the machine runs out of memory. To get this under control, the VM will track radix tree nodes exclusively containing shadow entries on a per-NUMA node list. Per-NUMA rather than global because we expect the radix tree nodes themselves to be allocated node-locally and we want to reduce cross-node references of otherwise independent cache workloads. A simple shrinker will then reclaim these nodes on memory pressure. A few things need to be stored in the radix tree node to implement the shadow node LRU and allow tree deletions coming from the list: 1. There is no index available that would describe the reverse path from the node up to the tree root, which is needed to perform a deletion. To solve this, encode in each node its offset inside the parent. This can be stored in the unused upper bits of the same member that stores the node's height at no extra space cost. 2. The number of shadow entries needs to be counted in addition to the regular entries, to quickly detect when the node is ready to go to the shadow node LRU list. The current entry count is an unsigned int but the maximum number of entries is 64, so a shadow counter can easily be stored in the unused upper bits. 3. Tree modification needs tree lock and tree root, which are located in the address space, so store an address_space backpointer in the node. The parent pointer of the node is in a union with the 2-word rcu_head, so the backpointer comes at no extra cost as well. 4. The node needs to be linked to an LRU list, which requires a list head inside the node. This does increase the size of the node, but it does not change the number of objects that fit into a slab page. [akpm@linux-foundation.org: export the right function] Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> Reviewed-by: Rik van Riel <riel@redhat.com> Reviewed-by: Minchan Kim <minchan@kernel.org> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Bob Liu <bob.liu@oracle.com> Cc: Christoph Hellwig <hch@infradead.org> Cc: Dave Chinner <david@fromorbit.com> Cc: Greg Thelen <gthelen@google.com> Cc: Hugh Dickins <hughd@google.com> Cc: Jan Kara <jack@suse.cz> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: Luigi Semenzato <semenzato@google.com> Cc: Mel Gorman <mgorman@suse.de> Cc: Metin Doslu <metin@citusdata.com> Cc: Michel Lespinasse <walken@google.com> Cc: Ozgun Erdogan <ozgun@citusdata.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Roman Gushchin <klamm@yandex-team.ru> Cc: Ryan Mallon <rmallon@gmail.com> Cc: Tejun Heo <tj@kernel.org> Cc: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-04-04 05:47:56 +08:00
struct address_space *mapping;
int ret;
/*
* Page cache insertions and deletions synchronously maintain
* the shadow node LRU under the i_pages lock and the
mm: keep page cache radix tree nodes in check Previously, page cache radix tree nodes were freed after reclaim emptied out their page pointers. But now reclaim stores shadow entries in their place, which are only reclaimed when the inodes themselves are reclaimed. This is problematic for bigger files that are still in use after they have a significant amount of their cache reclaimed, without any of those pages actually refaulting. The shadow entries will just sit there and waste memory. In the worst case, the shadow entries will accumulate until the machine runs out of memory. To get this under control, the VM will track radix tree nodes exclusively containing shadow entries on a per-NUMA node list. Per-NUMA rather than global because we expect the radix tree nodes themselves to be allocated node-locally and we want to reduce cross-node references of otherwise independent cache workloads. A simple shrinker will then reclaim these nodes on memory pressure. A few things need to be stored in the radix tree node to implement the shadow node LRU and allow tree deletions coming from the list: 1. There is no index available that would describe the reverse path from the node up to the tree root, which is needed to perform a deletion. To solve this, encode in each node its offset inside the parent. This can be stored in the unused upper bits of the same member that stores the node's height at no extra space cost. 2. The number of shadow entries needs to be counted in addition to the regular entries, to quickly detect when the node is ready to go to the shadow node LRU list. The current entry count is an unsigned int but the maximum number of entries is 64, so a shadow counter can easily be stored in the unused upper bits. 3. Tree modification needs tree lock and tree root, which are located in the address space, so store an address_space backpointer in the node. The parent pointer of the node is in a union with the 2-word rcu_head, so the backpointer comes at no extra cost as well. 4. The node needs to be linked to an LRU list, which requires a list head inside the node. This does increase the size of the node, but it does not change the number of objects that fit into a slab page. [akpm@linux-foundation.org: export the right function] Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> Reviewed-by: Rik van Riel <riel@redhat.com> Reviewed-by: Minchan Kim <minchan@kernel.org> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Bob Liu <bob.liu@oracle.com> Cc: Christoph Hellwig <hch@infradead.org> Cc: Dave Chinner <david@fromorbit.com> Cc: Greg Thelen <gthelen@google.com> Cc: Hugh Dickins <hughd@google.com> Cc: Jan Kara <jack@suse.cz> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: Luigi Semenzato <semenzato@google.com> Cc: Mel Gorman <mgorman@suse.de> Cc: Metin Doslu <metin@citusdata.com> Cc: Michel Lespinasse <walken@google.com> Cc: Ozgun Erdogan <ozgun@citusdata.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Roman Gushchin <klamm@yandex-team.ru> Cc: Ryan Mallon <rmallon@gmail.com> Cc: Tejun Heo <tj@kernel.org> Cc: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-04-04 05:47:56 +08:00
* lru_lock. Because the page cache tree is emptied before
* the inode can be destroyed, holding the lru_lock pins any
* address_space that has nodes on the LRU.
mm: keep page cache radix tree nodes in check Previously, page cache radix tree nodes were freed after reclaim emptied out their page pointers. But now reclaim stores shadow entries in their place, which are only reclaimed when the inodes themselves are reclaimed. This is problematic for bigger files that are still in use after they have a significant amount of their cache reclaimed, without any of those pages actually refaulting. The shadow entries will just sit there and waste memory. In the worst case, the shadow entries will accumulate until the machine runs out of memory. To get this under control, the VM will track radix tree nodes exclusively containing shadow entries on a per-NUMA node list. Per-NUMA rather than global because we expect the radix tree nodes themselves to be allocated node-locally and we want to reduce cross-node references of otherwise independent cache workloads. A simple shrinker will then reclaim these nodes on memory pressure. A few things need to be stored in the radix tree node to implement the shadow node LRU and allow tree deletions coming from the list: 1. There is no index available that would describe the reverse path from the node up to the tree root, which is needed to perform a deletion. To solve this, encode in each node its offset inside the parent. This can be stored in the unused upper bits of the same member that stores the node's height at no extra space cost. 2. The number of shadow entries needs to be counted in addition to the regular entries, to quickly detect when the node is ready to go to the shadow node LRU list. The current entry count is an unsigned int but the maximum number of entries is 64, so a shadow counter can easily be stored in the unused upper bits. 3. Tree modification needs tree lock and tree root, which are located in the address space, so store an address_space backpointer in the node. The parent pointer of the node is in a union with the 2-word rcu_head, so the backpointer comes at no extra cost as well. 4. The node needs to be linked to an LRU list, which requires a list head inside the node. This does increase the size of the node, but it does not change the number of objects that fit into a slab page. [akpm@linux-foundation.org: export the right function] Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> Reviewed-by: Rik van Riel <riel@redhat.com> Reviewed-by: Minchan Kim <minchan@kernel.org> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Bob Liu <bob.liu@oracle.com> Cc: Christoph Hellwig <hch@infradead.org> Cc: Dave Chinner <david@fromorbit.com> Cc: Greg Thelen <gthelen@google.com> Cc: Hugh Dickins <hughd@google.com> Cc: Jan Kara <jack@suse.cz> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: Luigi Semenzato <semenzato@google.com> Cc: Mel Gorman <mgorman@suse.de> Cc: Metin Doslu <metin@citusdata.com> Cc: Michel Lespinasse <walken@google.com> Cc: Ozgun Erdogan <ozgun@citusdata.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Roman Gushchin <klamm@yandex-team.ru> Cc: Ryan Mallon <rmallon@gmail.com> Cc: Tejun Heo <tj@kernel.org> Cc: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-04-04 05:47:56 +08:00
*
* We can then safely transition to the i_pages lock to
mm: keep page cache radix tree nodes in check Previously, page cache radix tree nodes were freed after reclaim emptied out their page pointers. But now reclaim stores shadow entries in their place, which are only reclaimed when the inodes themselves are reclaimed. This is problematic for bigger files that are still in use after they have a significant amount of their cache reclaimed, without any of those pages actually refaulting. The shadow entries will just sit there and waste memory. In the worst case, the shadow entries will accumulate until the machine runs out of memory. To get this under control, the VM will track radix tree nodes exclusively containing shadow entries on a per-NUMA node list. Per-NUMA rather than global because we expect the radix tree nodes themselves to be allocated node-locally and we want to reduce cross-node references of otherwise independent cache workloads. A simple shrinker will then reclaim these nodes on memory pressure. A few things need to be stored in the radix tree node to implement the shadow node LRU and allow tree deletions coming from the list: 1. There is no index available that would describe the reverse path from the node up to the tree root, which is needed to perform a deletion. To solve this, encode in each node its offset inside the parent. This can be stored in the unused upper bits of the same member that stores the node's height at no extra space cost. 2. The number of shadow entries needs to be counted in addition to the regular entries, to quickly detect when the node is ready to go to the shadow node LRU list. The current entry count is an unsigned int but the maximum number of entries is 64, so a shadow counter can easily be stored in the unused upper bits. 3. Tree modification needs tree lock and tree root, which are located in the address space, so store an address_space backpointer in the node. The parent pointer of the node is in a union with the 2-word rcu_head, so the backpointer comes at no extra cost as well. 4. The node needs to be linked to an LRU list, which requires a list head inside the node. This does increase the size of the node, but it does not change the number of objects that fit into a slab page. [akpm@linux-foundation.org: export the right function] Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> Reviewed-by: Rik van Riel <riel@redhat.com> Reviewed-by: Minchan Kim <minchan@kernel.org> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Bob Liu <bob.liu@oracle.com> Cc: Christoph Hellwig <hch@infradead.org> Cc: Dave Chinner <david@fromorbit.com> Cc: Greg Thelen <gthelen@google.com> Cc: Hugh Dickins <hughd@google.com> Cc: Jan Kara <jack@suse.cz> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: Luigi Semenzato <semenzato@google.com> Cc: Mel Gorman <mgorman@suse.de> Cc: Metin Doslu <metin@citusdata.com> Cc: Michel Lespinasse <walken@google.com> Cc: Ozgun Erdogan <ozgun@citusdata.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Roman Gushchin <klamm@yandex-team.ru> Cc: Ryan Mallon <rmallon@gmail.com> Cc: Tejun Heo <tj@kernel.org> Cc: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-04-04 05:47:56 +08:00
* pin only the address_space of the particular node we want
* to reclaim, take the node off-LRU, and drop the lru_lock.
*/
mapping = container_of(node->array, struct address_space, i_pages);
mm: keep page cache radix tree nodes in check Previously, page cache radix tree nodes were freed after reclaim emptied out their page pointers. But now reclaim stores shadow entries in their place, which are only reclaimed when the inodes themselves are reclaimed. This is problematic for bigger files that are still in use after they have a significant amount of their cache reclaimed, without any of those pages actually refaulting. The shadow entries will just sit there and waste memory. In the worst case, the shadow entries will accumulate until the machine runs out of memory. To get this under control, the VM will track radix tree nodes exclusively containing shadow entries on a per-NUMA node list. Per-NUMA rather than global because we expect the radix tree nodes themselves to be allocated node-locally and we want to reduce cross-node references of otherwise independent cache workloads. A simple shrinker will then reclaim these nodes on memory pressure. A few things need to be stored in the radix tree node to implement the shadow node LRU and allow tree deletions coming from the list: 1. There is no index available that would describe the reverse path from the node up to the tree root, which is needed to perform a deletion. To solve this, encode in each node its offset inside the parent. This can be stored in the unused upper bits of the same member that stores the node's height at no extra space cost. 2. The number of shadow entries needs to be counted in addition to the regular entries, to quickly detect when the node is ready to go to the shadow node LRU list. The current entry count is an unsigned int but the maximum number of entries is 64, so a shadow counter can easily be stored in the unused upper bits. 3. Tree modification needs tree lock and tree root, which are located in the address space, so store an address_space backpointer in the node. The parent pointer of the node is in a union with the 2-word rcu_head, so the backpointer comes at no extra cost as well. 4. The node needs to be linked to an LRU list, which requires a list head inside the node. This does increase the size of the node, but it does not change the number of objects that fit into a slab page. [akpm@linux-foundation.org: export the right function] Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> Reviewed-by: Rik van Riel <riel@redhat.com> Reviewed-by: Minchan Kim <minchan@kernel.org> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Bob Liu <bob.liu@oracle.com> Cc: Christoph Hellwig <hch@infradead.org> Cc: Dave Chinner <david@fromorbit.com> Cc: Greg Thelen <gthelen@google.com> Cc: Hugh Dickins <hughd@google.com> Cc: Jan Kara <jack@suse.cz> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: Luigi Semenzato <semenzato@google.com> Cc: Mel Gorman <mgorman@suse.de> Cc: Metin Doslu <metin@citusdata.com> Cc: Michel Lespinasse <walken@google.com> Cc: Ozgun Erdogan <ozgun@citusdata.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Roman Gushchin <klamm@yandex-team.ru> Cc: Ryan Mallon <rmallon@gmail.com> Cc: Tejun Heo <tj@kernel.org> Cc: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-04-04 05:47:56 +08:00
/* Coming from the list, invert the lock order */
if (!xa_trylock(&mapping->i_pages)) {
spin_unlock_irq(lru_lock);
mm: keep page cache radix tree nodes in check Previously, page cache radix tree nodes were freed after reclaim emptied out their page pointers. But now reclaim stores shadow entries in their place, which are only reclaimed when the inodes themselves are reclaimed. This is problematic for bigger files that are still in use after they have a significant amount of their cache reclaimed, without any of those pages actually refaulting. The shadow entries will just sit there and waste memory. In the worst case, the shadow entries will accumulate until the machine runs out of memory. To get this under control, the VM will track radix tree nodes exclusively containing shadow entries on a per-NUMA node list. Per-NUMA rather than global because we expect the radix tree nodes themselves to be allocated node-locally and we want to reduce cross-node references of otherwise independent cache workloads. A simple shrinker will then reclaim these nodes on memory pressure. A few things need to be stored in the radix tree node to implement the shadow node LRU and allow tree deletions coming from the list: 1. There is no index available that would describe the reverse path from the node up to the tree root, which is needed to perform a deletion. To solve this, encode in each node its offset inside the parent. This can be stored in the unused upper bits of the same member that stores the node's height at no extra space cost. 2. The number of shadow entries needs to be counted in addition to the regular entries, to quickly detect when the node is ready to go to the shadow node LRU list. The current entry count is an unsigned int but the maximum number of entries is 64, so a shadow counter can easily be stored in the unused upper bits. 3. Tree modification needs tree lock and tree root, which are located in the address space, so store an address_space backpointer in the node. The parent pointer of the node is in a union with the 2-word rcu_head, so the backpointer comes at no extra cost as well. 4. The node needs to be linked to an LRU list, which requires a list head inside the node. This does increase the size of the node, but it does not change the number of objects that fit into a slab page. [akpm@linux-foundation.org: export the right function] Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> Reviewed-by: Rik van Riel <riel@redhat.com> Reviewed-by: Minchan Kim <minchan@kernel.org> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Bob Liu <bob.liu@oracle.com> Cc: Christoph Hellwig <hch@infradead.org> Cc: Dave Chinner <david@fromorbit.com> Cc: Greg Thelen <gthelen@google.com> Cc: Hugh Dickins <hughd@google.com> Cc: Jan Kara <jack@suse.cz> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: Luigi Semenzato <semenzato@google.com> Cc: Mel Gorman <mgorman@suse.de> Cc: Metin Doslu <metin@citusdata.com> Cc: Michel Lespinasse <walken@google.com> Cc: Ozgun Erdogan <ozgun@citusdata.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Roman Gushchin <klamm@yandex-team.ru> Cc: Ryan Mallon <rmallon@gmail.com> Cc: Tejun Heo <tj@kernel.org> Cc: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-04-04 05:47:56 +08:00
ret = LRU_RETRY;
goto out;
}
/* For page cache we need to hold i_lock */
if (mapping->host != NULL) {
if (!spin_trylock(&mapping->host->i_lock)) {
xa_unlock(&mapping->i_pages);
spin_unlock_irq(lru_lock);
ret = LRU_RETRY;
goto out;
}
vfs: keep inodes with page cache off the inode shrinker LRU Historically (pre-2.5), the inode shrinker used to reclaim only empty inodes and skip over those that still contained page cache. This caused problems on highmem hosts: struct inode could put fill lowmem zones before the cache was getting reclaimed in the highmem zones. To address this, the inode shrinker started to strip page cache to facilitate reclaiming lowmem. However, this comes with its own set of problems: the shrinkers may drop actively used page cache just because the inodes are not currently open or dirty - think working with a large git tree. It further doesn't respect cgroup memory protection settings and can cause priority inversions between containers. Nowadays, the page cache also holds non-resident info for evicted cache pages in order to detect refaults. We've come to rely heavily on this data inside reclaim for protecting the cache workingset and driving swap behavior. We also use it to quantify and report workload health through psi. The latter in turn is used for fleet health monitoring, as well as driving automated memory sizing of workloads and containers, proactive reclaim and memory offloading schemes. The consequences of dropping page cache prematurely is that we're seeing subtle and not-so-subtle failures in all of the above-mentioned scenarios, with the workload generally entering unexpected thrashing states while losing the ability to reliably detect it. To fix this on non-highmem systems at least, going back to rotating inodes on the LRU isn't feasible. We've tried (commit a76cf1a474d7 ("mm: don't reclaim inodes with many attached pages")) and failed (commit 69056ee6a8a3 ("Revert "mm: don't reclaim inodes with many attached pages"")). The issue is mostly that shrinker pools attract pressure based on their size, and when objects get skipped the shrinkers remember this as deferred reclaim work. This accumulates excessive pressure on the remaining inodes, and we can quickly eat into heavily used ones, or dirty ones that require IO to reclaim, when there potentially is plenty of cold, clean cache around still. Instead, this patch keeps populated inodes off the inode LRU in the first place - just like an open file or dirty state would. An otherwise clean and unused inode then gets queued when the last cache entry disappears. This solves the problem without reintroducing the reclaim issues, and generally is a bit more scalable than having to wade through potentially hundreds of thousands of busy inodes. Locking is a bit tricky because the locks protecting the inode state (i_lock) and the inode LRU (lru_list.lock) don't nest inside the irq-safe page cache lock (i_pages.xa_lock). Page cache deletions are serialized through i_lock, taken before the i_pages lock, to make sure depopulated inodes are queued reliably. Additions may race with deletions, but we'll check again in the shrinker. If additions race with the shrinker itself, we're protected by the i_lock: if find_inode() or iput() win, the shrinker will bail on the elevated i_count or I_REFERENCED; if the shrinker wins and goes ahead with the inode, it will set I_FREEING and inhibit further igets(), which will cause the other side to create a new instance of the inode instead. Link: https://lkml.kernel.org/r/20210614211904.14420-4-hannes@cmpxchg.org Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> Cc: Roman Gushchin <guro@fb.com> Cc: Tejun Heo <tj@kernel.org> Cc: Dave Chinner <david@fromorbit.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-11-09 10:31:24 +08:00
}
list_lru: add helpers to isolate items Currently, the isolate callback passed to the list_lru_walk family of functions is supposed to just delete an item from the list upon returning LRU_REMOVED or LRU_REMOVED_RETRY, while nr_items counter is fixed by __list_lru_walk_one after the callback returns. Since the callback is allowed to drop the lock after removing an item (it has to return LRU_REMOVED_RETRY then), the nr_items can be less than the actual number of elements on the list even if we check them under the lock. This makes it difficult to move items from one list_lru_one to another, which is required for per-memcg list_lru reparenting - we can't just splice the lists, we have to move entries one by one. This patch therefore introduces helpers that must be used by callback functions to isolate items instead of raw list_del/list_move. These are list_lru_isolate and list_lru_isolate_move. They not only remove the entry from the list, but also fix the nr_items counter, making sure nr_items always reflects the actual number of elements on the list if checked under the appropriate lock. Signed-off-by: Vladimir Davydov <vdavydov@parallels.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Michal Hocko <mhocko@suse.cz> Cc: Tejun Heo <tj@kernel.org> Cc: Christoph Lameter <cl@linux.com> Cc: Pekka Enberg <penberg@kernel.org> Cc: David Rientjes <rientjes@google.com> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: Dave Chinner <david@fromorbit.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-02-13 06:59:35 +08:00
list_lru_isolate(lru, item);
__dec_lruvec_kmem_state(node, WORKINGSET_NODES);
mm: keep page cache radix tree nodes in check Previously, page cache radix tree nodes were freed after reclaim emptied out their page pointers. But now reclaim stores shadow entries in their place, which are only reclaimed when the inodes themselves are reclaimed. This is problematic for bigger files that are still in use after they have a significant amount of their cache reclaimed, without any of those pages actually refaulting. The shadow entries will just sit there and waste memory. In the worst case, the shadow entries will accumulate until the machine runs out of memory. To get this under control, the VM will track radix tree nodes exclusively containing shadow entries on a per-NUMA node list. Per-NUMA rather than global because we expect the radix tree nodes themselves to be allocated node-locally and we want to reduce cross-node references of otherwise independent cache workloads. A simple shrinker will then reclaim these nodes on memory pressure. A few things need to be stored in the radix tree node to implement the shadow node LRU and allow tree deletions coming from the list: 1. There is no index available that would describe the reverse path from the node up to the tree root, which is needed to perform a deletion. To solve this, encode in each node its offset inside the parent. This can be stored in the unused upper bits of the same member that stores the node's height at no extra space cost. 2. The number of shadow entries needs to be counted in addition to the regular entries, to quickly detect when the node is ready to go to the shadow node LRU list. The current entry count is an unsigned int but the maximum number of entries is 64, so a shadow counter can easily be stored in the unused upper bits. 3. Tree modification needs tree lock and tree root, which are located in the address space, so store an address_space backpointer in the node. The parent pointer of the node is in a union with the 2-word rcu_head, so the backpointer comes at no extra cost as well. 4. The node needs to be linked to an LRU list, which requires a list head inside the node. This does increase the size of the node, but it does not change the number of objects that fit into a slab page. [akpm@linux-foundation.org: export the right function] Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> Reviewed-by: Rik van Riel <riel@redhat.com> Reviewed-by: Minchan Kim <minchan@kernel.org> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Bob Liu <bob.liu@oracle.com> Cc: Christoph Hellwig <hch@infradead.org> Cc: Dave Chinner <david@fromorbit.com> Cc: Greg Thelen <gthelen@google.com> Cc: Hugh Dickins <hughd@google.com> Cc: Jan Kara <jack@suse.cz> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: Luigi Semenzato <semenzato@google.com> Cc: Mel Gorman <mgorman@suse.de> Cc: Metin Doslu <metin@citusdata.com> Cc: Michel Lespinasse <walken@google.com> Cc: Ozgun Erdogan <ozgun@citusdata.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Roman Gushchin <klamm@yandex-team.ru> Cc: Ryan Mallon <rmallon@gmail.com> Cc: Tejun Heo <tj@kernel.org> Cc: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-04-04 05:47:56 +08:00
spin_unlock(lru_lock);
/*
* The nodes should only contain one or more shadow entries,
* no pages, so we expect to be able to remove them all and
* delete and free the empty node afterwards.
*/
if (WARN_ON_ONCE(!node->nr_values))
goto out_invalid;
if (WARN_ON_ONCE(node->count != node->nr_values))
goto out_invalid;
xa_delete_node(node, workingset_update_node);
__inc_lruvec_kmem_state(node, WORKINGSET_NODERECLAIM);
mm: keep page cache radix tree nodes in check Previously, page cache radix tree nodes were freed after reclaim emptied out their page pointers. But now reclaim stores shadow entries in their place, which are only reclaimed when the inodes themselves are reclaimed. This is problematic for bigger files that are still in use after they have a significant amount of their cache reclaimed, without any of those pages actually refaulting. The shadow entries will just sit there and waste memory. In the worst case, the shadow entries will accumulate until the machine runs out of memory. To get this under control, the VM will track radix tree nodes exclusively containing shadow entries on a per-NUMA node list. Per-NUMA rather than global because we expect the radix tree nodes themselves to be allocated node-locally and we want to reduce cross-node references of otherwise independent cache workloads. A simple shrinker will then reclaim these nodes on memory pressure. A few things need to be stored in the radix tree node to implement the shadow node LRU and allow tree deletions coming from the list: 1. There is no index available that would describe the reverse path from the node up to the tree root, which is needed to perform a deletion. To solve this, encode in each node its offset inside the parent. This can be stored in the unused upper bits of the same member that stores the node's height at no extra space cost. 2. The number of shadow entries needs to be counted in addition to the regular entries, to quickly detect when the node is ready to go to the shadow node LRU list. The current entry count is an unsigned int but the maximum number of entries is 64, so a shadow counter can easily be stored in the unused upper bits. 3. Tree modification needs tree lock and tree root, which are located in the address space, so store an address_space backpointer in the node. The parent pointer of the node is in a union with the 2-word rcu_head, so the backpointer comes at no extra cost as well. 4. The node needs to be linked to an LRU list, which requires a list head inside the node. This does increase the size of the node, but it does not change the number of objects that fit into a slab page. [akpm@linux-foundation.org: export the right function] Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> Reviewed-by: Rik van Riel <riel@redhat.com> Reviewed-by: Minchan Kim <minchan@kernel.org> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Bob Liu <bob.liu@oracle.com> Cc: Christoph Hellwig <hch@infradead.org> Cc: Dave Chinner <david@fromorbit.com> Cc: Greg Thelen <gthelen@google.com> Cc: Hugh Dickins <hughd@google.com> Cc: Jan Kara <jack@suse.cz> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: Luigi Semenzato <semenzato@google.com> Cc: Mel Gorman <mgorman@suse.de> Cc: Metin Doslu <metin@citusdata.com> Cc: Michel Lespinasse <walken@google.com> Cc: Ozgun Erdogan <ozgun@citusdata.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Roman Gushchin <klamm@yandex-team.ru> Cc: Ryan Mallon <rmallon@gmail.com> Cc: Tejun Heo <tj@kernel.org> Cc: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-04-04 05:47:56 +08:00
out_invalid:
xa_unlock_irq(&mapping->i_pages);
if (mapping->host != NULL) {
if (mapping_shrinkable(mapping))
inode_add_lru(mapping->host);
spin_unlock(&mapping->host->i_lock);
}
mm: keep page cache radix tree nodes in check Previously, page cache radix tree nodes were freed after reclaim emptied out their page pointers. But now reclaim stores shadow entries in their place, which are only reclaimed when the inodes themselves are reclaimed. This is problematic for bigger files that are still in use after they have a significant amount of their cache reclaimed, without any of those pages actually refaulting. The shadow entries will just sit there and waste memory. In the worst case, the shadow entries will accumulate until the machine runs out of memory. To get this under control, the VM will track radix tree nodes exclusively containing shadow entries on a per-NUMA node list. Per-NUMA rather than global because we expect the radix tree nodes themselves to be allocated node-locally and we want to reduce cross-node references of otherwise independent cache workloads. A simple shrinker will then reclaim these nodes on memory pressure. A few things need to be stored in the radix tree node to implement the shadow node LRU and allow tree deletions coming from the list: 1. There is no index available that would describe the reverse path from the node up to the tree root, which is needed to perform a deletion. To solve this, encode in each node its offset inside the parent. This can be stored in the unused upper bits of the same member that stores the node's height at no extra space cost. 2. The number of shadow entries needs to be counted in addition to the regular entries, to quickly detect when the node is ready to go to the shadow node LRU list. The current entry count is an unsigned int but the maximum number of entries is 64, so a shadow counter can easily be stored in the unused upper bits. 3. Tree modification needs tree lock and tree root, which are located in the address space, so store an address_space backpointer in the node. The parent pointer of the node is in a union with the 2-word rcu_head, so the backpointer comes at no extra cost as well. 4. The node needs to be linked to an LRU list, which requires a list head inside the node. This does increase the size of the node, but it does not change the number of objects that fit into a slab page. [akpm@linux-foundation.org: export the right function] Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> Reviewed-by: Rik van Riel <riel@redhat.com> Reviewed-by: Minchan Kim <minchan@kernel.org> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Bob Liu <bob.liu@oracle.com> Cc: Christoph Hellwig <hch@infradead.org> Cc: Dave Chinner <david@fromorbit.com> Cc: Greg Thelen <gthelen@google.com> Cc: Hugh Dickins <hughd@google.com> Cc: Jan Kara <jack@suse.cz> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: Luigi Semenzato <semenzato@google.com> Cc: Mel Gorman <mgorman@suse.de> Cc: Metin Doslu <metin@citusdata.com> Cc: Michel Lespinasse <walken@google.com> Cc: Ozgun Erdogan <ozgun@citusdata.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Roman Gushchin <klamm@yandex-team.ru> Cc: Ryan Mallon <rmallon@gmail.com> Cc: Tejun Heo <tj@kernel.org> Cc: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-04-04 05:47:56 +08:00
ret = LRU_REMOVED_RETRY;
out:
cond_resched();
spin_lock_irq(lru_lock);
mm: keep page cache radix tree nodes in check Previously, page cache radix tree nodes were freed after reclaim emptied out their page pointers. But now reclaim stores shadow entries in their place, which are only reclaimed when the inodes themselves are reclaimed. This is problematic for bigger files that are still in use after they have a significant amount of their cache reclaimed, without any of those pages actually refaulting. The shadow entries will just sit there and waste memory. In the worst case, the shadow entries will accumulate until the machine runs out of memory. To get this under control, the VM will track radix tree nodes exclusively containing shadow entries on a per-NUMA node list. Per-NUMA rather than global because we expect the radix tree nodes themselves to be allocated node-locally and we want to reduce cross-node references of otherwise independent cache workloads. A simple shrinker will then reclaim these nodes on memory pressure. A few things need to be stored in the radix tree node to implement the shadow node LRU and allow tree deletions coming from the list: 1. There is no index available that would describe the reverse path from the node up to the tree root, which is needed to perform a deletion. To solve this, encode in each node its offset inside the parent. This can be stored in the unused upper bits of the same member that stores the node's height at no extra space cost. 2. The number of shadow entries needs to be counted in addition to the regular entries, to quickly detect when the node is ready to go to the shadow node LRU list. The current entry count is an unsigned int but the maximum number of entries is 64, so a shadow counter can easily be stored in the unused upper bits. 3. Tree modification needs tree lock and tree root, which are located in the address space, so store an address_space backpointer in the node. The parent pointer of the node is in a union with the 2-word rcu_head, so the backpointer comes at no extra cost as well. 4. The node needs to be linked to an LRU list, which requires a list head inside the node. This does increase the size of the node, but it does not change the number of objects that fit into a slab page. [akpm@linux-foundation.org: export the right function] Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> Reviewed-by: Rik van Riel <riel@redhat.com> Reviewed-by: Minchan Kim <minchan@kernel.org> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Bob Liu <bob.liu@oracle.com> Cc: Christoph Hellwig <hch@infradead.org> Cc: Dave Chinner <david@fromorbit.com> Cc: Greg Thelen <gthelen@google.com> Cc: Hugh Dickins <hughd@google.com> Cc: Jan Kara <jack@suse.cz> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: Luigi Semenzato <semenzato@google.com> Cc: Mel Gorman <mgorman@suse.de> Cc: Metin Doslu <metin@citusdata.com> Cc: Michel Lespinasse <walken@google.com> Cc: Ozgun Erdogan <ozgun@citusdata.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Roman Gushchin <klamm@yandex-team.ru> Cc: Ryan Mallon <rmallon@gmail.com> Cc: Tejun Heo <tj@kernel.org> Cc: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-04-04 05:47:56 +08:00
return ret;
}
static unsigned long scan_shadow_nodes(struct shrinker *shrinker,
struct shrink_control *sc)
{
/* list_lru lock nests inside the IRQ-safe i_pages lock */
return list_lru_shrink_walk_irq(&shadow_nodes, sc, shadow_lru_isolate,
NULL);
mm: keep page cache radix tree nodes in check Previously, page cache radix tree nodes were freed after reclaim emptied out their page pointers. But now reclaim stores shadow entries in their place, which are only reclaimed when the inodes themselves are reclaimed. This is problematic for bigger files that are still in use after they have a significant amount of their cache reclaimed, without any of those pages actually refaulting. The shadow entries will just sit there and waste memory. In the worst case, the shadow entries will accumulate until the machine runs out of memory. To get this under control, the VM will track radix tree nodes exclusively containing shadow entries on a per-NUMA node list. Per-NUMA rather than global because we expect the radix tree nodes themselves to be allocated node-locally and we want to reduce cross-node references of otherwise independent cache workloads. A simple shrinker will then reclaim these nodes on memory pressure. A few things need to be stored in the radix tree node to implement the shadow node LRU and allow tree deletions coming from the list: 1. There is no index available that would describe the reverse path from the node up to the tree root, which is needed to perform a deletion. To solve this, encode in each node its offset inside the parent. This can be stored in the unused upper bits of the same member that stores the node's height at no extra space cost. 2. The number of shadow entries needs to be counted in addition to the regular entries, to quickly detect when the node is ready to go to the shadow node LRU list. The current entry count is an unsigned int but the maximum number of entries is 64, so a shadow counter can easily be stored in the unused upper bits. 3. Tree modification needs tree lock and tree root, which are located in the address space, so store an address_space backpointer in the node. The parent pointer of the node is in a union with the 2-word rcu_head, so the backpointer comes at no extra cost as well. 4. The node needs to be linked to an LRU list, which requires a list head inside the node. This does increase the size of the node, but it does not change the number of objects that fit into a slab page. [akpm@linux-foundation.org: export the right function] Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> Reviewed-by: Rik van Riel <riel@redhat.com> Reviewed-by: Minchan Kim <minchan@kernel.org> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Bob Liu <bob.liu@oracle.com> Cc: Christoph Hellwig <hch@infradead.org> Cc: Dave Chinner <david@fromorbit.com> Cc: Greg Thelen <gthelen@google.com> Cc: Hugh Dickins <hughd@google.com> Cc: Jan Kara <jack@suse.cz> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: Luigi Semenzato <semenzato@google.com> Cc: Mel Gorman <mgorman@suse.de> Cc: Metin Doslu <metin@citusdata.com> Cc: Michel Lespinasse <walken@google.com> Cc: Ozgun Erdogan <ozgun@citusdata.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Roman Gushchin <klamm@yandex-team.ru> Cc: Ryan Mallon <rmallon@gmail.com> Cc: Tejun Heo <tj@kernel.org> Cc: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-04-04 05:47:56 +08:00
}
static struct shrinker workingset_shadow_shrinker = {
.count_objects = count_shadow_nodes,
.scan_objects = scan_shadow_nodes,
mm: zero-seek shrinkers The page cache and most shrinkable slab caches hold data that has been read from disk, but there are some caches that only cache CPU work, such as the dentry and inode caches of procfs and sysfs, as well as the subset of radix tree nodes that track non-resident page cache. Currently, all these are shrunk at the same rate: using DEFAULT_SEEKS for the shrinker's seeks setting tells the reclaim algorithm that for every two page cache pages scanned it should scan one slab object. This is a bogus setting. A virtual inode that required no IO to create is not twice as valuable as a page cache page; shadow cache entries with eviction distances beyond the size of memory aren't either. In most cases, the behavior in practice is still fine. Such virtual caches don't tend to grow and assert themselves aggressively, and usually get picked up before they cause problems. But there are scenarios where that's not true. Our database workloads suffer from two of those. For one, their file workingset is several times bigger than available memory, which has the kernel aggressively create shadow page cache entries for the non-resident parts of it. The workingset code does tell the VM that most of these are expendable, but the VM ends up balancing them 2:1 to cache pages as per the seeks setting. This is a huge waste of memory. These workloads also deal with tens of thousands of open files and use /proc for introspection, which ends up growing the proc_inode_cache to absurdly large sizes - again at the cost of valuable cache space, which isn't a reasonable trade-off, given that proc inodes can be re-created without involving the disk. This patch implements a "zero-seek" setting for shrinkers that results in a target ratio of 0:1 between their objects and IO-backed caches. This allows such virtual caches to grow when memory is available (they do cache/avoid CPU work after all), but effectively disables them as soon as IO-backed objects are under pressure. It then switches the shrinkers for procfs and sysfs metadata, as well as excess page cache shadow nodes, to the new zero-seek setting. Link: http://lkml.kernel.org/r/20181009184732.762-5-hannes@cmpxchg.org Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> Reported-by: Domas Mituzas <dmituzas@fb.com> Reviewed-by: Andrew Morton <akpm@linux-foundation.org> Reviewed-by: Rik van Riel <riel@surriel.com> Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-10-27 06:06:42 +08:00
.seeks = 0, /* ->count reports only fully expendable nodes */
.flags = SHRINKER_NUMA_AWARE | SHRINKER_MEMCG_AWARE,
mm: keep page cache radix tree nodes in check Previously, page cache radix tree nodes were freed after reclaim emptied out their page pointers. But now reclaim stores shadow entries in their place, which are only reclaimed when the inodes themselves are reclaimed. This is problematic for bigger files that are still in use after they have a significant amount of their cache reclaimed, without any of those pages actually refaulting. The shadow entries will just sit there and waste memory. In the worst case, the shadow entries will accumulate until the machine runs out of memory. To get this under control, the VM will track radix tree nodes exclusively containing shadow entries on a per-NUMA node list. Per-NUMA rather than global because we expect the radix tree nodes themselves to be allocated node-locally and we want to reduce cross-node references of otherwise independent cache workloads. A simple shrinker will then reclaim these nodes on memory pressure. A few things need to be stored in the radix tree node to implement the shadow node LRU and allow tree deletions coming from the list: 1. There is no index available that would describe the reverse path from the node up to the tree root, which is needed to perform a deletion. To solve this, encode in each node its offset inside the parent. This can be stored in the unused upper bits of the same member that stores the node's height at no extra space cost. 2. The number of shadow entries needs to be counted in addition to the regular entries, to quickly detect when the node is ready to go to the shadow node LRU list. The current entry count is an unsigned int but the maximum number of entries is 64, so a shadow counter can easily be stored in the unused upper bits. 3. Tree modification needs tree lock and tree root, which are located in the address space, so store an address_space backpointer in the node. The parent pointer of the node is in a union with the 2-word rcu_head, so the backpointer comes at no extra cost as well. 4. The node needs to be linked to an LRU list, which requires a list head inside the node. This does increase the size of the node, but it does not change the number of objects that fit into a slab page. [akpm@linux-foundation.org: export the right function] Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> Reviewed-by: Rik van Riel <riel@redhat.com> Reviewed-by: Minchan Kim <minchan@kernel.org> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Bob Liu <bob.liu@oracle.com> Cc: Christoph Hellwig <hch@infradead.org> Cc: Dave Chinner <david@fromorbit.com> Cc: Greg Thelen <gthelen@google.com> Cc: Hugh Dickins <hughd@google.com> Cc: Jan Kara <jack@suse.cz> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: Luigi Semenzato <semenzato@google.com> Cc: Mel Gorman <mgorman@suse.de> Cc: Metin Doslu <metin@citusdata.com> Cc: Michel Lespinasse <walken@google.com> Cc: Ozgun Erdogan <ozgun@citusdata.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Roman Gushchin <klamm@yandex-team.ru> Cc: Ryan Mallon <rmallon@gmail.com> Cc: Tejun Heo <tj@kernel.org> Cc: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-04-04 05:47:56 +08:00
};
/*
* Our list_lru->lock is IRQ-safe as it nests inside the IRQ-safe
* i_pages lock.
mm: keep page cache radix tree nodes in check Previously, page cache radix tree nodes were freed after reclaim emptied out their page pointers. But now reclaim stores shadow entries in their place, which are only reclaimed when the inodes themselves are reclaimed. This is problematic for bigger files that are still in use after they have a significant amount of their cache reclaimed, without any of those pages actually refaulting. The shadow entries will just sit there and waste memory. In the worst case, the shadow entries will accumulate until the machine runs out of memory. To get this under control, the VM will track radix tree nodes exclusively containing shadow entries on a per-NUMA node list. Per-NUMA rather than global because we expect the radix tree nodes themselves to be allocated node-locally and we want to reduce cross-node references of otherwise independent cache workloads. A simple shrinker will then reclaim these nodes on memory pressure. A few things need to be stored in the radix tree node to implement the shadow node LRU and allow tree deletions coming from the list: 1. There is no index available that would describe the reverse path from the node up to the tree root, which is needed to perform a deletion. To solve this, encode in each node its offset inside the parent. This can be stored in the unused upper bits of the same member that stores the node's height at no extra space cost. 2. The number of shadow entries needs to be counted in addition to the regular entries, to quickly detect when the node is ready to go to the shadow node LRU list. The current entry count is an unsigned int but the maximum number of entries is 64, so a shadow counter can easily be stored in the unused upper bits. 3. Tree modification needs tree lock and tree root, which are located in the address space, so store an address_space backpointer in the node. The parent pointer of the node is in a union with the 2-word rcu_head, so the backpointer comes at no extra cost as well. 4. The node needs to be linked to an LRU list, which requires a list head inside the node. This does increase the size of the node, but it does not change the number of objects that fit into a slab page. [akpm@linux-foundation.org: export the right function] Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> Reviewed-by: Rik van Riel <riel@redhat.com> Reviewed-by: Minchan Kim <minchan@kernel.org> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Bob Liu <bob.liu@oracle.com> Cc: Christoph Hellwig <hch@infradead.org> Cc: Dave Chinner <david@fromorbit.com> Cc: Greg Thelen <gthelen@google.com> Cc: Hugh Dickins <hughd@google.com> Cc: Jan Kara <jack@suse.cz> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: Luigi Semenzato <semenzato@google.com> Cc: Mel Gorman <mgorman@suse.de> Cc: Metin Doslu <metin@citusdata.com> Cc: Michel Lespinasse <walken@google.com> Cc: Ozgun Erdogan <ozgun@citusdata.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Roman Gushchin <klamm@yandex-team.ru> Cc: Ryan Mallon <rmallon@gmail.com> Cc: Tejun Heo <tj@kernel.org> Cc: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-04-04 05:47:56 +08:00
*/
static struct lock_class_key shadow_nodes_key;
static int __init workingset_init(void)
{
mm: workingset: eviction buckets for bigmem/lowbit machines For per-cgroup thrash detection, we need to store the memcg ID inside the radix tree cookie as well. However, on 32 bit that doesn't leave enough bits for the eviction timestamp to cover the necessary range of recently evicted pages. The radix tree entry would look like this: [ RADIX_TREE_EXCEPTIONAL(2) | ZONEID(2) | MEMCGID(16) | EVICTION(12) ] 12 bits means 4096 pages, means 16M worth of recently evicted pages. But refaults are actionable up to distances covering half of memory. To not miss refaults, we have to stretch out the range at the cost of how precisely we can tell when a page was evicted. This way we can shave off lower bits from the eviction timestamp until the necessary range is covered. E.g. grouping evictions into 1M buckets (256 pages) will stretch the longest representable refault distance to 4G. This patch implements eviction buckets that are automatically sized according to the available bits and the necessary refault range, in preparation for per-cgroup thrash detection. The maximum actionable distance is currently half of memory, but to support memory hotplug of up to 200% of boot-time memory, we size the buckets to cover double the distance. Beyond that, thrashing won't be detectable anymore. During boot, the kernel will print out the exact parameters, like so: [ 0.113929] workingset: timestamp_bits=12 max_order=18 bucket_order=6 In this example, there are 12 radix entry bits available for the eviction timestamp, to cover a maximum distance of 2^18 pages (this is a 1G machine). Consequently, evictions must be grouped into buckets of 2^6 pages, or 256K. Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> Reviewed-by: Vladimir Davydov <vdavydov@virtuozzo.com> Cc: Michal Hocko <mhocko@suse.cz> Cc: David Rientjes <rientjes@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-03-16 05:57:13 +08:00
unsigned int max_order;
mm: keep page cache radix tree nodes in check Previously, page cache radix tree nodes were freed after reclaim emptied out their page pointers. But now reclaim stores shadow entries in their place, which are only reclaimed when the inodes themselves are reclaimed. This is problematic for bigger files that are still in use after they have a significant amount of their cache reclaimed, without any of those pages actually refaulting. The shadow entries will just sit there and waste memory. In the worst case, the shadow entries will accumulate until the machine runs out of memory. To get this under control, the VM will track radix tree nodes exclusively containing shadow entries on a per-NUMA node list. Per-NUMA rather than global because we expect the radix tree nodes themselves to be allocated node-locally and we want to reduce cross-node references of otherwise independent cache workloads. A simple shrinker will then reclaim these nodes on memory pressure. A few things need to be stored in the radix tree node to implement the shadow node LRU and allow tree deletions coming from the list: 1. There is no index available that would describe the reverse path from the node up to the tree root, which is needed to perform a deletion. To solve this, encode in each node its offset inside the parent. This can be stored in the unused upper bits of the same member that stores the node's height at no extra space cost. 2. The number of shadow entries needs to be counted in addition to the regular entries, to quickly detect when the node is ready to go to the shadow node LRU list. The current entry count is an unsigned int but the maximum number of entries is 64, so a shadow counter can easily be stored in the unused upper bits. 3. Tree modification needs tree lock and tree root, which are located in the address space, so store an address_space backpointer in the node. The parent pointer of the node is in a union with the 2-word rcu_head, so the backpointer comes at no extra cost as well. 4. The node needs to be linked to an LRU list, which requires a list head inside the node. This does increase the size of the node, but it does not change the number of objects that fit into a slab page. [akpm@linux-foundation.org: export the right function] Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> Reviewed-by: Rik van Riel <riel@redhat.com> Reviewed-by: Minchan Kim <minchan@kernel.org> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Bob Liu <bob.liu@oracle.com> Cc: Christoph Hellwig <hch@infradead.org> Cc: Dave Chinner <david@fromorbit.com> Cc: Greg Thelen <gthelen@google.com> Cc: Hugh Dickins <hughd@google.com> Cc: Jan Kara <jack@suse.cz> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: Luigi Semenzato <semenzato@google.com> Cc: Mel Gorman <mgorman@suse.de> Cc: Metin Doslu <metin@citusdata.com> Cc: Michel Lespinasse <walken@google.com> Cc: Ozgun Erdogan <ozgun@citusdata.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Roman Gushchin <klamm@yandex-team.ru> Cc: Ryan Mallon <rmallon@gmail.com> Cc: Tejun Heo <tj@kernel.org> Cc: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-04-04 05:47:56 +08:00
int ret;
mm: workingset: eviction buckets for bigmem/lowbit machines For per-cgroup thrash detection, we need to store the memcg ID inside the radix tree cookie as well. However, on 32 bit that doesn't leave enough bits for the eviction timestamp to cover the necessary range of recently evicted pages. The radix tree entry would look like this: [ RADIX_TREE_EXCEPTIONAL(2) | ZONEID(2) | MEMCGID(16) | EVICTION(12) ] 12 bits means 4096 pages, means 16M worth of recently evicted pages. But refaults are actionable up to distances covering half of memory. To not miss refaults, we have to stretch out the range at the cost of how precisely we can tell when a page was evicted. This way we can shave off lower bits from the eviction timestamp until the necessary range is covered. E.g. grouping evictions into 1M buckets (256 pages) will stretch the longest representable refault distance to 4G. This patch implements eviction buckets that are automatically sized according to the available bits and the necessary refault range, in preparation for per-cgroup thrash detection. The maximum actionable distance is currently half of memory, but to support memory hotplug of up to 200% of boot-time memory, we size the buckets to cover double the distance. Beyond that, thrashing won't be detectable anymore. During boot, the kernel will print out the exact parameters, like so: [ 0.113929] workingset: timestamp_bits=12 max_order=18 bucket_order=6 In this example, there are 12 radix entry bits available for the eviction timestamp, to cover a maximum distance of 2^18 pages (this is a 1G machine). Consequently, evictions must be grouped into buckets of 2^6 pages, or 256K. Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> Reviewed-by: Vladimir Davydov <vdavydov@virtuozzo.com> Cc: Michal Hocko <mhocko@suse.cz> Cc: David Rientjes <rientjes@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-03-16 05:57:13 +08:00
BUILD_BUG_ON(BITS_PER_LONG < EVICTION_SHIFT);
/*
* Calculate the eviction bucket size to cover the longest
* actionable refault distance, which is currently half of
* memory (totalram_pages/2). However, memory hotplug may add
* some more pages at runtime, so keep working with up to
* double the initial memory by using totalram_pages as-is.
*/
max_order = fls_long(totalram_pages() - 1);
if (max_order > EVICTION_BITS)
bucket_order = max_order - EVICTION_BITS;
pr_info("workingset: timestamp_bits=%d max_order=%d bucket_order=%u\n",
EVICTION_BITS, max_order, bucket_order);
workingset, lru_gen: apply refault-distance based protection Upstream: pending I noticed MGLRU not working very well on certain workflows, which is observed on some heavily stressed databases. That is when the file page workingset size exceeds total memory, and the access distance (the left-shift time of a page before it gets activated, considering LRU starts from right) of file pages also larger than total memory. All file pages are stuck on the oldest generation and getting read-in then evicted permutably. Despite anon pages being idle, they never get aged. PID controller didn't kickin until there are some minor access pattern changes. And file pages are not promoted or reused. Even though the memory can't cover the whole workingset, the refault-distance based re-activation can help hold part of the workingset in-memory to help reduce the IO workload significantly. So apply it for MGLRU as well. The updated refault-distance model fits well for MGLRU in most cases, if we just consider the last two generation as the inactive LRU and the first two generations as active LRU. Some adjustment is done to fit the logic better, also make the refault-distance contributed to page tiering and PID refault detection of MGLRU: - If a tier-0 page have a qualified refault-distance, just promote it to higher tier, send it to second oldest gen. - If a tier >= 1 page have a qualified refault-distance, mark it as active and send it to youngest gen. - Increase the reference of every page that have a qualified refault-distance and increase the PID countroled refault rate of the updated tier, in hope similar paged will be protected next time upon eviction. NOTE: This also changed the meaning of workingset_* fields in /proc/vmstat, workingset_activate_* now stands for the pages reactivated or promoted by refault distance checking, workingset_restore_* now stands for all pages promoted by any reason. Following benchmark showed 5x improvement. To simulate the optimized workflow, I setup a 3-replicated mongodb cluster, each in a different cgroup, using 5 gb of wiretiger cache and 10g of oplog, on a 32G VM with no limit set. The benchmark is done using https://github.com/apavlo/py-tpcc.git, modified to run STOCK_LEVEL query only, for simulating slow query and get a stable result. Test is done on an EPYC 7K62 with 32G RAM with SATA SSD: - Before (with ZRAM enabled, the result won't change whether any kind of swap is on or not): $ tpcc.py --config=mongodb.config mongodb --duration=900 --warehouses=500 --clients=30 ================================================================== Execution Results after 919 seconds ------------------------------------------------------------------ Executed Time (µs) Rate STOCK_LEVEL 577 27584645283.7 0.02 txn/s ------------------------------------------------------------------ TOTAL 577 27584645283.7 0.02 txn/s $ cat /proc/vmstat | grep workingset workingset_nodes 47860 workingset_refault_anon 0 workingset_refault_file 23498953 workingset_activate_anon 0 workingset_activate_file 23487840 workingset_restore_anon 0 workingset_restore_file 18553646 workingset_nodereclaim 768 $ free -m total used free shared buff/cache available Mem: 31849 6829 790 23 24229 24542 Swap: 31848 0 31848 - Patched: (with ZRAM enabled): $ tpcc.py --config=mongodb.config mongodb --duration=900 --warehouses=500 --clients=30 ================================================================== Execution Results after 905 seconds ------------------------------------------------------------------ Executed Time (µs) Rate STOCK_LEVEL 2542 27121571486.2 0.09 txn/s ------------------------------------------------------------------ TOTAL 2542 27121571486.2 0.09 txn/s $ cat /proc/vmstat | grep working workingset_nodes 70358 workingset_refault_anon 16853 workingset_refault_file 22693601 workingset_activate_anon 10099 workingset_activate_file 8565519 workingset_restore_anon 10127 workingset_restore_file 8566053 workingset_nodereclaim 9801 $ free -m total used free shared buff/cache available Mem: 31849 7093 283 4 24472 24289 Swap: 31848 1652 30196 The performance is 5x times better than before, and the idle anon pages now can get swapped out as expected. The result is also better with lower test stress, testing with lower stress also shows a improvement. There is no regression on other tests so far, and a performance gain is observed on file page heavy tasks. Signed-off-by: Kairui Song <kasong@tencent.com>
2024-04-01 23:43:38 +08:00
#ifdef CONFIG_LRU_GEN
if (max_order > LRU_GEN_EVICTION_BITS)
lru_gen_bucket_order = max_order - LRU_GEN_EVICTION_BITS;
pr_info("workingset: lru_gen_timestamp_bits=%d lru_gen_bucket_order=%u\n",
LRU_GEN_EVICTION_BITS, lru_gen_bucket_order);
#endif
mm: workingset: eviction buckets for bigmem/lowbit machines For per-cgroup thrash detection, we need to store the memcg ID inside the radix tree cookie as well. However, on 32 bit that doesn't leave enough bits for the eviction timestamp to cover the necessary range of recently evicted pages. The radix tree entry would look like this: [ RADIX_TREE_EXCEPTIONAL(2) | ZONEID(2) | MEMCGID(16) | EVICTION(12) ] 12 bits means 4096 pages, means 16M worth of recently evicted pages. But refaults are actionable up to distances covering half of memory. To not miss refaults, we have to stretch out the range at the cost of how precisely we can tell when a page was evicted. This way we can shave off lower bits from the eviction timestamp until the necessary range is covered. E.g. grouping evictions into 1M buckets (256 pages) will stretch the longest representable refault distance to 4G. This patch implements eviction buckets that are automatically sized according to the available bits and the necessary refault range, in preparation for per-cgroup thrash detection. The maximum actionable distance is currently half of memory, but to support memory hotplug of up to 200% of boot-time memory, we size the buckets to cover double the distance. Beyond that, thrashing won't be detectable anymore. During boot, the kernel will print out the exact parameters, like so: [ 0.113929] workingset: timestamp_bits=12 max_order=18 bucket_order=6 In this example, there are 12 radix entry bits available for the eviction timestamp, to cover a maximum distance of 2^18 pages (this is a 1G machine). Consequently, evictions must be grouped into buckets of 2^6 pages, or 256K. Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> Reviewed-by: Vladimir Davydov <vdavydov@virtuozzo.com> Cc: Michal Hocko <mhocko@suse.cz> Cc: David Rientjes <rientjes@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-03-16 05:57:13 +08:00
mm: shrinkers: provide shrinkers with names Currently shrinkers are anonymous objects. For debugging purposes they can be identified by count/scan function names, but it's not always useful: e.g. for superblock's shrinkers it's nice to have at least an idea of to which superblock the shrinker belongs. This commit adds names to shrinkers. register_shrinker() and prealloc_shrinker() functions are extended to take a format and arguments to master a name. In some cases it's not possible to determine a good name at the time when a shrinker is allocated. For such cases shrinker_debugfs_rename() is provided. The expected format is: <subsystem>-<shrinker_type>[:<instance>]-<id> For some shrinkers an instance can be encoded as (MAJOR:MINOR) pair. After this change the shrinker debugfs directory looks like: $ cd /sys/kernel/debug/shrinker/ $ ls dquota-cache-16 sb-devpts-28 sb-proc-47 sb-tmpfs-42 mm-shadow-18 sb-devtmpfs-5 sb-proc-48 sb-tmpfs-43 mm-zspool:zram0-34 sb-hugetlbfs-17 sb-pstore-31 sb-tmpfs-44 rcu-kfree-0 sb-hugetlbfs-33 sb-rootfs-2 sb-tmpfs-49 sb-aio-20 sb-iomem-12 sb-securityfs-6 sb-tracefs-13 sb-anon_inodefs-15 sb-mqueue-21 sb-selinuxfs-22 sb-xfs:vda1-36 sb-bdev-3 sb-nsfs-4 sb-sockfs-8 sb-zsmalloc-19 sb-bpf-32 sb-pipefs-14 sb-sysfs-26 thp-deferred_split-10 sb-btrfs:vda2-24 sb-proc-25 sb-tmpfs-1 thp-zero-9 sb-cgroup2-30 sb-proc-39 sb-tmpfs-27 xfs-buf:vda1-37 sb-configfs-23 sb-proc-41 sb-tmpfs-29 xfs-inodegc:vda1-38 sb-dax-11 sb-proc-45 sb-tmpfs-35 sb-debugfs-7 sb-proc-46 sb-tmpfs-40 [roman.gushchin@linux.dev: fix build warnings] Link: https://lkml.kernel.org/r/Yr+ZTnLb9lJk6fJO@castle Reported-by: kernel test robot <lkp@intel.com> Link: https://lkml.kernel.org/r/20220601032227.4076670-4-roman.gushchin@linux.dev Signed-off-by: Roman Gushchin <roman.gushchin@linux.dev> Cc: Christophe JAILLET <christophe.jaillet@wanadoo.fr> Cc: Dave Chinner <dchinner@redhat.com> Cc: Hillf Danton <hdanton@sina.com> Cc: Kent Overstreet <kent.overstreet@gmail.com> Cc: Muchun Song <songmuchun@bytedance.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-06-01 11:22:24 +08:00
ret = prealloc_shrinker(&workingset_shadow_shrinker, "mm-shadow");
mm: keep page cache radix tree nodes in check Previously, page cache radix tree nodes were freed after reclaim emptied out their page pointers. But now reclaim stores shadow entries in their place, which are only reclaimed when the inodes themselves are reclaimed. This is problematic for bigger files that are still in use after they have a significant amount of their cache reclaimed, without any of those pages actually refaulting. The shadow entries will just sit there and waste memory. In the worst case, the shadow entries will accumulate until the machine runs out of memory. To get this under control, the VM will track radix tree nodes exclusively containing shadow entries on a per-NUMA node list. Per-NUMA rather than global because we expect the radix tree nodes themselves to be allocated node-locally and we want to reduce cross-node references of otherwise independent cache workloads. A simple shrinker will then reclaim these nodes on memory pressure. A few things need to be stored in the radix tree node to implement the shadow node LRU and allow tree deletions coming from the list: 1. There is no index available that would describe the reverse path from the node up to the tree root, which is needed to perform a deletion. To solve this, encode in each node its offset inside the parent. This can be stored in the unused upper bits of the same member that stores the node's height at no extra space cost. 2. The number of shadow entries needs to be counted in addition to the regular entries, to quickly detect when the node is ready to go to the shadow node LRU list. The current entry count is an unsigned int but the maximum number of entries is 64, so a shadow counter can easily be stored in the unused upper bits. 3. Tree modification needs tree lock and tree root, which are located in the address space, so store an address_space backpointer in the node. The parent pointer of the node is in a union with the 2-word rcu_head, so the backpointer comes at no extra cost as well. 4. The node needs to be linked to an LRU list, which requires a list head inside the node. This does increase the size of the node, but it does not change the number of objects that fit into a slab page. [akpm@linux-foundation.org: export the right function] Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> Reviewed-by: Rik van Riel <riel@redhat.com> Reviewed-by: Minchan Kim <minchan@kernel.org> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Bob Liu <bob.liu@oracle.com> Cc: Christoph Hellwig <hch@infradead.org> Cc: Dave Chinner <david@fromorbit.com> Cc: Greg Thelen <gthelen@google.com> Cc: Hugh Dickins <hughd@google.com> Cc: Jan Kara <jack@suse.cz> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: Luigi Semenzato <semenzato@google.com> Cc: Mel Gorman <mgorman@suse.de> Cc: Metin Doslu <metin@citusdata.com> Cc: Michel Lespinasse <walken@google.com> Cc: Ozgun Erdogan <ozgun@citusdata.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Roman Gushchin <klamm@yandex-team.ru> Cc: Ryan Mallon <rmallon@gmail.com> Cc: Tejun Heo <tj@kernel.org> Cc: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-04-04 05:47:56 +08:00
if (ret)
goto err;
fs: propagate shrinker::id to list_lru Add list_lru::shrinker_id field and populate it by registered shrinker id. This will be used to set correct bit in memcg shrinkers map by lru code in next patches, after there appeared the first related to memcg element in list_lru. Link: http://lkml.kernel.org/r/153063059758.1818.14866596416857717800.stgit@localhost.localdomain Signed-off-by: Kirill Tkhai <ktkhai@virtuozzo.com> Acked-by: Vladimir Davydov <vdavydov.dev@gmail.com> Tested-by: Shakeel Butt <shakeelb@google.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Andrey Ryabinin <aryabinin@virtuozzo.com> Cc: Chris Wilson <chris@chris-wilson.co.uk> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: Guenter Roeck <linux@roeck-us.net> Cc: "Huang, Ying" <ying.huang@intel.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Josef Bacik <jbacik@fb.com> Cc: Li RongQing <lirongqing@baidu.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Matthias Kaehlcke <mka@chromium.org> Cc: Mel Gorman <mgorman@techsingularity.net> Cc: Michal Hocko <mhocko@kernel.org> Cc: Minchan Kim <minchan@kernel.org> Cc: Philippe Ombredanne <pombredanne@nexb.com> Cc: Roman Gushchin <guro@fb.com> Cc: Sahitya Tummala <stummala@codeaurora.org> Cc: Stephen Rothwell <sfr@canb.auug.org.au> Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Waiman Long <longman@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-08-18 06:47:50 +08:00
ret = __list_lru_init(&shadow_nodes, true, &shadow_nodes_key,
&workingset_shadow_shrinker);
mm: keep page cache radix tree nodes in check Previously, page cache radix tree nodes were freed after reclaim emptied out their page pointers. But now reclaim stores shadow entries in their place, which are only reclaimed when the inodes themselves are reclaimed. This is problematic for bigger files that are still in use after they have a significant amount of their cache reclaimed, without any of those pages actually refaulting. The shadow entries will just sit there and waste memory. In the worst case, the shadow entries will accumulate until the machine runs out of memory. To get this under control, the VM will track radix tree nodes exclusively containing shadow entries on a per-NUMA node list. Per-NUMA rather than global because we expect the radix tree nodes themselves to be allocated node-locally and we want to reduce cross-node references of otherwise independent cache workloads. A simple shrinker will then reclaim these nodes on memory pressure. A few things need to be stored in the radix tree node to implement the shadow node LRU and allow tree deletions coming from the list: 1. There is no index available that would describe the reverse path from the node up to the tree root, which is needed to perform a deletion. To solve this, encode in each node its offset inside the parent. This can be stored in the unused upper bits of the same member that stores the node's height at no extra space cost. 2. The number of shadow entries needs to be counted in addition to the regular entries, to quickly detect when the node is ready to go to the shadow node LRU list. The current entry count is an unsigned int but the maximum number of entries is 64, so a shadow counter can easily be stored in the unused upper bits. 3. Tree modification needs tree lock and tree root, which are located in the address space, so store an address_space backpointer in the node. The parent pointer of the node is in a union with the 2-word rcu_head, so the backpointer comes at no extra cost as well. 4. The node needs to be linked to an LRU list, which requires a list head inside the node. This does increase the size of the node, but it does not change the number of objects that fit into a slab page. [akpm@linux-foundation.org: export the right function] Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> Reviewed-by: Rik van Riel <riel@redhat.com> Reviewed-by: Minchan Kim <minchan@kernel.org> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Bob Liu <bob.liu@oracle.com> Cc: Christoph Hellwig <hch@infradead.org> Cc: Dave Chinner <david@fromorbit.com> Cc: Greg Thelen <gthelen@google.com> Cc: Hugh Dickins <hughd@google.com> Cc: Jan Kara <jack@suse.cz> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: Luigi Semenzato <semenzato@google.com> Cc: Mel Gorman <mgorman@suse.de> Cc: Metin Doslu <metin@citusdata.com> Cc: Michel Lespinasse <walken@google.com> Cc: Ozgun Erdogan <ozgun@citusdata.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Roman Gushchin <klamm@yandex-team.ru> Cc: Ryan Mallon <rmallon@gmail.com> Cc: Tejun Heo <tj@kernel.org> Cc: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-04-04 05:47:56 +08:00
if (ret)
goto err_list_lru;
mm/workingset.c: refactor workingset_init() Use prealloc_shrinker()/register_shrinker_prepared() instead of register_shrinker(). This will be used in next patch. [ktkhai@virtuozzo.com: v9] Link: http://lkml.kernel.org/r/153112550112.4097.16606173020912323761.stgit@localhost.localdomain Link: http://lkml.kernel.org/r/153063057666.1818.17625951186610808734.stgit@localhost.localdomain Signed-off-by: Kirill Tkhai <ktkhai@virtuozzo.com> Acked-by: Vladimir Davydov <vdavydov.dev@gmail.com> Tested-by: Shakeel Butt <shakeelb@google.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Andrey Ryabinin <aryabinin@virtuozzo.com> Cc: Chris Wilson <chris@chris-wilson.co.uk> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: Guenter Roeck <linux@roeck-us.net> Cc: "Huang, Ying" <ying.huang@intel.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Josef Bacik <jbacik@fb.com> Cc: Li RongQing <lirongqing@baidu.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Matthias Kaehlcke <mka@chromium.org> Cc: Mel Gorman <mgorman@techsingularity.net> Cc: Michal Hocko <mhocko@kernel.org> Cc: Minchan Kim <minchan@kernel.org> Cc: Philippe Ombredanne <pombredanne@nexb.com> Cc: Roman Gushchin <guro@fb.com> Cc: Sahitya Tummala <stummala@codeaurora.org> Cc: Stephen Rothwell <sfr@canb.auug.org.au> Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Waiman Long <longman@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-08-18 06:47:41 +08:00
register_shrinker_prepared(&workingset_shadow_shrinker);
mm: keep page cache radix tree nodes in check Previously, page cache radix tree nodes were freed after reclaim emptied out their page pointers. But now reclaim stores shadow entries in their place, which are only reclaimed when the inodes themselves are reclaimed. This is problematic for bigger files that are still in use after they have a significant amount of their cache reclaimed, without any of those pages actually refaulting. The shadow entries will just sit there and waste memory. In the worst case, the shadow entries will accumulate until the machine runs out of memory. To get this under control, the VM will track radix tree nodes exclusively containing shadow entries on a per-NUMA node list. Per-NUMA rather than global because we expect the radix tree nodes themselves to be allocated node-locally and we want to reduce cross-node references of otherwise independent cache workloads. A simple shrinker will then reclaim these nodes on memory pressure. A few things need to be stored in the radix tree node to implement the shadow node LRU and allow tree deletions coming from the list: 1. There is no index available that would describe the reverse path from the node up to the tree root, which is needed to perform a deletion. To solve this, encode in each node its offset inside the parent. This can be stored in the unused upper bits of the same member that stores the node's height at no extra space cost. 2. The number of shadow entries needs to be counted in addition to the regular entries, to quickly detect when the node is ready to go to the shadow node LRU list. The current entry count is an unsigned int but the maximum number of entries is 64, so a shadow counter can easily be stored in the unused upper bits. 3. Tree modification needs tree lock and tree root, which are located in the address space, so store an address_space backpointer in the node. The parent pointer of the node is in a union with the 2-word rcu_head, so the backpointer comes at no extra cost as well. 4. The node needs to be linked to an LRU list, which requires a list head inside the node. This does increase the size of the node, but it does not change the number of objects that fit into a slab page. [akpm@linux-foundation.org: export the right function] Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> Reviewed-by: Rik van Riel <riel@redhat.com> Reviewed-by: Minchan Kim <minchan@kernel.org> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Bob Liu <bob.liu@oracle.com> Cc: Christoph Hellwig <hch@infradead.org> Cc: Dave Chinner <david@fromorbit.com> Cc: Greg Thelen <gthelen@google.com> Cc: Hugh Dickins <hughd@google.com> Cc: Jan Kara <jack@suse.cz> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: Luigi Semenzato <semenzato@google.com> Cc: Mel Gorman <mgorman@suse.de> Cc: Metin Doslu <metin@citusdata.com> Cc: Michel Lespinasse <walken@google.com> Cc: Ozgun Erdogan <ozgun@citusdata.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Roman Gushchin <klamm@yandex-team.ru> Cc: Ryan Mallon <rmallon@gmail.com> Cc: Tejun Heo <tj@kernel.org> Cc: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-04-04 05:47:56 +08:00
return 0;
err_list_lru:
mm/workingset.c: refactor workingset_init() Use prealloc_shrinker()/register_shrinker_prepared() instead of register_shrinker(). This will be used in next patch. [ktkhai@virtuozzo.com: v9] Link: http://lkml.kernel.org/r/153112550112.4097.16606173020912323761.stgit@localhost.localdomain Link: http://lkml.kernel.org/r/153063057666.1818.17625951186610808734.stgit@localhost.localdomain Signed-off-by: Kirill Tkhai <ktkhai@virtuozzo.com> Acked-by: Vladimir Davydov <vdavydov.dev@gmail.com> Tested-by: Shakeel Butt <shakeelb@google.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Andrey Ryabinin <aryabinin@virtuozzo.com> Cc: Chris Wilson <chris@chris-wilson.co.uk> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: Guenter Roeck <linux@roeck-us.net> Cc: "Huang, Ying" <ying.huang@intel.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Josef Bacik <jbacik@fb.com> Cc: Li RongQing <lirongqing@baidu.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Matthias Kaehlcke <mka@chromium.org> Cc: Mel Gorman <mgorman@techsingularity.net> Cc: Michal Hocko <mhocko@kernel.org> Cc: Minchan Kim <minchan@kernel.org> Cc: Philippe Ombredanne <pombredanne@nexb.com> Cc: Roman Gushchin <guro@fb.com> Cc: Sahitya Tummala <stummala@codeaurora.org> Cc: Stephen Rothwell <sfr@canb.auug.org.au> Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Waiman Long <longman@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-08-18 06:47:41 +08:00
free_prealloced_shrinker(&workingset_shadow_shrinker);
mm: keep page cache radix tree nodes in check Previously, page cache radix tree nodes were freed after reclaim emptied out their page pointers. But now reclaim stores shadow entries in their place, which are only reclaimed when the inodes themselves are reclaimed. This is problematic for bigger files that are still in use after they have a significant amount of their cache reclaimed, without any of those pages actually refaulting. The shadow entries will just sit there and waste memory. In the worst case, the shadow entries will accumulate until the machine runs out of memory. To get this under control, the VM will track radix tree nodes exclusively containing shadow entries on a per-NUMA node list. Per-NUMA rather than global because we expect the radix tree nodes themselves to be allocated node-locally and we want to reduce cross-node references of otherwise independent cache workloads. A simple shrinker will then reclaim these nodes on memory pressure. A few things need to be stored in the radix tree node to implement the shadow node LRU and allow tree deletions coming from the list: 1. There is no index available that would describe the reverse path from the node up to the tree root, which is needed to perform a deletion. To solve this, encode in each node its offset inside the parent. This can be stored in the unused upper bits of the same member that stores the node's height at no extra space cost. 2. The number of shadow entries needs to be counted in addition to the regular entries, to quickly detect when the node is ready to go to the shadow node LRU list. The current entry count is an unsigned int but the maximum number of entries is 64, so a shadow counter can easily be stored in the unused upper bits. 3. Tree modification needs tree lock and tree root, which are located in the address space, so store an address_space backpointer in the node. The parent pointer of the node is in a union with the 2-word rcu_head, so the backpointer comes at no extra cost as well. 4. The node needs to be linked to an LRU list, which requires a list head inside the node. This does increase the size of the node, but it does not change the number of objects that fit into a slab page. [akpm@linux-foundation.org: export the right function] Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> Reviewed-by: Rik van Riel <riel@redhat.com> Reviewed-by: Minchan Kim <minchan@kernel.org> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Bob Liu <bob.liu@oracle.com> Cc: Christoph Hellwig <hch@infradead.org> Cc: Dave Chinner <david@fromorbit.com> Cc: Greg Thelen <gthelen@google.com> Cc: Hugh Dickins <hughd@google.com> Cc: Jan Kara <jack@suse.cz> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: Luigi Semenzato <semenzato@google.com> Cc: Mel Gorman <mgorman@suse.de> Cc: Metin Doslu <metin@citusdata.com> Cc: Michel Lespinasse <walken@google.com> Cc: Ozgun Erdogan <ozgun@citusdata.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Roman Gushchin <klamm@yandex-team.ru> Cc: Ryan Mallon <rmallon@gmail.com> Cc: Tejun Heo <tj@kernel.org> Cc: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-04-04 05:47:56 +08:00
err:
return ret;
}
module_init(workingset_init);