License cleanup: add SPDX GPL-2.0 license identifier to files with no license
Many source files in the tree are missing licensing information, which
makes it harder for compliance tools to determine the correct license.
By default all files without license information are under the default
license of the kernel, which is GPL version 2.
Update the files which contain no license information with the 'GPL-2.0'
SPDX license identifier. The SPDX identifier is a legally binding
shorthand, which can be used instead of the full boiler plate text.
This patch is based on work done by Thomas Gleixner and Kate Stewart and
Philippe Ombredanne.
How this work was done:
Patches were generated and checked against linux-4.14-rc6 for a subset of
the use cases:
- file had no licensing information it it.
- file was a */uapi/* one with no licensing information in it,
- file was a */uapi/* one with existing licensing information,
Further patches will be generated in subsequent months to fix up cases
where non-standard license headers were used, and references to license
had to be inferred by heuristics based on keywords.
The analysis to determine which SPDX License Identifier to be applied to
a file was done in a spreadsheet of side by side results from of the
output of two independent scanners (ScanCode & Windriver) producing SPDX
tag:value files created by Philippe Ombredanne. Philippe prepared the
base worksheet, and did an initial spot review of a few 1000 files.
The 4.13 kernel was the starting point of the analysis with 60,537 files
assessed. Kate Stewart did a file by file comparison of the scanner
results in the spreadsheet to determine which SPDX license identifier(s)
to be applied to the file. She confirmed any determination that was not
immediately clear with lawyers working with the Linux Foundation.
Criteria used to select files for SPDX license identifier tagging was:
- Files considered eligible had to be source code files.
- Make and config files were included as candidates if they contained >5
lines of source
- File already had some variant of a license header in it (even if <5
lines).
All documentation files were explicitly excluded.
The following heuristics were used to determine which SPDX license
identifiers to apply.
- when both scanners couldn't find any license traces, file was
considered to have no license information in it, and the top level
COPYING file license applied.
For non */uapi/* files that summary was:
SPDX license identifier # files
---------------------------------------------------|-------
GPL-2.0 11139
and resulted in the first patch in this series.
If that file was a */uapi/* path one, it was "GPL-2.0 WITH
Linux-syscall-note" otherwise it was "GPL-2.0". Results of that was:
SPDX license identifier # files
---------------------------------------------------|-------
GPL-2.0 WITH Linux-syscall-note 930
and resulted in the second patch in this series.
- if a file had some form of licensing information in it, and was one
of the */uapi/* ones, it was denoted with the Linux-syscall-note if
any GPL family license was found in the file or had no licensing in
it (per prior point). Results summary:
SPDX license identifier # files
---------------------------------------------------|------
GPL-2.0 WITH Linux-syscall-note 270
GPL-2.0+ WITH Linux-syscall-note 169
((GPL-2.0 WITH Linux-syscall-note) OR BSD-2-Clause) 21
((GPL-2.0 WITH Linux-syscall-note) OR BSD-3-Clause) 17
LGPL-2.1+ WITH Linux-syscall-note 15
GPL-1.0+ WITH Linux-syscall-note 14
((GPL-2.0+ WITH Linux-syscall-note) OR BSD-3-Clause) 5
LGPL-2.0+ WITH Linux-syscall-note 4
LGPL-2.1 WITH Linux-syscall-note 3
((GPL-2.0 WITH Linux-syscall-note) OR MIT) 3
((GPL-2.0 WITH Linux-syscall-note) AND MIT) 1
and that resulted in the third patch in this series.
- when the two scanners agreed on the detected license(s), that became
the concluded license(s).
- when there was disagreement between the two scanners (one detected a
license but the other didn't, or they both detected different
licenses) a manual inspection of the file occurred.
- In most cases a manual inspection of the information in the file
resulted in a clear resolution of the license that should apply (and
which scanner probably needed to revisit its heuristics).
- When it was not immediately clear, the license identifier was
confirmed with lawyers working with the Linux Foundation.
- If there was any question as to the appropriate license identifier,
the file was flagged for further research and to be revisited later
in time.
In total, over 70 hours of logged manual review was done on the
spreadsheet to determine the SPDX license identifiers to apply to the
source files by Kate, Philippe, Thomas and, in some cases, confirmation
by lawyers working with the Linux Foundation.
Kate also obtained a third independent scan of the 4.13 code base from
FOSSology, and compared selected files where the other two scanners
disagreed against that SPDX file, to see if there was new insights. The
Windriver scanner is based on an older version of FOSSology in part, so
they are related.
Thomas did random spot checks in about 500 files from the spreadsheets
for the uapi headers and agreed with SPDX license identifier in the
files he inspected. For the non-uapi files Thomas did random spot checks
in about 15000 files.
In initial set of patches against 4.14-rc6, 3 files were found to have
copy/paste license identifier errors, and have been fixed to reflect the
correct identifier.
Additionally Philippe spent 10 hours this week doing a detailed manual
inspection and review of the 12,461 patched files from the initial patch
version early this week with:
- a full scancode scan run, collecting the matched texts, detected
license ids and scores
- reviewing anything where there was a license detected (about 500+
files) to ensure that the applied SPDX license was correct
- reviewing anything where there was no detection but the patch license
was not GPL-2.0 WITH Linux-syscall-note to ensure that the applied
SPDX license was correct
This produced a worksheet with 20 files needing minor correction. This
worksheet was then exported into 3 different .csv files for the
different types of files to be modified.
These .csv files were then reviewed by Greg. Thomas wrote a script to
parse the csv files and add the proper SPDX tag to the file, in the
format that the file expected. This script was further refined by Greg
based on the output to detect more types of files automatically and to
distinguish between header and source .c files (which need different
comment types.) Finally Greg ran the script using the .csv files to
generate the patches.
Reviewed-by: Kate Stewart <kstewart@linuxfoundation.org>
Reviewed-by: Philippe Ombredanne <pombredanne@nexb.com>
Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2017-11-01 22:07:57 +08:00
|
|
|
/* SPDX-License-Identifier: GPL-2.0 */
|
2010-08-10 08:19:16 +08:00
|
|
|
#undef TRACE_SYSTEM
|
|
|
|
#define TRACE_SYSTEM vmscan
|
|
|
|
|
|
|
|
#if !defined(_TRACE_VMSCAN_H) || defined(TRACE_HEADER_MULTI_READ)
|
|
|
|
#define _TRACE_VMSCAN_H
|
|
|
|
|
|
|
|
#include <linux/types.h>
|
|
|
|
#include <linux/tracepoint.h>
|
2011-06-16 06:08:14 +08:00
|
|
|
#include <linux/mm.h>
|
|
|
|
#include <linux/memcontrol.h>
|
mm, tracing: unify mm flags handling in tracepoints and printk
In tracepoints, it's possible to print gfp flags in a human-friendly
format through a macro show_gfp_flags(), which defines a translation
array and passes is to __print_flags(). Since the following patch will
introduce support for gfp flags printing in printk(), it would be nice
to reuse the array. This is not straightforward, since __print_flags()
can't simply reference an array defined in a .c file such as mm/debug.c
- it has to be a macro to allow the macro magic to communicate the
format to userspace tools such as trace-cmd.
The solution is to create a macro __def_gfpflag_names which is used both
in show_gfp_flags(), and to define the gfpflag_names[] array in
mm/debug.c.
On the other hand, mm/debug.c also defines translation tables for page
flags and vma flags, and desire was expressed (but not implemented in
this series) to use these also from tracepoints. Thus, this patch also
renames the events/gfpflags.h file to events/mmflags.h and moves the
table definitions there, using the same macro approach as for gfpflags.
This allows translating all three kinds of mm-specific flags both in
tracepoints and printk.
Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
Reviewed-by: Michal Hocko <mhocko@suse.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Arnaldo Carvalho de Melo <acme@kernel.org>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Rasmus Villemoes <linux@rasmusvillemoes.dk>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Sasha Levin <sasha.levin@oracle.com>
Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
Cc: Mel Gorman <mgorman@suse.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-03-16 05:55:52 +08:00
|
|
|
#include <trace/events/mmflags.h>
|
2010-08-10 08:19:16 +08:00
|
|
|
|
2010-08-10 08:19:18 +08:00
|
|
|
#define RECLAIM_WB_ANON 0x0001u
|
|
|
|
#define RECLAIM_WB_FILE 0x0002u
|
tracing, vmscan: add trace events for LRU list shrinking
There have been numerous reports of stalls that pointed at the problem
being somewhere in the VM. There are multiple roots to the problems which
means dealing with any of the root problems in isolation is tricky to
justify on their own and they would still need integration testing. This
patch series puts together two different patch sets which in combination
should tackle some of the root causes of latency problems being reported.
Patch 1 adds a tracepoint for shrink_inactive_list. For this series, the
most important results is being able to calculate the scanning/reclaim
ratio as a measure of the amount of work being done by page reclaim.
Patch 2 accounts for time spent in congestion_wait.
Patches 3-6 were originally developed by Kosaki Motohiro but reworked for
this series. It has been noted that lumpy reclaim is far too aggressive
and trashes the system somewhat. As SLUB uses high-order allocations, a
large cost incurred by lumpy reclaim will be noticeable. It was also
reported during transparent hugepage support testing that lumpy reclaim
was trashing the system and these patches should mitigate that problem
without disabling lumpy reclaim.
Patch 7 adds wait_iff_congested() and replaces some callers of
congestion_wait(). wait_iff_congested() only sleeps if there is a BDI
that is currently congested. Patch 8 notes that any BDI being congested
is not necessarily a problem because there could be multiple BDIs of
varying speeds and numberous zones. It attempts to track when a zone
being reclaimed contains many pages backed by a congested BDI and if so,
reclaimers wait on the congestion queue.
I ran a number of tests with monitoring on X86, X86-64 and PPC64. Each
machine had 3G of RAM and the CPUs were
X86: Intel P4 2-core
X86-64: AMD Phenom 4-core
PPC64: PPC970MP
Each used a single disk and the onboard IO controller. Dirty ratio was
left at 20. I'm just going to report for X86-64 and PPC64 in a vague
attempt to keep this report short. Four kernels were tested each based on
v2.6.36-rc4
traceonly-v2r2: Patches 1 and 2 to instrument vmscan reclaims and congestion_wait
lowlumpy-v2r3: Patches 1-6 to test if lumpy reclaim is better
waitcongest-v2r3: Patches 1-7 to only wait on congestion
waitwriteback-v2r4: Patches 1-8 to detect when a zone is congested
nocongest-v1r5: Patches 1-3 for testing wait_iff_congestion
nodirect-v1r5: Patches 1-10 to disable filesystem writeback for better IO
The tests run were as follows
kernbench
compile-based benchmark. Smoke test performance
sysbench
OLTP read-only benchmark. Will be re-run in the future as read-write
micro-mapped-file-stream
This is a micro-benchmark from Johannes Weiner that accesses a
large sparse-file through mmap(). It was configured to run in only
single-CPU mode but can be indicative of how well page reclaim
identifies suitable pages.
stress-highalloc
Tries to allocate huge pages under heavy load.
kernbench, iozone and sysbench did not report any performance regression
on any machine. sysbench did pressure the system lightly and there was
reclaim activity but there were no difference of major interest between
the kernels.
X86-64 micro-mapped-file-stream
traceonly-v2r2 lowlumpy-v2r3 waitcongest-v2r3 waitwriteback-v2r4
pgalloc_dma 1639.00 ( 0.00%) 667.00 (-145.73%) 1167.00 ( -40.45%) 578.00 (-183.56%)
pgalloc_dma32 2842410.00 ( 0.00%) 2842626.00 ( 0.01%) 2843043.00 ( 0.02%) 2843014.00 ( 0.02%)
pgalloc_normal 0.00 ( 0.00%) 0.00 ( 0.00%) 0.00 ( 0.00%) 0.00 ( 0.00%)
pgsteal_dma 729.00 ( 0.00%) 85.00 (-757.65%) 609.00 ( -19.70%) 125.00 (-483.20%)
pgsteal_dma32 2338721.00 ( 0.00%) 2447354.00 ( 4.44%) 2429536.00 ( 3.74%) 2436772.00 ( 4.02%)
pgsteal_normal 0.00 ( 0.00%) 0.00 ( 0.00%) 0.00 ( 0.00%) 0.00 ( 0.00%)
pgscan_kswapd_dma 1469.00 ( 0.00%) 532.00 (-176.13%) 1078.00 ( -36.27%) 220.00 (-567.73%)
pgscan_kswapd_dma32 4597713.00 ( 0.00%) 4503597.00 ( -2.09%) 4295673.00 ( -7.03%) 3891686.00 ( -18.14%)
pgscan_kswapd_normal 0.00 ( 0.00%) 0.00 ( 0.00%) 0.00 ( 0.00%) 0.00 ( 0.00%)
pgscan_direct_dma 71.00 ( 0.00%) 134.00 ( 47.01%) 243.00 ( 70.78%) 352.00 ( 79.83%)
pgscan_direct_dma32 305820.00 ( 0.00%) 280204.00 ( -9.14%) 600518.00 ( 49.07%) 957485.00 ( 68.06%)
pgscan_direct_normal 0.00 ( 0.00%) 0.00 ( 0.00%) 0.00 ( 0.00%) 0.00 ( 0.00%)
pageoutrun 16296.00 ( 0.00%) 21254.00 ( 23.33%) 18447.00 ( 11.66%) 20067.00 ( 18.79%)
allocstall 443.00 ( 0.00%) 273.00 ( -62.27%) 513.00 ( 13.65%) 1568.00 ( 71.75%)
These are based on the raw figures taken from /proc/vmstat. It's a rough
measure of reclaim activity. Note that allocstall counts are higher
because we are entering direct reclaim more often as a result of not
sleeping in congestion. In itself, it's not necessarily a bad thing.
It's easier to get a view of what happened from the vmscan tracepoint
report.
FTrace Reclaim Statistics: vmscan
traceonly-v2r2 lowlumpy-v2r3 waitcongest-v2r3 waitwriteback-v2r4
Direct reclaims 443 273 513 1568
Direct reclaim pages scanned 305968 280402 600825 957933
Direct reclaim pages reclaimed 43503 19005 30327 117191
Direct reclaim write file async I/O 0 0 0 0
Direct reclaim write anon async I/O 0 3 4 12
Direct reclaim write file sync I/O 0 0 0 0
Direct reclaim write anon sync I/O 0 0 0 0
Wake kswapd requests 187649 132338 191695 267701
Kswapd wakeups 3 1 4 1
Kswapd pages scanned 4599269 4454162 4296815 3891906
Kswapd pages reclaimed 2295947 2428434 2399818 2319706
Kswapd reclaim write file async I/O 1 0 1 1
Kswapd reclaim write anon async I/O 59 187 41 222
Kswapd reclaim write file sync I/O 0 0 0 0
Kswapd reclaim write anon sync I/O 0 0 0 0
Time stalled direct reclaim (seconds) 4.34 2.52 6.63 2.96
Time kswapd awake (seconds) 11.15 10.25 11.01 10.19
Total pages scanned 4905237 4734564 4897640 4849839
Total pages reclaimed 2339450 2447439 2430145 2436897
%age total pages scanned/reclaimed 47.69% 51.69% 49.62% 50.25%
%age total pages scanned/written 0.00% 0.00% 0.00% 0.00%
%age file pages scanned/written 0.00% 0.00% 0.00% 0.00%
Percentage Time Spent Direct Reclaim 29.23% 19.02% 38.48% 20.25%
Percentage Time kswapd Awake 78.58% 78.85% 76.83% 79.86%
What is interesting here for nocongest in particular is that while direct
reclaim scans more pages, the overall number of pages scanned remains the
same and the ratio of pages scanned to pages reclaimed is more or less the
same. In other words, while we are sleeping less, reclaim is not doing
more work and as direct reclaim and kswapd is awake for less time, it
would appear to be doing less work.
FTrace Reclaim Statistics: congestion_wait
Direct number congest waited 87 196 64 0
Direct time congest waited 4604ms 4732ms 5420ms 0ms
Direct full congest waited 72 145 53 0
Direct number conditional waited 0 0 324 1315
Direct time conditional waited 0ms 0ms 0ms 0ms
Direct full conditional waited 0 0 0 0
KSwapd number congest waited 20 10 15 7
KSwapd time congest waited 1264ms 536ms 884ms 284ms
KSwapd full congest waited 10 4 6 2
KSwapd number conditional waited 0 0 0 0
KSwapd time conditional waited 0ms 0ms 0ms 0ms
KSwapd full conditional waited 0 0 0 0
The vanilla kernel spent 8 seconds asleep in direct reclaim and no time at
all asleep with the patches.
MMTests Statistics: duration
User/Sys Time Running Test (seconds) 10.51 10.73 10.6 11.66
Total Elapsed Time (seconds) 14.19 13.00 14.33 12.76
Overall, the tests completed faster. It is interesting to note that backing off further
when a zone is congested and not just a BDI was more efficient overall.
PPC64 micro-mapped-file-stream
pgalloc_dma 3024660.00 ( 0.00%) 3027185.00 ( 0.08%) 3025845.00 ( 0.04%) 3026281.00 ( 0.05%)
pgalloc_normal 0.00 ( 0.00%) 0.00 ( 0.00%) 0.00 ( 0.00%) 0.00 ( 0.00%)
pgsteal_dma 2508073.00 ( 0.00%) 2565351.00 ( 2.23%) 2463577.00 ( -1.81%) 2532263.00 ( 0.96%)
pgsteal_normal 0.00 ( 0.00%) 0.00 ( 0.00%) 0.00 ( 0.00%) 0.00 ( 0.00%)
pgscan_kswapd_dma 4601307.00 ( 0.00%) 4128076.00 ( -11.46%) 3912317.00 ( -17.61%) 3377165.00 ( -36.25%)
pgscan_kswapd_normal 0.00 ( 0.00%) 0.00 ( 0.00%) 0.00 ( 0.00%) 0.00 ( 0.00%)
pgscan_direct_dma 629825.00 ( 0.00%) 971622.00 ( 35.18%) 1063938.00 ( 40.80%) 1711935.00 ( 63.21%)
pgscan_direct_normal 0.00 ( 0.00%) 0.00 ( 0.00%) 0.00 ( 0.00%) 0.00 ( 0.00%)
pageoutrun 27776.00 ( 0.00%) 20458.00 ( -35.77%) 18763.00 ( -48.04%) 18157.00 ( -52.98%)
allocstall 977.00 ( 0.00%) 2751.00 ( 64.49%) 2098.00 ( 53.43%) 5136.00 ( 80.98%)
Similar trends to x86-64. allocstalls are up but it's not necessarily bad.
FTrace Reclaim Statistics: vmscan
Direct reclaims 977 2709 2098 5136
Direct reclaim pages scanned 629825 963814 1063938 1711935
Direct reclaim pages reclaimed 75550 242538 150904 387647
Direct reclaim write file async I/O 0 0 0 2
Direct reclaim write anon async I/O 0 10 0 4
Direct reclaim write file sync I/O 0 0 0 0
Direct reclaim write anon sync I/O 0 0 0 0
Wake kswapd requests 392119 1201712 571935 571921
Kswapd wakeups 3 2 3 3
Kswapd pages scanned 4601307 4128076 3912317 3377165
Kswapd pages reclaimed 2432523 2318797 2312673 2144616
Kswapd reclaim write file async I/O 20 1 1 1
Kswapd reclaim write anon async I/O 57 132 11 121
Kswapd reclaim write file sync I/O 0 0 0 0
Kswapd reclaim write anon sync I/O 0 0 0 0
Time stalled direct reclaim (seconds) 6.19 7.30 13.04 10.88
Time kswapd awake (seconds) 21.73 26.51 25.55 23.90
Total pages scanned 5231132 5091890 4976255 5089100
Total pages reclaimed 2508073 2561335 2463577 2532263
%age total pages scanned/reclaimed 47.95% 50.30% 49.51% 49.76%
%age total pages scanned/written 0.00% 0.00% 0.00% 0.00%
%age file pages scanned/written 0.00% 0.00% 0.00% 0.00%
Percentage Time Spent Direct Reclaim 18.89% 20.65% 32.65% 27.65%
Percentage Time kswapd Awake 72.39% 80.68% 78.21% 77.40%
Again, a similar trend that the congestion_wait changes mean that direct
reclaim scans more pages but the overall number of pages scanned while
slightly reduced, are very similar. The ratio of scanning/reclaimed
remains roughly similar. The downside is that kswapd and direct reclaim
was awake longer and for a larger percentage of the overall workload.
It's possible there were big differences in the amount of time spent
reclaiming slab pages between the different kernels which is plausible
considering that the micro tests runs after fsmark and sysbench.
Trace Reclaim Statistics: congestion_wait
Direct number congest waited 845 1312 104 0
Direct time congest waited 19416ms 26560ms 7544ms 0ms
Direct full congest waited 745 1105 72 0
Direct number conditional waited 0 0 1322 2935
Direct time conditional waited 0ms 0ms 12ms 312ms
Direct full conditional waited 0 0 0 3
KSwapd number congest waited 39 102 75 63
KSwapd time congest waited 2484ms 6760ms 5756ms 3716ms
KSwapd full congest waited 20 48 46 25
KSwapd number conditional waited 0 0 0 0
KSwapd time conditional waited 0ms 0ms 0ms 0ms
KSwapd full conditional waited 0 0 0 0
The vanilla kernel spent 20 seconds asleep in direct reclaim and only
312ms asleep with the patches. The time kswapd spent congest waited was
also reduced by a large factor.
MMTests Statistics: duration
ser/Sys Time Running Test (seconds) 26.58 28.05 26.9 28.47
Total Elapsed Time (seconds) 30.02 32.86 32.67 30.88
With all patches applies, the completion times are very similar.
X86-64 STRESS-HIGHALLOC
traceonly-v2r2 lowlumpy-v2r3 waitcongest-v2r3waitwriteback-v2r4
Pass 1 82.00 ( 0.00%) 84.00 ( 2.00%) 85.00 ( 3.00%) 85.00 ( 3.00%)
Pass 2 90.00 ( 0.00%) 87.00 (-3.00%) 88.00 (-2.00%) 89.00 (-1.00%)
At Rest 92.00 ( 0.00%) 90.00 (-2.00%) 90.00 (-2.00%) 91.00 (-1.00%)
Success figures across the board are broadly similar.
traceonly-v2r2 lowlumpy-v2r3 waitcongest-v2r3waitwriteback-v2r4
Direct reclaims 1045 944 886 887
Direct reclaim pages scanned 135091 119604 109382 101019
Direct reclaim pages reclaimed 88599 47535 47863 46671
Direct reclaim write file async I/O 494 283 465 280
Direct reclaim write anon async I/O 29357 13710 16656 13462
Direct reclaim write file sync I/O 154 2 2 3
Direct reclaim write anon sync I/O 14594 571 509 561
Wake kswapd requests 7491 933 872 892
Kswapd wakeups 814 778 731 780
Kswapd pages scanned 7290822 15341158 11916436 13703442
Kswapd pages reclaimed 3587336 3142496 3094392 3187151
Kswapd reclaim write file async I/O 91975 32317 28022 29628
Kswapd reclaim write anon async I/O 1992022 789307 829745 849769
Kswapd reclaim write file sync I/O 0 0 0 0
Kswapd reclaim write anon sync I/O 0 0 0 0
Time stalled direct reclaim (seconds) 4588.93 2467.16 2495.41 2547.07
Time kswapd awake (seconds) 2497.66 1020.16 1098.06 1176.82
Total pages scanned 7425913 15460762 12025818 13804461
Total pages reclaimed 3675935 3190031 3142255 3233822
%age total pages scanned/reclaimed 49.50% 20.63% 26.13% 23.43%
%age total pages scanned/written 28.66% 5.41% 7.28% 6.47%
%age file pages scanned/written 1.25% 0.21% 0.24% 0.22%
Percentage Time Spent Direct Reclaim 57.33% 42.15% 42.41% 42.99%
Percentage Time kswapd Awake 43.56% 27.87% 29.76% 31.25%
Scanned/reclaimed ratios again look good with big improvements in
efficiency. The Scanned/written ratios also look much improved. With a
better scanned/written ration, there is an expectation that IO would be
more efficient and indeed, the time spent in direct reclaim is much
reduced by the full series and kswapd spends a little less time awake.
Overall, indications here are that allocations were happening much faster
and this can be seen with a graph of the latency figures as the
allocations were taking place
http://www.csn.ul.ie/~mel/postings/vmscanreduce-20101509/highalloc-interlatency-hydra-mean.ps
FTrace Reclaim Statistics: congestion_wait
Direct number congest waited 1333 204 169 4
Direct time congest waited 78896ms 8288ms 7260ms 200ms
Direct full congest waited 756 92 69 2
Direct number conditional waited 0 0 26 186
Direct time conditional waited 0ms 0ms 0ms 2504ms
Direct full conditional waited 0 0 0 25
KSwapd number congest waited 4 395 227 282
KSwapd time congest waited 384ms 25136ms 10508ms 18380ms
KSwapd full congest waited 3 232 98 176
KSwapd number conditional waited 0 0 0 0
KSwapd time conditional waited 0ms 0ms 0ms 0ms
KSwapd full conditional waited 0 0 0 0
KSwapd full conditional waited 318 0 312 9
Overall, the time spent speeping is reduced. kswapd is still hitting
congestion_wait() but that is because there are callers remaining where it
wasn't clear in advance if they should be changed to wait_iff_congested()
or not. Overall the sleep imes are reduced though - from 79ish seconds to
about 19.
MMTests Statistics: duration
User/Sys Time Running Test (seconds) 3415.43 3386.65 3388.39 3377.5
Total Elapsed Time (seconds) 5733.48 3660.33 3689.41 3765.39
With the full series, the time to complete the tests are reduced by 30%
PPC64 STRESS-HIGHALLOC
traceonly-v2r2 lowlumpy-v2r3 waitcongest-v2r3waitwriteback-v2r4
Pass 1 17.00 ( 0.00%) 34.00 (17.00%) 38.00 (21.00%) 43.00 (26.00%)
Pass 2 25.00 ( 0.00%) 37.00 (12.00%) 42.00 (17.00%) 46.00 (21.00%)
At Rest 49.00 ( 0.00%) 43.00 (-6.00%) 45.00 (-4.00%) 51.00 ( 2.00%)
Success rates there are *way* up particularly considering that the 16MB
huge pages on PPC64 mean that it's always much harder to allocate them.
FTrace Reclaim Statistics: vmscan
stress-highalloc stress-highalloc stress-highalloc stress-highalloc
traceonly-v2r2 lowlumpy-v2r3 waitcongest-v2r3waitwriteback-v2r4
Direct reclaims 499 505 564 509
Direct reclaim pages scanned 223478 41898 51818 45605
Direct reclaim pages reclaimed 137730 21148 27161 23455
Direct reclaim write file async I/O 399 136 162 136
Direct reclaim write anon async I/O 46977 2865 4686 3998
Direct reclaim write file sync I/O 29 0 1 3
Direct reclaim write anon sync I/O 31023 159 237 239
Wake kswapd requests 420 351 360 326
Kswapd wakeups 185 294 249 277
Kswapd pages scanned 15703488 16392500 17821724 17598737
Kswapd pages reclaimed 5808466 2908858 3139386 3145435
Kswapd reclaim write file async I/O 159938 18400 18717 13473
Kswapd reclaim write anon async I/O 3467554 228957 322799 234278
Kswapd reclaim write file sync I/O 0 0 0 0
Kswapd reclaim write anon sync I/O 0 0 0 0
Time stalled direct reclaim (seconds) 9665.35 1707.81 2374.32 1871.23
Time kswapd awake (seconds) 9401.21 1367.86 1951.75 1328.88
Total pages scanned 15926966 16434398 17873542 17644342
Total pages reclaimed 5946196 2930006 3166547 3168890
%age total pages scanned/reclaimed 37.33% 17.83% 17.72% 17.96%
%age total pages scanned/written 23.27% 1.52% 1.94% 1.43%
%age file pages scanned/written 1.01% 0.11% 0.11% 0.08%
Percentage Time Spent Direct Reclaim 44.55% 35.10% 41.42% 36.91%
Percentage Time kswapd Awake 86.71% 43.58% 52.67% 41.14%
While the scanning rates are slightly up, the scanned/reclaimed and
scanned/written figures are much improved. The time spent in direct
reclaim and with kswapd are massively reduced, mostly by the lowlumpy
patches.
FTrace Reclaim Statistics: congestion_wait
Direct number congest waited 725 303 126 3
Direct time congest waited 45524ms 9180ms 5936ms 300ms
Direct full congest waited 487 190 52 3
Direct number conditional waited 0 0 200 301
Direct time conditional waited 0ms 0ms 0ms 1904ms
Direct full conditional waited 0 0 0 19
KSwapd number congest waited 0 2 23 4
KSwapd time congest waited 0ms 200ms 420ms 404ms
KSwapd full congest waited 0 2 2 4
KSwapd number conditional waited 0 0 0 0
KSwapd time conditional waited 0ms 0ms 0ms 0ms
KSwapd full conditional waited 0 0 0 0
Not as dramatic a story here but the time spent asleep is reduced and we
can still see what wait_iff_congested is going to sleep when necessary.
MMTests Statistics: duration
User/Sys Time Running Test (seconds) 12028.09 3157.17 3357.79 3199.16
Total Elapsed Time (seconds) 10842.07 3138.72 3705.54 3229.85
The time to complete this test goes way down. With the full series, we
are allocating over twice the number of huge pages in 30% of the time and
there is a corresponding impact on the allocation latency graph available
at.
http://www.csn.ul.ie/~mel/postings/vmscanreduce-20101509/highalloc-interlatency-powyah-mean.ps
This patch:
Add a trace event for shrink_inactive_list() and updates the sample
postprocessing script appropriately. It can be used to determine how many
pages were reclaimed and for non-lumpy reclaim where exactly the pages
were reclaimed from.
Signed-off-by: Mel Gorman <mel@csn.ul.ie>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Minchan Kim <minchan.kim@gmail.com>
Cc: Wu Fengguang <fengguang.wu@intel.com>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: Rik van Riel <riel@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2010-10-27 05:21:40 +08:00
|
|
|
#define RECLAIM_WB_MIXED 0x0010u
|
2012-05-30 06:06:19 +08:00
|
|
|
#define RECLAIM_WB_SYNC 0x0004u /* Unused, all reclaim async */
|
2010-08-10 08:19:18 +08:00
|
|
|
#define RECLAIM_WB_ASYNC 0x0008u
|
2017-02-23 07:44:33 +08:00
|
|
|
#define RECLAIM_WB_LRU (RECLAIM_WB_ANON|RECLAIM_WB_FILE)
|
2010-08-10 08:19:18 +08:00
|
|
|
|
|
|
|
#define show_reclaim_flags(flags) \
|
|
|
|
(flags) ? __print_flags(flags, "|", \
|
|
|
|
{RECLAIM_WB_ANON, "RECLAIM_WB_ANON"}, \
|
|
|
|
{RECLAIM_WB_FILE, "RECLAIM_WB_FILE"}, \
|
tracing, vmscan: add trace events for LRU list shrinking
There have been numerous reports of stalls that pointed at the problem
being somewhere in the VM. There are multiple roots to the problems which
means dealing with any of the root problems in isolation is tricky to
justify on their own and they would still need integration testing. This
patch series puts together two different patch sets which in combination
should tackle some of the root causes of latency problems being reported.
Patch 1 adds a tracepoint for shrink_inactive_list. For this series, the
most important results is being able to calculate the scanning/reclaim
ratio as a measure of the amount of work being done by page reclaim.
Patch 2 accounts for time spent in congestion_wait.
Patches 3-6 were originally developed by Kosaki Motohiro but reworked for
this series. It has been noted that lumpy reclaim is far too aggressive
and trashes the system somewhat. As SLUB uses high-order allocations, a
large cost incurred by lumpy reclaim will be noticeable. It was also
reported during transparent hugepage support testing that lumpy reclaim
was trashing the system and these patches should mitigate that problem
without disabling lumpy reclaim.
Patch 7 adds wait_iff_congested() and replaces some callers of
congestion_wait(). wait_iff_congested() only sleeps if there is a BDI
that is currently congested. Patch 8 notes that any BDI being congested
is not necessarily a problem because there could be multiple BDIs of
varying speeds and numberous zones. It attempts to track when a zone
being reclaimed contains many pages backed by a congested BDI and if so,
reclaimers wait on the congestion queue.
I ran a number of tests with monitoring on X86, X86-64 and PPC64. Each
machine had 3G of RAM and the CPUs were
X86: Intel P4 2-core
X86-64: AMD Phenom 4-core
PPC64: PPC970MP
Each used a single disk and the onboard IO controller. Dirty ratio was
left at 20. I'm just going to report for X86-64 and PPC64 in a vague
attempt to keep this report short. Four kernels were tested each based on
v2.6.36-rc4
traceonly-v2r2: Patches 1 and 2 to instrument vmscan reclaims and congestion_wait
lowlumpy-v2r3: Patches 1-6 to test if lumpy reclaim is better
waitcongest-v2r3: Patches 1-7 to only wait on congestion
waitwriteback-v2r4: Patches 1-8 to detect when a zone is congested
nocongest-v1r5: Patches 1-3 for testing wait_iff_congestion
nodirect-v1r5: Patches 1-10 to disable filesystem writeback for better IO
The tests run were as follows
kernbench
compile-based benchmark. Smoke test performance
sysbench
OLTP read-only benchmark. Will be re-run in the future as read-write
micro-mapped-file-stream
This is a micro-benchmark from Johannes Weiner that accesses a
large sparse-file through mmap(). It was configured to run in only
single-CPU mode but can be indicative of how well page reclaim
identifies suitable pages.
stress-highalloc
Tries to allocate huge pages under heavy load.
kernbench, iozone and sysbench did not report any performance regression
on any machine. sysbench did pressure the system lightly and there was
reclaim activity but there were no difference of major interest between
the kernels.
X86-64 micro-mapped-file-stream
traceonly-v2r2 lowlumpy-v2r3 waitcongest-v2r3 waitwriteback-v2r4
pgalloc_dma 1639.00 ( 0.00%) 667.00 (-145.73%) 1167.00 ( -40.45%) 578.00 (-183.56%)
pgalloc_dma32 2842410.00 ( 0.00%) 2842626.00 ( 0.01%) 2843043.00 ( 0.02%) 2843014.00 ( 0.02%)
pgalloc_normal 0.00 ( 0.00%) 0.00 ( 0.00%) 0.00 ( 0.00%) 0.00 ( 0.00%)
pgsteal_dma 729.00 ( 0.00%) 85.00 (-757.65%) 609.00 ( -19.70%) 125.00 (-483.20%)
pgsteal_dma32 2338721.00 ( 0.00%) 2447354.00 ( 4.44%) 2429536.00 ( 3.74%) 2436772.00 ( 4.02%)
pgsteal_normal 0.00 ( 0.00%) 0.00 ( 0.00%) 0.00 ( 0.00%) 0.00 ( 0.00%)
pgscan_kswapd_dma 1469.00 ( 0.00%) 532.00 (-176.13%) 1078.00 ( -36.27%) 220.00 (-567.73%)
pgscan_kswapd_dma32 4597713.00 ( 0.00%) 4503597.00 ( -2.09%) 4295673.00 ( -7.03%) 3891686.00 ( -18.14%)
pgscan_kswapd_normal 0.00 ( 0.00%) 0.00 ( 0.00%) 0.00 ( 0.00%) 0.00 ( 0.00%)
pgscan_direct_dma 71.00 ( 0.00%) 134.00 ( 47.01%) 243.00 ( 70.78%) 352.00 ( 79.83%)
pgscan_direct_dma32 305820.00 ( 0.00%) 280204.00 ( -9.14%) 600518.00 ( 49.07%) 957485.00 ( 68.06%)
pgscan_direct_normal 0.00 ( 0.00%) 0.00 ( 0.00%) 0.00 ( 0.00%) 0.00 ( 0.00%)
pageoutrun 16296.00 ( 0.00%) 21254.00 ( 23.33%) 18447.00 ( 11.66%) 20067.00 ( 18.79%)
allocstall 443.00 ( 0.00%) 273.00 ( -62.27%) 513.00 ( 13.65%) 1568.00 ( 71.75%)
These are based on the raw figures taken from /proc/vmstat. It's a rough
measure of reclaim activity. Note that allocstall counts are higher
because we are entering direct reclaim more often as a result of not
sleeping in congestion. In itself, it's not necessarily a bad thing.
It's easier to get a view of what happened from the vmscan tracepoint
report.
FTrace Reclaim Statistics: vmscan
traceonly-v2r2 lowlumpy-v2r3 waitcongest-v2r3 waitwriteback-v2r4
Direct reclaims 443 273 513 1568
Direct reclaim pages scanned 305968 280402 600825 957933
Direct reclaim pages reclaimed 43503 19005 30327 117191
Direct reclaim write file async I/O 0 0 0 0
Direct reclaim write anon async I/O 0 3 4 12
Direct reclaim write file sync I/O 0 0 0 0
Direct reclaim write anon sync I/O 0 0 0 0
Wake kswapd requests 187649 132338 191695 267701
Kswapd wakeups 3 1 4 1
Kswapd pages scanned 4599269 4454162 4296815 3891906
Kswapd pages reclaimed 2295947 2428434 2399818 2319706
Kswapd reclaim write file async I/O 1 0 1 1
Kswapd reclaim write anon async I/O 59 187 41 222
Kswapd reclaim write file sync I/O 0 0 0 0
Kswapd reclaim write anon sync I/O 0 0 0 0
Time stalled direct reclaim (seconds) 4.34 2.52 6.63 2.96
Time kswapd awake (seconds) 11.15 10.25 11.01 10.19
Total pages scanned 4905237 4734564 4897640 4849839
Total pages reclaimed 2339450 2447439 2430145 2436897
%age total pages scanned/reclaimed 47.69% 51.69% 49.62% 50.25%
%age total pages scanned/written 0.00% 0.00% 0.00% 0.00%
%age file pages scanned/written 0.00% 0.00% 0.00% 0.00%
Percentage Time Spent Direct Reclaim 29.23% 19.02% 38.48% 20.25%
Percentage Time kswapd Awake 78.58% 78.85% 76.83% 79.86%
What is interesting here for nocongest in particular is that while direct
reclaim scans more pages, the overall number of pages scanned remains the
same and the ratio of pages scanned to pages reclaimed is more or less the
same. In other words, while we are sleeping less, reclaim is not doing
more work and as direct reclaim and kswapd is awake for less time, it
would appear to be doing less work.
FTrace Reclaim Statistics: congestion_wait
Direct number congest waited 87 196 64 0
Direct time congest waited 4604ms 4732ms 5420ms 0ms
Direct full congest waited 72 145 53 0
Direct number conditional waited 0 0 324 1315
Direct time conditional waited 0ms 0ms 0ms 0ms
Direct full conditional waited 0 0 0 0
KSwapd number congest waited 20 10 15 7
KSwapd time congest waited 1264ms 536ms 884ms 284ms
KSwapd full congest waited 10 4 6 2
KSwapd number conditional waited 0 0 0 0
KSwapd time conditional waited 0ms 0ms 0ms 0ms
KSwapd full conditional waited 0 0 0 0
The vanilla kernel spent 8 seconds asleep in direct reclaim and no time at
all asleep with the patches.
MMTests Statistics: duration
User/Sys Time Running Test (seconds) 10.51 10.73 10.6 11.66
Total Elapsed Time (seconds) 14.19 13.00 14.33 12.76
Overall, the tests completed faster. It is interesting to note that backing off further
when a zone is congested and not just a BDI was more efficient overall.
PPC64 micro-mapped-file-stream
pgalloc_dma 3024660.00 ( 0.00%) 3027185.00 ( 0.08%) 3025845.00 ( 0.04%) 3026281.00 ( 0.05%)
pgalloc_normal 0.00 ( 0.00%) 0.00 ( 0.00%) 0.00 ( 0.00%) 0.00 ( 0.00%)
pgsteal_dma 2508073.00 ( 0.00%) 2565351.00 ( 2.23%) 2463577.00 ( -1.81%) 2532263.00 ( 0.96%)
pgsteal_normal 0.00 ( 0.00%) 0.00 ( 0.00%) 0.00 ( 0.00%) 0.00 ( 0.00%)
pgscan_kswapd_dma 4601307.00 ( 0.00%) 4128076.00 ( -11.46%) 3912317.00 ( -17.61%) 3377165.00 ( -36.25%)
pgscan_kswapd_normal 0.00 ( 0.00%) 0.00 ( 0.00%) 0.00 ( 0.00%) 0.00 ( 0.00%)
pgscan_direct_dma 629825.00 ( 0.00%) 971622.00 ( 35.18%) 1063938.00 ( 40.80%) 1711935.00 ( 63.21%)
pgscan_direct_normal 0.00 ( 0.00%) 0.00 ( 0.00%) 0.00 ( 0.00%) 0.00 ( 0.00%)
pageoutrun 27776.00 ( 0.00%) 20458.00 ( -35.77%) 18763.00 ( -48.04%) 18157.00 ( -52.98%)
allocstall 977.00 ( 0.00%) 2751.00 ( 64.49%) 2098.00 ( 53.43%) 5136.00 ( 80.98%)
Similar trends to x86-64. allocstalls are up but it's not necessarily bad.
FTrace Reclaim Statistics: vmscan
Direct reclaims 977 2709 2098 5136
Direct reclaim pages scanned 629825 963814 1063938 1711935
Direct reclaim pages reclaimed 75550 242538 150904 387647
Direct reclaim write file async I/O 0 0 0 2
Direct reclaim write anon async I/O 0 10 0 4
Direct reclaim write file sync I/O 0 0 0 0
Direct reclaim write anon sync I/O 0 0 0 0
Wake kswapd requests 392119 1201712 571935 571921
Kswapd wakeups 3 2 3 3
Kswapd pages scanned 4601307 4128076 3912317 3377165
Kswapd pages reclaimed 2432523 2318797 2312673 2144616
Kswapd reclaim write file async I/O 20 1 1 1
Kswapd reclaim write anon async I/O 57 132 11 121
Kswapd reclaim write file sync I/O 0 0 0 0
Kswapd reclaim write anon sync I/O 0 0 0 0
Time stalled direct reclaim (seconds) 6.19 7.30 13.04 10.88
Time kswapd awake (seconds) 21.73 26.51 25.55 23.90
Total pages scanned 5231132 5091890 4976255 5089100
Total pages reclaimed 2508073 2561335 2463577 2532263
%age total pages scanned/reclaimed 47.95% 50.30% 49.51% 49.76%
%age total pages scanned/written 0.00% 0.00% 0.00% 0.00%
%age file pages scanned/written 0.00% 0.00% 0.00% 0.00%
Percentage Time Spent Direct Reclaim 18.89% 20.65% 32.65% 27.65%
Percentage Time kswapd Awake 72.39% 80.68% 78.21% 77.40%
Again, a similar trend that the congestion_wait changes mean that direct
reclaim scans more pages but the overall number of pages scanned while
slightly reduced, are very similar. The ratio of scanning/reclaimed
remains roughly similar. The downside is that kswapd and direct reclaim
was awake longer and for a larger percentage of the overall workload.
It's possible there were big differences in the amount of time spent
reclaiming slab pages between the different kernels which is plausible
considering that the micro tests runs after fsmark and sysbench.
Trace Reclaim Statistics: congestion_wait
Direct number congest waited 845 1312 104 0
Direct time congest waited 19416ms 26560ms 7544ms 0ms
Direct full congest waited 745 1105 72 0
Direct number conditional waited 0 0 1322 2935
Direct time conditional waited 0ms 0ms 12ms 312ms
Direct full conditional waited 0 0 0 3
KSwapd number congest waited 39 102 75 63
KSwapd time congest waited 2484ms 6760ms 5756ms 3716ms
KSwapd full congest waited 20 48 46 25
KSwapd number conditional waited 0 0 0 0
KSwapd time conditional waited 0ms 0ms 0ms 0ms
KSwapd full conditional waited 0 0 0 0
The vanilla kernel spent 20 seconds asleep in direct reclaim and only
312ms asleep with the patches. The time kswapd spent congest waited was
also reduced by a large factor.
MMTests Statistics: duration
ser/Sys Time Running Test (seconds) 26.58 28.05 26.9 28.47
Total Elapsed Time (seconds) 30.02 32.86 32.67 30.88
With all patches applies, the completion times are very similar.
X86-64 STRESS-HIGHALLOC
traceonly-v2r2 lowlumpy-v2r3 waitcongest-v2r3waitwriteback-v2r4
Pass 1 82.00 ( 0.00%) 84.00 ( 2.00%) 85.00 ( 3.00%) 85.00 ( 3.00%)
Pass 2 90.00 ( 0.00%) 87.00 (-3.00%) 88.00 (-2.00%) 89.00 (-1.00%)
At Rest 92.00 ( 0.00%) 90.00 (-2.00%) 90.00 (-2.00%) 91.00 (-1.00%)
Success figures across the board are broadly similar.
traceonly-v2r2 lowlumpy-v2r3 waitcongest-v2r3waitwriteback-v2r4
Direct reclaims 1045 944 886 887
Direct reclaim pages scanned 135091 119604 109382 101019
Direct reclaim pages reclaimed 88599 47535 47863 46671
Direct reclaim write file async I/O 494 283 465 280
Direct reclaim write anon async I/O 29357 13710 16656 13462
Direct reclaim write file sync I/O 154 2 2 3
Direct reclaim write anon sync I/O 14594 571 509 561
Wake kswapd requests 7491 933 872 892
Kswapd wakeups 814 778 731 780
Kswapd pages scanned 7290822 15341158 11916436 13703442
Kswapd pages reclaimed 3587336 3142496 3094392 3187151
Kswapd reclaim write file async I/O 91975 32317 28022 29628
Kswapd reclaim write anon async I/O 1992022 789307 829745 849769
Kswapd reclaim write file sync I/O 0 0 0 0
Kswapd reclaim write anon sync I/O 0 0 0 0
Time stalled direct reclaim (seconds) 4588.93 2467.16 2495.41 2547.07
Time kswapd awake (seconds) 2497.66 1020.16 1098.06 1176.82
Total pages scanned 7425913 15460762 12025818 13804461
Total pages reclaimed 3675935 3190031 3142255 3233822
%age total pages scanned/reclaimed 49.50% 20.63% 26.13% 23.43%
%age total pages scanned/written 28.66% 5.41% 7.28% 6.47%
%age file pages scanned/written 1.25% 0.21% 0.24% 0.22%
Percentage Time Spent Direct Reclaim 57.33% 42.15% 42.41% 42.99%
Percentage Time kswapd Awake 43.56% 27.87% 29.76% 31.25%
Scanned/reclaimed ratios again look good with big improvements in
efficiency. The Scanned/written ratios also look much improved. With a
better scanned/written ration, there is an expectation that IO would be
more efficient and indeed, the time spent in direct reclaim is much
reduced by the full series and kswapd spends a little less time awake.
Overall, indications here are that allocations were happening much faster
and this can be seen with a graph of the latency figures as the
allocations were taking place
http://www.csn.ul.ie/~mel/postings/vmscanreduce-20101509/highalloc-interlatency-hydra-mean.ps
FTrace Reclaim Statistics: congestion_wait
Direct number congest waited 1333 204 169 4
Direct time congest waited 78896ms 8288ms 7260ms 200ms
Direct full congest waited 756 92 69 2
Direct number conditional waited 0 0 26 186
Direct time conditional waited 0ms 0ms 0ms 2504ms
Direct full conditional waited 0 0 0 25
KSwapd number congest waited 4 395 227 282
KSwapd time congest waited 384ms 25136ms 10508ms 18380ms
KSwapd full congest waited 3 232 98 176
KSwapd number conditional waited 0 0 0 0
KSwapd time conditional waited 0ms 0ms 0ms 0ms
KSwapd full conditional waited 0 0 0 0
KSwapd full conditional waited 318 0 312 9
Overall, the time spent speeping is reduced. kswapd is still hitting
congestion_wait() but that is because there are callers remaining where it
wasn't clear in advance if they should be changed to wait_iff_congested()
or not. Overall the sleep imes are reduced though - from 79ish seconds to
about 19.
MMTests Statistics: duration
User/Sys Time Running Test (seconds) 3415.43 3386.65 3388.39 3377.5
Total Elapsed Time (seconds) 5733.48 3660.33 3689.41 3765.39
With the full series, the time to complete the tests are reduced by 30%
PPC64 STRESS-HIGHALLOC
traceonly-v2r2 lowlumpy-v2r3 waitcongest-v2r3waitwriteback-v2r4
Pass 1 17.00 ( 0.00%) 34.00 (17.00%) 38.00 (21.00%) 43.00 (26.00%)
Pass 2 25.00 ( 0.00%) 37.00 (12.00%) 42.00 (17.00%) 46.00 (21.00%)
At Rest 49.00 ( 0.00%) 43.00 (-6.00%) 45.00 (-4.00%) 51.00 ( 2.00%)
Success rates there are *way* up particularly considering that the 16MB
huge pages on PPC64 mean that it's always much harder to allocate them.
FTrace Reclaim Statistics: vmscan
stress-highalloc stress-highalloc stress-highalloc stress-highalloc
traceonly-v2r2 lowlumpy-v2r3 waitcongest-v2r3waitwriteback-v2r4
Direct reclaims 499 505 564 509
Direct reclaim pages scanned 223478 41898 51818 45605
Direct reclaim pages reclaimed 137730 21148 27161 23455
Direct reclaim write file async I/O 399 136 162 136
Direct reclaim write anon async I/O 46977 2865 4686 3998
Direct reclaim write file sync I/O 29 0 1 3
Direct reclaim write anon sync I/O 31023 159 237 239
Wake kswapd requests 420 351 360 326
Kswapd wakeups 185 294 249 277
Kswapd pages scanned 15703488 16392500 17821724 17598737
Kswapd pages reclaimed 5808466 2908858 3139386 3145435
Kswapd reclaim write file async I/O 159938 18400 18717 13473
Kswapd reclaim write anon async I/O 3467554 228957 322799 234278
Kswapd reclaim write file sync I/O 0 0 0 0
Kswapd reclaim write anon sync I/O 0 0 0 0
Time stalled direct reclaim (seconds) 9665.35 1707.81 2374.32 1871.23
Time kswapd awake (seconds) 9401.21 1367.86 1951.75 1328.88
Total pages scanned 15926966 16434398 17873542 17644342
Total pages reclaimed 5946196 2930006 3166547 3168890
%age total pages scanned/reclaimed 37.33% 17.83% 17.72% 17.96%
%age total pages scanned/written 23.27% 1.52% 1.94% 1.43%
%age file pages scanned/written 1.01% 0.11% 0.11% 0.08%
Percentage Time Spent Direct Reclaim 44.55% 35.10% 41.42% 36.91%
Percentage Time kswapd Awake 86.71% 43.58% 52.67% 41.14%
While the scanning rates are slightly up, the scanned/reclaimed and
scanned/written figures are much improved. The time spent in direct
reclaim and with kswapd are massively reduced, mostly by the lowlumpy
patches.
FTrace Reclaim Statistics: congestion_wait
Direct number congest waited 725 303 126 3
Direct time congest waited 45524ms 9180ms 5936ms 300ms
Direct full congest waited 487 190 52 3
Direct number conditional waited 0 0 200 301
Direct time conditional waited 0ms 0ms 0ms 1904ms
Direct full conditional waited 0 0 0 19
KSwapd number congest waited 0 2 23 4
KSwapd time congest waited 0ms 200ms 420ms 404ms
KSwapd full congest waited 0 2 2 4
KSwapd number conditional waited 0 0 0 0
KSwapd time conditional waited 0ms 0ms 0ms 0ms
KSwapd full conditional waited 0 0 0 0
Not as dramatic a story here but the time spent asleep is reduced and we
can still see what wait_iff_congested is going to sleep when necessary.
MMTests Statistics: duration
User/Sys Time Running Test (seconds) 12028.09 3157.17 3357.79 3199.16
Total Elapsed Time (seconds) 10842.07 3138.72 3705.54 3229.85
The time to complete this test goes way down. With the full series, we
are allocating over twice the number of huge pages in 30% of the time and
there is a corresponding impact on the allocation latency graph available
at.
http://www.csn.ul.ie/~mel/postings/vmscanreduce-20101509/highalloc-interlatency-powyah-mean.ps
This patch:
Add a trace event for shrink_inactive_list() and updates the sample
postprocessing script appropriately. It can be used to determine how many
pages were reclaimed and for non-lumpy reclaim where exactly the pages
were reclaimed from.
Signed-off-by: Mel Gorman <mel@csn.ul.ie>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Minchan Kim <minchan.kim@gmail.com>
Cc: Wu Fengguang <fengguang.wu@intel.com>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: Rik van Riel <riel@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2010-10-27 05:21:40 +08:00
|
|
|
{RECLAIM_WB_MIXED, "RECLAIM_WB_MIXED"}, \
|
2010-08-10 08:19:18 +08:00
|
|
|
{RECLAIM_WB_SYNC, "RECLAIM_WB_SYNC"}, \
|
|
|
|
{RECLAIM_WB_ASYNC, "RECLAIM_WB_ASYNC"} \
|
|
|
|
) : "RECLAIM_WB_NONE"
|
|
|
|
|
mm/vmscan: throttle reclaim until some writeback completes if congested
Patch series "Remove dependency on congestion_wait in mm/", v5.
This series that removes all calls to congestion_wait in mm/ and deletes
wait_iff_congested. It's not a clever implementation but
congestion_wait has been broken for a long time [1].
Even if congestion throttling worked, it was never a great idea. While
excessive dirty/writeback pages at the tail of the LRU is one
possibility that reclaim may be slow, there is also the problem of too
many pages being isolated and reclaim failing for other reasons
(elevated references, too many pages isolated, excessive LRU contention
etc).
This series replaces the "congestion" throttling with 3 different types.
- If there are too many dirty/writeback pages, sleep until a timeout or
enough pages get cleaned
- If too many pages are isolated, sleep until enough isolated pages are
either reclaimed or put back on the LRU
- If no progress is being made, direct reclaim tasks sleep until
another task makes progress with acceptable efficiency.
This was initially tested with a mix of workloads that used to trigger
corner cases that no longer work. A new test case was created called
"stutterp" (pagereclaim-stutterp-noreaders in mmtests) using a freshly
created XFS filesystem. Note that it may be necessary to increase the
timeout of ssh if executing remotely as ssh itself can get throttled and
the connection may timeout.
stutterp varies the number of "worker" processes from 4 up to NR_CPUS*4
to check the impact as the number of direct reclaimers increase. It has
four types of worker.
- One "anon latency" worker creates small mappings with mmap() and
times how long it takes to fault the mapping reading it 4K at a time
- X file writers which is fio randomly writing X files where the total
size of the files add up to the allowed dirty_ratio. fio is allowed
to run for a warmup period to allow some file-backed pages to
accumulate. The duration of the warmup is based on the best-case
linear write speed of the storage.
- Y file readers which is fio randomly reading small files
- Z anon memory hogs which continually map (100-dirty_ratio)% of memory
- Total estimated WSS = (100+dirty_ration) percentage of memory
X+Y+Z+1 == NR_WORKERS varying from 4 up to NR_CPUS*4
The intent is to maximise the total WSS with a mix of file and anon
memory where some anonymous memory must be swapped and there is a high
likelihood of dirty/writeback pages reaching the end of the LRU.
The test can be configured to have no background readers to stress
dirty/writeback pages. The results below are based on having zero
readers.
The short summary of the results is that the series works and stalls
until some event occurs but the timeouts may need adjustment.
The test results are not broken down by patch as the series should be
treated as one block that replaces a broken throttling mechanism with a
working one.
Finally, three machines were tested but I'm reporting the worst set of
results. The other two machines had much better latencies for example.
First the results of the "anon latency" latency
stutterp
5.15.0-rc1 5.15.0-rc1
vanilla mm-reclaimcongest-v5r4
Amean mmap-4 31.4003 ( 0.00%) 2661.0198 (-8374.52%)
Amean mmap-7 38.1641 ( 0.00%) 149.2891 (-291.18%)
Amean mmap-12 60.0981 ( 0.00%) 187.8105 (-212.51%)
Amean mmap-21 161.2699 ( 0.00%) 213.9107 ( -32.64%)
Amean mmap-30 174.5589 ( 0.00%) 377.7548 (-116.41%)
Amean mmap-48 8106.8160 ( 0.00%) 1070.5616 ( 86.79%)
Stddev mmap-4 41.3455 ( 0.00%) 27573.9676 (-66591.66%)
Stddev mmap-7 53.5556 ( 0.00%) 4608.5860 (-8505.23%)
Stddev mmap-12 171.3897 ( 0.00%) 5559.4542 (-3143.75%)
Stddev mmap-21 1506.6752 ( 0.00%) 5746.2507 (-281.39%)
Stddev mmap-30 557.5806 ( 0.00%) 7678.1624 (-1277.05%)
Stddev mmap-48 61681.5718 ( 0.00%) 14507.2830 ( 76.48%)
Max-90 mmap-4 31.4243 ( 0.00%) 83.1457 (-164.59%)
Max-90 mmap-7 41.0410 ( 0.00%) 41.0720 ( -0.08%)
Max-90 mmap-12 66.5255 ( 0.00%) 53.9073 ( 18.97%)
Max-90 mmap-21 146.7479 ( 0.00%) 105.9540 ( 27.80%)
Max-90 mmap-30 193.9513 ( 0.00%) 64.3067 ( 66.84%)
Max-90 mmap-48 277.9137 ( 0.00%) 591.0594 (-112.68%)
Max mmap-4 1913.8009 ( 0.00%) 299623.9695 (-15555.96%)
Max mmap-7 2423.9665 ( 0.00%) 204453.1708 (-8334.65%)
Max mmap-12 6845.6573 ( 0.00%) 221090.3366 (-3129.64%)
Max mmap-21 56278.6508 ( 0.00%) 213877.3496 (-280.03%)
Max mmap-30 19716.2990 ( 0.00%) 216287.6229 (-997.00%)
Max mmap-48 477923.9400 ( 0.00%) 245414.8238 ( 48.65%)
For most thread counts, the time to mmap() is unfortunately increased.
In earlier versions of the series, this was lower but a large number of
throttling events were reaching their timeout increasing the amount of
inefficient scanning of the LRU. There is no prioritisation of reclaim
tasks making progress based on each tasks rate of page allocation versus
progress of reclaim. The variance is also impacted for high worker
counts but in all cases, the differences in latency are not
statistically significant due to very large maximum outliers. Max-90
shows that 90% of the stalls are comparable but the Max results show the
massive outliers which are increased to to stalling.
It is expected that this will be very machine dependant. Due to the
test design, reclaim is difficult so allocations stall and there are
variances depending on whether THPs can be allocated or not. The amount
of memory will affect exactly how bad the corner cases are and how often
they trigger. The warmup period calculation is not ideal as it's based
on linear writes where as fio is randomly writing multiple files from
multiple tasks so the start state of the test is variable. For example,
these are the latencies on a single-socket machine that had more memory
Amean mmap-4 42.2287 ( 0.00%) 49.6838 * -17.65%*
Amean mmap-7 216.4326 ( 0.00%) 47.4451 * 78.08%*
Amean mmap-12 2412.0588 ( 0.00%) 51.7497 ( 97.85%)
Amean mmap-21 5546.2548 ( 0.00%) 51.8862 ( 99.06%)
Amean mmap-30 1085.3121 ( 0.00%) 72.1004 ( 93.36%)
The overall system CPU usage and elapsed time is as follows
5.15.0-rc3 5.15.0-rc3
vanilla mm-reclaimcongest-v5r4
Duration User 6989.03 983.42
Duration System 7308.12 799.68
Duration Elapsed 2277.67 2092.98
The patches reduce system CPU usage by 89% as the vanilla kernel is rarely
stalling.
The high-level /proc/vmstats show
5.15.0-rc1 5.15.0-rc1
vanilla mm-reclaimcongest-v5r2
Ops Direct pages scanned 1056608451.00 503594991.00
Ops Kswapd pages scanned 109795048.00 147289810.00
Ops Kswapd pages reclaimed 63269243.00 31036005.00
Ops Direct pages reclaimed 10803973.00 6328887.00
Ops Kswapd efficiency % 57.62 21.07
Ops Kswapd velocity 48204.98 57572.86
Ops Direct efficiency % 1.02 1.26
Ops Direct velocity 463898.83 196845.97
Kswapd scanned less pages but the detailed pattern is different. The
vanilla kernel scans slowly over time where as the patches exhibits
burst patterns of scan activity. Direct reclaim scanning is reduced by
52% due to stalling.
The pattern for stealing pages is also slightly different. Both kernels
exhibit spikes but the vanilla kernel when reclaiming shows pages being
reclaimed over a period of time where as the patches tend to reclaim in
spikes. The difference is that vanilla is not throttling and instead
scanning constantly finding some pages over time where as the patched
kernel throttles and reclaims in spikes.
Ops Percentage direct scans 90.59 77.37
For direct reclaim, vanilla scanned 90.59% of pages where as with the
patches, 77.37% were direct reclaim due to throttling
Ops Page writes by reclaim 2613590.00 1687131.00
Page writes from reclaim context are reduced.
Ops Page writes anon 2932752.00 1917048.00
And there is less swapping.
Ops Page reclaim immediate 996248528.00 107664764.00
The number of pages encountered at the tail of the LRU tagged for
immediate reclaim but still dirty/writeback is reduced by 89%.
Ops Slabs scanned 164284.00 153608.00
Slab scan activity is similar.
ftrace was used to gather stall activity
Vanilla
-------
1 writeback_wait_iff_congested: usec_timeout=100000 usec_delayed=16000
2 writeback_wait_iff_congested: usec_timeout=100000 usec_delayed=12000
8 writeback_wait_iff_congested: usec_timeout=100000 usec_delayed=8000
29 writeback_wait_iff_congested: usec_timeout=100000 usec_delayed=4000
82394 writeback_wait_iff_congested: usec_timeout=100000 usec_delayed=0
The fast majority of wait_iff_congested calls do not stall at all. What
is likely happening is that cond_resched() reschedules the task for a
short period when the BDI is not registering congestion (which it never
will in this test setup).
1 writeback_congestion_wait: usec_timeout=100000 usec_delayed=120000
2 writeback_congestion_wait: usec_timeout=100000 usec_delayed=132000
4 writeback_congestion_wait: usec_timeout=100000 usec_delayed=112000
380 writeback_congestion_wait: usec_timeout=100000 usec_delayed=108000
778 writeback_congestion_wait: usec_timeout=100000 usec_delayed=104000
congestion_wait if called always exceeds the timeout as there is no
trigger to wake it up.
Bottom line: Vanilla will throttle but it's not effective.
Patch series
------------
Kswapd throttle activity was always due to scanning pages tagged for
immediate reclaim at the tail of the LRU
1 usec_timeout=100000 usect_delayed=72000 reason=VMSCAN_THROTTLE_WRITEBACK
4 usec_timeout=100000 usect_delayed=20000 reason=VMSCAN_THROTTLE_WRITEBACK
5 usec_timeout=100000 usect_delayed=12000 reason=VMSCAN_THROTTLE_WRITEBACK
6 usec_timeout=100000 usect_delayed=16000 reason=VMSCAN_THROTTLE_WRITEBACK
11 usec_timeout=100000 usect_delayed=100000 reason=VMSCAN_THROTTLE_WRITEBACK
11 usec_timeout=100000 usect_delayed=8000 reason=VMSCAN_THROTTLE_WRITEBACK
94 usec_timeout=100000 usect_delayed=0 reason=VMSCAN_THROTTLE_WRITEBACK
112 usec_timeout=100000 usect_delayed=4000 reason=VMSCAN_THROTTLE_WRITEBACK
The majority of events did not stall or stalled for a short period.
Roughly 16% of stalls reached the timeout before expiry. For direct
reclaim, the number of times stalled for each reason were
6624 reason=VMSCAN_THROTTLE_ISOLATED
93246 reason=VMSCAN_THROTTLE_NOPROGRESS
96934 reason=VMSCAN_THROTTLE_WRITEBACK
The most common reason to stall was due to excessive pages tagged for
immediate reclaim at the tail of the LRU followed by a failure to make
forward. A relatively small number were due to too many pages isolated
from the LRU by parallel threads
For VMSCAN_THROTTLE_ISOLATED, the breakdown of delays was
9 usec_timeout=20000 usect_delayed=4000 reason=VMSCAN_THROTTLE_ISOLATED
12 usec_timeout=20000 usect_delayed=16000 reason=VMSCAN_THROTTLE_ISOLATED
83 usec_timeout=20000 usect_delayed=20000 reason=VMSCAN_THROTTLE_ISOLATED
6520 usec_timeout=20000 usect_delayed=0 reason=VMSCAN_THROTTLE_ISOLATED
Most did not stall at all. A small number reached the timeout.
For VMSCAN_THROTTLE_NOPROGRESS, the breakdown of stalls were all over
the map
1 usec_timeout=500000 usect_delayed=324000 reason=VMSCAN_THROTTLE_NOPROGRESS
1 usec_timeout=500000 usect_delayed=332000 reason=VMSCAN_THROTTLE_NOPROGRESS
1 usec_timeout=500000 usect_delayed=348000 reason=VMSCAN_THROTTLE_NOPROGRESS
1 usec_timeout=500000 usect_delayed=360000 reason=VMSCAN_THROTTLE_NOPROGRESS
2 usec_timeout=500000 usect_delayed=228000 reason=VMSCAN_THROTTLE_NOPROGRESS
2 usec_timeout=500000 usect_delayed=260000 reason=VMSCAN_THROTTLE_NOPROGRESS
2 usec_timeout=500000 usect_delayed=340000 reason=VMSCAN_THROTTLE_NOPROGRESS
2 usec_timeout=500000 usect_delayed=364000 reason=VMSCAN_THROTTLE_NOPROGRESS
2 usec_timeout=500000 usect_delayed=372000 reason=VMSCAN_THROTTLE_NOPROGRESS
2 usec_timeout=500000 usect_delayed=428000 reason=VMSCAN_THROTTLE_NOPROGRESS
2 usec_timeout=500000 usect_delayed=460000 reason=VMSCAN_THROTTLE_NOPROGRESS
2 usec_timeout=500000 usect_delayed=464000 reason=VMSCAN_THROTTLE_NOPROGRESS
3 usec_timeout=500000 usect_delayed=244000 reason=VMSCAN_THROTTLE_NOPROGRESS
3 usec_timeout=500000 usect_delayed=252000 reason=VMSCAN_THROTTLE_NOPROGRESS
3 usec_timeout=500000 usect_delayed=272000 reason=VMSCAN_THROTTLE_NOPROGRESS
4 usec_timeout=500000 usect_delayed=188000 reason=VMSCAN_THROTTLE_NOPROGRESS
4 usec_timeout=500000 usect_delayed=268000 reason=VMSCAN_THROTTLE_NOPROGRESS
4 usec_timeout=500000 usect_delayed=328000 reason=VMSCAN_THROTTLE_NOPROGRESS
4 usec_timeout=500000 usect_delayed=380000 reason=VMSCAN_THROTTLE_NOPROGRESS
4 usec_timeout=500000 usect_delayed=392000 reason=VMSCAN_THROTTLE_NOPROGRESS
4 usec_timeout=500000 usect_delayed=432000 reason=VMSCAN_THROTTLE_NOPROGRESS
5 usec_timeout=500000 usect_delayed=204000 reason=VMSCAN_THROTTLE_NOPROGRESS
5 usec_timeout=500000 usect_delayed=220000 reason=VMSCAN_THROTTLE_NOPROGRESS
5 usec_timeout=500000 usect_delayed=412000 reason=VMSCAN_THROTTLE_NOPROGRESS
5 usec_timeout=500000 usect_delayed=436000 reason=VMSCAN_THROTTLE_NOPROGRESS
6 usec_timeout=500000 usect_delayed=488000 reason=VMSCAN_THROTTLE_NOPROGRESS
7 usec_timeout=500000 usect_delayed=212000 reason=VMSCAN_THROTTLE_NOPROGRESS
7 usec_timeout=500000 usect_delayed=300000 reason=VMSCAN_THROTTLE_NOPROGRESS
7 usec_timeout=500000 usect_delayed=316000 reason=VMSCAN_THROTTLE_NOPROGRESS
7 usec_timeout=500000 usect_delayed=472000 reason=VMSCAN_THROTTLE_NOPROGRESS
8 usec_timeout=500000 usect_delayed=248000 reason=VMSCAN_THROTTLE_NOPROGRESS
8 usec_timeout=500000 usect_delayed=356000 reason=VMSCAN_THROTTLE_NOPROGRESS
8 usec_timeout=500000 usect_delayed=456000 reason=VMSCAN_THROTTLE_NOPROGRESS
9 usec_timeout=500000 usect_delayed=124000 reason=VMSCAN_THROTTLE_NOPROGRESS
9 usec_timeout=500000 usect_delayed=376000 reason=VMSCAN_THROTTLE_NOPROGRESS
9 usec_timeout=500000 usect_delayed=484000 reason=VMSCAN_THROTTLE_NOPROGRESS
10 usec_timeout=500000 usect_delayed=172000 reason=VMSCAN_THROTTLE_NOPROGRESS
10 usec_timeout=500000 usect_delayed=420000 reason=VMSCAN_THROTTLE_NOPROGRESS
10 usec_timeout=500000 usect_delayed=452000 reason=VMSCAN_THROTTLE_NOPROGRESS
11 usec_timeout=500000 usect_delayed=256000 reason=VMSCAN_THROTTLE_NOPROGRESS
12 usec_timeout=500000 usect_delayed=112000 reason=VMSCAN_THROTTLE_NOPROGRESS
12 usec_timeout=500000 usect_delayed=116000 reason=VMSCAN_THROTTLE_NOPROGRESS
12 usec_timeout=500000 usect_delayed=144000 reason=VMSCAN_THROTTLE_NOPROGRESS
12 usec_timeout=500000 usect_delayed=152000 reason=VMSCAN_THROTTLE_NOPROGRESS
12 usec_timeout=500000 usect_delayed=264000 reason=VMSCAN_THROTTLE_NOPROGRESS
12 usec_timeout=500000 usect_delayed=384000 reason=VMSCAN_THROTTLE_NOPROGRESS
12 usec_timeout=500000 usect_delayed=424000 reason=VMSCAN_THROTTLE_NOPROGRESS
12 usec_timeout=500000 usect_delayed=492000 reason=VMSCAN_THROTTLE_NOPROGRESS
13 usec_timeout=500000 usect_delayed=184000 reason=VMSCAN_THROTTLE_NOPROGRESS
13 usec_timeout=500000 usect_delayed=444000 reason=VMSCAN_THROTTLE_NOPROGRESS
14 usec_timeout=500000 usect_delayed=308000 reason=VMSCAN_THROTTLE_NOPROGRESS
14 usec_timeout=500000 usect_delayed=440000 reason=VMSCAN_THROTTLE_NOPROGRESS
14 usec_timeout=500000 usect_delayed=476000 reason=VMSCAN_THROTTLE_NOPROGRESS
16 usec_timeout=500000 usect_delayed=140000 reason=VMSCAN_THROTTLE_NOPROGRESS
17 usec_timeout=500000 usect_delayed=232000 reason=VMSCAN_THROTTLE_NOPROGRESS
17 usec_timeout=500000 usect_delayed=240000 reason=VMSCAN_THROTTLE_NOPROGRESS
17 usec_timeout=500000 usect_delayed=280000 reason=VMSCAN_THROTTLE_NOPROGRESS
18 usec_timeout=500000 usect_delayed=404000 reason=VMSCAN_THROTTLE_NOPROGRESS
20 usec_timeout=500000 usect_delayed=148000 reason=VMSCAN_THROTTLE_NOPROGRESS
20 usec_timeout=500000 usect_delayed=216000 reason=VMSCAN_THROTTLE_NOPROGRESS
20 usec_timeout=500000 usect_delayed=468000 reason=VMSCAN_THROTTLE_NOPROGRESS
21 usec_timeout=500000 usect_delayed=448000 reason=VMSCAN_THROTTLE_NOPROGRESS
23 usec_timeout=500000 usect_delayed=168000 reason=VMSCAN_THROTTLE_NOPROGRESS
23 usec_timeout=500000 usect_delayed=296000 reason=VMSCAN_THROTTLE_NOPROGRESS
25 usec_timeout=500000 usect_delayed=132000 reason=VMSCAN_THROTTLE_NOPROGRESS
25 usec_timeout=500000 usect_delayed=352000 reason=VMSCAN_THROTTLE_NOPROGRESS
26 usec_timeout=500000 usect_delayed=180000 reason=VMSCAN_THROTTLE_NOPROGRESS
27 usec_timeout=500000 usect_delayed=284000 reason=VMSCAN_THROTTLE_NOPROGRESS
28 usec_timeout=500000 usect_delayed=164000 reason=VMSCAN_THROTTLE_NOPROGRESS
29 usec_timeout=500000 usect_delayed=136000 reason=VMSCAN_THROTTLE_NOPROGRESS
30 usec_timeout=500000 usect_delayed=200000 reason=VMSCAN_THROTTLE_NOPROGRESS
30 usec_timeout=500000 usect_delayed=400000 reason=VMSCAN_THROTTLE_NOPROGRESS
31 usec_timeout=500000 usect_delayed=196000 reason=VMSCAN_THROTTLE_NOPROGRESS
32 usec_timeout=500000 usect_delayed=156000 reason=VMSCAN_THROTTLE_NOPROGRESS
33 usec_timeout=500000 usect_delayed=224000 reason=VMSCAN_THROTTLE_NOPROGRESS
35 usec_timeout=500000 usect_delayed=128000 reason=VMSCAN_THROTTLE_NOPROGRESS
35 usec_timeout=500000 usect_delayed=176000 reason=VMSCAN_THROTTLE_NOPROGRESS
36 usec_timeout=500000 usect_delayed=368000 reason=VMSCAN_THROTTLE_NOPROGRESS
36 usec_timeout=500000 usect_delayed=496000 reason=VMSCAN_THROTTLE_NOPROGRESS
37 usec_timeout=500000 usect_delayed=312000 reason=VMSCAN_THROTTLE_NOPROGRESS
38 usec_timeout=500000 usect_delayed=304000 reason=VMSCAN_THROTTLE_NOPROGRESS
40 usec_timeout=500000 usect_delayed=288000 reason=VMSCAN_THROTTLE_NOPROGRESS
43 usec_timeout=500000 usect_delayed=408000 reason=VMSCAN_THROTTLE_NOPROGRESS
55 usec_timeout=500000 usect_delayed=416000 reason=VMSCAN_THROTTLE_NOPROGRESS
56 usec_timeout=500000 usect_delayed=76000 reason=VMSCAN_THROTTLE_NOPROGRESS
58 usec_timeout=500000 usect_delayed=120000 reason=VMSCAN_THROTTLE_NOPROGRESS
59 usec_timeout=500000 usect_delayed=208000 reason=VMSCAN_THROTTLE_NOPROGRESS
61 usec_timeout=500000 usect_delayed=68000 reason=VMSCAN_THROTTLE_NOPROGRESS
71 usec_timeout=500000 usect_delayed=192000 reason=VMSCAN_THROTTLE_NOPROGRESS
71 usec_timeout=500000 usect_delayed=480000 reason=VMSCAN_THROTTLE_NOPROGRESS
79 usec_timeout=500000 usect_delayed=60000 reason=VMSCAN_THROTTLE_NOPROGRESS
82 usec_timeout=500000 usect_delayed=320000 reason=VMSCAN_THROTTLE_NOPROGRESS
82 usec_timeout=500000 usect_delayed=92000 reason=VMSCAN_THROTTLE_NOPROGRESS
85 usec_timeout=500000 usect_delayed=64000 reason=VMSCAN_THROTTLE_NOPROGRESS
85 usec_timeout=500000 usect_delayed=80000 reason=VMSCAN_THROTTLE_NOPROGRESS
88 usec_timeout=500000 usect_delayed=84000 reason=VMSCAN_THROTTLE_NOPROGRESS
90 usec_timeout=500000 usect_delayed=160000 reason=VMSCAN_THROTTLE_NOPROGRESS
90 usec_timeout=500000 usect_delayed=292000 reason=VMSCAN_THROTTLE_NOPROGRESS
94 usec_timeout=500000 usect_delayed=56000 reason=VMSCAN_THROTTLE_NOPROGRESS
118 usec_timeout=500000 usect_delayed=88000 reason=VMSCAN_THROTTLE_NOPROGRESS
119 usec_timeout=500000 usect_delayed=72000 reason=VMSCAN_THROTTLE_NOPROGRESS
126 usec_timeout=500000 usect_delayed=108000 reason=VMSCAN_THROTTLE_NOPROGRESS
146 usec_timeout=500000 usect_delayed=52000 reason=VMSCAN_THROTTLE_NOPROGRESS
148 usec_timeout=500000 usect_delayed=36000 reason=VMSCAN_THROTTLE_NOPROGRESS
148 usec_timeout=500000 usect_delayed=48000 reason=VMSCAN_THROTTLE_NOPROGRESS
159 usec_timeout=500000 usect_delayed=28000 reason=VMSCAN_THROTTLE_NOPROGRESS
178 usec_timeout=500000 usect_delayed=44000 reason=VMSCAN_THROTTLE_NOPROGRESS
183 usec_timeout=500000 usect_delayed=40000 reason=VMSCAN_THROTTLE_NOPROGRESS
237 usec_timeout=500000 usect_delayed=100000 reason=VMSCAN_THROTTLE_NOPROGRESS
266 usec_timeout=500000 usect_delayed=32000 reason=VMSCAN_THROTTLE_NOPROGRESS
313 usec_timeout=500000 usect_delayed=24000 reason=VMSCAN_THROTTLE_NOPROGRESS
347 usec_timeout=500000 usect_delayed=96000 reason=VMSCAN_THROTTLE_NOPROGRESS
470 usec_timeout=500000 usect_delayed=20000 reason=VMSCAN_THROTTLE_NOPROGRESS
559 usec_timeout=500000 usect_delayed=16000 reason=VMSCAN_THROTTLE_NOPROGRESS
964 usec_timeout=500000 usect_delayed=12000 reason=VMSCAN_THROTTLE_NOPROGRESS
2001 usec_timeout=500000 usect_delayed=104000 reason=VMSCAN_THROTTLE_NOPROGRESS
2447 usec_timeout=500000 usect_delayed=8000 reason=VMSCAN_THROTTLE_NOPROGRESS
7888 usec_timeout=500000 usect_delayed=4000 reason=VMSCAN_THROTTLE_NOPROGRESS
22727 usec_timeout=500000 usect_delayed=0 reason=VMSCAN_THROTTLE_NOPROGRESS
51305 usec_timeout=500000 usect_delayed=500000 reason=VMSCAN_THROTTLE_NOPROGRESS
The full timeout is often hit but a large number also do not stall at
all. The remainder slept a little allowing other reclaim tasks to make
progress.
While this timeout could be further increased, it could also negatively
impact worst-case behaviour when there is no prioritisation of what task
should make progress.
For VMSCAN_THROTTLE_WRITEBACK, the breakdown was
1 usec_timeout=100000 usect_delayed=44000 reason=VMSCAN_THROTTLE_WRITEBACK
2 usec_timeout=100000 usect_delayed=76000 reason=VMSCAN_THROTTLE_WRITEBACK
3 usec_timeout=100000 usect_delayed=80000 reason=VMSCAN_THROTTLE_WRITEBACK
5 usec_timeout=100000 usect_delayed=48000 reason=VMSCAN_THROTTLE_WRITEBACK
5 usec_timeout=100000 usect_delayed=84000 reason=VMSCAN_THROTTLE_WRITEBACK
6 usec_timeout=100000 usect_delayed=72000 reason=VMSCAN_THROTTLE_WRITEBACK
7 usec_timeout=100000 usect_delayed=88000 reason=VMSCAN_THROTTLE_WRITEBACK
11 usec_timeout=100000 usect_delayed=56000 reason=VMSCAN_THROTTLE_WRITEBACK
12 usec_timeout=100000 usect_delayed=64000 reason=VMSCAN_THROTTLE_WRITEBACK
16 usec_timeout=100000 usect_delayed=92000 reason=VMSCAN_THROTTLE_WRITEBACK
24 usec_timeout=100000 usect_delayed=68000 reason=VMSCAN_THROTTLE_WRITEBACK
28 usec_timeout=100000 usect_delayed=32000 reason=VMSCAN_THROTTLE_WRITEBACK
30 usec_timeout=100000 usect_delayed=60000 reason=VMSCAN_THROTTLE_WRITEBACK
30 usec_timeout=100000 usect_delayed=96000 reason=VMSCAN_THROTTLE_WRITEBACK
32 usec_timeout=100000 usect_delayed=52000 reason=VMSCAN_THROTTLE_WRITEBACK
42 usec_timeout=100000 usect_delayed=40000 reason=VMSCAN_THROTTLE_WRITEBACK
77 usec_timeout=100000 usect_delayed=28000 reason=VMSCAN_THROTTLE_WRITEBACK
99 usec_timeout=100000 usect_delayed=36000 reason=VMSCAN_THROTTLE_WRITEBACK
137 usec_timeout=100000 usect_delayed=24000 reason=VMSCAN_THROTTLE_WRITEBACK
190 usec_timeout=100000 usect_delayed=20000 reason=VMSCAN_THROTTLE_WRITEBACK
339 usec_timeout=100000 usect_delayed=16000 reason=VMSCAN_THROTTLE_WRITEBACK
518 usec_timeout=100000 usect_delayed=12000 reason=VMSCAN_THROTTLE_WRITEBACK
852 usec_timeout=100000 usect_delayed=8000 reason=VMSCAN_THROTTLE_WRITEBACK
3359 usec_timeout=100000 usect_delayed=4000 reason=VMSCAN_THROTTLE_WRITEBACK
7147 usec_timeout=100000 usect_delayed=0 reason=VMSCAN_THROTTLE_WRITEBACK
83962 usec_timeout=100000 usect_delayed=100000 reason=VMSCAN_THROTTLE_WRITEBACK
The majority hit the timeout in direct reclaim context although a
sizable number did not stall at all. This is very different to kswapd
where only a tiny percentage of stalls due to writeback reached the
timeout.
Bottom line, the throttling appears to work and the wakeup events may
limit worst case stalls. There might be some grounds for adjusting
timeouts but it's likely futile as the worst-case scenarios depend on
the workload, memory size and the speed of the storage. A better
approach to improve the series further would be to prioritise tasks
based on their rate of allocation with the caveat that it may be very
expensive to track.
This patch (of 5):
Page reclaim throttles on wait_iff_congested under the following
conditions:
- kswapd is encountering pages under writeback and marked for immediate
reclaim implying that pages are cycling through the LRU faster than
pages can be cleaned.
- Direct reclaim will stall if all dirty pages are backed by congested
inodes.
wait_iff_congested is almost completely broken with few exceptions.
This patch adds a new node-based workqueue and tracks the number of
throttled tasks and pages written back since throttling started. If
enough pages belonging to the node are written back then the throttled
tasks will wake early. If not, the throttled tasks sleeps until the
timeout expires.
[neilb@suse.de: Uninterruptible sleep and simpler wakeups]
[hdanton@sina.com: Avoid race when reclaim starts]
[vbabka@suse.cz: vmstat irq-safe api, clarifications]
Link: https://lore.kernel.org/linux-mm/45d8b7a6-8548-65f5-cccf-9f451d4ae3d4@kernel.dk/ [1]
Link: https://lkml.kernel.org/r/20211022144651.19914-1-mgorman@techsingularity.net
Link: https://lkml.kernel.org/r/20211022144651.19914-2-mgorman@techsingularity.net
Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Cc: NeilBrown <neilb@suse.de>
Cc: "Theodore Ts'o" <tytso@mit.edu>
Cc: Andreas Dilger <adilger.kernel@dilger.ca>
Cc: "Darrick J . Wong" <djwong@kernel.org>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Dave Chinner <david@fromorbit.com>
Cc: Rik van Riel <riel@surriel.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Jonathan Corbet <corbet@lwn.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-11-06 04:42:25 +08:00
|
|
|
#define _VMSCAN_THROTTLE_WRITEBACK (1 << VMSCAN_THROTTLE_WRITEBACK)
|
2021-11-06 04:42:29 +08:00
|
|
|
#define _VMSCAN_THROTTLE_ISOLATED (1 << VMSCAN_THROTTLE_ISOLATED)
|
2021-11-06 04:42:32 +08:00
|
|
|
#define _VMSCAN_THROTTLE_NOPROGRESS (1 << VMSCAN_THROTTLE_NOPROGRESS)
|
mm: vmscan: Reduce throttling due to a failure to make progress
Mike Galbraith, Alexey Avramov and Darrick Wong all reported similar
problems due to reclaim throttling for excessive lengths of time. In
Alexey's case, a memory hog that should go OOM quickly stalls for
several minutes before stalling. In Mike and Darrick's cases, a small
memcg environment stalled excessively even though the system had enough
memory overall.
Commit 69392a403f49 ("mm/vmscan: throttle reclaim when no progress is
being made") introduced the problem although commit a19594ca4a8b
("mm/vmscan: increase the timeout if page reclaim is not making
progress") made it worse. Systems at or near an OOM state that cannot
be recovered must reach OOM quickly and memcg should kill tasks if a
memcg is near OOM.
To address this, only stall for the first zone in the zonelist, reduce
the timeout to 1 tick for VMSCAN_THROTTLE_NOPROGRESS and only stall if
the scan control nr_reclaimed is 0, kswapd is still active and there
were excessive pages pending for writeback. If kswapd has stopped
reclaiming due to excessive failures, do not stall at all so that OOM
triggers relatively quickly. Similarly, if an LRU is simply congested,
only lightly throttle similar to NOPROGRESS.
Alexey's original case was the most straight forward
for i in {1..3}; do tail /dev/zero; done
On vanilla 5.16-rc1, this test stalled heavily, after the patch the test
completes in a few seconds similar to 5.15.
Alexey's second test case added watching a youtube video while tail runs
10 times. On 5.15, playback only jitters slightly, 5.16-rc1 stalls a
lot with lots of frames missing and numerous audio glitches. With this
patch applies, the video plays similarly to 5.15.
[lkp@intel.com: Fix W=1 build warning]
Link: https://lore.kernel.org/r/99e779783d6c7fce96448a3402061b9dc1b3b602.camel@gmx.de
Link: https://lore.kernel.org/r/20211124011954.7cab9bb4@mail.inbox.lv
Link: https://lore.kernel.org/r/20211022144651.19914-1-mgorman@techsingularity.net
Link: https://lore.kernel.org/r/20211202150614.22440-1-mgorman@techsingularity.net
Link: https://linux-regtracking.leemhuis.info/regzbot/regression/20211124011954.7cab9bb4@mail.inbox.lv/
Reported-and-tested-by: Alexey Avramov <hakavlad@inbox.lv>
Reported-and-tested-by: Mike Galbraith <efault@gmx.de>
Reported-and-tested-by: Darrick J. Wong <djwong@kernel.org>
Reported-by: kernel test robot <lkp@intel.com>
Acked-by: Hugh Dickins <hughd@google.com>
Tracked-by: Thorsten Leemhuis <regressions@leemhuis.info>
Fixes: 69392a403f49 ("mm/vmscan: throttle reclaim when no progress is being made")
Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-12-02 23:06:14 +08:00
|
|
|
#define _VMSCAN_THROTTLE_CONGESTED (1 << VMSCAN_THROTTLE_CONGESTED)
|
mm/vmscan: throttle reclaim until some writeback completes if congested
Patch series "Remove dependency on congestion_wait in mm/", v5.
This series that removes all calls to congestion_wait in mm/ and deletes
wait_iff_congested. It's not a clever implementation but
congestion_wait has been broken for a long time [1].
Even if congestion throttling worked, it was never a great idea. While
excessive dirty/writeback pages at the tail of the LRU is one
possibility that reclaim may be slow, there is also the problem of too
many pages being isolated and reclaim failing for other reasons
(elevated references, too many pages isolated, excessive LRU contention
etc).
This series replaces the "congestion" throttling with 3 different types.
- If there are too many dirty/writeback pages, sleep until a timeout or
enough pages get cleaned
- If too many pages are isolated, sleep until enough isolated pages are
either reclaimed or put back on the LRU
- If no progress is being made, direct reclaim tasks sleep until
another task makes progress with acceptable efficiency.
This was initially tested with a mix of workloads that used to trigger
corner cases that no longer work. A new test case was created called
"stutterp" (pagereclaim-stutterp-noreaders in mmtests) using a freshly
created XFS filesystem. Note that it may be necessary to increase the
timeout of ssh if executing remotely as ssh itself can get throttled and
the connection may timeout.
stutterp varies the number of "worker" processes from 4 up to NR_CPUS*4
to check the impact as the number of direct reclaimers increase. It has
four types of worker.
- One "anon latency" worker creates small mappings with mmap() and
times how long it takes to fault the mapping reading it 4K at a time
- X file writers which is fio randomly writing X files where the total
size of the files add up to the allowed dirty_ratio. fio is allowed
to run for a warmup period to allow some file-backed pages to
accumulate. The duration of the warmup is based on the best-case
linear write speed of the storage.
- Y file readers which is fio randomly reading small files
- Z anon memory hogs which continually map (100-dirty_ratio)% of memory
- Total estimated WSS = (100+dirty_ration) percentage of memory
X+Y+Z+1 == NR_WORKERS varying from 4 up to NR_CPUS*4
The intent is to maximise the total WSS with a mix of file and anon
memory where some anonymous memory must be swapped and there is a high
likelihood of dirty/writeback pages reaching the end of the LRU.
The test can be configured to have no background readers to stress
dirty/writeback pages. The results below are based on having zero
readers.
The short summary of the results is that the series works and stalls
until some event occurs but the timeouts may need adjustment.
The test results are not broken down by patch as the series should be
treated as one block that replaces a broken throttling mechanism with a
working one.
Finally, three machines were tested but I'm reporting the worst set of
results. The other two machines had much better latencies for example.
First the results of the "anon latency" latency
stutterp
5.15.0-rc1 5.15.0-rc1
vanilla mm-reclaimcongest-v5r4
Amean mmap-4 31.4003 ( 0.00%) 2661.0198 (-8374.52%)
Amean mmap-7 38.1641 ( 0.00%) 149.2891 (-291.18%)
Amean mmap-12 60.0981 ( 0.00%) 187.8105 (-212.51%)
Amean mmap-21 161.2699 ( 0.00%) 213.9107 ( -32.64%)
Amean mmap-30 174.5589 ( 0.00%) 377.7548 (-116.41%)
Amean mmap-48 8106.8160 ( 0.00%) 1070.5616 ( 86.79%)
Stddev mmap-4 41.3455 ( 0.00%) 27573.9676 (-66591.66%)
Stddev mmap-7 53.5556 ( 0.00%) 4608.5860 (-8505.23%)
Stddev mmap-12 171.3897 ( 0.00%) 5559.4542 (-3143.75%)
Stddev mmap-21 1506.6752 ( 0.00%) 5746.2507 (-281.39%)
Stddev mmap-30 557.5806 ( 0.00%) 7678.1624 (-1277.05%)
Stddev mmap-48 61681.5718 ( 0.00%) 14507.2830 ( 76.48%)
Max-90 mmap-4 31.4243 ( 0.00%) 83.1457 (-164.59%)
Max-90 mmap-7 41.0410 ( 0.00%) 41.0720 ( -0.08%)
Max-90 mmap-12 66.5255 ( 0.00%) 53.9073 ( 18.97%)
Max-90 mmap-21 146.7479 ( 0.00%) 105.9540 ( 27.80%)
Max-90 mmap-30 193.9513 ( 0.00%) 64.3067 ( 66.84%)
Max-90 mmap-48 277.9137 ( 0.00%) 591.0594 (-112.68%)
Max mmap-4 1913.8009 ( 0.00%) 299623.9695 (-15555.96%)
Max mmap-7 2423.9665 ( 0.00%) 204453.1708 (-8334.65%)
Max mmap-12 6845.6573 ( 0.00%) 221090.3366 (-3129.64%)
Max mmap-21 56278.6508 ( 0.00%) 213877.3496 (-280.03%)
Max mmap-30 19716.2990 ( 0.00%) 216287.6229 (-997.00%)
Max mmap-48 477923.9400 ( 0.00%) 245414.8238 ( 48.65%)
For most thread counts, the time to mmap() is unfortunately increased.
In earlier versions of the series, this was lower but a large number of
throttling events were reaching their timeout increasing the amount of
inefficient scanning of the LRU. There is no prioritisation of reclaim
tasks making progress based on each tasks rate of page allocation versus
progress of reclaim. The variance is also impacted for high worker
counts but in all cases, the differences in latency are not
statistically significant due to very large maximum outliers. Max-90
shows that 90% of the stalls are comparable but the Max results show the
massive outliers which are increased to to stalling.
It is expected that this will be very machine dependant. Due to the
test design, reclaim is difficult so allocations stall and there are
variances depending on whether THPs can be allocated or not. The amount
of memory will affect exactly how bad the corner cases are and how often
they trigger. The warmup period calculation is not ideal as it's based
on linear writes where as fio is randomly writing multiple files from
multiple tasks so the start state of the test is variable. For example,
these are the latencies on a single-socket machine that had more memory
Amean mmap-4 42.2287 ( 0.00%) 49.6838 * -17.65%*
Amean mmap-7 216.4326 ( 0.00%) 47.4451 * 78.08%*
Amean mmap-12 2412.0588 ( 0.00%) 51.7497 ( 97.85%)
Amean mmap-21 5546.2548 ( 0.00%) 51.8862 ( 99.06%)
Amean mmap-30 1085.3121 ( 0.00%) 72.1004 ( 93.36%)
The overall system CPU usage and elapsed time is as follows
5.15.0-rc3 5.15.0-rc3
vanilla mm-reclaimcongest-v5r4
Duration User 6989.03 983.42
Duration System 7308.12 799.68
Duration Elapsed 2277.67 2092.98
The patches reduce system CPU usage by 89% as the vanilla kernel is rarely
stalling.
The high-level /proc/vmstats show
5.15.0-rc1 5.15.0-rc1
vanilla mm-reclaimcongest-v5r2
Ops Direct pages scanned 1056608451.00 503594991.00
Ops Kswapd pages scanned 109795048.00 147289810.00
Ops Kswapd pages reclaimed 63269243.00 31036005.00
Ops Direct pages reclaimed 10803973.00 6328887.00
Ops Kswapd efficiency % 57.62 21.07
Ops Kswapd velocity 48204.98 57572.86
Ops Direct efficiency % 1.02 1.26
Ops Direct velocity 463898.83 196845.97
Kswapd scanned less pages but the detailed pattern is different. The
vanilla kernel scans slowly over time where as the patches exhibits
burst patterns of scan activity. Direct reclaim scanning is reduced by
52% due to stalling.
The pattern for stealing pages is also slightly different. Both kernels
exhibit spikes but the vanilla kernel when reclaiming shows pages being
reclaimed over a period of time where as the patches tend to reclaim in
spikes. The difference is that vanilla is not throttling and instead
scanning constantly finding some pages over time where as the patched
kernel throttles and reclaims in spikes.
Ops Percentage direct scans 90.59 77.37
For direct reclaim, vanilla scanned 90.59% of pages where as with the
patches, 77.37% were direct reclaim due to throttling
Ops Page writes by reclaim 2613590.00 1687131.00
Page writes from reclaim context are reduced.
Ops Page writes anon 2932752.00 1917048.00
And there is less swapping.
Ops Page reclaim immediate 996248528.00 107664764.00
The number of pages encountered at the tail of the LRU tagged for
immediate reclaim but still dirty/writeback is reduced by 89%.
Ops Slabs scanned 164284.00 153608.00
Slab scan activity is similar.
ftrace was used to gather stall activity
Vanilla
-------
1 writeback_wait_iff_congested: usec_timeout=100000 usec_delayed=16000
2 writeback_wait_iff_congested: usec_timeout=100000 usec_delayed=12000
8 writeback_wait_iff_congested: usec_timeout=100000 usec_delayed=8000
29 writeback_wait_iff_congested: usec_timeout=100000 usec_delayed=4000
82394 writeback_wait_iff_congested: usec_timeout=100000 usec_delayed=0
The fast majority of wait_iff_congested calls do not stall at all. What
is likely happening is that cond_resched() reschedules the task for a
short period when the BDI is not registering congestion (which it never
will in this test setup).
1 writeback_congestion_wait: usec_timeout=100000 usec_delayed=120000
2 writeback_congestion_wait: usec_timeout=100000 usec_delayed=132000
4 writeback_congestion_wait: usec_timeout=100000 usec_delayed=112000
380 writeback_congestion_wait: usec_timeout=100000 usec_delayed=108000
778 writeback_congestion_wait: usec_timeout=100000 usec_delayed=104000
congestion_wait if called always exceeds the timeout as there is no
trigger to wake it up.
Bottom line: Vanilla will throttle but it's not effective.
Patch series
------------
Kswapd throttle activity was always due to scanning pages tagged for
immediate reclaim at the tail of the LRU
1 usec_timeout=100000 usect_delayed=72000 reason=VMSCAN_THROTTLE_WRITEBACK
4 usec_timeout=100000 usect_delayed=20000 reason=VMSCAN_THROTTLE_WRITEBACK
5 usec_timeout=100000 usect_delayed=12000 reason=VMSCAN_THROTTLE_WRITEBACK
6 usec_timeout=100000 usect_delayed=16000 reason=VMSCAN_THROTTLE_WRITEBACK
11 usec_timeout=100000 usect_delayed=100000 reason=VMSCAN_THROTTLE_WRITEBACK
11 usec_timeout=100000 usect_delayed=8000 reason=VMSCAN_THROTTLE_WRITEBACK
94 usec_timeout=100000 usect_delayed=0 reason=VMSCAN_THROTTLE_WRITEBACK
112 usec_timeout=100000 usect_delayed=4000 reason=VMSCAN_THROTTLE_WRITEBACK
The majority of events did not stall or stalled for a short period.
Roughly 16% of stalls reached the timeout before expiry. For direct
reclaim, the number of times stalled for each reason were
6624 reason=VMSCAN_THROTTLE_ISOLATED
93246 reason=VMSCAN_THROTTLE_NOPROGRESS
96934 reason=VMSCAN_THROTTLE_WRITEBACK
The most common reason to stall was due to excessive pages tagged for
immediate reclaim at the tail of the LRU followed by a failure to make
forward. A relatively small number were due to too many pages isolated
from the LRU by parallel threads
For VMSCAN_THROTTLE_ISOLATED, the breakdown of delays was
9 usec_timeout=20000 usect_delayed=4000 reason=VMSCAN_THROTTLE_ISOLATED
12 usec_timeout=20000 usect_delayed=16000 reason=VMSCAN_THROTTLE_ISOLATED
83 usec_timeout=20000 usect_delayed=20000 reason=VMSCAN_THROTTLE_ISOLATED
6520 usec_timeout=20000 usect_delayed=0 reason=VMSCAN_THROTTLE_ISOLATED
Most did not stall at all. A small number reached the timeout.
For VMSCAN_THROTTLE_NOPROGRESS, the breakdown of stalls were all over
the map
1 usec_timeout=500000 usect_delayed=324000 reason=VMSCAN_THROTTLE_NOPROGRESS
1 usec_timeout=500000 usect_delayed=332000 reason=VMSCAN_THROTTLE_NOPROGRESS
1 usec_timeout=500000 usect_delayed=348000 reason=VMSCAN_THROTTLE_NOPROGRESS
1 usec_timeout=500000 usect_delayed=360000 reason=VMSCAN_THROTTLE_NOPROGRESS
2 usec_timeout=500000 usect_delayed=228000 reason=VMSCAN_THROTTLE_NOPROGRESS
2 usec_timeout=500000 usect_delayed=260000 reason=VMSCAN_THROTTLE_NOPROGRESS
2 usec_timeout=500000 usect_delayed=340000 reason=VMSCAN_THROTTLE_NOPROGRESS
2 usec_timeout=500000 usect_delayed=364000 reason=VMSCAN_THROTTLE_NOPROGRESS
2 usec_timeout=500000 usect_delayed=372000 reason=VMSCAN_THROTTLE_NOPROGRESS
2 usec_timeout=500000 usect_delayed=428000 reason=VMSCAN_THROTTLE_NOPROGRESS
2 usec_timeout=500000 usect_delayed=460000 reason=VMSCAN_THROTTLE_NOPROGRESS
2 usec_timeout=500000 usect_delayed=464000 reason=VMSCAN_THROTTLE_NOPROGRESS
3 usec_timeout=500000 usect_delayed=244000 reason=VMSCAN_THROTTLE_NOPROGRESS
3 usec_timeout=500000 usect_delayed=252000 reason=VMSCAN_THROTTLE_NOPROGRESS
3 usec_timeout=500000 usect_delayed=272000 reason=VMSCAN_THROTTLE_NOPROGRESS
4 usec_timeout=500000 usect_delayed=188000 reason=VMSCAN_THROTTLE_NOPROGRESS
4 usec_timeout=500000 usect_delayed=268000 reason=VMSCAN_THROTTLE_NOPROGRESS
4 usec_timeout=500000 usect_delayed=328000 reason=VMSCAN_THROTTLE_NOPROGRESS
4 usec_timeout=500000 usect_delayed=380000 reason=VMSCAN_THROTTLE_NOPROGRESS
4 usec_timeout=500000 usect_delayed=392000 reason=VMSCAN_THROTTLE_NOPROGRESS
4 usec_timeout=500000 usect_delayed=432000 reason=VMSCAN_THROTTLE_NOPROGRESS
5 usec_timeout=500000 usect_delayed=204000 reason=VMSCAN_THROTTLE_NOPROGRESS
5 usec_timeout=500000 usect_delayed=220000 reason=VMSCAN_THROTTLE_NOPROGRESS
5 usec_timeout=500000 usect_delayed=412000 reason=VMSCAN_THROTTLE_NOPROGRESS
5 usec_timeout=500000 usect_delayed=436000 reason=VMSCAN_THROTTLE_NOPROGRESS
6 usec_timeout=500000 usect_delayed=488000 reason=VMSCAN_THROTTLE_NOPROGRESS
7 usec_timeout=500000 usect_delayed=212000 reason=VMSCAN_THROTTLE_NOPROGRESS
7 usec_timeout=500000 usect_delayed=300000 reason=VMSCAN_THROTTLE_NOPROGRESS
7 usec_timeout=500000 usect_delayed=316000 reason=VMSCAN_THROTTLE_NOPROGRESS
7 usec_timeout=500000 usect_delayed=472000 reason=VMSCAN_THROTTLE_NOPROGRESS
8 usec_timeout=500000 usect_delayed=248000 reason=VMSCAN_THROTTLE_NOPROGRESS
8 usec_timeout=500000 usect_delayed=356000 reason=VMSCAN_THROTTLE_NOPROGRESS
8 usec_timeout=500000 usect_delayed=456000 reason=VMSCAN_THROTTLE_NOPROGRESS
9 usec_timeout=500000 usect_delayed=124000 reason=VMSCAN_THROTTLE_NOPROGRESS
9 usec_timeout=500000 usect_delayed=376000 reason=VMSCAN_THROTTLE_NOPROGRESS
9 usec_timeout=500000 usect_delayed=484000 reason=VMSCAN_THROTTLE_NOPROGRESS
10 usec_timeout=500000 usect_delayed=172000 reason=VMSCAN_THROTTLE_NOPROGRESS
10 usec_timeout=500000 usect_delayed=420000 reason=VMSCAN_THROTTLE_NOPROGRESS
10 usec_timeout=500000 usect_delayed=452000 reason=VMSCAN_THROTTLE_NOPROGRESS
11 usec_timeout=500000 usect_delayed=256000 reason=VMSCAN_THROTTLE_NOPROGRESS
12 usec_timeout=500000 usect_delayed=112000 reason=VMSCAN_THROTTLE_NOPROGRESS
12 usec_timeout=500000 usect_delayed=116000 reason=VMSCAN_THROTTLE_NOPROGRESS
12 usec_timeout=500000 usect_delayed=144000 reason=VMSCAN_THROTTLE_NOPROGRESS
12 usec_timeout=500000 usect_delayed=152000 reason=VMSCAN_THROTTLE_NOPROGRESS
12 usec_timeout=500000 usect_delayed=264000 reason=VMSCAN_THROTTLE_NOPROGRESS
12 usec_timeout=500000 usect_delayed=384000 reason=VMSCAN_THROTTLE_NOPROGRESS
12 usec_timeout=500000 usect_delayed=424000 reason=VMSCAN_THROTTLE_NOPROGRESS
12 usec_timeout=500000 usect_delayed=492000 reason=VMSCAN_THROTTLE_NOPROGRESS
13 usec_timeout=500000 usect_delayed=184000 reason=VMSCAN_THROTTLE_NOPROGRESS
13 usec_timeout=500000 usect_delayed=444000 reason=VMSCAN_THROTTLE_NOPROGRESS
14 usec_timeout=500000 usect_delayed=308000 reason=VMSCAN_THROTTLE_NOPROGRESS
14 usec_timeout=500000 usect_delayed=440000 reason=VMSCAN_THROTTLE_NOPROGRESS
14 usec_timeout=500000 usect_delayed=476000 reason=VMSCAN_THROTTLE_NOPROGRESS
16 usec_timeout=500000 usect_delayed=140000 reason=VMSCAN_THROTTLE_NOPROGRESS
17 usec_timeout=500000 usect_delayed=232000 reason=VMSCAN_THROTTLE_NOPROGRESS
17 usec_timeout=500000 usect_delayed=240000 reason=VMSCAN_THROTTLE_NOPROGRESS
17 usec_timeout=500000 usect_delayed=280000 reason=VMSCAN_THROTTLE_NOPROGRESS
18 usec_timeout=500000 usect_delayed=404000 reason=VMSCAN_THROTTLE_NOPROGRESS
20 usec_timeout=500000 usect_delayed=148000 reason=VMSCAN_THROTTLE_NOPROGRESS
20 usec_timeout=500000 usect_delayed=216000 reason=VMSCAN_THROTTLE_NOPROGRESS
20 usec_timeout=500000 usect_delayed=468000 reason=VMSCAN_THROTTLE_NOPROGRESS
21 usec_timeout=500000 usect_delayed=448000 reason=VMSCAN_THROTTLE_NOPROGRESS
23 usec_timeout=500000 usect_delayed=168000 reason=VMSCAN_THROTTLE_NOPROGRESS
23 usec_timeout=500000 usect_delayed=296000 reason=VMSCAN_THROTTLE_NOPROGRESS
25 usec_timeout=500000 usect_delayed=132000 reason=VMSCAN_THROTTLE_NOPROGRESS
25 usec_timeout=500000 usect_delayed=352000 reason=VMSCAN_THROTTLE_NOPROGRESS
26 usec_timeout=500000 usect_delayed=180000 reason=VMSCAN_THROTTLE_NOPROGRESS
27 usec_timeout=500000 usect_delayed=284000 reason=VMSCAN_THROTTLE_NOPROGRESS
28 usec_timeout=500000 usect_delayed=164000 reason=VMSCAN_THROTTLE_NOPROGRESS
29 usec_timeout=500000 usect_delayed=136000 reason=VMSCAN_THROTTLE_NOPROGRESS
30 usec_timeout=500000 usect_delayed=200000 reason=VMSCAN_THROTTLE_NOPROGRESS
30 usec_timeout=500000 usect_delayed=400000 reason=VMSCAN_THROTTLE_NOPROGRESS
31 usec_timeout=500000 usect_delayed=196000 reason=VMSCAN_THROTTLE_NOPROGRESS
32 usec_timeout=500000 usect_delayed=156000 reason=VMSCAN_THROTTLE_NOPROGRESS
33 usec_timeout=500000 usect_delayed=224000 reason=VMSCAN_THROTTLE_NOPROGRESS
35 usec_timeout=500000 usect_delayed=128000 reason=VMSCAN_THROTTLE_NOPROGRESS
35 usec_timeout=500000 usect_delayed=176000 reason=VMSCAN_THROTTLE_NOPROGRESS
36 usec_timeout=500000 usect_delayed=368000 reason=VMSCAN_THROTTLE_NOPROGRESS
36 usec_timeout=500000 usect_delayed=496000 reason=VMSCAN_THROTTLE_NOPROGRESS
37 usec_timeout=500000 usect_delayed=312000 reason=VMSCAN_THROTTLE_NOPROGRESS
38 usec_timeout=500000 usect_delayed=304000 reason=VMSCAN_THROTTLE_NOPROGRESS
40 usec_timeout=500000 usect_delayed=288000 reason=VMSCAN_THROTTLE_NOPROGRESS
43 usec_timeout=500000 usect_delayed=408000 reason=VMSCAN_THROTTLE_NOPROGRESS
55 usec_timeout=500000 usect_delayed=416000 reason=VMSCAN_THROTTLE_NOPROGRESS
56 usec_timeout=500000 usect_delayed=76000 reason=VMSCAN_THROTTLE_NOPROGRESS
58 usec_timeout=500000 usect_delayed=120000 reason=VMSCAN_THROTTLE_NOPROGRESS
59 usec_timeout=500000 usect_delayed=208000 reason=VMSCAN_THROTTLE_NOPROGRESS
61 usec_timeout=500000 usect_delayed=68000 reason=VMSCAN_THROTTLE_NOPROGRESS
71 usec_timeout=500000 usect_delayed=192000 reason=VMSCAN_THROTTLE_NOPROGRESS
71 usec_timeout=500000 usect_delayed=480000 reason=VMSCAN_THROTTLE_NOPROGRESS
79 usec_timeout=500000 usect_delayed=60000 reason=VMSCAN_THROTTLE_NOPROGRESS
82 usec_timeout=500000 usect_delayed=320000 reason=VMSCAN_THROTTLE_NOPROGRESS
82 usec_timeout=500000 usect_delayed=92000 reason=VMSCAN_THROTTLE_NOPROGRESS
85 usec_timeout=500000 usect_delayed=64000 reason=VMSCAN_THROTTLE_NOPROGRESS
85 usec_timeout=500000 usect_delayed=80000 reason=VMSCAN_THROTTLE_NOPROGRESS
88 usec_timeout=500000 usect_delayed=84000 reason=VMSCAN_THROTTLE_NOPROGRESS
90 usec_timeout=500000 usect_delayed=160000 reason=VMSCAN_THROTTLE_NOPROGRESS
90 usec_timeout=500000 usect_delayed=292000 reason=VMSCAN_THROTTLE_NOPROGRESS
94 usec_timeout=500000 usect_delayed=56000 reason=VMSCAN_THROTTLE_NOPROGRESS
118 usec_timeout=500000 usect_delayed=88000 reason=VMSCAN_THROTTLE_NOPROGRESS
119 usec_timeout=500000 usect_delayed=72000 reason=VMSCAN_THROTTLE_NOPROGRESS
126 usec_timeout=500000 usect_delayed=108000 reason=VMSCAN_THROTTLE_NOPROGRESS
146 usec_timeout=500000 usect_delayed=52000 reason=VMSCAN_THROTTLE_NOPROGRESS
148 usec_timeout=500000 usect_delayed=36000 reason=VMSCAN_THROTTLE_NOPROGRESS
148 usec_timeout=500000 usect_delayed=48000 reason=VMSCAN_THROTTLE_NOPROGRESS
159 usec_timeout=500000 usect_delayed=28000 reason=VMSCAN_THROTTLE_NOPROGRESS
178 usec_timeout=500000 usect_delayed=44000 reason=VMSCAN_THROTTLE_NOPROGRESS
183 usec_timeout=500000 usect_delayed=40000 reason=VMSCAN_THROTTLE_NOPROGRESS
237 usec_timeout=500000 usect_delayed=100000 reason=VMSCAN_THROTTLE_NOPROGRESS
266 usec_timeout=500000 usect_delayed=32000 reason=VMSCAN_THROTTLE_NOPROGRESS
313 usec_timeout=500000 usect_delayed=24000 reason=VMSCAN_THROTTLE_NOPROGRESS
347 usec_timeout=500000 usect_delayed=96000 reason=VMSCAN_THROTTLE_NOPROGRESS
470 usec_timeout=500000 usect_delayed=20000 reason=VMSCAN_THROTTLE_NOPROGRESS
559 usec_timeout=500000 usect_delayed=16000 reason=VMSCAN_THROTTLE_NOPROGRESS
964 usec_timeout=500000 usect_delayed=12000 reason=VMSCAN_THROTTLE_NOPROGRESS
2001 usec_timeout=500000 usect_delayed=104000 reason=VMSCAN_THROTTLE_NOPROGRESS
2447 usec_timeout=500000 usect_delayed=8000 reason=VMSCAN_THROTTLE_NOPROGRESS
7888 usec_timeout=500000 usect_delayed=4000 reason=VMSCAN_THROTTLE_NOPROGRESS
22727 usec_timeout=500000 usect_delayed=0 reason=VMSCAN_THROTTLE_NOPROGRESS
51305 usec_timeout=500000 usect_delayed=500000 reason=VMSCAN_THROTTLE_NOPROGRESS
The full timeout is often hit but a large number also do not stall at
all. The remainder slept a little allowing other reclaim tasks to make
progress.
While this timeout could be further increased, it could also negatively
impact worst-case behaviour when there is no prioritisation of what task
should make progress.
For VMSCAN_THROTTLE_WRITEBACK, the breakdown was
1 usec_timeout=100000 usect_delayed=44000 reason=VMSCAN_THROTTLE_WRITEBACK
2 usec_timeout=100000 usect_delayed=76000 reason=VMSCAN_THROTTLE_WRITEBACK
3 usec_timeout=100000 usect_delayed=80000 reason=VMSCAN_THROTTLE_WRITEBACK
5 usec_timeout=100000 usect_delayed=48000 reason=VMSCAN_THROTTLE_WRITEBACK
5 usec_timeout=100000 usect_delayed=84000 reason=VMSCAN_THROTTLE_WRITEBACK
6 usec_timeout=100000 usect_delayed=72000 reason=VMSCAN_THROTTLE_WRITEBACK
7 usec_timeout=100000 usect_delayed=88000 reason=VMSCAN_THROTTLE_WRITEBACK
11 usec_timeout=100000 usect_delayed=56000 reason=VMSCAN_THROTTLE_WRITEBACK
12 usec_timeout=100000 usect_delayed=64000 reason=VMSCAN_THROTTLE_WRITEBACK
16 usec_timeout=100000 usect_delayed=92000 reason=VMSCAN_THROTTLE_WRITEBACK
24 usec_timeout=100000 usect_delayed=68000 reason=VMSCAN_THROTTLE_WRITEBACK
28 usec_timeout=100000 usect_delayed=32000 reason=VMSCAN_THROTTLE_WRITEBACK
30 usec_timeout=100000 usect_delayed=60000 reason=VMSCAN_THROTTLE_WRITEBACK
30 usec_timeout=100000 usect_delayed=96000 reason=VMSCAN_THROTTLE_WRITEBACK
32 usec_timeout=100000 usect_delayed=52000 reason=VMSCAN_THROTTLE_WRITEBACK
42 usec_timeout=100000 usect_delayed=40000 reason=VMSCAN_THROTTLE_WRITEBACK
77 usec_timeout=100000 usect_delayed=28000 reason=VMSCAN_THROTTLE_WRITEBACK
99 usec_timeout=100000 usect_delayed=36000 reason=VMSCAN_THROTTLE_WRITEBACK
137 usec_timeout=100000 usect_delayed=24000 reason=VMSCAN_THROTTLE_WRITEBACK
190 usec_timeout=100000 usect_delayed=20000 reason=VMSCAN_THROTTLE_WRITEBACK
339 usec_timeout=100000 usect_delayed=16000 reason=VMSCAN_THROTTLE_WRITEBACK
518 usec_timeout=100000 usect_delayed=12000 reason=VMSCAN_THROTTLE_WRITEBACK
852 usec_timeout=100000 usect_delayed=8000 reason=VMSCAN_THROTTLE_WRITEBACK
3359 usec_timeout=100000 usect_delayed=4000 reason=VMSCAN_THROTTLE_WRITEBACK
7147 usec_timeout=100000 usect_delayed=0 reason=VMSCAN_THROTTLE_WRITEBACK
83962 usec_timeout=100000 usect_delayed=100000 reason=VMSCAN_THROTTLE_WRITEBACK
The majority hit the timeout in direct reclaim context although a
sizable number did not stall at all. This is very different to kswapd
where only a tiny percentage of stalls due to writeback reached the
timeout.
Bottom line, the throttling appears to work and the wakeup events may
limit worst case stalls. There might be some grounds for adjusting
timeouts but it's likely futile as the worst-case scenarios depend on
the workload, memory size and the speed of the storage. A better
approach to improve the series further would be to prioritise tasks
based on their rate of allocation with the caveat that it may be very
expensive to track.
This patch (of 5):
Page reclaim throttles on wait_iff_congested under the following
conditions:
- kswapd is encountering pages under writeback and marked for immediate
reclaim implying that pages are cycling through the LRU faster than
pages can be cleaned.
- Direct reclaim will stall if all dirty pages are backed by congested
inodes.
wait_iff_congested is almost completely broken with few exceptions.
This patch adds a new node-based workqueue and tracks the number of
throttled tasks and pages written back since throttling started. If
enough pages belonging to the node are written back then the throttled
tasks will wake early. If not, the throttled tasks sleeps until the
timeout expires.
[neilb@suse.de: Uninterruptible sleep and simpler wakeups]
[hdanton@sina.com: Avoid race when reclaim starts]
[vbabka@suse.cz: vmstat irq-safe api, clarifications]
Link: https://lore.kernel.org/linux-mm/45d8b7a6-8548-65f5-cccf-9f451d4ae3d4@kernel.dk/ [1]
Link: https://lkml.kernel.org/r/20211022144651.19914-1-mgorman@techsingularity.net
Link: https://lkml.kernel.org/r/20211022144651.19914-2-mgorman@techsingularity.net
Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Cc: NeilBrown <neilb@suse.de>
Cc: "Theodore Ts'o" <tytso@mit.edu>
Cc: Andreas Dilger <adilger.kernel@dilger.ca>
Cc: "Darrick J . Wong" <djwong@kernel.org>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Dave Chinner <david@fromorbit.com>
Cc: Rik van Riel <riel@surriel.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Jonathan Corbet <corbet@lwn.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-11-06 04:42:25 +08:00
|
|
|
|
|
|
|
#define show_throttle_flags(flags) \
|
|
|
|
(flags) ? __print_flags(flags, "|", \
|
2021-11-06 04:42:29 +08:00
|
|
|
{_VMSCAN_THROTTLE_WRITEBACK, "VMSCAN_THROTTLE_WRITEBACK"}, \
|
2021-11-06 04:42:32 +08:00
|
|
|
{_VMSCAN_THROTTLE_ISOLATED, "VMSCAN_THROTTLE_ISOLATED"}, \
|
mm: vmscan: Reduce throttling due to a failure to make progress
Mike Galbraith, Alexey Avramov and Darrick Wong all reported similar
problems due to reclaim throttling for excessive lengths of time. In
Alexey's case, a memory hog that should go OOM quickly stalls for
several minutes before stalling. In Mike and Darrick's cases, a small
memcg environment stalled excessively even though the system had enough
memory overall.
Commit 69392a403f49 ("mm/vmscan: throttle reclaim when no progress is
being made") introduced the problem although commit a19594ca4a8b
("mm/vmscan: increase the timeout if page reclaim is not making
progress") made it worse. Systems at or near an OOM state that cannot
be recovered must reach OOM quickly and memcg should kill tasks if a
memcg is near OOM.
To address this, only stall for the first zone in the zonelist, reduce
the timeout to 1 tick for VMSCAN_THROTTLE_NOPROGRESS and only stall if
the scan control nr_reclaimed is 0, kswapd is still active and there
were excessive pages pending for writeback. If kswapd has stopped
reclaiming due to excessive failures, do not stall at all so that OOM
triggers relatively quickly. Similarly, if an LRU is simply congested,
only lightly throttle similar to NOPROGRESS.
Alexey's original case was the most straight forward
for i in {1..3}; do tail /dev/zero; done
On vanilla 5.16-rc1, this test stalled heavily, after the patch the test
completes in a few seconds similar to 5.15.
Alexey's second test case added watching a youtube video while tail runs
10 times. On 5.15, playback only jitters slightly, 5.16-rc1 stalls a
lot with lots of frames missing and numerous audio glitches. With this
patch applies, the video plays similarly to 5.15.
[lkp@intel.com: Fix W=1 build warning]
Link: https://lore.kernel.org/r/99e779783d6c7fce96448a3402061b9dc1b3b602.camel@gmx.de
Link: https://lore.kernel.org/r/20211124011954.7cab9bb4@mail.inbox.lv
Link: https://lore.kernel.org/r/20211022144651.19914-1-mgorman@techsingularity.net
Link: https://lore.kernel.org/r/20211202150614.22440-1-mgorman@techsingularity.net
Link: https://linux-regtracking.leemhuis.info/regzbot/regression/20211124011954.7cab9bb4@mail.inbox.lv/
Reported-and-tested-by: Alexey Avramov <hakavlad@inbox.lv>
Reported-and-tested-by: Mike Galbraith <efault@gmx.de>
Reported-and-tested-by: Darrick J. Wong <djwong@kernel.org>
Reported-by: kernel test robot <lkp@intel.com>
Acked-by: Hugh Dickins <hughd@google.com>
Tracked-by: Thorsten Leemhuis <regressions@leemhuis.info>
Fixes: 69392a403f49 ("mm/vmscan: throttle reclaim when no progress is being made")
Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-12-02 23:06:14 +08:00
|
|
|
{_VMSCAN_THROTTLE_NOPROGRESS, "VMSCAN_THROTTLE_NOPROGRESS"}, \
|
|
|
|
{_VMSCAN_THROTTLE_CONGESTED, "VMSCAN_THROTTLE_CONGESTED"} \
|
mm/vmscan: throttle reclaim until some writeback completes if congested
Patch series "Remove dependency on congestion_wait in mm/", v5.
This series that removes all calls to congestion_wait in mm/ and deletes
wait_iff_congested. It's not a clever implementation but
congestion_wait has been broken for a long time [1].
Even if congestion throttling worked, it was never a great idea. While
excessive dirty/writeback pages at the tail of the LRU is one
possibility that reclaim may be slow, there is also the problem of too
many pages being isolated and reclaim failing for other reasons
(elevated references, too many pages isolated, excessive LRU contention
etc).
This series replaces the "congestion" throttling with 3 different types.
- If there are too many dirty/writeback pages, sleep until a timeout or
enough pages get cleaned
- If too many pages are isolated, sleep until enough isolated pages are
either reclaimed or put back on the LRU
- If no progress is being made, direct reclaim tasks sleep until
another task makes progress with acceptable efficiency.
This was initially tested with a mix of workloads that used to trigger
corner cases that no longer work. A new test case was created called
"stutterp" (pagereclaim-stutterp-noreaders in mmtests) using a freshly
created XFS filesystem. Note that it may be necessary to increase the
timeout of ssh if executing remotely as ssh itself can get throttled and
the connection may timeout.
stutterp varies the number of "worker" processes from 4 up to NR_CPUS*4
to check the impact as the number of direct reclaimers increase. It has
four types of worker.
- One "anon latency" worker creates small mappings with mmap() and
times how long it takes to fault the mapping reading it 4K at a time
- X file writers which is fio randomly writing X files where the total
size of the files add up to the allowed dirty_ratio. fio is allowed
to run for a warmup period to allow some file-backed pages to
accumulate. The duration of the warmup is based on the best-case
linear write speed of the storage.
- Y file readers which is fio randomly reading small files
- Z anon memory hogs which continually map (100-dirty_ratio)% of memory
- Total estimated WSS = (100+dirty_ration) percentage of memory
X+Y+Z+1 == NR_WORKERS varying from 4 up to NR_CPUS*4
The intent is to maximise the total WSS with a mix of file and anon
memory where some anonymous memory must be swapped and there is a high
likelihood of dirty/writeback pages reaching the end of the LRU.
The test can be configured to have no background readers to stress
dirty/writeback pages. The results below are based on having zero
readers.
The short summary of the results is that the series works and stalls
until some event occurs but the timeouts may need adjustment.
The test results are not broken down by patch as the series should be
treated as one block that replaces a broken throttling mechanism with a
working one.
Finally, three machines were tested but I'm reporting the worst set of
results. The other two machines had much better latencies for example.
First the results of the "anon latency" latency
stutterp
5.15.0-rc1 5.15.0-rc1
vanilla mm-reclaimcongest-v5r4
Amean mmap-4 31.4003 ( 0.00%) 2661.0198 (-8374.52%)
Amean mmap-7 38.1641 ( 0.00%) 149.2891 (-291.18%)
Amean mmap-12 60.0981 ( 0.00%) 187.8105 (-212.51%)
Amean mmap-21 161.2699 ( 0.00%) 213.9107 ( -32.64%)
Amean mmap-30 174.5589 ( 0.00%) 377.7548 (-116.41%)
Amean mmap-48 8106.8160 ( 0.00%) 1070.5616 ( 86.79%)
Stddev mmap-4 41.3455 ( 0.00%) 27573.9676 (-66591.66%)
Stddev mmap-7 53.5556 ( 0.00%) 4608.5860 (-8505.23%)
Stddev mmap-12 171.3897 ( 0.00%) 5559.4542 (-3143.75%)
Stddev mmap-21 1506.6752 ( 0.00%) 5746.2507 (-281.39%)
Stddev mmap-30 557.5806 ( 0.00%) 7678.1624 (-1277.05%)
Stddev mmap-48 61681.5718 ( 0.00%) 14507.2830 ( 76.48%)
Max-90 mmap-4 31.4243 ( 0.00%) 83.1457 (-164.59%)
Max-90 mmap-7 41.0410 ( 0.00%) 41.0720 ( -0.08%)
Max-90 mmap-12 66.5255 ( 0.00%) 53.9073 ( 18.97%)
Max-90 mmap-21 146.7479 ( 0.00%) 105.9540 ( 27.80%)
Max-90 mmap-30 193.9513 ( 0.00%) 64.3067 ( 66.84%)
Max-90 mmap-48 277.9137 ( 0.00%) 591.0594 (-112.68%)
Max mmap-4 1913.8009 ( 0.00%) 299623.9695 (-15555.96%)
Max mmap-7 2423.9665 ( 0.00%) 204453.1708 (-8334.65%)
Max mmap-12 6845.6573 ( 0.00%) 221090.3366 (-3129.64%)
Max mmap-21 56278.6508 ( 0.00%) 213877.3496 (-280.03%)
Max mmap-30 19716.2990 ( 0.00%) 216287.6229 (-997.00%)
Max mmap-48 477923.9400 ( 0.00%) 245414.8238 ( 48.65%)
For most thread counts, the time to mmap() is unfortunately increased.
In earlier versions of the series, this was lower but a large number of
throttling events were reaching their timeout increasing the amount of
inefficient scanning of the LRU. There is no prioritisation of reclaim
tasks making progress based on each tasks rate of page allocation versus
progress of reclaim. The variance is also impacted for high worker
counts but in all cases, the differences in latency are not
statistically significant due to very large maximum outliers. Max-90
shows that 90% of the stalls are comparable but the Max results show the
massive outliers which are increased to to stalling.
It is expected that this will be very machine dependant. Due to the
test design, reclaim is difficult so allocations stall and there are
variances depending on whether THPs can be allocated or not. The amount
of memory will affect exactly how bad the corner cases are and how often
they trigger. The warmup period calculation is not ideal as it's based
on linear writes where as fio is randomly writing multiple files from
multiple tasks so the start state of the test is variable. For example,
these are the latencies on a single-socket machine that had more memory
Amean mmap-4 42.2287 ( 0.00%) 49.6838 * -17.65%*
Amean mmap-7 216.4326 ( 0.00%) 47.4451 * 78.08%*
Amean mmap-12 2412.0588 ( 0.00%) 51.7497 ( 97.85%)
Amean mmap-21 5546.2548 ( 0.00%) 51.8862 ( 99.06%)
Amean mmap-30 1085.3121 ( 0.00%) 72.1004 ( 93.36%)
The overall system CPU usage and elapsed time is as follows
5.15.0-rc3 5.15.0-rc3
vanilla mm-reclaimcongest-v5r4
Duration User 6989.03 983.42
Duration System 7308.12 799.68
Duration Elapsed 2277.67 2092.98
The patches reduce system CPU usage by 89% as the vanilla kernel is rarely
stalling.
The high-level /proc/vmstats show
5.15.0-rc1 5.15.0-rc1
vanilla mm-reclaimcongest-v5r2
Ops Direct pages scanned 1056608451.00 503594991.00
Ops Kswapd pages scanned 109795048.00 147289810.00
Ops Kswapd pages reclaimed 63269243.00 31036005.00
Ops Direct pages reclaimed 10803973.00 6328887.00
Ops Kswapd efficiency % 57.62 21.07
Ops Kswapd velocity 48204.98 57572.86
Ops Direct efficiency % 1.02 1.26
Ops Direct velocity 463898.83 196845.97
Kswapd scanned less pages but the detailed pattern is different. The
vanilla kernel scans slowly over time where as the patches exhibits
burst patterns of scan activity. Direct reclaim scanning is reduced by
52% due to stalling.
The pattern for stealing pages is also slightly different. Both kernels
exhibit spikes but the vanilla kernel when reclaiming shows pages being
reclaimed over a period of time where as the patches tend to reclaim in
spikes. The difference is that vanilla is not throttling and instead
scanning constantly finding some pages over time where as the patched
kernel throttles and reclaims in spikes.
Ops Percentage direct scans 90.59 77.37
For direct reclaim, vanilla scanned 90.59% of pages where as with the
patches, 77.37% were direct reclaim due to throttling
Ops Page writes by reclaim 2613590.00 1687131.00
Page writes from reclaim context are reduced.
Ops Page writes anon 2932752.00 1917048.00
And there is less swapping.
Ops Page reclaim immediate 996248528.00 107664764.00
The number of pages encountered at the tail of the LRU tagged for
immediate reclaim but still dirty/writeback is reduced by 89%.
Ops Slabs scanned 164284.00 153608.00
Slab scan activity is similar.
ftrace was used to gather stall activity
Vanilla
-------
1 writeback_wait_iff_congested: usec_timeout=100000 usec_delayed=16000
2 writeback_wait_iff_congested: usec_timeout=100000 usec_delayed=12000
8 writeback_wait_iff_congested: usec_timeout=100000 usec_delayed=8000
29 writeback_wait_iff_congested: usec_timeout=100000 usec_delayed=4000
82394 writeback_wait_iff_congested: usec_timeout=100000 usec_delayed=0
The fast majority of wait_iff_congested calls do not stall at all. What
is likely happening is that cond_resched() reschedules the task for a
short period when the BDI is not registering congestion (which it never
will in this test setup).
1 writeback_congestion_wait: usec_timeout=100000 usec_delayed=120000
2 writeback_congestion_wait: usec_timeout=100000 usec_delayed=132000
4 writeback_congestion_wait: usec_timeout=100000 usec_delayed=112000
380 writeback_congestion_wait: usec_timeout=100000 usec_delayed=108000
778 writeback_congestion_wait: usec_timeout=100000 usec_delayed=104000
congestion_wait if called always exceeds the timeout as there is no
trigger to wake it up.
Bottom line: Vanilla will throttle but it's not effective.
Patch series
------------
Kswapd throttle activity was always due to scanning pages tagged for
immediate reclaim at the tail of the LRU
1 usec_timeout=100000 usect_delayed=72000 reason=VMSCAN_THROTTLE_WRITEBACK
4 usec_timeout=100000 usect_delayed=20000 reason=VMSCAN_THROTTLE_WRITEBACK
5 usec_timeout=100000 usect_delayed=12000 reason=VMSCAN_THROTTLE_WRITEBACK
6 usec_timeout=100000 usect_delayed=16000 reason=VMSCAN_THROTTLE_WRITEBACK
11 usec_timeout=100000 usect_delayed=100000 reason=VMSCAN_THROTTLE_WRITEBACK
11 usec_timeout=100000 usect_delayed=8000 reason=VMSCAN_THROTTLE_WRITEBACK
94 usec_timeout=100000 usect_delayed=0 reason=VMSCAN_THROTTLE_WRITEBACK
112 usec_timeout=100000 usect_delayed=4000 reason=VMSCAN_THROTTLE_WRITEBACK
The majority of events did not stall or stalled for a short period.
Roughly 16% of stalls reached the timeout before expiry. For direct
reclaim, the number of times stalled for each reason were
6624 reason=VMSCAN_THROTTLE_ISOLATED
93246 reason=VMSCAN_THROTTLE_NOPROGRESS
96934 reason=VMSCAN_THROTTLE_WRITEBACK
The most common reason to stall was due to excessive pages tagged for
immediate reclaim at the tail of the LRU followed by a failure to make
forward. A relatively small number were due to too many pages isolated
from the LRU by parallel threads
For VMSCAN_THROTTLE_ISOLATED, the breakdown of delays was
9 usec_timeout=20000 usect_delayed=4000 reason=VMSCAN_THROTTLE_ISOLATED
12 usec_timeout=20000 usect_delayed=16000 reason=VMSCAN_THROTTLE_ISOLATED
83 usec_timeout=20000 usect_delayed=20000 reason=VMSCAN_THROTTLE_ISOLATED
6520 usec_timeout=20000 usect_delayed=0 reason=VMSCAN_THROTTLE_ISOLATED
Most did not stall at all. A small number reached the timeout.
For VMSCAN_THROTTLE_NOPROGRESS, the breakdown of stalls were all over
the map
1 usec_timeout=500000 usect_delayed=324000 reason=VMSCAN_THROTTLE_NOPROGRESS
1 usec_timeout=500000 usect_delayed=332000 reason=VMSCAN_THROTTLE_NOPROGRESS
1 usec_timeout=500000 usect_delayed=348000 reason=VMSCAN_THROTTLE_NOPROGRESS
1 usec_timeout=500000 usect_delayed=360000 reason=VMSCAN_THROTTLE_NOPROGRESS
2 usec_timeout=500000 usect_delayed=228000 reason=VMSCAN_THROTTLE_NOPROGRESS
2 usec_timeout=500000 usect_delayed=260000 reason=VMSCAN_THROTTLE_NOPROGRESS
2 usec_timeout=500000 usect_delayed=340000 reason=VMSCAN_THROTTLE_NOPROGRESS
2 usec_timeout=500000 usect_delayed=364000 reason=VMSCAN_THROTTLE_NOPROGRESS
2 usec_timeout=500000 usect_delayed=372000 reason=VMSCAN_THROTTLE_NOPROGRESS
2 usec_timeout=500000 usect_delayed=428000 reason=VMSCAN_THROTTLE_NOPROGRESS
2 usec_timeout=500000 usect_delayed=460000 reason=VMSCAN_THROTTLE_NOPROGRESS
2 usec_timeout=500000 usect_delayed=464000 reason=VMSCAN_THROTTLE_NOPROGRESS
3 usec_timeout=500000 usect_delayed=244000 reason=VMSCAN_THROTTLE_NOPROGRESS
3 usec_timeout=500000 usect_delayed=252000 reason=VMSCAN_THROTTLE_NOPROGRESS
3 usec_timeout=500000 usect_delayed=272000 reason=VMSCAN_THROTTLE_NOPROGRESS
4 usec_timeout=500000 usect_delayed=188000 reason=VMSCAN_THROTTLE_NOPROGRESS
4 usec_timeout=500000 usect_delayed=268000 reason=VMSCAN_THROTTLE_NOPROGRESS
4 usec_timeout=500000 usect_delayed=328000 reason=VMSCAN_THROTTLE_NOPROGRESS
4 usec_timeout=500000 usect_delayed=380000 reason=VMSCAN_THROTTLE_NOPROGRESS
4 usec_timeout=500000 usect_delayed=392000 reason=VMSCAN_THROTTLE_NOPROGRESS
4 usec_timeout=500000 usect_delayed=432000 reason=VMSCAN_THROTTLE_NOPROGRESS
5 usec_timeout=500000 usect_delayed=204000 reason=VMSCAN_THROTTLE_NOPROGRESS
5 usec_timeout=500000 usect_delayed=220000 reason=VMSCAN_THROTTLE_NOPROGRESS
5 usec_timeout=500000 usect_delayed=412000 reason=VMSCAN_THROTTLE_NOPROGRESS
5 usec_timeout=500000 usect_delayed=436000 reason=VMSCAN_THROTTLE_NOPROGRESS
6 usec_timeout=500000 usect_delayed=488000 reason=VMSCAN_THROTTLE_NOPROGRESS
7 usec_timeout=500000 usect_delayed=212000 reason=VMSCAN_THROTTLE_NOPROGRESS
7 usec_timeout=500000 usect_delayed=300000 reason=VMSCAN_THROTTLE_NOPROGRESS
7 usec_timeout=500000 usect_delayed=316000 reason=VMSCAN_THROTTLE_NOPROGRESS
7 usec_timeout=500000 usect_delayed=472000 reason=VMSCAN_THROTTLE_NOPROGRESS
8 usec_timeout=500000 usect_delayed=248000 reason=VMSCAN_THROTTLE_NOPROGRESS
8 usec_timeout=500000 usect_delayed=356000 reason=VMSCAN_THROTTLE_NOPROGRESS
8 usec_timeout=500000 usect_delayed=456000 reason=VMSCAN_THROTTLE_NOPROGRESS
9 usec_timeout=500000 usect_delayed=124000 reason=VMSCAN_THROTTLE_NOPROGRESS
9 usec_timeout=500000 usect_delayed=376000 reason=VMSCAN_THROTTLE_NOPROGRESS
9 usec_timeout=500000 usect_delayed=484000 reason=VMSCAN_THROTTLE_NOPROGRESS
10 usec_timeout=500000 usect_delayed=172000 reason=VMSCAN_THROTTLE_NOPROGRESS
10 usec_timeout=500000 usect_delayed=420000 reason=VMSCAN_THROTTLE_NOPROGRESS
10 usec_timeout=500000 usect_delayed=452000 reason=VMSCAN_THROTTLE_NOPROGRESS
11 usec_timeout=500000 usect_delayed=256000 reason=VMSCAN_THROTTLE_NOPROGRESS
12 usec_timeout=500000 usect_delayed=112000 reason=VMSCAN_THROTTLE_NOPROGRESS
12 usec_timeout=500000 usect_delayed=116000 reason=VMSCAN_THROTTLE_NOPROGRESS
12 usec_timeout=500000 usect_delayed=144000 reason=VMSCAN_THROTTLE_NOPROGRESS
12 usec_timeout=500000 usect_delayed=152000 reason=VMSCAN_THROTTLE_NOPROGRESS
12 usec_timeout=500000 usect_delayed=264000 reason=VMSCAN_THROTTLE_NOPROGRESS
12 usec_timeout=500000 usect_delayed=384000 reason=VMSCAN_THROTTLE_NOPROGRESS
12 usec_timeout=500000 usect_delayed=424000 reason=VMSCAN_THROTTLE_NOPROGRESS
12 usec_timeout=500000 usect_delayed=492000 reason=VMSCAN_THROTTLE_NOPROGRESS
13 usec_timeout=500000 usect_delayed=184000 reason=VMSCAN_THROTTLE_NOPROGRESS
13 usec_timeout=500000 usect_delayed=444000 reason=VMSCAN_THROTTLE_NOPROGRESS
14 usec_timeout=500000 usect_delayed=308000 reason=VMSCAN_THROTTLE_NOPROGRESS
14 usec_timeout=500000 usect_delayed=440000 reason=VMSCAN_THROTTLE_NOPROGRESS
14 usec_timeout=500000 usect_delayed=476000 reason=VMSCAN_THROTTLE_NOPROGRESS
16 usec_timeout=500000 usect_delayed=140000 reason=VMSCAN_THROTTLE_NOPROGRESS
17 usec_timeout=500000 usect_delayed=232000 reason=VMSCAN_THROTTLE_NOPROGRESS
17 usec_timeout=500000 usect_delayed=240000 reason=VMSCAN_THROTTLE_NOPROGRESS
17 usec_timeout=500000 usect_delayed=280000 reason=VMSCAN_THROTTLE_NOPROGRESS
18 usec_timeout=500000 usect_delayed=404000 reason=VMSCAN_THROTTLE_NOPROGRESS
20 usec_timeout=500000 usect_delayed=148000 reason=VMSCAN_THROTTLE_NOPROGRESS
20 usec_timeout=500000 usect_delayed=216000 reason=VMSCAN_THROTTLE_NOPROGRESS
20 usec_timeout=500000 usect_delayed=468000 reason=VMSCAN_THROTTLE_NOPROGRESS
21 usec_timeout=500000 usect_delayed=448000 reason=VMSCAN_THROTTLE_NOPROGRESS
23 usec_timeout=500000 usect_delayed=168000 reason=VMSCAN_THROTTLE_NOPROGRESS
23 usec_timeout=500000 usect_delayed=296000 reason=VMSCAN_THROTTLE_NOPROGRESS
25 usec_timeout=500000 usect_delayed=132000 reason=VMSCAN_THROTTLE_NOPROGRESS
25 usec_timeout=500000 usect_delayed=352000 reason=VMSCAN_THROTTLE_NOPROGRESS
26 usec_timeout=500000 usect_delayed=180000 reason=VMSCAN_THROTTLE_NOPROGRESS
27 usec_timeout=500000 usect_delayed=284000 reason=VMSCAN_THROTTLE_NOPROGRESS
28 usec_timeout=500000 usect_delayed=164000 reason=VMSCAN_THROTTLE_NOPROGRESS
29 usec_timeout=500000 usect_delayed=136000 reason=VMSCAN_THROTTLE_NOPROGRESS
30 usec_timeout=500000 usect_delayed=200000 reason=VMSCAN_THROTTLE_NOPROGRESS
30 usec_timeout=500000 usect_delayed=400000 reason=VMSCAN_THROTTLE_NOPROGRESS
31 usec_timeout=500000 usect_delayed=196000 reason=VMSCAN_THROTTLE_NOPROGRESS
32 usec_timeout=500000 usect_delayed=156000 reason=VMSCAN_THROTTLE_NOPROGRESS
33 usec_timeout=500000 usect_delayed=224000 reason=VMSCAN_THROTTLE_NOPROGRESS
35 usec_timeout=500000 usect_delayed=128000 reason=VMSCAN_THROTTLE_NOPROGRESS
35 usec_timeout=500000 usect_delayed=176000 reason=VMSCAN_THROTTLE_NOPROGRESS
36 usec_timeout=500000 usect_delayed=368000 reason=VMSCAN_THROTTLE_NOPROGRESS
36 usec_timeout=500000 usect_delayed=496000 reason=VMSCAN_THROTTLE_NOPROGRESS
37 usec_timeout=500000 usect_delayed=312000 reason=VMSCAN_THROTTLE_NOPROGRESS
38 usec_timeout=500000 usect_delayed=304000 reason=VMSCAN_THROTTLE_NOPROGRESS
40 usec_timeout=500000 usect_delayed=288000 reason=VMSCAN_THROTTLE_NOPROGRESS
43 usec_timeout=500000 usect_delayed=408000 reason=VMSCAN_THROTTLE_NOPROGRESS
55 usec_timeout=500000 usect_delayed=416000 reason=VMSCAN_THROTTLE_NOPROGRESS
56 usec_timeout=500000 usect_delayed=76000 reason=VMSCAN_THROTTLE_NOPROGRESS
58 usec_timeout=500000 usect_delayed=120000 reason=VMSCAN_THROTTLE_NOPROGRESS
59 usec_timeout=500000 usect_delayed=208000 reason=VMSCAN_THROTTLE_NOPROGRESS
61 usec_timeout=500000 usect_delayed=68000 reason=VMSCAN_THROTTLE_NOPROGRESS
71 usec_timeout=500000 usect_delayed=192000 reason=VMSCAN_THROTTLE_NOPROGRESS
71 usec_timeout=500000 usect_delayed=480000 reason=VMSCAN_THROTTLE_NOPROGRESS
79 usec_timeout=500000 usect_delayed=60000 reason=VMSCAN_THROTTLE_NOPROGRESS
82 usec_timeout=500000 usect_delayed=320000 reason=VMSCAN_THROTTLE_NOPROGRESS
82 usec_timeout=500000 usect_delayed=92000 reason=VMSCAN_THROTTLE_NOPROGRESS
85 usec_timeout=500000 usect_delayed=64000 reason=VMSCAN_THROTTLE_NOPROGRESS
85 usec_timeout=500000 usect_delayed=80000 reason=VMSCAN_THROTTLE_NOPROGRESS
88 usec_timeout=500000 usect_delayed=84000 reason=VMSCAN_THROTTLE_NOPROGRESS
90 usec_timeout=500000 usect_delayed=160000 reason=VMSCAN_THROTTLE_NOPROGRESS
90 usec_timeout=500000 usect_delayed=292000 reason=VMSCAN_THROTTLE_NOPROGRESS
94 usec_timeout=500000 usect_delayed=56000 reason=VMSCAN_THROTTLE_NOPROGRESS
118 usec_timeout=500000 usect_delayed=88000 reason=VMSCAN_THROTTLE_NOPROGRESS
119 usec_timeout=500000 usect_delayed=72000 reason=VMSCAN_THROTTLE_NOPROGRESS
126 usec_timeout=500000 usect_delayed=108000 reason=VMSCAN_THROTTLE_NOPROGRESS
146 usec_timeout=500000 usect_delayed=52000 reason=VMSCAN_THROTTLE_NOPROGRESS
148 usec_timeout=500000 usect_delayed=36000 reason=VMSCAN_THROTTLE_NOPROGRESS
148 usec_timeout=500000 usect_delayed=48000 reason=VMSCAN_THROTTLE_NOPROGRESS
159 usec_timeout=500000 usect_delayed=28000 reason=VMSCAN_THROTTLE_NOPROGRESS
178 usec_timeout=500000 usect_delayed=44000 reason=VMSCAN_THROTTLE_NOPROGRESS
183 usec_timeout=500000 usect_delayed=40000 reason=VMSCAN_THROTTLE_NOPROGRESS
237 usec_timeout=500000 usect_delayed=100000 reason=VMSCAN_THROTTLE_NOPROGRESS
266 usec_timeout=500000 usect_delayed=32000 reason=VMSCAN_THROTTLE_NOPROGRESS
313 usec_timeout=500000 usect_delayed=24000 reason=VMSCAN_THROTTLE_NOPROGRESS
347 usec_timeout=500000 usect_delayed=96000 reason=VMSCAN_THROTTLE_NOPROGRESS
470 usec_timeout=500000 usect_delayed=20000 reason=VMSCAN_THROTTLE_NOPROGRESS
559 usec_timeout=500000 usect_delayed=16000 reason=VMSCAN_THROTTLE_NOPROGRESS
964 usec_timeout=500000 usect_delayed=12000 reason=VMSCAN_THROTTLE_NOPROGRESS
2001 usec_timeout=500000 usect_delayed=104000 reason=VMSCAN_THROTTLE_NOPROGRESS
2447 usec_timeout=500000 usect_delayed=8000 reason=VMSCAN_THROTTLE_NOPROGRESS
7888 usec_timeout=500000 usect_delayed=4000 reason=VMSCAN_THROTTLE_NOPROGRESS
22727 usec_timeout=500000 usect_delayed=0 reason=VMSCAN_THROTTLE_NOPROGRESS
51305 usec_timeout=500000 usect_delayed=500000 reason=VMSCAN_THROTTLE_NOPROGRESS
The full timeout is often hit but a large number also do not stall at
all. The remainder slept a little allowing other reclaim tasks to make
progress.
While this timeout could be further increased, it could also negatively
impact worst-case behaviour when there is no prioritisation of what task
should make progress.
For VMSCAN_THROTTLE_WRITEBACK, the breakdown was
1 usec_timeout=100000 usect_delayed=44000 reason=VMSCAN_THROTTLE_WRITEBACK
2 usec_timeout=100000 usect_delayed=76000 reason=VMSCAN_THROTTLE_WRITEBACK
3 usec_timeout=100000 usect_delayed=80000 reason=VMSCAN_THROTTLE_WRITEBACK
5 usec_timeout=100000 usect_delayed=48000 reason=VMSCAN_THROTTLE_WRITEBACK
5 usec_timeout=100000 usect_delayed=84000 reason=VMSCAN_THROTTLE_WRITEBACK
6 usec_timeout=100000 usect_delayed=72000 reason=VMSCAN_THROTTLE_WRITEBACK
7 usec_timeout=100000 usect_delayed=88000 reason=VMSCAN_THROTTLE_WRITEBACK
11 usec_timeout=100000 usect_delayed=56000 reason=VMSCAN_THROTTLE_WRITEBACK
12 usec_timeout=100000 usect_delayed=64000 reason=VMSCAN_THROTTLE_WRITEBACK
16 usec_timeout=100000 usect_delayed=92000 reason=VMSCAN_THROTTLE_WRITEBACK
24 usec_timeout=100000 usect_delayed=68000 reason=VMSCAN_THROTTLE_WRITEBACK
28 usec_timeout=100000 usect_delayed=32000 reason=VMSCAN_THROTTLE_WRITEBACK
30 usec_timeout=100000 usect_delayed=60000 reason=VMSCAN_THROTTLE_WRITEBACK
30 usec_timeout=100000 usect_delayed=96000 reason=VMSCAN_THROTTLE_WRITEBACK
32 usec_timeout=100000 usect_delayed=52000 reason=VMSCAN_THROTTLE_WRITEBACK
42 usec_timeout=100000 usect_delayed=40000 reason=VMSCAN_THROTTLE_WRITEBACK
77 usec_timeout=100000 usect_delayed=28000 reason=VMSCAN_THROTTLE_WRITEBACK
99 usec_timeout=100000 usect_delayed=36000 reason=VMSCAN_THROTTLE_WRITEBACK
137 usec_timeout=100000 usect_delayed=24000 reason=VMSCAN_THROTTLE_WRITEBACK
190 usec_timeout=100000 usect_delayed=20000 reason=VMSCAN_THROTTLE_WRITEBACK
339 usec_timeout=100000 usect_delayed=16000 reason=VMSCAN_THROTTLE_WRITEBACK
518 usec_timeout=100000 usect_delayed=12000 reason=VMSCAN_THROTTLE_WRITEBACK
852 usec_timeout=100000 usect_delayed=8000 reason=VMSCAN_THROTTLE_WRITEBACK
3359 usec_timeout=100000 usect_delayed=4000 reason=VMSCAN_THROTTLE_WRITEBACK
7147 usec_timeout=100000 usect_delayed=0 reason=VMSCAN_THROTTLE_WRITEBACK
83962 usec_timeout=100000 usect_delayed=100000 reason=VMSCAN_THROTTLE_WRITEBACK
The majority hit the timeout in direct reclaim context although a
sizable number did not stall at all. This is very different to kswapd
where only a tiny percentage of stalls due to writeback reached the
timeout.
Bottom line, the throttling appears to work and the wakeup events may
limit worst case stalls. There might be some grounds for adjusting
timeouts but it's likely futile as the worst-case scenarios depend on
the workload, memory size and the speed of the storage. A better
approach to improve the series further would be to prioritise tasks
based on their rate of allocation with the caveat that it may be very
expensive to track.
This patch (of 5):
Page reclaim throttles on wait_iff_congested under the following
conditions:
- kswapd is encountering pages under writeback and marked for immediate
reclaim implying that pages are cycling through the LRU faster than
pages can be cleaned.
- Direct reclaim will stall if all dirty pages are backed by congested
inodes.
wait_iff_congested is almost completely broken with few exceptions.
This patch adds a new node-based workqueue and tracks the number of
throttled tasks and pages written back since throttling started. If
enough pages belonging to the node are written back then the throttled
tasks will wake early. If not, the throttled tasks sleeps until the
timeout expires.
[neilb@suse.de: Uninterruptible sleep and simpler wakeups]
[hdanton@sina.com: Avoid race when reclaim starts]
[vbabka@suse.cz: vmstat irq-safe api, clarifications]
Link: https://lore.kernel.org/linux-mm/45d8b7a6-8548-65f5-cccf-9f451d4ae3d4@kernel.dk/ [1]
Link: https://lkml.kernel.org/r/20211022144651.19914-1-mgorman@techsingularity.net
Link: https://lkml.kernel.org/r/20211022144651.19914-2-mgorman@techsingularity.net
Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Cc: NeilBrown <neilb@suse.de>
Cc: "Theodore Ts'o" <tytso@mit.edu>
Cc: Andreas Dilger <adilger.kernel@dilger.ca>
Cc: "Darrick J . Wong" <djwong@kernel.org>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Dave Chinner <david@fromorbit.com>
Cc: Rik van Riel <riel@surriel.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Jonathan Corbet <corbet@lwn.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-11-06 04:42:25 +08:00
|
|
|
) : "VMSCAN_THROTTLE_NONE"
|
|
|
|
|
|
|
|
|
2019-05-14 08:23:08 +08:00
|
|
|
#define trace_reclaim_flags(file) ( \
|
|
|
|
(file ? RECLAIM_WB_FILE : RECLAIM_WB_ANON) | \
|
2012-05-30 06:06:19 +08:00
|
|
|
(RECLAIM_WB_ASYNC) \
|
2010-08-10 08:19:18 +08:00
|
|
|
)
|
|
|
|
|
2010-08-10 08:19:16 +08:00
|
|
|
TRACE_EVENT(mm_vmscan_kswapd_sleep,
|
|
|
|
|
|
|
|
TP_PROTO(int nid),
|
|
|
|
|
|
|
|
TP_ARGS(nid),
|
|
|
|
|
|
|
|
TP_STRUCT__entry(
|
|
|
|
__field( int, nid )
|
|
|
|
),
|
|
|
|
|
|
|
|
TP_fast_assign(
|
|
|
|
__entry->nid = nid;
|
|
|
|
),
|
|
|
|
|
|
|
|
TP_printk("nid=%d", __entry->nid)
|
|
|
|
);
|
|
|
|
|
|
|
|
TRACE_EVENT(mm_vmscan_kswapd_wake,
|
|
|
|
|
2016-07-29 06:46:47 +08:00
|
|
|
TP_PROTO(int nid, int zid, int order),
|
2010-08-10 08:19:16 +08:00
|
|
|
|
2016-07-29 06:46:47 +08:00
|
|
|
TP_ARGS(nid, zid, order),
|
2010-08-10 08:19:16 +08:00
|
|
|
|
|
|
|
TP_STRUCT__entry(
|
|
|
|
__field( int, nid )
|
2016-07-29 06:46:47 +08:00
|
|
|
__field( int, zid )
|
2010-08-10 08:19:16 +08:00
|
|
|
__field( int, order )
|
|
|
|
),
|
|
|
|
|
|
|
|
TP_fast_assign(
|
|
|
|
__entry->nid = nid;
|
2016-07-29 06:46:47 +08:00
|
|
|
__entry->zid = zid;
|
2010-08-10 08:19:16 +08:00
|
|
|
__entry->order = order;
|
|
|
|
),
|
|
|
|
|
2019-05-14 08:16:34 +08:00
|
|
|
TP_printk("nid=%d order=%d",
|
|
|
|
__entry->nid,
|
|
|
|
__entry->order)
|
2010-08-10 08:19:16 +08:00
|
|
|
);
|
|
|
|
|
|
|
|
TRACE_EVENT(mm_vmscan_wakeup_kswapd,
|
|
|
|
|
2018-04-06 07:25:16 +08:00
|
|
|
TP_PROTO(int nid, int zid, int order, gfp_t gfp_flags),
|
2010-08-10 08:19:16 +08:00
|
|
|
|
2018-04-06 07:25:16 +08:00
|
|
|
TP_ARGS(nid, zid, order, gfp_flags),
|
2010-08-10 08:19:16 +08:00
|
|
|
|
|
|
|
TP_STRUCT__entry(
|
2018-04-06 07:25:16 +08:00
|
|
|
__field( int, nid )
|
|
|
|
__field( int, zid )
|
|
|
|
__field( int, order )
|
2022-05-13 11:23:08 +08:00
|
|
|
__field( unsigned long, gfp_flags )
|
2010-08-10 08:19:16 +08:00
|
|
|
),
|
|
|
|
|
|
|
|
TP_fast_assign(
|
|
|
|
__entry->nid = nid;
|
|
|
|
__entry->zid = zid;
|
|
|
|
__entry->order = order;
|
2022-05-13 11:23:08 +08:00
|
|
|
__entry->gfp_flags = (__force unsigned long)gfp_flags;
|
2010-08-10 08:19:16 +08:00
|
|
|
),
|
|
|
|
|
2019-05-14 08:16:34 +08:00
|
|
|
TP_printk("nid=%d order=%d gfp_flags=%s",
|
2010-08-10 08:19:16 +08:00
|
|
|
__entry->nid,
|
2018-04-06 07:25:16 +08:00
|
|
|
__entry->order,
|
|
|
|
show_gfp_flags(__entry->gfp_flags))
|
2010-08-10 08:19:16 +08:00
|
|
|
);
|
|
|
|
|
2010-08-10 08:19:55 +08:00
|
|
|
DECLARE_EVENT_CLASS(mm_vmscan_direct_reclaim_begin_template,
|
2010-08-10 08:19:16 +08:00
|
|
|
|
2019-05-14 08:19:14 +08:00
|
|
|
TP_PROTO(int order, gfp_t gfp_flags),
|
2010-08-10 08:19:16 +08:00
|
|
|
|
2019-05-14 08:19:14 +08:00
|
|
|
TP_ARGS(order, gfp_flags),
|
2010-08-10 08:19:16 +08:00
|
|
|
|
|
|
|
TP_STRUCT__entry(
|
|
|
|
__field( int, order )
|
2022-05-13 11:23:08 +08:00
|
|
|
__field( unsigned long, gfp_flags )
|
2010-08-10 08:19:16 +08:00
|
|
|
),
|
|
|
|
|
|
|
|
TP_fast_assign(
|
|
|
|
__entry->order = order;
|
2022-05-13 11:23:08 +08:00
|
|
|
__entry->gfp_flags = (__force unsigned long)gfp_flags;
|
2010-08-10 08:19:16 +08:00
|
|
|
),
|
|
|
|
|
2019-05-14 08:19:14 +08:00
|
|
|
TP_printk("order=%d gfp_flags=%s",
|
2010-08-10 08:19:16 +08:00
|
|
|
__entry->order,
|
2019-05-14 08:19:14 +08:00
|
|
|
show_gfp_flags(__entry->gfp_flags))
|
2010-08-10 08:19:16 +08:00
|
|
|
);
|
|
|
|
|
2010-08-10 08:19:55 +08:00
|
|
|
DEFINE_EVENT(mm_vmscan_direct_reclaim_begin_template, mm_vmscan_direct_reclaim_begin,
|
|
|
|
|
2019-05-14 08:19:14 +08:00
|
|
|
TP_PROTO(int order, gfp_t gfp_flags),
|
2010-08-10 08:19:55 +08:00
|
|
|
|
2019-05-14 08:19:14 +08:00
|
|
|
TP_ARGS(order, gfp_flags)
|
2010-08-10 08:19:55 +08:00
|
|
|
);
|
|
|
|
|
2017-10-13 06:46:32 +08:00
|
|
|
#ifdef CONFIG_MEMCG
|
2010-08-10 08:19:56 +08:00
|
|
|
DEFINE_EVENT(mm_vmscan_direct_reclaim_begin_template, mm_vmscan_memcg_reclaim_begin,
|
|
|
|
|
2019-05-14 08:19:14 +08:00
|
|
|
TP_PROTO(int order, gfp_t gfp_flags),
|
2010-08-10 08:19:56 +08:00
|
|
|
|
2019-05-14 08:19:14 +08:00
|
|
|
TP_ARGS(order, gfp_flags)
|
2010-08-10 08:19:56 +08:00
|
|
|
);
|
|
|
|
|
|
|
|
DEFINE_EVENT(mm_vmscan_direct_reclaim_begin_template, mm_vmscan_memcg_softlimit_reclaim_begin,
|
|
|
|
|
2019-05-14 08:19:14 +08:00
|
|
|
TP_PROTO(int order, gfp_t gfp_flags),
|
2010-08-10 08:19:56 +08:00
|
|
|
|
2019-05-14 08:19:14 +08:00
|
|
|
TP_ARGS(order, gfp_flags)
|
2010-08-10 08:19:56 +08:00
|
|
|
);
|
2017-10-13 06:46:32 +08:00
|
|
|
#endif /* CONFIG_MEMCG */
|
2010-08-10 08:19:55 +08:00
|
|
|
|
|
|
|
DECLARE_EVENT_CLASS(mm_vmscan_direct_reclaim_end_template,
|
2010-08-10 08:19:16 +08:00
|
|
|
|
|
|
|
TP_PROTO(unsigned long nr_reclaimed),
|
|
|
|
|
|
|
|
TP_ARGS(nr_reclaimed),
|
|
|
|
|
|
|
|
TP_STRUCT__entry(
|
|
|
|
__field( unsigned long, nr_reclaimed )
|
|
|
|
),
|
|
|
|
|
|
|
|
TP_fast_assign(
|
|
|
|
__entry->nr_reclaimed = nr_reclaimed;
|
|
|
|
),
|
|
|
|
|
|
|
|
TP_printk("nr_reclaimed=%lu", __entry->nr_reclaimed)
|
|
|
|
);
|
|
|
|
|
2010-08-10 08:19:55 +08:00
|
|
|
DEFINE_EVENT(mm_vmscan_direct_reclaim_end_template, mm_vmscan_direct_reclaim_end,
|
|
|
|
|
|
|
|
TP_PROTO(unsigned long nr_reclaimed),
|
|
|
|
|
|
|
|
TP_ARGS(nr_reclaimed)
|
|
|
|
);
|
|
|
|
|
2017-10-13 06:46:32 +08:00
|
|
|
#ifdef CONFIG_MEMCG
|
2010-08-10 08:19:56 +08:00
|
|
|
DEFINE_EVENT(mm_vmscan_direct_reclaim_end_template, mm_vmscan_memcg_reclaim_end,
|
|
|
|
|
|
|
|
TP_PROTO(unsigned long nr_reclaimed),
|
|
|
|
|
|
|
|
TP_ARGS(nr_reclaimed)
|
|
|
|
);
|
|
|
|
|
|
|
|
DEFINE_EVENT(mm_vmscan_direct_reclaim_end_template, mm_vmscan_memcg_softlimit_reclaim_end,
|
|
|
|
|
|
|
|
TP_PROTO(unsigned long nr_reclaimed),
|
|
|
|
|
|
|
|
TP_ARGS(nr_reclaimed)
|
|
|
|
);
|
2017-10-13 06:46:32 +08:00
|
|
|
#endif /* CONFIG_MEMCG */
|
2010-08-10 08:19:56 +08:00
|
|
|
|
2011-07-08 12:14:34 +08:00
|
|
|
TRACE_EVENT(mm_shrink_slab_start,
|
|
|
|
TP_PROTO(struct shrinker *shr, struct shrink_control *sc,
|
mm: use sc->priority for slab shrink targets
Previously we were using the ratio of the number of lru pages scanned to
the number of eligible lru pages to determine the number of slab objects
to scan. The problem with this is that these two things have nothing to
do with each other, so in slab heavy work loads where there is little to
no page cache we can end up with the pages scanned being a very low
number. This means that we reclaim next to no slab pages and waste a
lot of time reclaiming small amounts of space.
Consider the following scenario, where we have the following values and
the rest of the memory usage is in slab
Active: 58840 kB
Inactive: 46860 kB
Every time we do a get_scan_count() we do this
scan = size >> sc->priority
where sc->priority starts at DEF_PRIORITY, which is 12. The first loop
through reclaim would result in a scan target of 2 pages to 11715 total
inactive pages, and 3 pages to 14710 total active pages. This is a
really really small target for a system that is entirely slab pages.
And this is super optimistic, this assumes we even get to scan these
pages. We don't increment sc->nr_scanned unless we 1) isolate the page,
which assumes it's not in use, and 2) can lock the page. Under pressure
these numbers could probably go down, I'm sure there's some random pages
from daemons that aren't actually in use, so the targets get even
smaller.
Instead use sc->priority in the same way we use it to determine scan
amounts for the lru's. This generally equates to pages. Consider the
following
slab_pages = (nr_objects * object_size) / PAGE_SIZE
What we would like to do is
scan = slab_pages >> sc->priority
but we don't know the number of slab pages each shrinker controls, only
the objects. However say that theoretically we knew how many pages a
shrinker controlled, we'd still have to convert this to objects, which
would look like the following
scan = shrinker_pages >> sc->priority
scan_objects = (PAGE_SIZE / object_size) * scan
or written another way
scan_objects = (shrinker_pages >> sc->priority) *
(PAGE_SIZE / object_size)
which can thus be written
scan_objects = ((shrinker_pages * PAGE_SIZE) / object_size) >>
sc->priority
which is just
scan_objects = nr_objects >> sc->priority
We don't need to know exactly how many pages each shrinker represents,
it's objects are all the information we need. Making this change allows
us to place an appropriate amount of pressure on the shrinker pools for
their relative size.
Link: http://lkml.kernel.org/r/1510780549-6812-1-git-send-email-josef@toxicpanda.com
Signed-off-by: Josef Bacik <jbacik@fb.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Acked-by: Dave Chinner <david@fromorbit.com>
Acked-by: Andrey Ryabinin <aryabinin@virtuozzo.com>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Christoph Lameter <cl@linux.com>
Cc: Pekka Enberg <penberg@kernel.org>
Cc: David Rientjes <rientjes@google.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-02-01 08:16:26 +08:00
|
|
|
long nr_objects_to_shrink, unsigned long cache_items,
|
|
|
|
unsigned long long delta, unsigned long total_scan,
|
|
|
|
int priority),
|
2011-07-08 12:14:34 +08:00
|
|
|
|
mm: use sc->priority for slab shrink targets
Previously we were using the ratio of the number of lru pages scanned to
the number of eligible lru pages to determine the number of slab objects
to scan. The problem with this is that these two things have nothing to
do with each other, so in slab heavy work loads where there is little to
no page cache we can end up with the pages scanned being a very low
number. This means that we reclaim next to no slab pages and waste a
lot of time reclaiming small amounts of space.
Consider the following scenario, where we have the following values and
the rest of the memory usage is in slab
Active: 58840 kB
Inactive: 46860 kB
Every time we do a get_scan_count() we do this
scan = size >> sc->priority
where sc->priority starts at DEF_PRIORITY, which is 12. The first loop
through reclaim would result in a scan target of 2 pages to 11715 total
inactive pages, and 3 pages to 14710 total active pages. This is a
really really small target for a system that is entirely slab pages.
And this is super optimistic, this assumes we even get to scan these
pages. We don't increment sc->nr_scanned unless we 1) isolate the page,
which assumes it's not in use, and 2) can lock the page. Under pressure
these numbers could probably go down, I'm sure there's some random pages
from daemons that aren't actually in use, so the targets get even
smaller.
Instead use sc->priority in the same way we use it to determine scan
amounts for the lru's. This generally equates to pages. Consider the
following
slab_pages = (nr_objects * object_size) / PAGE_SIZE
What we would like to do is
scan = slab_pages >> sc->priority
but we don't know the number of slab pages each shrinker controls, only
the objects. However say that theoretically we knew how many pages a
shrinker controlled, we'd still have to convert this to objects, which
would look like the following
scan = shrinker_pages >> sc->priority
scan_objects = (PAGE_SIZE / object_size) * scan
or written another way
scan_objects = (shrinker_pages >> sc->priority) *
(PAGE_SIZE / object_size)
which can thus be written
scan_objects = ((shrinker_pages * PAGE_SIZE) / object_size) >>
sc->priority
which is just
scan_objects = nr_objects >> sc->priority
We don't need to know exactly how many pages each shrinker represents,
it's objects are all the information we need. Making this change allows
us to place an appropriate amount of pressure on the shrinker pools for
their relative size.
Link: http://lkml.kernel.org/r/1510780549-6812-1-git-send-email-josef@toxicpanda.com
Signed-off-by: Josef Bacik <jbacik@fb.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Acked-by: Dave Chinner <david@fromorbit.com>
Acked-by: Andrey Ryabinin <aryabinin@virtuozzo.com>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Christoph Lameter <cl@linux.com>
Cc: Pekka Enberg <penberg@kernel.org>
Cc: David Rientjes <rientjes@google.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-02-01 08:16:26 +08:00
|
|
|
TP_ARGS(shr, sc, nr_objects_to_shrink, cache_items, delta, total_scan,
|
|
|
|
priority),
|
2011-07-08 12:14:34 +08:00
|
|
|
|
|
|
|
TP_STRUCT__entry(
|
|
|
|
__field(struct shrinker *, shr)
|
|
|
|
__field(void *, shrink)
|
2014-06-05 07:08:07 +08:00
|
|
|
__field(int, nid)
|
2011-07-08 12:14:34 +08:00
|
|
|
__field(long, nr_objects_to_shrink)
|
2022-05-13 11:23:08 +08:00
|
|
|
__field(unsigned long, gfp_flags)
|
2011-07-08 12:14:34 +08:00
|
|
|
__field(unsigned long, cache_items)
|
|
|
|
__field(unsigned long long, delta)
|
|
|
|
__field(unsigned long, total_scan)
|
mm: use sc->priority for slab shrink targets
Previously we were using the ratio of the number of lru pages scanned to
the number of eligible lru pages to determine the number of slab objects
to scan. The problem with this is that these two things have nothing to
do with each other, so in slab heavy work loads where there is little to
no page cache we can end up with the pages scanned being a very low
number. This means that we reclaim next to no slab pages and waste a
lot of time reclaiming small amounts of space.
Consider the following scenario, where we have the following values and
the rest of the memory usage is in slab
Active: 58840 kB
Inactive: 46860 kB
Every time we do a get_scan_count() we do this
scan = size >> sc->priority
where sc->priority starts at DEF_PRIORITY, which is 12. The first loop
through reclaim would result in a scan target of 2 pages to 11715 total
inactive pages, and 3 pages to 14710 total active pages. This is a
really really small target for a system that is entirely slab pages.
And this is super optimistic, this assumes we even get to scan these
pages. We don't increment sc->nr_scanned unless we 1) isolate the page,
which assumes it's not in use, and 2) can lock the page. Under pressure
these numbers could probably go down, I'm sure there's some random pages
from daemons that aren't actually in use, so the targets get even
smaller.
Instead use sc->priority in the same way we use it to determine scan
amounts for the lru's. This generally equates to pages. Consider the
following
slab_pages = (nr_objects * object_size) / PAGE_SIZE
What we would like to do is
scan = slab_pages >> sc->priority
but we don't know the number of slab pages each shrinker controls, only
the objects. However say that theoretically we knew how many pages a
shrinker controlled, we'd still have to convert this to objects, which
would look like the following
scan = shrinker_pages >> sc->priority
scan_objects = (PAGE_SIZE / object_size) * scan
or written another way
scan_objects = (shrinker_pages >> sc->priority) *
(PAGE_SIZE / object_size)
which can thus be written
scan_objects = ((shrinker_pages * PAGE_SIZE) / object_size) >>
sc->priority
which is just
scan_objects = nr_objects >> sc->priority
We don't need to know exactly how many pages each shrinker represents,
it's objects are all the information we need. Making this change allows
us to place an appropriate amount of pressure on the shrinker pools for
their relative size.
Link: http://lkml.kernel.org/r/1510780549-6812-1-git-send-email-josef@toxicpanda.com
Signed-off-by: Josef Bacik <jbacik@fb.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Acked-by: Dave Chinner <david@fromorbit.com>
Acked-by: Andrey Ryabinin <aryabinin@virtuozzo.com>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Christoph Lameter <cl@linux.com>
Cc: Pekka Enberg <penberg@kernel.org>
Cc: David Rientjes <rientjes@google.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-02-01 08:16:26 +08:00
|
|
|
__field(int, priority)
|
2011-07-08 12:14:34 +08:00
|
|
|
),
|
|
|
|
|
|
|
|
TP_fast_assign(
|
|
|
|
__entry->shr = shr;
|
2013-08-28 08:18:16 +08:00
|
|
|
__entry->shrink = shr->scan_objects;
|
2014-06-05 07:08:07 +08:00
|
|
|
__entry->nid = sc->nid;
|
2011-07-08 12:14:34 +08:00
|
|
|
__entry->nr_objects_to_shrink = nr_objects_to_shrink;
|
2022-05-13 11:23:08 +08:00
|
|
|
__entry->gfp_flags = (__force unsigned long)sc->gfp_mask;
|
2011-07-08 12:14:34 +08:00
|
|
|
__entry->cache_items = cache_items;
|
|
|
|
__entry->delta = delta;
|
|
|
|
__entry->total_scan = total_scan;
|
mm: use sc->priority for slab shrink targets
Previously we were using the ratio of the number of lru pages scanned to
the number of eligible lru pages to determine the number of slab objects
to scan. The problem with this is that these two things have nothing to
do with each other, so in slab heavy work loads where there is little to
no page cache we can end up with the pages scanned being a very low
number. This means that we reclaim next to no slab pages and waste a
lot of time reclaiming small amounts of space.
Consider the following scenario, where we have the following values and
the rest of the memory usage is in slab
Active: 58840 kB
Inactive: 46860 kB
Every time we do a get_scan_count() we do this
scan = size >> sc->priority
where sc->priority starts at DEF_PRIORITY, which is 12. The first loop
through reclaim would result in a scan target of 2 pages to 11715 total
inactive pages, and 3 pages to 14710 total active pages. This is a
really really small target for a system that is entirely slab pages.
And this is super optimistic, this assumes we even get to scan these
pages. We don't increment sc->nr_scanned unless we 1) isolate the page,
which assumes it's not in use, and 2) can lock the page. Under pressure
these numbers could probably go down, I'm sure there's some random pages
from daemons that aren't actually in use, so the targets get even
smaller.
Instead use sc->priority in the same way we use it to determine scan
amounts for the lru's. This generally equates to pages. Consider the
following
slab_pages = (nr_objects * object_size) / PAGE_SIZE
What we would like to do is
scan = slab_pages >> sc->priority
but we don't know the number of slab pages each shrinker controls, only
the objects. However say that theoretically we knew how many pages a
shrinker controlled, we'd still have to convert this to objects, which
would look like the following
scan = shrinker_pages >> sc->priority
scan_objects = (PAGE_SIZE / object_size) * scan
or written another way
scan_objects = (shrinker_pages >> sc->priority) *
(PAGE_SIZE / object_size)
which can thus be written
scan_objects = ((shrinker_pages * PAGE_SIZE) / object_size) >>
sc->priority
which is just
scan_objects = nr_objects >> sc->priority
We don't need to know exactly how many pages each shrinker represents,
it's objects are all the information we need. Making this change allows
us to place an appropriate amount of pressure on the shrinker pools for
their relative size.
Link: http://lkml.kernel.org/r/1510780549-6812-1-git-send-email-josef@toxicpanda.com
Signed-off-by: Josef Bacik <jbacik@fb.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Acked-by: Dave Chinner <david@fromorbit.com>
Acked-by: Andrey Ryabinin <aryabinin@virtuozzo.com>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Christoph Lameter <cl@linux.com>
Cc: Pekka Enberg <penberg@kernel.org>
Cc: David Rientjes <rientjes@google.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-02-01 08:16:26 +08:00
|
|
|
__entry->priority = priority;
|
2011-07-08 12:14:34 +08:00
|
|
|
),
|
|
|
|
|
2019-03-26 03:32:28 +08:00
|
|
|
TP_printk("%pS %p: nid: %d objects to shrink %ld gfp_flags %s cache items %ld delta %lld total_scan %ld priority %d",
|
2011-07-08 12:14:34 +08:00
|
|
|
__entry->shrink,
|
|
|
|
__entry->shr,
|
2014-06-05 07:08:07 +08:00
|
|
|
__entry->nid,
|
2011-07-08 12:14:34 +08:00
|
|
|
__entry->nr_objects_to_shrink,
|
|
|
|
show_gfp_flags(__entry->gfp_flags),
|
|
|
|
__entry->cache_items,
|
|
|
|
__entry->delta,
|
mm: use sc->priority for slab shrink targets
Previously we were using the ratio of the number of lru pages scanned to
the number of eligible lru pages to determine the number of slab objects
to scan. The problem with this is that these two things have nothing to
do with each other, so in slab heavy work loads where there is little to
no page cache we can end up with the pages scanned being a very low
number. This means that we reclaim next to no slab pages and waste a
lot of time reclaiming small amounts of space.
Consider the following scenario, where we have the following values and
the rest of the memory usage is in slab
Active: 58840 kB
Inactive: 46860 kB
Every time we do a get_scan_count() we do this
scan = size >> sc->priority
where sc->priority starts at DEF_PRIORITY, which is 12. The first loop
through reclaim would result in a scan target of 2 pages to 11715 total
inactive pages, and 3 pages to 14710 total active pages. This is a
really really small target for a system that is entirely slab pages.
And this is super optimistic, this assumes we even get to scan these
pages. We don't increment sc->nr_scanned unless we 1) isolate the page,
which assumes it's not in use, and 2) can lock the page. Under pressure
these numbers could probably go down, I'm sure there's some random pages
from daemons that aren't actually in use, so the targets get even
smaller.
Instead use sc->priority in the same way we use it to determine scan
amounts for the lru's. This generally equates to pages. Consider the
following
slab_pages = (nr_objects * object_size) / PAGE_SIZE
What we would like to do is
scan = slab_pages >> sc->priority
but we don't know the number of slab pages each shrinker controls, only
the objects. However say that theoretically we knew how many pages a
shrinker controlled, we'd still have to convert this to objects, which
would look like the following
scan = shrinker_pages >> sc->priority
scan_objects = (PAGE_SIZE / object_size) * scan
or written another way
scan_objects = (shrinker_pages >> sc->priority) *
(PAGE_SIZE / object_size)
which can thus be written
scan_objects = ((shrinker_pages * PAGE_SIZE) / object_size) >>
sc->priority
which is just
scan_objects = nr_objects >> sc->priority
We don't need to know exactly how many pages each shrinker represents,
it's objects are all the information we need. Making this change allows
us to place an appropriate amount of pressure on the shrinker pools for
their relative size.
Link: http://lkml.kernel.org/r/1510780549-6812-1-git-send-email-josef@toxicpanda.com
Signed-off-by: Josef Bacik <jbacik@fb.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Acked-by: Dave Chinner <david@fromorbit.com>
Acked-by: Andrey Ryabinin <aryabinin@virtuozzo.com>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Christoph Lameter <cl@linux.com>
Cc: Pekka Enberg <penberg@kernel.org>
Cc: David Rientjes <rientjes@google.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-02-01 08:16:26 +08:00
|
|
|
__entry->total_scan,
|
|
|
|
__entry->priority)
|
2011-07-08 12:14:34 +08:00
|
|
|
);
|
|
|
|
|
|
|
|
TRACE_EVENT(mm_shrink_slab_end,
|
2014-06-05 07:08:07 +08:00
|
|
|
TP_PROTO(struct shrinker *shr, int nid, int shrinker_retval,
|
2014-06-05 07:08:06 +08:00
|
|
|
long unused_scan_cnt, long new_scan_cnt, long total_scan),
|
2011-07-08 12:14:34 +08:00
|
|
|
|
2014-06-05 07:08:07 +08:00
|
|
|
TP_ARGS(shr, nid, shrinker_retval, unused_scan_cnt, new_scan_cnt,
|
|
|
|
total_scan),
|
2011-07-08 12:14:34 +08:00
|
|
|
|
|
|
|
TP_STRUCT__entry(
|
|
|
|
__field(struct shrinker *, shr)
|
2014-06-05 07:08:07 +08:00
|
|
|
__field(int, nid)
|
2011-07-08 12:14:34 +08:00
|
|
|
__field(void *, shrink)
|
|
|
|
__field(long, unused_scan)
|
|
|
|
__field(long, new_scan)
|
|
|
|
__field(int, retval)
|
|
|
|
__field(long, total_scan)
|
|
|
|
),
|
|
|
|
|
|
|
|
TP_fast_assign(
|
|
|
|
__entry->shr = shr;
|
2014-06-05 07:08:07 +08:00
|
|
|
__entry->nid = nid;
|
2013-08-28 08:18:16 +08:00
|
|
|
__entry->shrink = shr->scan_objects;
|
2011-07-08 12:14:34 +08:00
|
|
|
__entry->unused_scan = unused_scan_cnt;
|
|
|
|
__entry->new_scan = new_scan_cnt;
|
|
|
|
__entry->retval = shrinker_retval;
|
2014-06-05 07:08:06 +08:00
|
|
|
__entry->total_scan = total_scan;
|
2011-07-08 12:14:34 +08:00
|
|
|
),
|
|
|
|
|
2019-03-26 03:32:28 +08:00
|
|
|
TP_printk("%pS %p: nid: %d unused scan count %ld new scan count %ld total_scan %ld last shrinker return val %d",
|
2011-07-08 12:14:34 +08:00
|
|
|
__entry->shrink,
|
|
|
|
__entry->shr,
|
2014-06-05 07:08:07 +08:00
|
|
|
__entry->nid,
|
2011-07-08 12:14:34 +08:00
|
|
|
__entry->unused_scan,
|
|
|
|
__entry->new_scan,
|
|
|
|
__entry->total_scan,
|
|
|
|
__entry->retval)
|
|
|
|
);
|
2010-08-10 08:19:56 +08:00
|
|
|
|
2017-02-23 07:44:15 +08:00
|
|
|
TRACE_EVENT(mm_vmscan_lru_isolate,
|
2020-06-04 06:59:01 +08:00
|
|
|
TP_PROTO(int highest_zoneidx,
|
2016-07-29 06:46:47 +08:00
|
|
|
int order,
|
2010-08-10 08:19:17 +08:00
|
|
|
unsigned long nr_requested,
|
|
|
|
unsigned long nr_scanned,
|
2017-02-23 07:44:21 +08:00
|
|
|
unsigned long nr_skipped,
|
2010-08-10 08:19:17 +08:00
|
|
|
unsigned long nr_taken,
|
2012-01-13 09:19:20 +08:00
|
|
|
isolate_mode_t isolate_mode,
|
2017-02-23 07:44:24 +08:00
|
|
|
int lru),
|
2010-08-10 08:19:17 +08:00
|
|
|
|
2020-06-04 06:59:01 +08:00
|
|
|
TP_ARGS(highest_zoneidx, order, nr_requested, nr_scanned, nr_skipped, nr_taken, isolate_mode, lru),
|
2010-08-10 08:19:17 +08:00
|
|
|
|
|
|
|
TP_STRUCT__entry(
|
2020-06-04 06:59:01 +08:00
|
|
|
__field(int, highest_zoneidx)
|
2010-08-10 08:19:17 +08:00
|
|
|
__field(int, order)
|
|
|
|
__field(unsigned long, nr_requested)
|
|
|
|
__field(unsigned long, nr_scanned)
|
2017-02-23 07:44:21 +08:00
|
|
|
__field(unsigned long, nr_skipped)
|
2010-08-10 08:19:17 +08:00
|
|
|
__field(unsigned long, nr_taken)
|
2022-05-11 17:46:53 +08:00
|
|
|
__field(unsigned int, isolate_mode)
|
2017-02-23 07:44:24 +08:00
|
|
|
__field(int, lru)
|
2010-08-10 08:19:17 +08:00
|
|
|
),
|
|
|
|
|
|
|
|
TP_fast_assign(
|
2020-06-04 06:59:01 +08:00
|
|
|
__entry->highest_zoneidx = highest_zoneidx;
|
2010-08-10 08:19:17 +08:00
|
|
|
__entry->order = order;
|
|
|
|
__entry->nr_requested = nr_requested;
|
|
|
|
__entry->nr_scanned = nr_scanned;
|
2017-02-23 07:44:21 +08:00
|
|
|
__entry->nr_skipped = nr_skipped;
|
2010-08-10 08:19:17 +08:00
|
|
|
__entry->nr_taken = nr_taken;
|
2022-05-11 17:46:53 +08:00
|
|
|
__entry->isolate_mode = (__force unsigned int)isolate_mode;
|
2017-02-23 07:44:24 +08:00
|
|
|
__entry->lru = lru;
|
2010-08-10 08:19:17 +08:00
|
|
|
),
|
|
|
|
|
2020-06-04 06:59:01 +08:00
|
|
|
/*
|
|
|
|
* classzone is previous name of the highest_zoneidx.
|
|
|
|
* Reason not to change it is the ABI requirement of the tracepoint.
|
|
|
|
*/
|
2017-02-23 07:44:24 +08:00
|
|
|
TP_printk("isolate_mode=%d classzone=%d order=%d nr_requested=%lu nr_scanned=%lu nr_skipped=%lu nr_taken=%lu lru=%s",
|
2010-08-10 08:19:17 +08:00
|
|
|
__entry->isolate_mode,
|
2020-06-04 06:59:01 +08:00
|
|
|
__entry->highest_zoneidx,
|
2010-08-10 08:19:17 +08:00
|
|
|
__entry->order,
|
|
|
|
__entry->nr_requested,
|
|
|
|
__entry->nr_scanned,
|
2017-02-23 07:44:21 +08:00
|
|
|
__entry->nr_skipped,
|
2010-08-10 08:19:17 +08:00
|
|
|
__entry->nr_taken,
|
2017-02-23 07:44:24 +08:00
|
|
|
__print_symbolic(__entry->lru, LRU_NAMES))
|
2010-08-10 08:19:17 +08:00
|
|
|
);
|
|
|
|
|
2022-01-18 12:35:57 +08:00
|
|
|
TRACE_EVENT(mm_vmscan_write_folio,
|
2010-08-10 08:19:18 +08:00
|
|
|
|
2022-01-18 12:35:57 +08:00
|
|
|
TP_PROTO(struct folio *folio),
|
2010-08-10 08:19:18 +08:00
|
|
|
|
2022-01-18 12:35:57 +08:00
|
|
|
TP_ARGS(folio),
|
2010-08-10 08:19:18 +08:00
|
|
|
|
|
|
|
TP_STRUCT__entry(
|
2015-04-06 13:36:09 +08:00
|
|
|
__field(unsigned long, pfn)
|
2010-08-10 08:19:18 +08:00
|
|
|
__field(int, reclaim_flags)
|
|
|
|
),
|
|
|
|
|
|
|
|
TP_fast_assign(
|
2022-01-18 12:35:57 +08:00
|
|
|
__entry->pfn = folio_pfn(folio);
|
2019-05-14 08:23:08 +08:00
|
|
|
__entry->reclaim_flags = trace_reclaim_flags(
|
2022-01-18 12:35:57 +08:00
|
|
|
folio_is_file_lru(folio));
|
2010-08-10 08:19:18 +08:00
|
|
|
),
|
|
|
|
|
2021-06-29 10:40:08 +08:00
|
|
|
TP_printk("page=%p pfn=0x%lx flags=%s",
|
2015-04-06 13:36:09 +08:00
|
|
|
pfn_to_page(__entry->pfn),
|
|
|
|
__entry->pfn,
|
2010-08-10 08:19:18 +08:00
|
|
|
show_reclaim_flags(__entry->reclaim_flags))
|
|
|
|
);
|
|
|
|
|
tracing, vmscan: add trace events for LRU list shrinking
There have been numerous reports of stalls that pointed at the problem
being somewhere in the VM. There are multiple roots to the problems which
means dealing with any of the root problems in isolation is tricky to
justify on their own and they would still need integration testing. This
patch series puts together two different patch sets which in combination
should tackle some of the root causes of latency problems being reported.
Patch 1 adds a tracepoint for shrink_inactive_list. For this series, the
most important results is being able to calculate the scanning/reclaim
ratio as a measure of the amount of work being done by page reclaim.
Patch 2 accounts for time spent in congestion_wait.
Patches 3-6 were originally developed by Kosaki Motohiro but reworked for
this series. It has been noted that lumpy reclaim is far too aggressive
and trashes the system somewhat. As SLUB uses high-order allocations, a
large cost incurred by lumpy reclaim will be noticeable. It was also
reported during transparent hugepage support testing that lumpy reclaim
was trashing the system and these patches should mitigate that problem
without disabling lumpy reclaim.
Patch 7 adds wait_iff_congested() and replaces some callers of
congestion_wait(). wait_iff_congested() only sleeps if there is a BDI
that is currently congested. Patch 8 notes that any BDI being congested
is not necessarily a problem because there could be multiple BDIs of
varying speeds and numberous zones. It attempts to track when a zone
being reclaimed contains many pages backed by a congested BDI and if so,
reclaimers wait on the congestion queue.
I ran a number of tests with monitoring on X86, X86-64 and PPC64. Each
machine had 3G of RAM and the CPUs were
X86: Intel P4 2-core
X86-64: AMD Phenom 4-core
PPC64: PPC970MP
Each used a single disk and the onboard IO controller. Dirty ratio was
left at 20. I'm just going to report for X86-64 and PPC64 in a vague
attempt to keep this report short. Four kernels were tested each based on
v2.6.36-rc4
traceonly-v2r2: Patches 1 and 2 to instrument vmscan reclaims and congestion_wait
lowlumpy-v2r3: Patches 1-6 to test if lumpy reclaim is better
waitcongest-v2r3: Patches 1-7 to only wait on congestion
waitwriteback-v2r4: Patches 1-8 to detect when a zone is congested
nocongest-v1r5: Patches 1-3 for testing wait_iff_congestion
nodirect-v1r5: Patches 1-10 to disable filesystem writeback for better IO
The tests run were as follows
kernbench
compile-based benchmark. Smoke test performance
sysbench
OLTP read-only benchmark. Will be re-run in the future as read-write
micro-mapped-file-stream
This is a micro-benchmark from Johannes Weiner that accesses a
large sparse-file through mmap(). It was configured to run in only
single-CPU mode but can be indicative of how well page reclaim
identifies suitable pages.
stress-highalloc
Tries to allocate huge pages under heavy load.
kernbench, iozone and sysbench did not report any performance regression
on any machine. sysbench did pressure the system lightly and there was
reclaim activity but there were no difference of major interest between
the kernels.
X86-64 micro-mapped-file-stream
traceonly-v2r2 lowlumpy-v2r3 waitcongest-v2r3 waitwriteback-v2r4
pgalloc_dma 1639.00 ( 0.00%) 667.00 (-145.73%) 1167.00 ( -40.45%) 578.00 (-183.56%)
pgalloc_dma32 2842410.00 ( 0.00%) 2842626.00 ( 0.01%) 2843043.00 ( 0.02%) 2843014.00 ( 0.02%)
pgalloc_normal 0.00 ( 0.00%) 0.00 ( 0.00%) 0.00 ( 0.00%) 0.00 ( 0.00%)
pgsteal_dma 729.00 ( 0.00%) 85.00 (-757.65%) 609.00 ( -19.70%) 125.00 (-483.20%)
pgsteal_dma32 2338721.00 ( 0.00%) 2447354.00 ( 4.44%) 2429536.00 ( 3.74%) 2436772.00 ( 4.02%)
pgsteal_normal 0.00 ( 0.00%) 0.00 ( 0.00%) 0.00 ( 0.00%) 0.00 ( 0.00%)
pgscan_kswapd_dma 1469.00 ( 0.00%) 532.00 (-176.13%) 1078.00 ( -36.27%) 220.00 (-567.73%)
pgscan_kswapd_dma32 4597713.00 ( 0.00%) 4503597.00 ( -2.09%) 4295673.00 ( -7.03%) 3891686.00 ( -18.14%)
pgscan_kswapd_normal 0.00 ( 0.00%) 0.00 ( 0.00%) 0.00 ( 0.00%) 0.00 ( 0.00%)
pgscan_direct_dma 71.00 ( 0.00%) 134.00 ( 47.01%) 243.00 ( 70.78%) 352.00 ( 79.83%)
pgscan_direct_dma32 305820.00 ( 0.00%) 280204.00 ( -9.14%) 600518.00 ( 49.07%) 957485.00 ( 68.06%)
pgscan_direct_normal 0.00 ( 0.00%) 0.00 ( 0.00%) 0.00 ( 0.00%) 0.00 ( 0.00%)
pageoutrun 16296.00 ( 0.00%) 21254.00 ( 23.33%) 18447.00 ( 11.66%) 20067.00 ( 18.79%)
allocstall 443.00 ( 0.00%) 273.00 ( -62.27%) 513.00 ( 13.65%) 1568.00 ( 71.75%)
These are based on the raw figures taken from /proc/vmstat. It's a rough
measure of reclaim activity. Note that allocstall counts are higher
because we are entering direct reclaim more often as a result of not
sleeping in congestion. In itself, it's not necessarily a bad thing.
It's easier to get a view of what happened from the vmscan tracepoint
report.
FTrace Reclaim Statistics: vmscan
traceonly-v2r2 lowlumpy-v2r3 waitcongest-v2r3 waitwriteback-v2r4
Direct reclaims 443 273 513 1568
Direct reclaim pages scanned 305968 280402 600825 957933
Direct reclaim pages reclaimed 43503 19005 30327 117191
Direct reclaim write file async I/O 0 0 0 0
Direct reclaim write anon async I/O 0 3 4 12
Direct reclaim write file sync I/O 0 0 0 0
Direct reclaim write anon sync I/O 0 0 0 0
Wake kswapd requests 187649 132338 191695 267701
Kswapd wakeups 3 1 4 1
Kswapd pages scanned 4599269 4454162 4296815 3891906
Kswapd pages reclaimed 2295947 2428434 2399818 2319706
Kswapd reclaim write file async I/O 1 0 1 1
Kswapd reclaim write anon async I/O 59 187 41 222
Kswapd reclaim write file sync I/O 0 0 0 0
Kswapd reclaim write anon sync I/O 0 0 0 0
Time stalled direct reclaim (seconds) 4.34 2.52 6.63 2.96
Time kswapd awake (seconds) 11.15 10.25 11.01 10.19
Total pages scanned 4905237 4734564 4897640 4849839
Total pages reclaimed 2339450 2447439 2430145 2436897
%age total pages scanned/reclaimed 47.69% 51.69% 49.62% 50.25%
%age total pages scanned/written 0.00% 0.00% 0.00% 0.00%
%age file pages scanned/written 0.00% 0.00% 0.00% 0.00%
Percentage Time Spent Direct Reclaim 29.23% 19.02% 38.48% 20.25%
Percentage Time kswapd Awake 78.58% 78.85% 76.83% 79.86%
What is interesting here for nocongest in particular is that while direct
reclaim scans more pages, the overall number of pages scanned remains the
same and the ratio of pages scanned to pages reclaimed is more or less the
same. In other words, while we are sleeping less, reclaim is not doing
more work and as direct reclaim and kswapd is awake for less time, it
would appear to be doing less work.
FTrace Reclaim Statistics: congestion_wait
Direct number congest waited 87 196 64 0
Direct time congest waited 4604ms 4732ms 5420ms 0ms
Direct full congest waited 72 145 53 0
Direct number conditional waited 0 0 324 1315
Direct time conditional waited 0ms 0ms 0ms 0ms
Direct full conditional waited 0 0 0 0
KSwapd number congest waited 20 10 15 7
KSwapd time congest waited 1264ms 536ms 884ms 284ms
KSwapd full congest waited 10 4 6 2
KSwapd number conditional waited 0 0 0 0
KSwapd time conditional waited 0ms 0ms 0ms 0ms
KSwapd full conditional waited 0 0 0 0
The vanilla kernel spent 8 seconds asleep in direct reclaim and no time at
all asleep with the patches.
MMTests Statistics: duration
User/Sys Time Running Test (seconds) 10.51 10.73 10.6 11.66
Total Elapsed Time (seconds) 14.19 13.00 14.33 12.76
Overall, the tests completed faster. It is interesting to note that backing off further
when a zone is congested and not just a BDI was more efficient overall.
PPC64 micro-mapped-file-stream
pgalloc_dma 3024660.00 ( 0.00%) 3027185.00 ( 0.08%) 3025845.00 ( 0.04%) 3026281.00 ( 0.05%)
pgalloc_normal 0.00 ( 0.00%) 0.00 ( 0.00%) 0.00 ( 0.00%) 0.00 ( 0.00%)
pgsteal_dma 2508073.00 ( 0.00%) 2565351.00 ( 2.23%) 2463577.00 ( -1.81%) 2532263.00 ( 0.96%)
pgsteal_normal 0.00 ( 0.00%) 0.00 ( 0.00%) 0.00 ( 0.00%) 0.00 ( 0.00%)
pgscan_kswapd_dma 4601307.00 ( 0.00%) 4128076.00 ( -11.46%) 3912317.00 ( -17.61%) 3377165.00 ( -36.25%)
pgscan_kswapd_normal 0.00 ( 0.00%) 0.00 ( 0.00%) 0.00 ( 0.00%) 0.00 ( 0.00%)
pgscan_direct_dma 629825.00 ( 0.00%) 971622.00 ( 35.18%) 1063938.00 ( 40.80%) 1711935.00 ( 63.21%)
pgscan_direct_normal 0.00 ( 0.00%) 0.00 ( 0.00%) 0.00 ( 0.00%) 0.00 ( 0.00%)
pageoutrun 27776.00 ( 0.00%) 20458.00 ( -35.77%) 18763.00 ( -48.04%) 18157.00 ( -52.98%)
allocstall 977.00 ( 0.00%) 2751.00 ( 64.49%) 2098.00 ( 53.43%) 5136.00 ( 80.98%)
Similar trends to x86-64. allocstalls are up but it's not necessarily bad.
FTrace Reclaim Statistics: vmscan
Direct reclaims 977 2709 2098 5136
Direct reclaim pages scanned 629825 963814 1063938 1711935
Direct reclaim pages reclaimed 75550 242538 150904 387647
Direct reclaim write file async I/O 0 0 0 2
Direct reclaim write anon async I/O 0 10 0 4
Direct reclaim write file sync I/O 0 0 0 0
Direct reclaim write anon sync I/O 0 0 0 0
Wake kswapd requests 392119 1201712 571935 571921
Kswapd wakeups 3 2 3 3
Kswapd pages scanned 4601307 4128076 3912317 3377165
Kswapd pages reclaimed 2432523 2318797 2312673 2144616
Kswapd reclaim write file async I/O 20 1 1 1
Kswapd reclaim write anon async I/O 57 132 11 121
Kswapd reclaim write file sync I/O 0 0 0 0
Kswapd reclaim write anon sync I/O 0 0 0 0
Time stalled direct reclaim (seconds) 6.19 7.30 13.04 10.88
Time kswapd awake (seconds) 21.73 26.51 25.55 23.90
Total pages scanned 5231132 5091890 4976255 5089100
Total pages reclaimed 2508073 2561335 2463577 2532263
%age total pages scanned/reclaimed 47.95% 50.30% 49.51% 49.76%
%age total pages scanned/written 0.00% 0.00% 0.00% 0.00%
%age file pages scanned/written 0.00% 0.00% 0.00% 0.00%
Percentage Time Spent Direct Reclaim 18.89% 20.65% 32.65% 27.65%
Percentage Time kswapd Awake 72.39% 80.68% 78.21% 77.40%
Again, a similar trend that the congestion_wait changes mean that direct
reclaim scans more pages but the overall number of pages scanned while
slightly reduced, are very similar. The ratio of scanning/reclaimed
remains roughly similar. The downside is that kswapd and direct reclaim
was awake longer and for a larger percentage of the overall workload.
It's possible there were big differences in the amount of time spent
reclaiming slab pages between the different kernels which is plausible
considering that the micro tests runs after fsmark and sysbench.
Trace Reclaim Statistics: congestion_wait
Direct number congest waited 845 1312 104 0
Direct time congest waited 19416ms 26560ms 7544ms 0ms
Direct full congest waited 745 1105 72 0
Direct number conditional waited 0 0 1322 2935
Direct time conditional waited 0ms 0ms 12ms 312ms
Direct full conditional waited 0 0 0 3
KSwapd number congest waited 39 102 75 63
KSwapd time congest waited 2484ms 6760ms 5756ms 3716ms
KSwapd full congest waited 20 48 46 25
KSwapd number conditional waited 0 0 0 0
KSwapd time conditional waited 0ms 0ms 0ms 0ms
KSwapd full conditional waited 0 0 0 0
The vanilla kernel spent 20 seconds asleep in direct reclaim and only
312ms asleep with the patches. The time kswapd spent congest waited was
also reduced by a large factor.
MMTests Statistics: duration
ser/Sys Time Running Test (seconds) 26.58 28.05 26.9 28.47
Total Elapsed Time (seconds) 30.02 32.86 32.67 30.88
With all patches applies, the completion times are very similar.
X86-64 STRESS-HIGHALLOC
traceonly-v2r2 lowlumpy-v2r3 waitcongest-v2r3waitwriteback-v2r4
Pass 1 82.00 ( 0.00%) 84.00 ( 2.00%) 85.00 ( 3.00%) 85.00 ( 3.00%)
Pass 2 90.00 ( 0.00%) 87.00 (-3.00%) 88.00 (-2.00%) 89.00 (-1.00%)
At Rest 92.00 ( 0.00%) 90.00 (-2.00%) 90.00 (-2.00%) 91.00 (-1.00%)
Success figures across the board are broadly similar.
traceonly-v2r2 lowlumpy-v2r3 waitcongest-v2r3waitwriteback-v2r4
Direct reclaims 1045 944 886 887
Direct reclaim pages scanned 135091 119604 109382 101019
Direct reclaim pages reclaimed 88599 47535 47863 46671
Direct reclaim write file async I/O 494 283 465 280
Direct reclaim write anon async I/O 29357 13710 16656 13462
Direct reclaim write file sync I/O 154 2 2 3
Direct reclaim write anon sync I/O 14594 571 509 561
Wake kswapd requests 7491 933 872 892
Kswapd wakeups 814 778 731 780
Kswapd pages scanned 7290822 15341158 11916436 13703442
Kswapd pages reclaimed 3587336 3142496 3094392 3187151
Kswapd reclaim write file async I/O 91975 32317 28022 29628
Kswapd reclaim write anon async I/O 1992022 789307 829745 849769
Kswapd reclaim write file sync I/O 0 0 0 0
Kswapd reclaim write anon sync I/O 0 0 0 0
Time stalled direct reclaim (seconds) 4588.93 2467.16 2495.41 2547.07
Time kswapd awake (seconds) 2497.66 1020.16 1098.06 1176.82
Total pages scanned 7425913 15460762 12025818 13804461
Total pages reclaimed 3675935 3190031 3142255 3233822
%age total pages scanned/reclaimed 49.50% 20.63% 26.13% 23.43%
%age total pages scanned/written 28.66% 5.41% 7.28% 6.47%
%age file pages scanned/written 1.25% 0.21% 0.24% 0.22%
Percentage Time Spent Direct Reclaim 57.33% 42.15% 42.41% 42.99%
Percentage Time kswapd Awake 43.56% 27.87% 29.76% 31.25%
Scanned/reclaimed ratios again look good with big improvements in
efficiency. The Scanned/written ratios also look much improved. With a
better scanned/written ration, there is an expectation that IO would be
more efficient and indeed, the time spent in direct reclaim is much
reduced by the full series and kswapd spends a little less time awake.
Overall, indications here are that allocations were happening much faster
and this can be seen with a graph of the latency figures as the
allocations were taking place
http://www.csn.ul.ie/~mel/postings/vmscanreduce-20101509/highalloc-interlatency-hydra-mean.ps
FTrace Reclaim Statistics: congestion_wait
Direct number congest waited 1333 204 169 4
Direct time congest waited 78896ms 8288ms 7260ms 200ms
Direct full congest waited 756 92 69 2
Direct number conditional waited 0 0 26 186
Direct time conditional waited 0ms 0ms 0ms 2504ms
Direct full conditional waited 0 0 0 25
KSwapd number congest waited 4 395 227 282
KSwapd time congest waited 384ms 25136ms 10508ms 18380ms
KSwapd full congest waited 3 232 98 176
KSwapd number conditional waited 0 0 0 0
KSwapd time conditional waited 0ms 0ms 0ms 0ms
KSwapd full conditional waited 0 0 0 0
KSwapd full conditional waited 318 0 312 9
Overall, the time spent speeping is reduced. kswapd is still hitting
congestion_wait() but that is because there are callers remaining where it
wasn't clear in advance if they should be changed to wait_iff_congested()
or not. Overall the sleep imes are reduced though - from 79ish seconds to
about 19.
MMTests Statistics: duration
User/Sys Time Running Test (seconds) 3415.43 3386.65 3388.39 3377.5
Total Elapsed Time (seconds) 5733.48 3660.33 3689.41 3765.39
With the full series, the time to complete the tests are reduced by 30%
PPC64 STRESS-HIGHALLOC
traceonly-v2r2 lowlumpy-v2r3 waitcongest-v2r3waitwriteback-v2r4
Pass 1 17.00 ( 0.00%) 34.00 (17.00%) 38.00 (21.00%) 43.00 (26.00%)
Pass 2 25.00 ( 0.00%) 37.00 (12.00%) 42.00 (17.00%) 46.00 (21.00%)
At Rest 49.00 ( 0.00%) 43.00 (-6.00%) 45.00 (-4.00%) 51.00 ( 2.00%)
Success rates there are *way* up particularly considering that the 16MB
huge pages on PPC64 mean that it's always much harder to allocate them.
FTrace Reclaim Statistics: vmscan
stress-highalloc stress-highalloc stress-highalloc stress-highalloc
traceonly-v2r2 lowlumpy-v2r3 waitcongest-v2r3waitwriteback-v2r4
Direct reclaims 499 505 564 509
Direct reclaim pages scanned 223478 41898 51818 45605
Direct reclaim pages reclaimed 137730 21148 27161 23455
Direct reclaim write file async I/O 399 136 162 136
Direct reclaim write anon async I/O 46977 2865 4686 3998
Direct reclaim write file sync I/O 29 0 1 3
Direct reclaim write anon sync I/O 31023 159 237 239
Wake kswapd requests 420 351 360 326
Kswapd wakeups 185 294 249 277
Kswapd pages scanned 15703488 16392500 17821724 17598737
Kswapd pages reclaimed 5808466 2908858 3139386 3145435
Kswapd reclaim write file async I/O 159938 18400 18717 13473
Kswapd reclaim write anon async I/O 3467554 228957 322799 234278
Kswapd reclaim write file sync I/O 0 0 0 0
Kswapd reclaim write anon sync I/O 0 0 0 0
Time stalled direct reclaim (seconds) 9665.35 1707.81 2374.32 1871.23
Time kswapd awake (seconds) 9401.21 1367.86 1951.75 1328.88
Total pages scanned 15926966 16434398 17873542 17644342
Total pages reclaimed 5946196 2930006 3166547 3168890
%age total pages scanned/reclaimed 37.33% 17.83% 17.72% 17.96%
%age total pages scanned/written 23.27% 1.52% 1.94% 1.43%
%age file pages scanned/written 1.01% 0.11% 0.11% 0.08%
Percentage Time Spent Direct Reclaim 44.55% 35.10% 41.42% 36.91%
Percentage Time kswapd Awake 86.71% 43.58% 52.67% 41.14%
While the scanning rates are slightly up, the scanned/reclaimed and
scanned/written figures are much improved. The time spent in direct
reclaim and with kswapd are massively reduced, mostly by the lowlumpy
patches.
FTrace Reclaim Statistics: congestion_wait
Direct number congest waited 725 303 126 3
Direct time congest waited 45524ms 9180ms 5936ms 300ms
Direct full congest waited 487 190 52 3
Direct number conditional waited 0 0 200 301
Direct time conditional waited 0ms 0ms 0ms 1904ms
Direct full conditional waited 0 0 0 19
KSwapd number congest waited 0 2 23 4
KSwapd time congest waited 0ms 200ms 420ms 404ms
KSwapd full congest waited 0 2 2 4
KSwapd number conditional waited 0 0 0 0
KSwapd time conditional waited 0ms 0ms 0ms 0ms
KSwapd full conditional waited 0 0 0 0
Not as dramatic a story here but the time spent asleep is reduced and we
can still see what wait_iff_congested is going to sleep when necessary.
MMTests Statistics: duration
User/Sys Time Running Test (seconds) 12028.09 3157.17 3357.79 3199.16
Total Elapsed Time (seconds) 10842.07 3138.72 3705.54 3229.85
The time to complete this test goes way down. With the full series, we
are allocating over twice the number of huge pages in 30% of the time and
there is a corresponding impact on the allocation latency graph available
at.
http://www.csn.ul.ie/~mel/postings/vmscanreduce-20101509/highalloc-interlatency-powyah-mean.ps
This patch:
Add a trace event for shrink_inactive_list() and updates the sample
postprocessing script appropriately. It can be used to determine how many
pages were reclaimed and for non-lumpy reclaim where exactly the pages
were reclaimed from.
Signed-off-by: Mel Gorman <mel@csn.ul.ie>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Minchan Kim <minchan.kim@gmail.com>
Cc: Wu Fengguang <fengguang.wu@intel.com>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: Rik van Riel <riel@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2010-10-27 05:21:40 +08:00
|
|
|
TRACE_EVENT(mm_vmscan_lru_shrink_inactive,
|
|
|
|
|
2016-07-29 06:45:31 +08:00
|
|
|
TP_PROTO(int nid,
|
2016-01-15 07:18:48 +08:00
|
|
|
unsigned long nr_scanned, unsigned long nr_reclaimed,
|
mm, vmscan, tracing: use pointer to reclaim_stat struct in trace event
The trace event trace_mm_vmscan_lru_shrink_inactive() currently has 12
parameters! Seven of them are from the reclaim_stat structure. This
structure is currently local to mm/vmscan.c. By moving it to the global
vmstat.h header, we can also reference it from the vmscan tracepoints.
In moving it, it brings down the overhead of passing so many arguments
to the trace event. In the future, we may limit the number of arguments
that a trace event may pass (ideally just 6, but more realistically it
may be 8).
Before this patch, the code to call the trace event is this:
0f 83 aa fe ff ff jae ffffffff811e6261 <shrink_inactive_list+0x1e1>
48 8b 45 a0 mov -0x60(%rbp),%rax
45 8b 64 24 20 mov 0x20(%r12),%r12d
44 8b 6d d4 mov -0x2c(%rbp),%r13d
8b 4d d0 mov -0x30(%rbp),%ecx
44 8b 75 cc mov -0x34(%rbp),%r14d
44 8b 7d c8 mov -0x38(%rbp),%r15d
48 89 45 90 mov %rax,-0x70(%rbp)
8b 83 b8 fe ff ff mov -0x148(%rbx),%eax
8b 55 c0 mov -0x40(%rbp),%edx
8b 7d c4 mov -0x3c(%rbp),%edi
8b 75 b8 mov -0x48(%rbp),%esi
89 45 80 mov %eax,-0x80(%rbp)
65 ff 05 e4 f7 e2 7e incl %gs:0x7ee2f7e4(%rip) # 15bd0 <__preempt_count>
48 8b 05 75 5b 13 01 mov 0x1135b75(%rip),%rax # ffffffff8231bf68 <__tracepoint_mm_vmscan_lru_shrink_inactive+0x28>
48 85 c0 test %rax,%rax
74 72 je ffffffff811e646a <shrink_inactive_list+0x3ea>
48 89 c3 mov %rax,%rbx
4c 8b 10 mov (%rax),%r10
89 f8 mov %edi,%eax
48 89 85 68 ff ff ff mov %rax,-0x98(%rbp)
89 f0 mov %esi,%eax
48 89 85 60 ff ff ff mov %rax,-0xa0(%rbp)
89 c8 mov %ecx,%eax
48 89 85 78 ff ff ff mov %rax,-0x88(%rbp)
89 d0 mov %edx,%eax
48 89 85 70 ff ff ff mov %rax,-0x90(%rbp)
8b 45 8c mov -0x74(%rbp),%eax
48 8b 7b 08 mov 0x8(%rbx),%rdi
48 83 c3 18 add $0x18,%rbx
50 push %rax
41 54 push %r12
41 55 push %r13
ff b5 78 ff ff ff pushq -0x88(%rbp)
41 56 push %r14
41 57 push %r15
ff b5 70 ff ff ff pushq -0x90(%rbp)
4c 8b 8d 68 ff ff ff mov -0x98(%rbp),%r9
4c 8b 85 60 ff ff ff mov -0xa0(%rbp),%r8
48 8b 4d 98 mov -0x68(%rbp),%rcx
48 8b 55 90 mov -0x70(%rbp),%rdx
8b 75 80 mov -0x80(%rbp),%esi
41 ff d2 callq *%r10
After the patch:
0f 83 a8 fe ff ff jae ffffffff811e626d <shrink_inactive_list+0x1cd>
8b 9b b8 fe ff ff mov -0x148(%rbx),%ebx
45 8b 64 24 20 mov 0x20(%r12),%r12d
4c 8b 6d a0 mov -0x60(%rbp),%r13
65 ff 05 f5 f7 e2 7e incl %gs:0x7ee2f7f5(%rip) # 15bd0 <__preempt_count>
4c 8b 35 86 5b 13 01 mov 0x1135b86(%rip),%r14 # ffffffff8231bf68 <__tracepoint_mm_vmscan_lru_shrink_inactive+0x28>
4d 85 f6 test %r14,%r14
74 2a je ffffffff811e6411 <shrink_inactive_list+0x371>
49 8b 06 mov (%r14),%rax
8b 4d 8c mov -0x74(%rbp),%ecx
49 8b 7e 08 mov 0x8(%r14),%rdi
49 83 c6 18 add $0x18,%r14
4c 89 ea mov %r13,%rdx
45 89 e1 mov %r12d,%r9d
4c 8d 45 b8 lea -0x48(%rbp),%r8
89 de mov %ebx,%esi
51 push %rcx
48 8b 4d 98 mov -0x68(%rbp),%rcx
ff d0 callq *%rax
Link: http://lkml.kernel.org/r/2559d7cb-ec60-1200-2362-04fa34fd02bb@fb.com
Link: http://lkml.kernel.org/r/20180322121003.4177af15@gandalf.local.home
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
Reported-by: Alexei Starovoitov <ast@fb.com>
Acked-by: David Rientjes <rientjes@google.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Andrey Ryabinin <aryabinin@virtuozzo.com>
Cc: Alexei Starovoitov <ast@fb.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-04-11 07:28:07 +08:00
|
|
|
struct reclaim_stat *stat, int priority, int file),
|
tracing, vmscan: add trace events for LRU list shrinking
There have been numerous reports of stalls that pointed at the problem
being somewhere in the VM. There are multiple roots to the problems which
means dealing with any of the root problems in isolation is tricky to
justify on their own and they would still need integration testing. This
patch series puts together two different patch sets which in combination
should tackle some of the root causes of latency problems being reported.
Patch 1 adds a tracepoint for shrink_inactive_list. For this series, the
most important results is being able to calculate the scanning/reclaim
ratio as a measure of the amount of work being done by page reclaim.
Patch 2 accounts for time spent in congestion_wait.
Patches 3-6 were originally developed by Kosaki Motohiro but reworked for
this series. It has been noted that lumpy reclaim is far too aggressive
and trashes the system somewhat. As SLUB uses high-order allocations, a
large cost incurred by lumpy reclaim will be noticeable. It was also
reported during transparent hugepage support testing that lumpy reclaim
was trashing the system and these patches should mitigate that problem
without disabling lumpy reclaim.
Patch 7 adds wait_iff_congested() and replaces some callers of
congestion_wait(). wait_iff_congested() only sleeps if there is a BDI
that is currently congested. Patch 8 notes that any BDI being congested
is not necessarily a problem because there could be multiple BDIs of
varying speeds and numberous zones. It attempts to track when a zone
being reclaimed contains many pages backed by a congested BDI and if so,
reclaimers wait on the congestion queue.
I ran a number of tests with monitoring on X86, X86-64 and PPC64. Each
machine had 3G of RAM and the CPUs were
X86: Intel P4 2-core
X86-64: AMD Phenom 4-core
PPC64: PPC970MP
Each used a single disk and the onboard IO controller. Dirty ratio was
left at 20. I'm just going to report for X86-64 and PPC64 in a vague
attempt to keep this report short. Four kernels were tested each based on
v2.6.36-rc4
traceonly-v2r2: Patches 1 and 2 to instrument vmscan reclaims and congestion_wait
lowlumpy-v2r3: Patches 1-6 to test if lumpy reclaim is better
waitcongest-v2r3: Patches 1-7 to only wait on congestion
waitwriteback-v2r4: Patches 1-8 to detect when a zone is congested
nocongest-v1r5: Patches 1-3 for testing wait_iff_congestion
nodirect-v1r5: Patches 1-10 to disable filesystem writeback for better IO
The tests run were as follows
kernbench
compile-based benchmark. Smoke test performance
sysbench
OLTP read-only benchmark. Will be re-run in the future as read-write
micro-mapped-file-stream
This is a micro-benchmark from Johannes Weiner that accesses a
large sparse-file through mmap(). It was configured to run in only
single-CPU mode but can be indicative of how well page reclaim
identifies suitable pages.
stress-highalloc
Tries to allocate huge pages under heavy load.
kernbench, iozone and sysbench did not report any performance regression
on any machine. sysbench did pressure the system lightly and there was
reclaim activity but there were no difference of major interest between
the kernels.
X86-64 micro-mapped-file-stream
traceonly-v2r2 lowlumpy-v2r3 waitcongest-v2r3 waitwriteback-v2r4
pgalloc_dma 1639.00 ( 0.00%) 667.00 (-145.73%) 1167.00 ( -40.45%) 578.00 (-183.56%)
pgalloc_dma32 2842410.00 ( 0.00%) 2842626.00 ( 0.01%) 2843043.00 ( 0.02%) 2843014.00 ( 0.02%)
pgalloc_normal 0.00 ( 0.00%) 0.00 ( 0.00%) 0.00 ( 0.00%) 0.00 ( 0.00%)
pgsteal_dma 729.00 ( 0.00%) 85.00 (-757.65%) 609.00 ( -19.70%) 125.00 (-483.20%)
pgsteal_dma32 2338721.00 ( 0.00%) 2447354.00 ( 4.44%) 2429536.00 ( 3.74%) 2436772.00 ( 4.02%)
pgsteal_normal 0.00 ( 0.00%) 0.00 ( 0.00%) 0.00 ( 0.00%) 0.00 ( 0.00%)
pgscan_kswapd_dma 1469.00 ( 0.00%) 532.00 (-176.13%) 1078.00 ( -36.27%) 220.00 (-567.73%)
pgscan_kswapd_dma32 4597713.00 ( 0.00%) 4503597.00 ( -2.09%) 4295673.00 ( -7.03%) 3891686.00 ( -18.14%)
pgscan_kswapd_normal 0.00 ( 0.00%) 0.00 ( 0.00%) 0.00 ( 0.00%) 0.00 ( 0.00%)
pgscan_direct_dma 71.00 ( 0.00%) 134.00 ( 47.01%) 243.00 ( 70.78%) 352.00 ( 79.83%)
pgscan_direct_dma32 305820.00 ( 0.00%) 280204.00 ( -9.14%) 600518.00 ( 49.07%) 957485.00 ( 68.06%)
pgscan_direct_normal 0.00 ( 0.00%) 0.00 ( 0.00%) 0.00 ( 0.00%) 0.00 ( 0.00%)
pageoutrun 16296.00 ( 0.00%) 21254.00 ( 23.33%) 18447.00 ( 11.66%) 20067.00 ( 18.79%)
allocstall 443.00 ( 0.00%) 273.00 ( -62.27%) 513.00 ( 13.65%) 1568.00 ( 71.75%)
These are based on the raw figures taken from /proc/vmstat. It's a rough
measure of reclaim activity. Note that allocstall counts are higher
because we are entering direct reclaim more often as a result of not
sleeping in congestion. In itself, it's not necessarily a bad thing.
It's easier to get a view of what happened from the vmscan tracepoint
report.
FTrace Reclaim Statistics: vmscan
traceonly-v2r2 lowlumpy-v2r3 waitcongest-v2r3 waitwriteback-v2r4
Direct reclaims 443 273 513 1568
Direct reclaim pages scanned 305968 280402 600825 957933
Direct reclaim pages reclaimed 43503 19005 30327 117191
Direct reclaim write file async I/O 0 0 0 0
Direct reclaim write anon async I/O 0 3 4 12
Direct reclaim write file sync I/O 0 0 0 0
Direct reclaim write anon sync I/O 0 0 0 0
Wake kswapd requests 187649 132338 191695 267701
Kswapd wakeups 3 1 4 1
Kswapd pages scanned 4599269 4454162 4296815 3891906
Kswapd pages reclaimed 2295947 2428434 2399818 2319706
Kswapd reclaim write file async I/O 1 0 1 1
Kswapd reclaim write anon async I/O 59 187 41 222
Kswapd reclaim write file sync I/O 0 0 0 0
Kswapd reclaim write anon sync I/O 0 0 0 0
Time stalled direct reclaim (seconds) 4.34 2.52 6.63 2.96
Time kswapd awake (seconds) 11.15 10.25 11.01 10.19
Total pages scanned 4905237 4734564 4897640 4849839
Total pages reclaimed 2339450 2447439 2430145 2436897
%age total pages scanned/reclaimed 47.69% 51.69% 49.62% 50.25%
%age total pages scanned/written 0.00% 0.00% 0.00% 0.00%
%age file pages scanned/written 0.00% 0.00% 0.00% 0.00%
Percentage Time Spent Direct Reclaim 29.23% 19.02% 38.48% 20.25%
Percentage Time kswapd Awake 78.58% 78.85% 76.83% 79.86%
What is interesting here for nocongest in particular is that while direct
reclaim scans more pages, the overall number of pages scanned remains the
same and the ratio of pages scanned to pages reclaimed is more or less the
same. In other words, while we are sleeping less, reclaim is not doing
more work and as direct reclaim and kswapd is awake for less time, it
would appear to be doing less work.
FTrace Reclaim Statistics: congestion_wait
Direct number congest waited 87 196 64 0
Direct time congest waited 4604ms 4732ms 5420ms 0ms
Direct full congest waited 72 145 53 0
Direct number conditional waited 0 0 324 1315
Direct time conditional waited 0ms 0ms 0ms 0ms
Direct full conditional waited 0 0 0 0
KSwapd number congest waited 20 10 15 7
KSwapd time congest waited 1264ms 536ms 884ms 284ms
KSwapd full congest waited 10 4 6 2
KSwapd number conditional waited 0 0 0 0
KSwapd time conditional waited 0ms 0ms 0ms 0ms
KSwapd full conditional waited 0 0 0 0
The vanilla kernel spent 8 seconds asleep in direct reclaim and no time at
all asleep with the patches.
MMTests Statistics: duration
User/Sys Time Running Test (seconds) 10.51 10.73 10.6 11.66
Total Elapsed Time (seconds) 14.19 13.00 14.33 12.76
Overall, the tests completed faster. It is interesting to note that backing off further
when a zone is congested and not just a BDI was more efficient overall.
PPC64 micro-mapped-file-stream
pgalloc_dma 3024660.00 ( 0.00%) 3027185.00 ( 0.08%) 3025845.00 ( 0.04%) 3026281.00 ( 0.05%)
pgalloc_normal 0.00 ( 0.00%) 0.00 ( 0.00%) 0.00 ( 0.00%) 0.00 ( 0.00%)
pgsteal_dma 2508073.00 ( 0.00%) 2565351.00 ( 2.23%) 2463577.00 ( -1.81%) 2532263.00 ( 0.96%)
pgsteal_normal 0.00 ( 0.00%) 0.00 ( 0.00%) 0.00 ( 0.00%) 0.00 ( 0.00%)
pgscan_kswapd_dma 4601307.00 ( 0.00%) 4128076.00 ( -11.46%) 3912317.00 ( -17.61%) 3377165.00 ( -36.25%)
pgscan_kswapd_normal 0.00 ( 0.00%) 0.00 ( 0.00%) 0.00 ( 0.00%) 0.00 ( 0.00%)
pgscan_direct_dma 629825.00 ( 0.00%) 971622.00 ( 35.18%) 1063938.00 ( 40.80%) 1711935.00 ( 63.21%)
pgscan_direct_normal 0.00 ( 0.00%) 0.00 ( 0.00%) 0.00 ( 0.00%) 0.00 ( 0.00%)
pageoutrun 27776.00 ( 0.00%) 20458.00 ( -35.77%) 18763.00 ( -48.04%) 18157.00 ( -52.98%)
allocstall 977.00 ( 0.00%) 2751.00 ( 64.49%) 2098.00 ( 53.43%) 5136.00 ( 80.98%)
Similar trends to x86-64. allocstalls are up but it's not necessarily bad.
FTrace Reclaim Statistics: vmscan
Direct reclaims 977 2709 2098 5136
Direct reclaim pages scanned 629825 963814 1063938 1711935
Direct reclaim pages reclaimed 75550 242538 150904 387647
Direct reclaim write file async I/O 0 0 0 2
Direct reclaim write anon async I/O 0 10 0 4
Direct reclaim write file sync I/O 0 0 0 0
Direct reclaim write anon sync I/O 0 0 0 0
Wake kswapd requests 392119 1201712 571935 571921
Kswapd wakeups 3 2 3 3
Kswapd pages scanned 4601307 4128076 3912317 3377165
Kswapd pages reclaimed 2432523 2318797 2312673 2144616
Kswapd reclaim write file async I/O 20 1 1 1
Kswapd reclaim write anon async I/O 57 132 11 121
Kswapd reclaim write file sync I/O 0 0 0 0
Kswapd reclaim write anon sync I/O 0 0 0 0
Time stalled direct reclaim (seconds) 6.19 7.30 13.04 10.88
Time kswapd awake (seconds) 21.73 26.51 25.55 23.90
Total pages scanned 5231132 5091890 4976255 5089100
Total pages reclaimed 2508073 2561335 2463577 2532263
%age total pages scanned/reclaimed 47.95% 50.30% 49.51% 49.76%
%age total pages scanned/written 0.00% 0.00% 0.00% 0.00%
%age file pages scanned/written 0.00% 0.00% 0.00% 0.00%
Percentage Time Spent Direct Reclaim 18.89% 20.65% 32.65% 27.65%
Percentage Time kswapd Awake 72.39% 80.68% 78.21% 77.40%
Again, a similar trend that the congestion_wait changes mean that direct
reclaim scans more pages but the overall number of pages scanned while
slightly reduced, are very similar. The ratio of scanning/reclaimed
remains roughly similar. The downside is that kswapd and direct reclaim
was awake longer and for a larger percentage of the overall workload.
It's possible there were big differences in the amount of time spent
reclaiming slab pages between the different kernels which is plausible
considering that the micro tests runs after fsmark and sysbench.
Trace Reclaim Statistics: congestion_wait
Direct number congest waited 845 1312 104 0
Direct time congest waited 19416ms 26560ms 7544ms 0ms
Direct full congest waited 745 1105 72 0
Direct number conditional waited 0 0 1322 2935
Direct time conditional waited 0ms 0ms 12ms 312ms
Direct full conditional waited 0 0 0 3
KSwapd number congest waited 39 102 75 63
KSwapd time congest waited 2484ms 6760ms 5756ms 3716ms
KSwapd full congest waited 20 48 46 25
KSwapd number conditional waited 0 0 0 0
KSwapd time conditional waited 0ms 0ms 0ms 0ms
KSwapd full conditional waited 0 0 0 0
The vanilla kernel spent 20 seconds asleep in direct reclaim and only
312ms asleep with the patches. The time kswapd spent congest waited was
also reduced by a large factor.
MMTests Statistics: duration
ser/Sys Time Running Test (seconds) 26.58 28.05 26.9 28.47
Total Elapsed Time (seconds) 30.02 32.86 32.67 30.88
With all patches applies, the completion times are very similar.
X86-64 STRESS-HIGHALLOC
traceonly-v2r2 lowlumpy-v2r3 waitcongest-v2r3waitwriteback-v2r4
Pass 1 82.00 ( 0.00%) 84.00 ( 2.00%) 85.00 ( 3.00%) 85.00 ( 3.00%)
Pass 2 90.00 ( 0.00%) 87.00 (-3.00%) 88.00 (-2.00%) 89.00 (-1.00%)
At Rest 92.00 ( 0.00%) 90.00 (-2.00%) 90.00 (-2.00%) 91.00 (-1.00%)
Success figures across the board are broadly similar.
traceonly-v2r2 lowlumpy-v2r3 waitcongest-v2r3waitwriteback-v2r4
Direct reclaims 1045 944 886 887
Direct reclaim pages scanned 135091 119604 109382 101019
Direct reclaim pages reclaimed 88599 47535 47863 46671
Direct reclaim write file async I/O 494 283 465 280
Direct reclaim write anon async I/O 29357 13710 16656 13462
Direct reclaim write file sync I/O 154 2 2 3
Direct reclaim write anon sync I/O 14594 571 509 561
Wake kswapd requests 7491 933 872 892
Kswapd wakeups 814 778 731 780
Kswapd pages scanned 7290822 15341158 11916436 13703442
Kswapd pages reclaimed 3587336 3142496 3094392 3187151
Kswapd reclaim write file async I/O 91975 32317 28022 29628
Kswapd reclaim write anon async I/O 1992022 789307 829745 849769
Kswapd reclaim write file sync I/O 0 0 0 0
Kswapd reclaim write anon sync I/O 0 0 0 0
Time stalled direct reclaim (seconds) 4588.93 2467.16 2495.41 2547.07
Time kswapd awake (seconds) 2497.66 1020.16 1098.06 1176.82
Total pages scanned 7425913 15460762 12025818 13804461
Total pages reclaimed 3675935 3190031 3142255 3233822
%age total pages scanned/reclaimed 49.50% 20.63% 26.13% 23.43%
%age total pages scanned/written 28.66% 5.41% 7.28% 6.47%
%age file pages scanned/written 1.25% 0.21% 0.24% 0.22%
Percentage Time Spent Direct Reclaim 57.33% 42.15% 42.41% 42.99%
Percentage Time kswapd Awake 43.56% 27.87% 29.76% 31.25%
Scanned/reclaimed ratios again look good with big improvements in
efficiency. The Scanned/written ratios also look much improved. With a
better scanned/written ration, there is an expectation that IO would be
more efficient and indeed, the time spent in direct reclaim is much
reduced by the full series and kswapd spends a little less time awake.
Overall, indications here are that allocations were happening much faster
and this can be seen with a graph of the latency figures as the
allocations were taking place
http://www.csn.ul.ie/~mel/postings/vmscanreduce-20101509/highalloc-interlatency-hydra-mean.ps
FTrace Reclaim Statistics: congestion_wait
Direct number congest waited 1333 204 169 4
Direct time congest waited 78896ms 8288ms 7260ms 200ms
Direct full congest waited 756 92 69 2
Direct number conditional waited 0 0 26 186
Direct time conditional waited 0ms 0ms 0ms 2504ms
Direct full conditional waited 0 0 0 25
KSwapd number congest waited 4 395 227 282
KSwapd time congest waited 384ms 25136ms 10508ms 18380ms
KSwapd full congest waited 3 232 98 176
KSwapd number conditional waited 0 0 0 0
KSwapd time conditional waited 0ms 0ms 0ms 0ms
KSwapd full conditional waited 0 0 0 0
KSwapd full conditional waited 318 0 312 9
Overall, the time spent speeping is reduced. kswapd is still hitting
congestion_wait() but that is because there are callers remaining where it
wasn't clear in advance if they should be changed to wait_iff_congested()
or not. Overall the sleep imes are reduced though - from 79ish seconds to
about 19.
MMTests Statistics: duration
User/Sys Time Running Test (seconds) 3415.43 3386.65 3388.39 3377.5
Total Elapsed Time (seconds) 5733.48 3660.33 3689.41 3765.39
With the full series, the time to complete the tests are reduced by 30%
PPC64 STRESS-HIGHALLOC
traceonly-v2r2 lowlumpy-v2r3 waitcongest-v2r3waitwriteback-v2r4
Pass 1 17.00 ( 0.00%) 34.00 (17.00%) 38.00 (21.00%) 43.00 (26.00%)
Pass 2 25.00 ( 0.00%) 37.00 (12.00%) 42.00 (17.00%) 46.00 (21.00%)
At Rest 49.00 ( 0.00%) 43.00 (-6.00%) 45.00 (-4.00%) 51.00 ( 2.00%)
Success rates there are *way* up particularly considering that the 16MB
huge pages on PPC64 mean that it's always much harder to allocate them.
FTrace Reclaim Statistics: vmscan
stress-highalloc stress-highalloc stress-highalloc stress-highalloc
traceonly-v2r2 lowlumpy-v2r3 waitcongest-v2r3waitwriteback-v2r4
Direct reclaims 499 505 564 509
Direct reclaim pages scanned 223478 41898 51818 45605
Direct reclaim pages reclaimed 137730 21148 27161 23455
Direct reclaim write file async I/O 399 136 162 136
Direct reclaim write anon async I/O 46977 2865 4686 3998
Direct reclaim write file sync I/O 29 0 1 3
Direct reclaim write anon sync I/O 31023 159 237 239
Wake kswapd requests 420 351 360 326
Kswapd wakeups 185 294 249 277
Kswapd pages scanned 15703488 16392500 17821724 17598737
Kswapd pages reclaimed 5808466 2908858 3139386 3145435
Kswapd reclaim write file async I/O 159938 18400 18717 13473
Kswapd reclaim write anon async I/O 3467554 228957 322799 234278
Kswapd reclaim write file sync I/O 0 0 0 0
Kswapd reclaim write anon sync I/O 0 0 0 0
Time stalled direct reclaim (seconds) 9665.35 1707.81 2374.32 1871.23
Time kswapd awake (seconds) 9401.21 1367.86 1951.75 1328.88
Total pages scanned 15926966 16434398 17873542 17644342
Total pages reclaimed 5946196 2930006 3166547 3168890
%age total pages scanned/reclaimed 37.33% 17.83% 17.72% 17.96%
%age total pages scanned/written 23.27% 1.52% 1.94% 1.43%
%age file pages scanned/written 1.01% 0.11% 0.11% 0.08%
Percentage Time Spent Direct Reclaim 44.55% 35.10% 41.42% 36.91%
Percentage Time kswapd Awake 86.71% 43.58% 52.67% 41.14%
While the scanning rates are slightly up, the scanned/reclaimed and
scanned/written figures are much improved. The time spent in direct
reclaim and with kswapd are massively reduced, mostly by the lowlumpy
patches.
FTrace Reclaim Statistics: congestion_wait
Direct number congest waited 725 303 126 3
Direct time congest waited 45524ms 9180ms 5936ms 300ms
Direct full congest waited 487 190 52 3
Direct number conditional waited 0 0 200 301
Direct time conditional waited 0ms 0ms 0ms 1904ms
Direct full conditional waited 0 0 0 19
KSwapd number congest waited 0 2 23 4
KSwapd time congest waited 0ms 200ms 420ms 404ms
KSwapd full congest waited 0 2 2 4
KSwapd number conditional waited 0 0 0 0
KSwapd time conditional waited 0ms 0ms 0ms 0ms
KSwapd full conditional waited 0 0 0 0
Not as dramatic a story here but the time spent asleep is reduced and we
can still see what wait_iff_congested is going to sleep when necessary.
MMTests Statistics: duration
User/Sys Time Running Test (seconds) 12028.09 3157.17 3357.79 3199.16
Total Elapsed Time (seconds) 10842.07 3138.72 3705.54 3229.85
The time to complete this test goes way down. With the full series, we
are allocating over twice the number of huge pages in 30% of the time and
there is a corresponding impact on the allocation latency graph available
at.
http://www.csn.ul.ie/~mel/postings/vmscanreduce-20101509/highalloc-interlatency-powyah-mean.ps
This patch:
Add a trace event for shrink_inactive_list() and updates the sample
postprocessing script appropriately. It can be used to determine how many
pages were reclaimed and for non-lumpy reclaim where exactly the pages
were reclaimed from.
Signed-off-by: Mel Gorman <mel@csn.ul.ie>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Minchan Kim <minchan.kim@gmail.com>
Cc: Wu Fengguang <fengguang.wu@intel.com>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: Rik van Riel <riel@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2010-10-27 05:21:40 +08:00
|
|
|
|
mm, vmscan, tracing: use pointer to reclaim_stat struct in trace event
The trace event trace_mm_vmscan_lru_shrink_inactive() currently has 12
parameters! Seven of them are from the reclaim_stat structure. This
structure is currently local to mm/vmscan.c. By moving it to the global
vmstat.h header, we can also reference it from the vmscan tracepoints.
In moving it, it brings down the overhead of passing so many arguments
to the trace event. In the future, we may limit the number of arguments
that a trace event may pass (ideally just 6, but more realistically it
may be 8).
Before this patch, the code to call the trace event is this:
0f 83 aa fe ff ff jae ffffffff811e6261 <shrink_inactive_list+0x1e1>
48 8b 45 a0 mov -0x60(%rbp),%rax
45 8b 64 24 20 mov 0x20(%r12),%r12d
44 8b 6d d4 mov -0x2c(%rbp),%r13d
8b 4d d0 mov -0x30(%rbp),%ecx
44 8b 75 cc mov -0x34(%rbp),%r14d
44 8b 7d c8 mov -0x38(%rbp),%r15d
48 89 45 90 mov %rax,-0x70(%rbp)
8b 83 b8 fe ff ff mov -0x148(%rbx),%eax
8b 55 c0 mov -0x40(%rbp),%edx
8b 7d c4 mov -0x3c(%rbp),%edi
8b 75 b8 mov -0x48(%rbp),%esi
89 45 80 mov %eax,-0x80(%rbp)
65 ff 05 e4 f7 e2 7e incl %gs:0x7ee2f7e4(%rip) # 15bd0 <__preempt_count>
48 8b 05 75 5b 13 01 mov 0x1135b75(%rip),%rax # ffffffff8231bf68 <__tracepoint_mm_vmscan_lru_shrink_inactive+0x28>
48 85 c0 test %rax,%rax
74 72 je ffffffff811e646a <shrink_inactive_list+0x3ea>
48 89 c3 mov %rax,%rbx
4c 8b 10 mov (%rax),%r10
89 f8 mov %edi,%eax
48 89 85 68 ff ff ff mov %rax,-0x98(%rbp)
89 f0 mov %esi,%eax
48 89 85 60 ff ff ff mov %rax,-0xa0(%rbp)
89 c8 mov %ecx,%eax
48 89 85 78 ff ff ff mov %rax,-0x88(%rbp)
89 d0 mov %edx,%eax
48 89 85 70 ff ff ff mov %rax,-0x90(%rbp)
8b 45 8c mov -0x74(%rbp),%eax
48 8b 7b 08 mov 0x8(%rbx),%rdi
48 83 c3 18 add $0x18,%rbx
50 push %rax
41 54 push %r12
41 55 push %r13
ff b5 78 ff ff ff pushq -0x88(%rbp)
41 56 push %r14
41 57 push %r15
ff b5 70 ff ff ff pushq -0x90(%rbp)
4c 8b 8d 68 ff ff ff mov -0x98(%rbp),%r9
4c 8b 85 60 ff ff ff mov -0xa0(%rbp),%r8
48 8b 4d 98 mov -0x68(%rbp),%rcx
48 8b 55 90 mov -0x70(%rbp),%rdx
8b 75 80 mov -0x80(%rbp),%esi
41 ff d2 callq *%r10
After the patch:
0f 83 a8 fe ff ff jae ffffffff811e626d <shrink_inactive_list+0x1cd>
8b 9b b8 fe ff ff mov -0x148(%rbx),%ebx
45 8b 64 24 20 mov 0x20(%r12),%r12d
4c 8b 6d a0 mov -0x60(%rbp),%r13
65 ff 05 f5 f7 e2 7e incl %gs:0x7ee2f7f5(%rip) # 15bd0 <__preempt_count>
4c 8b 35 86 5b 13 01 mov 0x1135b86(%rip),%r14 # ffffffff8231bf68 <__tracepoint_mm_vmscan_lru_shrink_inactive+0x28>
4d 85 f6 test %r14,%r14
74 2a je ffffffff811e6411 <shrink_inactive_list+0x371>
49 8b 06 mov (%r14),%rax
8b 4d 8c mov -0x74(%rbp),%ecx
49 8b 7e 08 mov 0x8(%r14),%rdi
49 83 c6 18 add $0x18,%r14
4c 89 ea mov %r13,%rdx
45 89 e1 mov %r12d,%r9d
4c 8d 45 b8 lea -0x48(%rbp),%r8
89 de mov %ebx,%esi
51 push %rcx
48 8b 4d 98 mov -0x68(%rbp),%rcx
ff d0 callq *%rax
Link: http://lkml.kernel.org/r/2559d7cb-ec60-1200-2362-04fa34fd02bb@fb.com
Link: http://lkml.kernel.org/r/20180322121003.4177af15@gandalf.local.home
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
Reported-by: Alexei Starovoitov <ast@fb.com>
Acked-by: David Rientjes <rientjes@google.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Andrey Ryabinin <aryabinin@virtuozzo.com>
Cc: Alexei Starovoitov <ast@fb.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-04-11 07:28:07 +08:00
|
|
|
TP_ARGS(nid, nr_scanned, nr_reclaimed, stat, priority, file),
|
tracing, vmscan: add trace events for LRU list shrinking
There have been numerous reports of stalls that pointed at the problem
being somewhere in the VM. There are multiple roots to the problems which
means dealing with any of the root problems in isolation is tricky to
justify on their own and they would still need integration testing. This
patch series puts together two different patch sets which in combination
should tackle some of the root causes of latency problems being reported.
Patch 1 adds a tracepoint for shrink_inactive_list. For this series, the
most important results is being able to calculate the scanning/reclaim
ratio as a measure of the amount of work being done by page reclaim.
Patch 2 accounts for time spent in congestion_wait.
Patches 3-6 were originally developed by Kosaki Motohiro but reworked for
this series. It has been noted that lumpy reclaim is far too aggressive
and trashes the system somewhat. As SLUB uses high-order allocations, a
large cost incurred by lumpy reclaim will be noticeable. It was also
reported during transparent hugepage support testing that lumpy reclaim
was trashing the system and these patches should mitigate that problem
without disabling lumpy reclaim.
Patch 7 adds wait_iff_congested() and replaces some callers of
congestion_wait(). wait_iff_congested() only sleeps if there is a BDI
that is currently congested. Patch 8 notes that any BDI being congested
is not necessarily a problem because there could be multiple BDIs of
varying speeds and numberous zones. It attempts to track when a zone
being reclaimed contains many pages backed by a congested BDI and if so,
reclaimers wait on the congestion queue.
I ran a number of tests with monitoring on X86, X86-64 and PPC64. Each
machine had 3G of RAM and the CPUs were
X86: Intel P4 2-core
X86-64: AMD Phenom 4-core
PPC64: PPC970MP
Each used a single disk and the onboard IO controller. Dirty ratio was
left at 20. I'm just going to report for X86-64 and PPC64 in a vague
attempt to keep this report short. Four kernels were tested each based on
v2.6.36-rc4
traceonly-v2r2: Patches 1 and 2 to instrument vmscan reclaims and congestion_wait
lowlumpy-v2r3: Patches 1-6 to test if lumpy reclaim is better
waitcongest-v2r3: Patches 1-7 to only wait on congestion
waitwriteback-v2r4: Patches 1-8 to detect when a zone is congested
nocongest-v1r5: Patches 1-3 for testing wait_iff_congestion
nodirect-v1r5: Patches 1-10 to disable filesystem writeback for better IO
The tests run were as follows
kernbench
compile-based benchmark. Smoke test performance
sysbench
OLTP read-only benchmark. Will be re-run in the future as read-write
micro-mapped-file-stream
This is a micro-benchmark from Johannes Weiner that accesses a
large sparse-file through mmap(). It was configured to run in only
single-CPU mode but can be indicative of how well page reclaim
identifies suitable pages.
stress-highalloc
Tries to allocate huge pages under heavy load.
kernbench, iozone and sysbench did not report any performance regression
on any machine. sysbench did pressure the system lightly and there was
reclaim activity but there were no difference of major interest between
the kernels.
X86-64 micro-mapped-file-stream
traceonly-v2r2 lowlumpy-v2r3 waitcongest-v2r3 waitwriteback-v2r4
pgalloc_dma 1639.00 ( 0.00%) 667.00 (-145.73%) 1167.00 ( -40.45%) 578.00 (-183.56%)
pgalloc_dma32 2842410.00 ( 0.00%) 2842626.00 ( 0.01%) 2843043.00 ( 0.02%) 2843014.00 ( 0.02%)
pgalloc_normal 0.00 ( 0.00%) 0.00 ( 0.00%) 0.00 ( 0.00%) 0.00 ( 0.00%)
pgsteal_dma 729.00 ( 0.00%) 85.00 (-757.65%) 609.00 ( -19.70%) 125.00 (-483.20%)
pgsteal_dma32 2338721.00 ( 0.00%) 2447354.00 ( 4.44%) 2429536.00 ( 3.74%) 2436772.00 ( 4.02%)
pgsteal_normal 0.00 ( 0.00%) 0.00 ( 0.00%) 0.00 ( 0.00%) 0.00 ( 0.00%)
pgscan_kswapd_dma 1469.00 ( 0.00%) 532.00 (-176.13%) 1078.00 ( -36.27%) 220.00 (-567.73%)
pgscan_kswapd_dma32 4597713.00 ( 0.00%) 4503597.00 ( -2.09%) 4295673.00 ( -7.03%) 3891686.00 ( -18.14%)
pgscan_kswapd_normal 0.00 ( 0.00%) 0.00 ( 0.00%) 0.00 ( 0.00%) 0.00 ( 0.00%)
pgscan_direct_dma 71.00 ( 0.00%) 134.00 ( 47.01%) 243.00 ( 70.78%) 352.00 ( 79.83%)
pgscan_direct_dma32 305820.00 ( 0.00%) 280204.00 ( -9.14%) 600518.00 ( 49.07%) 957485.00 ( 68.06%)
pgscan_direct_normal 0.00 ( 0.00%) 0.00 ( 0.00%) 0.00 ( 0.00%) 0.00 ( 0.00%)
pageoutrun 16296.00 ( 0.00%) 21254.00 ( 23.33%) 18447.00 ( 11.66%) 20067.00 ( 18.79%)
allocstall 443.00 ( 0.00%) 273.00 ( -62.27%) 513.00 ( 13.65%) 1568.00 ( 71.75%)
These are based on the raw figures taken from /proc/vmstat. It's a rough
measure of reclaim activity. Note that allocstall counts are higher
because we are entering direct reclaim more often as a result of not
sleeping in congestion. In itself, it's not necessarily a bad thing.
It's easier to get a view of what happened from the vmscan tracepoint
report.
FTrace Reclaim Statistics: vmscan
traceonly-v2r2 lowlumpy-v2r3 waitcongest-v2r3 waitwriteback-v2r4
Direct reclaims 443 273 513 1568
Direct reclaim pages scanned 305968 280402 600825 957933
Direct reclaim pages reclaimed 43503 19005 30327 117191
Direct reclaim write file async I/O 0 0 0 0
Direct reclaim write anon async I/O 0 3 4 12
Direct reclaim write file sync I/O 0 0 0 0
Direct reclaim write anon sync I/O 0 0 0 0
Wake kswapd requests 187649 132338 191695 267701
Kswapd wakeups 3 1 4 1
Kswapd pages scanned 4599269 4454162 4296815 3891906
Kswapd pages reclaimed 2295947 2428434 2399818 2319706
Kswapd reclaim write file async I/O 1 0 1 1
Kswapd reclaim write anon async I/O 59 187 41 222
Kswapd reclaim write file sync I/O 0 0 0 0
Kswapd reclaim write anon sync I/O 0 0 0 0
Time stalled direct reclaim (seconds) 4.34 2.52 6.63 2.96
Time kswapd awake (seconds) 11.15 10.25 11.01 10.19
Total pages scanned 4905237 4734564 4897640 4849839
Total pages reclaimed 2339450 2447439 2430145 2436897
%age total pages scanned/reclaimed 47.69% 51.69% 49.62% 50.25%
%age total pages scanned/written 0.00% 0.00% 0.00% 0.00%
%age file pages scanned/written 0.00% 0.00% 0.00% 0.00%
Percentage Time Spent Direct Reclaim 29.23% 19.02% 38.48% 20.25%
Percentage Time kswapd Awake 78.58% 78.85% 76.83% 79.86%
What is interesting here for nocongest in particular is that while direct
reclaim scans more pages, the overall number of pages scanned remains the
same and the ratio of pages scanned to pages reclaimed is more or less the
same. In other words, while we are sleeping less, reclaim is not doing
more work and as direct reclaim and kswapd is awake for less time, it
would appear to be doing less work.
FTrace Reclaim Statistics: congestion_wait
Direct number congest waited 87 196 64 0
Direct time congest waited 4604ms 4732ms 5420ms 0ms
Direct full congest waited 72 145 53 0
Direct number conditional waited 0 0 324 1315
Direct time conditional waited 0ms 0ms 0ms 0ms
Direct full conditional waited 0 0 0 0
KSwapd number congest waited 20 10 15 7
KSwapd time congest waited 1264ms 536ms 884ms 284ms
KSwapd full congest waited 10 4 6 2
KSwapd number conditional waited 0 0 0 0
KSwapd time conditional waited 0ms 0ms 0ms 0ms
KSwapd full conditional waited 0 0 0 0
The vanilla kernel spent 8 seconds asleep in direct reclaim and no time at
all asleep with the patches.
MMTests Statistics: duration
User/Sys Time Running Test (seconds) 10.51 10.73 10.6 11.66
Total Elapsed Time (seconds) 14.19 13.00 14.33 12.76
Overall, the tests completed faster. It is interesting to note that backing off further
when a zone is congested and not just a BDI was more efficient overall.
PPC64 micro-mapped-file-stream
pgalloc_dma 3024660.00 ( 0.00%) 3027185.00 ( 0.08%) 3025845.00 ( 0.04%) 3026281.00 ( 0.05%)
pgalloc_normal 0.00 ( 0.00%) 0.00 ( 0.00%) 0.00 ( 0.00%) 0.00 ( 0.00%)
pgsteal_dma 2508073.00 ( 0.00%) 2565351.00 ( 2.23%) 2463577.00 ( -1.81%) 2532263.00 ( 0.96%)
pgsteal_normal 0.00 ( 0.00%) 0.00 ( 0.00%) 0.00 ( 0.00%) 0.00 ( 0.00%)
pgscan_kswapd_dma 4601307.00 ( 0.00%) 4128076.00 ( -11.46%) 3912317.00 ( -17.61%) 3377165.00 ( -36.25%)
pgscan_kswapd_normal 0.00 ( 0.00%) 0.00 ( 0.00%) 0.00 ( 0.00%) 0.00 ( 0.00%)
pgscan_direct_dma 629825.00 ( 0.00%) 971622.00 ( 35.18%) 1063938.00 ( 40.80%) 1711935.00 ( 63.21%)
pgscan_direct_normal 0.00 ( 0.00%) 0.00 ( 0.00%) 0.00 ( 0.00%) 0.00 ( 0.00%)
pageoutrun 27776.00 ( 0.00%) 20458.00 ( -35.77%) 18763.00 ( -48.04%) 18157.00 ( -52.98%)
allocstall 977.00 ( 0.00%) 2751.00 ( 64.49%) 2098.00 ( 53.43%) 5136.00 ( 80.98%)
Similar trends to x86-64. allocstalls are up but it's not necessarily bad.
FTrace Reclaim Statistics: vmscan
Direct reclaims 977 2709 2098 5136
Direct reclaim pages scanned 629825 963814 1063938 1711935
Direct reclaim pages reclaimed 75550 242538 150904 387647
Direct reclaim write file async I/O 0 0 0 2
Direct reclaim write anon async I/O 0 10 0 4
Direct reclaim write file sync I/O 0 0 0 0
Direct reclaim write anon sync I/O 0 0 0 0
Wake kswapd requests 392119 1201712 571935 571921
Kswapd wakeups 3 2 3 3
Kswapd pages scanned 4601307 4128076 3912317 3377165
Kswapd pages reclaimed 2432523 2318797 2312673 2144616
Kswapd reclaim write file async I/O 20 1 1 1
Kswapd reclaim write anon async I/O 57 132 11 121
Kswapd reclaim write file sync I/O 0 0 0 0
Kswapd reclaim write anon sync I/O 0 0 0 0
Time stalled direct reclaim (seconds) 6.19 7.30 13.04 10.88
Time kswapd awake (seconds) 21.73 26.51 25.55 23.90
Total pages scanned 5231132 5091890 4976255 5089100
Total pages reclaimed 2508073 2561335 2463577 2532263
%age total pages scanned/reclaimed 47.95% 50.30% 49.51% 49.76%
%age total pages scanned/written 0.00% 0.00% 0.00% 0.00%
%age file pages scanned/written 0.00% 0.00% 0.00% 0.00%
Percentage Time Spent Direct Reclaim 18.89% 20.65% 32.65% 27.65%
Percentage Time kswapd Awake 72.39% 80.68% 78.21% 77.40%
Again, a similar trend that the congestion_wait changes mean that direct
reclaim scans more pages but the overall number of pages scanned while
slightly reduced, are very similar. The ratio of scanning/reclaimed
remains roughly similar. The downside is that kswapd and direct reclaim
was awake longer and for a larger percentage of the overall workload.
It's possible there were big differences in the amount of time spent
reclaiming slab pages between the different kernels which is plausible
considering that the micro tests runs after fsmark and sysbench.
Trace Reclaim Statistics: congestion_wait
Direct number congest waited 845 1312 104 0
Direct time congest waited 19416ms 26560ms 7544ms 0ms
Direct full congest waited 745 1105 72 0
Direct number conditional waited 0 0 1322 2935
Direct time conditional waited 0ms 0ms 12ms 312ms
Direct full conditional waited 0 0 0 3
KSwapd number congest waited 39 102 75 63
KSwapd time congest waited 2484ms 6760ms 5756ms 3716ms
KSwapd full congest waited 20 48 46 25
KSwapd number conditional waited 0 0 0 0
KSwapd time conditional waited 0ms 0ms 0ms 0ms
KSwapd full conditional waited 0 0 0 0
The vanilla kernel spent 20 seconds asleep in direct reclaim and only
312ms asleep with the patches. The time kswapd spent congest waited was
also reduced by a large factor.
MMTests Statistics: duration
ser/Sys Time Running Test (seconds) 26.58 28.05 26.9 28.47
Total Elapsed Time (seconds) 30.02 32.86 32.67 30.88
With all patches applies, the completion times are very similar.
X86-64 STRESS-HIGHALLOC
traceonly-v2r2 lowlumpy-v2r3 waitcongest-v2r3waitwriteback-v2r4
Pass 1 82.00 ( 0.00%) 84.00 ( 2.00%) 85.00 ( 3.00%) 85.00 ( 3.00%)
Pass 2 90.00 ( 0.00%) 87.00 (-3.00%) 88.00 (-2.00%) 89.00 (-1.00%)
At Rest 92.00 ( 0.00%) 90.00 (-2.00%) 90.00 (-2.00%) 91.00 (-1.00%)
Success figures across the board are broadly similar.
traceonly-v2r2 lowlumpy-v2r3 waitcongest-v2r3waitwriteback-v2r4
Direct reclaims 1045 944 886 887
Direct reclaim pages scanned 135091 119604 109382 101019
Direct reclaim pages reclaimed 88599 47535 47863 46671
Direct reclaim write file async I/O 494 283 465 280
Direct reclaim write anon async I/O 29357 13710 16656 13462
Direct reclaim write file sync I/O 154 2 2 3
Direct reclaim write anon sync I/O 14594 571 509 561
Wake kswapd requests 7491 933 872 892
Kswapd wakeups 814 778 731 780
Kswapd pages scanned 7290822 15341158 11916436 13703442
Kswapd pages reclaimed 3587336 3142496 3094392 3187151
Kswapd reclaim write file async I/O 91975 32317 28022 29628
Kswapd reclaim write anon async I/O 1992022 789307 829745 849769
Kswapd reclaim write file sync I/O 0 0 0 0
Kswapd reclaim write anon sync I/O 0 0 0 0
Time stalled direct reclaim (seconds) 4588.93 2467.16 2495.41 2547.07
Time kswapd awake (seconds) 2497.66 1020.16 1098.06 1176.82
Total pages scanned 7425913 15460762 12025818 13804461
Total pages reclaimed 3675935 3190031 3142255 3233822
%age total pages scanned/reclaimed 49.50% 20.63% 26.13% 23.43%
%age total pages scanned/written 28.66% 5.41% 7.28% 6.47%
%age file pages scanned/written 1.25% 0.21% 0.24% 0.22%
Percentage Time Spent Direct Reclaim 57.33% 42.15% 42.41% 42.99%
Percentage Time kswapd Awake 43.56% 27.87% 29.76% 31.25%
Scanned/reclaimed ratios again look good with big improvements in
efficiency. The Scanned/written ratios also look much improved. With a
better scanned/written ration, there is an expectation that IO would be
more efficient and indeed, the time spent in direct reclaim is much
reduced by the full series and kswapd spends a little less time awake.
Overall, indications here are that allocations were happening much faster
and this can be seen with a graph of the latency figures as the
allocations were taking place
http://www.csn.ul.ie/~mel/postings/vmscanreduce-20101509/highalloc-interlatency-hydra-mean.ps
FTrace Reclaim Statistics: congestion_wait
Direct number congest waited 1333 204 169 4
Direct time congest waited 78896ms 8288ms 7260ms 200ms
Direct full congest waited 756 92 69 2
Direct number conditional waited 0 0 26 186
Direct time conditional waited 0ms 0ms 0ms 2504ms
Direct full conditional waited 0 0 0 25
KSwapd number congest waited 4 395 227 282
KSwapd time congest waited 384ms 25136ms 10508ms 18380ms
KSwapd full congest waited 3 232 98 176
KSwapd number conditional waited 0 0 0 0
KSwapd time conditional waited 0ms 0ms 0ms 0ms
KSwapd full conditional waited 0 0 0 0
KSwapd full conditional waited 318 0 312 9
Overall, the time spent speeping is reduced. kswapd is still hitting
congestion_wait() but that is because there are callers remaining where it
wasn't clear in advance if they should be changed to wait_iff_congested()
or not. Overall the sleep imes are reduced though - from 79ish seconds to
about 19.
MMTests Statistics: duration
User/Sys Time Running Test (seconds) 3415.43 3386.65 3388.39 3377.5
Total Elapsed Time (seconds) 5733.48 3660.33 3689.41 3765.39
With the full series, the time to complete the tests are reduced by 30%
PPC64 STRESS-HIGHALLOC
traceonly-v2r2 lowlumpy-v2r3 waitcongest-v2r3waitwriteback-v2r4
Pass 1 17.00 ( 0.00%) 34.00 (17.00%) 38.00 (21.00%) 43.00 (26.00%)
Pass 2 25.00 ( 0.00%) 37.00 (12.00%) 42.00 (17.00%) 46.00 (21.00%)
At Rest 49.00 ( 0.00%) 43.00 (-6.00%) 45.00 (-4.00%) 51.00 ( 2.00%)
Success rates there are *way* up particularly considering that the 16MB
huge pages on PPC64 mean that it's always much harder to allocate them.
FTrace Reclaim Statistics: vmscan
stress-highalloc stress-highalloc stress-highalloc stress-highalloc
traceonly-v2r2 lowlumpy-v2r3 waitcongest-v2r3waitwriteback-v2r4
Direct reclaims 499 505 564 509
Direct reclaim pages scanned 223478 41898 51818 45605
Direct reclaim pages reclaimed 137730 21148 27161 23455
Direct reclaim write file async I/O 399 136 162 136
Direct reclaim write anon async I/O 46977 2865 4686 3998
Direct reclaim write file sync I/O 29 0 1 3
Direct reclaim write anon sync I/O 31023 159 237 239
Wake kswapd requests 420 351 360 326
Kswapd wakeups 185 294 249 277
Kswapd pages scanned 15703488 16392500 17821724 17598737
Kswapd pages reclaimed 5808466 2908858 3139386 3145435
Kswapd reclaim write file async I/O 159938 18400 18717 13473
Kswapd reclaim write anon async I/O 3467554 228957 322799 234278
Kswapd reclaim write file sync I/O 0 0 0 0
Kswapd reclaim write anon sync I/O 0 0 0 0
Time stalled direct reclaim (seconds) 9665.35 1707.81 2374.32 1871.23
Time kswapd awake (seconds) 9401.21 1367.86 1951.75 1328.88
Total pages scanned 15926966 16434398 17873542 17644342
Total pages reclaimed 5946196 2930006 3166547 3168890
%age total pages scanned/reclaimed 37.33% 17.83% 17.72% 17.96%
%age total pages scanned/written 23.27% 1.52% 1.94% 1.43%
%age file pages scanned/written 1.01% 0.11% 0.11% 0.08%
Percentage Time Spent Direct Reclaim 44.55% 35.10% 41.42% 36.91%
Percentage Time kswapd Awake 86.71% 43.58% 52.67% 41.14%
While the scanning rates are slightly up, the scanned/reclaimed and
scanned/written figures are much improved. The time spent in direct
reclaim and with kswapd are massively reduced, mostly by the lowlumpy
patches.
FTrace Reclaim Statistics: congestion_wait
Direct number congest waited 725 303 126 3
Direct time congest waited 45524ms 9180ms 5936ms 300ms
Direct full congest waited 487 190 52 3
Direct number conditional waited 0 0 200 301
Direct time conditional waited 0ms 0ms 0ms 1904ms
Direct full conditional waited 0 0 0 19
KSwapd number congest waited 0 2 23 4
KSwapd time congest waited 0ms 200ms 420ms 404ms
KSwapd full congest waited 0 2 2 4
KSwapd number conditional waited 0 0 0 0
KSwapd time conditional waited 0ms 0ms 0ms 0ms
KSwapd full conditional waited 0 0 0 0
Not as dramatic a story here but the time spent asleep is reduced and we
can still see what wait_iff_congested is going to sleep when necessary.
MMTests Statistics: duration
User/Sys Time Running Test (seconds) 12028.09 3157.17 3357.79 3199.16
Total Elapsed Time (seconds) 10842.07 3138.72 3705.54 3229.85
The time to complete this test goes way down. With the full series, we
are allocating over twice the number of huge pages in 30% of the time and
there is a corresponding impact on the allocation latency graph available
at.
http://www.csn.ul.ie/~mel/postings/vmscanreduce-20101509/highalloc-interlatency-powyah-mean.ps
This patch:
Add a trace event for shrink_inactive_list() and updates the sample
postprocessing script appropriately. It can be used to determine how many
pages were reclaimed and for non-lumpy reclaim where exactly the pages
were reclaimed from.
Signed-off-by: Mel Gorman <mel@csn.ul.ie>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Minchan Kim <minchan.kim@gmail.com>
Cc: Wu Fengguang <fengguang.wu@intel.com>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: Rik van Riel <riel@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2010-10-27 05:21:40 +08:00
|
|
|
|
|
|
|
TP_STRUCT__entry(
|
|
|
|
__field(int, nid)
|
|
|
|
__field(unsigned long, nr_scanned)
|
|
|
|
__field(unsigned long, nr_reclaimed)
|
2017-02-23 07:44:30 +08:00
|
|
|
__field(unsigned long, nr_dirty)
|
|
|
|
__field(unsigned long, nr_writeback)
|
|
|
|
__field(unsigned long, nr_congested)
|
|
|
|
__field(unsigned long, nr_immediate)
|
2019-05-14 08:16:51 +08:00
|
|
|
__field(unsigned int, nr_activate0)
|
|
|
|
__field(unsigned int, nr_activate1)
|
2017-02-23 07:44:30 +08:00
|
|
|
__field(unsigned long, nr_ref_keep)
|
|
|
|
__field(unsigned long, nr_unmap_fail)
|
tracing, vmscan: add trace events for LRU list shrinking
There have been numerous reports of stalls that pointed at the problem
being somewhere in the VM. There are multiple roots to the problems which
means dealing with any of the root problems in isolation is tricky to
justify on their own and they would still need integration testing. This
patch series puts together two different patch sets which in combination
should tackle some of the root causes of latency problems being reported.
Patch 1 adds a tracepoint for shrink_inactive_list. For this series, the
most important results is being able to calculate the scanning/reclaim
ratio as a measure of the amount of work being done by page reclaim.
Patch 2 accounts for time spent in congestion_wait.
Patches 3-6 were originally developed by Kosaki Motohiro but reworked for
this series. It has been noted that lumpy reclaim is far too aggressive
and trashes the system somewhat. As SLUB uses high-order allocations, a
large cost incurred by lumpy reclaim will be noticeable. It was also
reported during transparent hugepage support testing that lumpy reclaim
was trashing the system and these patches should mitigate that problem
without disabling lumpy reclaim.
Patch 7 adds wait_iff_congested() and replaces some callers of
congestion_wait(). wait_iff_congested() only sleeps if there is a BDI
that is currently congested. Patch 8 notes that any BDI being congested
is not necessarily a problem because there could be multiple BDIs of
varying speeds and numberous zones. It attempts to track when a zone
being reclaimed contains many pages backed by a congested BDI and if so,
reclaimers wait on the congestion queue.
I ran a number of tests with monitoring on X86, X86-64 and PPC64. Each
machine had 3G of RAM and the CPUs were
X86: Intel P4 2-core
X86-64: AMD Phenom 4-core
PPC64: PPC970MP
Each used a single disk and the onboard IO controller. Dirty ratio was
left at 20. I'm just going to report for X86-64 and PPC64 in a vague
attempt to keep this report short. Four kernels were tested each based on
v2.6.36-rc4
traceonly-v2r2: Patches 1 and 2 to instrument vmscan reclaims and congestion_wait
lowlumpy-v2r3: Patches 1-6 to test if lumpy reclaim is better
waitcongest-v2r3: Patches 1-7 to only wait on congestion
waitwriteback-v2r4: Patches 1-8 to detect when a zone is congested
nocongest-v1r5: Patches 1-3 for testing wait_iff_congestion
nodirect-v1r5: Patches 1-10 to disable filesystem writeback for better IO
The tests run were as follows
kernbench
compile-based benchmark. Smoke test performance
sysbench
OLTP read-only benchmark. Will be re-run in the future as read-write
micro-mapped-file-stream
This is a micro-benchmark from Johannes Weiner that accesses a
large sparse-file through mmap(). It was configured to run in only
single-CPU mode but can be indicative of how well page reclaim
identifies suitable pages.
stress-highalloc
Tries to allocate huge pages under heavy load.
kernbench, iozone and sysbench did not report any performance regression
on any machine. sysbench did pressure the system lightly and there was
reclaim activity but there were no difference of major interest between
the kernels.
X86-64 micro-mapped-file-stream
traceonly-v2r2 lowlumpy-v2r3 waitcongest-v2r3 waitwriteback-v2r4
pgalloc_dma 1639.00 ( 0.00%) 667.00 (-145.73%) 1167.00 ( -40.45%) 578.00 (-183.56%)
pgalloc_dma32 2842410.00 ( 0.00%) 2842626.00 ( 0.01%) 2843043.00 ( 0.02%) 2843014.00 ( 0.02%)
pgalloc_normal 0.00 ( 0.00%) 0.00 ( 0.00%) 0.00 ( 0.00%) 0.00 ( 0.00%)
pgsteal_dma 729.00 ( 0.00%) 85.00 (-757.65%) 609.00 ( -19.70%) 125.00 (-483.20%)
pgsteal_dma32 2338721.00 ( 0.00%) 2447354.00 ( 4.44%) 2429536.00 ( 3.74%) 2436772.00 ( 4.02%)
pgsteal_normal 0.00 ( 0.00%) 0.00 ( 0.00%) 0.00 ( 0.00%) 0.00 ( 0.00%)
pgscan_kswapd_dma 1469.00 ( 0.00%) 532.00 (-176.13%) 1078.00 ( -36.27%) 220.00 (-567.73%)
pgscan_kswapd_dma32 4597713.00 ( 0.00%) 4503597.00 ( -2.09%) 4295673.00 ( -7.03%) 3891686.00 ( -18.14%)
pgscan_kswapd_normal 0.00 ( 0.00%) 0.00 ( 0.00%) 0.00 ( 0.00%) 0.00 ( 0.00%)
pgscan_direct_dma 71.00 ( 0.00%) 134.00 ( 47.01%) 243.00 ( 70.78%) 352.00 ( 79.83%)
pgscan_direct_dma32 305820.00 ( 0.00%) 280204.00 ( -9.14%) 600518.00 ( 49.07%) 957485.00 ( 68.06%)
pgscan_direct_normal 0.00 ( 0.00%) 0.00 ( 0.00%) 0.00 ( 0.00%) 0.00 ( 0.00%)
pageoutrun 16296.00 ( 0.00%) 21254.00 ( 23.33%) 18447.00 ( 11.66%) 20067.00 ( 18.79%)
allocstall 443.00 ( 0.00%) 273.00 ( -62.27%) 513.00 ( 13.65%) 1568.00 ( 71.75%)
These are based on the raw figures taken from /proc/vmstat. It's a rough
measure of reclaim activity. Note that allocstall counts are higher
because we are entering direct reclaim more often as a result of not
sleeping in congestion. In itself, it's not necessarily a bad thing.
It's easier to get a view of what happened from the vmscan tracepoint
report.
FTrace Reclaim Statistics: vmscan
traceonly-v2r2 lowlumpy-v2r3 waitcongest-v2r3 waitwriteback-v2r4
Direct reclaims 443 273 513 1568
Direct reclaim pages scanned 305968 280402 600825 957933
Direct reclaim pages reclaimed 43503 19005 30327 117191
Direct reclaim write file async I/O 0 0 0 0
Direct reclaim write anon async I/O 0 3 4 12
Direct reclaim write file sync I/O 0 0 0 0
Direct reclaim write anon sync I/O 0 0 0 0
Wake kswapd requests 187649 132338 191695 267701
Kswapd wakeups 3 1 4 1
Kswapd pages scanned 4599269 4454162 4296815 3891906
Kswapd pages reclaimed 2295947 2428434 2399818 2319706
Kswapd reclaim write file async I/O 1 0 1 1
Kswapd reclaim write anon async I/O 59 187 41 222
Kswapd reclaim write file sync I/O 0 0 0 0
Kswapd reclaim write anon sync I/O 0 0 0 0
Time stalled direct reclaim (seconds) 4.34 2.52 6.63 2.96
Time kswapd awake (seconds) 11.15 10.25 11.01 10.19
Total pages scanned 4905237 4734564 4897640 4849839
Total pages reclaimed 2339450 2447439 2430145 2436897
%age total pages scanned/reclaimed 47.69% 51.69% 49.62% 50.25%
%age total pages scanned/written 0.00% 0.00% 0.00% 0.00%
%age file pages scanned/written 0.00% 0.00% 0.00% 0.00%
Percentage Time Spent Direct Reclaim 29.23% 19.02% 38.48% 20.25%
Percentage Time kswapd Awake 78.58% 78.85% 76.83% 79.86%
What is interesting here for nocongest in particular is that while direct
reclaim scans more pages, the overall number of pages scanned remains the
same and the ratio of pages scanned to pages reclaimed is more or less the
same. In other words, while we are sleeping less, reclaim is not doing
more work and as direct reclaim and kswapd is awake for less time, it
would appear to be doing less work.
FTrace Reclaim Statistics: congestion_wait
Direct number congest waited 87 196 64 0
Direct time congest waited 4604ms 4732ms 5420ms 0ms
Direct full congest waited 72 145 53 0
Direct number conditional waited 0 0 324 1315
Direct time conditional waited 0ms 0ms 0ms 0ms
Direct full conditional waited 0 0 0 0
KSwapd number congest waited 20 10 15 7
KSwapd time congest waited 1264ms 536ms 884ms 284ms
KSwapd full congest waited 10 4 6 2
KSwapd number conditional waited 0 0 0 0
KSwapd time conditional waited 0ms 0ms 0ms 0ms
KSwapd full conditional waited 0 0 0 0
The vanilla kernel spent 8 seconds asleep in direct reclaim and no time at
all asleep with the patches.
MMTests Statistics: duration
User/Sys Time Running Test (seconds) 10.51 10.73 10.6 11.66
Total Elapsed Time (seconds) 14.19 13.00 14.33 12.76
Overall, the tests completed faster. It is interesting to note that backing off further
when a zone is congested and not just a BDI was more efficient overall.
PPC64 micro-mapped-file-stream
pgalloc_dma 3024660.00 ( 0.00%) 3027185.00 ( 0.08%) 3025845.00 ( 0.04%) 3026281.00 ( 0.05%)
pgalloc_normal 0.00 ( 0.00%) 0.00 ( 0.00%) 0.00 ( 0.00%) 0.00 ( 0.00%)
pgsteal_dma 2508073.00 ( 0.00%) 2565351.00 ( 2.23%) 2463577.00 ( -1.81%) 2532263.00 ( 0.96%)
pgsteal_normal 0.00 ( 0.00%) 0.00 ( 0.00%) 0.00 ( 0.00%) 0.00 ( 0.00%)
pgscan_kswapd_dma 4601307.00 ( 0.00%) 4128076.00 ( -11.46%) 3912317.00 ( -17.61%) 3377165.00 ( -36.25%)
pgscan_kswapd_normal 0.00 ( 0.00%) 0.00 ( 0.00%) 0.00 ( 0.00%) 0.00 ( 0.00%)
pgscan_direct_dma 629825.00 ( 0.00%) 971622.00 ( 35.18%) 1063938.00 ( 40.80%) 1711935.00 ( 63.21%)
pgscan_direct_normal 0.00 ( 0.00%) 0.00 ( 0.00%) 0.00 ( 0.00%) 0.00 ( 0.00%)
pageoutrun 27776.00 ( 0.00%) 20458.00 ( -35.77%) 18763.00 ( -48.04%) 18157.00 ( -52.98%)
allocstall 977.00 ( 0.00%) 2751.00 ( 64.49%) 2098.00 ( 53.43%) 5136.00 ( 80.98%)
Similar trends to x86-64. allocstalls are up but it's not necessarily bad.
FTrace Reclaim Statistics: vmscan
Direct reclaims 977 2709 2098 5136
Direct reclaim pages scanned 629825 963814 1063938 1711935
Direct reclaim pages reclaimed 75550 242538 150904 387647
Direct reclaim write file async I/O 0 0 0 2
Direct reclaim write anon async I/O 0 10 0 4
Direct reclaim write file sync I/O 0 0 0 0
Direct reclaim write anon sync I/O 0 0 0 0
Wake kswapd requests 392119 1201712 571935 571921
Kswapd wakeups 3 2 3 3
Kswapd pages scanned 4601307 4128076 3912317 3377165
Kswapd pages reclaimed 2432523 2318797 2312673 2144616
Kswapd reclaim write file async I/O 20 1 1 1
Kswapd reclaim write anon async I/O 57 132 11 121
Kswapd reclaim write file sync I/O 0 0 0 0
Kswapd reclaim write anon sync I/O 0 0 0 0
Time stalled direct reclaim (seconds) 6.19 7.30 13.04 10.88
Time kswapd awake (seconds) 21.73 26.51 25.55 23.90
Total pages scanned 5231132 5091890 4976255 5089100
Total pages reclaimed 2508073 2561335 2463577 2532263
%age total pages scanned/reclaimed 47.95% 50.30% 49.51% 49.76%
%age total pages scanned/written 0.00% 0.00% 0.00% 0.00%
%age file pages scanned/written 0.00% 0.00% 0.00% 0.00%
Percentage Time Spent Direct Reclaim 18.89% 20.65% 32.65% 27.65%
Percentage Time kswapd Awake 72.39% 80.68% 78.21% 77.40%
Again, a similar trend that the congestion_wait changes mean that direct
reclaim scans more pages but the overall number of pages scanned while
slightly reduced, are very similar. The ratio of scanning/reclaimed
remains roughly similar. The downside is that kswapd and direct reclaim
was awake longer and for a larger percentage of the overall workload.
It's possible there were big differences in the amount of time spent
reclaiming slab pages between the different kernels which is plausible
considering that the micro tests runs after fsmark and sysbench.
Trace Reclaim Statistics: congestion_wait
Direct number congest waited 845 1312 104 0
Direct time congest waited 19416ms 26560ms 7544ms 0ms
Direct full congest waited 745 1105 72 0
Direct number conditional waited 0 0 1322 2935
Direct time conditional waited 0ms 0ms 12ms 312ms
Direct full conditional waited 0 0 0 3
KSwapd number congest waited 39 102 75 63
KSwapd time congest waited 2484ms 6760ms 5756ms 3716ms
KSwapd full congest waited 20 48 46 25
KSwapd number conditional waited 0 0 0 0
KSwapd time conditional waited 0ms 0ms 0ms 0ms
KSwapd full conditional waited 0 0 0 0
The vanilla kernel spent 20 seconds asleep in direct reclaim and only
312ms asleep with the patches. The time kswapd spent congest waited was
also reduced by a large factor.
MMTests Statistics: duration
ser/Sys Time Running Test (seconds) 26.58 28.05 26.9 28.47
Total Elapsed Time (seconds) 30.02 32.86 32.67 30.88
With all patches applies, the completion times are very similar.
X86-64 STRESS-HIGHALLOC
traceonly-v2r2 lowlumpy-v2r3 waitcongest-v2r3waitwriteback-v2r4
Pass 1 82.00 ( 0.00%) 84.00 ( 2.00%) 85.00 ( 3.00%) 85.00 ( 3.00%)
Pass 2 90.00 ( 0.00%) 87.00 (-3.00%) 88.00 (-2.00%) 89.00 (-1.00%)
At Rest 92.00 ( 0.00%) 90.00 (-2.00%) 90.00 (-2.00%) 91.00 (-1.00%)
Success figures across the board are broadly similar.
traceonly-v2r2 lowlumpy-v2r3 waitcongest-v2r3waitwriteback-v2r4
Direct reclaims 1045 944 886 887
Direct reclaim pages scanned 135091 119604 109382 101019
Direct reclaim pages reclaimed 88599 47535 47863 46671
Direct reclaim write file async I/O 494 283 465 280
Direct reclaim write anon async I/O 29357 13710 16656 13462
Direct reclaim write file sync I/O 154 2 2 3
Direct reclaim write anon sync I/O 14594 571 509 561
Wake kswapd requests 7491 933 872 892
Kswapd wakeups 814 778 731 780
Kswapd pages scanned 7290822 15341158 11916436 13703442
Kswapd pages reclaimed 3587336 3142496 3094392 3187151
Kswapd reclaim write file async I/O 91975 32317 28022 29628
Kswapd reclaim write anon async I/O 1992022 789307 829745 849769
Kswapd reclaim write file sync I/O 0 0 0 0
Kswapd reclaim write anon sync I/O 0 0 0 0
Time stalled direct reclaim (seconds) 4588.93 2467.16 2495.41 2547.07
Time kswapd awake (seconds) 2497.66 1020.16 1098.06 1176.82
Total pages scanned 7425913 15460762 12025818 13804461
Total pages reclaimed 3675935 3190031 3142255 3233822
%age total pages scanned/reclaimed 49.50% 20.63% 26.13% 23.43%
%age total pages scanned/written 28.66% 5.41% 7.28% 6.47%
%age file pages scanned/written 1.25% 0.21% 0.24% 0.22%
Percentage Time Spent Direct Reclaim 57.33% 42.15% 42.41% 42.99%
Percentage Time kswapd Awake 43.56% 27.87% 29.76% 31.25%
Scanned/reclaimed ratios again look good with big improvements in
efficiency. The Scanned/written ratios also look much improved. With a
better scanned/written ration, there is an expectation that IO would be
more efficient and indeed, the time spent in direct reclaim is much
reduced by the full series and kswapd spends a little less time awake.
Overall, indications here are that allocations were happening much faster
and this can be seen with a graph of the latency figures as the
allocations were taking place
http://www.csn.ul.ie/~mel/postings/vmscanreduce-20101509/highalloc-interlatency-hydra-mean.ps
FTrace Reclaim Statistics: congestion_wait
Direct number congest waited 1333 204 169 4
Direct time congest waited 78896ms 8288ms 7260ms 200ms
Direct full congest waited 756 92 69 2
Direct number conditional waited 0 0 26 186
Direct time conditional waited 0ms 0ms 0ms 2504ms
Direct full conditional waited 0 0 0 25
KSwapd number congest waited 4 395 227 282
KSwapd time congest waited 384ms 25136ms 10508ms 18380ms
KSwapd full congest waited 3 232 98 176
KSwapd number conditional waited 0 0 0 0
KSwapd time conditional waited 0ms 0ms 0ms 0ms
KSwapd full conditional waited 0 0 0 0
KSwapd full conditional waited 318 0 312 9
Overall, the time spent speeping is reduced. kswapd is still hitting
congestion_wait() but that is because there are callers remaining where it
wasn't clear in advance if they should be changed to wait_iff_congested()
or not. Overall the sleep imes are reduced though - from 79ish seconds to
about 19.
MMTests Statistics: duration
User/Sys Time Running Test (seconds) 3415.43 3386.65 3388.39 3377.5
Total Elapsed Time (seconds) 5733.48 3660.33 3689.41 3765.39
With the full series, the time to complete the tests are reduced by 30%
PPC64 STRESS-HIGHALLOC
traceonly-v2r2 lowlumpy-v2r3 waitcongest-v2r3waitwriteback-v2r4
Pass 1 17.00 ( 0.00%) 34.00 (17.00%) 38.00 (21.00%) 43.00 (26.00%)
Pass 2 25.00 ( 0.00%) 37.00 (12.00%) 42.00 (17.00%) 46.00 (21.00%)
At Rest 49.00 ( 0.00%) 43.00 (-6.00%) 45.00 (-4.00%) 51.00 ( 2.00%)
Success rates there are *way* up particularly considering that the 16MB
huge pages on PPC64 mean that it's always much harder to allocate them.
FTrace Reclaim Statistics: vmscan
stress-highalloc stress-highalloc stress-highalloc stress-highalloc
traceonly-v2r2 lowlumpy-v2r3 waitcongest-v2r3waitwriteback-v2r4
Direct reclaims 499 505 564 509
Direct reclaim pages scanned 223478 41898 51818 45605
Direct reclaim pages reclaimed 137730 21148 27161 23455
Direct reclaim write file async I/O 399 136 162 136
Direct reclaim write anon async I/O 46977 2865 4686 3998
Direct reclaim write file sync I/O 29 0 1 3
Direct reclaim write anon sync I/O 31023 159 237 239
Wake kswapd requests 420 351 360 326
Kswapd wakeups 185 294 249 277
Kswapd pages scanned 15703488 16392500 17821724 17598737
Kswapd pages reclaimed 5808466 2908858 3139386 3145435
Kswapd reclaim write file async I/O 159938 18400 18717 13473
Kswapd reclaim write anon async I/O 3467554 228957 322799 234278
Kswapd reclaim write file sync I/O 0 0 0 0
Kswapd reclaim write anon sync I/O 0 0 0 0
Time stalled direct reclaim (seconds) 9665.35 1707.81 2374.32 1871.23
Time kswapd awake (seconds) 9401.21 1367.86 1951.75 1328.88
Total pages scanned 15926966 16434398 17873542 17644342
Total pages reclaimed 5946196 2930006 3166547 3168890
%age total pages scanned/reclaimed 37.33% 17.83% 17.72% 17.96%
%age total pages scanned/written 23.27% 1.52% 1.94% 1.43%
%age file pages scanned/written 1.01% 0.11% 0.11% 0.08%
Percentage Time Spent Direct Reclaim 44.55% 35.10% 41.42% 36.91%
Percentage Time kswapd Awake 86.71% 43.58% 52.67% 41.14%
While the scanning rates are slightly up, the scanned/reclaimed and
scanned/written figures are much improved. The time spent in direct
reclaim and with kswapd are massively reduced, mostly by the lowlumpy
patches.
FTrace Reclaim Statistics: congestion_wait
Direct number congest waited 725 303 126 3
Direct time congest waited 45524ms 9180ms 5936ms 300ms
Direct full congest waited 487 190 52 3
Direct number conditional waited 0 0 200 301
Direct time conditional waited 0ms 0ms 0ms 1904ms
Direct full conditional waited 0 0 0 19
KSwapd number congest waited 0 2 23 4
KSwapd time congest waited 0ms 200ms 420ms 404ms
KSwapd full congest waited 0 2 2 4
KSwapd number conditional waited 0 0 0 0
KSwapd time conditional waited 0ms 0ms 0ms 0ms
KSwapd full conditional waited 0 0 0 0
Not as dramatic a story here but the time spent asleep is reduced and we
can still see what wait_iff_congested is going to sleep when necessary.
MMTests Statistics: duration
User/Sys Time Running Test (seconds) 12028.09 3157.17 3357.79 3199.16
Total Elapsed Time (seconds) 10842.07 3138.72 3705.54 3229.85
The time to complete this test goes way down. With the full series, we
are allocating over twice the number of huge pages in 30% of the time and
there is a corresponding impact on the allocation latency graph available
at.
http://www.csn.ul.ie/~mel/postings/vmscanreduce-20101509/highalloc-interlatency-powyah-mean.ps
This patch:
Add a trace event for shrink_inactive_list() and updates the sample
postprocessing script appropriately. It can be used to determine how many
pages were reclaimed and for non-lumpy reclaim where exactly the pages
were reclaimed from.
Signed-off-by: Mel Gorman <mel@csn.ul.ie>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Minchan Kim <minchan.kim@gmail.com>
Cc: Wu Fengguang <fengguang.wu@intel.com>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: Rik van Riel <riel@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2010-10-27 05:21:40 +08:00
|
|
|
__field(int, priority)
|
|
|
|
__field(int, reclaim_flags)
|
|
|
|
),
|
|
|
|
|
|
|
|
TP_fast_assign(
|
2016-07-29 06:45:31 +08:00
|
|
|
__entry->nid = nid;
|
tracing, vmscan: add trace events for LRU list shrinking
There have been numerous reports of stalls that pointed at the problem
being somewhere in the VM. There are multiple roots to the problems which
means dealing with any of the root problems in isolation is tricky to
justify on their own and they would still need integration testing. This
patch series puts together two different patch sets which in combination
should tackle some of the root causes of latency problems being reported.
Patch 1 adds a tracepoint for shrink_inactive_list. For this series, the
most important results is being able to calculate the scanning/reclaim
ratio as a measure of the amount of work being done by page reclaim.
Patch 2 accounts for time spent in congestion_wait.
Patches 3-6 were originally developed by Kosaki Motohiro but reworked for
this series. It has been noted that lumpy reclaim is far too aggressive
and trashes the system somewhat. As SLUB uses high-order allocations, a
large cost incurred by lumpy reclaim will be noticeable. It was also
reported during transparent hugepage support testing that lumpy reclaim
was trashing the system and these patches should mitigate that problem
without disabling lumpy reclaim.
Patch 7 adds wait_iff_congested() and replaces some callers of
congestion_wait(). wait_iff_congested() only sleeps if there is a BDI
that is currently congested. Patch 8 notes that any BDI being congested
is not necessarily a problem because there could be multiple BDIs of
varying speeds and numberous zones. It attempts to track when a zone
being reclaimed contains many pages backed by a congested BDI and if so,
reclaimers wait on the congestion queue.
I ran a number of tests with monitoring on X86, X86-64 and PPC64. Each
machine had 3G of RAM and the CPUs were
X86: Intel P4 2-core
X86-64: AMD Phenom 4-core
PPC64: PPC970MP
Each used a single disk and the onboard IO controller. Dirty ratio was
left at 20. I'm just going to report for X86-64 and PPC64 in a vague
attempt to keep this report short. Four kernels were tested each based on
v2.6.36-rc4
traceonly-v2r2: Patches 1 and 2 to instrument vmscan reclaims and congestion_wait
lowlumpy-v2r3: Patches 1-6 to test if lumpy reclaim is better
waitcongest-v2r3: Patches 1-7 to only wait on congestion
waitwriteback-v2r4: Patches 1-8 to detect when a zone is congested
nocongest-v1r5: Patches 1-3 for testing wait_iff_congestion
nodirect-v1r5: Patches 1-10 to disable filesystem writeback for better IO
The tests run were as follows
kernbench
compile-based benchmark. Smoke test performance
sysbench
OLTP read-only benchmark. Will be re-run in the future as read-write
micro-mapped-file-stream
This is a micro-benchmark from Johannes Weiner that accesses a
large sparse-file through mmap(). It was configured to run in only
single-CPU mode but can be indicative of how well page reclaim
identifies suitable pages.
stress-highalloc
Tries to allocate huge pages under heavy load.
kernbench, iozone and sysbench did not report any performance regression
on any machine. sysbench did pressure the system lightly and there was
reclaim activity but there were no difference of major interest between
the kernels.
X86-64 micro-mapped-file-stream
traceonly-v2r2 lowlumpy-v2r3 waitcongest-v2r3 waitwriteback-v2r4
pgalloc_dma 1639.00 ( 0.00%) 667.00 (-145.73%) 1167.00 ( -40.45%) 578.00 (-183.56%)
pgalloc_dma32 2842410.00 ( 0.00%) 2842626.00 ( 0.01%) 2843043.00 ( 0.02%) 2843014.00 ( 0.02%)
pgalloc_normal 0.00 ( 0.00%) 0.00 ( 0.00%) 0.00 ( 0.00%) 0.00 ( 0.00%)
pgsteal_dma 729.00 ( 0.00%) 85.00 (-757.65%) 609.00 ( -19.70%) 125.00 (-483.20%)
pgsteal_dma32 2338721.00 ( 0.00%) 2447354.00 ( 4.44%) 2429536.00 ( 3.74%) 2436772.00 ( 4.02%)
pgsteal_normal 0.00 ( 0.00%) 0.00 ( 0.00%) 0.00 ( 0.00%) 0.00 ( 0.00%)
pgscan_kswapd_dma 1469.00 ( 0.00%) 532.00 (-176.13%) 1078.00 ( -36.27%) 220.00 (-567.73%)
pgscan_kswapd_dma32 4597713.00 ( 0.00%) 4503597.00 ( -2.09%) 4295673.00 ( -7.03%) 3891686.00 ( -18.14%)
pgscan_kswapd_normal 0.00 ( 0.00%) 0.00 ( 0.00%) 0.00 ( 0.00%) 0.00 ( 0.00%)
pgscan_direct_dma 71.00 ( 0.00%) 134.00 ( 47.01%) 243.00 ( 70.78%) 352.00 ( 79.83%)
pgscan_direct_dma32 305820.00 ( 0.00%) 280204.00 ( -9.14%) 600518.00 ( 49.07%) 957485.00 ( 68.06%)
pgscan_direct_normal 0.00 ( 0.00%) 0.00 ( 0.00%) 0.00 ( 0.00%) 0.00 ( 0.00%)
pageoutrun 16296.00 ( 0.00%) 21254.00 ( 23.33%) 18447.00 ( 11.66%) 20067.00 ( 18.79%)
allocstall 443.00 ( 0.00%) 273.00 ( -62.27%) 513.00 ( 13.65%) 1568.00 ( 71.75%)
These are based on the raw figures taken from /proc/vmstat. It's a rough
measure of reclaim activity. Note that allocstall counts are higher
because we are entering direct reclaim more often as a result of not
sleeping in congestion. In itself, it's not necessarily a bad thing.
It's easier to get a view of what happened from the vmscan tracepoint
report.
FTrace Reclaim Statistics: vmscan
traceonly-v2r2 lowlumpy-v2r3 waitcongest-v2r3 waitwriteback-v2r4
Direct reclaims 443 273 513 1568
Direct reclaim pages scanned 305968 280402 600825 957933
Direct reclaim pages reclaimed 43503 19005 30327 117191
Direct reclaim write file async I/O 0 0 0 0
Direct reclaim write anon async I/O 0 3 4 12
Direct reclaim write file sync I/O 0 0 0 0
Direct reclaim write anon sync I/O 0 0 0 0
Wake kswapd requests 187649 132338 191695 267701
Kswapd wakeups 3 1 4 1
Kswapd pages scanned 4599269 4454162 4296815 3891906
Kswapd pages reclaimed 2295947 2428434 2399818 2319706
Kswapd reclaim write file async I/O 1 0 1 1
Kswapd reclaim write anon async I/O 59 187 41 222
Kswapd reclaim write file sync I/O 0 0 0 0
Kswapd reclaim write anon sync I/O 0 0 0 0
Time stalled direct reclaim (seconds) 4.34 2.52 6.63 2.96
Time kswapd awake (seconds) 11.15 10.25 11.01 10.19
Total pages scanned 4905237 4734564 4897640 4849839
Total pages reclaimed 2339450 2447439 2430145 2436897
%age total pages scanned/reclaimed 47.69% 51.69% 49.62% 50.25%
%age total pages scanned/written 0.00% 0.00% 0.00% 0.00%
%age file pages scanned/written 0.00% 0.00% 0.00% 0.00%
Percentage Time Spent Direct Reclaim 29.23% 19.02% 38.48% 20.25%
Percentage Time kswapd Awake 78.58% 78.85% 76.83% 79.86%
What is interesting here for nocongest in particular is that while direct
reclaim scans more pages, the overall number of pages scanned remains the
same and the ratio of pages scanned to pages reclaimed is more or less the
same. In other words, while we are sleeping less, reclaim is not doing
more work and as direct reclaim and kswapd is awake for less time, it
would appear to be doing less work.
FTrace Reclaim Statistics: congestion_wait
Direct number congest waited 87 196 64 0
Direct time congest waited 4604ms 4732ms 5420ms 0ms
Direct full congest waited 72 145 53 0
Direct number conditional waited 0 0 324 1315
Direct time conditional waited 0ms 0ms 0ms 0ms
Direct full conditional waited 0 0 0 0
KSwapd number congest waited 20 10 15 7
KSwapd time congest waited 1264ms 536ms 884ms 284ms
KSwapd full congest waited 10 4 6 2
KSwapd number conditional waited 0 0 0 0
KSwapd time conditional waited 0ms 0ms 0ms 0ms
KSwapd full conditional waited 0 0 0 0
The vanilla kernel spent 8 seconds asleep in direct reclaim and no time at
all asleep with the patches.
MMTests Statistics: duration
User/Sys Time Running Test (seconds) 10.51 10.73 10.6 11.66
Total Elapsed Time (seconds) 14.19 13.00 14.33 12.76
Overall, the tests completed faster. It is interesting to note that backing off further
when a zone is congested and not just a BDI was more efficient overall.
PPC64 micro-mapped-file-stream
pgalloc_dma 3024660.00 ( 0.00%) 3027185.00 ( 0.08%) 3025845.00 ( 0.04%) 3026281.00 ( 0.05%)
pgalloc_normal 0.00 ( 0.00%) 0.00 ( 0.00%) 0.00 ( 0.00%) 0.00 ( 0.00%)
pgsteal_dma 2508073.00 ( 0.00%) 2565351.00 ( 2.23%) 2463577.00 ( -1.81%) 2532263.00 ( 0.96%)
pgsteal_normal 0.00 ( 0.00%) 0.00 ( 0.00%) 0.00 ( 0.00%) 0.00 ( 0.00%)
pgscan_kswapd_dma 4601307.00 ( 0.00%) 4128076.00 ( -11.46%) 3912317.00 ( -17.61%) 3377165.00 ( -36.25%)
pgscan_kswapd_normal 0.00 ( 0.00%) 0.00 ( 0.00%) 0.00 ( 0.00%) 0.00 ( 0.00%)
pgscan_direct_dma 629825.00 ( 0.00%) 971622.00 ( 35.18%) 1063938.00 ( 40.80%) 1711935.00 ( 63.21%)
pgscan_direct_normal 0.00 ( 0.00%) 0.00 ( 0.00%) 0.00 ( 0.00%) 0.00 ( 0.00%)
pageoutrun 27776.00 ( 0.00%) 20458.00 ( -35.77%) 18763.00 ( -48.04%) 18157.00 ( -52.98%)
allocstall 977.00 ( 0.00%) 2751.00 ( 64.49%) 2098.00 ( 53.43%) 5136.00 ( 80.98%)
Similar trends to x86-64. allocstalls are up but it's not necessarily bad.
FTrace Reclaim Statistics: vmscan
Direct reclaims 977 2709 2098 5136
Direct reclaim pages scanned 629825 963814 1063938 1711935
Direct reclaim pages reclaimed 75550 242538 150904 387647
Direct reclaim write file async I/O 0 0 0 2
Direct reclaim write anon async I/O 0 10 0 4
Direct reclaim write file sync I/O 0 0 0 0
Direct reclaim write anon sync I/O 0 0 0 0
Wake kswapd requests 392119 1201712 571935 571921
Kswapd wakeups 3 2 3 3
Kswapd pages scanned 4601307 4128076 3912317 3377165
Kswapd pages reclaimed 2432523 2318797 2312673 2144616
Kswapd reclaim write file async I/O 20 1 1 1
Kswapd reclaim write anon async I/O 57 132 11 121
Kswapd reclaim write file sync I/O 0 0 0 0
Kswapd reclaim write anon sync I/O 0 0 0 0
Time stalled direct reclaim (seconds) 6.19 7.30 13.04 10.88
Time kswapd awake (seconds) 21.73 26.51 25.55 23.90
Total pages scanned 5231132 5091890 4976255 5089100
Total pages reclaimed 2508073 2561335 2463577 2532263
%age total pages scanned/reclaimed 47.95% 50.30% 49.51% 49.76%
%age total pages scanned/written 0.00% 0.00% 0.00% 0.00%
%age file pages scanned/written 0.00% 0.00% 0.00% 0.00%
Percentage Time Spent Direct Reclaim 18.89% 20.65% 32.65% 27.65%
Percentage Time kswapd Awake 72.39% 80.68% 78.21% 77.40%
Again, a similar trend that the congestion_wait changes mean that direct
reclaim scans more pages but the overall number of pages scanned while
slightly reduced, are very similar. The ratio of scanning/reclaimed
remains roughly similar. The downside is that kswapd and direct reclaim
was awake longer and for a larger percentage of the overall workload.
It's possible there were big differences in the amount of time spent
reclaiming slab pages between the different kernels which is plausible
considering that the micro tests runs after fsmark and sysbench.
Trace Reclaim Statistics: congestion_wait
Direct number congest waited 845 1312 104 0
Direct time congest waited 19416ms 26560ms 7544ms 0ms
Direct full congest waited 745 1105 72 0
Direct number conditional waited 0 0 1322 2935
Direct time conditional waited 0ms 0ms 12ms 312ms
Direct full conditional waited 0 0 0 3
KSwapd number congest waited 39 102 75 63
KSwapd time congest waited 2484ms 6760ms 5756ms 3716ms
KSwapd full congest waited 20 48 46 25
KSwapd number conditional waited 0 0 0 0
KSwapd time conditional waited 0ms 0ms 0ms 0ms
KSwapd full conditional waited 0 0 0 0
The vanilla kernel spent 20 seconds asleep in direct reclaim and only
312ms asleep with the patches. The time kswapd spent congest waited was
also reduced by a large factor.
MMTests Statistics: duration
ser/Sys Time Running Test (seconds) 26.58 28.05 26.9 28.47
Total Elapsed Time (seconds) 30.02 32.86 32.67 30.88
With all patches applies, the completion times are very similar.
X86-64 STRESS-HIGHALLOC
traceonly-v2r2 lowlumpy-v2r3 waitcongest-v2r3waitwriteback-v2r4
Pass 1 82.00 ( 0.00%) 84.00 ( 2.00%) 85.00 ( 3.00%) 85.00 ( 3.00%)
Pass 2 90.00 ( 0.00%) 87.00 (-3.00%) 88.00 (-2.00%) 89.00 (-1.00%)
At Rest 92.00 ( 0.00%) 90.00 (-2.00%) 90.00 (-2.00%) 91.00 (-1.00%)
Success figures across the board are broadly similar.
traceonly-v2r2 lowlumpy-v2r3 waitcongest-v2r3waitwriteback-v2r4
Direct reclaims 1045 944 886 887
Direct reclaim pages scanned 135091 119604 109382 101019
Direct reclaim pages reclaimed 88599 47535 47863 46671
Direct reclaim write file async I/O 494 283 465 280
Direct reclaim write anon async I/O 29357 13710 16656 13462
Direct reclaim write file sync I/O 154 2 2 3
Direct reclaim write anon sync I/O 14594 571 509 561
Wake kswapd requests 7491 933 872 892
Kswapd wakeups 814 778 731 780
Kswapd pages scanned 7290822 15341158 11916436 13703442
Kswapd pages reclaimed 3587336 3142496 3094392 3187151
Kswapd reclaim write file async I/O 91975 32317 28022 29628
Kswapd reclaim write anon async I/O 1992022 789307 829745 849769
Kswapd reclaim write file sync I/O 0 0 0 0
Kswapd reclaim write anon sync I/O 0 0 0 0
Time stalled direct reclaim (seconds) 4588.93 2467.16 2495.41 2547.07
Time kswapd awake (seconds) 2497.66 1020.16 1098.06 1176.82
Total pages scanned 7425913 15460762 12025818 13804461
Total pages reclaimed 3675935 3190031 3142255 3233822
%age total pages scanned/reclaimed 49.50% 20.63% 26.13% 23.43%
%age total pages scanned/written 28.66% 5.41% 7.28% 6.47%
%age file pages scanned/written 1.25% 0.21% 0.24% 0.22%
Percentage Time Spent Direct Reclaim 57.33% 42.15% 42.41% 42.99%
Percentage Time kswapd Awake 43.56% 27.87% 29.76% 31.25%
Scanned/reclaimed ratios again look good with big improvements in
efficiency. The Scanned/written ratios also look much improved. With a
better scanned/written ration, there is an expectation that IO would be
more efficient and indeed, the time spent in direct reclaim is much
reduced by the full series and kswapd spends a little less time awake.
Overall, indications here are that allocations were happening much faster
and this can be seen with a graph of the latency figures as the
allocations were taking place
http://www.csn.ul.ie/~mel/postings/vmscanreduce-20101509/highalloc-interlatency-hydra-mean.ps
FTrace Reclaim Statistics: congestion_wait
Direct number congest waited 1333 204 169 4
Direct time congest waited 78896ms 8288ms 7260ms 200ms
Direct full congest waited 756 92 69 2
Direct number conditional waited 0 0 26 186
Direct time conditional waited 0ms 0ms 0ms 2504ms
Direct full conditional waited 0 0 0 25
KSwapd number congest waited 4 395 227 282
KSwapd time congest waited 384ms 25136ms 10508ms 18380ms
KSwapd full congest waited 3 232 98 176
KSwapd number conditional waited 0 0 0 0
KSwapd time conditional waited 0ms 0ms 0ms 0ms
KSwapd full conditional waited 0 0 0 0
KSwapd full conditional waited 318 0 312 9
Overall, the time spent speeping is reduced. kswapd is still hitting
congestion_wait() but that is because there are callers remaining where it
wasn't clear in advance if they should be changed to wait_iff_congested()
or not. Overall the sleep imes are reduced though - from 79ish seconds to
about 19.
MMTests Statistics: duration
User/Sys Time Running Test (seconds) 3415.43 3386.65 3388.39 3377.5
Total Elapsed Time (seconds) 5733.48 3660.33 3689.41 3765.39
With the full series, the time to complete the tests are reduced by 30%
PPC64 STRESS-HIGHALLOC
traceonly-v2r2 lowlumpy-v2r3 waitcongest-v2r3waitwriteback-v2r4
Pass 1 17.00 ( 0.00%) 34.00 (17.00%) 38.00 (21.00%) 43.00 (26.00%)
Pass 2 25.00 ( 0.00%) 37.00 (12.00%) 42.00 (17.00%) 46.00 (21.00%)
At Rest 49.00 ( 0.00%) 43.00 (-6.00%) 45.00 (-4.00%) 51.00 ( 2.00%)
Success rates there are *way* up particularly considering that the 16MB
huge pages on PPC64 mean that it's always much harder to allocate them.
FTrace Reclaim Statistics: vmscan
stress-highalloc stress-highalloc stress-highalloc stress-highalloc
traceonly-v2r2 lowlumpy-v2r3 waitcongest-v2r3waitwriteback-v2r4
Direct reclaims 499 505 564 509
Direct reclaim pages scanned 223478 41898 51818 45605
Direct reclaim pages reclaimed 137730 21148 27161 23455
Direct reclaim write file async I/O 399 136 162 136
Direct reclaim write anon async I/O 46977 2865 4686 3998
Direct reclaim write file sync I/O 29 0 1 3
Direct reclaim write anon sync I/O 31023 159 237 239
Wake kswapd requests 420 351 360 326
Kswapd wakeups 185 294 249 277
Kswapd pages scanned 15703488 16392500 17821724 17598737
Kswapd pages reclaimed 5808466 2908858 3139386 3145435
Kswapd reclaim write file async I/O 159938 18400 18717 13473
Kswapd reclaim write anon async I/O 3467554 228957 322799 234278
Kswapd reclaim write file sync I/O 0 0 0 0
Kswapd reclaim write anon sync I/O 0 0 0 0
Time stalled direct reclaim (seconds) 9665.35 1707.81 2374.32 1871.23
Time kswapd awake (seconds) 9401.21 1367.86 1951.75 1328.88
Total pages scanned 15926966 16434398 17873542 17644342
Total pages reclaimed 5946196 2930006 3166547 3168890
%age total pages scanned/reclaimed 37.33% 17.83% 17.72% 17.96%
%age total pages scanned/written 23.27% 1.52% 1.94% 1.43%
%age file pages scanned/written 1.01% 0.11% 0.11% 0.08%
Percentage Time Spent Direct Reclaim 44.55% 35.10% 41.42% 36.91%
Percentage Time kswapd Awake 86.71% 43.58% 52.67% 41.14%
While the scanning rates are slightly up, the scanned/reclaimed and
scanned/written figures are much improved. The time spent in direct
reclaim and with kswapd are massively reduced, mostly by the lowlumpy
patches.
FTrace Reclaim Statistics: congestion_wait
Direct number congest waited 725 303 126 3
Direct time congest waited 45524ms 9180ms 5936ms 300ms
Direct full congest waited 487 190 52 3
Direct number conditional waited 0 0 200 301
Direct time conditional waited 0ms 0ms 0ms 1904ms
Direct full conditional waited 0 0 0 19
KSwapd number congest waited 0 2 23 4
KSwapd time congest waited 0ms 200ms 420ms 404ms
KSwapd full congest waited 0 2 2 4
KSwapd number conditional waited 0 0 0 0
KSwapd time conditional waited 0ms 0ms 0ms 0ms
KSwapd full conditional waited 0 0 0 0
Not as dramatic a story here but the time spent asleep is reduced and we
can still see what wait_iff_congested is going to sleep when necessary.
MMTests Statistics: duration
User/Sys Time Running Test (seconds) 12028.09 3157.17 3357.79 3199.16
Total Elapsed Time (seconds) 10842.07 3138.72 3705.54 3229.85
The time to complete this test goes way down. With the full series, we
are allocating over twice the number of huge pages in 30% of the time and
there is a corresponding impact on the allocation latency graph available
at.
http://www.csn.ul.ie/~mel/postings/vmscanreduce-20101509/highalloc-interlatency-powyah-mean.ps
This patch:
Add a trace event for shrink_inactive_list() and updates the sample
postprocessing script appropriately. It can be used to determine how many
pages were reclaimed and for non-lumpy reclaim where exactly the pages
were reclaimed from.
Signed-off-by: Mel Gorman <mel@csn.ul.ie>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Minchan Kim <minchan.kim@gmail.com>
Cc: Wu Fengguang <fengguang.wu@intel.com>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: Rik van Riel <riel@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2010-10-27 05:21:40 +08:00
|
|
|
__entry->nr_scanned = nr_scanned;
|
|
|
|
__entry->nr_reclaimed = nr_reclaimed;
|
mm, vmscan, tracing: use pointer to reclaim_stat struct in trace event
The trace event trace_mm_vmscan_lru_shrink_inactive() currently has 12
parameters! Seven of them are from the reclaim_stat structure. This
structure is currently local to mm/vmscan.c. By moving it to the global
vmstat.h header, we can also reference it from the vmscan tracepoints.
In moving it, it brings down the overhead of passing so many arguments
to the trace event. In the future, we may limit the number of arguments
that a trace event may pass (ideally just 6, but more realistically it
may be 8).
Before this patch, the code to call the trace event is this:
0f 83 aa fe ff ff jae ffffffff811e6261 <shrink_inactive_list+0x1e1>
48 8b 45 a0 mov -0x60(%rbp),%rax
45 8b 64 24 20 mov 0x20(%r12),%r12d
44 8b 6d d4 mov -0x2c(%rbp),%r13d
8b 4d d0 mov -0x30(%rbp),%ecx
44 8b 75 cc mov -0x34(%rbp),%r14d
44 8b 7d c8 mov -0x38(%rbp),%r15d
48 89 45 90 mov %rax,-0x70(%rbp)
8b 83 b8 fe ff ff mov -0x148(%rbx),%eax
8b 55 c0 mov -0x40(%rbp),%edx
8b 7d c4 mov -0x3c(%rbp),%edi
8b 75 b8 mov -0x48(%rbp),%esi
89 45 80 mov %eax,-0x80(%rbp)
65 ff 05 e4 f7 e2 7e incl %gs:0x7ee2f7e4(%rip) # 15bd0 <__preempt_count>
48 8b 05 75 5b 13 01 mov 0x1135b75(%rip),%rax # ffffffff8231bf68 <__tracepoint_mm_vmscan_lru_shrink_inactive+0x28>
48 85 c0 test %rax,%rax
74 72 je ffffffff811e646a <shrink_inactive_list+0x3ea>
48 89 c3 mov %rax,%rbx
4c 8b 10 mov (%rax),%r10
89 f8 mov %edi,%eax
48 89 85 68 ff ff ff mov %rax,-0x98(%rbp)
89 f0 mov %esi,%eax
48 89 85 60 ff ff ff mov %rax,-0xa0(%rbp)
89 c8 mov %ecx,%eax
48 89 85 78 ff ff ff mov %rax,-0x88(%rbp)
89 d0 mov %edx,%eax
48 89 85 70 ff ff ff mov %rax,-0x90(%rbp)
8b 45 8c mov -0x74(%rbp),%eax
48 8b 7b 08 mov 0x8(%rbx),%rdi
48 83 c3 18 add $0x18,%rbx
50 push %rax
41 54 push %r12
41 55 push %r13
ff b5 78 ff ff ff pushq -0x88(%rbp)
41 56 push %r14
41 57 push %r15
ff b5 70 ff ff ff pushq -0x90(%rbp)
4c 8b 8d 68 ff ff ff mov -0x98(%rbp),%r9
4c 8b 85 60 ff ff ff mov -0xa0(%rbp),%r8
48 8b 4d 98 mov -0x68(%rbp),%rcx
48 8b 55 90 mov -0x70(%rbp),%rdx
8b 75 80 mov -0x80(%rbp),%esi
41 ff d2 callq *%r10
After the patch:
0f 83 a8 fe ff ff jae ffffffff811e626d <shrink_inactive_list+0x1cd>
8b 9b b8 fe ff ff mov -0x148(%rbx),%ebx
45 8b 64 24 20 mov 0x20(%r12),%r12d
4c 8b 6d a0 mov -0x60(%rbp),%r13
65 ff 05 f5 f7 e2 7e incl %gs:0x7ee2f7f5(%rip) # 15bd0 <__preempt_count>
4c 8b 35 86 5b 13 01 mov 0x1135b86(%rip),%r14 # ffffffff8231bf68 <__tracepoint_mm_vmscan_lru_shrink_inactive+0x28>
4d 85 f6 test %r14,%r14
74 2a je ffffffff811e6411 <shrink_inactive_list+0x371>
49 8b 06 mov (%r14),%rax
8b 4d 8c mov -0x74(%rbp),%ecx
49 8b 7e 08 mov 0x8(%r14),%rdi
49 83 c6 18 add $0x18,%r14
4c 89 ea mov %r13,%rdx
45 89 e1 mov %r12d,%r9d
4c 8d 45 b8 lea -0x48(%rbp),%r8
89 de mov %ebx,%esi
51 push %rcx
48 8b 4d 98 mov -0x68(%rbp),%rcx
ff d0 callq *%rax
Link: http://lkml.kernel.org/r/2559d7cb-ec60-1200-2362-04fa34fd02bb@fb.com
Link: http://lkml.kernel.org/r/20180322121003.4177af15@gandalf.local.home
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
Reported-by: Alexei Starovoitov <ast@fb.com>
Acked-by: David Rientjes <rientjes@google.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Andrey Ryabinin <aryabinin@virtuozzo.com>
Cc: Alexei Starovoitov <ast@fb.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-04-11 07:28:07 +08:00
|
|
|
__entry->nr_dirty = stat->nr_dirty;
|
|
|
|
__entry->nr_writeback = stat->nr_writeback;
|
|
|
|
__entry->nr_congested = stat->nr_congested;
|
|
|
|
__entry->nr_immediate = stat->nr_immediate;
|
2019-05-14 08:16:51 +08:00
|
|
|
__entry->nr_activate0 = stat->nr_activate[0];
|
|
|
|
__entry->nr_activate1 = stat->nr_activate[1];
|
mm, vmscan, tracing: use pointer to reclaim_stat struct in trace event
The trace event trace_mm_vmscan_lru_shrink_inactive() currently has 12
parameters! Seven of them are from the reclaim_stat structure. This
structure is currently local to mm/vmscan.c. By moving it to the global
vmstat.h header, we can also reference it from the vmscan tracepoints.
In moving it, it brings down the overhead of passing so many arguments
to the trace event. In the future, we may limit the number of arguments
that a trace event may pass (ideally just 6, but more realistically it
may be 8).
Before this patch, the code to call the trace event is this:
0f 83 aa fe ff ff jae ffffffff811e6261 <shrink_inactive_list+0x1e1>
48 8b 45 a0 mov -0x60(%rbp),%rax
45 8b 64 24 20 mov 0x20(%r12),%r12d
44 8b 6d d4 mov -0x2c(%rbp),%r13d
8b 4d d0 mov -0x30(%rbp),%ecx
44 8b 75 cc mov -0x34(%rbp),%r14d
44 8b 7d c8 mov -0x38(%rbp),%r15d
48 89 45 90 mov %rax,-0x70(%rbp)
8b 83 b8 fe ff ff mov -0x148(%rbx),%eax
8b 55 c0 mov -0x40(%rbp),%edx
8b 7d c4 mov -0x3c(%rbp),%edi
8b 75 b8 mov -0x48(%rbp),%esi
89 45 80 mov %eax,-0x80(%rbp)
65 ff 05 e4 f7 e2 7e incl %gs:0x7ee2f7e4(%rip) # 15bd0 <__preempt_count>
48 8b 05 75 5b 13 01 mov 0x1135b75(%rip),%rax # ffffffff8231bf68 <__tracepoint_mm_vmscan_lru_shrink_inactive+0x28>
48 85 c0 test %rax,%rax
74 72 je ffffffff811e646a <shrink_inactive_list+0x3ea>
48 89 c3 mov %rax,%rbx
4c 8b 10 mov (%rax),%r10
89 f8 mov %edi,%eax
48 89 85 68 ff ff ff mov %rax,-0x98(%rbp)
89 f0 mov %esi,%eax
48 89 85 60 ff ff ff mov %rax,-0xa0(%rbp)
89 c8 mov %ecx,%eax
48 89 85 78 ff ff ff mov %rax,-0x88(%rbp)
89 d0 mov %edx,%eax
48 89 85 70 ff ff ff mov %rax,-0x90(%rbp)
8b 45 8c mov -0x74(%rbp),%eax
48 8b 7b 08 mov 0x8(%rbx),%rdi
48 83 c3 18 add $0x18,%rbx
50 push %rax
41 54 push %r12
41 55 push %r13
ff b5 78 ff ff ff pushq -0x88(%rbp)
41 56 push %r14
41 57 push %r15
ff b5 70 ff ff ff pushq -0x90(%rbp)
4c 8b 8d 68 ff ff ff mov -0x98(%rbp),%r9
4c 8b 85 60 ff ff ff mov -0xa0(%rbp),%r8
48 8b 4d 98 mov -0x68(%rbp),%rcx
48 8b 55 90 mov -0x70(%rbp),%rdx
8b 75 80 mov -0x80(%rbp),%esi
41 ff d2 callq *%r10
After the patch:
0f 83 a8 fe ff ff jae ffffffff811e626d <shrink_inactive_list+0x1cd>
8b 9b b8 fe ff ff mov -0x148(%rbx),%ebx
45 8b 64 24 20 mov 0x20(%r12),%r12d
4c 8b 6d a0 mov -0x60(%rbp),%r13
65 ff 05 f5 f7 e2 7e incl %gs:0x7ee2f7f5(%rip) # 15bd0 <__preempt_count>
4c 8b 35 86 5b 13 01 mov 0x1135b86(%rip),%r14 # ffffffff8231bf68 <__tracepoint_mm_vmscan_lru_shrink_inactive+0x28>
4d 85 f6 test %r14,%r14
74 2a je ffffffff811e6411 <shrink_inactive_list+0x371>
49 8b 06 mov (%r14),%rax
8b 4d 8c mov -0x74(%rbp),%ecx
49 8b 7e 08 mov 0x8(%r14),%rdi
49 83 c6 18 add $0x18,%r14
4c 89 ea mov %r13,%rdx
45 89 e1 mov %r12d,%r9d
4c 8d 45 b8 lea -0x48(%rbp),%r8
89 de mov %ebx,%esi
51 push %rcx
48 8b 4d 98 mov -0x68(%rbp),%rcx
ff d0 callq *%rax
Link: http://lkml.kernel.org/r/2559d7cb-ec60-1200-2362-04fa34fd02bb@fb.com
Link: http://lkml.kernel.org/r/20180322121003.4177af15@gandalf.local.home
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
Reported-by: Alexei Starovoitov <ast@fb.com>
Acked-by: David Rientjes <rientjes@google.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Andrey Ryabinin <aryabinin@virtuozzo.com>
Cc: Alexei Starovoitov <ast@fb.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-04-11 07:28:07 +08:00
|
|
|
__entry->nr_ref_keep = stat->nr_ref_keep;
|
|
|
|
__entry->nr_unmap_fail = stat->nr_unmap_fail;
|
tracing, vmscan: add trace events for LRU list shrinking
There have been numerous reports of stalls that pointed at the problem
being somewhere in the VM. There are multiple roots to the problems which
means dealing with any of the root problems in isolation is tricky to
justify on their own and they would still need integration testing. This
patch series puts together two different patch sets which in combination
should tackle some of the root causes of latency problems being reported.
Patch 1 adds a tracepoint for shrink_inactive_list. For this series, the
most important results is being able to calculate the scanning/reclaim
ratio as a measure of the amount of work being done by page reclaim.
Patch 2 accounts for time spent in congestion_wait.
Patches 3-6 were originally developed by Kosaki Motohiro but reworked for
this series. It has been noted that lumpy reclaim is far too aggressive
and trashes the system somewhat. As SLUB uses high-order allocations, a
large cost incurred by lumpy reclaim will be noticeable. It was also
reported during transparent hugepage support testing that lumpy reclaim
was trashing the system and these patches should mitigate that problem
without disabling lumpy reclaim.
Patch 7 adds wait_iff_congested() and replaces some callers of
congestion_wait(). wait_iff_congested() only sleeps if there is a BDI
that is currently congested. Patch 8 notes that any BDI being congested
is not necessarily a problem because there could be multiple BDIs of
varying speeds and numberous zones. It attempts to track when a zone
being reclaimed contains many pages backed by a congested BDI and if so,
reclaimers wait on the congestion queue.
I ran a number of tests with monitoring on X86, X86-64 and PPC64. Each
machine had 3G of RAM and the CPUs were
X86: Intel P4 2-core
X86-64: AMD Phenom 4-core
PPC64: PPC970MP
Each used a single disk and the onboard IO controller. Dirty ratio was
left at 20. I'm just going to report for X86-64 and PPC64 in a vague
attempt to keep this report short. Four kernels were tested each based on
v2.6.36-rc4
traceonly-v2r2: Patches 1 and 2 to instrument vmscan reclaims and congestion_wait
lowlumpy-v2r3: Patches 1-6 to test if lumpy reclaim is better
waitcongest-v2r3: Patches 1-7 to only wait on congestion
waitwriteback-v2r4: Patches 1-8 to detect when a zone is congested
nocongest-v1r5: Patches 1-3 for testing wait_iff_congestion
nodirect-v1r5: Patches 1-10 to disable filesystem writeback for better IO
The tests run were as follows
kernbench
compile-based benchmark. Smoke test performance
sysbench
OLTP read-only benchmark. Will be re-run in the future as read-write
micro-mapped-file-stream
This is a micro-benchmark from Johannes Weiner that accesses a
large sparse-file through mmap(). It was configured to run in only
single-CPU mode but can be indicative of how well page reclaim
identifies suitable pages.
stress-highalloc
Tries to allocate huge pages under heavy load.
kernbench, iozone and sysbench did not report any performance regression
on any machine. sysbench did pressure the system lightly and there was
reclaim activity but there were no difference of major interest between
the kernels.
X86-64 micro-mapped-file-stream
traceonly-v2r2 lowlumpy-v2r3 waitcongest-v2r3 waitwriteback-v2r4
pgalloc_dma 1639.00 ( 0.00%) 667.00 (-145.73%) 1167.00 ( -40.45%) 578.00 (-183.56%)
pgalloc_dma32 2842410.00 ( 0.00%) 2842626.00 ( 0.01%) 2843043.00 ( 0.02%) 2843014.00 ( 0.02%)
pgalloc_normal 0.00 ( 0.00%) 0.00 ( 0.00%) 0.00 ( 0.00%) 0.00 ( 0.00%)
pgsteal_dma 729.00 ( 0.00%) 85.00 (-757.65%) 609.00 ( -19.70%) 125.00 (-483.20%)
pgsteal_dma32 2338721.00 ( 0.00%) 2447354.00 ( 4.44%) 2429536.00 ( 3.74%) 2436772.00 ( 4.02%)
pgsteal_normal 0.00 ( 0.00%) 0.00 ( 0.00%) 0.00 ( 0.00%) 0.00 ( 0.00%)
pgscan_kswapd_dma 1469.00 ( 0.00%) 532.00 (-176.13%) 1078.00 ( -36.27%) 220.00 (-567.73%)
pgscan_kswapd_dma32 4597713.00 ( 0.00%) 4503597.00 ( -2.09%) 4295673.00 ( -7.03%) 3891686.00 ( -18.14%)
pgscan_kswapd_normal 0.00 ( 0.00%) 0.00 ( 0.00%) 0.00 ( 0.00%) 0.00 ( 0.00%)
pgscan_direct_dma 71.00 ( 0.00%) 134.00 ( 47.01%) 243.00 ( 70.78%) 352.00 ( 79.83%)
pgscan_direct_dma32 305820.00 ( 0.00%) 280204.00 ( -9.14%) 600518.00 ( 49.07%) 957485.00 ( 68.06%)
pgscan_direct_normal 0.00 ( 0.00%) 0.00 ( 0.00%) 0.00 ( 0.00%) 0.00 ( 0.00%)
pageoutrun 16296.00 ( 0.00%) 21254.00 ( 23.33%) 18447.00 ( 11.66%) 20067.00 ( 18.79%)
allocstall 443.00 ( 0.00%) 273.00 ( -62.27%) 513.00 ( 13.65%) 1568.00 ( 71.75%)
These are based on the raw figures taken from /proc/vmstat. It's a rough
measure of reclaim activity. Note that allocstall counts are higher
because we are entering direct reclaim more often as a result of not
sleeping in congestion. In itself, it's not necessarily a bad thing.
It's easier to get a view of what happened from the vmscan tracepoint
report.
FTrace Reclaim Statistics: vmscan
traceonly-v2r2 lowlumpy-v2r3 waitcongest-v2r3 waitwriteback-v2r4
Direct reclaims 443 273 513 1568
Direct reclaim pages scanned 305968 280402 600825 957933
Direct reclaim pages reclaimed 43503 19005 30327 117191
Direct reclaim write file async I/O 0 0 0 0
Direct reclaim write anon async I/O 0 3 4 12
Direct reclaim write file sync I/O 0 0 0 0
Direct reclaim write anon sync I/O 0 0 0 0
Wake kswapd requests 187649 132338 191695 267701
Kswapd wakeups 3 1 4 1
Kswapd pages scanned 4599269 4454162 4296815 3891906
Kswapd pages reclaimed 2295947 2428434 2399818 2319706
Kswapd reclaim write file async I/O 1 0 1 1
Kswapd reclaim write anon async I/O 59 187 41 222
Kswapd reclaim write file sync I/O 0 0 0 0
Kswapd reclaim write anon sync I/O 0 0 0 0
Time stalled direct reclaim (seconds) 4.34 2.52 6.63 2.96
Time kswapd awake (seconds) 11.15 10.25 11.01 10.19
Total pages scanned 4905237 4734564 4897640 4849839
Total pages reclaimed 2339450 2447439 2430145 2436897
%age total pages scanned/reclaimed 47.69% 51.69% 49.62% 50.25%
%age total pages scanned/written 0.00% 0.00% 0.00% 0.00%
%age file pages scanned/written 0.00% 0.00% 0.00% 0.00%
Percentage Time Spent Direct Reclaim 29.23% 19.02% 38.48% 20.25%
Percentage Time kswapd Awake 78.58% 78.85% 76.83% 79.86%
What is interesting here for nocongest in particular is that while direct
reclaim scans more pages, the overall number of pages scanned remains the
same and the ratio of pages scanned to pages reclaimed is more or less the
same. In other words, while we are sleeping less, reclaim is not doing
more work and as direct reclaim and kswapd is awake for less time, it
would appear to be doing less work.
FTrace Reclaim Statistics: congestion_wait
Direct number congest waited 87 196 64 0
Direct time congest waited 4604ms 4732ms 5420ms 0ms
Direct full congest waited 72 145 53 0
Direct number conditional waited 0 0 324 1315
Direct time conditional waited 0ms 0ms 0ms 0ms
Direct full conditional waited 0 0 0 0
KSwapd number congest waited 20 10 15 7
KSwapd time congest waited 1264ms 536ms 884ms 284ms
KSwapd full congest waited 10 4 6 2
KSwapd number conditional waited 0 0 0 0
KSwapd time conditional waited 0ms 0ms 0ms 0ms
KSwapd full conditional waited 0 0 0 0
The vanilla kernel spent 8 seconds asleep in direct reclaim and no time at
all asleep with the patches.
MMTests Statistics: duration
User/Sys Time Running Test (seconds) 10.51 10.73 10.6 11.66
Total Elapsed Time (seconds) 14.19 13.00 14.33 12.76
Overall, the tests completed faster. It is interesting to note that backing off further
when a zone is congested and not just a BDI was more efficient overall.
PPC64 micro-mapped-file-stream
pgalloc_dma 3024660.00 ( 0.00%) 3027185.00 ( 0.08%) 3025845.00 ( 0.04%) 3026281.00 ( 0.05%)
pgalloc_normal 0.00 ( 0.00%) 0.00 ( 0.00%) 0.00 ( 0.00%) 0.00 ( 0.00%)
pgsteal_dma 2508073.00 ( 0.00%) 2565351.00 ( 2.23%) 2463577.00 ( -1.81%) 2532263.00 ( 0.96%)
pgsteal_normal 0.00 ( 0.00%) 0.00 ( 0.00%) 0.00 ( 0.00%) 0.00 ( 0.00%)
pgscan_kswapd_dma 4601307.00 ( 0.00%) 4128076.00 ( -11.46%) 3912317.00 ( -17.61%) 3377165.00 ( -36.25%)
pgscan_kswapd_normal 0.00 ( 0.00%) 0.00 ( 0.00%) 0.00 ( 0.00%) 0.00 ( 0.00%)
pgscan_direct_dma 629825.00 ( 0.00%) 971622.00 ( 35.18%) 1063938.00 ( 40.80%) 1711935.00 ( 63.21%)
pgscan_direct_normal 0.00 ( 0.00%) 0.00 ( 0.00%) 0.00 ( 0.00%) 0.00 ( 0.00%)
pageoutrun 27776.00 ( 0.00%) 20458.00 ( -35.77%) 18763.00 ( -48.04%) 18157.00 ( -52.98%)
allocstall 977.00 ( 0.00%) 2751.00 ( 64.49%) 2098.00 ( 53.43%) 5136.00 ( 80.98%)
Similar trends to x86-64. allocstalls are up but it's not necessarily bad.
FTrace Reclaim Statistics: vmscan
Direct reclaims 977 2709 2098 5136
Direct reclaim pages scanned 629825 963814 1063938 1711935
Direct reclaim pages reclaimed 75550 242538 150904 387647
Direct reclaim write file async I/O 0 0 0 2
Direct reclaim write anon async I/O 0 10 0 4
Direct reclaim write file sync I/O 0 0 0 0
Direct reclaim write anon sync I/O 0 0 0 0
Wake kswapd requests 392119 1201712 571935 571921
Kswapd wakeups 3 2 3 3
Kswapd pages scanned 4601307 4128076 3912317 3377165
Kswapd pages reclaimed 2432523 2318797 2312673 2144616
Kswapd reclaim write file async I/O 20 1 1 1
Kswapd reclaim write anon async I/O 57 132 11 121
Kswapd reclaim write file sync I/O 0 0 0 0
Kswapd reclaim write anon sync I/O 0 0 0 0
Time stalled direct reclaim (seconds) 6.19 7.30 13.04 10.88
Time kswapd awake (seconds) 21.73 26.51 25.55 23.90
Total pages scanned 5231132 5091890 4976255 5089100
Total pages reclaimed 2508073 2561335 2463577 2532263
%age total pages scanned/reclaimed 47.95% 50.30% 49.51% 49.76%
%age total pages scanned/written 0.00% 0.00% 0.00% 0.00%
%age file pages scanned/written 0.00% 0.00% 0.00% 0.00%
Percentage Time Spent Direct Reclaim 18.89% 20.65% 32.65% 27.65%
Percentage Time kswapd Awake 72.39% 80.68% 78.21% 77.40%
Again, a similar trend that the congestion_wait changes mean that direct
reclaim scans more pages but the overall number of pages scanned while
slightly reduced, are very similar. The ratio of scanning/reclaimed
remains roughly similar. The downside is that kswapd and direct reclaim
was awake longer and for a larger percentage of the overall workload.
It's possible there were big differences in the amount of time spent
reclaiming slab pages between the different kernels which is plausible
considering that the micro tests runs after fsmark and sysbench.
Trace Reclaim Statistics: congestion_wait
Direct number congest waited 845 1312 104 0
Direct time congest waited 19416ms 26560ms 7544ms 0ms
Direct full congest waited 745 1105 72 0
Direct number conditional waited 0 0 1322 2935
Direct time conditional waited 0ms 0ms 12ms 312ms
Direct full conditional waited 0 0 0 3
KSwapd number congest waited 39 102 75 63
KSwapd time congest waited 2484ms 6760ms 5756ms 3716ms
KSwapd full congest waited 20 48 46 25
KSwapd number conditional waited 0 0 0 0
KSwapd time conditional waited 0ms 0ms 0ms 0ms
KSwapd full conditional waited 0 0 0 0
The vanilla kernel spent 20 seconds asleep in direct reclaim and only
312ms asleep with the patches. The time kswapd spent congest waited was
also reduced by a large factor.
MMTests Statistics: duration
ser/Sys Time Running Test (seconds) 26.58 28.05 26.9 28.47
Total Elapsed Time (seconds) 30.02 32.86 32.67 30.88
With all patches applies, the completion times are very similar.
X86-64 STRESS-HIGHALLOC
traceonly-v2r2 lowlumpy-v2r3 waitcongest-v2r3waitwriteback-v2r4
Pass 1 82.00 ( 0.00%) 84.00 ( 2.00%) 85.00 ( 3.00%) 85.00 ( 3.00%)
Pass 2 90.00 ( 0.00%) 87.00 (-3.00%) 88.00 (-2.00%) 89.00 (-1.00%)
At Rest 92.00 ( 0.00%) 90.00 (-2.00%) 90.00 (-2.00%) 91.00 (-1.00%)
Success figures across the board are broadly similar.
traceonly-v2r2 lowlumpy-v2r3 waitcongest-v2r3waitwriteback-v2r4
Direct reclaims 1045 944 886 887
Direct reclaim pages scanned 135091 119604 109382 101019
Direct reclaim pages reclaimed 88599 47535 47863 46671
Direct reclaim write file async I/O 494 283 465 280
Direct reclaim write anon async I/O 29357 13710 16656 13462
Direct reclaim write file sync I/O 154 2 2 3
Direct reclaim write anon sync I/O 14594 571 509 561
Wake kswapd requests 7491 933 872 892
Kswapd wakeups 814 778 731 780
Kswapd pages scanned 7290822 15341158 11916436 13703442
Kswapd pages reclaimed 3587336 3142496 3094392 3187151
Kswapd reclaim write file async I/O 91975 32317 28022 29628
Kswapd reclaim write anon async I/O 1992022 789307 829745 849769
Kswapd reclaim write file sync I/O 0 0 0 0
Kswapd reclaim write anon sync I/O 0 0 0 0
Time stalled direct reclaim (seconds) 4588.93 2467.16 2495.41 2547.07
Time kswapd awake (seconds) 2497.66 1020.16 1098.06 1176.82
Total pages scanned 7425913 15460762 12025818 13804461
Total pages reclaimed 3675935 3190031 3142255 3233822
%age total pages scanned/reclaimed 49.50% 20.63% 26.13% 23.43%
%age total pages scanned/written 28.66% 5.41% 7.28% 6.47%
%age file pages scanned/written 1.25% 0.21% 0.24% 0.22%
Percentage Time Spent Direct Reclaim 57.33% 42.15% 42.41% 42.99%
Percentage Time kswapd Awake 43.56% 27.87% 29.76% 31.25%
Scanned/reclaimed ratios again look good with big improvements in
efficiency. The Scanned/written ratios also look much improved. With a
better scanned/written ration, there is an expectation that IO would be
more efficient and indeed, the time spent in direct reclaim is much
reduced by the full series and kswapd spends a little less time awake.
Overall, indications here are that allocations were happening much faster
and this can be seen with a graph of the latency figures as the
allocations were taking place
http://www.csn.ul.ie/~mel/postings/vmscanreduce-20101509/highalloc-interlatency-hydra-mean.ps
FTrace Reclaim Statistics: congestion_wait
Direct number congest waited 1333 204 169 4
Direct time congest waited 78896ms 8288ms 7260ms 200ms
Direct full congest waited 756 92 69 2
Direct number conditional waited 0 0 26 186
Direct time conditional waited 0ms 0ms 0ms 2504ms
Direct full conditional waited 0 0 0 25
KSwapd number congest waited 4 395 227 282
KSwapd time congest waited 384ms 25136ms 10508ms 18380ms
KSwapd full congest waited 3 232 98 176
KSwapd number conditional waited 0 0 0 0
KSwapd time conditional waited 0ms 0ms 0ms 0ms
KSwapd full conditional waited 0 0 0 0
KSwapd full conditional waited 318 0 312 9
Overall, the time spent speeping is reduced. kswapd is still hitting
congestion_wait() but that is because there are callers remaining where it
wasn't clear in advance if they should be changed to wait_iff_congested()
or not. Overall the sleep imes are reduced though - from 79ish seconds to
about 19.
MMTests Statistics: duration
User/Sys Time Running Test (seconds) 3415.43 3386.65 3388.39 3377.5
Total Elapsed Time (seconds) 5733.48 3660.33 3689.41 3765.39
With the full series, the time to complete the tests are reduced by 30%
PPC64 STRESS-HIGHALLOC
traceonly-v2r2 lowlumpy-v2r3 waitcongest-v2r3waitwriteback-v2r4
Pass 1 17.00 ( 0.00%) 34.00 (17.00%) 38.00 (21.00%) 43.00 (26.00%)
Pass 2 25.00 ( 0.00%) 37.00 (12.00%) 42.00 (17.00%) 46.00 (21.00%)
At Rest 49.00 ( 0.00%) 43.00 (-6.00%) 45.00 (-4.00%) 51.00 ( 2.00%)
Success rates there are *way* up particularly considering that the 16MB
huge pages on PPC64 mean that it's always much harder to allocate them.
FTrace Reclaim Statistics: vmscan
stress-highalloc stress-highalloc stress-highalloc stress-highalloc
traceonly-v2r2 lowlumpy-v2r3 waitcongest-v2r3waitwriteback-v2r4
Direct reclaims 499 505 564 509
Direct reclaim pages scanned 223478 41898 51818 45605
Direct reclaim pages reclaimed 137730 21148 27161 23455
Direct reclaim write file async I/O 399 136 162 136
Direct reclaim write anon async I/O 46977 2865 4686 3998
Direct reclaim write file sync I/O 29 0 1 3
Direct reclaim write anon sync I/O 31023 159 237 239
Wake kswapd requests 420 351 360 326
Kswapd wakeups 185 294 249 277
Kswapd pages scanned 15703488 16392500 17821724 17598737
Kswapd pages reclaimed 5808466 2908858 3139386 3145435
Kswapd reclaim write file async I/O 159938 18400 18717 13473
Kswapd reclaim write anon async I/O 3467554 228957 322799 234278
Kswapd reclaim write file sync I/O 0 0 0 0
Kswapd reclaim write anon sync I/O 0 0 0 0
Time stalled direct reclaim (seconds) 9665.35 1707.81 2374.32 1871.23
Time kswapd awake (seconds) 9401.21 1367.86 1951.75 1328.88
Total pages scanned 15926966 16434398 17873542 17644342
Total pages reclaimed 5946196 2930006 3166547 3168890
%age total pages scanned/reclaimed 37.33% 17.83% 17.72% 17.96%
%age total pages scanned/written 23.27% 1.52% 1.94% 1.43%
%age file pages scanned/written 1.01% 0.11% 0.11% 0.08%
Percentage Time Spent Direct Reclaim 44.55% 35.10% 41.42% 36.91%
Percentage Time kswapd Awake 86.71% 43.58% 52.67% 41.14%
While the scanning rates are slightly up, the scanned/reclaimed and
scanned/written figures are much improved. The time spent in direct
reclaim and with kswapd are massively reduced, mostly by the lowlumpy
patches.
FTrace Reclaim Statistics: congestion_wait
Direct number congest waited 725 303 126 3
Direct time congest waited 45524ms 9180ms 5936ms 300ms
Direct full congest waited 487 190 52 3
Direct number conditional waited 0 0 200 301
Direct time conditional waited 0ms 0ms 0ms 1904ms
Direct full conditional waited 0 0 0 19
KSwapd number congest waited 0 2 23 4
KSwapd time congest waited 0ms 200ms 420ms 404ms
KSwapd full congest waited 0 2 2 4
KSwapd number conditional waited 0 0 0 0
KSwapd time conditional waited 0ms 0ms 0ms 0ms
KSwapd full conditional waited 0 0 0 0
Not as dramatic a story here but the time spent asleep is reduced and we
can still see what wait_iff_congested is going to sleep when necessary.
MMTests Statistics: duration
User/Sys Time Running Test (seconds) 12028.09 3157.17 3357.79 3199.16
Total Elapsed Time (seconds) 10842.07 3138.72 3705.54 3229.85
The time to complete this test goes way down. With the full series, we
are allocating over twice the number of huge pages in 30% of the time and
there is a corresponding impact on the allocation latency graph available
at.
http://www.csn.ul.ie/~mel/postings/vmscanreduce-20101509/highalloc-interlatency-powyah-mean.ps
This patch:
Add a trace event for shrink_inactive_list() and updates the sample
postprocessing script appropriately. It can be used to determine how many
pages were reclaimed and for non-lumpy reclaim where exactly the pages
were reclaimed from.
Signed-off-by: Mel Gorman <mel@csn.ul.ie>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Minchan Kim <minchan.kim@gmail.com>
Cc: Wu Fengguang <fengguang.wu@intel.com>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: Rik van Riel <riel@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2010-10-27 05:21:40 +08:00
|
|
|
__entry->priority = priority;
|
2019-05-14 08:23:08 +08:00
|
|
|
__entry->reclaim_flags = trace_reclaim_flags(file);
|
tracing, vmscan: add trace events for LRU list shrinking
There have been numerous reports of stalls that pointed at the problem
being somewhere in the VM. There are multiple roots to the problems which
means dealing with any of the root problems in isolation is tricky to
justify on their own and they would still need integration testing. This
patch series puts together two different patch sets which in combination
should tackle some of the root causes of latency problems being reported.
Patch 1 adds a tracepoint for shrink_inactive_list. For this series, the
most important results is being able to calculate the scanning/reclaim
ratio as a measure of the amount of work being done by page reclaim.
Patch 2 accounts for time spent in congestion_wait.
Patches 3-6 were originally developed by Kosaki Motohiro but reworked for
this series. It has been noted that lumpy reclaim is far too aggressive
and trashes the system somewhat. As SLUB uses high-order allocations, a
large cost incurred by lumpy reclaim will be noticeable. It was also
reported during transparent hugepage support testing that lumpy reclaim
was trashing the system and these patches should mitigate that problem
without disabling lumpy reclaim.
Patch 7 adds wait_iff_congested() and replaces some callers of
congestion_wait(). wait_iff_congested() only sleeps if there is a BDI
that is currently congested. Patch 8 notes that any BDI being congested
is not necessarily a problem because there could be multiple BDIs of
varying speeds and numberous zones. It attempts to track when a zone
being reclaimed contains many pages backed by a congested BDI and if so,
reclaimers wait on the congestion queue.
I ran a number of tests with monitoring on X86, X86-64 and PPC64. Each
machine had 3G of RAM and the CPUs were
X86: Intel P4 2-core
X86-64: AMD Phenom 4-core
PPC64: PPC970MP
Each used a single disk and the onboard IO controller. Dirty ratio was
left at 20. I'm just going to report for X86-64 and PPC64 in a vague
attempt to keep this report short. Four kernels were tested each based on
v2.6.36-rc4
traceonly-v2r2: Patches 1 and 2 to instrument vmscan reclaims and congestion_wait
lowlumpy-v2r3: Patches 1-6 to test if lumpy reclaim is better
waitcongest-v2r3: Patches 1-7 to only wait on congestion
waitwriteback-v2r4: Patches 1-8 to detect when a zone is congested
nocongest-v1r5: Patches 1-3 for testing wait_iff_congestion
nodirect-v1r5: Patches 1-10 to disable filesystem writeback for better IO
The tests run were as follows
kernbench
compile-based benchmark. Smoke test performance
sysbench
OLTP read-only benchmark. Will be re-run in the future as read-write
micro-mapped-file-stream
This is a micro-benchmark from Johannes Weiner that accesses a
large sparse-file through mmap(). It was configured to run in only
single-CPU mode but can be indicative of how well page reclaim
identifies suitable pages.
stress-highalloc
Tries to allocate huge pages under heavy load.
kernbench, iozone and sysbench did not report any performance regression
on any machine. sysbench did pressure the system lightly and there was
reclaim activity but there were no difference of major interest between
the kernels.
X86-64 micro-mapped-file-stream
traceonly-v2r2 lowlumpy-v2r3 waitcongest-v2r3 waitwriteback-v2r4
pgalloc_dma 1639.00 ( 0.00%) 667.00 (-145.73%) 1167.00 ( -40.45%) 578.00 (-183.56%)
pgalloc_dma32 2842410.00 ( 0.00%) 2842626.00 ( 0.01%) 2843043.00 ( 0.02%) 2843014.00 ( 0.02%)
pgalloc_normal 0.00 ( 0.00%) 0.00 ( 0.00%) 0.00 ( 0.00%) 0.00 ( 0.00%)
pgsteal_dma 729.00 ( 0.00%) 85.00 (-757.65%) 609.00 ( -19.70%) 125.00 (-483.20%)
pgsteal_dma32 2338721.00 ( 0.00%) 2447354.00 ( 4.44%) 2429536.00 ( 3.74%) 2436772.00 ( 4.02%)
pgsteal_normal 0.00 ( 0.00%) 0.00 ( 0.00%) 0.00 ( 0.00%) 0.00 ( 0.00%)
pgscan_kswapd_dma 1469.00 ( 0.00%) 532.00 (-176.13%) 1078.00 ( -36.27%) 220.00 (-567.73%)
pgscan_kswapd_dma32 4597713.00 ( 0.00%) 4503597.00 ( -2.09%) 4295673.00 ( -7.03%) 3891686.00 ( -18.14%)
pgscan_kswapd_normal 0.00 ( 0.00%) 0.00 ( 0.00%) 0.00 ( 0.00%) 0.00 ( 0.00%)
pgscan_direct_dma 71.00 ( 0.00%) 134.00 ( 47.01%) 243.00 ( 70.78%) 352.00 ( 79.83%)
pgscan_direct_dma32 305820.00 ( 0.00%) 280204.00 ( -9.14%) 600518.00 ( 49.07%) 957485.00 ( 68.06%)
pgscan_direct_normal 0.00 ( 0.00%) 0.00 ( 0.00%) 0.00 ( 0.00%) 0.00 ( 0.00%)
pageoutrun 16296.00 ( 0.00%) 21254.00 ( 23.33%) 18447.00 ( 11.66%) 20067.00 ( 18.79%)
allocstall 443.00 ( 0.00%) 273.00 ( -62.27%) 513.00 ( 13.65%) 1568.00 ( 71.75%)
These are based on the raw figures taken from /proc/vmstat. It's a rough
measure of reclaim activity. Note that allocstall counts are higher
because we are entering direct reclaim more often as a result of not
sleeping in congestion. In itself, it's not necessarily a bad thing.
It's easier to get a view of what happened from the vmscan tracepoint
report.
FTrace Reclaim Statistics: vmscan
traceonly-v2r2 lowlumpy-v2r3 waitcongest-v2r3 waitwriteback-v2r4
Direct reclaims 443 273 513 1568
Direct reclaim pages scanned 305968 280402 600825 957933
Direct reclaim pages reclaimed 43503 19005 30327 117191
Direct reclaim write file async I/O 0 0 0 0
Direct reclaim write anon async I/O 0 3 4 12
Direct reclaim write file sync I/O 0 0 0 0
Direct reclaim write anon sync I/O 0 0 0 0
Wake kswapd requests 187649 132338 191695 267701
Kswapd wakeups 3 1 4 1
Kswapd pages scanned 4599269 4454162 4296815 3891906
Kswapd pages reclaimed 2295947 2428434 2399818 2319706
Kswapd reclaim write file async I/O 1 0 1 1
Kswapd reclaim write anon async I/O 59 187 41 222
Kswapd reclaim write file sync I/O 0 0 0 0
Kswapd reclaim write anon sync I/O 0 0 0 0
Time stalled direct reclaim (seconds) 4.34 2.52 6.63 2.96
Time kswapd awake (seconds) 11.15 10.25 11.01 10.19
Total pages scanned 4905237 4734564 4897640 4849839
Total pages reclaimed 2339450 2447439 2430145 2436897
%age total pages scanned/reclaimed 47.69% 51.69% 49.62% 50.25%
%age total pages scanned/written 0.00% 0.00% 0.00% 0.00%
%age file pages scanned/written 0.00% 0.00% 0.00% 0.00%
Percentage Time Spent Direct Reclaim 29.23% 19.02% 38.48% 20.25%
Percentage Time kswapd Awake 78.58% 78.85% 76.83% 79.86%
What is interesting here for nocongest in particular is that while direct
reclaim scans more pages, the overall number of pages scanned remains the
same and the ratio of pages scanned to pages reclaimed is more or less the
same. In other words, while we are sleeping less, reclaim is not doing
more work and as direct reclaim and kswapd is awake for less time, it
would appear to be doing less work.
FTrace Reclaim Statistics: congestion_wait
Direct number congest waited 87 196 64 0
Direct time congest waited 4604ms 4732ms 5420ms 0ms
Direct full congest waited 72 145 53 0
Direct number conditional waited 0 0 324 1315
Direct time conditional waited 0ms 0ms 0ms 0ms
Direct full conditional waited 0 0 0 0
KSwapd number congest waited 20 10 15 7
KSwapd time congest waited 1264ms 536ms 884ms 284ms
KSwapd full congest waited 10 4 6 2
KSwapd number conditional waited 0 0 0 0
KSwapd time conditional waited 0ms 0ms 0ms 0ms
KSwapd full conditional waited 0 0 0 0
The vanilla kernel spent 8 seconds asleep in direct reclaim and no time at
all asleep with the patches.
MMTests Statistics: duration
User/Sys Time Running Test (seconds) 10.51 10.73 10.6 11.66
Total Elapsed Time (seconds) 14.19 13.00 14.33 12.76
Overall, the tests completed faster. It is interesting to note that backing off further
when a zone is congested and not just a BDI was more efficient overall.
PPC64 micro-mapped-file-stream
pgalloc_dma 3024660.00 ( 0.00%) 3027185.00 ( 0.08%) 3025845.00 ( 0.04%) 3026281.00 ( 0.05%)
pgalloc_normal 0.00 ( 0.00%) 0.00 ( 0.00%) 0.00 ( 0.00%) 0.00 ( 0.00%)
pgsteal_dma 2508073.00 ( 0.00%) 2565351.00 ( 2.23%) 2463577.00 ( -1.81%) 2532263.00 ( 0.96%)
pgsteal_normal 0.00 ( 0.00%) 0.00 ( 0.00%) 0.00 ( 0.00%) 0.00 ( 0.00%)
pgscan_kswapd_dma 4601307.00 ( 0.00%) 4128076.00 ( -11.46%) 3912317.00 ( -17.61%) 3377165.00 ( -36.25%)
pgscan_kswapd_normal 0.00 ( 0.00%) 0.00 ( 0.00%) 0.00 ( 0.00%) 0.00 ( 0.00%)
pgscan_direct_dma 629825.00 ( 0.00%) 971622.00 ( 35.18%) 1063938.00 ( 40.80%) 1711935.00 ( 63.21%)
pgscan_direct_normal 0.00 ( 0.00%) 0.00 ( 0.00%) 0.00 ( 0.00%) 0.00 ( 0.00%)
pageoutrun 27776.00 ( 0.00%) 20458.00 ( -35.77%) 18763.00 ( -48.04%) 18157.00 ( -52.98%)
allocstall 977.00 ( 0.00%) 2751.00 ( 64.49%) 2098.00 ( 53.43%) 5136.00 ( 80.98%)
Similar trends to x86-64. allocstalls are up but it's not necessarily bad.
FTrace Reclaim Statistics: vmscan
Direct reclaims 977 2709 2098 5136
Direct reclaim pages scanned 629825 963814 1063938 1711935
Direct reclaim pages reclaimed 75550 242538 150904 387647
Direct reclaim write file async I/O 0 0 0 2
Direct reclaim write anon async I/O 0 10 0 4
Direct reclaim write file sync I/O 0 0 0 0
Direct reclaim write anon sync I/O 0 0 0 0
Wake kswapd requests 392119 1201712 571935 571921
Kswapd wakeups 3 2 3 3
Kswapd pages scanned 4601307 4128076 3912317 3377165
Kswapd pages reclaimed 2432523 2318797 2312673 2144616
Kswapd reclaim write file async I/O 20 1 1 1
Kswapd reclaim write anon async I/O 57 132 11 121
Kswapd reclaim write file sync I/O 0 0 0 0
Kswapd reclaim write anon sync I/O 0 0 0 0
Time stalled direct reclaim (seconds) 6.19 7.30 13.04 10.88
Time kswapd awake (seconds) 21.73 26.51 25.55 23.90
Total pages scanned 5231132 5091890 4976255 5089100
Total pages reclaimed 2508073 2561335 2463577 2532263
%age total pages scanned/reclaimed 47.95% 50.30% 49.51% 49.76%
%age total pages scanned/written 0.00% 0.00% 0.00% 0.00%
%age file pages scanned/written 0.00% 0.00% 0.00% 0.00%
Percentage Time Spent Direct Reclaim 18.89% 20.65% 32.65% 27.65%
Percentage Time kswapd Awake 72.39% 80.68% 78.21% 77.40%
Again, a similar trend that the congestion_wait changes mean that direct
reclaim scans more pages but the overall number of pages scanned while
slightly reduced, are very similar. The ratio of scanning/reclaimed
remains roughly similar. The downside is that kswapd and direct reclaim
was awake longer and for a larger percentage of the overall workload.
It's possible there were big differences in the amount of time spent
reclaiming slab pages between the different kernels which is plausible
considering that the micro tests runs after fsmark and sysbench.
Trace Reclaim Statistics: congestion_wait
Direct number congest waited 845 1312 104 0
Direct time congest waited 19416ms 26560ms 7544ms 0ms
Direct full congest waited 745 1105 72 0
Direct number conditional waited 0 0 1322 2935
Direct time conditional waited 0ms 0ms 12ms 312ms
Direct full conditional waited 0 0 0 3
KSwapd number congest waited 39 102 75 63
KSwapd time congest waited 2484ms 6760ms 5756ms 3716ms
KSwapd full congest waited 20 48 46 25
KSwapd number conditional waited 0 0 0 0
KSwapd time conditional waited 0ms 0ms 0ms 0ms
KSwapd full conditional waited 0 0 0 0
The vanilla kernel spent 20 seconds asleep in direct reclaim and only
312ms asleep with the patches. The time kswapd spent congest waited was
also reduced by a large factor.
MMTests Statistics: duration
ser/Sys Time Running Test (seconds) 26.58 28.05 26.9 28.47
Total Elapsed Time (seconds) 30.02 32.86 32.67 30.88
With all patches applies, the completion times are very similar.
X86-64 STRESS-HIGHALLOC
traceonly-v2r2 lowlumpy-v2r3 waitcongest-v2r3waitwriteback-v2r4
Pass 1 82.00 ( 0.00%) 84.00 ( 2.00%) 85.00 ( 3.00%) 85.00 ( 3.00%)
Pass 2 90.00 ( 0.00%) 87.00 (-3.00%) 88.00 (-2.00%) 89.00 (-1.00%)
At Rest 92.00 ( 0.00%) 90.00 (-2.00%) 90.00 (-2.00%) 91.00 (-1.00%)
Success figures across the board are broadly similar.
traceonly-v2r2 lowlumpy-v2r3 waitcongest-v2r3waitwriteback-v2r4
Direct reclaims 1045 944 886 887
Direct reclaim pages scanned 135091 119604 109382 101019
Direct reclaim pages reclaimed 88599 47535 47863 46671
Direct reclaim write file async I/O 494 283 465 280
Direct reclaim write anon async I/O 29357 13710 16656 13462
Direct reclaim write file sync I/O 154 2 2 3
Direct reclaim write anon sync I/O 14594 571 509 561
Wake kswapd requests 7491 933 872 892
Kswapd wakeups 814 778 731 780
Kswapd pages scanned 7290822 15341158 11916436 13703442
Kswapd pages reclaimed 3587336 3142496 3094392 3187151
Kswapd reclaim write file async I/O 91975 32317 28022 29628
Kswapd reclaim write anon async I/O 1992022 789307 829745 849769
Kswapd reclaim write file sync I/O 0 0 0 0
Kswapd reclaim write anon sync I/O 0 0 0 0
Time stalled direct reclaim (seconds) 4588.93 2467.16 2495.41 2547.07
Time kswapd awake (seconds) 2497.66 1020.16 1098.06 1176.82
Total pages scanned 7425913 15460762 12025818 13804461
Total pages reclaimed 3675935 3190031 3142255 3233822
%age total pages scanned/reclaimed 49.50% 20.63% 26.13% 23.43%
%age total pages scanned/written 28.66% 5.41% 7.28% 6.47%
%age file pages scanned/written 1.25% 0.21% 0.24% 0.22%
Percentage Time Spent Direct Reclaim 57.33% 42.15% 42.41% 42.99%
Percentage Time kswapd Awake 43.56% 27.87% 29.76% 31.25%
Scanned/reclaimed ratios again look good with big improvements in
efficiency. The Scanned/written ratios also look much improved. With a
better scanned/written ration, there is an expectation that IO would be
more efficient and indeed, the time spent in direct reclaim is much
reduced by the full series and kswapd spends a little less time awake.
Overall, indications here are that allocations were happening much faster
and this can be seen with a graph of the latency figures as the
allocations were taking place
http://www.csn.ul.ie/~mel/postings/vmscanreduce-20101509/highalloc-interlatency-hydra-mean.ps
FTrace Reclaim Statistics: congestion_wait
Direct number congest waited 1333 204 169 4
Direct time congest waited 78896ms 8288ms 7260ms 200ms
Direct full congest waited 756 92 69 2
Direct number conditional waited 0 0 26 186
Direct time conditional waited 0ms 0ms 0ms 2504ms
Direct full conditional waited 0 0 0 25
KSwapd number congest waited 4 395 227 282
KSwapd time congest waited 384ms 25136ms 10508ms 18380ms
KSwapd full congest waited 3 232 98 176
KSwapd number conditional waited 0 0 0 0
KSwapd time conditional waited 0ms 0ms 0ms 0ms
KSwapd full conditional waited 0 0 0 0
KSwapd full conditional waited 318 0 312 9
Overall, the time spent speeping is reduced. kswapd is still hitting
congestion_wait() but that is because there are callers remaining where it
wasn't clear in advance if they should be changed to wait_iff_congested()
or not. Overall the sleep imes are reduced though - from 79ish seconds to
about 19.
MMTests Statistics: duration
User/Sys Time Running Test (seconds) 3415.43 3386.65 3388.39 3377.5
Total Elapsed Time (seconds) 5733.48 3660.33 3689.41 3765.39
With the full series, the time to complete the tests are reduced by 30%
PPC64 STRESS-HIGHALLOC
traceonly-v2r2 lowlumpy-v2r3 waitcongest-v2r3waitwriteback-v2r4
Pass 1 17.00 ( 0.00%) 34.00 (17.00%) 38.00 (21.00%) 43.00 (26.00%)
Pass 2 25.00 ( 0.00%) 37.00 (12.00%) 42.00 (17.00%) 46.00 (21.00%)
At Rest 49.00 ( 0.00%) 43.00 (-6.00%) 45.00 (-4.00%) 51.00 ( 2.00%)
Success rates there are *way* up particularly considering that the 16MB
huge pages on PPC64 mean that it's always much harder to allocate them.
FTrace Reclaim Statistics: vmscan
stress-highalloc stress-highalloc stress-highalloc stress-highalloc
traceonly-v2r2 lowlumpy-v2r3 waitcongest-v2r3waitwriteback-v2r4
Direct reclaims 499 505 564 509
Direct reclaim pages scanned 223478 41898 51818 45605
Direct reclaim pages reclaimed 137730 21148 27161 23455
Direct reclaim write file async I/O 399 136 162 136
Direct reclaim write anon async I/O 46977 2865 4686 3998
Direct reclaim write file sync I/O 29 0 1 3
Direct reclaim write anon sync I/O 31023 159 237 239
Wake kswapd requests 420 351 360 326
Kswapd wakeups 185 294 249 277
Kswapd pages scanned 15703488 16392500 17821724 17598737
Kswapd pages reclaimed 5808466 2908858 3139386 3145435
Kswapd reclaim write file async I/O 159938 18400 18717 13473
Kswapd reclaim write anon async I/O 3467554 228957 322799 234278
Kswapd reclaim write file sync I/O 0 0 0 0
Kswapd reclaim write anon sync I/O 0 0 0 0
Time stalled direct reclaim (seconds) 9665.35 1707.81 2374.32 1871.23
Time kswapd awake (seconds) 9401.21 1367.86 1951.75 1328.88
Total pages scanned 15926966 16434398 17873542 17644342
Total pages reclaimed 5946196 2930006 3166547 3168890
%age total pages scanned/reclaimed 37.33% 17.83% 17.72% 17.96%
%age total pages scanned/written 23.27% 1.52% 1.94% 1.43%
%age file pages scanned/written 1.01% 0.11% 0.11% 0.08%
Percentage Time Spent Direct Reclaim 44.55% 35.10% 41.42% 36.91%
Percentage Time kswapd Awake 86.71% 43.58% 52.67% 41.14%
While the scanning rates are slightly up, the scanned/reclaimed and
scanned/written figures are much improved. The time spent in direct
reclaim and with kswapd are massively reduced, mostly by the lowlumpy
patches.
FTrace Reclaim Statistics: congestion_wait
Direct number congest waited 725 303 126 3
Direct time congest waited 45524ms 9180ms 5936ms 300ms
Direct full congest waited 487 190 52 3
Direct number conditional waited 0 0 200 301
Direct time conditional waited 0ms 0ms 0ms 1904ms
Direct full conditional waited 0 0 0 19
KSwapd number congest waited 0 2 23 4
KSwapd time congest waited 0ms 200ms 420ms 404ms
KSwapd full congest waited 0 2 2 4
KSwapd number conditional waited 0 0 0 0
KSwapd time conditional waited 0ms 0ms 0ms 0ms
KSwapd full conditional waited 0 0 0 0
Not as dramatic a story here but the time spent asleep is reduced and we
can still see what wait_iff_congested is going to sleep when necessary.
MMTests Statistics: duration
User/Sys Time Running Test (seconds) 12028.09 3157.17 3357.79 3199.16
Total Elapsed Time (seconds) 10842.07 3138.72 3705.54 3229.85
The time to complete this test goes way down. With the full series, we
are allocating over twice the number of huge pages in 30% of the time and
there is a corresponding impact on the allocation latency graph available
at.
http://www.csn.ul.ie/~mel/postings/vmscanreduce-20101509/highalloc-interlatency-powyah-mean.ps
This patch:
Add a trace event for shrink_inactive_list() and updates the sample
postprocessing script appropriately. It can be used to determine how many
pages were reclaimed and for non-lumpy reclaim where exactly the pages
were reclaimed from.
Signed-off-by: Mel Gorman <mel@csn.ul.ie>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Minchan Kim <minchan.kim@gmail.com>
Cc: Wu Fengguang <fengguang.wu@intel.com>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: Rik van Riel <riel@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2010-10-27 05:21:40 +08:00
|
|
|
),
|
|
|
|
|
2019-05-14 08:16:51 +08:00
|
|
|
TP_printk("nid=%d nr_scanned=%ld nr_reclaimed=%ld nr_dirty=%ld nr_writeback=%ld nr_congested=%ld nr_immediate=%ld nr_activate_anon=%d nr_activate_file=%d nr_ref_keep=%ld nr_unmap_fail=%ld priority=%d flags=%s",
|
2016-07-29 06:45:31 +08:00
|
|
|
__entry->nid,
|
tracing, vmscan: add trace events for LRU list shrinking
There have been numerous reports of stalls that pointed at the problem
being somewhere in the VM. There are multiple roots to the problems which
means dealing with any of the root problems in isolation is tricky to
justify on their own and they would still need integration testing. This
patch series puts together two different patch sets which in combination
should tackle some of the root causes of latency problems being reported.
Patch 1 adds a tracepoint for shrink_inactive_list. For this series, the
most important results is being able to calculate the scanning/reclaim
ratio as a measure of the amount of work being done by page reclaim.
Patch 2 accounts for time spent in congestion_wait.
Patches 3-6 were originally developed by Kosaki Motohiro but reworked for
this series. It has been noted that lumpy reclaim is far too aggressive
and trashes the system somewhat. As SLUB uses high-order allocations, a
large cost incurred by lumpy reclaim will be noticeable. It was also
reported during transparent hugepage support testing that lumpy reclaim
was trashing the system and these patches should mitigate that problem
without disabling lumpy reclaim.
Patch 7 adds wait_iff_congested() and replaces some callers of
congestion_wait(). wait_iff_congested() only sleeps if there is a BDI
that is currently congested. Patch 8 notes that any BDI being congested
is not necessarily a problem because there could be multiple BDIs of
varying speeds and numberous zones. It attempts to track when a zone
being reclaimed contains many pages backed by a congested BDI and if so,
reclaimers wait on the congestion queue.
I ran a number of tests with monitoring on X86, X86-64 and PPC64. Each
machine had 3G of RAM and the CPUs were
X86: Intel P4 2-core
X86-64: AMD Phenom 4-core
PPC64: PPC970MP
Each used a single disk and the onboard IO controller. Dirty ratio was
left at 20. I'm just going to report for X86-64 and PPC64 in a vague
attempt to keep this report short. Four kernels were tested each based on
v2.6.36-rc4
traceonly-v2r2: Patches 1 and 2 to instrument vmscan reclaims and congestion_wait
lowlumpy-v2r3: Patches 1-6 to test if lumpy reclaim is better
waitcongest-v2r3: Patches 1-7 to only wait on congestion
waitwriteback-v2r4: Patches 1-8 to detect when a zone is congested
nocongest-v1r5: Patches 1-3 for testing wait_iff_congestion
nodirect-v1r5: Patches 1-10 to disable filesystem writeback for better IO
The tests run were as follows
kernbench
compile-based benchmark. Smoke test performance
sysbench
OLTP read-only benchmark. Will be re-run in the future as read-write
micro-mapped-file-stream
This is a micro-benchmark from Johannes Weiner that accesses a
large sparse-file through mmap(). It was configured to run in only
single-CPU mode but can be indicative of how well page reclaim
identifies suitable pages.
stress-highalloc
Tries to allocate huge pages under heavy load.
kernbench, iozone and sysbench did not report any performance regression
on any machine. sysbench did pressure the system lightly and there was
reclaim activity but there were no difference of major interest between
the kernels.
X86-64 micro-mapped-file-stream
traceonly-v2r2 lowlumpy-v2r3 waitcongest-v2r3 waitwriteback-v2r4
pgalloc_dma 1639.00 ( 0.00%) 667.00 (-145.73%) 1167.00 ( -40.45%) 578.00 (-183.56%)
pgalloc_dma32 2842410.00 ( 0.00%) 2842626.00 ( 0.01%) 2843043.00 ( 0.02%) 2843014.00 ( 0.02%)
pgalloc_normal 0.00 ( 0.00%) 0.00 ( 0.00%) 0.00 ( 0.00%) 0.00 ( 0.00%)
pgsteal_dma 729.00 ( 0.00%) 85.00 (-757.65%) 609.00 ( -19.70%) 125.00 (-483.20%)
pgsteal_dma32 2338721.00 ( 0.00%) 2447354.00 ( 4.44%) 2429536.00 ( 3.74%) 2436772.00 ( 4.02%)
pgsteal_normal 0.00 ( 0.00%) 0.00 ( 0.00%) 0.00 ( 0.00%) 0.00 ( 0.00%)
pgscan_kswapd_dma 1469.00 ( 0.00%) 532.00 (-176.13%) 1078.00 ( -36.27%) 220.00 (-567.73%)
pgscan_kswapd_dma32 4597713.00 ( 0.00%) 4503597.00 ( -2.09%) 4295673.00 ( -7.03%) 3891686.00 ( -18.14%)
pgscan_kswapd_normal 0.00 ( 0.00%) 0.00 ( 0.00%) 0.00 ( 0.00%) 0.00 ( 0.00%)
pgscan_direct_dma 71.00 ( 0.00%) 134.00 ( 47.01%) 243.00 ( 70.78%) 352.00 ( 79.83%)
pgscan_direct_dma32 305820.00 ( 0.00%) 280204.00 ( -9.14%) 600518.00 ( 49.07%) 957485.00 ( 68.06%)
pgscan_direct_normal 0.00 ( 0.00%) 0.00 ( 0.00%) 0.00 ( 0.00%) 0.00 ( 0.00%)
pageoutrun 16296.00 ( 0.00%) 21254.00 ( 23.33%) 18447.00 ( 11.66%) 20067.00 ( 18.79%)
allocstall 443.00 ( 0.00%) 273.00 ( -62.27%) 513.00 ( 13.65%) 1568.00 ( 71.75%)
These are based on the raw figures taken from /proc/vmstat. It's a rough
measure of reclaim activity. Note that allocstall counts are higher
because we are entering direct reclaim more often as a result of not
sleeping in congestion. In itself, it's not necessarily a bad thing.
It's easier to get a view of what happened from the vmscan tracepoint
report.
FTrace Reclaim Statistics: vmscan
traceonly-v2r2 lowlumpy-v2r3 waitcongest-v2r3 waitwriteback-v2r4
Direct reclaims 443 273 513 1568
Direct reclaim pages scanned 305968 280402 600825 957933
Direct reclaim pages reclaimed 43503 19005 30327 117191
Direct reclaim write file async I/O 0 0 0 0
Direct reclaim write anon async I/O 0 3 4 12
Direct reclaim write file sync I/O 0 0 0 0
Direct reclaim write anon sync I/O 0 0 0 0
Wake kswapd requests 187649 132338 191695 267701
Kswapd wakeups 3 1 4 1
Kswapd pages scanned 4599269 4454162 4296815 3891906
Kswapd pages reclaimed 2295947 2428434 2399818 2319706
Kswapd reclaim write file async I/O 1 0 1 1
Kswapd reclaim write anon async I/O 59 187 41 222
Kswapd reclaim write file sync I/O 0 0 0 0
Kswapd reclaim write anon sync I/O 0 0 0 0
Time stalled direct reclaim (seconds) 4.34 2.52 6.63 2.96
Time kswapd awake (seconds) 11.15 10.25 11.01 10.19
Total pages scanned 4905237 4734564 4897640 4849839
Total pages reclaimed 2339450 2447439 2430145 2436897
%age total pages scanned/reclaimed 47.69% 51.69% 49.62% 50.25%
%age total pages scanned/written 0.00% 0.00% 0.00% 0.00%
%age file pages scanned/written 0.00% 0.00% 0.00% 0.00%
Percentage Time Spent Direct Reclaim 29.23% 19.02% 38.48% 20.25%
Percentage Time kswapd Awake 78.58% 78.85% 76.83% 79.86%
What is interesting here for nocongest in particular is that while direct
reclaim scans more pages, the overall number of pages scanned remains the
same and the ratio of pages scanned to pages reclaimed is more or less the
same. In other words, while we are sleeping less, reclaim is not doing
more work and as direct reclaim and kswapd is awake for less time, it
would appear to be doing less work.
FTrace Reclaim Statistics: congestion_wait
Direct number congest waited 87 196 64 0
Direct time congest waited 4604ms 4732ms 5420ms 0ms
Direct full congest waited 72 145 53 0
Direct number conditional waited 0 0 324 1315
Direct time conditional waited 0ms 0ms 0ms 0ms
Direct full conditional waited 0 0 0 0
KSwapd number congest waited 20 10 15 7
KSwapd time congest waited 1264ms 536ms 884ms 284ms
KSwapd full congest waited 10 4 6 2
KSwapd number conditional waited 0 0 0 0
KSwapd time conditional waited 0ms 0ms 0ms 0ms
KSwapd full conditional waited 0 0 0 0
The vanilla kernel spent 8 seconds asleep in direct reclaim and no time at
all asleep with the patches.
MMTests Statistics: duration
User/Sys Time Running Test (seconds) 10.51 10.73 10.6 11.66
Total Elapsed Time (seconds) 14.19 13.00 14.33 12.76
Overall, the tests completed faster. It is interesting to note that backing off further
when a zone is congested and not just a BDI was more efficient overall.
PPC64 micro-mapped-file-stream
pgalloc_dma 3024660.00 ( 0.00%) 3027185.00 ( 0.08%) 3025845.00 ( 0.04%) 3026281.00 ( 0.05%)
pgalloc_normal 0.00 ( 0.00%) 0.00 ( 0.00%) 0.00 ( 0.00%) 0.00 ( 0.00%)
pgsteal_dma 2508073.00 ( 0.00%) 2565351.00 ( 2.23%) 2463577.00 ( -1.81%) 2532263.00 ( 0.96%)
pgsteal_normal 0.00 ( 0.00%) 0.00 ( 0.00%) 0.00 ( 0.00%) 0.00 ( 0.00%)
pgscan_kswapd_dma 4601307.00 ( 0.00%) 4128076.00 ( -11.46%) 3912317.00 ( -17.61%) 3377165.00 ( -36.25%)
pgscan_kswapd_normal 0.00 ( 0.00%) 0.00 ( 0.00%) 0.00 ( 0.00%) 0.00 ( 0.00%)
pgscan_direct_dma 629825.00 ( 0.00%) 971622.00 ( 35.18%) 1063938.00 ( 40.80%) 1711935.00 ( 63.21%)
pgscan_direct_normal 0.00 ( 0.00%) 0.00 ( 0.00%) 0.00 ( 0.00%) 0.00 ( 0.00%)
pageoutrun 27776.00 ( 0.00%) 20458.00 ( -35.77%) 18763.00 ( -48.04%) 18157.00 ( -52.98%)
allocstall 977.00 ( 0.00%) 2751.00 ( 64.49%) 2098.00 ( 53.43%) 5136.00 ( 80.98%)
Similar trends to x86-64. allocstalls are up but it's not necessarily bad.
FTrace Reclaim Statistics: vmscan
Direct reclaims 977 2709 2098 5136
Direct reclaim pages scanned 629825 963814 1063938 1711935
Direct reclaim pages reclaimed 75550 242538 150904 387647
Direct reclaim write file async I/O 0 0 0 2
Direct reclaim write anon async I/O 0 10 0 4
Direct reclaim write file sync I/O 0 0 0 0
Direct reclaim write anon sync I/O 0 0 0 0
Wake kswapd requests 392119 1201712 571935 571921
Kswapd wakeups 3 2 3 3
Kswapd pages scanned 4601307 4128076 3912317 3377165
Kswapd pages reclaimed 2432523 2318797 2312673 2144616
Kswapd reclaim write file async I/O 20 1 1 1
Kswapd reclaim write anon async I/O 57 132 11 121
Kswapd reclaim write file sync I/O 0 0 0 0
Kswapd reclaim write anon sync I/O 0 0 0 0
Time stalled direct reclaim (seconds) 6.19 7.30 13.04 10.88
Time kswapd awake (seconds) 21.73 26.51 25.55 23.90
Total pages scanned 5231132 5091890 4976255 5089100
Total pages reclaimed 2508073 2561335 2463577 2532263
%age total pages scanned/reclaimed 47.95% 50.30% 49.51% 49.76%
%age total pages scanned/written 0.00% 0.00% 0.00% 0.00%
%age file pages scanned/written 0.00% 0.00% 0.00% 0.00%
Percentage Time Spent Direct Reclaim 18.89% 20.65% 32.65% 27.65%
Percentage Time kswapd Awake 72.39% 80.68% 78.21% 77.40%
Again, a similar trend that the congestion_wait changes mean that direct
reclaim scans more pages but the overall number of pages scanned while
slightly reduced, are very similar. The ratio of scanning/reclaimed
remains roughly similar. The downside is that kswapd and direct reclaim
was awake longer and for a larger percentage of the overall workload.
It's possible there were big differences in the amount of time spent
reclaiming slab pages between the different kernels which is plausible
considering that the micro tests runs after fsmark and sysbench.
Trace Reclaim Statistics: congestion_wait
Direct number congest waited 845 1312 104 0
Direct time congest waited 19416ms 26560ms 7544ms 0ms
Direct full congest waited 745 1105 72 0
Direct number conditional waited 0 0 1322 2935
Direct time conditional waited 0ms 0ms 12ms 312ms
Direct full conditional waited 0 0 0 3
KSwapd number congest waited 39 102 75 63
KSwapd time congest waited 2484ms 6760ms 5756ms 3716ms
KSwapd full congest waited 20 48 46 25
KSwapd number conditional waited 0 0 0 0
KSwapd time conditional waited 0ms 0ms 0ms 0ms
KSwapd full conditional waited 0 0 0 0
The vanilla kernel spent 20 seconds asleep in direct reclaim and only
312ms asleep with the patches. The time kswapd spent congest waited was
also reduced by a large factor.
MMTests Statistics: duration
ser/Sys Time Running Test (seconds) 26.58 28.05 26.9 28.47
Total Elapsed Time (seconds) 30.02 32.86 32.67 30.88
With all patches applies, the completion times are very similar.
X86-64 STRESS-HIGHALLOC
traceonly-v2r2 lowlumpy-v2r3 waitcongest-v2r3waitwriteback-v2r4
Pass 1 82.00 ( 0.00%) 84.00 ( 2.00%) 85.00 ( 3.00%) 85.00 ( 3.00%)
Pass 2 90.00 ( 0.00%) 87.00 (-3.00%) 88.00 (-2.00%) 89.00 (-1.00%)
At Rest 92.00 ( 0.00%) 90.00 (-2.00%) 90.00 (-2.00%) 91.00 (-1.00%)
Success figures across the board are broadly similar.
traceonly-v2r2 lowlumpy-v2r3 waitcongest-v2r3waitwriteback-v2r4
Direct reclaims 1045 944 886 887
Direct reclaim pages scanned 135091 119604 109382 101019
Direct reclaim pages reclaimed 88599 47535 47863 46671
Direct reclaim write file async I/O 494 283 465 280
Direct reclaim write anon async I/O 29357 13710 16656 13462
Direct reclaim write file sync I/O 154 2 2 3
Direct reclaim write anon sync I/O 14594 571 509 561
Wake kswapd requests 7491 933 872 892
Kswapd wakeups 814 778 731 780
Kswapd pages scanned 7290822 15341158 11916436 13703442
Kswapd pages reclaimed 3587336 3142496 3094392 3187151
Kswapd reclaim write file async I/O 91975 32317 28022 29628
Kswapd reclaim write anon async I/O 1992022 789307 829745 849769
Kswapd reclaim write file sync I/O 0 0 0 0
Kswapd reclaim write anon sync I/O 0 0 0 0
Time stalled direct reclaim (seconds) 4588.93 2467.16 2495.41 2547.07
Time kswapd awake (seconds) 2497.66 1020.16 1098.06 1176.82
Total pages scanned 7425913 15460762 12025818 13804461
Total pages reclaimed 3675935 3190031 3142255 3233822
%age total pages scanned/reclaimed 49.50% 20.63% 26.13% 23.43%
%age total pages scanned/written 28.66% 5.41% 7.28% 6.47%
%age file pages scanned/written 1.25% 0.21% 0.24% 0.22%
Percentage Time Spent Direct Reclaim 57.33% 42.15% 42.41% 42.99%
Percentage Time kswapd Awake 43.56% 27.87% 29.76% 31.25%
Scanned/reclaimed ratios again look good with big improvements in
efficiency. The Scanned/written ratios also look much improved. With a
better scanned/written ration, there is an expectation that IO would be
more efficient and indeed, the time spent in direct reclaim is much
reduced by the full series and kswapd spends a little less time awake.
Overall, indications here are that allocations were happening much faster
and this can be seen with a graph of the latency figures as the
allocations were taking place
http://www.csn.ul.ie/~mel/postings/vmscanreduce-20101509/highalloc-interlatency-hydra-mean.ps
FTrace Reclaim Statistics: congestion_wait
Direct number congest waited 1333 204 169 4
Direct time congest waited 78896ms 8288ms 7260ms 200ms
Direct full congest waited 756 92 69 2
Direct number conditional waited 0 0 26 186
Direct time conditional waited 0ms 0ms 0ms 2504ms
Direct full conditional waited 0 0 0 25
KSwapd number congest waited 4 395 227 282
KSwapd time congest waited 384ms 25136ms 10508ms 18380ms
KSwapd full congest waited 3 232 98 176
KSwapd number conditional waited 0 0 0 0
KSwapd time conditional waited 0ms 0ms 0ms 0ms
KSwapd full conditional waited 0 0 0 0
KSwapd full conditional waited 318 0 312 9
Overall, the time spent speeping is reduced. kswapd is still hitting
congestion_wait() but that is because there are callers remaining where it
wasn't clear in advance if they should be changed to wait_iff_congested()
or not. Overall the sleep imes are reduced though - from 79ish seconds to
about 19.
MMTests Statistics: duration
User/Sys Time Running Test (seconds) 3415.43 3386.65 3388.39 3377.5
Total Elapsed Time (seconds) 5733.48 3660.33 3689.41 3765.39
With the full series, the time to complete the tests are reduced by 30%
PPC64 STRESS-HIGHALLOC
traceonly-v2r2 lowlumpy-v2r3 waitcongest-v2r3waitwriteback-v2r4
Pass 1 17.00 ( 0.00%) 34.00 (17.00%) 38.00 (21.00%) 43.00 (26.00%)
Pass 2 25.00 ( 0.00%) 37.00 (12.00%) 42.00 (17.00%) 46.00 (21.00%)
At Rest 49.00 ( 0.00%) 43.00 (-6.00%) 45.00 (-4.00%) 51.00 ( 2.00%)
Success rates there are *way* up particularly considering that the 16MB
huge pages on PPC64 mean that it's always much harder to allocate them.
FTrace Reclaim Statistics: vmscan
stress-highalloc stress-highalloc stress-highalloc stress-highalloc
traceonly-v2r2 lowlumpy-v2r3 waitcongest-v2r3waitwriteback-v2r4
Direct reclaims 499 505 564 509
Direct reclaim pages scanned 223478 41898 51818 45605
Direct reclaim pages reclaimed 137730 21148 27161 23455
Direct reclaim write file async I/O 399 136 162 136
Direct reclaim write anon async I/O 46977 2865 4686 3998
Direct reclaim write file sync I/O 29 0 1 3
Direct reclaim write anon sync I/O 31023 159 237 239
Wake kswapd requests 420 351 360 326
Kswapd wakeups 185 294 249 277
Kswapd pages scanned 15703488 16392500 17821724 17598737
Kswapd pages reclaimed 5808466 2908858 3139386 3145435
Kswapd reclaim write file async I/O 159938 18400 18717 13473
Kswapd reclaim write anon async I/O 3467554 228957 322799 234278
Kswapd reclaim write file sync I/O 0 0 0 0
Kswapd reclaim write anon sync I/O 0 0 0 0
Time stalled direct reclaim (seconds) 9665.35 1707.81 2374.32 1871.23
Time kswapd awake (seconds) 9401.21 1367.86 1951.75 1328.88
Total pages scanned 15926966 16434398 17873542 17644342
Total pages reclaimed 5946196 2930006 3166547 3168890
%age total pages scanned/reclaimed 37.33% 17.83% 17.72% 17.96%
%age total pages scanned/written 23.27% 1.52% 1.94% 1.43%
%age file pages scanned/written 1.01% 0.11% 0.11% 0.08%
Percentage Time Spent Direct Reclaim 44.55% 35.10% 41.42% 36.91%
Percentage Time kswapd Awake 86.71% 43.58% 52.67% 41.14%
While the scanning rates are slightly up, the scanned/reclaimed and
scanned/written figures are much improved. The time spent in direct
reclaim and with kswapd are massively reduced, mostly by the lowlumpy
patches.
FTrace Reclaim Statistics: congestion_wait
Direct number congest waited 725 303 126 3
Direct time congest waited 45524ms 9180ms 5936ms 300ms
Direct full congest waited 487 190 52 3
Direct number conditional waited 0 0 200 301
Direct time conditional waited 0ms 0ms 0ms 1904ms
Direct full conditional waited 0 0 0 19
KSwapd number congest waited 0 2 23 4
KSwapd time congest waited 0ms 200ms 420ms 404ms
KSwapd full congest waited 0 2 2 4
KSwapd number conditional waited 0 0 0 0
KSwapd time conditional waited 0ms 0ms 0ms 0ms
KSwapd full conditional waited 0 0 0 0
Not as dramatic a story here but the time spent asleep is reduced and we
can still see what wait_iff_congested is going to sleep when necessary.
MMTests Statistics: duration
User/Sys Time Running Test (seconds) 12028.09 3157.17 3357.79 3199.16
Total Elapsed Time (seconds) 10842.07 3138.72 3705.54 3229.85
The time to complete this test goes way down. With the full series, we
are allocating over twice the number of huge pages in 30% of the time and
there is a corresponding impact on the allocation latency graph available
at.
http://www.csn.ul.ie/~mel/postings/vmscanreduce-20101509/highalloc-interlatency-powyah-mean.ps
This patch:
Add a trace event for shrink_inactive_list() and updates the sample
postprocessing script appropriately. It can be used to determine how many
pages were reclaimed and for non-lumpy reclaim where exactly the pages
were reclaimed from.
Signed-off-by: Mel Gorman <mel@csn.ul.ie>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Minchan Kim <minchan.kim@gmail.com>
Cc: Wu Fengguang <fengguang.wu@intel.com>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: Rik van Riel <riel@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2010-10-27 05:21:40 +08:00
|
|
|
__entry->nr_scanned, __entry->nr_reclaimed,
|
2017-02-23 07:44:30 +08:00
|
|
|
__entry->nr_dirty, __entry->nr_writeback,
|
|
|
|
__entry->nr_congested, __entry->nr_immediate,
|
2019-05-14 08:16:51 +08:00
|
|
|
__entry->nr_activate0, __entry->nr_activate1,
|
|
|
|
__entry->nr_ref_keep, __entry->nr_unmap_fail,
|
|
|
|
__entry->priority,
|
tracing, vmscan: add trace events for LRU list shrinking
There have been numerous reports of stalls that pointed at the problem
being somewhere in the VM. There are multiple roots to the problems which
means dealing with any of the root problems in isolation is tricky to
justify on their own and they would still need integration testing. This
patch series puts together two different patch sets which in combination
should tackle some of the root causes of latency problems being reported.
Patch 1 adds a tracepoint for shrink_inactive_list. For this series, the
most important results is being able to calculate the scanning/reclaim
ratio as a measure of the amount of work being done by page reclaim.
Patch 2 accounts for time spent in congestion_wait.
Patches 3-6 were originally developed by Kosaki Motohiro but reworked for
this series. It has been noted that lumpy reclaim is far too aggressive
and trashes the system somewhat. As SLUB uses high-order allocations, a
large cost incurred by lumpy reclaim will be noticeable. It was also
reported during transparent hugepage support testing that lumpy reclaim
was trashing the system and these patches should mitigate that problem
without disabling lumpy reclaim.
Patch 7 adds wait_iff_congested() and replaces some callers of
congestion_wait(). wait_iff_congested() only sleeps if there is a BDI
that is currently congested. Patch 8 notes that any BDI being congested
is not necessarily a problem because there could be multiple BDIs of
varying speeds and numberous zones. It attempts to track when a zone
being reclaimed contains many pages backed by a congested BDI and if so,
reclaimers wait on the congestion queue.
I ran a number of tests with monitoring on X86, X86-64 and PPC64. Each
machine had 3G of RAM and the CPUs were
X86: Intel P4 2-core
X86-64: AMD Phenom 4-core
PPC64: PPC970MP
Each used a single disk and the onboard IO controller. Dirty ratio was
left at 20. I'm just going to report for X86-64 and PPC64 in a vague
attempt to keep this report short. Four kernels were tested each based on
v2.6.36-rc4
traceonly-v2r2: Patches 1 and 2 to instrument vmscan reclaims and congestion_wait
lowlumpy-v2r3: Patches 1-6 to test if lumpy reclaim is better
waitcongest-v2r3: Patches 1-7 to only wait on congestion
waitwriteback-v2r4: Patches 1-8 to detect when a zone is congested
nocongest-v1r5: Patches 1-3 for testing wait_iff_congestion
nodirect-v1r5: Patches 1-10 to disable filesystem writeback for better IO
The tests run were as follows
kernbench
compile-based benchmark. Smoke test performance
sysbench
OLTP read-only benchmark. Will be re-run in the future as read-write
micro-mapped-file-stream
This is a micro-benchmark from Johannes Weiner that accesses a
large sparse-file through mmap(). It was configured to run in only
single-CPU mode but can be indicative of how well page reclaim
identifies suitable pages.
stress-highalloc
Tries to allocate huge pages under heavy load.
kernbench, iozone and sysbench did not report any performance regression
on any machine. sysbench did pressure the system lightly and there was
reclaim activity but there were no difference of major interest between
the kernels.
X86-64 micro-mapped-file-stream
traceonly-v2r2 lowlumpy-v2r3 waitcongest-v2r3 waitwriteback-v2r4
pgalloc_dma 1639.00 ( 0.00%) 667.00 (-145.73%) 1167.00 ( -40.45%) 578.00 (-183.56%)
pgalloc_dma32 2842410.00 ( 0.00%) 2842626.00 ( 0.01%) 2843043.00 ( 0.02%) 2843014.00 ( 0.02%)
pgalloc_normal 0.00 ( 0.00%) 0.00 ( 0.00%) 0.00 ( 0.00%) 0.00 ( 0.00%)
pgsteal_dma 729.00 ( 0.00%) 85.00 (-757.65%) 609.00 ( -19.70%) 125.00 (-483.20%)
pgsteal_dma32 2338721.00 ( 0.00%) 2447354.00 ( 4.44%) 2429536.00 ( 3.74%) 2436772.00 ( 4.02%)
pgsteal_normal 0.00 ( 0.00%) 0.00 ( 0.00%) 0.00 ( 0.00%) 0.00 ( 0.00%)
pgscan_kswapd_dma 1469.00 ( 0.00%) 532.00 (-176.13%) 1078.00 ( -36.27%) 220.00 (-567.73%)
pgscan_kswapd_dma32 4597713.00 ( 0.00%) 4503597.00 ( -2.09%) 4295673.00 ( -7.03%) 3891686.00 ( -18.14%)
pgscan_kswapd_normal 0.00 ( 0.00%) 0.00 ( 0.00%) 0.00 ( 0.00%) 0.00 ( 0.00%)
pgscan_direct_dma 71.00 ( 0.00%) 134.00 ( 47.01%) 243.00 ( 70.78%) 352.00 ( 79.83%)
pgscan_direct_dma32 305820.00 ( 0.00%) 280204.00 ( -9.14%) 600518.00 ( 49.07%) 957485.00 ( 68.06%)
pgscan_direct_normal 0.00 ( 0.00%) 0.00 ( 0.00%) 0.00 ( 0.00%) 0.00 ( 0.00%)
pageoutrun 16296.00 ( 0.00%) 21254.00 ( 23.33%) 18447.00 ( 11.66%) 20067.00 ( 18.79%)
allocstall 443.00 ( 0.00%) 273.00 ( -62.27%) 513.00 ( 13.65%) 1568.00 ( 71.75%)
These are based on the raw figures taken from /proc/vmstat. It's a rough
measure of reclaim activity. Note that allocstall counts are higher
because we are entering direct reclaim more often as a result of not
sleeping in congestion. In itself, it's not necessarily a bad thing.
It's easier to get a view of what happened from the vmscan tracepoint
report.
FTrace Reclaim Statistics: vmscan
traceonly-v2r2 lowlumpy-v2r3 waitcongest-v2r3 waitwriteback-v2r4
Direct reclaims 443 273 513 1568
Direct reclaim pages scanned 305968 280402 600825 957933
Direct reclaim pages reclaimed 43503 19005 30327 117191
Direct reclaim write file async I/O 0 0 0 0
Direct reclaim write anon async I/O 0 3 4 12
Direct reclaim write file sync I/O 0 0 0 0
Direct reclaim write anon sync I/O 0 0 0 0
Wake kswapd requests 187649 132338 191695 267701
Kswapd wakeups 3 1 4 1
Kswapd pages scanned 4599269 4454162 4296815 3891906
Kswapd pages reclaimed 2295947 2428434 2399818 2319706
Kswapd reclaim write file async I/O 1 0 1 1
Kswapd reclaim write anon async I/O 59 187 41 222
Kswapd reclaim write file sync I/O 0 0 0 0
Kswapd reclaim write anon sync I/O 0 0 0 0
Time stalled direct reclaim (seconds) 4.34 2.52 6.63 2.96
Time kswapd awake (seconds) 11.15 10.25 11.01 10.19
Total pages scanned 4905237 4734564 4897640 4849839
Total pages reclaimed 2339450 2447439 2430145 2436897
%age total pages scanned/reclaimed 47.69% 51.69% 49.62% 50.25%
%age total pages scanned/written 0.00% 0.00% 0.00% 0.00%
%age file pages scanned/written 0.00% 0.00% 0.00% 0.00%
Percentage Time Spent Direct Reclaim 29.23% 19.02% 38.48% 20.25%
Percentage Time kswapd Awake 78.58% 78.85% 76.83% 79.86%
What is interesting here for nocongest in particular is that while direct
reclaim scans more pages, the overall number of pages scanned remains the
same and the ratio of pages scanned to pages reclaimed is more or less the
same. In other words, while we are sleeping less, reclaim is not doing
more work and as direct reclaim and kswapd is awake for less time, it
would appear to be doing less work.
FTrace Reclaim Statistics: congestion_wait
Direct number congest waited 87 196 64 0
Direct time congest waited 4604ms 4732ms 5420ms 0ms
Direct full congest waited 72 145 53 0
Direct number conditional waited 0 0 324 1315
Direct time conditional waited 0ms 0ms 0ms 0ms
Direct full conditional waited 0 0 0 0
KSwapd number congest waited 20 10 15 7
KSwapd time congest waited 1264ms 536ms 884ms 284ms
KSwapd full congest waited 10 4 6 2
KSwapd number conditional waited 0 0 0 0
KSwapd time conditional waited 0ms 0ms 0ms 0ms
KSwapd full conditional waited 0 0 0 0
The vanilla kernel spent 8 seconds asleep in direct reclaim and no time at
all asleep with the patches.
MMTests Statistics: duration
User/Sys Time Running Test (seconds) 10.51 10.73 10.6 11.66
Total Elapsed Time (seconds) 14.19 13.00 14.33 12.76
Overall, the tests completed faster. It is interesting to note that backing off further
when a zone is congested and not just a BDI was more efficient overall.
PPC64 micro-mapped-file-stream
pgalloc_dma 3024660.00 ( 0.00%) 3027185.00 ( 0.08%) 3025845.00 ( 0.04%) 3026281.00 ( 0.05%)
pgalloc_normal 0.00 ( 0.00%) 0.00 ( 0.00%) 0.00 ( 0.00%) 0.00 ( 0.00%)
pgsteal_dma 2508073.00 ( 0.00%) 2565351.00 ( 2.23%) 2463577.00 ( -1.81%) 2532263.00 ( 0.96%)
pgsteal_normal 0.00 ( 0.00%) 0.00 ( 0.00%) 0.00 ( 0.00%) 0.00 ( 0.00%)
pgscan_kswapd_dma 4601307.00 ( 0.00%) 4128076.00 ( -11.46%) 3912317.00 ( -17.61%) 3377165.00 ( -36.25%)
pgscan_kswapd_normal 0.00 ( 0.00%) 0.00 ( 0.00%) 0.00 ( 0.00%) 0.00 ( 0.00%)
pgscan_direct_dma 629825.00 ( 0.00%) 971622.00 ( 35.18%) 1063938.00 ( 40.80%) 1711935.00 ( 63.21%)
pgscan_direct_normal 0.00 ( 0.00%) 0.00 ( 0.00%) 0.00 ( 0.00%) 0.00 ( 0.00%)
pageoutrun 27776.00 ( 0.00%) 20458.00 ( -35.77%) 18763.00 ( -48.04%) 18157.00 ( -52.98%)
allocstall 977.00 ( 0.00%) 2751.00 ( 64.49%) 2098.00 ( 53.43%) 5136.00 ( 80.98%)
Similar trends to x86-64. allocstalls are up but it's not necessarily bad.
FTrace Reclaim Statistics: vmscan
Direct reclaims 977 2709 2098 5136
Direct reclaim pages scanned 629825 963814 1063938 1711935
Direct reclaim pages reclaimed 75550 242538 150904 387647
Direct reclaim write file async I/O 0 0 0 2
Direct reclaim write anon async I/O 0 10 0 4
Direct reclaim write file sync I/O 0 0 0 0
Direct reclaim write anon sync I/O 0 0 0 0
Wake kswapd requests 392119 1201712 571935 571921
Kswapd wakeups 3 2 3 3
Kswapd pages scanned 4601307 4128076 3912317 3377165
Kswapd pages reclaimed 2432523 2318797 2312673 2144616
Kswapd reclaim write file async I/O 20 1 1 1
Kswapd reclaim write anon async I/O 57 132 11 121
Kswapd reclaim write file sync I/O 0 0 0 0
Kswapd reclaim write anon sync I/O 0 0 0 0
Time stalled direct reclaim (seconds) 6.19 7.30 13.04 10.88
Time kswapd awake (seconds) 21.73 26.51 25.55 23.90
Total pages scanned 5231132 5091890 4976255 5089100
Total pages reclaimed 2508073 2561335 2463577 2532263
%age total pages scanned/reclaimed 47.95% 50.30% 49.51% 49.76%
%age total pages scanned/written 0.00% 0.00% 0.00% 0.00%
%age file pages scanned/written 0.00% 0.00% 0.00% 0.00%
Percentage Time Spent Direct Reclaim 18.89% 20.65% 32.65% 27.65%
Percentage Time kswapd Awake 72.39% 80.68% 78.21% 77.40%
Again, a similar trend that the congestion_wait changes mean that direct
reclaim scans more pages but the overall number of pages scanned while
slightly reduced, are very similar. The ratio of scanning/reclaimed
remains roughly similar. The downside is that kswapd and direct reclaim
was awake longer and for a larger percentage of the overall workload.
It's possible there were big differences in the amount of time spent
reclaiming slab pages between the different kernels which is plausible
considering that the micro tests runs after fsmark and sysbench.
Trace Reclaim Statistics: congestion_wait
Direct number congest waited 845 1312 104 0
Direct time congest waited 19416ms 26560ms 7544ms 0ms
Direct full congest waited 745 1105 72 0
Direct number conditional waited 0 0 1322 2935
Direct time conditional waited 0ms 0ms 12ms 312ms
Direct full conditional waited 0 0 0 3
KSwapd number congest waited 39 102 75 63
KSwapd time congest waited 2484ms 6760ms 5756ms 3716ms
KSwapd full congest waited 20 48 46 25
KSwapd number conditional waited 0 0 0 0
KSwapd time conditional waited 0ms 0ms 0ms 0ms
KSwapd full conditional waited 0 0 0 0
The vanilla kernel spent 20 seconds asleep in direct reclaim and only
312ms asleep with the patches. The time kswapd spent congest waited was
also reduced by a large factor.
MMTests Statistics: duration
ser/Sys Time Running Test (seconds) 26.58 28.05 26.9 28.47
Total Elapsed Time (seconds) 30.02 32.86 32.67 30.88
With all patches applies, the completion times are very similar.
X86-64 STRESS-HIGHALLOC
traceonly-v2r2 lowlumpy-v2r3 waitcongest-v2r3waitwriteback-v2r4
Pass 1 82.00 ( 0.00%) 84.00 ( 2.00%) 85.00 ( 3.00%) 85.00 ( 3.00%)
Pass 2 90.00 ( 0.00%) 87.00 (-3.00%) 88.00 (-2.00%) 89.00 (-1.00%)
At Rest 92.00 ( 0.00%) 90.00 (-2.00%) 90.00 (-2.00%) 91.00 (-1.00%)
Success figures across the board are broadly similar.
traceonly-v2r2 lowlumpy-v2r3 waitcongest-v2r3waitwriteback-v2r4
Direct reclaims 1045 944 886 887
Direct reclaim pages scanned 135091 119604 109382 101019
Direct reclaim pages reclaimed 88599 47535 47863 46671
Direct reclaim write file async I/O 494 283 465 280
Direct reclaim write anon async I/O 29357 13710 16656 13462
Direct reclaim write file sync I/O 154 2 2 3
Direct reclaim write anon sync I/O 14594 571 509 561
Wake kswapd requests 7491 933 872 892
Kswapd wakeups 814 778 731 780
Kswapd pages scanned 7290822 15341158 11916436 13703442
Kswapd pages reclaimed 3587336 3142496 3094392 3187151
Kswapd reclaim write file async I/O 91975 32317 28022 29628
Kswapd reclaim write anon async I/O 1992022 789307 829745 849769
Kswapd reclaim write file sync I/O 0 0 0 0
Kswapd reclaim write anon sync I/O 0 0 0 0
Time stalled direct reclaim (seconds) 4588.93 2467.16 2495.41 2547.07
Time kswapd awake (seconds) 2497.66 1020.16 1098.06 1176.82
Total pages scanned 7425913 15460762 12025818 13804461
Total pages reclaimed 3675935 3190031 3142255 3233822
%age total pages scanned/reclaimed 49.50% 20.63% 26.13% 23.43%
%age total pages scanned/written 28.66% 5.41% 7.28% 6.47%
%age file pages scanned/written 1.25% 0.21% 0.24% 0.22%
Percentage Time Spent Direct Reclaim 57.33% 42.15% 42.41% 42.99%
Percentage Time kswapd Awake 43.56% 27.87% 29.76% 31.25%
Scanned/reclaimed ratios again look good with big improvements in
efficiency. The Scanned/written ratios also look much improved. With a
better scanned/written ration, there is an expectation that IO would be
more efficient and indeed, the time spent in direct reclaim is much
reduced by the full series and kswapd spends a little less time awake.
Overall, indications here are that allocations were happening much faster
and this can be seen with a graph of the latency figures as the
allocations were taking place
http://www.csn.ul.ie/~mel/postings/vmscanreduce-20101509/highalloc-interlatency-hydra-mean.ps
FTrace Reclaim Statistics: congestion_wait
Direct number congest waited 1333 204 169 4
Direct time congest waited 78896ms 8288ms 7260ms 200ms
Direct full congest waited 756 92 69 2
Direct number conditional waited 0 0 26 186
Direct time conditional waited 0ms 0ms 0ms 2504ms
Direct full conditional waited 0 0 0 25
KSwapd number congest waited 4 395 227 282
KSwapd time congest waited 384ms 25136ms 10508ms 18380ms
KSwapd full congest waited 3 232 98 176
KSwapd number conditional waited 0 0 0 0
KSwapd time conditional waited 0ms 0ms 0ms 0ms
KSwapd full conditional waited 0 0 0 0
KSwapd full conditional waited 318 0 312 9
Overall, the time spent speeping is reduced. kswapd is still hitting
congestion_wait() but that is because there are callers remaining where it
wasn't clear in advance if they should be changed to wait_iff_congested()
or not. Overall the sleep imes are reduced though - from 79ish seconds to
about 19.
MMTests Statistics: duration
User/Sys Time Running Test (seconds) 3415.43 3386.65 3388.39 3377.5
Total Elapsed Time (seconds) 5733.48 3660.33 3689.41 3765.39
With the full series, the time to complete the tests are reduced by 30%
PPC64 STRESS-HIGHALLOC
traceonly-v2r2 lowlumpy-v2r3 waitcongest-v2r3waitwriteback-v2r4
Pass 1 17.00 ( 0.00%) 34.00 (17.00%) 38.00 (21.00%) 43.00 (26.00%)
Pass 2 25.00 ( 0.00%) 37.00 (12.00%) 42.00 (17.00%) 46.00 (21.00%)
At Rest 49.00 ( 0.00%) 43.00 (-6.00%) 45.00 (-4.00%) 51.00 ( 2.00%)
Success rates there are *way* up particularly considering that the 16MB
huge pages on PPC64 mean that it's always much harder to allocate them.
FTrace Reclaim Statistics: vmscan
stress-highalloc stress-highalloc stress-highalloc stress-highalloc
traceonly-v2r2 lowlumpy-v2r3 waitcongest-v2r3waitwriteback-v2r4
Direct reclaims 499 505 564 509
Direct reclaim pages scanned 223478 41898 51818 45605
Direct reclaim pages reclaimed 137730 21148 27161 23455
Direct reclaim write file async I/O 399 136 162 136
Direct reclaim write anon async I/O 46977 2865 4686 3998
Direct reclaim write file sync I/O 29 0 1 3
Direct reclaim write anon sync I/O 31023 159 237 239
Wake kswapd requests 420 351 360 326
Kswapd wakeups 185 294 249 277
Kswapd pages scanned 15703488 16392500 17821724 17598737
Kswapd pages reclaimed 5808466 2908858 3139386 3145435
Kswapd reclaim write file async I/O 159938 18400 18717 13473
Kswapd reclaim write anon async I/O 3467554 228957 322799 234278
Kswapd reclaim write file sync I/O 0 0 0 0
Kswapd reclaim write anon sync I/O 0 0 0 0
Time stalled direct reclaim (seconds) 9665.35 1707.81 2374.32 1871.23
Time kswapd awake (seconds) 9401.21 1367.86 1951.75 1328.88
Total pages scanned 15926966 16434398 17873542 17644342
Total pages reclaimed 5946196 2930006 3166547 3168890
%age total pages scanned/reclaimed 37.33% 17.83% 17.72% 17.96%
%age total pages scanned/written 23.27% 1.52% 1.94% 1.43%
%age file pages scanned/written 1.01% 0.11% 0.11% 0.08%
Percentage Time Spent Direct Reclaim 44.55% 35.10% 41.42% 36.91%
Percentage Time kswapd Awake 86.71% 43.58% 52.67% 41.14%
While the scanning rates are slightly up, the scanned/reclaimed and
scanned/written figures are much improved. The time spent in direct
reclaim and with kswapd are massively reduced, mostly by the lowlumpy
patches.
FTrace Reclaim Statistics: congestion_wait
Direct number congest waited 725 303 126 3
Direct time congest waited 45524ms 9180ms 5936ms 300ms
Direct full congest waited 487 190 52 3
Direct number conditional waited 0 0 200 301
Direct time conditional waited 0ms 0ms 0ms 1904ms
Direct full conditional waited 0 0 0 19
KSwapd number congest waited 0 2 23 4
KSwapd time congest waited 0ms 200ms 420ms 404ms
KSwapd full congest waited 0 2 2 4
KSwapd number conditional waited 0 0 0 0
KSwapd time conditional waited 0ms 0ms 0ms 0ms
KSwapd full conditional waited 0 0 0 0
Not as dramatic a story here but the time spent asleep is reduced and we
can still see what wait_iff_congested is going to sleep when necessary.
MMTests Statistics: duration
User/Sys Time Running Test (seconds) 12028.09 3157.17 3357.79 3199.16
Total Elapsed Time (seconds) 10842.07 3138.72 3705.54 3229.85
The time to complete this test goes way down. With the full series, we
are allocating over twice the number of huge pages in 30% of the time and
there is a corresponding impact on the allocation latency graph available
at.
http://www.csn.ul.ie/~mel/postings/vmscanreduce-20101509/highalloc-interlatency-powyah-mean.ps
This patch:
Add a trace event for shrink_inactive_list() and updates the sample
postprocessing script appropriately. It can be used to determine how many
pages were reclaimed and for non-lumpy reclaim where exactly the pages
were reclaimed from.
Signed-off-by: Mel Gorman <mel@csn.ul.ie>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Minchan Kim <minchan.kim@gmail.com>
Cc: Wu Fengguang <fengguang.wu@intel.com>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: Rik van Riel <riel@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2010-10-27 05:21:40 +08:00
|
|
|
show_reclaim_flags(__entry->reclaim_flags))
|
|
|
|
);
|
|
|
|
|
2017-02-23 07:44:18 +08:00
|
|
|
TRACE_EVENT(mm_vmscan_lru_shrink_active,
|
|
|
|
|
|
|
|
TP_PROTO(int nid, unsigned long nr_taken,
|
|
|
|
unsigned long nr_active, unsigned long nr_deactivated,
|
|
|
|
unsigned long nr_referenced, int priority, int file),
|
|
|
|
|
|
|
|
TP_ARGS(nid, nr_taken, nr_active, nr_deactivated, nr_referenced, priority, file),
|
|
|
|
|
|
|
|
TP_STRUCT__entry(
|
|
|
|
__field(int, nid)
|
|
|
|
__field(unsigned long, nr_taken)
|
|
|
|
__field(unsigned long, nr_active)
|
|
|
|
__field(unsigned long, nr_deactivated)
|
|
|
|
__field(unsigned long, nr_referenced)
|
|
|
|
__field(int, priority)
|
|
|
|
__field(int, reclaim_flags)
|
|
|
|
),
|
|
|
|
|
|
|
|
TP_fast_assign(
|
|
|
|
__entry->nid = nid;
|
|
|
|
__entry->nr_taken = nr_taken;
|
|
|
|
__entry->nr_active = nr_active;
|
|
|
|
__entry->nr_deactivated = nr_deactivated;
|
|
|
|
__entry->nr_referenced = nr_referenced;
|
|
|
|
__entry->priority = priority;
|
2019-05-14 08:23:08 +08:00
|
|
|
__entry->reclaim_flags = trace_reclaim_flags(file);
|
2017-02-23 07:44:18 +08:00
|
|
|
),
|
|
|
|
|
|
|
|
TP_printk("nid=%d nr_taken=%ld nr_active=%ld nr_deactivated=%ld nr_referenced=%ld priority=%d flags=%s",
|
|
|
|
__entry->nid,
|
|
|
|
__entry->nr_taken,
|
|
|
|
__entry->nr_active, __entry->nr_deactivated, __entry->nr_referenced,
|
|
|
|
__entry->priority,
|
|
|
|
show_reclaim_flags(__entry->reclaim_flags))
|
|
|
|
);
|
|
|
|
|
2019-05-14 08:17:53 +08:00
|
|
|
TRACE_EVENT(mm_vmscan_node_reclaim_begin,
|
|
|
|
|
|
|
|
TP_PROTO(int nid, int order, gfp_t gfp_flags),
|
|
|
|
|
|
|
|
TP_ARGS(nid, order, gfp_flags),
|
|
|
|
|
|
|
|
TP_STRUCT__entry(
|
|
|
|
__field(int, nid)
|
|
|
|
__field(int, order)
|
2022-05-13 11:23:08 +08:00
|
|
|
__field(unsigned long, gfp_flags)
|
2019-05-14 08:17:53 +08:00
|
|
|
),
|
|
|
|
|
|
|
|
TP_fast_assign(
|
|
|
|
__entry->nid = nid;
|
|
|
|
__entry->order = order;
|
2022-05-13 11:23:08 +08:00
|
|
|
__entry->gfp_flags = (__force unsigned long)gfp_flags;
|
2019-05-14 08:17:53 +08:00
|
|
|
),
|
|
|
|
|
|
|
|
TP_printk("nid=%d order=%d gfp_flags=%s",
|
|
|
|
__entry->nid,
|
|
|
|
__entry->order,
|
|
|
|
show_gfp_flags(__entry->gfp_flags))
|
|
|
|
);
|
|
|
|
|
|
|
|
DEFINE_EVENT(mm_vmscan_direct_reclaim_end_template, mm_vmscan_node_reclaim_end,
|
|
|
|
|
|
|
|
TP_PROTO(unsigned long nr_reclaimed),
|
|
|
|
|
|
|
|
TP_ARGS(nr_reclaimed)
|
|
|
|
);
|
|
|
|
|
mm/vmscan: throttle reclaim until some writeback completes if congested
Patch series "Remove dependency on congestion_wait in mm/", v5.
This series that removes all calls to congestion_wait in mm/ and deletes
wait_iff_congested. It's not a clever implementation but
congestion_wait has been broken for a long time [1].
Even if congestion throttling worked, it was never a great idea. While
excessive dirty/writeback pages at the tail of the LRU is one
possibility that reclaim may be slow, there is also the problem of too
many pages being isolated and reclaim failing for other reasons
(elevated references, too many pages isolated, excessive LRU contention
etc).
This series replaces the "congestion" throttling with 3 different types.
- If there are too many dirty/writeback pages, sleep until a timeout or
enough pages get cleaned
- If too many pages are isolated, sleep until enough isolated pages are
either reclaimed or put back on the LRU
- If no progress is being made, direct reclaim tasks sleep until
another task makes progress with acceptable efficiency.
This was initially tested with a mix of workloads that used to trigger
corner cases that no longer work. A new test case was created called
"stutterp" (pagereclaim-stutterp-noreaders in mmtests) using a freshly
created XFS filesystem. Note that it may be necessary to increase the
timeout of ssh if executing remotely as ssh itself can get throttled and
the connection may timeout.
stutterp varies the number of "worker" processes from 4 up to NR_CPUS*4
to check the impact as the number of direct reclaimers increase. It has
four types of worker.
- One "anon latency" worker creates small mappings with mmap() and
times how long it takes to fault the mapping reading it 4K at a time
- X file writers which is fio randomly writing X files where the total
size of the files add up to the allowed dirty_ratio. fio is allowed
to run for a warmup period to allow some file-backed pages to
accumulate. The duration of the warmup is based on the best-case
linear write speed of the storage.
- Y file readers which is fio randomly reading small files
- Z anon memory hogs which continually map (100-dirty_ratio)% of memory
- Total estimated WSS = (100+dirty_ration) percentage of memory
X+Y+Z+1 == NR_WORKERS varying from 4 up to NR_CPUS*4
The intent is to maximise the total WSS with a mix of file and anon
memory where some anonymous memory must be swapped and there is a high
likelihood of dirty/writeback pages reaching the end of the LRU.
The test can be configured to have no background readers to stress
dirty/writeback pages. The results below are based on having zero
readers.
The short summary of the results is that the series works and stalls
until some event occurs but the timeouts may need adjustment.
The test results are not broken down by patch as the series should be
treated as one block that replaces a broken throttling mechanism with a
working one.
Finally, three machines were tested but I'm reporting the worst set of
results. The other two machines had much better latencies for example.
First the results of the "anon latency" latency
stutterp
5.15.0-rc1 5.15.0-rc1
vanilla mm-reclaimcongest-v5r4
Amean mmap-4 31.4003 ( 0.00%) 2661.0198 (-8374.52%)
Amean mmap-7 38.1641 ( 0.00%) 149.2891 (-291.18%)
Amean mmap-12 60.0981 ( 0.00%) 187.8105 (-212.51%)
Amean mmap-21 161.2699 ( 0.00%) 213.9107 ( -32.64%)
Amean mmap-30 174.5589 ( 0.00%) 377.7548 (-116.41%)
Amean mmap-48 8106.8160 ( 0.00%) 1070.5616 ( 86.79%)
Stddev mmap-4 41.3455 ( 0.00%) 27573.9676 (-66591.66%)
Stddev mmap-7 53.5556 ( 0.00%) 4608.5860 (-8505.23%)
Stddev mmap-12 171.3897 ( 0.00%) 5559.4542 (-3143.75%)
Stddev mmap-21 1506.6752 ( 0.00%) 5746.2507 (-281.39%)
Stddev mmap-30 557.5806 ( 0.00%) 7678.1624 (-1277.05%)
Stddev mmap-48 61681.5718 ( 0.00%) 14507.2830 ( 76.48%)
Max-90 mmap-4 31.4243 ( 0.00%) 83.1457 (-164.59%)
Max-90 mmap-7 41.0410 ( 0.00%) 41.0720 ( -0.08%)
Max-90 mmap-12 66.5255 ( 0.00%) 53.9073 ( 18.97%)
Max-90 mmap-21 146.7479 ( 0.00%) 105.9540 ( 27.80%)
Max-90 mmap-30 193.9513 ( 0.00%) 64.3067 ( 66.84%)
Max-90 mmap-48 277.9137 ( 0.00%) 591.0594 (-112.68%)
Max mmap-4 1913.8009 ( 0.00%) 299623.9695 (-15555.96%)
Max mmap-7 2423.9665 ( 0.00%) 204453.1708 (-8334.65%)
Max mmap-12 6845.6573 ( 0.00%) 221090.3366 (-3129.64%)
Max mmap-21 56278.6508 ( 0.00%) 213877.3496 (-280.03%)
Max mmap-30 19716.2990 ( 0.00%) 216287.6229 (-997.00%)
Max mmap-48 477923.9400 ( 0.00%) 245414.8238 ( 48.65%)
For most thread counts, the time to mmap() is unfortunately increased.
In earlier versions of the series, this was lower but a large number of
throttling events were reaching their timeout increasing the amount of
inefficient scanning of the LRU. There is no prioritisation of reclaim
tasks making progress based on each tasks rate of page allocation versus
progress of reclaim. The variance is also impacted for high worker
counts but in all cases, the differences in latency are not
statistically significant due to very large maximum outliers. Max-90
shows that 90% of the stalls are comparable but the Max results show the
massive outliers which are increased to to stalling.
It is expected that this will be very machine dependant. Due to the
test design, reclaim is difficult so allocations stall and there are
variances depending on whether THPs can be allocated or not. The amount
of memory will affect exactly how bad the corner cases are and how often
they trigger. The warmup period calculation is not ideal as it's based
on linear writes where as fio is randomly writing multiple files from
multiple tasks so the start state of the test is variable. For example,
these are the latencies on a single-socket machine that had more memory
Amean mmap-4 42.2287 ( 0.00%) 49.6838 * -17.65%*
Amean mmap-7 216.4326 ( 0.00%) 47.4451 * 78.08%*
Amean mmap-12 2412.0588 ( 0.00%) 51.7497 ( 97.85%)
Amean mmap-21 5546.2548 ( 0.00%) 51.8862 ( 99.06%)
Amean mmap-30 1085.3121 ( 0.00%) 72.1004 ( 93.36%)
The overall system CPU usage and elapsed time is as follows
5.15.0-rc3 5.15.0-rc3
vanilla mm-reclaimcongest-v5r4
Duration User 6989.03 983.42
Duration System 7308.12 799.68
Duration Elapsed 2277.67 2092.98
The patches reduce system CPU usage by 89% as the vanilla kernel is rarely
stalling.
The high-level /proc/vmstats show
5.15.0-rc1 5.15.0-rc1
vanilla mm-reclaimcongest-v5r2
Ops Direct pages scanned 1056608451.00 503594991.00
Ops Kswapd pages scanned 109795048.00 147289810.00
Ops Kswapd pages reclaimed 63269243.00 31036005.00
Ops Direct pages reclaimed 10803973.00 6328887.00
Ops Kswapd efficiency % 57.62 21.07
Ops Kswapd velocity 48204.98 57572.86
Ops Direct efficiency % 1.02 1.26
Ops Direct velocity 463898.83 196845.97
Kswapd scanned less pages but the detailed pattern is different. The
vanilla kernel scans slowly over time where as the patches exhibits
burst patterns of scan activity. Direct reclaim scanning is reduced by
52% due to stalling.
The pattern for stealing pages is also slightly different. Both kernels
exhibit spikes but the vanilla kernel when reclaiming shows pages being
reclaimed over a period of time where as the patches tend to reclaim in
spikes. The difference is that vanilla is not throttling and instead
scanning constantly finding some pages over time where as the patched
kernel throttles and reclaims in spikes.
Ops Percentage direct scans 90.59 77.37
For direct reclaim, vanilla scanned 90.59% of pages where as with the
patches, 77.37% were direct reclaim due to throttling
Ops Page writes by reclaim 2613590.00 1687131.00
Page writes from reclaim context are reduced.
Ops Page writes anon 2932752.00 1917048.00
And there is less swapping.
Ops Page reclaim immediate 996248528.00 107664764.00
The number of pages encountered at the tail of the LRU tagged for
immediate reclaim but still dirty/writeback is reduced by 89%.
Ops Slabs scanned 164284.00 153608.00
Slab scan activity is similar.
ftrace was used to gather stall activity
Vanilla
-------
1 writeback_wait_iff_congested: usec_timeout=100000 usec_delayed=16000
2 writeback_wait_iff_congested: usec_timeout=100000 usec_delayed=12000
8 writeback_wait_iff_congested: usec_timeout=100000 usec_delayed=8000
29 writeback_wait_iff_congested: usec_timeout=100000 usec_delayed=4000
82394 writeback_wait_iff_congested: usec_timeout=100000 usec_delayed=0
The fast majority of wait_iff_congested calls do not stall at all. What
is likely happening is that cond_resched() reschedules the task for a
short period when the BDI is not registering congestion (which it never
will in this test setup).
1 writeback_congestion_wait: usec_timeout=100000 usec_delayed=120000
2 writeback_congestion_wait: usec_timeout=100000 usec_delayed=132000
4 writeback_congestion_wait: usec_timeout=100000 usec_delayed=112000
380 writeback_congestion_wait: usec_timeout=100000 usec_delayed=108000
778 writeback_congestion_wait: usec_timeout=100000 usec_delayed=104000
congestion_wait if called always exceeds the timeout as there is no
trigger to wake it up.
Bottom line: Vanilla will throttle but it's not effective.
Patch series
------------
Kswapd throttle activity was always due to scanning pages tagged for
immediate reclaim at the tail of the LRU
1 usec_timeout=100000 usect_delayed=72000 reason=VMSCAN_THROTTLE_WRITEBACK
4 usec_timeout=100000 usect_delayed=20000 reason=VMSCAN_THROTTLE_WRITEBACK
5 usec_timeout=100000 usect_delayed=12000 reason=VMSCAN_THROTTLE_WRITEBACK
6 usec_timeout=100000 usect_delayed=16000 reason=VMSCAN_THROTTLE_WRITEBACK
11 usec_timeout=100000 usect_delayed=100000 reason=VMSCAN_THROTTLE_WRITEBACK
11 usec_timeout=100000 usect_delayed=8000 reason=VMSCAN_THROTTLE_WRITEBACK
94 usec_timeout=100000 usect_delayed=0 reason=VMSCAN_THROTTLE_WRITEBACK
112 usec_timeout=100000 usect_delayed=4000 reason=VMSCAN_THROTTLE_WRITEBACK
The majority of events did not stall or stalled for a short period.
Roughly 16% of stalls reached the timeout before expiry. For direct
reclaim, the number of times stalled for each reason were
6624 reason=VMSCAN_THROTTLE_ISOLATED
93246 reason=VMSCAN_THROTTLE_NOPROGRESS
96934 reason=VMSCAN_THROTTLE_WRITEBACK
The most common reason to stall was due to excessive pages tagged for
immediate reclaim at the tail of the LRU followed by a failure to make
forward. A relatively small number were due to too many pages isolated
from the LRU by parallel threads
For VMSCAN_THROTTLE_ISOLATED, the breakdown of delays was
9 usec_timeout=20000 usect_delayed=4000 reason=VMSCAN_THROTTLE_ISOLATED
12 usec_timeout=20000 usect_delayed=16000 reason=VMSCAN_THROTTLE_ISOLATED
83 usec_timeout=20000 usect_delayed=20000 reason=VMSCAN_THROTTLE_ISOLATED
6520 usec_timeout=20000 usect_delayed=0 reason=VMSCAN_THROTTLE_ISOLATED
Most did not stall at all. A small number reached the timeout.
For VMSCAN_THROTTLE_NOPROGRESS, the breakdown of stalls were all over
the map
1 usec_timeout=500000 usect_delayed=324000 reason=VMSCAN_THROTTLE_NOPROGRESS
1 usec_timeout=500000 usect_delayed=332000 reason=VMSCAN_THROTTLE_NOPROGRESS
1 usec_timeout=500000 usect_delayed=348000 reason=VMSCAN_THROTTLE_NOPROGRESS
1 usec_timeout=500000 usect_delayed=360000 reason=VMSCAN_THROTTLE_NOPROGRESS
2 usec_timeout=500000 usect_delayed=228000 reason=VMSCAN_THROTTLE_NOPROGRESS
2 usec_timeout=500000 usect_delayed=260000 reason=VMSCAN_THROTTLE_NOPROGRESS
2 usec_timeout=500000 usect_delayed=340000 reason=VMSCAN_THROTTLE_NOPROGRESS
2 usec_timeout=500000 usect_delayed=364000 reason=VMSCAN_THROTTLE_NOPROGRESS
2 usec_timeout=500000 usect_delayed=372000 reason=VMSCAN_THROTTLE_NOPROGRESS
2 usec_timeout=500000 usect_delayed=428000 reason=VMSCAN_THROTTLE_NOPROGRESS
2 usec_timeout=500000 usect_delayed=460000 reason=VMSCAN_THROTTLE_NOPROGRESS
2 usec_timeout=500000 usect_delayed=464000 reason=VMSCAN_THROTTLE_NOPROGRESS
3 usec_timeout=500000 usect_delayed=244000 reason=VMSCAN_THROTTLE_NOPROGRESS
3 usec_timeout=500000 usect_delayed=252000 reason=VMSCAN_THROTTLE_NOPROGRESS
3 usec_timeout=500000 usect_delayed=272000 reason=VMSCAN_THROTTLE_NOPROGRESS
4 usec_timeout=500000 usect_delayed=188000 reason=VMSCAN_THROTTLE_NOPROGRESS
4 usec_timeout=500000 usect_delayed=268000 reason=VMSCAN_THROTTLE_NOPROGRESS
4 usec_timeout=500000 usect_delayed=328000 reason=VMSCAN_THROTTLE_NOPROGRESS
4 usec_timeout=500000 usect_delayed=380000 reason=VMSCAN_THROTTLE_NOPROGRESS
4 usec_timeout=500000 usect_delayed=392000 reason=VMSCAN_THROTTLE_NOPROGRESS
4 usec_timeout=500000 usect_delayed=432000 reason=VMSCAN_THROTTLE_NOPROGRESS
5 usec_timeout=500000 usect_delayed=204000 reason=VMSCAN_THROTTLE_NOPROGRESS
5 usec_timeout=500000 usect_delayed=220000 reason=VMSCAN_THROTTLE_NOPROGRESS
5 usec_timeout=500000 usect_delayed=412000 reason=VMSCAN_THROTTLE_NOPROGRESS
5 usec_timeout=500000 usect_delayed=436000 reason=VMSCAN_THROTTLE_NOPROGRESS
6 usec_timeout=500000 usect_delayed=488000 reason=VMSCAN_THROTTLE_NOPROGRESS
7 usec_timeout=500000 usect_delayed=212000 reason=VMSCAN_THROTTLE_NOPROGRESS
7 usec_timeout=500000 usect_delayed=300000 reason=VMSCAN_THROTTLE_NOPROGRESS
7 usec_timeout=500000 usect_delayed=316000 reason=VMSCAN_THROTTLE_NOPROGRESS
7 usec_timeout=500000 usect_delayed=472000 reason=VMSCAN_THROTTLE_NOPROGRESS
8 usec_timeout=500000 usect_delayed=248000 reason=VMSCAN_THROTTLE_NOPROGRESS
8 usec_timeout=500000 usect_delayed=356000 reason=VMSCAN_THROTTLE_NOPROGRESS
8 usec_timeout=500000 usect_delayed=456000 reason=VMSCAN_THROTTLE_NOPROGRESS
9 usec_timeout=500000 usect_delayed=124000 reason=VMSCAN_THROTTLE_NOPROGRESS
9 usec_timeout=500000 usect_delayed=376000 reason=VMSCAN_THROTTLE_NOPROGRESS
9 usec_timeout=500000 usect_delayed=484000 reason=VMSCAN_THROTTLE_NOPROGRESS
10 usec_timeout=500000 usect_delayed=172000 reason=VMSCAN_THROTTLE_NOPROGRESS
10 usec_timeout=500000 usect_delayed=420000 reason=VMSCAN_THROTTLE_NOPROGRESS
10 usec_timeout=500000 usect_delayed=452000 reason=VMSCAN_THROTTLE_NOPROGRESS
11 usec_timeout=500000 usect_delayed=256000 reason=VMSCAN_THROTTLE_NOPROGRESS
12 usec_timeout=500000 usect_delayed=112000 reason=VMSCAN_THROTTLE_NOPROGRESS
12 usec_timeout=500000 usect_delayed=116000 reason=VMSCAN_THROTTLE_NOPROGRESS
12 usec_timeout=500000 usect_delayed=144000 reason=VMSCAN_THROTTLE_NOPROGRESS
12 usec_timeout=500000 usect_delayed=152000 reason=VMSCAN_THROTTLE_NOPROGRESS
12 usec_timeout=500000 usect_delayed=264000 reason=VMSCAN_THROTTLE_NOPROGRESS
12 usec_timeout=500000 usect_delayed=384000 reason=VMSCAN_THROTTLE_NOPROGRESS
12 usec_timeout=500000 usect_delayed=424000 reason=VMSCAN_THROTTLE_NOPROGRESS
12 usec_timeout=500000 usect_delayed=492000 reason=VMSCAN_THROTTLE_NOPROGRESS
13 usec_timeout=500000 usect_delayed=184000 reason=VMSCAN_THROTTLE_NOPROGRESS
13 usec_timeout=500000 usect_delayed=444000 reason=VMSCAN_THROTTLE_NOPROGRESS
14 usec_timeout=500000 usect_delayed=308000 reason=VMSCAN_THROTTLE_NOPROGRESS
14 usec_timeout=500000 usect_delayed=440000 reason=VMSCAN_THROTTLE_NOPROGRESS
14 usec_timeout=500000 usect_delayed=476000 reason=VMSCAN_THROTTLE_NOPROGRESS
16 usec_timeout=500000 usect_delayed=140000 reason=VMSCAN_THROTTLE_NOPROGRESS
17 usec_timeout=500000 usect_delayed=232000 reason=VMSCAN_THROTTLE_NOPROGRESS
17 usec_timeout=500000 usect_delayed=240000 reason=VMSCAN_THROTTLE_NOPROGRESS
17 usec_timeout=500000 usect_delayed=280000 reason=VMSCAN_THROTTLE_NOPROGRESS
18 usec_timeout=500000 usect_delayed=404000 reason=VMSCAN_THROTTLE_NOPROGRESS
20 usec_timeout=500000 usect_delayed=148000 reason=VMSCAN_THROTTLE_NOPROGRESS
20 usec_timeout=500000 usect_delayed=216000 reason=VMSCAN_THROTTLE_NOPROGRESS
20 usec_timeout=500000 usect_delayed=468000 reason=VMSCAN_THROTTLE_NOPROGRESS
21 usec_timeout=500000 usect_delayed=448000 reason=VMSCAN_THROTTLE_NOPROGRESS
23 usec_timeout=500000 usect_delayed=168000 reason=VMSCAN_THROTTLE_NOPROGRESS
23 usec_timeout=500000 usect_delayed=296000 reason=VMSCAN_THROTTLE_NOPROGRESS
25 usec_timeout=500000 usect_delayed=132000 reason=VMSCAN_THROTTLE_NOPROGRESS
25 usec_timeout=500000 usect_delayed=352000 reason=VMSCAN_THROTTLE_NOPROGRESS
26 usec_timeout=500000 usect_delayed=180000 reason=VMSCAN_THROTTLE_NOPROGRESS
27 usec_timeout=500000 usect_delayed=284000 reason=VMSCAN_THROTTLE_NOPROGRESS
28 usec_timeout=500000 usect_delayed=164000 reason=VMSCAN_THROTTLE_NOPROGRESS
29 usec_timeout=500000 usect_delayed=136000 reason=VMSCAN_THROTTLE_NOPROGRESS
30 usec_timeout=500000 usect_delayed=200000 reason=VMSCAN_THROTTLE_NOPROGRESS
30 usec_timeout=500000 usect_delayed=400000 reason=VMSCAN_THROTTLE_NOPROGRESS
31 usec_timeout=500000 usect_delayed=196000 reason=VMSCAN_THROTTLE_NOPROGRESS
32 usec_timeout=500000 usect_delayed=156000 reason=VMSCAN_THROTTLE_NOPROGRESS
33 usec_timeout=500000 usect_delayed=224000 reason=VMSCAN_THROTTLE_NOPROGRESS
35 usec_timeout=500000 usect_delayed=128000 reason=VMSCAN_THROTTLE_NOPROGRESS
35 usec_timeout=500000 usect_delayed=176000 reason=VMSCAN_THROTTLE_NOPROGRESS
36 usec_timeout=500000 usect_delayed=368000 reason=VMSCAN_THROTTLE_NOPROGRESS
36 usec_timeout=500000 usect_delayed=496000 reason=VMSCAN_THROTTLE_NOPROGRESS
37 usec_timeout=500000 usect_delayed=312000 reason=VMSCAN_THROTTLE_NOPROGRESS
38 usec_timeout=500000 usect_delayed=304000 reason=VMSCAN_THROTTLE_NOPROGRESS
40 usec_timeout=500000 usect_delayed=288000 reason=VMSCAN_THROTTLE_NOPROGRESS
43 usec_timeout=500000 usect_delayed=408000 reason=VMSCAN_THROTTLE_NOPROGRESS
55 usec_timeout=500000 usect_delayed=416000 reason=VMSCAN_THROTTLE_NOPROGRESS
56 usec_timeout=500000 usect_delayed=76000 reason=VMSCAN_THROTTLE_NOPROGRESS
58 usec_timeout=500000 usect_delayed=120000 reason=VMSCAN_THROTTLE_NOPROGRESS
59 usec_timeout=500000 usect_delayed=208000 reason=VMSCAN_THROTTLE_NOPROGRESS
61 usec_timeout=500000 usect_delayed=68000 reason=VMSCAN_THROTTLE_NOPROGRESS
71 usec_timeout=500000 usect_delayed=192000 reason=VMSCAN_THROTTLE_NOPROGRESS
71 usec_timeout=500000 usect_delayed=480000 reason=VMSCAN_THROTTLE_NOPROGRESS
79 usec_timeout=500000 usect_delayed=60000 reason=VMSCAN_THROTTLE_NOPROGRESS
82 usec_timeout=500000 usect_delayed=320000 reason=VMSCAN_THROTTLE_NOPROGRESS
82 usec_timeout=500000 usect_delayed=92000 reason=VMSCAN_THROTTLE_NOPROGRESS
85 usec_timeout=500000 usect_delayed=64000 reason=VMSCAN_THROTTLE_NOPROGRESS
85 usec_timeout=500000 usect_delayed=80000 reason=VMSCAN_THROTTLE_NOPROGRESS
88 usec_timeout=500000 usect_delayed=84000 reason=VMSCAN_THROTTLE_NOPROGRESS
90 usec_timeout=500000 usect_delayed=160000 reason=VMSCAN_THROTTLE_NOPROGRESS
90 usec_timeout=500000 usect_delayed=292000 reason=VMSCAN_THROTTLE_NOPROGRESS
94 usec_timeout=500000 usect_delayed=56000 reason=VMSCAN_THROTTLE_NOPROGRESS
118 usec_timeout=500000 usect_delayed=88000 reason=VMSCAN_THROTTLE_NOPROGRESS
119 usec_timeout=500000 usect_delayed=72000 reason=VMSCAN_THROTTLE_NOPROGRESS
126 usec_timeout=500000 usect_delayed=108000 reason=VMSCAN_THROTTLE_NOPROGRESS
146 usec_timeout=500000 usect_delayed=52000 reason=VMSCAN_THROTTLE_NOPROGRESS
148 usec_timeout=500000 usect_delayed=36000 reason=VMSCAN_THROTTLE_NOPROGRESS
148 usec_timeout=500000 usect_delayed=48000 reason=VMSCAN_THROTTLE_NOPROGRESS
159 usec_timeout=500000 usect_delayed=28000 reason=VMSCAN_THROTTLE_NOPROGRESS
178 usec_timeout=500000 usect_delayed=44000 reason=VMSCAN_THROTTLE_NOPROGRESS
183 usec_timeout=500000 usect_delayed=40000 reason=VMSCAN_THROTTLE_NOPROGRESS
237 usec_timeout=500000 usect_delayed=100000 reason=VMSCAN_THROTTLE_NOPROGRESS
266 usec_timeout=500000 usect_delayed=32000 reason=VMSCAN_THROTTLE_NOPROGRESS
313 usec_timeout=500000 usect_delayed=24000 reason=VMSCAN_THROTTLE_NOPROGRESS
347 usec_timeout=500000 usect_delayed=96000 reason=VMSCAN_THROTTLE_NOPROGRESS
470 usec_timeout=500000 usect_delayed=20000 reason=VMSCAN_THROTTLE_NOPROGRESS
559 usec_timeout=500000 usect_delayed=16000 reason=VMSCAN_THROTTLE_NOPROGRESS
964 usec_timeout=500000 usect_delayed=12000 reason=VMSCAN_THROTTLE_NOPROGRESS
2001 usec_timeout=500000 usect_delayed=104000 reason=VMSCAN_THROTTLE_NOPROGRESS
2447 usec_timeout=500000 usect_delayed=8000 reason=VMSCAN_THROTTLE_NOPROGRESS
7888 usec_timeout=500000 usect_delayed=4000 reason=VMSCAN_THROTTLE_NOPROGRESS
22727 usec_timeout=500000 usect_delayed=0 reason=VMSCAN_THROTTLE_NOPROGRESS
51305 usec_timeout=500000 usect_delayed=500000 reason=VMSCAN_THROTTLE_NOPROGRESS
The full timeout is often hit but a large number also do not stall at
all. The remainder slept a little allowing other reclaim tasks to make
progress.
While this timeout could be further increased, it could also negatively
impact worst-case behaviour when there is no prioritisation of what task
should make progress.
For VMSCAN_THROTTLE_WRITEBACK, the breakdown was
1 usec_timeout=100000 usect_delayed=44000 reason=VMSCAN_THROTTLE_WRITEBACK
2 usec_timeout=100000 usect_delayed=76000 reason=VMSCAN_THROTTLE_WRITEBACK
3 usec_timeout=100000 usect_delayed=80000 reason=VMSCAN_THROTTLE_WRITEBACK
5 usec_timeout=100000 usect_delayed=48000 reason=VMSCAN_THROTTLE_WRITEBACK
5 usec_timeout=100000 usect_delayed=84000 reason=VMSCAN_THROTTLE_WRITEBACK
6 usec_timeout=100000 usect_delayed=72000 reason=VMSCAN_THROTTLE_WRITEBACK
7 usec_timeout=100000 usect_delayed=88000 reason=VMSCAN_THROTTLE_WRITEBACK
11 usec_timeout=100000 usect_delayed=56000 reason=VMSCAN_THROTTLE_WRITEBACK
12 usec_timeout=100000 usect_delayed=64000 reason=VMSCAN_THROTTLE_WRITEBACK
16 usec_timeout=100000 usect_delayed=92000 reason=VMSCAN_THROTTLE_WRITEBACK
24 usec_timeout=100000 usect_delayed=68000 reason=VMSCAN_THROTTLE_WRITEBACK
28 usec_timeout=100000 usect_delayed=32000 reason=VMSCAN_THROTTLE_WRITEBACK
30 usec_timeout=100000 usect_delayed=60000 reason=VMSCAN_THROTTLE_WRITEBACK
30 usec_timeout=100000 usect_delayed=96000 reason=VMSCAN_THROTTLE_WRITEBACK
32 usec_timeout=100000 usect_delayed=52000 reason=VMSCAN_THROTTLE_WRITEBACK
42 usec_timeout=100000 usect_delayed=40000 reason=VMSCAN_THROTTLE_WRITEBACK
77 usec_timeout=100000 usect_delayed=28000 reason=VMSCAN_THROTTLE_WRITEBACK
99 usec_timeout=100000 usect_delayed=36000 reason=VMSCAN_THROTTLE_WRITEBACK
137 usec_timeout=100000 usect_delayed=24000 reason=VMSCAN_THROTTLE_WRITEBACK
190 usec_timeout=100000 usect_delayed=20000 reason=VMSCAN_THROTTLE_WRITEBACK
339 usec_timeout=100000 usect_delayed=16000 reason=VMSCAN_THROTTLE_WRITEBACK
518 usec_timeout=100000 usect_delayed=12000 reason=VMSCAN_THROTTLE_WRITEBACK
852 usec_timeout=100000 usect_delayed=8000 reason=VMSCAN_THROTTLE_WRITEBACK
3359 usec_timeout=100000 usect_delayed=4000 reason=VMSCAN_THROTTLE_WRITEBACK
7147 usec_timeout=100000 usect_delayed=0 reason=VMSCAN_THROTTLE_WRITEBACK
83962 usec_timeout=100000 usect_delayed=100000 reason=VMSCAN_THROTTLE_WRITEBACK
The majority hit the timeout in direct reclaim context although a
sizable number did not stall at all. This is very different to kswapd
where only a tiny percentage of stalls due to writeback reached the
timeout.
Bottom line, the throttling appears to work and the wakeup events may
limit worst case stalls. There might be some grounds for adjusting
timeouts but it's likely futile as the worst-case scenarios depend on
the workload, memory size and the speed of the storage. A better
approach to improve the series further would be to prioritise tasks
based on their rate of allocation with the caveat that it may be very
expensive to track.
This patch (of 5):
Page reclaim throttles on wait_iff_congested under the following
conditions:
- kswapd is encountering pages under writeback and marked for immediate
reclaim implying that pages are cycling through the LRU faster than
pages can be cleaned.
- Direct reclaim will stall if all dirty pages are backed by congested
inodes.
wait_iff_congested is almost completely broken with few exceptions.
This patch adds a new node-based workqueue and tracks the number of
throttled tasks and pages written back since throttling started. If
enough pages belonging to the node are written back then the throttled
tasks will wake early. If not, the throttled tasks sleeps until the
timeout expires.
[neilb@suse.de: Uninterruptible sleep and simpler wakeups]
[hdanton@sina.com: Avoid race when reclaim starts]
[vbabka@suse.cz: vmstat irq-safe api, clarifications]
Link: https://lore.kernel.org/linux-mm/45d8b7a6-8548-65f5-cccf-9f451d4ae3d4@kernel.dk/ [1]
Link: https://lkml.kernel.org/r/20211022144651.19914-1-mgorman@techsingularity.net
Link: https://lkml.kernel.org/r/20211022144651.19914-2-mgorman@techsingularity.net
Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Cc: NeilBrown <neilb@suse.de>
Cc: "Theodore Ts'o" <tytso@mit.edu>
Cc: Andreas Dilger <adilger.kernel@dilger.ca>
Cc: "Darrick J . Wong" <djwong@kernel.org>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Dave Chinner <david@fromorbit.com>
Cc: Rik van Riel <riel@surriel.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Jonathan Corbet <corbet@lwn.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-11-06 04:42:25 +08:00
|
|
|
TRACE_EVENT(mm_vmscan_throttled,
|
|
|
|
|
|
|
|
TP_PROTO(int nid, int usec_timeout, int usec_delayed, int reason),
|
|
|
|
|
|
|
|
TP_ARGS(nid, usec_timeout, usec_delayed, reason),
|
|
|
|
|
|
|
|
TP_STRUCT__entry(
|
|
|
|
__field(int, nid)
|
|
|
|
__field(int, usec_timeout)
|
|
|
|
__field(int, usec_delayed)
|
|
|
|
__field(int, reason)
|
|
|
|
),
|
|
|
|
|
|
|
|
TP_fast_assign(
|
|
|
|
__entry->nid = nid;
|
|
|
|
__entry->usec_timeout = usec_timeout;
|
|
|
|
__entry->usec_delayed = usec_delayed;
|
|
|
|
__entry->reason = 1U << reason;
|
|
|
|
),
|
|
|
|
|
|
|
|
TP_printk("nid=%d usec_timeout=%d usect_delayed=%d reason=%s",
|
|
|
|
__entry->nid,
|
|
|
|
__entry->usec_timeout,
|
|
|
|
__entry->usec_delayed,
|
|
|
|
show_throttle_flags(__entry->reason))
|
|
|
|
);
|
2010-08-10 08:19:16 +08:00
|
|
|
#endif /* _TRACE_VMSCAN_H */
|
|
|
|
|
|
|
|
/* This part must be outside protection */
|
|
|
|
#include <trace/define_trace.h>
|