This makes DAMON apply schemes to regions having higher priority first,
if it cannot apply schemes to all regions due to the quotas.
The prioritization function should be implemented in the monitoring
primitives. Those would commonly calculate the priority of the region
using attributes of regions, namely 'size', 'nr_accesses', and 'age'.
For example, some primitive would calculate the priority of each region
using a weighted sum of 'nr_accesses' and 'age' of the region.
The optimal weights would depend on give environments, so this makes
those customizable. Nevertheless, the score calculation functions are
only encouraged to respect the weights, not mandated.
Link: https://lkml.kernel.org/r/20211019150731.16699-8-sj@kernel.org
Signed-off-by: SeongJae Park <sj@kernel.org>
Cc: Amit Shah <amit@kernel.org>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: David Hildenbrand <david@redhat.com>
Cc: David Rientjes <rientjes@google.com>
Cc: David Woodhouse <dwmw@amazon.com>
Cc: Greg Thelen <gthelen@google.com>
Cc: Jonathan Cameron <Jonathan.Cameron@huawei.com>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Leonard Foerster <foersleo@amazon.de>
Cc: Marco Elver <elver@google.com>
Cc: Markus Boehme <markubo@amazon.de>
Cc: Shakeel Butt <shakeelb@google.com>
Cc: Shuah Khan <shuah@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The size quota feature of DAMOS is useful for IO resource-critical
systems, but not so intuitive for CPU time-critical systems. Systems
using zram or zswap-like swap device would be examples.
To provide another intuitive ways for such systems, this implements
time-based quota for DAMON-based Operation Schemes. If the quota is
set, DAMOS tries to use only up to the user-defined quota of CPU time
within a given time window.
Link: https://lkml.kernel.org/r/20211019150731.16699-5-sj@kernel.org
Signed-off-by: SeongJae Park <sj@kernel.org>
Cc: Amit Shah <amit@kernel.org>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: David Hildenbrand <david@redhat.com>
Cc: David Rientjes <rientjes@google.com>
Cc: David Woodhouse <dwmw@amazon.com>
Cc: Greg Thelen <gthelen@google.com>
Cc: Jonathan Cameron <Jonathan.Cameron@huawei.com>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Leonard Foerster <foersleo@amazon.de>
Cc: Marco Elver <elver@google.com>
Cc: Markus Boehme <markubo@amazon.de>
Cc: Shakeel Butt <shakeelb@google.com>
Cc: Shuah Khan <shuah@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
If DAMOS has stopped applying action in the middle of a group of memory
regions due to its size quota, it starts the work again from the
beginning of the address space in the next charge window. If there is a
huge memory region at the beginning of the address space and it fulfills
the scheme's target data access pattern always, the action will applied
to only the region.
This mitigates the case by skipping memory regions that charged in
current charge window at the beginning of next charge window.
Link: https://lkml.kernel.org/r/20211019150731.16699-4-sj@kernel.org
Signed-off-by: SeongJae Park <sj@kernel.org>
Cc: Amit Shah <amit@kernel.org>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: David Hildenbrand <david@redhat.com>
Cc: David Rientjes <rientjes@google.com>
Cc: David Woodhouse <dwmw@amazon.com>
Cc: Greg Thelen <gthelen@google.com>
Cc: Jonathan Cameron <Jonathan.Cameron@huawei.com>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Leonard Foerster <foersleo@amazon.de>
Cc: Marco Elver <elver@google.com>
Cc: Markus Boehme <markubo@amazon.de>
Cc: Shakeel Butt <shakeelb@google.com>
Cc: Shuah Khan <shuah@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
There could be arbitrarily large memory regions fulfilling the target
data access pattern of a DAMON-based operation scheme. In the case,
applying the action of the scheme could incur too high overhead. To
provide an intuitive way for avoiding it, this implements a feature
called size quota. If the quota is set, DAMON tries to apply the action
only up to the given amount of memory regions within a given time
window.
Link: https://lkml.kernel.org/r/20211019150731.16699-3-sj@kernel.org
Signed-off-by: SeongJae Park <sj@kernel.org>
Cc: Amit Shah <amit@kernel.org>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: David Hildenbrand <david@redhat.com>
Cc: David Rientjes <rientjes@google.com>
Cc: David Woodhouse <dwmw@amazon.com>
Cc: Greg Thelen <gthelen@google.com>
Cc: Jonathan Cameron <Jonathan.Cameron@huawei.com>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Leonard Foerster <foersleo@amazon.de>
Cc: Marco Elver <elver@google.com>
Cc: Markus Boehme <markubo@amazon.de>
Cc: Shakeel Butt <shakeelb@google.com>
Cc: Shuah Khan <shuah@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Introduction
============
This patchset 1) makes the engine for general data access
pattern-oriented memory management (DAMOS) be more useful for production
environments, and 2) implements a static kernel module for lightweight
proactive reclamation using the engine.
Proactive Reclamation
---------------------
On general memory over-committed systems, proactively reclaiming cold
pages helps saving memory and reducing latency spikes that incurred by
the direct reclaim or the CPU consumption of kswapd, while incurring
only minimal performance degradation[2].
A Free Pages Reporting[8] based memory over-commit virtualization system
would be one more specific use case. In the system, the guest VMs
reports their free memory to host, and the host reallocates the reported
memory to other guests. As a result, the system's memory utilization
can be maximized. However, the guests could be not so memory-frugal,
because some kernel subsystems and user-space applications are designed
to use as much memory as available. Then, guests would report only
small amount of free memory to host, results in poor memory utilization.
Running the proactive reclamation in such guests could help mitigating
this problem.
Google has also implemented this idea and using it in their data center.
They further proposed upstreaming it in LSFMM'19, and "the general
consensus was that, while this sort of proactive reclaim would be useful
for a number of users, the cost of this particular solution was too high
to consider merging it upstream"[3]. The cost mainly comes from the
coldness tracking. Roughly speaking, the implementation periodically
scans the 'Accessed' bit of each page. For the reason, the overhead
linearly increases as the size of the memory and the scanning frequency
grows. As a result, Google is known to dedicating one CPU for the work.
That's a reasonable option to someone like Google, but it wouldn't be so
to some others.
DAMON and DAMOS: An engine for data access pattern-oriented memory management
-----------------------------------------------------------------------------
DAMON[4] is a framework for general data access monitoring. Its
adaptive monitoring overhead control feature minimizes its monitoring
overhead. It also let the upper-bound of the overhead be configurable
by clients, regardless of the size of the monitoring target memory.
While monitoring 70 GiB memory of a production system every 5
milliseconds, it consumes less than 1% single CPU time. For this, it
could sacrify some of the quality of the monitoring results.
Nevertheless, the lower-bound of the quality is configurable, and it
uses a best-effort algorithm for better quality. Our test results[5]
show the quality is practical enough. From the production system
monitoring, we were able to find a 4 KiB region in the 70 GiB memory
that shows highest access frequency.
We normally don't monitor the data access pattern just for fun but to
improve something like memory management. Proactive reclamation is one
such usage. For such general cases, DAMON provides a feature called
DAMon-based Operation Schemes (DAMOS)[6]. It makes DAMON an engine for
general data access pattern oriented memory management. Using this,
clients can ask DAMON to find memory regions of specific data access
pattern and apply some memory management action (e.g., page out, move to
head of the LRU list, use huge page, ...). We call the request
'scheme'.
Proactive Reclamation on top of DAMON/DAMOS
-------------------------------------------
Therefore, by using DAMON for the cold pages detection, the proactive
reclamation's monitoring overhead issue can be solved. Actually, we
previously implemented a version of proactive reclamation using DAMOS
and achieved noticeable improvements with our evaluation setup[5].
Nevertheless, it more for a proof-of-concept, rather than production
uses. It supports only virtual address spaces of processes, and require
additional tuning efforts for given workloads and the hardware. For the
tuning, we introduced a simple auto-tuning user space tool[8]. Google
is also known to using a ML-based similar approach for their fleets[2].
But, making it just works with intuitive knobs in the kernel would be
helpful for general users.
To this end, this patchset improves DAMOS to be ready for such
production usages, and implements another version of the proactive
reclamation, namely DAMON_RECLAIM, on top of it.
DAMOS Improvements: Aggressiveness Control, Prioritization, and Watermarks
--------------------------------------------------------------------------
First of all, the current version of DAMOS supports only virtual address
spaces. This patchset makes it supports the physical address space for
the page out action.
Next major problem of the current version of DAMOS is the lack of the
aggressiveness control, which can results in arbitrary overhead. For
example, if huge memory regions having the data access pattern of
interest are found, applying the requested action to all of the regions
could incur significant overhead. It can be controlled by tuning the
target data access pattern with manual or automated approaches[2,7].
But, some people would prefer the kernel to just work with only
intuitive tuning or default values.
For such cases, this patchset implements a safeguard, namely time/size
quota. Using this, the clients can specify up to how much time can be
used for applying the action, and/or up to how much memory regions the
action can be applied within a user-specified time duration. A followup
question is, to which memory regions should the action applied within
the limits? We implement a simple regions prioritization mechanism for
each action and make DAMOS to apply the action to high priority regions
first. It also allows clients tune the prioritization mechanism to use
different weights for size, access frequency, and age of memory regions.
This means we could use not only LRU but also LFU or some fancy
algorithms like CAR[9] with lightweight overhead.
Though DAMON is lightweight, someone would want to remove even the cold
pages monitoring overhead when it is unnecessary. Currently, it should
manually turned on and off by clients, but some clients would simply
want to turn it on and off based on some metrics like free memory ratio
or memory fragmentation. For such cases, this patchset implements a
watermarks-based automatic activation feature. It allows the clients
configure the metric of their interest, and three watermarks of the
metric. If the metric is higher than the high watermark or lower than
the low watermark, the scheme is deactivated. If the metric is lower
than the mid watermark but higher than the low watermark, the scheme is
activated.
DAMON-based Reclaim
-------------------
Using the improved version of DAMOS, this patchset implements a static
kernel module called 'damon_reclaim'. It finds memory regions that
didn't accessed for specific time duration and page out. Consuming too
much CPU for the paging out operations, or doing pageout too frequently
can be critical for systems configuring their swap devices with
software-defined in-memory block devices like zram/zswap or total number
of writes limited devices like SSDs, respectively. To avoid the
problems, the time/size quotas can be configured. Under the quotas, it
pages out memory regions that didn't accessed longer first. Also, to
remove the monitoring overhead under peaceful situation, and to fall
back to the LRU-list based page granularity reclamation when it doesn't
make progress, the three watermarks based activation mechanism is used,
with the free memory ratio as the watermark metric.
For convenient configurations, it provides several module parameters.
Using these, sysadmins can enable/disable it, and tune its parameters
including the coldness identification time threshold, the time/size
quotas and the three watermarks.
Evaluation
==========
In short, DAMON_RECLAIM with 50ms/s time quota and regions
prioritization on v5.15-rc5 Linux kernel with ZRAM swap device achieves
38.58% memory saving with only 1.94% runtime overhead. For this,
DAMON_RECLAIM consumes only 4.97% of single CPU time.
Setup
-----
We evaluate DAMON_RECLAIM to show how each of the DAMOS improvements
make effect. For this, we measure DAMON_RECLAIM's CPU consumption,
entire system memory footprint, total number of major page faults, and
runtime of 24 realistic workloads in PARSEC3 and SPLASH-2X benchmark
suites on my QEMU/KVM based virtual machine. The virtual machine runs
on an i3.metal AWS instance, has 130GiB memory, and runs a linux kernel
built on latest -mm tree[1] plus this patchset. It also utilizes a 4
GiB ZRAM swap device. We repeats the measurement 5 times and use
averages.
[1] https://github.com/hnaz/linux-mm/tree/v5.15-rc5-mmots-2021-10-13-19-55
Detailed Results
----------------
The results are summarized in the below table.
With coldness identification threshold of 5 seconds, DAMON_RECLAIM
without the time quota-based speed limit achieves 47.21% memory saving,
but incur 4.59% runtime slowdown to the workloads on average. For this,
DAMON_RECLAIM consumes about 11.28% single CPU time.
Applying time quotas of 200ms/s, 50ms/s, and 10ms/s without the regions
prioritization reduces the slowdown to 4.89%, 2.65%, and 1.5%,
respectively. Time quota of 200ms/s (20%) makes no real change compared
to the quota unapplied version, because the quota unapplied version
consumes only 11.28% CPU time. DAMON_RECLAIM's CPU utilization also
similarly reduced: 11.24%, 5.51%, and 2.01% of single CPU time. That
is, the overhead is proportional to the speed limit. Nevertheless, it
also reduces the memory saving because it becomes less aggressive. In
detail, the three variants show 48.76%, 37.83%, and 7.85% memory saving,
respectively.
Applying the regions prioritization (page out regions that not accessed
longer first within the time quota) further reduces the performance
degradation. Runtime slowdowns and total number of major page faults
increase has been 4.89%/218,690% -> 4.39%/166,136% (200ms/s),
2.65%/111,886% -> 1.94%/59,053% (50ms/s), and 1.5%/34,973.40% ->
2.08%/8,781.75% (10ms/s). The runtime under 10ms/s time quota has
increased with prioritization, but apparently that's under the margin of
error.
time quota prioritization memory_saving cpu_util slowdown pgmajfaults overhead
N N 47.21% 11.28% 4.59% 194,802%
200ms/s N 48.76% 11.24% 4.89% 218,690%
50ms/s N 37.83% 5.51% 2.65% 111,886%
10ms/s N 7.85% 2.01% 1.5% 34,793.40%
200ms/s Y 50.08% 10.38% 4.39% 166,136%
50ms/s Y 38.58% 4.97% 1.94% 59,053%
10ms/s Y 3.63% 1.73% 2.08% 8,781.75%
Baseline and Complete Git Trees
===============================
The patches are based on the latest -mm tree
(v5.15-rc5-mmots-2021-10-13-19-55). You can also clone the complete git tree
from:
$ git clone git://github.com/sjp38/linux -b damon_reclaim/patches/v1
The web is also available:
https://git.kernel.org/pub/scm/linux/kernel/git/sj/linux.git/tag/?h=damon_reclaim/patches/v1
Sequence Of Patches
===================
The first patch makes DAMOS support the physical address space for the
page out action. Following five patches (patches 2-6) implement the
time/size quotas. Next four patches (patches 7-10) implement the memory
regions prioritization within the limit. Then, three following patches
(patches 11-13) implement the watermarks-based schemes activation.
Finally, the last two patches (patches 14-15) implement and document the
DAMON-based reclamation using the advanced DAMOS.
[1] https://www.kernel.org/doc/html/v5.15-rc1/vm/damon/index.html
[2] https://research.google/pubs/pub48551/
[3] https://lwn.net/Articles/787611/
[4] https://damonitor.github.io
[5] https://damonitor.github.io/doc/html/latest/vm/damon/eval.html
[6] https://lore.kernel.org/linux-mm/20211001125604.29660-1-sj@kernel.org/
[7] https://github.com/awslabs/damoos
[8] https://www.kernel.org/doc/html/latest/vm/free_page_reporting.html
[9] https://www.usenix.org/conference/fast-04/car-clock-adaptive-replacement
This patch (of 15):
This makes the DAMON primitives for physical address space support the
pageout action for DAMON-based Operation Schemes. With this commit,
hence, users can easily implement system-level data access-aware
reclamations using DAMOS.
[sj@kernel.org: fix missing-prototype build warning]
Link: https://lkml.kernel.org/r/20211025064220.13904-1-sj@kernel.org
Link: https://lkml.kernel.org/r/20211019150731.16699-1-sj@kernel.org
Link: https://lkml.kernel.org/r/20211019150731.16699-2-sj@kernel.org
Signed-off-by: SeongJae Park <sj@kernel.org>
Cc: Jonathan Cameron <Jonathan.Cameron@huawei.com>
Cc: Amit Shah <amit@kernel.org>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: David Hildenbrand <david@redhat.com>
Cc: David Woodhouse <dwmw@amazon.com>
Cc: Marco Elver <elver@google.com>
Cc: Leonard Foerster <foersleo@amazon.de>
Cc: Greg Thelen <gthelen@google.com>
Cc: Markus Boehme <markubo@amazon.de>
Cc: David Rientjes <rientjes@google.com>
Cc: Shakeel Butt <shakeelb@google.com>
Cc: Shuah Khan <shuah@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
This implements the monitoring primitives for the physical memory
address space. Internally, it uses the PTE Accessed bit, similar to
that of the virtual address spaces monitoring primitives. It supports
only user memory pages, as idle pages tracking does. If the monitoring
target physical memory address range contains non-user memory pages,
access check of the pages will do nothing but simply treat the pages as
not accessed.
Link: https://lkml.kernel.org/r/20211012205711.29216-6-sj@kernel.org
Signed-off-by: SeongJae Park <sj@kernel.org>
Cc: Amit Shah <amit@kernel.org>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Brendan Higgins <brendanhiggins@google.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: David Rienjes <rientjes@google.com>
Cc: David Woodhouse <dwmw@amazon.com>
Cc: Greg Thelen <gthelen@google.com>
Cc: Jonathan Cameron <Jonathan.Cameron@huawei.com>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Leonard Foerster <foersleo@amazon.de>
Cc: Marco Elver <elver@google.com>
Cc: Markus Boehme <markubo@amazon.de>
Cc: Shakeel Butt <shakeelb@google.com>
Cc: Shuah Khan <shuah@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
To tune the DAMON-based operation schemes, knowing how many and how
large regions are affected by each of the schemes will be helful. Those
stats could be used for not only the tuning, but also monitoring of the
working set size and the number of regions, if the scheme does not
change the program behavior too much.
For the reason, this implements the statistics for the schemes. The
total number and size of the regions that each scheme is applied are
exported to users via '->stat_count' and '->stat_sz' of 'struct damos'.
Admins can also check the number by reading 'schemes' debugfs file. The
last two integers now represents the stats. To allow collecting the
stats without changing the program behavior, this also adds new scheme
action, 'DAMOS_STAT'. Note that 'DAMOS_STAT' is not only making no
memory operation actions, but also does not reset the age of regions.
Link: https://lkml.kernel.org/r/20211001125604.29660-6-sj@kernel.org
Signed-off-by: SeongJae Park <sj@kernel.org>
Cc: Amit Shah <amit@kernel.org>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: David Hildenbrand <david@redhat.com>
Cc: David Rienjes <rientjes@google.com>
Cc: David Woodhouse <dwmw@amazon.com>
Cc: Greg Thelen <gthelen@google.com>
Cc: Jonathan Cameron <Jonathan.Cameron@huawei.com>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Leonard Foerster <foersleo@amazon.de>
Cc: Marco Elver <elver@google.com>
Cc: Markus Boehme <markubo@amazon.de>
Cc: Shakeel Butt <shakeelb@google.com>
Cc: Shuah Khan <shuah@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
This makes DAMON's default primitives for virtual address spaces to
support DAMON-based Operation Schemes (DAMOS) by implementing actions
application functions and registering it to the monitoring context. The
implementation simply links 'madvise()' for related DAMOS actions. That
is, 'madvise(MADV_WILLNEED)' is called for 'WILLNEED' DAMOS action and
similar for other actions ('COLD', 'PAGEOUT', 'HUGEPAGE', 'NOHUGEPAGE').
So, the kernel space DAMON users can now use the DAMON-based
optimizations with only small amount of code.
Link: https://lkml.kernel.org/r/20211001125604.29660-4-sj@kernel.org
Signed-off-by: SeongJae Park <sj@kernel.org>
Cc: Amit Shah <amit@kernel.org>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: David Hildenbrand <david@redhat.com>
Cc: David Rienjes <rientjes@google.com>
Cc: David Woodhouse <dwmw@amazon.com>
Cc: Greg Thelen <gthelen@google.com>
Cc: Jonathan Cameron <Jonathan.Cameron@huawei.com>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Leonard Foerster <foersleo@amazon.de>
Cc: Marco Elver <elver@google.com>
Cc: Markus Boehme <markubo@amazon.de>
Cc: Shakeel Butt <shakeelb@google.com>
Cc: Shuah Khan <shuah@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
In many cases, users might use DAMON for simple data access aware memory
management optimizations such as applying an operation scheme to a
memory region of a specific size having a specific access frequency for
a specific time. For example, "page out a memory region larger than 100
MiB but having a low access frequency more than 10 minutes", or "Use THP
for a memory region larger than 2 MiB having a high access frequency for
more than 2 seconds".
Most simple form of the solution would be doing offline data access
pattern profiling using DAMON and modifying the application source code
or system configuration based on the profiling results. Or, developing
a daemon constructed with two modules (one for access monitoring and the
other for applying memory management actions via mlock(), madvise(),
sysctl, etc) is imaginable.
To avoid users spending their time for implementation of such simple
data access monitoring-based operation schemes, this makes DAMON to
handle such schemes directly. With this change, users can simply
specify their desired schemes to DAMON. Then, DAMON will automatically
apply the schemes to the user-specified target processes.
Each of the schemes is composed with conditions for filtering of the
target memory regions and desired memory management action for the
target. Specifically, the format is::
<min/max size> <min/max access frequency> <min/max age> <action>
The filtering conditions are size of memory region, number of accesses
to the region monitored by DAMON, and the age of the region. The age of
region is incremented periodically but reset when its addresses or
access frequency has significantly changed or the action of a scheme was
applied. For the action, current implementation supports a few of
madvise()-like hints, ``WILLNEED``, ``COLD``, ``PAGEOUT``, ``HUGEPAGE``,
and ``NOHUGEPAGE``.
Because DAMON supports various address spaces and application of the
actions to a monitoring target region is dependent to the type of the
target address space, the application code should be implemented by each
primitives and registered to the framework. Note that this only
implements the framework part. Following commit will implement the
action applications for virtual address spaces primitives.
Link: https://lkml.kernel.org/r/20211001125604.29660-3-sj@kernel.org
Signed-off-by: SeongJae Park <sj@kernel.org>
Cc: Amit Shah <amit@kernel.org>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: David Hildenbrand <david@redhat.com>
Cc: David Rienjes <rientjes@google.com>
Cc: David Woodhouse <dwmw@amazon.com>
Cc: Greg Thelen <gthelen@google.com>
Cc: Jonathan Cameron <Jonathan.Cameron@huawei.com>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Leonard Foerster <foersleo@amazon.de>
Cc: Marco Elver <elver@google.com>
Cc: Markus Boehme <markubo@amazon.de>
Cc: Shakeel Butt <shakeelb@google.com>
Cc: Shuah Khan <shuah@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Patch series "Implement Data Access Monitoring-based Memory Operation Schemes".
Introduction
============
DAMON[1] can be used as a primitive for data access aware memory
management optimizations. For that, users who want such optimizations
should run DAMON, read the monitoring results, analyze it, plan a new
memory management scheme, and apply the new scheme by themselves. Such
efforts will be inevitable for some complicated optimizations.
However, in many other cases, the users would simply want the system to
apply a memory management action to a memory region of a specific size
having a specific access frequency for a specific time. For example,
"page out a memory region larger than 100 MiB keeping only rare accesses
more than 2 minutes", or "Do not use THP for a memory region larger than
2 MiB rarely accessed for more than 1 seconds".
To make the works easier and non-redundant, this patchset implements a
new feature of DAMON, which is called Data Access Monitoring-based
Operation Schemes (DAMOS). Using the feature, users can describe the
normal schemes in a simple way and ask DAMON to execute those on its
own.
[1] https://damonitor.github.io
Evaluations
===========
DAMOS is accurate and useful for memory management optimizations. An
experimental DAMON-based operation scheme for THP, 'ethp', removes
76.15% of THP memory overheads while preserving 51.25% of THP speedup.
Another experimental DAMON-based 'proactive reclamation' implementation,
'prcl', reduces 93.38% of residential sets and 23.63% of system memory
footprint while incurring only 1.22% runtime overhead in the best case
(parsec3/freqmine).
NOTE that the experimental THP optimization and proactive reclamation
are not for production but only for proof of concepts.
Please refer to the showcase web site's evaluation document[1] for
detailed evaluation setup and results.
[1] https://damonitor.github.io/doc/html/v34/vm/damon/eval.html
Long-term Support Trees
-----------------------
For people who want to test DAMON but using LTS kernels, there are
another couple of trees based on two latest LTS kernels respectively and
containing the 'damon/master' backports.
- For v5.4.y: https://git.kernel.org/sj/h/damon/for-v5.4.y
- For v5.10.y: https://git.kernel.org/sj/h/damon/for-v5.10.y
Sequence Of Patches
===================
The 1st patch accounts age of each region. The 2nd patch implements the
core of the DAMON-based operation schemes feature. The 3rd patch makes
the default monitoring primitives for virtual address spaces to support
the schemes. From this point, the kernel space users can use DAMOS.
The 4th patch exports the feature to the user space via the debugfs
interface. The 5th patch implements schemes statistics feature for
easier tuning of the schemes and runtime access pattern analysis, and
the 6th patch adds selftests for these changes. Finally, the 7th patch
documents this new feature.
This patch (of 7):
DAMON can be used for data access pattern aware memory management
optimizations. For that, users should run DAMON, read the monitoring
results, analyze it, plan a new memory management scheme, and apply the
new scheme by themselves. It would not be too hard, but still require
some level of effort. For complicated cases, this effort is inevitable.
That said, in many cases, users would simply want to apply an actions to
a memory region of a specific size having a specific access frequency
for a specific time. For example, "page out a memory region larger than
100 MiB but having a low access frequency more than 10 minutes", or "Use
THP for a memory region larger than 2 MiB having a high access frequency
for more than 2 seconds".
For such optimizations, users will need to first account the age of each
region themselves. To reduce such efforts, this implements a simple age
account of each region in DAMON. For each aggregation step, DAMON
compares the access frequency with that from last aggregation and reset
the age of the region if the change is significant. Else, the age is
incremented. Also, in case of the merge of regions, the region
size-weighted average of the ages is set as the age of merged new
region.
Link: https://lkml.kernel.org/r/20211001125604.29660-1-sj@kernel.org
Link: https://lkml.kernel.org/r/20211001125604.29660-2-sj@kernel.org
Signed-off-by: SeongJae Park <sj@kernel.org>
Cc: Jonathan Cameron <Jonathan.Cameron@huawei.com>
Cc: Amit Shah <amit@kernel.org>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: David Hildenbrand <david@redhat.com>
Cc: David Woodhouse <dwmw@amazon.com>
Cc: Marco Elver <elver@google.com>
Cc: Leonard Foerster <foersleo@amazon.de>
Cc: Greg Thelen <gthelen@google.com>
Cc: Markus Boehme <markubo@amazon.de>
Cc: David Rienjes <rientjes@google.com>
Cc: Shakeel Butt <shakeelb@google.com>
Cc: Shuah Khan <shuah@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
A few Kernel-doc comments in 'damon.h' are broken. This fixes them.
Link: https://lkml.kernel.org/r/20210917123958.3819-5-sj@kernel.org
Signed-off-by: SeongJae Park <sjpark@amazon.de>
Cc: Jonathan Corbet <corbet@lwn.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
DAMON is designed to be used by kernel space code such as the memory
management subsystems, and therefore it provides only kernel space API.
That said, letting the user space control DAMON could provide some
benefits to them. For example, it will allow user space to analyze their
specific workloads and make their own special optimizations.
For such cases, this commit implements a simple DAMON application kernel
module, namely 'damon-dbgfs', which merely wraps the DAMON api and exports
those to the user space via the debugfs.
'damon-dbgfs' exports three files, ``attrs``, ``target_ids``, and
``monitor_on`` under its debugfs directory, ``<debugfs>/damon/``.
Attributes
----------
Users can read and write the ``sampling interval``, ``aggregation
interval``, ``regions update interval``, and min/max number of monitoring
target regions by reading from and writing to the ``attrs`` file. For
example, below commands set those values to 5 ms, 100 ms, 1,000 ms, 10,
1000 and check it again::
# cd <debugfs>/damon
# echo 5000 100000 1000000 10 1000 > attrs
# cat attrs
5000 100000 1000000 10 1000
Target IDs
----------
Some types of address spaces supports multiple monitoring target. For
example, the virtual memory address spaces monitoring can have multiple
processes as the monitoring targets. Users can set the targets by writing
relevant id values of the targets to, and get the ids of the current
targets by reading from the ``target_ids`` file. In case of the virtual
address spaces monitoring, the values should be pids of the monitoring
target processes. For example, below commands set processes having pids
42 and 4242 as the monitoring targets and check it again::
# cd <debugfs>/damon
# echo 42 4242 > target_ids
# cat target_ids
42 4242
Note that setting the target ids doesn't start the monitoring.
Turning On/Off
--------------
Setting the files as described above doesn't incur effect unless you
explicitly start the monitoring. You can start, stop, and check the
current status of the monitoring by writing to and reading from the
``monitor_on`` file. Writing ``on`` to the file starts the monitoring of
the targets with the attributes. Writing ``off`` to the file stops those.
DAMON also stops if every targets are invalidated (in case of the virtual
memory monitoring, target processes are invalidated when terminated).
Below example commands turn on, off, and check the status of DAMON::
# cd <debugfs>/damon
# echo on > monitor_on
# echo off > monitor_on
# cat monitor_on
off
Please note that you cannot write to the above-mentioned debugfs files
while the monitoring is turned on. If you write to the files while DAMON
is running, an error code such as ``-EBUSY`` will be returned.
[akpm@linux-foundation.org: remove unneeded "alloc failed" printks]
[akpm@linux-foundation.org: replace macro with static inline]
Link: https://lkml.kernel.org/r/20210716081449.22187-8-sj38.park@gmail.com
Signed-off-by: SeongJae Park <sjpark@amazon.de>
Reviewed-by: Leonard Foerster <foersleo@amazon.de>
Reviewed-by: Fernand Sieber <sieberf@amazon.com>
Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
Cc: Amit Shah <amit@kernel.org>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Brendan Higgins <brendanhiggins@google.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: David Rientjes <rientjes@google.com>
Cc: David Woodhouse <dwmw@amazon.com>
Cc: Fan Du <fan.du@intel.com>
Cc: Greg Kroah-Hartman <greg@kroah.com>
Cc: Greg Thelen <gthelen@google.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Joe Perches <joe@perches.com>
Cc: Jonathan Cameron <Jonathan.Cameron@huawei.com>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Marco Elver <elver@google.com>
Cc: Markus Boehme <markubo@amazon.de>
Cc: Maximilian Heyne <mheyne@amazon.de>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Namhyung Kim <namhyung@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Rik van Riel <riel@surriel.com>
Cc: Shakeel Butt <shakeelb@google.com>
Cc: Shuah Khan <shuah@kernel.org>
Cc: Steven Rostedt (VMware) <rostedt@goodmis.org>
Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
This commit introduces a reference implementation of the address space
specific low level primitives for the virtual address space, so that users
of DAMON can easily monitor the data accesses on virtual address spaces of
specific processes by simply configuring the implementation to be used by
DAMON.
The low level primitives for the fundamental access monitoring are defined
in two parts:
1. Identification of the monitoring target address range for the address
space.
2. Access check of specific address range in the target space.
The reference implementation for the virtual address space does the works
as below.
PTE Accessed-bit Based Access Check
-----------------------------------
The implementation uses PTE Accessed-bit for basic access checks. That
is, it clears the bit for the next sampling target page and checks whether
it is set again after one sampling period. This could disturb the reclaim
logic. DAMON uses ``PG_idle`` and ``PG_young`` page flags to solve the
conflict, as Idle page tracking does.
VMA-based Target Address Range Construction
-------------------------------------------
Only small parts in the super-huge virtual address space of the processes
are mapped to physical memory and accessed. Thus, tracking the unmapped
address regions is just wasteful. However, because DAMON can deal with
some level of noise using the adaptive regions adjustment mechanism,
tracking every mapping is not strictly required but could even incur a
high overhead in some cases. That said, too huge unmapped areas inside
the monitoring target should be removed to not take the time for the
adaptive mechanism.
For the reason, this implementation converts the complex mappings to three
distinct regions that cover every mapped area of the address space. Also,
the two gaps between the three regions are the two biggest unmapped areas
in the given address space. The two biggest unmapped areas would be the
gap between the heap and the uppermost mmap()-ed region, and the gap
between the lowermost mmap()-ed region and the stack in most of the cases.
Because these gaps are exceptionally huge in usual address spaces,
excluding these will be sufficient to make a reasonable trade-off. Below
shows this in detail::
<heap>
<BIG UNMAPPED REGION 1>
<uppermost mmap()-ed region>
(small mmap()-ed regions and munmap()-ed regions)
<lowermost mmap()-ed region>
<BIG UNMAPPED REGION 2>
<stack>
[akpm@linux-foundation.org: mm/damon/vaddr.c needs highmem.h for kunmap_atomic()]
[sjpark@amazon.de: remove unnecessary PAGE_EXTENSION setup]
Link: https://lkml.kernel.org/r/20210806095153.6444-2-sj38.park@gmail.com
[sjpark@amazon.de: safely walk page table]
Link: https://lkml.kernel.org/r/20210831161800.29419-1-sj38.park@gmail.com
Link: https://lkml.kernel.org/r/20210716081449.22187-6-sj38.park@gmail.com
Signed-off-by: SeongJae Park <sjpark@amazon.de>
Reviewed-by: Leonard Foerster <foersleo@amazon.de>
Reviewed-by: Fernand Sieber <sieberf@amazon.com>
Acked-by: Shakeel Butt <shakeelb@google.com>
Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
Cc: Amit Shah <amit@kernel.org>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Brendan Higgins <brendanhiggins@google.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: David Rientjes <rientjes@google.com>
Cc: David Woodhouse <dwmw@amazon.com>
Cc: Fan Du <fan.du@intel.com>
Cc: Greg Kroah-Hartman <greg@kroah.com>
Cc: Greg Thelen <gthelen@google.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Joe Perches <joe@perches.com>
Cc: Jonathan Cameron <Jonathan.Cameron@huawei.com>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Marco Elver <elver@google.com>
Cc: Markus Boehme <markubo@amazon.de>
Cc: Maximilian Heyne <mheyne@amazon.de>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Namhyung Kim <namhyung@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Rik van Riel <riel@surriel.com>
Cc: Shuah Khan <shuah@kernel.org>
Cc: Steven Rostedt (VMware) <rostedt@goodmis.org>
Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Even somehow the initial monitoring target regions are well constructed to
fulfill the assumption (pages in same region have similar access
frequencies), the data access pattern can be dynamically changed. This
will result in low monitoring quality. To keep the assumption as much as
possible, DAMON adaptively merges and splits each region based on their
access frequency.
For each ``aggregation interval``, it compares the access frequencies of
adjacent regions and merges those if the frequency difference is small.
Then, after it reports and clears the aggregated access frequency of each
region, it splits each region into two or three regions if the total
number of regions will not exceed the user-specified maximum number of
regions after the split.
In this way, DAMON provides its best-effort quality and minimal overhead
while keeping the upper-bound overhead that users set.
Link: https://lkml.kernel.org/r/20210716081449.22187-4-sj38.park@gmail.com
Signed-off-by: SeongJae Park <sjpark@amazon.de>
Reviewed-by: Leonard Foerster <foersleo@amazon.de>
Reviewed-by: Fernand Sieber <sieberf@amazon.com>
Acked-by: Shakeel Butt <shakeelb@google.com>
Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
Cc: Amit Shah <amit@kernel.org>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Brendan Higgins <brendanhiggins@google.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: David Rientjes <rientjes@google.com>
Cc: David Woodhouse <dwmw@amazon.com>
Cc: Fan Du <fan.du@intel.com>
Cc: Greg Kroah-Hartman <greg@kroah.com>
Cc: Greg Thelen <gthelen@google.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Joe Perches <joe@perches.com>
Cc: Jonathan Cameron <Jonathan.Cameron@huawei.com>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Marco Elver <elver@google.com>
Cc: Markus Boehme <markubo@amazon.de>
Cc: Maximilian Heyne <mheyne@amazon.de>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Namhyung Kim <namhyung@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Rik van Riel <riel@surriel.com>
Cc: Shuah Khan <shuah@kernel.org>
Cc: Steven Rostedt (VMware) <rostedt@goodmis.org>
Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
To avoid the unbounded increase of the overhead, DAMON groups adjacent
pages that are assumed to have the same access frequencies into a
region. As long as the assumption (pages in a region have the same
access frequencies) is kept, only one page in the region is required to
be checked. Thus, for each ``sampling interval``,
1. the 'prepare_access_checks' primitive picks one page in each region,
2. waits for one ``sampling interval``,
3. checks whether the page is accessed meanwhile, and
4. increases the access count of the region if so.
Therefore, the monitoring overhead is controllable by adjusting the
number of regions. DAMON allows both the underlying primitives and user
callbacks to adjust regions for the trade-off. In other words, this
commit makes DAMON to use not only time-based sampling but also
space-based sampling.
This scheme, however, cannot preserve the quality of the output if the
assumption is not guaranteed. Next commit will address this problem.
Link: https://lkml.kernel.org/r/20210716081449.22187-3-sj38.park@gmail.com
Signed-off-by: SeongJae Park <sjpark@amazon.de>
Reviewed-by: Leonard Foerster <foersleo@amazon.de>
Reviewed-by: Fernand Sieber <sieberf@amazon.com>
Acked-by: Shakeel Butt <shakeelb@google.com>
Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
Cc: Amit Shah <amit@kernel.org>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Brendan Higgins <brendanhiggins@google.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: David Rientjes <rientjes@google.com>
Cc: David Woodhouse <dwmw@amazon.com>
Cc: Fan Du <fan.du@intel.com>
Cc: Greg Kroah-Hartman <greg@kroah.com>
Cc: Greg Thelen <gthelen@google.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Joe Perches <joe@perches.com>
Cc: Jonathan Cameron <Jonathan.Cameron@huawei.com>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Marco Elver <elver@google.com>
Cc: Markus Boehme <markubo@amazon.de>
Cc: Maximilian Heyne <mheyne@amazon.de>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Namhyung Kim <namhyung@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Rik van Riel <riel@surriel.com>
Cc: Shuah Khan <shuah@kernel.org>
Cc: Steven Rostedt (VMware) <rostedt@goodmis.org>
Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Patch series "Introduce Data Access MONitor (DAMON)", v34.
Introduction
============
DAMON is a data access monitoring framework for the Linux kernel. The
core mechanisms of DAMON called 'region based sampling' and 'adaptive
regions adjustment' (refer to 'mechanisms.rst' in the 11th patch of this
patchset for the detail) make it
- accurate (The monitored information is useful for DRAM level memory
management. It might not appropriate for Cache-level accuracy,
though.),
- light-weight (The monitoring overhead is low enough to be applied
online while making no impact on the performance of the target
workloads.), and
- scalable (the upper-bound of the instrumentation overhead is
controllable regardless of the size of target workloads.).
Using this framework, therefore, several memory management mechanisms such
as reclamation and THP can be optimized to aware real data access
patterns. Experimental access pattern aware memory management
optimization works that incurring high instrumentation overhead will be
able to have another try.
Though DAMON is for kernel subsystems, it can be easily exposed to the
user space by writing a DAMON-wrapper kernel subsystem. Then, user space
users who have some special workloads will be able to write personalized
tools or applications for deeper understanding and specialized
optimizations of their systems.
DAMON is also merged in two public Amazon Linux kernel trees that based on
v5.4.y[1] and v5.10.y[2].
[1] https://github.com/amazonlinux/linux/tree/amazon-5.4.y/master/mm/damon
[2] https://github.com/amazonlinux/linux/tree/amazon-5.10.y/master/mm/damon
The userspace tool[1] is available, released under GPLv2, and actively
being maintained. I am also planning to implement another basic user
interface in perf[2]. Also, the basic test suite for DAMON is available
under GPLv2[3].
[1] https://github.com/awslabs/damo
[2] https://lore.kernel.org/linux-mm/20210107120729.22328-1-sjpark@amazon.com/
[3] https://github.com/awslabs/damon-tests
Long-term Plan
--------------
DAMON is a part of a project called Data Access-aware Operating System
(DAOS). As the name implies, I want to improve the performance and
efficiency of systems using fine-grained data access patterns. The
optimizations are for both kernel and user spaces. I will therefore
modify or create kernel subsystems, export some of those to user space and
implement user space library / tools. Below shows the layers and
components for the project.
---------------------------------------------------------------------------
Primitives: PTE Accessed bit, PG_idle, rmap, (Intel CMT), ...
Framework: DAMON
Features: DAMOS, virtual addr, physical addr, ...
Applications: DAMON-debugfs, (DARC), ...
^^^^^^^^^^^^^^^^^^^^^^^ KERNEL SPACE ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
Raw Interface: debugfs, (sysfs), (damonfs), tracepoints, (sys_damon), ...
vvvvvvvvvvvvvvvvvvvvvvv USER SPACE vvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvv
Library: (libdamon), ...
Tools: DAMO, (perf), ...
---------------------------------------------------------------------------
The components in parentheses or marked as '...' are not implemented yet
but in the future plan. IOW, those are the TODO tasks of DAOS project.
For more detail, please refer to the plans:
https://lore.kernel.org/linux-mm/20201202082731.24828-1-sjpark@amazon.com/
Evaluations
===========
We evaluated DAMON's overhead, monitoring quality and usefulness using 24
realistic workloads on my QEMU/KVM based virtual machine running a kernel
that v24 DAMON patchset is applied.
DAMON is lightweight. It increases system memory usage by 0.39% and slows
target workloads down by 1.16%.
DAMON is accurate and useful for memory management optimizations. An
experimental DAMON-based operation scheme for THP, namely 'ethp', removes
76.15% of THP memory overheads while preserving 51.25% of THP speedup.
Another experimental DAMON-based 'proactive reclamation' implementation,
'prcl', reduces 93.38% of residential sets and 23.63% of system memory
footprint while incurring only 1.22% runtime overhead in the best case
(parsec3/freqmine).
NOTE that the experimental THP optimization and proactive reclamation are
not for production but only for proof of concepts.
Please refer to the official document[1] or "Documentation/admin-guide/mm:
Add a document for DAMON" patch in this patchset for detailed evaluation
setup and results.
[1] https://damonitor.github.io/doc/html/latest-damon/admin-guide/mm/damon/eval.html
Real-world User Story
=====================
In summary, DAMON has used on production systems and proved its usefulness.
DAMON as a profiler
-------------------
We analyzed characteristics of a large scale production systems of our
customers using DAMON. The systems utilize 70GB DRAM and 36 CPUs. From
this, we were able to find interesting things below.
There were obviously different access pattern under idle workload and
active workload. Under the idle workload, it accessed large memory
regions with low frequency, while the active workload accessed small
memory regions with high freuqnecy.
DAMON found a 7GB memory region that showing obviously high access
frequency under the active workload. We believe this is the
performance-effective working set and need to be protected.
There was a 4KB memory region that showing highest access frequency under
not only active but also idle workloads. We think this must be a hottest
code section like thing that should never be paged out.
For this analysis, DAMON used only 0.3-1% of single CPU time. Because we
used recording-based analysis, it consumed about 3-12 MB of disk space per
20 minutes. This is only small amount of disk space, but we can further
reduce the disk usage by using non-recording-based DAMON features. I'd
like to argue that only DAMON can do such detailed analysis (finding 4KB
highest region in 70GB memory) with the light overhead.
DAMON as a system optimization tool
-----------------------------------
We also found below potential performance problems on the systems and made
DAMON-based solutions.
The system doesn't want to make the workload suffer from the page
reclamation and thus it utilizes enough DRAM but no swap device. However,
we found the system is actively reclaiming file-backed pages, because the
system has intensive file IO. The file IO turned out to be not
performance critical for the workload, but the customer wanted to ensure
performance critical file-backed pages like code section to not mistakenly
be evicted.
Using direct IO should or `mlock()` would be a straightforward solution,
but modifying the user space code is not easy for the customer.
Alternatively, we could use DAMON-based operation scheme[1]. By using it,
we can ask DAMON to track access frequency of each region and make
'process_madvise(MADV_WILLNEED)[2]' call for regions having specific size
and access frequency for a time interval.
We also found the system is having high number of TLB misses. We tried
'always' THP enabled policy and it greatly reduced TLB misses, but the
page reclamation also been more frequent due to the THP internal
fragmentation caused memory bloat. We could try another DAMON-based
operation scheme that applies 'MADV_HUGEPAGE' to memory regions having
>=2MB size and high access frequency, while applying 'MADV_NOHUGEPAGE' to
regions having <2MB size and low access frequency.
We do not own the systems so we only reported the analysis results and
possible optimization solutions to the customers. The customers satisfied
about the analysis results and promised to try the optimization guides.
[1] https://lore.kernel.org/linux-mm/20201006123931.5847-1-sjpark@amazon.com/
[2] https://lore.kernel.org/linux-api/20200622192900.22757-4-minchan@kernel.org/
Comparison with Idle Page Tracking
==================================
Idle Page Tracking allows users to set and read idleness of pages using a
bitmap file which represents each page with each bit of the file. One
recommended usage of it is working set size detection. Users can do that
by
1. find PFN of each page for workloads in interest,
2. set all the pages as idle by doing writes to the bitmap file,
3. wait until the workload accesses its working set, and
4. read the idleness of the pages again and count pages became not idle.
NOTE: While Idle Page Tracking is for user space users, DAMON is primarily
designed for kernel subsystems though it can easily exposed to the user
space. Hence, this section only assumes such user space use of DAMON.
For what use cases Idle Page Tracking would be better?
------------------------------------------------------
1. Flexible usecases other than hotness monitoring.
Because Idle Page Tracking allows users to control the primitive (Page
idleness) by themselves, Idle Page Tracking users can do anything they
want. Meanwhile, DAMON is primarily designed to monitor the hotness of
each memory region. For this, DAMON asks users to provide sampling
interval and aggregation interval. For the reason, there could be some
use case that using Idle Page Tracking is simpler.
2. Physical memory monitoring.
Idle Page Tracking receives PFN range as input, so natively supports
physical memory monitoring.
DAMON is designed to be extensible for multiple address spaces and use
cases by implementing and using primitives for the given use case.
Therefore, by theory, DAMON has no limitation in the type of target
address space as long as primitives for the given address space exists.
However, the default primitives introduced by this patchset supports only
virtual address spaces.
Therefore, for physical memory monitoring, you should implement your own
primitives and use it, or simply use Idle Page Tracking.
Nonetheless, RFC patchsets[1] for the physical memory address space
primitives is already available. It also supports user memory same to
Idle Page Tracking.
[1] https://lore.kernel.org/linux-mm/20200831104730.28970-1-sjpark@amazon.com/
For what use cases DAMON is better?
-----------------------------------
1. Hotness Monitoring.
Idle Page Tracking let users know only if a page frame is accessed or not.
For hotness check, the user should write more code and use more memory.
DAMON do that by itself.
2. Low Monitoring Overhead
DAMON receives user's monitoring request with one step and then provide
the results. So, roughly speaking, DAMON require only O(1) user/kernel
context switches.
In case of Idle Page Tracking, however, because the interface receives
contiguous page frames, the number of user/kernel context switches
increases as the monitoring target becomes complex and huge. As a result,
the context switch overhead could be not negligible.
Moreover, DAMON is born to handle with the monitoring overhead. Because
the core mechanism is pure logical, Idle Page Tracking users might be able
to implement the mechanism on their own, but it would be time consuming
and the user/kernel context switching will still more frequent than that
of DAMON. Also, the kernel subsystems cannot use the logic in this case.
3. Page granularity working set size detection.
Until v22 of this patchset, this was categorized as the thing Idle Page
Tracking could do better, because DAMON basically maintains additional
metadata for each of the monitoring target regions. So, in the page
granularity working set size detection use case, DAMON would incur (number
of monitoring target pages * size of metadata) memory overhead. Size of
the single metadata item is about 54 bytes, so assuming 4KB pages, about
1.3% of monitoring target pages will be additionally used.
All essential metadata for Idle Page Tracking are embedded in 'struct
page' and page table entries. Therefore, in this use case, only one
counter variable for working set size accounting is required if Idle Page
Tracking is used.
There are more details to consider, but roughly speaking, this is true in
most cases.
However, the situation changed from v23. Now DAMON supports arbitrary
types of monitoring targets, which don't use the metadata. Using that,
DAMON can do the working set size detection with no additional space
overhead but less user-kernel context switch. A first draft for the
implementation of monitoring primitives for this usage is available in a
DAMON development tree[1]. An RFC patchset for it based on this patchset
will also be available soon.
Since v24, the arbitrary type support is dropped from this patchset
because this patchset doesn't introduce real use of the type. You can
still get it from the DAMON development tree[2], though.
[1] https://github.com/sjp38/linux/tree/damon/pgidle_hack
[2] https://github.com/sjp38/linux/tree/damon/master
4. More future usecases
While Idle Page Tracking has tight coupling with base primitives (PG_Idle
and page table Accessed bits), DAMON is designed to be extensible for many
use cases and address spaces. If you need some special address type or
want to use special h/w access check primitives, you can write your own
primitives for that and configure DAMON to use those. Therefore, if your
use case could be changed a lot in future, using DAMON could be better.
Can I use both Idle Page Tracking and DAMON?
--------------------------------------------
Yes, though using them concurrently for overlapping memory regions could
result in interference to each other. Nevertheless, such use case would
be rare or makes no sense at all. Even in the case, the noise would bot
be really significant. So, you can choose whatever you want depending on
the characteristics of your use cases.
More Information
================
We prepared a showcase web site[1] that you can get more information.
There are
- the official documentations[2],
- the heatmap format dynamic access pattern of various realistic workloads for
heap area[3], mmap()-ed area[4], and stack[5] area,
- the dynamic working set size distribution[6] and chronological working set
size changes[7], and
- the latest performance test results[8].
[1] https://damonitor.github.io/_index
[2] https://damonitor.github.io/doc/html/latest-damon
[3] https://damonitor.github.io/test/result/visual/latest/rec.heatmap.0.png.html
[4] https://damonitor.github.io/test/result/visual/latest/rec.heatmap.1.png.html
[5] https://damonitor.github.io/test/result/visual/latest/rec.heatmap.2.png.html
[6] https://damonitor.github.io/test/result/visual/latest/rec.wss_sz.png.html
[7] https://damonitor.github.io/test/result/visual/latest/rec.wss_time.png.html
[8] https://damonitor.github.io/test/result/perf/latest/html/index.html
Baseline and Complete Git Trees
===============================
The patches are based on the latest -mm tree, specifically
v5.14-rc1-mmots-2021-07-15-18-47 of https://github.com/hnaz/linux-mm. You can
also clone the complete git tree:
$ git clone git://github.com/sjp38/linux -b damon/patches/v34
The web is also available:
https://github.com/sjp38/linux/releases/tag/damon/patches/v34
Development Trees
-----------------
There are a couple of trees for entire DAMON patchset series and features
for future release.
- For latest release: https://github.com/sjp38/linux/tree/damon/master
- For next release: https://github.com/sjp38/linux/tree/damon/next
Long-term Support Trees
-----------------------
For people who want to test DAMON but using LTS kernels, there are another
couple of trees based on two latest LTS kernels respectively and
containing the 'damon/master' backports.
- For v5.4.y: https://github.com/sjp38/linux/tree/damon/for-v5.4.y
- For v5.10.y: https://github.com/sjp38/linux/tree/damon/for-v5.10.y
Amazon Linux Kernel Trees
-------------------------
DAMON is also merged in two public Amazon Linux kernel trees that based on
v5.4.y[1] and v5.10.y[2].
[1] https://github.com/amazonlinux/linux/tree/amazon-5.4.y/master/mm/damon
[2] https://github.com/amazonlinux/linux/tree/amazon-5.10.y/master/mm/damon
Git Tree for Diff of Patches
============================
For easy review of diff between different versions of each patch, I
prepared a git tree containing all versions of the DAMON patchset series:
https://github.com/sjp38/damon-patches
You can clone it and use 'diff' for easy review of changes between
different versions of the patchset. For example:
$ git clone https://github.com/sjp38/damon-patches && cd damon-patches
$ diff -u damon/v33 damon/v34
Sequence Of Patches
===================
First three patches implement the core logics of DAMON. The 1st patch
introduces basic sampling based hotness monitoring for arbitrary types of
targets. Following two patches implement the core mechanisms for control
of overhead and accuracy, namely regions based sampling (patch 2) and
adaptive regions adjustment (patch 3).
Now the essential parts of DAMON is complete, but it cannot work unless
someone provides monitoring primitives for a specific use case. The
following two patches make it just work for virtual address spaces
monitoring. The 4th patch makes 'PG_idle' can be used by DAMON and the
5th patch implements the virtual memory address space specific monitoring
primitives using page table Accessed bits and the 'PG_idle' page flag.
Now DAMON just works for virtual address space monitoring via the kernel
space api. To let the user space users can use DAMON, following four
patches add interfaces for them. The 6th patch adds a tracepoint for
monitoring results. The 7th patch implements a DAMON application kernel
module, namely damon-dbgfs, that simply wraps DAMON and exposes DAMON
interface to the user space via the debugfs interface. The 8th patch
further exports pid of monitoring thread (kdamond) to user space for
easier cpu usage accounting, and the 9th patch makes the debugfs interface
to support multiple contexts.
Three patches for maintainability follows. The 10th patch adds
documentations for both the user space and the kernel space. The 11th
patch provides unit tests (based on the kunit) while the 12th patch adds
user space tests (based on the kselftest).
Finally, the last patch (13th) updates the MAINTAINERS file.
This patch (of 13):
DAMON is a data access monitoring framework for the Linux kernel. The
core mechanisms of DAMON make it
- accurate (the monitoring output is useful enough for DRAM level
performance-centric memory management; It might be inappropriate for
CPU cache levels, though),
- light-weight (the monitoring overhead is normally low enough to be
applied online), and
- scalable (the upper-bound of the overhead is in constant range
regardless of the size of target workloads).
Using this framework, hence, we can easily write efficient kernel space
data access monitoring applications. For example, the kernel's memory
management mechanisms can make advanced decisions using this.
Experimental data access aware optimization works that incurring high
access monitoring overhead could again be implemented on top of this.
Due to its simple and flexible interface, providing user space interface
would be also easy. Then, user space users who have some special
workloads can write personalized applications for better understanding and
optimizations of their workloads and systems.
===
Nevertheless, this commit is defining and implementing only basic access
check part without the overhead-accuracy handling core logic. The basic
access check is as below.
The output of DAMON says what memory regions are how frequently accessed
for a given duration. The resolution of the access frequency is
controlled by setting ``sampling interval`` and ``aggregation interval``.
In detail, DAMON checks access to each page per ``sampling interval`` and
aggregates the results. In other words, counts the number of the accesses
to each region. After each ``aggregation interval`` passes, DAMON calls
callback functions that previously registered by users so that users can
read the aggregated results and then clears the results. This can be
described in below simple pseudo-code::
init()
while monitoring_on:
for page in monitoring_target:
if accessed(page):
nr_accesses[page] += 1
if time() % aggregation_interval == 0:
for callback in user_registered_callbacks:
callback(monitoring_target, nr_accesses)
for page in monitoring_target:
nr_accesses[page] = 0
if time() % update_interval == 0:
update()
sleep(sampling interval)
The target regions constructed at the beginning of the monitoring and
updated after each ``regions_update_interval``, because the target regions
could be dynamically changed (e.g., mmap() or memory hotplug). The
monitoring overhead of this mechanism will arbitrarily increase as the
size of the target workload grows.
The basic monitoring primitives for actual access check and dynamic target
regions construction aren't in the core part of DAMON. Instead, it allows
users to implement their own primitives that are optimized for their use
case and configure DAMON to use those. In other words, users cannot use
current version of DAMON without some additional works.
Following commits will implement the core mechanisms for the
overhead-accuracy control and default primitives implementations.
Link: https://lkml.kernel.org/r/20210716081449.22187-1-sj38.park@gmail.com
Link: https://lkml.kernel.org/r/20210716081449.22187-2-sj38.park@gmail.com
Signed-off-by: SeongJae Park <sjpark@amazon.de>
Reviewed-by: Leonard Foerster <foersleo@amazon.de>
Reviewed-by: Fernand Sieber <sieberf@amazon.com>
Acked-by: Shakeel Butt <shakeelb@google.com>
Cc: Jonathan Cameron <Jonathan.Cameron@huawei.com>
Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
Cc: Amit Shah <amit@kernel.org>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: David Hildenbrand <david@redhat.com>
Cc: David Woodhouse <dwmw@amazon.com>
Cc: Marco Elver <elver@google.com>
Cc: Fan Du <fan.du@intel.com>
Cc: Greg Kroah-Hartman <greg@kroah.com>
Cc: Greg Thelen <gthelen@google.com>
Cc: Joe Perches <joe@perches.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Maximilian Heyne <mheyne@amazon.de>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Namhyung Kim <namhyung@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Rik van Riel <riel@surriel.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Steven Rostedt (VMware) <rostedt@goodmis.org>
Cc: Shuah Khan <shuah@kernel.org>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
Cc: Brendan Higgins <brendanhiggins@google.com>
Cc: Markus Boehme <markubo@amazon.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>