2017-11-01 22:08:43 +08:00
|
|
|
/* SPDX-License-Identifier: GPL-2.0 WITH Linux-syscall-note */
|
2012-10-13 17:46:48 +08:00
|
|
|
#ifndef _UAPI_LINUX_MMAN_H
|
|
|
|
#define _UAPI_LINUX_MMAN_H
|
|
|
|
|
|
|
|
#include <asm/mman.h>
|
2017-09-07 07:23:29 +08:00
|
|
|
#include <asm-generic/hugetlb_encode.h>
|
cachestat: implement cachestat syscall
There is currently no good way to query the page cache state of large file
sets and directory trees. There is mincore(), but it scales poorly: the
kernel writes out a lot of bitmap data that userspace has to aggregate,
when the user really doesn not care about per-page information in that
case. The user also needs to mmap and unmap each file as it goes along,
which can be quite slow as well.
Some use cases where this information could come in handy:
* Allowing database to decide whether to perform an index scan or
direct table queries based on the in-memory cache state of the
index.
* Visibility into the writeback algorithm, for performance issues
diagnostic.
* Workload-aware writeback pacing: estimating IO fulfilled by page
cache (and IO to be done) within a range of a file, allowing for
more frequent syncing when and where there is IO capacity, and
batching when there is not.
* Computing memory usage of large files/directory trees, analogous to
the du tool for disk usage.
More information about these use cases could be found in the following
thread:
https://lore.kernel.org/lkml/20230315170934.GA97793@cmpxchg.org/
This patch implements a new syscall that queries cache state of a file and
summarizes the number of cached pages, number of dirty pages, number of
pages marked for writeback, number of (recently) evicted pages, etc. in a
given range. Currently, the syscall is only wired in for x86
architecture.
NAME
cachestat - query the page cache statistics of a file.
SYNOPSIS
#include <sys/mman.h>
struct cachestat_range {
__u64 off;
__u64 len;
};
struct cachestat {
__u64 nr_cache;
__u64 nr_dirty;
__u64 nr_writeback;
__u64 nr_evicted;
__u64 nr_recently_evicted;
};
int cachestat(unsigned int fd, struct cachestat_range *cstat_range,
struct cachestat *cstat, unsigned int flags);
DESCRIPTION
cachestat() queries the number of cached pages, number of dirty
pages, number of pages marked for writeback, number of evicted
pages, number of recently evicted pages, in the bytes range given by
`off` and `len`.
An evicted page is a page that is previously in the page cache but
has been evicted since. A page is recently evicted if its last
eviction was recent enough that its reentry to the cache would
indicate that it is actively being used by the system, and that
there is memory pressure on the system.
These values are returned in a cachestat struct, whose address is
given by the `cstat` argument.
The `off` and `len` arguments must be non-negative integers. If
`len` > 0, the queried range is [`off`, `off` + `len`]. If `len` ==
0, we will query in the range from `off` to the end of the file.
The `flags` argument is unused for now, but is included for future
extensibility. User should pass 0 (i.e no flag specified).
Currently, hugetlbfs is not supported.
Because the status of a page can change after cachestat() checks it
but before it returns to the application, the returned values may
contain stale information.
RETURN VALUE
On success, cachestat returns 0. On error, -1 is returned, and errno
is set to indicate the error.
ERRORS
EFAULT cstat or cstat_args points to an invalid address.
EINVAL invalid flags.
EBADF invalid file descriptor.
EOPNOTSUPP file descriptor is of a hugetlbfs file
[nphamcs@gmail.com: replace rounddown logic with the existing helper]
Link: https://lkml.kernel.org/r/20230504022044.3675469-1-nphamcs@gmail.com
Link: https://lkml.kernel.org/r/20230503013608.2431726-3-nphamcs@gmail.com
Signed-off-by: Nhat Pham <nphamcs@gmail.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Cc: Brian Foster <bfoster@redhat.com>
Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
Cc: Michael Kerrisk <mtk.manpages@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-05-03 09:36:07 +08:00
|
|
|
#include <linux/types.h>
|
2012-10-13 17:46:48 +08:00
|
|
|
|
mm/mremap: add MREMAP_DONTUNMAP to mremap()
When remapping an anonymous, private mapping, if MREMAP_DONTUNMAP is set,
the source mapping will not be removed. The remap operation will be
performed as it would have been normally by moving over the page tables to
the new mapping. The old vma will have any locked flags cleared, have no
pagetables, and any userfaultfds that were watching that range will
continue watching it.
For a mapping that is shared or not anonymous, MREMAP_DONTUNMAP will cause
the mremap() call to fail. Because MREMAP_DONTUNMAP always results in
moving a VMA you MUST use the MREMAP_MAYMOVE flag, it's not possible to
resize a VMA while also moving with MREMAP_DONTUNMAP so old_len must
always be equal to the new_len otherwise it will return -EINVAL.
We hope to use this in Chrome OS where with userfaultfd we could write an
anonymous mapping to disk without having to STOP the process or worry
about VMA permission changes.
This feature also has a use case in Android, Lokesh Gidra has said that
"As part of using userfaultfd for GC, We'll have to move the physical
pages of the java heap to a separate location. For this purpose mremap
will be used. Without the MREMAP_DONTUNMAP flag, when I mremap the java
heap, its virtual mapping will be removed as well. Therefore, we'll
require performing mmap immediately after. This is not only time
consuming but also opens a time window where a native thread may call mmap
and reserve the java heap's address range for its own usage. This flag
solves the problem."
[bgeffon@google.com: v6]
Link: http://lkml.kernel.org/r/20200218173221.237674-1-bgeffon@google.com
[bgeffon@google.com: v7]
Link: http://lkml.kernel.org/r/20200221174248.244748-1-bgeffon@google.com
Signed-off-by: Brian Geffon <bgeffon@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Tested-by: Lokesh Gidra <lokeshgidra@google.com>
Reviewed-by: Minchan Kim <minchan@kernel.org>
Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Cc: "Michael S . Tsirkin" <mst@redhat.com>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Will Deacon <will@kernel.org>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Sonny Rao <sonnyrao@google.com>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Joel Fernandes <joel@joelfernandes.org>
Cc: Yu Zhao <yuzhao@google.com>
Cc: Jesse Barnes <jsbarnes@google.com>
Cc: Nathan Chancellor <natechancellor@gmail.com>
Cc: Florian Weimer <fweimer@redhat.com>
Link: http://lkml.kernel.org/r/20200207201856.46070-1-bgeffon@google.com
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-04-02 12:09:17 +08:00
|
|
|
#define MREMAP_MAYMOVE 1
|
|
|
|
#define MREMAP_FIXED 2
|
|
|
|
#define MREMAP_DONTUNMAP 4
|
2012-10-13 17:46:48 +08:00
|
|
|
|
|
|
|
#define OVERCOMMIT_GUESS 0
|
|
|
|
#define OVERCOMMIT_ALWAYS 1
|
|
|
|
#define OVERCOMMIT_NEVER 2
|
|
|
|
|
2019-02-08 14:02:55 +08:00
|
|
|
#define MAP_SHARED 0x01 /* Share changes */
|
|
|
|
#define MAP_PRIVATE 0x02 /* Changes are private */
|
|
|
|
#define MAP_SHARED_VALIDATE 0x03 /* share + validate extension flags */
|
|
|
|
|
2017-09-07 07:23:29 +08:00
|
|
|
/*
|
|
|
|
* Huge page size encoding when MAP_HUGETLB is specified, and a huge page
|
|
|
|
* size other than the default is desired. See hugetlb_encode.h.
|
|
|
|
* All known huge page size encodings are provided here. It is the
|
|
|
|
* responsibility of the application to know which sizes are supported on
|
|
|
|
* the running system. See mmap(2) man page for details.
|
|
|
|
*/
|
|
|
|
#define MAP_HUGE_SHIFT HUGETLB_FLAG_ENCODE_SHIFT
|
|
|
|
#define MAP_HUGE_MASK HUGETLB_FLAG_ENCODE_MASK
|
|
|
|
|
2020-08-31 16:30:44 +08:00
|
|
|
#define MAP_HUGE_16KB HUGETLB_FLAG_ENCODE_16KB
|
2017-09-07 07:23:29 +08:00
|
|
|
#define MAP_HUGE_64KB HUGETLB_FLAG_ENCODE_64KB
|
|
|
|
#define MAP_HUGE_512KB HUGETLB_FLAG_ENCODE_512KB
|
|
|
|
#define MAP_HUGE_1MB HUGETLB_FLAG_ENCODE_1MB
|
|
|
|
#define MAP_HUGE_2MB HUGETLB_FLAG_ENCODE_2MB
|
|
|
|
#define MAP_HUGE_8MB HUGETLB_FLAG_ENCODE_8MB
|
|
|
|
#define MAP_HUGE_16MB HUGETLB_FLAG_ENCODE_16MB
|
2018-10-06 06:51:54 +08:00
|
|
|
#define MAP_HUGE_32MB HUGETLB_FLAG_ENCODE_32MB
|
2017-09-07 07:23:29 +08:00
|
|
|
#define MAP_HUGE_256MB HUGETLB_FLAG_ENCODE_256MB
|
2018-10-06 06:51:54 +08:00
|
|
|
#define MAP_HUGE_512MB HUGETLB_FLAG_ENCODE_512MB
|
2017-09-07 07:23:29 +08:00
|
|
|
#define MAP_HUGE_1GB HUGETLB_FLAG_ENCODE_1GB
|
|
|
|
#define MAP_HUGE_2GB HUGETLB_FLAG_ENCODE_2GB
|
|
|
|
#define MAP_HUGE_16GB HUGETLB_FLAG_ENCODE_16GB
|
|
|
|
|
cachestat: implement cachestat syscall
There is currently no good way to query the page cache state of large file
sets and directory trees. There is mincore(), but it scales poorly: the
kernel writes out a lot of bitmap data that userspace has to aggregate,
when the user really doesn not care about per-page information in that
case. The user also needs to mmap and unmap each file as it goes along,
which can be quite slow as well.
Some use cases where this information could come in handy:
* Allowing database to decide whether to perform an index scan or
direct table queries based on the in-memory cache state of the
index.
* Visibility into the writeback algorithm, for performance issues
diagnostic.
* Workload-aware writeback pacing: estimating IO fulfilled by page
cache (and IO to be done) within a range of a file, allowing for
more frequent syncing when and where there is IO capacity, and
batching when there is not.
* Computing memory usage of large files/directory trees, analogous to
the du tool for disk usage.
More information about these use cases could be found in the following
thread:
https://lore.kernel.org/lkml/20230315170934.GA97793@cmpxchg.org/
This patch implements a new syscall that queries cache state of a file and
summarizes the number of cached pages, number of dirty pages, number of
pages marked for writeback, number of (recently) evicted pages, etc. in a
given range. Currently, the syscall is only wired in for x86
architecture.
NAME
cachestat - query the page cache statistics of a file.
SYNOPSIS
#include <sys/mman.h>
struct cachestat_range {
__u64 off;
__u64 len;
};
struct cachestat {
__u64 nr_cache;
__u64 nr_dirty;
__u64 nr_writeback;
__u64 nr_evicted;
__u64 nr_recently_evicted;
};
int cachestat(unsigned int fd, struct cachestat_range *cstat_range,
struct cachestat *cstat, unsigned int flags);
DESCRIPTION
cachestat() queries the number of cached pages, number of dirty
pages, number of pages marked for writeback, number of evicted
pages, number of recently evicted pages, in the bytes range given by
`off` and `len`.
An evicted page is a page that is previously in the page cache but
has been evicted since. A page is recently evicted if its last
eviction was recent enough that its reentry to the cache would
indicate that it is actively being used by the system, and that
there is memory pressure on the system.
These values are returned in a cachestat struct, whose address is
given by the `cstat` argument.
The `off` and `len` arguments must be non-negative integers. If
`len` > 0, the queried range is [`off`, `off` + `len`]. If `len` ==
0, we will query in the range from `off` to the end of the file.
The `flags` argument is unused for now, but is included for future
extensibility. User should pass 0 (i.e no flag specified).
Currently, hugetlbfs is not supported.
Because the status of a page can change after cachestat() checks it
but before it returns to the application, the returned values may
contain stale information.
RETURN VALUE
On success, cachestat returns 0. On error, -1 is returned, and errno
is set to indicate the error.
ERRORS
EFAULT cstat or cstat_args points to an invalid address.
EINVAL invalid flags.
EBADF invalid file descriptor.
EOPNOTSUPP file descriptor is of a hugetlbfs file
[nphamcs@gmail.com: replace rounddown logic with the existing helper]
Link: https://lkml.kernel.org/r/20230504022044.3675469-1-nphamcs@gmail.com
Link: https://lkml.kernel.org/r/20230503013608.2431726-3-nphamcs@gmail.com
Signed-off-by: Nhat Pham <nphamcs@gmail.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Cc: Brian Foster <bfoster@redhat.com>
Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
Cc: Michael Kerrisk <mtk.manpages@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-05-03 09:36:07 +08:00
|
|
|
struct cachestat_range {
|
|
|
|
__u64 off;
|
|
|
|
__u64 len;
|
|
|
|
};
|
|
|
|
|
|
|
|
struct cachestat {
|
|
|
|
__u64 nr_cache;
|
|
|
|
__u64 nr_dirty;
|
|
|
|
__u64 nr_writeback;
|
|
|
|
__u64 nr_evicted;
|
|
|
|
__u64 nr_recently_evicted;
|
|
|
|
};
|
|
|
|
|
2012-10-13 17:46:48 +08:00
|
|
|
#endif /* _UAPI_LINUX_MMAN_H */
|