License cleanup: add SPDX GPL-2.0 license identifier to files with no license
Many source files in the tree are missing licensing information, which
makes it harder for compliance tools to determine the correct license.
By default all files without license information are under the default
license of the kernel, which is GPL version 2.
Update the files which contain no license information with the 'GPL-2.0'
SPDX license identifier. The SPDX identifier is a legally binding
shorthand, which can be used instead of the full boiler plate text.
This patch is based on work done by Thomas Gleixner and Kate Stewart and
Philippe Ombredanne.
How this work was done:
Patches were generated and checked against linux-4.14-rc6 for a subset of
the use cases:
- file had no licensing information it it.
- file was a */uapi/* one with no licensing information in it,
- file was a */uapi/* one with existing licensing information,
Further patches will be generated in subsequent months to fix up cases
where non-standard license headers were used, and references to license
had to be inferred by heuristics based on keywords.
The analysis to determine which SPDX License Identifier to be applied to
a file was done in a spreadsheet of side by side results from of the
output of two independent scanners (ScanCode & Windriver) producing SPDX
tag:value files created by Philippe Ombredanne. Philippe prepared the
base worksheet, and did an initial spot review of a few 1000 files.
The 4.13 kernel was the starting point of the analysis with 60,537 files
assessed. Kate Stewart did a file by file comparison of the scanner
results in the spreadsheet to determine which SPDX license identifier(s)
to be applied to the file. She confirmed any determination that was not
immediately clear with lawyers working with the Linux Foundation.
Criteria used to select files for SPDX license identifier tagging was:
- Files considered eligible had to be source code files.
- Make and config files were included as candidates if they contained >5
lines of source
- File already had some variant of a license header in it (even if <5
lines).
All documentation files were explicitly excluded.
The following heuristics were used to determine which SPDX license
identifiers to apply.
- when both scanners couldn't find any license traces, file was
considered to have no license information in it, and the top level
COPYING file license applied.
For non */uapi/* files that summary was:
SPDX license identifier # files
---------------------------------------------------|-------
GPL-2.0 11139
and resulted in the first patch in this series.
If that file was a */uapi/* path one, it was "GPL-2.0 WITH
Linux-syscall-note" otherwise it was "GPL-2.0". Results of that was:
SPDX license identifier # files
---------------------------------------------------|-------
GPL-2.0 WITH Linux-syscall-note 930
and resulted in the second patch in this series.
- if a file had some form of licensing information in it, and was one
of the */uapi/* ones, it was denoted with the Linux-syscall-note if
any GPL family license was found in the file or had no licensing in
it (per prior point). Results summary:
SPDX license identifier # files
---------------------------------------------------|------
GPL-2.0 WITH Linux-syscall-note 270
GPL-2.0+ WITH Linux-syscall-note 169
((GPL-2.0 WITH Linux-syscall-note) OR BSD-2-Clause) 21
((GPL-2.0 WITH Linux-syscall-note) OR BSD-3-Clause) 17
LGPL-2.1+ WITH Linux-syscall-note 15
GPL-1.0+ WITH Linux-syscall-note 14
((GPL-2.0+ WITH Linux-syscall-note) OR BSD-3-Clause) 5
LGPL-2.0+ WITH Linux-syscall-note 4
LGPL-2.1 WITH Linux-syscall-note 3
((GPL-2.0 WITH Linux-syscall-note) OR MIT) 3
((GPL-2.0 WITH Linux-syscall-note) AND MIT) 1
and that resulted in the third patch in this series.
- when the two scanners agreed on the detected license(s), that became
the concluded license(s).
- when there was disagreement between the two scanners (one detected a
license but the other didn't, or they both detected different
licenses) a manual inspection of the file occurred.
- In most cases a manual inspection of the information in the file
resulted in a clear resolution of the license that should apply (and
which scanner probably needed to revisit its heuristics).
- When it was not immediately clear, the license identifier was
confirmed with lawyers working with the Linux Foundation.
- If there was any question as to the appropriate license identifier,
the file was flagged for further research and to be revisited later
in time.
In total, over 70 hours of logged manual review was done on the
spreadsheet to determine the SPDX license identifiers to apply to the
source files by Kate, Philippe, Thomas and, in some cases, confirmation
by lawyers working with the Linux Foundation.
Kate also obtained a third independent scan of the 4.13 code base from
FOSSology, and compared selected files where the other two scanners
disagreed against that SPDX file, to see if there was new insights. The
Windriver scanner is based on an older version of FOSSology in part, so
they are related.
Thomas did random spot checks in about 500 files from the spreadsheets
for the uapi headers and agreed with SPDX license identifier in the
files he inspected. For the non-uapi files Thomas did random spot checks
in about 15000 files.
In initial set of patches against 4.14-rc6, 3 files were found to have
copy/paste license identifier errors, and have been fixed to reflect the
correct identifier.
Additionally Philippe spent 10 hours this week doing a detailed manual
inspection and review of the 12,461 patched files from the initial patch
version early this week with:
- a full scancode scan run, collecting the matched texts, detected
license ids and scores
- reviewing anything where there was a license detected (about 500+
files) to ensure that the applied SPDX license was correct
- reviewing anything where there was no detection but the patch license
was not GPL-2.0 WITH Linux-syscall-note to ensure that the applied
SPDX license was correct
This produced a worksheet with 20 files needing minor correction. This
worksheet was then exported into 3 different .csv files for the
different types of files to be modified.
These .csv files were then reviewed by Greg. Thomas wrote a script to
parse the csv files and add the proper SPDX tag to the file, in the
format that the file expected. This script was further refined by Greg
based on the output to detect more types of files automatically and to
distinguish between header and source .c files (which need different
comment types.) Finally Greg ran the script using the .csv files to
generate the patches.
Reviewed-by: Kate Stewart <kstewart@linuxfoundation.org>
Reviewed-by: Philippe Ombredanne <pombredanne@nexb.com>
Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2017-11-01 22:07:57 +08:00
|
|
|
// SPDX-License-Identifier: GPL-2.0
|
2005-04-17 06:20:36 +08:00
|
|
|
/*
|
|
|
|
* mm/mremap.c
|
|
|
|
*
|
|
|
|
* (C) Copyright 1996 Linus Torvalds
|
|
|
|
*
|
2009-01-05 22:06:29 +08:00
|
|
|
* Address space accounting code <alan@lxorguk.ukuu.org.uk>
|
2005-04-17 06:20:36 +08:00
|
|
|
* (C) Copyright 2002 Red Hat Inc, All Rights Reserved
|
|
|
|
*/
|
|
|
|
|
|
|
|
#include <linux/mm.h>
|
2022-06-03 22:57:19 +08:00
|
|
|
#include <linux/mm_inline.h>
|
2005-04-17 06:20:36 +08:00
|
|
|
#include <linux/hugetlb.h>
|
|
|
|
#include <linux/shm.h>
|
2009-09-22 08:02:05 +08:00
|
|
|
#include <linux/ksm.h>
|
2005-04-17 06:20:36 +08:00
|
|
|
#include <linux/mman.h>
|
|
|
|
#include <linux/swap.h>
|
2006-01-12 04:17:46 +08:00
|
|
|
#include <linux/capability.h>
|
2005-04-17 06:20:36 +08:00
|
|
|
#include <linux/fs.h>
|
2013-08-27 16:37:18 +08:00
|
|
|
#include <linux/swapops.h>
|
2005-04-17 06:20:36 +08:00
|
|
|
#include <linux/highmem.h>
|
|
|
|
#include <linux/security.h>
|
|
|
|
#include <linux/syscalls.h>
|
mmu-notifiers: core
With KVM/GFP/XPMEM there isn't just the primary CPU MMU pointing to pages.
There are secondary MMUs (with secondary sptes and secondary tlbs) too.
sptes in the kvm case are shadow pagetables, but when I say spte in
mmu-notifier context, I mean "secondary pte". In GRU case there's no
actual secondary pte and there's only a secondary tlb because the GRU
secondary MMU has no knowledge about sptes and every secondary tlb miss
event in the MMU always generates a page fault that has to be resolved by
the CPU (this is not the case of KVM where the a secondary tlb miss will
walk sptes in hardware and it will refill the secondary tlb transparently
to software if the corresponding spte is present). The same way
zap_page_range has to invalidate the pte before freeing the page, the spte
(and secondary tlb) must also be invalidated before any page is freed and
reused.
Currently we take a page_count pin on every page mapped by sptes, but that
means the pages can't be swapped whenever they're mapped by any spte
because they're part of the guest working set. Furthermore a spte unmap
event can immediately lead to a page to be freed when the pin is released
(so requiring the same complex and relatively slow tlb_gather smp safe
logic we have in zap_page_range and that can be avoided completely if the
spte unmap event doesn't require an unpin of the page previously mapped in
the secondary MMU).
The mmu notifiers allow kvm/GRU/XPMEM to attach to the tsk->mm and know
when the VM is swapping or freeing or doing anything on the primary MMU so
that the secondary MMU code can drop sptes before the pages are freed,
avoiding all page pinning and allowing 100% reliable swapping of guest
physical address space. Furthermore it avoids the code that teardown the
mappings of the secondary MMU, to implement a logic like tlb_gather in
zap_page_range that would require many IPI to flush other cpu tlbs, for
each fixed number of spte unmapped.
To make an example: if what happens on the primary MMU is a protection
downgrade (from writeable to wrprotect) the secondary MMU mappings will be
invalidated, and the next secondary-mmu-page-fault will call
get_user_pages and trigger a do_wp_page through get_user_pages if it
called get_user_pages with write=1, and it'll re-establishing an updated
spte or secondary-tlb-mapping on the copied page. Or it will setup a
readonly spte or readonly tlb mapping if it's a guest-read, if it calls
get_user_pages with write=0. This is just an example.
This allows to map any page pointed by any pte (and in turn visible in the
primary CPU MMU), into a secondary MMU (be it a pure tlb like GRU, or an
full MMU with both sptes and secondary-tlb like the shadow-pagetable layer
with kvm), or a remote DMA in software like XPMEM (hence needing of
schedule in XPMEM code to send the invalidate to the remote node, while no
need to schedule in kvm/gru as it's an immediate event like invalidating
primary-mmu pte).
At least for KVM without this patch it's impossible to swap guests
reliably. And having this feature and removing the page pin allows
several other optimizations that simplify life considerably.
Dependencies:
1) mm_take_all_locks() to register the mmu notifier when the whole VM
isn't doing anything with "mm". This allows mmu notifier users to keep
track if the VM is in the middle of the invalidate_range_begin/end
critical section with an atomic counter incraese in range_begin and
decreased in range_end. No secondary MMU page fault is allowed to map
any spte or secondary tlb reference, while the VM is in the middle of
range_begin/end as any page returned by get_user_pages in that critical
section could later immediately be freed without any further
->invalidate_page notification (invalidate_range_begin/end works on
ranges and ->invalidate_page isn't called immediately before freeing
the page). To stop all page freeing and pagetable overwrites the
mmap_sem must be taken in write mode and all other anon_vma/i_mmap
locks must be taken too.
2) It'd be a waste to add branches in the VM if nobody could possibly
run KVM/GRU/XPMEM on the kernel, so mmu notifiers will only enabled if
CONFIG_KVM=m/y. In the current kernel kvm won't yet take advantage of
mmu notifiers, but this already allows to compile a KVM external module
against a kernel with mmu notifiers enabled and from the next pull from
kvm.git we'll start using them. And GRU/XPMEM will also be able to
continue the development by enabling KVM=m in their config, until they
submit all GRU/XPMEM GPLv2 code to the mainline kernel. Then they can
also enable MMU_NOTIFIERS in the same way KVM does it (even if KVM=n).
This guarantees nobody selects MMU_NOTIFIER=y if KVM and GRU and XPMEM
are all =n.
The mmu_notifier_register call can fail because mm_take_all_locks may be
interrupted by a signal and return -EINTR. Because mmu_notifier_reigster
is used when a driver startup, a failure can be gracefully handled. Here
an example of the change applied to kvm to register the mmu notifiers.
Usually when a driver startups other allocations are required anyway and
-ENOMEM failure paths exists already.
struct kvm *kvm_arch_create_vm(void)
{
struct kvm *kvm = kzalloc(sizeof(struct kvm), GFP_KERNEL);
+ int err;
if (!kvm)
return ERR_PTR(-ENOMEM);
INIT_LIST_HEAD(&kvm->arch.active_mmu_pages);
+ kvm->arch.mmu_notifier.ops = &kvm_mmu_notifier_ops;
+ err = mmu_notifier_register(&kvm->arch.mmu_notifier, current->mm);
+ if (err) {
+ kfree(kvm);
+ return ERR_PTR(err);
+ }
+
return kvm;
}
mmu_notifier_unregister returns void and it's reliable.
The patch also adds a few needed but missing includes that would prevent
kernel to compile after these changes on non-x86 archs (x86 didn't need
them by luck).
[akpm@linux-foundation.org: coding-style fixes]
[akpm@linux-foundation.org: fix mm/filemap_xip.c build]
[akpm@linux-foundation.org: fix mm/mmu_notifier.c build]
Signed-off-by: Andrea Arcangeli <andrea@qumranet.com>
Signed-off-by: Nick Piggin <npiggin@suse.de>
Signed-off-by: Christoph Lameter <cl@linux-foundation.org>
Cc: Jack Steiner <steiner@sgi.com>
Cc: Robin Holt <holt@sgi.com>
Cc: Nick Piggin <npiggin@suse.de>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Kanoj Sarcar <kanojsarcar@yahoo.com>
Cc: Roland Dreier <rdreier@cisco.com>
Cc: Steve Wise <swise@opengridcomputing.com>
Cc: Avi Kivity <avi@qumranet.com>
Cc: Hugh Dickins <hugh@veritas.com>
Cc: Rusty Russell <rusty@rustcorp.com.au>
Cc: Anthony Liguori <aliguori@us.ibm.com>
Cc: Chris Wright <chrisw@redhat.com>
Cc: Marcelo Tosatti <marcelo@kvack.org>
Cc: Eric Dumazet <dada1@cosmosbay.com>
Cc: "Paul E. McKenney" <paulmck@us.ibm.com>
Cc: Izik Eidus <izike@qumranet.com>
Cc: Anthony Liguori <aliguori@us.ibm.com>
Cc: Rik van Riel <riel@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-07-29 06:46:29 +08:00
|
|
|
#include <linux/mmu_notifier.h>
|
2014-10-10 06:29:01 +08:00
|
|
|
#include <linux/uaccess.h>
|
2017-02-23 07:42:34 +08:00
|
|
|
#include <linux/userfaultfd_k.h>
|
2022-06-03 22:57:19 +08:00
|
|
|
#include <linux/mempolicy.h>
|
2005-04-17 06:20:36 +08:00
|
|
|
|
|
|
|
#include <asm/cacheflush.h>
|
2021-07-08 09:10:18 +08:00
|
|
|
#include <asm/tlb.h>
|
2021-07-08 09:10:12 +08:00
|
|
|
#include <asm/pgalloc.h>
|
2005-04-17 06:20:36 +08:00
|
|
|
|
2008-10-19 11:26:50 +08:00
|
|
|
#include "internal.h"
|
|
|
|
|
2020-12-15 11:07:30 +08:00
|
|
|
static pud_t *get_old_pud(struct mm_struct *mm, unsigned long addr)
|
2005-04-17 06:20:36 +08:00
|
|
|
{
|
|
|
|
pgd_t *pgd;
|
2017-03-09 22:24:07 +08:00
|
|
|
p4d_t *p4d;
|
2005-04-17 06:20:36 +08:00
|
|
|
pud_t *pud;
|
|
|
|
|
|
|
|
pgd = pgd_offset(mm, addr);
|
|
|
|
if (pgd_none_or_clear_bad(pgd))
|
|
|
|
return NULL;
|
|
|
|
|
2017-03-09 22:24:07 +08:00
|
|
|
p4d = p4d_offset(pgd, addr);
|
|
|
|
if (p4d_none_or_clear_bad(p4d))
|
|
|
|
return NULL;
|
|
|
|
|
|
|
|
pud = pud_offset(p4d, addr);
|
2005-04-17 06:20:36 +08:00
|
|
|
if (pud_none_or_clear_bad(pud))
|
|
|
|
return NULL;
|
|
|
|
|
2020-12-15 11:07:30 +08:00
|
|
|
return pud;
|
|
|
|
}
|
|
|
|
|
|
|
|
static pmd_t *get_old_pmd(struct mm_struct *mm, unsigned long addr)
|
|
|
|
{
|
|
|
|
pud_t *pud;
|
|
|
|
pmd_t *pmd;
|
|
|
|
|
|
|
|
pud = get_old_pud(mm, addr);
|
|
|
|
if (!pud)
|
|
|
|
return NULL;
|
|
|
|
|
2005-04-17 06:20:36 +08:00
|
|
|
pmd = pmd_offset(pud, addr);
|
thp: mremap support and TLB optimization
This adds THP support to mremap (decreases the number of split_huge_page()
calls).
Here are also some benchmarks with a proggy like this:
===
#define _GNU_SOURCE
#include <sys/mman.h>
#include <stdlib.h>
#include <stdio.h>
#include <string.h>
#include <sys/time.h>
#define SIZE (5UL*1024*1024*1024)
int main()
{
static struct timeval oldstamp, newstamp;
long diffsec;
char *p, *p2, *p3, *p4;
if (posix_memalign((void **)&p, 2*1024*1024, SIZE))
perror("memalign"), exit(1);
if (posix_memalign((void **)&p2, 2*1024*1024, SIZE))
perror("memalign"), exit(1);
if (posix_memalign((void **)&p3, 2*1024*1024, 4096))
perror("memalign"), exit(1);
memset(p, 0xff, SIZE);
memset(p2, 0xff, SIZE);
memset(p3, 0x77, 4096);
gettimeofday(&oldstamp, NULL);
p4 = mremap(p, SIZE, SIZE, MREMAP_FIXED|MREMAP_MAYMOVE, p3);
gettimeofday(&newstamp, NULL);
diffsec = newstamp.tv_sec - oldstamp.tv_sec;
diffsec = newstamp.tv_usec - oldstamp.tv_usec + 1000000 * diffsec;
printf("usec %ld\n", diffsec);
if (p == MAP_FAILED || p4 != p3)
//if (p == MAP_FAILED)
perror("mremap"), exit(1);
if (memcmp(p4, p2, SIZE))
printf("mremap bug\n"), exit(1);
printf("ok\n");
return 0;
}
===
THP on
Performance counter stats for './largepage13' (3 runs):
69195836 dTLB-loads ( +- 3.546% ) (scaled from 50.30%)
60708 dTLB-load-misses ( +- 11.776% ) (scaled from 52.62%)
676266476 dTLB-stores ( +- 5.654% ) (scaled from 69.54%)
29856 dTLB-store-misses ( +- 4.081% ) (scaled from 89.22%)
1055848782 iTLB-loads ( +- 4.526% ) (scaled from 80.18%)
8689 iTLB-load-misses ( +- 2.987% ) (scaled from 58.20%)
7.314454164 seconds time elapsed ( +- 0.023% )
THP off
Performance counter stats for './largepage13' (3 runs):
1967379311 dTLB-loads ( +- 0.506% ) (scaled from 60.59%)
9238687 dTLB-load-misses ( +- 22.547% ) (scaled from 61.87%)
2014239444 dTLB-stores ( +- 0.692% ) (scaled from 60.40%)
3312335 dTLB-store-misses ( +- 7.304% ) (scaled from 67.60%)
6764372065 iTLB-loads ( +- 0.925% ) (scaled from 79.00%)
8202 iTLB-load-misses ( +- 0.475% ) (scaled from 70.55%)
9.693655243 seconds time elapsed ( +- 0.069% )
grep thp /proc/vmstat
thp_fault_alloc 35849
thp_fault_fallback 0
thp_collapse_alloc 3
thp_collapse_alloc_failed 0
thp_split 0
thp_split 0 confirms no thp split despite plenty of hugepages allocated.
The measurement of only the mremap time (so excluding the 3 long
memset and final long 10GB memory accessing memcmp):
THP on
usec 14824
usec 14862
usec 14859
THP off
usec 256416
usec 255981
usec 255847
With an older kernel without the mremap optimizations (the below patch
optimizes the non THP version too).
THP on
usec 392107
usec 390237
usec 404124
THP off
usec 444294
usec 445237
usec 445820
I guess with a threaded program that sends more IPI on large SMP it'd
create an even larger difference.
All debug options are off except DEBUG_VM to avoid skewing the
results.
The only problem for native 2M mremap like it happens above both the
source and destination address must be 2M aligned or the hugepmd can't be
moved without a split but that is an hardware limitation.
[akpm@linux-foundation.org: coding-style nitpicking]
Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
Acked-by: Johannes Weiner <jweiner@redhat.com>
Acked-by: Mel Gorman <mgorman@suse.de>
Acked-by: Rik van Riel <riel@redhat.com>
Cc: Hugh Dickins <hughd@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-11-01 08:08:30 +08:00
|
|
|
if (pmd_none(*pmd))
|
2005-04-17 06:20:36 +08:00
|
|
|
return NULL;
|
|
|
|
|
2005-10-30 09:16:00 +08:00
|
|
|
return pmd;
|
2005-04-17 06:20:36 +08:00
|
|
|
}
|
|
|
|
|
2020-12-15 11:07:30 +08:00
|
|
|
static pud_t *alloc_new_pud(struct mm_struct *mm, struct vm_area_struct *vma,
|
2011-01-14 07:46:43 +08:00
|
|
|
unsigned long addr)
|
2005-04-17 06:20:36 +08:00
|
|
|
{
|
|
|
|
pgd_t *pgd;
|
2017-03-09 22:24:07 +08:00
|
|
|
p4d_t *p4d;
|
2005-04-17 06:20:36 +08:00
|
|
|
|
|
|
|
pgd = pgd_offset(mm, addr);
|
2017-03-09 22:24:07 +08:00
|
|
|
p4d = p4d_alloc(mm, pgd, addr);
|
|
|
|
if (!p4d)
|
|
|
|
return NULL;
|
2020-12-15 11:07:30 +08:00
|
|
|
|
|
|
|
return pud_alloc(mm, p4d, addr);
|
|
|
|
}
|
|
|
|
|
|
|
|
static pmd_t *alloc_new_pmd(struct mm_struct *mm, struct vm_area_struct *vma,
|
|
|
|
unsigned long addr)
|
|
|
|
{
|
|
|
|
pud_t *pud;
|
|
|
|
pmd_t *pmd;
|
|
|
|
|
|
|
|
pud = alloc_new_pud(mm, vma, addr);
|
2005-04-17 06:20:36 +08:00
|
|
|
if (!pud)
|
2005-10-30 09:16:23 +08:00
|
|
|
return NULL;
|
2005-10-30 09:16:00 +08:00
|
|
|
|
2005-04-17 06:20:36 +08:00
|
|
|
pmd = pmd_alloc(mm, pud, addr);
|
2013-10-17 04:47:09 +08:00
|
|
|
if (!pmd)
|
2005-10-30 09:16:23 +08:00
|
|
|
return NULL;
|
2005-10-30 09:16:00 +08:00
|
|
|
|
2011-01-14 07:46:43 +08:00
|
|
|
VM_BUG_ON(pmd_trans_huge(*pmd));
|
2005-10-30 09:16:23 +08:00
|
|
|
|
2005-10-30 09:16:00 +08:00
|
|
|
return pmd;
|
2005-04-17 06:20:36 +08:00
|
|
|
}
|
|
|
|
|
2016-05-20 08:12:57 +08:00
|
|
|
static void take_rmap_locks(struct vm_area_struct *vma)
|
|
|
|
{
|
|
|
|
if (vma->vm_file)
|
|
|
|
i_mmap_lock_write(vma->vm_file->f_mapping);
|
|
|
|
if (vma->anon_vma)
|
|
|
|
anon_vma_lock_write(vma->anon_vma);
|
|
|
|
}
|
|
|
|
|
|
|
|
static void drop_rmap_locks(struct vm_area_struct *vma)
|
|
|
|
{
|
|
|
|
if (vma->anon_vma)
|
|
|
|
anon_vma_unlock_write(vma->anon_vma);
|
|
|
|
if (vma->vm_file)
|
|
|
|
i_mmap_unlock_write(vma->vm_file->f_mapping);
|
|
|
|
}
|
|
|
|
|
2013-08-27 16:37:18 +08:00
|
|
|
static pte_t move_soft_dirty_pte(pte_t pte)
|
|
|
|
{
|
|
|
|
/*
|
|
|
|
* Set soft dirty bit so we can notice
|
|
|
|
* in userspace the ptes were moved.
|
|
|
|
*/
|
|
|
|
#ifdef CONFIG_MEM_SOFT_DIRTY
|
|
|
|
if (pte_present(pte))
|
|
|
|
pte = pte_mksoft_dirty(pte);
|
|
|
|
else if (is_swap_pte(pte))
|
|
|
|
pte = pte_swp_mksoft_dirty(pte);
|
|
|
|
#endif
|
|
|
|
return pte;
|
|
|
|
}
|
|
|
|
|
2023-06-09 09:32:47 +08:00
|
|
|
static int move_ptes(struct vm_area_struct *vma, pmd_t *old_pmd,
|
2005-10-30 09:16:00 +08:00
|
|
|
unsigned long old_addr, unsigned long old_end,
|
|
|
|
struct vm_area_struct *new_vma, pmd_t *new_pmd,
|
2018-10-13 06:22:59 +08:00
|
|
|
unsigned long new_addr, bool need_rmap_locks)
|
2005-04-17 06:20:36 +08:00
|
|
|
{
|
|
|
|
struct mm_struct *mm = vma->vm_mm;
|
2005-10-30 09:16:00 +08:00
|
|
|
pte_t *old_pte, *new_pte, pte;
|
[PATCH] mm: split page table lock
Christoph Lameter demonstrated very poor scalability on the SGI 512-way, with
a many-threaded application which concurrently initializes different parts of
a large anonymous area.
This patch corrects that, by using a separate spinlock per page table page, to
guard the page table entries in that page, instead of using the mm's single
page_table_lock. (But even then, page_table_lock is still used to guard page
table allocation, and anon_vma allocation.)
In this implementation, the spinlock is tucked inside the struct page of the
page table page: with a BUILD_BUG_ON in case it overflows - which it would in
the case of 32-bit PA-RISC with spinlock debugging enabled.
Splitting the lock is not quite for free: another cacheline access. Ideally,
I suppose we would use split ptlock only for multi-threaded processes on
multi-cpu machines; but deciding that dynamically would have its own costs.
So for now enable it by config, at some number of cpus - since the Kconfig
language doesn't support inequalities, let preprocessor compare that with
NR_CPUS. But I don't think it's worth being user-configurable: for good
testing of both split and unsplit configs, split now at 4 cpus, and perhaps
change that to 8 later.
There is a benefit even for singly threaded processes: kswapd can be attacking
one part of the mm while another part is busy faulting.
Signed-off-by: Hugh Dickins <hugh@veritas.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-10-30 09:16:40 +08:00
|
|
|
spinlock_t *old_ptl, *new_ptl;
|
mremap: fix race between mremap() and page cleanning
Prior to 3.15, there was a race between zap_pte_range() and
page_mkclean() where writes to a page could be lost. Dave Hansen
discovered by inspection that there is a similar race between
move_ptes() and page_mkclean().
We've been able to reproduce the issue by enlarging the race window with
a msleep(), but have not been able to hit it without modifying the code.
So, we think it's a real issue, but is difficult or impossible to hit in
practice.
The zap_pte_range() issue is fixed by commit 1cf35d47712d("mm: split
'tlb_flush_mmu()' into tlb flushing and memory freeing parts"). And
this patch is to fix the race between page_mkclean() and mremap().
Here is one possible way to hit the race: suppose a process mmapped a
file with READ | WRITE and SHARED, it has two threads and they are bound
to 2 different CPUs, e.g. CPU1 and CPU2. mmap returned X, then thread
1 did a write to addr X so that CPU1 now has a writable TLB for addr X
on it. Thread 2 starts mremaping from addr X to Y while thread 1
cleaned the page and then did another write to the old addr X again.
The 2nd write from thread 1 could succeed but the value will get lost.
thread 1 thread 2
(bound to CPU1) (bound to CPU2)
1: write 1 to addr X to get a
writeable TLB on this CPU
2: mremap starts
3: move_ptes emptied PTE for addr X
and setup new PTE for addr Y and
then dropped PTL for X and Y
4: page laundering for N by doing
fadvise FADV_DONTNEED. When done,
pageframe N is deemed clean.
5: *write 2 to addr X
6: tlb flush for addr X
7: munmap (Y, pagesize) to make the
page unmapped
8: fadvise with FADV_DONTNEED again
to kick the page off the pagecache
9: pread the page from file to verify
the value. If 1 is there, it means
we have lost the written 2.
*the write may or may not cause segmentation fault, it depends on
if the TLB is still on the CPU.
Please note that this is only one specific way of how the race could
occur, it didn't mean that the race could only occur in exact the above
config, e.g. more than 2 threads could be involved and fadvise() could
be done in another thread, etc.
For anonymous pages, they could race between mremap() and page reclaim:
THP: a huge PMD is moved by mremap to a new huge PMD, then the new huge
PMD gets unmapped/splitted/pagedout before the flush tlb happened for
the old huge PMD in move_page_tables() and we could still write data to
it. The normal anonymous page has similar situation.
To fix this, check for any dirty PTE in move_ptes()/move_huge_pmd() and
if any, did the flush before dropping the PTL. If we did the flush for
every move_ptes()/move_huge_pmd() call then we do not need to do the
flush in move_pages_tables() for the whole range. But if we didn't, we
still need to do the whole range flush.
Alternatively, we can track which part of the range is flushed in
move_ptes()/move_huge_pmd() and which didn't to avoid flushing the whole
range in move_page_tables(). But that would require multiple tlb
flushes for the different sub-ranges and should be less efficient than
the single whole range flush.
KBuild test on my Sandybridge desktop doesn't show any noticeable change.
v4.9-rc4:
real 5m14.048s
user 32m19.800s
sys 4m50.320s
With this commit:
real 5m13.888s
user 32m19.330s
sys 4m51.200s
Reported-by: Dave Hansen <dave.hansen@intel.com>
Signed-off-by: Aaron Lu <aaron.lu@intel.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-11-10 17:16:33 +08:00
|
|
|
bool force_flush = false;
|
|
|
|
unsigned long len = old_end - old_addr;
|
2023-06-09 09:32:47 +08:00
|
|
|
int err = 0;
|
2005-04-17 06:20:36 +08:00
|
|
|
|
2012-10-09 07:31:50 +08:00
|
|
|
/*
|
2014-12-13 08:54:24 +08:00
|
|
|
* When need_rmap_locks is true, we take the i_mmap_rwsem and anon_vma
|
2012-10-09 07:31:50 +08:00
|
|
|
* locks to ensure that rmap will always observe either the old or the
|
|
|
|
* new ptes. This is the easiest way to avoid races with
|
|
|
|
* truncate_pagecache(), page migration, etc...
|
|
|
|
*
|
|
|
|
* When need_rmap_locks is false, we use other ways to avoid
|
|
|
|
* such races:
|
|
|
|
*
|
|
|
|
* - During exec() shift_arg_pages(), we use a specially tagged vma
|
2020-04-02 12:07:52 +08:00
|
|
|
* which rmap call sites look for using vma_is_temporary_stack().
|
2012-10-09 07:31:50 +08:00
|
|
|
*
|
|
|
|
* - During mremap(), new_vma is often known to be placed after vma
|
|
|
|
* in rmap traversal order. This ensures rmap will always observe
|
|
|
|
* either the old pte, or the new pte, or both (the page table locks
|
|
|
|
* serialize access to individual ptes, but only rmap traversal
|
|
|
|
* order guarantees that we won't miss both the old and new ptes).
|
|
|
|
*/
|
2016-05-20 08:12:57 +08:00
|
|
|
if (need_rmap_locks)
|
|
|
|
take_rmap_locks(vma);
|
2005-04-17 06:20:36 +08:00
|
|
|
|
[PATCH] mm: split page table lock
Christoph Lameter demonstrated very poor scalability on the SGI 512-way, with
a many-threaded application which concurrently initializes different parts of
a large anonymous area.
This patch corrects that, by using a separate spinlock per page table page, to
guard the page table entries in that page, instead of using the mm's single
page_table_lock. (But even then, page_table_lock is still used to guard page
table allocation, and anon_vma allocation.)
In this implementation, the spinlock is tucked inside the struct page of the
page table page: with a BUILD_BUG_ON in case it overflows - which it would in
the case of 32-bit PA-RISC with spinlock debugging enabled.
Splitting the lock is not quite for free: another cacheline access. Ideally,
I suppose we would use split ptlock only for multi-threaded processes on
multi-cpu machines; but deciding that dynamically would have its own costs.
So for now enable it by config, at some number of cpus - since the Kconfig
language doesn't support inequalities, let preprocessor compare that with
NR_CPUS. But I don't think it's worth being user-configurable: for good
testing of both split and unsplit configs, split now at 4 cpus, and perhaps
change that to 8 later.
There is a benefit even for singly threaded processes: kswapd can be attacking
one part of the mm while another part is busy faulting.
Signed-off-by: Hugh Dickins <hugh@veritas.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-10-30 09:16:40 +08:00
|
|
|
/*
|
|
|
|
* We don't have to worry about the ordering of src and dst
|
2020-06-09 12:33:54 +08:00
|
|
|
* pte locks because exclusive mmap_lock prevents deadlock.
|
[PATCH] mm: split page table lock
Christoph Lameter demonstrated very poor scalability on the SGI 512-way, with
a many-threaded application which concurrently initializes different parts of
a large anonymous area.
This patch corrects that, by using a separate spinlock per page table page, to
guard the page table entries in that page, instead of using the mm's single
page_table_lock. (But even then, page_table_lock is still used to guard page
table allocation, and anon_vma allocation.)
In this implementation, the spinlock is tucked inside the struct page of the
page table page: with a BUILD_BUG_ON in case it overflows - which it would in
the case of 32-bit PA-RISC with spinlock debugging enabled.
Splitting the lock is not quite for free: another cacheline access. Ideally,
I suppose we would use split ptlock only for multi-threaded processes on
multi-cpu machines; but deciding that dynamically would have its own costs.
So for now enable it by config, at some number of cpus - since the Kconfig
language doesn't support inequalities, let preprocessor compare that with
NR_CPUS. But I don't think it's worth being user-configurable: for good
testing of both split and unsplit configs, split now at 4 cpus, and perhaps
change that to 8 later.
There is a benefit even for singly threaded processes: kswapd can be attacking
one part of the mm while another part is busy faulting.
Signed-off-by: Hugh Dickins <hugh@veritas.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-10-30 09:16:40 +08:00
|
|
|
*/
|
2005-10-30 09:16:23 +08:00
|
|
|
old_pte = pte_offset_map_lock(mm, old_pmd, old_addr, &old_ptl);
|
2023-06-09 09:32:47 +08:00
|
|
|
if (!old_pte) {
|
|
|
|
err = -EAGAIN;
|
|
|
|
goto out;
|
|
|
|
}
|
|
|
|
new_pte = pte_offset_map_nolock(mm, new_pmd, new_addr, &new_ptl);
|
|
|
|
if (!new_pte) {
|
|
|
|
pte_unmap_unlock(old_pte, old_ptl);
|
|
|
|
err = -EAGAIN;
|
|
|
|
goto out;
|
|
|
|
}
|
[PATCH] mm: split page table lock
Christoph Lameter demonstrated very poor scalability on the SGI 512-way, with
a many-threaded application which concurrently initializes different parts of
a large anonymous area.
This patch corrects that, by using a separate spinlock per page table page, to
guard the page table entries in that page, instead of using the mm's single
page_table_lock. (But even then, page_table_lock is still used to guard page
table allocation, and anon_vma allocation.)
In this implementation, the spinlock is tucked inside the struct page of the
page table page: with a BUILD_BUG_ON in case it overflows - which it would in
the case of 32-bit PA-RISC with spinlock debugging enabled.
Splitting the lock is not quite for free: another cacheline access. Ideally,
I suppose we would use split ptlock only for multi-threaded processes on
multi-cpu machines; but deciding that dynamically would have its own costs.
So for now enable it by config, at some number of cpus - since the Kconfig
language doesn't support inequalities, let preprocessor compare that with
NR_CPUS. But I don't think it's worth being user-configurable: for good
testing of both split and unsplit configs, split now at 4 cpus, and perhaps
change that to 8 later.
There is a benefit even for singly threaded processes: kswapd can be attacking
one part of the mm while another part is busy faulting.
Signed-off-by: Hugh Dickins <hugh@veritas.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-10-30 09:16:40 +08:00
|
|
|
if (new_ptl != old_ptl)
|
2006-07-03 15:25:08 +08:00
|
|
|
spin_lock_nested(new_ptl, SINGLE_DEPTH_NESTING);
|
2017-08-03 04:31:52 +08:00
|
|
|
flush_tlb_batched_pending(vma->vm_mm);
|
2006-10-01 14:29:33 +08:00
|
|
|
arch_enter_lazy_mmu_mode();
|
2005-10-30 09:16:00 +08:00
|
|
|
|
|
|
|
for (; old_addr < old_end; old_pte++, old_addr += PAGE_SIZE,
|
|
|
|
new_pte++, new_addr += PAGE_SIZE) {
|
mm: ptep_get() conversion
Convert all instances of direct pte_t* dereferencing to instead use
ptep_get() helper. This means that by default, the accesses change from a
C dereference to a READ_ONCE(). This is technically the correct thing to
do since where pgtables are modified by HW (for access/dirty) they are
volatile and therefore we should always ensure READ_ONCE() semantics.
But more importantly, by always using the helper, it can be overridden by
the architecture to fully encapsulate the contents of the pte. Arch code
is deliberately not converted, as the arch code knows best. It is
intended that arch code (arm64) will override the default with its own
implementation that can (e.g.) hide certain bits from the core code, or
determine young/dirty status by mixing in state from another source.
Conversion was done using Coccinelle:
----
// $ make coccicheck \
// COCCI=ptepget.cocci \
// SPFLAGS="--include-headers" \
// MODE=patch
virtual patch
@ depends on patch @
pte_t *v;
@@
- *v
+ ptep_get(v)
----
Then reviewed and hand-edited to avoid multiple unnecessary calls to
ptep_get(), instead opting to store the result of a single call in a
variable, where it is correct to do so. This aims to negate any cost of
READ_ONCE() and will benefit arch-overrides that may be more complex.
Included is a fix for an issue in an earlier version of this patch that
was pointed out by kernel test robot. The issue arose because config
MMU=n elides definition of the ptep helper functions, including
ptep_get(). HUGETLB_PAGE=n configs still define a simple
huge_ptep_clear_flush() for linking purposes, which dereferences the ptep.
So when both configs are disabled, this caused a build error because
ptep_get() is not defined. Fix by continuing to do a direct dereference
when MMU=n. This is safe because for this config the arch code cannot be
trying to virtualize the ptes because none of the ptep helpers are
defined.
Link: https://lkml.kernel.org/r/20230612151545.3317766-4-ryan.roberts@arm.com
Reported-by: kernel test robot <lkp@intel.com>
Link: https://lore.kernel.org/oe-kbuild-all/202305120142.yXsNEo6H-lkp@intel.com/
Signed-off-by: Ryan Roberts <ryan.roberts@arm.com>
Cc: Adrian Hunter <adrian.hunter@intel.com>
Cc: Alexander Potapenko <glider@google.com>
Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
Cc: Alex Williamson <alex.williamson@redhat.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Andrey Konovalov <andreyknvl@gmail.com>
Cc: Andrey Ryabinin <ryabinin.a.a@gmail.com>
Cc: Christian Brauner <brauner@kernel.org>
Cc: Christoph Hellwig <hch@infradead.org>
Cc: Daniel Vetter <daniel@ffwll.ch>
Cc: Dave Airlie <airlied@gmail.com>
Cc: Dimitri Sivanich <dimitri.sivanich@hpe.com>
Cc: Dmitry Vyukov <dvyukov@google.com>
Cc: Ian Rogers <irogers@google.com>
Cc: Jason Gunthorpe <jgg@ziepe.ca>
Cc: Jérôme Glisse <jglisse@redhat.com>
Cc: Jiri Olsa <jolsa@kernel.org>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Lorenzo Stoakes <lstoakes@gmail.com>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Miaohe Lin <linmiaohe@huawei.com>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Mike Kravetz <mike.kravetz@oracle.com>
Cc: Mike Rapoport (IBM) <rppt@kernel.org>
Cc: Muchun Song <muchun.song@linux.dev>
Cc: Namhyung Kim <namhyung@kernel.org>
Cc: Naoya Horiguchi <naoya.horiguchi@nec.com>
Cc: Oleksandr Tyshchenko <oleksandr_tyshchenko@epam.com>
Cc: Pavel Tatashin <pasha.tatashin@soleen.com>
Cc: Roman Gushchin <roman.gushchin@linux.dev>
Cc: SeongJae Park <sj@kernel.org>
Cc: Shakeel Butt <shakeelb@google.com>
Cc: Uladzislau Rezki (Sony) <urezki@gmail.com>
Cc: Vincenzo Frascino <vincenzo.frascino@arm.com>
Cc: Yu Zhao <yuzhao@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-06-12 23:15:45 +08:00
|
|
|
if (pte_none(ptep_get(old_pte)))
|
2005-10-30 09:16:00 +08:00
|
|
|
continue;
|
mremap: fix race between mremap() and page cleanning
Prior to 3.15, there was a race between zap_pte_range() and
page_mkclean() where writes to a page could be lost. Dave Hansen
discovered by inspection that there is a similar race between
move_ptes() and page_mkclean().
We've been able to reproduce the issue by enlarging the race window with
a msleep(), but have not been able to hit it without modifying the code.
So, we think it's a real issue, but is difficult or impossible to hit in
practice.
The zap_pte_range() issue is fixed by commit 1cf35d47712d("mm: split
'tlb_flush_mmu()' into tlb flushing and memory freeing parts"). And
this patch is to fix the race between page_mkclean() and mremap().
Here is one possible way to hit the race: suppose a process mmapped a
file with READ | WRITE and SHARED, it has two threads and they are bound
to 2 different CPUs, e.g. CPU1 and CPU2. mmap returned X, then thread
1 did a write to addr X so that CPU1 now has a writable TLB for addr X
on it. Thread 2 starts mremaping from addr X to Y while thread 1
cleaned the page and then did another write to the old addr X again.
The 2nd write from thread 1 could succeed but the value will get lost.
thread 1 thread 2
(bound to CPU1) (bound to CPU2)
1: write 1 to addr X to get a
writeable TLB on this CPU
2: mremap starts
3: move_ptes emptied PTE for addr X
and setup new PTE for addr Y and
then dropped PTL for X and Y
4: page laundering for N by doing
fadvise FADV_DONTNEED. When done,
pageframe N is deemed clean.
5: *write 2 to addr X
6: tlb flush for addr X
7: munmap (Y, pagesize) to make the
page unmapped
8: fadvise with FADV_DONTNEED again
to kick the page off the pagecache
9: pread the page from file to verify
the value. If 1 is there, it means
we have lost the written 2.
*the write may or may not cause segmentation fault, it depends on
if the TLB is still on the CPU.
Please note that this is only one specific way of how the race could
occur, it didn't mean that the race could only occur in exact the above
config, e.g. more than 2 threads could be involved and fadvise() could
be done in another thread, etc.
For anonymous pages, they could race between mremap() and page reclaim:
THP: a huge PMD is moved by mremap to a new huge PMD, then the new huge
PMD gets unmapped/splitted/pagedout before the flush tlb happened for
the old huge PMD in move_page_tables() and we could still write data to
it. The normal anonymous page has similar situation.
To fix this, check for any dirty PTE in move_ptes()/move_huge_pmd() and
if any, did the flush before dropping the PTL. If we did the flush for
every move_ptes()/move_huge_pmd() call then we do not need to do the
flush in move_pages_tables() for the whole range. But if we didn't, we
still need to do the whole range flush.
Alternatively, we can track which part of the range is flushed in
move_ptes()/move_huge_pmd() and which didn't to avoid flushing the whole
range in move_page_tables(). But that would require multiple tlb
flushes for the different sub-ranges and should be less efficient than
the single whole range flush.
KBuild test on my Sandybridge desktop doesn't show any noticeable change.
v4.9-rc4:
real 5m14.048s
user 32m19.800s
sys 4m50.320s
With this commit:
real 5m13.888s
user 32m19.330s
sys 4m51.200s
Reported-by: Dave Hansen <dave.hansen@intel.com>
Signed-off-by: Aaron Lu <aaron.lu@intel.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-11-10 17:16:33 +08:00
|
|
|
|
2016-11-29 13:27:31 +08:00
|
|
|
pte = ptep_get_and_clear(mm, old_addr, old_pte);
|
mremap: fix race between mremap() and page cleanning
Prior to 3.15, there was a race between zap_pte_range() and
page_mkclean() where writes to a page could be lost. Dave Hansen
discovered by inspection that there is a similar race between
move_ptes() and page_mkclean().
We've been able to reproduce the issue by enlarging the race window with
a msleep(), but have not been able to hit it without modifying the code.
So, we think it's a real issue, but is difficult or impossible to hit in
practice.
The zap_pte_range() issue is fixed by commit 1cf35d47712d("mm: split
'tlb_flush_mmu()' into tlb flushing and memory freeing parts"). And
this patch is to fix the race between page_mkclean() and mremap().
Here is one possible way to hit the race: suppose a process mmapped a
file with READ | WRITE and SHARED, it has two threads and they are bound
to 2 different CPUs, e.g. CPU1 and CPU2. mmap returned X, then thread
1 did a write to addr X so that CPU1 now has a writable TLB for addr X
on it. Thread 2 starts mremaping from addr X to Y while thread 1
cleaned the page and then did another write to the old addr X again.
The 2nd write from thread 1 could succeed but the value will get lost.
thread 1 thread 2
(bound to CPU1) (bound to CPU2)
1: write 1 to addr X to get a
writeable TLB on this CPU
2: mremap starts
3: move_ptes emptied PTE for addr X
and setup new PTE for addr Y and
then dropped PTL for X and Y
4: page laundering for N by doing
fadvise FADV_DONTNEED. When done,
pageframe N is deemed clean.
5: *write 2 to addr X
6: tlb flush for addr X
7: munmap (Y, pagesize) to make the
page unmapped
8: fadvise with FADV_DONTNEED again
to kick the page off the pagecache
9: pread the page from file to verify
the value. If 1 is there, it means
we have lost the written 2.
*the write may or may not cause segmentation fault, it depends on
if the TLB is still on the CPU.
Please note that this is only one specific way of how the race could
occur, it didn't mean that the race could only occur in exact the above
config, e.g. more than 2 threads could be involved and fadvise() could
be done in another thread, etc.
For anonymous pages, they could race between mremap() and page reclaim:
THP: a huge PMD is moved by mremap to a new huge PMD, then the new huge
PMD gets unmapped/splitted/pagedout before the flush tlb happened for
the old huge PMD in move_page_tables() and we could still write data to
it. The normal anonymous page has similar situation.
To fix this, check for any dirty PTE in move_ptes()/move_huge_pmd() and
if any, did the flush before dropping the PTL. If we did the flush for
every move_ptes()/move_huge_pmd() call then we do not need to do the
flush in move_pages_tables() for the whole range. But if we didn't, we
still need to do the whole range flush.
Alternatively, we can track which part of the range is flushed in
move_ptes()/move_huge_pmd() and which didn't to avoid flushing the whole
range in move_page_tables(). But that would require multiple tlb
flushes for the different sub-ranges and should be less efficient than
the single whole range flush.
KBuild test on my Sandybridge desktop doesn't show any noticeable change.
v4.9-rc4:
real 5m14.048s
user 32m19.800s
sys 4m50.320s
With this commit:
real 5m13.888s
user 32m19.330s
sys 4m51.200s
Reported-by: Dave Hansen <dave.hansen@intel.com>
Signed-off-by: Aaron Lu <aaron.lu@intel.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-11-10 17:16:33 +08:00
|
|
|
/*
|
2018-10-13 06:22:59 +08:00
|
|
|
* If we are remapping a valid PTE, make sure
|
2016-11-29 13:27:31 +08:00
|
|
|
* to flush TLB before we drop the PTL for the
|
2018-10-13 06:22:59 +08:00
|
|
|
* PTE.
|
2016-11-29 13:27:31 +08:00
|
|
|
*
|
2018-10-13 06:22:59 +08:00
|
|
|
* NOTE! Both old and new PTL matter: the old one
|
|
|
|
* for racing with page_mkclean(), the new one to
|
|
|
|
* make sure the physical page stays valid until
|
|
|
|
* the TLB entry for the old mapping has been
|
|
|
|
* flushed.
|
mremap: fix race between mremap() and page cleanning
Prior to 3.15, there was a race between zap_pte_range() and
page_mkclean() where writes to a page could be lost. Dave Hansen
discovered by inspection that there is a similar race between
move_ptes() and page_mkclean().
We've been able to reproduce the issue by enlarging the race window with
a msleep(), but have not been able to hit it without modifying the code.
So, we think it's a real issue, but is difficult or impossible to hit in
practice.
The zap_pte_range() issue is fixed by commit 1cf35d47712d("mm: split
'tlb_flush_mmu()' into tlb flushing and memory freeing parts"). And
this patch is to fix the race between page_mkclean() and mremap().
Here is one possible way to hit the race: suppose a process mmapped a
file with READ | WRITE and SHARED, it has two threads and they are bound
to 2 different CPUs, e.g. CPU1 and CPU2. mmap returned X, then thread
1 did a write to addr X so that CPU1 now has a writable TLB for addr X
on it. Thread 2 starts mremaping from addr X to Y while thread 1
cleaned the page and then did another write to the old addr X again.
The 2nd write from thread 1 could succeed but the value will get lost.
thread 1 thread 2
(bound to CPU1) (bound to CPU2)
1: write 1 to addr X to get a
writeable TLB on this CPU
2: mremap starts
3: move_ptes emptied PTE for addr X
and setup new PTE for addr Y and
then dropped PTL for X and Y
4: page laundering for N by doing
fadvise FADV_DONTNEED. When done,
pageframe N is deemed clean.
5: *write 2 to addr X
6: tlb flush for addr X
7: munmap (Y, pagesize) to make the
page unmapped
8: fadvise with FADV_DONTNEED again
to kick the page off the pagecache
9: pread the page from file to verify
the value. If 1 is there, it means
we have lost the written 2.
*the write may or may not cause segmentation fault, it depends on
if the TLB is still on the CPU.
Please note that this is only one specific way of how the race could
occur, it didn't mean that the race could only occur in exact the above
config, e.g. more than 2 threads could be involved and fadvise() could
be done in another thread, etc.
For anonymous pages, they could race between mremap() and page reclaim:
THP: a huge PMD is moved by mremap to a new huge PMD, then the new huge
PMD gets unmapped/splitted/pagedout before the flush tlb happened for
the old huge PMD in move_page_tables() and we could still write data to
it. The normal anonymous page has similar situation.
To fix this, check for any dirty PTE in move_ptes()/move_huge_pmd() and
if any, did the flush before dropping the PTL. If we did the flush for
every move_ptes()/move_huge_pmd() call then we do not need to do the
flush in move_pages_tables() for the whole range. But if we didn't, we
still need to do the whole range flush.
Alternatively, we can track which part of the range is flushed in
move_ptes()/move_huge_pmd() and which didn't to avoid flushing the whole
range in move_page_tables(). But that would require multiple tlb
flushes for the different sub-ranges and should be less efficient than
the single whole range flush.
KBuild test on my Sandybridge desktop doesn't show any noticeable change.
v4.9-rc4:
real 5m14.048s
user 32m19.800s
sys 4m50.320s
With this commit:
real 5m13.888s
user 32m19.330s
sys 4m51.200s
Reported-by: Dave Hansen <dave.hansen@intel.com>
Signed-off-by: Aaron Lu <aaron.lu@intel.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-11-10 17:16:33 +08:00
|
|
|
*/
|
2018-10-13 06:22:59 +08:00
|
|
|
if (pte_present(pte))
|
mremap: fix race between mremap() and page cleanning
Prior to 3.15, there was a race between zap_pte_range() and
page_mkclean() where writes to a page could be lost. Dave Hansen
discovered by inspection that there is a similar race between
move_ptes() and page_mkclean().
We've been able to reproduce the issue by enlarging the race window with
a msleep(), but have not been able to hit it without modifying the code.
So, we think it's a real issue, but is difficult or impossible to hit in
practice.
The zap_pte_range() issue is fixed by commit 1cf35d47712d("mm: split
'tlb_flush_mmu()' into tlb flushing and memory freeing parts"). And
this patch is to fix the race between page_mkclean() and mremap().
Here is one possible way to hit the race: suppose a process mmapped a
file with READ | WRITE and SHARED, it has two threads and they are bound
to 2 different CPUs, e.g. CPU1 and CPU2. mmap returned X, then thread
1 did a write to addr X so that CPU1 now has a writable TLB for addr X
on it. Thread 2 starts mremaping from addr X to Y while thread 1
cleaned the page and then did another write to the old addr X again.
The 2nd write from thread 1 could succeed but the value will get lost.
thread 1 thread 2
(bound to CPU1) (bound to CPU2)
1: write 1 to addr X to get a
writeable TLB on this CPU
2: mremap starts
3: move_ptes emptied PTE for addr X
and setup new PTE for addr Y and
then dropped PTL for X and Y
4: page laundering for N by doing
fadvise FADV_DONTNEED. When done,
pageframe N is deemed clean.
5: *write 2 to addr X
6: tlb flush for addr X
7: munmap (Y, pagesize) to make the
page unmapped
8: fadvise with FADV_DONTNEED again
to kick the page off the pagecache
9: pread the page from file to verify
the value. If 1 is there, it means
we have lost the written 2.
*the write may or may not cause segmentation fault, it depends on
if the TLB is still on the CPU.
Please note that this is only one specific way of how the race could
occur, it didn't mean that the race could only occur in exact the above
config, e.g. more than 2 threads could be involved and fadvise() could
be done in another thread, etc.
For anonymous pages, they could race between mremap() and page reclaim:
THP: a huge PMD is moved by mremap to a new huge PMD, then the new huge
PMD gets unmapped/splitted/pagedout before the flush tlb happened for
the old huge PMD in move_page_tables() and we could still write data to
it. The normal anonymous page has similar situation.
To fix this, check for any dirty PTE in move_ptes()/move_huge_pmd() and
if any, did the flush before dropping the PTL. If we did the flush for
every move_ptes()/move_huge_pmd() call then we do not need to do the
flush in move_pages_tables() for the whole range. But if we didn't, we
still need to do the whole range flush.
Alternatively, we can track which part of the range is flushed in
move_ptes()/move_huge_pmd() and which didn't to avoid flushing the whole
range in move_page_tables(). But that would require multiple tlb
flushes for the different sub-ranges and should be less efficient than
the single whole range flush.
KBuild test on my Sandybridge desktop doesn't show any noticeable change.
v4.9-rc4:
real 5m14.048s
user 32m19.800s
sys 4m50.320s
With this commit:
real 5m13.888s
user 32m19.330s
sys 4m51.200s
Reported-by: Dave Hansen <dave.hansen@intel.com>
Signed-off-by: Aaron Lu <aaron.lu@intel.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-11-10 17:16:33 +08:00
|
|
|
force_flush = true;
|
2005-10-30 09:16:00 +08:00
|
|
|
pte = move_pte(pte, new_vma->vm_page_prot, old_addr, new_addr);
|
2013-08-27 16:37:18 +08:00
|
|
|
pte = move_soft_dirty_pte(pte);
|
|
|
|
set_pte_at(mm, new_addr, new_pte, pte);
|
2005-04-17 06:20:36 +08:00
|
|
|
}
|
2005-10-30 09:16:00 +08:00
|
|
|
|
2006-10-01 14:29:33 +08:00
|
|
|
arch_leave_lazy_mmu_mode();
|
2018-10-13 06:22:59 +08:00
|
|
|
if (force_flush)
|
|
|
|
flush_tlb_range(vma, old_end - len, old_end);
|
[PATCH] mm: split page table lock
Christoph Lameter demonstrated very poor scalability on the SGI 512-way, with
a many-threaded application which concurrently initializes different parts of
a large anonymous area.
This patch corrects that, by using a separate spinlock per page table page, to
guard the page table entries in that page, instead of using the mm's single
page_table_lock. (But even then, page_table_lock is still used to guard page
table allocation, and anon_vma allocation.)
In this implementation, the spinlock is tucked inside the struct page of the
page table page: with a BUILD_BUG_ON in case it overflows - which it would in
the case of 32-bit PA-RISC with spinlock debugging enabled.
Splitting the lock is not quite for free: another cacheline access. Ideally,
I suppose we would use split ptlock only for multi-threaded processes on
multi-cpu machines; but deciding that dynamically would have its own costs.
So for now enable it by config, at some number of cpus - since the Kconfig
language doesn't support inequalities, let preprocessor compare that with
NR_CPUS. But I don't think it's worth being user-configurable: for good
testing of both split and unsplit configs, split now at 4 cpus, and perhaps
change that to 8 later.
There is a benefit even for singly threaded processes: kswapd can be attacking
one part of the mm while another part is busy faulting.
Signed-off-by: Hugh Dickins <hugh@veritas.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-10-30 09:16:40 +08:00
|
|
|
if (new_ptl != old_ptl)
|
|
|
|
spin_unlock(new_ptl);
|
2010-10-27 05:21:52 +08:00
|
|
|
pte_unmap(new_pte - 1);
|
2005-10-30 09:16:23 +08:00
|
|
|
pte_unmap_unlock(old_pte - 1, old_ptl);
|
2023-06-09 09:32:47 +08:00
|
|
|
out:
|
2016-05-20 08:12:57 +08:00
|
|
|
if (need_rmap_locks)
|
|
|
|
drop_rmap_locks(vma);
|
2023-06-09 09:32:47 +08:00
|
|
|
return err;
|
2005-04-17 06:20:36 +08:00
|
|
|
}
|
|
|
|
|
2021-07-08 09:10:18 +08:00
|
|
|
#ifndef arch_supports_page_table_move
|
|
|
|
#define arch_supports_page_table_move arch_supports_page_table_move
|
|
|
|
static inline bool arch_supports_page_table_move(void)
|
|
|
|
{
|
|
|
|
return IS_ENABLED(CONFIG_HAVE_MOVE_PMD) ||
|
|
|
|
IS_ENABLED(CONFIG_HAVE_MOVE_PUD);
|
|
|
|
}
|
|
|
|
#endif
|
|
|
|
|
2019-01-04 07:28:38 +08:00
|
|
|
#ifdef CONFIG_HAVE_MOVE_PMD
|
|
|
|
static bool move_normal_pmd(struct vm_area_struct *vma, unsigned long old_addr,
|
2020-08-07 14:23:40 +08:00
|
|
|
unsigned long new_addr, pmd_t *old_pmd, pmd_t *new_pmd)
|
2019-01-04 07:28:38 +08:00
|
|
|
{
|
|
|
|
spinlock_t *old_ptl, *new_ptl;
|
|
|
|
struct mm_struct *mm = vma->vm_mm;
|
mm/mremap: fix move_normal_pmd/retract_page_tables race
commit 6fa1066fc5d00cb9f1b0e83b7ff6ef98d26ba2aa upstream.
In mremap(), move_page_tables() looks at the type of the PMD entry and the
specified address range to figure out by which method the next chunk of
page table entries should be moved.
At that point, the mmap_lock is held in write mode, but no rmap locks are
held yet. For PMD entries that point to page tables and are fully covered
by the source address range, move_pgt_entry(NORMAL_PMD, ...) is called,
which first takes rmap locks, then does move_normal_pmd().
move_normal_pmd() takes the necessary page table locks at source and
destination, then moves an entire page table from the source to the
destination.
The problem is: The rmap locks, which protect against concurrent page
table removal by retract_page_tables() in the THP code, are only taken
after the PMD entry has been read and it has been decided how to move it.
So we can race as follows (with two processes that have mappings of the
same tmpfs file that is stored on a tmpfs mount with huge=advise); note
that process A accesses page tables through the MM while process B does it
through the file rmap:
process A process B
========= =========
mremap
mremap_to
move_vma
move_page_tables
get_old_pmd
alloc_new_pmd
*** PREEMPT ***
madvise(MADV_COLLAPSE)
do_madvise
madvise_walk_vmas
madvise_vma_behavior
madvise_collapse
hpage_collapse_scan_file
collapse_file
retract_page_tables
i_mmap_lock_read(mapping)
pmdp_collapse_flush
i_mmap_unlock_read(mapping)
move_pgt_entry(NORMAL_PMD, ...)
take_rmap_locks
move_normal_pmd
drop_rmap_locks
When this happens, move_normal_pmd() can end up creating bogus PMD entries
in the line `pmd_populate(mm, new_pmd, pmd_pgtable(pmd))`. The effect
depends on arch-specific and machine-specific details; on x86, you can end
up with physical page 0 mapped as a page table, which is likely
exploitable for user->kernel privilege escalation.
Fix the race by letting process B recheck that the PMD still points to a
page table after the rmap locks have been taken. Otherwise, we bail and
let the caller fall back to the PTE-level copying path, which will then
bail immediately at the pmd_none() check.
Bug reachability: Reaching this bug requires that you can create
shmem/file THP mappings - anonymous THP uses different code that doesn't
zap stuff under rmap locks. File THP is gated on an experimental config
flag (CONFIG_READ_ONLY_THP_FOR_FS), so on normal distro kernels you need
shmem THP to hit this bug. As far as I know, getting shmem THP normally
requires that you can mount your own tmpfs with the right mount flags,
which would require creating your own user+mount namespace; though I don't
know if some distros maybe enable shmem THP by default or something like
that.
Bug impact: This issue can likely be used for user->kernel privilege
escalation when it is reachable.
Link: https://lkml.kernel.org/r/20241007-move_normal_pmd-vs-collapse-fix-2-v1-1-5ead9631f2ea@google.com
Fixes: 1d65b771bc08 ("mm/khugepaged: retract_page_tables() without mmap or vma lock")
Signed-off-by: Jann Horn <jannh@google.com>
Signed-off-by: David Hildenbrand <david@redhat.com>
Co-developed-by: David Hildenbrand <david@redhat.com>
Closes: https://project-zero.issues.chromium.org/371047675
Acked-by: Qi Zheng <zhengqi.arch@bytedance.com>
Reviewed-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Joel Fernandes <joel@joelfernandes.org>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2024-10-08 05:42:04 +08:00
|
|
|
bool res = false;
|
2019-01-04 07:28:38 +08:00
|
|
|
pmd_t pmd;
|
|
|
|
|
2021-07-08 09:10:18 +08:00
|
|
|
if (!arch_supports_page_table_move())
|
|
|
|
return false;
|
2019-01-04 07:28:38 +08:00
|
|
|
/*
|
|
|
|
* The destination pmd shouldn't be established, free_pgtables()
|
2020-07-14 02:37:39 +08:00
|
|
|
* should have released it.
|
|
|
|
*
|
|
|
|
* However, there's a case during execve() where we use mremap
|
|
|
|
* to move the initial stack, and in that case the target area
|
|
|
|
* may overlap the source area (always moving down).
|
|
|
|
*
|
|
|
|
* If everything is PMD-aligned, that works fine, as moving
|
|
|
|
* each pmd down will clear the source pmd. But if we first
|
|
|
|
* have a few 4kB-only pages that get moved down, and then
|
|
|
|
* hit the "now the rest is PMD-aligned, let's do everything
|
|
|
|
* one pmd at a time", we will still have the old (now empty
|
|
|
|
* of any 4kB pages, but still there) PMD in the page table
|
|
|
|
* tree.
|
|
|
|
*
|
|
|
|
* Warn on it once - because we really should try to figure
|
|
|
|
* out how to do this better - but then say "I won't move
|
|
|
|
* this pmd".
|
|
|
|
*
|
|
|
|
* One alternative might be to just unmap the target pmd at
|
|
|
|
* this point, and verify that it really is empty. We'll see.
|
2019-01-04 07:28:38 +08:00
|
|
|
*/
|
2020-07-14 02:37:39 +08:00
|
|
|
if (WARN_ON_ONCE(!pmd_none(*new_pmd)))
|
2019-01-04 07:28:38 +08:00
|
|
|
return false;
|
|
|
|
|
|
|
|
/*
|
|
|
|
* We don't have to worry about the ordering of src and dst
|
2020-06-09 12:33:54 +08:00
|
|
|
* ptlocks because exclusive mmap_lock prevents deadlock.
|
2019-01-04 07:28:38 +08:00
|
|
|
*/
|
|
|
|
old_ptl = pmd_lock(vma->vm_mm, old_pmd);
|
|
|
|
new_ptl = pmd_lockptr(mm, new_pmd);
|
|
|
|
if (new_ptl != old_ptl)
|
|
|
|
spin_lock_nested(new_ptl, SINGLE_DEPTH_NESTING);
|
|
|
|
|
|
|
|
pmd = *old_pmd;
|
mm/mremap: fix move_normal_pmd/retract_page_tables race
commit 6fa1066fc5d00cb9f1b0e83b7ff6ef98d26ba2aa upstream.
In mremap(), move_page_tables() looks at the type of the PMD entry and the
specified address range to figure out by which method the next chunk of
page table entries should be moved.
At that point, the mmap_lock is held in write mode, but no rmap locks are
held yet. For PMD entries that point to page tables and are fully covered
by the source address range, move_pgt_entry(NORMAL_PMD, ...) is called,
which first takes rmap locks, then does move_normal_pmd().
move_normal_pmd() takes the necessary page table locks at source and
destination, then moves an entire page table from the source to the
destination.
The problem is: The rmap locks, which protect against concurrent page
table removal by retract_page_tables() in the THP code, are only taken
after the PMD entry has been read and it has been decided how to move it.
So we can race as follows (with two processes that have mappings of the
same tmpfs file that is stored on a tmpfs mount with huge=advise); note
that process A accesses page tables through the MM while process B does it
through the file rmap:
process A process B
========= =========
mremap
mremap_to
move_vma
move_page_tables
get_old_pmd
alloc_new_pmd
*** PREEMPT ***
madvise(MADV_COLLAPSE)
do_madvise
madvise_walk_vmas
madvise_vma_behavior
madvise_collapse
hpage_collapse_scan_file
collapse_file
retract_page_tables
i_mmap_lock_read(mapping)
pmdp_collapse_flush
i_mmap_unlock_read(mapping)
move_pgt_entry(NORMAL_PMD, ...)
take_rmap_locks
move_normal_pmd
drop_rmap_locks
When this happens, move_normal_pmd() can end up creating bogus PMD entries
in the line `pmd_populate(mm, new_pmd, pmd_pgtable(pmd))`. The effect
depends on arch-specific and machine-specific details; on x86, you can end
up with physical page 0 mapped as a page table, which is likely
exploitable for user->kernel privilege escalation.
Fix the race by letting process B recheck that the PMD still points to a
page table after the rmap locks have been taken. Otherwise, we bail and
let the caller fall back to the PTE-level copying path, which will then
bail immediately at the pmd_none() check.
Bug reachability: Reaching this bug requires that you can create
shmem/file THP mappings - anonymous THP uses different code that doesn't
zap stuff under rmap locks. File THP is gated on an experimental config
flag (CONFIG_READ_ONLY_THP_FOR_FS), so on normal distro kernels you need
shmem THP to hit this bug. As far as I know, getting shmem THP normally
requires that you can mount your own tmpfs with the right mount flags,
which would require creating your own user+mount namespace; though I don't
know if some distros maybe enable shmem THP by default or something like
that.
Bug impact: This issue can likely be used for user->kernel privilege
escalation when it is reachable.
Link: https://lkml.kernel.org/r/20241007-move_normal_pmd-vs-collapse-fix-2-v1-1-5ead9631f2ea@google.com
Fixes: 1d65b771bc08 ("mm/khugepaged: retract_page_tables() without mmap or vma lock")
Signed-off-by: Jann Horn <jannh@google.com>
Signed-off-by: David Hildenbrand <david@redhat.com>
Co-developed-by: David Hildenbrand <david@redhat.com>
Closes: https://project-zero.issues.chromium.org/371047675
Acked-by: Qi Zheng <zhengqi.arch@bytedance.com>
Reviewed-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Joel Fernandes <joel@joelfernandes.org>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2024-10-08 05:42:04 +08:00
|
|
|
|
|
|
|
/* Racing with collapse? */
|
|
|
|
if (unlikely(!pmd_present(pmd) || pmd_leaf(pmd)))
|
|
|
|
goto out_unlock;
|
|
|
|
/* Clear the pmd */
|
2019-01-04 07:28:38 +08:00
|
|
|
pmd_clear(old_pmd);
|
mm/mremap: fix move_normal_pmd/retract_page_tables race
commit 6fa1066fc5d00cb9f1b0e83b7ff6ef98d26ba2aa upstream.
In mremap(), move_page_tables() looks at the type of the PMD entry and the
specified address range to figure out by which method the next chunk of
page table entries should be moved.
At that point, the mmap_lock is held in write mode, but no rmap locks are
held yet. For PMD entries that point to page tables and are fully covered
by the source address range, move_pgt_entry(NORMAL_PMD, ...) is called,
which first takes rmap locks, then does move_normal_pmd().
move_normal_pmd() takes the necessary page table locks at source and
destination, then moves an entire page table from the source to the
destination.
The problem is: The rmap locks, which protect against concurrent page
table removal by retract_page_tables() in the THP code, are only taken
after the PMD entry has been read and it has been decided how to move it.
So we can race as follows (with two processes that have mappings of the
same tmpfs file that is stored on a tmpfs mount with huge=advise); note
that process A accesses page tables through the MM while process B does it
through the file rmap:
process A process B
========= =========
mremap
mremap_to
move_vma
move_page_tables
get_old_pmd
alloc_new_pmd
*** PREEMPT ***
madvise(MADV_COLLAPSE)
do_madvise
madvise_walk_vmas
madvise_vma_behavior
madvise_collapse
hpage_collapse_scan_file
collapse_file
retract_page_tables
i_mmap_lock_read(mapping)
pmdp_collapse_flush
i_mmap_unlock_read(mapping)
move_pgt_entry(NORMAL_PMD, ...)
take_rmap_locks
move_normal_pmd
drop_rmap_locks
When this happens, move_normal_pmd() can end up creating bogus PMD entries
in the line `pmd_populate(mm, new_pmd, pmd_pgtable(pmd))`. The effect
depends on arch-specific and machine-specific details; on x86, you can end
up with physical page 0 mapped as a page table, which is likely
exploitable for user->kernel privilege escalation.
Fix the race by letting process B recheck that the PMD still points to a
page table after the rmap locks have been taken. Otherwise, we bail and
let the caller fall back to the PTE-level copying path, which will then
bail immediately at the pmd_none() check.
Bug reachability: Reaching this bug requires that you can create
shmem/file THP mappings - anonymous THP uses different code that doesn't
zap stuff under rmap locks. File THP is gated on an experimental config
flag (CONFIG_READ_ONLY_THP_FOR_FS), so on normal distro kernels you need
shmem THP to hit this bug. As far as I know, getting shmem THP normally
requires that you can mount your own tmpfs with the right mount flags,
which would require creating your own user+mount namespace; though I don't
know if some distros maybe enable shmem THP by default or something like
that.
Bug impact: This issue can likely be used for user->kernel privilege
escalation when it is reachable.
Link: https://lkml.kernel.org/r/20241007-move_normal_pmd-vs-collapse-fix-2-v1-1-5ead9631f2ea@google.com
Fixes: 1d65b771bc08 ("mm/khugepaged: retract_page_tables() without mmap or vma lock")
Signed-off-by: Jann Horn <jannh@google.com>
Signed-off-by: David Hildenbrand <david@redhat.com>
Co-developed-by: David Hildenbrand <david@redhat.com>
Closes: https://project-zero.issues.chromium.org/371047675
Acked-by: Qi Zheng <zhengqi.arch@bytedance.com>
Reviewed-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Joel Fernandes <joel@joelfernandes.org>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2024-10-08 05:42:04 +08:00
|
|
|
res = true;
|
2019-01-04 07:28:38 +08:00
|
|
|
|
|
|
|
VM_BUG_ON(!pmd_none(*new_pmd));
|
|
|
|
|
2021-07-08 09:10:12 +08:00
|
|
|
pmd_populate(mm, new_pmd, pmd_pgtable(pmd));
|
2019-01-04 07:28:38 +08:00
|
|
|
flush_tlb_range(vma, old_addr, old_addr + PMD_SIZE);
|
mm/mremap: fix move_normal_pmd/retract_page_tables race
commit 6fa1066fc5d00cb9f1b0e83b7ff6ef98d26ba2aa upstream.
In mremap(), move_page_tables() looks at the type of the PMD entry and the
specified address range to figure out by which method the next chunk of
page table entries should be moved.
At that point, the mmap_lock is held in write mode, but no rmap locks are
held yet. For PMD entries that point to page tables and are fully covered
by the source address range, move_pgt_entry(NORMAL_PMD, ...) is called,
which first takes rmap locks, then does move_normal_pmd().
move_normal_pmd() takes the necessary page table locks at source and
destination, then moves an entire page table from the source to the
destination.
The problem is: The rmap locks, which protect against concurrent page
table removal by retract_page_tables() in the THP code, are only taken
after the PMD entry has been read and it has been decided how to move it.
So we can race as follows (with two processes that have mappings of the
same tmpfs file that is stored on a tmpfs mount with huge=advise); note
that process A accesses page tables through the MM while process B does it
through the file rmap:
process A process B
========= =========
mremap
mremap_to
move_vma
move_page_tables
get_old_pmd
alloc_new_pmd
*** PREEMPT ***
madvise(MADV_COLLAPSE)
do_madvise
madvise_walk_vmas
madvise_vma_behavior
madvise_collapse
hpage_collapse_scan_file
collapse_file
retract_page_tables
i_mmap_lock_read(mapping)
pmdp_collapse_flush
i_mmap_unlock_read(mapping)
move_pgt_entry(NORMAL_PMD, ...)
take_rmap_locks
move_normal_pmd
drop_rmap_locks
When this happens, move_normal_pmd() can end up creating bogus PMD entries
in the line `pmd_populate(mm, new_pmd, pmd_pgtable(pmd))`. The effect
depends on arch-specific and machine-specific details; on x86, you can end
up with physical page 0 mapped as a page table, which is likely
exploitable for user->kernel privilege escalation.
Fix the race by letting process B recheck that the PMD still points to a
page table after the rmap locks have been taken. Otherwise, we bail and
let the caller fall back to the PTE-level copying path, which will then
bail immediately at the pmd_none() check.
Bug reachability: Reaching this bug requires that you can create
shmem/file THP mappings - anonymous THP uses different code that doesn't
zap stuff under rmap locks. File THP is gated on an experimental config
flag (CONFIG_READ_ONLY_THP_FOR_FS), so on normal distro kernels you need
shmem THP to hit this bug. As far as I know, getting shmem THP normally
requires that you can mount your own tmpfs with the right mount flags,
which would require creating your own user+mount namespace; though I don't
know if some distros maybe enable shmem THP by default or something like
that.
Bug impact: This issue can likely be used for user->kernel privilege
escalation when it is reachable.
Link: https://lkml.kernel.org/r/20241007-move_normal_pmd-vs-collapse-fix-2-v1-1-5ead9631f2ea@google.com
Fixes: 1d65b771bc08 ("mm/khugepaged: retract_page_tables() without mmap or vma lock")
Signed-off-by: Jann Horn <jannh@google.com>
Signed-off-by: David Hildenbrand <david@redhat.com>
Co-developed-by: David Hildenbrand <david@redhat.com>
Closes: https://project-zero.issues.chromium.org/371047675
Acked-by: Qi Zheng <zhengqi.arch@bytedance.com>
Reviewed-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Joel Fernandes <joel@joelfernandes.org>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2024-10-08 05:42:04 +08:00
|
|
|
out_unlock:
|
2019-01-04 07:28:38 +08:00
|
|
|
if (new_ptl != old_ptl)
|
|
|
|
spin_unlock(new_ptl);
|
|
|
|
spin_unlock(old_ptl);
|
|
|
|
|
mm/mremap: fix move_normal_pmd/retract_page_tables race
commit 6fa1066fc5d00cb9f1b0e83b7ff6ef98d26ba2aa upstream.
In mremap(), move_page_tables() looks at the type of the PMD entry and the
specified address range to figure out by which method the next chunk of
page table entries should be moved.
At that point, the mmap_lock is held in write mode, but no rmap locks are
held yet. For PMD entries that point to page tables and are fully covered
by the source address range, move_pgt_entry(NORMAL_PMD, ...) is called,
which first takes rmap locks, then does move_normal_pmd().
move_normal_pmd() takes the necessary page table locks at source and
destination, then moves an entire page table from the source to the
destination.
The problem is: The rmap locks, which protect against concurrent page
table removal by retract_page_tables() in the THP code, are only taken
after the PMD entry has been read and it has been decided how to move it.
So we can race as follows (with two processes that have mappings of the
same tmpfs file that is stored on a tmpfs mount with huge=advise); note
that process A accesses page tables through the MM while process B does it
through the file rmap:
process A process B
========= =========
mremap
mremap_to
move_vma
move_page_tables
get_old_pmd
alloc_new_pmd
*** PREEMPT ***
madvise(MADV_COLLAPSE)
do_madvise
madvise_walk_vmas
madvise_vma_behavior
madvise_collapse
hpage_collapse_scan_file
collapse_file
retract_page_tables
i_mmap_lock_read(mapping)
pmdp_collapse_flush
i_mmap_unlock_read(mapping)
move_pgt_entry(NORMAL_PMD, ...)
take_rmap_locks
move_normal_pmd
drop_rmap_locks
When this happens, move_normal_pmd() can end up creating bogus PMD entries
in the line `pmd_populate(mm, new_pmd, pmd_pgtable(pmd))`. The effect
depends on arch-specific and machine-specific details; on x86, you can end
up with physical page 0 mapped as a page table, which is likely
exploitable for user->kernel privilege escalation.
Fix the race by letting process B recheck that the PMD still points to a
page table after the rmap locks have been taken. Otherwise, we bail and
let the caller fall back to the PTE-level copying path, which will then
bail immediately at the pmd_none() check.
Bug reachability: Reaching this bug requires that you can create
shmem/file THP mappings - anonymous THP uses different code that doesn't
zap stuff under rmap locks. File THP is gated on an experimental config
flag (CONFIG_READ_ONLY_THP_FOR_FS), so on normal distro kernels you need
shmem THP to hit this bug. As far as I know, getting shmem THP normally
requires that you can mount your own tmpfs with the right mount flags,
which would require creating your own user+mount namespace; though I don't
know if some distros maybe enable shmem THP by default or something like
that.
Bug impact: This issue can likely be used for user->kernel privilege
escalation when it is reachable.
Link: https://lkml.kernel.org/r/20241007-move_normal_pmd-vs-collapse-fix-2-v1-1-5ead9631f2ea@google.com
Fixes: 1d65b771bc08 ("mm/khugepaged: retract_page_tables() without mmap or vma lock")
Signed-off-by: Jann Horn <jannh@google.com>
Signed-off-by: David Hildenbrand <david@redhat.com>
Co-developed-by: David Hildenbrand <david@redhat.com>
Closes: https://project-zero.issues.chromium.org/371047675
Acked-by: Qi Zheng <zhengqi.arch@bytedance.com>
Reviewed-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Joel Fernandes <joel@joelfernandes.org>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2024-10-08 05:42:04 +08:00
|
|
|
return res;
|
2019-01-04 07:28:38 +08:00
|
|
|
}
|
2020-12-15 11:07:30 +08:00
|
|
|
#else
|
|
|
|
static inline bool move_normal_pmd(struct vm_area_struct *vma,
|
|
|
|
unsigned long old_addr, unsigned long new_addr, pmd_t *old_pmd,
|
|
|
|
pmd_t *new_pmd)
|
|
|
|
{
|
|
|
|
return false;
|
|
|
|
}
|
|
|
|
#endif
|
|
|
|
|
2021-07-08 09:10:09 +08:00
|
|
|
#if CONFIG_PGTABLE_LEVELS > 2 && defined(CONFIG_HAVE_MOVE_PUD)
|
2020-12-15 11:07:30 +08:00
|
|
|
static bool move_normal_pud(struct vm_area_struct *vma, unsigned long old_addr,
|
|
|
|
unsigned long new_addr, pud_t *old_pud, pud_t *new_pud)
|
|
|
|
{
|
|
|
|
spinlock_t *old_ptl, *new_ptl;
|
|
|
|
struct mm_struct *mm = vma->vm_mm;
|
|
|
|
pud_t pud;
|
|
|
|
|
2021-07-08 09:10:18 +08:00
|
|
|
if (!arch_supports_page_table_move())
|
|
|
|
return false;
|
2020-12-15 11:07:30 +08:00
|
|
|
/*
|
|
|
|
* The destination pud shouldn't be established, free_pgtables()
|
|
|
|
* should have released it.
|
|
|
|
*/
|
|
|
|
if (WARN_ON_ONCE(!pud_none(*new_pud)))
|
|
|
|
return false;
|
|
|
|
|
|
|
|
/*
|
|
|
|
* We don't have to worry about the ordering of src and dst
|
|
|
|
* ptlocks because exclusive mmap_lock prevents deadlock.
|
|
|
|
*/
|
|
|
|
old_ptl = pud_lock(vma->vm_mm, old_pud);
|
|
|
|
new_ptl = pud_lockptr(mm, new_pud);
|
|
|
|
if (new_ptl != old_ptl)
|
|
|
|
spin_lock_nested(new_ptl, SINGLE_DEPTH_NESTING);
|
|
|
|
|
|
|
|
/* Clear the pud */
|
|
|
|
pud = *old_pud;
|
|
|
|
pud_clear(old_pud);
|
|
|
|
|
|
|
|
VM_BUG_ON(!pud_none(*new_pud));
|
|
|
|
|
2021-07-08 09:10:12 +08:00
|
|
|
pud_populate(mm, new_pud, pud_pgtable(pud));
|
2020-12-15 11:07:30 +08:00
|
|
|
flush_tlb_range(vma, old_addr, old_addr + PUD_SIZE);
|
|
|
|
if (new_ptl != old_ptl)
|
|
|
|
spin_unlock(new_ptl);
|
|
|
|
spin_unlock(old_ptl);
|
|
|
|
|
|
|
|
return true;
|
|
|
|
}
|
|
|
|
#else
|
|
|
|
static inline bool move_normal_pud(struct vm_area_struct *vma,
|
|
|
|
unsigned long old_addr, unsigned long new_addr, pud_t *old_pud,
|
|
|
|
pud_t *new_pud)
|
|
|
|
{
|
|
|
|
return false;
|
|
|
|
}
|
2019-01-04 07:28:38 +08:00
|
|
|
#endif
|
|
|
|
|
2023-07-25 03:07:52 +08:00
|
|
|
#if defined(CONFIG_TRANSPARENT_HUGEPAGE) && defined(CONFIG_HAVE_ARCH_TRANSPARENT_HUGEPAGE_PUD)
|
2021-07-08 09:10:06 +08:00
|
|
|
static bool move_huge_pud(struct vm_area_struct *vma, unsigned long old_addr,
|
|
|
|
unsigned long new_addr, pud_t *old_pud, pud_t *new_pud)
|
|
|
|
{
|
|
|
|
spinlock_t *old_ptl, *new_ptl;
|
|
|
|
struct mm_struct *mm = vma->vm_mm;
|
|
|
|
pud_t pud;
|
|
|
|
|
|
|
|
/*
|
|
|
|
* The destination pud shouldn't be established, free_pgtables()
|
|
|
|
* should have released it.
|
|
|
|
*/
|
|
|
|
if (WARN_ON_ONCE(!pud_none(*new_pud)))
|
|
|
|
return false;
|
|
|
|
|
|
|
|
/*
|
|
|
|
* We don't have to worry about the ordering of src and dst
|
|
|
|
* ptlocks because exclusive mmap_lock prevents deadlock.
|
|
|
|
*/
|
|
|
|
old_ptl = pud_lock(vma->vm_mm, old_pud);
|
|
|
|
new_ptl = pud_lockptr(mm, new_pud);
|
|
|
|
if (new_ptl != old_ptl)
|
|
|
|
spin_lock_nested(new_ptl, SINGLE_DEPTH_NESTING);
|
|
|
|
|
|
|
|
/* Clear the pud */
|
|
|
|
pud = *old_pud;
|
|
|
|
pud_clear(old_pud);
|
|
|
|
|
|
|
|
VM_BUG_ON(!pud_none(*new_pud));
|
|
|
|
|
|
|
|
/* Set the new pud */
|
|
|
|
/* mark soft_ditry when we add pud level soft dirty support */
|
|
|
|
set_pud_at(mm, new_addr, new_pud, pud);
|
|
|
|
flush_pud_tlb_range(vma, old_addr, old_addr + HPAGE_PUD_SIZE);
|
|
|
|
if (new_ptl != old_ptl)
|
|
|
|
spin_unlock(new_ptl);
|
|
|
|
spin_unlock(old_ptl);
|
|
|
|
|
|
|
|
return true;
|
|
|
|
}
|
|
|
|
#else
|
|
|
|
static bool move_huge_pud(struct vm_area_struct *vma, unsigned long old_addr,
|
|
|
|
unsigned long new_addr, pud_t *old_pud, pud_t *new_pud)
|
|
|
|
{
|
|
|
|
WARN_ON_ONCE(1);
|
|
|
|
return false;
|
|
|
|
|
|
|
|
}
|
|
|
|
#endif
|
|
|
|
|
2020-12-15 11:07:30 +08:00
|
|
|
enum pgt_entry {
|
|
|
|
NORMAL_PMD,
|
|
|
|
HPAGE_PMD,
|
|
|
|
NORMAL_PUD,
|
2021-07-08 09:10:06 +08:00
|
|
|
HPAGE_PUD,
|
2020-12-15 11:07:30 +08:00
|
|
|
};
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Returns an extent of the corresponding size for the pgt_entry specified if
|
|
|
|
* valid. Else returns a smaller extent bounded by the end of the source and
|
|
|
|
* destination pgt_entry.
|
|
|
|
*/
|
2021-02-10 05:42:10 +08:00
|
|
|
static __always_inline unsigned long get_extent(enum pgt_entry entry,
|
|
|
|
unsigned long old_addr, unsigned long old_end,
|
|
|
|
unsigned long new_addr)
|
2020-12-15 11:07:30 +08:00
|
|
|
{
|
|
|
|
unsigned long next, extent, mask, size;
|
|
|
|
|
|
|
|
switch (entry) {
|
|
|
|
case HPAGE_PMD:
|
|
|
|
case NORMAL_PMD:
|
|
|
|
mask = PMD_MASK;
|
|
|
|
size = PMD_SIZE;
|
|
|
|
break;
|
2021-07-08 09:10:06 +08:00
|
|
|
case HPAGE_PUD:
|
2020-12-15 11:07:30 +08:00
|
|
|
case NORMAL_PUD:
|
|
|
|
mask = PUD_MASK;
|
|
|
|
size = PUD_SIZE;
|
|
|
|
break;
|
|
|
|
default:
|
|
|
|
BUILD_BUG();
|
|
|
|
break;
|
|
|
|
}
|
|
|
|
|
|
|
|
next = (old_addr + size) & mask;
|
|
|
|
/* even if next overflowed, extent below will be ok */
|
2020-12-30 07:14:40 +08:00
|
|
|
extent = next - old_addr;
|
|
|
|
if (extent > old_end - old_addr)
|
|
|
|
extent = old_end - old_addr;
|
2020-12-15 11:07:30 +08:00
|
|
|
next = (new_addr + size) & mask;
|
|
|
|
if (extent > next - new_addr)
|
|
|
|
extent = next - new_addr;
|
|
|
|
return extent;
|
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Attempts to speedup the move by moving entry at the level corresponding to
|
|
|
|
* pgt_entry. Returns true if the move was successful, else false.
|
|
|
|
*/
|
|
|
|
static bool move_pgt_entry(enum pgt_entry entry, struct vm_area_struct *vma,
|
|
|
|
unsigned long old_addr, unsigned long new_addr,
|
|
|
|
void *old_entry, void *new_entry, bool need_rmap_locks)
|
|
|
|
{
|
|
|
|
bool moved = false;
|
|
|
|
|
|
|
|
/* See comment in move_ptes() */
|
|
|
|
if (need_rmap_locks)
|
|
|
|
take_rmap_locks(vma);
|
|
|
|
|
|
|
|
switch (entry) {
|
|
|
|
case NORMAL_PMD:
|
|
|
|
moved = move_normal_pmd(vma, old_addr, new_addr, old_entry,
|
|
|
|
new_entry);
|
|
|
|
break;
|
|
|
|
case NORMAL_PUD:
|
|
|
|
moved = move_normal_pud(vma, old_addr, new_addr, old_entry,
|
|
|
|
new_entry);
|
|
|
|
break;
|
|
|
|
case HPAGE_PMD:
|
|
|
|
moved = IS_ENABLED(CONFIG_TRANSPARENT_HUGEPAGE) &&
|
|
|
|
move_huge_pmd(vma, old_addr, new_addr, old_entry,
|
|
|
|
new_entry);
|
|
|
|
break;
|
2021-07-08 09:10:06 +08:00
|
|
|
case HPAGE_PUD:
|
|
|
|
moved = IS_ENABLED(CONFIG_TRANSPARENT_HUGEPAGE) &&
|
|
|
|
move_huge_pud(vma, old_addr, new_addr, old_entry,
|
|
|
|
new_entry);
|
|
|
|
break;
|
|
|
|
|
2020-12-15 11:07:30 +08:00
|
|
|
default:
|
|
|
|
WARN_ON_ONCE(1);
|
|
|
|
break;
|
|
|
|
}
|
|
|
|
|
|
|
|
if (need_rmap_locks)
|
|
|
|
drop_rmap_locks(vma);
|
|
|
|
|
|
|
|
return moved;
|
|
|
|
}
|
|
|
|
|
2007-07-19 16:48:16 +08:00
|
|
|
unsigned long move_page_tables(struct vm_area_struct *vma,
|
2005-04-17 06:20:36 +08:00
|
|
|
unsigned long old_addr, struct vm_area_struct *new_vma,
|
2012-10-09 07:31:50 +08:00
|
|
|
unsigned long new_addr, unsigned long len,
|
|
|
|
bool need_rmap_locks)
|
2005-04-17 06:20:36 +08:00
|
|
|
{
|
2020-12-15 11:07:30 +08:00
|
|
|
unsigned long extent, old_end;
|
2018-12-28 16:38:09 +08:00
|
|
|
struct mmu_notifier_range range;
|
2005-10-30 09:16:00 +08:00
|
|
|
pmd_t *old_pmd, *new_pmd;
|
2021-07-08 09:10:06 +08:00
|
|
|
pud_t *old_pud, *new_pud;
|
2005-04-17 06:20:36 +08:00
|
|
|
|
2022-04-09 04:09:04 +08:00
|
|
|
if (!len)
|
|
|
|
return 0;
|
|
|
|
|
2005-10-30 09:16:00 +08:00
|
|
|
old_end = old_addr + len;
|
2005-04-17 06:20:36 +08:00
|
|
|
|
2021-11-06 04:41:40 +08:00
|
|
|
if (is_vm_hugetlb_page(vma))
|
|
|
|
return move_hugetlb_page_tables(vma, new_vma, old_addr,
|
|
|
|
new_addr, len);
|
|
|
|
|
2022-05-10 09:20:52 +08:00
|
|
|
flush_cache_range(vma, old_addr, old_end);
|
2023-01-10 10:57:22 +08:00
|
|
|
mmu_notifier_range_init(&range, MMU_NOTIFY_UNMAP, 0, vma->vm_mm,
|
mm/mmu_notifier: contextual information for event triggering invalidation
CPU page table update can happens for many reasons, not only as a result
of a syscall (munmap(), mprotect(), mremap(), madvise(), ...) but also as
a result of kernel activities (memory compression, reclaim, migration,
...).
Users of mmu notifier API track changes to the CPU page table and take
specific action for them. While current API only provide range of virtual
address affected by the change, not why the changes is happening.
This patchset do the initial mechanical convertion of all the places that
calls mmu_notifier_range_init to also provide the default MMU_NOTIFY_UNMAP
event as well as the vma if it is know (most invalidation happens against
a given vma). Passing down the vma allows the users of mmu notifier to
inspect the new vma page protection.
The MMU_NOTIFY_UNMAP is always the safe default as users of mmu notifier
should assume that every for the range is going away when that event
happens. A latter patch do convert mm call path to use a more appropriate
events for each call.
This is done as 2 patches so that no call site is forgotten especialy
as it uses this following coccinelle patch:
%<----------------------------------------------------------------------
@@
identifier I1, I2, I3, I4;
@@
static inline void mmu_notifier_range_init(struct mmu_notifier_range *I1,
+enum mmu_notifier_event event,
+unsigned flags,
+struct vm_area_struct *vma,
struct mm_struct *I2, unsigned long I3, unsigned long I4) { ... }
@@
@@
-#define mmu_notifier_range_init(range, mm, start, end)
+#define mmu_notifier_range_init(range, event, flags, vma, mm, start, end)
@@
expression E1, E3, E4;
identifier I1;
@@
<...
mmu_notifier_range_init(E1,
+MMU_NOTIFY_UNMAP, 0, I1,
I1->vm_mm, E3, E4)
...>
@@
expression E1, E2, E3, E4;
identifier FN, VMA;
@@
FN(..., struct vm_area_struct *VMA, ...) {
<...
mmu_notifier_range_init(E1,
+MMU_NOTIFY_UNMAP, 0, VMA,
E2, E3, E4)
...> }
@@
expression E1, E2, E3, E4;
identifier FN, VMA;
@@
FN(...) {
struct vm_area_struct *VMA;
<...
mmu_notifier_range_init(E1,
+MMU_NOTIFY_UNMAP, 0, VMA,
E2, E3, E4)
...> }
@@
expression E1, E2, E3, E4;
identifier FN;
@@
FN(...) {
<...
mmu_notifier_range_init(E1,
+MMU_NOTIFY_UNMAP, 0, NULL,
E2, E3, E4)
...> }
---------------------------------------------------------------------->%
Applied with:
spatch --all-includes --sp-file mmu-notifier.spatch fs/proc/task_mmu.c --in-place
spatch --sp-file mmu-notifier.spatch --dir kernel/events/ --in-place
spatch --sp-file mmu-notifier.spatch --dir mm --in-place
Link: http://lkml.kernel.org/r/20190326164747.24405-6-jglisse@redhat.com
Signed-off-by: Jérôme Glisse <jglisse@redhat.com>
Reviewed-by: Ralph Campbell <rcampbell@nvidia.com>
Reviewed-by: Ira Weiny <ira.weiny@intel.com>
Cc: Christian König <christian.koenig@amd.com>
Cc: Joonas Lahtinen <joonas.lahtinen@linux.intel.com>
Cc: Jani Nikula <jani.nikula@linux.intel.com>
Cc: Rodrigo Vivi <rodrigo.vivi@intel.com>
Cc: Jan Kara <jack@suse.cz>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Peter Xu <peterx@redhat.com>
Cc: Felix Kuehling <Felix.Kuehling@amd.com>
Cc: Jason Gunthorpe <jgg@mellanox.com>
Cc: Ross Zwisler <zwisler@kernel.org>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Radim Krcmar <rkrcmar@redhat.com>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Christian Koenig <christian.koenig@amd.com>
Cc: John Hubbard <jhubbard@nvidia.com>
Cc: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-05-14 08:20:49 +08:00
|
|
|
old_addr, old_end);
|
2018-12-28 16:38:09 +08:00
|
|
|
mmu_notifier_invalidate_range_start(&range);
|
2011-11-01 08:08:26 +08:00
|
|
|
|
2005-10-30 09:16:00 +08:00
|
|
|
for (; old_addr < old_end; old_addr += extent, new_addr += extent) {
|
2005-04-17 06:20:36 +08:00
|
|
|
cond_resched();
|
2020-12-15 11:07:30 +08:00
|
|
|
/*
|
|
|
|
* If extent is PUD-sized try to speed up the move by moving at the
|
|
|
|
* PUD level if possible.
|
|
|
|
*/
|
|
|
|
extent = get_extent(NORMAL_PUD, old_addr, old_end, new_addr);
|
|
|
|
|
2021-07-08 09:10:06 +08:00
|
|
|
old_pud = get_old_pud(vma->vm_mm, old_addr);
|
|
|
|
if (!old_pud)
|
|
|
|
continue;
|
|
|
|
new_pud = alloc_new_pud(vma->vm_mm, vma, new_addr);
|
|
|
|
if (!new_pud)
|
|
|
|
break;
|
|
|
|
if (pud_trans_huge(*old_pud) || pud_devmap(*old_pud)) {
|
|
|
|
if (extent == HPAGE_PUD_SIZE) {
|
|
|
|
move_pgt_entry(HPAGE_PUD, vma, old_addr, new_addr,
|
|
|
|
old_pud, new_pud, need_rmap_locks);
|
|
|
|
/* We ignore and continue on error? */
|
2020-12-15 11:07:30 +08:00
|
|
|
continue;
|
2021-07-08 09:10:06 +08:00
|
|
|
}
|
|
|
|
} else if (IS_ENABLED(CONFIG_HAVE_MOVE_PUD) && extent == PUD_SIZE) {
|
|
|
|
|
2020-12-15 11:07:30 +08:00
|
|
|
if (move_pgt_entry(NORMAL_PUD, vma, old_addr, new_addr,
|
2021-07-08 09:10:15 +08:00
|
|
|
old_pud, new_pud, true))
|
2020-12-15 11:07:30 +08:00
|
|
|
continue;
|
|
|
|
}
|
|
|
|
|
|
|
|
extent = get_extent(NORMAL_PMD, old_addr, old_end, new_addr);
|
2005-10-30 09:16:00 +08:00
|
|
|
old_pmd = get_old_pmd(vma->vm_mm, old_addr);
|
|
|
|
if (!old_pmd)
|
|
|
|
continue;
|
2011-01-14 07:46:43 +08:00
|
|
|
new_pmd = alloc_new_pmd(vma->vm_mm, vma, new_addr);
|
2005-10-30 09:16:00 +08:00
|
|
|
if (!new_pmd)
|
|
|
|
break;
|
2023-06-09 09:32:47 +08:00
|
|
|
again:
|
2020-12-15 11:07:30 +08:00
|
|
|
if (is_swap_pmd(*old_pmd) || pmd_trans_huge(*old_pmd) ||
|
|
|
|
pmd_devmap(*old_pmd)) {
|
|
|
|
if (extent == HPAGE_PMD_SIZE &&
|
|
|
|
move_pgt_entry(HPAGE_PMD, vma, old_addr, new_addr,
|
|
|
|
old_pmd, new_pmd, need_rmap_locks))
|
|
|
|
continue;
|
2016-01-16 08:53:39 +08:00
|
|
|
split_huge_pmd(vma, old_pmd, old_addr);
|
2020-12-15 11:07:30 +08:00
|
|
|
} else if (IS_ENABLED(CONFIG_HAVE_MOVE_PMD) &&
|
|
|
|
extent == PMD_SIZE) {
|
2019-01-04 07:28:38 +08:00
|
|
|
/*
|
|
|
|
* If the extent is PMD-sized, try to speed the move by
|
|
|
|
* moving at the PMD level if possible.
|
|
|
|
*/
|
2020-12-15 11:07:30 +08:00
|
|
|
if (move_pgt_entry(NORMAL_PMD, vma, old_addr, new_addr,
|
2021-07-08 09:10:15 +08:00
|
|
|
old_pmd, new_pmd, true))
|
2019-01-04 07:28:38 +08:00
|
|
|
continue;
|
thp: mremap support and TLB optimization
This adds THP support to mremap (decreases the number of split_huge_page()
calls).
Here are also some benchmarks with a proggy like this:
===
#define _GNU_SOURCE
#include <sys/mman.h>
#include <stdlib.h>
#include <stdio.h>
#include <string.h>
#include <sys/time.h>
#define SIZE (5UL*1024*1024*1024)
int main()
{
static struct timeval oldstamp, newstamp;
long diffsec;
char *p, *p2, *p3, *p4;
if (posix_memalign((void **)&p, 2*1024*1024, SIZE))
perror("memalign"), exit(1);
if (posix_memalign((void **)&p2, 2*1024*1024, SIZE))
perror("memalign"), exit(1);
if (posix_memalign((void **)&p3, 2*1024*1024, 4096))
perror("memalign"), exit(1);
memset(p, 0xff, SIZE);
memset(p2, 0xff, SIZE);
memset(p3, 0x77, 4096);
gettimeofday(&oldstamp, NULL);
p4 = mremap(p, SIZE, SIZE, MREMAP_FIXED|MREMAP_MAYMOVE, p3);
gettimeofday(&newstamp, NULL);
diffsec = newstamp.tv_sec - oldstamp.tv_sec;
diffsec = newstamp.tv_usec - oldstamp.tv_usec + 1000000 * diffsec;
printf("usec %ld\n", diffsec);
if (p == MAP_FAILED || p4 != p3)
//if (p == MAP_FAILED)
perror("mremap"), exit(1);
if (memcmp(p4, p2, SIZE))
printf("mremap bug\n"), exit(1);
printf("ok\n");
return 0;
}
===
THP on
Performance counter stats for './largepage13' (3 runs):
69195836 dTLB-loads ( +- 3.546% ) (scaled from 50.30%)
60708 dTLB-load-misses ( +- 11.776% ) (scaled from 52.62%)
676266476 dTLB-stores ( +- 5.654% ) (scaled from 69.54%)
29856 dTLB-store-misses ( +- 4.081% ) (scaled from 89.22%)
1055848782 iTLB-loads ( +- 4.526% ) (scaled from 80.18%)
8689 iTLB-load-misses ( +- 2.987% ) (scaled from 58.20%)
7.314454164 seconds time elapsed ( +- 0.023% )
THP off
Performance counter stats for './largepage13' (3 runs):
1967379311 dTLB-loads ( +- 0.506% ) (scaled from 60.59%)
9238687 dTLB-load-misses ( +- 22.547% ) (scaled from 61.87%)
2014239444 dTLB-stores ( +- 0.692% ) (scaled from 60.40%)
3312335 dTLB-store-misses ( +- 7.304% ) (scaled from 67.60%)
6764372065 iTLB-loads ( +- 0.925% ) (scaled from 79.00%)
8202 iTLB-load-misses ( +- 0.475% ) (scaled from 70.55%)
9.693655243 seconds time elapsed ( +- 0.069% )
grep thp /proc/vmstat
thp_fault_alloc 35849
thp_fault_fallback 0
thp_collapse_alloc 3
thp_collapse_alloc_failed 0
thp_split 0
thp_split 0 confirms no thp split despite plenty of hugepages allocated.
The measurement of only the mremap time (so excluding the 3 long
memset and final long 10GB memory accessing memcmp):
THP on
usec 14824
usec 14862
usec 14859
THP off
usec 256416
usec 255981
usec 255847
With an older kernel without the mremap optimizations (the below patch
optimizes the non THP version too).
THP on
usec 392107
usec 390237
usec 404124
THP off
usec 444294
usec 445237
usec 445820
I guess with a threaded program that sends more IPI on large SMP it'd
create an even larger difference.
All debug options are off except DEBUG_VM to avoid skewing the
results.
The only problem for native 2M mremap like it happens above both the
source and destination address must be 2M aligned or the hugepmd can't be
moved without a split but that is an hardware limitation.
[akpm@linux-foundation.org: coding-style nitpicking]
Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
Acked-by: Johannes Weiner <jweiner@redhat.com>
Acked-by: Mel Gorman <mgorman@suse.de>
Acked-by: Rik van Riel <riel@redhat.com>
Cc: Hugh Dickins <hughd@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-11-01 08:08:30 +08:00
|
|
|
}
|
2023-06-09 09:32:47 +08:00
|
|
|
if (pmd_none(*old_pmd))
|
|
|
|
continue;
|
mm: treewide: remove unused address argument from pte_alloc functions
Patch series "Add support for fast mremap".
This series speeds up the mremap(2) syscall by copying page tables at
the PMD level even for non-THP systems. There is concern that the extra
'address' argument that mremap passes to pte_alloc may do something
subtle architecture related in the future that may make the scheme not
work. Also we find that there is no point in passing the 'address' to
pte_alloc since its unused. This patch therefore removes this argument
tree-wide resulting in a nice negative diff as well. Also ensuring
along the way that the enabled architectures do not do anything funky
with the 'address' argument that goes unnoticed by the optimization.
Build and boot tested on x86-64. Build tested on arm64. The config
enablement patch for arm64 will be posted in the future after more
testing.
The changes were obtained by applying the following Coccinelle script.
(thanks Julia for answering all Coccinelle questions!).
Following fix ups were done manually:
* Removal of address argument from pte_fragment_alloc
* Removal of pte_alloc_one_fast definitions from m68k and microblaze.
// Options: --include-headers --no-includes
// Note: I split the 'identifier fn' line, so if you are manually
// running it, please unsplit it so it runs for you.
virtual patch
@pte_alloc_func_def depends on patch exists@
identifier E2;
identifier fn =~
"^(__pte_alloc|pte_alloc_one|pte_alloc|__pte_alloc_kernel|pte_alloc_one_kernel)$";
type T2;
@@
fn(...
- , T2 E2
)
{ ... }
@pte_alloc_func_proto_noarg depends on patch exists@
type T1, T2, T3, T4;
identifier fn =~ "^(__pte_alloc|pte_alloc_one|pte_alloc|__pte_alloc_kernel|pte_alloc_one_kernel)$";
@@
(
- T3 fn(T1, T2);
+ T3 fn(T1);
|
- T3 fn(T1, T2, T4);
+ T3 fn(T1, T2);
)
@pte_alloc_func_proto depends on patch exists@
identifier E1, E2, E4;
type T1, T2, T3, T4;
identifier fn =~
"^(__pte_alloc|pte_alloc_one|pte_alloc|__pte_alloc_kernel|pte_alloc_one_kernel)$";
@@
(
- T3 fn(T1 E1, T2 E2);
+ T3 fn(T1 E1);
|
- T3 fn(T1 E1, T2 E2, T4 E4);
+ T3 fn(T1 E1, T2 E2);
)
@pte_alloc_func_call depends on patch exists@
expression E2;
identifier fn =~
"^(__pte_alloc|pte_alloc_one|pte_alloc|__pte_alloc_kernel|pte_alloc_one_kernel)$";
@@
fn(...
-, E2
)
@pte_alloc_macro depends on patch exists@
identifier fn =~
"^(__pte_alloc|pte_alloc_one|pte_alloc|__pte_alloc_kernel|pte_alloc_one_kernel)$";
identifier a, b, c;
expression e;
position p;
@@
(
- #define fn(a, b, c) e
+ #define fn(a, b) e
|
- #define fn(a, b) e
+ #define fn(a) e
)
Link: http://lkml.kernel.org/r/20181108181201.88826-2-joelaf@google.com
Signed-off-by: Joel Fernandes (Google) <joel@joelfernandes.org>
Suggested-by: Kirill A. Shutemov <kirill@shutemov.name>
Acked-by: Kirill A. Shutemov <kirill@shutemov.name>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Julia Lawall <Julia.Lawall@lip6.fr>
Cc: Kirill A. Shutemov <kirill@shutemov.name>
Cc: William Kucharski <william.kucharski@oracle.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-01-04 07:28:34 +08:00
|
|
|
if (pte_alloc(new_vma->vm_mm, new_pmd))
|
thp: mremap support and TLB optimization
This adds THP support to mremap (decreases the number of split_huge_page()
calls).
Here are also some benchmarks with a proggy like this:
===
#define _GNU_SOURCE
#include <sys/mman.h>
#include <stdlib.h>
#include <stdio.h>
#include <string.h>
#include <sys/time.h>
#define SIZE (5UL*1024*1024*1024)
int main()
{
static struct timeval oldstamp, newstamp;
long diffsec;
char *p, *p2, *p3, *p4;
if (posix_memalign((void **)&p, 2*1024*1024, SIZE))
perror("memalign"), exit(1);
if (posix_memalign((void **)&p2, 2*1024*1024, SIZE))
perror("memalign"), exit(1);
if (posix_memalign((void **)&p3, 2*1024*1024, 4096))
perror("memalign"), exit(1);
memset(p, 0xff, SIZE);
memset(p2, 0xff, SIZE);
memset(p3, 0x77, 4096);
gettimeofday(&oldstamp, NULL);
p4 = mremap(p, SIZE, SIZE, MREMAP_FIXED|MREMAP_MAYMOVE, p3);
gettimeofday(&newstamp, NULL);
diffsec = newstamp.tv_sec - oldstamp.tv_sec;
diffsec = newstamp.tv_usec - oldstamp.tv_usec + 1000000 * diffsec;
printf("usec %ld\n", diffsec);
if (p == MAP_FAILED || p4 != p3)
//if (p == MAP_FAILED)
perror("mremap"), exit(1);
if (memcmp(p4, p2, SIZE))
printf("mremap bug\n"), exit(1);
printf("ok\n");
return 0;
}
===
THP on
Performance counter stats for './largepage13' (3 runs):
69195836 dTLB-loads ( +- 3.546% ) (scaled from 50.30%)
60708 dTLB-load-misses ( +- 11.776% ) (scaled from 52.62%)
676266476 dTLB-stores ( +- 5.654% ) (scaled from 69.54%)
29856 dTLB-store-misses ( +- 4.081% ) (scaled from 89.22%)
1055848782 iTLB-loads ( +- 4.526% ) (scaled from 80.18%)
8689 iTLB-load-misses ( +- 2.987% ) (scaled from 58.20%)
7.314454164 seconds time elapsed ( +- 0.023% )
THP off
Performance counter stats for './largepage13' (3 runs):
1967379311 dTLB-loads ( +- 0.506% ) (scaled from 60.59%)
9238687 dTLB-load-misses ( +- 22.547% ) (scaled from 61.87%)
2014239444 dTLB-stores ( +- 0.692% ) (scaled from 60.40%)
3312335 dTLB-store-misses ( +- 7.304% ) (scaled from 67.60%)
6764372065 iTLB-loads ( +- 0.925% ) (scaled from 79.00%)
8202 iTLB-load-misses ( +- 0.475% ) (scaled from 70.55%)
9.693655243 seconds time elapsed ( +- 0.069% )
grep thp /proc/vmstat
thp_fault_alloc 35849
thp_fault_fallback 0
thp_collapse_alloc 3
thp_collapse_alloc_failed 0
thp_split 0
thp_split 0 confirms no thp split despite plenty of hugepages allocated.
The measurement of only the mremap time (so excluding the 3 long
memset and final long 10GB memory accessing memcmp):
THP on
usec 14824
usec 14862
usec 14859
THP off
usec 256416
usec 255981
usec 255847
With an older kernel without the mremap optimizations (the below patch
optimizes the non THP version too).
THP on
usec 392107
usec 390237
usec 404124
THP off
usec 444294
usec 445237
usec 445820
I guess with a threaded program that sends more IPI on large SMP it'd
create an even larger difference.
All debug options are off except DEBUG_VM to avoid skewing the
results.
The only problem for native 2M mremap like it happens above both the
source and destination address must be 2M aligned or the hugepmd can't be
moved without a split but that is an hardware limitation.
[akpm@linux-foundation.org: coding-style nitpicking]
Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
Acked-by: Johannes Weiner <jweiner@redhat.com>
Acked-by: Mel Gorman <mgorman@suse.de>
Acked-by: Rik van Riel <riel@redhat.com>
Cc: Hugh Dickins <hughd@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-11-01 08:08:30 +08:00
|
|
|
break;
|
2023-06-09 09:32:47 +08:00
|
|
|
if (move_ptes(vma, old_pmd, old_addr, old_addr + extent,
|
|
|
|
new_vma, new_pmd, new_addr, need_rmap_locks) < 0)
|
|
|
|
goto again;
|
2005-04-17 06:20:36 +08:00
|
|
|
}
|
2011-11-01 08:08:26 +08:00
|
|
|
|
2018-12-28 16:38:09 +08:00
|
|
|
mmu_notifier_invalidate_range_end(&range);
|
2005-10-30 09:16:00 +08:00
|
|
|
|
|
|
|
return len + old_addr - old_end; /* how much done */
|
2005-04-17 06:20:36 +08:00
|
|
|
}
|
|
|
|
|
|
|
|
static unsigned long move_vma(struct vm_area_struct *vma,
|
|
|
|
unsigned long old_addr, unsigned long old_len,
|
2017-02-23 07:42:34 +08:00
|
|
|
unsigned long new_len, unsigned long new_addr,
|
mm/mremap: add MREMAP_DONTUNMAP to mremap()
When remapping an anonymous, private mapping, if MREMAP_DONTUNMAP is set,
the source mapping will not be removed. The remap operation will be
performed as it would have been normally by moving over the page tables to
the new mapping. The old vma will have any locked flags cleared, have no
pagetables, and any userfaultfds that were watching that range will
continue watching it.
For a mapping that is shared or not anonymous, MREMAP_DONTUNMAP will cause
the mremap() call to fail. Because MREMAP_DONTUNMAP always results in
moving a VMA you MUST use the MREMAP_MAYMOVE flag, it's not possible to
resize a VMA while also moving with MREMAP_DONTUNMAP so old_len must
always be equal to the new_len otherwise it will return -EINVAL.
We hope to use this in Chrome OS where with userfaultfd we could write an
anonymous mapping to disk without having to STOP the process or worry
about VMA permission changes.
This feature also has a use case in Android, Lokesh Gidra has said that
"As part of using userfaultfd for GC, We'll have to move the physical
pages of the java heap to a separate location. For this purpose mremap
will be used. Without the MREMAP_DONTUNMAP flag, when I mremap the java
heap, its virtual mapping will be removed as well. Therefore, we'll
require performing mmap immediately after. This is not only time
consuming but also opens a time window where a native thread may call mmap
and reserve the java heap's address range for its own usage. This flag
solves the problem."
[bgeffon@google.com: v6]
Link: http://lkml.kernel.org/r/20200218173221.237674-1-bgeffon@google.com
[bgeffon@google.com: v7]
Link: http://lkml.kernel.org/r/20200221174248.244748-1-bgeffon@google.com
Signed-off-by: Brian Geffon <bgeffon@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Tested-by: Lokesh Gidra <lokeshgidra@google.com>
Reviewed-by: Minchan Kim <minchan@kernel.org>
Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Cc: "Michael S . Tsirkin" <mst@redhat.com>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Will Deacon <will@kernel.org>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Sonny Rao <sonnyrao@google.com>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Joel Fernandes <joel@joelfernandes.org>
Cc: Yu Zhao <yuzhao@google.com>
Cc: Jesse Barnes <jsbarnes@google.com>
Cc: Nathan Chancellor <natechancellor@gmail.com>
Cc: Florian Weimer <fweimer@redhat.com>
Link: http://lkml.kernel.org/r/20200207201856.46070-1-bgeffon@google.com
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-04-02 12:09:17 +08:00
|
|
|
bool *locked, unsigned long flags,
|
|
|
|
struct vm_userfaultfd_ctx *uf, struct list_head *uf_unmap)
|
2005-04-17 06:20:36 +08:00
|
|
|
{
|
2021-11-06 04:39:06 +08:00
|
|
|
long to_account = new_len - old_len;
|
2005-04-17 06:20:36 +08:00
|
|
|
struct mm_struct *mm = vma->vm_mm;
|
|
|
|
struct vm_area_struct *new_vma;
|
|
|
|
unsigned long vm_flags = vma->vm_flags;
|
|
|
|
unsigned long new_pgoff;
|
|
|
|
unsigned long moved_len;
|
2023-01-21 00:26:39 +08:00
|
|
|
unsigned long account_start = 0;
|
|
|
|
unsigned long account_end = 0;
|
[PATCH] mm: update_hiwaters just in time
update_mem_hiwater has attracted various criticisms, in particular from those
concerned with mm scalability. Originally it was called whenever rss or
total_vm got raised. Then many of those callsites were replaced by a timer
tick call from account_system_time. Now Frank van Maarseveen reports that to
be found inadequate. How about this? Works for Frank.
Replace update_mem_hiwater, a poor combination of two unrelated ops, by macros
update_hiwater_rss and update_hiwater_vm. Don't attempt to keep
mm->hiwater_rss up to date at timer tick, nor every time we raise rss (usually
by 1): those are hot paths. Do the opposite, update only when about to lower
rss (usually by many), or just before final accounting in do_exit. Handle
mm->hiwater_vm in the same way, though it's much less of an issue. Demand
that whoever collects these hiwater statistics do the work of taking the
maximum with rss or total_vm.
And there has been no collector of these hiwater statistics in the tree. The
new convention needs an example, so match Frank's usage by adding a VmPeak
line above VmSize to /proc/<pid>/status, and also a VmHWM line above VmRSS
(High-Water-Mark or High-Water-Memory).
There was a particular anomaly during mremap move, that hiwater_vm might be
captured too high. A fleeting such anomaly remains, but it's quickly
corrected now, whereas before it would stick.
What locking? None: if the app is racy then these statistics will be racy,
it's not worth any overhead to make them exact. But whenever it suits,
hiwater_vm is updated under exclusive mmap_sem, and hiwater_rss under
page_table_lock (for now) or with preemption disabled (later on): without
going to any trouble, minimize the time between reading current values and
updating, to minimize those occasions when a racing thread bumps a count up
and back down in between.
Signed-off-by: Hugh Dickins <hugh@veritas.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-10-30 09:16:18 +08:00
|
|
|
unsigned long hiwater_vm;
|
2020-12-15 11:08:21 +08:00
|
|
|
int err = 0;
|
2012-10-09 07:31:50 +08:00
|
|
|
bool need_rmap_locks;
|
2023-01-21 00:26:39 +08:00
|
|
|
struct vma_iterator vmi;
|
2005-04-17 06:20:36 +08:00
|
|
|
|
|
|
|
/*
|
|
|
|
* We'd prefer to avoid failure later on in do_munmap:
|
|
|
|
* which may split one vma into three before unmapping.
|
|
|
|
*/
|
|
|
|
if (mm->map_count >= sysctl_max_map_count - 3)
|
|
|
|
return -ENOMEM;
|
|
|
|
|
2021-11-06 04:39:06 +08:00
|
|
|
if (unlikely(flags & MREMAP_DONTUNMAP))
|
|
|
|
to_account = new_len;
|
|
|
|
|
2020-12-15 11:08:21 +08:00
|
|
|
if (vma->vm_ops && vma->vm_ops->may_split) {
|
|
|
|
if (vma->vm_start != old_addr)
|
|
|
|
err = vma->vm_ops->may_split(vma, old_addr);
|
|
|
|
if (!err && vma->vm_end != old_addr + old_len)
|
|
|
|
err = vma->vm_ops->may_split(vma, old_addr + old_len);
|
|
|
|
if (err)
|
|
|
|
return err;
|
|
|
|
}
|
|
|
|
|
2009-09-22 08:02:05 +08:00
|
|
|
/*
|
|
|
|
* Advise KSM to break any KSM pages in the area to be moved:
|
|
|
|
* it would be confusing if they were to turn up at the new
|
|
|
|
* location, where they happen to coincide with different KSM
|
|
|
|
* pages recently unmapped. But leave vma->vm_flags as it was,
|
|
|
|
* so KSM can come around to merge on vma and new_vma afterwards.
|
|
|
|
*/
|
2009-09-22 08:02:28 +08:00
|
|
|
err = ksm_madvise(vma, old_addr, old_addr + old_len,
|
|
|
|
MADV_UNMERGEABLE, &vm_flags);
|
|
|
|
if (err)
|
|
|
|
return err;
|
2009-09-22 08:02:05 +08:00
|
|
|
|
2021-11-06 04:39:06 +08:00
|
|
|
if (vm_flags & VM_ACCOUNT) {
|
|
|
|
if (security_vm_enough_memory_mm(mm, to_account >> PAGE_SHIFT))
|
2020-12-15 11:08:09 +08:00
|
|
|
return -ENOMEM;
|
|
|
|
}
|
|
|
|
|
2023-02-28 01:36:16 +08:00
|
|
|
vma_start_write(vma);
|
2005-04-17 06:20:36 +08:00
|
|
|
new_pgoff = vma->vm_pgoff + ((old_addr - vma->vm_start) >> PAGE_SHIFT);
|
2012-10-09 07:31:50 +08:00
|
|
|
new_vma = copy_vma(&vma, new_addr, new_len, new_pgoff,
|
|
|
|
&need_rmap_locks);
|
2020-12-15 11:08:09 +08:00
|
|
|
if (!new_vma) {
|
2021-11-06 04:39:06 +08:00
|
|
|
if (vm_flags & VM_ACCOUNT)
|
|
|
|
vm_unacct_memory(to_account >> PAGE_SHIFT);
|
2005-04-17 06:20:36 +08:00
|
|
|
return -ENOMEM;
|
2020-12-15 11:08:09 +08:00
|
|
|
}
|
2005-04-17 06:20:36 +08:00
|
|
|
|
2012-10-09 07:31:50 +08:00
|
|
|
moved_len = move_page_tables(vma, old_addr, new_vma, new_addr, old_len,
|
|
|
|
need_rmap_locks);
|
2005-04-17 06:20:36 +08:00
|
|
|
if (moved_len < old_len) {
|
2015-09-05 06:48:01 +08:00
|
|
|
err = -ENOMEM;
|
2015-09-05 06:48:04 +08:00
|
|
|
} else if (vma->vm_ops && vma->vm_ops->mremap) {
|
2021-04-30 13:57:48 +08:00
|
|
|
err = vma->vm_ops->mremap(new_vma);
|
2015-09-05 06:48:01 +08:00
|
|
|
}
|
|
|
|
|
|
|
|
if (unlikely(err)) {
|
2005-04-17 06:20:36 +08:00
|
|
|
/*
|
|
|
|
* On error, move entries back from new area to old,
|
|
|
|
* which will succeed since page tables still there,
|
|
|
|
* and then proceed to unmap new area instead of old.
|
|
|
|
*/
|
2012-10-09 07:31:50 +08:00
|
|
|
move_page_tables(new_vma, new_addr, vma, old_addr, moved_len,
|
|
|
|
true);
|
2005-04-17 06:20:36 +08:00
|
|
|
vma = new_vma;
|
|
|
|
old_len = new_len;
|
|
|
|
old_addr = new_addr;
|
2015-09-05 06:48:01 +08:00
|
|
|
new_addr = err;
|
2015-06-25 07:56:19 +08:00
|
|
|
} else {
|
2017-02-23 07:42:34 +08:00
|
|
|
mremap_userfaultfd_prep(new_vma, uf);
|
2015-04-07 05:48:54 +08:00
|
|
|
}
|
2005-04-17 06:20:36 +08:00
|
|
|
|
2021-11-06 04:41:40 +08:00
|
|
|
if (is_vm_hugetlb_page(vma)) {
|
|
|
|
clear_vma_resv_huge_pages(vma);
|
|
|
|
}
|
|
|
|
|
2005-04-17 06:20:36 +08:00
|
|
|
/* Conceal VM_ACCOUNT so old reservation is not undone */
|
2020-12-15 11:08:09 +08:00
|
|
|
if (vm_flags & VM_ACCOUNT && !(flags & MREMAP_DONTUNMAP)) {
|
2023-01-27 03:37:49 +08:00
|
|
|
vm_flags_clear(vma, VM_ACCOUNT);
|
2023-01-21 00:26:39 +08:00
|
|
|
if (vma->vm_start < old_addr)
|
|
|
|
account_start = vma->vm_start;
|
|
|
|
if (vma->vm_end > old_addr + old_len)
|
|
|
|
account_end = vma->vm_end;
|
2005-04-17 06:20:36 +08:00
|
|
|
}
|
|
|
|
|
2005-05-17 12:53:18 +08:00
|
|
|
/*
|
[PATCH] mm: update_hiwaters just in time
update_mem_hiwater has attracted various criticisms, in particular from those
concerned with mm scalability. Originally it was called whenever rss or
total_vm got raised. Then many of those callsites were replaced by a timer
tick call from account_system_time. Now Frank van Maarseveen reports that to
be found inadequate. How about this? Works for Frank.
Replace update_mem_hiwater, a poor combination of two unrelated ops, by macros
update_hiwater_rss and update_hiwater_vm. Don't attempt to keep
mm->hiwater_rss up to date at timer tick, nor every time we raise rss (usually
by 1): those are hot paths. Do the opposite, update only when about to lower
rss (usually by many), or just before final accounting in do_exit. Handle
mm->hiwater_vm in the same way, though it's much less of an issue. Demand
that whoever collects these hiwater statistics do the work of taking the
maximum with rss or total_vm.
And there has been no collector of these hiwater statistics in the tree. The
new convention needs an example, so match Frank's usage by adding a VmPeak
line above VmSize to /proc/<pid>/status, and also a VmHWM line above VmRSS
(High-Water-Mark or High-Water-Memory).
There was a particular anomaly during mremap move, that hiwater_vm might be
captured too high. A fleeting such anomaly remains, but it's quickly
corrected now, whereas before it would stick.
What locking? None: if the app is racy then these statistics will be racy,
it's not worth any overhead to make them exact. But whenever it suits,
hiwater_vm is updated under exclusive mmap_sem, and hiwater_rss under
page_table_lock (for now) or with preemption disabled (later on): without
going to any trouble, minimize the time between reading current values and
updating, to minimize those occasions when a racing thread bumps a count up
and back down in between.
Signed-off-by: Hugh Dickins <hugh@veritas.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-10-30 09:16:18 +08:00
|
|
|
* If we failed to move page tables we still do total_vm increment
|
|
|
|
* since do_munmap() will decrement it by old_len == new_len.
|
|
|
|
*
|
|
|
|
* Since total_vm is about to be raised artificially high for a
|
|
|
|
* moment, we need to restore high watermark afterwards: if stats
|
|
|
|
* are taken meanwhile, total_vm and hiwater_vm appear too high.
|
|
|
|
* If this were a serious issue, we'd add a flag to do_munmap().
|
2005-05-17 12:53:18 +08:00
|
|
|
*/
|
[PATCH] mm: update_hiwaters just in time
update_mem_hiwater has attracted various criticisms, in particular from those
concerned with mm scalability. Originally it was called whenever rss or
total_vm got raised. Then many of those callsites were replaced by a timer
tick call from account_system_time. Now Frank van Maarseveen reports that to
be found inadequate. How about this? Works for Frank.
Replace update_mem_hiwater, a poor combination of two unrelated ops, by macros
update_hiwater_rss and update_hiwater_vm. Don't attempt to keep
mm->hiwater_rss up to date at timer tick, nor every time we raise rss (usually
by 1): those are hot paths. Do the opposite, update only when about to lower
rss (usually by many), or just before final accounting in do_exit. Handle
mm->hiwater_vm in the same way, though it's much less of an issue. Demand
that whoever collects these hiwater statistics do the work of taking the
maximum with rss or total_vm.
And there has been no collector of these hiwater statistics in the tree. The
new convention needs an example, so match Frank's usage by adding a VmPeak
line above VmSize to /proc/<pid>/status, and also a VmHWM line above VmRSS
(High-Water-Mark or High-Water-Memory).
There was a particular anomaly during mremap move, that hiwater_vm might be
captured too high. A fleeting such anomaly remains, but it's quickly
corrected now, whereas before it would stick.
What locking? None: if the app is racy then these statistics will be racy,
it's not worth any overhead to make them exact. But whenever it suits,
hiwater_vm is updated under exclusive mmap_sem, and hiwater_rss under
page_table_lock (for now) or with preemption disabled (later on): without
going to any trouble, minimize the time between reading current values and
updating, to minimize those occasions when a racing thread bumps a count up
and back down in between.
Signed-off-by: Hugh Dickins <hugh@veritas.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-10-30 09:16:18 +08:00
|
|
|
hiwater_vm = mm->hiwater_vm;
|
2016-01-15 07:22:07 +08:00
|
|
|
vm_stat_account(mm, vma->vm_flags, new_len >> PAGE_SHIFT);
|
2005-05-17 12:53:18 +08:00
|
|
|
|
2015-12-23 08:54:23 +08:00
|
|
|
/* Tell pfnmap has moved from this vma */
|
|
|
|
if (unlikely(vma->vm_flags & VM_PFNMAP))
|
2023-02-17 10:56:15 +08:00
|
|
|
untrack_pfn_clear(vma);
|
2015-12-23 08:54:23 +08:00
|
|
|
|
mm/mremap: add MREMAP_DONTUNMAP to mremap()
When remapping an anonymous, private mapping, if MREMAP_DONTUNMAP is set,
the source mapping will not be removed. The remap operation will be
performed as it would have been normally by moving over the page tables to
the new mapping. The old vma will have any locked flags cleared, have no
pagetables, and any userfaultfds that were watching that range will
continue watching it.
For a mapping that is shared or not anonymous, MREMAP_DONTUNMAP will cause
the mremap() call to fail. Because MREMAP_DONTUNMAP always results in
moving a VMA you MUST use the MREMAP_MAYMOVE flag, it's not possible to
resize a VMA while also moving with MREMAP_DONTUNMAP so old_len must
always be equal to the new_len otherwise it will return -EINVAL.
We hope to use this in Chrome OS where with userfaultfd we could write an
anonymous mapping to disk without having to STOP the process or worry
about VMA permission changes.
This feature also has a use case in Android, Lokesh Gidra has said that
"As part of using userfaultfd for GC, We'll have to move the physical
pages of the java heap to a separate location. For this purpose mremap
will be used. Without the MREMAP_DONTUNMAP flag, when I mremap the java
heap, its virtual mapping will be removed as well. Therefore, we'll
require performing mmap immediately after. This is not only time
consuming but also opens a time window where a native thread may call mmap
and reserve the java heap's address range for its own usage. This flag
solves the problem."
[bgeffon@google.com: v6]
Link: http://lkml.kernel.org/r/20200218173221.237674-1-bgeffon@google.com
[bgeffon@google.com: v7]
Link: http://lkml.kernel.org/r/20200221174248.244748-1-bgeffon@google.com
Signed-off-by: Brian Geffon <bgeffon@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Tested-by: Lokesh Gidra <lokeshgidra@google.com>
Reviewed-by: Minchan Kim <minchan@kernel.org>
Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Cc: "Michael S . Tsirkin" <mst@redhat.com>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Will Deacon <will@kernel.org>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Sonny Rao <sonnyrao@google.com>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Joel Fernandes <joel@joelfernandes.org>
Cc: Yu Zhao <yuzhao@google.com>
Cc: Jesse Barnes <jsbarnes@google.com>
Cc: Nathan Chancellor <natechancellor@gmail.com>
Cc: Florian Weimer <fweimer@redhat.com>
Link: http://lkml.kernel.org/r/20200207201856.46070-1-bgeffon@google.com
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-04-02 12:09:17 +08:00
|
|
|
if (unlikely(!err && (flags & MREMAP_DONTUNMAP))) {
|
|
|
|
/* We always clear VM_LOCKED[ONFAULT] on the old vma */
|
2023-01-27 03:37:48 +08:00
|
|
|
vm_flags_clear(vma, VM_LOCKED_MASK);
|
mm/mremap: add MREMAP_DONTUNMAP to mremap()
When remapping an anonymous, private mapping, if MREMAP_DONTUNMAP is set,
the source mapping will not be removed. The remap operation will be
performed as it would have been normally by moving over the page tables to
the new mapping. The old vma will have any locked flags cleared, have no
pagetables, and any userfaultfds that were watching that range will
continue watching it.
For a mapping that is shared or not anonymous, MREMAP_DONTUNMAP will cause
the mremap() call to fail. Because MREMAP_DONTUNMAP always results in
moving a VMA you MUST use the MREMAP_MAYMOVE flag, it's not possible to
resize a VMA while also moving with MREMAP_DONTUNMAP so old_len must
always be equal to the new_len otherwise it will return -EINVAL.
We hope to use this in Chrome OS where with userfaultfd we could write an
anonymous mapping to disk without having to STOP the process or worry
about VMA permission changes.
This feature also has a use case in Android, Lokesh Gidra has said that
"As part of using userfaultfd for GC, We'll have to move the physical
pages of the java heap to a separate location. For this purpose mremap
will be used. Without the MREMAP_DONTUNMAP flag, when I mremap the java
heap, its virtual mapping will be removed as well. Therefore, we'll
require performing mmap immediately after. This is not only time
consuming but also opens a time window where a native thread may call mmap
and reserve the java heap's address range for its own usage. This flag
solves the problem."
[bgeffon@google.com: v6]
Link: http://lkml.kernel.org/r/20200218173221.237674-1-bgeffon@google.com
[bgeffon@google.com: v7]
Link: http://lkml.kernel.org/r/20200221174248.244748-1-bgeffon@google.com
Signed-off-by: Brian Geffon <bgeffon@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Tested-by: Lokesh Gidra <lokeshgidra@google.com>
Reviewed-by: Minchan Kim <minchan@kernel.org>
Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Cc: "Michael S . Tsirkin" <mst@redhat.com>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Will Deacon <will@kernel.org>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Sonny Rao <sonnyrao@google.com>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Joel Fernandes <joel@joelfernandes.org>
Cc: Yu Zhao <yuzhao@google.com>
Cc: Jesse Barnes <jsbarnes@google.com>
Cc: Nathan Chancellor <natechancellor@gmail.com>
Cc: Florian Weimer <fweimer@redhat.com>
Link: http://lkml.kernel.org/r/20200207201856.46070-1-bgeffon@google.com
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-04-02 12:09:17 +08:00
|
|
|
|
2021-02-25 04:04:53 +08:00
|
|
|
/*
|
|
|
|
* anon_vma links of the old vma is no longer needed after its page
|
|
|
|
* table has been moved.
|
|
|
|
*/
|
|
|
|
if (new_vma != vma && vma->vm_start == old_addr &&
|
|
|
|
vma->vm_end == (old_addr + old_len))
|
|
|
|
unlink_anon_vmas(vma);
|
|
|
|
|
mm/mremap: add MREMAP_DONTUNMAP to mremap()
When remapping an anonymous, private mapping, if MREMAP_DONTUNMAP is set,
the source mapping will not be removed. The remap operation will be
performed as it would have been normally by moving over the page tables to
the new mapping. The old vma will have any locked flags cleared, have no
pagetables, and any userfaultfds that were watching that range will
continue watching it.
For a mapping that is shared or not anonymous, MREMAP_DONTUNMAP will cause
the mremap() call to fail. Because MREMAP_DONTUNMAP always results in
moving a VMA you MUST use the MREMAP_MAYMOVE flag, it's not possible to
resize a VMA while also moving with MREMAP_DONTUNMAP so old_len must
always be equal to the new_len otherwise it will return -EINVAL.
We hope to use this in Chrome OS where with userfaultfd we could write an
anonymous mapping to disk without having to STOP the process or worry
about VMA permission changes.
This feature also has a use case in Android, Lokesh Gidra has said that
"As part of using userfaultfd for GC, We'll have to move the physical
pages of the java heap to a separate location. For this purpose mremap
will be used. Without the MREMAP_DONTUNMAP flag, when I mremap the java
heap, its virtual mapping will be removed as well. Therefore, we'll
require performing mmap immediately after. This is not only time
consuming but also opens a time window where a native thread may call mmap
and reserve the java heap's address range for its own usage. This flag
solves the problem."
[bgeffon@google.com: v6]
Link: http://lkml.kernel.org/r/20200218173221.237674-1-bgeffon@google.com
[bgeffon@google.com: v7]
Link: http://lkml.kernel.org/r/20200221174248.244748-1-bgeffon@google.com
Signed-off-by: Brian Geffon <bgeffon@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Tested-by: Lokesh Gidra <lokeshgidra@google.com>
Reviewed-by: Minchan Kim <minchan@kernel.org>
Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Cc: "Michael S . Tsirkin" <mst@redhat.com>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Will Deacon <will@kernel.org>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Sonny Rao <sonnyrao@google.com>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Joel Fernandes <joel@joelfernandes.org>
Cc: Yu Zhao <yuzhao@google.com>
Cc: Jesse Barnes <jsbarnes@google.com>
Cc: Nathan Chancellor <natechancellor@gmail.com>
Cc: Florian Weimer <fweimer@redhat.com>
Link: http://lkml.kernel.org/r/20200207201856.46070-1-bgeffon@google.com
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-04-02 12:09:17 +08:00
|
|
|
/* Because we won't unmap we don't need to touch locked_vm */
|
2020-12-15 11:08:09 +08:00
|
|
|
return new_addr;
|
mm/mremap: add MREMAP_DONTUNMAP to mremap()
When remapping an anonymous, private mapping, if MREMAP_DONTUNMAP is set,
the source mapping will not be removed. The remap operation will be
performed as it would have been normally by moving over the page tables to
the new mapping. The old vma will have any locked flags cleared, have no
pagetables, and any userfaultfds that were watching that range will
continue watching it.
For a mapping that is shared or not anonymous, MREMAP_DONTUNMAP will cause
the mremap() call to fail. Because MREMAP_DONTUNMAP always results in
moving a VMA you MUST use the MREMAP_MAYMOVE flag, it's not possible to
resize a VMA while also moving with MREMAP_DONTUNMAP so old_len must
always be equal to the new_len otherwise it will return -EINVAL.
We hope to use this in Chrome OS where with userfaultfd we could write an
anonymous mapping to disk without having to STOP the process or worry
about VMA permission changes.
This feature also has a use case in Android, Lokesh Gidra has said that
"As part of using userfaultfd for GC, We'll have to move the physical
pages of the java heap to a separate location. For this purpose mremap
will be used. Without the MREMAP_DONTUNMAP flag, when I mremap the java
heap, its virtual mapping will be removed as well. Therefore, we'll
require performing mmap immediately after. This is not only time
consuming but also opens a time window where a native thread may call mmap
and reserve the java heap's address range for its own usage. This flag
solves the problem."
[bgeffon@google.com: v6]
Link: http://lkml.kernel.org/r/20200218173221.237674-1-bgeffon@google.com
[bgeffon@google.com: v7]
Link: http://lkml.kernel.org/r/20200221174248.244748-1-bgeffon@google.com
Signed-off-by: Brian Geffon <bgeffon@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Tested-by: Lokesh Gidra <lokeshgidra@google.com>
Reviewed-by: Minchan Kim <minchan@kernel.org>
Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Cc: "Michael S . Tsirkin" <mst@redhat.com>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Will Deacon <will@kernel.org>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Sonny Rao <sonnyrao@google.com>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Joel Fernandes <joel@joelfernandes.org>
Cc: Yu Zhao <yuzhao@google.com>
Cc: Jesse Barnes <jsbarnes@google.com>
Cc: Nathan Chancellor <natechancellor@gmail.com>
Cc: Florian Weimer <fweimer@redhat.com>
Link: http://lkml.kernel.org/r/20200207201856.46070-1-bgeffon@google.com
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-04-02 12:09:17 +08:00
|
|
|
}
|
|
|
|
|
2023-01-21 00:26:39 +08:00
|
|
|
vma_iter_init(&vmi, mm, old_addr);
|
2023-09-17 03:31:42 +08:00
|
|
|
if (do_vmi_munmap(&vmi, mm, old_addr, old_len, uf_unmap, false) < 0) {
|
2005-04-17 06:20:36 +08:00
|
|
|
/* OOM: unable to split vma, just get accounts right */
|
2020-12-15 11:08:09 +08:00
|
|
|
if (vm_flags & VM_ACCOUNT && !(flags & MREMAP_DONTUNMAP))
|
mm/mremap: fix memory account on do_munmap() failure
mremap will account the delta between new_len and old_len in
vma_to_resize, and then call move_vma when expanding an existing memory
mapping. In function move_vma, there are two scenarios when calling
do_munmap:
1. move_page_tables from old_addr to new_addr success
2. move_page_tables from old_addr to new_addr fail
In first scenario, it should account old_len if do_munmap fail, because
the delta has already been accounted.
In second scenario, new_addr/new_len will assign to old_addr/old_len if
move_page_table fail, so do_munmap is try to unmap new_addr actually, if
do_munmap fail, it should account the new_len, because error code will be
return from move_vma, and delta will be unaccounted. What'more, because
of new_len == old_len, so account old_len also is OK.
In summary, account old_len will be correct if do_munmap fail.
Link: https://lkml.kernel.org/r/20210717101942.120607-1-chenwandun@huawei.com
Fixes: 51df7bcb6151 ("mm/mremap: account memory on do_munmap() failure")
Signed-off-by: Chen Wandun <chenwandun@huawei.com>
Acked-by: Dmitry Safonov <dima@arista.com>
Cc: Kefeng Wang <wangkefeng.wang@huawei.com>
Cc: Wei Yongjun <weiyongjun1@huawei.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-09-03 05:56:52 +08:00
|
|
|
vm_acct_memory(old_len >> PAGE_SHIFT);
|
2023-01-21 00:26:39 +08:00
|
|
|
account_start = account_end = 0;
|
2005-04-17 06:20:36 +08:00
|
|
|
}
|
mm/mremap: add MREMAP_DONTUNMAP to mremap()
When remapping an anonymous, private mapping, if MREMAP_DONTUNMAP is set,
the source mapping will not be removed. The remap operation will be
performed as it would have been normally by moving over the page tables to
the new mapping. The old vma will have any locked flags cleared, have no
pagetables, and any userfaultfds that were watching that range will
continue watching it.
For a mapping that is shared or not anonymous, MREMAP_DONTUNMAP will cause
the mremap() call to fail. Because MREMAP_DONTUNMAP always results in
moving a VMA you MUST use the MREMAP_MAYMOVE flag, it's not possible to
resize a VMA while also moving with MREMAP_DONTUNMAP so old_len must
always be equal to the new_len otherwise it will return -EINVAL.
We hope to use this in Chrome OS where with userfaultfd we could write an
anonymous mapping to disk without having to STOP the process or worry
about VMA permission changes.
This feature also has a use case in Android, Lokesh Gidra has said that
"As part of using userfaultfd for GC, We'll have to move the physical
pages of the java heap to a separate location. For this purpose mremap
will be used. Without the MREMAP_DONTUNMAP flag, when I mremap the java
heap, its virtual mapping will be removed as well. Therefore, we'll
require performing mmap immediately after. This is not only time
consuming but also opens a time window where a native thread may call mmap
and reserve the java heap's address range for its own usage. This flag
solves the problem."
[bgeffon@google.com: v6]
Link: http://lkml.kernel.org/r/20200218173221.237674-1-bgeffon@google.com
[bgeffon@google.com: v7]
Link: http://lkml.kernel.org/r/20200221174248.244748-1-bgeffon@google.com
Signed-off-by: Brian Geffon <bgeffon@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Tested-by: Lokesh Gidra <lokeshgidra@google.com>
Reviewed-by: Minchan Kim <minchan@kernel.org>
Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Cc: "Michael S . Tsirkin" <mst@redhat.com>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Will Deacon <will@kernel.org>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Sonny Rao <sonnyrao@google.com>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Joel Fernandes <joel@joelfernandes.org>
Cc: Yu Zhao <yuzhao@google.com>
Cc: Jesse Barnes <jsbarnes@google.com>
Cc: Nathan Chancellor <natechancellor@gmail.com>
Cc: Florian Weimer <fweimer@redhat.com>
Link: http://lkml.kernel.org/r/20200207201856.46070-1-bgeffon@google.com
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-04-02 12:09:17 +08:00
|
|
|
|
|
|
|
if (vm_flags & VM_LOCKED) {
|
|
|
|
mm->locked_vm += new_len >> PAGE_SHIFT;
|
|
|
|
*locked = true;
|
|
|
|
}
|
2020-12-15 11:08:09 +08:00
|
|
|
|
[PATCH] mm: update_hiwaters just in time
update_mem_hiwater has attracted various criticisms, in particular from those
concerned with mm scalability. Originally it was called whenever rss or
total_vm got raised. Then many of those callsites were replaced by a timer
tick call from account_system_time. Now Frank van Maarseveen reports that to
be found inadequate. How about this? Works for Frank.
Replace update_mem_hiwater, a poor combination of two unrelated ops, by macros
update_hiwater_rss and update_hiwater_vm. Don't attempt to keep
mm->hiwater_rss up to date at timer tick, nor every time we raise rss (usually
by 1): those are hot paths. Do the opposite, update only when about to lower
rss (usually by many), or just before final accounting in do_exit. Handle
mm->hiwater_vm in the same way, though it's much less of an issue. Demand
that whoever collects these hiwater statistics do the work of taking the
maximum with rss or total_vm.
And there has been no collector of these hiwater statistics in the tree. The
new convention needs an example, so match Frank's usage by adding a VmPeak
line above VmSize to /proc/<pid>/status, and also a VmHWM line above VmRSS
(High-Water-Mark or High-Water-Memory).
There was a particular anomaly during mremap move, that hiwater_vm might be
captured too high. A fleeting such anomaly remains, but it's quickly
corrected now, whereas before it would stick.
What locking? None: if the app is racy then these statistics will be racy,
it's not worth any overhead to make them exact. But whenever it suits,
hiwater_vm is updated under exclusive mmap_sem, and hiwater_rss under
page_table_lock (for now) or with preemption disabled (later on): without
going to any trouble, minimize the time between reading current values and
updating, to minimize those occasions when a racing thread bumps a count up
and back down in between.
Signed-off-by: Hugh Dickins <hugh@veritas.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-10-30 09:16:18 +08:00
|
|
|
mm->hiwater_vm = hiwater_vm;
|
2005-04-17 06:20:36 +08:00
|
|
|
|
|
|
|
/* Restore VM_ACCOUNT if one or two pieces of vma left */
|
2023-01-21 00:26:39 +08:00
|
|
|
if (account_start) {
|
|
|
|
vma = vma_prev(&vmi);
|
2023-01-27 03:37:49 +08:00
|
|
|
vm_flags_set(vma, VM_ACCOUNT);
|
2023-01-21 00:26:39 +08:00
|
|
|
}
|
|
|
|
|
|
|
|
if (account_end) {
|
|
|
|
vma = vma_next(&vmi);
|
2023-01-27 03:37:49 +08:00
|
|
|
vm_flags_set(vma, VM_ACCOUNT);
|
2005-04-17 06:20:36 +08:00
|
|
|
}
|
|
|
|
|
|
|
|
return new_addr;
|
|
|
|
}
|
|
|
|
|
2009-11-24 20:17:46 +08:00
|
|
|
static struct vm_area_struct *vma_to_resize(unsigned long addr,
|
2021-11-06 04:39:06 +08:00
|
|
|
unsigned long old_len, unsigned long new_len, unsigned long flags)
|
2009-11-24 20:17:46 +08:00
|
|
|
{
|
|
|
|
struct mm_struct *mm = current->mm;
|
2021-06-29 10:39:47 +08:00
|
|
|
struct vm_area_struct *vma;
|
2015-09-05 06:48:10 +08:00
|
|
|
unsigned long pgoff;
|
2009-11-24 20:17:46 +08:00
|
|
|
|
2021-06-29 10:39:47 +08:00
|
|
|
vma = vma_lookup(mm, addr);
|
|
|
|
if (!vma)
|
2015-04-16 07:14:02 +08:00
|
|
|
return ERR_PTR(-EFAULT);
|
2009-11-24 20:17:46 +08:00
|
|
|
|
2017-09-07 07:20:55 +08:00
|
|
|
/*
|
|
|
|
* !old_len is a special case where an attempt is made to 'duplicate'
|
|
|
|
* a mapping. This makes no sense for private mappings as it will
|
|
|
|
* instead create a fresh/new mapping unrelated to the original. This
|
|
|
|
* is contrary to the basic idea of mremap which creates new mappings
|
|
|
|
* based on the original. There are no known use cases for this
|
|
|
|
* behavior. As a result, fail such attempts.
|
|
|
|
*/
|
|
|
|
if (!old_len && !(vma->vm_flags & (VM_SHARED | VM_MAYSHARE))) {
|
|
|
|
pr_warn_once("%s (%d): attempted to duplicate a private mapping with mremap. This is not supported.\n", current->comm, current->pid);
|
|
|
|
return ERR_PTR(-EINVAL);
|
|
|
|
}
|
|
|
|
|
mm: extend MREMAP_DONTUNMAP to non-anonymous mappings
Patch series "mm: Extend MREMAP_DONTUNMAP to non-anonymous mappings", v5.
This patch (of 3):
Currently MREMAP_DONTUNMAP only accepts private anonymous mappings. This
restriction was placed initially for simplicity and not because there
exists a technical reason to do so.
This change will widen the support to include any mappings which are not
VM_DONTEXPAND or VM_PFNMAP. The primary use case is to support
MREMAP_DONTUNMAP on mappings which may have been created from a memfd.
This change will result in mremap(MREMAP_DONTUNMAP) returning -EINVAL if
VM_DONTEXPAND or VM_PFNMAP mappings are specified.
Lokesh Gidra who works on the Android JVM, provided an explanation of how
such a feature will improve Android JVM garbage collection: "Android is
developing a new garbage collector (GC), based on userfaultfd. The
garbage collector will use userfaultfd (uffd) on the java heap during
compaction. On accessing any uncompacted page, the application threads
will find it missing, at which point the thread will create the compacted
page and then use UFFDIO_COPY ioctl to get it mapped and then resume
execution. Before starting this compaction, in a stop-the-world pause the
heap will be mremap(MREMAP_DONTUNMAP) so that the java heap is ready to
receive UFFD_EVENT_PAGEFAULT events after resuming execution.
To speedup mremap operations, pagetable movement was optimized by moving
PUD entries instead of PTE entries [1]. It was necessary as mremap of
even modest sized memory ranges also took several milliseconds, and
stopping the application for that long isn't acceptable in response-time
sensitive cases.
With UFFDIO_CONTINUE feature [2], it will be even more efficient to
implement this GC, particularly the 'non-moveable' portions of the heap.
It will also help in reducing the need to copy (UFFDIO_COPY) the pages.
However, for this to work, the java heap has to be on a 'shared' vma.
Currently MREMAP_DONTUNMAP only supports private anonymous mappings, this
patch will enable using UFFDIO_CONTINUE for the new userfaultfd-based heap
compaction."
[1] https://lore.kernel.org/linux-mm/20201215030730.NC3CU98e4%25akpm@linux-foundation.org/
[2] https://lore.kernel.org/linux-mm/20210302000133.272579-1-axelrasmussen@google.com/
Link: https://lkml.kernel.org/r/20210323182520.2712101-1-bgeffon@google.com
Signed-off-by: Brian Geffon <bgeffon@google.com>
Acked-by: Hugh Dickins <hughd@google.com>
Tested-by: Lokesh Gidra <lokeshgidra@google.com>
Reviewed-by: Dmitry Safonov <0x7f454c46@gmail.com>
Cc: Alejandro Colomar <alx.manpages@gmail.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: "Kirill A . Shutemov" <kirill@shutemov.name>
Cc: Michael Kerrisk <mtk.manpages@gmail.com>
Cc: "Michael S . Tsirkin" <mst@redhat.com>
Cc: Mike Rapoport <rppt@linux.vnet.ibm.com>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Peter Xu <peterx@redhat.com>
Cc: Sonny Rao <sonnyrao@google.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-04-30 13:57:45 +08:00
|
|
|
if ((flags & MREMAP_DONTUNMAP) &&
|
|
|
|
(vma->vm_flags & (VM_DONTEXPAND | VM_PFNMAP)))
|
mm/mremap: add MREMAP_DONTUNMAP to mremap()
When remapping an anonymous, private mapping, if MREMAP_DONTUNMAP is set,
the source mapping will not be removed. The remap operation will be
performed as it would have been normally by moving over the page tables to
the new mapping. The old vma will have any locked flags cleared, have no
pagetables, and any userfaultfds that were watching that range will
continue watching it.
For a mapping that is shared or not anonymous, MREMAP_DONTUNMAP will cause
the mremap() call to fail. Because MREMAP_DONTUNMAP always results in
moving a VMA you MUST use the MREMAP_MAYMOVE flag, it's not possible to
resize a VMA while also moving with MREMAP_DONTUNMAP so old_len must
always be equal to the new_len otherwise it will return -EINVAL.
We hope to use this in Chrome OS where with userfaultfd we could write an
anonymous mapping to disk without having to STOP the process or worry
about VMA permission changes.
This feature also has a use case in Android, Lokesh Gidra has said that
"As part of using userfaultfd for GC, We'll have to move the physical
pages of the java heap to a separate location. For this purpose mremap
will be used. Without the MREMAP_DONTUNMAP flag, when I mremap the java
heap, its virtual mapping will be removed as well. Therefore, we'll
require performing mmap immediately after. This is not only time
consuming but also opens a time window where a native thread may call mmap
and reserve the java heap's address range for its own usage. This flag
solves the problem."
[bgeffon@google.com: v6]
Link: http://lkml.kernel.org/r/20200218173221.237674-1-bgeffon@google.com
[bgeffon@google.com: v7]
Link: http://lkml.kernel.org/r/20200221174248.244748-1-bgeffon@google.com
Signed-off-by: Brian Geffon <bgeffon@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Tested-by: Lokesh Gidra <lokeshgidra@google.com>
Reviewed-by: Minchan Kim <minchan@kernel.org>
Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Cc: "Michael S . Tsirkin" <mst@redhat.com>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Will Deacon <will@kernel.org>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Sonny Rao <sonnyrao@google.com>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Joel Fernandes <joel@joelfernandes.org>
Cc: Yu Zhao <yuzhao@google.com>
Cc: Jesse Barnes <jsbarnes@google.com>
Cc: Nathan Chancellor <natechancellor@gmail.com>
Cc: Florian Weimer <fweimer@redhat.com>
Link: http://lkml.kernel.org/r/20200207201856.46070-1-bgeffon@google.com
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-04-02 12:09:17 +08:00
|
|
|
return ERR_PTR(-EINVAL);
|
|
|
|
|
2009-11-24 20:17:46 +08:00
|
|
|
/* We can't remap across vm area boundaries */
|
|
|
|
if (old_len > vma->vm_end - addr)
|
2015-04-16 07:14:02 +08:00
|
|
|
return ERR_PTR(-EFAULT);
|
2009-11-24 20:17:46 +08:00
|
|
|
|
2015-09-05 06:48:10 +08:00
|
|
|
if (new_len == old_len)
|
|
|
|
return vma;
|
|
|
|
|
2011-04-07 22:35:50 +08:00
|
|
|
/* Need to be careful about a growing mapping */
|
2015-09-05 06:48:10 +08:00
|
|
|
pgoff = (addr - vma->vm_start) >> PAGE_SHIFT;
|
|
|
|
pgoff += vma->vm_pgoff;
|
|
|
|
if (pgoff + (new_len >> PAGE_SHIFT) < pgoff)
|
|
|
|
return ERR_PTR(-EINVAL);
|
|
|
|
|
|
|
|
if (vma->vm_flags & (VM_DONTEXPAND | VM_PFNMAP))
|
|
|
|
return ERR_PTR(-EFAULT);
|
2009-11-24 20:17:46 +08:00
|
|
|
|
2023-05-23 04:52:10 +08:00
|
|
|
if (!mlock_future_ok(mm, vma->vm_flags, new_len - old_len))
|
2022-04-29 14:16:14 +08:00
|
|
|
return ERR_PTR(-EAGAIN);
|
2009-11-24 20:17:46 +08:00
|
|
|
|
2016-01-15 07:22:07 +08:00
|
|
|
if (!may_expand_vm(mm, vma->vm_flags,
|
|
|
|
(new_len - old_len) >> PAGE_SHIFT))
|
2015-04-16 07:14:02 +08:00
|
|
|
return ERR_PTR(-ENOMEM);
|
2009-11-24 20:17:46 +08:00
|
|
|
|
|
|
|
return vma;
|
|
|
|
}
|
|
|
|
|
2013-02-23 08:32:41 +08:00
|
|
|
static unsigned long mremap_to(unsigned long addr, unsigned long old_len,
|
2017-02-23 07:42:34 +08:00
|
|
|
unsigned long new_addr, unsigned long new_len, bool *locked,
|
mm/mremap: add MREMAP_DONTUNMAP to mremap()
When remapping an anonymous, private mapping, if MREMAP_DONTUNMAP is set,
the source mapping will not be removed. The remap operation will be
performed as it would have been normally by moving over the page tables to
the new mapping. The old vma will have any locked flags cleared, have no
pagetables, and any userfaultfds that were watching that range will
continue watching it.
For a mapping that is shared or not anonymous, MREMAP_DONTUNMAP will cause
the mremap() call to fail. Because MREMAP_DONTUNMAP always results in
moving a VMA you MUST use the MREMAP_MAYMOVE flag, it's not possible to
resize a VMA while also moving with MREMAP_DONTUNMAP so old_len must
always be equal to the new_len otherwise it will return -EINVAL.
We hope to use this in Chrome OS where with userfaultfd we could write an
anonymous mapping to disk without having to STOP the process or worry
about VMA permission changes.
This feature also has a use case in Android, Lokesh Gidra has said that
"As part of using userfaultfd for GC, We'll have to move the physical
pages of the java heap to a separate location. For this purpose mremap
will be used. Without the MREMAP_DONTUNMAP flag, when I mremap the java
heap, its virtual mapping will be removed as well. Therefore, we'll
require performing mmap immediately after. This is not only time
consuming but also opens a time window where a native thread may call mmap
and reserve the java heap's address range for its own usage. This flag
solves the problem."
[bgeffon@google.com: v6]
Link: http://lkml.kernel.org/r/20200218173221.237674-1-bgeffon@google.com
[bgeffon@google.com: v7]
Link: http://lkml.kernel.org/r/20200221174248.244748-1-bgeffon@google.com
Signed-off-by: Brian Geffon <bgeffon@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Tested-by: Lokesh Gidra <lokeshgidra@google.com>
Reviewed-by: Minchan Kim <minchan@kernel.org>
Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Cc: "Michael S . Tsirkin" <mst@redhat.com>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Will Deacon <will@kernel.org>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Sonny Rao <sonnyrao@google.com>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Joel Fernandes <joel@joelfernandes.org>
Cc: Yu Zhao <yuzhao@google.com>
Cc: Jesse Barnes <jsbarnes@google.com>
Cc: Nathan Chancellor <natechancellor@gmail.com>
Cc: Florian Weimer <fweimer@redhat.com>
Link: http://lkml.kernel.org/r/20200207201856.46070-1-bgeffon@google.com
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-04-02 12:09:17 +08:00
|
|
|
unsigned long flags, struct vm_userfaultfd_ctx *uf,
|
2017-08-03 04:31:55 +08:00
|
|
|
struct list_head *uf_unmap_early,
|
2017-02-25 06:58:22 +08:00
|
|
|
struct list_head *uf_unmap)
|
2009-11-24 20:28:07 +08:00
|
|
|
{
|
|
|
|
struct mm_struct *mm = current->mm;
|
|
|
|
struct vm_area_struct *vma;
|
|
|
|
unsigned long ret = -EINVAL;
|
mm/mremap: add MREMAP_DONTUNMAP to mremap()
When remapping an anonymous, private mapping, if MREMAP_DONTUNMAP is set,
the source mapping will not be removed. The remap operation will be
performed as it would have been normally by moving over the page tables to
the new mapping. The old vma will have any locked flags cleared, have no
pagetables, and any userfaultfds that were watching that range will
continue watching it.
For a mapping that is shared or not anonymous, MREMAP_DONTUNMAP will cause
the mremap() call to fail. Because MREMAP_DONTUNMAP always results in
moving a VMA you MUST use the MREMAP_MAYMOVE flag, it's not possible to
resize a VMA while also moving with MREMAP_DONTUNMAP so old_len must
always be equal to the new_len otherwise it will return -EINVAL.
We hope to use this in Chrome OS where with userfaultfd we could write an
anonymous mapping to disk without having to STOP the process or worry
about VMA permission changes.
This feature also has a use case in Android, Lokesh Gidra has said that
"As part of using userfaultfd for GC, We'll have to move the physical
pages of the java heap to a separate location. For this purpose mremap
will be used. Without the MREMAP_DONTUNMAP flag, when I mremap the java
heap, its virtual mapping will be removed as well. Therefore, we'll
require performing mmap immediately after. This is not only time
consuming but also opens a time window where a native thread may call mmap
and reserve the java heap's address range for its own usage. This flag
solves the problem."
[bgeffon@google.com: v6]
Link: http://lkml.kernel.org/r/20200218173221.237674-1-bgeffon@google.com
[bgeffon@google.com: v7]
Link: http://lkml.kernel.org/r/20200221174248.244748-1-bgeffon@google.com
Signed-off-by: Brian Geffon <bgeffon@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Tested-by: Lokesh Gidra <lokeshgidra@google.com>
Reviewed-by: Minchan Kim <minchan@kernel.org>
Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Cc: "Michael S . Tsirkin" <mst@redhat.com>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Will Deacon <will@kernel.org>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Sonny Rao <sonnyrao@google.com>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Joel Fernandes <joel@joelfernandes.org>
Cc: Yu Zhao <yuzhao@google.com>
Cc: Jesse Barnes <jsbarnes@google.com>
Cc: Nathan Chancellor <natechancellor@gmail.com>
Cc: Florian Weimer <fweimer@redhat.com>
Link: http://lkml.kernel.org/r/20200207201856.46070-1-bgeffon@google.com
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-04-02 12:09:17 +08:00
|
|
|
unsigned long map_flags = 0;
|
2009-11-24 20:28:07 +08:00
|
|
|
|
2015-11-06 10:46:57 +08:00
|
|
|
if (offset_in_page(new_addr))
|
2009-11-24 20:28:07 +08:00
|
|
|
goto out;
|
|
|
|
|
|
|
|
if (new_len > TASK_SIZE || new_addr > TASK_SIZE - new_len)
|
|
|
|
goto out;
|
|
|
|
|
2015-09-05 06:48:13 +08:00
|
|
|
/* Ensure the old/new locations do not overlap */
|
|
|
|
if (addr + old_len > new_addr && new_addr + new_len > addr)
|
2009-11-24 20:28:07 +08:00
|
|
|
goto out;
|
|
|
|
|
mm,mremap: bail out earlier in mremap_to under map pressure
When using mremap() syscall in addition to MREMAP_FIXED flag, mremap()
calls mremap_to() which does the following:
1) unmaps the destination region where we are going to move the map
2) If the new region is going to be smaller, we unmap the last part
of the old region
Then, we will eventually call move_vma() to do the actual move.
move_vma() checks whether we are at least 4 maps below max_map_count
before going further, otherwise it bails out with -ENOMEM. The problem
is that we might have already unmapped the vma's in steps 1) and 2), so
it is not possible for userspace to figure out the state of the vmas
after it gets -ENOMEM, and it gets tricky for userspace to clean up
properly on error path.
While it is true that we can return -ENOMEM for more reasons (e.g: see
may_expand_vm() or move_page_tables()), I think that we can avoid this
scenario if we check early in mremap_to() if the operation has high
chances to succeed map-wise.
Should that not be the case, we can bail out before we even try to unmap
anything, so we make sure the vma's are left untouched in case we are
likely to be short of maps.
The thumb-rule now is to rely on the worst-scenario case we can have.
That is when both vma's (old region and new region) are going to be
split in 3, so we get two more maps to the ones we already hold (one per
each). If current map count + 2 maps still leads us to 4 maps below the
threshold, we are going to pass the check in move_vma().
Of course, this is not free, as it might generate false positives when
it is true that we are tight map-wise, but the unmap operation can
release several vma's leading us to a good state.
Another approach was also investigated [1], but it may be too much
hassle for what it brings.
[1] https://lore.kernel.org/lkml/20190219155320.tkfkwvqk53tfdojt@d104.suse.de/
Link: http://lkml.kernel.org/r/20190226091314.18446-1-osalvador@suse.de
Signed-off-by: Oscar Salvador <osalvador@suse.de>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Joel Fernandes (Google) <joel@joelfernandes.org>
Cc: Yang Shi <yang.shi@linux.alibaba.com>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Joel Fernandes <joel@joelfernandes.org>
Cc: Cyril Hrubis <chrubis@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-03-06 07:50:14 +08:00
|
|
|
/*
|
|
|
|
* move_vma() need us to stay 4 maps below the threshold, otherwise
|
|
|
|
* it will bail out at the very beginning.
|
|
|
|
* That is a problem if we have already unmaped the regions here
|
|
|
|
* (new_addr, and old_addr), because userspace will not know the
|
|
|
|
* state of the vma's after it gets -ENOMEM.
|
|
|
|
* So, to avoid such scenario we can pre-compute if the whole
|
|
|
|
* operation has high chances to success map-wise.
|
|
|
|
* Worst-scenario case is when both vma's (new_addr and old_addr) get
|
2021-05-07 09:06:47 +08:00
|
|
|
* split in 3 before unmapping it.
|
mm,mremap: bail out earlier in mremap_to under map pressure
When using mremap() syscall in addition to MREMAP_FIXED flag, mremap()
calls mremap_to() which does the following:
1) unmaps the destination region where we are going to move the map
2) If the new region is going to be smaller, we unmap the last part
of the old region
Then, we will eventually call move_vma() to do the actual move.
move_vma() checks whether we are at least 4 maps below max_map_count
before going further, otherwise it bails out with -ENOMEM. The problem
is that we might have already unmapped the vma's in steps 1) and 2), so
it is not possible for userspace to figure out the state of the vmas
after it gets -ENOMEM, and it gets tricky for userspace to clean up
properly on error path.
While it is true that we can return -ENOMEM for more reasons (e.g: see
may_expand_vm() or move_page_tables()), I think that we can avoid this
scenario if we check early in mremap_to() if the operation has high
chances to succeed map-wise.
Should that not be the case, we can bail out before we even try to unmap
anything, so we make sure the vma's are left untouched in case we are
likely to be short of maps.
The thumb-rule now is to rely on the worst-scenario case we can have.
That is when both vma's (old region and new region) are going to be
split in 3, so we get two more maps to the ones we already hold (one per
each). If current map count + 2 maps still leads us to 4 maps below the
threshold, we are going to pass the check in move_vma().
Of course, this is not free, as it might generate false positives when
it is true that we are tight map-wise, but the unmap operation can
release several vma's leading us to a good state.
Another approach was also investigated [1], but it may be too much
hassle for what it brings.
[1] https://lore.kernel.org/lkml/20190219155320.tkfkwvqk53tfdojt@d104.suse.de/
Link: http://lkml.kernel.org/r/20190226091314.18446-1-osalvador@suse.de
Signed-off-by: Oscar Salvador <osalvador@suse.de>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Joel Fernandes (Google) <joel@joelfernandes.org>
Cc: Yang Shi <yang.shi@linux.alibaba.com>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Joel Fernandes <joel@joelfernandes.org>
Cc: Cyril Hrubis <chrubis@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-03-06 07:50:14 +08:00
|
|
|
* That means 2 more maps (1 for each) to the ones we already hold.
|
|
|
|
* Check whether current map count plus 2 still leads us to 4 maps below
|
|
|
|
* the threshold, otherwise return -ENOMEM here to be more safe.
|
|
|
|
*/
|
|
|
|
if ((mm->map_count + 2) >= sysctl_max_map_count - 3)
|
|
|
|
return -ENOMEM;
|
|
|
|
|
mm/mremap: add MREMAP_DONTUNMAP to mremap()
When remapping an anonymous, private mapping, if MREMAP_DONTUNMAP is set,
the source mapping will not be removed. The remap operation will be
performed as it would have been normally by moving over the page tables to
the new mapping. The old vma will have any locked flags cleared, have no
pagetables, and any userfaultfds that were watching that range will
continue watching it.
For a mapping that is shared or not anonymous, MREMAP_DONTUNMAP will cause
the mremap() call to fail. Because MREMAP_DONTUNMAP always results in
moving a VMA you MUST use the MREMAP_MAYMOVE flag, it's not possible to
resize a VMA while also moving with MREMAP_DONTUNMAP so old_len must
always be equal to the new_len otherwise it will return -EINVAL.
We hope to use this in Chrome OS where with userfaultfd we could write an
anonymous mapping to disk without having to STOP the process or worry
about VMA permission changes.
This feature also has a use case in Android, Lokesh Gidra has said that
"As part of using userfaultfd for GC, We'll have to move the physical
pages of the java heap to a separate location. For this purpose mremap
will be used. Without the MREMAP_DONTUNMAP flag, when I mremap the java
heap, its virtual mapping will be removed as well. Therefore, we'll
require performing mmap immediately after. This is not only time
consuming but also opens a time window where a native thread may call mmap
and reserve the java heap's address range for its own usage. This flag
solves the problem."
[bgeffon@google.com: v6]
Link: http://lkml.kernel.org/r/20200218173221.237674-1-bgeffon@google.com
[bgeffon@google.com: v7]
Link: http://lkml.kernel.org/r/20200221174248.244748-1-bgeffon@google.com
Signed-off-by: Brian Geffon <bgeffon@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Tested-by: Lokesh Gidra <lokeshgidra@google.com>
Reviewed-by: Minchan Kim <minchan@kernel.org>
Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Cc: "Michael S . Tsirkin" <mst@redhat.com>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Will Deacon <will@kernel.org>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Sonny Rao <sonnyrao@google.com>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Joel Fernandes <joel@joelfernandes.org>
Cc: Yu Zhao <yuzhao@google.com>
Cc: Jesse Barnes <jsbarnes@google.com>
Cc: Nathan Chancellor <natechancellor@gmail.com>
Cc: Florian Weimer <fweimer@redhat.com>
Link: http://lkml.kernel.org/r/20200207201856.46070-1-bgeffon@google.com
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-04-02 12:09:17 +08:00
|
|
|
if (flags & MREMAP_FIXED) {
|
|
|
|
ret = do_munmap(mm, new_addr, new_len, uf_unmap_early);
|
|
|
|
if (ret)
|
|
|
|
goto out;
|
|
|
|
}
|
2009-11-24 20:28:07 +08:00
|
|
|
|
2022-04-29 14:16:14 +08:00
|
|
|
if (old_len > new_len) {
|
2017-02-25 06:58:22 +08:00
|
|
|
ret = do_munmap(mm, addr+new_len, old_len - new_len, uf_unmap);
|
2022-04-29 14:16:14 +08:00
|
|
|
if (ret)
|
2009-11-24 20:28:07 +08:00
|
|
|
goto out;
|
|
|
|
old_len = new_len;
|
|
|
|
}
|
|
|
|
|
2021-11-06 04:39:06 +08:00
|
|
|
vma = vma_to_resize(addr, old_len, new_len, flags);
|
2009-11-24 20:28:07 +08:00
|
|
|
if (IS_ERR(vma)) {
|
|
|
|
ret = PTR_ERR(vma);
|
|
|
|
goto out;
|
|
|
|
}
|
|
|
|
|
mm/mremap: add MREMAP_DONTUNMAP to mremap()
When remapping an anonymous, private mapping, if MREMAP_DONTUNMAP is set,
the source mapping will not be removed. The remap operation will be
performed as it would have been normally by moving over the page tables to
the new mapping. The old vma will have any locked flags cleared, have no
pagetables, and any userfaultfds that were watching that range will
continue watching it.
For a mapping that is shared or not anonymous, MREMAP_DONTUNMAP will cause
the mremap() call to fail. Because MREMAP_DONTUNMAP always results in
moving a VMA you MUST use the MREMAP_MAYMOVE flag, it's not possible to
resize a VMA while also moving with MREMAP_DONTUNMAP so old_len must
always be equal to the new_len otherwise it will return -EINVAL.
We hope to use this in Chrome OS where with userfaultfd we could write an
anonymous mapping to disk without having to STOP the process or worry
about VMA permission changes.
This feature also has a use case in Android, Lokesh Gidra has said that
"As part of using userfaultfd for GC, We'll have to move the physical
pages of the java heap to a separate location. For this purpose mremap
will be used. Without the MREMAP_DONTUNMAP flag, when I mremap the java
heap, its virtual mapping will be removed as well. Therefore, we'll
require performing mmap immediately after. This is not only time
consuming but also opens a time window where a native thread may call mmap
and reserve the java heap's address range for its own usage. This flag
solves the problem."
[bgeffon@google.com: v6]
Link: http://lkml.kernel.org/r/20200218173221.237674-1-bgeffon@google.com
[bgeffon@google.com: v7]
Link: http://lkml.kernel.org/r/20200221174248.244748-1-bgeffon@google.com
Signed-off-by: Brian Geffon <bgeffon@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Tested-by: Lokesh Gidra <lokeshgidra@google.com>
Reviewed-by: Minchan Kim <minchan@kernel.org>
Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Cc: "Michael S . Tsirkin" <mst@redhat.com>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Will Deacon <will@kernel.org>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Sonny Rao <sonnyrao@google.com>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Joel Fernandes <joel@joelfernandes.org>
Cc: Yu Zhao <yuzhao@google.com>
Cc: Jesse Barnes <jsbarnes@google.com>
Cc: Nathan Chancellor <natechancellor@gmail.com>
Cc: Florian Weimer <fweimer@redhat.com>
Link: http://lkml.kernel.org/r/20200207201856.46070-1-bgeffon@google.com
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-04-02 12:09:17 +08:00
|
|
|
/* MREMAP_DONTUNMAP expands by old_len since old_len == new_len */
|
|
|
|
if (flags & MREMAP_DONTUNMAP &&
|
|
|
|
!may_expand_vm(mm, vma->vm_flags, old_len >> PAGE_SHIFT)) {
|
|
|
|
ret = -ENOMEM;
|
|
|
|
goto out;
|
|
|
|
}
|
|
|
|
|
|
|
|
if (flags & MREMAP_FIXED)
|
|
|
|
map_flags |= MAP_FIXED;
|
|
|
|
|
2009-11-24 21:43:52 +08:00
|
|
|
if (vma->vm_flags & VM_MAYSHARE)
|
|
|
|
map_flags |= MAP_SHARED;
|
2009-12-04 04:23:11 +08:00
|
|
|
|
2009-11-24 21:43:52 +08:00
|
|
|
ret = get_unmapped_area(vma->vm_file, new_addr, new_len, vma->vm_pgoff +
|
|
|
|
((addr - vma->vm_start) >> PAGE_SHIFT),
|
|
|
|
map_flags);
|
2019-12-01 09:51:03 +08:00
|
|
|
if (IS_ERR_VALUE(ret))
|
2021-11-06 04:39:06 +08:00
|
|
|
goto out;
|
2009-11-24 21:43:52 +08:00
|
|
|
|
mm/mremap: add MREMAP_DONTUNMAP to mremap()
When remapping an anonymous, private mapping, if MREMAP_DONTUNMAP is set,
the source mapping will not be removed. The remap operation will be
performed as it would have been normally by moving over the page tables to
the new mapping. The old vma will have any locked flags cleared, have no
pagetables, and any userfaultfds that were watching that range will
continue watching it.
For a mapping that is shared or not anonymous, MREMAP_DONTUNMAP will cause
the mremap() call to fail. Because MREMAP_DONTUNMAP always results in
moving a VMA you MUST use the MREMAP_MAYMOVE flag, it's not possible to
resize a VMA while also moving with MREMAP_DONTUNMAP so old_len must
always be equal to the new_len otherwise it will return -EINVAL.
We hope to use this in Chrome OS where with userfaultfd we could write an
anonymous mapping to disk without having to STOP the process or worry
about VMA permission changes.
This feature also has a use case in Android, Lokesh Gidra has said that
"As part of using userfaultfd for GC, We'll have to move the physical
pages of the java heap to a separate location. For this purpose mremap
will be used. Without the MREMAP_DONTUNMAP flag, when I mremap the java
heap, its virtual mapping will be removed as well. Therefore, we'll
require performing mmap immediately after. This is not only time
consuming but also opens a time window where a native thread may call mmap
and reserve the java heap's address range for its own usage. This flag
solves the problem."
[bgeffon@google.com: v6]
Link: http://lkml.kernel.org/r/20200218173221.237674-1-bgeffon@google.com
[bgeffon@google.com: v7]
Link: http://lkml.kernel.org/r/20200221174248.244748-1-bgeffon@google.com
Signed-off-by: Brian Geffon <bgeffon@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Tested-by: Lokesh Gidra <lokeshgidra@google.com>
Reviewed-by: Minchan Kim <minchan@kernel.org>
Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Cc: "Michael S . Tsirkin" <mst@redhat.com>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Will Deacon <will@kernel.org>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Sonny Rao <sonnyrao@google.com>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Joel Fernandes <joel@joelfernandes.org>
Cc: Yu Zhao <yuzhao@google.com>
Cc: Jesse Barnes <jsbarnes@google.com>
Cc: Nathan Chancellor <natechancellor@gmail.com>
Cc: Florian Weimer <fweimer@redhat.com>
Link: http://lkml.kernel.org/r/20200207201856.46070-1-bgeffon@google.com
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-04-02 12:09:17 +08:00
|
|
|
/* We got a new mapping */
|
|
|
|
if (!(flags & MREMAP_FIXED))
|
|
|
|
new_addr = ret;
|
|
|
|
|
|
|
|
ret = move_vma(vma, addr, old_len, new_len, new_addr, locked, flags, uf,
|
2017-02-25 06:58:22 +08:00
|
|
|
uf_unmap);
|
mm/mremap: add MREMAP_DONTUNMAP to mremap()
When remapping an anonymous, private mapping, if MREMAP_DONTUNMAP is set,
the source mapping will not be removed. The remap operation will be
performed as it would have been normally by moving over the page tables to
the new mapping. The old vma will have any locked flags cleared, have no
pagetables, and any userfaultfds that were watching that range will
continue watching it.
For a mapping that is shared or not anonymous, MREMAP_DONTUNMAP will cause
the mremap() call to fail. Because MREMAP_DONTUNMAP always results in
moving a VMA you MUST use the MREMAP_MAYMOVE flag, it's not possible to
resize a VMA while also moving with MREMAP_DONTUNMAP so old_len must
always be equal to the new_len otherwise it will return -EINVAL.
We hope to use this in Chrome OS where with userfaultfd we could write an
anonymous mapping to disk without having to STOP the process or worry
about VMA permission changes.
This feature also has a use case in Android, Lokesh Gidra has said that
"As part of using userfaultfd for GC, We'll have to move the physical
pages of the java heap to a separate location. For this purpose mremap
will be used. Without the MREMAP_DONTUNMAP flag, when I mremap the java
heap, its virtual mapping will be removed as well. Therefore, we'll
require performing mmap immediately after. This is not only time
consuming but also opens a time window where a native thread may call mmap
and reserve the java heap's address range for its own usage. This flag
solves the problem."
[bgeffon@google.com: v6]
Link: http://lkml.kernel.org/r/20200218173221.237674-1-bgeffon@google.com
[bgeffon@google.com: v7]
Link: http://lkml.kernel.org/r/20200221174248.244748-1-bgeffon@google.com
Signed-off-by: Brian Geffon <bgeffon@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Tested-by: Lokesh Gidra <lokeshgidra@google.com>
Reviewed-by: Minchan Kim <minchan@kernel.org>
Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Cc: "Michael S . Tsirkin" <mst@redhat.com>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Will Deacon <will@kernel.org>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Sonny Rao <sonnyrao@google.com>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Joel Fernandes <joel@joelfernandes.org>
Cc: Yu Zhao <yuzhao@google.com>
Cc: Jesse Barnes <jsbarnes@google.com>
Cc: Nathan Chancellor <natechancellor@gmail.com>
Cc: Florian Weimer <fweimer@redhat.com>
Link: http://lkml.kernel.org/r/20200207201856.46070-1-bgeffon@google.com
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-04-02 12:09:17 +08:00
|
|
|
|
2009-11-24 20:28:07 +08:00
|
|
|
out:
|
|
|
|
return ret;
|
|
|
|
}
|
|
|
|
|
2009-11-24 20:43:18 +08:00
|
|
|
static int vma_expandable(struct vm_area_struct *vma, unsigned long delta)
|
|
|
|
{
|
2009-11-24 21:25:18 +08:00
|
|
|
unsigned long end = vma->vm_end + delta;
|
2022-09-07 03:49:03 +08:00
|
|
|
|
2009-12-04 04:23:11 +08:00
|
|
|
if (end < vma->vm_end) /* overflow */
|
2009-11-24 21:25:18 +08:00
|
|
|
return 0;
|
2022-09-07 03:49:03 +08:00
|
|
|
if (find_vma_intersection(vma->vm_mm, vma->vm_end, end))
|
2009-11-24 21:25:18 +08:00
|
|
|
return 0;
|
|
|
|
if (get_unmapped_area(NULL, vma->vm_start, end - vma->vm_start,
|
|
|
|
0, MAP_FIXED) & ~PAGE_MASK)
|
2009-11-24 20:43:18 +08:00
|
|
|
return 0;
|
|
|
|
return 1;
|
|
|
|
}
|
|
|
|
|
2005-04-17 06:20:36 +08:00
|
|
|
/*
|
|
|
|
* Expand (or shrink) an existing mapping, potentially moving it at the
|
|
|
|
* same time (controlled by the MREMAP_MAYMOVE flag and available VM space)
|
|
|
|
*
|
|
|
|
* MREMAP_FIXED option added 5-Dec-1999 by Benjamin LaHaise
|
|
|
|
* This option implies MREMAP_MAYMOVE.
|
|
|
|
*/
|
2012-05-30 23:32:04 +08:00
|
|
|
SYSCALL_DEFINE5(mremap, unsigned long, addr, unsigned long, old_len,
|
|
|
|
unsigned long, new_len, unsigned long, flags,
|
|
|
|
unsigned long, new_addr)
|
2005-04-17 06:20:36 +08:00
|
|
|
{
|
2005-10-30 09:16:16 +08:00
|
|
|
struct mm_struct *mm = current->mm;
|
2005-04-17 06:20:36 +08:00
|
|
|
struct vm_area_struct *vma;
|
|
|
|
unsigned long ret = -EINVAL;
|
2013-02-23 08:32:41 +08:00
|
|
|
bool locked = false;
|
2017-02-23 07:42:34 +08:00
|
|
|
struct vm_userfaultfd_ctx uf = NULL_VM_UFFD_CTX;
|
2017-08-03 04:31:55 +08:00
|
|
|
LIST_HEAD(uf_unmap_early);
|
2017-02-25 06:58:22 +08:00
|
|
|
LIST_HEAD(uf_unmap);
|
2005-04-17 06:20:36 +08:00
|
|
|
|
2020-03-25 19:13:46 +08:00
|
|
|
/*
|
|
|
|
* There is a deliberate asymmetry here: we strip the pointer tag
|
|
|
|
* from the old address but leave the new address alone. This is
|
|
|
|
* for consistency with mmap(), where we prevent the creation of
|
|
|
|
* aliasing mappings in userspace by leaving the tag bits of the
|
|
|
|
* mapping address intact. A non-zero tag will cause the subsequent
|
|
|
|
* range checks to reject the address as invalid.
|
|
|
|
*
|
2023-06-12 20:12:24 +08:00
|
|
|
* See Documentation/arch/arm64/tagged-address-abi.rst for more
|
|
|
|
* information.
|
2020-03-25 19:13:46 +08:00
|
|
|
*/
|
mm: untag user pointers passed to memory syscalls
This patch is a part of a series that extends kernel ABI to allow to pass
tagged user pointers (with the top byte set to something else other than
0x00) as syscall arguments.
This patch allows tagged pointers to be passed to the following memory
syscalls: get_mempolicy, madvise, mbind, mincore, mlock, mlock2, mprotect,
mremap, msync, munlock, move_pages.
The mmap and mremap syscalls do not currently accept tagged addresses.
Architectures may interpret the tag as a background colour for the
corresponding vma.
Link: http://lkml.kernel.org/r/aaf0c0969d46b2feb9017f3e1b3ef3970b633d91.1563904656.git.andreyknvl@google.com
Signed-off-by: Andrey Konovalov <andreyknvl@google.com>
Reviewed-by: Khalid Aziz <khalid.aziz@oracle.com>
Reviewed-by: Vincenzo Frascino <vincenzo.frascino@arm.com>
Reviewed-by: Catalin Marinas <catalin.marinas@arm.com>
Reviewed-by: Kees Cook <keescook@chromium.org>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: Eric Auger <eric.auger@redhat.com>
Cc: Felix Kuehling <Felix.Kuehling@amd.com>
Cc: Jens Wiklander <jens.wiklander@linaro.org>
Cc: Mauro Carvalho Chehab <mchehab+samsung@kernel.org>
Cc: Mike Rapoport <rppt@linux.ibm.com>
Cc: Will Deacon <will@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-09-26 07:48:30 +08:00
|
|
|
addr = untagged_addr(addr);
|
|
|
|
|
mm/mremap: add MREMAP_DONTUNMAP to mremap()
When remapping an anonymous, private mapping, if MREMAP_DONTUNMAP is set,
the source mapping will not be removed. The remap operation will be
performed as it would have been normally by moving over the page tables to
the new mapping. The old vma will have any locked flags cleared, have no
pagetables, and any userfaultfds that were watching that range will
continue watching it.
For a mapping that is shared or not anonymous, MREMAP_DONTUNMAP will cause
the mremap() call to fail. Because MREMAP_DONTUNMAP always results in
moving a VMA you MUST use the MREMAP_MAYMOVE flag, it's not possible to
resize a VMA while also moving with MREMAP_DONTUNMAP so old_len must
always be equal to the new_len otherwise it will return -EINVAL.
We hope to use this in Chrome OS where with userfaultfd we could write an
anonymous mapping to disk without having to STOP the process or worry
about VMA permission changes.
This feature also has a use case in Android, Lokesh Gidra has said that
"As part of using userfaultfd for GC, We'll have to move the physical
pages of the java heap to a separate location. For this purpose mremap
will be used. Without the MREMAP_DONTUNMAP flag, when I mremap the java
heap, its virtual mapping will be removed as well. Therefore, we'll
require performing mmap immediately after. This is not only time
consuming but also opens a time window where a native thread may call mmap
and reserve the java heap's address range for its own usage. This flag
solves the problem."
[bgeffon@google.com: v6]
Link: http://lkml.kernel.org/r/20200218173221.237674-1-bgeffon@google.com
[bgeffon@google.com: v7]
Link: http://lkml.kernel.org/r/20200221174248.244748-1-bgeffon@google.com
Signed-off-by: Brian Geffon <bgeffon@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Tested-by: Lokesh Gidra <lokeshgidra@google.com>
Reviewed-by: Minchan Kim <minchan@kernel.org>
Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Cc: "Michael S . Tsirkin" <mst@redhat.com>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Will Deacon <will@kernel.org>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Sonny Rao <sonnyrao@google.com>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Joel Fernandes <joel@joelfernandes.org>
Cc: Yu Zhao <yuzhao@google.com>
Cc: Jesse Barnes <jsbarnes@google.com>
Cc: Nathan Chancellor <natechancellor@gmail.com>
Cc: Florian Weimer <fweimer@redhat.com>
Link: http://lkml.kernel.org/r/20200207201856.46070-1-bgeffon@google.com
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-04-02 12:09:17 +08:00
|
|
|
if (flags & ~(MREMAP_FIXED | MREMAP_MAYMOVE | MREMAP_DONTUNMAP))
|
2013-07-09 06:59:48 +08:00
|
|
|
return ret;
|
|
|
|
|
|
|
|
if (flags & MREMAP_FIXED && !(flags & MREMAP_MAYMOVE))
|
|
|
|
return ret;
|
2005-04-17 06:20:36 +08:00
|
|
|
|
mm/mremap: add MREMAP_DONTUNMAP to mremap()
When remapping an anonymous, private mapping, if MREMAP_DONTUNMAP is set,
the source mapping will not be removed. The remap operation will be
performed as it would have been normally by moving over the page tables to
the new mapping. The old vma will have any locked flags cleared, have no
pagetables, and any userfaultfds that were watching that range will
continue watching it.
For a mapping that is shared or not anonymous, MREMAP_DONTUNMAP will cause
the mremap() call to fail. Because MREMAP_DONTUNMAP always results in
moving a VMA you MUST use the MREMAP_MAYMOVE flag, it's not possible to
resize a VMA while also moving with MREMAP_DONTUNMAP so old_len must
always be equal to the new_len otherwise it will return -EINVAL.
We hope to use this in Chrome OS where with userfaultfd we could write an
anonymous mapping to disk without having to STOP the process or worry
about VMA permission changes.
This feature also has a use case in Android, Lokesh Gidra has said that
"As part of using userfaultfd for GC, We'll have to move the physical
pages of the java heap to a separate location. For this purpose mremap
will be used. Without the MREMAP_DONTUNMAP flag, when I mremap the java
heap, its virtual mapping will be removed as well. Therefore, we'll
require performing mmap immediately after. This is not only time
consuming but also opens a time window where a native thread may call mmap
and reserve the java heap's address range for its own usage. This flag
solves the problem."
[bgeffon@google.com: v6]
Link: http://lkml.kernel.org/r/20200218173221.237674-1-bgeffon@google.com
[bgeffon@google.com: v7]
Link: http://lkml.kernel.org/r/20200221174248.244748-1-bgeffon@google.com
Signed-off-by: Brian Geffon <bgeffon@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Tested-by: Lokesh Gidra <lokeshgidra@google.com>
Reviewed-by: Minchan Kim <minchan@kernel.org>
Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Cc: "Michael S . Tsirkin" <mst@redhat.com>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Will Deacon <will@kernel.org>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Sonny Rao <sonnyrao@google.com>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Joel Fernandes <joel@joelfernandes.org>
Cc: Yu Zhao <yuzhao@google.com>
Cc: Jesse Barnes <jsbarnes@google.com>
Cc: Nathan Chancellor <natechancellor@gmail.com>
Cc: Florian Weimer <fweimer@redhat.com>
Link: http://lkml.kernel.org/r/20200207201856.46070-1-bgeffon@google.com
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-04-02 12:09:17 +08:00
|
|
|
/*
|
|
|
|
* MREMAP_DONTUNMAP is always a move and it does not allow resizing
|
|
|
|
* in the process.
|
|
|
|
*/
|
|
|
|
if (flags & MREMAP_DONTUNMAP &&
|
|
|
|
(!(flags & MREMAP_MAYMOVE) || old_len != new_len))
|
|
|
|
return ret;
|
|
|
|
|
|
|
|
|
2015-11-06 10:46:57 +08:00
|
|
|
if (offset_in_page(addr))
|
2013-07-09 06:59:48 +08:00
|
|
|
return ret;
|
2005-04-17 06:20:36 +08:00
|
|
|
|
|
|
|
old_len = PAGE_ALIGN(old_len);
|
|
|
|
new_len = PAGE_ALIGN(new_len);
|
|
|
|
|
|
|
|
/*
|
|
|
|
* We allow a zero old-len as a special case
|
|
|
|
* for DOS-emu "duplicate shm area" thing. But
|
|
|
|
* a zero new-len is nonsensical.
|
|
|
|
*/
|
|
|
|
if (!new_len)
|
2013-07-09 06:59:48 +08:00
|
|
|
return ret;
|
|
|
|
|
2020-06-09 12:33:25 +08:00
|
|
|
if (mmap_write_lock_killable(current->mm))
|
2016-05-24 07:25:27 +08:00
|
|
|
return -EINTR;
|
2022-03-23 05:42:41 +08:00
|
|
|
vma = vma_lookup(mm, addr);
|
|
|
|
if (!vma) {
|
2022-05-10 08:34:28 +08:00
|
|
|
ret = -EFAULT;
|
2021-11-06 04:41:40 +08:00
|
|
|
goto out;
|
|
|
|
}
|
|
|
|
|
|
|
|
if (is_vm_hugetlb_page(vma)) {
|
|
|
|
struct hstate *h __maybe_unused = hstate_vma(vma);
|
|
|
|
|
|
|
|
old_len = ALIGN(old_len, huge_page_size(h));
|
|
|
|
new_len = ALIGN(new_len, huge_page_size(h));
|
|
|
|
|
|
|
|
/* addrs must be huge page aligned */
|
|
|
|
if (addr & ~huge_page_mask(h))
|
|
|
|
goto out;
|
|
|
|
if (new_addr & ~huge_page_mask(h))
|
|
|
|
goto out;
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Don't allow remap expansion, because the underlying hugetlb
|
|
|
|
* reservation is not yet capable to handle split reservation.
|
|
|
|
*/
|
|
|
|
if (new_len > old_len)
|
|
|
|
goto out;
|
|
|
|
}
|
2005-04-17 06:20:36 +08:00
|
|
|
|
mm/mremap: add MREMAP_DONTUNMAP to mremap()
When remapping an anonymous, private mapping, if MREMAP_DONTUNMAP is set,
the source mapping will not be removed. The remap operation will be
performed as it would have been normally by moving over the page tables to
the new mapping. The old vma will have any locked flags cleared, have no
pagetables, and any userfaultfds that were watching that range will
continue watching it.
For a mapping that is shared or not anonymous, MREMAP_DONTUNMAP will cause
the mremap() call to fail. Because MREMAP_DONTUNMAP always results in
moving a VMA you MUST use the MREMAP_MAYMOVE flag, it's not possible to
resize a VMA while also moving with MREMAP_DONTUNMAP so old_len must
always be equal to the new_len otherwise it will return -EINVAL.
We hope to use this in Chrome OS where with userfaultfd we could write an
anonymous mapping to disk without having to STOP the process or worry
about VMA permission changes.
This feature also has a use case in Android, Lokesh Gidra has said that
"As part of using userfaultfd for GC, We'll have to move the physical
pages of the java heap to a separate location. For this purpose mremap
will be used. Without the MREMAP_DONTUNMAP flag, when I mremap the java
heap, its virtual mapping will be removed as well. Therefore, we'll
require performing mmap immediately after. This is not only time
consuming but also opens a time window where a native thread may call mmap
and reserve the java heap's address range for its own usage. This flag
solves the problem."
[bgeffon@google.com: v6]
Link: http://lkml.kernel.org/r/20200218173221.237674-1-bgeffon@google.com
[bgeffon@google.com: v7]
Link: http://lkml.kernel.org/r/20200221174248.244748-1-bgeffon@google.com
Signed-off-by: Brian Geffon <bgeffon@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Tested-by: Lokesh Gidra <lokeshgidra@google.com>
Reviewed-by: Minchan Kim <minchan@kernel.org>
Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Cc: "Michael S . Tsirkin" <mst@redhat.com>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Will Deacon <will@kernel.org>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Sonny Rao <sonnyrao@google.com>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Joel Fernandes <joel@joelfernandes.org>
Cc: Yu Zhao <yuzhao@google.com>
Cc: Jesse Barnes <jsbarnes@google.com>
Cc: Nathan Chancellor <natechancellor@gmail.com>
Cc: Florian Weimer <fweimer@redhat.com>
Link: http://lkml.kernel.org/r/20200207201856.46070-1-bgeffon@google.com
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-04-02 12:09:17 +08:00
|
|
|
if (flags & (MREMAP_FIXED | MREMAP_DONTUNMAP)) {
|
2013-07-09 06:59:48 +08:00
|
|
|
ret = mremap_to(addr, old_len, new_addr, new_len,
|
mm/mremap: add MREMAP_DONTUNMAP to mremap()
When remapping an anonymous, private mapping, if MREMAP_DONTUNMAP is set,
the source mapping will not be removed. The remap operation will be
performed as it would have been normally by moving over the page tables to
the new mapping. The old vma will have any locked flags cleared, have no
pagetables, and any userfaultfds that were watching that range will
continue watching it.
For a mapping that is shared or not anonymous, MREMAP_DONTUNMAP will cause
the mremap() call to fail. Because MREMAP_DONTUNMAP always results in
moving a VMA you MUST use the MREMAP_MAYMOVE flag, it's not possible to
resize a VMA while also moving with MREMAP_DONTUNMAP so old_len must
always be equal to the new_len otherwise it will return -EINVAL.
We hope to use this in Chrome OS where with userfaultfd we could write an
anonymous mapping to disk without having to STOP the process or worry
about VMA permission changes.
This feature also has a use case in Android, Lokesh Gidra has said that
"As part of using userfaultfd for GC, We'll have to move the physical
pages of the java heap to a separate location. For this purpose mremap
will be used. Without the MREMAP_DONTUNMAP flag, when I mremap the java
heap, its virtual mapping will be removed as well. Therefore, we'll
require performing mmap immediately after. This is not only time
consuming but also opens a time window where a native thread may call mmap
and reserve the java heap's address range for its own usage. This flag
solves the problem."
[bgeffon@google.com: v6]
Link: http://lkml.kernel.org/r/20200218173221.237674-1-bgeffon@google.com
[bgeffon@google.com: v7]
Link: http://lkml.kernel.org/r/20200221174248.244748-1-bgeffon@google.com
Signed-off-by: Brian Geffon <bgeffon@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Tested-by: Lokesh Gidra <lokeshgidra@google.com>
Reviewed-by: Minchan Kim <minchan@kernel.org>
Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Cc: "Michael S . Tsirkin" <mst@redhat.com>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Will Deacon <will@kernel.org>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Sonny Rao <sonnyrao@google.com>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Joel Fernandes <joel@joelfernandes.org>
Cc: Yu Zhao <yuzhao@google.com>
Cc: Jesse Barnes <jsbarnes@google.com>
Cc: Nathan Chancellor <natechancellor@gmail.com>
Cc: Florian Weimer <fweimer@redhat.com>
Link: http://lkml.kernel.org/r/20200207201856.46070-1-bgeffon@google.com
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-04-02 12:09:17 +08:00
|
|
|
&locked, flags, &uf, &uf_unmap_early,
|
|
|
|
&uf_unmap);
|
2009-11-24 20:28:07 +08:00
|
|
|
goto out;
|
2005-04-17 06:20:36 +08:00
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Always allow a shrinking remap: that just unmaps
|
|
|
|
* the unnecessary pages..
|
2023-01-21 00:26:13 +08:00
|
|
|
* do_vmi_munmap does all the needed commit accounting, and
|
2023-06-30 10:28:16 +08:00
|
|
|
* unlocks the mmap_lock if so directed.
|
2005-04-17 06:20:36 +08:00
|
|
|
*/
|
|
|
|
if (old_len >= new_len) {
|
2023-01-21 00:26:13 +08:00
|
|
|
VMA_ITERATOR(vmi, mm, addr + new_len);
|
2018-10-27 06:08:50 +08:00
|
|
|
|
2023-06-30 10:28:16 +08:00
|
|
|
if (old_len == new_len) {
|
|
|
|
ret = addr;
|
2022-09-07 03:48:52 +08:00
|
|
|
goto out;
|
|
|
|
}
|
|
|
|
|
2023-06-30 10:28:16 +08:00
|
|
|
ret = do_vmi_munmap(&vmi, mm, addr + new_len, old_len - new_len,
|
|
|
|
&uf_unmap, true);
|
|
|
|
if (ret)
|
|
|
|
goto out;
|
|
|
|
|
2005-04-17 06:20:36 +08:00
|
|
|
ret = addr;
|
2023-06-30 10:28:16 +08:00
|
|
|
goto out_unlocked;
|
2005-04-17 06:20:36 +08:00
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
2009-11-24 20:28:07 +08:00
|
|
|
* Ok, we need to grow..
|
2005-04-17 06:20:36 +08:00
|
|
|
*/
|
2021-11-06 04:39:06 +08:00
|
|
|
vma = vma_to_resize(addr, old_len, new_len, flags);
|
2009-11-24 20:17:46 +08:00
|
|
|
if (IS_ERR(vma)) {
|
|
|
|
ret = PTR_ERR(vma);
|
2005-04-17 06:20:36 +08:00
|
|
|
goto out;
|
2005-05-01 23:58:35 +08:00
|
|
|
}
|
2005-04-17 06:20:36 +08:00
|
|
|
|
|
|
|
/* old_len exactly to the end of the area..
|
|
|
|
*/
|
2009-11-24 20:28:07 +08:00
|
|
|
if (old_len == vma->vm_end - addr) {
|
2005-04-17 06:20:36 +08:00
|
|
|
/* can we just expand the current mapping? */
|
2009-11-24 20:43:18 +08:00
|
|
|
if (vma_expandable(vma, new_len - old_len)) {
|
2021-11-06 04:39:06 +08:00
|
|
|
long pages = (new_len - old_len) >> PAGE_SHIFT;
|
2022-06-03 22:57:19 +08:00
|
|
|
unsigned long extension_start = addr + old_len;
|
|
|
|
unsigned long extension_end = addr + new_len;
|
2022-12-17 00:32:27 +08:00
|
|
|
pgoff_t extension_pgoff = vma->vm_pgoff +
|
|
|
|
((extension_start - vma->vm_start) >> PAGE_SHIFT);
|
2023-01-21 00:26:27 +08:00
|
|
|
VMA_ITERATOR(vmi, mm, extension_start);
|
2021-11-06 04:39:06 +08:00
|
|
|
|
|
|
|
if (vma->vm_flags & VM_ACCOUNT) {
|
|
|
|
if (security_vm_enough_memory_mm(mm, pages)) {
|
|
|
|
ret = -ENOMEM;
|
|
|
|
goto out;
|
|
|
|
}
|
|
|
|
}
|
2005-04-17 06:20:36 +08:00
|
|
|
|
2022-06-03 22:57:19 +08:00
|
|
|
/*
|
2023-01-17 18:19:39 +08:00
|
|
|
* Function vma_merge() is called on the extension we
|
|
|
|
* are adding to the already existing vma, vma_merge()
|
|
|
|
* will merge this extension with the already existing
|
|
|
|
* vma (expand operation itself) and possibly also with
|
|
|
|
* the next vma if it becomes adjacent to the expanded
|
|
|
|
* vma and otherwise compatible.
|
2022-06-03 22:57:19 +08:00
|
|
|
*/
|
2023-03-09 19:12:58 +08:00
|
|
|
vma = vma_merge(&vmi, mm, vma, extension_start,
|
|
|
|
extension_end, vma->vm_flags, vma->anon_vma,
|
|
|
|
vma->vm_file, extension_pgoff, vma_policy(vma),
|
|
|
|
vma->vm_userfaultfd_ctx, anon_vma_name(vma));
|
2022-06-03 22:57:19 +08:00
|
|
|
if (!vma) {
|
2021-11-06 04:39:06 +08:00
|
|
|
vm_unacct_memory(pages);
|
mm: change anon_vma linking to fix multi-process server scalability issue
The old anon_vma code can lead to scalability issues with heavily forking
workloads. Specifically, each anon_vma will be shared between the parent
process and all its child processes.
In a workload with 1000 child processes and a VMA with 1000 anonymous
pages per process that get COWed, this leads to a system with a million
anonymous pages in the same anon_vma, each of which is mapped in just one
of the 1000 processes. However, the current rmap code needs to walk them
all, leading to O(N) scanning complexity for each page.
This can result in systems where one CPU is walking the page tables of
1000 processes in page_referenced_one, while all other CPUs are stuck on
the anon_vma lock. This leads to catastrophic failure for a benchmark
like AIM7, where the total number of processes can reach in the tens of
thousands. Real workloads are still a factor 10 less process intensive
than AIM7, but they are catching up.
This patch changes the way anon_vmas and VMAs are linked, which allows us
to associate multiple anon_vmas with a VMA. At fork time, each child
process gets its own anon_vmas, in which its COWed pages will be
instantiated. The parents' anon_vma is also linked to the VMA, because
non-COWed pages could be present in any of the children.
This reduces rmap scanning complexity to O(1) for the pages of the 1000
child processes, with O(N) complexity for at most 1/N pages in the system.
This reduces the average scanning cost in heavily forking workloads from
O(N) to 2.
The only real complexity in this patch stems from the fact that linking a
VMA to anon_vmas now involves memory allocations. This means vma_adjust
can fail, if it needs to attach a VMA to anon_vma structures. This in
turn means error handling needs to be added to the calling functions.
A second source of complexity is that, because there can be multiple
anon_vmas, the anon_vma linking in vma_adjust can no longer be done under
"the" anon_vma lock. To prevent the rmap code from walking up an
incomplete VMA, this patch introduces the VM_LOCK_RMAP VMA flag. This bit
flag uses the same slot as the NOMMU VM_MAPPED_COPY, with an ifdef in mm.h
to make sure it is impossible to compile a kernel that needs both symbolic
values for the same bitflag.
Some test results:
Without the anon_vma changes, when AIM7 hits around 9.7k users (on a test
box with 16GB RAM and not quite enough IO), the system ends up running
>99% in system time, with every CPU on the same anon_vma lock in the
pageout code.
With these changes, AIM7 hits the cross-over point around 29.7k users.
This happens with ~99% IO wait time, there never seems to be any spike in
system time. The anon_vma lock contention appears to be resolved.
[akpm@linux-foundation.org: cleanups]
Signed-off-by: Rik van Riel <riel@redhat.com>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: Larry Woodman <lwoodman@redhat.com>
Cc: Lee Schermerhorn <Lee.Schermerhorn@hp.com>
Cc: Minchan Kim <minchan.kim@gmail.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Hugh Dickins <hugh.dickins@tiscali.co.uk>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2010-03-06 05:42:07 +08:00
|
|
|
ret = -ENOMEM;
|
|
|
|
goto out;
|
|
|
|
}
|
2005-04-17 06:20:36 +08:00
|
|
|
|
2016-01-15 07:22:07 +08:00
|
|
|
vm_stat_account(mm, vma->vm_flags, pages);
|
2005-04-17 06:20:36 +08:00
|
|
|
if (vma->vm_flags & VM_LOCKED) {
|
2005-10-30 09:16:16 +08:00
|
|
|
mm->locked_vm += pages;
|
2013-02-23 08:32:41 +08:00
|
|
|
locked = true;
|
|
|
|
new_addr = addr;
|
2005-04-17 06:20:36 +08:00
|
|
|
}
|
|
|
|
ret = addr;
|
|
|
|
goto out;
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
|
|
|
* We weren't able to just expand or shrink the area,
|
|
|
|
* we need to create a new one and move it..
|
|
|
|
*/
|
|
|
|
ret = -ENOMEM;
|
|
|
|
if (flags & MREMAP_MAYMOVE) {
|
2009-11-24 20:28:07 +08:00
|
|
|
unsigned long map_flags = 0;
|
|
|
|
if (vma->vm_flags & VM_MAYSHARE)
|
|
|
|
map_flags |= MAP_SHARED;
|
|
|
|
|
|
|
|
new_addr = get_unmapped_area(vma->vm_file, 0, new_len,
|
2009-11-24 21:45:24 +08:00
|
|
|
vma->vm_pgoff +
|
|
|
|
((addr - vma->vm_start) >> PAGE_SHIFT),
|
|
|
|
map_flags);
|
2019-12-01 09:51:03 +08:00
|
|
|
if (IS_ERR_VALUE(new_addr)) {
|
2009-11-24 20:28:07 +08:00
|
|
|
ret = new_addr;
|
|
|
|
goto out;
|
2005-04-17 06:20:36 +08:00
|
|
|
}
|
2009-11-24 20:28:07 +08:00
|
|
|
|
2017-02-23 07:42:34 +08:00
|
|
|
ret = move_vma(vma, addr, old_len, new_len, new_addr,
|
mm/mremap: add MREMAP_DONTUNMAP to mremap()
When remapping an anonymous, private mapping, if MREMAP_DONTUNMAP is set,
the source mapping will not be removed. The remap operation will be
performed as it would have been normally by moving over the page tables to
the new mapping. The old vma will have any locked flags cleared, have no
pagetables, and any userfaultfds that were watching that range will
continue watching it.
For a mapping that is shared or not anonymous, MREMAP_DONTUNMAP will cause
the mremap() call to fail. Because MREMAP_DONTUNMAP always results in
moving a VMA you MUST use the MREMAP_MAYMOVE flag, it's not possible to
resize a VMA while also moving with MREMAP_DONTUNMAP so old_len must
always be equal to the new_len otherwise it will return -EINVAL.
We hope to use this in Chrome OS where with userfaultfd we could write an
anonymous mapping to disk without having to STOP the process or worry
about VMA permission changes.
This feature also has a use case in Android, Lokesh Gidra has said that
"As part of using userfaultfd for GC, We'll have to move the physical
pages of the java heap to a separate location. For this purpose mremap
will be used. Without the MREMAP_DONTUNMAP flag, when I mremap the java
heap, its virtual mapping will be removed as well. Therefore, we'll
require performing mmap immediately after. This is not only time
consuming but also opens a time window where a native thread may call mmap
and reserve the java heap's address range for its own usage. This flag
solves the problem."
[bgeffon@google.com: v6]
Link: http://lkml.kernel.org/r/20200218173221.237674-1-bgeffon@google.com
[bgeffon@google.com: v7]
Link: http://lkml.kernel.org/r/20200221174248.244748-1-bgeffon@google.com
Signed-off-by: Brian Geffon <bgeffon@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Tested-by: Lokesh Gidra <lokeshgidra@google.com>
Reviewed-by: Minchan Kim <minchan@kernel.org>
Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Cc: "Michael S . Tsirkin" <mst@redhat.com>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Will Deacon <will@kernel.org>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Sonny Rao <sonnyrao@google.com>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Joel Fernandes <joel@joelfernandes.org>
Cc: Yu Zhao <yuzhao@google.com>
Cc: Jesse Barnes <jsbarnes@google.com>
Cc: Nathan Chancellor <natechancellor@gmail.com>
Cc: Florian Weimer <fweimer@redhat.com>
Link: http://lkml.kernel.org/r/20200207201856.46070-1-bgeffon@google.com
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-04-02 12:09:17 +08:00
|
|
|
&locked, flags, &uf, &uf_unmap);
|
2005-04-17 06:20:36 +08:00
|
|
|
}
|
|
|
|
out:
|
2021-11-06 04:39:06 +08:00
|
|
|
if (offset_in_page(ret))
|
2020-06-05 07:49:46 +08:00
|
|
|
locked = false;
|
2023-06-30 10:28:16 +08:00
|
|
|
mmap_write_unlock(current->mm);
|
2013-02-23 08:32:41 +08:00
|
|
|
if (locked && new_len > old_len)
|
|
|
|
mm_populate(new_addr + old_len, new_len - old_len);
|
2023-06-30 10:28:16 +08:00
|
|
|
out_unlocked:
|
2017-08-03 04:31:55 +08:00
|
|
|
userfaultfd_unmap_complete(mm, &uf_unmap_early);
|
2020-05-14 08:50:44 +08:00
|
|
|
mremap_userfaultfd_complete(&uf, addr, ret, old_len);
|
2017-02-25 06:58:22 +08:00
|
|
|
userfaultfd_unmap_complete(mm, &uf_unmap);
|
2005-04-17 06:20:36 +08:00
|
|
|
return ret;
|
|
|
|
}
|