2019-06-04 16:11:32 +08:00
|
|
|
// SPDX-License-Identifier: GPL-2.0-only
|
2015-09-05 06:46:31 +08:00
|
|
|
/*
|
|
|
|
* fs/userfaultfd.c
|
|
|
|
*
|
|
|
|
* Copyright (C) 2007 Davide Libenzi <davidel@xmailserver.org>
|
|
|
|
* Copyright (C) 2008-2009 Red Hat, Inc.
|
|
|
|
* Copyright (C) 2015 Red Hat, Inc.
|
|
|
|
*
|
|
|
|
* Some part derived from fs/eventfd.c (anon inode setup) and
|
|
|
|
* mm/ksm.c (mm hashing).
|
|
|
|
*/
|
|
|
|
|
2017-02-23 07:42:21 +08:00
|
|
|
#include <linux/list.h>
|
2015-09-05 06:46:31 +08:00
|
|
|
#include <linux/hashtable.h>
|
2017-02-03 02:15:33 +08:00
|
|
|
#include <linux/sched/signal.h>
|
2017-02-09 01:51:29 +08:00
|
|
|
#include <linux/sched/mm.h>
|
2015-09-05 06:46:31 +08:00
|
|
|
#include <linux/mm.h>
|
2022-01-15 06:06:07 +08:00
|
|
|
#include <linux/mm_inline.h>
|
2021-05-05 09:33:13 +08:00
|
|
|
#include <linux/mmu_notifier.h>
|
2015-09-05 06:46:31 +08:00
|
|
|
#include <linux/poll.h>
|
|
|
|
#include <linux/slab.h>
|
|
|
|
#include <linux/seq_file.h>
|
|
|
|
#include <linux/file.h>
|
|
|
|
#include <linux/bug.h>
|
|
|
|
#include <linux/anon_inodes.h>
|
|
|
|
#include <linux/syscalls.h>
|
|
|
|
#include <linux/userfaultfd_k.h>
|
|
|
|
#include <linux/mempolicy.h>
|
|
|
|
#include <linux/ioctl.h>
|
|
|
|
#include <linux/security.h>
|
2017-02-23 07:43:04 +08:00
|
|
|
#include <linux/hugetlb.h>
|
2022-05-13 11:22:52 +08:00
|
|
|
#include <linux/swapops.h>
|
userfaultfd: add /dev/userfaultfd for fine grained access control
Historically, it has been shown that intercepting kernel faults with
userfaultfd (thereby forcing the kernel to wait for an arbitrary amount of
time) can be exploited, or at least can make some kinds of exploits
easier. So, in 37cd0575b8 "userfaultfd: add UFFD_USER_MODE_ONLY" we
changed things so, in order for kernel faults to be handled by
userfaultfd, either the process needs CAP_SYS_PTRACE, or this sysctl must
be configured so that any unprivileged user can do it.
In a typical implementation of a hypervisor with live migration (take
QEMU/KVM as one such example), we do indeed need to be able to handle
kernel faults. But, both options above are less than ideal:
- Toggling the sysctl increases attack surface by allowing any
unprivileged user to do it.
- Granting the live migration process CAP_SYS_PTRACE gives it this
ability, but *also* the ability to "observe and control the
execution of another process [...], and examine and change [its]
memory and registers" (from ptrace(2)). This isn't something we need
or want to be able to do, so granting this permission violates the
"principle of least privilege".
This is all a long winded way to say: we want a more fine-grained way to
grant access to userfaultfd, without granting other additional permissions
at the same time.
To achieve this, add a /dev/userfaultfd misc device. This device provides
an alternative to the userfaultfd(2) syscall for the creation of new
userfaultfds. The idea is, any userfaultfds created this way will be able
to handle kernel faults, without the caller having any special
capabilities. Access to this mechanism is instead restricted using e.g.
standard filesystem permissions.
[axelrasmussen@google.com: Handle misc_register() failure properly]
Link: https://lkml.kernel.org/r/20220819205201.658693-3-axelrasmussen@google.com
Link: https://lkml.kernel.org/r/20220808175614.3885028-3-axelrasmussen@google.com
Signed-off-by: Axel Rasmussen <axelrasmussen@google.com>
Acked-by: Nadav Amit <namit@vmware.com>
Acked-by: Peter Xu <peterx@redhat.com>
Acked-by: Mike Rapoport <rppt@linux.ibm.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Dmitry V. Levin <ldv@altlinux.org>
Cc: Gleb Fotengauer-Malinovskiy <glebfm@altlinux.org>
Cc: Hugh Dickins <hughd@google.com>
Cc: Jan Kara <jack@suse.cz>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Mike Kravetz <mike.kravetz@oracle.com>
Cc: Shuah Khan <shuah@kernel.org>
Cc: Shuah Khan <skhan@linuxfoundation.org>
Cc: Suren Baghdasaryan <surenb@google.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Zhang Yi <yi.zhang@huawei.com>
Cc: Mike Rapoport <rppt@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-08-09 01:56:11 +08:00
|
|
|
#include <linux/miscdevice.h>
|
2015-09-05 06:46:31 +08:00
|
|
|
|
userfaultfd: add user-mode only option to unprivileged_userfaultfd sysctl knob
With this change, when the knob is set to 0, it allows unprivileged users
to call userfaultfd, like when it is set to 1, but with the restriction
that page faults from only user-mode can be handled. In this mode, an
unprivileged user (without SYS_CAP_PTRACE capability) must pass
UFFD_USER_MODE_ONLY to userfaultd or the API will fail with EPERM.
This enables administrators to reduce the likelihood that an attacker with
access to userfaultfd can delay faulting kernel code to widen timing
windows for other exploits.
The default value of this knob is changed to 0. This is required for
correct functioning of pipe mutex. However, this will fail postcopy live
migration, which will be unnoticeable to the VM guests. To avoid this,
set 'vm.userfault = 1' in /sys/sysctl.conf.
The main reason this change is desirable as in the short term is that the
Android userland will behave as with the sysctl set to zero. So without
this commit, any Linux binary using userfaultfd to manage its memory would
behave differently if run within the Android userland. For more details,
refer to Andrea's reply [1].
[1] https://lore.kernel.org/lkml/20200904033438.GI9411@redhat.com/
Link: https://lkml.kernel.org/r/20201120030411.2690816-3-lokeshgidra@google.com
Signed-off-by: Lokesh Gidra <lokeshgidra@google.com>
Reviewed-by: Andrea Arcangeli <aarcange@redhat.com>
Cc: Kees Cook <keescook@chromium.org>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Peter Xu <peterx@redhat.com>
Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Stephen Smalley <stephen.smalley.work@gmail.com>
Cc: Eric Biggers <ebiggers@kernel.org>
Cc: Daniel Colascione <dancol@dancol.org>
Cc: "Joel Fernandes (Google)" <joel@joelfernandes.org>
Cc: Kalesh Singh <kaleshsingh@google.com>
Cc: Suren Baghdasaryan <surenb@google.com>
Cc: Jeff Vander Stoep <jeffv@google.com>
Cc: <calin@google.com>
Cc: Mike Rapoport <rppt@linux.vnet.ibm.com>
Cc: Shaohua Li <shli@fb.com>
Cc: Jerome Glisse <jglisse@redhat.com>
Cc: Mauro Carvalho Chehab <mchehab+huawei@kernel.org>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Nitin Gupta <nigupta@nvidia.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Iurii Zaikin <yzaikin@google.com>
Cc: Luis Chamberlain <mcgrof@kernel.org>
Cc: Daniel Colascione <dancol@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-12-15 11:13:54 +08:00
|
|
|
int sysctl_unprivileged_userfaultfd __read_mostly;
|
userfaultfd/sysctl: add vm.unprivileged_userfaultfd
Userfaultfd can be misued to make it easier to exploit existing
use-after-free (and similar) bugs that might otherwise only make a
short window or race condition available. By using userfaultfd to
stall a kernel thread, a malicious program can keep some state that it
wrote, stable for an extended period, which it can then access using an
existing exploit. While it doesn't cause the exploit itself, and while
it's not the only thing that can stall a kernel thread when accessing a
memory location, it's one of the few that never needs privilege.
We can add a flag, allowing userfaultfd to be restricted, so that in
general it won't be useable by arbitrary user programs, but in
environments that require userfaultfd it can be turned back on.
Add a global sysctl knob "vm.unprivileged_userfaultfd" to control
whether userfaultfd is allowed by unprivileged users. When this is
set to zero, only privileged users (root user, or users with the
CAP_SYS_PTRACE capability) will be able to use the userfaultfd
syscalls.
Andrea said:
: The only difference between the bpf sysctl and the userfaultfd sysctl
: this way is that the bpf sysctl adds the CAP_SYS_ADMIN capability
: requirement, while userfaultfd adds the CAP_SYS_PTRACE requirement,
: because the userfaultfd monitor is more likely to need CAP_SYS_PTRACE
: already if it's doing other kind of tracking on processes runtime, in
: addition of userfaultfd. In other words both syscalls works only for
: root, when the two sysctl are opt-in set to 1.
[dgilbert@redhat.com: changelog additions]
[akpm@linux-foundation.org: documentation tweak, per Mike]
Link: http://lkml.kernel.org/r/20190319030722.12441-2-peterx@redhat.com
Signed-off-by: Peter Xu <peterx@redhat.com>
Suggested-by: Andrea Arcangeli <aarcange@redhat.com>
Suggested-by: Mike Rapoport <rppt@linux.ibm.com>
Reviewed-by: Mike Rapoport <rppt@linux.ibm.com>
Reviewed-by: Andrea Arcangeli <aarcange@redhat.com>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Luis Chamberlain <mcgrof@kernel.org>
Cc: Maxime Coquelin <maxime.coquelin@redhat.com>
Cc: Maya Gokhale <gokhale2@llnl.gov>
Cc: Jerome Glisse <jglisse@redhat.com>
Cc: Pavel Emelyanov <xemul@virtuozzo.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Martin Cracauer <cracauer@cons.org>
Cc: Denis Plotnikov <dplotnikov@virtuozzo.com>
Cc: Marty McFadden <mcfadden8@llnl.gov>
Cc: Mike Kravetz <mike.kravetz@oracle.com>
Cc: Kees Cook <keescook@chromium.org>
Cc: Mel Gorman <mgorman@suse.de>
Cc: "Kirill A . Shutemov" <kirill@shutemov.name>
Cc: "Dr . David Alan Gilbert" <dgilbert@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-05-14 08:16:41 +08:00
|
|
|
|
2015-09-05 06:46:48 +08:00
|
|
|
static struct kmem_cache *userfaultfd_ctx_cachep __read_mostly;
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Start with fault_pending_wqh and fault_wqh so they're more likely
|
|
|
|
* to be in the same cacheline.
|
2019-07-05 06:14:39 +08:00
|
|
|
*
|
|
|
|
* Locking order:
|
|
|
|
* fd_wqh.lock
|
|
|
|
* fault_pending_wqh.lock
|
|
|
|
* fault_wqh.lock
|
|
|
|
* event_wqh.lock
|
|
|
|
*
|
|
|
|
* To avoid deadlocks, IRQs must be disabled when taking any of the above locks,
|
|
|
|
* since fd_wqh.lock is taken by aio_poll() while it's holding a lock that's
|
|
|
|
* also taken in IRQ context.
|
2015-09-05 06:46:48 +08:00
|
|
|
*/
|
2015-09-05 06:46:31 +08:00
|
|
|
struct userfaultfd_ctx {
|
2015-09-05 06:46:44 +08:00
|
|
|
/* waitqueue head for the pending (i.e. not read) userfaults */
|
|
|
|
wait_queue_head_t fault_pending_wqh;
|
|
|
|
/* waitqueue head for the userfaults */
|
2015-09-05 06:46:31 +08:00
|
|
|
wait_queue_head_t fault_wqh;
|
|
|
|
/* waitqueue head for the pseudo fd to wakeup poll/read */
|
|
|
|
wait_queue_head_t fd_wqh;
|
2017-02-23 07:42:21 +08:00
|
|
|
/* waitqueue head for events */
|
|
|
|
wait_queue_head_t event_wqh;
|
2015-09-05 06:47:23 +08:00
|
|
|
/* a refile sequence protected by fault_pending_wqh lock */
|
2020-07-20 23:55:28 +08:00
|
|
|
seqcount_spinlock_t refile_seq;
|
2015-09-05 06:46:48 +08:00
|
|
|
/* pseudo fd refcounting */
|
2018-12-28 16:34:43 +08:00
|
|
|
refcount_t refcount;
|
2015-09-05 06:46:31 +08:00
|
|
|
/* userfaultfd syscall flags */
|
|
|
|
unsigned int flags;
|
2017-02-23 07:42:21 +08:00
|
|
|
/* features requested from the userspace */
|
|
|
|
unsigned int features;
|
2015-09-05 06:46:31 +08:00
|
|
|
/* released */
|
|
|
|
bool released;
|
userfaultfd: prevent non-cooperative events vs mcopy_atomic races
If a process monitored with userfaultfd changes it's memory mappings or
forks() at the same time as uffd monitor fills the process memory with
UFFDIO_COPY, the actual creation of page table entries and copying of
the data in mcopy_atomic may happen either before of after the memory
mapping modifications and there is no way for the uffd monitor to
maintain consistent view of the process memory layout.
For instance, let's consider fork() running in parallel with
userfaultfd_copy():
process | uffd monitor
---------------------------------+------------------------------
fork() | userfaultfd_copy()
... | ...
dup_mmap() | down_read(mmap_sem)
down_write(mmap_sem) | /* create PTEs, copy data */
dup_uffd() | up_read(mmap_sem)
copy_page_range() |
up_write(mmap_sem) |
dup_uffd_complete() |
/* notify monitor */ |
If the userfaultfd_copy() takes the mmap_sem first, the new page(s) will
be present by the time copy_page_range() is called and they will appear
in the child's memory mappings. However, if the fork() is the first to
take the mmap_sem, the new pages won't be mapped in the child's address
space.
If the pages are not present and child tries to access them, the monitor
will get page fault notification and everything is fine. However, if
the pages *are present*, the child can access them without uffd
noticing. And if we copy them into child it'll see the wrong data.
Since we are talking about background copy, we'd need to decide whether
the pages should be copied or not regardless #PF notifications.
Since userfaultfd monitor has no way to determine what was the order,
let's disallow userfaultfd_copy in parallel with the non-cooperative
events. In such case we return -EAGAIN and the uffd monitor can
understand that userfaultfd_copy() clashed with a non-cooperative event
and take an appropriate action.
Link: http://lkml.kernel.org/r/1527061324-19949-1-git-send-email-rppt@linux.vnet.ibm.com
Signed-off-by: Mike Rapoport <rppt@linux.vnet.ibm.com>
Acked-by: Pavel Emelyanov <xemul@virtuozzo.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Mike Kravetz <mike.kravetz@oracle.com>
Cc: Andrei Vagin <avagin@virtuozzo.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-06-08 08:09:25 +08:00
|
|
|
/* memory mappings are changing because of non-cooperative event */
|
2021-09-03 05:58:56 +08:00
|
|
|
atomic_t mmap_changing;
|
2015-09-05 06:46:31 +08:00
|
|
|
/* mm with one ore more vmas attached to this userfaultfd_ctx */
|
|
|
|
struct mm_struct *mm;
|
|
|
|
};
|
|
|
|
|
2017-02-23 07:42:27 +08:00
|
|
|
struct userfaultfd_fork_ctx {
|
|
|
|
struct userfaultfd_ctx *orig;
|
|
|
|
struct userfaultfd_ctx *new;
|
|
|
|
struct list_head list;
|
|
|
|
};
|
|
|
|
|
2017-02-25 06:58:22 +08:00
|
|
|
struct userfaultfd_unmap_ctx {
|
|
|
|
struct userfaultfd_ctx *ctx;
|
|
|
|
unsigned long start;
|
|
|
|
unsigned long end;
|
|
|
|
struct list_head list;
|
|
|
|
};
|
|
|
|
|
2015-09-05 06:46:31 +08:00
|
|
|
struct userfaultfd_wait_queue {
|
2015-09-05 06:46:37 +08:00
|
|
|
struct uffd_msg msg;
|
2017-06-20 18:06:13 +08:00
|
|
|
wait_queue_entry_t wq;
|
2015-09-05 06:46:31 +08:00
|
|
|
struct userfaultfd_ctx *ctx;
|
userfaultfd: fix SIGBUS resulting from false rwsem wakeups
With >=32 CPUs the userfaultfd selftest triggered a graceful but
unexpected SIGBUS because VM_FAULT_RETRY was returned by
handle_userfault() despite the UFFDIO_COPY wasn't completed.
This seems caused by rwsem waking the thread blocked in
handle_userfault() and we can't run up_read() before the wait_event
sequence is complete.
Keeping the wait_even sequence identical to the first one, would require
running userfaultfd_must_wait() again to know if the loop should be
repeated, and it would also require retaking the rwsem and revalidating
the whole vma status.
It seems simpler to wait the targeted wakeup so that if false wakeups
materialize we still wait for our specific wakeup event, unless of
course there are signals or the uffd was released.
Debug code collecting the stack trace of the wakeup showed this:
$ ./userfaultfd 100 99999
nr_pages: 25600, nr_pages_per_cpu: 800
bounces: 99998, mode: racing ver poll, userfaults: 32 35 90 232 30 138 69 82 34 30 139 40 40 31 20 19 43 13 15 28 27 38 21 43 56 22 1 17 31 8 4 2
bounces: 99997, mode: rnd ver poll, Bus error (core dumped)
save_stack_trace+0x2b/0x50
try_to_wake_up+0x2a6/0x580
wake_up_q+0x32/0x70
rwsem_wake+0xe0/0x120
call_rwsem_wake+0x1b/0x30
up_write+0x3b/0x40
vm_mmap_pgoff+0x9c/0xc0
SyS_mmap_pgoff+0x1a9/0x240
SyS_mmap+0x22/0x30
entry_SYSCALL_64_fastpath+0x1f/0xbd
0xffffffffffffffff
FAULT_FLAG_ALLOW_RETRY missing 70
CPU: 24 PID: 1054 Comm: userfaultfd Tainted: G W 4.8.0+ #30
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.9.3-0-ge2fc41e-prebuilt.qemu-project.org 04/01/2014
Call Trace:
dump_stack+0xb8/0x112
handle_userfault+0x572/0x650
handle_mm_fault+0x12cb/0x1520
__do_page_fault+0x175/0x500
trace_do_page_fault+0x61/0x270
do_async_page_fault+0x19/0x90
async_page_fault+0x25/0x30
This always happens when the main userfault selftest thread is running
clone() while glibc runs either mprotect or mmap (both taking mmap_sem
down_write()) to allocate the thread stack of the background threads,
while locking/userfault threads already run at full throttle and are
susceptible to false wakeups that may cause handle_userfault() to return
before than expected (which results in graceful SIGBUS at the next
attempt).
This was reproduced only with >=32 CPUs because the loop to start the
thread where clone() is too quick with fewer CPUs, while with 32 CPUs
there's already significant activity on ~32 locking and userfault
threads when the last background threads are started with clone().
This >=32 CPUs SMP race condition is likely reproducible only with the
selftest because of the much heavier userfault load it generates if
compared to real apps.
We'll have to allow "one more" VM_FAULT_RETRY for the WP support and a
patch floating around that provides it also hidden this problem but in
reality only is successfully at hiding the problem.
False wakeups could still happen again the second time
handle_userfault() is invoked, even if it's a so rare race condition
that getting false wakeups twice in a row is impossible to reproduce.
This full fix is needed for correctness, the only alternative would be
to allow VM_FAULT_RETRY to be returned infinitely. With this fix the WP
support can stick to a strict "one more" VM_FAULT_RETRY logic (no need
of returning it infinite times to avoid the SIGBUS).
Link: http://lkml.kernel.org/r/20170111005535.13832-2-aarcange@redhat.com
Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
Reported-by: Shubham Kumar Sharma <shubham.kumar.sharma@oracle.com>
Tested-by: Mike Kravetz <mike.kravetz@oracle.com>
Acked-by: Hillf Danton <hillf.zj@alibaba-inc.com>
Cc: Michael Rapoport <RAPOPORT@il.ibm.com>
Cc: "Dr. David Alan Gilbert" <dgilbert@redhat.com>
Cc: Pavel Emelyanov <xemul@parallels.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-01-25 07:17:59 +08:00
|
|
|
bool waken;
|
2015-09-05 06:46:31 +08:00
|
|
|
};
|
|
|
|
|
|
|
|
struct userfaultfd_wake_range {
|
|
|
|
unsigned long start;
|
|
|
|
unsigned long len;
|
|
|
|
};
|
|
|
|
|
userfaultfd: prevent concurrent API initialization
userfaultfd assumes that the enabled features are set once and never
changed after UFFDIO_API ioctl succeeded.
However, currently, UFFDIO_API can be called concurrently from two
different threads, succeed on both threads and leave userfaultfd's
features in non-deterministic state. Theoretically, other uffd operations
(ioctl's and page-faults) can be dispatched while adversely affected by
such changes of features.
Moreover, the writes to ctx->state and ctx->features are not ordered,
which can - theoretically, again - let userfaultfd_ioctl() think that
userfaultfd API completed, while the features are still not initialized.
To avoid races, it is arguably best to get rid of ctx->state. Since there
are only 2 states, record the API initialization in ctx->features as the
uppermost bit and remove ctx->state.
Link: https://lkml.kernel.org/r/20210808020724.1022515-3-namit@vmware.com
Fixes: 9cd75c3cd4c3d ("userfaultfd: non-cooperative: add ability to report non-PF events from uffd descriptor")
Signed-off-by: Nadav Amit <namit@vmware.com>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Axel Rasmussen <axelrasmussen@google.com>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Mike Rapoport <rppt@linux.vnet.ibm.com>
Cc: Peter Xu <peterx@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-09-03 05:58:59 +08:00
|
|
|
/* internal indication that UFFD_API ioctl was successfully executed */
|
|
|
|
#define UFFD_FEATURE_INITIALIZED (1u << 31)
|
|
|
|
|
|
|
|
static bool userfaultfd_is_initialized(struct userfaultfd_ctx *ctx)
|
|
|
|
{
|
|
|
|
return ctx->features & UFFD_FEATURE_INITIALIZED;
|
|
|
|
}
|
|
|
|
|
mm/userfaultfd: enable writenotify while userfaultfd-wp is enabled for a VMA
Currently, we don't enable writenotify when enabling userfaultfd-wp on a
shared writable mapping (for now only shmem and hugetlb). The consequence
is that vma->vm_page_prot will still include write permissions, to be set
as default for all PTEs that get remapped (e.g., mprotect(), NUMA hinting,
page migration, ...).
So far, vma->vm_page_prot is assumed to be a safe default, meaning that we
only add permissions (e.g., mkwrite) but not remove permissions (e.g.,
wrprotect). For example, when enabling softdirty tracking, we enable
writenotify. With uffd-wp on shared mappings, that changed. More details
on vma->vm_page_prot semantics were summarized in [1].
This is problematic for uffd-wp: we'd have to manually check for a uffd-wp
PTEs/PMDs and manually write-protect PTEs/PMDs, which is error prone.
Prone to such issues is any code that uses vma->vm_page_prot to set PTE
permissions: primarily pte_modify() and mk_pte().
Instead, let's enable writenotify such that PTEs/PMDs/... will be mapped
write-protected as default and we will only allow selected PTEs that are
definitely safe to be mapped without write-protection (see
can_change_pte_writable()) to be writable. In the future, we might want
to enable write-bit recovery -- e.g., can_change_pte_writable() -- at more
locations, for example, also when removing uffd-wp protection.
This fixes two known cases:
(a) remove_migration_pte() mapping uffd-wp'ed PTEs writable, resulting
in uffd-wp not triggering on write access.
(b) do_numa_page() / do_huge_pmd_numa_page() mapping uffd-wp'ed PTEs/PMDs
writable, resulting in uffd-wp not triggering on write access.
Note that do_numa_page() / do_huge_pmd_numa_page() can be reached even
without NUMA hinting (which currently doesn't seem to be applicable to
shmem), for example, by using uffd-wp with a PROT_WRITE shmem VMA. On
such a VMA, userfaultfd-wp is currently non-functional.
Note that when enabling userfaultfd-wp, there is no need to walk page
tables to enforce the new default protection for the PTEs: we know that
they cannot be uffd-wp'ed yet, because that can only happen after enabling
uffd-wp for the VMA in general.
Also note that this makes mprotect() on ranges with uffd-wp'ed PTEs not
accidentally set the write bit -- which would result in uffd-wp not
triggering on later write access. This commit makes uffd-wp on shmem
behave just like uffd-wp on anonymous memory in that regard, even though,
mixing mprotect with uffd-wp is controversial.
[1] https://lkml.kernel.org/r/92173bad-caa3-6b43-9d1e-9a471fdbc184@redhat.com
Link: https://lkml.kernel.org/r/20221209080912.7968-1-david@redhat.com
Fixes: b1f9e876862d ("mm/uffd: enable write protection for shmem & hugetlbfs")
Signed-off-by: David Hildenbrand <david@redhat.com>
Reported-by: Ives van Hoorne <ives@codesandbox.io>
Debugged-by: Peter Xu <peterx@redhat.com>
Acked-by: Peter Xu <peterx@redhat.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Alistair Popple <apopple@nvidia.com>
Cc: Mike Rapoport <rppt@linux.vnet.ibm.com>
Cc: Nadav Amit <nadav.amit@gmail.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-12-09 16:09:12 +08:00
|
|
|
static void userfaultfd_set_vm_flags(struct vm_area_struct *vma,
|
|
|
|
vm_flags_t flags)
|
|
|
|
{
|
|
|
|
const bool uffd_wp_changed = (vma->vm_flags ^ flags) & VM_UFFD_WP;
|
|
|
|
|
|
|
|
vma->vm_flags = flags;
|
|
|
|
/*
|
|
|
|
* For shared mappings, we want to enable writenotify while
|
|
|
|
* userfaultfd-wp is enabled (see vma_wants_writenotify()). We'll simply
|
|
|
|
* recalculate vma->vm_page_prot whenever userfaultfd-wp changes.
|
|
|
|
*/
|
|
|
|
if ((vma->vm_flags & VM_SHARED) && uffd_wp_changed)
|
|
|
|
vma_set_page_prot(vma);
|
|
|
|
}
|
|
|
|
|
2017-06-20 18:06:13 +08:00
|
|
|
static int userfaultfd_wake_function(wait_queue_entry_t *wq, unsigned mode,
|
2015-09-05 06:46:31 +08:00
|
|
|
int wake_flags, void *key)
|
|
|
|
{
|
|
|
|
struct userfaultfd_wake_range *range = key;
|
|
|
|
int ret;
|
|
|
|
struct userfaultfd_wait_queue *uwq;
|
|
|
|
unsigned long start, len;
|
|
|
|
|
|
|
|
uwq = container_of(wq, struct userfaultfd_wait_queue, wq);
|
|
|
|
ret = 0;
|
|
|
|
/* len == 0 means wake all */
|
|
|
|
start = range->start;
|
|
|
|
len = range->len;
|
2015-09-05 06:46:37 +08:00
|
|
|
if (len && (start > uwq->msg.arg.pagefault.address ||
|
|
|
|
start + len <= uwq->msg.arg.pagefault.address))
|
2015-09-05 06:46:31 +08:00
|
|
|
goto out;
|
userfaultfd: fix SIGBUS resulting from false rwsem wakeups
With >=32 CPUs the userfaultfd selftest triggered a graceful but
unexpected SIGBUS because VM_FAULT_RETRY was returned by
handle_userfault() despite the UFFDIO_COPY wasn't completed.
This seems caused by rwsem waking the thread blocked in
handle_userfault() and we can't run up_read() before the wait_event
sequence is complete.
Keeping the wait_even sequence identical to the first one, would require
running userfaultfd_must_wait() again to know if the loop should be
repeated, and it would also require retaking the rwsem and revalidating
the whole vma status.
It seems simpler to wait the targeted wakeup so that if false wakeups
materialize we still wait for our specific wakeup event, unless of
course there are signals or the uffd was released.
Debug code collecting the stack trace of the wakeup showed this:
$ ./userfaultfd 100 99999
nr_pages: 25600, nr_pages_per_cpu: 800
bounces: 99998, mode: racing ver poll, userfaults: 32 35 90 232 30 138 69 82 34 30 139 40 40 31 20 19 43 13 15 28 27 38 21 43 56 22 1 17 31 8 4 2
bounces: 99997, mode: rnd ver poll, Bus error (core dumped)
save_stack_trace+0x2b/0x50
try_to_wake_up+0x2a6/0x580
wake_up_q+0x32/0x70
rwsem_wake+0xe0/0x120
call_rwsem_wake+0x1b/0x30
up_write+0x3b/0x40
vm_mmap_pgoff+0x9c/0xc0
SyS_mmap_pgoff+0x1a9/0x240
SyS_mmap+0x22/0x30
entry_SYSCALL_64_fastpath+0x1f/0xbd
0xffffffffffffffff
FAULT_FLAG_ALLOW_RETRY missing 70
CPU: 24 PID: 1054 Comm: userfaultfd Tainted: G W 4.8.0+ #30
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.9.3-0-ge2fc41e-prebuilt.qemu-project.org 04/01/2014
Call Trace:
dump_stack+0xb8/0x112
handle_userfault+0x572/0x650
handle_mm_fault+0x12cb/0x1520
__do_page_fault+0x175/0x500
trace_do_page_fault+0x61/0x270
do_async_page_fault+0x19/0x90
async_page_fault+0x25/0x30
This always happens when the main userfault selftest thread is running
clone() while glibc runs either mprotect or mmap (both taking mmap_sem
down_write()) to allocate the thread stack of the background threads,
while locking/userfault threads already run at full throttle and are
susceptible to false wakeups that may cause handle_userfault() to return
before than expected (which results in graceful SIGBUS at the next
attempt).
This was reproduced only with >=32 CPUs because the loop to start the
thread where clone() is too quick with fewer CPUs, while with 32 CPUs
there's already significant activity on ~32 locking and userfault
threads when the last background threads are started with clone().
This >=32 CPUs SMP race condition is likely reproducible only with the
selftest because of the much heavier userfault load it generates if
compared to real apps.
We'll have to allow "one more" VM_FAULT_RETRY for the WP support and a
patch floating around that provides it also hidden this problem but in
reality only is successfully at hiding the problem.
False wakeups could still happen again the second time
handle_userfault() is invoked, even if it's a so rare race condition
that getting false wakeups twice in a row is impossible to reproduce.
This full fix is needed for correctness, the only alternative would be
to allow VM_FAULT_RETRY to be returned infinitely. With this fix the WP
support can stick to a strict "one more" VM_FAULT_RETRY logic (no need
of returning it infinite times to avoid the SIGBUS).
Link: http://lkml.kernel.org/r/20170111005535.13832-2-aarcange@redhat.com
Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
Reported-by: Shubham Kumar Sharma <shubham.kumar.sharma@oracle.com>
Tested-by: Mike Kravetz <mike.kravetz@oracle.com>
Acked-by: Hillf Danton <hillf.zj@alibaba-inc.com>
Cc: Michael Rapoport <RAPOPORT@il.ibm.com>
Cc: "Dr. David Alan Gilbert" <dgilbert@redhat.com>
Cc: Pavel Emelyanov <xemul@parallels.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-01-25 07:17:59 +08:00
|
|
|
WRITE_ONCE(uwq->waken, true);
|
|
|
|
/*
|
2017-06-07 23:51:27 +08:00
|
|
|
* The Program-Order guarantees provided by the scheduler
|
|
|
|
* ensure uwq->waken is visible before the task is woken.
|
userfaultfd: fix SIGBUS resulting from false rwsem wakeups
With >=32 CPUs the userfaultfd selftest triggered a graceful but
unexpected SIGBUS because VM_FAULT_RETRY was returned by
handle_userfault() despite the UFFDIO_COPY wasn't completed.
This seems caused by rwsem waking the thread blocked in
handle_userfault() and we can't run up_read() before the wait_event
sequence is complete.
Keeping the wait_even sequence identical to the first one, would require
running userfaultfd_must_wait() again to know if the loop should be
repeated, and it would also require retaking the rwsem and revalidating
the whole vma status.
It seems simpler to wait the targeted wakeup so that if false wakeups
materialize we still wait for our specific wakeup event, unless of
course there are signals or the uffd was released.
Debug code collecting the stack trace of the wakeup showed this:
$ ./userfaultfd 100 99999
nr_pages: 25600, nr_pages_per_cpu: 800
bounces: 99998, mode: racing ver poll, userfaults: 32 35 90 232 30 138 69 82 34 30 139 40 40 31 20 19 43 13 15 28 27 38 21 43 56 22 1 17 31 8 4 2
bounces: 99997, mode: rnd ver poll, Bus error (core dumped)
save_stack_trace+0x2b/0x50
try_to_wake_up+0x2a6/0x580
wake_up_q+0x32/0x70
rwsem_wake+0xe0/0x120
call_rwsem_wake+0x1b/0x30
up_write+0x3b/0x40
vm_mmap_pgoff+0x9c/0xc0
SyS_mmap_pgoff+0x1a9/0x240
SyS_mmap+0x22/0x30
entry_SYSCALL_64_fastpath+0x1f/0xbd
0xffffffffffffffff
FAULT_FLAG_ALLOW_RETRY missing 70
CPU: 24 PID: 1054 Comm: userfaultfd Tainted: G W 4.8.0+ #30
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.9.3-0-ge2fc41e-prebuilt.qemu-project.org 04/01/2014
Call Trace:
dump_stack+0xb8/0x112
handle_userfault+0x572/0x650
handle_mm_fault+0x12cb/0x1520
__do_page_fault+0x175/0x500
trace_do_page_fault+0x61/0x270
do_async_page_fault+0x19/0x90
async_page_fault+0x25/0x30
This always happens when the main userfault selftest thread is running
clone() while glibc runs either mprotect or mmap (both taking mmap_sem
down_write()) to allocate the thread stack of the background threads,
while locking/userfault threads already run at full throttle and are
susceptible to false wakeups that may cause handle_userfault() to return
before than expected (which results in graceful SIGBUS at the next
attempt).
This was reproduced only with >=32 CPUs because the loop to start the
thread where clone() is too quick with fewer CPUs, while with 32 CPUs
there's already significant activity on ~32 locking and userfault
threads when the last background threads are started with clone().
This >=32 CPUs SMP race condition is likely reproducible only with the
selftest because of the much heavier userfault load it generates if
compared to real apps.
We'll have to allow "one more" VM_FAULT_RETRY for the WP support and a
patch floating around that provides it also hidden this problem but in
reality only is successfully at hiding the problem.
False wakeups could still happen again the second time
handle_userfault() is invoked, even if it's a so rare race condition
that getting false wakeups twice in a row is impossible to reproduce.
This full fix is needed for correctness, the only alternative would be
to allow VM_FAULT_RETRY to be returned infinitely. With this fix the WP
support can stick to a strict "one more" VM_FAULT_RETRY logic (no need
of returning it infinite times to avoid the SIGBUS).
Link: http://lkml.kernel.org/r/20170111005535.13832-2-aarcange@redhat.com
Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
Reported-by: Shubham Kumar Sharma <shubham.kumar.sharma@oracle.com>
Tested-by: Mike Kravetz <mike.kravetz@oracle.com>
Acked-by: Hillf Danton <hillf.zj@alibaba-inc.com>
Cc: Michael Rapoport <RAPOPORT@il.ibm.com>
Cc: "Dr. David Alan Gilbert" <dgilbert@redhat.com>
Cc: Pavel Emelyanov <xemul@parallels.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-01-25 07:17:59 +08:00
|
|
|
*/
|
2015-09-05 06:46:31 +08:00
|
|
|
ret = wake_up_state(wq->private, mode);
|
2017-06-07 23:51:27 +08:00
|
|
|
if (ret) {
|
2015-09-05 06:46:31 +08:00
|
|
|
/*
|
|
|
|
* Wake only once, autoremove behavior.
|
|
|
|
*
|
2017-06-07 23:51:27 +08:00
|
|
|
* After the effect of list_del_init is visible to the other
|
|
|
|
* CPUs, the waitqueue may disappear from under us, see the
|
|
|
|
* !list_empty_careful() in handle_userfault().
|
|
|
|
*
|
|
|
|
* try_to_wake_up() has an implicit smp_mb(), and the
|
|
|
|
* wq->private is read before calling the extern function
|
|
|
|
* "wake_up_state" (which in turns calls try_to_wake_up).
|
2015-09-05 06:46:31 +08:00
|
|
|
*/
|
sched/wait: Disambiguate wq_entry->task_list and wq_head->task_list naming
So I've noticed a number of instances where it was not obvious from the
code whether ->task_list was for a wait-queue head or a wait-queue entry.
Furthermore, there's a number of wait-queue users where the lists are
not for 'tasks' but other entities (poll tables, etc.), in which case
the 'task_list' name is actively confusing.
To clear this all up, name the wait-queue head and entry list structure
fields unambiguously:
struct wait_queue_head::task_list => ::head
struct wait_queue_entry::task_list => ::entry
For example, this code:
rqw->wait.task_list.next != &wait->task_list
... is was pretty unclear (to me) what it's doing, while now it's written this way:
rqw->wait.head.next != &wait->entry
... which makes it pretty clear that we are iterating a list until we see the head.
Other examples are:
list_for_each_entry_safe(pos, next, &x->task_list, task_list) {
list_for_each_entry(wq, &fence->wait.task_list, task_list) {
... where it's unclear (to me) what we are iterating, and during review it's
hard to tell whether it's trying to walk a wait-queue entry (which would be
a bug), while now it's written as:
list_for_each_entry_safe(pos, next, &x->head, entry) {
list_for_each_entry(wq, &fence->wait.head, entry) {
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-06-20 18:06:46 +08:00
|
|
|
list_del_init(&wq->entry);
|
2017-06-07 23:51:27 +08:00
|
|
|
}
|
2015-09-05 06:46:31 +08:00
|
|
|
out:
|
|
|
|
return ret;
|
|
|
|
}
|
|
|
|
|
|
|
|
/**
|
|
|
|
* userfaultfd_ctx_get - Acquires a reference to the internal userfaultfd
|
|
|
|
* context.
|
|
|
|
* @ctx: [in] Pointer to the userfaultfd context.
|
|
|
|
*/
|
|
|
|
static void userfaultfd_ctx_get(struct userfaultfd_ctx *ctx)
|
|
|
|
{
|
2018-12-28 16:34:43 +08:00
|
|
|
refcount_inc(&ctx->refcount);
|
2015-09-05 06:46:31 +08:00
|
|
|
}
|
|
|
|
|
|
|
|
/**
|
|
|
|
* userfaultfd_ctx_put - Releases a reference to the internal userfaultfd
|
|
|
|
* context.
|
|
|
|
* @ctx: [in] Pointer to userfaultfd context.
|
|
|
|
*
|
|
|
|
* The userfaultfd context reference must have been previously acquired either
|
|
|
|
* with userfaultfd_ctx_get() or userfaultfd_ctx_fdget().
|
|
|
|
*/
|
|
|
|
static void userfaultfd_ctx_put(struct userfaultfd_ctx *ctx)
|
|
|
|
{
|
2018-12-28 16:34:43 +08:00
|
|
|
if (refcount_dec_and_test(&ctx->refcount)) {
|
2015-09-05 06:46:31 +08:00
|
|
|
VM_BUG_ON(spin_is_locked(&ctx->fault_pending_wqh.lock));
|
|
|
|
VM_BUG_ON(waitqueue_active(&ctx->fault_pending_wqh));
|
|
|
|
VM_BUG_ON(spin_is_locked(&ctx->fault_wqh.lock));
|
|
|
|
VM_BUG_ON(waitqueue_active(&ctx->fault_wqh));
|
2017-02-23 07:42:21 +08:00
|
|
|
VM_BUG_ON(spin_is_locked(&ctx->event_wqh.lock));
|
|
|
|
VM_BUG_ON(waitqueue_active(&ctx->event_wqh));
|
2015-09-05 06:46:31 +08:00
|
|
|
VM_BUG_ON(spin_is_locked(&ctx->fd_wqh.lock));
|
|
|
|
VM_BUG_ON(waitqueue_active(&ctx->fd_wqh));
|
2016-05-21 07:58:36 +08:00
|
|
|
mmdrop(ctx->mm);
|
2015-09-05 06:46:48 +08:00
|
|
|
kmem_cache_free(userfaultfd_ctx_cachep, ctx);
|
2015-09-05 06:46:31 +08:00
|
|
|
}
|
|
|
|
}
|
|
|
|
|
2015-09-05 06:46:37 +08:00
|
|
|
static inline void msg_init(struct uffd_msg *msg)
|
2015-09-05 06:46:31 +08:00
|
|
|
{
|
2015-09-05 06:46:37 +08:00
|
|
|
BUILD_BUG_ON(sizeof(struct uffd_msg) != 32);
|
|
|
|
/*
|
|
|
|
* Must use memset to zero out the paddings or kernel data is
|
|
|
|
* leaked to userland.
|
|
|
|
*/
|
|
|
|
memset(msg, 0, sizeof(struct uffd_msg));
|
|
|
|
}
|
|
|
|
|
|
|
|
static inline struct uffd_msg userfault_msg(unsigned long address,
|
2022-07-12 00:59:06 +08:00
|
|
|
unsigned long real_address,
|
2015-09-05 06:46:37 +08:00
|
|
|
unsigned int flags,
|
2017-09-07 07:23:56 +08:00
|
|
|
unsigned long reason,
|
|
|
|
unsigned int features)
|
2015-09-05 06:46:37 +08:00
|
|
|
{
|
|
|
|
struct uffd_msg msg;
|
2022-07-12 00:59:06 +08:00
|
|
|
|
2015-09-05 06:46:37 +08:00
|
|
|
msg_init(&msg);
|
|
|
|
msg.event = UFFD_EVENT_PAGEFAULT;
|
userfaultfd: provide unmasked address on page-fault
Userfaultfd is supposed to provide the full address (i.e., unmasked) of
the faulting access back to userspace. However, that is not the case for
quite some time.
Even running "userfaultfd_demo" from the userfaultfd man page provides the
wrong output (and contradicts the man page). Notice that
"UFFD_EVENT_PAGEFAULT event" shows the masked address (7fc5e30b3000) and
not the first read address (0x7fc5e30b300f).
Address returned by mmap() = 0x7fc5e30b3000
fault_handler_thread():
poll() returns: nready = 1; POLLIN = 1; POLLERR = 0
UFFD_EVENT_PAGEFAULT event: flags = 0; address = 7fc5e30b3000
(uffdio_copy.copy returned 4096)
Read address 0x7fc5e30b300f in main(): A
Read address 0x7fc5e30b340f in main(): A
Read address 0x7fc5e30b380f in main(): A
Read address 0x7fc5e30b3c0f in main(): A
The exact address is useful for various reasons and specifically for
prefetching decisions. If it is known that the memory is populated by
certain objects whose size is not page-aligned, then based on the faulting
address, the uffd-monitor can decide whether to prefetch and prefault the
adjacent page.
This bug has been for quite some time in the kernel: since commit
1a29d85eb0f1 ("mm: use vmf->address instead of of vmf->virtual_address")
vmf->virtual_address"), which dates back to 2016. A concern has been
raised that existing userspace application might rely on the old/wrong
behavior in which the address is masked. Therefore, it was suggested to
provide the masked address unless the user explicitly asks for the exact
address.
Add a new userfaultfd feature UFFD_FEATURE_EXACT_ADDRESS to direct
userfaultfd to provide the exact address. Add a new "real_address" field
to vmf to hold the unmasked address. Provide the address to userspace
accordingly.
Initialize real_address in various code-paths to be consistent with
address, even when it is not used, to be on the safe side.
[namit@vmware.com: initialize real_address on all code paths, per Jan]
Link: https://lkml.kernel.org/r/20220226022655.350562-1-namit@vmware.com
[akpm@linux-foundation.org: fix typo in comment, per Jan]
Link: https://lkml.kernel.org/r/20220218041003.3508-1-namit@vmware.com
Signed-off-by: Nadav Amit <namit@vmware.com>
Acked-by: Peter Xu <peterx@redhat.com>
Reviewed-by: David Hildenbrand <david@redhat.com>
Acked-by: Mike Rapoport <rppt@linux.ibm.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2022-03-23 05:45:32 +08:00
|
|
|
|
2022-07-12 00:59:06 +08:00
|
|
|
msg.arg.pagefault.address = (features & UFFD_FEATURE_EXACT_ADDRESS) ?
|
|
|
|
real_address : address;
|
|
|
|
|
userfaultfd: add minor fault registration mode
Patch series "userfaultfd: add minor fault handling", v9.
Overview
========
This series adds a new userfaultfd feature, UFFD_FEATURE_MINOR_HUGETLBFS.
When enabled (via the UFFDIO_API ioctl), this feature means that any
hugetlbfs VMAs registered with UFFDIO_REGISTER_MODE_MISSING will *also*
get events for "minor" faults. By "minor" fault, I mean the following
situation:
Let there exist two mappings (i.e., VMAs) to the same page(s) (shared
memory). One of the mappings is registered with userfaultfd (in minor
mode), and the other is not. Via the non-UFFD mapping, the underlying
pages have already been allocated & filled with some contents. The UFFD
mapping has not yet been faulted in; when it is touched for the first
time, this results in what I'm calling a "minor" fault. As a concrete
example, when working with hugetlbfs, we have huge_pte_none(), but
find_lock_page() finds an existing page.
We also add a new ioctl to resolve such faults: UFFDIO_CONTINUE. The idea
is, userspace resolves the fault by either a) doing nothing if the
contents are already correct, or b) updating the underlying contents using
the second, non-UFFD mapping (via memcpy/memset or similar, or something
fancier like RDMA, or etc...). In either case, userspace issues
UFFDIO_CONTINUE to tell the kernel "I have ensured the page contents are
correct, carry on setting up the mapping".
Use Case
========
Consider the use case of VM live migration (e.g. under QEMU/KVM):
1. While a VM is still running, we copy the contents of its memory to a
target machine. The pages are populated on the target by writing to the
non-UFFD mapping, using the setup described above. The VM is still running
(and therefore its memory is likely changing), so this may be repeated
several times, until we decide the target is "up to date enough".
2. We pause the VM on the source, and start executing on the target machine.
During this gap, the VM's user(s) will *see* a pause, so it is desirable to
minimize this window.
3. Between the last time any page was copied from the source to the target, and
when the VM was paused, the contents of that page may have changed - and
therefore the copy we have on the target machine is out of date. Although we
can keep track of which pages are out of date, for VMs with large amounts of
memory, it is "slow" to transfer this information to the target machine. We
want to resume execution before such a transfer would complete.
4. So, the guest begins executing on the target machine. The first time it
touches its memory (via the UFFD-registered mapping), userspace wants to
intercept this fault. Userspace checks whether or not the page is up to date,
and if not, copies the updated page from the source machine, via the non-UFFD
mapping. Finally, whether a copy was performed or not, userspace issues a
UFFDIO_CONTINUE ioctl to tell the kernel "I have ensured the page contents
are correct, carry on setting up the mapping".
We don't have to do all of the final updates on-demand. The userfaultfd manager
can, in the background, also copy over updated pages once it receives the map of
which pages are up-to-date or not.
Interaction with Existing APIs
==============================
Because this is a feature, a registered VMA could potentially receive both
missing and minor faults. I spent some time thinking through how the
existing API interacts with the new feature:
UFFDIO_CONTINUE cannot be used to resolve non-minor faults, as it does not
allocate a new page. If UFFDIO_CONTINUE is used on a non-minor fault:
- For non-shared memory or shmem, -EINVAL is returned.
- For hugetlb, -EFAULT is returned.
UFFDIO_COPY and UFFDIO_ZEROPAGE cannot be used to resolve minor faults.
Without modifications, the existing codepath assumes a new page needs to
be allocated. This is okay, since userspace must have a second
non-UFFD-registered mapping anyway, thus there isn't much reason to want
to use these in any case (just memcpy or memset or similar).
- If UFFDIO_COPY is used on a minor fault, -EEXIST is returned.
- If UFFDIO_ZEROPAGE is used on a minor fault, -EEXIST is returned (or -EINVAL
in the case of hugetlb, as UFFDIO_ZEROPAGE is unsupported in any case).
- UFFDIO_WRITEPROTECT simply doesn't work with shared memory, and returns
-ENOENT in that case (regardless of the kind of fault).
Future Work
===========
This series only supports hugetlbfs. I have a second series in flight to
support shmem as well, extending the functionality. This series is more
mature than the shmem support at this point, and the functionality works
fully on hugetlbfs, so this series can be merged first and then shmem
support will follow.
This patch (of 6):
This feature allows userspace to intercept "minor" faults. By "minor"
faults, I mean the following situation:
Let there exist two mappings (i.e., VMAs) to the same page(s). One of the
mappings is registered with userfaultfd (in minor mode), and the other is
not. Via the non-UFFD mapping, the underlying pages have already been
allocated & filled with some contents. The UFFD mapping has not yet been
faulted in; when it is touched for the first time, this results in what
I'm calling a "minor" fault. As a concrete example, when working with
hugetlbfs, we have huge_pte_none(), but find_lock_page() finds an existing
page.
This commit adds the new registration mode, and sets the relevant flag on
the VMAs being registered. In the hugetlb fault path, if we find that we
have huge_pte_none(), but find_lock_page() does indeed find an existing
page, then we have a "minor" fault, and if the VMA has the userfaultfd
registration flag, we call into userfaultfd to handle it.
This is implemented as a new registration mode, instead of an API feature.
This is because the alternative implementation has significant drawbacks
[1].
However, doing it this was requires we allocate a VM_* flag for the new
registration mode. On 32-bit systems, there are no unused bits, so this
feature is only supported on architectures with
CONFIG_ARCH_USES_HIGH_VMA_FLAGS. When attempting to register a VMA in
MINOR mode on 32-bit architectures, we return -EINVAL.
[1] https://lore.kernel.org/patchwork/patch/1380226/
[peterx@redhat.com: fix minor fault page leak]
Link: https://lkml.kernel.org/r/20210322175132.36659-1-peterx@redhat.com
Link: https://lkml.kernel.org/r/20210301222728.176417-1-axelrasmussen@google.com
Link: https://lkml.kernel.org/r/20210301222728.176417-2-axelrasmussen@google.com
Signed-off-by: Axel Rasmussen <axelrasmussen@google.com>
Reviewed-by: Peter Xu <peterx@redhat.com>
Reviewed-by: Mike Kravetz <mike.kravetz@oracle.com>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Alexey Dobriyan <adobriyan@gmail.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Anshuman Khandual <anshuman.khandual@arm.com>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Chinwen Chang <chinwen.chang@mediatek.com>
Cc: Huang Ying <ying.huang@intel.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Jann Horn <jannh@google.com>
Cc: Jerome Glisse <jglisse@redhat.com>
Cc: Lokesh Gidra <lokeshgidra@google.com>
Cc: "Matthew Wilcox (Oracle)" <willy@infradead.org>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: "Michal Koutn" <mkoutny@suse.com>
Cc: Michel Lespinasse <walken@google.com>
Cc: Mike Rapoport <rppt@linux.vnet.ibm.com>
Cc: Nicholas Piggin <npiggin@gmail.com>
Cc: Peter Xu <peterx@redhat.com>
Cc: Shaohua Li <shli@fb.com>
Cc: Shawn Anastasio <shawn@anastas.io>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Steven Price <steven.price@arm.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Adam Ruprecht <ruprecht@google.com>
Cc: Axel Rasmussen <axelrasmussen@google.com>
Cc: Cannon Matthews <cannonmatthews@google.com>
Cc: "Dr . David Alan Gilbert" <dgilbert@redhat.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Mina Almasry <almasrymina@google.com>
Cc: Oliver Upton <oupton@google.com>
Cc: Kirill A. Shutemov <kirill@shutemov.name>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-05-05 09:35:36 +08:00
|
|
|
/*
|
|
|
|
* These flags indicate why the userfault occurred:
|
|
|
|
* - UFFD_PAGEFAULT_FLAG_WP indicates a write protect fault.
|
|
|
|
* - UFFD_PAGEFAULT_FLAG_MINOR indicates a minor fault.
|
|
|
|
* - Neither of these flags being set indicates a MISSING fault.
|
|
|
|
*
|
|
|
|
* Separately, UFFD_PAGEFAULT_FLAG_WRITE indicates it was a write
|
|
|
|
* fault. Otherwise, it was a read fault.
|
|
|
|
*/
|
2015-09-05 06:46:31 +08:00
|
|
|
if (flags & FAULT_FLAG_WRITE)
|
2015-09-05 06:46:37 +08:00
|
|
|
msg.arg.pagefault.flags |= UFFD_PAGEFAULT_FLAG_WRITE;
|
2015-09-05 06:46:31 +08:00
|
|
|
if (reason & VM_UFFD_WP)
|
2015-09-05 06:46:37 +08:00
|
|
|
msg.arg.pagefault.flags |= UFFD_PAGEFAULT_FLAG_WP;
|
userfaultfd: add minor fault registration mode
Patch series "userfaultfd: add minor fault handling", v9.
Overview
========
This series adds a new userfaultfd feature, UFFD_FEATURE_MINOR_HUGETLBFS.
When enabled (via the UFFDIO_API ioctl), this feature means that any
hugetlbfs VMAs registered with UFFDIO_REGISTER_MODE_MISSING will *also*
get events for "minor" faults. By "minor" fault, I mean the following
situation:
Let there exist two mappings (i.e., VMAs) to the same page(s) (shared
memory). One of the mappings is registered with userfaultfd (in minor
mode), and the other is not. Via the non-UFFD mapping, the underlying
pages have already been allocated & filled with some contents. The UFFD
mapping has not yet been faulted in; when it is touched for the first
time, this results in what I'm calling a "minor" fault. As a concrete
example, when working with hugetlbfs, we have huge_pte_none(), but
find_lock_page() finds an existing page.
We also add a new ioctl to resolve such faults: UFFDIO_CONTINUE. The idea
is, userspace resolves the fault by either a) doing nothing if the
contents are already correct, or b) updating the underlying contents using
the second, non-UFFD mapping (via memcpy/memset or similar, or something
fancier like RDMA, or etc...). In either case, userspace issues
UFFDIO_CONTINUE to tell the kernel "I have ensured the page contents are
correct, carry on setting up the mapping".
Use Case
========
Consider the use case of VM live migration (e.g. under QEMU/KVM):
1. While a VM is still running, we copy the contents of its memory to a
target machine. The pages are populated on the target by writing to the
non-UFFD mapping, using the setup described above. The VM is still running
(and therefore its memory is likely changing), so this may be repeated
several times, until we decide the target is "up to date enough".
2. We pause the VM on the source, and start executing on the target machine.
During this gap, the VM's user(s) will *see* a pause, so it is desirable to
minimize this window.
3. Between the last time any page was copied from the source to the target, and
when the VM was paused, the contents of that page may have changed - and
therefore the copy we have on the target machine is out of date. Although we
can keep track of which pages are out of date, for VMs with large amounts of
memory, it is "slow" to transfer this information to the target machine. We
want to resume execution before such a transfer would complete.
4. So, the guest begins executing on the target machine. The first time it
touches its memory (via the UFFD-registered mapping), userspace wants to
intercept this fault. Userspace checks whether or not the page is up to date,
and if not, copies the updated page from the source machine, via the non-UFFD
mapping. Finally, whether a copy was performed or not, userspace issues a
UFFDIO_CONTINUE ioctl to tell the kernel "I have ensured the page contents
are correct, carry on setting up the mapping".
We don't have to do all of the final updates on-demand. The userfaultfd manager
can, in the background, also copy over updated pages once it receives the map of
which pages are up-to-date or not.
Interaction with Existing APIs
==============================
Because this is a feature, a registered VMA could potentially receive both
missing and minor faults. I spent some time thinking through how the
existing API interacts with the new feature:
UFFDIO_CONTINUE cannot be used to resolve non-minor faults, as it does not
allocate a new page. If UFFDIO_CONTINUE is used on a non-minor fault:
- For non-shared memory or shmem, -EINVAL is returned.
- For hugetlb, -EFAULT is returned.
UFFDIO_COPY and UFFDIO_ZEROPAGE cannot be used to resolve minor faults.
Without modifications, the existing codepath assumes a new page needs to
be allocated. This is okay, since userspace must have a second
non-UFFD-registered mapping anyway, thus there isn't much reason to want
to use these in any case (just memcpy or memset or similar).
- If UFFDIO_COPY is used on a minor fault, -EEXIST is returned.
- If UFFDIO_ZEROPAGE is used on a minor fault, -EEXIST is returned (or -EINVAL
in the case of hugetlb, as UFFDIO_ZEROPAGE is unsupported in any case).
- UFFDIO_WRITEPROTECT simply doesn't work with shared memory, and returns
-ENOENT in that case (regardless of the kind of fault).
Future Work
===========
This series only supports hugetlbfs. I have a second series in flight to
support shmem as well, extending the functionality. This series is more
mature than the shmem support at this point, and the functionality works
fully on hugetlbfs, so this series can be merged first and then shmem
support will follow.
This patch (of 6):
This feature allows userspace to intercept "minor" faults. By "minor"
faults, I mean the following situation:
Let there exist two mappings (i.e., VMAs) to the same page(s). One of the
mappings is registered with userfaultfd (in minor mode), and the other is
not. Via the non-UFFD mapping, the underlying pages have already been
allocated & filled with some contents. The UFFD mapping has not yet been
faulted in; when it is touched for the first time, this results in what
I'm calling a "minor" fault. As a concrete example, when working with
hugetlbfs, we have huge_pte_none(), but find_lock_page() finds an existing
page.
This commit adds the new registration mode, and sets the relevant flag on
the VMAs being registered. In the hugetlb fault path, if we find that we
have huge_pte_none(), but find_lock_page() does indeed find an existing
page, then we have a "minor" fault, and if the VMA has the userfaultfd
registration flag, we call into userfaultfd to handle it.
This is implemented as a new registration mode, instead of an API feature.
This is because the alternative implementation has significant drawbacks
[1].
However, doing it this was requires we allocate a VM_* flag for the new
registration mode. On 32-bit systems, there are no unused bits, so this
feature is only supported on architectures with
CONFIG_ARCH_USES_HIGH_VMA_FLAGS. When attempting to register a VMA in
MINOR mode on 32-bit architectures, we return -EINVAL.
[1] https://lore.kernel.org/patchwork/patch/1380226/
[peterx@redhat.com: fix minor fault page leak]
Link: https://lkml.kernel.org/r/20210322175132.36659-1-peterx@redhat.com
Link: https://lkml.kernel.org/r/20210301222728.176417-1-axelrasmussen@google.com
Link: https://lkml.kernel.org/r/20210301222728.176417-2-axelrasmussen@google.com
Signed-off-by: Axel Rasmussen <axelrasmussen@google.com>
Reviewed-by: Peter Xu <peterx@redhat.com>
Reviewed-by: Mike Kravetz <mike.kravetz@oracle.com>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Alexey Dobriyan <adobriyan@gmail.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Anshuman Khandual <anshuman.khandual@arm.com>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Chinwen Chang <chinwen.chang@mediatek.com>
Cc: Huang Ying <ying.huang@intel.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Jann Horn <jannh@google.com>
Cc: Jerome Glisse <jglisse@redhat.com>
Cc: Lokesh Gidra <lokeshgidra@google.com>
Cc: "Matthew Wilcox (Oracle)" <willy@infradead.org>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: "Michal Koutn" <mkoutny@suse.com>
Cc: Michel Lespinasse <walken@google.com>
Cc: Mike Rapoport <rppt@linux.vnet.ibm.com>
Cc: Nicholas Piggin <npiggin@gmail.com>
Cc: Peter Xu <peterx@redhat.com>
Cc: Shaohua Li <shli@fb.com>
Cc: Shawn Anastasio <shawn@anastas.io>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Steven Price <steven.price@arm.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Adam Ruprecht <ruprecht@google.com>
Cc: Axel Rasmussen <axelrasmussen@google.com>
Cc: Cannon Matthews <cannonmatthews@google.com>
Cc: "Dr . David Alan Gilbert" <dgilbert@redhat.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Mina Almasry <almasrymina@google.com>
Cc: Oliver Upton <oupton@google.com>
Cc: Kirill A. Shutemov <kirill@shutemov.name>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-05-05 09:35:36 +08:00
|
|
|
if (reason & VM_UFFD_MINOR)
|
|
|
|
msg.arg.pagefault.flags |= UFFD_PAGEFAULT_FLAG_MINOR;
|
2017-09-07 07:23:56 +08:00
|
|
|
if (features & UFFD_FEATURE_THREAD_ID)
|
2017-09-07 07:23:59 +08:00
|
|
|
msg.arg.pagefault.feat.ptid = task_pid_vnr(current);
|
2015-09-05 06:46:37 +08:00
|
|
|
return msg;
|
2015-09-05 06:46:31 +08:00
|
|
|
}
|
|
|
|
|
2017-02-23 07:43:10 +08:00
|
|
|
#ifdef CONFIG_HUGETLB_PAGE
|
|
|
|
/*
|
|
|
|
* Same functionality as userfaultfd_must_wait below with modifications for
|
|
|
|
* hugepmd ranges.
|
|
|
|
*/
|
|
|
|
static inline bool userfaultfd_huge_must_wait(struct userfaultfd_ctx *ctx,
|
2017-07-07 06:39:42 +08:00
|
|
|
struct vm_area_struct *vma,
|
2017-02-23 07:43:10 +08:00
|
|
|
unsigned long address,
|
|
|
|
unsigned long flags,
|
|
|
|
unsigned long reason)
|
|
|
|
{
|
2018-07-04 08:02:39 +08:00
|
|
|
pte_t *ptep, pte;
|
2017-02-23 07:43:10 +08:00
|
|
|
bool ret = true;
|
|
|
|
|
2022-12-16 23:52:29 +08:00
|
|
|
mmap_assert_locked(ctx->mm);
|
2018-07-04 08:02:39 +08:00
|
|
|
|
2022-12-16 23:52:29 +08:00
|
|
|
ptep = hugetlb_walk(vma, address, vma_mmu_pagesize(vma));
|
2018-07-04 08:02:39 +08:00
|
|
|
if (!ptep)
|
2017-02-23 07:43:10 +08:00
|
|
|
goto out;
|
|
|
|
|
|
|
|
ret = false;
|
2018-07-04 08:02:39 +08:00
|
|
|
pte = huge_ptep_get(ptep);
|
2017-02-23 07:43:10 +08:00
|
|
|
|
|
|
|
/*
|
|
|
|
* Lockless access: we're in a wait_event so it's ok if it
|
2022-05-13 11:22:52 +08:00
|
|
|
* changes under us. PTE markers should be handled the same as none
|
|
|
|
* ptes here.
|
2017-02-23 07:43:10 +08:00
|
|
|
*/
|
2022-05-13 11:22:52 +08:00
|
|
|
if (huge_pte_none_mostly(pte))
|
2017-02-23 07:43:10 +08:00
|
|
|
ret = true;
|
2018-07-04 08:02:39 +08:00
|
|
|
if (!huge_pte_write(pte) && (reason & VM_UFFD_WP))
|
2017-02-23 07:43:10 +08:00
|
|
|
ret = true;
|
|
|
|
out:
|
|
|
|
return ret;
|
|
|
|
}
|
|
|
|
#else
|
|
|
|
static inline bool userfaultfd_huge_must_wait(struct userfaultfd_ctx *ctx,
|
2017-07-07 06:39:42 +08:00
|
|
|
struct vm_area_struct *vma,
|
2017-02-23 07:43:10 +08:00
|
|
|
unsigned long address,
|
|
|
|
unsigned long flags,
|
|
|
|
unsigned long reason)
|
|
|
|
{
|
|
|
|
return false; /* should never get here */
|
|
|
|
}
|
|
|
|
#endif /* CONFIG_HUGETLB_PAGE */
|
|
|
|
|
userfaultfd: solve the race between UFFDIO_COPY|ZEROPAGE and read
Solve in-kernel the race between UFFDIO_COPY|ZEROPAGE and
userfaultfd_read if they are run on different threads simultaneously.
Until now qemu solved the race in userland: the race was explicitly
and intentionally left for userland to solve. However we can also
solve it in kernel.
Requiring all users to solve this race if they use two threads (one
for the background transfer and one for the userfault reads) isn't
very attractive from an API prospective, furthermore this allows to
remove a whole bunch of mutex and bitmap code from qemu, making it
faster. The cost of __get_user_pages_fast should be insignificant
considering it scales perfectly and the pagetables are already hot in
the CPU cache, compared to the overhead in userland to maintain those
structures.
Applying this patch is backwards compatible with respect to the
userfaultfd userland API, however reverting this change wouldn't be
backwards compatible anymore.
Without this patch qemu in the background transfer thread, has to read
the old state, and do UFFDIO_WAKE if old_state is missing but it
become REQUESTED by the time it tries to set it to RECEIVED (signaling
the other side received an userfault).
vcpu background_thr userfault_thr
----- ----- -----
vcpu0 handle_mm_fault()
postcopy_place_page
read old_state -> MISSING
UFFDIO_COPY 0x7fb76a139000 (no wakeup, still pending)
vcpu0 fault at 0x7fb76a139000 enters handle_userfault
poll() is kicked
poll() -> POLLIN
read() -> 0x7fb76a139000
postcopy_pmi_change_state(MISSING, REQUESTED) -> REQUESTED
tmp_state = postcopy_pmi_change_state(old_state, RECEIVED) -> REQUESTED
/* check that no userfault raced with UFFDIO_COPY */
if (old_state == MISSING && tmp_state == REQUESTED)
UFFDIO_WAKE from background thread
And a second case where a UFFDIO_WAKE would be needed is in the userfault thread:
vcpu background_thr userfault_thr
----- ----- -----
vcpu0 handle_mm_fault()
postcopy_place_page
read old_state -> MISSING
UFFDIO_COPY 0x7fb76a139000 (no wakeup, still pending)
tmp_state = postcopy_pmi_change_state(old_state, RECEIVED) -> RECEIVED
vcpu0 fault at 0x7fb76a139000 enters handle_userfault
poll() is kicked
poll() -> POLLIN
read() -> 0x7fb76a139000
if (postcopy_pmi_change_state(MISSING, REQUESTED) == RECEIVED)
UFFDIO_WAKE from userfault thread
This patch removes the need of both UFFDIO_WAKE and of the associated
per-page tristate as well.
Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
Acked-by: Pavel Emelyanov <xemul@parallels.com>
Cc: Sanidhya Kashyap <sanidhya.gatech@gmail.com>
Cc: zhang.zhanghailiang@huawei.com
Cc: "Kirill A. Shutemov" <kirill@shutemov.name>
Cc: Andres Lagar-Cavilla <andreslc@google.com>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Hugh Dickins <hughd@google.com>
Cc: Peter Feiner <pfeiner@google.com>
Cc: "Dr. David Alan Gilbert" <dgilbert@redhat.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: "Huangpeng (Peter)" <peter.huangpeng@huawei.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-09-05 06:46:51 +08:00
|
|
|
/*
|
|
|
|
* Verify the pagetables are still not ok after having reigstered into
|
|
|
|
* the fault_pending_wqh to avoid userland having to UFFDIO_WAKE any
|
|
|
|
* userfault that has already been resolved, if userfaultfd_read and
|
|
|
|
* UFFDIO_COPY|ZEROPAGE are being run simultaneously on two different
|
|
|
|
* threads.
|
|
|
|
*/
|
|
|
|
static inline bool userfaultfd_must_wait(struct userfaultfd_ctx *ctx,
|
|
|
|
unsigned long address,
|
|
|
|
unsigned long flags,
|
|
|
|
unsigned long reason)
|
|
|
|
{
|
|
|
|
struct mm_struct *mm = ctx->mm;
|
|
|
|
pgd_t *pgd;
|
2017-03-09 22:24:07 +08:00
|
|
|
p4d_t *p4d;
|
userfaultfd: solve the race between UFFDIO_COPY|ZEROPAGE and read
Solve in-kernel the race between UFFDIO_COPY|ZEROPAGE and
userfaultfd_read if they are run on different threads simultaneously.
Until now qemu solved the race in userland: the race was explicitly
and intentionally left for userland to solve. However we can also
solve it in kernel.
Requiring all users to solve this race if they use two threads (one
for the background transfer and one for the userfault reads) isn't
very attractive from an API prospective, furthermore this allows to
remove a whole bunch of mutex and bitmap code from qemu, making it
faster. The cost of __get_user_pages_fast should be insignificant
considering it scales perfectly and the pagetables are already hot in
the CPU cache, compared to the overhead in userland to maintain those
structures.
Applying this patch is backwards compatible with respect to the
userfaultfd userland API, however reverting this change wouldn't be
backwards compatible anymore.
Without this patch qemu in the background transfer thread, has to read
the old state, and do UFFDIO_WAKE if old_state is missing but it
become REQUESTED by the time it tries to set it to RECEIVED (signaling
the other side received an userfault).
vcpu background_thr userfault_thr
----- ----- -----
vcpu0 handle_mm_fault()
postcopy_place_page
read old_state -> MISSING
UFFDIO_COPY 0x7fb76a139000 (no wakeup, still pending)
vcpu0 fault at 0x7fb76a139000 enters handle_userfault
poll() is kicked
poll() -> POLLIN
read() -> 0x7fb76a139000
postcopy_pmi_change_state(MISSING, REQUESTED) -> REQUESTED
tmp_state = postcopy_pmi_change_state(old_state, RECEIVED) -> REQUESTED
/* check that no userfault raced with UFFDIO_COPY */
if (old_state == MISSING && tmp_state == REQUESTED)
UFFDIO_WAKE from background thread
And a second case where a UFFDIO_WAKE would be needed is in the userfault thread:
vcpu background_thr userfault_thr
----- ----- -----
vcpu0 handle_mm_fault()
postcopy_place_page
read old_state -> MISSING
UFFDIO_COPY 0x7fb76a139000 (no wakeup, still pending)
tmp_state = postcopy_pmi_change_state(old_state, RECEIVED) -> RECEIVED
vcpu0 fault at 0x7fb76a139000 enters handle_userfault
poll() is kicked
poll() -> POLLIN
read() -> 0x7fb76a139000
if (postcopy_pmi_change_state(MISSING, REQUESTED) == RECEIVED)
UFFDIO_WAKE from userfault thread
This patch removes the need of both UFFDIO_WAKE and of the associated
per-page tristate as well.
Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
Acked-by: Pavel Emelyanov <xemul@parallels.com>
Cc: Sanidhya Kashyap <sanidhya.gatech@gmail.com>
Cc: zhang.zhanghailiang@huawei.com
Cc: "Kirill A. Shutemov" <kirill@shutemov.name>
Cc: Andres Lagar-Cavilla <andreslc@google.com>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Hugh Dickins <hughd@google.com>
Cc: Peter Feiner <pfeiner@google.com>
Cc: "Dr. David Alan Gilbert" <dgilbert@redhat.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: "Huangpeng (Peter)" <peter.huangpeng@huawei.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-09-05 06:46:51 +08:00
|
|
|
pud_t *pud;
|
|
|
|
pmd_t *pmd, _pmd;
|
|
|
|
pte_t *pte;
|
|
|
|
bool ret = true;
|
|
|
|
|
2020-06-09 12:33:44 +08:00
|
|
|
mmap_assert_locked(mm);
|
userfaultfd: solve the race between UFFDIO_COPY|ZEROPAGE and read
Solve in-kernel the race between UFFDIO_COPY|ZEROPAGE and
userfaultfd_read if they are run on different threads simultaneously.
Until now qemu solved the race in userland: the race was explicitly
and intentionally left for userland to solve. However we can also
solve it in kernel.
Requiring all users to solve this race if they use two threads (one
for the background transfer and one for the userfault reads) isn't
very attractive from an API prospective, furthermore this allows to
remove a whole bunch of mutex and bitmap code from qemu, making it
faster. The cost of __get_user_pages_fast should be insignificant
considering it scales perfectly and the pagetables are already hot in
the CPU cache, compared to the overhead in userland to maintain those
structures.
Applying this patch is backwards compatible with respect to the
userfaultfd userland API, however reverting this change wouldn't be
backwards compatible anymore.
Without this patch qemu in the background transfer thread, has to read
the old state, and do UFFDIO_WAKE if old_state is missing but it
become REQUESTED by the time it tries to set it to RECEIVED (signaling
the other side received an userfault).
vcpu background_thr userfault_thr
----- ----- -----
vcpu0 handle_mm_fault()
postcopy_place_page
read old_state -> MISSING
UFFDIO_COPY 0x7fb76a139000 (no wakeup, still pending)
vcpu0 fault at 0x7fb76a139000 enters handle_userfault
poll() is kicked
poll() -> POLLIN
read() -> 0x7fb76a139000
postcopy_pmi_change_state(MISSING, REQUESTED) -> REQUESTED
tmp_state = postcopy_pmi_change_state(old_state, RECEIVED) -> REQUESTED
/* check that no userfault raced with UFFDIO_COPY */
if (old_state == MISSING && tmp_state == REQUESTED)
UFFDIO_WAKE from background thread
And a second case where a UFFDIO_WAKE would be needed is in the userfault thread:
vcpu background_thr userfault_thr
----- ----- -----
vcpu0 handle_mm_fault()
postcopy_place_page
read old_state -> MISSING
UFFDIO_COPY 0x7fb76a139000 (no wakeup, still pending)
tmp_state = postcopy_pmi_change_state(old_state, RECEIVED) -> RECEIVED
vcpu0 fault at 0x7fb76a139000 enters handle_userfault
poll() is kicked
poll() -> POLLIN
read() -> 0x7fb76a139000
if (postcopy_pmi_change_state(MISSING, REQUESTED) == RECEIVED)
UFFDIO_WAKE from userfault thread
This patch removes the need of both UFFDIO_WAKE and of the associated
per-page tristate as well.
Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
Acked-by: Pavel Emelyanov <xemul@parallels.com>
Cc: Sanidhya Kashyap <sanidhya.gatech@gmail.com>
Cc: zhang.zhanghailiang@huawei.com
Cc: "Kirill A. Shutemov" <kirill@shutemov.name>
Cc: Andres Lagar-Cavilla <andreslc@google.com>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Hugh Dickins <hughd@google.com>
Cc: Peter Feiner <pfeiner@google.com>
Cc: "Dr. David Alan Gilbert" <dgilbert@redhat.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: "Huangpeng (Peter)" <peter.huangpeng@huawei.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-09-05 06:46:51 +08:00
|
|
|
|
|
|
|
pgd = pgd_offset(mm, address);
|
|
|
|
if (!pgd_present(*pgd))
|
|
|
|
goto out;
|
2017-03-09 22:24:07 +08:00
|
|
|
p4d = p4d_offset(pgd, address);
|
|
|
|
if (!p4d_present(*p4d))
|
|
|
|
goto out;
|
|
|
|
pud = pud_offset(p4d, address);
|
userfaultfd: solve the race between UFFDIO_COPY|ZEROPAGE and read
Solve in-kernel the race between UFFDIO_COPY|ZEROPAGE and
userfaultfd_read if they are run on different threads simultaneously.
Until now qemu solved the race in userland: the race was explicitly
and intentionally left for userland to solve. However we can also
solve it in kernel.
Requiring all users to solve this race if they use two threads (one
for the background transfer and one for the userfault reads) isn't
very attractive from an API prospective, furthermore this allows to
remove a whole bunch of mutex and bitmap code from qemu, making it
faster. The cost of __get_user_pages_fast should be insignificant
considering it scales perfectly and the pagetables are already hot in
the CPU cache, compared to the overhead in userland to maintain those
structures.
Applying this patch is backwards compatible with respect to the
userfaultfd userland API, however reverting this change wouldn't be
backwards compatible anymore.
Without this patch qemu in the background transfer thread, has to read
the old state, and do UFFDIO_WAKE if old_state is missing but it
become REQUESTED by the time it tries to set it to RECEIVED (signaling
the other side received an userfault).
vcpu background_thr userfault_thr
----- ----- -----
vcpu0 handle_mm_fault()
postcopy_place_page
read old_state -> MISSING
UFFDIO_COPY 0x7fb76a139000 (no wakeup, still pending)
vcpu0 fault at 0x7fb76a139000 enters handle_userfault
poll() is kicked
poll() -> POLLIN
read() -> 0x7fb76a139000
postcopy_pmi_change_state(MISSING, REQUESTED) -> REQUESTED
tmp_state = postcopy_pmi_change_state(old_state, RECEIVED) -> REQUESTED
/* check that no userfault raced with UFFDIO_COPY */
if (old_state == MISSING && tmp_state == REQUESTED)
UFFDIO_WAKE from background thread
And a second case where a UFFDIO_WAKE would be needed is in the userfault thread:
vcpu background_thr userfault_thr
----- ----- -----
vcpu0 handle_mm_fault()
postcopy_place_page
read old_state -> MISSING
UFFDIO_COPY 0x7fb76a139000 (no wakeup, still pending)
tmp_state = postcopy_pmi_change_state(old_state, RECEIVED) -> RECEIVED
vcpu0 fault at 0x7fb76a139000 enters handle_userfault
poll() is kicked
poll() -> POLLIN
read() -> 0x7fb76a139000
if (postcopy_pmi_change_state(MISSING, REQUESTED) == RECEIVED)
UFFDIO_WAKE from userfault thread
This patch removes the need of both UFFDIO_WAKE and of the associated
per-page tristate as well.
Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
Acked-by: Pavel Emelyanov <xemul@parallels.com>
Cc: Sanidhya Kashyap <sanidhya.gatech@gmail.com>
Cc: zhang.zhanghailiang@huawei.com
Cc: "Kirill A. Shutemov" <kirill@shutemov.name>
Cc: Andres Lagar-Cavilla <andreslc@google.com>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Hugh Dickins <hughd@google.com>
Cc: Peter Feiner <pfeiner@google.com>
Cc: "Dr. David Alan Gilbert" <dgilbert@redhat.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: "Huangpeng (Peter)" <peter.huangpeng@huawei.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-09-05 06:46:51 +08:00
|
|
|
if (!pud_present(*pud))
|
|
|
|
goto out;
|
|
|
|
pmd = pmd_offset(pud, address);
|
|
|
|
/*
|
|
|
|
* READ_ONCE must function as a barrier with narrower scope
|
|
|
|
* and it must be equivalent to:
|
|
|
|
* _pmd = *pmd; barrier();
|
|
|
|
*
|
|
|
|
* This is to deal with the instability (as in
|
|
|
|
* pmd_trans_unstable) of the pmd.
|
|
|
|
*/
|
|
|
|
_pmd = READ_ONCE(*pmd);
|
mm, userfaultfd, THP: avoid waiting when PMD under THP migration
If THP migration is enabled, for a VMA handled by userfaultfd, consider
the following situation,
do_page_fault()
__do_huge_pmd_anonymous_page()
handle_userfault()
userfault_msg()
/* a huge page is allocated and mapped at fault address */
/* the huge page is under migration, leaves migration entry
in page table */
userfaultfd_must_wait()
/* return true because !pmd_present() */
/* may wait in loop until fatal signal */
That is, it may be possible for userfaultfd_must_wait() encounters a PMD
entry which is !pmd_none() && !pmd_present(). In the current
implementation, we will wait for such PMD entries, which may cause
unnecessary waiting, and potential soft lockup.
This is fixed via avoiding to wait when !pmd_none() && !pmd_present(),
only wait when pmd_none().
This may be not a problem in practice, because userfaultfd_must_wait()
is always called with mm->mmap_sem read-locked. mremap() will
write-lock mm->mmap_sem. And UFFDIO_COPY doesn't support to copy THP
mapping. But the change introduced still makes the code more correct,
and makes the PMD and PTE code more consistent.
Link: http://lkml.kernel.org/r/20171207011752.3292-1-ying.huang@intel.com
Signed-off-by: "Huang, Ying" <ying.huang@intel.com>
Reviewed-by: Andrea Arcangeli <aarcange@redhat.com>
Cc: Mike Kravetz <mike.kravetz@oracle.com>
Cc: Mike Rapoport <rppt@linux.vnet.ibm.com>
Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
Cc: Alexander Viro <viro@zeniv.linux.org.UK>
Cc: Zi Yan <zi.yan@cs.rutgers.edu>
Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-02-01 08:17:32 +08:00
|
|
|
if (pmd_none(_pmd))
|
userfaultfd: solve the race between UFFDIO_COPY|ZEROPAGE and read
Solve in-kernel the race between UFFDIO_COPY|ZEROPAGE and
userfaultfd_read if they are run on different threads simultaneously.
Until now qemu solved the race in userland: the race was explicitly
and intentionally left for userland to solve. However we can also
solve it in kernel.
Requiring all users to solve this race if they use two threads (one
for the background transfer and one for the userfault reads) isn't
very attractive from an API prospective, furthermore this allows to
remove a whole bunch of mutex and bitmap code from qemu, making it
faster. The cost of __get_user_pages_fast should be insignificant
considering it scales perfectly and the pagetables are already hot in
the CPU cache, compared to the overhead in userland to maintain those
structures.
Applying this patch is backwards compatible with respect to the
userfaultfd userland API, however reverting this change wouldn't be
backwards compatible anymore.
Without this patch qemu in the background transfer thread, has to read
the old state, and do UFFDIO_WAKE if old_state is missing but it
become REQUESTED by the time it tries to set it to RECEIVED (signaling
the other side received an userfault).
vcpu background_thr userfault_thr
----- ----- -----
vcpu0 handle_mm_fault()
postcopy_place_page
read old_state -> MISSING
UFFDIO_COPY 0x7fb76a139000 (no wakeup, still pending)
vcpu0 fault at 0x7fb76a139000 enters handle_userfault
poll() is kicked
poll() -> POLLIN
read() -> 0x7fb76a139000
postcopy_pmi_change_state(MISSING, REQUESTED) -> REQUESTED
tmp_state = postcopy_pmi_change_state(old_state, RECEIVED) -> REQUESTED
/* check that no userfault raced with UFFDIO_COPY */
if (old_state == MISSING && tmp_state == REQUESTED)
UFFDIO_WAKE from background thread
And a second case where a UFFDIO_WAKE would be needed is in the userfault thread:
vcpu background_thr userfault_thr
----- ----- -----
vcpu0 handle_mm_fault()
postcopy_place_page
read old_state -> MISSING
UFFDIO_COPY 0x7fb76a139000 (no wakeup, still pending)
tmp_state = postcopy_pmi_change_state(old_state, RECEIVED) -> RECEIVED
vcpu0 fault at 0x7fb76a139000 enters handle_userfault
poll() is kicked
poll() -> POLLIN
read() -> 0x7fb76a139000
if (postcopy_pmi_change_state(MISSING, REQUESTED) == RECEIVED)
UFFDIO_WAKE from userfault thread
This patch removes the need of both UFFDIO_WAKE and of the associated
per-page tristate as well.
Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
Acked-by: Pavel Emelyanov <xemul@parallels.com>
Cc: Sanidhya Kashyap <sanidhya.gatech@gmail.com>
Cc: zhang.zhanghailiang@huawei.com
Cc: "Kirill A. Shutemov" <kirill@shutemov.name>
Cc: Andres Lagar-Cavilla <andreslc@google.com>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Hugh Dickins <hughd@google.com>
Cc: Peter Feiner <pfeiner@google.com>
Cc: "Dr. David Alan Gilbert" <dgilbert@redhat.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: "Huangpeng (Peter)" <peter.huangpeng@huawei.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-09-05 06:46:51 +08:00
|
|
|
goto out;
|
|
|
|
|
|
|
|
ret = false;
|
mm, userfaultfd, THP: avoid waiting when PMD under THP migration
If THP migration is enabled, for a VMA handled by userfaultfd, consider
the following situation,
do_page_fault()
__do_huge_pmd_anonymous_page()
handle_userfault()
userfault_msg()
/* a huge page is allocated and mapped at fault address */
/* the huge page is under migration, leaves migration entry
in page table */
userfaultfd_must_wait()
/* return true because !pmd_present() */
/* may wait in loop until fatal signal */
That is, it may be possible for userfaultfd_must_wait() encounters a PMD
entry which is !pmd_none() && !pmd_present(). In the current
implementation, we will wait for such PMD entries, which may cause
unnecessary waiting, and potential soft lockup.
This is fixed via avoiding to wait when !pmd_none() && !pmd_present(),
only wait when pmd_none().
This may be not a problem in practice, because userfaultfd_must_wait()
is always called with mm->mmap_sem read-locked. mremap() will
write-lock mm->mmap_sem. And UFFDIO_COPY doesn't support to copy THP
mapping. But the change introduced still makes the code more correct,
and makes the PMD and PTE code more consistent.
Link: http://lkml.kernel.org/r/20171207011752.3292-1-ying.huang@intel.com
Signed-off-by: "Huang, Ying" <ying.huang@intel.com>
Reviewed-by: Andrea Arcangeli <aarcange@redhat.com>
Cc: Mike Kravetz <mike.kravetz@oracle.com>
Cc: Mike Rapoport <rppt@linux.vnet.ibm.com>
Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
Cc: Alexander Viro <viro@zeniv.linux.org.UK>
Cc: Zi Yan <zi.yan@cs.rutgers.edu>
Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-02-01 08:17:32 +08:00
|
|
|
if (!pmd_present(_pmd))
|
|
|
|
goto out;
|
|
|
|
|
2020-04-07 11:06:12 +08:00
|
|
|
if (pmd_trans_huge(_pmd)) {
|
|
|
|
if (!pmd_write(_pmd) && (reason & VM_UFFD_WP))
|
|
|
|
ret = true;
|
userfaultfd: solve the race between UFFDIO_COPY|ZEROPAGE and read
Solve in-kernel the race between UFFDIO_COPY|ZEROPAGE and
userfaultfd_read if they are run on different threads simultaneously.
Until now qemu solved the race in userland: the race was explicitly
and intentionally left for userland to solve. However we can also
solve it in kernel.
Requiring all users to solve this race if they use two threads (one
for the background transfer and one for the userfault reads) isn't
very attractive from an API prospective, furthermore this allows to
remove a whole bunch of mutex and bitmap code from qemu, making it
faster. The cost of __get_user_pages_fast should be insignificant
considering it scales perfectly and the pagetables are already hot in
the CPU cache, compared to the overhead in userland to maintain those
structures.
Applying this patch is backwards compatible with respect to the
userfaultfd userland API, however reverting this change wouldn't be
backwards compatible anymore.
Without this patch qemu in the background transfer thread, has to read
the old state, and do UFFDIO_WAKE if old_state is missing but it
become REQUESTED by the time it tries to set it to RECEIVED (signaling
the other side received an userfault).
vcpu background_thr userfault_thr
----- ----- -----
vcpu0 handle_mm_fault()
postcopy_place_page
read old_state -> MISSING
UFFDIO_COPY 0x7fb76a139000 (no wakeup, still pending)
vcpu0 fault at 0x7fb76a139000 enters handle_userfault
poll() is kicked
poll() -> POLLIN
read() -> 0x7fb76a139000
postcopy_pmi_change_state(MISSING, REQUESTED) -> REQUESTED
tmp_state = postcopy_pmi_change_state(old_state, RECEIVED) -> REQUESTED
/* check that no userfault raced with UFFDIO_COPY */
if (old_state == MISSING && tmp_state == REQUESTED)
UFFDIO_WAKE from background thread
And a second case where a UFFDIO_WAKE would be needed is in the userfault thread:
vcpu background_thr userfault_thr
----- ----- -----
vcpu0 handle_mm_fault()
postcopy_place_page
read old_state -> MISSING
UFFDIO_COPY 0x7fb76a139000 (no wakeup, still pending)
tmp_state = postcopy_pmi_change_state(old_state, RECEIVED) -> RECEIVED
vcpu0 fault at 0x7fb76a139000 enters handle_userfault
poll() is kicked
poll() -> POLLIN
read() -> 0x7fb76a139000
if (postcopy_pmi_change_state(MISSING, REQUESTED) == RECEIVED)
UFFDIO_WAKE from userfault thread
This patch removes the need of both UFFDIO_WAKE and of the associated
per-page tristate as well.
Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
Acked-by: Pavel Emelyanov <xemul@parallels.com>
Cc: Sanidhya Kashyap <sanidhya.gatech@gmail.com>
Cc: zhang.zhanghailiang@huawei.com
Cc: "Kirill A. Shutemov" <kirill@shutemov.name>
Cc: Andres Lagar-Cavilla <andreslc@google.com>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Hugh Dickins <hughd@google.com>
Cc: Peter Feiner <pfeiner@google.com>
Cc: "Dr. David Alan Gilbert" <dgilbert@redhat.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: "Huangpeng (Peter)" <peter.huangpeng@huawei.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-09-05 06:46:51 +08:00
|
|
|
goto out;
|
2020-04-07 11:06:12 +08:00
|
|
|
}
|
userfaultfd: solve the race between UFFDIO_COPY|ZEROPAGE and read
Solve in-kernel the race between UFFDIO_COPY|ZEROPAGE and
userfaultfd_read if they are run on different threads simultaneously.
Until now qemu solved the race in userland: the race was explicitly
and intentionally left for userland to solve. However we can also
solve it in kernel.
Requiring all users to solve this race if they use two threads (one
for the background transfer and one for the userfault reads) isn't
very attractive from an API prospective, furthermore this allows to
remove a whole bunch of mutex and bitmap code from qemu, making it
faster. The cost of __get_user_pages_fast should be insignificant
considering it scales perfectly and the pagetables are already hot in
the CPU cache, compared to the overhead in userland to maintain those
structures.
Applying this patch is backwards compatible with respect to the
userfaultfd userland API, however reverting this change wouldn't be
backwards compatible anymore.
Without this patch qemu in the background transfer thread, has to read
the old state, and do UFFDIO_WAKE if old_state is missing but it
become REQUESTED by the time it tries to set it to RECEIVED (signaling
the other side received an userfault).
vcpu background_thr userfault_thr
----- ----- -----
vcpu0 handle_mm_fault()
postcopy_place_page
read old_state -> MISSING
UFFDIO_COPY 0x7fb76a139000 (no wakeup, still pending)
vcpu0 fault at 0x7fb76a139000 enters handle_userfault
poll() is kicked
poll() -> POLLIN
read() -> 0x7fb76a139000
postcopy_pmi_change_state(MISSING, REQUESTED) -> REQUESTED
tmp_state = postcopy_pmi_change_state(old_state, RECEIVED) -> REQUESTED
/* check that no userfault raced with UFFDIO_COPY */
if (old_state == MISSING && tmp_state == REQUESTED)
UFFDIO_WAKE from background thread
And a second case where a UFFDIO_WAKE would be needed is in the userfault thread:
vcpu background_thr userfault_thr
----- ----- -----
vcpu0 handle_mm_fault()
postcopy_place_page
read old_state -> MISSING
UFFDIO_COPY 0x7fb76a139000 (no wakeup, still pending)
tmp_state = postcopy_pmi_change_state(old_state, RECEIVED) -> RECEIVED
vcpu0 fault at 0x7fb76a139000 enters handle_userfault
poll() is kicked
poll() -> POLLIN
read() -> 0x7fb76a139000
if (postcopy_pmi_change_state(MISSING, REQUESTED) == RECEIVED)
UFFDIO_WAKE from userfault thread
This patch removes the need of both UFFDIO_WAKE and of the associated
per-page tristate as well.
Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
Acked-by: Pavel Emelyanov <xemul@parallels.com>
Cc: Sanidhya Kashyap <sanidhya.gatech@gmail.com>
Cc: zhang.zhanghailiang@huawei.com
Cc: "Kirill A. Shutemov" <kirill@shutemov.name>
Cc: Andres Lagar-Cavilla <andreslc@google.com>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Hugh Dickins <hughd@google.com>
Cc: Peter Feiner <pfeiner@google.com>
Cc: "Dr. David Alan Gilbert" <dgilbert@redhat.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: "Huangpeng (Peter)" <peter.huangpeng@huawei.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-09-05 06:46:51 +08:00
|
|
|
|
|
|
|
/*
|
|
|
|
* the pmd is stable (as in !pmd_trans_unstable) so we can re-read it
|
|
|
|
* and use the standard pte_offset_map() instead of parsing _pmd.
|
|
|
|
*/
|
|
|
|
pte = pte_offset_map(pmd, address);
|
|
|
|
/*
|
|
|
|
* Lockless access: we're in a wait_event so it's ok if it
|
2022-05-13 11:22:52 +08:00
|
|
|
* changes under us. PTE markers should be handled the same as none
|
|
|
|
* ptes here.
|
userfaultfd: solve the race between UFFDIO_COPY|ZEROPAGE and read
Solve in-kernel the race between UFFDIO_COPY|ZEROPAGE and
userfaultfd_read if they are run on different threads simultaneously.
Until now qemu solved the race in userland: the race was explicitly
and intentionally left for userland to solve. However we can also
solve it in kernel.
Requiring all users to solve this race if they use two threads (one
for the background transfer and one for the userfault reads) isn't
very attractive from an API prospective, furthermore this allows to
remove a whole bunch of mutex and bitmap code from qemu, making it
faster. The cost of __get_user_pages_fast should be insignificant
considering it scales perfectly and the pagetables are already hot in
the CPU cache, compared to the overhead in userland to maintain those
structures.
Applying this patch is backwards compatible with respect to the
userfaultfd userland API, however reverting this change wouldn't be
backwards compatible anymore.
Without this patch qemu in the background transfer thread, has to read
the old state, and do UFFDIO_WAKE if old_state is missing but it
become REQUESTED by the time it tries to set it to RECEIVED (signaling
the other side received an userfault).
vcpu background_thr userfault_thr
----- ----- -----
vcpu0 handle_mm_fault()
postcopy_place_page
read old_state -> MISSING
UFFDIO_COPY 0x7fb76a139000 (no wakeup, still pending)
vcpu0 fault at 0x7fb76a139000 enters handle_userfault
poll() is kicked
poll() -> POLLIN
read() -> 0x7fb76a139000
postcopy_pmi_change_state(MISSING, REQUESTED) -> REQUESTED
tmp_state = postcopy_pmi_change_state(old_state, RECEIVED) -> REQUESTED
/* check that no userfault raced with UFFDIO_COPY */
if (old_state == MISSING && tmp_state == REQUESTED)
UFFDIO_WAKE from background thread
And a second case where a UFFDIO_WAKE would be needed is in the userfault thread:
vcpu background_thr userfault_thr
----- ----- -----
vcpu0 handle_mm_fault()
postcopy_place_page
read old_state -> MISSING
UFFDIO_COPY 0x7fb76a139000 (no wakeup, still pending)
tmp_state = postcopy_pmi_change_state(old_state, RECEIVED) -> RECEIVED
vcpu0 fault at 0x7fb76a139000 enters handle_userfault
poll() is kicked
poll() -> POLLIN
read() -> 0x7fb76a139000
if (postcopy_pmi_change_state(MISSING, REQUESTED) == RECEIVED)
UFFDIO_WAKE from userfault thread
This patch removes the need of both UFFDIO_WAKE and of the associated
per-page tristate as well.
Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
Acked-by: Pavel Emelyanov <xemul@parallels.com>
Cc: Sanidhya Kashyap <sanidhya.gatech@gmail.com>
Cc: zhang.zhanghailiang@huawei.com
Cc: "Kirill A. Shutemov" <kirill@shutemov.name>
Cc: Andres Lagar-Cavilla <andreslc@google.com>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Hugh Dickins <hughd@google.com>
Cc: Peter Feiner <pfeiner@google.com>
Cc: "Dr. David Alan Gilbert" <dgilbert@redhat.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: "Huangpeng (Peter)" <peter.huangpeng@huawei.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-09-05 06:46:51 +08:00
|
|
|
*/
|
2022-05-13 11:22:52 +08:00
|
|
|
if (pte_none_mostly(*pte))
|
userfaultfd: solve the race between UFFDIO_COPY|ZEROPAGE and read
Solve in-kernel the race between UFFDIO_COPY|ZEROPAGE and
userfaultfd_read if they are run on different threads simultaneously.
Until now qemu solved the race in userland: the race was explicitly
and intentionally left for userland to solve. However we can also
solve it in kernel.
Requiring all users to solve this race if they use two threads (one
for the background transfer and one for the userfault reads) isn't
very attractive from an API prospective, furthermore this allows to
remove a whole bunch of mutex and bitmap code from qemu, making it
faster. The cost of __get_user_pages_fast should be insignificant
considering it scales perfectly and the pagetables are already hot in
the CPU cache, compared to the overhead in userland to maintain those
structures.
Applying this patch is backwards compatible with respect to the
userfaultfd userland API, however reverting this change wouldn't be
backwards compatible anymore.
Without this patch qemu in the background transfer thread, has to read
the old state, and do UFFDIO_WAKE if old_state is missing but it
become REQUESTED by the time it tries to set it to RECEIVED (signaling
the other side received an userfault).
vcpu background_thr userfault_thr
----- ----- -----
vcpu0 handle_mm_fault()
postcopy_place_page
read old_state -> MISSING
UFFDIO_COPY 0x7fb76a139000 (no wakeup, still pending)
vcpu0 fault at 0x7fb76a139000 enters handle_userfault
poll() is kicked
poll() -> POLLIN
read() -> 0x7fb76a139000
postcopy_pmi_change_state(MISSING, REQUESTED) -> REQUESTED
tmp_state = postcopy_pmi_change_state(old_state, RECEIVED) -> REQUESTED
/* check that no userfault raced with UFFDIO_COPY */
if (old_state == MISSING && tmp_state == REQUESTED)
UFFDIO_WAKE from background thread
And a second case where a UFFDIO_WAKE would be needed is in the userfault thread:
vcpu background_thr userfault_thr
----- ----- -----
vcpu0 handle_mm_fault()
postcopy_place_page
read old_state -> MISSING
UFFDIO_COPY 0x7fb76a139000 (no wakeup, still pending)
tmp_state = postcopy_pmi_change_state(old_state, RECEIVED) -> RECEIVED
vcpu0 fault at 0x7fb76a139000 enters handle_userfault
poll() is kicked
poll() -> POLLIN
read() -> 0x7fb76a139000
if (postcopy_pmi_change_state(MISSING, REQUESTED) == RECEIVED)
UFFDIO_WAKE from userfault thread
This patch removes the need of both UFFDIO_WAKE and of the associated
per-page tristate as well.
Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
Acked-by: Pavel Emelyanov <xemul@parallels.com>
Cc: Sanidhya Kashyap <sanidhya.gatech@gmail.com>
Cc: zhang.zhanghailiang@huawei.com
Cc: "Kirill A. Shutemov" <kirill@shutemov.name>
Cc: Andres Lagar-Cavilla <andreslc@google.com>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Hugh Dickins <hughd@google.com>
Cc: Peter Feiner <pfeiner@google.com>
Cc: "Dr. David Alan Gilbert" <dgilbert@redhat.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: "Huangpeng (Peter)" <peter.huangpeng@huawei.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-09-05 06:46:51 +08:00
|
|
|
ret = true;
|
2020-04-07 11:06:12 +08:00
|
|
|
if (!pte_write(*pte) && (reason & VM_UFFD_WP))
|
|
|
|
ret = true;
|
userfaultfd: solve the race between UFFDIO_COPY|ZEROPAGE and read
Solve in-kernel the race between UFFDIO_COPY|ZEROPAGE and
userfaultfd_read if they are run on different threads simultaneously.
Until now qemu solved the race in userland: the race was explicitly
and intentionally left for userland to solve. However we can also
solve it in kernel.
Requiring all users to solve this race if they use two threads (one
for the background transfer and one for the userfault reads) isn't
very attractive from an API prospective, furthermore this allows to
remove a whole bunch of mutex and bitmap code from qemu, making it
faster. The cost of __get_user_pages_fast should be insignificant
considering it scales perfectly and the pagetables are already hot in
the CPU cache, compared to the overhead in userland to maintain those
structures.
Applying this patch is backwards compatible with respect to the
userfaultfd userland API, however reverting this change wouldn't be
backwards compatible anymore.
Without this patch qemu in the background transfer thread, has to read
the old state, and do UFFDIO_WAKE if old_state is missing but it
become REQUESTED by the time it tries to set it to RECEIVED (signaling
the other side received an userfault).
vcpu background_thr userfault_thr
----- ----- -----
vcpu0 handle_mm_fault()
postcopy_place_page
read old_state -> MISSING
UFFDIO_COPY 0x7fb76a139000 (no wakeup, still pending)
vcpu0 fault at 0x7fb76a139000 enters handle_userfault
poll() is kicked
poll() -> POLLIN
read() -> 0x7fb76a139000
postcopy_pmi_change_state(MISSING, REQUESTED) -> REQUESTED
tmp_state = postcopy_pmi_change_state(old_state, RECEIVED) -> REQUESTED
/* check that no userfault raced with UFFDIO_COPY */
if (old_state == MISSING && tmp_state == REQUESTED)
UFFDIO_WAKE from background thread
And a second case where a UFFDIO_WAKE would be needed is in the userfault thread:
vcpu background_thr userfault_thr
----- ----- -----
vcpu0 handle_mm_fault()
postcopy_place_page
read old_state -> MISSING
UFFDIO_COPY 0x7fb76a139000 (no wakeup, still pending)
tmp_state = postcopy_pmi_change_state(old_state, RECEIVED) -> RECEIVED
vcpu0 fault at 0x7fb76a139000 enters handle_userfault
poll() is kicked
poll() -> POLLIN
read() -> 0x7fb76a139000
if (postcopy_pmi_change_state(MISSING, REQUESTED) == RECEIVED)
UFFDIO_WAKE from userfault thread
This patch removes the need of both UFFDIO_WAKE and of the associated
per-page tristate as well.
Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
Acked-by: Pavel Emelyanov <xemul@parallels.com>
Cc: Sanidhya Kashyap <sanidhya.gatech@gmail.com>
Cc: zhang.zhanghailiang@huawei.com
Cc: "Kirill A. Shutemov" <kirill@shutemov.name>
Cc: Andres Lagar-Cavilla <andreslc@google.com>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Hugh Dickins <hughd@google.com>
Cc: Peter Feiner <pfeiner@google.com>
Cc: "Dr. David Alan Gilbert" <dgilbert@redhat.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: "Huangpeng (Peter)" <peter.huangpeng@huawei.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-09-05 06:46:51 +08:00
|
|
|
pte_unmap(pte);
|
|
|
|
|
|
|
|
out:
|
|
|
|
return ret;
|
|
|
|
}
|
|
|
|
|
2021-06-11 16:28:17 +08:00
|
|
|
static inline unsigned int userfaultfd_get_blocking_state(unsigned int flags)
|
2020-04-02 12:09:00 +08:00
|
|
|
{
|
|
|
|
if (flags & FAULT_FLAG_INTERRUPTIBLE)
|
|
|
|
return TASK_INTERRUPTIBLE;
|
|
|
|
|
|
|
|
if (flags & FAULT_FLAG_KILLABLE)
|
|
|
|
return TASK_KILLABLE;
|
|
|
|
|
|
|
|
return TASK_UNINTERRUPTIBLE;
|
|
|
|
}
|
|
|
|
|
2015-09-05 06:46:31 +08:00
|
|
|
/*
|
|
|
|
* The locking rules involved in returning VM_FAULT_RETRY depending on
|
|
|
|
* FAULT_FLAG_ALLOW_RETRY, FAULT_FLAG_RETRY_NOWAIT and
|
|
|
|
* FAULT_FLAG_KILLABLE are not straightforward. The "Caution"
|
|
|
|
* recommendation in __lock_page_or_retry is not an understatement.
|
|
|
|
*
|
2020-06-09 12:33:54 +08:00
|
|
|
* If FAULT_FLAG_ALLOW_RETRY is set, the mmap_lock must be released
|
2015-09-05 06:46:31 +08:00
|
|
|
* before returning VM_FAULT_RETRY only if FAULT_FLAG_RETRY_NOWAIT is
|
|
|
|
* not set.
|
|
|
|
*
|
|
|
|
* If FAULT_FLAG_ALLOW_RETRY is set but FAULT_FLAG_KILLABLE is not
|
|
|
|
* set, VM_FAULT_RETRY can still be returned if and only if there are
|
2020-06-09 12:33:54 +08:00
|
|
|
* fatal_signal_pending()s, and the mmap_lock must be released before
|
2015-09-05 06:46:31 +08:00
|
|
|
* returning it.
|
|
|
|
*/
|
2018-08-24 08:01:36 +08:00
|
|
|
vm_fault_t handle_userfault(struct vm_fault *vmf, unsigned long reason)
|
2015-09-05 06:46:31 +08:00
|
|
|
{
|
2022-12-16 23:52:17 +08:00
|
|
|
struct vm_area_struct *vma = vmf->vma;
|
|
|
|
struct mm_struct *mm = vma->vm_mm;
|
2015-09-05 06:46:31 +08:00
|
|
|
struct userfaultfd_ctx *ctx;
|
|
|
|
struct userfaultfd_wait_queue uwq;
|
2018-08-24 08:01:36 +08:00
|
|
|
vm_fault_t ret = VM_FAULT_SIGBUS;
|
2020-04-02 12:09:00 +08:00
|
|
|
bool must_wait;
|
2021-06-11 16:28:17 +08:00
|
|
|
unsigned int blocking_state;
|
2015-09-05 06:46:31 +08:00
|
|
|
|
2017-06-17 05:02:37 +08:00
|
|
|
/*
|
|
|
|
* We don't do userfault handling for the final child pid update.
|
|
|
|
*
|
|
|
|
* We also don't do userfault handling during
|
|
|
|
* coredumping. hugetlbfs has the special
|
|
|
|
* follow_hugetlb_page() to skip missing pages in the
|
|
|
|
* FOLL_DUMP case, anon memory also checks for FOLL_DUMP with
|
|
|
|
* the no_page_table() helper in follow_page_mask(), but the
|
|
|
|
* shmem_vm_ops->fault method is invoked even during
|
2020-06-09 12:33:54 +08:00
|
|
|
* coredumping without mmap_lock and it ends up here.
|
2017-06-17 05:02:37 +08:00
|
|
|
*/
|
|
|
|
if (current->flags & (PF_EXITING|PF_DUMPCORE))
|
|
|
|
goto out;
|
|
|
|
|
|
|
|
/*
|
2020-06-09 12:33:54 +08:00
|
|
|
* Coredumping runs without mmap_lock so we can only check that
|
|
|
|
* the mmap_lock is held, if PF_DUMPCORE was not set.
|
2017-06-17 05:02:37 +08:00
|
|
|
*/
|
2020-06-09 12:33:44 +08:00
|
|
|
mmap_assert_locked(mm);
|
2017-06-17 05:02:37 +08:00
|
|
|
|
2022-12-16 23:52:17 +08:00
|
|
|
ctx = vma->vm_userfaultfd_ctx.ctx;
|
2015-09-05 06:46:31 +08:00
|
|
|
if (!ctx)
|
2015-09-05 06:46:41 +08:00
|
|
|
goto out;
|
2015-09-05 06:46:31 +08:00
|
|
|
|
|
|
|
BUG_ON(ctx->mm != mm);
|
|
|
|
|
userfaultfd: add minor fault registration mode
Patch series "userfaultfd: add minor fault handling", v9.
Overview
========
This series adds a new userfaultfd feature, UFFD_FEATURE_MINOR_HUGETLBFS.
When enabled (via the UFFDIO_API ioctl), this feature means that any
hugetlbfs VMAs registered with UFFDIO_REGISTER_MODE_MISSING will *also*
get events for "minor" faults. By "minor" fault, I mean the following
situation:
Let there exist two mappings (i.e., VMAs) to the same page(s) (shared
memory). One of the mappings is registered with userfaultfd (in minor
mode), and the other is not. Via the non-UFFD mapping, the underlying
pages have already been allocated & filled with some contents. The UFFD
mapping has not yet been faulted in; when it is touched for the first
time, this results in what I'm calling a "minor" fault. As a concrete
example, when working with hugetlbfs, we have huge_pte_none(), but
find_lock_page() finds an existing page.
We also add a new ioctl to resolve such faults: UFFDIO_CONTINUE. The idea
is, userspace resolves the fault by either a) doing nothing if the
contents are already correct, or b) updating the underlying contents using
the second, non-UFFD mapping (via memcpy/memset or similar, or something
fancier like RDMA, or etc...). In either case, userspace issues
UFFDIO_CONTINUE to tell the kernel "I have ensured the page contents are
correct, carry on setting up the mapping".
Use Case
========
Consider the use case of VM live migration (e.g. under QEMU/KVM):
1. While a VM is still running, we copy the contents of its memory to a
target machine. The pages are populated on the target by writing to the
non-UFFD mapping, using the setup described above. The VM is still running
(and therefore its memory is likely changing), so this may be repeated
several times, until we decide the target is "up to date enough".
2. We pause the VM on the source, and start executing on the target machine.
During this gap, the VM's user(s) will *see* a pause, so it is desirable to
minimize this window.
3. Between the last time any page was copied from the source to the target, and
when the VM was paused, the contents of that page may have changed - and
therefore the copy we have on the target machine is out of date. Although we
can keep track of which pages are out of date, for VMs with large amounts of
memory, it is "slow" to transfer this information to the target machine. We
want to resume execution before such a transfer would complete.
4. So, the guest begins executing on the target machine. The first time it
touches its memory (via the UFFD-registered mapping), userspace wants to
intercept this fault. Userspace checks whether or not the page is up to date,
and if not, copies the updated page from the source machine, via the non-UFFD
mapping. Finally, whether a copy was performed or not, userspace issues a
UFFDIO_CONTINUE ioctl to tell the kernel "I have ensured the page contents
are correct, carry on setting up the mapping".
We don't have to do all of the final updates on-demand. The userfaultfd manager
can, in the background, also copy over updated pages once it receives the map of
which pages are up-to-date or not.
Interaction with Existing APIs
==============================
Because this is a feature, a registered VMA could potentially receive both
missing and minor faults. I spent some time thinking through how the
existing API interacts with the new feature:
UFFDIO_CONTINUE cannot be used to resolve non-minor faults, as it does not
allocate a new page. If UFFDIO_CONTINUE is used on a non-minor fault:
- For non-shared memory or shmem, -EINVAL is returned.
- For hugetlb, -EFAULT is returned.
UFFDIO_COPY and UFFDIO_ZEROPAGE cannot be used to resolve minor faults.
Without modifications, the existing codepath assumes a new page needs to
be allocated. This is okay, since userspace must have a second
non-UFFD-registered mapping anyway, thus there isn't much reason to want
to use these in any case (just memcpy or memset or similar).
- If UFFDIO_COPY is used on a minor fault, -EEXIST is returned.
- If UFFDIO_ZEROPAGE is used on a minor fault, -EEXIST is returned (or -EINVAL
in the case of hugetlb, as UFFDIO_ZEROPAGE is unsupported in any case).
- UFFDIO_WRITEPROTECT simply doesn't work with shared memory, and returns
-ENOENT in that case (regardless of the kind of fault).
Future Work
===========
This series only supports hugetlbfs. I have a second series in flight to
support shmem as well, extending the functionality. This series is more
mature than the shmem support at this point, and the functionality works
fully on hugetlbfs, so this series can be merged first and then shmem
support will follow.
This patch (of 6):
This feature allows userspace to intercept "minor" faults. By "minor"
faults, I mean the following situation:
Let there exist two mappings (i.e., VMAs) to the same page(s). One of the
mappings is registered with userfaultfd (in minor mode), and the other is
not. Via the non-UFFD mapping, the underlying pages have already been
allocated & filled with some contents. The UFFD mapping has not yet been
faulted in; when it is touched for the first time, this results in what
I'm calling a "minor" fault. As a concrete example, when working with
hugetlbfs, we have huge_pte_none(), but find_lock_page() finds an existing
page.
This commit adds the new registration mode, and sets the relevant flag on
the VMAs being registered. In the hugetlb fault path, if we find that we
have huge_pte_none(), but find_lock_page() does indeed find an existing
page, then we have a "minor" fault, and if the VMA has the userfaultfd
registration flag, we call into userfaultfd to handle it.
This is implemented as a new registration mode, instead of an API feature.
This is because the alternative implementation has significant drawbacks
[1].
However, doing it this was requires we allocate a VM_* flag for the new
registration mode. On 32-bit systems, there are no unused bits, so this
feature is only supported on architectures with
CONFIG_ARCH_USES_HIGH_VMA_FLAGS. When attempting to register a VMA in
MINOR mode on 32-bit architectures, we return -EINVAL.
[1] https://lore.kernel.org/patchwork/patch/1380226/
[peterx@redhat.com: fix minor fault page leak]
Link: https://lkml.kernel.org/r/20210322175132.36659-1-peterx@redhat.com
Link: https://lkml.kernel.org/r/20210301222728.176417-1-axelrasmussen@google.com
Link: https://lkml.kernel.org/r/20210301222728.176417-2-axelrasmussen@google.com
Signed-off-by: Axel Rasmussen <axelrasmussen@google.com>
Reviewed-by: Peter Xu <peterx@redhat.com>
Reviewed-by: Mike Kravetz <mike.kravetz@oracle.com>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Alexey Dobriyan <adobriyan@gmail.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Anshuman Khandual <anshuman.khandual@arm.com>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Chinwen Chang <chinwen.chang@mediatek.com>
Cc: Huang Ying <ying.huang@intel.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Jann Horn <jannh@google.com>
Cc: Jerome Glisse <jglisse@redhat.com>
Cc: Lokesh Gidra <lokeshgidra@google.com>
Cc: "Matthew Wilcox (Oracle)" <willy@infradead.org>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: "Michal Koutn" <mkoutny@suse.com>
Cc: Michel Lespinasse <walken@google.com>
Cc: Mike Rapoport <rppt@linux.vnet.ibm.com>
Cc: Nicholas Piggin <npiggin@gmail.com>
Cc: Peter Xu <peterx@redhat.com>
Cc: Shaohua Li <shli@fb.com>
Cc: Shawn Anastasio <shawn@anastas.io>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Steven Price <steven.price@arm.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Adam Ruprecht <ruprecht@google.com>
Cc: Axel Rasmussen <axelrasmussen@google.com>
Cc: Cannon Matthews <cannonmatthews@google.com>
Cc: "Dr . David Alan Gilbert" <dgilbert@redhat.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Mina Almasry <almasrymina@google.com>
Cc: Oliver Upton <oupton@google.com>
Cc: Kirill A. Shutemov <kirill@shutemov.name>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-05-05 09:35:36 +08:00
|
|
|
/* Any unrecognized flag is a bug. */
|
|
|
|
VM_BUG_ON(reason & ~__VM_UFFD_FLAGS);
|
|
|
|
/* 0 or > 1 flags set is a bug; we expect exactly 1. */
|
|
|
|
VM_BUG_ON(!reason || (reason & (reason - 1)));
|
2015-09-05 06:46:31 +08:00
|
|
|
|
2017-09-07 07:23:39 +08:00
|
|
|
if (ctx->features & UFFD_FEATURE_SIGBUS)
|
|
|
|
goto out;
|
userfaultfd: add /dev/userfaultfd for fine grained access control
Historically, it has been shown that intercepting kernel faults with
userfaultfd (thereby forcing the kernel to wait for an arbitrary amount of
time) can be exploited, or at least can make some kinds of exploits
easier. So, in 37cd0575b8 "userfaultfd: add UFFD_USER_MODE_ONLY" we
changed things so, in order for kernel faults to be handled by
userfaultfd, either the process needs CAP_SYS_PTRACE, or this sysctl must
be configured so that any unprivileged user can do it.
In a typical implementation of a hypervisor with live migration (take
QEMU/KVM as one such example), we do indeed need to be able to handle
kernel faults. But, both options above are less than ideal:
- Toggling the sysctl increases attack surface by allowing any
unprivileged user to do it.
- Granting the live migration process CAP_SYS_PTRACE gives it this
ability, but *also* the ability to "observe and control the
execution of another process [...], and examine and change [its]
memory and registers" (from ptrace(2)). This isn't something we need
or want to be able to do, so granting this permission violates the
"principle of least privilege".
This is all a long winded way to say: we want a more fine-grained way to
grant access to userfaultfd, without granting other additional permissions
at the same time.
To achieve this, add a /dev/userfaultfd misc device. This device provides
an alternative to the userfaultfd(2) syscall for the creation of new
userfaultfds. The idea is, any userfaultfds created this way will be able
to handle kernel faults, without the caller having any special
capabilities. Access to this mechanism is instead restricted using e.g.
standard filesystem permissions.
[axelrasmussen@google.com: Handle misc_register() failure properly]
Link: https://lkml.kernel.org/r/20220819205201.658693-3-axelrasmussen@google.com
Link: https://lkml.kernel.org/r/20220808175614.3885028-3-axelrasmussen@google.com
Signed-off-by: Axel Rasmussen <axelrasmussen@google.com>
Acked-by: Nadav Amit <namit@vmware.com>
Acked-by: Peter Xu <peterx@redhat.com>
Acked-by: Mike Rapoport <rppt@linux.ibm.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Dmitry V. Levin <ldv@altlinux.org>
Cc: Gleb Fotengauer-Malinovskiy <glebfm@altlinux.org>
Cc: Hugh Dickins <hughd@google.com>
Cc: Jan Kara <jack@suse.cz>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Mike Kravetz <mike.kravetz@oracle.com>
Cc: Shuah Khan <shuah@kernel.org>
Cc: Shuah Khan <skhan@linuxfoundation.org>
Cc: Suren Baghdasaryan <surenb@google.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Zhang Yi <yi.zhang@huawei.com>
Cc: Mike Rapoport <rppt@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-08-09 01:56:11 +08:00
|
|
|
if (!(vmf->flags & FAULT_FLAG_USER) && (ctx->flags & UFFD_USER_MODE_ONLY))
|
userfaultfd: add UFFD_USER_MODE_ONLY
Patch series "Control over userfaultfd kernel-fault handling", v6.
This patch series is split from [1]. The other series enables SELinux
support for userfaultfd file descriptors so that its creation and movement
can be controlled.
It has been demonstrated on various occasions that suspending kernel code
execution for an arbitrary amount of time at any access to userspace
memory (copy_from_user()/copy_to_user()/...) can be exploited to change
the intended behavior of the kernel. For instance, handling page faults
in kernel-mode using userfaultfd has been exploited in [2, 3]. Likewise,
FUSE, which is similar to userfaultfd in this respect, has been exploited
in [4, 5] for similar outcome.
This small patch series adds a new flag to userfaultfd(2) that allows
callers to give up the ability to handle kernel-mode faults with the
resulting UFFD file object. It then adds a 'user-mode only' option to the
unprivileged_userfaultfd sysctl knob to require unprivileged callers to
use this new flag.
The purpose of this new interface is to decrease the chance of an
unprivileged userfaultfd user taking advantage of userfaultfd to enhance
security vulnerabilities by lengthening the race window in kernel code.
[1] https://lore.kernel.org/lkml/20200211225547.235083-1-dancol@google.com/
[2] https://duasynt.com/blog/linux-kernel-heap-spray
[3] https://duasynt.com/blog/cve-2016-6187-heap-off-by-one-exploit
[4] https://googleprojectzero.blogspot.com/2016/06/exploiting-recursion-in-linux-kernel_20.html
[5] https://bugs.chromium.org/p/project-zero/issues/detail?id=808
This patch (of 2):
userfaultfd handles page faults from both user and kernel code. Add a new
UFFD_USER_MODE_ONLY flag for userfaultfd(2) that makes the resulting
userfaultfd object refuse to handle faults from kernel mode, treating
these faults as if SIGBUS were always raised, causing the kernel code to
fail with EFAULT.
A future patch adds a knob allowing administrators to give some processes
the ability to create userfaultfd file objects only if they pass
UFFD_USER_MODE_ONLY, reducing the likelihood that these processes will
exploit userfaultfd's ability to delay kernel page faults to open timing
windows for future exploits.
Link: https://lkml.kernel.org/r/20201120030411.2690816-1-lokeshgidra@google.com
Link: https://lkml.kernel.org/r/20201120030411.2690816-2-lokeshgidra@google.com
Signed-off-by: Daniel Colascione <dancol@google.com>
Signed-off-by: Lokesh Gidra <lokeshgidra@google.com>
Reviewed-by: Andrea Arcangeli <aarcange@redhat.com>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: <calin@google.com>
Cc: Daniel Colascione <dancol@dancol.org>
Cc: Eric Biggers <ebiggers@kernel.org>
Cc: Iurii Zaikin <yzaikin@google.com>
Cc: Jeff Vander Stoep <jeffv@google.com>
Cc: Jerome Glisse <jglisse@redhat.com>
Cc: "Joel Fernandes (Google)" <joel@joelfernandes.org>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Kalesh Singh <kaleshsingh@google.com>
Cc: Kees Cook <keescook@chromium.org>
Cc: Luis Chamberlain <mcgrof@kernel.org>
Cc: Mauro Carvalho Chehab <mchehab+huawei@kernel.org>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Mike Rapoport <rppt@linux.vnet.ibm.com>
Cc: Nitin Gupta <nigupta@nvidia.com>
Cc: Peter Xu <peterx@redhat.com>
Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Cc: Shaohua Li <shli@fb.com>
Cc: Stephen Smalley <stephen.smalley.work@gmail.com>
Cc: Suren Baghdasaryan <surenb@google.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-12-15 11:13:49 +08:00
|
|
|
goto out;
|
2017-09-07 07:23:39 +08:00
|
|
|
|
2015-09-05 06:46:31 +08:00
|
|
|
/*
|
|
|
|
* If it's already released don't get it. This avoids to loop
|
|
|
|
* in __get_user_pages if userfaultfd_release waits on the
|
2020-06-09 12:33:54 +08:00
|
|
|
* caller of handle_userfault to release the mmap_lock.
|
2015-09-05 06:46:31 +08:00
|
|
|
*/
|
locking/atomics: COCCINELLE/treewide: Convert trivial ACCESS_ONCE() patterns to READ_ONCE()/WRITE_ONCE()
Please do not apply this to mainline directly, instead please re-run the
coccinelle script shown below and apply its output.
For several reasons, it is desirable to use {READ,WRITE}_ONCE() in
preference to ACCESS_ONCE(), and new code is expected to use one of the
former. So far, there's been no reason to change most existing uses of
ACCESS_ONCE(), as these aren't harmful, and changing them results in
churn.
However, for some features, the read/write distinction is critical to
correct operation. To distinguish these cases, separate read/write
accessors must be used. This patch migrates (most) remaining
ACCESS_ONCE() instances to {READ,WRITE}_ONCE(), using the following
coccinelle script:
----
// Convert trivial ACCESS_ONCE() uses to equivalent READ_ONCE() and
// WRITE_ONCE()
// $ make coccicheck COCCI=/home/mark/once.cocci SPFLAGS="--include-headers" MODE=patch
virtual patch
@ depends on patch @
expression E1, E2;
@@
- ACCESS_ONCE(E1) = E2
+ WRITE_ONCE(E1, E2)
@ depends on patch @
expression E;
@@
- ACCESS_ONCE(E)
+ READ_ONCE(E)
----
Signed-off-by: Mark Rutland <mark.rutland@arm.com>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: davem@davemloft.net
Cc: linux-arch@vger.kernel.org
Cc: mpe@ellerman.id.au
Cc: shuah@kernel.org
Cc: snitzer@redhat.com
Cc: thor.thayer@linux.intel.com
Cc: tj@kernel.org
Cc: viro@zeniv.linux.org.uk
Cc: will.deacon@arm.com
Link: http://lkml.kernel.org/r/1508792849-3115-19-git-send-email-paulmck@linux.vnet.ibm.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-10-24 05:07:29 +08:00
|
|
|
if (unlikely(READ_ONCE(ctx->released))) {
|
2017-09-09 07:12:42 +08:00
|
|
|
/*
|
|
|
|
* Don't return VM_FAULT_SIGBUS in this case, so a non
|
|
|
|
* cooperative manager can close the uffd after the
|
|
|
|
* last UFFDIO_COPY, without risking to trigger an
|
|
|
|
* involuntary SIGBUS if the process was starting the
|
|
|
|
* userfaultfd while the userfaultfd was still armed
|
|
|
|
* (but after the last UFFDIO_COPY). If the uffd
|
|
|
|
* wasn't already closed when the userfault reached
|
|
|
|
* this point, that would normally be solved by
|
|
|
|
* userfaultfd_must_wait returning 'false'.
|
|
|
|
*
|
|
|
|
* If we were to return VM_FAULT_SIGBUS here, the non
|
|
|
|
* cooperative manager would be instead forced to
|
|
|
|
* always call UFFDIO_UNREGISTER before it can safely
|
|
|
|
* close the uffd.
|
|
|
|
*/
|
|
|
|
ret = VM_FAULT_NOPAGE;
|
2015-09-05 06:46:41 +08:00
|
|
|
goto out;
|
2017-09-09 07:12:42 +08:00
|
|
|
}
|
2015-09-05 06:46:31 +08:00
|
|
|
|
|
|
|
/*
|
|
|
|
* Check that we can return VM_FAULT_RETRY.
|
|
|
|
*
|
|
|
|
* NOTE: it should become possible to return VM_FAULT_RETRY
|
|
|
|
* even if FAULT_FLAG_TRIED is set without leading to gup()
|
|
|
|
* -EBUSY failures, if the userfaultfd is to be extended for
|
|
|
|
* VM_UFFD_WP tracking and we intend to arm the userfault
|
|
|
|
* without first stopping userland access to the memory. For
|
|
|
|
* VM_UFFD_MISSING userfaults this is enough for now.
|
|
|
|
*/
|
2016-12-15 07:06:58 +08:00
|
|
|
if (unlikely(!(vmf->flags & FAULT_FLAG_ALLOW_RETRY))) {
|
2015-09-05 06:46:31 +08:00
|
|
|
/*
|
|
|
|
* Validate the invariant that nowait must allow retry
|
|
|
|
* to be sure not to return SIGBUS erroneously on
|
|
|
|
* nowait invocations.
|
|
|
|
*/
|
2016-12-15 07:06:58 +08:00
|
|
|
BUG_ON(vmf->flags & FAULT_FLAG_RETRY_NOWAIT);
|
2015-09-05 06:46:31 +08:00
|
|
|
#ifdef CONFIG_DEBUG_VM
|
|
|
|
if (printk_ratelimit()) {
|
|
|
|
printk(KERN_WARNING
|
2016-12-15 07:06:58 +08:00
|
|
|
"FAULT_FLAG_ALLOW_RETRY missing %x\n",
|
|
|
|
vmf->flags);
|
2015-09-05 06:46:31 +08:00
|
|
|
dump_stack();
|
|
|
|
}
|
|
|
|
#endif
|
2015-09-05 06:46:41 +08:00
|
|
|
goto out;
|
2015-09-05 06:46:31 +08:00
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Handle nowait, not much to do other than tell it to retry
|
|
|
|
* and wait.
|
|
|
|
*/
|
2015-09-05 06:46:41 +08:00
|
|
|
ret = VM_FAULT_RETRY;
|
2016-12-15 07:06:58 +08:00
|
|
|
if (vmf->flags & FAULT_FLAG_RETRY_NOWAIT)
|
2015-09-05 06:46:41 +08:00
|
|
|
goto out;
|
2015-09-05 06:46:31 +08:00
|
|
|
|
2020-06-09 12:33:54 +08:00
|
|
|
/* take the reference before dropping the mmap_lock */
|
2015-09-05 06:46:31 +08:00
|
|
|
userfaultfd_ctx_get(ctx);
|
|
|
|
|
|
|
|
init_waitqueue_func_entry(&uwq.wq, userfaultfd_wake_function);
|
|
|
|
uwq.wq.private = current;
|
2022-07-12 00:59:06 +08:00
|
|
|
uwq.msg = userfault_msg(vmf->address, vmf->real_address, vmf->flags,
|
|
|
|
reason, ctx->features);
|
2015-09-05 06:46:31 +08:00
|
|
|
uwq.ctx = ctx;
|
userfaultfd: fix SIGBUS resulting from false rwsem wakeups
With >=32 CPUs the userfaultfd selftest triggered a graceful but
unexpected SIGBUS because VM_FAULT_RETRY was returned by
handle_userfault() despite the UFFDIO_COPY wasn't completed.
This seems caused by rwsem waking the thread blocked in
handle_userfault() and we can't run up_read() before the wait_event
sequence is complete.
Keeping the wait_even sequence identical to the first one, would require
running userfaultfd_must_wait() again to know if the loop should be
repeated, and it would also require retaking the rwsem and revalidating
the whole vma status.
It seems simpler to wait the targeted wakeup so that if false wakeups
materialize we still wait for our specific wakeup event, unless of
course there are signals or the uffd was released.
Debug code collecting the stack trace of the wakeup showed this:
$ ./userfaultfd 100 99999
nr_pages: 25600, nr_pages_per_cpu: 800
bounces: 99998, mode: racing ver poll, userfaults: 32 35 90 232 30 138 69 82 34 30 139 40 40 31 20 19 43 13 15 28 27 38 21 43 56 22 1 17 31 8 4 2
bounces: 99997, mode: rnd ver poll, Bus error (core dumped)
save_stack_trace+0x2b/0x50
try_to_wake_up+0x2a6/0x580
wake_up_q+0x32/0x70
rwsem_wake+0xe0/0x120
call_rwsem_wake+0x1b/0x30
up_write+0x3b/0x40
vm_mmap_pgoff+0x9c/0xc0
SyS_mmap_pgoff+0x1a9/0x240
SyS_mmap+0x22/0x30
entry_SYSCALL_64_fastpath+0x1f/0xbd
0xffffffffffffffff
FAULT_FLAG_ALLOW_RETRY missing 70
CPU: 24 PID: 1054 Comm: userfaultfd Tainted: G W 4.8.0+ #30
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.9.3-0-ge2fc41e-prebuilt.qemu-project.org 04/01/2014
Call Trace:
dump_stack+0xb8/0x112
handle_userfault+0x572/0x650
handle_mm_fault+0x12cb/0x1520
__do_page_fault+0x175/0x500
trace_do_page_fault+0x61/0x270
do_async_page_fault+0x19/0x90
async_page_fault+0x25/0x30
This always happens when the main userfault selftest thread is running
clone() while glibc runs either mprotect or mmap (both taking mmap_sem
down_write()) to allocate the thread stack of the background threads,
while locking/userfault threads already run at full throttle and are
susceptible to false wakeups that may cause handle_userfault() to return
before than expected (which results in graceful SIGBUS at the next
attempt).
This was reproduced only with >=32 CPUs because the loop to start the
thread where clone() is too quick with fewer CPUs, while with 32 CPUs
there's already significant activity on ~32 locking and userfault
threads when the last background threads are started with clone().
This >=32 CPUs SMP race condition is likely reproducible only with the
selftest because of the much heavier userfault load it generates if
compared to real apps.
We'll have to allow "one more" VM_FAULT_RETRY for the WP support and a
patch floating around that provides it also hidden this problem but in
reality only is successfully at hiding the problem.
False wakeups could still happen again the second time
handle_userfault() is invoked, even if it's a so rare race condition
that getting false wakeups twice in a row is impossible to reproduce.
This full fix is needed for correctness, the only alternative would be
to allow VM_FAULT_RETRY to be returned infinitely. With this fix the WP
support can stick to a strict "one more" VM_FAULT_RETRY logic (no need
of returning it infinite times to avoid the SIGBUS).
Link: http://lkml.kernel.org/r/20170111005535.13832-2-aarcange@redhat.com
Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
Reported-by: Shubham Kumar Sharma <shubham.kumar.sharma@oracle.com>
Tested-by: Mike Kravetz <mike.kravetz@oracle.com>
Acked-by: Hillf Danton <hillf.zj@alibaba-inc.com>
Cc: Michael Rapoport <RAPOPORT@il.ibm.com>
Cc: "Dr. David Alan Gilbert" <dgilbert@redhat.com>
Cc: Pavel Emelyanov <xemul@parallels.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-01-25 07:17:59 +08:00
|
|
|
uwq.waken = false;
|
2015-09-05 06:46:31 +08:00
|
|
|
|
2020-04-02 12:09:00 +08:00
|
|
|
blocking_state = userfaultfd_get_blocking_state(vmf->flags);
|
2015-09-05 06:47:18 +08:00
|
|
|
|
2022-12-16 23:52:17 +08:00
|
|
|
/*
|
|
|
|
* Take the vma lock now, in order to safely call
|
|
|
|
* userfaultfd_huge_must_wait() later. Since acquiring the
|
|
|
|
* (sleepable) vma lock can modify the current task state, that
|
|
|
|
* must be before explicitly calling set_current_state().
|
|
|
|
*/
|
|
|
|
if (is_vm_hugetlb_page(vma))
|
|
|
|
hugetlb_vma_lock_read(vma);
|
|
|
|
|
2019-07-05 06:14:39 +08:00
|
|
|
spin_lock_irq(&ctx->fault_pending_wqh.lock);
|
2015-09-05 06:46:31 +08:00
|
|
|
/*
|
|
|
|
* After the __add_wait_queue the uwq is visible to userland
|
|
|
|
* through poll/read().
|
|
|
|
*/
|
2015-09-05 06:46:44 +08:00
|
|
|
__add_wait_queue(&ctx->fault_pending_wqh, &uwq.wq);
|
|
|
|
/*
|
|
|
|
* The smp_mb() after __set_current_state prevents the reads
|
|
|
|
* following the spin_unlock to happen before the list_add in
|
|
|
|
* __add_wait_queue.
|
|
|
|
*/
|
userfaultfd: fix SIGBUS resulting from false rwsem wakeups
With >=32 CPUs the userfaultfd selftest triggered a graceful but
unexpected SIGBUS because VM_FAULT_RETRY was returned by
handle_userfault() despite the UFFDIO_COPY wasn't completed.
This seems caused by rwsem waking the thread blocked in
handle_userfault() and we can't run up_read() before the wait_event
sequence is complete.
Keeping the wait_even sequence identical to the first one, would require
running userfaultfd_must_wait() again to know if the loop should be
repeated, and it would also require retaking the rwsem and revalidating
the whole vma status.
It seems simpler to wait the targeted wakeup so that if false wakeups
materialize we still wait for our specific wakeup event, unless of
course there are signals or the uffd was released.
Debug code collecting the stack trace of the wakeup showed this:
$ ./userfaultfd 100 99999
nr_pages: 25600, nr_pages_per_cpu: 800
bounces: 99998, mode: racing ver poll, userfaults: 32 35 90 232 30 138 69 82 34 30 139 40 40 31 20 19 43 13 15 28 27 38 21 43 56 22 1 17 31 8 4 2
bounces: 99997, mode: rnd ver poll, Bus error (core dumped)
save_stack_trace+0x2b/0x50
try_to_wake_up+0x2a6/0x580
wake_up_q+0x32/0x70
rwsem_wake+0xe0/0x120
call_rwsem_wake+0x1b/0x30
up_write+0x3b/0x40
vm_mmap_pgoff+0x9c/0xc0
SyS_mmap_pgoff+0x1a9/0x240
SyS_mmap+0x22/0x30
entry_SYSCALL_64_fastpath+0x1f/0xbd
0xffffffffffffffff
FAULT_FLAG_ALLOW_RETRY missing 70
CPU: 24 PID: 1054 Comm: userfaultfd Tainted: G W 4.8.0+ #30
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.9.3-0-ge2fc41e-prebuilt.qemu-project.org 04/01/2014
Call Trace:
dump_stack+0xb8/0x112
handle_userfault+0x572/0x650
handle_mm_fault+0x12cb/0x1520
__do_page_fault+0x175/0x500
trace_do_page_fault+0x61/0x270
do_async_page_fault+0x19/0x90
async_page_fault+0x25/0x30
This always happens when the main userfault selftest thread is running
clone() while glibc runs either mprotect or mmap (both taking mmap_sem
down_write()) to allocate the thread stack of the background threads,
while locking/userfault threads already run at full throttle and are
susceptible to false wakeups that may cause handle_userfault() to return
before than expected (which results in graceful SIGBUS at the next
attempt).
This was reproduced only with >=32 CPUs because the loop to start the
thread where clone() is too quick with fewer CPUs, while with 32 CPUs
there's already significant activity on ~32 locking and userfault
threads when the last background threads are started with clone().
This >=32 CPUs SMP race condition is likely reproducible only with the
selftest because of the much heavier userfault load it generates if
compared to real apps.
We'll have to allow "one more" VM_FAULT_RETRY for the WP support and a
patch floating around that provides it also hidden this problem but in
reality only is successfully at hiding the problem.
False wakeups could still happen again the second time
handle_userfault() is invoked, even if it's a so rare race condition
that getting false wakeups twice in a row is impossible to reproduce.
This full fix is needed for correctness, the only alternative would be
to allow VM_FAULT_RETRY to be returned infinitely. With this fix the WP
support can stick to a strict "one more" VM_FAULT_RETRY logic (no need
of returning it infinite times to avoid the SIGBUS).
Link: http://lkml.kernel.org/r/20170111005535.13832-2-aarcange@redhat.com
Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
Reported-by: Shubham Kumar Sharma <shubham.kumar.sharma@oracle.com>
Tested-by: Mike Kravetz <mike.kravetz@oracle.com>
Acked-by: Hillf Danton <hillf.zj@alibaba-inc.com>
Cc: Michael Rapoport <RAPOPORT@il.ibm.com>
Cc: "Dr. David Alan Gilbert" <dgilbert@redhat.com>
Cc: Pavel Emelyanov <xemul@parallels.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-01-25 07:17:59 +08:00
|
|
|
set_current_state(blocking_state);
|
2019-07-05 06:14:39 +08:00
|
|
|
spin_unlock_irq(&ctx->fault_pending_wqh.lock);
|
2015-09-05 06:46:31 +08:00
|
|
|
|
2022-12-16 23:52:17 +08:00
|
|
|
if (!is_vm_hugetlb_page(vma))
|
2017-02-23 07:43:10 +08:00
|
|
|
must_wait = userfaultfd_must_wait(ctx, vmf->address, vmf->flags,
|
|
|
|
reason);
|
|
|
|
else
|
2022-12-16 23:52:17 +08:00
|
|
|
must_wait = userfaultfd_huge_must_wait(ctx, vma,
|
2017-07-07 06:39:42 +08:00
|
|
|
vmf->address,
|
2017-02-23 07:43:10 +08:00
|
|
|
vmf->flags, reason);
|
2022-12-16 23:52:17 +08:00
|
|
|
if (is_vm_hugetlb_page(vma))
|
|
|
|
hugetlb_vma_unlock_read(vma);
|
2020-06-09 12:33:25 +08:00
|
|
|
mmap_read_unlock(mm);
|
userfaultfd: solve the race between UFFDIO_COPY|ZEROPAGE and read
Solve in-kernel the race between UFFDIO_COPY|ZEROPAGE and
userfaultfd_read if they are run on different threads simultaneously.
Until now qemu solved the race in userland: the race was explicitly
and intentionally left for userland to solve. However we can also
solve it in kernel.
Requiring all users to solve this race if they use two threads (one
for the background transfer and one for the userfault reads) isn't
very attractive from an API prospective, furthermore this allows to
remove a whole bunch of mutex and bitmap code from qemu, making it
faster. The cost of __get_user_pages_fast should be insignificant
considering it scales perfectly and the pagetables are already hot in
the CPU cache, compared to the overhead in userland to maintain those
structures.
Applying this patch is backwards compatible with respect to the
userfaultfd userland API, however reverting this change wouldn't be
backwards compatible anymore.
Without this patch qemu in the background transfer thread, has to read
the old state, and do UFFDIO_WAKE if old_state is missing but it
become REQUESTED by the time it tries to set it to RECEIVED (signaling
the other side received an userfault).
vcpu background_thr userfault_thr
----- ----- -----
vcpu0 handle_mm_fault()
postcopy_place_page
read old_state -> MISSING
UFFDIO_COPY 0x7fb76a139000 (no wakeup, still pending)
vcpu0 fault at 0x7fb76a139000 enters handle_userfault
poll() is kicked
poll() -> POLLIN
read() -> 0x7fb76a139000
postcopy_pmi_change_state(MISSING, REQUESTED) -> REQUESTED
tmp_state = postcopy_pmi_change_state(old_state, RECEIVED) -> REQUESTED
/* check that no userfault raced with UFFDIO_COPY */
if (old_state == MISSING && tmp_state == REQUESTED)
UFFDIO_WAKE from background thread
And a second case where a UFFDIO_WAKE would be needed is in the userfault thread:
vcpu background_thr userfault_thr
----- ----- -----
vcpu0 handle_mm_fault()
postcopy_place_page
read old_state -> MISSING
UFFDIO_COPY 0x7fb76a139000 (no wakeup, still pending)
tmp_state = postcopy_pmi_change_state(old_state, RECEIVED) -> RECEIVED
vcpu0 fault at 0x7fb76a139000 enters handle_userfault
poll() is kicked
poll() -> POLLIN
read() -> 0x7fb76a139000
if (postcopy_pmi_change_state(MISSING, REQUESTED) == RECEIVED)
UFFDIO_WAKE from userfault thread
This patch removes the need of both UFFDIO_WAKE and of the associated
per-page tristate as well.
Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
Acked-by: Pavel Emelyanov <xemul@parallels.com>
Cc: Sanidhya Kashyap <sanidhya.gatech@gmail.com>
Cc: zhang.zhanghailiang@huawei.com
Cc: "Kirill A. Shutemov" <kirill@shutemov.name>
Cc: Andres Lagar-Cavilla <andreslc@google.com>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Hugh Dickins <hughd@google.com>
Cc: Peter Feiner <pfeiner@google.com>
Cc: "Dr. David Alan Gilbert" <dgilbert@redhat.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: "Huangpeng (Peter)" <peter.huangpeng@huawei.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-09-05 06:46:51 +08:00
|
|
|
|
2020-08-03 01:42:31 +08:00
|
|
|
if (likely(must_wait && !READ_ONCE(ctx->released))) {
|
2018-02-12 06:34:03 +08:00
|
|
|
wake_up_poll(&ctx->fd_wqh, EPOLLIN);
|
2015-09-05 06:46:31 +08:00
|
|
|
schedule();
|
2015-09-05 06:46:41 +08:00
|
|
|
}
|
2015-09-05 06:46:31 +08:00
|
|
|
|
2015-09-05 06:46:41 +08:00
|
|
|
__set_current_state(TASK_RUNNING);
|
2015-09-05 06:46:44 +08:00
|
|
|
|
|
|
|
/*
|
|
|
|
* Here we race with the list_del; list_add in
|
|
|
|
* userfaultfd_ctx_read(), however because we don't ever run
|
|
|
|
* list_del_init() to refile across the two lists, the prev
|
|
|
|
* and next pointers will never point to self. list_add also
|
|
|
|
* would never let any of the two pointers to point to
|
|
|
|
* self. So list_empty_careful won't risk to see both pointers
|
|
|
|
* pointing to self at any time during the list refile. The
|
|
|
|
* only case where list_del_init() is called is the full
|
|
|
|
* removal in the wake function and there we don't re-list_add
|
|
|
|
* and it's fine not to block on the spinlock. The uwq on this
|
|
|
|
* kernel stack can be released after the list_del_init.
|
|
|
|
*/
|
sched/wait: Disambiguate wq_entry->task_list and wq_head->task_list naming
So I've noticed a number of instances where it was not obvious from the
code whether ->task_list was for a wait-queue head or a wait-queue entry.
Furthermore, there's a number of wait-queue users where the lists are
not for 'tasks' but other entities (poll tables, etc.), in which case
the 'task_list' name is actively confusing.
To clear this all up, name the wait-queue head and entry list structure
fields unambiguously:
struct wait_queue_head::task_list => ::head
struct wait_queue_entry::task_list => ::entry
For example, this code:
rqw->wait.task_list.next != &wait->task_list
... is was pretty unclear (to me) what it's doing, while now it's written this way:
rqw->wait.head.next != &wait->entry
... which makes it pretty clear that we are iterating a list until we see the head.
Other examples are:
list_for_each_entry_safe(pos, next, &x->task_list, task_list) {
list_for_each_entry(wq, &fence->wait.task_list, task_list) {
... where it's unclear (to me) what we are iterating, and during review it's
hard to tell whether it's trying to walk a wait-queue entry (which would be
a bug), while now it's written as:
list_for_each_entry_safe(pos, next, &x->head, entry) {
list_for_each_entry(wq, &fence->wait.head, entry) {
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-06-20 18:06:46 +08:00
|
|
|
if (!list_empty_careful(&uwq.wq.entry)) {
|
2019-07-05 06:14:39 +08:00
|
|
|
spin_lock_irq(&ctx->fault_pending_wqh.lock);
|
2015-09-05 06:46:44 +08:00
|
|
|
/*
|
|
|
|
* No need of list_del_init(), the uwq on the stack
|
|
|
|
* will be freed shortly anyway.
|
|
|
|
*/
|
sched/wait: Disambiguate wq_entry->task_list and wq_head->task_list naming
So I've noticed a number of instances where it was not obvious from the
code whether ->task_list was for a wait-queue head or a wait-queue entry.
Furthermore, there's a number of wait-queue users where the lists are
not for 'tasks' but other entities (poll tables, etc.), in which case
the 'task_list' name is actively confusing.
To clear this all up, name the wait-queue head and entry list structure
fields unambiguously:
struct wait_queue_head::task_list => ::head
struct wait_queue_entry::task_list => ::entry
For example, this code:
rqw->wait.task_list.next != &wait->task_list
... is was pretty unclear (to me) what it's doing, while now it's written this way:
rqw->wait.head.next != &wait->entry
... which makes it pretty clear that we are iterating a list until we see the head.
Other examples are:
list_for_each_entry_safe(pos, next, &x->task_list, task_list) {
list_for_each_entry(wq, &fence->wait.task_list, task_list) {
... where it's unclear (to me) what we are iterating, and during review it's
hard to tell whether it's trying to walk a wait-queue entry (which would be
a bug), while now it's written as:
list_for_each_entry_safe(pos, next, &x->head, entry) {
list_for_each_entry(wq, &fence->wait.head, entry) {
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-06-20 18:06:46 +08:00
|
|
|
list_del(&uwq.wq.entry);
|
2019-07-05 06:14:39 +08:00
|
|
|
spin_unlock_irq(&ctx->fault_pending_wqh.lock);
|
2015-09-05 06:46:31 +08:00
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
|
|
|
* ctx may go away after this if the userfault pseudo fd is
|
|
|
|
* already released.
|
|
|
|
*/
|
|
|
|
userfaultfd_ctx_put(ctx);
|
|
|
|
|
2015-09-05 06:46:41 +08:00
|
|
|
out:
|
|
|
|
return ret;
|
2015-09-05 06:46:31 +08:00
|
|
|
}
|
|
|
|
|
2017-03-10 08:16:54 +08:00
|
|
|
static void userfaultfd_event_wait_completion(struct userfaultfd_ctx *ctx,
|
|
|
|
struct userfaultfd_wait_queue *ewq)
|
2017-02-23 07:42:21 +08:00
|
|
|
{
|
2018-01-05 08:18:09 +08:00
|
|
|
struct userfaultfd_ctx *release_new_ctx;
|
|
|
|
|
2017-03-10 08:16:52 +08:00
|
|
|
if (WARN_ON_ONCE(current->flags & PF_EXITING))
|
|
|
|
goto out;
|
2017-02-23 07:42:21 +08:00
|
|
|
|
|
|
|
ewq->ctx = ctx;
|
|
|
|
init_waitqueue_entry(&ewq->wq, current);
|
2018-01-05 08:18:09 +08:00
|
|
|
release_new_ctx = NULL;
|
2017-02-23 07:42:21 +08:00
|
|
|
|
2019-07-05 06:14:39 +08:00
|
|
|
spin_lock_irq(&ctx->event_wqh.lock);
|
2017-02-23 07:42:21 +08:00
|
|
|
/*
|
|
|
|
* After the __add_wait_queue the uwq is visible to userland
|
|
|
|
* through poll/read().
|
|
|
|
*/
|
|
|
|
__add_wait_queue(&ctx->event_wqh, &ewq->wq);
|
|
|
|
for (;;) {
|
|
|
|
set_current_state(TASK_KILLABLE);
|
|
|
|
if (ewq->msg.event == 0)
|
|
|
|
break;
|
locking/atomics: COCCINELLE/treewide: Convert trivial ACCESS_ONCE() patterns to READ_ONCE()/WRITE_ONCE()
Please do not apply this to mainline directly, instead please re-run the
coccinelle script shown below and apply its output.
For several reasons, it is desirable to use {READ,WRITE}_ONCE() in
preference to ACCESS_ONCE(), and new code is expected to use one of the
former. So far, there's been no reason to change most existing uses of
ACCESS_ONCE(), as these aren't harmful, and changing them results in
churn.
However, for some features, the read/write distinction is critical to
correct operation. To distinguish these cases, separate read/write
accessors must be used. This patch migrates (most) remaining
ACCESS_ONCE() instances to {READ,WRITE}_ONCE(), using the following
coccinelle script:
----
// Convert trivial ACCESS_ONCE() uses to equivalent READ_ONCE() and
// WRITE_ONCE()
// $ make coccicheck COCCI=/home/mark/once.cocci SPFLAGS="--include-headers" MODE=patch
virtual patch
@ depends on patch @
expression E1, E2;
@@
- ACCESS_ONCE(E1) = E2
+ WRITE_ONCE(E1, E2)
@ depends on patch @
expression E;
@@
- ACCESS_ONCE(E)
+ READ_ONCE(E)
----
Signed-off-by: Mark Rutland <mark.rutland@arm.com>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: davem@davemloft.net
Cc: linux-arch@vger.kernel.org
Cc: mpe@ellerman.id.au
Cc: shuah@kernel.org
Cc: snitzer@redhat.com
Cc: thor.thayer@linux.intel.com
Cc: tj@kernel.org
Cc: viro@zeniv.linux.org.uk
Cc: will.deacon@arm.com
Link: http://lkml.kernel.org/r/1508792849-3115-19-git-send-email-paulmck@linux.vnet.ibm.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-10-24 05:07:29 +08:00
|
|
|
if (READ_ONCE(ctx->released) ||
|
2017-02-23 07:42:21 +08:00
|
|
|
fatal_signal_pending(current)) {
|
2017-10-04 07:15:38 +08:00
|
|
|
/*
|
|
|
|
* &ewq->wq may be queued in fork_event, but
|
|
|
|
* __remove_wait_queue ignores the head
|
|
|
|
* parameter. It would be a problem if it
|
|
|
|
* didn't.
|
|
|
|
*/
|
2017-02-23 07:42:21 +08:00
|
|
|
__remove_wait_queue(&ctx->event_wqh, &ewq->wq);
|
2017-03-10 08:17:09 +08:00
|
|
|
if (ewq->msg.event == UFFD_EVENT_FORK) {
|
|
|
|
struct userfaultfd_ctx *new;
|
|
|
|
|
|
|
|
new = (struct userfaultfd_ctx *)
|
|
|
|
(unsigned long)
|
|
|
|
ewq->msg.arg.reserved.reserved1;
|
2018-01-05 08:18:09 +08:00
|
|
|
release_new_ctx = new;
|
2017-03-10 08:17:09 +08:00
|
|
|
}
|
2017-02-23 07:42:21 +08:00
|
|
|
break;
|
|
|
|
}
|
|
|
|
|
2019-07-05 06:14:39 +08:00
|
|
|
spin_unlock_irq(&ctx->event_wqh.lock);
|
2017-02-23 07:42:21 +08:00
|
|
|
|
2018-02-12 06:34:03 +08:00
|
|
|
wake_up_poll(&ctx->fd_wqh, EPOLLIN);
|
2017-02-23 07:42:21 +08:00
|
|
|
schedule();
|
|
|
|
|
2019-07-05 06:14:39 +08:00
|
|
|
spin_lock_irq(&ctx->event_wqh.lock);
|
2017-02-23 07:42:21 +08:00
|
|
|
}
|
|
|
|
__set_current_state(TASK_RUNNING);
|
2019-07-05 06:14:39 +08:00
|
|
|
spin_unlock_irq(&ctx->event_wqh.lock);
|
2017-02-23 07:42:21 +08:00
|
|
|
|
2018-01-05 08:18:09 +08:00
|
|
|
if (release_new_ctx) {
|
|
|
|
struct vm_area_struct *vma;
|
|
|
|
struct mm_struct *mm = release_new_ctx->mm;
|
2022-09-07 03:48:57 +08:00
|
|
|
VMA_ITERATOR(vmi, mm, 0);
|
2018-01-05 08:18:09 +08:00
|
|
|
|
|
|
|
/* the various vma->vm_userfaultfd_ctx still points to it */
|
2020-06-09 12:33:25 +08:00
|
|
|
mmap_write_lock(mm);
|
2022-09-07 03:48:57 +08:00
|
|
|
for_each_vma(vmi, vma) {
|
2018-08-03 06:36:09 +08:00
|
|
|
if (vma->vm_userfaultfd_ctx.ctx == release_new_ctx) {
|
2018-01-05 08:18:09 +08:00
|
|
|
vma->vm_userfaultfd_ctx = NULL_VM_UFFD_CTX;
|
mm/userfaultfd: enable writenotify while userfaultfd-wp is enabled for a VMA
Currently, we don't enable writenotify when enabling userfaultfd-wp on a
shared writable mapping (for now only shmem and hugetlb). The consequence
is that vma->vm_page_prot will still include write permissions, to be set
as default for all PTEs that get remapped (e.g., mprotect(), NUMA hinting,
page migration, ...).
So far, vma->vm_page_prot is assumed to be a safe default, meaning that we
only add permissions (e.g., mkwrite) but not remove permissions (e.g.,
wrprotect). For example, when enabling softdirty tracking, we enable
writenotify. With uffd-wp on shared mappings, that changed. More details
on vma->vm_page_prot semantics were summarized in [1].
This is problematic for uffd-wp: we'd have to manually check for a uffd-wp
PTEs/PMDs and manually write-protect PTEs/PMDs, which is error prone.
Prone to such issues is any code that uses vma->vm_page_prot to set PTE
permissions: primarily pte_modify() and mk_pte().
Instead, let's enable writenotify such that PTEs/PMDs/... will be mapped
write-protected as default and we will only allow selected PTEs that are
definitely safe to be mapped without write-protection (see
can_change_pte_writable()) to be writable. In the future, we might want
to enable write-bit recovery -- e.g., can_change_pte_writable() -- at more
locations, for example, also when removing uffd-wp protection.
This fixes two known cases:
(a) remove_migration_pte() mapping uffd-wp'ed PTEs writable, resulting
in uffd-wp not triggering on write access.
(b) do_numa_page() / do_huge_pmd_numa_page() mapping uffd-wp'ed PTEs/PMDs
writable, resulting in uffd-wp not triggering on write access.
Note that do_numa_page() / do_huge_pmd_numa_page() can be reached even
without NUMA hinting (which currently doesn't seem to be applicable to
shmem), for example, by using uffd-wp with a PROT_WRITE shmem VMA. On
such a VMA, userfaultfd-wp is currently non-functional.
Note that when enabling userfaultfd-wp, there is no need to walk page
tables to enforce the new default protection for the PTEs: we know that
they cannot be uffd-wp'ed yet, because that can only happen after enabling
uffd-wp for the VMA in general.
Also note that this makes mprotect() on ranges with uffd-wp'ed PTEs not
accidentally set the write bit -- which would result in uffd-wp not
triggering on later write access. This commit makes uffd-wp on shmem
behave just like uffd-wp on anonymous memory in that regard, even though,
mixing mprotect with uffd-wp is controversial.
[1] https://lkml.kernel.org/r/92173bad-caa3-6b43-9d1e-9a471fdbc184@redhat.com
Link: https://lkml.kernel.org/r/20221209080912.7968-1-david@redhat.com
Fixes: b1f9e876862d ("mm/uffd: enable write protection for shmem & hugetlbfs")
Signed-off-by: David Hildenbrand <david@redhat.com>
Reported-by: Ives van Hoorne <ives@codesandbox.io>
Debugged-by: Peter Xu <peterx@redhat.com>
Acked-by: Peter Xu <peterx@redhat.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Alistair Popple <apopple@nvidia.com>
Cc: Mike Rapoport <rppt@linux.vnet.ibm.com>
Cc: Nadav Amit <nadav.amit@gmail.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-12-09 16:09:12 +08:00
|
|
|
userfaultfd_set_vm_flags(vma,
|
|
|
|
vma->vm_flags & ~__VM_UFFD_FLAGS);
|
2018-08-03 06:36:09 +08:00
|
|
|
}
|
2022-09-07 03:48:57 +08:00
|
|
|
}
|
2020-06-09 12:33:25 +08:00
|
|
|
mmap_write_unlock(mm);
|
2018-01-05 08:18:09 +08:00
|
|
|
|
|
|
|
userfaultfd_ctx_put(release_new_ctx);
|
|
|
|
}
|
|
|
|
|
2017-02-23 07:42:21 +08:00
|
|
|
/*
|
|
|
|
* ctx may go away after this if the userfault pseudo fd is
|
|
|
|
* already released.
|
|
|
|
*/
|
2017-03-10 08:16:52 +08:00
|
|
|
out:
|
2021-09-03 05:58:56 +08:00
|
|
|
atomic_dec(&ctx->mmap_changing);
|
|
|
|
VM_BUG_ON(atomic_read(&ctx->mmap_changing) < 0);
|
2017-02-23 07:42:21 +08:00
|
|
|
userfaultfd_ctx_put(ctx);
|
|
|
|
}
|
|
|
|
|
|
|
|
static void userfaultfd_event_complete(struct userfaultfd_ctx *ctx,
|
|
|
|
struct userfaultfd_wait_queue *ewq)
|
|
|
|
{
|
|
|
|
ewq->msg.event = 0;
|
|
|
|
wake_up_locked(&ctx->event_wqh);
|
|
|
|
__remove_wait_queue(&ctx->event_wqh, &ewq->wq);
|
|
|
|
}
|
|
|
|
|
2017-02-23 07:42:27 +08:00
|
|
|
int dup_userfaultfd(struct vm_area_struct *vma, struct list_head *fcs)
|
|
|
|
{
|
|
|
|
struct userfaultfd_ctx *ctx = NULL, *octx;
|
|
|
|
struct userfaultfd_fork_ctx *fctx;
|
|
|
|
|
|
|
|
octx = vma->vm_userfaultfd_ctx.ctx;
|
|
|
|
if (!octx || !(octx->features & UFFD_FEATURE_EVENT_FORK)) {
|
|
|
|
vma->vm_userfaultfd_ctx = NULL_VM_UFFD_CTX;
|
mm/userfaultfd: enable writenotify while userfaultfd-wp is enabled for a VMA
Currently, we don't enable writenotify when enabling userfaultfd-wp on a
shared writable mapping (for now only shmem and hugetlb). The consequence
is that vma->vm_page_prot will still include write permissions, to be set
as default for all PTEs that get remapped (e.g., mprotect(), NUMA hinting,
page migration, ...).
So far, vma->vm_page_prot is assumed to be a safe default, meaning that we
only add permissions (e.g., mkwrite) but not remove permissions (e.g.,
wrprotect). For example, when enabling softdirty tracking, we enable
writenotify. With uffd-wp on shared mappings, that changed. More details
on vma->vm_page_prot semantics were summarized in [1].
This is problematic for uffd-wp: we'd have to manually check for a uffd-wp
PTEs/PMDs and manually write-protect PTEs/PMDs, which is error prone.
Prone to such issues is any code that uses vma->vm_page_prot to set PTE
permissions: primarily pte_modify() and mk_pte().
Instead, let's enable writenotify such that PTEs/PMDs/... will be mapped
write-protected as default and we will only allow selected PTEs that are
definitely safe to be mapped without write-protection (see
can_change_pte_writable()) to be writable. In the future, we might want
to enable write-bit recovery -- e.g., can_change_pte_writable() -- at more
locations, for example, also when removing uffd-wp protection.
This fixes two known cases:
(a) remove_migration_pte() mapping uffd-wp'ed PTEs writable, resulting
in uffd-wp not triggering on write access.
(b) do_numa_page() / do_huge_pmd_numa_page() mapping uffd-wp'ed PTEs/PMDs
writable, resulting in uffd-wp not triggering on write access.
Note that do_numa_page() / do_huge_pmd_numa_page() can be reached even
without NUMA hinting (which currently doesn't seem to be applicable to
shmem), for example, by using uffd-wp with a PROT_WRITE shmem VMA. On
such a VMA, userfaultfd-wp is currently non-functional.
Note that when enabling userfaultfd-wp, there is no need to walk page
tables to enforce the new default protection for the PTEs: we know that
they cannot be uffd-wp'ed yet, because that can only happen after enabling
uffd-wp for the VMA in general.
Also note that this makes mprotect() on ranges with uffd-wp'ed PTEs not
accidentally set the write bit -- which would result in uffd-wp not
triggering on later write access. This commit makes uffd-wp on shmem
behave just like uffd-wp on anonymous memory in that regard, even though,
mixing mprotect with uffd-wp is controversial.
[1] https://lkml.kernel.org/r/92173bad-caa3-6b43-9d1e-9a471fdbc184@redhat.com
Link: https://lkml.kernel.org/r/20221209080912.7968-1-david@redhat.com
Fixes: b1f9e876862d ("mm/uffd: enable write protection for shmem & hugetlbfs")
Signed-off-by: David Hildenbrand <david@redhat.com>
Reported-by: Ives van Hoorne <ives@codesandbox.io>
Debugged-by: Peter Xu <peterx@redhat.com>
Acked-by: Peter Xu <peterx@redhat.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Alistair Popple <apopple@nvidia.com>
Cc: Mike Rapoport <rppt@linux.vnet.ibm.com>
Cc: Nadav Amit <nadav.amit@gmail.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-12-09 16:09:12 +08:00
|
|
|
userfaultfd_set_vm_flags(vma, vma->vm_flags & ~__VM_UFFD_FLAGS);
|
2017-02-23 07:42:27 +08:00
|
|
|
return 0;
|
|
|
|
}
|
|
|
|
|
|
|
|
list_for_each_entry(fctx, fcs, list)
|
|
|
|
if (fctx->orig == octx) {
|
|
|
|
ctx = fctx->new;
|
|
|
|
break;
|
|
|
|
}
|
|
|
|
|
|
|
|
if (!ctx) {
|
|
|
|
fctx = kmalloc(sizeof(*fctx), GFP_KERNEL);
|
|
|
|
if (!fctx)
|
|
|
|
return -ENOMEM;
|
|
|
|
|
|
|
|
ctx = kmem_cache_alloc(userfaultfd_ctx_cachep, GFP_KERNEL);
|
|
|
|
if (!ctx) {
|
|
|
|
kfree(fctx);
|
|
|
|
return -ENOMEM;
|
|
|
|
}
|
|
|
|
|
2018-12-28 16:34:43 +08:00
|
|
|
refcount_set(&ctx->refcount, 1);
|
2017-02-23 07:42:27 +08:00
|
|
|
ctx->flags = octx->flags;
|
|
|
|
ctx->features = octx->features;
|
|
|
|
ctx->released = false;
|
2021-09-03 05:58:56 +08:00
|
|
|
atomic_set(&ctx->mmap_changing, 0);
|
2017-02-23 07:42:27 +08:00
|
|
|
ctx->mm = vma->vm_mm;
|
2017-11-16 09:36:56 +08:00
|
|
|
mmgrab(ctx->mm);
|
2017-02-23 07:42:27 +08:00
|
|
|
|
|
|
|
userfaultfd_ctx_get(octx);
|
2021-09-03 05:58:56 +08:00
|
|
|
atomic_inc(&octx->mmap_changing);
|
2017-02-23 07:42:27 +08:00
|
|
|
fctx->orig = octx;
|
|
|
|
fctx->new = ctx;
|
|
|
|
list_add_tail(&fctx->list, fcs);
|
|
|
|
}
|
|
|
|
|
|
|
|
vma->vm_userfaultfd_ctx.ctx = ctx;
|
|
|
|
return 0;
|
|
|
|
}
|
|
|
|
|
2017-03-10 08:16:54 +08:00
|
|
|
static void dup_fctx(struct userfaultfd_fork_ctx *fctx)
|
2017-02-23 07:42:27 +08:00
|
|
|
{
|
|
|
|
struct userfaultfd_ctx *ctx = fctx->orig;
|
|
|
|
struct userfaultfd_wait_queue ewq;
|
|
|
|
|
|
|
|
msg_init(&ewq.msg);
|
|
|
|
|
|
|
|
ewq.msg.event = UFFD_EVENT_FORK;
|
|
|
|
ewq.msg.arg.reserved.reserved1 = (unsigned long)fctx->new;
|
|
|
|
|
2017-03-10 08:16:54 +08:00
|
|
|
userfaultfd_event_wait_completion(ctx, &ewq);
|
2017-02-23 07:42:27 +08:00
|
|
|
}
|
|
|
|
|
|
|
|
void dup_userfaultfd_complete(struct list_head *fcs)
|
|
|
|
{
|
|
|
|
struct userfaultfd_fork_ctx *fctx, *n;
|
|
|
|
|
|
|
|
list_for_each_entry_safe(fctx, n, fcs, list) {
|
2017-03-10 08:16:54 +08:00
|
|
|
dup_fctx(fctx);
|
2017-02-23 07:42:27 +08:00
|
|
|
list_del(&fctx->list);
|
|
|
|
kfree(fctx);
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
2017-02-23 07:42:34 +08:00
|
|
|
void mremap_userfaultfd_prep(struct vm_area_struct *vma,
|
|
|
|
struct vm_userfaultfd_ctx *vm_ctx)
|
|
|
|
{
|
|
|
|
struct userfaultfd_ctx *ctx;
|
|
|
|
|
|
|
|
ctx = vma->vm_userfaultfd_ctx.ctx;
|
2018-12-28 16:38:47 +08:00
|
|
|
|
|
|
|
if (!ctx)
|
|
|
|
return;
|
|
|
|
|
|
|
|
if (ctx->features & UFFD_FEATURE_EVENT_REMAP) {
|
2017-02-23 07:42:34 +08:00
|
|
|
vm_ctx->ctx = ctx;
|
|
|
|
userfaultfd_ctx_get(ctx);
|
2021-09-03 05:58:56 +08:00
|
|
|
atomic_inc(&ctx->mmap_changing);
|
2018-12-28 16:38:47 +08:00
|
|
|
} else {
|
|
|
|
/* Drop uffd context if remap feature not enabled */
|
|
|
|
vma->vm_userfaultfd_ctx = NULL_VM_UFFD_CTX;
|
mm/userfaultfd: enable writenotify while userfaultfd-wp is enabled for a VMA
Currently, we don't enable writenotify when enabling userfaultfd-wp on a
shared writable mapping (for now only shmem and hugetlb). The consequence
is that vma->vm_page_prot will still include write permissions, to be set
as default for all PTEs that get remapped (e.g., mprotect(), NUMA hinting,
page migration, ...).
So far, vma->vm_page_prot is assumed to be a safe default, meaning that we
only add permissions (e.g., mkwrite) but not remove permissions (e.g.,
wrprotect). For example, when enabling softdirty tracking, we enable
writenotify. With uffd-wp on shared mappings, that changed. More details
on vma->vm_page_prot semantics were summarized in [1].
This is problematic for uffd-wp: we'd have to manually check for a uffd-wp
PTEs/PMDs and manually write-protect PTEs/PMDs, which is error prone.
Prone to such issues is any code that uses vma->vm_page_prot to set PTE
permissions: primarily pte_modify() and mk_pte().
Instead, let's enable writenotify such that PTEs/PMDs/... will be mapped
write-protected as default and we will only allow selected PTEs that are
definitely safe to be mapped without write-protection (see
can_change_pte_writable()) to be writable. In the future, we might want
to enable write-bit recovery -- e.g., can_change_pte_writable() -- at more
locations, for example, also when removing uffd-wp protection.
This fixes two known cases:
(a) remove_migration_pte() mapping uffd-wp'ed PTEs writable, resulting
in uffd-wp not triggering on write access.
(b) do_numa_page() / do_huge_pmd_numa_page() mapping uffd-wp'ed PTEs/PMDs
writable, resulting in uffd-wp not triggering on write access.
Note that do_numa_page() / do_huge_pmd_numa_page() can be reached even
without NUMA hinting (which currently doesn't seem to be applicable to
shmem), for example, by using uffd-wp with a PROT_WRITE shmem VMA. On
such a VMA, userfaultfd-wp is currently non-functional.
Note that when enabling userfaultfd-wp, there is no need to walk page
tables to enforce the new default protection for the PTEs: we know that
they cannot be uffd-wp'ed yet, because that can only happen after enabling
uffd-wp for the VMA in general.
Also note that this makes mprotect() on ranges with uffd-wp'ed PTEs not
accidentally set the write bit -- which would result in uffd-wp not
triggering on later write access. This commit makes uffd-wp on shmem
behave just like uffd-wp on anonymous memory in that regard, even though,
mixing mprotect with uffd-wp is controversial.
[1] https://lkml.kernel.org/r/92173bad-caa3-6b43-9d1e-9a471fdbc184@redhat.com
Link: https://lkml.kernel.org/r/20221209080912.7968-1-david@redhat.com
Fixes: b1f9e876862d ("mm/uffd: enable write protection for shmem & hugetlbfs")
Signed-off-by: David Hildenbrand <david@redhat.com>
Reported-by: Ives van Hoorne <ives@codesandbox.io>
Debugged-by: Peter Xu <peterx@redhat.com>
Acked-by: Peter Xu <peterx@redhat.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Alistair Popple <apopple@nvidia.com>
Cc: Mike Rapoport <rppt@linux.vnet.ibm.com>
Cc: Nadav Amit <nadav.amit@gmail.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-12-09 16:09:12 +08:00
|
|
|
userfaultfd_set_vm_flags(vma, vma->vm_flags & ~__VM_UFFD_FLAGS);
|
2017-02-23 07:42:34 +08:00
|
|
|
}
|
|
|
|
}
|
|
|
|
|
2017-02-23 07:42:37 +08:00
|
|
|
void mremap_userfaultfd_complete(struct vm_userfaultfd_ctx *vm_ctx,
|
2017-02-23 07:42:34 +08:00
|
|
|
unsigned long from, unsigned long to,
|
|
|
|
unsigned long len)
|
|
|
|
{
|
2017-02-23 07:42:37 +08:00
|
|
|
struct userfaultfd_ctx *ctx = vm_ctx->ctx;
|
2017-02-23 07:42:34 +08:00
|
|
|
struct userfaultfd_wait_queue ewq;
|
|
|
|
|
|
|
|
if (!ctx)
|
|
|
|
return;
|
|
|
|
|
|
|
|
if (to & ~PAGE_MASK) {
|
|
|
|
userfaultfd_ctx_put(ctx);
|
|
|
|
return;
|
|
|
|
}
|
|
|
|
|
|
|
|
msg_init(&ewq.msg);
|
|
|
|
|
|
|
|
ewq.msg.event = UFFD_EVENT_REMAP;
|
|
|
|
ewq.msg.arg.remap.from = from;
|
|
|
|
ewq.msg.arg.remap.to = to;
|
|
|
|
ewq.msg.arg.remap.len = len;
|
|
|
|
|
|
|
|
userfaultfd_event_wait_completion(ctx, &ewq);
|
|
|
|
}
|
|
|
|
|
2017-03-10 08:17:11 +08:00
|
|
|
bool userfaultfd_remove(struct vm_area_struct *vma,
|
2017-02-25 06:56:02 +08:00
|
|
|
unsigned long start, unsigned long end)
|
2017-02-23 07:42:40 +08:00
|
|
|
{
|
|
|
|
struct mm_struct *mm = vma->vm_mm;
|
|
|
|
struct userfaultfd_ctx *ctx;
|
|
|
|
struct userfaultfd_wait_queue ewq;
|
|
|
|
|
|
|
|
ctx = vma->vm_userfaultfd_ctx.ctx;
|
2017-02-25 06:56:02 +08:00
|
|
|
if (!ctx || !(ctx->features & UFFD_FEATURE_EVENT_REMOVE))
|
2017-03-10 08:17:11 +08:00
|
|
|
return true;
|
2017-02-23 07:42:40 +08:00
|
|
|
|
|
|
|
userfaultfd_ctx_get(ctx);
|
2021-09-03 05:58:56 +08:00
|
|
|
atomic_inc(&ctx->mmap_changing);
|
2020-06-09 12:33:25 +08:00
|
|
|
mmap_read_unlock(mm);
|
2017-02-23 07:42:40 +08:00
|
|
|
|
|
|
|
msg_init(&ewq.msg);
|
|
|
|
|
2017-02-25 06:56:02 +08:00
|
|
|
ewq.msg.event = UFFD_EVENT_REMOVE;
|
|
|
|
ewq.msg.arg.remove.start = start;
|
|
|
|
ewq.msg.arg.remove.end = end;
|
2017-02-23 07:42:40 +08:00
|
|
|
|
|
|
|
userfaultfd_event_wait_completion(ctx, &ewq);
|
|
|
|
|
2017-03-10 08:17:11 +08:00
|
|
|
return false;
|
2017-02-23 07:42:40 +08:00
|
|
|
}
|
|
|
|
|
2017-02-25 06:58:22 +08:00
|
|
|
static bool has_unmap_ctx(struct userfaultfd_ctx *ctx, struct list_head *unmaps,
|
|
|
|
unsigned long start, unsigned long end)
|
|
|
|
{
|
|
|
|
struct userfaultfd_unmap_ctx *unmap_ctx;
|
|
|
|
|
|
|
|
list_for_each_entry(unmap_ctx, unmaps, list)
|
|
|
|
if (unmap_ctx->ctx == ctx && unmap_ctx->start == start &&
|
|
|
|
unmap_ctx->end == end)
|
|
|
|
return true;
|
|
|
|
|
|
|
|
return false;
|
|
|
|
}
|
|
|
|
|
2022-09-07 03:48:57 +08:00
|
|
|
int userfaultfd_unmap_prep(struct mm_struct *mm, unsigned long start,
|
|
|
|
unsigned long end, struct list_head *unmaps)
|
2017-02-25 06:58:22 +08:00
|
|
|
{
|
2022-09-07 03:48:57 +08:00
|
|
|
VMA_ITERATOR(vmi, mm, start);
|
|
|
|
struct vm_area_struct *vma;
|
|
|
|
|
|
|
|
for_each_vma_range(vmi, vma, end) {
|
2017-02-25 06:58:22 +08:00
|
|
|
struct userfaultfd_unmap_ctx *unmap_ctx;
|
|
|
|
struct userfaultfd_ctx *ctx = vma->vm_userfaultfd_ctx.ctx;
|
|
|
|
|
|
|
|
if (!ctx || !(ctx->features & UFFD_FEATURE_EVENT_UNMAP) ||
|
|
|
|
has_unmap_ctx(ctx, unmaps, start, end))
|
|
|
|
continue;
|
|
|
|
|
|
|
|
unmap_ctx = kzalloc(sizeof(*unmap_ctx), GFP_KERNEL);
|
|
|
|
if (!unmap_ctx)
|
|
|
|
return -ENOMEM;
|
|
|
|
|
|
|
|
userfaultfd_ctx_get(ctx);
|
2021-09-03 05:58:56 +08:00
|
|
|
atomic_inc(&ctx->mmap_changing);
|
2017-02-25 06:58:22 +08:00
|
|
|
unmap_ctx->ctx = ctx;
|
|
|
|
unmap_ctx->start = start;
|
|
|
|
unmap_ctx->end = end;
|
|
|
|
list_add_tail(&unmap_ctx->list, unmaps);
|
|
|
|
}
|
|
|
|
|
|
|
|
return 0;
|
|
|
|
}
|
|
|
|
|
|
|
|
void userfaultfd_unmap_complete(struct mm_struct *mm, struct list_head *uf)
|
|
|
|
{
|
|
|
|
struct userfaultfd_unmap_ctx *ctx, *n;
|
|
|
|
struct userfaultfd_wait_queue ewq;
|
|
|
|
|
|
|
|
list_for_each_entry_safe(ctx, n, uf, list) {
|
|
|
|
msg_init(&ewq.msg);
|
|
|
|
|
|
|
|
ewq.msg.event = UFFD_EVENT_UNMAP;
|
|
|
|
ewq.msg.arg.remove.start = ctx->start;
|
|
|
|
ewq.msg.arg.remove.end = ctx->end;
|
|
|
|
|
|
|
|
userfaultfd_event_wait_completion(ctx->ctx, &ewq);
|
|
|
|
|
|
|
|
list_del(&ctx->list);
|
|
|
|
kfree(ctx);
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
2015-09-05 06:46:31 +08:00
|
|
|
static int userfaultfd_release(struct inode *inode, struct file *file)
|
|
|
|
{
|
|
|
|
struct userfaultfd_ctx *ctx = file->private_data;
|
|
|
|
struct mm_struct *mm = ctx->mm;
|
|
|
|
struct vm_area_struct *vma, *prev;
|
|
|
|
/* len == 0 means wake all */
|
|
|
|
struct userfaultfd_wake_range range = { .len = 0, };
|
|
|
|
unsigned long new_flags;
|
2022-09-07 03:48:57 +08:00
|
|
|
MA_STATE(mas, &mm->mm_mt, 0, 0);
|
2015-09-05 06:46:31 +08:00
|
|
|
|
locking/atomics: COCCINELLE/treewide: Convert trivial ACCESS_ONCE() patterns to READ_ONCE()/WRITE_ONCE()
Please do not apply this to mainline directly, instead please re-run the
coccinelle script shown below and apply its output.
For several reasons, it is desirable to use {READ,WRITE}_ONCE() in
preference to ACCESS_ONCE(), and new code is expected to use one of the
former. So far, there's been no reason to change most existing uses of
ACCESS_ONCE(), as these aren't harmful, and changing them results in
churn.
However, for some features, the read/write distinction is critical to
correct operation. To distinguish these cases, separate read/write
accessors must be used. This patch migrates (most) remaining
ACCESS_ONCE() instances to {READ,WRITE}_ONCE(), using the following
coccinelle script:
----
// Convert trivial ACCESS_ONCE() uses to equivalent READ_ONCE() and
// WRITE_ONCE()
// $ make coccicheck COCCI=/home/mark/once.cocci SPFLAGS="--include-headers" MODE=patch
virtual patch
@ depends on patch @
expression E1, E2;
@@
- ACCESS_ONCE(E1) = E2
+ WRITE_ONCE(E1, E2)
@ depends on patch @
expression E;
@@
- ACCESS_ONCE(E)
+ READ_ONCE(E)
----
Signed-off-by: Mark Rutland <mark.rutland@arm.com>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: davem@davemloft.net
Cc: linux-arch@vger.kernel.org
Cc: mpe@ellerman.id.au
Cc: shuah@kernel.org
Cc: snitzer@redhat.com
Cc: thor.thayer@linux.intel.com
Cc: tj@kernel.org
Cc: viro@zeniv.linux.org.uk
Cc: will.deacon@arm.com
Link: http://lkml.kernel.org/r/1508792849-3115-19-git-send-email-paulmck@linux.vnet.ibm.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-10-24 05:07:29 +08:00
|
|
|
WRITE_ONCE(ctx->released, true);
|
2015-09-05 06:46:31 +08:00
|
|
|
|
2016-05-21 07:58:36 +08:00
|
|
|
if (!mmget_not_zero(mm))
|
|
|
|
goto wakeup;
|
|
|
|
|
2015-09-05 06:46:31 +08:00
|
|
|
/*
|
|
|
|
* Flush page faults out of all CPUs. NOTE: all page faults
|
|
|
|
* must be retried without returning VM_FAULT_SIGBUS if
|
|
|
|
* userfaultfd_ctx_get() succeeds but vma->vma_userfault_ctx
|
2020-06-09 12:33:54 +08:00
|
|
|
* changes while handle_userfault released the mmap_lock. So
|
2015-09-05 06:46:31 +08:00
|
|
|
* it's critical that released is set to true (above), before
|
2020-06-09 12:33:54 +08:00
|
|
|
* taking the mmap_lock for writing.
|
2015-09-05 06:46:31 +08:00
|
|
|
*/
|
2020-06-09 12:33:25 +08:00
|
|
|
mmap_write_lock(mm);
|
2015-09-05 06:46:31 +08:00
|
|
|
prev = NULL;
|
2022-09-07 03:48:57 +08:00
|
|
|
mas_for_each(&mas, vma, ULONG_MAX) {
|
2015-09-05 06:46:31 +08:00
|
|
|
cond_resched();
|
|
|
|
BUG_ON(!!vma->vm_userfaultfd_ctx.ctx ^
|
userfaultfd: add minor fault registration mode
Patch series "userfaultfd: add minor fault handling", v9.
Overview
========
This series adds a new userfaultfd feature, UFFD_FEATURE_MINOR_HUGETLBFS.
When enabled (via the UFFDIO_API ioctl), this feature means that any
hugetlbfs VMAs registered with UFFDIO_REGISTER_MODE_MISSING will *also*
get events for "minor" faults. By "minor" fault, I mean the following
situation:
Let there exist two mappings (i.e., VMAs) to the same page(s) (shared
memory). One of the mappings is registered with userfaultfd (in minor
mode), and the other is not. Via the non-UFFD mapping, the underlying
pages have already been allocated & filled with some contents. The UFFD
mapping has not yet been faulted in; when it is touched for the first
time, this results in what I'm calling a "minor" fault. As a concrete
example, when working with hugetlbfs, we have huge_pte_none(), but
find_lock_page() finds an existing page.
We also add a new ioctl to resolve such faults: UFFDIO_CONTINUE. The idea
is, userspace resolves the fault by either a) doing nothing if the
contents are already correct, or b) updating the underlying contents using
the second, non-UFFD mapping (via memcpy/memset or similar, or something
fancier like RDMA, or etc...). In either case, userspace issues
UFFDIO_CONTINUE to tell the kernel "I have ensured the page contents are
correct, carry on setting up the mapping".
Use Case
========
Consider the use case of VM live migration (e.g. under QEMU/KVM):
1. While a VM is still running, we copy the contents of its memory to a
target machine. The pages are populated on the target by writing to the
non-UFFD mapping, using the setup described above. The VM is still running
(and therefore its memory is likely changing), so this may be repeated
several times, until we decide the target is "up to date enough".
2. We pause the VM on the source, and start executing on the target machine.
During this gap, the VM's user(s) will *see* a pause, so it is desirable to
minimize this window.
3. Between the last time any page was copied from the source to the target, and
when the VM was paused, the contents of that page may have changed - and
therefore the copy we have on the target machine is out of date. Although we
can keep track of which pages are out of date, for VMs with large amounts of
memory, it is "slow" to transfer this information to the target machine. We
want to resume execution before such a transfer would complete.
4. So, the guest begins executing on the target machine. The first time it
touches its memory (via the UFFD-registered mapping), userspace wants to
intercept this fault. Userspace checks whether or not the page is up to date,
and if not, copies the updated page from the source machine, via the non-UFFD
mapping. Finally, whether a copy was performed or not, userspace issues a
UFFDIO_CONTINUE ioctl to tell the kernel "I have ensured the page contents
are correct, carry on setting up the mapping".
We don't have to do all of the final updates on-demand. The userfaultfd manager
can, in the background, also copy over updated pages once it receives the map of
which pages are up-to-date or not.
Interaction with Existing APIs
==============================
Because this is a feature, a registered VMA could potentially receive both
missing and minor faults. I spent some time thinking through how the
existing API interacts with the new feature:
UFFDIO_CONTINUE cannot be used to resolve non-minor faults, as it does not
allocate a new page. If UFFDIO_CONTINUE is used on a non-minor fault:
- For non-shared memory or shmem, -EINVAL is returned.
- For hugetlb, -EFAULT is returned.
UFFDIO_COPY and UFFDIO_ZEROPAGE cannot be used to resolve minor faults.
Without modifications, the existing codepath assumes a new page needs to
be allocated. This is okay, since userspace must have a second
non-UFFD-registered mapping anyway, thus there isn't much reason to want
to use these in any case (just memcpy or memset or similar).
- If UFFDIO_COPY is used on a minor fault, -EEXIST is returned.
- If UFFDIO_ZEROPAGE is used on a minor fault, -EEXIST is returned (or -EINVAL
in the case of hugetlb, as UFFDIO_ZEROPAGE is unsupported in any case).
- UFFDIO_WRITEPROTECT simply doesn't work with shared memory, and returns
-ENOENT in that case (regardless of the kind of fault).
Future Work
===========
This series only supports hugetlbfs. I have a second series in flight to
support shmem as well, extending the functionality. This series is more
mature than the shmem support at this point, and the functionality works
fully on hugetlbfs, so this series can be merged first and then shmem
support will follow.
This patch (of 6):
This feature allows userspace to intercept "minor" faults. By "minor"
faults, I mean the following situation:
Let there exist two mappings (i.e., VMAs) to the same page(s). One of the
mappings is registered with userfaultfd (in minor mode), and the other is
not. Via the non-UFFD mapping, the underlying pages have already been
allocated & filled with some contents. The UFFD mapping has not yet been
faulted in; when it is touched for the first time, this results in what
I'm calling a "minor" fault. As a concrete example, when working with
hugetlbfs, we have huge_pte_none(), but find_lock_page() finds an existing
page.
This commit adds the new registration mode, and sets the relevant flag on
the VMAs being registered. In the hugetlb fault path, if we find that we
have huge_pte_none(), but find_lock_page() does indeed find an existing
page, then we have a "minor" fault, and if the VMA has the userfaultfd
registration flag, we call into userfaultfd to handle it.
This is implemented as a new registration mode, instead of an API feature.
This is because the alternative implementation has significant drawbacks
[1].
However, doing it this was requires we allocate a VM_* flag for the new
registration mode. On 32-bit systems, there are no unused bits, so this
feature is only supported on architectures with
CONFIG_ARCH_USES_HIGH_VMA_FLAGS. When attempting to register a VMA in
MINOR mode on 32-bit architectures, we return -EINVAL.
[1] https://lore.kernel.org/patchwork/patch/1380226/
[peterx@redhat.com: fix minor fault page leak]
Link: https://lkml.kernel.org/r/20210322175132.36659-1-peterx@redhat.com
Link: https://lkml.kernel.org/r/20210301222728.176417-1-axelrasmussen@google.com
Link: https://lkml.kernel.org/r/20210301222728.176417-2-axelrasmussen@google.com
Signed-off-by: Axel Rasmussen <axelrasmussen@google.com>
Reviewed-by: Peter Xu <peterx@redhat.com>
Reviewed-by: Mike Kravetz <mike.kravetz@oracle.com>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Alexey Dobriyan <adobriyan@gmail.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Anshuman Khandual <anshuman.khandual@arm.com>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Chinwen Chang <chinwen.chang@mediatek.com>
Cc: Huang Ying <ying.huang@intel.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Jann Horn <jannh@google.com>
Cc: Jerome Glisse <jglisse@redhat.com>
Cc: Lokesh Gidra <lokeshgidra@google.com>
Cc: "Matthew Wilcox (Oracle)" <willy@infradead.org>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: "Michal Koutn" <mkoutny@suse.com>
Cc: Michel Lespinasse <walken@google.com>
Cc: Mike Rapoport <rppt@linux.vnet.ibm.com>
Cc: Nicholas Piggin <npiggin@gmail.com>
Cc: Peter Xu <peterx@redhat.com>
Cc: Shaohua Li <shli@fb.com>
Cc: Shawn Anastasio <shawn@anastas.io>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Steven Price <steven.price@arm.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Adam Ruprecht <ruprecht@google.com>
Cc: Axel Rasmussen <axelrasmussen@google.com>
Cc: Cannon Matthews <cannonmatthews@google.com>
Cc: "Dr . David Alan Gilbert" <dgilbert@redhat.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Mina Almasry <almasrymina@google.com>
Cc: Oliver Upton <oupton@google.com>
Cc: Kirill A. Shutemov <kirill@shutemov.name>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-05-05 09:35:36 +08:00
|
|
|
!!(vma->vm_flags & __VM_UFFD_FLAGS));
|
2015-09-05 06:46:31 +08:00
|
|
|
if (vma->vm_userfaultfd_ctx.ctx != ctx) {
|
|
|
|
prev = vma;
|
|
|
|
continue;
|
|
|
|
}
|
userfaultfd: add minor fault registration mode
Patch series "userfaultfd: add minor fault handling", v9.
Overview
========
This series adds a new userfaultfd feature, UFFD_FEATURE_MINOR_HUGETLBFS.
When enabled (via the UFFDIO_API ioctl), this feature means that any
hugetlbfs VMAs registered with UFFDIO_REGISTER_MODE_MISSING will *also*
get events for "minor" faults. By "minor" fault, I mean the following
situation:
Let there exist two mappings (i.e., VMAs) to the same page(s) (shared
memory). One of the mappings is registered with userfaultfd (in minor
mode), and the other is not. Via the non-UFFD mapping, the underlying
pages have already been allocated & filled with some contents. The UFFD
mapping has not yet been faulted in; when it is touched for the first
time, this results in what I'm calling a "minor" fault. As a concrete
example, when working with hugetlbfs, we have huge_pte_none(), but
find_lock_page() finds an existing page.
We also add a new ioctl to resolve such faults: UFFDIO_CONTINUE. The idea
is, userspace resolves the fault by either a) doing nothing if the
contents are already correct, or b) updating the underlying contents using
the second, non-UFFD mapping (via memcpy/memset or similar, or something
fancier like RDMA, or etc...). In either case, userspace issues
UFFDIO_CONTINUE to tell the kernel "I have ensured the page contents are
correct, carry on setting up the mapping".
Use Case
========
Consider the use case of VM live migration (e.g. under QEMU/KVM):
1. While a VM is still running, we copy the contents of its memory to a
target machine. The pages are populated on the target by writing to the
non-UFFD mapping, using the setup described above. The VM is still running
(and therefore its memory is likely changing), so this may be repeated
several times, until we decide the target is "up to date enough".
2. We pause the VM on the source, and start executing on the target machine.
During this gap, the VM's user(s) will *see* a pause, so it is desirable to
minimize this window.
3. Between the last time any page was copied from the source to the target, and
when the VM was paused, the contents of that page may have changed - and
therefore the copy we have on the target machine is out of date. Although we
can keep track of which pages are out of date, for VMs with large amounts of
memory, it is "slow" to transfer this information to the target machine. We
want to resume execution before such a transfer would complete.
4. So, the guest begins executing on the target machine. The first time it
touches its memory (via the UFFD-registered mapping), userspace wants to
intercept this fault. Userspace checks whether or not the page is up to date,
and if not, copies the updated page from the source machine, via the non-UFFD
mapping. Finally, whether a copy was performed or not, userspace issues a
UFFDIO_CONTINUE ioctl to tell the kernel "I have ensured the page contents
are correct, carry on setting up the mapping".
We don't have to do all of the final updates on-demand. The userfaultfd manager
can, in the background, also copy over updated pages once it receives the map of
which pages are up-to-date or not.
Interaction with Existing APIs
==============================
Because this is a feature, a registered VMA could potentially receive both
missing and minor faults. I spent some time thinking through how the
existing API interacts with the new feature:
UFFDIO_CONTINUE cannot be used to resolve non-minor faults, as it does not
allocate a new page. If UFFDIO_CONTINUE is used on a non-minor fault:
- For non-shared memory or shmem, -EINVAL is returned.
- For hugetlb, -EFAULT is returned.
UFFDIO_COPY and UFFDIO_ZEROPAGE cannot be used to resolve minor faults.
Without modifications, the existing codepath assumes a new page needs to
be allocated. This is okay, since userspace must have a second
non-UFFD-registered mapping anyway, thus there isn't much reason to want
to use these in any case (just memcpy or memset or similar).
- If UFFDIO_COPY is used on a minor fault, -EEXIST is returned.
- If UFFDIO_ZEROPAGE is used on a minor fault, -EEXIST is returned (or -EINVAL
in the case of hugetlb, as UFFDIO_ZEROPAGE is unsupported in any case).
- UFFDIO_WRITEPROTECT simply doesn't work with shared memory, and returns
-ENOENT in that case (regardless of the kind of fault).
Future Work
===========
This series only supports hugetlbfs. I have a second series in flight to
support shmem as well, extending the functionality. This series is more
mature than the shmem support at this point, and the functionality works
fully on hugetlbfs, so this series can be merged first and then shmem
support will follow.
This patch (of 6):
This feature allows userspace to intercept "minor" faults. By "minor"
faults, I mean the following situation:
Let there exist two mappings (i.e., VMAs) to the same page(s). One of the
mappings is registered with userfaultfd (in minor mode), and the other is
not. Via the non-UFFD mapping, the underlying pages have already been
allocated & filled with some contents. The UFFD mapping has not yet been
faulted in; when it is touched for the first time, this results in what
I'm calling a "minor" fault. As a concrete example, when working with
hugetlbfs, we have huge_pte_none(), but find_lock_page() finds an existing
page.
This commit adds the new registration mode, and sets the relevant flag on
the VMAs being registered. In the hugetlb fault path, if we find that we
have huge_pte_none(), but find_lock_page() does indeed find an existing
page, then we have a "minor" fault, and if the VMA has the userfaultfd
registration flag, we call into userfaultfd to handle it.
This is implemented as a new registration mode, instead of an API feature.
This is because the alternative implementation has significant drawbacks
[1].
However, doing it this was requires we allocate a VM_* flag for the new
registration mode. On 32-bit systems, there are no unused bits, so this
feature is only supported on architectures with
CONFIG_ARCH_USES_HIGH_VMA_FLAGS. When attempting to register a VMA in
MINOR mode on 32-bit architectures, we return -EINVAL.
[1] https://lore.kernel.org/patchwork/patch/1380226/
[peterx@redhat.com: fix minor fault page leak]
Link: https://lkml.kernel.org/r/20210322175132.36659-1-peterx@redhat.com
Link: https://lkml.kernel.org/r/20210301222728.176417-1-axelrasmussen@google.com
Link: https://lkml.kernel.org/r/20210301222728.176417-2-axelrasmussen@google.com
Signed-off-by: Axel Rasmussen <axelrasmussen@google.com>
Reviewed-by: Peter Xu <peterx@redhat.com>
Reviewed-by: Mike Kravetz <mike.kravetz@oracle.com>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Alexey Dobriyan <adobriyan@gmail.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Anshuman Khandual <anshuman.khandual@arm.com>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Chinwen Chang <chinwen.chang@mediatek.com>
Cc: Huang Ying <ying.huang@intel.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Jann Horn <jannh@google.com>
Cc: Jerome Glisse <jglisse@redhat.com>
Cc: Lokesh Gidra <lokeshgidra@google.com>
Cc: "Matthew Wilcox (Oracle)" <willy@infradead.org>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: "Michal Koutn" <mkoutny@suse.com>
Cc: Michel Lespinasse <walken@google.com>
Cc: Mike Rapoport <rppt@linux.vnet.ibm.com>
Cc: Nicholas Piggin <npiggin@gmail.com>
Cc: Peter Xu <peterx@redhat.com>
Cc: Shaohua Li <shli@fb.com>
Cc: Shawn Anastasio <shawn@anastas.io>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Steven Price <steven.price@arm.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Adam Ruprecht <ruprecht@google.com>
Cc: Axel Rasmussen <axelrasmussen@google.com>
Cc: Cannon Matthews <cannonmatthews@google.com>
Cc: "Dr . David Alan Gilbert" <dgilbert@redhat.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Mina Almasry <almasrymina@google.com>
Cc: Oliver Upton <oupton@google.com>
Cc: Kirill A. Shutemov <kirill@shutemov.name>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-05-05 09:35:36 +08:00
|
|
|
new_flags = vma->vm_flags & ~__VM_UFFD_FLAGS;
|
2020-10-16 11:13:00 +08:00
|
|
|
prev = vma_merge(mm, prev, vma->vm_start, vma->vm_end,
|
|
|
|
new_flags, vma->anon_vma,
|
|
|
|
vma->vm_file, vma->vm_pgoff,
|
|
|
|
vma_policy(vma),
|
2022-03-05 12:28:51 +08:00
|
|
|
NULL_VM_UFFD_CTX, anon_vma_name(vma));
|
2022-09-07 03:48:57 +08:00
|
|
|
if (prev) {
|
|
|
|
mas_pause(&mas);
|
2020-10-16 11:13:00 +08:00
|
|
|
vma = prev;
|
2022-09-07 03:48:57 +08:00
|
|
|
} else {
|
2020-10-16 11:13:00 +08:00
|
|
|
prev = vma;
|
2022-09-07 03:48:57 +08:00
|
|
|
}
|
|
|
|
|
mm/userfaultfd: enable writenotify while userfaultfd-wp is enabled for a VMA
Currently, we don't enable writenotify when enabling userfaultfd-wp on a
shared writable mapping (for now only shmem and hugetlb). The consequence
is that vma->vm_page_prot will still include write permissions, to be set
as default for all PTEs that get remapped (e.g., mprotect(), NUMA hinting,
page migration, ...).
So far, vma->vm_page_prot is assumed to be a safe default, meaning that we
only add permissions (e.g., mkwrite) but not remove permissions (e.g.,
wrprotect). For example, when enabling softdirty tracking, we enable
writenotify. With uffd-wp on shared mappings, that changed. More details
on vma->vm_page_prot semantics were summarized in [1].
This is problematic for uffd-wp: we'd have to manually check for a uffd-wp
PTEs/PMDs and manually write-protect PTEs/PMDs, which is error prone.
Prone to such issues is any code that uses vma->vm_page_prot to set PTE
permissions: primarily pte_modify() and mk_pte().
Instead, let's enable writenotify such that PTEs/PMDs/... will be mapped
write-protected as default and we will only allow selected PTEs that are
definitely safe to be mapped without write-protection (see
can_change_pte_writable()) to be writable. In the future, we might want
to enable write-bit recovery -- e.g., can_change_pte_writable() -- at more
locations, for example, also when removing uffd-wp protection.
This fixes two known cases:
(a) remove_migration_pte() mapping uffd-wp'ed PTEs writable, resulting
in uffd-wp not triggering on write access.
(b) do_numa_page() / do_huge_pmd_numa_page() mapping uffd-wp'ed PTEs/PMDs
writable, resulting in uffd-wp not triggering on write access.
Note that do_numa_page() / do_huge_pmd_numa_page() can be reached even
without NUMA hinting (which currently doesn't seem to be applicable to
shmem), for example, by using uffd-wp with a PROT_WRITE shmem VMA. On
such a VMA, userfaultfd-wp is currently non-functional.
Note that when enabling userfaultfd-wp, there is no need to walk page
tables to enforce the new default protection for the PTEs: we know that
they cannot be uffd-wp'ed yet, because that can only happen after enabling
uffd-wp for the VMA in general.
Also note that this makes mprotect() on ranges with uffd-wp'ed PTEs not
accidentally set the write bit -- which would result in uffd-wp not
triggering on later write access. This commit makes uffd-wp on shmem
behave just like uffd-wp on anonymous memory in that regard, even though,
mixing mprotect with uffd-wp is controversial.
[1] https://lkml.kernel.org/r/92173bad-caa3-6b43-9d1e-9a471fdbc184@redhat.com
Link: https://lkml.kernel.org/r/20221209080912.7968-1-david@redhat.com
Fixes: b1f9e876862d ("mm/uffd: enable write protection for shmem & hugetlbfs")
Signed-off-by: David Hildenbrand <david@redhat.com>
Reported-by: Ives van Hoorne <ives@codesandbox.io>
Debugged-by: Peter Xu <peterx@redhat.com>
Acked-by: Peter Xu <peterx@redhat.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Alistair Popple <apopple@nvidia.com>
Cc: Mike Rapoport <rppt@linux.vnet.ibm.com>
Cc: Nadav Amit <nadav.amit@gmail.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-12-09 16:09:12 +08:00
|
|
|
userfaultfd_set_vm_flags(vma, new_flags);
|
2015-09-05 06:46:31 +08:00
|
|
|
vma->vm_userfaultfd_ctx = NULL_VM_UFFD_CTX;
|
|
|
|
}
|
2020-06-09 12:33:25 +08:00
|
|
|
mmap_write_unlock(mm);
|
2016-05-21 07:58:36 +08:00
|
|
|
mmput(mm);
|
|
|
|
wakeup:
|
2015-09-05 06:46:31 +08:00
|
|
|
/*
|
2015-09-05 06:46:44 +08:00
|
|
|
* After no new page faults can wait on this fault_*wqh, flush
|
2015-09-05 06:46:31 +08:00
|
|
|
* the last page faults that may have been already waiting on
|
2015-09-05 06:46:44 +08:00
|
|
|
* the fault_*wqh.
|
2015-09-05 06:46:31 +08:00
|
|
|
*/
|
2019-07-05 06:14:39 +08:00
|
|
|
spin_lock_irq(&ctx->fault_pending_wqh.lock);
|
2015-09-23 05:58:49 +08:00
|
|
|
__wake_up_locked_key(&ctx->fault_pending_wqh, TASK_NORMAL, &range);
|
2018-08-22 12:56:30 +08:00
|
|
|
__wake_up(&ctx->fault_wqh, TASK_NORMAL, 1, &range);
|
2019-07-05 06:14:39 +08:00
|
|
|
spin_unlock_irq(&ctx->fault_pending_wqh.lock);
|
2015-09-05 06:46:31 +08:00
|
|
|
|
2017-08-03 04:32:24 +08:00
|
|
|
/* Flush pending events that may still wait on event_wqh */
|
|
|
|
wake_up_all(&ctx->event_wqh);
|
|
|
|
|
2018-02-12 06:34:03 +08:00
|
|
|
wake_up_poll(&ctx->fd_wqh, EPOLLHUP);
|
2015-09-05 06:46:31 +08:00
|
|
|
userfaultfd_ctx_put(ctx);
|
|
|
|
return 0;
|
|
|
|
}
|
|
|
|
|
2015-09-05 06:46:44 +08:00
|
|
|
/* fault_pending_wqh.lock must be hold by the caller */
|
2017-02-23 07:42:18 +08:00
|
|
|
static inline struct userfaultfd_wait_queue *find_userfault_in(
|
|
|
|
wait_queue_head_t *wqh)
|
2015-09-05 06:46:31 +08:00
|
|
|
{
|
2017-06-20 18:06:13 +08:00
|
|
|
wait_queue_entry_t *wq;
|
2015-09-05 06:46:44 +08:00
|
|
|
struct userfaultfd_wait_queue *uwq;
|
2015-09-05 06:46:31 +08:00
|
|
|
|
2018-10-05 14:45:44 +08:00
|
|
|
lockdep_assert_held(&wqh->lock);
|
2015-09-05 06:46:31 +08:00
|
|
|
|
2015-09-05 06:46:44 +08:00
|
|
|
uwq = NULL;
|
2017-02-23 07:42:18 +08:00
|
|
|
if (!waitqueue_active(wqh))
|
2015-09-05 06:46:44 +08:00
|
|
|
goto out;
|
|
|
|
/* walk in reverse to provide FIFO behavior to read userfaults */
|
sched/wait: Disambiguate wq_entry->task_list and wq_head->task_list naming
So I've noticed a number of instances where it was not obvious from the
code whether ->task_list was for a wait-queue head or a wait-queue entry.
Furthermore, there's a number of wait-queue users where the lists are
not for 'tasks' but other entities (poll tables, etc.), in which case
the 'task_list' name is actively confusing.
To clear this all up, name the wait-queue head and entry list structure
fields unambiguously:
struct wait_queue_head::task_list => ::head
struct wait_queue_entry::task_list => ::entry
For example, this code:
rqw->wait.task_list.next != &wait->task_list
... is was pretty unclear (to me) what it's doing, while now it's written this way:
rqw->wait.head.next != &wait->entry
... which makes it pretty clear that we are iterating a list until we see the head.
Other examples are:
list_for_each_entry_safe(pos, next, &x->task_list, task_list) {
list_for_each_entry(wq, &fence->wait.task_list, task_list) {
... where it's unclear (to me) what we are iterating, and during review it's
hard to tell whether it's trying to walk a wait-queue entry (which would be
a bug), while now it's written as:
list_for_each_entry_safe(pos, next, &x->head, entry) {
list_for_each_entry(wq, &fence->wait.head, entry) {
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-06-20 18:06:46 +08:00
|
|
|
wq = list_last_entry(&wqh->head, typeof(*wq), entry);
|
2015-09-05 06:46:44 +08:00
|
|
|
uwq = container_of(wq, struct userfaultfd_wait_queue, wq);
|
|
|
|
out:
|
|
|
|
return uwq;
|
2015-09-05 06:46:31 +08:00
|
|
|
}
|
2017-02-23 07:42:18 +08:00
|
|
|
|
|
|
|
static inline struct userfaultfd_wait_queue *find_userfault(
|
|
|
|
struct userfaultfd_ctx *ctx)
|
|
|
|
{
|
|
|
|
return find_userfault_in(&ctx->fault_pending_wqh);
|
|
|
|
}
|
2015-09-05 06:46:31 +08:00
|
|
|
|
2017-02-23 07:42:21 +08:00
|
|
|
static inline struct userfaultfd_wait_queue *find_userfault_evt(
|
|
|
|
struct userfaultfd_ctx *ctx)
|
|
|
|
{
|
|
|
|
return find_userfault_in(&ctx->event_wqh);
|
|
|
|
}
|
|
|
|
|
2017-07-03 13:02:18 +08:00
|
|
|
static __poll_t userfaultfd_poll(struct file *file, poll_table *wait)
|
2015-09-05 06:46:31 +08:00
|
|
|
{
|
|
|
|
struct userfaultfd_ctx *ctx = file->private_data;
|
2017-07-03 13:02:18 +08:00
|
|
|
__poll_t ret;
|
2015-09-05 06:46:31 +08:00
|
|
|
|
|
|
|
poll_wait(file, &ctx->fd_wqh, wait);
|
|
|
|
|
userfaultfd: prevent concurrent API initialization
userfaultfd assumes that the enabled features are set once and never
changed after UFFDIO_API ioctl succeeded.
However, currently, UFFDIO_API can be called concurrently from two
different threads, succeed on both threads and leave userfaultfd's
features in non-deterministic state. Theoretically, other uffd operations
(ioctl's and page-faults) can be dispatched while adversely affected by
such changes of features.
Moreover, the writes to ctx->state and ctx->features are not ordered,
which can - theoretically, again - let userfaultfd_ioctl() think that
userfaultfd API completed, while the features are still not initialized.
To avoid races, it is arguably best to get rid of ctx->state. Since there
are only 2 states, record the API initialization in ctx->features as the
uppermost bit and remove ctx->state.
Link: https://lkml.kernel.org/r/20210808020724.1022515-3-namit@vmware.com
Fixes: 9cd75c3cd4c3d ("userfaultfd: non-cooperative: add ability to report non-PF events from uffd descriptor")
Signed-off-by: Nadav Amit <namit@vmware.com>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Axel Rasmussen <axelrasmussen@google.com>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Mike Rapoport <rppt@linux.vnet.ibm.com>
Cc: Peter Xu <peterx@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-09-03 05:58:59 +08:00
|
|
|
if (!userfaultfd_is_initialized(ctx))
|
2018-02-12 06:34:03 +08:00
|
|
|
return EPOLLERR;
|
2017-02-23 07:42:21 +08:00
|
|
|
|
userfaultfd: prevent concurrent API initialization
userfaultfd assumes that the enabled features are set once and never
changed after UFFDIO_API ioctl succeeded.
However, currently, UFFDIO_API can be called concurrently from two
different threads, succeed on both threads and leave userfaultfd's
features in non-deterministic state. Theoretically, other uffd operations
(ioctl's and page-faults) can be dispatched while adversely affected by
such changes of features.
Moreover, the writes to ctx->state and ctx->features are not ordered,
which can - theoretically, again - let userfaultfd_ioctl() think that
userfaultfd API completed, while the features are still not initialized.
To avoid races, it is arguably best to get rid of ctx->state. Since there
are only 2 states, record the API initialization in ctx->features as the
uppermost bit and remove ctx->state.
Link: https://lkml.kernel.org/r/20210808020724.1022515-3-namit@vmware.com
Fixes: 9cd75c3cd4c3d ("userfaultfd: non-cooperative: add ability to report non-PF events from uffd descriptor")
Signed-off-by: Nadav Amit <namit@vmware.com>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Axel Rasmussen <axelrasmussen@google.com>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Mike Rapoport <rppt@linux.vnet.ibm.com>
Cc: Peter Xu <peterx@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-09-03 05:58:59 +08:00
|
|
|
/*
|
|
|
|
* poll() never guarantees that read won't block.
|
|
|
|
* userfaults can be waken before they're read().
|
|
|
|
*/
|
|
|
|
if (unlikely(!(file->f_flags & O_NONBLOCK)))
|
2018-02-12 06:34:03 +08:00
|
|
|
return EPOLLERR;
|
userfaultfd: prevent concurrent API initialization
userfaultfd assumes that the enabled features are set once and never
changed after UFFDIO_API ioctl succeeded.
However, currently, UFFDIO_API can be called concurrently from two
different threads, succeed on both threads and leave userfaultfd's
features in non-deterministic state. Theoretically, other uffd operations
(ioctl's and page-faults) can be dispatched while adversely affected by
such changes of features.
Moreover, the writes to ctx->state and ctx->features are not ordered,
which can - theoretically, again - let userfaultfd_ioctl() think that
userfaultfd API completed, while the features are still not initialized.
To avoid races, it is arguably best to get rid of ctx->state. Since there
are only 2 states, record the API initialization in ctx->features as the
uppermost bit and remove ctx->state.
Link: https://lkml.kernel.org/r/20210808020724.1022515-3-namit@vmware.com
Fixes: 9cd75c3cd4c3d ("userfaultfd: non-cooperative: add ability to report non-PF events from uffd descriptor")
Signed-off-by: Nadav Amit <namit@vmware.com>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Axel Rasmussen <axelrasmussen@google.com>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Mike Rapoport <rppt@linux.vnet.ibm.com>
Cc: Peter Xu <peterx@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-09-03 05:58:59 +08:00
|
|
|
/*
|
|
|
|
* lockless access to see if there are pending faults
|
|
|
|
* __pollwait last action is the add_wait_queue but
|
|
|
|
* the spin_unlock would allow the waitqueue_active to
|
|
|
|
* pass above the actual list_add inside
|
|
|
|
* add_wait_queue critical section. So use a full
|
|
|
|
* memory barrier to serialize the list_add write of
|
|
|
|
* add_wait_queue() with the waitqueue_active read
|
|
|
|
* below.
|
|
|
|
*/
|
|
|
|
ret = 0;
|
|
|
|
smp_mb();
|
|
|
|
if (waitqueue_active(&ctx->fault_pending_wqh))
|
|
|
|
ret = EPOLLIN;
|
|
|
|
else if (waitqueue_active(&ctx->event_wqh))
|
|
|
|
ret = EPOLLIN;
|
|
|
|
|
|
|
|
return ret;
|
2015-09-05 06:46:31 +08:00
|
|
|
}
|
|
|
|
|
2017-02-23 07:42:27 +08:00
|
|
|
static const struct file_operations userfaultfd_fops;
|
|
|
|
|
2021-01-09 06:22:23 +08:00
|
|
|
static int resolve_userfault_fork(struct userfaultfd_ctx *new,
|
|
|
|
struct inode *inode,
|
2017-02-23 07:42:27 +08:00
|
|
|
struct uffd_msg *msg)
|
|
|
|
{
|
|
|
|
int fd;
|
|
|
|
|
2021-01-09 06:22:23 +08:00
|
|
|
fd = anon_inode_getfd_secure("[userfaultfd]", &userfaultfd_fops, new,
|
2022-07-08 17:34:51 +08:00
|
|
|
O_RDONLY | (new->flags & UFFD_SHARED_FCNTL_FLAGS), inode);
|
2017-02-23 07:42:27 +08:00
|
|
|
if (fd < 0)
|
|
|
|
return fd;
|
|
|
|
|
|
|
|
msg->arg.reserved.reserved1 = 0;
|
|
|
|
msg->arg.fork.ufd = fd;
|
|
|
|
return 0;
|
|
|
|
}
|
|
|
|
|
2015-09-05 06:46:31 +08:00
|
|
|
static ssize_t userfaultfd_ctx_read(struct userfaultfd_ctx *ctx, int no_wait,
|
2021-01-09 06:22:23 +08:00
|
|
|
struct uffd_msg *msg, struct inode *inode)
|
2015-09-05 06:46:31 +08:00
|
|
|
{
|
|
|
|
ssize_t ret;
|
|
|
|
DECLARE_WAITQUEUE(wait, current);
|
2015-09-05 06:46:44 +08:00
|
|
|
struct userfaultfd_wait_queue *uwq;
|
2017-02-23 07:42:27 +08:00
|
|
|
/*
|
|
|
|
* Handling fork event requires sleeping operations, so
|
|
|
|
* we drop the event_wqh lock, then do these ops, then
|
|
|
|
* lock it back and wake up the waiter. While the lock is
|
|
|
|
* dropped the ewq may go away so we keep track of it
|
|
|
|
* carefully.
|
|
|
|
*/
|
|
|
|
LIST_HEAD(fork_event);
|
|
|
|
struct userfaultfd_ctx *fork_nctx = NULL;
|
2015-09-05 06:46:31 +08:00
|
|
|
|
2015-09-05 06:46:44 +08:00
|
|
|
/* always take the fd_wqh lock before the fault_pending_wqh lock */
|
2018-10-27 06:02:19 +08:00
|
|
|
spin_lock_irq(&ctx->fd_wqh.lock);
|
2015-09-05 06:46:31 +08:00
|
|
|
__add_wait_queue(&ctx->fd_wqh, &wait);
|
|
|
|
for (;;) {
|
|
|
|
set_current_state(TASK_INTERRUPTIBLE);
|
2015-09-05 06:46:44 +08:00
|
|
|
spin_lock(&ctx->fault_pending_wqh.lock);
|
|
|
|
uwq = find_userfault(ctx);
|
|
|
|
if (uwq) {
|
2015-09-05 06:47:23 +08:00
|
|
|
/*
|
|
|
|
* Use a seqcount to repeat the lockless check
|
|
|
|
* in wake_userfault() to avoid missing
|
|
|
|
* wakeups because during the refile both
|
|
|
|
* waitqueue could become empty if this is the
|
|
|
|
* only userfault.
|
|
|
|
*/
|
|
|
|
write_seqcount_begin(&ctx->refile_seq);
|
|
|
|
|
2015-09-05 06:46:31 +08:00
|
|
|
/*
|
2015-09-05 06:46:44 +08:00
|
|
|
* The fault_pending_wqh.lock prevents the uwq
|
|
|
|
* to disappear from under us.
|
|
|
|
*
|
|
|
|
* Refile this userfault from
|
|
|
|
* fault_pending_wqh to fault_wqh, it's not
|
|
|
|
* pending anymore after we read it.
|
|
|
|
*
|
|
|
|
* Use list_del() by hand (as
|
|
|
|
* userfaultfd_wake_function also uses
|
|
|
|
* list_del_init() by hand) to be sure nobody
|
|
|
|
* changes __remove_wait_queue() to use
|
|
|
|
* list_del_init() in turn breaking the
|
|
|
|
* !list_empty_careful() check in
|
sched/wait: Disambiguate wq_entry->task_list and wq_head->task_list naming
So I've noticed a number of instances where it was not obvious from the
code whether ->task_list was for a wait-queue head or a wait-queue entry.
Furthermore, there's a number of wait-queue users where the lists are
not for 'tasks' but other entities (poll tables, etc.), in which case
the 'task_list' name is actively confusing.
To clear this all up, name the wait-queue head and entry list structure
fields unambiguously:
struct wait_queue_head::task_list => ::head
struct wait_queue_entry::task_list => ::entry
For example, this code:
rqw->wait.task_list.next != &wait->task_list
... is was pretty unclear (to me) what it's doing, while now it's written this way:
rqw->wait.head.next != &wait->entry
... which makes it pretty clear that we are iterating a list until we see the head.
Other examples are:
list_for_each_entry_safe(pos, next, &x->task_list, task_list) {
list_for_each_entry(wq, &fence->wait.task_list, task_list) {
... where it's unclear (to me) what we are iterating, and during review it's
hard to tell whether it's trying to walk a wait-queue entry (which would be
a bug), while now it's written as:
list_for_each_entry_safe(pos, next, &x->head, entry) {
list_for_each_entry(wq, &fence->wait.head, entry) {
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-06-20 18:06:46 +08:00
|
|
|
* handle_userfault(). The uwq->wq.head list
|
2015-09-05 06:46:44 +08:00
|
|
|
* must never be empty at any time during the
|
|
|
|
* refile, or the waitqueue could disappear
|
|
|
|
* from under us. The "wait_queue_head_t"
|
|
|
|
* parameter of __remove_wait_queue() is unused
|
|
|
|
* anyway.
|
2015-09-05 06:46:31 +08:00
|
|
|
*/
|
sched/wait: Disambiguate wq_entry->task_list and wq_head->task_list naming
So I've noticed a number of instances where it was not obvious from the
code whether ->task_list was for a wait-queue head or a wait-queue entry.
Furthermore, there's a number of wait-queue users where the lists are
not for 'tasks' but other entities (poll tables, etc.), in which case
the 'task_list' name is actively confusing.
To clear this all up, name the wait-queue head and entry list structure
fields unambiguously:
struct wait_queue_head::task_list => ::head
struct wait_queue_entry::task_list => ::entry
For example, this code:
rqw->wait.task_list.next != &wait->task_list
... is was pretty unclear (to me) what it's doing, while now it's written this way:
rqw->wait.head.next != &wait->entry
... which makes it pretty clear that we are iterating a list until we see the head.
Other examples are:
list_for_each_entry_safe(pos, next, &x->task_list, task_list) {
list_for_each_entry(wq, &fence->wait.task_list, task_list) {
... where it's unclear (to me) what we are iterating, and during review it's
hard to tell whether it's trying to walk a wait-queue entry (which would be
a bug), while now it's written as:
list_for_each_entry_safe(pos, next, &x->head, entry) {
list_for_each_entry(wq, &fence->wait.head, entry) {
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-06-20 18:06:46 +08:00
|
|
|
list_del(&uwq->wq.entry);
|
2018-08-22 12:56:30 +08:00
|
|
|
add_wait_queue(&ctx->fault_wqh, &uwq->wq);
|
2015-09-05 06:46:44 +08:00
|
|
|
|
2015-09-05 06:47:23 +08:00
|
|
|
write_seqcount_end(&ctx->refile_seq);
|
|
|
|
|
2015-09-05 06:46:37 +08:00
|
|
|
/* careful to always initialize msg if ret == 0 */
|
|
|
|
*msg = uwq->msg;
|
2015-09-05 06:46:44 +08:00
|
|
|
spin_unlock(&ctx->fault_pending_wqh.lock);
|
2015-09-05 06:46:31 +08:00
|
|
|
ret = 0;
|
|
|
|
break;
|
|
|
|
}
|
2015-09-05 06:46:44 +08:00
|
|
|
spin_unlock(&ctx->fault_pending_wqh.lock);
|
2017-02-23 07:42:21 +08:00
|
|
|
|
|
|
|
spin_lock(&ctx->event_wqh.lock);
|
|
|
|
uwq = find_userfault_evt(ctx);
|
|
|
|
if (uwq) {
|
|
|
|
*msg = uwq->msg;
|
|
|
|
|
2017-02-23 07:42:27 +08:00
|
|
|
if (uwq->msg.event == UFFD_EVENT_FORK) {
|
|
|
|
fork_nctx = (struct userfaultfd_ctx *)
|
|
|
|
(unsigned long)
|
|
|
|
uwq->msg.arg.reserved.reserved1;
|
sched/wait: Disambiguate wq_entry->task_list and wq_head->task_list naming
So I've noticed a number of instances where it was not obvious from the
code whether ->task_list was for a wait-queue head or a wait-queue entry.
Furthermore, there's a number of wait-queue users where the lists are
not for 'tasks' but other entities (poll tables, etc.), in which case
the 'task_list' name is actively confusing.
To clear this all up, name the wait-queue head and entry list structure
fields unambiguously:
struct wait_queue_head::task_list => ::head
struct wait_queue_entry::task_list => ::entry
For example, this code:
rqw->wait.task_list.next != &wait->task_list
... is was pretty unclear (to me) what it's doing, while now it's written this way:
rqw->wait.head.next != &wait->entry
... which makes it pretty clear that we are iterating a list until we see the head.
Other examples are:
list_for_each_entry_safe(pos, next, &x->task_list, task_list) {
list_for_each_entry(wq, &fence->wait.task_list, task_list) {
... where it's unclear (to me) what we are iterating, and during review it's
hard to tell whether it's trying to walk a wait-queue entry (which would be
a bug), while now it's written as:
list_for_each_entry_safe(pos, next, &x->head, entry) {
list_for_each_entry(wq, &fence->wait.head, entry) {
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-06-20 18:06:46 +08:00
|
|
|
list_move(&uwq->wq.entry, &fork_event);
|
2017-10-04 07:15:38 +08:00
|
|
|
/*
|
|
|
|
* fork_nctx can be freed as soon as
|
|
|
|
* we drop the lock, unless we take a
|
|
|
|
* reference on it.
|
|
|
|
*/
|
|
|
|
userfaultfd_ctx_get(fork_nctx);
|
2017-02-23 07:42:27 +08:00
|
|
|
spin_unlock(&ctx->event_wqh.lock);
|
|
|
|
ret = 0;
|
|
|
|
break;
|
|
|
|
}
|
|
|
|
|
2017-02-23 07:42:21 +08:00
|
|
|
userfaultfd_event_complete(ctx, uwq);
|
|
|
|
spin_unlock(&ctx->event_wqh.lock);
|
|
|
|
ret = 0;
|
|
|
|
break;
|
|
|
|
}
|
|
|
|
spin_unlock(&ctx->event_wqh.lock);
|
|
|
|
|
2015-09-05 06:46:31 +08:00
|
|
|
if (signal_pending(current)) {
|
|
|
|
ret = -ERESTARTSYS;
|
|
|
|
break;
|
|
|
|
}
|
|
|
|
if (no_wait) {
|
|
|
|
ret = -EAGAIN;
|
|
|
|
break;
|
|
|
|
}
|
2018-10-27 06:02:19 +08:00
|
|
|
spin_unlock_irq(&ctx->fd_wqh.lock);
|
2015-09-05 06:46:31 +08:00
|
|
|
schedule();
|
2018-10-27 06:02:19 +08:00
|
|
|
spin_lock_irq(&ctx->fd_wqh.lock);
|
2015-09-05 06:46:31 +08:00
|
|
|
}
|
|
|
|
__remove_wait_queue(&ctx->fd_wqh, &wait);
|
|
|
|
__set_current_state(TASK_RUNNING);
|
2018-10-27 06:02:19 +08:00
|
|
|
spin_unlock_irq(&ctx->fd_wqh.lock);
|
2015-09-05 06:46:31 +08:00
|
|
|
|
2017-02-23 07:42:27 +08:00
|
|
|
if (!ret && msg->event == UFFD_EVENT_FORK) {
|
2021-01-09 06:22:23 +08:00
|
|
|
ret = resolve_userfault_fork(fork_nctx, inode, msg);
|
2019-07-05 06:14:39 +08:00
|
|
|
spin_lock_irq(&ctx->event_wqh.lock);
|
2017-10-04 07:15:38 +08:00
|
|
|
if (!list_empty(&fork_event)) {
|
|
|
|
/*
|
|
|
|
* The fork thread didn't abort, so we can
|
|
|
|
* drop the temporary refcount.
|
|
|
|
*/
|
|
|
|
userfaultfd_ctx_put(fork_nctx);
|
|
|
|
|
|
|
|
uwq = list_first_entry(&fork_event,
|
|
|
|
typeof(*uwq),
|
|
|
|
wq.entry);
|
|
|
|
/*
|
|
|
|
* If fork_event list wasn't empty and in turn
|
|
|
|
* the event wasn't already released by fork
|
|
|
|
* (the event is allocated on fork kernel
|
|
|
|
* stack), put the event back to its place in
|
|
|
|
* the event_wq. fork_event head will be freed
|
|
|
|
* as soon as we return so the event cannot
|
|
|
|
* stay queued there no matter the current
|
|
|
|
* "ret" value.
|
|
|
|
*/
|
|
|
|
list_del(&uwq->wq.entry);
|
|
|
|
__add_wait_queue(&ctx->event_wqh, &uwq->wq);
|
2017-02-23 07:42:27 +08:00
|
|
|
|
2017-10-04 07:15:38 +08:00
|
|
|
/*
|
|
|
|
* Leave the event in the waitqueue and report
|
|
|
|
* error to userland if we failed to resolve
|
|
|
|
* the userfault fork.
|
|
|
|
*/
|
|
|
|
if (likely(!ret))
|
2017-02-23 07:42:27 +08:00
|
|
|
userfaultfd_event_complete(ctx, uwq);
|
2017-10-04 07:15:38 +08:00
|
|
|
} else {
|
|
|
|
/*
|
|
|
|
* Here the fork thread aborted and the
|
|
|
|
* refcount from the fork thread on fork_nctx
|
|
|
|
* has already been released. We still hold
|
|
|
|
* the reference we took before releasing the
|
|
|
|
* lock above. If resolve_userfault_fork
|
|
|
|
* failed we've to drop it because the
|
|
|
|
* fork_nctx has to be freed in such case. If
|
|
|
|
* it succeeded we'll hold it because the new
|
|
|
|
* uffd references it.
|
|
|
|
*/
|
|
|
|
if (ret)
|
|
|
|
userfaultfd_ctx_put(fork_nctx);
|
2017-02-23 07:42:27 +08:00
|
|
|
}
|
2019-07-05 06:14:39 +08:00
|
|
|
spin_unlock_irq(&ctx->event_wqh.lock);
|
2017-02-23 07:42:27 +08:00
|
|
|
}
|
|
|
|
|
2015-09-05 06:46:31 +08:00
|
|
|
return ret;
|
|
|
|
}
|
|
|
|
|
|
|
|
static ssize_t userfaultfd_read(struct file *file, char __user *buf,
|
|
|
|
size_t count, loff_t *ppos)
|
|
|
|
{
|
|
|
|
struct userfaultfd_ctx *ctx = file->private_data;
|
|
|
|
ssize_t _ret, ret = 0;
|
2015-09-05 06:46:37 +08:00
|
|
|
struct uffd_msg msg;
|
2015-09-05 06:46:31 +08:00
|
|
|
int no_wait = file->f_flags & O_NONBLOCK;
|
2021-01-09 06:22:23 +08:00
|
|
|
struct inode *inode = file_inode(file);
|
2015-09-05 06:46:31 +08:00
|
|
|
|
userfaultfd: prevent concurrent API initialization
userfaultfd assumes that the enabled features are set once and never
changed after UFFDIO_API ioctl succeeded.
However, currently, UFFDIO_API can be called concurrently from two
different threads, succeed on both threads and leave userfaultfd's
features in non-deterministic state. Theoretically, other uffd operations
(ioctl's and page-faults) can be dispatched while adversely affected by
such changes of features.
Moreover, the writes to ctx->state and ctx->features are not ordered,
which can - theoretically, again - let userfaultfd_ioctl() think that
userfaultfd API completed, while the features are still not initialized.
To avoid races, it is arguably best to get rid of ctx->state. Since there
are only 2 states, record the API initialization in ctx->features as the
uppermost bit and remove ctx->state.
Link: https://lkml.kernel.org/r/20210808020724.1022515-3-namit@vmware.com
Fixes: 9cd75c3cd4c3d ("userfaultfd: non-cooperative: add ability to report non-PF events from uffd descriptor")
Signed-off-by: Nadav Amit <namit@vmware.com>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Axel Rasmussen <axelrasmussen@google.com>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Mike Rapoport <rppt@linux.vnet.ibm.com>
Cc: Peter Xu <peterx@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-09-03 05:58:59 +08:00
|
|
|
if (!userfaultfd_is_initialized(ctx))
|
2015-09-05 06:46:31 +08:00
|
|
|
return -EINVAL;
|
|
|
|
|
|
|
|
for (;;) {
|
2015-09-05 06:46:37 +08:00
|
|
|
if (count < sizeof(msg))
|
2015-09-05 06:46:31 +08:00
|
|
|
return ret ? ret : -EINVAL;
|
2021-01-09 06:22:23 +08:00
|
|
|
_ret = userfaultfd_ctx_read(ctx, no_wait, &msg, inode);
|
2015-09-05 06:46:31 +08:00
|
|
|
if (_ret < 0)
|
|
|
|
return ret ? ret : _ret;
|
2015-09-05 06:46:37 +08:00
|
|
|
if (copy_to_user((__u64 __user *) buf, &msg, sizeof(msg)))
|
2015-09-05 06:46:31 +08:00
|
|
|
return ret ? ret : -EFAULT;
|
2015-09-05 06:46:37 +08:00
|
|
|
ret += sizeof(msg);
|
|
|
|
buf += sizeof(msg);
|
|
|
|
count -= sizeof(msg);
|
2015-09-05 06:46:31 +08:00
|
|
|
/*
|
|
|
|
* Allow to read more than one fault at time but only
|
|
|
|
* block if waiting for the very first one.
|
|
|
|
*/
|
|
|
|
no_wait = O_NONBLOCK;
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
|
|
|
static void __wake_userfault(struct userfaultfd_ctx *ctx,
|
|
|
|
struct userfaultfd_wake_range *range)
|
|
|
|
{
|
2019-07-05 06:14:39 +08:00
|
|
|
spin_lock_irq(&ctx->fault_pending_wqh.lock);
|
2015-09-05 06:46:31 +08:00
|
|
|
/* wake all in the range and autoremove */
|
2015-09-05 06:46:44 +08:00
|
|
|
if (waitqueue_active(&ctx->fault_pending_wqh))
|
2015-09-23 05:58:49 +08:00
|
|
|
__wake_up_locked_key(&ctx->fault_pending_wqh, TASK_NORMAL,
|
2015-09-05 06:46:44 +08:00
|
|
|
range);
|
|
|
|
if (waitqueue_active(&ctx->fault_wqh))
|
2018-08-22 12:56:30 +08:00
|
|
|
__wake_up(&ctx->fault_wqh, TASK_NORMAL, 1, range);
|
2019-07-05 06:14:39 +08:00
|
|
|
spin_unlock_irq(&ctx->fault_pending_wqh.lock);
|
2015-09-05 06:46:31 +08:00
|
|
|
}
|
|
|
|
|
|
|
|
static __always_inline void wake_userfault(struct userfaultfd_ctx *ctx,
|
|
|
|
struct userfaultfd_wake_range *range)
|
|
|
|
{
|
2015-09-05 06:47:23 +08:00
|
|
|
unsigned seq;
|
|
|
|
bool need_wakeup;
|
|
|
|
|
2015-09-05 06:46:31 +08:00
|
|
|
/*
|
|
|
|
* To be sure waitqueue_active() is not reordered by the CPU
|
|
|
|
* before the pagetable update, use an explicit SMP memory
|
2020-06-09 12:33:51 +08:00
|
|
|
* barrier here. PT lock release or mmap_read_unlock(mm) still
|
2015-09-05 06:46:31 +08:00
|
|
|
* have release semantics that can allow the
|
|
|
|
* waitqueue_active() to be reordered before the pte update.
|
|
|
|
*/
|
|
|
|
smp_mb();
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Use waitqueue_active because it's very frequent to
|
|
|
|
* change the address space atomically even if there are no
|
|
|
|
* userfaults yet. So we take the spinlock only when we're
|
|
|
|
* sure we've userfaults to wake.
|
|
|
|
*/
|
2015-09-05 06:47:23 +08:00
|
|
|
do {
|
|
|
|
seq = read_seqcount_begin(&ctx->refile_seq);
|
|
|
|
need_wakeup = waitqueue_active(&ctx->fault_pending_wqh) ||
|
|
|
|
waitqueue_active(&ctx->fault_wqh);
|
|
|
|
cond_resched();
|
|
|
|
} while (read_seqcount_retry(&ctx->refile_seq, seq));
|
|
|
|
if (need_wakeup)
|
2015-09-05 06:46:31 +08:00
|
|
|
__wake_userfault(ctx, range);
|
|
|
|
}
|
|
|
|
|
|
|
|
static __always_inline int validate_range(struct mm_struct *mm,
|
userfaultfd: do not untag user pointers
Patch series "userfaultfd: do not untag user pointers", v5.
If a user program uses userfaultfd on ranges of heap memory, it may end
up passing a tagged pointer to the kernel in the range.start field of
the UFFDIO_REGISTER ioctl. This can happen when using an MTE-capable
allocator, or on Android if using the Tagged Pointers feature for MTE
readiness [1].
When a fault subsequently occurs, the tag is stripped from the fault
address returned to the application in the fault.address field of struct
uffd_msg. However, from the application's perspective, the tagged
address *is* the memory address, so if the application is unaware of
memory tags, it may get confused by receiving an address that is, from
its point of view, outside of the bounds of the allocation. We observed
this behavior in the kselftest for userfaultfd [2] but other
applications could have the same problem.
Address this by not untagging pointers passed to the userfaultfd ioctls.
Instead, let the system call fail. Also change the kselftest to use
mmap so that it doesn't encounter this problem.
[1] https://source.android.com/devices/tech/debug/tagged-pointers
[2] tools/testing/selftests/vm/userfaultfd.c
This patch (of 2):
Do not untag pointers passed to the userfaultfd ioctls. Instead, let
the system call fail. This will provide an early indication of problems
with tag-unaware userspace code instead of letting the code get confused
later, and is consistent with how we decided to handle brk/mmap/mremap
in commit dcde237319e6 ("mm: Avoid creating virtual address aliases in
brk()/mmap()/mremap()"), as well as being consistent with the existing
tagged address ABI documentation relating to how ioctl arguments are
handled.
The code change is a revert of commit 7d0325749a6c ("userfaultfd: untag
user pointers") plus some fixups to some additional calls to
validate_range that have appeared since then.
[1] https://source.android.com/devices/tech/debug/tagged-pointers
[2] tools/testing/selftests/vm/userfaultfd.c
Link: https://lkml.kernel.org/r/20210714195437.118982-1-pcc@google.com
Link: https://lkml.kernel.org/r/20210714195437.118982-2-pcc@google.com
Link: https://linux-review.googlesource.com/id/I761aa9f0344454c482b83fcfcce547db0a25501b
Fixes: 63f0c6037965 ("arm64: Introduce prctl() options to control the tagged user addresses ABI")
Signed-off-by: Peter Collingbourne <pcc@google.com>
Reviewed-by: Andrey Konovalov <andreyknvl@gmail.com>
Reviewed-by: Catalin Marinas <catalin.marinas@arm.com>
Cc: Alistair Delva <adelva@google.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Dave Martin <Dave.Martin@arm.com>
Cc: Evgenii Stepanov <eugenis@google.com>
Cc: Lokesh Gidra <lokeshgidra@google.com>
Cc: Mitch Phillips <mitchp@google.com>
Cc: Vincenzo Frascino <vincenzo.frascino@arm.com>
Cc: Will Deacon <will@kernel.org>
Cc: William McVicker <willmcvicker@google.com>
Cc: <stable@vger.kernel.org> [5.4]
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-07-24 06:50:01 +08:00
|
|
|
__u64 start, __u64 len)
|
2015-09-05 06:46:31 +08:00
|
|
|
{
|
|
|
|
__u64 task_size = mm->task_size;
|
|
|
|
|
userfaultfd: do not untag user pointers
Patch series "userfaultfd: do not untag user pointers", v5.
If a user program uses userfaultfd on ranges of heap memory, it may end
up passing a tagged pointer to the kernel in the range.start field of
the UFFDIO_REGISTER ioctl. This can happen when using an MTE-capable
allocator, or on Android if using the Tagged Pointers feature for MTE
readiness [1].
When a fault subsequently occurs, the tag is stripped from the fault
address returned to the application in the fault.address field of struct
uffd_msg. However, from the application's perspective, the tagged
address *is* the memory address, so if the application is unaware of
memory tags, it may get confused by receiving an address that is, from
its point of view, outside of the bounds of the allocation. We observed
this behavior in the kselftest for userfaultfd [2] but other
applications could have the same problem.
Address this by not untagging pointers passed to the userfaultfd ioctls.
Instead, let the system call fail. Also change the kselftest to use
mmap so that it doesn't encounter this problem.
[1] https://source.android.com/devices/tech/debug/tagged-pointers
[2] tools/testing/selftests/vm/userfaultfd.c
This patch (of 2):
Do not untag pointers passed to the userfaultfd ioctls. Instead, let
the system call fail. This will provide an early indication of problems
with tag-unaware userspace code instead of letting the code get confused
later, and is consistent with how we decided to handle brk/mmap/mremap
in commit dcde237319e6 ("mm: Avoid creating virtual address aliases in
brk()/mmap()/mremap()"), as well as being consistent with the existing
tagged address ABI documentation relating to how ioctl arguments are
handled.
The code change is a revert of commit 7d0325749a6c ("userfaultfd: untag
user pointers") plus some fixups to some additional calls to
validate_range that have appeared since then.
[1] https://source.android.com/devices/tech/debug/tagged-pointers
[2] tools/testing/selftests/vm/userfaultfd.c
Link: https://lkml.kernel.org/r/20210714195437.118982-1-pcc@google.com
Link: https://lkml.kernel.org/r/20210714195437.118982-2-pcc@google.com
Link: https://linux-review.googlesource.com/id/I761aa9f0344454c482b83fcfcce547db0a25501b
Fixes: 63f0c6037965 ("arm64: Introduce prctl() options to control the tagged user addresses ABI")
Signed-off-by: Peter Collingbourne <pcc@google.com>
Reviewed-by: Andrey Konovalov <andreyknvl@gmail.com>
Reviewed-by: Catalin Marinas <catalin.marinas@arm.com>
Cc: Alistair Delva <adelva@google.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Dave Martin <Dave.Martin@arm.com>
Cc: Evgenii Stepanov <eugenis@google.com>
Cc: Lokesh Gidra <lokeshgidra@google.com>
Cc: Mitch Phillips <mitchp@google.com>
Cc: Vincenzo Frascino <vincenzo.frascino@arm.com>
Cc: Will Deacon <will@kernel.org>
Cc: William McVicker <willmcvicker@google.com>
Cc: <stable@vger.kernel.org> [5.4]
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-07-24 06:50:01 +08:00
|
|
|
if (start & ~PAGE_MASK)
|
2015-09-05 06:46:31 +08:00
|
|
|
return -EINVAL;
|
|
|
|
if (len & ~PAGE_MASK)
|
|
|
|
return -EINVAL;
|
|
|
|
if (!len)
|
|
|
|
return -EINVAL;
|
userfaultfd: do not untag user pointers
Patch series "userfaultfd: do not untag user pointers", v5.
If a user program uses userfaultfd on ranges of heap memory, it may end
up passing a tagged pointer to the kernel in the range.start field of
the UFFDIO_REGISTER ioctl. This can happen when using an MTE-capable
allocator, or on Android if using the Tagged Pointers feature for MTE
readiness [1].
When a fault subsequently occurs, the tag is stripped from the fault
address returned to the application in the fault.address field of struct
uffd_msg. However, from the application's perspective, the tagged
address *is* the memory address, so if the application is unaware of
memory tags, it may get confused by receiving an address that is, from
its point of view, outside of the bounds of the allocation. We observed
this behavior in the kselftest for userfaultfd [2] but other
applications could have the same problem.
Address this by not untagging pointers passed to the userfaultfd ioctls.
Instead, let the system call fail. Also change the kselftest to use
mmap so that it doesn't encounter this problem.
[1] https://source.android.com/devices/tech/debug/tagged-pointers
[2] tools/testing/selftests/vm/userfaultfd.c
This patch (of 2):
Do not untag pointers passed to the userfaultfd ioctls. Instead, let
the system call fail. This will provide an early indication of problems
with tag-unaware userspace code instead of letting the code get confused
later, and is consistent with how we decided to handle brk/mmap/mremap
in commit dcde237319e6 ("mm: Avoid creating virtual address aliases in
brk()/mmap()/mremap()"), as well as being consistent with the existing
tagged address ABI documentation relating to how ioctl arguments are
handled.
The code change is a revert of commit 7d0325749a6c ("userfaultfd: untag
user pointers") plus some fixups to some additional calls to
validate_range that have appeared since then.
[1] https://source.android.com/devices/tech/debug/tagged-pointers
[2] tools/testing/selftests/vm/userfaultfd.c
Link: https://lkml.kernel.org/r/20210714195437.118982-1-pcc@google.com
Link: https://lkml.kernel.org/r/20210714195437.118982-2-pcc@google.com
Link: https://linux-review.googlesource.com/id/I761aa9f0344454c482b83fcfcce547db0a25501b
Fixes: 63f0c6037965 ("arm64: Introduce prctl() options to control the tagged user addresses ABI")
Signed-off-by: Peter Collingbourne <pcc@google.com>
Reviewed-by: Andrey Konovalov <andreyknvl@gmail.com>
Reviewed-by: Catalin Marinas <catalin.marinas@arm.com>
Cc: Alistair Delva <adelva@google.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Dave Martin <Dave.Martin@arm.com>
Cc: Evgenii Stepanov <eugenis@google.com>
Cc: Lokesh Gidra <lokeshgidra@google.com>
Cc: Mitch Phillips <mitchp@google.com>
Cc: Vincenzo Frascino <vincenzo.frascino@arm.com>
Cc: Will Deacon <will@kernel.org>
Cc: William McVicker <willmcvicker@google.com>
Cc: <stable@vger.kernel.org> [5.4]
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-07-24 06:50:01 +08:00
|
|
|
if (start < mmap_min_addr)
|
2015-09-05 06:46:31 +08:00
|
|
|
return -EINVAL;
|
userfaultfd: do not untag user pointers
Patch series "userfaultfd: do not untag user pointers", v5.
If a user program uses userfaultfd on ranges of heap memory, it may end
up passing a tagged pointer to the kernel in the range.start field of
the UFFDIO_REGISTER ioctl. This can happen when using an MTE-capable
allocator, or on Android if using the Tagged Pointers feature for MTE
readiness [1].
When a fault subsequently occurs, the tag is stripped from the fault
address returned to the application in the fault.address field of struct
uffd_msg. However, from the application's perspective, the tagged
address *is* the memory address, so if the application is unaware of
memory tags, it may get confused by receiving an address that is, from
its point of view, outside of the bounds of the allocation. We observed
this behavior in the kselftest for userfaultfd [2] but other
applications could have the same problem.
Address this by not untagging pointers passed to the userfaultfd ioctls.
Instead, let the system call fail. Also change the kselftest to use
mmap so that it doesn't encounter this problem.
[1] https://source.android.com/devices/tech/debug/tagged-pointers
[2] tools/testing/selftests/vm/userfaultfd.c
This patch (of 2):
Do not untag pointers passed to the userfaultfd ioctls. Instead, let
the system call fail. This will provide an early indication of problems
with tag-unaware userspace code instead of letting the code get confused
later, and is consistent with how we decided to handle brk/mmap/mremap
in commit dcde237319e6 ("mm: Avoid creating virtual address aliases in
brk()/mmap()/mremap()"), as well as being consistent with the existing
tagged address ABI documentation relating to how ioctl arguments are
handled.
The code change is a revert of commit 7d0325749a6c ("userfaultfd: untag
user pointers") plus some fixups to some additional calls to
validate_range that have appeared since then.
[1] https://source.android.com/devices/tech/debug/tagged-pointers
[2] tools/testing/selftests/vm/userfaultfd.c
Link: https://lkml.kernel.org/r/20210714195437.118982-1-pcc@google.com
Link: https://lkml.kernel.org/r/20210714195437.118982-2-pcc@google.com
Link: https://linux-review.googlesource.com/id/I761aa9f0344454c482b83fcfcce547db0a25501b
Fixes: 63f0c6037965 ("arm64: Introduce prctl() options to control the tagged user addresses ABI")
Signed-off-by: Peter Collingbourne <pcc@google.com>
Reviewed-by: Andrey Konovalov <andreyknvl@gmail.com>
Reviewed-by: Catalin Marinas <catalin.marinas@arm.com>
Cc: Alistair Delva <adelva@google.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Dave Martin <Dave.Martin@arm.com>
Cc: Evgenii Stepanov <eugenis@google.com>
Cc: Lokesh Gidra <lokeshgidra@google.com>
Cc: Mitch Phillips <mitchp@google.com>
Cc: Vincenzo Frascino <vincenzo.frascino@arm.com>
Cc: Will Deacon <will@kernel.org>
Cc: William McVicker <willmcvicker@google.com>
Cc: <stable@vger.kernel.org> [5.4]
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-07-24 06:50:01 +08:00
|
|
|
if (start >= task_size)
|
2015-09-05 06:46:31 +08:00
|
|
|
return -EINVAL;
|
userfaultfd: do not untag user pointers
Patch series "userfaultfd: do not untag user pointers", v5.
If a user program uses userfaultfd on ranges of heap memory, it may end
up passing a tagged pointer to the kernel in the range.start field of
the UFFDIO_REGISTER ioctl. This can happen when using an MTE-capable
allocator, or on Android if using the Tagged Pointers feature for MTE
readiness [1].
When a fault subsequently occurs, the tag is stripped from the fault
address returned to the application in the fault.address field of struct
uffd_msg. However, from the application's perspective, the tagged
address *is* the memory address, so if the application is unaware of
memory tags, it may get confused by receiving an address that is, from
its point of view, outside of the bounds of the allocation. We observed
this behavior in the kselftest for userfaultfd [2] but other
applications could have the same problem.
Address this by not untagging pointers passed to the userfaultfd ioctls.
Instead, let the system call fail. Also change the kselftest to use
mmap so that it doesn't encounter this problem.
[1] https://source.android.com/devices/tech/debug/tagged-pointers
[2] tools/testing/selftests/vm/userfaultfd.c
This patch (of 2):
Do not untag pointers passed to the userfaultfd ioctls. Instead, let
the system call fail. This will provide an early indication of problems
with tag-unaware userspace code instead of letting the code get confused
later, and is consistent with how we decided to handle brk/mmap/mremap
in commit dcde237319e6 ("mm: Avoid creating virtual address aliases in
brk()/mmap()/mremap()"), as well as being consistent with the existing
tagged address ABI documentation relating to how ioctl arguments are
handled.
The code change is a revert of commit 7d0325749a6c ("userfaultfd: untag
user pointers") plus some fixups to some additional calls to
validate_range that have appeared since then.
[1] https://source.android.com/devices/tech/debug/tagged-pointers
[2] tools/testing/selftests/vm/userfaultfd.c
Link: https://lkml.kernel.org/r/20210714195437.118982-1-pcc@google.com
Link: https://lkml.kernel.org/r/20210714195437.118982-2-pcc@google.com
Link: https://linux-review.googlesource.com/id/I761aa9f0344454c482b83fcfcce547db0a25501b
Fixes: 63f0c6037965 ("arm64: Introduce prctl() options to control the tagged user addresses ABI")
Signed-off-by: Peter Collingbourne <pcc@google.com>
Reviewed-by: Andrey Konovalov <andreyknvl@gmail.com>
Reviewed-by: Catalin Marinas <catalin.marinas@arm.com>
Cc: Alistair Delva <adelva@google.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Dave Martin <Dave.Martin@arm.com>
Cc: Evgenii Stepanov <eugenis@google.com>
Cc: Lokesh Gidra <lokeshgidra@google.com>
Cc: Mitch Phillips <mitchp@google.com>
Cc: Vincenzo Frascino <vincenzo.frascino@arm.com>
Cc: Will Deacon <will@kernel.org>
Cc: William McVicker <willmcvicker@google.com>
Cc: <stable@vger.kernel.org> [5.4]
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-07-24 06:50:01 +08:00
|
|
|
if (len > task_size - start)
|
2015-09-05 06:46:31 +08:00
|
|
|
return -EINVAL;
|
|
|
|
return 0;
|
|
|
|
}
|
|
|
|
|
|
|
|
static int userfaultfd_register(struct userfaultfd_ctx *ctx,
|
|
|
|
unsigned long arg)
|
|
|
|
{
|
|
|
|
struct mm_struct *mm = ctx->mm;
|
|
|
|
struct vm_area_struct *vma, *prev, *cur;
|
|
|
|
int ret;
|
|
|
|
struct uffdio_register uffdio_register;
|
|
|
|
struct uffdio_register __user *user_uffdio_register;
|
|
|
|
unsigned long vm_flags, new_flags;
|
|
|
|
bool found;
|
2017-09-07 07:23:12 +08:00
|
|
|
bool basic_ioctls;
|
2015-09-05 06:46:31 +08:00
|
|
|
unsigned long start, end, vma_end;
|
2022-09-07 03:48:57 +08:00
|
|
|
MA_STATE(mas, &mm->mm_mt, 0, 0);
|
2015-09-05 06:46:31 +08:00
|
|
|
|
|
|
|
user_uffdio_register = (struct uffdio_register __user *) arg;
|
|
|
|
|
|
|
|
ret = -EFAULT;
|
|
|
|
if (copy_from_user(&uffdio_register, user_uffdio_register,
|
|
|
|
sizeof(uffdio_register)-sizeof(__u64)))
|
|
|
|
goto out;
|
|
|
|
|
|
|
|
ret = -EINVAL;
|
|
|
|
if (!uffdio_register.mode)
|
|
|
|
goto out;
|
userfaultfd: add minor fault registration mode
Patch series "userfaultfd: add minor fault handling", v9.
Overview
========
This series adds a new userfaultfd feature, UFFD_FEATURE_MINOR_HUGETLBFS.
When enabled (via the UFFDIO_API ioctl), this feature means that any
hugetlbfs VMAs registered with UFFDIO_REGISTER_MODE_MISSING will *also*
get events for "minor" faults. By "minor" fault, I mean the following
situation:
Let there exist two mappings (i.e., VMAs) to the same page(s) (shared
memory). One of the mappings is registered with userfaultfd (in minor
mode), and the other is not. Via the non-UFFD mapping, the underlying
pages have already been allocated & filled with some contents. The UFFD
mapping has not yet been faulted in; when it is touched for the first
time, this results in what I'm calling a "minor" fault. As a concrete
example, when working with hugetlbfs, we have huge_pte_none(), but
find_lock_page() finds an existing page.
We also add a new ioctl to resolve such faults: UFFDIO_CONTINUE. The idea
is, userspace resolves the fault by either a) doing nothing if the
contents are already correct, or b) updating the underlying contents using
the second, non-UFFD mapping (via memcpy/memset or similar, or something
fancier like RDMA, or etc...). In either case, userspace issues
UFFDIO_CONTINUE to tell the kernel "I have ensured the page contents are
correct, carry on setting up the mapping".
Use Case
========
Consider the use case of VM live migration (e.g. under QEMU/KVM):
1. While a VM is still running, we copy the contents of its memory to a
target machine. The pages are populated on the target by writing to the
non-UFFD mapping, using the setup described above. The VM is still running
(and therefore its memory is likely changing), so this may be repeated
several times, until we decide the target is "up to date enough".
2. We pause the VM on the source, and start executing on the target machine.
During this gap, the VM's user(s) will *see* a pause, so it is desirable to
minimize this window.
3. Between the last time any page was copied from the source to the target, and
when the VM was paused, the contents of that page may have changed - and
therefore the copy we have on the target machine is out of date. Although we
can keep track of which pages are out of date, for VMs with large amounts of
memory, it is "slow" to transfer this information to the target machine. We
want to resume execution before such a transfer would complete.
4. So, the guest begins executing on the target machine. The first time it
touches its memory (via the UFFD-registered mapping), userspace wants to
intercept this fault. Userspace checks whether or not the page is up to date,
and if not, copies the updated page from the source machine, via the non-UFFD
mapping. Finally, whether a copy was performed or not, userspace issues a
UFFDIO_CONTINUE ioctl to tell the kernel "I have ensured the page contents
are correct, carry on setting up the mapping".
We don't have to do all of the final updates on-demand. The userfaultfd manager
can, in the background, also copy over updated pages once it receives the map of
which pages are up-to-date or not.
Interaction with Existing APIs
==============================
Because this is a feature, a registered VMA could potentially receive both
missing and minor faults. I spent some time thinking through how the
existing API interacts with the new feature:
UFFDIO_CONTINUE cannot be used to resolve non-minor faults, as it does not
allocate a new page. If UFFDIO_CONTINUE is used on a non-minor fault:
- For non-shared memory or shmem, -EINVAL is returned.
- For hugetlb, -EFAULT is returned.
UFFDIO_COPY and UFFDIO_ZEROPAGE cannot be used to resolve minor faults.
Without modifications, the existing codepath assumes a new page needs to
be allocated. This is okay, since userspace must have a second
non-UFFD-registered mapping anyway, thus there isn't much reason to want
to use these in any case (just memcpy or memset or similar).
- If UFFDIO_COPY is used on a minor fault, -EEXIST is returned.
- If UFFDIO_ZEROPAGE is used on a minor fault, -EEXIST is returned (or -EINVAL
in the case of hugetlb, as UFFDIO_ZEROPAGE is unsupported in any case).
- UFFDIO_WRITEPROTECT simply doesn't work with shared memory, and returns
-ENOENT in that case (regardless of the kind of fault).
Future Work
===========
This series only supports hugetlbfs. I have a second series in flight to
support shmem as well, extending the functionality. This series is more
mature than the shmem support at this point, and the functionality works
fully on hugetlbfs, so this series can be merged first and then shmem
support will follow.
This patch (of 6):
This feature allows userspace to intercept "minor" faults. By "minor"
faults, I mean the following situation:
Let there exist two mappings (i.e., VMAs) to the same page(s). One of the
mappings is registered with userfaultfd (in minor mode), and the other is
not. Via the non-UFFD mapping, the underlying pages have already been
allocated & filled with some contents. The UFFD mapping has not yet been
faulted in; when it is touched for the first time, this results in what
I'm calling a "minor" fault. As a concrete example, when working with
hugetlbfs, we have huge_pte_none(), but find_lock_page() finds an existing
page.
This commit adds the new registration mode, and sets the relevant flag on
the VMAs being registered. In the hugetlb fault path, if we find that we
have huge_pte_none(), but find_lock_page() does indeed find an existing
page, then we have a "minor" fault, and if the VMA has the userfaultfd
registration flag, we call into userfaultfd to handle it.
This is implemented as a new registration mode, instead of an API feature.
This is because the alternative implementation has significant drawbacks
[1].
However, doing it this was requires we allocate a VM_* flag for the new
registration mode. On 32-bit systems, there are no unused bits, so this
feature is only supported on architectures with
CONFIG_ARCH_USES_HIGH_VMA_FLAGS. When attempting to register a VMA in
MINOR mode on 32-bit architectures, we return -EINVAL.
[1] https://lore.kernel.org/patchwork/patch/1380226/
[peterx@redhat.com: fix minor fault page leak]
Link: https://lkml.kernel.org/r/20210322175132.36659-1-peterx@redhat.com
Link: https://lkml.kernel.org/r/20210301222728.176417-1-axelrasmussen@google.com
Link: https://lkml.kernel.org/r/20210301222728.176417-2-axelrasmussen@google.com
Signed-off-by: Axel Rasmussen <axelrasmussen@google.com>
Reviewed-by: Peter Xu <peterx@redhat.com>
Reviewed-by: Mike Kravetz <mike.kravetz@oracle.com>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Alexey Dobriyan <adobriyan@gmail.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Anshuman Khandual <anshuman.khandual@arm.com>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Chinwen Chang <chinwen.chang@mediatek.com>
Cc: Huang Ying <ying.huang@intel.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Jann Horn <jannh@google.com>
Cc: Jerome Glisse <jglisse@redhat.com>
Cc: Lokesh Gidra <lokeshgidra@google.com>
Cc: "Matthew Wilcox (Oracle)" <willy@infradead.org>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: "Michal Koutn" <mkoutny@suse.com>
Cc: Michel Lespinasse <walken@google.com>
Cc: Mike Rapoport <rppt@linux.vnet.ibm.com>
Cc: Nicholas Piggin <npiggin@gmail.com>
Cc: Peter Xu <peterx@redhat.com>
Cc: Shaohua Li <shli@fb.com>
Cc: Shawn Anastasio <shawn@anastas.io>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Steven Price <steven.price@arm.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Adam Ruprecht <ruprecht@google.com>
Cc: Axel Rasmussen <axelrasmussen@google.com>
Cc: Cannon Matthews <cannonmatthews@google.com>
Cc: "Dr . David Alan Gilbert" <dgilbert@redhat.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Mina Almasry <almasrymina@google.com>
Cc: Oliver Upton <oupton@google.com>
Cc: Kirill A. Shutemov <kirill@shutemov.name>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-05-05 09:35:36 +08:00
|
|
|
if (uffdio_register.mode & ~UFFD_API_REGISTER_MODES)
|
2015-09-05 06:46:31 +08:00
|
|
|
goto out;
|
|
|
|
vm_flags = 0;
|
|
|
|
if (uffdio_register.mode & UFFDIO_REGISTER_MODE_MISSING)
|
|
|
|
vm_flags |= VM_UFFD_MISSING;
|
2021-07-01 09:49:06 +08:00
|
|
|
if (uffdio_register.mode & UFFDIO_REGISTER_MODE_WP) {
|
|
|
|
#ifndef CONFIG_HAVE_ARCH_USERFAULTFD_WP
|
|
|
|
goto out;
|
|
|
|
#endif
|
2015-09-05 06:46:31 +08:00
|
|
|
vm_flags |= VM_UFFD_WP;
|
2021-07-01 09:49:06 +08:00
|
|
|
}
|
userfaultfd: add minor fault registration mode
Patch series "userfaultfd: add minor fault handling", v9.
Overview
========
This series adds a new userfaultfd feature, UFFD_FEATURE_MINOR_HUGETLBFS.
When enabled (via the UFFDIO_API ioctl), this feature means that any
hugetlbfs VMAs registered with UFFDIO_REGISTER_MODE_MISSING will *also*
get events for "minor" faults. By "minor" fault, I mean the following
situation:
Let there exist two mappings (i.e., VMAs) to the same page(s) (shared
memory). One of the mappings is registered with userfaultfd (in minor
mode), and the other is not. Via the non-UFFD mapping, the underlying
pages have already been allocated & filled with some contents. The UFFD
mapping has not yet been faulted in; when it is touched for the first
time, this results in what I'm calling a "minor" fault. As a concrete
example, when working with hugetlbfs, we have huge_pte_none(), but
find_lock_page() finds an existing page.
We also add a new ioctl to resolve such faults: UFFDIO_CONTINUE. The idea
is, userspace resolves the fault by either a) doing nothing if the
contents are already correct, or b) updating the underlying contents using
the second, non-UFFD mapping (via memcpy/memset or similar, or something
fancier like RDMA, or etc...). In either case, userspace issues
UFFDIO_CONTINUE to tell the kernel "I have ensured the page contents are
correct, carry on setting up the mapping".
Use Case
========
Consider the use case of VM live migration (e.g. under QEMU/KVM):
1. While a VM is still running, we copy the contents of its memory to a
target machine. The pages are populated on the target by writing to the
non-UFFD mapping, using the setup described above. The VM is still running
(and therefore its memory is likely changing), so this may be repeated
several times, until we decide the target is "up to date enough".
2. We pause the VM on the source, and start executing on the target machine.
During this gap, the VM's user(s) will *see* a pause, so it is desirable to
minimize this window.
3. Between the last time any page was copied from the source to the target, and
when the VM was paused, the contents of that page may have changed - and
therefore the copy we have on the target machine is out of date. Although we
can keep track of which pages are out of date, for VMs with large amounts of
memory, it is "slow" to transfer this information to the target machine. We
want to resume execution before such a transfer would complete.
4. So, the guest begins executing on the target machine. The first time it
touches its memory (via the UFFD-registered mapping), userspace wants to
intercept this fault. Userspace checks whether or not the page is up to date,
and if not, copies the updated page from the source machine, via the non-UFFD
mapping. Finally, whether a copy was performed or not, userspace issues a
UFFDIO_CONTINUE ioctl to tell the kernel "I have ensured the page contents
are correct, carry on setting up the mapping".
We don't have to do all of the final updates on-demand. The userfaultfd manager
can, in the background, also copy over updated pages once it receives the map of
which pages are up-to-date or not.
Interaction with Existing APIs
==============================
Because this is a feature, a registered VMA could potentially receive both
missing and minor faults. I spent some time thinking through how the
existing API interacts with the new feature:
UFFDIO_CONTINUE cannot be used to resolve non-minor faults, as it does not
allocate a new page. If UFFDIO_CONTINUE is used on a non-minor fault:
- For non-shared memory or shmem, -EINVAL is returned.
- For hugetlb, -EFAULT is returned.
UFFDIO_COPY and UFFDIO_ZEROPAGE cannot be used to resolve minor faults.
Without modifications, the existing codepath assumes a new page needs to
be allocated. This is okay, since userspace must have a second
non-UFFD-registered mapping anyway, thus there isn't much reason to want
to use these in any case (just memcpy or memset or similar).
- If UFFDIO_COPY is used on a minor fault, -EEXIST is returned.
- If UFFDIO_ZEROPAGE is used on a minor fault, -EEXIST is returned (or -EINVAL
in the case of hugetlb, as UFFDIO_ZEROPAGE is unsupported in any case).
- UFFDIO_WRITEPROTECT simply doesn't work with shared memory, and returns
-ENOENT in that case (regardless of the kind of fault).
Future Work
===========
This series only supports hugetlbfs. I have a second series in flight to
support shmem as well, extending the functionality. This series is more
mature than the shmem support at this point, and the functionality works
fully on hugetlbfs, so this series can be merged first and then shmem
support will follow.
This patch (of 6):
This feature allows userspace to intercept "minor" faults. By "minor"
faults, I mean the following situation:
Let there exist two mappings (i.e., VMAs) to the same page(s). One of the
mappings is registered with userfaultfd (in minor mode), and the other is
not. Via the non-UFFD mapping, the underlying pages have already been
allocated & filled with some contents. The UFFD mapping has not yet been
faulted in; when it is touched for the first time, this results in what
I'm calling a "minor" fault. As a concrete example, when working with
hugetlbfs, we have huge_pte_none(), but find_lock_page() finds an existing
page.
This commit adds the new registration mode, and sets the relevant flag on
the VMAs being registered. In the hugetlb fault path, if we find that we
have huge_pte_none(), but find_lock_page() does indeed find an existing
page, then we have a "minor" fault, and if the VMA has the userfaultfd
registration flag, we call into userfaultfd to handle it.
This is implemented as a new registration mode, instead of an API feature.
This is because the alternative implementation has significant drawbacks
[1].
However, doing it this was requires we allocate a VM_* flag for the new
registration mode. On 32-bit systems, there are no unused bits, so this
feature is only supported on architectures with
CONFIG_ARCH_USES_HIGH_VMA_FLAGS. When attempting to register a VMA in
MINOR mode on 32-bit architectures, we return -EINVAL.
[1] https://lore.kernel.org/patchwork/patch/1380226/
[peterx@redhat.com: fix minor fault page leak]
Link: https://lkml.kernel.org/r/20210322175132.36659-1-peterx@redhat.com
Link: https://lkml.kernel.org/r/20210301222728.176417-1-axelrasmussen@google.com
Link: https://lkml.kernel.org/r/20210301222728.176417-2-axelrasmussen@google.com
Signed-off-by: Axel Rasmussen <axelrasmussen@google.com>
Reviewed-by: Peter Xu <peterx@redhat.com>
Reviewed-by: Mike Kravetz <mike.kravetz@oracle.com>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Alexey Dobriyan <adobriyan@gmail.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Anshuman Khandual <anshuman.khandual@arm.com>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Chinwen Chang <chinwen.chang@mediatek.com>
Cc: Huang Ying <ying.huang@intel.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Jann Horn <jannh@google.com>
Cc: Jerome Glisse <jglisse@redhat.com>
Cc: Lokesh Gidra <lokeshgidra@google.com>
Cc: "Matthew Wilcox (Oracle)" <willy@infradead.org>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: "Michal Koutn" <mkoutny@suse.com>
Cc: Michel Lespinasse <walken@google.com>
Cc: Mike Rapoport <rppt@linux.vnet.ibm.com>
Cc: Nicholas Piggin <npiggin@gmail.com>
Cc: Peter Xu <peterx@redhat.com>
Cc: Shaohua Li <shli@fb.com>
Cc: Shawn Anastasio <shawn@anastas.io>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Steven Price <steven.price@arm.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Adam Ruprecht <ruprecht@google.com>
Cc: Axel Rasmussen <axelrasmussen@google.com>
Cc: Cannon Matthews <cannonmatthews@google.com>
Cc: "Dr . David Alan Gilbert" <dgilbert@redhat.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Mina Almasry <almasrymina@google.com>
Cc: Oliver Upton <oupton@google.com>
Cc: Kirill A. Shutemov <kirill@shutemov.name>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-05-05 09:35:36 +08:00
|
|
|
if (uffdio_register.mode & UFFDIO_REGISTER_MODE_MINOR) {
|
|
|
|
#ifndef CONFIG_HAVE_ARCH_USERFAULTFD_MINOR
|
|
|
|
goto out;
|
|
|
|
#endif
|
|
|
|
vm_flags |= VM_UFFD_MINOR;
|
|
|
|
}
|
2015-09-05 06:46:31 +08:00
|
|
|
|
userfaultfd: do not untag user pointers
Patch series "userfaultfd: do not untag user pointers", v5.
If a user program uses userfaultfd on ranges of heap memory, it may end
up passing a tagged pointer to the kernel in the range.start field of
the UFFDIO_REGISTER ioctl. This can happen when using an MTE-capable
allocator, or on Android if using the Tagged Pointers feature for MTE
readiness [1].
When a fault subsequently occurs, the tag is stripped from the fault
address returned to the application in the fault.address field of struct
uffd_msg. However, from the application's perspective, the tagged
address *is* the memory address, so if the application is unaware of
memory tags, it may get confused by receiving an address that is, from
its point of view, outside of the bounds of the allocation. We observed
this behavior in the kselftest for userfaultfd [2] but other
applications could have the same problem.
Address this by not untagging pointers passed to the userfaultfd ioctls.
Instead, let the system call fail. Also change the kselftest to use
mmap so that it doesn't encounter this problem.
[1] https://source.android.com/devices/tech/debug/tagged-pointers
[2] tools/testing/selftests/vm/userfaultfd.c
This patch (of 2):
Do not untag pointers passed to the userfaultfd ioctls. Instead, let
the system call fail. This will provide an early indication of problems
with tag-unaware userspace code instead of letting the code get confused
later, and is consistent with how we decided to handle brk/mmap/mremap
in commit dcde237319e6 ("mm: Avoid creating virtual address aliases in
brk()/mmap()/mremap()"), as well as being consistent with the existing
tagged address ABI documentation relating to how ioctl arguments are
handled.
The code change is a revert of commit 7d0325749a6c ("userfaultfd: untag
user pointers") plus some fixups to some additional calls to
validate_range that have appeared since then.
[1] https://source.android.com/devices/tech/debug/tagged-pointers
[2] tools/testing/selftests/vm/userfaultfd.c
Link: https://lkml.kernel.org/r/20210714195437.118982-1-pcc@google.com
Link: https://lkml.kernel.org/r/20210714195437.118982-2-pcc@google.com
Link: https://linux-review.googlesource.com/id/I761aa9f0344454c482b83fcfcce547db0a25501b
Fixes: 63f0c6037965 ("arm64: Introduce prctl() options to control the tagged user addresses ABI")
Signed-off-by: Peter Collingbourne <pcc@google.com>
Reviewed-by: Andrey Konovalov <andreyknvl@gmail.com>
Reviewed-by: Catalin Marinas <catalin.marinas@arm.com>
Cc: Alistair Delva <adelva@google.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Dave Martin <Dave.Martin@arm.com>
Cc: Evgenii Stepanov <eugenis@google.com>
Cc: Lokesh Gidra <lokeshgidra@google.com>
Cc: Mitch Phillips <mitchp@google.com>
Cc: Vincenzo Frascino <vincenzo.frascino@arm.com>
Cc: Will Deacon <will@kernel.org>
Cc: William McVicker <willmcvicker@google.com>
Cc: <stable@vger.kernel.org> [5.4]
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-07-24 06:50:01 +08:00
|
|
|
ret = validate_range(mm, uffdio_register.range.start,
|
2015-09-05 06:46:31 +08:00
|
|
|
uffdio_register.range.len);
|
|
|
|
if (ret)
|
|
|
|
goto out;
|
|
|
|
|
|
|
|
start = uffdio_register.range.start;
|
|
|
|
end = start + uffdio_register.range.len;
|
|
|
|
|
2016-05-21 07:58:36 +08:00
|
|
|
ret = -ENOMEM;
|
|
|
|
if (!mmget_not_zero(mm))
|
|
|
|
goto out;
|
|
|
|
|
2020-06-09 12:33:25 +08:00
|
|
|
mmap_write_lock(mm);
|
2022-09-07 03:48:57 +08:00
|
|
|
mas_set(&mas, start);
|
|
|
|
vma = mas_find(&mas, ULONG_MAX);
|
2015-09-05 06:46:31 +08:00
|
|
|
if (!vma)
|
|
|
|
goto out_unlock;
|
|
|
|
|
|
|
|
/* check that there's at least one vma in the range */
|
|
|
|
ret = -EINVAL;
|
|
|
|
if (vma->vm_start >= end)
|
|
|
|
goto out_unlock;
|
|
|
|
|
2017-02-23 07:43:04 +08:00
|
|
|
/*
|
|
|
|
* If the first vma contains huge pages, make sure start address
|
|
|
|
* is aligned to huge page size.
|
|
|
|
*/
|
|
|
|
if (is_vm_hugetlb_page(vma)) {
|
|
|
|
unsigned long vma_hpagesize = vma_kernel_pagesize(vma);
|
|
|
|
|
|
|
|
if (start & (vma_hpagesize - 1))
|
|
|
|
goto out_unlock;
|
|
|
|
}
|
|
|
|
|
2015-09-05 06:46:31 +08:00
|
|
|
/*
|
|
|
|
* Search for not compatible vmas.
|
|
|
|
*/
|
|
|
|
found = false;
|
2017-09-07 07:23:12 +08:00
|
|
|
basic_ioctls = false;
|
2022-09-07 03:48:57 +08:00
|
|
|
for (cur = vma; cur; cur = mas_next(&mas, end - 1)) {
|
2015-09-05 06:46:31 +08:00
|
|
|
cond_resched();
|
|
|
|
|
|
|
|
BUG_ON(!!cur->vm_userfaultfd_ctx.ctx ^
|
userfaultfd: add minor fault registration mode
Patch series "userfaultfd: add minor fault handling", v9.
Overview
========
This series adds a new userfaultfd feature, UFFD_FEATURE_MINOR_HUGETLBFS.
When enabled (via the UFFDIO_API ioctl), this feature means that any
hugetlbfs VMAs registered with UFFDIO_REGISTER_MODE_MISSING will *also*
get events for "minor" faults. By "minor" fault, I mean the following
situation:
Let there exist two mappings (i.e., VMAs) to the same page(s) (shared
memory). One of the mappings is registered with userfaultfd (in minor
mode), and the other is not. Via the non-UFFD mapping, the underlying
pages have already been allocated & filled with some contents. The UFFD
mapping has not yet been faulted in; when it is touched for the first
time, this results in what I'm calling a "minor" fault. As a concrete
example, when working with hugetlbfs, we have huge_pte_none(), but
find_lock_page() finds an existing page.
We also add a new ioctl to resolve such faults: UFFDIO_CONTINUE. The idea
is, userspace resolves the fault by either a) doing nothing if the
contents are already correct, or b) updating the underlying contents using
the second, non-UFFD mapping (via memcpy/memset or similar, or something
fancier like RDMA, or etc...). In either case, userspace issues
UFFDIO_CONTINUE to tell the kernel "I have ensured the page contents are
correct, carry on setting up the mapping".
Use Case
========
Consider the use case of VM live migration (e.g. under QEMU/KVM):
1. While a VM is still running, we copy the contents of its memory to a
target machine. The pages are populated on the target by writing to the
non-UFFD mapping, using the setup described above. The VM is still running
(and therefore its memory is likely changing), so this may be repeated
several times, until we decide the target is "up to date enough".
2. We pause the VM on the source, and start executing on the target machine.
During this gap, the VM's user(s) will *see* a pause, so it is desirable to
minimize this window.
3. Between the last time any page was copied from the source to the target, and
when the VM was paused, the contents of that page may have changed - and
therefore the copy we have on the target machine is out of date. Although we
can keep track of which pages are out of date, for VMs with large amounts of
memory, it is "slow" to transfer this information to the target machine. We
want to resume execution before such a transfer would complete.
4. So, the guest begins executing on the target machine. The first time it
touches its memory (via the UFFD-registered mapping), userspace wants to
intercept this fault. Userspace checks whether or not the page is up to date,
and if not, copies the updated page from the source machine, via the non-UFFD
mapping. Finally, whether a copy was performed or not, userspace issues a
UFFDIO_CONTINUE ioctl to tell the kernel "I have ensured the page contents
are correct, carry on setting up the mapping".
We don't have to do all of the final updates on-demand. The userfaultfd manager
can, in the background, also copy over updated pages once it receives the map of
which pages are up-to-date or not.
Interaction with Existing APIs
==============================
Because this is a feature, a registered VMA could potentially receive both
missing and minor faults. I spent some time thinking through how the
existing API interacts with the new feature:
UFFDIO_CONTINUE cannot be used to resolve non-minor faults, as it does not
allocate a new page. If UFFDIO_CONTINUE is used on a non-minor fault:
- For non-shared memory or shmem, -EINVAL is returned.
- For hugetlb, -EFAULT is returned.
UFFDIO_COPY and UFFDIO_ZEROPAGE cannot be used to resolve minor faults.
Without modifications, the existing codepath assumes a new page needs to
be allocated. This is okay, since userspace must have a second
non-UFFD-registered mapping anyway, thus there isn't much reason to want
to use these in any case (just memcpy or memset or similar).
- If UFFDIO_COPY is used on a minor fault, -EEXIST is returned.
- If UFFDIO_ZEROPAGE is used on a minor fault, -EEXIST is returned (or -EINVAL
in the case of hugetlb, as UFFDIO_ZEROPAGE is unsupported in any case).
- UFFDIO_WRITEPROTECT simply doesn't work with shared memory, and returns
-ENOENT in that case (regardless of the kind of fault).
Future Work
===========
This series only supports hugetlbfs. I have a second series in flight to
support shmem as well, extending the functionality. This series is more
mature than the shmem support at this point, and the functionality works
fully on hugetlbfs, so this series can be merged first and then shmem
support will follow.
This patch (of 6):
This feature allows userspace to intercept "minor" faults. By "minor"
faults, I mean the following situation:
Let there exist two mappings (i.e., VMAs) to the same page(s). One of the
mappings is registered with userfaultfd (in minor mode), and the other is
not. Via the non-UFFD mapping, the underlying pages have already been
allocated & filled with some contents. The UFFD mapping has not yet been
faulted in; when it is touched for the first time, this results in what
I'm calling a "minor" fault. As a concrete example, when working with
hugetlbfs, we have huge_pte_none(), but find_lock_page() finds an existing
page.
This commit adds the new registration mode, and sets the relevant flag on
the VMAs being registered. In the hugetlb fault path, if we find that we
have huge_pte_none(), but find_lock_page() does indeed find an existing
page, then we have a "minor" fault, and if the VMA has the userfaultfd
registration flag, we call into userfaultfd to handle it.
This is implemented as a new registration mode, instead of an API feature.
This is because the alternative implementation has significant drawbacks
[1].
However, doing it this was requires we allocate a VM_* flag for the new
registration mode. On 32-bit systems, there are no unused bits, so this
feature is only supported on architectures with
CONFIG_ARCH_USES_HIGH_VMA_FLAGS. When attempting to register a VMA in
MINOR mode on 32-bit architectures, we return -EINVAL.
[1] https://lore.kernel.org/patchwork/patch/1380226/
[peterx@redhat.com: fix minor fault page leak]
Link: https://lkml.kernel.org/r/20210322175132.36659-1-peterx@redhat.com
Link: https://lkml.kernel.org/r/20210301222728.176417-1-axelrasmussen@google.com
Link: https://lkml.kernel.org/r/20210301222728.176417-2-axelrasmussen@google.com
Signed-off-by: Axel Rasmussen <axelrasmussen@google.com>
Reviewed-by: Peter Xu <peterx@redhat.com>
Reviewed-by: Mike Kravetz <mike.kravetz@oracle.com>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Alexey Dobriyan <adobriyan@gmail.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Anshuman Khandual <anshuman.khandual@arm.com>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Chinwen Chang <chinwen.chang@mediatek.com>
Cc: Huang Ying <ying.huang@intel.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Jann Horn <jannh@google.com>
Cc: Jerome Glisse <jglisse@redhat.com>
Cc: Lokesh Gidra <lokeshgidra@google.com>
Cc: "Matthew Wilcox (Oracle)" <willy@infradead.org>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: "Michal Koutn" <mkoutny@suse.com>
Cc: Michel Lespinasse <walken@google.com>
Cc: Mike Rapoport <rppt@linux.vnet.ibm.com>
Cc: Nicholas Piggin <npiggin@gmail.com>
Cc: Peter Xu <peterx@redhat.com>
Cc: Shaohua Li <shli@fb.com>
Cc: Shawn Anastasio <shawn@anastas.io>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Steven Price <steven.price@arm.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Adam Ruprecht <ruprecht@google.com>
Cc: Axel Rasmussen <axelrasmussen@google.com>
Cc: Cannon Matthews <cannonmatthews@google.com>
Cc: "Dr . David Alan Gilbert" <dgilbert@redhat.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Mina Almasry <almasrymina@google.com>
Cc: Oliver Upton <oupton@google.com>
Cc: Kirill A. Shutemov <kirill@shutemov.name>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-05-05 09:35:36 +08:00
|
|
|
!!(cur->vm_flags & __VM_UFFD_FLAGS));
|
2015-09-05 06:46:31 +08:00
|
|
|
|
|
|
|
/* check not compatible vmas */
|
|
|
|
ret = -EINVAL;
|
2020-04-07 11:06:12 +08:00
|
|
|
if (!vma_can_userfault(cur, vm_flags))
|
2015-09-05 06:46:31 +08:00
|
|
|
goto out_unlock;
|
2018-12-01 06:09:32 +08:00
|
|
|
|
|
|
|
/*
|
|
|
|
* UFFDIO_COPY will fill file holes even without
|
|
|
|
* PROT_WRITE. This check enforces that if this is a
|
|
|
|
* MAP_SHARED, the process has write permission to the backing
|
|
|
|
* file. If VM_MAYWRITE is set it also enforces that on a
|
|
|
|
* MAP_SHARED vma: there is no F_WRITE_SEAL and no further
|
|
|
|
* F_WRITE_SEAL can be taken until the vma is destroyed.
|
|
|
|
*/
|
|
|
|
ret = -EPERM;
|
|
|
|
if (unlikely(!(cur->vm_flags & VM_MAYWRITE)))
|
|
|
|
goto out_unlock;
|
|
|
|
|
2017-02-23 07:43:04 +08:00
|
|
|
/*
|
|
|
|
* If this vma contains ending address, and huge pages
|
|
|
|
* check alignment.
|
|
|
|
*/
|
|
|
|
if (is_vm_hugetlb_page(cur) && end <= cur->vm_end &&
|
|
|
|
end > cur->vm_start) {
|
|
|
|
unsigned long vma_hpagesize = vma_kernel_pagesize(cur);
|
|
|
|
|
|
|
|
ret = -EINVAL;
|
|
|
|
|
|
|
|
if (end & (vma_hpagesize - 1))
|
|
|
|
goto out_unlock;
|
|
|
|
}
|
2020-04-07 11:06:12 +08:00
|
|
|
if ((vm_flags & VM_UFFD_WP) && !(cur->vm_flags & VM_MAYWRITE))
|
|
|
|
goto out_unlock;
|
2015-09-05 06:46:31 +08:00
|
|
|
|
|
|
|
/*
|
|
|
|
* Check that this vma isn't already owned by a
|
|
|
|
* different userfaultfd. We can't allow more than one
|
|
|
|
* userfaultfd to own a single vma simultaneously or we
|
|
|
|
* wouldn't know which one to deliver the userfaults to.
|
|
|
|
*/
|
|
|
|
ret = -EBUSY;
|
|
|
|
if (cur->vm_userfaultfd_ctx.ctx &&
|
|
|
|
cur->vm_userfaultfd_ctx.ctx != ctx)
|
|
|
|
goto out_unlock;
|
|
|
|
|
2017-02-23 07:43:04 +08:00
|
|
|
/*
|
|
|
|
* Note vmas containing huge pages
|
|
|
|
*/
|
2017-09-07 07:23:12 +08:00
|
|
|
if (is_vm_hugetlb_page(cur))
|
|
|
|
basic_ioctls = true;
|
2017-02-23 07:43:04 +08:00
|
|
|
|
2015-09-05 06:46:31 +08:00
|
|
|
found = true;
|
|
|
|
}
|
|
|
|
BUG_ON(!found);
|
|
|
|
|
2022-09-07 03:48:57 +08:00
|
|
|
mas_set(&mas, start);
|
|
|
|
prev = mas_prev(&mas, 0);
|
|
|
|
if (prev != vma)
|
|
|
|
mas_next(&mas, ULONG_MAX);
|
2015-09-05 06:46:31 +08:00
|
|
|
|
|
|
|
ret = 0;
|
|
|
|
do {
|
|
|
|
cond_resched();
|
|
|
|
|
2020-04-07 11:06:12 +08:00
|
|
|
BUG_ON(!vma_can_userfault(vma, vm_flags));
|
2015-09-05 06:46:31 +08:00
|
|
|
BUG_ON(vma->vm_userfaultfd_ctx.ctx &&
|
|
|
|
vma->vm_userfaultfd_ctx.ctx != ctx);
|
2018-12-01 06:09:32 +08:00
|
|
|
WARN_ON(!(vma->vm_flags & VM_MAYWRITE));
|
2015-09-05 06:46:31 +08:00
|
|
|
|
|
|
|
/*
|
|
|
|
* Nothing to do: this vma is already registered into this
|
|
|
|
* userfaultfd and with the right tracking mode too.
|
|
|
|
*/
|
|
|
|
if (vma->vm_userfaultfd_ctx.ctx == ctx &&
|
|
|
|
(vma->vm_flags & vm_flags) == vm_flags)
|
|
|
|
goto skip;
|
|
|
|
|
|
|
|
if (vma->vm_start > start)
|
|
|
|
start = vma->vm_start;
|
|
|
|
vma_end = min(end, vma->vm_end);
|
|
|
|
|
userfaultfd: add minor fault registration mode
Patch series "userfaultfd: add minor fault handling", v9.
Overview
========
This series adds a new userfaultfd feature, UFFD_FEATURE_MINOR_HUGETLBFS.
When enabled (via the UFFDIO_API ioctl), this feature means that any
hugetlbfs VMAs registered with UFFDIO_REGISTER_MODE_MISSING will *also*
get events for "minor" faults. By "minor" fault, I mean the following
situation:
Let there exist two mappings (i.e., VMAs) to the same page(s) (shared
memory). One of the mappings is registered with userfaultfd (in minor
mode), and the other is not. Via the non-UFFD mapping, the underlying
pages have already been allocated & filled with some contents. The UFFD
mapping has not yet been faulted in; when it is touched for the first
time, this results in what I'm calling a "minor" fault. As a concrete
example, when working with hugetlbfs, we have huge_pte_none(), but
find_lock_page() finds an existing page.
We also add a new ioctl to resolve such faults: UFFDIO_CONTINUE. The idea
is, userspace resolves the fault by either a) doing nothing if the
contents are already correct, or b) updating the underlying contents using
the second, non-UFFD mapping (via memcpy/memset or similar, or something
fancier like RDMA, or etc...). In either case, userspace issues
UFFDIO_CONTINUE to tell the kernel "I have ensured the page contents are
correct, carry on setting up the mapping".
Use Case
========
Consider the use case of VM live migration (e.g. under QEMU/KVM):
1. While a VM is still running, we copy the contents of its memory to a
target machine. The pages are populated on the target by writing to the
non-UFFD mapping, using the setup described above. The VM is still running
(and therefore its memory is likely changing), so this may be repeated
several times, until we decide the target is "up to date enough".
2. We pause the VM on the source, and start executing on the target machine.
During this gap, the VM's user(s) will *see* a pause, so it is desirable to
minimize this window.
3. Between the last time any page was copied from the source to the target, and
when the VM was paused, the contents of that page may have changed - and
therefore the copy we have on the target machine is out of date. Although we
can keep track of which pages are out of date, for VMs with large amounts of
memory, it is "slow" to transfer this information to the target machine. We
want to resume execution before such a transfer would complete.
4. So, the guest begins executing on the target machine. The first time it
touches its memory (via the UFFD-registered mapping), userspace wants to
intercept this fault. Userspace checks whether or not the page is up to date,
and if not, copies the updated page from the source machine, via the non-UFFD
mapping. Finally, whether a copy was performed or not, userspace issues a
UFFDIO_CONTINUE ioctl to tell the kernel "I have ensured the page contents
are correct, carry on setting up the mapping".
We don't have to do all of the final updates on-demand. The userfaultfd manager
can, in the background, also copy over updated pages once it receives the map of
which pages are up-to-date or not.
Interaction with Existing APIs
==============================
Because this is a feature, a registered VMA could potentially receive both
missing and minor faults. I spent some time thinking through how the
existing API interacts with the new feature:
UFFDIO_CONTINUE cannot be used to resolve non-minor faults, as it does not
allocate a new page. If UFFDIO_CONTINUE is used on a non-minor fault:
- For non-shared memory or shmem, -EINVAL is returned.
- For hugetlb, -EFAULT is returned.
UFFDIO_COPY and UFFDIO_ZEROPAGE cannot be used to resolve minor faults.
Without modifications, the existing codepath assumes a new page needs to
be allocated. This is okay, since userspace must have a second
non-UFFD-registered mapping anyway, thus there isn't much reason to want
to use these in any case (just memcpy or memset or similar).
- If UFFDIO_COPY is used on a minor fault, -EEXIST is returned.
- If UFFDIO_ZEROPAGE is used on a minor fault, -EEXIST is returned (or -EINVAL
in the case of hugetlb, as UFFDIO_ZEROPAGE is unsupported in any case).
- UFFDIO_WRITEPROTECT simply doesn't work with shared memory, and returns
-ENOENT in that case (regardless of the kind of fault).
Future Work
===========
This series only supports hugetlbfs. I have a second series in flight to
support shmem as well, extending the functionality. This series is more
mature than the shmem support at this point, and the functionality works
fully on hugetlbfs, so this series can be merged first and then shmem
support will follow.
This patch (of 6):
This feature allows userspace to intercept "minor" faults. By "minor"
faults, I mean the following situation:
Let there exist two mappings (i.e., VMAs) to the same page(s). One of the
mappings is registered with userfaultfd (in minor mode), and the other is
not. Via the non-UFFD mapping, the underlying pages have already been
allocated & filled with some contents. The UFFD mapping has not yet been
faulted in; when it is touched for the first time, this results in what
I'm calling a "minor" fault. As a concrete example, when working with
hugetlbfs, we have huge_pte_none(), but find_lock_page() finds an existing
page.
This commit adds the new registration mode, and sets the relevant flag on
the VMAs being registered. In the hugetlb fault path, if we find that we
have huge_pte_none(), but find_lock_page() does indeed find an existing
page, then we have a "minor" fault, and if the VMA has the userfaultfd
registration flag, we call into userfaultfd to handle it.
This is implemented as a new registration mode, instead of an API feature.
This is because the alternative implementation has significant drawbacks
[1].
However, doing it this was requires we allocate a VM_* flag for the new
registration mode. On 32-bit systems, there are no unused bits, so this
feature is only supported on architectures with
CONFIG_ARCH_USES_HIGH_VMA_FLAGS. When attempting to register a VMA in
MINOR mode on 32-bit architectures, we return -EINVAL.
[1] https://lore.kernel.org/patchwork/patch/1380226/
[peterx@redhat.com: fix minor fault page leak]
Link: https://lkml.kernel.org/r/20210322175132.36659-1-peterx@redhat.com
Link: https://lkml.kernel.org/r/20210301222728.176417-1-axelrasmussen@google.com
Link: https://lkml.kernel.org/r/20210301222728.176417-2-axelrasmussen@google.com
Signed-off-by: Axel Rasmussen <axelrasmussen@google.com>
Reviewed-by: Peter Xu <peterx@redhat.com>
Reviewed-by: Mike Kravetz <mike.kravetz@oracle.com>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Alexey Dobriyan <adobriyan@gmail.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Anshuman Khandual <anshuman.khandual@arm.com>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Chinwen Chang <chinwen.chang@mediatek.com>
Cc: Huang Ying <ying.huang@intel.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Jann Horn <jannh@google.com>
Cc: Jerome Glisse <jglisse@redhat.com>
Cc: Lokesh Gidra <lokeshgidra@google.com>
Cc: "Matthew Wilcox (Oracle)" <willy@infradead.org>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: "Michal Koutn" <mkoutny@suse.com>
Cc: Michel Lespinasse <walken@google.com>
Cc: Mike Rapoport <rppt@linux.vnet.ibm.com>
Cc: Nicholas Piggin <npiggin@gmail.com>
Cc: Peter Xu <peterx@redhat.com>
Cc: Shaohua Li <shli@fb.com>
Cc: Shawn Anastasio <shawn@anastas.io>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Steven Price <steven.price@arm.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Adam Ruprecht <ruprecht@google.com>
Cc: Axel Rasmussen <axelrasmussen@google.com>
Cc: Cannon Matthews <cannonmatthews@google.com>
Cc: "Dr . David Alan Gilbert" <dgilbert@redhat.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Mina Almasry <almasrymina@google.com>
Cc: Oliver Upton <oupton@google.com>
Cc: Kirill A. Shutemov <kirill@shutemov.name>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-05-05 09:35:36 +08:00
|
|
|
new_flags = (vma->vm_flags & ~__VM_UFFD_FLAGS) | vm_flags;
|
2015-09-05 06:46:31 +08:00
|
|
|
prev = vma_merge(mm, prev, start, vma_end, new_flags,
|
|
|
|
vma->anon_vma, vma->vm_file, vma->vm_pgoff,
|
|
|
|
vma_policy(vma),
|
mm: add a field to store names for private anonymous memory
In many userspace applications, and especially in VM based applications
like Android uses heavily, there are multiple different allocators in
use. At a minimum there is libc malloc and the stack, and in many cases
there are libc malloc, the stack, direct syscalls to mmap anonymous
memory, and multiple VM heaps (one for small objects, one for big
objects, etc.). Each of these layers usually has its own tools to
inspect its usage; malloc by compiling a debug version, the VM through
heap inspection tools, and for direct syscalls there is usually no way
to track them.
On Android we heavily use a set of tools that use an extended version of
the logic covered in Documentation/vm/pagemap.txt to walk all pages
mapped in userspace and slice their usage by process, shared (COW) vs.
unique mappings, backing, etc. This can account for real physical
memory usage even in cases like fork without exec (which Android uses
heavily to share as many private COW pages as possible between
processes), Kernel SamePage Merging, and clean zero pages. It produces
a measurement of the pages that only exist in that process (USS, for
unique), and a measurement of the physical memory usage of that process
with the cost of shared pages being evenly split between processes that
share them (PSS).
If all anonymous memory is indistinguishable then figuring out the real
physical memory usage (PSS) of each heap requires either a pagemap
walking tool that can understand the heap debugging of every layer, or
for every layer's heap debugging tools to implement the pagemap walking
logic, in which case it is hard to get a consistent view of memory
across the whole system.
Tracking the information in userspace leads to all sorts of problems.
It either needs to be stored inside the process, which means every
process has to have an API to export its current heap information upon
request, or it has to be stored externally in a filesystem that somebody
needs to clean up on crashes. It needs to be readable while the process
is still running, so it has to have some sort of synchronization with
every layer of userspace. Efficiently tracking the ranges requires
reimplementing something like the kernel vma trees, and linking to it
from every layer of userspace. It requires more memory, more syscalls,
more runtime cost, and more complexity to separately track regions that
the kernel is already tracking.
This patch adds a field to /proc/pid/maps and /proc/pid/smaps to show a
userspace-provided name for anonymous vmas. The names of named
anonymous vmas are shown in /proc/pid/maps and /proc/pid/smaps as
[anon:<name>].
Userspace can set the name for a region of memory by calling
prctl(PR_SET_VMA, PR_SET_VMA_ANON_NAME, start, len, (unsigned long)name)
Setting the name to NULL clears it. The name length limit is 80 bytes
including NUL-terminator and is checked to contain only printable ascii
characters (including space), except '[',']','\','$' and '`'.
Ascii strings are being used to have a descriptive identifiers for vmas,
which can be understood by the users reading /proc/pid/maps or
/proc/pid/smaps. Names can be standardized for a given system and they
can include some variable parts such as the name of the allocator or a
library, tid of the thread using it, etc.
The name is stored in a pointer in the shared union in vm_area_struct
that points to a null terminated string. Anonymous vmas with the same
name (equivalent strings) and are otherwise mergeable will be merged.
The name pointers are not shared between vmas even if they contain the
same name. The name pointer is stored in a union with fields that are
only used on file-backed mappings, so it does not increase memory usage.
CONFIG_ANON_VMA_NAME kernel configuration is introduced to enable this
feature. It keeps the feature disabled by default to prevent any
additional memory overhead and to avoid confusing procfs parsers on
systems which are not ready to support named anonymous vmas.
The patch is based on the original patch developed by Colin Cross, more
specifically on its latest version [1] posted upstream by Sumit Semwal.
It used a userspace pointer to store vma names. In that design, name
pointers could be shared between vmas. However during the last
upstreaming attempt, Kees Cook raised concerns [2] about this approach
and suggested to copy the name into kernel memory space, perform
validity checks [3] and store as a string referenced from
vm_area_struct.
One big concern is about fork() performance which would need to strdup
anonymous vma names. Dave Hansen suggested experimenting with
worst-case scenario of forking a process with 64k vmas having longest
possible names [4]. I ran this experiment on an ARM64 Android device
and recorded a worst-case regression of almost 40% when forking such a
process.
This regression is addressed in the followup patch which replaces the
pointer to a name with a refcounted structure that allows sharing the
name pointer between vmas of the same name. Instead of duplicating the
string during fork() or when splitting a vma it increments the refcount.
[1] https://lore.kernel.org/linux-mm/20200901161459.11772-4-sumit.semwal@linaro.org/
[2] https://lore.kernel.org/linux-mm/202009031031.D32EF57ED@keescook/
[3] https://lore.kernel.org/linux-mm/202009031022.3834F692@keescook/
[4] https://lore.kernel.org/linux-mm/5d0358ab-8c47-2f5f-8e43-23b89d6a8e95@intel.com/
Changes for prctl(2) manual page (in the options section):
PR_SET_VMA
Sets an attribute specified in arg2 for virtual memory areas
starting from the address specified in arg3 and spanning the
size specified in arg4. arg5 specifies the value of the attribute
to be set. Note that assigning an attribute to a virtual memory
area might prevent it from being merged with adjacent virtual
memory areas due to the difference in that attribute's value.
Currently, arg2 must be one of:
PR_SET_VMA_ANON_NAME
Set a name for anonymous virtual memory areas. arg5 should
be a pointer to a null-terminated string containing the
name. The name length including null byte cannot exceed
80 bytes. If arg5 is NULL, the name of the appropriate
anonymous virtual memory areas will be reset. The name
can contain only printable ascii characters (including
space), except '[',']','\','$' and '`'.
This feature is available only if the kernel is built with
the CONFIG_ANON_VMA_NAME option enabled.
[surenb@google.com: docs: proc.rst: /proc/PID/maps: fix malformed table]
Link: https://lkml.kernel.org/r/20211123185928.2513763-1-surenb@google.com
[surenb: rebased over v5.15-rc6, replaced userpointer with a kernel copy,
added input sanitization and CONFIG_ANON_VMA_NAME config. The bulk of the
work here was done by Colin Cross, therefore, with his permission, keeping
him as the author]
Link: https://lkml.kernel.org/r/20211019215511.3771969-2-surenb@google.com
Signed-off-by: Colin Cross <ccross@google.com>
Signed-off-by: Suren Baghdasaryan <surenb@google.com>
Reviewed-by: Kees Cook <keescook@chromium.org>
Cc: Stephen Rothwell <sfr@canb.auug.org.au>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Cyrill Gorcunov <gorcunov@openvz.org>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: David Rientjes <rientjes@google.com>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Jan Glauber <jan.glauber@gmail.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: John Stultz <john.stultz@linaro.org>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Pekka Enberg <penberg@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Rob Landley <rob@landley.net>
Cc: "Serge E. Hallyn" <serge.hallyn@ubuntu.com>
Cc: Shaohua Li <shli@fusionio.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2022-01-15 06:05:59 +08:00
|
|
|
((struct vm_userfaultfd_ctx){ ctx }),
|
2022-03-05 12:28:51 +08:00
|
|
|
anon_vma_name(vma));
|
2015-09-05 06:46:31 +08:00
|
|
|
if (prev) {
|
2022-09-07 03:48:57 +08:00
|
|
|
/* vma_merge() invalidated the mas */
|
|
|
|
mas_pause(&mas);
|
2015-09-05 06:46:31 +08:00
|
|
|
vma = prev;
|
|
|
|
goto next;
|
|
|
|
}
|
|
|
|
if (vma->vm_start < start) {
|
|
|
|
ret = split_vma(mm, vma, start, 1);
|
|
|
|
if (ret)
|
|
|
|
break;
|
2022-09-07 03:48:57 +08:00
|
|
|
/* split_vma() invalidated the mas */
|
|
|
|
mas_pause(&mas);
|
2015-09-05 06:46:31 +08:00
|
|
|
}
|
|
|
|
if (vma->vm_end > end) {
|
|
|
|
ret = split_vma(mm, vma, end, 0);
|
|
|
|
if (ret)
|
|
|
|
break;
|
2022-09-07 03:48:57 +08:00
|
|
|
/* split_vma() invalidated the mas */
|
|
|
|
mas_pause(&mas);
|
2015-09-05 06:46:31 +08:00
|
|
|
}
|
|
|
|
next:
|
|
|
|
/*
|
|
|
|
* In the vma_merge() successful mprotect-like case 8:
|
|
|
|
* the next vma was merged into the current one and
|
|
|
|
* the current one has not been updated yet.
|
|
|
|
*/
|
mm/userfaultfd: enable writenotify while userfaultfd-wp is enabled for a VMA
Currently, we don't enable writenotify when enabling userfaultfd-wp on a
shared writable mapping (for now only shmem and hugetlb). The consequence
is that vma->vm_page_prot will still include write permissions, to be set
as default for all PTEs that get remapped (e.g., mprotect(), NUMA hinting,
page migration, ...).
So far, vma->vm_page_prot is assumed to be a safe default, meaning that we
only add permissions (e.g., mkwrite) but not remove permissions (e.g.,
wrprotect). For example, when enabling softdirty tracking, we enable
writenotify. With uffd-wp on shared mappings, that changed. More details
on vma->vm_page_prot semantics were summarized in [1].
This is problematic for uffd-wp: we'd have to manually check for a uffd-wp
PTEs/PMDs and manually write-protect PTEs/PMDs, which is error prone.
Prone to such issues is any code that uses vma->vm_page_prot to set PTE
permissions: primarily pte_modify() and mk_pte().
Instead, let's enable writenotify such that PTEs/PMDs/... will be mapped
write-protected as default and we will only allow selected PTEs that are
definitely safe to be mapped without write-protection (see
can_change_pte_writable()) to be writable. In the future, we might want
to enable write-bit recovery -- e.g., can_change_pte_writable() -- at more
locations, for example, also when removing uffd-wp protection.
This fixes two known cases:
(a) remove_migration_pte() mapping uffd-wp'ed PTEs writable, resulting
in uffd-wp not triggering on write access.
(b) do_numa_page() / do_huge_pmd_numa_page() mapping uffd-wp'ed PTEs/PMDs
writable, resulting in uffd-wp not triggering on write access.
Note that do_numa_page() / do_huge_pmd_numa_page() can be reached even
without NUMA hinting (which currently doesn't seem to be applicable to
shmem), for example, by using uffd-wp with a PROT_WRITE shmem VMA. On
such a VMA, userfaultfd-wp is currently non-functional.
Note that when enabling userfaultfd-wp, there is no need to walk page
tables to enforce the new default protection for the PTEs: we know that
they cannot be uffd-wp'ed yet, because that can only happen after enabling
uffd-wp for the VMA in general.
Also note that this makes mprotect() on ranges with uffd-wp'ed PTEs not
accidentally set the write bit -- which would result in uffd-wp not
triggering on later write access. This commit makes uffd-wp on shmem
behave just like uffd-wp on anonymous memory in that regard, even though,
mixing mprotect with uffd-wp is controversial.
[1] https://lkml.kernel.org/r/92173bad-caa3-6b43-9d1e-9a471fdbc184@redhat.com
Link: https://lkml.kernel.org/r/20221209080912.7968-1-david@redhat.com
Fixes: b1f9e876862d ("mm/uffd: enable write protection for shmem & hugetlbfs")
Signed-off-by: David Hildenbrand <david@redhat.com>
Reported-by: Ives van Hoorne <ives@codesandbox.io>
Debugged-by: Peter Xu <peterx@redhat.com>
Acked-by: Peter Xu <peterx@redhat.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Alistair Popple <apopple@nvidia.com>
Cc: Mike Rapoport <rppt@linux.vnet.ibm.com>
Cc: Nadav Amit <nadav.amit@gmail.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-12-09 16:09:12 +08:00
|
|
|
userfaultfd_set_vm_flags(vma, new_flags);
|
2015-09-05 06:46:31 +08:00
|
|
|
vma->vm_userfaultfd_ctx.ctx = ctx;
|
|
|
|
|
2021-05-05 09:33:13 +08:00
|
|
|
if (is_vm_hugetlb_page(vma) && uffd_disable_huge_pmd_share(vma))
|
|
|
|
hugetlb_unshare_all_pmds(vma);
|
|
|
|
|
2015-09-05 06:46:31 +08:00
|
|
|
skip:
|
|
|
|
prev = vma;
|
|
|
|
start = vma->vm_end;
|
2022-09-07 03:48:57 +08:00
|
|
|
vma = mas_next(&mas, end - 1);
|
|
|
|
} while (vma);
|
2015-09-05 06:46:31 +08:00
|
|
|
out_unlock:
|
2020-06-09 12:33:25 +08:00
|
|
|
mmap_write_unlock(mm);
|
2016-05-21 07:58:36 +08:00
|
|
|
mmput(mm);
|
2015-09-05 06:46:31 +08:00
|
|
|
if (!ret) {
|
2020-04-07 11:06:29 +08:00
|
|
|
__u64 ioctls_out;
|
|
|
|
|
|
|
|
ioctls_out = basic_ioctls ? UFFD_API_RANGE_IOCTLS_BASIC :
|
|
|
|
UFFD_API_RANGE_IOCTLS;
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Declare the WP ioctl only if the WP mode is
|
|
|
|
* specified and all checks passed with the range
|
|
|
|
*/
|
|
|
|
if (!(uffdio_register.mode & UFFDIO_REGISTER_MODE_WP))
|
|
|
|
ioctls_out &= ~((__u64)1 << _UFFDIO_WRITEPROTECT);
|
|
|
|
|
userfaultfd: add UFFDIO_CONTINUE ioctl
This ioctl is how userspace ought to resolve "minor" userfaults. The
idea is, userspace is notified that a minor fault has occurred. It
might change the contents of the page using its second non-UFFD mapping,
or not. Then, it calls UFFDIO_CONTINUE to tell the kernel "I have
ensured the page contents are correct, carry on setting up the mapping".
Note that it doesn't make much sense to use UFFDIO_{COPY,ZEROPAGE} for
MINOR registered VMAs. ZEROPAGE maps the VMA to the zero page; but in
the minor fault case, we already have some pre-existing underlying page.
Likewise, UFFDIO_COPY isn't useful if we have a second non-UFFD mapping.
We'd just use memcpy() or similar instead.
It turns out hugetlb_mcopy_atomic_pte() already does very close to what
we want, if an existing page is provided via `struct page **pagep`. We
already special-case the behavior a bit for the UFFDIO_ZEROPAGE case, so
just extend that design: add an enum for the three modes of operation,
and make the small adjustments needed for the MCOPY_ATOMIC_CONTINUE
case. (Basically, look up the existing page, and avoid adding the
existing page to the page cache or calling set_page_huge_active() on
it.)
Link: https://lkml.kernel.org/r/20210301222728.176417-5-axelrasmussen@google.com
Signed-off-by: Axel Rasmussen <axelrasmussen@google.com>
Reviewed-by: Peter Xu <peterx@redhat.com>
Cc: Adam Ruprecht <ruprecht@google.com>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Alexey Dobriyan <adobriyan@gmail.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Anshuman Khandual <anshuman.khandual@arm.com>
Cc: Cannon Matthews <cannonmatthews@google.com>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Chinwen Chang <chinwen.chang@mediatek.com>
Cc: David Rientjes <rientjes@google.com>
Cc: "Dr . David Alan Gilbert" <dgilbert@redhat.com>
Cc: Huang Ying <ying.huang@intel.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Jann Horn <jannh@google.com>
Cc: Jerome Glisse <jglisse@redhat.com>
Cc: Kirill A. Shutemov <kirill@shutemov.name>
Cc: Lokesh Gidra <lokeshgidra@google.com>
Cc: "Matthew Wilcox (Oracle)" <willy@infradead.org>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: "Michal Koutn" <mkoutny@suse.com>
Cc: Michel Lespinasse <walken@google.com>
Cc: Mike Kravetz <mike.kravetz@oracle.com>
Cc: Mike Rapoport <rppt@linux.vnet.ibm.com>
Cc: Mina Almasry <almasrymina@google.com>
Cc: Nicholas Piggin <npiggin@gmail.com>
Cc: Oliver Upton <oupton@google.com>
Cc: Shaohua Li <shli@fb.com>
Cc: Shawn Anastasio <shawn@anastas.io>
Cc: Steven Price <steven.price@arm.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-05-05 09:35:49 +08:00
|
|
|
/* CONTINUE ioctl is only supported for MINOR ranges. */
|
|
|
|
if (!(uffdio_register.mode & UFFDIO_REGISTER_MODE_MINOR))
|
|
|
|
ioctls_out &= ~((__u64)1 << _UFFDIO_CONTINUE);
|
|
|
|
|
2015-09-05 06:46:31 +08:00
|
|
|
/*
|
|
|
|
* Now that we scanned all vmas we can already tell
|
|
|
|
* userland which ioctls methods are guaranteed to
|
|
|
|
* succeed on this range.
|
|
|
|
*/
|
2020-04-07 11:06:29 +08:00
|
|
|
if (put_user(ioctls_out, &user_uffdio_register->ioctls))
|
2015-09-05 06:46:31 +08:00
|
|
|
ret = -EFAULT;
|
|
|
|
}
|
|
|
|
out:
|
|
|
|
return ret;
|
|
|
|
}
|
|
|
|
|
|
|
|
static int userfaultfd_unregister(struct userfaultfd_ctx *ctx,
|
|
|
|
unsigned long arg)
|
|
|
|
{
|
|
|
|
struct mm_struct *mm = ctx->mm;
|
|
|
|
struct vm_area_struct *vma, *prev, *cur;
|
|
|
|
int ret;
|
|
|
|
struct uffdio_range uffdio_unregister;
|
|
|
|
unsigned long new_flags;
|
|
|
|
bool found;
|
|
|
|
unsigned long start, end, vma_end;
|
|
|
|
const void __user *buf = (void __user *)arg;
|
2022-09-07 03:48:57 +08:00
|
|
|
MA_STATE(mas, &mm->mm_mt, 0, 0);
|
2015-09-05 06:46:31 +08:00
|
|
|
|
|
|
|
ret = -EFAULT;
|
|
|
|
if (copy_from_user(&uffdio_unregister, buf, sizeof(uffdio_unregister)))
|
|
|
|
goto out;
|
|
|
|
|
userfaultfd: do not untag user pointers
Patch series "userfaultfd: do not untag user pointers", v5.
If a user program uses userfaultfd on ranges of heap memory, it may end
up passing a tagged pointer to the kernel in the range.start field of
the UFFDIO_REGISTER ioctl. This can happen when using an MTE-capable
allocator, or on Android if using the Tagged Pointers feature for MTE
readiness [1].
When a fault subsequently occurs, the tag is stripped from the fault
address returned to the application in the fault.address field of struct
uffd_msg. However, from the application's perspective, the tagged
address *is* the memory address, so if the application is unaware of
memory tags, it may get confused by receiving an address that is, from
its point of view, outside of the bounds of the allocation. We observed
this behavior in the kselftest for userfaultfd [2] but other
applications could have the same problem.
Address this by not untagging pointers passed to the userfaultfd ioctls.
Instead, let the system call fail. Also change the kselftest to use
mmap so that it doesn't encounter this problem.
[1] https://source.android.com/devices/tech/debug/tagged-pointers
[2] tools/testing/selftests/vm/userfaultfd.c
This patch (of 2):
Do not untag pointers passed to the userfaultfd ioctls. Instead, let
the system call fail. This will provide an early indication of problems
with tag-unaware userspace code instead of letting the code get confused
later, and is consistent with how we decided to handle brk/mmap/mremap
in commit dcde237319e6 ("mm: Avoid creating virtual address aliases in
brk()/mmap()/mremap()"), as well as being consistent with the existing
tagged address ABI documentation relating to how ioctl arguments are
handled.
The code change is a revert of commit 7d0325749a6c ("userfaultfd: untag
user pointers") plus some fixups to some additional calls to
validate_range that have appeared since then.
[1] https://source.android.com/devices/tech/debug/tagged-pointers
[2] tools/testing/selftests/vm/userfaultfd.c
Link: https://lkml.kernel.org/r/20210714195437.118982-1-pcc@google.com
Link: https://lkml.kernel.org/r/20210714195437.118982-2-pcc@google.com
Link: https://linux-review.googlesource.com/id/I761aa9f0344454c482b83fcfcce547db0a25501b
Fixes: 63f0c6037965 ("arm64: Introduce prctl() options to control the tagged user addresses ABI")
Signed-off-by: Peter Collingbourne <pcc@google.com>
Reviewed-by: Andrey Konovalov <andreyknvl@gmail.com>
Reviewed-by: Catalin Marinas <catalin.marinas@arm.com>
Cc: Alistair Delva <adelva@google.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Dave Martin <Dave.Martin@arm.com>
Cc: Evgenii Stepanov <eugenis@google.com>
Cc: Lokesh Gidra <lokeshgidra@google.com>
Cc: Mitch Phillips <mitchp@google.com>
Cc: Vincenzo Frascino <vincenzo.frascino@arm.com>
Cc: Will Deacon <will@kernel.org>
Cc: William McVicker <willmcvicker@google.com>
Cc: <stable@vger.kernel.org> [5.4]
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-07-24 06:50:01 +08:00
|
|
|
ret = validate_range(mm, uffdio_unregister.start,
|
2015-09-05 06:46:31 +08:00
|
|
|
uffdio_unregister.len);
|
|
|
|
if (ret)
|
|
|
|
goto out;
|
|
|
|
|
|
|
|
start = uffdio_unregister.start;
|
|
|
|
end = start + uffdio_unregister.len;
|
|
|
|
|
2016-05-21 07:58:36 +08:00
|
|
|
ret = -ENOMEM;
|
|
|
|
if (!mmget_not_zero(mm))
|
|
|
|
goto out;
|
|
|
|
|
2020-06-09 12:33:25 +08:00
|
|
|
mmap_write_lock(mm);
|
2022-09-07 03:48:57 +08:00
|
|
|
mas_set(&mas, start);
|
|
|
|
vma = mas_find(&mas, ULONG_MAX);
|
2015-09-05 06:46:31 +08:00
|
|
|
if (!vma)
|
|
|
|
goto out_unlock;
|
|
|
|
|
|
|
|
/* check that there's at least one vma in the range */
|
|
|
|
ret = -EINVAL;
|
|
|
|
if (vma->vm_start >= end)
|
|
|
|
goto out_unlock;
|
|
|
|
|
2017-02-23 07:43:04 +08:00
|
|
|
/*
|
|
|
|
* If the first vma contains huge pages, make sure start address
|
|
|
|
* is aligned to huge page size.
|
|
|
|
*/
|
|
|
|
if (is_vm_hugetlb_page(vma)) {
|
|
|
|
unsigned long vma_hpagesize = vma_kernel_pagesize(vma);
|
|
|
|
|
|
|
|
if (start & (vma_hpagesize - 1))
|
|
|
|
goto out_unlock;
|
|
|
|
}
|
|
|
|
|
2015-09-05 06:46:31 +08:00
|
|
|
/*
|
|
|
|
* Search for not compatible vmas.
|
|
|
|
*/
|
|
|
|
found = false;
|
|
|
|
ret = -EINVAL;
|
2022-09-07 03:48:57 +08:00
|
|
|
for (cur = vma; cur; cur = mas_next(&mas, end - 1)) {
|
2015-09-05 06:46:31 +08:00
|
|
|
cond_resched();
|
|
|
|
|
|
|
|
BUG_ON(!!cur->vm_userfaultfd_ctx.ctx ^
|
userfaultfd: add minor fault registration mode
Patch series "userfaultfd: add minor fault handling", v9.
Overview
========
This series adds a new userfaultfd feature, UFFD_FEATURE_MINOR_HUGETLBFS.
When enabled (via the UFFDIO_API ioctl), this feature means that any
hugetlbfs VMAs registered with UFFDIO_REGISTER_MODE_MISSING will *also*
get events for "minor" faults. By "minor" fault, I mean the following
situation:
Let there exist two mappings (i.e., VMAs) to the same page(s) (shared
memory). One of the mappings is registered with userfaultfd (in minor
mode), and the other is not. Via the non-UFFD mapping, the underlying
pages have already been allocated & filled with some contents. The UFFD
mapping has not yet been faulted in; when it is touched for the first
time, this results in what I'm calling a "minor" fault. As a concrete
example, when working with hugetlbfs, we have huge_pte_none(), but
find_lock_page() finds an existing page.
We also add a new ioctl to resolve such faults: UFFDIO_CONTINUE. The idea
is, userspace resolves the fault by either a) doing nothing if the
contents are already correct, or b) updating the underlying contents using
the second, non-UFFD mapping (via memcpy/memset or similar, or something
fancier like RDMA, or etc...). In either case, userspace issues
UFFDIO_CONTINUE to tell the kernel "I have ensured the page contents are
correct, carry on setting up the mapping".
Use Case
========
Consider the use case of VM live migration (e.g. under QEMU/KVM):
1. While a VM is still running, we copy the contents of its memory to a
target machine. The pages are populated on the target by writing to the
non-UFFD mapping, using the setup described above. The VM is still running
(and therefore its memory is likely changing), so this may be repeated
several times, until we decide the target is "up to date enough".
2. We pause the VM on the source, and start executing on the target machine.
During this gap, the VM's user(s) will *see* a pause, so it is desirable to
minimize this window.
3. Between the last time any page was copied from the source to the target, and
when the VM was paused, the contents of that page may have changed - and
therefore the copy we have on the target machine is out of date. Although we
can keep track of which pages are out of date, for VMs with large amounts of
memory, it is "slow" to transfer this information to the target machine. We
want to resume execution before such a transfer would complete.
4. So, the guest begins executing on the target machine. The first time it
touches its memory (via the UFFD-registered mapping), userspace wants to
intercept this fault. Userspace checks whether or not the page is up to date,
and if not, copies the updated page from the source machine, via the non-UFFD
mapping. Finally, whether a copy was performed or not, userspace issues a
UFFDIO_CONTINUE ioctl to tell the kernel "I have ensured the page contents
are correct, carry on setting up the mapping".
We don't have to do all of the final updates on-demand. The userfaultfd manager
can, in the background, also copy over updated pages once it receives the map of
which pages are up-to-date or not.
Interaction with Existing APIs
==============================
Because this is a feature, a registered VMA could potentially receive both
missing and minor faults. I spent some time thinking through how the
existing API interacts with the new feature:
UFFDIO_CONTINUE cannot be used to resolve non-minor faults, as it does not
allocate a new page. If UFFDIO_CONTINUE is used on a non-minor fault:
- For non-shared memory or shmem, -EINVAL is returned.
- For hugetlb, -EFAULT is returned.
UFFDIO_COPY and UFFDIO_ZEROPAGE cannot be used to resolve minor faults.
Without modifications, the existing codepath assumes a new page needs to
be allocated. This is okay, since userspace must have a second
non-UFFD-registered mapping anyway, thus there isn't much reason to want
to use these in any case (just memcpy or memset or similar).
- If UFFDIO_COPY is used on a minor fault, -EEXIST is returned.
- If UFFDIO_ZEROPAGE is used on a minor fault, -EEXIST is returned (or -EINVAL
in the case of hugetlb, as UFFDIO_ZEROPAGE is unsupported in any case).
- UFFDIO_WRITEPROTECT simply doesn't work with shared memory, and returns
-ENOENT in that case (regardless of the kind of fault).
Future Work
===========
This series only supports hugetlbfs. I have a second series in flight to
support shmem as well, extending the functionality. This series is more
mature than the shmem support at this point, and the functionality works
fully on hugetlbfs, so this series can be merged first and then shmem
support will follow.
This patch (of 6):
This feature allows userspace to intercept "minor" faults. By "minor"
faults, I mean the following situation:
Let there exist two mappings (i.e., VMAs) to the same page(s). One of the
mappings is registered with userfaultfd (in minor mode), and the other is
not. Via the non-UFFD mapping, the underlying pages have already been
allocated & filled with some contents. The UFFD mapping has not yet been
faulted in; when it is touched for the first time, this results in what
I'm calling a "minor" fault. As a concrete example, when working with
hugetlbfs, we have huge_pte_none(), but find_lock_page() finds an existing
page.
This commit adds the new registration mode, and sets the relevant flag on
the VMAs being registered. In the hugetlb fault path, if we find that we
have huge_pte_none(), but find_lock_page() does indeed find an existing
page, then we have a "minor" fault, and if the VMA has the userfaultfd
registration flag, we call into userfaultfd to handle it.
This is implemented as a new registration mode, instead of an API feature.
This is because the alternative implementation has significant drawbacks
[1].
However, doing it this was requires we allocate a VM_* flag for the new
registration mode. On 32-bit systems, there are no unused bits, so this
feature is only supported on architectures with
CONFIG_ARCH_USES_HIGH_VMA_FLAGS. When attempting to register a VMA in
MINOR mode on 32-bit architectures, we return -EINVAL.
[1] https://lore.kernel.org/patchwork/patch/1380226/
[peterx@redhat.com: fix minor fault page leak]
Link: https://lkml.kernel.org/r/20210322175132.36659-1-peterx@redhat.com
Link: https://lkml.kernel.org/r/20210301222728.176417-1-axelrasmussen@google.com
Link: https://lkml.kernel.org/r/20210301222728.176417-2-axelrasmussen@google.com
Signed-off-by: Axel Rasmussen <axelrasmussen@google.com>
Reviewed-by: Peter Xu <peterx@redhat.com>
Reviewed-by: Mike Kravetz <mike.kravetz@oracle.com>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Alexey Dobriyan <adobriyan@gmail.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Anshuman Khandual <anshuman.khandual@arm.com>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Chinwen Chang <chinwen.chang@mediatek.com>
Cc: Huang Ying <ying.huang@intel.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Jann Horn <jannh@google.com>
Cc: Jerome Glisse <jglisse@redhat.com>
Cc: Lokesh Gidra <lokeshgidra@google.com>
Cc: "Matthew Wilcox (Oracle)" <willy@infradead.org>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: "Michal Koutn" <mkoutny@suse.com>
Cc: Michel Lespinasse <walken@google.com>
Cc: Mike Rapoport <rppt@linux.vnet.ibm.com>
Cc: Nicholas Piggin <npiggin@gmail.com>
Cc: Peter Xu <peterx@redhat.com>
Cc: Shaohua Li <shli@fb.com>
Cc: Shawn Anastasio <shawn@anastas.io>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Steven Price <steven.price@arm.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Adam Ruprecht <ruprecht@google.com>
Cc: Axel Rasmussen <axelrasmussen@google.com>
Cc: Cannon Matthews <cannonmatthews@google.com>
Cc: "Dr . David Alan Gilbert" <dgilbert@redhat.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Mina Almasry <almasrymina@google.com>
Cc: Oliver Upton <oupton@google.com>
Cc: Kirill A. Shutemov <kirill@shutemov.name>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-05-05 09:35:36 +08:00
|
|
|
!!(cur->vm_flags & __VM_UFFD_FLAGS));
|
2015-09-05 06:46:31 +08:00
|
|
|
|
|
|
|
/*
|
|
|
|
* Check not compatible vmas, not strictly required
|
|
|
|
* here as not compatible vmas cannot have an
|
|
|
|
* userfaultfd_ctx registered on them, but this
|
|
|
|
* provides for more strict behavior to notice
|
|
|
|
* unregistration errors.
|
|
|
|
*/
|
2020-04-07 11:06:12 +08:00
|
|
|
if (!vma_can_userfault(cur, cur->vm_flags))
|
2015-09-05 06:46:31 +08:00
|
|
|
goto out_unlock;
|
|
|
|
|
|
|
|
found = true;
|
|
|
|
}
|
|
|
|
BUG_ON(!found);
|
|
|
|
|
2022-09-07 03:48:57 +08:00
|
|
|
mas_set(&mas, start);
|
|
|
|
prev = mas_prev(&mas, 0);
|
|
|
|
if (prev != vma)
|
|
|
|
mas_next(&mas, ULONG_MAX);
|
2015-09-05 06:46:31 +08:00
|
|
|
|
|
|
|
ret = 0;
|
|
|
|
do {
|
|
|
|
cond_resched();
|
|
|
|
|
2020-04-07 11:06:12 +08:00
|
|
|
BUG_ON(!vma_can_userfault(vma, vma->vm_flags));
|
2015-09-05 06:46:31 +08:00
|
|
|
|
|
|
|
/*
|
|
|
|
* Nothing to do: this vma is already registered into this
|
|
|
|
* userfaultfd and with the right tracking mode too.
|
|
|
|
*/
|
|
|
|
if (!vma->vm_userfaultfd_ctx.ctx)
|
|
|
|
goto skip;
|
|
|
|
|
2018-12-15 06:17:17 +08:00
|
|
|
WARN_ON(!(vma->vm_flags & VM_MAYWRITE));
|
|
|
|
|
2015-09-05 06:46:31 +08:00
|
|
|
if (vma->vm_start > start)
|
|
|
|
start = vma->vm_start;
|
|
|
|
vma_end = min(end, vma->vm_end);
|
|
|
|
|
2017-02-23 07:42:46 +08:00
|
|
|
if (userfaultfd_missing(vma)) {
|
|
|
|
/*
|
|
|
|
* Wake any concurrent pending userfault while
|
|
|
|
* we unregister, so they will not hang
|
|
|
|
* permanently and it avoids userland to call
|
|
|
|
* UFFDIO_WAKE explicitly.
|
|
|
|
*/
|
|
|
|
struct userfaultfd_wake_range range;
|
|
|
|
range.start = start;
|
|
|
|
range.len = vma_end - start;
|
|
|
|
wake_userfault(vma->vm_userfaultfd_ctx.ctx, &range);
|
|
|
|
}
|
|
|
|
|
mm/uffd: reset write protection when unregister with wp-mode
The motivation of this patch comes from a recent report and patchfix from
David Hildenbrand on hugetlb shared handling of wr-protected page [1].
With the reproducer provided in commit message of [1], one can leverage
the uffd-wp lazy-reset of ptes to trigger a hugetlb issue which can affect
not only the attacker process, but also the whole system.
The lazy-reset mechanism of uffd-wp was used to make unregister faster,
meanwhile it has an assumption that any leftover pgtable entries should
only affect the process on its own, so not only the user should be aware
of anything it does, but also it should not affect outside of the process.
But it seems that this is not true, and it can also be utilized to make
some exploit easier.
So far there's no clue showing that the lazy-reset is important to any
userfaultfd users because normally the unregister will only happen once
for a specific range of memory of the lifecycle of the process.
Considering all above, what this patch proposes is to do explicit pte
resets when unregister an uffd region with wr-protect mode enabled.
It should be the same as calling ioctl(UFFDIO_WRITEPROTECT, wp=false)
right before ioctl(UFFDIO_UNREGISTER) for the user. So potentially it'll
make the unregister slower. From that pov it's a very slight abi change,
but hopefully nothing should break with this change either.
Regarding to the change itself - core of uffd write [un]protect operation
is moved into a separate function (uffd_wp_range()) and it is reused in
the unregister code path.
Note that the new function will not check for anything, e.g. ranges or
memory types, because they should have been checked during the previous
UFFDIO_REGISTER or it should have failed already. It also doesn't check
mmap_changing because we're with mmap write lock held anyway.
I added a Fixes upon introducing of uffd-wp shmem+hugetlbfs because that's
the only issue reported so far and that's the commit David's reproducer
will start working (v5.19+). But the whole idea actually applies to not
only file memories but also anonymous. It's just that we don't need to
fix anonymous prior to v5.19- because there's no known way to exploit.
IOW, this patch can also fix the issue reported in [1] as the patch 2 does.
[1] https://lore.kernel.org/all/20220811103435.188481-3-david@redhat.com/
Link: https://lkml.kernel.org/r/20220811201340.39342-1-peterx@redhat.com
Fixes: b1f9e876862d ("mm/uffd: enable write protection for shmem & hugetlbfs")
Signed-off-by: Peter Xu <peterx@redhat.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: Mike Rapoport <rppt@linux.vnet.ibm.com>
Cc: Mike Kravetz <mike.kravetz@oracle.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Nadav Amit <nadav.amit@gmail.com>
Cc: Axel Rasmussen <axelrasmussen@google.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-08-12 04:13:40 +08:00
|
|
|
/* Reset ptes for the whole vma range if wr-protected */
|
|
|
|
if (userfaultfd_wp(vma))
|
|
|
|
uffd_wp_range(mm, vma, start, vma_end - start, false);
|
|
|
|
|
userfaultfd: add minor fault registration mode
Patch series "userfaultfd: add minor fault handling", v9.
Overview
========
This series adds a new userfaultfd feature, UFFD_FEATURE_MINOR_HUGETLBFS.
When enabled (via the UFFDIO_API ioctl), this feature means that any
hugetlbfs VMAs registered with UFFDIO_REGISTER_MODE_MISSING will *also*
get events for "minor" faults. By "minor" fault, I mean the following
situation:
Let there exist two mappings (i.e., VMAs) to the same page(s) (shared
memory). One of the mappings is registered with userfaultfd (in minor
mode), and the other is not. Via the non-UFFD mapping, the underlying
pages have already been allocated & filled with some contents. The UFFD
mapping has not yet been faulted in; when it is touched for the first
time, this results in what I'm calling a "minor" fault. As a concrete
example, when working with hugetlbfs, we have huge_pte_none(), but
find_lock_page() finds an existing page.
We also add a new ioctl to resolve such faults: UFFDIO_CONTINUE. The idea
is, userspace resolves the fault by either a) doing nothing if the
contents are already correct, or b) updating the underlying contents using
the second, non-UFFD mapping (via memcpy/memset or similar, or something
fancier like RDMA, or etc...). In either case, userspace issues
UFFDIO_CONTINUE to tell the kernel "I have ensured the page contents are
correct, carry on setting up the mapping".
Use Case
========
Consider the use case of VM live migration (e.g. under QEMU/KVM):
1. While a VM is still running, we copy the contents of its memory to a
target machine. The pages are populated on the target by writing to the
non-UFFD mapping, using the setup described above. The VM is still running
(and therefore its memory is likely changing), so this may be repeated
several times, until we decide the target is "up to date enough".
2. We pause the VM on the source, and start executing on the target machine.
During this gap, the VM's user(s) will *see* a pause, so it is desirable to
minimize this window.
3. Between the last time any page was copied from the source to the target, and
when the VM was paused, the contents of that page may have changed - and
therefore the copy we have on the target machine is out of date. Although we
can keep track of which pages are out of date, for VMs with large amounts of
memory, it is "slow" to transfer this information to the target machine. We
want to resume execution before such a transfer would complete.
4. So, the guest begins executing on the target machine. The first time it
touches its memory (via the UFFD-registered mapping), userspace wants to
intercept this fault. Userspace checks whether or not the page is up to date,
and if not, copies the updated page from the source machine, via the non-UFFD
mapping. Finally, whether a copy was performed or not, userspace issues a
UFFDIO_CONTINUE ioctl to tell the kernel "I have ensured the page contents
are correct, carry on setting up the mapping".
We don't have to do all of the final updates on-demand. The userfaultfd manager
can, in the background, also copy over updated pages once it receives the map of
which pages are up-to-date or not.
Interaction with Existing APIs
==============================
Because this is a feature, a registered VMA could potentially receive both
missing and minor faults. I spent some time thinking through how the
existing API interacts with the new feature:
UFFDIO_CONTINUE cannot be used to resolve non-minor faults, as it does not
allocate a new page. If UFFDIO_CONTINUE is used on a non-minor fault:
- For non-shared memory or shmem, -EINVAL is returned.
- For hugetlb, -EFAULT is returned.
UFFDIO_COPY and UFFDIO_ZEROPAGE cannot be used to resolve minor faults.
Without modifications, the existing codepath assumes a new page needs to
be allocated. This is okay, since userspace must have a second
non-UFFD-registered mapping anyway, thus there isn't much reason to want
to use these in any case (just memcpy or memset or similar).
- If UFFDIO_COPY is used on a minor fault, -EEXIST is returned.
- If UFFDIO_ZEROPAGE is used on a minor fault, -EEXIST is returned (or -EINVAL
in the case of hugetlb, as UFFDIO_ZEROPAGE is unsupported in any case).
- UFFDIO_WRITEPROTECT simply doesn't work with shared memory, and returns
-ENOENT in that case (regardless of the kind of fault).
Future Work
===========
This series only supports hugetlbfs. I have a second series in flight to
support shmem as well, extending the functionality. This series is more
mature than the shmem support at this point, and the functionality works
fully on hugetlbfs, so this series can be merged first and then shmem
support will follow.
This patch (of 6):
This feature allows userspace to intercept "minor" faults. By "minor"
faults, I mean the following situation:
Let there exist two mappings (i.e., VMAs) to the same page(s). One of the
mappings is registered with userfaultfd (in minor mode), and the other is
not. Via the non-UFFD mapping, the underlying pages have already been
allocated & filled with some contents. The UFFD mapping has not yet been
faulted in; when it is touched for the first time, this results in what
I'm calling a "minor" fault. As a concrete example, when working with
hugetlbfs, we have huge_pte_none(), but find_lock_page() finds an existing
page.
This commit adds the new registration mode, and sets the relevant flag on
the VMAs being registered. In the hugetlb fault path, if we find that we
have huge_pte_none(), but find_lock_page() does indeed find an existing
page, then we have a "minor" fault, and if the VMA has the userfaultfd
registration flag, we call into userfaultfd to handle it.
This is implemented as a new registration mode, instead of an API feature.
This is because the alternative implementation has significant drawbacks
[1].
However, doing it this was requires we allocate a VM_* flag for the new
registration mode. On 32-bit systems, there are no unused bits, so this
feature is only supported on architectures with
CONFIG_ARCH_USES_HIGH_VMA_FLAGS. When attempting to register a VMA in
MINOR mode on 32-bit architectures, we return -EINVAL.
[1] https://lore.kernel.org/patchwork/patch/1380226/
[peterx@redhat.com: fix minor fault page leak]
Link: https://lkml.kernel.org/r/20210322175132.36659-1-peterx@redhat.com
Link: https://lkml.kernel.org/r/20210301222728.176417-1-axelrasmussen@google.com
Link: https://lkml.kernel.org/r/20210301222728.176417-2-axelrasmussen@google.com
Signed-off-by: Axel Rasmussen <axelrasmussen@google.com>
Reviewed-by: Peter Xu <peterx@redhat.com>
Reviewed-by: Mike Kravetz <mike.kravetz@oracle.com>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Alexey Dobriyan <adobriyan@gmail.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Anshuman Khandual <anshuman.khandual@arm.com>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Chinwen Chang <chinwen.chang@mediatek.com>
Cc: Huang Ying <ying.huang@intel.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Jann Horn <jannh@google.com>
Cc: Jerome Glisse <jglisse@redhat.com>
Cc: Lokesh Gidra <lokeshgidra@google.com>
Cc: "Matthew Wilcox (Oracle)" <willy@infradead.org>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: "Michal Koutn" <mkoutny@suse.com>
Cc: Michel Lespinasse <walken@google.com>
Cc: Mike Rapoport <rppt@linux.vnet.ibm.com>
Cc: Nicholas Piggin <npiggin@gmail.com>
Cc: Peter Xu <peterx@redhat.com>
Cc: Shaohua Li <shli@fb.com>
Cc: Shawn Anastasio <shawn@anastas.io>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Steven Price <steven.price@arm.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Adam Ruprecht <ruprecht@google.com>
Cc: Axel Rasmussen <axelrasmussen@google.com>
Cc: Cannon Matthews <cannonmatthews@google.com>
Cc: "Dr . David Alan Gilbert" <dgilbert@redhat.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Mina Almasry <almasrymina@google.com>
Cc: Oliver Upton <oupton@google.com>
Cc: Kirill A. Shutemov <kirill@shutemov.name>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-05-05 09:35:36 +08:00
|
|
|
new_flags = vma->vm_flags & ~__VM_UFFD_FLAGS;
|
2015-09-05 06:46:31 +08:00
|
|
|
prev = vma_merge(mm, prev, start, vma_end, new_flags,
|
|
|
|
vma->anon_vma, vma->vm_file, vma->vm_pgoff,
|
|
|
|
vma_policy(vma),
|
2022-03-05 12:28:51 +08:00
|
|
|
NULL_VM_UFFD_CTX, anon_vma_name(vma));
|
2015-09-05 06:46:31 +08:00
|
|
|
if (prev) {
|
|
|
|
vma = prev;
|
2022-11-08 04:11:42 +08:00
|
|
|
mas_pause(&mas);
|
2015-09-05 06:46:31 +08:00
|
|
|
goto next;
|
|
|
|
}
|
|
|
|
if (vma->vm_start < start) {
|
|
|
|
ret = split_vma(mm, vma, start, 1);
|
|
|
|
if (ret)
|
|
|
|
break;
|
2022-11-08 04:11:42 +08:00
|
|
|
mas_pause(&mas);
|
2015-09-05 06:46:31 +08:00
|
|
|
}
|
|
|
|
if (vma->vm_end > end) {
|
|
|
|
ret = split_vma(mm, vma, end, 0);
|
|
|
|
if (ret)
|
|
|
|
break;
|
2022-11-08 04:11:42 +08:00
|
|
|
mas_pause(&mas);
|
2015-09-05 06:46:31 +08:00
|
|
|
}
|
|
|
|
next:
|
|
|
|
/*
|
|
|
|
* In the vma_merge() successful mprotect-like case 8:
|
|
|
|
* the next vma was merged into the current one and
|
|
|
|
* the current one has not been updated yet.
|
|
|
|
*/
|
mm/userfaultfd: enable writenotify while userfaultfd-wp is enabled for a VMA
Currently, we don't enable writenotify when enabling userfaultfd-wp on a
shared writable mapping (for now only shmem and hugetlb). The consequence
is that vma->vm_page_prot will still include write permissions, to be set
as default for all PTEs that get remapped (e.g., mprotect(), NUMA hinting,
page migration, ...).
So far, vma->vm_page_prot is assumed to be a safe default, meaning that we
only add permissions (e.g., mkwrite) but not remove permissions (e.g.,
wrprotect). For example, when enabling softdirty tracking, we enable
writenotify. With uffd-wp on shared mappings, that changed. More details
on vma->vm_page_prot semantics were summarized in [1].
This is problematic for uffd-wp: we'd have to manually check for a uffd-wp
PTEs/PMDs and manually write-protect PTEs/PMDs, which is error prone.
Prone to such issues is any code that uses vma->vm_page_prot to set PTE
permissions: primarily pte_modify() and mk_pte().
Instead, let's enable writenotify such that PTEs/PMDs/... will be mapped
write-protected as default and we will only allow selected PTEs that are
definitely safe to be mapped without write-protection (see
can_change_pte_writable()) to be writable. In the future, we might want
to enable write-bit recovery -- e.g., can_change_pte_writable() -- at more
locations, for example, also when removing uffd-wp protection.
This fixes two known cases:
(a) remove_migration_pte() mapping uffd-wp'ed PTEs writable, resulting
in uffd-wp not triggering on write access.
(b) do_numa_page() / do_huge_pmd_numa_page() mapping uffd-wp'ed PTEs/PMDs
writable, resulting in uffd-wp not triggering on write access.
Note that do_numa_page() / do_huge_pmd_numa_page() can be reached even
without NUMA hinting (which currently doesn't seem to be applicable to
shmem), for example, by using uffd-wp with a PROT_WRITE shmem VMA. On
such a VMA, userfaultfd-wp is currently non-functional.
Note that when enabling userfaultfd-wp, there is no need to walk page
tables to enforce the new default protection for the PTEs: we know that
they cannot be uffd-wp'ed yet, because that can only happen after enabling
uffd-wp for the VMA in general.
Also note that this makes mprotect() on ranges with uffd-wp'ed PTEs not
accidentally set the write bit -- which would result in uffd-wp not
triggering on later write access. This commit makes uffd-wp on shmem
behave just like uffd-wp on anonymous memory in that regard, even though,
mixing mprotect with uffd-wp is controversial.
[1] https://lkml.kernel.org/r/92173bad-caa3-6b43-9d1e-9a471fdbc184@redhat.com
Link: https://lkml.kernel.org/r/20221209080912.7968-1-david@redhat.com
Fixes: b1f9e876862d ("mm/uffd: enable write protection for shmem & hugetlbfs")
Signed-off-by: David Hildenbrand <david@redhat.com>
Reported-by: Ives van Hoorne <ives@codesandbox.io>
Debugged-by: Peter Xu <peterx@redhat.com>
Acked-by: Peter Xu <peterx@redhat.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Alistair Popple <apopple@nvidia.com>
Cc: Mike Rapoport <rppt@linux.vnet.ibm.com>
Cc: Nadav Amit <nadav.amit@gmail.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-12-09 16:09:12 +08:00
|
|
|
userfaultfd_set_vm_flags(vma, new_flags);
|
2015-09-05 06:46:31 +08:00
|
|
|
vma->vm_userfaultfd_ctx = NULL_VM_UFFD_CTX;
|
|
|
|
|
|
|
|
skip:
|
|
|
|
prev = vma;
|
|
|
|
start = vma->vm_end;
|
2022-09-07 03:48:57 +08:00
|
|
|
vma = mas_next(&mas, end - 1);
|
|
|
|
} while (vma);
|
2015-09-05 06:46:31 +08:00
|
|
|
out_unlock:
|
2020-06-09 12:33:25 +08:00
|
|
|
mmap_write_unlock(mm);
|
2016-05-21 07:58:36 +08:00
|
|
|
mmput(mm);
|
2015-09-05 06:46:31 +08:00
|
|
|
out:
|
|
|
|
return ret;
|
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
2015-09-05 06:46:41 +08:00
|
|
|
* userfaultfd_wake may be used in combination with the
|
|
|
|
* UFFDIO_*_MODE_DONTWAKE to wakeup userfaults in batches.
|
2015-09-05 06:46:31 +08:00
|
|
|
*/
|
|
|
|
static int userfaultfd_wake(struct userfaultfd_ctx *ctx,
|
|
|
|
unsigned long arg)
|
|
|
|
{
|
|
|
|
int ret;
|
|
|
|
struct uffdio_range uffdio_wake;
|
|
|
|
struct userfaultfd_wake_range range;
|
|
|
|
const void __user *buf = (void __user *)arg;
|
|
|
|
|
|
|
|
ret = -EFAULT;
|
|
|
|
if (copy_from_user(&uffdio_wake, buf, sizeof(uffdio_wake)))
|
|
|
|
goto out;
|
|
|
|
|
userfaultfd: do not untag user pointers
Patch series "userfaultfd: do not untag user pointers", v5.
If a user program uses userfaultfd on ranges of heap memory, it may end
up passing a tagged pointer to the kernel in the range.start field of
the UFFDIO_REGISTER ioctl. This can happen when using an MTE-capable
allocator, or on Android if using the Tagged Pointers feature for MTE
readiness [1].
When a fault subsequently occurs, the tag is stripped from the fault
address returned to the application in the fault.address field of struct
uffd_msg. However, from the application's perspective, the tagged
address *is* the memory address, so if the application is unaware of
memory tags, it may get confused by receiving an address that is, from
its point of view, outside of the bounds of the allocation. We observed
this behavior in the kselftest for userfaultfd [2] but other
applications could have the same problem.
Address this by not untagging pointers passed to the userfaultfd ioctls.
Instead, let the system call fail. Also change the kselftest to use
mmap so that it doesn't encounter this problem.
[1] https://source.android.com/devices/tech/debug/tagged-pointers
[2] tools/testing/selftests/vm/userfaultfd.c
This patch (of 2):
Do not untag pointers passed to the userfaultfd ioctls. Instead, let
the system call fail. This will provide an early indication of problems
with tag-unaware userspace code instead of letting the code get confused
later, and is consistent with how we decided to handle brk/mmap/mremap
in commit dcde237319e6 ("mm: Avoid creating virtual address aliases in
brk()/mmap()/mremap()"), as well as being consistent with the existing
tagged address ABI documentation relating to how ioctl arguments are
handled.
The code change is a revert of commit 7d0325749a6c ("userfaultfd: untag
user pointers") plus some fixups to some additional calls to
validate_range that have appeared since then.
[1] https://source.android.com/devices/tech/debug/tagged-pointers
[2] tools/testing/selftests/vm/userfaultfd.c
Link: https://lkml.kernel.org/r/20210714195437.118982-1-pcc@google.com
Link: https://lkml.kernel.org/r/20210714195437.118982-2-pcc@google.com
Link: https://linux-review.googlesource.com/id/I761aa9f0344454c482b83fcfcce547db0a25501b
Fixes: 63f0c6037965 ("arm64: Introduce prctl() options to control the tagged user addresses ABI")
Signed-off-by: Peter Collingbourne <pcc@google.com>
Reviewed-by: Andrey Konovalov <andreyknvl@gmail.com>
Reviewed-by: Catalin Marinas <catalin.marinas@arm.com>
Cc: Alistair Delva <adelva@google.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Dave Martin <Dave.Martin@arm.com>
Cc: Evgenii Stepanov <eugenis@google.com>
Cc: Lokesh Gidra <lokeshgidra@google.com>
Cc: Mitch Phillips <mitchp@google.com>
Cc: Vincenzo Frascino <vincenzo.frascino@arm.com>
Cc: Will Deacon <will@kernel.org>
Cc: William McVicker <willmcvicker@google.com>
Cc: <stable@vger.kernel.org> [5.4]
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-07-24 06:50:01 +08:00
|
|
|
ret = validate_range(ctx->mm, uffdio_wake.start, uffdio_wake.len);
|
2015-09-05 06:46:31 +08:00
|
|
|
if (ret)
|
|
|
|
goto out;
|
|
|
|
|
|
|
|
range.start = uffdio_wake.start;
|
|
|
|
range.len = uffdio_wake.len;
|
|
|
|
|
|
|
|
/*
|
|
|
|
* len == 0 means wake all and we don't want to wake all here,
|
|
|
|
* so check it again to be sure.
|
|
|
|
*/
|
|
|
|
VM_BUG_ON(!range.len);
|
|
|
|
|
|
|
|
wake_userfault(ctx, &range);
|
|
|
|
ret = 0;
|
|
|
|
|
|
|
|
out:
|
|
|
|
return ret;
|
|
|
|
}
|
|
|
|
|
2015-09-05 06:47:11 +08:00
|
|
|
static int userfaultfd_copy(struct userfaultfd_ctx *ctx,
|
|
|
|
unsigned long arg)
|
|
|
|
{
|
|
|
|
__s64 ret;
|
|
|
|
struct uffdio_copy uffdio_copy;
|
|
|
|
struct uffdio_copy __user *user_uffdio_copy;
|
|
|
|
struct userfaultfd_wake_range range;
|
|
|
|
|
|
|
|
user_uffdio_copy = (struct uffdio_copy __user *) arg;
|
|
|
|
|
userfaultfd: prevent non-cooperative events vs mcopy_atomic races
If a process monitored with userfaultfd changes it's memory mappings or
forks() at the same time as uffd monitor fills the process memory with
UFFDIO_COPY, the actual creation of page table entries and copying of
the data in mcopy_atomic may happen either before of after the memory
mapping modifications and there is no way for the uffd monitor to
maintain consistent view of the process memory layout.
For instance, let's consider fork() running in parallel with
userfaultfd_copy():
process | uffd monitor
---------------------------------+------------------------------
fork() | userfaultfd_copy()
... | ...
dup_mmap() | down_read(mmap_sem)
down_write(mmap_sem) | /* create PTEs, copy data */
dup_uffd() | up_read(mmap_sem)
copy_page_range() |
up_write(mmap_sem) |
dup_uffd_complete() |
/* notify monitor */ |
If the userfaultfd_copy() takes the mmap_sem first, the new page(s) will
be present by the time copy_page_range() is called and they will appear
in the child's memory mappings. However, if the fork() is the first to
take the mmap_sem, the new pages won't be mapped in the child's address
space.
If the pages are not present and child tries to access them, the monitor
will get page fault notification and everything is fine. However, if
the pages *are present*, the child can access them without uffd
noticing. And if we copy them into child it'll see the wrong data.
Since we are talking about background copy, we'd need to decide whether
the pages should be copied or not regardless #PF notifications.
Since userfaultfd monitor has no way to determine what was the order,
let's disallow userfaultfd_copy in parallel with the non-cooperative
events. In such case we return -EAGAIN and the uffd monitor can
understand that userfaultfd_copy() clashed with a non-cooperative event
and take an appropriate action.
Link: http://lkml.kernel.org/r/1527061324-19949-1-git-send-email-rppt@linux.vnet.ibm.com
Signed-off-by: Mike Rapoport <rppt@linux.vnet.ibm.com>
Acked-by: Pavel Emelyanov <xemul@virtuozzo.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Mike Kravetz <mike.kravetz@oracle.com>
Cc: Andrei Vagin <avagin@virtuozzo.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-06-08 08:09:25 +08:00
|
|
|
ret = -EAGAIN;
|
2021-09-03 05:58:56 +08:00
|
|
|
if (atomic_read(&ctx->mmap_changing))
|
userfaultfd: prevent non-cooperative events vs mcopy_atomic races
If a process monitored with userfaultfd changes it's memory mappings or
forks() at the same time as uffd monitor fills the process memory with
UFFDIO_COPY, the actual creation of page table entries and copying of
the data in mcopy_atomic may happen either before of after the memory
mapping modifications and there is no way for the uffd monitor to
maintain consistent view of the process memory layout.
For instance, let's consider fork() running in parallel with
userfaultfd_copy():
process | uffd monitor
---------------------------------+------------------------------
fork() | userfaultfd_copy()
... | ...
dup_mmap() | down_read(mmap_sem)
down_write(mmap_sem) | /* create PTEs, copy data */
dup_uffd() | up_read(mmap_sem)
copy_page_range() |
up_write(mmap_sem) |
dup_uffd_complete() |
/* notify monitor */ |
If the userfaultfd_copy() takes the mmap_sem first, the new page(s) will
be present by the time copy_page_range() is called and they will appear
in the child's memory mappings. However, if the fork() is the first to
take the mmap_sem, the new pages won't be mapped in the child's address
space.
If the pages are not present and child tries to access them, the monitor
will get page fault notification and everything is fine. However, if
the pages *are present*, the child can access them without uffd
noticing. And if we copy them into child it'll see the wrong data.
Since we are talking about background copy, we'd need to decide whether
the pages should be copied or not regardless #PF notifications.
Since userfaultfd monitor has no way to determine what was the order,
let's disallow userfaultfd_copy in parallel with the non-cooperative
events. In such case we return -EAGAIN and the uffd monitor can
understand that userfaultfd_copy() clashed with a non-cooperative event
and take an appropriate action.
Link: http://lkml.kernel.org/r/1527061324-19949-1-git-send-email-rppt@linux.vnet.ibm.com
Signed-off-by: Mike Rapoport <rppt@linux.vnet.ibm.com>
Acked-by: Pavel Emelyanov <xemul@virtuozzo.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Mike Kravetz <mike.kravetz@oracle.com>
Cc: Andrei Vagin <avagin@virtuozzo.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-06-08 08:09:25 +08:00
|
|
|
goto out;
|
|
|
|
|
2015-09-05 06:47:11 +08:00
|
|
|
ret = -EFAULT;
|
|
|
|
if (copy_from_user(&uffdio_copy, user_uffdio_copy,
|
|
|
|
/* don't copy "copy" last field */
|
|
|
|
sizeof(uffdio_copy)-sizeof(__s64)))
|
|
|
|
goto out;
|
|
|
|
|
userfaultfd: do not untag user pointers
Patch series "userfaultfd: do not untag user pointers", v5.
If a user program uses userfaultfd on ranges of heap memory, it may end
up passing a tagged pointer to the kernel in the range.start field of
the UFFDIO_REGISTER ioctl. This can happen when using an MTE-capable
allocator, or on Android if using the Tagged Pointers feature for MTE
readiness [1].
When a fault subsequently occurs, the tag is stripped from the fault
address returned to the application in the fault.address field of struct
uffd_msg. However, from the application's perspective, the tagged
address *is* the memory address, so if the application is unaware of
memory tags, it may get confused by receiving an address that is, from
its point of view, outside of the bounds of the allocation. We observed
this behavior in the kselftest for userfaultfd [2] but other
applications could have the same problem.
Address this by not untagging pointers passed to the userfaultfd ioctls.
Instead, let the system call fail. Also change the kselftest to use
mmap so that it doesn't encounter this problem.
[1] https://source.android.com/devices/tech/debug/tagged-pointers
[2] tools/testing/selftests/vm/userfaultfd.c
This patch (of 2):
Do not untag pointers passed to the userfaultfd ioctls. Instead, let
the system call fail. This will provide an early indication of problems
with tag-unaware userspace code instead of letting the code get confused
later, and is consistent with how we decided to handle brk/mmap/mremap
in commit dcde237319e6 ("mm: Avoid creating virtual address aliases in
brk()/mmap()/mremap()"), as well as being consistent with the existing
tagged address ABI documentation relating to how ioctl arguments are
handled.
The code change is a revert of commit 7d0325749a6c ("userfaultfd: untag
user pointers") plus some fixups to some additional calls to
validate_range that have appeared since then.
[1] https://source.android.com/devices/tech/debug/tagged-pointers
[2] tools/testing/selftests/vm/userfaultfd.c
Link: https://lkml.kernel.org/r/20210714195437.118982-1-pcc@google.com
Link: https://lkml.kernel.org/r/20210714195437.118982-2-pcc@google.com
Link: https://linux-review.googlesource.com/id/I761aa9f0344454c482b83fcfcce547db0a25501b
Fixes: 63f0c6037965 ("arm64: Introduce prctl() options to control the tagged user addresses ABI")
Signed-off-by: Peter Collingbourne <pcc@google.com>
Reviewed-by: Andrey Konovalov <andreyknvl@gmail.com>
Reviewed-by: Catalin Marinas <catalin.marinas@arm.com>
Cc: Alistair Delva <adelva@google.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Dave Martin <Dave.Martin@arm.com>
Cc: Evgenii Stepanov <eugenis@google.com>
Cc: Lokesh Gidra <lokeshgidra@google.com>
Cc: Mitch Phillips <mitchp@google.com>
Cc: Vincenzo Frascino <vincenzo.frascino@arm.com>
Cc: Will Deacon <will@kernel.org>
Cc: William McVicker <willmcvicker@google.com>
Cc: <stable@vger.kernel.org> [5.4]
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-07-24 06:50:01 +08:00
|
|
|
ret = validate_range(ctx->mm, uffdio_copy.dst, uffdio_copy.len);
|
2015-09-05 06:47:11 +08:00
|
|
|
if (ret)
|
|
|
|
goto out;
|
|
|
|
/*
|
|
|
|
* double check for wraparound just in case. copy_from_user()
|
|
|
|
* will later check uffdio_copy.src + uffdio_copy.len to fit
|
|
|
|
* in the userland range.
|
|
|
|
*/
|
|
|
|
ret = -EINVAL;
|
|
|
|
if (uffdio_copy.src + uffdio_copy.len <= uffdio_copy.src)
|
|
|
|
goto out;
|
2020-04-07 11:05:41 +08:00
|
|
|
if (uffdio_copy.mode & ~(UFFDIO_COPY_MODE_DONTWAKE|UFFDIO_COPY_MODE_WP))
|
2015-09-05 06:47:11 +08:00
|
|
|
goto out;
|
2016-05-21 07:58:36 +08:00
|
|
|
if (mmget_not_zero(ctx->mm)) {
|
|
|
|
ret = mcopy_atomic(ctx->mm, uffdio_copy.dst, uffdio_copy.src,
|
2020-04-07 11:05:41 +08:00
|
|
|
uffdio_copy.len, &ctx->mmap_changing,
|
|
|
|
uffdio_copy.mode);
|
2016-05-21 07:58:36 +08:00
|
|
|
mmput(ctx->mm);
|
2017-02-25 06:58:31 +08:00
|
|
|
} else {
|
2017-08-11 06:24:32 +08:00
|
|
|
return -ESRCH;
|
2016-05-21 07:58:36 +08:00
|
|
|
}
|
2015-09-05 06:47:11 +08:00
|
|
|
if (unlikely(put_user(ret, &user_uffdio_copy->copy)))
|
|
|
|
return -EFAULT;
|
|
|
|
if (ret < 0)
|
|
|
|
goto out;
|
|
|
|
BUG_ON(!ret);
|
|
|
|
/* len == 0 would wake all */
|
|
|
|
range.len = ret;
|
|
|
|
if (!(uffdio_copy.mode & UFFDIO_COPY_MODE_DONTWAKE)) {
|
|
|
|
range.start = uffdio_copy.dst;
|
|
|
|
wake_userfault(ctx, &range);
|
|
|
|
}
|
|
|
|
ret = range.len == uffdio_copy.len ? 0 : -EAGAIN;
|
|
|
|
out:
|
|
|
|
return ret;
|
|
|
|
}
|
|
|
|
|
|
|
|
static int userfaultfd_zeropage(struct userfaultfd_ctx *ctx,
|
|
|
|
unsigned long arg)
|
|
|
|
{
|
|
|
|
__s64 ret;
|
|
|
|
struct uffdio_zeropage uffdio_zeropage;
|
|
|
|
struct uffdio_zeropage __user *user_uffdio_zeropage;
|
|
|
|
struct userfaultfd_wake_range range;
|
|
|
|
|
|
|
|
user_uffdio_zeropage = (struct uffdio_zeropage __user *) arg;
|
|
|
|
|
userfaultfd: prevent non-cooperative events vs mcopy_atomic races
If a process monitored with userfaultfd changes it's memory mappings or
forks() at the same time as uffd monitor fills the process memory with
UFFDIO_COPY, the actual creation of page table entries and copying of
the data in mcopy_atomic may happen either before of after the memory
mapping modifications and there is no way for the uffd monitor to
maintain consistent view of the process memory layout.
For instance, let's consider fork() running in parallel with
userfaultfd_copy():
process | uffd monitor
---------------------------------+------------------------------
fork() | userfaultfd_copy()
... | ...
dup_mmap() | down_read(mmap_sem)
down_write(mmap_sem) | /* create PTEs, copy data */
dup_uffd() | up_read(mmap_sem)
copy_page_range() |
up_write(mmap_sem) |
dup_uffd_complete() |
/* notify monitor */ |
If the userfaultfd_copy() takes the mmap_sem first, the new page(s) will
be present by the time copy_page_range() is called and they will appear
in the child's memory mappings. However, if the fork() is the first to
take the mmap_sem, the new pages won't be mapped in the child's address
space.
If the pages are not present and child tries to access them, the monitor
will get page fault notification and everything is fine. However, if
the pages *are present*, the child can access them without uffd
noticing. And if we copy them into child it'll see the wrong data.
Since we are talking about background copy, we'd need to decide whether
the pages should be copied or not regardless #PF notifications.
Since userfaultfd monitor has no way to determine what was the order,
let's disallow userfaultfd_copy in parallel with the non-cooperative
events. In such case we return -EAGAIN and the uffd monitor can
understand that userfaultfd_copy() clashed with a non-cooperative event
and take an appropriate action.
Link: http://lkml.kernel.org/r/1527061324-19949-1-git-send-email-rppt@linux.vnet.ibm.com
Signed-off-by: Mike Rapoport <rppt@linux.vnet.ibm.com>
Acked-by: Pavel Emelyanov <xemul@virtuozzo.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Mike Kravetz <mike.kravetz@oracle.com>
Cc: Andrei Vagin <avagin@virtuozzo.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-06-08 08:09:25 +08:00
|
|
|
ret = -EAGAIN;
|
2021-09-03 05:58:56 +08:00
|
|
|
if (atomic_read(&ctx->mmap_changing))
|
userfaultfd: prevent non-cooperative events vs mcopy_atomic races
If a process monitored with userfaultfd changes it's memory mappings or
forks() at the same time as uffd monitor fills the process memory with
UFFDIO_COPY, the actual creation of page table entries and copying of
the data in mcopy_atomic may happen either before of after the memory
mapping modifications and there is no way for the uffd monitor to
maintain consistent view of the process memory layout.
For instance, let's consider fork() running in parallel with
userfaultfd_copy():
process | uffd monitor
---------------------------------+------------------------------
fork() | userfaultfd_copy()
... | ...
dup_mmap() | down_read(mmap_sem)
down_write(mmap_sem) | /* create PTEs, copy data */
dup_uffd() | up_read(mmap_sem)
copy_page_range() |
up_write(mmap_sem) |
dup_uffd_complete() |
/* notify monitor */ |
If the userfaultfd_copy() takes the mmap_sem first, the new page(s) will
be present by the time copy_page_range() is called and they will appear
in the child's memory mappings. However, if the fork() is the first to
take the mmap_sem, the new pages won't be mapped in the child's address
space.
If the pages are not present and child tries to access them, the monitor
will get page fault notification and everything is fine. However, if
the pages *are present*, the child can access them without uffd
noticing. And if we copy them into child it'll see the wrong data.
Since we are talking about background copy, we'd need to decide whether
the pages should be copied or not regardless #PF notifications.
Since userfaultfd monitor has no way to determine what was the order,
let's disallow userfaultfd_copy in parallel with the non-cooperative
events. In such case we return -EAGAIN and the uffd monitor can
understand that userfaultfd_copy() clashed with a non-cooperative event
and take an appropriate action.
Link: http://lkml.kernel.org/r/1527061324-19949-1-git-send-email-rppt@linux.vnet.ibm.com
Signed-off-by: Mike Rapoport <rppt@linux.vnet.ibm.com>
Acked-by: Pavel Emelyanov <xemul@virtuozzo.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Mike Kravetz <mike.kravetz@oracle.com>
Cc: Andrei Vagin <avagin@virtuozzo.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-06-08 08:09:25 +08:00
|
|
|
goto out;
|
|
|
|
|
2015-09-05 06:47:11 +08:00
|
|
|
ret = -EFAULT;
|
|
|
|
if (copy_from_user(&uffdio_zeropage, user_uffdio_zeropage,
|
|
|
|
/* don't copy "zeropage" last field */
|
|
|
|
sizeof(uffdio_zeropage)-sizeof(__s64)))
|
|
|
|
goto out;
|
|
|
|
|
userfaultfd: do not untag user pointers
Patch series "userfaultfd: do not untag user pointers", v5.
If a user program uses userfaultfd on ranges of heap memory, it may end
up passing a tagged pointer to the kernel in the range.start field of
the UFFDIO_REGISTER ioctl. This can happen when using an MTE-capable
allocator, or on Android if using the Tagged Pointers feature for MTE
readiness [1].
When a fault subsequently occurs, the tag is stripped from the fault
address returned to the application in the fault.address field of struct
uffd_msg. However, from the application's perspective, the tagged
address *is* the memory address, so if the application is unaware of
memory tags, it may get confused by receiving an address that is, from
its point of view, outside of the bounds of the allocation. We observed
this behavior in the kselftest for userfaultfd [2] but other
applications could have the same problem.
Address this by not untagging pointers passed to the userfaultfd ioctls.
Instead, let the system call fail. Also change the kselftest to use
mmap so that it doesn't encounter this problem.
[1] https://source.android.com/devices/tech/debug/tagged-pointers
[2] tools/testing/selftests/vm/userfaultfd.c
This patch (of 2):
Do not untag pointers passed to the userfaultfd ioctls. Instead, let
the system call fail. This will provide an early indication of problems
with tag-unaware userspace code instead of letting the code get confused
later, and is consistent with how we decided to handle brk/mmap/mremap
in commit dcde237319e6 ("mm: Avoid creating virtual address aliases in
brk()/mmap()/mremap()"), as well as being consistent with the existing
tagged address ABI documentation relating to how ioctl arguments are
handled.
The code change is a revert of commit 7d0325749a6c ("userfaultfd: untag
user pointers") plus some fixups to some additional calls to
validate_range that have appeared since then.
[1] https://source.android.com/devices/tech/debug/tagged-pointers
[2] tools/testing/selftests/vm/userfaultfd.c
Link: https://lkml.kernel.org/r/20210714195437.118982-1-pcc@google.com
Link: https://lkml.kernel.org/r/20210714195437.118982-2-pcc@google.com
Link: https://linux-review.googlesource.com/id/I761aa9f0344454c482b83fcfcce547db0a25501b
Fixes: 63f0c6037965 ("arm64: Introduce prctl() options to control the tagged user addresses ABI")
Signed-off-by: Peter Collingbourne <pcc@google.com>
Reviewed-by: Andrey Konovalov <andreyknvl@gmail.com>
Reviewed-by: Catalin Marinas <catalin.marinas@arm.com>
Cc: Alistair Delva <adelva@google.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Dave Martin <Dave.Martin@arm.com>
Cc: Evgenii Stepanov <eugenis@google.com>
Cc: Lokesh Gidra <lokeshgidra@google.com>
Cc: Mitch Phillips <mitchp@google.com>
Cc: Vincenzo Frascino <vincenzo.frascino@arm.com>
Cc: Will Deacon <will@kernel.org>
Cc: William McVicker <willmcvicker@google.com>
Cc: <stable@vger.kernel.org> [5.4]
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-07-24 06:50:01 +08:00
|
|
|
ret = validate_range(ctx->mm, uffdio_zeropage.range.start,
|
2015-09-05 06:47:11 +08:00
|
|
|
uffdio_zeropage.range.len);
|
|
|
|
if (ret)
|
|
|
|
goto out;
|
|
|
|
ret = -EINVAL;
|
|
|
|
if (uffdio_zeropage.mode & ~UFFDIO_ZEROPAGE_MODE_DONTWAKE)
|
|
|
|
goto out;
|
|
|
|
|
2016-05-21 07:58:36 +08:00
|
|
|
if (mmget_not_zero(ctx->mm)) {
|
|
|
|
ret = mfill_zeropage(ctx->mm, uffdio_zeropage.range.start,
|
userfaultfd: prevent non-cooperative events vs mcopy_atomic races
If a process monitored with userfaultfd changes it's memory mappings or
forks() at the same time as uffd monitor fills the process memory with
UFFDIO_COPY, the actual creation of page table entries and copying of
the data in mcopy_atomic may happen either before of after the memory
mapping modifications and there is no way for the uffd monitor to
maintain consistent view of the process memory layout.
For instance, let's consider fork() running in parallel with
userfaultfd_copy():
process | uffd monitor
---------------------------------+------------------------------
fork() | userfaultfd_copy()
... | ...
dup_mmap() | down_read(mmap_sem)
down_write(mmap_sem) | /* create PTEs, copy data */
dup_uffd() | up_read(mmap_sem)
copy_page_range() |
up_write(mmap_sem) |
dup_uffd_complete() |
/* notify monitor */ |
If the userfaultfd_copy() takes the mmap_sem first, the new page(s) will
be present by the time copy_page_range() is called and they will appear
in the child's memory mappings. However, if the fork() is the first to
take the mmap_sem, the new pages won't be mapped in the child's address
space.
If the pages are not present and child tries to access them, the monitor
will get page fault notification and everything is fine. However, if
the pages *are present*, the child can access them without uffd
noticing. And if we copy them into child it'll see the wrong data.
Since we are talking about background copy, we'd need to decide whether
the pages should be copied or not regardless #PF notifications.
Since userfaultfd monitor has no way to determine what was the order,
let's disallow userfaultfd_copy in parallel with the non-cooperative
events. In such case we return -EAGAIN and the uffd monitor can
understand that userfaultfd_copy() clashed with a non-cooperative event
and take an appropriate action.
Link: http://lkml.kernel.org/r/1527061324-19949-1-git-send-email-rppt@linux.vnet.ibm.com
Signed-off-by: Mike Rapoport <rppt@linux.vnet.ibm.com>
Acked-by: Pavel Emelyanov <xemul@virtuozzo.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Mike Kravetz <mike.kravetz@oracle.com>
Cc: Andrei Vagin <avagin@virtuozzo.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-06-08 08:09:25 +08:00
|
|
|
uffdio_zeropage.range.len,
|
|
|
|
&ctx->mmap_changing);
|
2016-05-21 07:58:36 +08:00
|
|
|
mmput(ctx->mm);
|
2017-08-03 04:32:15 +08:00
|
|
|
} else {
|
2017-08-11 06:24:32 +08:00
|
|
|
return -ESRCH;
|
2016-05-21 07:58:36 +08:00
|
|
|
}
|
2015-09-05 06:47:11 +08:00
|
|
|
if (unlikely(put_user(ret, &user_uffdio_zeropage->zeropage)))
|
|
|
|
return -EFAULT;
|
|
|
|
if (ret < 0)
|
|
|
|
goto out;
|
|
|
|
/* len == 0 would wake all */
|
|
|
|
BUG_ON(!ret);
|
|
|
|
range.len = ret;
|
|
|
|
if (!(uffdio_zeropage.mode & UFFDIO_ZEROPAGE_MODE_DONTWAKE)) {
|
|
|
|
range.start = uffdio_zeropage.range.start;
|
|
|
|
wake_userfault(ctx, &range);
|
|
|
|
}
|
|
|
|
ret = range.len == uffdio_zeropage.range.len ? 0 : -EAGAIN;
|
|
|
|
out:
|
|
|
|
return ret;
|
|
|
|
}
|
|
|
|
|
2020-04-07 11:06:12 +08:00
|
|
|
static int userfaultfd_writeprotect(struct userfaultfd_ctx *ctx,
|
|
|
|
unsigned long arg)
|
|
|
|
{
|
|
|
|
int ret;
|
|
|
|
struct uffdio_writeprotect uffdio_wp;
|
|
|
|
struct uffdio_writeprotect __user *user_uffdio_wp;
|
|
|
|
struct userfaultfd_wake_range range;
|
2020-04-07 11:06:20 +08:00
|
|
|
bool mode_wp, mode_dontwake;
|
2020-04-07 11:06:12 +08:00
|
|
|
|
2021-09-03 05:58:56 +08:00
|
|
|
if (atomic_read(&ctx->mmap_changing))
|
2020-04-07 11:06:12 +08:00
|
|
|
return -EAGAIN;
|
|
|
|
|
|
|
|
user_uffdio_wp = (struct uffdio_writeprotect __user *) arg;
|
|
|
|
|
|
|
|
if (copy_from_user(&uffdio_wp, user_uffdio_wp,
|
|
|
|
sizeof(struct uffdio_writeprotect)))
|
|
|
|
return -EFAULT;
|
|
|
|
|
userfaultfd: do not untag user pointers
Patch series "userfaultfd: do not untag user pointers", v5.
If a user program uses userfaultfd on ranges of heap memory, it may end
up passing a tagged pointer to the kernel in the range.start field of
the UFFDIO_REGISTER ioctl. This can happen when using an MTE-capable
allocator, or on Android if using the Tagged Pointers feature for MTE
readiness [1].
When a fault subsequently occurs, the tag is stripped from the fault
address returned to the application in the fault.address field of struct
uffd_msg. However, from the application's perspective, the tagged
address *is* the memory address, so if the application is unaware of
memory tags, it may get confused by receiving an address that is, from
its point of view, outside of the bounds of the allocation. We observed
this behavior in the kselftest for userfaultfd [2] but other
applications could have the same problem.
Address this by not untagging pointers passed to the userfaultfd ioctls.
Instead, let the system call fail. Also change the kselftest to use
mmap so that it doesn't encounter this problem.
[1] https://source.android.com/devices/tech/debug/tagged-pointers
[2] tools/testing/selftests/vm/userfaultfd.c
This patch (of 2):
Do not untag pointers passed to the userfaultfd ioctls. Instead, let
the system call fail. This will provide an early indication of problems
with tag-unaware userspace code instead of letting the code get confused
later, and is consistent with how we decided to handle brk/mmap/mremap
in commit dcde237319e6 ("mm: Avoid creating virtual address aliases in
brk()/mmap()/mremap()"), as well as being consistent with the existing
tagged address ABI documentation relating to how ioctl arguments are
handled.
The code change is a revert of commit 7d0325749a6c ("userfaultfd: untag
user pointers") plus some fixups to some additional calls to
validate_range that have appeared since then.
[1] https://source.android.com/devices/tech/debug/tagged-pointers
[2] tools/testing/selftests/vm/userfaultfd.c
Link: https://lkml.kernel.org/r/20210714195437.118982-1-pcc@google.com
Link: https://lkml.kernel.org/r/20210714195437.118982-2-pcc@google.com
Link: https://linux-review.googlesource.com/id/I761aa9f0344454c482b83fcfcce547db0a25501b
Fixes: 63f0c6037965 ("arm64: Introduce prctl() options to control the tagged user addresses ABI")
Signed-off-by: Peter Collingbourne <pcc@google.com>
Reviewed-by: Andrey Konovalov <andreyknvl@gmail.com>
Reviewed-by: Catalin Marinas <catalin.marinas@arm.com>
Cc: Alistair Delva <adelva@google.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Dave Martin <Dave.Martin@arm.com>
Cc: Evgenii Stepanov <eugenis@google.com>
Cc: Lokesh Gidra <lokeshgidra@google.com>
Cc: Mitch Phillips <mitchp@google.com>
Cc: Vincenzo Frascino <vincenzo.frascino@arm.com>
Cc: Will Deacon <will@kernel.org>
Cc: William McVicker <willmcvicker@google.com>
Cc: <stable@vger.kernel.org> [5.4]
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-07-24 06:50:01 +08:00
|
|
|
ret = validate_range(ctx->mm, uffdio_wp.range.start,
|
2020-04-07 11:06:12 +08:00
|
|
|
uffdio_wp.range.len);
|
|
|
|
if (ret)
|
|
|
|
return ret;
|
|
|
|
|
|
|
|
if (uffdio_wp.mode & ~(UFFDIO_WRITEPROTECT_MODE_DONTWAKE |
|
|
|
|
UFFDIO_WRITEPROTECT_MODE_WP))
|
|
|
|
return -EINVAL;
|
2020-04-07 11:06:20 +08:00
|
|
|
|
|
|
|
mode_wp = uffdio_wp.mode & UFFDIO_WRITEPROTECT_MODE_WP;
|
|
|
|
mode_dontwake = uffdio_wp.mode & UFFDIO_WRITEPROTECT_MODE_DONTWAKE;
|
|
|
|
|
|
|
|
if (mode_wp && mode_dontwake)
|
2020-04-07 11:06:12 +08:00
|
|
|
return -EINVAL;
|
|
|
|
|
2021-10-19 06:15:25 +08:00
|
|
|
if (mmget_not_zero(ctx->mm)) {
|
|
|
|
ret = mwriteprotect_range(ctx->mm, uffdio_wp.range.start,
|
|
|
|
uffdio_wp.range.len, mode_wp,
|
|
|
|
&ctx->mmap_changing);
|
|
|
|
mmput(ctx->mm);
|
|
|
|
} else {
|
|
|
|
return -ESRCH;
|
|
|
|
}
|
|
|
|
|
2020-04-07 11:06:12 +08:00
|
|
|
if (ret)
|
|
|
|
return ret;
|
|
|
|
|
2020-04-07 11:06:20 +08:00
|
|
|
if (!mode_wp && !mode_dontwake) {
|
2020-04-07 11:06:12 +08:00
|
|
|
range.start = uffdio_wp.range.start;
|
|
|
|
range.len = uffdio_wp.range.len;
|
|
|
|
wake_userfault(ctx, &range);
|
|
|
|
}
|
|
|
|
return ret;
|
|
|
|
}
|
|
|
|
|
userfaultfd: add UFFDIO_CONTINUE ioctl
This ioctl is how userspace ought to resolve "minor" userfaults. The
idea is, userspace is notified that a minor fault has occurred. It
might change the contents of the page using its second non-UFFD mapping,
or not. Then, it calls UFFDIO_CONTINUE to tell the kernel "I have
ensured the page contents are correct, carry on setting up the mapping".
Note that it doesn't make much sense to use UFFDIO_{COPY,ZEROPAGE} for
MINOR registered VMAs. ZEROPAGE maps the VMA to the zero page; but in
the minor fault case, we already have some pre-existing underlying page.
Likewise, UFFDIO_COPY isn't useful if we have a second non-UFFD mapping.
We'd just use memcpy() or similar instead.
It turns out hugetlb_mcopy_atomic_pte() already does very close to what
we want, if an existing page is provided via `struct page **pagep`. We
already special-case the behavior a bit for the UFFDIO_ZEROPAGE case, so
just extend that design: add an enum for the three modes of operation,
and make the small adjustments needed for the MCOPY_ATOMIC_CONTINUE
case. (Basically, look up the existing page, and avoid adding the
existing page to the page cache or calling set_page_huge_active() on
it.)
Link: https://lkml.kernel.org/r/20210301222728.176417-5-axelrasmussen@google.com
Signed-off-by: Axel Rasmussen <axelrasmussen@google.com>
Reviewed-by: Peter Xu <peterx@redhat.com>
Cc: Adam Ruprecht <ruprecht@google.com>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Alexey Dobriyan <adobriyan@gmail.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Anshuman Khandual <anshuman.khandual@arm.com>
Cc: Cannon Matthews <cannonmatthews@google.com>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Chinwen Chang <chinwen.chang@mediatek.com>
Cc: David Rientjes <rientjes@google.com>
Cc: "Dr . David Alan Gilbert" <dgilbert@redhat.com>
Cc: Huang Ying <ying.huang@intel.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Jann Horn <jannh@google.com>
Cc: Jerome Glisse <jglisse@redhat.com>
Cc: Kirill A. Shutemov <kirill@shutemov.name>
Cc: Lokesh Gidra <lokeshgidra@google.com>
Cc: "Matthew Wilcox (Oracle)" <willy@infradead.org>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: "Michal Koutn" <mkoutny@suse.com>
Cc: Michel Lespinasse <walken@google.com>
Cc: Mike Kravetz <mike.kravetz@oracle.com>
Cc: Mike Rapoport <rppt@linux.vnet.ibm.com>
Cc: Mina Almasry <almasrymina@google.com>
Cc: Nicholas Piggin <npiggin@gmail.com>
Cc: Oliver Upton <oupton@google.com>
Cc: Shaohua Li <shli@fb.com>
Cc: Shawn Anastasio <shawn@anastas.io>
Cc: Steven Price <steven.price@arm.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-05-05 09:35:49 +08:00
|
|
|
static int userfaultfd_continue(struct userfaultfd_ctx *ctx, unsigned long arg)
|
|
|
|
{
|
|
|
|
__s64 ret;
|
|
|
|
struct uffdio_continue uffdio_continue;
|
|
|
|
struct uffdio_continue __user *user_uffdio_continue;
|
|
|
|
struct userfaultfd_wake_range range;
|
|
|
|
|
|
|
|
user_uffdio_continue = (struct uffdio_continue __user *)arg;
|
|
|
|
|
|
|
|
ret = -EAGAIN;
|
2021-09-03 05:58:56 +08:00
|
|
|
if (atomic_read(&ctx->mmap_changing))
|
userfaultfd: add UFFDIO_CONTINUE ioctl
This ioctl is how userspace ought to resolve "minor" userfaults. The
idea is, userspace is notified that a minor fault has occurred. It
might change the contents of the page using its second non-UFFD mapping,
or not. Then, it calls UFFDIO_CONTINUE to tell the kernel "I have
ensured the page contents are correct, carry on setting up the mapping".
Note that it doesn't make much sense to use UFFDIO_{COPY,ZEROPAGE} for
MINOR registered VMAs. ZEROPAGE maps the VMA to the zero page; but in
the minor fault case, we already have some pre-existing underlying page.
Likewise, UFFDIO_COPY isn't useful if we have a second non-UFFD mapping.
We'd just use memcpy() or similar instead.
It turns out hugetlb_mcopy_atomic_pte() already does very close to what
we want, if an existing page is provided via `struct page **pagep`. We
already special-case the behavior a bit for the UFFDIO_ZEROPAGE case, so
just extend that design: add an enum for the three modes of operation,
and make the small adjustments needed for the MCOPY_ATOMIC_CONTINUE
case. (Basically, look up the existing page, and avoid adding the
existing page to the page cache or calling set_page_huge_active() on
it.)
Link: https://lkml.kernel.org/r/20210301222728.176417-5-axelrasmussen@google.com
Signed-off-by: Axel Rasmussen <axelrasmussen@google.com>
Reviewed-by: Peter Xu <peterx@redhat.com>
Cc: Adam Ruprecht <ruprecht@google.com>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Alexey Dobriyan <adobriyan@gmail.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Anshuman Khandual <anshuman.khandual@arm.com>
Cc: Cannon Matthews <cannonmatthews@google.com>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Chinwen Chang <chinwen.chang@mediatek.com>
Cc: David Rientjes <rientjes@google.com>
Cc: "Dr . David Alan Gilbert" <dgilbert@redhat.com>
Cc: Huang Ying <ying.huang@intel.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Jann Horn <jannh@google.com>
Cc: Jerome Glisse <jglisse@redhat.com>
Cc: Kirill A. Shutemov <kirill@shutemov.name>
Cc: Lokesh Gidra <lokeshgidra@google.com>
Cc: "Matthew Wilcox (Oracle)" <willy@infradead.org>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: "Michal Koutn" <mkoutny@suse.com>
Cc: Michel Lespinasse <walken@google.com>
Cc: Mike Kravetz <mike.kravetz@oracle.com>
Cc: Mike Rapoport <rppt@linux.vnet.ibm.com>
Cc: Mina Almasry <almasrymina@google.com>
Cc: Nicholas Piggin <npiggin@gmail.com>
Cc: Oliver Upton <oupton@google.com>
Cc: Shaohua Li <shli@fb.com>
Cc: Shawn Anastasio <shawn@anastas.io>
Cc: Steven Price <steven.price@arm.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-05-05 09:35:49 +08:00
|
|
|
goto out;
|
|
|
|
|
|
|
|
ret = -EFAULT;
|
|
|
|
if (copy_from_user(&uffdio_continue, user_uffdio_continue,
|
|
|
|
/* don't copy the output fields */
|
|
|
|
sizeof(uffdio_continue) - (sizeof(__s64))))
|
|
|
|
goto out;
|
|
|
|
|
userfaultfd: do not untag user pointers
Patch series "userfaultfd: do not untag user pointers", v5.
If a user program uses userfaultfd on ranges of heap memory, it may end
up passing a tagged pointer to the kernel in the range.start field of
the UFFDIO_REGISTER ioctl. This can happen when using an MTE-capable
allocator, or on Android if using the Tagged Pointers feature for MTE
readiness [1].
When a fault subsequently occurs, the tag is stripped from the fault
address returned to the application in the fault.address field of struct
uffd_msg. However, from the application's perspective, the tagged
address *is* the memory address, so if the application is unaware of
memory tags, it may get confused by receiving an address that is, from
its point of view, outside of the bounds of the allocation. We observed
this behavior in the kselftest for userfaultfd [2] but other
applications could have the same problem.
Address this by not untagging pointers passed to the userfaultfd ioctls.
Instead, let the system call fail. Also change the kselftest to use
mmap so that it doesn't encounter this problem.
[1] https://source.android.com/devices/tech/debug/tagged-pointers
[2] tools/testing/selftests/vm/userfaultfd.c
This patch (of 2):
Do not untag pointers passed to the userfaultfd ioctls. Instead, let
the system call fail. This will provide an early indication of problems
with tag-unaware userspace code instead of letting the code get confused
later, and is consistent with how we decided to handle brk/mmap/mremap
in commit dcde237319e6 ("mm: Avoid creating virtual address aliases in
brk()/mmap()/mremap()"), as well as being consistent with the existing
tagged address ABI documentation relating to how ioctl arguments are
handled.
The code change is a revert of commit 7d0325749a6c ("userfaultfd: untag
user pointers") plus some fixups to some additional calls to
validate_range that have appeared since then.
[1] https://source.android.com/devices/tech/debug/tagged-pointers
[2] tools/testing/selftests/vm/userfaultfd.c
Link: https://lkml.kernel.org/r/20210714195437.118982-1-pcc@google.com
Link: https://lkml.kernel.org/r/20210714195437.118982-2-pcc@google.com
Link: https://linux-review.googlesource.com/id/I761aa9f0344454c482b83fcfcce547db0a25501b
Fixes: 63f0c6037965 ("arm64: Introduce prctl() options to control the tagged user addresses ABI")
Signed-off-by: Peter Collingbourne <pcc@google.com>
Reviewed-by: Andrey Konovalov <andreyknvl@gmail.com>
Reviewed-by: Catalin Marinas <catalin.marinas@arm.com>
Cc: Alistair Delva <adelva@google.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Dave Martin <Dave.Martin@arm.com>
Cc: Evgenii Stepanov <eugenis@google.com>
Cc: Lokesh Gidra <lokeshgidra@google.com>
Cc: Mitch Phillips <mitchp@google.com>
Cc: Vincenzo Frascino <vincenzo.frascino@arm.com>
Cc: Will Deacon <will@kernel.org>
Cc: William McVicker <willmcvicker@google.com>
Cc: <stable@vger.kernel.org> [5.4]
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-07-24 06:50:01 +08:00
|
|
|
ret = validate_range(ctx->mm, uffdio_continue.range.start,
|
userfaultfd: add UFFDIO_CONTINUE ioctl
This ioctl is how userspace ought to resolve "minor" userfaults. The
idea is, userspace is notified that a minor fault has occurred. It
might change the contents of the page using its second non-UFFD mapping,
or not. Then, it calls UFFDIO_CONTINUE to tell the kernel "I have
ensured the page contents are correct, carry on setting up the mapping".
Note that it doesn't make much sense to use UFFDIO_{COPY,ZEROPAGE} for
MINOR registered VMAs. ZEROPAGE maps the VMA to the zero page; but in
the minor fault case, we already have some pre-existing underlying page.
Likewise, UFFDIO_COPY isn't useful if we have a second non-UFFD mapping.
We'd just use memcpy() or similar instead.
It turns out hugetlb_mcopy_atomic_pte() already does very close to what
we want, if an existing page is provided via `struct page **pagep`. We
already special-case the behavior a bit for the UFFDIO_ZEROPAGE case, so
just extend that design: add an enum for the three modes of operation,
and make the small adjustments needed for the MCOPY_ATOMIC_CONTINUE
case. (Basically, look up the existing page, and avoid adding the
existing page to the page cache or calling set_page_huge_active() on
it.)
Link: https://lkml.kernel.org/r/20210301222728.176417-5-axelrasmussen@google.com
Signed-off-by: Axel Rasmussen <axelrasmussen@google.com>
Reviewed-by: Peter Xu <peterx@redhat.com>
Cc: Adam Ruprecht <ruprecht@google.com>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Alexey Dobriyan <adobriyan@gmail.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Anshuman Khandual <anshuman.khandual@arm.com>
Cc: Cannon Matthews <cannonmatthews@google.com>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Chinwen Chang <chinwen.chang@mediatek.com>
Cc: David Rientjes <rientjes@google.com>
Cc: "Dr . David Alan Gilbert" <dgilbert@redhat.com>
Cc: Huang Ying <ying.huang@intel.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Jann Horn <jannh@google.com>
Cc: Jerome Glisse <jglisse@redhat.com>
Cc: Kirill A. Shutemov <kirill@shutemov.name>
Cc: Lokesh Gidra <lokeshgidra@google.com>
Cc: "Matthew Wilcox (Oracle)" <willy@infradead.org>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: "Michal Koutn" <mkoutny@suse.com>
Cc: Michel Lespinasse <walken@google.com>
Cc: Mike Kravetz <mike.kravetz@oracle.com>
Cc: Mike Rapoport <rppt@linux.vnet.ibm.com>
Cc: Mina Almasry <almasrymina@google.com>
Cc: Nicholas Piggin <npiggin@gmail.com>
Cc: Oliver Upton <oupton@google.com>
Cc: Shaohua Li <shli@fb.com>
Cc: Shawn Anastasio <shawn@anastas.io>
Cc: Steven Price <steven.price@arm.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-05-05 09:35:49 +08:00
|
|
|
uffdio_continue.range.len);
|
|
|
|
if (ret)
|
|
|
|
goto out;
|
|
|
|
|
|
|
|
ret = -EINVAL;
|
|
|
|
/* double check for wraparound just in case. */
|
|
|
|
if (uffdio_continue.range.start + uffdio_continue.range.len <=
|
|
|
|
uffdio_continue.range.start) {
|
|
|
|
goto out;
|
|
|
|
}
|
|
|
|
if (uffdio_continue.mode & ~UFFDIO_CONTINUE_MODE_DONTWAKE)
|
|
|
|
goto out;
|
|
|
|
|
|
|
|
if (mmget_not_zero(ctx->mm)) {
|
|
|
|
ret = mcopy_continue(ctx->mm, uffdio_continue.range.start,
|
|
|
|
uffdio_continue.range.len,
|
|
|
|
&ctx->mmap_changing);
|
|
|
|
mmput(ctx->mm);
|
|
|
|
} else {
|
|
|
|
return -ESRCH;
|
|
|
|
}
|
|
|
|
|
|
|
|
if (unlikely(put_user(ret, &user_uffdio_continue->mapped)))
|
|
|
|
return -EFAULT;
|
|
|
|
if (ret < 0)
|
|
|
|
goto out;
|
|
|
|
|
|
|
|
/* len == 0 would wake all */
|
|
|
|
BUG_ON(!ret);
|
|
|
|
range.len = ret;
|
|
|
|
if (!(uffdio_continue.mode & UFFDIO_CONTINUE_MODE_DONTWAKE)) {
|
|
|
|
range.start = uffdio_continue.range.start;
|
|
|
|
wake_userfault(ctx, &range);
|
|
|
|
}
|
|
|
|
ret = range.len == uffdio_continue.range.len ? 0 : -EAGAIN;
|
|
|
|
|
|
|
|
out:
|
|
|
|
return ret;
|
|
|
|
}
|
|
|
|
|
2017-02-23 07:42:21 +08:00
|
|
|
static inline unsigned int uffd_ctx_features(__u64 user_features)
|
|
|
|
{
|
|
|
|
/*
|
userfaultfd: prevent concurrent API initialization
userfaultfd assumes that the enabled features are set once and never
changed after UFFDIO_API ioctl succeeded.
However, currently, UFFDIO_API can be called concurrently from two
different threads, succeed on both threads and leave userfaultfd's
features in non-deterministic state. Theoretically, other uffd operations
(ioctl's and page-faults) can be dispatched while adversely affected by
such changes of features.
Moreover, the writes to ctx->state and ctx->features are not ordered,
which can - theoretically, again - let userfaultfd_ioctl() think that
userfaultfd API completed, while the features are still not initialized.
To avoid races, it is arguably best to get rid of ctx->state. Since there
are only 2 states, record the API initialization in ctx->features as the
uppermost bit and remove ctx->state.
Link: https://lkml.kernel.org/r/20210808020724.1022515-3-namit@vmware.com
Fixes: 9cd75c3cd4c3d ("userfaultfd: non-cooperative: add ability to report non-PF events from uffd descriptor")
Signed-off-by: Nadav Amit <namit@vmware.com>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Axel Rasmussen <axelrasmussen@google.com>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Mike Rapoport <rppt@linux.vnet.ibm.com>
Cc: Peter Xu <peterx@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-09-03 05:58:59 +08:00
|
|
|
* For the current set of features the bits just coincide. Set
|
|
|
|
* UFFD_FEATURE_INITIALIZED to mark the features as enabled.
|
2017-02-23 07:42:21 +08:00
|
|
|
*/
|
userfaultfd: prevent concurrent API initialization
userfaultfd assumes that the enabled features are set once and never
changed after UFFDIO_API ioctl succeeded.
However, currently, UFFDIO_API can be called concurrently from two
different threads, succeed on both threads and leave userfaultfd's
features in non-deterministic state. Theoretically, other uffd operations
(ioctl's and page-faults) can be dispatched while adversely affected by
such changes of features.
Moreover, the writes to ctx->state and ctx->features are not ordered,
which can - theoretically, again - let userfaultfd_ioctl() think that
userfaultfd API completed, while the features are still not initialized.
To avoid races, it is arguably best to get rid of ctx->state. Since there
are only 2 states, record the API initialization in ctx->features as the
uppermost bit and remove ctx->state.
Link: https://lkml.kernel.org/r/20210808020724.1022515-3-namit@vmware.com
Fixes: 9cd75c3cd4c3d ("userfaultfd: non-cooperative: add ability to report non-PF events from uffd descriptor")
Signed-off-by: Nadav Amit <namit@vmware.com>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Axel Rasmussen <axelrasmussen@google.com>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Mike Rapoport <rppt@linux.vnet.ibm.com>
Cc: Peter Xu <peterx@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-09-03 05:58:59 +08:00
|
|
|
return (unsigned int)user_features | UFFD_FEATURE_INITIALIZED;
|
2017-02-23 07:42:21 +08:00
|
|
|
}
|
|
|
|
|
2015-09-05 06:46:31 +08:00
|
|
|
/*
|
|
|
|
* userland asks for a certain API version and we return which bits
|
|
|
|
* and ioctl commands are implemented in this kernel for such API
|
|
|
|
* version or -EINVAL if unknown.
|
|
|
|
*/
|
|
|
|
static int userfaultfd_api(struct userfaultfd_ctx *ctx,
|
|
|
|
unsigned long arg)
|
|
|
|
{
|
|
|
|
struct uffdio_api uffdio_api;
|
|
|
|
void __user *buf = (void __user *)arg;
|
userfaultfd: prevent concurrent API initialization
userfaultfd assumes that the enabled features are set once and never
changed after UFFDIO_API ioctl succeeded.
However, currently, UFFDIO_API can be called concurrently from two
different threads, succeed on both threads and leave userfaultfd's
features in non-deterministic state. Theoretically, other uffd operations
(ioctl's and page-faults) can be dispatched while adversely affected by
such changes of features.
Moreover, the writes to ctx->state and ctx->features are not ordered,
which can - theoretically, again - let userfaultfd_ioctl() think that
userfaultfd API completed, while the features are still not initialized.
To avoid races, it is arguably best to get rid of ctx->state. Since there
are only 2 states, record the API initialization in ctx->features as the
uppermost bit and remove ctx->state.
Link: https://lkml.kernel.org/r/20210808020724.1022515-3-namit@vmware.com
Fixes: 9cd75c3cd4c3d ("userfaultfd: non-cooperative: add ability to report non-PF events from uffd descriptor")
Signed-off-by: Nadav Amit <namit@vmware.com>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Axel Rasmussen <axelrasmussen@google.com>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Mike Rapoport <rppt@linux.vnet.ibm.com>
Cc: Peter Xu <peterx@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-09-03 05:58:59 +08:00
|
|
|
unsigned int ctx_features;
|
2015-09-05 06:46:31 +08:00
|
|
|
int ret;
|
2017-02-23 07:42:24 +08:00
|
|
|
__u64 features;
|
2015-09-05 06:46:31 +08:00
|
|
|
|
|
|
|
ret = -EFAULT;
|
2015-09-05 06:46:37 +08:00
|
|
|
if (copy_from_user(&uffdio_api, buf, sizeof(uffdio_api)))
|
2015-09-05 06:46:31 +08:00
|
|
|
goto out;
|
userfaultfd: don't fail on unrecognized features
The basic interaction for setting up a userfaultfd is, userspace issues
a UFFDIO_API ioctl, and passes in a set of zero or more feature flags,
indicating the features they would prefer to use.
Of course, different kernels may support different sets of features
(depending on kernel version, kconfig options, architecture, etc).
Userspace's expectations may also not match: perhaps it was built
against newer kernel headers, which defined some features the kernel
it's running on doesn't support.
Currently, if userspace passes in a flag we don't recognize, the
initialization fails and we return -EINVAL. This isn't great, though.
Userspace doesn't have an obvious way to react to this; sure, one of the
features I asked for was unavailable, but which one? The only option it
has is to turn off things "at random" and hope something works.
Instead, modify UFFDIO_API to just ignore any unrecognized feature
flags. The interaction is now that the initialization will succeed, and
as always we return the *subset* of feature flags that can actually be
used back to userspace.
Now userspace has an obvious way to react: it checks if any flags it
asked for are missing. If so, it can conclude this kernel doesn't
support those, and it can either resign itself to not using them, or
fail with an error on its own, or whatever else.
Link: https://lkml.kernel.org/r/20220722201513.1624158-1-axelrasmussen@google.com
Signed-off-by: Axel Rasmussen <axelrasmussen@google.com>
Cc: Peter Xu <peterx@redhat.com>
Cc: Axel Rasmussen <axelrasmussen@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-07-23 04:15:13 +08:00
|
|
|
/* Ignore unsupported features (userspace built against newer kernel) */
|
|
|
|
features = uffdio_api.features & UFFD_API_FEATURES;
|
2019-12-01 09:58:01 +08:00
|
|
|
ret = -EPERM;
|
|
|
|
if ((features & UFFD_FEATURE_EVENT_FORK) && !capable(CAP_SYS_PTRACE))
|
|
|
|
goto err_out;
|
2017-02-23 07:42:24 +08:00
|
|
|
/* report all available features and ioctls to userland */
|
|
|
|
uffdio_api.features = UFFD_API_FEATURES;
|
userfaultfd: add minor fault registration mode
Patch series "userfaultfd: add minor fault handling", v9.
Overview
========
This series adds a new userfaultfd feature, UFFD_FEATURE_MINOR_HUGETLBFS.
When enabled (via the UFFDIO_API ioctl), this feature means that any
hugetlbfs VMAs registered with UFFDIO_REGISTER_MODE_MISSING will *also*
get events for "minor" faults. By "minor" fault, I mean the following
situation:
Let there exist two mappings (i.e., VMAs) to the same page(s) (shared
memory). One of the mappings is registered with userfaultfd (in minor
mode), and the other is not. Via the non-UFFD mapping, the underlying
pages have already been allocated & filled with some contents. The UFFD
mapping has not yet been faulted in; when it is touched for the first
time, this results in what I'm calling a "minor" fault. As a concrete
example, when working with hugetlbfs, we have huge_pte_none(), but
find_lock_page() finds an existing page.
We also add a new ioctl to resolve such faults: UFFDIO_CONTINUE. The idea
is, userspace resolves the fault by either a) doing nothing if the
contents are already correct, or b) updating the underlying contents using
the second, non-UFFD mapping (via memcpy/memset or similar, or something
fancier like RDMA, or etc...). In either case, userspace issues
UFFDIO_CONTINUE to tell the kernel "I have ensured the page contents are
correct, carry on setting up the mapping".
Use Case
========
Consider the use case of VM live migration (e.g. under QEMU/KVM):
1. While a VM is still running, we copy the contents of its memory to a
target machine. The pages are populated on the target by writing to the
non-UFFD mapping, using the setup described above. The VM is still running
(and therefore its memory is likely changing), so this may be repeated
several times, until we decide the target is "up to date enough".
2. We pause the VM on the source, and start executing on the target machine.
During this gap, the VM's user(s) will *see* a pause, so it is desirable to
minimize this window.
3. Between the last time any page was copied from the source to the target, and
when the VM was paused, the contents of that page may have changed - and
therefore the copy we have on the target machine is out of date. Although we
can keep track of which pages are out of date, for VMs with large amounts of
memory, it is "slow" to transfer this information to the target machine. We
want to resume execution before such a transfer would complete.
4. So, the guest begins executing on the target machine. The first time it
touches its memory (via the UFFD-registered mapping), userspace wants to
intercept this fault. Userspace checks whether or not the page is up to date,
and if not, copies the updated page from the source machine, via the non-UFFD
mapping. Finally, whether a copy was performed or not, userspace issues a
UFFDIO_CONTINUE ioctl to tell the kernel "I have ensured the page contents
are correct, carry on setting up the mapping".
We don't have to do all of the final updates on-demand. The userfaultfd manager
can, in the background, also copy over updated pages once it receives the map of
which pages are up-to-date or not.
Interaction with Existing APIs
==============================
Because this is a feature, a registered VMA could potentially receive both
missing and minor faults. I spent some time thinking through how the
existing API interacts with the new feature:
UFFDIO_CONTINUE cannot be used to resolve non-minor faults, as it does not
allocate a new page. If UFFDIO_CONTINUE is used on a non-minor fault:
- For non-shared memory or shmem, -EINVAL is returned.
- For hugetlb, -EFAULT is returned.
UFFDIO_COPY and UFFDIO_ZEROPAGE cannot be used to resolve minor faults.
Without modifications, the existing codepath assumes a new page needs to
be allocated. This is okay, since userspace must have a second
non-UFFD-registered mapping anyway, thus there isn't much reason to want
to use these in any case (just memcpy or memset or similar).
- If UFFDIO_COPY is used on a minor fault, -EEXIST is returned.
- If UFFDIO_ZEROPAGE is used on a minor fault, -EEXIST is returned (or -EINVAL
in the case of hugetlb, as UFFDIO_ZEROPAGE is unsupported in any case).
- UFFDIO_WRITEPROTECT simply doesn't work with shared memory, and returns
-ENOENT in that case (regardless of the kind of fault).
Future Work
===========
This series only supports hugetlbfs. I have a second series in flight to
support shmem as well, extending the functionality. This series is more
mature than the shmem support at this point, and the functionality works
fully on hugetlbfs, so this series can be merged first and then shmem
support will follow.
This patch (of 6):
This feature allows userspace to intercept "minor" faults. By "minor"
faults, I mean the following situation:
Let there exist two mappings (i.e., VMAs) to the same page(s). One of the
mappings is registered with userfaultfd (in minor mode), and the other is
not. Via the non-UFFD mapping, the underlying pages have already been
allocated & filled with some contents. The UFFD mapping has not yet been
faulted in; when it is touched for the first time, this results in what
I'm calling a "minor" fault. As a concrete example, when working with
hugetlbfs, we have huge_pte_none(), but find_lock_page() finds an existing
page.
This commit adds the new registration mode, and sets the relevant flag on
the VMAs being registered. In the hugetlb fault path, if we find that we
have huge_pte_none(), but find_lock_page() does indeed find an existing
page, then we have a "minor" fault, and if the VMA has the userfaultfd
registration flag, we call into userfaultfd to handle it.
This is implemented as a new registration mode, instead of an API feature.
This is because the alternative implementation has significant drawbacks
[1].
However, doing it this was requires we allocate a VM_* flag for the new
registration mode. On 32-bit systems, there are no unused bits, so this
feature is only supported on architectures with
CONFIG_ARCH_USES_HIGH_VMA_FLAGS. When attempting to register a VMA in
MINOR mode on 32-bit architectures, we return -EINVAL.
[1] https://lore.kernel.org/patchwork/patch/1380226/
[peterx@redhat.com: fix minor fault page leak]
Link: https://lkml.kernel.org/r/20210322175132.36659-1-peterx@redhat.com
Link: https://lkml.kernel.org/r/20210301222728.176417-1-axelrasmussen@google.com
Link: https://lkml.kernel.org/r/20210301222728.176417-2-axelrasmussen@google.com
Signed-off-by: Axel Rasmussen <axelrasmussen@google.com>
Reviewed-by: Peter Xu <peterx@redhat.com>
Reviewed-by: Mike Kravetz <mike.kravetz@oracle.com>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Alexey Dobriyan <adobriyan@gmail.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Anshuman Khandual <anshuman.khandual@arm.com>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Chinwen Chang <chinwen.chang@mediatek.com>
Cc: Huang Ying <ying.huang@intel.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Jann Horn <jannh@google.com>
Cc: Jerome Glisse <jglisse@redhat.com>
Cc: Lokesh Gidra <lokeshgidra@google.com>
Cc: "Matthew Wilcox (Oracle)" <willy@infradead.org>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: "Michal Koutn" <mkoutny@suse.com>
Cc: Michel Lespinasse <walken@google.com>
Cc: Mike Rapoport <rppt@linux.vnet.ibm.com>
Cc: Nicholas Piggin <npiggin@gmail.com>
Cc: Peter Xu <peterx@redhat.com>
Cc: Shaohua Li <shli@fb.com>
Cc: Shawn Anastasio <shawn@anastas.io>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Steven Price <steven.price@arm.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Adam Ruprecht <ruprecht@google.com>
Cc: Axel Rasmussen <axelrasmussen@google.com>
Cc: Cannon Matthews <cannonmatthews@google.com>
Cc: "Dr . David Alan Gilbert" <dgilbert@redhat.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Mina Almasry <almasrymina@google.com>
Cc: Oliver Upton <oupton@google.com>
Cc: Kirill A. Shutemov <kirill@shutemov.name>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-05-05 09:35:36 +08:00
|
|
|
#ifndef CONFIG_HAVE_ARCH_USERFAULTFD_MINOR
|
2021-07-01 09:49:27 +08:00
|
|
|
uffdio_api.features &=
|
|
|
|
~(UFFD_FEATURE_MINOR_HUGETLBFS | UFFD_FEATURE_MINOR_SHMEM);
|
2021-07-01 09:49:06 +08:00
|
|
|
#endif
|
|
|
|
#ifndef CONFIG_HAVE_ARCH_USERFAULTFD_WP
|
|
|
|
uffdio_api.features &= ~UFFD_FEATURE_PAGEFAULT_FLAG_WP;
|
2022-05-13 11:22:56 +08:00
|
|
|
#endif
|
|
|
|
#ifndef CONFIG_PTE_MARKER_UFFD_WP
|
|
|
|
uffdio_api.features &= ~UFFD_FEATURE_WP_HUGETLBFS_SHMEM;
|
userfaultfd: add minor fault registration mode
Patch series "userfaultfd: add minor fault handling", v9.
Overview
========
This series adds a new userfaultfd feature, UFFD_FEATURE_MINOR_HUGETLBFS.
When enabled (via the UFFDIO_API ioctl), this feature means that any
hugetlbfs VMAs registered with UFFDIO_REGISTER_MODE_MISSING will *also*
get events for "minor" faults. By "minor" fault, I mean the following
situation:
Let there exist two mappings (i.e., VMAs) to the same page(s) (shared
memory). One of the mappings is registered with userfaultfd (in minor
mode), and the other is not. Via the non-UFFD mapping, the underlying
pages have already been allocated & filled with some contents. The UFFD
mapping has not yet been faulted in; when it is touched for the first
time, this results in what I'm calling a "minor" fault. As a concrete
example, when working with hugetlbfs, we have huge_pte_none(), but
find_lock_page() finds an existing page.
We also add a new ioctl to resolve such faults: UFFDIO_CONTINUE. The idea
is, userspace resolves the fault by either a) doing nothing if the
contents are already correct, or b) updating the underlying contents using
the second, non-UFFD mapping (via memcpy/memset or similar, or something
fancier like RDMA, or etc...). In either case, userspace issues
UFFDIO_CONTINUE to tell the kernel "I have ensured the page contents are
correct, carry on setting up the mapping".
Use Case
========
Consider the use case of VM live migration (e.g. under QEMU/KVM):
1. While a VM is still running, we copy the contents of its memory to a
target machine. The pages are populated on the target by writing to the
non-UFFD mapping, using the setup described above. The VM is still running
(and therefore its memory is likely changing), so this may be repeated
several times, until we decide the target is "up to date enough".
2. We pause the VM on the source, and start executing on the target machine.
During this gap, the VM's user(s) will *see* a pause, so it is desirable to
minimize this window.
3. Between the last time any page was copied from the source to the target, and
when the VM was paused, the contents of that page may have changed - and
therefore the copy we have on the target machine is out of date. Although we
can keep track of which pages are out of date, for VMs with large amounts of
memory, it is "slow" to transfer this information to the target machine. We
want to resume execution before such a transfer would complete.
4. So, the guest begins executing on the target machine. The first time it
touches its memory (via the UFFD-registered mapping), userspace wants to
intercept this fault. Userspace checks whether or not the page is up to date,
and if not, copies the updated page from the source machine, via the non-UFFD
mapping. Finally, whether a copy was performed or not, userspace issues a
UFFDIO_CONTINUE ioctl to tell the kernel "I have ensured the page contents
are correct, carry on setting up the mapping".
We don't have to do all of the final updates on-demand. The userfaultfd manager
can, in the background, also copy over updated pages once it receives the map of
which pages are up-to-date or not.
Interaction with Existing APIs
==============================
Because this is a feature, a registered VMA could potentially receive both
missing and minor faults. I spent some time thinking through how the
existing API interacts with the new feature:
UFFDIO_CONTINUE cannot be used to resolve non-minor faults, as it does not
allocate a new page. If UFFDIO_CONTINUE is used on a non-minor fault:
- For non-shared memory or shmem, -EINVAL is returned.
- For hugetlb, -EFAULT is returned.
UFFDIO_COPY and UFFDIO_ZEROPAGE cannot be used to resolve minor faults.
Without modifications, the existing codepath assumes a new page needs to
be allocated. This is okay, since userspace must have a second
non-UFFD-registered mapping anyway, thus there isn't much reason to want
to use these in any case (just memcpy or memset or similar).
- If UFFDIO_COPY is used on a minor fault, -EEXIST is returned.
- If UFFDIO_ZEROPAGE is used on a minor fault, -EEXIST is returned (or -EINVAL
in the case of hugetlb, as UFFDIO_ZEROPAGE is unsupported in any case).
- UFFDIO_WRITEPROTECT simply doesn't work with shared memory, and returns
-ENOENT in that case (regardless of the kind of fault).
Future Work
===========
This series only supports hugetlbfs. I have a second series in flight to
support shmem as well, extending the functionality. This series is more
mature than the shmem support at this point, and the functionality works
fully on hugetlbfs, so this series can be merged first and then shmem
support will follow.
This patch (of 6):
This feature allows userspace to intercept "minor" faults. By "minor"
faults, I mean the following situation:
Let there exist two mappings (i.e., VMAs) to the same page(s). One of the
mappings is registered with userfaultfd (in minor mode), and the other is
not. Via the non-UFFD mapping, the underlying pages have already been
allocated & filled with some contents. The UFFD mapping has not yet been
faulted in; when it is touched for the first time, this results in what
I'm calling a "minor" fault. As a concrete example, when working with
hugetlbfs, we have huge_pte_none(), but find_lock_page() finds an existing
page.
This commit adds the new registration mode, and sets the relevant flag on
the VMAs being registered. In the hugetlb fault path, if we find that we
have huge_pte_none(), but find_lock_page() does indeed find an existing
page, then we have a "minor" fault, and if the VMA has the userfaultfd
registration flag, we call into userfaultfd to handle it.
This is implemented as a new registration mode, instead of an API feature.
This is because the alternative implementation has significant drawbacks
[1].
However, doing it this was requires we allocate a VM_* flag for the new
registration mode. On 32-bit systems, there are no unused bits, so this
feature is only supported on architectures with
CONFIG_ARCH_USES_HIGH_VMA_FLAGS. When attempting to register a VMA in
MINOR mode on 32-bit architectures, we return -EINVAL.
[1] https://lore.kernel.org/patchwork/patch/1380226/
[peterx@redhat.com: fix minor fault page leak]
Link: https://lkml.kernel.org/r/20210322175132.36659-1-peterx@redhat.com
Link: https://lkml.kernel.org/r/20210301222728.176417-1-axelrasmussen@google.com
Link: https://lkml.kernel.org/r/20210301222728.176417-2-axelrasmussen@google.com
Signed-off-by: Axel Rasmussen <axelrasmussen@google.com>
Reviewed-by: Peter Xu <peterx@redhat.com>
Reviewed-by: Mike Kravetz <mike.kravetz@oracle.com>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Alexey Dobriyan <adobriyan@gmail.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Anshuman Khandual <anshuman.khandual@arm.com>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Chinwen Chang <chinwen.chang@mediatek.com>
Cc: Huang Ying <ying.huang@intel.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Jann Horn <jannh@google.com>
Cc: Jerome Glisse <jglisse@redhat.com>
Cc: Lokesh Gidra <lokeshgidra@google.com>
Cc: "Matthew Wilcox (Oracle)" <willy@infradead.org>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: "Michal Koutn" <mkoutny@suse.com>
Cc: Michel Lespinasse <walken@google.com>
Cc: Mike Rapoport <rppt@linux.vnet.ibm.com>
Cc: Nicholas Piggin <npiggin@gmail.com>
Cc: Peter Xu <peterx@redhat.com>
Cc: Shaohua Li <shli@fb.com>
Cc: Shawn Anastasio <shawn@anastas.io>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Steven Price <steven.price@arm.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Adam Ruprecht <ruprecht@google.com>
Cc: Axel Rasmussen <axelrasmussen@google.com>
Cc: Cannon Matthews <cannonmatthews@google.com>
Cc: "Dr . David Alan Gilbert" <dgilbert@redhat.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Mina Almasry <almasrymina@google.com>
Cc: Oliver Upton <oupton@google.com>
Cc: Kirill A. Shutemov <kirill@shutemov.name>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-05-05 09:35:36 +08:00
|
|
|
#endif
|
2015-09-05 06:46:31 +08:00
|
|
|
uffdio_api.ioctls = UFFD_API_IOCTLS;
|
|
|
|
ret = -EFAULT;
|
|
|
|
if (copy_to_user(buf, &uffdio_api, sizeof(uffdio_api)))
|
|
|
|
goto out;
|
userfaultfd: prevent concurrent API initialization
userfaultfd assumes that the enabled features are set once and never
changed after UFFDIO_API ioctl succeeded.
However, currently, UFFDIO_API can be called concurrently from two
different threads, succeed on both threads and leave userfaultfd's
features in non-deterministic state. Theoretically, other uffd operations
(ioctl's and page-faults) can be dispatched while adversely affected by
such changes of features.
Moreover, the writes to ctx->state and ctx->features are not ordered,
which can - theoretically, again - let userfaultfd_ioctl() think that
userfaultfd API completed, while the features are still not initialized.
To avoid races, it is arguably best to get rid of ctx->state. Since there
are only 2 states, record the API initialization in ctx->features as the
uppermost bit and remove ctx->state.
Link: https://lkml.kernel.org/r/20210808020724.1022515-3-namit@vmware.com
Fixes: 9cd75c3cd4c3d ("userfaultfd: non-cooperative: add ability to report non-PF events from uffd descriptor")
Signed-off-by: Nadav Amit <namit@vmware.com>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Axel Rasmussen <axelrasmussen@google.com>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Mike Rapoport <rppt@linux.vnet.ibm.com>
Cc: Peter Xu <peterx@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-09-03 05:58:59 +08:00
|
|
|
|
2017-02-23 07:42:24 +08:00
|
|
|
/* only enable the requested features for this uffd context */
|
userfaultfd: prevent concurrent API initialization
userfaultfd assumes that the enabled features are set once and never
changed after UFFDIO_API ioctl succeeded.
However, currently, UFFDIO_API can be called concurrently from two
different threads, succeed on both threads and leave userfaultfd's
features in non-deterministic state. Theoretically, other uffd operations
(ioctl's and page-faults) can be dispatched while adversely affected by
such changes of features.
Moreover, the writes to ctx->state and ctx->features are not ordered,
which can - theoretically, again - let userfaultfd_ioctl() think that
userfaultfd API completed, while the features are still not initialized.
To avoid races, it is arguably best to get rid of ctx->state. Since there
are only 2 states, record the API initialization in ctx->features as the
uppermost bit and remove ctx->state.
Link: https://lkml.kernel.org/r/20210808020724.1022515-3-namit@vmware.com
Fixes: 9cd75c3cd4c3d ("userfaultfd: non-cooperative: add ability to report non-PF events from uffd descriptor")
Signed-off-by: Nadav Amit <namit@vmware.com>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Axel Rasmussen <axelrasmussen@google.com>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Mike Rapoport <rppt@linux.vnet.ibm.com>
Cc: Peter Xu <peterx@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-09-03 05:58:59 +08:00
|
|
|
ctx_features = uffd_ctx_features(features);
|
|
|
|
ret = -EINVAL;
|
|
|
|
if (cmpxchg(&ctx->features, 0, ctx_features) != 0)
|
|
|
|
goto err_out;
|
|
|
|
|
2015-09-05 06:46:31 +08:00
|
|
|
ret = 0;
|
|
|
|
out:
|
|
|
|
return ret;
|
2019-12-01 09:58:01 +08:00
|
|
|
err_out:
|
|
|
|
memset(&uffdio_api, 0, sizeof(uffdio_api));
|
|
|
|
if (copy_to_user(buf, &uffdio_api, sizeof(uffdio_api)))
|
|
|
|
ret = -EFAULT;
|
|
|
|
goto out;
|
2015-09-05 06:46:31 +08:00
|
|
|
}
|
|
|
|
|
|
|
|
static long userfaultfd_ioctl(struct file *file, unsigned cmd,
|
|
|
|
unsigned long arg)
|
|
|
|
{
|
|
|
|
int ret = -EINVAL;
|
|
|
|
struct userfaultfd_ctx *ctx = file->private_data;
|
|
|
|
|
userfaultfd: prevent concurrent API initialization
userfaultfd assumes that the enabled features are set once and never
changed after UFFDIO_API ioctl succeeded.
However, currently, UFFDIO_API can be called concurrently from two
different threads, succeed on both threads and leave userfaultfd's
features in non-deterministic state. Theoretically, other uffd operations
(ioctl's and page-faults) can be dispatched while adversely affected by
such changes of features.
Moreover, the writes to ctx->state and ctx->features are not ordered,
which can - theoretically, again - let userfaultfd_ioctl() think that
userfaultfd API completed, while the features are still not initialized.
To avoid races, it is arguably best to get rid of ctx->state. Since there
are only 2 states, record the API initialization in ctx->features as the
uppermost bit and remove ctx->state.
Link: https://lkml.kernel.org/r/20210808020724.1022515-3-namit@vmware.com
Fixes: 9cd75c3cd4c3d ("userfaultfd: non-cooperative: add ability to report non-PF events from uffd descriptor")
Signed-off-by: Nadav Amit <namit@vmware.com>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Axel Rasmussen <axelrasmussen@google.com>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Mike Rapoport <rppt@linux.vnet.ibm.com>
Cc: Peter Xu <peterx@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-09-03 05:58:59 +08:00
|
|
|
if (cmd != UFFDIO_API && !userfaultfd_is_initialized(ctx))
|
2015-09-05 06:47:15 +08:00
|
|
|
return -EINVAL;
|
|
|
|
|
2015-09-05 06:46:31 +08:00
|
|
|
switch(cmd) {
|
|
|
|
case UFFDIO_API:
|
|
|
|
ret = userfaultfd_api(ctx, arg);
|
|
|
|
break;
|
|
|
|
case UFFDIO_REGISTER:
|
|
|
|
ret = userfaultfd_register(ctx, arg);
|
|
|
|
break;
|
|
|
|
case UFFDIO_UNREGISTER:
|
|
|
|
ret = userfaultfd_unregister(ctx, arg);
|
|
|
|
break;
|
|
|
|
case UFFDIO_WAKE:
|
|
|
|
ret = userfaultfd_wake(ctx, arg);
|
|
|
|
break;
|
2015-09-05 06:47:11 +08:00
|
|
|
case UFFDIO_COPY:
|
|
|
|
ret = userfaultfd_copy(ctx, arg);
|
|
|
|
break;
|
|
|
|
case UFFDIO_ZEROPAGE:
|
|
|
|
ret = userfaultfd_zeropage(ctx, arg);
|
|
|
|
break;
|
2020-04-07 11:06:12 +08:00
|
|
|
case UFFDIO_WRITEPROTECT:
|
|
|
|
ret = userfaultfd_writeprotect(ctx, arg);
|
|
|
|
break;
|
userfaultfd: add UFFDIO_CONTINUE ioctl
This ioctl is how userspace ought to resolve "minor" userfaults. The
idea is, userspace is notified that a minor fault has occurred. It
might change the contents of the page using its second non-UFFD mapping,
or not. Then, it calls UFFDIO_CONTINUE to tell the kernel "I have
ensured the page contents are correct, carry on setting up the mapping".
Note that it doesn't make much sense to use UFFDIO_{COPY,ZEROPAGE} for
MINOR registered VMAs. ZEROPAGE maps the VMA to the zero page; but in
the minor fault case, we already have some pre-existing underlying page.
Likewise, UFFDIO_COPY isn't useful if we have a second non-UFFD mapping.
We'd just use memcpy() or similar instead.
It turns out hugetlb_mcopy_atomic_pte() already does very close to what
we want, if an existing page is provided via `struct page **pagep`. We
already special-case the behavior a bit for the UFFDIO_ZEROPAGE case, so
just extend that design: add an enum for the three modes of operation,
and make the small adjustments needed for the MCOPY_ATOMIC_CONTINUE
case. (Basically, look up the existing page, and avoid adding the
existing page to the page cache or calling set_page_huge_active() on
it.)
Link: https://lkml.kernel.org/r/20210301222728.176417-5-axelrasmussen@google.com
Signed-off-by: Axel Rasmussen <axelrasmussen@google.com>
Reviewed-by: Peter Xu <peterx@redhat.com>
Cc: Adam Ruprecht <ruprecht@google.com>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Alexey Dobriyan <adobriyan@gmail.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Anshuman Khandual <anshuman.khandual@arm.com>
Cc: Cannon Matthews <cannonmatthews@google.com>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Chinwen Chang <chinwen.chang@mediatek.com>
Cc: David Rientjes <rientjes@google.com>
Cc: "Dr . David Alan Gilbert" <dgilbert@redhat.com>
Cc: Huang Ying <ying.huang@intel.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Jann Horn <jannh@google.com>
Cc: Jerome Glisse <jglisse@redhat.com>
Cc: Kirill A. Shutemov <kirill@shutemov.name>
Cc: Lokesh Gidra <lokeshgidra@google.com>
Cc: "Matthew Wilcox (Oracle)" <willy@infradead.org>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: "Michal Koutn" <mkoutny@suse.com>
Cc: Michel Lespinasse <walken@google.com>
Cc: Mike Kravetz <mike.kravetz@oracle.com>
Cc: Mike Rapoport <rppt@linux.vnet.ibm.com>
Cc: Mina Almasry <almasrymina@google.com>
Cc: Nicholas Piggin <npiggin@gmail.com>
Cc: Oliver Upton <oupton@google.com>
Cc: Shaohua Li <shli@fb.com>
Cc: Shawn Anastasio <shawn@anastas.io>
Cc: Steven Price <steven.price@arm.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-05-05 09:35:49 +08:00
|
|
|
case UFFDIO_CONTINUE:
|
|
|
|
ret = userfaultfd_continue(ctx, arg);
|
|
|
|
break;
|
2015-09-05 06:46:31 +08:00
|
|
|
}
|
|
|
|
return ret;
|
|
|
|
}
|
|
|
|
|
|
|
|
#ifdef CONFIG_PROC_FS
|
|
|
|
static void userfaultfd_show_fdinfo(struct seq_file *m, struct file *f)
|
|
|
|
{
|
|
|
|
struct userfaultfd_ctx *ctx = f->private_data;
|
2017-06-20 18:06:13 +08:00
|
|
|
wait_queue_entry_t *wq;
|
2015-09-05 06:46:31 +08:00
|
|
|
unsigned long pending = 0, total = 0;
|
|
|
|
|
2019-07-05 06:14:39 +08:00
|
|
|
spin_lock_irq(&ctx->fault_pending_wqh.lock);
|
sched/wait: Disambiguate wq_entry->task_list and wq_head->task_list naming
So I've noticed a number of instances where it was not obvious from the
code whether ->task_list was for a wait-queue head or a wait-queue entry.
Furthermore, there's a number of wait-queue users where the lists are
not for 'tasks' but other entities (poll tables, etc.), in which case
the 'task_list' name is actively confusing.
To clear this all up, name the wait-queue head and entry list structure
fields unambiguously:
struct wait_queue_head::task_list => ::head
struct wait_queue_entry::task_list => ::entry
For example, this code:
rqw->wait.task_list.next != &wait->task_list
... is was pretty unclear (to me) what it's doing, while now it's written this way:
rqw->wait.head.next != &wait->entry
... which makes it pretty clear that we are iterating a list until we see the head.
Other examples are:
list_for_each_entry_safe(pos, next, &x->task_list, task_list) {
list_for_each_entry(wq, &fence->wait.task_list, task_list) {
... where it's unclear (to me) what we are iterating, and during review it's
hard to tell whether it's trying to walk a wait-queue entry (which would be
a bug), while now it's written as:
list_for_each_entry_safe(pos, next, &x->head, entry) {
list_for_each_entry(wq, &fence->wait.head, entry) {
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-06-20 18:06:46 +08:00
|
|
|
list_for_each_entry(wq, &ctx->fault_pending_wqh.head, entry) {
|
2015-09-05 06:46:44 +08:00
|
|
|
pending++;
|
|
|
|
total++;
|
|
|
|
}
|
sched/wait: Disambiguate wq_entry->task_list and wq_head->task_list naming
So I've noticed a number of instances where it was not obvious from the
code whether ->task_list was for a wait-queue head or a wait-queue entry.
Furthermore, there's a number of wait-queue users where the lists are
not for 'tasks' but other entities (poll tables, etc.), in which case
the 'task_list' name is actively confusing.
To clear this all up, name the wait-queue head and entry list structure
fields unambiguously:
struct wait_queue_head::task_list => ::head
struct wait_queue_entry::task_list => ::entry
For example, this code:
rqw->wait.task_list.next != &wait->task_list
... is was pretty unclear (to me) what it's doing, while now it's written this way:
rqw->wait.head.next != &wait->entry
... which makes it pretty clear that we are iterating a list until we see the head.
Other examples are:
list_for_each_entry_safe(pos, next, &x->task_list, task_list) {
list_for_each_entry(wq, &fence->wait.task_list, task_list) {
... where it's unclear (to me) what we are iterating, and during review it's
hard to tell whether it's trying to walk a wait-queue entry (which would be
a bug), while now it's written as:
list_for_each_entry_safe(pos, next, &x->head, entry) {
list_for_each_entry(wq, &fence->wait.head, entry) {
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-06-20 18:06:46 +08:00
|
|
|
list_for_each_entry(wq, &ctx->fault_wqh.head, entry) {
|
2015-09-05 06:46:31 +08:00
|
|
|
total++;
|
|
|
|
}
|
2019-07-05 06:14:39 +08:00
|
|
|
spin_unlock_irq(&ctx->fault_pending_wqh.lock);
|
2015-09-05 06:46:31 +08:00
|
|
|
|
|
|
|
/*
|
|
|
|
* If more protocols will be added, there will be all shown
|
|
|
|
* separated by a space. Like this:
|
|
|
|
* protocols: aa:... bb:...
|
|
|
|
*/
|
|
|
|
seq_printf(m, "pending:\t%lu\ntotal:\t%lu\nAPI:\t%Lx:%x:%Lx\n",
|
2017-04-08 07:04:42 +08:00
|
|
|
pending, total, UFFD_API, ctx->features,
|
2015-09-05 06:46:31 +08:00
|
|
|
UFFD_API_IOCTLS|UFFD_API_RANGE_IOCTLS);
|
|
|
|
}
|
|
|
|
#endif
|
|
|
|
|
|
|
|
static const struct file_operations userfaultfd_fops = {
|
|
|
|
#ifdef CONFIG_PROC_FS
|
|
|
|
.show_fdinfo = userfaultfd_show_fdinfo,
|
|
|
|
#endif
|
|
|
|
.release = userfaultfd_release,
|
|
|
|
.poll = userfaultfd_poll,
|
|
|
|
.read = userfaultfd_read,
|
|
|
|
.unlocked_ioctl = userfaultfd_ioctl,
|
2018-09-12 03:59:08 +08:00
|
|
|
.compat_ioctl = compat_ptr_ioctl,
|
2015-09-05 06:46:31 +08:00
|
|
|
.llseek = noop_llseek,
|
|
|
|
};
|
|
|
|
|
2015-09-05 06:46:48 +08:00
|
|
|
static void init_once_userfaultfd_ctx(void *mem)
|
|
|
|
{
|
|
|
|
struct userfaultfd_ctx *ctx = (struct userfaultfd_ctx *) mem;
|
|
|
|
|
|
|
|
init_waitqueue_head(&ctx->fault_pending_wqh);
|
|
|
|
init_waitqueue_head(&ctx->fault_wqh);
|
2017-02-23 07:42:21 +08:00
|
|
|
init_waitqueue_head(&ctx->event_wqh);
|
2015-09-05 06:46:48 +08:00
|
|
|
init_waitqueue_head(&ctx->fd_wqh);
|
2020-07-20 23:55:28 +08:00
|
|
|
seqcount_spinlock_init(&ctx->refile_seq, &ctx->fault_pending_wqh.lock);
|
2015-09-05 06:46:48 +08:00
|
|
|
}
|
|
|
|
|
userfaultfd: add /dev/userfaultfd for fine grained access control
Historically, it has been shown that intercepting kernel faults with
userfaultfd (thereby forcing the kernel to wait for an arbitrary amount of
time) can be exploited, or at least can make some kinds of exploits
easier. So, in 37cd0575b8 "userfaultfd: add UFFD_USER_MODE_ONLY" we
changed things so, in order for kernel faults to be handled by
userfaultfd, either the process needs CAP_SYS_PTRACE, or this sysctl must
be configured so that any unprivileged user can do it.
In a typical implementation of a hypervisor with live migration (take
QEMU/KVM as one such example), we do indeed need to be able to handle
kernel faults. But, both options above are less than ideal:
- Toggling the sysctl increases attack surface by allowing any
unprivileged user to do it.
- Granting the live migration process CAP_SYS_PTRACE gives it this
ability, but *also* the ability to "observe and control the
execution of another process [...], and examine and change [its]
memory and registers" (from ptrace(2)). This isn't something we need
or want to be able to do, so granting this permission violates the
"principle of least privilege".
This is all a long winded way to say: we want a more fine-grained way to
grant access to userfaultfd, without granting other additional permissions
at the same time.
To achieve this, add a /dev/userfaultfd misc device. This device provides
an alternative to the userfaultfd(2) syscall for the creation of new
userfaultfds. The idea is, any userfaultfds created this way will be able
to handle kernel faults, without the caller having any special
capabilities. Access to this mechanism is instead restricted using e.g.
standard filesystem permissions.
[axelrasmussen@google.com: Handle misc_register() failure properly]
Link: https://lkml.kernel.org/r/20220819205201.658693-3-axelrasmussen@google.com
Link: https://lkml.kernel.org/r/20220808175614.3885028-3-axelrasmussen@google.com
Signed-off-by: Axel Rasmussen <axelrasmussen@google.com>
Acked-by: Nadav Amit <namit@vmware.com>
Acked-by: Peter Xu <peterx@redhat.com>
Acked-by: Mike Rapoport <rppt@linux.ibm.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Dmitry V. Levin <ldv@altlinux.org>
Cc: Gleb Fotengauer-Malinovskiy <glebfm@altlinux.org>
Cc: Hugh Dickins <hughd@google.com>
Cc: Jan Kara <jack@suse.cz>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Mike Kravetz <mike.kravetz@oracle.com>
Cc: Shuah Khan <shuah@kernel.org>
Cc: Shuah Khan <skhan@linuxfoundation.org>
Cc: Suren Baghdasaryan <surenb@google.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Zhang Yi <yi.zhang@huawei.com>
Cc: Mike Rapoport <rppt@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-08-09 01:56:11 +08:00
|
|
|
static int new_userfaultfd(int flags)
|
2015-09-05 06:46:31 +08:00
|
|
|
{
|
|
|
|
struct userfaultfd_ctx *ctx;
|
2018-02-01 08:19:48 +08:00
|
|
|
int fd;
|
2015-09-05 06:46:31 +08:00
|
|
|
|
|
|
|
BUG_ON(!current->mm);
|
|
|
|
|
|
|
|
/* Check the UFFD_* constants for consistency. */
|
userfaultfd: add UFFD_USER_MODE_ONLY
Patch series "Control over userfaultfd kernel-fault handling", v6.
This patch series is split from [1]. The other series enables SELinux
support for userfaultfd file descriptors so that its creation and movement
can be controlled.
It has been demonstrated on various occasions that suspending kernel code
execution for an arbitrary amount of time at any access to userspace
memory (copy_from_user()/copy_to_user()/...) can be exploited to change
the intended behavior of the kernel. For instance, handling page faults
in kernel-mode using userfaultfd has been exploited in [2, 3]. Likewise,
FUSE, which is similar to userfaultfd in this respect, has been exploited
in [4, 5] for similar outcome.
This small patch series adds a new flag to userfaultfd(2) that allows
callers to give up the ability to handle kernel-mode faults with the
resulting UFFD file object. It then adds a 'user-mode only' option to the
unprivileged_userfaultfd sysctl knob to require unprivileged callers to
use this new flag.
The purpose of this new interface is to decrease the chance of an
unprivileged userfaultfd user taking advantage of userfaultfd to enhance
security vulnerabilities by lengthening the race window in kernel code.
[1] https://lore.kernel.org/lkml/20200211225547.235083-1-dancol@google.com/
[2] https://duasynt.com/blog/linux-kernel-heap-spray
[3] https://duasynt.com/blog/cve-2016-6187-heap-off-by-one-exploit
[4] https://googleprojectzero.blogspot.com/2016/06/exploiting-recursion-in-linux-kernel_20.html
[5] https://bugs.chromium.org/p/project-zero/issues/detail?id=808
This patch (of 2):
userfaultfd handles page faults from both user and kernel code. Add a new
UFFD_USER_MODE_ONLY flag for userfaultfd(2) that makes the resulting
userfaultfd object refuse to handle faults from kernel mode, treating
these faults as if SIGBUS were always raised, causing the kernel code to
fail with EFAULT.
A future patch adds a knob allowing administrators to give some processes
the ability to create userfaultfd file objects only if they pass
UFFD_USER_MODE_ONLY, reducing the likelihood that these processes will
exploit userfaultfd's ability to delay kernel page faults to open timing
windows for future exploits.
Link: https://lkml.kernel.org/r/20201120030411.2690816-1-lokeshgidra@google.com
Link: https://lkml.kernel.org/r/20201120030411.2690816-2-lokeshgidra@google.com
Signed-off-by: Daniel Colascione <dancol@google.com>
Signed-off-by: Lokesh Gidra <lokeshgidra@google.com>
Reviewed-by: Andrea Arcangeli <aarcange@redhat.com>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: <calin@google.com>
Cc: Daniel Colascione <dancol@dancol.org>
Cc: Eric Biggers <ebiggers@kernel.org>
Cc: Iurii Zaikin <yzaikin@google.com>
Cc: Jeff Vander Stoep <jeffv@google.com>
Cc: Jerome Glisse <jglisse@redhat.com>
Cc: "Joel Fernandes (Google)" <joel@joelfernandes.org>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Kalesh Singh <kaleshsingh@google.com>
Cc: Kees Cook <keescook@chromium.org>
Cc: Luis Chamberlain <mcgrof@kernel.org>
Cc: Mauro Carvalho Chehab <mchehab+huawei@kernel.org>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Mike Rapoport <rppt@linux.vnet.ibm.com>
Cc: Nitin Gupta <nigupta@nvidia.com>
Cc: Peter Xu <peterx@redhat.com>
Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Cc: Shaohua Li <shli@fb.com>
Cc: Stephen Smalley <stephen.smalley.work@gmail.com>
Cc: Suren Baghdasaryan <surenb@google.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-12-15 11:13:49 +08:00
|
|
|
BUILD_BUG_ON(UFFD_USER_MODE_ONLY & UFFD_SHARED_FCNTL_FLAGS);
|
2015-09-05 06:46:31 +08:00
|
|
|
BUILD_BUG_ON(UFFD_CLOEXEC != O_CLOEXEC);
|
|
|
|
BUILD_BUG_ON(UFFD_NONBLOCK != O_NONBLOCK);
|
|
|
|
|
userfaultfd: add UFFD_USER_MODE_ONLY
Patch series "Control over userfaultfd kernel-fault handling", v6.
This patch series is split from [1]. The other series enables SELinux
support for userfaultfd file descriptors so that its creation and movement
can be controlled.
It has been demonstrated on various occasions that suspending kernel code
execution for an arbitrary amount of time at any access to userspace
memory (copy_from_user()/copy_to_user()/...) can be exploited to change
the intended behavior of the kernel. For instance, handling page faults
in kernel-mode using userfaultfd has been exploited in [2, 3]. Likewise,
FUSE, which is similar to userfaultfd in this respect, has been exploited
in [4, 5] for similar outcome.
This small patch series adds a new flag to userfaultfd(2) that allows
callers to give up the ability to handle kernel-mode faults with the
resulting UFFD file object. It then adds a 'user-mode only' option to the
unprivileged_userfaultfd sysctl knob to require unprivileged callers to
use this new flag.
The purpose of this new interface is to decrease the chance of an
unprivileged userfaultfd user taking advantage of userfaultfd to enhance
security vulnerabilities by lengthening the race window in kernel code.
[1] https://lore.kernel.org/lkml/20200211225547.235083-1-dancol@google.com/
[2] https://duasynt.com/blog/linux-kernel-heap-spray
[3] https://duasynt.com/blog/cve-2016-6187-heap-off-by-one-exploit
[4] https://googleprojectzero.blogspot.com/2016/06/exploiting-recursion-in-linux-kernel_20.html
[5] https://bugs.chromium.org/p/project-zero/issues/detail?id=808
This patch (of 2):
userfaultfd handles page faults from both user and kernel code. Add a new
UFFD_USER_MODE_ONLY flag for userfaultfd(2) that makes the resulting
userfaultfd object refuse to handle faults from kernel mode, treating
these faults as if SIGBUS were always raised, causing the kernel code to
fail with EFAULT.
A future patch adds a knob allowing administrators to give some processes
the ability to create userfaultfd file objects only if they pass
UFFD_USER_MODE_ONLY, reducing the likelihood that these processes will
exploit userfaultfd's ability to delay kernel page faults to open timing
windows for future exploits.
Link: https://lkml.kernel.org/r/20201120030411.2690816-1-lokeshgidra@google.com
Link: https://lkml.kernel.org/r/20201120030411.2690816-2-lokeshgidra@google.com
Signed-off-by: Daniel Colascione <dancol@google.com>
Signed-off-by: Lokesh Gidra <lokeshgidra@google.com>
Reviewed-by: Andrea Arcangeli <aarcange@redhat.com>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: <calin@google.com>
Cc: Daniel Colascione <dancol@dancol.org>
Cc: Eric Biggers <ebiggers@kernel.org>
Cc: Iurii Zaikin <yzaikin@google.com>
Cc: Jeff Vander Stoep <jeffv@google.com>
Cc: Jerome Glisse <jglisse@redhat.com>
Cc: "Joel Fernandes (Google)" <joel@joelfernandes.org>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Kalesh Singh <kaleshsingh@google.com>
Cc: Kees Cook <keescook@chromium.org>
Cc: Luis Chamberlain <mcgrof@kernel.org>
Cc: Mauro Carvalho Chehab <mchehab+huawei@kernel.org>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Mike Rapoport <rppt@linux.vnet.ibm.com>
Cc: Nitin Gupta <nigupta@nvidia.com>
Cc: Peter Xu <peterx@redhat.com>
Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Cc: Shaohua Li <shli@fb.com>
Cc: Stephen Smalley <stephen.smalley.work@gmail.com>
Cc: Suren Baghdasaryan <surenb@google.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-12-15 11:13:49 +08:00
|
|
|
if (flags & ~(UFFD_SHARED_FCNTL_FLAGS | UFFD_USER_MODE_ONLY))
|
2018-02-01 08:19:48 +08:00
|
|
|
return -EINVAL;
|
2015-09-05 06:46:31 +08:00
|
|
|
|
2015-09-05 06:46:48 +08:00
|
|
|
ctx = kmem_cache_alloc(userfaultfd_ctx_cachep, GFP_KERNEL);
|
2015-09-05 06:46:31 +08:00
|
|
|
if (!ctx)
|
2018-02-01 08:19:48 +08:00
|
|
|
return -ENOMEM;
|
2015-09-05 06:46:31 +08:00
|
|
|
|
2018-12-28 16:34:43 +08:00
|
|
|
refcount_set(&ctx->refcount, 1);
|
2015-09-05 06:46:31 +08:00
|
|
|
ctx->flags = flags;
|
2017-02-23 07:42:21 +08:00
|
|
|
ctx->features = 0;
|
2015-09-05 06:46:31 +08:00
|
|
|
ctx->released = false;
|
2021-09-03 05:58:56 +08:00
|
|
|
atomic_set(&ctx->mmap_changing, 0);
|
2015-09-05 06:46:31 +08:00
|
|
|
ctx->mm = current->mm;
|
|
|
|
/* prevent the mm struct to be freed */
|
2017-02-28 06:30:07 +08:00
|
|
|
mmgrab(ctx->mm);
|
2015-09-05 06:46:31 +08:00
|
|
|
|
2021-01-09 06:22:23 +08:00
|
|
|
fd = anon_inode_getfd_secure("[userfaultfd]", &userfaultfd_fops, ctx,
|
2022-07-08 17:34:51 +08:00
|
|
|
O_RDONLY | (flags & UFFD_SHARED_FCNTL_FLAGS), NULL);
|
2018-02-01 08:19:48 +08:00
|
|
|
if (fd < 0) {
|
2016-05-21 07:58:36 +08:00
|
|
|
mmdrop(ctx->mm);
|
2015-09-05 06:46:48 +08:00
|
|
|
kmem_cache_free(userfaultfd_ctx_cachep, ctx);
|
2015-09-18 07:01:54 +08:00
|
|
|
}
|
2015-09-05 06:46:31 +08:00
|
|
|
return fd;
|
|
|
|
}
|
2015-09-05 06:46:48 +08:00
|
|
|
|
userfaultfd: add /dev/userfaultfd for fine grained access control
Historically, it has been shown that intercepting kernel faults with
userfaultfd (thereby forcing the kernel to wait for an arbitrary amount of
time) can be exploited, or at least can make some kinds of exploits
easier. So, in 37cd0575b8 "userfaultfd: add UFFD_USER_MODE_ONLY" we
changed things so, in order for kernel faults to be handled by
userfaultfd, either the process needs CAP_SYS_PTRACE, or this sysctl must
be configured so that any unprivileged user can do it.
In a typical implementation of a hypervisor with live migration (take
QEMU/KVM as one such example), we do indeed need to be able to handle
kernel faults. But, both options above are less than ideal:
- Toggling the sysctl increases attack surface by allowing any
unprivileged user to do it.
- Granting the live migration process CAP_SYS_PTRACE gives it this
ability, but *also* the ability to "observe and control the
execution of another process [...], and examine and change [its]
memory and registers" (from ptrace(2)). This isn't something we need
or want to be able to do, so granting this permission violates the
"principle of least privilege".
This is all a long winded way to say: we want a more fine-grained way to
grant access to userfaultfd, without granting other additional permissions
at the same time.
To achieve this, add a /dev/userfaultfd misc device. This device provides
an alternative to the userfaultfd(2) syscall for the creation of new
userfaultfds. The idea is, any userfaultfds created this way will be able
to handle kernel faults, without the caller having any special
capabilities. Access to this mechanism is instead restricted using e.g.
standard filesystem permissions.
[axelrasmussen@google.com: Handle misc_register() failure properly]
Link: https://lkml.kernel.org/r/20220819205201.658693-3-axelrasmussen@google.com
Link: https://lkml.kernel.org/r/20220808175614.3885028-3-axelrasmussen@google.com
Signed-off-by: Axel Rasmussen <axelrasmussen@google.com>
Acked-by: Nadav Amit <namit@vmware.com>
Acked-by: Peter Xu <peterx@redhat.com>
Acked-by: Mike Rapoport <rppt@linux.ibm.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Dmitry V. Levin <ldv@altlinux.org>
Cc: Gleb Fotengauer-Malinovskiy <glebfm@altlinux.org>
Cc: Hugh Dickins <hughd@google.com>
Cc: Jan Kara <jack@suse.cz>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Mike Kravetz <mike.kravetz@oracle.com>
Cc: Shuah Khan <shuah@kernel.org>
Cc: Shuah Khan <skhan@linuxfoundation.org>
Cc: Suren Baghdasaryan <surenb@google.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Zhang Yi <yi.zhang@huawei.com>
Cc: Mike Rapoport <rppt@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-08-09 01:56:11 +08:00
|
|
|
static inline bool userfaultfd_syscall_allowed(int flags)
|
|
|
|
{
|
|
|
|
/* Userspace-only page faults are always allowed */
|
|
|
|
if (flags & UFFD_USER_MODE_ONLY)
|
|
|
|
return true;
|
|
|
|
|
|
|
|
/*
|
|
|
|
* The user is requesting a userfaultfd which can handle kernel faults.
|
|
|
|
* Privileged users are always allowed to do this.
|
|
|
|
*/
|
|
|
|
if (capable(CAP_SYS_PTRACE))
|
|
|
|
return true;
|
|
|
|
|
|
|
|
/* Otherwise, access to kernel fault handling is sysctl controlled. */
|
|
|
|
return sysctl_unprivileged_userfaultfd;
|
|
|
|
}
|
|
|
|
|
|
|
|
SYSCALL_DEFINE1(userfaultfd, int, flags)
|
|
|
|
{
|
|
|
|
if (!userfaultfd_syscall_allowed(flags))
|
|
|
|
return -EPERM;
|
|
|
|
|
|
|
|
return new_userfaultfd(flags);
|
|
|
|
}
|
|
|
|
|
|
|
|
static long userfaultfd_dev_ioctl(struct file *file, unsigned int cmd, unsigned long flags)
|
|
|
|
{
|
|
|
|
if (cmd != USERFAULTFD_IOC_NEW)
|
|
|
|
return -EINVAL;
|
|
|
|
|
|
|
|
return new_userfaultfd(flags);
|
|
|
|
}
|
|
|
|
|
|
|
|
static const struct file_operations userfaultfd_dev_fops = {
|
|
|
|
.unlocked_ioctl = userfaultfd_dev_ioctl,
|
|
|
|
.compat_ioctl = userfaultfd_dev_ioctl,
|
|
|
|
.owner = THIS_MODULE,
|
|
|
|
.llseek = noop_llseek,
|
|
|
|
};
|
|
|
|
|
|
|
|
static struct miscdevice userfaultfd_misc = {
|
|
|
|
.minor = MISC_DYNAMIC_MINOR,
|
|
|
|
.name = "userfaultfd",
|
|
|
|
.fops = &userfaultfd_dev_fops
|
|
|
|
};
|
|
|
|
|
2015-09-05 06:46:48 +08:00
|
|
|
static int __init userfaultfd_init(void)
|
|
|
|
{
|
userfaultfd: add /dev/userfaultfd for fine grained access control
Historically, it has been shown that intercepting kernel faults with
userfaultfd (thereby forcing the kernel to wait for an arbitrary amount of
time) can be exploited, or at least can make some kinds of exploits
easier. So, in 37cd0575b8 "userfaultfd: add UFFD_USER_MODE_ONLY" we
changed things so, in order for kernel faults to be handled by
userfaultfd, either the process needs CAP_SYS_PTRACE, or this sysctl must
be configured so that any unprivileged user can do it.
In a typical implementation of a hypervisor with live migration (take
QEMU/KVM as one such example), we do indeed need to be able to handle
kernel faults. But, both options above are less than ideal:
- Toggling the sysctl increases attack surface by allowing any
unprivileged user to do it.
- Granting the live migration process CAP_SYS_PTRACE gives it this
ability, but *also* the ability to "observe and control the
execution of another process [...], and examine and change [its]
memory and registers" (from ptrace(2)). This isn't something we need
or want to be able to do, so granting this permission violates the
"principle of least privilege".
This is all a long winded way to say: we want a more fine-grained way to
grant access to userfaultfd, without granting other additional permissions
at the same time.
To achieve this, add a /dev/userfaultfd misc device. This device provides
an alternative to the userfaultfd(2) syscall for the creation of new
userfaultfds. The idea is, any userfaultfds created this way will be able
to handle kernel faults, without the caller having any special
capabilities. Access to this mechanism is instead restricted using e.g.
standard filesystem permissions.
[axelrasmussen@google.com: Handle misc_register() failure properly]
Link: https://lkml.kernel.org/r/20220819205201.658693-3-axelrasmussen@google.com
Link: https://lkml.kernel.org/r/20220808175614.3885028-3-axelrasmussen@google.com
Signed-off-by: Axel Rasmussen <axelrasmussen@google.com>
Acked-by: Nadav Amit <namit@vmware.com>
Acked-by: Peter Xu <peterx@redhat.com>
Acked-by: Mike Rapoport <rppt@linux.ibm.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Dmitry V. Levin <ldv@altlinux.org>
Cc: Gleb Fotengauer-Malinovskiy <glebfm@altlinux.org>
Cc: Hugh Dickins <hughd@google.com>
Cc: Jan Kara <jack@suse.cz>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Mike Kravetz <mike.kravetz@oracle.com>
Cc: Shuah Khan <shuah@kernel.org>
Cc: Shuah Khan <skhan@linuxfoundation.org>
Cc: Suren Baghdasaryan <surenb@google.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Zhang Yi <yi.zhang@huawei.com>
Cc: Mike Rapoport <rppt@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-08-09 01:56:11 +08:00
|
|
|
int ret;
|
|
|
|
|
|
|
|
ret = misc_register(&userfaultfd_misc);
|
|
|
|
if (ret)
|
|
|
|
return ret;
|
|
|
|
|
2015-09-05 06:46:48 +08:00
|
|
|
userfaultfd_ctx_cachep = kmem_cache_create("userfaultfd_ctx_cache",
|
|
|
|
sizeof(struct userfaultfd_ctx),
|
|
|
|
0,
|
|
|
|
SLAB_HWCACHE_ALIGN|SLAB_PANIC,
|
|
|
|
init_once_userfaultfd_ctx);
|
|
|
|
return 0;
|
|
|
|
}
|
|
|
|
__initcall(userfaultfd_init);
|