OpenCloudOS-Kernel/fs
Andrea Arcangeli 15a77c6fe4 userfaultfd: fix SIGBUS resulting from false rwsem wakeups
With >=32 CPUs the userfaultfd selftest triggered a graceful but
unexpected SIGBUS because VM_FAULT_RETRY was returned by
handle_userfault() despite the UFFDIO_COPY wasn't completed.

This seems caused by rwsem waking the thread blocked in
handle_userfault() and we can't run up_read() before the wait_event
sequence is complete.

Keeping the wait_even sequence identical to the first one, would require
running userfaultfd_must_wait() again to know if the loop should be
repeated, and it would also require retaking the rwsem and revalidating
the whole vma status.

It seems simpler to wait the targeted wakeup so that if false wakeups
materialize we still wait for our specific wakeup event, unless of
course there are signals or the uffd was released.

Debug code collecting the stack trace of the wakeup showed this:

  $ ./userfaultfd 100 99999
  nr_pages: 25600, nr_pages_per_cpu: 800
  bounces: 99998, mode: racing ver poll, userfaults: 32 35 90 232 30 138 69 82 34 30 139 40 40 31 20 19 43 13 15 28 27 38 21 43 56 22 1 17 31 8 4 2
  bounces: 99997, mode: rnd ver poll, Bus error (core dumped)

    save_stack_trace+0x2b/0x50
    try_to_wake_up+0x2a6/0x580
    wake_up_q+0x32/0x70
    rwsem_wake+0xe0/0x120
    call_rwsem_wake+0x1b/0x30
    up_write+0x3b/0x40
    vm_mmap_pgoff+0x9c/0xc0
    SyS_mmap_pgoff+0x1a9/0x240
    SyS_mmap+0x22/0x30
    entry_SYSCALL_64_fastpath+0x1f/0xbd
    0xffffffffffffffff
    FAULT_FLAG_ALLOW_RETRY missing 70
  CPU: 24 PID: 1054 Comm: userfaultfd Tainted: G        W       4.8.0+ #30
  Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.9.3-0-ge2fc41e-prebuilt.qemu-project.org 04/01/2014
  Call Trace:
    dump_stack+0xb8/0x112
    handle_userfault+0x572/0x650
    handle_mm_fault+0x12cb/0x1520
    __do_page_fault+0x175/0x500
    trace_do_page_fault+0x61/0x270
    do_async_page_fault+0x19/0x90
    async_page_fault+0x25/0x30

This always happens when the main userfault selftest thread is running
clone() while glibc runs either mprotect or mmap (both taking mmap_sem
down_write()) to allocate the thread stack of the background threads,
while locking/userfault threads already run at full throttle and are
susceptible to false wakeups that may cause handle_userfault() to return
before than expected (which results in graceful SIGBUS at the next
attempt).

This was reproduced only with >=32 CPUs because the loop to start the
thread where clone() is too quick with fewer CPUs, while with 32 CPUs
there's already significant activity on ~32 locking and userfault
threads when the last background threads are started with clone().

This >=32 CPUs SMP race condition is likely reproducible only with the
selftest because of the much heavier userfault load it generates if
compared to real apps.

We'll have to allow "one more" VM_FAULT_RETRY for the WP support and a
patch floating around that provides it also hidden this problem but in
reality only is successfully at hiding the problem.

False wakeups could still happen again the second time
handle_userfault() is invoked, even if it's a so rare race condition
that getting false wakeups twice in a row is impossible to reproduce.
This full fix is needed for correctness, the only alternative would be
to allow VM_FAULT_RETRY to be returned infinitely.  With this fix the WP
support can stick to a strict "one more" VM_FAULT_RETRY logic (no need
of returning it infinite times to avoid the SIGBUS).

Link: http://lkml.kernel.org/r/20170111005535.13832-2-aarcange@redhat.com
Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
Reported-by: Shubham Kumar Sharma <shubham.kumar.sharma@oracle.com>
Tested-by: Mike Kravetz <mike.kravetz@oracle.com>
Acked-by: Hillf Danton <hillf.zj@alibaba-inc.com>
Cc: Michael Rapoport <RAPOPORT@il.ibm.com>
Cc: "Dr. David Alan Gilbert" <dgilbert@redhat.com>
Cc: Pavel Emelyanov <xemul@parallels.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-01-24 16:26:14 -08:00
..
9p Replace <asm/uaccess.h> with <linux/uaccess.h> globally 2016-12-24 11:46:01 -08:00
adfs Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs 2016-10-10 20:16:43 -07:00
affs vfs: remove ".readlink = generic_readlink" assignments 2016-12-09 16:45:04 +01:00
afs Replace <asm/uaccess.h> with <linux/uaccess.h> globally 2016-12-24 11:46:01 -08:00
autofs4 Merge uncontroversial parts of branch 'readlink' of git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/vfs 2016-12-17 19:16:12 -08:00
befs befs: add NFS export support 2016-12-22 11:25:24 +00:00
bfs Replace <asm/uaccess.h> with <linux/uaccess.h> globally 2016-12-24 11:46:01 -08:00
btrfs Merge branch 'for-linus-4.10' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs 2017-01-13 17:40:22 -08:00
cachefiles Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs 2016-10-10 20:16:43 -07:00
ceph ceph: fix bad endianness handling in parse_reply_info_extra 2017-01-18 17:58:45 +01:00
cifs Replace <asm/uaccess.h> with <linux/uaccess.h> globally 2016-12-24 11:46:01 -08:00
coda vfs: remove ".readlink = generic_readlink" assignments 2016-12-09 16:45:04 +01:00
configfs Replace <asm/uaccess.h> with <linux/uaccess.h> globally 2016-12-24 11:46:01 -08:00
cramfs
crypto fscrypt: fix renaming and linking special files 2016-12-31 00:47:05 -05:00
debugfs Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs 2016-10-10 20:16:43 -07:00
devpts Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs 2016-10-10 20:16:43 -07:00
dlm ktime: Get rid of ktime_equal() 2016-12-25 17:21:23 +01:00
ecryptfs vfs: remove ".readlink = generic_readlink" assignments 2016-12-09 16:45:04 +01:00
efivarfs Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs 2016-10-10 20:16:43 -07:00
efs Replace <asm/uaccess.h> with <linux/uaccess.h> globally 2016-12-24 11:46:01 -08:00
exofs exofs: don't mess with simple_write_{begin,end} 2016-12-10 14:25:19 -05:00
exportfs exportfs: be careful to only return expected errors. 2016-10-06 09:07:44 -04:00
ext2 dax: fix build warnings with FS_DAX and !FS_IOMAP 2017-01-24 16:26:14 -08:00
ext4 dax: fix build warnings with FS_DAX and !FS_IOMAP 2017-01-24 16:26:14 -08:00
f2fs block: Rename blk_queue_zone_size and bdev_zone_size 2017-01-12 07:58:32 -07:00
fat Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs 2016-10-10 20:16:43 -07:00
freevxfs
fscache
fuse fuse: fix time_to_jiffies nsec sanity check 2017-01-13 17:20:47 +01:00
gfs2 ktime: Cleanup ktime_set() usage 2016-12-25 17:21:22 +01:00
hfs Replace <asm/uaccess.h> with <linux/uaccess.h> globally 2016-12-24 11:46:01 -08:00
hfsplus Replace <asm/uaccess.h> with <linux/uaccess.h> globally 2016-12-24 11:46:01 -08:00
hostfs vfs: remove ".readlink = generic_readlink" assignments 2016-12-09 16:45:04 +01:00
hpfs Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs 2016-10-10 20:16:43 -07:00
hugetlbfs Replace <asm/uaccess.h> with <linux/uaccess.h> globally 2016-12-24 11:46:01 -08:00
isofs Merge branch 'for-4.10/block' of git://git.kernel.dk/linux-block 2016-12-13 10:19:16 -08:00
jbd2 Replace <asm/uaccess.h> with <linux/uaccess.h> globally 2016-12-24 11:46:01 -08:00
jffs2 vfs: remove ".readlink = generic_readlink" assignments 2016-12-09 16:45:04 +01:00
jfs Replace <asm/uaccess.h> with <linux/uaccess.h> globally 2016-12-24 11:46:01 -08:00
kernfs Merge uncontroversial parts of branch 'readlink' of git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/vfs 2016-12-17 19:16:12 -08:00
lockd netns: make struct pernet_operations::id unsigned int 2016-11-18 10:59:15 -05:00
minix vfs: remove ".readlink = generic_readlink" assignments 2016-12-09 16:45:04 +01:00
ncpfs Replace <asm/uaccess.h> with <linux/uaccess.h> globally 2016-12-24 11:46:01 -08:00
nfs NFSv4: Fix client recovery when server reboots multiple times 2017-01-13 13:31:32 -05:00
nfs_common netns: make struct pernet_operations::id unsigned int 2016-11-18 10:59:15 -05:00
nfsd nfsd: fix supported attributes for acl & labels 2017-01-12 15:55:51 -05:00
nilfs2 Merge uncontroversial parts of branch 'readlink' of git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/vfs 2016-12-17 19:16:12 -08:00
nls
notify Merge branch 'stable-4.10' of git://git.infradead.org/users/pcmoore/audit 2017-01-05 23:06:06 -08:00
ntfs Replace <asm/uaccess.h> with <linux/uaccess.h> globally 2016-12-24 11:46:01 -08:00
ocfs2 ocfs2: fix crash caused by stale lvb with fsdlm plugin 2017-01-10 18:31:54 -08:00
omfs Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs 2016-10-10 20:16:43 -07:00
openpromfs Replace <asm/uaccess.h> with <linux/uaccess.h> globally 2016-12-24 11:46:01 -08:00
orangefs Merge uncontroversial parts of branch 'readlink' of git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/vfs 2016-12-17 19:16:12 -08:00
overlayfs ovl: fix possible use after free on redirect dir lookup 2017-01-18 15:19:54 +01:00
proc sysctl: Drop reference added by grab_header in proc_sys_readdir 2017-01-10 13:34:57 +13:00
pstore Improvements and fixes to pstore subsystem: 2016-12-13 09:16:11 -08:00
qnx4
qnx6
quota Merge branch 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs 2016-12-19 08:23:53 -08:00
ramfs Replace <asm/uaccess.h> with <linux/uaccess.h> globally 2016-12-24 11:46:01 -08:00
reiserfs Merge uncontroversial parts of branch 'readlink' of git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/vfs 2016-12-17 19:16:12 -08:00
romfs
squashfs Merge uncontroversial parts of branch 'readlink' of git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/vfs 2016-12-17 19:16:12 -08:00
sysfs Merge branch 'for-4.9' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup 2016-10-14 12:18:50 -07:00
sysv vfs: remove ".readlink = generic_readlink" assignments 2016-12-09 16:45:04 +01:00
tracefs fs: Replace CURRENT_TIME with current_time() for inode timestamps 2016-09-27 21:06:21 -04:00
ubifs ubifs: Fix journal replay wrt. xattr nodes 2017-01-17 14:35:58 +01:00
udf block,fs: untangle fs.h and blk_types.h 2016-11-01 09:43:26 -06:00
ufs Replace <asm/uaccess.h> with <linux/uaccess.h> globally 2016-12-24 11:46:01 -08:00
xfs xfs: fix xfs_mode_to_ftype() prototype 2017-01-18 12:39:21 -08:00
Kconfig dax: fix build warnings with FS_DAX and !FS_IOMAP 2017-01-24 16:26:14 -08:00
Kconfig.binfmt docs: fix locations of several documents that got moved 2016-10-24 08:12:35 -02:00
Makefile logfs: remove from tree 2016-12-14 23:48:11 -05:00
aio.c aio: fix lock dep warning 2017-01-14 19:31:40 -05:00
anon_inodes.c Replace <asm/uaccess.h> with <linux/uaccess.h> globally 2016-12-24 11:46:01 -08:00
attr.c Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs 2016-10-10 20:16:43 -07:00
bad_inode.c bad_inode: add missing i_op initializers 2016-12-09 11:57:43 +01:00
binfmt_aout.c Replace <asm/uaccess.h> with <linux/uaccess.h> globally 2016-12-24 11:46:01 -08:00
binfmt_elf.c coredump: Ensure proper size of sparse core files 2017-01-14 19:32:40 -05:00
binfmt_elf_fdpic.c Replace <asm/uaccess.h> with <linux/uaccess.h> globally 2016-12-24 11:46:01 -08:00
binfmt_em86.c fs/binfmt_em86.c: fix incompatible pointer type 2016-08-02 19:35:15 -04:00
binfmt_flat.c binfmt_flat: allow compressed flat binary format to work on MMU systems 2016-07-28 13:29:12 +10:00
binfmt_misc.c fs: Replace current_fs_time() with current_time() 2016-09-27 21:06:22 -04:00
binfmt_script.c
block_dev.c Merge branch 'for-linus' of git://git.kernel.dk/linux-block 2017-01-04 09:03:37 -08:00
buffer.c clean_bdev_aliases: Prevent cleaning blocks that are not in block range 2017-01-02 09:35:14 -07:00
char_dev.c dax: define a unified inode/address_space for device-dax mappings 2016-08-23 22:58:51 -07:00
compat.c Replace <asm/uaccess.h> with <linux/uaccess.h> globally 2016-12-24 11:46:01 -08:00
compat_binfmt_elf.c
compat_ioctl.c Replace <asm/uaccess.h> with <linux/uaccess.h> globally 2016-12-24 11:46:01 -08:00
coredump.c coredump: Ensure proper size of sparse core files 2017-01-14 19:32:40 -05:00
dax.c dax: fix build warnings with FS_DAX and !FS_IOMAP 2017-01-24 16:26:14 -08:00
dcache.c mnt: Protect the mountpoint hashtable with mount_lock 2017-01-10 13:34:43 +13:00
dcookies.c Replace <asm/uaccess.h> with <linux/uaccess.h> globally 2016-12-24 11:46:01 -08:00
direct-io.c do_direct_IO: Use inode->i_blkbits to compute block count to be cleaned 2017-01-10 13:29:54 -07:00
drop_caches.c
eventfd.c
eventpoll.c Replace <asm/uaccess.h> with <linux/uaccess.h> globally 2016-12-24 11:46:01 -08:00
exec.c Replace <asm/uaccess.h> with <linux/uaccess.h> globally 2016-12-24 11:46:01 -08:00
fcntl.c Replace <asm/uaccess.h> with <linux/uaccess.h> globally 2016-12-24 11:46:01 -08:00
fhandle.c Replace <asm/uaccess.h> with <linux/uaccess.h> globally 2016-12-24 11:46:01 -08:00
file.c fs/file: more unsigned file descriptors 2016-09-27 18:47:38 -04:00
file_table.c constify alloc_file() 2016-12-05 19:01:16 -05:00
filesystems.c Replace <asm/uaccess.h> with <linux/uaccess.h> globally 2016-12-24 11:46:01 -08:00
fs-writeback.c fs/fs-writeback.c: remove redundant if check 2016-12-12 18:55:08 -08:00
fs_pin.c
fs_struct.c
inode.c Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs 2016-10-10 20:16:43 -07:00
internal.h Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs 2016-12-17 18:44:00 -08:00
ioctl.c vfs: call vfs_clone_file_range() under freeze protection 2016-12-16 11:02:54 +01:00
iomap.c xfs: updates for 4.10-rc1 2016-12-14 21:35:31 -08:00
libfs.c libfs: Modify mount_pseudo_xattr to be clear it is not a userspace mount 2017-01-10 13:34:55 +13:00
locks.c Replace <asm/uaccess.h> with <linux/uaccess.h> globally 2016-12-24 11:46:01 -08:00
mbcache.c mbcache: document that "find" functions only return reusable entries 2016-12-03 15:55:01 -05:00
mount.h vfs: add path_is_mountpoint() helper 2016-12-03 20:51:35 -05:00
mpage.c fs: Add helper to clean bdev aliases under a bh and use it 2016-11-04 14:34:47 -06:00
namei.c Replace <asm/uaccess.h> with <linux/uaccess.h> globally 2016-12-24 11:46:01 -08:00
namespace.c mnt: Protect the mountpoint hashtable with mount_lock 2017-01-10 13:34:43 +13:00
no-block.c
nsfs.c net: add an ioctl to get a socket network namespace 2016-10-31 10:56:36 -04:00
open.c Replace <asm/uaccess.h> with <linux/uaccess.h> globally 2016-12-24 11:46:01 -08:00
pipe.c Replace <asm/uaccess.h> with <linux/uaccess.h> globally 2016-12-24 11:46:01 -08:00
pnode.c reorganize do_make_slave() 2016-12-16 16:30:49 -05:00
pnode.h mnt: Add a per mount namespace limit on the number of mounts 2016-09-30 12:46:48 -05:00
posix_acl.c tmpfs: clear S_ISGID when setting posix ACLs 2017-01-10 01:29:48 -05:00
proc_namespace.c
read_write.c Replace <asm/uaccess.h> with <linux/uaccess.h> globally 2016-12-24 11:46:01 -08:00
readdir.c Replace <asm/uaccess.h> with <linux/uaccess.h> globally 2016-12-24 11:46:01 -08:00
select.c Replace <asm/uaccess.h> with <linux/uaccess.h> globally 2016-12-24 11:46:01 -08:00
seq_file.c Replace <asm/uaccess.h> with <linux/uaccess.h> globally 2016-12-24 11:46:01 -08:00
signalfd.c
splice.c splice: reinstate SIGPIPE/EPIPE handling 2016-12-21 10:59:34 -08:00
stack.c
stat.c Replace <asm/uaccess.h> with <linux/uaccess.h> globally 2016-12-24 11:46:01 -08:00
statfs.c vfs: misc struct path constification 2016-12-05 19:03:49 -05:00
super.c quota: Remove dqonoff_mutex 2016-11-30 08:38:07 +01:00
sync.c
timerfd.c ktime: Cleanup ktime_set() usage 2016-12-25 17:21:22 +01:00
userfaultfd.c userfaultfd: fix SIGBUS resulting from false rwsem wakeups 2017-01-24 16:26:14 -08:00
utimes.c Replace <asm/uaccess.h> with <linux/uaccess.h> globally 2016-12-24 11:46:01 -08:00
xattr.c Replace <asm/uaccess.h> with <linux/uaccess.h> globally 2016-12-24 11:46:01 -08:00