Expand the userfaultfd_register/unregister routines to allow VM_HUGETLB
vmas. huge page alignment checking is performed after a VM_HUGETLB vma
is encountered.
Also, since there is no UFFDIO_ZEROPAGE support for huge pages do not
return that as a valid ioctl method for huge page ranges.
Link: http://lkml.kernel.org/r/20161216144821.5183-22-aarcange@redhat.com
Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com>
Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
Cc: "Dr. David Alan Gilbert" <dgilbert@redhat.com>
Cc: Hillf Danton <hillf.zj@alibaba-inc.com>
Cc: Michael Rapoport <RAPOPORT@il.ibm.com>
Cc: Mike Rapoport <rppt@linux.vnet.ibm.com>
Cc: Pavel Emelyanov <xemul@parallels.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Userfaults may still happen after the userfaultfd monitor thread
received a UFFD_EVENT_MADVDONTNEED until UFFDIO_UNREGISTER is run.
Wake any pending userfault within UFFDIO_UNREGISTER protected by the
mmap_sem for writing, so they will not be reported to userland leading
to UFFDIO_COPY returning -EINVAL (as the range was already unregistered)
and they will not hang permanently either.
Link: http://lkml.kernel.org/r/20161216144821.5183-16-aarcange@redhat.com
Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
Cc: "Dr. David Alan Gilbert" <dgilbert@redhat.com>
Cc: Hillf Danton <hillf.zj@alibaba-inc.com>
Cc: Michael Rapoport <RAPOPORT@il.ibm.com>
Cc: Mike Kravetz <mike.kravetz@oracle.com>
Cc: Mike Rapoport <rppt@linux.vnet.ibm.com>
Cc: Pavel Emelyanov <xemul@parallels.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
If the page is punched out of the address space the uffd reader should
know this and zeromap the respective area in case of the #PF event.
Link: http://lkml.kernel.org/r/20161216144821.5183-14-aarcange@redhat.com
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
Signed-off-by: Mike Rapoport <rppt@linux.vnet.ibm.com>
Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
Cc: "Dr. David Alan Gilbert" <dgilbert@redhat.com>
Cc: Hillf Danton <hillf.zj@alibaba-inc.com>
Cc: Michael Rapoport <RAPOPORT@il.ibm.com>
Cc: Mike Kravetz <mike.kravetz@oracle.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Optimize the mremap_userfaultfd_complete() interface to pass only the
vm_userfaultfd_ctx pointer through the stack as a microoptimization.
Link: http://lkml.kernel.org/r/20161216144821.5183-13-aarcange@redhat.com
Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
Reported-by: Hillf Danton <hillf.zj@alibaba-inc.com>
Acked-by: Mike Rapoport <rppt@linux.vnet.ibm.com>
Cc: "Dr. David Alan Gilbert" <dgilbert@redhat.com>
Cc: Michael Rapoport <RAPOPORT@il.ibm.com>
Cc: Mike Kravetz <mike.kravetz@oracle.com>
Cc: Pavel Emelyanov <xemul@parallels.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The event denotes that an area [start:end] moves to different location.
Length change isn't reported as "new" addresses, if they appear on the
uffd reader side they will not contain any data and the latter can just
zeromap them.
Waiting for the event ACK is also done outside of mmap sem, as for fork
event.
Link: http://lkml.kernel.org/r/20161216144821.5183-12-aarcange@redhat.com
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
Signed-off-by: Mike Rapoport <rppt@linux.vnet.ibm.com>
Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
Cc: "Dr. David Alan Gilbert" <dgilbert@redhat.com>
Cc: Hillf Danton <hillf.zj@alibaba-inc.com>
Cc: Michael Rapoport <RAPOPORT@il.ibm.com>
Cc: Mike Kravetz <mike.kravetz@oracle.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Since commit d2005e3f41 ("userfaultfd: don't pin the user memory in
userfaultfd_file_create()") userfaultfd uses mm_count rather than
mm_users to pin mm_struct.
Make dup_userfaultfd consistent with this behaviour
Link: http://lkml.kernel.org/r/20161216144821.5183-11-aarcange@redhat.com
Signed-off-by: Mike Rapoport <rppt@linux.vnet.ibm.com>
Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
Cc: "Dr. David Alan Gilbert" <dgilbert@redhat.com>
Cc: Hillf Danton <hillf.zj@alibaba-inc.com>
Cc: Michael Rapoport <RAPOPORT@il.ibm.com>
Cc: Mike Kravetz <mike.kravetz@oracle.com>
Cc: Pavel Emelyanov <xemul@parallels.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
When the mm with uffd-ed vmas fork()-s the respective vmas notify their
uffds with the event which contains a descriptor with new uffd. This
new descriptor can then be used to get events from the child and
populate its mm with data. Note, that there can be different uffd-s
controlling different vmas within one mm, so first we should collect all
those uffds (and ctx-s) in a list and then notify them all one by one
but only once per fork().
The context is created at fork() time but the descriptor, file struct
and anon inode object is created at event read time. So some trickery
is added to the userfaultfd_ctx_read() to handle the ctx queues' locking
vs file creation.
Another thing worth noticing is that the task that fork()-s waits for
the uffd event to get processed WITHOUT the mmap sem.
[aarcange@redhat.com: build warning fix]
Link: http://lkml.kernel.org/r/20161216144821.5183-10-aarcange@redhat.com
Link: http://lkml.kernel.org/r/20161216144821.5183-9-aarcange@redhat.com
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
Signed-off-by: Mike Rapoport <rppt@linux.vnet.ibm.com>
Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
Cc: "Dr. David Alan Gilbert" <dgilbert@redhat.com>
Cc: Hillf Danton <hillf.zj@alibaba-inc.com>
Cc: Michael Rapoport <RAPOPORT@il.ibm.com>
Cc: Mike Kravetz <mike.kravetz@oracle.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
This will allow userland to probe all features available in the kernel.
It will however only enable the requested features in the open userfaultfd
context.
Link: http://lkml.kernel.org/r/20161216144821.5183-8-aarcange@redhat.com
Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
Cc: "Dr. David Alan Gilbert" <dgilbert@redhat.com>
Cc: Hillf Danton <hillf.zj@alibaba-inc.com>
Cc: Michael Rapoport <RAPOPORT@il.ibm.com>
Cc: Mike Kravetz <mike.kravetz@oracle.com>
Cc: Mike Rapoport <rppt@linux.vnet.ibm.com>
Cc: Pavel Emelyanov <xemul@parallels.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The custom events are queued in ctx->event_wqh not to disturb the
fast-path-ed PF queue-wait-wakeup functions.
The events to be generated (other than PF-s) are requested in UFFD_API
ioctl with the uffd_api.features bits. Those, known by the kernel, are
then turned on and reported back to the user-space.
Link: http://lkml.kernel.org/r/20161216144821.5183-7-aarcange@redhat.com
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
Signed-off-by: Mike Rapoport <rppt@linux.vnet.ibm.com>
Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
Cc: "Dr. David Alan Gilbert" <dgilbert@redhat.com>
Cc: Hillf Danton <hillf.zj@alibaba-inc.com>
Cc: Michael Rapoport <RAPOPORT@il.ibm.com>
Cc: Mike Kravetz <mike.kravetz@oracle.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
I will need one to lookup for userfaultfd_wait_queue-s in different
wait queue
Link: http://lkml.kernel.org/r/20161216144821.5183-6-aarcange@redhat.com
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
Signed-off-by: Mike Rapoport <rppt@linux.vnet.ibm.com>
Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
Cc: "Dr. David Alan Gilbert" <dgilbert@redhat.com>
Cc: Hillf Danton <hillf.zj@alibaba-inc.com>
Cc: Michael Rapoport <RAPOPORT@il.ibm.com>
Cc: Mike Kravetz <mike.kravetz@oracle.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Cleanup the vma->vm_ops usage.
Side note: it would be more robust if vma_is_anonymous() would also
check that vm_flags hasn't VM_PFNMAP set.
Link: http://lkml.kernel.org/r/20161216144821.5183-5-aarcange@redhat.com
Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
Cc: "Dr. David Alan Gilbert" <dgilbert@redhat.com>
Cc: Hillf Danton <hillf.zj@alibaba-inc.com>
Cc: Michael Rapoport <RAPOPORT@il.ibm.com>
Cc: Mike Kravetz <mike.kravetz@oracle.com>
Cc: Mike Rapoport <rppt@linux.vnet.ibm.com>
Cc: Pavel Emelyanov <xemul@parallels.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Avoid BUG_ON()s and only WARN instead. This is just a cleanup, it can't
make any runtime difference. This BUG_ON has never triggered and cannot
trigger.
Link: http://lkml.kernel.org/r/20161216144821.5183-4-aarcange@redhat.com
Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
Cc: "Dr. David Alan Gilbert" <dgilbert@redhat.com>
Cc: Hillf Danton <hillf.zj@alibaba-inc.com>
Cc: Michael Rapoport <RAPOPORT@il.ibm.com>
Cc: Mike Kravetz <mike.kravetz@oracle.com>
Cc: Mike Rapoport <rppt@linux.vnet.ibm.com>
Cc: Pavel Emelyanov <xemul@parallels.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Minor comment correction.
Link: http://lkml.kernel.org/r/20161216144821.5183-3-aarcange@redhat.com
Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
Cc: "Dr. David Alan Gilbert" <dgilbert@redhat.com>
Cc: Hillf Danton <hillf.zj@alibaba-inc.com>
Cc: Michael Rapoport <RAPOPORT@il.ibm.com>
Cc: Mike Kravetz <mike.kravetz@oracle.com>
Cc: Mike Rapoport <rppt@linux.vnet.ibm.com>
Cc: Pavel Emelyanov <xemul@parallels.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
posix_acl_update_mode() could possibly clear 'acl', if so we leak the
memory pointed by 'acl'. Save this pointer before calling
posix_acl_update_mode() and release the memory if 'acl' really gets
cleared.
Link: http://lkml.kernel.org/r/1486678332-2430-1-git-send-email-xiyou.wangcong@gmail.com
Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com>
Reported-by: Mark Salyzyn <salyzyn@android.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Reviewed-by: Greg Kurz <groug@kaod.org>
Cc: Eric Van Hensbergen <ericvh@gmail.com>
Cc: Ron Minnich <rminnich@sandia.gov>
Cc: Latchesar Ionkov <lucho@ionkov.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Commit 743b5f1434 ("ocfs2: take inode lock in ocfs2_iop_set/get_acl()")
results in a deadlock, as the author "Tariq Saeed" realized shortly
after the patch was merged. The discussion happened here
https://oss.oracle.com/pipermail/ocfs2-devel/2015-September/011085.html
The reason why taking cluster inode lock at vfs entry points opens up a
self deadlock window, is explained in the previous patch of this series.
So far, we have seen two different code paths that have this issue.
1. do_sys_open
may_open
inode_permission
ocfs2_permission
ocfs2_inode_lock() <=== take PR
generic_permission
get_acl
ocfs2_iop_get_acl
ocfs2_inode_lock() <=== take PR
2. fchmod|fchmodat
chmod_common
notify_change
ocfs2_setattr <=== take EX
posix_acl_chmod
get_acl
ocfs2_iop_get_acl <=== take PR
ocfs2_iop_set_acl <=== take EX
Fixes them by adding the tracking logic (in the previous patch) for these
funcs above, ocfs2_permission(), ocfs2_iop_[set|get]_acl(),
ocfs2_setattr().
Link: http://lkml.kernel.org/r/20170117100948.11657-3-zren@suse.com
Signed-off-by: Eric Ren <zren@suse.com>
Reviewed-by: Junxiao Bi <junxiao.bi@oracle.com>
Reviewed-by: Joseph Qi <jiangqi903@gmail.com>
Cc: Mark Fasheh <mfasheh@versity.com>
Cc: Joel Becker <jlbec@evilplan.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
We are in the situation that we have to avoid recursive cluster locking,
but there is no way to check if a cluster lock has been taken by a precess
already.
Mostly, we can avoid recursive locking by writing code carefully.
However, we found that it's very hard to handle the routines that are
invoked directly by vfs code. For instance:
const struct inode_operations ocfs2_file_iops = {
.permission = ocfs2_permission,
.get_acl = ocfs2_iop_get_acl,
.set_acl = ocfs2_iop_set_acl,
};
Both ocfs2_permission() and ocfs2_iop_get_acl() call ocfs2_inode_lock(PR):
do_sys_open
may_open
inode_permission
ocfs2_permission
ocfs2_inode_lock() <=== first time
generic_permission
get_acl
ocfs2_iop_get_acl
ocfs2_inode_lock() <=== recursive one
A deadlock will occur if a remote EX request comes in between two of
ocfs2_inode_lock(). Briefly describe how the deadlock is formed:
On one hand, OCFS2_LOCK_BLOCKED flag of this lockres is set in
BAST(ocfs2_generic_handle_bast) when downconvert is started on behalf of
the remote EX lock request. Another hand, the recursive cluster lock
(the second one) will be blocked in in __ocfs2_cluster_lock() because of
OCFS2_LOCK_BLOCKED. But, the downconvert never complete, why? because
there is no chance for the first cluster lock on this node to be
unlocked - we block ourselves in the code path.
The idea to fix this issue is mostly taken from gfs2 code.
1. introduce a new field: struct ocfs2_lock_res.l_holders, to keep track
of the processes' pid who has taken the cluster lock of this lock
resource;
2. introduce a new flag for ocfs2_inode_lock_full:
OCFS2_META_LOCK_GETBH; it means just getting back disk inode bh for
us if we've got cluster lock.
3. export a helper: ocfs2_is_locked_by_me() is used to check if we have
got the cluster lock in the upper code path.
The tracking logic should be used by some of the ocfs2 vfs's callbacks,
to solve the recursive locking issue cuased by the fact that vfs
routines can call into each other.
The performance penalty of processing the holder list should only be
seen at a few cases where the tracking logic is used, such as get/set
acl.
You may ask what if the first time we got a PR lock, and the second time
we want a EX lock? fortunately, this case never happens in the real
world, as far as I can see, including permission check,
(get|set)_(acl|attr), and the gfs2 code also do so.
[sfr@canb.auug.org.au remove some inlines]
Link: http://lkml.kernel.org/r/20170117100948.11657-2-zren@suse.com
Signed-off-by: Eric Ren <zren@suse.com>
Reviewed-by: Junxiao Bi <junxiao.bi@oracle.com>
Reviewed-by: Joseph Qi <jiangqi903@gmail.com>
Cc: Stephen Rothwell <sfr@canb.auug.org.au>
Cc: Mark Fasheh <mfasheh@versity.com>
Cc: Joel Becker <jlbec@evilplan.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
pmd_fault() and related functions really only need the vmf parameter since
the additional parameters are all included in the vmf struct. Remove the
additional parameter and simplify pmd_fault() and friends.
Link: http://lkml.kernel.org/r/1484085142-2297-8-git-send-email-ross.zwisler@linux.intel.com
Signed-off-by: Dave Jiang <dave.jiang@intel.com>
Reviewed-by: Ross Zwisler <ross.zwisler@linux.intel.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Cc: Dave Chinner <david@fromorbit.com>
Cc: Dave Jiang <dave.jiang@intel.com>
Cc: Matthew Wilcox <mawilcox@microsoft.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Instead of passing in multiple parameters in the pmd_fault() handler,
a vmf can be passed in just like a fault() handler. This will simplify
code and remove the need for the actual pmd fault handlers to allocate a
vmf. Related functions are also modified to do the same.
[dave.jiang@intel.com: fix issue with xfs_tests stall when DAX option is off]
Link: http://lkml.kernel.org/r/148469861071.195597.3619476895250028518.stgit@djiang5-desk3.ch.intel.com
Link: http://lkml.kernel.org/r/1484085142-2297-7-git-send-email-ross.zwisler@linux.intel.com
Signed-off-by: Dave Jiang <dave.jiang@intel.com>
Reviewed-by: Ross Zwisler <ross.zwisler@linux.intel.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Cc: Dave Chinner <david@fromorbit.com>
Cc: Matthew Wilcox <mawilcox@microsoft.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Tracepoints are the standard way to capture debugging and tracing
information in many parts of the kernel, including the XFS and ext4
filesystems. Create a tracepoint header for FS DAX and add the first DAX
tracepoints to the PMD fault handler. This allows the tracing for DAX to
be done in the same way as the filesystem tracing so that developers can
look at them together and get a coherent idea of what the system is doing.
I added both an entry and exit tracepoint because future patches will add
tracepoints to child functions of dax_iomap_pmd_fault() like
dax_pmd_load_hole() and dax_pmd_insert_mapping(). We want those messages
to be wrapped by the parent function tracepoints so the code flow is more
easily understood. Having entry and exit tracepoints for faults also
allows us to easily see what filesystems functions were called during the
fault. These filesystem functions get executed via iomap_begin() and
iomap_end() calls, for example, and will have their own tracepoints.
For PMD faults we primarily want to understand the type of mapping, the
fault flags, the faulting address and whether it fell back to 4k faults.
If it fell back to 4k faults the tracepoints should let us understand why.
I named the new tracepoint header file "fs_dax.h" to allow for device DAX
to have its own separate tracing header in the same directory at some
point.
Here is an example output for these events from a successful PMD fault:
big-1441 [005] .... 32.582758: xfs_filemap_pmd_fault: dev 259:0 ino 0x1003
big-1441 [005] .... 32.582776: dax_pmd_fault: dev 259:0 ino 0x1003
shared WRITE|ALLOW_RETRY|KILLABLE|USER address 0x10505000 vm_start 0x10200000 vm_end 0x10700000 pgoff 0x200 max_pgoff 0x1400
big-1441 [005] .... 32.583292: dax_pmd_fault_done: dev 259:0 ino 0x1003
shared WRITE|ALLOW_RETRY|KILLABLE|USER address 0x10505000 vm_start 0x10200000 vm_end 0x10700000 pgoff 0x200 max_pgoff 0x1400 NOPAGE
Link: http://lkml.kernel.org/r/1484085142-2297-3-git-send-email-ross.zwisler@linux.intel.com
Signed-off-by: Ross Zwisler <ross.zwisler@linux.intel.com>
Suggested-by: Dave Chinner <david@fromorbit.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Acked-by: Steven Rostedt <rostedt@goodmis.org>
Cc: Dave Jiang <dave.jiang@intel.com>
Cc: Matthew Wilcox <mawilcox@microsoft.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Here is the "small" driver core patches for 4.11-rc1.
Not much here, some firmware documentation and self-test updates, a
debugfs code formatting issue, and a new feature for call_usermodehelper
to make it more robust on systems that want to lock it down in a more
secure way.
All of these have been linux-next for a while now with no reported
issues.
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
-----BEGIN PGP SIGNATURE-----
iG0EABECAC0WIQT0tgzFv3jCIUoxPcsxR9QN2y37KQUCWK2jKg8cZ3JlZ0Brcm9h
aC5jb20ACgkQMUfUDdst+ymCEACgozYuqZZ/TUGW0P3xVNi7fbfUWCEAn3nYExrc
XgevqeYOSKp2We6X/2JX
=aZ+5
-----END PGP SIGNATURE-----
Merge tag 'driver-core-4.11-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core
Pull driver core updates from Greg KH:
"Here is the "small" driver core patches for 4.11-rc1.
Not much here, some firmware documentation and self-test updates, a
debugfs code formatting issue, and a new feature for call_usermodehelper
to make it more robust on systems that want to lock it down in a more
secure way.
All of these have been linux-next for a while now with no reported
issues"
* tag 'driver-core-4.11-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core:
kernfs: handle null pointers while printing node name and path
Introduce STATIC_USERMODEHELPER to mediate call_usermodehelper()
Make static usermode helper binaries constant
kmod: make usermodehelper path a const string
firmware: revamp firmware documentation
selftests: firmware: send expected errors to /dev/null
selftests: firmware: only modprobe if driver is missing
platform: Print the resource range if device failed to claim
kref: prefer atomic_inc_not_zero to atomic_add_unless
debugfs: improve formatting of debugfs_real_fops()
Pull networking updates from David Miller:
"Highlights:
1) Support TX_RING in AF_PACKET TPACKET_V3 mode, from Sowmini
Varadhan.
2) Simplify classifier state on sk_buff in order to shrink it a bit.
From Willem de Bruijn.
3) Introduce SIPHASH and it's usage for secure sequence numbers and
syncookies. From Jason A. Donenfeld.
4) Reduce CPU usage for ICMP replies we are going to limit or
suppress, from Jesper Dangaard Brouer.
5) Introduce Shared Memory Communications socket layer, from Ursula
Braun.
6) Add RACK loss detection and allow it to actually trigger fast
recovery instead of just assisting after other algorithms have
triggered it. From Yuchung Cheng.
7) Add xmit_more and BQL support to mvneta driver, from Simon Guinot.
8) skb_cow_data avoidance in esp4 and esp6, from Steffen Klassert.
9) Export MPLS packet stats via netlink, from Robert Shearman.
10) Significantly improve inet port bind conflict handling, especially
when an application is restarted and changes it's setting of
reuseport. From Josef Bacik.
11) Implement TX batching in vhost_net, from Jason Wang.
12) Extend the dummy device so that VF (virtual function) features,
such as configuration, can be more easily tested. From Phil
Sutter.
13) Avoid two atomic ops per page on x86 in bnx2x driver, from Eric
Dumazet.
14) Add new bpf MAP, implementing a longest prefix match trie. From
Daniel Mack.
15) Packet sample offloading support in mlxsw driver, from Yotam Gigi.
16) Add new aquantia driver, from David VomLehn.
17) Add bpf tracepoints, from Daniel Borkmann.
18) Add support for port mirroring to b53 and bcm_sf2 drivers, from
Florian Fainelli.
19) Remove custom busy polling in many drivers, it is done in the core
networking since 4.5 times. From Eric Dumazet.
20) Support XDP adjust_head in virtio_net, from John Fastabend.
21) Fix several major holes in neighbour entry confirmation, from
Julian Anastasov.
22) Add XDP support to bnxt_en driver, from Michael Chan.
23) VXLAN offloads for enic driver, from Govindarajulu Varadarajan.
24) Add IPVTAP driver (IP-VLAN based tap driver) from Sainath Grandhi.
25) Support GRO in IPSEC protocols, from Steffen Klassert"
* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next: (1764 commits)
Revert "ath10k: Search SMBIOS for OEM board file extension"
net: socket: fix recvmmsg not returning error from sock_error
bnxt_en: use eth_hw_addr_random()
bpf: fix unlocking of jited image when module ronx not set
arch: add ARCH_HAS_SET_MEMORY config
net: napi_watchdog() can use napi_schedule_irqoff()
tcp: Revert "tcp: tcp_probe: use spin_lock_bh()"
net/hsr: use eth_hw_addr_random()
net: mvpp2: enable building on 64-bit platforms
net: mvpp2: switch to build_skb() in the RX path
net: mvpp2: simplify MVPP2_PRS_RI_* definitions
net: mvpp2: fix indentation of MVPP2_EXT_GLOBAL_CTRL_DEFAULT
net: mvpp2: remove unused register definitions
net: mvpp2: simplify mvpp2_bm_bufs_add()
net: mvpp2: drop useless fields in mvpp2_bm_pool and related code
net: mvpp2: remove unused 'tx_skb' field of 'struct mvpp2_tx_queue'
net: mvpp2: release reference to txq_cpu[] entry after unmapping
net: mvpp2: handle too large value in mvpp2_rx_time_coal_set()
net: mvpp2: handle too large value handling in mvpp2_rx_pkts_coal_set()
net: mvpp2: remove useless arguments in mvpp2_rx_{pkts, time}_coal_set
...
Pull security layer updates from James Morris:
"Highlights:
- major AppArmor update: policy namespaces & lots of fixes
- add /sys/kernel/security/lsm node for easy detection of loaded LSMs
- SELinux cgroupfs labeling support
- SELinux context mounts on tmpfs, ramfs, devpts within user
namespaces
- improved TPM 2.0 support"
* 'next' of git://git.kernel.org/pub/scm/linux/kernel/git/jmorris/linux-security: (117 commits)
tpm: declare tpm2_get_pcr_allocation() as static
tpm: Fix expected number of response bytes of TPM1.2 PCR Extend
tpm xen: drop unneeded chip variable
tpm: fix misspelled "facilitate" in module parameter description
tpm_tis: fix the error handling of init_tis()
KEYS: Use memzero_explicit() for secret data
KEYS: Fix an error code in request_master_key()
sign-file: fix build error in sign-file.c with libressl
selinux: allow changing labels for cgroupfs
selinux: fix off-by-one in setprocattr
tpm: silence an array overflow warning
tpm: fix the type of owned field in cap_t
tpm: add securityfs support for TPM 2.0 firmware event log
tpm: enhance read_log_of() to support Physical TPM event log
tpm: enhance TPM 2.0 PCR extend to support multiple banks
tpm: implement TPM 2.0 capability to get active PCR banks
tpm: fix RC value check in tpm2_seal_trusted
tpm_tis: fix iTPM probe via probe_itpm() function
tpm: Begin the process to deprecate user_read_timer
tpm: remove tpm_read_index and tpm_write_index from tpm.h
...
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1
iQIcBAABCAAGBQJYqeb8AAoJEPfTWPspceCmB3UP/3UtcPrzEm8w2cxB9MaWhZN3
J+jiwlO4vaqhm2HVzQtoJqfaqRlud/iDx5cIXE2S7FnIM54ZKs3CANbKu8X+b1zm
eJije3zMI8A8qyftigbz6a/Y2kWE4ZqFEc9WU5CWawfTl3ImCVUi8+F5X0wOLU/h
r50zAQOEyURH4G5usNl9q0olF6FonJ82AcYm1iJ0QP2wYWZRJauC0rRn8IT93tyK
bZPHnGKdkd7km8yi3zr2GNWOfuZZuA0HWAaF4qfrHPZQ883gITFAUIlFb1f+2TNl
DkQzRrBB2wPWPnlbfb9KejMkvL94hflzsLb5rHt835DyVXFRyjxsgyAI8A+LPGSz
vqZ3rsbWj6H4F9z2CkZ+T+AP/ZSWDNjwc0RXPm9HYdR5CDeTxIUVvnFQ44YNsmTv
Xd5BKrUJ2oKegAxQG6zcuFx23p8JzhT70l+mNrMdtyeKnDD9FRdDvhKG9AHeTipn
o/DnGivhS3UMQoQ7D68KOO+kuhLDeo7my5XGsnjzMO/iHqg++7IP2HyYYs/Ba4qZ
cYaCtSDQW71Zt0vsqa6dvPuXBveu4h8Qh8R7uAGjSGS9IAFFb4Cab2tiUdISE6PE
YnMWzY+G6pT8imlLVOL5/QFuo2Q4pUsaL0AHpXMCN9TZnQtbqXa8eqwnKnQ0m2KN
7ut0IYYEPaYUX5xFn1K6
=z7AL
-----END PGP SIGNATURE-----
Merge tag 'for-4.11/linus-merge-signed' of git://git.kernel.dk/linux-block
Pull block layer updates from Jens Axboe:
- blk-mq scheduling framework from me and Omar, with a port of the
deadline scheduler for this framework. A port of BFQ from Paolo is in
the works, and should be ready for 4.12.
- Various fixups and improvements to the above scheduling framework
from Omar, Paolo, Bart, me, others.
- Cleanup of the exported sysfs blk-mq data into debugfs, from Omar.
This allows us to export more information that helps debug hangs or
performance issues, without cluttering or abusing the sysfs API.
- Fixes for the sbitmap code, the scalable bitmap code that was
migrated from blk-mq, from Omar.
- Removal of the BLOCK_PC support in struct request, and refactoring of
carrying SCSI payloads in the block layer. This cleans up the code
nicely, and enables us to kill the SCSI specific parts of struct
request, shrinking it down nicely. From Christoph mainly, with help
from Hannes.
- Support for ranged discard requests and discard merging, also from
Christoph.
- Support for OPAL in the block layer, and for NVMe as well. Mainly
from Scott Bauer, with fixes/updates from various others folks.
- Error code fixup for gdrom from Christophe.
- cciss pci irq allocation cleanup from Christoph.
- Making the cdrom device operations read only, from Kees Cook.
- Fixes for duplicate bdi registrations and bdi/queue life time
problems from Jan and Dan.
- Set of fixes and updates for lightnvm, from Matias and Javier.
- A few fixes for nbd from Josef, using idr to name devices and a
workqueue deadlock fix on receive. Also marks Josef as the current
maintainer of nbd.
- Fix from Josef, overwriting queue settings when the number of
hardware queues is updated for a blk-mq device.
- NVMe fix from Keith, ensuring that we don't repeatedly mark and IO
aborted, if we didn't end up aborting it.
- SG gap merging fix from Ming Lei for block.
- Loop fix also from Ming, fixing a race and crash between setting loop
status and IO.
- Two block race fixes from Tahsin, fixing request list iteration and
fixing a race between device registration and udev device add
notifiations.
- Double free fix from cgroup writeback, from Tejun.
- Another double free fix in blkcg, from Hou Tao.
- Partition overflow fix for EFI from Alden Tondettar.
* tag 'for-4.11/linus-merge-signed' of git://git.kernel.dk/linux-block: (156 commits)
nvme: Check for Security send/recv support before issuing commands.
block/sed-opal: allocate struct opal_dev dynamically
block/sed-opal: tone down not supported warnings
block: don't defer flushes on blk-mq + scheduling
blk-mq-sched: ask scheduler for work, if we failed dispatching leftovers
blk-mq: don't special case flush inserts for blk-mq-sched
blk-mq-sched: don't add flushes to the head of requeue queue
blk-mq: have blk_mq_dispatch_rq_list() return if we queued IO or not
block: do not allow updates through sysfs until registration completes
lightnvm: set default lun range when no luns are specified
lightnvm: fix off-by-one error on target initialization
Maintainers: Modify SED list from nvme to block
Move stack parameters for sed_ioctl to prevent oversized stack with CONFIG_KASAN
uapi: sed-opal fix IOW for activate lsp to use correct struct
cdrom: Make device operations read-only
elevator: fix loading wrong elevator type for blk-mq devices
cciss: switch to pci_irq_alloc_vectors
block/loop: fix race between I/O and set_status
blk-mq-sched: don't hold queue_lock when calling exit_icq
block: set make_request_fn manually in blk_mq_update_nr_hw_queues
...
1. Andy Price submitted a patch to make gfs2_write_full_page a
static function.
2. Dan Carpenter submitted a patch to fix a ERR_PTR thinko.
I've also got a few patches, three of which fix bugs related to
deleting very large files, which cause GFS2 to run out of
journal space:
3. The first one prevents GFS2 delete operation from requesting too
much journal space.
4. The second one fixes a problem whereby GFS2 can hang because it
wasn't taking journal space demand into its calculations.
5. The third one wakes up IO waiters when a flush is done to restart
processes stuck waiting for journal space to become available.
The other three patches are a performance improvement related to
spin_lock contention between multiple writers:
6. The "tr_touched" variable was switched to a flag to be more atomic
and eliminate the possibility of some races.
7. Function meta_lo_add was moved inline with its only caller to make
the code more readable and efficient.
8. Contention on the gfs2_log_lock spinlock was greatly reduced by
avoiding the lock altogether in cases where we don't really need
it: buffers that already appear in the appropriate metadata list
for the journal. Many thanks to Steve Whitehouse for the ideas and
principles behind these patches.
-----BEGIN PGP SIGNATURE-----
iQEcBAABAgAGBQJYrEEEAAoJENeLYdPf93o7bjoIAIqPG/EAzi+idgMWDPQa9Eit
53dPy16snkrbWwtaK6spSWlH6bGYuHeanXORYon9bvtVjKYaa4NQclGihN2IE6uB
O8zT+MGwP45LDhNplVJpumaOALZ9ZDqQSe+3tHeNK3FhNirLyiIjSqrHt/7Yi1qi
fPLlT4Jx0TBo5rhvEGa7Yg01WhWVtnmVSMqJXj/7ZtC50s1aPyDUikdNIDfDCN2X
LxfKGDXuk6p63VQ6JKqYSBVATCR0/bbKfkuk/kBUTYLoHoapImxB8d0HgIdsh1Mv
9PlbZnnNW8k5oapuhVxjl0T5G0JsQgCkPb/wlte+ryOCjBoc2L2fCUV5qc0QxWc=
=xQyl
-----END PGP SIGNATURE-----
Merge tag 'gfs2-4.11.fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/gfs2/linux-gfs2
Pull GFS2 updates from Robert Peterson:
"We've got eight GFS2 patches for this merge window:
- Andy Price submitted a patch to make gfs2_write_full_page a static
function.
- Dan Carpenter submitted a patch to fix a ERR_PTR thinko.
Three patches fix bugs related to deleting very large files, which
cause GFS2 to run out of journal space:
- The first one prevents GFS2 delete operation from requesting too
much journal space.
- The second one fixes a problem whereby GFS2 can hang because it
wasn't taking journal space demand into its calculations.
- The third one wakes up IO waiters when a flush is done to restart
processes stuck waiting for journal space to become available.
The final three patches are a performance improvement related to
spin_lock contention between multiple writers:
- The "tr_touched" variable was switched to a flag to be more atomic
and eliminate the possibility of some races.
- Function meta_lo_add was moved inline with its only caller to make
the code more readable and efficient.
- Contention on the gfs2_log_lock spinlock was greatly reduced by
avoiding the lock altogether in cases where we don't really need
it: buffers that already appear in the appropriate metadata list
for the journal. Many thanks to Steve Whitehouse for the ideas and
principles behind these patches"
* tag 'gfs2-4.11.fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/gfs2/linux-gfs2:
gfs2: Make gfs2_write_full_page static
GFS2: Reduce contention on gfs2_log_lock
GFS2: Inline function meta_lo_add
GFS2: Switch tr_touched to flag in transaction
GFS2: Wake up io waiters whenever a flush is done
GFS2: Made logd daemon take into account log demand
GFS2: Limit number of transaction blocks requested for truncates
GFS2: Fix reference to ERR_PTR in gfs2_glock_iter_next
Pull UDF fixes and cleanups from Jan Kara:
"Several small UDF fixes and cleanups and a small cleanup of fanotify
code"
* 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs:
fanotify: simplify the code of fanotify_merge
udf: simplify udf_ioctl()
udf: fix ioctl errors
udf: allow implicit blocksize specification during mount
udf: check partition reference in udf_read_inode()
udf: atomically read inode size
udf: merge module informations in super.c
udf: remove next_epos from udf_update_extent_cache()
udf: Factor out trimming of crtime
udf: remove empty condition
udf: remove unneeded line break
udf: merge bh free
udf: use pointer for kernel_long_ad argument
udf: use __packed instead of __attribute__ ((packed))
udf: Make stat on symlink report symlink length as st_size
fs/udf: make #ifdef UDF_PREALLOCATE unconditional
fs: udf: Replace CURRENT_TIME with current_time()
Pull CIFS/SMB3 updates from Steve French:
"Includes support for a critical SMB3 security feature: per-share
encryption from Pavel, and a cleanup from Jean Delvare.
Will have another cifs/smb3 merge next week"
* 'for-next' of git://git.samba.org/sfrench/cifs-2.6:
CIFS: Allow to switch on encryption with seal mount option
CIFS: Add capability to decrypt big read responses
CIFS: Decrypt and process small encrypted packets
CIFS: Add copy into pages callback for a read operation
CIFS: Add mid handle callback
CIFS: Add transform header handling callbacks
CIFS: Encrypt SMB3 requests before sending
CIFS: Enable encryption during session setup phase
CIFS: Add capability to transform requests before sending
CIFS: Separate RFC1001 length processing for SMB2 read
CIFS: Separate SMB2 sync header processing
CIFS: Send RFC1001 length in a separate iov
CIFS: Make send_cancel take rqst as argument
CIFS: Make SendReceive2() takes resp iov
CIFS: Separate SMB2 header structure
CIFS: Fix splice read for non-cached files
cifs: Add soft dependencies
cifs: Only select the required crypto modules
cifs: Simplify SMB2 and SMB311 dependencies
primarily used for testing, but which can be useful on production
systems when a scratch volume is being destroyed and the data on it
doesn't need to be saved. This found (and we fixed) a number of bugs
with ext4's recovery to corrupted file system --- the bugs increased
the amount of data that could be potentially lost, and in the case of
the inline data feature, could cause the kernel to BUG.
Also included are a number of other bug fixes, including in ext4's
fscrypt, DAX, inline data support.
-----BEGIN PGP SIGNATURE-----
iQEzBAABCAAdFiEEK2m5VNv+CHkogTfJ8vlZVpUNgaMFAlirXesACgkQ8vlZVpUN
gaMOzQf8Ct6uPatV+m855oR4dAbZr2+lY4A4C+vHDzBtSMkPRyLX8cuo8XcwfTIm
vPVyDnL6EPyhXPxxfItu+92wAq1m5mVpKo57d0Ft5lw0rHxNtJTgVSRzsQ7VDRjj
5qMHW2K7Bk7EjzTeW3SF8/3+hqpzkAvRtNCntcomk5h08+cWMC8JSnn1kqw+naIn
EcbrC72GZb8JUELogVXC2vU58lp50SSBdr3l005jqKc5BvljMvdJ0Izn/3RVyU7u
q7vtynhe2ScFcHe/UzL1QgmQOy32tJpbS0NHalW47aw3Ynmn4cSX0YhhT9FDjRNQ
VOOfo1m1sAg166x0E+Nn7FeghTSSyA==
=cPIf
-----END PGP SIGNATURE-----
Merge tag 'ext4_for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4
Pull ext4 updates from Ted Ts'o:
"For this cycle we add support for the shutdown ioctl, which is
primarily used for testing, but which can be useful on production
systems when a scratch volume is being destroyed and the data on it
doesn't need to be saved.
This found (and we fixed) a number of bugs with ext4's recovery to
corrupted file system --- the bugs increased the amount of data that
could be potentially lost, and in the case of the inline data feature,
could cause the kernel to BUG.
Also included are a number of other bug fixes, including in ext4's
fscrypt, DAX, inline data support"
* tag 'ext4_for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4: (26 commits)
ext4: rename EXT4_IOC_GOINGDOWN to EXT4_IOC_SHUTDOWN
ext4: fix fencepost in s_first_meta_bg validation
ext4: don't BUG when truncating encrypted inodes on the orphan list
ext4: do not use stripe_width if it is not set
ext4: fix stripe-unaligned allocations
dax: assert that i_rwsem is held exclusive for writes
ext4: fix DAX write locking
ext4: add EXT4_IOC_GOINGDOWN ioctl
ext4: add shutdown bit and check for it
ext4: rename s_resize_flags to s_ext4_flags
ext4: return EROFS if device is r/o and journal replay is needed
ext4: preserve the needs_recovery flag when the journal is aborted
jbd2: don't leak modified metadata buffers on an aborted journal
ext4: fix inline data error paths
ext4: move halfmd4 into hash.c directly
ext4: fix use-after-iput when fscrypt contexts are inconsistent
jbd2: fix use after free in kjournald2()
ext4: fix data corruption in data=journal mode
ext4: trim allocation requests to group size
ext4: replace BUG_ON with WARN_ON in mb_find_extent()
...
-----BEGIN PGP SIGNATURE-----
iQEzBAABCAAdFiEEK2m5VNv+CHkogTfJ8vlZVpUNgaMFAlirP6wACgkQ8vlZVpUN
gaMwpQgApR67CxzlstxYjZpWPAqC8McJ2FBDX+mCOle5Vkc1WQDklwr0oCfQThTj
eDSFRhNfIvyPh0DJ589PxBCsWOqN5h6Si7hD5ZinomVNI+IL89OytaU5EV2OpWaW
iKdJgO9Tm8U7LuY6FOIoVdX57kUXVdkWoj61rC056B1SNhnNiVeofi7lYDM8Ix4q
IGSQ9W24iQKmCk4hCwgObhJBRK9RnlOH0GLUmpMaS+jnfnj/uePwdxWEFsPuCOob
8acAJ49lr55kjIw79E0BAyWxhEZ2aiArHk8PaWynT/DyNq3ftcapPlpftoeba8vo
glBJRX70QxPvt0iHEp0ykfExkhWhFA==
=Joki
-----END PGP SIGNATURE-----
Merge tag 'fscrypt-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/fscrypt
Pull fscrypt updates from Ted Ts'o:
"Various cleanups for the file system encryption feature"
* tag 'fscrypt-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/fscrypt:
fscrypt: constify struct fscrypt_operations
fscrypt: properly declare on-stack completion
fscrypt: split supp and notsupp declarations into their own headers
fscrypt: remove redundant assignment of res
fscrypt: make fscrypt_operations.key_prefix a string
fscrypt: remove unused 'mode' member of fscrypt_ctx
ext4: don't allow encrypted operations without keys
fscrypt: make test_dummy_encryption require a keyring key
fscrypt: factor out bio specific functions
fscrypt: pass up error codes from ->get_context()
fscrypt: remove user-triggerable warning messages
fscrypt: use EEXIST when file already uses different policy
fscrypt: use ENOTDIR when setting encryption policy on nondirectory
fscrypt: use ENOKEY when file cannot be created w/o key
Pull locking updates from Ingo Molnar:
"The main changes in this cycle were:
- Implement wraparound-safe refcount_t and kref_t types based on
generic atomic primitives (Peter Zijlstra)
- Improve and fix the ww_mutex code (Nicolai Hähnle)
- Add self-tests to the ww_mutex code (Chris Wilson)
- Optimize percpu-rwsems with the 'rcuwait' mechanism (Davidlohr
Bueso)
- Micro-optimize the current-task logic all around the core kernel
(Davidlohr Bueso)
- Tidy up after recent optimizations: remove stale code and APIs,
clean up the code (Waiman Long)
- ... plus misc fixes, updates and cleanups"
* 'locking-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (50 commits)
fork: Fix task_struct alignment
locking/spinlock/debug: Remove spinlock lockup detection code
lockdep: Fix incorrect condition to print bug msgs for MAX_LOCKDEP_CHAIN_HLOCKS
lkdtm: Convert to refcount_t testing
kref: Implement 'struct kref' using refcount_t
refcount_t: Introduce a special purpose refcount type
sched/wake_q: Clarify queue reinit comment
sched/wait, rcuwait: Fix typo in comment
locking/mutex: Fix lockdep_assert_held() fail
locking/rtmutex: Flip unlikely() branch to likely() in __rt_mutex_slowlock()
locking/rwsem: Reinit wake_q after use
locking/rwsem: Remove unnecessary atomic_long_t casts
jump_labels: Move header guard #endif down where it belongs
locking/atomic, kref: Implement kref_put_lock()
locking/ww_mutex: Turn off __must_check for now
locking/atomic, kref: Avoid more abuse
locking/atomic, kref: Use kref_get_unless_zero() more
locking/atomic, kref: Kill kref_sub()
locking/atomic, kref: Add kref_read()
locking/atomic, kref: Add KREF_INIT()
...
Pull scheduler updates from Ingo Molnar:
"The main changes in this (fairly busy) cycle were:
- There was a class of scheduler bugs related to forgetting to update
the rq-clock timestamp which can cause weird and hard to debug
problems, so there's a new debug facility for this: which uncovered
a whole lot of bugs which convinced us that we want to keep the
debug facility.
(Peter Zijlstra, Matt Fleming)
- Various cputime related updates: eliminate cputime and use u64
nanoseconds directly, simplify and improve the arch interfaces,
implement delayed accounting more widely, etc. - (Frederic
Weisbecker)
- Move code around for better structure plus cleanups (Ingo Molnar)
- Move IO schedule accounting deeper into the scheduler plus related
changes to improve the situation (Tejun Heo)
- ... plus a round of sched/rt and sched/deadline fixes, plus other
fixes, updats and cleanups"
* 'sched-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (85 commits)
sched/core: Remove unlikely() annotation from sched_move_task()
sched/autogroup: Rename auto_group.[ch] to autogroup.[ch]
sched/topology: Split out scheduler topology code from core.c into topology.c
sched/core: Remove unnecessary #include headers
sched/rq_clock: Consolidate the ordering of the rq_clock methods
delayacct: Include <uapi/linux/taskstats.h>
sched/core: Clean up comments
sched/rt: Show the 'sched_rr_timeslice' SCHED_RR timeslice tuning knob in milliseconds
sched/clock: Add dummy clear_sched_clock_stable() stub function
sched/cputime: Remove generic asm headers
sched/cputime: Remove unused nsec_to_cputime()
s390, sched/cputime: Remove unused cputime definitions
powerpc, sched/cputime: Remove unused cputime definitions
s390, sched/cputime: Make arch_cpu_idle_time() to return nsecs
ia64, sched/cputime: Remove unused cputime definitions
ia64: Convert vtime to use nsec units directly
ia64, sched/cputime: Move the nsecs based cputime headers to the last arch using it
sched/cputime: Remove jiffies based cputime
sched/cputime, vtime: Return nsecs instead of cputime_t to account
sched/cputime: Complete nsec conversion of tick based accounting
...
It's very likely the file system independent ioctl name will be
FS_IOC_SHUTDOWN, so let's use the same name for the ext4 ioctl name.
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Pull timer updates from Thomas Gleixner:
"Nothing exciting, just the usual pile of fixes, updates and cleanups:
- A bunch of clocksource driver updates
- Removal of CONFIG_TIMER_STATS and the related /proc file
- More posix timer slim down work
- A scalability enhancement in the tick broadcast code
- Math cleanups"
* 'timers-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (23 commits)
hrtimer: Catch invalid clockids again
math64, tile: Fix build failure
clocksource/drivers/arm_arch_timer:: Mark cyclecounter __ro_after_init
timerfd: Protect the might cancel mechanism proper
timer_list: Remove useless cast when printing
time: Remove CONFIG_TIMER_STATS
clocksource/drivers/arm_arch_timer: Work around Hisilicon erratum 161010101
clocksource/drivers/arm_arch_timer: Introduce generic errata handling infrastructure
clocksource/drivers/arm_arch_timer: Remove fsl-a008585 parameter
clocksource/drivers/arm_arch_timer: Add dt binding for hisilicon-161010101 erratum
clocksource/drivers/ostm: Add renesas-ostm timer driver
clocksource/drivers/ostm: Document renesas-ostm timer DT bindings
clocksource/drivers/tcb_clksrc: Use 32 bit tcb as sched_clock
clocksource/drivers/gemini: Add driver for the Cortina Gemini
clocksource: add DT bindings for Cortina Gemini
clockevents: Add a clkevt-of mechanism like clksrc-of
tick/broadcast: Reduce lock cacheline contention
timers: Omit POSIX timer stuff from task_struct when disabled
x86/timer: Make delay() work during early bootup
delay: Add explanation of udelay() inaccuracy
...
The function glock_hash_walk walks the rhashtable by hand. This
is broken because if it catches the hash table in the middle of
a rehash, then it will miss entries.
This patch replaces the manual walk by using the rhashtable walk
interface.
Fixes: 88ffbf3e03 ("GFS2: Use resizable hash table for glocks")
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
Flags (PIPE_BUF_FLAG_PACKET, PIPE_BUF_FLAG_GIFT) could remain on the
unused part of the pipe ring buffer. Previously splice_to_pipe() left
the flags value alone, which could result in incorrect behavior.
Uninitialized flags appears to have been there from the introduction of
the splice syscall.
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
Cc: <stable@vger.kernel.org> # 2.6.17+
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
There is a potential race between fuse_dev_do_write()
and request_wait_answer() contexts as shown below:
TASK 1:
__fuse_request_send():
|--spin_lock(&fiq->waitq.lock);
|--queue_request();
|--spin_unlock(&fiq->waitq.lock);
|--request_wait_answer():
|--if (test_bit(FR_SENT, &req->flags))
<gets pre-empted after it is validated true>
TASK 2:
fuse_dev_do_write():
|--clears bit FR_SENT,
|--request_end():
|--sets bit FR_FINISHED
|--spin_lock(&fiq->waitq.lock);
|--list_del_init(&req->intr_entry);
|--spin_unlock(&fiq->waitq.lock);
|--fuse_put_request();
|--queue_interrupt();
<request gets queued to interrupts list>
|--wake_up_locked(&fiq->waitq);
|--wait_event_freezable();
<as FR_FINISHED is set, it returns and then
the caller frees this request>
Now, the next fuse_dev_do_read(), see interrupts list is not empty
and then calls fuse_read_interrupt() which tries to access the request
which is already free'd and gets the below crash:
[11432.401266] Unable to handle kernel paging request at virtual address
6b6b6b6b6b6b6b6b
...
[11432.418518] Kernel BUG at ffffff80083720e0
[11432.456168] PC is at __list_del_entry+0x6c/0xc4
[11432.463573] LR is at fuse_dev_do_read+0x1ac/0x474
...
[11432.679999] [<ffffff80083720e0>] __list_del_entry+0x6c/0xc4
[11432.687794] [<ffffff80082c65e0>] fuse_dev_do_read+0x1ac/0x474
[11432.693180] [<ffffff80082c6b14>] fuse_dev_read+0x6c/0x78
[11432.699082] [<ffffff80081d5638>] __vfs_read+0xc0/0xe8
[11432.704459] [<ffffff80081d5efc>] vfs_read+0x90/0x108
[11432.709406] [<ffffff80081d67f0>] SyS_read+0x58/0x94
As FR_FINISHED bit is set before deleting the intr_entry with input
queue lock in request completion path, do the testing of this flag and
queueing atomically with the same lock in queue_interrupt().
Signed-off-by: Sahitya Tummala <stummala@codeaurora.org>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
Fixes: fd22d62ed0 ("fuse: no fc->lock for iqueue parts")
Cc: <stable@vger.kernel.org> # 4.2+
Fix a BUG when the kernel tries to mount a file system constructed as
follows:
echo foo > foo.txt
mke2fs -Fq -t ext4 -O encrypt foo.img 100
debugfs -w foo.img << EOF
write foo.txt a
set_inode_field a i_flags 0x80800
set_super_value s_last_orphan 12
quit
EOF
root@kvm-xfstests:~# mount -o loop foo.img /mnt
[ 160.238770] ------------[ cut here ]------------
[ 160.240106] kernel BUG at /usr/projects/linux/ext4/fs/ext4/inode.c:3874!
[ 160.240106] invalid opcode: 0000 [#1] SMP
[ 160.240106] Modules linked in:
[ 160.240106] CPU: 0 PID: 2547 Comm: mount Tainted: G W 4.10.0-rc3-00034-gcdd33b941b67 #227
[ 160.240106] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.10.1-1 04/01/2014
[ 160.240106] task: f4518000 task.stack: f47b6000
[ 160.240106] EIP: ext4_block_zero_page_range+0x1a7/0x2b4
[ 160.240106] EFLAGS: 00010246 CPU: 0
[ 160.240106] EAX: 00000001 EBX: f7be4b50 ECX: f47b7dc0 EDX: 00000007
[ 160.240106] ESI: f43b05a8 EDI: f43babec EBP: f47b7dd0 ESP: f47b7dac
[ 160.240106] DS: 007b ES: 007b FS: 00d8 GS: 0033 SS: 0068
[ 160.240106] CR0: 80050033 CR2: bfd85b08 CR3: 34a00680 CR4: 000006f0
[ 160.240106] Call Trace:
[ 160.240106] ext4_truncate+0x1e9/0x3e5
[ 160.240106] ext4_fill_super+0x286f/0x2b1e
[ 160.240106] ? set_blocksize+0x2e/0x7e
[ 160.240106] mount_bdev+0x114/0x15f
[ 160.240106] ext4_mount+0x15/0x17
[ 160.240106] ? ext4_calculate_overhead+0x39d/0x39d
[ 160.240106] mount_fs+0x58/0x115
[ 160.240106] vfs_kern_mount+0x4b/0xae
[ 160.240106] do_mount+0x671/0x8c3
[ 160.240106] ? _copy_from_user+0x70/0x83
[ 160.240106] ? strndup_user+0x31/0x46
[ 160.240106] SyS_mount+0x57/0x7b
[ 160.240106] do_int80_syscall_32+0x4f/0x61
[ 160.240106] entry_INT80_32+0x2f/0x2f
[ 160.240106] EIP: 0xb76b919e
[ 160.240106] EFLAGS: 00000246 CPU: 0
[ 160.240106] EAX: ffffffda EBX: 08053838 ECX: 08052188 EDX: 080537e8
[ 160.240106] ESI: c0ed0000 EDI: 00000000 EBP: 080537e8 ESP: bfa13660
[ 160.240106] DS: 007b ES: 007b FS: 0000 GS: 0033 SS: 007b
[ 160.240106] Code: 59 8b 00 a8 01 0f 84 09 01 00 00 8b 07 66 25 00 f0 66 3d 00 80 75 61 89 f8 e8 3e e2 ff ff 84 c0 74 56 83 bf 48 02 00 00 00 75 02 <0f> 0b 81 7d e8 00 10 00 00 74 02 0f 0b 8b 43 04 8b 53 08 31 c9
[ 160.240106] EIP: ext4_block_zero_page_range+0x1a7/0x2b4 SS:ESP: 0068:f47b7dac
[ 160.317241] ---[ end trace d6a773a375c810a5 ]---
The problem is that when the kernel tries to truncate an inode in
ext4_truncate(), it tries to clear any on-disk data beyond i_size.
Without the encryption key, it can't do that, and so it triggers a
BUG.
E2fsck does *not* provide this service, and in practice most file
systems have their orphan list processed by e2fsck, so to avoid
crashing, this patch skips this step if we don't have access to the
encryption key (which is the case when processing the orphan list; in
all other cases, we will have the encryption key, or the kernel
wouldn't have allowed the file to be opened).
An open question is whether the fact that e2fsck isn't clearing the
bytes beyond i_size causing problems --- and if we've lived with it
not doing it for so long, can we drop this from the kernel replay of
the orphan list in all cases (not just when we don't have the key for
encrypted inodes).
Addresses-Google-Bug: #35209576
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
-----BEGIN PGP SIGNATURE-----
iQIVAwUAWJ3sKfSw1s6N8H32AQLYMw//ViDsOuAu/NwJXW86Xzgh/pRtjOC1481j
AaonMTY1c5WbXN8cFEthst6GERL4BBVV3goF758SJZgnTJSrT4HwYFE1J2Tf0Pc+
Z8UTUoaAhvDl+wVWlKCJ+XaZb5i6+FpnPpnm8F6noQ/Lv3F1LpTWqdybBquGhZ07
K+izTo2isXBhplRVPHfyXyExgpHHHXQpKIW2FVYD/OJ1+/WMUIuep7e98fyO7lcI
5ROA29YzdjdRJP9AZbqd8uFcJA4tBcwSP5yt6sUGOW52ncYYUCVuK0Q9H9LvFMJG
WQm3NhXCt4sTGyHsghcpetp2Jccz+Tk/547HJRqbsb6XznxROENUx4L5hS1YthFM
WbB24rmV5fWchNW5hkGTPOyapjdZujoAlr6JzXawvcx2u9Xdyf/1nO47gCvQBdtg
eyrYpNpKp/CUzrMa1tl8pFK/DbPWCPeDnuZOfygvoCVJaTa15Q59KPItxHrO/p5i
Vj3ExFdVWnDyLmol3U4ffENAuVvSy2xaQhmLVJt6u/npltGD6oq45XFia2naMw5L
ShXTmQh5/dspORrNsVEHh0M6ZVCSlpPQwLyAoIP1+D14+5V035ebzgOOounOhsOa
Zf+mvPfp5yq1BJqsBWTx8qoC+jrSYMYIiRi030+xgu6pXe7t7xPIxHOTjS2X8RCZ
ERlnE3qonXY=
=ol7H
-----END PGP SIGNATURE-----
Merge tag 'rxrpc-rewrite-20170210' of git://git.kernel.org/pub/scm/linux/kernel/git/dhowells/linux-fs
David Howells says:
====================
afs: Use system UUID generation
There is now a general function for generating a UUID and AFS should make
use of it. It's also been recommended to me that I switch to using random
rather than time plus MAC address-based UUIDs which this function does.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Instead of needing additional checks in callers for unallocated przs,
perform the check in the walker, which gives us a more universal way to
handle the situation.
Signed-off-by: Kees Cook <keescook@chromium.org>
Pull btrfs fixes from Chris Mason:
"This has two last minute fixes. The highest priority here is a
regression fix for the decompression code, but we also fixed up a
problem with the 32-bit compat ioctls.
The decompression bug could hand back the wrong data on big reads when
zlib was used. I have a larger cleanup to make the math here less
error prone, but at this stage in the release Omar's patch is the best
choice"
* 'for-linus-4.10' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs:
btrfs: fix btrfs_decompress_buf2page()
btrfs: fix btrfs_compat_ioctl failures on non-compat ioctls
If btrfs_decompress_buf2page() is handed a bio with its page in the
middle of the working buffer, then we adjust the offset into the working
buffer. After we copy into the bio, we advance the iterator by the
number of bytes we copied. Then, we have some logic to handle the case
of discontiguous pages and adjust the offset into the working buffer
again. However, if we didn't advance the bio to a new page, we may enter
this case in error, essentially repeating the adjustment that we already
made when we entered the function. The end result is bogus data in the
bio.
Previously, we only checked for this case when we advanced to a new
page, but the conversion to bio iterators changed that. This restores
the old, correct behavior.
A case I saw when testing with zlib was:
buf_start = 42769
total_out = 46865
working_bytes = total_out - buf_start = 4096
start_byte = 45056
The condition (total_out > start_byte && buf_start < start_byte) is
true, so we adjust the offset:
buf_offset = start_byte - buf_start = 2287
working_bytes -= buf_offset = 1809
current_buf_start = buf_start = 42769
Then, we copy
bytes = min(bvec.bv_len, PAGE_SIZE - buf_offset, working_bytes) = 1809
buf_offset += bytes = 4096
working_bytes -= bytes = 0
current_buf_start += bytes = 44578
After bio_advance(), we are still in the same page, so start_byte is the
same. Then, we check (total_out > start_byte && current_buf_start < start_byte),
which is true! So, we adjust the values again:
buf_offset = start_byte - buf_start = 2287
working_bytes = total_out - start_byte = 1809
current_buf_start = buf_start + buf_offset = 45056
But note that working_bytes was already zero before this, so we should
have stopped copying.
Fixes: 974b1adc3b ("btrfs: use bio iterators for the decompression handlers")
Reported-by: Pat Erley <pat-lkml@erley.org>
Reviewed-by: Chris Mason <clm@fb.com>
Signed-off-by: Omar Sandoval <osandov@fb.com>
Signed-off-by: Chris Mason <clm@fb.com>
Reviewed-by: Liu Bo <bo.li.liu@oracle.com>
Tested-by: Liu Bo <bo.li.liu@oracle.com>
fixable, but at least one of the fixes is a little ugly. The original
bug has always been there, so we can wait another week or two to get
this right.
-----BEGIN PGP SIGNATURE-----
iQIcBAABAgAGBQJYncSKAAoJECebzXlCjuG+txsP/iqR4qsw9BOpLDb0Jf3sRpcj
FzNaHC3a8yIsQsgEBqaoslg1ZUIWzbcudxR4G1X0ieeE00uFLNs4/nYDBlQlYAKi
bXtUcHyIqpU9eCDfe+BdrNDyfoMizQW+OoYVoLzXWR8jEzEe3HWGOtSv8Y1McIwo
N75lTYfzYtM3xrUjCaQMY5BgwmOhXZMUh0bkyTGKNv0V/HY2zRrDWbNShsZnh3M+
ywhoDGHKCo1fMM8Gs4fQU4V5z6RXVqi7kNjoAdAjyxsPsBc0U7ggjNfq0mlal0FT
tgGoa+KHuCyMAfbZ5gQVuqetuZXiBnc5CDyxg7idClY/aDzmwPO5dIHXoWUwCrHR
Y/zJRQ0h0jfZK8Mis/T2c9ueYPkSERLHY9Pol8g35FOlkJhR2kc30wQM33Ya0hsk
Hp3qPHWrVoL3AbxuPHkY3aF/YrCIOsHiqdTvxctzrPI9V+rI0Zqx0uro/yOZ/17C
T38O9usblHa5KGT0rFVQDBIuZaF6lNWTV+gCQNFFu+Xxl01i82CQohmckYvV/iC7
zWnMBh7lTXwCmzw6/lWI7d277QdkfJkMYlmsOFPrkV1qxZ92KFJyqas1NFAt0Gg2
V8RmBABd6JSqmA0RWQxNh9iy9Ot53HkugTDMZKgBsBIIiI88DcY2hkhanpRhUXVn
/3uBqGHbWw4wwQaeZiRe
=1kRc
-----END PGP SIGNATURE-----
Merge tag 'nfsd-4.10-3' of git://linux-nfs.org/~bfields/linux
Pull nfsd revert from Bruce Fields:
"This patch turned out to have a couple problems. The problems are
fixable, but at least one of the fixes is a little ugly. The original
bug has always been there, so we can wait another week or two to get
this right"
* tag 'nfsd-4.10-3' of git://linux-nfs.org/~bfields/linux:
nfsd: Revert "nfsd: special case truncates some more"