If our layoutreturn returns an NFS4ERR_OLD_STATEID, then try to
update the stateid and retry.
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
If our layoutreturn on close operation returns an NFS4ERR_OLD_STATEID,
then try to update the stateid and retry. We know that there should
be no further LAYOUTGET requests being launched.
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
If the stateid is no longer recognised on the server, either due to a
restart, or due to a competing CLOSE call, then we do not have to
retry. Any open contexts that triggered a reopen of the file, will
also act as triggers for any CLOSE for the updated stateids.
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
If we're racing with an OPEN, then retry the operation instead of
declaring it a success.
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
[Andrew W Elble: Fix a typo in nfs4_refresh_open_stateid]
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
On successful rename, the "old_dentry" is retained and is attached to
the "new_dir", so we need to call nfs_set_verifier() accordingly.
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
If the server that does not implement NFSv4.1 persistent session
semantics reboots while we are performing an exclusive create,
then the return value of NFS4ERR_DELAY when we replay the open
during the grace period causes us to lose the verifier.
When the grace period expires, and we present a new verifier,
the server will then correctly reply NFS4ERR_EXIST.
This commit ensures that we always present the same verifier when
replaying the OPEN.
Reported-by: Tigran Mkrtchyan <tigran.mkrtchyan@desy.de>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
Ben Coddington has noted the following race between OPEN and CLOSE
on a single client.
Process 1 Process 2 Server
========= ========= ======
1) OPEN file
2) OPEN file
3) Process OPEN (1) seqid=1
4) Process OPEN (2) seqid=2
5) Reply OPEN (2)
6) Receive reply (2)
7) new stateid, seqid=2
8) CLOSE file, using
stateid w/ seqid=2
9) Reply OPEN (1)
10( Process CLOSE (8)
11) Reply CLOSE (8)
12) Forget stateid
file closed
13) Receive reply (7)
14) Forget stateid
file closed.
15) Receive reply (1).
16) New stateid seqid=1
is really the same
stateid that was
closed.
IOW: the reply to the first OPEN is delayed. Since "Process 2" does
not wait before closing the file, and it does not cache the closed
stateid, then when the delayed reply is finally received, it is treated
as setting up a new stateid by the client.
The fix is to ensure that the client processes the OPEN and CLOSE calls
in the same order in which the server processed them.
This commit ensures that we examine the seqid of the stateid
returned by OPEN. If it is a new stateid, we assume the seqid
must be equal to the value 1, and that each state transition
increments the seqid value by 1 (See RFC7530, Section 9.1.4.2,
and RFC5661, Section 8.2.2).
If the tracker sees that an OPEN returns with a seqid that is greater
than the cached seqid + 1, then it bumps a flag to ensure that the
caller waits for the RPCs carrying the missing seqids to complete.
Note that there can still be pathologies where the server crashes before
it can even send us the missing seqids. Since the OPEN call is still
holding a slot when it waits here, that could cause the recovery to
stall forever. To avoid that, we time out after a 5 second wait.
Reported-by: Benjamin Coddington <bcodding@redhat.com>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
Bool initializations should use true and false. Bool tests don't need
comparisons.
Signed-off-by: Thomas Meyer <thomas@m3y3r.de>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
There isn't an obvious way to acquire and release the RCU lock during a
tracepoint, so we can't use the rpc_peeraddr2str() function here.
Instead, rely on the client's cl_hostname, which should have similar
enough information without needing an rcu_dereference().
Reported-by: Dave Jones <davej@codemonkey.org.uk>
Cc: stable@vger.kernel.org # v3.12
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
Pull overlayfs updates from Miklos Szeredi:
- Report constant st_ino values across copy-up even if underlying
layers are on different filesystems, but using different st_dev
values for each layer.
Ideally we'd report the same st_dev across the overlay, and it's
possible to do for filesystems that use only 32bits for st_ino by
unifying the inum space. It would be nice if it wasn't a choice of 32
or 64, rather filesystems could report their current maximum (that
could change on resize, so it wouldn't be set in stone).
- miscellaneus fixes and a cleanup of ovl_fill_super(), that was long
overdue.
- created a path_put_init() helper that clears out the pointers after
putting the ref.
I think this could be useful elsewhere, so added it to <linux/path.h>
* 'overlayfs-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/vfs: (30 commits)
ovl: remove unneeded arg from ovl_verify_origin()
ovl: Put upperdentry if ovl_check_origin() fails
ovl: rename ufs to ofs
ovl: clean up getting lower layers
ovl: clean up workdir creation
ovl: clean up getting upper layer
ovl: move ovl_get_workdir() and ovl_get_lower_layers()
ovl: reduce the number of arguments for ovl_workdir_create()
ovl: change order of setup in ovl_fill_super()
ovl: factor out ovl_free_fs() helper
ovl: grab reference to workbasedir early
ovl: split out ovl_get_indexdir() from ovl_fill_super()
ovl: split out ovl_get_lower_layers() from ovl_fill_super()
ovl: split out ovl_get_workdir() from ovl_fill_super()
ovl: split out ovl_get_upper() from ovl_fill_super()
ovl: split out ovl_get_lowerstack() from ovl_fill_super()
ovl: split out ovl_get_workpath() from ovl_fill_super()
ovl: split out ovl_get_upperpath() from ovl_fill_super()
ovl: use path_put_init() in error paths for ovl_fill_super()
vfs: add path_put_init()
...
-----BEGIN PGP SIGNATURE-----
iQIcBAABAgAGBQJaDuoWAAoJEAAOaEEZVoIVXEQP/jQYoU9hgvEj8j3ZIgi56SDJ
pR45w2zcJz2/uU43DEKyShyLgsuoBbJQ3l/gGBH/tl+xGm9NzB0gatoEu9GmKNYz
/IN6/vUFnoIAUyD+iMZbpmsYKIkz0z2YJo261IfspAwIft/cvHJnYYGQrP9YXg9F
c7bdDuANTKocdQigc4BQyOe3OfIBGfTwJhuakO+1yuZmGOVNyxEcdYbMM8FiTfc8
+62kvQQ3t7WMqSbM8M0QdGcYQjG0EwcVAuV7COurLJIva7hUkVel32MVUjoFcf28
BnRu2ztFJCubm1HA85twlJDtpeXbcMqrUl/CcwRMpwDaePd5GVB1h5iKqbZ51BZ1
fWT2STmt+8hY2B5eiXoYEaG3B7ZRr+r0oroxqOxpiZ/m4AVeouF+gPGv+NV5zgvD
NGWC0MdklIJ4xaC99NEeP6kBhz0M74VKymFCTeHkVg9m4TqDepNvitKed0qagw19
uw8seei7TOTm4o117+l55NHmyfTHXFO4U0WLTJyeZcoEnUs0rOcHeqyy0RwCBMrK
W2fJtdBLFr+tBIIrID4TnPhhYtSvIPjz+FpiRDobqhgvMva/PIvLGTWK4unrgIjG
ZQ7YGnwWda8GjqKhgZacn/BSXyJzOAF9hJp0mz2ORaOxaMarEV55duiZufCvGuZw
uUQWRCKuQX7Oi05i9jXp
=fCeF
-----END PGP SIGNATURE-----
Merge tag 'locks-v4.15-1' of git://git.kernel.org/pub/scm/linux/kernel/git/jlayton/linux
Pull file locking update from Jeff Layton:
"A couple of fixes for a patch that went into v4.14, and the bug report
just came in a few days ago.. It passes my (minimal) testing, and has
been in linux-next for a few days now.
I also would like to get my address changed in MAINTAINERS to clear
that hurdle"
* tag 'locks-v4.15-1' of git://git.kernel.org/pub/scm/linux/kernel/git/jlayton/linux:
fcntl: don't cap l_start and l_end values for F_GETLK64 in compat syscall
fcntl: don't leak fd reference when fixup_compat_flock fails
MAINTAINERS: s/jlayton@poochiereds.net/jlayton@kernel.org/
Pull cramfs updates from Al Viro:
"Nicolas Pitre's cramfs work"
* 'work.cramfs' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
cramfs: rehabilitate it
cramfs: add mmap support
cramfs: implement uncompressed and arbitrary data block positioning
cramfs: direct memory access support
Pull misc vfs updates from Al Viro:
"Assorted stuff, really no common topic here"
* 'work.misc' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
vfs: grab the lock instead of blocking in __fd_install during resizing
vfs: stop clearing close on exec when closing a fd
include/linux/fs.h: fix comment about struct address_space
fs: make fiemap work from compat_ioctl
coda: fix 'kernel memory exposure attempt' in fsync
pstore: remove unneeded unlikely()
vfs: remove unneeded unlikely()
stubs for mount_bdev() and kill_block_super() in !CONFIG_BLOCK case
make vfs_ustat() static
do_handle_open() should be static
elf_fdpic: fix unused variable warning
fold destroy_super() into __put_super()
new helper: destroy_unused_super()
fix address space warnings in ipc/
acct.h: get rid of detritus
Pull iov_iter updates from Al Viro:
- bio_{map,copy}_user_iov() series; those are cleanups - fixes from the
same pile went into mainline (and stable) in late September.
- fs/iomap.c iov_iter-related fixes
- new primitive - iov_iter_for_each_range(), which applies a function
to kernel-mapped segments of an iov_iter.
Usable for kvec and bvec ones, the latter does kmap()/kunmap() around
the callback. _Not_ usable for iovec- or pipe-backed iov_iter; the
latter is not hard to fix if the need ever appears, the former is by
design.
Another related primitive will have to wait for the next cycle - it
passes page + offset + size instead of pointer + size, and that one
will be usable for everything _except_ kvec. Unfortunately, that one
didn't get exposure in -next yet, so...
- a bit more lustre iov_iter work, including a use case for
iov_iter_for_each_range() (checksum calculation)
- vhost/scsi leak fix in failure exit
- misc cleanups and detritectomy...
* 'work.iov_iter' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (21 commits)
iomap_dio_actor(): fix iov_iter bugs
switch ksocknal_lib_recv_...() to use of iov_iter_for_each_range()
lustre: switch struct ksock_conn to iov_iter
vhost/scsi: switch to iov_iter_get_pages()
fix a page leak in vhost_scsi_iov_to_sgl() error recovery
new primitive: iov_iter_for_each_range()
lnet_return_rx_credits_locked: don't abuse list_entry
xen: don't open-code iov_iter_kvec()
orangefs: remove detritus from struct orangefs_kiocb_s
kill iov_shorten()
bio_alloc_map_data(): do bmd->iter setup right there
bio_copy_user_iov(): saner bio size calculation
bio_map_user_iov(): get rid of copying iov_iter
bio_copy_from_iter(): get rid of copying iov_iter
move more stuff down into bio_copy_user_iov()
blk_rq_map_user_iov(): move iov_iter_advance() down
bio_map_user_iov(): get rid of the iov_for_each()
bio_map_user_iov(): move alignment check into the main loop
don't rely upon subsequent bio_add_pc_page() calls failing
... and with iov_iter_get_pages_alloc() it becomes even simpler
...
Pull compat and uaccess updates from Al Viro:
- {get,put}_compat_sigset() series
- assorted compat ioctl stuff
- more set_fs() elimination
- a few more timespec64 conversions
- several removals of pointless access_ok() in places where it was
followed only by non-__ variants of primitives
* 'misc.compat' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (24 commits)
coredump: call do_unlinkat directly instead of sys_unlink
fs: expose do_unlinkat for built-in callers
ext4: take handling of EXT4_IOC_GROUP_ADD into a helper, get rid of set_fs()
ipmi: get rid of pointless access_ok()
pi433: sanitize ioctl
cxlflash: get rid of pointless access_ok()
mtdchar: get rid of pointless access_ok()
r128: switch compat ioctls to drm_ioctl_kernel()
selection: get rid of field-by-field copyin
VT_RESIZEX: get rid of field-by-field copyin
i2c compat ioctls: move to ->compat_ioctl()
sched_rr_get_interval(): move compat to native, get rid of set_fs()
mips: switch to {get,put}_compat_sigset()
sparc: switch to {get,put}_compat_sigset()
s390: switch to {get,put}_compat_sigset()
ppc: switch to {get,put}_compat_sigset()
parisc: switch to {get,put}_compat_sigset()
get_compat_sigset()
get rid of {get,put}_compat_itimerspec()
io_getevents: Use timespec64 to represent timeouts
...
atomic_t variables are currently used to implement reference
counters with the following properties:
- counter is initialized to 1 using atomic_set()
- a resource is freed upon counter reaching zero
- once counter reaches zero, its further
increments aren't allowed
- counter schema uses basic atomic operations
(set, inc, inc_not_zero, dec_and_test, etc.)
Such atomic variables should be converted to a newly provided
refcount_t type and API that prevents accidental counter overflows
and underflows. This is important since overflows and underflows
can lead to use-after-free situation and be exploitable.
The variable nfs_client.cl_count is used as pure reference counter.
Convert it to refcount_t and fix up the operations.
Suggested-by: Kees Cook <keescook@chromium.org>
Reviewed-by: David Windsor <dwindsor@gmail.com>
Reviewed-by: Hans Liljestrand <ishkamiel@gmail.com>
Signed-off-by: Elena Reshetova <elena.reshetova@intel.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
atomic_t variables are currently used to implement reference
counters with the following properties:
- counter is initialized to 1 using atomic_set()
- a resource is freed upon counter reaching zero
- once counter reaches zero, its further
increments aren't allowed
- counter schema uses basic atomic operations
(set, inc, inc_not_zero, dec_and_test, etc.)
Such atomic variables should be converted to a newly provided
refcount_t type and API that prevents accidental counter overflows
and underflows. This is important since overflows and underflows
can lead to use-after-free situation and be exploitable.
The variable nfs_lock_context.count is used as pure reference counter.
Convert it to refcount_t and fix up the operations.
Suggested-by: Kees Cook <keescook@chromium.org>
Reviewed-by: David Windsor <dwindsor@gmail.com>
Reviewed-by: Hans Liljestrand <ishkamiel@gmail.com>
Signed-off-by: Elena Reshetova <elena.reshetova@intel.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
atomic_t variables are currently used to implement reference
counters with the following properties:
- counter is initialized to 1 using atomic_set()
- a resource is freed upon counter reaching zero
- once counter reaches zero, its further
increments aren't allowed
- counter schema uses basic atomic operations
(set, inc, inc_not_zero, dec_and_test, etc.)
Such atomic variables should be converted to a newly provided
refcount_t type and API that prevents accidental counter overflows
and underflows. This is important since overflows and underflows
can lead to use-after-free situation and be exploitable.
The variable nfs4_lock_state.ls_count is used as pure reference counter.
Convert it to refcount_t and fix up the operations.
Suggested-by: Kees Cook <keescook@chromium.org>
Reviewed-by: David Windsor <dwindsor@gmail.com>
Reviewed-by: Hans Liljestrand <ishkamiel@gmail.com>
Signed-off-by: Elena Reshetova <elena.reshetova@intel.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
atomic_t variables are currently used to implement reference
counters with the following properties:
- counter is initialized to 1 using atomic_set()
- a resource is freed upon counter reaching zero
- once counter reaches zero, its further
increments aren't allowed
- counter schema uses basic atomic operations
(set, inc, inc_not_zero, dec_and_test, etc.)
Such atomic variables should be converted to a newly provided
refcount_t type and API that prevents accidental counter overflows
and underflows. This is important since overflows and underflows
can lead to use-after-free situation and be exploitable.
The variable nfs_cache_defer_req.count is used as pure reference counter.
Convert it to refcount_t and fix up the operations.
Suggested-by: Kees Cook <keescook@chromium.org>
Reviewed-by: David Windsor <dwindsor@gmail.com>
Reviewed-by: Hans Liljestrand <ishkamiel@gmail.com>
Signed-off-by: Elena Reshetova <elena.reshetova@intel.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
atomic_t variables are currently used to implement reference
counters with the following properties:
- counter is initialized to 1 using atomic_set()
- a resource is freed upon counter reaching zero
- once counter reaches zero, its further
increments aren't allowed
- counter schema uses basic atomic operations
(set, inc, inc_not_zero, dec_and_test, etc.)
Such atomic variables should be converted to a newly provided
refcount_t type and API that prevents accidental counter overflows
and underflows. This is important since overflows and underflows
can lead to use-after-free situation and be exploitable.
The variable nfs4_ff_layout_mirror.ref is used as pure reference counter.
Convert it to refcount_t and fix up the operations.
Suggested-by: Kees Cook <keescook@chromium.org>
Reviewed-by: David Windsor <dwindsor@gmail.com>
Reviewed-by: Hans Liljestrand <ishkamiel@gmail.com>
Signed-off-by: Elena Reshetova <elena.reshetova@intel.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
atomic_t variables are currently used to implement reference
counters with the following properties:
- counter is initialized to 1 using atomic_set()
- a resource is freed upon counter reaching zero
- once counter reaches zero, its further
increments aren't allowed
- counter schema uses basic atomic operations
(set, inc, inc_not_zero, dec_and_test, etc.)
Such atomic variables should be converted to a newly provided
refcount_t type and API that prevents accidental counter overflows
and underflows. This is important since overflows and underflows
can lead to use-after-free situation and be exploitable.
The variable pnfs_layout_hdr.plh_refcount is used as pure reference counter.
Convert it to refcount_t and fix up the operations.
Suggested-by: Kees Cook <keescook@chromium.org>
Reviewed-by: David Windsor <dwindsor@gmail.com>
Reviewed-by: Hans Liljestrand <ishkamiel@gmail.com>
Signed-off-by: Elena Reshetova <elena.reshetova@intel.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
refcount_t type and corresponding API should be
used instead of atomic_t when the variable is used as
a reference counter. This allows to avoid accidental
refcounter overflows that might lead to use-after-free
situations.
Signed-off-by: Elena Reshetova <elena.reshetova@intel.com>
Signed-off-by: Hans Liljestrand <ishkamiel@gmail.com>
Signed-off-by: Kees Cook <keescook@chromium.org>
Signed-off-by: David Windsor <dwindsor@gmail.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
atomic_t variables are currently used to implement reference
counters with the following properties:
- counter is initialized to 1 using atomic_set()
- a resource is freed upon counter reaching zero
- once counter reaches zero, its further
increments aren't allowed
- counter schema uses basic atomic operations
(set, inc, inc_not_zero, dec_and_test, etc.)
Such atomic variables should be converted to a newly provided
refcount_t type and API that prevents accidental counter overflows
and underflows. This is important since overflows and underflows
can lead to use-after-free situation and be exploitable.
The variable nfs4_pnfs_ds.ds_count is used as pure reference counter.
Convert it to refcount_t and fix up the operations.
Suggested-by: Kees Cook <keescook@chromium.org>
Reviewed-by: David Windsor <dwindsor@gmail.com>
Reviewed-by: Hans Liljestrand <ishkamiel@gmail.com>
Signed-off-by: Elena Reshetova <elena.reshetova@intel.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
If the previous request on a slot was interrupted before it was
processed by the server, then our slot sequence number may be out of whack,
and so we try the next operation using the old sequence number.
The problem with this, is that not all servers check to see that the
client is replaying the same operations as previously when they decide
to go to the replay cache, and so instead of the expected error of
NFS4ERR_SEQ_FALSE_RETRY, we get a replay of the old reply, which could
(if the operations match up) be mistaken by the client for a new reply.
To fix this, we attempt to send a COMPOUND containing only the SEQUENCE op
in order to resync our slot sequence number.
Cc: Olga Kornievskaia <olga.kornievskaia@gmail.com>
[olga.kornievskaia@gmail.com: fix an Oops]
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
* Introduce MAP_SYNC and MAP_SHARED_VALIDATE, a mechanism to enable
'userspace flush' of persistent memory updates via filesystem-dax
mappings. It arranges for any filesystem metadata updates that may be
required to satisfy a write fault to also be flushed ("on disk") before
the kernel returns to userspace from the fault handler. Effectively
every write-fault that dirties metadata completes an fsync() before
returning from the fault handler. The new MAP_SHARED_VALIDATE mapping
type guarantees that the MAP_SYNC flag is validated as supported by the
filesystem's ->mmap() file operation.
* Add support for the standard ACPI 6.2 label access methods that
replace the NVDIMM_FAMILY_INTEL (vendor specific) label methods. This
enables interoperability with environments that only implement the
standardized methods.
* Add support for the ACPI 6.2 NVDIMM media error injection methods.
* Add support for the NVDIMM_FAMILY_INTEL v1.6 DIMM commands for latch
last shutdown status, firmware update, SMART error injection, and
SMART alarm threshold control.
* Cleanup physical address information disclosures to be root-only.
* Fix revalidation of the DIMM "locked label area" status to support
dynamic unlock of the label area.
* Expand unit test infrastructure to mock the ACPI 6.2 Translate SPA
(system-physical-address) command and error injection commands.
Acknowledgements that came after the commits were pushed to -next:
957ac8c421 dax: fix PMD faults on zero-length files
Reviewed-by: Ross Zwisler <ross.zwisler@linux.intel.com>
a39e596baa xfs: support for synchronous DAX faults
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
7b565c9f96 xfs: Implement xfs_filemap_pfn_mkwrite() using __xfs_filemap_fault()
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
-----BEGIN PGP SIGNATURE-----
iQIcBAABAgAGBQJaDfvcAAoJEB7SkWpmfYgCk7sP/2qJhBH+VTTdg2osDnhAdAhI
co/AGEmsHFlUCMBb/Ek7UnMAmhBYiJU2q4ywPsNFBpusXpMlqNy5Iwo7k4/wQHE/
SJcIM0g4zg0ViFuUhwV+C2T0R5UzFR8JLd9EYWj/YS6aJpurtotm5l4UStaM0Hzo
AhxSXJLrBDuqCpbOxbctfiGEmdRL7aRfBEAARTNRKBn/iXxJUcYHlp62rtXQS+t4
I6LC/URCWTNTTMGmzW6TRsgSD9WMfd19xKcGzN3qL6ee0KFccxN4ctFqHA/sFGOh
iYLeR0XJUjJxyp+PkWGteXPVZL0Kj3bD/lSTG+Co5bm/ra8a/sh3TSFfgFyoBZD1
EqMN8Ryf80hGp3FabeH2Iw2SviYPZpHSWgjddjxLD0RA6OmpzINc+Wm8eqApjMME
sbZDTOijiab4QMQ0XamF4GuDHyQtawv5Y/w2Ehhl1tmiqW+5tKhsKqxkQt+/V3Yt
RTVSRe2Pkway66b+cD64IdQ6L2tyonPnmi5IzgkKOhlOEGomy+4/U2Jt2bMbhzq6
ymszKmXp2XI8P06wU8sHrIUeXO5I9qoKn/fZA73Eb8aIzgJe3tBE/5+Ab7RG6HB9
1OVfcMWoXU1gNgNktTs63X1Lsg4aW9kt/K4fPHHcqUcaliEJpJTlAbg9GLF2buoW
nQ+0fTRgMRihE3ZA0Fs3
=h2vZ
-----END PGP SIGNATURE-----
Merge tag 'libnvdimm-for-4.15' of git://git.kernel.org/pub/scm/linux/kernel/git/nvdimm/nvdimm
Pull libnvdimm and dax updates from Dan Williams:
"Save for a few late fixes, all of these commits have shipped in -next
releases since before the merge window opened, and 0day has given a
build success notification.
The ext4 touches came from Jan, and the xfs touches have Darrick's
reviewed-by. An xfstest for the MAP_SYNC feature has been through
a few round of reviews and is on track to be merged.
- Introduce MAP_SYNC and MAP_SHARED_VALIDATE, a mechanism to enable
'userspace flush' of persistent memory updates via filesystem-dax
mappings. It arranges for any filesystem metadata updates that may
be required to satisfy a write fault to also be flushed ("on disk")
before the kernel returns to userspace from the fault handler.
Effectively every write-fault that dirties metadata completes an
fsync() before returning from the fault handler. The new
MAP_SHARED_VALIDATE mapping type guarantees that the MAP_SYNC flag
is validated as supported by the filesystem's ->mmap() file
operation.
- Add support for the standard ACPI 6.2 label access methods that
replace the NVDIMM_FAMILY_INTEL (vendor specific) label methods.
This enables interoperability with environments that only implement
the standardized methods.
- Add support for the ACPI 6.2 NVDIMM media error injection methods.
- Add support for the NVDIMM_FAMILY_INTEL v1.6 DIMM commands for
latch last shutdown status, firmware update, SMART error injection,
and SMART alarm threshold control.
- Cleanup physical address information disclosures to be root-only.
- Fix revalidation of the DIMM "locked label area" status to support
dynamic unlock of the label area.
- Expand unit test infrastructure to mock the ACPI 6.2 Translate SPA
(system-physical-address) command and error injection commands.
Acknowledgements that came after the commits were pushed to -next:
- 957ac8c421 ("dax: fix PMD faults on zero-length files"):
Reviewed-by: Ross Zwisler <ross.zwisler@linux.intel.com>
- a39e596baa ("xfs: support for synchronous DAX faults") and
7b565c9f96 ("xfs: Implement xfs_filemap_pfn_mkwrite() using __xfs_filemap_fault()")
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>"
* tag 'libnvdimm-for-4.15' of git://git.kernel.org/pub/scm/linux/kernel/git/nvdimm/nvdimm: (49 commits)
acpi, nfit: add 'Enable Latch System Shutdown Status' command support
dax: fix general protection fault in dax_alloc_inode
dax: fix PMD faults on zero-length files
dax: stop requiring a live device for dax_flush()
brd: remove dax support
dax: quiet bdev_dax_supported()
fs, dax: unify IOMAP_F_DIRTY read vs write handling policy in the dax core
tools/testing/nvdimm: unit test clear-error commands
acpi, nfit: validate commands against the device type
tools/testing/nvdimm: stricter bounds checking for error injection commands
xfs: support for synchronous DAX faults
xfs: Implement xfs_filemap_pfn_mkwrite() using __xfs_filemap_fault()
ext4: Support for synchronous DAX faults
ext4: Simplify error handling in ext4_dax_huge_fault()
dax: Implement dax_finish_sync_fault()
dax, iomap: Add support for synchronous faults
mm: Define MAP_SYNC and VM_SYNC flags
dax: Allow tuning whether dax_insert_mapping_entry() dirties entry
dax: Allow dax_iomap_fault() to return pfn
dax: Fix comment describing dax_iomap_fault()
...
Fix the AFS file locking whereby the use of the big kernel lock (which
could be slept with) was replaced by a spinlock (which couldn't). The
problem is that the AFS code was doing stuff inside the critical section
that might call schedule(), so this is a broken transformation.
Fix this by the following means:
(1) Use a state machine with a proper state that can only be changed under
the spinlock rather than using a collection of bit flags.
(2) Cache the key used for the lock and the lock type in the afs_vnode
struct so that the manager work function doesn't have to refer to a
file_lock struct that's been dequeued. This makes signal handling
safer.
(4) Move the unlock from afs_do_unlk() to afs_fl_release_private() which
means that unlock is achieved in other circumstances too.
(5) Unlock the file on the server before taking the next conflicting lock.
Also change:
(1) Check the permits on a file before actually trying the lock.
(2) fsync the file before effecting an explicit unlock operation. We
don't fsync if the lock is erased otherwise as we might not be in a
context where we can actually do that.
Further fixes:
(1) Fixed-fileserver address rotation is made to work. It's only used by
the locking functions, so couldn't be tested before.
Fixes: 72f98e7255 ("locks: turn lock_flocks into a spinlock")
Signed-off-by: David Howells <dhowells@redhat.com>
cc: jlayton@redhat.com
Pull ARM updates from Russell King:
- add support for ELF fdpic binaries on both MMU and noMMU platforms
- linker script cleanups
- support for compressed .data section for XIP images
- discard memblock arrays when possible
- various cleanups
- atomic DMA pool updates
- better diagnostics of missing/corrupt device tree
- export information to allow userspace kexec tool to place images more
inteligently, so that the device tree isn't overwritten by the
booting kernel
- make early_printk more efficient on semihosted systems
- noMMU cleanups
- SA1111 PCMCIA update in preparation for further cleanups
* 'for-linus' of git://git.armlinux.org.uk/~rmk/linux-arm: (38 commits)
ARM: 8719/1: NOMMU: work around maybe-uninitialized warning
ARM: 8717/2: debug printch/printascii: translate '\n' to "\r\n" not "\n\r"
ARM: 8713/1: NOMMU: Support MPU in XIP configuration
ARM: 8712/1: NOMMU: Use more MPU regions to cover memory
ARM: 8711/1: V7M: Add support for MPU to M-class
ARM: 8710/1: Kconfig: Kill CONFIG_VECTORS_BASE
ARM: 8709/1: NOMMU: Disallow MPU for XIP
ARM: 8708/1: NOMMU: Rework MPU to be mostly done in C
ARM: 8707/1: NOMMU: Update MPU accessors to use cp15 helpers
ARM: 8706/1: NOMMU: Move out MPU setup in separate module
ARM: 8702/1: head-common.S: Clear lr before jumping to start_kernel()
ARM: 8705/1: early_printk: use printascii() rather than printch()
ARM: 8703/1: debug.S: move hexbuf to a writable section
ARM: add additional table to compressed kernel
ARM: decompressor: fix BSS size calculation
pcmcia: sa1111: remove special sa1111 mmio accessors
pcmcia: sa1111: use sa1111_get_irq() to obtain IRQ resources
ARM: better diagnostics with missing/corrupt dtb
ARM: 8699/1: dma-mapping: Remove init_dma_coherent_pool_size()
ARM: 8698/1: dma-mapping: Mark atomic_pool as __ro_after_init
..
In this round, we introduce sysfile-based quota support which is required
for Android by default. In addition, we allow that users are able to reserve
some blocks in runtime to mitigate performance drops in low free space.
Enhancement
- assign proper data segments according to write_hints given by user
- issue cache_flush on dirty devices only among multiple devices
- exploit cp_error flag and add more faults to enhance fault injection test
- conduct more readaheads during f2fs_readdir
- add a range for discard commands
Bug fix
- fix zero stat->st_blocks when inline_data is set
- drop crypto key and free stale memory pointer while evict_inode is failing
- fix some corner cases in free space and segment management
- fix wrong last_disk_size
This series includes lots of clean-ups and code enhancement in terms of xattr
operations, discard/flush command control. In addition, it adds versatile
debugfs entries to monitor f2fs status.
-----BEGIN PGP SIGNATURE-----
iQIzBAABCAAdFiEE00UqedjCtOrGVvQiQBSofoJIUNIFAloNCPAACgkQQBSofoJI
UNLYmg/8DbDp/mTXqJ0AURo84Z4OQUOTRxYkWazx4ct2WPZp2+5HCWDDoM8AAtUn
1J6/t7cU3osjos+zWvpUREZq1SPbp5m0h818HBFFJ/YMBPXucdQcd6wpepniOR5J
5uKauVd7jd2pbAAL7hKyr+iBSLrJl816wsq34Ml8y8zkDSJe4wO5YsGDqzqyKf4N
8nxMavUgerb14I/qXPb3ljlYlfaNNRlCT649QGCG78gx5hPeiUtUJ2l5DKV2xPe7
v+5lZO93FFwW1siGy+Atq+nqQJyUkeiOYGPR1NPx9tfmaPO58iOIXLirfblKASZY
HXJigVf50fQQBtwdBFL8ICSop6zV6gCKkNGZCHLzcYFWWL2TQwCIP3/iJdj9Wy+j
+YUYyN0dyl2mmNEDZjRNX1V+QBW1k+msmvBCb0fT1GJTQAyRfA4XfBDyg94cpWQ1
9YivNywuzG8YtghY7gYU3lCfT2OG19nXCSdz4qYUb5SSwoeGtLahLxMV4mlil4Tg
dOa8CPLFhJnCqB9ivI4L6SennBr+gNgL26SeZ3PF+B5KimYOTZxbenrll1kTi1xp
uCU6UR1xJS0W7Cjk8sCIu5hXkJMJwPJ0hcVeTgsxMkujLGvSSRCGb2hmOeILfwRZ
N4aGn+kVmwwgKaKjD/F4CY4b3yJLdTKMjjl74u5YaMQWe4Bq4qU=
=c49T
-----END PGP SIGNATURE-----
Merge tag 'f2fs-for-4.15-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/jaegeuk/f2fs
Pull f2fs updates from Jaegeuk Kim:
"In this round, we introduce sysfile-based quota support which is
required for Android by default. In addition, we allow that users are
able to reserve some blocks in runtime to mitigate performance drops
in low free space.
Enhancements:
- assign proper data segments according to write_hints given by user
- issue cache_flush on dirty devices only among multiple devices
- exploit cp_error flag and add more faults to enhance fault
injection test
- conduct more readaheads during f2fs_readdir
- add a range for discard commands
Bug fixes:
- fix zero stat->st_blocks when inline_data is set
- drop crypto key and free stale memory pointer while evict_inode is
failing
- fix some corner cases in free space and segment management
- fix wrong last_disk_size
This series includes lots of clean-ups and code enhancement in terms
of xattr operations, discard/flush command control. In addition, it
adds versatile debugfs entries to monitor f2fs status"
* tag 'f2fs-for-4.15-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/jaegeuk/f2fs: (75 commits)
f2fs: deny accessing encryption policy if encryption is off
f2fs: inject fault in inc_valid_node_count
f2fs: fix to clear FI_NO_PREALLOC
f2fs: expose quota information in debugfs
f2fs: separate nat entry mem alloc from nat_tree_lock
f2fs: validate before set/clear free nat bitmap
f2fs: avoid opened loop codes in __add_ino_entry
f2fs: apply write hints to select the type of segments for buffered write
f2fs: introduce scan_curseg_cache for cleanup
f2fs: optimize the way of traversing free_nid_bitmap
f2fs: keep scanning until enough free nids are acquired
f2fs: trace checkpoint reason in fsync()
f2fs: keep isize once block is reserved cross EOF
f2fs: avoid race in between GC and block exchange
f2fs: save a multiplication for last_nid calculation
f2fs: fix summary info corruption
f2fs: remove dead code in update_meta_page
f2fs: remove unneeded semicolon
f2fs: don't bother with inode->i_version
f2fs: check curseg space before foreground GC
...
Be consistent about using uint32_t/uint8_t instead of u32/u8. This is
more so that we don't have to maintain /those/ types in xfsprogs.
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Eric Sandeen <sandeen@redhat.com>
-----BEGIN PGP SIGNATURE-----
iQIVAwUAWgm9V/Sw1s6N8H32AQK5mQ//QGUDZLXsUPCtq0XJq0V+r4MUjNp9tCZR
htiuNrEkHSyPpYgCcQ2Aqdl9kndwVXcE7lWT99mp/a0zwNAsp9GOGVhCXUd5R86G
XlrBuUYVvBJk18tDsUNWdjRQ0gMHgQSlEnEbsaGiU1bVrpXatI9hL8qoeO78Iy7+
eaJUQLCuCVJq7qMQGhC0hg338vmHVeYhnViXIxq+HFjsMmR9IVanuK+sQr6NSJxS
F6RkPxBUPWkRVMHmxTLWj/XSHZwtwu+Mnc/UFYsAPLKEbY0cIohsI8EgfE8U7geU
yRVnu3MIOXUXUrZizj9SwVYWdJfneRlINqMbHIO8QXMKR38tnQ0C2/7bgBsXiNPv
YdiAyeqL4nM+JthV/rgA3hWgupwBlSb4ubclTphDNxMs5MBIUIK3XUt9GOXDDUZz
2FT/FdrphM2UORaI2AEOi4Q0/nHdin+3rld8fjV0Ree/TPNXwcrOmvy8yGnxFCEp
5b7YLwKrffZGnnS965dhZlnFR6hjndmzFgHdyRrJwc80hXi1Q/+W4F19MoYkkoVK
G/gLvD3FbmygmFnjCik9TjUrro6vQxo56H/TuWgHTvYriNGH+D/D7EGUwg4GiXZZ
+7vrNw660uXmZiu9i0YacCRyD8lvm7QpmWLb+uHwzfsBE1+C8UetyQ+egSWVdWJO
KwPspygWXD4=
=3vy0
-----END PGP SIGNATURE-----
Merge tag 'afs-next-20171113' of git://git.kernel.org/pub/scm/linux/kernel/git/dhowells/linux-fs
Pull AFS updates from David Howells:
"kAFS filesystem driver overhaul.
The major points of the overhaul are:
(1) Preliminary groundwork is laid for supporting network-namespacing
of kAFS. The remainder of the namespacing work requires some way
to pass namespace information to submounts triggered by an
automount. This requires something like the mount overhaul that's
in progress.
(2) sockaddr_rxrpc is used in preference to in_addr for holding
addresses internally and add support for talking to the YFS VL
server. With this, kAFS can do everything over IPv6 as well as
IPv4 if it's talking to servers that support it.
(3) Callback handling is overhauled to be generally passive rather
than active. 'Callbacks' are promises by the server to tell us
about data and metadata changes. Callbacks are now checked when
we next touch an inode rather than actively going and looking for
it where possible.
(4) File access permit caching is overhauled to store the caching
information per-inode rather than per-directory, shared over
subordinate files. Whilst older AFS servers only allow ACLs on
directories (shared to the files in that directory), newer AFS
servers break that restriction.
To improve memory usage and to make it easier to do mass-key
removal, permit combinations are cached and shared.
(5) Cell database management is overhauled to allow lighter locks to
be used and to make cell records autonomous state machines that
look after getting their own DNS records and cleaning themselves
up, in particular preventing races in acquiring and relinquishing
the fscache token for the cell.
(6) Volume caching is overhauled. The afs_vlocation record is got rid
of to simplify things and the superblock is now keyed on the cell
and the numeric volume ID only. The volume record is tied to a
superblock and normal superblock management is used to mediate
the lifetime of the volume fscache token.
(7) File server record caching is overhauled to make server records
independent of cells and volumes. A server can be in multiple
cells (in such a case, the administrator must make sure that the
VL services for all cells correctly reflect the volumes shared
between those cells).
Server records are now indexed using the UUID of the server
rather than the address since a server can have multiple
addresses.
(8) File server rotation is overhauled to handle VMOVED, VBUSY (and
similar), VOFFLINE and VNOVOL indications and to handle rotation
both of servers and addresses of those servers. The rotation will
also wait and retry if the server says it is busy.
(9) Data writeback is overhauled. Each inode no longer stores a list
of modified sections tagged with the key that authorised it in
favour of noting the modified region of a page in page->private
and storing a list of keys that made modifications in the inode.
This simplifies things and allows other keys to be used to
actually write to the server if a key that made a modification
becomes useless.
(10) Writable mmap() is implemented. This allows a kernel to be build
entirely on AFS.
Note that Pre AFS-3.4 servers are no longer supported, though this can
be added back if necessary (AFS-3.4 was released in 1998)"
* tag 'afs-next-20171113' of git://git.kernel.org/pub/scm/linux/kernel/git/dhowells/linux-fs: (35 commits)
afs: Protect call->state changes against signals
afs: Trace page dirty/clean
afs: Implement shared-writeable mmap
afs: Get rid of the afs_writeback record
afs: Introduce a file-private data record
afs: Use a dynamic port if 7001 is in use
afs: Fix directory read/modify race
afs: Trace the sending of pages
afs: Trace the initiation and completion of client calls
afs: Fix documentation on # vs % prefix in mount source specification
afs: Fix total-length calculation for multiple-page send
afs: Only progress call state at end of Tx phase from rxrpc callback
afs: Make use of the YFS service upgrade to fully support IPv6
afs: Overhaul volume and server record caching and fileserver rotation
afs: Move server rotation code into its own file
afs: Add an address list concept
afs: Overhaul cell database management
afs: Overhaul permit caching
afs: Overhaul the callback handling
afs: Rename struct afs_call server member to cm_server
...
Here is the set of driver core / debugfs patches for 4.15-rc1.
Not many here, mostly all are debugfs fixes to resolve some
long-reported problems with files going away with references to them in
userspace. There's also some SPDX cleanups for the debugfs code, as
well as a few other minor driver core changes for issues reported by
people.
All of these have been in linux-next for a week or more with no reported
issues.
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
-----BEGIN PGP SIGNATURE-----
iG0EABECAC0WIQT0tgzFv3jCIUoxPcsxR9QN2y37KQUCWg2NCA8cZ3JlZ0Brcm9h
aC5jb20ACgkQMUfUDdst+ymUNgCfYq434CFh+YtwITBNYdqkFYFf0ZAAn3qfhh2+
M3rmZzwk2MKBvNQ2npvt
=/8+Y
-----END PGP SIGNATURE-----
Merge tag 'driver-core-4.15-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core
Pull driver core updates from Greg KH:
"Here is the set of driver core / debugfs patches for 4.15-rc1.
Not many here, mostly all are debugfs fixes to resolve some
long-reported problems with files going away with references to them
in userspace. There's also some SPDX cleanups for the debugfs code, as
well as a few other minor driver core changes for issues reported by
people.
All of these have been in linux-next for a week or more with no
reported issues"
* tag 'driver-core-4.15-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core:
driver core: Fix device link deferred probe
debugfs: Remove redundant license text
debugfs: add SPDX identifiers to all debugfs files
debugfs: defer debugfs_fsdata allocation to first usage
debugfs: call debugfs_real_fops() only after debugfs_file_get()
debugfs: purge obsolete SRCU based removal protection
IB/hfi1: convert to debugfs_file_get() and -put()
debugfs: convert to debugfs_file_get() and -put()
debugfs: debugfs_real_fops(): drop __must_hold sparse annotation
debugfs: implement per-file removal protection
debugfs: add support for more elaborate ->d_fsdata
driver core: Move device_links_purge() after bus_remove_device()
arch_topology: Fix section miss match warning due to free_raw_capacity()
driver-core: pr_err() strings should end with newlines
Merge updates from Andrew Morton:
- a few misc bits
- ocfs2 updates
- almost all of MM
* emailed patches from Andrew Morton <akpm@linux-foundation.org>: (131 commits)
memory hotplug: fix comments when adding section
mm: make alloc_node_mem_map a void call if we don't have CONFIG_FLAT_NODE_MEM_MAP
mm: simplify nodemask printing
mm,oom_reaper: remove pointless kthread_run() error check
mm/page_ext.c: check if page_ext is not prepared
writeback: remove unused function parameter
mm: do not rely on preempt_count in print_vma_addr
mm, sparse: do not swamp log with huge vmemmap allocation failures
mm/hmm: remove redundant variable align_end
mm/list_lru.c: mark expected switch fall-through
mm/shmem.c: mark expected switch fall-through
mm/page_alloc.c: broken deferred calculation
mm: don't warn about allocations which stall for too long
fs: fuse: account fuse_inode slab memory as reclaimable
mm, page_alloc: fix potential false positive in __zone_watermark_ok
mm: mlock: remove lru_add_drain_all()
mm, sysctl: make NUMA stats configurable
shmem: convert shmem_init_inodecache() to void
Unify migrate_pages and move_pages access checks
mm, pagevec: rename pagevec drained field
...
Fuse inodes are currently included in the unreclaimable slab counts -
SUnreclaim in /proc/meminfo, slab_unreclaimable in /proc/vmstat and the
per-cgroup memory.stat. But they are reclaimable just like other
filesystems' inodes, and /proc/sys/vm/drop_caches frees them easily.
Mark the slab cache reclaimable.
Link: http://lkml.kernel.org/r/20171102202727.12539-1-hannes@cmpxchg.org
Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Cc: Miklos Szeredi <mszeredi@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
As the page free path makes no distinction between cache hot and cold
pages, there is no real useful ordering of pages in the free list that
allocation requests can take advantage of. Juding from the users of
__GFP_COLD, it is likely that a number of them are the result of copying
other sites instead of actually measuring the impact. Remove the
__GFP_COLD parameter which simplifies a number of paths in the page
allocator.
This is potentially controversial but bear in mind that the size of the
per-cpu pagelists versus modern cache sizes means that the whole per-cpu
list can often fit in the L3 cache. Hence, there is only a potential
benefit for microbenchmarks that alloc/free pages in a tight loop. It's
even worse when THP is taken into account which has little or no chance
of getting a cache-hot page as the per-cpu list is bypassed and the
zeroing of multiple pages will thrash the cache anyway.
The truncate microbenchmarks are not shown as this patch affects the
allocation path and not the free path. A page fault microbenchmark was
tested but it showed no sigificant difference which is not surprising
given that the __GFP_COLD branches are a miniscule percentage of the
fault path.
Link: http://lkml.kernel.org/r/20171018075952.10627-9-mgorman@techsingularity.net
Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Andi Kleen <ak@linux.intel.com>
Cc: Dave Chinner <david@fromorbit.com>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: Jan Kara <jack@suse.cz>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
All callers of release_pages claim the pages being released are cache
hot. As no one cares about the hotness of pages being released to the
allocator, just ditch the parameter.
No performance impact is expected as the overhead is marginal. The
parameter is removed simply because it is a bit stupid to have a useless
parameter copied everywhere.
Link: http://lkml.kernel.org/r/20171018075952.10627-7-mgorman@techsingularity.net
Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Andi Kleen <ak@linux.intel.com>
Cc: Dave Chinner <david@fromorbit.com>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: Jan Kara <jack@suse.cz>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Every pagevec_init user claims the pages being released are hot even in
cases where it is unlikely the pages are hot. As no one cares about the
hotness of pages being released to the allocator, just ditch the
parameter.
No performance impact is expected as the overhead is marginal. The
parameter is removed simply because it is a bit stupid to have a useless
parameter copied everywhere.
Link: http://lkml.kernel.org/r/20171018075952.10627-6-mgorman@techsingularity.net
Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Andi Kleen <ak@linux.intel.com>
Cc: Dave Chinner <david@fromorbit.com>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: Jan Kara <jack@suse.cz>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
During truncation, the mapping has already been checked for shmem and
dax so it's known that workingset_update_node is required.
This patch avoids the checks on mapping for each page being truncated.
In all other cases, a lookup helper is used to determine if
workingset_update_node() needs to be called. The one danger is that the
API is slightly harder to use as calling workingset_update_node directly
without checking for dax or shmem mappings could lead to surprises.
However, the API rarely needs to be used and hopefully the comment is
enough to give people the hint.
sparsetruncate (tiny)
4.14.0-rc4 4.14.0-rc4
oneirq-v1r1 pickhelper-v1r1
Min Time 141.00 ( 0.00%) 140.00 ( 0.71%)
1st-qrtle Time 142.00 ( 0.00%) 141.00 ( 0.70%)
2nd-qrtle Time 142.00 ( 0.00%) 142.00 ( 0.00%)
3rd-qrtle Time 143.00 ( 0.00%) 143.00 ( 0.00%)
Max-90% Time 144.00 ( 0.00%) 144.00 ( 0.00%)
Max-95% Time 147.00 ( 0.00%) 145.00 ( 1.36%)
Max-99% Time 195.00 ( 0.00%) 191.00 ( 2.05%)
Max Time 230.00 ( 0.00%) 205.00 ( 10.87%)
Amean Time 144.37 ( 0.00%) 143.82 ( 0.38%)
Stddev Time 10.44 ( 0.00%) 9.00 ( 13.74%)
Coeff Time 7.23 ( 0.00%) 6.26 ( 13.41%)
Best99%Amean Time 143.72 ( 0.00%) 143.34 ( 0.26%)
Best95%Amean Time 142.37 ( 0.00%) 142.00 ( 0.26%)
Best90%Amean Time 142.19 ( 0.00%) 141.85 ( 0.24%)
Best75%Amean Time 141.92 ( 0.00%) 141.58 ( 0.24%)
Best50%Amean Time 141.69 ( 0.00%) 141.31 ( 0.27%)
Best25%Amean Time 141.38 ( 0.00%) 140.97 ( 0.29%)
As you'd expect, the gain is marginal but it can be detected. The
differences in bonnie are all within the noise which is not surprising
given the impact on the microbenchmark.
radix_tree_update_node_t is a callback for some radix operations that
optionally passes in a private field. The only user of the callback is
workingset_update_node and as it no longer requires a mapping, the
private field is removed.
Link: http://lkml.kernel.org/r/20171018075952.10627-3-mgorman@techsingularity.net
Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Reviewed-by: Jan Kara <jack@suse.cz>
Cc: Andi Kleen <ak@linux.intel.com>
Cc: Dave Chinner <david@fromorbit.com>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Link: http://lkml.kernel.org/r/1508132478-7738-1-git-send-email-rppt@linux.vnet.ibm.com
Signed-off-by: Mike Rapoport <rppt@linux.vnet.ibm.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: "Dr . David Alan Gilbert" <dgilbert@redhat.com>
Cc: Mike Kravetz <mike.kravetz@oracle.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Patch series "kmemcheck: kill kmemcheck", v2.
As discussed at LSF/MM, kill kmemcheck.
KASan is a replacement that is able to work without the limitation of
kmemcheck (single CPU, slow). KASan is already upstream.
We are also not aware of any users of kmemcheck (or users who don't
consider KASan as a suitable replacement).
The only objection was that since KASAN wasn't supported by all GCC
versions provided by distros at that time we should hold off for 2
years, and try again.
Now that 2 years have passed, and all distros provide gcc that supports
KASAN, kill kmemcheck again for the very same reasons.
This patch (of 4):
Remove kmemcheck annotations, and calls to kmemcheck from the kernel.
[alexander.levin@verizon.com: correctly remove kmemcheck call from dma_map_sg_attrs]
Link: http://lkml.kernel.org/r/20171012192151.26531-1-alexander.levin@verizon.com
Link: http://lkml.kernel.org/r/20171007030159.22241-2-alexander.levin@verizon.com
Signed-off-by: Sasha Levin <alexander.levin@verizon.com>
Cc: Alexander Potapenko <glider@google.com>
Cc: Eric W. Biederman <ebiederm@xmission.com>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Pekka Enberg <penberg@kernel.org>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Tim Hansen <devtimhansen@gmail.com>
Cc: Vegard Nossum <vegardno@ifi.uio.no>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The allocations from filp cache can be directly triggered by userspace
applications. A buggy application can consume a significant amount of
unaccounted system memory. Though we have not noticed such buggy
applications in our production but upon close inspection, we found that
a lot of machines spend very significant amount of memory on these
caches.
One way to limit allocations from filp cache is to set system level
limit of maximum number of open files. However this limit is shared
between different users on the system and one user can hog this
resource. To cater that, we can charge filp to kmemcg and set the
maximum limit very high and let the memory limit of each user limit the
number of files they can open and indirectly limiting their allocations
from filp cache.
One side effect of this change is that it will allow _sysctl() to return
ENOMEM and the man page of _sysctl() does not specify that. However the
man page also discourages to use _sysctl() at all.
Link: http://lkml.kernel.org/r/20171011190359.34926-1-shakeelb@google.com
Signed-off-by: Shakeel Butt <shakeelb@google.com>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Greg Thelen <gthelen@google.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Currently, we account page tables separately for each page table level,
but that's redundant -- we only make use of total memory allocated to
page tables for oom_badness calculation. We also provide the
information to userspace, but it has dubious value there too.
This patch switches page table accounting to single counter.
mm->pgtables_bytes is now used to account all page table levels. We use
bytes, because page table size for different levels of page table tree
may be different.
The change has user-visible effect: we don't have VmPMD and VmPUD
reported in /proc/[pid]/status. Not sure if anybody uses them. (As
alternative, we can always report 0 kB for them.)
OOM-killer report is also slightly changed: we now report pgtables_bytes
instead of nr_ptes, nr_pmd, nr_puds.
Apart from reducing number of counters per-mm, the benefit is that we
now calculate oom_badness() more correctly for machines which have
different size of page tables depending on level or where page tables
are less than a page in size.
The only downside can be debuggability because we do not know which page
table level could leak. But I do not remember many bugs that would be
caught by separate counters so I wouldn't lose sleep over this.
[akpm@linux-foundation.org: fix mm/huge_memory.c]
Link: http://lkml.kernel.org/r/20171006100651.44742-2-kirill.shutemov@linux.intel.com
Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Acked-by: Michal Hocko <mhocko@suse.com>
[kirill.shutemov@linux.intel.com: fix build]
Link: http://lkml.kernel.org/r/20171016150113.ikfxy3e7zzfvsr4w@black.fi.intel.com
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Let's add wrappers for ->nr_ptes with the same interface as for nr_pmd
and nr_pud.
The patch also makes nr_ptes accounting dependent onto CONFIG_MMU. Page
table accounting doesn't make sense if you don't have page tables.
It's preparation for consolidation of page-table counters in mm_struct.
Link: http://lkml.kernel.org/r/20171006100651.44742-1-kirill.shutemov@linux.intel.com
Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
On a machine with 5-level paging support a process can allocate
significant amount of memory and stay unnoticed by oom-killer and memory
cgroup. The trick is to allocate a lot of PUD page tables. We don't
account PUD page tables, only PMD and PTE.
We already addressed the same issue for PMD page tables, see commit
dc6c9a35b6 ("mm: account pmd page tables to the process").
Introduction of 5-level paging brings the same issue for PUD page
tables.
The patch expands accounting to PUD level.
[kirill.shutemov@linux.intel.com: s/pmd_t/pud_t/]
Link: http://lkml.kernel.org/r/20171004074305.x35eh5u7ybbt5kar@black.fi.intel.com
[heiko.carstens@de.ibm.com: s390/mm: fix pud table accounting]
Link: http://lkml.kernel.org/r/20171103090551.18231-1-heiko.carstens@de.ibm.com
Link: http://lkml.kernel.org/r/20171002080427.3320-1-kirill.shutemov@linux.intel.com
Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Signed-off-by: Heiko Carstens <heiko.carstens@de.ibm.com>
Acked-by: Rik van Riel <riel@redhat.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
wdata_alloc_and_fillpages() needlessly iterates calls to
find_get_pages_tag(). Also it wants only pages from given range. Make
it use find_get_pages_range_tag().
Link: http://lkml.kernel.org/r/20171009151359.31984-17-jack@suse.cz
Signed-off-by: Jan Kara <jack@suse.cz>
Suggested-by: Daniel Jordan <daniel.m.jordan@oracle.com>
Reviewed-by: Daniel Jordan <daniel.m.jordan@oracle.com>
Cc: Steve French <sfrench@samba.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Use find_get_pages_range_tag() in afs_writepages_region() as we are
interested only in pages from given range. Remove unnecessary code
after this conversion.
Link: http://lkml.kernel.org/r/20171009151359.31984-16-jack@suse.cz
Signed-off-by: Jan Kara <jack@suse.cz>
Reviewed-by: Daniel Jordan <daniel.m.jordan@oracle.com>
Cc: David Howells <dhowells@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
All users of pagevec_lookup() and pagevec_lookup_range() now pass
PAGEVEC_SIZE as a desired number of pages. Just drop the argument.
Link: http://lkml.kernel.org/r/20171009151359.31984-15-jack@suse.cz
Signed-off-by: Jan Kara <jack@suse.cz>
Reviewed-by: Daniel Jordan <daniel.m.jordan@oracle.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Use new function for looking up pages since nr_pages argument from
pagevec_lookup_range_tag() is going away.
Link: http://lkml.kernel.org/r/20171009151359.31984-14-jack@suse.cz
Signed-off-by: Jan Kara <jack@suse.cz>
Reviewed-by: "Yan, Zheng" <zyan@redhat.com>
Reviewed-by: Daniel Jordan <daniel.m.jordan@oracle.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
We want only pages from given range in nilfs_lookup_dirty_data_buffers().
Use pagevec_lookup_range_tag() instead of pagevec_lookup_tag() and
remove unnecessary code.
Link: http://lkml.kernel.org/r/20171009151359.31984-10-jack@suse.cz
Signed-off-by: Jan Kara <jack@suse.cz>
Reviewed-by: Daniel Jordan <daniel.m.jordan@oracle.com>
Acked-by: Ryusuke Konishi <konishi.ryusuke@lab.ntt.co.jp>
Cc: Ryusuke Konishi <konishi.ryusuke@lab.ntt.co.jp>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
We want only pages from given range in gfs2_write_cache_jdata(). Use
pagevec_lookup_range_tag() instead of pagevec_lookup_tag() and remove
unnecessary code.
Link: http://lkml.kernel.org/r/20171009151359.31984-9-jack@suse.cz
Signed-off-by: Jan Kara <jack@suse.cz>
Reviewed-by: Daniel Jordan <daniel.m.jordan@oracle.com>
Cc: Bob Peterson <rpeterso@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
__get_first_dirty_index() wants to lookup only the first dirty page
after given index. There's no point in using pagevec_lookup_tag() for
that. Just use find_get_pages_tag() directly.
Link: http://lkml.kernel.org/r/20171009151359.31984-8-jack@suse.cz
Signed-off-by: Jan Kara <jack@suse.cz>
Reviewed-by: Chao Yu <yuchao0@huawei.com>
Reviewed-by: Daniel Jordan <daniel.m.jordan@oracle.com>
Cc: Jaegeuk Kim <jaegeuk@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
In several places we want to iterate over all tagged pages in a mapping.
However the code was apparently copied from places that iterate only
over a limited range and thus it checks for index <= end, optimizes the
case where we are coming close to range end which is all pointless when
end == ULONG_MAX. So just remove this dead code.
[akpm@linux-foundation.org: fix warnings]
Link: http://lkml.kernel.org/r/20171009151359.31984-7-jack@suse.cz
Signed-off-by: Jan Kara <jack@suse.cz>
Reviewed-by: Daniel Jordan <daniel.m.jordan@oracle.com>
Reviewed-by: Chao Yu <yuchao0@huawei.com>
Cc: Jaegeuk Kim <jaegeuk@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
We want only pages from given range in f2fs_write_cache_pages(). Use
pagevec_lookup_range_tag() instead of pagevec_lookup_tag() and remove
unnecessary code.
Link: http://lkml.kernel.org/r/20171009151359.31984-6-jack@suse.cz
Signed-off-by: Jan Kara <jack@suse.cz>
Reviewed-by: Chao Yu <yuchao0@huawei.com>
Reviewed-by: Daniel Jordan <daniel.m.jordan@oracle.com>
Cc: Jaegeuk Kim <jaegeuk@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
We want only pages from given range in ext4_writepages(). Use
pagevec_lookup_range_tag() instead of pagevec_lookup_tag() and remove
unnecessary code.
Link: http://lkml.kernel.org/r/20171009151359.31984-5-jack@suse.cz
Signed-off-by: Jan Kara <jack@suse.cz>
Reviewed-by: Daniel Jordan <daniel.m.jordan@oracle.com>
Cc: "Theodore Ts'o" <tytso@mit.edu>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
We want only pages from given range in ceph_writepages_start(). Use
pagevec_lookup_range_tag() instead of pagevec_lookup_tag() and remove
unnecessary code.
Link: http://lkml.kernel.org/r/20171009151359.31984-4-jack@suse.cz
Signed-off-by: Jan Kara <jack@suse.cz>
Reviewed-by: Daniel Jordan <daniel.m.jordan@oracle.com>
Reviewed-by: "Yan, Zheng" <zyan@redhat.com>
Cc: Ilya Dryomov <idryomov@gmail.com>
Cc: "Yan, Zheng" <zyan@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
We want only pages from given range in btree_write_cache_pages() and
extent_write_cache_pages(). Use pagevec_lookup_range_tag() instead of
pagevec_lookup_tag() and remove unnecessary code.
Link: http://lkml.kernel.org/r/20171009151359.31984-3-jack@suse.cz
Signed-off-by: Jan Kara <jack@suse.cz>
Reviewed-by: David Sterba <dsterba@suse.com>
Reviewed-by: Daniel Jordan <daniel.m.jordan@oracle.com>
Cc: David Sterba <dsterba@suse.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
This patch only affects users of mmu_notifier->invalidate_range callback
which are device drivers related to ATS/PASID, CAPI, IOMMUv2, SVM ...
and it is an optimization for those users. Everyone else is unaffected
by it.
When clearing a pte/pmd we are given a choice to notify the event under
the page table lock (notify version of *_clear_flush helpers do call the
mmu_notifier_invalidate_range). But that notification is not necessary
in all cases.
This patch removes almost all cases where it is useless to have a call
to mmu_notifier_invalidate_range before
mmu_notifier_invalidate_range_end. It also adds documentation in all
those cases explaining why.
Below is a more in depth analysis of why this is fine to do this:
For secondary TLB (non CPU TLB) like IOMMU TLB or device TLB (when
device use thing like ATS/PASID to get the IOMMU to walk the CPU page
table to access a process virtual address space). There is only 2 cases
when you need to notify those secondary TLB while holding page table
lock when clearing a pte/pmd:
A) page backing address is free before mmu_notifier_invalidate_range_end
B) a page table entry is updated to point to a new page (COW, write fault
on zero page, __replace_page(), ...)
Case A is obvious you do not want to take the risk for the device to write
to a page that might now be used by something completely different.
Case B is more subtle. For correctness it requires the following sequence
to happen:
- take page table lock
- clear page table entry and notify (pmd/pte_huge_clear_flush_notify())
- set page table entry to point to new page
If clearing the page table entry is not followed by a notify before setting
the new pte/pmd value then you can break memory model like C11 or C++11 for
the device.
Consider the following scenario (device use a feature similar to ATS/
PASID):
Two address addrA and addrB such that |addrA - addrB| >= PAGE_SIZE we
assume they are write protected for COW (other case of B apply too).
[Time N] -----------------------------------------------------------------
CPU-thread-0 {try to write to addrA}
CPU-thread-1 {try to write to addrB}
CPU-thread-2 {}
CPU-thread-3 {}
DEV-thread-0 {read addrA and populate device TLB}
DEV-thread-2 {read addrB and populate device TLB}
[Time N+1] ---------------------------------------------------------------
CPU-thread-0 {COW_step0: {mmu_notifier_invalidate_range_start(addrA)}}
CPU-thread-1 {COW_step0: {mmu_notifier_invalidate_range_start(addrB)}}
CPU-thread-2 {}
CPU-thread-3 {}
DEV-thread-0 {}
DEV-thread-2 {}
[Time N+2] ---------------------------------------------------------------
CPU-thread-0 {COW_step1: {update page table point to new page for addrA}}
CPU-thread-1 {COW_step1: {update page table point to new page for addrB}}
CPU-thread-2 {}
CPU-thread-3 {}
DEV-thread-0 {}
DEV-thread-2 {}
[Time N+3] ---------------------------------------------------------------
CPU-thread-0 {preempted}
CPU-thread-1 {preempted}
CPU-thread-2 {write to addrA which is a write to new page}
CPU-thread-3 {}
DEV-thread-0 {}
DEV-thread-2 {}
[Time N+3] ---------------------------------------------------------------
CPU-thread-0 {preempted}
CPU-thread-1 {preempted}
CPU-thread-2 {}
CPU-thread-3 {write to addrB which is a write to new page}
DEV-thread-0 {}
DEV-thread-2 {}
[Time N+4] ---------------------------------------------------------------
CPU-thread-0 {preempted}
CPU-thread-1 {COW_step3: {mmu_notifier_invalidate_range_end(addrB)}}
CPU-thread-2 {}
CPU-thread-3 {}
DEV-thread-0 {}
DEV-thread-2 {}
[Time N+5] ---------------------------------------------------------------
CPU-thread-0 {preempted}
CPU-thread-1 {}
CPU-thread-2 {}
CPU-thread-3 {}
DEV-thread-0 {read addrA from old page}
DEV-thread-2 {read addrB from new page}
So here because at time N+2 the clear page table entry was not pair with a
notification to invalidate the secondary TLB, the device see the new value
for addrB before seing the new value for addrA. This break total memory
ordering for the device.
When changing a pte to write protect or to point to a new write protected
page with same content (KSM) it is ok to delay invalidate_range callback
to mmu_notifier_invalidate_range_end() outside the page table lock. This
is true even if the thread doing page table update is preempted right
after releasing page table lock before calling
mmu_notifier_invalidate_range_end
Thanks to Andrea for thinking of a problematic scenario for COW.
[jglisse@redhat.com: v2]
Link: http://lkml.kernel.org/r/20171017031003.7481-2-jglisse@redhat.com
Link: http://lkml.kernel.org/r/20170901173011.10745-1-jglisse@redhat.com
Signed-off-by: Jérôme Glisse <jglisse@redhat.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Nadav Amit <nadav.amit@gmail.com>
Cc: Joerg Roedel <jroedel@suse.de>
Cc: Suravee Suthikulpanit <suravee.suthikulpanit@amd.com>
Cc: David Woodhouse <dwmw2@infradead.org>
Cc: Alistair Popple <alistair@popple.id.au>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Stephen Rothwell <sfr@canb.auug.org.au>
Cc: Andrew Donnellan <andrew.donnellan@au1.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
There is no need to have a local return code set with -EINVAL when both
the conditions following it return error codes appropriately. Just
remove the redundant one.
Link: http://lkml.kernel.org/r/20170929145444.17611-1-khandual@linux.vnet.ibm.com
Signed-off-by: Anshuman Khandual <khandual@linux.vnet.ibm.com>
Reviewed-by: Mike Kravetz <mike.kravetz@oracle.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Add sparse-checked slab_flags_t for struct kmem_cache::flags (SLAB_POISON,
etc).
SLAB is bloated temporarily by switching to "unsigned long", but only
temporarily.
Link: http://lkml.kernel.org/r/20171021100225.GA22428@avx2
Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
Acked-by: Pekka Enberg <penberg@kernel.org>
Cc: Christoph Lameter <cl@linux.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
When dlm_add_migration_mle returns -EEXIST, previously input mle will
not be initialized. So we can't use its associated dlm object. And we
truly don't need this mle for already launched migration progress, since
oldmle has taken this role.
Link: http://lkml.kernel.org/r/63ADC13FD55D6546B7DECE290D39E373CED7AA61@H3CMLB14-EX.srv.huawei-3com.com
Signed-off-by: Changwei Ge <ge.changwei@h3c.com>
Reviewed-by: Joseph Qi <jiangqi903@gmail.com>
Cc: Mark Fasheh <mfasheh@versity.com>
Cc: Joel Becker <jlbec@evilplan.org>
Cc: Junxiao Bi <junxiao.bi@oracle.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The subsystem.su_mutex is required while accessing the item->ci_parent,
otherwise, NULL pointer dereference to the item->ci_parent will be
triggered in the following situation:
add node delete node
sys_write
vfs_write
configfs_write_file
o2nm_node_store
o2nm_node_local_write
do_rmdir
vfs_rmdir
configfs_rmdir
mutex_lock(&subsys->su_mutex);
unlink_obj
item->ci_group = NULL;
item->ci_parent = NULL;
to_o2nm_cluster_from_node
node->nd_item.ci_parent->ci_parent
BUG since of NULL pointer dereference to nd_item.ci_parent
Moreover, the o2nm_cluster also should be protected by the
subsystem.su_mutex.
[alex.chen@huawei.com: v2]
Link: http://lkml.kernel.org/r/59EEAA69.9080703@huawei.com
Link: http://lkml.kernel.org/r/59E9B36A.10700@huawei.com
Signed-off-by: Alex Chen <alex.chen@huawei.com>
Reviewed-by: Jun Piao <piaojun@huawei.com>
Reviewed-by: Joseph Qi <jiangqi903@gmail.com>
Cc: Mark Fasheh <mfasheh@versity.com>
Cc: Joel Becker <jlbec@evilplan.org>
Cc: Junxiao Bi <junxiao.bi@oracle.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
ip_alloc_sem should be taken in ocfs2_get_block() when reading file in
DIRECT mode to prevent concurrent access to extent tree with
ocfs2_dio_end_io_write(), which may cause BUGON in the following
situation:
read file 'A' end_io of writing file 'A'
vfs_read
__vfs_read
ocfs2_file_read_iter
generic_file_read_iter
ocfs2_direct_IO
__blockdev_direct_IO
do_blockdev_direct_IO
do_direct_IO
get_more_blocks
ocfs2_get_block
ocfs2_extent_map_get_blocks
ocfs2_get_clusters
ocfs2_get_clusters_nocache()
ocfs2_search_extent_list
return the index of record which
contains the v_cluster, that is
v_cluster > rec[i]->e_cpos.
ocfs2_dio_end_io
ocfs2_dio_end_io_write
down_write(&oi->ip_alloc_sem);
ocfs2_mark_extent_written
ocfs2_change_extent_flag
ocfs2_split_extent
...
--> modify the rec[i]->e_cpos, resulting
in v_cluster < rec[i]->e_cpos.
BUG_ON(v_cluster < le32_to_cpu(rec->e_cpos))
[alex.chen@huawei.com: v3]
Link: http://lkml.kernel.org/r/59EF3614.6050008@huawei.com
Link: http://lkml.kernel.org/r/59EF3614.6050008@huawei.com
Fixes: c15471f795 ("ocfs2: fix sparse file & data ordering issue in direct io")
Signed-off-by: Alex Chen <alex.chen@huawei.com>
Reviewed-by: Jun Piao <piaojun@huawei.com>
Reviewed-by: Joseph Qi <jiangqi903@gmail.com>
Reviewed-by: Gang He <ghe@suse.com>
Acked-by: Changwei Ge <ge.changwei@h3c.com>
Cc: Mark Fasheh <mfasheh@versity.com>
Cc: Joel Becker <jlbec@evilplan.org>
Cc: Junxiao Bi <junxiao.bi@oracle.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
we should wait dio requests to finish before inode lock in
ocfs2_setattr(), otherwise the following deadlock will happen:
process 1 process 2 process 3
truncate file 'A' end_io of writing file 'A' receiving the bast messages
ocfs2_setattr
ocfs2_inode_lock_tracker
ocfs2_inode_lock_full
inode_dio_wait
__inode_dio_wait
-->waiting for all dio
requests finish
dlm_proxy_ast_handler
dlm_do_local_bast
ocfs2_blocking_ast
ocfs2_generic_handle_bast
set OCFS2_LOCK_BLOCKED flag
dio_end_io
dio_bio_end_aio
dio_complete
ocfs2_dio_end_io
ocfs2_dio_end_io_write
ocfs2_inode_lock
__ocfs2_cluster_lock
ocfs2_wait_for_mask
-->waiting for OCFS2_LOCK_BLOCKED
flag to be cleared, that is waiting
for 'process 1' unlocking the inode lock
inode_dio_end
-->here dec the i_dio_count, but will never
be called, so a deadlock happened.
Link: http://lkml.kernel.org/r/59F81636.70508@huawei.com
Signed-off-by: Alex Chen <alex.chen@huawei.com>
Reviewed-by: Jun Piao <piaojun@huawei.com>
Reviewed-by: Joseph Qi <jiangqi903@gmail.com>
Acked-by: Changwei Ge <ge.changwei@h3c.com>
Cc: Mark Fasheh <mfasheh@versity.com>
Cc: Joel Becker <jlbec@evilplan.org>
Cc: Junxiao Bi <junxiao.bi@oracle.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Link: http://lkml.kernel.org/r/59C5D7D6.9050106@huawei.com
Signed-off-by: Jun Piao <piaojun@huawei.com>
Reviewed-by: Alex Chen <alex.chen@huawei.com>
Cc: Mark Fasheh <mfasheh@versity.com>
Cc: Joel Becker <jlbec@evilplan.org>
Cc: Junxiao Bi <junxiao.bi@oracle.com>
Cc: Joseph Qi <jiangqi903@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
When a node dies, other live nodes have to choose a new master for an
existed lock resource mastered by the dead node.
As for ocfs2/dlm implementation, this is done by function -
dlm_move_lockres_to_recovery_list which marks those lock rsources as
DLM_LOCK_RES_RECOVERING and manages them via a list from which DLM
changes lock resource's master later.
So without invoking dlm_move_lockres_to_recovery_list, no master will be
choosed after dlm recovery accomplishment since no lock resource can be
found through ::resource list.
What's worse is that if DLM_LOCK_RES_RECOVERING is not marked for lock
resources mastered a dead node, it will break up synchronization among
nodes.
So invoke dlm_move_lockres_to_recovery_list again.
Fixs: 'commit ee8f7fcbe6 ("ocfs2/dlm: continue to purge recovery lockres when recovery master goes down")'
Link: http://lkml.kernel.org/r/63ADC13FD55D6546B7DECE290D39E373CED6E0F9@H3CMLB14-EX.srv.huawei-3com.com
Signed-off-by: Changwei Ge <ge.changwei@h3c.com>
Reported-by: Vitaly Mayatskih <v.mayatskih@gmail.com>
Tested-by: Vitaly Mayatskikh <v.mayatskih@gmail.com>
Cc: Mark Fasheh <mfasheh@versity.com>
Cc: Joel Becker <jlbec@evilplan.org>
Cc: Junxiao Bi <junxiao.bi@oracle.com>
Cc: Joseph Qi <jiangqi903@gmail.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Link: http://lkml.kernel.org/r/59E064BB.8000005@huawei.com
Signed-off-by: Jun Piao <piaojun@huawei.com>
Reviewed-by: Joseph Qi <jiangqi903@gmail.com>
Cc: Mark Fasheh <mfasheh@versity.com>
Cc: Joel Becker <jlbec@evilplan.org>
Cc: Junxiao Bi <junxiao.bi@oracle.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
destroy_workqueue() will do flushing work for us.
Link: http://lkml.kernel.org/r/59E06476.3090502@huawei.com
Signed-off-by: Jun Piao <piaojun@huawei.com>
Reviewed-by: Joseph Qi <jiangqi903@gmail.com>
Cc: Mark Fasheh <mfasheh@versity.com>
Cc: Joel Becker <jlbec@evilplan.org>
Cc: Junxiao Bi <junxiao.bi@oracle.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Summary of modules changes for the 4.15 merge window:
- Treewide module_param_call() cleanup, fix up set/get function
prototype mismatches, from Kees Cook
- Minor code cleanups
Signed-off-by: Jessica Yu <jeyu@kernel.org>
-----BEGIN PGP SIGNATURE-----
iQIcBAABCgAGBQJaDCyzAAoJEMBFfjjOO8FyaYQP/AwHBy6XmwwVlWDP4BqIF6hL
Vhy3ccVLYEORvePv68tWSRPUz5n6+1Ebqanmwtkw6i8l+KwxY2SfkZql09cARc33
2iBE4bHF98iWQmnJbF6me80fedY9n5bZJNMQKEF9VozJWwTMOTQFTCfmyJRDBmk9
iidQj6M3idbSUOYIJjvc40VGx5NyQWSr+FFfqsz1rU5iLGRGEvA3I2/CDT0oTuV6
D4MmFxzE2Tv/vIMa2GzKJ1LGScuUfSjf93Lq9Kk0cG36qWao8l930CaXyVdE9WJv
bkUzpf3QYv/rDX6QbAGA0cada13zd+dfBr8YhchclEAfJ+GDLjMEDu04NEmI6KUT
5lP0Xw0xYNZQI7bkdxDMhsj5jaz/HJpXCjPCtZBnSEKiL4OPXVMe+pBHoCJ2/yFN
6M716XpWYgUviUOdiE+chczB5p3z4FA6u2ykaM4Tlk0btZuHGxjcSWwvcIdlPmjm
kY4AfDV6K0bfEBVguWPJicvrkx44atqT5nWbbPhDwTSavtsuRJLb3GCsHedx7K8h
ZO47lCQFAWCtrycK1HYw+oupNC3hYWQ0SR42XRdGhL1bq26C+1sei1QhfqSgA9PQ
7CwWH4UTOL9fhtrzSqZngYOh9sjQNFNefqQHcecNzcEjK2vjrgQZvRNWZKHSwaFs
fbGX8juZWP4ypbK+irTB
=c8vb
-----END PGP SIGNATURE-----
Merge tag 'modules-for-v4.15' of git://git.kernel.org/pub/scm/linux/kernel/git/jeyu/linux
Pull module updates from Jessica Yu:
"Summary of modules changes for the 4.15 merge window:
- treewide module_param_call() cleanup, fix up set/get function
prototype mismatches, from Kees Cook
- minor code cleanups"
* tag 'modules-for-v4.15' of git://git.kernel.org/pub/scm/linux/kernel/git/jeyu/linux:
module: Do not paper over type mismatches in module_param_call()
treewide: Fix function prototypes for module_param_call()
module: Prepare to convert all module_param_call() prototypes
kernel/module: Delete an error message for a failed memory allocation in add_module_usage()
Pull networking updates from David Miller:
"Highlights:
1) Maintain the TCP retransmit queue using an rbtree, with 1GB
windows at 100Gb this really has become necessary. From Eric
Dumazet.
2) Multi-program support for cgroup+bpf, from Alexei Starovoitov.
3) Perform broadcast flooding in hardware in mv88e6xxx, from Andrew
Lunn.
4) Add meter action support to openvswitch, from Andy Zhou.
5) Add a data meta pointer for BPF accessible packets, from Daniel
Borkmann.
6) Namespace-ify almost all TCP sysctl knobs, from Eric Dumazet.
7) Turn on Broadcom Tags in b53 driver, from Florian Fainelli.
8) More work to move the RTNL mutex down, from Florian Westphal.
9) Add 'bpftool' utility, to help with bpf program introspection.
From Jakub Kicinski.
10) Add new 'cpumap' type for XDP_REDIRECT action, from Jesper
Dangaard Brouer.
11) Support 'blocks' of transformations in the packet scheduler which
can span multiple network devices, from Jiri Pirko.
12) TC flower offload support in cxgb4, from Kumar Sanghvi.
13) Priority based stream scheduler for SCTP, from Marcelo Ricardo
Leitner.
14) Thunderbolt networking driver, from Amir Levy and Mika Westerberg.
15) Add RED qdisc offloadability, and use it in mlxsw driver. From
Nogah Frankel.
16) eBPF based device controller for cgroup v2, from Roman Gushchin.
17) Add some fundamental tracepoints for TCP, from Song Liu.
18) Remove garbage collection from ipv6 route layer, this is a
significant accomplishment. From Wei Wang.
19) Add multicast route offload support to mlxsw, from Yotam Gigi"
* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next: (2177 commits)
tcp: highest_sack fix
geneve: fix fill_info when link down
bpf: fix lockdep splat
net: cdc_ncm: GetNtbFormat endian fix
openvswitch: meter: fix NULL pointer dereference in ovs_meter_cmd_reply_start
netem: remove unnecessary 64 bit modulus
netem: use 64 bit divide by rate
tcp: Namespace-ify sysctl_tcp_default_congestion_control
net: Protect iterations over net::fib_notifier_ops in fib_seq_sum()
ipv6: set all.accept_dad to 0 by default
uapi: fix linux/tls.h userspace compilation error
usbnet: ipheth: prevent TX queue timeouts when device not ready
vhost_net: conditionally enable tx polling
uapi: fix linux/rxrpc.h userspace compilation errors
net: stmmac: fix LPI transitioning for dwmac4
atm: horizon: Fix irq release error
net-sysfs: trigger netlink notification on ifalias change via sysfs
openvswitch: Using kfree_rcu() to simplify the code
openvswitch: Make local function ovs_nsh_key_attr_size() static
openvswitch: Fix return value check in ovs_meter_cmd_features()
...
Plenty of acronym soup here:
- Initial support for the Scalable Vector Extension (SVE)
- Improved handling for SError interrupts (required to handle RAS events)
- Enable GCC support for 128-bit integer types
- Remove kernel text addresses from backtraces and register dumps
- Use of WFE to implement long delay()s
- ACPI IORT updates from Lorenzo Pieralisi
- Perf PMU driver for the Statistical Profiling Extension (SPE)
- Perf PMU driver for Hisilicon's system PMUs
- Misc cleanups and non-critical fixes
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1
iQEcBAABCgAGBQJaCcLqAAoJELescNyEwWM0JREH/2FbmD/khGzEtP8LW+o9D8iV
TBM02uWQxS1bbO1pV2vb+512YQO+iWfeQwJH9Jv2FZcrMvFv7uGRnYgAnJuXNGrl
W+LL6OhN22A24LSawC437RU3Xe7GqrtONIY/yLeJBPablfcDGzPK1eHRA0pUzcyX
VlyDruSHWX44VGBPV6JRd3x0vxpV8syeKOjbRvopRfn3Nwkbd76V3YSfEgwoTG5W
ET1sOnXLmHHdeifn/l1Am5FX1FYstpcd7usUTJ4Oto8y7e09tw3bGJCD0aMJ3vow
v1pCUWohEw7fHqoPc9rTrc1QEnkdML4vjJvMPUzwyTfPrN+7uEuMIEeJierW+qE=
=0qrg
-----END PGP SIGNATURE-----
Merge tag 'arm64-upstream' of git://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux
Pull arm64 updates from Will Deacon:
"The big highlight is support for the Scalable Vector Extension (SVE)
which required extensive ABI work to ensure we don't break existing
applications by blowing away their signal stack with the rather large
new vector context (<= 2 kbit per vector register). There's further
work to be done optimising things like exception return, but the ABI
is solid now.
Much of the line count comes from some new PMU drivers we have, but
they're pretty self-contained and I suspect we'll have more of them in
future.
Plenty of acronym soup here:
- initial support for the Scalable Vector Extension (SVE)
- improved handling for SError interrupts (required to handle RAS
events)
- enable GCC support for 128-bit integer types
- remove kernel text addresses from backtraces and register dumps
- use of WFE to implement long delay()s
- ACPI IORT updates from Lorenzo Pieralisi
- perf PMU driver for the Statistical Profiling Extension (SPE)
- perf PMU driver for Hisilicon's system PMUs
- misc cleanups and non-critical fixes"
* tag 'arm64-upstream' of git://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux: (97 commits)
arm64: Make ARMV8_DEPRECATED depend on SYSCTL
arm64: Implement __lshrti3 library function
arm64: support __int128 on gcc 5+
arm64/sve: Add documentation
arm64/sve: Detect SVE and activate runtime support
arm64/sve: KVM: Hide SVE from CPU features exposed to guests
arm64/sve: KVM: Treat guest SVE use as undefined instruction execution
arm64/sve: KVM: Prevent guests from using SVE
arm64/sve: Add sysctl to set the default vector length for new processes
arm64/sve: Add prctl controls for userspace vector length management
arm64/sve: ptrace and ELF coredump support
arm64/sve: Preserve SVE registers around EFI runtime service calls
arm64/sve: Preserve SVE registers around kernel-mode NEON use
arm64/sve: Probe SVE capabilities and usable vector lengths
arm64: cpufeature: Move sys_caps_initialised declarations
arm64/sve: Backend logic for setting the vector length
arm64/sve: Signal handling support
arm64/sve: Support vector length resetting for new processes
arm64/sve: Core task context handling
arm64/sve: Low-level CPU setup
...
After commit 890da9cf09 (Revert "x86: do not use cpufreq_quick_get()
for /proc/cpuinfo "cpu MHz"") the "cpu MHz" number in /proc/cpuinfo
on x86 can be either the nominal CPU frequency (which is constant)
or the frequency most recently requested by a scaling governor in
cpufreq, depending on the cpufreq configuration. That is somewhat
inconsistent and is different from what it was before 4.13, so in
order to restore the previous behavior, make it report the current
CPU frequency like the scaling_cur_freq sysfs file in cpufreq.
To that end, modify the /proc/cpuinfo implementation on x86 to use
aperfmperf_snapshot_khz() to snapshot the APERF and MPERF feedback
registers, if available, and use their values to compute the CPU
frequency to be reported as "cpu MHz".
However, do that carefully enough to avoid accumulating delays that
lead to unacceptable access times for /proc/cpuinfo on systems with
many CPUs. Run aperfmperf_snapshot_khz() once on all CPUs
asynchronously at the /proc/cpuinfo open time, add a single delay
upfront (if necessary) at that point and simply compute the current
frequency while running show_cpuinfo() for each individual CPU.
Also, to avoid slowing down /proc/cpuinfo accesses too much, reduce
the default delay between consecutive APERF and MPERF reads to 10 ms,
which should be sufficient to get large enough numbers for the
frequency computation in all cases.
Fixes: 890da9cf09 (Revert "x86: do not use cpufreq_quick_get() for /proc/cpuinfo "cpu MHz"")
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Acked-by: Thomas Gleixner <tglx@linutronix.de>
Tested-by: Thomas Gleixner <tglx@linutronix.de>
Acked-by: Ingo Molnar <mingo@kernel.org>
Currently, we're capping the values too low in the F_GETLK64 case. The
fields in that structure are 64-bit values, so we shouldn't need to do
any sort of fixup there.
Make sure we check that assumption at build time in the future however
by ensuring that the sizes we're copying will fit.
With this, we no longer need COMPAT_LOFF_T_MAX either, so remove it.
Fixes: 94073ad77f (fs/locks: don't mess with the address limit in compat_fcntl64)
Reported-by: Vitaly Lipatov <lav@etersoft.ru>
Signed-off-by: Jeff Layton <jlayton@redhat.com>
Reviewed-by: David Howells <dhowells@redhat.com>
Currently we just return err here, but we need to put the fd reference
first.
Fixes: 94073ad77f (fs/locks: don't mess with the address limit in compat_fcntl64)
Signed-off-by: Jeff Layton <jlayton@redhat.com>
PMD faults on a zero length file on a file system mounted with -o dax
will not generate SIGBUS as expected.
fd = open(...O_TRUNC);
addr = mmap(NULL, 2*1024*1024, PROT_READ|PROT_WRITE, MAP_SHARED, fd, 0);
*addr = 'a';
<expect SIGBUS>
The problem is this code in dax_iomap_pmd_fault:
max_pgoff = (i_size_read(inode) - 1) >> PAGE_SHIFT;
If the inode size is zero, we end up with a max_pgoff that is way larger
than 0. :) Fix it by using DIV_ROUND_UP, as is done elsewhere in the
kernel.
I tested this with some simple test code that ensured that SIGBUS was
received where expected.
Cc: <stable@vger.kernel.org>
Fixes: 642261ac99 ("dax: add struct iomap based DAX PMD support")
Signed-off-by: Jeff Moyer <jmoyer@redhat.com>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
Pull core block layer updates from Jens Axboe:
"This is the main pull request for block storage for 4.15-rc1.
Nothing out of the ordinary in here, and no API changes or anything
like that. Just various new features for drivers, core changes, etc.
In particular, this pull request contains:
- A patch series from Bart, closing the whole on blk/scsi-mq queue
quescing.
- A series from Christoph, building towards hidden gendisks (for
multipath) and ability to move bio chains around.
- NVMe
- Support for native multipath for NVMe (Christoph).
- Userspace notifications for AENs (Keith).
- Command side-effects support (Keith).
- SGL support (Chaitanya Kulkarni)
- FC fixes and improvements (James Smart)
- Lots of fixes and tweaks (Various)
- bcache
- New maintainer (Michael Lyle)
- Writeback control improvements (Michael)
- Various fixes (Coly, Elena, Eric, Liang, et al)
- lightnvm updates, mostly centered around the pblk interface
(Javier, Hans, and Rakesh).
- Removal of unused bio/bvec kmap atomic interfaces (me, Christoph)
- Writeback series that fix the much discussed hundreds of millions
of sync-all units. This goes all the way, as discussed previously
(me).
- Fix for missing wakeup on writeback timer adjustments (Yafang
Shao).
- Fix laptop mode on blk-mq (me).
- {mq,name} tupple lookup for IO schedulers, allowing us to have
alias names. This means you can use 'deadline' on both !mq and on
mq (where it's called mq-deadline). (me).
- blktrace race fix, oopsing on sg load (me).
- blk-mq optimizations (me).
- Obscure waitqueue race fix for kyber (Omar).
- NBD fixes (Josef).
- Disable writeback throttling by default on bfq, like we do on cfq
(Luca Miccio).
- Series from Ming that enable us to treat flush requests on blk-mq
like any other request. This is a really nice cleanup.
- Series from Ming that improves merging on blk-mq with schedulers,
getting us closer to flipping the switch on scsi-mq again.
- BFQ updates (Paolo).
- blk-mq atomic flags memory ordering fixes (Peter Z).
- Loop cgroup support (Shaohua).
- Lots of minor fixes from lots of different folks, both for core and
driver code"
* 'for-4.15/block' of git://git.kernel.dk/linux-block: (294 commits)
nvme: fix visibility of "uuid" ns attribute
blk-mq: fixup some comment typos and lengths
ide: ide-atapi: fix compile error with defining macro DEBUG
blk-mq: improve tag waiting setup for non-shared tags
brd: remove unused brd_mutex
blk-mq: only run the hardware queue if IO is pending
block: avoid null pointer dereference on null disk
fs: guard_bio_eod() needs to consider partitions
xtensa/simdisk: fix compile error
nvme: expose subsys attribute to sysfs
nvme: create 'slaves' and 'holders' entries for hidden controllers
block: create 'slaves' and 'holders' entries for hidden gendisks
nvme: also expose the namespace identification sysfs files for mpath nodes
nvme: implement multipath access to nvme subsystems
nvme: track shared namespaces
nvme: introduce a nvme_ns_ids structure
nvme: track subsystems
block, nvme: Introduce blk_mq_req_flags_t
block, scsi: Make SCSI quiesce and resume work reliably
block: Add the QUEUE_FLAG_PREEMPT_ONLY request queue flag
...
- proper use of the bool type (Thomas Meyer)
- constification of struct config_item_type (Bhumika Goyal)
-----BEGIN PGP SIGNATURE-----
iQI/BAABCAApFiEEgdbnc3r/njty3Iq9D55TZVIEUYMFAloLSTALHGhjaEBsc3Qu
ZGUACgkQD55TZVIEUYNxfhAAv3cunxiEPEAvs+1xuGd3cZYaxz7qinvIODPxIKoF
kRWiuy5PUklRMnJ8seOgJ1p1QokX6Sk4cZ8HcctDJVByqODjOq4K5eaKVN1ZqJoz
BUzO/gOqfs64r9yaFIlKfe8nFA+gpUftSeWyv3lThxAIJ1iSbue7OZ/A10tTOS1m
RWp9FPepFv+nJMfWqeQU64BsoDQ4kgZ2NcEA+jFxNx5dlmIbLD49tk0lfddvZQXr
j5WyAH73iugilLtNUGVOqSzHBY4kUvfCKUV7leirCegyMoGhFtA87m6Wzwbo6ZUI
DwQLzWvuPaGv1P2PpNEHfKiNbfIEp75DRyyyf87DD3lc5ffAxQSm28mGuwcr7Rn5
Ow/yWL6ERMzCLExoCzEkXYJISy7T5LIzYDgNggKMpeWxysAduF7Onx7KfW1bTuhK
mHvY7iOXCjEvaIVaF8uMKE6zvuY1vCMRXaJ+kC9jcIE3gwhg+2hmQvrdJ2uAFXY+
rkeF2Poj/JlblPU4IKWAjiPUbzB7Lv0gkypCB2pD4riaYIN5qCAgF8ULIGQp2hsO
lYW1EEgp5FBop85oSO/HAGWeH9dFg0WaV7WqNRVv0AGXhKjgy+bVd7iYPpvs7mGw
z9IqSQDORcG2ETLcFhZgiJpCk/itwqXBD+wgMOjJPP8lL+4kZ8FcuhtY9kc9WlJE
Tew=
=+tMO
-----END PGP SIGNATURE-----
Merge tag 'configfs-for-4.15' of git://git.infradead.org/users/hch/configfs
Pull configfs updates from Christoph Hellwig:
"A couple of configfs cleanups:
- proper use of the bool type (Thomas Meyer)
- constification of struct config_item_type (Bhumika Goyal)"
* tag 'configfs-for-4.15' of git://git.infradead.org/users/hch/configfs:
RDMA/cma: make config_item_type const
stm class: make config_item_type const
ACPI: configfs: make config_item_type const
nvmet: make config_item_type const
usb: gadget: configfs: make config_item_type const
PCI: endpoint: make config_item_type const
iio: make function argument and some structures const
usb: gadget: make config_item_type structures const
dlm: make config_item_type const
netconsole: make config_item_type const
nullb: make config_item_type const
ocfs2/cluster: make config_item_type const
target: make config_item_type const
configfs: make ci_type field, some pointers and function arguments const
configfs: make config_item_type const
configfs: Fix bool initialization/comparison
Pull quota, ext2, isofs and udf fixes from Jan Kara:
- two small quota error handling fixes
- two isofs fixes for architectures with signed char
- several udf block number overflow and signedness fixes
- ext2 rework of mount option handling to avoid GFP_KERNEL allocation
with spinlock held
- ... it also contains a patch to implement auditing of responses to
fanotify permission events. That should have been in the fanotify
pull request but I mistakenly merged that patch into a wrong branch
and noticed only now at which point I don't think it's worth rebasing
and redoing.
* 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs:
quota: be aware of error from dquot_initialize
quota: fix potential infinite loop
isofs: use unsigned char types consistently
isofs: fix timestamps beyond 2027
udf: Fix some sign-conversion warnings
udf: Fix signed/unsigned format specifiers
udf: Fix 64-bit sign extension issues affecting blocks > 0x7FFFFFFF
udf: Remove some outdate references from documentation
udf: Avoid overflow when session starts at large offset
ext2: Fix possible sleep in atomic during mount option parsing
ext2: Parse mount options into a dedicated structure
audit: Record fanotify access control decisions
Pull fsnotify updates from Jan Kara:
- fixes of use-after-tree issues when handling fanotify permission
events from Miklos
- refcount_t conversions from Elena
- fixes of ENOMEM handling in dnotify and fsnotify from me
* 'fsnotify' of git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs:
fsnotify: convert fsnotify_mark.refcnt from atomic_t to refcount_t
fanotify: clean up CONFIG_FANOTIFY_ACCESS_PERMISSIONS ifdefs
fsnotify: clean up fsnotify()
fanotify: fix fsnotify_prepare_user_wait() failure
fsnotify: fix pinning group in fsnotify_prepare_user_wait()
fsnotify: pin both inode and vfsmount mark
fsnotify: clean up fsnotify_prepare/finish_user_wait()
fsnotify: convert fsnotify_group.refcnt from atomic_t to refcount_t
fsnotify: Protect bail out path of fsnotify_add_mark_locked() properly
dnotify: Handle errors from fsnotify_add_mark_locked() in fcntl_dirnotify()
This set focuses, as usual, on fixes to the comms layer.
New testing of the dlm with ocfs2 uncovered a number of
bugs in the TCP connection handling during recovery,
starting, and stopping.
-----BEGIN PGP SIGNATURE-----
iQIcBAABAgAGBQJaCeJ8AAoJEDgbc8f8gGmqpqQP/A9GekFRRvm3QfpmHG3Lj6ey
O4IKINaB8F46KDBaWzTwKE3fOl0j19qICKuEibBZeJl4lGh7Q5GDOMZ20AfDU4wv
Qq40OEfFombCFsVX/Qc4AvdXj7cjpfjJwrZhW6CHOkYGZDaAmsHBeCgBeTvhkqR4
dj4pGFIwBpvV1gQrIteFx110kupeT8DvCSIVzWelD+Jb18vtht7YfKehRyQ3Cyix
8sbEPiKuhmLW/6wbliRqQL5cp9ZyUU5YtBqhmE8r2QIbOOB+k1xFIvVgUylawv3P
qi1SpBkX7zRM4BCTP0J3zbUzQHZhgjtgBLVMiSrAWBFb3XtpssEXVczKFDxFafEt
YJtPeqHxr8zwzQeF+6MGx6amRWW0T9yHv2sB79wBkz8wL483qL39k9DNa564NoSJ
rZtN0bk4g6CuDnHgEM3hzNsVU2sgdaQMZnRWYONHwvDeI+HgKJWD4nedD6wFmXlo
kimrQDQCzvx8ZnCKHH0/k23BV2SoYz+80fbW+TeFCWU6gPFGKcJZ12p1e9YYLZJh
yeY1Y/kdNhLWyIZlldIK1TtO0645YPBhXcaFBA/RF7g8EbwKrIG8FUZSHzWwQIoJ
hGtLBhWT12BGE2NCLHSMCrKZEb+JeXIN+jKxm9g2m5k6D+nQBt5K7Ae6j6n8pwUC
hxic9hQmXNxb0R51YD/+
=w3jk
-----END PGP SIGNATURE-----
Merge tag 'dlm-4.15' of git://git.kernel.org/pub/scm/linux/kernel/git/teigland/linux-dlm
Pull dlm updates from David Teigland:
"This set focuses, as usual, on fixes to the comms layer.
New testing of the dlm with ocfs2 uncovered a number of bugs in the
TCP connection handling during recovery, starting, and stopping"
* tag 'dlm-4.15' of git://git.kernel.org/pub/scm/linux/kernel/git/teigland/linux-dlm:
dlm: remove dlm_send_rcom_lookup_dump
dlm: recheck kthread_should_stop() before schedule()
DLM: fix NULL pointer dereference in send_to_sock()
DLM: fix to reschedule rwork
DLM: fix to use sk_callback_lock correctly
DLM: fix overflow dlm_cb_seq
DLM: fix memory leak in tcp_accept_from_sock()
DLM: fix conversion deadlock when DLM_LKF_NODLCKWT flag is set
DLM: use CF_CLOSE flag to stop dlm_send correctly
DLM: Reanimate CF_WRITE_PENDING flag
DLM: fix race condition between dlm_recoverd_stop and dlm_recoverd
DLM: close othercon at send/receive error
DLM: retry rcom when dlm_wait_function is timed out.
DLM: fix to use sock_mutex correctly in xxx_accept_from_sock
DLM: fix race condition between dlm_send and dlm_recv
DLM: fix double list_del()
DLM: fix remove save_cb argument from add_sock()
DLM: Fix saving of NULL callbacks
DLM: Eliminate CF_WRITE_PENDING flag
DLM: Eliminate CF_CONNECT_PENDING flag
patches are basically in three categories: (1) patches related to
broken xfstest cases, (2) patches related to improving iomap and
start using it in GFS2, and (3) general typos and clarifications.
Please note that one of the iomap patches extends beyond GFS2 and
affects other file systems, but it was publically reviewed by a
variety of file system people in the community.
1. Andreas has a patch that simply renames variable 'bsize' to 'factor'
to clarify the logic related to gfs2_block_map.
2. He also has a patch to correctly set ctime in the setflags ioctl,
which fixes broken xfstests test 277.
3. He also fixed broken xfstest 258, due to an atime initialization
problem.
4. He also fixed broken xfstest 307, in which GFS2 was not setting
ctime when setting acls.
5. He has a patch to switch general iomap code from blkno to disk
offset for a variety of file systems.
6. He has a patch to add a new IOMAP_F_DATA_INLINE flag for iomap
to indicate blocks that have data mixed with metadata.
7. I contributed a patch to make inode height info part of the
'metapath' data structure to facilitate using iomap in GFS2.
8. I have a patch to start using iomap inside GFS2 and switch GFS2's
block_map functions to use iomap under the covers.
9. I have a patch to switch GFS2's fiemap implementation from using
block_map to using iomap under the covers.
10. Andreas has a patch to implement SEEK_HOLE and SEEK_DATA via
iomap in GFS2.
11. I have a patch related to journaled data pages not being properly
synced to media when writing inodes. This was caught with xfstests.
12. I have a patch to fix another failing xfstest case in which
switching a file from ordered_write to journaled data via set_flags
caused a deadlock.
13. Andreas has a patch to fix failing xfstest case 066, which was
due to not properly syncing dirty inodes when changing extended
attributes.
14. Andreas fixed a minor typo in a comment.
15. Andreas contributed a patch to partially fix xfstest 424, which
involved GET_FLAGS and SET_FLAGS ioctl. This is also a cleanup
and simplification of the translation of flags from fs flags to
gfs2 flags.
16. He also added support for STATX_ATTR_ in statx, which fixed broken
xfstest 424.
17. He also contributed a fix for failing xfstest 093 which fixes a
recursive glock problem with gfs2_xattr_get and _set.
-----BEGIN PGP SIGNATURE-----
iQEcBAABAgAGBQJaCx5uAAoJENeLYdPf93o7Nb8H/RLJ2CsEbTSJQ82RH4eptoxe
XbQ4HVig9Hm8k5teSTH9DdVypkxjPtbJZY9k1Y4mEtddtCZ/yS407aTdr/pP0C5r
3W8Ouu2JXmqPKWg0sp3wC/Pji2ThCYssQXNyBSDPADsF2C8XEuT7aL/YPzMitIdm
Lxa9JHo1tKgdFnkloNyaTt4MdBGNF5M5UBr6KgRfwhgooHWbxM0rNyZIXJtySb0I
vsaNNOA7a4VQp1Fo1DkHQomNbOG5hpVKfswUOOZvk2RdAewTPN+jXiOAmIhNjQ3Y
/PkJLjRCf8Ob/VIYmt2BTs16+07mODGv1d6DuhgXzH/dfiVihVGvVo71DxXx5uw=
=i8b2
-----END PGP SIGNATURE-----
Merge tag 'gfs2-4.15.fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/gfs2/linux-gfs2
Pull gfs2 updates from Bob Peterson:
"We've got a total of 17 GFS2 patches for this merge window. The
patches are basically in three categories: (1) patches related to
broken xfstest cases, (2) patches related to improving iomap and start
using it in GFS2, and (3) general typos and clarifications.
Please note that one of the iomap patches extends beyond GFS2 and
affects other file systems, but it was publically reviewed by a
variety of file system people in the community.
From Andreas Gruenbacher:
- rename variable 'bsize' to 'factor' to clarify the logic related to
gfs2_block_map.
- correctly set ctime in the setflags ioctl, which fixes broken
xfstests test 277.
- fix broken xfstest 258, due to an atime initialization problem.
- fix broken xfstest 307, in which GFS2 was not setting ctime when
setting acls.
- switch general iomap code from blkno to disk offset for a variety
of file systems.
- add a new IOMAP_F_DATA_INLINE flag for iomap to indicate blocks
that have data mixed with metadata.
- implement SEEK_HOLE and SEEK_DATA via iomap in GFS2.
- fix failing xfstest case 066, which was due to not properly syncing
dirty inodes when changing extended attributes.
- fix a minor typo in a comment.
- partially fix xfstest 424, which involved GET_FLAGS and SET_FLAGS
ioctl. This is also a cleanup and simplification of the translation
of flags from fs flags to gfs2 flags.
- add support for STATX_ATTR_ in statx, which fixed broken xfstest
424.
- fix for failing xfstest 093 which fixes a recursive glock problem
with gfs2_xattr_get and _set
From me:
- make inode height info part of the 'metapath' data structure to
facilitate using iomap in GFS2.
- start using iomap inside GFS2 and switch GFS2's block_map functions
to use iomap under the covers.
- switch GFS2's fiemap implementation from using block_map to using
iomap under the covers.
- fix journaled data pages not being properly synced to media when
writing inodes. This was caught with xfstests.
- fix another failing xfstest case in which switching a file from
ordered_write to journaled data via set_flags caused a deadlock"
* tag 'gfs2-4.15.fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/gfs2/linux-gfs2:
gfs2: Allow gfs2_xattr_set to be called with the glock held
gfs2: Add support for statx inode flags
gfs2: Fix and clean up {GET,SET}FLAGS ioctl
gfs2: Fix a harmless typo
gfs2: Fix xattr fsync
GFS2: Take inode off order_write list when setting jdata flag
GFS2: flush the log and all pages for jdata as we do for WB_SYNC_ALL
gfs2: Implement SEEK_HOLE / SEEK_DATA via iomap
GFS2: Switch fiemap implementation to use iomap
GFS2: Implement iomap for block_map
GFS2: Make height info part of metapath
gfs2: Always update inode ctime in set_acl
gfs2: Support negative atimes
gfs2: Update ctime in setflags ioctl
gfs2: Clarify gfs2_block_map
-----BEGIN PGP SIGNATURE-----
iQIzBAABCAAdFiEEIodevzQLVs53l6BhNqiEXrVAjGQFAloLPhEACgkQNqiEXrVA
jGTKUA//bdyfsVyx6GH8VMspHQcwYuIn40Pz3iD1MCEu0Fg4yjjF8RL0F+RC4xGi
OHVA5eLRW+6JOlTcZQ4Cf7+MNbAzcrsjbo3wsjWjBT5jfFh5Z6XcNel1WLBUsWBF
tK5tgOjB+vfdWF6oquCi4n4R+EWW2TcmI6qhtsV0Xr1trmbaTls4zDE9DAghJam+
UXwMSsZHe/YZ88kLMHQ2qf3mD/MmUCBCIM+LHUWlK+wCW9ZLMMh5mtizRqhGKAo4
hHBxv0tknlTIfd3qskxxFP5ii6bADcR11vNYx7DUjIhb/GOs6eGIWmlLyJN17P9X
F21mTuQ4SDfXnaBaq5w3mLRgAek9+bB1di5rIIEqXkiPZySX3wXGeZk60i6Bps55
Z2xUPuk0E5VImljEv8kqpbZRbMc+bP5IFM2lmsRmvvjlinqK82s3soqhystTVLKU
Cm5mMGN51mX+00OboC9lJg7caZmb2b1gCtrIgzE++8gN3RKdKzSQXEMJz28oIUoH
w9Fjm687MWbjyDU7fhIcIcouVYRcUKqQGKj3qw8iNXkXUx/2UXZaXdAYve8Fmj8U
BZdp7rD5SJyLxEc4NEDbi8rkWkDW6UwpOR/WD796AxT7rMORA17LUAg0wV8Xuq42
Zc6PW74038hfI1WoMrW97488c0dh3Kc+saKqwdVece/Z3HfcsQ8=
=oulW
-----END PGP SIGNATURE-----
Merge tag 'jfs-4.15' of git://github.com/kleikamp/linux-shaggy
Pull jfs updates from David Kleikamp:
"A couple small fixes for jfs"
* tag 'jfs-4.15' of git://github.com/kleikamp/linux-shaggy:
jfs: Add missing NULL pointer check in __get_metapage
jfs: remove increment of i_version counter
Pull btrfs updates from David Sterba:
"There are some new user features and the usual load of invisible
enhancements or cleanups.
New features:
- extend mount options to specify zlib compression level, -o
compress=zlib:9
- v2 of ioctl "extent to inode mapping", addressing a usecase where
we want to retrieve more but inaccurate results and do the
postprocessing in userspace, aiding defragmentation or
deduplication tools
- populate compression heuristics logic, do data sampling and try to
guess compressibility by: looking for repeated patterns, counting
unique byte values and distribution, calculating Shannon entropy;
this will need more benchmarking and possibly fine tuning, but the
base should be good enough
- enable indexing for btrfs as lower filesystem in overlayfs
- speedup page cache readahead during send on large files
Internal enhancements:
- more sanity checks of b-tree items when reading them from disk
- more EINVAL/EUCLEAN fixups, missing BLK_STS_* conversion, other
errno or error handling fixes
- remove some homegrown IO-related logic, that's been obsoleted by
core block layer changes (batching, plug/unplug, own counters)
- add ref-verify, optional debugging feature to verify extent
reference accounting
- simplify code handling outstanding extents, make it more clear
where and how the accounting is done
- make delalloc reservations per-inode, simplify the code and make
the logic more straightforward
- extensive cleanup of delayed refs code
Notable fixes:
- fix send ioctl on 32bit with 64bit kernel"
* 'for-4.15' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux: (102 commits)
btrfs: Fix bug for misused dev_t when lookup in dev state hash table.
Btrfs: heuristic: add Shannon entropy calculation
Btrfs: heuristic: add byte core set calculation
Btrfs: heuristic: add byte set calculation
Btrfs: heuristic: add detection of repeated data patterns
Btrfs: heuristic: implement sampling logic
Btrfs: heuristic: add bucket and sample counters and other defines
Btrfs: compression: separate heuristic/compression workspaces
btrfs: move btrfs_truncate_block out of trans handle
btrfs: don't call btrfs_start_delalloc_roots in flushoncommit
btrfs: track refs in a rb_tree instead of a list
btrfs: add a comp_refs() helper
btrfs: switch args for comp_*_refs
btrfs: make the delalloc block rsv per inode
btrfs: add tracepoints for outstanding extents mods
Btrfs: rework outstanding_extents
btrfs: increase output size for LOGICAL_INO_V2 ioctl
btrfs: add a flags argument to LOGICAL_INO and call it LOGICAL_INO_V2
btrfs: add a flag to iterate_inodes_from_logical to find all extent refs for uncompressed extents
btrfs: send: remove unused code
...
- Refactor the incore extent map manipulations to use a cursor instead of
directly modifying extent data.
- Refactor the incore extent map cursor to use an in-memory btree instead
of a single high-order allocation. This eliminates a major source of
complaints about insufficient memory when opening a heavily fragmented
file into a system whose memory is also heavily fragmented.
- Fix a longstanding bug where deleting a file with a complex extended
attribute btree incorrectly handled memory pointers, which could lead
to memory corruption.
- Improve metadata validation to eliminate crashing problems found while
fuzzing xfs.
- Move the error injection tag definitions into libxfs to be shared with
userspace components.
- Fix some log recovery bugs where we'd underflow log block position
vector and incorrectly fail log recovery.
- Drain the buffer lru after log recovery to force recovered buffers back
through the verifiers after mount. On a v4 filesystem the log never
attaches verifiers during log replay (v5 does), so we could end up with
buffers marked verified but without having ever been verified.
- Fix various other bugs.
- Introduce the first part of a new online fsck tool. The new fsck tool
will be able to iterate every piece of metadata in the filesystem to
look for obvious errors and corruptions. In the next release cycle
the checking will be extended to cross-reference with the other fs
metadata, so this feature should only be used by the developers in the
mean time.
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1
iQIcBAABCgAGBQJaBdbZAAoJEPh/dxk0SrTrKoUP/RroXfZX3PSn3Z0Qo99E6Ev9
+Z3CoJSSfXPJtSPBh6mUgonzzpKMqoN3kj8ZezYRLaeSEo+36ZkBtdLOb/8PydOZ
4agNvtGDhwt88+1vSAccbT6l4wB/Z16NfzGaVN4dioHF1LpC4rORqdEuoq5xXxzo
JVjuwTbz8uPSCpTTukzll9XFghvvj+YXm20MgEOCJiR5uULlGW5gZ38mNCmS76Bk
Nks5dNSmNzlGwIpwsVmthd0s0jwj8WeQPnUOv27naRm4J6GOvB5gE8vn15e07AHT
EqeTTHy25lnJhmpazphvDwbN3B6UdWCHGoG8ll2B+45pZegS7SKt4G6b4ittHq9x
+ErCHFElrNCO77QDQmQoXHy6+DJV/Rdnyb5K575rA91TAb0q2C7OP6vQt6oV0rDM
obZ7M3MvW9jBVn9A07Hdsk4+J2/SYW0jf5Dv4O69U1KuvZYUES2B++PL+u7pdTpy
JPg1+pWO+AgxRKQNviFFzRwQDPE3JSp854TCE/5D/59h2ZeSWg+g4ZH5jcLjKwKM
+uHbJgqOdgk2/WPHiEFCOouom3RUxdE1Yg7S87sbaQC4iU5oWWQ8Kenl2AUyNQEN
yaU/leq6rqX3Z2z+T70ujWSvh5xl07YHLW3LJszZMi4w+i8C7c0lIX9F8CNu26Cf
yJApOvMWhhY3Mf7Gn1l5
=vQrJ
-----END PGP SIGNATURE-----
Merge tag 'xfs-4.15-merge-1' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux
Pull xfs updates from Darrick Wong:
"xfs: great scads of new stuff for 4.15.
This merge cycle, we're making some substantive changes to XFS. The
in-core extent mappings have been refactored to use proper iterators
and a btree to handle heavily fragmented files without needing
high-order memory allocations; some important log recovery bug fixes;
and the first part of the online fsck functionality.
(The online fsck feature is disabled by default and more pieces of it
will be coming in future release cycles.)
This giant pile of patches has been run through a full xfstests run
over the weekend and through a quick xfstests run against this
morning's master, with no major failures reported.
New in this version:
- Refactor the incore extent map manipulations to use a cursor
instead of directly modifying extent data.
- Refactor the incore extent map cursor to use an in-memory btree
instead of a single high-order allocation. This eliminates a major
source of complaints about insufficient memory when opening a
heavily fragmented file into a system whose memory is also heavily
fragmented.
- Fix a longstanding bug where deleting a file with a complex
extended attribute btree incorrectly handled memory pointers, which
could lead to memory corruption.
- Improve metadata validation to eliminate crashing problems found
while fuzzing xfs.
- Move the error injection tag definitions into libxfs to be shared
with userspace components.
- Fix some log recovery bugs where we'd underflow log block position
vector and incorrectly fail log recovery.
- Drain the buffer lru after log recovery to force recovered buffers
back through the verifiers after mount. On a v4 filesystem the log
never attaches verifiers during log replay (v5 does), so we could
end up with buffers marked verified but without having ever been
verified.
- Fix various other bugs.
- Introduce the first part of a new online fsck tool. The new fsck
tool will be able to iterate every piece of metadata in the
filesystem to look for obvious errors and corruptions. In the next
release cycle the checking will be extended to cross-reference with
the other fs metadata, so this feature should only be used by the
developers in the mean time"
* tag 'xfs-4.15-merge-1' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux: (131 commits)
xfs: on failed mount, force-reclaim inodes after unmounting quota controls
xfs: check the uniqueness of the AGFL entries
xfs: remove u_int* type usage
xfs: handle zero entries case in xfs_iext_rebalance_leaf
xfs: add comments documenting the rebalance algorithm
xfs: trivial indentation fixup for xfs_iext_remove_node
xfs: remove a superflous assignment in xfs_iext_remove_node
xfs: add some comments to xfs_iext_insert/xfs_iext_insert_node
xfs: fix number of records handling in xfs_iext_split_leaf
fs/xfs: Remove NULL check before kmem_cache_destroy
xfs: only check da node header padding on v5 filesystems
xfs: fix btree scrub deref check
xfs: fix uninitialized return values in scrub code
xfs: pass inode number to xfs_scrub_ino_set_{preen,warning}
xfs: refactor the directory data block bestfree checks
xfs: mark xlog_verify_dest_ptr STATIC
xfs: mark xlog_recover_check_summary STATIC
xfs: mark xfs_btree_check_lblock and xfs_btree_check_ptr static
xfs: remove unreachable error injection code in xfs_qm_dqget
xfs: remove unused debug counts for xfs_lock_inodes
...
two data corruption bugs involving DAX, as well as a corruption bug
after a crash during a racing fallocate and delayed allocation.
Finally, a number of cleanups and optimizations.
-----BEGIN PGP SIGNATURE-----
iQEzBAABCAAdFiEEK2m5VNv+CHkogTfJ8vlZVpUNgaMFAloJCiEACgkQ8vlZVpUN
gaOahAgAhcgdPagn/B5w+6vKFdH+hOJLKyGI0adGDyWD9YBXN0wFQvliVgXrTKei
hxW2GdQGc6yHv9mOjvD+4Fn2AnTZk8F3GtG6zdqRM08JGF/IN2Jax2boczG/XnUz
rT9cd3ic2Ff0KaUX+Yos55QwomTh5CAeRPgvB69o9D6L4VJzTlsWKSOBR19FmrSG
NDmzZibgWmHcqzW9Bq8ZrXXx+KB42kUlc8tYYm2n6MTaE0LMvp3d9XcFcnm/I7Bk
MGa2d3/3FArGD6Rkl/E82MXMSElOHJnY6jGYSDaadUeMI5FXkA6tECOSJYXqShdb
ZJwkOBwfv2lbYZJxIBJTy/iA6zdsoQ==
=ZzaJ
-----END PGP SIGNATURE-----
Merge tag 'ext4_for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4
Pull ext4 updates from Ted Ts'o:
- Add support for online resizing of file systems with bigalloc
- Fix a two data corruption bugs involving DAX, as well as a corruption
bug after a crash during a racing fallocate and delayed allocation.
- Finally, a number of cleanups and optimizations.
* tag 'ext4_for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4:
ext4: improve smp scalability for inode generation
ext4: add support for online resizing with bigalloc
ext4: mention noload when recovering on read-only device
Documentation: fix little inconsistencies
ext4: convert timers to use timer_setup()
jbd2: convert timers to use timer_setup()
ext4: remove duplicate extended attributes defs
ext4: add ext4_should_use_dax()
ext4: add sanity check for encryption + DAX
ext4: prevent data corruption with journaling + DAX
ext4: prevent data corruption with inline data + DAX
ext4: fix interaction between i_size, fallocate, and delalloc after a crash
ext4: retry allocations conservatively
ext4: Switch to iomap for SEEK_HOLE / SEEK_DATA
ext4: Add iomap support for inline data
iomap: Add IOMAP_F_DATA_INLINE flag
iomap: Switch from blkno to disk offset
-----BEGIN PGP SIGNATURE-----
iQEzBAABCAAdFiEEK2m5VNv+CHkogTfJ8vlZVpUNgaMFAloI8AUACgkQ8vlZVpUN
gaMdjgf8CCW7UhPjoZYwF8sUNtAaX9+JZT1maOcXUhpJ3vRQiRn+AzRH6yBYMm79
+NZBwVlk4dlEe55Wh4yFIStMAstqzCrke4C9CSbExjgHNsJdU4znyYuLRMbLfyO0
6c4NObiAIKJdW1/te1aN90keGC6min8pBZot+FqZsRr+Kq2+IOtM43JAv7efOLev
v3LCjUf9JKxatoB8tgw4AJRa1p18p7D2APWTG05VlFq63TjhVIYNvvwcQlizLwGY
cuEq3X59FbFdX06fJnucujU3WP3ES4/3rhufBK4NNaec5e5dbnH2KlAx7J5SyMIZ
0qUFB/dmXDSb3gsfScSGo1F71Ad0CA==
=asAm
-----END PGP SIGNATURE-----
Merge tag 'fscrypt-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/fscrypt
Pull fscrypt updates from Ted Ts'o:
"Lots of cleanups, mostly courtesy by Eric Biggers"
* tag 'fscrypt-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/fscrypt:
fscrypt: lock mutex before checking for bounce page pool
fscrypt: add a documentation file for filesystem-level encryption
ext4: switch to fscrypt_prepare_setattr()
ext4: switch to fscrypt_prepare_lookup()
ext4: switch to fscrypt_prepare_rename()
ext4: switch to fscrypt_prepare_link()
ext4: switch to fscrypt_file_open()
fscrypt: new helper function - fscrypt_prepare_setattr()
fscrypt: new helper function - fscrypt_prepare_lookup()
fscrypt: new helper function - fscrypt_prepare_rename()
fscrypt: new helper function - fscrypt_prepare_link()
fscrypt: new helper function - fscrypt_file_open()
fscrypt: new helper function - fscrypt_require_key()
fscrypt: remove unneeded empty fscrypt_operations structs
fscrypt: remove ->is_encrypted()
fscrypt: switch from ->is_encrypted() to IS_ENCRYPTED()
fs, fscrypt: add an S_ENCRYPTED inode flag
fscrypt: clean up include file mess
Pull crypto updates from Herbert Xu:
"Here is the crypto update for 4.15:
API:
- Disambiguate EBUSY when queueing crypto request by adding ENOSPC.
This change touches code outside the crypto API.
- Reset settings when empty string is written to rng_current.
Algorithms:
- Add OSCCA SM3 secure hash.
Drivers:
- Remove old mv_cesa driver (replaced by marvell/cesa).
- Enable rfc3686/ecb/cfb/ofb AES in crypto4xx.
- Add ccm/gcm AES in crypto4xx.
- Add support for BCM7278 in iproc-rng200.
- Add hash support on Exynos in s5p-sss.
- Fix fallback-induced error in vmx.
- Fix output IV in atmel-aes.
- Fix empty GCM hash in mediatek.
Others:
- Fix DoS potential in lib/mpi.
- Fix potential out-of-order issues with padata"
* 'linus' of git://git.kernel.org/pub/scm/linux/kernel/git/herbert/crypto-2.6: (162 commits)
lib/mpi: call cond_resched() from mpi_powm() loop
crypto: stm32/hash - Fix return issue on update
crypto: dh - Remove pointless checks for NULL 'p' and 'g'
crypto: qat - Clean up error handling in qat_dh_set_secret()
crypto: dh - Don't permit 'key' or 'g' size longer than 'p'
crypto: dh - Don't permit 'p' to be 0
crypto: dh - Fix double free of ctx->p
hwrng: iproc-rng200 - Add support for BCM7278
dt-bindings: rng: Document BCM7278 RNG200 compatible
crypto: chcr - Replace _manual_ swap with swap macro
crypto: marvell - Add a NULL entry at the end of mv_cesa_plat_id_table[]
hwrng: virtio - Virtio RNG devices need to be re-registered after suspend/resume
crypto: atmel - remove empty functions
crypto: ecdh - remove empty exit()
MAINTAINERS: update maintainer for qat
crypto: caam - remove unused param of ctx_map_to_sec4_sg()
crypto: caam - remove unneeded edesc zeroization
crypto: atmel-aes - Reset the controller before each use
crypto: atmel-aes - properly set IV after {en,de}crypt
hwrng: core - Reset user selected rng by writing "" to rng_current
...
Here is the big tty/serial driver pull request for 4.15-rc1.
Lots of serial driver updates in here, some small vt cleanups, and a
raft of SPDX and license boilerplate cleanups, messing up the diffstat a
bit.
Nothing major, with no realy functional changes except better hardware
support for some platforms.
All of these have been in linux-next for a while with no reported
issues.
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
-----BEGIN PGP SIGNATURE-----
iG0EABECAC0WIQT0tgzFv3jCIUoxPcsxR9QN2y37KQUCWgnD+w8cZ3JlZ0Brcm9h
aC5jb20ACgkQMUfUDdst+ynAmgCfSSr/9qiCE0vfP5eVYjddzxfWyZ4AoMbKORZC
5x2KVW0Btrbs3WmnD7ZU
=PSea
-----END PGP SIGNATURE-----
Merge tag 'tty-4.15-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/tty
Pull tty/serial updates from Greg KH:
"Here is the big tty/serial driver pull request for 4.15-rc1.
Lots of serial driver updates in here, some small vt cleanups, and a
raft of SPDX and license boilerplate cleanups, messing up the diffstat
a bit.
Nothing major, with no realy functional changes except better hardware
support for some platforms.
All of these have been in linux-next for a while with no reported
issues"
* tag 'tty-4.15-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/tty: (110 commits)
tty: ehv_bytechan: fix spelling mistake
tty: serial: meson: allow baud-rates lower than 9600
serial: 8250_fintek: Fix crash with baud rate B0
serial: 8250_fintek: Disable delays for ports != 0
serial: 8250_fintek: Return -EINVAL on invalid configuration
tty: Remove redundant license text
tty: serdev: Remove redundant license text
tty: hvc: Remove redundant license text
tty: serial: Remove redundant license text
tty: add SPDX identifiers to all remaining files in drivers/tty/
tty: serial: jsm: remove redundant pointer ts
tty: serial: jsm: add space before the open parenthesis '('
tty: serial: jsm: fix coding style
tty: serial: jsm: delete space between function name and '('
tty: serial: jsm: add blank line after declarations
tty: serial: jsm: change the type of local variable
tty: serial: imx: remove dead code imx_dma_rxint
tty: serial: imx: disable ageing timer interrupt if dma in use
serial: 8250: fix potential deadlock in rs485-mode
serial: m32r_sio: Drop redundant .data assignment
...
We need to clear FI_NO_PREALLOC flag in error path of f2fs_file_write_iter,
otherwise we will lose the chance to preallocate blocks in latter write()
at one time.
Fixes: dc91de78e5 ("f2fs: do not preallocate blocks which has wrong buffer")
Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
This patch splits memory allocation part in nat_entry to avoid lock contention.
Signed-off-by: Yunlei He <heyunlei@huawei.com>
Reviewed-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>