Commit Graph

47453 Commits

Author SHA1 Message Date
Chao Yu 9434fcde1f f2fs: add missing f2fs_balance_fs in f2fs_zero_range
f2fs_balance_fs should be called in between node page updating, otherwise
node page count will exceeded far beyond watermark of triggering
foreground garbage collection, result in facing high risk of hitting LFS
allocation failure.

Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2016-11-23 12:10:52 -08:00
Chao Yu 933439c8f3 f2fs: give a chance to detach from dirty list
If there is no dirty pages in inode, we should give a chance to detach
the inode from global dirty list, otherwise it needs to call another
unnecessary .writepages for detaching.

Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2016-11-23 12:10:51 -08:00
Chao Yu 2dd15654ac f2fs: fix to release discard entries during checkpoint
In f2fs_fill_super, if there is any IO error occurs during recovery,
cached discard entries will be leaked, in order to avoid this, make
write_checkpoint() handle memory release by itself, besides, move
clear_prefree_segments to write_checkpoint for readability.

Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2016-11-23 12:10:50 -08:00
Chao Yu 2411cf5bef f2fs: exclude free nids building and allocation
During nid allocation, it needs to exclude building and allocating flow
of free nids, this is because while building free nid cache, there are two
steps: a) load free nids from unused nat entries in NAT pages, b) update
free nid cache by checking nat journal. The two steps should be atomical,
otherwise an used nid can be allocated as free one after a) and before b).

This patch adds missing lock which covers build_free_nids in
unlock_operation and f2fs_balance_fs_bg to avoid that.

Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2016-11-23 12:10:49 -08:00
Jaegeuk Kim e87f7329bb f2fs: fix overflow due to condition check order
In the last ilen case, i was already increased, resulting in accessing out-
of-boundary entry of do_replace and blkaddr.
Fix to check ilen first to exit the loop.

Fixes: 2aa8fbb9693020 ("f2fs: refactor __exchange_data_block for speed up")
Cc: stable@vger.kernel.org # 4.8+
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2016-11-23 12:10:48 -08:00
Jan Kara ba6379f7e6 fs: Provide function to get superblock with exclusive s_umount
Quota code will need a variant of get_super_thawed() that returns
superblock with s_umount held in exclusive mode to serialize quota on
and quota off operations. Provide this functionality.

Signed-off-by: Jan Kara <jack@suse.cz>
2016-11-23 12:53:00 +01:00
Jan Kara 4f5a763c9a ext4: Add select for CONFIG_FS_IOMAP
When ext4 is compiled with DAX support, it now needs the iomap code. Add
appropriate select to Kconfig.

Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2016-11-22 23:21:58 -05:00
Arnd Bergmann d55b352b01 NFSv4.x: hide array-bounds warning
A correct bugfix introduced a harmless warning that shows up with gcc-7:

fs/nfs/callback.c: In function 'nfs_callback_up':
fs/nfs/callback.c:214:14: error: array subscript is outside array bounds [-Werror=array-bounds]

What happens here is that the 'minorversion == 0' check tells the
compiler that we assume minorversion can be something other than 0,
but when CONFIG_NFS_V4_1 is disabled that would be invalid and
result in an out-of-bounds access.

The added check for IS_ENABLED(CONFIG_NFS_V4_1) tells gcc that this
really can't happen, which makes the code slightly smaller and also
avoids the warning.

The bugfix that introduced the warning is marked for stable backports,
we want this one backported to the same releases.

Fixes: 98b0f80c23 ("NFSv4.x: Fix a refcount leak in nfs_callback_up_net")
Cc: stable@vger.kernel.org # v3.7+
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2016-11-22 16:11:44 -05:00
Eric W. Biederman f84df2a6f2 exec: Ensure mm->user_ns contains the execed files
When the user namespace support was merged the need to prevent
ptrace from revealing the contents of an unreadable executable
was overlooked.

Correct this oversight by ensuring that the executed file
or files are in mm->user_ns, by adjusting mm->user_ns.

Use the new function privileged_wrt_inode_uidgid to see if
the executable is a member of the user namespace, and as such
if having CAP_SYS_PTRACE in the user namespace should allow
tracing the executable.  If not update mm->user_ns to
the parent user namespace until an appropriate parent is found.

Cc: stable@vger.kernel.org
Reported-by: Jann Horn <jann@thejh.net>
Fixes: 9e4a36ece6 ("userns: Fail exec for suid and sgid binaries with ids outside our user namespace.")
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
2016-11-22 13:21:00 -06:00
David S. Miller f9aa9dc7d2 Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net
All conflicts were simple overlapping changes except perhaps
for the Thunder driver.

That driver has a change_mtu method explicitly for sending
a message to the hardware.  If that fails it returns an
error.

Normally a driver doesn't need an ndo_change_mtu method becuase those
are usually just range changes, which are now handled generically.
But since this extra operation is needed in the Thunder driver, it has
to stay.

However, if the message send fails we have to restore the original
MTU before the change because the entire call chain expects that if
an error is thrown by ndo_change_mtu then the MTU did not change.
Therefore code is added to nicvf_change_mtu to remember the original
MTU, and to restore it upon nicvf_update_hw_max_frs() failue.

Signed-off-by: David S. Miller <davem@davemloft.net>
2016-11-22 13:27:16 -05:00
Eric W. Biederman 64b875f7ac ptrace: Capture the ptracer's creds not PT_PTRACE_CAP
When the flag PT_PTRACE_CAP was added the PTRACE_TRACEME path was
overlooked.  This can result in incorrect behavior when an application
like strace traces an exec of a setuid executable.

Further PT_PTRACE_CAP does not have enough information for making good
security decisions as it does not report which user namespace the
capability is in.  This has already allowed one mistake through
insufficient granulariy.

I found this issue when I was testing another corner case of exec and
discovered that I could not get strace to set PT_PTRACE_CAP even when
running strace as root with a full set of caps.

This change fixes the above issue with strace allowing stracing as
root a setuid executable without disabling setuid.  More fundamentaly
this change allows what is allowable at all times, by using the correct
information in it's decision.

Cc: stable@vger.kernel.org
Fixes: 4214e42f96d4 ("v2.4.9.11 -> v2.4.9.12")
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
2016-11-22 11:49:49 -06:00
Ming Lei 05aea81b4b fs: logfs: remove unnecesary check
The check on bio->bi_vcnt doesn't make sense in erase_end_io().

Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Ming Lei <tom.leiming@gmail.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
2016-11-22 08:57:55 -07:00
Ming Lei c124843678 fs: logfs: use bio_add_page() in do_erase()
Also code gets simplified a bit.

Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Ming Lei <tom.leiming@gmail.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
2016-11-22 08:57:55 -07:00
Ming Lei d4f98a89f9 fs: logfs: use bio_add_page() in __bdev_writeseg()
Also this patch simplify the code a bit.

Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Ming Lei <tom.leiming@gmail.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
2016-11-22 08:57:55 -07:00
Ming Lei 739a997546 fs: logfs: convert to bio_add_page() in sync_request()
Always bio_add_page() is the standard and preferred way to
do the task.

Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Ming Lei <tom.leiming@gmail.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
2016-11-22 08:57:55 -07:00
Ming Lei 3a83f46775 block: bio: pass bvec table to bio_init()
Some drivers often use external bvec table, so introduce
this helper for this case. It is always safe to access the
bio->bi_io_vec in this way for this case.

After converting to this usage, it will becomes a bit easier
to evaluate the remaining direct access to bio->bi_io_vec,
so it can help to prepare for the following multipage bvec
support.

Signed-off-by: Ming Lei <tom.leiming@gmail.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>

Fixed up the new O_DIRECT cases.

Signed-off-by: Jens Axboe <axboe@fb.com>
2016-11-22 08:57:21 -07:00
Jens Axboe 9a794fb9bd block_dev: get rid of blksize bits calculation
We store the bits in the bdev sector size locally, but we don't use
the calculation anymore. All we do with it is shift it back up to
the bdev sector size. So let's just use that directly and kill the
variable and bits calculation.

Signed-off-by: Jens Axboe <axboe@fb.com>
2016-11-22 08:56:25 -07:00
Damien Le Moal 4d1a476542 block_dev: Fixed direct I/O bio sector calculation
A direct I/O alignment must be always checked against the device blocks size,
but the I/O offset (bio->bi_iter.bi_sector must always use 512B sector unit, and
not the actual logical block size.

Signed-off-by: Damien Le Moal <damien.lemoal@wdc.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@fb.com>
2016-11-22 08:09:09 -07:00
Benjamin Coddington d75a6a0e39 NFSv4.1: Keep a reference on lock states while checking
While walking the list of lock_states, keep a reference on each
nfs4_lock_state to be checked, otherwise the lock state could be removed
while the check performs TEST_STATEID and possible FREE_STATEID.

Signed-off-by: Benjamin Coddington <bcodding@redhat.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2016-11-21 11:58:39 -05:00
Eric Biggers 2f8f5e76c7 ext4: avoid lockdep warning when inheriting encryption context
On a lockdep-enabled kernel, xfstests generic/027 fails due to a lockdep
warning when run on ext4 mounted with -o test_dummy_encryption:

    xfs_io/4594 is trying to acquire lock:
     (jbd2_handle
    ){++++.+}, at:
    [<ffffffff813096ef>] jbd2_log_wait_commit+0x5/0x11b

    but task is already holding lock:
     (jbd2_handle
    ){++++.+}, at:
    [<ffffffff813000de>] start_this_handle+0x354/0x3d8

The abbreviated call stack is:

 [<ffffffff813096ef>] ? jbd2_log_wait_commit+0x5/0x11b
 [<ffffffff8130972a>] jbd2_log_wait_commit+0x40/0x11b
 [<ffffffff813096ef>] ? jbd2_log_wait_commit+0x5/0x11b
 [<ffffffff8130987b>] ? __jbd2_journal_force_commit+0x76/0xa6
 [<ffffffff81309896>] __jbd2_journal_force_commit+0x91/0xa6
 [<ffffffff813098b9>] jbd2_journal_force_commit_nested+0xe/0x18
 [<ffffffff812a6049>] ext4_should_retry_alloc+0x72/0x79
 [<ffffffff812f0c1f>] ext4_xattr_set+0xef/0x11f
 [<ffffffff812cc35b>] ext4_set_context+0x3a/0x16b
 [<ffffffff81258123>] fscrypt_inherit_context+0xe3/0x103
 [<ffffffff812ab611>] __ext4_new_inode+0x12dc/0x153a
 [<ffffffff812bd371>] ext4_create+0xb7/0x161

When a file is created in an encrypted directory, ext4_set_context() is
called to set an encryption context on the new file.  This calls
ext4_xattr_set(), which contains a retry loop where the journal is
forced to commit if an ENOSPC error is encountered.

If the task actually were to wait for the journal to commit in this
case, then it would deadlock because a handle remains open from
__ext4_new_inode(), so the running transaction can't be committed yet.
Fortunately, __jbd2_journal_force_commit() avoids the deadlock by not
allowing the running transaction to be committed while the current task
has it open.  However, the above lockdep warning is still triggered.

This was a false positive which was introduced by: 1eaa566d368b: jbd2:
track more dependencies on transaction commit

Fix the problem by passing the handle through the 'fs_data' argument to
ext4_set_context(), then using ext4_xattr_set_handle() instead of
ext4_xattr_set().  And in the case where no journal handle is specified
and ext4_set_context() has to open one, add an ENOSPC retry loop since
in that case it is the outermost transaction.

Signed-off-by: Eric Biggers <ebiggers@google.com>
2016-11-21 11:52:44 -05:00
Ross Zwisler d086630e19 ext4: remove unused function ext4_aligned_io()
The last user of ext4_aligned_io() was the DAX path in
ext4_direct_IO_write().  This usage was removed by Jan Kara's patch
entitled "ext4: Rip out DAX handling from direct IO path".

Signed-off-by: Ross Zwisler <ross.zwisler@linux.intel.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2016-11-21 11:51:44 -05:00
Jan Kara dd936e4313 dax: rip out get_block based IO support
No one uses functions using the get_block callback anymore. Rip them
out and update documentation.

Reviewed-by: Ross Zwisler <ross.zwisler@linux.intel.com>
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2016-11-20 20:48:36 -05:00
Jan Kara 00697eed38 ext2: use iomap_zero_range() for zeroing truncated page in DAX path
Currently the last user of ext2_get_blocks() for DAX inodes was
dax_truncate_page(). Convert that to iomap_zero_range() so that all DAX
IO uses the iomap path.

Reviewed-by: Ross Zwisler <ross.zwisler@linux.intel.com>
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2016-11-20 20:47:07 -05:00
Jan Kara 0bd2d5ec3d ext4: rip out DAX handling from direct IO path
Reads and writes for DAX inodes should no longer end up in direct IO
code. Rip out the support and add a warning.

Reviewed-by: Ross Zwisler <ross.zwisler@linux.intel.com>
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2016-11-20 18:53:30 -05:00
Jan Kara e2ae766c1b ext4: convert DAX faults to iomap infrastructure
Convert DAX faults to use iomap infrastructure. We would not have to start
transaction in ext4_dax_fault() anymore since ext4_iomap_begin takes
care of that but so far we do that to avoid lock inversion of
transaction start with DAX entry lock which gets acquired in
dax_iomap_fault() before calling ->iomap_begin handler.

Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2016-11-20 18:51:24 -05:00
Jan Kara 96f8ba3dd6 ext4: avoid split extents for DAX writes
Currently mapping of blocks for DAX writes happen with
EXT4_GET_BLOCKS_PRE_IO flag set. That has a result that each
ext4_map_blocks() call creates a separate written extent, although it
could be merged to the neighboring extents in the extent tree.  The
reason for using this flag is that in case the extent is unwritten, we
need to convert it to written one and zero it out. However this "convert
mapped range to written" operation is already implemented by
ext4_map_blocks() for the case of data writes into unwritten extent. So
just use flags for that mode of operation, simplify the code, and avoid
unnecessary split extents.

Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2016-11-20 18:10:09 -05:00
Jan Kara 776722e85d ext4: DAX iomap write support
Implement DAX writes using the new iomap infrastructure instead of
overloading the direct IO path.

Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2016-11-20 18:09:11 -05:00
Jan Kara 47e6935136 ext4: use iomap for zeroing blocks in DAX mode
Use iomap infrastructure for zeroing blocks when in DAX mode.
ext4_iomap_begin() handles read requests just fine and that's all that
is needed for iomap_zero_range().

Reviewed-by: Ross Zwisler <ross.zwisler@linux.intel.com>
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2016-11-20 18:08:05 -05:00
Jan Kara 364443cbcf ext4: convert DAX reads to iomap infrastructure
Implement basic iomap_begin function that handles reading and use it for
DAX reads.

Reviewed-by: Ross Zwisler <ross.zwisler@linux.intel.com>
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2016-11-20 17:36:06 -05:00
Jan Kara a3caa24b70 ext4: only set S_DAX if DAX is really supported
Currently we have S_DAX set inode->i_flags for a regular file whenever
ext4 is mounted with dax mount option. However in some cases we cannot
really do DAX - e.g. when inode is marked to use data journalling, when
inode data is being encrypted, or when inode is stored inline. Make sure
S_DAX flag is appropriately set/cleared in these cases.

Reviewed-by: Ross Zwisler <ross.zwisler@linux.intel.com>
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2016-11-20 17:32:59 -05:00
Jan Kara 213bcd9ccb ext4: factor out checks from ext4_file_write_iter()
Factor out checks of 'from' and whether we are overwriting out of
ext4_file_write_iter() so that the function is easier to follow.

Reviewed-by: Ross Zwisler <ross.zwisler@linux.intel.com>
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2016-11-20 17:29:51 -05:00
Linus Torvalds d117b9acae A security fix (so a maliciously corrupted file system image won't
panic the kernel) and some fixes for CONFIG_VMAP_STACK.
 -----BEGIN PGP SIGNATURE-----
 
 iQEzBAABCAAdFiEEK2m5VNv+CHkogTfJ8vlZVpUNgaMFAlgxCMoACgkQ8vlZVpUN
 gaOX3Af/QOphB5pKrKijhDK9H40nKS6lHtL7klJpvRafUMtVxBDOP3dsRISyGMdF
 w+gQQQv+eFEPefwGcYzdO4PN7FFVirAF9RS/NTFSIB/c8V6FfHzn/DeiftU7CLRW
 ljTP7y8M9eo35TsU8s9D7wfbyfY55MEANiAP8vnpx4JKDb86I/8Eaa6YS91v17vp
 /7TKSUt7PE6UUp7mgTRCX8vK9SxJJ8Xvg2hSzulfrO1DdsfW61RQYXwif+biR85T
 uxFPnV0yvji2EU4cpeIekPqJKUb9Av0aIbSwg19QqcAE0xqxvtSRBKlYnF2IRTuv
 OXoaC30d4UcQrNCkxPDAdH/0BMdcNQ==
 =y+5G
 -----END PGP SIGNATURE-----

Merge tag 'ext4_for_stable' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4

Pull ext4 fixes from Ted Ts'o:
 "A security fix (so a maliciously corrupted file system image won't
  panic the kernel) and some fixes for CONFIG_VMAP_STACK"

* tag 'ext4_for_stable' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4:
  ext4: sanity check the block and cluster size at mount time
  fscrypto: don't use on-stack buffer for key derivation
  fscrypto: don't use on-stack buffer for filename encryption
2016-11-19 18:33:50 -08:00
Theodore Ts'o 8cdf3372fe ext4: sanity check the block and cluster size at mount time
If the block size or cluster size is insane, reject the mount.  This
is important for security reasons (although we shouldn't be just
depending on this check).

Ref: http://www.securityfocus.com/archive/1/539661
Ref: https://bugzilla.redhat.com/show_bug.cgi?id=1332506
Reported-by: Borislav Petkov <bp@alien8.de>
Reported-by: Nikolay Borisov <kernel@kyup.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Cc: stable@vger.kernel.org
2016-11-19 20:58:15 -05:00
Eric Biggers 0f0909e242 fscrypto: don't use on-stack buffer for key derivation
With the new (in 4.9) option to use a virtually-mapped stack
(CONFIG_VMAP_STACK), stack buffers cannot be used as input/output for
the scatterlist crypto API because they may not be directly mappable to
struct page.  get_crypt_info() was using a stack buffer to hold the
output from the encryption operation used to derive the per-file key.
Fix it by using a heap buffer.

This bug could most easily be observed in a CONFIG_DEBUG_SG kernel
because this allowed the BUG in sg_set_buf() to be triggered.

Cc: stable@vger.kernel.org
Signed-off-by: Eric Biggers <ebiggers@google.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2016-11-19 20:56:13 -05:00
Eric Biggers 3c7018ebf8 fscrypto: don't use on-stack buffer for filename encryption
With the new (in 4.9) option to use a virtually-mapped stack
(CONFIG_VMAP_STACK), stack buffers cannot be used as input/output for
the scatterlist crypto API because they may not be directly mappable to
struct page.  For short filenames, fname_encrypt() was encrypting a
stack buffer holding the padded filename.  Fix it by encrypting the
filename in-place in the output buffer, thereby making the temporary
buffer unnecessary.

This bug could most easily be observed in a CONFIG_DEBUG_SG kernel
because this allowed the BUG in sg_set_buf() to be triggered.

Cc: stable@vger.kernel.org
Signed-off-by: Eric Biggers <ebiggers@google.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2016-11-19 20:56:06 -05:00
Filipe Manana 2a2a83de54 Btrfs: remove rb_node field from the delayed ref node structure
After the last big change in the delayed references code that was needed
for the last qgroups rework, the red black tree node field of struct
btrfs_delayed_ref_node is no longer used, so just remove it, this helps
us save some memory (since struct rb_node is 24 bytes on x86_64) for
these structures.

Signed-off-by: Filipe Manana <fdmanana@suse.com>
2016-11-19 13:39:18 +00:00
Filipe Manana 001895b313 Btrfs: remove unused code when creating and merging reloc trees
In commit 5bc7247ac4 (Btrfs: fix broken nocow after balance) we started
abusing the rtransid and otransid fields of root items from relocation
trees to fix some issues with nodatacow mode. However later in commit
ba8b028933 (Btrfs: do not reset last_snapshot after relocation) we
dropped the code that made use of those fields but did not remove
the code that sets those fields.

So just remove them to avoid confusion.

Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: Josef Bacik <jbacik@fb.com>
2016-11-19 13:39:18 +00:00
Filipe Manana 054570a1dc Btrfs: fix relocation incorrectly dropping data references
During relocation of a data block group we create a relocation tree
for each fs/subvol tree by making a snapshot of each tree using
btrfs_copy_root() and the tree's commit root, and then setting the last
snapshot field for the fs/subvol tree's root to the value of the current
transaction id minus 1. However this can lead to relocation later
dropping references that it did not create if we have qgroups enabled,
leaving the filesystem in an inconsistent state that keeps aborting
transactions.

Lets consider the following example to explain the problem, which requires
qgroups to be enabled.

We are relocating data block group Y, we have a subvolume with id 258 that
has a root at level 1, that subvolume is used to store directory entries
for snapshots and we are currently at transaction 3404.

When committing transaction 3404, we have a pending snapshot and therefore
we call btrfs_run_delayed_items() at transaction.c:create_pending_snapshot()
in order to create its dentry at subvolume 258. This results in COWing
leaf A from root 258 in order to add the dentry. Note that leaf A
also contains file extent items referring to extents from some other
block group X (we are currently relocating block group Y). Later on, still
at create_pending_snapshot() we call qgroup_account_snapshot(), which
switches the commit root for root 258 when it calls switch_commit_roots(),
so now the COWed version of leaf A, lets call it leaf A', is accessible
from the commit root of tree 258. At the end of qgroup_account_snapshot(),
we call record_root_in_trans() with 258 as its argument, which results
in btrfs_init_reloc_root() being called, which in turn calls
relocation.c:create_reloc_root() in order to create a relocation tree
associated to root 258, which results in assigning the value of 3403
(which is the current transaction id minus 1 = 3404 - 1) to the
last_snapshot field of root 258. When creating the relocation tree root
at ctree.c:btrfs_copy_root() we add a shared reference for leaf A',
corresponding to the relocation tree's root, when we call btrfs_inc_ref()
against the COWed root (a copy of the commit root from tree 258), which
is at level 1. So at this point leaf A' has 2 references, one normal
reference corresponding to root 258 and one shared reference corresponding
to the root of the relocation tree.

Transaction 3404 finishes its commit and transaction 3405 is started by
relocation when calling merge_reloc_root() for the relocation tree
associated to root 258. In the meanwhile leaf A' is COWed again, in
response to some filesystem operation, when we are still at transaction
3405. However when we COW leaf A', at ctree.c:update_ref_for_cow(), we
call btrfs_block_can_be_shared() in order to figure out if other trees
refer to the leaf and if any such trees exists, add a full back reference
to leaf A' - but btrfs_block_can_be_shared() incorrectly returns false
because the following condition is false:

  btrfs_header_generation(buf) <= btrfs_root_last_snapshot(&root->root_item)

which evaluates to 3404 <= 3403. So after leaf A' is COWed, it stays with
only one reference, corresponding to the shared reference we created when
we called btrfs_copy_root() to create the relocation tree's root and
btrfs_inc_ref() ends up not being called for leaf A' nor we end up setting
the flag BTRFS_BLOCK_FLAG_FULL_BACKREF in leaf A'. This results in not
adding shared references for the extents from block group X that leaf A'
refers to with its file extent items.

Later, after merging the relocation root we do a call to to
btrfs_drop_snapshot() in order to delete the relocation tree. This ends
up calling do_walk_down() when path->slots[1] points to leaf A', which
results in calling btrfs_lookup_extent_info() to get the number of
references for leaf A', which is 1 at this time (only the shared reference
exists) and this value is stored at wc->refs[0]. After this walk_up_proc()
is called when wc->level is 0 and path->nodes[0] corresponds to leaf A'.
Because the current level is 0 and wc->refs[0] is 1, it does call
btrfs_dec_ref() against leaf A', which results in removing the single
references that the extents from block group X have which are associated
to root 258 - the expectation was to have each of these extents with 2
references - one reference for root 258 and one shared reference related
to the root of the relocation tree, and so we would drop only the shared
reference (because leaf A' was supposed to have the flag
BTRFS_BLOCK_FLAG_FULL_BACKREF set).

This leaves the filesystem in an inconsistent state as we now have file
extent items in a subvolume tree that point to extents from block group X
without references in the extent tree. So later on when we try to decrement
the references for these extents, for example due to a file unlink operation,
truncate operation or overwriting ranges of a file, we fail because the
expected references do not exist in the extent tree.

This leads to warnings and transaction aborts like the following:

[  588.965795] ------------[ cut here ]------------
[  588.965815] WARNING: CPU: 2 PID: 2479 at fs/btrfs/extent-tree.c:1625 lookup_inline_extent_backref+0x432/0x5b0 [btrfs]
[  588.965816] Modules linked in: af_packet iscsi_ibft iscsi_boot_sysfs xfs libcrc32c ppdev acpi_cpufreq button tpm_tis e1000 i2c_piix4 pcspkr parport_pc
parport tpm qemu_fw_cfg joydev btrfs xor raid6_pq sr_mod cdrom ata_generic virtio_scsi ata_piix virtio_pci bochs_drm virtio_ring drm_kms_helper syscopyarea
sysfillrect sysimgblt fb_sys_fops virtio ttm serio_raw drm floppy sg
[  588.965831] CPU: 2 PID: 2479 Comm: kworker/u8:7 Not tainted 4.7.3-3-default-fdm+ #1
[  588.965832] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.9.1-0-gb3ef39f-prebuilt.qemu-project.org 04/01/2014
[  588.965844] Workqueue: btrfs-extent-refs btrfs_extent_refs_helper [btrfs]
[  588.965845]  0000000000000000 ffff8802263bfa28 ffffffff813af542 0000000000000000
[  588.965847]  0000000000000000 ffff8802263bfa68 ffffffff81081e8b 0000065900000000
[  588.965848]  ffff8801db2af000 000000012bbe2000 0000000000000000 ffff880215703b48
[  588.965849] Call Trace:
[  588.965852]  [<ffffffff813af542>] dump_stack+0x63/0x81
[  588.965854]  [<ffffffff81081e8b>] __warn+0xcb/0xf0
[  588.965855]  [<ffffffff81081f7d>] warn_slowpath_null+0x1d/0x20
[  588.965863]  [<ffffffffa0175042>] lookup_inline_extent_backref+0x432/0x5b0 [btrfs]
[  588.965865]  [<ffffffff81143220>] ? trace_clock_local+0x10/0x30
[  588.965867]  [<ffffffff8114c5df>] ? rb_reserve_next_event+0x6f/0x460
[  588.965875]  [<ffffffffa0175215>] insert_inline_extent_backref+0x55/0xd0 [btrfs]
[  588.965882]  [<ffffffffa017531f>] __btrfs_inc_extent_ref.isra.55+0x8f/0x240 [btrfs]
[  588.965890]  [<ffffffffa017acea>] __btrfs_run_delayed_refs+0x74a/0x1260 [btrfs]
[  588.965892]  [<ffffffff810cb046>] ? cpuacct_charge+0x86/0xa0
[  588.965900]  [<ffffffffa017e74f>] btrfs_run_delayed_refs+0x9f/0x2c0 [btrfs]
[  588.965908]  [<ffffffffa017ea04>] delayed_ref_async_start+0x94/0xb0 [btrfs]
[  588.965918]  [<ffffffffa01c799a>] btrfs_scrubparity_helper+0xca/0x350 [btrfs]
[  588.965928]  [<ffffffffa01c7c5e>] btrfs_extent_refs_helper+0xe/0x10 [btrfs]
[  588.965930]  [<ffffffff8109b323>] process_one_work+0x1f3/0x4e0
[  588.965931]  [<ffffffff8109b658>] worker_thread+0x48/0x4e0
[  588.965932]  [<ffffffff8109b610>] ? process_one_work+0x4e0/0x4e0
[  588.965934]  [<ffffffff810a1659>] kthread+0xc9/0xe0
[  588.965936]  [<ffffffff816f2f1f>] ret_from_fork+0x1f/0x40
[  588.965937]  [<ffffffff810a1590>] ? kthread_worker_fn+0x170/0x170
[  588.965938] ---[ end trace 34e5232c933a1749 ]---
[  588.966187] ------------[ cut here ]------------
[  588.966196] WARNING: CPU: 2 PID: 2479 at fs/btrfs/extent-tree.c:2966 btrfs_run_delayed_refs+0x28c/0x2c0 [btrfs]
[  588.966196] BTRFS: Transaction aborted (error -5)
[  588.966197] Modules linked in: af_packet iscsi_ibft iscsi_boot_sysfs xfs libcrc32c ppdev acpi_cpufreq button tpm_tis e1000 i2c_piix4 pcspkr parport_pc
parport tpm qemu_fw_cfg joydev btrfs xor raid6_pq sr_mod cdrom ata_generic virtio_scsi ata_piix virtio_pci bochs_drm virtio_ring drm_kms_helper syscopyarea
sysfillrect sysimgblt fb_sys_fops virtio ttm serio_raw drm floppy sg
[  588.966206] CPU: 2 PID: 2479 Comm: kworker/u8:7 Tainted: G        W       4.7.3-3-default-fdm+ #1
[  588.966207] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.9.1-0-gb3ef39f-prebuilt.qemu-project.org 04/01/2014
[  588.966217] Workqueue: btrfs-extent-refs btrfs_extent_refs_helper [btrfs]
[  588.966217]  0000000000000000 ffff8802263bfc98 ffffffff813af542 ffff8802263bfce8
[  588.966219]  0000000000000000 ffff8802263bfcd8 ffffffff81081e8b 00000b96345ee000
[  588.966220]  ffffffffa021ae1c ffff880215703b48 00000000000005fe ffff8802345ee000
[  588.966221] Call Trace:
[  588.966223]  [<ffffffff813af542>] dump_stack+0x63/0x81
[  588.966224]  [<ffffffff81081e8b>] __warn+0xcb/0xf0
[  588.966225]  [<ffffffff81081eff>] warn_slowpath_fmt+0x4f/0x60
[  588.966233]  [<ffffffffa017e93c>] btrfs_run_delayed_refs+0x28c/0x2c0 [btrfs]
[  588.966241]  [<ffffffffa017ea04>] delayed_ref_async_start+0x94/0xb0 [btrfs]
[  588.966250]  [<ffffffffa01c799a>] btrfs_scrubparity_helper+0xca/0x350 [btrfs]
[  588.966259]  [<ffffffffa01c7c5e>] btrfs_extent_refs_helper+0xe/0x10 [btrfs]
[  588.966260]  [<ffffffff8109b323>] process_one_work+0x1f3/0x4e0
[  588.966261]  [<ffffffff8109b658>] worker_thread+0x48/0x4e0
[  588.966263]  [<ffffffff8109b610>] ? process_one_work+0x4e0/0x4e0
[  588.966264]  [<ffffffff810a1659>] kthread+0xc9/0xe0
[  588.966265]  [<ffffffff816f2f1f>] ret_from_fork+0x1f/0x40
[  588.966267]  [<ffffffff810a1590>] ? kthread_worker_fn+0x170/0x170
[  588.966268] ---[ end trace 34e5232c933a174a ]---
[  588.966269] BTRFS: error (device sda2) in btrfs_run_delayed_refs:2966: errno=-5 IO failure
[  588.966270] BTRFS info (device sda2): forced readonly

This was happening often on openSUSE and SLE systems using btrfs as the
root filesystem (with its default layout where multiple subvolumes are
used) where balance happens in the background triggered by a cron job and
snapshots are automatically created before/after package installations,
upgrades and removals. The issue could be triggered simply by running the
following loop on the first system boot post installation:

  while true; do
     zypper -n in nfs-kernel-server
     zypper -n rm nfs-kernel-server
  done

(If we were fast enough and made that loop before the cron job triggered
a balance operation and the balance finished)

So fix by setting the last_snapshot field of the root to the value of the
generation of its commit root. Like this btrfs_block_can_be_shared()
behaves correctly for the case where the relocation root is created during
a transaction commit and for the case where it's created before a
transaction commit.

Fixes: 6426c7ad69 (btrfs: qgroup: Fix qgroup accounting when creating snapshot)
Cc: stable@vger.kernel.org  # 4.7+
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: Josef Bacik <jbacik@fb.com>
2016-11-19 13:39:17 +00:00
Jonathan Corbet 917fef6f7e Linux 4.9-rc4
-----BEGIN PGP SIGNATURE-----
 
 iQEcBAABAgAGBQJYHmoCAAoJEHm+PkMAQRiG7RMIAI2i7Y5hpL5yCxK5AFaL4u/G
 KxXfp1B1UanUTgjOmd7zGqtDYcFX9t7GTTUFixQ7/9Opr4PD9qbnatoDGSc3xjbT
 msDgA1B78F1/Q3kHWfeGq32MihQ4mj5NwUCo+igUcUvvWG7mHgzErj/Nh5RoobQX
 p/izdpTbrw3GX6xXB8olbG7XWHaVye/+TT3q6+gmgm8I/QEujcLeGoycE0zlhPN8
 FG/JX76At/+ZM2Py7Oxo3k+oKL9CHrtOQYDp/wN0uslV5eYvvkZz0/M1HMOGZt+c
 gZU5jzM17K7C4Nzo06WAuBU9wUBGc25m+cPicLlOmljnzfU+f50SKaDjZq3p7QI=
 =2KUF
 -----END PGP SIGNATURE-----

Merge tag 'v4.9-rc4' into sound

Bring in -rc4 patches so I can successfully merge the sound doc changes.
2016-11-18 16:13:41 -07:00
Benjamin Coddington d41cbfc9a6 NFSv4.1: Handle NFS4ERR_OLD_STATEID in nfs4_reclaim_open_state
Now that we're doing TEST_STATEID in nfs4_reclaim_open_state(), we can have
a NFS4ERR_OLD_STATEID returned from nfs41_open_expired() .  Instead of
marking state recovery as failed, mark the state for recovery again.

Signed-off-by: Benjamin Coddington <bcodding@redhat.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2016-11-18 14:27:27 -05:00
Trond Myklebust 5cc7861eb5 NFSv4: Don't call close if the open stateid has already been cleared
Ensure we test to see if the open stateid is actually set, before we
send a CLOSE.

Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2016-11-18 14:18:02 -05:00
Theodore Ts'o c48ae41baf ext4: add sanity checking to count_overhead()
The commit "ext4: sanity check the block and cluster size at mount
time" should prevent any problems, but in case the superblock is
modified while the file system is mounted, add an extra safety check
to make sure we won't overrun the allocated buffer.

Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Cc: stable@vger.kernel.org
2016-11-18 13:37:47 -05:00
Trond Myklebust 3e7dfb1659 NFSv4: Fix CLOSE races with OPEN
If the reply to a successful CLOSE call races with an OPEN to the same
file, we can end up scribbling over the stateid that represents the
new open state.
The race looks like:

  Client				Server
  ======				======

  CLOSE stateid A on file "foo"
					CLOSE stateid A, return stateid C
  OPEN file "foo"
					OPEN "foo", return stateid B
  Receive reply to OPEN
  Reset open state for "foo"
  Associate stateid B to "foo"

  Receive CLOSE for A
  Reset open state for "foo"
  Replace stateid B with C

The fix is to examine the argument of the CLOSE, and check for a match
with the current stateid "other" field. If the two do not match, then
the above race occurred, and we should just ignore the CLOSE.

Reported-by: Benjamin Coddington <bcodding@redhat.com>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2016-11-18 13:35:58 -05:00
Trond Myklebust 23ea44c215 NFSv4.1: Fix a regression in DELEGRETURN
We don't want to call nfs4_free_revoked_stateid() in the case where
the delegreturn was successful.

Reported-by: Benjamin Coddington <bcodding@redhat.com>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2016-11-18 13:35:54 -05:00
Theodore Ts'o cd6bb35bf7 ext4: use more strict checks for inodes_per_block on mount
Centralize the checks for inodes_per_block and be more strict to make
sure the inodes_per_block_group can't end up being zero.

Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Reviewed-by: Andreas Dilger <adilger@dilger.ca>
Cc: stable@vger.kernel.org
2016-11-18 13:28:30 -05:00
Theodore Ts'o 5aee0f8a3f ext4: fix in-superblock mount options processing
Fix a large number of problems with how we handle mount options in the
superblock.  For one, if the string in the superblock is long enough
that it is not null terminated, we could run off the end of the string
and try to interpret superblocks fields as characters.  It's unlikely
this will cause a security problem, but it could result in an invalid
parse.  Also, parse_options is destructive to the string, so in some
cases if there is a comma-separated string, it would be modified in
the superblock.  (Fortunately it only happens on file systems with a
1k block size.)

Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Cc: stable@vger.kernel.org
2016-11-18 13:24:26 -05:00
Theodore Ts'o 9e47a4c9fc ext4: sanity check the block and cluster size at mount time
If the block size or cluster size is insane, reject the mount.  This
is important for security reasons (although we shouldn't be just
depending on this check).

Ref: http://www.securityfocus.com/archive/1/539661
Ref: https://bugzilla.redhat.com/show_bug.cgi?id=1332506
Reported-by: Borislav Petkov <bp@alien8.de>
Reported-by: Nikolay Borisov <kernel@kyup.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Cc: stable@vger.kernel.org
2016-11-18 13:00:24 -05:00
Alexey Dobriyan c7d03a00b5 netns: make struct pernet_operations::id unsigned int
Make struct pernet_operations::id unsigned.

There are 2 reasons to do so:

1)
This field is really an index into an zero based array and
thus is unsigned entity. Using negative value is out-of-bound
access by definition.

2)
On x86_64 unsigned 32-bit data which are mixed with pointers
via array indexing or offsets added or subtracted to pointers
are preffered to signed 32-bit data.

"int" being used as an array index needs to be sign-extended
to 64-bit before being used.

	void f(long *p, int i)
	{
		g(p[i]);
	}

  roughly translates to

	movsx	rsi, esi
	mov	rdi, [rsi+...]
	call 	g

MOVSX is 3 byte instruction which isn't necessary if the variable is
unsigned because x86_64 is zero extending by default.

Now, there is net_generic() function which, you guessed it right, uses
"int" as an array index:

	static inline void *net_generic(const struct net *net, int id)
	{
		...
		ptr = ng->ptr[id - 1];
		...
	}

And this function is used a lot, so those sign extensions add up.

Patch snipes ~1730 bytes on allyesconfig kernel (without all junk
messing with code generation):

	add/remove: 0/0 grow/shrink: 70/598 up/down: 396/-2126 (-1730)

Unfortunately some functions actually grow bigger.
This is a semmingly random artefact of code generation with register
allocator being used differently. gcc decides that some variable
needs to live in new r8+ registers and every access now requires REX
prefix. Or it is shifted into r12, so [r12+0] addressing mode has to be
used which is longer than [r8]

However, overall balance is in negative direction:

	add/remove: 0/0 grow/shrink: 70/598 up/down: 396/-2126 (-1730)
	function                                     old     new   delta
	nfsd4_lock                                  3886    3959     +73
	tipc_link_build_proto_msg                   1096    1140     +44
	mac80211_hwsim_new_radio                    2776    2808     +32
	tipc_mon_rcv                                1032    1058     +26
	svcauth_gss_legacy_init                     1413    1429     +16
	tipc_bcbase_select_primary                   379     392     +13
	nfsd4_exchange_id                           1247    1260     +13
	nfsd4_setclientid_confirm                    782     793     +11
		...
	put_client_renew_locked                      494     480     -14
	ip_set_sockfn_get                            730     716     -14
	geneve_sock_add                              829     813     -16
	nfsd4_sequence_done                          721     703     -18
	nlmclnt_lookup_host                          708     686     -22
	nfsd4_lockt                                 1085    1063     -22
	nfs_get_client                              1077    1050     -27
	tcf_bpf_init                                1106    1076     -30
	nfsd4_encode_fattr                          5997    5930     -67
	Total: Before=154856051, After=154854321, chg -0.00%

Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-11-18 10:59:15 -05:00
Linus Torvalds bec1b089ab Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs
Pull vfs fixes from Al Viro:
 "A couple of regression fixes"

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
  fix iov_iter_advance() for ITER_PIPE
  xattr: Fix setting security xattrs on sockfs
2016-11-17 13:49:30 -08:00
Linus Torvalds d46bc34da9 orangefs: add .owner to debugfs file_operations
Without ".owner = THIS_MODULE" it is possible to crash the kernel
 by unloading the Orangefs module while someone is reading debugfs
 files.
 -----BEGIN PGP SIGNATURE-----
 
 iQIcBAABAgAGBQJYLfvEAAoJEM9EDqnrzg2+sHEQAJo4jn/sAQvO04ujaMrViLmy
 5+V93F7jwGFeLvAwjMvPAeBb+UmlgqjVi0VT85RzEe6eNOKN9qlj9ZNDutOfnbhr
 H6qu8AQsbO0znSTQuJA1M2Hca9h66EnN0pT8xW4wat1cCdAf6X6HcFcr1lZIRKZd
 E17EygXi+IW0c0evIq4UBsD0DfTZgtC4ONrR9N7+zprlg2PVX35So6Lr0ODceJQs
 StWHrZW9hDZ6KR8WocupuHPR8brOe+P5PU14fPzR1+EH7BsTf8uxWK7CfTE5ov0C
 UNkNeh81BOkwIQDFoPCJ5asaipdi5RRNTIQekhhQ2GnaaCdmCKln8OLjqDZZOmDj
 KRGB4mdPcCb3XlvMH3SaXNmyhmjt2cTS0/TQPexrTqjSNmbXmnzJOCguweoTIJ5w
 CgEnsrNp8GwlZo12Z8JkFGxC39ifjH4F+KFetU+eUNjw9Tce+zHwgEvsAMqDhWw8
 FJQWy+snG7m8ooytRObWPepchnd2XHkrJv4yu8uw3GirM+YTlxvuWnB54hVH17FQ
 0vKYhdAXBUmeeyyNKApBSGQezPWD9hfAY5Di7JGJlaTiai3pVxgXd8YY4DGXHj3t
 ebPpxEnlWrRLC5Cazd0yC9CoR8azQp9zvRgfPuPEM4wJSjUFVfmasmFg7s99h3Zq
 vnTqfV/uQwLm9f+3CfNB
 =s21f
 -----END PGP SIGNATURE-----

Merge tag 'for-linus-4.9-rc5-ofs-1' of git://git.kernel.org/pub/scm/linux/kernel/git/hubcap/linux

Pull orangefs fix from Mike Marshall:
 "orangefs: add .owner to debugfs file_operations

  Without ".owner = THIS_MODULE" it is possible to crash the kernel by
  unloading the Orangefs module while someone is reading debugfs files"

* tag 'for-linus-4.9-rc5-ofs-1' of git://git.kernel.org/pub/scm/linux/kernel/git/hubcap/linux:
  orangefs: add .owner to debugfs file_operations
2016-11-17 13:45:57 -08:00
Christoph Hellwig 542ff7bf18 block: new direct I/O implementation
Similar to the simple fast path, but we now need a dio structure to
track multiple-bio completions.  It's basically a cut-down version
of the new iomap-based direct I/O code for filesystems, but without
all the logic to call into the filesystem for extent lookup or
allocation, and without the complex I/O completion workqueue handler
for AIO - instead we just use the FUA bit on the bios to ensure
data is flushed to stable storage.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@fb.com>
2016-11-17 13:35:11 -07:00
Jens Axboe 78250c02d9 block: make __blkdev_direct_IO_sync() support O_SYNC/DSYNC
Split the op setting code into a helper, use it in both places.

Signed-off-by: Jens Axboe <axboe@fb.com>
2016-11-17 13:35:05 -07:00
Jens Axboe 72ecad22d9 block: support a full bio worth of IO for simplified bdev direct-io
Just alloc the bio_vec array if we exceed the inline limit.

Signed-off-by: Jens Axboe <axboe@fb.com>
2016-11-17 13:35:02 -07:00
Christoph Hellwig 189ce2b9dc block: fast-path for small and simple direct I/O requests
This patch adds a small and simple fast patch for small direct I/O
requests on block devices that don't use AIO.  Between the neat
bio_iov_iter_get_pages helper that avoids allocating a page array
for get_user_pages and the on-stack bio and biovec this avoid memory
allocations and atomic operations entirely in the direct I/O code
(lower levels might still do memory allocations and will usually
have at least some atomic operations, though).

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@fb.com>
Tested-By: Stephen Bates <sbates@raithlin.com>
Reviewed-By: Stephen Bates <sbates@raithlin.com>
2016-11-17 13:34:45 -07:00
Seth Forshee f97df70b1c xenfs: Use proc_create_mount_point() to create /proc/xen
Mounting proc in user namespace containers fails if the xenbus
filesystem is mounted on /proc/xen because this directory fails
the "permanently empty" test. proc_create_mount_point() exists
specifically to create such mountpoints in proc but is currently
proc-internal. Export this interface to modules, then use it in
xenbus when creating /proc/xen.

Signed-off-by: Seth Forshee <seth.forshee@canonical.com>
Signed-off-by: David Vrabel <david.vrabel@citrix.com>
Signed-off-by: Juergen Gross <jgross@suse.com>
2016-11-17 13:52:18 +01:00
Andreas Gruenbacher 4a59015372 xattr: Fix setting security xattrs on sockfs
The IOP_XATTR flag is set on sockfs because sockfs supports getting the
"system.sockprotoname" xattr.  Since commit 6c6ef9f2, this flag is checked for
setxattr support as well.  This is wrong on sockfs because security xattr
support there is supposed to be provided by security_inode_setsecurity.  The
smack security module relies on socket labels (xattrs).

Fix this by adding a security xattr handler on sockfs that returns
-EAGAIN, and by checking for -EAGAIN in setxattr.

We cannot simply check for -EOPNOTSUPP in setxattr because there are
filesystems that neither have direct security xattr support nor support
via security_inode_setsecurity.  A more proper fix might be to move the
call to security_inode_setsecurity into sockfs, but it's not clear to me
if that is safe: we would end up calling security_inode_post_setxattr after
that as well.

Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2016-11-17 00:00:23 -05:00
Linus Torvalds 984573abf8 Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/fuse
Pull fuse fixes from Miklos Szeredi:
 "A regression fix and bug fix bound for stable"

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/fuse:
  fuse: fix fuse_write_end() if zero bytes were copied
  fuse: fix root dentry initialization
2016-11-16 09:20:10 -08:00
Mike Marshall 19ff7fcc76 orangefs: add .owner to debugfs file_operations
Without ".owner = THIS_MODULE" it is possible to crash the kernel
by unloading the Orangefs module while someone is reading debugfs
files.

Signed-off-by: Mike Marshall <hubcap@omnibond.com>
2016-11-16 11:52:19 -05:00
Nicolas Pitre baa73d9e47 posix-timers: Make them configurable
Some embedded systems have no use for them.  This removes about
25KB from the kernel binary size when configured out.

Corresponding syscalls are routed to a stub logging the attempt to
use those syscalls which should be enough of a clue if they were
disabled without proper consideration. They are: timer_create,
timer_gettime: timer_getoverrun, timer_settime, timer_delete,
clock_adjtime, setitimer, getitimer, alarm.

The clock_settime, clock_gettime, clock_getres and clock_nanosleep
syscalls are replaced by simple wrappers compatible with CLOCK_REALTIME,
CLOCK_MONOTONIC and CLOCK_BOOTTIME only which should cover the vast
majority of use cases with very little code.

Signed-off-by: Nicolas Pitre <nico@linaro.org>
Acked-by: Richard Cochran <richardcochran@gmail.com>
Acked-by: Thomas Gleixner <tglx@linutronix.de>
Acked-by: John Stultz <john.stultz@linaro.org>
Reviewed-by: Josh Triplett <josh@joshtriplett.org>
Cc: Paul Bolle <pebolle@tiscali.nl>
Cc: linux-kbuild@vger.kernel.org
Cc: netdev@vger.kernel.org
Cc: Michal Marek <mmarek@suse.com>
Cc: Edward Cree <ecree@solarflare.com>
Link: http://lkml.kernel.org/r/1478841010-28605-7-git-send-email-nicolas.pitre@linaro.org
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2016-11-16 09:26:35 +01:00
Kees Cook fc46d4e453 ramoops: add pdata NULL check to ramoops_probe
This adds a check for a NULL platform data, which should only be possible
if a driver incorrectly sets up a probe request without also having defined
the platform_data structure. This is based on a patch from Geliang Tang.

Signed-off-by: Kees Cook <keescook@chromium.org>
2016-11-15 16:34:32 -08:00
Namhyung Kim 70ad35db33 pstore: Convert console write to use ->write_buf
Maybe I'm missing something, but I don't know why it needs to copy the
input buffer to psinfo->buf and then write.  Instead we can write the
input buffer directly.  The only implementation that supports console
message (i.e. ramoops) already does it for ftrace messages.

For the upcoming virtio backend driver, it needs to protect psinfo->buf
overwritten from console messages.  If it could use ->write_buf method
instead of ->write, the problem will be solved easily.

Cc: Stefan Hajnoczi <stefanha@redhat.com>
Signed-off-by: Namhyung Kim <namhyung@kernel.org>
Signed-off-by: Kees Cook <keescook@chromium.org>
2016-11-15 16:34:32 -08:00
Namhyung Kim e9e360b08a pstore: Protect unlink with read_mutex
When update_ms is set, pstore_get_records() will be called when there's
a new entry.  But unlink can be called at the same time and might
contend with the open-read-close loop.  Depending on the implementation
of platform driver, it may be safe or not.  But I think it'd be better
to protect those race in the first place.

Cc: Stefan Hajnoczi <stefanha@redhat.com>
Signed-off-by: Namhyung Kim <namhyung@kernel.org>
Signed-off-by: Kees Cook <keescook@chromium.org>
2016-11-15 16:34:31 -08:00
Joel Fernandes 7a0032f504 pstore: Use global ftrace filters for function trace filtering
Currently, pstore doesn't have any filters setup for function tracing.
This has the associated overhead and may not be useful for users looking
for tracing specific set of functions.

ftrace's regular function trace filtering is done writing to
tracing/set_ftrace_filter however this is not available if not requested.
In order to be able to use this feature, the support to request global
filtering introduced earlier in the series should be requested before
registering the ftrace ops. Here we do the same.

Signed-off-by: Joel Fernandes <joelaf@google.com>
Signed-off-by: Kees Cook <keescook@chromium.org>
2016-11-15 16:34:30 -08:00
Kees Cook a5d23b956c pstore: Clarify context field przs as dprzs
Since "przs" (persistent ram zones) is a general name in the code now, so
rename the Oops-dump zones to dprzs from przs.

Based on a patch from Nobuhiro Iwamatsu.

Signed-off-by: Kees Cook <keescook@chromium.org>
2016-11-15 16:34:29 -08:00
Kees Cook c443a5f3f1 pstore: improve error report for failed setup
When setting ramoops record sizes, sometimes it's not clear which
parameters contributed to the allocation failure. This adds a per-zone
name and expands the failure reports.

Signed-off-by: Kees Cook <keescook@chromium.org>
2016-11-15 16:34:28 -08:00
Joel Fernandes 2fbea82bbb pstore: Merge per-CPU ftrace records into one
Up until this patch, each of the per CPU ftrace buffers appear as a
separate ftrace-ramoops-N file. In this patch we merge all the zones into
one and populate a single ftrace-ramoops-0 file.

Signed-off-by: Joel Fernandes <joelaf@google.com>
[kees: clarified variables names, added -ENOMEM handling]
Signed-off-by: Kees Cook <keescook@chromium.org>
2016-11-15 16:34:28 -08:00
Joel Fernandes fbccdeb8d7 pstore: Add ftrace timestamp counter
In preparation for merging the per CPU buffers into one buffer when
we retrieve the pstore ftrace data, we store the timestamp as a
counter in the ftrace pstore record.  We store the CPU number as well
if !PSTORE_CPU_IN_IP, in this case we shift the counter and may lose
ordering there but we preserve the same record size. The timestamp counter
is also racy, and not doing any locking or synchronization here results
in the benefit of lower overhead. Since we don't care much here for exact
ordering of function traces across CPUs, we don't synchronize and may lose
some counter updates but I'm ok with that.

Using trace_clock() results in much lower performance so avoid using it
since we don't want accuracy in timestamp and need a rough ordering to
perform merge.

Signed-off-by: Joel Fernandes <joelaf@google.com>
[kees: updated commit message, added comments]
Signed-off-by: Kees Cook <keescook@chromium.org>
2016-11-15 16:34:27 -08:00
Joel Fernandes a1cf53ac6d ramoops: Split ftrace buffer space into per-CPU zones
If the RAMOOPS_FLAG_FTRACE_PER_CPU flag is passed to ramoops pdata, split
the ftrace space into multiple zones depending on the number of CPUs.

This speeds up the performance of function tracing by about 280% in my
tests as we avoid the locking. The trade off being lesser space available
per CPU. Let the ramoops user decide which option they want based on pdata
flag.

Signed-off-by: Joel Fernandes <joelaf@google.com>
[kees: added max_ftrace_cnt to track size, added DT logic and docs]
Signed-off-by: Kees Cook <keescook@chromium.org>
2016-11-15 16:34:26 -08:00
Kees Cook de83209249 pstore: Make ramoops_init_przs generic for other prz arrays
Currently ramoops_init_przs() is hard wired only for panic dump zone
array. In preparation for the ftrace zone array (one zone per-cpu) and pmsg
zone array, make the function more generic to be able to handle this case.

Heavily based on similar work from Joel Fernandes.

Signed-off-by: Kees Cook <keescook@chromium.org>
2016-11-15 16:34:25 -08:00
Joel Fernandes 663deb4788 pstore: Allow prz to control need for locking
In preparation of not locking at all for certain buffers depending on if
there's contention, make locking optional depending on the initialization
of the prz.

Signed-off-by: Joel Fernandes <joelaf@google.com>
[kees: moved locking flag into prz instead of via caller arguments]
Signed-off-by: Kees Cook <keescook@chromium.org>
2016-11-15 16:34:25 -08:00
David S. Miller bb598c1b8c Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net
Several cases of bug fixes in 'net' overlapping other changes in
'net-next-.

Signed-off-by: David S. Miller <davem@davemloft.net>
2016-11-15 10:54:36 -05:00
Miklos Szeredi 59c3b76cc6 fuse: fix fuse_write_end() if zero bytes were copied
If pos is at the beginning of a page and copied is zero then page is not
zeroed but is marked uptodate.

Fix by skipping everything except unlock/put of page if zero bytes were
copied.

Reported-by: Al Viro <viro@zeniv.linux.org.uk>
Fixes: 6b12c1b37e ("fuse: Implement write_begin/write_end callbacks")
Cc: <stable@vger.kernel.org> # v3.15+
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2016-11-15 12:34:21 +01:00
Eric Whitney d5c8dab6a8 ext4: remove parameter from ext4_xattr_ibody_set()
The parameter "handle" isn't used.

Signed-off-by: Eric Whitney <enwlinux@gmail.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2016-11-14 21:56:48 -05:00
Eric Whitney 88e0387769 ext4: allow inode expansion for nojournal file systems
Runs of xfstest ext4/022 on nojournal file systems result in failures
because the inodes of some of its test files do not expand as expected.
The cause is a conditional in ext4_mark_inode_dirty() that prevents inode
expansion unless the test file system has a journal.  Remove this
unnecessary restriction.

Signed-off-by: Eric Whitney <enwlinux@gmail.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2016-11-14 21:48:35 -05:00
Deepa Dinamani eeca7ea1ba ext4: use current_time() for inode timestamps
CURRENT_TIME_SEC and CURRENT_TIME are not y2038 safe.
current_time() will be transitioned to be y2038 safe
along with vfs.

current_time() returns timestamps according to the
granularities set in the super_block.
The granularity check in ext4_current_time() to call
current_time() or CURRENT_TIME_SEC is not required.
Use current_time() directly to obtain timestamps
unconditionally, and remove ext4_current_time().

Quota files are assumed to be on the same filesystem.
Hence, use current_time() for these files as well.

Signed-off-by: Deepa Dinamani <deepa.kernel@gmail.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Reviewed-by: Arnd Bergmann <arnd@arndb.de>
2016-11-14 21:40:10 -05:00
Chandan Rajendra 30a9d7afe7 ext4: fix stack memory corruption with 64k block size
The number of 'counters' elements needed in 'struct sg' is
super_block->s_blocksize_bits + 2. Presently we have 16 'counters'
elements in the array. This is insufficient for block sizes >= 32k. In
such cases the memcpy operation performed in ext4_mb_seq_groups_show()
would cause stack memory corruption.

Fixes: c9de560ded
Signed-off-by: Chandan Rajendra <chandan@linux.vnet.ibm.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Reviewed-by: Jan Kara <jack@suse.cz>
Cc: stable@vger.kernel.org
2016-11-14 21:26:26 -05:00
Chandan Rajendra 69e43e8cc9 ext4: fix mballoc breakage with 64k block size
'border' variable is set to a value of 2 times the block size of the
underlying filesystem. With 64k block size, the resulting value won't
fit into a 16-bit variable. Hence this commit changes the data type of
'border' to 'unsigned int'.

Fixes: c9de560ded
Signed-off-by: Chandan Rajendra <chandan@linux.vnet.ibm.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Reviewed-by: Andreas Dilger <adilger@dilger.ca>
Cc: stable@vger.kernel.org
2016-11-14 21:04:37 -05:00
Andreas Gruenbacher db978da8fa proc: Pass file mode to proc_pid_make_inode
Pass the file mode of the proc inode to be created to
proc_pid_make_inode.  In proc_pid_make_inode, initialize inode->i_mode
before calling security_task_to_inode.  This allows selinux to set
isec->sclass right away without introducing "half-initialized" inode
security structs.

Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
Signed-off-by: Paul Moore <paul@paul-moore.com>
2016-11-14 15:39:48 -05:00
Julia Lawall 7ba630f54c nfsd: constify reply_cache_stats_operations structure
reply_cache_stats_operations, of type struct file_operations, is never
modified, so declare it as const.

Done with the help of Coccinelle.

Signed-off-by: Julia Lawall <Julia.Lawall@lip6.fr>
Reviewed-by: Jeff Layton <jlayton@poochiereds.net>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2016-11-14 15:24:19 -05:00
J. Bruce Fields 8838203667 nfsd: update workqueue creation
No real change in functionality, but the old interface seems to be
deprecated.

We don't actually care about ordering necessarily, but we do depend on
running at most one work item at a time: nfsd4_process_cb_update()
assumes that no other thread is running it, and that no new callbacks
are starting while it's running.

Reviewed-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2016-11-14 11:13:43 -05:00
Theodore Ts'o 1566a48aaa ext4: don't lock buffer in ext4_commit_super if holding spinlock
If there is an error reported in mballoc via ext4_grp_locked_error(),
the code is holding a spinlock, so ext4_commit_super() must not try to
lock the buffer head, or else it will trigger a BUG:

  BUG: sleeping function called from invalid context at ./include/linux/buffer_head.h:358
  in_atomic(): 1, irqs_disabled(): 0, pid: 993, name: mount
  CPU: 0 PID: 993 Comm: mount Not tainted 4.9.0-rc1-clouder1 #62
  Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.8.1-0-g4adadbd-20150316_085822-nilsson.home.kraxel.org 04/01/2014
   ffff880006423548 ffffffff81318c89 ffffffff819ecdd0 0000000000000166
   ffff880006423558 ffffffff810810b0 ffff880006423580 ffffffff81081153
   ffff880006e5a1a0 ffff88000690e400 0000000000000000 ffff8800064235c0
  Call Trace:
    [<ffffffff81318c89>] dump_stack+0x67/0x9e
    [<ffffffff810810b0>] ___might_sleep+0xf0/0x140
    [<ffffffff81081153>] __might_sleep+0x53/0xb0
    [<ffffffff8126c1dc>] ext4_commit_super+0x19c/0x290
    [<ffffffff8126e61a>] __ext4_grp_locked_error+0x14a/0x230
    [<ffffffff81081153>] ? __might_sleep+0x53/0xb0
    [<ffffffff812822be>] ext4_mb_generate_buddy+0x1de/0x320

Since ext4_grp_locked_error() calls ext4_commit_super with sync == 0
(and it is the only caller which does so), avoid locking and unlocking
the buffer in this case.

This can result in races with ext4_commit_super() if there are other
problems (which is what commit 4743f83990 was trying to address),
but a Warning is better than BUG.

Fixes: 4743f83990
Cc: stable@vger.kernel.org # 4.9
Reported-by: Nikolay Borisov <kernel@kyup.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Reviewed-by: Jan Kara <jack@suse.cz>
2016-11-13 22:02:29 -05:00
Theodore Ts'o d0abb36db4 ext4: allow ext4_ext_truncate() to return an error
Return errors to the caller instead of declaring the file system
corrupted.

Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Reviewed-by: Jan Kara <jack@suse.cz>
2016-11-13 22:02:28 -05:00
Theodore Ts'o 2c98eb5ea2 ext4: allow ext4_truncate() to return an error
This allows us to properly propagate errors back up to
ext4_truncate()'s callers.  This also means we no longer have to
silently ignore some errors (e.g., when trying to add the inode to the
orphan inode list).

Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Reviewed-by: Jan Kara <jack@suse.cz>
2016-11-13 22:02:26 -05:00
Theodore Ts'o 6da22013bb Merge branch 'fscrypt' into origin 2016-11-13 22:02:22 -05:00
Theodore Ts'o a2f6d9c4c0 Merge branch 'dax-4.10-iomap-pmd' into origin 2016-11-13 22:02:15 -05:00
Eric Biggers a6e0891286 fscrypto: don't use on-stack buffer for key derivation
With the new (in 4.9) option to use a virtually-mapped stack
(CONFIG_VMAP_STACK), stack buffers cannot be used as input/output for
the scatterlist crypto API because they may not be directly mappable to
struct page.  get_crypt_info() was using a stack buffer to hold the
output from the encryption operation used to derive the per-file key.
Fix it by using a heap buffer.

This bug could most easily be observed in a CONFIG_DEBUG_SG kernel
because this allowed the BUG in sg_set_buf() to be triggered.

Cc: stable@vger.kernel.org
Signed-off-by: Eric Biggers <ebiggers@google.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2016-11-13 21:56:25 -05:00
Eric Biggers 08ae877f4e fscrypto: don't use on-stack buffer for filename encryption
With the new (in 4.9) option to use a virtually-mapped stack
(CONFIG_VMAP_STACK), stack buffers cannot be used as input/output for
the scatterlist crypto API because they may not be directly mappable to
struct page.  For short filenames, fname_encrypt() was encrypting a
stack buffer holding the padded filename.  Fix it by encrypting the
filename in-place in the output buffer, thereby making the temporary
buffer unnecessary.

This bug could most easily be observed in a CONFIG_DEBUG_SG kernel
because this allowed the BUG in sg_set_buf() to be triggered.

Cc: stable@vger.kernel.org
Signed-off-by: Eric Biggers <ebiggers@google.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2016-11-13 21:56:19 -05:00
David Gstir 9c4bb8a3a9 fscrypt: Let fs select encryption index/tweak
Avoid re-use of page index as tweak for AES-XTS when multiple parts of
same page are encrypted. This will happen on multiple (partial) calls of
fscrypt_encrypt_page on same page.
page->index is only valid for writeback pages.

Signed-off-by: David Gstir <david@sigma-star.at>
Signed-off-by: Richard Weinberger <richard@nod.at>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2016-11-13 20:18:16 -05:00
David Gstir 0b93e1b94b fscrypt: Constify struct inode pointer
Some filesystems, such as UBIFS, maintain a const pointer for struct
inode.

Signed-off-by: David Gstir <david@sigma-star.at>
Signed-off-by: Richard Weinberger <richard@nod.at>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2016-11-13 20:18:01 -05:00
David Gstir 7821d4dd45 fscrypt: Enable partial page encryption
Not all filesystems work on full pages, thus we should allow them to
hand partial pages to fscrypt for en/decryption.

Signed-off-by: David Gstir <david@sigma-star.at>
Signed-off-by: Richard Weinberger <richard@nod.at>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2016-11-13 18:55:21 -05:00
David Gstir b50f7b268b fscrypt: Allow fscrypt_decrypt_page() to function with non-writeback pages
Some filesystem might pass pages which do not have page->mapping->host
set to the encrypted inode. We want the caller to explicitly pass the
corresponding inode.

Signed-off-by: David Gstir <david@sigma-star.at>
Signed-off-by: Richard Weinberger <richard@nod.at>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2016-11-13 18:53:10 -05:00
David Gstir 1c7dcf69ee fscrypt: Add in-place encryption mode
ext4 and f2fs require a bounce page when encrypting pages. However, not
all filesystems will need that (eg. UBIFS). This is handled via a
flag on fscrypt_operations where a fs implementation can select in-place
encryption over using a bounce page (which is the default).

Signed-off-by: David Gstir <david@sigma-star.at>
Signed-off-by: Richard Weinberger <richard@nod.at>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2016-11-13 18:47:04 -05:00
Deepa Dinamani 362ad5d58e fs: jfs: Replace CURRENT_TIME_SEC by current_time()
jfs uses nanosecond granularity for filesystem timestamps.
Only this assignment is not using nanosecond granularity.
Use current_time() to get the right granularity.

Signed-off-by: Deepa Dinamani <deepa.kernel@gmail.com>
Reviewed-by: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: Dave Kleikamp <dave.kleikamp@oracle.com>
2016-11-11 15:51:39 -06:00
Jens Axboe bbd7bb7017 block: move poll code to blk-mq
The poll code is blk-mq specific, let's move it to blk-mq.c. This
is a prep patch for improving the polling code.

Signed-off-by: Jens Axboe <axboe@fb.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
2016-11-11 13:40:25 -07:00
Joel Fernandes d8991f51e5 pstore: Warn on PSTORE_TYPE_PMSG using deprecated function
PMSG now uses ramoops_pstore_write_buf_user() instead of ...write_buf().
Print a ratelimited warning if gets accidentally called.

Signed-off-by: Joel Fernandes <joelaf@google.com>
[kees: adjusted commit log and added -EINVAL return]
Signed-off-by: Kees Cook <keescook@chromium.org>
2016-11-11 10:36:46 -08:00
Joel Fernandes 109704492e pstore: Make spinlock per zone instead of global
Currently pstore has a global spinlock for all zones. Since the zones
are independent and modify different areas of memory, there's no need
to have a global lock, so we should use a per-zone lock as introduced
here. Also, when ramoops's ftrace use-case has a FTRACE_PER_CPU flag
introduced later, which splits the ftrace memory area into a single zone
per CPU, it will eliminate the need for locking. In preparation for this,
make the locking optional.

Signed-off-by: Joel Fernandes <joelaf@google.com>
[kees: updated commit message]
Signed-off-by: Kees Cook <keescook@chromium.org>
2016-11-11 10:35:37 -08:00
Linus Torvalds 968ef8de55 Merge branch 'akpm' (patches from Andrew)
Merge misc fixes from Andrew Morton:
 "15 fixes"

* emailed patches from Andrew Morton <akpm@linux-foundation.org>:
  lib/stackdepot: export save/fetch stack for drivers
  mm: kmemleak: scan .data.ro_after_init
  memcg: prevent memcg caches to be both OFF_SLAB & OBJFREELIST_SLAB
  coredump: fix unfreezable coredumping task
  mm/filemap: don't allow partially uptodate page for pipes
  mm/hugetlb: fix huge page reservation leak in private mapping error paths
  ocfs2: fix not enough credit panic
  Revert "console: don't prefer first registered if DT specifies stdout-path"
  mm: hwpoison: fix thp split handling in memory_failure()
  swapfile: fix memory corruption via malformed swapfile
  mm/cma.c: check the max limit for cma allocation
  scripts/bloat-o-meter: fix SIGPIPE
  shmem: fix pageflags after swapping DMA32 object
  mm, frontswap: make sure allocated frontswap map is assigned
  mm: remove extra newline from allocation stall warning
2016-11-11 09:44:23 -08:00
Linus Torvalds c5e4ca6da9 Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs
Pull VFS fixes from Al Viro:
 "Christoph's and Jan's aio fixes, fixup for generic_file_splice_read
  (removal of pointless detritus that actually breaks it when used for
  gfs2 ->splice_read()) and fixup for generic_file_read_iter()
  interaction with ITER_PIPE destinations."

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
  splice: remove detritus from generic_file_splice_read()
  mm/filemap: don't allow partially uptodate page for pipes
  aio: fix freeze protection of aio writes
  fs: remove aio_run_iocb
  fs: remove the never implemented aio_fsync file operation
  aio: hold an extra file reference over AIO read/write operations
2016-11-11 09:19:01 -08:00
Linus Torvalds ef091b3cef Ceph's ->read_iter() implementation is incompatible with the new
generic_file_splice_read() code that went into -rc1.  Switch to the
 less efficient default_file_splice_read() for now; the proper fix is
 being held for 4.10.
 
 We also have a fix for a 4.8 regression and a trival libceph fixup.
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v2
 
 iQEcBAABCAAGBQJYJdjPAAoJEEp/3jgCEfOLzEoH/A3B1qqiqs2WoMn0O4pnEdcM
 TxaU46VOkYcK2wh/xbYAns2kZEXKgcCySv+kXn4l3Gh6/lXVxv4WexNqWdO1o6yN
 GqEufIH7yQM6QOE/hkwtUciBXmPfQMPxF14vvprYQuyu5Bs96mrphiAa7vX6Vbk5
 VhfE/j0shb8Q2QQj/Om0mWqM6JtOAlr5aFtEcJcodbCk1k8CptUcBsSoQ31PXMC7
 UcaBHh1VHGLvx9WeG1Rw1g9tc2LiUyu+UK0csolp51+amB7HezgfmzDQzHtXzBmm
 n90SQwonf0DrdWUGuQlOpHnREwxLSgN19s68FCjLc0jeMTP4b6TFEIUgFxiqWc4=
 =Ws5s
 -----END PGP SIGNATURE-----

Merge tag 'ceph-for-4.9-rc5' of git://github.com/ceph/ceph-client

Pull Ceph fixes from Ilya Dryomov:
 "Ceph's ->read_iter() implementation is incompatible with the new
  generic_file_splice_read() code that went into -rc1.  Switch to the
  less efficient default_file_splice_read() for now; the proper fix is
  being held for 4.10.

  We also have a fix for a 4.8 regression and a trival libceph fixup"

* tag 'ceph-for-4.9-rc5' of git://github.com/ceph/ceph-client:
  libceph: initialize last_linger_id with a large integer
  libceph: fix legacy layout decode with pool 0
  ceph: use default file splice read callback
2016-11-11 09:17:10 -08:00
Linus Torvalds ef5beed998 NFS client bugfixes for Linux 4.9
Bugfixes:
 - Trim extra slashes in v4 nfs_paths to fix tools that use this
 - Fix a -Wmaybe-uninitialized warnings
 - Fix suspicious RCU usages
 - Fix Oops when mounting multiple servers at once
 - Suppress a false-positive pNFS error
 - Fix a DMAR failure in NFS over RDMA
 -----BEGIN PGP SIGNATURE-----
 
 iQIcBAABCAAGBQJYJOCbAAoJENfLVL+wpUDrbO0QAIkcxdUu2iQeOrk07VP48kDE
 UEfJTal8vbW/KtKyL9bIeRa1qCvYpSJXnnKcR/Uo5VHE5nMz/5omoJofWf5Zg0UM
 iEHyZfOsuGFieBbl1NBaLjEd6MCoJYpmWFUj+3drZ8zqSdqDTL+JgrP7k3XEU2Mx
 glKb7U0AKoclm3h1MKyCyo5TgDVeI5TOhi+i3VVw2IN79VY2CUp4lHWMY4vloghp
 h+GuJWeVFS1nBpfCF9PpTU6LdHDfg4o/J5+DrP+IjIffD1XGzGEjfFR0BX5HyDcN
 PgOSF3fc7uVOOUIBEAqHUHY/7XiKlv6TEMRPdM8ALVoCXZ6hPSSFxq8JBJSWoVEp
 r11ts66VgYxdQgHbs51Y5AaKudLBwU60KosWuddbdZVb4YPM0cn5WQzVezrpoQYu
 k4rfrpt+LFv23NGfIJa6JaTSFBzM+YXmggEGUI8TI/YUFSN+wEp4uzLB4r19nqAP
 ff32iunzV9Z5edpPQFDCf3/1HAhzrL5KWo7E8EvijpdQKZl5k5CnUJxbG22lh4ct
 QIyYg51LjhCayzbRH8Mu+TKUFT29ORlcSp851BotLjT8ZdUetWXcFab93nAkQI7g
 sMREml4DvcXWy8qFAOzi8mX1ddTBumxBfOD0m3skPg+odxwsl/KiwjLCRwfTrgwS
 jfSXsXmrwTniPCDWgKg3
 =hFod
 -----END PGP SIGNATURE-----

Merge tag 'nfs-for-4.9-3' of git://git.linux-nfs.org/projects/anna/linux-nfs

Pull NFS client bugfixes from Anna Schumaker:
 "Most of these fix regressions in 4.9, and none are going to stable
  this time around.

  Bugfixes:
   - Trim extra slashes in v4 nfs_paths to fix tools that use this
   - Fix a -Wmaybe-uninitialized warnings
   - Fix suspicious RCU usages
   - Fix Oops when mounting multiple servers at once
   - Suppress a false-positive pNFS error
   - Fix a DMAR failure in NFS over RDMA"

* tag 'nfs-for-4.9-3' of git://git.linux-nfs.org/projects/anna/linux-nfs:
  xprtrdma: Fix DMAR failure in frwr_op_map() after reconnect
  fs/nfs: Fix used uninitialized warn in nfs4_slot_seqid_in_use()
  NFS: Don't print a pNFS error if we aren't using pNFS
  NFS: Ignore connections that have cl_rpcclient uninitialized
  SUNRPC: Fix suspicious RCU usage
  NFSv4.1: work around -Wmaybe-uninitialized warning
  NFS: Trim extra slash in v4 nfs_path
2016-11-11 09:15:30 -08:00
Linus Torvalds a4fac3b5d1 xfs: update for 4.9-rc5
In this update:
 o fix for aborting deferred transactions on filesystem shutdown.
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v1
 
 iQIcBAABAgAGBQJYI/hIAAoJEK3oKUf0dfodjtIP/28PmMnwkSwC3IfNzMEuIlvM
 Q96IlnIXiOlqXSIlO69VM2cS8YCW2oVUYD6pjbhC2YcOaogYGo4TlFmQMrFL4dgG
 /q9aH+Hf6jGgGfwpC2MKq6WIYFU8oYYzG2FIm3jyEnnDCIFMH4H+FYGgzrNVIraQ
 31nWn3ye/xHD44EWvRbEUAE0ROxbZfgm7+QST+R2+lAhLitTqZLNKkenlN8v5P9h
 WInRYBHxq5beIbbhAgm50wKfvxalrYChDeujorGsGAjJLliWpgJnbah6TwPNvRXH
 SwSkJfzI53AFXqH/45P2X5Ib34P7mIz6hTl3zednfU1GBeB3PxUGay6QU0rxRKAL
 4vv/RB6pd5EzmU+hvnLJbM7lpRWpnV2FoP9YwRoqIR8vUvlXTYLpecEw1mtJwFmd
 VZ99F+8mtz+jGKK01hZDAtP7tHt1JSFhOnOQ506UsmnqhbgWaYuDEOOktX08iTik
 +zwTETuoh4frQDVLjrQ507RWJl3XblWXEgD4Dw+N19w7oTSAbuF6zIU7sTlcLJSx
 APqD1uJUzvfYV11eyY7bosuCx7QtTRwUQWQSkKu0f9FXAuILf2KdDI62lZCnKi5M
 6wTUKFNZfqED+ergGyzPuVuQr/PhbQPEC61EehYMpQPSjeVEay2e30ouM6A+8VPp
 ZQal03fQzJvbxm/+8m3v
 =62jA
 -----END PGP SIGNATURE-----

Merge tag 'xfs-fixes-for-linus-4.9-rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/dgc/linux-xfs

Pull xfs fix from Dave Chinner:
 "This is a fix for an unmount hang (regression) when the filesystem is
  shutdown.  It was supposed to go to you for -rc3, but I accidentally
  tagged the commit prior to it in that pullreq.

  Summary:

   - fix for aborting deferred transactions on filesystem shutdown"

* tag 'xfs-fixes-for-linus-4.9-rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/dgc/linux-xfs:
  xfs: defer should abort intent items if the trans roll fails
2016-11-11 09:13:48 -08:00
Andrey Ryabinin 70d78fe7c8 coredump: fix unfreezable coredumping task
It could be not possible to freeze coredumping task when it waits for
'core_state->startup' completion, because threads are frozen in
get_signal() before they got a chance to complete 'core_state->startup'.

Inability to freeze a task during suspend will cause suspend to fail.
Also CRIU uses cgroup freezer during dump operation.  So with an
unfreezable task the CRIU dump will fail because it waits for a
transition from 'FREEZING' to 'FROZEN' state which will never happen.

Use freezer_do_not_count() to tell freezer to ignore coredumping task
while it waits for core_state->startup completion.

Link: http://lkml.kernel.org/r/1475225434-3753-1-git-send-email-aryabinin@virtuozzo.com
Signed-off-by: Andrey Ryabinin <aryabinin@virtuozzo.com>
Acked-by: Pavel Machek <pavel@ucw.cz>
Acked-by: Oleg Nesterov <oleg@redhat.com>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Tejun Heo <tj@kernel.org>
Cc: "Rafael J. Wysocki" <rjw@rjwysocki.net>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-11-11 08:12:37 -08:00
Junxiao Bi d006c71f8a ocfs2: fix not enough credit panic
The following panic was caught when run ocfs2 disconfig single test
(block size 512 and cluster size 8192).  ocfs2_journal_dirty() return
-ENOSPC, that means credits were used up.

The total credit should include 3 times of "num_dx_leaves" from
ocfs2_dx_dir_rebalance(), because 2 times will be consumed in
ocfs2_dx_dir_transfer_leaf() and 1 time will be consumed in
ocfs2_dx_dir_new_cluster() -> __ocfs2_dx_dir_new_cluster() ->
ocfs2_dx_dir_format_cluster().  But only two times is included in
ocfs2_dx_dir_rebalance_credits(), fix it.

This can cause read-only fs(v4.1+) or panic for mainline linux depending
on mount option.

  ------------[ cut here ]------------
  kernel BUG at fs/ocfs2/journal.c:775!
  invalid opcode: 0000 [#1] SMP
  Modules linked in: ocfs2 nfsd lockd grace nfs_acl auth_rpcgss sunrpc autofs4 ocfs2_dlmfs ocfs2_stack_o2cb ocfs2_dlm ocfs2_nodemanager ocfs2_stackglue configfs sd_mod sg ip6t_REJECT nf_reject_ipv6 nf_conntrack_ipv6 nf_defrag_ipv6 xt_state nf_conntrack ip6table_filter ip6_tables be2iscsi iscsi_boot_sysfs bnx2i cnic uio cxgb4i cxgb4 cxgb3i libcxgbi cxgb3 mdio ib_iser rdma_cm ib_cm iw_cm ib_sa ib_mad ib_core ib_addr ipv6 iscsi_tcp libiscsi_tcp libiscsi scsi_transport_iscsi ppdev xen_kbdfront xen_netfront fb_sys_fops sysimgblt sysfillrect syscopyarea parport_pc parport acpi_cpufreq i2c_piix4 i2c_core pcspkr ext4 jbd2 mbcache xen_blkfront floppy pata_acpi ata_generic ata_piix dm_mirror dm_region_hash dm_log dm_mod
  CPU: 2 PID: 10601 Comm: dd Not tainted 4.1.12-71.el6uek.bug24939243.x86_64 #2
  Hardware name: Xen HVM domU, BIOS 4.4.4OVM 02/11/2016
  task: ffff8800b6de6200 ti: ffff8800a7d48000 task.ti: ffff8800a7d48000
  RIP: ocfs2_journal_dirty+0xa7/0xb0 [ocfs2]
  RSP: 0018:ffff8800a7d4b6d8  EFLAGS: 00010286
  RAX: 00000000ffffffe4 RBX: 00000000814d0a9c RCX: 00000000000004f9
  RDX: ffffffffa008e990 RSI: ffffffffa008f1ee RDI: ffff8800622b6460
  RBP: ffff8800a7d4b6f8 R08: ffffffffa008f288 R09: ffff8800622b6460
  R10: 0000000000000000 R11: 0000000000000282 R12: 0000000002c8421e
  R13: ffff88006d0cad00 R14: ffff880092beef60 R15: 0000000000000070
  FS:  00007f9b83e92700(0000) GS:ffff8800be880000(0000) knlGS:0000000000000000
  CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
  CR2: 00007fb2c0d1a000 CR3: 0000000008f80000 CR4: 00000000000406e0
  Call Trace:
    ocfs2_dx_dir_transfer_leaf+0x159/0x1a0 [ocfs2]
    ocfs2_dx_dir_rebalance+0xd9b/0xea0 [ocfs2]
    ocfs2_find_dir_space_dx+0xd3/0x300 [ocfs2]
    ocfs2_prepare_dx_dir_for_insert+0x219/0x450 [ocfs2]
    ocfs2_prepare_dir_for_insert+0x1d6/0x580 [ocfs2]
    ocfs2_mknod+0x5a2/0x1400 [ocfs2]
    ocfs2_create+0x73/0x180 [ocfs2]
    vfs_create+0xd8/0x100
    lookup_open+0x185/0x1c0
    do_last+0x36d/0x780
    path_openat+0x92/0x470
    do_filp_open+0x4a/0xa0
    do_sys_open+0x11a/0x230
    SyS_open+0x1e/0x20
    system_call_fastpath+0x12/0x71
  Code: 1d 3f 29 09 00 48 85 db 74 1f 48 8b 03 0f 1f 80 00 00 00 00 48 8b 7b 08 48 83 c3 10 4c 89 e6 ff d0 48 8b 03 48 85 c0 75 eb eb 90 <0f> 0b eb fe 0f 1f 44 00 00 55 48 89 e5 41 57 41 56 41 55 41 54
  RIP  ocfs2_journal_dirty+0xa7/0xb0 [ocfs2]
  ---[ end trace 91ac5312a6ee1288 ]---
  Kernel panic - not syncing: Fatal exception
  Kernel Offset: disabled

Link: http://lkml.kernel.org/r/1478248135-31963-1-git-send-email-junxiao.bi@oracle.com
Signed-off-by: Junxiao Bi <junxiao.bi@oracle.com>
Cc: Mark Fasheh <mfasheh@versity.com>
Cc: Joel Becker <jlbec@evilplan.org>
Cc: Joseph Qi <joseph.qi@huawei.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-11-11 08:12:37 -08:00
Al Viro e519e77747 splice: remove detritus from generic_file_splice_read()
i_size check is a leftover from the horrors that used to play with
the page cache in that function.  With the switch to ->read_iter(),
it's neither needed nor correct - for gfs2 it ends up being buggy,
since i_size is not guaranteed to be correct until later (inside
->read_iter()).

Spotted-by: Abhi Das <adas@redhat.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2016-11-10 18:32:13 -05:00
Yan, Zheng 8a8d561766 ceph: use default file splice read callback
Splice read/write implementation changed recently. When using
generic_file_splice_read(), iov_iter with type == ITER_PIPE is
passed to filesystem's read_iter callback. But ceph_sync_read()
can't serve ITER_PIPE iov_iter correctly (ITER_PIPE iov_iter
expects pages from page cache).

Fixing ceph_sync_read() requires a big patch. So use default
splice read callback for now.

Signed-off-by: Yan, Zheng <zyan@redhat.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2016-11-10 20:13:04 +01:00
Dave Chinner 0fc204e2eb Merge branch 'xfs-4.10-misc-fixes-1' into for-next 2016-11-10 10:29:43 +11:00
Dave Chinner 8f23d318aa Merge branch 'xfs-4.10-libxfs-cleanups' into for-next 2016-11-10 10:29:29 +11:00
Dave Chinner b649c42e25 Merge branch 'dax-4.10-iomap-pmd' into for-next 2016-11-10 10:29:06 +11:00
Jan Kara 9484ab1bf4 dax: Introduce IOMAP_FAULT flag
Introduce a flag telling iomap operations whether they are handling a
fault or other IO. That may influence behavior wrt inode size and
similar things.

Signed-off-by: Jan Kara <jack@suse.cz>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2016-11-10 10:26:50 +11:00
Sebastian Andrzej Siewior fc4d24c9b4 fs/buffer: Convert to hotplug state machine
Install the callbacks via the state machine.

Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-fsdevel@vger.kernel.org
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: rt@linutronix.de
Link: http://lkml.kernel.org/r/20161103145021.28528-2-bigeasy@linutronix.de
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2016-11-09 23:45:25 +01:00
Brian Foster 98efe8af1c xfs: fix unbalanced inode reclaim flush locking
Filesystem shutdown testing on an older distro kernel has uncovered an
imbalanced locking pattern for the inode flush lock in
xfs_reclaim_inode(). Specifically, there is a double unlock sequence
between the call to xfs_iflush_abort() and xfs_reclaim_inode() at the
"reclaim:" label.

This actually does not cause obvious problems on current kernels due to
the current flush lock implementation. Older kernels use a counting
based flush lock mechanism, however, which effectively breaks the lock
indefinitely when an already unlocked flush lock is repeatedly unlocked.
Though this only currently occurs on filesystem shutdown, it has
reproduced the effect of elevating an fs shutdown to a system-wide crash
or hang.

As it turns out, the flush lock is not actually required for the reclaim
logic in xfs_reclaim_inode() because by that time we have already cycled
the flush lock once while holding ILOCK_EXCL. Therefore, remove the
additional flush lock/unlock cycle around the 'reclaim:' label and
update branches into this label to release the flush lock where
appropriate. Add an assert to xfs_ifunlock() to help prevent future
occurences of the same problem.

Reported-by: Zorro Lang <zlang@redhat.com>
Signed-off-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2016-11-10 08:23:22 +11:00
Linus Torvalds 3c6106da74 We recently refactored the Orangefs debugfs code.
The refactor seemed to trigger dan.carpenter@oracle.com's
 static tester to find a possible double-free in the code.
 
 While designing the fix we saw a condition under which the
 buffer being freed could also be overflowed.
 
 We also realized how to rebuild the related debugfs file's
 "contents" (a string) without deleting and re-creating the file.
 
 This fix should eliminate the possible double-free, the
 potential overflow and improve code readability.
 -----BEGIN PGP SIGNATURE-----
 
 iQIcBAABAgAGBQJYI04nAAoJEM9EDqnrzg2+rT4P/1sN1ZUDwKgyJ3Qk3n5AAvlR
 PtbqeRhzUD7QdTR5yb/k/37rqYB7BBB5xd5VDlYKuD8luppjoAS2J4SRkngPFQiV
 NZMP1Sq5nWeEyeG8it+MzH364zBUGK+D94VqlhUDLHKa27WMWTB2vrcLq/DI06np
 35r4dWnV3+2+lgg0zvJGP7QoQLlPByB5q7pwTA9TBaPnDHVh/Myq4jS70wfEYDqY
 NMxkq02vQHS7a2mysZbrE2fXC2OBRTCGP+9lsdvJx9XfYQkIHfe4qMAxu0XlqDSM
 6POHEr5cPizENGSp7myV9G97FuF0UHZXnLEKe04DLbuGao2omMiNHvrnS/zPgKRQ
 zpCMsf4FVChjCipQzne1eLQbQskWVDy58ziURzefO+bV8aFe61KBOvAiNEZ+7i21
 CeEBeU2A5brd1y6ELcFCf2SDjmyd/ScyYqwIIrsY0eK1D3GveeHX1tkMGH8y7hWD
 CoY/cKUSHwZ6ZwFpzeCs96wzZ19o7BLKogyIyYQWc1s09uijIHInxkRxUqtfVErj
 2SpOXOqLkpjpU0Kga8beU6OtXLRv/ob518RPTPF5NAzjl08xn20pZckKL092uhVj
 k/VlJUE72Om2XJmBbta2Sz7iN8QfujJO0Ql/Nl51lX++pVbQIjFDaRhIYt6Yb0af
 y1XRJXqDJ3sh9J/od3fN
 =ELyS
 -----END PGP SIGNATURE-----

Merge tag 'for-linus-4.9-rc4-ofs-1' of git://git.kernel.org/pub/scm/linux/kernel/git/hubcap/linux

Pull orangefs fix from Mike Marshall:
 "We recently refactored the Orangefs debugfs code. The refactor seemed
  to trigger dan.carpenter@oracle.com's static tester to find a possible
  double-free in the code.

  While designing the fix we saw a condition under which the buffer
  being freed could also be overflowed.

  We also realized how to rebuild the related debugfs file's "contents"
  (a string) without deleting and re-creating the file.

  This fix should eliminate the possible double-free, the potential
  overflow and improve code readability"

* tag 'for-linus-4.9-rc4-ofs-1' of git://git.kernel.org/pub/scm/linux/kernel/git/hubcap/linux:
  orangefs: clean up debugfs
2016-11-09 11:36:43 -08:00
Darrick J. Wong bec9d48d7a xfs: check minimum block size for CRC filesystems
Check the minimum block size on v5 filesystems.

[dchinner: cleaned up XFS_MIN_CRC_BLOCKSIZE check]

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2016-11-09 12:11:12 +11:00
Li Pengcheng 959217c84c pstore: Actually give up during locking failure
Without a return after the pr_err(), dumps will collide when two threads
call pstore_dump() at the same time.

Signed-off-by: Liu Hailong <liuhailong5@huawei.com>
Signed-off-by: Li Pengcheng <lipengcheng8@huawei.com>
Signed-off-by: Li Zhong <lizhong11@hisilicon.com>
[kees: improved commit message]
Signed-off-by: Kees Cook <keescook@chromium.org>
2016-11-08 16:44:33 -08:00
Eric Sandeen 5d829300be xfs: provide helper for counting extents from if_bytes
The open-coded pattern:

ifp->if_bytes / (uint)sizeof(xfs_bmbt_rec_t)

is all over the xfs code; provide a new helper
xfs_iext_count(ifp) to count the number of inline extents
in an inode fork.

[dchinner: pick up several missed conversions]

Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2016-11-08 12:59:42 +11:00
Eric Sandeen 4dfce57db6 xfs: fix up xfs_swap_extent_forks inline extent handling
There have been several reports over the years of NULL pointer
dereferences in xfs_trans_log_inode during xfs_fsr processes,
when the process is doing an fput and tearing down extents
on the temporary inode, something like:

BUG: unable to handle kernel NULL pointer dereference at 0000000000000018
PID: 29439  TASK: ffff880550584fa0  CPU: 6   COMMAND: "xfs_fsr"
    [exception RIP: xfs_trans_log_inode+0x10]
 #9 [ffff8800a57bbbe0] xfs_bunmapi at ffffffffa037398e [xfs]
#10 [ffff8800a57bbce8] xfs_itruncate_extents at ffffffffa0391b29 [xfs]
#11 [ffff8800a57bbd88] xfs_inactive_truncate at ffffffffa0391d0c [xfs]
#12 [ffff8800a57bbdb8] xfs_inactive at ffffffffa0392508 [xfs]
#13 [ffff8800a57bbdd8] xfs_fs_evict_inode at ffffffffa035907e [xfs]
#14 [ffff8800a57bbe00] evict at ffffffff811e1b67
#15 [ffff8800a57bbe28] iput at ffffffff811e23a5
#16 [ffff8800a57bbe58] dentry_kill at ffffffff811dcfc8
#17 [ffff8800a57bbe88] dput at ffffffff811dd06c
#18 [ffff8800a57bbea8] __fput at ffffffff811c823b
#19 [ffff8800a57bbef0] ____fput at ffffffff811c846e
#20 [ffff8800a57bbf00] task_work_run at ffffffff81093b27
#21 [ffff8800a57bbf30] do_notify_resume at ffffffff81013b0c
#22 [ffff8800a57bbf50] int_signal at ffffffff8161405d

As it turns out, this is because the i_itemp pointer, along
with the d_ops pointer, has been overwritten with zeros
when we tear down the extents during truncate.  When the in-core
inode fork on the temporary inode used by xfs_fsr was originally
set up during the extent swap, we mistakenly looked at di_nextents
to determine whether all extents fit inline, but this misses extents
generated by speculative preallocation; we should be using if_bytes
instead.

This mistake corrupts the in-memory inode, and code in
xfs_iext_remove_inline eventually gets bad inputs, causing
it to memmove and memset incorrect ranges; this became apparent
because the two values in ifp->if_u2.if_inline_ext[1] contained
what should have been in d_ops and i_itemp; they were memmoved due
to incorrect array indexing and then the original locations
were zeroed with memset, again due to an array overrun.

Fix this by properly using i_df.if_bytes to determine the number
of extents, not di_nextents.

Thanks to dchinner for looking at this with me and spotting the
root cause.

Cc: stable@vger.kernel.org
Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2016-11-08 12:55:18 +11:00
Brian Foster 04197b341f xfs: don't BUG() on mixed direct and mapped I/O
We've had reports of generic/095 causing XFS to BUG() in
__xfs_get_blocks() due to the existence of delalloc blocks on a
direct I/O read. generic/095 issues a mix of various types of I/O,
including direct and memory mapped I/O to a single file. This is
clearly not supported behavior and is known to lead to such
problems. E.g., the lack of exclusion between the direct I/O and
write fault paths means that a write fault can allocate delalloc
blocks in a region of a file that was previously a hole after the
direct read has attempted to flush/inval the file range, but before
it actually reads the block mapping. In turn, the direct read
discovers a delalloc extent and cannot proceed.

While the appropriate solution here is to not mix direct and memory
mapped I/O to the same regions of the same file, the current
BUG_ON() behavior is probably overkill as it can crash the entire
system.  Instead, localize the failure to the I/O in question by
returning an error for a direct I/O that cannot be handled safely
due to delalloc blocks. Be careful to allow the case of a direct
write to post-eof delalloc blocks. This can occur due to speculative
preallocation and is safe as post-eof blocks are not accompanied by
dirty pages in pagecache (conversely, preallocation within eof must
have been zeroed, and thus dirtied, before the inode size could have
been increased beyond said blocks).

Finally, provide an additional warning if a direct I/O write occurs
while the file is memory mapped. This may not catch all problematic
scenarios, but provides a hint that some known-to-be-problematic I/O
methods are in use.

Signed-off-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2016-11-08 12:54:14 +11:00
Brian Foster 399372349a xfs: don't skip cow forks w/ delalloc blocks in cowblocks scan
The cowblocks background scanner currently clears the cowblocks tag
for inodes without any real allocations in the cow fork. This
excludes inodes with only delalloc blocks in the cow fork. While we
might never expect to clear delalloc blocks from the cow fork in the
background scanner, it is not necessarily correct to clear the
cowblocks tag from such inodes.

For example, if the background scanner happens to process an inode
between a buffered write and writeback, the scanner catches the
inode in a state after delalloc blocks have been allocated to the
cow fork but before the delalloc blocks have been converted to real
blocks by writeback. The background scanner then incorrectly clears
the cowblocks tag, even if part of the aforementioned delalloc
reservation will not be remapped to the data fork (i.e., extra
blocks due to the cowextsize hint). This means that any such
additional blocks in the cow fork might never be reclaimed by the
background scanner and could persist until the inode itself is
reclaimed.

To address this problem, only skip and clear inodes without any cow
fork allocations whatsoever from the background scanner. While we
generally do not want to cancel delalloc reservations from the
background scanner, the pagecache dirty check following the
cowblocks check should prevent that situation. If we do end up with
delalloc cow fork blocks without a dirty address space mapping, this
is probably an indication that something has gone wrong and the
blocks should be reclaimed, as they may never be converted to a real
allocation.

Signed-off-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2016-11-08 12:53:33 +11:00
Darrick J. Wong 4fd29ec472 xfs: check return value of _trans_reserve_quota_nblks
Check the return value of xfs_trans_reserve_quota_nblks for errors.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2016-11-08 11:59:26 +11:00
Darrick J. Wong 5e52365ac8 xfs: move dir_ino_validate declaration per xfsprogs
Move the declaration of _dir_ino_validate out of the private
dir2 header file into the public one, since xfsprogs did that
for the benefit of xfs_repair.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2016-11-08 11:59:12 +11:00
Eric Sandeen e6fc6fcf44 xfs: don't call xfs_sb_quota_from_disk twice
Source xfsprogs commit: ee3754254e8c186c99b6cdd4d59f741759d04acb

Kernel commit 5ef828c4 ("xfs: avoid false quotacheck after unclean
shutdown") made xfs_sb_from_disk() also call xfs_sb_quota_from_disk
by default.

However, when this was merged to libxfs, existing separate
calls to libxfs_sb_quota_from_disk remained, and calling it
twice in a row on a V4 superblock leads to issues, because:

        if (sbp->sb_qflags & XFS_PQUOTA_ACCT)  {
...
                sbp->sb_pquotino = sbp->sb_gquotino;
                sbp->sb_gquotino = NULLFSINO;

and after the second call, we have set both pquotino and gquotino
to NULLFSINO.

Fix this by making it safe to call twice, and also remove the extra
calls to libxfs_sb_quota_from_disk.

This is only spotted when running xfstests with "-m crc=0" because
the sb_from_disk change came about after V5 became default, and
the above behavior only exists on a V4 superblock.

Reported-by: Eryu Guan <eguan@redhat.com>
Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Reviewed-by: Carlos Maiolino <cmaiolino@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2016-11-08 11:58:55 +11:00
Darrick J. Wong 523b2e76e3 libxfs: clean up _dir2_data_freescan
Refactor the implementations of xfs_dir2_data_freescan into a
routine that takes the raw directory block parameters and
a second function that figures out the raw parameters from the
directory inode.  This enables us to use the exact same code
for both userspace and the kernel, since repair knows exactly
which directory block geometry parameters it needs.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2016-11-08 11:56:51 +11:00
Darrick J. Wong ae90b994b4 libxfs: fix xfs_attr_shortform_bytesfit declaration
Change the xfs_attr_shortform_bytesfit declaration to have
struct xfs_inode to avoid tripping up the libxfs-diff scanner.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Eric Sandeen <sandeen@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2016-11-08 11:56:20 +11:00
Darrick J. Wong 68c098582b libxfs: fix whitespace problems
Fix some whitespace problems that trip up my libxfs-diff script.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Eric Sandeen <sandeen@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2016-11-08 11:56:13 +11:00
Darrick J. Wong 420fbeb4bf libxfs: synchronize dinode_verify with userspace
The userspace version of _dinode_verify takes a raw inode number
instead of an inode itself.  Since neither version actually needs
the inode, port the changes to the kernel.  This will also reduce
the libxfs diff noise.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2016-11-08 11:56:06 +11:00
Darrick J. Wong 755c7bf5dd libxfs: convert ushort to unsigned short
Since xfsprogs dropped ushort in favor of unsigned short, do that
here too.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2016-11-08 11:55:48 +11:00
Ross Zwisler 190b5caad7 dax: remove "depends on BROKEN" from FS_DAX_PMD
Now that DAX PMD faults are once again working and are now participating in
DAX's radix tree locking scheme, allow their config option to be enabled.

Signed-off-by: Ross Zwisler <ross.zwisler@linux.intel.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2016-11-08 11:35:16 +11:00
Ross Zwisler 862f1b9d67 xfs: use struct iomap based DAX PMD fault path
Switch xfs_filemap_pmd_fault() from using dax_pmd_fault() to the new and
improved dax_iomap_pmd_fault().  Also, now that it has no more users,
remove xfs_get_blocks_dax_fault().

Signed-off-by: Ross Zwisler <ross.zwisler@linux.intel.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2016-11-08 11:35:02 +11:00
Ross Zwisler 642261ac99 dax: add struct iomap based DAX PMD support
DAX PMDs have been disabled since Jan Kara introduced DAX radix tree based
locking.  This patch allows DAX PMDs to participate in the DAX radix tree
based locking scheme so that they can be re-enabled using the new struct
iomap based fault handlers.

There are currently three types of DAX 4k entries: 4k zero pages, 4k DAX
mappings that have an associated block allocation, and 4k DAX empty
entries.  The empty entries exist to provide locking for the duration of a
given page fault.

This patch adds three equivalent 2MiB DAX entries: Huge Zero Page (HZP)
entries, PMD DAX entries that have associated block allocations, and 2 MiB
DAX empty entries.

Unlike the 4k case where we insert a struct page* into the radix tree for
4k zero pages, for HZP we insert a DAX exceptional entry with the new
RADIX_DAX_HZP flag set.  This is because we use a single 2 MiB zero page in
every 2MiB hole mapping, and it doesn't make sense to have that same struct
page* with multiple entries in multiple trees.  This would cause contention
on the single page lock for the one Huge Zero Page, and it would break the
page->index and page->mapping associations that are assumed to be valid in
many other places in the kernel.

One difficult use case is when one thread is trying to use 4k entries in
radix tree for a given offset, and another thread is using 2 MiB entries
for that same offset.  The current code handles this by making the 2 MiB
user fall back to 4k entries for most cases.  This was done because it is
the simplest solution, and because the use of 2MiB pages is already
opportunistic.

If we were to try to upgrade from 4k pages to 2MiB pages for a given range,
we run into the problem of how we lock out 4k page faults for the entire
2MiB range while we clean out the radix tree so we can insert the 2MiB
entry.  We can solve this problem if we need to, but I think that the cases
where both 2MiB entries and 4K entries are being used for the same range
will be rare enough and the gain small enough that it probably won't be
worth the complexity.

Signed-off-by: Ross Zwisler <ross.zwisler@linux.intel.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2016-11-08 11:34:45 +11:00
Ross Zwisler 422476c464 dax: move put_(un)locked_mapping_entry() in dax.c
No functional change.

The static functions put_locked_mapping_entry() and
put_unlocked_mapping_entry() will soon be used in error cases in
grab_mapping_entry(), so move their definitions above this function.

Signed-off-by: Ross Zwisler <ross.zwisler@linux.intel.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2016-11-08 11:33:44 +11:00
Ross Zwisler fa28f7296a dax: move RADIX_DAX_* defines to dax.h
The RADIX_DAX_* defines currently mostly live in fs/dax.c, with just
RADIX_DAX_ENTRY_LOCK being in include/linux/dax.h so it can be used in
mm/filemap.c.  When we add PMD support, though, mm/filemap.c will also need
access to the RADIX_DAX_PTE type so it can properly construct a 4k sized
empty entry.

Instead of shifting the defines between dax.c and dax.h as they are
individually used in other code, just move them wholesale to dax.h so
they'll be available when we need them.

Signed-off-by: Ross Zwisler <ross.zwisler@linux.intel.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2016-11-08 11:33:35 +11:00
Ross Zwisler 1550290b08 dax: dax_iomap_fault() needs to call iomap_end()
Currently iomap_end() doesn't do anything for DAX page faults for both ext2
and XFS.  ext2_iomap_end() just checks for a write underrun, and
xfs_file_iomap_end() checks to see if it needs to finish a delayed
allocation.  However, in the future iomap_end() calls might be needed to
make sure we have balanced allocations, locks, etc.  So, add calls to
iomap_end() with appropriate error handling to dax_iomap_fault().

Signed-off-by: Ross Zwisler <ross.zwisler@linux.intel.com>
Suggested-by: Jan Kara <jack@suse.cz>
Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2016-11-08 11:33:26 +11:00
Ross Zwisler 333ccc978e dax: add dax_iomap_sector() helper function
To be able to correctly calculate the sector from a file position and a
struct iomap there is a complex little bit of logic that currently happens
in both dax_iomap_actor() and dax_iomap_fault().  This will need to be
repeated yet again in the DAX PMD fault handler when it is added, so break
it out into a helper function.

Signed-off-by: Ross Zwisler <ross.zwisler@linux.intel.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2016-11-08 11:33:09 +11:00
Ross Zwisler 11c59c92f4 dax: correct dax iomap code namespace
The recently added DAX functions that use the new struct iomap data
structure were named iomap_dax_rw(), iomap_dax_fault() and
iomap_dax_actor().  These are actually defined in fs/dax.c, though, so
should be part of the "dax" namespace and not the "iomap" namespace.
Rename them to dax_iomap_rw(), dax_iomap_fault() and dax_iomap_actor()
respectively.

Signed-off-by: Ross Zwisler <ross.zwisler@linux.intel.com>
Suggested-by: Dave Chinner <david@fromorbit.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2016-11-08 11:32:46 +11:00
Ross Zwisler b9fde0462e dax: remove dax_pmd_fault()
dax_pmd_fault() is the old struct buffer_head + get_block_t based 2 MiB DAX
fault handler.  This fault handler has been disabled for several kernel
releases, and support for PMDs will be reintroduced using the struct iomap
interface instead.

Signed-off-by: Ross Zwisler <ross.zwisler@linux.intel.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2016-11-08 11:32:35 +11:00
Ross Zwisler 63e95b5c4f dax: coordinate locking for offsets in PMD range
DAX radix tree locking currently locks entries based on the unique
combination of the 'mapping' pointer and the pgoff_t 'index' for the entry.
This works for PTEs, but as we move to PMDs we will need to have all the
offsets within the range covered by the PMD to map to the same bit lock.
To accomplish this, for ranges covered by a PMD entry we will instead lock
based on the page offset of the beginning of the PMD entry.  The 'mapping'
pointer is still used in the same way.

Signed-off-by: Ross Zwisler <ross.zwisler@linux.intel.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2016-11-08 11:32:20 +11:00
Ross Zwisler e3ad61c64a dax: consistent variable naming for DAX entries
No functional change.

Consistently use the variable name 'entry' instead of 'ret' for DAX radix
tree entries.  This was already happening in most of the code, so update
get_unlocked_mapping_entry(), grab_mapping_entry() and
dax_unlock_mapping_entry().

Signed-off-by: Ross Zwisler <ross.zwisler@linux.intel.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2016-11-08 11:32:12 +11:00
Ross Zwisler aada54f980 dax: remove the last BUG_ON() from fs/dax.c
Don't take down the kernel if we get an invalid 'from' and 'length'
argument pair.  Just warn once and return an error.

Signed-off-by: Ross Zwisler <ross.zwisler@linux.intel.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2016-11-08 11:32:00 +11:00
Ross Zwisler ce95ab0fa6 dax: make 'wait_table' global variable static
The global 'wait_table' variable is only used within fs/dax.c, and
generates the following sparse warning:

fs/dax.c:39:19: warning: symbol 'wait_table' was not declared. Should it be static?

Make it static so it has scope local to fs/dax.c, and to make sparse happy.

Signed-off-by: Ross Zwisler <ross.zwisler@linux.intel.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2016-11-08 11:31:44 +11:00
Ross Zwisler 03e0990fc8 ext2: remove support for DAX PMD faults
DAX PMD support was added via the following commit:

commit e7b1ea2ad6 ("ext2: huge page fault support")

I believe this path to be untested as ext2 doesn't reliably provide block
allocations that are aligned to 2MiB.  In my testing I've been unable to
get ext2 to actually fault in a PMD.  It always fails with a "pfn
unaligned" message because the sector returned by ext2_get_block() isn't
aligned.

I've tried various settings for the "stride" and "stripe_width" extended
options to mkfs.ext2, without any luck.

Since we can't reliably get PMDs, remove support so that we don't have an
untested code path that we may someday traverse when we happen to get an
aligned block allocation.  This should also make 4k DAX faults in ext2 a
bit faster since they will no longer have to call the PMD fault handler
only to get a response of VM_FAULT_FALLBACK.

Signed-off-by: Ross Zwisler <ross.zwisler@linux.intel.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2016-11-08 11:31:33 +11:00
Ross Zwisler fa0d3fce7c dax: remove buffer_size_valid()
Now that ext4 properly sets bh.b_size when we call get_block() for a hole,
rely on that value and remove the buffer_size_valid() sanity check.

Signed-off-by: Ross Zwisler <ross.zwisler@linux.intel.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2016-11-08 11:31:14 +11:00
Ross Zwisler 547edce3ba ext4: tell DAX the size of allocation holes
When DAX calls _ext4_get_block() and the file offset points to a hole we
currently don't set bh->b_size.  This is current worked around via
buffer_size_valid() in fs/dax.c.

_ext4_get_block() has the hole size information from ext4_map_blocks(), so
populate bh->b_size so we can remove buffer_size_valid() in a later patch.

Signed-off-by: Ross Zwisler <ross.zwisler@linux.intel.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2016-11-08 11:30:58 +11:00
Shuah Khan 0ac84b72c0 fs/nfs: Fix used uninitialized warn in nfs4_slot_seqid_in_use()
Fix the following warn:

fs/nfs/nfs4session.c: In function ‘nfs4_slot_seqid_in_use’:
fs/nfs/nfs4session.c:203:54: warning: ‘cur_seq’ may be used uninitialized in this function [-Wmaybe-uninitialized]
  if (nfs4_slot_get_seqid(tbl, slotid, &cur_seq) == 0 &&
      ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~^~
      cur_seq == seq_nr && test_bit(slotid, tbl->used_slots))
      ~~~~~~~~~~~~~~~~~

Signed-off-by: Shuah Khan <shuahkh@osg.samsung.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2016-11-07 16:11:30 -05:00
Anna Schumaker 192747166a NFS: Don't print a pNFS error if we aren't using pNFS
We used to check for a valid layout type id before verifying pNFS flags
as an indicator for if we are using pNFS.  This changed in 3132e49ece
with the introduction of multiple layout types, since now we are passing
an array of ids instead of just one.  Since then, users have been seeing
a KERN_ERR printk show up whenever mounting NFS v4 without pNFS.  This
patch restores the original behavior of exiting set_pnfs_layoutdriver()
early if we aren't using pNFS.

Fixes 3132e49ece ("pnfs: track multiple layout types in fsinfo
structure")
Reviewed-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2016-11-07 16:11:30 -05:00
Petr Vandrovec 8ef3295530 NFS: Ignore connections that have cl_rpcclient uninitialized
cl_rpcclient starts as ERR_PTR(-EINVAL), and connections like that
are floating freely through the system.  Most places check whether
pointer is valid before dereferencing it, but newly added code
in nfs_match_client does not.

Which causes crashes when more than one NFS mount point is present.

Signed-off-by: Petr Vandrovec <petr@vandrovec.name>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2016-11-07 16:11:29 -05:00
Mike Marshall dc0336214e orangefs: clean up debugfs
We recently refactored the Orangefs debugfs code.
The refactor seemed to trigger dan.carpenter@oracle.com's
static tester to find a possible double-free in the code.

While designing the fix we saw a condition under which the
buffer being freed could also be overflowed.

We also realized how to rebuild the related debugfs file's
"contents" (a string) without deleting and re-creating the file.

This fix should eliminate the possible double-free, the
potential overflow and improve code readability.

Signed-off-by: Mike Marshall <hubcap@omnibond.com>
Signed-off-by: Martin Brandenburg <martin@omnibond.com>
2016-11-07 10:41:55 -05:00
Linus Torvalds fb415f222c Fixes for some recent regressions including fallout from the vmalloc'd
stack change (after which we can no longer encrypt stuff on the stack).
 -----BEGIN PGP SIGNATURE-----
 
 iQIcBAABAgAGBQJYHNwpAAoJECebzXlCjuG++DMP/3mUUAF09DfFR/EHl7knDT1f
 kZ53UVHYzr02w0wXfwxVLlp2H7TdSAufgsSvPT6qksA3eY7gL6nJ9zHkl+Nv5yCx
 y6vsFWjO1QEUWFOZWCKcmT2dAI3Ddt9IhK13pfZEKN1XKvK2zWB16HEVzSg6fR2K
 NwHlpMnQUI4HWThURzwTZb1M5YhxRCAnyiv8BTAAPjbEfzPzdL7j3jxwqtH8bOWp
 qIcDDvjC744b9zy0YuAEY/NyGBhYZPdM6gWsBBes1TRzBWUL9qsUYTWDJTmg/F1l
 Or0Jz7CUEN9uOHLGnkATPDc+eBg9YFV+bSsSnJu1/W4Er7dX1Af/lol79zEp/Zw1
 Snd9FelSPj3vxmYAFTCLnHRTRgsyiDhbbb7gVrzH9bxnCrRNR6p2kY018s1Cl9Td
 uWQoNNFQwwnYxWYEeZdO5PgX+pcgoCzhHACNk5oA93YaBE0GuLHHugwwIrYE8TM1
 1iY20sLC5lJcnPqxdgnoprZnnHMuL6rx5KRbvBeflNZ4huK2PIcPJyeB83XH6s12
 G67PjJ0rfWzSBF14O/ZtQA6he+kXvnH3pKqpNnaMiBxZZ2J8E1eQvrKTLLIwmtlP
 18KKJpZIzh7jTTZ/99nAMAt/BGw97P9TToLdnI8dCxYygHEaywpEYtcsE8IWFAvA
 3XkS5QdlJhhAaAUUYBXy
 =oPbZ
 -----END PGP SIGNATURE-----

Merge tag 'nfsd-4.9-1' of git://linux-nfs.org/~bfields/linux

Pull nfsd bugfixes from Bruce Fields:
 "Fixes for some recent regressions including fallout from the vmalloc'd
  stack change (after which we can no longer encrypt stuff on the
  stack)"

* tag 'nfsd-4.9-1' of git://linux-nfs.org/~bfields/linux:
  nfsd: Fix general protection fault in release_lock_stateid()
  svcrdma: backchannel cannot share a page for send and rcv buffers
  sunrpc: fix some missing rq_rbuffer assignments
  sunrpc: don't pass on-stack memory to sg_set_buf
  nfsd: move blocked lock handling under a dedicated spinlock
2016-11-04 20:12:10 -07:00
Linus Torvalds 46d7cbb2c4 Merge branch 'for-4.9-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux
Pull btrfs fixes from Chris Mason:
 "Some fixes that Dave Sterba collected.  We held off on these last week
  because I was focused on the memory corruption testing"

* 'for-4.9-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux:
  btrfs: fix WARNING in btrfs_select_ref_head()
  Btrfs: remove some no-op casts
  btrfs: pass correct args to btrfs_async_run_delayed_refs()
  btrfs: make file clone aware of fatal signals
  btrfs: qgroup: Prevent qgroup->reserved from going subzero
  Btrfs: kill BUG_ON in do_relocation
2016-11-04 20:08:16 -07:00
Linus Torvalds bd30fac18f Merge branch 'overlayfs-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/vfs
Pull overlayfs fixes from Miklos Szeredi:
 "Fix two more POSIX ACL bugs introduced in 4.8 and add a missing fsync
  during copy up to prevent possible data loss"

* 'overlayfs-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/vfs:
  ovl: fsync after copy-up
  ovl: fix get_acl() on tmpfs
  ovl: update S_ISGID when setting posix ACLs
2016-11-04 20:03:14 -07:00
Jan Kara ce98321bf7 fs: Remove unmap_underlying_metadata
Nobody is using this function anymore. Remove it.

Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Jens Axboe <axboe@fb.com>
2016-11-04 14:34:47 -06:00
Jan Kara e64855c6cf fs: Add helper to clean bdev aliases under a bh and use it
Add a helper function that clears buffer heads from a block device
aliasing passed bh. Use this helper function from filesystems instead of
the original unmap_underlying_metadata() to save some boiler plate code
and also have a better name for the functionalily since it is not
unmapping anything for a *long* time.

Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Jens Axboe <axboe@fb.com>
2016-11-04 14:34:47 -06:00
Jan Kara 69a9bea146 ext2: Use clean_bdev_aliases() instead of iteration
Use clean_bdev_aliases() instead of iterating through blocks one by one.

Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Jens Axboe <axboe@fb.com>
2016-11-04 14:34:47 -06:00
Jan Kara 64e1c57fa4 ext4: Use clean_bdev_aliases() instead of iteration
Use clean_bdev_aliases() instead of iterating through blocks one by one.

Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Jens Axboe <axboe@fb.com>
2016-11-04 14:34:47 -06:00
Jan Kara f734c89cc9 direct-io: Use clean_bdev_aliases() instead of handmade iteration
Use new provided function instead of an iteration through all allocated
blocks.

Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Jens Axboe <axboe@fb.com>
2016-11-04 14:34:47 -06:00
Jan Kara 29f3ad7d83 fs: Provide function to unmap metadata for a range of blocks
Provide function equivalent to unmap_underlying_metadata() for a range
of blocks. We somewhat optimize the function to use pagevec lookups
instead of looking up buffer heads one by one and use page lock to pin
buffer heads instead of mapping's private_lock to improve scalability.

Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Jens Axboe <axboe@fb.com>
2016-11-04 14:34:47 -06:00
Alexey Dobriyan b4eb4f7f1a audit: less stack usage for /proc/*/loginuid
%u requires 10 characters at most not 20.

Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
Acked-by: Richard Guy Briggs <rgb@redhat.com>
Signed-off-by: Paul Moore <paul@paul-moore.com>
2016-11-03 17:20:00 -04:00
Jens Axboe 7637241e65 writeback: add wbc_to_write_flags()
Add wbc_to_write_flags(), which returns the write modifier flags to use,
based on a struct writeback_control. No functional changes in this
patch, but it prepares us for factoring other wbc fields for write type.

Signed-off-by: Jens Axboe <axboe@fb.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Reviewed-by: Christoph Hellwig <hch@lst.de>
2016-11-02 10:24:03 -06:00
Chris Mason e3597e6090 Merge branch 'for-4.9-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux into for-linus-4.9 2016-11-01 12:54:45 -07:00
J. Bruce Fields e864c189e1 nfsd: catch errors in decode_fattr earlier
3c8e03166a "NFSv4: do exact check about attribute specified" fixed
some handling of unsupported-attribute errors, but it also delayed
checking for unwriteable attributes till after we decode them.  This
could lead to odd behavior in the case a client attemps to set an
attribute we don't know about followed by one we try to parse.  In that
case the parser for the known attribute will attempt to parse the
unknown attribute.  It should fail in some safe way, but the error might
at least be incorrect (probably bad_xdr instead of inval).  So, it's
better to do that check at the start.

As far as I know this doesn't cause any problems with current clients
but it might be a minor issue e.g. if we encounter a future client that
supports a new attribute that we currently don't.

Cc: Yu Zhiguo <yuzg@cn.fujitsu.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2016-11-01 15:47:52 -04:00
J. Bruce Fields 916d2d844a nfsd: clean up supported attribute handling
Minor cleanup, no change in behavior.

Provide helpers for some common attribute bitmap operations.  Drop some
comments that just echo the code.

Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2016-11-01 15:47:52 -04:00
Jeff Layton 851238a22f nfsd: fix error handling for clients that fail to return the layout
Currently, when the client continually returns NFS4ERR_DELAY on a
CB_LAYOUTRECALL, we'll give up trying to retransmit after two lease
periods, but leave the layout in place.

What we really need to do here is fence the client in this case. Have it
fall through to that code in that case instead of into the
NFS4ERR_NOMATCHING_LAYOUT case.

Signed-off-by: Jeff Layton <jlayton@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2016-11-01 15:47:43 -04:00
Jeff Layton 8f97514b42 nfsd: more robust allocation failure handling in nfsd_reply_cache_init
Currently, we try to allocate the cache as a single, large chunk, which
can fail if no big chunks of memory are available. We _do_ try to size
it according to the amount of memory in the box, but if the server is
started well after boot time, then the allocation can fail due to memory
fragmentation.

Fall back to doing a vzalloc if the kcalloc fails, and switch the
shutdown code to do a kvfree to handle freeing correctly.

Reported-by: Olaf Hering <olaf@aepfle.de>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2016-11-01 15:47:43 -04:00
Chuck Lever f46c445b79 nfsd: Fix general protection fault in release_lock_stateid()
When I push NFSv4.1 / RDMA hard, (xfstests generic/089, for example),
I get this crash on the server:

Oct 28 22:04:30 klimt kernel: general protection fault: 0000 [#1] SMP DEBUG_PAGEALLOC
Oct 28 22:04:30 klimt kernel: Modules linked in: cts rpcsec_gss_krb5 iTCO_wdt iTCO_vendor_support sb_edac edac_core x86_pkg_temp_thermal intel_powerclamp coretemp kvm_intel kvm btrfs irqbypass crct10dif_pclmul crc32_pclmul ghash_clmulni_intel aesni_intel lrw gf128mul glue_helper ablk_helper cryptd xor pcspkr raid6_pq i2c_i801 i2c_smbus lpc_ich mfd_core sg mei_me mei ioatdma shpchp wmi ipmi_si ipmi_msghandler rpcrdma ib_ipoib rdma_ucm acpi_power_meter acpi_pad ib_ucm ib_uverbs ib_umad rdma_cm ib_cm iw_cm nfsd auth_rpcgss nfs_acl lockd grace sunrpc ip_tables xfs libcrc32c mlx4_ib mlx4_en ib_core sr_mod cdrom sd_mod ast drm_kms_helper syscopyarea sysfillrect sysimgblt fb_sys_fops ttm drm crc32c_intel igb ahci libahci ptp mlx4_core pps_core dca libata i2c_algo_bit i2c_core dm_mirror dm_region_hash dm_log dm_mod
Oct 28 22:04:30 klimt kernel: CPU: 7 PID: 1558 Comm: nfsd Not tainted 4.9.0-rc2-00005-g82cd754 #8
Oct 28 22:04:30 klimt kernel: Hardware name: Supermicro Super Server/X10SRL-F, BIOS 1.0c 09/09/2015
Oct 28 22:04:30 klimt kernel: task: ffff880835c3a100 task.stack: ffff8808420d8000
Oct 28 22:04:30 klimt kernel: RIP: 0010:[<ffffffffa05a759f>]  [<ffffffffa05a759f>] release_lock_stateid+0x1f/0x60 [nfsd]
Oct 28 22:04:30 klimt kernel: RSP: 0018:ffff8808420dbce0  EFLAGS: 00010246
Oct 28 22:04:30 klimt kernel: RAX: ffff88084e6660f0 RBX: ffff88084e667020 RCX: 0000000000000000
Oct 28 22:04:30 klimt kernel: RDX: 0000000000000007 RSI: 0000000000000000 RDI: ffff88084e667020
Oct 28 22:04:30 klimt kernel: RBP: ffff8808420dbcf8 R08: 0000000000000001 R09: 0000000000000000
Oct 28 22:04:30 klimt kernel: R10: ffff880835c3a100 R11: ffff880835c3aca8 R12: 6b6b6b6b6b6b6b6b
Oct 28 22:04:30 klimt kernel: R13: ffff88084e6670d8 R14: ffff880835f546f0 R15: ffff880835f1c548
Oct 28 22:04:30 klimt kernel: FS:  0000000000000000(0000) GS:ffff88087bdc0000(0000) knlGS:0000000000000000
Oct 28 22:04:30 klimt kernel: CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
Oct 28 22:04:30 klimt kernel: CR2: 00007ff020389000 CR3: 0000000001c06000 CR4: 00000000001406e0
Oct 28 22:04:30 klimt kernel: Stack:
Oct 28 22:04:30 klimt kernel: ffff88084e667020 0000000000000000 ffff88084e6670d8 ffff8808420dbd20
Oct 28 22:04:30 klimt kernel: ffffffffa05ac80d ffff880835f54548 ffff88084e640008 ffff880835f545b0
Oct 28 22:04:30 klimt kernel: ffff8808420dbd70 ffffffffa059803d ffff880835f1c768 0000000000000870
Oct 28 22:04:30 klimt kernel: Call Trace:
Oct 28 22:04:30 klimt kernel: [<ffffffffa05ac80d>] nfsd4_free_stateid+0xfd/0x1b0 [nfsd]
Oct 28 22:04:30 klimt kernel: [<ffffffffa059803d>] nfsd4_proc_compound+0x40d/0x690 [nfsd]
Oct 28 22:04:30 klimt kernel: [<ffffffffa0583114>] nfsd_dispatch+0xd4/0x1d0 [nfsd]
Oct 28 22:04:30 klimt kernel: [<ffffffffa047bbf9>] svc_process_common+0x3d9/0x700 [sunrpc]
Oct 28 22:04:30 klimt kernel: [<ffffffffa047ca64>] svc_process+0xf4/0x330 [sunrpc]
Oct 28 22:04:30 klimt kernel: [<ffffffffa05827ca>] nfsd+0xfa/0x160 [nfsd]
Oct 28 22:04:30 klimt kernel: [<ffffffffa05826d0>] ? nfsd_destroy+0x170/0x170 [nfsd]
Oct 28 22:04:30 klimt kernel: [<ffffffff810b367b>] kthread+0x10b/0x120
Oct 28 22:04:30 klimt kernel: [<ffffffff810b3570>] ? kthread_stop+0x280/0x280
Oct 28 22:04:30 klimt kernel: [<ffffffff8174e8ba>] ret_from_fork+0x2a/0x40
Oct 28 22:04:30 klimt kernel: Code: c3 66 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 44 00 00 55 48 89 e5 41 55 41 54 53 48 8b 87 b0 00 00 00 48 89 fb 4c 8b a0 98 00 00 00 <49> 8b 44 24 20 48 8d b8 80 03 00 00 e8 10 66 1a e1 48 89 df e8
Oct 28 22:04:30 klimt kernel: RIP  [<ffffffffa05a759f>] release_lock_stateid+0x1f/0x60 [nfsd]
Oct 28 22:04:30 klimt kernel: RSP <ffff8808420dbce0>
Oct 28 22:04:30 klimt kernel: ---[ end trace cf5d0b371973e167 ]---

Jeff Layton says:
> Hm...now that I look though, this is a little suspicious:
>
>    struct nfs4_openowner *oo = openowner(stp->st_openstp->st_stateowner);
>
> I wonder if it's possible for the openstateid to have already been
> destroyed at this point.
>
> We might be better off doing something like this to get the client pointer:
>
>    stp->st_stid.sc_client;
>
> ...which should be more direct and less dependent on other stateids
> staying valid.

With the suggested change, I am no longer able to reproduce the above oops.

v2: Fix unhash_lock_stateid() as well

Fix-suggested-by: Jeff Layton <jlayton@redhat.com>
Fixes: 42691398be ('nfsd: Fix race between FREE_STATEID and LOCK')
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Reviewed-by: Jeff Layton <jlayton@redhat.com>
Cc: stable@vger.kernel.org
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2016-11-01 15:24:43 -04:00
Christoph Hellwig be297968da mm: only include blk_types in swap.h if CONFIG_SWAP is enabled
It's only needed for the CONFIG_SWAP-only use of bio_end_io_t.

Because CONFIG_SWAP implies CONFIG_BLOCK this will allow to drop some
ifdefs in blk_types.h.

Instead we'll need to add a few explicit includes that were implicit
before, though.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@fb.com>
2016-11-01 09:43:26 -06:00
Christoph Hellwig 2f8b544477 block,fs: untangle fs.h and blk_types.h
Nothing in fs.h should require blk_types.h to be included.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@fb.com>
2016-11-01 09:43:26 -06:00
Christoph Hellwig 70fd76140a block,fs: use REQ_* flags directly
Remove the WRITE_* and READ_SYNC wrappers, and just use the flags
directly.  Where applicable this also drops usage of the
bio_set_op_attrs wrapper.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@fb.com>
2016-11-01 09:43:26 -06:00
Christoph Hellwig 67f055c798 btrfs: use op_is_sync to check for synchronous requests
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@fb.com>
2016-11-01 09:43:26 -06:00
Andrey Vagin c62cce2cae net: add an ioctl to get a socket network namespace
Each socket operates in a network namespace where it has been created,
so if we want to dump and restore a socket, we have to know its network
namespace.

We have a socket_diag to get information about sockets, it doesn't
report sockets which are not bound or connected.

This patch introduces a new socket ioctl, which is called SIOCGSKNS
and used to get a file descriptor for a socket network namespace.

A task must have CAP_NET_ADMIN in a target network namespace to
use this ioctl.

Cc: "David S. Miller" <davem@davemloft.net>
Cc: Eric W. Biederman <ebiederm@xmission.com>
Signed-off-by: Andrei Vagin <avagin@openvz.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-10-31 10:56:36 -04:00
Miklos Szeredi 641089c154 ovl: fsync after copy-up
Make sure the copied up file hits the disk before renaming to the final
destination.  If this is not done then the copy-up may corrupt the data in
the file in case of a crash.

Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
Cc: <stable@vger.kernel.org>
2016-10-31 14:42:14 +01:00
Miklos Szeredi b93d4a0eb3 ovl: fix get_acl() on tmpfs
tmpfs doesn't have ->get_acl() because it only uses cached acls.

This fixes the acl tests in pjdfstest when tmpfs is used as the upper layer
of the overlay.

Reported-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
Fixes: 39a25b2b37 ("ovl: define ->get_acl() for overlay inodes")
Cc: <stable@vger.kernel.org> # v4.8
2016-10-31 14:42:14 +01:00
Miklos Szeredi fd3220d37b ovl: update S_ISGID when setting posix ACLs
This change fixes xfstest generic/375, which failed to clear the
setgid bit in the following test case on overlayfs:

  touch $testfile
  chown 100:100 $testfile
  chmod 2755 $testfile
  _runas -u 100 -g 101 -- setfacl -m u::rwx,g::rwx,o::rwx $testfile

Reported-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
Tested-by: Amir Goldstein <amir73il@gmail.com>
Fixes: d837a49bd5 ("ovl: fix POSIX ACL setting")
Cc: <stable@vger.kernel.org> # v4.8
2016-10-31 14:42:14 +01:00
Jan Kara 70fe2f4815 aio: fix freeze protection of aio writes
Currently we dropped freeze protection of aio writes just after IO was
submitted. Thus aio write could be in flight while the filesystem was
frozen and that could result in unexpected situation like aio completion
wanting to convert extent type on frozen filesystem. Testcase from
Dmitry triggering this is like:

for ((i=0;i<60;i++));do fsfreeze -f /mnt ;sleep 1;fsfreeze -u /mnt;done &
fio --bs=4k --ioengine=libaio --iodepth=128 --size=1g --direct=1 \
    --runtime=60 --filename=/mnt/file --name=rand-write --rw=randwrite

Fix the problem by dropping freeze protection only once IO is completed
in aio_complete().

Reported-by: Dmitry Monakhov <dmonakhov@openvz.org>
Signed-off-by: Jan Kara <jack@suse.cz>
[hch: forward ported on top of various VFS and aio changes]
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2016-10-30 13:09:42 -04:00
Christoph Hellwig 89319d31d2 fs: remove aio_run_iocb
Pass the ABI iocb structure to aio_setup_rw and let it handle the
non-vectored I/O case as well.  With that and a new helper for the AIO
return value handling we can now define new aio_read and aio_write
helpers that implement reads and writes in a self-contained way without
duplicating too much code.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2016-10-30 13:09:42 -04:00
Christoph Hellwig 723c038475 fs: remove the never implemented aio_fsync file operation
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2016-10-30 13:09:42 -04:00
Christoph Hellwig 0b944d3a4b aio: hold an extra file reference over AIO read/write operations
Otherwise we might dereference an already freed file and/or inode
when aio_complete is called before we return from the read_iter or
write_iter method.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2016-10-30 13:09:42 -04:00
David S. Miller 27058af401 Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net
Mostly simple overlapping changes.

For example, David Ahern's adjacency list revamp in 'net-next'
conflicted with an adjacency list traversal bug fix in 'net'.

Signed-off-by: David S. Miller <davem@davemloft.net>
2016-10-30 12:42:58 -04:00
Linus Torvalds 2a26d99b25 Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net
Pull networking fixes from David Miller:
 "Lots of fixes, mostly drivers as is usually the case.

   1) Don't treat zero DMA address as invalid in vmxnet3, from Alexey
      Khoroshilov.

   2) Fix element timeouts in netfilter's nft_dynset, from Anders K.
      Pedersen.

   3) Don't put aead_req crypto struct on the stack in mac80211, from
      Ard Biesheuvel.

   4) Several uninitialized variable warning fixes from Arnd Bergmann.

   5) Fix memory leak in cxgb4, from Colin Ian King.

   6) Fix bpf handling of VLAN header push/pop, from Daniel Borkmann.

   7) Several VRF semantic fixes from David Ahern.

   8) Set skb->protocol properly in ip6_tnl_xmit(), from Eli Cooper.

   9) Socket needs to be locked in udp_disconnect(), from Eric Dumazet.

  10) Div-by-zero on 32-bit fix in mlx4 driver, from Eugenia Emantayev.

  11) Fix stale link state during failover in NCSCI driver, from Gavin
      Shan.

  12) Fix netdev lower adjacency list traversal, from Ido Schimmel.

  13) Propvide proper handle when emitting notifications of filter
      deletes, from Jamal Hadi Salim.

  14) Memory leaks and big-endian issues in rtl8xxxu, from Jes Sorensen.

  15) Fix DESYNC_FACTOR handling in ipv6, from Jiri Bohac.

  16) Several routing offload fixes in mlxsw driver, from Jiri Pirko.

  17) Fix broadcast sync problem in TIPC, from Jon Paul Maloy.

  18) Validate chunk len before using it in SCTP, from Marcelo Ricardo
      Leitner.

  19) Revert a netns locking change that causes regressions, from Paul
      Moore.

  20) Add recursion limit to GRO handling, from Sabrina Dubroca.

  21) GFP_KERNEL in irq context fix in ibmvnic, from Thomas Falcon.

  22) Avoid accessing stale vxlan/geneve socket in data path, from
      Pravin Shelar"

* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net: (189 commits)
  geneve: avoid using stale geneve socket.
  vxlan: avoid using stale vxlan socket.
  qede: Fix out-of-bound fastpath memory access
  net: phy: dp83848: add dp83822 PHY support
  enic: fix rq disable
  tipc: fix broadcast link synchronization problem
  ibmvnic: Fix missing brackets in init_sub_crq_irqs
  ibmvnic: Fix releasing of sub-CRQ IRQs in interrupt context
  Revert "ibmvnic: Fix releasing of sub-CRQ IRQs in interrupt context"
  arch/powerpc: Update parameters for csum_tcpudp_magic & csum_tcpudp_nofold
  net/mlx4_en: Save slave ethtool stats command
  net/mlx4_en: Fix potential deadlock in port statistics flow
  net/mlx4: Fix firmware command timeout during interrupt test
  net/mlx4_core: Do not access comm channel if it has not yet been initialized
  net/mlx4_en: Fix panic during reboot
  net/mlx4_en: Process all completions in RX rings after port goes up
  net/mlx4_en: Resolve dividing by zero in 32-bit system
  net/mlx4_core: Change the default value of enable_qos
  net/mlx4_core: Avoid setting ports to auto when only one port type is supported
  net/mlx4_core: Fix the resource-type enum in res tracker to conform to FW spec
  ...
2016-10-29 20:33:20 -07:00
Linus Torvalds efa563752c This pull request contains fixes for issues in both UBI and UBIFS:
- A regression wrt. overlayfs, introduced in -rc2.
 - An UBI issue, found by Dan Carpenter's static checker.
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v2
 
 iQIcBAABAgAGBQJYFPHWAAoJEEtJtSqsAOnWcK4P/AwBcqPa0em/HXrdCExanQXY
 8U3uCPbDua4sW1Eaw5dVFoZuVoPzhibLLaVoVIWs8LOXiD8v23VYQ8ezu0D0O9fc
 cAsrxg0MtQLF/hyyVbdihxaqCB2H/j9PDJdIdCiRindPEwm0k6KBkVMk3N8O3m2U
 xDSA+Oq8Ns5cgjx+yfOhMJbGOFUzky26SV/M+PTAIU9Sj2w7RJS9R18BtWv4EFoK
 q1sT8aEte3kryb+v/a4s9RNzWOOHqRvZ4XizOMvma9I6uX6hOU4oeLknmJx1gPnb
 U5z75uAVn+IeNRnrco3pD91N3X9hEtv4IgZhFafNseVTY9MirDX5ss4th+XrSM6y
 wKgWEC8UmcV9Y7zDV/towZjhCipIh1yJPu3493IVHB/1UDPoNDfOGpK6NuhIEZHy
 1sNY8F2j3BBnLw6Fc2uC1FxM3a9MQ9CgJWQ0y9src73VNgQ8miz1WH2rsFp5DwNu
 HdZGBXGElmhbJbNFSsRqC1j+K0Y2LzL5BVOrBblkJNpUmxufRx0LIdXE7p4tPazq
 8dVOH/Ktx+mDQFbtyA8vXK+Cyyp0c/snR3BZo3AWLfrlip6iwZPG6arN4Wu6P4Nl
 ZFWUlHKaMJS/lvsdAuCdZ/lawRvENTOEQMORJR8U7CX/7gDLV1KiaFRpB3fFDUW5
 xm5r2qsbVzElu6skk4xk
 =eOKJ
 -----END PGP SIGNATURE-----

Merge tag 'upstream-4.9-rc3' of git://git.infradead.org/linux-ubifs

Pull ubi/ubifs fixes from Richard Weinberger:
 "This contains fixes for issues in both UBI and UBIFS:

   - A regression wrt overlayfs, introduced in -rc2.
   - An UBI issue, found by Dan Carpenter's static checker"

* tag 'upstream-4.9-rc3' of git://git.infradead.org/linux-ubifs:
  ubifs: Fix regression in ubifs_readdir()
  ubi: fastmap: Fix add_vol() return value test in ubi_attach_fastmap()
2016-10-29 13:15:24 -07:00
Linus Torvalds c636e176d8 driver core fixes for 4.9-rc3
Here are two small driver core / kernfs fixes for 4.9-rc3.  One makes
 the Kconfig entry for DEBUG_TEST_DRIVER_REMOVE a bit more explicit that
 this is a crazy thing to enable for a distro kernel (thanks for trying
 Fedora!), the other resolves an issue with vim opening kernfs files
 (sysfs, configfs, etc.).
 
 Both have been in linux-next with no reported issues.
 
 Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
 -----BEGIN PGP SIGNATURE-----
 
 iFYEABECABYFAlgU0CMPHGdyZWdAa3JvYWguY29tAAoJEDFH1A3bLfspgv4AoJhR
 YJeG57ReBKjlzAj497Z1X7QcAJ9GXcbbbxmwj2IcUln5I3uEyuPCkQ==
 =pS6k
 -----END PGP SIGNATURE-----

Merge tag 'driver-core-4.9-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core

Pull driver core fixes from Greg KH:
 "Here are two small driver core / kernfs fixes for 4.9-rc3.

  One makes the Kconfig entry for DEBUG_TEST_DRIVER_REMOVE a bit more
  explicit that this is a crazy thing to enable for a distro kernel
  (thanks for trying Fedora!), the other resolves an issue with vim
  opening kernfs files (sysfs, configfs, etc.)

  Both have been in linux-next with no reported issues"

* tag 'driver-core-4.9-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core:
  driver core: Make Kconfig text for DEBUG_TEST_DRIVER_REMOVE stronger
  kernfs: Add noop_fsync to supported kernfs_file_fops
2016-10-29 10:57:40 -07:00
Al Viro ad5cb123fd ceph: switch to use of ->d_init()
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2016-10-28 22:05:13 -04:00
Al Viro 18fc8abdb7 ceph: unify dentry_operations instances
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2016-10-28 21:52:50 -04:00
Linus Torvalds f6167514c8 Merge branch 'for-linus-4.9' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs
Pull btrfs fixes from Chris Mason:
 "My patch fixes the btrfs list_head abuse that we tracked down during
  Dave Jones' memory corruption investigation. With both Jens and my
  patches in place, I'm no longer able to trigger problems.

  Filipe is fixing a difficult old bug between snapshots, balance and
  send. Dave is cooking a few more for the next rc, but these are tested
  and ready"

* 'for-linus-4.9' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs:
  btrfs: fix races on root_log_ctx lists
  btrfs: fix incremental send failure caused by balance
2016-10-28 10:07:35 -07:00
Christoph Hellwig ef295ecf09 block: better op and flags encoding
Now that we don't need the common flags to overflow outside the range
of a 32-bit type we can encode them the same way for both the bio and
request fields.  This in addition allows us to place the operation
first (and make some room for more ops while we're at it) and to
stop having to shift around the operation values.

In addition this allows passing around only one value in the block layer
instead of two (and eventuall also in the file systems, but we can do
that later) and thus clean up a lot of code.

Last but not least this allows decreasing the size of the cmd_flags
field in struct request to 32-bits.  Various functions passing this
value could also be updated, but I'd like to avoid the churn for now.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@fb.com>
2016-10-28 08:48:16 -06:00
Richard Weinberger a00052a296 ubifs: Fix regression in ubifs_readdir()
Commit c83ed4c9db ("ubifs: Abort readdir upon error") broke
overlayfs support because the fix exposed an internal error
code to VFS.

Reported-by: Peter Rosin <peda@axentia.se>
Tested-by: Peter Rosin <peda@axentia.se>
Reported-by: Ralph Sennhauser <ralph.sennhauser@gmail.com>
Tested-by: Ralph Sennhauser <ralph.sennhauser@gmail.com>
Fixes: c83ed4c9db ("ubifs: Abort readdir upon error")
Cc: stable@vger.kernel.org
Signed-off-by: Richard Weinberger <richard@nod.at>
2016-10-28 14:48:31 +02:00
Linus Torvalds 14970f204b Merge branch 'akpm' (patches from Andrew)
Merge misc fixes from Andrew Morton:
 "20 fixes"

* emailed patches from Andrew Morton <akpm@linux-foundation.org>:
  drivers/misc/sgi-gru/grumain.c: remove bogus 0x prefix from printk
  cris/arch-v32: cryptocop: print a hex number after a 0x prefix
  ipack: print a hex number after a 0x prefix
  block: DAC960: print a hex number after a 0x prefix
  fs: exofs: print a hex number after a 0x prefix
  lib/genalloc.c: start search from start of chunk
  mm: memcontrol: do not recurse in direct reclaim
  CREDITS: update credit information for Martin Kepplinger
  proc: fix NULL dereference when reading /proc/<pid>/auxv
  mm: kmemleak: ensure that the task stack is not freed during scanning
  lib/stackdepot.c: bump stackdepot capacity from 16MB to 128MB
  latent_entropy: raise CONFIG_FRAME_WARN by default
  kconfig.h: remove config_enabled() macro
  ipc: account for kmem usage on mqueue and msg
  mm/slab: improve performance of gathering slabinfo stats
  mm: page_alloc: use KERN_CONT where appropriate
  mm/list_lru.c: avoid error-path NULL pointer deref
  h8300: fix syscall restarting
  kcov: properly check if we are in an interrupt
  mm/slab: fix kmemcg cache creation delayed issue
2016-10-27 19:58:39 -07:00
Uwe Kleine-König 14f947c87a fs: exofs: print a hex number after a 0x prefix
It makes the message hard to interpret correctly if a base 10 number is
prefixed by 0x.  So change to a hex number.

Link: http://lkml.kernel.org/r/20161026125658.25728-2-u.kleine-koenig@pengutronix.de
Signed-off-by: Uwe Kleine-König <u.kleine-koenig@pengutronix.de>
Cc: Boaz Harrosh <ooo@electrozaur.com>
Cc: Benny Halevy <bhalevy@primarydata.com>
Cc: Joe Perches <joe@perches.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-10-27 18:43:43 -07:00
Leon Yu 06b2849d10 proc: fix NULL dereference when reading /proc/<pid>/auxv
Reading auxv of any kernel thread results in NULL pointer dereferencing
in auxv_read() where mm can be NULL.  Fix that by checking for NULL mm
and bailing out early.  This is also the original behavior changed by
recent commit c531716785 ("proc: switch auxv to use of __mem_open()").

  # cat /proc/2/auxv
  Unable to handle kernel NULL pointer dereference at virtual address 000000a8
  Internal error: Oops: 17 [#1] PREEMPT SMP ARM
  CPU: 3 PID: 113 Comm: cat Not tainted 4.9.0-rc1-ARCH+ #1
  Hardware name: BCM2709
  task: ea3b0b00 task.stack: e99b2000
  PC is at auxv_read+0x24/0x4c
  LR is at do_readv_writev+0x2fc/0x37c
  Process cat (pid: 113, stack limit = 0xe99b2210)
  Call chain:
    auxv_read
    do_readv_writev
    vfs_readv
    default_file_splice_read
    splice_direct_to_actor
    do_splice_direct
    do_sendfile
    SyS_sendfile64
    ret_fast_syscall

Fixes: c531716785 ("proc: switch auxv to use of __mem_open()")
Link: http://lkml.kernel.org/r/1476966200-14457-1-git-send-email-chianglungyu@gmail.com
Signed-off-by: Leon Yu <chianglungyu@gmail.com>
Acked-by: Oleg Nesterov <oleg@redhat.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Kees Cook <keescook@chromium.org>
Cc: John Stultz <john.stultz@linaro.org>
Cc: Mateusz Guzik <mguzik@redhat.com>
Cc: Janis Danisevskis <jdanis@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-10-27 18:43:43 -07:00
Johannes Berg 56989f6d85 genetlink: mark families as __ro_after_init
Now genl_register_family() is the only thing (other than the
users themselves, perhaps, but I didn't find any doing that)
writing to the family struct.

In all families that I found, genl_register_family() is only
called from __init functions (some indirectly, in which case
I've add __init annotations to clarifly things), so all can
actually be marked __ro_after_init.

This protects the data structure from accidental corruption.

Signed-off-by: Johannes Berg <johannes.berg@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-10-27 16:16:09 -04:00
Johannes Berg 489111e5c2 genetlink: statically initialize families
Instead of providing macros/inline functions to initialize
the families, make all users initialize them statically and
get rid of the macros.

This reduces the kernel code size by about 1.6k on x86-64
(with allyesconfig).

Signed-off-by: Johannes Berg <johannes.berg@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-10-27 16:16:09 -04:00
Johannes Berg a07ea4d994 genetlink: no longer support using static family IDs
Static family IDs have never really been used, the only
use case was the workaround I introduced for those users
that assumed their family ID was also their multicast
group ID.

Additionally, because static family IDs would never be
reserved by the generic netlink code, using a relatively
low ID would only work for built-in families that can be
registered immediately after generic netlink is started,
which is basically only the control family (apart from
the workaround code, which I also had to add code for so
it would reserve those IDs)

Thus, anything other than GENL_ID_GENERATE is flawed and
luckily not used except in the cases I mentioned. Move
those workarounds into a few lines of code, and then get
rid of GENL_ID_GENERATE entirely, making it more robust.

Signed-off-by: Johannes Berg <johannes.berg@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-10-27 16:16:09 -04:00
Linus Torvalds e3300ffef0 orangefs: a couple of cleanups sent in by other developers
use d_fsdata instead of d_time
     Miklos Szeredi <mszeredi@redhat.com>
 
   use file_inode(file) instead of file->f_path.dentry->d_inode
     Amir Goldstein <amir73il@gmail.com>
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v2.0.22 (GNU/Linux)
 
 iQIcBAABAgAGBQJYD65MAAoJEM9EDqnrzg2+kaUP/0HPDYJyWSgbGVSKuNqOiyml
 VbAGRbDAcpyYCFww2cRO9Xvvh6bJmGEqZUUbNxgi3q5L2KnvvoQ0jkHFfHaVii53
 uWP0WGrxBcRNxv72jfo1cBxYTcTqEfzXZBQb6HhzfbjMCvejbhSbYDowElTE7Oar
 AwcgEdv0Utm7zD/0K+OW56Q4fUYzOSFI4c/tNGUyjQCLE+N3R2roXdivz3maEfee
 uDg262lfQgkzbEYGJOdt8MpUak6YEp2bFa+Xf8bRoKMze8KbVDLwuTlYXuSdc/i8
 e8QO/Zr+irX/jJ/Sc998FwGquUljPuxz4wHSNEVO3HqYFIe30zkUD0mqQcxqx6YD
 F4DhSn8Ok5PuKv5aw1Q7AMA0Zd+bKaJzb/E0JdlHn1n9PFMiod82rdTfmGxP1rZb
 BwuOW/dsp/RLBZhCYpkNTBiNAH+TSIp8M7eOavO68AZ2zJXN69e/Qv2iJsaAZJZ0
 of+i9I4kmXUS4F6OPjgT6xJbH4aD/X4/jei4dPKDATXM0MW+GsZ7VodAYmAqGGCO
 l66UoL4o11BCMJNGfsdPxWJkUgpn4OBb+RSkS0f6qQ7Nlp1OaYeRYKNbX5ICHcgj
 A0PHXZ8Pub3iVgX5xUrQmYk3txbLt0ISDYBXzfPZ0rreztN0o5FRB4TNVLC82VwJ
 XHBdehhgLsNc1PMKSzZo
 =b9Ly
 -----END PGP SIGNATURE-----

Merge tag 'for-linus-4.9-rc2-ofs-1' of git://git.kernel.org/pub/scm/linux/kernel/git/hubcap/linux

Pull oreangefs updates from Mike Marshall:
 "A couple of orangefs cleanups sent in by other developers:

   - use d_fsdata instead of d_time (Miklos Szeredi)

   - use file_inode(file) instead of file->f_path.dentry->d_inode (Amir
     Goldstein)"

* tag 'for-linus-4.9-rc2-ofs-1' of git://git.kernel.org/pub/scm/linux/kernel/git/hubcap/linux:
  orangefs: don't use d_time
  orangefs: user file_inode() where it is due
2016-10-27 12:52:46 -07:00
Linus Torvalds e890038e6a xfs: updates for 4.9-rc3
Changes in this update:
 o iomap page offset masking fix for page faults
 o add IOMAP_REPORT to distinguish between read and fiemap map requests
 o cleanups to new shared data extent code
 o fix mount active status on failed log recovery
 o fix broken dquots in a buffer calculation
 o fix locking order issues and merge xfs_reflink_remap_range and
   xfs_file_share_range
 o rework unmapping of CoW extents and remove now unused functions
 o clean state when CoW is done.
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v1
 
 iQIcBAABAgAGBQJYEfdWAAoJEK3oKUf0dfoddg4P/0Tl/i58sBL/Um90kSGOjxjI
 yaOKuFImS3MFSYDwiYADnXdhq6BgVLUJWS07t9/P6Nn3OZr1wBCZDZdyRS1+JwAA
 qOui4sp/v21HprydscN+BAdxyYmuo4yFgu9lkFFSM55yiaAb5C8hsYKF42Gja1+m
 gS40/Lsa5nauSz58UOZ5oEljAvBldAdyMlk8rVSGXVm7+pqs7Lxmhjif/Y8y/Y+i
 097auIrGk+oRDukXqhtZyCQ7VP99WzM+ksajtrNwVOOzSMhrcDCHKuLe0i4LsyjN
 UTx1ioY/AD8PUYhSmLqALD9vtFHnJbx50/MQFHNLc+hDQb2jb/jQmqx9LyEYDt38
 sw/Wy55hh9PylILdE//bWH0vSgqmnNCWviBUzjDtAJ9FKfv19slFlwtu2K4lOHoq
 C6Q2uh2mB7BC6efksk9DeA6/N9tFQuiXa48sN5+D2zMfZAmdkgzDCKfGrpRnS1Yl
 4h+sfiK/DTf11Q2nTaPAHylt02SmHsikQWvb5Fxu76UI8k4RsjCZc3ep/NUNJBlU
 E8f+cdNlAF5k/AWBY7107N1iUqL/vS2wXLdburJkckmQqRcI5WuRaLhi9g4tFjFI
 o+m9EM1WuOP6jeOuVImwgCRJoLVnTVKwee/d4J8y9Ad//Rs6B9pB0SIDfxJa9LY6
 B1XjT8z/NVyK6GsfP1Qs
 =LDDu
 -----END PGP SIGNATURE-----

Merge tag 'xfs-fixes-for-linus-4.9-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/dgc/linux-xfs

Pull xfs fixes from Dave Chinner:
 "This update contains fixes for most of the outstanding regressions
  introduced with the 4.9-rc1 XFS merge. There is also a fix for an
  iomap bug, too.

  This is a quite a bit larger than I'd prefer for a -rc3, but most of
  the change comes from cleaning up the new reflink copy on write code;
  it's much simpler and easier to understand now. These changes fixed
  several bugs in the new code, and it wasn't clear that there was an
  easier/simpler way to fix them. The rest of the fixes are the usual
  size you'd expect at this stage.

  I've left the commits to soak in linux-next for a some extra time
  because of the size before asking you to pull, no new problems with
  them have been reported so I think it's all OK.

  Summary:
   - iomap page offset masking fix for page faults
   - add IOMAP_REPORT to distinguish between read and fiemap map
     requests
   - cleanups to new shared data extent code
   - fix mount active status on failed log recovery
   - fix broken dquots in a buffer calculation
   - fix locking order issues and merge xfs_reflink_remap_range and
     xfs_file_share_range
   - rework unmapping of CoW extents and remove now unused functions
   - clean state when CoW is done"

* tag 'xfs-fixes-for-linus-4.9-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/dgc/linux-xfs: (25 commits)
  xfs: clear cowblocks tag when cow fork is emptied
  xfs: fix up inode cowblocks tracking tracepoints
  fs: Do to trim high file position bits in iomap_page_mkwrite_actor
  xfs: remove xfs_bunmapi_cow
  xfs: optimize xfs_reflink_end_cow
  xfs: optimize xfs_reflink_cancel_cow_blocks
  xfs: refactor xfs_bunmapi_cow
  xfs: optimize writes to reflink files
  xfs: don't bother looking at the refcount tree for reads
  xfs: handle "raw" delayed extents xfs_reflink_trim_around_shared
  xfs: add xfs_trim_extent
  iomap: add IOMAP_REPORT
  xfs: merge xfs_reflink_remap_range and xfs_file_share_range
  xfs: remove xfs_file_wait_for_io
  xfs: move inode locking from xfs_reflink_remap_range to xfs_file_share_range
  xfs: fix the same_inode check in xfs_file_share_range
  xfs: remove the same fs check from xfs_file_share_range
  libxfs: v3 inodes are only valid on crc-enabled filesystems
  libxfs: clean up _calc_dquots_per_chunk
  xfs: unset MS_ACTIVE if mount fails
  ...
2016-10-27 12:34:50 -07:00
Chris Mason 570dd45042 btrfs: fix races on root_log_ctx lists
btrfs_remove_all_log_ctxs takes a shortcut where it avoids walking the
list because it knows all of the waiters are patiently waiting for the
commit to finish.

But, there's a small race where btrfs_sync_log can remove itself from
the list if it finds a log commit is already done.  Also, it uses
list_del_init() to remove itself from the list, but there's no way to
know if btrfs_remove_all_log_ctxs has already run, so we don't know for
sure if it is safe to call list_del_init().

This gets rid of all the shortcuts for btrfs_remove_all_log_ctxs(), and
just calls it with the proper locking.

This is part two of the corruption fixed by cbd60aa7cd.  I should have
done this in the first place, but convinced myself the optimizations were
safe.  A 12 hour run of dbench 2048 will eventually trigger a list debug
WARN_ON for the list_del_init() in btrfs_sync_log().

Fixes: d1433debe7
Reported-by: Dave Jones <davej@codemonkey.org.uk>
cc: stable@vger.kernel.org # 3.15+
Signed-off-by: Chris Mason <clm@fb.com>
2016-10-27 10:42:20 -07:00
Tony Luck 2a9becdd4d kernfs: Add noop_fsync to supported kernfs_file_fops
If you edit a kernfs backed file with vi(1), you see an ugly error
message when you write the file because vi tries to fsync(2) the
file after writing, which fails.

We have noop_fsync() for this, use it.

Signed-off-by: Tony Luck <tony.luck@intel.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2016-10-27 17:47:11 +02:00
Linus Torvalds 272ddc8b37 proc: don't use FOLL_FORCE for reading cmdline and environment
Now that Lorenzo cleaned things up and made the FOLL_FORCE users
explicit, it becomes obvious how some of them don't really need
FOLL_FORCE at all.

So remove FOLL_FORCE from the proc code that reads the command line and
arguments from user space.

The mem_rw() function actually does want FOLL_FORCE, because gdd (and
possibly many other debuggers) use it as a much more convenient version
of PTRACE_PEEKDATA, but we should consider making the FOLL_FORCE part
conditional on actually being a ptracer.  This does not actually do
that, just moves adds a comment to that effect and moves the gup_flags
settings next to each other.

Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-10-24 19:00:44 -07:00
Jeff Layton 0cc11a61b8 nfsd: move blocked lock handling under a dedicated spinlock
Bruce was hitting some lockdep warnings in testing, showing that we
could hit a deadlock with the new CB_NOTIFY_LOCK handling, involving a
rather complex situation involving four different spinlocks.

The crux of the matter is that we end up taking the nn->client_lock in
the lm_notify handler. The simplest fix is to just declare a new
per-nfsd_net spinlock to protect the new CB_NOTIFY_LOCK structures.

Signed-off-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2016-10-24 16:51:21 -04:00
Miklos Szeredi 804b1737d7 orangefs: don't use d_time
Instead use d_fsdata which is the same size.  Hoping to get rid of d_time,
which is used by very few filesystems by this time.

Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
Reviewed-by: Martin Brandenburg <martin@omnibond.com>
Signed-off-by: Mike Marshall <hubcap@omnibond.com>
2016-10-24 14:50:07 -04:00
Amir Goldstein d62a9025ae orangefs: user file_inode() where it is due
Replace wrong use of file->f_path.dentry->d_inode with file_inode(file).
In case orangefs ever finds itself as an overelayfs layer, it would want
to get its own inode and not overlayfs's inode.

DISCLAIMER: I did not test this patch because I do not know how to setup
            an orangefs mount

Signed-off-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Mike Marshall <hubcap@omnibond.com>
2016-10-24 14:29:39 -04:00
Arnd Bergmann 68a564006a NFSv4.1: work around -Wmaybe-uninitialized warning
A bugfix introduced a harmless gcc warning in nfs4_slot_seqid_in_use
if we enable -Wmaybe-uninitialized again:

fs/nfs/nfs4session.c:203:54: error: 'cur_seq' may be used uninitialized in this function [-Werror=maybe-uninitialized]

gcc is not smart enough to conclude that the IS_ERR/PTR_ERR pair
results in a nonzero return value here. Using PTR_ERR_OR_ZERO()
instead makes this clear to the compiler.

The warning originally did not appear in v4.8 as it was globally
disabled, but the bugfix that introduced the warning got backported
to stable kernels which again enable it, and this is now the only
warning in the v4.7 builds.

Fixes: e09c978aae ("NFSv4.1: Fix Oopsable condition in server callback races")
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Cc: Trond Myklebust <trond.myklebust@primarydata.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2016-10-24 13:54:43 -04:00
Wang Xiaoguang 9d1032cc49 btrfs: fix WARNING in btrfs_select_ref_head()
This issue was found when testing in-band dedupe enospc behaviour,
sometimes run_one_delayed_ref() may fail for enospc reason, then
__btrfs_run_delayed_refs()will return, but forget to add num_heads_read
back, which will trigger "WARN_ON(delayed_refs->num_heads_ready == 0)" in
btrfs_select_ref_head().

Signed-off-by: Wang Xiaoguang <wangxg.fnst@cn.fujitsu.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2016-10-24 18:20:29 +02:00
Dan Carpenter 9c894696f5 Btrfs: remove some no-op casts
We cast 0 to a u8 but then because of type promotion, it's immediately
cast to int back to int before we do a bitwise negate.  The cast doesn't
matter in this case, the code works as intended.  It causes a static
checker warning though so let's remove it.

Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2016-10-24 18:20:29 +02:00
Wang Xiaoguang dd4b857aab btrfs: pass correct args to btrfs_async_run_delayed_refs()
In btrfs_truncate_inode_items()->btrfs_async_run_delayed_refs(), we
swap the arg2 and arg3 wrongly, fix this.

This bug just impacts asynchronous delayed refs handle when we truncate inodes.
In delayed_ref_async_start(), there is such codes:

    trans = btrfs_join_transaction(async->root);
    if (trans->transid > async->transid)
        goto end;
    ret = btrfs_run_delayed_refs(trans, async->root, async->count);

From this codes, we can see that this just influence whether can we handle
delayed refs or the number of delayed refs to handle, this may impact
performance, but will not result in missing delayed refs, all delayed refs will
be handled in btrfs_commit_transaction().

Signed-off-by: Wang Xiaoguang <wangxg.fnst@cn.fujitsu.com>
Reviewed-by: Holger Hoffstätte <holger@applied-asynchrony.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2016-10-24 18:20:29 +02:00
Wang Xiaoguang 69ae5e4459 btrfs: make file clone aware of fatal signals
Indeed this just make the behavior similar to xfs when process has
fatal signals pending, and it'll make fstests/generic/298 happy.

Signed-off-by: Wang Xiaoguang <wangxg.fnst@cn.fujitsu.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2016-10-24 18:20:29 +02:00
Goldwyn Rodrigues 0b34c261e2 btrfs: qgroup: Prevent qgroup->reserved from going subzero
While free'ing qgroup->reserved resources, we much check if
the page has not been invalidated by a truncate operation
by checking if the page is still dirty before reducing the
qgroup resources. Resources in such a case are free'd when
the entire extent is released by delayed_ref.

This fixes a double accounting while releasing resources
in case of truncating a file, reproduced by the following testcase.

SCRATCH_DEV=/dev/vdb
SCRATCH_MNT=/mnt
mkfs.btrfs -f $SCRATCH_DEV
mount -t btrfs $SCRATCH_DEV $SCRATCH_MNT
cd $SCRATCH_MNT
btrfs quota enable $SCRATCH_MNT
btrfs subvolume create a
btrfs qgroup limit 500m a $SCRATCH_MNT
sync
for c in {1..15}; do
dd if=/dev/zero  bs=1M count=40 of=$SCRATCH_MNT/a/file;
done

sleep 10
sync
sleep 5

touch $SCRATCH_MNT/a/newfile

echo "Removing file"
rm $SCRATCH_MNT/a/file

Fixes: b9d0b38928 ("btrfs: Add handler for invalidate page")
Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com>
Reviewed-by: Qu Wenruo <quwenruo@cn.fujitsu.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2016-10-24 18:20:21 +02:00
Benjamin Coddington 86a6c211d6 NFS: Trim extra slash in v4 nfs_path
A NFSv4 mount of a subdirectory will show an extra slash (as in
'server://path') in proc's mountinfo which will not match the device name
and path.  This can cause problems for programs searching for the mount.
Fix this by checking for a leading slash in the dentry path, if so trim
away any trailing slashes in the device name.

Signed-off-by: Benjamin Coddington <bcodding@redhat.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2016-10-24 12:06:01 -04:00
Wei Yongjun 26c1ec2fe4 dlm: fix error return code in sctp_accept_from_sock()
Fix to return a negative error code from the error handling
case instead of 0, as done elsewhere in this function.

Signed-off-by: Wei Yongjun <weiyongjun1@huawei.com>
Signed-off-by: David Teigland <teigland@redhat.com>
2016-10-24 10:01:51 -05:00
Mauro Carvalho Chehab 8c27ceff36 docs: fix locations of several documents that got moved
The previous patch renamed several files that are cross-referenced
along the Kernel documentation. Adjust the links to point to
the right places.

Signed-off-by: Mauro Carvalho Chehab <mchehab@s-opensource.com>
2016-10-24 08:12:35 -02:00
Darrick J. Wong b77428b12b xfs: defer should abort intent items if the trans roll fails
If the deferred ops transaction roll fails, we need to abort the intent
items if we haven't already logged a done item for it, regardless of
whether or not the deferred ops has had a transaction committed.  Dave
found this while running generic/388.

Move the tracepoint to make it easier to track object lifetimes.

Reported-by: Dave Chinner <david@fromorbit.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2016-10-24 14:21:18 +11:00
Brian Foster c17a8ef43d xfs: clear cowblocks tag when cow fork is emptied
The background cowblocks scan job takes care of scanning for inodes with
potentially lingering blocks in the cow fork and clearing them out. If
the background scanner reclaims the cow fork blocks, however, it doesn't
immediately clear the cowblocks tag from the inode. Instead, the inode
remains tagged until the background scanner comes around again,
discovers the inode cow fork has no blocks, clears the tag and fires the
trace_xfs_inode_free_cowblocks_invalid() tracepoint to indicate that the
inode may have been incorrectly tagged.

This is not a major functional problem as the tag is ultimately cleared.
Nonetheless, clear the tag when an inode cow fork is explicitly emptied
to avoid the extra round trip through the background scanner and
spurious "invalid" tracepoint.

Signed-off-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2016-10-24 14:21:08 +11:00
Brian Foster 7b7381f043 xfs: fix up inode cowblocks tracking tracepoints
These calls are still using the eofblocks tracepoints. The cowblocks
equivalents are already defined, we just aren't actually calling them.

Signed-off-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2016-10-24 14:21:00 +11:00
Jan Kara c663e29f88 fs: Do to trim high file position bits in iomap_page_mkwrite_actor
iomap_page_mkwrite_actor() calls __block_write_begin_int() with position
masked as pos & ~PAGE_MASK which is equivalent to pos & (PAGE_SIZE-1).
Thus it masks off high bits of file position. However
__block_write_begin_int() expects full file position on input. This does
not cause any visible issues because all __block_write_begin_int()
really cares about are low file position bits but still it is a bug
waiting to happen.

Signed-off-by: Jan Kara <jack@suse.cz>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2016-10-24 14:20:25 +11:00
Linus Torvalds 5ff93abc7a This pull requests contains fixes for issues in both UBI and UBIFS:
- Fallout from the merge window, refactoring UBI code introduced some issues.
 - Fixes for an UBIFS readdir bug which can cause getdents() to busy loop
   for ever and a bug in the UBIFS xattr code.
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v2
 
 iQIcBAABAgAGBQJYDMa5AAoJEEtJtSqsAOnWxKgP+wT4lCM9gP3/1FywhrJRxA4Z
 vH9YYWP6vjdYhZP8tt3RVIrJ/BxPMDx8+7IkZBzxVRQcnvaoaGibgEsfkGmTngyW
 2rFVqDuwFIDIWLvNrKW26ep4p1Ek8yZhIIcW4upHKtnaJpZwvn6BmwxRep1JeLuc
 yZjGIJtejRbvuuaVwEBu+Et3Rlflg5/D6oPWOJfYqwjJjxihkb4hfAgzJkLeBK3Q
 Qw65S8FxKDPa7vAj2+jor3Cq0ETg3b2cQR4+UnGmDat9RVMquS3dDTBzBn6TNZx+
 xw2aiOPi0JPMeEnJP+Z61/moeQhlLddZsEVdRQ5Ud6LcOeq6Rg7v5J+POkQ0hhIy
 DUfxHjnsmB4P9XqtaGGr74d8trjIm15cL6yAVKG/jMnb11oCWVDVyr0FmsXSmO7I
 O+b6P9hM7C3o+eAETdCLhd8Jg5isOm27WWQ2Bqq2FOjY9EmvTIFl+Imp+++3YHA6
 R6jlFfMbju0gCfyPZdDPmTc91CPtWdTze43bpIdl2N3L2/efG2I0xFjjlr+WWEkL
 htYQr+b3vjO+moTl8KvT7pmvVNPUtNljOZsHHJjrsBLvuMDb0+7X1Wy860klTOPp
 B7NntTqwBUF6HtPpeebHvEfBiTruyspGZfokvkud6rqPuO1DbsJrVNY7Lwh9XA8M
 iGn9LwwlNjQYiyZNx0GT
 =Gjo0
 -----END PGP SIGNATURE-----

Merge tag 'upstream-4.9-rc2' of git://git.infradead.org/linux-ubifs

Pull UBI[FS] fixes from Richard Weinberger:
 "This contains fixes for issues in both UBI and UBIFS:

   - Fallout from the merge window, refactoring UBI code introduced some
     issues.

   - Fixes for an UBIFS readdir bug which can cause getdents() to busy
     loop for ever and a bug in the UBIFS xattr code"

* tag 'upstream-4.9-rc2' of git://git.infradead.org/linux-ubifs:
  ubifs: Abort readdir upon error
  UBI: Fix crash in try_recover_peb()
  ubi: fix swapped arguments to call to ubi_alloc_aeb
  ubifs: Fix xattr_names length in exit paths
  ubifs: Rename ubifs_rename2
2016-10-23 16:58:55 -07:00
Linus Torvalds c761923cb8 A few bug fixes and add some missing KERN_CONT annotations
-----BEGIN PGP SIGNATURE-----
 
 iQEcBAABCAAGBQJYDK6KAAoJEPL5WVaVDYGjhZ0H/2aLu4BQOmIPJZBBS+I2FurE
 7FFdnQ8r1gBPWktvfUTn6MzTE4VKe0b1js5EiRCiCJhJq9UadBu53dUWTgfZ5Egi
 Sc6p0NGqDRgixLXbFRt8wP7iPtVg0tlysE0EJ6ae4VA1wUpf5aoHaPqgO9V0hirW
 9pUJq8kzBGs628CROcYtQ5IL5AfouM1q/fzazw4Voz48LTgvhnDGCkqQmNsKkRo+
 bN5tkjSTQUdW3OrRVsNwNND/iDYpTa6PcX1XXQiFFhQ4SbZoNS/dzowz09QreGxA
 Uz/rt2hMnv552Zd52d5q6N/jPWg+O+x0b4PcYtn7NDjPn/1KZUyX0pQK/EoevXQ=
 =2Kri
 -----END PGP SIGNATURE-----

Merge tag 'ext4_for_linus_stable' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4

Pull ext4 fixes from Ted Ts'o:
 "A few bug fixes and add some missing KERN_CONT annotations"

* tag 'ext4_for_linus_stable' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4:
  ext4: add missing KERN_CONT to a few more debugging uses
  fscrypto: lock inode while setting encryption policy
  ext4: correct endianness conversion in __xattr_check_inode()
  fscrypto: make XTS tweak initialization endian-independent
  ext4: do not advertise encryption support when disabled
  jbd2: fix incorrect unlock on j_list_lock
  ext4: super.c: Update logging style using KERN_CONT
2016-10-23 16:52:19 -07:00
Linus Torvalds 86c5bf7101 Merge branch 'mm-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull vmap stack fixes from Ingo Molnar:
 "This is fallout from CONFIG_HAVE_ARCH_VMAP_STACK=y on x86: stack
  accesses that used to be just somewhat questionable are now totally
  buggy.

  These changes try to do it without breaking the ABI: the fields are
  left there, they are just reporting zero, or reporting narrower
  information (the maps file change)"

* 'mm-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  mm: Change vm_is_stack_for_task() to vm_is_stack_for_current()
  fs/proc: Stop trying to report thread stacks
  fs/proc: Stop reporting eip and esp in /proc/PID/stat
  mm/numa: Remove duplicated include from mprotect.c
2016-10-22 09:39:10 -07:00
Linus Torvalds 02593ac680 NFS client bugfixes for Linux 4.9
Stable bugfix:
 - Fix last_write_offset incorrectly set to page boundary
 
 Other bugfix:
 - Fix missing-braces warning
 -----BEGIN PGP SIGNATURE-----
 
 iQIcBAABCAAGBQJYCnklAAoJENfLVL+wpUDrv5MQALGRrZyyvQVCGwHt8BhiZDMp
 5OAB1B7mFF0yf/L7j5rLUEvXs6+YyGVHTRrqWlAm1Mq7aqqGjW3YcE260KOJwse3
 sk0eZ8mj92Bbm19ktRRGJCWeeCi16BsywIJEIbFFLs0ssKltSJMMnhE/8gyZ3Oj1
 /TQ0jFCsAGxErr9GVny9FiVa4mlgkauOEfY/QJsgMHH7FBYftBU7mo7rH43RaxQ7
 XLLv9XTe/WFCedxAa0uY/SikmAplLCpShOHCnCvveOF4WhKdx1gjaCnp6nZSMMP8
 Hyd49AZHfxlQWK3B6amhHtI5iU2/tyNl8aFN49PXUdbN1VqJoSFnMCZgc/BWKIsQ
 NGpUuQSTqz2qnMtHC3sErWfi2/c9kNDn9R3DPkTJPtZKoE0+FHnnxlhTWl9YSvju
 iW4hisaDbldmP2davoMeKKDIrP9g+z0+8akcZx4lSoVEhswVtsDzpFpGETL8bM6Y
 0002b8UU0qj4QVLUoW1HNCad5/H0G3ir0utXr+//OduQb2SMAilQmscltOcFXzfe
 TzR6YD7RP2RZs/t5fqnxlvBB2kYkSa8vWC/dJdVC5MC0nq6L1yO1n6L0p+E7Keck
 9S2fWi89WnGN4guKxtIOo58vbr6wAcA+g6zM35WwLDgxtklQHZCKOTPJEsPbnlSr
 DpeZFTwLeG7/SENBFZhI
 =VadI
 -----END PGP SIGNATURE-----

Merge tag 'nfs-for-4.9-2' of git://git.linux-nfs.org/projects/anna/linux-nfs

Pull NFS client bugfixes from Anna Schumaker:
 "Just two bugfixes this time:

  Stable bugfix:
   - Fix last_write_offset incorrectly set to page boundary

  Other bugfix:
   - Fix missing-braces warning"

* tag 'nfs-for-4.9-2' of git://git.linux-nfs.org/projects/anna/linux-nfs:
  nfs4: fix missing-braces warning
  pnfs/blocklayout: fix last_write_offset incorrectly set to page boundary
2016-10-21 19:06:59 -07:00
Linus Torvalds bdcff41597 rbd exclusive-lock edge case fix and several filesystem fixups.
Nikolay's error path patch is tagged for stable, everything else but
 readdir vs frags race was introduced in 4.9-rc1.
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v2
 
 iQEcBAABCAAGBQJYCPFwAAoJEEp/3jgCEfOLQxkH/3t7m/NaC0S+1eISHQWne0rs
 GtI4wx6Yh5KUV0SKgzYTYs0AEusW459XvUzwLwe/Tp9Qdp/KehviGJdQY8WBP6Es
 J5u7WLU+Ja1GwB586YUzhG7L3PAi8DXxbkTB+MYB4circhZ0w8ecuJUL4o++5VuH
 yAfoKn6tFyCTpvhFGd9dBPn3tVl90/vpwiH/hHp04PWHq6dNvLyJuIbvUD4JaV3O
 NYQqq3fFG76jqwyu2dE0DN4IPNb3tUjJ1oY86Uvkq7DP4ZiI61JNx45XTW1XIplx
 lWi2f2MurwznAJZl9kaU0TiTdS7liizkRdb2cu56nMRmzVSDz+va5X3CdDSpQtg=
 =JwMW
 -----END PGP SIGNATURE-----

Merge tag 'ceph-for-4.9-rc2' of git://github.com/ceph/ceph-client

Pull Ceph fixes from Ilya Dryomov:
 "An rbd exclusive-lock edge case fix and several filesystem fixups.

  Nikolay's error path patch is tagged for stable, everything else but
  readdir vs frags race was introduced in this merge window"

* tag 'ceph-for-4.9-rc2' of git://github.com/ceph/ceph-client:
  ceph: fix non static symbol warning
  ceph: fix uninitialized dentry pointer in ceph_real_mount()
  ceph: fix readdir vs fragmentation race
  ceph: fix error handling in ceph_read_iter
  rbd: don't retry watch reregistration if header object is gone
  rbd: don't wait for the lock forever if blacklisted
2016-10-20 09:57:51 -07:00
Linus Torvalds a28ad14e05 Merge branch 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs
Pull misc filesystem fixes from Jan Kara:
 "A fix for an isofs change apparently breaking mount(8) in some cases
  and one ext2 warning fix"

* 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs:
  ext2: avoid bogus -Wmaybe-uninitialized warning
  isofs: Do not return EACCES for unknown filesystems
2016-10-20 08:49:03 -07:00
Andy Lutomirski b18cb64ead fs/proc: Stop trying to report thread stacks
This reverts more of:

  b76437579d ("procfs: mark thread stack correctly in proc/<pid>/maps")

... which was partially reverted by:

  65376df582 ("proc: revert /proc/<pid>/maps [stack:TID] annotation")

Originally, /proc/PID/task/TID/maps was the same as /proc/TID/maps.

In current kernels, /proc/PID/maps (or /proc/TID/maps even for
threads) shows "[stack]" for VMAs in the mm's stack address range.

In contrast, /proc/PID/task/TID/maps uses KSTK_ESP to guess the
target thread's stack's VMA.  This is racy, probably returns garbage
and, on arches with CONFIG_TASK_INFO_IN_THREAD=y, is also crash-prone:
KSTK_ESP is not safe to use on tasks that aren't known to be running
ordinary process-context kernel code.

This patch removes the difference and just shows "[stack]" for VMAs
in the mm's stack range.  This is IMO much more sensible -- the
actual "stack" address really is treated specially by the VM code,
and the current thread stack isn't even well-defined for programs
that frequently switch stacks on their own.

Reported-by: Jann Horn <jann@thejh.net>
Signed-off-by: Andy Lutomirski <luto@kernel.org>
Acked-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Kees Cook <keescook@chromium.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Linux API <linux-api@vger.kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Tycho Andersen <tycho.andersen@canonical.com>
Link: http://lkml.kernel.org/r/3e678474ec14e0a0ec34c611016753eea2e1b8ba.1475257877.git.luto@kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-10-20 09:21:41 +02:00
Andy Lutomirski 0a1eb2d474 fs/proc: Stop reporting eip and esp in /proc/PID/stat
Reporting these fields on a non-current task is dangerous.  If the
task is in any state other than normal kernel code, they may contain
garbage or even kernel addresses on some architectures.  (x86_64
used to do this.  I bet lots of architectures still do.)  With
CONFIG_THREAD_INFO_IN_TASK=y, it can OOPS, too.

As far as I know, there are no use programs that make any material
use of these fields, so just get rid of them.

Reported-by: Jann Horn <jann@thejh.net>
Signed-off-by: Andy Lutomirski <luto@kernel.org>
Acked-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Kees Cook <keescook@chromium.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Linux API <linux-api@vger.kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Tetsuo Handa <penguin-kernel@i-love.sakura.ne.jp>
Cc: Tycho Andersen <tycho.andersen@canonical.com>
Link: http://lkml.kernel.org/r/a5fed4c3f4e33ed25d4bb03567e329bc5a712bcc.1475257877.git.luto@kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-10-20 09:21:41 +02:00
Christoph Hellwig 64e6428ddd xfs: remove xfs_bunmapi_cow
Since no one uses it anymore.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2016-10-20 15:54:59 +11:00
Christoph Hellwig c1112b6e62 xfs: optimize xfs_reflink_end_cow
Instead of doing a full extent list search for each extent that is
to be deleted using xfs_bmapi_read and then doing another one inside
of xfs_bunmapi_cow use the same scheme that xfs_bumapi uses:  look
up the last extent to be deleted and then use the extent index to
walk downward until we are outside the range to be deleted.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2016-10-20 15:54:45 +11:00
Christoph Hellwig 3e0ee78f7a xfs: optimize xfs_reflink_cancel_cow_blocks
Rewrite xfs_reflink_cancel_cow_blocks so that we only do a search for
the first extent in the extent list and then iterate over the remaining
extents using the extent index, passing the extent we operate on
directly to xfs_bmap_del_extent_delay or xfs_bmap_del_extent_cow instead
of going through xfs_bunmapi and doing yet another extent list lookup.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2016-10-20 15:54:31 +11:00
Christoph Hellwig fa5c836ca8 xfs: refactor xfs_bunmapi_cow
Split out two helpers for deleting delayed or real extents from the COW fork.
This allows to call them directly from xfs_reflink_cow_end_io once that
function is refactored to iterate the extent tree.  It will also allow
to reuse the delalloc deletion from xfs_bunmapi in the future.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2016-10-20 15:54:14 +11:00
Christoph Hellwig 3ba020befe xfs: optimize writes to reflink files
Instead of reserving space as the first thing in write_begin move it past
reading the extent in the data fork.  That way we only have to read from
the data fork once and can reuse that information for trimming the extent
to the shared/unshared boundary.  Additionally this allows to easily
limit the actual write size to said boundary, and avoid a roundtrip on the
ilock.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2016-10-20 15:53:50 +11:00
Christoph Hellwig 5f9268ca53 xfs: don't bother looking at the refcount tree for reads
There is no need to trim an extent into a shared or non-shared one, or
report any flags for plain old reads.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2016-10-20 15:53:32 +11:00
Christoph Hellwig 62c5ac89de xfs: handle "raw" delayed extents xfs_reflink_trim_around_shared
Delalloc extents in the extent list contain the number of reserved
indirect blocks in their startblock value and don't use the magic
DELAYSTARTBLOCK constant.  Ensure that xfs_reflink_trim_around_shared
handles them properly by checking for isnullstartblock().

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2016-10-20 15:52:00 +11:00
Darrick J. Wong 0a0af28cad xfs: add xfs_trim_extent
This helpers allows to trim an extent to a subset of it's original range
while making sure the block numbers in it remain valid,

In the future xfs_trim_extent and xfs_bmapi_trim_map should probably be
merged in some form.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
[hch: split from a previous patch from Darrick, moved around and added
 support for "raw" delayed extents"]
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2016-10-20 15:51:50 +11:00
Christoph Hellwig d33fd776f9 iomap: add IOMAP_REPORT
This allows the file system to tell a FIEMAP from a read operation, and thus
avoids the need to report flags that aren't actually used in the read path.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2016-10-20 15:51:28 +11:00
Christoph Hellwig 5faaf4fa0a xfs: merge xfs_reflink_remap_range and xfs_file_share_range
There is no clear division of responsibility between those functions, so
just merge them into one to keep the code simple.  Also move
xfs_file_wait_for_io to xfs_reflink.c together with its only caller.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2016-10-20 15:50:07 +11:00
Christoph Hellwig ec40759902 xfs: remove xfs_file_wait_for_io
filemap_write_and_wait_range operates on full pages, so there is no
need for the rounding operations.  Additionally this allows us to
micro-optimize by skipping the second inode_dio_wait for a
intra-file clone.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2016-10-20 15:49:55 +11:00
Christoph Hellwig 576177818e xfs: move inode locking from xfs_reflink_remap_range to xfs_file_share_range
We need the iolock protection to stabilizie the IS_SWAPFILE and
IS_IMMUTABLE values, as well as preventing new buffered writers
re-dirtying the file data that we just wrote out.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2016-10-20 15:49:19 +11:00
Christoph Hellwig a62e82b35b xfs: fix the same_inode check in xfs_file_share_range
The VFS i_ino is an unsigned long, while XFS inode numbers are 64-bit
wide, so checking i_ino for equality could lead to rate false positives
on 32-bit architectures.  Just compare the inode pointers themselves
to be safe.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2016-10-20 15:49:03 +11:00
Christoph Hellwig 4fbc2c6525 xfs: remove the same fs check from xfs_file_share_range
The VFS already does the check, and the placement of this duplicate
is in the way of the following locking rework.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2016-10-20 15:48:54 +11:00
Roger Willcocks 8cdcc8102c libxfs: v3 inodes are only valid on crc-enabled filesystems
xfs_repair was not detecting that version 3 inodes are invalid for
for non-CRC filesystems. The result is specific inode corruptions go
undetected and hence aren't repaired if only the version number is
out of range.

The core of the problem is that the XFS_DINODE_GOOD_VERSION() macro
doesn't know that valid inode versions are dependent on a superblock
version number. Fix this in libxfs, and propagate the new function
out into the rest of xfsprogs to fix the issue.

[Darrick: port to kernel from xfsprogs]

Reported-by: Leslie Rhorer <lrhorer@mygrande.net>
Signed-off-by: Roger Willcocks <roger@filmlight.ltd.uk>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2016-10-20 15:48:38 +11:00
Darrick J. Wong 58d7896785 libxfs: clean up _calc_dquots_per_chunk
The function xfs_calc_dquots_per_chunk takes a parameter in units
of basic blocks.  The kernel seems to get the units wrong, but
userspace got 'fixed' by commenting out the unnecessary conversion.
Fix both.

cc: <stable@vger.kernel.org>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Eric Sandeen <sandeen@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2016-10-20 15:46:18 +11:00
Darrick J. Wong d099245297 xfs: unset MS_ACTIVE if mount fails
As part of the inode block map intent log item recovery process, we
had to set the IRECOVERY flag to prevent an unlinked inode from
being truncated during the first iput call.  This required us to set
MS_ACTIVE so that iput puts the inode on the lru instead of
immediately evicting the inode.

Unfortunately, if the mount fails later on, the inodes that have
been loaded (root dir and realtime) actually need to be evicted
since we're aborting the mount.  If we don't clear MS_ACTIVE in the
failure step, those inodes are not evicted and therefore leak.   The
leak was found by running xfs/130 and rmmoding xfs immediately after
the test.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2016-10-20 15:45:40 +11:00
Eric Sandeen fe23759eaf xfs: remove pointless error goto in xfs_bmap_remap_alloc
The commit:

f65306ea xfs: map an inode's offset to an exact physical block

added a pointless error0: target; remove it.

Addresses-Coverity-Id: 1373865
Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Reviewed-by: Bill O'Donnell <billodo@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2016-10-20 15:44:53 +11:00
Christoph Hellwig 0ee7a3f6b5 xfs: don't take the IOLOCK exclusive for direct I/O page invalidation
XFS historically took the iolock exclusive when invalidating pages
before direct I/O operations to protect against writeback starvations.

But this writeback starvation issues has been fixed a long time ago
in the core writeback code, and all other file systems manage to do
without the exclusive lock.  Convert XFS over to avoid the exclusive
lock in this case, and also move to range invalidations like done
by the other file systems.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Jan Kara <jack@suse.cz>
Reviewed-by: Carlos Maiolino <cmaiolino@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2016-10-20 15:44:14 +11:00
Eric Biggers f1b8243c55 xfs: add some 'static' annotations
sparse reported that several variables and a function were not
forward-declared anywhere and therefore should be 'static'.

Found with sparse by running 'make C=2 CF=-D__CHECK_ENDIAN__ fs/xfs/'

Signed-off-by: Eric Biggers <ebiggers@google.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2016-10-20 15:42:30 +11:00
Geert Uytterhoeven 1be7f9be0e xfs: Fix uninitialized variable in xfs_reflink_reserve_cow_range()
with gcc 4.1.2:

    fs/xfs/xfs_reflink.c: In function xfs_reflink_reserve_cow_range:
    fs/xfs/xfs_reflink.c:327: warning: error may be used uninitialized in this function

Indeed, if "count" is zero, the function will return an uninitialized
error value.

While "count" is unlikely to be zero, this function is called through
the public iomap API. Hence fix this by preinitializing error to zero.

Fixes: 2a06705cd5 ("xfs: create delalloc extents in CoW fork")
Signed-off-by: Geert Uytterhoeven <geert@linux-m68k.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2016-10-20 15:41:48 +11:00
Colin Ian King 1d55a4bfd0 xfs: remove redundant assignment of ifp
Remove redundant ifp = ifp statement, it does nothing. Found with
static analysis by CoverityScan.

Signed-off-by: Colin Ian King <colin.king@canonical.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2016-10-20 15:40:55 +11:00
Richard Weinberger c83ed4c9db ubifs: Abort readdir upon error
If UBIFS is facing an error while walking a directory, it reports this
error and ubifs_readdir() returns the error code. But the VFS readdir
logic does not make the getdents system call fail in all cases. When the
readdir cursor indicates that more entries are present, the system call
will just return and the libc wrapper will try again since it also
knows that more entries are present.
This causes the libc wrapper to busy loop for ever when a directory is
corrupted on UBIFS.
A common approach do deal with corrupted directory entries is
skipping them by setting the cursor to the next entry. On UBIFS this
approach is not possible since we cannot compute the next directory
entry cursor position without reading the current entry. So all we can
do is setting the cursor to the "no more entries" position and make
getdents exit.

Cc: stable@vger.kernel.org
Signed-off-by: Richard Weinberger <richard@nod.at>
2016-10-20 00:06:11 +02:00
Richard Weinberger 843741c577 ubifs: Fix xattr_names length in exit paths
When the operation fails we also have to undo the changes
we made to ->xattr_names. Otherwise listxattr() will report
wrong lengths.

Cc: stable@vger.kernel.org
Signed-off-by: Richard Weinberger <richard@nod.at>
2016-10-20 00:05:54 +02:00
Richard Weinberger 390975ac39 ubifs: Rename ubifs_rename2
Since ->rename2 is gone, rename ubifs_rename2() to ubifs_rename().

Suggested-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Richard Weinberger <richard@nod.at>
2016-10-20 00:05:47 +02:00
Arnd Bergmann 83aa3e0f79 nfs4: fix missing-braces warning
A bugfix introduced a harmless warning for update_open_stateid:

fs/nfs/nfs4proc.c:1548:2: error: missing braces around initializer [-Werror=missing-braces]

Removing the zero in the initializer will do the right thing here
and initialize the entire structure to zero.

Fixes: 1393d9612b ("NFSv4: Fix a race when updating an open_stateid")
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2016-10-19 14:39:15 -04:00
Bob Peterson aa9f101285 dlm: don't specify WQ_UNBOUND for the ast callback workqueue
This patch removes the WQ_UNBOUND flag (which implies WQ_HIGHPRI)
from the DLM's ast work queue, in favor of just WQ_HIGHPRI.
This has been shown to cause a 19 percent performance increase for
simultaneous inode creates on GFS2 with fs_mark.

Signed-off-by: Bob Peterson <rpeterso@redhat.com>
Signed-off-by: David Teigland <teigland@redhat.com>
2016-10-19 11:13:04 -05:00
Bob Peterson d2fee58a3b dlm: remove lock_sock to avoid scheduling while atomic
Before this patch, functions save_callbacks and restore_callbacks
called function lock_sock and release_sock to prevent other processes
from messing with the struct sock while the callbacks were saved and
restored. However, function add_sock calls write_lock_bh prior to
calling it save_callbacks, which disables preempts. So the call to
lock_sock would try to schedule when we can't schedule.

Signed-off-by: Bob Peterson <rpeterso@redhat.com>
Signed-off-by: David Teigland <teigland@redhat.com>
2016-10-19 11:00:03 -05:00
Bob Peterson 3735b4b9f1 dlm: don't save callbacks after accept
When DLM calls accept() on a socket, the comm code copies the sk
after we've saved its callbacks. Afterward, it calls add_sock which
saves the callbacks a second time. Since the error reporting function
lowcomms_error_report calls the previous callback too, this results
in a recursive call to itself. This patch adds a new parameter to
function add_sock to tell whether to save the callbacks. Function
tcp_accept_from_sock (and its sctp counterpart) then calls it with
false to avoid the recursion.

Signed-off-by: Bob Peterson <rpeterso@redhat.com>
Signed-off-by: David Teigland <teigland@redhat.com>
2016-10-19 11:00:03 -05:00
Paul Gortmaker 7963b8a598 dlm: audit and remove any unnecessary uses of module.h
Historically a lot of these existed because we did not have
a distinction between what was modular code and what was providing
support to modules via EXPORT_SYMBOL and friends.  That changed
when we forked out support for the latter into the export.h file.
This means we should be able to reduce the usage of module.h
in code that is obj-y Makefile or bool Kconfig.

In the case of some code where it is modular, we can extend that to
also include files that are building basic support functionality but
not related to loading or registering the final module; such files
also have no need whatsoever for module.h

The advantage in removing such instances is that module.h itself
sources about 15 other headers; adding significantly to what we feed
cpp, and it can obscure what headers we are effectively using.

Since module.h might have been the implicit source for init.h
(for __init) and for export.h (for EXPORT_SYMBOL) we consider each
instance for the presence of either and replace as needed.

In the dlm case, we remove module.h from a global header and only
introduce it in the files where it is explicitly required, since
there is nothing modular in dlm_internal.h itself.

Signed-off-by: Paul Gortmaker <paul.gortmaker@windriver.com>
Signed-off-by: David Teigland <teigland@redhat.com>
2016-10-19 11:00:03 -05:00
Stephen Hemminger dbef1c0534 dlm: make genl_ops const
This table contains function points and should be const.

Signed-off-by: Stephen Hemminger <stephen@networkplumber.org>
Signed-off-by: David Teigland <teigland@redhat.com>
2016-10-19 11:00:03 -05:00
Linus Torvalds 63ae602cea Merge branch 'gup_flag-cleanups'
Merge the gup_flags cleanups from Lorenzo Stoakes:
 "This patch series adjusts functions in the get_user_pages* family such
  that desired FOLL_* flags are passed as an argument rather than
  implied by flags.

  The purpose of this change is to make the use of FOLL_FORCE explicit
  so it is easier to grep for and clearer to callers that this flag is
  being used.  The use of FOLL_FORCE is an issue as it overrides missing
  VM_READ/VM_WRITE flags for the VMA whose pages we are reading
  from/writing to, which can result in surprising behaviour.

  The patch series came out of the discussion around commit 38e0885465
  ("mm: check VMA flags to avoid invalid PROT_NONE NUMA balancing"),
  which addressed a BUG_ON() being triggered when a page was faulted in
  with PROT_NONE set but having been overridden by FOLL_FORCE.
  do_numa_page() was run on the assumption the page _must_ be one marked
  for NUMA node migration as an actual PROT_NONE page would have been
  dealt with prior to this code path, however FOLL_FORCE introduced a
  situation where this assumption did not hold.

  See

      https://marc.info/?l=linux-mm&m=147585445805166

  for the patch proposal"

Additionally, there's a fix for an ancient bug related to FOLL_FORCE and
FOLL_WRITE by me.

[ This branch was rebased recently to add a few more acked-by's and
  reviewed-by's ]

* gup_flag-cleanups:
  mm: replace access_process_vm() write parameter with gup_flags
  mm: replace access_remote_vm() write parameter with gup_flags
  mm: replace __access_remote_vm() write parameter with gup_flags
  mm: replace get_user_pages_remote() write/force parameters with gup_flags
  mm: replace get_user_pages() write/force parameters with gup_flags
  mm: replace get_vaddr_frames() write/force parameters with gup_flags
  mm: replace get_user_pages_locked() write/force parameters with gup_flags
  mm: replace get_user_pages_unlocked() write/force parameters with gup_flags
  mm: remove write/force parameters from __get_user_pages_unlocked()
  mm: remove write/force parameters from __get_user_pages_locked()
  mm: remove gup_flags FOLL_WRITE games from __get_user_pages()
2016-10-19 08:39:47 -07:00
Lorenzo Stoakes 6347e8d5bc mm: replace access_remote_vm() write parameter with gup_flags
This removes the 'write' argument from access_remote_vm() and replaces
it with 'gup_flags' as use of this function previously silently implied
FOLL_FORCE, whereas after this patch callers explicitly pass this flag.

We make this explicit as use of FOLL_FORCE can result in surprising
behaviour (and hence bugs) within the mm subsystem.

Signed-off-by: Lorenzo Stoakes <lstoakes@gmail.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-10-19 08:12:14 -07:00
Lorenzo Stoakes 9beae1ea89 mm: replace get_user_pages_remote() write/force parameters with gup_flags
This removes the 'write' and 'force' from get_user_pages_remote() and
replaces them with 'gup_flags' to make the use of FOLL_FORCE explicit in
callers as use of this flag can result in surprising behaviour (and
hence bugs) within the mm subsystem.

Signed-off-by: Lorenzo Stoakes <lstoakes@gmail.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-10-19 08:12:02 -07:00
Linus Torvalds 1a1891d762 This includes fixing a bug which references a wrong pointer, sum_page, in
f2fs_gc. It was newly introduced in 4.9-rc1.
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v1
 
 iQIcBAABAgAGBQJYBoL+AAoJEEAUqH6CSFDSQEoP/irufW4HDUszwKCxISLGksBN
 /i85zfbL9pY11Ci78qW4N5pegd2ouKk9WtUdrtXwT6y8Y+CWs9Gx5FaYbdA8aj7T
 nQBfRlfN/zKIJkRqqqXh/YRTMyQfUticur0dWZ00l7wpGAxA6Xhe+QDV1T7rxQxZ
 W8Ne9jcAD+SLu8G5Ci8zOTDcK+q6SWeXFNFtM1MPqr/S86PXiTRmlWFNcubACBi9
 iqLl5moD++oYBJSU6sqxaKXvf27GhJMZUGp+upYT9lEIGF98vyhw+yXT64HYDvrW
 lMJspz80m5CRT3Pz7nxG+1d6ctONbXwDjyBPeEINUlHi4DN07UJZCJUnUWyMYM8D
 JffKq4DGhmsh0CJoX5pKjzxDhuHtrWwN+d/gMuUZrgC7oecXs0U5O2WR49oMGCs8
 MMh6T/MUh1fWBI7F1Y3+OvyshU5XA9LCcNguu0T/RSKsjvqxkYTlpS/Zo1U/mpqv
 ar34mlvLXlLQNj8gTYDNybq/ufBfS2Fzq+juWbXCaVZ4OV0rsu66D9aUInQ66Iy6
 1EqjlLaqAwTF+DJsyGeAxNEn6tMPiqMVhK9LTT9dz5Uk5vK4FdvuDxqeFuwwTnCi
 xYHdsZn/yhs7Nn6akwOclJayegDKiV3OVEG4ZyUHrXVR+njFTJyW20VcuUvN9RCn
 AxHW5MI9jaU/U+O4Opbb
 =1MpQ
 -----END PGP SIGNATURE-----

Merge tag 'for-f2fs-4.9-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/jaegeuk/f2fs

Pull f2fs bugfix from Jaegeuk Kim:
 "This fixes a bug which referenced the wrong pointer, sum_page, in
  f2fs_gc.  It was newly introduced in 4.9-rc1.

* tag 'for-f2fs-4.9-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/jaegeuk/f2fs:
  f2fs: fix wrong sum_page pointer in f2fs_gc
2016-10-18 14:15:23 -07:00
Linus Torvalds e0ed1c22d4 Merge branch 'locking-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull locking fixes from Ingo Molnar:
 "Two fixes:

   - a file locks fix (missing critical section, bug introduced in this
     merge window)

   - an x86 down_write() stack frame annotation"

* 'locking-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  locking, fs/locks: Add missing file_sem locks
  locking/rwsem/x86: Add stack frame dependency for ____down_write()
2016-10-18 09:04:17 -07:00
Chris Mason 112a3edf4b Merge branch 'for-chris-4.9' of git://git.kernel.org/pub/scm/linux/kernel/git/fdmanana/linux into for-linus-4.9 2016-10-18 06:51:33 -07:00
Miklos Szeredi 0ce267ff95 fuse: fix root dentry initialization
Add missing dentry initialization to root dentry.

Fixes: f75fdf22b0 ("fuse: don't use ->d_time")
Reported-by: Andreas Reis <andreas.reis@gmail.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2016-10-18 15:36:48 +02:00
Wei Yongjun 5130ccea7c ceph: fix non static symbol warning
Fixes the following sparse warning:

fs/ceph/xattr.c:19:28: warning:
 symbol 'ceph_other_xattr_handler' was not declared. Should it be static?

Signed-off-by: Wei Yongjun <weiyongjun1@huawei.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2016-10-18 12:30:32 +02:00
Peter Zijlstra 5f43086bb9 locking, fs/locks: Add missing file_sem locks
I overlooked a few code-paths that can lead to
locks_delete_global_locks().

Reported-by: Dmitry Vyukov <dvyukov@google.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Jeff Layton <jlayton@poochiereds.net>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Bruce Fields <bfields@fieldses.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-fsdevel@vger.kernel.org
Cc: syzkaller <syzkaller@googlegroups.com>
Link: http://lkml.kernel.org/r/20161008081228.GF3142@twins.programming.kicks-ass.net
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-10-18 12:21:28 +02:00
Geert Uytterhoeven 31ca587810 ceph: fix uninitialized dentry pointer in ceph_real_mount()
fs/ceph/super.c: In function ‘ceph_real_mount’:
    fs/ceph/super.c:818: warning: ‘root’ may be used uninitialized in this function

If s_root is already valid, dentry pointer root is never initialized,
and returned by ceph_real_mount(). This will cause a crash later when
the caller dereferences the pointer.

Fixes: ce2728aaa8 ("ceph: avoid accessing / when mounting a subpath")
Signed-off-by: Geert Uytterhoeven <geert@linux-m68k.org>
Signed-off-by: Yan, Zheng <zyan@redhat.com>
2016-10-18 12:10:59 +02:00
Yan, Zheng f72f94555a ceph: fix readdir vs fragmentation race
following sequence of events tigger the race

- client readdir frag 0* -> got item 'A'
- MDS merges frag 0* and frag 1*
- client send readdir request (frag 1*, offset 2, readdir_start 'A')
- MDS reply items (that are after item 'A') in frag *

Link: http://tracker.ceph.com/issues/17286
Signed-off-by: Yan, Zheng <zyan@redhat.com>
2016-10-18 12:09:58 +02:00
Arnd Bergmann e952813e21 ext2: avoid bogus -Wmaybe-uninitialized warning
On ARM, we get this false-positive warning since the rework of
the ext2_get_blocks interface:

fs/ext2/inode.c: In function 'ext2_get_block':
include/linux/buffer_head.h:340:16: error: 'bno' may be used uninitialized in this function [-Werror=maybe-uninitialized]

The calling conventions for this function are rather complex, and it's
not surprising that the compiler gets this wrong, I spent a long time
trying to understand how it all fits together myself.

This change to avoid the warning makes sure the compiler sees that we
always set 'bno' pointer whenever we have a positive return code.
The transformation is correct because we always arrive at the 'got_it'
label with a positive count that gets used as the return value, while
any branch to the 'cleanup' label has a negative or zero 'err'.

Fixes: 6750ad7198 ("ext2: stop passing buffer_head to ext2_get_blocks")
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Cc: Dave Chinner <david@fromorbit.com>
Signed-off-by: Jan Kara <jack@suse.cz>
2016-10-18 11:29:35 +02:00
Jan Kara a2ed0b391d isofs: Do not return EACCES for unknown filesystems
When isofs_mount() is called to mount a device read-write, it returns
EACCES even before it checks that the device actually contains an isofs
filesystem. This may confuse mount(8) which then tries to mount all
subsequent filesystem types in read-only mode.

Fix the problem by returning EACCES only once we verify that the device
indeed contains an iso9660 filesystem.

CC: stable@vger.kernel.org
Fixes: 17b7f7cf58
Reported-by: Kent Overstreet <kent.overstreet@gmail.com>
Reported-by: Karel Zak <kzak@redhat.com>
Signed-off-by: Jan Kara <jack@suse.cz>
2016-10-18 11:28:21 +02:00
Junjie Mao 14155cafea btrfs: assign error values to the correct bio structs
Fixes: 4246a0b63b ("block: add a bi_error field to struct bio")
Signed-off-by: Junjie Mao <junjie.mao@enight.me>
Acked-by: David Sterba <dsterba@suse.cz>
Cc: stable@vger.kernel.org # 4.3+
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-10-17 14:16:14 -07:00
Liu Bo 4547f4d8ff Btrfs: kill BUG_ON in do_relocation
While updating btree, we try to push items between sibling
nodes/leaves in order to keep height as low as possible.
But we don't memset the original places with zero when
pushing items so that we could end up leaving stale content
in nodes/leaves.  One may read the above stale content by
increasing btree blocks' @nritems.

One case I've come across is that in fs tree, a leaf has two
parent nodes, hence running balance ends up with processing
this leaf with two parent nodes, but it can only reach the
valid parent node through btrfs_search_slot, so it'd be like,

do_relocation
    for P in all parent nodes of block A:
        if !P->eb:
            btrfs_search_slot(key);   --> get path from P to A.
        if lowest:
            BUG_ON(A->bytenr != bytenr of A recorded in P);
        btrfs_cow_block(P, A);   --> change A's bytenr in P.

After btrfs_cow_block, P has the new bytenr of A, but with the
same @key, we get the same path again, and get panic by BUG_ON.

Note that this is only happening in a corrupted fs, for a
regular fs in which we have correct @nritems so that we won't
read stale content in any case.

Reviewed-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2016-10-17 15:48:40 +02:00
Nikolay Borisov 0d7718f666 ceph: fix error handling in ceph_read_iter
In case __ceph_do_getattr returns an error and the retry_op in
ceph_read_iter is not READ_INLINE, then it's possible to invoke
__free_page on a page which is NULL, this naturally leads to a crash.
This can happen when, for example, a process waiting on a MDS reply
receives sigterm.

Fix this by explicitly checking whether the page is set or not.

Cc: stable@vger.kernel.org # 3.19+
Signed-off-by: Nikolay Borisov <kernel@kyup.com>
Reviewed-by: Yan, Zheng <zyan@redhat.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2016-10-15 23:28:07 +02:00
Linus Torvalds df34d04a6f befs fixes for 4.9-rc1
-----BEGIN PGP SIGNATURE-----
 Version: GnuPG v1
 
 iQEcBAABAgAGBQJYAnNOAAoJEGu/nxmHO1GNOzQH/3p+j1yPUR08+qhlZBdF/vCH
 i9Qb13yUT8yEN9tCZ7bsMhRZYQ70GuPMtLJbhklwGmnDAEZwzGoCrokexCsKoKiv
 0RmzLUsbN7GM6LFXOyTj3QwFGxjQnVzk5TKXSR2qUpqvvffFsAFlTpg/JqRNpTjF
 c85naRDFYmZ3fGi2mT/emoY8MAu90XnjWbAMrg+uipsriBqOcbUD487CubDeR0CK
 svO3JSvv2W6vjMVzkLSWnpFrhiWmqAcOHFS4NEcCeQaJkDmyRCnmVNXBaB1YGZey
 47+r8oLo64oByCt+Z60Dxb5rwDJfDLLDfRQeDOltgR4i2nnSZ5cS21V55Z5alqg=
 =sDD1
 -----END PGP SIGNATURE-----

Merge tag 'befs-v4.9-rc1' of git://github.com/luisbg/linux-befs

Pull befs fixes from Luis de Bethencourt:
 "I recently took maintainership of the befs file system [0]. This is
  the first time I send you a git pull request, so please let me know if
  all the below is OK.

  Salah Triki and myself have been cleaning the code and fixing a few
  small bugs.

  Sorry I couldn't send this sooner in the merge window, I was waiting
  to have my GPG key signed by kernel members at ELCE in Berlin a few
  days ago."

[0] https://lkml.org/lkml/2016/7/27/502

* tag 'befs-v4.9-rc1' of git://github.com/luisbg/linux-befs: (39 commits)
  befs: befs: fix style issues in datastream.c
  befs: improve documentation in datastream.c
  befs: fix typos in datastream.c
  befs: fix typos in btree.c
  befs: fix style issues in super.c
  befs: fix comment style
  befs: add check for ag_shift in superblock
  befs: dump inode_size superblock information
  befs: remove unnecessary initialization
  befs: fix typo in befs_sb_info
  befs: add flags field to validate superblock state
  befs: fix typo in befs_find_key
  befs: remove unused BEFS_BT_PARMATCH
  fs: befs: remove ret variable
  fs: befs: remove in vain variable assignment
  fs: befs: remove unnecessary *befs_sb variable
  fs: befs: remove useless initialization to zero
  fs: befs: remove in vain variable assignment
  fs: befs: Insert NULL inode to dentry
  fs: befs: Remove useless calls to brelse in befs_find_brun_dblindirect
  ...
2016-10-15 12:09:13 -07:00
Linus Torvalds 9ffc66941d This adds a new gcc plugin named "latent_entropy". It is designed to
extract as much possible uncertainty from a running system at boot time as
 possible, hoping to capitalize on any possible variation in CPU operation
 (due to runtime data differences, hardware differences, SMP ordering,
 thermal timing variation, cache behavior, etc).
 
 At the very least, this plugin is a much more comprehensive example for
 how to manipulate kernel code using the gcc plugin internals.
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v1
 Comment: Kees Cook <kees@outflux.net>
 
 iQIcBAABCgAGBQJX/BAFAAoJEIly9N/cbcAmzW8QALFbCs7EFFkML+M/M/9d8zEk
 1QbUs/z8covJTTT1PjSdw7JUrAMulI3S00owpcQVd/PcWjRPU80QwfsXBgIB0tvC
 Kub2qxn6Oaf+kTB646zwjFgjdCecw/USJP+90nfcu2+LCnE8ReclKd1aUee+Bnhm
 iDEUyH2ONIoWq6ta2Z9sA7+E4y2ZgOlmW0iga3Mnf+OcPtLE70fWPoe5E4g9DpYk
 B+kiPDrD9ql5zsHaEnKG1ldjiAZ1L6Grk8rGgLEXmbOWtTOFmnUhR+raK5NA/RCw
 MXNuyPay5aYPpqDHFm+OuaWQAiPWfPNWM3Ett4k0d9ZWLixTcD1z68AciExwk7aW
 SEA8b1Jwbg05ZNYM7NJB6t6suKC4dGPxWzKFOhmBicsh2Ni5f+Az0BQL6q8/V8/4
 8UEqDLuFlPJBB50A3z5ngCVeYJKZe8Bg/Swb4zXl6mIzZ9darLzXDEV6ystfPXxJ
 e1AdBb41WC+O2SAI4l64yyeswkGo3Iw2oMbXG5jmFl6wY/xGp7dWxw7gfnhC6oOh
 afOT54p2OUDfSAbJaO0IHliWoIdmE5ZYdVYVU9Ek+uWyaIwcXhNmqRg+Uqmo32jf
 cP5J9x2kF3RdOcbSHXmFp++fU+wkhBtEcjkNpvkjpi4xyA47IWS7lrVBBebrCq9R
 pa/A7CNQwibIV6YD8+/p
 =1dUK
 -----END PGP SIGNATURE-----

Merge tag 'gcc-plugins-v4.9-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux

Pull gcc plugins update from Kees Cook:
 "This adds a new gcc plugin named "latent_entropy". It is designed to
  extract as much possible uncertainty from a running system at boot
  time as possible, hoping to capitalize on any possible variation in
  CPU operation (due to runtime data differences, hardware differences,
  SMP ordering, thermal timing variation, cache behavior, etc).

  At the very least, this plugin is a much more comprehensive example
  for how to manipulate kernel code using the gcc plugin internals"

* tag 'gcc-plugins-v4.9-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux:
  latent_entropy: Mark functions with __latent_entropy
  gcc-plugins: Add latent_entropy plugin
2016-10-15 10:03:15 -07:00
Joe Perches d74f3d2528 ext4: add missing KERN_CONT to a few more debugging uses
Recent commits require line continuing printks to always use
pr_cont or KERN_CONT.  Add these markings to a few more printks.

Miscellaneaous:

o Integrate the ea_idebug and ea_bdebug macros to use a single
  call to printk(KERN_DEBUG instead of 3 separate printks
o Use the more common varargs macro style

Signed-off-by: Joe Perches <joe@perches.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Reviewed-by: Andreas Dilger <adilger@dilger.ca>
2016-10-15 09:57:31 -04:00
Eric Biggers 8906a8223a fscrypto: lock inode while setting encryption policy
i_rwsem needs to be acquired while setting an encryption policy so that
concurrent calls to FS_IOC_SET_ENCRYPTION_POLICY are correctly
serialized (especially the ->get_context() + ->set_context() pair), and
so that new files cannot be created in the directory during or after the
->empty_dir() check.

Signed-off-by: Eric Biggers <ebiggers@google.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Reviewed-by: Richard Weinberger <richard@nod.at>
Cc: stable@vger.kernel.org
2016-10-15 09:48:50 -04:00
Eric Biggers 199625098a ext4: correct endianness conversion in __xattr_check_inode()
It should be cpu_to_le32(), not le32_to_cpu().  No change in behavior.

Found with sparse, and this was the only endianness warning in fs/ext4/.

Signed-off-by: Eric Biggers <ebiggers@google.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Reviewed-by: Jan Kara <jack@suse.cz>
2016-10-15 09:39:31 -04:00
Linus Torvalds b26b5ef5ec Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs
Pull more misc uaccess and vfs updates from Al Viro:
 "The rest of the stuff from -next (more uaccess work) + assorted fixes"

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
  score: traps: Add missing include file to fix build error
  fs/super.c: don't fool lockdep in freeze_super() and thaw_super() paths
  fs/super.c: fix race between freeze_super() and thaw_super()
  overlayfs: Fix setting IOP_XATTR flag
  iov_iter: kernel-doc import_iovec() and rw_copy_check_uvector()
  blackfin: no access_ok() for __copy_{to,from}_user()
  arm64: don't zero in __copy_from_user{,_inatomic}
  arm: don't zero in __copy_from_user_inatomic()/__copy_from_user()
  arc: don't leak bits of kernel stack into coredump
  alpha: get rid of tail-zeroing in __copy_user()
2016-10-14 18:19:05 -07:00
Linus Torvalds 87dbe42a16 Merge branch 'for-next' of git://git.samba.org/sfrench/cifs-2.6
Pull cifs fixes from Steve French:
 "Including:

   - nine bug fixes for stable. Some of these we found at the recent two
     weeks of SMB3 test events/plugfests.

   - significant improvements in reconnection (e.g. if server or network
     crashes) especially when mounted with "persistenthandles" or to
     server which advertises Continuous Availability on the share.

   - a new mount option "idsfromsid" which improves POSIX compatibility
     in some cases (when winbind not configured e.g.) by better (and
     faster) fetching uid/gid from acl (when "cifsacl" mount option is
     enabled). NB: we are almost complete work on "cifsacl" (querying
     mode/uid/gid from ACL) for SMB3, but SMB3 support for cifsacl is
     not included in this set.

   - improved handling for SMB3 "credits" (even if server is buggy)

  Still working on two sets of changes:

   - cifsacl enablement for SMB3

   - cleanup of RFC1001 length calculation (so we can handle encryption
     and multichannel and RDMA)

  And a couple of new bugs were reported recently (unrelated to above)
  so will probably have another merge request next week"

* 'for-next' of git://git.samba.org/sfrench/cifs-2.6: (21 commits)
  CIFS: Retrieve uid and gid from special sid if enabled
  CIFS: Add new mount option to set owner uid and gid from special sids in acl
  CIFS: Reset read oplock to NONE if we have mandatory locks after reopen
  CIFS: Fix persistent handles re-opening on reconnect
  SMB2: Separate RawNTLMSSP authentication from SMB2_sess_setup
  SMB2: Separate Kerberos authentication from SMB2_sess_setup
  Expose cifs module parameters in sysfs
  Cleanup missing frees on some ioctls
  Enable previous version support
  Do not send SMB3 SET_INFO request if nothing is changing
  SMB3: Add mount parameter to allow user to override max credits
  fs/cifs: reopen persistent handles on reconnect
  Clarify locking of cifs file and tcon structures and make more granular
  Fix regression which breaks DFS mounting
  fs/cifs: keep guid when assigning fid to fileinfo
  SMB3: GUIDs should be constructed as random but valid uuids
  Set previous session id correctly on SMB3 reconnect
  cifs: Limit the overall credit acquired
  Display number of credits available
  Add way to query creation time of file via cifs xattr
  ...
2016-10-14 17:47:31 -07:00
Linus Torvalds d3304cadb2 Merge branch 'for-linus-4.9' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs
Pull btrfs fixes from Chris Mason:
 "Some fixes from Omar and Dave Sterba for our new free space tree.

  This isn't heavily used yet, but as we move toward making it the new
  default we wanted to nail down an endian bug"

* 'for-linus-4.9' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs:
  btrfs: tests: uninline member definitions in free_space_extent
  btrfs: tests: constify free space extent specs
  Btrfs: expand free space tree sanity tests to catch endianness bug
  Btrfs: fix extent buffer bitmap tests on big-endian systems
  Btrfs: catch invalid free space trees
  Btrfs: fix mount -o clear_cache,space_cache=v2
  Btrfs: fix free space tree bitmaps on big-endian systems
2016-10-14 17:44:56 -07:00
Oleg Nesterov f1a9622037 fs/super.c: don't fool lockdep in freeze_super() and thaw_super() paths
sb_wait_write()->percpu_rwsem_release() fools lockdep to avoid the
false-positives. Now that xfs was fixed by Dave's commit dbad7c9930
("xfs: stop holding ILOCK over filldir callbacks") we can remove it and
change freeze_super() and thaw_super() to run with s_writers.rw_sem locks
held; we add two trivial helpers for that, lockdep_sb_freeze_release()
and lockdep_sb_freeze_acquire().

xfstests-dev/check `grep -il freeze tests/*/???` does not trigger any
warning from lockdep.

Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2016-10-14 20:41:59 -04:00
Linus Torvalds 1a892b485f Merge branch 'overlayfs-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/vfs
Pull overlayfs updates from Miklos Szeredi:
 "This update contains fixes to the "use mounter's permission to access
  underlying layers" area, and miscellaneous other fixes and cleanups.

  No new features this time"

* 'overlayfs-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/vfs:
  ovl: use vfs_get_link()
  vfs: add vfs_get_link() helper
  ovl: use generic_readlink
  ovl: explain error values when removing acl from workdir
  ovl: Fix info leak in ovl_lookup_temp()
  ovl: during copy up, switch to mounter's creds early
  ovl: lookup: do getxattr with mounter's permission
  ovl: copy_up_xattr(): use strnlen
2016-10-14 17:23:33 -07:00
Oleg Nesterov 89f39af129 fs/super.c: fix race between freeze_super() and thaw_super()
Change thaw_super() to check frozen != SB_FREEZE_COMPLETE rather than
frozen == SB_UNFROZEN, otherwise it can race with freeze_super() which
drops sb->s_umount after SB_FREEZE_WRITE to preserve the lock ordering.

In this case thaw_super() will wrongly call s_op->unfreeze_fs() before
it was actually frozen, and call sb_freeze_unlock() which leads to the
unbalanced percpu_up_write(). Unfortunately lockdep can't detect this,
so this triggers misc BUG_ON()'s in kernel/rcu/sync.c.

Reported-and-tested-by: Nikolay Borisov <kernel@kyup.com>
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Cc: stable@vger.kernel.org
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2016-10-14 20:00:34 -04:00
Vivek Goyal 655042cc14 overlayfs: Fix setting IOP_XATTR flag
ovl_fill_super calls ovl_new_inode to create a root inode for the new
superblock before initializing sb->s_xattr.  This wrongly causes
IOP_XATTR to be cleared in i_opflags of the new inode, causing SELinux
to log the following message:

  SELinux: (dev overlay, type overlay) has no xattr support

Fix this by initializing sb->s_xattr and similar fields before calling
ovl_new_inode.

Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2016-10-14 20:00:34 -04:00
Vegard Nossum ffecee4f24 iov_iter: kernel-doc import_iovec() and rw_copy_check_uvector()
Both import_iovec() and rw_copy_check_uvector() take an array
(typically small and on-stack) which is used to hold an iovec array copy
from userspace. This is to avoid an expensive memory allocation in the
fast path (i.e. few iovec elements).

The caller may have to check whether these functions actually used
the provided buffer or allocated a new one -- but this differs between
the too. Let's just add a kernel doc to clarify what the semantics are
for each function.

Signed-off-by: Vegard Nossum <vegard.nossum@oracle.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2016-10-14 20:00:34 -04:00
Steve French 3514de3fd5 CIFS: Retrieve uid and gid from special sid if enabled
New mount option "idsfromsid" indicates to cifs.ko that
it should try to retrieve the uid and gid owner fields
from special sids.  This patch adds the code to parse the owner
sids in the ACL to see if they match, and if so populate the
uid and/or gid from them.  This is faster than upcalling for
them and asking winbind, and is a fairly common case, and is
also helpful when cifs.upcall and idmapping is not configured.

Signed-off-by: Steve French <steve.french@primarydata.com>
Reviewed-by:  Shirish Pargaonkar <shirishpargaonkar@gmail.com>
Reviewed-by: Jeff Layton <jlayton@redhat.com>
Reviewed-by: Pavel Shilovsky <pshilov@microsoft.com>
2016-10-14 14:22:16 -05:00
Steve French 9593265531 CIFS: Add new mount option to set owner uid and gid from special sids in acl
Add "idsfromsid" mount option to indicate to cifs.ko that it should
try to retrieve the uid and gid owner fields from special sids in the
ACL if present.  This first patch just adds the parsing for the mount
option.

Signed-off-by: Steve French <steve.french@primarydata.com>
Reviewed-by:  Shirish Pargaonkar <shirishpargaonkar@gmail.com>
Reviewed-by: Pavel Shilovsky <pshilov@microsoft.com>
2016-10-14 14:22:01 -05:00
Linus Torvalds f34d3606f7 Merge branch 'for-4.9' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup
Pull cgroup updates from Tejun Heo:

 - tracepoints for basic cgroup management operations added

 - kernfs and cgroup path formatting functions updated to behave in the
   style of strlcpy()

 - non-critical bug fixes

* 'for-4.9' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup:
  blkcg: Unlock blkcg_pol_mutex only once when cpd == NULL
  cgroup: fix error handling regressions in proc_cgroup_show() and cgroup_release_agent()
  cpuset: fix error handling regression in proc_cpuset_show()
  cgroup: add tracepoints for basic operations
  cgroup: make cgroup_path() and friends behave in the style of strlcpy()
  kernfs: remove kernfs_path_len()
  kernfs: make kernfs_path*() behave in the style of strlcpy()
  kernfs: add dummy implementation of kernfs_path_from_node()
2016-10-14 12:18:50 -07:00
David S. Miller f1f081cef0 RxRPC rewrite
-----BEGIN PGP SIGNATURE-----
 
 iQIVAwUAV/+wwfSw1s6N8H32AQLpVA/+NByreKyI8cHL1zgz816iTrzYbEP5Gbtw
 RI9TI5iweUa9ySe4PFQUw+VC0yaAP9brY8tTtss8KHk808Wu4xhlg8fAClOaZXwy
 WmHASdwnRaDWguEpPHyHRST+s9ZO/VD5vwGhREB/hojzdzd135bq1d6GKaHoLFx2
 XDwDeyZc1z+aSzdMCoQuKJlqw9mfujsIOK5xZJ/h/JquYJ3iER55vdofettNCT+S
 hjueVBgWV988oORBtduPrfUYBbQI83QyiWl0xdo6QXWAoN784NfpVti8YAA9B7To
 qfup5aE6ky3LhuRD8GS00yWb96b43FGuPqt27LTH7SGnALX7KbETbBcgasyWqFeV
 UvPbVlk5R+0OXLxqOHvn20gRFS2c6HIdVAW6h7QHB0qwnLS1JJJToAczhK4QTsHQ
 eOXSGK4Gj0CiTd23bL7egaULKnD7eiZbagoty1UL05k8TAgPRnXXaMCRc0cupvKo
 5Amk7xT7ZN1Iyh9TQSt2MRIBIG6AOJogsmjSqgKJZY4YdCD5rYAWyAzc7K2l/L0e
 oUwDaPFWwNAfVYxJIXSd223squyxaXen0B+NlY501tqw4Ce4NnuEssqsSWk6OTHZ
 R9HCT200HvvZthPOStP7rdFiQ6VRDH3aSdl3zU4ila7cxr4wfdVslShuj91OiLqI
 /7zodCmls/8=
 =7krl
 -----END PGP SIGNATURE-----

Merge tag 'rxrpc-rewrite-20161013' of git://git.kernel.org/pub/scm/linux/kernel/git/dhowells/linux-fs

David Howells says:

====================
rxrpc: Fixes

This set of patches contains a bunch of fixes:

 (1) Fix use of kunmap() after change from kunmap_atomic() within AFS.

 (2) Don't use of ERR_PTR() with an always zero value.

 (3) Check the right error when using ip6_route_output().

 (4) Be consistent about whether call->operation_ID is BE or CPU-E within
     AFS.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
2016-10-14 10:44:45 -04:00
Miklos Szeredi 7764235bec ovl: use vfs_get_link()
Resulting in a complete removal of a function basically implementing the
inverse of vfs_readlink().

As a bonus, now the proper security hook is also called.

Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2016-10-14 11:16:47 +02:00
Miklos Szeredi d60874cd58 vfs: add vfs_get_link() helper
This helper is for filesystems that want to read the symlink and are better
off with the get_link() interface (returning a char *) rather than the
readlink() interface (copy into a userspace buffer).

Also call the LSM hook for readlink (not get_link) since this is for
symlink reading not following.

Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2016-10-14 11:16:47 +02:00
Miklos Szeredi 78a3fa4f32 ovl: use generic_readlink
All filesystems that are backers for overlayfs would also use
generic_readlink().  Move this logic to the overlay itself, which is a nice
cleanup.

Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2016-10-14 11:16:46 +02:00
Miklos Szeredi cb348edb6b ovl: explain error values when removing acl from workdir
Suggested-by: Linus Torvalds <torvalds@linux-foundation.org> 
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2016-10-14 11:16:46 +02:00
Linus Torvalds c4a86165d1 NFS client updates for Linux 4.9
Highlights include:
 
 Stable bugfixes:
 - sunrpc: fix writ espace race causing stalls
 - NFS: Fix inode corruption in nfs_prime_dcache()
 - NFSv4: Don't report revoked delegations as valid in
   nfs_have_delegation()
 - NFSv4: nfs4_copy_delegation_stateid() must fail if the delegation is
   invalid
 - NFSv4: Open state recovery must account for file permission changes
 - NFSv4.2: Fix a reference leak in nfs42_proc_layoutstats_generic
 
 Features:
 - Add support for tracking multiple layout types with an ordered list
 - Add support for using multiple backchannel threads on the client
 - Add support for pNFS file layout session trunking
 - Delay xprtrdma use of DMA API (for device driver removal)
 - Add support for xprtrdma remote invalidation
 - Add support for larger xprtrdma inline thresholds
 - Use a scatter/gather list for sending xprtrdma RPC calls
 - Add support for the CB_NOTIFY_LOCK callback
 - Improve hashing sunrpc auth_creds by using both uid and gid
 
 Bugfixes:
 - Fix xprtrdma use of DMA API
 - Validate filenames before adding to the dcache
 - Fix corruption of xdr->nwords in xdr_copy_to_scratch
 - Fix setting buffer length in xdr_set_next_buffer()
 - Don't deadlock the state manager on the SEQUENCE status flags
 - Various delegation and stateid related fixes
 - Retry operations if an interrupted slot receives EREMOTEIO
 - Make nfs boot time y2038 safe
 -----BEGIN PGP SIGNATURE-----
 
 iQIcBAABCAAGBQJX/+ZfAAoJENfLVL+wpUDr5MUP/16s2Kp9ZZZZ7ICi3yrHOzb0
 9WpCOmbKUIELXl8YgkxlvPUYMzTQTIc32TwbVgdFV0g41my/0+O3z3+IiTrUGxH5
 8LgouMWBZ9KKmyUB//+KQAXr3j/bvDdF6Li6wJfz8a2o+9xT4oTkK1+Js8p0kn6e
 HNKfRknfCKwvE+j4tPCLfs2RX5qDyBFILXwWhj1fAbmT3rbnp+QqkXD4mWUrXb9z
 DBgxciXRhOkOQQAD2KQBFd2kUqWDZ5ED23b+aYsu9D3VCW45zitBqQFAxkQWL0hp
 x8Mp+MDCxlgdEaGQPUmUiDtPkG1X9ZxUJCAwaJWWsZaItwR2Il+en2sETctnTZ1X
 0IAxZVFdolzSeLzIfNx3OG32JdWJdaNjUzkIZam8gO6i1f6PAmK4alR0J3CT31nJ
 /OEN76o1E7acGWRMmj+MAZ2U5gPfR7EitOzyE8ZUPcHgyeGMiynjwi56WIpeSvT2
 F/Sp5kRe5+D5gtnYuppGp7Srp5vYdtFaz1zgPDUKpDLcxfDweO8AHGjJf3Zmrunx
 X24yia4A14CnfcUy4vKpISXRykmkG/3Z0tpWwV53uXZm4nlQfRc7gPibiW7Ay521
 af8sDoItW98K3DK5NQU7IUn83ua1TStzpoqlAEafRw//g9zPMTbhHvNvOyrRfrcX
 kjWn6hNblMu9M34JOjtu
 =XOrF
 -----END PGP SIGNATURE-----

Merge tag 'nfs-for-4.9-1' of git://git.linux-nfs.org/projects/anna/linux-nfs

Pull NFS client updates from Anna Schumaker:
 "Highlights include:

  Stable bugfixes:
   - sunrpc: fix writ espace race causing stalls
   - NFS: Fix inode corruption in nfs_prime_dcache()
   - NFSv4: Don't report revoked delegations as valid in nfs_have_delegation()
   - NFSv4: nfs4_copy_delegation_stateid() must fail if the delegation is invalid
   - NFSv4: Open state recovery must account for file permission changes
   - NFSv4.2: Fix a reference leak in nfs42_proc_layoutstats_generic

  Features:
   - Add support for tracking multiple layout types with an ordered list
   - Add support for using multiple backchannel threads on the client
   - Add support for pNFS file layout session trunking
   - Delay xprtrdma use of DMA API (for device driver removal)
   - Add support for xprtrdma remote invalidation
   - Add support for larger xprtrdma inline thresholds
   - Use a scatter/gather list for sending xprtrdma RPC calls
   - Add support for the CB_NOTIFY_LOCK callback
   - Improve hashing sunrpc auth_creds by using both uid and gid

  Bugfixes:
   - Fix xprtrdma use of DMA API
   - Validate filenames before adding to the dcache
   - Fix corruption of xdr->nwords in xdr_copy_to_scratch
   - Fix setting buffer length in xdr_set_next_buffer()
   - Don't deadlock the state manager on the SEQUENCE status flags
   - Various delegation and stateid related fixes
   - Retry operations if an interrupted slot receives EREMOTEIO
   - Make nfs boot time y2038 safe"

* tag 'nfs-for-4.9-1' of git://git.linux-nfs.org/projects/anna/linux-nfs: (100 commits)
  NFSv4.2: Fix a reference leak in nfs42_proc_layoutstats_generic
  fs: nfs: Make nfs boot time y2038 safe
  sunrpc: replace generic auth_cred hash with auth-specific function
  sunrpc: add RPCSEC_GSS hash_cred() function
  sunrpc: add auth_unix hash_cred() function
  sunrpc: add generic_auth hash_cred() function
  sunrpc: add hash_cred() function to rpc_authops struct
  Retry operation on EREMOTEIO on an interrupted slot
  pNFS: Fix atime updates on pNFS clients
  sunrpc: queue work on system_power_efficient_wq
  NFSv4.1: Even if the stateid is OK, we may need to recover the open modes
  NFSv4: If recovery failed for a specific open stateid, then don't retry
  NFSv4: Fix retry issues with nfs41_test/free_stateid
  NFSv4: Open state recovery must account for file permission changes
  NFSv4: Mark the lock and open stateids as invalid after freeing them
  NFSv4: Don't test open_stateid unless it is set
  NFSv4: nfs4_do_handle_exception() handle revoke/expiry of a single stateid
  NFS: Always call nfs_inode_find_state_and_recover() when revoking a delegation
  NFSv4: Fix a race when updating an open_stateid
  NFSv4: Fix a race in nfs_inode_reclaim_delegation()
  ...
2016-10-13 21:28:20 -07:00
Linus Torvalds 2778556474 Some RDMA work and some good bugfixes, and two new features that could
benefit from user testing:
 
 Anna Schumacker contributed a simple NFSv4.2 COPY implementation.  COPY
 is already supported on the client side, so a call to copy_file_range()
 on a recent client should now result in a server-side copy that doesn't
 require all the data to make a round trip to the client and back.
 
 Jeff Layton implemented callbacks to notify clients when contended locks
 become available, which should reduce latency on workloads with
 contended locks.
 -----BEGIN PGP SIGNATURE-----
 
 iQIcBAABAgAGBQJX/mcsAAoJECebzXlCjuG+MU0P/3SzTLGYXU5yOTAorx255/uf
 fUVKQQhTzzaA2xj3gWWWztYx3y0ZJUVgwU56a+Ap5Z8/goqDQ78H+ePEc+MG7BT/
 /UXS/bITvt0MP/dvPrDzhSltvqx/wpelLPBo29hGLlAQ2dsnD4Y75IbOOQccWqcC
 iD2v6x7lnpWZ7j9Zhwzg/JNQHwISIb7tiLoYBjfcdNDEMU76KIyhxD0Cx9MSeBzH
 9Rq/oEdwGDFS5WqVfNe2jxbngoauq1IupziQ2eQGv2D/POyXCx8fphoYjDz1XaW8
 PxaJtJtM2owPGG+z2CxklJqNaS1Z4F+oppjg+nf4i/ibxmIBaTy8NluASX3vMh69
 CDO1+ly+TiF0l1VqMOQJWRnqn1qGk6fLpF6P1Ac62B0oWpeLGU7nmik7XN1ORgsi
 8ksxRKNAWeprZo3wl5xNrADu/wlZ7XCJTc4QoHEgYT04aHF+j8EMCHv+mtZ8+Bwn
 WWiA8iItZOgXV4vitCRJlvsixjYvmF3djPIoI2Lt5KDWIg+eL89sKwzTALSfeC4m
 Vjb0svzPX1MmZCNP1rCStFbl3gZYXZyqPk+uA6M7H8mjAjVeKxRPowWpMBgvYZHr
 FjCPb878bAuqCeBVbIyOLLcKWBLTw8PsUWZAor3gNg454JGkMjLUyJ/S22Cz5Nbo
 HdjoiTJtbPrHnCwTMXwa
 =nozl
 -----END PGP SIGNATURE-----

Merge tag 'nfsd-4.9' of git://linux-nfs.org/~bfields/linux

Pull nfsd updates from Bruce Fields:
 "Some RDMA work and some good bugfixes, and two new features that could
  benefit from user testing:

   - Anna Schumacker contributed a simple NFSv4.2 COPY implementation.
     COPY is already supported on the client side, so a call to
     copy_file_range() on a recent client should now result in a
     server-side copy that doesn't require all the data to make a round
     trip to the client and back.

   - Jeff Layton implemented callbacks to notify clients when contended
     locks become available, which should reduce latency on workloads
     with contended locks"

* tag 'nfsd-4.9' of git://linux-nfs.org/~bfields/linux:
  NFSD: Implement the COPY call
  nfsd: handle EUCLEAN
  nfsd: only WARN once on unmapped errors
  exportfs: be careful to only return expected errors.
  nfsd4: setclientid_confirm with unmatched verifier should fail
  nfsd: randomize SETCLIENTID reply to help distinguish servers
  nfsd: set the MAY_NOTIFY_LOCK flag in OPEN replies
  nfs: add a new NFS4_OPEN_RESULT_MAY_NOTIFY_LOCK constant
  nfsd: add a LRU list for blocked locks
  nfsd: have nfsd4_lock use blocking locks for v4.1+ locks
  nfsd: plumb in a CB_NOTIFY_LOCK operation
  NFSD: fix corruption in notifier registration
  svcrdma: support Remote Invalidation
  svcrdma: Server-side support for rpcrdma_connect_private
  rpcrdma: RDMA/CM private message data structure
  svcrdma: Skip put_page() when send_reply() fails
  svcrdma: Tail iovec leaves an orphaned DMA mapping
  nfsd: fix dprintk in nfsd4_encode_getdeviceinfo
  nfsd: eliminate cb_minorversion field
  nfsd: don't set a FL_LAYOUT lease for flexfiles layouts
2016-10-13 21:04:42 -07:00
Linus Torvalds 35a891be96 xfs: reflink update for 4.9-rc1
< XFS has gained super CoW powers! >
  ----------------------------------
         \   ^__^
          \  (oo)\_______
             (__)\       )\/\
                 ||----w |
                 ||     ||
 
 Included in this update:
 - unshare range (FALLOC_FL_UNSHARE) support for fallocate
 - copy-on-write extent size hints (FS_XFLAG_COWEXTSIZE) for fsxattr interface
 - shared extent support for XFS
 - copy-on-write support for shared extents
 - copy_file_range support
 - clone_file_range support (implements reflink)
 - dedupe_file_range support
 - defrag support for reverse mapping enabled filesystems
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v1
 
 iQIcBAABAgAGBQJX/hrZAAoJEK3oKUf0dfodpwcQAKkTerNPhhDcthqWUJ2+jC7w
 JIuhKUg2GYojJhIJ4+Ue1knmuBeIusda+PzGls+6gdy7GDGdux/esRIJSe1W7A5G
 RNeumiSKVX5iYsZNUEX35O2a/SwUM1Sm5mcIFs4CxUwIRwE/cayNby6vrlVExvz7
 Ns6YYOI2bldUHLsxedg8MLG0it1JGTADB9gwGgb98bxQ3bD/UBn3TF9xTlj+ZH22
 ebnWsogSJOnUigOOSGeaQsmy1pJAhRIhvt+f481KuZak1pdQcK2feL4RcKw0NpNt
 15LCYRqX6RexC684VYgJZxXB4EKyfS2Bma71q41A7dz1x36kw7+wG18xasBqU++p
 GZwwL6si02rIGPMz1oD8xxZ0F97ADCGRmkgUHsCJKbP5UmGiP08K6GEN3osr5hAN
 xAmn9AxcprXVnV3WmnFxpBeWY/qCEsvSQqJuKSThYqAilqUc8wN2u5g/eEpE6mmg
 KEEhzaq5P4ovS/HOIQJWdBu1j5E9Mg2o/ncy87Q6uE+9Fa5AAP6GBWOtGcMwdFSU
 adbN7dqjgoHMyNHFrmePqyJYtOZ2hZovDlVndxnYysl5ZBfiBEEDISmr+x6KcSlo
 3kyOltYQLjEVu1sLOT3COCddn0jt5Lr1QhGeVepnrMlU2E1h4461viCNMDinJRIp
 OYoMOS+J83G2FEFwgXYM
 =Sa+Y
 -----END PGP SIGNATURE-----

Merge tag 'xfs-reflink-for-linus-4.9-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/dgc/linux-xfs

    < XFS has gained super CoW powers! >
     ----------------------------------
            \   ^__^
             \  (oo)\_______
                (__)\       )\/\
                    ||----w |
                    ||     ||

Pull XFS support for shared data extents from Dave Chinner:
 "This is the second part of the XFS updates for this merge cycle.  This
  pullreq contains the new shared data extents feature for XFS.

  Given the complexity and size of this change I am expecting - like the
  addition of reverse mapping last cycle - that there will be some
  follow-up bug fixes and cleanups around the -rc3 stage for issues that
  I'm sure will show up once the code hits a wider userbase.

  What it is:

  At the most basic level we are simply adding shared data extents to
  XFS - i.e. a single extent on disk can now have multiple owners. To do
  this we have to add new on-disk features to both track the shared
  extents and the number of times they've been shared. This is done by
  the new "refcount" btree that sits in every allocation group. When we
  share or unshare an extent, this tree gets updated.

  Along with this new tree, the reverse mapping tree needs to be updated
  to track each owner or a shared extent. This also needs to be updated
  ever share/unshare operation. These interactions at extent allocation
  and freeing time have complex ordering and recovery constraints, so
  there's a significant amount of new intent-based transaction code to
  ensure that operations are performed atomically from both the runtime
  and integrity/crash recovery perspectives.

  We also need to break sharing when writes hit a shared extent - this
  is where the new copy-on-write implementation comes in. We allocate
  new storage and copy the original data along with the overwrite data
  into the new location. We only do this for data as we don't share
  metadata at all - each inode has it's own metadata that tracks the
  shared data extents, the extents undergoing CoW and it's own private
  extents.

  Of course, being XFS, nothing is simple - we use delayed allocation
  for CoW similar to how we use it for normal writes. ENOSPC is a
  significant issue here - we build on the reservation code added in
  4.8-rc1 with the reverse mapping feature to ensure we don't get
  spurious ENOSPC issues part way through a CoW operation. These
  mechanisms also help minimise fragmentation due to repeated CoW
  operations. To further reduce fragmentation overhead, we've also
  introduced a CoW extent size hint, which indicates how large a region
  we should allocate when we execute a CoW operation.

  With all this functionality in place, we can hook up .copy_file_range,
  .clone_file_range and .dedupe_file_range and we gain all the
  capabilities of reflink and other vfs provided functionality that
  enable manipulation to shared extents. We also added a fallocate mode
  that explicitly unshares a range of a file, which we implemented as an
  explicit CoW of all the shared extents in a file.

  As such, it's a huge chunk of new functionality with new on-disk
  format features and internal infrastructure. It warns at mount time as
  an experimental feature and that it may eat data (as we do with all
  new on-disk features until they stabilise). We have not released
  userspace suport for it yet - userspace support currently requires
  download from Darrick's xfsprogs repo and build from source, so the
  access to this feature is really developer/tester only at this point.
  Initial userspace support will be released at the same time the kernel
  with this code in it is released.

  The new code causes 5-6 new failures with xfstests - these aren't
  serious functional failures but things the output of tests changing
  slightly due to perturbations in layouts, space usage, etc. OTOH,
  we've added 150+ new tests to xfstests that specifically exercise this
  new functionality so it's got far better test coverage than any
  functionality we've previously added to XFS.

  Darrick has done a pretty amazing job getting us to this stage, and
  special mention also needs to go to Christoph (review, testing,
  improvements and bug fixes) and Brian (caught several intricate bugs
  during review) for the effort they've also put in.

  Summary:

   - unshare range (FALLOC_FL_UNSHARE) support for fallocate

   - copy-on-write extent size hints (FS_XFLAG_COWEXTSIZE) for fsxattr
     interface

   - shared extent support for XFS

   - copy-on-write support for shared extents

   - copy_file_range support

   - clone_file_range support (implements reflink)

   - dedupe_file_range support

   - defrag support for reverse mapping enabled filesystems"

* tag 'xfs-reflink-for-linus-4.9-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/dgc/linux-xfs: (71 commits)
  xfs: convert COW blocks to real blocks before unwritten extent conversion
  xfs: rework refcount cow recovery error handling
  xfs: clear reflink flag if setting realtime flag
  xfs: fix error initialization
  xfs: fix label inaccuracies
  xfs: remove isize check from unshare operation
  xfs: reduce stack usage of _reflink_clear_inode_flag
  xfs: check inode reflink flag before calling reflink functions
  xfs: implement swapext for rmap filesystems
  xfs: refactor swapext code
  xfs: various swapext cleanups
  xfs: recognize the reflink feature bit
  xfs: simulate per-AG reservations being critically low
  xfs: don't mix reflink and DAX mode for now
  xfs: check for invalid inode reflink flags
  xfs: set a default CoW extent size of 32 blocks
  xfs: convert unwritten status of reverse mappings for shared files
  xfs: use interval query for rmap alloc operations on shared files
  xfs: add shared rmap map/unmap/convert log item types
  xfs: increase log reservations for reflink
  ...
2016-10-13 20:28:22 -07:00
Pavel Shilovsky de74025052 CIFS: Reset read oplock to NONE if we have mandatory locks after reopen
We are already doing the same thing for an ordinary open case:
we can't keep read oplock on a file if we have mandatory byte-range
locks because pagereading can conflict with these locks on a server.
Fix it by setting oplock level to NONE.

Signed-off-by: Pavel Shilovsky <pshilov@microsoft.com>
Signed-off-by: Steve French <smfrench@gmail.com>
2016-10-13 19:48:59 -05:00
Pavel Shilovsky f2cca6a7c9 CIFS: Fix persistent handles re-opening on reconnect
openFileList of tcon can be changed while cifs_reopen_file() is called
that can lead to an unexpected behavior when we return to the loop.
Fix this by introducing a temp list for keeping all file handles that
need to be reopen.

Signed-off-by: Pavel Shilovsky <pshilov@microsoft.com>
Signed-off-by: Steve French <smfrench@gmail.com>
2016-10-13 19:48:55 -05:00
Sachin Prabhu 166cea4dc3 SMB2: Separate RawNTLMSSP authentication from SMB2_sess_setup
We split the rawntlmssp authentication into negotiate and
authencate parts. We also clean up the code and add helpers.

Signed-off-by: Sachin Prabhu <sprabhu@redhat.com>
Signed-off-by: Steve French <smfrench@gmail.com>
Reviewed-by: Pavel Shilovsky <pshilov@microsoft.com>
2016-10-13 19:48:34 -05:00
Sachin Prabhu 3baf1a7b92 SMB2: Separate Kerberos authentication from SMB2_sess_setup
Add helper functions and split Kerberos authentication off
SMB2_sess_setup.

Signed-off-by: Sachin Prabhu <sprabhu@redhat.com>
Signed-off-by: Steve French <smfrench@gmail.com>
Reviewed-by: Pavel Shilovsky <pshilov@microsoft.com>
2016-10-13 19:48:30 -05:00
Germano Percossi cb978ac8b8 Expose cifs module parameters in sysfs
/sys/module/cifs/parameters should display the three
other module load time configuration settings for cifs.ko

Signed-off-by: Germano Percossi <germano.percossi@citrix.com>
Signed-off-by: Steve French <steve.french@primarydata.com>
2016-10-13 19:48:25 -05:00
Steve French 24df1483c2 Cleanup missing frees on some ioctls
Cleanup some missing mem frees on some cifs ioctls, and
clarify others to make more obvious that no data is returned.

CC: Stable <stable@vger.kernel.org>
Signed-off-by: Steve French <smfrench@gmail.com>
Acked-by: Sachin Prabhu <sprabhu@redhat.com>
2016-10-13 19:48:20 -05:00
Steve French 834170c859 Enable previous version support
Add ioctl to query previous versions of file

Allows listing snapshots on files on SMB3 mounts.

Signed-off-by: Steve French <smfrench@gmail.com>
2016-10-13 19:48:11 -05:00
Steve French 18dd8e1a65 Do not send SMB3 SET_INFO request if nothing is changing
[CIFS] We had cases where we sent a SMB2/SMB3 setinfo request with all
timestamp (and DOS attribute) fields marked as 0 (ie do not change)
e.g. on chmod or chown.

Signed-off-by: Steve French <steve.french@primarydata.com>
CC: Stable <stable@vger.kernel.org>
2016-10-13 19:46:51 -05:00
Linus Torvalds e3799a210d Merge git://www.linux-watchdog.org/linux-watchdog
Pull watchdog updates from Wim Van Sebroeck:

 - a new watchdog pretimeout governor framework

 - support to upload the firmware on the ziirave_wdt

 - several fixes and cleanups

* git://www.linux-watchdog.org/linux-watchdog: (26 commits)
  watchdog: imx2_wdt: add pretimeout function support
  watchdog: softdog: implement pretimeout support
  watchdog: pretimeout: add pretimeout_available_governors attribute
  watchdog: pretimeout: add option to select a pretimeout governor in runtime
  watchdog: pretimeout: add panic pretimeout governor
  watchdog: pretimeout: add noop pretimeout governor
  watchdog: add watchdog pretimeout governor framework
  watchdog: hpwdt: add support for iLO5
  fs: compat_ioctl: add pretimeout functions for watchdogs
  watchdog: add pretimeout support to the core
  watchdog: imx2_wdt: use preferred BIT macro instead of open coded values
  watchdog: st_wdt: Remove support for obsolete platforms
  watchdog: bindings: Remove obsolete platforms from dt doc.
  watchdog: mt7621_wdt: Remove assignment of dev pointer
  watchdog: rt2880_wdt: Remove assignment of dev pointer
  watchdog: constify watchdog_ops structures
  watchdog: tegra: constify watchdog_ops structures
  watchdog: iTCO_wdt: constify iTCO_wdt_pm structure
  watchdog: cadence_wdt: Fix the suspend resume
  watchdog: txx9wdt: Add missing clock (un)prepare calls for CCF
  ...
2016-10-13 16:44:20 -07:00
Benjamin Coddington a3f9d1b58a pnfs/blocklayout: fix last_write_offset incorrectly set to page boundary
Commit 41963c10c4 sets the block layout's
last written byte to the offset of the end of the extent rather than the
end of the write which incorrectly updates the inode's size for
partial-page writes.

Fixes: 41963c10c4 ("pnfs/blocklayout: update last_write_offset atomically with extents")
Signed-off-by: Benjamin Coddington <bcodding@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Tested-by: Christoph Hellwig <hch@lst.de>
Cc: stable@vger.kernel.org # 4.8+
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2016-10-13 16:42:53 -04:00
David Howells 50a2c95381 afs: call->operation_ID sometimes used as __be32 sometimes as u32
call->operation_ID is sometimes being used as __be32 sometimes is being
used as u32.  Be consistent and settle on using as u32.

Signed-off-by: David Howells <dhowells@redhat.com.
2016-10-13 17:03:52 +01:00
Dan Carpenter 233c9edcca afs: unmapping the wrong buffer
We switched from kmap_atomic() to kmap() so the kunmap() calls need to
be updated to match.

Fixes: d001648ec7 ('rxrpc: Don't expose skbs to in-kernel users [ver #2]')
Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: David Howells <dhowells@redhat.com>
2016-10-13 08:33:28 +01:00
Eric Biggers fb4454376d fscrypto: make XTS tweak initialization endian-independent
The XTS tweak (or IV) was initialized differently on little endian and
big endian systems.  Because the ciphertext depends on the XTS tweak, it
was not possible to use an encrypted filesystem created by a little
endian system on a big endian system and vice versa, even if they shared
the same PAGE_SIZE.  Fix this by always using little endian.

This will break hypothetical big endian users of ext4 or f2fs
encryption.  However, all users we are aware of are little endian, and
it's believed that "real" big endian users are unlikely to exist yet.
So this might as well be fixed now before it's too late.

Signed-off-by: Eric Biggers <ebiggers@google.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Cc: stable@vger.kernel.org
2016-10-12 23:30:16 -04:00
Eric Biggers c4704a4fbe ext4: do not advertise encryption support when disabled
The sysfs file /sys/fs/ext4/features/encryption was present on kernels
compiled with CONFIG_EXT4_FS_ENCRYPTION=n.  This was misleading because
such kernels do not actually support ext4 encryption.  Therefore, only
provide this file on kernels compiled with CONFIG_EXT4_FS_ENCRYPTION=y.

Note: since the ext4 feature files are all hardcoded to have a contents
of "supported", it really is the presence or absence of the file that is
significant, not the contents (and this change reflects that).

Signed-off-by: Eric Biggers <ebiggers@google.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Cc: stable@vger.kernel.org
2016-10-12 23:24:51 -04:00
Taesoo Kim 559cce698e jbd2: fix incorrect unlock on j_list_lock
When 'jh->b_transaction == transaction' (asserted by below)

  J_ASSERT_JH(jh, (jh->b_transaction == transaction || ...

'journal->j_list_lock' will be incorrectly unlocked, since
the the lock is aquired only at the end of if / else-if
statements (missing the else case).

Signed-off-by: Taesoo Kim <tsgatesv@gmail.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Reviewed-by: Andreas Dilger <adilger@dilger.ca>
Fixes: 6e4862a5bb
Cc: stable@vger.kernel.org # 3.14+
2016-10-12 23:19:18 -04:00
Joe Perches 651e1c3b15 ext4: super.c: Update logging style using KERN_CONT
Recent commit require line continuing printks to use PR_CONT.

Update super.c to use KERN_CONT and use vsprintf extension %pV to
avoid a printk/vprintk/printk("\n") sequence as well.

Signed-off-by: Joe Perches <joe@perches.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Reviewed-by: Jan Kara <jack@suse.cz>
2016-10-12 23:12:53 -04:00
Jaegeuk Kim de0dcc40f6 f2fs: fix wrong sum_page pointer in f2fs_gc
This patch fixes using a wrong pointer for sum_page in f2fs_gc.

Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2016-10-12 16:23:36 -07:00
Chris Mason d9ed71e545 Merge branch 'fst-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux into for-linus-4.9
Signed-off-by: Chris Mason <clm@fb.com>
2016-10-12 13:16:00 -07:00
Steve French 141891f472 SMB3: Add mount parameter to allow user to override max credits
Add mount option "max_credits" to allow setting maximum SMB3
credits to any value from 10 to 64000 (default is 32000).
This can be useful to workaround servers with problems allocating
credits, or to throttle the client to use smaller amount of
simultaneous i/o or to workaround server performance issues.

Also adds a cap, so that even if the server granted us more than
65000 credits due to a server bug, we would not use that many.

Signed-off-by: Steve French <steve.french@primarydata.com>
2016-10-12 12:08:33 -05:00
Steve French 52ace1ef12 fs/cifs: reopen persistent handles on reconnect
Continuous Availability features like persistent handles
require that clients reconnect their open files, not
just the sessions, soon after the network connection comes
back up, otherwise the server will throw away the state
(byte range locks, leases, deny modes) on those handles
after a timeout.

Add code to reconnect handles when use_persistent set
(e.g. Continuous Availability shares) after tree reconnect.

Signed-off-by: Aurelien Aptel <aaptel@suse.com>
Reviewed-by: Germano Percossi <germano.percossi@citrix.com>
Signed-off-by: Steve French <smfrench@gmail.com>
2016-10-12 12:08:33 -05:00
Steve French 3afca265b5 Clarify locking of cifs file and tcon structures and make more granular
Remove the global file_list_lock to simplify cifs/smb3 locking and
have spinlocks that more closely match the information they are
protecting.

Add new tcon->open_file_lock and file->file_info_lock spinlocks.
Locks continue to follow a heirachy,
	cifs_socket --> cifs_ses --> cifs_tcon --> cifs_file
where global tcp_ses_lock still protects socket and cifs_ses, while the
the newer locks protect the lower level structure's information
(tcon and cifs_file respectively).

CC: Stable <stable@vger.kernel.org>
Signed-off-by: Steve French <steve.french@primarydata.com>
Signed-off-by: Pavel Shilovsky <pshilov@microsoft.com>
Reviewed-by: Aurelien Aptel <aaptel@suse.com>
Reviewed-by: Germano Percossi <germano.percossi@citrix.com>
2016-10-12 12:08:32 -05:00
Sachin Prabhu d171356ff1 Fix regression which breaks DFS mounting
Patch a6b5058 results in -EREMOTE returned by is_path_accessible() in
cifs_mount() to be ignored which breaks DFS mounting.

Signed-off-by: Sachin Prabhu <sprabhu@redhat.com>
Reviewed-by: Aurelien Aptel <aaptel@suse.com>
CC: Stable <stable@vger.kernel.org>
Signed-off-by: Steve French <smfrench@gmail.com>
2016-10-12 12:08:32 -05:00
Aurelien Aptel 94f8737175 fs/cifs: keep guid when assigning fid to fileinfo
When we open a durable handle we give a Globally Unique
Identifier (GUID) to the server which we must keep for later reference
e.g. when reopening persistent handles on reconnection.

Without this the GUID generated for a new persistent handle was lost and
16 zero bytes were used instead on re-opening.

Signed-off-by: Aurelien Aptel <aaptel@suse.com>
CC: Stable <stable@vger.kernel.org>
Signed-off-by: Steve French <smfrench@gmail.com>
2016-10-12 12:08:32 -05:00
Steve French fa70b87cc6 SMB3: GUIDs should be constructed as random but valid uuids
GUIDs although random, and 16 bytes, need to be generated as
proper uuids.

Signed-off-by: Steve French <steve.french@primarydata.com>
Reviewed-by: Aurelien Aptel <aaptel@suse.com>
Reported-by: David Goebels <davidgoe@microsoft.com>
CC: Stable <stable@vger.kernel.org>
2016-10-12 12:08:32 -05:00
Steve French c2afb8147e Set previous session id correctly on SMB3 reconnect
Signed-off-by: Steve French <steve.french@primarydata.com>
CC: Stable <stable@vger.kernel.org>
Reported-by: David Goebel <davidgoe@microsoft.com>
2016-10-12 12:08:31 -05:00
Ross Lagerwall 7d414f396c cifs: Limit the overall credit acquired
The kernel client requests 2 credits for many operations even though
they only use 1 credit (presumably to build up a buffer of credit).
Some servers seem to give the client as much credit as is requested.  In
this case, the amount of credit the client has continues increasing to
the point where (server->credits * MAX_BUFFER_SIZE) overflows in
smb2_wait_mtu_credits().

Fix this by throttling the credit requests if an set limit is reached.
For async requests where the credit charge may be > 1, request as much
credit as what is charged.
The limit is chosen somewhat arbitrarily. The Windows client
defaults to 128 credits, the Windows server allows clients up to
512 credits (or 8192 for Windows 2016), and the NetApp server
(and at least one other) does not limit clients at all.
Choose a high enough value such that the client shouldn't limit
performance.

This behavior was seen with a NetApp filer (NetApp Release 9.0RC2).

Signed-off-by: Ross Lagerwall <ross.lagerwall@citrix.com>
CC: Stable <stable@vger.kernel.org>
Signed-off-by: Steve French <smfrench@gmail.com>
2016-10-12 12:08:31 -05:00
Steve French 9742805d6b Display number of credits available
In debugging smb3, it is useful to display the number
of credits available, so we can see when the server has not granted
sufficient operations for the client to make progress, or alternatively
the client has requested too many credits (as we saw in a recent bug)
so we can compare with the number of credits the server thinks
we have.

Add a /proc/fs/cifs/DebugData line to display the client view
on how many credits are available.

Signed-off-by: Steve French <steve.french@primarydata.com>
Reported-by: Germano Percossi <germano.percossi@citrix.com>
CC: Stable <stable@vger.kernel.org>
2016-10-12 12:08:31 -05:00
Steve French 6609804413 Add way to query creation time of file via cifs xattr
Add parsing for new pseudo-xattr user.cifs.creationtime file
attribute to allow backup and test applications to view
birth time of file on cifs/smb3 mounts.

Signed-off-by: Steve French <steve.french@primarydata.com>
2016-10-12 12:08:31 -05:00
Steve French a958fff242 Add way to query file attributes via cifs xattr
Add parsing for new pseudo-xattr user.cifs.dosattrib file attribute
so tools can recognize what kind of file it is, and verify if common
SMB3 attributes (system, hidden, archive, sparse, indexed etc.) are
set.

Signed-off-by: Steve French <steve.french@primarydata.com>
Reviewed-by: Pavel Shilovsky <pshilovsky@samba.org>
2016-10-12 12:08:30 -05:00
Filipe Manana d5e84fd8d0 Btrfs: fix incremental send failure caused by balance
Commit 951555856b ("Btrfs: send, don't bug on inconsistent snapshots")
removed some BUG_ON() statements (replacing them with returning errors
to user space and logging error messages) when a snapshot is in an
inconsistent state due to failures to update a delayed inode item (ENOMEM
or ENOSPC) after adding/updating/deleting references, xattrs or file
extent items.

However there is a case, when no errors happen, where a file extent item
can be modified without having the corresponding inode item updated. This
case happens during balance under very specific timings, when relocation
is in the stage where it updates data pointers and a leaf that contains
file extent items is COWed. When that happens file extent items get their
disk_bytenr field updated to a new value that reflects the post relocation
logical address of the extent, without updating their respective inode
items (as there is nothing that needs to be updated on them). This is
performed at relocation.c:replace_file_extents() through
relocation.c:btrfs_reloc_cow_block().

So make an incremental send deal with this case and don't do any processing
for a file extent item that got its disk_bytenr field updated by relocation,
since the extent's data is the same as the one pointed by the file extent
item in the parent snapshot.

After the recent commit mentioned above this case resulted in EIO errors
returned to user space (and an error message logged to dmesg/syslog) when
doing an incremental send, while before it, it resulted in hitting a
BUG_ON leading to the following trace:

[  952.206705] ------------[ cut here ]------------
[  952.206714] kernel BUG at ../fs/btrfs/send.c:5653!
[  952.206719] Internal error: Oops - BUG: 0 [#1] SMP
[  952.209854] Modules linked in: st dm_mod nls_utf8 isofs fuse nf_log_ipv6 xt_pkttype xt_physdev br_netfilter nf_log_ipv4 nf_log_common xt_LOG xt_limit ebtable_filter ebtables af_packet bridge stp llc ip6t_REJECT xt_tcpudp nf_conntrack_ipv6 nf_defrag_ipv6 ip6table_raw ipt_REJECT iptable_raw xt_CT iptable_filter ip6table_mangle nf_conntrack_netbios_ns nf_conntrack_broadcast nf_conntrack_ipv4 nf_defrag_ipv4 ip_tables xt_conntrack nf_conntrack ip6table_filter ip6_tables x_tables xfs libcrc32c nls_iso8859_1 nls_cp437 vfat fat joydev aes_ce_blk ablk_helper cryptd snd_intel8x0 aes_ce_cipher snd_ac97_codec ac97_bus snd_pcm ghash_ce sha2_ce sha1_ce snd_timer snd virtio_net soundcore btrfs xor sr_mod cdrom hid_generic usbhid raid6_pq virtio_blk virtio_scsi bochs_drm drm_kms_helper syscopyarea sysfillrect sysimgblt fb_sys_fops ttm virtio_mmio xhci_pci xhci_hcd usbcore usb_common virtio_pci virtio_ring virtio drm sg efivarfs
[  952.228333] Supported: Yes
[  952.228908] CPU: 0 PID: 12779 Comm: snapperd Not tainted 4.4.14-50-default #1
[  952.230329] Hardware name: QEMU KVM Virtual Machine, BIOS 0.0.0 02/06/2015
[  952.231683] task: ffff800058e94100 ti: ffff8000d866c000 task.ti: ffff8000d866c000
[  952.233279] PC is at changed_cb+0x9f4/0xa48 [btrfs]
[  952.234375] LR is at changed_cb+0x58/0xa48 [btrfs]
[  952.236552] pc : [<ffff7ffffc39de7c>] lr : [<ffff7ffffc39d4e0>] pstate: 80000145
[  952.238049] sp : ffff8000d866fa20
[  952.238732] x29: ffff8000d866fa20 x28: 0000000000000019
[  952.239840] x27: 00000000000028d5 x26: 00000000000024a2
[  952.241008] x25: 0000000000000002 x24: ffff8000e66e92f0
[  952.242131] x23: ffff8000b8c76800 x22: ffff800092879140
[  952.243238] x21: 0000000000000002 x20: ffff8000d866fb78
[  952.244348] x19: ffff8000b8f8c200 x18: 0000000000002710
[  952.245607] x17: 0000ffff90d42480 x16: ffff800000237dc0
[  952.246719] x15: 0000ffff90de7510 x14: ab000c000a2faf08
[  952.247835] x13: 0000000000577c2b x12: ab000c000b696665
[  952.248981] x11: 2e65726f632f6966 x10: 652d34366d72612f
[  952.250101] x9 : 32627572672f746f x8 : ab000c00092f1671
[  952.251352] x7 : 8000000000577c2b x6 : ffff800053eadf45
[  952.252468] x5 : 0000000000000000 x4 : ffff80005e169494
[  952.253582] x3 : 0000000000000004 x2 : ffff8000d866fb78
[  952.254695] x1 : 000000000003e2a3 x0 : 000000000003e2a4
[  952.255803]
[  952.256150] Process snapperd (pid: 12779, stack limit = 0xffff8000d866c020)
[  952.257516] Stack: (0xffff8000d866fa20 to 0xffff8000d8670000)
[  952.258654] fa20: ffff8000d866fae0 ffff7ffffc308fc0 ffff800092879140 ffff8000e66e92f0
[  952.260219] fa40: 0000000000000035 ffff800055de6000 ffff8000b8c76800 ffff8000d866fb78
[  952.261745] fa60: 0000000000000002 00000000000024a2 00000000000028d5 0000000000000019
[  952.263269] fa80: ffff8000d866fae0 ffff7ffffc3090f0 ffff8000d866fae0 ffff7ffffc309128
[  952.264797] faa0: ffff800092879140 ffff8000e66e92f0 0000000000000035 ffff800055de6000
[  952.268261] fac0: ffff8000b8c76800 ffff8000d866fb78 0000000000000002 0000000000001000
[  952.269822] fae0: ffff8000d866fbc0 ffff7ffffc39ecfc ffff8000b8f8c200 ffff8000b8f8c368
[  952.271368] fb00: ffff8000b8f8c378 ffff800055de6000 0000000000000001 ffff8000ecb17500
[  952.272893] fb20: ffff8000b8c76800 ffff800092879140 ffff800062b6d000 ffff80007a9e2470
[  952.274420] fb40: ffff8000b8f8c208 0000000005784000 ffff8000580a8000 ffff8000b8f8c200
[  952.276088] fb60: ffff7ffffc39d488 00000002b8f8c368 0000000000000000 000000000003e2a4
[  952.280275] fb80: 000000000000006c ffff7ffffc39ec00 000000000003e2a4 000000000000006c
[  952.283219] fba0: ffff8000b8f8c300 0000000000000100 0000000000000001 ffff8000ecb17500
[  952.286166] fbc0: ffff8000d866fcd0 ffff7ffffc3643c0 ffff8000f8842700 0000ffff8ffe9278
[  952.289136] fbe0: 0000000040489426 ffff800055de6000 0000ffff8ffe9278 0000000040489426
[  952.292083] fc00: 000000000000011d 000000000000001d ffff80007a9e4598 ffff80007a9e43e8
[  952.294959] fc20: ffff8000b8c7693f 0000000000003b24 0000000000000019 ffff8000b8f8c218
[  952.301161] fc40: 00000001d866fc70 ffff8000b8c76800 0000000000000128 ffffffffffffff84
[  952.305749] fc60: ffff800058e941ff 0000000000003a58 ffff8000d866fcb0 ffff8000000f7390
[  952.308875] fc80: 000000000000012a 0000000000010290 ffff8000d866fc00 000000000000007b
[  952.311915] fca0: 0000000000010290 ffff800046c1b100 74732d7366727462 000001006d616572
[  952.314937] fcc0: ffff8000fffc4100 cb88537fdc8ba60e ffff8000d866fe10 ffff8000002499e8
[  952.318008] fce0: 0000000040489426 ffff8000f8842700 0000ffff8ffe9278 ffff80007a9e4598
[  952.321321] fd00: 0000ffff8ffe9278 0000000040489426 000000000000011d 000000000000001d
[  952.324280] fd20: ffff80000072c000 ffff8000d866c000 ffff8000d866fda0 ffff8000000e997c
[  952.327156] fd40: ffff8000fffc4180 00000000000031ed ffff8000fffc4180 ffff800046c1b7d4
[  952.329895] fd60: 0000000000000140 0000ffff907ea170 000000000000011d 00000000000000dc
[  952.334641] fd80: ffff80000072c000 ffff8000d866c000 0000000000000000 0000000000000002
[  952.338002] fda0: ffff8000d866fdd0 ffff8000000ebacc ffff800046c1b080 ffff800046c1b7d4
[  952.340724] fdc0: ffff8000d866fdf0 ffff8000000db67c 0000000000000040 ffff800000e69198
[  952.343415] fde0: 0000ffff8ffea790 00000000000031ed ffff8000d866fe20 ffff800000254000
[  952.346101] fe00: 000000000000001d 0000000000000004 ffff8000d866fe90 ffff800000249d3c
[  952.348980] fe20: ffff8000f8842700 0000000000000000 ffff8000f8842701 0000000000000008
[  952.351696] fe40: ffff8000d866fe70 0000000000000008 ffff8000d866fe90 ffff800000249cf8
[  952.354387] fe60: ffff8000f8842700 0000ffff8ffe9170 ffff8000f8842701 0000000000000008
[  952.357083] fe80: 0000ffff8ffe9278 ffff80008ff85500 0000ffff8ffe90c0 ffff800000085c84
[  952.359800] fea0: 0000000000000000 0000ffff8ffe9170 ffffffffffffffff 0000ffff90d473bc
[  952.365351] fec0: 0000000000000000 0000000000000015 0000000000000008 0000000040489426
[  952.369550] fee0: 0000ffff8ffe9278 0000ffff907ea790 0000ffff907ea170 0000ffff907ea790
[  952.372416] ff00: 0000ffff907ea170 0000000000000000 000000000000001d 0000000000000004
[  952.375223] ff20: 0000ffff90a32220 00000000003d0f00 0000ffff907ea0a0 0000ffff8ffe8f30
[  952.378099] ff40: 0000ffff9100f554 0000ffff91147000 0000ffff91117bc0 0000ffff90d473b0
[  952.381115] ff60: 0000ffff9100f620 0000ffff880069b0 0000ffff8ffe9170 0000ffff8ffe91a0
[  952.384003] ff80: 0000ffff8ffe9160 0000ffff8ffe9140 0000ffff88006990 0000ffff8ffe9278
[  952.386860] ffa0: 0000ffff88008a60 0000ffff8ffe9480 0000ffff88014ca0 0000ffff8ffe90c0
[  952.389654] ffc0: 0000ffff910be8e8 0000ffff8ffe90c0 0000ffff90d473bc 0000000000000000
[  952.410986] ffe0: 0000000000000008 000000000000001d 6e2079747265706f 72616d223d656d61
[  952.415497] Call trace:
[  952.417403] [<ffff7ffffc39de7c>] changed_cb+0x9f4/0xa48 [btrfs]
[  952.420023] [<ffff7ffffc308fc0>] btrfs_compare_trees+0x500/0x6b0 [btrfs]
[  952.422759] [<ffff7ffffc39ecfc>] btrfs_ioctl_send+0xb4c/0xe10 [btrfs]
[  952.425601] [<ffff7ffffc3643c0>] btrfs_ioctl+0x374/0x29a4 [btrfs]
[  952.428031] [<ffff8000002499e8>] do_vfs_ioctl+0x33c/0x600
[  952.430360] [<ffff800000249d3c>] SyS_ioctl+0x90/0xa4
[  952.432552] [<ffff800000085c84>] el0_svc_naked+0x38/0x3c
[  952.434803] Code: 2a1503e0 17fffdac b9404282 17ffff28 (d4210000)
[  952.437457] ---[ end trace 9afd7090c466cf15 ]---

Signed-off-by: Filipe Manana <fdmanana@suse.com>
2016-10-12 10:41:01 +01:00
Linus Torvalds a379f71a30 Merge branch 'akpm' (patches from Andrew)
Merge more updates from Andrew Morton:

 - a few block updates that fell in my lap

 - lib/ updates

 - checkpatch

 - autofs

 - ipc

 - a ton of misc other things

* emailed patches from Andrew Morton <akpm@linux-foundation.org>: (100 commits)
  mm: split gfp_mask and mapping flags into separate fields
  fs: use mapping_set_error instead of opencoded set_bit
  treewide: remove redundant #include <linux/kconfig.h>
  hung_task: allow hung_task_panic when hung_task_warnings is 0
  kthread: add kerneldoc for kthread_create()
  kthread: better support freezable kthread workers
  kthread: allow to modify delayed kthread work
  kthread: allow to cancel kthread work
  kthread: initial support for delayed kthread work
  kthread: detect when a kthread work is used by more workers
  kthread: add kthread_destroy_worker()
  kthread: add kthread_create_worker*()
  kthread: allow to call __kthread_create_on_node() with va_list args
  kthread/smpboot: do not park in kthread_create_on_cpu()
  kthread: kthread worker API cleanup
  kthread: rename probe_kthread_data() to kthread_probe_data()
  scripts/tags.sh: enable code completion in VIM
  mm: kmemleak: avoid using __va() on addresses that don't have a lowmem mapping
  kdump, vmcoreinfo: report memory sections virtual addresses
  ipc/sem.c: add cond_resched in exit_sme
  ...
2016-10-11 17:34:10 -07:00
Michal Hocko 5114a97a8b fs: use mapping_set_error instead of opencoded set_bit
The mapping_set_error() helper sets the correct AS_ flag for the mapping
so there is no reason to open code it.  Use the helper directly.

[akpm@linux-foundation.org: be honest about conversion from -ENXIO to -EIO]
Link: http://lkml.kernel.org/r/20160912111608.2588-2-mhocko@kernel.org
Signed-off-by: Michal Hocko <mhocko@suse.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-10-11 15:06:33 -07:00
Masahiro Yamada 97139d4a6f treewide: remove redundant #include <linux/kconfig.h>
Kernel source files need not include <linux/kconfig.h> explicitly
because the top Makefile forces to include it with:

  -include $(srctree)/include/linux/kconfig.h

This commit removes explicit includes except the following:

  * arch/s390/include/asm/facilities_src.h
  * tools/testing/radix-tree/linux/kernel.h

These two are used for host programs.

Link: http://lkml.kernel.org/r/1473656164-11929-1-git-send-email-yamada.masahiro@socionext.com
Signed-off-by: Masahiro Yamada <yamada.masahiro@socionext.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-10-11 15:06:33 -07:00
Michael Kerrisk (man-pages) 086e774a57 pipe: cap initial pipe capacity according to pipe-max-size limit
This is a patch that provides behavior that is more consistent, and
probably less surprising to users. I consider the change optional, and
welcome opinions about whether it should be applied.

By default, pipes are created with a capacity of 64 kiB.  However,
/proc/sys/fs/pipe-max-size may be set smaller than this value.  In this
scenario, an unprivileged user could thus create a pipe whose initial
capacity exceeds the limit. Therefore, it seems logical to cap the
initial pipe capacity according to the value of pipe-max-size.

The test program shown earlier in this patch series can be used to
demonstrate the effect of the change brought about with this patch:

    # cat /proc/sys/fs/pipe-max-size
    1048576
    # sudo -u mtk ./test_F_SETPIPE_SZ 1
    Initial pipe capacity: 65536
    # echo 10000 > /proc/sys/fs/pipe-max-size
    # cat /proc/sys/fs/pipe-max-size
    16384
    # sudo -u mtk ./test_F_SETPIPE_SZ 1
    Initial pipe capacity: 16384
    # ./test_F_SETPIPE_SZ 1
    Initial pipe capacity: 65536

The last two executions of 'test_F_SETPIPE_SZ' show that pipe-max-size
caps the initial allocation for a new pipe for unprivileged users, but
not for privileged users.

Link: http://lkml.kernel.org/r/31dc7064-2a17-9c5b-1df1-4e3012ee992c@gmail.com
Signed-off-by: Michael Kerrisk <mtk.manpages@gmail.com>
Reviewed-by: Vegard Nossum <vegard.nossum@oracle.com>
Cc: Willy Tarreau <w@1wt.eu>
Cc: <socketpair@gmail.com>
Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Cc: Jens Axboe <axboe@fb.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-10-11 15:06:32 -07:00
Michael Kerrisk (man-pages) 9c87bcf0a3 pipe: make account_pipe_buffers() return a value, and use it
This is an optional patch, to provide a small performance
improvement.  Alter account_pipe_buffers() so that it returns the
new value in user->pipe_bufs. This means that we can refactor
too_many_pipe_buffers_soft() and too_many_pipe_buffers_hard() to
avoid the costs of repeated use of atomic_long_read() to get the
value user->pipe_bufs.

Link: http://lkml.kernel.org/r/93e5f193-1e5e-3e1f-3a20-eae79b7e1310@gmail.com
Signed-off-by: Michael Kerrisk <mtk.manpages@gmail.com>
Reviewed-by: Vegard Nossum <vegard.nossum@oracle.com>
Cc: Willy Tarreau <w@1wt.eu>
Cc: <socketpair@gmail.com>
Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Cc: Jens Axboe <axboe@fb.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-10-11 15:06:32 -07:00
Michael Kerrisk (man-pages) a005ca0e68 pipe: fix limit checking in alloc_pipe_info()
The limit checking in alloc_pipe_info() (used by pipe(2) and when
opening a FIFO) has the following problems:

(1) When checking capacity required for the new pipe, the checks against
    the limit in /proc/sys/fs/pipe-user-pages-{soft,hard} are made
    against existing consumption, and exclude the memory required for
    the new pipe capacity. As a consequence: (1) the memory allocation
    throttling provided by the soft limit does not kick in quite as
    early as it should, and (2) the user can overrun the hard limit.

(2) As currently implemented, accounting and checking against the limits
    is done as follows:

    (a) Test whether the user has exceeded the limit.
    (b) Make new pipe buffer allocation.
    (c) Account new allocation against the limits.

    This is racey. Multiple processes may pass point (a) simultaneously,
    and then allocate pipe buffers that are accounted for only in step
    (c).  The race means that the user's pipe buffer allocation could be
    pushed over the limit (by an arbitrary amount, depending on how
    unlucky we were in the race). [Thanks to Vegard Nossum for spotting
    this point, which I had missed.]

This patch addresses the above problems as follows:

* Alter the checks against limits to include the memory required for the
  new pipe.
* Re-order the accounting step so that it precedes the buffer allocation.
  If the accounting step determines that a limit has been reached, revert
  the accounting and cause the operation to fail.

Link: http://lkml.kernel.org/r/8ff3e9f9-23f6-510c-644f-8e70cd1c0bd9@gmail.com
Signed-off-by: Michael Kerrisk <mtk.manpages@gmail.com>
Reviewed-by: Vegard Nossum <vegard.nossum@oracle.com>
Cc: Willy Tarreau <w@1wt.eu>
Cc: <socketpair@gmail.com>
Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Cc: Jens Axboe <axboe@fb.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-10-11 15:06:32 -07:00
Michael Kerrisk (man-pages) 09b4d19900 pipe: simplify logic in alloc_pipe_info()
Replace an 'if' block that covers most of the code in this function
with a 'goto'. This makes the code a little simpler to read, and also
simplifies the next patch (fix limit checking in alloc_pipe_info())

Link: http://lkml.kernel.org/r/aef030c1-0257-98a9-4988-186efa48530c@gmail.com
Signed-off-by: Michael Kerrisk <mtk.manpages@gmail.com>
Reviewed-by: Vegard Nossum <vegard.nossum@oracle.com>
Cc: Willy Tarreau <w@1wt.eu>
Cc: <socketpair@gmail.com>
Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Cc: Jens Axboe <axboe@fb.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-10-11 15:06:32 -07:00
Michael Kerrisk (man-pages) b0b91d18e2 pipe: fix limit checking in pipe_set_size()
The limit checking in pipe_set_size() (used by fcntl(F_SETPIPE_SZ))
has the following problems:

(1) When increasing the pipe capacity, the checks against the limits in
    /proc/sys/fs/pipe-user-pages-{soft,hard} are made against existing
    consumption, and exclude the memory required for the increased pipe
    capacity. The new increase in pipe capacity can then push the total
    memory used by the user for pipes (possibly far) over a limit. This
    can also trigger the problem described next.

(2) The limit checks are performed even when the new pipe capacity is
    less than the existing pipe capacity. This can lead to problems if a
    user sets a large pipe capacity, and then the limits are lowered,
    with the result that the user will no longer be able to decrease the
    pipe capacity.

(3) As currently implemented, accounting and checking against the
    limits is done as follows:

    (a) Test whether the user has exceeded the limit.
    (b) Make new pipe buffer allocation.
    (c) Account new allocation against the limits.

    This is racey. Multiple processes may pass point (a)
    simultaneously, and then allocate pipe buffers that are accounted
    for only in step (c).  The race means that the user's pipe buffer
    allocation could be pushed over the limit (by an arbitrary amount,
    depending on how unlucky we were in the race). [Thanks to Vegard
    Nossum for spotting this point, which I had missed.]

This patch addresses the above problems as follows:

* Perform checks against the limits only when increasing a pipe's
  capacity; an unprivileged user can always decrease a pipe's capacity.
* Alter the checks against limits to include the memory required for
  the new pipe capacity.
* Re-order the accounting step so that it precedes the buffer
  allocation. If the accounting step determines that a limit has
  been reached, revert the accounting and cause the operation to fail.

The program below can be used to demonstrate problems 1 and 2, and the
effect of the fix. The program takes one or more command-line arguments.
The first argument specifies the number of pipes that the program should
create. The remaining arguments are, alternately, pipe capacities that
should be set using fcntl(F_SETPIPE_SZ), and sleep intervals (in
seconds) between the fcntl() operations. (The sleep intervals allow the
possibility to change the limits between fcntl() operations.)

Problem 1
=========

Using the test program on an unpatched kernel, we first set some
limits:

    # echo 0 > /proc/sys/fs/pipe-user-pages-soft
    # echo 1000000000 > /proc/sys/fs/pipe-max-size
    # echo 10000 > /proc/sys/fs/pipe-user-pages-hard    # 40.96 MB

Then show that we can set a pipe with capacity (100MB) that is
over the hard limit

    # sudo -u mtk ./test_F_SETPIPE_SZ 1 100000000
    Initial pipe capacity: 65536
        Loop 1: set pipe capacity to 100000000 bytes
            F_SETPIPE_SZ returned 134217728

Now set the capacity to 100MB twice. The second call fails (which is
probably surprising to most users, since it seems like a no-op):

    # sudo -u mtk ./test_F_SETPIPE_SZ 1 100000000 0 100000000
    Initial pipe capacity: 65536
        Loop 1: set pipe capacity to 100000000 bytes
            F_SETPIPE_SZ returned 134217728
        Loop 2: set pipe capacity to 100000000 bytes
            Loop 2, pipe 0: F_SETPIPE_SZ failed: fcntl: Operation not permitted

With a patched kernel, setting a capacity over the limit fails at the
first attempt:

    # echo 0 > /proc/sys/fs/pipe-user-pages-soft
    # echo 1000000000 > /proc/sys/fs/pipe-max-size
    # echo 10000 > /proc/sys/fs/pipe-user-pages-hard
    # sudo -u mtk ./test_F_SETPIPE_SZ 1 100000000
    Initial pipe capacity: 65536
        Loop 1: set pipe capacity to 100000000 bytes
            Loop 1, pipe 0: F_SETPIPE_SZ failed: fcntl: Operation not permitted

There is a small chance that the change to fix this problem could
break user-space, since there are cases where fcntl(F_SETPIPE_SZ)
calls that previously succeeded might fail. However, the chances are
small, since (a) the pipe-user-pages-{soft,hard} limits are new (in
4.5), and the default soft/hard limits are high/unlimited.  Therefore,
it seems warranted to make these limits operate more precisely (and
behave more like what users probably expect).

Problem 2
=========

Running the test program on an unpatched kernel, we first set some limits:

    # getconf PAGESIZE
    4096
    # echo 0 > /proc/sys/fs/pipe-user-pages-soft
    # echo 1000000000 > /proc/sys/fs/pipe-max-size
    # echo 10000 > /proc/sys/fs/pipe-user-pages-hard    # 40.96 MB

Now perform two fcntl(F_SETPIPE_SZ) operations on a single pipe,
first setting a pipe capacity (10MB), sleeping for a few seconds,
during which time the hard limit is lowered, and then set pipe
capacity to a smaller amount (5MB):

    # sudo -u mtk ./test_F_SETPIPE_SZ 1 10000000 15 5000000 &
    [1] 748
    # Initial pipe capacity: 65536
        Loop 1: set pipe capacity to 10000000 bytes
            F_SETPIPE_SZ returned 16777216
            Sleeping 15 seconds

    # echo 1000 > /proc/sys/fs/pipe-user-pages-hard      # 4.096 MB
    #     Loop 2: set pipe capacity to 5000000 bytes
            Loop 2, pipe 0: F_SETPIPE_SZ failed: fcntl: Operation not permitted

In this case, the user should be able to lower the limit.

With a kernel that has the patch below, the second fcntl()
succeeds:

    # echo 0 > /proc/sys/fs/pipe-user-pages-soft
    # echo 1000000000 > /proc/sys/fs/pipe-max-size
    # echo 10000 > /proc/sys/fs/pipe-user-pages-hard
    # sudo -u mtk ./test_F_SETPIPE_SZ 1 10000000 15 5000000 &
    [1] 3215
    # Initial pipe capacity: 65536
    #     Loop 1: set pipe capacity to 10000000 bytes
            F_SETPIPE_SZ returned 16777216
            Sleeping 15 seconds

    # echo 1000 > /proc/sys/fs/pipe-user-pages-hard

    #     Loop 2: set pipe capacity to 5000000 bytes
            F_SETPIPE_SZ returned 8388608

8x---8x---8x---8x---8x---8x---8x---8x---8x---8x---8x---8x---8x---8x---

/* test_F_SETPIPE_SZ.c

   (C) 2016, Michael Kerrisk; licensed under GNU GPL version 2 or later

   Test operation of fcntl(F_SETPIPE_SZ) for setting pipe capacity
   and interactions with limits defined by /proc/sys/fs/pipe-* files.
*/

#define _GNU_SOURCE
#include <stdio.h>
#include <stdlib.h>
#include <fcntl.h>
#include <unistd.h>

int
main(int argc, char *argv[])
{
    int (*pfd)[2];
    int npipes;
    int pcap, rcap;
    int j, p, s, stime, loop;

    if (argc < 2) {
        fprintf(stderr, "Usage: %s num-pipes "
                "[pipe-capacity sleep-time]...\n", argv[0]);
        exit(EXIT_FAILURE);
    }

    npipes = atoi(argv[1]);

    pfd = calloc(npipes, sizeof (int [2]));
    if (pfd == NULL) {
        perror("calloc");
        exit(EXIT_FAILURE);
    }

    for (j = 0; j < npipes; j++) {
        if (pipe(pfd[j]) == -1) {
            fprintf(stderr, "Loop %d: pipe() failed: ", j);
            perror("pipe");
            exit(EXIT_FAILURE);
        }
    }

    printf("Initial pipe capacity: %d\n", fcntl(pfd[0][0], F_GETPIPE_SZ));

    for (j = 2; j < argc; j += 2 ) {
        loop = j / 2;
        pcap = atoi(argv[j]);
        printf("    Loop %d: set pipe capacity to %d bytes\n", loop, pcap);

        for (p = 0; p < npipes; p++) {
            s = fcntl(pfd[p][0], F_SETPIPE_SZ, pcap);
            if (s == -1) {
                fprintf(stderr, "        Loop %d, pipe %d: F_SETPIPE_SZ "
                        "failed: ", loop, p);
                perror("fcntl");
                exit(EXIT_FAILURE);
            }

            if (p == 0) {
                printf("        F_SETPIPE_SZ returned %d\n", s);
                rcap = s;
            } else {
                if (s != rcap) {
                    fprintf(stderr, "        Loop %d, pipe %d: F_SETPIPE_SZ "
                            "unexpected return: %d\n", loop, p, s);
                    exit(EXIT_FAILURE);
                }
            }

            stime = (j + 1 < argc) ? atoi(argv[j + 1]) : 0;
            if (stime > 0) {
                printf("        Sleeping %d seconds\n", stime);
                sleep(stime);
            }
        }
    }

    exit(EXIT_SUCCESS);
}

8x---8x---8x---8x---8x---8x---8x---8x---8x---8x---8x---8x---8x---8x---

Patch history:

v2
   * Switch order of test in 'if' statement to avoid function call
      (to capability()) in normal path. [This is a fix to a preexisting
      wart in the code. Thanks to Willy Tarreau]
    * Perform (size > pipe_max_size) check before calling
      account_pipe_buffers().  [Thanks to Vegard Nossum]
      Quoting Vegard:

        The potential problem happens if the user passes a very large number
        which will overflow pipe->user->pipe_bufs.

        On 32-bit, sizeof(int) == sizeof(long), so if they pass arg = INT_MAX
        then round_pipe_size() returns INT_MAX. Although it's true that the
        accounting is done in terms of pages and not bytes, so you'd need on
        the order of (1 << 13) = 8192 processes hitting the limit at the same
        time in order to make it overflow, which seems a bit unlikely.

        (See https://lkml.org/lkml/2016/8/12/215 for another discussion on the
        limit checking)

Link: http://lkml.kernel.org/r/1e464945-536b-2420-798b-e77b9c7e8593@gmail.com
Signed-off-by: Michael Kerrisk <mtk.manpages@gmail.com>
Reviewed-by: Vegard Nossum <vegard.nossum@oracle.com>
Cc: Willy Tarreau <w@1wt.eu>
Cc: <socketpair@gmail.com>
Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Cc: Jens Axboe <axboe@fb.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-10-11 15:06:32 -07:00
Michael Kerrisk (man-pages) 3734a13b96 pipe: refactor argument for account_pipe_buffers()
This is a preparatory patch for following work. account_pipe_buffers()
performs accounting in the 'user_struct'. There is no need to pass a
pointer to a 'pipe_inode_info' struct (which is then dereferenced to
obtain a pointer to the 'user' field). Instead, pass a pointer directly
to the 'user_struct'. This change is needed in preparation for a
subsequent patch that the fixes the limit checking in alloc_pipe_info()
(and the resulting code is a little more logical).

Link: http://lkml.kernel.org/r/7277bf8c-a6fc-4a7d-659c-f5b145c981ab@gmail.com
Signed-off-by: Michael Kerrisk <mtk.manpages@gmail.com>
Reviewed-by: Vegard Nossum <vegard.nossum@oracle.com>
Cc: Willy Tarreau <w@1wt.eu>
Cc: <socketpair@gmail.com>
Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Cc: Jens Axboe <axboe@fb.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-10-11 15:06:31 -07:00
Michael Kerrisk (man-pages) d37d416664 pipe: move limit checking logic into pipe_set_size()
This is a preparatory patch for following work. Move the F_SETPIPE_SZ
limit-checking logic from pipe_fcntl() into pipe_set_size().  This
simplifies the code a little, and allows for reworking required in
a later patch that fixes the limit checking in pipe_set_size()

Link: http://lkml.kernel.org/r/3701b2c5-2c52-2c3e-226d-29b9deb29b50@gmail.com
Signed-off-by: Michael Kerrisk <mtk.manpages@gmail.com>
Reviewed-by: Vegard Nossum <vegard.nossum@oracle.com>
Cc: Willy Tarreau <w@1wt.eu>
Cc: <socketpair@gmail.com>
Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Cc: Jens Axboe <axboe@fb.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-10-11 15:06:31 -07:00
Michael Kerrisk (man-pages) f491bd7111 pipe: relocate round_pipe_size() above pipe_set_size()
Patch series "pipe: fix limit handling", v2.

When changing a pipe's capacity with fcntl(F_SETPIPE_SZ), various limits
defined by /proc/sys/fs/pipe-* files are checked to see if unprivileged
users are exceeding limits on memory consumption.

While documenting and testing the operation of these limits I noticed
that, as currently implemented, these checks have a number of problems:

(1) When increasing the pipe capacity, the checks against the limits
    in /proc/sys/fs/pipe-user-pages-{soft,hard} are made against
    existing consumption, and exclude the memory required for the
    increased pipe capacity. The new increase in pipe capacity can then
    push the total memory used by the user for pipes (possibly far) over
    a limit. This can also trigger the problem described next.

(2) The limit checks are performed even when the new pipe capacity
    is less than the existing pipe capacity. This can lead to problems
    if a user sets a large pipe capacity, and then the limits are
    lowered, with the result that the user will no longer be able to
    decrease the pipe capacity.

(3) As currently implemented, accounting and checking against the
    limits is done as follows:

    (a) Test whether the user has exceeded the limit.
    (b) Make new pipe buffer allocation.
    (c) Account new allocation against the limits.

    This is racey. Multiple processes may pass point (a) simultaneously,
    and then allocate pipe buffers that are accounted for only in step
    (c).  The race means that the user's pipe buffer allocation could be
    pushed over the limit (by an arbitrary amount, depending on how
    unlucky we were in the race). [Thanks to Vegard Nossum for spotting
    this point, which I had missed.]

This patch series addresses these three problems.

This patch (of 8):

This is a minor preparatory patch.  After subsequent patches,
round_pipe_size() will be called from pipe_set_size(), so place
round_pipe_size() above pipe_set_size().

Link: http://lkml.kernel.org/r/91a91fdb-a959-ba7f-b551-b62477cc98a1@gmail.com
Signed-off-by: Michael Kerrisk <mtk.manpages@gmail.com>
Reviewed-by: Vegard Nossum <vegard.nossum@oracle.com>
Cc: Willy Tarreau <w@1wt.eu>
Cc: <socketpair@gmail.com>
Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Cc: Jens Axboe <axboe@fb.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-10-11 15:06:31 -07:00
Tomohiro Kusumi fcc24534b0 autofs: refactor ioctl fn vector in iookup_dev_ioctl()
cmd part of this struct is the same as an index of itself within
_ioctls[]. In fact this cmd is unused, so we can drop this part.

Link: http://lkml.kernel.org/r/20160831033414.9910.66697.stgit@pluto.themaw.net
Signed-off-by: Tomohiro Kusumi <kusumi.tomohiro@gmail.com>
Signed-off-by: Ian Kent <raven@themaw.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-10-11 15:06:31 -07:00
Tomohiro Kusumi 962ca7cfbd autofs: remove possibly misleading /* #define DEBUG */
Having this in autofs_i.h gives illusion that uncommenting this enables
pr_debug(), but it doesn't enable all the pr_debug() in autofs because
inclusion order matters.

XFS has the same DEBUG macro in its core header fs/xfs/xfs.h, however XFS
seems to have a rule to include this prior to other XFS headers as well as
kernel headers.  This is not the case with autofs, and DEBUG could be
enabled via Makefile, so autofs should just get rid of this comment to
make the code less confusing.  It's a comment, so there is literally no
functional difference.

Link: http://lkml.kernel.org/r/20160831033409.9910.77067.stgit@pluto.themaw.net
Signed-off-by: Tomohiro Kusumi <kusumi.tomohiro@gmail.com>
Signed-off-by: Ian Kent <raven@themaw.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-10-11 15:06:31 -07:00
Tomohiro Kusumi 390855547c autofs: fix print format for ioctl warning message
All other warnings use "cmd(0x%08x)" and this is the only one with
"cmd(%d)".  (below comes from my userspace debug program, but not
automount daemon)

[ 1139.905676] autofs4:pid:1640:check_dev_ioctl_version: ioctl control interface version mismatch: kernel(1.0), user(0.0), cmd(-1072131215)

Link: http://lkml.kernel.org/r/20160812024851.12352.75458.stgit@pluto.themaw.net
Signed-off-by: Tomohiro Kusumi <kusumi.tomohiro@gmail.com>
Signed-off-by: Ian Kent <ikent@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-10-11 15:06:31 -07:00
Ian Kent d9e1923207 autofs: add autofs_dev_ioctl_version() for AUTOFS_DEV_IOCTL_VERSION_CMD
No functional changes, based on the following justification.

1. Make the code more consistent using the ioctl vector _ioctls[],
   rather than assigning NULL only for this ioctl command.
2. Remove goto done; for better maintainability in the long run.
3. The existing code is based on the fact that validate_dev_ioctl()
   sets ioctl version for any command, but AUTOFS_DEV_IOCTL_VERSION_CMD
   should explicitly set it regardless of the default behavior.

Link: http://lkml.kernel.org/r/20160812024846.12352.9885.stgit@pluto.themaw.net
Signed-off-by: Tomohiro Kusumi <kusumi.tomohiro@gmail.com>
Signed-off-by: Ian Kent <ikent@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-10-11 15:06:31 -07:00
Ian Kent aa8419367b autofs: fix dev ioctl number range check
The count of miscellaneous device ioctls in fs/autofs4/autofs_i.h is wrong.

The number of ioctls is the difference between AUTOFS_DEV_IOCTL_VERSION_CMD
and AUTOFS_DEV_IOCTL_ISMOUNTPOINT_CMD (14) not the difference between
AUTOFS_IOC_COUNT and 11 (21).

[kusumi.tomohiro@gmail.com: fix typo that made the count macro negative]
 Link: http://lkml.kernel.org/r/20160831033420.9910.16809.stgit@pluto.themaw.net
Link: http://lkml.kernel.org/r/20160812024841.12352.11975.stgit@pluto.themaw.net
Signed-off-by: Ian Kent <raven@themaw.net>
Cc: Tomohiro Kusumi <kusumi.tomohiro@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-10-11 15:06:31 -07:00
Tomohiro Kusumi b6e3795a06 autofs: fix pr_debug() message
This isn't a return value, so change the message to indicate the status is
the result of may_umount().

(or locate pr_debug() after put_user() with the same message)

Link: http://lkml.kernel.org/r/20160812024836.12352.74628.stgit@pluto.themaw.net
Signed-off-by: Tomohiro Kusumi <kusumi.tomohiro@gmail.com>
Signed-off-by: Ian Kent <ikent@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-10-11 15:06:31 -07:00
Tomohiro Kusumi 41a4497a4f autofs: don't fail to free_dev_ioctl(param)
Returning -ENOTTY here fails to free dynamically allocated param.

Link: http://lkml.kernel.org/r/20160812024815.12352.69153.stgit@pluto.themaw.net
Signed-off-by: Tomohiro Kusumi <kusumi.tomohiro@gmail.com>
Signed-off-by: Ian Kent <ikent@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-10-11 15:06:31 -07:00
Tomohiro Kusumi eea618e6d5 autofs: remove obsolete sb fields
These two were left from commit aa55ddf340 ("autofs4: remove unused
ioctls") which removed unused ioctls.

Link: http://lkml.kernel.org/r/20160812024810.12352.96377.stgit@pluto.themaw.net
Signed-off-by: Tomohiro Kusumi <kusumi.tomohiro@gmail.com>
Signed-off-by: Ian Kent <ikent@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-10-11 15:06:31 -07:00
Tomohiro Kusumi ca552599bf autofs: use autofs4_free_ino() to kfree dentry data
kfree dentry data allocated by autofs4_new_ino() with autofs4_free_ino()
instead of raw kfree.  (since we have the interface to free autofs_info*)

This patch was modified to remove the need to set the dentry info field to
NULL dew to a change in the previous patch.

Link: http://lkml.kernel.org/r/20160812024805.12352.43650.stgit@pluto.themaw.net
Signed-off-by: Tomohiro Kusumi <kusumi.tomohiro@gmail.com>
Signed-off-by: Ian Kent <raven@themaw.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-10-11 15:06:31 -07:00
Ian Kent 1574fa7beb autofs: remove ino free in autofs4_dir_symlink()
The inode allocation failure case in autofs4_dir_symlink() frees the
autofs dentry info of the dentry without setting ->d_fsdata to NULL.

That could lead to a double free so just get rid of the free and leave it
to ->d_release().

Link: http://lkml.kernel.org/r/20160812024759.12352.10653.stgit@pluto.themaw.net
Signed-off-by: Ian Kent <raven@themaw.net>
Cc: Tomohiro Kusumi <kusumi.tomohiro@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-10-11 15:06:31 -07:00
Tomohiro Kusumi 97537b35b6 autofs: add WARN_ON(1) for non dir/link inode case
It's invalid if the given mode is neither dir nor link, so warn on else
case.

Link: http://lkml.kernel.org/r/20160812024754.12352.8536.stgit@pluto.themaw.net
Signed-off-by: Tomohiro Kusumi <kusumi.tomohiro@gmail.com>
Signed-off-by: Ian Kent <raven@themaw.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-10-11 15:06:31 -07:00
Ian Kent 1973a12269 autofs: fix autofs4_fill_super() error exit handling
Somewhere along the line the error handling gotos have become incorrect.

Link: http://lkml.kernel.org/r/20160812024749.12352.15100.stgit@pluto.themaw.net
Signed-off-by: Ian Kent <raven@themaw.net>
Cc: Tomohiro Kusumi <kusumi.tomohiro@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-10-11 15:06:31 -07:00
Tomohiro Kusumi 749800ef53 autofs: test autofs versions first on sb initialization
This patch does what the below comment says.  It could be and it's
considered better to do this first before various functions get called
during initialization.

/* Couldn't this be tested earlier? */

Link: http://lkml.kernel.org/r/20160812024744.12352.43075.stgit@pluto.themaw.net
Signed-off-by: Tomohiro Kusumi <kusumi.tomohiro@gmail.com>
Signed-off-by: Ian Kent <raven@themaw.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-10-11 15:06:31 -07:00
Tomohiro Kusumi 4a44c1859f autofs: drop unnecessary extern in autofs_i.h
autofs4_kill_sb() doesn't need to be declared as extern, and no other
functions in .h are explicitly declared as extern.

Link: http://lkml.kernel.org/r/20160812024739.12352.99354.stgit@pluto.themaw.net
Signed-off-by: Tomohiro Kusumi <kusumi.tomohiro@gmail.com>
Signed-off-by: Ian Kent <raven@themaw.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-10-11 15:06:31 -07:00
Vlastimil Babka 2d19309cf8 fs/select: add vmalloc fallback for select(2)
The select(2) syscall performs a kmalloc(size, GFP_KERNEL) where size grows
with the number of fds passed. We had a customer report page allocation
failures of order-4 for this allocation. This is a costly order, so it might
easily fail, as the VM expects such allocation to have a lower-order fallback.

Such trivial fallback is vmalloc(), as the memory doesn't have to be physically
contiguous and the allocation is temporary for the duration of the syscall
only. There were some concerns, whether this would have negative impact on the
system by exposing vmalloc() to userspace. Although an excessive use of vmalloc
can cause some system wide performance issues - TLB flushes etc. - a large
order allocation is not for free either and an excessive reclaim/compaction can
have a similar effect. Also note that the size is effectively limited by
RLIMIT_NOFILE which defaults to 1024 on the systems I checked. That means the
bitmaps will fit well within single page and thus the vmalloc() fallback could
be only excercised for processes where root allows a higher limit.

Note that the poll(2) syscall seems to use a linked list of order-0 pages, so
it doesn't need this kind of fallback.

[eric.dumazet@gmail.com: fix failure path logic]
[akpm@linux-foundation.org: use proper type for size]
Link: http://lkml.kernel.org/r/20160927084536.5923-1-vbabka@suse.cz
Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
Acked-by: Michal Hocko <mhocko@suse.com>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Eric Dumazet <eric.dumazet@gmail.com>
Cc: David Laight <David.Laight@ACULAB.COM>
Cc: Hillf Danton <hillf.zj@alibaba-inc.com>
Cc: Nicholas Piggin <npiggin@gmail.com>
Cc: Jason Baron <jbaron@akamai.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-10-11 15:06:30 -07:00
Darrick J. Wong 25f4c41415 block: implement (some of) fallocate for block devices
After much discussion, it seems that the fallocate feature flag
FALLOC_FL_ZERO_RANGE maps nicely to SCSI WRITE SAME; and the feature
FALLOC_FL_PUNCH_HOLE maps nicely to the devices that have been whitelisted
for zeroing SCSI UNMAP.  Punch still requires that FALLOC_FL_KEEP_SIZE is
set.  A length that goes past the end of the device will be clamped to the
device size if KEEP_SIZE is set; or will return -EINVAL if not.  Both
start and length must be aligned to the device's logical block size.

Since the semantics of fallocate are fairly well established already, wire
up the two pieces.  The other fallocate variants (collapse range, insert
range, and allocate blocks) are not supported.

Link: http://lkml.kernel.org/r/147518379992.22791.8849838163218235007.stgit@birch.djwong.org
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Hannes Reinecke <hare@suse.com>
Reviewed-by: Bart Van Assche <bart.vanassche@sandisk.com>
Cc: Theodore Ts'o <tytso@mit.edu>
Cc: Martin K. Petersen <martin.petersen@oracle.com>
Cc: Mike Snitzer <snitzer@redhat.com> # tweaked header
Cc: Brian Foster <bfoster@redhat.com>
Cc: Christoph Hellwig <hch@infradead.org>
Cc: Hannes Reinecke <hare@suse.de>
Cc: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-10-11 15:06:30 -07:00
Guozhonghua 0cc482ee41 ocfs2: fix memory leak in dlm_migrate_request_handler()
In the dlm_migrate_request_handler(), when `ret' is -EEXIST, the mle
should be freed, otherwise the memory will be leaked.

Link: http://lkml.kernel.org/r/71604351584F6A4EBAE558C676F37CA4A3D3522A@H3CMLB12-EX.srv.huawei-3com.com
Signed-off-by: Guozhonghua <guozhonghua@h3c.com>
Reviewed-by: Mark Fasheh <mfasheh@versity.com>
Cc: Eric Ren <zren@suse.com>
Cc: Joel Becker <jlbec@evilplan.org>
Cc: Junxiao Bi <junxiao.bi@oracle.com>
Cc: Joseph Qi <joseph.qi@huawei.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-10-11 15:06:30 -07:00
Linus Torvalds d09ba13110 libnvdimm for 4.9
* PMEM sub-division support: Allow a single PMEM region to be divided
   into multiple namespaces. Originally, ~2 years ago, it was thought that
   partitions of a /dev/pmemX block device could handle sub-allocations of
   persistent memory for different use cases. With the decision to not
   support DAX mappings of raw block-devices, and the genesis of
   device-dax, the need for having multiple pmem-namespace per region has
   grown.
 
 * Device-DAX unified inode: In support of dynamic-resizing of a
   device-dax instance the kernel arranges for all mappings of a
   device-dax node to share the same inode. This allows unmap / truncate /
   invalidation events to affect all instances of the device similar to the
   behavior of mmap on block devices.
 
 * Hardware error scrubbing reworks: The original address-range-scrub +
   badblocks tracking solution allowed clearing entries at the individual
   namespace level, but it failed to clear the internal list of media
   errors maintained at the bus level. The result was that the next scrub
   or namespace disable/re-enable event would restore the cleared
   badblocks, but now that is fixed. The v4.8 kernel introduced an
   auto-scrub-on-machine-check behavior to repopulate the badblocks list.
   Now, in v4.9, the auto-scrub behavior can be disabled and simply arrange
   for the error reported in the machine-check to be added to the list.
 
 * DIMM health-event notification support: ACPI 6.1 defines a
   notification event code that can be send to ACPI NVDIMM devices. A
   poll(2) capable file descriptor for these events can be obtained from
   the nmemX/nfit/flags sysfs-attribute of a libnvdimm memory device.
 
 * Miscellaneous fixes: NVDIMM-N probe error, device-dax build error, and
   a change to dedup the flush hint list to not flush the memory controller
   more than necessary.
 -----BEGIN PGP SIGNATURE-----
 
 iQIcBAABAgAGBQJX/B2oAAoJEB7SkWpmfYgCe3YQAJiH4ZYRxr6HeJzVQltbhB2k
 qyLC+7vIssefPPqn/Wycc3aHJjyk2ktetmFyjYE1q/vlJJWCG3y/ACfz2SZANXXx
 2tgLsI+3dXZaGgIxRsZF8MsB672owqCbzJHbbmTRu3EtgMplagfh27G7HFZxt4Jd
 FyKnRkknYsCEbHry/s0aRcZWPmacu5v1TDJyWgd0edNTG32GrKOtwxWrWEPRDJE1
 dIK5JjPaDwMFMKjV6lgRuBVlsMKCzIC4YjSYZZmN/Mf/JCJBJuPSlkYEdGZ+xx84
 /ZmKrE/XRPr7469f66QyD8iRtGAQ9OparhChbuzCagCHRAwgYy4yQGbK7rk0lwUM
 18jysZU8NJxp4jEJIt0u2ap6W9ySePX5Bm+3CSwqxT0Ernew2AUJDLIw9f1hAAbX
 rippSWyHp0JtBTjOeaV2ZY1LJlm+J//AycbFo51lAERHoX5zPimHL730EM8mJu7y
 fIbFpau3fjob+ovQMXMIYam8C/MpTqAvcjpBFhkSlsY7q/l+ARgFpjYpg9qVir8g
 v6PZ0UoGBhQvD2lTNTUjaCaHOc+sjo8PLeNI1ZsFebh63rF3k5sOLOk7wXllf8z5
 jQBnYtYnPCJI67BLLZmwWzoBb0HpCbcPp9/0/c1rdLTcAo+3gi6SY4pVJgznxCZZ
 +fkeOvSutJ687tFMarc1
 =SenK
 -----END PGP SIGNATURE-----

Merge tag 'libnvdimm-for-4.9' of git://git.kernel.org/pub/scm/linux/kernel/git/nvdimm/nvdimm

Pull libnvdimm updates from Dan Williams:
 "Aside from the recently added pmem sub-division support these have
  been in -next for several releases with no reported issues. The sub-
  division support was included in next-20161010 with no reported
  issues. It passes all unit tests including new tests for all the new
  functionality below.

  Summary:

   - PMEM sub-division support: Allow a single PMEM region to be divided
     into multiple namespaces. Originally, ~2 years ago, it was thought
     that partitions of a /dev/pmemX block device could handle
     sub-allocations of persistent memory for different use cases. With
     the decision to not support DAX mappings of raw block-devices, and
     the genesis of device-dax, the need for having multiple
     pmem-namespace per region has grown.

   - Device-DAX unified inode: In support of dynamic-resizing of a
     device-dax instance the kernel arranges for all mappings of a
     device-dax node to share the same inode. This allows unmap /
     truncate / invalidation events to affect all instances of the
     device similar to the behavior of mmap on block devices.

   - Hardware error scrubbing reworks: The original address-range-scrub
     and badblocks tracking solution allowed clearing entries at the
     individual namespace level, but it failed to clear the internal
     list of media errors maintained at the bus level. The result was
     that the next scrub or namespace disable/re-enable event would
     restore the cleared badblocks, but now that is fixed. The v4.8
     kernel introduced an auto-scrub-on-machine-check behavior to
     repopulate the badblocks list. Now, in v4.9, the auto-scrub
     behavior can be disabled and simply arrange for the error reported
     in the machine-check to be added to the list.

   - DIMM health-event notification support: ACPI 6.1 defines a
     notification event code that can be send to ACPI NVDIMM devices. A
     poll(2) capable file descriptor for these events can be obtained
     from the nmemX/nfit/flags sysfs-attribute of a libnvdimm memory
     device.

   - Miscellaneous fixes: NVDIMM-N probe error, device-dax build error,
     and a change to dedup the flush hint list to not flush the memory
     controller more than necessary"

* tag 'libnvdimm-for-4.9' of git://git.kernel.org/pub/scm/linux/kernel/git/nvdimm/nvdimm: (39 commits)
  /dev/dax: fix Kconfig dependency build breakage
  dax: use correct dev_t value
  dax: convert devm_create_dax_dev to PTR_ERR
  libnvdimm, namespace: allow creation of multiple pmem-namespaces per region
  libnvdimm, namespace: lift single pmem limit in scan_labels()
  libnvdimm, namespace: filter out of range labels in scan_labels()
  libnvdimm, namespace: enable allocation of multiple pmem namespaces
  libnvdimm, namespace: update label implementation for multi-pmem
  libnvdimm, namespace: expand pmem device naming scheme for multi-pmem
  libnvdimm, region: update nd_region_available_dpa() for multi-pmem support
  libnvdimm, namespace: sort namespaces by dpa at init
  libnvdimm, namespace: allow multiple pmem-namespaces per region at scan time
  tools/testing/nvdimm: support for sub-dividing a pmem region
  libnvdimm, namespace: unify blk and pmem label scanning
  libnvdimm, namespace: refactor uuid_show() into a namespace_to_uuid() helper
  libnvdimm, label: convert label tracking to a linked list
  libnvdimm, region: move region-mapping input-paramters to nd_mapping_desc
  nvdimm: reduce duplicated wpq flushes
  libnvdimm: clear the internal poison_list when clearing badblocks
  pmem: reduce kmap_atomic sections to the memcpys only
  ...
2016-10-11 12:19:31 -07:00
Linus Torvalds f29135b54b Merge branch 'for-linus-4.9' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs
Pull btrfs updates from Chris Mason:
 "This is a big variety of fixes and cleanups.

  Liu Bo continues to fixup fuzzer related problems, and some of Josef's
  cleanups are prep for his bigger extent buffer changes (slated for
  v4.10)"

* 'for-linus-4.9' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs: (39 commits)
  Revert "btrfs: let btrfs_delete_unused_bgs() to clean relocated bgs"
  Btrfs: remove unnecessary btrfs_mark_buffer_dirty in split_leaf
  Btrfs: don't BUG() during drop snapshot
  btrfs: fix btrfs_no_printk stub helper
  Btrfs: memset to avoid stale content in btree leaf
  btrfs: parent_start initialization cleanup
  btrfs: Remove already completed TODO comment
  btrfs: Do not reassign count in btrfs_run_delayed_refs
  btrfs: fix a possible umount deadlock
  Btrfs: fix memory leak in do_walk_down
  btrfs: btrfs_debug should consume fs_info when DEBUG is not defined
  btrfs: convert send's verbose_printk to btrfs_debug
  btrfs: convert pr_* to btrfs_* where possible
  btrfs: convert printk(KERN_* to use pr_* calls
  btrfs: unsplit printed strings
  btrfs: clean the old superblocks before freeing the device
  Btrfs: kill BUG_ON in run_delayed_tree_ref
  Btrfs: don't leak reloc root nodes on error
  btrfs: squash lines for simple wrapper functions
  Btrfs: improve check_node to avoid reading corrupted nodes
  ...
2016-10-11 11:23:06 -07:00
Linus Torvalds 4c609922a3 This pull request contains:
* Fixes for both UBI and UBIFS
 * overlayfs support (O_TMPFILE, RENAME_WHITEOUT/EXCHANGE)
 * Code refactoring for the upcoming MLC support
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v2
 
 iQIcBAABAgAGBQJX/QOCAAoJEEtJtSqsAOnWtp4QAKItkx/LrW44rHhkoJfqG62i
 o+OaxMKNu43/v/io+68JNEkIqgEap2vMZVkfoIgIyuyPxMG7nA/zG3c2JFvQ/ReS
 uH0PmcpkIXbRBKe9IEn6rXmRz9q9UTNGhP2U5kg0rL22vwVGYIuzF4Bny25Irzf/
 LLtYOkpfZfaNTSjs1pmuJMWVFF1Rj68eVJEWL6JZ1BPQ4bRPbn5sNgOKNTJYkrJs
 GcXXNtonf3B0zOzFnmfFhVO5neo4FEG3QEQafR+qbhoNBvXSluVIAFoO4VKEcyHD
 BJbotsT64TBsBj7ol97EXxz+N6LkB3tNM3bFBvhAFXZ+EvrJ0o+2QoEOH0igWjMI
 4AXwSl6htCs+wRmqAqpJfZpfI7kv2MDUB9ZGAbuXRS888OK78Dzt1CupPW7Q12xh
 yYMNsXZvRvK82n0DfqBLQ53SIe/L3PotG2Cc29hjGaHjK+YcwVRvdp/2B3ID3O2L
 6ap/M6KA+i1SiYZI6yAEYT76jKOam9YG/psb76q66xILJ7h5XQOZODYQ9zC2towo
 Pjb+bCPzHZPm+v7xtSsP6aanZ+5xRXO91JjvsWl9UOQVDCA/Jt98H5qhCJZjIeIs
 OJ7z9PbTv0/jcBBRrjJyZIUE85omDliY4h04B3Yu44xa7Q9e7wbE+Vs/6L9txS0e
 L8TBNHmrYB7ZIprCIhcE
 =UB7l
 -----END PGP SIGNATURE-----

Merge tag 'upstream-4.9-rc1' of git://git.infradead.org/linux-ubifs

Pull UBI/UBIFS updates from Richard Weinberger:
 "This pull request contains:

   - Fixes for both UBI and UBIFS
   - overlayfs support (O_TMPFILE, RENAME_WHITEOUT/EXCHANGE)
   - Code refactoring for the upcoming MLC support"

[ Ugh, we just got rid of the "rename2()" naming for the extended rename
  functionality. And this re-introduces it in ubifs with the cross-
  renaming and whiteout support.

  But rather than do any re-organizations in the merge itself, the
  naming can be cleaned up later ]

* tag 'upstream-4.9-rc1' of git://git.infradead.org/linux-ubifs: (27 commits)
  UBIFS: improve function-level documentation
  ubifs: fix host xattr_len when changing xattr
  ubifs: Use move variable in ubifs_rename()
  ubifs: Implement RENAME_EXCHANGE
  ubifs: Implement RENAME_WHITEOUT
  ubifs: Implement O_TMPFILE
  ubi: Fix Fastmap's update_vol()
  ubi: Fix races around ubi_refill_pools()
  ubi: Deal with interrupted erasures in WL
  UBI: introduce the VID buffer concept
  UBI: hide EBA internals
  UBI: provide an helper to query LEB information
  UBI: provide an helper to check whether a LEB is mapped or not
  UBI: add an helper to check lnum validity
  UBI: simplify LEB write and atomic LEB change code
  UBI: simplify recover_peb() code
  UBI: move the global ech and vidh variables into struct ubi_attach_info
  UBI: provide helpers to allocate and free aeb elements
  UBI: fastmap: use ubi_io_{read, write}_data() instead of ubi_io_{read, write}()
  UBI: fastmap: use ubi_rb_for_each_entry() in unmap_peb()
  ...
2016-10-11 10:49:44 -07:00
Linus Torvalds 6b5e09a748 Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net
Pull networking fixes from David Miller:

 1) Netfilter list handling fix, from Linus.

 2) RXRPC/AFS bug fixes from David Howells (oops on call to serviceless
    endpoints, build warnings, missing notifications, etc.) From David
    Howells.

 3) Kernel log message missing newlines, from Colin Ian King.

 4) Don't enter direct reclaim in netlink dumps, the idea is to use a
    high order allocation first and fallback quickly to a 0-order
    allocation if such a high-order one cannot be done cheaply and
    without reclaim. From Eric Dumazet.

 5) Fix firmware download errors in btusb bluetooth driver, from Ethan
    Hsieh.

 6) Missing Kconfig deps for QCOM_EMAC, from Geert Uytterhoeven.

 7) Fix MDIO_XGENE dup Kconfig entry. From Laura Abbott.

 8) Constrain ipv6 rtr_solicits sysctl values properly, from Maciej
    Żenczykowski.

* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net: (65 commits)
  netfilter: Fix slab corruption.
  be2net: Enable VF link state setting for BE3
  be2net: Fix TX stats for TSO packets
  be2net: Update Copyright string in be_hw.h
  be2net: NCSI FW section should be properly updated with ethtool for BE3
  be2net: Provide an alternate way to read pf_num for BEx chips
  wan/fsl_ucc_hdlc: Fix size used in dma_free_coherent()
  net: macb: NULL out phydev after removing mdio bus
  xen-netback: make sure that hashes are not send to unaware frontends
  Fixing a bug in team driver due to incorrect 'unsigned int' to 'int' conversion
  MAINTAINERS: add myself as a maintainer of xen-netback
  ipv6 addrconf: disallow rtr_solicits < -1
  Bluetooth: btusb: Fix atheros firmware download error
  drivers: net: phy: Correct duplicate MDIO_XGENE entry
  ethernet: qualcomm: QCOM_EMAC should depend on HAS_DMA and HAS_IOMEM
  net: ethernet: mediatek: remove hwlro property in the device tree
  net: ethernet: mediatek: get hw lro capability by the chip id instead of by the dtsi
  net: ethernet: mediatek: get the chip id by ETHDMASYS registers
  net: bgmac: Fix errant feature flag check
  netlink: do not enter direct reclaim from netlink_dump()
  ...
2016-10-11 08:10:19 -07:00
Linus Torvalds 101105b171 Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs
Pull more vfs updates from Al Viro:
 ">rename2() work from Miklos + current_time() from Deepa"

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
  fs: Replace current_fs_time() with current_time()
  fs: Replace CURRENT_TIME_SEC with current_time() for inode timestamps
  fs: Replace CURRENT_TIME with current_time() for inode timestamps
  fs: proc: Delete inode time initializations in proc_alloc_inode()
  vfs: Add current_time() api
  vfs: add note about i_op->rename changes to porting
  fs: rename "rename2" i_op to "rename"
  vfs: remove unused i_op->rename
  fs: make remaining filesystems use .rename2
  libfs: support RENAME_NOREPLACE in simple_rename()
  fs: support RENAME_NOREPLACE for local filesystems
  ncpfs: fix unused variable warning
2016-10-10 20:16:43 -07:00
Al Viro 3873691e5a Merge remote-tracking branch 'ovl/rename2' into for-linus 2016-10-10 23:02:51 -04:00
Linus Torvalds 97d2116708 Merge branch 'work.xattr' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs
Pull vfs xattr updates from Al Viro:
 "xattr stuff from Andreas

  This completes the switch to xattr_handler ->get()/->set() from
  ->getxattr/->setxattr/->removexattr"

* 'work.xattr' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
  vfs: Remove {get,set,remove}xattr inode operations
  xattr: Stop calling {get,set,remove}xattr inode operations
  vfs: Check for the IOP_XATTR flag in listxattr
  xattr: Add __vfs_{get,set,remove}xattr helpers
  libfs: Use IOP_XATTR flag for empty directory handling
  vfs: Use IOP_XATTR flag for bad-inode handling
  vfs: Add IOP_XATTR inode operations flag
  vfs: Move xattr_resolve_name to the front of fs/xattr.c
  ecryptfs: Switch to generic xattr handlers
  sockfs: Get rid of getxattr iop
  sockfs: getxattr: Fail with -EOPNOTSUPP for invalid attribute names
  kernfs: Switch to generic xattr handlers
  hfs: Switch to generic xattr handlers
  jffs2: Remove jffs2_{get,set,remove}xattr macros
  xattr: Remove unnecessary NULL attribute name check
2016-10-10 17:11:50 -07:00
Christoph Hellwig feac470e36 xfs: convert COW blocks to real blocks before unwritten extent conversion
We need to splice COW blocks we've completed in xfs_end_io_direct_write
into the data fork before converting unwritten extents.  Otherwise
xfs_bmapi_write might first allocate blocks for any holes in the data
fork, which isn't only not needed but also harmful as it might cause
reserved block underruns in the transaction.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2016-10-11 09:03:19 +11:00
Emese Revfy 0766f788eb latent_entropy: Mark functions with __latent_entropy
The __latent_entropy gcc attribute can be used only on functions and
variables.  If it is on a function then the plugin will instrument it for
gathering control-flow entropy. If the attribute is on a variable then
the plugin will initialize it with random contents.  The variable must
be an integer, an integer array type or a structure with integer fields.

These specific functions have been selected because they are init
functions (to help gather boot-time entropy), are called at unpredictable
times, or they have variable loops, each of which provide some level of
latent entropy.

Signed-off-by: Emese Revfy <re.emese@gmail.com>
[kees: expanded commit message]
Signed-off-by: Kees Cook <keescook@chromium.org>
2016-10-10 14:51:45 -07:00
Linus Torvalds 6763afe4b9 dlm for 4.9
This includes a bug fix for a bad memory access during workqueue
 cleanup, which can happen while shutting down the dlm networking
 layer.
 -----BEGIN PGP SIGNATURE-----
 
 iQIcBAABAgAGBQJX+63GAAoJEDgbc8f8gGmq4hIP/R63HIQkCQJOCrV34gdrk4tN
 7+mwqKkQeWfDYEB0TI+B/iwMsEqtE12Wob6lN9P1pYlTp1OOXulj/jV3xBcENMkM
 trxmcscCwKcVnLvkW1cVqKfLdswFEQZv95g0CVIAaLghI3v39Sf5WDVcaw+L6IEv
 7ko5vet2OY5eJm6vJEDXTJbxWQ3itkOvIrD8f50jA6IFkrgfJQ0oFkPnfVFTU0Mp
 g1v2w05voWINoHQ3b1AfTz7iAYUIv94CnVDYgyIwsqUW2M133Tvw2Cj78rK0EqJa
 t2vr/8+8twm0D2NDq7xX4BLLKkrlchhDVQWNivVHw9DGOtjXdMu3r7WmgDM8FvW6
 NethPt45xLNHtcP/GvE1z7YzUus50aSwjWIRHyRUhMDD/8QwIwCiw3FBx2H3TRuf
 OXZCXOpCXFxM63uOAkZaI9xCni0SWCVCdHxFEmd1VF41t4RNc4jbpGWSe1jqacLx
 a1Dpf/mLClQ1WOgLtnVQ+/OeRhayVnnhRD3u2XxqWl3oZkU4QqJp8cTSeq3gVbgE
 m9ZDi/MozN9amkGECkI7JlyRHgk73ZM0KPCcYgl9v0v9l6R1i8Bi8xvs5YS+i/Je
 e5jyDM2VtcqoGY9lzSQMMbqz6Lrv7aagrDYGPts/togYuKT/925L6BiIVoHIZtfy
 H+NKEqCPMq70mRDjbd/S
 =4las
 -----END PGP SIGNATURE-----

Merge tag 'dlm-4.9' of git://git.kernel.org/pub/scm/linux/kernel/git/teigland/linux-dlm

Pull dlm fix from David Teigland:
 "This includes a bug fix for a bad memory access during workqueue
  cleanup, which can happen while shutting down the dlm networking
  layer"

* tag 'dlm-4.9' of git://git.kernel.org/pub/scm/linux/kernel/git/teigland/linux-dlm:
  dlm: free workqueues after the connections
2016-10-10 13:58:06 -07:00
Linus Torvalds 8dfb790b15 The big ticket item here is support for rbd exclusive-lock feature,
with maintenance operations offloaded to userspace (Douglas Fuller,
 Mike Christie and myself).  Another block device bullet is a series
 fixing up layering error paths (myself).
 
 On the filesystem side, we've got patches that improve our handling of
 buffered vs dio write races (Neil Brown) and a few assorted fixes from
 Zheng.  Also included a couple of random cleanups and a minor CRUSH
 update.
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v2
 
 iQEcBAABCAAGBQJX+PjZAAoJEEp/3jgCEfOLVuoH/RwtFLIb6/KZUYtBOrVVrTun
 kReRlfq2xKYrGGtyQEqSuz7fBdwT1LVCVcL8kC4GFD4R67o+tNMAr6PfM/7pZABj
 HRoRLgSZ9FLw4W5n0VpBIznih75QUbCdXiTCtH9eorMHU5q1YpTvVHHlF9W9Pm2I
 eNGnBWpGyHVeiK66mpUCH+EQKQ4GkAVD9rneTNqLHgq2yotHkVl1j258+DL6JRGs
 OBoh3RmNQaGOAS37Lss8erCSusAGEcAeGV6ubuK2lFUKyR41EkD3I0xkhNSPe+CD
 RifFcpVziIeTu//cLgl0nnHGtmUytD7HgJubaPthArKIOen9ZDAfEkgI0o+JI2A=
 =45O7
 -----END PGP SIGNATURE-----

Merge tag 'ceph-for-4.9-rc1' of git://github.com/ceph/ceph-client

Pull Ceph updates from Ilya Dryomov:
 "The big ticket item here is support for rbd exclusive-lock feature,
  with maintenance operations offloaded to userspace (Douglas Fuller,
  Mike Christie and myself). Another block device bullet is a series
  fixing up layering error paths (myself).

  On the filesystem side, we've got patches that improve our handling of
  buffered vs dio write races (Neil Brown) and a few assorted fixes from
  Zheng. Also included a couple of random cleanups and a minor CRUSH
  update"

* tag 'ceph-for-4.9-rc1' of git://github.com/ceph/ceph-client: (39 commits)
  crush: remove redundant local variable
  crush: don't normalize input of crush_ln iteratively
  libceph: ceph_build_auth() doesn't need ceph_auth_build_hello()
  libceph: use CEPH_AUTH_UNKNOWN in ceph_auth_build_hello()
  ceph: fix description for rsize and rasize mount options
  rbd: use kmalloc_array() in rbd_header_from_disk()
  ceph: use list_move instead of list_del/list_add
  ceph: handle CEPH_SESSION_REJECT message
  ceph: avoid accessing / when mounting a subpath
  ceph: fix mandatory flock check
  ceph: remove warning when ceph_releasepage() is called on dirty page
  ceph: ignore error from invalidate_inode_pages2_range() in direct write
  ceph: fix error handling of start_read()
  rbd: add rbd_obj_request_error() helper
  rbd: img_data requests don't own their page array
  rbd: don't call rbd_osd_req_format_read() for !img_data requests
  rbd: rework rbd_img_obj_exists_submit() error paths
  rbd: don't crash or leak on errors in rbd_img_obj_parent_read_full_callback()
  rbd: move bumping img_request refcount into rbd_obj_request_submit()
  rbd: mark the original request as done if stat request fails
  ...
2016-10-10 13:52:05 -07:00
Chris Mason 19c4d2f994 Revert "btrfs: let btrfs_delete_unused_bgs() to clean relocated bgs"
This reverts commit 5d8eb6fe51.

When we remove devices, we free the device structures.  Delaying
btfs_remove_chunk() ends up hitting a use-after-free on them.

Signed-off-by: Chris Mason <clm@fb.com>
2016-10-10 13:43:31 -07:00
Linus Torvalds fed41f7d03 Merge branch 'work.splice_read' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs
Pull splice fixups from Al Viro:
 "A couple of fixups for interaction of pipe-backed iov_iter with
  O_DIRECT reads + constification of a couple of primitives in uio.h
  missed by previous rounds.

  Kudos to davej - his fuzzing has caught those bugs"

* 'work.splice_read' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
  [btrfs] fix check_direct_IO() for non-iovec iterators
  constify iov_iter_count() and iter_is_iovec()
  fix ITER_PIPE interaction with direct_IO
2016-10-10 13:38:49 -07:00
Linus Torvalds abb5a14fa2 Merge branch 'work.misc' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs
Pull misc vfs updates from Al Viro:
 "Assorted misc bits and pieces.

  There are several single-topic branches left after this (rename2
  series from Miklos, current_time series from Deepa Dinamani, xattr
  series from Andreas, uaccess stuff from from me) and I'd prefer to
  send those separately"

* 'work.misc' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (39 commits)
  proc: switch auxv to use of __mem_open()
  hpfs: support FIEMAP
  cifs: get rid of unused arguments of CIFSSMBWrite()
  posix_acl: uapi header split
  posix_acl: xattr representation cleanups
  fs/aio.c: eliminate redundant loads in put_aio_ring_file
  fs/internal.h: add const to ns_dentry_operations declaration
  compat: remove compat_printk()
  fs/buffer.c: make __getblk_slow() static
  proc: unsigned file descriptors
  fs/file: more unsigned file descriptors
  fs: compat: remove redundant check of nr_segs
  cachefiles: Fix attempt to read i_blocks after deleting file [ver #2]
  cifs: don't use memcpy() to copy struct iov_iter
  get rid of separate multipage fault-in primitives
  fs: Avoid premature clearing of capabilities
  fs: Give dentry to inode_change_ok() instead of inode
  fuse: Propagate dentry down to inode_change_ok()
  ceph: Propagate dentry down to inode_change_ok()
  xfs: Propagate dentry down to inode_change_ok()
  ...
2016-10-10 13:04:49 -07:00
Al Viro cd27e45504 [btrfs] fix check_direct_IO() for non-iovec iterators
looking for duplicate ->iov_base makes sense only for
iovec-backed iterators; for kvec-backed ones it's pointless,
for bvec-backed ones it's pointless and broken on 32bit (we
walk through an array of struct bio_vec accessing them as if
they were struct iovec; works by accident on 64bit, but on
32bit it'll blow up) and for pipe-backed ones it's pointless
and ends up oopsing.

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2016-10-10 13:58:16 -04:00
Al Viro c3a6902404 fix ITER_PIPE interaction with direct_IO
by making sure we call iov_iter_advance() on original
iov_iter even if direct_IO (done on its copy) has returned 0.
It's a no-op for old iov_iter flavours and does the right thing
(== truncation of the stuff we'd allocated, but not filled) in
ITER_PIPE case.  Failures (e.g. -EIO) get caught and dealt with
by cleanup in generic_file_read_iter().

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2016-10-10 13:36:06 -04:00
Marcelo Ricardo Leitner 3a8db79889 dlm: free workqueues after the connections
After backporting commit ee44b4bc05 ("dlm: use sctp 1-to-1 API")
series to a kernel with an older workqueue which didn't use RCU yet, it
was noticed that we are freeing the workqueues in dlm_lowcomms_stop()
too early as free_conn() will try to access that memory for canceling
the queued works if any.

This issue was introduced by commit 0d737a8cfd as before it such
attempt to cancel the queued works wasn't performed, so the issue was
not present.

This patch fixes it by simply inverting the free order.

Cc: stable@vger.kernel.org
Fixes: 0d737a8cfd ("dlm: fix race while closing connections")
Signed-off-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com>
Signed-off-by: David Teigland <teigland@redhat.com>
2016-10-10 09:54:00 -05:00
Darrick J. Wong 6f97077ff6 xfs: rework refcount cow recovery error handling
The error handling in xfs_refcount_recover_cow_leftovers is confused
and can potentially leak memory, so rework it to release resources
correctly on error.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reported-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2016-10-10 17:23:07 +11:00
Darrick J. Wong 1987fd7434 xfs: clear reflink flag if setting realtime flag
Since we can only turn on the rt flag if there are no data extents,
we can safely turn off the reflink flag if the rt flag is being
turned on.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reported-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2016-10-10 16:49:29 +11:00
Darrick J. Wong 9780643cde xfs: fix error initialization
Eric Sandeen reported a gcc complaint about uninitialized error
variables, so fix that.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reported-by: Eric Sandeen <sandeen@redhat.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2016-10-10 16:49:18 +11:00
Darrick J. Wong 93fed47013 xfs: fix label inaccuracies
Since we don't unlock anything on the way out, change the label.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reported-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2016-10-10 16:49:10 +11:00
Darrick J. Wong 97a1b87ea7 xfs: remove isize check from unshare operation
Now that fallocate has an explicit unshare flag again, let's try
to remove the inode reflink flag whenever the user unshares any
part of a file since checking is cheap compared to the CoW.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reported-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2016-10-10 16:49:01 +11:00
Darrick J. Wong 024adf4870 xfs: reduce stack usage of _reflink_clear_inode_flag
The loop in _reflink_clear_inode_flag isn't necessary since we
jump out if any part of any extent is shared.  Remove the loop
and we no longer need two maps, so we can save some stack use.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reported-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2016-10-10 16:47:40 +11:00
Darrick J. Wong 63646fc58d xfs: check inode reflink flag before calling reflink functions
There are a couple of places where we don't check the inode's
reflink flag before calling into the reflink code.  Fix those,
and add some asserts so we don't make this mistake again.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reported-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2016-10-10 16:47:32 +11:00
Al Viro e55f1d1d13 Merge remote-tracking branch 'jk/vfs' into work.misc 2016-10-08 11:06:08 -04:00
Al Viro f334bcd94b Merge remote-tracking branch 'ovl/misc' into work.misc 2016-10-08 11:00:01 -04:00
Al Viro 73e8fb2d59 Merge branch 'work.const-qstr' into work.misc 2016-10-08 10:44:55 -04:00
Al Viro 33e09f0ee7 Merge branch 'work.iget' into work.misc 2016-10-08 10:44:37 -04:00
Luis de Bethencourt a17e7d2010 befs: befs: fix style issues in datastream.c
Fixing the following checkpatch.pl errors:

ERROR: "foo * bar" should be "foo *bar"
+                            befs_blocknr_t blockno, befs_block_run * run);

WARNING: Missing a blank line after declarations
+       struct buffer_head *bh;
+       befs_debug(sb, "---> %s length: %llu", __func__, len);

WARNING: Block comments use * on subsequent lines
+       /*
+          Double indir block, plus all the indirect blocks it maps.

(and other instances of these)

Signed-off-by: Luis de Bethencourt <luisbg@osg.samsung.com>
Signed-off-by: Salah Triki <salah.triki@gmail.com>
2016-10-08 10:01:36 +01:00
Luis de Bethencourt a20af5f9ea befs: improve documentation in datastream.c
Convert function descriptions to kernel-doc style.

Signed-off-by: Luis de Bethencourt <luisbg@osg.samsung.com>
Signed-off-by: Salah Triki <salah.triki@gmail.com>
2016-10-08 10:01:36 +01:00
Luis de Bethencourt d327e612bd befs: fix typos in datastream.c
Signed-off-by: Luis de Bethencourt <luisbg@osg.samsung.com>
Signed-off-by: Salah Triki <salah.triki@gmail.com>
2016-10-08 10:01:35 +01:00
Luis de Bethencourt 02d91f97fd befs: fix typos in btree.c
Fixing typos in kernel-doc function descriptions in fs/befs/btree.c.

Signed-off-by: Luis de Bethencourt <luisbg@osg.samsung.com>
Signed-off-by: Salah Triki <salah.triki@gmail.com>
2016-10-08 10:01:34 +01:00
Luis de Bethencourt 103c0fb340 befs: fix style issues in super.c
Fixing the following checkpatch.pl error:

ERROR: "foo * bar" should be "foo *bar"
+befs_load_sb(struct super_block *sb, befs_super_block * disk_sb)

And the following warnings:

WARNING: suspect code indent for conditional statements (8, 12)
+       if (disk_sb->fs_byte_order == BEFS_BYTEORDER_NATIVE_LE)
+           befs_sb->byte_order = BEFS_BYTESEX_LE;

WARNING: suspect code indent for conditional statements (8, 12)
+       else if (disk_sb->fs_byte_order == BEFS_BYTEORDER_NATIVE_BE)
+           befs_sb->byte_order = BEFS_BYTESEX_BE;

WARNING: break quoted strings at a space character
+               befs_error(sb, "blocksize(%u) cannot be larger"
+                          "than system pagesize(%lu)", befs_sb->block_size,

WARNING: line over 80 characters
+       if (befs_sb->log_start != befs_sb->log_end || befs_sb->flags == BEFS_DIRTY) {

Signed-off-by: Luis de Bethencourt <luisbg@osg.samsung.com>
Signed-off-by: Salah Triki <salah.triki@gmail.com>
2016-10-08 10:01:34 +01:00
Luis de Bethencourt 11674239f9 befs: fix comment style
The description of befs_load_sb was confusing the kernel-doc system since,
because it starts with /**, it thinks it will document the function with
kernel-doc formatting. Which it isn't.

Fix other comment style issues in the file while we are at it.

Signed-off-by: Luis de Bethencourt <luisbg@osg.samsung.com>
Signed-off-by: Salah Triki <salah.triki@gmail.com>
2016-10-08 10:01:33 +01:00
Luis de Bethencourt bbe1bd0b6b befs: add check for ag_shift in superblock
ag_shift and blocks_per_ag contain the same information in different ways,
same as block_shift and block_size do. It is worth checking this two are
consistent, but since blocks_per_ag isn't documented as mandatory to use
some implementations of befs don't enforce this, so making it non-fatal if
they don't match and just having it as a warning.

Signed-off-by: Luis de Bethencourt <luisbg@osg.samsung.com>
Signed-off-by: Salah Triki <salah.triki@gmail.com>
2016-10-08 10:01:31 +01:00
Luis de Bethencourt d1a8c70676 befs: dump inode_size superblock information
befs_dump_super_block() wasn't giving the inode_size information when
dumping all elements of the superblock. Add this element to have complete
information of the superblock.

Signed-off-by: Luis de Bethencourt <luisbg@osg.samsung.com>
Signed-off-by: Salah Triki <salah.triki@gmail.com>
2016-10-08 10:01:29 +01:00
Salah Triki 78f647c27f befs: remove unnecessary initialization
There is no need to init block, since it will be overwitten later by
iaddr2blockno().

Signed-off-by: Salah Triki <salah.triki@gmail.com>
Signed-off-by: Luis de Bethencourt <luisbg@osg.samsung.com>
2016-10-08 10:01:28 +01:00
Salah Triki 2ac636b4d0 befs: fix typo in befs_sb_info
Fixing jornal to Journal.

Signed-off-by: Salah Triki <salah.triki@gmail.com>
Signed-off-by: Luis de Bethencourt <luisbg@osg.samsung.com>
2016-10-08 10:01:27 +01:00
Salah Triki 6ea4558f9b befs: add flags field to validate superblock state
For validating superblock state, add flags field to befs_sb_info, read the state from the disk
and check if it is equal to BEFS_DIRTY.

Signed-off-by: Salah Triki <salah.triki@gmail.com>
Signed-off-by: Luis de Bethencourt <luisbg@osg.samsung.com>
2016-10-08 10:01:27 +01:00
Luis de Bethencourt bb75e66627 befs: fix typo in befs_find_key
Fixing skeep to skip.

Signed-off-by: Luis de Bethencourt <luisbg@osg.samsung.com>
Signed-off-by: Salah Triki <salah.triki@gmail.com>
2016-10-08 10:01:26 +01:00
Luis de Bethencourt 672a8515ee befs: remove unused BEFS_BT_PARMATCH
befs_btree_find(), the only caller of befs_find_key(), only cares about if
the return from that function is BEFS_BT_MATCH or not. It never uses the
partial match given with BEFS_BT_PARMATCH. Make the overflow return clearer
by having BEFS_BT_OVERFLOW instead of BEFS_BT_PARMATCH.

Signed-off-by: Luis de Bethencourt <luisbg@osg.samsung.com>
Signed-off-by: Salah Triki <salah.triki@gmail.com>
2016-10-08 10:01:26 +01:00
Salah Triki 33c712b4fc fs: befs: remove ret variable
ret is initialized to -EIO and is never modified, so remove ret and use
-EIO directly.

Signed-off-by: Salah Triki <salah.triki@gmail.com>
Acked-by: Luis de Bethencourt <luisbg@osg.samsung.com>
2016-10-08 10:01:25 +01:00
Salah Triki abcf911691 fs: befs: remove in vain variable assignment
There is no need to init res, since it will be overwitten later by
befs_fblock2brun().

Signed-off-by: Salah Triki <salah.triki@gmail.com>
Acked-by: Luis de Bethencourt <luisbg@osg.samsung.com>
2016-10-08 10:01:24 +01:00
Salah Triki f30661035b fs: befs: remove unnecessary *befs_sb variable
Remove *befs_sb and just call BEFS_SB(sb) directly, since the returned
value by this function is only used once.

Signed-off-by: Salah Triki <salah.triki@gmail.com>
Acked-by: Luis de Bethencourt <luisbg@osg.samsung.com>
2016-10-08 10:01:23 +01:00
Salah Triki 143d2a615f fs: befs: remove useless initialization to zero
node_off is unconditionally set to bt_super.root_node_ptr, so no need to
init it to zero.

Signed-off-by: Salah Triki <salah.triki@gmail.com>
Acked-by: Luis de Bethencourt <luisbg@osg.samsung.com>
2016-10-08 10:01:23 +01:00
Salah Triki 88ff34446b fs: befs: remove in vain variable assignment
There is no need to set *value, it will be overwritten later.

Signed-off-by: Salah Triki <salah.triki@gmail.com>
Acked-by: Luis de Bethencourt <luisbg@osg.samsung.com>
2016-10-08 10:01:22 +01:00
Salah Triki a26bc1adc7 fs: befs: Insert NULL inode to dentry
As VFS expects, lookup inserts NULL inode to dentry when the named inode
does not exist.

Signed-off-by: Salah Triki <salah.triki@gmail.com>
Acked-by: Luis de Bethencourt <luisbg@osg.samsung.com>
2016-10-08 10:01:21 +01:00
Salah Triki d70ee4f2de fs: befs: Remove useless calls to brelse in befs_find_brun_dblindirect
The calls to brelse are useless since dbl_indir_block and indir_block
are NULL.

Signed-off-by: Salah Triki <salah.triki@gmail.com>
Acked-by: Luis de Bethencourt <luisbg@osg.samsung.com>
2016-10-08 10:01:20 +01:00
Salah Triki 4bb594329a fs: befs: Coding style fix
Constant has to be capitalized.

Signed-off-by: Salah Triki <salah.triki@gmail.com>
Acked-by: Luis de Bethencourt <luisbg@osg.samsung.com>
2016-10-08 10:01:20 +01:00
Salah Triki d84e4a5a09 fs: befs: Remove redundant validation from befs_find_brun_direct
The only caller of befs_find_brun_direct is befs_fblock2brun, which
already validates that the block is within the range of direct blocks.
So remove the duplicate validation.

Signed-off-by: Salah Triki <salah.triki@gmail.com>
Acked-by: Luis de Bethencourt <luisbg@osg.samsung.com>
2016-10-08 10:01:19 +01:00
Luis de Bethencourt 2dfa8a6e56 befs: fix typo in befs_bt_read_node documentation
Fixing a grammatical error in the documentation.

Signed-off-by: Luis de Bethencourt <luisbg@osg.samsung.com>
2016-10-08 10:01:18 +01:00
Luis de Bethencourt cfe0cb20e6 befs: in memory free_node_ptr and max_size never read
The only place the values of free_node_ptr and max_size are read is in
befs_dump_index_entry(), which both times it is called, it is passed the on
disk superblock. Removing assignment of unused values.

Signed-off-by: Luis de Bethencourt <luisbg@osg.samsung.com>
2016-10-08 10:01:17 +01:00
Luis de Bethencourt 4c3897cce0 befs: make consistent use of befs_error()
befs_error() is used in potential errors that could happen in befs to
provide informational log messages. befs_debug() is silent when
CONFIG_BEFS_DEBUG=no, and very verbose when switched on, which is why it is
used for general debugging but not for errors.

Fix a few cases where the befs debug utility usage isn't following the
expected pattern. To make sure we have consistent information in the logs.

Signed-off-by: Luis de Bethencourt <luisbg@osg.samsung.com>
2016-10-08 10:01:16 +01:00
Luis de Bethencourt 9ae51a32b1 befs: use simpler while loop
Replace goto with simpler while loop to make befs_readdir() more readable.

Signed-off-by: Luis de Bethencourt <luisbg@osg.samsung.com>
2016-10-08 10:01:16 +01:00
Luis de Bethencourt 50858ef96d befs: remove constant variable
Use macro directly instead of via assigning it to an unchanging variable.

Signed-off-by: Luis de Bethencourt <luisbg@osg.samsung.com>
Acked-by: Salah Triki <salah.triki@gmail.com>
2016-10-08 10:01:15 +01:00
Luis de Bethencourt f7769f9cf9 befs: avoid dereferencing dentry twice
No need to dereference dentry twice to get the name when we already have
it stored in a local variable.

Signed-off-by: Luis de Bethencourt <luisbg@osg.samsung.com>
2016-10-08 10:01:15 +01:00
Luis de Bethencourt 39dcfd3b34 fs: befs: remove comment that confuses kernel-doc
This comment with a mysterious unfinished line confuses the kernel-doc
system since, because it starts with /**, it thinks it is documenting a
function.

Signed-off-by: Luis de Bethencourt <luisbg@osg.samsung.com>
2016-10-08 10:01:14 +01:00
Luis de Bethencourt a64998504e fs: befs: check silent flag before logging error
Log error only when silent flag is not set.

Fixes: dbe6460388bc ("fs/befs/linuxvfs.c: check silent flag before logging errors")
Signed-off-by: Luis de Bethencourt <luisbg@osg.samsung.com>
Acked-by: Salah Triki <salah.triki@gmail.com>
2016-10-08 10:01:13 +01:00
Salah Triki f7f675406b fs: befs: replace befs_bread by sb_bread
Since befs_bread merely calls sb_bread, replace it by sb_bread.

Link: http://lkml.kernel.org/r/1466800258-4542-1-git-send-email-salah.triki@gmail.com
Signed-off-by: Salah Triki <salah.triki@gmail.com>
Acked-by: Luis de Bethencourt <luisbg@osg.samsung.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2016-10-08 10:01:12 +01:00
Luis de Bethencourt f0f2536fe3 befs: remove unused functions
befs_iaddr_is_empty() and befs_brun_size() are unused. Remove them.

Link: http://lkml.kernel.org/r/1465700235-22881-3-git-send-email-luisbg@osg.samsung.com
Signed-off-by: Luis de Bethencourt <luisbg@osg.samsung.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2016-10-08 10:01:12 +01:00
Luis de Bethencourt 10145d6116 befs: fix function name in documentation
Documentation of function befs_load_cb() lists it as load_befs_sb().  Fix
the misnomer.

Link: http://lkml.kernel.org/r/1465700235-22881-2-git-send-email-luisbg@osg.samsung.com
Signed-off-by: Luis de Bethencourt <luisbg@osg.samsung.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2016-10-08 10:01:11 +01:00
Luis de Bethencourt 173b066f58 befs: check return of sb_min_blocksize
Confirm sb_min_blocksize() succeeded before continuing.

Link: http://lkml.kernel.org/r/1465700235-22881-1-git-send-email-luisbg@osg.samsung.com
Signed-off-by: Luis de Bethencourt <luisbg@osg.samsung.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2016-10-08 10:01:10 +01:00
Salah Triki c08f1cb627 fs: befs: remove useless pr_err in befs_init_inodecache()
Remove pr_err since kmem_cache_create log error and dump stack.

Link: http://lkml.kernel.org/r/e6d03cbc9542495dc6174b59e32fcd41c1393cfc.1464226521.git.salah.triki@acm.org
Signed-off-by: Salah Triki <salah.triki@acm.org>
2016-10-08 10:01:10 +01:00
Salah Triki e808792784 fs/befs/linuxvfs.c: remove useless befs_error
Remove befs_error since when kmalloc fails there is a generic out of
memory and stack dump.

Link: http://lkml.kernel.org/r/3de4d388d98bbb570462a5eb8e64623e17fb5d74.1464226521.git.salah.triki@acm.org
Signed-off-by: Salah Triki <salah.triki@acm.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2016-10-08 10:01:10 +01:00
Salah Triki c625426fb6 fs/befs/linuxvfs.c: remove useless pr_err in befs_fill_super()
Remove pr_err since when kzalloc fails there is a generic out of memory
and stack dump.

Link: http://lkml.kernel.org/r/c5a7f2d42ec0fc8465c118248e88cd221c483391.1464226521.git.salah.triki@acm.org
Signed-off-by: Salah Triki <salah.triki@acm.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2016-10-08 10:01:09 +01:00
Salah Triki dceee2e230 fs/befs/linuxvfs.c: check silent flag before logging errors
Log errors only when silent flag is not set.

Link: http://lkml.kernel.org/r/d400aaf5a7430de79bd956e40ec075fb1cb08474.1464226521.git.salah.triki@acm.org
Signed-off-by: Salah Triki <salah.triki@acm.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2016-10-08 10:01:08 +01:00
Salah Triki 30982583e4 fs/befs/linuxvfs.c: move useless assignment
Control is transfered to unacquire_none when sb->s_fs_info is equal to
NULL, so the assignment to NULL is useless and it is moved above
unacquire_none.

Link: http://lkml.kernel.org/r/ed41da113fc693c7daa4e8813ca04cc766ddfc05.1464226521.git.salah.triki@acm.org
Signed-off-by: Salah Triki <salah.triki@acm.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2016-10-08 10:01:08 +01:00
Linus Torvalds b66484cd74 Merge branch 'akpm' (patches from Andrew)
Merge updates from Andrew Morton:

 - fsnotify updates

 - ocfs2 updates

 - all of MM

* emailed patches from Andrew Morton <akpm@linux-foundation.org>: (127 commits)
  console: don't prefer first registered if DT specifies stdout-path
  cred: simpler, 1D supplementary groups
  CREDITS: update Pavel's information, add GPG key, remove snail mail address
  mailmap: add Johan Hovold
  .gitattributes: set git diff driver for C source code files
  uprobes: remove function declarations from arch/{mips,s390}
  spelling.txt: "modeled" is spelt correctly
  nmi_backtrace: generate one-line reports for idle cpus
  arch/tile: adopt the new nmi_backtrace framework
  nmi_backtrace: do a local dump_stack() instead of a self-NMI
  nmi_backtrace: add more trigger_*_cpu_backtrace() methods
  min/max: remove sparse warnings when they're nested
  Documentation/filesystems/proc.txt: add more description for maps/smaps
  mm, proc: fix region lost in /proc/self/smaps
  proc: fix timerslack_ns CAP_SYS_NICE check when adjusting self
  proc: add LSM hook checks to /proc/<tid>/timerslack_ns
  proc: relax /proc/<tid>/timerslack_ns capability requirements
  meminfo: break apart a very long seq_printf with #ifdefs
  seq/proc: modify seq_put_decimal_[u]ll to take a const char *, not char
  proc: faster /proc/*/status
  ...
2016-10-07 21:38:00 -07:00
Andreas Gruenbacher fd50ecaddf vfs: Remove {get,set,remove}xattr inode operations
These inode operations are no longer used; remove them.

Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2016-10-07 21:48:36 -04:00
Alexey Dobriyan 81243eacfa cred: simpler, 1D supplementary groups
Current supplementary groups code can massively overallocate memory and
is implemented in a way so that access to individual gid is done via 2D
array.

If number of gids is <= 32, memory allocation is more or less tolerable
(140/148 bytes).  But if it is not, code allocates full page (!)
regardless and, what's even more fun, doesn't reuse small 32-entry
array.

2D array means dependent shifts, loads and LEAs without possibility to
optimize them (gid is never known at compile time).

All of the above is unnecessary.  Switch to the usual
trailing-zero-len-array scheme.  Memory is allocated with
kmalloc/vmalloc() and only as much as needed.  Accesses become simpler
(LEA 8(gi,idx,4) or even without displacement).

Maximum number of gids is 65536 which translates to 256KB+8 bytes.  I
think kernel can handle such allocation.

On my usual desktop system with whole 9 (nine) aux groups, struct
group_info shrinks from 148 bytes to 44 bytes, yay!

Nice side effects:

 - "gi->gid[i]" is shorter than "GROUP_AT(gi, i)", less typing,

 - fix little mess in net/ipv4/ping.c
   should have been using GROUP_AT macro but this point becomes moot,

 - aux group allocation is persistent and should be accounted as such.

Link: http://lkml.kernel.org/r/20160817201927.GA2096@p183.telecom.by
Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
Cc: Vasily Kulikov <segoon@openwall.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-10-07 18:46:30 -07:00
Robert Ho 855af072b6 mm, proc: fix region lost in /proc/self/smaps
Recently, Redhat reported that nvml test suite failed on QEMU/KVM,
more detailed info please refer to:

   https://bugzilla.redhat.com/show_bug.cgi?id=1365721

Actually, this bug is not only for NVDIMM/DAX but also for any other
file systems.  This simple test case abstracted from nvml can easily
reproduce this bug in common environment:

-------------------------- testcase.c -----------------------------

int
is_pmem_proc(const void *addr, size_t len)
{
        const char *caddr = addr;

        FILE *fp;
        if ((fp = fopen("/proc/self/smaps", "r")) == NULL) {
                printf("!/proc/self/smaps");
                return 0;
        }

        int retval = 0;         /* assume false until proven otherwise */
        char line[PROCMAXLEN];  /* for fgets() */
        char *lo = NULL;        /* beginning of current range in smaps file */
        char *hi = NULL;        /* end of current range in smaps file */
        int needmm = 0;         /* looking for mm flag for current range */
        while (fgets(line, PROCMAXLEN, fp) != NULL) {
                static const char vmflags[] = "VmFlags:";
                static const char mm[] = " wr";

                /* check for range line */
                if (sscanf(line, "%p-%p", &lo, &hi) == 2) {
                        if (needmm) {
                                /* last range matched, but no mm flag found */
                                printf("never found mm flag.\n");
                                break;
                        } else if (caddr < lo) {
                                /* never found the range for caddr */
                                printf("#######no match for addr %p.\n", caddr);
                                break;
                        } else if (caddr < hi) {
                                /* start address is in this range */
                                size_t rangelen = (size_t)(hi - caddr);

                                /* remember that matching has started */
                                needmm = 1;

                                /* calculate remaining range to search for */
                                if (len > rangelen) {
                                        len -= rangelen;
                                        caddr += rangelen;
                                        printf("matched %zu bytes in range "
                                                "%p-%p, %zu left over.\n",
                                                        rangelen, lo, hi, len);
                                } else {
                                        len = 0;
                                        printf("matched all bytes in range "
                                                        "%p-%p.\n", lo, hi);
                                }
                        }
                } else if (needmm && strncmp(line, vmflags,
                                        sizeof(vmflags) - 1) == 0) {
                        if (strstr(&line[sizeof(vmflags) - 1], mm) != NULL) {
                                printf("mm flag found.\n");
                                if (len == 0) {
                                        /* entire range matched */
                                        retval = 1;
                                        break;
                                }
                                needmm = 0;     /* saw what was needed */
                        } else {
                                /* mm flag not set for some or all of range */
                                printf("range has no mm flag.\n");
                                break;
                        }
                }
        }

        fclose(fp);

        printf("returning %d.\n", retval);
        return retval;
}

void *Addr;
size_t Size;

/*
 * worker -- the work each thread performs
 */
static void *
worker(void *arg)
{
        int *ret = (int *)arg;
        *ret =  is_pmem_proc(Addr, Size);
        return NULL;
}

int main(int argc, char *argv[])
{
        if (argc <  2 || argc > 3) {
                printf("usage: %s file [env].\n", argv[0]);
                return -1;
        }

        int fd = open(argv[1], O_RDWR);

        struct stat stbuf;
        fstat(fd, &stbuf);

        Size = stbuf.st_size;
        Addr = mmap(0, stbuf.st_size, PROT_READ|PROT_WRITE, MAP_PRIVATE, fd, 0);

        close(fd);

        pthread_t threads[NTHREAD];
        int ret[NTHREAD];

        /* kick off NTHREAD threads */
        for (int i = 0; i < NTHREAD; i++)
                pthread_create(&threads[i], NULL, worker, &ret[i]);

        /* wait for all the threads to complete */
        for (int i = 0; i < NTHREAD; i++)
                pthread_join(threads[i], NULL);

        /* verify that all the threads return the same value */
        for (int i = 1; i < NTHREAD; i++) {
                if (ret[0] != ret[i]) {
                        printf("Error i %d ret[0] = %d ret[i] = %d.\n", i,
                                ret[0], ret[i]);
                }
        }

        printf("%d", ret[0]);
        return 0;
}

It failed as some threads can not find the memory region in
"/proc/self/smaps" which is allocated in the main process

It is caused by proc fs which uses 'file->version' to indicate the VMA that
is the last one has already been handled by read() system call. When the
next read() issues, it uses the 'version' to find the VMA, then the next
VMA is what we want to handle, the related code is as follows:

        if (last_addr) {
                vma = find_vma(mm, last_addr);
                if (vma && (vma = m_next_vma(priv, vma)))
                        return vma;
        }

However, VMA will be lost if the last VMA is gone, e.g:

The process VMA list is A->B->C->D

CPU 0                                  CPU 1
read() system call
   handle VMA B
   version = B
return to userspace

                                   unmap VMA B

issue read() again to continue to get
the region info
   find_vma(version) will get VMA C
   m_next_vma(C) will get VMA D
   handle D
   !!! VMA C is lost !!!

In order to fix this bug, we make 'file->version' indicate the end address
of the current VMA.  m_start will then look up a vma which with vma_start
< last_vm_end and moves on to the next vma if we found the same or an
overlapping vma.  This will guarantee that we will not miss an exclusive
vma but we can still miss one if the previous vma was shrunk.  This is
acceptable because guaranteeing "never miss a vma" is simply not feasible.
User has to cope with some inconsistencies if the file is not read in one
go.

[mhocko@suse.com: changelog fixes]
Link: http://lkml.kernel.org/r/1475296958-27652-1-git-send-email-robert.hu@intel.com
Acked-by: Dave Hansen <dave.hansen@intel.com>
Signed-off-by: Xiao Guangrong <guangrong.xiao@linux.intel.com>
Signed-off-by: Robert Hu <robert.hu@intel.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Acked-by: Oleg Nesterov <oleg@redhat.com>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Gleb Natapov <gleb@kernel.org>
Cc: Marcelo Tosatti <mtosatti@redhat.com>
Cc: Stefan Hajnoczi <stefanha@redhat.com>
Cc: Ross Zwisler <ross.zwisler@linux.intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-10-07 18:46:30 -07:00
John Stultz 4b2bd5fec0 proc: fix timerslack_ns CAP_SYS_NICE check when adjusting self
In changing from checking ptrace_may_access(p, PTRACE_MODE_ATTACH_FSCREDS)
to capable(CAP_SYS_NICE), I missed that ptrace_my_access succeeds when p
== current, but the CAP_SYS_NICE doesn't.

Thus while the previous commit was intended to loosen the needed
privileges to modify a processes timerslack, it needlessly restricted a
task modifying its own timerslack via the proc/<tid>/timerslack_ns
(which is permitted also via the PR_SET_TIMERSLACK method).

This patch corrects this by checking if p == current before checking the
CAP_SYS_NICE value.

This patch applies on top of my two previous patches currently in -mm

Link: http://lkml.kernel.org/r/1471906870-28624-1-git-send-email-john.stultz@linaro.org
Signed-off-by: John Stultz <john.stultz@linaro.org>
Acked-by: Kees Cook <keescook@chromium.org>
Cc: "Serge E. Hallyn" <serge@hallyn.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Arjan van de Ven <arjan@linux.intel.com>
Cc: Oren Laadan <orenl@cellrox.com>
Cc: Ruchi Kandoi <kandoiruchi@google.com>
Cc: Rom Lemarchand <romlem@android.com>
Cc: Todd Kjos <tkjos@google.com>
Cc: Colin Cross <ccross@android.com>
Cc: Nick Kralevich <nnk@google.com>
Cc: Dmitry Shmidt <dimitrysh@google.com>
Cc: Elliott Hughes <enh@google.com>
Cc: Android Kernel Team <kernel-team@android.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-10-07 18:46:30 -07:00
John Stultz 904763e1fb proc: add LSM hook checks to /proc/<tid>/timerslack_ns
As requested, this patch checks the existing LSM hooks
task_getscheduler/task_setscheduler when reading or modifying the task's
timerslack value.

Previous versions added new get/settimerslack LSM hooks, but since they
checked the same PROCESS__SET/GETSCHED values as existing hooks, it was
suggested we just use the existing ones.

Link: http://lkml.kernel.org/r/1469132667-17377-2-git-send-email-john.stultz@linaro.org
Signed-off-by: John Stultz <john.stultz@linaro.org>
Cc: Kees Cook <keescook@chromium.org>
Cc: "Serge E. Hallyn" <serge@hallyn.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Arjan van de Ven <arjan@linux.intel.com>
Cc: Oren Laadan <orenl@cellrox.com>
Cc: Ruchi Kandoi <kandoiruchi@google.com>
Cc: Rom Lemarchand <romlem@android.com>
Cc: Todd Kjos <tkjos@google.com>
Cc: Colin Cross <ccross@android.com>
Cc: Nick Kralevich <nnk@google.com>
Cc: Dmitry Shmidt <dimitrysh@google.com>
Cc: Elliott Hughes <enh@google.com>
Cc: James Morris <jmorris@namei.org>
Cc: Android Kernel Team <kernel-team@android.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-10-07 18:46:30 -07:00
John Stultz 7abbaf9404 proc: relax /proc/<tid>/timerslack_ns capability requirements
When an interface to allow a task to change another tasks timerslack was
first proposed, it was suggested that something greater then
CAP_SYS_NICE would be needed, as a task could be delayed further then
what normally could be done with nice adjustments.

So CAP_SYS_PTRACE was adopted instead for what became the
/proc/<tid>/timerslack_ns interface.  However, for Android (where this
feature originates), giving the system_server CAP_SYS_PTRACE would allow
it to observe and modify all tasks memory.  This is considered too high
a privilege level for only needing to change the timerslack.

After some discussion, it was realized that a CAP_SYS_NICE process can
set a task as SCHED_FIFO, so they could fork some spinning processes and
set them all SCHED_FIFO 99, in effect delaying all other tasks for an
infinite amount of time.

So as a CAP_SYS_NICE task can already cause trouble for other tasks,
using it as a required capability for accessing and modifying
/proc/<tid>/timerslack_ns seems sufficient.

Thus, this patch loosens the capability requirements to CAP_SYS_NICE and
removes CAP_SYS_PTRACE, simplifying some of the code flow as well.

This is technically an ABI change, but as the feature just landed in
4.6, I suspect no one is yet using it.

Link: http://lkml.kernel.org/r/1469132667-17377-1-git-send-email-john.stultz@linaro.org
Signed-off-by: John Stultz <john.stultz@linaro.org>
Reviewed-by: Nick Kralevich <nnk@google.com>
Acked-by: Serge Hallyn <serge@hallyn.com>
Acked-by: Kees Cook <keescook@chromium.org>
Cc: Kees Cook <keescook@chromium.org>
Cc: "Serge E. Hallyn" <serge@hallyn.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Arjan van de Ven <arjan@linux.intel.com>
Cc: Oren Laadan <orenl@cellrox.com>
Cc: Ruchi Kandoi <kandoiruchi@google.com>
Cc: Rom Lemarchand <romlem@android.com>
Cc: Todd Kjos <tkjos@google.com>
Cc: Colin Cross <ccross@android.com>
Cc: Nick Kralevich <nnk@google.com>
Cc: Dmitry Shmidt <dimitrysh@google.com>
Cc: Elliott Hughes <enh@google.com>
Cc: Android Kernel Team <kernel-team@android.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-10-07 18:46:30 -07:00
Joe Perches e16e2d8e14 meminfo: break apart a very long seq_printf with #ifdefs
Use a specific routine to emit most lines so that the code is easier to
read and maintain.

akpm:
   text    data     bss     dec     hex filename
   2976       8       0    2984     ba8 fs/proc/meminfo.o before
   2669       8       0    2677     a75 fs/proc/meminfo.o after

Link: http://lkml.kernel.org/r/8fce7fdef2ba081a4ef531594e97da8a9feebb58.1470810406.git.joe@perches.com
Signed-off-by: Joe Perches <joe@perches.com>
Cc: Andi Kleen <andi@firstfloor.org>
Cc: Alexey Dobriyan <adobriyan@gmail.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-10-07 18:46:30 -07:00
Joe Perches 75ba1d07fd seq/proc: modify seq_put_decimal_[u]ll to take a const char *, not char
Allow some seq_puts removals by taking a string instead of a single
char.

[akpm@linux-foundation.org: update vmstat_show(), per Joe]
Link: http://lkml.kernel.org/r/667e1cf3d436de91a5698170a1e98d882905e956.1470704995.git.joe@perches.com
Signed-off-by: Joe Perches <joe@perches.com>
Cc: Joe Perches <joe@perches.com>
Cc: Andi Kleen <andi@firstfloor.org>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-10-07 18:46:30 -07:00
Alexey Dobriyan f7a5f132b4 proc: faster /proc/*/status
top(1) opens the following files for every PID:

	/proc/*/stat
	/proc/*/statm
	/proc/*/status

This patch switches /proc/*/status away from seq_printf().
The result is 13.5% speedup.

Benchmark is open("/proc/self/status")+read+close 1.000.000 million times.

				BEFORE
$ perf stat -r 10 taskset -c 3 ./proc-self-status

 Performance counter stats for 'taskset -c 3 ./proc-self-status' (10 runs):

      10748.474301      task-clock (msec)         #    0.954 CPUs utilized            ( +-  0.91% )
                12      context-switches          #    0.001 K/sec                    ( +-  1.09% )
                 1      cpu-migrations            #    0.000 K/sec
               104      page-faults               #    0.010 K/sec                    ( +-  0.45% )
    37,424,127,876      cycles                    #    3.482 GHz                      ( +-  0.04% )
     8,453,010,029      stalled-cycles-frontend   #   22.59% frontend cycles idle     ( +-  0.12% )
     3,747,609,427      stalled-cycles-backend    #  10.01% backend cycles idle       ( +-  0.68% )
    65,632,764,147      instructions              #    1.75  insn per cycle
                                                  #    0.13  stalled cycles per insn  ( +-  0.00% )
    13,981,324,775      branches                  # 1300.773 M/sec                    ( +-  0.00% )
       138,967,110      branch-misses             #    0.99% of all branches          ( +-  0.18% )

      11.263885428 seconds time elapsed                                          ( +-  0.04% )
      ^^^^^^^^^^^^

				AFTER
$ perf stat -r 10 taskset -c 3 ./proc-self-status

 Performance counter stats for 'taskset -c 3 ./proc-self-status' (10 runs):

       9010.521776      task-clock (msec)         #    0.925 CPUs utilized            ( +-  1.54% )
                11      context-switches          #    0.001 K/sec                    ( +-  1.54% )
                 1      cpu-migrations            #    0.000 K/sec                    ( +- 11.11% )
               103      page-faults               #    0.011 K/sec                    ( +-  0.60% )
    32,352,310,603      cycles                    #    3.591 GHz                      ( +-  0.07% )
     7,849,199,578      stalled-cycles-frontend   #   24.26% frontend cycles idle     ( +-  0.27% )
     3,269,738,842      stalled-cycles-backend    #  10.11% backend cycles idle       ( +-  0.73% )
    56,012,163,567      instructions              #    1.73  insn per cycle
                                                  #    0.14  stalled cycles per insn  ( +-  0.00% )
    11,735,778,795      branches                  # 1302.453 M/sec                    ( +-  0.00% )
        98,084,459      branch-misses             #    0.84% of all branches          ( +-  0.28% )

       9.741247736 seconds time elapsed                                          ( +-  0.07% )
       ^^^^^^^^^^^

Link: http://lkml.kernel.org/r/20160806125608.GB1187@p183.telecom.by
Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
Cc: Joe Perches <joe@perches.com>
Cc: Andi Kleen <andi@firstfloor.org>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-10-07 18:46:30 -07:00
zhong jiang 72e2936c04 mm: remove unnecessary condition in remove_inode_hugepages
When the huge page is added to the page cahce (huge_add_to_page_cache),
the page private flag will be cleared.  since this code
(remove_inode_hugepages) will only be called for pages in the page
cahce, PagePrivate(page) will always be false.

The patch remove the code without any functional change.

Link: http://lkml.kernel.org/r/1475113323-29368-1-git-send-email-zhongjiang@huawei.com
Signed-off-by: zhong jiang <zhongjiang@huawei.com>
Reviewed-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Reviewed-by: Mike Kravetz <mike.kravetz@oracle.com>
Tested-by: Mike Kravetz <mike.kravetz@oracle.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-10-07 18:46:29 -07:00
Yisheng Xie 461a718432 mm/hugetlb: introduce ARCH_HAS_GIGANTIC_PAGE
Avoid making ifdef get pretty unwieldy if many ARCHs support gigantic
page.  No functional change with this patch.

Link: http://lkml.kernel.org/r/1475227569-63446-2-git-send-email-xieyisheng1@huawei.com
Signed-off-by: Yisheng Xie <xieyisheng1@huawei.com>
Suggested-by: Michal Hocko <mhocko@suse.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Acked-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Acked-by: Hillf Danton <hillf.zj@alibaba-inc.com>
Cc: Hanjun Guo <guohanjun@huawei.com>
Cc: Will Deacon <will.deacon@arm.com>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: Sudeep Holla <sudeep.holla@arm.com>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Rob Herring <robh+dt@kernel.org>
Cc: Mike Kravetz <mike.kravetz@oracle.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-10-07 18:46:29 -07:00
Huang Ying 8cd797887a mm: remove page_file_index
After using the offset of the swap entry as the key of the swap cache,
the page_index() becomes exactly same as page_file_index().  So the
page_file_index() is removed and the callers are changed to use
page_index() instead.

Link: http://lkml.kernel.org/r/1473270649-27229-2-git-send-email-ying.huang@intel.com
Signed-off-by: "Huang, Ying" <ying.huang@intel.com>
Cc: Trond Myklebust <trond.myklebust@primarydata.com>
Cc: Anna Schumaker <anna.schumaker@netapp.com>
Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Ross Zwisler <ross.zwisler@linux.intel.com>
Cc: Eric Dumazet <edumazet@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-10-07 18:46:28 -07:00
Aaron Lu 6fcb52a56f thp: reduce usage of huge zero page's atomic counter
The global zero page is used to satisfy an anonymous read fault.  If
THP(Transparent HugePage) is enabled then the global huge zero page is
used.  The global huge zero page uses an atomic counter for reference
counting and is allocated/freed dynamically according to its counter
value.

CPU time spent on that counter will greatly increase if there are a lot
of processes doing anonymous read faults.  This patch proposes a way to
reduce the access to the global counter so that the CPU load can be
reduced accordingly.

To do this, a new flag of the mm_struct is introduced:
MMF_USED_HUGE_ZERO_PAGE.  With this flag, the process only need to touch
the global counter in two cases:

 1 The first time it uses the global huge zero page;
 2 The time when mm_user of its mm_struct reaches zero.

Note that right now, the huge zero page is eligible to be freed as soon
as its last use goes away.  With this patch, the page will not be
eligible to be freed until the exit of the last process from which it
was ever used.

And with the use of mm_user, the kthread is not eligible to use huge
zero page either.  Since no kthread is using huge zero page today, there
is no difference after applying this patch.  But if that is not desired,
I can change it to when mm_count reaches zero.

Case used for test on Haswell EP:

  usemem -n 72 --readonly -j 0x200000 100G

Which spawns 72 processes and each will mmap 100G anonymous space and
then do read only access to that space sequentially with a step of 2MB.

  CPU cycles from perf report for base commit:
      54.03%  usemem   [kernel.kallsyms]   [k] get_huge_zero_page
  CPU cycles from perf report for this commit:
       0.11%  usemem   [kernel.kallsyms]   [k] mm_get_huge_zero_page

Performance(throughput) of the workload for base commit: 1784430792
Performance(throughput) of the workload for this commit: 4726928591
164% increase.

Runtime of the workload for base commit: 707592 us
Runtime of the workload for this commit: 303970 us
50% drop.

Link: http://lkml.kernel.org/r/fe51a88f-446a-4622-1363-ad1282d71385@intel.com
Signed-off-by: Aaron Lu <aaron.lu@intel.com>
Cc: Sergey Senozhatsky <sergey.senozhatsky@gmail.com>
Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: Tim Chen <tim.c.chen@linux.intel.com>
Cc: Huang Ying <ying.huang@intel.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Jerome Marchand <jmarchan@redhat.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Ebru Akagunduz <ebru.akagunduz@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-10-07 18:46:28 -07:00
James Morse 0f30206bf2 fs/proc/task_mmu.c: make the task_mmu walk_page_range() limit in clear_refs_write() obvious
Trying to walk all of virtual memory requires architecture specific
knowledge.  On x86_64, addresses must be sign extended from bit 48,
whereas on arm64 the top VA_BITS of address space have their own set of
page tables.

clear_refs_write() calls walk_page_range() on the range 0 to ~0UL, it
provides a test_walk() callback that only expects to be walking over
VMAs.  Currently walk_pmd_range() will skip memory regions that don't
have a VMA, reporting them as a hole.

As this call only expects to walk user address space, make it walk 0 to
'highest_vm_end'.

Link: http://lkml.kernel.org/r/1472655792-22439-1-git-send-email-james.morse@arm.com
Signed-off-by: James Morse <james.morse@arm.com>
Acked-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-10-07 18:46:28 -07:00
Toshi Kani dbe6ec8156 ext2/4, xfs: call thp_get_unmapped_area() for pmd mappings
To support DAX pmd mappings with unmodified applications, filesystems
need to align an mmap address by the pmd size.

Call thp_get_unmapped_area() from f_op->get_unmapped_area.

Note, there is no change in behavior for a non-DAX file.

Link: http://lkml.kernel.org/r/1472497881-9323-3-git-send-email-toshi.kani@hpe.com
Signed-off-by: Toshi Kani <toshi.kani@hpe.com>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Matthew Wilcox <mawilcox@microsoft.com>
Cc: Ross Zwisler <ross.zwisler@linux.intel.com>
Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Dave Chinner <david@fromorbit.com>
Cc: Jan Kara <jack@suse.cz>
Cc: Theodore Ts'o <tytso@mit.edu>
Cc: Andreas Dilger <adilger.kernel@dilger.ca>
Cc: Mike Kravetz <mike.kravetz@oracle.com>
Cc: "Kirill A. Shutemov" <kirill@shutemov.name>
Cc: Hugh Dickins <hughd@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-10-07 18:46:28 -07:00
Joseph Qi 48e509ece9 ocfs2: fix undefined struct variable in inode.h
The extern struct variable ocfs2_inode_cache is not defined. It meant to
use ocfs2_inode_cachep defined in super.c, I think. Fortunately it is
not used anywhere now, so no impact actually. Clean it up to fix this
mistake.

Link: http://lkml.kernel.org/r/57E1E49D.8050503@huawei.com
Signed-off-by: Joseph Qi <joseph.qi@huawei.com>
Reviewed-by: Eric Ren <zren@suse.com>
Cc: Mark Fasheh <mfasheh@suse.de>
Cc: Joel Becker <jlbec@evilplan.org>
Cc: Junxiao Bi <junxiao.bi@oracle.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-10-07 18:46:26 -07:00
Bhaktipriya Shridhar 055fdcff35 fs/ocfs2/dlm: remove deprecated create_singlethread_workqueue()
The workqueue "dlm_worker" queues a single work item &dlm->dispatched_work
and thus it doesn't require execution ordering.  Hence, alloc_workqueue
has been used to replace the deprecated create_singlethread_workqueue
instance.

The WQ_MEM_RECLAIM flag has been set to ensure forward progress under
memory pressure.

Since there are fixed number of work items, explicit concurrency
limit is unnecessary here.

Link: http://lkml.kernel.org/r/2b5ad8d6688effe1a9ddb2bc2082d26fbbe00302.1472590094.git.bhaktipriya96@gmail.com
Signed-off-by: Bhaktipriya Shridhar <bhaktipriya96@gmail.com>
Acked-by: Tejun Heo <tj@kernel.org>
Cc: Mark Fasheh <mfasheh@suse.de>
Cc: Joel Becker <jlbec@evilplan.org>
Cc: Junxiao Bi <junxiao.bi@oracle.com>
Cc: Joseph Qi <joseph.qi@huawei.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-10-07 18:46:26 -07:00
Bhaktipriya Shridhar 44be975691 fs/ocfs2/super: remove deprecated create_singlethread_workqueue()
The workqueue "ocfs2_wq" queues multiple work items viz
&osb->la_enable_wq, &journal->j_recovery_work, &os->os_orphan_scan_work,
&osb->osb_truncate_log_wq which require strict execution ordering.  Hence,
an ordered dedicated workqueue has been used.

WQ_MEM_RECLAIM has been set to ensure forward progress under memory
pressure because the workqueue is being used on a memory reclaim path.

Link: http://lkml.kernel.org/r/66279de510a7f4cfc6e386d99b7e04b3f65fb11b.1472590094.git.bhaktipriya96@gmail.com
Signed-off-by: Bhaktipriya Shridhar <bhaktipriya96@gmail.com>
Acked-by: Tejun Heo <tj@kernel.org>
Cc: Mark Fasheh <mfasheh@suse.de>
Cc: Joel Becker <jlbec@evilplan.org>
Cc: Junxiao Bi <junxiao.bi@oracle.com>
Cc: Joseph Qi <joseph.qi@huawei.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-10-07 18:46:26 -07:00
Bhaktipriya Shridhar bf940776c0 fs/ocfs2/cluster: remove deprecated create_singlethread_workqueue()
The workqueue "o2net_wq" queues multiple work items viz
&old_sc->sc_shutdown_work, &sc->sc_rx_work, &sc->sc_connect_work which
require strict execution ordering.  Hence, an ordered dedicated
workqueue has been used.

WQ_MEM_RECLAIM has been set to ensure forward progress under memory
pressure.

Link: http://lkml.kernel.org/r/ddc12e5766c79ba26f8a00d98049107f8a1d4866.1472590094.git.bhaktipriya96@gmail.com
Signed-off-by: Bhaktipriya Shridhar <bhaktipriya96@gmail.com>
Acked-by: Tejun Heo <tj@kernel.org>
Cc: Mark Fasheh <mfasheh@suse.de>
Cc: Joel Becker <jlbec@evilplan.org>
Cc: Junxiao Bi <junxiao.bi@oracle.com>
Cc: Joseph Qi <joseph.qi@huawei.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-10-07 18:46:26 -07:00
Bhaktipriya Shridhar 0b41be0763 fs/ocfs2/dlmfs: remove deprecated create_singlethread_workqueue()
The workqueue "user_dlm_worker" queues a single work item
&lockres->l_work per user_lock_res instance and so it doesn't require
execution ordering.  Hence, alloc_workqueue has been used to replace the
deprecated create_singlethread_workqueue instance.

The WQ_MEM_RECLAIM flag has been set to ensure forward progress under
memory pressure.

Since there are fixed number of work items, explicit concurrency
limit is unnecessary here.

Link: http://lkml.kernel.org/r/9748136d3a3b18138ad1d6ba708367aa1fe9f98c.1472590094.git.bhaktipriya96@gmail.com
Signed-off-by: Bhaktipriya Shridhar <bhaktipriya96@gmail.com>
Acked-by: Tejun Heo <tj@kernel.org>
Cc: Mark Fasheh <mfasheh@suse.de>
Cc: Joel Becker <jlbec@evilplan.org>
Cc: Junxiao Bi <junxiao.bi@oracle.com>
Cc: Joseph Qi <joseph.qi@huawei.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-10-07 18:46:26 -07:00
Jan Kara ed2726406c fsnotify: clean up spinlock assertions
Use assert_spin_locked() macro instead of hand-made BUG_ON statements.

Link: http://lkml.kernel.org/r/1474537439-18919-1-git-send-email-jack@suse.cz
Signed-off-by: Jan Kara <jack@suse.cz>
Suggested-by: Heiner Kallweit <hkallweit1@gmail.com>
Reviewed-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-10-07 18:46:26 -07:00
Jan Kara 0b1b86527d fanotify: fix possible false warning when freeing events
When freeing permission events by fsnotify_destroy_event(), the warning
WARN_ON(!list_empty(&event->list)); may falsely hit.

This is because although fanotify_get_response() saw event->response
set, there is nothing to make sure the current CPU also sees the removal
of the event from the list.  Add proper locking around the WARN_ON() to
avoid the false warning.

Link: http://lkml.kernel.org/r/1473797711-14111-7-git-send-email-jack@suse.cz
Reported-by: Miklos Szeredi <mszeredi@redhat.com>
Signed-off-by: Jan Kara <jack@suse.cz>
Reviewed-by: Lino Sanfilippo <LinoSanfilippo@gmx.de>
Cc: Eric Paris <eparis@redhat.com>
Cc: Al Viro <viro@ZenIV.linux.org.uk>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-10-07 18:46:26 -07:00
Jan Kara 073f65522a fanotify: use notification_lock instead of access_lock
Fanotify code has its own lock (access_lock) to protect a list of events
waiting for a response from userspace.

However this is somewhat awkward as the same list_head in the event is
protected by notification_lock if it is part of the notification queue
and by access_lock if it is part of the fanotify private queue which
makes it difficult for any reliable checks in the generic code.  So make
fanotify use the same lock - notification_lock - for protecting its
private event list.

Link: http://lkml.kernel.org/r/1473797711-14111-6-git-send-email-jack@suse.cz
Signed-off-by: Jan Kara <jack@suse.cz>
Reviewed-by: Lino Sanfilippo <LinoSanfilippo@gmx.de>
Cc: Miklos Szeredi <mszeredi@redhat.com>
Cc: Eric Paris <eparis@redhat.com>
Cc: Al Viro <viro@ZenIV.linux.org.uk>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-10-07 18:46:26 -07:00
Jan Kara c21dbe20f6 fsnotify: convert notification_mutex to a spinlock
notification_mutex is used to protect the list of pending events.  As such
there's no reason to use a sleeping lock for it.  Convert it to a
spinlock.

[jack@suse.cz: fixed version]
  Link: http://lkml.kernel.org/r/1474031567-1831-1-git-send-email-jack@suse.cz
Link: http://lkml.kernel.org/r/1473797711-14111-5-git-send-email-jack@suse.cz
Signed-off-by: Jan Kara <jack@suse.cz>
Reviewed-by: Lino Sanfilippo <LinoSanfilippo@gmx.de>
Tested-by: Guenter Roeck <linux@roeck-us.net>
Cc: Miklos Szeredi <mszeredi@redhat.com>
Cc: Eric Paris <eparis@redhat.com>
Cc: Al Viro <viro@ZenIV.linux.org.uk>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-10-07 18:46:26 -07:00
Jan Kara 1404ff3cc3 fsnotify: drop notification_mutex before destroying event
fsnotify_flush_notify() and fanotify_release() destroy notification
event while holding notification_mutex.

The destruction of fanotify event includes a path_put() call which may
end up calling into a filesystem to delete an inode if we happen to be
the last holders of dentry reference which happens to be the last holder
of inode reference.

That in turn may violate lock ordering for some filesystems since
notification_mutex is also acquired e. g. during write when generating
fanotify event.

Also this is the only thing that forces notification_mutex to be a
sleeping lock.  So drop notification_mutex before destroying a
notification event.

Link: http://lkml.kernel.org/r/1473797711-14111-4-git-send-email-jack@suse.cz
Signed-off-by: Jan Kara <jack@suse.cz>
Cc: Miklos Szeredi <mszeredi@redhat.com>
Cc: Lino Sanfilippo <LinoSanfilippo@gmx.de>
Cc: Eric Paris <eparis@redhat.com>
Cc: Al Viro <viro@ZenIV.linux.org.uk>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-10-07 18:46:26 -07:00
Al Viro 41fefa36be Merge remote-tracking branch 'fuse/xattr' into work.xattr 2016-10-07 20:10:55 -04:00
Andreas Gruenbacher 6c6ef9f26e xattr: Stop calling {get,set,remove}xattr inode operations
All filesystems that support xattrs by now do so via xattr handlers.
They all define sb->s_xattr, and their getxattr, setxattr, and
removexattr inode operations use the generic inode operations.  On
filesystems that don't support xattrs, the xattr inode operations are
all NULL, and sb->s_xattr is also NULL.

This means that we can remove the getxattr, setxattr, and removexattr
inode operations and directly call the generic handlers, or better,
inline expand those handlers into fs/xattr.c.

Filesystems that do not support xattrs on some inodes should clear the
IOP_XATTR i_opflags flag in those inodes.  (Right now, some filesystems
have checks to disable xattrs on some inodes in the ->list, ->get, and
->set xattr handler operations instead.)  The IOP_XATTR flag is
automatically cleared in inodes of filesystems that don't have xattr
support.

In orangefs, symlinks do have a setxattr iop but no getxattr iop.  Add a
check for symlinks to orangefs_inode_getxattr to preserve the current,
weird behavior; that check may not be necessary though.

Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2016-10-07 20:10:44 -04:00
Andreas Gruenbacher bf3ee71363 vfs: Check for the IOP_XATTR flag in listxattr
When an inode doesn't support xattrs, turn listxattr off as well.

(When xattrs are "turned off", the VFS still passes security xattr
operations through to security modules, which can still expose inode
security labels that way.)

Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2016-10-07 20:10:44 -04:00
Andreas Gruenbacher 5d6c31910b xattr: Add __vfs_{get,set,remove}xattr helpers
Right now, various places in the kernel check for the existence of
getxattr, setxattr, and removexattr inode operations and directly call
those operations.  Switch to helper functions and test for the IOP_XATTR
flag instead.

Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
Acked-by: James Morris <james.l.morris@oracle.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2016-10-07 20:10:44 -04:00
Andreas Gruenbacher f5c2443837 libfs: Use IOP_XATTR flag for empty directory handling
Instead of special xattr inode operations, use the IOP_XATTR inode
operations flag for the special libfs empty directories.

Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2016-10-07 20:10:43 -04:00
Andreas Gruenbacher 5f6e59ae82 vfs: Use IOP_XATTR flag for bad-inode handling
With this change, all the xattr handler based operations will produce an
-EIO result for bad inodes, and we no longer only depend on inode->i_op
to be set to bad_inode_ops.

Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2016-10-07 20:10:43 -04:00
Andreas Gruenbacher d0a5b995a3 vfs: Add IOP_XATTR inode operations flag
The IOP_XATTR inode operations flag in inode->i_opflags indicates that
the inode has xattr support.  The flag is automatically set by
new_inode() on filesystems with xattr support (where sb->s_xattr is
defined), and cleared otherwise.  Filesystems can explicitly clear it
for inodes that should not have xattr support.

Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2016-10-07 20:10:42 -04:00
Andreas Gruenbacher b6ba11773d vfs: Move xattr_resolve_name to the front of fs/xattr.c
Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2016-10-07 20:10:42 -04:00
Dan Williams e476f94482 Merge branch 'for-4.9/dax' into libnvdimm-for-next 2016-10-07 16:46:30 -07:00
Linus Torvalds d1f5323370 Merge branch 'work.splice_read' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs
Pull VFS splice updates from Al Viro:
 "There's a bunch of branches this cycle, both mine and from other folks
  and I'd rather send pull requests separately.

  This one is the conversion of ->splice_read() to ITER_PIPE iov_iter
  (and introduction of such). Gets rid of a lot of code in fs/splice.c
  and elsewhere; there will be followups, but these are for the next
  cycle...  Some pipe/splice-related cleanups from Miklos in the same
  branch as well"

* 'work.splice_read' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
  pipe: fix comment in pipe_buf_operations
  pipe: add pipe_buf_steal() helper
  pipe: add pipe_buf_confirm() helper
  pipe: add pipe_buf_release() helper
  pipe: add pipe_buf_get() helper
  relay: simplify relay_file_read()
  switch default_file_splice_read() to use of pipe-backed iov_iter
  switch generic_file_splice_read() to use of ->read_iter()
  new iov_iter flavour: pipe-backed
  fuse_dev_splice_read(): switch to add_to_pipe()
  skb_splice_bits(): get rid of callback
  new helper: add_to_pipe()
  splice: lift pipe_lock out of splice_to_pipe()
  splice: switch get_iovec_page_array() to iov_iter
  splice_to_pipe(): don't open-code wakeup_pipe_readers()
  consistent treatment of EFAULT on O_DIRECT read/write
2016-10-07 15:36:58 -07:00
Linus Torvalds 2eee010d09 Lots of bug fixes and cleanups.
-----BEGIN PGP SIGNATURE-----
 
 iQEcBAABCAAGBQJX9pA6AAoJEPL5WVaVDYGj7fwH/0YcdQWBg0O5d7iXFnTcimh9
 fiYkqKniBWQhgBAOFPMoNPRIW4tyeQmTtu8Rywx2Hr+v4lzJvuOaT18NDANdq/pp
 u5eDrnJ4R+uqPJlgxVOzopLVJ6I2glgSSRdvAKYxwTYcv8F88ObzVfsJ4M415gPq
 cbEKF+JT3l5hTGENR5sqmYvHYaNfOFkOqt4gulPtgk1eshy+BH/05M+qBSeA5a6k
 srdon0pFRoUV68m+T4G8FqOZxdybeT5Yx6X0GJf0eQJoX7IaiQTPcDrXzlrbDBbN
 rrzbpwsDeDKtgSOckbarCBroZKdToHFekfnOJ7IPWYq8IwYTSnZKFCWIRKO6z38=
 =IvhS
 -----END PGP SIGNATURE-----

Merge tag 'ext4_for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4

Pull ext4 updates from Ted Ts'o:
 "Lots of bug fixes and cleanups"

* tag 'ext4_for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4: (40 commits)
  ext4: remove unused variable
  ext4: use journal inode to determine journal overhead
  ext4: create function to read journal inode
  ext4: unmap metadata when zeroing blocks
  ext4: remove plugging from ext4_file_write_iter()
  ext4: allow unlocked direct IO when pages are cached
  ext4: require encryption feature for EXT4_IOC_SET_ENCRYPTION_POLICY
  fscrypto: use standard macros to compute length of fname ciphertext
  ext4: do not unnecessarily null-terminate encrypted symlink data
  ext4: release bh in make_indexed_dir
  ext4: Allow parallel DIO reads
  ext4: allow DAX writeback for hole punch
  jbd2: fix lockdep annotation in add_transaction_credits()
  blockgroup_lock.h: simplify definition of NR_BG_LOCKS
  blockgroup_lock.h: remove debris from bgl_lock_ptr() conversion
  fscrypto: make filename crypto functions return 0 on success
  fscrypto: rename completion callbacks to reflect usage
  fscrypto: remove unnecessary includes
  fscrypto: improved validation when loading inode encryption metadata
  ext4: fix memory leak when symlink decryption fails
  ...
2016-10-07 15:15:33 -07:00
Linus Torvalds 513a4befae Merge branch 'for-4.9/block' of git://git.kernel.dk/linux-block
Pull block layer updates from Jens Axboe:
 "This is the main pull request for block layer changes in 4.9.

  As mentioned at the last merge window, I've changed things up and now
  do just one branch for core block layer changes, and driver changes.
  This avoids dependencies between the two branches. Outside of this
  main pull request, there are two topical branches coming as well.

  This pull request contains:

   - A set of fixes, and a conversion to blk-mq, of nbd. From Josef.

   - Set of fixes and updates for lightnvm from Matias, Simon, and Arnd.
     Followup dependency fix from Geert.

   - General fixes from Bart, Baoyou, Guoqing, and Linus W.

   - CFQ async write starvation fix from Glauber.

   - Add supprot for delayed kick of the requeue list, from Mike.

   - Pull out the scalable bitmap code from blk-mq-tag.c and make it
     generally available under the name of sbitmap. Only blk-mq-tag uses
     it for now, but the blk-mq scheduling bits will use it as well.
     From Omar.

   - bdev thaw error progagation from Pierre.

   - Improve the blk polling statistics, and allow the user to clear
     them. From Stephen.

   - Set of minor cleanups from Christoph in block/blk-mq.

   - Set of cleanups and optimizations from me for block/blk-mq.

   - Various nvme/nvmet/nvmeof fixes from the various folks"

* 'for-4.9/block' of git://git.kernel.dk/linux-block: (54 commits)
  fs/block_dev.c: return the right error in thaw_bdev()
  nvme: Pass pointers, not dma addresses, to nvme_get/set_features()
  nvme/scsi: Remove power management support
  nvmet: Make dsm number of ranges zero based
  nvmet: Use direct IO for writes
  admin-cmd: Added smart-log command support.
  nvme-fabrics: Add host_traddr options field to host infrastructure
  nvme-fabrics: revise host transport option descriptions
  nvme-fabrics: rework nvmf_get_address() for variable options
  nbd: use BLK_MQ_F_BLOCKING
  blkcg: Annotate blkg_hint correctly
  cfq: fix starvation of asynchronous writes
  blk-mq: add flag for drivers wanting blocking ->queue_rq()
  blk-mq: remove non-blocking pass in blk_mq_map_request
  blk-mq: get rid of manual run of queue with __blk_mq_run_hw_queue()
  block: export bio_free_pages to other modules
  lightnvm: propagate device_add() error code
  lightnvm: expose device geometry through sysfs
  lightnvm: control life of nvm_dev in driver
  blk-mq: register device instead of disk
  ...
2016-10-07 14:42:05 -07:00
Anna Schumaker 29ae7f9dc2 NFSD: Implement the COPY call
I only implemented the sync version of this call, since it's the
easiest.  I can simply call vfs_copy_range() and have the vfs do the
right thing for the filesystem being exported.

Signed-off-by: Anna Schumaker <bjschuma@netapp.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2016-10-07 14:54:25 -04:00
J. Bruce Fields 42e616167a nfsd: handle EUCLEAN
Eric Sandeen reports that xfs can return this if filesystem corruption
prevented completing the operation.

Reported-by: Eric Sandeen <sandeen@redhat.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2016-10-07 14:54:19 -04:00
J. Bruce Fields ff30f08c32 nfsd: only WARN once on unmapped errors
No need to spam the logs here.

The only drawback is losing information if we ever encounter two
different unmapped errors, but in practice we've rarely see even one.

Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2016-10-07 14:53:33 -04:00
Andreas Gruenbacher 4b899da50d ecryptfs: Switch to generic xattr handlers
Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2016-10-06 22:17:38 -04:00
Andreas Gruenbacher bba0bd31b1 sockfs: Get rid of getxattr iop
If we allow pseudo-filesystems created with mount_pseudo to have xattr
handlers, we can replace sockfs_getxattr with a sockfs_xattr_get handler
to use the xattr handler name parsing.

Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2016-10-06 22:17:38 -04:00
Andreas Gruenbacher e72a1a8b3a kernfs: Switch to generic xattr handlers
Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
Acked-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Acked-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2016-10-06 22:17:38 -04:00
Andreas Gruenbacher b8020eff7f hfs: Switch to generic xattr handlers
Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2016-10-06 22:17:38 -04:00
Andreas Gruenbacher 6966f842c0 jffs2: Remove jffs2_{get,set,remove}xattr macros
When CONFIG_JFFS2_FS_XATTR is off, jffs2_xattr_handlers is defined as
NULL. With sb->s_xattr == NULL, the generic_{get,set,remove}xattr
functions produce the same result as setting the {get,set,remove}xattr
inode operations to NULL, so there is no need for these macros.

Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2016-10-06 22:17:38 -04:00
Andreas Gruenbacher 5d18cbf16c xattr: Remove unnecessary NULL attribute name check
When NULL is passed to one of the xattr system calls as the attribute
name, copying that name from user space already fails with -EFAULT;
xattr_resolve_name is never called with a NULL attribute name.

Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2016-10-06 22:17:38 -04:00
David S. Miller 0d818c2889 RxRPC rewrite
-----BEGIN PGP SIGNATURE-----
 
 iQIVAwUAV/YbX/Sw1s6N8H32AQLwDg//W0fGt3OSFrOpEQHtKUSCWO3m4RRJgn/m
 Xbaz8ZO6Z8qmdkM267yrLCAp5hx0E77WP46l7V3B9p9wX0vA+P2QO7K5Kis6sNaY
 aceCCAKHqvUSiZa8tQ2aGpbxxa8qICbjHjiCg0lFABiGDWGRnIBNW8qV5LyGKZkI
 7b3i9MGBkGLdZxetcJd498j6Gck9cuqOZDnfqgb0Q5pAtsjVM3EZXXsHO1ZD5WHG
 GUieQgY9Tp0rlVKjlLdR94fW/acMZYs0c5RO1uzGAoUeBALnSUS5+bSRSlGp1KOM
 C7r5/dK4FvkZY+xuS5pLXoI8WpsA4EDpBINGdO6L03wTJ10zx5y5CdTTl7G6Y53R
 BpmY8SDFmWYqpJs+gZiWYIlbnBQ+b0Mu7p7rKeSJS/q0+YEVwJlz3UFo2k1O+J3A
 ovpxP5E6IvOjlKF21Zs1hOR2m/sfR42v/TfwpApImSeY2k2m8vzyfXBJP4ClAk29
 PGYOOqMLYwzIjLwdapDxL3ccjKvOwYeClCs1t6bKva2XCrF1ybtBnAQDxFp6KzXi
 p/y/QkHnseSeYct8mElDopRekbwoqa9YPwXn7lagvQhNxqNGIR4HT82IeohI/Dqe
 GtQbjSPc3uebk5lRf535kTZixu+l5/yKQeuRTsfoIgsMjVlMdqS9dUAphzI4IXLp
 FE0q49uLTVI=
 =+Jr3
 -----END PGP SIGNATURE-----

Merge tag 'rxrpc-rewrite-20161004' of git://git.kernel.org/pub/scm/linux/kernel/git/dhowells/linux-fs

David Howells says:

====================
rxrpc: Fixes

This set of patches contains a bunch of fixes:

 (1) Fix an oops on incoming call to a local endpoint without a bound
     service.

 (2) Only ping for a lost reply in a client call (this is inapplicable to
     service calls).

 (3) Fix maybe uninitialised variable warnings in the ACK/ABORT sending
     function by splitting it.

 (4) Fix loss of PING RESPONSE ACKs due to them being subsumed by PING ACK
     generation.

 (5) OpenAFS improperly terminates calls it makes as a client under some
     circumstances by not fully hard-ACK'ing the last DATA packets.  This
     is alleviated by a new call appearing on the same channel implicitly
     completing the previous call on that channel.  Handle this implicit
     completion.

 (6) Properly handle expiry of service calls due to the aforementioned
     improper termination with no follow up call to implicitly complete it:

     (a) The call's background processor needs to be queued to complete the
     	 call, send an abort and notify the socket.

     (b) The call's background processor needs to notify the socket (or the
     	 kernel service) when it has completed the call.

     (c) A negative error code must thence be returned to the kernel
     	 service so that it knows the call died.

     (d) The AFS filesystem must detect the fatal error and end the call.

 (7) Must produce a DELAY ACK when the actual service operation takes a
     while to process and must cancel the ACK when the reply is ready.

 (8) Don't request an ACK on the last DATA packet of the Tx phase as this
     confuses OpenAFS.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
2016-10-06 21:04:24 -04:00
Linus Torvalds 4c1fad64ef In this round, we've investigated how f2fs deals with errors given by our fault
injection facility. With this, we could fix several corner cases. And, in order
 to improve the performance, we set inline_dentry by default and enhance the
 exisiting discard issue flow. In addition, we added f2fs_migrate_page for better
 memory management.
 
 = Enhancement =
  - set inline_dentry by default
  - improve discard issue flow
  - add more fault injection cases in f2fs
  - allow block preallocation for encrypted files
  - introduce migrate_page callback function
  - avoid truncating the next direct node block at every checkpoint
 
 = Bug fixes =
  - set page flag correctly between write_begin and write_end
  - missing error handling cases detected by fault injection
  - preallocate blocks regarding to 4KB alignement correctly
  - dentry and filename handling of encryption
  - lost xattrs of directories
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v1
 
 iQIcBAABAgAGBQJX9sMhAAoJEEAUqH6CSFDSFhQQAIQ99GkcaPmSACHg7JNa9zG1
 wb6eeKIDee+Jr4vu7yQ++T3Ih4lesl2ZLABVaP+IcXlsYWI2VUvlChczuwVSDQMg
 ZiBIR2IwXVVY6Zpb0xuw8C/vmQAJjLZTBV33s+wgsYHaTDobYexVUjkCM+pekrzj
 HBXrk7zx8NHUh41yr/kVQl6FY8KPC6bTtBH23UUp6Vuy1zMZDR/VjL440IyT5Ded
 JRSBX0XSAC9He6n+kZ4S2kMc11kmqZYW7mE4SmiPDzAhGwUv4SmQ1871lK00EOUp
 5EN1Lcy8M7kkl8en2zpZ002R/LDbzRTYjb1fjGJVR+s5Q3piGokxtwAMd0/a7k9v
 wwZm64Bm4NMHBEK6uc/DPWFUmnUySrboTvOCDRunNogPGTjMJwnzAQmTcB/Hdpr5
 oAJQwyAq7ZzkMk3xt0ifeNqy+78uiwfpPEnZDoWqU6zxa+vIyqpFDD+8wEPBO9qo
 JLRocH0Yl7+ExJvi+2W9wMQq9DsxZWR+CwUc8pg68E+1oOEycJ3weAwg5XSVHoNr
 59I2blZQU6P922sH2HVhp0n58xZfYrR7Z3NSsiSfKXeL4gN222dHHT1UfRUmY+A3
 7EeuYm8EUecKV0fZimMcqCCrUXQpubT+qGZfI6NZhu3Qhno1Y8ApxqH8Ieypx7ol
 YD5prZs2qqVKO5LjLV5o
 =crpN
 -----END PGP SIGNATURE-----

Merge tag 'for-f2fs-4.9' of git://git.kernel.org/pub/scm/linux/kernel/git/jaegeuk/f2fs

Pull f2fs updates from Jaegeuk Kim:
 "In this round, we've investigated how f2fs deals with errors given by
  our fault injection facility. With this, we could fix several corner
  cases. And, in order to improve the performance, we set inline_dentry
  by default and enhance the exisiting discard issue flow. In addition,
  we added f2fs_migrate_page for better memory management.

  Enhancements:
   - set inline_dentry by default
   - improve discard issue flow
   - add more fault injection cases in f2fs
   - allow block preallocation for encrypted files
   - introduce migrate_page callback function
   - avoid truncating the next direct node block at every checkpoint

  Bug fixes:
   - set page flag correctly between write_begin and write_end
   - missing error handling cases detected by fault injection
   - preallocate blocks regarding to 4KB alignement correctly
   - dentry and filename handling of encryption
   - lost xattrs of directories"

* tag 'for-f2fs-4.9' of git://git.kernel.org/pub/scm/linux/kernel/git/jaegeuk/f2fs: (69 commits)
  f2fs: introduce update_ckpt_flags to clean up
  f2fs: don't submit irrelevant page
  f2fs: fix to commit bio cache after flushing node pages
  f2fs: introduce get_checkpoint_version for cleanup
  f2fs: remove dead variable
  f2fs: remove redundant io plug
  f2fs: support checkpoint error injection
  f2fs: fix to recover old fault injection config in ->remount_fs
  f2fs: do fault injection initialization in default_options
  f2fs: remove redundant value definition
  f2fs: support configuring fault injection per superblock
  f2fs: adjust display format of segment bit
  f2fs: remove dirty inode pages in error path
  f2fs: do not unnecessarily null-terminate encrypted symlink data
  f2fs: handle errors during recover_orphan_inodes
  f2fs: avoid gc in cp_error case
  f2fs: should put_page for summary page
  f2fs: assign return value in f2fs_gc
  f2fs: add customized migrate_page callback
  f2fs: introduce cp_lock to protect updating of ckpt_flags
  ...
2016-10-06 15:30:40 -07:00
Linus Torvalds 0fb3ca447d Fix bug in module unloading.
Switch to always using spinlock over cmpxchg.
 Explicitly define pstore backend's supported modes.
 Remove bounce buffer from pmsg.
 Switch to using memcpy_to/fromio().
 Error checking improvements.
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v1
 Comment: Kees Cook <kees@outflux.net>
 
 iQIcBAABCgAGBQJX9XPtAAoJEIly9N/cbcAmRr8P/0NoEX3bzEYgQWVMmsvzlk4U
 /mJ7LUk1+TDL0DOdQ84O1Tr3k6MQ2wRyiGXHjxhQ+aC2ompvmuT+SHEARWlqUZZx
 bEKr3u6nJ5qz1KZ5KwaPOH2EPs2MDq2jh6VvYDFzDGpBYsueDTzRqWJo7VhO/kmq
 MyVCePtEY3m1q4dZtaVLfDMGUEAU8s8j+D5HM9lmoijmzQuKAz3BFRuakasBIYSf
 4ILY0W1E57HAUWsi19jhnYMHOvJt2Gcog0wRUYo4CYmPTyNqud6I5WU6HXeY2F7v
 LtWbhaS2QcpJRAxDEzzKBBSZ4IS6TINYDBBOf/0NEVo2qj4PHyy3f14MCtSo2LDg
 4hoeI0DUgnAmp+NFgp1mQQ25DhR8TZlunBuntGXdeugb5qgT65NYXGtQxnMp5QJd
 s3DsfGW/diKbKfLWQN7GVcHHM/GNe+XM1yl1Q3TyDgSLJVjgAB21r/kPE7AIQzTO
 vDTLcv1w+KLdhDIrHlZqz1IAPATidTA21A7h8JeUWrOSetOhpZ0uXUwBR5+IZhyN
 tG1Wt0ohZAqlhv9ERXYN1g3iRHCCJ26V0LYOKsf80wAAutT8iRO4iH0PKdEYKX+a
 U0TqeX4TIh+4Q3FgnR7efFACzPXrM1RG9qnc1o5OR/BiyXIzLPdrpYYCVpejzj9K
 x6AoYCxRl6qYLJgYUR/H
 =FRpQ
 -----END PGP SIGNATURE-----

Merge tag 'pstore-v4.9-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux

Pull pstore updates from Kees Cook:

 - Fix bug in module unloading

 - Switch to always using spinlock over cmpxchg

 - Explicitly define pstore backend's supported modes

 - Remove bounce buffer from pmsg

 - Switch to using memcpy_to/fromio()

 - Error checking improvements

* tag 'pstore-v4.9-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux:
  ramoops: move spin_lock_init after kmalloc error checking
  pstore/ram: Use memcpy_fromio() to save old buffer
  pstore/ram: Use memcpy_toio instead of memcpy
  pstore/pmsg: drop bounce buffer
  pstore/ram: Set pstore flags dynamically
  pstore: Split pstore fragile flags
  pstore/core: drop cmpxchg based updates
  pstore/ramoops: fixup driver removal
2016-10-06 15:16:16 -07:00
Linus Torvalds 3940ee36a0 orangefs: miscellaneous improvements and feature negotiation
miscellaneous improvements
 
     - clean up debugfs globals
     - remove dead code in sysfs
     - reorganize duplicated sysfs attribute structs
     - consolidate sysfs show and store functions
     - remove duplicated sysfs_ops structures
     - describe organization of sysfs
     - make devreq_mutex static
     - g_orangefs_stats -> orangefs_stats for consistency
     - rename most remaining global variables
 
   feature negotiation
 
     enable Orangefs userspace and kernel module to negotiate mutually
     supported features.
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v2.0.22 (GNU/Linux)
 
 iQIcBAABAgAGBQJX9Aa0AAoJEM9EDqnrzg2+/JIP/iBDvWIxWvqs1cywLQoWJhPx
 1Lm0p1a7RQEFjYI1AJ3W5U2dr12Drxezgn/a1Yfn/5vX8d868gtcj4uv8hD6PY2Y
 wY69yidiA6GL1/vHOSyiTBofT7jeniCt44QbxS3fXNpSXEiGD2d1pJ4lSwg0Mkyp
 E+JcAnmp6rVUvQV0Kx+djBvaFBNQ1tT84UqLqdGBTpx4DqG+zGTw3tOgRPh4jAZt
 mDmtF8TKR9DhjzxnkeX66tfErxdGNZEHrNNeHSM/3ds1IMn09d1pxFkE+y5lWhd3
 d3FJeONt6CJG+k7iPXGWScvvo83DoIfvjsDx3S4vJIvQxxRuKDwp3pR34BQYvbKO
 nSnaDBZ1okLaQEg0GYt6BlqWZHcEdKEiR870dBTDmzlGwIY33m4G2mx9uZibR6dt
 pcsel4e2q3Js1tZob0MXwtbrR7pl/4TPVpf/ZEiTTprX0egL2SMhxCNwk4DHDMyv
 JszdjdxC+SJgpsaBRYcdGEzb8Fd+FbDIVWxAei8uaQUUSK40j7kPoSdx4w17mHQl
 s7Mmp/12miO0/eGgKmI+cJjXhRzCxu8HG6ovzlBWLdfQKmPYtk+Hm38HXz2fz4P5
 pWKHgwsFXHtAZ0pQ9VbOVmctCehbuAS3nef2rZsWfA3x65Z+O4GIwrUDtfMCsiXK
 OuDgcDysqhPMCKbmdSEw
 =kxaL
 -----END PGP SIGNATURE-----

Merge tag 'for-linus-4.9-ofs1' of git://git.kernel.org/pub/scm/linux/kernel/git/hubcap/linux

Pull orangefs updates from Mike Marshall:
 "Miscellaneous improvements:
   - clean up debugfs globals
   - remove dead code in sysfs
   - reorganize duplicated sysfs attribute structs
   - consolidate sysfs show and store functions
   - remove duplicated sysfs_ops structures
   - describe organization of sysfs
   - make devreq_mutex static
   - g_orangefs_stats -> orangefs_stats for consistency
   - rename most remaining global variables

  Feature negotiation:
   - enable Orangefs userspace and kernel module to negotiate mutually
     supported features"

* tag 'for-linus-4.9-ofs1' of git://git.kernel.org/pub/scm/linux/kernel/git/hubcap/linux:
  Revert "orangefs: bump minimum userspace version"
  orangefs: bump minimum userspace version
  orangefs: rename most remaining global variables
  orangefs: g_orangefs_stats -> orangefs_stats for consistency
  orangefs: make devreq_mutex static
  orangefs: describe organization of sysfs
  orangefs: remove duplicated sysfs_ops structures
  orangefs: consolidate sysfs show and store functions
  orangefs: reorganize duplicated sysfs attribute structs
  orangefs: remove dead code in sysfs
  orangefs: clean up debugfs globals
  orangefs: do not allow client readahead cache without feature bit
  orangefs: add features op
  orangefs: record userspace version for feature compatbility
  orangefs: add readahead count and size to sysfs
  orangefs: re-add flush_racache from out-of-tree
  orangefs: turn param response value into union
  orangefs: add missing param request ops
  orangefs: rename remaining bits of mmap readahead cache
2016-10-06 13:33:35 -07:00
Linus Torvalds 14986a34e1 Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace
Pull namespace updates from Eric Biederman:
 "This set of changes is a number of smaller things that have been
  overlooked in other development cycles focused on more fundamental
  change. The devpts changes are small things that were a distraction
  until we managed to kill off DEVPTS_MULTPLE_INSTANCES. There is an
  trivial regression fix to autofs for the unprivileged mount changes
  that went in last cycle. A pair of ioctls has been added by Andrey
  Vagin making it is possible to discover the relationships between
  namespaces when referring to them through file descriptors.

  The big user visible change is starting to add simple resource limits
  to catch programs that misbehave. With namespaces in general and user
  namespaces in particular allowing users to use more kinds of
  resources, it has become important to have something to limit errant
  programs. Because the purpose of these limits is to catch errant
  programs the code needs to be inexpensive to use as it always on, and
  the default limits need to be high enough that well behaved programs
  on well behaved systems don't encounter them.

  To this end, after some review I have implemented per user per user
  namespace limits, and use them to limit the number of namespaces. The
  limits being per user mean that one user can not exhause the limits of
  another user. The limits being per user namespace allow contexts where
  the limit is 0 and security conscious folks can remove from their
  threat anlysis the code used to manage namespaces (as they have
  historically done as it root only). At the same time the limits being
  per user namespace allow other parts of the system to use namespaces.

  Namespaces are increasingly being used in application sand boxing
  scenarios so an all or nothing disable for the entire system for the
  security conscious folks makes increasing use of these sandboxes
  impossible.

  There is also added a limit on the maximum number of mounts present in
  a single mount namespace. It is nontrivial to guess what a reasonable
  system wide limit on the number of mount structure in the kernel would
  be, especially as it various based on how a system is using
  containers. A limit on the number of mounts in a mount namespace
  however is much easier to understand and set. In most cases in
  practice only about 1000 mounts are used. Given that some autofs
  scenarious have the potential to be 30,000 to 50,000 mounts I have set
  the default limit for the number of mounts at 100,000 which is well
  above every known set of users but low enough that the mount hash
  tables don't degrade unreaonsably.

  These limits are a start. I expect this estabilishes a pattern that
  other limits for resources that namespaces use will follow. There has
  been interest in making inotify event limits per user per user
  namespace as well as interest expressed in making details about what
  is going on in the kernel more visible"

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace: (28 commits)
  autofs:  Fix automounts by using current_real_cred()->uid
  mnt: Add a per mount namespace limit on the number of mounts
  netns: move {inc,dec}_net_namespaces into #ifdef
  nsfs: Simplify __ns_get_path
  tools/testing: add a test to check nsfs ioctl-s
  nsfs: add ioctl to get a parent namespace
  nsfs: add ioctl to get an owning user namespace for ns file descriptor
  kernel: add a helper to get an owning user namespace for a namespace
  devpts: Change the owner of /dev/pts/ptmx to the mounter of /dev/pts
  devpts: Remove sync_filesystems
  devpts: Make devpts_kill_sb safe if fsi is NULL
  devpts: Simplify devpts_mount by using mount_nodev
  devpts: Move the creation of /dev/pts/ptmx into fill_super
  devpts: Move parse_mount_options into fill_super
  userns: When the per user per user namespace limit is reached return ENOSPC
  userns; Document per user per user namespace limits.
  mntns: Add a limit on the number of mount namespaces.
  netns: Add a limit on the number of net namespaces
  cgroupns: Add a limit on the number of cgroup namespaces
  ipcns: Add a  limit on the number of ipc namespaces
  ...
2016-10-06 09:52:23 -07:00
Linus Torvalds 8d37059581 xfs: updates for 4.9-rc1
Included in this update:
 - change of XFS mailing list to linux-xfs@vger.kernel.org
 - iomap-based DAX infrastructure w/ XFS and ext2 support
 - small iomap fixes and additions
 - more efficient XFS delayed allocation infrastructure based on iomap
 - a rework of log recovery writeback scheduling to ensure we don't
   fail recovery when trying to replay items that are already on disk
 - some preparation patches for upcoming reflink support
 - configurable error handling fixes and documentation
 - aio access time update race fixes for XFS and generic_file_read_iter
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v1
 
 iQIcBAABAgAGBQJX9WvjAAoJEK3oKUf0dfodrl8P/R1cS8tEHnrmNlKeENNWFTlN
 q8HEfP3tX43QLHXpeHd9F9qXs5/esrOFfWYFjeoAaB1cWiRXDJsUNOEH3PuQf0Go
 NKHgrL8GiU6XY9keZI6KJYphr2a5//qWJywxOeBuJh3446MDSYwOmI3eEIY8ac3/
 k0e8bMnLhfryWOvyZE6v2w75lMi+SL1LH/W6OSJqGFKS3N+GqdqRKkMfYGQToHkM
 ZgIX1vDSq4xgJzkR1Q+AACCaSTGE2wEG/bnqZ1R3l19/bERB17LaOyEegBDXbrTT
 vI31EQnrN92O/Q2eYJlap8nFIm4lVaCFTU1R7KEVEXvUBRXXfxllu1sOSBpn1PSQ
 OrC5bbcCodcG8b1SlwRrcstqc42weojqwyl65eJxOa17valghaYEcLkqEZrrrssv
 Y+C0okfL3UB2JAxG4O1nFQ3py1cYlkYURf6CuhxNQfktXZxSpAMTLy9wYCRylBiO
 Eu6Say4zfnfKiVaSg0xlMhIaAyugVH+uVro62hZYxCU2mJ/biZHeQAUC6Krl6NsY
 NsAk0T7eUgMd7lLW+C9/rL2AQaXYwR72cl/1jAWBE2piBM2Gu1lcGHGwWHvOcYjO
 K2Yg4RMnR9TDbUX2jl1r4bZoQD3IZ3HpUjgVInmbTPtKY4q89kfC40haSpBQykm7
 QzGLPvFz2sMrkmKPLbV2
 =R9uL
 -----END PGP SIGNATURE-----

Merge tag 'xfs-for-linus-4.9-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/dgc/linux-xfs

Pull xfs and iomap updates from Dave Chinner:
 "The main things in this update are the iomap-based DAX infrastructure,
  an XFS delalloc rework, and a chunk of fixes to how log recovery
  schedules writeback to prevent spurious corruption detections when
  recovery of certain items was not required.

  The other main chunk of code is some preparation for the upcoming
  reflink functionality. Most of it is generic and cleanups that stand
  alone, but they were ready and reviewed so are in this pull request.

  Speaking of reflink, I'm currently planning to send you another pull
  request next week containing all the new reflink functionality. I'm
  working through a similar process to the last cycle, where I sent the
  reverse mapping code in a separate request because of how large it
  was. The reflink code merge is even bigger than reverse mapping, so
  I'll be doing the same thing again....

  Summary for this update:

   - change of XFS mailing list to linux-xfs@vger.kernel.org

   - iomap-based DAX infrastructure w/ XFS and ext2 support

   - small iomap fixes and additions

   - more efficient XFS delayed allocation infrastructure based on iomap

   - a rework of log recovery writeback scheduling to ensure we don't
     fail recovery when trying to replay items that are already on disk

   - some preparation patches for upcoming reflink support

   - configurable error handling fixes and documentation

   - aio access time update race fixes for XFS and
     generic_file_read_iter"

* tag 'xfs-for-linus-4.9-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/dgc/linux-xfs: (40 commits)
  fs: update atime before I/O in generic_file_read_iter
  xfs: update atime before I/O in xfs_file_dio_aio_read
  ext2: fix possible integer truncation in ext2_iomap_begin
  xfs: log recovery tracepoints to track current lsn and buffer submission
  xfs: update metadata LSN in buffers during log recovery
  xfs: don't warn on buffers not being recovered due to LSN
  xfs: pass current lsn to log recovery buffer validation
  xfs: rework log recovery to submit buffers on LSN boundaries
  xfs: quiesce the filesystem after recovery on readonly mount
  xfs: remote attribute blocks aren't really userdata
  ext2: use iomap to implement DAX
  ext2: stop passing buffer_head to ext2_get_blocks
  xfs: use iomap to implement DAX
  xfs: refactor xfs_setfilesize
  xfs: take the ilock shared if possible in xfs_file_iomap_begin
  xfs: fix locking for DAX writes
  dax: provide an iomap based fault handler
  dax: provide an iomap based dax read/write path
  dax: don't pass buffer_head to copy_user_dax
  dax: don't pass buffer_head to dax_insert_mapping
  ...
2016-10-06 08:18:10 -07:00
Linus Torvalds 82fa407da0 Merge branch 'for-linus' of git://git.armlinux.org.uk/~rmk/linux-arm
Pull ARM updates from Russell King:

 - Correct ARMs dma-mapping to use the correct printk format strings.

 - Avoid defining OBJCOPYFLAGS globally which upsets lkdtm rodata
   testing.

 - Cleanups to ARMs asm/memory.h include.

 - L2 cache cleanups.

 - Allow flat nommu binaries to be executed on ARM MMU systems.

 - Kernel hardening - add more read-only after init annotations,
   including making some kernel vdso variables const.

 - Ensure AMBA primecell clocks are appropriately defaulted.

 - ARM breakpoint cleanup.

 - Various StrongARM 11x0 and companion chip (SA1111) updates to bring
   this legacy platform to use more modern APIs for (eg) GPIOs and
   interrupts, which will allow us in the future to reduce some of the
   board-level driver clutter and elimate function callbacks into board
   code via platform data. There still appears to be interest in these
   platforms!

 - Remove the now redundant secure_flush_area() API.

 - Module PLT relocation optimisations. Ard says: This series of 4
   patches optimizes the ARM PLT generation code that is invoked at
   module load time, to get rid of the O(n^2) algorithm that results in
   pathological load times of 10 seconds or more for large modules on
   certain STB platforms.

 - ARMv7M cache maintanence support.

 - L2 cache PMU support

* 'for-linus' of git://git.armlinux.org.uk/~rmk/linux-arm: (35 commits)
  ARM: sa1111: provide to_sa1111_device() macro
  ARM: sa1111: add sa1111_get_irq()
  ARM: sa1111: clean up duplication in IRQ chip implementation
  ARM: sa1111: implement a gpio_chip for SA1111 GPIOs
  ARM: sa1111: move irq cleanup to separate function
  ARM: sa1111: use devm_clk_get()
  ARM: sa1111: use devm_kzalloc()
  ARM: sa1111: ensure we only touch RAB bus type devices when removing
  ARM: 8611/1: l2x0: add PMU support
  ARM: 8610/1: V7M: Add dsb before jumping in handler mode
  ARM: 8609/1: V7M: Add support for the Cortex-M7 processor
  ARM: 8608/1: V7M: Indirect proc_info construction for V7M CPUs
  ARM: 8607/1: V7M: Wire up caches for V7M processors with cache support.
  ARM: 8606/1: V7M: introduce cache operations
  ARM: 8605/1: V7M: fix notrace variant of save_and_disable_irqs
  ARM: 8604/1: V7M: Add support for reading the CTR with read_cpuid_cachetype()
  ARM: 8603/1: V7M: Add addresses for mem-mapped V7M cache operations
  ARM: 8602/1: factor out CSSELR/CCSIDR operations that use cp15 directly
  ARM: kernel: avoid brute force search on PLT generation
  ARM: kernel: sort relocation sections before allocating PLTs
  ...
2016-10-06 07:59:37 -07:00
NeilBrown 09bb8bfffd exportfs: be careful to only return expected errors.
When nfsd calls fh_to_dentry, it expect ESTALE or ENOMEM as errors.
In particular it can be tempting to return ENOENT, but this is not
handled well by nfsd.

Rather than requiring strict adherence to error code code filesystems,
treat all unexpected error codes the same as ESTALE.  This is safest.

Signed-off-by: NeilBrown <neilb@suse.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2016-10-06 09:07:44 -04:00
Russell King 301a36fa70 Merge branches 'misc' and 'sa1111-base' into for-linus 2016-10-06 08:56:43 +01:00
David Howells 9008f998a2 afs: Check for fatal error when in waiting for ack state
When it's in the waiting-for-ACK state, the AFS filesystem needs to check
the result of rxrpc_kernel_recv_data() any time it is notified to see if it
is indicating a fatal error.  If this is the case, it needs to mark the
call completed otherwise the call just sits there and never goes away.

Signed-off-by: David Howells <dhowells@redhat.com>
2016-10-06 08:11:50 +01:00
Darrick J. Wong 1f08af52e7 xfs: implement swapext for rmap filesystems
Implement swapext for filesystems that have reverse mapping.  Back in
the reflink patches, we augmented the bmap code with a 'REMAP' flag
that updates only the bmbt and doesn't touch the allocator and
implemented log redo items for those two operations.  Now we can
rewrite extent swapping as a (looong) series of remap operations.

This is far less efficient than the fork swapping method implemented
in the past, so we only switch this on for rmap.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
2016-10-05 16:26:32 -07:00
Darrick J. Wong 39aff5fdb9 xfs: refactor swapext code
Refactor the swapext function to pull out the fork swapping piece
into a separate function.  In the next patch we'll add in the bit
we need to make it work with rmap filesystems.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
2016-10-05 16:26:32 -07:00
Darrick J. Wong e06259aa08 xfs: various swapext cleanups
Replace structure typedefs with struct expressions and fix some
whitespace issues that result.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
2016-10-05 16:26:32 -07:00
Darrick J. Wong e54b5bf9d7 xfs: recognize the reflink feature bit
Add the reflink feature flag to the set of recognized feature flags.
This enables users to write to reflink filesystems.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
2016-10-05 16:26:31 -07:00
Darrick J. Wong a35eb41519 xfs: simulate per-AG reservations being critically low
Create an error injection point that enables us to simulate being
critically low on per-AG block reservations.  This should enable us to
simulate this specific ENOSPC condition so that we can test falling back
to a regular file copy.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
2016-10-05 16:26:31 -07:00
Darrick J. Wong 4f435ebe7d xfs: don't mix reflink and DAX mode for now
Since we don't have a strategy for handling both DAX and reflink,
for now we'll just prohibit both being set at the same time.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
2016-10-05 16:26:31 -07:00
Darrick J. Wong c8e156ac33 xfs: check for invalid inode reflink flags
We don't support sharing blocks on the realtime device.  Flag inodes
with the reflink or cowextsize flags set when the reflink feature is
disabled.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
2016-10-05 16:26:31 -07:00
Darrick J. Wong e153aa7990 xfs: set a default CoW extent size of 32 blocks
If the admin doesn't set a CoW extent size or a regular extent size
hint, default to creating CoW reservations 32 blocks long to reduce
fragmentation.

Signed-off-by: DarricK J. Wong <darrick.wong@oracle.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
2016-10-05 16:26:31 -07:00
Darrick J. Wong 3f165b334e xfs: convert unwritten status of reverse mappings for shared files
Provide a function to convert an unwritten extent to a real one and
vice versa when shared extents are possible.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
2016-10-05 16:26:29 -07:00
Darrick J. Wong ceeb9c832e xfs: use interval query for rmap alloc operations on shared files
When it's possible for reverse mappings to overlap (data fork extents
of files on reflink filesystems), use the interval query function to
find the left neighbor of an extent we're trying to add; and be
careful to use the lookup functions to update the neighbors and/or
add new extents.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
2016-10-05 16:26:29 -07:00
Darrick J. Wong 0e07c039ba xfs: add shared rmap map/unmap/convert log item types
Wire up some rmap log redo item type codes to map, unmap, or convert
shared data block extents.  The actual log item recovery comes in a
later patch.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
2016-10-05 16:26:29 -07:00
Darrick J. Wong 80de462e09 xfs: increase log reservations for reflink
Increase the log reservations to handle the increased rolling that
happens at the end of copy-on-write operations.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
2016-10-05 16:26:29 -07:00
Darrick J. Wong 83104d449e xfs: garbage collect old cowextsz reservations
Trim CoW reservations made on behalf of a cowextsz hint if they get too
old or we run low on quota, so long as we don't have dirty data awaiting
writeback or directio operations in progress.

Garbage collection of the cowextsize extents are kept separate from
prealloc extent reaping because setting the CoW prealloc lifetime to a
(much) higher value than the regular prealloc extent lifetime has been
useful for combatting CoW fragmentation on VM hosts where the VMs
experience bursty write behaviors and we can keep the utilization ratios
low enough that we don't start to run out of space.  IOWs, it benefits
us to keep the CoW fork reservations around for as long as we can unless
we run out of blocks or hit inode reclaim.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
2016-10-05 16:26:28 -07:00
Darrick J. Wong 90e2056d76 xfs: try other AGs to allocate a BMBT block
Prior to the introduction of reflink, allocating a block and mapping
it into a file was performed in a single transaction with a single
block reservation, and the allocator was supposed to find enough
blocks to allocate the extent and any BMBT blocks that might be
necessary (unless we're low on space).

However, due to the way copy on write works, allocation and mapping
have been split into two transactions, which means that we must be
able to handle the case where we allocate an extent for CoW but that
AG runs out of free space before the blocks can be mapped into a file,
and the mapping requires a new BMBT block.  When this happens, look in
one of the other AGs for a BMBT block instead of taking the FS down.

The same applies to the functions that convert a data fork to extents
and later btree format.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
2016-10-05 16:26:28 -07:00
Darrick J. Wong 6fa164b865 xfs: don't allow reflink when the AG is low on space
If the AG free space is down to the reserves, refuse to reflink our
way out of space.  Hopefully userspace will make a real copy and/or go
elsewhere.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
2016-10-05 16:26:27 -07:00
Darrick J. Wong 84d6961910 xfs: preallocate blocks for worst-case btree expansion
To gracefully handle the situation where a CoW operation turns a
single refcount extent into a lot of tiny ones and then run out of
space when a tree split has to happen, use the per-AG reserved block
pool to pre-allocate all the space we'll ever need for a maximal
btree.  For a 4K block size, this only costs an overhead of 0.3% of
available disk space.

When reflink is enabled, we have an unfortunate problem with rmap --
since we can share a block billions of times, this means that the
reverse mapping btree can expand basically infinitely.  When an AG is
so full that there are no free blocks with which to expand the rmapbt,
the filesystem will shut down hard.

This is rather annoying to the user, so use the AG reservation code to
reserve a "reasonable" amount of space for rmap.  We'll prevent
reflinks and CoW operations if we think we're getting close to
exhausting an AG's free space rather than shutting down, but this
permanent reservation should be enough for "most" users.  Hopefully.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
[hch@lst.de: ensure that we invalidate the freed btree buffer]
Signed-off-by: Christoph Hellwig <hch@lst.de>
2016-10-05 16:26:27 -07:00
Darrick J. Wong f7ca352272 xfs: create a separate cow extent size hint for the allocator
Create a per-inode extent size allocator hint for copy-on-write.  This
hint is separate from the existing extent size hint so that CoW can
take advantage of the fragmentation-reducing properties of extent size
hints without disabling delalloc for regular writes.

The extent size hint that's fed to the allocator during a copy on
write operation is the greater of the cowextsize and regular extsize
hint.

During reflink, if we're sharing the entire source file to the entire
destination file and the destination file doesn't already have a
cowextsize hint, propagate the source file's cowextsize hint to the
destination file.

Furthermore, zero the bulkstat buffer prior to setting the fields
so that we don't copy kernel memory contents into userspace.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
2016-10-05 16:26:26 -07:00
Darrick J. Wong 98cc2db5b8 xfs: unshare a range of blocks via fallocate
Unshare all shared extents if the user calls fallocate with the new
unshare mode flag set, so that we can guarantee that a subsequent
write will not ENOSPC.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
[hch: pass inode instead of file to xfs_reflink_dirty_range,
      use iomap infrastructure for copy up]
Signed-off-by: Christoph Hellwig <hch@lst.de>
2016-10-05 16:26:26 -07:00
Darrick J. Wong f0bc4d134b xfs: swap inode reflink flags when swapping inode extents
When we're swapping the extents of two inodes, be sure to swap the
reflink inode flag too.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
2016-10-05 16:26:26 -07:00
Darrick J. Wong f86f403794 xfs: teach get_bmapx about shared extents and the CoW fork
Teach xfs_getbmapx how to report shared extents and CoW fork contents
accurately in the bmap output by querying the refcount btree
appropriately.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
2016-10-05 16:26:26 -07:00
Darrick J. Wong cc714660bb xfs: add dedupe range vfs function
Define a VFS function which allows userspace to request that the
kernel reflink a range of blocks between two files if the ranges'
contents match.  The function fits the new VFS ioctl that standardizes
the checking for the btrfs EXTENT SAME ioctl.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
2016-10-05 16:26:26 -07:00
Darrick J. Wong 9fe26045e9 xfs: add clone file and clone range vfs functions
Define two VFS functions which allow userspace to reflink a range of
blocks between two files or to reflink one file's contents to another.
These functions fit the new VFS ioctls that standardize the checking
for the btrfs CLONE and CLONE RANGE ioctls.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
2016-10-05 16:26:25 -07:00
Darrick J. Wong 862bb360ef xfs: reflink extents from one file to another
Reflink extents from one file to another; that is to say, iteratively
remove the mappings from the destination file, copy the mappings from
the source file to the destination file, and increment the reference
count of all the blocks that got remapped.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
2016-10-05 16:26:05 -07:00
Darrick J. Wong 174edb0e46 xfs: store in-progress CoW allocations in the refcount btree
Due to the way the CoW algorithm in XFS works, there's an interval
during which blocks allocated to handle a CoW can be lost -- if the FS
goes down after the blocks are allocated but before the block
remapping takes place.  This is exacerbated by the cowextsz hint --
allocated reservations can sit around for a while, waiting to get
used.

Since the refcount btree doesn't normally store records with refcount
of 1, we can use it to record these in-progress extents.  In-progress
blocks cannot be shared because they're not user-visible, so there
shouldn't be any conflicts with other programs.  This is a better
solution than holding EFIs during writeback because (a) EFIs can't be
relogged currently, (b) even if they could, EFIs are bound by
available log space, which puts an unnecessary upper bound on how much
CoW we can have in flight, and (c) we already have a mechanism to
track blocks.

At mount time, read the refcount records and free anything we find
with a refcount of 1 because those were in-progress when the FS went
down.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
2016-10-05 16:26:05 -07:00
Darrick J. Wong 5e7e605c4d xfs: cancel pending CoW reservations when destroying inodes
When destroying the inode, cancel all pending reservations in the CoW
fork so that all the reserved blocks go back to the free pile.  In
theory this sort of cleanup is only needed to clean up after write
errors.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
2016-10-05 16:26:05 -07:00
Darrick J. Wong aa8968f227 xfs: cancel CoW reservations and clear inode reflink flag when freeing blocks
When we're freeing blocks (truncate, punch, etc.), clear all CoW
reservations in the range being freed.  If the file block count
drops to zero, also clear the inode reflink flag.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
2016-10-05 16:26:04 -07:00
Darrick J. Wong 0613f16cd2 xfs: implement CoW for directio writes
For O_DIRECT writes to shared blocks, we have to CoW them just like
we would with buffered writes.  For writes that are not block-aligned,
just bounce them to the page cache.

For block-aligned writes, however, we can do better than that.  Use
the same mechanisms that we employ for buffered CoW to set up a
delalloc reservation, allocate all the blocks at once, issue the
writes against the new blocks and use the same ioend functions to
remap the blocks after the write.  This should be fairly performant.

Christoph discovered that xfs_reflink_allocate_cow_range may stumble
over invalid entries in the extent array given that it drops the ilock
but still expects the index to be stable.  Simple fixing it to a new
lookup for every iteration still isn't correct given that
xfs_bmapi_allocate will trigger a BUG_ON() if hitting a hole, and
there is nothing preventing a xfs_bunmapi_cow call removing extents
once we dropped the ilock either.

This patch duplicates the inner loop of xfs_bmapi_allocate into a
helper for xfs_reflink_allocate_cow_range so that it can be done under
the same ilock critical section as our CoW fork delayed allocation.
The directio CoW warts will be revisited in a later patch.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
2016-10-05 16:26:04 -07:00
Darrick J. Wong db1327b16c xfs: report shared extent mappings to userspace correctly
Report shared extents through the iomap interface so that FIEMAP flags
shared blocks accurately.  Have xfs_vm_bmap return zero for reflinked
files because the bmap-based swap code requires static block mappings,
which is incompatible with copy on write.

NOTE: Existing userspace bmap users such as lilo will have the same
problem with reflink files.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2016-10-05 16:26:04 -07:00
Al Viro c531716785 proc: switch auxv to use of __mem_open()
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2016-10-05 18:43:43 -04:00
Mikulas Patocka 91fff9b347 hpfs: support FIEMAP
Support the FIEMAP ioctl that reports extents allocated by a file.

Signed-off-by: Mikulas Patocka <mikulas@artax.karlin.mff.cuni.cz>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2016-10-05 18:31:58 -04:00
Miklos Szeredi ca76f5b6bd pipe: add pipe_buf_steal() helper
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2016-10-05 18:23:59 -04:00
Miklos Szeredi fba597db42 pipe: add pipe_buf_confirm() helper
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2016-10-05 18:23:59 -04:00
Miklos Szeredi a779638cf6 pipe: add pipe_buf_release() helper
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2016-10-05 18:23:58 -04:00
Miklos Szeredi 7bf2d1df80 pipe: add pipe_buf_get() helper
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2016-10-05 18:23:57 -04:00
Al Viro 523ac9afc7 switch default_file_splice_read() to use of pipe-backed iov_iter
we only use iov_iter_get_pages_alloc() and iov_iter_advance() -
pages are filled by kernel_readv() via a kvec array (as we used
to do all along), so iov_iter here is used only as a way of
arranging for those pages to be in pipe.

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2016-10-05 18:23:56 -04:00
Al Viro 82c156f853 switch generic_file_splice_read() to use of ->read_iter()
... and kill the ->splice_read() instances that can be switched to it

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2016-10-05 18:23:56 -04:00
Al Viro 241699cd72 new iov_iter flavour: pipe-backed
iov_iter variant for passing data into pipe.  copy_to_iter()
copies data into page(s) it has allocated and stuffs them into
the pipe; copy_page_to_iter() stuffs there a reference to the
page given to it.  Both will try to coalesce if possible.
iov_iter_zero() is similar to copy_to_iter(); iov_iter_get_pages()
and friends will do as copy_to_iter() would have and return the
pages where the data would've been copied.  iov_iter_advance()
will truncate everything past the spot it has advanced to.

New primitive: iov_iter_pipe(), used for initializing those.
pipe should be locked all along.

Running out of space acts as fault would for iovec-backed ones;
in other words, giving it to ->read_iter() may result in short
read if the pipe overflows, or -EFAULT if it happens with nothing
copied there.

In other words, ->read_iter() on those acts pretty much like
->splice_read().  Moreover, all generic_file_splice_read() users,
as well as many other ->splice_read() instances can be switched
to that scheme - that'll happen in the next commit.

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2016-10-05 18:23:36 -04:00
Darrick J. Wong 43caeb187d xfs: move mappings from cow fork to data fork after copy-write
After the write component of a copy-write operation finishes, clean up
the bookkeeping left behind.  On error, we simply free the new blocks
and pass the error up.  If we succeed, however, then we must remove
the old data fork mapping and move the cow fork mapping to the data
fork.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
[hch: Call the CoW failure function during xfs_cancel_ioend]
Signed-off-by: Christoph Hellwig <hch@lst.de>
2016-10-05 13:55:40 -07:00
Darrick J. Wong 4862cfe825 xfs: support removing extents from CoW fork
Create a helper method to remove extents from the CoW fork without
any of the side effects (rmapbt/bmbt updates) of the regular extent
deletion routine.  We'll eventually use this to clear out the CoW fork
during ioend processing.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
2016-10-05 13:55:40 -07:00
Pierre Morel 997198ba1e fs/block_dev.c: return the right error in thaw_bdev()
When triggering thaw-filesystems via magic sysrq, the system enters a
loop in do_thaw_one(), as thaw_bdev() still returns success if
bd_fsfreeze_count == 0. To fix this, let thaw_bdev() always return
error (and simplify the code a bit at the same time).

Reviewed-by: Eric Farman <farman@linux.vnet.ibm.com>
Reviewed-by: Cornelia Huck <cornelia.huck@de.ibm.com>
Signed-off-by: Pierre Morel <pmorel@linux.vnet.ibm.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: Jens Axboe <axboe@fb.com>
2016-10-05 14:35:13 -06:00
Linus Torvalds edadd0e5a7 Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/fuse
Pull fuse updates from Miklos Szeredi:
 "This adds POSIX ACL permission checking to the fuse kernel module.

  In addition there are minor bug fixes as well as cleanups"

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/fuse:
  fuse: limit xattr returned size
  fuse: remove duplicate cs->offset assignment
  fuse: don't use fuse_ioctl_copy_user() helper
  fuse_ioctl_copy_user(): don't open-code copy_page_{to,from}_iter()
  fuse: get rid of fc->flags
  fuse: use timespec64
  fuse: don't use ->d_time
  fuse: Add posix ACL support
  fuse: handle killpriv in userspace fs
  fuse: fix killing s[ug]id in setattr
  fuse: invalidate dir dentry after chmod
  fuse: Use generic xattr ops
  fuse: listxattr: verify xattr list
2016-10-05 10:58:15 -07:00
Linus Torvalds 3fb75cb80d Merge branch 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs
Pull misc filesystem and quota fixes from Jan Kara:
 "Some smaller udf, ext2, quota & reiserfs fixes"

* 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs:
  ext2: Unmap metadata when zeroing blocks
  udf: don't bother with full-page write optimisations in adinicb case
  reiserfs: Unlock superblock before calling reiserfs_quota_on_mount()
  udf: Remove useless check in udf_adinicb_write_begin()
  quota: fill in Q_XGETQSTAT inode information for inactive quotas
  ext2: Check return value from ext2_get_group_desc()
2016-10-05 10:53:03 -07:00
Linus Torvalds 687ee0ad4e Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next
Pull networking updates from David Miller:

 1) BBR TCP congestion control, from Neal Cardwell, Yuchung Cheng and
    co. at Google. https://lwn.net/Articles/701165/

 2) Do TCP Small Queues for retransmits, from Eric Dumazet.

 3) Support collect_md mode for all IPV4 and IPV6 tunnels, from Alexei
    Starovoitov.

 4) Allow cls_flower to classify packets in ip tunnels, from Amir Vadai.

 5) Support DSA tagging in older mv88e6xxx switches, from Andrew Lunn.

 6) Support GMAC protocol in iwlwifi mwm, from Ayala Beker.

 7) Support ndo_poll_controller in mlx5, from Calvin Owens.

 8) Move VRF processing to an output hook and allow l3mdev to be
    loopback, from David Ahern.

 9) Support SOCK_DESTROY for UDP sockets. Also from David Ahern.

10) Congestion control in RXRPC, from David Howells.

11) Support geneve RX offload in ixgbe, from Emil Tantilov.

12) When hitting pressure for new incoming TCP data SKBs, perform a
    partial rathern than a full purge of the OFO queue (which could be
    huge). From Eric Dumazet.

13) Convert XFRM state and policy lookups to RCU, from Florian Westphal.

14) Support RX network flow classification to igb, from Gangfeng Huang.

15) Hardware offloading of eBPF in nfp driver, from Jakub Kicinski.

16) New skbmod packet action, from Jamal Hadi Salim.

17) Remove some inefficiencies in snmp proc output, from Jia He.

18) Add FIB notifications to properly propagate route changes to
    hardware which is doing forwarding offloading. From Jiri Pirko.

19) New dsa driver for qca8xxx chips, from John Crispin.

20) Implement RFC7559 ipv6 router solicitation backoff, from Maciej
    Żenczykowski.

21) Add L3 mode to ipvlan, from Mahesh Bandewar.

22) Support 802.1ad in mlx4, from Moshe Shemesh.

23) Support hardware LRO in mediatek driver, from Nelson Chang.

24) Add TC offloading to mlx5, from Or Gerlitz.

25) Convert various drivers to ethtool ksettings interfaces, from
    Philippe Reynes.

26) TX max rate limiting for cxgb4, from Rahul Lakkireddy.

27) NAPI support for ath10k, from Rajkumar Manoharan.

28) Support XDP in mlx5, from Rana Shahout and Saeed Mahameed.

29) UDP replicast support in TIPC, from Richard Alpe.

30) Per-queue statistics for qed driver, from Sudarsana Reddy Kalluru.

31) Support BQL in thunderx driver, from Sunil Goutham.

32) TSO support in alx driver, from Tobias Regnery.

33) Add stream parser engine and use it in kcm.

34) Support async DHCP replies in ipconfig module, from Uwe
    Kleine-König.

35) DSA port fast aging for mv88e6xxx driver, from Vivien Didelot.

* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next: (1715 commits)
  mlxsw: switchx2: Fix misuse of hard_header_len
  mlxsw: spectrum: Fix misuse of hard_header_len
  net/faraday: Stop NCSI device on shutdown
  net/ncsi: Introduce ncsi_stop_dev()
  net/ncsi: Rework the channel monitoring
  net/ncsi: Allow to extend NCSI request properties
  net/ncsi: Rework request index allocation
  net/ncsi: Don't probe on the reserved channel ID (0x1f)
  net/ncsi: Introduce NCSI_RESERVED_CHANNEL
  net/ncsi: Avoid unused-value build warning from ia64-linux-gcc
  net: Add netdev all_adj_list refcnt propagation to fix panic
  net: phy: Add Edge-rate driver for Microsemi PHYs.
  vmxnet3: Wake queue from reset work
  i40e: avoid NULL pointer dereference and recursive errors on early PCI error
  qed: Add RoCE ll2 & GSI support
  qed: Add support for memory registeration verbs
  qed: Add support for QP verbs
  qed: PD,PKEY and CQ verb support
  qed: Add support for RoCE hw init
  qede: Add qedr framework
  ...
2016-10-05 10:11:24 -07:00
Darrick J. Wong ef4736678f xfs: allocate delayed extents in CoW fork
Modify the writepage handler to find and convert pending delalloc
extents to real allocations.  Furthermore, when we're doing non-cow
writes to a part of a file that already has a CoW reservation (the
cowextsz hint that we set up in a subsequent patch facilitates this),
promote the write to copy-on-write so that the entire extent can get
written out as a single extent on disk, thereby reducing post-CoW
fragmentation.

Christoph moved the CoW support code in _map_blocks to a separate helper
function, refactored other functions, and reduced the number of CoW fork
lookups, so I merged those changes here to reduce churn.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
2016-10-04 18:06:41 -07:00
Darrick J. Wong 60b4984fc3 xfs: support allocating delayed extents in CoW fork
Modify xfs_bmap_add_extent_delay_real() so that we can convert delayed
allocation extents in the CoW fork to real allocations, and wire this
up all the way back to xfs_iomap_write_allocate().  In a subsequent
patch, we'll modify the writepage handler to call this.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
2016-10-04 18:06:41 -07:00
Darrick J. Wong 2a06705cd5 xfs: create delalloc extents in CoW fork
Wire up iomap_begin to detect shared extents and create delayed allocation
extents in the CoW fork:

 1) Check if we already have an extent in the COW fork for the area.
    If so nothing to do, we can move along.
 2) Look up block number for the current extent, and if there is none
    it's not shared move along.
 3) Unshare the current extent as far as we are going to write into it.
    For this we avoid an additional COW fork lookup and use the
    information we set aside in step 1) above.
 4) Goto 1) unless we've covered the whole range.

Last but not least, this updates the xfs_reflink_reserve_cow_range calling
convention to pass a byte offset and length, as that is what both callers
expect anyway.  This patch has been refactored considerably as part of the
iomap transition.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
2016-10-04 18:06:40 -07:00
Darrick J. Wong be51f8119c xfs: support bmapping delalloc extents in the CoW fork
Allow the creation of delayed allocation extents in the CoW fork.  In
a subsequent patch we'll wire up iomap_begin to actually do this via
reflink helper functions.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
2016-10-04 18:06:40 -07:00
Darrick J. Wong 3993baeb3c xfs: introduce the CoW fork
Introduce a new in-core fork for storing copy-on-write delalloc
reservations and allocated extents that are in the process of being
written out.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
2016-10-04 18:06:40 -07:00
Darrick J. Wong 11715a21bc xfs: don't allow reflinked dir/dev/fifo/socket/pipe files
Only non-rt files can be reflinked, so check that when we load an
inode.  Also, don't leak the attr fork if there's a failure.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
2016-10-04 18:06:40 -07:00
Darrick J. Wong f0ec1b8ef1 xfs: add reflink feature flag to geometry
Report the reflink feature in the XFS geometry so that xfs_info and
friends know the filesystem has this feature.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
2016-10-04 18:06:40 -07:00
Darrick J. Wong 53aa1c34f4 xfs: define tracepoints for reflink activities
Define all the tracepoints we need to inspect the runtime operation
of reflink/dedupe/copy-on-write.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
2016-10-04 18:06:39 -07:00
Darrick J. Wong 4453593be6 xfs: return work remaining at the end of a bunmapi operation
Return the range of file blocks that bunmapi didn't free.  This hint
is used by CoW and reflink to figure out what part of an extent
actually got freed so that it can set up the appropriate atomic
remapping of just the freed range.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
2016-10-04 18:06:39 -07:00
Linus Torvalds a3443cda55 Merge branch 'next' of git://git.kernel.org/pub/scm/linux/kernel/git/jmorris/linux-security
Pull security subsystem updates from James Morris:

  SELinux/LSM:
   - overlayfs support, necessary for container filesystems

  LSM:
   - finally remove the kernel_module_from_file hook

  Smack:
   - treat signal delivery as an 'append' operation

  TPM:
   - lots of bugfixes & updates

  Audit:
   - new audit data type: LSM_AUDIT_DATA_FILE

* 'next' of git://git.kernel.org/pub/scm/linux/kernel/git/jmorris/linux-security: (47 commits)
  Revert "tpm/tpm_crb: implement tpm crb idle state"
  Revert "tmp/tpm_crb: fix Intel PTT hw bug during idle state"
  Revert "tpm/tpm_crb: open code the crb_init into acpi_add"
  Revert "tmp/tpm_crb: implement runtime pm for tpm_crb"
  lsm,audit,selinux: Introduce a new audit data type LSM_AUDIT_DATA_FILE
  tmp/tpm_crb: implement runtime pm for tpm_crb
  tpm/tpm_crb: open code the crb_init into acpi_add
  tmp/tpm_crb: fix Intel PTT hw bug during idle state
  tpm/tpm_crb: implement tpm crb idle state
  tpm: add check for minimum buffer size in tpm_transmit()
  tpm: constify TPM 1.x header structures
  tpm/tpm_crb: fix the over 80 characters checkpatch warring
  tpm/tpm_crb: drop useless cpu_to_le32 when writing to registers
  tpm/tpm_crb: cache cmd_size register value.
  tmp/tpm_crb: drop include to platform_device
  tpm/tpm_tis: remove unused itpm variable
  tpm_crb: fix incorrect values of cmdReady and goIdle bits
  tpm_crb: refine the naming of constants
  tpm_crb: remove wmb()'s
  tpm_crb: fix crb_req_canceled behavior
  ...
2016-10-04 14:48:27 -07:00
Linus Torvalds 2105b9ff73 Minor jfs updates
-----BEGIN PGP SIGNATURE-----
 
 iQIcBAABCAAGBQJX8//uAAoJEDaohF61QIxkBi8QAKKHQVPK+QAcoSYf3fqV7PKo
 +j83RnBOcPAo8Rhtycb5+azz1Or1IWmsZETbhSo8+4GZJgV8E0XZbgC5Cj630Lg8
 ltmi31GfHD959kzIukDJKbPBsBROgOCk0k8Gry3/tWFdQRblPreoJkR0c3FeD6kJ
 AY7GrITRuxZQFde7pRM5mXmgO2CO6ERaXQit+BeG+cdMXpeoC3PHQvs8LQphV/ah
 Ybn6oJJnO/fP2lzbNoe8aN+owuaJbA2EasjCtZpuhRAUAsBpSGDy+nGlkBCg8MAZ
 DLQzLOYAafCyoXu5GuqStUjRJAtz7GWcL+QWYcHKpNZgztYyTQhzDbmDA3pdWffG
 CZUqYk6PR9+3dIa2wE8UJqRQ4YmggFhBC1zcHkOarFyzuTNMje0bOP+5BTJ5bJgB
 j2R/R8b3vn9PY0tobeV6Mju1ArXJHxde3mEvJ3RsOMPrwlGTUd3te9ANAu/T2MJG
 5s1msjbY+SKk+605IF2gWrWqbDrvP8MBGkcBAqV/0jW+MhhDuj1c2+r417kKS6tj
 sZb71zoslVJW1y3dxQ0oLb3VWECH7X6GDA5Nz3JrHFuQRjSNHnFulXEdOFxmf7EZ
 y9Ld+YyOpKWqT0ifQlBoyO95IHL9EsvhUO+eNHsLvTDD+Z1W8YdgMxUNX2EdqDow
 8tnw1N7Um+6FyrDm2OER
 =L+vH
 -----END PGP SIGNATURE-----

Merge tag 'jfs-4.9' of git://github.com/kleikamp/linux-shaggy

Pull jfs updates from David Kleikamp:
 "Minor jfs updates"

* tag 'jfs-4.9' of git://github.com/kleikamp/linux-shaggy:
  jfs: Simplify code
  jfs: jump to error_out when filemap_{fdatawait, write_and_wait} fails
2016-10-04 13:45:09 -07:00
Linus Torvalds 5fdf4939dc We've only got six GFS2 patches for this merge window. In patch order:
1. Fabian Frederick submitted a nice cleanup that uses the BIT macro
    rather than bit shifting.
 2. Andreas Gruenbacher contributed a patch that fixes a long-standing
    annoyance whereby GFS2 warned about dirty pages.
 3. Andreas also fixed a problem with the recent extended attribute
    readahead feature.
 4. Chao Yu contributed a patch that checks the return code from function
    register_shrinker and reacts accordingly. Previously, it was not checked.
 5. Andreas Gruenbacher also fixed a problem whereby incore file timestamps
    were forgotten if the file was invalidated. This merely moves the
    assignment inside the inode glock where it belongs.
 6. He also fixed a problem where incore timestamps were not initialized.
 -----BEGIN PGP SIGNATURE-----
 
 iQEcBAABAgAGBQJX8oqYAAoJENeLYdPf93o7m5YIAIvBQ4WAmMmNuLT0AkvXIKXW
 ZHXtV5oizSOl+qOrb5x3ANbnZWZ5NnWRP6E0frDf3Y5wk6U4qWAqU0V8BTbdr2E+
 IryOLQ+62CAa4UbHqgQRFCpwkPxEaCsOde7eQh/ppTyBKjP0da7tUvSfPcLrWU+9
 qhYiqAv5qVk38JjFiwhw4zER+dOCPDIg1xkkMPG6fspjM8/CkXR9p4lh73qNJT/j
 NDzyjHSBYK32lkcb5xagjpLjmN/fIm6gXvdk65bD1euqxfUeuSCg6AF8QWkEXkcB
 pbqQVIOWrZixS9HMTqT7w8nNstsBKSrEwQhulZWBZygRAzJJAWu6IaHQ9gZkUsE=
 =1Fjo
 -----END PGP SIGNATURE-----

Merge tag 'gfs2-4.8.fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/gfs2/linux-gfs2

Pull gfs2 updates from Bob Peterson:
 "We've only got six GFS2 patches for this merge window.  In patch
  order:

   - Fabian Frederick submitted a nice cleanup that uses the BIT macro
     rather than bit shifting.

   - Andreas Gruenbacher contributed a patch that fixes a long-standing
     annoyance whereby GFS2 warned about dirty pages.

   - Andreas also fixed a problem with the recent extended attribute
     readahead feature.

   - Chao Yu contributed a patch that checks the return code from
     function register_shrinker and reacts accordingly. Previously, it
     was not checked.

   - Andreas Gruenbacher also fixed a problem whereby incore file
     timestamps were forgotten if the file was invalidated. This merely
     moves the assignment inside the inode glock where it belongs.

   - Andreas also fixed a problem where incore timestamps were not
     initialized"

* tag 'gfs2-4.8.fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/gfs2/linux-gfs2:
  gfs2: Initialize atime of I_NEW inodes
  gfs2: Update file times after grabbing glock
  gfs2: fix to detect failure of register_shrinker
  gfs2: Fix extended attribute readahead optimization
  gfs2: Remove dirty buffer warning from gfs2_releasepage
  GFS2: use BIT() macro
2016-10-04 13:42:13 -07:00
Linus Torvalds c35bcfd8e4 File locking related changes for v4.9
-----BEGIN PGP SIGNATURE-----
 
 iQIcBAABAgAGBQJX8EMFAAoJEAAOaEEZVoIV138QALm9BtIpuLeg3m2L7DffC6tk
 uRhu0a+sZhES8n1YF8/Z40KqlGvZ8qlbRv08vYQ1xNGYQ/RMBEdVZUXuOvN1NDSt
 CgU3JSEtBo1Qg8eNkAUwvzfyLsfTazLYf6rus2v2wwrH/1pF8yeU2OZUhv4FhKd2
 EoIczZ5NsWabJLktb4drckD+Xng9WHLKyB5bE7VKXR38cK7HWbuY30wg03JyX/em
 rkfw00rcRhh5JWqyL2NOO7INJSNXyJKBVZ/xeIQYnhj4ZA7aTFN+LgQebPqpfyzw
 g5jVet1ygaI+/8lp3IpB8rrkpmVSbtqLgmbPOvnDltiZOQbBlGOsw84TX/Dxp9VH
 7q04zCmcDWGD1ZMnQmXDPJxQZ8+pYdutfSNait0Q7lYSySqO0+1nSLpMQ2yIrebS
 hSREgj/MyOWewn5todNCh102IpSPUvo0J9mcDijlUBFWmPrK30QDGWrG20Qzb6ON
 olYRxztSX7cs0rNIOSjeRNCiy6E5Eoz8zm22JuDgKd2TGzES0ZoPea++1iqsTKbM
 KZrjGw5oQPkRbOePxoIk8ZP1iGbZyXQgMsPVHe+cuKBhiPqujgRNex4bwGQzKBT0
 O9o1YORl/wN2H04+K+HfsdAIh0cWeSZDiU7F9vPP5RmjVqzMwDc5YbP+KZFF3Nod
 Yu292qD+EcZL25PDt/Da
 =MUd+
 -----END PGP SIGNATURE-----

Merge tag 'locks-v4.9-1' of git://git.samba.org/jlayton/linux

Pull file locking updates from Jeff Layton:
 "Only a single patch from Nikolay this cycle, with a small change to
  better handle /proc/locks in a containerized host"

* tag 'locks-v4.9-1' of git://git.samba.org/jlayton/linux:
  locks: Filter /proc/locks output on proc pid ns
2016-10-04 13:36:19 -07:00
Jeff Layton 3f807e5ae5 NFSv4.2: Fix a reference leak in nfs42_proc_layoutstats_generic
The caller of rpc_run_task also gets a reference that must be put.

Signed-off-by: Jeff Layton <jeff.layton@primarydata.com>
Cc: stable@vger.kernel.org # 4.2+
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2016-10-04 16:30:54 -04:00
Deepa Dinamani 2f86e0919a fs: nfs: Make nfs boot time y2038 safe
boot_time is represented as a struct timespec.
struct timespec and CURRENT_TIME are not y2038 safe.
Overall, the plan is to use timespec64 and ktime_t for
all internal kernel representation of timestamps.
CURRENT_TIME will also be removed.

boot_time is used to construct the nfs client boot verifier.

Use ktime_t to represent boot_time and ktime_get_real() for
the boot_time value.

Following Trond's request https://lkml.org/lkml/2016/6/9/22 ,
use ktime_t instead of converting to struct timespec64.

Use higher and lower 32 bit parts of ktime_t for the boot
verifier.

Use the lower 32 bit part of ktime_t for the authsys_parms
stamp field.

Signed-off-by: Deepa Dinamani <deepa.kernel@gmail.com>
Reviewed-by: Arnd Bergmann <arnd@arndb.de>
Cc: Trond Myklebust <trond.myklebust@primarydata.com>
Cc: Anna Schumaker <anna.schumaker@netapp.com>
Cc: linux-nfs@vger.kernel.org
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2016-10-04 16:20:26 -04:00
Darrick J. Wong 17c12bcd30 xfs: when replaying bmap operations, don't let unlinked inodes get reaped
Log recovery will iget an inode to replay BUI items and iput the inode
when it's done.  Unfortunately, if the inode was unlinked, the iput
will see that i_nlink == 0 and decide to truncate & free the inode,
which prevents us from replaying subsequent BUIs.  We can't skip the
BUIs because we have to replay all the redo items to ensure that
atomic operations complete.

Since unlinked inode recovery will reap the inode anyway, we can
safely introduce a new inode flag to indicate that an inode is in this
'unlinked recovery' state and should not be auto-reaped in the
drop_inode path.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
2016-10-04 11:05:44 -07:00
Darrick J. Wong 9f3afb57d5 xfs: implement deferred bmbt map/unmap operations
Implement deferred versions of the inode block map/unmap functions.
These will be used in subsequent patches to make reflink operations
atomic.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
2016-10-04 11:05:44 -07:00
Darrick J. Wong 4847acf868 xfs: pass bmapi flags through to bmap_del_extent
Pass BMAPI_ flags from bunmapi into bmap_del_extent and extend
BMAPI_REMAP (which means "don't touch the allocator or the quota
accounting") to apply to bunmapi as well.  This will be used to
implement the unmap operation, which will be used by swapext.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
2016-10-04 11:05:44 -07:00
Darrick J. Wong f65306ea52 xfs: map an inode's offset to an exact physical block
Teach the bmap routine to know how to map a range of file blocks to a
specific range of physical blocks, instead of simply allocating fresh
blocks.  This enables reflink to map a file to blocks that are already
in use.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
2016-10-04 11:05:44 -07:00
Darrick J. Wong 77d61fe45e xfs: log bmap intent items
Provide a mechanism for higher levels to create BUI/BUD items, submit
them to the log, and a stub function to deal with recovered BUI items.
These parts will be connected to the rmapbt in a later patch.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
2016-10-04 11:05:44 -07:00
Darrick J. Wong 6413a01420 xfs: create bmbt update intent log items
Create bmbt update intent/done log items to record redo information in
the log.  Because we roll transactions multiple times for reflink
operations, we also have to track the status of the metadata updates
that will be recorded in the post-roll transactions in case we crash
before committing the final transaction.  This mechanism enables log
recovery to finish what was already started.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
2016-10-04 11:05:43 -07:00
Linus Torvalds e6dce825fb TTY/Serial patches for 4.9-rc1
Here is the big TTY and Serial patch set for 4.9-rc1.
 
 It also includes some drivers/dma/ changes, as those were needed by some
 serial drivers, and they were all acked by the DMA maintainer.  Also in
 here is the long-suffering ACPI SPCR patchset, which was passed around
 from maintainer to maintainer like a hot-potato.  Seems I was the
 sucker^Wlucky one.  All of those patches have been acked by the various
 subsystem maintainers as well.
 
 All of this has been in linux-next with no reported issues.
 
 Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
 -----BEGIN PGP SIGNATURE-----
 
 iFYEABECABYFAlfyNjEPHGdyZWdAa3JvYWguY29tAAoJEDFH1A3bLfspwIcAn2uN
 qCD8xQJ0Cs61hD1nUzhNygG8AJ94I4zz/fPGpyh/CtJfLQwtUdLhNA==
 =Rken
 -----END PGP SIGNATURE-----

Merge tag 'tty-4.9-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/tty

Pull tty and serial updates from Greg KH:
 "Here is the big tty and serial patch set for 4.9-rc1.

  It also includes some drivers/dma/ changes, as those were needed by
  some serial drivers, and they were all acked by the DMA maintainer.

  Also in here is the long-suffering ACPI SPCR patchset, which was
  passed around from maintainer to maintainer like a hot-potato. Seems I
  was the sucker^Wlucky one. All of those patches have been acked by the
  various subsystem maintainers as well.

  All of this has been in linux-next with no reported issues"

* tag 'tty-4.9-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/tty: (111 commits)
  Revert "serial: pl011: add console matching function"
  MAINTAINERS: update entry for atmel_serial driver
  serial: pl011: add console matching function
  ARM64: ACPI: enable ACPI_SPCR_TABLE
  ACPI: parse SPCR and enable matching console
  of/serial: move earlycon early_param handling to serial
  Revert "drivers/tty: Explicitly pass current to show_stack"
  tty: amba-pl011: Don't complain on -EPROBE_DEFER when no irq
  nios2: dts: 10m50: Add tx-threshold parameter
  serial: 8250: Set Altera 16550 TX FIFO Threshold
  serial: 8250: of: Load TX FIFO Threshold from DT
  Documentation: dt: serial: Add TX FIFO threshold parameter
  drivers/tty: Explicitly pass current to show_stack
  serial: imx: Fix DCD reading
  serial: stm32: mark symbols static where possible
  serial: xuartps: Add some register initialisation to cdns_early_console_setup()
  serial: xuartps: Removed unwanted checks while reading the error conditions
  serial: xuartps: Rewrite the interrupt handling logic
  serial: stm32: use mapbase instead of membase for DMA
  tty/serial: atmel: fix fractional baud rate computation
  ...
2016-10-03 20:11:49 -07:00
Linus Torvalds 9929780e86 Driver core patches for 4.9-rc1
Here are the "big" driver core patches for 4.9-rc1.  Also in here are a
 number of debugfs fixes that cropped up due to the changes that happened
 in 4.8 for that filesystem.  Overall, nothing major, just a few fixes
 and cleanups.
 
 All of these have been in linux-next with no reported issues.
 
 Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
 -----BEGIN PGP SIGNATURE-----
 
 iFYEABECABYFAlfyNw4PHGdyZWdAa3JvYWguY29tAAoJEDFH1A3bLfspLVYAoNXr
 FXBHGb2tNT/1PLfvUCwd5PqWAJ9Khb5WAHtvjTmEN1zabz45aSbcrA==
 =Uz6V
 -----END PGP SIGNATURE-----

Merge tag 'driver-core-4.9-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core

Pull driver core updates from Greg KH:
 "Here are the "big" driver core patches for 4.9-rc1. Also in here are a
  number of debugfs fixes that cropped up due to the changes that
  happened in 4.8 for that filesystem. Overall, nothing major, just a
  few fixes and cleanups.

  All of these have been in linux-next with no reported issues"

* tag 'driver-core-4.9-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core: (23 commits)
  drivers: dma-coherent: Move spinlock in dma_alloc_from_coherent()
  drivers: dma-coherent: Fix DMA coherent size for less than page
  MAINTAINERS: extend firmware_class maintainer list
  debugfs: propagate release() call result
  driver-core: platform: Catch errors from calls to irq_get_irq_data
  sysfs print name of undiscoverable attribute group
  carl9170: fix debugfs crashes
  b43legacy: fix debugfs crash
  b43: fix debugfs crash
  debugfs: introduce a public file_operations accessor
  device core: Remove deprecated create_singlethread_workqueue
  drivers/base dmam_declare_coherent_memory leaks
  platform: don't return 0 from platform_get_irq[_byname]() on error
  cpu: clean up register_cpu func
  dma-mapping: use vma_pages().
  drivers: dma-coherent: use vma_pages().
  attribute_container: Fix typo
  base: soc: make it explicitly non-modular
  drivers: base: dma-mapping: page align the size when unmap_kernel_range
  platform driver: fix use-after-free in platform_device_del()
  ...
2016-10-03 20:03:24 -07:00
Al Viro d82718e348 fuse_dev_splice_read(): switch to add_to_pipe()
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2016-10-03 20:40:56 -04:00
Al Viro 79fddc4efd new helper: add_to_pipe()
single-buffer analogue of splice_to_pipe(); vmsplice_to_pipe() switched
to that, leaving splice_to_pipe() only for ->splice_read() instances
(and that only until they are converted as well).

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2016-10-03 20:40:55 -04:00
Al Viro 8924feff66 splice: lift pipe_lock out of splice_to_pipe()
* splice_to_pipe() stops at pipe overflow and does *not* take pipe_lock
* ->splice_read() instances do the same
* vmsplice_to_pipe() and do_splice() (ultimate callers of splice_to_pipe())
  arrange for waiting, looping, etc. themselves.

That should make pipe_lock the outermost one.

Unfortunately, existing rules for the amount passed by vmsplice_to_pipe()
and do_splice() are quite ugly _and_ userland code can be easily broken
by changing those.  It's not even "no more than the maximal capacity of
this pipe" - it's "once we'd fed pipe->nr_buffers pages into the pipe,
leave instead of waiting".

Considering how poorly these rules are documented, let's try "wait for some
space to appear, unless given SPLICE_F_NONBLOCK, then push into pipe
and if we run into overflow, we are done".

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2016-10-03 20:40:55 -04:00
Al Viro db85a9eb2e splice: switch get_iovec_page_array() to iov_iter
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2016-10-03 20:40:54 -04:00
Al Viro e7c3c64624 splice_to_pipe(): don't open-code wakeup_pipe_readers()
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2016-10-03 20:40:54 -04:00
Al Viro 4038acdb18 consistent treatment of EFAULT on O_DIRECT read/write
Make local filesystems treat a fault as shortened IO,
returning -EFAULT only if nothing had been transferred.
That's how everything else (NFS, FUSE, ceph, Lustre)
behaves.

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2016-10-03 20:38:55 -04:00
Linus Torvalds 8e4ef63867 Merge branch 'x86-vdso-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull x86 vdso updates from Ingo Molnar:
 "The main changes in this cycle centered around adding support for
  32-bit compatible C/R of the vDSO on 64-bit kernels, by Dmitry
  Safonov"

* 'x86-vdso-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  x86/vdso: Use CONFIG_X86_X32_ABI to enable vdso prctl
  x86/vdso: Only define map_vdso_randomized() if CONFIG_X86_64
  x86/vdso: Only define prctl_map_vdso() if CONFIG_CHECKPOINT_RESTORE
  x86/signal: Add SA_{X32,IA32}_ABI sa_flags
  x86/ptrace: Down with test_thread_flag(TIF_IA32)
  x86/coredump: Use pr_reg size, rather that TIF_IA32 flag
  x86/arch_prctl/vdso: Add ARCH_MAP_VDSO_*
  x86/vdso: Replace calculate_addr in map_vdso() with addr
  x86/vdso: Unmap vdso blob on vvar mapping failure
2016-10-03 17:29:01 -07:00
Linus Torvalds 1a4a2bc460 Merge branch 'x86-asm-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull low-level x86 updates from Ingo Molnar:
 "In this cycle this topic tree has become one of those 'super topics'
  that accumulated a lot of changes:

   - Add CONFIG_VMAP_STACK=y support to the core kernel and enable it on
     x86 - preceded by an array of changes. v4.8 saw preparatory changes
     in this area already - this is the rest of the work. Includes the
     thread stack caching performance optimization. (Andy Lutomirski)

   - switch_to() cleanups and all around enhancements. (Brian Gerst)

   - A large number of dumpstack infrastructure enhancements and an
     unwinder abstraction. The secret long term plan is safe(r) live
     patching plus maybe another attempt at debuginfo based unwinding -
     but all these current bits are standalone enhancements in a frame
     pointer based debug environment as well. (Josh Poimboeuf)

   - More __ro_after_init and const annotations. (Kees Cook)

   - Enable KASLR for the vmemmap memory region. (Thomas Garnier)"

[ The virtually mapped stack changes are pretty fundamental, and not
  x86-specific per se, even if they are only used on x86 right now. ]

* 'x86-asm-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (70 commits)
  x86/asm: Get rid of __read_cr4_safe()
  thread_info: Use unsigned long for flags
  x86/alternatives: Add stack frame dependency to alternative_call_2()
  x86/dumpstack: Fix show_stack() task pointer regression
  x86/dumpstack: Remove dump_trace() and related callbacks
  x86/dumpstack: Convert show_trace_log_lvl() to use the new unwinder
  oprofile/x86: Convert x86_backtrace() to use the new unwinder
  x86/stacktrace: Convert save_stack_trace_*() to use the new unwinder
  perf/x86: Convert perf_callchain_kernel() to use the new unwinder
  x86/unwind: Add new unwind interface and implementations
  x86/dumpstack: Remove NULL task pointer convention
  fork: Optimize task creation by caching two thread stacks per CPU if CONFIG_VMAP_STACK=y
  sched/core: Free the stack early if CONFIG_THREAD_INFO_IN_TASK
  lib/syscall: Pin the task stack in collect_syscall()
  x86/process: Pin the target stack in get_wchan()
  x86/dumpstack: Pin the target stack when dumping it
  kthread: Pin the stack via try_get_task_stack()/put_task_stack() in to_live_kthread() function
  sched/core: Add try_get_task_stack() and put_task_stack()
  x86/entry/64: Fix a minor comment rebase error
  iommu/amd: Don't put completion-wait semaphore on stack
  ...
2016-10-03 16:13:28 -07:00
Linus Torvalds 00bcf5cdd6 Merge branch 'locking-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull locking updates from Ingo Molnar:
 "The main changes in this cycle were:

   - rwsem micro-optimizations (Davidlohr Bueso)

   - Improve the implementation and optimize the performance of
     percpu-rwsems. (Peter Zijlstra.)

   - Convert all lglock users to better facilities such as percpu-rwsems
     or percpu-spinlocks and remove lglocks. (Peter Zijlstra)

   - Remove the ticket (spin)lock implementation. (Peter Zijlstra)

   - Korean translation of memory-barriers.txt and related fixes to the
     English document. (SeongJae Park)

   - misc fixes and cleanups"

* 'locking-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (24 commits)
  x86/cmpxchg, locking/atomics: Remove superfluous definitions
  x86, locking/spinlocks: Remove ticket (spin)lock implementation
  locking/lglock: Remove lglock implementation
  stop_machine: Remove stop_cpus_lock and lg_double_lock/unlock()
  fs/locks: Use percpu_down_read_preempt_disable()
  locking/percpu-rwsem: Add down_read_preempt_disable()
  fs/locks: Replace lg_local with a per-cpu spinlock
  fs/locks: Replace lg_global with a percpu-rwsem
  locking/percpu-rwsem: Add DEFINE_STATIC_PERCPU_RWSEMand percpu_rwsem_assert_held()
  locking/pv-qspinlock: Use cmpxchg_release() in __pv_queued_spin_unlock()
  locking/rwsem, x86: Drop a bogus cc clobber
  futex: Add some more function commentry
  locking/hung_task: Show all locks
  locking/rwsem: Scan the wait_list for readers only once
  locking/rwsem: Remove a few useless comments
  locking/rwsem: Return void in __rwsem_mark_wake()
  locking, rcu, cgroup: Avoid synchronize_sched() in __cgroup_procs_write()
  locking/Documentation: Add Korean translation
  locking/Documentation: Fix a typo of example result
  locking/Documentation: Fix wrong section reference
  ...
2016-10-03 12:15:00 -07:00
Mike Marshall f60fbdbf41 Revert "orangefs: bump minimum userspace version"
The features op did make it into OrangeFS 2.9.6 after all.

This reverts commit 0c95ad7636.
2016-10-03 15:07:36 -04:00
Linus Torvalds de956b8f45 Merge branch 'efi-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull EFI updates from Ingo Molnar:
 "Main changes in this cycle were:

   - Refactor the EFI memory map code into architecture neutral files
     and allow drivers to permanently reserve EFI boot services regions
     on x86, as well as ARM/arm64. (Matt Fleming)

   - Add ARM support for the EFI ESRT driver. (Ard Biesheuvel)

   - Make the EFI runtime services and efivar API interruptible by
     swapping spinlocks for semaphores. (Sylvain Chouleur)

   - Provide the EFI identity mapping for kexec which allows kexec to
     work on SGI/UV platforms with requiring the "noefi" kernel command
     line parameter. (Alex Thorlton)

   - Add debugfs node to dump EFI page tables on arm64. (Ard Biesheuvel)

   - Merge the EFI test driver being carried out of tree until now in
     the FWTS project. (Ivan Hu)

   - Expand the list of flags for classifying EFI regions as "RAM" on
     arm64 so we align with the UEFI spec. (Ard Biesheuvel)

   - Optimise out the EFI mixed mode if it's unsupported (CONFIG_X86_32)
     or disabled (CONFIG_EFI_MIXED=n) and switch the early EFI boot
     services function table for direct calls, alleviating us from
     having to maintain the custom function table. (Lukas Wunner)

   - Miscellaneous cleanups and fixes"

* 'efi-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (30 commits)
  x86/efi: Round EFI memmap reservations to EFI_PAGE_SIZE
  x86/efi: Allow invocation of arbitrary boot services
  x86/efi: Optimize away setup_gop32/64 if unused
  x86/efi: Use kmalloc_array() in efi_call_phys_prolog()
  efi/arm64: Treat regions with WT/WC set but WB cleared as memory
  efi: Add efi_test driver for exporting UEFI runtime service interfaces
  x86/efi: Defer efi_esrt_init until after memblock_x86_fill
  efi/arm64: Add debugfs node to dump UEFI runtime page tables
  x86/efi: Remove unused find_bits() function
  fs/efivarfs: Fix double kfree() in error path
  x86/efi: Map in physical addresses in efi_map_region_fixed
  lib/ucs2_string: Speed up ucs2_utf8size()
  firmware-gsmi: Delete an unnecessary check before the function call "dma_pool_destroy"
  x86/efi: Initialize status to ensure garbage is not returned on small size
  efi: Replace runtime services spinlock with semaphore
  efi: Don't use spinlocks for efi vars
  efi: Use a file local lock for efivars
  efi/arm*: esrt: Add missing call to efi_esrt_init()
  efi/esrt: Use memremap not ioremap to access ESRT table in memory
  x86/efi-bgrt: Use efi_mem_reserve() to avoid copying image data
  ...
2016-10-03 11:33:18 -07:00
David Sterba 0e6757859e btrfs: tests: uninline member definitions in free_space_extent
The recommended way is to put all members on separate lines.

Signed-off-by: David Sterba <dsterba@suse.com>
2016-10-03 18:52:15 +02:00
David Sterba d2d9ac6aae btrfs: tests: constify free space extent specs
We don't change the given extent ranges, mark them const to catch
accidental changes.

Signed-off-by: David Sterba <dsterba@suse.com>
2016-10-03 18:52:15 +02:00
Omar Sandoval 781e3bcf0e Btrfs: expand free space tree sanity tests to catch endianness bug
The free space tree format conversion functions were broken on
big-endian systems, but the sanity tests didn't catch it because all of
the operations were aligned to multiple words. This was meant to catch
any bugs in the extent buffer code's handling of high memory, but it
ended up hiding the endianness bug. Expand the tests to do both
sector-aligned and page-aligned operations.

Tested-by: Chandan Rajendra <chandan@linux.vnet.ibm.com>
Signed-off-by: Omar Sandoval <osandov@fb.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2016-10-03 18:52:14 +02:00
Omar Sandoval 9426ce754f Btrfs: fix extent buffer bitmap tests on big-endian systems
The in-memory bitmap code manipulates words and is therefore sensitive
to endianness, while the extent buffer bitmap code addresses bytes and
is byte-order agnostic. Because the byte addressing of the extent buffer
bitmaps is equivalent to a little-endian in-memory bitmap, the extent
buffer bitmap tests fail on big-endian systems.

34b3e6c92a ("Btrfs: self-tests: Fix extent buffer bitmap test fail on
BE system") worked around another endianness bug in the tests but missed
this one because ed9e4afdb0 ("Btrfs: self-tests: Execute page
straddling test only when nodesize < PAGE_SIZE") disables this part of
the test on ppc64. That change lost the original meaning of the test,
however. We really want to test that an equivalent series of operations
using the in-memory bitmap API and the extent buffer bitmap API produces
equivalent results.

To fix this, don't use memcmp_extent_buffer() or write_extent_buffer();
do everything bit-by-bit.

Reported-by: Anatoly Pugachev <matorola@gmail.com>
Tested-by: Anatoly Pugachev <matorola@gmail.com>
Tested-by: Feifei Xu <xufeifei@linux.vnet.ibm.com>
Tested-by: Chandan Rajendra <chandan@linux.vnet.ibm.com>
Signed-off-by: Omar Sandoval <osandov@fb.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2016-10-03 18:52:14 +02:00
Omar Sandoval 6675df311d Btrfs: catch invalid free space trees
There are two separate issues that can lead to corrupted free space
trees.

1. The free space tree bitmaps had an endianness issue on big-endian
   systems which is fixed by an earlier patch in this series.
2. btrfs-progs before v4.7.3 modified filesystems without updating the
   free space tree.

To catch both of these issues at once, we need to force the free space
tree to be rebuilt. To do so, add a FREE_SPACE_TREE_VALID compat_ro bit.
If the bit isn't set, we know that it was either produced by a broken
big-endian kernel or may have been corrupted by btrfs-progs.

This also provides us with a way to add rudimentary read-write support
for the free space tree to btrfs-progs: it can just clear this bit and
have the kernel rebuild the free space tree.

Cc: stable@vger.kernel.org # 4.5+
Tested-by: Holger Hoffstätte <holger@applied-asynchrony.com>
Tested-by: Chandan Rajendra <chandan@linux.vnet.ibm.com>
Signed-off-by: Omar Sandoval <osandov@fb.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2016-10-03 18:52:14 +02:00
Omar Sandoval f8d468a15c Btrfs: fix mount -o clear_cache,space_cache=v2
We moved the code for creating the free space tree the first time that
it's enabled, but didn't move the clearing code along with it. This
breaks my (undocumented) intention that `mount -o
clear_cache,space_cache=v2` would clear the free space tree and then
recreate it.

Fixes: 511711af91 ("btrfs: don't run delayed references while we are creating the free space tree")
Cc: stable@vger.kernel.org # 4.5+
Tested-by: Holger Hoffstätte <holger@applied-asynchrony.com>
Tested-by: Chandan Rajendra <chandan@linux.vnet.ibm.com>
Signed-off-by: Omar Sandoval <osandov@fb.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2016-10-03 18:52:14 +02:00
Omar Sandoval 2fe1d55134 Btrfs: fix free space tree bitmaps on big-endian systems
In convert_free_space_to_{bitmaps,extents}(), we buffer the free space
bitmaps in memory and copy them directly to/from the extent buffers with
{read,write}_extent_buffer(). The extent buffer bitmap helpers use byte
granularity, which is equivalent to a little-endian bitmap. This means
that on big-endian systems, the in-memory bitmaps will be written to
disk byte-swapped. To fix this, use byte-granularity for the bitmaps in
memory.

Fixes: a5ed918285 ("Btrfs: implement the free space B-tree")
Cc: stable@vger.kernel.org # 4.5+
Tested-by: Holger Hoffstätte <holger@applied-asynchrony.com>
Tested-by: Chandan Rajendra <chandan@linux.vnet.ibm.com>
Signed-off-by: Omar Sandoval <osandov@fb.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2016-10-03 18:52:14 +02:00
Darrick J. Wong 350a27a6a6 xfs: introduce reflink utility functions
These functions will be used by the other reflink functions to find
the maximum length of a range of shared blocks.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.coM>
Reviewed-by: Christoph Hellwig <hch@lst.de>
2016-10-03 09:11:25 -07:00
Darrick J. Wong d0e853f360 xfs: reserve AG space for the refcount btree root
Reduce the max AG usable space size so that we always have space for
the refcount btree root.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
2016-10-03 09:11:24 -07:00
Darrick J. Wong a90c00f055 xfs: add refcount btree block detection to log recovery
Identify refcountbt blocks in the log correctly so that we can
validate them during log recovery.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
2016-10-03 09:11:23 -07:00
Darrick J. Wong 62aab20f08 xfs: adjust refcount when unmapping file blocks
When we're unmapping blocks from a reflinked file, decrease the
refcount of the affected blocks and free the extents that are no
longer in use.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
2016-10-03 09:11:23 -07:00
Darrick J. Wong 33ba612920 xfs: connect refcount adjust functions to upper layers
Plumb in the upper level interface to schedule and finish deferred
refcount operations via the deferred ops mechanism.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
2016-10-03 09:11:22 -07:00
Darrick J. Wong 3172725814 xfs: adjust refcount of an extent of blocks in refcount btree
Provide functions to adjust the reference counts for an extent of
physical blocks stored in the refcount btree.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
2016-10-03 09:11:21 -07:00
Darrick J. Wong f997ee2137 xfs: log refcount intent items
Provide a mechanism for higher levels to create CUI/CUD items, submit
them to the log, and a stub function to deal with recovered CUI items.
These parts will be connected to the refcountbt in a later patch.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
2016-10-03 09:11:21 -07:00
Darrick J. Wong baf4bcacb7 xfs: create refcount update intent log items
Create refcount update intent/done log items to record redo
information in the log.  Because we need to roll transactions between
updating the bmbt mapping and updating the reverse mapping, we also
have to track the status of the metadata updates that will be recorded
in the post-roll transactions, just in case we crash before committing
the final transaction.  This mechanism enables log recovery to finish
what was already started.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
2016-10-03 09:11:20 -07:00
Darrick J. Wong bdf28630b7 xfs: add refcount btree operations
Implement the generic btree operations required to manipulate refcount
btree blocks.  The implementation is similar to the bmapbt, though it
will only allocate and free blocks from the AG.

Since the refcount root and level fields are separate from the
existing roots and levels array, they need a separate logging flag.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
[hch: fix logging of AGF refcount btree fields]
Signed-off-by: Christoph Hellwig <hch@lst.de>
2016-10-03 09:11:19 -07:00
Darrick J. Wong f310bd2ecd xfs: account for the refcount btree in the alloc/free log reservation
Every time we allocate or free a data extent, we might need to split
the refcount btree.  Reserve some blocks in the transaction to handle
this possibility.  Even though the deferred refcount code can roll a
transaction to avoid overloading the transaction, we can still exceed
the reservation.

Certain pathological workloads (1k blocks, no cowextsize hint, random
directio writes), cause a perfect storm wherein a refcount adjustment
of a large range of blocks causes full tree splits in two separate
extents in two separate refcount tree blocks; allocating new refcount
tree blocks causes rmap btree splits; and all the allocation activity
causes the freespace btrees to split, blowing the reservation.

(Reproduced by generic/167 over NFS atop XFS)

Signed-off-by: Christoph Hellwig <hch@lst.de>
[darrick.wong@oracle.com: add commit message]
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2016-10-03 09:11:19 -07:00
Darrick J. Wong ac4fef6938 xfs: add refcount btree support to growfs
Modify the growfs code to initialize new refcount btree blocks.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
2016-10-03 09:11:18 -07:00
Darrick J. Wong 1946b91cee xfs: define the on-disk refcount btree format
Start constructing the refcount btree implementation by establishing
the on-disk format and everything needed to read, write, and
manipulate the refcount btree blocks.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Christoph Hellwig <hch@lst.de>
2016-10-03 09:11:18 -07:00
Darrick J. Wong af30dfa144 xfs: refcount btree add more reserved blocks
Since XFS reserves a small amount of space in each AG as the minimum
free space needed for an operation, save some more space in case we
touch the refcount btree.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
2016-10-03 09:11:17 -07:00
Darrick J. Wong 46eeb521b9 xfs: introduce refcount btree definitions
Add new per-AG refcount btree definitions to the per-AG structures.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Christoph Hellwig <hch@lst.de>
2016-10-03 09:11:16 -07:00
Darrick J. Wong c75c752d03 xfs: define tracepoints for refcount btree activities
Define all the tracepoints we need to inspect the refcount btree
runtime operation.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
2016-10-03 09:11:15 -07:00
Darrick J. Wong 9cdafd8a76 xfs: return an error when an inline directory is too small
If the size of an inline directory is so small that it doesn't
even cover the required header size, return an error to userspace
instead of ASSERTing and returning 0 like everything's ok.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reported-by: Jan Kara <jack@suse.cz>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
2016-10-03 09:11:15 -07:00
Darrick J. Wong 71be6b4942 vfs: add a FALLOC_FL_UNSHARE mode to fallocate to unshare a range of blocks
Add a new fallocate mode flag that explicitly unshares blocks on
filesystems that support such features.  The new flag can only
be used with an allocate-mode fallocate call.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2016-10-03 09:11:14 -07:00
Wei Yongjun 8cdcc07dde ceph: use list_move instead of list_del/list_add
Using list_move() instead of list_del() + list_add().

Signed-off-by: Wei Yongjun <weiyj.lk@gmail.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2016-10-03 16:13:50 +02:00
Yan, Zheng fcff415c94 ceph: handle CEPH_SESSION_REJECT message
Signed-off-by: Yan, Zheng <zyan@redhat.com>
2016-10-03 16:13:50 +02:00
Yan, Zheng ce2728aaa8 ceph: avoid accessing / when mounting a subpath
Accessing / causes failuire if the client has caps that restrict path

Signed-off-by: Yan, Zheng <zyan@redhat.com>
2016-10-03 16:13:50 +02:00
Yan, Zheng db4a63aab4 ceph: fix mandatory flock check
Signed-off-by: Yan, Zheng <zyan@redhat.com>
2016-10-03 16:13:49 +02:00
NeilBrown e55f1a1871 ceph: remove warning when ceph_releasepage() is called on dirty page
If O_DIRECT writes are racing with buffered writes, then
the call to invalidate_inode_pages2_range() can call ceph_releasepage()
on dirty pages.

Most filesystems hold inode_lock() across O_DIRECT writes so they do not
suffer this race, but cephfs deliberately drops the lock, and opens a window
for the race.

This race can be triggered with the generic/036 test from the xfstests
test suite.  It doesn't happen every time, but it does happen often.

As the possibilty is expected, remove the warning, and instead include
the PageDirty() status in the debug message.

Signed-off-by: NeilBrown <neilb@suse.com>
Reviewed-by: Jeff Layton <jlayton@redhat.com>
Reviewed-by: Yan, Zheng <zyan@redhat.com>
2016-10-03 16:13:49 +02:00
NeilBrown 5d7eb1a322 ceph: ignore error from invalidate_inode_pages2_range() in direct write
This call can fail if there are dirty pages.  The preceding call to
filemap_write_and_wait_range() will normally remove dirty pages, but
as inode_lock() is not held over calls to ceph_direct_read_write(), it
could race with non-direct writes and pages could be dirtied
immediately after filemap_write_and_wait_range() returns

If there are dirty pages, they will be removed by the subsequent call
to truncate_inode_pages_range(), so having them here is not a problem.

If the 'ret' value is left holding an error, then in the async IO case
(aio_req is not NULL) the loop that would normally call
ceph_osdc_start_request() will see the error in 'ret' and abort all
requests.  This doesn't seem like correct behaviour.

So use separate 'ret2' instead of overloading 'ret'.

Signed-off-by: NeilBrown <neilb@suse.com>
Reviewed-by: Jeff Layton <jlayton@redhat.com>
Reviewed-by: Yan, Zheng <zyan@redhat.com>
2016-10-03 16:13:49 +02:00
Yan, Zheng 1afe478569 ceph: fix error handling of start_read()
If start_page() fails to add a page to page cache or fails to send
OSD request. It should cal put_page() (instead of free_page()) for
relevant pages.

Besides, start_page() need to cancel fscache readpage if it fails
to send OSD request.

Signed-off-by: Yan, Zheng <zyan@redhat.com>
Reported-by: Zhi Zhang <zhang.david2011@gmail.com>
2016-10-03 16:13:49 +02:00
Miklos Szeredi 63401ccdb2 fuse: limit xattr returned size
Don't let userspace filesystem give bogus values for the size of xattr and
xattr list.

Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2016-10-03 11:06:05 +02:00
David S. Miller b50afd203a Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net
Three sets of overlapping changes.  Nothing serious.

Signed-off-by: David S. Miller <davem@davemloft.net>
2016-10-02 22:20:41 -04:00
Dave Chinner 155cd433b5 Merge branch 'xfs-4.9-log-recovery-fixes' into for-next 2016-10-03 09:56:28 +11:00
Dave Chinner a1f45e668e Merge branch 'iomap-4.9-dax' into for-next 2016-10-03 09:53:59 +11:00
Dave Chinner a89b3f97bb Merge branch 'xfs-4.9-delalloc-rework' into for-next 2016-10-03 09:52:51 +11:00
Dave Chinner 79ad576124 Merge branch 'xfs-4.9-reflink-prep' into for-next 2016-10-03 09:52:31 +11:00
Dave Chinner b036b97050 Merge branch 'iomap-4.9-misc-fixes-1' into for-next 2016-10-03 09:52:11 +11:00
Christoph Hellwig a447d7cd15 xfs: update atime before I/O in xfs_file_dio_aio_read
After the call to __blkdev_direct_IO the final reference to the file
might have been dropped by aio_complete already, and the call to
file_accessed might cause a use after free.

Instead update the access time before the I/O, similar to how we
update the time stamps before writes.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reported-and-tested-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2016-10-03 09:47:34 +11:00
Christoph Hellwig d5bfccdf38 ext2: fix possible integer truncation in ext2_iomap_begin
For 32-bit architectures we need to cast first_block to u64 before
shifting it left.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reported-by: Jan Kara <jack@suse.cz>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2016-10-03 09:46:04 +11:00
Julia Lawall ec037dfcc0 UBIFS: improve function-level documentation
Fix various inconsistencies in the documentation associated with various
functions.

In the case of fs/ubifs/lprops.c, the second parameter of
ubifs_get_lp_stats was renamed from st to lst in commit 84abf972cc
("UBIFS: add re-mount debugging checks")

In the case of fs/ubifs/lpt_commit.c, the excess variables have never
existed in the associated functions since the code was introduced into the
kernel.

The others appear to be straightforward typos.

Issues detected using Coccinelle (http://coccinelle.lip6.fr/)

Signed-off-by: Julia Lawall <Julia.Lawall@lip6.fr>
Signed-off-by: Richard Weinberger <richard@nod.at>
2016-10-02 22:55:02 +02:00
Pascal Eberhard 74e9c700bc ubifs: fix host xattr_len when changing xattr
When an extended attribute is changed, xattr_len of host inode is
recalculated. ui->data_len is updated before computation and result
is wrong. This patch adds a temporary variable to fix computation.

To reproduce the issue:

~# > a.txt
~# attr -s an-attr -V a-value a.txt
~# attr -s an-attr -V a-bit-bigger-value a.txt

Now host inode xattr_len is wrong. Forcing dbg_check_filesystem()
generates the following error:

[  130.620140] UBIFS (ubi0:2): background thread "ubifs_bgt0_2" started, PID 565
[  131.470790] UBIFS error (ubi0:2 pid 564): check_inodes: inode 646 has xattr size 240, but calculated size is 256
[  131.481697] UBIFS (ubi0:2): dump of the inode 646 sitting in LEB 29:114688
[  131.488953]  magic          0x6101831
[  131.492876]  crc            0x9fce9091
[  131.496836]  node_type      0 (inode node)
[  131.501193]  group_type     1 (in node group)
[  131.505788]  sqnum          9278
[  131.509191]  len            160
[  131.512549]  key            (646, inode)
[  131.516688]  creat_sqnum    9270
[  131.520133]  size           0
[  131.523264]  nlink          1
[  131.526398]  atime          1053025857.0
[  131.530574]  mtime          1053025857.0
[  131.534714]  ctime          1053025906.0
[  131.538849]  uid            0
[  131.542009]  gid            0
[  131.545140]  mode           33188
[  131.548636]  flags          0x1
[  131.551977]  xattr_cnt      1
[  131.555108]  xattr_size     240
[  131.558420]  xattr_names    12
[  131.561670]  compr_type     0x1
[  131.564983]  data len       0
[  131.568125] UBIFS error (ubi0:2 pid 564): dbg_check_filesystem: file-system check failed with error -22
[  131.578074] CPU: 0 PID: 564 Comm: mount Not tainted 4.4.12-g3639bea54a #24
[  131.585352] Hardware name: Generic AM33XX (Flattened Device Tree)
[  131.591918] [<c00151c0>] (unwind_backtrace) from [<c0012acc>] (show_stack+0x10/0x14)
[  131.600177] [<c0012acc>] (show_stack) from [<c01c950c>] (dbg_check_filesystem+0x464/0x4d0)
[  131.608934] [<c01c950c>] (dbg_check_filesystem) from [<c019f36c>] (ubifs_mount+0x14f8/0x2130)
[  131.617991] [<c019f36c>] (ubifs_mount) from [<c00d7088>] (mount_fs+0x14/0x98)
[  131.625572] [<c00d7088>] (mount_fs) from [<c00ed674>] (vfs_kern_mount+0x4c/0xd4)
[  131.633435] [<c00ed674>] (vfs_kern_mount) from [<c00efb5c>] (do_mount+0x988/0xb50)
[  131.641471] [<c00efb5c>] (do_mount) from [<c00f004c>] (SyS_mount+0x74/0xa0)
[  131.648837] [<c00f004c>] (SyS_mount) from [<c000fe20>] (ret_fast_syscall+0x0/0x3c)
[  131.665315] UBIFS (ubi0:2): background thread "ubifs_bgt0_2" stops

Signed-off-by: Pascal Eberhard <pascal.eberhard@gmail.com>
Signed-off-by: Richard Weinberger <richard@nod.at>
2016-10-02 22:55:02 +02:00
Richard Weinberger 1e03953388 ubifs: Use move variable in ubifs_rename()
...to make the code more consistent since we use
move already in other places.

Signed-off-by: Richard Weinberger <richard@nod.at>
2016-10-02 22:55:02 +02:00
Richard Weinberger 9ec64962af ubifs: Implement RENAME_EXCHANGE
Adds RENAME_EXCHANGE to UBIFS, the operation itself
is completely disjunct from a regular rename() that's
why we dispatch very early in ubifs_reaname().

RENAME_EXCHANGE used by the renameat2() system call
allows the caller to exchange two paths atomically.
Both paths have to exist and have to be on the same
filesystem.

Signed-off-by: Richard Weinberger <richard@nod.at>
2016-10-02 22:55:02 +02:00
Richard Weinberger 9e0a1fff8d ubifs: Implement RENAME_WHITEOUT
Adds RENAME_WHITEOUT support to UBIFS, we implement
it in the same way as ext4 and xfs do.
For an overview of other ways to implement it please
refere to commit 7dcf5c3e45 ("xfs: add RENAME_WHITEOUT support").

Signed-off-by: Richard Weinberger <richard@nod.at>
2016-10-02 22:55:02 +02:00
Richard Weinberger 474b93704f ubifs: Implement O_TMPFILE
This patchs adds O_TMPFILE support to UBIFS.
A temp file is a reference to an unlinked inode, a user
holding the reference can use it. As soon it is being closed
all data vanishes.

Signed-off-by: Richard Weinberger <richard@nod.at>
2016-10-02 22:55:02 +02:00
Miklos Szeredi 4680a7ee5d fuse: remove duplicate cs->offset assignment
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2016-10-01 07:32:33 +02:00
Miklos Szeredi acbe5fda1f fuse: don't use fuse_ioctl_copy_user() helper
The two invocations share little code.

Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2016-10-01 07:32:33 +02:00
Al Viro 3daa9c5165 fuse_ioctl_copy_user(): don't open-code copy_page_{to,from}_iter()
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2016-10-01 07:32:33 +02:00
Seth Forshee 703c73629f fuse: Use generic xattr ops
In preparation for posix acl support, rework fuse to use xattr handlers and
the generic setxattr/getxattr/listxattr callbacks.  Split the xattr code
out into it's own file, and promote symbols to module-global scope as
needed.

Functionally these changes have no impact, as fuse still uses a single
handler for all xattrs which uses the old callbacks.

Signed-off-by: Seth Forshee <seth.forshee@canonical.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2016-10-01 07:32:32 +02:00
Miklos Szeredi 29433a2991 fuse: get rid of fc->flags
Only two flags: "default_permissions" and "allow_other".  All other flags
are handled via bitfields.  So convert these two as well.  They don't
change during the lifetime of the filesystem, so this is quite safe.

Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2016-10-01 07:32:32 +02:00
Miklos Szeredi cb3ae6d25a fuse: listxattr: verify xattr list
Make sure userspace filesystem is returning a well formed list of xattr
names (zero or more nonzero length, null terminated strings).

[Michael Theall: only verify in the nonzero size case]

Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
Cc: <stable@vger.kernel.org>
2016-10-01 07:32:32 +02:00
Miklos Szeredi bcb6f6d2b9 fuse: use timespec64
And check for valid nsec value before passing into timespec64_to_jiffies().

Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2016-10-01 07:32:32 +02:00
Miklos Szeredi f75fdf22b0 fuse: don't use ->d_time
Store in memory pointed to by ->d_fsdata.  Use ->d_init() to allocate the
storage.  Need to use RCU freeing because the data is used in RCU lookup
mode.

We could cast ->d_fsdata directly on 64bit archs, but I don't think this is
worth the extra complexity.

Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2016-10-01 07:32:32 +02:00
Seth Forshee 60bcc88ad1 fuse: Add posix ACL support
Add a new INIT flag, FUSE_POSIX_ACL, for negotiating ACL support with
userspace.  When it is set in the INIT response, ACL support will be
enabled.  ACL support also implies "default_permissions".

When ACL support is enabled, the kernel will cache and have responsibility
for enforcing ACLs.  ACL xattrs will be passed to userspace, which is
responsible for updating the ACLs in the filesystem, keeping the file mode
in sync, and inheritance of default ACLs when new filesystem nodes are
created.

Signed-off-by: Seth Forshee <seth.forshee@canonical.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2016-10-01 07:32:32 +02:00
Miklos Szeredi 5e940c1dd3 fuse: handle killpriv in userspace fs
Only userspace filesystem can do the killing of suid/sgid without races.
So introduce an INIT flag and negotiate support for this.

Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2016-10-01 07:32:32 +02:00
Miklos Szeredi a09f99edde fuse: fix killing s[ug]id in setattr
Fuse allowed VFS to set mode in setattr in order to clear suid/sgid on
chown and truncate, and (since writeback_cache) write.  The problem with
this is that it'll potentially restore a stale mode.

The poper fix would be to let the filesystems do the suid/sgid clearing on
the relevant operations.  Possibly some are already doing it but there's no
way we can detect this.

So fix this by refreshing and recalculating the mode.  Do this only if
ATTR_KILL_S[UG]ID is set to not destroy performance for writes.  This is
still racy but the size of the window is reduced.

Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
Cc: <stable@vger.kernel.org>
2016-10-01 07:32:32 +02:00
Miklos Szeredi 5e2b8828ff fuse: invalidate dir dentry after chmod
Without "default_permissions" the userspace filesystem's lookup operation
needs to perform the check for search permission on the directory.

If directory does not allow search for everyone (this is quite rare) then
userspace filesystem has to set entry timeout to zero to make sure
permissions are always performed.

Changing the mode bits of the directory should also invalidate the
(previously cached) dentry to make sure the next lookup will have a chance
of updating the timeout, if needed.

Reported-by: Jean-Pierre André <jean-pierre.andre@wanadoo.fr>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
Cc: <stable@vger.kernel.org>
2016-10-01 07:32:32 +02:00
Jaegeuk Kim e4c5d8489a f2fs: introduce update_ckpt_flags to clean up
This patch add update_ckpt_flags() to clean up the flow.

Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2016-09-30 17:55:24 -07:00
Chao Yu 6ca56ca429 f2fs: don't submit irrelevant page
While we call ->writepages, there are two cases:
a. we didn't writeout any dirty pages, since they are writebacked by other
thread concurrently.
b. we writeout dirty pages, and have already submitted bio to block layer.

In these cases, we don't need to do additional bio flushing unnecessarily,
it may split bio in cache into smaller one.

Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2016-09-30 17:34:39 -07:00
Chao Yu 3f5f4959b1 f2fs: fix to commit bio cache after flushing node pages
In sync_node_pages, we won't check and commit last merged pages in private
bio cache of f2fs, as these pages were taged as writeback, someone who is
waiting for writebacking of the page will be blocked until the cache was
committed by someone else.

We need to commit node type bio cache to avoid potential deadlock or long
delay of waiting writeback.

Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2016-09-30 17:34:38 -07:00
Tiezhu Yang fc0065adb2 f2fs: introduce get_checkpoint_version for cleanup
There exists almost same codes when get the value of pre_version
and cur_version in function validate_checkpoint, this patch adds
get_checkpoint_version to clean up redundant codes.

Signed-off-by: Tiezhu Yang <kernelpatch@126.com>
Reviewed-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2016-09-30 17:34:37 -07:00
Sheng Yong 3fa565039e f2fs: remove dead variable
Signed-off-by: Sheng Yong <shengyong1@huawei.com>
Acked-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2016-09-30 17:34:37 -07:00
Chao Yu 7fd748df45 f2fs: remove redundant io plug
Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2016-09-30 17:34:36 -07:00
Chao Yu 0f34802858 f2fs: support checkpoint error injection
This patch adds to support checkpoint error injection in f2fs for testing
fatal error tolerance, it will be useful that it can simulate abnormal
power off by f2fs itself instead of calling godown ioctl by running apps.

Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2016-09-30 17:34:35 -07:00
Chao Yu 2443b8b363 f2fs: fix to recover old fault injection config in ->remount_fs
In ->remount_fs, we didn't recover original fault injection config if
we encounter error, fix it.

Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2016-09-30 17:34:34 -07:00
Chao Yu 36dbd3287f f2fs: do fault injection initialization in default_options
Do fault injection initialization in default_options to keep consistent
with other default option configurating.

Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2016-09-30 17:34:33 -07:00
Yunlei He 9c094040c5 f2fs: remove redundant value definition
This patch remove redundant value definition in build_sit_entries

Signed-off-by: Yunlei He <heyunlei@huawei.com>
Reviewed-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2016-09-30 17:34:32 -07:00
Chao Yu 1ecc0c5c50 f2fs: support configuring fault injection per superblock
Previously, we only support global fault injection configuration, so that
when we configure type/rate of fault injection through sysfs, mount
option, it will influence all f2fs partition which is being used.

It is not make sence, since it will be not convenient if developer want
to test separated partitions with different fault injection rate/type
simultaneously, also it's not possible to enable fault injection in one
partition and disable fault injection in other one.

>From now on, we move global configuration of fault injection in module
into per-superblock, hence injection testing can be more flexible.

Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2016-09-30 17:34:31 -07:00
Chao Yu d32853de50 f2fs: adjust display format of segment bit
Just adjust segment bit info printed in procfs.

Before:
1008      5|0  |0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
1009      3|183|0 0 61 20 20 0 0 21 80 c0 2 e4 e 54 0 21 21 17 a 44 d0 28 e4 50 40 30 8 0 2d 32 0 5 b0 80 1 43 2 8e f8 7b 2 25 93 bf e0 73 8e 9a 19 44 60 ff e4 cc e6 8e bf f9 ff 5 3d 31 3d 13
1010      3|1  |0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 40 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0

After:
1008      5|0  | 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
1009      4|434| ff 7d ff bf d9 3f ff e7 ff bf d7 bf ff bb be ff fb df f7 fb fa bf fb fe bb df dd ff fe ef ff fe ef e2 27 bf ab bf fb df fd bd bf fb db fc ff ff 3f ff ff bf ff 5f db 3f fb fb bf fb bf 4f ff ef
1010      4|422| ff bb fe ff ef d7 ee ff ff fc bf ef 7d eb ec fd fb 3f 97 7f ef ff af ff db ff ff 69 bf ff f6 e7 ff fb f7 7b fb df be ff ff ef f3 fe ff ff df fe f7 fa ff b7 77 be fe fb a9 7f 87 a2 ac c7 ff 75

Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2016-09-30 17:34:30 -07:00
Jaegeuk Kim bb5dada7d2 f2fs: remove dirty inode pages in error path
When getting EIO while handling orphan inodes, we can get some dirty node
pages. Then, f2fs_write_node_pages() called by iput(node_inode) will try
to flush node pages. But in this case, we should prevent to do that, since
we will try again from the start.

Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2016-09-30 17:34:29 -07:00
Eric Biggers ef68bf1197 f2fs: do not unnecessarily null-terminate encrypted symlink data
Null-terminating the fscrypt_symlink_data on read is unnecessary because
it is not string data --- it contains binary ciphertext.

Signed-off-by: Eric Biggers <ebiggers@google.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2016-09-30 17:34:28 -07:00
Jaegeuk Kim d41065e204 f2fs: handle errors during recover_orphan_inodes
This patch fixes to handle EIO during recover_orphan_inode() given the below
panic.

F2FS-fs : inject IO error in f2fs_read_end_io+0xe6/0x100 [f2fs]
------------[ cut here ]------------
RIP: 0010:[<ffffffffc0b244e3>]  [<ffffffffc0b244e3>] f2fs_evict_inode+0x433/0x470 [f2fs]
RSP: 0018:ffff92f8b7fb7c30  EFLAGS: 00010246
RAX: ffff92fb88a13500 RBX: ffff92f890566ea0 RCX: 00000000fd3c255c
RDX: 0000000000000001 RSI: ffff92fb88a13d90 RDI: ffff92fb8ee127e8
RBP: ffff92f8b7fb7c58 R08: 0000000000000001 R09: ffff92fb88a13d58
R10: 000000005a6a9373 R11: 0000000000000001 R12: 00000000fffffffb
R13: ffff92fb8ee12000 R14: 00000000000034ca R15: ffff92fb8ee12620
FS:  00007f1fefd8e880(0000) GS:ffff92fb95600000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 00007fc211d34cdb CR3: 000000012d43a000 CR4: 00000000001406e0
Stack:
 ffff92f890566ea0 ffff92f890567078 ffffffffc0b5a0c0 ffff92f890566f28
 ffff92fb888b2000 ffff92f8b7fb7c80 ffffffffbc27ff55 ffff92f890566ea0
 ffff92fb8bf10000 ffffffffc0b5a0c0 ffff92f8b7fb7cb0 ffffffffbc28090d
Call Trace:
 [<ffffffffbc27ff55>] evict+0xc5/0x1a0
 [<ffffffffbc28090d>] iput+0x1ad/0x2c0
 [<ffffffffc0b3304c>] recover_orphan_inodes+0x10c/0x2e0 [f2fs]
 [<ffffffffc0b2e0f4>] f2fs_fill_super+0x884/0x1150 [f2fs]
 [<ffffffffbc2644ac>] mount_bdev+0x18c/0x1c0
 [<ffffffffc0b2d870>] ? f2fs_commit_super+0x100/0x100 [f2fs]
 [<ffffffffc0b2a755>] f2fs_mount+0x15/0x20 [f2fs]
 [<ffffffffbc264e49>] mount_fs+0x39/0x170
 [<ffffffffbc28555b>] vfs_kern_mount+0x6b/0x160
 [<ffffffffbc2881df>] do_mount+0x1cf/0xd00
 [<ffffffffbc287f2c>] ? copy_mount_options+0xac/0x170
 [<ffffffffbc289003>] SyS_mount+0x83/0xd0
 [<ffffffffbc8ee880>] entry_SYSCALL_64_fastpath+0x23/0xc1

Reviewed-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2016-09-30 17:34:27 -07:00
Jaegeuk Kim 646e759a4d f2fs: avoid gc in cp_error case
Otherwise, we can hit
	f2fs_bug_on(sbi, !PageUptodate(sum_page));

Reviewed-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2016-09-30 17:34:26 -07:00
Jaegeuk Kim f6fe2be3c6 f2fs: should put_page for summary page
We should call put_page for preloaded summary pages in do_garbage_collect.

Reviewed-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2016-09-30 17:34:25 -07:00
Jaegeuk Kim 2956e450fa f2fs: assign return value in f2fs_gc
This patch adds a return value of write_checkpoint for f2fs_gc.

Reviewed-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2016-09-30 17:34:24 -07:00
Weichao Guo 5b7a487cf3 f2fs: add customized migrate_page callback
This patch improves the migration of dirty pages and allows migrating atomic
written pages that F2FS uses in Page Cache. Instead of the fallback releasing
page path, it provides better performance for memory compaction, CMA and other
users of memory page migrating. For dirty pages, there is no need to write back
first when migrating. For an atomic written page before committing, we can
migrate the page and update the related 'inmem_pages' list at the same time.

Signed-off-by: Weichao Guo <guoweichao@huawei.com>
Reviewed-by: Chao Yu <yuchao0@huawei.com>
[Jaegeuk Kim: fix some coding style]
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2016-09-30 17:34:23 -07:00
Chao Yu aaec2b1d18 f2fs: introduce cp_lock to protect updating of ckpt_flags
This patch introduces spinlock to protect updating process of ckpt_flags
field in struct f2fs_checkpoint, it avoids incorrectly updating in race
condition.

Signed-off-by: Chao Yu <yuchao0@huawei.com>
[Jaegeuk Kim: add __is_set_ckpt_flags likewise __set_ckpt_flags]
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2016-09-30 17:34:20 -07:00
Eric Ren c33f0785bf ocfs2: fix deadlock on mmapped page in ocfs2_write_begin_nolock()
The testcase "mmaptruncate" of ocfs2-test deadlocks occasionally.

In this testcase, we create a 2*CLUSTER_SIZE file and mmap() on it;
there are 2 process repeatedly performing the following operations
respectively: one is doing memset(mmaped_addr + 2*CLUSTER_SIZE - 1, 'a',
1), while the another is playing ftruncate(fd, 2*CLUSTER_SIZE) and then
ftruncate(fd, CLUSTER_SIZE) again and again.

This is the backtrace when the deadlock happens:

   __wait_on_bit_lock+0x50/0xa0
   __lock_page+0xb7/0xc0
   ocfs2_write_begin_nolock+0x163f/0x1790 [ocfs2]
   ocfs2_page_mkwrite+0x1c7/0x2a0 [ocfs2]
   do_page_mkwrite+0x66/0xc0
   handle_mm_fault+0x685/0x1350
   __do_page_fault+0x1d8/0x4d0
   trace_do_page_fault+0x37/0xf0
   do_async_page_fault+0x19/0x70
   async_page_fault+0x28/0x30

In ocfs2_write_begin_nolock(), we first grab the pages and then allocate
disk space for this write; ocfs2_try_to_free_truncate_log() will be
called if -ENOSPC is returned; if we're lucky to get enough clusters,
which is usually the case, we start over again.

But in ocfs2_free_write_ctxt() the target page isn't unlocked, so we
will deadlock when trying to grab the target page again.

Also, -ENOMEM might be returned in ocfs2_grab_pages_for_write().
Another deadlock will happen in __do_page_mkwrite() if
ocfs2_page_mkwrite() returns non-VM_FAULT_LOCKED, and along with a
locked target page.

These two errors fail on the same path, so fix them by unlocking the
target page manually before ocfs2_free_write_ctxt().

Jan Kara helps me clear out the JBD2 part, and suggest the hint for root
cause.

Changes since v1:
1. Also put ENOMEM error case into consideration.

Link: http://lkml.kernel.org/r/1474173902-32075-1-git-send-email-zren@suse.com
Signed-off-by: Eric Ren <zren@suse.com>
Reviewed-by: He Gang <ghe@suse.com>
Acked-by: Joseph Qi <joseph.qi@huawei.com>
Cc: Mark Fasheh <mfasheh@suse.de>
Cc: Joel Becker <jlbec@evilplan.org>
Cc: Junxiao Bi <junxiao.bi@oracle.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-09-30 15:26:52 -07:00
Eric W. Biederman 069d5ac9ae autofs: Fix automounts by using current_real_cred()->uid
Seth Forshee reports that in 4.8-rcN some automounts are failing
because the requesting the automount changed.

The relevant call path is:
follow_automount()
    ->d_automount
    autofs4_d_automount
       autofs4_mount_wait
           autofs4_wait

In autofs4_wait wq_uid and wq_gid are set to current_uid() and
current_gid respectively.  With follow_automount now overriding creds
uid that we export to userspace changes and that breaks existing
setups.

To remove the regression set wq_uid and wq_gid from
current_real_cred()->uid and current_real_cred()->gid respectively.
This restores the current behavior as current->real_cred is identical
to current->cred except when override creds are used.

Cc: stable@vger.kernel.org
Fixes: aeaa4a79ff ("fs: Call d_automount with the filesystems creds")
Reported-by: Seth Forshee <seth.forshee@canonical.com>
Tested-by: Seth Forshee <seth.forshee@canonical.com>
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
2016-09-30 12:48:01 -05:00
Eric W. Biederman d29216842a mnt: Add a per mount namespace limit on the number of mounts
CAI Qian <caiqian@redhat.com> pointed out that the semantics
of shared subtrees make it possible to create an exponentially
increasing number of mounts in a mount namespace.

    mkdir /tmp/1 /tmp/2
    mount --make-rshared /
    for i in $(seq 1 20) ; do mount --bind /tmp/1 /tmp/2 ; done

Will create create 2^20 or 1048576 mounts, which is a practical problem
as some people have managed to hit this by accident.

As such CVE-2016-6213 was assigned.

Ian Kent <raven@themaw.net> described the situation for autofs users
as follows:

> The number of mounts for direct mount maps is usually not very large because of
> the way they are implemented, large direct mount maps can have performance
> problems. There can be anywhere from a few (likely case a few hundred) to less
> than 10000, plus mounts that have been triggered and not yet expired.
>
> Indirect mounts have one autofs mount at the root plus the number of mounts that
> have been triggered and not yet expired.
>
> The number of autofs indirect map entries can range from a few to the common
> case of several thousand and in rare cases up to between 30000 and 50000. I've
> not heard of people with maps larger than 50000 entries.
>
> The larger the number of map entries the greater the possibility for a large
> number of active mounts so it's not hard to expect cases of a 1000 or somewhat
> more active mounts.

So I am setting the default number of mounts allowed per mount
namespace at 100,000.  This is more than enough for any use case I
know of, but small enough to quickly stop an exponential increase
in mounts.  Which should be perfect to catch misconfigurations and
malfunctioning programs.

For anyone who needs a higher limit this can be changed by writing
to the new /proc/sys/fs/mount-max sysctl.

Tested-by: CAI Qian <caiqian@redhat.com>
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
2016-09-30 12:46:48 -05:00
Chao Yu fadb2fb8af f2fs: fix to avoid race condition when updating sbi flag
Making updating of sbi flag atomic by using {test,set,clear}_bit,
otherwise in concurrency scenario, the flag could be updated incorrectly.

Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2016-09-30 10:05:50 -07:00
Jaegeuk Kim 9e1e6df412 f2fs: put directory inodes before checkpoint in roll-forward recovery
Before checkpoint, we'd be better drop any inodes.

Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2016-09-30 10:05:49 -07:00
Jaegeuk Kim a468f0ef51 f2fs: use crc and cp version to determine roll-forward recovery
Previously, we used cp_version only to detect recoverable dnodes.
In order to avoid same garbage cp_version, we needed to truncate the next
dnode during checkpoint, resulting in additional discard or data write.
If we can distinguish this by using crc in addition to cp_version, we can
remove this overhead.

There is backward compatibility concern where it changes node_footer layout.
So, this patch introduces a new checkpoint flag, CP_CRC_RECOVERY_FLAG, to
detect new layout. New layout will be activated only when this flag is set.

Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2016-09-30 10:05:46 -07:00
Thomas Gleixner d7e25c66c9 Merge branch 'x86/urgent' into x86/asm
Get the cr4 fixes so we can apply the final cleanup
2016-09-30 12:38:28 +02:00
Ingo Molnar 0b429e18c2 Merge branch 'linus' into locking/core, to pick up fixes
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-09-30 10:54:46 +02:00
Eric Engestrom 18017479ca ext4: remove unused variable
Signed-off-by: Eric Engestrom <eric.engestrom@imgtec.com>
2016-09-30 02:14:56 -04:00
Eric Whitney 3c816ded78 ext4: use journal inode to determine journal overhead
When a file system contains an internal journal that has not been
loaded, use the journal inode's i_size field to determine its
contribution to the file system's overhead.  (The journal's j_maxlen
field is normally used to determine its size, but it's unavailable when
the journal has not been loaded.)

Signed-off-by: Eric Whitney <enwlinux@gmail.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2016-09-30 02:08:49 -04:00
Eric Whitney c6cb7e776a ext4: create function to read journal inode
Factor out the code used in ext4_get_journal() to read a valid journal
inode from storage, enabling its reuse in other functions.

Signed-off-by: Eric Whitney <enwlinux@gmail.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2016-09-30 02:05:09 -04:00
Jan Kara 9b623df614 ext4: unmap metadata when zeroing blocks
When zeroing blocks for DAX allocations, we also have to unmap aliases
in the block device mappings.  Otherwise writeback can overwrite zeros
with stale data from block device page cache.

Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Cc: stable@kernel.org
2016-09-30 02:02:29 -04:00
Jan Kara 51e8137b82 ext4: remove plugging from ext4_file_write_iter()
do_blockdev_direct_IO() takes care of properly plugging direct IO so
there's no need to plug again inside ext4_file_write_iter().

Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2016-09-30 01:57:41 -04:00
Jan Kara 4b0524aae0 ext4: allow unlocked direct IO when pages are cached
Currently we do not allow unlocked (meaning without inode_lock) direct
IO when the file has any pages cached. This check is not needed anymore
as we keep inode lock until ext4_direct_IO_write() and thus can happily
writeback and evict any pages conflicting with current direct IO write.

Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2016-09-30 01:55:32 -04:00
Richard Weinberger 9a200d075e ext4: require encryption feature for EXT4_IOC_SET_ENCRYPTION_POLICY
...otherwise an user can enable encryption for certain files even
when the filesystem is unable to support it.
Such a case would be a filesystem created by mkfs.ext4's default
settings, 1KiB block size. Ext4 supports encyption only when block size
is equal to PAGE_SIZE.
But this constraint is only checked when the encryption feature flag
is set.

Signed-off-by: Richard Weinberger <richard@nod.at>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2016-09-30 01:49:55 -04:00
Eric Biggers 55be3145d1 fscrypto: use standard macros to compute length of fname ciphertext
Signed-off-by: Eric Biggers <ebiggers@google.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2016-09-30 01:46:18 -04:00
Eric Biggers cc91542ac8 ext4: do not unnecessarily null-terminate encrypted symlink data
Null-terminating the fscrypt_symlink_data on read is unnecessary because
it is not string data --- it contains binary ciphertext.

Signed-off-by: Eric Biggers <ebiggers@google.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2016-09-30 01:44:17 -04:00
gmail e81d44778d ext4: release bh in make_indexed_dir
The commit 6050d47adcad: "ext4: bail out from make_indexed_dir() on
first error" could end up leaking bh2 in the error path.

[ Also avoid renaming bh2 to bh, which just confuses things --tytso ]

Cc: stable@vger.kernel.org
Signed-off-by: yangsheng <yngsion@gmail.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2016-09-30 01:33:37 -04:00
Jan Kara 16c5468859 ext4: Allow parallel DIO reads
We can easily support parallel direct IO reads. We only have to make
sure we cannot expose uninitialized data by reading allocated block to
which data was not written yet, or which was already truncated. That is
easily achieved by holding inode_lock in shared mode - that excludes all
writes, truncates, hole punches. We also have to guard against page
writeback allocating blocks for delay-allocated pages - that race is
handled by the fact that we writeback all the pages in the affected
range and the lock protects us from new pages being created there.

Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2016-09-30 01:03:17 -04:00
Olga Kornievskaia a865880e20 Retry operation on EREMOTEIO on an interrupted slot
If an operation got interrupted, then since we don't know if the
server processed it on not, we keep the seq#. Upon reuse of slot
and seq# if we get reply from the cache (ie EREMOTEIO) then we
need to retry the operation after bumping the seq#

Signed-off-by: Olga Kornievskaia <kolga@netapp.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2016-09-29 12:31:48 -04:00
Martin Brandenburg b78b11985a Merge branch 'misc' into for-next
Pull in an OrangeFS branch containing miscellaneous improvements.

- clean up debugfs globals
- remove dead code in sysfs
- reorganize duplicated sysfs attribute structs
- consolidate sysfs show and store functions
- remove duplicated sysfs_ops structures
- describe organization of sysfs
- make devreq_mutex static
- g_orangefs_stats -> orangefs_stats for consistency
- rename most remaining global variables
2016-09-28 14:50:46 -04:00
Al Viro dbbab32574 cifs: get rid of unused arguments of CIFSSMBWrite()
they used to be used, but...

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2016-09-27 21:54:53 -04:00
Andreas Gruenbacher 2211d5ba5c posix_acl: xattr representation cleanups
Remove the unnecessary typedefs and the zero-length a_entries array in
struct posix_acl_xattr_header.

Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2016-09-27 21:52:00 -04:00