Commit Graph

47453 Commits

Author SHA1 Message Date
Anna Schumaker 7d38de3ffa NFS: Remove unused authflavour parameter from nfs_get_client()
This parameter hasn't been used since f8407299 (Linux 3.11-rc2), so
let's remove it from this function and callers.

Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2016-12-01 17:46:32 -05:00
J. Bruce Fields ced85a7568 nfs: fix false positives in nfs40_walk_client_list()
It's possible that two different servers can return the same (clientid,
verifier) pair purely by coincidence.  Both are 64-bit values, but
depending on the server implementation, they can be highly predictable
and collisions may be quite likely, especially when there are lots of
servers.

So, check for this case.  If the clientid and verifier both match, then
we actually know they *can't* be the same server, since a new
SETCLIENTID to an already-known server should have changed the verifier.

This helps fix a bug that could cause the client to mount a filesystem
from the wrong server.

Reviewed-by: Jeff Layton <jlayton@redhat.com>
Tested-by: Yongcheng Yang <yoyang@redhat.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2016-12-01 17:43:31 -05:00
Trond Myklebust b85f562049 pNFS: Skip invalid stateids when doing a bulk destroy
If the layout stateid is already invalid, we have no work to do.

Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2016-12-01 17:21:51 -05:00
Trond Myklebust 29ade5db12 pNFS: Wait on outstanding layoutreturns to complete in pnfs_roc()
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2016-12-01 17:21:50 -05:00
Trond Myklebust abb3e1c877 pNFS: Don't mark the layout as freed if the last lseg is marked for return
Address another memory leak.

Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2016-12-01 17:21:50 -05:00
Trond Myklebust 4aab97327f pNFS: Sync the layout state bits in pnfs_cache_lseg_for_layoutreturn
Ensure that the layout state bits are synced when we cache a layout
segment for layoutreturn using an appropriate call to
pnfs_set_plh_return_info.

Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2016-12-01 17:21:49 -05:00
Trond Myklebust 24408f5282 pNFS: Fix bugs in _pnfs_return_layout
We need to honour the NFS_LAYOUT_RETURN_REQUESTED bit regardless of
whether or not there are layout segments pending.
Furthermore, we should ensure that we leave the plh_return_segs list
empty.

This patch fixes a memory leak of the layout segments on plh_return_segs.

Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2016-12-01 17:21:49 -05:00
Trond Myklebust fe1cf9469d pNFS: Clear all layout segment state in pnfs_mark_layout_stateid_invalid
When the layout state is invalidated, then so is the layout segment
state, and hence we do need to clean up the state bits.

Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2016-12-01 17:21:48 -05:00
Trond Myklebust 53e6fc86ab pNFS: Prevent unnecessary layoutreturns after delegreturn
If we cannot grab the inode or superblock, then we cannot pin the
layout header, and so we cannot send a layoutreturn as part of an
async delegreturn call. In this case, we currently end up sending
an extra layoutreturn after the delegreturn. Since the layout was
implicitly returned by the delegreturn, that just gets a BAD_STATEID.

The fix is to simply complete the return-on-close immediately.

Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2016-12-01 17:21:48 -05:00
Trond Myklebust 1c5bd76d17 pNFS: Enable layoutreturn operation for return-on-close
Amend the pnfs return on close helper functions to enable sending the
layoutreturn op in CLOSE/DELEGRETURN. This closes a potential race between
CLOSE/DELEGRETURN and parallel OPEN calls to the same file, and allows the
client and the server to agree on whether or not there is an outstanding
layout.

Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2016-12-01 17:21:47 -05:00
Trond Myklebust 828ed9ec1b pNFS: Clean up - add a helper to initialise struct layoutreturn_args
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2016-12-01 17:21:47 -05:00
Trond Myklebust 586f1c39da NFSv4: Add encode/decode of the layoutreturn op in DELEGRETURN
Add XDR encoding for the layoutreturn op, and storage for the layoutreturn
arguments to the DELEGRETURN compound.

Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2016-12-01 17:21:46 -05:00
Trond Myklebust cf80516579 NFSv4: Add encode/decode of the layoutreturn op in CLOSE
Add XDR encoding for the layoutreturn op, and storage for the layoutreturn
arguments to the CLOSE compound.

Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2016-12-01 17:21:46 -05:00
Trond Myklebust d8434d4c54 NFSv4: Fix missing operation accounting in NFS4_dec_delegreturn_sz
We need to account for the reply to the PUTFH operation in the
DELEGRETURN compound.

Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2016-12-01 17:21:45 -05:00
Trond Myklebust 69820d22c5 pNFS: Don't mark layout segments invalid on layoutreturn in pnfs_roc
The layoutreturn call will take care of invalidating the layout segments
once the call is successful.

Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2016-12-01 17:21:45 -05:00
Trond Myklebust 94e5c571fc pNFS: Get rid of unnecessary layout parameter in encode_layoutreturn callback
The parameter is already present in the "args" structure.

Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2016-12-01 17:21:44 -05:00
Trond Myklebust 0cdc329ec9 pNFS: Skip checking for return-on-close if the layout is invalid
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2016-12-01 17:21:44 -05:00
Trond Myklebust e685d237e6 pNFS: Remove spurious wake up in pnfs_layout_remove_lseg()
There is no change to the value of NFS_LAYOUT_RETURN, so we should
not be waking up the RPC call.

Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2016-12-01 17:21:43 -05:00
Trond Myklebust 2a974425e5 NFSv4: Ignore LAYOUTRETURN result if the layout doesn't match or is invalid
Fix a potential race with CB_LAYOUTRECALL in which the server recalls the
remaining layout segments while our LAYOUTRETURN is still in transit.

Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2016-12-01 17:21:43 -05:00
Trond Myklebust 68f744797e pNFS: Do not free layout segments that are marked for return
We may want to process and transmit layout stat information for the
layout segments that are being returned, so we should defer freeing
them until after the layoutreturn has completed.

Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2016-12-01 17:21:42 -05:00
Trond Myklebust 7b410d9ce4 pNFS: Delay getting the layout header in CB_LAYOUTRECALL handlers
Instead of grabbing the layout, we want to get the inode so that we
can reduce races between layoutget and layoutrecall when the server
does not support call referring.

Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2016-12-01 17:21:42 -05:00
Trond Myklebust 17822b207f pNFS: consolidate the different range intersection tests
Both pnfs.c and the flexfiles code have their own versions of the
range intersection testing, and the "end_offset" helper.

Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2016-12-01 17:21:41 -05:00
Trond Myklebust ee284e35d8 pNFS: Fix race in pnfs_wait_on_layoutreturn
We must put the task to sleep while holding the inode->i_lock in order
to ensure atomicity with the test for NFS_LAYOUT_RETURN.

Fixes: 500d701f33 ("NFS41: make close wait for layoutreturn")
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2016-12-01 17:21:41 -05:00
Trond Myklebust 6604b203fb pNFS: On error, do not send LAYOUTGET until the LAYOUTRETURN has completed
If there is an I/O error, we should not call LAYOUTGET until the
LAYOUTRETURN that reports the error is complete.

Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
Cc: stable@vger.kernel.org # v4.8+
2016-12-01 17:21:40 -05:00
Trond Myklebust 9888d837f3 pNFS: Force a retry of LAYOUTGET if the stateid doesn't match our cache
If the server sends us a completely new stateid, and the client thinks
it already holds a layout, then force a retry of the LAYOUTGET after
invalidating the existing layout in order to avoid corruption due to
races.

Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2016-12-01 17:21:40 -05:00
Trond Myklebust ae5a459d5f pNFS: Clear NFS_LAYOUT_RETURN_REQUESTED when invalidating the layout stateid
We must ensure that we don't schedule a layoutreturn if the layout stateid
has been marked as invalid.

Fixes: 2a59a04116 ("pNFS: Fix pnfs_set_layout_stateid() to clear...")
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
Cc: stable@vger.kernel.org # v4.8+
2016-12-01 17:21:39 -05:00
Trond Myklebust 7b650994ab pNFS: Don't clear the layout stateid if a layout return is outstanding
If we no longer hold any layout segments, we're normally expected to
consider the layout stateid to be invalid. However we cannot assume this
if we're about to, or in the process of sending a layoutreturn.

Fixes: 334a8f3711 ("pNFS: Don't forget the layout stateid if...")
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
Cc: stable@vger.kernel.org # v4.8+
2016-12-01 17:21:39 -05:00
Trond Myklebust 54e4a0dfa2 pNFS: Fix a deadlock between read resends and layoutreturn
We must not call nfs_pageio_init_read() on a new nfs_pageio_descriptor
while holding a reference to a layout segment, as that can deadlock
pnfs_update_layout().

Fixes: d67ae825a5 ("pnfs/flexfiles: Add the FlexFile Layout Driver")
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
Cc: stable@vger.kernel.org # v4.0+
2016-12-01 17:21:38 -05:00
Fred Isaman 9a837856cf NFSv4.1: Fix regression in callback retry handling
When initializing a freshly created slot for the calllback channel,
the seq_nr needs to be 0, not 1.  Otherwise validate_seqid
and nfs4_slot_wait_on_seqid get confused and believe that the
mpty slot corresponds to a previously sent reply.

Signed-off-by: Fred Isaman <fred.isaman@gmail.com>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2016-12-01 17:21:38 -05:00
Trond Myklebust 1ad13dbc85 NFSv4: Optimise away forced revalidation when we know the attributes are OK
The NFS_INO_REVAL_FORCED flag needs to be set if we just got a delegation,
and we see that there might still be some ambiguity as to whether or not
our attribute or data cache are valid.
In practice, this means that a call to nfs_check_inode_attributes() will
have noticed a discrepancy between cached attributes and measured ones,
so let's move the setting of NFS_INO_REVAL_FORCED to there.

Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2016-12-01 17:21:37 -05:00
Trond Myklebust 3ecefc9295 NFSv4: Don't request close-to-open attribute when holding a delegation
If holding a delegation, we do not need to ask the server to return
close-to-open cache consistency attributes as part of the CLOSE
compound.

Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2016-12-01 17:21:37 -05:00
Trond Myklebust 3947b74d0f NFSv4: Don't request a GETATTR on open_downgrade.
If we're not closing the file completely, there is no need to request
close-to-open attributes.

Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2016-12-01 17:21:36 -05:00
Trond Myklebust 1cc1baf14b NFSv4: Don't ask for the change attribute when reclaiming state
We don't need to ask for the change attribute when returning a delegation
or recovering from a server reboot, and it could actually cause us to
obtain an incorrect value if we're using a pNFS flavour that requires
LAYOUTCOMMIT.

Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2016-12-01 17:21:36 -05:00
Trond Myklebust 536585ccf9 NFSv4: Don't check file access when reclaiming state
If we're reclaiming state after a reboot, or as part of returning a
delegation, we don't need to check access modes again.

Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2016-12-01 17:21:35 -05:00
Eryu Guan 3a4b77cd47 ext4: validate s_first_meta_bg at mount time
Ralf Spenneberg reported that he hit a kernel crash when mounting a
modified ext4 image. And it turns out that kernel crashed when
calculating fs overhead (ext4_calculate_overhead()), this is because
the image has very large s_first_meta_bg (debug code shows it's
842150400), and ext4 overruns the memory in count_overhead() when
setting bitmap buffer, which is PAGE_SIZE.

ext4_calculate_overhead():
  buf = get_zeroed_page(GFP_NOFS);  <=== PAGE_SIZE buffer
  blks = count_overhead(sb, i, buf);

count_overhead():
  for (j = ext4_bg_num_gdb(sb, grp); j > 0; j--) { <=== j = 842150400
          ext4_set_bit(EXT4_B2C(sbi, s++), buf);   <=== buffer overrun
          count++;
  }

This can be reproduced easily for me by this script:

  #!/bin/bash
  rm -f fs.img
  mkdir -p /mnt/ext4
  fallocate -l 16M fs.img
  mke2fs -t ext4 -O bigalloc,meta_bg,^resize_inode -F fs.img
  debugfs -w -R "ssv first_meta_bg 842150400" fs.img
  mount -o loop fs.img /mnt/ext4

Fix it by validating s_first_meta_bg first at mount time, and
refusing to mount if its value exceeds the largest possible meta_bg
number.

Reported-by: Ralf Spenneberg <ralf@os-t.de>
Signed-off-by: Eryu Guan <guaneryu@gmail.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Reviewed-by: Andreas Dilger <adilger@dilger.ca>
2016-12-01 15:08:37 -05:00
Eric Biggers d7614cc161 ext4: correctly detect when an xattr value has an invalid size
It was possible for an xattr value to have a very large size, which
would then pass validation on 32-bit architectures due to a pointer
wraparound.  Fix this by validating the size in a way which avoids
pointer wraparound.

It was also possible that a value's size would fit in the available
space but its padded size would not.  This would cause an out-of-bounds
memory write in ext4_xattr_set_entry when replacing the xattr value.
For example, if an xattr value of unpadded size 253 bytes went until the
very end of the inode or block, then using setxattr(2) to replace this
xattr's value with 256 bytes would cause a write to the 3 bytes past the
end of the inode or buffer, and the new xattr value would be incorrectly
truncated.  Fix this by requiring that the padded size fit in the
available space rather than the unpadded size.

This patch shouldn't have any noticeable effect on
non-corrupted/non-malicious filesystems.

Signed-off-by: Eric Biggers <ebiggers@google.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2016-12-01 14:57:29 -05:00
Eric Biggers 290ab23001 ext4: don't read out of bounds when checking for in-inode xattrs
With i_extra_isize equal to or close to the available space, it was
possible for us to read past the end of the inode when trying to detect
or validate in-inode xattrs.  Fix this by checking for the needed extra
space first.

This patch shouldn't have any noticeable effect on
non-corrupted/non-malicious filesystems.

Signed-off-by: Eric Biggers <ebiggers@google.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Reviewed-by: Andreas Dilger <adilger@dilger.ca>
2016-12-01 14:51:58 -05:00
Eric Biggers 2dc8d9e19b ext4: forbid i_extra_isize not divisible by 4
i_extra_isize not divisible by 4 is problematic for several reasons:

- It causes the in-inode xattr space to be misaligned, but the xattr
  header and entries are not declared __packed to express this
  possibility.  This may cause poor performance or incorrect code
  generation on some platforms.
- When validating the xattr entries we can read past the end of the
  inode if the size available for xattrs is not a multiple of 4.
- It allows the nonsensical i_extra_isize=1, which doesn't even leave
  enough room for i_extra_isize itself.

Therefore, update ext4_iget() to consider i_extra_isize not divisible by
4 to be an error, like the case where i_extra_isize is too large.

This also matches the rule recently added to e2fsck for determining
whether an inode has valid i_extra_isize.

This patch shouldn't have any noticeable effect on
non-corrupted/non-malicious filesystems, since the size of ext4_inode
has always been a multiple of 4.

Signed-off-by: Eric Biggers <ebiggers@google.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Reviewed-by: Andreas Dilger <adilger@dilger.ca>
2016-12-01 14:43:33 -05:00
Linus Torvalds 2caceb3294 Merge branch 'overlayfs-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/vfs
Pull overlayfs fix from Miklos Szeredi:
 "This fixes a regression introduced in 4.8"

* 'overlayfs-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/vfs:
  ovl: fix d_real() for stacked fs
2016-12-01 10:31:53 -08:00
Eric Biggers ba679017ef ext4: disable pwsalt ioctl when encryption disabled by config
On a CONFIG_EXT4_FS_ENCRYPTION=n kernel, the ioctls to get and set
encryption policies were disabled but EXT4_IOC_GET_ENCRYPTION_PWSALT was
not.  But there's no good reason to expose the pwsalt ioctl if the
kernel doesn't support encryption.  The pwsalt ioctl was also disabled
pre-4.8 (via ext4_sb_has_crypto() previously returning 0 when encryption
was disabled by config) and seems to have been enabled by mistake when
ext4 encryption was refactored to use fs/crypto/.  So let's disable it
again.

Signed-off-by: Eric Biggers <ebiggers@google.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2016-12-01 11:55:51 -05:00
Eric Biggers 35997d1ce8 ext4: get rid of ext4_sb_has_crypto()
ext4_sb_has_crypto() just called through to ext4_has_feature_encrypt(),
and all callers except one were already using the latter.  So remove it
and switch its one caller to ext4_has_feature_encrypt().

Signed-off-by: Eric Biggers <ebiggers@google.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2016-12-01 11:54:18 -05:00
Daeho Jeong 05ac5aa18a ext4: fix inode checksum calculation problem if i_extra_size is small
We've fixed the race condition problem in calculating ext4 checksum
value in commit b47820edd1 ("ext4: avoid modifying checksum fields
directly during checksum veficationon"). However, by this change,
when calculating the checksum value of inode whose i_extra_size is
less than 4, we couldn't calculate the checksum value in a proper way.
This problem was found and reported by Nix, Thank you.

Reported-by: Nix <nix@esperi.org.uk>
Signed-off-by: Daeho Jeong <daeho.jeong@samsung.com>
Signed-off-by: Youngjin Gil <youngjin.gil@samsung.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2016-12-01 11:49:12 -05:00
Jan Kara 6dcc693bc5 ext4: warn when page is dirtied without buffers
Warn when a page is dirtied without buffers (as that will likely lead to
a crash in ext4_writepages()) or when it gets newly dirtied without the
page being locked (as there is nothing that prevents buffers to get
stripped just before calling set_page_dirty() under memory pressure).

Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2016-12-01 11:46:40 -05:00
Rabin Vincent af309226db block: protect iterate_bdevs() against concurrent close
If a block device is closed while iterate_bdevs() is handling it, the
following NULL pointer dereference occurs because bdev->b_disk is NULL
in bdev_get_queue(), which is called from blk_get_backing_dev_info() (in
turn called by the mapping_cap_writeback_dirty() call in
__filemap_fdatawrite_range()):

 BUG: unable to handle kernel NULL pointer dereference at 0000000000000508
 IP: [<ffffffff81314790>] blk_get_backing_dev_info+0x10/0x20
 PGD 9e62067 PUD 9ee8067 PMD 0
 Oops: 0000 [#1] PREEMPT SMP DEBUG_PAGEALLOC
 Modules linked in:
 CPU: 1 PID: 2422 Comm: sync Not tainted 4.5.0-rc7+ #400
 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996)
 task: ffff880009f4d700 ti: ffff880009f5c000 task.ti: ffff880009f5c000
 RIP: 0010:[<ffffffff81314790>]  [<ffffffff81314790>] blk_get_backing_dev_info+0x10/0x20
 RSP: 0018:ffff880009f5fe68  EFLAGS: 00010246
 RAX: 0000000000000000 RBX: ffff88000ec17a38 RCX: ffffffff81a4e940
 RDX: 7fffffffffffffff RSI: 0000000000000000 RDI: ffff88000ec176c0
 RBP: ffff880009f5fe68 R08: 0000000000000000 R09: 0000000000000000
 R10: 0000000000000001 R11: 0000000000000000 R12: ffff88000ec17860
 R13: ffffffff811b25c0 R14: ffff88000ec178e0 R15: ffff88000ec17a38
 FS:  00007faee505d700(0000) GS:ffff88000fb00000(0000) knlGS:0000000000000000
 CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
 CR2: 0000000000000508 CR3: 0000000009e8a000 CR4: 00000000000006e0
 Stack:
  ffff880009f5feb8 ffffffff8112e7f5 0000000000000000 7fffffffffffffff
  0000000000000000 0000000000000000 7fffffffffffffff 0000000000000001
  ffff88000ec178e0 ffff88000ec17860 ffff880009f5fec8 ffffffff8112e81f
 Call Trace:
  [<ffffffff8112e7f5>] __filemap_fdatawrite_range+0x85/0x90
  [<ffffffff8112e81f>] filemap_fdatawrite+0x1f/0x30
  [<ffffffff811b25d6>] fdatawrite_one_bdev+0x16/0x20
  [<ffffffff811bc402>] iterate_bdevs+0xf2/0x130
  [<ffffffff811b2763>] sys_sync+0x63/0x90
  [<ffffffff815d4272>] entry_SYSCALL_64_fastpath+0x12/0x76
 Code: 0f 1f 44 00 00 48 8b 87 f0 00 00 00 55 48 89 e5 <48> 8b 80 08 05 00 00 5d
 RIP  [<ffffffff81314790>] blk_get_backing_dev_info+0x10/0x20
  RSP <ffff880009f5fe68>
 CR2: 0000000000000508
 ---[ end trace 2487336ceb3de62d ]---

The crash is easily reproducible by running the following command, if an
msleep(100) is inserted before the call to func() in iterate_devs():

 while :; do head -c1 /dev/nullb0; done > /dev/null & while :; do sync; done

Fix it by holding the bd_mutex across the func() call and only calling
func() if the bdev is opened.

Cc: stable@vger.kernel.org
Fixes: 5c0d6b60a0 ("vfs: Create function for iterating over block devices")
Reported-and-tested-by: Wei Fang <fangwei1@huawei.com>
Signed-off-by: Rabin Vincent <rabinv@axis.com>
Signed-off-by: Jan Kara <jack@suse.cz>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@fb.com>
2016-12-01 08:26:39 -07:00
Steve French 8b217fe7fc SMB3: parsing for new snapshot timestamp mount parm
New mount option "snapshot=<time>" to allow mounting an earlier
version of the remote volume (if such a snapshot exists on
the server).

Note that eventually specifying a snapshot time of 1 will allow
the user to mount the oldest snapshot. A subsequent patch
add the processing for that and another for actually specifying
the "time warp" create context on SMB2/SMB3 open.

Check to make sure SMB2 negotiated, and ensure that
we use a different tcon if mount same share twice
but with different snaphshot times

Signed-off-by: Steve French <smfrench@gmail.com>
Reviewed-by: Pavel Shilovsky <pshilov@microsoft.com>
2016-12-01 00:23:20 -06:00
Mike Rapoport a107bf8b39 isofs: add KERN_CONT to printing of ER records
The ER records are printed without explicit log level presuming line
continuation until "\n".  After the commit 4bcc595ccd (printk:
reinstate KERN_CONT for printing continuation lines), the ER records are
printed a character per line.

Adding KERN_CONT to appropriate printk statements restores the printout
behavior.

Signed-off-by: Mike Rapoport <rppt@linux.vnet.ibm.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-11-30 10:41:26 -08:00
Robbie Ko 2a7bf53f57 Btrfs: fix tree search logic when replaying directory entry deletes
If a log tree has a layout like the following:

leaf N:
        ...
        item 240 key (282 DIR_LOG_ITEM 0) itemoff 8189 itemsize 8
                dir log end 1275809046
leaf N + 1:
        item 0 key (282 DIR_LOG_ITEM 3936149215) itemoff 16275 itemsize 8
                dir log end 18446744073709551615
        ...

When we pass the value 1275809046 + 1 as the parameter start_ret to the
function tree-log.c:find_dir_range() (done by replay_dir_deletes()), we
end up with path->slots[0] having the value 239 (points to the last item
of leaf N, item 240). Because the dir log item in that position has an
offset value smaller than *start_ret (1275809046 + 1) we need to move on
to the next leaf, however the logic for that is wrong since it compares
the current slot to the number of items in the leaf, which is smaller
and therefore we don't lookup for the next leaf but instead we set the
slot to point to an item that does not exist, at slot 240, and we later
operate on that slot which has unexpected content or in the worst case
can result in an invalid memory access (accessing beyond the last page
of leaf N's extent buffer).

So fix the logic that checks when we need to lookup at the next leaf
by first incrementing the slot and only after to check if that slot
is beyond the last item of the current leaf.

Signed-off-by: Robbie Ko <robbieko@synology.com>
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Fixes: e02119d5a7 (Btrfs: Add a write ahead tree log to optimize synchronous operations)
Cc: stable@vger.kernel.org  # 2.6.29+
Signed-off-by: Filipe Manana <fdmanana@suse.com>
[Modified changelog for clarity and correctness]
2016-11-30 16:56:12 +00:00
Robbie Ko ec125cfb7a Btrfs: fix deadlock caused by fsync when logging directory entries
While logging new directory entries, at tree-log.c:log_new_dir_dentries(),
after we call btrfs_search_forward() we get a leaf with a read lock on it,
and without unlocking that leaf we can end up calling btrfs_iget() to get
an inode pointer. The later (btrfs_iget()) can end up doing a read-only
search on the same tree again, if the inode is not in memory already, which
ends up causing a deadlock if some other task in the meanwhile started a
write search on the tree and is attempting to write lock the same leaf
that btrfs_search_forward() locked while holding write locks on upper
levels of the tree blocking the read search from btrfs_iget(). In this
scenario we get a deadlock.

So fix this by releasing the search path before calling btrfs_iget() at
tree-log.c:log_new_dir_dentries().

Example trace of such deadlock:

[ 4077.478852] kworker/u24:10  D ffff88107fc90640     0 14431      2 0x00000000
[ 4077.486752] Workqueue: btrfs-endio-write btrfs_endio_write_helper [btrfs]
[ 4077.494346]  ffff880ffa56bad0 0000000000000046 0000000000009000 ffff880ffa56bfd8
[ 4077.502629]  ffff880ffa56bfd8 ffff881016ce21c0 ffffffffa06ecb26 ffff88101a5d6138
[ 4077.510915]  ffff880ebb5173b0 ffff880ffa56baf8 ffff880ebb517410 ffff881016ce21c0
[ 4077.519202] Call Trace:
[ 4077.528752]  [<ffffffffa06ed5ed>] ? btrfs_tree_lock+0xdd/0x2f0 [btrfs]
[ 4077.536049]  [<ffffffff81053680>] ? wake_up_atomic_t+0x30/0x30
[ 4077.542574]  [<ffffffffa068cc1f>] ? btrfs_search_slot+0x79f/0xb10 [btrfs]
[ 4077.550171]  [<ffffffffa06a5073>] ? btrfs_lookup_file_extent+0x33/0x40 [btrfs]
[ 4077.558252]  [<ffffffffa06c600b>] ? __btrfs_drop_extents+0x13b/0xdf0 [btrfs]
[ 4077.566140]  [<ffffffffa06fc9e2>] ? add_delayed_data_ref+0xe2/0x150 [btrfs]
[ 4077.573928]  [<ffffffffa06fd629>] ? btrfs_add_delayed_data_ref+0x149/0x1d0 [btrfs]
[ 4077.582399]  [<ffffffffa06cf3c0>] ? __set_extent_bit+0x4c0/0x5c0 [btrfs]
[ 4077.589896]  [<ffffffffa06b4a64>] ? insert_reserved_file_extent.constprop.75+0xa4/0x320 [btrfs]
[ 4077.599632]  [<ffffffffa06b206d>] ? start_transaction+0x8d/0x470 [btrfs]
[ 4077.607134]  [<ffffffffa06bab57>] ? btrfs_finish_ordered_io+0x2e7/0x600 [btrfs]
[ 4077.615329]  [<ffffffff8104cbc2>] ? process_one_work+0x142/0x3d0
[ 4077.622043]  [<ffffffff8104d729>] ? worker_thread+0x109/0x3b0
[ 4077.628459]  [<ffffffff8104d620>] ? manage_workers.isra.26+0x270/0x270
[ 4077.635759]  [<ffffffff81052b0f>] ? kthread+0xaf/0xc0
[ 4077.641404]  [<ffffffff81052a60>] ? kthread_create_on_node+0x110/0x110
[ 4077.648696]  [<ffffffff814a9ac8>] ? ret_from_fork+0x58/0x90
[ 4077.654926]  [<ffffffff81052a60>] ? kthread_create_on_node+0x110/0x110

[ 4078.358087] kworker/u24:15  D ffff88107fcd0640     0 14436      2 0x00000000
[ 4078.365981] Workqueue: btrfs-endio-write btrfs_endio_write_helper [btrfs]
[ 4078.373574]  ffff880ffa57fad0 0000000000000046 0000000000009000 ffff880ffa57ffd8
[ 4078.381864]  ffff880ffa57ffd8 ffff88103004d0a0 ffffffffa06ecb26 ffff88101a5d6138
[ 4078.390163]  ffff880fbeffc298 ffff880ffa57faf8 ffff880fbeffc2f8 ffff88103004d0a0
[ 4078.398466] Call Trace:
[ 4078.408019]  [<ffffffffa06ed5ed>] ? btrfs_tree_lock+0xdd/0x2f0 [btrfs]
[ 4078.415322]  [<ffffffff81053680>] ? wake_up_atomic_t+0x30/0x30
[ 4078.421844]  [<ffffffffa068cc1f>] ? btrfs_search_slot+0x79f/0xb10 [btrfs]
[ 4078.429438]  [<ffffffffa06a5073>] ? btrfs_lookup_file_extent+0x33/0x40 [btrfs]
[ 4078.437518]  [<ffffffffa06c600b>] ? __btrfs_drop_extents+0x13b/0xdf0 [btrfs]
[ 4078.445404]  [<ffffffffa06fc9e2>] ? add_delayed_data_ref+0xe2/0x150 [btrfs]
[ 4078.453194]  [<ffffffffa06fd629>] ? btrfs_add_delayed_data_ref+0x149/0x1d0 [btrfs]
[ 4078.461663]  [<ffffffffa06cf3c0>] ? __set_extent_bit+0x4c0/0x5c0 [btrfs]
[ 4078.469161]  [<ffffffffa06b4a64>] ? insert_reserved_file_extent.constprop.75+0xa4/0x320 [btrfs]
[ 4078.478893]  [<ffffffffa06b206d>] ? start_transaction+0x8d/0x470 [btrfs]
[ 4078.486388]  [<ffffffffa06bab57>] ? btrfs_finish_ordered_io+0x2e7/0x600 [btrfs]
[ 4078.494561]  [<ffffffff8104cbc2>] ? process_one_work+0x142/0x3d0
[ 4078.501278]  [<ffffffff8104a507>] ? pwq_activate_delayed_work+0x27/0x40
[ 4078.508673]  [<ffffffff8104d729>] ? worker_thread+0x109/0x3b0
[ 4078.515098]  [<ffffffff8104d620>] ? manage_workers.isra.26+0x270/0x270
[ 4078.522396]  [<ffffffff81052b0f>] ? kthread+0xaf/0xc0
[ 4078.528032]  [<ffffffff81052a60>] ? kthread_create_on_node+0x110/0x110
[ 4078.535325]  [<ffffffff814a9ac8>] ? ret_from_fork+0x58/0x90
[ 4078.541552]  [<ffffffff81052a60>] ? kthread_create_on_node+0x110/0x110

[ 4079.355824] user-space-program D ffff88107fd30640     0 32020      1 0x00000000
[ 4079.363716]  ffff880eae8eba10 0000000000000086 0000000000009000 ffff880eae8ebfd8
[ 4079.372003]  ffff880eae8ebfd8 ffff881016c162c0 ffffffffa06ecb26 ffff88101a5d6138
[ 4079.380294]  ffff880fbed4b4c8 ffff880eae8eba38 ffff880fbed4b528 ffff881016c162c0
[ 4079.388586] Call Trace:
[ 4079.398134]  [<ffffffffa06ed595>] ? btrfs_tree_lock+0x85/0x2f0 [btrfs]
[ 4079.405431]  [<ffffffff81053680>] ? wake_up_atomic_t+0x30/0x30
[ 4079.411955]  [<ffffffffa06876fb>] ? btrfs_lock_root_node+0x2b/0x40 [btrfs]
[ 4079.419644]  [<ffffffffa068ce83>] ? btrfs_search_slot+0xa03/0xb10 [btrfs]
[ 4079.427237]  [<ffffffffa06aba52>] ? btrfs_buffer_uptodate+0x52/0x70 [btrfs]
[ 4079.435041]  [<ffffffffa0689b60>] ? generic_bin_search.constprop.38+0x80/0x190 [btrfs]
[ 4079.443897]  [<ffffffffa068ea44>] ? btrfs_insert_empty_items+0x74/0xd0 [btrfs]
[ 4079.451975]  [<ffffffffa072c443>] ? copy_items+0x128/0x850 [btrfs]
[ 4079.458890]  [<ffffffffa072da10>] ? btrfs_log_inode+0x629/0xbf3 [btrfs]
[ 4079.466292]  [<ffffffffa06f34a1>] ? btrfs_log_inode_parent+0xc61/0xf30 [btrfs]
[ 4079.474373]  [<ffffffffa06f45a9>] ? btrfs_log_dentry_safe+0x59/0x80 [btrfs]
[ 4079.482161]  [<ffffffffa06c298d>] ? btrfs_sync_file+0x20d/0x330 [btrfs]
[ 4079.489558]  [<ffffffff8112777c>] ? do_fsync+0x4c/0x80
[ 4079.495300]  [<ffffffff81127a0a>] ? SyS_fdatasync+0xa/0x10
[ 4079.501422]  [<ffffffff814a9b72>] ? system_call_fastpath+0x16/0x1b

[ 4079.508334] user-space-program D ffff88107fc30640     0 32021      1 0x00000004
[ 4079.516226]  ffff880eae8efbf8 0000000000000086 0000000000009000 ffff880eae8effd8
[ 4079.524513]  ffff880eae8effd8 ffff881030279610 ffffffffa06ecb26 ffff88101a5d6138
[ 4079.532802]  ffff880ebb671d88 ffff880eae8efc20 ffff880ebb671de8 ffff881030279610
[ 4079.541092] Call Trace:
[ 4079.550642]  [<ffffffffa06ed595>] ? btrfs_tree_lock+0x85/0x2f0 [btrfs]
[ 4079.557941]  [<ffffffff81053680>] ? wake_up_atomic_t+0x30/0x30
[ 4079.564463]  [<ffffffffa068cc1f>] ? btrfs_search_slot+0x79f/0xb10 [btrfs]
[ 4079.572058]  [<ffffffffa06bb7d8>] ? btrfs_truncate_inode_items+0x168/0xb90 [btrfs]
[ 4079.580526]  [<ffffffffa06b04be>] ? join_transaction.isra.15+0x1e/0x3a0 [btrfs]
[ 4079.588701]  [<ffffffffa06b206d>] ? start_transaction+0x8d/0x470 [btrfs]
[ 4079.596196]  [<ffffffffa0690ac6>] ? block_rsv_add_bytes+0x16/0x50 [btrfs]
[ 4079.603789]  [<ffffffffa06bc2e9>] ? btrfs_truncate+0xe9/0x2e0 [btrfs]
[ 4079.610994]  [<ffffffffa06bd00b>] ? btrfs_setattr+0x30b/0x410 [btrfs]
[ 4079.618197]  [<ffffffff81117c1c>] ? notify_change+0x1dc/0x680
[ 4079.624625]  [<ffffffff8123c8a4>] ? aa_path_perm+0xd4/0x160
[ 4079.630854]  [<ffffffff810f4fcb>] ? do_truncate+0x5b/0x90
[ 4079.636889]  [<ffffffff810f59fa>] ? do_sys_ftruncate.constprop.15+0x10a/0x160
[ 4079.644869]  [<ffffffff8110d87b>] ? SyS_fcntl+0x5b/0x570
[ 4079.650805]  [<ffffffff814a9b72>] ? system_call_fastpath+0x16/0x1b

[ 4080.410607] user-space-program D ffff88107fc70640     0 32028  12639 0x00000004
[ 4080.418489]  ffff880eaeccbbe0 0000000000000086 0000000000009000 ffff880eaeccbfd8
[ 4080.426778]  ffff880eaeccbfd8 ffff880f317ef1e0 ffffffffa06ecb26 ffff88101a5d6138
[ 4080.435067]  ffff880ef7e93928 ffff880f317ef1e0 ffff880eaeccbc08 ffff880f317ef1e0
[ 4080.443353] Call Trace:
[ 4080.452920]  [<ffffffffa06ed15d>] ? btrfs_tree_read_lock+0xdd/0x190 [btrfs]
[ 4080.460703]  [<ffffffff81053680>] ? wake_up_atomic_t+0x30/0x30
[ 4080.467225]  [<ffffffffa06876bb>] ? btrfs_read_lock_root_node+0x2b/0x40 [btrfs]
[ 4080.475400]  [<ffffffffa068cc81>] ? btrfs_search_slot+0x801/0xb10 [btrfs]
[ 4080.482994]  [<ffffffffa06b2df0>] ? btrfs_clean_one_deleted_snapshot+0xe0/0xe0 [btrfs]
[ 4080.491857]  [<ffffffffa06a70a6>] ? btrfs_lookup_inode+0x26/0x90 [btrfs]
[ 4080.499353]  [<ffffffff810ec42f>] ? kmem_cache_alloc+0xaf/0xc0
[ 4080.505879]  [<ffffffffa06bd905>] ? btrfs_iget+0xd5/0x5d0 [btrfs]
[ 4080.512696]  [<ffffffffa06caf04>] ? btrfs_get_token_64+0x104/0x120 [btrfs]
[ 4080.520387]  [<ffffffffa06f341f>] ? btrfs_log_inode_parent+0xbdf/0xf30 [btrfs]
[ 4080.528469]  [<ffffffffa06f45a9>] ? btrfs_log_dentry_safe+0x59/0x80 [btrfs]
[ 4080.536258]  [<ffffffffa06c298d>] ? btrfs_sync_file+0x20d/0x330 [btrfs]
[ 4080.543657]  [<ffffffff8112777c>] ? do_fsync+0x4c/0x80
[ 4080.549399]  [<ffffffff81127a0a>] ? SyS_fdatasync+0xa/0x10
[ 4080.555534]  [<ffffffff814a9b72>] ? system_call_fastpath+0x16/0x1b

Signed-off-by: Robbie Ko <robbieko@synology.com>
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Fixes: 2f2ff0ee5e (Btrfs: fix metadata inconsistencies after directory fsync)
Cc: stable@vger.kernel.org # 4.1+
Signed-off-by: Filipe Manana <fdmanana@suse.com>
[Modified changelog for clarity and correctness]
2016-11-30 13:49:16 +00:00
Robbie Ko 2cdaf447e8 Btrfs: fix enospc in hole punching
The hole punching can result in adding new leafs (and as a consequence
new nodes) to the tree because when we find file extent items that span
beyond the hole range we may end up not deleting them (just adjusting
them, reducing their range by reducing their length or increasing their
offset field) and add new file extent items representing holes.

So after splitting a leaf (therefore creating a new one) to insert a new
file extent item representing a hole, a new node might be added to each
level of the tree in the worst case scenario (since there's a new key
and every parent node was full).

For example if a file has an extent item representing the range 0 to 64Mb
and we punch a hole in the range 1Mb to 20Mb, the existing extent item is
duplicated and one of the copies is adjusted to represent the range 0 to
1Mb, the other copy adjusted to represent the range 20Mb to 64Mb, and a
new file extent item representing a hole in the range 1Mb to 20Mb is
inserted.

Fix this by using btrfs_calc_trans_metadata_size() instead of
btrfs_calc_trunc_metadata_size(), so that enough metadata space is
reserved for the worst possible case.

Signed-off-by: Robbie Ko <robbieko@synology.com>
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
[Modified changelog for clarity and correctness]
2016-11-30 13:44:16 +00:00
Wang Xiaoguang 1d57ee9416 btrfs: improve delayed refs iterations
This issue was found when I tried to delete a heavily reflinked file,
when deleting such files, other transaction operation will not have a
chance to make progress, for example, start_transaction() will blocked
in wait_current_trans(root) for long time, sometimes it even triggers
soft lockups, and the time taken to delete such heavily reflinked file
is also very large, often hundreds of seconds. Using perf top, it reports
that:

PerfTop:    7416 irqs/sec  kernel:99.8%  exact:  0.0% [4000Hz cpu-clock],  (all, 4 CPUs)
---------------------------------------------------------------------------------------
    84.37%  [btrfs]             [k] __btrfs_run_delayed_refs.constprop.80
    11.02%  [kernel]            [k] delay_tsc
     0.79%  [kernel]            [k] _raw_spin_unlock_irq
     0.78%  [kernel]            [k] _raw_spin_unlock_irqrestore
     0.45%  [kernel]            [k] do_raw_spin_lock
     0.18%  [kernel]            [k] __slab_alloc
It seems __btrfs_run_delayed_refs() took most cpu time, after some debug
work, I found it's select_delayed_ref() causing this issue, for a delayed
head, in our case, it'll be full of BTRFS_DROP_DELAYED_REF nodes, but
select_delayed_ref() will firstly try to iterate node list to find
BTRFS_ADD_DELAYED_REF nodes, obviously it's a disaster in this case, and
waste much time.

To fix this issue, we introduce a new ref_add_list in struct btrfs_delayed_ref_head,
then in select_delayed_ref(), if this list is not empty, we can directly use
nodes in this list. With this patch, it just took about 10~15 seconds to
delte the same file. Now using perf top, it reports that:

PerfTop:    2734 irqs/sec  kernel:99.5%  exact:  0.0% [4000Hz cpu-clock],  (all, 4 CPUs)
----------------------------------------------------------------------------------------

    20.74%  [kernel]          [k] _raw_spin_unlock_irqrestore
    16.33%  [kernel]          [k] __slab_alloc
     5.41%  [kernel]          [k] lock_acquired
     4.42%  [kernel]          [k] lock_acquire
     4.05%  [kernel]          [k] lock_release
     3.37%  [kernel]          [k] _raw_spin_unlock_irq

For normal files, this patch also gives help, at least we do not need to
iterate whole list to found BTRFS_ADD_DELAYED_REF nodes.

Signed-off-by: Wang Xiaoguang <wangxg.fnst@cn.fujitsu.com>
Reviewed-by: Liu Bo <bo.li.liu@oracle.com>
Tested-by: Holger Hoffstätte <holger@applied-asynchrony.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2016-11-30 13:45:21 +01:00
Qu Wenruo 824d8dff88 btrfs: qgroup: Fix qgroup data leaking by using subtree tracing
Commit 62b99540a1 (btrfs: relocation: Fix leaking qgroups numbers
on data extents) only fixes the problem partly.

The previous fix is to trace all new data extents at transaction commit
time when balance finishes.

However balance is not done in a large transaction, every path
replacement can happen in its own transaction.
This makes the fix useless if transaction commits during relocation.

For example:
relocate_block_group()
|-merge_reloc_roots()
|  |- merge_reloc_root()
|     |- btrfs_start_transaction()         <- Trans X
|     |- replace_path()                    <- Cause leak
|     |- btrfs_end_transaction_throttle()  <- Trans X commits here
|     |                                       Leak not fixed
|     |
|     |- btrfs_start_transaction()         <- Trans Y
|     |- replace_path()                    <- Cause leak
|     |- btrfs_end_transaction_throttle()  <- Trans Y ends
|                                             but not committed
|-btrfs_join_transaction()                 <- Still trans Y
|-qgroup_fix()                             <- Only fixes data leak
|                                             in trans Y
|-btrfs_commit_transaction()               <- Trans Y commits

In that case, qgroup fixup can only fix data leak in trans Y, data leak
in trans X is out of fix.

So the correct fix should happen in the same transaction of
replace_path().

This patch fixes it by tracing both subtrees of tree block swap, so it
can fix the problem and ensure all leaking and fix are in the same
transaction, so no leak again.

Reported-by: Goldwyn Rodrigues <rgoldwyn@suse.com>
Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com>
Reviewed-and-Tested-by: Goldwyn Rodrigues <rgoldwyn@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2016-11-30 13:45:21 +01:00
Qu Wenruo 33d1f05ccb btrfs: Export and move leaf/subtree qgroup helpers to qgroup.c
Move account_shared_subtree() to qgroup.c and rename it to
btrfs_qgroup_trace_subtree().

Do the same thing for account_leaf_items() and rename it to
btrfs_qgroup_trace_leaf_items().

Since all these functions are only for qgroup, move them to qgroup.c and
export them is more appropriate.

Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com>
Reviewed-and-Tested-by: Goldwyn Rodrigues <rgoldwyn@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2016-11-30 13:45:21 +01:00
Qu Wenruo 50b3e040b7 btrfs: qgroup: Rename functions to make it follow reserve,trace,account steps
Rename btrfs_qgroup_insert_dirty_extent(_nolock) to
btrfs_qgroup_trace_extent(_nolock), according to the new
reserve/trace/account naming schema.

Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com>
Reviewed-and-Tested-by: Goldwyn Rodrigues <rgoldwyn@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2016-11-30 13:45:21 +01:00
Qu Wenruo 1d2beaa95b btrfs: qgroup: Add comments explaining how btrfs qgroup works
Add explaination how btrfs qgroups work.

Qgroup is split into 3 main phrases:
1) Reserve
   To ensure qgroup doesn't exceed its limit

2) Trace
   To info qgroup to trace which extent

3) Account
   Calculate qgroup number change for each traced extent.

This should save quite some time for new developers.

Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com>
Reviewed-by: Goldwyn Rodrigues <rgoldwyn@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2016-11-30 13:45:21 +01:00
Christoph Hellwig 1621f8f3f9 btrfs: use bio_for_each_segment_all in __btrfsic_submit_bio
And remove the bogus check for a NULL return value from kmap, which
can't happen.  While we're at it: I don't think that kmapping up to 256
will work without deadlocks on highmem machines, a better idea would
be to use vm_map_ram to map all of them into a single virtual address
range.  Incidentally that would also simplify the code a lot.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Omar Sandoval <osandov@fb.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2016-11-30 13:45:20 +01:00
Christoph Hellwig 4989d277eb btrfs: refactor __btrfs_lookup_bio_sums to use bio_for_each_segment_all
Rework the loop a little bit to use the generic bio_for_each_segment_all
helper for iterating over the bio.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Omar Sandoval <osandov@fb.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2016-11-30 13:45:20 +01:00
Christoph Hellwig 2a4d0c9068 btrfs: calculate end of bio offset properly
Use the bvec offset and len members to prepare for multipage bvecs.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: David Sterba <dsterba@suse.com>
2016-11-30 13:45:20 +01:00
Christoph Hellwig 81381053d0 btrfs: use bi_size
Instead of using bi_vcnt to calculate it.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Omar Sandoval <osandov@fb.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2016-11-30 13:45:20 +01:00
Christoph Hellwig 6cd7ce4935 btrfs: don't access the bio directly in btrfs_csum_one_bio
Use bio_for_each_segment_all to iterate over the segments instead.
This requires a bit of reshuffling so that we only lookup up the ordered
item once inside the bio_for_each_segment_all loop.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Omar Sandoval <osandov@fb.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2016-11-30 13:45:20 +01:00
Christoph Hellwig 6a2de22f6b btrfs: don't access the bio directly in the direct I/O code
Just use bio_for_each_segment_all to iterate over all segments.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Omar Sandoval <osandov@fb.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2016-11-30 13:45:20 +01:00
Christoph Hellwig 80ace3e403 btrfs: don't access the bio directly in the raid5/6 code
Just use bio_for_each_segment_all to iterate over all segments.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Omar Sandoval <osandov@fb.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2016-11-30 13:45:19 +01:00
Christoph Hellwig 974b1adc3b btrfs: use bio iterators for the decompression handlers
Pass the full bio to the decompression routines and use bio iterators
to iterate over the data in the bio.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: David Sterba <dsterba@suse.com>
2016-11-30 13:45:19 +01:00
Jeff Mahoney 0c476a5d7f btrfs: Ensure proper sector alignment for btrfs_free_reserved_data_space
This fixes the WARN_ON on BTRFS_I(inode)->reserved_extents in
btrfs_destroy_inode and the WARN_ON on nonzero delalloc bytes on umount
with qgroups enabled.

I was able to reproduce this by setting up a small (~500kb) quota limit
and writing a file one byte at a time until I hit the limit.  The warnings
would all hit on umount.

The root cause is that we would reserve a block-sized range in both
the reservation and the quota in btrfs_check_data_free_space, but if we
encountered a problem (like e.g. EDQUOT), we would only release the single
byte in the qgroup reservation.  That caused an iotree state split, which
increased the number of outstanding extents, in turn disallowing releasing
the metadata reservation.

Signed-off-by: Jeff Mahoney <jeffm@suse.com>
Reviewed-by: Qu Wenruo <quwenruo@cn.fujitsu.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2016-11-30 13:45:19 +01:00
Josef Bacik f94480bd7b Btrfs: abort transaction if fill_holes() fails
At this point we will have dropped extent entries from the file, so if we fail
to insert the new hole entries then we are leaving the fs in a corrupt state
(albeit an easily fixed one).  Abort the transaciton if this happens so we can
avoid corrupting the fs.  Thanks,

Signed-off-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2016-11-30 13:45:19 +01:00
Josef Bacik 62fe51c1d0 Btrfs: fix file extent corruption
In order to do hole punching we have a block reserve to hold the reservation we
need to drop the extents in our range.  Since we could end up dropping a lot of
extents we set rsv->failfast so we can just loop around again and drop the
remaining of the range.  Unfortunately we unconditionally fill the hole extents
in and start from the last extent we encountered, which we may or may not have
dropped.  So this can result in overlapping file extent entries, which can be
tripped over in a variety of ways, either by hitting BUG_ON(!ret) in
fill_holes() after the search, or in btrfs_set_item_key_safe() in
btrfs_drop_extent() at a later time by an unrelated task.  Fix this by only
setting drop_end to the last extent we did actually drop.  This way our holes
are filled in properly for the range that we did drop, and the rest of the range
that remains to be dropped is actually dropped.  Thanks,

Signed-off-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2016-11-30 13:45:19 +01:00
Jeff Mahoney d2fbb2b589 btrfs: increment ctx->pos for every emitted or skipped dirent in readdir
If we process the last item in the leaf and hit an I/O error while
reading the next leaf, we return -EIO without having adjusted the
position.  Since we have emitted dirents, getdents() will return
the byte count to the user instead of the error.  Subsequent callers
will emit the last successful dirent again, and return -EIO again,
with the same result.  Callers loop forever.

Instead, if we always increment ctx->pos after emitting or skipping
the dirent, we'll be sure that we won't hit the same one again.  When
we go to process the next leaf, we won't have emitted any dirents
and the -EIO will be returned to the user properly.  We also don't
need to track if we've emitted a dirent already or if we've changed
the position yet.

Signed-off-by: Jeff Mahoney <jeffm@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2016-11-30 13:45:19 +01:00
Jeff Mahoney c2951f32d3 btrfs: remove old tree_root dirent processing in btrfs_real_readdir()
Commit 3de4586c52 (Btrfs: Allow subvolumes and snapshots anywhere
in the directory tree) introduced the current system of placing
snapshots in the directory tree.  It also introduced the behavior of
creating the snapshot and then creating the directory entries for it.

We've kept this code around for compatibility reasons, but it turns
out that no file systems with the old tree_root based snapshots can
be mounted on newer (>= 2009) kernels anyway.  About a month after the
above commit, commit 2a7108ad89 (Btrfs: rev the disk format for the
inode compat and csum selection changes) landed, changing the superblock
magic number.

As a result, we know that we'll never encounter tree_root-based dirents
or have to deal with skipping our own snapshot dirents.  Since that
also means that we're now only iterating over DIR_INDEX items, which only
contain one directory entry per leaf item, we don't need to loop over
the leaf item contents anymore either.

Signed-off-by: Jeff Mahoney <jeffm@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2016-11-30 13:45:19 +01:00
Nick Terrell d1111a7547 btrfs: Call kunmap if zlib_inflateInit2 fails
If zlib_inflateInit2 fails, the input page is never unmapped.
Add a call to kunmap when it fails.

Signed-off-by: Nick Terrell <nickrterrell@gmail.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2016-11-30 13:45:18 +01:00
David Sterba ed0df618b1 btrfs: store and load values of stripes_min/stripes_max in balance status item
The balance status item contains currently known filter values, but the
stripes filter was unintentionally not among them. This would mean, that
interrupted and automatically restarted balance does not apply the
stripe filters.

Fixes: dee32d0ac3
CC: stable@vger.kernel.org # 4.4+
Signed-off-by: David Sterba <dsterba@suse.com>
2016-11-30 13:45:18 +01:00
Christophe JAILLET 4d5106a126 btrfs: remove redundant check of btrfs_iget return value
'btrfs_iget()' can not return NULL, so this test can be removed.

Signed-off-by: Christophe JAILLET <christophe.jaillet@wanadoo.fr>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2016-11-30 13:45:18 +01:00
Domagoj Tršan 0b5e3dafb6 btrfs: change btrfs_csum_final result param type to u8
csum member of struct btrfs_super_block has array type of u8. It makes
sense that function btrfs_csum_final should be also declared to accept
u8 *. I changed the declaration of method void btrfs_csum_final(u32 crc,
char *result); to void btrfs_csum_final(u32 crc, u8 *result);

Signed-off-by: Domagoj Tršan <domagoj.trsan@gmail.com>
[ changed cast to u8 at several call sites ]
Signed-off-by: David Sterba <dsterba@suse.com>
2016-11-30 13:45:18 +01:00
Liu Bo a23eaa875f Btrfs: adjust len of writes if following a preallocated extent
If we have

|0--hole--4095||4096--preallocate--12287|

instead of using preallocated space, a 8K direct write will just
create a new 8K extent and it'll end up with

|0--new extent--8191||8192--preallocate--12287|

It's because we find a hole em and then go to create a new 8K
extent directly without adjusting @len.

Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
Reviewed-by: Chris Mason <clm@fb.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2016-11-30 13:45:18 +01:00
Shailendra Verma 7b9ea6279b btrfs: return early from failed memory allocations in ioctl handlers
There is no need to call kfree() if memdup_user() fails, as no memory
was allocated and the error in the error-valued pointer should be returned.

Signed-off-by: Shailendra Verma <shailendra.v@samsung.com>
[ edit subject ]
Signed-off-by: David Sterba <dsterba@suse.com>
2016-11-30 13:45:18 +01:00
David Sterba 58e8012cc1 btrfs: add optimized version of eb to eb copy
Using copy_extent_buffer is suitable for copying betwenn buffers from an
arbitrary offset and deals with page boundaries. This is not necessary
when doing a full extent_buffer-to-extent_buffer copy. We can utilize
the copy_page helper as well.

Signed-off-by: David Sterba <dsterba@suse.com>
2016-11-30 13:45:17 +01:00
David Sterba b159fa2808 btrfs: remove constant parameter to memset_extent_buffer and rename it
The only memset we do is to 0, so sink the parameter to the function and
simplify all calls. Rename the function to reflect the behaviour.

Signed-off-by: David Sterba <dsterba@suse.com>
2016-11-30 13:45:17 +01:00
David Sterba fba1acf9ff btrfs: use specialized page copying helpers in btrfs_clone_extent_buffer
The copy_page is usually optimized and can be faster than memcpy.

Signed-off-by: David Sterba <dsterba@suse.com>
2016-11-30 13:45:17 +01:00
David Sterba d24ee97b96 btrfs: use new helpers to set uuids in eb
Signed-off-by: David Sterba <dsterba@suse.com>
2016-11-30 13:45:17 +01:00
David Sterba f157bf765b btrfs: introduce helpers for updating eb uuids
The fsid and chunk tree uuid are always located in the first page,
we don't need the to use write_extent_buffer.

Signed-off-by: David Sterba <dsterba@suse.com>
2016-11-30 13:45:17 +01:00
David Sterba 2230adffe4 btrfs: delete unused member from superblock
__bdev' has never been used since
 0b86a832a1 (2008).

Signed-off-by: David Sterba <dsterba@suse.com>
2016-11-30 13:45:17 +01:00
David Sterba 62d1f9fe97 btrfs: remove trivial helper btrfs_find_tree_block
During the time, the function has been shrunk to the point that it just
calls find_extent_buffer, just passing the parameters.

Signed-off-by: David Sterba <dsterba@suse.com>
2016-11-30 13:45:16 +01:00
David Sterba b917bb3878 btrfs: reada, remove pointless BUG_ON check for fs_info
We dereference fs_info several times, besides that post-mount functions
should never see a NULL fs_info.

Signed-off-by: David Sterba <dsterba@suse.com>
2016-11-30 13:45:16 +01:00
David Sterba 8694bb6136 btrfs: reada, remove pointless BUG_ON in reada_find_extent
The lock is held, we make the same lookup that previously failed with
EEXIST and we don't insert NULL pointers.

Signed-off-by: David Sterba <dsterba@suse.com>
2016-11-30 13:45:16 +01:00
David Sterba fc2e901f26 btrfs: reada, sink start parameter to btree_readahead_hook
Originally, the eb and start were passed separately in case eb is NULL.
Since the readahead has been refactored in 4.6, this is not true anymore
and we can get rid of the parameter.

Signed-off-by: David Sterba <dsterba@suse.com>
2016-11-30 13:45:16 +01:00
David Sterba bcdc51b204 btrfs: reada, remove unused parameter from __readahead_hook
'start' is not used since "btrfs: reada: Pass reada_extent into
__readahead_hook directly" (6e39dbe8b9).

Signed-off-by: David Sterba <dsterba@suse.com>
2016-11-30 13:45:16 +01:00
David Sterba 04998b3324 btrfs: reada, cleanup remove unneeded variable in __readahead_hook
We can't touch the eb directly in case the function is called with a
non-zero error, so we can read the eb level when needed.

Signed-off-by: David Sterba <dsterba@suse.com>
2016-11-30 13:45:15 +01:00
David Sterba ef2fff64fd btrfs: rename helper macros for qgroup and aux data casts
The helpers are not meant to be generic, the name is misleading. Convert
them to static inlines for type checking.

Signed-off-by: David Sterba <dsterba@suse.com>
2016-11-30 13:45:15 +01:00
David Sterba 5d9dbe617a btrfs: remove stale comment from btrfs_statfs
Signed-off-by: David Sterba <dsterba@suse.com>
2016-11-30 13:45:15 +01:00
David Sterba 926b92335a btrfs: remove unused headers, statfs.h
Signed-off-by: David Sterba <dsterba@suse.com>
2016-11-30 13:45:14 +01:00
Xiaoguang Wang 745699ef62 btrfs: remove useless comments
Fixes: ("btrfs: update btrfs_space_info's bytes_may_use timely")

Signed-off-by: Wang Xiaoguang <wangxg.fnst@cn.fujitsu.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2016-11-30 13:45:14 +01:00
Adam Borowski ebce0e01b9 btrfs: make block group flags in balance printks human-readable
They're not even documented anywhere, letting users with no recourse but
to RTFS.  It's no big burden to output the bitfield as words.

Also, display unknown flags as hex.

Signed-off-by: Adam Borowski <kilobyte@angband.pl>
Tested-by: Holger Hoffstätte <holger@applied-asynchrony.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2016-11-30 13:45:14 +01:00
Omar Sandoval 8e2bd3b7fa Btrfs: deal with existing encompassing extent map in btrfs_get_extent()
My QEMU VM was seeing inexplicable I/O errors that I tracked down to
errors coming from the qcow2 virtual drive in the host system. The qcow2
file is a nocow file on my Btrfs drive, which QEMU opens with O_DIRECT.
Every once in awhile, pread() or pwrite() would return EEXIST, which
makes no sense. This turned out to be a bug in btrfs_get_extent().

Commit 8dff9c8534 ("Btrfs: deal with duplciates during extent_map
insertion in btrfs_get_extent") fixed a case in btrfs_get_extent() where
two threads race on adding the same extent map to an inode's extent map
tree. However, if the added em is merged with an adjacent em in the
extent tree, then we'll end up with an existing extent that is not
identical to but instead encompasses the extent we tried to add. When we
call merge_extent_mapping() to find the nonoverlapping part of the new
em, the arithmetic overflows because there is no such thing. We then end
up trying to add a bogus em to the em_tree, which results in a EEXIST
that can bubble all the way up to userspace.

Fix it by extending the identical extent map special case.

Signed-off-by: Omar Sandoval <osandov@fb.com>
Reviewed-by: Liu Bo <bo.li.liu@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2016-11-30 13:45:14 +01:00
Wang Xiaoguang 939659dfd3 btrfs: add necessary comments about tickets_id
Tickets_id's name may result in some misunderstandings,  it just indicates
the next ticket will be handled and is not stored per ticket.

Fixes: ce12965 ("btrfs: introduce tickets_id to determine whether
asynchronous metadata reclaim work makes progress")
Signed-off-by: Wang Xiaoguang <wangxg.fnst@cn.fujitsu.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2016-11-30 13:45:14 +01:00
Jan Kara c3b004460d quota: Remove dqonoff_mutex
The only places that were grabbing dqonoff_mutex are functions turning
quotas on and off and these are properly serialized using s_umount
semaphore. Remove dqonoff_mutex.

Signed-off-by: Jan Kara <jack@suse.cz>
2016-11-30 08:38:07 +01:00
Jan Kara 5f530de63c ocfs2: Use s_umount for quota recovery protection
Currently we use dqonoff_mutex to serialize quota recovery protection
and turning of quotas on / off. Use s_umount semaphore instead.

Tested-by: Eric Ren <zren@suse.com>
Signed-off-by: Jan Kara <jack@suse.cz>
2016-11-30 08:37:21 +01:00
Jan Kara ee1ac541a2 quota: Remove dqonoff_mutex from dquot_scan_active()
All callers of dquot_scan_active() now hold s_umount so we can rely on
that lock to protect us against quota state changes.

Signed-off-by: Jan Kara <jack@suse.cz>
2016-11-30 08:37:11 +01:00
Jan Kara 2a64f80ee3 ocfs2: Protect periodic quota syncing with s_umount semaphore
New quota locking rules will require s_umount semaphore for all quota
scanning functions. Add is for periodic quota syncing.

Tested-by: Eric Ren <zren@suse.com>
Signed-off-by: Jan Kara <jack@suse.cz>
2016-11-30 08:36:54 +01:00
Dave Chinner 5f1c6d28cf Merge branch 'iomap-4.10-directio' into for-next 2016-11-30 14:39:29 +11:00
Christoph Hellwig acdda3aae1 xfs: use iomap_dio_rw
Straight switch over to using iomap for direct I/O - we already have the
non-COW dio path in write_begin for DAX and files with extent size hints,
so nothing to add there.  The COW path is ported over from the old
get_blocks version and a bit of a mess, but I have some work in progress
to make it look more like the buffered I/O COW path.

This gets rid of xfs_get_blocks_direct and the last caller of
xfs_get_blocks with the create flag set, so all that code can be removed.

Last but not least I've removed a comment in xfs_filemap_fault that
refers to xfs_get_blocks entirely instead of updating it - while the
reference is correct, the whole DAX fault path looks different than
the non-DAX one, so it seems rather pointless.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Tested-by: Jens Axboe <axboe@fb.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2016-11-30 14:37:15 +11:00
Christoph Hellwig ff6a9292e6 iomap: implement direct I/O
This adds a full fledget direct I/O implementation using the iomap
interface. Full fledged in this case means all features are supported:
AIO, vectored I/O, any iov_iter type including kernel pointers, bvecs
and pipes, support for hole filling and async apending writes.  It does
not mean supporting all the warts of the old generic code.  We expect
i_rwsem to be held over the duration of the call, and we expect to
maintain i_dio_count ourselves, and we pass on any kinds of mapping
to the file system for now.

The algorithm used is very simple: We use iomap_apply to iterate over
the range of the I/O, and then we use the new bio_iov_iter_get_pages
helper to lock down the user range for the size of the extent.
bio_iov_iter_get_pages can currently lock down twice as many pages as
the old direct I/O code did, which means that we will have a better
batch factor for everything but overwrites of badly fragmented files.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Kent Overstreet <kent.overstreet@gmail.com>
Tested-by: Jens Axboe <axboe@fb.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2016-11-30 14:36:01 +11:00
Christoph Hellwig ec1b826097 fs: make sb_init_dio_done_wq available outside of direct-io.c
We want to use the per-sb completion workqueue from the new iomap
direct I/O code.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Tested-by: Jens Axboe <axboe@fb.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2016-11-30 14:33:53 +11:00
Christoph Hellwig 6552321831 xfs: remove i_iolock and use i_rwsem in the VFS inode instead
This patch drops the XFS-own i_iolock and uses the VFS i_rwsem which
recently replaced i_mutex instead.  This means we only have to take
one lock instead of two in many fast path operations, and we can
also shrink the xfs_inode structure.  Thanks to the xfs_ilock family
there is very little churn, the only thing of note is that we need
to switch to use the lock_two_directory helper for taking the i_rwsem
on two inodes in a few places to make sure our lock order matches
the one used in the VFS.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Tested-by: Jens Axboe <axboe@fb.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2016-11-30 14:33:25 +11:00
Dave Chinner e3df41f978 Merge branch 'xfs-4.10-misc-fixes-2' into iomap-4.10-directio 2016-11-30 12:49:38 +11:00
Chao Yu 0002b61bda f2fs: return AOP_WRITEPAGE_ACTIVATE for writepage
We should use AOP_WRITEPAGE_ACTIVATE when we bypass writing pages.

Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Miao Xie <miaoxie@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2016-11-29 15:43:00 -08:00
Jaegeuk Kim 26787236b3 f2fs: do not activate auto_recovery for fallocated i_size
If a file needs to keep its i_size by fallocate, we need to turn off auto
recovery during roll-forward recovery.

This will resolve the below scenario.

1. xfs_io -f /mnt/f2fs/file -c "pwrite 0 4096" -c "fsync"
2. xfs_io -f /mnt/f2fs/file -c "falloc -k 4096 4096" -c "fsync"
3. md5sum /mnt/f2fs/file;
4. godown /mnt/f2fs/
5. umount /mnt/f2fs/
6. mount -t f2fs /dev/sdx /mnt/f2fs
7. md5sum /mnt/f2fs/file

Reported-by: Chao Yu <chao@kernel.org>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2016-11-29 15:42:58 -08:00
Bart Van Assche b5a0623444 kernfs: Declare two local data structures static
This was spotted by the 'sparse' static checker.

Signed-off-by: Bart Van Assche <bart.vanassche@sandisk.com>
Cc: Andreas Gruenbacher <agruenba@redhat.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2016-11-29 20:58:31 +01:00
Jan Kara d14e7683ec ext4: be more strict when verifying flags set via SETFLAGS ioctls
Currently we just silently ignore flags that we don't understand (or
that cannot be manipulated) through EXT4_IOC_SETFLAGS and
EXT4_IOC_FSSETXATTR ioctls. This makes it problematic for the unused
flags to be used in future (some app may be inadvertedly setting them
and we won't notice until the flag gets used). Also this is inconsistent
with other filesystems like XFS or BTRFS which return EOPNOTSUPP when
they see a flag they cannot set.

ext4 has the additional problem that there are flags which are returned
by EXT4_IOC_GETFLAGS ioctl but which cannot be modified via
EXT4_IOC_SETFLAGS. So we have to be careful to ignore value of these
flags and not fail the ioctl when they are set (as e.g. chattr(1) passes
flags returned from EXT4_IOC_GETFLAGS to EXT4_IOC_SETFLAGS without any
masking and thus we'd break this utility).

Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2016-11-29 11:18:39 -05:00
Jan Kara f8011d93a2 ext4: add EXT4_JOURNAL_DATA_FL and EXT4_EXTENTS_FL to modifiable mask
Add EXT4_JOURNAL_DATA_FL and EXT4_EXTENTS_FL to EXT4_FL_USER_MODIFIABLE
to recognize that they are modifiable by userspace. So far we got away
without having them there because ext4_ioctl_setflags() treats them in a
special way. But it was really confusing like that.

Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2016-11-29 11:13:13 -05:00
Wang Xiaoguang dc1a90c6aa btrfs: cleanup: use already calculated value in btrfs_should_throttle_delayed_refs()
Signed-off-by: Wang Xiaoguang <wangxg.fnst@cn.fujitsu.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2016-11-29 14:10:38 +01:00
Christoph Hellwig cf8cddd38b btrfs: don't abuse REQ_OP_* flags for btrfs_map_block
btrfs_map_block supports different types of mappings, which to a large
extent resemble block layer operations.  But they don't always do, and
currently btrfs dangerously overlays it's own flag over the block layer
flags.  This is just asking for a conflict, so introduce a different
map flags enum inside of btrfs instead.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2016-11-29 14:10:38 +01:00
Miklos Szeredi c4fcfc1619 ovl: fix d_real() for stacked fs
Handling of recursion in d_real() is completely broken.  Recursion is only
done in the 'inode != NULL' case.  But when opening the file we have
'inode == NULL' hence d_real() will return an overlay dentry.  This won't
work since overlayfs doesn't define its own file operations, so all file
ops will fail.

Fix by doing the recursion first and the check against the inode second.

Bash script to reproduce the issue written by Quentin:

 - 8< - - - - - 8< - - - - - 8< - - - - - 8< - - - -
tmpdir=$(mktemp -d)
pushd ${tmpdir}

mkdir -p {upper,lower,work}
echo -n 'rocks' > lower/ksplice
mount -t overlay level_zero upper -o lowerdir=lower,upperdir=upper,workdir=work
cat upper/ksplice

tmpdir2=$(mktemp -d)
pushd ${tmpdir2}

mkdir -p {upper,work}
mount -t overlay level_one upper -o lowerdir=${tmpdir}/upper,upperdir=upper,workdir=work
ls -l upper/ksplice
cat upper/ksplice
 - 8< - - - - - 8< - - - - - 8< - - - - - 8< - - - - 

Reported-by: Quentin Casasnovas <quentin.casasnovas@oracle.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
Fixes: 2d902671ce ("vfs: merge .d_select_inode() into .d_real()")
Cc: <stable@vger.kernel.org> # v4.8+
2016-11-29 10:20:24 +01:00
Eryu Guan ae9ebe7c4e CIFS: iterate over posix acl xattr entry correctly in ACL_to_cifs_posix()
Commit 2211d5ba5c ("posix_acl: xattr representation cleanups")
removes the typedefs and the zero-length a_entries array in struct
posix_acl_xattr_header, and uses bare struct posix_acl_xattr_header
and struct posix_acl_xattr_entry directly.

But it failed to iterate over posix acl slots when converting posix
acls to CIFS format, which results in several test failures in
xfstests (generic/053 generic/105) when testing against a samba v1
server, starting from v4.9-rc1 kernel. e.g.

  [root@localhost xfstests]# diff -u tests/generic/105.out /root/xfstests/results//generic/105.out.bad
  --- tests/generic/105.out       2016-09-19 16:33:28.577962575 +0800
  +++ /root/xfstests/results//generic/105.out.bad 2016-10-22 15:41:15.201931110 +0800
  @@ -1,3 +1,4 @@
   QA output created by 105
   -rw-r--r-- root
  +setfacl: subdir: Invalid argument
   -rw-r--r-- root

Fix it by introducing a new "ace" var, like what
cifs_copy_posix_acl() does, and iterating posix acl xattr entries
over it in the for loop.

Signed-off-by: Eryu Guan <guaneryu@gmail.com>
Signed-off-by: Steve French <smfrench@gmail.com>
2016-11-28 23:08:53 -06:00
Sachin Prabhu b8c600120f Call echo service immediately after socket reconnect
Commit 4fcd1813e6 ("Fix reconnect to not defer smb3 session reconnect
long after socket reconnect") changes the behaviour of the SMB2 echo
service and causes it to renegotiate after a socket reconnect. However
under default settings, the echo service could take up to 120 seconds to
be scheduled.

The patch forces the echo service to be called immediately resulting a
negotiate call being made immediately on reconnect.

Signed-off-by: Sachin Prabhu <sprabhu@redhat.com>
Reviewed-by: Pavel Shilovsky <pshilov@microsoft.com>
Signed-off-by: Steve French <smfrench@gmail.com>
2016-11-28 23:08:52 -06:00
Sachin Prabhu 5f4b55699a CIFS: Fix BUG() in calc_seckey()
Andy Lutromirski's new virtually mapped kernel stack allocations moves
kernel stacks the vmalloc area. This triggers the bug
 kernel BUG at ./include/linux/scatterlist.h:140!
at calc_seckey()->sg_init()

Signed-off-by: Sachin Prabhu <sprabhu@redhat.com>
Signed-off-by: Steve French <smfrench@gmail.com>
Reviewed-by: Jeff Layton <jlayton@redhat.com>
2016-11-28 23:08:52 -06:00
Jaegeuk Kim 8508e44ae9 f2fs: fix to determine start_cp_addr by sbi->cur_cp_pack
We don't guarantee cp_addr is fixed by cp_version.
This is to sync with f2fs-tools.

Cc: stable@vger.kernel.org
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2016-11-28 13:39:58 -08:00
Dave Chinner b7b26110ed Merge branch 'xfs-4.10-misc-fixes-2' into for-next 2016-11-28 15:06:03 +11:00
Brian Foster f782088c9e xfs: pass post-eof speculative prealloc blocks to bmapi
xfs_file_iomap_begin_delay() implements post-eof speculative
preallocation by extending the block count of the requested delayed
allocation. Now that xfs_bmapi_reserve_delalloc() has been updated to
handle prealloc blocks separately and tag the inode, update
xfs_file_iomap_begin_delay() to use the new parameter and rely on the
former to tag the inode.

Note that this patch does not change behavior.

Signed-off-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2016-11-28 14:57:42 +11:00
Brian Foster 0260d8ff5f xfs: clean up cow fork reservation and tag inodes correctly
COW fork reservation is implemented via delayed allocation. The code is
modeled after the traditional delalloc allocation code, but is slightly
different in terms of how preallocation occurs. Rather than post-eof
speculative preallocation, COW fork preallocation is implemented via a
COW extent size hint that is designed to minimize fragmentation as a
reflinked file is split over time.

xfs_reflink_reserve_cow() still uses logic that is oriented towards
dealing with post-eof speculative preallocation, however, and is stale
or not necessarily correct. First, the EOF alignment to the COW extent
size hint is implemented in xfs_bmapi_reserve_delalloc() (which does so
correctly by aligning the start and end offsets) and so is not necessary
in xfs_reflink_reserve_cow(). The backoff and retry logic on ENOSPC is
also ineffective for the same reason, as xfs_bmapi_reserve_delalloc()
will simply perform the same allocation request on the retry. Finally,
since the COW extent size hint aligns the start and end offset of the
range to allocate, the end_fsb != orig_end_fsb logic is not sufficient.
Indeed, if a write request happens to end on an aligned offset, it is
possible that we do not tag the inode for COW preallocation even though
xfs_bmapi_reserve_delalloc() may have preallocated at the start offset.

Kill the unnecessary, duplicate code in xfs_reflink_reserve_cow().
Remove the inode tag logic as well since xfs_bmapi_reserve_delalloc()
has been updated to tag the inode correctly.

Signed-off-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2016-11-28 14:57:42 +11:00
Brian Foster 974ae922ef xfs: track preallocation separately in xfs_bmapi_reserve_delalloc()
Speculative preallocation is currently processed entirely by the callers
of xfs_bmapi_reserve_delalloc(). The caller determines how much
preallocation to include, adjusts the extent length and passes down the
resulting request.

While this works fine for post-eof speculative preallocation, it is not
as reliable for COW fork preallocation. COW fork preallocation is
implemented via the cowextszhint, which aligns the start offset as well
as the length of the extent. Further, it is difficult for the caller to
accurately identify when preallocation occurs because the returned
extent could have been merged with neighboring extents in the fork.

To simplify this situation and facilitate further COW fork preallocation
enhancements, update xfs_bmapi_reserve_delalloc() to take a separate
preallocation parameter to incorporate into the allocation request. The
preallocation blocks value is tacked onto the end of the request and
adjusted to accommodate neighboring extents and extent size limits.
Since xfs_bmapi_reserve_delalloc() now knows precisely how much
preallocation was included in the allocation, it can also tag the inodes
appropriately to support preallocation reclaim.

Note that xfs_bmapi_reserve_delalloc() callers are not yet updated to
use the preallocation mechanism. This patch should not change behavior
outside of correctly tagging reflink inodes when start offset
preallocation occurs (which the caller does not handle correctly).

Signed-off-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2016-11-28 14:57:42 +11:00
Darrick J. Wong fba3e594ef xfs: always succeed when deduping zero bytes
It turns out that btrfs and xfs had differing interpretations of what
to do when the dedupe length is zero.  Change xfs to follow btrfs'
semantics so that the userland interface is consistent.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2016-11-28 14:57:42 +11:00
Bhumika Goyal cf7841c12d fs: xfs: libxfs: constify xfs_nameops structures
Declare the structure xfs_nameops as const as it is only stored in the
m_dirnameops field of a xfs_mount structure. This field is of type
const struct xfs_nameops *, so xfs_nameops structures having this
property can be declared as const.
Done using Coccinelle:
@r1 disable optional_qualifier @
identifier i;
position p;
@@
static struct xfs_nameops i@p = {...};

@ok1@
identifier r1.i;
position p;
struct xfs_mount mp;
@@
mp.m_dirnameops=&i@p

@bad@
position p!={r1.p,ok1.p};
identifier r1.i;
@@
i@p

@depends on !bad disable optional_qualifier@
identifier r1.i;
@@
static
+const
struct xfs_nameops i={...};

@depends on !bad disable optional_qualifier@
identifier r1.i;
@@
+const
struct xfs_nameops i;

File size before:
   text	   data	    bss	    dec	    hex	filename
   5302	     85	      0	   5387	   150b	fs/xfs/libxfs/xfs_dir2.o

File size after:
   text	   data	    bss	    dec	    hex	filename
   5318	     69	      0	   5387	   150b	fs/xfs/libxfs/xfs_dir2.o

Signed-off-by: Bhumika Goyal <bhumirks@gmail.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2016-11-28 14:57:42 +11:00
Bhumika Goyal bb6e0ebed7 fs: xfs: xfs_icreate_item: constify xfs_item_ops structure
Declare the structure xfs_item_ops as const as it is only passed as an
argument to the function xfs_log_item_init. As this argument is of type
const struct xfs_item_ops *, so xfs_item_ops structures having this
property can be declared as const.
Done using Coccinelle:

@r1 disable optional_qualifier @
identifier i;
position p;
@@
static struct xfs_item_ops i@p = {...};

@ok1@
identifier r1.i;
position p;
expression e1,e2,e3;
@@
xfs_log_item_init(e1,e2,e3,&i@p)

@bad@
position p!={r1.p,ok1.p};
identifier r1.i;
@@
i@p

@depends on !bad disable optional_qualifier@
identifier r1.i;
@@
static
+const
struct xfs_item_ops i={...};

@depends on !bad disable optional_qualifier@
identifier r1.i;
@@
+const
struct xfs_item_ops i;

File size before:
   text	   data	    bss	    dec	    hex	filename
    737	     64	      8	    809	    329	fs/xfs/xfs_icreate_item.o

File size after:
   text	   data	    bss	    dec	    hex	filename
    801	      0	      8	    809	    329	fs/xfs/xfs_icreate_item.o

Signed-off-by: Bhumika Goyal <bhumirks@gmail.com>
Reviewed-by: Eric Sandeen <sandeen@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2016-11-28 14:57:42 +11:00
Darrick J. Wong fd26a88093 xfs: factor rmap btree size into the indlen calculations
When we're estimating the amount of space it's going to take to satisfy
a delalloc reservation, we need to include the space that we might need
to grow the rmapbt.  This helps us to avoid running out of space later
when _iomap_write_allocate needs more space than we reserved.  Eryu Guan
observed this happening on generic/224 when sunit/swidth were set.

Reported-by: Eryu Guan <eguan@redhat.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2016-11-28 14:57:42 +11:00
Eric Sandeen 1247ec4c5f xfs: add XBF_XBF_NO_IOACCT to buf trace output
When XBF_NO_IOACCT got added, it missed the translation
in XFS_BUF_FLAGS, so we see "0x8" in trace output rather
than the flag name.  Fix it.

Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2016-11-28 14:57:42 +11:00
David S. Miller 0b42f25d2f Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net
udplite conflict is resolved by taking what 'net-next' did
which removed the backlog receive method assignment, since
it is no longer necessary.

Two entries were added to the non-priv ethtool operations
switch statement, one in 'net' and one in 'net-next, so
simple overlapping changes.

Signed-off-by: David S. Miller <davem@davemloft.net>
2016-11-26 23:42:21 -05:00
Al Viro 8e54cadab4 fix default_file_splice_read()
Botched calculation of number of pages.  As the result,
we were dropping pieces when doing splice to pipe from
e.g. 9p.

Reported-by: Alexei Starovoitov <ast@kernel.org>
Tested-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2016-11-26 20:05:42 -05:00
Eric Sandeen 9060dd2c50 ext4: fix mmp use after free during unmount
In ext4_put_super, we call brelse on the buffer head containing
the ext4 superblock, but then try to use it when we stop the
mmp thread, because when the thread shuts down it does:

write_mmp_block
  ext4_mmp_csum_set
    ext4_has_metadata_csum
      WARN_ON_ONCE(ext4_has_feature_metadata_csum(sb)...)

which reaches into sb->s_fs_info->s_es->s_feature_ro_compat,
which lives in the superblock buffer s_sbh which we just released.

Fix this by moving the brelse down to a point where we are no
longer using it.

Reported-by: Wang Shu <shuwang@redhat.com>
Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Reviewed-by: Andreas Dilger <adilger@dilger.ca>
2016-11-26 14:24:51 -05:00
Arnd Bergmann 19c526515f f2fs: fix 32-bit build
The addition of multiple-device support broke CONFIG_BLK_DEV_ZONED
on 32-bit machines because of a 64-bit division:

fs/f2fs/f2fs.o: In function `__issue_discard_async':
extent_cache.c:(.text.__issue_discard_async+0xd4): undefined reference to `__aeabi_uldivmod'

Fortunately, bdev_zone_size() is guaranteed to return a power-of-two
number, so we can replace the % operator with a cheaper bit mask.

Fixes: 792b84b74b54 ("f2fs: support multiple devices")
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2016-11-25 10:16:09 -08:00
Nicolai Stange 05e6ea2685 f2fs: set ->owner for debugfs status file's file_operations
The struct file_operations instance serving the f2fs/status debugfs file
lacks an initialization of its ->owner.

This means that although that file might have been opened, the f2fs module
can still get removed. Any further operation on that opened file, releasing
included,  will cause accesses to unmapped memory.

Indeed, Mike Marshall reported the following:

  BUG: unable to handle kernel paging request at ffffffffa0307430
  IP: [<ffffffff8132a224>] full_proxy_release+0x24/0x90
  <...>
  Call Trace:
   [] __fput+0xdf/0x1d0
   [] ____fput+0xe/0x10
   [] task_work_run+0x8e/0xc0
   [] do_exit+0x2ae/0xae0
   [] ? __audit_syscall_entry+0xae/0x100
   [] ? syscall_trace_enter+0x1ca/0x310
   [] do_group_exit+0x44/0xc0
   [] SyS_exit_group+0x14/0x20
   [] do_syscall_64+0x61/0x150
   [] entry_SYSCALL64_slow_path+0x25/0x25
  <...>
  ---[ end trace f22ae883fa3ea6b8 ]---
  Fixing recursive fault but reboot is needed!

Fix this by initializing the f2fs/status file_operations' ->owner with
THIS_MODULE.

This will allow debugfs to grab a reference to the f2fs module upon any
open on that file, thus preventing it from getting removed.

Fixes: 902829aa0b ("f2fs: move proc files to debugfs")
Reported-by: Mike Marshall <hubcap@omnibond.com>
Reported-by: Martin Brandenburg <martin@omnibond.com>
Cc: stable@vger.kernel.org
Signed-off-by: Nicolai Stange <nicstange@gmail.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2016-11-25 10:16:08 -08:00
Chao Yu b08b12d2dd f2fs: fix incorrect free inode count in ->statfs
While calculating inode count that we can create at most in the left space,
we should consider space which data/node blocks occupied, since we create
data/node mixly in main area. So fix the wrong calculation in ->statfs.

Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2016-11-25 10:16:07 -08:00
Geliang Tang b4ceec2921 f2fs: drop duplicate header timer.h
Drop duplicate header timer.h from segment.c.

Signed-off-by: Geliang Tang <geliangtang@gmail.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2016-11-25 10:16:06 -08:00
Jaegeuk Kim 97dd26ad83 f2fs: fix wrong AUTO_RECOVER condition
If i_size is not aligned to the f2fs's block size, we should not skip inode
update during fsync.

Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2016-11-25 10:16:05 -08:00
Jaegeuk Kim 3a3a5ead7b f2fs: do not recover i_size if it's valid
If i_size is already valid during roll_forward recovery, we should not update
it according to the block alignment.

Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2016-11-25 10:16:04 -08:00
Chao Yu 281518c694 f2fs: fix fdatasync
For below two cases, we can't guarantee data consistence:

a)
1. xfs_io "pwrite 0 4195328" "fsync"
2. xfs_io "pwrite 4195328 1024" "fdatasync"
3. godown
4. umount & mount
--> isize we updated before fdatasync won't be recovered

b)
1. xfs_io "pwrite -S 0xcc 0 4202496" "fsync"
2. xfs_io "fpunch 4194304 4096" "fdatasync"
3. godown
4. umount & mount
--> dnode we punched before fdatasync won't be recovered

The reason is that normally fdatasync won't be aware of modification
of metadata in file, e.g. isize changing, dnode updating, so in ->fsync
we will skip flushing node pages for above cases, result in making
fdatasynced file being lost during recovery.

Currently we have introduced DIRTY_META global list in sbi for tracking
dirty inode selectively, so in fdatasync we can choose to flush nodes
depend on dirty state of current inode in the list.

Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2016-11-25 10:16:03 -08:00
Chao Yu 04d47e6738 f2fs: fix to account total free nid correctly
Thread A		Thread B		Thread C
- f2fs_create
 - f2fs_new_inode
  - f2fs_lock_op
   - alloc_nid
    alloc last nid
  - f2fs_unlock_op
			- f2fs_create
			 - f2fs_new_inode
			  - f2fs_lock_op
			   - alloc_nid
			    as node count still not
			    be increased, we will
			    loop in alloc_nid
						- f2fs_write_node_pages
						 - f2fs_balance_fs_bg
						  - f2fs_sync_fs
						   - write_checkpoint
						    - block_operations
						     - f2fs_lock_all
 - f2fs_lock_op

While creating new inode, we do not allocate and account nid atomically,
so that when there is almost no free nids left, we may encounter deadloop
like above stack.

In order to avoid that, reuse nm_i::available_nids for accounting free nids
and make nid allocation and counting being atomical during node creation.

Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2016-11-25 10:16:01 -08:00
Yunlei He d40a43af0a f2fs: fix an infinite loop when flush nodes in cp
Thread A			Thread B

- write_checkpoint
 - block_operations
   -blk_start_plug
    -sync_node_pages		- f2fs_do_sync_file
				 - fsync_node_pages
				  - f2fs_wait_on_page_writeback

Thread A wait for global F2FS_DIRTY_NODES decreased to zero,
it start a plug list, some requests have been added to this list.
Thread B lock one dirty node page, and wait this page write back.
But this page has been in plug list of thread A with PG_writeback flag.
Thread A keep on running and its plug list has no chance to finish,
so it seems a deadlock between cp and fsync path.

This patch add a wait on page write back before set node page dirty
to avoid this problem.

Signed-off-by: Yunlei He <heyunlei@huawei.com>
Signed-off-by: Pengyang Hou <houpengyang@huawei.com>
Reviewed-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2016-11-25 10:16:00 -08:00
Chao Yu 36951b38d1 f2fs: don't wait writeback for datas during checkpoint
Normally, while committing checkpoint, we will wait on all pages to be
writebacked no matter the page is data or metadata, so in scenario where
there are lots of data IO being submitted with metadata, we may suffer
long latency for waiting writeback during checkpoint.

Indeed, we only care about persistence for pages with metadata, but not
pages with data, as file system consistent are only related to metadate,
so in order to avoid encountering long latency in above scenario, let's
recognize and reference metadata in submitted IOs, wait writeback only
for metadatas.

Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2016-11-25 10:15:59 -08:00
Jaegeuk Kim c79b7ff1d3 f2fs: fix wrong written_valid_blocks counting
Previously, written_valid_blocks was got by ckpt->valid_block_count. But if
the last checkpoint has some NEW_ADDR due to power-cut, we can get wrong value.
Fix it to get the number from actual written block count from sit entries.

Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2016-11-25 10:15:58 -08:00
Jaegeuk Kim 7702bdbe50 f2fs: avoid BG_GC in f2fs_balance_fs
If many threads hit has_not_enough_free_secs() in f2fs_balance_fs() at the same
time, all the threads would do FG_GC or BG_GC.
In this critical path, we totally don't need to do BG_GC at all.
Let's avoid that.

Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2016-11-25 10:15:57 -08:00
Jaegeuk Kim c040ff9d69 f2fs: fix redundant block allocation
In direct_IO path of f2fs_file_write_iter(),
1. f2fs_preallocate_blocks(F2FS_GET_BLOCK_PRE_DIO)
   -> allocate LBA X
2. f2fs_direct_IO()
   -> return 0;

Then,
f2fs_write_data_page() will allocate another LBA X+1.

This makes EIO triggered by HM-SMR.

Reviewed-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2016-11-25 10:15:50 -08:00
Jaegeuk Kim a7de608691 f2fs: use err for f2fs_preallocate_blocks
This patch has no functional change.

Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2016-11-25 10:15:14 -08:00
Jaegeuk Kim 3c62be17d4 f2fs: support multiple devices
This patch implements multiple devices support for f2fs.
Given multiple devices by mkfs.f2fs, f2fs shows them entirely as one big
volume under one f2fs instance.

Internal block management is very simple, but we will modify block allocation
and background GC policy to boost IO speed by exploiting them accoording to
each device speed.

Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2016-11-25 10:15:13 -08:00
Jaegeuk Kim e57e9ae5b1 f2fs: allow dio read for LFS mode
We can allow dio reads for LFS mode, while doing buffered writes for dio writes.

Reviewed-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2016-11-25 10:15:12 -08:00
Jaegeuk Kim 6ae1be13e8 f2fs: revert segment allocation for direct IO
Now we don't need to be too much careful about storage alignment for dio, since
its speed becomes quite fast and we'd better avoid any misalignment first.

Revert: 38aa0889b2 (f2fs: align direct_io'ed data to section)

Reviewed-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2016-11-25 10:15:02 -08:00
Filipe Manana 8d9eddad19 Btrfs: fix qgroup rescan worker initialization
We were setting the qgroup_rescan_running flag to true only after the
rescan worker started (which is a task run by a queue). So if a user
space task starts a rescan and immediately after asks to wait for the
rescan worker to finish, this second call might happen before the rescan
worker task starts running, in which case the rescan wait ioctl returns
immediatley, not waiting for the rescan worker to finish.

This was making the fstest btrfs/022 fail very often.

Fixes: d2c609b834 (btrfs: properly track when rescan worker is running)
Cc: stable@vger.kernel.org # 4.4+
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
2016-11-25 18:06:50 +00:00
Jan Kara 9d1ccbe70e quota: Use s_umount protection for quota operations
Writeback quota is protected by s_umount semaphore held for reading
because every writeback must be protected by that lock (grabbed either
by the generic writeback code or by quotactl handler). Getting next
available ID in quota file, querying quota state, setting quota
information, getting quota format are all quotactl operations protected
by s_umount semaphore held for reading grabbed in quotactl handler.

This also fixes lockdep splat about possible deadlock during filesystem
freezing where sync_filesystem() is called with page-faults already
blocked but sync_filesystem() calls into dquot_writeback_dquots() which
grabs dqonoff_mutex which ranks above i_mutex (vfs_load_quota_inode()
grabs i_mutex under dqonoff_mutex) which clearly ranks below page fault
freeze protection (e.g. via mmap_sem dependencies). The reported problem
is not a real deadlock possibility since during quota on we check
whether filesystem freezing is not in progress but still it is good to
have this fixed.

Reported-by: Ted Tso <tytso@mit.edu>
Reported-by: Eric Whitney <enwlinux@gmail.com>
Signed-off-by: Jan Kara <jack@suse.cz>
2016-11-24 15:27:16 +01:00
Jan Kara 7d6cd73d33 quota: Hold s_umount in exclusive mode when enabling / disabling quotas
Currently we hold s_umount semaphore only in shared mode when enabling
or disabling quotas and use dqonoff_mutex for serializing quota state
changes on a filesystem and also quota state changes with other places
depending on current quota state. Using dedicated mutex for this causes
possible deadlocks during filesystem freezing (see following commit for
details) so we transition to using s_umount semaphore for the necessary
synchronization whose lock ordering is properly handled by the
filesystem freezing code. As a start grab s_umount in exclusive mode
when enabling / disabling quotas.

Signed-off-by: Jan Kara <jack@suse.cz>
2016-11-24 15:26:53 +01:00
Dave Chinner ed24bee6f2 Merge branch 'xfs-4.10-extent-lookup' into for-next 2016-11-24 11:41:59 +11:00
Christoph Hellwig 0e8d630ba0 xfs: remove NULLEXTNUM
We only ever set a field to this constant for an impossible to reach
error case in xfs_bmap_search_extents.  That functions has been removed,
so we can remove the constant as well.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2016-11-24 11:40:32 +11:00
Christoph Hellwig 6edc977f77 xfs: remove xfs_bmap_search_extents
Now that all users are gone.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2016-11-24 11:40:14 +11:00
Christoph Hellwig 4ab8671c19 xfs: use new extent lookup helpers in xfs_reflink_end_cow
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2016-11-24 11:39:50 +11:00
Christoph Hellwig df5ab1b5a8 xfs: use new extent lookup helpers in xfs_reflink_cancel_cow_blocks
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2016-11-24 11:39:50 +11:00
Christoph Hellwig 86f12ab05f xfs: use new extent lookup helpers in xfs_reflink_trim_irec_to_next_cow
And remove the unused return value.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2016-11-24 11:39:50 +11:00
Christoph Hellwig 092d5d9d58 xfs: cleanup xfs_reflink_find_cow_mapping
Use xfs_iext_lookup_extent to look up the extent, drop a useless check,
drop a unneeded return value and clean up the general style a little bit.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2016-11-24 11:39:49 +11:00
Christoph Hellwig 2755fc4438 xfs: use new extent lookup helpers in __xfs_reflink_reserve_cow
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2016-11-24 11:39:49 +11:00
Christoph Hellwig 656152e552 xfs: use new extent lookup helpers xfs_file_iomap_begin_delay
And only lookup the previous extent inside xfs_iomap_prealloc_size
if we actually need it.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2016-11-24 11:39:44 +11:00
Christoph Hellwig 65c5f41978 xfs: remove prev argument to xfs_bmapi_reserve_delalloc
We can easily lookup the previous extent for the cases where we need it,
which saves the callers from looking it up for us later in the series.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2016-11-24 11:39:44 +11:00
Christoph Hellwig 7efc794561 xfs: use new extent lookup helpers in __xfs_bunmapi
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2016-11-24 11:39:44 +11:00
Christoph Hellwig 2d58f6ef79 xfs: use new extent lookup helpers in xfs_bmapi_write
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2016-11-24 11:39:43 +11:00
Christoph Hellwig 334f3423d6 xfs: use new extent lookup helpers in xfs_bmapi_read
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2016-11-24 11:39:43 +11:00
Christoph Hellwig 86685f7ba5 xfs: cleanup xfs_bmap_last_before
Rewrite the function using xfs_iext_lookup_extent and xfs_iext_get_extent,
and massage the flow into something easily understandable.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2016-11-24 11:39:38 +11:00
Christoph Hellwig 93533c7855 xfs: new inode extent list lookup helpers
xfs_iext_lookup_extent looks up a single extent at the passed in offset,
and returns the extent covering the area, or the one behind it in case
of a hole, as well as the index of the returned extent in arguments,
as well as a simple bool as return value that is set to false if no
extent could be found because the offset is behind EOF.  It is a simpler
replacement for xfs_bmap_search_extent that leaves looking up the rarely
needed previous extent to the caller and has a nicer calling convention.

xfs_iext_get_extent is a helper for iterating over the extent list,
it takes an extent index as input, and returns the extent at that index
in it's expanded form in an argument if it exists.  The actual return
value is a bool whether the index is valid or not.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2016-11-24 11:39:32 +11:00
James Morris 0821e30cd2 Merge branch 'stable-4.10' of git://git.infradead.org/users/pcmoore/selinux into next 2016-11-24 11:21:25 +11:00
Linus Torvalds 10b9dd5686 NFS client bugfixes for Linux 4.9 part 4
Stable Bugfixes:
 - Hide array-bounds warning
 
 Bugfixes:
 - Keep a reference on lock states while checking
 - Handle NFS4ERR_OLD_STATEID in nfs4_reclaim_open_state
 - Don't call close if the open stateid has already been cleared
 - Fix CLOSE rases with OPEN
 - Fix a regression in DELEGRETURN
 -----BEGIN PGP SIGNATURE-----
 
 iQIcBAABCAAGBQJYNhGKAAoJENfLVL+wpUDrGgEP/0okAGQfb7yHVNYjDpMmVh7u
 6T1Vh+xbIMsGmuLXPOJH3FRFDnPWCrZO77K+l1y5oMl1fW/hA5h07yt0g0wT94+u
 if1wunZ6bak6KFeevo4xphpqXCjLhwpe801SbBcJPY6D6YxMckobHR8NcuzTjFab
 Kc9OAjnpIzS2lJBThaeyavGGnrlhNvH+Le+zEgMv/bSBTiPSymLlpj12a88cuHRF
 hx2vBao3UuR1vaTaZ5Zdp954DtNXNo7Pikye11cvVJVhesNwpZe37SszcRZ1U6P4
 o4LnYf/ImkjDrcRyvFRxc6bu/Q1jLBuAYZjB4oMcx7YQW8rJqcS/UkEpGzOfER3i
 3NQXFqacIAGhULfJxF8W0vPGzKM74koa0HRRI34C10qZAPe06Iy8slkdIjM4t2IX
 ASJI+uyrbIqTQ/x3FObWlqvw4TCOntYFpOsHF6G8M0uj+tX+3iXjpmwDGsJDVyFE
 y+egnnVn9LmGGfg1SBU2VBKL2945e/VAWfHtDGmJYgEwNDiqtutoIMDn+szESX60
 yGLPJdIL3O7pTWmDXdSSpUJZ+wqa90rrU34kGmk3njydaNHeA1SEhcNTi2Ha5ALb
 NcVD0omnhrZUFE5MRY0OtmHRwhsaa9CYlMyqzb5SEeb46Z3KUm1KX9qEy4I4rZHG
 C4MlTY5AScHqqNXmT8Pu
 =YhQv
 -----END PGP SIGNATURE-----

Merge tag 'nfs-for-4.9-4' of git://git.linux-nfs.org/projects/anna/linux-nfs

Pull NFS client bugfixes from Anna Schumaker:
 "Most of these fix regressions or races, but there is one patch for
  stable that Arnd sent me

  Stable bugfix:
   - Hide array-bounds warning

  Bugfixes:
   - Keep a reference on lock states while checking
   - Handle NFS4ERR_OLD_STATEID in nfs4_reclaim_open_state
   - Don't call close if the open stateid has already been cleared
   - Fix CLOSE rases with OPEN
   - Fix a regression in DELEGRETURN"

* tag 'nfs-for-4.9-4' of git://git.linux-nfs.org/projects/anna/linux-nfs:
  NFSv4.x: hide array-bounds warning
  NFSv4.1: Keep a reference on lock states while checking
  NFSv4.1: Handle NFS4ERR_OLD_STATEID in nfs4_reclaim_open_state
  NFSv4: Don't call close if the open stateid has already been cleared
  NFSv4: Fix CLOSE races with OPEN
  NFSv4.1: Fix a regression in DELEGRETURN
2016-11-23 14:43:40 -08:00
Filipe Manana f177d73949 Btrfs: fix emptiness check for dirtied extent buffers at check_leaf()
We can not simply use the owner field from an extent buffer's header to
get the id of the respective tree when the extent buffer is from a
relocation tree. When we create the root for a relocation tree we leave
(on purpose) the owner field with the same value as the subvolume's tree
root (we do this at ctree.c:btrfs_copy_root()). So we must ignore extent
buffers from relocation trees, which have the BTRFS_HEADER_FLAG_RELOC
flag set, because otherwise we will always consider the extent buffer
as not being the root of the tree (the root of original subvolume tree
is always different from the root of the respective relocation tree).

This lead to assertion failures when running with the integrity checker
enabled (CONFIG_BTRFS_FS_CHECK_INTEGRITY=y) such as the following:

[  643.393409] BTRFS critical (device sdg): corrupt leaf, non-root leaf's nritems is 0: block=38506496, root=260, slot=0
[  643.397609] BTRFS info (device sdg): leaf 38506496 total ptrs 0 free space 3995
[  643.407075] assertion failed: 0, file: fs/btrfs/disk-io.c, line: 4078
[  643.408425] ------------[ cut here ]------------
[  643.409112] kernel BUG at fs/btrfs/ctree.h:3419!
[  643.409773] invalid opcode: 0000 [#1] PREEMPT SMP
[  643.410447] Modules linked in: dm_flakey dm_mod crc32c_generic btrfs xor raid6_pq ppdev psmouse acpi_cpufreq parport_pc evdev parport tpm_tis tpm_tis_core pcspkr serio_raw i2c_piix4 sg tpm i2c_core button processor loop autofs4 ext4 crc16 jbd2 mbcache sr_mod cdrom sd_mod ata_generic virtio_scsi ata_piix libata virtio_pci virtio_ring scsi_mod virtio e1000 floppy
[  643.414356] CPU: 11 PID: 32726 Comm: btrfs Not tainted 4.8.0-rc8-btrfs-next-35+ #1
[  643.414356] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.9.1-0-gb3ef39f-prebuilt.qemu-project.org 04/01/2014
[  643.414356] task: ffff880145e95b00 task.stack: ffff88014826c000
[  643.414356] RIP: 0010:[<ffffffffa0352759>]  [<ffffffffa0352759>] assfail.constprop.41+0x1c/0x1e [btrfs]
[  643.414356] RSP: 0018:ffff88014826fa28  EFLAGS: 00010292
[  643.414356] RAX: 0000000000000039 RBX: ffff88014e2d7c38 RCX: 0000000000000001
[  643.414356] RDX: ffff88023f4d2f58 RSI: ffffffff81806c63 RDI: 00000000ffffffff
[  643.414356] RBP: ffff88014826fa28 R08: 0000000000000001 R09: 0000000000000000
[  643.414356] R10: ffff88014826f918 R11: ffffffff82f3c5ed R12: ffff880172910000
[  643.414356] R13: ffff880233992230 R14: ffff8801a68a3310 R15: fffffffffffffff8
[  643.414356] FS:  00007f9ca305e8c0(0000) GS:ffff88023f4c0000(0000) knlGS:0000000000000000
[  643.414356] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  643.414356] CR2: 00007f9ca3071000 CR3: 000000015d01b000 CR4: 00000000000006e0
[  643.414356] Stack:
[  643.414356]  ffff88014826fa50 ffffffffa02d655a 000000000000000a ffff88014e2d7c38
[  643.414356]  0000000000000000 ffff88014826faa8 ffffffffa02b72f3 ffff88014826fab8
[  643.414356]  00ffffffa03228e4 0000000000000000 0000000000000000 ffff8801bbd4e000
[  643.414356] Call Trace:
[  643.414356]  [<ffffffffa02d655a>] btrfs_mark_buffer_dirty+0xdf/0xe5 [btrfs]
[  643.414356]  [<ffffffffa02b72f3>] btrfs_copy_root+0x18a/0x1d1 [btrfs]
[  643.414356]  [<ffffffffa0322921>] create_reloc_root+0x72/0x1ba [btrfs]
[  643.414356]  [<ffffffffa03267c2>] btrfs_init_reloc_root+0x7b/0xa7 [btrfs]
[  643.414356]  [<ffffffffa02d9e44>] record_root_in_trans+0xdf/0xed [btrfs]
[  643.414356]  [<ffffffffa02db04e>] btrfs_record_root_in_trans+0x50/0x6a [btrfs]
[  643.414356]  [<ffffffffa030ad2b>] create_subvol+0x472/0x773 [btrfs]
[  643.414356]  [<ffffffffa030b406>] btrfs_mksubvol+0x3da/0x463 [btrfs]
[  643.414356]  [<ffffffffa030b406>] ? btrfs_mksubvol+0x3da/0x463 [btrfs]
[  643.414356]  [<ffffffff810781ac>] ? preempt_count_add+0x65/0x68
[  643.414356]  [<ffffffff811a6e97>] ? __mnt_want_write+0x62/0x77
[  643.414356]  [<ffffffffa030b55d>] btrfs_ioctl_snap_create_transid+0xce/0x187 [btrfs]
[  643.414356]  [<ffffffffa030b67d>] btrfs_ioctl_snap_create+0x67/0x81 [btrfs]
[  643.414356]  [<ffffffffa030ecfd>] btrfs_ioctl+0x508/0x20dd [btrfs]
[  643.414356]  [<ffffffff81293e39>] ? __this_cpu_preempt_check+0x13/0x15
[  643.414356]  [<ffffffff81155eca>] ? handle_mm_fault+0x976/0x9ab
[  643.414356]  [<ffffffff81091300>] ? arch_local_irq_save+0x9/0xc
[  643.414356]  [<ffffffff8119a2b0>] vfs_ioctl+0x18/0x34
[  643.414356]  [<ffffffff8119a8e8>] do_vfs_ioctl+0x581/0x600
[  643.414356]  [<ffffffff814b9552>] ? entry_SYSCALL_64_fastpath+0x5/0xa8
[  643.414356]  [<ffffffff81093fe9>] ? trace_hardirqs_on_caller+0x17b/0x197
[  643.414356]  [<ffffffff8119a9be>] SyS_ioctl+0x57/0x79
[  643.414356]  [<ffffffff814b9565>] entry_SYSCALL_64_fastpath+0x18/0xa8
[  643.414356]  [<ffffffff81091b08>] ? trace_hardirqs_off_caller+0x3f/0xaa
[  643.414356] Code: 89 83 88 00 00 00 31 c0 5b 41 5c 41 5d 5d c3 55 89 f1 48 c7 c2 98 bc 35 a0 48 89 fe 48 c7 c7 05 be 35 a0 48 89 e5 e8 13 46 dd e0 <0f> 0b 55 89 f1 48 c7 c2 9f d3 35 a0 48 89 fe 48 c7 c7 7a d5 35
[  643.414356] RIP  [<ffffffffa0352759>] assfail.constprop.41+0x1c/0x1e [btrfs]
[  643.414356]  RSP <ffff88014826fa28>
[  643.468267] ---[ end trace 6a1b3fb1a9d7d6e3 ]---

This can be easily reproduced by running xfstests with the integrity
checker enabled.

Fixes: 1ba98d086f (Btrfs: detect corruption when non-root leaf has zero item)
Cc: stable@vger.kernel.org  # 4.8+
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: Liu Bo <bo.li.liu@oracle.com>
2016-11-23 20:24:35 +00:00
Liu Bo ef85b25e98 Btrfs: fix BUG_ON in btrfs_mark_buffer_dirty
This can only happen with CONFIG_BTRFS_FS_CHECK_INTEGRITY=y.

Commit 1ba98d0 ("Btrfs: detect corruption when non-root leaf has zero item")
assumes that a leaf is its root when leaf->bytenr == btrfs_root_bytenr(root),
however, we should not use btrfs_root_bytenr(root) since it's mainly got
updated during committing transaction.  So the check can fail when doing
COW on this leaf while it is a root.

This changes to use "if (leaf == btrfs_root_node(root))" instead, just like
how we check whether leaf is a root in __btrfs_cow_block().

Fixes: 1ba98d086f (Btrfs: detect corruption when non-root leaf has zero item)
Cc: stable@vger.kernel.org  # 4.8+
Reported-by: Jeff Mahoney <jeffm@suse.com>
Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
Reviewed-by: Filipe Manana <fdmanana@suse.com>
2016-11-23 20:23:20 +00:00
Yunlei He 2061471128 f2fs: return directly if block has been removed from the victim
If one block has been to written to a new place, just return
in move data process. This patch check it again with holding
page lock.

Signed-off-by: Yunlei He <heyunlei@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2016-11-23 12:11:31 -08:00
Chao Yu d47b871595 Revert "f2fs: do not recover from previous remained wrong dnodes"
i_times of inode will be set with current system time which can be
configured through 'date', so it's not safe to judge dnode block as
garbage data or unchanged inode depend on i_times.

Now, we have used enhanced 'cp_ver + cp' crc method to verify valid
dnode block, so I expect recoverying invalid dnode is almost not
possible.

This reverts commit 807b1e1c8e.

Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2016-11-23 12:11:30 -08:00
Jaegeuk Kim b4b9d34c85 f2fs: remove checkpoint in f2fs_freeze
The generic freeze_super() calls sync_filesystems() before f2fs_freeze().
So, basically we don't need to do checkpoint in f2fs_freeze(). But, in xfs/068,
it triggers circular locking problem below due to gc_mutex for checkpoint.

======================================================
[ INFO: possible circular locking dependency detected ]
4.9.0-rc1+ #132 Tainted: G           OE
-------------------------------------------------------

1. wait for __sb_start_write() by

 [<ffffffff9845f353>] dump_stack+0x85/0xc2
 [<ffffffff980e80bf>] print_circular_bug+0x1cf/0x230
 [<ffffffff980eb4d0>] __lock_acquire+0x19e0/0x1bc0
 [<ffffffff980ebdcb>] lock_acquire+0x11b/0x220
 [<ffffffffc08c7c3b>] ? f2fs_drop_inode+0x9b/0x160 [f2fs]
 [<ffffffff9826bdd0>] __sb_start_write+0x130/0x200
 [<ffffffffc08c7c3b>] ? f2fs_drop_inode+0x9b/0x160 [f2fs]
 [<ffffffffc08c7c3b>] f2fs_drop_inode+0x9b/0x160 [f2fs]
 [<ffffffff98289991>] iput+0x171/0x2c0
 [<ffffffffc08cfccf>] f2fs_sync_inode_meta+0x3f/0xf0 [f2fs]
 [<ffffffffc08cfe04>] block_operations+0x84/0x110 [f2fs]
 [<ffffffffc08cff78>] write_checkpoint+0xe8/0xf20 [f2fs]
 [<ffffffff980e979d>] ? trace_hardirqs_on+0xd/0x10
 [<ffffffffc08c6de9>] ? f2fs_sync_fs+0x79/0x190 [f2fs]
 [<ffffffff9803e9d9>] ? sched_clock+0x9/0x10
 [<ffffffffc08c6de9>] ? f2fs_sync_fs+0x79/0x190 [f2fs]
 [<ffffffffc08c6df5>] f2fs_sync_fs+0x85/0x190 [f2fs]
 [<ffffffff982a4f90>] ? do_fsync+0x70/0x70
 [<ffffffff982a4f90>] ? do_fsync+0x70/0x70
 [<ffffffff982a4fb0>] sync_fs_one_sb+0x20/0x30
 [<ffffffff9826ca3e>] iterate_supers+0xae/0x100
 [<ffffffff982a50b5>] sys_sync+0x55/0x90
 [<ffffffff9890b345>] entry_SYSCALL_64_fastpath+0x23/0xc6

2. wait for sbi->gc_mutex by

 [<ffffffff980ebdcb>] lock_acquire+0x11b/0x220
 [<ffffffff989063d6>] mutex_lock_nested+0x76/0x3f0
 [<ffffffffc08c6de9>] f2fs_sync_fs+0x79/0x190 [f2fs]
 [<ffffffffc08c7a6c>] f2fs_freeze+0x1c/0x20 [f2fs]
 [<ffffffff9826b6ef>] freeze_super+0xcf/0x190
 [<ffffffff9827eebc>] do_vfs_ioctl+0x53c/0x6a0
 [<ffffffff9827f099>] SyS_ioctl+0x79/0x90
 [<ffffffff9890b345>] entry_SYSCALL_64_fastpath+0x23/0xc6

Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2016-11-23 12:11:29 -08:00
Jaegeuk Kim bdb7d964c4 f2fs: assign segments correctly for direct_io
Previously, we assigned CURSEG_WARM_DATA for direct_io, but if we have two or
four logs, we do not use that type at all.
Let's fix it.

Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2016-11-23 12:11:28 -08:00
Chao Yu 9f0552e078 f2fs: fix wrong i_atime recovery
Shouldn't update in-memory i_atime with on-disk i_mtime of inode when
recovering inode.

Shuoran found this bug which is hidden for a long time, honour is belong
to him.

Signed-off-by: Shuoran Liu <liushuoran@huawei.com>
Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2016-11-23 12:11:26 -08:00
Chao Yu 60dcedc997 f2fs: record inode updating status correctly
We should record updating status of inode only for living inode, for those
unlinked inode it needs to clear its ino cache, otherwise after the ino
was been reused, it will cause unneeded node page writing during ->fsync.

Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2016-11-23 12:11:25 -08:00
Damien Le Moal 126606c7a9 f2fs: Trace reset zone events
Similarly to the regular discard, trace zone reset events.

Signed-off-by: Damien Le Moal <damien.lemoal@wdc.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2016-11-23 12:11:24 -08:00
Damien Le Moal f46e8809e8 f2fs: Reset sequential zones on zoned block devices
When a zoned block device is mounted, discarding sections
contained in sequential zones must reset the zone write pointer.
For sections contained in conventional zones, the regular discard
is used if the drive supports it.

Signed-off-by: Damien Le Moal <damien.lemoal@wdc.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2016-11-23 12:11:23 -08:00
Damien Le Moal 178053e2f1 f2fs: Cache zoned block devices zone type
With the zoned block device feature enabled, section discard
need to do a zone reset for sections contained in sequential
zones, and a regular discard (if supported) for sections
stored in conventional zones. Avoid the need for a costly
report zones to obtain a section zone type when discarding it
by caching the types of the device zones in the super block
information. This cache is initialized at mount time for mounts
with the zoned block device feature enabled.

Signed-off-by: Damien Le Moal <damien.lemoal@wdc.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2016-11-23 12:11:22 -08:00
Damien Le Moal 3adc57e977 f2fs: Do not allow adaptive mode for host-managed zoned block devices
The LFS mode is mandatory for host-managed zoned block devices as
update in place optimizations are not possible for segments in
sequential zones.

Signed-off-by: Damien Le Moal <damien.lemoal@wdc.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2016-11-23 12:11:20 -08:00
Damien Le Moal 96ba2decb4 f2fs: Always enable discard for zoned blocks devices
Zone write pointer reset acts as discard for zoned block
devices. So if the zoned block device feature is enabled,
always declare that discard is enabled, even if the device
does not actually support the command.
For the same reason, prevent the use the "nodicard" mount
option.

Signed-off-by: Damien Le Moal <damien.lemoal@wdc.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2016-11-23 12:11:19 -08:00
Damien Le Moal 0ab0299835 f2fs: Suppress discard warning message for zoned block devices
For zoned block devices, discard is replaced by zone reset. So
do not warn if the device does not supports discard.

Signed-off-by: Damien Le Moal <damien.lemoal@wdc.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2016-11-23 12:11:18 -08:00
Damien Le Moal d1b959c877 f2fs: Check zoned block feature for host-managed zoned block devices
The F2FS_FEATURE_BLKZONED feature indicates that the drive was formatted
 with zone alignment optimization. This is optional for host-aware
devices, but mandatory for host-managed zoned block devices.
So check that the feature is set in this latter case.

Signed-off-by: Damien Le Moal <damien.lemoal@wdc.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2016-11-23 12:11:17 -08:00
Damien Le Moal 0bfd7a091c f2fs: Use generic zoned block device terminology
SMR stands for "Shingled Magnetic Recording" which makes sense
only for hard disk drives (spinning rust). The ZBC/ZAC standards
enable management of SMR disks, but solid state drives may also
support those standards. So rename the HMSMR feature to BLKZONED
to avoid a HDD centric terminology. For the same reason, rename
f2fs_sb_mounted_hmsmr to f2fs_sb_mounted_blkzoned.

Signed-off-by: Damien Le Moal <damien.lemoal@wdc.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2016-11-23 12:11:16 -08:00
Damien Le Moal 487df616de f2fs: Add missing break in switch-case
Signed-off-by: Damien Le Moal <damien.lemoal@wdc.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2016-11-23 12:11:14 -08:00
Jaegeuk Kim 099228000e f2fs: avoid infinite loop in the EIO case on recover_orphan_inodes
This patch should fix an infinite loop case below.

F2FS-fs : inject IO error in f2fs_read_end_io+0xf3/0x120 [f2fs]
F2FS-fs (nvme0n1p1): recover_orphan_inode: orphan failed (ino=39ac1a), run fsck to fix.
...
[<ffffffffc0b11ede>] sync_meta_pages+0xae/0x270 [f2fs]
[<ffffffffc0b288dd>] ? flush_sit_entries+0x8d/0x960 [f2fs]
[<ffffffffc0b13801>] write_checkpoint+0x361/0xf20 [f2fs]
[<ffffffffb40e979d>] ? trace_hardirqs_on+0xd/0x10
[<ffffffffc0b0a199>] ? f2fs_sync_fs+0x79/0x190 [f2fs]
[<ffffffffc0b0a1a5>] f2fs_sync_fs+0x85/0x190 [f2fs]
[<ffffffffc0b2560e>] f2fs_balance_fs_bg+0x7e/0x1c0 [f2fs]
[<ffffffffc0b216c4>] f2fs_write_node_pages+0x34/0x320 [f2fs]
[<ffffffffb41dff21>] do_writepages+0x21/0x30
[<ffffffffb429edb1>] __writeback_single_inode+0x61/0x760
[<ffffffffb490a937>] ? _raw_spin_unlock+0x27/0x40
[<ffffffffb42a0805>] writeback_single_inode+0xd5/0x190
[<ffffffffb42a0959>] write_inode_now+0x99/0xc0
[<ffffffffb4289a16>] iput+0x1f6/0x2c0
[<ffffffffc0b0e3be>] f2fs_fill_super+0xe0e/0x1300 [f2fs]
[<ffffffffb426c394>] ? sget_userns+0x4f4/0x530
[<ffffffffb426c692>] mount_bdev+0x182/0x1b0
[<ffffffffc0b0d5b0>] ? f2fs_commit_super+0x100/0x100 [f2fs]
[<ffffffffc0b0a375>] f2fs_mount+0x15/0x20 [f2fs]
[<ffffffffb426d038>] mount_fs+0x38/0x170
[<ffffffffb428ec9b>] vfs_kern_mount+0x6b/0x160
[<ffffffffb4291d9e>] do_mount+0x1be/0xd60
[<ffffffffb4291a57>] ? copy_mount_options+0xb7/0x220
[<ffffffffb4292c54>] SyS_mount+0x94/0xd0
[<ffffffffb490b345>] entry_SYSCALL_64_fastpath+0x23/0xc6

Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2016-11-23 12:11:13 -08:00
Chao Yu ed6bd4b146 f2fs: report error of f2fs_fill_dentries
Report error of f2fs_fill_dentries to ->iterate_shared, otherwise when
error ocurrs, user may just list part of dirents in target directory
without any hints.

Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2016-11-23 12:11:12 -08:00
Arnd Bergmann 230436b3ef f2fs: hide a maybe-uninitialized warning
gcc is unsure about the use of last_ofs_in_node, which might happen
without a prior initialization:

fs/f2fs//git/arm-soc/fs/f2fs/data.c: In function ‘f2fs_map_blocks’:
fs/f2fs/data.c:799:54: warning: ‘last_ofs_in_node’ may be used uninitialized in this function [-Wmaybe-uninitialized]
   if (prealloc && dn.ofs_in_node != last_ofs_in_node + 1) {

As pointed out by Chao Yu, the code is actually correct as 'prealloc'
is only set if the last_ofs_in_node has been set, the two always
get updated together.

This initializes last_ofs_in_node to dn.ofs_in_node for each
new dnode at the start of the 'next_block' loop, which at that
point is a correct initialization as well. I assume that compilers
that correctly track the contents of the variables and do not
warn about the condition also figure out that they can eliminate
the extra assignment here.

Fixes: 46008c6d42 ("f2fs: support in batch multi blocks preallocation")
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2016-11-23 12:11:11 -08:00
Jaegeuk Kim 35782b233f f2fs: remove percpu_count due to performance regression
This patch removes percpu_count usage due to performance regression in iozone.

Fixes: 523be8a6b3 ("f2fs: use percpu_counter for page counters")
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2016-11-23 12:11:10 -08:00
Jaegeuk Kim 18340edc8d f2fs: make clean inodes when flushing inode page
This patch tries to make more clean inodes when flushing dirty inodes in
checkpoint.

Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2016-11-23 12:11:09 -08:00
Jaegeuk Kim 7c45729a4d f2fs: keep dirty inodes selectively for checkpoint
This is to avoid no free segment bug during checkpoint caused by a number of
dirty inodes.

The case was reported by Chao like this.
1. mount with lazytime option
2. fill 4k file until disk is full
3. sync filesystem
4. read all files in the image
5. umount

In this case, we actually don't need to flush dirty inode to inode page during
checkpoint.

Reviewed-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2016-11-23 12:11:08 -08:00
Jaegeuk Kim 664ba972df f2fs: use BIO_MAX_PAGES for bio allocation
We don't need to allocate bio partially in order to maximize sequential writes.

Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2016-11-23 12:11:07 -08:00
Jaegeuk Kim 3e7b5bbbef f2fs: declare static function for __build_free_nids
This patch avoids build warning.

Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2016-11-23 12:11:06 -08:00
Jaegeuk Kim 15d0435455 f2fs: call f2fs_balance_fs for setattr
If inode becomes dirty, we need to check the # of dirty inodes whether or not
further checkpoint would be required.

Reviewed-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2016-11-23 12:11:05 -08:00
Jaegeuk Kim b9610bdfcb f2fs: count dirty inodes to flush node pages during checkpoint
If there are a lot of dirty inodes, we need to flush all of them when doing
checkpoint. So, we need to count this for enough free space.

Reviewed-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2016-11-23 12:11:04 -08:00
Chao Yu 02110a4fd5 f2fs: avoid casted negative value as shrink count
This patch makes sure it returns a positive value instead of a probable
casted negative value as shrink count.

Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2016-11-23 12:11:03 -08:00
Chao Yu 3a2ad5672b f2fs: don't interrupt free nids building during nid allocation
Let build_free_nids support sync/async methods, in allocation flow of nids,
we use synchronuous method, so that we can avoid looping in alloc_nid when
free memory is low; in unblock_operations and f2fs_balance_fs_bg we use
asynchronuous method in where low memory condition can interrupt us.

Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2016-11-23 12:11:02 -08:00
Jaegeuk Kim eb0aa4b807 f2fs: clean up free nid list operations
This patch cleans up to use consistent free nid list ops.

Reviewed-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2016-11-23 12:11:01 -08:00
Chao Yu b8559dc242 f2fs: split free nid list
During free nid allocation, in order to do preallocation, we will tag free
nid entry as allocated one and still leave it in free nid list, for other
allocators who want to grab free nids, it needs to traverse the free nid
list for lookup. It becomes overhead in scenario of allocating free nid
intensively by multithreads.

This patch splits free nid list to two list: {free,alloc}_nid_list, to
keep free nids and preallocated free nids separately, after that, traverse
latency will be gone, besides split nid_cnt for separate statistic.

Additionally, introduce __insert_nid_to_list and __remove_nid_from_list for
cleanup.

Signed-off-by: Chao Yu <yuchao0@huawei.com>
[Jaegeuk Kim: modify f2fs_bug_on to avoid needless branches]
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2016-11-23 12:11:00 -08:00
Chao Yu a11b9f65ea f2fs: clear nlink if fail to add_link
We don't need to keep incomplete created inode in cache, so if we fail to
add link into directory during new inode creation, it's better to set
nlink of inode to zero, then we can evict inode immediately. Otherwise
release of nid belong to inode will be delayed until inode cache is being
shrunk, it may cause a seemingly endless loop while allocating free nids
in time of testing generic/269 case of fstest suit.

Signed-off-by: Chao Yu <yuchao0@huawei.com>
[Jaegeuk Kim: add update_inode_page to fix kernel panic]
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2016-11-23 12:10:59 -08:00
Eric Biggers 0c0b471e43 f2fs: fix sparse warnings
f2fs contained a number of endianness conversion bugs.

Also, one function should have been 'static'.

Found with sparse by running 'make C=2 CF=-D__CHECK_ENDIAN__ fs/f2fs/'

Signed-off-by: Eric Biggers <ebiggers@google.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2016-11-23 12:10:57 -08:00
Chao Yu 9de6927975 f2fs: fix error handling in fsync_node_pages
In fsync_node_pages, if f2fs was taged with CP_ERROR_FLAG, make sure bio
cache was flushed before return.

Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2016-11-23 12:10:56 -08:00
Chao Yu b691d98fdd f2fs: fix to update largest extent under lock
In order to avoid racing problem, make largest extent cache being updated
under lock.

Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2016-11-23 12:10:55 -08:00
Chao Yu 58736fa60f f2fs: be aware of extent beyond EOF in fiemap
f2fs can support fallocating blocks beyond file size without changing the
size, but ->fiemap of f2fs was restricted and can't detect these extents
fallocated past EOF, now relieve the restriction.

Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2016-11-23 12:10:54 -08:00
Chao Yu 6f2d8ed654 f2fs: don't miss any f2fs_balance_fs cases
In f2fs_map_blocks, let f2fs_balance_fs detects node page modification
with dn.node_changed to avoid miss some corner cases.

Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2016-11-23 12:10:53 -08:00
Chao Yu 9434fcde1f f2fs: add missing f2fs_balance_fs in f2fs_zero_range
f2fs_balance_fs should be called in between node page updating, otherwise
node page count will exceeded far beyond watermark of triggering
foreground garbage collection, result in facing high risk of hitting LFS
allocation failure.

Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2016-11-23 12:10:52 -08:00
Chao Yu 933439c8f3 f2fs: give a chance to detach from dirty list
If there is no dirty pages in inode, we should give a chance to detach
the inode from global dirty list, otherwise it needs to call another
unnecessary .writepages for detaching.

Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2016-11-23 12:10:51 -08:00
Chao Yu 2dd15654ac f2fs: fix to release discard entries during checkpoint
In f2fs_fill_super, if there is any IO error occurs during recovery,
cached discard entries will be leaked, in order to avoid this, make
write_checkpoint() handle memory release by itself, besides, move
clear_prefree_segments to write_checkpoint for readability.

Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2016-11-23 12:10:50 -08:00
Chao Yu 2411cf5bef f2fs: exclude free nids building and allocation
During nid allocation, it needs to exclude building and allocating flow
of free nids, this is because while building free nid cache, there are two
steps: a) load free nids from unused nat entries in NAT pages, b) update
free nid cache by checking nat journal. The two steps should be atomical,
otherwise an used nid can be allocated as free one after a) and before b).

This patch adds missing lock which covers build_free_nids in
unlock_operation and f2fs_balance_fs_bg to avoid that.

Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2016-11-23 12:10:49 -08:00
Jaegeuk Kim e87f7329bb f2fs: fix overflow due to condition check order
In the last ilen case, i was already increased, resulting in accessing out-
of-boundary entry of do_replace and blkaddr.
Fix to check ilen first to exit the loop.

Fixes: 2aa8fbb9693020 ("f2fs: refactor __exchange_data_block for speed up")
Cc: stable@vger.kernel.org # 4.8+
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2016-11-23 12:10:48 -08:00
Jan Kara ba6379f7e6 fs: Provide function to get superblock with exclusive s_umount
Quota code will need a variant of get_super_thawed() that returns
superblock with s_umount held in exclusive mode to serialize quota on
and quota off operations. Provide this functionality.

Signed-off-by: Jan Kara <jack@suse.cz>
2016-11-23 12:53:00 +01:00
Jan Kara 4f5a763c9a ext4: Add select for CONFIG_FS_IOMAP
When ext4 is compiled with DAX support, it now needs the iomap code. Add
appropriate select to Kconfig.

Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2016-11-22 23:21:58 -05:00
Arnd Bergmann d55b352b01 NFSv4.x: hide array-bounds warning
A correct bugfix introduced a harmless warning that shows up with gcc-7:

fs/nfs/callback.c: In function 'nfs_callback_up':
fs/nfs/callback.c:214:14: error: array subscript is outside array bounds [-Werror=array-bounds]

What happens here is that the 'minorversion == 0' check tells the
compiler that we assume minorversion can be something other than 0,
but when CONFIG_NFS_V4_1 is disabled that would be invalid and
result in an out-of-bounds access.

The added check for IS_ENABLED(CONFIG_NFS_V4_1) tells gcc that this
really can't happen, which makes the code slightly smaller and also
avoids the warning.

The bugfix that introduced the warning is marked for stable backports,
we want this one backported to the same releases.

Fixes: 98b0f80c23 ("NFSv4.x: Fix a refcount leak in nfs_callback_up_net")
Cc: stable@vger.kernel.org # v3.7+
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2016-11-22 16:11:44 -05:00
Eric W. Biederman f84df2a6f2 exec: Ensure mm->user_ns contains the execed files
When the user namespace support was merged the need to prevent
ptrace from revealing the contents of an unreadable executable
was overlooked.

Correct this oversight by ensuring that the executed file
or files are in mm->user_ns, by adjusting mm->user_ns.

Use the new function privileged_wrt_inode_uidgid to see if
the executable is a member of the user namespace, and as such
if having CAP_SYS_PTRACE in the user namespace should allow
tracing the executable.  If not update mm->user_ns to
the parent user namespace until an appropriate parent is found.

Cc: stable@vger.kernel.org
Reported-by: Jann Horn <jann@thejh.net>
Fixes: 9e4a36ece6 ("userns: Fail exec for suid and sgid binaries with ids outside our user namespace.")
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
2016-11-22 13:21:00 -06:00
David S. Miller f9aa9dc7d2 Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net
All conflicts were simple overlapping changes except perhaps
for the Thunder driver.

That driver has a change_mtu method explicitly for sending
a message to the hardware.  If that fails it returns an
error.

Normally a driver doesn't need an ndo_change_mtu method becuase those
are usually just range changes, which are now handled generically.
But since this extra operation is needed in the Thunder driver, it has
to stay.

However, if the message send fails we have to restore the original
MTU before the change because the entire call chain expects that if
an error is thrown by ndo_change_mtu then the MTU did not change.
Therefore code is added to nicvf_change_mtu to remember the original
MTU, and to restore it upon nicvf_update_hw_max_frs() failue.

Signed-off-by: David S. Miller <davem@davemloft.net>
2016-11-22 13:27:16 -05:00
Eric W. Biederman 64b875f7ac ptrace: Capture the ptracer's creds not PT_PTRACE_CAP
When the flag PT_PTRACE_CAP was added the PTRACE_TRACEME path was
overlooked.  This can result in incorrect behavior when an application
like strace traces an exec of a setuid executable.

Further PT_PTRACE_CAP does not have enough information for making good
security decisions as it does not report which user namespace the
capability is in.  This has already allowed one mistake through
insufficient granulariy.

I found this issue when I was testing another corner case of exec and
discovered that I could not get strace to set PT_PTRACE_CAP even when
running strace as root with a full set of caps.

This change fixes the above issue with strace allowing stracing as
root a setuid executable without disabling setuid.  More fundamentaly
this change allows what is allowable at all times, by using the correct
information in it's decision.

Cc: stable@vger.kernel.org
Fixes: 4214e42f96d4 ("v2.4.9.11 -> v2.4.9.12")
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
2016-11-22 11:49:49 -06:00
Ming Lei 05aea81b4b fs: logfs: remove unnecesary check
The check on bio->bi_vcnt doesn't make sense in erase_end_io().

Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Ming Lei <tom.leiming@gmail.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
2016-11-22 08:57:55 -07:00
Ming Lei c124843678 fs: logfs: use bio_add_page() in do_erase()
Also code gets simplified a bit.

Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Ming Lei <tom.leiming@gmail.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
2016-11-22 08:57:55 -07:00
Ming Lei d4f98a89f9 fs: logfs: use bio_add_page() in __bdev_writeseg()
Also this patch simplify the code a bit.

Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Ming Lei <tom.leiming@gmail.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
2016-11-22 08:57:55 -07:00
Ming Lei 739a997546 fs: logfs: convert to bio_add_page() in sync_request()
Always bio_add_page() is the standard and preferred way to
do the task.

Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Ming Lei <tom.leiming@gmail.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
2016-11-22 08:57:55 -07:00
Ming Lei 3a83f46775 block: bio: pass bvec table to bio_init()
Some drivers often use external bvec table, so introduce
this helper for this case. It is always safe to access the
bio->bi_io_vec in this way for this case.

After converting to this usage, it will becomes a bit easier
to evaluate the remaining direct access to bio->bi_io_vec,
so it can help to prepare for the following multipage bvec
support.

Signed-off-by: Ming Lei <tom.leiming@gmail.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>

Fixed up the new O_DIRECT cases.

Signed-off-by: Jens Axboe <axboe@fb.com>
2016-11-22 08:57:21 -07:00
Jens Axboe 9a794fb9bd block_dev: get rid of blksize bits calculation
We store the bits in the bdev sector size locally, but we don't use
the calculation anymore. All we do with it is shift it back up to
the bdev sector size. So let's just use that directly and kill the
variable and bits calculation.

Signed-off-by: Jens Axboe <axboe@fb.com>
2016-11-22 08:56:25 -07:00
Damien Le Moal 4d1a476542 block_dev: Fixed direct I/O bio sector calculation
A direct I/O alignment must be always checked against the device blocks size,
but the I/O offset (bio->bi_iter.bi_sector must always use 512B sector unit, and
not the actual logical block size.

Signed-off-by: Damien Le Moal <damien.lemoal@wdc.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@fb.com>
2016-11-22 08:09:09 -07:00
Benjamin Coddington d75a6a0e39 NFSv4.1: Keep a reference on lock states while checking
While walking the list of lock_states, keep a reference on each
nfs4_lock_state to be checked, otherwise the lock state could be removed
while the check performs TEST_STATEID and possible FREE_STATEID.

Signed-off-by: Benjamin Coddington <bcodding@redhat.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2016-11-21 11:58:39 -05:00
Eric Biggers 2f8f5e76c7 ext4: avoid lockdep warning when inheriting encryption context
On a lockdep-enabled kernel, xfstests generic/027 fails due to a lockdep
warning when run on ext4 mounted with -o test_dummy_encryption:

    xfs_io/4594 is trying to acquire lock:
     (jbd2_handle
    ){++++.+}, at:
    [<ffffffff813096ef>] jbd2_log_wait_commit+0x5/0x11b

    but task is already holding lock:
     (jbd2_handle
    ){++++.+}, at:
    [<ffffffff813000de>] start_this_handle+0x354/0x3d8

The abbreviated call stack is:

 [<ffffffff813096ef>] ? jbd2_log_wait_commit+0x5/0x11b
 [<ffffffff8130972a>] jbd2_log_wait_commit+0x40/0x11b
 [<ffffffff813096ef>] ? jbd2_log_wait_commit+0x5/0x11b
 [<ffffffff8130987b>] ? __jbd2_journal_force_commit+0x76/0xa6
 [<ffffffff81309896>] __jbd2_journal_force_commit+0x91/0xa6
 [<ffffffff813098b9>] jbd2_journal_force_commit_nested+0xe/0x18
 [<ffffffff812a6049>] ext4_should_retry_alloc+0x72/0x79
 [<ffffffff812f0c1f>] ext4_xattr_set+0xef/0x11f
 [<ffffffff812cc35b>] ext4_set_context+0x3a/0x16b
 [<ffffffff81258123>] fscrypt_inherit_context+0xe3/0x103
 [<ffffffff812ab611>] __ext4_new_inode+0x12dc/0x153a
 [<ffffffff812bd371>] ext4_create+0xb7/0x161

When a file is created in an encrypted directory, ext4_set_context() is
called to set an encryption context on the new file.  This calls
ext4_xattr_set(), which contains a retry loop where the journal is
forced to commit if an ENOSPC error is encountered.

If the task actually were to wait for the journal to commit in this
case, then it would deadlock because a handle remains open from
__ext4_new_inode(), so the running transaction can't be committed yet.
Fortunately, __jbd2_journal_force_commit() avoids the deadlock by not
allowing the running transaction to be committed while the current task
has it open.  However, the above lockdep warning is still triggered.

This was a false positive which was introduced by: 1eaa566d368b: jbd2:
track more dependencies on transaction commit

Fix the problem by passing the handle through the 'fs_data' argument to
ext4_set_context(), then using ext4_xattr_set_handle() instead of
ext4_xattr_set().  And in the case where no journal handle is specified
and ext4_set_context() has to open one, add an ENOSPC retry loop since
in that case it is the outermost transaction.

Signed-off-by: Eric Biggers <ebiggers@google.com>
2016-11-21 11:52:44 -05:00
Ross Zwisler d086630e19 ext4: remove unused function ext4_aligned_io()
The last user of ext4_aligned_io() was the DAX path in
ext4_direct_IO_write().  This usage was removed by Jan Kara's patch
entitled "ext4: Rip out DAX handling from direct IO path".

Signed-off-by: Ross Zwisler <ross.zwisler@linux.intel.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2016-11-21 11:51:44 -05:00
Jan Kara dd936e4313 dax: rip out get_block based IO support
No one uses functions using the get_block callback anymore. Rip them
out and update documentation.

Reviewed-by: Ross Zwisler <ross.zwisler@linux.intel.com>
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2016-11-20 20:48:36 -05:00
Jan Kara 00697eed38 ext2: use iomap_zero_range() for zeroing truncated page in DAX path
Currently the last user of ext2_get_blocks() for DAX inodes was
dax_truncate_page(). Convert that to iomap_zero_range() so that all DAX
IO uses the iomap path.

Reviewed-by: Ross Zwisler <ross.zwisler@linux.intel.com>
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2016-11-20 20:47:07 -05:00
Jan Kara 0bd2d5ec3d ext4: rip out DAX handling from direct IO path
Reads and writes for DAX inodes should no longer end up in direct IO
code. Rip out the support and add a warning.

Reviewed-by: Ross Zwisler <ross.zwisler@linux.intel.com>
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2016-11-20 18:53:30 -05:00
Jan Kara e2ae766c1b ext4: convert DAX faults to iomap infrastructure
Convert DAX faults to use iomap infrastructure. We would not have to start
transaction in ext4_dax_fault() anymore since ext4_iomap_begin takes
care of that but so far we do that to avoid lock inversion of
transaction start with DAX entry lock which gets acquired in
dax_iomap_fault() before calling ->iomap_begin handler.

Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2016-11-20 18:51:24 -05:00
Jan Kara 96f8ba3dd6 ext4: avoid split extents for DAX writes
Currently mapping of blocks for DAX writes happen with
EXT4_GET_BLOCKS_PRE_IO flag set. That has a result that each
ext4_map_blocks() call creates a separate written extent, although it
could be merged to the neighboring extents in the extent tree.  The
reason for using this flag is that in case the extent is unwritten, we
need to convert it to written one and zero it out. However this "convert
mapped range to written" operation is already implemented by
ext4_map_blocks() for the case of data writes into unwritten extent. So
just use flags for that mode of operation, simplify the code, and avoid
unnecessary split extents.

Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2016-11-20 18:10:09 -05:00
Jan Kara 776722e85d ext4: DAX iomap write support
Implement DAX writes using the new iomap infrastructure instead of
overloading the direct IO path.

Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2016-11-20 18:09:11 -05:00
Jan Kara 47e6935136 ext4: use iomap for zeroing blocks in DAX mode
Use iomap infrastructure for zeroing blocks when in DAX mode.
ext4_iomap_begin() handles read requests just fine and that's all that
is needed for iomap_zero_range().

Reviewed-by: Ross Zwisler <ross.zwisler@linux.intel.com>
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2016-11-20 18:08:05 -05:00
Jan Kara 364443cbcf ext4: convert DAX reads to iomap infrastructure
Implement basic iomap_begin function that handles reading and use it for
DAX reads.

Reviewed-by: Ross Zwisler <ross.zwisler@linux.intel.com>
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2016-11-20 17:36:06 -05:00
Jan Kara a3caa24b70 ext4: only set S_DAX if DAX is really supported
Currently we have S_DAX set inode->i_flags for a regular file whenever
ext4 is mounted with dax mount option. However in some cases we cannot
really do DAX - e.g. when inode is marked to use data journalling, when
inode data is being encrypted, or when inode is stored inline. Make sure
S_DAX flag is appropriately set/cleared in these cases.

Reviewed-by: Ross Zwisler <ross.zwisler@linux.intel.com>
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2016-11-20 17:32:59 -05:00
Jan Kara 213bcd9ccb ext4: factor out checks from ext4_file_write_iter()
Factor out checks of 'from' and whether we are overwriting out of
ext4_file_write_iter() so that the function is easier to follow.

Reviewed-by: Ross Zwisler <ross.zwisler@linux.intel.com>
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2016-11-20 17:29:51 -05:00
Linus Torvalds d117b9acae A security fix (so a maliciously corrupted file system image won't
panic the kernel) and some fixes for CONFIG_VMAP_STACK.
 -----BEGIN PGP SIGNATURE-----
 
 iQEzBAABCAAdFiEEK2m5VNv+CHkogTfJ8vlZVpUNgaMFAlgxCMoACgkQ8vlZVpUN
 gaOX3Af/QOphB5pKrKijhDK9H40nKS6lHtL7klJpvRafUMtVxBDOP3dsRISyGMdF
 w+gQQQv+eFEPefwGcYzdO4PN7FFVirAF9RS/NTFSIB/c8V6FfHzn/DeiftU7CLRW
 ljTP7y8M9eo35TsU8s9D7wfbyfY55MEANiAP8vnpx4JKDb86I/8Eaa6YS91v17vp
 /7TKSUt7PE6UUp7mgTRCX8vK9SxJJ8Xvg2hSzulfrO1DdsfW61RQYXwif+biR85T
 uxFPnV0yvji2EU4cpeIekPqJKUb9Av0aIbSwg19QqcAE0xqxvtSRBKlYnF2IRTuv
 OXoaC30d4UcQrNCkxPDAdH/0BMdcNQ==
 =y+5G
 -----END PGP SIGNATURE-----

Merge tag 'ext4_for_stable' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4

Pull ext4 fixes from Ted Ts'o:
 "A security fix (so a maliciously corrupted file system image won't
  panic the kernel) and some fixes for CONFIG_VMAP_STACK"

* tag 'ext4_for_stable' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4:
  ext4: sanity check the block and cluster size at mount time
  fscrypto: don't use on-stack buffer for key derivation
  fscrypto: don't use on-stack buffer for filename encryption
2016-11-19 18:33:50 -08:00
Theodore Ts'o 8cdf3372fe ext4: sanity check the block and cluster size at mount time
If the block size or cluster size is insane, reject the mount.  This
is important for security reasons (although we shouldn't be just
depending on this check).

Ref: http://www.securityfocus.com/archive/1/539661
Ref: https://bugzilla.redhat.com/show_bug.cgi?id=1332506
Reported-by: Borislav Petkov <bp@alien8.de>
Reported-by: Nikolay Borisov <kernel@kyup.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Cc: stable@vger.kernel.org
2016-11-19 20:58:15 -05:00
Eric Biggers 0f0909e242 fscrypto: don't use on-stack buffer for key derivation
With the new (in 4.9) option to use a virtually-mapped stack
(CONFIG_VMAP_STACK), stack buffers cannot be used as input/output for
the scatterlist crypto API because they may not be directly mappable to
struct page.  get_crypt_info() was using a stack buffer to hold the
output from the encryption operation used to derive the per-file key.
Fix it by using a heap buffer.

This bug could most easily be observed in a CONFIG_DEBUG_SG kernel
because this allowed the BUG in sg_set_buf() to be triggered.

Cc: stable@vger.kernel.org
Signed-off-by: Eric Biggers <ebiggers@google.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2016-11-19 20:56:13 -05:00
Eric Biggers 3c7018ebf8 fscrypto: don't use on-stack buffer for filename encryption
With the new (in 4.9) option to use a virtually-mapped stack
(CONFIG_VMAP_STACK), stack buffers cannot be used as input/output for
the scatterlist crypto API because they may not be directly mappable to
struct page.  For short filenames, fname_encrypt() was encrypting a
stack buffer holding the padded filename.  Fix it by encrypting the
filename in-place in the output buffer, thereby making the temporary
buffer unnecessary.

This bug could most easily be observed in a CONFIG_DEBUG_SG kernel
because this allowed the BUG in sg_set_buf() to be triggered.

Cc: stable@vger.kernel.org
Signed-off-by: Eric Biggers <ebiggers@google.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2016-11-19 20:56:06 -05:00
Filipe Manana 2a2a83de54 Btrfs: remove rb_node field from the delayed ref node structure
After the last big change in the delayed references code that was needed
for the last qgroups rework, the red black tree node field of struct
btrfs_delayed_ref_node is no longer used, so just remove it, this helps
us save some memory (since struct rb_node is 24 bytes on x86_64) for
these structures.

Signed-off-by: Filipe Manana <fdmanana@suse.com>
2016-11-19 13:39:18 +00:00
Filipe Manana 001895b313 Btrfs: remove unused code when creating and merging reloc trees
In commit 5bc7247ac4 (Btrfs: fix broken nocow after balance) we started
abusing the rtransid and otransid fields of root items from relocation
trees to fix some issues with nodatacow mode. However later in commit
ba8b028933 (Btrfs: do not reset last_snapshot after relocation) we
dropped the code that made use of those fields but did not remove
the code that sets those fields.

So just remove them to avoid confusion.

Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: Josef Bacik <jbacik@fb.com>
2016-11-19 13:39:18 +00:00
Filipe Manana 054570a1dc Btrfs: fix relocation incorrectly dropping data references
During relocation of a data block group we create a relocation tree
for each fs/subvol tree by making a snapshot of each tree using
btrfs_copy_root() and the tree's commit root, and then setting the last
snapshot field for the fs/subvol tree's root to the value of the current
transaction id minus 1. However this can lead to relocation later
dropping references that it did not create if we have qgroups enabled,
leaving the filesystem in an inconsistent state that keeps aborting
transactions.

Lets consider the following example to explain the problem, which requires
qgroups to be enabled.

We are relocating data block group Y, we have a subvolume with id 258 that
has a root at level 1, that subvolume is used to store directory entries
for snapshots and we are currently at transaction 3404.

When committing transaction 3404, we have a pending snapshot and therefore
we call btrfs_run_delayed_items() at transaction.c:create_pending_snapshot()
in order to create its dentry at subvolume 258. This results in COWing
leaf A from root 258 in order to add the dentry. Note that leaf A
also contains file extent items referring to extents from some other
block group X (we are currently relocating block group Y). Later on, still
at create_pending_snapshot() we call qgroup_account_snapshot(), which
switches the commit root for root 258 when it calls switch_commit_roots(),
so now the COWed version of leaf A, lets call it leaf A', is accessible
from the commit root of tree 258. At the end of qgroup_account_snapshot(),
we call record_root_in_trans() with 258 as its argument, which results
in btrfs_init_reloc_root() being called, which in turn calls
relocation.c:create_reloc_root() in order to create a relocation tree
associated to root 258, which results in assigning the value of 3403
(which is the current transaction id minus 1 = 3404 - 1) to the
last_snapshot field of root 258. When creating the relocation tree root
at ctree.c:btrfs_copy_root() we add a shared reference for leaf A',
corresponding to the relocation tree's root, when we call btrfs_inc_ref()
against the COWed root (a copy of the commit root from tree 258), which
is at level 1. So at this point leaf A' has 2 references, one normal
reference corresponding to root 258 and one shared reference corresponding
to the root of the relocation tree.

Transaction 3404 finishes its commit and transaction 3405 is started by
relocation when calling merge_reloc_root() for the relocation tree
associated to root 258. In the meanwhile leaf A' is COWed again, in
response to some filesystem operation, when we are still at transaction
3405. However when we COW leaf A', at ctree.c:update_ref_for_cow(), we
call btrfs_block_can_be_shared() in order to figure out if other trees
refer to the leaf and if any such trees exists, add a full back reference
to leaf A' - but btrfs_block_can_be_shared() incorrectly returns false
because the following condition is false:

  btrfs_header_generation(buf) <= btrfs_root_last_snapshot(&root->root_item)

which evaluates to 3404 <= 3403. So after leaf A' is COWed, it stays with
only one reference, corresponding to the shared reference we created when
we called btrfs_copy_root() to create the relocation tree's root and
btrfs_inc_ref() ends up not being called for leaf A' nor we end up setting
the flag BTRFS_BLOCK_FLAG_FULL_BACKREF in leaf A'. This results in not
adding shared references for the extents from block group X that leaf A'
refers to with its file extent items.

Later, after merging the relocation root we do a call to to
btrfs_drop_snapshot() in order to delete the relocation tree. This ends
up calling do_walk_down() when path->slots[1] points to leaf A', which
results in calling btrfs_lookup_extent_info() to get the number of
references for leaf A', which is 1 at this time (only the shared reference
exists) and this value is stored at wc->refs[0]. After this walk_up_proc()
is called when wc->level is 0 and path->nodes[0] corresponds to leaf A'.
Because the current level is 0 and wc->refs[0] is 1, it does call
btrfs_dec_ref() against leaf A', which results in removing the single
references that the extents from block group X have which are associated
to root 258 - the expectation was to have each of these extents with 2
references - one reference for root 258 and one shared reference related
to the root of the relocation tree, and so we would drop only the shared
reference (because leaf A' was supposed to have the flag
BTRFS_BLOCK_FLAG_FULL_BACKREF set).

This leaves the filesystem in an inconsistent state as we now have file
extent items in a subvolume tree that point to extents from block group X
without references in the extent tree. So later on when we try to decrement
the references for these extents, for example due to a file unlink operation,
truncate operation or overwriting ranges of a file, we fail because the
expected references do not exist in the extent tree.

This leads to warnings and transaction aborts like the following:

[  588.965795] ------------[ cut here ]------------
[  588.965815] WARNING: CPU: 2 PID: 2479 at fs/btrfs/extent-tree.c:1625 lookup_inline_extent_backref+0x432/0x5b0 [btrfs]
[  588.965816] Modules linked in: af_packet iscsi_ibft iscsi_boot_sysfs xfs libcrc32c ppdev acpi_cpufreq button tpm_tis e1000 i2c_piix4 pcspkr parport_pc
parport tpm qemu_fw_cfg joydev btrfs xor raid6_pq sr_mod cdrom ata_generic virtio_scsi ata_piix virtio_pci bochs_drm virtio_ring drm_kms_helper syscopyarea
sysfillrect sysimgblt fb_sys_fops virtio ttm serio_raw drm floppy sg
[  588.965831] CPU: 2 PID: 2479 Comm: kworker/u8:7 Not tainted 4.7.3-3-default-fdm+ #1
[  588.965832] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.9.1-0-gb3ef39f-prebuilt.qemu-project.org 04/01/2014
[  588.965844] Workqueue: btrfs-extent-refs btrfs_extent_refs_helper [btrfs]
[  588.965845]  0000000000000000 ffff8802263bfa28 ffffffff813af542 0000000000000000
[  588.965847]  0000000000000000 ffff8802263bfa68 ffffffff81081e8b 0000065900000000
[  588.965848]  ffff8801db2af000 000000012bbe2000 0000000000000000 ffff880215703b48
[  588.965849] Call Trace:
[  588.965852]  [<ffffffff813af542>] dump_stack+0x63/0x81
[  588.965854]  [<ffffffff81081e8b>] __warn+0xcb/0xf0
[  588.965855]  [<ffffffff81081f7d>] warn_slowpath_null+0x1d/0x20
[  588.965863]  [<ffffffffa0175042>] lookup_inline_extent_backref+0x432/0x5b0 [btrfs]
[  588.965865]  [<ffffffff81143220>] ? trace_clock_local+0x10/0x30
[  588.965867]  [<ffffffff8114c5df>] ? rb_reserve_next_event+0x6f/0x460
[  588.965875]  [<ffffffffa0175215>] insert_inline_extent_backref+0x55/0xd0 [btrfs]
[  588.965882]  [<ffffffffa017531f>] __btrfs_inc_extent_ref.isra.55+0x8f/0x240 [btrfs]
[  588.965890]  [<ffffffffa017acea>] __btrfs_run_delayed_refs+0x74a/0x1260 [btrfs]
[  588.965892]  [<ffffffff810cb046>] ? cpuacct_charge+0x86/0xa0
[  588.965900]  [<ffffffffa017e74f>] btrfs_run_delayed_refs+0x9f/0x2c0 [btrfs]
[  588.965908]  [<ffffffffa017ea04>] delayed_ref_async_start+0x94/0xb0 [btrfs]
[  588.965918]  [<ffffffffa01c799a>] btrfs_scrubparity_helper+0xca/0x350 [btrfs]
[  588.965928]  [<ffffffffa01c7c5e>] btrfs_extent_refs_helper+0xe/0x10 [btrfs]
[  588.965930]  [<ffffffff8109b323>] process_one_work+0x1f3/0x4e0
[  588.965931]  [<ffffffff8109b658>] worker_thread+0x48/0x4e0
[  588.965932]  [<ffffffff8109b610>] ? process_one_work+0x4e0/0x4e0
[  588.965934]  [<ffffffff810a1659>] kthread+0xc9/0xe0
[  588.965936]  [<ffffffff816f2f1f>] ret_from_fork+0x1f/0x40
[  588.965937]  [<ffffffff810a1590>] ? kthread_worker_fn+0x170/0x170
[  588.965938] ---[ end trace 34e5232c933a1749 ]---
[  588.966187] ------------[ cut here ]------------
[  588.966196] WARNING: CPU: 2 PID: 2479 at fs/btrfs/extent-tree.c:2966 btrfs_run_delayed_refs+0x28c/0x2c0 [btrfs]
[  588.966196] BTRFS: Transaction aborted (error -5)
[  588.966197] Modules linked in: af_packet iscsi_ibft iscsi_boot_sysfs xfs libcrc32c ppdev acpi_cpufreq button tpm_tis e1000 i2c_piix4 pcspkr parport_pc
parport tpm qemu_fw_cfg joydev btrfs xor raid6_pq sr_mod cdrom ata_generic virtio_scsi ata_piix virtio_pci bochs_drm virtio_ring drm_kms_helper syscopyarea
sysfillrect sysimgblt fb_sys_fops virtio ttm serio_raw drm floppy sg
[  588.966206] CPU: 2 PID: 2479 Comm: kworker/u8:7 Tainted: G        W       4.7.3-3-default-fdm+ #1
[  588.966207] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.9.1-0-gb3ef39f-prebuilt.qemu-project.org 04/01/2014
[  588.966217] Workqueue: btrfs-extent-refs btrfs_extent_refs_helper [btrfs]
[  588.966217]  0000000000000000 ffff8802263bfc98 ffffffff813af542 ffff8802263bfce8
[  588.966219]  0000000000000000 ffff8802263bfcd8 ffffffff81081e8b 00000b96345ee000
[  588.966220]  ffffffffa021ae1c ffff880215703b48 00000000000005fe ffff8802345ee000
[  588.966221] Call Trace:
[  588.966223]  [<ffffffff813af542>] dump_stack+0x63/0x81
[  588.966224]  [<ffffffff81081e8b>] __warn+0xcb/0xf0
[  588.966225]  [<ffffffff81081eff>] warn_slowpath_fmt+0x4f/0x60
[  588.966233]  [<ffffffffa017e93c>] btrfs_run_delayed_refs+0x28c/0x2c0 [btrfs]
[  588.966241]  [<ffffffffa017ea04>] delayed_ref_async_start+0x94/0xb0 [btrfs]
[  588.966250]  [<ffffffffa01c799a>] btrfs_scrubparity_helper+0xca/0x350 [btrfs]
[  588.966259]  [<ffffffffa01c7c5e>] btrfs_extent_refs_helper+0xe/0x10 [btrfs]
[  588.966260]  [<ffffffff8109b323>] process_one_work+0x1f3/0x4e0
[  588.966261]  [<ffffffff8109b658>] worker_thread+0x48/0x4e0
[  588.966263]  [<ffffffff8109b610>] ? process_one_work+0x4e0/0x4e0
[  588.966264]  [<ffffffff810a1659>] kthread+0xc9/0xe0
[  588.966265]  [<ffffffff816f2f1f>] ret_from_fork+0x1f/0x40
[  588.966267]  [<ffffffff810a1590>] ? kthread_worker_fn+0x170/0x170
[  588.966268] ---[ end trace 34e5232c933a174a ]---
[  588.966269] BTRFS: error (device sda2) in btrfs_run_delayed_refs:2966: errno=-5 IO failure
[  588.966270] BTRFS info (device sda2): forced readonly

This was happening often on openSUSE and SLE systems using btrfs as the
root filesystem (with its default layout where multiple subvolumes are
used) where balance happens in the background triggered by a cron job and
snapshots are automatically created before/after package installations,
upgrades and removals. The issue could be triggered simply by running the
following loop on the first system boot post installation:

  while true; do
     zypper -n in nfs-kernel-server
     zypper -n rm nfs-kernel-server
  done

(If we were fast enough and made that loop before the cron job triggered
a balance operation and the balance finished)

So fix by setting the last_snapshot field of the root to the value of the
generation of its commit root. Like this btrfs_block_can_be_shared()
behaves correctly for the case where the relocation root is created during
a transaction commit and for the case where it's created before a
transaction commit.

Fixes: 6426c7ad69 (btrfs: qgroup: Fix qgroup accounting when creating snapshot)
Cc: stable@vger.kernel.org  # 4.7+
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: Josef Bacik <jbacik@fb.com>
2016-11-19 13:39:17 +00:00
Jonathan Corbet 917fef6f7e Linux 4.9-rc4
-----BEGIN PGP SIGNATURE-----
 
 iQEcBAABAgAGBQJYHmoCAAoJEHm+PkMAQRiG7RMIAI2i7Y5hpL5yCxK5AFaL4u/G
 KxXfp1B1UanUTgjOmd7zGqtDYcFX9t7GTTUFixQ7/9Opr4PD9qbnatoDGSc3xjbT
 msDgA1B78F1/Q3kHWfeGq32MihQ4mj5NwUCo+igUcUvvWG7mHgzErj/Nh5RoobQX
 p/izdpTbrw3GX6xXB8olbG7XWHaVye/+TT3q6+gmgm8I/QEujcLeGoycE0zlhPN8
 FG/JX76At/+ZM2Py7Oxo3k+oKL9CHrtOQYDp/wN0uslV5eYvvkZz0/M1HMOGZt+c
 gZU5jzM17K7C4Nzo06WAuBU9wUBGc25m+cPicLlOmljnzfU+f50SKaDjZq3p7QI=
 =2KUF
 -----END PGP SIGNATURE-----

Merge tag 'v4.9-rc4' into sound

Bring in -rc4 patches so I can successfully merge the sound doc changes.
2016-11-18 16:13:41 -07:00
Benjamin Coddington d41cbfc9a6 NFSv4.1: Handle NFS4ERR_OLD_STATEID in nfs4_reclaim_open_state
Now that we're doing TEST_STATEID in nfs4_reclaim_open_state(), we can have
a NFS4ERR_OLD_STATEID returned from nfs41_open_expired() .  Instead of
marking state recovery as failed, mark the state for recovery again.

Signed-off-by: Benjamin Coddington <bcodding@redhat.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2016-11-18 14:27:27 -05:00
Trond Myklebust 5cc7861eb5 NFSv4: Don't call close if the open stateid has already been cleared
Ensure we test to see if the open stateid is actually set, before we
send a CLOSE.

Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2016-11-18 14:18:02 -05:00
Theodore Ts'o c48ae41baf ext4: add sanity checking to count_overhead()
The commit "ext4: sanity check the block and cluster size at mount
time" should prevent any problems, but in case the superblock is
modified while the file system is mounted, add an extra safety check
to make sure we won't overrun the allocated buffer.

Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Cc: stable@vger.kernel.org
2016-11-18 13:37:47 -05:00
Trond Myklebust 3e7dfb1659 NFSv4: Fix CLOSE races with OPEN
If the reply to a successful CLOSE call races with an OPEN to the same
file, we can end up scribbling over the stateid that represents the
new open state.
The race looks like:

  Client				Server
  ======				======

  CLOSE stateid A on file "foo"
					CLOSE stateid A, return stateid C
  OPEN file "foo"
					OPEN "foo", return stateid B
  Receive reply to OPEN
  Reset open state for "foo"
  Associate stateid B to "foo"

  Receive CLOSE for A
  Reset open state for "foo"
  Replace stateid B with C

The fix is to examine the argument of the CLOSE, and check for a match
with the current stateid "other" field. If the two do not match, then
the above race occurred, and we should just ignore the CLOSE.

Reported-by: Benjamin Coddington <bcodding@redhat.com>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2016-11-18 13:35:58 -05:00
Trond Myklebust 23ea44c215 NFSv4.1: Fix a regression in DELEGRETURN
We don't want to call nfs4_free_revoked_stateid() in the case where
the delegreturn was successful.

Reported-by: Benjamin Coddington <bcodding@redhat.com>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2016-11-18 13:35:54 -05:00
Theodore Ts'o cd6bb35bf7 ext4: use more strict checks for inodes_per_block on mount
Centralize the checks for inodes_per_block and be more strict to make
sure the inodes_per_block_group can't end up being zero.

Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Reviewed-by: Andreas Dilger <adilger@dilger.ca>
Cc: stable@vger.kernel.org
2016-11-18 13:28:30 -05:00
Theodore Ts'o 5aee0f8a3f ext4: fix in-superblock mount options processing
Fix a large number of problems with how we handle mount options in the
superblock.  For one, if the string in the superblock is long enough
that it is not null terminated, we could run off the end of the string
and try to interpret superblocks fields as characters.  It's unlikely
this will cause a security problem, but it could result in an invalid
parse.  Also, parse_options is destructive to the string, so in some
cases if there is a comma-separated string, it would be modified in
the superblock.  (Fortunately it only happens on file systems with a
1k block size.)

Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Cc: stable@vger.kernel.org
2016-11-18 13:24:26 -05:00
Theodore Ts'o 9e47a4c9fc ext4: sanity check the block and cluster size at mount time
If the block size or cluster size is insane, reject the mount.  This
is important for security reasons (although we shouldn't be just
depending on this check).

Ref: http://www.securityfocus.com/archive/1/539661
Ref: https://bugzilla.redhat.com/show_bug.cgi?id=1332506
Reported-by: Borislav Petkov <bp@alien8.de>
Reported-by: Nikolay Borisov <kernel@kyup.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Cc: stable@vger.kernel.org
2016-11-18 13:00:24 -05:00
Alexey Dobriyan c7d03a00b5 netns: make struct pernet_operations::id unsigned int
Make struct pernet_operations::id unsigned.

There are 2 reasons to do so:

1)
This field is really an index into an zero based array and
thus is unsigned entity. Using negative value is out-of-bound
access by definition.

2)
On x86_64 unsigned 32-bit data which are mixed with pointers
via array indexing or offsets added or subtracted to pointers
are preffered to signed 32-bit data.

"int" being used as an array index needs to be sign-extended
to 64-bit before being used.

	void f(long *p, int i)
	{
		g(p[i]);
	}

  roughly translates to

	movsx	rsi, esi
	mov	rdi, [rsi+...]
	call 	g

MOVSX is 3 byte instruction which isn't necessary if the variable is
unsigned because x86_64 is zero extending by default.

Now, there is net_generic() function which, you guessed it right, uses
"int" as an array index:

	static inline void *net_generic(const struct net *net, int id)
	{
		...
		ptr = ng->ptr[id - 1];
		...
	}

And this function is used a lot, so those sign extensions add up.

Patch snipes ~1730 bytes on allyesconfig kernel (without all junk
messing with code generation):

	add/remove: 0/0 grow/shrink: 70/598 up/down: 396/-2126 (-1730)

Unfortunately some functions actually grow bigger.
This is a semmingly random artefact of code generation with register
allocator being used differently. gcc decides that some variable
needs to live in new r8+ registers and every access now requires REX
prefix. Or it is shifted into r12, so [r12+0] addressing mode has to be
used which is longer than [r8]

However, overall balance is in negative direction:

	add/remove: 0/0 grow/shrink: 70/598 up/down: 396/-2126 (-1730)
	function                                     old     new   delta
	nfsd4_lock                                  3886    3959     +73
	tipc_link_build_proto_msg                   1096    1140     +44
	mac80211_hwsim_new_radio                    2776    2808     +32
	tipc_mon_rcv                                1032    1058     +26
	svcauth_gss_legacy_init                     1413    1429     +16
	tipc_bcbase_select_primary                   379     392     +13
	nfsd4_exchange_id                           1247    1260     +13
	nfsd4_setclientid_confirm                    782     793     +11
		...
	put_client_renew_locked                      494     480     -14
	ip_set_sockfn_get                            730     716     -14
	geneve_sock_add                              829     813     -16
	nfsd4_sequence_done                          721     703     -18
	nlmclnt_lookup_host                          708     686     -22
	nfsd4_lockt                                 1085    1063     -22
	nfs_get_client                              1077    1050     -27
	tcf_bpf_init                                1106    1076     -30
	nfsd4_encode_fattr                          5997    5930     -67
	Total: Before=154856051, After=154854321, chg -0.00%

Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-11-18 10:59:15 -05:00
Linus Torvalds bec1b089ab Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs
Pull vfs fixes from Al Viro:
 "A couple of regression fixes"

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
  fix iov_iter_advance() for ITER_PIPE
  xattr: Fix setting security xattrs on sockfs
2016-11-17 13:49:30 -08:00
Linus Torvalds d46bc34da9 orangefs: add .owner to debugfs file_operations
Without ".owner = THIS_MODULE" it is possible to crash the kernel
 by unloading the Orangefs module while someone is reading debugfs
 files.
 -----BEGIN PGP SIGNATURE-----
 
 iQIcBAABAgAGBQJYLfvEAAoJEM9EDqnrzg2+sHEQAJo4jn/sAQvO04ujaMrViLmy
 5+V93F7jwGFeLvAwjMvPAeBb+UmlgqjVi0VT85RzEe6eNOKN9qlj9ZNDutOfnbhr
 H6qu8AQsbO0znSTQuJA1M2Hca9h66EnN0pT8xW4wat1cCdAf6X6HcFcr1lZIRKZd
 E17EygXi+IW0c0evIq4UBsD0DfTZgtC4ONrR9N7+zprlg2PVX35So6Lr0ODceJQs
 StWHrZW9hDZ6KR8WocupuHPR8brOe+P5PU14fPzR1+EH7BsTf8uxWK7CfTE5ov0C
 UNkNeh81BOkwIQDFoPCJ5asaipdi5RRNTIQekhhQ2GnaaCdmCKln8OLjqDZZOmDj
 KRGB4mdPcCb3XlvMH3SaXNmyhmjt2cTS0/TQPexrTqjSNmbXmnzJOCguweoTIJ5w
 CgEnsrNp8GwlZo12Z8JkFGxC39ifjH4F+KFetU+eUNjw9Tce+zHwgEvsAMqDhWw8
 FJQWy+snG7m8ooytRObWPepchnd2XHkrJv4yu8uw3GirM+YTlxvuWnB54hVH17FQ
 0vKYhdAXBUmeeyyNKApBSGQezPWD9hfAY5Di7JGJlaTiai3pVxgXd8YY4DGXHj3t
 ebPpxEnlWrRLC5Cazd0yC9CoR8azQp9zvRgfPuPEM4wJSjUFVfmasmFg7s99h3Zq
 vnTqfV/uQwLm9f+3CfNB
 =s21f
 -----END PGP SIGNATURE-----

Merge tag 'for-linus-4.9-rc5-ofs-1' of git://git.kernel.org/pub/scm/linux/kernel/git/hubcap/linux

Pull orangefs fix from Mike Marshall:
 "orangefs: add .owner to debugfs file_operations

  Without ".owner = THIS_MODULE" it is possible to crash the kernel by
  unloading the Orangefs module while someone is reading debugfs files"

* tag 'for-linus-4.9-rc5-ofs-1' of git://git.kernel.org/pub/scm/linux/kernel/git/hubcap/linux:
  orangefs: add .owner to debugfs file_operations
2016-11-17 13:45:57 -08:00
Christoph Hellwig 542ff7bf18 block: new direct I/O implementation
Similar to the simple fast path, but we now need a dio structure to
track multiple-bio completions.  It's basically a cut-down version
of the new iomap-based direct I/O code for filesystems, but without
all the logic to call into the filesystem for extent lookup or
allocation, and without the complex I/O completion workqueue handler
for AIO - instead we just use the FUA bit on the bios to ensure
data is flushed to stable storage.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@fb.com>
2016-11-17 13:35:11 -07:00
Jens Axboe 78250c02d9 block: make __blkdev_direct_IO_sync() support O_SYNC/DSYNC
Split the op setting code into a helper, use it in both places.

Signed-off-by: Jens Axboe <axboe@fb.com>
2016-11-17 13:35:05 -07:00
Jens Axboe 72ecad22d9 block: support a full bio worth of IO for simplified bdev direct-io
Just alloc the bio_vec array if we exceed the inline limit.

Signed-off-by: Jens Axboe <axboe@fb.com>
2016-11-17 13:35:02 -07:00
Christoph Hellwig 189ce2b9dc block: fast-path for small and simple direct I/O requests
This patch adds a small and simple fast patch for small direct I/O
requests on block devices that don't use AIO.  Between the neat
bio_iov_iter_get_pages helper that avoids allocating a page array
for get_user_pages and the on-stack bio and biovec this avoid memory
allocations and atomic operations entirely in the direct I/O code
(lower levels might still do memory allocations and will usually
have at least some atomic operations, though).

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@fb.com>
Tested-By: Stephen Bates <sbates@raithlin.com>
Reviewed-By: Stephen Bates <sbates@raithlin.com>
2016-11-17 13:34:45 -07:00
Seth Forshee f97df70b1c xenfs: Use proc_create_mount_point() to create /proc/xen
Mounting proc in user namespace containers fails if the xenbus
filesystem is mounted on /proc/xen because this directory fails
the "permanently empty" test. proc_create_mount_point() exists
specifically to create such mountpoints in proc but is currently
proc-internal. Export this interface to modules, then use it in
xenbus when creating /proc/xen.

Signed-off-by: Seth Forshee <seth.forshee@canonical.com>
Signed-off-by: David Vrabel <david.vrabel@citrix.com>
Signed-off-by: Juergen Gross <jgross@suse.com>
2016-11-17 13:52:18 +01:00
Andreas Gruenbacher 4a59015372 xattr: Fix setting security xattrs on sockfs
The IOP_XATTR flag is set on sockfs because sockfs supports getting the
"system.sockprotoname" xattr.  Since commit 6c6ef9f2, this flag is checked for
setxattr support as well.  This is wrong on sockfs because security xattr
support there is supposed to be provided by security_inode_setsecurity.  The
smack security module relies on socket labels (xattrs).

Fix this by adding a security xattr handler on sockfs that returns
-EAGAIN, and by checking for -EAGAIN in setxattr.

We cannot simply check for -EOPNOTSUPP in setxattr because there are
filesystems that neither have direct security xattr support nor support
via security_inode_setsecurity.  A more proper fix might be to move the
call to security_inode_setsecurity into sockfs, but it's not clear to me
if that is safe: we would end up calling security_inode_post_setxattr after
that as well.

Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2016-11-17 00:00:23 -05:00
Linus Torvalds 984573abf8 Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/fuse
Pull fuse fixes from Miklos Szeredi:
 "A regression fix and bug fix bound for stable"

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/fuse:
  fuse: fix fuse_write_end() if zero bytes were copied
  fuse: fix root dentry initialization
2016-11-16 09:20:10 -08:00
Mike Marshall 19ff7fcc76 orangefs: add .owner to debugfs file_operations
Without ".owner = THIS_MODULE" it is possible to crash the kernel
by unloading the Orangefs module while someone is reading debugfs
files.

Signed-off-by: Mike Marshall <hubcap@omnibond.com>
2016-11-16 11:52:19 -05:00
Nicolas Pitre baa73d9e47 posix-timers: Make them configurable
Some embedded systems have no use for them.  This removes about
25KB from the kernel binary size when configured out.

Corresponding syscalls are routed to a stub logging the attempt to
use those syscalls which should be enough of a clue if they were
disabled without proper consideration. They are: timer_create,
timer_gettime: timer_getoverrun, timer_settime, timer_delete,
clock_adjtime, setitimer, getitimer, alarm.

The clock_settime, clock_gettime, clock_getres and clock_nanosleep
syscalls are replaced by simple wrappers compatible with CLOCK_REALTIME,
CLOCK_MONOTONIC and CLOCK_BOOTTIME only which should cover the vast
majority of use cases with very little code.

Signed-off-by: Nicolas Pitre <nico@linaro.org>
Acked-by: Richard Cochran <richardcochran@gmail.com>
Acked-by: Thomas Gleixner <tglx@linutronix.de>
Acked-by: John Stultz <john.stultz@linaro.org>
Reviewed-by: Josh Triplett <josh@joshtriplett.org>
Cc: Paul Bolle <pebolle@tiscali.nl>
Cc: linux-kbuild@vger.kernel.org
Cc: netdev@vger.kernel.org
Cc: Michal Marek <mmarek@suse.com>
Cc: Edward Cree <ecree@solarflare.com>
Link: http://lkml.kernel.org/r/1478841010-28605-7-git-send-email-nicolas.pitre@linaro.org
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2016-11-16 09:26:35 +01:00
Kees Cook fc46d4e453 ramoops: add pdata NULL check to ramoops_probe
This adds a check for a NULL platform data, which should only be possible
if a driver incorrectly sets up a probe request without also having defined
the platform_data structure. This is based on a patch from Geliang Tang.

Signed-off-by: Kees Cook <keescook@chromium.org>
2016-11-15 16:34:32 -08:00
Namhyung Kim 70ad35db33 pstore: Convert console write to use ->write_buf
Maybe I'm missing something, but I don't know why it needs to copy the
input buffer to psinfo->buf and then write.  Instead we can write the
input buffer directly.  The only implementation that supports console
message (i.e. ramoops) already does it for ftrace messages.

For the upcoming virtio backend driver, it needs to protect psinfo->buf
overwritten from console messages.  If it could use ->write_buf method
instead of ->write, the problem will be solved easily.

Cc: Stefan Hajnoczi <stefanha@redhat.com>
Signed-off-by: Namhyung Kim <namhyung@kernel.org>
Signed-off-by: Kees Cook <keescook@chromium.org>
2016-11-15 16:34:32 -08:00
Namhyung Kim e9e360b08a pstore: Protect unlink with read_mutex
When update_ms is set, pstore_get_records() will be called when there's
a new entry.  But unlink can be called at the same time and might
contend with the open-read-close loop.  Depending on the implementation
of platform driver, it may be safe or not.  But I think it'd be better
to protect those race in the first place.

Cc: Stefan Hajnoczi <stefanha@redhat.com>
Signed-off-by: Namhyung Kim <namhyung@kernel.org>
Signed-off-by: Kees Cook <keescook@chromium.org>
2016-11-15 16:34:31 -08:00
Joel Fernandes 7a0032f504 pstore: Use global ftrace filters for function trace filtering
Currently, pstore doesn't have any filters setup for function tracing.
This has the associated overhead and may not be useful for users looking
for tracing specific set of functions.

ftrace's regular function trace filtering is done writing to
tracing/set_ftrace_filter however this is not available if not requested.
In order to be able to use this feature, the support to request global
filtering introduced earlier in the series should be requested before
registering the ftrace ops. Here we do the same.

Signed-off-by: Joel Fernandes <joelaf@google.com>
Signed-off-by: Kees Cook <keescook@chromium.org>
2016-11-15 16:34:30 -08:00
Kees Cook a5d23b956c pstore: Clarify context field przs as dprzs
Since "przs" (persistent ram zones) is a general name in the code now, so
rename the Oops-dump zones to dprzs from przs.

Based on a patch from Nobuhiro Iwamatsu.

Signed-off-by: Kees Cook <keescook@chromium.org>
2016-11-15 16:34:29 -08:00
Kees Cook c443a5f3f1 pstore: improve error report for failed setup
When setting ramoops record sizes, sometimes it's not clear which
parameters contributed to the allocation failure. This adds a per-zone
name and expands the failure reports.

Signed-off-by: Kees Cook <keescook@chromium.org>
2016-11-15 16:34:28 -08:00
Joel Fernandes 2fbea82bbb pstore: Merge per-CPU ftrace records into one
Up until this patch, each of the per CPU ftrace buffers appear as a
separate ftrace-ramoops-N file. In this patch we merge all the zones into
one and populate a single ftrace-ramoops-0 file.

Signed-off-by: Joel Fernandes <joelaf@google.com>
[kees: clarified variables names, added -ENOMEM handling]
Signed-off-by: Kees Cook <keescook@chromium.org>
2016-11-15 16:34:28 -08:00
Joel Fernandes fbccdeb8d7 pstore: Add ftrace timestamp counter
In preparation for merging the per CPU buffers into one buffer when
we retrieve the pstore ftrace data, we store the timestamp as a
counter in the ftrace pstore record.  We store the CPU number as well
if !PSTORE_CPU_IN_IP, in this case we shift the counter and may lose
ordering there but we preserve the same record size. The timestamp counter
is also racy, and not doing any locking or synchronization here results
in the benefit of lower overhead. Since we don't care much here for exact
ordering of function traces across CPUs, we don't synchronize and may lose
some counter updates but I'm ok with that.

Using trace_clock() results in much lower performance so avoid using it
since we don't want accuracy in timestamp and need a rough ordering to
perform merge.

Signed-off-by: Joel Fernandes <joelaf@google.com>
[kees: updated commit message, added comments]
Signed-off-by: Kees Cook <keescook@chromium.org>
2016-11-15 16:34:27 -08:00
Joel Fernandes a1cf53ac6d ramoops: Split ftrace buffer space into per-CPU zones
If the RAMOOPS_FLAG_FTRACE_PER_CPU flag is passed to ramoops pdata, split
the ftrace space into multiple zones depending on the number of CPUs.

This speeds up the performance of function tracing by about 280% in my
tests as we avoid the locking. The trade off being lesser space available
per CPU. Let the ramoops user decide which option they want based on pdata
flag.

Signed-off-by: Joel Fernandes <joelaf@google.com>
[kees: added max_ftrace_cnt to track size, added DT logic and docs]
Signed-off-by: Kees Cook <keescook@chromium.org>
2016-11-15 16:34:26 -08:00
Kees Cook de83209249 pstore: Make ramoops_init_przs generic for other prz arrays
Currently ramoops_init_przs() is hard wired only for panic dump zone
array. In preparation for the ftrace zone array (one zone per-cpu) and pmsg
zone array, make the function more generic to be able to handle this case.

Heavily based on similar work from Joel Fernandes.

Signed-off-by: Kees Cook <keescook@chromium.org>
2016-11-15 16:34:25 -08:00
Joel Fernandes 663deb4788 pstore: Allow prz to control need for locking
In preparation of not locking at all for certain buffers depending on if
there's contention, make locking optional depending on the initialization
of the prz.

Signed-off-by: Joel Fernandes <joelaf@google.com>
[kees: moved locking flag into prz instead of via caller arguments]
Signed-off-by: Kees Cook <keescook@chromium.org>
2016-11-15 16:34:25 -08:00
David S. Miller bb598c1b8c Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net
Several cases of bug fixes in 'net' overlapping other changes in
'net-next-.

Signed-off-by: David S. Miller <davem@davemloft.net>
2016-11-15 10:54:36 -05:00
Miklos Szeredi 59c3b76cc6 fuse: fix fuse_write_end() if zero bytes were copied
If pos is at the beginning of a page and copied is zero then page is not
zeroed but is marked uptodate.

Fix by skipping everything except unlock/put of page if zero bytes were
copied.

Reported-by: Al Viro <viro@zeniv.linux.org.uk>
Fixes: 6b12c1b37e ("fuse: Implement write_begin/write_end callbacks")
Cc: <stable@vger.kernel.org> # v3.15+
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2016-11-15 12:34:21 +01:00
Eric Whitney d5c8dab6a8 ext4: remove parameter from ext4_xattr_ibody_set()
The parameter "handle" isn't used.

Signed-off-by: Eric Whitney <enwlinux@gmail.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2016-11-14 21:56:48 -05:00
Eric Whitney 88e0387769 ext4: allow inode expansion for nojournal file systems
Runs of xfstest ext4/022 on nojournal file systems result in failures
because the inodes of some of its test files do not expand as expected.
The cause is a conditional in ext4_mark_inode_dirty() that prevents inode
expansion unless the test file system has a journal.  Remove this
unnecessary restriction.

Signed-off-by: Eric Whitney <enwlinux@gmail.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2016-11-14 21:48:35 -05:00
Deepa Dinamani eeca7ea1ba ext4: use current_time() for inode timestamps
CURRENT_TIME_SEC and CURRENT_TIME are not y2038 safe.
current_time() will be transitioned to be y2038 safe
along with vfs.

current_time() returns timestamps according to the
granularities set in the super_block.
The granularity check in ext4_current_time() to call
current_time() or CURRENT_TIME_SEC is not required.
Use current_time() directly to obtain timestamps
unconditionally, and remove ext4_current_time().

Quota files are assumed to be on the same filesystem.
Hence, use current_time() for these files as well.

Signed-off-by: Deepa Dinamani <deepa.kernel@gmail.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Reviewed-by: Arnd Bergmann <arnd@arndb.de>
2016-11-14 21:40:10 -05:00
Chandan Rajendra 30a9d7afe7 ext4: fix stack memory corruption with 64k block size
The number of 'counters' elements needed in 'struct sg' is
super_block->s_blocksize_bits + 2. Presently we have 16 'counters'
elements in the array. This is insufficient for block sizes >= 32k. In
such cases the memcpy operation performed in ext4_mb_seq_groups_show()
would cause stack memory corruption.

Fixes: c9de560ded
Signed-off-by: Chandan Rajendra <chandan@linux.vnet.ibm.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Reviewed-by: Jan Kara <jack@suse.cz>
Cc: stable@vger.kernel.org
2016-11-14 21:26:26 -05:00
Chandan Rajendra 69e43e8cc9 ext4: fix mballoc breakage with 64k block size
'border' variable is set to a value of 2 times the block size of the
underlying filesystem. With 64k block size, the resulting value won't
fit into a 16-bit variable. Hence this commit changes the data type of
'border' to 'unsigned int'.

Fixes: c9de560ded
Signed-off-by: Chandan Rajendra <chandan@linux.vnet.ibm.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Reviewed-by: Andreas Dilger <adilger@dilger.ca>
Cc: stable@vger.kernel.org
2016-11-14 21:04:37 -05:00
Andreas Gruenbacher db978da8fa proc: Pass file mode to proc_pid_make_inode
Pass the file mode of the proc inode to be created to
proc_pid_make_inode.  In proc_pid_make_inode, initialize inode->i_mode
before calling security_task_to_inode.  This allows selinux to set
isec->sclass right away without introducing "half-initialized" inode
security structs.

Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
Signed-off-by: Paul Moore <paul@paul-moore.com>
2016-11-14 15:39:48 -05:00
Julia Lawall 7ba630f54c nfsd: constify reply_cache_stats_operations structure
reply_cache_stats_operations, of type struct file_operations, is never
modified, so declare it as const.

Done with the help of Coccinelle.

Signed-off-by: Julia Lawall <Julia.Lawall@lip6.fr>
Reviewed-by: Jeff Layton <jlayton@poochiereds.net>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2016-11-14 15:24:19 -05:00
J. Bruce Fields 8838203667 nfsd: update workqueue creation
No real change in functionality, but the old interface seems to be
deprecated.

We don't actually care about ordering necessarily, but we do depend on
running at most one work item at a time: nfsd4_process_cb_update()
assumes that no other thread is running it, and that no new callbacks
are starting while it's running.

Reviewed-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2016-11-14 11:13:43 -05:00
Theodore Ts'o 1566a48aaa ext4: don't lock buffer in ext4_commit_super if holding spinlock
If there is an error reported in mballoc via ext4_grp_locked_error(),
the code is holding a spinlock, so ext4_commit_super() must not try to
lock the buffer head, or else it will trigger a BUG:

  BUG: sleeping function called from invalid context at ./include/linux/buffer_head.h:358
  in_atomic(): 1, irqs_disabled(): 0, pid: 993, name: mount
  CPU: 0 PID: 993 Comm: mount Not tainted 4.9.0-rc1-clouder1 #62
  Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.8.1-0-g4adadbd-20150316_085822-nilsson.home.kraxel.org 04/01/2014
   ffff880006423548 ffffffff81318c89 ffffffff819ecdd0 0000000000000166
   ffff880006423558 ffffffff810810b0 ffff880006423580 ffffffff81081153
   ffff880006e5a1a0 ffff88000690e400 0000000000000000 ffff8800064235c0
  Call Trace:
    [<ffffffff81318c89>] dump_stack+0x67/0x9e
    [<ffffffff810810b0>] ___might_sleep+0xf0/0x140
    [<ffffffff81081153>] __might_sleep+0x53/0xb0
    [<ffffffff8126c1dc>] ext4_commit_super+0x19c/0x290
    [<ffffffff8126e61a>] __ext4_grp_locked_error+0x14a/0x230
    [<ffffffff81081153>] ? __might_sleep+0x53/0xb0
    [<ffffffff812822be>] ext4_mb_generate_buddy+0x1de/0x320

Since ext4_grp_locked_error() calls ext4_commit_super with sync == 0
(and it is the only caller which does so), avoid locking and unlocking
the buffer in this case.

This can result in races with ext4_commit_super() if there are other
problems (which is what commit 4743f83990 was trying to address),
but a Warning is better than BUG.

Fixes: 4743f83990
Cc: stable@vger.kernel.org # 4.9
Reported-by: Nikolay Borisov <kernel@kyup.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Reviewed-by: Jan Kara <jack@suse.cz>
2016-11-13 22:02:29 -05:00
Theodore Ts'o d0abb36db4 ext4: allow ext4_ext_truncate() to return an error
Return errors to the caller instead of declaring the file system
corrupted.

Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Reviewed-by: Jan Kara <jack@suse.cz>
2016-11-13 22:02:28 -05:00
Theodore Ts'o 2c98eb5ea2 ext4: allow ext4_truncate() to return an error
This allows us to properly propagate errors back up to
ext4_truncate()'s callers.  This also means we no longer have to
silently ignore some errors (e.g., when trying to add the inode to the
orphan inode list).

Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Reviewed-by: Jan Kara <jack@suse.cz>
2016-11-13 22:02:26 -05:00
Theodore Ts'o 6da22013bb Merge branch 'fscrypt' into origin 2016-11-13 22:02:22 -05:00
Theodore Ts'o a2f6d9c4c0 Merge branch 'dax-4.10-iomap-pmd' into origin 2016-11-13 22:02:15 -05:00
Eric Biggers a6e0891286 fscrypto: don't use on-stack buffer for key derivation
With the new (in 4.9) option to use a virtually-mapped stack
(CONFIG_VMAP_STACK), stack buffers cannot be used as input/output for
the scatterlist crypto API because they may not be directly mappable to
struct page.  get_crypt_info() was using a stack buffer to hold the
output from the encryption operation used to derive the per-file key.
Fix it by using a heap buffer.

This bug could most easily be observed in a CONFIG_DEBUG_SG kernel
because this allowed the BUG in sg_set_buf() to be triggered.

Cc: stable@vger.kernel.org
Signed-off-by: Eric Biggers <ebiggers@google.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2016-11-13 21:56:25 -05:00
Eric Biggers 08ae877f4e fscrypto: don't use on-stack buffer for filename encryption
With the new (in 4.9) option to use a virtually-mapped stack
(CONFIG_VMAP_STACK), stack buffers cannot be used as input/output for
the scatterlist crypto API because they may not be directly mappable to
struct page.  For short filenames, fname_encrypt() was encrypting a
stack buffer holding the padded filename.  Fix it by encrypting the
filename in-place in the output buffer, thereby making the temporary
buffer unnecessary.

This bug could most easily be observed in a CONFIG_DEBUG_SG kernel
because this allowed the BUG in sg_set_buf() to be triggered.

Cc: stable@vger.kernel.org
Signed-off-by: Eric Biggers <ebiggers@google.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2016-11-13 21:56:19 -05:00
David Gstir 9c4bb8a3a9 fscrypt: Let fs select encryption index/tweak
Avoid re-use of page index as tweak for AES-XTS when multiple parts of
same page are encrypted. This will happen on multiple (partial) calls of
fscrypt_encrypt_page on same page.
page->index is only valid for writeback pages.

Signed-off-by: David Gstir <david@sigma-star.at>
Signed-off-by: Richard Weinberger <richard@nod.at>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2016-11-13 20:18:16 -05:00
David Gstir 0b93e1b94b fscrypt: Constify struct inode pointer
Some filesystems, such as UBIFS, maintain a const pointer for struct
inode.

Signed-off-by: David Gstir <david@sigma-star.at>
Signed-off-by: Richard Weinberger <richard@nod.at>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2016-11-13 20:18:01 -05:00
David Gstir 7821d4dd45 fscrypt: Enable partial page encryption
Not all filesystems work on full pages, thus we should allow them to
hand partial pages to fscrypt for en/decryption.

Signed-off-by: David Gstir <david@sigma-star.at>
Signed-off-by: Richard Weinberger <richard@nod.at>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2016-11-13 18:55:21 -05:00
David Gstir b50f7b268b fscrypt: Allow fscrypt_decrypt_page() to function with non-writeback pages
Some filesystem might pass pages which do not have page->mapping->host
set to the encrypted inode. We want the caller to explicitly pass the
corresponding inode.

Signed-off-by: David Gstir <david@sigma-star.at>
Signed-off-by: Richard Weinberger <richard@nod.at>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2016-11-13 18:53:10 -05:00
David Gstir 1c7dcf69ee fscrypt: Add in-place encryption mode
ext4 and f2fs require a bounce page when encrypting pages. However, not
all filesystems will need that (eg. UBIFS). This is handled via a
flag on fscrypt_operations where a fs implementation can select in-place
encryption over using a bounce page (which is the default).

Signed-off-by: David Gstir <david@sigma-star.at>
Signed-off-by: Richard Weinberger <richard@nod.at>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2016-11-13 18:47:04 -05:00
Deepa Dinamani 362ad5d58e fs: jfs: Replace CURRENT_TIME_SEC by current_time()
jfs uses nanosecond granularity for filesystem timestamps.
Only this assignment is not using nanosecond granularity.
Use current_time() to get the right granularity.

Signed-off-by: Deepa Dinamani <deepa.kernel@gmail.com>
Reviewed-by: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: Dave Kleikamp <dave.kleikamp@oracle.com>
2016-11-11 15:51:39 -06:00
Jens Axboe bbd7bb7017 block: move poll code to blk-mq
The poll code is blk-mq specific, let's move it to blk-mq.c. This
is a prep patch for improving the polling code.

Signed-off-by: Jens Axboe <axboe@fb.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
2016-11-11 13:40:25 -07:00
Joel Fernandes d8991f51e5 pstore: Warn on PSTORE_TYPE_PMSG using deprecated function
PMSG now uses ramoops_pstore_write_buf_user() instead of ...write_buf().
Print a ratelimited warning if gets accidentally called.

Signed-off-by: Joel Fernandes <joelaf@google.com>
[kees: adjusted commit log and added -EINVAL return]
Signed-off-by: Kees Cook <keescook@chromium.org>
2016-11-11 10:36:46 -08:00
Joel Fernandes 109704492e pstore: Make spinlock per zone instead of global
Currently pstore has a global spinlock for all zones. Since the zones
are independent and modify different areas of memory, there's no need
to have a global lock, so we should use a per-zone lock as introduced
here. Also, when ramoops's ftrace use-case has a FTRACE_PER_CPU flag
introduced later, which splits the ftrace memory area into a single zone
per CPU, it will eliminate the need for locking. In preparation for this,
make the locking optional.

Signed-off-by: Joel Fernandes <joelaf@google.com>
[kees: updated commit message]
Signed-off-by: Kees Cook <keescook@chromium.org>
2016-11-11 10:35:37 -08:00
Linus Torvalds 968ef8de55 Merge branch 'akpm' (patches from Andrew)
Merge misc fixes from Andrew Morton:
 "15 fixes"

* emailed patches from Andrew Morton <akpm@linux-foundation.org>:
  lib/stackdepot: export save/fetch stack for drivers
  mm: kmemleak: scan .data.ro_after_init
  memcg: prevent memcg caches to be both OFF_SLAB & OBJFREELIST_SLAB
  coredump: fix unfreezable coredumping task
  mm/filemap: don't allow partially uptodate page for pipes
  mm/hugetlb: fix huge page reservation leak in private mapping error paths
  ocfs2: fix not enough credit panic
  Revert "console: don't prefer first registered if DT specifies stdout-path"
  mm: hwpoison: fix thp split handling in memory_failure()
  swapfile: fix memory corruption via malformed swapfile
  mm/cma.c: check the max limit for cma allocation
  scripts/bloat-o-meter: fix SIGPIPE
  shmem: fix pageflags after swapping DMA32 object
  mm, frontswap: make sure allocated frontswap map is assigned
  mm: remove extra newline from allocation stall warning
2016-11-11 09:44:23 -08:00
Linus Torvalds c5e4ca6da9 Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs
Pull VFS fixes from Al Viro:
 "Christoph's and Jan's aio fixes, fixup for generic_file_splice_read
  (removal of pointless detritus that actually breaks it when used for
  gfs2 ->splice_read()) and fixup for generic_file_read_iter()
  interaction with ITER_PIPE destinations."

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
  splice: remove detritus from generic_file_splice_read()
  mm/filemap: don't allow partially uptodate page for pipes
  aio: fix freeze protection of aio writes
  fs: remove aio_run_iocb
  fs: remove the never implemented aio_fsync file operation
  aio: hold an extra file reference over AIO read/write operations
2016-11-11 09:19:01 -08:00
Linus Torvalds ef091b3cef Ceph's ->read_iter() implementation is incompatible with the new
generic_file_splice_read() code that went into -rc1.  Switch to the
 less efficient default_file_splice_read() for now; the proper fix is
 being held for 4.10.
 
 We also have a fix for a 4.8 regression and a trival libceph fixup.
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v2
 
 iQEcBAABCAAGBQJYJdjPAAoJEEp/3jgCEfOLzEoH/A3B1qqiqs2WoMn0O4pnEdcM
 TxaU46VOkYcK2wh/xbYAns2kZEXKgcCySv+kXn4l3Gh6/lXVxv4WexNqWdO1o6yN
 GqEufIH7yQM6QOE/hkwtUciBXmPfQMPxF14vvprYQuyu5Bs96mrphiAa7vX6Vbk5
 VhfE/j0shb8Q2QQj/Om0mWqM6JtOAlr5aFtEcJcodbCk1k8CptUcBsSoQ31PXMC7
 UcaBHh1VHGLvx9WeG1Rw1g9tc2LiUyu+UK0csolp51+amB7HezgfmzDQzHtXzBmm
 n90SQwonf0DrdWUGuQlOpHnREwxLSgN19s68FCjLc0jeMTP4b6TFEIUgFxiqWc4=
 =Ws5s
 -----END PGP SIGNATURE-----

Merge tag 'ceph-for-4.9-rc5' of git://github.com/ceph/ceph-client

Pull Ceph fixes from Ilya Dryomov:
 "Ceph's ->read_iter() implementation is incompatible with the new
  generic_file_splice_read() code that went into -rc1.  Switch to the
  less efficient default_file_splice_read() for now; the proper fix is
  being held for 4.10.

  We also have a fix for a 4.8 regression and a trival libceph fixup"

* tag 'ceph-for-4.9-rc5' of git://github.com/ceph/ceph-client:
  libceph: initialize last_linger_id with a large integer
  libceph: fix legacy layout decode with pool 0
  ceph: use default file splice read callback
2016-11-11 09:17:10 -08:00
Linus Torvalds ef5beed998 NFS client bugfixes for Linux 4.9
Bugfixes:
 - Trim extra slashes in v4 nfs_paths to fix tools that use this
 - Fix a -Wmaybe-uninitialized warnings
 - Fix suspicious RCU usages
 - Fix Oops when mounting multiple servers at once
 - Suppress a false-positive pNFS error
 - Fix a DMAR failure in NFS over RDMA
 -----BEGIN PGP SIGNATURE-----
 
 iQIcBAABCAAGBQJYJOCbAAoJENfLVL+wpUDrbO0QAIkcxdUu2iQeOrk07VP48kDE
 UEfJTal8vbW/KtKyL9bIeRa1qCvYpSJXnnKcR/Uo5VHE5nMz/5omoJofWf5Zg0UM
 iEHyZfOsuGFieBbl1NBaLjEd6MCoJYpmWFUj+3drZ8zqSdqDTL+JgrP7k3XEU2Mx
 glKb7U0AKoclm3h1MKyCyo5TgDVeI5TOhi+i3VVw2IN79VY2CUp4lHWMY4vloghp
 h+GuJWeVFS1nBpfCF9PpTU6LdHDfg4o/J5+DrP+IjIffD1XGzGEjfFR0BX5HyDcN
 PgOSF3fc7uVOOUIBEAqHUHY/7XiKlv6TEMRPdM8ALVoCXZ6hPSSFxq8JBJSWoVEp
 r11ts66VgYxdQgHbs51Y5AaKudLBwU60KosWuddbdZVb4YPM0cn5WQzVezrpoQYu
 k4rfrpt+LFv23NGfIJa6JaTSFBzM+YXmggEGUI8TI/YUFSN+wEp4uzLB4r19nqAP
 ff32iunzV9Z5edpPQFDCf3/1HAhzrL5KWo7E8EvijpdQKZl5k5CnUJxbG22lh4ct
 QIyYg51LjhCayzbRH8Mu+TKUFT29ORlcSp851BotLjT8ZdUetWXcFab93nAkQI7g
 sMREml4DvcXWy8qFAOzi8mX1ddTBumxBfOD0m3skPg+odxwsl/KiwjLCRwfTrgwS
 jfSXsXmrwTniPCDWgKg3
 =hFod
 -----END PGP SIGNATURE-----

Merge tag 'nfs-for-4.9-3' of git://git.linux-nfs.org/projects/anna/linux-nfs

Pull NFS client bugfixes from Anna Schumaker:
 "Most of these fix regressions in 4.9, and none are going to stable
  this time around.

  Bugfixes:
   - Trim extra slashes in v4 nfs_paths to fix tools that use this
   - Fix a -Wmaybe-uninitialized warnings
   - Fix suspicious RCU usages
   - Fix Oops when mounting multiple servers at once
   - Suppress a false-positive pNFS error
   - Fix a DMAR failure in NFS over RDMA"

* tag 'nfs-for-4.9-3' of git://git.linux-nfs.org/projects/anna/linux-nfs:
  xprtrdma: Fix DMAR failure in frwr_op_map() after reconnect
  fs/nfs: Fix used uninitialized warn in nfs4_slot_seqid_in_use()
  NFS: Don't print a pNFS error if we aren't using pNFS
  NFS: Ignore connections that have cl_rpcclient uninitialized
  SUNRPC: Fix suspicious RCU usage
  NFSv4.1: work around -Wmaybe-uninitialized warning
  NFS: Trim extra slash in v4 nfs_path
2016-11-11 09:15:30 -08:00
Linus Torvalds a4fac3b5d1 xfs: update for 4.9-rc5
In this update:
 o fix for aborting deferred transactions on filesystem shutdown.
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v1
 
 iQIcBAABAgAGBQJYI/hIAAoJEK3oKUf0dfodjtIP/28PmMnwkSwC3IfNzMEuIlvM
 Q96IlnIXiOlqXSIlO69VM2cS8YCW2oVUYD6pjbhC2YcOaogYGo4TlFmQMrFL4dgG
 /q9aH+Hf6jGgGfwpC2MKq6WIYFU8oYYzG2FIm3jyEnnDCIFMH4H+FYGgzrNVIraQ
 31nWn3ye/xHD44EWvRbEUAE0ROxbZfgm7+QST+R2+lAhLitTqZLNKkenlN8v5P9h
 WInRYBHxq5beIbbhAgm50wKfvxalrYChDeujorGsGAjJLliWpgJnbah6TwPNvRXH
 SwSkJfzI53AFXqH/45P2X5Ib34P7mIz6hTl3zednfU1GBeB3PxUGay6QU0rxRKAL
 4vv/RB6pd5EzmU+hvnLJbM7lpRWpnV2FoP9YwRoqIR8vUvlXTYLpecEw1mtJwFmd
 VZ99F+8mtz+jGKK01hZDAtP7tHt1JSFhOnOQ506UsmnqhbgWaYuDEOOktX08iTik
 +zwTETuoh4frQDVLjrQ507RWJl3XblWXEgD4Dw+N19w7oTSAbuF6zIU7sTlcLJSx
 APqD1uJUzvfYV11eyY7bosuCx7QtTRwUQWQSkKu0f9FXAuILf2KdDI62lZCnKi5M
 6wTUKFNZfqED+ergGyzPuVuQr/PhbQPEC61EehYMpQPSjeVEay2e30ouM6A+8VPp
 ZQal03fQzJvbxm/+8m3v
 =62jA
 -----END PGP SIGNATURE-----

Merge tag 'xfs-fixes-for-linus-4.9-rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/dgc/linux-xfs

Pull xfs fix from Dave Chinner:
 "This is a fix for an unmount hang (regression) when the filesystem is
  shutdown.  It was supposed to go to you for -rc3, but I accidentally
  tagged the commit prior to it in that pullreq.

  Summary:

   - fix for aborting deferred transactions on filesystem shutdown"

* tag 'xfs-fixes-for-linus-4.9-rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/dgc/linux-xfs:
  xfs: defer should abort intent items if the trans roll fails
2016-11-11 09:13:48 -08:00
Andrey Ryabinin 70d78fe7c8 coredump: fix unfreezable coredumping task
It could be not possible to freeze coredumping task when it waits for
'core_state->startup' completion, because threads are frozen in
get_signal() before they got a chance to complete 'core_state->startup'.

Inability to freeze a task during suspend will cause suspend to fail.
Also CRIU uses cgroup freezer during dump operation.  So with an
unfreezable task the CRIU dump will fail because it waits for a
transition from 'FREEZING' to 'FROZEN' state which will never happen.

Use freezer_do_not_count() to tell freezer to ignore coredumping task
while it waits for core_state->startup completion.

Link: http://lkml.kernel.org/r/1475225434-3753-1-git-send-email-aryabinin@virtuozzo.com
Signed-off-by: Andrey Ryabinin <aryabinin@virtuozzo.com>
Acked-by: Pavel Machek <pavel@ucw.cz>
Acked-by: Oleg Nesterov <oleg@redhat.com>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Tejun Heo <tj@kernel.org>
Cc: "Rafael J. Wysocki" <rjw@rjwysocki.net>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-11-11 08:12:37 -08:00
Junxiao Bi d006c71f8a ocfs2: fix not enough credit panic
The following panic was caught when run ocfs2 disconfig single test
(block size 512 and cluster size 8192).  ocfs2_journal_dirty() return
-ENOSPC, that means credits were used up.

The total credit should include 3 times of "num_dx_leaves" from
ocfs2_dx_dir_rebalance(), because 2 times will be consumed in
ocfs2_dx_dir_transfer_leaf() and 1 time will be consumed in
ocfs2_dx_dir_new_cluster() -> __ocfs2_dx_dir_new_cluster() ->
ocfs2_dx_dir_format_cluster().  But only two times is included in
ocfs2_dx_dir_rebalance_credits(), fix it.

This can cause read-only fs(v4.1+) or panic for mainline linux depending
on mount option.

  ------------[ cut here ]------------
  kernel BUG at fs/ocfs2/journal.c:775!
  invalid opcode: 0000 [#1] SMP
  Modules linked in: ocfs2 nfsd lockd grace nfs_acl auth_rpcgss sunrpc autofs4 ocfs2_dlmfs ocfs2_stack_o2cb ocfs2_dlm ocfs2_nodemanager ocfs2_stackglue configfs sd_mod sg ip6t_REJECT nf_reject_ipv6 nf_conntrack_ipv6 nf_defrag_ipv6 xt_state nf_conntrack ip6table_filter ip6_tables be2iscsi iscsi_boot_sysfs bnx2i cnic uio cxgb4i cxgb4 cxgb3i libcxgbi cxgb3 mdio ib_iser rdma_cm ib_cm iw_cm ib_sa ib_mad ib_core ib_addr ipv6 iscsi_tcp libiscsi_tcp libiscsi scsi_transport_iscsi ppdev xen_kbdfront xen_netfront fb_sys_fops sysimgblt sysfillrect syscopyarea parport_pc parport acpi_cpufreq i2c_piix4 i2c_core pcspkr ext4 jbd2 mbcache xen_blkfront floppy pata_acpi ata_generic ata_piix dm_mirror dm_region_hash dm_log dm_mod
  CPU: 2 PID: 10601 Comm: dd Not tainted 4.1.12-71.el6uek.bug24939243.x86_64 #2
  Hardware name: Xen HVM domU, BIOS 4.4.4OVM 02/11/2016
  task: ffff8800b6de6200 ti: ffff8800a7d48000 task.ti: ffff8800a7d48000
  RIP: ocfs2_journal_dirty+0xa7/0xb0 [ocfs2]
  RSP: 0018:ffff8800a7d4b6d8  EFLAGS: 00010286
  RAX: 00000000ffffffe4 RBX: 00000000814d0a9c RCX: 00000000000004f9
  RDX: ffffffffa008e990 RSI: ffffffffa008f1ee RDI: ffff8800622b6460
  RBP: ffff8800a7d4b6f8 R08: ffffffffa008f288 R09: ffff8800622b6460
  R10: 0000000000000000 R11: 0000000000000282 R12: 0000000002c8421e
  R13: ffff88006d0cad00 R14: ffff880092beef60 R15: 0000000000000070
  FS:  00007f9b83e92700(0000) GS:ffff8800be880000(0000) knlGS:0000000000000000
  CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
  CR2: 00007fb2c0d1a000 CR3: 0000000008f80000 CR4: 00000000000406e0
  Call Trace:
    ocfs2_dx_dir_transfer_leaf+0x159/0x1a0 [ocfs2]
    ocfs2_dx_dir_rebalance+0xd9b/0xea0 [ocfs2]
    ocfs2_find_dir_space_dx+0xd3/0x300 [ocfs2]
    ocfs2_prepare_dx_dir_for_insert+0x219/0x450 [ocfs2]
    ocfs2_prepare_dir_for_insert+0x1d6/0x580 [ocfs2]
    ocfs2_mknod+0x5a2/0x1400 [ocfs2]
    ocfs2_create+0x73/0x180 [ocfs2]
    vfs_create+0xd8/0x100
    lookup_open+0x185/0x1c0
    do_last+0x36d/0x780
    path_openat+0x92/0x470
    do_filp_open+0x4a/0xa0
    do_sys_open+0x11a/0x230
    SyS_open+0x1e/0x20
    system_call_fastpath+0x12/0x71
  Code: 1d 3f 29 09 00 48 85 db 74 1f 48 8b 03 0f 1f 80 00 00 00 00 48 8b 7b 08 48 83 c3 10 4c 89 e6 ff d0 48 8b 03 48 85 c0 75 eb eb 90 <0f> 0b eb fe 0f 1f 44 00 00 55 48 89 e5 41 57 41 56 41 55 41 54
  RIP  ocfs2_journal_dirty+0xa7/0xb0 [ocfs2]
  ---[ end trace 91ac5312a6ee1288 ]---
  Kernel panic - not syncing: Fatal exception
  Kernel Offset: disabled

Link: http://lkml.kernel.org/r/1478248135-31963-1-git-send-email-junxiao.bi@oracle.com
Signed-off-by: Junxiao Bi <junxiao.bi@oracle.com>
Cc: Mark Fasheh <mfasheh@versity.com>
Cc: Joel Becker <jlbec@evilplan.org>
Cc: Joseph Qi <joseph.qi@huawei.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-11-11 08:12:37 -08:00
Al Viro e519e77747 splice: remove detritus from generic_file_splice_read()
i_size check is a leftover from the horrors that used to play with
the page cache in that function.  With the switch to ->read_iter(),
it's neither needed nor correct - for gfs2 it ends up being buggy,
since i_size is not guaranteed to be correct until later (inside
->read_iter()).

Spotted-by: Abhi Das <adas@redhat.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2016-11-10 18:32:13 -05:00
Yan, Zheng 8a8d561766 ceph: use default file splice read callback
Splice read/write implementation changed recently. When using
generic_file_splice_read(), iov_iter with type == ITER_PIPE is
passed to filesystem's read_iter callback. But ceph_sync_read()
can't serve ITER_PIPE iov_iter correctly (ITER_PIPE iov_iter
expects pages from page cache).

Fixing ceph_sync_read() requires a big patch. So use default
splice read callback for now.

Signed-off-by: Yan, Zheng <zyan@redhat.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2016-11-10 20:13:04 +01:00
Dave Chinner 0fc204e2eb Merge branch 'xfs-4.10-misc-fixes-1' into for-next 2016-11-10 10:29:43 +11:00
Dave Chinner 8f23d318aa Merge branch 'xfs-4.10-libxfs-cleanups' into for-next 2016-11-10 10:29:29 +11:00
Dave Chinner b649c42e25 Merge branch 'dax-4.10-iomap-pmd' into for-next 2016-11-10 10:29:06 +11:00
Jan Kara 9484ab1bf4 dax: Introduce IOMAP_FAULT flag
Introduce a flag telling iomap operations whether they are handling a
fault or other IO. That may influence behavior wrt inode size and
similar things.

Signed-off-by: Jan Kara <jack@suse.cz>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2016-11-10 10:26:50 +11:00
Sebastian Andrzej Siewior fc4d24c9b4 fs/buffer: Convert to hotplug state machine
Install the callbacks via the state machine.

Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-fsdevel@vger.kernel.org
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: rt@linutronix.de
Link: http://lkml.kernel.org/r/20161103145021.28528-2-bigeasy@linutronix.de
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2016-11-09 23:45:25 +01:00
Brian Foster 98efe8af1c xfs: fix unbalanced inode reclaim flush locking
Filesystem shutdown testing on an older distro kernel has uncovered an
imbalanced locking pattern for the inode flush lock in
xfs_reclaim_inode(). Specifically, there is a double unlock sequence
between the call to xfs_iflush_abort() and xfs_reclaim_inode() at the
"reclaim:" label.

This actually does not cause obvious problems on current kernels due to
the current flush lock implementation. Older kernels use a counting
based flush lock mechanism, however, which effectively breaks the lock
indefinitely when an already unlocked flush lock is repeatedly unlocked.
Though this only currently occurs on filesystem shutdown, it has
reproduced the effect of elevating an fs shutdown to a system-wide crash
or hang.

As it turns out, the flush lock is not actually required for the reclaim
logic in xfs_reclaim_inode() because by that time we have already cycled
the flush lock once while holding ILOCK_EXCL. Therefore, remove the
additional flush lock/unlock cycle around the 'reclaim:' label and
update branches into this label to release the flush lock where
appropriate. Add an assert to xfs_ifunlock() to help prevent future
occurences of the same problem.

Reported-by: Zorro Lang <zlang@redhat.com>
Signed-off-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2016-11-10 08:23:22 +11:00
Linus Torvalds 3c6106da74 We recently refactored the Orangefs debugfs code.
The refactor seemed to trigger dan.carpenter@oracle.com's
 static tester to find a possible double-free in the code.
 
 While designing the fix we saw a condition under which the
 buffer being freed could also be overflowed.
 
 We also realized how to rebuild the related debugfs file's
 "contents" (a string) without deleting and re-creating the file.
 
 This fix should eliminate the possible double-free, the
 potential overflow and improve code readability.
 -----BEGIN PGP SIGNATURE-----
 
 iQIcBAABAgAGBQJYI04nAAoJEM9EDqnrzg2+rT4P/1sN1ZUDwKgyJ3Qk3n5AAvlR
 PtbqeRhzUD7QdTR5yb/k/37rqYB7BBB5xd5VDlYKuD8luppjoAS2J4SRkngPFQiV
 NZMP1Sq5nWeEyeG8it+MzH364zBUGK+D94VqlhUDLHKa27WMWTB2vrcLq/DI06np
 35r4dWnV3+2+lgg0zvJGP7QoQLlPByB5q7pwTA9TBaPnDHVh/Myq4jS70wfEYDqY
 NMxkq02vQHS7a2mysZbrE2fXC2OBRTCGP+9lsdvJx9XfYQkIHfe4qMAxu0XlqDSM
 6POHEr5cPizENGSp7myV9G97FuF0UHZXnLEKe04DLbuGao2omMiNHvrnS/zPgKRQ
 zpCMsf4FVChjCipQzne1eLQbQskWVDy58ziURzefO+bV8aFe61KBOvAiNEZ+7i21
 CeEBeU2A5brd1y6ELcFCf2SDjmyd/ScyYqwIIrsY0eK1D3GveeHX1tkMGH8y7hWD
 CoY/cKUSHwZ6ZwFpzeCs96wzZ19o7BLKogyIyYQWc1s09uijIHInxkRxUqtfVErj
 2SpOXOqLkpjpU0Kga8beU6OtXLRv/ob518RPTPF5NAzjl08xn20pZckKL092uhVj
 k/VlJUE72Om2XJmBbta2Sz7iN8QfujJO0Ql/Nl51lX++pVbQIjFDaRhIYt6Yb0af
 y1XRJXqDJ3sh9J/od3fN
 =ELyS
 -----END PGP SIGNATURE-----

Merge tag 'for-linus-4.9-rc4-ofs-1' of git://git.kernel.org/pub/scm/linux/kernel/git/hubcap/linux

Pull orangefs fix from Mike Marshall:
 "We recently refactored the Orangefs debugfs code. The refactor seemed
  to trigger dan.carpenter@oracle.com's static tester to find a possible
  double-free in the code.

  While designing the fix we saw a condition under which the buffer
  being freed could also be overflowed.

  We also realized how to rebuild the related debugfs file's "contents"
  (a string) without deleting and re-creating the file.

  This fix should eliminate the possible double-free, the potential
  overflow and improve code readability"

* tag 'for-linus-4.9-rc4-ofs-1' of git://git.kernel.org/pub/scm/linux/kernel/git/hubcap/linux:
  orangefs: clean up debugfs
2016-11-09 11:36:43 -08:00
Darrick J. Wong bec9d48d7a xfs: check minimum block size for CRC filesystems
Check the minimum block size on v5 filesystems.

[dchinner: cleaned up XFS_MIN_CRC_BLOCKSIZE check]

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2016-11-09 12:11:12 +11:00
Li Pengcheng 959217c84c pstore: Actually give up during locking failure
Without a return after the pr_err(), dumps will collide when two threads
call pstore_dump() at the same time.

Signed-off-by: Liu Hailong <liuhailong5@huawei.com>
Signed-off-by: Li Pengcheng <lipengcheng8@huawei.com>
Signed-off-by: Li Zhong <lizhong11@hisilicon.com>
[kees: improved commit message]
Signed-off-by: Kees Cook <keescook@chromium.org>
2016-11-08 16:44:33 -08:00
Eric Sandeen 5d829300be xfs: provide helper for counting extents from if_bytes
The open-coded pattern:

ifp->if_bytes / (uint)sizeof(xfs_bmbt_rec_t)

is all over the xfs code; provide a new helper
xfs_iext_count(ifp) to count the number of inline extents
in an inode fork.

[dchinner: pick up several missed conversions]

Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2016-11-08 12:59:42 +11:00
Eric Sandeen 4dfce57db6 xfs: fix up xfs_swap_extent_forks inline extent handling
There have been several reports over the years of NULL pointer
dereferences in xfs_trans_log_inode during xfs_fsr processes,
when the process is doing an fput and tearing down extents
on the temporary inode, something like:

BUG: unable to handle kernel NULL pointer dereference at 0000000000000018
PID: 29439  TASK: ffff880550584fa0  CPU: 6   COMMAND: "xfs_fsr"
    [exception RIP: xfs_trans_log_inode+0x10]
 #9 [ffff8800a57bbbe0] xfs_bunmapi at ffffffffa037398e [xfs]
#10 [ffff8800a57bbce8] xfs_itruncate_extents at ffffffffa0391b29 [xfs]
#11 [ffff8800a57bbd88] xfs_inactive_truncate at ffffffffa0391d0c [xfs]
#12 [ffff8800a57bbdb8] xfs_inactive at ffffffffa0392508 [xfs]
#13 [ffff8800a57bbdd8] xfs_fs_evict_inode at ffffffffa035907e [xfs]
#14 [ffff8800a57bbe00] evict at ffffffff811e1b67
#15 [ffff8800a57bbe28] iput at ffffffff811e23a5
#16 [ffff8800a57bbe58] dentry_kill at ffffffff811dcfc8
#17 [ffff8800a57bbe88] dput at ffffffff811dd06c
#18 [ffff8800a57bbea8] __fput at ffffffff811c823b
#19 [ffff8800a57bbef0] ____fput at ffffffff811c846e
#20 [ffff8800a57bbf00] task_work_run at ffffffff81093b27
#21 [ffff8800a57bbf30] do_notify_resume at ffffffff81013b0c
#22 [ffff8800a57bbf50] int_signal at ffffffff8161405d

As it turns out, this is because the i_itemp pointer, along
with the d_ops pointer, has been overwritten with zeros
when we tear down the extents during truncate.  When the in-core
inode fork on the temporary inode used by xfs_fsr was originally
set up during the extent swap, we mistakenly looked at di_nextents
to determine whether all extents fit inline, but this misses extents
generated by speculative preallocation; we should be using if_bytes
instead.

This mistake corrupts the in-memory inode, and code in
xfs_iext_remove_inline eventually gets bad inputs, causing
it to memmove and memset incorrect ranges; this became apparent
because the two values in ifp->if_u2.if_inline_ext[1] contained
what should have been in d_ops and i_itemp; they were memmoved due
to incorrect array indexing and then the original locations
were zeroed with memset, again due to an array overrun.

Fix this by properly using i_df.if_bytes to determine the number
of extents, not di_nextents.

Thanks to dchinner for looking at this with me and spotting the
root cause.

Cc: stable@vger.kernel.org
Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2016-11-08 12:55:18 +11:00
Brian Foster 04197b341f xfs: don't BUG() on mixed direct and mapped I/O
We've had reports of generic/095 causing XFS to BUG() in
__xfs_get_blocks() due to the existence of delalloc blocks on a
direct I/O read. generic/095 issues a mix of various types of I/O,
including direct and memory mapped I/O to a single file. This is
clearly not supported behavior and is known to lead to such
problems. E.g., the lack of exclusion between the direct I/O and
write fault paths means that a write fault can allocate delalloc
blocks in a region of a file that was previously a hole after the
direct read has attempted to flush/inval the file range, but before
it actually reads the block mapping. In turn, the direct read
discovers a delalloc extent and cannot proceed.

While the appropriate solution here is to not mix direct and memory
mapped I/O to the same regions of the same file, the current
BUG_ON() behavior is probably overkill as it can crash the entire
system.  Instead, localize the failure to the I/O in question by
returning an error for a direct I/O that cannot be handled safely
due to delalloc blocks. Be careful to allow the case of a direct
write to post-eof delalloc blocks. This can occur due to speculative
preallocation and is safe as post-eof blocks are not accompanied by
dirty pages in pagecache (conversely, preallocation within eof must
have been zeroed, and thus dirtied, before the inode size could have
been increased beyond said blocks).

Finally, provide an additional warning if a direct I/O write occurs
while the file is memory mapped. This may not catch all problematic
scenarios, but provides a hint that some known-to-be-problematic I/O
methods are in use.

Signed-off-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2016-11-08 12:54:14 +11:00
Brian Foster 399372349a xfs: don't skip cow forks w/ delalloc blocks in cowblocks scan
The cowblocks background scanner currently clears the cowblocks tag
for inodes without any real allocations in the cow fork. This
excludes inodes with only delalloc blocks in the cow fork. While we
might never expect to clear delalloc blocks from the cow fork in the
background scanner, it is not necessarily correct to clear the
cowblocks tag from such inodes.

For example, if the background scanner happens to process an inode
between a buffered write and writeback, the scanner catches the
inode in a state after delalloc blocks have been allocated to the
cow fork but before the delalloc blocks have been converted to real
blocks by writeback. The background scanner then incorrectly clears
the cowblocks tag, even if part of the aforementioned delalloc
reservation will not be remapped to the data fork (i.e., extra
blocks due to the cowextsize hint). This means that any such
additional blocks in the cow fork might never be reclaimed by the
background scanner and could persist until the inode itself is
reclaimed.

To address this problem, only skip and clear inodes without any cow
fork allocations whatsoever from the background scanner. While we
generally do not want to cancel delalloc reservations from the
background scanner, the pagecache dirty check following the
cowblocks check should prevent that situation. If we do end up with
delalloc cow fork blocks without a dirty address space mapping, this
is probably an indication that something has gone wrong and the
blocks should be reclaimed, as they may never be converted to a real
allocation.

Signed-off-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2016-11-08 12:53:33 +11:00
Darrick J. Wong 4fd29ec472 xfs: check return value of _trans_reserve_quota_nblks
Check the return value of xfs_trans_reserve_quota_nblks for errors.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2016-11-08 11:59:26 +11:00
Darrick J. Wong 5e52365ac8 xfs: move dir_ino_validate declaration per xfsprogs
Move the declaration of _dir_ino_validate out of the private
dir2 header file into the public one, since xfsprogs did that
for the benefit of xfs_repair.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2016-11-08 11:59:12 +11:00
Eric Sandeen e6fc6fcf44 xfs: don't call xfs_sb_quota_from_disk twice
Source xfsprogs commit: ee3754254e8c186c99b6cdd4d59f741759d04acb

Kernel commit 5ef828c4 ("xfs: avoid false quotacheck after unclean
shutdown") made xfs_sb_from_disk() also call xfs_sb_quota_from_disk
by default.

However, when this was merged to libxfs, existing separate
calls to libxfs_sb_quota_from_disk remained, and calling it
twice in a row on a V4 superblock leads to issues, because:

        if (sbp->sb_qflags & XFS_PQUOTA_ACCT)  {
...
                sbp->sb_pquotino = sbp->sb_gquotino;
                sbp->sb_gquotino = NULLFSINO;

and after the second call, we have set both pquotino and gquotino
to NULLFSINO.

Fix this by making it safe to call twice, and also remove the extra
calls to libxfs_sb_quota_from_disk.

This is only spotted when running xfstests with "-m crc=0" because
the sb_from_disk change came about after V5 became default, and
the above behavior only exists on a V4 superblock.

Reported-by: Eryu Guan <eguan@redhat.com>
Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Reviewed-by: Carlos Maiolino <cmaiolino@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2016-11-08 11:58:55 +11:00
Darrick J. Wong 523b2e76e3 libxfs: clean up _dir2_data_freescan
Refactor the implementations of xfs_dir2_data_freescan into a
routine that takes the raw directory block parameters and
a second function that figures out the raw parameters from the
directory inode.  This enables us to use the exact same code
for both userspace and the kernel, since repair knows exactly
which directory block geometry parameters it needs.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2016-11-08 11:56:51 +11:00
Darrick J. Wong ae90b994b4 libxfs: fix xfs_attr_shortform_bytesfit declaration
Change the xfs_attr_shortform_bytesfit declaration to have
struct xfs_inode to avoid tripping up the libxfs-diff scanner.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Eric Sandeen <sandeen@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2016-11-08 11:56:20 +11:00
Darrick J. Wong 68c098582b libxfs: fix whitespace problems
Fix some whitespace problems that trip up my libxfs-diff script.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Eric Sandeen <sandeen@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2016-11-08 11:56:13 +11:00
Darrick J. Wong 420fbeb4bf libxfs: synchronize dinode_verify with userspace
The userspace version of _dinode_verify takes a raw inode number
instead of an inode itself.  Since neither version actually needs
the inode, port the changes to the kernel.  This will also reduce
the libxfs diff noise.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2016-11-08 11:56:06 +11:00
Darrick J. Wong 755c7bf5dd libxfs: convert ushort to unsigned short
Since xfsprogs dropped ushort in favor of unsigned short, do that
here too.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2016-11-08 11:55:48 +11:00
Ross Zwisler 190b5caad7 dax: remove "depends on BROKEN" from FS_DAX_PMD
Now that DAX PMD faults are once again working and are now participating in
DAX's radix tree locking scheme, allow their config option to be enabled.

Signed-off-by: Ross Zwisler <ross.zwisler@linux.intel.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2016-11-08 11:35:16 +11:00
Ross Zwisler 862f1b9d67 xfs: use struct iomap based DAX PMD fault path
Switch xfs_filemap_pmd_fault() from using dax_pmd_fault() to the new and
improved dax_iomap_pmd_fault().  Also, now that it has no more users,
remove xfs_get_blocks_dax_fault().

Signed-off-by: Ross Zwisler <ross.zwisler@linux.intel.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2016-11-08 11:35:02 +11:00
Ross Zwisler 642261ac99 dax: add struct iomap based DAX PMD support
DAX PMDs have been disabled since Jan Kara introduced DAX radix tree based
locking.  This patch allows DAX PMDs to participate in the DAX radix tree
based locking scheme so that they can be re-enabled using the new struct
iomap based fault handlers.

There are currently three types of DAX 4k entries: 4k zero pages, 4k DAX
mappings that have an associated block allocation, and 4k DAX empty
entries.  The empty entries exist to provide locking for the duration of a
given page fault.

This patch adds three equivalent 2MiB DAX entries: Huge Zero Page (HZP)
entries, PMD DAX entries that have associated block allocations, and 2 MiB
DAX empty entries.

Unlike the 4k case where we insert a struct page* into the radix tree for
4k zero pages, for HZP we insert a DAX exceptional entry with the new
RADIX_DAX_HZP flag set.  This is because we use a single 2 MiB zero page in
every 2MiB hole mapping, and it doesn't make sense to have that same struct
page* with multiple entries in multiple trees.  This would cause contention
on the single page lock for the one Huge Zero Page, and it would break the
page->index and page->mapping associations that are assumed to be valid in
many other places in the kernel.

One difficult use case is when one thread is trying to use 4k entries in
radix tree for a given offset, and another thread is using 2 MiB entries
for that same offset.  The current code handles this by making the 2 MiB
user fall back to 4k entries for most cases.  This was done because it is
the simplest solution, and because the use of 2MiB pages is already
opportunistic.

If we were to try to upgrade from 4k pages to 2MiB pages for a given range,
we run into the problem of how we lock out 4k page faults for the entire
2MiB range while we clean out the radix tree so we can insert the 2MiB
entry.  We can solve this problem if we need to, but I think that the cases
where both 2MiB entries and 4K entries are being used for the same range
will be rare enough and the gain small enough that it probably won't be
worth the complexity.

Signed-off-by: Ross Zwisler <ross.zwisler@linux.intel.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2016-11-08 11:34:45 +11:00
Ross Zwisler 422476c464 dax: move put_(un)locked_mapping_entry() in dax.c
No functional change.

The static functions put_locked_mapping_entry() and
put_unlocked_mapping_entry() will soon be used in error cases in
grab_mapping_entry(), so move their definitions above this function.

Signed-off-by: Ross Zwisler <ross.zwisler@linux.intel.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2016-11-08 11:33:44 +11:00
Ross Zwisler fa28f7296a dax: move RADIX_DAX_* defines to dax.h
The RADIX_DAX_* defines currently mostly live in fs/dax.c, with just
RADIX_DAX_ENTRY_LOCK being in include/linux/dax.h so it can be used in
mm/filemap.c.  When we add PMD support, though, mm/filemap.c will also need
access to the RADIX_DAX_PTE type so it can properly construct a 4k sized
empty entry.

Instead of shifting the defines between dax.c and dax.h as they are
individually used in other code, just move them wholesale to dax.h so
they'll be available when we need them.

Signed-off-by: Ross Zwisler <ross.zwisler@linux.intel.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2016-11-08 11:33:35 +11:00
Ross Zwisler 1550290b08 dax: dax_iomap_fault() needs to call iomap_end()
Currently iomap_end() doesn't do anything for DAX page faults for both ext2
and XFS.  ext2_iomap_end() just checks for a write underrun, and
xfs_file_iomap_end() checks to see if it needs to finish a delayed
allocation.  However, in the future iomap_end() calls might be needed to
make sure we have balanced allocations, locks, etc.  So, add calls to
iomap_end() with appropriate error handling to dax_iomap_fault().

Signed-off-by: Ross Zwisler <ross.zwisler@linux.intel.com>
Suggested-by: Jan Kara <jack@suse.cz>
Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2016-11-08 11:33:26 +11:00
Ross Zwisler 333ccc978e dax: add dax_iomap_sector() helper function
To be able to correctly calculate the sector from a file position and a
struct iomap there is a complex little bit of logic that currently happens
in both dax_iomap_actor() and dax_iomap_fault().  This will need to be
repeated yet again in the DAX PMD fault handler when it is added, so break
it out into a helper function.

Signed-off-by: Ross Zwisler <ross.zwisler@linux.intel.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2016-11-08 11:33:09 +11:00
Ross Zwisler 11c59c92f4 dax: correct dax iomap code namespace
The recently added DAX functions that use the new struct iomap data
structure were named iomap_dax_rw(), iomap_dax_fault() and
iomap_dax_actor().  These are actually defined in fs/dax.c, though, so
should be part of the "dax" namespace and not the "iomap" namespace.
Rename them to dax_iomap_rw(), dax_iomap_fault() and dax_iomap_actor()
respectively.

Signed-off-by: Ross Zwisler <ross.zwisler@linux.intel.com>
Suggested-by: Dave Chinner <david@fromorbit.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2016-11-08 11:32:46 +11:00
Ross Zwisler b9fde0462e dax: remove dax_pmd_fault()
dax_pmd_fault() is the old struct buffer_head + get_block_t based 2 MiB DAX
fault handler.  This fault handler has been disabled for several kernel
releases, and support for PMDs will be reintroduced using the struct iomap
interface instead.

Signed-off-by: Ross Zwisler <ross.zwisler@linux.intel.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2016-11-08 11:32:35 +11:00
Ross Zwisler 63e95b5c4f dax: coordinate locking for offsets in PMD range
DAX radix tree locking currently locks entries based on the unique
combination of the 'mapping' pointer and the pgoff_t 'index' for the entry.
This works for PTEs, but as we move to PMDs we will need to have all the
offsets within the range covered by the PMD to map to the same bit lock.
To accomplish this, for ranges covered by a PMD entry we will instead lock
based on the page offset of the beginning of the PMD entry.  The 'mapping'
pointer is still used in the same way.

Signed-off-by: Ross Zwisler <ross.zwisler@linux.intel.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2016-11-08 11:32:20 +11:00
Ross Zwisler e3ad61c64a dax: consistent variable naming for DAX entries
No functional change.

Consistently use the variable name 'entry' instead of 'ret' for DAX radix
tree entries.  This was already happening in most of the code, so update
get_unlocked_mapping_entry(), grab_mapping_entry() and
dax_unlock_mapping_entry().

Signed-off-by: Ross Zwisler <ross.zwisler@linux.intel.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2016-11-08 11:32:12 +11:00
Ross Zwisler aada54f980 dax: remove the last BUG_ON() from fs/dax.c
Don't take down the kernel if we get an invalid 'from' and 'length'
argument pair.  Just warn once and return an error.

Signed-off-by: Ross Zwisler <ross.zwisler@linux.intel.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2016-11-08 11:32:00 +11:00
Ross Zwisler ce95ab0fa6 dax: make 'wait_table' global variable static
The global 'wait_table' variable is only used within fs/dax.c, and
generates the following sparse warning:

fs/dax.c:39:19: warning: symbol 'wait_table' was not declared. Should it be static?

Make it static so it has scope local to fs/dax.c, and to make sparse happy.

Signed-off-by: Ross Zwisler <ross.zwisler@linux.intel.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2016-11-08 11:31:44 +11:00
Ross Zwisler 03e0990fc8 ext2: remove support for DAX PMD faults
DAX PMD support was added via the following commit:

commit e7b1ea2ad6 ("ext2: huge page fault support")

I believe this path to be untested as ext2 doesn't reliably provide block
allocations that are aligned to 2MiB.  In my testing I've been unable to
get ext2 to actually fault in a PMD.  It always fails with a "pfn
unaligned" message because the sector returned by ext2_get_block() isn't
aligned.

I've tried various settings for the "stride" and "stripe_width" extended
options to mkfs.ext2, without any luck.

Since we can't reliably get PMDs, remove support so that we don't have an
untested code path that we may someday traverse when we happen to get an
aligned block allocation.  This should also make 4k DAX faults in ext2 a
bit faster since they will no longer have to call the PMD fault handler
only to get a response of VM_FAULT_FALLBACK.

Signed-off-by: Ross Zwisler <ross.zwisler@linux.intel.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2016-11-08 11:31:33 +11:00
Ross Zwisler fa0d3fce7c dax: remove buffer_size_valid()
Now that ext4 properly sets bh.b_size when we call get_block() for a hole,
rely on that value and remove the buffer_size_valid() sanity check.

Signed-off-by: Ross Zwisler <ross.zwisler@linux.intel.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2016-11-08 11:31:14 +11:00
Ross Zwisler 547edce3ba ext4: tell DAX the size of allocation holes
When DAX calls _ext4_get_block() and the file offset points to a hole we
currently don't set bh->b_size.  This is current worked around via
buffer_size_valid() in fs/dax.c.

_ext4_get_block() has the hole size information from ext4_map_blocks(), so
populate bh->b_size so we can remove buffer_size_valid() in a later patch.

Signed-off-by: Ross Zwisler <ross.zwisler@linux.intel.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2016-11-08 11:30:58 +11:00
Shuah Khan 0ac84b72c0 fs/nfs: Fix used uninitialized warn in nfs4_slot_seqid_in_use()
Fix the following warn:

fs/nfs/nfs4session.c: In function ‘nfs4_slot_seqid_in_use’:
fs/nfs/nfs4session.c:203:54: warning: ‘cur_seq’ may be used uninitialized in this function [-Wmaybe-uninitialized]
  if (nfs4_slot_get_seqid(tbl, slotid, &cur_seq) == 0 &&
      ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~^~
      cur_seq == seq_nr && test_bit(slotid, tbl->used_slots))
      ~~~~~~~~~~~~~~~~~

Signed-off-by: Shuah Khan <shuahkh@osg.samsung.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2016-11-07 16:11:30 -05:00
Anna Schumaker 192747166a NFS: Don't print a pNFS error if we aren't using pNFS
We used to check for a valid layout type id before verifying pNFS flags
as an indicator for if we are using pNFS.  This changed in 3132e49ece
with the introduction of multiple layout types, since now we are passing
an array of ids instead of just one.  Since then, users have been seeing
a KERN_ERR printk show up whenever mounting NFS v4 without pNFS.  This
patch restores the original behavior of exiting set_pnfs_layoutdriver()
early if we aren't using pNFS.

Fixes 3132e49ece ("pnfs: track multiple layout types in fsinfo
structure")
Reviewed-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2016-11-07 16:11:30 -05:00
Petr Vandrovec 8ef3295530 NFS: Ignore connections that have cl_rpcclient uninitialized
cl_rpcclient starts as ERR_PTR(-EINVAL), and connections like that
are floating freely through the system.  Most places check whether
pointer is valid before dereferencing it, but newly added code
in nfs_match_client does not.

Which causes crashes when more than one NFS mount point is present.

Signed-off-by: Petr Vandrovec <petr@vandrovec.name>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2016-11-07 16:11:29 -05:00
Mike Marshall dc0336214e orangefs: clean up debugfs
We recently refactored the Orangefs debugfs code.
The refactor seemed to trigger dan.carpenter@oracle.com's
static tester to find a possible double-free in the code.

While designing the fix we saw a condition under which the
buffer being freed could also be overflowed.

We also realized how to rebuild the related debugfs file's
"contents" (a string) without deleting and re-creating the file.

This fix should eliminate the possible double-free, the
potential overflow and improve code readability.

Signed-off-by: Mike Marshall <hubcap@omnibond.com>
Signed-off-by: Martin Brandenburg <martin@omnibond.com>
2016-11-07 10:41:55 -05:00
Linus Torvalds fb415f222c Fixes for some recent regressions including fallout from the vmalloc'd
stack change (after which we can no longer encrypt stuff on the stack).
 -----BEGIN PGP SIGNATURE-----
 
 iQIcBAABAgAGBQJYHNwpAAoJECebzXlCjuG++DMP/3mUUAF09DfFR/EHl7knDT1f
 kZ53UVHYzr02w0wXfwxVLlp2H7TdSAufgsSvPT6qksA3eY7gL6nJ9zHkl+Nv5yCx
 y6vsFWjO1QEUWFOZWCKcmT2dAI3Ddt9IhK13pfZEKN1XKvK2zWB16HEVzSg6fR2K
 NwHlpMnQUI4HWThURzwTZb1M5YhxRCAnyiv8BTAAPjbEfzPzdL7j3jxwqtH8bOWp
 qIcDDvjC744b9zy0YuAEY/NyGBhYZPdM6gWsBBes1TRzBWUL9qsUYTWDJTmg/F1l
 Or0Jz7CUEN9uOHLGnkATPDc+eBg9YFV+bSsSnJu1/W4Er7dX1Af/lol79zEp/Zw1
 Snd9FelSPj3vxmYAFTCLnHRTRgsyiDhbbb7gVrzH9bxnCrRNR6p2kY018s1Cl9Td
 uWQoNNFQwwnYxWYEeZdO5PgX+pcgoCzhHACNk5oA93YaBE0GuLHHugwwIrYE8TM1
 1iY20sLC5lJcnPqxdgnoprZnnHMuL6rx5KRbvBeflNZ4huK2PIcPJyeB83XH6s12
 G67PjJ0rfWzSBF14O/ZtQA6he+kXvnH3pKqpNnaMiBxZZ2J8E1eQvrKTLLIwmtlP
 18KKJpZIzh7jTTZ/99nAMAt/BGw97P9TToLdnI8dCxYygHEaywpEYtcsE8IWFAvA
 3XkS5QdlJhhAaAUUYBXy
 =oPbZ
 -----END PGP SIGNATURE-----

Merge tag 'nfsd-4.9-1' of git://linux-nfs.org/~bfields/linux

Pull nfsd bugfixes from Bruce Fields:
 "Fixes for some recent regressions including fallout from the vmalloc'd
  stack change (after which we can no longer encrypt stuff on the
  stack)"

* tag 'nfsd-4.9-1' of git://linux-nfs.org/~bfields/linux:
  nfsd: Fix general protection fault in release_lock_stateid()
  svcrdma: backchannel cannot share a page for send and rcv buffers
  sunrpc: fix some missing rq_rbuffer assignments
  sunrpc: don't pass on-stack memory to sg_set_buf
  nfsd: move blocked lock handling under a dedicated spinlock
2016-11-04 20:12:10 -07:00
Linus Torvalds 46d7cbb2c4 Merge branch 'for-4.9-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux
Pull btrfs fixes from Chris Mason:
 "Some fixes that Dave Sterba collected.  We held off on these last week
  because I was focused on the memory corruption testing"

* 'for-4.9-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux:
  btrfs: fix WARNING in btrfs_select_ref_head()
  Btrfs: remove some no-op casts
  btrfs: pass correct args to btrfs_async_run_delayed_refs()
  btrfs: make file clone aware of fatal signals
  btrfs: qgroup: Prevent qgroup->reserved from going subzero
  Btrfs: kill BUG_ON in do_relocation
2016-11-04 20:08:16 -07:00
Linus Torvalds bd30fac18f Merge branch 'overlayfs-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/vfs
Pull overlayfs fixes from Miklos Szeredi:
 "Fix two more POSIX ACL bugs introduced in 4.8 and add a missing fsync
  during copy up to prevent possible data loss"

* 'overlayfs-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/vfs:
  ovl: fsync after copy-up
  ovl: fix get_acl() on tmpfs
  ovl: update S_ISGID when setting posix ACLs
2016-11-04 20:03:14 -07:00
Jan Kara ce98321bf7 fs: Remove unmap_underlying_metadata
Nobody is using this function anymore. Remove it.

Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Jens Axboe <axboe@fb.com>
2016-11-04 14:34:47 -06:00
Jan Kara e64855c6cf fs: Add helper to clean bdev aliases under a bh and use it
Add a helper function that clears buffer heads from a block device
aliasing passed bh. Use this helper function from filesystems instead of
the original unmap_underlying_metadata() to save some boiler plate code
and also have a better name for the functionalily since it is not
unmapping anything for a *long* time.

Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Jens Axboe <axboe@fb.com>
2016-11-04 14:34:47 -06:00
Jan Kara 69a9bea146 ext2: Use clean_bdev_aliases() instead of iteration
Use clean_bdev_aliases() instead of iterating through blocks one by one.

Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Jens Axboe <axboe@fb.com>
2016-11-04 14:34:47 -06:00
Jan Kara 64e1c57fa4 ext4: Use clean_bdev_aliases() instead of iteration
Use clean_bdev_aliases() instead of iterating through blocks one by one.

Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Jens Axboe <axboe@fb.com>
2016-11-04 14:34:47 -06:00
Jan Kara f734c89cc9 direct-io: Use clean_bdev_aliases() instead of handmade iteration
Use new provided function instead of an iteration through all allocated
blocks.

Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Jens Axboe <axboe@fb.com>
2016-11-04 14:34:47 -06:00
Jan Kara 29f3ad7d83 fs: Provide function to unmap metadata for a range of blocks
Provide function equivalent to unmap_underlying_metadata() for a range
of blocks. We somewhat optimize the function to use pagevec lookups
instead of looking up buffer heads one by one and use page lock to pin
buffer heads instead of mapping's private_lock to improve scalability.

Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Jens Axboe <axboe@fb.com>
2016-11-04 14:34:47 -06:00
Alexey Dobriyan b4eb4f7f1a audit: less stack usage for /proc/*/loginuid
%u requires 10 characters at most not 20.

Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
Acked-by: Richard Guy Briggs <rgb@redhat.com>
Signed-off-by: Paul Moore <paul@paul-moore.com>
2016-11-03 17:20:00 -04:00
Jens Axboe 7637241e65 writeback: add wbc_to_write_flags()
Add wbc_to_write_flags(), which returns the write modifier flags to use,
based on a struct writeback_control. No functional changes in this
patch, but it prepares us for factoring other wbc fields for write type.

Signed-off-by: Jens Axboe <axboe@fb.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Reviewed-by: Christoph Hellwig <hch@lst.de>
2016-11-02 10:24:03 -06:00
Chris Mason e3597e6090 Merge branch 'for-4.9-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux into for-linus-4.9 2016-11-01 12:54:45 -07:00
J. Bruce Fields e864c189e1 nfsd: catch errors in decode_fattr earlier
3c8e03166a "NFSv4: do exact check about attribute specified" fixed
some handling of unsupported-attribute errors, but it also delayed
checking for unwriteable attributes till after we decode them.  This
could lead to odd behavior in the case a client attemps to set an
attribute we don't know about followed by one we try to parse.  In that
case the parser for the known attribute will attempt to parse the
unknown attribute.  It should fail in some safe way, but the error might
at least be incorrect (probably bad_xdr instead of inval).  So, it's
better to do that check at the start.

As far as I know this doesn't cause any problems with current clients
but it might be a minor issue e.g. if we encounter a future client that
supports a new attribute that we currently don't.

Cc: Yu Zhiguo <yuzg@cn.fujitsu.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2016-11-01 15:47:52 -04:00
J. Bruce Fields 916d2d844a nfsd: clean up supported attribute handling
Minor cleanup, no change in behavior.

Provide helpers for some common attribute bitmap operations.  Drop some
comments that just echo the code.

Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2016-11-01 15:47:52 -04:00
Jeff Layton 851238a22f nfsd: fix error handling for clients that fail to return the layout
Currently, when the client continually returns NFS4ERR_DELAY on a
CB_LAYOUTRECALL, we'll give up trying to retransmit after two lease
periods, but leave the layout in place.

What we really need to do here is fence the client in this case. Have it
fall through to that code in that case instead of into the
NFS4ERR_NOMATCHING_LAYOUT case.

Signed-off-by: Jeff Layton <jlayton@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2016-11-01 15:47:43 -04:00
Jeff Layton 8f97514b42 nfsd: more robust allocation failure handling in nfsd_reply_cache_init
Currently, we try to allocate the cache as a single, large chunk, which
can fail if no big chunks of memory are available. We _do_ try to size
it according to the amount of memory in the box, but if the server is
started well after boot time, then the allocation can fail due to memory
fragmentation.

Fall back to doing a vzalloc if the kcalloc fails, and switch the
shutdown code to do a kvfree to handle freeing correctly.

Reported-by: Olaf Hering <olaf@aepfle.de>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2016-11-01 15:47:43 -04:00
Chuck Lever f46c445b79 nfsd: Fix general protection fault in release_lock_stateid()
When I push NFSv4.1 / RDMA hard, (xfstests generic/089, for example),
I get this crash on the server:

Oct 28 22:04:30 klimt kernel: general protection fault: 0000 [#1] SMP DEBUG_PAGEALLOC
Oct 28 22:04:30 klimt kernel: Modules linked in: cts rpcsec_gss_krb5 iTCO_wdt iTCO_vendor_support sb_edac edac_core x86_pkg_temp_thermal intel_powerclamp coretemp kvm_intel kvm btrfs irqbypass crct10dif_pclmul crc32_pclmul ghash_clmulni_intel aesni_intel lrw gf128mul glue_helper ablk_helper cryptd xor pcspkr raid6_pq i2c_i801 i2c_smbus lpc_ich mfd_core sg mei_me mei ioatdma shpchp wmi ipmi_si ipmi_msghandler rpcrdma ib_ipoib rdma_ucm acpi_power_meter acpi_pad ib_ucm ib_uverbs ib_umad rdma_cm ib_cm iw_cm nfsd auth_rpcgss nfs_acl lockd grace sunrpc ip_tables xfs libcrc32c mlx4_ib mlx4_en ib_core sr_mod cdrom sd_mod ast drm_kms_helper syscopyarea sysfillrect sysimgblt fb_sys_fops ttm drm crc32c_intel igb ahci libahci ptp mlx4_core pps_core dca libata i2c_algo_bit i2c_core dm_mirror dm_region_hash dm_log dm_mod
Oct 28 22:04:30 klimt kernel: CPU: 7 PID: 1558 Comm: nfsd Not tainted 4.9.0-rc2-00005-g82cd754 #8
Oct 28 22:04:30 klimt kernel: Hardware name: Supermicro Super Server/X10SRL-F, BIOS 1.0c 09/09/2015
Oct 28 22:04:30 klimt kernel: task: ffff880835c3a100 task.stack: ffff8808420d8000
Oct 28 22:04:30 klimt kernel: RIP: 0010:[<ffffffffa05a759f>]  [<ffffffffa05a759f>] release_lock_stateid+0x1f/0x60 [nfsd]
Oct 28 22:04:30 klimt kernel: RSP: 0018:ffff8808420dbce0  EFLAGS: 00010246
Oct 28 22:04:30 klimt kernel: RAX: ffff88084e6660f0 RBX: ffff88084e667020 RCX: 0000000000000000
Oct 28 22:04:30 klimt kernel: RDX: 0000000000000007 RSI: 0000000000000000 RDI: ffff88084e667020
Oct 28 22:04:30 klimt kernel: RBP: ffff8808420dbcf8 R08: 0000000000000001 R09: 0000000000000000
Oct 28 22:04:30 klimt kernel: R10: ffff880835c3a100 R11: ffff880835c3aca8 R12: 6b6b6b6b6b6b6b6b
Oct 28 22:04:30 klimt kernel: R13: ffff88084e6670d8 R14: ffff880835f546f0 R15: ffff880835f1c548
Oct 28 22:04:30 klimt kernel: FS:  0000000000000000(0000) GS:ffff88087bdc0000(0000) knlGS:0000000000000000
Oct 28 22:04:30 klimt kernel: CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
Oct 28 22:04:30 klimt kernel: CR2: 00007ff020389000 CR3: 0000000001c06000 CR4: 00000000001406e0
Oct 28 22:04:30 klimt kernel: Stack:
Oct 28 22:04:30 klimt kernel: ffff88084e667020 0000000000000000 ffff88084e6670d8 ffff8808420dbd20
Oct 28 22:04:30 klimt kernel: ffffffffa05ac80d ffff880835f54548 ffff88084e640008 ffff880835f545b0
Oct 28 22:04:30 klimt kernel: ffff8808420dbd70 ffffffffa059803d ffff880835f1c768 0000000000000870
Oct 28 22:04:30 klimt kernel: Call Trace:
Oct 28 22:04:30 klimt kernel: [<ffffffffa05ac80d>] nfsd4_free_stateid+0xfd/0x1b0 [nfsd]
Oct 28 22:04:30 klimt kernel: [<ffffffffa059803d>] nfsd4_proc_compound+0x40d/0x690 [nfsd]
Oct 28 22:04:30 klimt kernel: [<ffffffffa0583114>] nfsd_dispatch+0xd4/0x1d0 [nfsd]
Oct 28 22:04:30 klimt kernel: [<ffffffffa047bbf9>] svc_process_common+0x3d9/0x700 [sunrpc]
Oct 28 22:04:30 klimt kernel: [<ffffffffa047ca64>] svc_process+0xf4/0x330 [sunrpc]
Oct 28 22:04:30 klimt kernel: [<ffffffffa05827ca>] nfsd+0xfa/0x160 [nfsd]
Oct 28 22:04:30 klimt kernel: [<ffffffffa05826d0>] ? nfsd_destroy+0x170/0x170 [nfsd]
Oct 28 22:04:30 klimt kernel: [<ffffffff810b367b>] kthread+0x10b/0x120
Oct 28 22:04:30 klimt kernel: [<ffffffff810b3570>] ? kthread_stop+0x280/0x280
Oct 28 22:04:30 klimt kernel: [<ffffffff8174e8ba>] ret_from_fork+0x2a/0x40
Oct 28 22:04:30 klimt kernel: Code: c3 66 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 44 00 00 55 48 89 e5 41 55 41 54 53 48 8b 87 b0 00 00 00 48 89 fb 4c 8b a0 98 00 00 00 <49> 8b 44 24 20 48 8d b8 80 03 00 00 e8 10 66 1a e1 48 89 df e8
Oct 28 22:04:30 klimt kernel: RIP  [<ffffffffa05a759f>] release_lock_stateid+0x1f/0x60 [nfsd]
Oct 28 22:04:30 klimt kernel: RSP <ffff8808420dbce0>
Oct 28 22:04:30 klimt kernel: ---[ end trace cf5d0b371973e167 ]---

Jeff Layton says:
> Hm...now that I look though, this is a little suspicious:
>
>    struct nfs4_openowner *oo = openowner(stp->st_openstp->st_stateowner);
>
> I wonder if it's possible for the openstateid to have already been
> destroyed at this point.
>
> We might be better off doing something like this to get the client pointer:
>
>    stp->st_stid.sc_client;
>
> ...which should be more direct and less dependent on other stateids
> staying valid.

With the suggested change, I am no longer able to reproduce the above oops.

v2: Fix unhash_lock_stateid() as well

Fix-suggested-by: Jeff Layton <jlayton@redhat.com>
Fixes: 42691398be ('nfsd: Fix race between FREE_STATEID and LOCK')
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Reviewed-by: Jeff Layton <jlayton@redhat.com>
Cc: stable@vger.kernel.org
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2016-11-01 15:24:43 -04:00
Christoph Hellwig be297968da mm: only include blk_types in swap.h if CONFIG_SWAP is enabled
It's only needed for the CONFIG_SWAP-only use of bio_end_io_t.

Because CONFIG_SWAP implies CONFIG_BLOCK this will allow to drop some
ifdefs in blk_types.h.

Instead we'll need to add a few explicit includes that were implicit
before, though.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@fb.com>
2016-11-01 09:43:26 -06:00
Christoph Hellwig 2f8b544477 block,fs: untangle fs.h and blk_types.h
Nothing in fs.h should require blk_types.h to be included.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@fb.com>
2016-11-01 09:43:26 -06:00
Christoph Hellwig 70fd76140a block,fs: use REQ_* flags directly
Remove the WRITE_* and READ_SYNC wrappers, and just use the flags
directly.  Where applicable this also drops usage of the
bio_set_op_attrs wrapper.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@fb.com>
2016-11-01 09:43:26 -06:00
Christoph Hellwig 67f055c798 btrfs: use op_is_sync to check for synchronous requests
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@fb.com>
2016-11-01 09:43:26 -06:00
Andrey Vagin c62cce2cae net: add an ioctl to get a socket network namespace
Each socket operates in a network namespace where it has been created,
so if we want to dump and restore a socket, we have to know its network
namespace.

We have a socket_diag to get information about sockets, it doesn't
report sockets which are not bound or connected.

This patch introduces a new socket ioctl, which is called SIOCGSKNS
and used to get a file descriptor for a socket network namespace.

A task must have CAP_NET_ADMIN in a target network namespace to
use this ioctl.

Cc: "David S. Miller" <davem@davemloft.net>
Cc: Eric W. Biederman <ebiederm@xmission.com>
Signed-off-by: Andrei Vagin <avagin@openvz.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-10-31 10:56:36 -04:00
Miklos Szeredi 641089c154 ovl: fsync after copy-up
Make sure the copied up file hits the disk before renaming to the final
destination.  If this is not done then the copy-up may corrupt the data in
the file in case of a crash.

Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
Cc: <stable@vger.kernel.org>
2016-10-31 14:42:14 +01:00
Miklos Szeredi b93d4a0eb3 ovl: fix get_acl() on tmpfs
tmpfs doesn't have ->get_acl() because it only uses cached acls.

This fixes the acl tests in pjdfstest when tmpfs is used as the upper layer
of the overlay.

Reported-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
Fixes: 39a25b2b37 ("ovl: define ->get_acl() for overlay inodes")
Cc: <stable@vger.kernel.org> # v4.8
2016-10-31 14:42:14 +01:00
Miklos Szeredi fd3220d37b ovl: update S_ISGID when setting posix ACLs
This change fixes xfstest generic/375, which failed to clear the
setgid bit in the following test case on overlayfs:

  touch $testfile
  chown 100:100 $testfile
  chmod 2755 $testfile
  _runas -u 100 -g 101 -- setfacl -m u::rwx,g::rwx,o::rwx $testfile

Reported-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
Tested-by: Amir Goldstein <amir73il@gmail.com>
Fixes: d837a49bd5 ("ovl: fix POSIX ACL setting")
Cc: <stable@vger.kernel.org> # v4.8
2016-10-31 14:42:14 +01:00
Jan Kara 70fe2f4815 aio: fix freeze protection of aio writes
Currently we dropped freeze protection of aio writes just after IO was
submitted. Thus aio write could be in flight while the filesystem was
frozen and that could result in unexpected situation like aio completion
wanting to convert extent type on frozen filesystem. Testcase from
Dmitry triggering this is like:

for ((i=0;i<60;i++));do fsfreeze -f /mnt ;sleep 1;fsfreeze -u /mnt;done &
fio --bs=4k --ioengine=libaio --iodepth=128 --size=1g --direct=1 \
    --runtime=60 --filename=/mnt/file --name=rand-write --rw=randwrite

Fix the problem by dropping freeze protection only once IO is completed
in aio_complete().

Reported-by: Dmitry Monakhov <dmonakhov@openvz.org>
Signed-off-by: Jan Kara <jack@suse.cz>
[hch: forward ported on top of various VFS and aio changes]
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2016-10-30 13:09:42 -04:00
Christoph Hellwig 89319d31d2 fs: remove aio_run_iocb
Pass the ABI iocb structure to aio_setup_rw and let it handle the
non-vectored I/O case as well.  With that and a new helper for the AIO
return value handling we can now define new aio_read and aio_write
helpers that implement reads and writes in a self-contained way without
duplicating too much code.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2016-10-30 13:09:42 -04:00
Christoph Hellwig 723c038475 fs: remove the never implemented aio_fsync file operation
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2016-10-30 13:09:42 -04:00
Christoph Hellwig 0b944d3a4b aio: hold an extra file reference over AIO read/write operations
Otherwise we might dereference an already freed file and/or inode
when aio_complete is called before we return from the read_iter or
write_iter method.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2016-10-30 13:09:42 -04:00
David S. Miller 27058af401 Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net
Mostly simple overlapping changes.

For example, David Ahern's adjacency list revamp in 'net-next'
conflicted with an adjacency list traversal bug fix in 'net'.

Signed-off-by: David S. Miller <davem@davemloft.net>
2016-10-30 12:42:58 -04:00
Linus Torvalds 2a26d99b25 Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net
Pull networking fixes from David Miller:
 "Lots of fixes, mostly drivers as is usually the case.

   1) Don't treat zero DMA address as invalid in vmxnet3, from Alexey
      Khoroshilov.

   2) Fix element timeouts in netfilter's nft_dynset, from Anders K.
      Pedersen.

   3) Don't put aead_req crypto struct on the stack in mac80211, from
      Ard Biesheuvel.

   4) Several uninitialized variable warning fixes from Arnd Bergmann.

   5) Fix memory leak in cxgb4, from Colin Ian King.

   6) Fix bpf handling of VLAN header push/pop, from Daniel Borkmann.

   7) Several VRF semantic fixes from David Ahern.

   8) Set skb->protocol properly in ip6_tnl_xmit(), from Eli Cooper.

   9) Socket needs to be locked in udp_disconnect(), from Eric Dumazet.

  10) Div-by-zero on 32-bit fix in mlx4 driver, from Eugenia Emantayev.

  11) Fix stale link state during failover in NCSCI driver, from Gavin
      Shan.

  12) Fix netdev lower adjacency list traversal, from Ido Schimmel.

  13) Propvide proper handle when emitting notifications of filter
      deletes, from Jamal Hadi Salim.

  14) Memory leaks and big-endian issues in rtl8xxxu, from Jes Sorensen.

  15) Fix DESYNC_FACTOR handling in ipv6, from Jiri Bohac.

  16) Several routing offload fixes in mlxsw driver, from Jiri Pirko.

  17) Fix broadcast sync problem in TIPC, from Jon Paul Maloy.

  18) Validate chunk len before using it in SCTP, from Marcelo Ricardo
      Leitner.

  19) Revert a netns locking change that causes regressions, from Paul
      Moore.

  20) Add recursion limit to GRO handling, from Sabrina Dubroca.

  21) GFP_KERNEL in irq context fix in ibmvnic, from Thomas Falcon.

  22) Avoid accessing stale vxlan/geneve socket in data path, from
      Pravin Shelar"

* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net: (189 commits)
  geneve: avoid using stale geneve socket.
  vxlan: avoid using stale vxlan socket.
  qede: Fix out-of-bound fastpath memory access
  net: phy: dp83848: add dp83822 PHY support
  enic: fix rq disable
  tipc: fix broadcast link synchronization problem
  ibmvnic: Fix missing brackets in init_sub_crq_irqs
  ibmvnic: Fix releasing of sub-CRQ IRQs in interrupt context
  Revert "ibmvnic: Fix releasing of sub-CRQ IRQs in interrupt context"
  arch/powerpc: Update parameters for csum_tcpudp_magic & csum_tcpudp_nofold
  net/mlx4_en: Save slave ethtool stats command
  net/mlx4_en: Fix potential deadlock in port statistics flow
  net/mlx4: Fix firmware command timeout during interrupt test
  net/mlx4_core: Do not access comm channel if it has not yet been initialized
  net/mlx4_en: Fix panic during reboot
  net/mlx4_en: Process all completions in RX rings after port goes up
  net/mlx4_en: Resolve dividing by zero in 32-bit system
  net/mlx4_core: Change the default value of enable_qos
  net/mlx4_core: Avoid setting ports to auto when only one port type is supported
  net/mlx4_core: Fix the resource-type enum in res tracker to conform to FW spec
  ...
2016-10-29 20:33:20 -07:00
Linus Torvalds efa563752c This pull request contains fixes for issues in both UBI and UBIFS:
- A regression wrt. overlayfs, introduced in -rc2.
 - An UBI issue, found by Dan Carpenter's static checker.
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v2
 
 iQIcBAABAgAGBQJYFPHWAAoJEEtJtSqsAOnWcK4P/AwBcqPa0em/HXrdCExanQXY
 8U3uCPbDua4sW1Eaw5dVFoZuVoPzhibLLaVoVIWs8LOXiD8v23VYQ8ezu0D0O9fc
 cAsrxg0MtQLF/hyyVbdihxaqCB2H/j9PDJdIdCiRindPEwm0k6KBkVMk3N8O3m2U
 xDSA+Oq8Ns5cgjx+yfOhMJbGOFUzky26SV/M+PTAIU9Sj2w7RJS9R18BtWv4EFoK
 q1sT8aEte3kryb+v/a4s9RNzWOOHqRvZ4XizOMvma9I6uX6hOU4oeLknmJx1gPnb
 U5z75uAVn+IeNRnrco3pD91N3X9hEtv4IgZhFafNseVTY9MirDX5ss4th+XrSM6y
 wKgWEC8UmcV9Y7zDV/towZjhCipIh1yJPu3493IVHB/1UDPoNDfOGpK6NuhIEZHy
 1sNY8F2j3BBnLw6Fc2uC1FxM3a9MQ9CgJWQ0y9src73VNgQ8miz1WH2rsFp5DwNu
 HdZGBXGElmhbJbNFSsRqC1j+K0Y2LzL5BVOrBblkJNpUmxufRx0LIdXE7p4tPazq
 8dVOH/Ktx+mDQFbtyA8vXK+Cyyp0c/snR3BZo3AWLfrlip6iwZPG6arN4Wu6P4Nl
 ZFWUlHKaMJS/lvsdAuCdZ/lawRvENTOEQMORJR8U7CX/7gDLV1KiaFRpB3fFDUW5
 xm5r2qsbVzElu6skk4xk
 =eOKJ
 -----END PGP SIGNATURE-----

Merge tag 'upstream-4.9-rc3' of git://git.infradead.org/linux-ubifs

Pull ubi/ubifs fixes from Richard Weinberger:
 "This contains fixes for issues in both UBI and UBIFS:

   - A regression wrt overlayfs, introduced in -rc2.
   - An UBI issue, found by Dan Carpenter's static checker"

* tag 'upstream-4.9-rc3' of git://git.infradead.org/linux-ubifs:
  ubifs: Fix regression in ubifs_readdir()
  ubi: fastmap: Fix add_vol() return value test in ubi_attach_fastmap()
2016-10-29 13:15:24 -07:00
Linus Torvalds c636e176d8 driver core fixes for 4.9-rc3
Here are two small driver core / kernfs fixes for 4.9-rc3.  One makes
 the Kconfig entry for DEBUG_TEST_DRIVER_REMOVE a bit more explicit that
 this is a crazy thing to enable for a distro kernel (thanks for trying
 Fedora!), the other resolves an issue with vim opening kernfs files
 (sysfs, configfs, etc.).
 
 Both have been in linux-next with no reported issues.
 
 Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
 -----BEGIN PGP SIGNATURE-----
 
 iFYEABECABYFAlgU0CMPHGdyZWdAa3JvYWguY29tAAoJEDFH1A3bLfspgv4AoJhR
 YJeG57ReBKjlzAj497Z1X7QcAJ9GXcbbbxmwj2IcUln5I3uEyuPCkQ==
 =pS6k
 -----END PGP SIGNATURE-----

Merge tag 'driver-core-4.9-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core

Pull driver core fixes from Greg KH:
 "Here are two small driver core / kernfs fixes for 4.9-rc3.

  One makes the Kconfig entry for DEBUG_TEST_DRIVER_REMOVE a bit more
  explicit that this is a crazy thing to enable for a distro kernel
  (thanks for trying Fedora!), the other resolves an issue with vim
  opening kernfs files (sysfs, configfs, etc.)

  Both have been in linux-next with no reported issues"

* tag 'driver-core-4.9-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core:
  driver core: Make Kconfig text for DEBUG_TEST_DRIVER_REMOVE stronger
  kernfs: Add noop_fsync to supported kernfs_file_fops
2016-10-29 10:57:40 -07:00
Al Viro ad5cb123fd ceph: switch to use of ->d_init()
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2016-10-28 22:05:13 -04:00
Al Viro 18fc8abdb7 ceph: unify dentry_operations instances
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2016-10-28 21:52:50 -04:00
Linus Torvalds f6167514c8 Merge branch 'for-linus-4.9' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs
Pull btrfs fixes from Chris Mason:
 "My patch fixes the btrfs list_head abuse that we tracked down during
  Dave Jones' memory corruption investigation. With both Jens and my
  patches in place, I'm no longer able to trigger problems.

  Filipe is fixing a difficult old bug between snapshots, balance and
  send. Dave is cooking a few more for the next rc, but these are tested
  and ready"

* 'for-linus-4.9' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs:
  btrfs: fix races on root_log_ctx lists
  btrfs: fix incremental send failure caused by balance
2016-10-28 10:07:35 -07:00
Christoph Hellwig ef295ecf09 block: better op and flags encoding
Now that we don't need the common flags to overflow outside the range
of a 32-bit type we can encode them the same way for both the bio and
request fields.  This in addition allows us to place the operation
first (and make some room for more ops while we're at it) and to
stop having to shift around the operation values.

In addition this allows passing around only one value in the block layer
instead of two (and eventuall also in the file systems, but we can do
that later) and thus clean up a lot of code.

Last but not least this allows decreasing the size of the cmd_flags
field in struct request to 32-bits.  Various functions passing this
value could also be updated, but I'd like to avoid the churn for now.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@fb.com>
2016-10-28 08:48:16 -06:00
Richard Weinberger a00052a296 ubifs: Fix regression in ubifs_readdir()
Commit c83ed4c9db ("ubifs: Abort readdir upon error") broke
overlayfs support because the fix exposed an internal error
code to VFS.

Reported-by: Peter Rosin <peda@axentia.se>
Tested-by: Peter Rosin <peda@axentia.se>
Reported-by: Ralph Sennhauser <ralph.sennhauser@gmail.com>
Tested-by: Ralph Sennhauser <ralph.sennhauser@gmail.com>
Fixes: c83ed4c9db ("ubifs: Abort readdir upon error")
Cc: stable@vger.kernel.org
Signed-off-by: Richard Weinberger <richard@nod.at>
2016-10-28 14:48:31 +02:00
Linus Torvalds 14970f204b Merge branch 'akpm' (patches from Andrew)
Merge misc fixes from Andrew Morton:
 "20 fixes"

* emailed patches from Andrew Morton <akpm@linux-foundation.org>:
  drivers/misc/sgi-gru/grumain.c: remove bogus 0x prefix from printk
  cris/arch-v32: cryptocop: print a hex number after a 0x prefix
  ipack: print a hex number after a 0x prefix
  block: DAC960: print a hex number after a 0x prefix
  fs: exofs: print a hex number after a 0x prefix
  lib/genalloc.c: start search from start of chunk
  mm: memcontrol: do not recurse in direct reclaim
  CREDITS: update credit information for Martin Kepplinger
  proc: fix NULL dereference when reading /proc/<pid>/auxv
  mm: kmemleak: ensure that the task stack is not freed during scanning
  lib/stackdepot.c: bump stackdepot capacity from 16MB to 128MB
  latent_entropy: raise CONFIG_FRAME_WARN by default
  kconfig.h: remove config_enabled() macro
  ipc: account for kmem usage on mqueue and msg
  mm/slab: improve performance of gathering slabinfo stats
  mm: page_alloc: use KERN_CONT where appropriate
  mm/list_lru.c: avoid error-path NULL pointer deref
  h8300: fix syscall restarting
  kcov: properly check if we are in an interrupt
  mm/slab: fix kmemcg cache creation delayed issue
2016-10-27 19:58:39 -07:00
Uwe Kleine-König 14f947c87a fs: exofs: print a hex number after a 0x prefix
It makes the message hard to interpret correctly if a base 10 number is
prefixed by 0x.  So change to a hex number.

Link: http://lkml.kernel.org/r/20161026125658.25728-2-u.kleine-koenig@pengutronix.de
Signed-off-by: Uwe Kleine-König <u.kleine-koenig@pengutronix.de>
Cc: Boaz Harrosh <ooo@electrozaur.com>
Cc: Benny Halevy <bhalevy@primarydata.com>
Cc: Joe Perches <joe@perches.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-10-27 18:43:43 -07:00
Leon Yu 06b2849d10 proc: fix NULL dereference when reading /proc/<pid>/auxv
Reading auxv of any kernel thread results in NULL pointer dereferencing
in auxv_read() where mm can be NULL.  Fix that by checking for NULL mm
and bailing out early.  This is also the original behavior changed by
recent commit c531716785 ("proc: switch auxv to use of __mem_open()").

  # cat /proc/2/auxv
  Unable to handle kernel NULL pointer dereference at virtual address 000000a8
  Internal error: Oops: 17 [#1] PREEMPT SMP ARM
  CPU: 3 PID: 113 Comm: cat Not tainted 4.9.0-rc1-ARCH+ #1
  Hardware name: BCM2709
  task: ea3b0b00 task.stack: e99b2000
  PC is at auxv_read+0x24/0x4c
  LR is at do_readv_writev+0x2fc/0x37c
  Process cat (pid: 113, stack limit = 0xe99b2210)
  Call chain:
    auxv_read
    do_readv_writev
    vfs_readv
    default_file_splice_read
    splice_direct_to_actor
    do_splice_direct
    do_sendfile
    SyS_sendfile64
    ret_fast_syscall

Fixes: c531716785 ("proc: switch auxv to use of __mem_open()")
Link: http://lkml.kernel.org/r/1476966200-14457-1-git-send-email-chianglungyu@gmail.com
Signed-off-by: Leon Yu <chianglungyu@gmail.com>
Acked-by: Oleg Nesterov <oleg@redhat.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Kees Cook <keescook@chromium.org>
Cc: John Stultz <john.stultz@linaro.org>
Cc: Mateusz Guzik <mguzik@redhat.com>
Cc: Janis Danisevskis <jdanis@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-10-27 18:43:43 -07:00
Johannes Berg 56989f6d85 genetlink: mark families as __ro_after_init
Now genl_register_family() is the only thing (other than the
users themselves, perhaps, but I didn't find any doing that)
writing to the family struct.

In all families that I found, genl_register_family() is only
called from __init functions (some indirectly, in which case
I've add __init annotations to clarifly things), so all can
actually be marked __ro_after_init.

This protects the data structure from accidental corruption.

Signed-off-by: Johannes Berg <johannes.berg@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-10-27 16:16:09 -04:00
Johannes Berg 489111e5c2 genetlink: statically initialize families
Instead of providing macros/inline functions to initialize
the families, make all users initialize them statically and
get rid of the macros.

This reduces the kernel code size by about 1.6k on x86-64
(with allyesconfig).

Signed-off-by: Johannes Berg <johannes.berg@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-10-27 16:16:09 -04:00
Johannes Berg a07ea4d994 genetlink: no longer support using static family IDs
Static family IDs have never really been used, the only
use case was the workaround I introduced for those users
that assumed their family ID was also their multicast
group ID.

Additionally, because static family IDs would never be
reserved by the generic netlink code, using a relatively
low ID would only work for built-in families that can be
registered immediately after generic netlink is started,
which is basically only the control family (apart from
the workaround code, which I also had to add code for so
it would reserve those IDs)

Thus, anything other than GENL_ID_GENERATE is flawed and
luckily not used except in the cases I mentioned. Move
those workarounds into a few lines of code, and then get
rid of GENL_ID_GENERATE entirely, making it more robust.

Signed-off-by: Johannes Berg <johannes.berg@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-10-27 16:16:09 -04:00
Linus Torvalds e3300ffef0 orangefs: a couple of cleanups sent in by other developers
use d_fsdata instead of d_time
     Miklos Szeredi <mszeredi@redhat.com>
 
   use file_inode(file) instead of file->f_path.dentry->d_inode
     Amir Goldstein <amir73il@gmail.com>
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v2.0.22 (GNU/Linux)
 
 iQIcBAABAgAGBQJYD65MAAoJEM9EDqnrzg2+kaUP/0HPDYJyWSgbGVSKuNqOiyml
 VbAGRbDAcpyYCFww2cRO9Xvvh6bJmGEqZUUbNxgi3q5L2KnvvoQ0jkHFfHaVii53
 uWP0WGrxBcRNxv72jfo1cBxYTcTqEfzXZBQb6HhzfbjMCvejbhSbYDowElTE7Oar
 AwcgEdv0Utm7zD/0K+OW56Q4fUYzOSFI4c/tNGUyjQCLE+N3R2roXdivz3maEfee
 uDg262lfQgkzbEYGJOdt8MpUak6YEp2bFa+Xf8bRoKMze8KbVDLwuTlYXuSdc/i8
 e8QO/Zr+irX/jJ/Sc998FwGquUljPuxz4wHSNEVO3HqYFIe30zkUD0mqQcxqx6YD
 F4DhSn8Ok5PuKv5aw1Q7AMA0Zd+bKaJzb/E0JdlHn1n9PFMiod82rdTfmGxP1rZb
 BwuOW/dsp/RLBZhCYpkNTBiNAH+TSIp8M7eOavO68AZ2zJXN69e/Qv2iJsaAZJZ0
 of+i9I4kmXUS4F6OPjgT6xJbH4aD/X4/jei4dPKDATXM0MW+GsZ7VodAYmAqGGCO
 l66UoL4o11BCMJNGfsdPxWJkUgpn4OBb+RSkS0f6qQ7Nlp1OaYeRYKNbX5ICHcgj
 A0PHXZ8Pub3iVgX5xUrQmYk3txbLt0ISDYBXzfPZ0rreztN0o5FRB4TNVLC82VwJ
 XHBdehhgLsNc1PMKSzZo
 =b9Ly
 -----END PGP SIGNATURE-----

Merge tag 'for-linus-4.9-rc2-ofs-1' of git://git.kernel.org/pub/scm/linux/kernel/git/hubcap/linux

Pull oreangefs updates from Mike Marshall:
 "A couple of orangefs cleanups sent in by other developers:

   - use d_fsdata instead of d_time (Miklos Szeredi)

   - use file_inode(file) instead of file->f_path.dentry->d_inode (Amir
     Goldstein)"

* tag 'for-linus-4.9-rc2-ofs-1' of git://git.kernel.org/pub/scm/linux/kernel/git/hubcap/linux:
  orangefs: don't use d_time
  orangefs: user file_inode() where it is due
2016-10-27 12:52:46 -07:00
Linus Torvalds e890038e6a xfs: updates for 4.9-rc3
Changes in this update:
 o iomap page offset masking fix for page faults
 o add IOMAP_REPORT to distinguish between read and fiemap map requests
 o cleanups to new shared data extent code
 o fix mount active status on failed log recovery
 o fix broken dquots in a buffer calculation
 o fix locking order issues and merge xfs_reflink_remap_range and
   xfs_file_share_range
 o rework unmapping of CoW extents and remove now unused functions
 o clean state when CoW is done.
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v1
 
 iQIcBAABAgAGBQJYEfdWAAoJEK3oKUf0dfoddg4P/0Tl/i58sBL/Um90kSGOjxjI
 yaOKuFImS3MFSYDwiYADnXdhq6BgVLUJWS07t9/P6Nn3OZr1wBCZDZdyRS1+JwAA
 qOui4sp/v21HprydscN+BAdxyYmuo4yFgu9lkFFSM55yiaAb5C8hsYKF42Gja1+m
 gS40/Lsa5nauSz58UOZ5oEljAvBldAdyMlk8rVSGXVm7+pqs7Lxmhjif/Y8y/Y+i
 097auIrGk+oRDukXqhtZyCQ7VP99WzM+ksajtrNwVOOzSMhrcDCHKuLe0i4LsyjN
 UTx1ioY/AD8PUYhSmLqALD9vtFHnJbx50/MQFHNLc+hDQb2jb/jQmqx9LyEYDt38
 sw/Wy55hh9PylILdE//bWH0vSgqmnNCWviBUzjDtAJ9FKfv19slFlwtu2K4lOHoq
 C6Q2uh2mB7BC6efksk9DeA6/N9tFQuiXa48sN5+D2zMfZAmdkgzDCKfGrpRnS1Yl
 4h+sfiK/DTf11Q2nTaPAHylt02SmHsikQWvb5Fxu76UI8k4RsjCZc3ep/NUNJBlU
 E8f+cdNlAF5k/AWBY7107N1iUqL/vS2wXLdburJkckmQqRcI5WuRaLhi9g4tFjFI
 o+m9EM1WuOP6jeOuVImwgCRJoLVnTVKwee/d4J8y9Ad//Rs6B9pB0SIDfxJa9LY6
 B1XjT8z/NVyK6GsfP1Qs
 =LDDu
 -----END PGP SIGNATURE-----

Merge tag 'xfs-fixes-for-linus-4.9-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/dgc/linux-xfs

Pull xfs fixes from Dave Chinner:
 "This update contains fixes for most of the outstanding regressions
  introduced with the 4.9-rc1 XFS merge. There is also a fix for an
  iomap bug, too.

  This is a quite a bit larger than I'd prefer for a -rc3, but most of
  the change comes from cleaning up the new reflink copy on write code;
  it's much simpler and easier to understand now. These changes fixed
  several bugs in the new code, and it wasn't clear that there was an
  easier/simpler way to fix them. The rest of the fixes are the usual
  size you'd expect at this stage.

  I've left the commits to soak in linux-next for a some extra time
  because of the size before asking you to pull, no new problems with
  them have been reported so I think it's all OK.

  Summary:
   - iomap page offset masking fix for page faults
   - add IOMAP_REPORT to distinguish between read and fiemap map
     requests
   - cleanups to new shared data extent code
   - fix mount active status on failed log recovery
   - fix broken dquots in a buffer calculation
   - fix locking order issues and merge xfs_reflink_remap_range and
     xfs_file_share_range
   - rework unmapping of CoW extents and remove now unused functions
   - clean state when CoW is done"

* tag 'xfs-fixes-for-linus-4.9-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/dgc/linux-xfs: (25 commits)
  xfs: clear cowblocks tag when cow fork is emptied
  xfs: fix up inode cowblocks tracking tracepoints
  fs: Do to trim high file position bits in iomap_page_mkwrite_actor
  xfs: remove xfs_bunmapi_cow
  xfs: optimize xfs_reflink_end_cow
  xfs: optimize xfs_reflink_cancel_cow_blocks
  xfs: refactor xfs_bunmapi_cow
  xfs: optimize writes to reflink files
  xfs: don't bother looking at the refcount tree for reads
  xfs: handle "raw" delayed extents xfs_reflink_trim_around_shared
  xfs: add xfs_trim_extent
  iomap: add IOMAP_REPORT
  xfs: merge xfs_reflink_remap_range and xfs_file_share_range
  xfs: remove xfs_file_wait_for_io
  xfs: move inode locking from xfs_reflink_remap_range to xfs_file_share_range
  xfs: fix the same_inode check in xfs_file_share_range
  xfs: remove the same fs check from xfs_file_share_range
  libxfs: v3 inodes are only valid on crc-enabled filesystems
  libxfs: clean up _calc_dquots_per_chunk
  xfs: unset MS_ACTIVE if mount fails
  ...
2016-10-27 12:34:50 -07:00
Chris Mason 570dd45042 btrfs: fix races on root_log_ctx lists
btrfs_remove_all_log_ctxs takes a shortcut where it avoids walking the
list because it knows all of the waiters are patiently waiting for the
commit to finish.

But, there's a small race where btrfs_sync_log can remove itself from
the list if it finds a log commit is already done.  Also, it uses
list_del_init() to remove itself from the list, but there's no way to
know if btrfs_remove_all_log_ctxs has already run, so we don't know for
sure if it is safe to call list_del_init().

This gets rid of all the shortcuts for btrfs_remove_all_log_ctxs(), and
just calls it with the proper locking.

This is part two of the corruption fixed by cbd60aa7cd.  I should have
done this in the first place, but convinced myself the optimizations were
safe.  A 12 hour run of dbench 2048 will eventually trigger a list debug
WARN_ON for the list_del_init() in btrfs_sync_log().

Fixes: d1433debe7
Reported-by: Dave Jones <davej@codemonkey.org.uk>
cc: stable@vger.kernel.org # 3.15+
Signed-off-by: Chris Mason <clm@fb.com>
2016-10-27 10:42:20 -07:00
Tony Luck 2a9becdd4d kernfs: Add noop_fsync to supported kernfs_file_fops
If you edit a kernfs backed file with vi(1), you see an ugly error
message when you write the file because vi tries to fsync(2) the
file after writing, which fails.

We have noop_fsync() for this, use it.

Signed-off-by: Tony Luck <tony.luck@intel.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2016-10-27 17:47:11 +02:00
Linus Torvalds 272ddc8b37 proc: don't use FOLL_FORCE for reading cmdline and environment
Now that Lorenzo cleaned things up and made the FOLL_FORCE users
explicit, it becomes obvious how some of them don't really need
FOLL_FORCE at all.

So remove FOLL_FORCE from the proc code that reads the command line and
arguments from user space.

The mem_rw() function actually does want FOLL_FORCE, because gdd (and
possibly many other debuggers) use it as a much more convenient version
of PTRACE_PEEKDATA, but we should consider making the FOLL_FORCE part
conditional on actually being a ptracer.  This does not actually do
that, just moves adds a comment to that effect and moves the gup_flags
settings next to each other.

Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-10-24 19:00:44 -07:00
Jeff Layton 0cc11a61b8 nfsd: move blocked lock handling under a dedicated spinlock
Bruce was hitting some lockdep warnings in testing, showing that we
could hit a deadlock with the new CB_NOTIFY_LOCK handling, involving a
rather complex situation involving four different spinlocks.

The crux of the matter is that we end up taking the nn->client_lock in
the lm_notify handler. The simplest fix is to just declare a new
per-nfsd_net spinlock to protect the new CB_NOTIFY_LOCK structures.

Signed-off-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2016-10-24 16:51:21 -04:00
Miklos Szeredi 804b1737d7 orangefs: don't use d_time
Instead use d_fsdata which is the same size.  Hoping to get rid of d_time,
which is used by very few filesystems by this time.

Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
Reviewed-by: Martin Brandenburg <martin@omnibond.com>
Signed-off-by: Mike Marshall <hubcap@omnibond.com>
2016-10-24 14:50:07 -04:00
Amir Goldstein d62a9025ae orangefs: user file_inode() where it is due
Replace wrong use of file->f_path.dentry->d_inode with file_inode(file).
In case orangefs ever finds itself as an overelayfs layer, it would want
to get its own inode and not overlayfs's inode.

DISCLAIMER: I did not test this patch because I do not know how to setup
            an orangefs mount

Signed-off-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Mike Marshall <hubcap@omnibond.com>
2016-10-24 14:29:39 -04:00
Arnd Bergmann 68a564006a NFSv4.1: work around -Wmaybe-uninitialized warning
A bugfix introduced a harmless gcc warning in nfs4_slot_seqid_in_use
if we enable -Wmaybe-uninitialized again:

fs/nfs/nfs4session.c:203:54: error: 'cur_seq' may be used uninitialized in this function [-Werror=maybe-uninitialized]

gcc is not smart enough to conclude that the IS_ERR/PTR_ERR pair
results in a nonzero return value here. Using PTR_ERR_OR_ZERO()
instead makes this clear to the compiler.

The warning originally did not appear in v4.8 as it was globally
disabled, but the bugfix that introduced the warning got backported
to stable kernels which again enable it, and this is now the only
warning in the v4.7 builds.

Fixes: e09c978aae ("NFSv4.1: Fix Oopsable condition in server callback races")
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Cc: Trond Myklebust <trond.myklebust@primarydata.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2016-10-24 13:54:43 -04:00
Wang Xiaoguang 9d1032cc49 btrfs: fix WARNING in btrfs_select_ref_head()
This issue was found when testing in-band dedupe enospc behaviour,
sometimes run_one_delayed_ref() may fail for enospc reason, then
__btrfs_run_delayed_refs()will return, but forget to add num_heads_read
back, which will trigger "WARN_ON(delayed_refs->num_heads_ready == 0)" in
btrfs_select_ref_head().

Signed-off-by: Wang Xiaoguang <wangxg.fnst@cn.fujitsu.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2016-10-24 18:20:29 +02:00
Dan Carpenter 9c894696f5 Btrfs: remove some no-op casts
We cast 0 to a u8 but then because of type promotion, it's immediately
cast to int back to int before we do a bitwise negate.  The cast doesn't
matter in this case, the code works as intended.  It causes a static
checker warning though so let's remove it.

Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2016-10-24 18:20:29 +02:00
Wang Xiaoguang dd4b857aab btrfs: pass correct args to btrfs_async_run_delayed_refs()
In btrfs_truncate_inode_items()->btrfs_async_run_delayed_refs(), we
swap the arg2 and arg3 wrongly, fix this.

This bug just impacts asynchronous delayed refs handle when we truncate inodes.
In delayed_ref_async_start(), there is such codes:

    trans = btrfs_join_transaction(async->root);
    if (trans->transid > async->transid)
        goto end;
    ret = btrfs_run_delayed_refs(trans, async->root, async->count);

From this codes, we can see that this just influence whether can we handle
delayed refs or the number of delayed refs to handle, this may impact
performance, but will not result in missing delayed refs, all delayed refs will
be handled in btrfs_commit_transaction().

Signed-off-by: Wang Xiaoguang <wangxg.fnst@cn.fujitsu.com>
Reviewed-by: Holger Hoffstätte <holger@applied-asynchrony.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2016-10-24 18:20:29 +02:00
Wang Xiaoguang 69ae5e4459 btrfs: make file clone aware of fatal signals
Indeed this just make the behavior similar to xfs when process has
fatal signals pending, and it'll make fstests/generic/298 happy.

Signed-off-by: Wang Xiaoguang <wangxg.fnst@cn.fujitsu.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2016-10-24 18:20:29 +02:00
Goldwyn Rodrigues 0b34c261e2 btrfs: qgroup: Prevent qgroup->reserved from going subzero
While free'ing qgroup->reserved resources, we much check if
the page has not been invalidated by a truncate operation
by checking if the page is still dirty before reducing the
qgroup resources. Resources in such a case are free'd when
the entire extent is released by delayed_ref.

This fixes a double accounting while releasing resources
in case of truncating a file, reproduced by the following testcase.

SCRATCH_DEV=/dev/vdb
SCRATCH_MNT=/mnt
mkfs.btrfs -f $SCRATCH_DEV
mount -t btrfs $SCRATCH_DEV $SCRATCH_MNT
cd $SCRATCH_MNT
btrfs quota enable $SCRATCH_MNT
btrfs subvolume create a
btrfs qgroup limit 500m a $SCRATCH_MNT
sync
for c in {1..15}; do
dd if=/dev/zero  bs=1M count=40 of=$SCRATCH_MNT/a/file;
done

sleep 10
sync
sleep 5

touch $SCRATCH_MNT/a/newfile

echo "Removing file"
rm $SCRATCH_MNT/a/file

Fixes: b9d0b38928 ("btrfs: Add handler for invalidate page")
Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com>
Reviewed-by: Qu Wenruo <quwenruo@cn.fujitsu.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2016-10-24 18:20:21 +02:00
Benjamin Coddington 86a6c211d6 NFS: Trim extra slash in v4 nfs_path
A NFSv4 mount of a subdirectory will show an extra slash (as in
'server://path') in proc's mountinfo which will not match the device name
and path.  This can cause problems for programs searching for the mount.
Fix this by checking for a leading slash in the dentry path, if so trim
away any trailing slashes in the device name.

Signed-off-by: Benjamin Coddington <bcodding@redhat.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2016-10-24 12:06:01 -04:00
Wei Yongjun 26c1ec2fe4 dlm: fix error return code in sctp_accept_from_sock()
Fix to return a negative error code from the error handling
case instead of 0, as done elsewhere in this function.

Signed-off-by: Wei Yongjun <weiyongjun1@huawei.com>
Signed-off-by: David Teigland <teigland@redhat.com>
2016-10-24 10:01:51 -05:00
Mauro Carvalho Chehab 8c27ceff36 docs: fix locations of several documents that got moved
The previous patch renamed several files that are cross-referenced
along the Kernel documentation. Adjust the links to point to
the right places.

Signed-off-by: Mauro Carvalho Chehab <mchehab@s-opensource.com>
2016-10-24 08:12:35 -02:00
Darrick J. Wong b77428b12b xfs: defer should abort intent items if the trans roll fails
If the deferred ops transaction roll fails, we need to abort the intent
items if we haven't already logged a done item for it, regardless of
whether or not the deferred ops has had a transaction committed.  Dave
found this while running generic/388.

Move the tracepoint to make it easier to track object lifetimes.

Reported-by: Dave Chinner <david@fromorbit.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2016-10-24 14:21:18 +11:00
Brian Foster c17a8ef43d xfs: clear cowblocks tag when cow fork is emptied
The background cowblocks scan job takes care of scanning for inodes with
potentially lingering blocks in the cow fork and clearing them out. If
the background scanner reclaims the cow fork blocks, however, it doesn't
immediately clear the cowblocks tag from the inode. Instead, the inode
remains tagged until the background scanner comes around again,
discovers the inode cow fork has no blocks, clears the tag and fires the
trace_xfs_inode_free_cowblocks_invalid() tracepoint to indicate that the
inode may have been incorrectly tagged.

This is not a major functional problem as the tag is ultimately cleared.
Nonetheless, clear the tag when an inode cow fork is explicitly emptied
to avoid the extra round trip through the background scanner and
spurious "invalid" tracepoint.

Signed-off-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2016-10-24 14:21:08 +11:00
Brian Foster 7b7381f043 xfs: fix up inode cowblocks tracking tracepoints
These calls are still using the eofblocks tracepoints. The cowblocks
equivalents are already defined, we just aren't actually calling them.

Signed-off-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2016-10-24 14:21:00 +11:00
Jan Kara c663e29f88 fs: Do to trim high file position bits in iomap_page_mkwrite_actor
iomap_page_mkwrite_actor() calls __block_write_begin_int() with position
masked as pos & ~PAGE_MASK which is equivalent to pos & (PAGE_SIZE-1).
Thus it masks off high bits of file position. However
__block_write_begin_int() expects full file position on input. This does
not cause any visible issues because all __block_write_begin_int()
really cares about are low file position bits but still it is a bug
waiting to happen.

Signed-off-by: Jan Kara <jack@suse.cz>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2016-10-24 14:20:25 +11:00
Linus Torvalds 5ff93abc7a This pull requests contains fixes for issues in both UBI and UBIFS:
- Fallout from the merge window, refactoring UBI code introduced some issues.
 - Fixes for an UBIFS readdir bug which can cause getdents() to busy loop
   for ever and a bug in the UBIFS xattr code.
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v2
 
 iQIcBAABAgAGBQJYDMa5AAoJEEtJtSqsAOnWxKgP+wT4lCM9gP3/1FywhrJRxA4Z
 vH9YYWP6vjdYhZP8tt3RVIrJ/BxPMDx8+7IkZBzxVRQcnvaoaGibgEsfkGmTngyW
 2rFVqDuwFIDIWLvNrKW26ep4p1Ek8yZhIIcW4upHKtnaJpZwvn6BmwxRep1JeLuc
 yZjGIJtejRbvuuaVwEBu+Et3Rlflg5/D6oPWOJfYqwjJjxihkb4hfAgzJkLeBK3Q
 Qw65S8FxKDPa7vAj2+jor3Cq0ETg3b2cQR4+UnGmDat9RVMquS3dDTBzBn6TNZx+
 xw2aiOPi0JPMeEnJP+Z61/moeQhlLddZsEVdRQ5Ud6LcOeq6Rg7v5J+POkQ0hhIy
 DUfxHjnsmB4P9XqtaGGr74d8trjIm15cL6yAVKG/jMnb11oCWVDVyr0FmsXSmO7I
 O+b6P9hM7C3o+eAETdCLhd8Jg5isOm27WWQ2Bqq2FOjY9EmvTIFl+Imp+++3YHA6
 R6jlFfMbju0gCfyPZdDPmTc91CPtWdTze43bpIdl2N3L2/efG2I0xFjjlr+WWEkL
 htYQr+b3vjO+moTl8KvT7pmvVNPUtNljOZsHHJjrsBLvuMDb0+7X1Wy860klTOPp
 B7NntTqwBUF6HtPpeebHvEfBiTruyspGZfokvkud6rqPuO1DbsJrVNY7Lwh9XA8M
 iGn9LwwlNjQYiyZNx0GT
 =Gjo0
 -----END PGP SIGNATURE-----

Merge tag 'upstream-4.9-rc2' of git://git.infradead.org/linux-ubifs

Pull UBI[FS] fixes from Richard Weinberger:
 "This contains fixes for issues in both UBI and UBIFS:

   - Fallout from the merge window, refactoring UBI code introduced some
     issues.

   - Fixes for an UBIFS readdir bug which can cause getdents() to busy
     loop for ever and a bug in the UBIFS xattr code"

* tag 'upstream-4.9-rc2' of git://git.infradead.org/linux-ubifs:
  ubifs: Abort readdir upon error
  UBI: Fix crash in try_recover_peb()
  ubi: fix swapped arguments to call to ubi_alloc_aeb
  ubifs: Fix xattr_names length in exit paths
  ubifs: Rename ubifs_rename2
2016-10-23 16:58:55 -07:00
Linus Torvalds c761923cb8 A few bug fixes and add some missing KERN_CONT annotations
-----BEGIN PGP SIGNATURE-----
 
 iQEcBAABCAAGBQJYDK6KAAoJEPL5WVaVDYGjhZ0H/2aLu4BQOmIPJZBBS+I2FurE
 7FFdnQ8r1gBPWktvfUTn6MzTE4VKe0b1js5EiRCiCJhJq9UadBu53dUWTgfZ5Egi
 Sc6p0NGqDRgixLXbFRt8wP7iPtVg0tlysE0EJ6ae4VA1wUpf5aoHaPqgO9V0hirW
 9pUJq8kzBGs628CROcYtQ5IL5AfouM1q/fzazw4Voz48LTgvhnDGCkqQmNsKkRo+
 bN5tkjSTQUdW3OrRVsNwNND/iDYpTa6PcX1XXQiFFhQ4SbZoNS/dzowz09QreGxA
 Uz/rt2hMnv552Zd52d5q6N/jPWg+O+x0b4PcYtn7NDjPn/1KZUyX0pQK/EoevXQ=
 =2Kri
 -----END PGP SIGNATURE-----

Merge tag 'ext4_for_linus_stable' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4

Pull ext4 fixes from Ted Ts'o:
 "A few bug fixes and add some missing KERN_CONT annotations"

* tag 'ext4_for_linus_stable' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4:
  ext4: add missing KERN_CONT to a few more debugging uses
  fscrypto: lock inode while setting encryption policy
  ext4: correct endianness conversion in __xattr_check_inode()
  fscrypto: make XTS tweak initialization endian-independent
  ext4: do not advertise encryption support when disabled
  jbd2: fix incorrect unlock on j_list_lock
  ext4: super.c: Update logging style using KERN_CONT
2016-10-23 16:52:19 -07:00
Linus Torvalds 86c5bf7101 Merge branch 'mm-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull vmap stack fixes from Ingo Molnar:
 "This is fallout from CONFIG_HAVE_ARCH_VMAP_STACK=y on x86: stack
  accesses that used to be just somewhat questionable are now totally
  buggy.

  These changes try to do it without breaking the ABI: the fields are
  left there, they are just reporting zero, or reporting narrower
  information (the maps file change)"

* 'mm-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  mm: Change vm_is_stack_for_task() to vm_is_stack_for_current()
  fs/proc: Stop trying to report thread stacks
  fs/proc: Stop reporting eip and esp in /proc/PID/stat
  mm/numa: Remove duplicated include from mprotect.c
2016-10-22 09:39:10 -07:00
Linus Torvalds 02593ac680 NFS client bugfixes for Linux 4.9
Stable bugfix:
 - Fix last_write_offset incorrectly set to page boundary
 
 Other bugfix:
 - Fix missing-braces warning
 -----BEGIN PGP SIGNATURE-----
 
 iQIcBAABCAAGBQJYCnklAAoJENfLVL+wpUDrv5MQALGRrZyyvQVCGwHt8BhiZDMp
 5OAB1B7mFF0yf/L7j5rLUEvXs6+YyGVHTRrqWlAm1Mq7aqqGjW3YcE260KOJwse3
 sk0eZ8mj92Bbm19ktRRGJCWeeCi16BsywIJEIbFFLs0ssKltSJMMnhE/8gyZ3Oj1
 /TQ0jFCsAGxErr9GVny9FiVa4mlgkauOEfY/QJsgMHH7FBYftBU7mo7rH43RaxQ7
 XLLv9XTe/WFCedxAa0uY/SikmAplLCpShOHCnCvveOF4WhKdx1gjaCnp6nZSMMP8
 Hyd49AZHfxlQWK3B6amhHtI5iU2/tyNl8aFN49PXUdbN1VqJoSFnMCZgc/BWKIsQ
 NGpUuQSTqz2qnMtHC3sErWfi2/c9kNDn9R3DPkTJPtZKoE0+FHnnxlhTWl9YSvju
 iW4hisaDbldmP2davoMeKKDIrP9g+z0+8akcZx4lSoVEhswVtsDzpFpGETL8bM6Y
 0002b8UU0qj4QVLUoW1HNCad5/H0G3ir0utXr+//OduQb2SMAilQmscltOcFXzfe
 TzR6YD7RP2RZs/t5fqnxlvBB2kYkSa8vWC/dJdVC5MC0nq6L1yO1n6L0p+E7Keck
 9S2fWi89WnGN4guKxtIOo58vbr6wAcA+g6zM35WwLDgxtklQHZCKOTPJEsPbnlSr
 DpeZFTwLeG7/SENBFZhI
 =VadI
 -----END PGP SIGNATURE-----

Merge tag 'nfs-for-4.9-2' of git://git.linux-nfs.org/projects/anna/linux-nfs

Pull NFS client bugfixes from Anna Schumaker:
 "Just two bugfixes this time:

  Stable bugfix:
   - Fix last_write_offset incorrectly set to page boundary

  Other bugfix:
   - Fix missing-braces warning"

* tag 'nfs-for-4.9-2' of git://git.linux-nfs.org/projects/anna/linux-nfs:
  nfs4: fix missing-braces warning
  pnfs/blocklayout: fix last_write_offset incorrectly set to page boundary
2016-10-21 19:06:59 -07:00
Linus Torvalds bdcff41597 rbd exclusive-lock edge case fix and several filesystem fixups.
Nikolay's error path patch is tagged for stable, everything else but
 readdir vs frags race was introduced in 4.9-rc1.
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v2
 
 iQEcBAABCAAGBQJYCPFwAAoJEEp/3jgCEfOLQxkH/3t7m/NaC0S+1eISHQWne0rs
 GtI4wx6Yh5KUV0SKgzYTYs0AEusW459XvUzwLwe/Tp9Qdp/KehviGJdQY8WBP6Es
 J5u7WLU+Ja1GwB586YUzhG7L3PAi8DXxbkTB+MYB4circhZ0w8ecuJUL4o++5VuH
 yAfoKn6tFyCTpvhFGd9dBPn3tVl90/vpwiH/hHp04PWHq6dNvLyJuIbvUD4JaV3O
 NYQqq3fFG76jqwyu2dE0DN4IPNb3tUjJ1oY86Uvkq7DP4ZiI61JNx45XTW1XIplx
 lWi2f2MurwznAJZl9kaU0TiTdS7liizkRdb2cu56nMRmzVSDz+va5X3CdDSpQtg=
 =JwMW
 -----END PGP SIGNATURE-----

Merge tag 'ceph-for-4.9-rc2' of git://github.com/ceph/ceph-client

Pull Ceph fixes from Ilya Dryomov:
 "An rbd exclusive-lock edge case fix and several filesystem fixups.

  Nikolay's error path patch is tagged for stable, everything else but
  readdir vs frags race was introduced in this merge window"

* tag 'ceph-for-4.9-rc2' of git://github.com/ceph/ceph-client:
  ceph: fix non static symbol warning
  ceph: fix uninitialized dentry pointer in ceph_real_mount()
  ceph: fix readdir vs fragmentation race
  ceph: fix error handling in ceph_read_iter
  rbd: don't retry watch reregistration if header object is gone
  rbd: don't wait for the lock forever if blacklisted
2016-10-20 09:57:51 -07:00
Linus Torvalds a28ad14e05 Merge branch 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs
Pull misc filesystem fixes from Jan Kara:
 "A fix for an isofs change apparently breaking mount(8) in some cases
  and one ext2 warning fix"

* 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs:
  ext2: avoid bogus -Wmaybe-uninitialized warning
  isofs: Do not return EACCES for unknown filesystems
2016-10-20 08:49:03 -07:00
Andy Lutomirski b18cb64ead fs/proc: Stop trying to report thread stacks
This reverts more of:

  b76437579d ("procfs: mark thread stack correctly in proc/<pid>/maps")

... which was partially reverted by:

  65376df582 ("proc: revert /proc/<pid>/maps [stack:TID] annotation")

Originally, /proc/PID/task/TID/maps was the same as /proc/TID/maps.

In current kernels, /proc/PID/maps (or /proc/TID/maps even for
threads) shows "[stack]" for VMAs in the mm's stack address range.

In contrast, /proc/PID/task/TID/maps uses KSTK_ESP to guess the
target thread's stack's VMA.  This is racy, probably returns garbage
and, on arches with CONFIG_TASK_INFO_IN_THREAD=y, is also crash-prone:
KSTK_ESP is not safe to use on tasks that aren't known to be running
ordinary process-context kernel code.

This patch removes the difference and just shows "[stack]" for VMAs
in the mm's stack range.  This is IMO much more sensible -- the
actual "stack" address really is treated specially by the VM code,
and the current thread stack isn't even well-defined for programs
that frequently switch stacks on their own.

Reported-by: Jann Horn <jann@thejh.net>
Signed-off-by: Andy Lutomirski <luto@kernel.org>
Acked-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Kees Cook <keescook@chromium.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Linux API <linux-api@vger.kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Tycho Andersen <tycho.andersen@canonical.com>
Link: http://lkml.kernel.org/r/3e678474ec14e0a0ec34c611016753eea2e1b8ba.1475257877.git.luto@kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-10-20 09:21:41 +02:00
Andy Lutomirski 0a1eb2d474 fs/proc: Stop reporting eip and esp in /proc/PID/stat
Reporting these fields on a non-current task is dangerous.  If the
task is in any state other than normal kernel code, they may contain
garbage or even kernel addresses on some architectures.  (x86_64
used to do this.  I bet lots of architectures still do.)  With
CONFIG_THREAD_INFO_IN_TASK=y, it can OOPS, too.

As far as I know, there are no use programs that make any material
use of these fields, so just get rid of them.

Reported-by: Jann Horn <jann@thejh.net>
Signed-off-by: Andy Lutomirski <luto@kernel.org>
Acked-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Kees Cook <keescook@chromium.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Linux API <linux-api@vger.kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Tetsuo Handa <penguin-kernel@i-love.sakura.ne.jp>
Cc: Tycho Andersen <tycho.andersen@canonical.com>
Link: http://lkml.kernel.org/r/a5fed4c3f4e33ed25d4bb03567e329bc5a712bcc.1475257877.git.luto@kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-10-20 09:21:41 +02:00
Christoph Hellwig 64e6428ddd xfs: remove xfs_bunmapi_cow
Since no one uses it anymore.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2016-10-20 15:54:59 +11:00
Christoph Hellwig c1112b6e62 xfs: optimize xfs_reflink_end_cow
Instead of doing a full extent list search for each extent that is
to be deleted using xfs_bmapi_read and then doing another one inside
of xfs_bunmapi_cow use the same scheme that xfs_bumapi uses:  look
up the last extent to be deleted and then use the extent index to
walk downward until we are outside the range to be deleted.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2016-10-20 15:54:45 +11:00
Christoph Hellwig 3e0ee78f7a xfs: optimize xfs_reflink_cancel_cow_blocks
Rewrite xfs_reflink_cancel_cow_blocks so that we only do a search for
the first extent in the extent list and then iterate over the remaining
extents using the extent index, passing the extent we operate on
directly to xfs_bmap_del_extent_delay or xfs_bmap_del_extent_cow instead
of going through xfs_bunmapi and doing yet another extent list lookup.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2016-10-20 15:54:31 +11:00
Christoph Hellwig fa5c836ca8 xfs: refactor xfs_bunmapi_cow
Split out two helpers for deleting delayed or real extents from the COW fork.
This allows to call them directly from xfs_reflink_cow_end_io once that
function is refactored to iterate the extent tree.  It will also allow
to reuse the delalloc deletion from xfs_bunmapi in the future.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2016-10-20 15:54:14 +11:00
Christoph Hellwig 3ba020befe xfs: optimize writes to reflink files
Instead of reserving space as the first thing in write_begin move it past
reading the extent in the data fork.  That way we only have to read from
the data fork once and can reuse that information for trimming the extent
to the shared/unshared boundary.  Additionally this allows to easily
limit the actual write size to said boundary, and avoid a roundtrip on the
ilock.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2016-10-20 15:53:50 +11:00
Christoph Hellwig 5f9268ca53 xfs: don't bother looking at the refcount tree for reads
There is no need to trim an extent into a shared or non-shared one, or
report any flags for plain old reads.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2016-10-20 15:53:32 +11:00
Christoph Hellwig 62c5ac89de xfs: handle "raw" delayed extents xfs_reflink_trim_around_shared
Delalloc extents in the extent list contain the number of reserved
indirect blocks in their startblock value and don't use the magic
DELAYSTARTBLOCK constant.  Ensure that xfs_reflink_trim_around_shared
handles them properly by checking for isnullstartblock().

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2016-10-20 15:52:00 +11:00
Darrick J. Wong 0a0af28cad xfs: add xfs_trim_extent
This helpers allows to trim an extent to a subset of it's original range
while making sure the block numbers in it remain valid,

In the future xfs_trim_extent and xfs_bmapi_trim_map should probably be
merged in some form.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
[hch: split from a previous patch from Darrick, moved around and added
 support for "raw" delayed extents"]
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2016-10-20 15:51:50 +11:00
Christoph Hellwig d33fd776f9 iomap: add IOMAP_REPORT
This allows the file system to tell a FIEMAP from a read operation, and thus
avoids the need to report flags that aren't actually used in the read path.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2016-10-20 15:51:28 +11:00
Christoph Hellwig 5faaf4fa0a xfs: merge xfs_reflink_remap_range and xfs_file_share_range
There is no clear division of responsibility between those functions, so
just merge them into one to keep the code simple.  Also move
xfs_file_wait_for_io to xfs_reflink.c together with its only caller.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2016-10-20 15:50:07 +11:00
Christoph Hellwig ec40759902 xfs: remove xfs_file_wait_for_io
filemap_write_and_wait_range operates on full pages, so there is no
need for the rounding operations.  Additionally this allows us to
micro-optimize by skipping the second inode_dio_wait for a
intra-file clone.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2016-10-20 15:49:55 +11:00
Christoph Hellwig 576177818e xfs: move inode locking from xfs_reflink_remap_range to xfs_file_share_range
We need the iolock protection to stabilizie the IS_SWAPFILE and
IS_IMMUTABLE values, as well as preventing new buffered writers
re-dirtying the file data that we just wrote out.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2016-10-20 15:49:19 +11:00
Christoph Hellwig a62e82b35b xfs: fix the same_inode check in xfs_file_share_range
The VFS i_ino is an unsigned long, while XFS inode numbers are 64-bit
wide, so checking i_ino for equality could lead to rate false positives
on 32-bit architectures.  Just compare the inode pointers themselves
to be safe.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2016-10-20 15:49:03 +11:00
Christoph Hellwig 4fbc2c6525 xfs: remove the same fs check from xfs_file_share_range
The VFS already does the check, and the placement of this duplicate
is in the way of the following locking rework.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2016-10-20 15:48:54 +11:00
Roger Willcocks 8cdcc8102c libxfs: v3 inodes are only valid on crc-enabled filesystems
xfs_repair was not detecting that version 3 inodes are invalid for
for non-CRC filesystems. The result is specific inode corruptions go
undetected and hence aren't repaired if only the version number is
out of range.

The core of the problem is that the XFS_DINODE_GOOD_VERSION() macro
doesn't know that valid inode versions are dependent on a superblock
version number. Fix this in libxfs, and propagate the new function
out into the rest of xfsprogs to fix the issue.

[Darrick: port to kernel from xfsprogs]

Reported-by: Leslie Rhorer <lrhorer@mygrande.net>
Signed-off-by: Roger Willcocks <roger@filmlight.ltd.uk>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2016-10-20 15:48:38 +11:00
Darrick J. Wong 58d7896785 libxfs: clean up _calc_dquots_per_chunk
The function xfs_calc_dquots_per_chunk takes a parameter in units
of basic blocks.  The kernel seems to get the units wrong, but
userspace got 'fixed' by commenting out the unnecessary conversion.
Fix both.

cc: <stable@vger.kernel.org>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Eric Sandeen <sandeen@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2016-10-20 15:46:18 +11:00
Darrick J. Wong d099245297 xfs: unset MS_ACTIVE if mount fails
As part of the inode block map intent log item recovery process, we
had to set the IRECOVERY flag to prevent an unlinked inode from
being truncated during the first iput call.  This required us to set
MS_ACTIVE so that iput puts the inode on the lru instead of
immediately evicting the inode.

Unfortunately, if the mount fails later on, the inodes that have
been loaded (root dir and realtime) actually need to be evicted
since we're aborting the mount.  If we don't clear MS_ACTIVE in the
failure step, those inodes are not evicted and therefore leak.   The
leak was found by running xfs/130 and rmmoding xfs immediately after
the test.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2016-10-20 15:45:40 +11:00
Eric Sandeen fe23759eaf xfs: remove pointless error goto in xfs_bmap_remap_alloc
The commit:

f65306ea xfs: map an inode's offset to an exact physical block

added a pointless error0: target; remove it.

Addresses-Coverity-Id: 1373865
Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Reviewed-by: Bill O'Donnell <billodo@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2016-10-20 15:44:53 +11:00
Christoph Hellwig 0ee7a3f6b5 xfs: don't take the IOLOCK exclusive for direct I/O page invalidation
XFS historically took the iolock exclusive when invalidating pages
before direct I/O operations to protect against writeback starvations.

But this writeback starvation issues has been fixed a long time ago
in the core writeback code, and all other file systems manage to do
without the exclusive lock.  Convert XFS over to avoid the exclusive
lock in this case, and also move to range invalidations like done
by the other file systems.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Jan Kara <jack@suse.cz>
Reviewed-by: Carlos Maiolino <cmaiolino@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2016-10-20 15:44:14 +11:00
Eric Biggers f1b8243c55 xfs: add some 'static' annotations
sparse reported that several variables and a function were not
forward-declared anywhere and therefore should be 'static'.

Found with sparse by running 'make C=2 CF=-D__CHECK_ENDIAN__ fs/xfs/'

Signed-off-by: Eric Biggers <ebiggers@google.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2016-10-20 15:42:30 +11:00
Geert Uytterhoeven 1be7f9be0e xfs: Fix uninitialized variable in xfs_reflink_reserve_cow_range()
with gcc 4.1.2:

    fs/xfs/xfs_reflink.c: In function xfs_reflink_reserve_cow_range:
    fs/xfs/xfs_reflink.c:327: warning: error may be used uninitialized in this function

Indeed, if "count" is zero, the function will return an uninitialized
error value.

While "count" is unlikely to be zero, this function is called through
the public iomap API. Hence fix this by preinitializing error to zero.

Fixes: 2a06705cd5 ("xfs: create delalloc extents in CoW fork")
Signed-off-by: Geert Uytterhoeven <geert@linux-m68k.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2016-10-20 15:41:48 +11:00
Colin Ian King 1d55a4bfd0 xfs: remove redundant assignment of ifp
Remove redundant ifp = ifp statement, it does nothing. Found with
static analysis by CoverityScan.

Signed-off-by: Colin Ian King <colin.king@canonical.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2016-10-20 15:40:55 +11:00
Richard Weinberger c83ed4c9db ubifs: Abort readdir upon error
If UBIFS is facing an error while walking a directory, it reports this
error and ubifs_readdir() returns the error code. But the VFS readdir
logic does not make the getdents system call fail in all cases. When the
readdir cursor indicates that more entries are present, the system call
will just return and the libc wrapper will try again since it also
knows that more entries are present.
This causes the libc wrapper to busy loop for ever when a directory is
corrupted on UBIFS.
A common approach do deal with corrupted directory entries is
skipping them by setting the cursor to the next entry. On UBIFS this
approach is not possible since we cannot compute the next directory
entry cursor position without reading the current entry. So all we can
do is setting the cursor to the "no more entries" position and make
getdents exit.

Cc: stable@vger.kernel.org
Signed-off-by: Richard Weinberger <richard@nod.at>
2016-10-20 00:06:11 +02:00
Richard Weinberger 843741c577 ubifs: Fix xattr_names length in exit paths
When the operation fails we also have to undo the changes
we made to ->xattr_names. Otherwise listxattr() will report
wrong lengths.

Cc: stable@vger.kernel.org
Signed-off-by: Richard Weinberger <richard@nod.at>
2016-10-20 00:05:54 +02:00
Richard Weinberger 390975ac39 ubifs: Rename ubifs_rename2
Since ->rename2 is gone, rename ubifs_rename2() to ubifs_rename().

Suggested-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Richard Weinberger <richard@nod.at>
2016-10-20 00:05:47 +02:00
Arnd Bergmann 83aa3e0f79 nfs4: fix missing-braces warning
A bugfix introduced a harmless warning for update_open_stateid:

fs/nfs/nfs4proc.c:1548:2: error: missing braces around initializer [-Werror=missing-braces]

Removing the zero in the initializer will do the right thing here
and initialize the entire structure to zero.

Fixes: 1393d9612b ("NFSv4: Fix a race when updating an open_stateid")
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2016-10-19 14:39:15 -04:00
Bob Peterson aa9f101285 dlm: don't specify WQ_UNBOUND for the ast callback workqueue
This patch removes the WQ_UNBOUND flag (which implies WQ_HIGHPRI)
from the DLM's ast work queue, in favor of just WQ_HIGHPRI.
This has been shown to cause a 19 percent performance increase for
simultaneous inode creates on GFS2 with fs_mark.

Signed-off-by: Bob Peterson <rpeterso@redhat.com>
Signed-off-by: David Teigland <teigland@redhat.com>
2016-10-19 11:13:04 -05:00
Bob Peterson d2fee58a3b dlm: remove lock_sock to avoid scheduling while atomic
Before this patch, functions save_callbacks and restore_callbacks
called function lock_sock and release_sock to prevent other processes
from messing with the struct sock while the callbacks were saved and
restored. However, function add_sock calls write_lock_bh prior to
calling it save_callbacks, which disables preempts. So the call to
lock_sock would try to schedule when we can't schedule.

Signed-off-by: Bob Peterson <rpeterso@redhat.com>
Signed-off-by: David Teigland <teigland@redhat.com>
2016-10-19 11:00:03 -05:00
Bob Peterson 3735b4b9f1 dlm: don't save callbacks after accept
When DLM calls accept() on a socket, the comm code copies the sk
after we've saved its callbacks. Afterward, it calls add_sock which
saves the callbacks a second time. Since the error reporting function
lowcomms_error_report calls the previous callback too, this results
in a recursive call to itself. This patch adds a new parameter to
function add_sock to tell whether to save the callbacks. Function
tcp_accept_from_sock (and its sctp counterpart) then calls it with
false to avoid the recursion.

Signed-off-by: Bob Peterson <rpeterso@redhat.com>
Signed-off-by: David Teigland <teigland@redhat.com>
2016-10-19 11:00:03 -05:00
Paul Gortmaker 7963b8a598 dlm: audit and remove any unnecessary uses of module.h
Historically a lot of these existed because we did not have
a distinction between what was modular code and what was providing
support to modules via EXPORT_SYMBOL and friends.  That changed
when we forked out support for the latter into the export.h file.
This means we should be able to reduce the usage of module.h
in code that is obj-y Makefile or bool Kconfig.

In the case of some code where it is modular, we can extend that to
also include files that are building basic support functionality but
not related to loading or registering the final module; such files
also have no need whatsoever for module.h

The advantage in removing such instances is that module.h itself
sources about 15 other headers; adding significantly to what we feed
cpp, and it can obscure what headers we are effectively using.

Since module.h might have been the implicit source for init.h
(for __init) and for export.h (for EXPORT_SYMBOL) we consider each
instance for the presence of either and replace as needed.

In the dlm case, we remove module.h from a global header and only
introduce it in the files where it is explicitly required, since
there is nothing modular in dlm_internal.h itself.

Signed-off-by: Paul Gortmaker <paul.gortmaker@windriver.com>
Signed-off-by: David Teigland <teigland@redhat.com>
2016-10-19 11:00:03 -05:00
Stephen Hemminger dbef1c0534 dlm: make genl_ops const
This table contains function points and should be const.

Signed-off-by: Stephen Hemminger <stephen@networkplumber.org>
Signed-off-by: David Teigland <teigland@redhat.com>
2016-10-19 11:00:03 -05:00