Commit Graph

53554 Commits

Author SHA1 Message Date
Nikolay Borisov 2b87733134 btrfs: Split btrfs_del_delalloc_inode into 2 functions
This is in preparation of fixing delalloc inodes leakage on transaction
abort. Also export the new function.

Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2018-05-17 14:18:26 +02:00
Liu Bo 02a3307aa9 btrfs: fix reading stale metadata blocks after degraded raid1 mounts
If a btree block, aka. extent buffer, is not available in the extent
buffer cache, it'll be read out from the disk instead, i.e.

btrfs_search_slot()
  read_block_for_search()  # hold parent and its lock, go to read child
    btrfs_release_path()
    read_tree_block()  # read child

Unfortunately, the parent lock got released before reading child, so
commit 5bdd3536cb ("Btrfs: Fix block generation verification race") had
used 0 as parent transid to read the child block.  It forces
read_tree_block() not to check if parent transid is different with the
generation id of the child that it reads out from disk.

A simple PoC is included in btrfs/124,

0. A two-disk raid1 btrfs,

1. Right after mkfs.btrfs, block A is allocated to be device tree's root.

2. Mount this filesystem and put it in use, after a while, device tree's
   root got COW but block A hasn't been allocated/overwritten yet.

3. Umount it and reload the btrfs module to remove both disks from the
   global @fs_devices list.

4. mount -odegraded dev1 and write some data, so now block A is allocated
   to be a leaf in checksum tree.  Note that only dev1 has the latest
   metadata of this filesystem.

5. Umount it and mount it again normally (with both disks), since raid1
   can pick up one disk by the writer task's pid, if btrfs_search_slot()
   needs to read block A, dev2 which does NOT have the latest metadata
   might be read for block A, then we got a stale block A.

6. As parent transid is not checked, block A is marked as uptodate and
   put into the extent buffer cache, so the future search won't bother
   to read disk again, which means it'll make changes on this stale
   one and make it dirty and flush it onto disk.

To avoid the problem, parent transid needs to be passed to
read_tree_block().

In order to get a valid parent transid, we need to hold the parent's
lock until finishing reading child.

This patch needs to be slightly adapted for stable kernels, the
&first_key parameter added to read_tree_block() is from 4.16+
(581c176041). The fix is to replace 0 by 'gen'.

Fixes: 5bdd3536cb ("Btrfs: Fix block generation verification race")
CC: stable@vger.kernel.org # 4.4+
Signed-off-by: Liu Bo <bo.liu@linux.alibaba.com>
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: Qu Wenruo <wqu@suse.com>
[ update changelog ]
Signed-off-by: David Sterba <dsterba@suse.com>
2018-05-17 14:18:25 +02:00
Misono Tomohiro 1a63c198dd btrfs: property: Set incompat flag if lzo/zstd compression is set
Incompat flag of LZO/ZSTD compression should be set at:

 1. mount time (-o compress/compress-force)
 2. when defrag is done
 3. when property is set

Currently 3. is missing and this commit adds this.

This could lead to a filesystem that uses ZSTD but is not marked as
such. If a kernel without a ZSTD support encounteres a ZSTD compressed
extent, it will handle that but this could be confusing to the user.

Typically the filesystem is mounted with the ZSTD option, but the
discrepancy can arise when a filesystem is never mounted with ZSTD and
then the property on some file is set (and some new extents are
written). A simple mount with -o compress=zstd will fix that up on an
unpatched kernel.

Same goes for LZO, but this has been around for a very long time
(2.6.37) so it's unlikely that a pre-LZO kernel would be used.

Fixes: 5c1aab1dd5 ("btrfs: Add zstd support")
CC: stable@vger.kernel.org # 4.14+
Signed-off-by: Tomohiro Misono <misono.tomohiro@jp.fujitsu.com>
Reviewed-by: Anand Jain <anand.jain@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.com>
[ add user visible impact ]
Signed-off-by: David Sterba <dsterba@suse.com>
2018-05-17 14:18:25 +02:00
Filipe Manana 31d11b83b9 Btrfs: fix duplicate extents after fsync of file with prealloc extents
In commit 471d557afe ("Btrfs: fix loss of prealloc extents past i_size
after fsync log replay"), on fsync,  we started to always log all prealloc
extents beyond an inode's i_size in order to avoid losing them after a
power failure. However under some cases this can lead to the log replay
code to create duplicate extent items, with different lengths, in the
extent tree. That happens because, as of that commit, we can now log
extent items based on extent maps that are not on the "modified" list
of extent maps of the inode's extent map tree. Logging extent items based
on extent maps is used during the fast fsync path to save time and for
this to work reliably it requires that the extent maps are not merged
with other adjacent extent maps - having the extent maps in the list
of modified extents gives such guarantee.

Consider the following example, captured during a long run of fsstress,
which illustrates this problem.

We have inode 271, in the filesystem tree (root 5), for which all of the
following operations and discussion apply to.

A buffered write starts at offset 312391 with a length of 933471 bytes
(end offset at 1245862). At this point we have, for this inode, the
following extent maps with the their field values:

em A, start 0, orig_start 0, len 40960, block_start 18446744073709551613,
      block_len 0, orig_block_len 0
em B, start 40960, orig_start 40960, len 376832, block_start 1106399232,
      block_len 376832, orig_block_len 376832
em C, start 417792, orig_start 417792, len 782336, block_start
      18446744073709551613, block_len 0, orig_block_len 0
em D, start 1200128, orig_start 1200128, len 835584, block_start
      1106776064, block_len 835584, orig_block_len 835584
em E, start 2035712, orig_start 2035712, len 245760, block_start
      1107611648, block_len 245760, orig_block_len 245760

Extent map A corresponds to a hole and extent maps D and E correspond to
preallocated extents.

Extent map D ends where extent map E begins (1106776064 + 835584 =
1107611648), but these extent maps were not merged because they are in
the inode's list of modified extent maps.

An fsync against this inode is made, which triggers the fast path
(BTRFS_INODE_NEEDS_FULL_SYNC is not set). This fsync triggers writeback
of the data previously written using buffered IO, and when the respective
ordered extent finishes, btrfs_drop_extents() is called against the
(aligned) range 311296..1249279. This causes a split of extent map D at
btrfs_drop_extent_cache(), replacing extent map D with a new extent map
D', also added to the list of modified extents,  with the following
values:

em D', start 1249280, orig_start of 1200128,
       block_start 1106825216 (= 1106776064 + 1249280 - 1200128),
       orig_block_len 835584,
       block_len 786432 (835584 - (1249280 - 1200128))

Then, during the fast fsync, btrfs_log_changed_extents() is called and
extent maps D' and E are removed from the list of modified extents. The
flag EXTENT_FLAG_LOGGING is also set on them. After the extents are logged
clear_em_logging() is called on each of them, and that makes extent map E
to be merged with extent map D' (try_merge_map()), resulting in D' being
deleted and E adjusted to:

em E, start 1249280, orig_start 1200128, len 1032192,
      block_start 1106825216, block_len 1032192,
      orig_block_len 245760

A direct IO write at offset 1847296 and length of 360448 bytes (end offset
at 2207744) starts, and at that moment the following extent maps exist for
our inode:

em A, start 0, orig_start 0, len 40960, block_start 18446744073709551613,
      block_len 0, orig_block_len 0
em B, start 40960, orig_start 40960, len 270336, block_start 1106399232,
      block_len 270336, orig_block_len 376832
em C, start 311296, orig_start 311296, len 937984, block_start 1112842240,
      block_len 937984, orig_block_len 937984
em E (prealloc), start 1249280, orig_start 1200128, len 1032192,
      block_start 1106825216, block_len 1032192, orig_block_len 245760

The dio write results in drop_extent_cache() being called twice. The first
time for a range that starts at offset 1847296 and ends at offset 2035711
(length of 188416), which results in a double split of extent map E,
replacing it with two new extent maps:

em F, start 1249280, orig_start 1200128, block_start 1106825216,
      block_len 598016, orig_block_len 598016
em G, start 2035712, orig_start 1200128, block_start 1107611648,
      block_len 245760, orig_block_len 1032192

It also creates a new extent map that represents a part of the requested
IO (through create_io_em()):

em H, start 1847296, len 188416, block_start 1107423232, block_len 188416

The second call to drop_extent_cache() has a range with a start offset of
2035712 and end offset of 2207743 (length of 172032). This leads to
replacing extent map G with a new extent map I with the following values:

em I, start 2207744, orig_start 1200128, block_start 1107783680,
      block_len 73728, orig_block_len 1032192

It also creates a new extent map that represents the second part of the
requested IO (through create_io_em()):

em J, start 2035712, len 172032, block_start 1107611648, block_len 172032

The dio write set the inode's i_size to 2207744 bytes.

After the dio write the inode has the following extent maps:

em A, start 0, orig_start 0, len 40960, block_start 18446744073709551613,
      block_len 0, orig_block_len 0
em B, start 40960, orig_start 40960, len 270336, block_start 1106399232,
      block_len 270336, orig_block_len 376832
em C, start 311296, orig_start 311296, len 937984, block_start 1112842240,
      block_len 937984, orig_block_len 937984
em F, start 1249280, orig_start 1200128, len 598016,
      block_start 1106825216, block_len 598016, orig_block_len 598016
em H, start 1847296, orig_start 1200128, len 188416,
      block_start 1107423232, block_len 188416, orig_block_len 835584
em J, start 2035712, orig_start 2035712, len 172032,
      block_start 1107611648, block_len 172032, orig_block_len 245760
em I, start 2207744, orig_start 1200128, len 73728,
      block_start 1107783680, block_len 73728, orig_block_len 1032192

Now do some change to the file, like adding a xattr for example and then
fsync it again. This triggers a fast fsync path, and as of commit
471d557afe ("Btrfs: fix loss of prealloc extents past i_size after fsync
log replay"), we use the extent map I to log a file extent item because
it's a prealloc extent and it starts at an offset matching the inode's
i_size. However when we log it, we create a file extent item with a value
for the disk byte location that is wrong, as can be seen from the
following output of "btrfs inspect-internal dump-tree":

 item 1 key (271 EXTENT_DATA 2207744) itemoff 3782 itemsize 53
     generation 22 type 2 (prealloc)
     prealloc data disk byte 1106776064 nr 1032192
     prealloc data offset 1007616 nr 73728

Here the disk byte value corresponds to calculation based on some fields
from the extent map I:

  1106776064 = block_start (1107783680) - 1007616 (extent_offset)
  extent_offset = 2207744 (start) - 1200128 (orig_start) = 1007616

The disk byte value of 1106776064 clashes with disk byte values of the
file extent items at offsets 1249280 and 1847296 in the fs tree:

        item 6 key (271 EXTENT_DATA 1249280) itemoff 3568 itemsize 53
                generation 20 type 2 (prealloc)
                prealloc data disk byte 1106776064 nr 835584
                prealloc data offset 49152 nr 598016
        item 7 key (271 EXTENT_DATA 1847296) itemoff 3515 itemsize 53
                generation 20 type 1 (regular)
                extent data disk byte 1106776064 nr 835584
                extent data offset 647168 nr 188416 ram 835584
                extent compression 0 (none)
        item 8 key (271 EXTENT_DATA 2035712) itemoff 3462 itemsize 53
                generation 20 type 1 (regular)
                extent data disk byte 1107611648 nr 245760
                extent data offset 0 nr 172032 ram 245760
                extent compression 0 (none)
        item 9 key (271 EXTENT_DATA 2207744) itemoff 3409 itemsize 53
                generation 20 type 2 (prealloc)
                prealloc data disk byte 1107611648 nr 245760
                prealloc data offset 172032 nr 73728

Instead of the disk byte value of 1106776064, the value of 1107611648
should have been logged. Also the data offset value should have been
172032 and not 1007616.
After a log replay we end up getting two extent items in the extent tree
with different lengths, one of 835584, which is correct and existed
before the log replay, and another one of 1032192 which is wrong and is
based on the logged file extent item:

 item 12 key (1106776064 EXTENT_ITEM 835584) itemoff 3406 itemsize 53
    refs 2 gen 15 flags DATA
    extent data backref root 5 objectid 271 offset 1200128 count 2
 item 13 key (1106776064 EXTENT_ITEM 1032192) itemoff 3353 itemsize 53
    refs 1 gen 22 flags DATA
    extent data backref root 5 objectid 271 offset 1200128 count 1

Obviously this leads to many problems and a filesystem check reports many
errors:

 (...)
 checking extents
 Extent back ref already exists for 1106776064 parent 0 root 5 owner 271 offset 1200128 num_refs 1
 extent item 1106776064 has multiple extent items
 ref mismatch on [1106776064 835584] extent item 2, found 3
 Incorrect local backref count on 1106776064 root 5 owner 271 offset 1200128 found 2 wanted 1 back 0x55b1d0ad7680
 Backref 1106776064 root 5 owner 271 offset 1200128 num_refs 0 not found in extent tree
 Incorrect local backref count on 1106776064 root 5 owner 271 offset 1200128 found 1 wanted 0 back 0x55b1d0ad4e70
 Backref bytes do not match extent backref, bytenr=1106776064, ref bytes=835584, backref bytes=1032192
 backpointer mismatch on [1106776064 835584]
 checking free space cache
 block group 1103101952 has wrong amount of free space
 failed to load free space cache for block group 1103101952
 checking fs roots
 (...)

So fix this by logging the prealloc extents beyond the inode's i_size
based on searches in the subvolume tree instead of the extent maps.

Fixes: 471d557afe ("Btrfs: fix loss of prealloc extents past i_size after fsync log replay")
CC: stable@vger.kernel.org # 4.14+
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2018-05-17 14:18:19 +02:00
Linus Torvalds 21b9f1c7e3 AFS fixes
-----BEGIN PGP SIGNATURE-----
 
 iQIVAwUAWvmaZvu3V2unywtrAQKZoA/9HzO6QsB7h7hWY6tTuoL0gD8T8S4hC7l3
 UYFtTgq0rFHJYiET4SWoy0Sfs8rY1iFPtaIeFVQG804SrnXu5/Q1tsv+1lRhZIuo
 /upAtZ3xEcqvAqU8pgcksKl/KUdmm7ZHUbhAFCasu+1eczGF5Q55UAUgonFrnEMi
 9N0WviRUkRAlTre7cvCMRI05c+HJV+PCYrJPjStAkJeuS1CuTEAT/d58NumquMAt
 6ENkpR4OhRUJZDhYH7XIRLm7hsYjr9v3VIeCiLpYqUZGuvhaj3jzPi0e9zD5PDzZ
 lyyodQVegBs88V2rXrjjZHohNQRiuSzI+42pMXrdaDu5jBFFqYLEeaBoperJY7nl
 W6l6HSb/I8VValM7iwkyzNWeQ6KhdUhYvA5ljYaJufZvqxp4di9xT4mAxRqbHSX+
 H5I/n+R27FEOFAqnWInaksj5IO80HGThrGhdz9O/4pa8xITz7W2ZKg5YMLEoF9yp
 /QUxsn3lz4VD4tjPrqampJ+IwbpQB+XDiJhM4boI47kC2IxEc9L2QiYWlFl/okZ4
 CGuXsluQFPleR3Mo8xq1WaQzmT40iYQ+aBOPq1/OhDisexZJ55Cjha1GHk/8aHDu
 GL5UiL7AfWEwY20mJiCObg8u2nnkwg/0YPR3awDBlCMDBeYhxbSFOLrKiQxUjWM9
 Pp6PUhTtSjU=
 =1ow3
 -----END PGP SIGNATURE-----

Merge tag 'afs-fixes-20180514' of git://git.kernel.org/pub/scm/linux/kernel/git/dhowells/linux-fs

Pull AFS fixes from David Howells:
 "Here's a set of patches that fix a number of bugs in the in-kernel AFS
  client, including:

   - Fix directory locking to not use individual page locks for
     directory reading/scanning but rather to use a semaphore on the
     afs_vnode struct as the directory contents must be read in a single
     blob and data from different reads must not be mixed as the entire
     contents may be shuffled about between reads.

   - Fix address list parsing to handle port specifiers correctly.

   - Only give up callback records on a server if we actually talked to
     that server (we might not be able to access a server).

   - Fix some callback handling bugs, including refcounting,
     whole-volume callbacks and when callbacks actually get broken in
     response to a CB.CallBack op.

   - Fix some server/address rotation bugs, including giving up if we
     can't probe a server; giving up if a server says it doesn't have a
     volume, but there are more servers to try.

   - Fix the decoding of fetched statuses to be OpenAFS compatible.

   - Fix the handling of server lookups in Cache Manager ops (such as
     CB.InitCallBackState3) to use a UUID if possible and to handle no
     server being found.

   - Fix a bug in server lookup where not all addresses are compared.

   - Fix the non-encryption of calls that prevents some servers from
     being accessed (this also requires an AF_RXRPC patch that has
     already gone in through the net tree).

  There's also a patch that adds tracepoints to log Cache Manager ops
  that don't find a matching server, either by UUID or by address"

* tag 'afs-fixes-20180514' of git://git.kernel.org/pub/scm/linux/kernel/git/dhowells/linux-fs:
  afs: Fix the non-encryption of calls
  afs: Fix CB.CallBack handling
  afs: Fix whole-volume callback handling
  afs: Fix afs_find_server search loop
  afs: Fix the handling of an unfound server in CM operations
  afs: Add a tracepoint to record callbacks from unlisted servers
  afs: Fix the handling of CB.InitCallBackState3 to find the server by UUID
  afs: Fix VNOVOL handling in address rotation
  afs: Fix AFSFetchStatus decoder to provide OpenAFS compatibility
  afs: Fix server rotation's handling of fileserver probe failure
  afs: Fix refcounting in callback registration
  afs: Fix giving up callbacks on server destruction
  afs: Fix address list parsing
  afs: Fix directory page locking
2018-05-15 10:48:36 -07:00
Filipe Manana 9a8fca62aa Btrfs: fix xattr loss after power failure
If a file has xattrs, we fsync it, to ensure we clear the flags
BTRFS_INODE_NEEDS_FULL_SYNC and BTRFS_INODE_COPY_EVERYTHING from its
inode, the current transaction commits and then we fsync it (without
either of those bits being set in its inode), we end up not logging
all its xattrs. This results in deleting all xattrs when replying the
log after a power failure.

Trivial reproducer

  $ mkfs.btrfs -f /dev/sdb
  $ mount /dev/sdb /mnt

  $ touch /mnt/foobar
  $ setfattr -n user.xa -v qwerty /mnt/foobar
  $ xfs_io -c "fsync" /mnt/foobar

  $ sync

  $ xfs_io -c "pwrite -S 0xab 0 64K" /mnt/foobar
  $ xfs_io -c "fsync" /mnt/foobar
  <power failure>

  $ mount /dev/sdb /mnt
  $ getfattr --absolute-names --dump /mnt/foobar
  <empty output>
  $

So fix this by making sure all xattrs are logged if we log a file's inode
item and neither the flags BTRFS_INODE_NEEDS_FULL_SYNC nor
BTRFS_INODE_COPY_EVERYTHING were set in the inode.

Fixes: 36283bf777 ("Btrfs: fix fsync xattr loss in the fast fsync path")
Cc: <stable@vger.kernel.org> # 4.2+
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2018-05-14 16:42:43 +02:00
Robbie Ko 6f2f0b394b Btrfs: send, fix invalid access to commit roots due to concurrent snapshotting
[BUG]
btrfs incremental send BUG happens when creating a snapshot of snapshot
that is being used by send.

[REASON]
The problem can happen if while we are doing a send one of the snapshots
used (parent or send) is snapshotted, because snapshoting implies COWing
the root of the source subvolume/snapshot.

1. When doing an incremental send, the send process will get the commit
   roots from the parent and send snapshots, and add references to them
   through extent_buffer_get().

2. When a snapshot/subvolume is snapshotted, its root node is COWed
   (transaction.c:create_pending_snapshot()).

3. COWing releases the space used by the node immediately, through:

   __btrfs_cow_block()
   --btrfs_free_tree_block()
   ----btrfs_add_free_space(bytenr of node)

4. Because send doesn't hold a transaction open, it's possible that
   the transaction used to create the snapshot commits, switches the
   commit root and the old space used by the previous root node gets
   assigned to some other node allocation. Allocation of a new node will
   use the existing extent buffer found in memory, which we previously
   got a reference through extent_buffer_get(), and allow the extent
   buffer's content (pages) to be modified:

   btrfs_alloc_tree_block
   --btrfs_reserve_extent
   ----find_free_extent (get bytenr of old node)
   --btrfs_init_new_buffer (use bytenr of old node)
   ----btrfs_find_create_tree_block
   ------alloc_extent_buffer
   --------find_extent_buffer (get old node)

5. So send can access invalid memory content and have unpredictable
   behaviour.

[FIX]
So we fix the problem by copying the commit roots of the send and
parent snapshots and use those copies.

CallTrace looks like this:
 ------------[ cut here ]------------
 kernel BUG at fs/btrfs/ctree.c:1861!
 invalid opcode: 0000 [#1] SMP
 CPU: 6 PID: 24235 Comm: btrfs Tainted: P           O 3.10.105 #23721
 ffff88046652d680 ti: ffff88041b720000 task.ti: ffff88041b720000
 RIP: 0010:[<ffffffffa08dd0e8>] read_node_slot+0x108/0x110 [btrfs]
 RSP: 0018:ffff88041b723b68  EFLAGS: 00010246
 RAX: ffff88043ca6b000 RBX: ffff88041b723c50 RCX: ffff880000000000
 RDX: 000000000000004c RSI: ffff880314b133f8 RDI: ffff880458b24000
 RBP: 0000000000000000 R08: 0000000000000001 R09: ffff88041b723c66
 R10: 0000000000000001 R11: 0000000000001000 R12: ffff8803f3e48890
 R13: ffff8803f3e48880 R14: ffff880466351800 R15: 0000000000000001
 FS:  00007f8c321dc8c0(0000) GS:ffff88047fcc0000(0000)
 CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
 R2: 00007efd1006d000 CR3: 0000000213a24000 CR4: 00000000003407e0
 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
 DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
 Stack:
 ffff88041b723c50 ffff8803f3e48880 ffff8803f3e48890 ffff8803f3e48880
 ffff880466351800 0000000000000001 ffffffffa08dd9d7 ffff88041b723c50
 ffff8803f3e48880 ffff88041b723c66 ffffffffa08dde85 a9ff88042d2c4400
 Call Trace:
 [<ffffffffa08dd9d7>] ? tree_move_down.isra.33+0x27/0x50 [btrfs]
 [<ffffffffa08dde85>] ? tree_advance+0xb5/0xc0 [btrfs]
 [<ffffffffa08e83d4>] ? btrfs_compare_trees+0x2d4/0x760 [btrfs]
 [<ffffffffa0982050>] ? finish_inode_if_needed+0x870/0x870 [btrfs]
 [<ffffffffa09841ea>] ? btrfs_ioctl_send+0xeda/0x1050 [btrfs]
 [<ffffffffa094bd3d>] ? btrfs_ioctl+0x1e3d/0x33f0 [btrfs]
 [<ffffffff81111133>] ? handle_pte_fault+0x373/0x990
 [<ffffffff8153a096>] ? atomic_notifier_call_chain+0x16/0x20
 [<ffffffff81063256>] ? set_task_cpu+0xb6/0x1d0
 [<ffffffff811122c3>] ? handle_mm_fault+0x143/0x2a0
 [<ffffffff81539cc0>] ? __do_page_fault+0x1d0/0x500
 [<ffffffff81062f07>] ? check_preempt_curr+0x57/0x90
 [<ffffffff8115075a>] ? do_vfs_ioctl+0x4aa/0x990
 [<ffffffff81034f83>] ? do_fork+0x113/0x3b0
 [<ffffffff812dd7d7>] ? trace_hardirqs_off_thunk+0x3a/0x6c
 [<ffffffff81150cc8>] ? SyS_ioctl+0x88/0xa0
 [<ffffffff8153e422>] ? system_call_fastpath+0x16/0x1b
 ---[ end trace 29576629ee80b2e1 ]---

Fixes: 7069830a9e ("Btrfs: add btrfs_compare_trees function")
CC: stable@vger.kernel.org # 3.6+
Signed-off-by: Robbie Ko <robbieko@synology.com>
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2018-05-14 16:42:34 +02:00
David Howells 4776cab43f afs: Fix the non-encryption of calls
Some AFS servers refuse to accept unencrypted traffic, so can't be accessed
with kAFS.  Set the AF_RXRPC security level to encrypt client calls to deal
with this.

Note that incoming service calls are set by the remote client and so aren't
affected by this.

This requires an AF_RXRPC patch to pass the value set by setsockopt to calls
begun by the kernel.

Signed-off-by: David Howells <dhowells@redhat.com>
2018-05-14 15:15:19 +01:00
David Howells 428edade4e afs: Fix CB.CallBack handling
The handling of CB.CallBack messages sent by the fileserver to the client
is broken in that they are currently being processed after the reply has
been transmitted.

This is not what the fileserver expects, however.  It holds up change
visibility until the reply comes so as to maintain cache coherency, and so
expects the client to have to refetch the state on the affected files.

Fix CB.CallBack handling to perform the callback break before sending the
reply.

The fileserver is free to hold up status fetches issued by other threads on
the same client that occur in reponse to the callback until any pending
changes have been committed.

Fixes: d001648ec7 ("rxrpc: Don't expose skbs to in-kernel users [ver #2]")
Signed-off-by: David Howells <dhowells@redhat.com>
2018-05-14 15:15:19 +01:00
David Howells 68251f0a68 afs: Fix whole-volume callback handling
It's possible for an AFS file server to issue a whole-volume notification
that callbacks on all the vnodes in the file have been broken.  This is
done for R/O and backup volumes (which don't have per-file callbacks) and
for things like a volume being taken offline.

Fix callback handling to detect whole-volume notifications, to track it
across operations and to check it during inode validation.

Fixes: c435ee3455 ("afs: Overhaul the callback handling")
Signed-off-by: David Howells <dhowells@redhat.com>
2018-05-14 15:15:18 +01:00
Marc Dionne f9c1bba3d3 afs: Fix afs_find_server search loop
The code that looks up servers by addresses makes the assumption
that the list of addresses for a server is sorted.  It exits the
loop if it finds that the target address is larger than the
current candidate.  As the list is not currently sorted, this
can lead to a failure to find a matching server, which can cause
callbacks from that server to be ignored.

Remove the early exit case so that the complete list is searched.

Fixes: d2ddc776a4 ("afs: Overhaul volume and server record caching and fileserver rotation")
Signed-off-by: Marc Dionne <marc.dionne@auristor.com>
Signed-off-by: David Howells <dhowells@redhat.com>
2018-05-14 15:15:18 +01:00
David Howells a86b06d1cc afs: Fix the handling of an unfound server in CM operations
If the client cache manager operations that need the server record
(CB.Callback, CB.InitCallBackState, and CB.InitCallBackState3) can't find
the server record, they abort the call from the file server with
RX_CALL_DEAD when they should return okay.

Fixes: c35eccb1f6 ("[AFS]: Implement the CB.InitCallBackState3 operation.")
Signed-off-by: David Howells <dhowells@redhat.com>
2018-05-14 15:15:18 +01:00
David Howells 3709a399c1 afs: Add a tracepoint to record callbacks from unlisted servers
Add a tracepoint to record callbacks from servers for which we don't have a
record.

Signed-off-by: David Howells <dhowells@redhat.com>
2018-05-14 15:15:18 +01:00
David Howells 001ab5a67e afs: Fix the handling of CB.InitCallBackState3 to find the server by UUID
Fix the handling of the CB.InitCallBackState3 service call to find the
record of a server that we're using by looking it up by the UUID passed as
the parameter rather than by its address (of which it might have many, and
which may change).

Fixes: c35eccb1f6 ("[AFS]: Implement the CB.InitCallBackState3 operation.")
Signed-off-by: David Howells <dhowells@redhat.com>
2018-05-14 15:15:18 +01:00
David Howells 3d9fa91161 afs: Fix VNOVOL handling in address rotation
If a volume location record lists multiple file servers for a volume, then
it's possible that due to a misconfiguration or a changing configuration
that one of the file servers doesn't know about it yet and will abort
VNOVOL.  Currently, the rotation algorithm will stop with EREMOTEIO.

Fix this by moving on to try the next server if VNOVOL is returned.  Once
all the servers have been tried and the record rechecked, the algorithm
will stop with EREMOTEIO or ENOMEDIUM.

Fixes: d2ddc776a4 ("afs: Overhaul volume and server record caching and fileserver rotation")
Reported-by: Marc Dionne <marc.dionne@auristor.com>
Signed-off-by: David Howells <dhowells@redhat.com>
2018-05-14 15:15:18 +01:00
David Howells 684b0f68cf afs: Fix AFSFetchStatus decoder to provide OpenAFS compatibility
The OpenAFS server's RXAFS_InlineBulkStatus implementation has a bug
whereby if an error occurs on one of the vnodes being queried, then the
errorCode field is set correctly in the corresponding status, but the
interfaceVersion field is left unset.

Fix kAFS to deal with this by evaluating the AFSFetchStatus blob against
the following cases when called from FS.InlineBulkStatus delivery:

 (1) If InterfaceVersion == 0 then:

     (a) If errorCode != 0 then it indicates the abort code for the
         corresponding vnode.

     (b) If errorCode == 0 then the status record is invalid.

 (2) If InterfaceVersion == 1 then:

     (a) If errorCode != 0 then it indicates the abort code for the
         corresponding vnode.

     (b) If errorCode == 0 then the status record is valid and can be
     	 parsed.

 (3) If InterfaceVersion is anything else then the status record is
     invalid.

Fixes: dd9fbcb8e1 ("afs: Rearrange status mapping")
Reported-by: Jeffrey Altman <jaltman@auristor.com>
Signed-off-by: David Howells <dhowells@redhat.com>
2018-05-14 15:15:18 +01:00
David Howells ec5a3b4b50 afs: Fix server rotation's handling of fileserver probe failure
The server rotation algorithm just gives up if it fails to probe a
fileserver.  Fix this by rotating to the next fileserver instead.

Fixes: d2ddc776a4 ("afs: Overhaul volume and server record caching and fileserver rotation")
Signed-off-by: David Howells <dhowells@redhat.com>
2018-05-14 13:26:44 +01:00
David Howells d4a96bec7a afs: Fix refcounting in callback registration
The refcounting on afs_cb_interest struct objects in
afs_register_server_cb_interest() is wrong as it uses the server list
entry's call back interest pointer without regard for the fact that it
might be replaced at any time and the object thrown away.

Fix this by:

 (1) Put a lock on the afs_server_list struct that can be used to
     mediate access to the callback interest pointers in the servers array.

 (2) Keep a ref on the callback interest that we get from the entry.

 (3) Dropping the old reference held by vnode->cb_interest if we replace
     the pointer.

Fixes: c435ee3455 ("afs: Overhaul the callback handling")
Signed-off-by: David Howells <dhowells@redhat.com>
2018-05-14 13:17:35 +01:00
David Howells f2686b0926 afs: Fix giving up callbacks on server destruction
When a server record is destroyed, we want to send a message to the server
telling it that we're giving up all the callbacks it has promised us.

Apply two fixes to this:

 (1) Only send the FS.GiveUpAllCallBacks message if we actually got a
     callback from that server.  We assume this to be the case if we
     performed at least one successful FS operation on that server.

 (2) Send it to the address last used for that server rather than always
     picking the first address in the list (which might be unreachable).

Fixes: d2ddc776a4 ("afs: Overhaul volume and server record caching and fileserver rotation")
Signed-off-by: David Howells <dhowells@redhat.com>
2018-05-14 13:17:35 +01:00
David Howells 01fd79e6de afs: Fix address list parsing
The parsing of port specifiers in the address list obtained from the DNS
resolution upcall doesn't work as in4_pton() and in6_pton() will fail on
encountering an unexpected delimiter (in this case, the '+' marking the
port number).  However, in*_pton() can't be given multiple specifiers.

Fix this by finding the delimiter in advance and not relying on in*_pton()
to find the end of the address for us.

Fixes: 8b2a464ced ("afs: Add an address list concept")
Signed-off-by: David Howells <dhowells@redhat.com>
2018-05-14 13:17:35 +01:00
David Howells b61f7dcf4e afs: Fix directory page locking
The afs directory loading code (primarily afs_read_dir()) locks all the
pages that hold a directory's content blob to defend against
getdents/getdents races and getdents/lookup races where the competitors
issue conflicting reads on the same data.  As the reads will complete
consecutively, they may retrieve different versions of the data and
one may overwrite the data that the other is busy parsing.

Fix this by not locking the pages at all, but rather by turning the
validation lock into an rwsem and getting an exclusive lock on it whilst
reading the data or validating the attributes and a shared lock whilst
parsing the data.  Sharing the attribute validation lock should be fine as
the data fetch will retrieve the attributes also.

The individual page locks aren't needed at all as the only place they're
being used is to serialise data loading.

Without this patch, the:

 	if (!test_bit(AFS_VNODE_DIR_VALID, &dvnode->flags)) {
		...
	}

part of afs_read_dir() may be skipped, leaving the pages unlocked when we
hit the success: clause - in which case we try to unlock the not-locked
pages, leading to the following oops:

  page:ffffe38b405b4300 count:3 mapcount:0 mapping:ffff98156c83a978 index:0x0
  flags: 0xfffe000001004(referenced|private)
  raw: 000fffe000001004 ffff98156c83a978 0000000000000000 00000003ffffffff
  raw: dead000000000100 dead000000000200 0000000000000001 ffff98156b27c000
  page dumped because: VM_BUG_ON_PAGE(!PageLocked(page))
  page->mem_cgroup:ffff98156b27c000
  ------------[ cut here ]------------
  kernel BUG at mm/filemap.c:1205!
  ...
  RIP: 0010:unlock_page+0x43/0x50
  ...
  Call Trace:
   afs_dir_iterate+0x789/0x8f0 [kafs]
   ? _cond_resched+0x15/0x30
   ? kmem_cache_alloc_trace+0x166/0x1d0
   ? afs_do_lookup+0x69/0x490 [kafs]
   ? afs_do_lookup+0x101/0x490 [kafs]
   ? key_default_cmp+0x20/0x20
   ? request_key+0x3c/0x80
   ? afs_lookup+0xf1/0x340 [kafs]
   ? __lookup_slow+0x97/0x150
   ? lookup_slow+0x35/0x50
   ? walk_component+0x1bf/0x490
   ? path_lookupat.isra.52+0x75/0x200
   ? filename_lookup.part.66+0xa0/0x170
   ? afs_end_vnode_operation+0x41/0x60 [kafs]
   ? __check_object_size+0x9c/0x171
   ? strncpy_from_user+0x4a/0x170
   ? vfs_statx+0x73/0xe0
   ? __do_sys_newlstat+0x39/0x70
   ? __x64_sys_getdents+0xc9/0x140
   ? __x64_sys_getdents+0x140/0x140
   ? do_syscall_64+0x5b/0x160
   ? entry_SYSCALL_64_after_hwframe+0x44/0xa9

Fixes: f3ddee8dc4 ("afs: Fix directory handling")
Reported-by: Marc Dionne <marc.dionne@auristor.com>
Signed-off-by: David Howells <dhowells@redhat.com>
2018-05-14 13:17:35 +01:00
Linus Torvalds ccda3c4b77 some small SMB3 fixes for 4.17-rc5, some for stable
-----BEGIN PGP SIGNATURE-----
 
 iQHHBAABCgAxFiEE6fsu8pdIjtWE/DpLiiy9cAdyT1EFAlr04iwTHHNtZnJlbmNo
 QGdtYWlsLmNvbQAKCRCKLL1wB3JPUaQ/C/4xelt5SEsr3AEj9SUJFwjHZJ09pSCI
 T1e/ztJLVorE2VAKhVub9bN5hP4vrAfUzWTHDeDB02eQy8BLUaUKFMxhN2UjRfix
 pdL9vLIGMXNCuGROWnO3UHbE99b1Dht8pdpmt+tTevSwYnHj7ACbyoZJsESOhged
 5WDwUCM2FTmyu4L3UqD8ZDp5jrvx2q6qTOpck53v3lLbnkRx/rNLLdamFXo1IRd3
 cypXIck+v+g1pv0mbs5cYdtHRQqiiP7RbY7bmvNJrV0G95iNGgonuWaCwiMOLqmZ
 R9aw9ktGZepu2aWlhhiV1nNxI1zo7pvMSoRYT2Pd1ZaxUbTb3LvBwkviAFgz8Dft
 a10MzvLMrxY79q/4FdWf+FdsQ0QHh98XwsMdAFuraI0zlNN+9/0UPDYd8X/LFgoK
 3GOZDNyQ6LPvCsJ4ZmzugneuCRzWk/HU4y1uo67vjqUODKjPPZ8qXGOz0sNSJfnp
 gjpdw6G2GW4PLH4q3oppbNfetAaqhm24pMw=
 =qAFz
 -----END PGP SIGNATURE-----

Merge tag '4.17-rc4-SMB3-Fixes' of git://git.samba.org/sfrench/cifs-2.6

Pull cifs fixes from Steve French:
 "Some small SMB3 fixes for 4.17-rc5, some for stable"

* tag '4.17-rc4-SMB3-Fixes' of git://git.samba.org/sfrench/cifs-2.6:
  smb3: directory sync should not return an error
  cifs: smb2ops: Fix listxattr() when there are no EAs
  cifs: smbd: Enable signing with smbdirect
  cifs: Allocate validate negotiation request through kmalloc
2018-05-12 18:49:53 -07:00
Linus Torvalds f0ab773f5c Merge branch 'akpm' (patches from Andrew)
Merge misc fixes from Andrew Morton:
 "13 fixes"

* emailed patches from Andrew Morton <akpm@linux-foundation.org>:
  rbtree: include rcu.h
  scripts/faddr2line: fix error when addr2line output contains discriminator
  ocfs2: take inode cluster lock before moving reflinked inode from orphan dir
  mm, oom: fix concurrent munlock and oom reaper unmap, v3
  mm: migrate: fix double call of radix_tree_replace_slot()
  proc/kcore: don't bounds check against address 0
  mm: don't show nr_indirectly_reclaimable in /proc/vmstat
  mm: sections are not offlined during memory hotremove
  z3fold: fix reclaim lock-ups
  init: fix false positives in W+X checking
  lib/find_bit_benchmark.c: avoid soft lockup in test_find_first_bit()
  KASAN: prohibit KASAN+STRUCTLEAK combination
  MAINTAINERS: update Shuah's email address
2018-05-11 18:04:12 -07:00
Ashish Samant e438302920 ocfs2: take inode cluster lock before moving reflinked inode from orphan dir
While reflinking an inode, we create a new inode in orphan directory,
then take EX lock on it, reflink the original inode to orphan inode and
release EX lock.  Once the lock is released another node could request
it in EX mode from ocfs2_recover_orphans() which causes downconvert of
the lock, on this node, to NL mode.

Later we attempt to initialize security acl for the orphan inode and
move it to the reflink destination.  However, while doing this we dont
take EX lock on the inode.  This could potentially cause problems
because we could be starting transaction, accessing journal and
modifying metadata of the inode while holding NL lock and with another
node holding EX lock on the inode.

Fix this by taking orphan inode cluster lock in EX mode before
initializing security and moving orphan inode to reflink destination.
Use the __tracker variant while taking inode lock to avoid recursive
locking in the ocfs2_init_security_and_acl() call chain.

Link: http://lkml.kernel.org/r/1523475107-7639-1-git-send-email-ashish.samant@oracle.com
Signed-off-by: Ashish Samant <ashish.samant@oracle.com>
Reviewed-by: Joseph Qi <jiangqi903@gmail.com>
Reviewed-by: Junxiao Bi <junxiao.bi@oracle.com>
Acked-by: Jun Piao <piaojun@huawei.com>
Cc: Mark Fasheh <mark@fasheh.com>
Cc: Joel Becker <jlbec@evilplan.org>
Cc: Changwei Ge <ge.changwei@h3c.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-05-11 17:28:45 -07:00
Laura Abbott 3955333df9 proc/kcore: don't bounds check against address 0
The existing kcore code checks for bad addresses against __va(0) with
the assumption that this is the lowest address on the system.  This may
not hold true on some systems (e.g.  arm64) and produce overflows and
crashes.  Switch to using other functions to validate the address range.

It's currently only seen on arm64 and it's not clear if anyone wants to
use that particular combination on a stable release.  So this is not
urgent for stable.

Link: http://lkml.kernel.org/r/20180501201143.15121-1-labbott@redhat.com
Signed-off-by: Laura Abbott <labbott@redhat.com>
Tested-by: Dave Anderson <anderson@redhat.com>
Cc: Kees Cook <keescook@chromium.org>
Cc: Ard Biesheuvel <ard.biesheuvel@linaro.org>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Andi Kleen <andi@firstfloor.org>
Cc: Alexey Dobriyan <adobriyan@gmail.com>a
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-05-11 17:28:45 -07:00
Dave Chinner 79f546a696 fs: don't scan the inode cache before SB_BORN is set
We recently had an oops reported on a 4.14 kernel in
xfs_reclaim_inodes_count() where sb->s_fs_info pointed to garbage
and so the m_perag_tree lookup walked into lala land.  It produces
an oops down this path during the failed mount:

  radix_tree_gang_lookup_tag+0xc4/0x130
  xfs_perag_get_tag+0x37/0xf0
  xfs_reclaim_inodes_count+0x32/0x40
  xfs_fs_nr_cached_objects+0x11/0x20
  super_cache_count+0x35/0xc0
  shrink_slab.part.66+0xb1/0x370
  shrink_node+0x7e/0x1a0
  try_to_free_pages+0x199/0x470
  __alloc_pages_slowpath+0x3a1/0xd20
  __alloc_pages_nodemask+0x1c3/0x200
  cache_grow_begin+0x20b/0x2e0
  fallback_alloc+0x160/0x200
  kmem_cache_alloc+0x111/0x4e0

The problem is that the superblock shrinker is running before the
filesystem structures it depends on have been fully set up. i.e.
the shrinker is registered in sget(), before ->fill_super() has been
called, and the shrinker can call into the filesystem before
fill_super() does it's setup work. Essentially we are exposed to
both use-after-free and use-before-initialisation bugs here.

To fix this, add a check for the SB_BORN flag in super_cache_count.
In general, this flag is not set until ->fs_mount() completes
successfully, so we know that it is set after the filesystem
setup has completed. This matches the trylock_super() behaviour
which will not let super_cache_scan() run if SB_BORN is not set, and
hence will not allow the superblock shrinker from entering the
filesystem while it is being set up or after it has failed setup
and is being torn down.

Cc: stable@kernel.org
Signed-Off-By: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2018-05-11 15:37:57 -04:00
Al Viro 1e2e547a93 do d_instantiate/unlock_new_inode combinations safely
For anything NFS-exported we do _not_ want to unlock new inode
before it has grown an alias; original set of fixes got the
ordering right, but missed the nasty complication in case of
lockdep being enabled - unlock_new_inode() does
	lockdep_annotate_inode_mutex_key(inode)
which can only be done before anyone gets a chance to touch
->i_mutex.  Unfortunately, flipping the order and doing
unlock_new_inode() before d_instantiate() opens a window when
mkdir can race with open-by-fhandle on a guessed fhandle, leading
to multiple aliases for a directory inode and all the breakage
that follows from that.

	Correct solution: a new primitive (d_instantiate_new())
combining these two in the right order - lockdep annotate, then
d_instantiate(), then the rest of unlock_new_inode().  All
combinations of d_instantiate() with unlock_new_inode() should
be converted to that.

Cc: stable@kernel.org	# 2.6.29 and later
Tested-by: Mike Marshall <hubcap@omnibond.com>
Reviewed-by: Andreas Dilger <adilger@dilger.ca>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2018-05-11 15:36:37 -04:00
Steve French 6e70c267e6 smb3: directory sync should not return an error
As with NFS, which ignores sync on directory handles,
fsync on a directory handle is a noop for CIFS/SMB3.
Do not return an error on it.  It breaks some database
apps otherwise.

Signed-off-by: Steve French <smfrench@gmail.com>
CC: Stable <stable@vger.kernel.org>
Reviewed-by: Ronnie Sahlberg <lsahlber@redhat.com>
Reviewed-by: Pavel Shilovsky <pshilov@microsoft.com>
2018-05-10 19:21:14 -05:00
Ilya Dryomov fc218544fb ceph: fix iov_iter issues in ceph_direct_read_write()
dio_get_pagev_size() and dio_get_pages_alloc() introduced in commit
b5b98989dc ("ceph: combine as many iovec as possile into one OSD
request") assume that the passed iov_iter is ITER_IOVEC.  This isn't
the case with splice where it ends up poking into the guts of ITER_BVEC
or ITER_PIPE iterators, causing lockups and crashes easily reproduced
with generic/095.

Rather than trying to figure out gap alignment and stuff pages into
a page vector, add a helper for going from iov_iter to a bio_vec array
and make use of the new CEPH_OSD_DATA_TYPE_BVECS code.

Fixes: b5b98989dc ("ceph: combine as many iovec as possile into one OSD request")
Link: http://tracker.ceph.com/issues/18130
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
Reviewed-by: Jeff Layton <jlayton@redhat.com>
Reviewed-by: "Yan, Zheng" <zyan@redhat.com>
Tested-by: Luis Henriques <lhenriques@suse.com>
2018-05-10 10:15:12 +02:00
Ilya Dryomov 3a15b38fd2 ceph: fix rsize/wsize capping in ceph_direct_read_write()
rsize/wsize cap should be applied before ceph_osdc_new_request() is
called.  Otherwise, if the size is limited by the cap instead of the
stripe unit, ceph_osdc_new_request() would setup an extent op that is
bigger than what dio_get_pages_alloc() would pin and add to the page
vector, triggering asserts in the messenger.

Cc: stable@vger.kernel.org
Fixes: 95cca2b44e ("ceph: limit osd write size")
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
Reviewed-by: "Yan, Zheng" <zyan@redhat.com>
2018-05-10 10:15:00 +02:00
Konrad Rzeszutek Wilk e96f46ee85 proc: Use underscores for SSBD in 'status'
The style for the 'status' file is CamelCase or this. _.

Fixes: fae1fa0fc ("proc: Provide details on speculation flaw mitigations")
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2018-05-09 21:41:38 +02:00
Paulo Alcantara ae2cd7fb47 cifs: smb2ops: Fix listxattr() when there are no EAs
As per listxattr(2):

       On success, a nonnegative number is returned indicating the size
       of the extended attribute name list.  On failure, -1 is returned
       and errno  is set appropriately.

In SMB1, when the server returns an empty EA list through a listxattr(),
it will correctly return 0 as there are no EAs for the given file.

However, in SMB2+, it returns -ENODATA in listxattr() which is wrong since
the request and response were sent successfully, although there's no actual
EA for the given file.

This patch fixes listxattr() for SMB2+ by returning 0 in cifs_listxattr()
when the server returns an empty list of EAs.

Signed-off-by: Paulo Alcantara <palcantara@suse.de>
Reviewed-by: Aurelien Aptel <aaptel@suse.com>
Signed-off-by: Steve French <smfrench@gmail.com>
2018-05-09 11:48:42 -05:00
Long Li f7c439668a cifs: smbd: Enable signing with smbdirect
Now signing is supported with RDMA transport.

Remove the code that disabled it.

Signed-off-by: Long Li <longli@microsoft.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
Reviewed-by: Ronnie Sahlberg <lsahlber@redhat.com>
2018-05-09 11:48:35 -05:00
Long Li 2796d303e3 cifs: Allocate validate negotiation request through kmalloc
The data buffer allocated on the stack can't be DMA'ed, ib_dma_map_page will
return an invalid DMA address for a buffer on stack. Even worse, this
incorrect address can't be detected by ib_dma_mapping_error. Sending data
from this address to hardware will not fail, but the remote peer will get
junk data.

Fix this by allocating the request on the heap in smb3_validate_negotiate.

Changes in v2:
Removed duplicated code on freeing buffers on function exit.
(Thanks to Parav Pandit <parav@mellanox.com>)
Fixed typo in the patch title.

Changes in v3:
Added "Fixes" to the patch.
Changed several sizeof() to use *pointer in place of struct.

Changes in v4:
Added detailed comments on the failure through RDMA.
Allocate request buffer using GPF_NOFS.
Fixed possible memory leak.

Changes in v5:
Removed variable ret for checking return value.
Changed to use pneg_inbuf->Dialects[0] to calculate unused space in pneg_inbuf.

Fixes: ff1c038add ("Check SMB3 dialects against downgrade attacks")
Signed-off-by: Long Li <longli@microsoft.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
Reviewed-by: Ronnie Sahlberg <lsahlber@redhat.com>
Reviewed-by: Tom Talpey <ttalpey@microsoft.com>
2018-05-09 11:48:31 -05:00
Greg Thelen 9533b292a7 IB: remove redundant INFINIBAND kconfig dependencies
INFINIBAND_ADDR_TRANS depends on INFINIBAND.  So there's no need for
options which depend INFINIBAND_ADDR_TRANS to also depend on INFINIBAND.
Remove the unnecessary INFINIBAND depends.

Signed-off-by: Greg Thelen <gthelen@google.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
2018-05-09 08:51:03 -04:00
Linus Torvalds eb4f959b26 First pull request for 4.17-rc
- Various build fixes (USER_ACCESS=m and ADDR_TRANS turned off)
 - SPDX license tag cleanups (new tag Linux-OpenIB)
 - RoCE GID fixes related to default GIDs
 - Various fixes to: cxgb4, uverbs, cma, iwpm, rxe, hns (big batch),
   mlx4, mlx5, and hfi1 (medium batch)
 -----BEGIN PGP SIGNATURE-----
 
 iQIcBAABAgAGBQJa7JXPAAoJELgmozMOVy/dc0AP/0i7EajAmgl1ihka6BYVj2pa
 DV8iSrVMDPulh9AVnAtwLJSbdwmgN/HeVzLzcutHyMYk6tAf8RCs6TsyoB36XiOL
 oUh5+V2GyNnyh9veWPwyGTgZKCpPJc3uQFV6502lZVDYwArMfGatumApBgQVKiJ+
 YdPEXEQZPNIs6YZB1WXkNYV/ra9u0aBByQvUrxwVZ2AND+srJYO82tqZit2wBtjK
 UXrhmZbWXGWMFg8K3/lpfUkQhkG3Arj+tMKkCfqsVzC7wUPhlTKBHR9NmvdLIiy9
 5Vhv7Xp78udcxZKtUeTFsbhaMqqK7x7sKHnpKAs7hOZNZ/Eg47BrMwMrZVLOFuDF
 nBLUL1H+nJ1mASZoMWH5xzOpVew+e9X0cot09pVDBIvsOIh97wCG7hgptQ2Z5xig
 fcDiMmg6tuakMsaiD0dzC9JI5HR6Z7+6oR1tBkQFDxQ+XkkcoFabdmkJaIRRwOj7
 CUhXRgcm0UgVd03Jdta6CtYXsjSODirWg4AvSSMt9lUFpjYf9WZP00/YojcBbBEH
 UlVrPbsKGyncgrm3FUP6kXmScESfdTljTPDLiY9cO9+bhhPGo1OHf005EfAp178B
 jGp6hbKlt+rNs9cdXrPSPhjds+QF8HyfSlwyYVWKw8VWlh/5DG8uyGYjF05hYO0q
 xhjIS6/EZjcTAh5e4LzR
 =PI8v
 -----END PGP SIGNATURE-----

Merge tag 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/rdma/rdma

Pull rdma fixes from Doug Ledford:
 "This is our first pull request of the rc cycle. It's not that it's
  been overly quiet, we were just waiting on a few things before sending
  this off.

  For instance, the 6 patch series from Intel for the hfi1 driver had
  actually been pulled in on Tuesday for a Wednesday pull request, only
  to have Jason notice something I missed, so we held off for some
  testing, and then on Thursday had to respin the series because the
  very first patch needed a minor fix (unnecessary cast is all).

  There is a sizable hns patch series in here, as well as a reasonably
  largish hfi1 patch series, then all of the lines of uapi updates are
  just the change to the new official Linux-OpenIB SPDX tag (a bunch of
  our files had what amounts to a BSD-2-Clause + MIT Warranty statement
  as their license as a result of the initial code submission years ago,
  and the SPDX folks decided it was unique enough to warrant a unique
  tag), then the typical mlx4 and mlx5 updates, and finally some cxgb4
  and core/cache/cma updates to round out the bunch.

  None of it was overly large by itself, but in the 2 1/2 weeks we've
  been collecting patches, it has added up :-/.

  As best I can tell, it's been through 0day (I got a notice about my
  last for-next push, but not for my for-rc push, but Jason seems to
  think that failure messages are prioritized and success messages not
  so much). It's also been through linux-next. And yes, we did notice in
  the context portion of the CMA query gid fix patch that there is a
  dubious BUG_ON() in the code, and have plans to audit our BUG_ON usage
  and remove it anywhere we can.

  Summary:

   - Various build fixes (USER_ACCESS=m and ADDR_TRANS turned off)

   - SPDX license tag cleanups (new tag Linux-OpenIB)

   - RoCE GID fixes related to default GIDs

   - Various fixes to: cxgb4, uverbs, cma, iwpm, rxe, hns (big batch),
     mlx4, mlx5, and hfi1 (medium batch)"

* tag 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/rdma/rdma: (52 commits)
  RDMA/cma: Do not query GID during QP state transition to RTR
  IB/mlx4: Fix integer overflow when calculating optimal MTT size
  IB/hfi1: Fix memory leak in exception path in get_irq_affinity()
  IB/{hfi1, rdmavt}: Fix memory leak in hfi1_alloc_devdata() upon failure
  IB/hfi1: Fix NULL pointer dereference when invalid num_vls is used
  IB/hfi1: Fix loss of BECN with AHG
  IB/hfi1 Use correct type for num_user_context
  IB/hfi1: Fix handling of FECN marked multicast packet
  IB/core: Make ib_mad_client_id atomic
  iw_cxgb4: Atomically flush per QP HW CQEs
  IB/uverbs: Fix kernel crash during MR deregistration flow
  IB/uverbs: Prevent reregistration of DM_MR to regular MR
  RDMA/mlx4: Add missed RSS hash inner header flag
  RDMA/hns: Fix a couple misspellings
  RDMA/hns: Submit bad wr
  RDMA/hns: Update assignment method for owner field of send wqe
  RDMA/hns: Adjust the order of cleanup hem table
  RDMA/hns: Only assign dqpn if IB_QP_PATH_DEST_QPN bit is set
  RDMA/hns: Remove some unnecessary attr_mask judgement
  RDMA/hns: Only assign mtu if IB_QP_PATH_MTU bit is set
  ...
2018-05-04 20:51:10 -10:00
Linus Torvalds 2f50037a1c for-linus-20180504
-----BEGIN PGP SIGNATURE-----
 Version: GnuPG v1
 
 iQIcBAABCAAGBQJa7HerAAoJEPfTWPspceCmxdsP/3ncJdw/PJRGaQNt99ogEIbl
 y/YscWxPWsxbigM0Yc0zh134vO5ZeE7v12InpoE3i5OO4UW+oC+WYP/KDAo3TJIy
 j/9r25p1kfb3j/8fNlb8uMf/6/nKk29cu+gqIZleHMOj6hfap5AFdTwW0/B/gC/p
 BJ+C3e3s41intl+NikZmD4M959gpPTgm5ma8wyCz1XKtGQMH5AxFFrIc22vug/Fb
 3Nk++xuFvgF04tCXwimhgny2eOtHt5L6KNuYYHFWBnd1gXALttsisLgAW2vXbfFB
 c9PDEya3c+btr8+ied27Tp0hHlcQa2/ZY+yFJ3RJ35AXMvTVNDx6bKF3PzfJWzt+
 ynjrywsXC/k7G1JBZntdXF7+y8b52keaIBS8DBBxzhhmzrv0NOTGTaQRhuK5eeem
 tHrvEZlP5iqPRGGQz7F1RYztdWulo/iMLJwibuy2rcNYeHL5T0Olhv9hdH26OVqV
 CNEuEvy+xO4uzkXAGm3j/EoHryHvGgp2xD/8OuQfTnjB6IdcuLznJuyBiUyOj/te
 PgSAI/SdUKPnWyVVONKjXyOyvAglcenNtWMmAZQbsOSNZAW2blrXSFvzHa8wDVe+
 Zpw5+fWJOioemMo+gf884jMRbNDfwyq5hcgjpbkYRz+qg60abqefNt7e87mTqTcJ
 WqP9luNiP9RmXsXo4k+w
 =P6V8
 -----END PGP SIGNATURE-----

Merge tag 'for-linus-20180504' of git://git.kernel.dk/linux-block

Pull block fixes from Jens Axboe:
 "A collection of fixes that should to into this release. This contains:

   - Set of bcache fixes from Coly, fixing regression in patches that
     went into this series.

   - Set of NVMe fixes by way of Keith.

   - Set of bdi related fixes, one from Jan and two from Tetsuo Handa,
     fixing various issues around device addition/removal.

   - Two block inflight fixes from Omar, fixing issues around the
     transition to using tags for blk-mq inflight accounting that we
     did a few releases ago"

* tag 'for-linus-20180504' of git://git.kernel.dk/linux-block:
  bdi: Fix oops in wb_workfn()
  nvmet: switch loopback target state to connecting when resetting
  nvme/multipath: Fix multipath disabled naming collisions
  nvme/multipath: Disable runtime writable enabling parameter
  nvme: Set integrity flag for user passthrough commands
  nvme: fix potential memory leak in option parsing
  bdi: Fix use after free bug in debugfs_remove()
  bdi: wake up concurrent wb_shutdown() callers.
  bcache: use pr_info() to inform duplicated CACHE_SET_IO_DISABLE set
  bcache: set dc->io_disable to true in conditional_stop_bcache_device()
  bcache: add wait_for_kthread_stop() in bch_allocator_thread()
  bcache: count backing device I/O error for writeback I/O
  bcache: set CACHE_SET_IO_DISABLE in bch_cached_dev_error()
  bcache: store disk name in struct cache and struct cached_dev
  blk-mq: fix sysfs inflight counter
  blk-mq: count allocated but not started requests in iostats inflight
2018-05-04 20:41:44 -10:00
Linus Torvalds 2e171ffcdf Changes since last update:
- Cap the maximum length of a deduplication request at MAX_RW_COUNT/2
   to avoid kernel livelock due to excessively large IO requests.
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEEUzaAxoMeQq6m2jMV+H93GTRKtOsFAlrp5f0ACgkQ+H93GTRK
 tOuflQ//VOKn73zjiwhYPcN4a45/O4sWVxGdAS59+W3usFYG5S6u/OC9XNmaGP2V
 XVmf5VhfYT6MsYq562JVMuXo9GiqU2cMbvFuHQHibdlk+hHVPeIQI5ipBn7K7OMj
 cgvlE3UWKm407QamC9mM62HyeDb9fyXJnDN+nXi6LruPg6ClJQHt1AQLxlDRu6Bm
 ioIpyBJKWbCGbXEGiGIUu5Jv8ohKyZ0lUnxJTRqYrKvqEQ/pDpkMI6POwC0TuHVr
 Kd9z2KCKH7RQmtC1zSFzaXn7N5jb2C1Z70i50F7eeC8MLZJbsXxK3pJlugsOYU4D
 NCgELpqoYV5R08B+Z4yp2Wj91PWBDuY6cTigArLjqvRN1IX2Q+g8uOQhKi1q0c4x
 x+18y+cAPsNgpzu92SFUzJHGLJg5kHxN2wcqt2rIsTdJQt3Z3mIW/sCW/+69v4nu
 arHRMhfa6whcFm30phGyw3juYOsxwypasCGWoDR4A5x8Jy0hrsGhVN1SW6Iw1v7W
 AGYNH1BIb+2qjQ6To6YgRvwvhbT8EpvxwBi3c0wgmOuI3/oOdiJBS4A7tGXo4fHQ
 9CK6gucf9VCX3Nij+juHBXwIeBJ6m6yWX4zD1R5zhA5Z8RIZRMe16aFoOfmHyrAg
 SnwYrOBZFD1k3I5fwcWovxxY9V0CxiZOGg217YkSZP5N7+PamZc=
 =l9pt
 -----END PGP SIGNATURE-----

Merge tag 'xfs-4.17-fixes-2' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux

Pull xfs fixes from Darrick Wong:
 "I've got one more bug fix for xfs for 4.17-rc4, which caps the amount
  of data we try to handle in one dedupe request so that userspace can't
  livelock the kernel.

  This series has been run through a full xfstests run during the week
  and through a quick xfstests run against this morning's master, with
  no ajor failures reported.

  Summary:

  - Cap the maximum length of a deduplication request at MAX_RW_COUNT/2
    to avoid kernel livelock due to excessively large IO requests"

* tag 'xfs-4.17-fixes-2' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux:
  xfs: cap the length of deduplication requests
2018-05-04 20:36:50 -10:00
Linus Torvalds 4148d3884a for-4.17-rc3-tag
-----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEE8rQSAMVO+zA4DBdWxWXV+ddtWDsFAlrsbkEACgkQxWXV+ddt
 WDvm6Q/+KdFKJ7T8hBOc6o5EeULXCDF3FmMA7HvDC696WXKsckXJFKk52awvrSb6
 3wnIzfWmI3K+rwX3cKqLRKe6tMXtBrTjVWXyezfvx1SMcCO4hSQ+nWLqK08htaNf
 h7m3OC3y0xO8QHcFSkvHUov6KRWG3rH+4p46JsJjN7GTBtWmR6tsiyQQ9JMC3gNR
 8Jnl1YaQq/JDLFm8GmFfPqIK+MLnNJ+GOJC1pm2vJQFtjnDw9dic+dI2hGX2oh9M
 SSRAoJu7jUvTWSmQN9aJfbBUr4atzoKKGYsyAgx5qgXbzOnbUGTIhtyAZirRWWBy
 0pT2b/8XuqsIabwR5dR4UbL4Ke1h5DS4c6GFydwO4DeddTovHtDzbN0cPuPQABL1
 rwFzlnHhcM/qRu9SKXx1jRy7w4Vju8fVX9D4lyjLcyk24flkEAn1NlJCWEqSzPYR
 ikTTm71r/1/62XqE6AcOyugS8E6EYtYHo3PjrfFXr3fQCbctTLEaUKoegFkTezbX
 EZLRPy9KNGfuyUyh3eiSypNHZZ9WiT+W42BaLIcpbHJLnqB+A14Z0oZx6aHVhY0T
 VFLL91O//OmUvjpsZ7I99LyswvrzsmU0jKS41GXOQlLsLTxtGhcZbvRt4uX4UUie
 UOQrNCgO0Y2slGfv+uoDCLhH0tTCRziuEvmgu8bjPcfZE7tph+s=
 =Jdmp
 -----END PGP SIGNATURE-----

Merge tag 'for-4.17-rc3-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux

Pull btrfs fixes from David Sterba:
 "Two regression fixes and one fix for stable"

* tag 'for-4.17-rc3-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux:
  Btrfs: send, fix missing truncate for inode with prealloc extent past eof
  btrfs: Take trans lock before access running trans in check_delayed_ref
  btrfs: Fix wrong first_key parameter in replace_path
2018-05-04 20:32:18 -10:00
Thomas Gleixner 356e4bfff2 prctl: Add force disable speculation
For certain use cases it is desired to enforce mitigations so they cannot
be undone afterwards. That's important for loader stubs which want to
prevent a child from disabling the mitigation again. Will also be used for
seccomp(). The extra state preserving of the prctl state for SSB is a
preparatory step for EBPF dymanic speculation control.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2018-05-05 00:51:43 +02:00
Jan Kara b8b784958e bdi: Fix oops in wb_workfn()
Syzbot has reported that it can hit a NULL pointer dereference in
wb_workfn() due to wb->bdi->dev being NULL. This indicates that
wb_workfn() was called for an already unregistered bdi which should not
happen as wb_shutdown() called from bdi_unregister() should make sure
all pending writeback works are completed before bdi is unregistered.
Except that wb_workfn() itself can requeue the work with:

	mod_delayed_work(bdi_wq, &wb->dwork, 0);

and if this happens while wb_shutdown() is waiting in:

	flush_delayed_work(&wb->dwork);

the dwork can get executed after wb_shutdown() has finished and
bdi_unregister() has cleared wb->bdi->dev.

Make wb_workfn() use wakeup_wb() for requeueing the work which takes all
the necessary precautions against racing with bdi unregistration.

CC: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
CC: Tejun Heo <tj@kernel.org>
Fixes: 839a8e8660
Reported-by: syzbot <syzbot+9873874c735f2892e7e9@syzkaller.appspotmail.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-05-03 16:11:37 -06:00
Kees Cook fae1fa0fc6 proc: Provide details on speculation flaw mitigations
As done with seccomp and no_new_privs, also show speculation flaw
mitigation state in /proc/$pid/status.

Signed-off-by: Kees Cook <keescook@chromium.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2018-05-03 13:55:51 +02:00
Darrick J. Wong 021ba8e98f xfs: cap the length of deduplication requests
Since deduplication potentially has to read in all the pages in both
files in order to compare the contents, cap the deduplication request
length at MAX_RW_COUNT/2 (roughly 1GB) so that we have /some/ upper bound
on the request length and can't just lock up the kernel forever.  Found
by running generic/304 after commit 1ddae54555b62 ("common/rc: add
missing 'local' keywords").

Reported-by: matorola@gmail.com
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Carlos Maiolino <cmaiolino@redhat.com>
2018-05-02 09:21:33 -07:00
Filipe Manana a6aa10c70b Btrfs: send, fix missing truncate for inode with prealloc extent past eof
An incremental send operation can miss a truncate operation when an inode
has an increased size in the send snapshot and a prealloc extent beyond
its size.

Consider the following scenario where a necessary truncate operation is
missing in the incremental send stream:

1) In the parent snapshot an inode has a size of 1282957 bytes and it has
   no prealloc extents beyond its size;

2) In the the send snapshot it has a size of 5738496 bytes and has a new
   extent at offsets 1884160 (length of 106496 bytes) and a prealloc
   extent beyond eof at offset 6729728 (and a length of 339968 bytes);

3) When processing the prealloc extent, at offset 6729728, we end up at
   send.c:send_write_or_clone() and set the @len variable to a value of
   18446744073708560384 because @offset plus the original @len value is
   larger then the inode's size (6729728 + 339968 > 5738496). We then
   call send_extent_data(), with that @offset and @len, which in turn
   calls send_write(), and then the later calls fill_read_buf(). Because
   the offset passed to fill_read_buf() is greater then inode's i_size,
   this function returns 0 immediately, which makes send_write() and
   send_extent_data() do nothing and return immediately as well. When
   we get back to send.c:send_write_or_clone() we adjust the value
   of sctx->cur_inode_next_write_offset to @offset plus @len, which
   corresponds to 6729728 + 18446744073708560384 = 5738496, which is
   precisely the the size of the inode in the send snapshot;

4) Later when at send.c:finish_inode_if_needed() we determine that
   we don't need to issue a truncate operation because the value of
   sctx->cur_inode_next_write_offset corresponds to the inode's new
   size, 5738496 bytes. This is wrong because the last write operation
   that was issued started at offset 1884160 with a length of 106496
   bytes, so the correct value for sctx->cur_inode_next_write_offset
   should be 1990656 (1884160 + 106496), so that a truncate operation
   with a value of 5738496 bytes would have been sent to insert a
   trailing hole at the destination.

So fix the issue by making send.c:send_write_or_clone() not attempt
to send write or clone operations for extents that start beyond the
inode's size, since such attempts do nothing but waste time by
calling helper functions and allocating path structures, and send
currently has no fallocate command in order to create prealloc extents
at the destination (either beyond a file's eof or not).

The issue was found running the test btrfs/007 from fstests using a seed
value of 1524346151 for fsstress.

Reported-by: Gu, Jinxiang <gujx@cn.fujitsu.com>
Fixes: ffa7c4296e ("Btrfs: send, do not issue unnecessary truncate operations")
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2018-05-02 11:55:29 +02:00
ethanwu 998ac6d21c btrfs: Take trans lock before access running trans in check_delayed_ref
In preivous patch:
Btrfs: kill trans in run_delalloc_nocow and btrfs_cross_ref_exist
We avoid starting btrfs transaction and get this information from
fs_info->running_transaction directly.

When accessing running_transaction in check_delayed_ref, there's a
chance that current transaction will be freed by commit transaction
after the NULL pointer check of running_transaction is passed.

After looking all the other places using fs_info->running_transaction,
they are either protected by trans_lock or holding the transactions.

Fix this by using trans_lock and increasing the use_count.

Fixes: e4c3b2dcd1 ("Btrfs: kill trans in run_delalloc_nocow and btrfs_cross_ref_exist")
CC: stable@vger.kernel.org # 4.14+
Signed-off-by: ethanwu <ethanwu@synology.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2018-05-02 11:54:58 +02:00
Linus Torvalds f2125992e7 Changes since last update:
- Enhance inode fork verifiers to prevent loading of corrupted metadata.
 - Fix a crash when we try to convert extents format inodes to btree
   format, we run out of space, but forget to revert the in-core state
   changes.
 - Fix file size checks when doing INSERT_RANGE that could cause files
   to end up negative size if there previously was an extent mapped at
   s_maxbytes.
 - Fix a bug when doing a remove-then-add ATTR_REPLACE xattr update where
   we forget to clear ATTR_REPLACE after the remove, which causes the
   attr to be lost and the fs to shut down due to (what it thinks is)
   inconsistent in-core state.
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v2
 
 iQIcBAABCgAGBQJa2gs1AAoJEPh/dxk0SrTru50P+wZuY2DRfBSCH20Oq3qTbO9M
 dQGBnOIqRUC+9IDkpRExB4BYVZoK9/v3fnW3zttRMlEmTQfHtMsoEUXpByoETY5b
 /bTZh+c7fMhj7eNZspfc+8xsHvq0k3hSxrITe4zjL2rSy72KUrsYtDoIu2UvXyZK
 nJsqCiyFOdFgMi6IBRLOAVBPzs/q8sIBfl/axjyvokLL/6ki/TfvCAtLkdT4FRIt
 UHzE8ly/Z99honciPQW4axZ9TobAVd6g2d11XJpbhku3ijTL/vHyflRCFgmM/2T3
 VyJ3tTH/w3rCAXOEEj3H8TvKAlHiKpB+g9VwhsTZrP0B14/ljkClm/yEFnCOLmb1
 /26t+A++fkqy71PoqeQvXGJGEAxC/1TGo7dxq6Gn5SESc1yr8CjXXMiWTMBBWgfF
 xIsBNA6Ok0F5+OkEhfSUtfIBziPeUIjbmNGTnMV/EiL9stpHwrQS+JLAK5+POHvQ
 ZoH5RTY69mK5rJtzN6USGyPe4J0f+S7YTf9fjKTcjjpWHLjJYEYJcGvQb3ynZSV+
 5Wu1TqaUkl/Mp3mkZ1KFMq+MXBgOnarOZug80fbhkv3Vyzbelzpebuc5fMJnzofp
 0x6BQUC+bwOxB2KJX8TC+9ajen8O+oxrE2RyehVtCZL0O830NvZ4hQg3yxjKUbwv
 8Y0Mq6Q2CQEmsBJkIHH2
 =lFf0
 -----END PGP SIGNATURE-----

Merge tag 'xfs-4.17-fixes-1' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux

Pull xfs fixes from Darrick Wong:
 "Here are a few more bug fixes for xfs for 4.17-rc4. Most of them are
  fixes for bad behavior.

  This series has been run through a full xfstests run during LSF and
  through a quick xfstests run against this morning's master, with no
  major failures reported.

  Summary:

   - Enhance inode fork verifiers to prevent loading of corrupted
     metadata.

   - Fix a crash when we try to convert extents format inodes to btree
     format, we run out of space, but forget to revert the in-core state
     changes.

   - Fix file size checks when doing INSERT_RANGE that could cause files
     to end up negative size if there previously was an extent mapped at
     s_maxbytes.

   - Fix a bug when doing a remove-then-add ATTR_REPLACE xattr update
     where we forget to clear ATTR_REPLACE after the remove, which
     causes the attr to be lost and the fs to shut down due to (what it
     thinks is) inconsistent in-core state"

* tag 'xfs-4.17-fixes-1' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux:
  xfs: don't fail when converting shortform attr to long form during ATTR_REPLACE
  xfs: prevent creating negative-sized file via INSERT_RANGE
  xfs: set format back to extents if xfs_bmap_extents_to_btree
  xfs: enhance dinode verifier
2018-05-01 09:11:45 -07:00
Linus Torvalds cdface5209 Fix misc. bugs and a regression for ext4.
-----BEGIN PGP SIGNATURE-----
 
 iQEzBAABCAAdFiEEK2m5VNv+CHkogTfJ8vlZVpUNgaMFAlrlM9IACgkQ8vlZVpUN
 gaOtrQf+OhNwH0vIfCQwQG36m7DM5DNJ1eiFV6Qp++7B+nzMO7nRaXrINTKty/iO
 vSK59a2lCa1Uufjii/vIiVoVjiaiRfVJJt4/WFV6jtESxnBAPTj//nFDGBoQANsR
 XxTNDO9Bkra9QlWgasSiqbkyyKd2KFHf23LP1fdfspXuRFrGhu6pYqaZpbx8V0/2
 j+TeLS9V8vNx/rWNmvpMK+WopapvrGYoA0YESAZJBCLMNO5uZZ+2qteORPp+Y5oQ
 0d0mCLPedYy5gagHUIN4EnpwP4zNh8efQhQA16teEqs+foIHnyp7VnYYG1lJ3z+Y
 bYQoQGmKKdnd6Hl+/5sLYct8yZEhXQ==
 =EF3H
 -----END PGP SIGNATURE-----

Merge tag 'for_linus_stable' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4

Pull ext4 fixes from Ted Ts'o:
 "Fix misc bugs and a regression for ext4"

* tag 'for_linus_stable' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4:
  ext4: add MODULE_SOFTDEP to ensure crc32c is included in the initramfs
  ext4: fix bitmap position validation
  ext4: set h_journal if there is a failure starting a reserved handle
  ext4: prevent right-shifting extents beyond EXT_MAX_BLOCKS
2018-04-28 20:07:21 -07:00
Linus Torvalds cac264288a Security fixes for SMB3 for 4.17-rc3
-----BEGIN PGP SIGNATURE-----
 
 iQGwBAABCAAaBQJa4LQCExxzbWZyZW5jaEBnbWFpbC5jb20ACgkQiiy9cAdyT1GQ
 OAv+KPrprp+2jkoEZRgy/cJBZGmMCfyfjxM9eAZUr35FfkuMF8ir4cJ0AbbSpQOY
 E+WDdeRhS9FjueIVGKGi74C9yhJpEEDPtvFCFqhGJ5/AIDBBDpC4KyCQEfJDTlGo
 b+USbBcu+6bzqASN/L4fwhz1E+Q45RUb98E/IHPOKzV7qASyjYpB9CUcFaANt03k
 GK+VKNJF5ppa9YRgXwoPFbD4M2B/Wfe5ZP5+9ZYnYmZnpIUFCuY1rrASuJdcwFN2
 z97sGmqacR+a12FkdfZCoeK+dpsQ+ZeyKiB5sOpj+gr7apKAEmESjRlC1xpNRJ4B
 TOvkOSyYaNUr94HJo8FXzs1a/j8I59Cn2ER2o8Z0f1s7QvgFxvF09AnDUPQt1OBK
 197KNO6a/E1Rd+umPzKvgTIbrm7fcPYZgEWNHdbd3Hf8eBvFs8372GZXGLSkupmJ
 jnBkwYbukz6KsVNEx8m/9fGhKhyJCZ34yTEKrT+aNDQ0aA4vVuP/WUa+GtNSkDuL
 WX0u
 =U4ba
 -----END PGP SIGNATURE-----

Merge tag '4.17-rc2-smb3' of git://git.samba.org/sfrench/cifs-2.6

Pull cifs fixes from Steve French:
 "A few security related fixes for SMB3, most importantly for SMB3.11
  encryption"

* tag '4.17-rc2-smb3' of git://git.samba.org/sfrench/cifs-2.6:
  cifs: smbd: Avoid allocating iov on the stack
  cifs: smbd: Don't use RDMA read/write when signing is used
  SMB311: Fix reconnect
  SMB3: Fix 3.11 encryption to Windows and handle encrypted smb3 tcon
  CIFS: set *resp_buf_type to NO_BUFFER on error
2018-04-28 09:51:56 -07:00
Greg Thelen 3c6b03d18d cifs: smbd: depend on INFINIBAND_ADDR_TRANS
CIFS_SMB_DIRECT code depends on INFINIBAND_ADDR_TRANS provided symbols.
So declare the kconfig dependency.  This is necessary to allow for
enabling INFINIBAND without INFINIBAND_ADDR_TRANS.

Signed-off-by: Greg Thelen <gthelen@google.com>
Cc: Tarick Bedeir <tarick@google.com>
Reviewed-by: Long Li <longli@microsoft.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
2018-04-27 11:15:44 -04:00
Qu Wenruo 17515f1b76 btrfs: Fix wrong first_key parameter in replace_path
Commit 581c176041 ("btrfs: Validate child tree block's level and first
key") introduced new @first_key parameter for read_tree_block(), however
caller in replace_path() is parasing wrong key to read_tree_block().

It should use parameter @first_key other than @key.

Normally it won't expose problem as @key is normally initialzied to the
same value of @first_key we expect.
However in relocation recovery case, @key can be set to (0, 0, 0), and
since no valid key in relocation tree can be (0, 0, 0), it will cause
read_tree_block() to return -EUCLEAN and interrupt relocation recovery.

Fix it by setting @first_key correctly.

Fixes: 581c176041 ("btrfs: Validate child tree block's level and first key")
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2018-04-26 13:21:04 +02:00
Theodore Ts'o 7ef79ad521 ext4: add MODULE_SOFTDEP to ensure crc32c is included in the initramfs
Fixes: a45403b515 ("ext4: always initialize the crc32c checksum driver")
Reported-by: François Valenduc <francoisvalenduc@gmail.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Cc: stable@vger.kernel.org
2018-04-26 00:44:46 -04:00
Long Li 8bcda1d2a7 cifs: smbd: Avoid allocating iov on the stack
It's not necessary to allocate another iov when going through the buffers
in smbd_send() through RDMA send.

Remove it to reduce stack size.

Thanks to Matt for spotting a printk typo in the earlier version of this.

CC: Matt Redfearn <matt.redfearn@mips.com>
Signed-off-by: Long Li <longli@microsoft.com>
Acked-by: Ronnie Sahlberg <lsahlber@redhat.com>
Cc: stable@vger.kernel.org
Signed-off-by: Steve French <smfrench@gmail.com>
2018-04-25 11:15:58 -05:00
Long Li bb4c041947 cifs: smbd: Don't use RDMA read/write when signing is used
SMB server will not sign data transferred through RDMA read/write. When
signing is used, it's a good idea to have all the data signed.

In this case, use RDMA send/recv for all data transfers. This will degrade
performance as this is not generally configured in RDMA environemnt. So
warn the user on signing and RDMA send/recv.

Signed-off-by: Long Li <longli@microsoft.com>
Acked-by: Ronnie Sahlberg <lsahlber@redhat.com>
Cc: stable@vger.kernel.org
Signed-off-by: Steve French <smfrench@gmail.com>
2018-04-25 11:15:53 -05:00
Steve French 0d5ec281c0 SMB311: Fix reconnect
The preauth hash was not being recalculated properly on reconnect
of SMB3.11 dialect mounts (which caused access denied repeatedly
on auto-reconnect).

Fixes: 8bd68c6e47 ("CIFS: implement v3.11 preauth integrity")

Signed-off-by: Steve French <smfrench@gmail.com>
CC: Stable <stable@vger.kernel.org>
Reviewed-by: Ronnie Sahlberg <lsahlber@redhat.com>
2018-04-25 11:15:20 -05:00
Lukas Czerner 22be37acce ext4: fix bitmap position validation
Currently in ext4_valid_block_bitmap() we expect the bitmap to be
positioned anywhere between 0 and s_blocksize clusters, but that's
wrong because the bitmap can be placed anywhere in the block group. This
causes false positives when validating bitmaps on perfectly valid file
system layouts. Fix it by checking whether the bitmap is within the group
boundary.

The problem can be reproduced using the following

mkfs -t ext3 -E stride=256 /dev/vdb1
mount /dev/vdb1 /mnt/test
cd /mnt/test
wget https://cdn.kernel.org/pub/linux/kernel/v4.x/linux-4.16.3.tar.xz
tar xf linux-4.16.3.tar.xz

This will result in the warnings in the logs

EXT4-fs error (device vdb1): ext4_validate_block_bitmap:399: comm tar: bg 84: block 2774529: invalid block bitmap

[ Changed slightly for clarity and to not drop a overflow test -- TYT ]

Signed-off-by: Lukas Czerner <lczerner@redhat.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Reported-by: Ilya Dryomov <idryomov@gmail.com>
Fixes: 7dac4a1726 ("ext4: add validity checks for bitmap block numbers")
Cc: stable@vger.kernel.org
2018-04-24 11:31:44 -04:00
Steve French 23657ad730 SMB3: Fix 3.11 encryption to Windows and handle encrypted smb3 tcon
Temporarily disable AES-GCM, as AES-CCM is only currently
enabled mechanism on client side.  This fixes SMB3.11
encrypted mounts to Windows.

Also the tree connect request itself should be encrypted if
requested encryption ("seal" on mount), in addition we should be
enabling encryption in 3.11 based on whether we got any valid
encryption ciphers back in negprot (the corresponding session flag is
not set as it is in 3.0 and 3.02)

Signed-off-by: Steve French <smfrench@gmail.com>
Reviewed-by: Pavel Shilovsky <pshilov@microsoft.com>
Reviewed-by: Ronnie Sahlberg <lsahlber@redhat.com>
CC: Stable <stable@vger.kernel.org>
2018-04-24 10:07:14 -05:00
Steve French 117e3b7fed CIFS: set *resp_buf_type to NO_BUFFER on error
Dan Carpenter had pointed this out a while ago, but the code around
this had changed so wasn't causing any problems since that field
was not used in this error path.

Still, it is cleaner to always initialize this field, so changing
the error path to set it.

Reviewed-by: Ronnie Sahlberg <lsahlber@redhat.com>
CC: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: Steve French <smfrench@gmail.com>
2018-04-24 10:06:28 -05:00
Yan, Zheng f191982689 ceph: check if mds create snaprealm when setting quota
If mds does not, return -EOPNOTSUPP.

Link: http://tracker.ceph.com/issues/23491
Signed-off-by: "Yan, Zheng" <zyan@redhat.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2018-04-23 17:35:19 +02:00
Linus Torvalds 5ec83b22a2 various SMB3/CIFS fixes for stable 4.17-rc1
-----BEGIN PGP SIGNATURE-----
 
 iQGwBAABCAAaBQJa2jQtExxzbWZyZW5jaEBnbWFpbC5jb20ACgkQiiy9cAdyT1Fb
 AAv9FqmD7ElRfZAHBO1gFhOMojhGngf4VsdoKD7raFhkzQLJbEB/Vjjf0OQ6IeR9
 EJ+yW7H7lLIifv+xFandBxhjFvg3H+HTFdOuAUk4zpffripOgyfIJeHmc0Lt3DDA
 sGGsIcgl5wWXV9hEEtMGlUOoKeVfMKoDqOmIvtOUeSZ616G2NXSRM7ySnSX3KEQR
 SJo+rI16P4OQbEZ3W5ElJE6otytKQF76cCoOdEuWLF5mSJfkaCiovrDw/VM2oXFF
 C8Uu4cHJFDDxL2uPrYGFPGKWisi0K0G8Lq5q0nOQWC8Ajh9EgfGR667p3btEGQ0d
 TPYLHaXleb45qtMUDq7JiPy3VTs5qvr5uJyQdB9s3E+M+9pq1bgQmIVymKJi9ccu
 QbKospypFsz81t0DwB7/2TEVsh00Y6bcNIALTALUNntP/emeiQwLrwOkWb3PBlIf
 WM0OJH1JJChvypDZ8n29wy2A2p1YIflrGghieONf2jG/dL7bUzssfMCXFxY2MVvf
 GCUk
 =lT4e
 -----END PGP SIGNATURE-----

Merge tag '4.17-rc1-SMB3-CIFS' of git://git.samba.org/sfrench/cifs-2.6

Pull cifs fixes from Steve French:
 "Various SMB3/CIFS fixes.

  There are three more security related fixes in progress that are not
  included in this set but they are still being tested and reviewed, so
  sending this unrelated set of smaller fixes now"

* tag '4.17-rc1-SMB3-CIFS' of git://git.samba.org/sfrench/cifs-2.6:
  CIFS: fix typo in cifs_dbg
  cifs: do not allow creating sockets except with SMB1 posix exensions
  cifs: smbd: Dump SMB packet when configured
  cifs: smbd: Check for iov length on sending the last iov
  fs: cifs: Adding new return type vm_fault_t
  cifs: smb2ops: Fix NULL check in smb2_query_symlink
2018-04-22 12:13:04 -07:00
Linus Torvalds d54b5c1315 for-4.17-rc1-tag
-----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEE8rQSAMVO+zA4DBdWxWXV+ddtWDsFAlrcop8ACgkQxWXV+ddt
 WDuT2Q/9HZo9b1N4f+dM38P/cyBwEt9VYT0MLzpIpOid6ogpdF+1E7F6rDnUOyEQ
 0Me+znhC7kgXLBnpUDS16/2uGbp9VgCZo9NSq8juLrowGYpe9asadWbCqfFBmAh6
 Pc6r03Z1Z1EhXXUSm0cqJwvNHpNN6WnfBP1ithcp4OAltjVZ7IU45xHCcymfTxi+
 R6/kvJlG56YXoT97on/CHF85pAidELJ7CysLBGpCHP7Yl/lkVkyPVxSZO9up6ZtS
 078l0nCG9PpzONbdUcB9QGICPDcNLbBDhrS8yzQ+oajiRBz0+Im4c2UUvmZEUiMb
 6EAo6QV58OFUskgsY1cHWgZGS3YIFwzpmdsRQrCyuI4Z+1KXVu+RWMv1hF1G1Dqw
 8fHDm6+V2oqcoK3lSnb40UUySGw2d7PwjGxUk5iGKWISdmrxQsLcgkY/ERbjZxar
 emelwAZbsPwUNYNJj8qbm1QZlmys8FACKzJqyCR6+7/n353uWE30Tnv1MgfJg0nQ
 7ejk3U96R+oynN/9WOBFCLpuvPCbYaZk4eA9+3EjcSLzN81r/WtbgQDntEWi9auO
 /OUADaHOXoIaUnTBNtFV9DTseDW21hQ/4NseTVVsjKGWjh2qWRoEwuqI7TEZD/4M
 kGgdJL+xvD0MzR6Tq7xbhnQTcCOwH6NQ7aI4rI3abDbU9ZPuDrY=
 =lgaO
 -----END PGP SIGNATURE-----

Merge tag 'for-4.17-rc1-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux

Pull btrfs fixes from David Sterba:
 "This contains a few fixups to the qgroup patches that were merged this
  dev cycle, unaligned access fix, blockgroup removal corner case fix
  and a small debugging output tweak"

* tag 'for-4.17-rc1-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux:
  btrfs: print-tree: debugging output enhancement
  btrfs: Fix race condition between delayed refs and blockgroup removal
  btrfs: fix unaligned access in readdir
  btrfs: Fix wrong btrfs_delalloc_release_extents parameter
  btrfs: delayed-inode: Remove wrong qgroup meta reservation calls
  btrfs: qgroup: Use independent and accurate per inode qgroup rsv
  btrfs: qgroup: Commit transaction in advance to reduce early EDQUOT
2018-04-22 12:09:27 -07:00
Tetsuo Handa d23a61ee90 fs, elf: don't complain MAP_FIXED_NOREPLACE unless -EEXIST error
Commit 4ed2863951 ("fs, elf: drop MAP_FIXED usage from elf_map") is
printing spurious messages under memory pressure due to map_addr == -ENOMEM.

 9794 (a.out): Uhuuh, elf segment at 00007f2e34738000(fffffffffffffff4) requested but the memory is mapped already
 14104 (a.out): Uhuuh, elf segment at 00007f34fd76c000(fffffffffffffff4) requested but the memory is mapped already
 16843 (a.out): Uhuuh, elf segment at 00007f930ecc7000(fffffffffffffff4) requested but the memory is mapped already

Complain only if -EEXIST, and use %px for printing the address.

Link: http://lkml.kernel.org/r/201804182307.FAC17665.SFMOFJVFtHOLOQ@I-love.SAKURA.ne.jp
Fixes: 4ed2863951 ("fs, elf: drop MAP_FIXED usage from elf_map") is
Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Acked-by: Michal Hocko <mhocko@suse.com>
Cc: Andrei Vagin <avagin@openvz.org>
Cc: Khalid Aziz <khalid.aziz@oracle.com>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Kees Cook <keescook@chromium.org>
Cc: Abdul Haleem <abdhalee@linux.vnet.ibm.com>
Cc: Joel Stanley <joel@jms.id.au>
Cc: Anshuman Khandual <khandual@linux.vnet.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-04-20 17:18:36 -07:00
Alexey Dobriyan 9a1015b32f proc: fix /proc/loadavg regression
Commit 95846ecf9d ("pid: replace pid bitmap implementation with IDR
API") changed last field of /proc/loadavg (last pid allocated) to be off
by one:

	# unshare -p -f --mount-proc cat /proc/loadavg
	0.00 0.00 0.00 1/60 2	<===

It should be 1 after first fork into pid namespace.

This is formally a regression but given how useless this field is I
don't think anyone is affected.

Bug was found by /proc testsuite!

Link: http://lkml.kernel.org/r/20180413175408.GA27246@avx2
Fixes: 95846ecf9d ("pid: replace pid bitmap implementation with IDR API")
Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: Gargi Sharma <gs051095@gmail.com>
Cc: Oleg Nesterov <oleg@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-04-20 17:18:36 -07:00
Alexey Dobriyan 2e0ad552f5 proc: revalidate kernel thread inodes to root:root
task_dump_owner() has the following code:

	mm = task->mm;
	if (mm) {
		if (get_dumpable(mm) != SUID_DUMP_USER) {
			uid = ...
		}
	}

Check for ->mm is buggy -- kernel thread might be borrowing mm
and inode will go to some random uid:gid pair.

Link: http://lkml.kernel.org/r/20180412220109.GA20978@avx2
Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-04-20 17:18:35 -07:00
Ian Kent 1e6306652b autofs: mount point create should honour passed in mode
The autofs file system mkdir inode operation blindly sets the created
directory mode to S_IFDIR | 0555, ingoring the passed in mode, which can
cause selinux dac_override denials.

But the function also checks if the caller is the daemon (as no-one else
should be able to do anything here) so there's no point in not honouring
the passed in mode, allowing the daemon to set appropriate mode when
required.

Link: http://lkml.kernel.org/r/152361593601.8051.14014139124905996173.stgit@pluto.themaw.net
Signed-off-by: Ian Kent <raven@themaw.net>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-04-20 17:18:35 -07:00
Greg Thelen 2e898e4c0a writeback: safer lock nesting
lock_page_memcg()/unlock_page_memcg() use spin_lock_irqsave/restore() if
the page's memcg is undergoing move accounting, which occurs when a
process leaves its memcg for a new one that has
memory.move_charge_at_immigrate set.

unlocked_inode_to_wb_begin,end() use spin_lock_irq/spin_unlock_irq() if
the given inode is switching writeback domains.  Switches occur when
enough writes are issued from a new domain.

This existing pattern is thus suspicious:
    lock_page_memcg(page);
    unlocked_inode_to_wb_begin(inode, &locked);
    ...
    unlocked_inode_to_wb_end(inode, locked);
    unlock_page_memcg(page);

If both inode switch and process memcg migration are both in-flight then
unlocked_inode_to_wb_end() will unconditionally enable interrupts while
still holding the lock_page_memcg() irq spinlock.  This suggests the
possibility of deadlock if an interrupt occurs before unlock_page_memcg().

    truncate
    __cancel_dirty_page
    lock_page_memcg
    unlocked_inode_to_wb_begin
    unlocked_inode_to_wb_end
    <interrupts mistakenly enabled>
                                    <interrupt>
                                    end_page_writeback
                                    test_clear_page_writeback
                                    lock_page_memcg
                                    <deadlock>
    unlock_page_memcg

Due to configuration limitations this deadlock is not currently possible
because we don't mix cgroup writeback (a cgroupv2 feature) and
memory.move_charge_at_immigrate (a cgroupv1 feature).

If the kernel is hacked to always claim inode switching and memcg
moving_account, then this script triggers lockup in less than a minute:

  cd /mnt/cgroup/memory
  mkdir a b
  echo 1 > a/memory.move_charge_at_immigrate
  echo 1 > b/memory.move_charge_at_immigrate
  (
    echo $BASHPID > a/cgroup.procs
    while true; do
      dd if=/dev/zero of=/mnt/big bs=1M count=256
    done
  ) &
  while true; do
    sync
  done &
  sleep 1h &
  SLEEP=$!
  while true; do
    echo $SLEEP > a/cgroup.procs
    echo $SLEEP > b/cgroup.procs
  done

The deadlock does not seem possible, so it's debatable if there's any
reason to modify the kernel.  I suggest we should to prevent future
surprises.  And Wang Long said "this deadlock occurs three times in our
environment", so there's more reason to apply this, even to stable.
Stable 4.4 has minor conflicts applying this patch.  For a clean 4.4 patch
see "[PATCH for-4.4] writeback: safer lock nesting"
https://lkml.org/lkml/2018/4/11/146

Wang Long said "this deadlock occurs three times in our environment"

[gthelen@google.com: v4]
  Link: http://lkml.kernel.org/r/20180411084653.254724-1-gthelen@google.com
[akpm@linux-foundation.org: comment tweaks, struct initialization simplification]
Change-Id: Ibb773e8045852978f6207074491d262f1b3fb613
Link: http://lkml.kernel.org/r/20180410005908.167976-1-gthelen@google.com
Fixes: 682aa8e1a6 ("writeback: implement unlocked_inode_to_wb transaction and use it for stat updates")
Signed-off-by: Greg Thelen <gthelen@google.com>
Reported-by: Wang Long <wanglong19@meituan.com>
Acked-by: Wang Long <wanglong19@meituan.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Tejun Heo <tj@kernel.org>
Cc: Nicholas Piggin <npiggin@gmail.com>
Cc: <stable@vger.kernel.org>	[v4.2+]
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-04-20 17:18:35 -07:00
Huang Ying 88c28f2469 mm, pagemap: fix swap offset value for PMD migration entry
The swap offset reported by /proc/<pid>/pagemap may be not correct for
PMD migration entries.  If addr passed into pagemap_pmd_range() isn't
aligned with PMD start address, the swap offset reported doesn't
reflect this.  And in the loop to report information of each sub-page,
the swap offset isn't increased accordingly as that for PFN.

This may happen after opening /proc/<pid>/pagemap and seeking to a page
whose address doesn't align with a PMD start address.  I have verified
this with a simple test program.

BTW: migration swap entries have PFN information, do we need to restrict
whether to show them?

[akpm@linux-foundation.org: fix typo, per Huang, Ying]
Link: http://lkml.kernel.org/r/20180408033737.10897-1-ying.huang@intel.com
Signed-off-by: "Huang, Ying" <ying.huang@intel.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
Cc: Andrei Vagin <avagin@openvz.org>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: "Jerome Glisse" <jglisse@redhat.com>
Cc: Daniel Colascione <dancol@google.com>
Cc: Zi Yan <zi.yan@cs.rutgers.edu>
Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Cc: Alexey Dobriyan <adobriyan@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-04-20 17:18:35 -07:00
Aurelien Aptel 596632de04 CIFS: fix typo in cifs_dbg
Signed-off-by: Aurelien Aptel <aaptel@suse.com>
Signed-off-by: Steve French <smfrench@gmail.com>
Reported-by: Long Li <longli@microsoft.com>
2018-04-20 13:39:10 -05:00
Steve French 1d0cffa674 cifs: do not allow creating sockets except with SMB1 posix exensions
RHBZ: 1453123

Since at least the 3.10 kernel and likely a lot earlier we have
not been able to create unix domain sockets in a cifs share
when mounted using the SFU mount option (except when mounted
with the cifs unix extensions to Samba e.g.)
Trying to create a socket, for example using the af_unix command from
xfstests will cause :
BUG: unable to handle kernel NULL pointer dereference at 00000000
00000040

Since no one uses or depends on being able to create unix domains sockets
on a cifs share the easiest fix to stop this vulnerability is to simply
not allow creation of any other special files than char or block devices
when sfu is used.

Added update to Ronnie's patch to handle a tcon link leak, and
to address a buf leak noticed by Gustavo and Colin.

Acked-by: Gustavo A. R. Silva <gustavo@embeddedor.com>
CC:  Colin Ian King <colin.king@canonical.com>
Reviewed-by: Pavel Shilovsky <pshilov@microsoft.com>
Reported-by: Eryu Guan <eguan@redhat.com>
Signed-off-by: Ronnie Sahlberg <lsahlber@redhat.com>
Signed-off-by: Steve French <smfrench@gmail.com>
Cc: stable@vger.kernel.org
2018-04-20 13:31:32 -05:00
Long Li ff30b89e0a cifs: smbd: Dump SMB packet when configured
When sending through SMB Direct, also dump the packet in SMB send path.

Also fixed a typo in debug message.

Signed-off-by: Long Li <longli@microsoft.com>
Cc: stable@vger.kernel.org
Signed-off-by: Steve French <smfrench@gmail.com>
Reviewed-by: Ronnie Sahlberg <lsahlber@redhat.com>
2018-04-20 12:18:28 -05:00
Qu Wenruo c087232374 btrfs: print-tree: debugging output enhancement
This patch enhances the following things:

- tree block header
  * add generation and owner output for node and leaf
- node pointer generation output
- allow btrfs_print_tree() to not follow nodes
  * just like btrfs-progs

Please note that, although function btrfs_print_tree() is not called by
anyone right now, it's still a pretty useful function to debug kernel.
So that function is still kept for later use.

Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: Lu Fengqi <lufq.fnst@cn.fujitsu.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2018-04-20 19:18:16 +02:00
Nikolay Borisov 5e388e9581 btrfs: Fix race condition between delayed refs and blockgroup removal
When the delayed refs for a head are all run, eventually
cleanup_ref_head is called which (in case of deletion) obtains a
reference for the relevant btrfs_space_info struct by querying the bg
for the range. This is problematic because when the last extent of a
bg is deleted a race window emerges between removal of that bg and the
subsequent invocation of cleanup_ref_head. This can result in cache being null
and either a null pointer dereference or assertion failure.

	task: ffff8d04d31ed080 task.stack: ffff9e5dc10cc000
	RIP: 0010:assfail.constprop.78+0x18/0x1a [btrfs]
	RSP: 0018:ffff9e5dc10cfbe8 EFLAGS: 00010292
	RAX: 0000000000000044 RBX: 0000000000000000 RCX: 0000000000000000
	RDX: ffff8d04ffc1f868 RSI: ffff8d04ffc178c8 RDI: ffff8d04ffc178c8
	RBP: ffff8d04d29e5ea0 R08: 00000000000001f0 R09: 0000000000000001
	R10: ffff9e5dc0507d58 R11: 0000000000000001 R12: ffff8d04d29e5ea0
	R13: ffff8d04d29e5f08 R14: ffff8d04efe29b40 R15: ffff8d04efe203e0
	FS:  00007fbf58ead500(0000) GS:ffff8d04ffc00000(0000) knlGS:0000000000000000
	CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
	CR2: 00007fe6c6975648 CR3: 0000000013b2a000 CR4: 00000000000006f0
	DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
	DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
	Call Trace:
	 __btrfs_run_delayed_refs+0x10e7/0x12c0 [btrfs]
	 btrfs_run_delayed_refs+0x68/0x250 [btrfs]
	 btrfs_should_end_transaction+0x42/0x60 [btrfs]
	 btrfs_truncate_inode_items+0xaac/0xfc0 [btrfs]
	 btrfs_evict_inode+0x4c6/0x5c0 [btrfs]
	 evict+0xc6/0x190
	 do_unlinkat+0x19c/0x300
	 do_syscall_64+0x74/0x140
	 entry_SYSCALL_64_after_hwframe+0x3d/0xa2
	RIP: 0033:0x7fbf589c57a7

To fix this, introduce a new flag "is_system" to head_ref structs,
which is populated at insertion time. This allows to decouple the
querying for the spaceinfo from querying the possibly deleted bg.

Fixes: d7eae3403f ("Btrfs: rework delayed ref total_bytes_pinned accounting")
CC: stable@vger.kernel.org # 4.14+
Suggested-by: Omar Sandoval <osandov@osandov.com>
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: Omar Sandoval <osandov@fb.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2018-04-20 19:17:25 +02:00
David Howells a9e5b73288 vfs: Undo an overly zealous MS_RDONLY -> SB_RDONLY conversion
In do_mount() when the MS_* flags are being converted to MNT_* flags,
MS_RDONLY got accidentally convered to SB_RDONLY.

Undo this change.

Fixes: e462ec50cb ("VFS: Differentiate mount flags (MS_*) from internal superblock flags")
Signed-off-by: David Howells <dhowells@redhat.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-04-20 09:59:33 -07:00
David Howells 660625922b afs: Fix server record deletion
AFS server records get removed from the net->fs_servers tree when
they're deleted, but not from the net->fs_addresses{4,6} lists, which
can lead to an oops in afs_find_server() when a server record has been
removed, for instance during rmmod.

Fix this by deleting the record from the by-address lists before posting
it for RCU destruction.

The reason this hasn't been noticed before is that the fileserver keeps
probing the local cache manager, thereby keeping the service record
alive, so the oops would only happen when a fileserver eventually gets
bored and stops pinging or if the module gets rmmod'd and a call comes
in from the fileserver during the window between the server records
being destroyed and the socket being closed.

The oops looks something like:

  BUG: unable to handle kernel NULL pointer dereference at 000000000000001c
  ...
  Workqueue: kafsd afs_process_async_call [kafs]
  RIP: 0010:afs_find_server+0x271/0x36f [kafs]
  ...
  Call Trace:
   afs_deliver_cb_init_call_back_state3+0x1f2/0x21f [kafs]
   afs_deliver_to_call+0x1ee/0x5e8 [kafs]
   afs_process_async_call+0x5b/0xd0 [kafs]
   process_one_work+0x2c2/0x504
   worker_thread+0x1d4/0x2ac
   kthread+0x11f/0x127
   ret_from_fork+0x24/0x30

Fixes: d2ddc776a4 ("afs: Overhaul volume and server record caching and fileserver rotation")
Signed-off-by: David Howells <dhowells@redhat.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-04-20 09:59:33 -07:00
Linus Torvalds b9abdcfd10 Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs
Pull vfs fixes from Al Viro:
 "Assorted fixes.

  Some of that is only a matter with fault injection (broken handling of
  small allocation failure in various mount-related places), but the
  last one is a root-triggerable stack overflow, and combined with
  userns it gets really nasty ;-/"

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
  Don't leak MNT_INTERNAL away from internal mounts
  mm,vmscan: Allow preallocating memory for register_shrinker().
  rpc_pipefs: fix double-dput()
  orangefs_kill_sb(): deal with allocation failures
  jffs2_kill_sb(): deal with failed allocations
  hypfs_kill_super(): deal with failed allocations
2018-04-20 09:15:14 -07:00
Linus Torvalds 43f70c9601 Minor cleanups and a bug fix to completely ignore unencrypted filenames
in the lower filesystem when filename encryption is enabled at the
 eCryptfs layer.
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v1
 
 iQIcBAABCgAGBQJa2LwFAAoJENaSAD2qAscKt00P+wSOy/iUn7wd3L/mtjN+B34l
 8MaBNoruoZ87pPPHm4ZjB85zk5fSzwQl/jKgS8y6dRA1bxcwsUYRMooTIsiOBFeg
 c1BEWfJVL8uq5AtmZ+czwUD2+DNn/U6S+XejXKr/bQyzOOgUtAeX8IPERPzGqJkB
 dXbz21NLBpICxslIHTTHCpCWodm8B9SJSrBeKd1WIad1d+ITjB6fnyHyIlEyFxlk
 yxkXLVZskR/yihIQ8JugJ4Li0nWz4KcdAvS2fbSIG5zgkAtLS9ypPAAQrfCf9pCA
 BYc4Q2Ai425tgVqAwK2uHXOzy7/WsPL4yrIXt/04vTra1XcK9cv2EdojYi1cK04W
 /a6xIWzFtp7HYEhuHZiVdvSTRXEOF9IYUXTw/gavE0KvMZ36k2FU9bLo4WKmx/91
 atr+9p4vE6uA932YY54Wit++IjpLGN7FzXYxVVoACVz7HJ+sESqytPiLVX5iufux
 /xrrbmlEpLEeJa6/F1Yboo9R7aclrMsxEz4Y+Txkk/zmjzP8wYYiYodOCJ23pauY
 5MJruAxLtlcmAzW7FbES/Id+6Rfj42BuzXi3TRceiUv8NWmDTk3y205o94CfJW5U
 mxEiFgb7Nxr1UsL66rZGmpnwcfJY4FFEyuXRGEmATBquEGJPTqhku9HKUkmORa7I
 mYJzK27VX7dE2lAzuMpr
 =bwS0
 -----END PGP SIGNATURE-----

Merge tag 'ecryptfs-4.17-rc2-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tyhicks/ecryptfs

Pull eCryptfs fixes from Tyler Hicks:
 "Minor cleanups and a bug fix to completely ignore unencrypted
  filenames in the lower filesystem when filename encryption is enabled
  at the eCryptfs layer"

* tag 'ecryptfs-4.17-rc2-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tyhicks/ecryptfs:
  eCryptfs: don't pass up plaintext names when using filename encryption
  ecryptfs: fix spelling mistake: "cadidate" -> "candidate"
  ecryptfs: lookup: Don't check if mount_crypt_stat is NULL
2018-04-20 09:08:37 -07:00
Linus Torvalds 0d9cf33b4a \n
-----BEGIN PGP SIGNATURE-----
 
 iQEzBAABCAAdFiEEq1nRK9aeMoq1VSgcnJ2qBz9kQNkFAlrXYi4ACgkQnJ2qBz9k
 QNmp5Af9HhwGDuTgY+OcVEeZzXQ30BbFQNP4cby5LZ3Wmffio8T97c2DnlDDjr2E
 l8Biq5HTm3Gg5x7WEuopyrS6L4on76RDyr1Ohf34mOoIcxdcWQ7XpiyA8jeP9uLn
 9IJSPYPbA2RsQk4yr8GBRlVQh/udJfSG0vQOQ5cPylgICLsfxuMIC294vFfcgDt1
 wLOy24INlJYbyECMIQKrleyoZQF37Cv7v12KPU0w3eLtuNX0rdI4gwpOS4eEb27G
 2KrAsRYCb5KyFl3mRXBeCPNytiJ3ffFgvFC1Z57GVWVmNnadj/Jctw3zFirsMC2d
 j7IUA4XI/T+1Gql5rCOXcSgeJ0LRZA==
 =KQzD
 -----END PGP SIGNATURE-----

Merge tag 'for_v4.17-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs

 - isofs memory leak fix

 - two fsnotify fixes of event mask handling

 - udf fix of UTF-16 handling

 - couple other smaller cleanups

* tag 'for_v4.17-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs:
  udf: Fix leak of UTF-16 surrogates into encoded strings
  fs: ext2: Adding new return type vm_fault_t
  isofs: fix potential memory leak in mount option parsing
  MAINTAINERS: add an entry for FSNOTIFY infrastructure
  fsnotify: fix typo in a comment about mark->g_list
  fsnotify: fix ignore mask logic in send_to_group()
  isofs compress: Remove VLA usage
  fs: quota: Replace GFP_ATOMIC with GFP_KERNEL in dquot_init
  fanotify: fix logic of events on child
2018-04-20 09:01:26 -07:00
Al Viro 16a34adb93 Don't leak MNT_INTERNAL away from internal mounts
We want it only for the stuff created by SB_KERNMOUNT mounts, *not* for
their copies.  As it is, creating a deep stack of bindings of /proc/*/ns/*
somewhere in a new namespace and exiting yields a stack overflow.

Cc: stable@kernel.org
Reported-by: Alexander Aring <aring@mojatatu.com>
Bisected-by: Kirill Tkhai <ktkhai@virtuozzo.com>
Tested-by: Kirill Tkhai <ktkhai@virtuozzo.com>
Tested-by: Alexander Aring <aring@mojatatu.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2018-04-19 23:52:15 -04:00
Long Li ab60ee7bf9 cifs: smbd: Check for iov length on sending the last iov
When sending the last iov that breaks into smaller buffers to fit the
transfer size, it's necessary to check if this is the last iov.

If this is the latest iov, stop and proceed to send pages.

Signed-off-by: Long Li <longli@microsoft.com>
Cc: stable@vger.kernel.org
Signed-off-by: Steve French <stfrench@microsoft.com>
Reviewed-by: Ronnie Sahlberg <lsahlber@redhat.com>
2018-04-18 22:02:49 -05:00
David Sterba 92d3217084 btrfs: fix unaligned access in readdir
The last update to readdir introduced a temporary buffer to store the
emitted readdir data, but as there are file names of variable length,
there's a lot of unaligned access.

This was observed on a sparc64 machine:

  Kernel unaligned access at TPC[102f3080] btrfs_real_readdir+0x51c/0x718 [btrfs]

Fixes: 23b5ec7494 ("btrfs: fix readdir deadlock with pagefault")
CC: stable@vger.kernel.org # 4.14+
Reported-and-tested-by: René Rebe <rene@exactcode.com>
Reviewed-by: Liu Bo <bo.liu@linux.alibaba.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2018-04-19 00:35:08 +02:00
Theodore Ts'o b2569260d5 ext4: set h_journal if there is a failure starting a reserved handle
If ext4 tries to start a reserved handle via
jbd2_journal_start_reserved(), and the journal has been aborted, this
can result in a NULL pointer dereference.  This is because the fields
h_journal and h_transaction in the handle structure share the same
memory, via a union, so jbd2_journal_start_reserved() will clear
h_journal before calling start_this_handle().  If this function fails
due to an aborted handle, h_journal will still be NULL, and the call
to jbd2_journal_free_reserved() will pass a NULL journal to
sub_reserve_credits().

This can be reproduced by running "kvm-xfstests -c dioread_nolock
generic/475".

Cc: stable@kernel.org # 3.11
Fixes: 8f7d89f368 ("jbd2: transaction reservation support")
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Reviewed-by: Andreas Dilger <adilger@dilger.ca>
Reviewed-by: Jan Kara <jack@suse.cz>
2018-04-18 11:49:31 -04:00
Qu Wenruo 336a8bb8e3 btrfs: Fix wrong btrfs_delalloc_release_extents parameter
Commit 43b18595d6 ("btrfs: qgroup: Use separate meta reservation type
for delalloc") merged into mainline is not the latest version submitted
to mail list in Dec 2017.

It has a fatal wrong @qgroup_free parameter, which results increasing
qgroup metadata pertrans reserved space, and causing a lot of early EDQUOT.

Fix it by applying the correct diff on top of current branch.

Fixes: 43b18595d6 ("btrfs: qgroup: Use separate meta reservation type for delalloc")
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2018-04-18 16:46:57 +02:00
Qu Wenruo f218ea6c47 btrfs: delayed-inode: Remove wrong qgroup meta reservation calls
Commit 4f5427ccce ("btrfs: delayed-inode: Use new qgroup meta rsv for
delayed inode and item") merged into mainline was not latest version
submitted to the mail list in Dec 2017.

Which lacks the following fixes:

1) Remove btrfs_qgroup_convert_reserved_meta() call in
   btrfs_delayed_item_release_metadata()
2) Remove btrfs_qgroup_reserve_meta_prealloc() call in
   btrfs_delayed_inode_reserve_metadata()

Those fixes will resolve unexpected EDQUOT problems.

Fixes: 4f5427ccce ("btrfs: delayed-inode: Use new qgroup meta rsv for delayed inode and item")
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2018-04-18 16:46:55 +02:00
Qu Wenruo ff6bc37eb7 btrfs: qgroup: Use independent and accurate per inode qgroup rsv
Unlike reservation calculation used in inode rsv for metadata, qgroup
doesn't really need to care about things like csum size or extent usage
for the whole tree COW.

Qgroups care more about net change of the extent usage.
That's to say, if we're going to insert one file extent, it will mostly
find its place in COWed tree block, leaving no change in extent usage.
Or causing a leaf split, resulting in one new net extent and increasing
qgroup number by nodesize.
Or in an even more rare case, increase the tree level, increasing qgroup
number by 2 * nodesize.

So here instead of using the complicated calculation for extent
allocator, which cares more about accuracy and no error, qgroup doesn't
need that over-estimated reservation.

This patch will maintain 2 new members in btrfs_block_rsv structure for
qgroup, using much smaller calculation for qgroup rsv, reducing false
EDQUOT.

Signed-off-by: David Sterba <dsterba@suse.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
2018-04-18 16:46:51 +02:00
Qu Wenruo a514d63882 btrfs: qgroup: Commit transaction in advance to reduce early EDQUOT
Unlike previous method that tries to commit transaction inside
qgroup_reserve(), this time we will try to commit transaction using
fs_info->transaction_kthread to avoid nested transaction and no need to
worry about locking context.

Since it's an asynchronous function call and we won't wait for
transaction commit, unlike previous method, we must call it before we
hit the qgroup limit.

So this patch will use the ratio and size of qgroup meta_pertrans
reservation as indicator to check if we should trigger a transaction
commit.  (meta_prealloc won't be cleaned in transaction committ, it's
useless anyway)

Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2018-04-18 16:46:47 +02:00
Jan Kara 44f06ba829 udf: Fix leak of UTF-16 surrogates into encoded strings
OSTA UDF specification does not mention whether the CS0 charset in case
of two bytes per character encoding should be treated in UTF-16 or
UCS-2. The sample code in the standard does not treat UTF-16 surrogates
in any special way but on systems such as Windows which work in UTF-16
internally, filenames would be treated as being in UTF-16 effectively.
In Linux it is more difficult to handle characters outside of Base
Multilingual plane (beyond 0xffff) as NLS framework works with 2-byte
characters only. Just make sure we don't leak UTF-16 surrogates into the
resulting string when loading names from the filesystem for now.

CC: stable@vger.kernel.org # >= v4.6
Reported-by: Mingye Wang <arthur200126@gmail.com>
Signed-off-by: Jan Kara <jack@suse.cz>
2018-04-18 16:34:55 +02:00
Darrick J. Wong 7b38460dc8 xfs: don't fail when converting shortform attr to long form during ATTR_REPLACE
Kanda Motohiro reported that expanding a tiny xattr into a large xattr
fails on XFS because we remove the tiny xattr from a shortform fork and
then try to re-add it after converting the fork to extents format having
not removed the ATTR_REPLACE flag.  This fails because the attr is no
longer present, causing a fs shutdown.

This is derived from the patch in his bug report, but we really
shouldn't ignore a nonzero retval from the remove call.

Bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=199119
Reported-by: kanda.motohiro@gmail.com
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2018-04-17 19:10:15 -07:00
Darrick J. Wong 7d83fb1425 xfs: prevent creating negative-sized file via INSERT_RANGE
During the "insert range" fallocate operation, i_size grows by the
specified 'len' bytes.  XFS verifies that i_size + len < s_maxbytes, as
it should.  But this comparison is done using the signed 'loff_t', and
'i_size + len' can wrap around to a negative value, causing the check to
incorrectly pass, resulting in an inode with "negative" i_size.  This is
possible on 64-bit platforms, where XFS sets s_maxbytes = LLONG_MAX.
ext4 and f2fs don't run into this because they set a smaller s_maxbytes.

Fix it by using subtraction instead.

Reproducer:
    xfs_io -f file -c "truncate $(((1<<63)-1))" -c "finsert 0 4096"

Fixes: a904b1ca57 ("xfs: Add support FALLOC_FL_INSERT_RANGE for fallocate")
Cc: <stable@vger.kernel.org> # v4.1+
Originally-From: Eric Biggers <ebiggers@google.com>
Signed-off-by: Eric Biggers <ebiggers@google.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
[darrick: fix signed integer addition overflow too]
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2018-04-17 17:29:11 -07:00
Eric Sandeen 2c4306f719 xfs: set format back to extents if xfs_bmap_extents_to_btree
If xfs_bmap_extents_to_btree fails in a mode where we call
xfs_iroot_realloc(-1) to de-allocate the root, set the
format back to extents.

Otherwise we can assume we can dereference ifp->if_broot
based on the XFS_DINODE_FMT_BTREE format, and crash.

Bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=199423
Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2018-04-17 17:10:17 -07:00
Eric Sandeen b42db0860e xfs: enhance dinode verifier
Add several more validations to xfs_dinode_verify:

- For LOCAL data fork formats, di_nextents must be 0.
- For LOCAL attr fork formats, di_anextents must be 0.
- For inodes with no attr fork offset,
  - format must be XFS_DINODE_FMT_EXTENTS if set at all
  - di_anextents must be 0.

Thanks to dchinner for pointing out a couple related checks I had
forgotten to add.

Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=199377
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2018-04-17 17:10:17 -07:00
Souptick Joarder a5240cbde2 fs: cifs: Adding new return type vm_fault_t
Use new return type vm_fault_t for page_mkwrite
handler.

Signed-off-by: Souptick Joarder <jrdr.linux@gmail.com>
Reviewed-by: Matthew Wilcox <mawilcox@microsoft.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
2018-04-17 14:44:35 -05:00
Gustavo A. R. Silva 0d568cd34e cifs: smb2ops: Fix NULL check in smb2_query_symlink
The current code null checks variable err_buf, which is always null
when it is checked, hence utf16_path is free'd and the function
returns -ENOENT everytime it is called, making it impossible for the
execution path to reach the following code:

err_buf = err_iov.iov_base;

Fix this by null checking err_iov.iov_base instead of err_buf. Also,
notice that err_buf no longer needs to be initialized to NULL.

Addresses-Coverity-ID: 1467876 ("Logically dead code")
Fixes: 2d636199e400 ("cifs: Change SMB2_open to return an iov for the error parameter")
Signed-off-by: Gustavo A. R. Silva <gustavo@embeddedor.com>
Signed-off-by: Steve French <smfrench@gmail.com>
Reviewed-by: Pavel Shilovsky <pshilov@microsoft.com>
2018-04-17 14:44:30 -05:00
Tyler Hicks e86281e700 eCryptfs: don't pass up plaintext names when using filename encryption
Both ecryptfs_filldir() and ecryptfs_readlink_lower() use
ecryptfs_decode_and_decrypt_filename() to translate lower filenames to
upper filenames. The function correctly passes up lower filenames,
unchanged, when filename encryption isn't in use. However, it was also
passing up lower filenames when the filename wasn't encrypted or
when decryption failed. Since 88ae4ab980, eCryptfs refuses to lookup
lower plaintext names when filename encryption is enabled so this
resulted in a situation where userspace would see lower plaintext
filenames in calls to getdents(2) but then not be able to lookup those
filenames.

An example of this can be seen when enabling filename encryption on an
eCryptfs mount at the root directory of an Ext4 filesystem:

$ ls -1i /lower
12 ECRYPTFS_FNEK_ENCRYPTED.FWYZD8TcW.5FV-TKTEYOHsheiHX9a-w.NURCCYIMjI8pn5BDB9-h3fXwrE--
11 lost+found
$ ls -1i /upper
ls: cannot access '/upper/lost+found': No such file or directory
 ? lost+found
12 test

With this change, the lower lost+found dentry is ignored:

$ ls -1i /lower
12 ECRYPTFS_FNEK_ENCRYPTED.FWYZD8TcW.5FV-TKTEYOHsheiHX9a-w.NURCCYIMjI8pn5BDB9-h3fXwrE--
11 lost+found
$ ls -1i /upper
12 test

Additionally, some potentially noisy error/info messages in the related
code paths are turned into debug messages so that the logs can't be
easily filled.

Fixes: 88ae4ab980 ("ecryptfs_lookup(): try either only encrypted or plaintext name")
Reported-by: Guenter Roeck <linux@roeck-us.net>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Tyler Hicks <tyhicks@canonical.com>
2018-04-16 18:51:22 +00:00
Souptick Joarder 0685693811 fs: ext2: Adding new return type vm_fault_t
Use new return type vm_fault_t for page_mkwrite,
pfn_mkwrite and fault handler.

Signed-off-by: Souptick Joarder <jrdr.linux@gmail.com>
Reviewed-by: Matthew Wilcox <mawilcox@microsoft.com>
Signed-off-by: Jan Kara <jack@suse.cz>
2018-04-16 09:52:24 +02:00
Chengguang Xu 4f34a5130a isofs: fix potential memory leak in mount option parsing
When specifying string type mount option (e.g., iocharset)
several times in a mount, current option parsing may
cause memory leak. Hence, call kfree for previous one
in this case. Meanwhile, check memory allocation result
for it.

Signed-off-by: Chengguang Xu <cgxu519@gmx.com>
Signed-off-by: Jan Kara <jack@suse.cz>
2018-04-16 09:47:41 +02:00
Yan, Zheng ffdeec7aa4 ceph: always update atime/mtime/ctime for new inode
For new inode, atime/mtime/ctime are uninitialized.  Don't compare
against them.

Cc: stable@kernel.org
Signed-off-by: "Yan, Zheng" <zyan@redhat.com>
Reviewed-by: Ilya Dryomov <idryomov@gmail.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2018-04-16 09:38:40 +02:00
Tetsuo Handa 8e04944f0e mm,vmscan: Allow preallocating memory for register_shrinker().
syzbot is catching so many bugs triggered by commit 9ee332d99e
("sget(): handle failures of register_shrinker()"). That commit expected
that calling kill_sb() from deactivate_locked_super() without successful
fill_super() is safe, but the reality was different; some callers assign
attributes which are needed for kill_sb() after sget() succeeds.

For example, [1] is a report where sb->s_mode (which seems to be either
FMODE_READ | FMODE_EXCL | FMODE_WRITE or FMODE_READ | FMODE_EXCL) is not
assigned unless sget() succeeds. But it does not worth complicate sget()
so that register_shrinker() failure path can safely call
kill_block_super() via kill_sb(). Making alloc_super() fail if memory
allocation for register_shrinker() failed is much simpler. Let's avoid
calling deactivate_locked_super() from sget_userns() by preallocating
memory for the shrinker and making register_shrinker() in sget_userns()
never fail.

[1] https://syzkaller.appspot.com/bug?id=588996a25a2587be2e3a54e8646728fb9cae44e7

Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Reported-by: syzbot <syzbot+5a170e19c963a2e0df79@syzkaller.appspotmail.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Michal Hocko <mhocko@suse.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2018-04-16 02:06:47 -04:00
Al Viro 659038428c orangefs_kill_sb(): deal with allocation failures
orangefs_fill_sb() might've failed to allocate ORANGEFS_SB(s); don't
oops in that case.

Cc: stable@kernel.org
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2018-04-15 23:49:12 -04:00
Al Viro c66b23c284 jffs2_kill_sb(): deal with failed allocations
jffs2_fill_super() might fail to allocate jffs2_sb_info;
jffs2_kill_sb() must survive that.

Cc: stable@kernel.org
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2018-04-15 23:49:05 -04:00
Linus Torvalds e37563bb6c for-4.17-part2-tag
-----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEE8rQSAMVO+zA4DBdWxWXV+ddtWDsFAlrTQ4sACgkQxWXV+ddt
 WDti3Q/+MAeqsLTjvre2RQ3ka5hNyCuVftUIBmcP3YfJbt+xZYQyaewW4Xkfi3cm
 cbJE+zehzf5ag+RJhxk3OwFTvNfLGIO9asWs3b08NGUi6VzwL0/8B/iOdZPuHSAV
 TrecQIBE2Tp+xax9cQEnxav34D4dUtXNaDweGjp1MIIUkDneQP/I0vlTu7vafBgX
 UVxP6riL/MCs7sjTHGIPs0lv8L/fgdmo+dk5SnNuIPTOcFTQXgVrtHjw9IvbKWd4
 aq+sbNWoSrhXUfllbFg/wZqDe9tWn9E2f6m/H0ThSoNdxusSVgacOjFRYh20NKLW
 WGB8Amd/ItGtJwJ1CIypa7VX2U11UAi0XT7BeiK82rUNEJ6moRqFOXG861gRLoTZ
 SpH8uWO+e+CogfXob1KCndn5lot4AM2ZTkCqfrjpM35Nul72PZdne0CxNlmiRupY
 Fdt5GB+sg8plcMaRiYr++BbbHP5tggX1MrhLGEbx2XBs2eRdn+2Lv2I1Ig/U4NUb
 Vf+xk/tFLKGOTSZlbv7SV7ekXxG/3+7gAuL7A1XMETZCwBF4L3hwyW7CgkEBb3PC
 TqX8BwMaRpyp/FgW/QL6edjXZ3a64VaHIfqRPNks3lWWcCHVbzyiVPfEjx+AEJ/0
 abx6jXTJhLUBPuPxEgb5rsSv62RoxPoYHqCErrG95XwnZ8rSCts=
 =I0y0
 -----END PGP SIGNATURE-----

Merge tag 'for-4.17-part2-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux

Pull more btrfs updates from David Sterba:
 "We have queued a few more fixes (error handling, log replay,
  softlockup) and the rest is SPDX updates that touche almost all files
  so the diffstat is long"

* tag 'for-4.17-part2-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux:
  btrfs: Only check first key for committed tree blocks
  btrfs: add SPDX header to Kconfig
  btrfs: replace GPL boilerplate by SPDX -- sources
  btrfs: replace GPL boilerplate by SPDX -- headers
  Btrfs: fix loss of prealloc extents past i_size after fsync log replay
  Btrfs: clean up resources during umount after trans is aborted
  btrfs: Fix possible softlock on single core machines
  Btrfs: bail out on error during replay_dir_deletes
  Btrfs: fix NULL pointer dereference in log_dir_items
2018-04-15 18:08:35 -07:00
Linus Torvalds 09c9b0eaa0 SMB3 fixes, a few for stable, and some important cleanup work from Ronnie of the smb3 transport code
-----BEGIN PGP SIGNATURE-----
 
 iQGwBAABCAAaBQJa0mDhExxzbWZyZW5jaEBnbWFpbC5jb20ACgkQiiy9cAdyT1HA
 EQv+KxZAFONG/YGBIdoyJevnSobzZSTYgsrEmox28rIvWSFr6Xkt9K6WN4lki+Td
 nacByZBhyeKPp7CxdgjmiNunbGfBZJsWk5LwtgdE2jOLjM9CFq8Z7m2qmkXQ4IkJ
 EoH+NWpCUvEe9nj8oaTdSelR2OHH63SG5E5dK29eOB59ixcVe7qKUG29SpJks9qd
 mCdZyhYyuzh469CKGFeQ8qfWopBH+QFBs4SrpQ9d4liNtZ3W8zTsQ3ezX069AGlU
 Ef2xVTT1DtlZ4/6SiXBKlXGH3jjUg67vEGHzT3ZJwfLVNhTLEpKM/hsPiduSnjS8
 JqdVRdFLc2lQzr4HknhNR9yuE5wT5tZM4iZOjNDwT6Sef2Mt0QIBuYI6kb4mv6Qg
 aCVv7uXUMdvSrTk0mYiO1dXYozGUC+uFhJCuaiLax583o3ihp+/INDosQyqTbavZ
 amRBZpWeNysSJwwJu21RdJiM74+fasR/CDy11U94ssxplg1qE72M8DShK7kXtUIy
 g9ZW
 =Ah5z
 -----END PGP SIGNATURE-----

Merge tag '4.17-rc1SMB3-Fixes' of git://git.samba.org/sfrench/cifs-2.6

Pull cifs fixes from Steve French:
 "SMB3 fixes, a few for stable, and some important cleanup work from
  Ronnie of the smb3 transport code"

* tag '4.17-rc1SMB3-Fixes' of git://git.samba.org/sfrench/cifs-2.6:
  cifs: change validate_buf to validate_iov
  cifs: remove rfc1002 hardcoded constants from cifs_discard_remaining_data()
  cifs: Change SMB2_open to return an iov for the error parameter
  cifs: add resp_buf_size to the mid_q_entry structure
  smb3.11: replace a 4 with server->vals->header_preamble_size
  cifs: replace a 4 with server->vals->header_preamble_size
  cifs: add pdu_size to the TCP_Server_Info structure
  SMB311: Improve checking of negotiate security contexts
  SMB3: Fix length checking of SMB3.11 negotiate request
  CIFS: add ONCE flag for cifs_dbg type
  cifs: Use ULL suffix for 64-bit constant
  SMB3: Log at least once if tree connect fails during reconnect
  cifs: smb2pdu: Fix potential NULL pointer dereference
2018-04-15 18:06:22 -07:00