Commit Graph

44550 Commits

Author SHA1 Message Date
Jaegeuk Kim 1beba1b3a9 f2fs: use percpu_counter for # of dirty pages in inode
This patch adds percpu_counter for # of dirty pages in inode.

Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2016-05-18 13:57:28 -07:00
Jaegeuk Kim 523be8a6b3 f2fs: use percpu_counter for page counters
This patch substitutes percpu_counter for atomic_counter when counting
various types of pages.

Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2016-05-18 13:57:27 -07:00
Jaegeuk Kim f573018491 f2fs: use bio count instead of F2FS_WRITEBACK page count
This can reduce page counting overhead.

Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2016-05-18 13:57:25 -07:00
Linus Torvalds 9e17632c0a Merge branch 'work.misc' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs
Pull misc vfs cleanups from Al Viro:
 "Assorted cleanups and fixes all over the place"

* 'work.misc' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
  coredump: only charge written data against RLIMIT_CORE
  coredump: get rid of coredump_params->written
  ecryptfs_lookup(): try either only encrypted or plaintext name
  ecryptfs: avoid multiple aliases for directories
  bpf: reject invalid names right in ->lookup()
  __d_alloc(): treat NULL name as QSTR("/", 1)
  mtd: switch ubi_open_volume_path() to vfs_stat()
  mtd: switch open_mtd_by_chdev() to use of vfs_stat()
2016-05-18 11:51:59 -07:00
Linus Torvalds 69370471d0 Merge branch 'work.iov_iter' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs
Pull iov_iter cleanups from Al Viro.

* 'work.iov_iter' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
  fold checks into iterate_and_advance()
  rw_verify_area(): saner calling conventions
  aio: remove a pointless assignment
2016-05-18 11:46:23 -07:00
Linus Torvalds e34df3344d Merge branch 'work.lookups' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs
Pull parallel lookup fixups from Al Viro:
 "Fix for xfs parallel readdir (turns out the cxfs exposure was not
  enough to catch all problems), and a reversion of btrfs back to
  ->iterate() until the fs/btrfs/delayed-inode.c gets fixed"

* 'work.lookups' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
  xfs: concurrent readdir hangs on data buffer locks
  Revert "btrfs: switch to ->iterate_shared()"
2016-05-18 10:28:45 -07:00
Dave Chinner 9f54180109 xfs: concurrent readdir hangs on data buffer locks
There's a three-process deadlock involving shared/exclusive barriers
and inverted lock orders in the directory readdir implementation.
It's a pre-existing problem with lock ordering, exposed by the
VFS parallelisation code.

process 1               process 2               process 3
---------               ---------               ---------
readdir
iolock(shared)
  get_leaf_dents
    iterate entries
       ilock(shared)
       map, lock and read buffer
       iunlock(shared)
       process entries in buffer
       .....
                                                readdir
                                                iolock(shared)
                                                  get_leaf_dents
                                                    iterate entries
                                                      ilock(shared)
                                                      map, lock buffer
                                                      <blocks>
                        finish ->iterate_shared
                        file_accessed()
                          ->update_time
                            start transaction
                            ilock(excl)
                            <blocks>
        .....
        finishes processing buffer
        get next buffer
          ilock(shared)
          <blocks>

And that's the deadlock.

Fix this by dropping the current buffer lock in process 1 before
trying to map the next buffer. This means we keep the lock order of
ilock -> buffer lock intact and hence will allow process 3 to make
progress and drop it's ilock(shared) once it is done.

Reported-by: Xiong Zhou <xzhou@redhat.com>
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2016-05-18 13:20:21 -04:00
Al Viro fe742fd4f9 Revert "btrfs: switch to ->iterate_shared()"
This reverts commit 972b241f84.
Quoth Chris:
	didn't take the delayed inode stuff into account
	it got an rbtree of items and it pulls things out
	so in shared mode, its hugely racey
	sorry, lets revert and fix it for real inside of btrfs

Signed-off-by: Chris Mason <clm@fb.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2016-05-18 13:19:17 -04:00
Linus Torvalds 442c9ac989 Merge branch 'sendmsg.cifs' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs
Pull cifs iovec cleanups from Al Viro.

* 'sendmsg.cifs' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
  cifs: don't bother with kmap on read_pages side
  cifs_readv_receive: use cifs_read_from_socket()
  cifs: no need to wank with copying and advancing iovec on recvmsg side either
  cifs: quit playing games with draining iovecs
  cifs: merge the hash calculation helpers
2016-05-18 10:17:56 -07:00
Linus Torvalds ba5a2655c2 Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs
Pull remaining vfs xattr work from Al Viro:
 "The rest of work.xattr (non-cifs conversions)"

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
  btrfs: Switch to generic xattr handlers
  ubifs: Switch to generic xattr handlers
  jfs: Switch to generic xattr handlers
  jfs: Clean up xattr name mapping
  gfs2: Switch to generic xattr handlers
  ceph: kill __ceph_removexattr()
  ceph: Switch to generic xattr handlers
  ceph: Get rid of d_find_alias in ceph_set_acl
2016-05-18 10:08:45 -07:00
Linus Torvalds 8908c94d6c Merge branch 'for-next' of git://git.samba.org/sfrench/cifs-2.6
Pull cifs updates from Steve French:
 "Various small CIFS and SMB3 fixes (including some for stable)"

* 'for-next' of git://git.samba.org/sfrench/cifs-2.6:
  remove directory incorrectly tries to set delete on close on non-empty directories
  Update cifs.ko version to 2.09
  fs/cifs: correctly to anonymous authentication for the NTLM(v2) authentication
  fs/cifs: correctly to anonymous authentication for the NTLM(v1) authentication
  fs/cifs: correctly to anonymous authentication for the LANMAN authentication
  fs/cifs: correctly to anonymous authentication via NTLMSSP
  cifs: remove any preceding delimiter from prefix_path
  cifs: Use file_dentry()
2016-05-18 10:01:47 -07:00
Linus Torvalds 16bf834805 Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jikos/trivial
Pull trivial tree updates from Jiri Kosina.

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jikos/trivial: (21 commits)
  gitignore: fix wording
  mfd: ab8500-debugfs: fix "between" in printk
  memstick: trivial fix of spelling mistake on management
  cpupowerutils: bench: fix "average"
  treewide: Fix typos in printk
  IB/mlx4: printk fix
  pinctrl: sirf/atlas7: fix printk spelling
  serial: mctrl_gpio: Grammar s/lines GPIOs/line GPIOs/, /sets/set/
  w1: comment spelling s/minmum/minimum/
  Blackfin: comment spelling s/divsor/divisor/
  metag: Fix misspellings in comments.
  ia64: Fix misspellings in comments.
  hexagon: Fix misspellings in comments.
  tools/perf: Fix misspellings in comments.
  cris: Fix misspellings in comments.
  c6x: Fix misspellings in comments.
  blackfin: Fix misspelling of 'register' in comment.
  avr32: Fix misspelling of 'definitions' in comment.
  treewide: Fix typos in printk
  Doc: treewide : Fix typos in DocBook/filesystem.xml
  ...
2016-05-17 17:05:30 -07:00
Linus Torvalds a7fd20d1c4 Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next
Pull networking updates from David Miller:
 "Highlights:

   1) Support SPI based w5100 devices, from Akinobu Mita.

   2) Partial Segmentation Offload, from Alexander Duyck.

   3) Add GMAC4 support to stmmac driver, from Alexandre TORGUE.

   4) Allow cls_flower stats offload, from Amir Vadai.

   5) Implement bpf blinding, from Daniel Borkmann.

   6) Optimize _ASYNC_ bit twiddling on sockets, unless the socket is
      actually using FASYNC these atomics are superfluous.  From Eric
      Dumazet.

   7) Run TCP more preemptibly, also from Eric Dumazet.

   8) Support LED blinking, EEPROM dumps, and rxvlan offloading in mlx5e
      driver, from Gal Pressman.

   9) Allow creating ppp devices via rtnetlink, from Guillaume Nault.

  10) Improve BPF usage documentation, from Jesper Dangaard Brouer.

  11) Support tunneling offloads in qed, from Manish Chopra.

  12) aRFS offloading in mlx5e, from Maor Gottlieb.

  13) Add RFS and RPS support to SCTP protocol, from Marcelo Ricardo
      Leitner.

  14) Add MSG_EOR support to TCP, this allows controlling packet
      coalescing on application record boundaries for more accurate
      socket timestamp sampling.  From Martin KaFai Lau.

  15) Fix alignment of 64-bit netlink attributes across the board, from
      Nicolas Dichtel.

  16) Per-vlan stats in bridging, from Nikolay Aleksandrov.

  17) Several conversions of drivers to ethtool ksettings, from Philippe
      Reynes.

  18) Checksum neutral ILA in ipv6, from Tom Herbert.

  19) Factorize all of the various marvell dsa drivers into one, from
      Vivien Didelot

  20) Add VF support to qed driver, from Yuval Mintz"

* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next: (1649 commits)
  Revert "phy dp83867: Fix compilation with CONFIG_OF_MDIO=m"
  Revert "phy dp83867: Make rgmii parameters optional"
  r8169: default to 64-bit DMA on recent PCIe chips
  phy dp83867: Make rgmii parameters optional
  phy dp83867: Fix compilation with CONFIG_OF_MDIO=m
  bpf: arm64: remove callee-save registers use for tmp registers
  asix: Fix offset calculation in asix_rx_fixup() causing slow transmissions
  switchdev: pass pointer to fib_info instead of copy
  net_sched: close another race condition in tcf_mirred_release()
  tipc: fix nametable publication field in nl compat
  drivers: net: Don't print unpopulated net_device name
  qed: add support for dcbx.
  ravb: Add missing free_irq() calls to ravb_close()
  qed: Remove a stray tab
  net: ethernet: fec-mpc52xx: use phy_ethtool_{get|set}_link_ksettings
  net: ethernet: fec-mpc52xx: use phydev from struct net_device
  bpf, doc: fix typo on bpf_asm descriptions
  stmmac: hardware TX COE doesn't work when force_thresh_dma_mode is set
  net: ethernet: fs-enet: use phy_ethtool_{get|set}_link_ksettings
  net: ethernet: fs-enet: use phydev from struct net_device
  ...
2016-05-17 16:26:30 -07:00
Andreas Gruenbacher e0d46f5c6e btrfs: Switch to generic xattr handlers
The btrfs_{set,remove}xattr inode operations check for a read-only root
(btrfs_root_readonly) before calling into generic_{set,remove}xattr.  If
this check is moved into __btrfs_setxattr, we can get rid of
btrfs_{set,remove}xattr.

This patch applies to mainline, I would like to keep it together with
the other xattr cleanups if possible, though.  Could you please review?

Thanks,
Andreas

Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2016-05-17 19:17:09 -04:00
Andreas Gruenbacher 2b88fc21ca ubifs: Switch to generic xattr handlers
Ubifs internally uses special inodes for storing xattrs. Those inodes
had NULL {get,set,remove}xattr inode operations before this change, so
xattr operations on them would fail. The super block's s_xattr field
would also apply to those special inodes. However, the inodes are not
visible outside of ubifs, and so no xattr operations will ever be
carried out on them anyway.

Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
Reviewed-by: Richard Weinberger <richard@nod.at>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2016-05-17 19:16:23 -04:00
Linus Torvalds c2e7b20705 Merge branch 'work.preadv2' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs
Pull vfs cleanups from Al Viro:
 "More cleanups from Christoph"

* 'work.preadv2' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
  nfsd: use RWF_SYNC
  fs: add RWF_DSYNC aand RWF_SYNC
  ceph: use generic_write_sync
  fs: simplify the generic_write_sync prototype
  fs: add IOCB_SYNC and IOCB_DSYNC
  direct-io: remove the offset argument to dio_complete
  direct-io: eliminate the offset argument to ->direct_IO
  xfs: eliminate the pos variable in xfs_file_dio_aio_write
  filemap: remove the pos argument to generic_file_direct_write
  filemap: remove pos variables in generic_file_read_iter
2016-05-17 15:05:23 -07:00
Chris Mason c315ef8d9d Merge branch 'for-chris-4.7' of git://git.kernel.org/pub/scm/linux/kernel/git/fdmanana/linux into for-linus-4.7
Signed-off-by: Chris Mason <clm@fb.com>
2016-05-17 14:43:19 -07:00
Linus Torvalds c52b76185b Merge branch 'work.const-path' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs
Pull 'struct path' constification update from Al Viro:
 "'struct path' is passed by reference to a bunch of Linux security
  methods; in theory, there's nothing to stop them from modifying the
  damn thing and LSM community being what it is, sooner or later some
  enterprising soul is going to decide that it's a good idea.

  Let's remove the temptation and constify all of those..."

* 'work.const-path' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
  constify ima_d_path()
  constify security_sb_pivotroot()
  constify security_path_chroot()
  constify security_path_{link,rename}
  apparmor: remove useless checks for NULL ->mnt
  constify security_path_{mkdir,mknod,symlink}
  constify security_path_{unlink,rmdir}
  apparmor: constify common_perm_...()
  apparmor: constify aa_path_link()
  apparmor: new helper - common_path_perm()
  constify chmod_common/security_path_chmod
  constify security_sb_mount()
  constify chown_common/security_path_chown
  tomoyo: constify assorted struct path *
  apparmor_path_truncate(): path->mnt is never NULL
  constify vfs_truncate()
  constify security_path_truncate()
  [apparmor] constify struct path * in a bunch of helpers
2016-05-17 14:41:03 -07:00
Linus Torvalds 681750c046 Merge branch 'for-cifs' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs
Pull cifs xattr updates from Al Viro:
 "This is the remaining parts of the xattr work - the cifs bits"

* 'for-cifs' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
  cifs: Switch to generic xattr handlers
  cifs: Fix removexattr for os2.* xattrs
  cifs: Check for equality with ACL_TYPE_ACCESS and ACL_TYPE_DEFAULT
  cifs: Fix xattr name checks
2016-05-17 14:35:45 -07:00
Linus Torvalds 820c687b70 Merge branch 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs
Pull UDF fixes from Jan Kara:
 "A fix for UDF crash on corrupted media and one UDF header fixup"

* 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs:
  udf: Export superblock magic to userspace
  udf: Prevent stack overflow on corrupted filesystem mount
2016-05-17 14:25:02 -07:00
Linus Torvalds dba1e98731 Some jfs logging cleanups from Joe Perches
-----BEGIN PGP SIGNATURE-----
 Version: GnuPG v2
 
 iQIcBAABCAAGBQJXOjXsAAoJEDaohF61QIxkXYYP/RlNZu5S+ekeJMCOKF/dZWpJ
 cpg/Klr7KC8TnCU7WiemXG7PNMjY5jnMJE19MyGrPcNpNeiSLp7RMnWiGiSNmVNM
 76qy9dcHFwq9SfksQ0cdCp6nhH3paLepX1OSVhgIqAzrsrgLcUbjV/wjcmBq5Ths
 sRBQtEI68EasybL+77s3G3Oqe/F2JHAHbJkf8IWrC+yo6pvXzCEauycYamOyP1Bp
 owu9OIxIDXvm2BOVzogmlZX63Mz+7yXGINKIqG/j/5mPGALL4n/AIpCqxpdpPkcr
 o8Wr4A9QIma8Z8X6ayIOF7/KgQCdORq/8biufs1KcyIIcsJkifpRfmvPNDIeAW9D
 dO/jlVzTpDRLel7eOcmLZylFcfKiRnY5Y4YxwkJkIpMIj6Luztp75WtGhBvGw2gB
 zOUvyxRp42QLEyro/sP+2LfzaagdKv3skH5nKcpr72Bs/gvetoH3/poDGlGEHZKJ
 i4Quh+62PUIsXsMt0mdBpr3QeijKmYv7kL5mroJjAgE6yM9Ihy8tmrE+ICpvK/z3
 1j3/lbDRn84oR8SdBywgbN2vw9x4VOwn8IcLChzn0w+D/jFyWzqwlzzYhp1leRMg
 FsX7nifbEQe8Ci86sZQLEKP2sVgGe8JqEYzRidVRY6ALcW7ICROmuTCzZ4jT1xU9
 98FDjKXs5QlfNjD+yuq7
 =KviC
 -----END PGP SIGNATURE-----

Merge tag 'jfs-4.7' of git://github.com/kleikamp/linux-shaggy

Pull jfs updates from Dave Kleikamp:
 "Some jfs logging cleanups from Joe Perches"

* tag 'jfs-4.7' of git://github.com/kleikamp/linux-shaggy:
  jfs: Coalesce some formats
  jfs: Remove unnecessary line continuations and terminating newlines
  jfs: Remove terminating newlines from jfs_info, jfs_warn, jfs_err uses
2016-05-17 14:15:18 -07:00
Kees Cook cb6fd68fdd exec: clarify reasoning for euid/egid reset
This section of code initially looks redundant, but is required. This
improves the comment to explain more clearly why the reset is needed.

Signed-off-by: Kees Cook <keescook@chromium.org>
Acked-by: Serge E. Hallyn <serge.hallyn@ubuntu.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-05-17 13:56:53 -07:00
Steve French 897fba1172 remove directory incorrectly tries to set delete on close on non-empty directories
Wrong return code was being returned on SMB3 rmdir of
non-empty directory.

For SMB3 (unlike for cifs), we attempt to delete a directory by
set of delete on close flag on the open. Windows clients set
this flag via a set info (SET_FILE_DISPOSITION to set this flag)
which properly checks if the directory is empty.

With this patch on smb3 mounts we correctly return
 "DIRECTORY NOT EMPTY"
on attempts to remove a non-empty directory.

Signed-off-by: Steve French <steve.french@primarydata.com>
CC: Stable <stable@vger.kernel.org>
Acked-by: Sachin Prabhu <sprabhu@redhat.com>
2016-05-17 14:09:44 -05:00
Steve French 5a4f7e8e7f Update cifs.ko version to 2.09
Signed-off-by: Steven French <steve.french@primarydata.com>
2016-05-17 14:09:34 -05:00
Stefan Metzmacher 1a967d6c9b fs/cifs: correctly to anonymous authentication for the NTLM(v2) authentication
Only server which map unknown users to guest will allow
access using a non-null NTLMv2_Response.

For Samba it's the "map to guest = bad user" option.

BUG: https://bugzilla.samba.org/show_bug.cgi?id=11913

Signed-off-by: Stefan Metzmacher <metze@samba.org>
CC: Stable <stable@vger.kernel.org>
Signed-off-by: Steve French <smfrench@gmail.com>
2016-05-17 14:09:34 -05:00
Stefan Metzmacher 777f69b8d2 fs/cifs: correctly to anonymous authentication for the NTLM(v1) authentication
Only server which map unknown users to guest will allow
access using a non-null NTChallengeResponse.

For Samba it's the "map to guest = bad user" option.

BUG: https://bugzilla.samba.org/show_bug.cgi?id=11913

Signed-off-by: Stefan Metzmacher <metze@samba.org>
CC: Stable <stable@vger.kernel.org>
Signed-off-by: Steve French <smfrench@gmail.com>
2016-05-17 14:09:34 -05:00
Stefan Metzmacher fa8f3a354b fs/cifs: correctly to anonymous authentication for the LANMAN authentication
Only server which map unknown users to guest will allow
access using a non-null LMChallengeResponse.

For Samba it's the "map to guest = bad user" option.

BUG: https://bugzilla.samba.org/show_bug.cgi?id=11913

Signed-off-by: Stefan Metzmacher <metze@samba.org>
CC: Stable <stable@vger.kernel.org>
Signed-off-by: Steve French <smfrench@gmail.com>
2016-05-17 14:09:34 -05:00
Stefan Metzmacher cfda35d982 fs/cifs: correctly to anonymous authentication via NTLMSSP
See [MS-NLMP] 3.2.5.1.2 Server Receives an AUTHENTICATE_MESSAGE from the Client:

   ...
   Set NullSession to FALSE
   If (AUTHENTICATE_MESSAGE.UserNameLen == 0 AND
      AUTHENTICATE_MESSAGE.NtChallengeResponse.Length == 0 AND
      (AUTHENTICATE_MESSAGE.LmChallengeResponse == Z(1)
       OR
       AUTHENTICATE_MESSAGE.LmChallengeResponse.Length == 0))
       -- Special case: client requested anonymous authentication
       Set NullSession to TRUE
   ...

Only server which map unknown users to guest will allow
access using a non-null NTChallengeResponse.

For Samba it's the "map to guest = bad user" option.

BUG: https://bugzilla.samba.org/show_bug.cgi?id=11913

CC: Stable <stable@vger.kernel.org>
Signed-off-by: Stefan Metzmacher <metze@samba.org>
Signed-off-by: Steve French <smfrench@gmail.com>
2016-05-17 14:09:33 -05:00
Sachin Prabhu 11e31647c9 cifs: remove any preceding delimiter from prefix_path
We currently do not check if any delimiter exists before the prefix
path in cifs_compose_mount_options(). Consequently when building the
devname using cifs_build_devname() we can end up with multiple
delimiters separating the UNC and the prefix path.

An issue was reported by the customer mounting a folder within a DFS
share from a Netapp server which uses McAfee antivirus. We have
narrowed down the cause to the use of double backslashes in the file
name used to open the file. This was determined to be caused because of
additional delimiters as a result of the bug.

In addition to changes in cifs_build_devname(), we also fix
cifs_parse_devname() to ignore any preceding delimiter for the prefix
path.

The problem was originally reported on RHEL 6 in RHEL bz 1252721. This
is the upstream version of the fix. The fix was confirmed by looking at
the packet capture of a DFS mount.

Signed-off-by: Sachin Prabhu <sprabhu@redhat.com>
Signed-off-by: Steve French <smfrench@gmail.com>
2016-05-17 14:09:33 -05:00
Goldwyn Rodrigues 1f1735cb75 cifs: Use file_dentry()
CIFS may be used as lower layer of overlayfs and accessing f_path.dentry can
lead to a crash.

Fix by replacing direct access of file->f_path.dentry with the
file_dentry() accessor, which will always return a native object.

Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com>
Acked-by: Shirish Pargaonkar <shirishpargaonkar@gmail.com>
Signed-off-by: Steve French <smfrench@gmail.com>
2016-05-17 14:09:33 -05:00
Linus Torvalds 7f427d3a60 Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs
Pull parallel filesystem directory handling update from Al Viro.

This is the main parallel directory work by Al that makes the vfs layer
able to do lookup and readdir in parallel within a single directory.
That's a big change, since this used to be all protected by the
directory inode mutex.

The inode mutex is replaced by an rwsem, and serialization of lookups of
a single name is done by a "in-progress" dentry marker.

The series begins with xattr cleanups, and then ends with switching
filesystems over to actually doing the readdir in parallel (switching to
the "iterate_shared()" that only takes the read lock).

A more detailed explanation of the process from Al Viro:
 "The xattr work starts with some acl fixes, then switches ->getxattr to
  passing inode and dentry separately.  This is the point where the
  things start to get tricky - that got merged into the very beginning
  of the -rc3-based #work.lookups, to allow untangling the
  security_d_instantiate() mess.  The xattr work itself proceeds to
  switch a lot of filesystems to generic_...xattr(); no complications
  there.

  After that initial xattr work, the series then does the following:

   - untangle security_d_instantiate()

   - convert a bunch of open-coded lookup_one_len_unlocked() to calls of
     that thing; one such place (in overlayfs) actually yields a trivial
     conflict with overlayfs fixes later in the cycle - overlayfs ended
     up switching to a variant of lookup_one_len_unlocked() sans the
     permission checks.  I would've dropped that commit (it gets
     overridden on merge from #ovl-fixes in #for-next; proper resolution
     is to use the variant in mainline fs/overlayfs/super.c), but I
     didn't want to rebase the damn thing - it was fairly late in the
     cycle...

   - some filesystems had managed to depend on lookup/lookup exclusion
     for *fs-internal* data structures in a way that would break if we
     relaxed the VFS exclusion.  Fixing hadn't been hard, fortunately.

   - core of that series - parallel lookup machinery, replacing
     ->i_mutex with rwsem, making lookup_slow() take it only shared.  At
     that point lookups happen in parallel; lookups on the same name
     wait for the in-progress one to be done with that dentry.

     Surprisingly little code, at that - almost all of it is in
     fs/dcache.c, with fs/namei.c changes limited to lookup_slow() -
     making it use the new primitive and actually switching to locking
     shared.

   - parallel readdir stuff - first of all, we provide the exclusion on
     per-struct file basis, same as we do for read() vs lseek() for
     regular files.  That takes care of most of the needed exclusion in
     readdir/readdir; however, these guys are trickier than lookups, so
     I went for switching them one-by-one.  To do that, a new method
     '->iterate_shared()' is added and filesystems are switched to it
     as they are either confirmed to be OK with shared lock on directory
     or fixed to be OK with that.  I hope to kill the original method
     come next cycle (almost all in-tree filesystems are switched
     already), but it's still not quite finished.

   - several filesystems get switched to parallel readdir.  The
     interesting part here is dealing with dcache preseeding by readdir;
     that needs minor adjustment to be safe with directory locked only
     shared.

     Most of the filesystems doing that got switched to in those
     commits.  Important exception: NFS.  Turns out that NFS folks, with
     their, er, insistence on VFS getting the fuck out of the way of the
     Smart Filesystem Code That Knows How And What To Lock(tm) have
     grown the locking of their own.  They had their own homegrown
     rwsem, with lookup/readdir/atomic_open being *writers* (sillyunlink
     is the reader there).  Of course, with VFS getting the fuck out of
     the way, as requested, the actual smarts of the smart filesystem
     code etc. had become exposed...

   - do_last/lookup_open/atomic_open cleanups.  As the result, open()
     without O_CREAT locks the directory only shared.  Including the
     ->atomic_open() case.  Backmerge from #for-linus in the middle of
     that - atomic_open() fix got brought in.

   - then comes NFS switch to saner (VFS-based ;-) locking, killing the
     homegrown "lookup and readdir are writers" kinda-sorta rwsem.  All
     exclusion for sillyunlink/lookup is done by the parallel lookups
     mechanism.  Exclusion between sillyunlink and rmdir is a real rwsem
     now - rmdir being the writer.

     Result: NFS lookups/readdirs/O_CREAT-less opens happen in parallel
     now.

   - the rest of the series consists of switching a lot of filesystems
     to parallel readdir; in a lot of cases ->llseek() gets simplified
     as well.  One backmerge in there (again, #for-linus - rockridge
     fix)"

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (74 commits)
  ext4: switch to ->iterate_shared()
  hfs: switch to ->iterate_shared()
  hfsplus: switch to ->iterate_shared()
  hostfs: switch to ->iterate_shared()
  hpfs: switch to ->iterate_shared()
  hpfs: handle allocation failures in hpfs_add_pos()
  gfs2: switch to ->iterate_shared()
  f2fs: switch to ->iterate_shared()
  afs: switch to ->iterate_shared()
  befs: switch to ->iterate_shared()
  befs: constify stuff a bit
  isofs: switch to ->iterate_shared()
  get_acorn_filename(): deobfuscate a bit
  btrfs: switch to ->iterate_shared()
  logfs: no need to lock directory in lseek
  switch ecryptfs to ->iterate_shared
  9p: switch to ->iterate_shared()
  fat: switch to ->iterate_shared()
  romfs, squashfs: switch to ->iterate_shared()
  more trivial ->iterate_shared conversions
  ...
2016-05-17 11:01:31 -07:00
Linus Torvalds 9a07a79684 Merge branch 'linus' of git://git.kernel.org/pub/scm/linux/kernel/git/herbert/crypto-2.6
Pull crypto update from Herbert Xu:
 "API:

   - Crypto self tests can now be disabled at boot/run time.
   - Add async support to algif_aead.

  Algorithms:

   - A large number of fixes to MPI from Nicolai Stange.
   - Performance improvement for HMAC DRBG.

  Drivers:

   - Use generic crypto engine in omap-des.
   - Merge ppc4xx-rng and crypto4xx drivers.
   - Fix lockups in sun4i-ss driver by disabling IRQs.
   - Add DMA engine support to ccp.
   - Reenable talitos hash algorithms.
   - Add support for Hisilicon SoC RNG.
   - Add basic crypto driver for the MXC SCC.

  Others:

   - Do not allocate crypto hash tfm in NORECLAIM context in ecryptfs"

* 'linus' of git://git.kernel.org/pub/scm/linux/kernel/git/herbert/crypto-2.6: (77 commits)
  crypto: qat - change the adf_ctl_stop_devices to void
  crypto: caam - fix caam_jr_alloc() ret code
  crypto: vmx - comply with ABIs that specify vrsave as reserved.
  crypto: testmgr - Add a flag allowing the self-tests to be disabled at runtime.
  crypto: ccp - constify ccp_actions structure
  crypto: marvell/cesa - Use dma_pool_zalloc
  crypto: qat - make adf_vf_isr.c dependant on IOV config
  crypto: qat - Fix typo in comments
  lib: asn1_decoder - add MODULE_LICENSE("GPL")
  crypto: omap-sham - Use dma_request_chan() for requesting DMA channel
  crypto: omap-des - Use dma_request_chan() for requesting DMA channel
  crypto: omap-aes - Use dma_request_chan() for requesting DMA channel
  crypto: omap-des - Integrate with the crypto engine framework
  crypto: s5p-sss - fix incorrect usage of scatterlists api
  crypto: s5p-sss - Fix missed interrupts when working with 8 kB blocks
  crypto: s5p-sss - Use common BIT macro
  crypto: mxc-scc - fix unwinding in mxc_scc_crypto_register()
  crypto: mxc-scc - signedness bugs in mxc_scc_ablkcipher_req_init()
  crypto: talitos - fix ahash algorithms registration
  crypto: ccp - Ensure all dependencies are specified
  ...
2016-05-17 09:33:39 -07:00
Al Viro 0e0162bb8c Merge branch 'ovl-fixes' into for-linus
Backmerge to resolve a conflict in ovl_lookup_real();
"ovl_lookup_real(): use lookup_one_len_unlocked()" instead,
but it was too late in the cycle to rebase.
2016-05-17 02:17:59 -04:00
Jaegeuk Kim 10aa97c379 f2fs: manipulate dirty file inodes when DATA_FLUSH is set
It needs to maintain dirty file inodes only if DATA_FLUSH is set.
Otherwise, let's avoid its overhead.

Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2016-05-16 15:32:03 -07:00
Sheng Yong 087968974f f2fs: add fault injection to sysfs
This patch introduces a new struct f2fs_fault_info and a global f2fs_fault
to save fault injection status. Fault injection entries are created in
/sys/fs/f2fs/fault_injection/ during initializing f2fs module.

Signed-off-by: Sheng Yong <shengyong1@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2016-05-16 15:32:02 -07:00
Yunlei He b951a4ec16 f2fs: no need inc dirty pages under inode lock
No need inc dirty pages under inode lock

Signed-off-by: Yunlei He <heyunlei@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2016-05-16 15:32:01 -07:00
Chao Yu 8975bdf482 f2fs: fix incorrect error path handling in f2fs_move_rehashed_dirents
Fix two bugs in error path of f2fs_move_rehashed_dirents:
 - release dir's inode page if fail to call kmalloc
 - recover i_current_depth if fail to converting

Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2016-05-16 15:32:01 -07:00
Chao Yu e4103849ba f2fs: fix i_current_depth during inline dentry conversion
With below steps, we will see that dentry page becoming unaccessable later.
This is because we forget updating i_current_depth in inode during inline
dentry conversion, after that, once we failed at somewhere, it will leave
i_current_depth as 0 in non-inline directory. Then, during ->lookup, the
current_depth value makes all dentry pages in first level invisible. Fix
it.

1) mount f2fs with inline_dentry option
2) mkdir dir
3) touch 180 files named [0-179] in dir
4) touch 180 in dir (fail after inline dir conversion)
5) ll dir

ls: cannot access /mnt/f2fs/dir/0: No such file or directory
ls: cannot access /mnt/f2fs/dir/1: No such file or directory
ls: cannot access /mnt/f2fs/dir/2: No such file or directory
ls: cannot access /mnt/f2fs/dir/3: No such file or directory
ls: cannot access /mnt/f2fs/dir/4: No such file or directory

drwxr-xr-x 2 root root 4096  may 13 21:47 ./
drwxr-xr-x 3 root root 4096  may 13 21:46 ../
-????????? ? ?    ?       ?             ? 0
-????????? ? ?    ?       ?             ? 1
-????????? ? ?    ?       ?             ? 10
-????????? ? ?    ?       ?             ? 100
-????????? ? ?    ?       ?             ? 101
-????????? ? ?    ?       ?             ? 102

Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2016-05-16 15:32:00 -07:00
Sheng Yong 99e3e858a4 f2fs: correct return value type of f2fs_fill_super
Signed-off-by: Sheng Yong <shengyong1@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2016-05-16 15:31:59 -07:00
Linus Torvalds 49817c3343 Merge branch 'efi-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull EFI updates from Ingo Molnar:
 "The main changes in this cycle were:

   - Drop the unused EFI_SYSTEM_TABLES efi.flags bit and ensure the
     ARM/arm64 EFI System Table mapping is read-only (Ard Biesheuvel)

   - Add a comment to explain that one of the code paths in the x86/pat
     code is only executed for EFI boot (Matt Fleming)

   - Improve Secure Boot status checks on arm64 and handle unexpected
     errors (Linn Crosetto)

   - Remove the global EFI memory map variable 'memmap' as the same
     information is already available in efi::memmap (Matt Fleming)

   - Add EFI Memory Attribute table support for ARM/arm64 (Ard
     Biesheuvel)

   - Add EFI GOP framebuffer support for ARM/arm64 (Ard Biesheuvel)

   - Add EFI Bootloader Control driver for storing reboot(2) data in EFI
     variables for consumption by bootloaders (Jeremy Compostella)

   - Add Core EFI capsule support (Matt Fleming)

   - Add EFI capsule char driver (Kweh, Hock Leong)

   - Unify EFI memory map code for ARM and arm64 (Ard Biesheuvel)

   - Add generic EFI support for detecting when firmware corrupts CPU
     status register bits (like IRQ flags) when performing EFI runtime
     service calls (Mark Rutland)

  ... and other misc cleanups"

* 'efi-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (46 commits)
  efivarfs: Make efivarfs_file_ioctl() static
  efi: Merge boolean flag arguments
  efi/capsule: Move 'capsule' to the stack in efi_capsule_supported()
  efibc: Fix excessive stack footprint warning
  efi/capsule: Make efi_capsule_pending() lockless
  efi: Remove unnecessary (and buggy) .memmap initialization from the Xen EFI driver
  efi/runtime-wrappers: Remove ARCH_EFI_IRQ_FLAGS_MASK #ifdef
  x86/efi: Enable runtime call flag checking
  arm/efi: Enable runtime call flag checking
  arm64/efi: Enable runtime call flag checking
  efi/runtime-wrappers: Detect firmware IRQ flag corruption
  efi/runtime-wrappers: Remove redundant #ifdefs
  x86/efi: Move to generic {__,}efi_call_virt()
  arm/efi: Move to generic {__,}efi_call_virt()
  arm64/efi: Move to generic {__,}efi_call_virt()
  efi/runtime-wrappers: Add {__,}efi_call_virt() templates
  efi/arm-init: Reserve rather than unmap the memory map for ARM as well
  efi: Add misc char driver interface to update EFI firmware
  x86/efi: Force EFI reboot to process pending capsules
  efi: Add 'capsule' update support
  ...
2016-05-16 13:06:27 -07:00
George Spelvin 0fed3ac866 namei: Improve hash mixing if CONFIG_DCACHE_WORD_ACCESS
The hash mixing between adding the next 64 bits of name
was just a bit weak.

Replaced with a still very fast but slightly more effective
mixing function.

Signed-off-by: George Spelvin <linux@horizon.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-05-16 11:35:08 -07:00
David Sterba 680834ca0a Merge branch 'foreign/jeffm/uapi' into for-chris-4.7-20160516
# Conflicts:
#	include/uapi/linux/btrfs.h
2016-05-16 15:46:29 +02:00
David Sterba 36fac9e9ff Merge branch 'foreign/anand/dev-del-by-id-ext' into for-chris-4.7-20160516 2016-05-16 15:46:26 +02:00
David Sterba 5ef64a3e75 Merge branch 'cleanups-4.7' into for-chris-4.7-20160516 2016-05-16 15:46:24 +02:00
Scott Talbert 4673272f43 btrfs: fix memory leak during RAID 5/6 device replacement
A 'struct bio' is allocated in scrub_missing_raid56_pages(), but it was never
freed anywhere.

Signed-off-by: Scott Talbert <scott.talbert@hgst.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2016-05-16 10:17:58 +02:00
David S. Miller 909b27f706 Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net
The nf_conntrack_core.c fix in 'net' is not relevant in 'net-next'
because we no longer have a per-netns conntrack hash.

The ip_gre.c conflict as well as the iwlwifi ones were cases of
overlapping changes.

Conflicts:
	drivers/net/wireless/intel/iwlwifi/mvm/tx.c
	net/ipv4/ip_gre.c
	net/netfilter/nf_conntrack_core.c

Signed-off-by: David S. Miller <davem@davemloft.net>
2016-05-15 13:32:48 -04:00
Linus Torvalds 6ba5b85fd4 Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs
Pull vfs fixes from Al Viro:
 "Overlayfs fixes from Miklos, assorted fixes from me.

  Stable fodder of varying severity, all sat in -next for a while"

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
  ovl: ignore permissions on underlying lookup
  vfs: add lookup_hash() helper
  vfs: rename: check backing inode being equal
  vfs: add vfs_select_inode() helper
  get_rock_ridge_filename(): handle malformed NM entries
  ecryptfs: fix handling of directory opening
  atomic_open(): fix the handling of create_error
  fix the copy vs. map logics in blk_rq_map_user_iov()
  do_splice_to(): cap the size before passing to ->splice_read()
2016-05-14 11:59:43 -07:00
Linus Torvalds 1410b74e40 Merge branch 'for-4.6-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup
Pull cgroup fixes from Tejun Heo:
 "During v4.6-rc1 cgroup namespace support was merged.  There is an
  issue where it's impossible to tell whether a given cgroup mount point
  is bind mounted or namespaced.  Serge has been working on the issue
  but it took longer than expected to resolve, so the late pull request.

  Given that it's a completely new feature and the patches don't touch
  anything else, the risk seems acceptable.  However, if this is too
  late, an alternative is plugging new cgroup ns creation for v4.6 and
  retrying for v4.7"

* 'for-4.6-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup:
  cgroup: fix compile warning
  kernfs: kernfs_sop_show_path: don't return 0 after seq_dentry call
  cgroup, kernfs: make mountinfo show properly scoped path for cgroup namespaces
  kernfs_path_from_node_locked: don't overwrite nlen
2016-05-13 16:26:46 -07:00
Daniel Wagner e55d531244 crash_dump: Add vmcore_elf32_check_arch
parse_crash_elf{32|64}_headers will check the headers via the
elf_check_arch respectively vmcore_elf64_check_arch macro.

The MIPS architecture implements those two macros differently.
In order to make the differentiation more explicit, let's introduce
an vmcore_elf32_check_arch to allow the archs to overwrite it.

Signed-off-by: Daniel Wagner <daniel.wagner@bmw-carit.de>
Suggested-by: Maciej W. Rozycki <macro@imgtec.com>
Reviewed-by: Maciej W. Rozycki <macro@imgtec.com>
Cc: linux-kernel@vger.kernel.org
Cc: linux-mips@linux-mips.org
Patchwork: https://patchwork.linux-mips.org/patch/12535/
Signed-off-by: Ralf Baechle <ralf@linux-mips.org>
2016-05-13 14:01:59 +02:00
Andreas Gruenbacher c8b6056a50 jfs: Switch to generic xattr handlers
This is mostly the same as on other filesystems except for attribute
names with an "os2." prefix: for those, the prefix is not stored on
disk, and on-attribute names without a prefix have "os2." added.

As on several other filesystems, the underlying function for
setting/removing xattrs (__jfs_setxattr) removes attributes when the
value is NULL, so the set xattr handlers will work as expected.

Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2016-05-12 22:29:18 -04:00
Andreas Gruenbacher 6c8f980c75 jfs: Clean up xattr name mapping
Instead of stripping "os2." prefixes in __jfs_setxattr, make callers
strip them, as __jfs_getxattr already does.  With that change, use the
same name mapping function in jfs_{get,set,remove}xattr.

Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2016-05-12 22:29:18 -04:00
Al Viro 1a39ba99b5 gfs2: Switch to generic xattr handlers
Switch to the generic xattr handlers and take the necessary glocks at
the layer below. The following are the new xattr "entry points"; they
are called with the glock held already in the following cases:

  gfs2_xattr_get: From SELinux, during lookups.
  gfs2_xattr_set: The glock is never held.
  gfs2_get_acl: From gfs2_create_inode -> posix_acl_create and
                gfs2_setattr -> posix_acl_chmod.
  gfs2_set_acl: From gfs2_setattr -> posix_acl_chmod.

Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2016-05-12 22:28:05 -04:00
Filipe Manana 5f9a8a51d8 Btrfs: add semaphore to synchronize direct IO writes with fsync
Due to the optimization of lockless direct IO writes (the inode's i_mutex
is not held) introduced in commit 38851cc19a ("Btrfs: implement unlocked
dio write"), we started having races between such writes with concurrent
fsync operations that use the fast fsync path. These races were addressed
in the patches titled "Btrfs: fix race between fsync and lockless direct
IO writes" and "Btrfs: fix race between fsync and direct IO writes for
prealloc extents". The races happened because the direct IO path, like
every other write path, does create extent maps followed by the
corresponding ordered extents while the fast fsync path collected first
ordered extents and then it collected extent maps. This made it possible
to log file extent items (based on the collected extent maps) without
waiting for the corresponding ordered extents to complete (get their IO
done). The two fixes mentioned before added a solution that consists of
making the direct IO path create first the ordered extents and then the
extent maps, while the fsync path attempts to collect any new ordered
extents once it collects the extent maps. This was simple and did not
require adding any synchonization primitive to any data structure (struct
btrfs_inode for example) but it makes things more fragile for future
development endeavours and adds an exceptional approach compared to the
other write paths.

This change adds a read-write semaphore to the btrfs inode structure and
makes the direct IO path create the extent maps and the ordered extents
while holding read access on that semaphore, while the fast fsync path
collects extent maps and ordered extents while holding write access on
that semaphore. The logic for direct IO write path is encapsulated in a
new helper function that is used both for cow and nocow direct IO writes.

Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: Josef Bacik <jbacik@fb.com>
2016-05-13 01:59:36 +01:00
Filipe Manana f78c436c39 Btrfs: fix race between block group relocation and nocow writes
Relocation of a block group waits for all existing tasks flushing
dellaloc, starting direct IO writes and any ordered extents before
starting the relocation process. However for direct IO writes that end
up doing nocow (inode either has the flag nodatacow set or the write is
against a prealloc extent) we have a short time window that allows for a
race that makes relocation proceed without waiting for the direct IO
write to complete first, resulting in data loss after the relocation
finishes. This is illustrated by the following diagram:

           CPU 1                                     CPU 2

 btrfs_relocate_block_group(bg X)

                                               direct IO write starts against
                                               an extent in block group X
                                               using nocow mode (inode has the
                                               nodatacow flag or the write is
                                               for a prealloc extent)

                                               btrfs_direct_IO()
                                                 btrfs_get_blocks_direct()
                                                   --> can_nocow_extent() returns 1

   btrfs_inc_block_group_ro(bg X)
     --> turns block group into RO mode

   btrfs_wait_ordered_roots()
     --> returns and does not know about
         the DIO write happening at CPU 2
         (the task there has not created
          yet an ordered extent)

   relocate_block_group(bg X)
     --> rc->stage == MOVE_DATA_EXTENTS

     find_next_extent()
       --> returns extent that the DIO
           write is going to write to

     relocate_data_extent()

       relocate_file_extent_cluster()

         --> reads the extent from disk into
             pages belonging to the relocation
             inode and dirties them

                                                   --> creates DIO ordered extent

                                                 btrfs_submit_direct()
                                                   --> submits bio against a location
                                                       on disk obtained from an extent
                                                       map before the relocation started

   btrfs_wait_ordered_range()
     --> writes all the pages read before
         to disk (belonging to the
         relocation inode)

   relocation finishes

                                                 bio completes and wrote new data
                                                 to the old location of the block
                                                 group

So fix this by tracking the number of nocow writers for a block group and
make sure relocation waits for that number to go down to 0 before starting
to move the extents.

The same race can also happen with buffered writes in nocow mode since the
patch I recently made titled "Btrfs: don't do unnecessary delalloc flushes
when relocating", because we are no longer flushing all delalloc which
served as a synchonization mechanism (due to page locking) and ensured
the ordered extents for nocow buffered writes were created before we
called btrfs_wait_ordered_roots(). The race with direct IO writes in nocow
mode existed before that patch (no pages are locked or used during direct
IO) and that fixed only races with direct IO writes that do cow.

Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: Josef Bacik <jbacik@fb.com>
2016-05-13 01:59:34 +01:00
Filipe Manana 0b901916a0 Btrfs: fix race between fsync and direct IO writes for prealloc extents
When we do a direct IO write against a preallocated extent (fallocate)
that does not go beyond the i_size of the inode, we do the write operation
without holding the inode's i_mutex (an optimization that landed in
commit 38851cc19a ("Btrfs: implement unlocked dio write")). This allows
for a very tiny time window where a race can happen with a concurrent
fsync using the fast code path, as the direct IO write path creates first
a new extent map (no longer flagged as a prealloc extent) and then it
creates the ordered extent, while the fast fsync path first collects
ordered extents and then it collects extent maps. This allows for the
possibility of the fast fsync path to collect the new extent map without
collecting the new ordered extent, and therefore logging an extent item
based on the extent map without waiting for the ordered extent to be
created and complete. This can result in a situation where after a log
replay we end up with an extent not marked anymore as prealloc but it was
only partially written (or not written at all), exposing random, stale or
garbage data corresponding to the unwritten pages and without any
checksums in the csum tree covering the extent's range.

This is an extension of what was done in commit de0ee0edb2 ("Btrfs: fix
race between fsync and lockless direct IO writes").

So fix this by creating first the ordered extent and then the extent
map, so that this way if the fast fsync patch collects the new extent
map it also collects the corresponding ordered extent.

Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: Josef Bacik <jbacik@fb.com>
2016-05-13 01:59:32 +01:00
Filipe Manana 5062af35c3 Btrfs: fix number of transaction units for renames with whiteout
When we do a rename with the whiteout flag, we need to create the whiteout
inode, which in the worst case requires 5 transaction units (1 inode item,
1 inode ref, 2 dir items and 1 xattr if selinux is enabled). So bump the
number of transaction units from 11 to 16 if the whiteout flag is set.

Signed-off-by: Filipe Manana <fdmanana@suse.com>
2016-05-13 01:59:30 +01:00
Filipe Manana 376e5a57bf Btrfs: pin logs earlier when doing a rename exchange operation
The btrfs_rename_exchange() started as a copy-paste from btrfs_rename(),
which had a race fixed by my previous patch titled "Btrfs: pin log earlier
when renaming", and so it suffers from the same problem.

We pin the logs of the affected roots after we insert the new inode
references, leaving a time window where concurrent tasks logging the
inodes can end up logging both the new and old references, resulting
in log trees that when replayed can turn the metadata into inconsistent
states. This behaviour was added to btrfs_rename() in 2009 without any
explanation about why not pinning the logs earlier, just leaving a
comment about the posibility for the race. As of today it's perfectly
safe and sane to pin the logs before we start doing any of the steps
involved in the rename operation.

Signed-off-by: Filipe Manana <fdmanana@suse.com>
2016-05-13 01:59:28 +01:00
Filipe Manana 86e8aa0e77 Btrfs: unpin logs if rename exchange operation fails
If rename exchange operations fail at some point after we pinned any of
the logs, we end up aborting the current transaction but never unpin the
logs, which leaves concurrent tasks that are trying to sync the logs (as
part of an fsync request from user space) blocked forever and preventing
the filesystem from being unmountable.

Fix this by safely unpinning the log.

Signed-off-by: Filipe Manana <fdmanana@suse.com>
2016-05-13 01:59:26 +01:00
Filipe Manana c990161888 Btrfs: fix inode leak on failure to setup whiteout inode in rename
If we failed to fully setup the whiteout inode during a rename operation
with the whiteout flag, we ended up leaking the inode, not decrementing
its link count nor removing all its items from the fs/subvol tree.

Signed-off-by: Filipe Manana <fdmanana@suse.com>
2016-05-13 01:59:23 +01:00
Dan Fuhry cdd1fedf82 btrfs: add support for RENAME_EXCHANGE and RENAME_WHITEOUT
Two new flags, RENAME_EXCHANGE and RENAME_WHITEOUT, provide for new
behavior in the renameat2() syscall. This behavior is primarily used by
overlayfs. This patch adds support for these flags to btrfs, enabling it to
be used as a fully functional upper layer for overlayfs.

RENAME_EXCHANGE support was written by Davide Italiano originally
submitted on 2 April 2015.

Signed-off-by: Davide Italiano <dccitaliano@gmail.com>
Signed-off-by: Dan Fuhry <dfuhry@datto.com>
[ remove unlikely ]
Signed-off-by: David Sterba <dsterba@suse.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
2016-05-13 01:59:21 +01:00
Filipe Manana c4aba95454 Btrfs: pin log earlier when renaming
We were pinning the log right after the first step in the rename operation
(inserting inode ref for the new name in the destination directory)
instead of doing it before. This behaviour was introduced in 2009 for some
reason that was not mentioned neither on the changelog nor any comment,
with the drawback of a small time window where concurrent log writers can
end up logging the new inode reference for the inode we are renaming while
the rename operation is in progress (so that we can end up with a log
containing both the new and old references). As of today there's no reason
to not pin the log before that first step anymore, so just fix this.

Signed-off-by: Filipe Manana <fdmanana@suse.com>
2016-05-13 01:59:19 +01:00
Filipe Manana 3dc9e8f767 Btrfs: unpin log if rename operation fails
If rename operations fail at some point after we pinned the log, we end
up aborting the current transaction but never unpin the log, which leaves
concurrent tasks that are trying to sync the log (as part of an fsync
request from user space) blocked forever and preventing the filesystem
from being unmountable.

Fix this by safely unpinning the log.

Signed-off-by: Filipe Manana <fdmanana@suse.com>
2016-05-13 01:59:18 +01:00
Filipe Manana 9cfa3e34e2 Btrfs: don't do unnecessary delalloc flushes when relocating
Before we start the actual relocation process of a block group, we do
calls to flush delalloc of all inodes and then wait for ordered extents
to complete. However we do these flush calls just to make sure we don't
race with concurrent tasks that have actually already started to run
delalloc and have allocated an extent from the block group we want to
relocate, right before we set it to readonly mode, but have not yet
created the respective ordered extents. The flush calls make us wait
for such concurrent tasks because they end up calling
filemap_fdatawrite_range() (through btrfs_start_delalloc_roots() ->
__start_delalloc_inodes() -> btrfs_alloc_delalloc_work() ->
btrfs_run_delalloc_work()) which ends up serializing us with those tasks
due to attempts to lock the same pages (and the delalloc flush procedure
calls the allocator and creates the ordered extents before unlocking the
pages).

These flushing calls not only make us waste time (cpu, IO) but also reduce
the chances of writing larger extents (applications might be writing to
contiguous ranges and we flush before they finish dirtying the whole
ranges).

So make sure we don't flush delalloc and just wait for concurrent tasks
that have already started flushing delalloc and have allocated an extent
from the block group we are about to relocate.

This change also ends up fixing a race with direct IO writes that makes
relocation not wait for direct IO ordered extents. This race is
illustrated by the following diagram:

        CPU 1                                       CPU 2

 btrfs_relocate_block_group(bg X)

                                           starts direct IO write,
                                           target inode currently has no
                                           ordered extents ongoing nor
                                           dirty pages (delalloc regions),
                                           therefore the root for our inode
                                           is not in the list
                                           fs_info->ordered_roots

                                           btrfs_direct_IO()
                                             __blockdev_direct_IO()
                                               btrfs_get_blocks_direct()
                                                 btrfs_lock_extent_direct()
                                                   locks range in the io tree
                                                 btrfs_new_extent_direct()
                                                   btrfs_reserve_extent()
                                                     --> extent allocated
                                                         from bg X

   btrfs_inc_block_group_ro(bg X)

   btrfs_start_delalloc_roots()
     __start_delalloc_inodes()
       --> does nothing, no dealloc ranges
           in the inode's io tree so the
           inode's root is not in the list
           fs_info->delalloc_roots

   btrfs_wait_ordered_roots()
     --> does not find the inode's root in the
         list fs_info->ordered_roots

     --> ends up not waiting for the direct IO
         write started by the task at CPU 2

   relocate_block_group(rc->stage ==
     MOVE_DATA_EXTENTS)

     prepare_to_relocate()
       btrfs_commit_transaction()

     iterates the extent tree, using its
     commit root and moves extents into new
     locations

                                                   btrfs_add_ordered_extent_dio()
                                                     --> now a ordered extent is
                                                         created and added to the
                                                         list root->ordered_extents
                                                         and the root added to the
                                                         list fs_info->ordered_roots
                                                     --> this is too late and the
                                                         task at CPU 1 already
                                                         started the relocation

     btrfs_commit_transaction()

                                                   btrfs_finish_ordered_io()
                                                     btrfs_alloc_reserved_file_extent()
                                                       --> adds delayed data reference
                                                           for the extent allocated
                                                           from bg X

   relocate_block_group(rc->stage ==
     UPDATE_DATA_PTRS)

     prepare_to_relocate()
       btrfs_commit_transaction()
         --> delayed refs are run, so an extent
             item for the allocated extent from
             bg X is added to extent tree
         --> commit roots are switched, so the
             next scan in the extent tree will
             see the extent item

     sees the extent in the extent tree

When this happens the relocation produces the following warning when it
finishes:

[ 7260.832836] ------------[ cut here ]------------
[ 7260.834653] WARNING: CPU: 5 PID: 6765 at fs/btrfs/relocation.c:4318 btrfs_relocate_block_group+0x245/0x2a1 [btrfs]()
[ 7260.838268] Modules linked in: btrfs crc32c_generic xor ppdev raid6_pq psmouse sg acpi_cpufreq evdev i2c_piix4 tpm_tis serio_raw tpm i2c_core pcspkr parport_pc
[ 7260.850935] CPU: 5 PID: 6765 Comm: btrfs Not tainted 4.5.0-rc6-btrfs-next-28+ #1
[ 7260.852998] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS by qemu-project.org 04/01/2014
[ 7260.852998]  0000000000000000 ffff88020bf57bc0 ffffffff812648b3 0000000000000000
[ 7260.852998]  0000000000000009 ffff88020bf57bf8 ffffffff81051608 ffffffffa03c1b2d
[ 7260.852998]  ffff8800b2bbb800 0000000000000000 ffff8800b17bcc58 ffff8800399dd000
[ 7260.852998] Call Trace:
[ 7260.852998]  [<ffffffff812648b3>] dump_stack+0x67/0x90
[ 7260.852998]  [<ffffffff81051608>] warn_slowpath_common+0x99/0xb2
[ 7260.852998]  [<ffffffffa03c1b2d>] ? btrfs_relocate_block_group+0x245/0x2a1 [btrfs]
[ 7260.852998]  [<ffffffff810516d4>] warn_slowpath_null+0x1a/0x1c
[ 7260.852998]  [<ffffffffa03c1b2d>] btrfs_relocate_block_group+0x245/0x2a1 [btrfs]
[ 7260.852998]  [<ffffffffa039d9de>] btrfs_relocate_chunk.isra.29+0x66/0xdb [btrfs]
[ 7260.852998]  [<ffffffffa039f314>] btrfs_balance+0xde1/0xe4e [btrfs]
[ 7260.852998]  [<ffffffff8127d671>] ? debug_smp_processor_id+0x17/0x19
[ 7260.852998]  [<ffffffffa03a9583>] btrfs_ioctl_balance+0x255/0x2d3 [btrfs]
[ 7260.852998]  [<ffffffffa03ac96a>] btrfs_ioctl+0x11e0/0x1dff [btrfs]
[ 7260.852998]  [<ffffffff811451df>] ? handle_mm_fault+0x443/0xd63
[ 7260.852998]  [<ffffffff81491817>] ? _raw_spin_unlock+0x31/0x44
[ 7260.852998]  [<ffffffff8108b36a>] ? arch_local_irq_save+0x9/0xc
[ 7260.852998]  [<ffffffff811876ab>] vfs_ioctl+0x18/0x34
[ 7260.852998]  [<ffffffff81187cb2>] do_vfs_ioctl+0x550/0x5be
[ 7260.852998]  [<ffffffff81190c30>] ? __fget_light+0x4d/0x71
[ 7260.852998]  [<ffffffff81187d77>] SyS_ioctl+0x57/0x79
[ 7260.852998]  [<ffffffff81492017>] entry_SYSCALL_64_fastpath+0x12/0x6b
[ 7260.893268] ---[ end trace eb7803b24ebab8ad ]---

This is because at the end of the first stage, in relocate_block_group(),
we commit the current transaction, which makes delayed refs run, the
commit roots are switched and so the second stage will find the extent
item that the ordered extent added to the delayed refs. But this extent
was not moved (ordered extent completed after first stage finished), so
at the end of the relocation our block group item still has a positive
used bytes counter, triggering a warning at the end of
btrfs_relocate_block_group(). Later on when trying to read the extent
contents from disk we hit a BUG_ON() due to the inability to map a block
with a logical address that belongs to the block group we relocated and
is no longer valid, resulting in the following trace:

[ 7344.885290] BTRFS critical (device sdi): unable to find logical 12845056 len 4096
[ 7344.887518] ------------[ cut here ]------------
[ 7344.888431] kernel BUG at fs/btrfs/inode.c:1833!
[ 7344.888431] invalid opcode: 0000 [#1] PREEMPT SMP DEBUG_PAGEALLOC
[ 7344.888431] Modules linked in: btrfs crc32c_generic xor ppdev raid6_pq psmouse sg acpi_cpufreq evdev i2c_piix4 tpm_tis serio_raw tpm i2c_core pcspkr parport_pc
[ 7344.888431] CPU: 0 PID: 6831 Comm: od Tainted: G        W       4.5.0-rc6-btrfs-next-28+ #1
[ 7344.888431] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS by qemu-project.org 04/01/2014
[ 7344.888431] task: ffff880215818600 ti: ffff880204684000 task.ti: ffff880204684000
[ 7344.888431] RIP: 0010:[<ffffffffa037c88c>]  [<ffffffffa037c88c>] btrfs_merge_bio_hook+0x54/0x6b [btrfs]
[ 7344.888431] RSP: 0018:ffff8802046878f0  EFLAGS: 00010282
[ 7344.888431] RAX: 00000000ffffffea RBX: 0000000000001000 RCX: 0000000000000001
[ 7344.888431] RDX: ffff88023ec0f950 RSI: ffffffff8183b638 RDI: 00000000ffffffff
[ 7344.888431] RBP: ffff880204687908 R08: 0000000000000001 R09: 0000000000000000
[ 7344.888431] R10: ffff880204687770 R11: ffffffff82f2d52d R12: 0000000000001000
[ 7344.888431] R13: ffff88021afbfee8 R14: 0000000000006208 R15: ffff88006cd199b0
[ 7344.888431] FS:  00007f1f9e1d6700(0000) GS:ffff88023ec00000(0000) knlGS:0000000000000000
[ 7344.888431] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 7344.888431] CR2: 00007f1f9dc8cb60 CR3: 000000023e3b6000 CR4: 00000000000006f0
[ 7344.888431] Stack:
[ 7344.888431]  0000000000001000 0000000000001000 ffff880204687b98 ffff880204687950
[ 7344.888431]  ffffffffa0395c8f ffffea0004d64d48 0000000000000000 0000000000001000
[ 7344.888431]  ffffea0004d64d48 0000000000001000 0000000000000000 0000000000000000
[ 7344.888431] Call Trace:
[ 7344.888431]  [<ffffffffa0395c8f>] submit_extent_page+0xf5/0x16f [btrfs]
[ 7344.888431]  [<ffffffffa03970ac>] __do_readpage+0x4a0/0x4f1 [btrfs]
[ 7344.888431]  [<ffffffffa039680d>] ? btrfs_create_repair_bio+0xcb/0xcb [btrfs]
[ 7344.888431]  [<ffffffffa037eeb4>] ? btrfs_writepage_start_hook+0xbc/0xbc [btrfs]
[ 7344.888431]  [<ffffffff8108df55>] ? trace_hardirqs_on+0xd/0xf
[ 7344.888431]  [<ffffffffa039728c>] __do_contiguous_readpages.constprop.26+0xc2/0xe4 [btrfs]
[ 7344.888431]  [<ffffffffa037eeb4>] ? btrfs_writepage_start_hook+0xbc/0xbc [btrfs]
[ 7344.888431]  [<ffffffffa039739b>] __extent_readpages.constprop.25+0xed/0x100 [btrfs]
[ 7344.888431]  [<ffffffff81129d24>] ? lru_cache_add+0xe/0x10
[ 7344.888431]  [<ffffffffa0397ea8>] extent_readpages+0x160/0x1aa [btrfs]
[ 7344.888431]  [<ffffffffa037eeb4>] ? btrfs_writepage_start_hook+0xbc/0xbc [btrfs]
[ 7344.888431]  [<ffffffff8115daad>] ? alloc_pages_current+0xa9/0xcd
[ 7344.888431]  [<ffffffffa037cdc9>] btrfs_readpages+0x1f/0x21 [btrfs]
[ 7344.888431]  [<ffffffff81128316>] __do_page_cache_readahead+0x168/0x1fc
[ 7344.888431]  [<ffffffff811285a0>] ondemand_readahead+0x1f6/0x207
[ 7344.888431]  [<ffffffff811285a0>] ? ondemand_readahead+0x1f6/0x207
[ 7344.888431]  [<ffffffff8111cf34>] ? pagecache_get_page+0x2b/0x154
[ 7344.888431]  [<ffffffff8112870e>] page_cache_sync_readahead+0x3d/0x3f
[ 7344.888431]  [<ffffffff8111dbf7>] generic_file_read_iter+0x197/0x4e1
[ 7344.888431]  [<ffffffff8117773a>] __vfs_read+0x79/0x9d
[ 7344.888431]  [<ffffffff81178050>] vfs_read+0x8f/0xd2
[ 7344.888431]  [<ffffffff81178a38>] SyS_read+0x50/0x7e
[ 7344.888431]  [<ffffffff81492017>] entry_SYSCALL_64_fastpath+0x12/0x6b
[ 7344.888431] Code: 8d 4d e8 45 31 c9 45 31 c0 48 8b 00 48 c1 e2 09 48 8b 80 80 fc ff ff 4c 89 65 e8 48 8b b8 f0 01 00 00 e8 1d 42 02 00 85 c0 79 02 <0f> 0b 4c 0
[ 7344.888431] RIP  [<ffffffffa037c88c>] btrfs_merge_bio_hook+0x54/0x6b [btrfs]
[ 7344.888431]  RSP <ffff8802046878f0>
[ 7344.970544] ---[ end trace eb7803b24ebab8ae ]---

Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: Josef Bacik <jbacik@fb.com>
Reviewed-by: Liu Bo <bo.li.liu@oracle.com>
2016-05-13 01:59:16 +01:00
Filipe Manana 578def7c50 Btrfs: don't wait for unrelated IO to finish before relocation
Before the relocation process of a block group starts, it sets the block
group to readonly mode, then flushes all delalloc writes and then finally
it waits for all ordered extents to complete. This last step includes
waiting for ordered extents destinated at extents allocated in other block
groups, making us waste unecessary time.

So improve this by waiting only for ordered extents that fall into the
block group's range.

Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: Josef Bacik <jbacik@fb.com>
Reviewed-by: Liu Bo <bo.li.liu@oracle.com>
2016-05-13 01:59:14 +01:00
Filipe Manana 3f9749f6e9 Btrfs: fix empty symlink after creating symlink and fsync parent dir
If we create a symlink, fsync its parent directory, crash/power fail and
mount the filesystem, we end up with an empty symlink, which not only is
useless it's also not allowed in linux (the man page symlink(2) is well
explicit about that).  So we just need to make sure to fully log an inode
if it's a symlink, to ensure its inline extent gets logged, ensuring the
same behaviour as ext3, ext4, xfs, reiserfs, f2fs, nilfs2, etc.

Example reproducer:

  $ mkfs.btrfs -f /dev/sdb
  $ mount /dev/sdb /mnt
  $ mkdir /mnt/testdir
  $ sync
  $ ln -s /mnt/foo /mnt/testdir/bar
  $ xfs_io -c fsync /mnt/testdir
  <power fail>
  $ mount /dev/sdb /mnt
  $ readlink /mnt/testdir/bar
  <empty string>

A test case for fstests follows soon.

Signed-off-by: Filipe Manana <fdmanana@suse.com>
2016-05-13 01:59:12 +01:00
Filipe Manana 657ed1aa48 Btrfs: fix for incorrect directory entries after fsync log replay
If we move a directory to a new parent and later log that parent and don't
explicitly log the old parent, when we replay the log we can end up with
entries for the moved directory in both the old and new parent directories.
Besides being ilegal to have directories with multiple hard links in linux,
it also resulted in the leaving the inode item with a link count of 1.
A similar issue also happens if we move a regular file - after the log tree
is replayed the file has a link in both the old and new parent directories,
when it should be only at the new directory.

Sample reproducer:

  $ mkfs.btrfs -f /dev/sdc
  $ mount /dev/sdc /mnt
  $ mkdir /mnt/x
  $ mkdir /mnt/y
  $ touch /mnt/x/foo
  $ mkdir /mnt/y/z
  $ sync
  $ ln /mnt/x/foo /mnt/x/bar
  $ mv /mnt/y/z /mnt/x/z
  < power fail >
  $ mount /dev/sdc /mnt
  $ ls -1Ri /mnt
  /mnt:
  257 x
  258 y

  /mnt/x:
  259 bar
  259 foo
  260 z

  /mnt/x/z:

  /mnt/y:
  260 z

  /mnt/y/z:

  $ umount /dev/sdc
  $ btrfs check /dev/sdc
  Checking filesystem on /dev/sdc
  UUID: a67e2c4a-a4b4-4fdc-b015-9d9af1e344be
  checking extents
  checking free space cache
  checking fs roots
  root 5 inode 260 errors 2000, link count wrong
        unresolved ref dir 257 index 4 namelen 1 name z filetype 2 errors 0
        unresolved ref dir 258 index 2 namelen 1 name z filetype 2 errors 0
  (...)

Attempting to remove the directory becomes impossible:

  $ mount /dev/sdc /mnt
  $ rmdir /mnt/y/z
  $ ls -lh /mnt/y
  ls: cannot access /mnt/y/z: No such file or directory
  total 0
  d????????? ? ? ? ?            ? z
  $ rmdir /mnt/x/z
  rmdir: failed to remove ‘/mnt/x/z’: Stale file handle
  $ ls -lh /mnt/x
  ls: cannot access /mnt/x/z: Stale file handle
  total 0
  -rw-r--r-- 2 root root 0 Apr  6 18:06 bar
  -rw-r--r-- 2 root root 0 Apr  6 18:06 foo
  d????????? ? ?    ?    ?            ? z

So make sure that on rename we set the last_unlink_trans value for our
inode, even if it's a directory, to the value of the current transaction's
ID and that if the new parent directory is logged that we fallback to a
transaction commit.

A test case for fstests is being submitted as well.

Signed-off-by: Filipe Manana <fdmanana@suse.com>
2016-05-13 01:59:11 +01:00
Al Viro ae05327a00 ext4: switch to ->iterate_shared()
Note that we need relax_dir() equivalent for directories
locked shared.

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2016-05-12 20:36:01 -04:00
Al Viro 9717a91b01 hfs: switch to ->iterate_shared()
exact parallel of hfsplus analogue

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2016-05-12 20:13:50 -04:00
Al Viro 323ee8fc54 hfsplus: switch to ->iterate_shared()
We need to protect the list of hfsplus_readdir_data against parallel
insertions (in readdir) and removals (in release).  Add a spinlock
for that.  Note that it has nothing to do with protection of
hfsplus_readdir_data->key - we have an exclusion between hfsplus_readdir()
and hfsplus_delete_cat() on directory lock and between several
hfsplus_readdir() for the same struct file on ->f_pos_lock.  The spinlock
is strictly for list changes.

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2016-05-12 20:08:40 -04:00
Al Viro 552a9d489f hostfs: switch to ->iterate_shared()
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2016-05-12 19:49:30 -04:00
Al Viro 7d674b3195 hpfs: switch to ->iterate_shared()
NOTE: the only reason we can do that without ->i_rdir_offs races
is that hpfs_lock() serializes everything in there anyway.  It's
not that hard to get rid of, but not as part of this series...

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2016-05-12 19:47:13 -04:00
Al Viro e82c314755 hpfs: handle allocation failures in hpfs_add_pos()
pr_err() is nice, but we'd better propagate the error
to caller and not proceed to violate the invariants
(namely, "every file with f_pos tied to directory block
should have its address visible in per-inode array").

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2016-05-12 19:35:57 -04:00
Junxiao Bi c25a1e0671 ocfs2: fix posix_acl_create deadlock
Commit 702e5bc68a ("ocfs2: use generic posix ACL infrastructure")
refactored code to use posix_acl_create.  The problem with this function
is that it is not mindful of the cluster wide inode lock making it
unsuitable for use with ocfs2 inode creation with ACLs.  For example,
when used in ocfs2_mknod, this function can cause deadlock as follows.
The parent dir inode lock is taken when calling posix_acl_create ->
get_acl -> ocfs2_iop_get_acl which takes the inode lock again.  This can
cause deadlock if there is a blocked remote lock request waiting for the
lock to be downconverted.  And same deadlock happened in ocfs2_reflink.
This fix is to revert back using ocfs2_init_acl.

Fixes: 702e5bc68a ("ocfs2: use generic posix ACL infrastructure")
Signed-off-by: Tariq Saeed <tariq.x.saeed@oracle.com>
Signed-off-by: Junxiao Bi <junxiao.bi@oracle.com>
Cc: Mark Fasheh <mfasheh@suse.de>
Cc: Joel Becker <jlbec@evilplan.org>
Cc: Joseph Qi <joseph.qi@huawei.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-05-12 15:52:50 -07:00
Junxiao Bi 5ee0fbd50f ocfs2: revert using ocfs2_acl_chmod to avoid inode cluster lock hang
Commit 743b5f1434 ("ocfs2: take inode lock in ocfs2_iop_set/get_acl()")
introduced this issue.  ocfs2_setattr called by chmod command holds
cluster wide inode lock when calling posix_acl_chmod.  This latter
function in turn calls ocfs2_iop_get_acl and ocfs2_iop_set_acl.  These
two are also called directly from vfs layer for getfacl/setfacl commands
and therefore acquire the cluster wide inode lock.  If a remote
conversion request comes after the first inode lock in ocfs2_setattr,
OCFS2_LOCK_BLOCKED will be set.  And this will cause the second call to
inode lock from the ocfs2_iop_get_acl() to block indefinetly.

The deleted version of ocfs2_acl_chmod() calls __posix_acl_chmod() which
does not call back into the filesystem.  Therefore, we restore
ocfs2_acl_chmod(), modify it slightly for locking as needed, and use that
instead.

Fixes: 743b5f1434 ("ocfs2: take inode lock in ocfs2_iop_set/get_acl()")
Signed-off-by: Tariq Saeed <tariq.x.saeed@oracle.com>
Signed-off-by: Junxiao Bi <junxiao.bi@oracle.com>
Cc: Mark Fasheh <mfasheh@suse.de>
Cc: Joel Becker <jlbec@evilplan.org>
Cc: Joseph Qi <joseph.qi@huawei.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-05-12 15:52:50 -07:00
Al Viro 1d1bb236bc gfs2: switch to ->iterate_shared()
protected by glock and already used without locking the directory
by gfs2_get_name()

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2016-05-12 17:00:20 -04:00
Omar Sandoval 2c4cb04300 coredump: only charge written data against RLIMIT_CORE
Commit 9b56d54380 ("dump_skip(): dump_seek() replacement taking
coredump_params") introduced a regression with regard to RLIMIT_CORE.
Previously, when a core dump was sparse, only the data that was actually
written out would count against the limit. Now, the sparse ranges are
also included, which leads to truncated core dumps when the actual disk
usage is still well below the limit. Restore the old behavior by only
counting what gets emitted and ignoring what gets skipped.

Signed-off-by: Omar Sandoval <osandov@fb.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2016-05-12 16:55:50 -04:00
Omar Sandoval a008393951 coredump: get rid of coredump_params->written
cprm->written is redundant with cprm->file->f_pos, so use that instead.

Signed-off-by: Omar Sandoval <osandov@fb.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2016-05-12 16:55:50 -04:00
Serge E. Hallyn 3cc9b23c81 kernfs: kernfs_sop_show_path: don't return 0 after seq_dentry call
Our caller expects 0 on success, not >0.

This fixes a bug in the patch

	cgroup, kernfs: make mountinfo show properly scoped path for cgroup namespaces

where /sys does not show up in mountinfo, breaking criu.

Thanks for catching this, Andrei.

Reported-by: Andrei Vagin <avagin@gmail.com>
Signed-off-by: Serge Hallyn <serge.hallyn@ubuntu.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
2016-05-12 11:03:51 -04:00
David Sterba 2c1984f244 btrfs: build fixup for qgroup_account_snapshot
The macro btrfs_std_error got renamed to btrfs_handle_fs_error in an
independent branch for the same merge target (4.7). To make the code
compilable for bisectability reasons, add a temporary stub.

Signed-off-by: David Sterba <dsterba@suse.com>
2016-05-12 11:05:03 +02:00
Qu Wenruo 6426c7ad69 btrfs: qgroup: Fix qgroup accounting when creating snapshot
Current btrfs qgroup design implies a requirement that after calling
btrfs_qgroup_account_extents() there must be a commit root switch.

Normally this is OK, as btrfs_qgroup_accounting_extents() is only called
inside btrfs_commit_transaction() just be commit_cowonly_roots().

However there is a exception at create_pending_snapshot(), which will
call btrfs_qgroup_account_extents() but no any commit root switch.

In case of creating a snapshot whose parent root is itself (create a
snapshot of fs tree), it will corrupt qgroup by the following trace:
(skipped unrelated data)
======
btrfs_qgroup_account_extent: bytenr = 29786112, num_bytes = 16384, nr_old_roots = 0, nr_new_roots = 1
qgroup_update_counters: qgid = 5, cur_old_count = 0, cur_new_count = 1, rfer = 0, excl = 0
qgroup_update_counters: qgid = 5, cur_old_count = 0, cur_new_count = 1, rfer = 16384, excl = 16384
btrfs_qgroup_account_extent: bytenr = 29786112, num_bytes = 16384, nr_old_roots = 0, nr_new_roots = 0
======

The problem here is in first qgroup_account_extent(), the
nr_new_roots of the extent is 1, which means its reference got
increased, and qgroup increased its rfer and excl.

But at second qgroup_account_extent(), its reference got decreased, but
between these two qgroup_account_extent(), there is no switch roots.
This leads to the same nr_old_roots, and this extent just got ignored by
qgroup, which means this extent is wrongly accounted.

Fix it by call commit_cowonly_roots() after qgroup_account_extent() in
create_pending_snapshot(), with needed preparation.

Mark: I added a check at the top of qgroup_account_snapshot() to skip this
code if qgroups are turned off. xfstest btrfs/122 exposes this problem.

Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com>
Reviewed-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: Mark Fasheh <mfasheh@suse.de>
Signed-off-by: David Sterba <dsterba@suse.com>
2016-05-12 10:47:31 +02:00
Chao Yu ab47036d8f f2fs: fix deadlock when flush inline data
Below backtrace info was reported by Yunlei He:

Call Trace:
 [<ffffffff817a9395>] schedule+0x35/0x80
 [<ffffffff817abb7d>] rwsem_down_read_failed+0xed/0x130
 [<ffffffff813c12a8>] call_rwsem_down_read_failed+0x18/0x
 [<ffffffff817ab1d0>] down_read+0x20/0x30
 [<ffffffffa02a1a12>] f2fs_evict_inode+0x242/0x3a0 [f2fs]
 [<ffffffff81217057>] evict+0xc7/0x1a0
 [<ffffffff81217cd6>] iput+0x196/0x200
 [<ffffffff812134f9>] __dentry_kill+0x179/0x1e0
 [<ffffffff812136f9>] dput+0x199/0x1f0
 [<ffffffff811fe77b>] __fput+0x18b/0x220
 [<ffffffff811fe84e>] ____fput+0xe/0x10
 [<ffffffff81097427>] task_work_run+0x77/0x90
 [<ffffffff81074d62>] exit_to_usermode_loop+0x73/0xa2
 [<ffffffff81003b7a>] do_syscall_64+0xfa/0x110
 [<ffffffff817acf65>] entry_SYSCALL64_slow_path+0x25/0x25

Call Trace:
 [<ffffffff817a9395>] schedule+0x35/0x80
 [<ffffffff81216dc3>] __wait_on_freeing_inode+0xa3/0xd0
 [<ffffffff810bc300>] ? autoremove_wake_function+0x40/0x4
 [<ffffffff8121771d>] find_inode_fast+0x7d/0xb0
 [<ffffffff8121794a>] ilookup+0x6a/0xd0
 [<ffffffffa02bc740>] sync_node_pages+0x210/0x650 [f2fs]
 [<ffffffff8122e690>] ? do_fsync+0x70/0x70
 [<ffffffffa02b085e>] block_operations+0x9e/0xf0 [f2fs]
 [<ffffffff8137b795>] ? bio_endio+0x55/0x60
 [<ffffffffa02b0942>] write_checkpoint+0x92/0xba0 [f2fs]
 [<ffffffff8117da57>] ? mempool_free_slab+0x17/0x20
 [<ffffffff8117de8b>] ? mempool_free+0x2b/0x80
 [<ffffffff8122e690>] ? do_fsync+0x70/0x70
 [<ffffffffa02a53e3>] f2fs_sync_fs+0x63/0xd0 [f2fs]
 [<ffffffff8129630f>] ? ext4_sync_fs+0xbf/0x190
 [<ffffffff8122e6b0>] sync_fs_one_sb+0x20/0x30
 [<ffffffff812002e9>] iterate_supers+0xb9/0x110
 [<ffffffff8122e7b5>] sys_sync+0x55/0x90
 [<ffffffff81003ae9>] do_syscall_64+0x69/0x110
 [<ffffffff817acf65>] entry_SYSCALL64_slow_path+0x25/0x25

With following excuting serials, we will set inline_node in inode page
after inode was unlinked, result in a deadloop described as below:
1. open file
2. write file
3. unlink file
4. write file
5. close file

Thread A				Thread B
 - dput
  - iput_final
   - inode->i_state |= I_FREEING
   - evict
    - f2fs_evict_inode
					 - f2fs_sync_fs
					  - write_checkpoint
					   - block_operations
					    - f2fs_lock_all (down_write(cp_rwsem))
     - f2fs_lock_op (down_read(cp_rwsem))
					    - sync_node_pages
					     - ilookup
					      - find_inode_fast
					       - __wait_on_freeing_inode
					         (wait on I_FREEING clear)

Here, we change to set inline_node flag only for linked inode for fixing.

Reported-by: Yunlei He <heyunlei@huawei.com>
Signed-off-by: Chao Yu <yuchao0@huawei.com>
Tested-by: Jaegeuk Kim <jaegeuk@kernel.org>
Cc: stable@vger.kernel.org # v4.6
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2016-05-11 09:56:38 -07:00
Jaegeuk Kim 3b9b10f9ce f2fs: avoid f2fs_bug_on during recovery
We don't need to use f2fs_bug_on() to treat with any error case when allocating
a block during recovery.

Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2016-05-11 09:56:37 -07:00
Jaegeuk Kim 652be55162 f2fs: show # of orphan inodes
This adds debug information for # of orphan inodes.

Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2016-05-11 09:56:36 -07:00
Chao Yu 6e9619499f f2fs: support in batch fzero in dnode page
This patch tries to speedup fzero_range by making space preallocation and
address removal of blocks in one dnode page as in batch operation.

In virtual machine, with zram driver:

dd if=/dev/zero of=/mnt/f2fs/file bs=1M count=4096
time xfs_io -f /mnt/f2fs/file -c "fzero 0 4096M"

Before:
real	0m3.276s
user	0m0.008s
sys	0m3.260s

After:
real	0m1.568s
user	0m0.000s
sys	0m1.564s

Signed-off-by: Chao Yu <yuchao0@huawei.com>
[Jaegeuk Kim: consider ENOSPC case]
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2016-05-11 09:56:36 -07:00
Chao Yu 46008c6d42 f2fs: support in batch multi blocks preallocation
This patch introduces reserve_new_blocks to make preallocation of multi
blocks as in batch operation, so it can avoid lots of redundant
operation, result in better performance.

In virtual machine, with rotational device:

time fallocate -l 32G /mnt/f2fs/file

Before:
real	0m4.584s
user	0m0.000s
sys	0m4.580s

After:
real	0m0.292s
user	0m0.000s
sys	0m0.272s

In x86, with SSD:

time fallocate -l 500G $MNT/testfile

Before : 24.758 s
After  :  1.604 s

Signed-off-by: Chao Yu <yuchao0@huawei.com>
[Jaegeuk Kim: fix bugs and add performance numbers measured in x86.]
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2016-05-11 09:56:35 -07:00
Chao Yu 0fac558b96 f2fs: make atomic/volatile operation exclusive
atomic/volatile ioctl interfaces are exposed to user like other file
operation interface, it needs to make them getting exclusion against
to each other to avoid potential conflict among these operations
in concurrent scenario.

Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2016-05-11 09:56:34 -07:00
Chao Yu 7fb17fe44b f2fs: use mnt_{want,drop}_write_file in ioctl
In interfaces of ioctl, mnt_{want,drop}_write_file should be used for:
- get exclusion against file system freezing which may used by lvm
  snapshot.
- do telling filesystem that a write is about to be performed on it, and
  make sure that the writes are permitted.

Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2016-05-11 09:56:32 -07:00
Al Viro e4d35be584 Merge branch 'ovl-fixes' into for-linus 2016-05-11 00:00:29 -04:00
Miklos Szeredi 38b78a5f18 ovl: ignore permissions on underlying lookup
Generally permission checking is not necessary when overlayfs looks up a
dentry on one of the underlying layers, since search permission on base
directory was already checked in ovl_permission().

More specifically using lookup_one_len() causes a problem when the lower
directory lacks search permission for a specific user while the upper
directory does have search permission.  Since lookups are cached, this
causes inconsistency in behavior: success depends on who did the first
lookup.

So instead use lookup_hash() which doesn't do the permission check.

Reported-by: Ignacy Gawędzki <ignacy.gawedzki@green-communications.fr>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2016-05-10 23:58:18 -04:00
Miklos Szeredi 3c9fe8cdff vfs: add lookup_hash() helper
Overlayfs needs lookup without inode_permission() and already has the name
hash (in form of dentry->d_name on overlayfs dentry).  It also doesn't
support filesystems with d_op->d_hash() so basically it only needs
the actual hashed lookup from lookup_one_len_unlocked()

So add a new helper that does unlocked lookup of a hashed name.

Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2016-05-10 23:56:28 -04:00
Miklos Szeredi 9409e22acd vfs: rename: check backing inode being equal
If a file is renamed to a hardlink of itself POSIX specifies that rename(2)
should do nothing and return success.

This condition is checked in vfs_rename().  However it won't detect hard
links on overlayfs where these are given separate inodes on the overlayfs
layer.

Overlayfs itself detects this condition and returns success without doing
anything, but then vfs_rename() will proceed as if this was a successful
rename (detach_mounts(), d_move()).

The correct thing to do is to detect this condition before even calling
into overlayfs.  This patch does this by calling vfs_select_inode() to get
the underlying inodes.

Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
Cc: <stable@vger.kernel.org> # v4.2+
2016-05-10 23:55:43 -04:00
Miklos Szeredi 54d5ca871e vfs: add vfs_select_inode() helper
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
Cc: <stable@vger.kernel.org> # v4.2+
2016-05-10 23:55:01 -04:00
Al Viro e77d0c63f0 f2fs: switch to ->iterate_shared()
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2016-05-10 16:41:13 -04:00
Al Viro 29884eff1f afs: switch to ->iterate_shared()
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2016-05-10 14:27:44 -04:00
Al Viro e23e9aa752 befs: switch to ->iterate_shared()
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2016-05-10 14:24:57 -04:00
Al Viro 22341d8f33 befs: constify stuff a bit
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2016-05-10 14:24:06 -04:00
Vincent Stehlé 72928f2476 Btrfs: fix fspath error deallocation
Make sure to deallocate fspath with vfree() in case of error in
init_ipath().

fspath is allocated with vmalloc() in init_data_container() since
commit 425d17a290 ("Btrfs: use larger limit for translation of logical to
inode").

Signed-off-by: Vincent Stehlé <vincent.stehle@intel.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2016-05-10 16:22:26 +02:00
David Sterba 523567168d btrfs: make find_workspace warn if there are no workspaces
Be verbose if there are no workspaces at all, ie. the module init time
preallocation failed.

Signed-off-by: David Sterba <dsterba@suse.com>
2016-05-10 09:46:16 +02:00
David Sterba e721e49dd1 btrfs: make find_workspace always succeed
With just one preallocated workspace we can guarantee forward progress
even if there's no memory available for new workspaces. The cost is more
waiting but we also get rid of several error paths.

On average, there will be several idle workspaces, so the waiting
penalty won't be so bad.

In the worst case, all cpus will compete for one workspace until there's
some memory. Attempts to allocate a new one are done each time the
waiters are woken up.

Signed-off-by: David Sterba <dsterba@suse.com>
2016-05-10 09:46:13 +02:00
David Sterba f77dd0d6b2 btrfs: preallocate compression workspaces
Preallocate one workspace for each compression type so we can guarantee
forward progress in the worst case. A failure cannot be a hard error as
we might not use compression at all on the filesystem. If we can't
allocate the workspaces later when need them, it might actually
deadlock, but in such situation the system has effectively not enough
memory to operate properly.

Signed-off-by: David Sterba <dsterba@suse.com>
2016-05-10 09:46:11 +02:00