Commit Graph

1427 Commits

Author SHA1 Message Date
Linus Torvalds e253d98f5b Merge branch 'work.read_write' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs
Pull nowait read support from Al Viro:
 "Support IOCB_NOWAIT for buffered reads and block devices"

* 'work.read_write' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
  block_dev: support RFW_NOWAIT on block device nodes
  fs: support RWF_NOWAIT for buffered reads
  fs: support IOCB_NOWAIT in generic_file_buffered_read
  fs: pass iocb to do_generic_file_read
2017-09-14 19:29:55 -07:00
Linus Torvalds 0f0d12728e Merge branch 'work.mount' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs
Pull mount flag updates from Al Viro:
 "Another chunk of fmount preparations from dhowells; only trivial
  conflicts for that part. It separates MS_... bits (very grotty
  mount(2) ABI) from the struct super_block ->s_flags (kernel-internal,
  only a small subset of MS_... stuff).

  This does *not* convert the filesystems to new constants; only the
  infrastructure is done here. The next step in that series is where the
  conflicts would be; that's the conversion of filesystems. It's purely
  mechanical and it's better done after the merge, so if you could run
  something like

	list=$(for i in MS_RDONLY MS_NOSUID MS_NODEV MS_NOEXEC MS_SYNCHRONOUS MS_MANDLOCK MS_DIRSYNC MS_NOATIME MS_NODIRATIME MS_SILENT MS_POSIXACL MS_KERNMOUNT MS_I_VERSION MS_LAZYTIME; do git grep -l $i fs drivers/staging/lustre drivers/mtd ipc mm include/linux; done|sort|uniq|grep -v '^fs/namespace.c$')

	sed -i -e 's/\<MS_RDONLY\>/SB_RDONLY/g' \
	        -e 's/\<MS_NOSUID\>/SB_NOSUID/g' \
	        -e 's/\<MS_NODEV\>/SB_NODEV/g' \
	        -e 's/\<MS_NOEXEC\>/SB_NOEXEC/g' \
	        -e 's/\<MS_SYNCHRONOUS\>/SB_SYNCHRONOUS/g' \
	        -e 's/\<MS_MANDLOCK\>/SB_MANDLOCK/g' \
	        -e 's/\<MS_DIRSYNC\>/SB_DIRSYNC/g' \
	        -e 's/\<MS_NOATIME\>/SB_NOATIME/g' \
	        -e 's/\<MS_NODIRATIME\>/SB_NODIRATIME/g' \
	        -e 's/\<MS_SILENT\>/SB_SILENT/g' \
	        -e 's/\<MS_POSIXACL\>/SB_POSIXACL/g' \
	        -e 's/\<MS_KERNMOUNT\>/SB_KERNMOUNT/g' \
	        -e 's/\<MS_I_VERSION\>/SB_I_VERSION/g' \
	        -e 's/\<MS_LAZYTIME\>/SB_LAZYTIME/g' \
	        $list

  and commit it with something along the lines of 'convert filesystems
  away from use of MS_... constants' as commit message, it would save a
  quite a bit of headache next cycle"

* 'work.mount' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
  VFS: Differentiate mount flags (MS_*) from internal superblock flags
  VFS: Convert sb->s_flags & MS_RDONLY to sb_rdonly(sb)
  vfs: Add sb_rdonly(sb) to query the MS_RDONLY flag on s_flags
2017-09-14 18:54:01 -07:00
Linus Torvalds 581bfce969 Merge branch 'work.set_fs' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs
Pull more set_fs removal from Al Viro:
 "Christoph's 'use kernel_read and friends rather than open-coding
  set_fs()' series"

* 'work.set_fs' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
  fs: unexport vfs_readv and vfs_writev
  fs: unexport vfs_read and vfs_write
  fs: unexport __vfs_read/__vfs_write
  lustre: switch to kernel_write
  gadget/f_mass_storage: stop messing with the address limit
  mconsole: switch to kernel_read
  btrfs: switch write_buf to kernel_write
  net/9p: switch p9_fd_read to kernel_write
  mm/nommu: switch do_mmap_private to kernel_read
  serial2002: switch serial2002_tty_write to kernel_{read/write}
  fs: make the buf argument to __kernel_write a void pointer
  fs: fix kernel_write prototype
  fs: fix kernel_read prototype
  fs: move kernel_read to fs/read_write.c
  fs: move kernel_write to fs/read_write.c
  autofs4: switch autofs4_write to __kernel_write
  ashmem: switch to ->read_iter
2017-09-14 18:13:32 -07:00
Linus Torvalds c353f88f3d Merge branch 'overlayfs-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/vfs
Pull overlayfs updates from Miklos Szeredi:
 "This fixes d_ino correctness in readdir, which brings overlayfs on par
  with normal filesystems regarding inode number semantics, as long as
  all layers are on the same filesystem.

  There are also some bug fixes, one in particular (random ioctl's
  shouldn't be able to modify lower layers) that touches some vfs code,
  but of course no-op for non-overlay fs"

* 'overlayfs-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/vfs:
  ovl: fix false positive ESTALE on lookup
  ovl: don't allow writing ioctl on lower layer
  ovl: fix relatime for directories
  vfs: add flags to d_real()
  ovl: cleanup d_real for negative
  ovl: constant d_ino for non-merge dirs
  ovl: constant d_ino across copy up
  ovl: fix readdir error value
  ovl: check snprintf return
2017-09-13 09:11:44 -07:00
Ian Kent 42f4614821 autofs: fix AT_NO_AUTOMOUNT not being honored
The fstatat(2) and statx() calls can pass the flag AT_NO_AUTOMOUNT which
is meant to clear the LOOKUP_AUTOMOUNT flag and prevent triggering of an
automount by the call.  But this flag is unconditionally cleared for all
stat family system calls except statx().

stat family system calls have always triggered mount requests for the
negative dentry case in follow_automount() which is intended but prevents
the fstatat(2) and statx() AT_NO_AUTOMOUNT case from being handled.

In order to handle the AT_NO_AUTOMOUNT for both system calls the negative
dentry case in follow_automount() needs to be changed to return ENOENT
when the LOOKUP_AUTOMOUNT flag is clear (and the other required flags are
clear).

AFAICT this change doesn't have any noticable side effects and may, in
some use cases (although I didn't see it in testing) prevent unnecessary
callbacks to the automount daemon.

It's also possible that a stat family call has been made with a path that
is in the process of being mounted by some other process.  But stat family
calls should return the automount state of the path as it is "now" so it
shouldn't wait for mount completion.

This is the same semantic as the positive dentry case already handled.

Link: http://lkml.kernel.org/r/150216641255.11652.4204561328197919771.stgit@pluto.themaw.net
Fixes: deccf497d8 ("Make stat/lstat/fstatat pass AT_NO_AUTOMOUNT to vfs_statx()")
Signed-off-by: Ian Kent <raven@themaw.net>
Cc: David Howells <dhowells@redhat.com>
Cc: Colin Walters <walters@redhat.com>
Cc: Ondrej Holy <oholy@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-09-08 18:26:50 -07:00
Davidlohr Bueso f808c13fd3 lib/interval_tree: fast overlap detection
Allow interval trees to quickly check for overlaps to avoid unnecesary
tree lookups in interval_tree_iter_first().

As of this patch, all interval tree flavors will require using a
'rb_root_cached' such that we can have the leftmost node easily
available.  While most users will make use of this feature, those with
special functions (in addition to the generic insert, delete, search
calls) will avoid using the cached option as they can do funky things
with insertions -- for example, vma_interval_tree_insert_after().

[jglisse@redhat.com: fix deadlock from typo vm_lock_anon_vma()]
  Link: http://lkml.kernel.org/r/20170808225719.20723-1-jglisse@redhat.com
Link: http://lkml.kernel.org/r/20170719014603.19029-12-dave@stgolabs.net
Signed-off-by: Davidlohr Bueso <dbueso@suse.de>
Signed-off-by: Jérôme Glisse <jglisse@redhat.com>
Acked-by: Christian König <christian.koenig@amd.com>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Doug Ledford <dledford@redhat.com>
Acked-by: Michael S. Tsirkin <mst@redhat.com>
Cc: David Airlie <airlied@linux.ie>
Cc: Jason Wang <jasowang@redhat.com>
Cc: Christian Benvenuti <benve@cisco.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-09-08 18:26:49 -07:00
Linus Torvalds ae8ac6b7db Merge branch 'quota_scaling' of git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs
Pull quota scaling updates from Jan Kara:
 "This contains changes to make the quota subsystem more scalable.

  Reportedly it improves number of files created per second on ext4
  filesystem on fast storage by about a factor of 2x"

* 'quota_scaling' of git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs: (28 commits)
  quota: Add lock annotations to struct members
  quota: Reduce contention on dq_data_lock
  fs: Provide __inode_get_bytes()
  quota: Inline dquot_[re]claim_reserved_space() into callsite
  quota: Inline inode_{incr,decr}_space() into callsites
  quota: Inline functions into their callsites
  ext4: Disable dirty list tracking of dquots when journalling quotas
  quota: Allow disabling tracking of dirty dquots in a list
  quota: Remove dq_wait_unused from dquot
  quota: Move locking into clear_dquot_dirty()
  quota: Do not dirty bad dquots
  quota: Fix possible corruption of dqi_flags
  quota: Propagate ->quota_read errors from v2_read_file_info()
  quota: Fix error codes in v2_read_file_info()
  quota: Push dqio_sem down to ->read_file_info()
  quota: Push dqio_sem down to ->write_file_info()
  quota: Push dqio_sem down to ->get_next_id()
  quota: Push dqio_sem down to ->release_dqblk()
  quota: Remove locking for writing to the old quota format
  quota: Do not acquire dqio_sem for dquot overwrites in v2 format
  ...
2017-09-07 15:19:35 -07:00
Linus Torvalds a0725ab0c7 Merge branch 'for-4.14/block' of git://git.kernel.dk/linux-block
Pull block layer updates from Jens Axboe:
 "This is the first pull request for 4.14, containing most of the code
  changes. It's a quiet series this round, which I think we needed after
  the churn of the last few series. This contains:

   - Fix for a registration race in loop, from Anton Volkov.

   - Overflow complaint fix from Arnd for DAC960.

   - Series of drbd changes from the usual suspects.

   - Conversion of the stec/skd driver to blk-mq. From Bart.

   - A few BFQ improvements/fixes from Paolo.

   - CFQ improvement from Ritesh, allowing idling for group idle.

   - A few fixes found by Dan's smatch, courtesy of Dan.

   - A warning fixup for a race between changing the IO scheduler and
     device remova. From David Jeffery.

   - A few nbd fixes from Josef.

   - Support for cgroup info in blktrace, from Shaohua.

   - Also from Shaohua, new features in the null_blk driver to allow it
     to actually hold data, among other things.

   - Various corner cases and error handling fixes from Weiping Zhang.

   - Improvements to the IO stats tracking for blk-mq from me. Can
     drastically improve performance for fast devices and/or big
     machines.

   - Series from Christoph removing bi_bdev as being needed for IO
     submission, in preparation for nvme multipathing code.

   - Series from Bart, including various cleanups and fixes for switch
     fall through case complaints"

* 'for-4.14/block' of git://git.kernel.dk/linux-block: (162 commits)
  kernfs: checking for IS_ERR() instead of NULL
  drbd: remove BIOSET_NEED_RESCUER flag from drbd_{md_,}io_bio_set
  drbd: Fix allyesconfig build, fix recent commit
  drbd: switch from kmalloc() to kmalloc_array()
  drbd: abort drbd_start_resync if there is no connection
  drbd: move global variables to drbd namespace and make some static
  drbd: rename "usermode_helper" to "drbd_usermode_helper"
  drbd: fix race between handshake and admin disconnect/down
  drbd: fix potential deadlock when trying to detach during handshake
  drbd: A single dot should be put into a sequence.
  drbd: fix rmmod cleanup, remove _all_ debugfs entries
  drbd: Use setup_timer() instead of init_timer() to simplify the code.
  drbd: fix potential get_ldev/put_ldev refcount imbalance during attach
  drbd: new disk-option disable-write-same
  drbd: Fix resource role for newly created resources in events2
  drbd: mark symbols static where possible
  drbd: Send P_NEG_ACK upon write error in protocol != C
  drbd: add explicit plugging when submitting batches
  drbd: change list_for_each_safe to while(list_first_entry_or_null)
  drbd: introduce drbd_recv_header_maybe_unplug
  ...
2017-09-07 11:59:42 -07:00
Linus Torvalds d34fc1adf0 Merge branch 'akpm' (patches from Andrew)
Merge updates from Andrew Morton:

 - various misc bits

 - DAX updates

 - OCFS2

 - most of MM

* emailed patches from Andrew Morton <akpm@linux-foundation.org>: (119 commits)
  mm,fork: introduce MADV_WIPEONFORK
  x86,mpx: make mpx depend on x86-64 to free up VMA flag
  mm: add /proc/pid/smaps_rollup
  mm: hugetlb: clear target sub-page last when clearing huge page
  mm: oom: let oom_reap_task and exit_mmap run concurrently
  swap: choose swap device according to numa node
  mm: replace TIF_MEMDIE checks by tsk_is_oom_victim
  mm, oom: do not rely on TIF_MEMDIE for memory reserves access
  z3fold: use per-cpu unbuddied lists
  mm, swap: don't use VMA based swap readahead if HDD is used as swap
  mm, swap: add sysfs interface for VMA based swap readahead
  mm, swap: VMA based swap readahead
  mm, swap: fix swap readahead marking
  mm, swap: add swap readahead hit statistics
  mm/vmalloc.c: don't reinvent the wheel but use existing llist API
  mm/vmstat.c: fix wrong comment
  selftests/memfd: add memfd_create hugetlbfs selftest
  mm/shmem: add hugetlbfs support to memfd_create()
  mm, devm_memremap_pages: use multi-order radix for ZONE_DEVICE lookups
  mm/vmalloc.c: halve the number of comparisons performed in pcpu_get_vm_areas()
  ...
2017-09-06 20:49:49 -07:00
Jeff Layton a446d6f9ce include/linux/fs.h: remove unneeded forward definition of mm_struct
Link: http://lkml.kernel.org/r/20170525102927.6163-1-jlayton@redhat.com
Signed-off-by: Jeff Layton <jlayton@redhat.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-09-06 17:27:28 -07:00
Linus Torvalds ec3604c7a5 Writeback error handling fixes for v4.14
-----BEGIN PGP SIGNATURE-----
 
 iQIcBAABAgAGBQJZrTy3AAoJEAAOaEEZVoIVaucP/ApBAj2S5wzvlV1u6l8E6ae7
 ZeEEZfcWwzRYlKjZAkTWqj9XvGpDGO5gLq4wsZK2edFAq++/MJF8ZVtN4tdZ1kUZ
 DUvRodtVOrT08Kp9wZXGT7JOFrf6U/6gMcR6p0MuWnHndeKYvlpcFi9NPT4EC9/z
 Zm9V7gtlPdSOha7eaSjUS0+vLERkxqXLBW3Av9QUOBP/lbI3lqIroGKeHDYnVdya
 2P/k5EcRRJMyJP6TqyYxmmJl+UWjJFMLvnlUDBslHnD/u3mIUhw3JLHYBjn5dZRE
 Xjq56IDPoXDUvzlBhtn/Uqyx+/wtwsNsylpmKv6K5G1JfdeuSsPVsCey+A1cqV64
 LpE5896wf9TmnmI9LNyh6vDn925xPSGBiF45UEp5f9aO7jXeY0MaEZ8g+ENqFIDK
 v4gtZdS9FhYHV+/l4qEwYMKrqSbwKEs1r1FT+f4wnABby1ojfdA57ZPlp5PV2Vjp
 szTp88Zkb7cMvZwEnWwxWofcJNmgS7uNahvnQF3IJ4ITsioEkuyYR3K4ZQMaaaV9
 wCp6G0FhXZaK3OI7o9WiDwaO2elp9Hxc8bnqKpiBbHZkY0NLh7/++5VxpeNbTHFy
 AGijQiiKGNNyYqNj93wq9jpVdMNjB0pXrHRxfav8v7MtQ+WfbEoAENF4T7hN7iXn
 UuF6eSWEC5O1UCRUk1A+
 =LLY3
 -----END PGP SIGNATURE-----

Merge tag 'wberr-v4.14-1' of git://git.kernel.org/pub/scm/linux/kernel/git/jlayton/linux

Pull writeback error handling updates from Jeff Layton:
 "This pile continues the work from last cycle on better tracking
  writeback errors. In v4.13 we added some basic errseq_t infrastructure
  and converted a few filesystems to use it.

  This set continues refining that infrastructure, adds documentation,
  and converts most of the other filesystems to use it. The main
  exception at this point is the NFS client"

* tag 'wberr-v4.14-1' of git://git.kernel.org/pub/scm/linux/kernel/git/jlayton/linux:
  ecryptfs: convert to file_write_and_wait in ->fsync
  mm: remove optimizations based on i_size in mapping writeback waits
  fs: convert a pile of fsync routines to errseq_t based reporting
  gfs2: convert to errseq_t based writeback error reporting for fsync
  fs: convert sync_file_range to use errseq_t based error-tracking
  mm: add file_fdatawait_range and file_write_and_wait
  fuse: convert to errseq_t based error tracking for fsync
  mm: consolidate dax / non-dax checks for writeback
  Documentation: add some docs for errseq_t
  errseq: rename __errseq_set to errseq_set
2017-09-06 14:11:03 -07:00
Linus Torvalds 066dea8c30 File locking related changes for v4.14
-----BEGIN PGP SIGNATURE-----
 
 iQIcBAABAgAGBQJZrTzyAAoJEAAOaEEZVoIVj8wP/0sOxG+7vEEpe4uj2W52aq9T
 Y39/ZLfRTLm9SqgH61lkN+IyUsvDx+IP1ws2LBhp0IDRD9m40wdILhHZRWXJcRW2
 ApEfmXF+rxnZZ6725ixX9w4Ylab2ZeGmKbzaG4wIjxfddftewZkJvFQcb1LZDfWq
 1N0SF4KWoWN6t26Du5CHmYSj/Sz6YGrWGhF22u3mNfkGL+MmuKbz+kB3W+0q2NUF
 ZjkOIH9WcRiXgSlcHPBLre2EKHqHaNgb0s4Iofd3ZEe50v1NwY/vBMefxuwRdgKS
 kpLhIKIYMawrHn2rpV0jm12qdgCYj+t2kbVIUBDn3unBP2zYA0e/oo5HNIrroVlk
 Q6aGwmW0LN60rpd5qcRuNS1p1h2id2HpxEe98dsski6T8CVnj/nvu7EIxmWM02cf
 g2HeOd7bnl3+uu7SwSTkOVb6G7Kjn+Xufiz/n11mK6fl2jvOyWZZmDqhhjWAYJ8r
 t5mQVWJdEV12+6+A1WSv9DeS3TUgdYPCF8dzDtF+JVn3WEmxYHywH36Y3hKKz+BA
 gFEhnHvlyaVvpXCr8Y5BqNSfEfvZe/YUnmVReHpgBU/U4GJ17iQYk/g2vfmPLmsN
 IZ2OGCrDUc/LfdWc4llRyQBvlGT1KujaT0tbN7xnuWcS2qWdsfX4jDtDUH9E6pvK
 TB6Sw4Ike0ixamG8N8q/
 =VPMU
 -----END PGP SIGNATURE-----

Merge tag 'locks-v4.14-1' of git://git.kernel.org/pub/scm/linux/kernel/git/jlayton/linux

Pull file locking updates from Jeff Layton:
 "This pile just has a few file locking fixes from Ben Coddington. There
  are a couple of cleanup patches + an attempt to bring sanity to the
  l_pid value that is reported back to userland on an F_GETLK request.

  After a few gyrations, he came up with a way for filesystems to
  communicate to the VFS layer code whether the pid should be translated
  according to the namespace or presented as-is to userland"

* tag 'locks-v4.14-1' of git://git.kernel.org/pub/scm/linux/kernel/git/jlayton/linux:
  locks: restore a warn for leaked locks on close
  fs/locks: Remove fl_nspid and use fs-specific l_pid for remote locks
  fs/locks: Use allocation rather than the stack in fcntl_getlk()
2017-09-06 13:43:26 -07:00
Linus Torvalds 5791577963 Updates for 4.14:
- Write unmount record for a ro mount to avoid unnecessary log replay
 - Clean up orphaned inodes when mounting fs readonly
 - Resubmit inode log items when buffer writeback fails to avoid umount hang
 - Fix log recovery corruption problems when log headers wrap around the end
 - Avoid infinite loop searching for free inodes when inode counters are wrong
 - Evict inodes involved with log redo so that we don't leak them later
 - Fix a potential race between reclaim and inode cluster freeing
 - Refactor the inode joining code w.r.t. transaction rolling & deferred ops
 - Fix a bug where the log doesn't properly deal with dirty buffers that
   are about to become ordered buffers
 - Fix the extent swap code to deal with making dirty buffers ordered properly
 - Consolidate page fault handlers
 - Refactor the incore extent manipulation functions to use the iext
   abstractions instead of directly modifying with extent data
 - Disable crashy chattr +/-x until we fix it
 - Don't allow us to set S_DAX for v2 inodes
 - Various cleanups
 - Clarify some documentation
 - Fix a problem where fsync and a log commit race to send the disk a
   flush command, resulting in a small window where power fail data loss
   could occur
 - Simplify some rmap operations in the fcollapse code
 - Fix some use-after-free problems in async writeback
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v1
 
 iQIcBAABCgAGBQJZrEAQAAoJEPh/dxk0SrTrxKEP/3y8sLWdy4fUdPpVkwZteXwc
 zGyYaLrmKRc5i6abBNtLCZoRJGfRdvVyPrhQ1q3mt8H//xuURgqgFFyjj3wAdsLf
 sDejIHhdsc8/VcuLLtCW3rEYg58hJ89hW7d1InCP0tvqWmljh9svhzXebtwUvNNF
 /2fHIUXUiAxLbgjv/N2i/smlLl0zdx6C2x1TlJmfwer0UMTAnlmbFWxCqmtUZwSl
 QSuGgn1wo3dkId9aFoNwQmSCFeYcxQlpaInJEzUiVQOA4dbphXHO9Bsx0eOkpDuz
 39waaX0fld8LEfIQGmUQ995UkAwfk/asjgDSApyXdkMayNWhi0KpRl1zXgCb8BbL
 m7vYJhIfJ399+jbNPe1+htn3I16AmpvAai9MNJidFclWwqFEuQEnxZccdtTIAiRv
 XuYiq9hN2NOwlwPUYfrZxfx34fdocRyHmGVs3i7P3/qPWd5Hx6+FpQTOngciS7MN
 6xnM8PbnrLadw3ooMDEKgWsN805BQALiwzDRggoAXG1Pm2SqFnLD/dAR4c7R3nR8
 vvYlfGHnd38aMlW73IALkkGJqZy/bHPFhrbvpjXyIG6SYwCjrWrO0chM0O8MCRrF
 MIW3rM5hYIE8aCkpJ2mxvcQalmSAlSPVKlmgvSK4S1Sz4kcywxskNhch8uNkb5uy
 WUHhrJz+wBjdjrDOU3aL
 =jBdo
 -----END PGP SIGNATURE-----

Merge tag 'xfs-4.14-merge-7' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux

Pull XFS updates from Darrick Wong:
 "Here are the changes for xfs for 4.14. Most of these are cleanups and
  fixes for bad behavior, as we're mostly focusing on improving
  reliablity this cycle (read: there's potentially a lot of stuff on the
  horizon for 4.15 so better to spend a few weeks killing other bugs
  now).

  Summary:

   - Write unmount record for a ro mount to avoid unnecessary log replay

   - Clean up orphaned inodes when mounting fs readonly

   - Resubmit inode log items when buffer writeback fails to avoid
     umount hang

   - Fix log recovery corruption problems when log headers wrap around
     the end

   - Avoid infinite loop searching for free inodes when inode counters
     are wrong

   - Evict inodes involved with log redo so that we don't leak them
     later

   - Fix a potential race between reclaim and inode cluster freeing

   - Refactor the inode joining code w.r.t. transaction rolling &
     deferred ops

   - Fix a bug where the log doesn't properly deal with dirty buffers
     that are about to become ordered buffers

   - Fix the extent swap code to deal with making dirty buffers ordered
     properly

   - Consolidate page fault handlers

   - Refactor the incore extent manipulation functions to use the iext
     abstractions instead of directly modifying with extent data

   - Disable crashy chattr +/-x until we fix it

   - Don't allow us to set S_DAX for v2 inodes

   - Various cleanups

   - Clarify some documentation

   - Fix a problem where fsync and a log commit race to send the disk a
     flush command, resulting in a small window where power fail data
     loss could occur

   - Simplify some rmap operations in the fcollapse code

   - Fix some use-after-free problems in async writeback"

* tag 'xfs-4.14-merge-7' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux: (44 commits)
  xfs: use kmem_free to free return value of kmem_zalloc
  xfs: open code end_buffer_async_write in xfs_finish_page_writeback
  xfs: don't set v3 xflags for v2 inodes
  xfs: fix compiler warnings
  fsmap: fix documentation of FMR_OF_LAST
  xfs: simplify the rmap code in xfs_bmse_merge
  xfs: remove unused flags arg from xfs_file_iomap_begin_delay
  xfs: fix incorrect log_flushed on fsync
  xfs: disable per-inode DAX flag
  xfs: replace xfs_qm_get_rtblks with a direct call to xfs_bmap_count_leaves
  xfs: rewrite xfs_bmap_count_leaves using xfs_iext_get_extent
  xfs: use xfs_iext_*_extent helpers in xfs_bmap_split_extent_at
  xfs: use xfs_iext_*_extent helpers in xfs_bmap_shift_extents
  xfs: move some code around inside xfs_bmap_shift_extents
  xfs: use xfs_iext_get_extent in xfs_bmap_first_unused
  xfs: switch xfs_bmap_local_to_extents to use xfs_iext_insert
  xfs: add a xfs_iext_update_extent helper
  xfs: consolidate the various page fault handlers
  iomap: return VM_FAULT_* codes from iomap_page_mkwrite
  xfs: relog dirty buffers during swapext bmbt owner change
  ...
2017-09-06 12:19:23 -07:00
Linus Torvalds bafb0762cb Char/Misc drivers for 4.14-rc1
Here is the big char/misc driver update for 4.14-rc1.
 
 Lots of different stuff in here, it's been an active development cycle
 for some reason.  Highlights are:
   - updated binder driver, this brings binder up to date with what
     shipped in the Android O release, plus some more changes that
     happened since then that are in the Android development trees.
   - coresight updates and fixes
   - mux driver file renames to be a bit "nicer"
   - intel_th driver updates
   - normal set of hyper-v updates and changes
   - small fpga subsystem and driver updates
   - lots of const code changes all over the driver trees
   - extcon driver updates
   - fmc driver subsystem upadates
   - w1 subsystem minor reworks and new features and drivers added
   - spmi driver updates
 
 Plus a smattering of other minor driver updates and fixes.
 
 All of these have been in linux-next with no reported issues for a
 while.
 
 Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
 -----BEGIN PGP SIGNATURE-----
 
 iG0EABECAC0WIQT0tgzFv3jCIUoxPcsxR9QN2y37KQUCWa1+Ew8cZ3JlZ0Brcm9h
 aC5jb20ACgkQMUfUDdst+yl26wCgquufNylfhxr65NbJrovduJYzRnUAniCivXg8
 bePIh/JI5WxWoHK+wEbY
 =hYWx
 -----END PGP SIGNATURE-----

Merge tag 'char-misc-4.14-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/char-misc

Pull char/misc driver updates from Greg KH:
 "Here is the big char/misc driver update for 4.14-rc1.

  Lots of different stuff in here, it's been an active development cycle
  for some reason. Highlights are:

   - updated binder driver, this brings binder up to date with what
     shipped in the Android O release, plus some more changes that
     happened since then that are in the Android development trees.

   - coresight updates and fixes

   - mux driver file renames to be a bit "nicer"

   - intel_th driver updates

   - normal set of hyper-v updates and changes

   - small fpga subsystem and driver updates

   - lots of const code changes all over the driver trees

   - extcon driver updates

   - fmc driver subsystem upadates

   - w1 subsystem minor reworks and new features and drivers added

   - spmi driver updates

  Plus a smattering of other minor driver updates and fixes.

  All of these have been in linux-next with no reported issues for a
  while"

* tag 'char-misc-4.14-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/char-misc: (244 commits)
  ANDROID: binder: don't queue async transactions to thread.
  ANDROID: binder: don't enqueue death notifications to thread todo.
  ANDROID: binder: Don't BUG_ON(!spin_is_locked()).
  ANDROID: binder: Add BINDER_GET_NODE_DEBUG_INFO ioctl
  ANDROID: binder: push new transactions to waiting threads.
  ANDROID: binder: remove proc waitqueue
  android: binder: Add page usage in binder stats
  android: binder: fixup crash introduced by moving buffer hdr
  drivers: w1: add hwmon temp support for w1_therm
  drivers: w1: refactor w1_slave_show to make the temp reading functionality separate
  drivers: w1: add hwmon support structures
  eeprom: idt_89hpesx: Support both ACPI and OF probing
  mcb: Fix an error handling path in 'chameleon_parse_cells()'
  MCB: add support for SC31 to mcb-lpc
  mux: make device_type const
  char: virtio: constify attribute_group structures.
  Documentation/ABI: document the nvmem sysfs files
  lkdtm: fix spelling mistake: "incremeted" -> "incremented"
  perf: cs-etm: Fix ETMv4 CONFIGR entry in perf.data file
  nvmem: include linux/err.h from header
  ...
2017-09-05 11:08:17 -07:00
Christoph Hellwig 9725d4cef6 fs: unexport vfs_readv and vfs_writev
We've got no modular users left, and any potential modular user is better
of with iov_iter based variants.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2017-09-04 19:05:16 -04:00
Christoph Hellwig eb031849d5 fs: unexport __vfs_read/__vfs_write
No modular users left, and any new ones should use kernel_read/write
or iov_iter variants instead.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2017-09-04 19:05:16 -04:00
Christoph Hellwig 73e18f7c0b fs: make the buf argument to __kernel_write a void pointer
This matches kernel_read and kernel_write and avoids any need for casts in
the callers.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2017-09-04 19:05:15 -04:00
Christoph Hellwig e13ec939e9 fs: fix kernel_write prototype
Make the position an in/out argument like all the other read/write
helpers and and make the buf argument a void pointer.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2017-09-04 19:05:15 -04:00
Christoph Hellwig bdd1d2d3d2 fs: fix kernel_read prototype
Use proper ssize_t and size_t types for the return value and count
argument, move the offset last and make it an in/out argument like
all other read/write helpers, and make the buf argument a void pointer
to get rid of lots of casts in the callers.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2017-09-04 19:05:15 -04:00
Christoph Hellwig 91f9943e1c fs: support RWF_NOWAIT for buffered reads
This is based on the old idea and code from Milosz Tanski.  With the aio
nowait code it becomes mostly trivial now.  Buffered writes continue to
return -EOPNOTSUPP if RWF_NOWAIT is passed.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2017-09-04 19:04:23 -04:00
Miklos Szeredi 495e642939 vfs: add flags to d_real()
Add a separate flags argument (in addition to the open flags) to control
the behavior of d_real().

Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2017-09-04 21:42:22 +02:00
Darrick J. Wong 799ea9e9c5 xfs: evict all inodes involved with log redo item
When we introduced the bmap redo log items, we set MS_ACTIVE on the
mountpoint and XFS_IRECOVERY on the inode to prevent unlinked inodes
from being truncated prematurely during log recovery.  This also had the
effect of putting linked inodes on the lru instead of evicting them.

Unfortunately, we neglected to find all those unreferenced lru inodes
and evict them after finishing log recovery, which means that we leak
them if anything goes wrong in the rest of xfs_mountfs, because the lru
is only cleaned out on unmount.

Therefore, evict unreferenced inodes in the lru list immediately
after clearing MS_ACTIVE.

Fixes: 17c12bcd30 ("xfs: when replaying bmap operations, don't let unlinked inodes get reaped")
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Cc: viro@ZenIV.linux.org.uk
Reviewed-by: Brian Foster <bfoster@redhat.com>
2017-09-01 10:55:30 -07:00
Christoph Hellwig ddef7ed2b5 annotate RWF_... flags
[AV: added missing annotations in syscalls.h/compat.h]

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2017-08-31 17:32:38 -04:00
Greg Kroah-Hartman 9749c37275 Merge 4.13-rc7 into char-misc-next
We want the binder fix in here as well for testing and merge issues.

Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2017-08-28 10:19:01 +02:00
Linus Torvalds 0cc3b0ec23 Clarify (and fix) MAX_LFS_FILESIZE macros
We have a MAX_LFS_FILESIZE macro that is meant to be filled in by
filesystems (and other IO targets) that know they are 64-bit clean and
don't have any 32-bit limits in their IO path.

It turns out that our 32-bit value for that limit was bogus.  On 32-bit,
the VM layer is limited by the page cache to only 32-bit index values,
but our logic for that was confusing and actually wrong.  We used to
define that value to

	(((loff_t)PAGE_SIZE << (BITS_PER_LONG-1))-1)

which is actually odd in several ways: it limits the index to 31 bits,
and then it limits files so that they can't have data in that last byte
of a page that has the highest 31-bit index (ie page index 0x7fffffff).

Neither of those limitations make sense.  The index is actually the full
32 bit unsigned value, and we can use that whole full page.  So the
maximum size of the file would logically be "PAGE_SIZE << BITS_PER_LONG".

However, we do wan tto avoid the maximum index, because we have code
that iterates over the page indexes, and we don't want that code to
overflow.  So the maximum size of a file on a 32-bit host should
actually be one page less than the full 32-bit index.

So the actual limit is ULONG_MAX << PAGE_SHIFT.  That means that we will
not actually be using the page of that last index (ULONG_MAX), but we
can grow a file up to that limit.

The wrong value of MAX_LFS_FILESIZE actually caused problems for Doug
Nazar, who was still using a 32-bit host, but with a 9.7TB 2 x RAID5
volume.  It turns out that our old MAX_LFS_FILESIZE was 8TiB (well, one
byte less), but the actual true VM limit is one page less than 16TiB.

This was invisible until commit c2a9737f45 ("vfs,mm: fix a dead loop
in truncate_inode_pages_range()"), which started applying that
MAX_LFS_FILESIZE limit to block devices too.

NOTE! On 64-bit, the page index isn't a limiter at all, and the limit is
actually just the offset type itself (loff_t), which is signed.  But for
clarity, on 64-bit, just use the maximum signed value, and don't make
people have to count the number of 'f' characters in the hex constant.

So just use LLONG_MAX for the 64-bit case.  That was what the value had
been before too, just written out as a hex constant.

Fixes: c2a9737f45 ("vfs,mm: fix a dead loop in truncate_inode_pages_range()")
Reported-and-tested-by: Doug Nazar <nazard@nazar.ca>
Cc: Andreas Dilger <adilger@dilger.ca>
Cc: Mark Fasheh <mfasheh@versity.com>
Cc: Joel Becker <jlbec@evilplan.org>
Cc: Dave Kleikamp <shaggy@kernel.org>
Cc: stable@kernel.org
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-08-27 12:12:25 -07:00
Christoph Hellwig c2ee070fb0 block: cache the partition index in struct block_device
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2017-08-23 12:49:53 -06:00
Jan Kara f4a8116a4c fs: Provide __inode_get_bytes()
Provide helper __inode_get_bytes() which assumes i_lock is already
acquired. Quota code will need this to be able to use i_lock to protect
consistency of quota accounting information and inode usage.

Signed-off-by: Jan Kara <jack@suse.cz>
2017-08-17 22:06:03 +02:00
Jeff Layton ffb959bbdf mm: remove optimizations based on i_size in mapping writeback waits
Marcelo added this i_size based optimization with a patch in 2004
(commitid is from the linux-history tree):

    commit 765dad09b4ac101a32d87af2bb793c3060497d3c
    Author: Marcelo Tosatti <marcelo.tosatti@cyclades.com>
    Date:   Tue Sep 7 17:51:17 2004 -0700

	small wait_on_page_writeback_range() optimization

	filemap_fdatawait() calls wait_on_page_writeback_range() with -1
	as "end" parameter.  This is not needed since we know the EOF
	from the inode.  Use that instead.

There may be races here, particularly with clustered or network
filesystems. It also seems like a bit of a layering violation since
we're operating on an address_space here, not an inode.

Finally, it's also questionable whether this optimization really helps
on workloads that we care about. Should we be optimizing for writeback
vs. truncate races in a codepath where we expect to wait anyway? It
doesn't seem worth the risk.

Remove this optimization from the filemap_fdatawait codepaths. This
means that filemap_fdatawait becomes a trivial wrapper around
filemap_fdatawait_range.

Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: Jeff Layton <jlayton@redhat.com>
2017-08-01 08:39:29 -04:00
Jeff Layton a823e4589e mm: add file_fdatawait_range and file_write_and_wait
Necessary now for gfs2_fsync and sync_file_range, but there will
eventually be other callers.

Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: Jeff Layton <jlayton@redhat.com>
2017-07-31 19:12:26 -04:00
Jeff Layton 80aafd50b6 Documentation: add some docs for errseq_t
...and fix up a few comments in the code.

Signed-off-by: Jeff Layton <jlayton@redhat.com>
2017-07-29 09:01:02 -04:00
Greg Kroah-Hartman 24a81a2c25 Merge 4.13-rc2 into char-misc-next
We want the char/misc driver fixes in here as well to handle future
changes.

Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2017-07-23 19:58:30 -07:00
Linus Torvalds e06fdaf40a Now that IPC and other changes have landed, enable manual markings for
randstruct plugin, including the task_struct.
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v1
 Comment: Kees Cook <kees@outflux.net>
 
 iQIcBAABCgAGBQJZbRgGAAoJEIly9N/cbcAmk2AQAIL60aQ+9RIcFAXriFhnd7Z2
 x9Jqi9JNc8NgPFXx8GhE4J4eTZ5PwcjgXBpNRWY/laBkRyoBHn24ku09YxrJjmHz
 ZSUsP+/iO9lVeEfbmU9Tnk50afkfwx6bHXBwkiVGQWHtybNVUqA19JbqkHeg8ubx
 myKLGeUv5PPCodRIcBDD0+HaAANcsqtgbDpgmWU8s+IXWwvWCE2p7PuBw7v3HHgH
 qzlPDHYQCRDw+LWsSqPaHj+9mbRO18P/ydMoZHGH4Hl3YYNtty8ZbxnraI3A7zBL
 6mLUVcZ+/l88DqHc5I05T8MmLU1yl2VRxi8/jpMAkg9wkvZ5iNAtlEKIWU6eqsvk
 vaImNOkViLKlWKF+oUD1YdG16d8Segrc6m4MGdI021tb+LoGuUbkY7Tl4ee+3dl/
 9FM+jPv95HjJnyfRNGidh2TKTa9KJkh6DYM9aUnktMFy3ca1h/LuszOiN0LTDiHt
 k5xoFURk98XslJJyXM8FPwXCXiRivrXMZbg5ixNoS4aYSBLv7Cn1M6cPnSOs7UPh
 FqdNPXLRZ+vabSxvEg5+41Ioe0SHqACQIfaSsV5BfF2rrRRdaAxK4h7DBcI6owV2
 7ziBN1nBBq2onYGbARN6ApyCqLcchsKtQfiZ0iFsvW7ZawnkVOOObDTCgPl3tdkr
 403YXzphQVzJtpT5eRV6
 =ngAW
 -----END PGP SIGNATURE-----

Merge tag 'gcc-plugins-v4.13-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux

Pull structure randomization updates from Kees Cook:
 "Now that IPC and other changes have landed, enable manual markings for
  randstruct plugin, including the task_struct.

  This is the rest of what was staged in -next for the gcc-plugins, and
  comes in three patches, largest first:

   - mark "easy" structs with __randomize_layout

   - mark task_struct with an optional anonymous struct to isolate the
     __randomize_layout section

   - mark structs to opt _out_ of automated marking (which will come
     later)

  And, FWIW, this continues to pass allmodconfig (normal and patched to
  enable gcc-plugins) builds of x86_64, i386, arm64, arm, powerpc, and
  s390 for me"

* tag 'gcc-plugins-v4.13-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux:
  randstruct: opt-out externally exposed function pointer structs
  task_struct: Allow randomized layout
  randstruct: Mark various structs for randomization
2017-07-19 08:55:18 -07:00
Logan Gunthorpe 133d55cdb2 block: order /proc/devices by major number
Presently, the order of the block devices listed in /proc/devices is not
entirely sequential. If a block device has a major number greater than
BLKDEV_MAJOR_HASH_SIZE (255), it will be ordered as if its major were
module 255. For example, 511 appears after 1.

This patch cleans that up and prints each major number in the correct
order, regardless of where they are stored in the hash table.

In order to do this, we introduce BLKDEV_MAJOR_MAX as an artificial
limit (chosen to be 512). It will then print all devices in major
order number from 0 to the maximum.

Signed-off-by: Logan Gunthorpe <logang@deltatee.com>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Jeff Layton <jlayton@poochiereds.net>
Cc: "J. Bruce Fields" <bfields@fieldses.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2017-07-17 15:42:20 +02:00
Logan Gunthorpe 8a932f73e5 char_dev: order /proc/devices by major number
Presently, the order of the char devices listed in /proc/devices is not
entirely sequential. If a char device has a major number greater than
CHRDEV_MAJOR_HASH_SIZE (255), it will be ordered as if its major were
module 255. For example, 511 appears after 1.

This patch cleans that up and prints each major number in the correct
order, regardless of where they are stored in the hash table.

In order to do this, we introduce CHRDEV_MAJOR_MAX as an artificial
limit (chosen to be 511). It will then print all devices in major
order number from 0 to the maximum.

Signed-off-by: Logan Gunthorpe <logang@deltatee.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Alan Cox <alan@linux.intel.com>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Linus Walleij <linus.walleij@linaro.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2017-07-17 15:28:50 +02:00
Logan Gunthorpe a5d31a3f81 char_dev: extend dynamic allocation of majors into a higher range
We've run into problems with running out of dynamicly assign char
device majors particullarly on automated test systems with
all-yes-configs. Roughly 40 dynamic assignments can be made with such
kernels at this time while space is reserved for only 20.

Currently, the kernel only prints a warning when dynamic allocation
overflows the reserved region. And when this happens drivers that have
fixed assignments can randomly fail depending on the order of
initialization of other drivers. Thus, adding a new char device can cause
unexpected failures in completely unrelated parts of the kernel.

This patch solves the problem by extending dynamic major number
allocations down from 511 once the 234-254 region fills up. Fixed
majors already exist above 255 so the infrastructure to support
high number majors is already in place. The patch reserves an
additional 128 major numbers which should hopefully last us a while.

Kernels that don't require more than 20 dynamic majors assigned (which
is pretty typical) should not be affected by this change.

Signed-off-by: Logan Gunthorpe <logang@deltatee.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Alan Cox <alan@linux.intel.com>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Linus Walleij <linus.walleij@linaro.org>
Link: https://lkml.org/lkml/2017/6/4/107
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2017-07-17 15:09:17 +02:00
David Howells e462ec50cb VFS: Differentiate mount flags (MS_*) from internal superblock flags
Differentiate the MS_* flags passed to mount(2) from the internal flags set
in the super_block's s_flags.  s_flags are now called SB_*, with the names
and the values for the moment mirroring the MS_* flags that they're
equivalent to.

In this patch, just the headers are altered and some kernel code where
blind automated conversion isn't necessarily correct.

Note that this shows up some interesting issues:

 (1) Some MS_* flags get translated to MNT_* flags (such as MS_NODEV ->
     MNT_NODEV) without passing this on to the filesystem, but some
     filesystems set such flags anyway.

 (2) The ->remount_fs() methods of some filesystems adjust the *flags
     argument by setting MS_* flags in it, such as MS_NOATIME - but these
     flags are then scrubbed by do_remount_sb() (only the occupants of
     MS_RMT_MASK are permitted: MS_RDONLY, MS_SYNCHRONOUS, MS_MANDLOCK,
     MS_I_VERSION and MS_LAZYTIME)

I'm not sure what's the best way to solve all these cases.

Suggested-by: Al Viro <viro@ZenIV.linux.org.uk>
Signed-off-by: David Howells <dhowells@redhat.com>
2017-07-17 08:45:35 +01:00
David Howells 94e92e7ac9 vfs: Add sb_rdonly(sb) to query the MS_RDONLY flag on s_flags
Add an sb_rdonly() function to query the MS_RDONLY flag on sb->s_flags
preparatory to providing an SB_RDONLY flag.

Signed-off-by: David Howells <dhowells@redhat.com>
2017-07-17 08:45:34 +01:00
Benjamin Coddington 9d5b86ac13 fs/locks: Remove fl_nspid and use fs-specific l_pid for remote locks
Since commit c69899a17c "NFSv4: Update of VFS byte range lock must be
atomic with the stateid update", NFSv4 has been inserting locks in rpciod
worker context.  The result is that the file_lock's fl_nspid is the
kworker's pid instead of the original userspace pid.

The fl_nspid is only used to represent the namespaced virtual pid number
when displaying locks or returning from F_GETLK.  There's no reason to set
it for every inserted lock, since we can usually just look it up from
fl_pid.  So, instead of looking up and holding struct pid for every lock,
let's just look up the virtual pid number from fl_pid when it is needed.
That means we can remove fl_nspid entirely.

The translaton and presentation of fl_pid should handle the following four
cases:

1 - F_GETLK on a remote file with a remote lock:
    In this case, the filesystem should determine the l_pid to return here.
    Filesystems should indicate that the fl_pid represents a non-local pid
    value that should not be translated by returning an fl_pid <= 0.

2 - F_GETLK on a local file with a remote lock:
    This should be the l_pid of the lock manager process, and translated.

3 - F_GETLK on a remote file with a local lock, and
4 - F_GETLK on a local file with a local lock:
    These should be the translated l_pid of the local locking process.

Fuse was already doing the correct thing by translating the pid into the
caller's namespace.  With this change we must update fuse to translate
to init's pid namespace, so that the locks API can then translate from
init's pid namespace into the pid namespace of the caller.

With this change, the locks API will expect that if a filesystem returns
a remote pid as opposed to a local pid for F_GETLK, that remote pid will
be <= 0.  This signifies that the pid is remote, and the locks API will
forego translating that pid into the pid namespace of the local calling
process.

Finally, we convert remote filesystems to present remote pids using
negative numbers. Have lustre, 9p, ceph, cifs, and dlm negate the remote
pid returned for F_GETLK lock requests.

Since local pids will never be larger than PID_MAX_LIMIT (which is
currently defined as <= 4 million), but pid_t is an unsigned int, we
should have plenty of room to represent remote pids with negative
numbers if we assume that remote pid numbers are similarly limited.

If this is not the case, then we run the risk of having a remote pid
returned for which there is also a corresponding local pid.  This is a
problem we have now, but this patch should reduce the chances of that
occurring, while also returning those remote pid numbers, for whatever
that may be worth.

Signed-off-by: Benjamin Coddington <bcodding@redhat.com>
Signed-off-by: Jeff Layton <jlayton@redhat.com>
2017-07-16 10:28:22 -04:00
Linus Torvalds 78dcf73421 Merge branch 'work.mount' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs
Pull ->s_options removal from Al Viro:
 "Preparations for fsmount/fsopen stuff (coming next cycle). Everything
  gets moved to explicit ->show_options(), killing ->s_options off +
  some cosmetic bits around fs/namespace.c and friends. Basically, the
  stuff needed to work with fsmount series with minimum of conflicts
  with other work.

  It's not strictly required for this merge window, but it would reduce
  the PITA during the coming cycle, so it would be nice to have those
  bits and pieces out of the way"

* 'work.mount' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
  isofs: Fix isofs_show_options()
  VFS: Kill off s_options and helpers
  orangefs: Implement show_options
  9p: Implement show_options
  isofs: Implement show_options
  afs: Implement show_options
  affs: Implement show_options
  befs: Implement show_options
  spufs: Implement show_options
  bpf: Implement show_options
  ramfs: Implement show_options
  pstore: Implement show_options
  omfs: Implement show_options
  hugetlbfs: Implement show_options
  VFS: Don't use save/replace_mount_options if not using generic_show_options
  VFS: Provide empty name qstr
  VFS: Make get_filesystem() return the affected filesystem
  VFS: Clean up whitespace in fs/namespace.c and fs/super.c
  Provide a function to create a NUL-terminated string from unterminated data
2017-07-15 12:00:42 -07:00
Linus Torvalds 6b1c776d3e Merge branch 'overlayfs-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/vfs
Pull overlayfs updates from Miklos Szeredi:
 "This work from Amir introduces the inodes index feature, which
  provides:

   - hardlinks are not broken on copy up

   - infrastructure for overlayfs NFS export

  This also fixes constant st_ino for samefs case for lower hardlinks"

* 'overlayfs-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/vfs: (33 commits)
  ovl: mark parent impure and restore timestamp on ovl_link_up()
  ovl: document copying layers restrictions with inodes index
  ovl: cleanup orphan index entries
  ovl: persistent overlay inode nlink for indexed inodes
  ovl: implement index dir copy up
  ovl: move copy up lock out
  ovl: rearrange copy up
  ovl: add flag for upper in ovl_entry
  ovl: use struct copy_up_ctx as function argument
  ovl: base tmpfile in workdir too
  ovl: factor out ovl_copy_up_inode() helper
  ovl: extract helper to get temp file in copy up
  ovl: defer upper dir lock to tempfile link
  ovl: hash overlay non-dir inodes by copy up origin
  ovl: cleanup bad and stale index entries on mount
  ovl: lookup index entry for copy up origin
  ovl: verify index dir matches upper dir
  ovl: verify upper root dir matches lower root dir
  ovl: introduce the inodes index dir feature
  ovl: generalize ovl_create_workdir()
  ...
2017-07-12 09:28:55 -07:00
David Howells 1d278a8790 VFS: Kill off s_options and helpers
Kill off s_options, save/replace_mount_options() and generic_show_options()
as all filesystems now implement ->show_options() for themselves.  This
should make it easier to implement a context-based mount where the mount
options can be passed individually over a file descriptor.

Signed-off-by: David Howells <dhowells@redhat.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2017-07-11 06:09:21 -04:00
Dan Williams baabda2614 mm: always enable thp for dax mappings
The madvise policy for transparent huge pages is meant to avoid unwanted
allocations of transparent huge pages.  It allows a policy of disabling
the extra memory pressure and effort to arrange for a huge page when it
is not needed.

DAX by definition never incurs this overhead since it is statically
allocated.  The policy choice makes even less sense for device-dax which
tries to guarantee a given tlb-fault size.  Specifically, the following
setting:

	echo never > /sys/kernel/mm/transparent_hugepage/enabled

...violates that guarantee and silently disables all device-dax
instances with a 2M or 1G alignment.  So, let's avoid that non-obvious
side effect by force enabling thp for dax mappings in all cases.

It is worth noting that the reason this uses vma_is_dax(), and the
resulting header include changes, is that previous attempts to add a
VM_DAX flag were NAKd.

Link: http://lkml.kernel.org/r/149739531127.20686.15813586620597484283.stgit@dwillia2-desk3.amr.corp.intel.com
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
Reviewed-by: Ross Zwisler <ross.zwisler@linux.intel.com>
Cc: Jan Kara <jack@suse.cz>
Cc: Christoph Hellwig <hch@lst.de>
Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
Cc: Michal Hocko <mhocko@suse.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-07-10 16:32:31 -07:00
Linus Torvalds 088737f44b Writeback error handling fixes (pile #2)
-----BEGIN PGP SIGNATURE-----
 
 iQIcBAABAgAGBQJZXhmCAAoJEAAOaEEZVoIVpRkP/1qlYn3pq6d5Kuz84pejOmlL
 5jbkS/cOmeTxeUU4+B1xG8Lx7bAk8PfSXQOADbSJGiZd0ug95tJxplFYIGJzR/tG
 aNMHeu/BVKKhUKORGuKR9rJKtwC839L/qao+yPBo5U3mU4L73rFWX8fxFuhSJ8HR
 hvkgBu3Hx6GY59CzxJ8iJzj+B+uPSFrNweAk0+0UeWkBgTzEdiGqaXBX4cHIkq/5
 hMoCG+xnmwHKbCBsQ5js+YJT+HedZ4lvfjOqGxgElUyjJ7Bkt/IFYOp8TUiu193T
 tA4UinDjN8A7FImmIBIftrECmrAC9HIGhGZroYkMKbb8ReDR2ikE5FhKEpuAGU3a
 BXBgX2mPQuArvZWM7qeJCkxV9QJ0u/8Ykbyzo30iPrICyrzbEvIubeB/mDA034+Z
 Z0/z8C3v7826F3zP/NyaQEojUgRq30McMOIS8GMnx15HJwRsRKlzjfy9Wm4tWhl0
 t3nH1jMqAZ7068s6rfh/oCwdgGOwr5o4hW/bnlITzxbjWQUOnZIe7KBxIezZJ2rv
 OcIwd5qE8PNtpagGj5oUbnjGOTkERAgsMfvPk5tjUNt28/qUlVs2V0aeo47dlcsh
 oYr8WMOIzw98Rl7Bo70mplLrqLD6nGl0LfXOyUlT4STgLWW4ksmLVuJjWIUxcO/0
 yKWjj9wfYRQ0vSUqhsI5
 =3Z93
 -----END PGP SIGNATURE-----

Merge tag 'for-linus-v4.13-2' of git://git.kernel.org/pub/scm/linux/kernel/git/jlayton/linux

Pull Writeback error handling updates from Jeff Layton:
 "This pile represents the bulk of the writeback error handling fixes
  that I have for this cycle. Some of the earlier patches in this pile
  may look trivial but they are prerequisites for later patches in the
  series.

  The aim of this set is to improve how we track and report writeback
  errors to userland. Most applications that care about data integrity
  will periodically call fsync/fdatasync/msync to ensure that their
  writes have made it to the backing store.

  For a very long time, we have tracked writeback errors using two flags
  in the address_space: AS_EIO and AS_ENOSPC. Those flags are set when a
  writeback error occurs (via mapping_set_error) and are cleared as a
  side-effect of filemap_check_errors (as you noted yesterday). This
  model really sucks for userland.

  Only the first task to call fsync (or msync or fdatasync) will see the
  error. Any subsequent task calling fsync on a file will get back 0
  (unless another writeback error occurs in the interim). If I have
  several tasks writing to a file and calling fsync to ensure that their
  writes got stored, then I need to have them coordinate with one
  another. That's difficult enough, but in a world of containerized
  setups that coordination may even not be possible.

  But wait...it gets worse!

  The calls to filemap_check_errors can be buried pretty far down in the
  call stack, and there are internal callers of filemap_write_and_wait
  and the like that also end up clearing those errors. Many of those
  callers ignore the error return from that function or return it to
  userland at nonsensical times (e.g. truncate() or stat()). If I get
  back -EIO on a truncate, there is no reason to think that it was
  because some previous writeback failed, and a subsequent fsync() will
  (incorrectly) return 0.

  This pile aims to do three things:

   1) ensure that when a writeback error occurs that that error will be
      reported to userland on a subsequent fsync/fdatasync/msync call,
      regardless of what internal callers are doing

   2) report writeback errors on all file descriptions that were open at
      the time that the error occurred. This is a user-visible change,
      but I think most applications are written to assume this behavior
      anyway. Those that aren't are unlikely to be hurt by it.

   3) document what filesystems should do when there is a writeback
      error. Today, there is very little consistency between them, and a
      lot of cargo-cult copying. We need to make it very clear what
      filesystems should do in this situation.

  To achieve this, the set adds a new data type (errseq_t) and then
  builds new writeback error tracking infrastructure around that. Once
  all of that is in place, we change the filesystems to use the new
  infrastructure for reporting wb errors to userland.

  Note that this is just the initial foray into cleaning up this mess.
  There is a lot of work remaining here:

   1) convert the rest of the filesystems in a similar fashion. Once the
      initial set is in, then I think most other fs' will be fairly
      simple to convert. Hopefully most of those can in via individual
      filesystem trees.

   2) convert internal waiters on writeback to use errseq_t for
      detecting errors instead of relying on the AS_* flags. I have some
      draft patches for this for ext4, but they are not quite ready for
      prime time yet.

  This was a discussion topic this year at LSF/MM too. If you're
  interested in the gory details, LWN has some good articles about this:

      https://lwn.net/Articles/718734/
      https://lwn.net/Articles/724307/"

* tag 'for-linus-v4.13-2' of git://git.kernel.org/pub/scm/linux/kernel/git/jlayton/linux:
  btrfs: minimal conversion to errseq_t writeback error reporting on fsync
  xfs: minimal conversion to errseq_t writeback error reporting
  ext4: use errseq_t based error handling for reporting data writeback errors
  fs: convert __generic_file_fsync to use errseq_t based reporting
  block: convert to errseq_t based writeback error tracking
  dax: set errors in mapping when writeback fails
  Documentation: flesh out the section in vfs.txt on storing and reporting writeback errors
  mm: set both AS_EIO/AS_ENOSPC and errseq_t in mapping_set_error
  fs: new infrastructure for writeback error handling and reporting
  lib: add errseq_t type and infrastructure for handling it
  mm: don't TestClearPageError in __filemap_fdatawait_range
  mm: clear AS_EIO/AS_ENOSPC when writeback initiation fails
  jbd2: don't clear and reset errors after waiting on writeback
  buffer: set errors in mapping at the time that the error occurs
  fs: check for writeback errors after syncing out buffers in generic_file_fsync
  buffer: use mapping_set_error instead of setting the flag
  mm: fix mapping_set_error call in me_pagecache_dirty
2017-07-07 19:38:17 -07:00
Linus Torvalds 33198c165b Writeback error handling fixes (pile #1)
-----BEGIN PGP SIGNATURE-----
 
 iQIcBAABAgAGBQJZXXLdAAoJEAAOaEEZVoIVtBIP/2BMtyDB5IVaxUuYc9LiFxCZ
 Y6W4aYEBgPhrct6epV3pnV+SXuzov9F5QZWe1P+lB3e30JHvPhO52OUIT7gSbFbv
 kKCh+p7Q1vLqaKxONPQpJI5LjlB6e6GIekrI4woA2RWVw+6cUyP0oQTVhsSsgnj/
 /GMo2pAqlhR3vnn9cWG93vl+xnrtmckpwFe0g5Jhdp/cVQBrqwxG+1W9rEsJf0nx
 RN29E7+CyxI3x2KkVdmgsMQkpkM2ooopn//1QDmS3M2sbCrJrLSTRG8LBEcs8fi8
 pQZcgW6uHXDH2I0hews1vhJRA38TeXoQfj9OZoFGQcVpbP3ZnjASKioRoQiSsHyQ
 QRDxUw6C45tjWT0HZ1GaCDMuTMs0z2/zF/E7TaOX6zB2LS/NuIluoVAMkYVyXY3a
 L39flIddnDaga1ojL+tQK5hhSl9C66++/FsFa2FZ0hLkeXA5WDLhRy0ODW3NaYg8
 89pPJDfiocEiI7ULht2Bkk88zFe+K07bQRQ5eoFtSOAxOnWGJCbxn8G8dFZZDHnO
 XZe3gscbR3DCMJ+agb4V/YOyqCHAJMA/lcnP9v7P+QnrEXSV5yrblk1Gx442xMhv
 tANcCUI3nb/b2Ma3DW3iZS/iYmhmy/baBSV3n65K9NqtkkIbnqSXxk+5RJd5eKsS
 8Y5nyu+6mlcOOxBMkmRo
 =jRrj
 -----END PGP SIGNATURE-----

Merge tag 'for-linus-v4.13-1' of git://git.kernel.org/pub/scm/linux/kernel/git/jlayton/linux

Pull Writeback error handling fixes from Jeff Layton:
 "The main rationale for all of these changes is to tighten up writeback
  error reporting to userland. There are many ways now that writeback
  errors can be lost, such that fsync/fdatasync/msync return 0 when
  writeback actually failed.

  This pile contains a small set of cleanups and writeback error
  handling fixes that I was able to break off from the main pile (#2).

  Two of the patches in this pile are trivial. The exceptions are the
  patch to fix up error handling in write_one_page, and the patch to
  make JFS pay attention to write_one_page errors"

* tag 'for-linus-v4.13-1' of git://git.kernel.org/pub/scm/linux/kernel/git/jlayton/linux:
  fs: remove call_fsync helper function
  mm: clean up error handling in write_one_page
  JFS: do not ignore return code from write_one_page()
  mm: drop "wait" parameter from write_one_page()
2017-07-07 18:39:15 -07:00
Jeff Layton 5660e13d2f fs: new infrastructure for writeback error handling and reporting
Most filesystems currently use mapping_set_error and
filemap_check_errors for setting and reporting/clearing writeback errors
at the mapping level. filemap_check_errors is indirectly called from
most of the filemap_fdatawait_* functions and from
filemap_write_and_wait*. These functions are called from all sorts of
contexts to wait on writeback to finish -- e.g. mostly in fsync, but
also in truncate calls, getattr, etc.

The non-fsync callers are problematic. We should be reporting writeback
errors during fsync, but many places spread over the tree clear out
errors before they can be properly reported, or report errors at
nonsensical times.

If I get -EIO on a stat() call, there is no reason for me to assume that
it is because some previous writeback failed. The fact that it also
clears out the error such that a subsequent fsync returns 0 is a bug,
and a nasty one since that's potentially silent data corruption.

This patch adds a small bit of new infrastructure for setting and
reporting errors during address_space writeback. While the above was my
original impetus for adding this, I think it's also the case that
current fsync semantics are just problematic for userland. Most
applications that call fsync do so to ensure that the data they wrote
has hit the backing store.

In the case where there are multiple writers to the file at the same
time, this is really hard to determine. The first one to call fsync will
see any stored error, and the rest get back 0. The processes with open
fds may not be associated with one another in any way. They could even
be in different containers, so ensuring coordination between all fsync
callers is not really an option.

One way to remedy this would be to track what file descriptor was used
to dirty the file, but that's rather cumbersome and would likely be
slow. However, there is a simpler way to improve the semantics here
without incurring too much overhead.

This set adds an errseq_t to struct address_space, and a corresponding
one is added to struct file. Writeback errors are recorded in the
mapping's errseq_t, and the one in struct file is used as the "since"
value.

This changes the semantics of the Linux fsync implementation such that
applications can now use it to determine whether there were any
writeback errors since fsync(fd) was last called (or since the file was
opened in the case of fsync having never been called).

Note that those writeback errors may have occurred when writing data
that was dirtied via an entirely different fd, but that's the case now
with the current mapping_set_error/filemap_check_error infrastructure.
This will at least prevent you from getting a false report of success.

The new behavior is still consistent with the POSIX spec, and is more
reliable for application developers. This patch just adds some basic
infrastructure for doing this, and ensures that the f_wb_err "cursor"
is properly set when a file is opened. Later patches will change the
existing code to use this new infrastructure for reporting errors at
fsync time.

Signed-off-by: Jeff Layton <jlayton@redhat.com>
Reviewed-by: Jan Kara <jack@suse.cz>
2017-07-06 07:02:25 -04:00
Jeff Layton 76341cabbd jbd2: don't clear and reset errors after waiting on writeback
Resetting this flag is almost certainly racy, and will be problematic
with some coming changes.

Make filemap_fdatawait_keep_errors return int, but not clear the flag(s).
Have jbd2 call it instead of filemap_fdatawait and don't attempt to
re-set the error flag if it fails.

Reviewed-by: Jan Kara <jack@suse.cz>
Reviewed-by: Carlos Maiolino <cmaiolino@redhat.com>
Signed-off-by: Jeff Layton <jlayton@redhat.com>
2017-07-06 07:02:22 -04:00
David Howells ee416bcdba VFS: Make get_filesystem() return the affected filesystem
Make get_filesystem() return a pointer to the filesystem on which it just
got a ref.

Suggested-by: Rasmus Villemoes <linux@rasmusvillemoes.dk>
Signed-off-by: David Howells <dhowells@redhat.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2017-07-06 03:27:09 -04:00
Jeff Layton 0f41074a65 fs: remove call_fsync helper function
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Jan Kara <jack@suse.cz>
Reviewed-by: Carlos Maiolino <cmaiolino@redhat.com>
Signed-off-by: Jeff Layton <jlayton@redhat.com>
2017-07-05 18:44:23 -04:00
Linus Torvalds 89fbf5384d Merge branch 'work.read_write' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs
Pull read/write updates from Al Viro:
 "Christoph's fs/read_write.c series - consolidation and cleanups"

* 'work.read_write' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
  nfsd: remove nfsd_vfs_read
  nfsd: use vfs_iter_read/write
  fs: implement vfs_iter_write using do_iter_write
  fs: implement vfs_iter_read using do_iter_read
  fs: move more code into do_iter_read/do_iter_write
  fs: remove __do_readv_writev
  fs: remove do_compat_readv_writev
  fs: remove do_readv_writev
2017-07-05 14:35:57 -07:00
Linus Torvalds 3bad2f1c67 Merge branch 'work.misc-set_fs' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs
Pull misc user access cleanups from Al Viro:
 "The first pile is assorted getting rid of cargo-culted access_ok(),
  cargo-culted set_fs() and field-by-field copyouts.

  The same description applies to a lot of stuff in other branches -
  this is just the stuff that didn't fit into a more specific topical
  branch"

* 'work.misc-set_fs' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
  Switch flock copyin/copyout primitives to copy_{from,to}_user()
  fs/fcntl: return -ESRCH in f_setown when pid/pgid can't be found
  fs/fcntl: f_setown, avoid undefined behaviour
  fs/fcntl: f_setown, allow returning error
  lpfc debugfs: get rid of pointless access_ok()
  adb: get rid of pointless access_ok()
  isdn: get rid of pointless access_ok()
  compat statfs: switch to copy_to_user()
  fs/locks: don't mess with the address limit in compat_fcntl64
  nfsd_readlink(): switch to vfs_get_link()
  drbd: ->sendpage() never needed set_fs()
  fs/locks: pass kernel struct flock to fcntl_getlk/setlk
  fs: locks: Fix some troubles at kernel-doc comments
2017-07-05 13:13:32 -07:00
Amir Goldstein ad0af7104d vfs: introduce inode 'inuse' lock
Added an i_state flag I_INUSE and helpers to set/clear/test the bit.

The 'inuse' lock is an 'advisory' inode lock, that can be used to extend
exclusive create protection beyond parent->i_mutex lock among cooperating
users.

This is going to be used by overlayfs to get exclusive ownership on upper
and work dirs among overlayfs mounts.

Signed-off-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2017-07-04 22:03:16 +02:00
Linus Torvalds 9bd42183b9 Merge branch 'sched-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull scheduler updates from Ingo Molnar:
 "The main changes in this cycle were:

   - Add the SYSTEM_SCHEDULING bootup state to move various scheduler
     debug checks earlier into the bootup. This turns silent and
     sporadically deadly bugs into nice, deterministic splats. Fix some
     of the splats that triggered. (Thomas Gleixner)

   - A round of restructuring and refactoring of the load-balancing and
     topology code (Peter Zijlstra)

   - Another round of consolidating ~20 of incremental scheduler code
     history: this time in terms of wait-queue nomenclature. (I didn't
     get much feedback on these renaming patches, and we can still
     easily change any names I might have misplaced, so if anyone hates
     a new name, please holler and I'll fix it.) (Ingo Molnar)

   - sched/numa improvements, fixes and updates (Rik van Riel)

   - Another round of x86/tsc scheduler clock code improvements, in hope
     of making it more robust (Peter Zijlstra)

   - Improve NOHZ behavior (Frederic Weisbecker)

   - Deadline scheduler improvements and fixes (Luca Abeni, Daniel
     Bristot de Oliveira)

   - Simplify and optimize the topology setup code (Lauro Ramos
     Venancio)

   - Debloat and decouple scheduler code some more (Nicolas Pitre)

   - Simplify code by making better use of llist primitives (Byungchul
     Park)

   - ... plus other fixes and improvements"

* 'sched-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (103 commits)
  sched/cputime: Refactor the cputime_adjust() code
  sched/debug: Expose the number of RT/DL tasks that can migrate
  sched/numa: Hide numa_wake_affine() from UP build
  sched/fair: Remove effective_load()
  sched/numa: Implement NUMA node level wake_affine()
  sched/fair: Simplify wake_affine() for the single socket case
  sched/numa: Override part of migrate_degrades_locality() when idle balancing
  sched/rt: Move RT related code from sched/core.c to sched/rt.c
  sched/deadline: Move DL related code from sched/core.c to sched/deadline.c
  sched/cpuset: Only offer CONFIG_CPUSETS if SMP is enabled
  sched/fair: Spare idle load balancing on nohz_full CPUs
  nohz: Move idle balancer registration to the idle path
  sched/loadavg: Generalize "_idle" naming to "_nohz"
  sched/core: Drop the unused try_get_task_struct() helper function
  sched/fair: WARN() and refuse to set buddy when !se->on_rq
  sched/debug: Fix SCHED_WARN_ON() to return a value on !CONFIG_SCHED_DEBUG as well
  sched/wait: Disambiguate wq_entry->task_list and wq_head->task_list naming
  sched/wait: Move bit_wait_table[] and related functionality from sched/core.c to sched/wait_bit.c
  sched/wait: Split out the wait_bit*() APIs from <linux/wait.h> into <linux/wait_bit.h>
  sched/wait: Re-adjust macro line continuation backslashes in <linux/wait.h>
  ...
2017-07-03 13:08:04 -07:00
Kees Cook 3859a271a0 randstruct: Mark various structs for randomization
This marks many critical kernel structures for randomization. These are
structures that have been targeted in the past in security exploits, or
contain functions pointers, pointers to function pointer tables, lists,
workqueues, ref-counters, credentials, permissions, or are otherwise
sensitive. This initial list was extracted from Brad Spengler/PaX Team's
code in the last public patch of grsecurity/PaX based on my understanding
of the code. Changes or omissions from the original code are mine and
don't reflect the original grsecurity/PaX code.

Left out of this list is task_struct, which requires special handling
and will be covered in a subsequent patch.

Signed-off-by: Kees Cook <keescook@chromium.org>
2017-06-30 12:00:51 -07:00
Christoph Hellwig abbb65899a fs: implement vfs_iter_write using do_iter_write
De-dupliate some code and allow for passing the flags argument to
vfs_iter_write.  Additionally it now properly updates timestamps.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2017-06-29 17:49:23 -04:00
Christoph Hellwig 18e9710ee5 fs: implement vfs_iter_read using do_iter_read
De-dupliate some code and allow for passing the flags argument to
vfs_iter_read.  Additional it properly updates atime now.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2017-06-29 17:49:23 -04:00
Jens Axboe c75b1d9421 fs: add fcntl() interface for setting/getting write life time hints
Define a set of write life time hints:

RWH_WRITE_LIFE_NOT_SET	No hint information set
RWH_WRITE_LIFE_NONE	No hints about write life time
RWH_WRITE_LIFE_SHORT	Data written has a short life time
RWH_WRITE_LIFE_MEDIUM	Data written has a medium life time
RWH_WRITE_LIFE_LONG	Data written has a long life time
RWH_WRITE_LIFE_EXTREME	Data written has an extremely long life time

The intent is for these values to be relative to each other, no
absolute meaning should be attached to these flag names.

Add an fcntl interface for querying these flags, and also for
setting them as well:

F_GET_RW_HINT		Returns the read/write hint set on the
			underlying inode.

F_SET_RW_HINT		Set one of the above write hints on the
			underlying inode.

F_GET_FILE_RW_HINT	Returns the read/write hint set on the
			file descriptor.

F_SET_FILE_RW_HINT	Set one of the above write hints on the
			file descriptor.

The user passes in a 64-bit pointer to get/set these values, and
the interface returns 0/-1 on success/error.

Sample program testing/implementing basic setting/getting of write
hints is below.

Add support for storing the write life time hint in the inode flags
and in struct file as well, and pass them to the kiocb flags. If
both a file and its corresponding inode has a write hint, then we
use the one in the file, if available. The file hint can be used
for sync/direct IO, for buffered writeback only the inode hint
is available.

This is in preparation for utilizing these hints in the block layer,
to guide on-media data placement.

/*
 * writehint.c: get or set an inode write hint
 */
 #include <stdio.h>
 #include <fcntl.h>
 #include <stdlib.h>
 #include <unistd.h>
 #include <stdbool.h>
 #include <inttypes.h>

 #ifndef F_GET_RW_HINT
 #define F_LINUX_SPECIFIC_BASE	1024
 #define F_GET_RW_HINT		(F_LINUX_SPECIFIC_BASE + 11)
 #define F_SET_RW_HINT		(F_LINUX_SPECIFIC_BASE + 12)
 #endif

static char *str[] = { "RWF_WRITE_LIFE_NOT_SET", "RWH_WRITE_LIFE_NONE",
			"RWH_WRITE_LIFE_SHORT", "RWH_WRITE_LIFE_MEDIUM",
			"RWH_WRITE_LIFE_LONG", "RWH_WRITE_LIFE_EXTREME" };

int main(int argc, char *argv[])
{
	uint64_t hint;
	int fd, ret;

	if (argc < 2) {
		fprintf(stderr, "%s: file <hint>\n", argv[0]);
		return 1;
	}

	fd = open(argv[1], O_RDONLY);
	if (fd < 0) {
		perror("open");
		return 2;
	}

	if (argc > 2) {
		hint = atoi(argv[2]);
		ret = fcntl(fd, F_SET_RW_HINT, &hint);
		if (ret < 0) {
			perror("fcntl: F_SET_RW_HINT");
			return 4;
		}
	}

	ret = fcntl(fd, F_GET_RW_HINT, &hint);
	if (ret < 0) {
		perror("fcntl: F_GET_RW_HINT");
		return 3;
	}

	printf("%s: hint %s\n", argv[1], str[hint]);
	close(fd);
	return 0;
}

Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2017-06-27 12:05:22 -06:00
Goldwyn Rodrigues b745fafaf7 fs: Introduce RWF_NOWAIT and FMODE_AIO_NOWAIT
RWF_NOWAIT informs kernel to bail out if an AIO request will block
for reasons such as file allocations, or a writeback triggered,
or would block while allocating requests while performing
direct I/O.

RWF_NOWAIT is translated to IOCB_NOWAIT for iocb->ki_flags.

FMODE_AIO_NOWAIT is a flag which identifies the file opened is capable
of returning -EAGAIN if the AIO call will block. This must be set by
supporting filesystems in the ->open() call.

Filesystems xfs, btrfs and ext4 would be supported in the following patches.

Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2017-06-20 07:12:03 -06:00
Goldwyn Rodrigues 7fc9e47224 fs: Introduce filemap_range_has_page()
filemap_range_has_page() return true if the file's mapping has
a page within the range mentioned. This function will be used
to check if a write() call will cause a writeback of previous
writes.

Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2017-06-20 07:12:03 -06:00
Goldwyn Rodrigues fdd2f5b7de fs: Separate out kiocb flags setup based on RWF_* flags
Also added RWF_SUPPORTED to encompass all flags.

Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2017-06-20 07:12:03 -06:00
Ingo Molnar 5dd43ce2f6 sched/wait: Split out the wait_bit*() APIs from <linux/wait.h> into <linux/wait_bit.h>
The wait_bit*() types and APIs are mixed into wait.h, but they
are a pretty orthogonal extension of wait-queues.

Furthermore, only about 50 kernel files use these APIs, while
over 1000 use the regular wait-queue functionality.

So clean up the main wait.h by moving the wait-bit functionality
out of it, into a separate .h and .c file:

  include/linux/wait_bit.h  for types and APIs
  kernel/sched/wait_bit.c   for the implementation

Update all header dependencies.

This reduces the size of wait.h rather significantly, by about 30%.

Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-06-20 12:19:09 +02:00
Jiri Slaby 393cc3f511 fs/fcntl: f_setown, allow returning error
Allow f_setown to return an error value. We will fail in the next patch
with EINVAL for bad input to f_setown, so tile the path for the later
patch.

Signed-off-by: Jiri Slaby <jslaby@suse.cz>
Reviewed-by: Jeff Layton <jlayton@redhat.com>
Cc: Jeff Layton <jlayton@poochiereds.net>
Cc: "J. Bruce Fields" <bfields@fieldses.org>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: linux-fsdevel@vger.kernel.org
Signed-off-by: Jeff Layton <jlayton@redhat.com>
2017-06-14 08:46:36 -04:00
Christoph Hellwig fdd050b5b3 Merge branch 'uuid-types' of bombadil.infradead.org:public_git/uuid into nvme-base 2017-06-13 11:45:14 +02:00
Christoph Hellwig 4055351cdb fs: remove the unused error argument to dio_end_io()
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Bart Van Assche <Bart.VanAssche@sandisk.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
2017-06-09 09:27:32 -06:00
Christoph Hellwig 85787090a2 fs: switch ->s_uuid to uuid_t
For some file systems we still memcpy into it, but in various places this
already allows us to use the proper uuid helpers.  More to come..

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Amir Goldstein <amir73il@gmail.com>
Acked-by: Mimi Zohar <zohar@linux.vnet.ibm.com> (Changes to IMA/EVM)
Reviewed-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com>
2017-06-05 16:59:12 +02:00
Christoph Hellwig a75d30c772 fs/locks: pass kernel struct flock to fcntl_getlk/setlk
This will make it easier to implement a sane compat fcntl syscall.

[ jlayton: fix undeclared identifiers in 32-bit fcntl64 syscall handler ]

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: Jeff Layton <jlayton@redhat.com>
2017-05-27 06:07:19 -04:00
Deepa Dinamani 572e0ca9b9 time: delete current_fs_time()
All uses of the current_fs_time() function have been replaced by other
time interfaces.

And, its use cases can be fulfilled by current_time() or ktime_get_*
variants.

Link: http://lkml.kernel.org/r/1491613030-11599-13-git-send-email-deepa.kernel@gmail.com
Signed-off-by: Deepa Dinamani <deepa.kernel@gmail.com>
Reviewed-by: Arnd Bergmann <arnd@arndb.de>
Cc: John Stultz <john.stultz@linaro.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-05-12 15:57:15 -07:00
Linus Torvalds 73ccb023a2 NFS client updates for Linux 4.12
Highlights include:
 
 Stable bugfixes:
 - Fix use after free in write error path
 - Use GFP_NOIO for two allocations in writeback
 - Fix a hang in OPEN related to server reboot
 - Check the result of nfs4_pnfs_ds_connect
 - Fix an rcu lock leak
 
 Features:
 - Removal of the unmaintained and unused OSD pNFS layout
 - Cleanup and removal of lots of unnecessary dprintk()s
 - Cleanup and removal of some memory failure paths now that
   GFP_NOFS is guaranteed to never fail.
 - Remove the v3-only data server limitation on pNFS/flexfiles
 
 Bugfixes:
 - RPC/RDMA connection handling bugfixes
 - Copy offload: fixes to ensure the copied data is COMMITed to disk.
 - Readdir: switch back to using the ->iterate VFS interface
 - File locking fixes from Ben Coddington
 - Various use-after-free and deadlock issues in pNFS
 - Write path bugfixes
 -----BEGIN PGP SIGNATURE-----
 
 iQIbBAABAgAGBQJZE0KiAAoJEGcL54qWCgDy/moP93wZ+cGnN5sC+nsirqj4eUiM
 BQKKonweNQIoYRwp5B9jLTxsUMIxRasV5W3BbqEm4PUtBYXfqQ7SfLv7RboKbd4M
 RJB9PS+sjx3Fxf65mhveKziwUFLvQCQ3+we0TpUga6+7SBiGlgPKBfisk7frC0nt
 BbYBuGaWXMPxO0BnR8adNwqiGINPDSzB+8sgjiT8zkZLm4lrew2eV7TDvwVOguD+
 S2vLPGhg1F9wu8aG731MgiSNaeCgsBP6I5D29fTTD7z1DCNMQXOoHcX8k4KwwIDB
 sHRR0tVBsg+1B7WdH4y41GQ03rn3o2DHeJB5cdYGaEu4lx7CecCzt0o0dfAkNizT
 5LxbQxIHPNYMeZmP2T0oD41zQyfjKqrdRSPnXi3dPD98NwaM1Lqv+Kzb/eXzupXp
 vJ7859PQCa3KjQ1IFhwdXTmh53J1c8SzEDpzz7WX0R0saRyxeIJsm30MmdPqKu7Z
 notjsXxrTmjIhC+0vFLey1kejFDh+b0gT6UIwoMdx39VL9AM6DVL7HsrU1kEwCdf
 f8otaLcm0WoUaseF+cMtfRNGEqCMxPywwz7mEKlGiVZgyAM8VfzH+s5j6/u6ncwS
 ASwRclwwPAZN97rzl0exZxuaRwFZd7oFT1zrviPWvv+0SUPuy258J6QpolUSavgi
 Qh7f3QR65K+QX9QbO1g=
 =7Nm2
 -----END PGP SIGNATURE-----

Merge tag 'nfs-for-4.12-1' of git://git.linux-nfs.org/projects/trondmy/linux-nfs

Pull NFS client updates from Trond Myklebust:
 "Highlights include:

  Stable bugfixes:
   - Fix use after free in write error path
   - Use GFP_NOIO for two allocations in writeback
   - Fix a hang in OPEN related to server reboot
   - Check the result of nfs4_pnfs_ds_connect
   - Fix an rcu lock leak

  Features:
   - Removal of the unmaintained and unused OSD pNFS layout
   - Cleanup and removal of lots of unnecessary dprintk()s
   - Cleanup and removal of some memory failure paths now that GFP_NOFS
     is guaranteed to never fail.
   - Remove the v3-only data server limitation on pNFS/flexfiles

  Bugfixes:
   - RPC/RDMA connection handling bugfixes
   - Copy offload: fixes to ensure the copied data is COMMITed to disk.
   - Readdir: switch back to using the ->iterate VFS interface
   - File locking fixes from Ben Coddington
   - Various use-after-free and deadlock issues in pNFS
   - Write path bugfixes"

* tag 'nfs-for-4.12-1' of git://git.linux-nfs.org/projects/trondmy/linux-nfs: (89 commits)
  pNFS/flexfiles: Always attempt to call layoutstats when flexfiles is enabled
  NFSv4.1: Work around a Linux server bug...
  NFS append COMMIT after synchronous COPY
  NFSv4: Fix exclusive create attributes encoding
  NFSv4: Fix an rcu lock leak
  nfs: use kmap/kunmap directly
  NFS: always treat the invocation of nfs_getattr as cache hit when noac is on
  Fix nfs_client refcounting if kmalloc fails in nfs4_proc_exchange_id and nfs4_proc_async_renew
  NFSv4.1: RECLAIM_COMPLETE must handle NFS4ERR_CONN_NOT_BOUND_TO_SESSION
  pNFS: Fix NULL dereference in pnfs_generic_alloc_ds_commits
  pNFS: Fix a typo in pnfs_generic_alloc_ds_commits
  pNFS: Fix a deadlock when coalescing writes and returning the layout
  pNFS: Don't clear the layout return info if there are segments to return
  pNFS: Ensure we commit the layout if it has been invalidated
  pNFS: Don't send COMMITs to the DSes if the server invalidated our layout
  pNFS/flexfiles: Fix up the ff_layout_write_pagelist failure path
  pNFS: Ensure we check layout validity before marking it for return
  NFS4.1 handle interrupted slot reuse from ERR_DELAY
  NFSv4: check return value of xdr_inline_decode
  nfs/filelayout: fix NULL pointer dereference in fl_pnfs_update_layout()
  ...
2017-05-10 13:03:38 -07:00
Linus Torvalds 11fbf53d66 Merge branch 'work.misc' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs
Pull misc vfs updates from Al Viro:
 "Assorted bits and pieces from various people. No common topic in this
  pile, sorry"

* 'work.misc' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
  fs/affs: add rename exchange
  fs/affs: add rename2 to prepare multiple methods
  Make stat/lstat/fstatat pass AT_NO_AUTOMOUNT to vfs_statx()
  fs: don't set *REFERENCED on single use objects
  fs: compat: Remove warning from COMPATIBLE_IOCTL
  remove pointless extern of atime_need_update_rcu()
  fs: completely ignore unknown open flags
  fs: add a VALID_OPEN_FLAGS
  fs: remove _submit_bh()
  fs: constify tree_descr arrays passed to simple_fill_super()
  fs: drop duplicate header percpu-rwsem.h
  fs/affs: bugfix: Write files greater than page size on OFS
  fs/affs: bugfix: enable writes on OFS disks
  fs/affs: remove node generation check
  fs/affs: import amigaffs.h
  fs/affs: bugfix: make symbolic links work again
2017-05-09 09:12:53 -07:00
Tetsuo Handa c718a97514 fs: semove set but not checked AOP_FLAG_UNINTERRUPTIBLE flag
Commit afddba49d1 ("fs: introduce write_begin, write_end, and
perform_write aops") introduced AOP_FLAG_UNINTERRUPTIBLE flag which was
checked in pagecache_write_begin(), but that check was removed by
4e02ed4b4a ("fs: remove prepare_write/commit_write").

Between these two commits, commit d9414774dc ("cifs: Convert cifs to
new aops.") added a check in cifs_write_begin(), but that check was soon
removed by commit a98ee8c1c7 ("[CIFS] fix regression in
cifs_write_begin/cifs_write_end").

Therefore, AOP_FLAG_UNINTERRUPTIBLE flag is checked nowhere.  Let's
remove this flag.  This patch has no functionality changes.

Link: http://lkml.kernel.org/r/1489294781-53494-1-git-send-email-penguin-kernel@I-love.SAKURA.ne.jp
Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Reviewed-by: Jeff Layton <jlayton@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Cc: Nick Piggin <npiggin@gmail.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-05-08 17:15:14 -07:00
David Howells deccf497d8 Make stat/lstat/fstatat pass AT_NO_AUTOMOUNT to vfs_statx()
stat/lstat/fstatat need to pass AT_NO_AUTOMOUNT to vfs_statx() as the
pre-statx code didn't set LOOKUP_AUTOMOUNT, even though fstatat() accepted
the AT_NO_AUTOMOUNT flag.

Fixes: a528d35e8b ("statx: Add a system call to make enhanced file info available")
Reported-by: Ian Kent <raven@themaw.net>
Signed-off-by: David Howells <dhowells@redhat.com>
Tested-by: Ian Kent <raven@themaw.net>
cc: stable@vger.kernel.org
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2017-05-04 19:52:03 -04:00
Linus Torvalds 5133cd7518 Merge branch 'fsnotify' of git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs
Pull fsnotify updates from Jan Kara:
 "The branch contains mainly a rework of fsnotify infrastructure fixing
  a shortcoming that we have waited for response to fanotify permission
  events with SRCU read lock held and when the process consuming events
  was slow to respond the kernel has stalled.

  It also contains several cleanups of unnecessary indirections in
  fsnotify framework and a bugfix from Amir fixing leakage of kernel
  internal errno to userspace"

* 'fsnotify' of git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs: (37 commits)
  fanotify: don't expose EOPENSTALE to userspace
  fsnotify: remove a stray unlock
  fsnotify: Move ->free_mark callback to fsnotify_ops
  fsnotify: Add group pointer in fsnotify_init_mark()
  fsnotify: Drop inode_mark.c
  fsnotify: Remove fsnotify_find_{inode|vfsmount}_mark()
  fsnotify: Remove fsnotify_detach_group_marks()
  fsnotify: Rename fsnotify_clear_marks_by_group_flags()
  fsnotify: Inline fsnotify_clear_{inode|vfsmount}_mark_group()
  fsnotify: Remove fsnotify_recalc_{inode|vfsmount}_mask()
  fsnotify: Remove fsnotify_set_mark_{,ignored_}mask_locked()
  fanotify: Release SRCU lock when waiting for userspace response
  fsnotify: Pass fsnotify_iter_info into handle_event handler
  fsnotify: Provide framework for dropping SRCU lock in ->handle_event
  fsnotify: Remove special handling of mark destruction on group shutdown
  fsnotify: Detach mark from object list when last reference is dropped
  fsnotify: Move queueing of mark for destruction into fsnotify_put_mark()
  inotify: Do not drop mark reference under idr_lock
  fsnotify: Free fsnotify_mark_connector when there is no mark attached
  fsnotify: Lock object list with connector lock
  ...
2017-05-03 11:05:15 -07:00
Eric Biggers cda37124f4 fs: constify tree_descr arrays passed to simple_fill_super()
simple_fill_super() is passed an array of tree_descr structures which
describe the files to create in the filesystem's root directory.  Since
these arrays are never modified intentionally, they should be 'const' so
that they are placed in .rodata and benefit from memory protection.
This patch updates the function signature and all users, and also
constifies tree_descr.name.

Signed-off-by: Eric Biggers <ebiggers@google.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2017-04-26 23:54:06 -04:00
Geliang Tang a0c111b49b fs: drop duplicate header percpu-rwsem.h
Drop duplicate header percpu-rwsem.h from linux/fs.h.

Signed-off-by: Geliang Tang <geliangtang@gmail.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2017-04-26 23:54:06 -04:00
Benjamin Coddington 50f2112cf7 locks: Set FL_CLOSE when removing flock locks on close()
Set FL_CLOSE in fl_flags as in locks_remove_posix() when clearing locks.
NFS will check for this flag to ensure an unlock is sent in a following
patch.

Fuse handles flock and posix locks differently for FL_CLOSE, and so
requires a fixup to retain the existing behavior for flock.

Signed-off-by: Benjamin Coddington <bcodding@redhat.com>
Reviewed-by: Jeff Layton <jlayton@redhat.com>
Acked-by: Miklos Szeredi <miklos@szeredi.hu>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2017-04-21 10:45:01 -04:00
Jan Kara c1844d536d fs: Remove SB_I_DYNBDI flag
Now that all bdi structures filesystems use are properly refcounted, we
can remove the SB_I_DYNBDI flag.

Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Jens Axboe <axboe@fb.com>
2017-04-20 12:09:55 -06:00
Jan Kara fca39346a5 fs: Provide infrastructure for dynamic BDIs in filesystems
Provide helper functions for setting up dynamically allocated
backing_dev_info structures for filesystems and cleaning them up on
superblock destruction.

CC: linux-mtd@lists.infradead.org
CC: linux-nfs@vger.kernel.org
CC: Petr Vandrovec <petr@vandrovec.name>
CC: linux-nilfs@vger.kernel.org
CC: cluster-devel@redhat.com
CC: osd-dev@open-osd.org
CC: codalist@coda.cs.cmu.edu
CC: linux-afs@lists.infradead.org
CC: ecryptfs@vger.kernel.org
CC: linux-cifs@vger.kernel.org
CC: ceph-devel@vger.kernel.org
CC: linux-btrfs@vger.kernel.org
CC: v9fs-developer@lists.sourceforge.net
CC: lustre-devel@lists.lustre.org
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Jens Axboe <axboe@fb.com>
2017-04-20 12:09:55 -06:00
Jan Kara 08991e83b7 fsnotify: Free fsnotify_mark_connector when there is no mark attached
Currently we free fsnotify_mark_connector structure only when inode /
vfsmount is getting freed. This can however impose noticeable memory
overhead when marks get attached to inodes only temporarily. So free the
connector structure once the last mark is detached from the object.
Since notification infrastructure can be working with the connector
under the protection of fsnotify_mark_srcu, we have to be careful and
free the fsnotify_mark_connector only after SRCU period passes.

Reviewed-by: Miklos Szeredi <mszeredi@redhat.com>
Reviewed-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Jan Kara <jack@suse.cz>
2017-04-10 17:37:35 +02:00
Jan Kara 9dd813c15b fsnotify: Move mark list head from object into dedicated structure
Currently notification marks are attached to object (inode or vfsmnt) by
a hlist_head in the object. The list is also protected by a spinlock in
the object. So while there is any mark attached to the list of marks,
the object must be pinned in memory (and thus e.g. last iput() deleting
inode cannot happen). Also for list iteration in fsnotify() to work, we
must hold fsnotify_mark_srcu lock so that mark itself and
mark->obj_list.next cannot get freed. Thus we are required to wait for
response to fanotify events from userspace process with
fsnotify_mark_srcu lock held. That causes issues when userspace process
is buggy and does not reply to some event - basically the whole
notification subsystem gets eventually stuck.

So to be able to drop fsnotify_mark_srcu lock while waiting for
response, we have to pin the mark in memory and make sure it stays in
the object list (as removing the mark waiting for response could lead to
lost notification events for groups later in the list). However we don't
want inode reclaim to block on such mark as that would lead to system
just locking up elsewhere.

This commit is the first in the series that paves way towards solving
these conflicting lifetime needs. Instead of anchoring the list of marks
directly in the object, we anchor it in a dedicated structure
(fsnotify_mark_connector) and just point to that structure from the
object. The following commits will also add spinlock protecting the list
and object pointer to the structure.

Reviewed-by: Miklos Szeredi <mszeredi@redhat.com>
Reviewed-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Jan Kara <jack@suse.cz>
2017-04-10 17:37:34 +02:00
Arnd Bergmann cbfd0c1001 include/linux/fs.h: fix unsigned enum warning with gcc-4.2
With arm-linux-gcc-4.2, almost every file we build in the kernel ends up
with this warning:

  include/linux/fs.h:2648: warning: comparison of unsigned expression < 0 is always false

Later versions don't have this problem, but it's easy enough to work
around.

Link: http://lkml.kernel.org/r/20161216105634.235457-12-arnd@arndb.de
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Cc: Russell King <rmk+kernel@armlinux.org.uk>
Cc: Brendan Gregg <brendan.d.gregg@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-03-09 17:01:09 -08:00
David Howells a528d35e8b statx: Add a system call to make enhanced file info available
Add a system call to make extended file information available, including
file creation and some attribute flags where available through the
underlying filesystem.

The getattr inode operation is altered to take two additional arguments: a
u32 request_mask and an unsigned int flags that indicate the
synchronisation mode.  This change is propagated to the vfs_getattr*()
function.

Functions like vfs_stat() are now inline wrappers around new functions
vfs_statx() and vfs_statx_fd() to reduce stack usage.

========
OVERVIEW
========

The idea was initially proposed as a set of xattrs that could be retrieved
with getxattr(), but the general preference proved to be for a new syscall
with an extended stat structure.

A number of requests were gathered for features to be included.  The
following have been included:

 (1) Make the fields a consistent size on all arches and make them large.

 (2) Spare space, request flags and information flags are provided for
     future expansion.

 (3) Better support for the y2038 problem [Arnd Bergmann] (tv_sec is an
     __s64).

 (4) Creation time: The SMB protocol carries the creation time, which could
     be exported by Samba, which will in turn help CIFS make use of
     FS-Cache as that can be used for coherency data (stx_btime).

     This is also specified in NFSv4 as a recommended attribute and could
     be exported by NFSD [Steve French].

 (5) Lightweight stat: Ask for just those details of interest, and allow a
     netfs (such as NFS) to approximate anything not of interest, possibly
     without going to the server [Trond Myklebust, Ulrich Drepper, Andreas
     Dilger] (AT_STATX_DONT_SYNC).

 (6) Heavyweight stat: Force a netfs to go to the server, even if it thinks
     its cached attributes are up to date [Trond Myklebust]
     (AT_STATX_FORCE_SYNC).

And the following have been left out for future extension:

 (7) Data version number: Could be used by userspace NFS servers [Aneesh
     Kumar].

     Can also be used to modify fill_post_wcc() in NFSD which retrieves
     i_version directly, but has just called vfs_getattr().  It could get
     it from the kstat struct if it used vfs_xgetattr() instead.

     (There's disagreement on the exact semantics of a single field, since
     not all filesystems do this the same way).

 (8) BSD stat compatibility: Including more fields from the BSD stat such
     as creation time (st_btime) and inode generation number (st_gen)
     [Jeremy Allison, Bernd Schubert].

 (9) Inode generation number: Useful for FUSE and userspace NFS servers
     [Bernd Schubert].

     (This was asked for but later deemed unnecessary with the
     open-by-handle capability available and caused disagreement as to
     whether it's a security hole or not).

(10) Extra coherency data may be useful in making backups [Andreas Dilger].

     (No particular data were offered, but things like last backup
     timestamp, the data version number and the DOS archive bit would come
     into this category).

(11) Allow the filesystem to indicate what it can/cannot provide: A
     filesystem can now say it doesn't support a standard stat feature if
     that isn't available, so if, for instance, inode numbers or UIDs don't
     exist or are fabricated locally...

     (This requires a separate system call - I have an fsinfo() call idea
     for this).

(12) Store a 16-byte volume ID in the superblock that can be returned in
     struct xstat [Steve French].

     (Deferred to fsinfo).

(13) Include granularity fields in the time data to indicate the
     granularity of each of the times (NFSv4 time_delta) [Steve French].

     (Deferred to fsinfo).

(14) FS_IOC_GETFLAGS value.  These could be translated to BSD's st_flags.
     Note that the Linux IOC flags are a mess and filesystems such as Ext4
     define flags that aren't in linux/fs.h, so translation in the kernel
     may be a necessity (or, possibly, we provide the filesystem type too).

     (Some attributes are made available in stx_attributes, but the general
     feeling was that the IOC flags were to ext[234]-specific and shouldn't
     be exposed through statx this way).

(15) Mask of features available on file (eg: ACLs, seclabel) [Brad Boyer,
     Michael Kerrisk].

     (Deferred, probably to fsinfo.  Finding out if there's an ACL or
     seclabal might require extra filesystem operations).

(16) Femtosecond-resolution timestamps [Dave Chinner].

     (A __reserved field has been left in the statx_timestamp struct for
     this - if there proves to be a need).

(17) A set multiple attributes syscall to go with this.

===============
NEW SYSTEM CALL
===============

The new system call is:

	int ret = statx(int dfd,
			const char *filename,
			unsigned int flags,
			unsigned int mask,
			struct statx *buffer);

The dfd, filename and flags parameters indicate the file to query, in a
similar way to fstatat().  There is no equivalent of lstat() as that can be
emulated with statx() by passing AT_SYMLINK_NOFOLLOW in flags.  There is
also no equivalent of fstat() as that can be emulated by passing a NULL
filename to statx() with the fd of interest in dfd.

Whether or not statx() synchronises the attributes with the backing store
can be controlled by OR'ing a value into the flags argument (this typically
only affects network filesystems):

 (1) AT_STATX_SYNC_AS_STAT tells statx() to behave as stat() does in this
     respect.

 (2) AT_STATX_FORCE_SYNC will require a network filesystem to synchronise
     its attributes with the server - which might require data writeback to
     occur to get the timestamps correct.

 (3) AT_STATX_DONT_SYNC will suppress synchronisation with the server in a
     network filesystem.  The resulting values should be considered
     approximate.

mask is a bitmask indicating the fields in struct statx that are of
interest to the caller.  The user should set this to STATX_BASIC_STATS to
get the basic set returned by stat().  It should be noted that asking for
more information may entail extra I/O operations.

buffer points to the destination for the data.  This must be 256 bytes in
size.

======================
MAIN ATTRIBUTES RECORD
======================

The following structures are defined in which to return the main attribute
set:

	struct statx_timestamp {
		__s64	tv_sec;
		__s32	tv_nsec;
		__s32	__reserved;
	};

	struct statx {
		__u32	stx_mask;
		__u32	stx_blksize;
		__u64	stx_attributes;
		__u32	stx_nlink;
		__u32	stx_uid;
		__u32	stx_gid;
		__u16	stx_mode;
		__u16	__spare0[1];
		__u64	stx_ino;
		__u64	stx_size;
		__u64	stx_blocks;
		__u64	__spare1[1];
		struct statx_timestamp	stx_atime;
		struct statx_timestamp	stx_btime;
		struct statx_timestamp	stx_ctime;
		struct statx_timestamp	stx_mtime;
		__u32	stx_rdev_major;
		__u32	stx_rdev_minor;
		__u32	stx_dev_major;
		__u32	stx_dev_minor;
		__u64	__spare2[14];
	};

The defined bits in request_mask and stx_mask are:

	STATX_TYPE		Want/got stx_mode & S_IFMT
	STATX_MODE		Want/got stx_mode & ~S_IFMT
	STATX_NLINK		Want/got stx_nlink
	STATX_UID		Want/got stx_uid
	STATX_GID		Want/got stx_gid
	STATX_ATIME		Want/got stx_atime{,_ns}
	STATX_MTIME		Want/got stx_mtime{,_ns}
	STATX_CTIME		Want/got stx_ctime{,_ns}
	STATX_INO		Want/got stx_ino
	STATX_SIZE		Want/got stx_size
	STATX_BLOCKS		Want/got stx_blocks
	STATX_BASIC_STATS	[The stuff in the normal stat struct]
	STATX_BTIME		Want/got stx_btime{,_ns}
	STATX_ALL		[All currently available stuff]

stx_btime is the file creation time, stx_mask is a bitmask indicating the
data provided and __spares*[] are where as-yet undefined fields can be
placed.

Time fields are structures with separate seconds and nanoseconds fields
plus a reserved field in case we want to add even finer resolution.  Note
that times will be negative if before 1970; in such a case, the nanosecond
fields will also be negative if not zero.

The bits defined in the stx_attributes field convey information about a
file, how it is accessed, where it is and what it does.  The following
attributes map to FS_*_FL flags and are the same numerical value:

	STATX_ATTR_COMPRESSED		File is compressed by the fs
	STATX_ATTR_IMMUTABLE		File is marked immutable
	STATX_ATTR_APPEND		File is append-only
	STATX_ATTR_NODUMP		File is not to be dumped
	STATX_ATTR_ENCRYPTED		File requires key to decrypt in fs

Within the kernel, the supported flags are listed by:

	KSTAT_ATTR_FS_IOC_FLAGS

[Are any other IOC flags of sufficient general interest to be exposed
through this interface?]

New flags include:

	STATX_ATTR_AUTOMOUNT		Object is an automount trigger

These are for the use of GUI tools that might want to mark files specially,
depending on what they are.

Fields in struct statx come in a number of classes:

 (0) stx_dev_*, stx_blksize.

     These are local system information and are always available.

 (1) stx_mode, stx_nlinks, stx_uid, stx_gid, stx_[amc]time, stx_ino,
     stx_size, stx_blocks.

     These will be returned whether the caller asks for them or not.  The
     corresponding bits in stx_mask will be set to indicate whether they
     actually have valid values.

     If the caller didn't ask for them, then they may be approximated.  For
     example, NFS won't waste any time updating them from the server,
     unless as a byproduct of updating something requested.

     If the values don't actually exist for the underlying object (such as
     UID or GID on a DOS file), then the bit won't be set in the stx_mask,
     even if the caller asked for the value.  In such a case, the returned
     value will be a fabrication.

     Note that there are instances where the type might not be valid, for
     instance Windows reparse points.

 (2) stx_rdev_*.

     This will be set only if stx_mode indicates we're looking at a
     blockdev or a chardev, otherwise will be 0.

 (3) stx_btime.

     Similar to (1), except this will be set to 0 if it doesn't exist.

=======
TESTING
=======

The following test program can be used to test the statx system call:

	samples/statx/test-statx.c

Just compile and run, passing it paths to the files you want to examine.
The file is built automatically if CONFIG_SAMPLES is enabled.

Here's some example output.  Firstly, an NFS directory that crosses to
another FSID.  Note that the AUTOMOUNT attribute is set because transiting
this directory will cause d_automount to be invoked by the VFS.

	[root@andromeda ~]# /tmp/test-statx -A /warthog/data
	statx(/warthog/data) = 0
	results=7ff
	  Size: 4096            Blocks: 8          IO Block: 1048576  directory
	Device: 00:26           Inode: 1703937     Links: 125
	Access: (3777/drwxrwxrwx)  Uid:     0   Gid:  4041
	Access: 2016-11-24 09:02:12.219699527+0000
	Modify: 2016-11-17 10:44:36.225653653+0000
	Change: 2016-11-17 10:44:36.225653653+0000
	Attributes: 0000000000001000 (-------- -------- -------- -------- -------- -------- ---m---- --------)

Secondly, the result of automounting on that directory.

	[root@andromeda ~]# /tmp/test-statx /warthog/data
	statx(/warthog/data) = 0
	results=7ff
	  Size: 4096            Blocks: 8          IO Block: 1048576  directory
	Device: 00:27           Inode: 2           Links: 125
	Access: (3777/drwxrwxrwx)  Uid:     0   Gid:  4041
	Access: 2016-11-24 09:02:12.219699527+0000
	Modify: 2016-11-17 10:44:36.225653653+0000
	Change: 2016-11-17 10:44:36.225653653+0000

Signed-off-by: David Howells <dhowells@redhat.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2017-03-02 20:51:15 -05:00
Linus Torvalds 94e877d0fb Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs
Pull vfs pile two from Al Viro:

 - orangefs fix

 - series of fs/namei.c cleanups from me

 - VFS stuff coming from overlayfs tree

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
  orangefs: Use RCU for destroy_inode
  vfs: use helper for calling f_op->fsync()
  mm: use helper for calling f_op->mmap()
  vfs: use helpers for calling f_op->{read,write}_iter()
  vfs: pass type instead of fn to do_{loop,iter}_readv_writev()
  vfs: extract common parts of {compat_,}do_readv_writev()
  vfs: wrap write f_ops with file_{start,end}_write()
  vfs: deny copy_file_range() for non regular files
  vfs: deny fallocate() on directory
  vfs: create vfs helper vfs_tmpfile()
  namei.c: split unlazy_walk()
  namei.c: fold the check for DCACHE_OP_REVALIDATE into d_revalidate()
  lookup_fast(): clean up the logics around the fallback to non-rcu mode
  namei: fold unlazy_link() into its sole caller
2017-03-02 15:20:00 -08:00
Al Viro 653a7746fa Merge remote-tracking branch 'ovl/for-viro' into for-linus
Overlayfs-related series from Miklos and Amir
2017-03-02 06:41:22 -05:00
Fabian Frederick 93407472a2 fs: add i_blocksize()
Replace all 1 << inode->i_blkbits and (1 << inode->i_blkbits) in fs
branch.

This patch also fixes multiple checkpatch warnings: WARNING: Prefer
'unsigned int' to bare use of 'unsigned'

Thanks to Andrew Morton for suggesting more appropriate function instead
of macro.

[geliangtang@gmail.com: truncate: use i_blocksize()]
  Link: http://lkml.kernel.org/r/9c8b2cd83c8f5653805d43debde9fa8817e02fc4.1484895804.git.geliangtang@gmail.com
Link: http://lkml.kernel.org/r/1481319905-10126-1-git-send-email-fabf@skynet.be
Signed-off-by: Fabian Frederick <fabf@skynet.be>
Signed-off-by: Geliang Tang <geliangtang@gmail.com>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Ross Zwisler <ross.zwisler@linux.intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-02-27 18:43:46 -08:00
Miklos Szeredi 0eb8af4916 vfs: use helper for calling f_op->fsync()
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2017-02-20 16:51:23 +01:00
Miklos Szeredi f74ac01520 mm: use helper for calling f_op->mmap()
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2017-02-20 16:51:23 +01:00
Miklos Szeredi bb7462b6fd vfs: use helpers for calling f_op->{read,write}_iter()
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2017-02-20 16:51:23 +01:00
Amir Goldstein bfe219d373 vfs: wrap write f_ops with file_{start,end}_write()
Before calling write f_ops, call file_start_write() instead
of sb_start_write().

Replace {sb,file}_start_write() for {copy,clone}_file_range() and
for fallocate().

Beyond correct semantics, this avoids freeze protection to sb when
operating on special inodes, such as fallocate() on a blockdev.

Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: Amir Goldstein <amir73il@gmail.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2017-02-07 15:05:04 +01:00
Amir Goldstein af7bd4dc13 vfs: create vfs helper vfs_tmpfile()
Factor out some common vfs bits from do_tmpfile()
to be used by overlayfs for concurrent copy up.

Signed-off-by: Amir Goldstein <amir73il@gmail.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2017-02-07 15:05:04 +01:00
Jan Kara b1d2dc5659 block: Make blk_get_backing_dev_info() safe without open bdev
Currenly blk_get_backing_dev_info() is not safe to be called when the
block device is not open as bdev->bd_disk is NULL in that case. However
inode_to_bdi() uses this function and may be call called from flusher
worker or other writeback related functions without bdev being open
which leads to crashes such as:

[113031.075540] Unable to handle kernel paging request for data at address 0x00000000
[113031.075614] Faulting instruction address: 0xc0000000003692e0
0:mon> t
[c0000000fb65f900] c00000000036cb6c writeback_sb_inodes+0x30c/0x590
[c0000000fb65fa10] c00000000036ced4 __writeback_inodes_wb+0xe4/0x150
[c0000000fb65fa70] c00000000036d33c wb_writeback+0x30c/0x450
[c0000000fb65fb40] c00000000036e198 wb_workfn+0x268/0x580
[c0000000fb65fc50] c0000000000f3470 process_one_work+0x1e0/0x590
[c0000000fb65fce0] c0000000000f38c8 worker_thread+0xa8/0x660
[c0000000fb65fd80] c0000000000fc4b0 kthread+0x110/0x130
[c0000000fb65fe30] c0000000000098f0 ret_from_kernel_thread+0x5c/0x6c

Signed-off-by: Jens Axboe <axboe@fb.com>
2017-02-02 08:20:53 -07:00
Jan Kara f44f1ab5a2 block: Unhash block device inodes on gendisk destruction
Currently, block device inodes stay around after corresponding gendisk
hash died until memory reclaim finds them and frees them. Since we will
make block device inode pin the bdi, we want to free the block device
inode as soon as the device goes away so that bdi does not stay around
unnecessarily. Furthermore we need to avoid issues when new device with
the same major,minor pair gets created since reusing the bdi structure
would be rather difficult in this case.

Unhashing block device inode on gendisk destruction nicely deals with
these problems. Once last block device inode reference is dropped (which
may be directly in del_gendisk()), the inode gets evicted. Furthermore if
the major,minor pair gets reallocated, we are guaranteed to get new
block device inode even if old block device inode is not yet evicted and
thus we avoid issues with possible reuse of bdi.

Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Jens Axboe <axboe@fb.com>
2017-02-02 08:18:41 -07:00
Linus Torvalds e93b1cc8a8 Merge branch 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs
Pull quota, fsnotify and ext2 updates from Jan Kara:
 "Changes to locking of some quota operations from dedicated quota mutex
  to s_umount semaphore, a fsnotify fix and a simple ext2 fix"

* 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs:
  quota: Fix bogus warning in dquot_disable()
  fsnotify: Fix possible use-after-free in inode iteration on umount
  ext2: reject inodes with negative size
  quota: Remove dqonoff_mutex
  ocfs2: Use s_umount for quota recovery protection
  quota: Remove dqonoff_mutex from dquot_scan_active()
  ocfs2: Protect periodic quota syncing with s_umount semaphore
  quota: Use s_umount protection for quota operations
  quota: Hold s_umount in exclusive mode when enabling / disabling quotas
  fs: Provide function to get superblock with exclusive s_umount
2016-12-19 08:23:53 -08:00
Linus Torvalds 231753ef78 Merge uncontroversial parts of branch 'readlink' of git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/vfs
Pull partial readlink cleanups from Miklos Szeredi.

This is the uncontroversial part of the readlink cleanup patch-set that
simplifies the default readlink handling.

Miklos and Al are still discussing the rest of the series.

* git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/vfs:
  vfs: make generic_readlink() static
  vfs: remove ".readlink = generic_readlink" assignments
  vfs: default to generic_readlink()
  vfs: replace calling i_op->readlink with vfs_readlink()
  proc/self: use generic_readlink
  ecryptfs: use vfs_get_link()
  bad_inode: add missing i_op initializers
2016-12-17 19:16:12 -08:00
Linus Torvalds 0110c350c8 Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs
Pull more vfs updates from Al Viro:
 "In this pile:

   - autofs-namespace series
   - dedupe stuff
   - more struct path constification"

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (40 commits)
  ocfs2: implement the VFS clone_range, copy_range, and dedupe_range features
  ocfs2: charge quota for reflinked blocks
  ocfs2: fix bad pointer cast
  ocfs2: always unlock when completing dio writes
  ocfs2: don't eat io errors during _dio_end_io_write
  ocfs2: budget for extent tree splits when adding refcount flag
  ocfs2: prohibit refcounted swapfiles
  ocfs2: add newlines to some error messages
  ocfs2: convert inode refcount test to a helper
  simple_write_end(): don't zero in short copy into uptodate
  exofs: don't mess with simple_write_{begin,end}
  9p: saner ->write_end() on failing copy into non-uptodate page
  fix gfs2_stuffed_write_end() on short copies
  fix ceph_write_end()
  nfs_write_end(): fix handling of short copies
  vfs: refactor clone/dedupe_file_range common functions
  fs: try to clone files first in vfs_copy_file_range
  vfs: misc struct path constification
  namespace.c: constify struct path passed to a bunch of primitives
  quota: constify struct path in quota_on
  ...
2016-12-17 18:44:00 -08:00
Al Viro 3c55d6bcfe Merge remote-tracking branch 'djwong/ocfs2-vfs-reflink-6' into for-linus 2016-12-16 16:21:05 -05:00
Linus Torvalds ff0f962ca3 Merge branch 'overlayfs-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/vfs
Pull overlayfs updates from Miklos Szeredi:
 "This update contains:

   - try to clone on copy-up

   - allow renaming a directory

   - split source into managable chunks

   - misc cleanups and fixes

  It does not contain the read-only fd data inconsistency fix, which Al
  didn't like. I'll leave that to the next year..."

* 'overlayfs-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/vfs: (36 commits)
  ovl: fix reStructuredText syntax errors in documentation
  ovl: fix return value of ovl_fill_super
  ovl: clean up kstat usage
  ovl: fold ovl_copy_up_truncate() into ovl_copy_up()
  ovl: create directories inside merged parent opaque
  ovl: opaque cleanup
  ovl: show redirect_dir mount option
  ovl: allow setting max size of redirect
  ovl: allow redirect_dir to default to "on"
  ovl: check for emptiness of redirect dir
  ovl: redirect on rename-dir
  ovl: lookup redirects
  ovl: consolidate lookup for underlying layers
  ovl: fix nested overlayfs mount
  ovl: check namelen
  ovl: split super.c
  ovl: use d_is_dir()
  ovl: simplify lookup
  ovl: check lower existence of rename target
  ovl: rename: simplify handling of lower/merged directory
  ...
2016-12-16 10:58:12 -08:00
Amir Goldstein 031a072a0b vfs: call vfs_clone_file_range() under freeze protection
Move sb_start_write()/sb_end_write() out of the vfs helper and up into the
ioctl handler.

Signed-off-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2016-12-16 11:02:54 +01:00
Linus Torvalds 36869cb93d Merge branch 'for-4.10/block' of git://git.kernel.dk/linux-block
Pull block layer updates from Jens Axboe:
 "This is the main block pull request this series. Contrary to previous
  release, I've kept the core and driver changes in the same branch. We
  always ended up having dependencies between the two for obvious
  reasons, so makes more sense to keep them together. That said, I'll
  probably try and keep more topical branches going forward, especially
  for cycles that end up being as busy as this one.

  The major parts of this pull request is:

   - Improved support for O_DIRECT on block devices, with a small
     private implementation instead of using the pig that is
     fs/direct-io.c. From Christoph.

   - Request completion tracking in a scalable fashion. This is utilized
     by two components in this pull, the new hybrid polling and the
     writeback queue throttling code.

   - Improved support for polling with O_DIRECT, adding a hybrid mode
     that combines pure polling with an initial sleep. From me.

   - Support for automatic throttling of writeback queues on the block
     side. This uses feedback from the device completion latencies to
     scale the queue on the block side up or down. From me.

   - Support from SMR drives in the block layer and for SD. From Hannes
     and Shaun.

   - Multi-connection support for nbd. From Josef.

   - Cleanup of request and bio flags, so we have a clear split between
     which are bio (or rq) private, and which ones are shared. From
     Christoph.

   - A set of patches from Bart, that improve how we handle queue
     stopping and starting in blk-mq.

   - Support for WRITE_ZEROES from Chaitanya.

   - Lightnvm updates from Javier/Matias.

   - Supoort for FC for the nvme-over-fabrics code. From James Smart.

   - A bunch of fixes from a whole slew of people, too many to name
     here"

* 'for-4.10/block' of git://git.kernel.dk/linux-block: (182 commits)
  blk-stat: fix a few cases of missing batch flushing
  blk-flush: run the queue when inserting blk-mq flush
  elevator: make the rqhash helpers exported
  blk-mq: abstract out blk_mq_dispatch_rq_list() helper
  blk-mq: add blk_mq_start_stopped_hw_queue()
  block: improve handling of the magic discard payload
  blk-wbt: don't throttle discard or write zeroes
  nbd: use dev_err_ratelimited in io path
  nbd: reset the setup task for NBD_CLEAR_SOCK
  nvme-fabrics: Add FC LLDD loopback driver to test FC-NVME
  nvme-fabrics: Add target support for FC transport
  nvme-fabrics: Add host support for FC transport
  nvme-fabrics: Add FC transport LLDD api definitions
  nvme-fabrics: Add FC transport FC-NVME definitions
  nvme-fabrics: Add FC transport error codes to nvme.h
  Add type 0x28 NVME type code to scsi fc headers
  nvme-fabrics: patch target code in prep for FC transport support
  nvme-fabrics: set sqe.command_id in core not transports
  parser: add u64 number parser
  nvme-rdma: align to generic ib_event logging helper
  ...
2016-12-13 10:19:16 -08:00
Darrick J. Wong 876bec6f9b vfs: refactor clone/dedupe_file_range common functions
Hoist both the XFS reflink inode state and preparation code and the XFS
file blocks compare functions into the VFS so that ocfs2 can take
advantage of it for reflink and dedupe.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2016-12-09 16:18:30 -08:00
Miklos Szeredi d16744ec8a vfs: make generic_readlink() static
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2016-12-09 16:45:04 +01:00
Miklos Szeredi 76fca90e9f vfs: default to generic_readlink()
If i_op->readlink is NULL, but i_op->get_link is set then vfs_readlink()
defaults to calling generic_readlink().

The IOP_DEFAULT_READLINK flag indicates that the above conditions are met
and the default action can be taken.

Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2016-12-09 16:45:04 +01:00