Commit Graph

69521 Commits

Author SHA1 Message Date
Jens Axboe 728f13e730 io-wq: remove nr_process accounting
We're now just using fork like we would from userspace, so there's no
need to try and impose extra restrictions or accounting on the user
side of things. That's already being done for us. That also means we
don't have to pass in the user_struct anymore, that's correctly inherited
through ->creds on fork.

Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-02-23 20:33:26 -07:00
Jens Axboe 1c0aa1fae1 io_uring: flag new native workers with IORING_FEAT_NATIVE_WORKERS
A few reasons to do this:

- The naming of the manager and worker have changed. That's a user visible
  change, so makes sense to flag it.

- Opening certain files that use ->signal (like /proc/self or /dev/tty)
  now works, and the flag tells the application upfront that this is the
  case.

- Related to the above, using signalfd will now work as well.

Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-02-23 20:32:11 -07:00
Jens Axboe 2587890b5e Revert "proc: don't allow async path resolution of /proc/self components"
This reverts commit 8d4c3e76e3.

No longer needed, as the io-wq worker threads have the right identity.

Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-02-23 20:32:11 -07:00
Jens Axboe 9e8d9e829c Revert "proc: don't allow async path resolution of /proc/thread-self components"
This reverts commit 0d4370cfe3.

No longer needed, as the io-wq worker threads have the right identity.

Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-02-23 20:32:11 -07:00
Pavel Begunkov e5547d2c5e io_uring: fix locked_free_list caches_free()
Don't forget to zero locked_free_nr, it's not a disaster but makes it
attempting to flush it with extra locking when there is nothing in the
list. Also, don't traverse a potentially long list freeing requests
under spinlock, splice the list and do it afterwards.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-02-23 19:18:54 -07:00
Jens Axboe 7c977a58dc io_uring: don't attempt IO reissue from the ring exit path
If we're exiting the ring, just let the IO fail with -EAGAIN as nobody
will care anyway. It's not the right context to reissue from.

Cc: stable@vger.kernel.org
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-02-23 19:18:13 -07:00
Jens Axboe 37d1e2e364 io_uring: move SQPOLL thread io-wq forked worker
Don't use a kthread for SQPOLL, use a forked worker just like the io-wq
workers. With that done, we can drop the various context grabbing we do
for SQPOLL, it already has everything it needs.

Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-02-23 16:44:42 -07:00
Linus Torvalds f6e1e1d1e1 Changes in gfs2:
* Log space and revoke accounting rework to fix some failed asserts.
 * Local resource group glock sharing for better local performance.
 * Add support for version 1802 filesystems: trusted xattr support and
   '-o rgrplvb' mounts by default.
 * Actually synchronize on the inode glock's FREEING bit during withdraw
   ("gfs2: fix glock confusion in function signal_our_withdraw").
 * Fix parallel recovery of multiple journals ("gfs2: keep bios separate
   for each journal").
 * Various other bug fixes.
 -----BEGIN PGP SIGNATURE-----
 
 iQJIBAABCAAyFiEEJZs3krPW0xkhLMTc1b+f6wMTZToFAmA1TmwUHGFncnVlbmJh
 QHJlZGhhdC5jb20ACgkQ1b+f6wMTZTpDZhAArnFj5AhWMI2+DD5o05EILdgDSpwh
 JWYT1pfRqR1OZrs7ZZ7tGZB4H6oytYfJ+4mg9Kk7CE7oJKcBh695IPZoIWv8+BCC
 WIgQGJytCFp4tuDNw11HZ0ahgW4zXPyJTt6jidZ5jVkux31JrUS7fVqSsD2vIPqA
 iQMcJIH+NLTlYbNt4d5T/ngaoRcx7m18RWkcxf6Y+/DBnnwIe4ZDpZmkWVykuncv
 OFSvXK8vKyLWGnvH/MIsywfYeU5rj/0AIu66JhVILQ4v5kGYIigwY3quXP2SoITM
 Z0+N5Gj/N4OWSscRS86zyqhnRucrjDkNP2+oGSzJWgtSXE/KplyfInAmQWzhIPRM
 n7T0boTp+gOTzGq7ELCzj44KICLG76WgDwaR2bLHuQ2/ppVrHNltZqncP2iwynN6
 glfST/eHBUBu1qTYLaOAfkUBlhpKDXu0YPcXX7lH6M0JqyvkRUFfuBAU9dic9D9K
 zsxplHGJrZnE9QFWWbS3aOviPlSHaXfkZF0Xv7QCLyuPRhu+e/qfcAoeVhxSd4+e
 I0grs/TxM61jyju9SmqnM7P+8qYS55naYH1V+6iNCU5dax8MvdxNZuneBQIa07U+
 Y84JPQvTBZDUE0gZ8fUzZtnYS7RqyiG7BL+T4W5Ph7LgxXbgQD7CWerYpg7fBm/j
 HEpjKqrS96zfTyk=
 =45VG
 -----END PGP SIGNATURE-----

Merge tag 'gfs2-for-5.12' of git://git.kernel.org/pub/scm/linux/kernel/git/gfs2/linux-gfs2

Pull gfs2 updates from Andreas Gruenbacher:

 - Log space and revoke accounting rework to fix some failed asserts.

 - Local resource group glock sharing for better local performance.

 - Add support for version 1802 filesystems: trusted xattr support and
   '-o rgrplvb' mounts by default.

 - Actually synchronize on the inode glock's FREEING bit during withdraw
   ("gfs2: fix glock confusion in function signal_our_withdraw").

 - Fix parallel recovery of multiple journals ("gfs2: keep bios separate
   for each journal").

 - Various other bug fixes.

* tag 'gfs2-for-5.12' of git://git.kernel.org/pub/scm/linux/kernel/git/gfs2/linux-gfs2: (49 commits)
  gfs2: Don't get stuck with I/O plugged in gfs2_ail1_flush
  gfs2: Per-revoke accounting in transactions
  gfs2: Rework the log space allocation logic
  gfs2: Minor calc_reserved cleanup
  gfs2: Use resource group glock sharing
  gfs2: Allow node-wide exclusive glock sharing
  gfs2: Add local resource group locking
  gfs2: Add per-reservation reserved block accounting
  gfs2: Rename rs_{free -> requested} and rd_{reserved -> requested}
  gfs2: Check for active reservation in gfs2_release
  gfs2: Don't search for unreserved space twice
  gfs2: Only pass reservation down to gfs2_rbm_find
  gfs2: Also reflect single-block allocations in rgd->rd_extfail_pt
  gfs2: Recursive gfs2_quota_hold in gfs2_iomap_end
  gfs2: Add trusted xattr support
  gfs2: Enable rgrplvb for sb_fs_format 1802
  gfs2: Don't skip dlm unlock if glock has an lvb
  gfs2: Lock imbalance on error path in gfs2_recover_one
  gfs2: Move function gfs2_ail_empty_tr
  gfs2: Get rid of current_tail()
  ...
2021-02-23 14:04:04 -08:00
Linus Torvalds 7d6beb71da idmapped-mounts-v5.12
-----BEGIN PGP SIGNATURE-----
 
 iHUEABYKAB0WIQRAhzRXHqcMeLMyaSiRxhvAZXjcogUCYCegywAKCRCRxhvAZXjc
 ouJ6AQDlf+7jCQlQdeKKoN9QDFfMzG1ooemat36EpRRTONaGuAD8D9A4sUsG4+5f
 4IU5Lj9oY4DEmF8HenbWK2ZHsesL2Qg=
 =yPaw
 -----END PGP SIGNATURE-----

Merge tag 'idmapped-mounts-v5.12' of git://git.kernel.org/pub/scm/linux/kernel/git/brauner/linux

Pull idmapped mounts from Christian Brauner:
 "This introduces idmapped mounts which has been in the making for some
  time. Simply put, different mounts can expose the same file or
  directory with different ownership. This initial implementation comes
  with ports for fat, ext4 and with Christoph's port for xfs with more
  filesystems being actively worked on by independent people and
  maintainers.

  Idmapping mounts handle a wide range of long standing use-cases. Here
  are just a few:

   - Idmapped mounts make it possible to easily share files between
     multiple users or multiple machines especially in complex
     scenarios. For example, idmapped mounts will be used in the
     implementation of portable home directories in
     systemd-homed.service(8) where they allow users to move their home
     directory to an external storage device and use it on multiple
     computers where they are assigned different uids and gids. This
     effectively makes it possible to assign random uids and gids at
     login time.

   - It is possible to share files from the host with unprivileged
     containers without having to change ownership permanently through
     chown(2).

   - It is possible to idmap a container's rootfs and without having to
     mangle every file. For example, Chromebooks use it to share the
     user's Download folder with their unprivileged containers in their
     Linux subsystem.

   - It is possible to share files between containers with
     non-overlapping idmappings.

   - Filesystem that lack a proper concept of ownership such as fat can
     use idmapped mounts to implement discretionary access (DAC)
     permission checking.

   - They allow users to efficiently changing ownership on a per-mount
     basis without having to (recursively) chown(2) all files. In
     contrast to chown (2) changing ownership of large sets of files is
     instantenous with idmapped mounts. This is especially useful when
     ownership of a whole root filesystem of a virtual machine or
     container is changed. With idmapped mounts a single syscall
     mount_setattr syscall will be sufficient to change the ownership of
     all files.

   - Idmapped mounts always take the current ownership into account as
     idmappings specify what a given uid or gid is supposed to be mapped
     to. This contrasts with the chown(2) syscall which cannot by itself
     take the current ownership of the files it changes into account. It
     simply changes the ownership to the specified uid and gid. This is
     especially problematic when recursively chown(2)ing a large set of
     files which is commong with the aforementioned portable home
     directory and container and vm scenario.

   - Idmapped mounts allow to change ownership locally, restricting it
     to specific mounts, and temporarily as the ownership changes only
     apply as long as the mount exists.

  Several userspace projects have either already put up patches and
  pull-requests for this feature or will do so should you decide to pull
  this:

   - systemd: In a wide variety of scenarios but especially right away
     in their implementation of portable home directories.

         https://systemd.io/HOME_DIRECTORY/

   - container runtimes: containerd, runC, LXD:To share data between
     host and unprivileged containers, unprivileged and privileged
     containers, etc. The pull request for idmapped mounts support in
     containerd, the default Kubernetes runtime is already up for quite
     a while now: https://github.com/containerd/containerd/pull/4734

   - The virtio-fs developers and several users have expressed interest
     in using this feature with virtual machines once virtio-fs is
     ported.

   - ChromeOS: Sharing host-directories with unprivileged containers.

  I've tightly synced with all those projects and all of those listed
  here have also expressed their need/desire for this feature on the
  mailing list. For more info on how people use this there's a bunch of
  talks about this too. Here's just two recent ones:

      https://www.cncf.io/wp-content/uploads/2020/12/Rootless-Containers-in-Gitpod.pdf
      https://fosdem.org/2021/schedule/event/containers_idmap/

  This comes with an extensive xfstests suite covering both ext4 and
  xfs:

      https://git.kernel.org/brauner/xfstests-dev/h/idmapped_mounts

  It covers truncation, creation, opening, xattrs, vfscaps, setid
  execution, setgid inheritance and more both with idmapped and
  non-idmapped mounts. It already helped to discover an unrelated xfs
  setgid inheritance bug which has since been fixed in mainline. It will
  be sent for inclusion with the xfstests project should you decide to
  merge this.

  In order to support per-mount idmappings vfsmounts are marked with
  user namespaces. The idmapping of the user namespace will be used to
  map the ids of vfs objects when they are accessed through that mount.
  By default all vfsmounts are marked with the initial user namespace.
  The initial user namespace is used to indicate that a mount is not
  idmapped. All operations behave as before and this is verified in the
  testsuite.

  Based on prior discussions we want to attach the whole user namespace
  and not just a dedicated idmapping struct. This allows us to reuse all
  the helpers that already exist for dealing with idmappings instead of
  introducing a whole new range of helpers. In addition, if we decide in
  the future that we are confident enough to enable unprivileged users
  to setup idmapped mounts the permission checking can take into account
  whether the caller is privileged in the user namespace the mount is
  currently marked with.

  The user namespace the mount will be marked with can be specified by
  passing a file descriptor refering to the user namespace as an
  argument to the new mount_setattr() syscall together with the new
  MOUNT_ATTR_IDMAP flag. The system call follows the openat2() pattern
  of extensibility.

  The following conditions must be met in order to create an idmapped
  mount:

   - The caller must currently have the CAP_SYS_ADMIN capability in the
     user namespace the underlying filesystem has been mounted in.

   - The underlying filesystem must support idmapped mounts.

   - The mount must not already be idmapped. This also implies that the
     idmapping of a mount cannot be altered once it has been idmapped.

   - The mount must be a detached/anonymous mount, i.e. it must have
     been created by calling open_tree() with the OPEN_TREE_CLONE flag
     and it must not already have been visible in the filesystem.

  The last two points guarantee easier semantics for userspace and the
  kernel and make the implementation significantly simpler.

  By default vfsmounts are marked with the initial user namespace and no
  behavioral or performance changes are observed.

  The manpage with a detailed description can be found here:

      1d7b902e28

  In order to support idmapped mounts, filesystems need to be changed
  and mark themselves with the FS_ALLOW_IDMAP flag in fs_flags. The
  patches to convert individual filesystem are not very large or
  complicated overall as can be seen from the included fat, ext4, and
  xfs ports. Patches for other filesystems are actively worked on and
  will be sent out separately. The xfstestsuite can be used to verify
  that port has been done correctly.

  The mount_setattr() syscall is motivated independent of the idmapped
  mounts patches and it's been around since July 2019. One of the most
  valuable features of the new mount api is the ability to perform
  mounts based on file descriptors only.

  Together with the lookup restrictions available in the openat2()
  RESOLVE_* flag namespace which we added in v5.6 this is the first time
  we are close to hardened and race-free (e.g. symlinks) mounting and
  path resolution.

  While userspace has started porting to the new mount api to mount
  proper filesystems and create new bind-mounts it is currently not
  possible to change mount options of an already existing bind mount in
  the new mount api since the mount_setattr() syscall is missing.

  With the addition of the mount_setattr() syscall we remove this last
  restriction and userspace can now fully port to the new mount api,
  covering every use-case the old mount api could. We also add the
  crucial ability to recursively change mount options for a whole mount
  tree, both removing and adding mount options at the same time. This
  syscall has been requested multiple times by various people and
  projects.

  There is a simple tool available at

      https://github.com/brauner/mount-idmapped

  that allows to create idmapped mounts so people can play with this
  patch series. I'll add support for the regular mount binary should you
  decide to pull this in the following weeks:

  Here's an example to a simple idmapped mount of another user's home
  directory:

	u1001@f2-vm:/$ sudo ./mount --idmap both:1000:1001:1 /home/ubuntu/ /mnt

	u1001@f2-vm:/$ ls -al /home/ubuntu/
	total 28
	drwxr-xr-x 2 ubuntu ubuntu 4096 Oct 28 22:07 .
	drwxr-xr-x 4 root   root   4096 Oct 28 04:00 ..
	-rw------- 1 ubuntu ubuntu 3154 Oct 28 22:12 .bash_history
	-rw-r--r-- 1 ubuntu ubuntu  220 Feb 25  2020 .bash_logout
	-rw-r--r-- 1 ubuntu ubuntu 3771 Feb 25  2020 .bashrc
	-rw-r--r-- 1 ubuntu ubuntu  807 Feb 25  2020 .profile
	-rw-r--r-- 1 ubuntu ubuntu    0 Oct 16 16:11 .sudo_as_admin_successful
	-rw------- 1 ubuntu ubuntu 1144 Oct 28 00:43 .viminfo

	u1001@f2-vm:/$ ls -al /mnt/
	total 28
	drwxr-xr-x  2 u1001 u1001 4096 Oct 28 22:07 .
	drwxr-xr-x 29 root  root  4096 Oct 28 22:01 ..
	-rw-------  1 u1001 u1001 3154 Oct 28 22:12 .bash_history
	-rw-r--r--  1 u1001 u1001  220 Feb 25  2020 .bash_logout
	-rw-r--r--  1 u1001 u1001 3771 Feb 25  2020 .bashrc
	-rw-r--r--  1 u1001 u1001  807 Feb 25  2020 .profile
	-rw-r--r--  1 u1001 u1001    0 Oct 16 16:11 .sudo_as_admin_successful
	-rw-------  1 u1001 u1001 1144 Oct 28 00:43 .viminfo

	u1001@f2-vm:/$ touch /mnt/my-file

	u1001@f2-vm:/$ setfacl -m u:1001:rwx /mnt/my-file

	u1001@f2-vm:/$ sudo setcap -n 1001 cap_net_raw+ep /mnt/my-file

	u1001@f2-vm:/$ ls -al /mnt/my-file
	-rw-rwxr--+ 1 u1001 u1001 0 Oct 28 22:14 /mnt/my-file

	u1001@f2-vm:/$ ls -al /home/ubuntu/my-file
	-rw-rwxr--+ 1 ubuntu ubuntu 0 Oct 28 22:14 /home/ubuntu/my-file

	u1001@f2-vm:/$ getfacl /mnt/my-file
	getfacl: Removing leading '/' from absolute path names
	# file: mnt/my-file
	# owner: u1001
	# group: u1001
	user::rw-
	user:u1001:rwx
	group::rw-
	mask::rwx
	other::r--

	u1001@f2-vm:/$ getfacl /home/ubuntu/my-file
	getfacl: Removing leading '/' from absolute path names
	# file: home/ubuntu/my-file
	# owner: ubuntu
	# group: ubuntu
	user::rw-
	user:ubuntu:rwx
	group::rw-
	mask::rwx
	other::r--"

* tag 'idmapped-mounts-v5.12' of git://git.kernel.org/pub/scm/linux/kernel/git/brauner/linux: (41 commits)
  xfs: remove the possibly unused mp variable in xfs_file_compat_ioctl
  xfs: support idmapped mounts
  ext4: support idmapped mounts
  fat: handle idmapped mounts
  tests: add mount_setattr() selftests
  fs: introduce MOUNT_ATTR_IDMAP
  fs: add mount_setattr()
  fs: add attr_flags_to_mnt_flags helper
  fs: split out functions to hold writers
  namespace: only take read lock in do_reconfigure_mnt()
  mount: make {lock,unlock}_mount_hash() static
  namespace: take lock_mount_hash() directly when changing flags
  nfs: do not export idmapped mounts
  overlayfs: do not mount on top of idmapped mounts
  ecryptfs: do not mount on top of idmapped mounts
  ima: handle idmapped mounts
  apparmor: handle idmapped mounts
  fs: make helpers idmap mount aware
  exec: handle idmapped mounts
  would_dump: handle idmapped mounts
  ...
2021-02-23 13:39:45 -08:00
Bob Peterson 17d7768408 gfs2: Don't get stuck with I/O plugged in gfs2_ail1_flush
In gfs2_ail1_flush, we're using I/O plugging to give the block layer a
better chance of merging I/O requests.  If we're too aggressive here, we
can end up waiting on I/O to complete while still plugged.  Fix that in
a way similar to writeback_sb_inodes, except that we can't use
blk_flush_plug because blk_flush_plug_list is not exported.

Signed-off-by: Bob Peterson <rpeterso@redhat.com>
Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
2021-02-23 19:01:42 +01:00
Andreas Gruenbacher 803074ad77 Merge branches 'rgrp-glock-sharing' and 'gfs2-revoke' from https://git.kernel.org/pub/scm/linux/kernel/git/gfs2/linux-gfs2.git
Merge the resource group glock sharing feature and the revoke accounting rework.

Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
2021-02-23 18:54:22 +01:00
Tetsuo Handa 9c7d83ae6b pstore: Fix warning in pstore_kill_sb()
syzbot is hitting WARN_ON(pstore_sb != sb) at pstore_kill_sb() [1], for the
assumption that pstore_sb != NULL is wrong because pstore_fill_super() will
not assign pstore_sb = sb when new_inode() for d_make_root() returned NULL
(due to memory allocation fault injection).

Since mount_single() calls pstore_kill_sb() when pstore_fill_super()
failed, pstore_kill_sb() needs to be aware of such failure path.

[1] https://syzkaller.appspot.com/bug?id=6abacb8da5137cb47a416f2bef95719ed60508a0

Reported-by: syzbot <syzbot+d0cf0ad6513e9a1da5df@syzkaller.appspotmail.com>
Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Signed-off-by: Kees Cook <keescook@chromium.org>
Link: https://lore.kernel.org/r/20210214031307.57903-1-penguin-kernel@I-love.SAKURA.ne.jp
2021-02-23 09:27:20 -08:00
Al Viro 6f24784f00 whack-a-mole: don't open-code iminor/imajor
several instances creeped back into the tree...

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2021-02-23 10:25:29 -05:00
Al Viro 9652c73246 9p: fix misuse of sscanf() in v9fs_stat2inode()
1) sscanf() return value needs to be checked, damnit
2) sscanf() is perfectly capable of checking for fixed prefix,
no need for that %13s + strncmp with constant string.
3) st->extension is a valid string; no need for voodoo with
str*cpy() there.

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2021-02-23 10:25:28 -05:00
Steve French f1a08655cc cifs: minor simplification to smb2_is_network_name_deleted
Trivial change to clarify code in smb2_is_network_name_deleted

Suggested-by: Pavel Shilovsky <pshilov@microsoft.com>
Reviewed-by: Pavel Shilovsky <pshilov@microsoft.com>
Reviewed-by: Shyam Prasad N <sprasad@microsoft.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
2021-02-23 04:16:41 -06:00
Rohith Surabattula 9e550b0852 TCON Reconnect during STATUS_NETWORK_NAME_DELETED
When server returns error STATUS_NETWORK_NAME_DELETED, TCON
must be marked for reconnect. So, subsequent IO does the tree
connect again.

Signed-off-by: Rohith Surabattula <rohiths@microsoft.com>
Reviewed-by: Pavel Shilovsky <pshilov@microsoft.com>
Reviewed-by: Shyam Prasad N <sprasad@microsoft.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
2021-02-23 04:16:00 -06:00
Steve French 23bda5e651 cifs: cleanup a few le16 vs. le32 uses in cifsacl.c
Cleanup some minor sparse warnings in cifsacl.c

Signed-off-by: Steve French <stfrench@microsoft.com>
2021-02-22 21:20:44 -06:00
Shyam Prasad N bc3e9dd9d1 cifs: Change SIDs in ACEs while transferring file ownership.
With cifsacl, when a file/dir ownership is transferred (chown/chgrp),
the ACEs in the DACL for that file will need to replace the old owner
SIDs with the new owner SID.

Signed-off-by: Shyam Prasad N <sprasad@microsoft.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
2021-02-22 21:20:44 -06:00
Shyam Prasad N f506550889 cifs: Retain old ACEs when converting between mode bits and ACL.
When cifsacl mount option is used, retain the ACEs which
should not be modified during chmod. Following is the approach taken:

1. Retain all explicit (non-inherited) ACEs, unless the SID is one
of owner/group/everyone/authenticated-users. We're going to set new
ACEs for these SIDs anyways.
2. At the end of the list of explicit ACEs, place the new list of
ACEs obtained by necessary conversion/encoding.
3. Once the converted/encoded ACEs are set, copy all the remaining
ACEs (inherited) into the new ACL.

Signed-off-by: Shyam Prasad N <sprasad@microsoft.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
2021-02-22 21:20:44 -06:00
Shyam Prasad N c12ead71e8 cifs: Fix cifsacl ACE mask for group and others.
A two line fix which I made while testing my prev fix with
cifsacl mode conversions seem to have gone missing in the final fix
that was submitted. This is that fix.

Signed-off-by: Shyam Prasad N <sprasad@microsoft.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
2021-02-22 21:20:44 -06:00
Steve French 40f077a02b cifs: clarify hostname vs ip address in /proc/fs/cifs/DebugData
/proc/fs/cifs/DebugData called the ip address for server sessions
"Name" which is confusing since it is not a hostname. Change
this field name to "Address" and for the list of servers add
new field "Hostname" which is populated from the hostname used
to connect to the server.  See below. And also don't print
[NONE] when the interface list is empty as it is not clear
what 'NONE' referred to.

Servers:
1) ConnectionId: 0x1 Hostname: localhost
Number of credits: 389 Dialect 0x311
TCP status: 1 Instance: 1
Local Users To Server: 1 SecMode: 0x1 Req On Wire: 0
In Send: 0 In MaxReq Wait: 0

	Sessions:
	1) Address: 127.0.0.1
...

Reviewed-by: Shyam Prasad N <sprasad@microsoft.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
2021-02-22 21:20:44 -06:00
Steve French b438fcf128 cifs: change confusing field serverName (to ip_addr)
ses->serverName is not the server name, but the string form
of the ip address of the server.  Change the name to ip_addr
to avoid confusion (and fix the array length to match
maximum length of ipv6 address).

Reviewed-by: Shyam Prasad N <sprasad@microsoft.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
2021-02-22 21:20:43 -06:00
Linus Torvalds e913a8cdc2 Fixes around VM_FPNMAP and follow_pfn
- replace mm/frame_vector.c by get_user_pages in misc/habana and
   drm/exynos drivers, then move that into media as it's sole user
 - close race in generic_access_phys
 - s390 pci ioctl fix of this series landed in 5.11 already
 - properly revoke iomem mappings (/dev/mem, pci files)
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEEb4nG6jLu8Y5XI+PfTA9ye/CYqnEFAmAzgywACgkQTA9ye/CY
 qnFPbA//RUHB5bD7vwnEglfJhonKSi/Vt3dNQwUI+pCFK8muWvvPyTkGXKjjT2dI
 uAOY2F23wymtIexV3fNLgnMez7kMcupOLkdxJic4GiO+HJn1jnkshdX7/dGtUW7O
 G3yfnf/D27i912tT3j6PN7dVnasAYYtndCgImM027Zigzn4ibY+02tnzd5XTj1F8
 yq8Swx88oqF8v10HxfpF3RLShqT3S17mFmd9dTv0GkZX497Pe75O44XcXzkD33Bj
 wasH2Tz8gMEQx6TNAGlJe13dzDHReh2cG0z2r+6PTA6KnaMMxbEIImHNuhWOmHb/
 nf8Jpu9uMOLzB+3hG3TzISTDBhAgPfoJ8Ov40VJCWMtCVBnyMyPJr28Oobb8Dj3V
 SXvjSVlLeobOLt+E9vAS+Rmas07LCGBdNP9sexxV7S/sveSQ5W+FptaQW03EghwA
 nBYEUC68WqpX99lJCFPmv5zmy5xkecjpU6mLHZljtV1ORzktqWZdVhmC8njHMAMY
 Hi/emnPxEX1FpOD38rr7F9KUUSsy4t/ZaCgVaLcxCcbglCHXSHC41R09p9TBRSJo
 G6Lksjyj4aa+UL5dZDAtLY0shg0bv2u93dGQNaDAC+uzj6D0ErBBzDK570zBKjp/
 75+nqezJlD0d7I6rOl6FwiEYeSrYXJxYEveKVUr8CnH6sfeBlwo=
 =lQoR
 -----END PGP SIGNATURE-----

Merge tag 'topic/iomem-mmap-vs-gup-2021-02-22' of git://anongit.freedesktop.org/drm/drm

Pull follow_pfn() updates from Daniel Vetter:
 "Fixes around VM_FPNMAP and follow_pfn:

   - replace mm/frame_vector.c by get_user_pages in misc/habana and
     drm/exynos drivers, then move that into media as it's sole user

   - close race in generic_access_phys

   - s390 pci ioctl fix of this series landed in 5.11 already

   - properly revoke iomem mappings (/dev/mem, pci files)"

* tag 'topic/iomem-mmap-vs-gup-2021-02-22' of git://anongit.freedesktop.org/drm/drm:
  PCI: Revoke mappings like devmem
  PCI: Also set up legacy files only after sysfs init
  sysfs: Support zapping of binary attr mmaps
  resource: Move devmem revoke code to resource framework
  /dev/mem: Only set filp->f_mapping
  PCI: Obey iomem restrictions for procfs mmap
  mm: Close race in generic_access_phys
  media: videobuf2: Move frame_vector into media subsystem
  mm/frame-vector: Use FOLL_LONGTERM
  misc/habana: Use FOLL_LONGTERM for userptr
  misc/habana: Stop using frame_vector helpers
  drm/exynos: Use FOLL_LONGTERM for g2d cmdlists
  drm/exynos: Stop using frame_vector helpers
2021-02-22 17:45:02 -08:00
Linus Torvalds 4b5f9254e4 kconfig for kcmp syscall
drm userspaces uses this, systemd uses this, makes sense to pull it
 out from the checkpoint-restore bundle. Kees reviewed this from
 security pov and is happy with the final version.
 
 LWN coverage: https://lwn.net/Articles/845448/
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEEb4nG6jLu8Y5XI+PfTA9ye/CYqnEFAmAzaXIACgkQTA9ye/CY
 qnH5FQ//eL/7a/PDICuCRIN2p2aQwHoe9d12q+01RolAgce6F9mR9SFiKGSCR+t7
 daw4G/BaGxSYzvz1IqWbXDMhN87jAXV/IGs9k4OkSIcbnDmMY78EKMZB1c1t7AZo
 zmeAuQvmTAcBogTwC6IE9N1JwhH3fmudq4p8zZ4zLojJNSPjrwCvF/xQI/Yaw52V
 CTfni8mrjYJ+pZ1qn9XP3IceAFEEI27ubZj2DJU+5xpRJAdIAobo0XbVOf8XQ0uc
 /BRLyXFS66EDsY1wWHT6y6UXDNZgbLic0olC1aielaBJh+Wq6bQHgephxpasU5y7
 cZX7XTX2N1q8j8NmgzWLYRgERqtXv0CPHKdimTs8SaUcPDGhxcnwPR6hmdQEC+i6
 IjntWMERjfuyD+s6qVuc7s8WS7+Ry9OxgdVskHASqGpBvsSliXN1o02Am6WUuGsB
 HZxTjCe967FyL4LGU0YjobMTUUSWfYQkOBKABlvYUySNZ0ZHnSygHIWiWjC6b89A
 KmXiHJoocNfDlKwX6bf3OWQ+dGGFu2wo5wYzldIiqYJVidp50xdOosdRE1R6WwuG
 IOLCdNKdqDgtig+90/fFZ06liXZvqUdDafWgUs/g6lLquFrcq5aAIiSdR6PcPKB0
 MwfWcCglLtYrxgDHvNaqnW18yRQq2TGbe+A65aXzLPp45pKP8Hk=
 =uiSj
 -----END PGP SIGNATURE-----

Merge tag 'topic/kcmp-kconfig-2021-02-22' of git://anongit.freedesktop.org/drm/drm

Pull kcmp kconfig update from Daniel Vetter:
 "Make the kcmp syscall available independently of checkpoint/restore.

  drm userspaces uses this, systemd uses this, so makes sense to pull it
  out from the checkpoint-restore bundle.

  Kees reviewed this from security pov and is happy with the final
  version"

Link: https://lwn.net/Articles/845448/

* tag 'topic/kcmp-kconfig-2021-02-22' of git://anongit.freedesktop.org/drm/drm:
  kcmp: Support selection of SYS_kcmp without CHECKPOINT_RESTORE
2021-02-22 17:15:30 -08:00
Linus Torvalds 7c70f3a748 Optimization:
- Cork the socket while there are queued replies
 
 Fixes:
 
 - DRC shutdown ordering
 - svc_rdma_accept() lockdep splat
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCAAdFiEEKLLlsBKG3yQ88j7+M2qzM29mf5cFAmAsA80ACgkQM2qzM29m
 f5erXA/+MrR3ZtwK2eaTITu13TzzTrMURbp/n0wCCW/Ls1YMb6bn9ggtBwu2W5Cn
 Vb0RO9OLcmoI6CjqPh0CTUvvZspMYOAX4W1jQecKt2ml075APdlqUcv9YWPUQqVJ
 qTg8HxDymvHvY3I3FcBxhzofmGzF8AOmQZJw9uI5Wt/ivBfqGWcAGlxyRmB3mdsm
 cJRK0Sy7QMn2LefMcpMEeSbPA049/NZNRp6fcXnpPQFer42thoosYsNhTlAJfCXC
 C5S0z3/T6rpuJucV9la/WkpUA0YhWbPEHWNdAB5tzSqmoEo4LpzJzjv7uyQU4oue
 QlmChIz9qasgTI/BnCkBIzPD99S4UQcXjX0BnNinkQ77e6+b/vdAR+T+NLHJdkAf
 +7Xz6T9aZNaz2R49CjYl6/kG0rlNkjUzyURRYs/9zEBhogMPH/N4T7Z2M+ljCkeb
 tc3OaFDXZ2rfr7EKBGsfnEKINM1gpYipzILkr8GSHUMZLzOB/64upKySaJVjCGXj
 7Sf1w+vJUWwYc+FqFvbaR4ybr01VIfdsecpn1TtY870zG1JzimzAHVZk1/xC9+CX
 J+lVOXbjawDl1Et3V3fWq6Y7mhAWves/NKPcbSug9sFc4qRHEmPbAq/RRtlsjQcn
 foMr5R8qd8OwEamVypZ2nIFxq4q3b742AS8lZhaK+DyZKq3oLac=
 =+R4U
 -----END PGP SIGNATURE-----

Merge tag 'nfsd-5.12-1' of git://git.kernel.org/pub/scm/linux/kernel/git/cel/linux

Pull more nfsd updates from Chuck Lever:
 "Here are a few additional NFSD commits for the merge window:

 Optimization:
   - Cork the socket while there are queued replies

  Fixes:
   - DRC shutdown ordering
   - svc_rdma_accept() lockdep splat"

* tag 'nfsd-5.12-1' of git://git.kernel.org/pub/scm/linux/kernel/git/cel/linux:
  SUNRPC: Further clean up svc_tcp_sendmsg()
  SUNRPC: Remove redundant socket flags from svc_tcp_sendmsg()
  SUNRPC: Use TCP_CORK to optimise send performance on the server
  svcrdma: Hold private mutex while invoking rdma_accept()
  nfsd: register pernet ops last, unregister first
2021-02-22 13:29:55 -08:00
Linus Torvalds 20bf195e93 With netfs helper library and fscache rework delayed, just a few cap
handling improvements to avoid grabbing mmap_lock in some code paths
 and deal with capsnaps better and a mount option cleanup.
 -----BEGIN PGP SIGNATURE-----
 
 iQFHBAABCAAxFiEEydHwtzie9C7TfviiSn/eOAIR84sFAmAzuGwTHGlkcnlvbW92
 QGdtYWlsLmNvbQAKCRBKf944AhHzixPqB/9kQxU8IkCF0wOm+dm0tBW3PjYxBFuz
 HryHU6WJHDbX9/enH6PgMj6ZpRwxgzDq8xUpmRKVeaPflej9PnfQyH/On+vQWRUX
 WyWyBx0QqbrKYvYK0cCjHzVC5kbtBA8C/1OSSs5EkJIh518RBMkeru9pYL7+TI5x
 zeQVXzOJB2Bz7y8Odd2RjlkAkix/J1m0LIggRaoWrTygz93PKXfjzhDpa4KC4WZj
 W6LjnYPpYjo34poKx/3N3ZSgGP+Y3F7ZDeNfSnPB2WKs7vzcYUCpWXBSHnHTz+lK
 H2O5GdmxQ6BFp4SZvYtf5e78igH/m/QmzAYGW2EmmKttOcyrb2282snb
 =8MQu
 -----END PGP SIGNATURE-----

Merge tag 'ceph-for-5.12-rc1' of git://github.com/ceph/ceph-client

Pull ceph updates from Ilya Dryomov:
 "With netfs helper library and fscache rework delayed, just a few cap
  handling improvements to avoid grabbing mmap_lock in some code paths
  and deal with capsnaps better and a mount option cleanup"

* tag 'ceph-for-5.12-rc1' of git://github.com/ceph/ceph-client:
  ceph: defer flushing the capsnap if the Fb is used
  libceph: remove osdtimeout option entirely
  libceph: deprecate [no]cephx_require_signatures options
  ceph: allow queueing cap/snap handling after putting cap references
  ceph: clean up inode work queueing
  ceph: fix flush_snap logic after putting caps
2021-02-22 13:27:51 -08:00
Linus Torvalds 9fe1904626 \n
-----BEGIN PGP SIGNATURE-----
 
 iQEzBAABCAAdFiEEq1nRK9aeMoq1VSgcnJ2qBz9kQNkFAmAzo34ACgkQnJ2qBz9k
 QNmsUwf8DFq8Uu2PI2BFOzHkEG6F3y+/KPpja9k08q3A1NSM28RYBaFeWc9wGImZ
 jtu3k1+8TiK51OkYGxa5LeIKpaMZrylEGXhdYTyfBJiJSHrjApWiq1jsCvtxk/xt
 3pjI9+OItwmZVo/INYAWS8+QdweX87PkaZtKi0//pqgFdnsjMCKDUxkCIB3IEigk
 I7orTiBpTSgP3iwcuRhchyyCFjIeoW+L2nbNuv8CYjXo9WIAF5ypQx+r1T2f1Ieu
 Vt9u41gwRUYfn3a5YdKMJZgAkcv7a4QYP4+tbSnD9Wl3jtorCBgTC6EDUyGNWqdr
 lqRIJ0jp1ET387J/YAGCGFsdz1AIjw==
 =YTNN
 -----END PGP SIGNATURE-----

Merge tag 'fs_for_v5.12-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs

Pull isofs, udf, and quota updates from Jan Kara:
 "Several udf, isofs, and quota fixes"

* tag 'fs_for_v5.12-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs:
  parser: Fix kernel-doc markups
  udf: handle large user and group ID
  isofs: handle large user and group ID
  parser: add unsigned int parser
  udf: fix silent AED tagLocation corruption
  isofs: release buffer head before return
  quota: Fix memory leak when handling corrupted quota file
2021-02-22 13:25:37 -08:00
Linus Torvalds db99038542 \n
-----BEGIN PGP SIGNATURE-----
 
 iQEzBAABCAAdFiEEq1nRK9aeMoq1VSgcnJ2qBz9kQNkFAmAzolwACgkQnJ2qBz9k
 QNnwAQgAhw7PYZgRGnhm/VEDBD1EiPqNIhV3+EuUcNHlrNERx0jPN3bcoXmJD7FE
 PCCwbsYtQyqjYFipuzvBnUur5s7CxrwyDhvE8bgYdOB43Gy94awwvwF+JbMnBaPj
 gZSvArKD7ISAUpt560jtH5KedNAZnDkPITePME2GQsOpZ9SHHjsJEhSheTaHk0t1
 03Odx6gK5CcRvRD4KQYTa/hvZH95dVHSdakgFODAUoyfR65KMLhBihNOVHZsEVEZ
 S99j0YBY15nxS8ygo+Iz3Y3KOzy9G1xRAzk3wSeDGzhNRfzYP/IIZWWY/KWowmvH
 afx0pa0KiYjgqDpDjsyYPOJ2Ul4cPA==
 =gXlh
 -----END PGP SIGNATURE-----

Merge tag 'fsnotify_for_v5.12-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs

Pull fsnotify update from Jan Kara:
 "Make inotify groups be charged against appropriate memcgs"

* tag 'fsnotify_for_v5.12-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs:
  inotify, memcg: account inotify instances to kmemcg
2021-02-22 13:23:29 -08:00
Linus Torvalds d61c6a58ae \n
-----BEGIN PGP SIGNATURE-----
 
 iQEzBAABCAAdFiEEq1nRK9aeMoq1VSgcnJ2qBz9kQNkFAmAzoWUACgkQnJ2qBz9k
 QNnFgQgAlng0JOzeCQvLpwweqFl0FCxYbOsZXC1xDyvfX3TiA6A6oiOR4tx3uhQN
 cOQmJXaiMn4oCXjD1j6WZwGfy23yx0XchaoFK9jy2IqodaB/zUjkiWYYqt0G3XIX
 ud35mxjLAGS12BCD0c+vHy2RMsUFl5ep+5aBHRHZJJhCcYbl7e5ctXZ3xB1Q0mgI
 r639gD8JhH3ICdu9W0NaMvqOrVhJFNmhSGATKL/N96+oKub2x2ycYE4L2OXegxy3
 mnFf26LjA8jt7K+KfHloTvkC6D4HVnnvKFvKiIbGKafiWhAE7q57ZO6BPCMajGue
 3UHIhWGmwKXRU72+nW6N+089GbcO/g==
 =1e+z
 -----END PGP SIGNATURE-----

Merge tag 'lazytime_for_v5.12-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs

Pull lazytime updates from Jan Kara:
 "Cleanups of the lazytime handling in the writeback code making rules
  for calling ->dirty_inode() filesystem handlers saner"

* tag 'lazytime_for_v5.12-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs:
  ext4: simplify i_state checks in __ext4_update_other_inode_time()
  gfs2: don't worry about I_DIRTY_TIME in gfs2_fsync()
  fs: improve comments for writeback_single_inode()
  fs: drop redundant check from __writeback_single_inode()
  fs: clean up __mark_inode_dirty() a bit
  fs: pass only I_DIRTY_INODE flags to ->dirty_inode
  fs: don't call ->dirty_inode for lazytime timestamp updates
  fat: only specify I_DIRTY_TIME when needed in fat_update_time()
  fs: only specify I_DIRTY_TIME when needed in generic_update_time()
  fs: correctly document the inode dirty flags
2021-02-22 13:17:39 -08:00
Linus Torvalds c63dca9e23 Description for this pull request:
- Improve file deletion performance with dirsync mount option.
 - fix shift-out-of-bounds in exfat_fill_super() generated by syzkaller.
 -----BEGIN PGP SIGNATURE-----
 
 iQJMBAABCgA2FiEE6NzKS6Uv/XAAGHgyZwv7A1FEIQgFAmAzA7MYHG5hbWphZS5q
 ZW9uQHNhbXN1bmcuY29tAAoJEGcL+wNRRCEIYKMP/1i4uV1KebeQTkIiNlPMae+H
 hq2TgkrOLp7gKY+G8VoSrYIG9U5fEGh+bo6UwIXQpGDrnFb9AWAE0YoyCUV0sTGS
 6CQmmjl1Cq4gwmBs2yjkbmGa54d2NRExZc2zMewDxnWRQjIvqJftxh1qowjwN13j
 arfjvWGJ61Je/S1jQ0vS0+nC5f4Mcy+60JyAQB4gxX5327o1ZVohs16jYTTZuS4+
 PsARwM+q3wYQ/THeTf01E/a77IfsS49EtR1aTo8/9fCfQCxKdYT0wkJQQ/J0yhTL
 l6LMp1PNHV3iePRPZWAF15n8eB2dXGfcaathgtecY+9arKJVDnAoALt91gxN1bxE
 ZZZWkOlviN4FmdQvKNtVaBhdAxNqybe1P0yLvH1Vm2nY0oyY+qgQ79A/xNWcBvhg
 Q6Y1fV08KLWRPsafoZVntb3pd/DT1ekeedrFuqPe3c6kTziFn1GmUhb+6DGCEoFs
 2sbuxATOJCdNL4GHLt0CRNQYU2Dm6g0yuQ3qodu4wcco9yUgiNXe2sM6NpMXaTH7
 XUlgsaxl8zO5TKrVfre93gIP48HRf2/gZu8ejXs2rUEIzhSy3rsIMyt2DhTUe9Db
 CW2SMGQADG9Aiod4gRaunvMrTE3S7NMXLsOG5K6eThar8fOnqQ/F0NHi5DUWuFdw
 17G5gwimXbZxW0GhTvWH
 =kzJV
 -----END PGP SIGNATURE-----

Merge tag 'exfat-for-5.12-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/linkinjeon/exfat

Pull exfat updates from Namjae Jeon:

 - improve file deletion performance with dirsync mount option

 - fix shift-out-of-bounds in exfat_fill_super() reported by syzkaller

* tag 'exfat-for-5.12-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/linkinjeon/exfat:
  exfat: improve performance of exfat_free_cluster when using dirsync mount option
  exfat: fix shift-out-of-bounds in exfat_fill_super()
2021-02-22 13:15:50 -08:00
Linus Torvalds 0f3d950ddd zonefs changes for 5.12-rc1
Two patches in this pull request:
 * A fix that did not make it in time for 5.11, to correct the file size
   initialization of full sequential zone, from Shin'ichiro
 * Add file operation tracepoints to help with debugging, from Johannes
 
 Signed-off-by: Damien Le Moal <damien.lemoal@wdc.com>
 -----BEGIN PGP SIGNATURE-----
 
 iHUEABYIAB0WIQSRPv8tYSvhwAzJdzjdoc3SxdoYdgUCYDLicAAKCRDdoc3SxdoY
 dpN4AQCh1EbZ3NvHRJ4bMWnNQ3OsdkTWYix7ur3C/5Ft7oQbKQEAge6cUgIjEkrD
 u3znsZSYE/RM+LxrhE1RquTwERkSeQk=
 =StDj
 -----END PGP SIGNATURE-----

Merge tag 'zonefs-5.12-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/dlemoal/zonefs

Pull zonefs updates from Damien Le Moal:
 "Two changes:

   - A fix that did not make it in time for 5.11, to correct the file
     size initialization of full sequential zone, from Shin'ichiro

   - Add file operation tracepoints to help with debugging, from
     Johannes"

* tag 'zonefs-5.12-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/dlemoal/zonefs:
  zonefs: Fix file size of zones in full condition
  zonefs: add tracepoints for file operations
2021-02-22 13:13:51 -08:00
Linus Torvalds 250a25e7a1 Merge branch 'work.audit' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs
Pull RCU-safe common_lsm_audit() from Al Viro:
 "Make common_lsm_audit() non-blocking and usable from RCU pathwalk
  context.

  We don't really need to grab/drop dentry in there - rcu_read_lock() is
  enough. There's a couple of followups using that to simplify the
  logics in selinux, but those hadn't soaked in -next yet, so they'll
  have to go in next window"

* 'work.audit' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
  make dump_common_audit_data() safe to be called from RCU pathwalk
  new helper: d_find_alias_rcu()
2021-02-22 13:05:30 -08:00
Linus Torvalds 205f92d7f2 Merge branch 'work.d_name' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs
Pull d_name whack-a-mole from Al Viro:
 "A bunch of places that play with ->d_name in printks instead of using
  proper formats..."

* 'work.d_name' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
  orangefs_file_mmap(): use %pD
  cifs_debug: use %pd instead of messing with ->d_name
  erofs: use %pd instead of messing with ->d_name
  cramfs: use %pD instead of messing with file_dentry()->d_name
2021-02-22 13:03:30 -08:00
Andreas Gruenbacher 2129b42888 gfs2: Per-revoke accounting in transactions
In the log, revokes are stored as a revoke descriptor (struct
gfs2_log_descriptor), followed by zero or more additional revoke blocks
(struct gfs2_meta_header).  On filesystems with a blocksize of 4k, the
revoke descriptor contains up to 503 revokes, and the metadata blocks
contain up to 509 revokes each.  We've so far been reserving space for
revokes in transactions in block granularity, so a lot more space than
necessary was being allocated and then released again.

This patch switches to assigning revokes to transactions individually
instead.  Initially, space for the revoke descriptor is reserved and
handed out to transactions.  When more revokes than that are reserved,
additional revoke blocks are added.  When the log is flushed, the space
for the additional revoke blocks is released, but we keep the space for
the revoke descriptor block allocated.

Transactions may still reserve more revokes than they will actually need
in the end, but now we won't overshoot the target as much, and by only
returning the space for excess revokes at log flush time, we further
reduce the amount of contention between processes.

Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
2021-02-22 21:16:23 +01:00
Andreas Gruenbacher fe3e397668 gfs2: Rework the log space allocation logic
The current log space allocation logic is hard to understand or extend.
The principle it that when the log is flushed, we may or may not have a
transaction active that has space allocated in the log.  To deal with
that, we set aside a magical number of blocks to be used in case we
don't have an active transaction.  It isn't clear that the pool will
always be big enough.  In addition, we can't return unused log space at
the end of a transaction, so the number of blocks allocated must exactly
match the number of blocks used.

Simplify this as follows:
 * When transactions are allocated or merged, always reserve enough
   blocks to flush the transaction (err on the safe side).
 * In gfs2_log_flush, return any allocated blocks that haven't been used.
 * Maintain a pool of spare blocks big enough to do one log flush, as
   before.
 * In gfs2_log_flush, when we have no active transaction, allocate a
   suitable number of blocks.  For that, use the spare pool when
   called from logd, and leave the pool alone otherwise.  This means
   that when the log is almost full, logd will still be able to do one
   more log flush, which will result in more log space becoming
   available.

This will make the log space allocator code easier to work with in
the future.

Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
2021-02-22 21:16:22 +01:00
Andreas Gruenbacher 71b219f4e5 gfs2: Minor calc_reserved cleanup
No functional change.

Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
2021-02-22 21:16:22 +01:00
Linus Torvalds 0e63a5c6ba It has been a relatively quiet cycle in docsland.
- As promised, the minimum Sphinx version to build the docs is now 1.7,
    and we have dropped support for Python 2 entirely.  That allowed the
    removal of a bunch of compatibility code.
 
  - A set of treewide warning fixups from Mauro that I applied after it
    became clear nobody else was going to deal with them.
 
  - The automarkup mechanism can now create cross-references from relative
    paths to RST files.
 
  - More translations, typo fixes, and warning fixes.
 
 No conflicts with any other tree as far as I know.
 -----BEGIN PGP SIGNATURE-----
 
 iQFDBAABCAAtFiEEIw+MvkEiF49krdp9F0NaE2wMflgFAmAq4EUPHGNvcmJldEBs
 d24ubmV0AAoJEBdDWhNsDH5YTIAH/1I5MlVQwuvNKjwCAEdmltQgHv6SmXSpDkrp
 SGuviWVXxqz8dTyo7C2R12qE/7nP8zGAmclNdX78ynl5qWaj05lQsjBgMYSoQO/F
 +akyLQSL8/8SQrtDPPBcboPuIz9DzkX51kkQthvCf0puJi0ScKVHO9Sk9SKUgDoK
 cnCE9VwpGL7YX/ee2wt91UYREijgJ9P7eQ6rqKvUZ5Itu9ikfu9vQU41GR9tOXDK
 MQK+k38pWdl8wRgTgA0pkVhMf1G732bxTTicvFHXcyqmCkh7++m2+ysT8O+SBBMX
 e5BbP0yysSqThjwFHOW5PWM1AWD5iVz+pnwJwEaJ4K76tJJOw9M=
 =bcDk
 -----END PGP SIGNATURE-----

Merge tag 'docs-5.12' of git://git.lwn.net/linux

Pull documentation updates from Jonathan Corbet:
 "It has been a relatively quiet cycle in docsland.

   - As promised, the minimum Sphinx version to build the docs is now
     1.7, and we have dropped support for Python 2 entirely. That
     allowed the removal of a bunch of compatibility code.

   - A set of treewide warning fixups from Mauro that I applied after it
     became clear nobody else was going to deal with them.

   - The automarkup mechanism can now create cross-references from
     relative paths to RST files.

   - More translations, typo fixes, and warning fixes"

* tag 'docs-5.12' of git://git.lwn.net/linux: (75 commits)
  docs: kernel-hacking: be more civil
  docs: Remove the Microsoft rhetoric
  Documentation/admin-guide: kernel-parameters: Update nohlt section
  doc/admin-guide: fix spelling mistake: "perfomance" -> "performance"
  docs: Document cross-referencing using relative path
  docs: Enable usage of relative paths to docs on automarkup
  docs: thermal: fix spelling mistakes
  Documentation: admin-guide: Update kvm/xen config option
  docs: Make syscalls' helpers naming consistent
  coding-style.rst: Avoid comma statements
  Documentation: /proc/loadavg: add 3 more field descriptions
  Documentation/submitting-patches: Add blurb about backtraces in commit messages
  Docs: drop Python 2 support
  Move our minimum Sphinx version to 1.7
  Documentation: input: define ABS_PRESSURE/ABS_MT_PRESSURE resolution as grams
  scripts/kernel-doc: add internal hyperlink to DOC: sections
  Update Documentation/admin-guide/sysctl/fs.rst
  docs: Update DTB format references
  docs: zh_CN: add iio index.rst translation
  docs/zh_CN: add iio ep93xx_adc.rst translation
  ...
2021-02-22 10:57:46 -08:00
Johannes Thumshirn 6e37d24599 btrfs: zoned: fix deadlock on log sync
Lockdep with fstests test case btrfs/041 detected a unsafe locking
scenario when we allocate the log node on a zoned filesystem.

btrfs/041
 ============================================
 WARNING: possible recursive locking detected
 5.11.0-rc7+ #939 Not tainted
 --------------------------------------------
 xfs_io/698 is trying to acquire lock:
 ffff88810cd673a0 (&root->log_mutex){+.+.}-{3:3}, at: btrfs_sync_log+0x3d1/0xee0 [btrfs]

 but task is already holding lock:
 ffff88810b0fc3a0 (&root->log_mutex){+.+.}-{3:3}, at: btrfs_sync_log+0x313/0xee0 [btrfs]

 other info that might help us debug this:
  Possible unsafe locking scenario:

        CPU0
        ----
   lock(&root->log_mutex);
   lock(&root->log_mutex);

  *** DEADLOCK ***

  May be due to missing lock nesting notation

 2 locks held by xfs_io/698:
  #0: ffff88810cd66620 (sb_internal){.+.+}-{0:0}, at: btrfs_sync_file+0x2c3/0x570 [btrfs]
  #1: ffff88810b0fc3a0 (&root->log_mutex){+.+.}-{3:3}, at: btrfs_sync_log+0x313/0xee0 [btrfs]

 stack backtrace:
 CPU: 0 PID: 698 Comm: xfs_io Not tainted 5.11.0-rc7+ #939
 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.13.0-0-gf21b5a4-rebuilt.opensuse.org 04/01/2014
 Call Trace:
  dump_stack+0x77/0x97
  __lock_acquire.cold+0xb9/0x32a
  lock_acquire+0xb5/0x400
  ? btrfs_sync_log+0x3d1/0xee0 [btrfs]
  __mutex_lock+0x7b/0x8d0
  ? btrfs_sync_log+0x3d1/0xee0 [btrfs]
  ? btrfs_sync_log+0x3d1/0xee0 [btrfs]
  ? find_first_extent_bit+0x9f/0x100 [btrfs]
  ? __mutex_unlock_slowpath+0x35/0x270
  btrfs_sync_log+0x3d1/0xee0 [btrfs]
  btrfs_sync_file+0x3a8/0x570 [btrfs]
  __x64_sys_fsync+0x34/0x60
  do_syscall_64+0x33/0x40
  entry_SYSCALL_64_after_hwframe+0x44/0xa9

This happens, because we are taking the ->log_mutex albeit it has already
been locked.

Also while at it, fix the bogus unlock of the tree_log_mutex in the error
handling.

Fixes: 3ddebf27fc ("btrfs: zoned: reorder log node allocation on zoned filesystem")
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2021-02-22 18:08:48 +01:00
Josef Bacik 95c85fba1f btrfs: avoid double put of block group when emptying cluster
It's wrong calling btrfs_put_block_group in
__btrfs_return_cluster_to_free_space if the block group passed is
different than the block group the cluster represents. As this means the
cluster doesn't have a reference to the passed block group. This results
in double put and a use-after-free bug.

Fix this by simply bailing if the block group we passed in does not
match the block group on the cluster.

Fixes: fa9c0d795f ("Btrfs: rework allocation clustering")
CC: stable@vger.kernel.org # 4.4+
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
[ update changelog ]
Signed-off-by: David Sterba <dsterba@suse.com>
2021-02-22 18:07:45 +01:00
Filipe Manana 3660d0bcdb btrfs: fix stale data exposure after cloning a hole with NO_HOLES enabled
When using the NO_HOLES feature, if we clone a file range that spans only
a hole into a range that is at or beyond the current i_size of the
destination file, we end up not setting the full sync runtime flag on the
inode. As a result, if we then fsync the destination file and have a power
failure, after log replay we can end up exposing stale data instead of
having a hole for that range.

The conditions for this to happen are the following:

1) We have a file with a size of, for example, 1280K;

2) There is a written (non-prealloc) extent for the file range from 1024K
   to 1280K with a length of 256K;

3) This particular file extent layout is durably persisted, so that the
   existing superblock persisted on disk points to a subvolume root where
   the file has that exact file extent layout and state;

4) The file is truncated to a smaller size, to an offset lower than the
   start offset of its last extent, for example to 800K. The truncate sets
   the full sync runtime flag on the inode;

6) Fsync the file to log it and clear the full sync runtime flag;

7) Clone a region that covers only a hole (implicit hole due to NO_HOLES)
   into the file with a destination offset that starts at or beyond the
   256K file extent item we had - for example to offset 1024K;

8) Since the clone operation does not find extents in the source range,
   we end up in the if branch at the bottom of btrfs_clone() where we
   punch a hole for the file range starting at offset 1024K by calling
   btrfs_replace_file_extents(). There we end up not setting the full
   sync flag on the inode, because we don't know we are being called in
   a clone context (and not fallocate's punch hole operation), and
   neither do we create an extent map to represent a hole because the
   requested range is beyond eof;

9) A further fsync to the file will be a fast fsync, since the clone
   operation did not set the full sync flag, and therefore it relies on
   modified extent maps to correctly log the file layout. But since
   it does not find any extent map marking the range from 1024K (the
   previous eof) to the new eof, it does not log a file extent item
   for that range representing the hole;

10) After a power failure no hole for the range starting at 1024K is
   punched and we end up exposing stale data from the old 256K extent.

Turning this into exact steps:

  $ mkfs.btrfs -f -O no-holes /dev/sdi
  $ mount /dev/sdi /mnt

  # Create our test file with 3 extents of 256K and a 256K hole at offset
  # 256K. The file has a size of 1280K.
  $ xfs_io -f -s \
              -c "pwrite -S 0xab -b 256K 0 256K" \
              -c "pwrite -S 0xcd -b 256K 512K 256K" \
              -c "pwrite -S 0xef -b 256K 768K 256K" \
              -c "pwrite -S 0x73 -b 256K 1024K 256K" \
              /mnt/sdi/foobar

  # Make sure it's durably persisted. We want the last committed super
  # block to point to this particular file extent layout.
  sync

  # Now truncate our file to a smaller size, falling within a position of
  # the second extent. This sets the full sync runtime flag on the inode.
  # Then fsync the file to log it and clear the full sync flag from the
  # inode. The third extent is no longer part of the file and therefore
  # it is not logged.
  $ xfs_io -c "truncate 800K" -c "fsync" /mnt/foobar

  # Now do a clone operation that only clones the hole and sets back the
  # file size to match the size it had before the truncate operation
  # (1280K).
  $ xfs_io \
        -c "reflink /mnt/foobar 256K 1024K 256K" \
        -c "fsync" \
        /mnt/foobar

  # File data before power failure:
  $ od -A d -t x1 /mnt/foobar
  0000000 ab ab ab ab ab ab ab ab ab ab ab ab ab ab ab ab
  *
  0262144 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
  *
  0524288 cd cd cd cd cd cd cd cd cd cd cd cd cd cd cd cd
  *
  0786432 ef ef ef ef ef ef ef ef ef ef ef ef ef ef ef ef
  *
  0819200 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
  *
  1310720

  <power fail>

  # Mount the fs again to replay the log tree.
  $ mount /dev/sdi /mnt

  # File data after power failure:
  $ od -A d -t x1 /mnt/foobar
  0000000 ab ab ab ab ab ab ab ab ab ab ab ab ab ab ab ab
  *
  0262144 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
  *
  0524288 cd cd cd cd cd cd cd cd cd cd cd cd cd cd cd cd
  *
  0786432 ef ef ef ef ef ef ef ef ef ef ef ef ef ef ef ef
  *
  0819200 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
  *
  1048576 73 73 73 73 73 73 73 73 73 73 73 73 73 73 73 73
  *
  1310720

The range from 1024K to 1280K should correspond to a hole but instead it
points to stale data, to the 256K extent that should not exist after the
truncate operation.

The issue does not exists when not using NO_HOLES, because for that case
we use file extent items to represent holes, these are found and copied
during the loop that iterates over extents at btrfs_clone(), and that
causes btrfs_replace_file_extents() to be called with a non-NULL
extent_info argument and therefore set the full sync runtime flag on the
inode.

So fix this by making the code that deals with a trailing hole during
cloning, at btrfs_clone(), to set the full sync flag on the inode, if the
range starts at or beyond the current i_size.

A test case for fstests will follow soon.

Backporting notes: for kernel 5.4 the change goes to ioctl.c into
btrfs_clone before the last call to btrfs_punch_hole_range.

CC: stable@vger.kernel.org # 5.4+
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2021-02-22 18:07:45 +01:00
Josef Bacik 1119a72e22 btrfs: tree-checker: do not error out if extent ref hash doesn't match
The tree checker checks the extent ref hash at read and write time to
make sure we do not corrupt the file system.  Generally extent
references go inline, but if we have enough of them we need to make an
item, which looks like

key.objectid	= <bytenr>
key.type	= <BTRFS_EXTENT_DATA_REF_KEY|BTRFS_TREE_BLOCK_REF_KEY>
key.offset	= hash(tree, owner, offset)

However if key.offset collide with an unrelated extent reference we'll
simply key.offset++ until we get something that doesn't collide.
Obviously this doesn't match at tree checker time, and thus we error
while writing out the transaction.  This is relatively easy to
reproduce, simply do something like the following

  xfs_io -f -c "pwrite 0 1M" file
  offset=2

  for i in {0..10000}
  do
	  xfs_io -c "reflink file 0 ${offset}M 1M" file
	  offset=$(( offset + 2 ))
  done

  xfs_io -c "reflink file 0 17999258914816 1M" file
  xfs_io -c "reflink file 0 35998517829632 1M" file
  xfs_io -c "reflink file 0 53752752058368 1M" file

  btrfs filesystem sync

And the sync will error out because we'll abort the transaction.  The
magic values above are used because they generate hash collisions with
the first file in the main subvol.

The fix for this is to remove the hash value check from tree checker, as
we have no idea which offset ours should belong to.

Reported-by: Tuomas Lähdekorpi <tuomas.lahdekorpi@gmail.com>
Fixes: 0785a9aacf ("btrfs: tree-checker: Add EXTENT_DATA_REF check")
CC: stable@vger.kernel.org # 5.4+
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
[ add comment]
Signed-off-by: David Sterba <dsterba@suse.com>
2021-02-22 18:07:44 +01:00
Filipe Manana dd0734f2a8 btrfs: fix race between swap file activation and snapshot creation
When creating a snapshot we check if the current number of swap files, in
the root, is non-zero, and if it is, we error out and warn that we can not
create the snapshot because there are active swap files.

However this is racy because when a task started activation of a swap
file, another task might have started already snapshot creation and might
have seen the counter for the number of swap files as zero. This means
that after the swap file is activated we may end up with a snapshot of the
same root successfully created, and therefore when the first write to the
swap file happens it has to fall back into COW mode, which should never
happen for active swap files.

Basically what can happen is:

1) Task A starts snapshot creation and enters ioctl.c:create_snapshot().
   There it sees that root->nr_swapfiles has a value of 0 so it continues;

2) Task B enters btrfs_swap_activate(). It is not aware that another task
   started snapshot creation but it did not finish yet. It increments
   root->nr_swapfiles from 0 to 1;

3) Task B checks that the file meets all requirements to be an active
   swap file - it has NOCOW set, there are no snapshots for the inode's
   root at the moment, no file holes, no reflinked extents, etc;

4) Task B returns success and now the file is an active swap file;

5) Task A commits the transaction to create the snapshot and finishes.
   The swap file's extents are now shared between the original root and
   the snapshot;

6) A write into an extent of the swap file is attempted - there is a
   snapshot of the file's root, so we fall back to COW mode and therefore
   the physical location of the extent changes on disk.

So fix this by taking the snapshot lock during swap file activation before
locking the extent range, as that is the order in which we lock these
during buffered writes.

Fixes: ed46ff3d42 ("Btrfs: support swap files")
CC: stable@vger.kernel.org # 5.4+
Reviewed-by: Anand Jain <anand.jain@oracle.com>
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2021-02-22 18:07:35 +01:00
Filipe Manana 195a49eaf6 btrfs: fix race between writes to swap files and scrub
When we active a swap file, at btrfs_swap_activate(), we acquire the
exclusive operation lock to prevent the physical location of the swap
file extents to be changed by operations such as balance and device
replace/resize/remove. We also call there can_nocow_extent() which,
among other things, checks if the block group of a swap file extent is
currently RO, and if it is we can not use the extent, since a write
into it would result in COWing the extent.

However we have no protection against a scrub operation running after we
activate the swap file, which can result in the swap file extents to be
COWed while the scrub is running and operating on the respective block
group, because scrub turns a block group into RO before it processes it
and then back again to RW mode after processing it. That means an attempt
to write into a swap file extent while scrub is processing the respective
block group, will result in COWing the extent, changing its physical
location on disk.

Fix this by making sure that block groups that have extents that are used
by active swap files can not be turned into RO mode, therefore making it
not possible for a scrub to turn them into RO mode. When a scrub finds a
block group that can not be turned to RO due to the existence of extents
used by swap files, it proceeds to the next block group and logs a warning
message that mentions the block group was skipped due to active swap
files - this is the same approach we currently use for balance.

Fixes: ed46ff3d42 ("Btrfs: support swap files")
CC: stable@vger.kernel.org # 5.4+
Reviewed-by: Anand Jain <anand.jain@oracle.com>
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2021-02-22 18:07:15 +01:00
Filipe Manana 20903032cd btrfs: avoid checking for RO block group twice during nocow writeback
During the nocow writeback path, we currently iterate the rbtree of block
groups twice: once for checking if the target block group is RO with the
call to btrfs_extent_readonly()), and once again for getting a nocow
reference on the block group with a call to btrfs_inc_nocow_writers().

Since btrfs_inc_nocow_writers() already returns false when the target
block group is RO, remove the call to btrfs_extent_readonly(). Not only
we avoid searching the blocks group rbtree twice, it also helps reduce
contention on the lock that protects it (specially since it is a spin
lock and not a read-write lock). That may make a noticeable difference
on very large filesystems, with thousands of allocated block groups.

Reviewed-by: Anand Jain <anand.jain@oracle.com>
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2021-02-22 17:15:55 +01:00
Nikolay Borisov 3c17916510 btrfs: fix race between extent freeing/allocation when using bitmaps
During allocation the allocator will try to allocate an extent using
cluster policy. Once the current cluster is exhausted it will remove the
entry under btrfs_free_cluster::lock and subsequently acquire
btrfs_free_space_ctl::tree_lock to dispose of the already-deleted entry
and adjust btrfs_free_space_ctl::total_bitmap. This poses a problem
because there exists a race condition between removing the entry under
one lock and doing the necessary accounting holding a different lock
since extent freeing only uses the 2nd lock. This can result in the
following situation:

T1:                                    T2:
btrfs_alloc_from_cluster               insert_into_bitmap <holds tree_lock>
 if (entry->bytes == 0)                   if (block_group && !list_empty(&block_group->cluster_list)) {
    rb_erase(entry)

 spin_unlock(&cluster->lock);
   (total_bitmaps is still 4)           spin_lock(&cluster->lock);
                                         <doesn't find entry in cluster->root>
 spin_lock(&ctl->tree_lock);             <goes to new_bitmap label, adds
<blocked since T2 holds tree_lock>       <a new entry and calls add_new_bitmap>
					    recalculate_thresholds  <crashes,
                                              due to total_bitmaps
					      becoming 5 and triggering
					      an ASSERT>

To fix this ensure that once depleted, the cluster entry is deleted when
both cluster lock and tree locks are held in the allocator (T1), this
ensures that even if there is a race with a concurrent
insert_into_bitmap call it will correctly find the entry in the cluster
and add the new space to it.

CC: <stable@vger.kernel.org> # 4.4+
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2021-02-22 17:15:31 +01:00
Qu Wenruo 04d4ba4c90 btrfs: make check_compressed_csum() to be subpage compatible
Currently check_compressed_csum() completely relies on sectorsize ==
PAGE_SIZE to do checksum verification for compressed extents.

To make it subpage compatible, this patch will:
- Do extra calculation for the csum range
  Since we have multiple sectors inside a page, we need to only hash
  the range we want, not the full page anymore.

- Do sector-by-sector hash inside the page

With this patch and previous conversion on
btrfs_submit_compressed_read(), now we can read subpage compressed
extents properly, and do proper csum verification.

Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2021-02-22 17:15:27 +01:00
Qu Wenruo be6a13613f btrfs: make btrfs_submit_compressed_read() subpage compatible
For compressed read, we always submit page read using page size.  This
doesn't work well with subpage, as for subpage one page can contain
several sectors.  Such submission will read range out of what we want,
and cause problems.

Thankfully to make it subpage compatible, we only need to change how the
last page of the compressed extent is read.

Instead of always adding a full page to the compressed read bio, if we're
at the last page, calculate the size using compressed length, so that we
only add part of the range into the compressed read bio.

Since we are here, also change the PAGE_SIZE used in
lookup_extent_mapping() to sectorsize.
This modification won't cause any functional change, as
lookup_extent_mapping() can handle the case where the search range is
larger than found extent range.

Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2021-02-22 17:15:25 +01:00
Ira Weiny d70cef0d46 btrfs: fix raid6 qstripe kmap
When a qstripe is required an extra page is allocated and mapped.  There
were 3 problems:

1) There is no corresponding call of kunmap() for the qstripe page.
2) There is no reason to map the qstripe page more than once if the
   number of bits set in rbio->dbitmap is greater than one.
3) There is no reason to map the parity page and unmap it each time
   through the loop.

The page memory can continue to be reused with a single mapping on each
iteration by raid6_call.gen_syndrome() without remapping.  So map the
page for the duration of the loop.

Similarly, improve the algorithm by mapping the parity page just 1 time.

Fixes: 5a6ac9eacb ("Btrfs, raid56: support parity scrub on raid56")
CC: stable@vger.kernel.org # 4.4.x: c17af96554a8: btrfs: raid56: simplify tracking of Q stripe presence
CC: stable@vger.kernel.org # 4.4.x
Signed-off-by: Ira Weiny <ira.weiny@intel.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2021-02-22 17:15:21 +01:00
Pavel Begunkov 8e5c66c485 io_uring: clear request count when freeing caches
BUG: KASAN: double-free or invalid-free in io_req_caches_free.constprop.0+0x3ce/0x530 fs/io_uring.c:8709

Workqueue: events_unbound io_ring_exit_work
Call Trace:
 [...]
 __cache_free mm/slab.c:3424 [inline]
 kmem_cache_free_bulk+0x4b/0x1b0 mm/slab.c:3744
 io_req_caches_free.constprop.0+0x3ce/0x530 fs/io_uring.c:8709
 io_ring_ctx_free fs/io_uring.c:8764 [inline]
 io_ring_exit_work+0x518/0x6b0 fs/io_uring.c:8846
 process_one_work+0x98d/0x1600 kernel/workqueue.c:2275
 worker_thread+0x64c/0x1120 kernel/workqueue.c:2421
 kthread+0x3b1/0x4a0 kernel/kthread.c:292
 ret_from_fork+0x1f/0x30 arch/x86/entry/entry_64.S:294

Freed by task 11900:
 [...]
 kmem_cache_free_bulk+0x4b/0x1b0 mm/slab.c:3744
 io_req_caches_free.constprop.0+0x3ce/0x530 fs/io_uring.c:8709
 io_uring_flush+0x483/0x6e0 fs/io_uring.c:9237
 filp_close+0xb4/0x170 fs/open.c:1286
 close_files fs/file.c:403 [inline]
 put_files_struct fs/file.c:418 [inline]
 put_files_struct+0x1d0/0x350 fs/file.c:415
 exit_files+0x7e/0xa0 fs/file.c:435
 do_exit+0xc27/0x2ae0 kernel/exit.c:820
 do_group_exit+0x125/0x310 kernel/exit.c:922
 [...]

io_req_caches_free() doesn't zero submit_state->free_reqs, so io_uring
considers just freed requests to be good and sound and will reuse or
double free them. Zero the counter.

Reported-by: syzbot+30b4936dcdb3aafa4fb4@syzkaller.appspotmail.com
Fixes: 41be53e94f ("io_uring: kill cached requests from exiting task closing the ring")
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-02-22 06:31:31 -07:00
Hyeongseok Kim f728760aa9 exfat: improve performance of exfat_free_cluster when using dirsync mount option
There are stressful update of cluster allocation bitmap when using
dirsync mount option which is doing sync buffer on every cluster bit
clearing. This could result in performance degradation when deleting
big size file.
Fix to update only when the bitmap buffer index is changed would make
less disk access, improving performance especially for truncate operation.

Testing with Samsung 256GB sdcard, mounted with dirsync option
(mount -t exfat /dev/block/mmcblk0p1 /temp/mount -o dirsync)

Remove 4GB file, blktrace result.
[Before] : 39 secs.
Total (blktrace):
 Reads Queued:      0,        0KiB   Writes Queued:      32775,    16387KiB
 Read Dispatches:   0,        0KiB   Write Dispatches:   32775,    16387KiB
 Reads Requeued:    0                Writes Requeued:        0
 Reads Completed:   0,        0KiB   Writes Completed:   32775,    16387KiB
 Read Merges:       0,        0KiB   Write Merges:           0,        0KiB
 IO unplugs:        2                Timer unplugs:          0

[After] : 1 sec.
Total (blktrace):
 Reads Queued:      0,        0KiB   Writes Queued:         13,        6KiB
 Read Dispatches:   0,        0KiB   Write Dispatches:      13,        6KiB
 Reads Requeued:    0                Writes Requeued:        0
 Reads Completed:   0,        0KiB   Writes Completed:      13,        6KiB
 Read Merges:       0,        0KiB   Write Merges:           0,        0KiB
 IO unplugs:        1                Timer unplugs:          0

Signed-off-by: Hyeongseok Kim <hyeongseok@gmail.com>
Acked-by: Sungjong Seo <sj1557.seo@samsung.com>
Signed-off-by: Namjae Jeon <namjae.jeon@samsung.com>
2021-02-22 09:55:14 +09:00
Namjae Jeon 78c276f549 exfat: fix shift-out-of-bounds in exfat_fill_super()
syzbot reported a warning which could cause shift-out-of-bounds issue.

Call Trace:
 __dump_stack lib/dump_stack.c:79 [inline]
 dump_stack+0x183/0x22e lib/dump_stack.c:120
 ubsan_epilogue lib/ubsan.c:148 [inline]
 __ubsan_handle_shift_out_of_bounds+0x432/0x4d0 lib/ubsan.c:395
 exfat_read_boot_sector fs/exfat/super.c:471 [inline]
 __exfat_fill_super fs/exfat/super.c:556 [inline]
 exfat_fill_super+0x2acb/0x2d00 fs/exfat/super.c:624
 get_tree_bdev+0x406/0x630 fs/super.c:1291
 vfs_get_tree+0x86/0x270 fs/super.c:1496
 do_new_mount fs/namespace.c:2881 [inline]
 path_mount+0x1937/0x2c50 fs/namespace.c:3211
 do_mount fs/namespace.c:3224 [inline]
 __do_sys_mount fs/namespace.c:3432 [inline]
 __se_sys_mount+0x2f9/0x3b0 fs/namespace.c:3409
 do_syscall_64+0x2d/0x70 arch/x86/entry/common.c:46
 entry_SYSCALL_64_after_hwframe+0x44/0xa9

exfat specification describe sect_per_clus_bits field of boot sector
could be at most 25 - sect_size_bits and at least 0. And sect_size_bits
can also affect this calculation, It also needs validation.
This patch add validation for sect_per_clus_bits and sect_size_bits
field of boot sector.

Fixes: 719c1e1829 ("exfat: add super block operations")
Cc: stable@vger.kernel.org # v5.9+
Reported-by: syzbot+da4fe66aaadd3c2e2d1c@syzkaller.appspotmail.com
Reviewed-by: Sungjong Seo <sj1557.seo@samsung.com>
Tested-by: Randy Dunlap <rdunlap@infradead.org>
Signed-off-by: Namjae Jeon <namjae.jeon@samsung.com>
2021-02-22 09:55:13 +09:00
Linus Torvalds d1fec2214b selinux/stable-5.12 PR 20210215
-----BEGIN PGP SIGNATURE-----
 
 iQJIBAABCAAyFiEES0KozwfymdVUl37v6iDy2pc3iXMFAmAqwVUUHHBhdWxAcGF1
 bC1tb29yZS5jb20ACgkQ6iDy2pc3iXP1nw//bbmtBhpaG+RnmPrSGZgy3gqbB3gU
 ggJ5UKNvYclrej2dur3EHXPEB0YWDv2D2OgChfTAu+T7sc2sBF3bz9qAu1a556mV
 JdfID8aoUwSk+oN7AKcwbdua+wLhXppAnYSKaknR+tjmWzvVKBDkrOovl52oR6L8
 Wx3YHCy7yPO79wqGqoWLCI7aI8ByfovoyOf6Xr/sPl+gMuBvbFoJeO1Pa9YNoI0z
 noGT1h6vLjgyvegqMX5lCkh1sUlcOsmXkAksw1FyEAfJfr0MPLLkVoTaBAook5NO
 X7VEhv845CjfIfoCXDdIHzriDWHp3tEDMSQaLwU3QSjfsbyNVh4ggwuHZYqrR9dL
 DerCa+89XYdCldrBzBeRs3Qd/6bZtHpd62pHDgn+NwMdjEckCHh41t2f2odD+Rdy
 2Fv+50C3m+7JjUawKhzgWR3BYJhafiKKUiWc2GBm1cBSr7+vSKokDG27gJmtNCoE
 TedSlQTPyi47zjZMnf/laSqGEUG9xz79xAiDPDP5yuxbDvN5andRYHmhI4thbGcq
 5DsVx5DDWaXtJxRVlsTgTeyvjdp61Rbvj8jvbbD/St+8PNsbpFOerbjSaidovfJK
 Y0YrkL/sKcGkM8HbQCcl1DKd4l1EfDIKUch078LQJHetuHh4L89U+r5uqZRsgZYD
 /EWeEw56llrepMQ=
 =5fVL
 -----END PGP SIGNATURE-----

Merge tag 'selinux-pr-20210215' of git://git.kernel.org/pub/scm/linux/kernel/git/pcmoore/selinux

Pull selinux updates from Paul Moore:
 "We've got a good handful of patches for SELinux this time around; with
  everything passing the selinux-testsuite and applying cleanly to your
  tree as of a few minutes ago. The highlights are:

   - Add support for labeling anonymous inodes, and extend this new
     support to userfaultfd.

   - Fallback to SELinux genfs file labeling if the filesystem does not
     have xattr support. This is useful for virtiofs which can vary in
     its xattr support depending on the backing filesystem.

   - Classify and handle MPTCP the same as TCP in SELinux.

   - Ensure consistent behavior between inode_getxattr and
     inode_listsecurity when the SELinux policy is not loaded. This
     fixes a known problem with overlayfs.

   - A couple of patches to prune some unused variables from the SELinux
     code, mark private variables as static, and mark other variables as
     __ro_after_init or __read_mostly"

* tag 'selinux-pr-20210215' of git://git.kernel.org/pub/scm/linux/kernel/git/pcmoore/selinux:
  fs: anon_inodes: rephrase to appropriate kernel-doc
  userfaultfd: use secure anon inodes for userfaultfd
  selinux: teach SELinux about anonymous inodes
  fs: add LSM-supporting anon-inode interface
  security: add inode_init_security_anon() LSM hook
  selinux: fall back to SECURITY_FS_USE_GENFS if no xattr support
  selinux: mark selinux_xfrm_refcount as __read_mostly
  selinux: mark some global variables __ro_after_init
  selinux: make selinuxfs_mount static
  selinux: drop the unnecessary aurule_callback variable
  selinux: remove unused global variables
  selinux: fix inconsistency between inode_getxattr and inode_listsecurity
  selinux: handle MPTCP consistently with TCP
2021-02-21 16:54:54 -08:00
Jens Axboe 843bbfd49f io-wq: make io_wq_fork_thread() available to other users
We want to use this in io_uring proper as well, for the SQPOLL thread.
Rename it from fork_thread() to io_wq_fork_thread(), and make it
available through the io-wq.h header.

Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-02-21 17:25:22 -07:00
Jens Axboe bf1daa4bfc io-wq: only remove worker from free_list, if it was there
If the worker isn't on the free_list, don't attempt to delete it.

Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-02-21 17:25:22 -07:00
Jens Axboe 4379bf8bd7 io_uring: remove io_identity
We are no longer grabbing state, so no need to maintain an IO identity
that we COW if there are changes.

Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-02-21 17:25:22 -07:00
Jens Axboe 44526bedc2 io_uring: remove any grabbing of context
The async workers are siblings of the task itself, so by definition we
have all the state that we need. Remove any of the state grabbing that
we have, and requests flagging what they need.

Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-02-21 17:25:22 -07:00
Jens Axboe c6d77d92b7 io-wq: worker idling always returns false
Remove the bool return, and the checking for it in the caller.

Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-02-21 17:25:22 -07:00
Jens Axboe 3bfe610669 io-wq: fork worker threads from original task
Instead of using regular kthread kernel threads, create kernel threads
that are like a real thread that the task would create. This ensures that
we get all the context that we need, without having to carry that state
around. This greatly reduces the code complexity, and the risk of missing
state for a given request type.

With the move away from kthread, we can also dump everything related to
assigned state to the new threads.

Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-02-21 17:25:22 -07:00
Jens Axboe 958234d5ec io-wq: don't pass 'wqe' needlessly around
Just grab it from the worker itself, which we're already passing in.

Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-02-21 17:25:22 -07:00
Jens Axboe 5aa75ed5b9 io_uring: tie async worker side to the task context
Move it outside of the io_ring_ctx, and tie it to the io_uring task
context.

Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-02-21 17:25:22 -07:00
Jens Axboe 3b094e727d io-wq: get rid of wq->use_refs
We don't support attach anymore, so doesn't make sense to carry the
use_refs reference count. Get rid of it.

Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-02-21 17:25:22 -07:00
Jens Axboe d25e3a3de0 io_uring: disable io-wq attaching
Moving towards making the io_wq per ring per task, so we can't really
share it between rings. Which is fine, since we've now dropped some
of that fat from it.

Retain compatibility with how attaching works, so that any attempt to
attach to an fd that doesn't exist, or isn't an io_uring fd, will fail
like it did before.

Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-02-21 17:25:22 -07:00
Jens Axboe 1cbd9c2bcf io-wq: don't create any IO workers upfront
When the manager thread starts up, it creates a worker per node for
the given context. Just let these get created dynamically, like we do
for adding further workers.

Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-02-21 17:25:22 -07:00
Jens Axboe 7c25c0d16e io_uring: remove the need for relying on an io-wq fallback worker
We hit this case when the task is exiting, and we need somewhere to
do background cleanup of requests. Instead of relying on the io-wq
task manager to do this work for us, just stuff it somewhere where
we can safely run it ourselves directly.

Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-02-21 17:25:22 -07:00
Jens Axboe 2713154906 Merge branch 'for-5.12/io_uring' into io_uring-worker.v3
* for-5.12/io_uring: (21 commits)
  io_uring: run task_work on io_uring_register()
  io_uring: fix leaving invalid req->flags
  io_uring: wait potential ->release() on resurrect
  io_uring: keep generic rsrc infra generic
  io_uring: zero ref_node after killing it
  io_uring: make the !CONFIG_NET helpers a bit more robust
  io_uring: don't hold uring_lock when calling io_run_task_work*
  io_uring: fail io-wq submission from a task_work
  io_uring: don't take uring_lock during iowq cancel
  io_uring: fail links more in io_submit_sqe()
  io_uring: don't do async setup for links' heads
  io_uring: do io_*_prep() early in io_submit_sqe()
  io_uring: split sqe-prep and async setup
  io_uring: don't submit link on error
  io_uring: move req link into submit_state
  io_uring: move io_init_req() into io_submit_sqe()
  io_uring: move io_init_req()'s definition
  io_uring: don't duplicate ->file check in sfr
  io_uring: keep io_*_prep() naming consistent
  io_uring: kill fictitious submit iteration index
  ...
2021-02-21 17:22:53 -07:00
Pavel Begunkov b6c23dd5a4 io_uring: run task_work on io_uring_register()
Do run task_work before io_uring_register(), that might make a first
quiesce round much nicer. We generally do that for any syscall invocation
to avoid spurious -EINTR/-ERESTARTSYS, for task_work that we generate.
This patch brings io_uring_register() inline with the two other io_uring
syscalls.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-02-21 17:18:56 -07:00
Linus Torvalds 66f73fb3fa This pull request contains changes (actually just fixes) for UBIFS
JFFS2:
 - Fix for use-after-free in jffs2_sum_write_data()
 - Fix for out-of-bounds access in jffs2_zlib_compress()
 
 UBI:
 - Remove dead/useless code
 
 UBIFS:
 - Fix for a memory leak in ubifs_init_authentication()
 - Fix for high stack usage
 - Fix for a off-by-one error in xattrs code
 -----BEGIN PGP SIGNATURE-----
 
 iQJKBAABCAA0FiEEdgfidid8lnn52cLTZvlZhesYu8EFAmAyuh8WHHJpY2hhcmRA
 c2lnbWEtc3Rhci5hdAAKCRBm+VmF6xi7wT+bD/9Q2Ar9yMX9drPyAnyb3vudE8c8
 l0RdNLyBSL87pskpszNZR2+o8Yi3vjlbGWq5i97JsP/7UOb4Gc/MfXPYJteP1xUN
 S46EZwgcZa4XCgMSSdMk/NZl7bVdbwjvcGjw1CA4RdPkwt8s2jwYdS+hPrHu6o87
 3xkP7kWShs/2KhUyvodZgAu6SYcTW+OjiKwdAIKxa1Ak9YUMGzsSHqCbl19he5MG
 hMxFZIqRZ2zZUfFeYXffVApJI8eBEKVud2qtNA/A6eGsy5Wx3c4F/bxG/uWdoJPp
 n5CmFRc6UGh8teA43aag5BnLv8sR9bC1Kf3lQX4nwfpBSzE7LwIMN7SVpL0JH5vT
 dJdwn37JYL/RQjmjTk++O/sSgeg9jJWMG+VOSmuKWPgP6xAYEVXqWg9njuV3wl9W
 NFBoybP82IyVHcthOcTrY8dx0F7A4q+3PkMy+7cikO2fYKVvJjdKgTp4pcVnGCR3
 IadXzNRdYrLPvYwf25D2AyETwQQxcmh/Ox7ZOhkUXuFQ/KnhU0yqbO3cTTB1A/mO
 jY2SPtXXeUZwgGpGc8Lyr8/KGZ6tJX/3jswwmg+XvdegBLRogqty8XOcwUuJszCh
 1kDAKs2LJ6UaMYyhV6Jxr4c+wgHoKJG+voY+oTkrUP4Lt0hQVELCylEkX2uJo60Y
 x4Gic/YbRUwnfjlAcg==
 =xorv
 -----END PGP SIGNATURE-----

Merge tag 'for-linus-5.12-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/rw/ubifs

Pull JFFS2/UBIFS and UBI updates from Richard Weinberger:
 "JFFS2:
   - Fix for use-after-free in jffs2_sum_write_data()
   - Fix for out-of-bounds access in jffs2_zlib_compress()

  UBI:
   - Remove dead/useless code

  UBIFS:
   - Fix for a memory leak in ubifs_init_authentication()
   - Fix for high stack usage
   - Fix for a off-by-one error in xattrs code"

* tag 'for-linus-5.12-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/rw/ubifs:
  ubifs: Fix error return code in alloc_wbufs()
  jffs2: check the validity of dstlen in jffs2_zlib_compress()
  ubifs: Fix off-by-one error
  ubifs: replay: Fix high stack usage, again
  ubifs: Fix memleak in ubifs_init_authentication
  jffs2: fix use after free in jffs2_sum_write_data()
  ubi: eba: Delete useless kfree code
  ubi: remove dead code in validate_vid_hdr()
2021-02-21 13:57:08 -08:00
Linus Torvalds 04471d3f18 This pull request contains the following changes for UML:
- Many cleanups and fixes for our virtio code
 - Add support for a pseudo RTC
 - Fix for a possible jailbreak
 - Minor fixes (spelling, header files)
 -----BEGIN PGP SIGNATURE-----
 
 iQJKBAABCAA0FiEEdgfidid8lnn52cLTZvlZhesYu8EFAmAyu0oWHHJpY2hhcmRA
 c2lnbWEtc3Rhci5hdAAKCRBm+VmF6xi7wd5jD/9sb/5xYhXCSfTdPS/eIrWvBQoc
 B8rxLfRpYW1Yvzz4R60/vKe8/td5I1/AvlprLp/1AJeawl49vCbSOqwdjn+58Uqb
 rlagZ2Ikilfn5lVIsxPf8fjbleonvBe8qVA30gJKgCYdYuAcXLs404jZ8MMvwZ0g
 t4G7BUc7bq19+UVF06kwefzC64c1WgPiHTmO6DT6RcXoFKq/x6i1FN4QnMEoiKQi
 SsficYHo7FsIhJZKtgTfujzEInLyMMuZ9mTJU/3wwUveLWArG0NRtIttC6FPvhi4
 xx5RlTfC/Jzoqi9Qo14bLqV6KcOU/J7Oi4bDpYyhNggF/QfhnNgT8MGPwx5f+Gso
 8OJg3MsZw70480EBH7/xLSdhZ3ih178Rr/asmiJkwLOYm5zJ22yqtx/jXQBlGOz3
 FHPgTMJcRMzosGqSrhl+KxFdrK1uSLbcFZS3Sp8PUGdtgPPu19Proz2SPdHzt1Mj
 QJY30nRKKUoTLnRYnQV3VSa7uZXGVAK+HtkpRl/ubTKbGcSF8rdl4fYhOPnmAsKQ
 F4HBXwqKBht7eKN2BsNNTLz86OFBopn8eFqq8XxwOqF9O7DZitU0sOboWJyMUY2u
 /QzKxtSAUnNg6Ab+whKhAvkwktJ7IrVJh1JENWDy0pRoCGdF135ajic0bpFDCjqV
 ohOT/Ol+p4/ClLgxiw==
 =e5Qa
 -----END PGP SIGNATURE-----

Merge tag 'for-linux-5.12-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/rw/uml

Pull UML updates from Richard Weinberger:

 - Many cleanups and fixes for our virtio code

 - Add support for a pseudo RTC

 - Fix for a possible jailbreak

 - Minor fixes (spelling, header files)

* tag 'for-linux-5.12-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/rw/uml:
  um: irq.h: include <asm-generic/irq.h>
  um: io.h: include <linux/types.h>
  um: add a pseudo RTC
  um: remove process stub VMA
  um: rework userspace stubs to not hard-code stub location
  um: separate child and parent errors in clone stub
  um: defer killing userspace on page table update failures
  um: mm: check more comprehensively for stub changes
  um: print register names in wait_for_stub
  um: hostfs: use a kmem cache for inodes
  mm: Remove arch_remap() and mm-arch-hooks.h
  um: fix spelling mistake in Kconfig "privleges" -> "privileges"
  um: virtio: allow devices to be configured for wakeup
  um: time-travel: rework interrupt handling in ext mode
  um: virtio: disable VQs during suspend
  um: virtio: fix handling of messages without payload
  um: virtio: clean up a comment
2021-02-21 13:53:00 -08:00
Linus Torvalds df24212a49 s390 updates for the 5.12 merge window
- Convert to using the generic entry infrastructure.
 
 - Add vdso time namespace support.
 
 - Switch s390 and alpha to 64-bit ino_t. As discussed here
   lkml.kernel.org/r/YCV7QiyoweJwvN+m@osiris
 
 - Get rid of expensive stck (store clock) usages where possible. Utilize
   cpu alternatives to patch stckf when supported.
 
 - Make tod_clock usage less error prone by converting it to a union and
   rework code which is using it.
 
 - Machine check handler fixes and cleanups.
 
 - Drop couple of minor inline asm optimizations to fix clang build.
 
 - Default configs changes notably to make libvirt happy.
 
 - Various changes to rework and improve qdio code.
 
 - Other small various fixes and improvements all over the code.
 -----BEGIN PGP SIGNATURE-----
 
 iQEzBAABCAAdFiEE3QHqV+H2a8xAv27vjYWKoQLXFBgFAmAyzcwACgkQjYWKoQLX
 FBjjMwgAmeY3oMkj93bnUF/OnbYTJQ0ZHmlyeboKt7SnFyvNpOVGyRfl7+fPHsNu
 +t9QZQk0f7fSxbcC04gz0ZMw1YbTjWihgZJsN6s+qtrRsv/kVqKr7kvhFrcs8uSZ
 rLiwIRWGVAbprnJZWCNqaGpKkOM0wPYZ5W3Mtnoxe4nTM2LwSu2RWI8ibTGYLQPy
 FybKos2hYOFBTGQdrxmg1zAvpE8DJg4qQNLhYvnmHd8Bw/FNBmoyhx8rS8z06NmS
 dWMk7pfvQaslIIaFC3Yo7/sJVa/JJH33FlBonc+MSO8OZz5O6vG4bk9ZHq6DfHUH
 V1I38xiBdYdSXDq8QqT3N9d+CtjeMQ==
 =Lt/v
 -----END PGP SIGNATURE-----

Merge tag 's390-5.12-1' of git://git.kernel.org/pub/scm/linux/kernel/git/s390/linux

Pull s390 updates from Vasily Gorbik:

 - Convert to using the generic entry infrastructure.

 - Add vdso time namespace support.

 - Switch s390 and alpha to 64-bit ino_t. As discussed at

     https://lore.kernel.org/linux-mm/YCV7QiyoweJwvN+m@osiris/

 - Get rid of expensive stck (store clock) usages where possible.
   Utilize cpu alternatives to patch stckf when supported.

 - Make tod_clock usage less error prone by converting it to a union and
   rework code which is using it.

 - Machine check handler fixes and cleanups.

 - Drop couple of minor inline asm optimizations to fix clang build.

 - Default configs changes notably to make libvirt happy.

 - Various changes to rework and improve qdio code.

 - Other small various fixes and improvements all over the code.

* tag 's390-5.12-1' of git://git.kernel.org/pub/scm/linux/kernel/git/s390/linux: (68 commits)
  s390/qdio: remove 'merge_pending' mechanism
  s390/qdio: improve handling of PENDING buffers for QEBSM devices
  s390/qdio: rework q->qdio_error indication
  s390/qdio: inline qdio_kick_handler()
  s390/time: remove get_tod_clock_ext()
  s390/crypto: use store_tod_clock_ext()
  s390/hypfs: use store_tod_clock_ext()
  s390/debug: use union tod_clock
  s390/kvm: use union tod_clock
  s390/vdso: use union tod_clock
  s390/time: convert tod_clock_base to union
  s390/time: introduce new store_tod_clock_ext()
  s390/time: rename store_tod_clock_ext() and use union tod_clock
  s390/time: introduce union tod_clock
  s390,alpha: switch to 64-bit ino_t
  s390: split cleanup_sie
  s390: use r13 in cleanup_sie as temp register
  s390: fix kernel asce loading when sie is interrupted
  s390: add stack for machine check handler
  s390: use WRITE_ONCE when re-allocating async stack
  ...
2021-02-21 13:40:06 -08:00
Linus Torvalds 3e10585335 x86:
- Support for userspace to emulate Xen hypercalls
 - Raise the maximum number of user memslots
 - Scalability improvements for the new MMU.  Instead of the complex
   "fast page fault" logic that is used in mmu.c, tdp_mmu.c uses an
   rwlock so that page faults are concurrent, but the code that can run
   against page faults is limited.  Right now only page faults take the
   lock for reading; in the future this will be extended to some
   cases of page table destruction.  I hope to switch the default MMU
   around 5.12-rc3 (some testing was delayed due to Chinese New Year).
 - Cleanups for MAXPHYADDR checks
 - Use static calls for vendor-specific callbacks
 - On AMD, use VMLOAD/VMSAVE to save and restore host state
 - Stop using deprecated jump label APIs
 - Workaround for AMD erratum that made nested virtualization unreliable
 - Support for LBR emulation in the guest
 - Support for communicating bus lock vmexits to userspace
 - Add support for SEV attestation command
 - Miscellaneous cleanups
 
 PPC:
 - Support for second data watchpoint on POWER10
 - Remove some complex workarounds for buggy early versions of POWER9
 - Guest entry/exit fixes
 
 ARM64
 - Make the nVHE EL2 object relocatable
 - Cleanups for concurrent translation faults hitting the same page
 - Support for the standard TRNG hypervisor call
 - A bunch of small PMU/Debug fixes
 - Simplification of the early init hypercall handling
 
 Non-KVM changes (with acks):
 - Detection of contended rwlocks (implemented only for qrwlocks,
   because KVM only needs it for x86)
 - Allow __DISABLE_EXPORTS from assembly code
 - Provide a saner follow_pfn replacements for modules
 -----BEGIN PGP SIGNATURE-----
 
 iQFIBAABCAAyFiEE8TM4V0tmI4mGbHaCv/vSX3jHroMFAmApSRgUHHBib256aW5p
 QHJlZGhhdC5jb20ACgkQv/vSX3jHroOc7wf9FnlinKoTFaSk7oeuuhF/CoCVwSFs
 Z9+A2sNI99tWHQxFR6dyDkEFeQoXnqSxfLHtUVIdH/JnTg0FkEvFz3NK+0PzY1PF
 PnGNbSoyhP58mSBG4gbBAxdF3ZJZMB8GBgYPeR62PvMX2dYbcHqVBNhlf6W4MQK4
 5mAUuAnbf19O5N267sND+sIg3wwJYwOZpRZB7PlwvfKAGKf18gdBz5dQ/6Ej+apf
 P7GODZITjqM5Iho7SDm/sYJlZprFZT81KqffwJQHWFMEcxFgwzrnYPx7J3gFwRTR
 eeh9E61eCBDyCTPpHROLuNTVBqrAioCqXLdKOtO5gKvZI3zmomvAsZ8uXQ==
 =uFZU
 -----END PGP SIGNATURE-----

Merge tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm

Pull KVM updates from Paolo Bonzini:
 "x86:

   - Support for userspace to emulate Xen hypercalls

   - Raise the maximum number of user memslots

   - Scalability improvements for the new MMU.

     Instead of the complex "fast page fault" logic that is used in
     mmu.c, tdp_mmu.c uses an rwlock so that page faults are concurrent,
     but the code that can run against page faults is limited. Right now
     only page faults take the lock for reading; in the future this will
     be extended to some cases of page table destruction. I hope to
     switch the default MMU around 5.12-rc3 (some testing was delayed
     due to Chinese New Year).

   - Cleanups for MAXPHYADDR checks

   - Use static calls for vendor-specific callbacks

   - On AMD, use VMLOAD/VMSAVE to save and restore host state

   - Stop using deprecated jump label APIs

   - Workaround for AMD erratum that made nested virtualization
     unreliable

   - Support for LBR emulation in the guest

   - Support for communicating bus lock vmexits to userspace

   - Add support for SEV attestation command

   - Miscellaneous cleanups

  PPC:

   - Support for second data watchpoint on POWER10

   - Remove some complex workarounds for buggy early versions of POWER9

   - Guest entry/exit fixes

  ARM64:

   - Make the nVHE EL2 object relocatable

   - Cleanups for concurrent translation faults hitting the same page

   - Support for the standard TRNG hypervisor call

   - A bunch of small PMU/Debug fixes

   - Simplification of the early init hypercall handling

  Non-KVM changes (with acks):

   - Detection of contended rwlocks (implemented only for qrwlocks,
     because KVM only needs it for x86)

   - Allow __DISABLE_EXPORTS from assembly code

   - Provide a saner follow_pfn replacements for modules"

* tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm: (192 commits)
  KVM: x86/xen: Explicitly pad struct compat_vcpu_info to 64 bytes
  KVM: selftests: Don't bother mapping GVA for Xen shinfo test
  KVM: selftests: Fix hex vs. decimal snafu in Xen test
  KVM: selftests: Fix size of memslots created by Xen tests
  KVM: selftests: Ignore recently added Xen tests' build output
  KVM: selftests: Add missing header file needed by xAPIC IPI tests
  KVM: selftests: Add operand to vmsave/vmload/vmrun in svm.c
  KVM: SVM: Make symbol 'svm_gp_erratum_intercept' static
  locking/arch: Move qrwlock.h include after qspinlock.h
  KVM: PPC: Book3S HV: Fix host radix SLB optimisation with hash guests
  KVM: PPC: Book3S HV: Ensure radix guest has no SLB entries
  KVM: PPC: Don't always report hash MMU capability for P9 < DD2.2
  KVM: PPC: Book3S HV: Save and restore FSCR in the P9 path
  KVM: PPC: remove unneeded semicolon
  KVM: PPC: Book3S HV: Use POWER9 SLBIA IH=6 variant to clear SLB
  KVM: PPC: Book3S HV: No need to clear radix host SLB before loading HPT guest
  KVM: PPC: Book3S HV: Fix radix guest SLB side channel
  KVM: PPC: Book3S HV: Remove support for running HPT guest on RPT host without mixed mode support
  KVM: PPC: Book3S HV: Introduce new capability for 2nd DAWR
  KVM: PPC: Book3S HV: Add infrastructure to support 2nd DAWR
  ...
2021-02-21 13:31:43 -08:00
Linus Torvalds 08179b47e1 Merge branch 'parisc-5.12-1' of git://git.kernel.org/pub/scm/linux/kernel/git/deller/parisc-linux
Pull parisc updates from Helge Deller:

 - Optimize parisc page table locks by using the existing
   page_table_lock

 - Export argv0-preserve flag in binfmt_misc for usage in qemu-user

 - Fix interrupt table (IVT) checksum so firmware will call crash
   handler (HPMC)

 - Increase IRQ stack to 64kb on 64-bit kernel

 - Switch to common devmem_is_allowed() implementation

 - Minor fix to get_whan()

* 'parisc-5.12-1' of git://git.kernel.org/pub/scm/linux/kernel/git/deller/parisc-linux:
  binfmt_misc: pass binfmt_misc flags to the interpreter
  parisc: Optimize per-pagetable spinlocks
  parisc: Replace test_ti_thread_flag() with test_tsk_thread_flag()
  parisc: Bump 64-bit IRQ stack size to 64 KB
  parisc: Fix IVT checksum calculation wrt HPMC
  parisc: Use the generic devmem_is_allowed()
  parisc: Drop out of get_whan() if task is running again
2021-02-21 13:20:41 -08:00
Linus Torvalds 99ca0edb41 arm64 updates for 5.12
- vDSO build improvements including support for building with BSD.
 
  - Cleanup to the AMU support code and initialisation rework to support
    cpufreq drivers built as modules.
 
  - Removal of synthetic frame record from exception stack when entering
    the kernel from EL0.
 
  - Add support for the TRNG firmware call introduced by Arm spec
    DEN0098.
 
  - Cleanup and refactoring across the board.
 
  - Avoid calling arch_get_random_seed_long() from
    add_interrupt_randomness()
 
  - Perf and PMU updates including support for Cortex-A78 and the v8.3
    SPE extensions.
 
  - Significant steps along the road to leaving the MMU enabled during
    kexec relocation.
 
  - Faultaround changes to initialise prefaulted PTEs as 'old' when
    hardware access-flag updates are supported, which drastically
    improves vmscan performance.
 
  - CPU errata updates for Cortex-A76 (#1463225) and Cortex-A55
    (#1024718)
 
  - Preparatory work for yielding the vector unit at a finer granularity
    in the crypto code, which in turn will one day allow us to defer
    softirq processing when it is in use.
 
  - Support for overriding CPU ID register fields on the command-line.
 -----BEGIN PGP SIGNATURE-----
 
 iQFEBAABCgAuFiEEPxTL6PPUbjXGY88ct6xw3ITBYzQFAmAmwZcQHHdpbGxAa2Vy
 bmVsLm9yZwAKCRC3rHDchMFjNLA1B/0XMwWUhmJ4ZPK4sr28YWHNGLuCFHDgkMKU
 dEmS806OF9d0J7fTczGsKdS4IKtXWko67Z0UGiPIStwfm0itSW2Zgbo9KZeDPqPI
 fH0s23nQKxUMyNW7b9p4cTV3YuGVMZSBoMug2jU2DEDpSqeGBk09NPi6inERBCz/
 qZxcqXTKxXbtOY56eJmq09UlFZiwfONubzuCrrUH7LU8ZBSInM/6Q4us/oVm4zYI
 Pnv996mtL4UxRqq/KoU9+cQ1zsI01kt9/coHwfCYvSpZEVAnTWtfECsJ690tr3mF
 TSKQLvOzxbDtU+HcbkNVKW0A38EIO1xXr8yXW9SJx6BJBkyb24xo
 =IwMb
 -----END PGP SIGNATURE-----

Merge tag 'arm64-upstream' of git://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux

Pull arm64 updates from Will Deacon:

 - vDSO build improvements including support for building with BSD.

 - Cleanup to the AMU support code and initialisation rework to support
   cpufreq drivers built as modules.

 - Removal of synthetic frame record from exception stack when entering
   the kernel from EL0.

 - Add support for the TRNG firmware call introduced by Arm spec
   DEN0098.

 - Cleanup and refactoring across the board.

 - Avoid calling arch_get_random_seed_long() from
   add_interrupt_randomness()

 - Perf and PMU updates including support for Cortex-A78 and the v8.3
   SPE extensions.

 - Significant steps along the road to leaving the MMU enabled during
   kexec relocation.

 - Faultaround changes to initialise prefaulted PTEs as 'old' when
   hardware access-flag updates are supported, which drastically
   improves vmscan performance.

 - CPU errata updates for Cortex-A76 (#1463225) and Cortex-A55
   (#1024718)

 - Preparatory work for yielding the vector unit at a finer granularity
   in the crypto code, which in turn will one day allow us to defer
   softirq processing when it is in use.

 - Support for overriding CPU ID register fields on the command-line.

* tag 'arm64-upstream' of git://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux: (85 commits)
  drivers/perf: Replace spin_lock_irqsave to spin_lock
  mm: filemap: Fix microblaze build failure with 'mmu_defconfig'
  arm64: Make CPU_BIG_ENDIAN depend on ld.bfd or ld.lld 13.0.0+
  arm64: cpufeatures: Allow disabling of Pointer Auth from the command-line
  arm64: Defer enabling pointer authentication on boot core
  arm64: cpufeatures: Allow disabling of BTI from the command-line
  arm64: Move "nokaslr" over to the early cpufeature infrastructure
  KVM: arm64: Document HVC_VHE_RESTART stub hypercall
  arm64: Make kvm-arm.mode={nvhe, protected} an alias of id_aa64mmfr1.vh=0
  arm64: Add an aliasing facility for the idreg override
  arm64: Honor VHE being disabled from the command-line
  arm64: Allow ID_AA64MMFR1_EL1.VH to be overridden from the command line
  arm64: cpufeature: Add an early command-line cpufeature override facility
  arm64: Extract early FDT mapping from kaslr_early_init()
  arm64: cpufeature: Use IDreg override in __read_sysreg_by_encoding()
  arm64: cpufeature: Add global feature override facility
  arm64: Move SCTLR_EL1 initialisation to EL-agnostic code
  arm64: Simplify init_el2_state to be non-VHE only
  arm64: Move VHE-specific SPE setup to mutate_to_vhe()
  arm64: Drop early setting of MDSCR_EL2.TPMS
  ...
2021-02-21 13:08:42 -08:00
Linus Torvalds 7b15c27e2f These changes fix MM (soft-)dirty bit management in the procfs code & clean up the API.
Signed-off-by: Ingo Molnar <mingo@kernel.org>
 -----BEGIN PGP SIGNATURE-----
 
 iQJFBAABCgAvFiEEBpT5eoXrXCwVQwEKEnMQ0APhK1gFAmAtAgsRHG1pbmdvQGtl
 cm5lbC5vcmcACgkQEnMQ0APhK1gOnA/9GKJblgi88Qb23YwGKp0OfCMdLfx8FJa+
 dq0AB0jgzc8v2J8IITSs/qo/8o25IE9IPTjTfItn0E0jxz7Y8J16urb/vyWX6O2s
 jb4riT5fIRCXvhv9DooxSQerZePaOJXbHYa2BBk8yqNJPGbd5kr0SUGn3BQnBQhR
 0yAfqjrzBLMGzzSO+kK0nhGQH8BJZgYu94CHNnUZJtWcIb2ZC6lzZ7Lz0zi6ueRJ
 81JblV4NCOC9uy9I9odOwESu2TIxT9afq1C/6COyrKYx3sWY6xPOGQTxYZe1LITE
 lb57/95qc7SOIj7Y3aL4YRSVRYRihEU31qlAltwP4fEnz49qdHJOR1HQmjKVG8xs
 Uaa6kCYFeTKmh4SRRr8ZR/hUkebrFUT+9+db6LmBs/i4Kt09T+ZurXC4jqmUHMFn
 2nYCDH6RX153V1YwcHGkr4OWaUVWZwAZl+t0zIo7o7wQdkoAD75ydecW2R3nLMN7
 p1ofGPXmT8Wh4en8LngBawO/4bBuunezh4L3vpz0/EU3viK5+DRsyNKf+d+Tti28
 XCe7ID0GDGq7nIzSZxuyIxmAbWJxjI+7gWT2WUudrJxJ2rUUxPQms8GsQD54IMh5
 UILv9GMBNuV8iA/2c3B52ff5iFl7kp+SxVS3MRC6zTudIV9VV6bb7WpFb8FLOhsH
 3sEo0qDFab4=
 =qcXO
 -----END PGP SIGNATURE-----

Merge tag 'core-mm-2021-02-17' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

Pull tlb gather updates from Ingo Molnar:
 "Theses fix MM (soft-)dirty bit management in the procfs code & clean
  up the TLB gather API"

* tag 'core-mm-2021-02-17' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  x86/ldt: Use tlb_gather_mmu_fullmm() when freeing LDT page-tables
  tlb: arch: Remove empty __tlb_remove_tlb_entry() stubs
  tlb: mmu_gather: Remove start/end arguments from tlb_gather_mmu()
  tlb: mmu_gather: Introduce tlb_gather_mmu_fullmm()
  tlb: mmu_gather: Remove unused start/end arguments from tlb_finish_mmu()
  mm: proc: Invalidate TLB after clearing soft-dirty page state
2021-02-21 12:19:56 -08:00
Linus Torvalds 5bbb336ba7 for-5.12/io_uring-2021-02-17
-----BEGIN PGP SIGNATURE-----
 
 iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAmAtYbYQHGF4Ym9lQGtl
 cm5lbC5kawAKCRD301j7KXHgppeWD/4xKhzBCGZWOkdycaaPhsUTOjNNIPmCBhlz
 QQj4KFSEuJNKACUg53Ak0oECJTaH5976kjKkKs7Z+hzmkEwboLBI4erkcT9MGC3M
 mPx349qBq9X3sYaFrUJF3h0sjRr+wa60nWQ01oVH8HkfI4bCNCHoqo5jDvMPWsYT
 ksFbUm8YWEZmi0K2yXFWXuJIN2bVBd72a8CrvtF3ksdEMYxbWWTOAcrhYJ4H5/U7
 BQjWIxiIVsAoJohcXWq/Swh8cgvgb5uJVpNUU8VEFob/jI3Gc3YojIToISB6soUL
 DNhDJLeyZjuXfE1Ej+ySas9bpdG4LgxzsDBl9lFl9EQkSo1c3h/lEx85aeixAZla
 QfjTOVUabzdPzvZ9H1yDQISxjVLy2PotnhVMy/rSSrnDKlowtNB9iEzd6cpzFzxU
 fxomz1d6+w8rZY9jaRIAcMNa6bEOuYmcP9V8rIzGeg3Mm3jqL7H/JgJu5s2YbjpN
 InmTNu4cwLeTO65DzqVxF8UGbZ2tHbMm5pNeVBYxuY1adRgJFlIOP5kYlNlyiY+D
 Bt41CRuK3hqpYfXh7nSK8U4BKEhMikTCS0W4aKL5EzLZ20rxjgTlaHZiOBqd9vep
 1tqNjPIvL2jWfF+5shwAZbupj3WKbuVqi4S2jXljv+Wkmk4ZVLSX3fQZv2I7JTHM
 I2qa59PB4A==
 =8MX/
 -----END PGP SIGNATURE-----

Merge tag 'for-5.12/io_uring-2021-02-17' of git://git.kernel.dk/linux-block

Pull io_uring updates from Jens Axboe:
 "Highlights from this cycles are things like request recycling and
  task_work optimizations, which net us anywhere from 10-20% of speedups
  on workloads that mostly are inline.

  This work was originally done to put io_uring under memcg, which adds
  considerable overhead. But it's a really nice win as well. Also worth
  highlighting is the LOOKUP_CACHED work in the VFS, and using it in
  io_uring. Greatly speeds up the fast path for file opens.

  Summary:

   - Put io_uring under memcg protection. We accounted just the rings
     themselves under rlimit memlock before, now we account everything.

   - Request cache recycling, persistent across invocations (Pavel, me)

   - First part of a cleanup/improvement to buffer registration (Bijan)

   - SQPOLL fixes (Hao)

   - File registration NULL pointer fixup (Dan)

   - LOOKUP_CACHED support for io_uring

   - Disable /proc/thread-self/ for io_uring, like we do for /proc/self

   - Add Pavel to the io_uring MAINTAINERS entry

   - Tons of code cleanups and optimizations (Pavel)

   - Support for skip entries in file registration (Noah)"

* tag 'for-5.12/io_uring-2021-02-17' of git://git.kernel.dk/linux-block: (103 commits)
  io_uring: tctx->task_lock should be IRQ safe
  proc: don't allow async path resolution of /proc/thread-self components
  io_uring: kill cached requests from exiting task closing the ring
  io_uring: add helper to free all request caches
  io_uring: allow task match to be passed to io_req_cache_free()
  io-wq: clear out worker ->fs and ->files
  io_uring: optimise io_init_req() flags setting
  io_uring: clean io_req_find_next() fast check
  io_uring: don't check PF_EXITING from syscall
  io_uring: don't split out consume out of SQE get
  io_uring: save ctx put/get for task_work submit
  io_uring: don't duplicate io_req_task_queue()
  io_uring: optimise SQPOLL mm/files grabbing
  io_uring: optimise out unlikely link queue
  io_uring: take compl state from submit state
  io_uring: inline io_complete_rw_common()
  io_uring: move res check out of io_rw_reissue()
  io_uring: simplify iopoll reissuing
  io_uring: clean up io_req_free_batch_finish()
  io_uring: move submit side state closer in the ring
  ...
2021-02-21 11:10:39 -08:00
Linus Torvalds 582cd91f69 for-5.12/block-2021-02-17
-----BEGIN PGP SIGNATURE-----
 
 iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAmAtmIwQHGF4Ym9lQGtl
 cm5lbC5kawAKCRD301j7KXHgplzLEAC5O+3rBM8QuiJdo39Yppmuw4hDJ6hOKynP
 EJQLKQQi0VfXgU+MprGvcbpFYmNbgICvUICQkEzJuk++kPCu/BJtJz0yErQeLgS+
 RdXiPV6enbF7iRML5TVRTr1q/z7sJMXcIIJ8Pz/rU/JNfGYExVd0WfnEY9mp1jOt
 Bl9V+qyTazdP+Ma4+uEPatSayqcdi1rxB5I+7v/sLiOvKZZWkaRZjUZ/mxAjUfvK
 dBOOPjMygEo3tCLkIyyA6lpLvr1r+SUZhLuebRLEKa3To3TW6RtoG0qwpKmI2iKw
 ylLeVLB60nM9RUxjflVOfBsHxz1bDg5Ve86y5nCjQd4Jo8x1c4DnecyGE5/Tu8Rg
 rgbsfD6nFWzhDCvcZT0XrfQ4ZAjIL2IfT+ypQiQ6UlRd3hvIKRmzWMkjuH2svr0u
 ey9Kq+lYerI4cM0F3W73gzUKdIQOuCzBCYxQuSQQomscBa7FCInyU192dAI9Aj6l
 Yd06mgKu6qCx6zLv6JfpBqaBHZMwyGE4dmZgPQFuuwO+b4N+Ck3Jm5fzEzw/xIxQ
 wdo/DlsAl60BXentB6FByGBJaCjVdSymRqN/xNCAbFKCjmr6TLBuXPfg1gYYO7xC
 VOcVjWe8iN3wWHZab3t2mxMKH9B9B/KKzIhu6TNHSmgtQ5paZPRCBx995pDyRw26
 WC22RGC2MA==
 =os1E
 -----END PGP SIGNATURE-----

Merge tag 'for-5.12/block-2021-02-17' of git://git.kernel.dk/linux-block

Pull core block updates from Jens Axboe:
 "Another nice round of removing more code than what is added, mostly
  due to Christoph's relentless pursuit of tech debt removal/cleanups.
  This pull request contains:

   - Two series of BFQ improvements (Paolo, Jan, Jia)

   - Block iov_iter improvements (Pavel)

   - bsg error path fix (Pan)

   - blk-mq scheduler improvements (Jan)

   - -EBUSY discard fix (Jan)

   - bvec allocation improvements (Ming, Christoph)

   - bio allocation and init improvements (Christoph)

   - Store bdev pointer in bio instead of gendisk + partno (Christoph)

   - Block trace point cleanups (Christoph)

   - hard read-only vs read-only split (Christoph)

   - Block based swap cleanups (Christoph)

   - Zoned write granularity support (Damien)

   - Various fixes/tweaks (Chunguang, Guoqing, Lei, Lukas, Huhai)"

* tag 'for-5.12/block-2021-02-17' of git://git.kernel.dk/linux-block: (104 commits)
  mm: simplify swapdev_block
  sd_zbc: clear zone resources for non-zoned case
  block: introduce blk_queue_clear_zone_settings()
  zonefs: use zone write granularity as block size
  block: introduce zone_write_granularity limit
  block: use blk_queue_set_zoned in add_partition()
  nullb: use blk_queue_set_zoned() to setup zoned devices
  nvme: cleanup zone information initialization
  block: document zone_append_max_bytes attribute
  block: use bi_max_vecs to find the bvec pool
  md/raid10: remove dead code in reshape_request
  block: mark the bio as cloned in bio_iov_bvec_set
  block: set BIO_NO_PAGE_REF in bio_iov_bvec_set
  block: remove a layer of indentation in bio_iov_iter_get_pages
  block: turn the nr_iovecs argument to bio_alloc* into an unsigned short
  block: remove the 1 and 4 vec bvec_slabs entries
  block: streamline bvec_alloc
  block: factor out a bvec_alloc_gfp helper
  block: move struct biovec_slab to bio.c
  block: reuse BIO_INLINE_VECS for integrity bvecs
  ...
2021-02-21 11:02:48 -08:00
Linus Torvalds 24880bef41 Remove oprofile and dcookies support
The "oprofile" user-space tools don't use the kernel OPROFILE support any more,
 and haven't in a long time. User-space has been converted to the perf
 interfaces.
 
 The dcookies stuff is only used by the oprofile code. Now that oprofile's
 support is getting removed from the kernel, there is no need for dcookies as
 well.
 
 Remove kernel's old oprofile and dcookies support.
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v1
 
 iQIcBAABAgAGBQJgJMEVAAoJENK5HDyugRIcL8YP/jkmXH5CZT80ntcqrJGWKcG7
 lWbach7uNeQteht7B1ZPKvojxizTkmfrN2sClX0B2hbGkc5TiWUQ2ZSnvnfWDZ8+
 z2qQcEB11G/ReL2vvRk1fJlWdAOyUfrPee/44AkemnLRv+Niw/8PqnGd87yDQGsK
 qy5E1XXfbjUq6Y/uMiLOX3+21I6w6o2Q6I3NNXC93s0wS3awqnft8n0XBC7iAPBj
 eowRJxpdRU2Vcuj8UOzzOI7gQlwdjwYImyLPbRy/V8NawC8a+FHrPrf5/GCYlVzl
 7TGFBsDQSmzvrBChUfoGz1Rq/VZ1a357p5rhRqemfUrdkjW+vyzelnD8I1W/hb2o
 SmBXoPoyl3+UkFHNyJI0mI7obaV+2PzyXMV0JIQUj+IiX/mfeFv0nF4XfZD2IkRt
 6xhaYj775Zrx32iBdGZIvvLg5Gh9ZkZmR5vJ7Fi/EIZFe6Z+bZnPKUROnAgS/o0z
 +UkSygOhgo/1XbqrzZVk1iweWeu+EUMbY4YQv2qVnFhpvsq4ieThcUGQpWcxGjjH
 WP8O0n1yq1slsnpUtxhiTsm46ENajx9zZp6Iv6Ws+NM0RUqjND8BdF1co9WGD3LS
 cnZMFBs4Bg/V1HICL/D4s6L7t1ofrEXIgJH1y3iF0HeECq03mU4CgA/qly9Aebqg
 UxPF3oNlVOPlds9FzsU2
 =I2Ac
 -----END PGP SIGNATURE-----

Merge tag 'oprofile-removal-5.12' of git://git.kernel.org/pub/scm/linux/kernel/git/vireshk/linux

Pull oprofile and dcookies removal from Viresh Kumar:
 "Remove oprofile and dcookies support

  The 'oprofile' user-space tools don't use the kernel OPROFILE support
  any more, and haven't in a long time. User-space has been converted to
  the perf interfaces.

  The dcookies stuff is only used by the oprofile code. Now that
  oprofile's support is getting removed from the kernel, there is no
  need for dcookies as well.

  Remove kernel's old oprofile and dcookies support"

* tag 'oprofile-removal-5.12' of git://git.kernel.org/pub/scm/linux/kernel/git/vireshk/linux:
  fs: Remove dcookies support
  drivers: Remove CONFIG_OPROFILE support
  arch: xtensa: Remove CONFIG_OPROFILE support
  arch: x86: Remove CONFIG_OPROFILE support
  arch: sparc: Remove CONFIG_OPROFILE support
  arch: sh: Remove CONFIG_OPROFILE support
  arch: s390: Remove CONFIG_OPROFILE support
  arch: powerpc: Remove oprofile
  arch: powerpc: Stop building and using oprofile
  arch: parisc: Remove CONFIG_OPROFILE support
  arch: mips: Remove CONFIG_OPROFILE support
  arch: microblaze: Remove CONFIG_OPROFILE support
  arch: ia64: Remove rest of perfmon support
  arch: ia64: Remove CONFIG_OPROFILE support
  arch: hexagon: Don't select HAVE_OPROFILE
  arch: arc: Remove CONFIG_OPROFILE support
  arch: arm: Remove CONFIG_OPROFILE support
  arch: alpha: Remove CONFIG_OPROFILE support
2021-02-21 10:40:34 -08:00
Linus Torvalds b52bb135aa New code for 5.12:
- Fix an ABBA deadlock when renaming files on overlayfs.
 - Make sure that we can't overflow the inode extent counters when adding
   to or removing extents from a file.
 - Make directory sgid inheritance work the same way as all the other
   filesystems.
 - Don't drain the buffer cache on freeze and ro remount, which should
   reduce the amount of time if read-only workloads are continuing
   during the freeze.
 - Fix a bug where symlink size isn't reported to the vfs in ecryptfs.
 - Disentangle log cleaning from log covering.  This refactoring sets us
   up for future changes to the log, though for now it simply means that
   we can use covering for freezes, and cleaning becomes something we
   only do at unmount.
 - Speed up file fsyncs by reducing iolock cycling.
 - Fix delalloc blocks leaking when changing the project id fails because
   of input validation errors in FSSETXATTR.
 - Fix oversized quota reservation when converting unwritten extents
   during a DAX write.
 - Create a transaction allocation helper function to standardize the
   idiom of allocating a transaction, reserving blocks, locking inodes,
   and reserving quota.  Replace all the open-coded logic for file
   creation, file ownership changes, and file modifications to use them.
 - Actually shut down the fs if the incore quota reservations get
   corrupted.
 - Fix background block garbage collection scans to not block and to
   actually clean out CoW staging extents properly.
 - Run block gc scans when we run low on project quota.
 - Use the standardized transaction allocation helpers to make it so that
   ENOSPC and EDQUOT errors during reservation will back out, invoke the
   block gc scanner, and try again.  This is preparation for introducing
   background inode garbage collection in the next cycle.
 - Combine speculative post-EOF block garbage collection with speculative
   copy on write block garbage collection.
 - Enable multithreaded quotacheck.
 - Allow sysadmins to tweak the CPU affinities and maximum concurrency
   levels of quotacheck and background blockgc worker pools.
 - Expose the inode btree counter feature in the fs geometry ioctl.
 - Cleanups of the growfs code in preparation for starting work on
   filesystem shrinking.
 - Fix all the bloody gcc warnings that the maintainer knows about. :P
 - Fix a RST syntax error.
 - Don't trigger bmbt corruption assertions after the fs shuts down.
 - Restore behavior of forcing SIGBUS on a shut down filesystem when
   someone triggers a mmap write fault (or really, any buffered write).
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEEUzaAxoMeQq6m2jMV+H93GTRKtOsFAmAlX/UACgkQ+H93GTRK
 tOta+RAAiGqLKxeY07HH7F98pRJ86j6lU0zmc5i5UCOGMvZd8hLKDdThzggsjqO6
 rrUSc7Ppg7MQt1JdXLSdZw2N6Ksb9yy6chufj+j3Dq1JQfSL4YvBO/LlXmZmFE6d
 80Qbqq6HFSRWb6JzCMr3knhC+FJovAGhFgZYZGBZ817A/FXacTg9/A5Ow8SX81WX
 42s517QOmegAn7YhC3xcPZp5iavjbMd7Y9v7izpuo4FBB9AY7NYyb5wVhvffILfS
 /SMLQPw3T/tccRJuVJ8TfLA9R+B9+LaGmQ5tn/AtdwN+Lv7ykinzGKYLagkdlTmE
 onGkEIwrebEgq9phT47eX7ixiEt7oWQiQGZukXLVn7mL/0WPVI2pbYi/M1BNpi8i
 UftOEVroav+m4h0DF3duOE7rLGuBIEdjPuuAs85QhZ6UTusBjwxp1gOJbjuN0Up9
 9hBGTtYQIRhWxHkxWKAeuYzIbtMxC2S2XGxnW4cNOxbE7GxwfxBw0KP/38ZP4iYQ
 LKt6JVX+iFDQ+lH8JA6DD7+j+m7W37Alu89OPmpW2nYpFyisFDY+1dEIFvPw9roZ
 BtbKlZzS2O2zD67/tTVh+ZcPoEcPfp156GDCrgfgdIdiBvQtGbyOLB/WQC6wSU1L
 2PLt1inFBx5wNrIEMFMHT1hsduRihNMM+eLn6LV5XIK2RmSCT+I=
 =CaLz
 -----END PGP SIGNATURE-----

Merge tag 'xfs-5.12-merge-5' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux

Pull xfs updates from Darrick Wong:
 "There's a lot going on this time, which seems about right for this
  drama-filled year.

  Community developers added some code to speed up freezing when
  read-only workloads are still running, refactored the logging code,
  added checks to prevent file extent counter overflow, reduced iolock
  cycling to speed up fsync and gc scans, and started the slow march
  towards supporting filesystem shrinking.

  There's a huge refactoring of the internal speculative preallocation
  garbage collection code which fixes a bunch of bugs, makes the gc
  scheduling per-AG and hence multithreaded, and standardizes the retry
  logic when we try to reserve space or quota, can't, and want to
  trigger a gc scan. We also enable multithreaded quotacheck to reduce
  mount times further. This is also preparation for background file gc,
  which may or may not land for 5.13.

  We also fixed some deadlocks in the rename code, fixed a quota
  accounting leak when FSSETXATTR fails, restored the behavior that
  write faults to an mmap'd region actually cause a SIGBUS, fixed a bug
  where sgid directory inheritance wasn't quite working properly, and
  fixed a bug where symlinks weren't working properly in ecryptfs. We
  also now advertise the inode btree counters feature that was
  introduced two cycles ago.

  Summary:

   - Fix an ABBA deadlock when renaming files on overlayfs.

   - Make sure that we can't overflow the inode extent counters when
     adding to or removing extents from a file.

   - Make directory sgid inheritance work the same way as all the other
     filesystems.

   - Don't drain the buffer cache on freeze and ro remount, which should
     reduce the amount of time if read-only workloads are continuing
     during the freeze.

   - Fix a bug where symlink size isn't reported to the vfs in ecryptfs.

   - Disentangle log cleaning from log covering. This refactoring sets
     us up for future changes to the log, though for now it simply means
     that we can use covering for freezes, and cleaning becomes
     something we only do at unmount.

   - Speed up file fsyncs by reducing iolock cycling.

   - Fix delalloc blocks leaking when changing the project id fails
     because of input validation errors in FSSETXATTR.

   - Fix oversized quota reservation when converting unwritten extents
     during a DAX write.

   - Create a transaction allocation helper function to standardize the
     idiom of allocating a transaction, reserving blocks, locking
     inodes, and reserving quota. Replace all the open-coded logic for
     file creation, file ownership changes, and file modifications to
     use them.

   - Actually shut down the fs if the incore quota reservations get
     corrupted.

   - Fix background block garbage collection scans to not block and to
     actually clean out CoW staging extents properly.

   - Run block gc scans when we run low on project quota.

   - Use the standardized transaction allocation helpers to make it so
     that ENOSPC and EDQUOT errors during reservation will back out,
     invoke the block gc scanner, and try again. This is preparation for
     introducing background inode garbage collection in the next cycle.

   - Combine speculative post-EOF block garbage collection with
     speculative copy on write block garbage collection.

   - Enable multithreaded quotacheck.

   - Allow sysadmins to tweak the CPU affinities and maximum concurrency
     levels of quotacheck and background blockgc worker pools.

   - Expose the inode btree counter feature in the fs geometry ioctl.

   - Cleanups of the growfs code in preparation for starting work on
     filesystem shrinking.

   - Fix all the bloody gcc warnings that the maintainer knows about. :P

   - Fix a RST syntax error.

   - Don't trigger bmbt corruption assertions after the fs shuts down.

   - Restore behavior of forcing SIGBUS on a shut down filesystem when
     someone triggers a mmap write fault (or really, any buffered
     write)"

* tag 'xfs-5.12-merge-5' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux: (85 commits)
  xfs: consider shutdown in bmapbt cursor delete assert
  xfs: fix boolreturn.cocci warnings
  xfs: restore shutdown check in mapped write fault path
  xfs: fix rst syntax error in admin guide
  xfs: fix incorrect root dquot corruption error when switching group/project quota types
  xfs: get rid of xfs_growfs_{data,log}_t
  xfs: rename `new' to `delta' in xfs_growfs_data_private()
  libxfs: expose inobtcount in xfs geometry
  xfs: don't bounce the iolock between free_{eof,cow}blocks
  xfs: expose the blockgc workqueue knobs publicly
  xfs: parallelize block preallocation garbage collection
  xfs: rename block gc start and stop functions
  xfs: only walk the incore inode tree once per blockgc scan
  xfs: consolidate the eofblocks and cowblocks workers
  xfs: consolidate incore inode radix tree posteof/cowblocks tags
  xfs: remove trivial eof/cowblocks functions
  xfs: hide xfs_icache_free_cowblocks
  xfs: hide xfs_icache_free_eofblocks
  xfs: relocate the eofb/cowb workqueue functions
  xfs: set WQ_SYSFS on all workqueues in debug mode
  ...
2021-02-21 10:34:36 -08:00
Linus Torvalds 4f016a316f New code for 5.12:
- Adjust the final parameter of iomap_dio_rw.
 - Add a new flag to request that iomap directio writes return EAGAIN if
   the write is not a pure overwrite within EOF; this will be used to
   reduce lock contention with unaligned direct writes on XFS.
 - Amend XFS' directio code to eliminate exclusive locking for unaligned
   direct writes if the circumstances permit
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEEUzaAxoMeQq6m2jMV+H93GTRKtOsFAmAZgQAACgkQ+H93GTRK
 tOtNqw/+KPff1NjQVK2k361R0+LjlEHfe2nxh7+kS10IiR5nbBz4Fu+GwEosZKq+
 H9ficBbZ0wIveV+5CEt2xZLEJFC4LZUpNPVVrUf8XPLKiVexP/U3wtKzmv9Z7D5J
 5walMWQycVeR+ycomynV36giqekvARL7KCQG5By2ITfSNxfnb/wvKhn1d61ZDOF6
 f4xzq7F6+cEOrSZt2LcFzGSfsTl6oakYMAomPU57sqGmw7MHRqoPTErbdh2HnVJy
 yQ47eiZgSKWKA+Qm+VvHHePYCYnu0nvA2rbNerjTN70hnO8rK9S0Vle6Sp5CUqAX
 sXOy8zxOLYKqyM4S/QkIN2TGIyWg+CHiakVLZGF3Q4AUDDYfpD0cHvAe9N3v9euL
 qt8ypT8dz2C3qiTg5E31xy033wlAP0wg3FZiLAqEjL5o3fzD+qbplTiSmYbMV2Fb
 xuu7a2T6u1MHaIn1IhaL0cB49Fzn+5EMyp6BlAucAOakyuqJCyJiXokdk0Looy5e
 jUshvcwWcmHMpI/YYYY6t56KV6tl2exGq5sySY5U6dr8/r5lwc0SI+TrYFG0jTR8
 59DGd5CkKgdBFcuys+eaZDXgr7A4ymkVE+pE0QNDz9UwNP20tLb3dQNlhgxchUgu
 NgPaFgQkoNM3HmQNyU2wX/t1aFlC/doqSkb/96UWQSxq6IrajMU=
 =AR07
 -----END PGP SIGNATURE-----

Merge tag 'iomap-5.12-merge-2' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux

Pull iomap updates from Darrick Wong:
 "The big change in this cycle is some new code to make it possible for
  XFS to try unaligned directio overwrites without taking locks. If the
  block is fully written and within EOF (i.e. doesn't require any
  further fs intervention) then we can let the unlocked write proceed.
  If not, we fall back to synchronizing direct writes.

  Summary:

   - Adjust the final parameter of iomap_dio_rw.

   - Add a new flag to request that iomap directio writes return EAGAIN
     if the write is not a pure overwrite within EOF; this will be used
     to reduce lock contention with unaligned direct writes on XFS.

   - Amend XFS' directio code to eliminate exclusive locking for
     unaligned direct writes if the circumstances permit"

* tag 'iomap-5.12-merge-2' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux:
  xfs: reduce exclusive locking on unaligned dio
  xfs: split the unaligned DIO write code out
  xfs: improve the reflink_bounce_dio_write tracepoint
  xfs: simplify the read/write tracepoints
  xfs: remove the buffered I/O fallback assert
  xfs: cleanup the read/write helper naming
  xfs: make xfs_file_aio_write_checks IOCB_NOWAIT-aware
  xfs: factor out a xfs_ilock_iocb helper
  iomap: add a IOMAP_DIO_OVERWRITE_ONLY flag
  iomap: pass a flags argument to iomap_dio_rw
  iomap: rename the flags variable in __iomap_dio_rw
2021-02-21 10:29:20 -08:00
Linus Torvalds f02361639a pstore update for v5.12-rc1
- Fix a CONFIG typo (Jiri Bohac)
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEEpcP2jyKd1g9yPm4TiXL039xtwCYFAmAuz1gACgkQiXL039xt
 wCbHmA/8Dweb4NELHwpMb+7CS0YTtQSkDsZT8PeKnIl3UaoZKQdTcolCm/20wkDx
 Pvh25zu9bCwut9bYtAiMSKTsfvOIJAxLiPpLNtyuukx/MtGo6SUVaxRZ2Cm1HL+c
 uQBCgBXnMGznJPOiU0O6/AH9QFr+HQMuXjRHuIBabNOnat93s4sSSefbSAaCUlFS
 TOTCEIUpTS3w3s0KUGuVsQ8M/OLg/in5JZ+Hbvfq7dejHi/5lQlBDkd8juqzapOL
 BCW4s45NIrnXOFqImFnDx7M6vXB8nt/D8Wf/70Qv7tQgqbFjU7qz0r4HEwGAlghG
 DYr5kr5R58tmqyYPuswJKgxdJi8DvmVI9XMdmST2XRIkgQpFfF9Fi5JrNHKXPDHV
 mFIC58ts17BmAE32jEglYgndYg86YkX+uNYJ/9mZ5s0LEYHzQE1AlNwz7+zlKMQJ
 ByiSs+TRDEWvintB1BXtba5W/IE1uqkieZp4NFmwIddvTGj9DCz0ST3eLhyJX7Jd
 RBR3HH83SOR1T1Nis+rIMSMurgSzMo+XH4w2G0+Bo3KAujqyZAnxu/TczTeEtvR1
 5fbpVG7KBtKOLsCPHGAzOP5YRj/Y1M7toAIANZpgStn5rqgQIvdS2JiWFfYs4MWU
 fdGAO+mTfuXpfZwUt8IVFZ0cC56z6mJ8RRUEHhk+M0N7CdD2Jps=
 =zSj6
 -----END PGP SIGNATURE-----

Merge tag 'pstore-v5.12-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux

Pull pstore fix from Kees Cook:
 "Fix a CONFIG typo (Jiri Bohac)"

* tag 'pstore-v5.12-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux:
  pstore: Fix typo in compression option name
2021-02-21 10:27:13 -08:00
Linus Torvalds f7b36dc5cb fsverity updates for 5.12
Add an ioctl which allows reading fs-verity metadata from a file.
 
 This is useful when a file with fs-verity enabled needs to be served
 somewhere, and the other end wants to do its own fs-verity compatible
 verification of the file.  See the commit messages for details.
 
 This new ioctl has been tested using new xfstests I've written for it.
 -----BEGIN PGP SIGNATURE-----
 
 iIoEABYIADIWIQSacvsUNc7UX4ntmEPzXCl4vpKOKwUCYCv/2hQcZWJpZ2dlcnNA
 Z29vZ2xlLmNvbQAKCRDzXCl4vpKOK6/7AQDRmmnV+G34yGPCWfu8tyjdYvWPyak2
 IA/I+eM6S/F+4QEAkbX6rOwYVhLHN9KSOYyNhJiBchm6xq83J+R8BYh/Kw0=
 =FPNK
 -----END PGP SIGNATURE-----

Merge tag 'fsverity-for-linus' of git://git.kernel.org/pub/scm/fs/fscrypt/fscrypt

Pull fsverity updates from Eric Biggers:
 "Add an ioctl which allows reading fs-verity metadata from a file.

  This is useful when a file with fs-verity enabled needs to be served
  somewhere, and the other end wants to do its own fs-verity compatible
  verification of the file. See the commit messages for details.

  This new ioctl has been tested using new xfstests I've written for it"

* tag 'fsverity-for-linus' of git://git.kernel.org/pub/scm/fs/fscrypt/fscrypt:
  fs-verity: support reading signature with ioctl
  fs-verity: support reading descriptor with ioctl
  fs-verity: support reading Merkle tree with ioctl
  fs-verity: add FS_IOC_READ_VERITY_METADATA ioctl
  fs-verity: don't pass whole descriptor to fsverity_verify_signature()
  fs-verity: factor out fsverity_get_descriptor()
2021-02-21 10:25:24 -08:00
Linus Torvalds 99f1a5872b Highlights:
- Update NFSv2 and NFSv3 XDR decoding functions
 - Further improve support for re-exporting NFS mounts
 - Convert NFSD stats to per-CPU counters
 - Add batch Receive posting to the server's RPC/RDMA transport
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCAAdFiEEKLLlsBKG3yQ88j7+M2qzM29mf5cFAmAYVsAACgkQM2qzM29m
 f5f1Lg/+IBC7Bhnnc8jNr4nv4IntCwwKdx2VzSzQszbN/kkhLZK89u36nZyqp0RB
 Vg3olyS5DseEisMMx0rI0KkHBz7pz+kXVdOGvve8fHBZvewnJ/FpxNZPChG4aMDc
 mfjHLvDHO0/GoUqSftrBrjSEJ2jHoNdDcmvzgdAlugTuLOjGX3HhmKa3ZYVTNgFn
 kDmFMaEHjS3pb3LqNDHNIYYpNnvtIukxHUh9weDvr+AH8Rmt/WVfjDc26xBS0FQu
 jDJUk9AP06VYgZx0dLKp4In8GJYwz9DNjNrWm91+RyJml9AWrFswdBHHcfi0W/Yy
 GipkBZGYE6ZblyMlITZCB4etyHQsq7qLuqicTlcXjL/Fdkd7xlT8DwFlZ8LjpyCU
 LeHTI2cGzRSJ/JjL2hvhPvT3gR5hln/qk17jSP7V4S6psZAqAEvw/Xa/+MDJhB/b
 vnzltFPvEgZc59Q/SJLbaWZLHy1q0enbrOBLMZDmUlk911/tgAuflHJM60N8o732
 vkfy05pvZlrV0cFY546pQd7zTKZcAOYPVHHoP25wPa2ibKBu6eQ6kZEi5zu+tVK3
 CkvqIhePFspBMQ6GOPKixTiFV4KFoO1HBtk+JEeMkiHXHk1xATCWbg1m7wkaagsq
 NNS/qFkLRnftGYpFViBaxTFBGxiBOSbsTIS/zfj5L7JOpW4FRD4=
 =02xw
 -----END PGP SIGNATURE-----

Merge tag 'nfsd-5.12' of git://git.kernel.org/pub/scm/linux/kernel/git/cel/linux

Pull nfsd updates from Chuck Lever:

 - Update NFSv2 and NFSv3 XDR decoding functions

 - Further improve support for re-exporting NFS mounts

 - Convert NFSD stats to per-CPU counters

 - Add batch Receive posting to the server's RPC/RDMA transport

* tag 'nfsd-5.12' of git://git.kernel.org/pub/scm/linux/kernel/git/cel/linux: (65 commits)
  nfsd: skip some unnecessary stats in the v4 case
  nfs: use change attribute for NFS re-exports
  NFSv4_2: SSC helper should use its own config.
  nfsd: cstate->session->se_client -> cstate->clp
  nfsd: simplify nfsd4_check_open_reclaim
  nfsd: remove unused set_client argument
  nfsd: find_cpntf_state cleanup
  nfsd: refactor set_client
  nfsd: rename lookup_clientid->set_client
  nfsd: simplify nfsd_renew
  nfsd: simplify process_lock
  nfsd4: simplify process_lookup1
  SUNRPC: Correct a comment
  svcrdma: DMA-sync the receive buffer in svc_rdma_recvfrom()
  svcrdma: Reduce Receive doorbell rate
  svcrdma: Deprecate stat variables that are no longer used
  svcrdma: Restore read and write stats
  svcrdma: Convert rdma_stat_sq_starve to a per-CPU counter
  svcrdma: Convert rdma_stat_recv to a per-CPU counter
  svcrdma: Refactor svc_rdma_init() and svc_rdma_clean_up()
  ...
2021-02-21 10:22:20 -08:00
Linus Torvalds 681e2abe21 Changes since last update:
- fix shift-out-of-bounds of crafted blkszbits generated by syzkaller;
 
  - ensure initialized fields can only be observed after bit is set.
 -----BEGIN PGP SIGNATURE-----
 
 iIsEABYIADMWIQThPAmQN9sSA0DVxtI5NzHcH7XmBAUCYC5qFBUcaHNpYW5na2Fv
 QHJlZGhhdC5jb20ACgkQOTcx3B+15gT4PwD/W8BGqC3/uBC6qGJuNkRteFmaIDvB
 EplXizcZ+6ennkkBAIbbEsFx8K3TM/tg45YqV+ebjRbsH4NG1owVqb8ZAc0M
 =Ni8F
 -----END PGP SIGNATURE-----

Merge tag 'erofs-for-5.12-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/xiang/erofs

Pull erofs updates from Gao Xiang:
 "This contains a somewhat important but rarely reproduced fix reported
  month ago for platforms which have weak memory model (e.g. arm64).

  The root cause is that test_bit/set_bit atomic operations are actually
  implemented in relaxed forms, and uninitialized fields governed by an
  atomic bit could be observed in advance due to memory reordering thus
  memory barrier pairs should be used.

  There is also a trivial fix of crafted blkszbits generated by
  syzkaller.

  Summary:

   - fix shift-out-of-bounds of crafted blkszbits generated by syzkaller

   - ensure initialized fields can only be observed after bit is set"

* tag 'erofs-for-5.12-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/xiang/erofs:
  erofs: initialized fields can only be observed after bit is set
  erofs: fix shift-out-of-bounds of blkszbits
2021-02-21 10:19:34 -08:00
Linus Torvalds 8b42fe123b f2fs-for-5.12-rc1
We've added two major features: 1) compression level and 2) checkpoint_merge, in
 this round. 1) compression level expands 'compress_algorithm' mount option to
 accept parameter as format of <algorithm>:<level>, by this way, it gives a way
 to allow user to do more specified config on lz4 and zstd compression level,
 then f2fs compression can provide higher compress ratio. 2) checkpoint_merge
 creates a kernel daemon and makes it to merge concurrent checkpoint requests as
 much as possible to eliminate redundant checkpoint issues. Plus, we can
 eliminate the sluggish issue caused by slow checkpoint operation when the
 checkpoint is done in a process context in a cgroup having low i/o budget and
 cpu shares.
 
 Enhancement:
  - add compress level for lz4 and zstd in mount option
  - checkpoint_merge mount option
  - deprecate f2fs_trace_io
 
 Bug fix:
  - flush data when enabling checkpoint back
  - handle corner cases of mount options
  - missing ACL update and lock for I_LINKABLE flag
  - attach FIEMAP_EXTENT_MERGED in f2fs_fiemap
  - fix potential deadlock in compression flow
  - fix wrong submit_io condition
 
 As usual, we've cleaned up many code flows and fixed minor bugs.
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEE00UqedjCtOrGVvQiQBSofoJIUNIFAmAtdrAACgkQQBSofoJI
 UNLvLg//XWERjTZ3tfHHLtNcIkNCd2WaKXwpanTXJsn0kVUc6H5m8lqkutn5Vh/z
 ZAtQE89aqwbw/FPQQl6jEA/aHhXAnCBbXS0Rjx7QFwlqs+772H10VLvdNXewgvJB
 r/u7CIlxbmu3p6ZLSG/a8uJe3CMimJe4lrswjnFlLYgKiho40tcQL8qfQEtkNQSF
 +MV2npS7ka4x/PenFykVbTI0OcwOpblpgkpjgfl5A9bcOsGbli+1qzcasbcX9z9k
 20TwZqk5q7rZHVDjvtYERSyS9mmn3fzEJStK4sdZ6uk+EKxyC+KNHrv9cKwemTCm
 ZATR/YBJKeYhjYppyYLLTRp5eL08PBNgE15SmnkVRjMcAiFxM689WfShrIVhBaf1
 dRr9DxAMLuFSiwFuLBLE/8yMwed38RH9e0RrfQRVjj8Zs2kHcUdwD1WqyDg7omS8
 NuH776LhJSsSVgC8ZKTacQgX8l2NvsjAigeBj/6v4o0lzr1msn2ADpQ9Bww9Iqtt
 lv/09350ww78UV+ipLlVSHw4rl8sebatMUSHtmF4SP7U7Jqv2MaGhNAteWlCklmV
 0cTzjEueiuvmrmkiphTHtl1fHHDVCE0xtScpoylchPVd8bal0pVq4XbZLmGsQwDt
 9V9qOebt2xLmx9EXDyqdRWRbDrtE0FG/AZiN8Q0VcJSzUI/ATx8=
 =+/7T
 -----END PGP SIGNATURE-----

Merge tag 'f2fs-for-5.12-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/jaegeuk/f2fs

Pull f2fs updates from Jaegeuk Kim:
 "We've added two major features: 1) compression level and 2)
  checkpoint_merge, in this round.

  Compression level expands 'compress_algorithm' mount option to accept
  parameter as format of <algorithm>:<level>, by this way, it gives a
  way to allow user to do more specified config on lz4 and zstd
  compression level, then f2fs compression can provide higher compress
  ratio.

  checkpoint_merge creates a kernel daemon and makes it to merge
  concurrent checkpoint requests as much as possible to eliminate
  redundant checkpoint issues. Plus, we can eliminate the sluggish issue
  caused by slow checkpoint operation when the checkpoint is done in a
  process context in a cgroup having low i/o budget and cpu shares.

  Enhancements:
   - add compress level for lz4 and zstd in mount option
   - checkpoint_merge mount option
   - deprecate f2fs_trace_io

  Bug fixes:
   - flush data when enabling checkpoint back
   - handle corner cases of mount options
   - missing ACL update and lock for I_LINKABLE flag
   - attach FIEMAP_EXTENT_MERGED in f2fs_fiemap
   - fix potential deadlock in compression flow
   - fix wrong submit_io condition

  As usual, we've cleaned up many code flows and fixed minor bugs"

* tag 'f2fs-for-5.12-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/jaegeuk/f2fs: (32 commits)
  Documentation: f2fs: fix typo s/automaic/automatic
  f2fs: give a warning only for readonly partition
  f2fs: don't grab superblock freeze for flush/ckpt thread
  f2fs: add ckpt_thread_ioprio sysfs node
  f2fs: introduce checkpoint_merge mount option
  f2fs: relocate inline conversion from mmap() to mkwrite()
  f2fs: fix a wrong condition in __submit_bio
  f2fs: remove unnecessary initialization in xattr.c
  f2fs: fix to avoid inconsistent quota data
  f2fs: flush data when enabling checkpoint back
  f2fs: deprecate f2fs_trace_io
  f2fs: Remove readahead collision detection
  f2fs: remove unused stat_{inc, dec}_atomic_write
  f2fs: introduce sb_status sysfs node
  f2fs: fix to use per-inode maxbytes
  f2fs: compress: fix potential deadlock
  libfs: unexport generic_ci_d_compare() and generic_ci_d_hash()
  f2fs: fix to set/clear I_LINKABLE under i_lock
  f2fs: fix null page reference in redirty_blocks
  f2fs: clean up post-read processing
  ...
2021-02-21 10:09:32 -08:00
Linus Torvalds 6f3952cbe0 for-5.12-tag
-----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEE8rQSAMVO+zA4DBdWxWXV+ddtWDsFAmAqyGEACgkQxWXV+ddt
 WDuU6BAAhfI5BndMm6a1LooMsBHTR7Mh/aFXZEKX7vCDRnrkr+WiihDFhXu4tH3y
 arRsdwMnJCnta2/JMI5xCZZRg9Bsb/Sa0qWoR9sDBVoGRMnE1DS5YHQyv0bfJYk0
 qYOW/jorBV1n/hL19+WbDFajwajP86uGtlDKV7cJ/C3lIogQma7zQ7ygwxbDcZqm
 ZQVHg7ooM4P1t7EV0eDlatxn0Sm8KFkxXD7dbu37qDLWr3Aw8N4IwT7I9h4b+/tg
 hL4dqMPxX6AyRiI0VBsqKnmcRWtT9cN7yw0+J+/JK5KuaFFx3qyZZ+EQu1jAGZDt
 2m432YKya8LQfyBuSe8uoCIcczhGoD0EPIhspecDMfWTvxdo+AeTJZzZzj3u1y+v
 3pih+gBN1sa8vRVSX08mIBF/k0pPfxRu7gIjvl4wl18bm3Khq5VJ93ImP7DNroNg
 bKiUG35K+kvXGBNaLY71zZfO6aLMddK73aDudSbYOS8XcbKhor1G8j5o5/EkcVQA
 wio4Gw5BmfVeRuXOl2h1aEXThk+469s0DR7MiMiAA6917cUjQiFUgFOaogR0XY3S
 8ffX+S50AFW834J0eIGHPLmzi70WwSSXCS2q+zl87PPRK5+jCp9ZzWGi9MGG1qdh
 fp7XVMkzHVSKGK5GXB+ICUfzkShxfTCh+EbxcXIulONxsEdADsc=
 =0O6r
 -----END PGP SIGNATURE-----

Merge tag 'for-5.12-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux

Pull btrfs updates from David Sterba:
 "This brings updates of space handling, performance improvements or bug
  fixes. The subpage block size and zoned mode features have reached
  state where they're usable but with limitations.

  Performance or related:

   - do not block on deleted block group mutex in the cleaner, avoids
     some long stalls

   - improved flushing: make it work better with ticket space
     reservations and avoid excessive transaction commits in some
     scenarios, slightly improves throughput for random write load

   - preemptive background flushing: separate the logic from ticket
     reservations, improve the accounting and decisions when to flush in
     low space conditions

   - less lock contention related to running delayed refs, let just one
     thread do the flushing when there are many inside transaction
     commit

   - dbench workload improvements: avoid unnecessary work when logging
     inodes, fewer fallbacks to transaction commit and thus less waiting
     for it (+7% throughput, -20% latency)

  Core:

   - subpage block size
      - currently read-only support
      - refactor and generalize code where sectorsize is assumed to be
        page size, add the subpage handling everywhere
      - the read-write support is on the way, page sizes are still
        limited to 4K or 64K

   - zoned mode, first working version but with limitations
      - SMR/ZBC/ZNS friendly allocation mode, utilizing the "no fixed
        location for structures" and chunked allocation
      - superblock as the only fixed data structure needs special
        handling, uses 2 consecutive zones as a ring buffer
      - tree-log support with a dedicated block group to avoid unordered
        writes
      - emulated zones on non-zoned devices
      - not yet working
      - all non-single block group profiles, requires more zone write
        pointer synchronization between the multiple block groups
      - fitrim due to dependency on space cache, can be implemented

  Fixes:

   - ref-verify: proper tree owner and node level tracking

   - fix pinned byte accounting, causing some early ENOSPC now more
     likely due to other changes in delayed refs

  Other:

   - error handling fixes and improvements

   - more error injection points

   - more function documentation

   - more and updated tracepoints

   - subset of W=1 checked by default

   - update comments to allow more automatic kdoc parameter checks"

* tag 'for-5.12-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux: (144 commits)
  btrfs: zoned: enable to mount ZONED incompat flag
  btrfs: zoned: deal with holes writing out tree-log pages
  btrfs: zoned: reorder log node allocation on zoned filesystem
  btrfs: zoned: serialize log transaction on zoned filesystems
  btrfs: zoned: extend zoned allocator to use dedicated tree-log block group
  btrfs: split alloc_log_tree()
  btrfs: zoned: relocate block group to repair IO failure in zoned filesystems
  btrfs: zoned: enable relocation on a zoned filesystem
  btrfs: zoned: support dev-replace in zoned filesystems
  btrfs: zoned: implement copying for zoned device-replace
  btrfs: zoned: implement cloning for zoned device-replace
  btrfs: zoned: mark block groups to copy for device-replace
  btrfs: zoned: do not use async metadata checksum on zoned filesystems
  btrfs: zoned: wait for existing extents before truncating
  btrfs: zoned: serialize metadata IO
  btrfs: zoned: introduce dedicated data write path for zoned filesystems
  btrfs: zoned: enable zone append writing for direct IO
  btrfs: zoned: use ZONE_APPEND write for zoned mode
  btrfs: save irq flags when looking up an ordered extent
  btrfs: zoned: cache if block group is on a sequential zone
  ...
2021-02-21 10:00:39 -08:00
Linus Torvalds f9d58de231 affs-for-5.12-tag
-----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEE8rQSAMVO+zA4DBdWxWXV+ddtWDsFAmAqyNYACgkQxWXV+ddt
 WDt9Ow/9Fsulw3gwsgTzuhlM08Ax7uJWhYSvq7hdg4kfrwCgsR7gI6BalIErstOz
 R8pxiRwXLI6C3muQGUHVTYa7t9IkYqhYfE6hTNtFYlpomVwZPm0URkwAnbwkL+VK
 rL94bimLtsbvkdMI17rHSvQ5wEEvEUGZBF2Jvy3s2sx3P1tt6nFHFf51alIKY+Lv
 u4J3/8otevd+nGRKeMahUOV2v4ssTTcASGLPudvRAIj3g+nAjM/ODTeopN7SBvnd
 b708r4e5HsPXCSW+aN2E2IwrwOiNrcezSgQsl6xtUobvBcTjeFoEGnbgNK8FTepr
 GaE2sJnHhH2+ZhSph21iMONVFY34hJJwl26SrixjfhGh+88QsgHD91dypkPfPKMn
 2TLiCpmPg95UCBmElSJubgqOAC2KT/rwTN4dob7G+mFwEKSza2Oqc4dBVrB5rWiW
 bYyexkobZt83ybwgL1ySiyA3t9GZiuDpORylE1rXB28KfQbHDaCwOgtc4qV6TJbr
 z4F9ya+Yoop3/1M1xbknuA9AtPykmnAjxK96NKEeAiWpzCrcnP0PFQ4Vh1tHRQoY
 yhE3mEaAHgMbEa9N+9gO8RyJSzqlqvneA2kgoTQoFfcUWoGdgzk6d1dWJmvZuUT1
 I3K+K48E+2Cwq0aewCPUv44z8N/NmgDK+vDRR/U3cXG6RlJUkJM=
 =Eo74
 -----END PGP SIGNATURE-----

Merge tag 'affs-for-5.12-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux

Pull AFFS fix from David Sterba:
 "One minor fix for error handling in rename exchange"

* tag 'affs-for-5.12-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux:
  fs/affs: release old buffer head on error path
2021-02-21 09:59:09 -08:00
Linus Torvalds d88e8b67a6 A few jfs fixes
-----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCAAdFiEEIodevzQLVs53l6BhNqiEXrVAjGQFAmAqmbYACgkQNqiEXrVA
 jGSf9A/7BatjusxdjAQ3pF8xROzVl3+xZR/uG7NZFvhh7X8/Gm+/DIPH4MIpI99s
 gaQLCOPaBz3s7ZAK+LYMyJ6/Ko0e1tgBWXCVNdm24bc5ETNbT68NNWiqsKw+HG/q
 iVf/2n2PIgeDWjwkXfuOnYK6vpisl6l5gst8d2aIorPHk2oE9qTvylxmTBg114dP
 6gJEyNnokrqo9oVPoEGwFsDIOigM0QSrreiBtzb5+8nWxd366VoOh8zznehPjGAs
 C2MiKxQYOTub7AcyKdnuOwrWjmWiHHhkXq2w33QVZKSVU2m0Uoa7XkA75n+PE6GT
 ypxUopxNZmQu3WB7BzkoZB6zsNHdyCbp9RdFtzLO2o1eKj8B2yvSTrp8TmSd8ReM
 4Wi3CjVjVQcGyFgbng6071h5eRfXpxuFg4blGscFnttkHGaKNGtmklhie2qQAPiJ
 ToV1bdam7CuvlMsOSX+DSFM7ZZbnLFlvcD8eDAztMKPWim1qgkMiY6tumSLPAGrj
 9N02IIET9Iixx0BE9/HeauU3/0CTbgNwRBqBTqwBBYH9RTER4B3/+4ouWM7aLsNJ
 ky/d4IB+QGXgVTbNj+FCo2dyCc3tLy/TZvY/uIq7QBNEqTuLmGwGl71BuZIvWYEV
 hM4oHmV//ncgBFDM8a+cWp+saDkI2CRVJTAn/pd1vIPZRWgVwgM=
 =ESGZ
 -----END PGP SIGNATURE-----

Merge tag 'jfs-5.12' of git://github.com/kleikamp/linux-shaggy

Pull jfs updates from David Kleikamp:
 "A few jfs fixes"

* tag 'jfs-5.12' of git://github.com/kleikamp/linux-shaggy:
  fs/jfs: fix potential integer overflow on shift of a int
  jfs: turn diLog(), dataLog() and txLog() into void functions
  JFS: more checks for invalid superblock
2021-02-21 09:57:30 -08:00
Linus Torvalds 961a9b512d fcntl() fix for v5.12
-----BEGIN PGP SIGNATURE-----
 
 iQJHBAABCAAxFiEES8DXskRxsqGE6vXTAA5oQRlWghUFAmAqXhITHGpsYXl0b25A
 a2VybmVsLm9yZwAKCRAADmhBGVaCFc2CEAC2WgxNFYXUINTo8FzmgYquLrVfj04X
 ecXUJwOJBUQjg+F46OENufh0uREI9DmwlW9RWQAwiVBecLK24vz0WBhKOi/88JhG
 8S1I2YL3zIBbnOyBKwAiuK7y3uAQswvKRFRzaY7+aFxVvagDO2YC0l4QCdg3WDp/
 n9es8OksUR04ztMYLn6qT1xHb1pWXUmHeYiGzmhgWBwyPygs5OxSP+y2qmDkj08l
 o64f3GdUZivF6tT7m7rBDrx9pzUha8oqEw8+LDgiUEaq7ZeMVxHSuFVNHW7fCWVH
 ICLfeZPUEZgdMD0w2v5+z/jpy8H4tm2bWNtOWxba1uQoUj5cHrPVuYXSSU1rt5SP
 +yHCSyr4eEfR211d7j/U+v/O+WwJCFHRxzE9PdUpi6VlMnuTVkBhrbSGMtBiQRv7
 UUwXN3JLRPO63d1D2rfpqxMspZpp5e70TpWKXYLQ69Fl1j0GcF1eLfnKsHPZld8C
 Uqfa+CUwRDJKEpnprVn0BOHUlWoPHu4pUIz/gf52pN2v+mTAziZA7WHdxR30V8Pm
 H1VAhRX+rPNXsjHzc9TuQK+IsaDenKHRyBOrteBS0TT1hBLF+pe0ocOVgMSP+H3w
 p0BL3bVf6gToKRZMnT5+L5GA0Zp1PIQCODyjfSRxQGtNNumnGr/vmZsGka0j3gIW
 JO6I+6fsEr0TEg==
 =hsmB
 -----END PGP SIGNATURE-----

Merge tag 'locks-v5.12' of git://git.kernel.org/pub/scm/linux/kernel/git/jlayton/linux

Pull fcntl fix from Jeff Layton.

* tag 'locks-v5.12' of git://git.kernel.org/pub/scm/linux/kernel/git/jlayton/linux:
  fcntl: make F_GETOWN(EX) return 0 on dead owner task
2021-02-21 09:54:02 -08:00
Linus Torvalds c57b1f0a5f Merge branch 'work.namei' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs
Pull namei updates from Al Viro:
 "Most of that pile is LOOKUP_CACHED series; the rest is a couple of
  misc cleanups in the general area...

  There's a minor bisect hazard in the end of series, and normally I
  would've just folded the fix into the previous commit, but this branch
  is shared with Jens' tree, with stuff on top of it in there, so that
  would've required rebases outside of vfs.git"

* 'work.namei' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
  fix handling of nd->depth on LOOKUP_CACHED failures in try_to_unlazy*
  fs: expose LOOKUP_CACHED through openat2() RESOLVE_CACHED
  fs: add support for LOOKUP_CACHED
  saner calling conventions for unlazy_child()
  fs: make unlazy_walk() error handling consistent
  fs/namei.c: Remove unlikely of status being -ECHILD in lookup_fast()
  do_tmpfile(): don't mess with finish_open()
2021-02-21 09:42:18 -08:00
Linus Torvalds 591fd30eee Merge branch 'work.elf-compat' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs
Pull ELF compat updates from Al Viro:
 "Sanitizing ELF compat support, especially for triarch architectures:

   - X32 handling cleaned up

   - MIPS64 uses compat_binfmt_elf.c both for O32 and N32 now

   - Kconfig side of things regularized

  Eventually I hope to have compat_binfmt_elf.c killed, with both native
  and compat built from fs/binfmt_elf.c, with -DELF_BITS={64,32} passed
  by kbuild, but that's a separate story - not included here"

* 'work.elf-compat' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
  get rid of COMPAT_ELF_EXEC_PAGESIZE
  compat_binfmt_elf: don't bother with undef of ELF_ARCH
  Kconfig: regularize selection of CONFIG_BINFMT_ELF
  mips compat: switch to compat_binfmt_elf.c
  mips: don't bother with ELF_CORE_EFLAGS
  mips compat: don't bother with ELF_ET_DYN_BASE
  mips: KVM_GUEST makes no sense for 64bit builds...
  mips: kill unused definitions in binfmt_elf[on]32.c
  mips binfmt_elf*32.c: use elfcore-compat.h
  x32: make X32, !IA32_EMULATION setups able to execute x32 binaries
  [amd64] clean PRSTATUS_SIZE/SET_PR_FPVALID up properly
  elf_prstatus: collect the common part (everything before pr_reg) into a struct
  binfmt_elf: partially sanitize PRSTATUS_SIZE and SET_PR_FPVALID
2021-02-21 09:29:23 -08:00
Linus Torvalds 054560e961 Merge branch 'work.sendfile' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs
Pull sendfile updates from Al Viro:
 "Make sendfile() to pipe destination do the right thing, should make
  'fs/pipe: allow sendfile() to pipe again' redundant"

* 'work.sendfile' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
  teach sendfile(2) to handle send-to-pipe directly
  take the guts of file-to-pipe splice into a helper function
  do_splice_to(): move the logics for limiting the read length in
2021-02-21 09:25:32 -08:00
Linus Torvalds 584ce3c9b4 SoC platform removal
There are a lot of platforms that have not seen any interesting code
 changes in the past five years or more.
 
 I made a list and asked around which ones are no longer in use [1], and
 received confirmation about six ARM platforms and the TI C6x architecture
 that have all reached the end of their life upstream, with no known
 users remaining:
 
  - efm32 -- added in 2011, first Cortex-M, no notable changes after 2013
  - picoxcell -- added in 2011, abandoned after 2012 acquisition
  - prima2 -- added in 20111, no notable changes since 2015
  - tango -- added in 2015, sporadic changes until 2017, but abandoned
  - u300 -- added in 2009, no notable changes since 2013
  - zx --added in 2015 for both 32, 2017 for 64 bit, no notable changes
  - arch/c6x -- added in 2011, but work stalled soon after that
 
 A number of other platforms on the original list turned out to still
 have users. In some cases there are out-of-tree patches and users
 that plan to contribute them in the future, in other cases the code
 is complete and works reliably.
 
 [1] https://lore.kernel.org/lkml/CAK8P3a2DZ8xQp7R=H=wewHnT2=a_=M53QsZOueMVEf7tOZLKNg@mail.gmail.com/
 
 Signed-off-by: Arnd Bergmann <arnd@arndb.de>
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEEiK/NIGsWEZVxh/FrYKtH/8kJUicFAmApiR8ACgkQYKtH/8kJ
 Uifl7A//RZVyxUSlbD/StS6oEOmkZH8j0L7yeYOKkSHGZI+6Dqxo6rooKymbeflk
 jJvDVQqLcrclT/7rWsKesdN8aW+ilfWrby5nDsWivsROrTw3DdvZgkjh7KYz7tA/
 OxygKQu4W9I+ywJltR4ykTUxXohjU+duHPuZJawQk64xE3Q0MWxJlQQ2kHJYVJRu
 /rWgNDQaI2d8HFhhEVsn4PC0RLWfUuBevKEuRYqZwM/oB/HuYjY+uTUGe2RhlgWb
 sbcoD93JP2MghSypq33/UtEl4Uk7Wpdv2bshTTv8DL5ToltY7wD8qIIh+aSJk9hP
 0FG3NTia7e9dqQQR2bskspGxP73iIuSN1exAbm/Ten5sysy6IsESmzqZRxXv+7Z1
 q1Oyc4wYaotJPAxMOE00RMLiRa5domI8V6Y10I5uyOcmpRvwWK2WfCOE7D3WSQ5M
 i1JiqLnC5JtJ0vyVBeRKo99zZImeXXrmS0n+fcARGtcKwAqKSvKxFcLTmkj3KqHv
 L4Xgy5f83QrMZWmldX7IiwWjTar2geBM7pFgG/z3R6JqkaxWiDHxyok6j1WUCE7b
 MViRZ8wT7JC5sIkHuwXZ4jvAXPqHq6J1rmJreU6N/jzmv/PTQoUnQ3C/MbDNhuv8
 NDVSRgrPcd/T0BrBkzIWk3t+Oh6ikDgflWsWkqIRFG0vCNx+KdM=
 =pf3b
 -----END PGP SIGNATURE-----

Merge tag 'arm-platform-removal-v5.12' of git://git.kernel.org/pub/scm/linux/kernel/git/soc/soc

Pull ARM SoC platform removals from Arnd Bergmann:
 "There are a lot of platforms that have not seen any interesting code
  changes in the past five years or more.

  I made a list and asked around which ones are no longer in use, and
  received confirmation about six ARM platforms and the TI C6x
  architecture that have all reached the end of their life upstream,
  with no known users remaining:

   - efm32 - added in 2011, first Cortex-M, no notable changes after 2013

   - picoxcell - added in 2011, abandoned after 2012 acquisition

   - prima2 - added in 20111, no notable changes since 2015

   - tango - added in 2015, sporadic changes until 2017, but abandoned

   - u300 - added in 2009, no notable changes since 2013

   - zx - added in 2015 for both 32, 2017 for 64 bit, no notable changes

   - arch/c6x - added in 2011, but work stalled soon after that

  A number of other platforms on the original list turned out to still
  have users. In some cases there are out-of-tree patches and users that
  plan to contribute them in the future, in other cases the code is
  complete and works reliably"

Link: https://lore.kernel.org/lkml/CAK8P3a2DZ8xQp7R=H=wewHnT2=a_=M53QsZOueMVEf7tOZLKNg@mail.gmail.com/

* tag 'arm-platform-removal-v5.12' of git://git.kernel.org/pub/scm/linux/kernel/git/soc/soc:
  ARM: remove u300 platform
  ARM: remove tango platform
  ARM: remove zte zx platform
  ARM: remove sirf prima2/atlas platforms
  c6x: remove architecture
  MAINTAINERS: Remove deleted platform efm32
  ARM: drop efm32 platform
  ARM: Remove PicoXcell platform support
  ARM: dts: Remove PicoXcell platforms
2021-02-20 18:16:30 -08:00
Pavel Begunkov ebf4a5db69 io_uring: fix leaving invalid req->flags
sqe->flags are subset of req flags, so incorrectly copied may span into
in-kernel flags and wreck havoc, e.g. by setting REQ_F_INFLIGHT.

Fixes: 5be9ad1e42 ("io_uring: optimise io_init_req() flags setting")
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-02-20 19:02:45 -07:00
Pavel Begunkov 88f171ab77 io_uring: wait potential ->release() on resurrect
There is a short window where percpu_refs are already turned zero, but
we try to do resurrect(). Play nicer and wait for ->release() to happen
in this case and proceed as everything is ok. One downside for ctx refs
is that we can ignore signal_pending() on a rare occasion, but someone
else should check for it later if needed.

Cc: <stable@vger.kernel.org> # 5.5+
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-02-20 19:02:45 -07:00
Pavel Begunkov f2303b1f82 io_uring: keep generic rsrc infra generic
io_rsrc_ref_quiesce() is a generic resource function, though now it
was wired to allocate and initialise ref nodes with file-specific
callbacks/etc. Keep it sane by passing in as a parameters everything we
need for initialisations, otherwise it will hurt us badly one day.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-02-20 19:02:45 -07:00
Pavel Begunkov e6cb007c45 io_uring: zero ref_node after killing it
After a rsrc/files reference node's refs are killed, it must never be
used. And that's how it works, it either assigns a new node or kills the
whole data table.

Let's explicitly NULL it, that shouldn't be necessary, but if something
would go wrong I'd rather catch a NULL dereference to using a dangling
pointer.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-02-20 19:02:45 -07:00
Jens Axboe 99a1008164 io_uring: make the !CONFIG_NET helpers a bit more robust
With the prep and prep async split, we now have potentially 3 helpers
that need to be defined for !CONFIG_NET. Add some helpers to do just
that.

Fixes the following compile error on !CONFIG_NET:

fs/io_uring.c:6171:10: error: implicit declaration of function
'io_sendmsg_prep_async'; did you mean 'io_req_prep_async'?
[-Werror=implicit-function-declaration]
   return io_sendmsg_prep_async(req);
             ^~~~~~~~~~~~~~~~~~~~~
	     io_req_prep_async

Fixes: 93642ef884 ("io_uring: split sqe-prep and async setup")
Reported-by: Naresh Kamboju <naresh.kamboju@linaro.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-02-20 19:02:45 -07:00
Hao Xu 8bad28d8a3 io_uring: don't hold uring_lock when calling io_run_task_work*
Abaci reported the below issue:
[  141.400455] hrtimer: interrupt took 205853 ns
[  189.869316] process 'usr/local/ilogtail/ilogtail_0.16.26' started with executable stack
[  250.188042]
[  250.188327] ============================================
[  250.189015] WARNING: possible recursive locking detected
[  250.189732] 5.11.0-rc4 #1 Not tainted
[  250.190267] --------------------------------------------
[  250.190917] a.out/7363 is trying to acquire lock:
[  250.191506] ffff888114dbcbe8 (&ctx->uring_lock){+.+.}-{3:3}, at: __io_req_task_submit+0x29/0xa0
[  250.192599]
[  250.192599] but task is already holding lock:
[  250.193309] ffff888114dbfbe8 (&ctx->uring_lock){+.+.}-{3:3}, at: __x64_sys_io_uring_register+0xad/0x210
[  250.194426]
[  250.194426] other info that might help us debug this:
[  250.195238]  Possible unsafe locking scenario:
[  250.195238]
[  250.196019]        CPU0
[  250.196411]        ----
[  250.196803]   lock(&ctx->uring_lock);
[  250.197420]   lock(&ctx->uring_lock);
[  250.197966]
[  250.197966]  *** DEADLOCK ***
[  250.197966]
[  250.198837]  May be due to missing lock nesting notation
[  250.198837]
[  250.199780] 1 lock held by a.out/7363:
[  250.200373]  #0: ffff888114dbfbe8 (&ctx->uring_lock){+.+.}-{3:3}, at: __x64_sys_io_uring_register+0xad/0x210
[  250.201645]
[  250.201645] stack backtrace:
[  250.202298] CPU: 0 PID: 7363 Comm: a.out Not tainted 5.11.0-rc4 #1
[  250.203144] Hardware name: Red Hat KVM, BIOS 0.5.1 01/01/2011
[  250.203887] Call Trace:
[  250.204302]  dump_stack+0xac/0xe3
[  250.204804]  __lock_acquire+0xab6/0x13a0
[  250.205392]  lock_acquire+0x2c3/0x390
[  250.205928]  ? __io_req_task_submit+0x29/0xa0
[  250.206541]  __mutex_lock+0xae/0x9f0
[  250.207071]  ? __io_req_task_submit+0x29/0xa0
[  250.207745]  ? 0xffffffffa0006083
[  250.208248]  ? __io_req_task_submit+0x29/0xa0
[  250.208845]  ? __io_req_task_submit+0x29/0xa0
[  250.209452]  ? __io_req_task_submit+0x5/0xa0
[  250.210083]  __io_req_task_submit+0x29/0xa0
[  250.210687]  io_async_task_func+0x23d/0x4c0
[  250.211278]  task_work_run+0x89/0xd0
[  250.211884]  io_run_task_work_sig+0x50/0xc0
[  250.212464]  io_sqe_files_unregister+0xb2/0x1f0
[  250.213109]  __io_uring_register+0x115a/0x1750
[  250.213718]  ? __x64_sys_io_uring_register+0xad/0x210
[  250.214395]  ? __fget_files+0x15a/0x260
[  250.214956]  __x64_sys_io_uring_register+0xbe/0x210
[  250.215620]  ? trace_hardirqs_on+0x46/0x110
[  250.216205]  do_syscall_64+0x2d/0x40
[  250.216731]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
[  250.217455] RIP: 0033:0x7f0fa17e5239
[  250.218034] Code: 01 00 48 81 c4 80 00 00 00 e9 f1 fe ff ff 0f 1f 00 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05  3d 01 f0 ff ff 73 01 c3 48 8b 0d 27 ec 2c 00 f7 d8 64 89 01 48
[  250.220343] RSP: 002b:00007f0fa1eeac48 EFLAGS: 00000246 ORIG_RAX: 00000000000001ab
[  250.221360] RAX: ffffffffffffffda RBX: 0000000000000000 RCX: 00007f0fa17e5239
[  250.222272] RDX: 0000000000000000 RSI: 0000000000000003 RDI: 0000000000000008
[  250.223185] RBP: 00007f0fa1eeae20 R08: 0000000000000000 R09: 0000000000000000
[  250.224091] R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000000000
[  250.224999] R13: 0000000000021000 R14: 0000000000000000 R15: 00007f0fa1eeb700

This is caused by calling io_run_task_work_sig() to do work under
uring_lock while the caller io_sqe_files_unregister() already held
uring_lock.
To fix this issue, briefly drop uring_lock when calling
io_run_task_work_sig(), and there are two things to concern:

- hold uring_lock in io_ring_ctx_free() around io_sqe_files_unregister()
    this is for consistency of lock/unlock.
- add new fixed rsrc ref node before dropping uring_lock
    it's not safe to do io_uring_enter-->percpu_ref_get() with a dying one.
- check if rsrc_data->refs is dying to avoid parallel io_sqe_files_unregister

Reported-by: Abaci <abaci@linux.alibaba.com>
Fixes: 1ffc54220c ("io_uring: fix io_sqe_files_unregister() hangs")
Suggested-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Hao Xu <haoxu@linux.alibaba.com>
[axboe: fixes from Pavel folded in]
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-02-20 19:02:12 -07:00
Pavel Begunkov a3df769899 io_uring: fail io-wq submission from a task_work
In case of failure io_wq_submit_work() needs to post an CQE and so
potentially take uring_lock. The safest way to deal with it is to do
that from under task_work where we can safely take the lock.

Also, as io_iopoll_check() holds the lock tight and releases it
reluctantly, it will play nicer in the furuter with notifying an
iopolling task about new such pending failed requests.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-02-20 19:01:35 -07:00
Al Viro eacd9aa8ce fix handling of nd->depth on LOOKUP_CACHED failures in try_to_unlazy*
After switching to non-RCU mode, we want nd->depth to match the number
of entries in nd->stack[] that need eventual path_put().
legitimize_links() takes care of that on failures; unfortunately,
failure exits added for LOOKUP_CACHED do not.

We could add the logics for that into those failure exits, both in
try_to_unlazy() and in try_to_unlazy_next(), but since both checks
are immediately followed by legitimize_links() and there's no calls
of legitimize_links() other than those two...  It's easier to
move the check (and required handling of nd->depth on failure) into
legitimize_links() itself.

[caught by Jens: ... and since we are zeroing ->depth here, we need
to do drop_links() first]

Fixes: 6c6ec2b0a3 "fs: add support for LOOKUP_CACHED"
Tested-by: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2021-02-20 12:33:12 -05:00
YueHaibing af982da9a6 cifs: Fix inconsistent IS_ERR and PTR_ERR
Fix inconsistent IS_ERR and PTR_ERR in cifs_find_swn_reg(). The proper
pointer to be passed as argument to PTR_ERR() is share_name.

This bug was detected with the help of Coccinelle.

Fixes: bf80e5d425 ("cifs: Send witness register and unregister commands to userspace daemon")
Signed-off-by: YueHaibing <yuehaibing@huawei.com>
Reviewed-by: Samuel Cabrero <scabrero@suse.de>
Signed-off-by: Steve French <stfrench@microsoft.com>
2021-02-19 21:29:10 -06:00
Pavel Begunkov 792bb6eb86 io_uring: don't take uring_lock during iowq cancel
[   97.866748] a.out/2890 is trying to acquire lock:
[   97.867829] ffff8881046763e8 (&ctx->uring_lock){+.+.}-{3:3}, at:
io_wq_submit_work+0x155/0x240
[   97.869735]
[   97.869735] but task is already holding lock:
[   97.871033] ffff88810dfe0be8 (&ctx->uring_lock){+.+.}-{3:3}, at:
__x64_sys_io_uring_enter+0x3f0/0x5b0
[   97.873074]
[   97.873074] other info that might help us debug this:
[   97.874520]  Possible unsafe locking scenario:
[   97.874520]
[   97.875845]        CPU0
[   97.876440]        ----
[   97.877048]   lock(&ctx->uring_lock);
[   97.877961]   lock(&ctx->uring_lock);
[   97.878881]
[   97.878881]  *** DEADLOCK ***
[   97.878881]
[   97.880341]  May be due to missing lock nesting notation
[   97.880341]
[   97.881952] 1 lock held by a.out/2890:
[   97.882873]  #0: ffff88810dfe0be8 (&ctx->uring_lock){+.+.}-{3:3}, at:
__x64_sys_io_uring_enter+0x3f0/0x5b0
[   97.885108]
[   97.885108] stack backtrace:
[   97.890457] Call Trace:
[   97.891121]  dump_stack+0xac/0xe3
[   97.891972]  __lock_acquire+0xab6/0x13a0
[   97.892940]  lock_acquire+0x2c3/0x390
[   97.894894]  __mutex_lock+0xae/0x9f0
[   97.901101]  io_wq_submit_work+0x155/0x240
[   97.902112]  io_wq_cancel_cb+0x162/0x490
[   97.904126]  io_async_find_and_cancel+0x3b/0x140
[   97.905247]  io_issue_sqe+0x86d/0x13e0
[   97.909122]  __io_queue_sqe+0x10b/0x550
[   97.913971]  io_queue_sqe+0x235/0x470
[   97.914894]  io_submit_sqes+0xcce/0xf10
[   97.917872]  __x64_sys_io_uring_enter+0x3fb/0x5b0
[   97.921424]  do_syscall_64+0x2d/0x40
[   97.922329]  entry_SYSCALL_64_after_hwframe+0x44/0xa9

While holding uring_lock, e.g. from inline execution, async cancel
request may attempt cancellations through io_wq_submit_work, which may
try to grab a lock. Delay it to task_work, so we do it from a clean
context and don't have to worry about locking.

Cc: <stable@vger.kernel.org> # 5.5+
Fixes: c07e671951 ("io_uring: hold uring_lock while completing failed polled io in io_wq_submit_work()")
Reported-by: Abaci <abaci@linux.alibaba.com>
Reported-by: Hao Xu <haoxu@linux.alibaba.com>
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-02-18 16:15:31 -07:00
Jiri Bohac 19d8e9149c pstore: Fix typo in compression option name
Both pstore_compress() and decompress_record() use a mistyped config
option name ("PSTORE_COMPRESSION" instead of "PSTORE_COMPRESS"). As
a result compression and decompression of pstore records was always
disabled.

Use the correct config option name.

Signed-off-by: Jiri Bohac <jbohac@suse.cz>
Fixes: fd49e03280 ("pstore: Fix linking when crypto API disabled")
Acked-by: Matteo Croce <mcroce@microsoft.com>
Signed-off-by: Kees Cook <keescook@chromium.org>
Cc: stable@vger.kernel.org
Link: https://lore.kernel.org/r/20210218111547.johvp5klpv3xrpnn@dwarf.suse.cz
2021-02-18 12:27:49 -08:00
Pavel Begunkov de59bc104c io_uring: fail links more in io_submit_sqe()
Instead of marking a link with REQ_F_FAIL_LINK on an error and delaying
its failing to the caller, do it eagerly right when after getting an
error in io_submit_sqe(). This renders FAIL_LINK checks in
io_queue_link_head() useless and we can skip it.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-02-18 13:13:18 -07:00
Pavel Begunkov 1ee43ba8d2 io_uring: don't do async setup for links' heads
Now, as we can do async setup without holding an SQE, we can skip doing
io_req_defer_prep() for link heads, it will be tried to be executed
inline and follows all the rules of the non-linked requests.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-02-18 13:13:18 -07:00
Pavel Begunkov be7053b7d0 io_uring: do io_*_prep() early in io_submit_sqe()
Now as preparations are split from async setup, we can do the first one
pretty early not spilling it across multiple call sites. And after it's
done SQE is not needed anymore and we can save on passing it deeply into
the submission stack.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-02-18 13:13:18 -07:00
Pavel Begunkov 93642ef884 io_uring: split sqe-prep and async setup
There are two kinds of opcode-specific preparations we do. The first is
just initialising req with what is always needed for an opcode and
reading all non-generic SQE fields. And the second is copying some of
the stuff like iovec preparing to punt a request to somewhere async,
e.g. to io-wq or for draining. For requests that have tried an inline
execution but still needing to be punted, the second prep type is done
by the opcode handler itself.

Currently, we don't explicitly split those preparation steps, but
combining both of them into io_*_prep(), altering the behaviour by
allocating ->async_data. That's pretty messy and hard to follow and also
gets in the way of some optimisations.

Split the steps, leave the first type as where it is now, and put the
second into a new io_req_prep_async() helper. It may make us to do opcode
switch twice, but it's worth it.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-02-18 13:13:18 -07:00
Pavel Begunkov cf10960426 io_uring: don't submit link on error
If we get an error in io_init_req() for a request that would have been
linked, we break the submission but still issue a partially composed
link, that's nasty, fail it instead.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-02-18 13:13:18 -07:00
Pavel Begunkov a1ab7b35db io_uring: move req link into submit_state
Move struct io_submit_link into submit_state, which is a part of a
submission state and so belongs to it. It saves us from explicitly
passing it, and init/deinit is now nicely hidden in
io_submit_state_[start,end].

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-02-18 13:13:18 -07:00
Pavel Begunkov a6b8cadcea io_uring: move io_init_req() into io_submit_sqe()
Behaves identically, just move io_init_req() call into the beginning of
io_submit_sqes(). That looks better unloads io_submit_sqes().

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-02-18 13:13:18 -07:00
Pavel Begunkov b16fed66bc io_uring: move io_init_req()'s definition
A preparation patch, symbol to symbol move io_init_req() +
io_check_restriction() a bit up. The submission path is pretty settled
down, so don't worry about backports and move the functions instead of
relying on forward declarations in the future.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-02-18 13:13:18 -07:00
Pavel Begunkov 441960f3b9 io_uring: don't duplicate ->file check in sfr
IORING_OP_SYNC_FILE_RANGE is marked as .needs_file, so the common path
will take care of assigning and validating req->file, no need to
duplicate it in io_sfr_prep().

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-02-18 13:13:18 -07:00
Pavel Begunkov 1155c76a24 io_uring: keep io_*_prep() naming consistent
Follow io_*_prep() naming pattern, there are only fsync and sfr that
don't do that.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-02-18 13:13:18 -07:00
Pavel Begunkov 46c4e16a86 io_uring: kill fictitious submit iteration index
@i and @submitted are very much coupled together, and there is no need
to keep them both. Remove @i, it doesn't change generated binary but
helps to keep a single source of truth.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-02-18 13:13:18 -07:00
Greg Kroah-Hartman 56348560d4 debugfs: do not attempt to create a new file before the filesystem is initalized
Some subsystems want to add debugfs files at early boot, way before
debugfs is initialized.  This seems to work somehow as the vfs layer
will not allow it to happen, but let's be explicit and test to ensure we
are properly up and running before allowing files to be created.

Cc: "Rafael J. Wysocki" <rafael@kernel.org>
Cc: stable <stable@vger.kernel.org>
Reported-by: Michael Walle <michael@walle.cc>
Reported-by: Marc Zyngier <maz@kernel.org>
Link: https://lore.kernel.org/r/20210218100818.3622317-2-gregkh@linuxfoundation.org
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2021-02-18 16:23:49 +01:00
Greg Kroah-Hartman bc6de804d3 debugfs: be more robust at handling improper input in debugfs_lookup()
debugfs_lookup() doesn't like it if it is passed an illegal name
pointer, or if the filesystem isn't even initialized yet.  If either of
these happen, it will crash the system, so fix it up by properly testing
for valid input and that we are up and running before trying to find a
file in the filesystem.

Cc: "Rafael J. Wysocki" <rafael@kernel.org>
Cc: stable <stable@vger.kernel.org>
Reported-by: Michael Walle <michael@walle.cc>
Tested-by: Michael Walle <michael@walle.cc>
Tested-by: Marc Zyngier <maz@kernel.org>
Link: https://lore.kernel.org/r/20210218100818.3622317-1-gregkh@linuxfoundation.org
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2021-02-18 16:23:46 +01:00
Shin'ichiro Kawasaki 059c01039c zonefs: Fix file size of zones in full condition
Per ZBC/ZAC/ZNS specifications, write pointers may not have valid values
when zones are in full condition. However, when zonefs mounts a zoned
block device, zonefs refers write pointers to set file size even when
the zones are in full condition. This results in wrong file size. To fix
this, refer maximum file size in place of write pointers for zones in
full condition.

Signed-off-by: Shin'ichiro Kawasaki <shinichiro.kawasaki@wdc.com>
Fixes: 8dcc1a9d90 ("fs: New zonefs file system")
Cc: <stable@vger.kernel.org> # 5.6+
Signed-off-by: Damien Le Moal <damien.lemoal@wdc.com>
2021-02-18 08:36:40 +09:00
Pavel Begunkov fe1cdd5586 io_uring: fix read memory leak
Don't forget to free iovec read inline completion and bunch of other
cases that do "goto done" before setting up an async context.

Fixes: 5ea5dd4584 ("io_uring: inline io_read()'s iovec freeing")
Reported-by: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-02-17 14:27:51 -07:00
Trond Myklebust 7ae017c732 NFS: Support the '-owrite=' option in /proc/self/mounts and mountinfo
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2021-02-17 15:36:03 -05:00
Bob Peterson 4fc7ec31c3 gfs2: Use resource group glock sharing
This patch takes advantage of the new glock holder sharing feature for
resource groups.  We have already introduced local resource group
locking in a previous patch, so competing accesses of local processes
are already under control.

Signed-off-by: Bob Peterson <rpeterso@redhat.com>
Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
2021-02-17 19:30:28 +01:00
Bob Peterson 06e908cd9e gfs2: Allow node-wide exclusive glock sharing
Introduce a new LM_FLAG_NODE_SCOPE glock holder flag: when taking a
glock in LM_ST_EXCLUSIVE (EX) mode and with the LM_FLAG_NODE_SCOPE flag
set, the exclusive lock is shared among all local processes who are
holding the glock in EX mode and have the LM_FLAG_NODE_SCOPE flag set.
From the point of view of other nodes, the lock is still held
exclusively.

A future patch will start using this flag to improve performance with
rgrp sharing.

Signed-off-by: Bob Peterson <rpeterso@redhat.com>
Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
2021-02-17 19:30:28 +01:00
Andreas Gruenbacher 9e514605c7 gfs2: Add local resource group locking
Prepare for treating resource group glocks as exclusive among nodes but
shared among all tasks running on a node: introduce another layer of
node-specific locking that the local tasks can use to coordinate their
accesses.

Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
2021-02-17 19:30:28 +01:00
Andreas Gruenbacher 725d0e9d46 gfs2: Add per-reservation reserved block accounting
Add a rs_reserved field to struct gfs2_blkreserv to keep track of the number of
blocks reserved by this particular reservation, and a rd_reserved field to
struct gfs2_rgrpd to keep track of the total number of reserved blocks in the
resource group.  Those blocks are exclusively reserved, as opposed to the
rs_requested / rd_requested blocks which are tracked in the reservation tree
(rd_rstree) and which can be stolen if necessary.

When making a reservation with gfs2_inplace_reserve, rs_reserved is set to
somewhere between ap->min_target and ap->target depending on the number of free
blocks in the resource group.  When allocating blocks with gfs2_alloc_blocks,
rs_reserved is decremented accordingly.  Eventually, any reserved but not
consumed blocks are returned to the resource group by gfs2_inplace_release.

Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
2021-02-17 19:30:26 +01:00
Andreas Gruenbacher 07974d2a2a gfs2: Rename rs_{free -> requested} and rd_{reserved -> requested}
We keep track of what we've so far been referring to as reservations in
rd_rstree: the nodes in that tree indicate where in a resource group we'd
like to allocate the next couple of blocks for a particular inode.  Local
processes take those as hints, but they may still "steal" blocks from those
extents, so when actually allocating a block, we must double check in the
bitmap whether that block is actually still free.  Likewise, other cluster
nodes may "steal" such blocks as well.

One of the following patches introduces resource group glock sharing, i.e.,
sharing of an exclusively locked resource group glock among local processes to
speed up allocations.  To make that work, we'll need to keep track of how many
blocks we've actually reserved for each inode, so we end up with two different
kinds of reservations.

Distinguish these two kinds by referring to blocks which are reserved but may
still be "stolen" as "requested".  This rename also makes it more obvious that
rs_requested and rd_requested are strongly related.

Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
2021-02-17 19:26:06 +01:00
Andreas Gruenbacher 0ec9b9ea4f gfs2: Check for active reservation in gfs2_release
In gfs2_release, check if the inode has an active reservation to avoid
unnecessary lock taking.

Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
2021-02-17 19:26:05 +01:00
Andreas Gruenbacher b2598965dc gfs2: Don't search for unreserved space twice
If gfs2_inplace_reserve has chosen a resource group but it couldn't make a
reservation there, there are too many other reservations in that resource
group.  In that case, don't even try to respect existing reservations in
gfs2_alloc_blocks.

Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
2021-02-17 19:26:05 +01:00
Andreas Gruenbacher 3d39fcd16d gfs2: Only pass reservation down to gfs2_rbm_find
Only pass the current reservation down to gfs2_rbm_find rather than the entire
inode; we don't need any of the other information.

Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
2021-02-17 19:26:05 +01:00
Andreas Gruenbacher f38e998fbb gfs2: Also reflect single-block allocations in rgd->rd_extfail_pt
Pass a non-NULL minext to gfs2_rbm_find even for single-block allocations.  In
gfs2_rbm_find, also set rgd->rd_extfail_pt when a single-block allocation
fails in a resource group: there is no reason for treating that case
differently.  In gfs2_reservation_check_and_update, only check how many free
blocks we have if more than one block is requested; we already know there's at
least one free block.

In addition, when allocating N blocks fails in gfs2_rbm_find, we need to set
rd_extfail_pt to N - 1 rather than N:  rd_extfail_pt defines the biggest
allocation that might still succeed.

Finally, reset rd_extfail_pt when updating the resource group statistics in
update_rgrp_lvb, as we already do in gfs2_rgrp_bh_get.

Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
2021-02-17 19:26:05 +01:00
Shyam Prasad N 03e9bb1a0b cifs: Reformat DebugData and index connections by conn_id.
Reformat the output of /proc/fs/cifs/DebugData to print the
conn_id for each connection. Also reordered and numbered the data
into a more reader-friendly format.

This is what the new format looks like:
$ cat /proc/fs/cifs/DebugData
Display Internal CIFS Data Structures for Debugging
---------------------------------------------------
CIFS Version 2.30
Features: DFS,FSCACHE,STATS,DEBUG,ALLOW_INSECURE_LEGACY,WEAK_PW_HASH,CIFS_POSIX,UPCALL(SPNEGO),XATTR,ACL
CIFSMaxBufSize: 16384
Active VFS Requests: 0

Servers:
1) ConnectionId: 0x1
Number of credits: 371 Dialect 0x300
TCP status: 1 Instance: 1
Local Users To Server: 1 SecMode: 0x1 Req On Wire: 0 In Send: 0 In MaxReq Wait: 0

        Sessions:
        1) Name: 10.10.10.10 Uses: 1 Capability: 0x300077     Session Status: 1
        Security type: RawNTLMSSP  SessionId: 0x785560000019
        User: 1000 Cred User: 0

        Shares:
        0) IPC: \\10.10.10.10\IPC$ Mounts: 1 DevInfo: 0x0 Attributes: 0x0
        PathComponentMax: 0 Status: 1 type: 0 Serial Number: 0x0
        Share Capabilities: None        Share Flags: 0x30
        tid: 0x1        Maximal Access: 0x11f01ff

        1) \\10.10.10.10\shyam_test2 Mounts: 1 DevInfo: 0x20020 Attributes: 0xc706ff
        PathComponentMax: 255 Status: 1 type: DISK Serial Number: 0xd4723975
        Share Capabilities: None Aligned, Partition Aligned,    Share Flags: 0x0
        tid: 0x5        Optimal sector size: 0x1000     Maximal Access: 0x1f01ff

        MIDs:

        Server interfaces: 3
        1)      Speed: 10000000000 bps
                Capabilities: rss
                IPv4: 10.10.10.1

        2)      Speed: 10000000000 bps
                Capabilities: rss
                IPv6: fe80:0000:0000:0000:18b4:0000:0000:0000

        3)      Speed: 1000000000 bps
                Capabilities: rss
                IPv4: 10.10.10.10
                [CONNECTED]

Signed-off-by: Shyam Prasad N <sprasad@microsoft.com>
Reviewed-by: Pavel Shilovsky <pshilov@microsoft.com>
Reviewed-by: Aurelien Aptel <aaptel@suse.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
2021-02-16 16:27:41 -06:00
Shyam Prasad N 6d82c27ae5 cifs: Identify a connection by a conn_id.
Introduced a new field conn_id in TCP_Server_Info structure.
This is a non-persistent unique identifier maintained by the client
for a connection to a file server. For this, a global counter named
tcpSesNextId is maintained. On allocating a new TCP_Server_Info,
this counter is incremented and assigned.

Changed the dynamic tracepoints related to reconnects and
crediting to be more informative (with conn_id printed).
Debugging a crediting issue helped me understand the
important things to print here.

Always call dynamic tracepoints outside the scope of spinlocks.
To do this, copy out the credits and in_flight fields of the
server struct before dropping the lock.

Signed-off-by: Shyam Prasad N <sprasad@microsoft.com>
Reviewed-by: Pavel Shilovsky <pshilov@microsoft.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
2021-02-16 15:48:02 -06:00
Shyam Prasad N 7de0394801 cifs: Fix in error types returned for out-of-credit situations.
For failure by timeout waiting for credits, changed the error
returned to the app with EBUSY, instead of ENOTSUPP. This is done
because this situation is possible even in non-buggy cases. i.e.
overloaded server can return 0 credits until done with outstanding
requests. And this feels like a better error to return to the app.

For cases of zero credits found even when there are no requests
in flight, replaced ENOTSUPP with EDEADLK, since we're avoiding
deadlock here by returning error.

Signed-off-by: Shyam Prasad N <sprasad@microsoft.com>
Reviewed-by: Pavel Shilovsky <pshilov@microsoft.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
2021-02-16 15:40:13 -06:00
Shyam Prasad N 0f56db8314 cifs: New optype for session operations.
We used to share the CIFS_NEG_OP flag between negotiate and
session authentication. There was an assumption in the code that
CIFS_NEG_OP is used by negotiate only. So introcuded CIFS_SESS_OP
and used it for session setup optypes.

Signed-off-by: Shyam Prasad N <sprasad@microsoft.com>
Reviewed-by: Pavel Shilovsky <pshilov@microsoft.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
2021-02-16 15:35:57 -06:00
Trond Myklebust 6c17260ca4 NFS: Set the stable writes flag when initialising the super block
We need to wait for outstanding writes on the page to complete before we
can update it.

Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2021-02-16 16:11:14 -05:00
Trond Myklebust a0492339fc NFS: Add mount options supporting eager writes
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2021-02-16 16:11:14 -05:00
Trond Myklebust ed7bcdb374 NFS: Add support for eager writes
Support eager writing to the server, meaning that we write the data to
cache on the server, and wait for that to complete. This ensures that we
see ENOSPC errors immediately.

Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2021-02-16 16:11:14 -05:00
Steve French 201023c5b2 cifs: fix trivial typo
Typo: exiting --> existing

Signed-off-by: Steve French <stfrench@microsoft.com>
2021-02-16 15:03:01 -06:00
Jens Axboe 0b81e80c81 io_uring: tctx->task_lock should be IRQ safe
We add task_work from any context, hence we need to ensure that we can
tolerate it being from IRQ context as well.

Fixes: 7cbf1722d5 ("io_uring: provide FIFO ordering for task_work")
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-02-16 11:11:20 -07:00
Xiubo Li 558b4510f6 ceph: defer flushing the capsnap if the Fb is used
If the Fb cap is used it means the current inode is flushing the
dirty data to OSD, just defer flushing the capsnap.

URL: https://tracker.ceph.com/issues/48640
Signed-off-by: Xiubo Li <xiubli@redhat.com>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2021-02-16 12:09:52 +01:00
Jeff Layton a8810cdc00 ceph: allow queueing cap/snap handling after putting cap references
Testing with the fscache overhaul has triggered some lockdep warnings
about circular lock dependencies involving page_mkwrite and the
mmap_lock. It'd be better to do the "real work" without the mmap lock
being held.

Change the skip_checking_caps parameter in __ceph_put_cap_refs to an
enum, and use that to determine whether to queue check_caps, do it
synchronously or not at all. Change ceph_page_mkwrite to do a
ceph_put_cap_refs_async().

Signed-off-by: Jeff Layton <jlayton@kernel.org>
Reviewed-by: Ilya Dryomov <idryomov@gmail.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2021-02-16 12:09:51 +01:00
Jeff Layton 64f28c627a ceph: clean up inode work queueing
Add a generic function for taking an inode reference, setting the I_WORK
bit and queueing i_work. Turn the ceph_queue_* functions into static
inline wrappers that pass in the right bit.

Signed-off-by: Jeff Layton <jlayton@kernel.org>
Reviewed-by: Ilya Dryomov <idryomov@gmail.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2021-02-16 12:09:51 +01:00
Jeff Layton 64f36da562 ceph: fix flush_snap logic after putting caps
A primary reason for skipping ceph_check_caps after putting the
references was to avoid the locking in ceph_check_caps during a
reconnect. __ceph_put_cap_refs can still call ceph_flush_snaps in that
case though, and that takes many of the same inconvenient locks.

Fix the logic in __ceph_put_cap_refs to skip flushing snaps when the
skip_checking_caps flag is set.

Fixes: e64f44a884 ("ceph: skip checking caps when session reconnecting and releasing reqs")
Signed-off-by: Jeff Layton <jlayton@kernel.org>
Reviewed-by: Ilya Dryomov <idryomov@gmail.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2021-02-16 12:09:51 +01:00
Chris Wilson bfe3911a91 kcmp: Support selection of SYS_kcmp without CHECKPOINT_RESTORE
Userspace has discovered the functionality offered by SYS_kcmp and has
started to depend upon it. In particular, Mesa uses SYS_kcmp for
os_same_file_description() in order to identify when two fd (e.g. device
or dmabuf) point to the same struct file. Since they depend on it for
core functionality, lift SYS_kcmp out of the non-default
CONFIG_CHECKPOINT_RESTORE into the selectable syscall category.

Rasmus Villemoes also pointed out that systemd uses SYS_kcmp to
deduplicate the per-service file descriptor store.

Note that some distributions such as Ubuntu are already enabling
CHECKPOINT_RESTORE in their configs and so, by extension, SYS_kcmp.

References: https://gitlab.freedesktop.org/drm/intel/-/issues/3046
Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
Cc: Kees Cook <keescook@chromium.org>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Will Drewry <wad@chromium.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Dave Airlie <airlied@gmail.com>
Cc: Daniel Vetter <daniel@ffwll.ch>
Cc: Lucas Stach <l.stach@pengutronix.de>
Cc: Rasmus Villemoes <linux@rasmusvillemoes.dk>
Cc: Cyrill Gorcunov <gorcunov@gmail.com>
Cc: stable@vger.kernel.org
Acked-by: Daniel Vetter <daniel.vetter@ffwll.ch> # DRM depends on kcmp
Acked-by: Rasmus Villemoes <linux@rasmusvillemoes.dk> # systemd uses kcmp
Reviewed-by: Cyrill Gorcunov <gorcunov@gmail.com>
Reviewed-by: Kees Cook <keescook@chromium.org>
Acked-by: Thomas Zimmermann <tzimmermann@suse.de>
Signed-off-by: Daniel Vetter <daniel.vetter@ffwll.ch>
Link: https://patchwork.freedesktop.org/patch/msgid/20210205220012.1983-1-chris@chris-wilson.co.uk
2021-02-16 09:59:41 +01:00
Johannes Thumshirn 62ab1aadcc zonefs: add tracepoints for file operations
Add tracepoints for file I/O operations to aid in debugging of I/O errors
with zonefs.

The added tracepoints are in:
- zonefs_zone_mgmt() for tracing zone management operations
- zonefs_iomap_begin() for tracing regular file I/O
- zonefs_file_dio_append() for tracing zone-append operations

Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Damien Le Moal <damien.lemoal@wdc.com>
2021-02-16 09:59:54 +09:00
Jens Axboe 0d4370cfe3 proc: don't allow async path resolution of /proc/thread-self components
If this is attempted by an io-wq kthread, then return -EOPNOTSUPP as we
don't currently support that. Once we can get task_pid_ptr() doing the
right thing, then this can go away again.

Use PF_IO_WORKER for this to speciically target the io_uring workers.
Modify the /proc/self/ check to use PF_IO_WORKER as well.

Cc: stable@vger.kernel.org
Fixes: 8d4c3e76e3 ("proc: don't allow async path resolution of /proc/self components")
Reported-by: Eric W. Biederman <ebiederm@xmission.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-02-15 11:02:16 -07:00
Laurent Vivier 2347961b11 binfmt_misc: pass binfmt_misc flags to the interpreter
It can be useful to the interpreter to know which flags are in use.

For instance, knowing if the preserve-argv[0] is in use would
allow to skip the pathname argument.

This patch uses an unused auxiliary vector, AT_FLAGS, to add a
flag to inform interpreter if the preserve-argv[0] is enabled.

Note by Helge Deller:
The real-world user of this patch is qemu-user, which needs to know
if it has to preserve the argv[0]. See Debian bug #970460.

Signed-off-by: Laurent Vivier <laurent@vivier.eu>
Reviewed-by: YunQiang Su <ysu@wavecomp.com>
URL: http://bugs.debian.org/970460
Signed-off-by: Helge Deller <deller@gmx.de>
2021-02-15 18:28:30 +01:00
Steve French 6dffa4c220 smb3: negotiate current dialect (SMB3.1.1) when version 3 or greater requested
SMB3.1.1 is the newest, and preferred dialect, and is included in
the requested dialect list by default (ie if no vers= is specified
on mount) but it should also be requested if SMB3 or later is requested
(vers=3 instead of a specific dialect: vers=2.1, vers=3.02 or vers=3.0).

Currently specifying "vers=3" only requests smb3.0 and smb3.02 but this
patch fixes it to also request smb3.1.1 dialect, as it is the newest
and most secure dialect and is a "version 3 or later" dialect (the intent
of "vers=3").

Signed-off-by: Steve French <stfrench@microsoft.com>
Suggested-by: Pavel Shilovsky <pshilov@microsoft.com>
Reviewed-by: Shyam Prasad N <sprasad@microsoft.com>
Reviewed-by: Pavel Shilovsky <pshilov@microsoft.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
2021-02-15 10:33:34 -06:00
J. Bruce Fields bd5ae9288d nfsd: register pernet ops last, unregister first
These pernet operations may depend on stuff set up or torn down in the
module init/exit functions.  And they may be called at any time in
between.  So it makes more sense for them to be the last to be
registered in the init function, and the first to be unregistered in the
exit function.

In particular, without this, the drc slab is being destroyed before all
the per-net drcs are shut down, resulting in an "Objects remaining in
nfsd_drc on __kmem_cache_shutdown()" warning in exit_nfsd.

Reported-by: Zhi Li <yieli@redhat.com>
Fixes: 3ba75830ce "nfsd4: drc containerization"
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2021-02-15 10:45:00 -05:00
Wang ShaoBo 42119dbe57 ubifs: Fix error return code in alloc_wbufs()
Fix to return PTR_ERR() error code from the error handling case instead
fo 0 in function alloc_wbufs(), as done elsewhere in this function.

Fixes: 6a98bc4614 ("ubifs: Add authentication nodes to journal")
Signed-off-by: Wang ShaoBo <bobo.shaobowang@huawei.com>
Reviewed-by: Sascha Hauer <s.hauer@pengutronix.de>
Signed-off-by: Richard Weinberger <richard@nod.at>
2021-02-13 22:58:44 +01:00
Linus Torvalds e42ee56fe5 for-5.11-rc7-tag
-----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEE8rQSAMVO+zA4DBdWxWXV+ddtWDsFAmAmlkAACgkQxWXV+ddt
 WDuwNxAAiBAhEwPllzyU86p4RMMip5pa24zu11HkTya65yGk6EFuj4zTlx/L5Fn6
 JOjxwlPqaTItER1PYJ5HRdIy1Y2E4eWEiDLolvmvDCPZrfKRKhBU1MZbgXwDbp+Z
 pwaJGIm5ZaXDGyuFge3bKA48BERfqxRBO3qIOZ0tzgsUFLlZ2d9EdDc99093/J6k
 QzIijXQjFnvnB2MNawN1b/KQ63xqXLo2hemKcKIFCxJHm9eaet/qwGHl5iuR5ScY
 bOGCWvLSkCXceartDur3msOZXur09YLyfeYmE9dj1FN3aNu97sW8VivWRrs3aglK
 if51iYrrjKSnDr4SOK28S5UYdgeStb/qWWtosdcMsQVBo0t7iCnGT2psGaQCkdfG
 FChqbs2uXlbJrojlelV6xbaU3S2D2MtSz5mF+I2G5MpQbj1jkhYE9ZTUQeibcd7o
 l+edn/VJvVK4X0NAX8pIWJ4nFY1HqUTyfn28IQ7ymBhyyUloIoazvSkBuSWy6iy0
 9aPpohOKjCw8Y3MbgcIfIEJhdK+aIKF8ZPh52+zcXQzf1OtSryVarLHsNXWm9vJ8
 tHsRHCzrbLFdAXZccT6YlerzPs4+PVf44UknDbFCg7sLcG04NIGGrMXOtTHwgEZL
 BEywTjAMlMDjrEXouxYAPNPnEg/NlvQGZYRvBnxrtZE4G2fxJ7o=
 =7w6G
 -----END PGP SIGNATURE-----

Merge tag 'for-5.11-rc7-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux

Pull btrfs fix from David Sterba:
 "A regression fix caused by a refactoring in 5.11.

  A corrupted superblock wouldn't be detected by checksum verification
  due to wrongly placed initialization of the checksum length, thus
  making memcmp always work"

* tag 'for-5.11-rc7-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux:
  btrfs: initialize fs_info::csum_size earlier in open_ctree
2021-02-13 11:55:29 -08:00
Heiko Carstens 96c0a6a72d s390,alpha: switch to 64-bit ino_t
s390 and alpha are the only 64 bit architectures with a 32-bit ino_t.
Since this is quite unusual this causes bugs from time to time.

See e.g. commit ebce3eb2f7 ("ceph: fix inode number handling on
arches with 32-bit ino_t") for an example.

This (obviously) also prevents s390 and alpha to use 64-bit ino_t for
tmpfs. See commit b85a7a8bb5 ("tmpfs: disallow CONFIG_TMPFS_INODE64
on s390").

Therefore switch both s390 and alpha to 64-bit ino_t. This should only
have an effect on the ustat system call. To prevent ABI breakage
define struct ustat compatible to the old layout and change
sys_ustat() accordingly.

Acked-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Heiko Carstens <hca@linux.ibm.com>
Signed-off-by: Vasily Gorbik <gor@linux.ibm.com>
2021-02-13 17:17:53 +01:00
Jens Axboe 41be53e94f io_uring: kill cached requests from exiting task closing the ring
Be nice and prune these upfront, in case the ring is being shared and
one of the tasks is going away. This is a bit more important now that
we account the allocations.

Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-02-13 09:11:04 -07:00
Jens Axboe 9a4fdbd8ee io_uring: add helper to free all request caches
We have three different ones, put it in a helper for easy calling. This
is in preparation for doing it outside of ring freeing as well. With
that in mind, also ensure that we do the proper locking for safe calling
from a context where the ring it still live.

Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-02-13 09:09:44 -07:00
Jens Axboe 68e68ee6e3 io_uring: allow task match to be passed to io_req_cache_free()
No changes in this patch, just allows a caller to pass in a targeted
task that we must match for freeing requests in the cache.

Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-02-13 09:00:02 -07:00
Linus Torvalds 7989807dc0 4 small smb3 fixes to the new mount API
-----BEGIN PGP SIGNATURE-----
 
 iQGzBAABCgAdFiEE6fsu8pdIjtWE/DpLiiy9cAdyT1EFAmAm8mYACgkQiiy9cAdy
 T1FDMgwArdtgQtRL9hSOINBl19/OM9GSLszUtI/EctfZGgnGoNysIq4/pIvv7uqE
 egVohlVwJI4niGguU7AABj5vrthLsAbmzKi+e2N6kAcYtzpXeLvjkXVfyq/Bld66
 oe7sYWMjH1TFEc64gejW7nYcxOsg93HQtAvNyyoAS3SlOOWsl2LI/AQiw3BXXVBo
 Jb1AkfpdBGglUW7esYVZUyVwCI/ZzYVEA0YTCpaGX+EIfdCWXm3ArNPP7E9gHOLb
 3HUbNP6W5QKwgYL1tPX3s7AFEtj0+PxuREgB6mSTFOkWRRfZJUTma1AEfa9MUWGA
 KOnQKiIzIsmaOQGP/BumcrPr/7kgeqYEFZ2exNT8kVw6ETEHP1+A4j61KZI0mduz
 rgnQx21gPzVcDo0tfO8SjGSt3vzuRA+vkyZO4eB/nmTqJ4YVqX+E6E4vophKOUk2
 ELqk0fUlX3uspqocZCor5nrLA0EadNV6P/LfFiRyGUTt+tOcOeYmyAdK7dWY1JTf
 wsd20mCY
 =vJKS
 -----END PGP SIGNATURE-----

Merge tag '5.11-rc7-smb3-github' of git://github.com/smfrench/smb3-kernel

Pull cifs fixes from Steve French:
 "Four small smb3 fixes to the new mount API (including a particularly
  important one for DFS links).

  These were found in testing this week of additional DFS scenarios, and
  a user testing of an apache container problem"

* tag '5.11-rc7-smb3-github' of git://github.com/smfrench/smb3-kernel:
  cifs: Set CIFS_MOUNT_USE_PREFIX_PATH flag on setting cifs_sb->prepath.
  cifs: In the new mount api we get the full devname as source=
  cifs: do not disable noperm if multiuser mount option is not provided
  cifs: fix dfs-links
2021-02-12 14:45:39 -08:00
Jaegeuk Kim 938a184265 f2fs: give a warning only for readonly partition
Let's allow mounting readonly partition. We're able to recovery later once we
have it as read-write back.

Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2021-02-12 14:09:54 -08:00
Jens Axboe e06aa2e94f io-wq: clear out worker ->fs and ->files
By default, kernel threads have init_fs and init_files assigned. In the
past, this has triggered security problems, as commands that don't ask
for (and hence don't get assigned) fs/files from the originating task
can then attempt path resolution etc with access to parts of the system
they should not be able to.

Rather than add checks in the fs code for misuse, just set these to
NULL. If we do attempt to use them, then the resulting code will oops
rather than provide access to something that it should not permit.

Suggested-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-02-12 14:02:54 -07:00
Yang Yang 90ada91f46 jffs2: check the validity of dstlen in jffs2_zlib_compress()
KASAN reports a BUG when download file in jffs2 filesystem.It is
because when dstlen == 1, cpage_out will write array out of bounds.
Actually, data will not be compressed in jffs2_zlib_compress() if
data's length less than 4.

[  393.799778] BUG: KASAN: slab-out-of-bounds in jffs2_rtime_compress+0x214/0x2f0 at addr ffff800062e3b281
[  393.809166] Write of size 1 by task tftp/2918
[  393.813526] CPU: 3 PID: 2918 Comm: tftp Tainted: G    B           4.9.115-rt93-EMBSYS-CGEL-6.1.R6-dirty #1
[  393.823173] Hardware name: LS1043A RDB Board (DT)
[  393.827870] Call trace:
[  393.830322] [<ffff20000808c700>] dump_backtrace+0x0/0x2f0
[  393.835721] [<ffff20000808ca04>] show_stack+0x14/0x20
[  393.840774] [<ffff2000086ef700>] dump_stack+0x90/0xb0
[  393.845829] [<ffff20000827b19c>] kasan_object_err+0x24/0x80
[  393.851402] [<ffff20000827b404>] kasan_report_error+0x1b4/0x4d8
[  393.857323] [<ffff20000827bae8>] kasan_report+0x38/0x40
[  393.862548] [<ffff200008279d44>] __asan_store1+0x4c/0x58
[  393.867859] [<ffff2000084ce2ec>] jffs2_rtime_compress+0x214/0x2f0
[  393.873955] [<ffff2000084bb3b0>] jffs2_selected_compress+0x178/0x2a0
[  393.880308] [<ffff2000084bb530>] jffs2_compress+0x58/0x478
[  393.885796] [<ffff2000084c5b34>] jffs2_write_inode_range+0x13c/0x450
[  393.892150] [<ffff2000084be0b8>] jffs2_write_end+0x2a8/0x4a0
[  393.897811] [<ffff2000081f3008>] generic_perform_write+0x1c0/0x280
[  393.903990] [<ffff2000081f5074>] __generic_file_write_iter+0x1c4/0x228
[  393.910517] [<ffff2000081f5210>] generic_file_write_iter+0x138/0x288
[  393.916870] [<ffff20000829ec1c>] __vfs_write+0x1b4/0x238
[  393.922181] [<ffff20000829ff00>] vfs_write+0xd0/0x238
[  393.927232] [<ffff2000082a1ba8>] SyS_write+0xa0/0x110
[  393.932283] [<ffff20000808429c>] __sys_trace_return+0x0/0x4
[  393.937851] Object at ffff800062e3b280, in cache kmalloc-64 size: 64
[  393.944197] Allocated:
[  393.946552] PID = 2918
[  393.948913]  save_stack_trace_tsk+0x0/0x220
[  393.953096]  save_stack_trace+0x18/0x20
[  393.956932]  kasan_kmalloc+0xd8/0x188
[  393.960594]  __kmalloc+0x144/0x238
[  393.963994]  jffs2_selected_compress+0x48/0x2a0
[  393.968524]  jffs2_compress+0x58/0x478
[  393.972273]  jffs2_write_inode_range+0x13c/0x450
[  393.976889]  jffs2_write_end+0x2a8/0x4a0
[  393.980810]  generic_perform_write+0x1c0/0x280
[  393.985251]  __generic_file_write_iter+0x1c4/0x228
[  393.990040]  generic_file_write_iter+0x138/0x288
[  393.994655]  __vfs_write+0x1b4/0x238
[  393.998228]  vfs_write+0xd0/0x238
[  394.001543]  SyS_write+0xa0/0x110
[  394.004856]  __sys_trace_return+0x0/0x4
[  394.008684] Freed:
[  394.010691] PID = 2918
[  394.013051]  save_stack_trace_tsk+0x0/0x220
[  394.017233]  save_stack_trace+0x18/0x20
[  394.021069]  kasan_slab_free+0x88/0x188
[  394.024902]  kfree+0x6c/0x1d8
[  394.027868]  jffs2_sum_write_sumnode+0x2c4/0x880
[  394.032486]  jffs2_do_reserve_space+0x198/0x598
[  394.037016]  jffs2_reserve_space+0x3f8/0x4d8
[  394.041286]  jffs2_write_inode_range+0xf0/0x450
[  394.045816]  jffs2_write_end+0x2a8/0x4a0
[  394.049737]  generic_perform_write+0x1c0/0x280
[  394.054179]  __generic_file_write_iter+0x1c4/0x228
[  394.058968]  generic_file_write_iter+0x138/0x288
[  394.063583]  __vfs_write+0x1b4/0x238
[  394.067157]  vfs_write+0xd0/0x238
[  394.070470]  SyS_write+0xa0/0x110
[  394.073783]  __sys_trace_return+0x0/0x4
[  394.077612] Memory state around the buggy address:
[  394.082404]  ffff800062e3b180: 00 00 00 00 00 00 00 00 fc fc fc fc fc fc fc fc
[  394.089623]  ffff800062e3b200: 00 00 00 00 00 00 00 00 fc fc fc fc fc fc fc fc
[  394.096842] >ffff800062e3b280: 01 fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc
[  394.104056]                    ^
[  394.107283]  ffff800062e3b300: fb fb fb fb fb fb fb fb fc fc fc fc fc fc fc fc
[  394.114502]  ffff800062e3b380: fb fb fb fb fb fb fb fb fc fc fc fc fc fc fc fc
[  394.121718] ==================================================================

Signed-off-by: Yang Yang <yang.yang29@zte.com.cn>
Signed-off-by: Richard Weinberger <richard@nod.at>
2021-02-12 21:53:23 +01:00
Sascha Hauer d984bcf576 ubifs: Fix off-by-one error
An inode is allowed to have ubifs_xattr_max_cnt() xattrs, so we must
complain only when an inode has more xattrs, having exactly
ubifs_xattr_max_cnt() xattrs is fine.
With this the maximum number of xattrs can be created without hitting
the "has too many xattrs" warning when removing it.

Signed-off-by: Sascha Hauer <s.hauer@pengutronix.de>
Signed-off-by: Richard Weinberger <richard@nod.at>
2021-02-12 21:53:23 +01:00
Arnd Bergmann 410b6de702 ubifs: replay: Fix high stack usage, again
An earlier commit moved out some functions to not be inlined by gcc, but
after some other rework to remove one of those, clang started inlining
the other one and ran into the same problem as gcc did before:

fs/ubifs/replay.c:1174:5: error: stack frame size of 1152 bytes in function 'ubifs_replay_journal' [-Werror,-Wframe-larger-than=]

Mark the function as noinline_for_stack to ensure it doesn't happen
again.

Fixes: f80df38512 ("ubifs: use crypto_shash_tfm_digest()")
Fixes: eb66eff663 ("ubifs: replay: Fix high stack usage")
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Reviewed-by: Nathan Chancellor <natechancellor@gmail.com>
Signed-off-by: Richard Weinberger <richard@nod.at>
2021-02-12 21:53:22 +01:00
Dinghao Liu 11b8ab3836 ubifs: Fix memleak in ubifs_init_authentication
When crypto_shash_digestsize() fails, c->hmac_tfm
has not been freed before returning, which leads
to memleak.

Fixes: 49525e5eec ("ubifs: Add helper functions for authentication support")
Signed-off-by: Dinghao Liu <dinghao.liu@zju.edu.cn>
Reviewed-by: Zhihao Cheng <chengzhihao1@huawei.com>
Signed-off-by: Richard Weinberger <richard@nod.at>
2021-02-12 21:53:22 +01:00
Tom Rix 19646447ad jffs2: fix use after free in jffs2_sum_write_data()
clang static analysis reports this problem

fs/jffs2/summary.c:794:31: warning: Use of memory after it is freed
                c->summary->sum_list_head = temp->u.next;
                                            ^~~~~~~~~~~~

In jffs2_sum_write_data(), in a loop summary data is handles a node at
a time.  When it has written out the node it is removed the summary list,
and the node is deleted.  In the corner case when a
JFFS2_FEATURE_RWCOMPAT_COPY is seen, a call is made to
jffs2_sum_disable_collecting().  jffs2_sum_disable_collecting() deletes
the whole list which conflicts with the loop's deleting the list by parts.

To preserve the old behavior of stopping the write midway, bail out of
the loop after disabling summary collection.

Fixes: 6171586a7a ("[JFFS2] Correct handling of JFFS2_FEATURE_RWCOMPAT_COPY nodes.")
Signed-off-by: Tom Rix <trix@redhat.com>
Reviewed-by: Nathan Chancellor <natechancellor@gmail.com>
Signed-off-by: Richard Weinberger <richard@nod.at>
2021-02-12 21:53:22 +01:00
Johannes Berg a15f1e41fb um: hostfs: use a kmem cache for inodes
This collects all of them together and makes it possible to
e.g. exclude it from slub debugging.

Signed-off-by: Johannes Berg <johannes.berg@intel.com>
Signed-off-by: Richard Weinberger <richard@nod.at>
2021-02-12 21:28:55 +01:00
Linus Torvalds c6d8570e4d io_uring-5.11-2021-02-12
-----BEGIN PGP SIGNATURE-----
 
 iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAmAmi4wQHGF4Ym9lQGtl
 cm5lbC5kawAKCRD301j7KXHgplrtD/9BTWebvU17q/9grBrZsIyfCB9UOoPzOSwd
 4dY1trOnFk91vAVM2I8RLkUSYS6UHPSEE3Xa+jVkNOBWy0W/8RJtQpqQTVjoY8zn
 jMrvPHTyW5tc/rniG5FNsCilyDEOBZY4pIpJqwXpj/Ez14D+mb+A3nTCTdwat4sU
 zdSVLcGJU2O7RJx1yLiJLvqct1dbZI8axSd/gEOCVhxKgP6UVoYfjkjYm9hCU4et
 y7OzRfvPTb0N03EVoA6XeVke7sDK9cJLySbcwiGczCmPkKEJmFOyJP2xWSlE5Z91
 UMYeg4pOSg3tHYvPFuUShjzaYJTKAvzObHomyPjRCve+847AcqPpdoHaYQASUXvF
 ORs8vXkgjyd9lyrBa+8oqWYvXYj/3M05qPO2LhfSyDbwzEzBAmaAyf0JIr9mvem+
 7mgJ6R7uTCqPt0FXzxIfNnWSq/Rtiyuw+DP/y2sgYMDRjg70hyFhhud9K67hMplP
 wc1UAp9vD3PalTQG3fHIJycWIXd6A/RxBM+KbXdIyi6aqd6iHgf2Plz5CI2Orz7W
 sMPlG2IYfwwKDyNf9LE+sXmrDbfM3wdSQGr3BXmMXBRNWicxD6P4IM8FbsVemrt1
 QZ77zt7xPtGY3CMabYPAycbVxdf52TofcTvp3Z5gEWUEaQw1j681wsBF/V3Btnk/
 704EKKZraw==
 =y7ZL
 -----END PGP SIGNATURE-----

Merge tag 'io_uring-5.11-2021-02-12' of git://git.kernel.dk/linux-block

Pull io_uring fix from Jens Axboe:
 "Revert of a patch from this release that caused a regression"

* tag 'io_uring-5.11-2021-02-12' of git://git.kernel.dk/linux-block:
  Revert "io_uring: don't take fs for recvmsg/sendmsg"
2021-02-12 11:48:02 -08:00
Pavel Begunkov 5be9ad1e42 io_uring: optimise io_init_req() flags setting
Invalid req->flags are tolerated by free/put well, avoid this dancing
needlessly presetting it to zero, and then not even resetting but
modifying it, i.e. "|=".

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-02-12 11:49:50 -07:00
Pavel Begunkov cdbff98223 io_uring: clean io_req_find_next() fast check
Indirectly io_req_find_next() is called for every request, optimise the
check by testing flags as it was long before -- __io_req_find_next()
tolerates false-positives well (i.e. link==NULL), and those should be
really rare.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-02-12 11:49:49 -07:00
Pavel Begunkov dc0eced5d9 io_uring: don't check PF_EXITING from syscall
io_sq_thread_acquire_mm_files() can find a PF_EXITING task only when
it's called from task_work context. Don't check it in all other cases,
that are when we're in io_uring_enter().

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-02-12 11:49:48 -07:00
Paolo Bonzini 8c6e67bec3 KVM/arm64 updates for Linux 5.12
- Make the nVHE EL2 object relocatable, resulting in much more
   maintainable code
 - Handle concurrent translation faults hitting the same page
   in a more elegant way
 - Support for the standard TRNG hypervisor call
 - A bunch of small PMU/Debug fixes
 - Allow the disabling of symbol export from assembly code
 - Simplification of the early init hypercall handling
 -----BEGIN PGP SIGNATURE-----
 
 iQJDBAABCgAtFiEEn9UcU+C1Yxj9lZw9I9DQutE9ekMFAmAmjqEPHG1hekBrZXJu
 ZWwub3JnAAoJECPQ0LrRPXpDoUEQAIrJ7YF4v4gz06a0HG9+b6fbmykHyxlG7jfm
 trvctfaiKzOybKoY5odPpNFzhbYOOdXXqYipyTHGwBYtGSy9G/9SjMKSUrfln2Ni
 lr1wBqapr9TE+SVKoR8pWWuZxGGbHVa7brNuMbMsMi1wwAsM2/n70H9PXrdq3QiK
 Ge1DWLso2oEfhtTwqNKa4dwB2MHjBhBFhhq+Nq5pslm6mmxJaYqz7pyBmw/C+2cc
 oU/6kpAa1yPAauptWXtYXJYOMHihxgEa1IdK3Gl0hUyFyu96xVkwH/KFsj+bRs23
 QGGCSdy4313hzaoGaSOTK22R98Aeg0wI9a6tcCBvVVjTAztnlu1FPtUZr8e/F7uc
 +r8xVJUJFiywt3Zktf/D7YDK9LuMMqFnj0BkI4U9nIBY59XZRNhENsBCmjru5lnL
 iXa5cuta03H4emfssIChLpgn0XHFas6t5dFXBPGbXyw0qsQchTw98iQX9LVxefUK
 rOUGPIN4nE9ESRIZe0SPlAVeCtNP8cLH7+0YG9MJ1QeDVYaUsnvy9Ln/ox+514mR
 5y2KJ6y7xnLB136SKCzPDDloYtz7BDiJq6a/RPiXKGheKoxy+N+BSe58yWCqFZYE
 Fx/cGUr7oSg39U7gCboog6BDp5e2CXBfbRllg6P47bZFfdPNwzNEzHvk49VltMxx
 Rl2W05bk
 =6EwV
 -----END PGP SIGNATURE-----

Merge tag 'kvmarm-5.12' of git://git.kernel.org/pub/scm/linux/kernel/git/kvmarm/kvmarm into HEAD

KVM/arm64 updates for Linux 5.12

- Make the nVHE EL2 object relocatable, resulting in much more
  maintainable code
- Handle concurrent translation faults hitting the same page
  in a more elegant way
- Support for the standard TRNG hypervisor call
- A bunch of small PMU/Debug fixes
- Allow the disabling of symbol export from assembly code
- Simplification of the early init hypercall handling
2021-02-12 11:23:44 -05:00
Su Yue 83c68bbcb6 btrfs: initialize fs_info::csum_size earlier in open_ctree
User reported that btrfs-progs misc-tests/028-superblock-recover fails:

      [TEST/misc]   028-superblock-recover
  unexpected success: mounted fs with corrupted superblock
  test failed for case 028-superblock-recover

The test case expects that a broken image with bad superblock will be
rejected to be mounted. However, the test image just passed csum check
of superblock and was successfully mounted.

Commit 55fc29bed8 ("btrfs: use cached value of fs_info::csum_size
everywhere") replaces all calls to btrfs_super_csum_size by
fs_info::csum_size. The calls include the place where fs_info->csum_size
is not initialized. So btrfs_check_super_csum() passes because memcmp()
with len 0 always returns 0.

Fix it by caching csum size in btrfs_fs_info::csum_size once we know the
csum type in superblock is valid in open_ctree().

Link: https://github.com/kdave/btrfs-progs/issues/250
Fixes: 55fc29bed8 ("btrfs: use cached value of fs_info::csum_size everywhere")
Signed-off-by: Su Yue <l@damenly.su>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2021-02-12 14:48:24 +01:00
Pavel Begunkov 4fccfcbb73 io_uring: don't split out consume out of SQE get
Remove io_consume_sqe() and inline it back into io_get_sqe(). It
requires req dealloc on error, but in exchange we get cleaner
io_submit_sqes() and better locality for cached_sq_head.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-02-12 05:30:36 -07:00
Pavel Begunkov 04fc6c802d io_uring: save ctx put/get for task_work submit
Do a little trick in io_ring_ctx_free() briefly taking uring_lock, that
will wait for everyone currently holding it, so we can skip pinning ctx
with ctx->refs for __io_req_task_submit(), which is executed and loses
its refs/reqs while holding the lock.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-02-12 05:30:25 -07:00
Pavel Begunkov 921b9054e0 io_uring: don't duplicate io_req_task_queue()
Don't hand code io_req_task_queue() inside of io_async_buf_func(), just
call it.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-02-12 05:30:25 -07:00
Pavel Begunkov 4e32635834 io_uring: optimise SQPOLL mm/files grabbing
There are two reasons for this. First is to optimise
io_sq_thread_acquire_mm_files() for non-SQPOLL case, which currently do
too many checks and function calls in the hot path, e.g. in
io_init_req().

The second is to not grab mm/files when there are not needed. As
__io_queue_sqe() issues only one request now, we can reuse
io_sq_thread_acquire_mm_files() instead of unconditional acquire
mm/files.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-02-12 05:30:25 -07:00
Pavel Begunkov d3d7298d05 io_uring: optimise out unlikely link queue
__io_queue_sqe() tries to issue as much requests of a link as it can,
and uses io_put_req_find_next() to extract a next one, targeting inline
completed requests. As now __io_queue_sqe() is always used together with
struct io_comp_state, it leaves next propagation only a small window and
only for async reqs, that doesn't justify its existence.

Remove it, make __io_queue_sqe() to issue only a head request. It
simplifies the code and will allow other optimisations.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-02-12 05:30:25 -07:00
Pavel Begunkov bd75904590 io_uring: take compl state from submit state
Completion and submission states are now coupled together, it's weird to
get one from argument and another from ctx, do it consistently for
io_req_free_batch(). It's also faster as we already have @state cached
in registers.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-02-12 05:30:25 -07:00
Daniel Latypov 0a76945fd1 ext4: add .kunitconfig fragment to enable ext4-specific tests
As of [1], we no longer want EXT4_KUNIT_TESTS and others to `select`
their deps. This means it can get harder to get all the right things
selected as we gain more tests w/ more deps over time.

This patch (and [2]) proposes we store kunitconfig fragments in-tree to
represent sets of tests. (N.B. right now we only have one ext4 test).

There's still a discussion to be had about how to have a hierarchy of
these files (e.g. if one wanted to test all of fs/, not just fs/ext4).

But this fragment would likely be a leaf node and isn't blocked on
deciding if we want `import` statements and the like.

Usage
=====

Before [2] (on its way to being merged):
  $ cp fs/ext4/.kunitconfig .kunit/
  $ ./tools/testing/kunit/kunit.py run

After [2]:
  $ ./tools/testing/kunit/kunit.py run --kunitconfig=fs/ext4/.kunitconfig

".kunitconfig" vs "kunitconfig"
===============================

See also: commit 14ee5cfd45 ("kunit: Rename 'kunitconfig' to '.kunitconfig'").
* The bit about .gitignore exluding it by default is now a con, however.
* But there are a lot of directories with files that begin with "k" and
  so this could cause some annoyance w/ tab completion*
* This is the name kunit.py expects right now, so some people are used
  to .kunitconfig over "kunitconfig"

[1] https://lore.kernel.org/linux-ext4/20210122110234.2825685-1-geert@linux-m68k.org/
[2] https://git.kernel.org/pub/scm/linux/kernel/git/shuah/linux-kselftest.git/commit/?h=kunit&id=243180f5924ed27ea417db39feb7f9691777688e

* 372/5556 directories isn't too much, but still not a small number:
$ find -type f -name 'k*' | xargs dirname | sort -u | wc -l
372

Signed-off-by: Daniel Latypov <dlatypov@google.com>
Link: https://lore.kernel.org/r/20210210013206.136227-1-dlatypov@google.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2021-02-11 23:16:30 -05:00
Geert Uytterhoeven 302fdadeaf ext: EXT4_KUNIT_TESTS should depend on EXT4_FS instead of selecting it
EXT4_KUNIT_TESTS selects EXT4_FS, thus enabling an optional feature the
user may not want to enable.  Fix this by making the test depend on
EXT4_FS instead.

Fixes: 1cbeab1b24 ("ext4: add kunit test for decoding extended timestamps")
Reviewed-by: Randy Dunlap <rdunlap@infradead.org>
Signed-off-by: Geert Uytterhoeven <geert@linux-m68k.org>
Link: https://lore.kernel.org/r/20210122110234.2825685-1-geert@linux-m68k.org
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2021-02-11 23:12:59 -05:00
Pavel Begunkov 2f8e45f16c io_uring: inline io_complete_rw_common()
__io_complete_rw() casts request to kiocb for it to be immediately
container_of()'ed by io_complete_rw_common(). And the last function's name
doesn't do a great job of illuminating its purposes, so just inline it in
its only user.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-02-11 11:42:19 -07:00
Pavel Begunkov 23faba36ce io_uring: move res check out of io_rw_reissue()
We pass return code into io_rw_reissue() only to be able to check if it's
-EAGAIN. That's not the cleanest approach and may prevent inlining of the
non-EAGAIN fast path, so do it at call sites.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-02-11 11:41:49 -07:00
Pavel Begunkov f161340d9e io_uring: simplify iopoll reissuing
Don't stash -EAGAIN'ed iopoll requests into a list to reissue it later,
do it eagerly. It removes overhead on keeping and checking that list,
and allows in case of failure for these requests to be completed through
normal iopoll completion path.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-02-11 11:40:42 -07:00
Pavel Begunkov 6e833d538b io_uring: clean up io_req_free_batch_finish()
io_req_free_batch_finish() is final and does not permit struct req_batch
to be reused without re-init. To be more consistent don't clear ->task
there.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-02-11 11:40:40 -07:00
Jens Axboe 3c1a2ead91 io_uring: move submit side state closer in the ring
We recently added the submit side req cache, but it was placed at the
end of the struct. Move it near the other submission state for better
memory placement, and reshuffle a few other members at the same time.

Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-02-11 10:48:03 -07:00
Colin Ian King 4208c398aa fs/jfs: fix potential integer overflow on shift of a int
The left shift of int 32 bit integer constant 1 is evaluated using 32 bit
arithmetic and then assigned to a signed 64 bit integer. In the case where
l2nb is 32 or more this can lead to an overflow.  Avoid this by shifting
the value 1LL instead.

Addresses-Coverity: ("Uninitentional integer overflow")
Fixes: b40c2e665c ("fs/jfs: TRIM support for JFS Filesystem")
Signed-off-by: Colin Ian King <colin.king@canonical.com>
Signed-off-by: Dave Kleikamp <dave.kleikamp@oracle.com>
2021-02-11 11:25:54 -06:00
Shyam Prasad N a738c93fb1 cifs: Set CIFS_MOUNT_USE_PREFIX_PATH flag on setting cifs_sb->prepath.
While debugging another issue today, Steve and I noticed that if a
subdir for a file share is already mounted on the client, any new
mount of any other subdir (or the file share root) of the same share
results in sharing the cifs superblock, which e.g. can result in
incorrect device name.

While setting prefix path for the root of a cifs_sb,
CIFS_MOUNT_USE_PREFIX_PATH flag should also be set.
Without it, prepath is not even considered in some places,
and output of "mount" and various /proc/<>/*mount* related
options can be missing part of the device name.

Signed-off-by: Shyam Prasad N <sprasad@microsoft.com>
Reviewed-by: Aurelien Aptel <aaptel@suse.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
2021-02-11 11:08:32 -06:00
Ronnie Sahlberg af1a3d2ba9 cifs: In the new mount api we get the full devname as source=
so we no longer need to handle or parse the UNC= and prefixpath=
options that mount.cifs are generating.

This also fixes a bug in the mount command option where the devname
would be truncated into just //server/share because we were looking
at the truncated UNC value and not the full path.

I.e.  in the mount command output the devive //server/share/path
would show up as just //server/share

Signed-off-by: Ronnie Sahlberg <lsahlber@redhat.com>
Reviewed-by: Shyam Prasad N <nspmangalore@gmail.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
2021-02-11 10:58:08 -06:00
Brian Foster 1cd738b13a xfs: consider shutdown in bmapbt cursor delete assert
The assert in xfs_btree_del_cursor() checks that the bmapbt block
allocation field has been handled correctly before the cursor is
freed. This field is used for accurate calculation of indirect block
reservation requirements (for delayed allocations), for example.
generic/019 reproduces a scenario where this assert fails because
the filesystem has shutdown while in the middle of a bmbt record
insertion. This occurs after a bmbt block has been allocated via the
cursor but before the higher level bmap function (i.e.
xfs_bmap_add_extent_hole_real()) completes and resets the field.

Update the assert to accommodate the transient state if the
filesystem has shutdown. While here, clean up the indentation and
comments in the function.

Signed-off-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
2021-02-11 08:46:38 -08:00
Jens Axboe e68a3ff8c3 io_uring: assign file_slot prior to calling io_sqe_file_register()
We use the assigned slot in io_sqe_file_register(), and a previous
patch moved the assignment to after we have called it. This isn't
super pretty, and will get cleaned up in the future. For now, fix
the regression by restoring the previous assignment/clear of the
file_slot.

Fixes: ea64ec02b3 ("io_uring: deduplicate file table slot calculation")
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-02-11 07:45:08 -07:00
Gao Xiang ce06312918 erofs: initialized fields can only be observed after bit is set
Currently, although set_bit() & test_bit() pairs are used as a fast-
path for initialized configurations. However, these atomic ops are
actually relaxed forms. Instead, load-acquire & store-release form is
needed to make sure uninitialized fields won't be observed in advance
here (yet no such corresponding bitops so use full barriers instead.)

Link: https://lore.kernel.org/r/20210209130618.15838-1-hsiangkao@aol.com
Fixes: 62dc45979f ("staging: erofs: fix race of initializing xattrs of a inode at the same time")
Fixes: 152a333a58 ("staging: erofs: add compacted compression indexes support")
Cc: <stable@vger.kernel.org> # 5.3+
Reported-by: Huang Jianan <huangjianan@oppo.com>
Reviewed-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Gao Xiang <hsiangkao@redhat.com>
2021-02-11 11:55:28 +08:00
Gao Xiang bde545295b erofs: fix shift-out-of-bounds of blkszbits
syzbot generated a crafted bitszbits which can be shifted
out-of-bounds[1]. So directly print unsupported blkszbits
instead of blksize.

[1] https://lore.kernel.org/r/000000000000c72ddd05b9444d2f@google.com

Link: https://lore.kernel.org/r/20210120013016.14071-1-hsiangkao@aol.com
Reported-by: syzbot+c68f467cd7c45860e8d4@syzkaller.appspotmail.com
Reviewed-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Gao Xiang <hsiangkao@redhat.com>
2021-02-11 11:54:57 +08:00
kernel test robot 8646b982ba xfs: fix boolreturn.cocci warnings
fs/xfs/xfs_log.c:1062:9-10: WARNING: return of 0/1 in function 'xfs_log_need_covered' with return type bool

 Return statements in functions returning bool should use
 true/false instead of 1/0.
Generated by: scripts/coccinelle/misc/boolreturn.cocci

Fixes: 37444fc4cc ("xfs: lift writable fs check up into log worker task")
CC: Brian Foster <bfoster@redhat.com>
Reported-by: kernel test robot <lkp@intel.com>
Signed-off-by: kernel test robot <lkp@intel.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
2021-02-10 17:28:13 -08:00
Brian Foster e4826691cc xfs: restore shutdown check in mapped write fault path
XFS triggers an iomap warning in the write fault path due to a
!PageUptodate() page if a write fault happens to occur on a page
that recently failed writeback. The iomap writeback error handling
code can clear the Uptodate flag if no portion of the page is
submitted for I/O. This is reproduced by fstest generic/019, which
combines various forms of I/O with simulated disk failures that
inevitably lead to filesystem shutdown (which then unconditionally
fails page writeback).

This is a regression introduced by commit f150b42343 ("xfs: split
the iomap ops for buffered vs direct writes") due to the removal of
a shutdown check and explicit error return in the ->iomap_begin()
path used by the write fault path. The explicit error return
historically translated to a SIGBUS, but now carries on with iomap
processing where it complains about the unexpected state. Restore
the shutdown check to xfs_buffered_write_iomap_begin() to restore
historical behavior.

Fixes: f150b42343 ("xfs: split the iomap ops for buffered vs direct writes")
Signed-off-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Eric Sandeen <sandeen@redhat.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
2021-02-10 17:27:20 -08:00
Colin Ian King 4a245479c2 io_uring: remove redundant initialization of variable ret
The variable ret is being initialized with a value that is never read
and it is being updated later with a new value.  The initialization is
redundant and can be removed.

Addresses-Coverity: ("Unused value")
Fixes: b63534c41e ("io_uring: re-issue block requests that failed because of resources")
Signed-off-by: Colin Ian King <colin.king@canonical.com>
Reviewed-by: Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-02-10 13:28:41 -07:00
Pavel Begunkov 34343786ec io_uring: unpark SQPOLL thread for cancelation
We park SQPOLL task before going into io_uring_cancel_files(), so the
task won't run task_works including those that might be important for
the cancellation passes. In this case it's io_poll_remove_one(), which
frees requests via io_put_req_deferred().

Unpark it for while waiting, it's ok as we disable submissions
beforehand, so no new requests will be generated.

INFO: task syz-executor893:8493 blocked for more than 143 seconds.
Call Trace:
 context_switch kernel/sched/core.c:4327 [inline]
 __schedule+0x90c/0x21a0 kernel/sched/core.c:5078
 schedule+0xcf/0x270 kernel/sched/core.c:5157
 io_uring_cancel_files fs/io_uring.c:8912 [inline]
 io_uring_cancel_task_requests+0xe70/0x11a0 fs/io_uring.c:8979
 __io_uring_files_cancel+0x110/0x1b0 fs/io_uring.c:9067
 io_uring_files_cancel include/linux/io_uring.h:51 [inline]
 do_exit+0x2fe/0x2ae0 kernel/exit.c:780
 do_group_exit+0x125/0x310 kernel/exit.c:922
 __do_sys_exit_group kernel/exit.c:933 [inline]
 __se_sys_exit_group kernel/exit.c:931 [inline]
 __x64_sys_exit_group+0x3a/0x50 kernel/exit.c:931
 do_syscall_64+0x2d/0x70 arch/x86/entry/common.c:46
 entry_SYSCALL_64_after_hwframe+0x44/0xa9

Cc: stable@vger.kernel.org # 5.5+
Reported-by: syzbot+695b03d82fa8e4901b06@syzkaller.appspotmail.com
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-02-10 13:28:41 -07:00
Jens Axboe 92c75f7594 Revert "io_uring: don't take fs for recvmsg/sendmsg"
This reverts commit 10cad2c40d.

Petr reports that with this commit in place, io_uring fails the chroot
test (CVE-202-29373). We do need to retain ->fs for send/recvmsg, so
revert this commit.

Reported-by: Petr Vorel <pvorel@suse.cz>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-02-10 12:37:58 -07:00
Joachim Henke a35d8f016e nilfs2: make splice write available again
Since 5.10, splice() or sendfile() to NILFS2 return EINVAL.  This was
caused by commit 36e2c7421f ("fs: don't allow splice read/write
without explicit ops").

This patch initializes the splice_write field in file_operations, like
most file systems do, to restore the functionality.

Link: https://lkml.kernel.org/r/1612784101-14353-1-git-send-email-konishi.ryusuke@gmail.com
Signed-off-by: Joachim Henke <joachim.henke@t-systems.com>
Signed-off-by: Ryusuke Konishi <konishi.ryusuke@gmail.com>
Tested-by: Ryusuke Konishi <konishi.ryusuke@gmail.com>
Cc: <stable@vger.kernel.org>	[5.10+]
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-02-10 11:19:58 -08:00
Damien Le Moal 0f1ba5f5d8 zonefs: use zone write granularity as block size
Zoned block devices have different granularity constraints for write
operations into sequential zones. E.g. ZBC and ZAC devices require that
writes be aligned to the device physical block size while NVMe ZNS
devices allow logical block size aligned write operations. To correctly
handle such difference, use the device zone write granularity limit to
set the block size of a zonefs volume, thus allowing the smallest
possible write unit for all zoned device types.

Signed-off-by: Damien Le Moal <damien.lemoal@wdc.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com>
Reviewed-by: Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@edc.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-02-10 07:44:41 -07:00
Jens Axboe 26bfa89e25 io_uring: place ring SQ/CQ arrays under memcg memory limits
Instead of imposing rlimit memlock limits for the rings themselves,
ensure that we account them properly under memcg with __GFP_ACCOUNT.
We retain rlimit memlock for registered buffers, this is just for the
ring arrays themselves.

Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-02-10 07:33:15 -07:00
Jens Axboe 91f245d5d5 io_uring: enable kmemcg account for io_uring requests
This puts io_uring under the memory cgroups accounting and limits for
requests.

Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-02-10 07:33:15 -07:00
Jens Axboe c7dae4ba46 io_uring: enable req cache for IRQ driven IO
This is the last class of requests that cannot utilize the req alloc
cache. Add a per-ctx req cache that is protected by the completion_lock,
and refill our submit side cache when it gets over our batch count.

Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-02-10 07:33:12 -07:00
Hao Xu ed670c3f90 io_uring: fix possible deadlock in io_uring_poll
Abaci reported follow issue:

[   30.615891] ======================================================
[   30.616648] WARNING: possible circular locking dependency detected
[   30.617423] 5.11.0-rc3-next-20210115 #1 Not tainted
[   30.618035] ------------------------------------------------------
[   30.618914] a.out/1128 is trying to acquire lock:
[   30.619520] ffff88810b063868 (&ep->mtx){+.+.}-{3:3}, at: __ep_eventpoll_poll+0x9f/0x220
[   30.620505]
[   30.620505] but task is already holding lock:
[   30.621218] ffff88810e952be8 (&ctx->uring_lock){+.+.}-{3:3}, at: __x64_sys_io_uring_enter+0x3f0/0x5b0
[   30.622349]
[   30.622349] which lock already depends on the new lock.
[   30.622349]
[   30.623289]
[   30.623289] the existing dependency chain (in reverse order) is:
[   30.624243]
[   30.624243] -> #1 (&ctx->uring_lock){+.+.}-{3:3}:
[   30.625263]        lock_acquire+0x2c7/0x390
[   30.625868]        __mutex_lock+0xae/0x9f0
[   30.626451]        io_cqring_overflow_flush.part.95+0x6d/0x70
[   30.627278]        io_uring_poll+0xcb/0xd0
[   30.627890]        ep_item_poll.isra.14+0x4e/0x90
[   30.628531]        do_epoll_ctl+0xb7e/0x1120
[   30.629122]        __x64_sys_epoll_ctl+0x70/0xb0
[   30.629770]        do_syscall_64+0x2d/0x40
[   30.630332]        entry_SYSCALL_64_after_hwframe+0x44/0xa9
[   30.631187]
[   30.631187] -> #0 (&ep->mtx){+.+.}-{3:3}:
[   30.631985]        check_prevs_add+0x226/0xb00
[   30.632584]        __lock_acquire+0x1237/0x13a0
[   30.633207]        lock_acquire+0x2c7/0x390
[   30.633740]        __mutex_lock+0xae/0x9f0
[   30.634258]        __ep_eventpoll_poll+0x9f/0x220
[   30.634879]        __io_arm_poll_handler+0xbf/0x220
[   30.635462]        io_issue_sqe+0xa6b/0x13e0
[   30.635982]        __io_queue_sqe+0x10b/0x550
[   30.636648]        io_queue_sqe+0x235/0x470
[   30.637281]        io_submit_sqes+0xcce/0xf10
[   30.637839]        __x64_sys_io_uring_enter+0x3fb/0x5b0
[   30.638465]        do_syscall_64+0x2d/0x40
[   30.638999]        entry_SYSCALL_64_after_hwframe+0x44/0xa9
[   30.639643]
[   30.639643] other info that might help us debug this:
[   30.639643]
[   30.640618]  Possible unsafe locking scenario:
[   30.640618]
[   30.641402]        CPU0                    CPU1
[   30.641938]        ----                    ----
[   30.642664]   lock(&ctx->uring_lock);
[   30.643425]                                lock(&ep->mtx);
[   30.644498]                                lock(&ctx->uring_lock);
[   30.645668]   lock(&ep->mtx);
[   30.646321]
[   30.646321]  *** DEADLOCK ***
[   30.646321]
[   30.647642] 1 lock held by a.out/1128:
[   30.648424]  #0: ffff88810e952be8 (&ctx->uring_lock){+.+.}-{3:3}, at: __x64_sys_io_uring_enter+0x3f0/0x5b0
[   30.649954]
[   30.649954] stack backtrace:
[   30.650592] CPU: 1 PID: 1128 Comm: a.out Not tainted 5.11.0-rc3-next-20210115 #1
[   30.651554] Hardware name: Red Hat KVM, BIOS 0.5.1 01/01/2011
[   30.652290] Call Trace:
[   30.652688]  dump_stack+0xac/0xe3
[   30.653164]  check_noncircular+0x11e/0x130
[   30.653747]  ? check_prevs_add+0x226/0xb00
[   30.654303]  check_prevs_add+0x226/0xb00
[   30.654845]  ? add_lock_to_list.constprop.49+0xac/0x1d0
[   30.655564]  __lock_acquire+0x1237/0x13a0
[   30.656262]  lock_acquire+0x2c7/0x390
[   30.656788]  ? __ep_eventpoll_poll+0x9f/0x220
[   30.657379]  ? __io_queue_proc.isra.88+0x180/0x180
[   30.658014]  __mutex_lock+0xae/0x9f0
[   30.658524]  ? __ep_eventpoll_poll+0x9f/0x220
[   30.659112]  ? mark_held_locks+0x5a/0x80
[   30.659648]  ? __ep_eventpoll_poll+0x9f/0x220
[   30.660229]  ? _raw_spin_unlock_irqrestore+0x2d/0x40
[   30.660885]  ? trace_hardirqs_on+0x46/0x110
[   30.661471]  ? __io_queue_proc.isra.88+0x180/0x180
[   30.662102]  ? __ep_eventpoll_poll+0x9f/0x220
[   30.662696]  __ep_eventpoll_poll+0x9f/0x220
[   30.663273]  ? __ep_eventpoll_poll+0x220/0x220
[   30.663875]  __io_arm_poll_handler+0xbf/0x220
[   30.664463]  io_issue_sqe+0xa6b/0x13e0
[   30.664984]  ? __lock_acquire+0x782/0x13a0
[   30.665544]  ? __io_queue_proc.isra.88+0x180/0x180
[   30.666170]  ? __io_queue_sqe+0x10b/0x550
[   30.666725]  __io_queue_sqe+0x10b/0x550
[   30.667252]  ? __fget_files+0x131/0x260
[   30.667791]  ? io_req_prep+0xd8/0x1090
[   30.668316]  ? io_queue_sqe+0x235/0x470
[   30.668868]  io_queue_sqe+0x235/0x470
[   30.669398]  io_submit_sqes+0xcce/0xf10
[   30.669931]  ? xa_load+0xe4/0x1c0
[   30.670425]  __x64_sys_io_uring_enter+0x3fb/0x5b0
[   30.671051]  ? lockdep_hardirqs_on_prepare+0xde/0x180
[   30.671719]  ? syscall_enter_from_user_mode+0x2b/0x80
[   30.672380]  do_syscall_64+0x2d/0x40
[   30.672901]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
[   30.673503] RIP: 0033:0x7fd89c813239
[   30.673962] Code: 01 00 48 81 c4 80 00 00 00 e9 f1 fe ff ff 0f 1f 00 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05  3d 01 f0 ff ff 73 01 c3 48 8b 0d 27 ec 2c 00 f7 d8 64 89 01 48
[   30.675920] RSP: 002b:00007ffc65a7c628 EFLAGS: 00000217 ORIG_RAX: 00000000000001aa
[   30.676791] RAX: ffffffffffffffda RBX: 0000000000000000 RCX: 00007fd89c813239
[   30.677594] RDX: 0000000000000000 RSI: 0000000000000014 RDI: 0000000000000003
[   30.678678] RBP: 00007ffc65a7c720 R08: 0000000000000000 R09: 0000000003000000
[   30.679492] R10: 0000000000000000 R11: 0000000000000217 R12: 0000000000400ff0
[   30.680282] R13: 00007ffc65a7c840 R14: 0000000000000000 R15: 0000000000000000

This might happen if we do epoll_wait on a uring fd while reading/writing
the former epoll fd in a sqe in the former uring instance.
So let's don't flush cqring overflow list, just do a simple check.

Reported-by: Abaci <abaci@linux.alibaba.com>
Fixes: 6c503150ae ("io_uring: patch up IOPOLL overflow_flush sync")
Signed-off-by: Hao Xu <haoxu@linux.alibaba.com>
Reviewed-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-02-10 07:28:44 -07:00
Pavel Begunkov e5d1bc0a91 io_uring: defer flushing cached reqs
Awhile there are requests in the allocation cache -- use them, only if
those ended go for the stashed memory in comp.free_list. As list
manipulation are generally heavy and are not good for caches, flush them
all or as much as can in one go.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
[axboe: return success/failure from io_flush_cached_reqs()]
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-02-10 07:28:43 -07:00
Pavel Begunkov c5eef2b944 io_uring: take comp_state from ctx
__io_queue_sqe() is always called with a non-NULL comp_state, which is
taken directly from context. Don't pass it around but infer from ctx.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-02-10 07:28:43 -07:00
Jens Axboe 65453d1efb io_uring: enable req cache for task_work items
task_work is run without utilizing the req alloc cache, so any deferred
items don't get to take advantage of either the alloc or free side of it.
With task_work now being wrapped by io_uring, we can use the ctx
completion state to both use the req cache and the completion flush
batching.

With this, the only request type that cannot take advantage of the req
cache is IRQ driven IO for regular files / block devices. Anything else,
including IOPOLL polled IO to those same tyes, will take advantage of it.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-02-10 07:28:43 -07:00
Jens Axboe 7cbf1722d5 io_uring: provide FIFO ordering for task_work
task_work is a LIFO list, due to how it's implemented as a lockless
list. For long chains of task_work, this can be problematic as the
first entry added is the last one processed. Similarly, we'd waste
a lot of CPU cycles reversing this list.

Wrap the task_work so we have a single task_work entry per task per
ctx, and use that to run it in the right order.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-02-10 07:28:43 -07:00
Jens Axboe 1b4c351f6e io_uring: use persistent request cache
Now that we have the submit_state in the ring itself, we can have io_kiocb
allocations that are persistent across invocations. This reduces the time
spent doing slab allocations and frees.

[sil: rebased]
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-02-10 07:28:43 -07:00
Pavel Begunkov 6ff119a6e4 io_uring: feed reqs back into alloc cache
Make io_req_free_batch(), which is used for inline executed requests and
IOPOLL, to return requests back into the allocation cache, so avoid
most of kmalloc()/kfree() for those cases.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-02-10 07:28:43 -07:00
Pavel Begunkov bf019da7fc io_uring: persistent req cache
Don't free batch-allocated requests across syscalls.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-02-10 07:28:43 -07:00
Pavel Begunkov 9ae7246321 io_uring: count ctx refs separately from reqs
Currently batch free handles request memory freeing and ctx ref putting
together. Separate them and use different counters, that will be needed
for reusing reqs memory.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-02-10 07:28:43 -07:00
Pavel Begunkov 3893f39f22 io_uring: remove fallback_req
Remove fallback_req for now, it gets in the way of other changes.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-02-10 07:28:43 -07:00
Pavel Begunkov 905c172f32 io_uring: submit-completion free batching
io_submit_flush_completions() does completion batching, but may also use
free batching as iopoll does. The main beneficiaries should be buffered
reads/writes and send/recv.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-02-10 07:28:43 -07:00
Pavel Begunkov 6dd0be1e24 io_uring: replace list with array for compl batch
Reincarnation of an old patch that replaces a list in struct
io_compl_batch with an array. It's needed to avoid hooking requests via
their compl.list, because it won't be always available in the future.

It's also nice to split io_submit_flush_completions() to avoid free
under locks and remove unlock/lock with a long comment describing when
it can be done.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-02-10 07:28:43 -07:00
Pavel Begunkov 5087275dba io_uring: don't reinit submit state every time
As now submit_state is retained across syscalls, we can save ourself
from initialising it from ground up for each io_submit_sqes(). Set some
fields during ctx allocation, and just keep them always consistent.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
[axboe: remove unnecessary zeroing of ctx members]
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-02-10 07:28:43 -07:00
Pavel Begunkov ba88ff112b io_uring: remove ctx from comp_state
completion state is closely bound to ctx, we don't need to store ctx
inside as we always have it around to pass to flush.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-02-10 07:28:43 -07:00
Pavel Begunkov 258b29a93b io_uring: don't keep submit_state on stack
struct io_submit_state is quite big (168 bytes) and going to grow. It's
better to not keep it on stack as it is now. Move it to context, it's
always protected by uring_lock, so it's fine to have only one instance
of it.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-02-10 07:28:42 -07:00
Pavel Begunkov 889fca7328 io_uring: don't propagate io_comp_state
There is no reason to drag io_comp_state into opcode handlers, we just
need a flag and the actual work will be done in __io_queue_sqe().

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-02-10 07:28:38 -07:00
Andreas Gruenbacher 7009fa9cd9 gfs2: Recursive gfs2_quota_hold in gfs2_iomap_end
When starting an iomap write, gfs2_quota_lock_check -> gfs2_quota_lock
-> gfs2_quota_hold is called from gfs2_iomap_begin.  At the end of the
write, before unlocking the quotas, punch_hole -> gfs2_quota_hold can be
called again in gfs2_iomap_end, which is incorrect and leads to a failed
assertion.  Instead, move the call to gfs2_quota_unlock before the call
to punch_hole to fix that.

Fixes: 64bc06bb32 ("gfs2: iomap buffered write support")
Cc: stable@vger.kernel.org # v4.19+
Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
2021-02-10 09:51:06 +01:00
Ronnie Sahlberg a0f85e38a3 cifs: do not disable noperm if multiuser mount option is not provided
Fixes small regression in implementation of new mount API.

Signed-off-by: Ronnie Sahlberg <lsahlber@redhat.com>
Reported-by: Hyunchul Lee <hyc.lee@gmail.com>
Tested-by: Hyunchul Lee <hyc.lee@gmail.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
2021-02-09 20:47:05 -06:00
Pavel Begunkov 61e9820304 io_uring: make op handlers always take issue flags
Make opcode handler interfaces a bit more consistent by always passing
in issue flags. Bulky but pretty easy and mechanical change.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-02-09 19:15:14 -07:00
Pavel Begunkov 45d189c606 io_uring: replace force_nonblock with flags
Replace bool force_nonblock with flags. It has a long standing goal of
differentiating context from which we execute. Currently we have some
subtle places where some invariants, like holding of uring_lock, are
subtly inferred.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-02-09 19:15:13 -07:00
Seth Forshee ad69c389ec tmpfs: disallow CONFIG_TMPFS_INODE64 on alpha
As with s390, alpha is a 64-bit architecture with a 32-bit ino_t.  With
CONFIG_TMPFS_INODE64=y tmpfs mounts will get 64-bit inode numbers and
display "inode64" in the mount options, whereas passing "inode64" in the
mount options will fail.  This leads to erroneous behaviours such as
this:

  # mkdir mnt
  # mount -t tmpfs nodev mnt
  # mount -o remount,rw mnt
  mount: /home/ubuntu/mnt: mount point not mounted or bad option.

Prevent CONFIG_TMPFS_INODE64 from being selected on alpha.

Link: https://lkml.kernel.org/r/20210208215726.608197-1-seth.forshee@canonical.com
Fixes: ea3271f719 ("tmpfs: support 64-bit inums per-sb")
Signed-off-by: Seth Forshee <seth.forshee@canonical.com>
Acked-by: Hugh Dickins <hughd@google.com>
Cc: Chris Down <chris@chrisdown.name>
Cc: Amir Goldstein <amir73il@gmail.com>
Cc: Richard Henderson <rth@twiddle.net>
Cc: Ivan Kokshaysky <ink@jurassic.park.msu.ru>
Cc: Matt Turner <mattst88@gmail.com>
Cc: <stable@vger.kernel.org>	[5.9+]
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-02-09 17:26:44 -08:00
Seth Forshee b85a7a8bb5 tmpfs: disallow CONFIG_TMPFS_INODE64 on s390
Currently there is an assumption in tmpfs that 64-bit architectures also
have a 64-bit ino_t.  This is not true on s390 which has a 32-bit ino_t.
With CONFIG_TMPFS_INODE64=y tmpfs mounts will get 64-bit inode numbers
and display "inode64" in the mount options, but passing the "inode64"
mount option will fail.  This leads to the following behavior:

  # mkdir mnt
  # mount -t tmpfs nodev mnt
  # mount -o remount,rw mnt
  mount: /home/ubuntu/mnt: mount point not mounted or bad option.

As mount sees "inode64" in the mount options and thus passes it in the
options for the remount.

So prevent CONFIG_TMPFS_INODE64 from being selected on s390.

Link: https://lkml.kernel.org/r/20210205230620.518245-1-seth.forshee@canonical.com
Fixes: ea3271f719 ("tmpfs: support 64-bit inums per-sb")
Signed-off-by: Seth Forshee <seth.forshee@canonical.com>
Acked-by: Hugh Dickins <hughd@google.com>
Cc: Chris Down <chris@chrisdown.name>
Cc: Hugh Dickins <hughd@google.com>
Cc: Amir Goldstein <amir73il@gmail.com>
Cc: Heiko Carstens <hca@linux.ibm.com>
Cc: Vasily Gorbik <gor@linux.ibm.com>
Cc: Christian Borntraeger <borntraeger@de.ibm.com>
Cc: <stable@vger.kernel.org>	[5.9+]
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-02-09 17:26:44 -08:00
Phillip Lougher 506220d2ba squashfs: add more sanity checks in xattr id lookup
Sysbot has reported a warning where a kmalloc() attempt exceeds the
maximum limit.  This has been identified as corruption of the xattr_ids
count when reading the xattr id lookup table.

This patch adds a number of additional sanity checks to detect this
corruption and others.

1. It checks for a corrupted xattr index read from the inode.  This could
   be because the metadata block is uncompressed, or because the
   "compression" bit has been corrupted (turning a compressed block
   into an uncompressed block).  This would cause an out of bounds read.

2. It checks against corruption of the xattr_ids count.  This can either
   lead to the above kmalloc failure, or a smaller than expected
   table to be read.

3. It checks the contents of the index table for corruption.

[phillip@squashfs.org.uk: fix checkpatch issue]
  Link: https://lkml.kernel.org/r/270245655.754655.1612770082682@webmail.123-reg.co.uk

Link: https://lkml.kernel.org/r/20210204130249.4495-5-phillip@squashfs.org.uk
Signed-off-by: Phillip Lougher <phillip@squashfs.org.uk>
Reported-by: syzbot+2ccea6339d368360800d@syzkaller.appspotmail.com
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-02-09 17:26:44 -08:00
Phillip Lougher eabac19e40 squashfs: add more sanity checks in inode lookup
Sysbot has reported an "slab-out-of-bounds read" error which has been
identified as being caused by a corrupted "ino_num" value read from the
inode.  This could be because the metadata block is uncompressed, or
because the "compression" bit has been corrupted (turning a compressed
block into an uncompressed block).

This patch adds additional sanity checks to detect this, and the
following corruption.

1. It checks against corruption of the inodes count.  This can either
   lead to a larger table to be read, or a smaller than expected
   table to be read.

   In the case of a too large inodes count, this would often have been
   trapped by the existing sanity checks, but this patch introduces
   a more exact check, which can identify too small values.

2. It checks the contents of the index table for corruption.

[phillip@squashfs.org.uk: fix checkpatch issue]
  Link: https://lkml.kernel.org/r/527909353.754618.1612769948607@webmail.123-reg.co.uk

Link: https://lkml.kernel.org/r/20210204130249.4495-4-phillip@squashfs.org.uk
Signed-off-by: Phillip Lougher <phillip@squashfs.org.uk>
Reported-by: syzbot+04419e3ff19d2970ea28@syzkaller.appspotmail.com
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-02-09 17:26:44 -08:00
Phillip Lougher f37aa4c736 squashfs: add more sanity checks in id lookup
Sysbot has reported a number of "slab-out-of-bounds reads" and
"use-after-free read" errors which has been identified as being caused
by a corrupted index value read from the inode.  This could be because
the metadata block is uncompressed, or because the "compression" bit has
been corrupted (turning a compressed block into an uncompressed block).

This patch adds additional sanity checks to detect this, and the
following corruption.

1. It checks against corruption of the ids count.  This can either
   lead to a larger table to be read, or a smaller than expected
   table to be read.

   In the case of a too large ids count, this would often have been
   trapped by the existing sanity checks, but this patch introduces
   a more exact check, which can identify too small values.

2. It checks the contents of the index table for corruption.

Link: https://lkml.kernel.org/r/20210204130249.4495-3-phillip@squashfs.org.uk
Signed-off-by: Phillip Lougher <phillip@squashfs.org.uk>
Reported-by: syzbot+b06d57ba83f604522af2@syzkaller.appspotmail.com
Reported-by: syzbot+c021ba012da41ee9807c@syzkaller.appspotmail.com
Reported-by: syzbot+5024636e8b5fd19f0f19@syzkaller.appspotmail.com
Reported-by: syzbot+bcbc661df46657d0fa4f@syzkaller.appspotmail.com
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-02-09 17:26:44 -08:00
Phillip Lougher e812cbbbbb squashfs: avoid out of bounds writes in decompressors
Patch series "Squashfs: fix BIO migration regression and add sanity checks".

Patch [1/4] fixes a regression introduced by the "migrate from
ll_rw_block usage to BIO" patch, which has produced a number of
Sysbot/Syzkaller reports.

Patches [2/4], [3/4], and [4/4] fix a number of filesystem corruption
issues which have produced Sysbot reports in the id, inode and xattr
lookup code.

Each patch has been tested against the Sysbot reproducers using the
given kernel configuration.  They have the appropriate "Reported-by:"
lines added.

Additionally, all of the reproducer filesystems are indirectly fixed by
patch [4/4] due to the fact they all have xattr corruption which is now
detected there.

Additional testing with other configurations and architectures (32bit,
big endian), and normal filesystems has also been done to trap any
inadvertent regressions caused by the additional sanity checks.

This patch (of 4):

This is a regression introduced by the patch "migrate from ll_rw_block
usage to BIO".

Sysbot/Syskaller has reported a number of "out of bounds writes" and
"unable to handle kernel paging request in squashfs_decompress" errors
which have been identified as a regression introduced by the above
patch.

Specifically, the patch removed the following sanity check

        if (length < 0 || length > output->length ||
		(index + length) > msblk->bytes_used)

This check did two things:

1. It ensured any reads were not beyond the end of the filesystem

2. It ensured that the "length" field read from the filesystem
   was within the expected maximum length.  Without this any
   corrupted values can over-run allocated buffers.

Link: https://lkml.kernel.org/r/20210204130249.4495-1-phillip@squashfs.org.uk
Link: https://lkml.kernel.org/r/20210204130249.4495-2-phillip@squashfs.org.uk
Fixes: 93e72b3c61 ("squashfs: migrate from ll_rw_block usage to BIO")
Reported-by: syzbot+6fba78f99b9afd4b5634@syzkaller.appspotmail.com
Signed-off-by: Phillip Lougher <phillip@squashfs.org.uk>
Cc: Philippe Liard <pliard@google.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-02-09 17:26:44 -08:00
Ronnie Sahlberg abd4af47d3 cifs: fix dfs-links
This fixes a regression following dfs links that was introduced in the
patch series for the new mount api.

Signed-off-by: Ronnie Sahlberg <lsahlber@redhat.com>
Reviewed-by: Paulo Alcantara (SUSE) <pc@cjr.nz>
Signed-off-by: Steve French <stfrench@microsoft.com>
2021-02-09 10:59:52 -06:00
Pan Bian 70779b8973 fs/affs: release old buffer head on error path
The reference count of the old buffer head should be decremented on path
that fails to get the new buffer head.

Fixes: 6b4657667b ("fs/affs: add rename exchange")
CC: stable@vger.kernel.org # 4.14+
Signed-off-by: Pan Bian <bianpan2016@163.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2021-02-09 17:11:03 +01:00
Trond Myklebust 848fdd6239 NFS: Don't set NFS_INO_INVALID_XATTR if there is no xattr cache
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2021-02-09 09:08:22 -05:00
Paolo Bonzini 9fd6dad126 mm: provide a saner PTE walking API for modules
Currently, the follow_pfn function is exported for modules but
follow_pte is not.  However, follow_pfn is very easy to misuse,
because it does not provide protections (so most of its callers
assume the page is writable!) and because it returns after having
already unlocked the page table lock.

Provide instead a simplified version of follow_pte that does
not have the pmdpp and range arguments.  The older version
survives as follow_invalidate_pte() for use by fs/dax.c.

Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-02-09 07:05:44 -05:00
Naohiro Aota 9d294a685f btrfs: zoned: enable to mount ZONED incompat flag
This final patch adds the ZONED incompat flag to the supported flags
and enables to mount ZONED flagged file system.

Reviewed-by: Anand Jain <anand.jain@oracle.com>
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2021-02-09 02:52:24 +01:00
Naohiro Aota b528f46713 btrfs: zoned: deal with holes writing out tree-log pages
Since the zoned filesystem requires sequential write out of metadata, we
cannot proceed with a hole in tree-log pages. When such a hole exists,
btree_write_cache_pages() will return -EAGAIN. This happens when someone,
e.g., a concurrent transaction commit, writes a dirty extent in this
tree-log commit.

If we are not going to wait for the extents, we can hope the concurrent
writing fills the hole for us. So, we can ignore the error in this case and
hope the next write will succeed.

If we want to wait for them and got the error, we cannot wait for them
because it will cause a deadlock. So, let's bail out to a full commit in
this case.

Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2021-02-09 02:52:24 +01:00
Naohiro Aota 3ddebf27fc btrfs: zoned: reorder log node allocation on zoned filesystem
This is the 3/3 patch to enable tree-log on zoned filesystems.

The allocation order of nodes of "fs_info->log_root_tree" and nodes of
"root->log_root" is not the same as the writing order of them. So, the
writing causes unaligned write errors.

Reorder the allocation of them by delaying allocation of the root node of
"fs_info->log_root_tree," so that the node buffers can go out sequentially
to devices.

Cc: Filipe Manana <fdmanana@gmail.com>
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2021-02-09 02:48:41 +01:00
Naohiro Aota fa1a0f42a0 btrfs: zoned: serialize log transaction on zoned filesystems
This is the 2/3 patch to enable tree-log on zoned filesystems.

Since we can start more than one log transactions per subvolume
simultaneously, nodes from multiple transactions can be allocated
interleaved. Such mixed allocation results in non-sequential writes at
the time of a log transaction commit. The nodes of the global log root
tree (fs_info->log_root_tree), also have the same problem with mixed
allocation.

Serializes log transactions by waiting for a committing transaction when
someone tries to start a new transaction, to avoid the mixed allocation
problem. We must also wait for running log transactions from another
subvolume, but there is no easy way to detect which subvolume root is
running a log transaction. So, this patch forbids starting a new log
transaction when other subvolumes already allocated the global log root
tree.

Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2021-02-09 02:48:37 +01:00
Naohiro Aota 40ab3be102 btrfs: zoned: extend zoned allocator to use dedicated tree-log block group
This is the 1/3 patch to enable tree log on zoned filesystems.

The tree-log feature does not work on a zoned filesystem as is. Blocks for
a tree-log tree are allocated mixed with other metadata blocks and btrfs
writes and syncs the tree-log blocks to devices at the time of fsync(),
which has a different timing than a global transaction commit. As a
result, both writing tree-log blocks and writing other metadata blocks
become non-sequential writes that zoned filesystems must avoid.

Introduce a dedicated block group for tree-log blocks, so that tree-log
blocks and other metadata blocks can be separate write streams.  As a
result, each write stream can now be written to devices separately.
"fs_info->treelog_bg" tracks the dedicated block group and assigns
"treelog_bg" on-demand on tree-log block allocation time.

This commit extends the zoned block allocator to use the block group.

Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2021-02-09 02:46:08 +01:00
Naohiro Aota 6ab6ebb760 btrfs: split alloc_log_tree()
This is a preparation patch for the next patch. Split alloc_log_tree()
into two parts. The first one allocating the tree structure, remains in
alloc_log_tree() and the second part allocating the tree node, which is
moved into btrfs_alloc_log_tree_node().

Also export the latter part is to be used in the next patch.

Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2021-02-09 02:46:07 +01:00
Naohiro Aota f7ef5287a6 btrfs: zoned: relocate block group to repair IO failure in zoned filesystems
When a bad checksum is found and if the filesystem has a mirror of the
damaged data, we read the correct data from the mirror and writes it to
damaged blocks. This however, violates the sequential write constraints
of a zoned block device.

We can consider three methods to repair an IO failure in zoned filesystems:

(1) Reset and rewrite the damaged zone
(2) Allocate new device extent and replace the damaged device extent to
    the new extent
(3) Relocate the corresponding block group

Method (1) is most similar to a behavior done with regular devices.
However, it also wipes non-damaged data in the same device extent, and
so it unnecessary degrades non-damaged data.

Method (2) is much like device replacing but done in the same device. It
is safe because it keeps the device extent until the replacing finish.
However, extending device replacing is non-trivial. It assumes
"src_dev->physical == dst_dev->physical". Also, the extent mapping
replacing function should be extended to support replacing device extent
position in one device.

Method (3) invokes relocation of the damaged block group and is
straightforward to implement. It relocates all the mirrored device
extents, so it potentially is a more costly operation than method (1) or
(2). But it relocates only used extents which reduce the total IO size.

Let's apply method (3) for now. In the future, we can extend device-replace
and apply method (2).

For protecting a block group gets relocated multiple time with multiple
IO errors, this commit introduces "relocating_repair" bit to show it's
now relocating to repair IO failures. Also it uses a new kthread
"btrfs-relocating-repair", not to block IO path with relocating process.

This commit also supports repairing in the scrub process.

Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2021-02-09 02:46:07 +01:00
Naohiro Aota 32430c6148 btrfs: zoned: enable relocation on a zoned filesystem
Currently fallocate() is disabled on a zoned filesystem. Since current
relocation process relies on preallocation to move file data extents, it
must be handled differently.

On a zoned filesystem, we just truncate the inode to the size that we
wanted to pre-allocate. Then, we flush dirty pages on the file before
finishing the relocation process. run_delalloc_zoned() will handle all
the allocations and submit IOs to the underlying layers.

Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2021-02-09 02:46:07 +01:00
Naohiro Aota 7db1c5d14d btrfs: zoned: support dev-replace in zoned filesystems
This is 4/4 patch to implement device-replace on zoned filesystems.

Even after the copying is done, the write pointers of the source device
and the destination device may not be synchronized. For example, when
the last allocated extent is freed before device-replace process, the
extent is not copied, leaving a hole there.

Synchronize the write pointers by writing zeroes to the destination
device.

Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2021-02-09 02:46:07 +01:00
Naohiro Aota de17addce7 btrfs: zoned: implement copying for zoned device-replace
This is 3/4 patch to implement device-replace on zoned filesystems.

This commit implements copying. To do this, it tracks the write pointer
during the device replace process. As device-replace's copy process is
smart enough to only copy used extents on the source device, we have to
fill the gap to honor the sequential write requirement in the target
device.

The device-replace process on zoned filesystems must copy or clone all
the extents in the source device exactly once. So, we need to ensure
allocations started just before the dev-replace process to have their
corresponding extent information in the B-trees.
finish_extent_writes_for_zoned() implements that functionality, which
basically is the removed code in the commit 042528f8d8 ("Btrfs: fix
block group remaining RO forever after error during device replace").

Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2021-02-09 02:46:07 +01:00
Naohiro Aota 6143c23ccc btrfs: zoned: implement cloning for zoned device-replace
This is 2/4 patch to implement device replace for zoned filesystems.

In zoned mode, a block group must be either copied (from the source
device to the target device) or cloned (to both devices).

Implement the cloning part. If a block group targeted by an IO is marked
to copy, we should not clone the IO to the destination device, because
the block group is eventually copied by the replace process.

This commit also handles cloning of device reset.

Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2021-02-09 02:46:07 +01:00
Naohiro Aota 78ce9fc269 btrfs: zoned: mark block groups to copy for device-replace
This is the 1/4 patch to support device-replace on zoned filesystems.

We have two types of IOs during the device replace process. One is an IO
to "copy" (by the scrub functions) all the device extents from the source
device to the destination device. The other one is an IO to "clone" (by
handle_ops_on_dev_replace()) new incoming write IOs from users to the
source device into the target device.

Cloning incoming IOs can break the sequential write rule in on target
device. When a write is mapped in the middle of a block group, the IO is
directed to the middle of a target device zone, which breaks the
sequential write requirement.

However, the cloning function cannot be disabled since incoming IOs
targeting already copied device extents must be cloned so that the IO is
executed on the target device.

We cannot use dev_replace->cursor_{left,right} to determine whether a bio
is going to a not yet copied region. Since we have a time gap between
finishing btrfs_scrub_dev() and rewriting the mapping tree in
btrfs_dev_replace_finishing(), we can have a newly allocated device extent
which is never cloned nor copied.

So the point is to copy only already existing device extents. This patch
introduces mark_block_group_to_copy() to mark existing block groups as a
target of copying. Then, handle_ops_on_dev_replace() and dev-replace can
check the flag to do their job.

Also, btrfs_finish_block_group_to_copy() will check if the copied stripe
is the last stripe in the block group. With the last stripe copied,
the to_copy flag is finally disabled. Afterwards we can safely clone
incoming IOs on this block group.

Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2021-02-09 02:46:07 +01:00
Naohiro Aota 4eef29ef63 btrfs: zoned: do not use async metadata checksum on zoned filesystems
On zoned filesystems, btrfs uses per-fs zoned_meta_io_lock to serialize
the metadata write IOs.

Even with this serialization, write bios sent from btree_write_cache_pages
can be reordered by async checksum workers as these workers are per CPU
and not per zone.

To preserve write bio ordering, we disable async metadata checksum on a
zoned filesystem. This does not result in lower performance with HDDs as
a single CPU core is fast enough to do checksum for a single zone write
stream with the maximum possible bandwidth of the device. If multiple
zones are being written simultaneously, HDD seek overhead lowers the
achievable maximum bandwidth, resulting again in a per zone checksum
serialization not affecting the performance.

Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2021-02-09 02:46:07 +01:00
Naohiro Aota 24c0a7227f btrfs: zoned: wait for existing extents before truncating
When truncating a file, file buffers which have already been allocated
but not yet written may be truncated. Truncating these buffers could
cause breakage of a sequential write pattern in a block group if the
truncated blocks are for example followed by blocks allocated to another
file. To avoid this problem, always wait for write out of all unwritten
buffers before proceeding with the truncate execution.

Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2021-02-09 02:46:07 +01:00
Naohiro Aota 0bc09ca129 btrfs: zoned: serialize metadata IO
We cannot use zone append for writing metadata, because the B-tree nodes
have references to each other using logical address. Without knowing
the address in advance, we cannot construct the tree in the first place.
So we need to serialize write IOs for metadata.

We cannot add a mutex around allocation and submission because metadata
blocks are allocated in an earlier stage to build up B-trees.

Add a zoned_meta_io_lock and hold it during metadata IO submission in
btree_write_cache_pages() to serialize IOs.

Furthermore, this adds a per-block group metadata IO submission pointer
"meta_write_pointer" to ensure sequential writing, which can break when
attempting to write back blocks in an unfinished transaction. If the
writing out failed because of a hole and the write out is for data
integrity (WB_SYNC_ALL), it returns EAGAIN.

A caller like fsync() code should handle this properly e.g. by falling
back to a full transaction commit.

Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2021-02-09 02:46:07 +01:00
Naohiro Aota 42c0110009 btrfs: zoned: introduce dedicated data write path for zoned filesystems
If more than one IO is issued for one file extent, these IO can be
written to separate regions on a device. Since we cannot map one file
extent to such a separate area on a zoned filesystem, we need to follow
the "one IO == one ordered extent" rule.

The normal buffered, uncompressed and not pre-allocated write path (used
by cow_file_range()) sometimes does not follow this rule. It can write a
part of an ordered extent when specified a region to write e.g., when
its called from fdatasync().

Introduce a dedicated (uncompressed buffered) data write path for zoned
filesystems, that will COW the region and write it at once.

Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2021-02-09 02:46:06 +01:00
Naohiro Aota 544d24f9de btrfs: zoned: enable zone append writing for direct IO
Likewise to buffered IO, enable zone append writing for direct IO when
its used on a zoned block device.

Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2021-02-09 02:46:06 +01:00
Naohiro Aota d8e3fb106f btrfs: zoned: use ZONE_APPEND write for zoned mode
Enable zone append writing for zoned mode. When using zone append, a
bio is issued to the start of a target zone and the device decides to
place it inside the zone. Upon completion the device reports the actual
written position back to the host.

Three parts are necessary to enable zone append mode. First, modify the
bio to use REQ_OP_ZONE_APPEND in btrfs_submit_bio_hook() and adjust the
bi_sector to point the beginning of the zone.

Second, record the returned physical address (and disk/partno) to the
ordered extent in end_bio_extent_writepage() after the bio has been
completed. We cannot resolve the physical address to the logical address
because we can neither take locks nor allocate a buffer in this end_bio
context. So, we need to record the physical address to resolve it later
in btrfs_finish_ordered_io().

And finally, rewrite the logical addresses of the extent mapping and
checksum data according to the physical address using btrfs_rmap_block.
If the returned address matches the originally allocated address, we can
skip this rewriting process.

Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2021-02-09 02:46:06 +01:00
Johannes Thumshirn 24533f6a9a btrfs: save irq flags when looking up an ordered extent
A following patch will add another caller of
btrfs_lookup_ordered_extent(), but from a bio's endio context.

btrfs_lookup_ordered_extent() uses spin_lock_irq() which unconditionally
disables interrupts. Change this to spin_lock_irqsave() so interrupts
aren't disabled and re-enabled unconditionally.

Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2021-02-09 02:46:06 +01:00
Johannes Thumshirn 08f455593f btrfs: zoned: cache if block group is on a sequential zone
On a zoned filesystem, cache if a block group is on a sequential write
only zone.

On sequential write only zones, we can use REQ_OP_ZONE_APPEND for
writing data, therefore provide btrfs_use_zone_append() to figure out if
IO is targeting a sequential write only zone and we can use
REQ_OP_ZONE_APPEND for data writing.

Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2021-02-09 02:46:05 +01:00
Naohiro Aota 138082f366 btrfs: extend btrfs_rmap_block for specifying a device
btrfs_rmap_block currently reverse-maps the physical addresses on all
devices to the corresponding logical addresses.

Extend the function to match to a specified device. The old functionality
of querying all devices is left intact by specifying NULL as target
device.

A block_device instead of a btrfs_device is passed into btrfs_rmap_block,
as this function is intended to reverse-map the result of a bio, which
only has a block_device.

Also export the function for later use.

Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2021-02-09 02:46:05 +01:00
Johannes Thumshirn cacb2cea46 btrfs: zoned: check if bio spans across an ordered extent
To ensure that an ordered extent maps to a contiguous region on disk, we
need to maintain a "one bio == one ordered extent" rule.

Ensure that constructing bio does not span more than an ordered extent.

Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2021-02-09 02:46:05 +01:00
Naohiro Aota d22002fd37 btrfs: zoned: split ordered extent when bio is sent
For a zone append write, the device decides the location the data is being
written to. Therefore we cannot ensure that two bios are written
consecutively on the device. In order to ensure that an ordered extent
maps to a contiguous region on disk, we need to maintain a "one bio ==
one ordered extent" rule.

Implement splitting of an ordered extent and extent map on bio submission
to adhere to the rule.

extract_ordered_extent() hooks into btrfs_submit_data_bio() and splits the
corresponding ordered extent so that the ordered extent's region fits into
one bio and the corresponding device limits.

Several sanity checks need to be done in extract_ordered_extent() e.g.

- We cannot split once end_bio'd ordered extent because we cannot divide
  ordered->bytes_left for the split ones
- We do not expect a compressed ordered extent
- We should not have checksum list because we omit the list splitting.
  Since the function is called before btrfs_wq_submit_bio() or
  btrfs_csum_one_bio(), this should be always ensured.

We also need to split an extent map by creating a new one. If not,
unpin_extent_cache() complains about the difference between the start of
the extent map and the file's logical offset.

Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2021-02-09 02:46:05 +01:00