Commit Graph

68373 Commits

Author SHA1 Message Date
Jeff Layton 0f51a98361 ceph: don't reach into request header for readdir info
We already have a pointer to the argument struct in req->r_args. Use that
instead of groveling around in the ceph_mds_request_head.

Signed-off-by: Jeff Layton <jlayton@kernel.org>
Reviewed-by: Xiubo Li <xiubli@redhat.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2020-12-14 23:21:48 +01:00
Xiubo Li 968cd14edc ceph: set osdmap epoch for setxattr
When setting the file/dir layout, it may need data pool info. So
in mds server, it needs to check the osdmap. At present, if mds
doesn't find the data pool specified, it will try to get the latest
osdmap. Now if pass the osd epoch for setxattr, the mds server can
only check this epoch of osdmap.

URL: https://tracker.ceph.com/issues/48504
Signed-off-by: Xiubo Li <xiubli@redhat.com>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2020-12-14 23:21:48 +01:00
Colin Ian King 4a756db2a1 ceph: remove redundant assignment to variable i
The variable i is being initialized with a value that is never read
and it is being updated later with a new value in a for-loop.  The
initialization is redundant and can be removed.

Addresses-Coverity: ("Unused value")
Signed-off-by: Colin Ian King <colin.king@canonical.com>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2020-12-14 23:21:48 +01:00
Luis Henriques dd980fc0d5 ceph: add ceph.caps vxattr
Add a new vxattr that allows userspace to list the caps for a specific
directory or file.

[ jlayton: change format delimiter to '/' ]

Signed-off-by: Luis Henriques <lhenriques@suse.de>
Signed-off-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2020-12-14 23:21:48 +01:00
Jeff Layton bca9fc14c7 ceph: when filling trace, call ceph_get_inode outside of mutexes
Geng Jichao reported a rather complex deadlock involving several
moving parts:

1) readahead is issued against an inode and some of its pages are locked
   while the read is in flight

2) the same inode is evicted from the cache, and this task gets stuck
   waiting for the page lock because of the above readahead

3) another task is processing a reply trace, and looks up the inode
   being evicted while holding the s_mutex. That ends up waiting for the
   eviction to complete

4) a write reply for an unrelated inode is then processed in the
   ceph_con_workfn job. It calls ceph_check_caps after putting wrbuffer
   caps, and that gets stuck waiting on the s_mutex held by 3.

The reply to "1" is stuck behind the write reply in "4", so we deadlock
at that point.

This patch changes the trace processing to call ceph_get_inode outside
of the s_mutex and snap_rwsem, which should break the cycle above.

[ idryomov: break unnecessarily long lines ]

URL: https://tracker.ceph.com/issues/47998
Reported-by: Geng Jichao <gengjichao@jd.com>
Signed-off-by: Jeff Layton <jlayton@kernel.org>
Reviewed-by: Luis Henriques <lhenriques@suse.de>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2020-12-14 23:21:48 +01:00
Luis Henriques 6646ea1c8e Revert "ceph: allow rename operation under different quota realms"
This reverts commit dffdcd7145.

When doing a rename across quota realms, there's a corner case that isn't
handled correctly.  Here's a testcase:

  mkdir files limit
  truncate files/file -s 10G
  setfattr limit -n ceph.quota.max_bytes -v 1000000
  mv files limit/

The above will succeed because ftruncate(2) won't immediately notify the
MDSs with the new file size, and thus the quota realms stats won't be
updated.

Since the possible fixes for this issue would have a huge performance impact,
the solution for now is to simply revert to returning -EXDEV when doing a cross
quota realms rename.

URL: https://tracker.ceph.com/issues/48203
Signed-off-by: Luis Henriques <lhenriques@suse.de>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Reviewed-by: Ilya Dryomov <idryomov@gmail.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2020-12-14 23:21:47 +01:00
Jeff Layton 68cbb8056a ceph: fix inode refcount leak when ceph_fill_inode on non-I_NEW inode fails
Signed-off-by: Jeff Layton <jlayton@kernel.org>
Reviewed-by: Ilya Dryomov <idryomov@gmail.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2020-12-14 23:21:47 +01:00
Luis Henriques ccd1acdf1c ceph: downgrade warning from mdsmap decode to debug
While the MDS cluster is unstable and changing state the client may get
mdsmap updates that will trigger warnings:

  [144692.478400] ceph: mdsmap_decode got incorrect state(up:standby-replay)
  [144697.489552] ceph: mdsmap_decode got incorrect state(up:standby-replay)
  [144697.489580] ceph: mdsmap_decode got incorrect state(up:standby-replay)

This patch downgrades these warnings to debug, as they may flood the logs
if the cluster is unstable for a while.

Signed-off-by: Luis Henriques <lhenriques@suse.de>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2020-12-14 23:21:47 +01:00
Luis Henriques e5cafce3ad ceph: fix race in concurrent __ceph_remove_cap invocations
A NULL pointer dereference may occur in __ceph_remove_cap with some of the
callbacks used in ceph_iterate_session_caps, namely trim_caps_cb and
remove_session_caps_cb. Those callers hold the session->s_mutex, so they
are prevented from concurrent execution, but ceph_evict_inode does not.

Since the callers of this function hold the i_ceph_lock, the fix is simply
a matter of returning immediately if caps->ci is NULL.

Cc: stable@vger.kernel.org
URL: https://tracker.ceph.com/issues/43272
Suggested-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Luis Henriques <lhenriques@suse.de>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2020-12-14 23:21:47 +01:00
Jeff Layton 4a357f5069 ceph: pass down the flags to grab_cache_page_write_begin
write_begin operations are passed a flags parameter that we need to
mirror here, so that we don't (e.g.) recurse back into filesystem code
inappropriately.

Signed-off-by: Jeff Layton <jlayton@kernel.org>
Reviewed-by: Ilya Dryomov <idryomov@gmail.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2020-12-14 23:21:47 +01:00
Xiubo Li 5a9e2f5d55 ceph: add ceph.{cluster_fsid/client_id} vxattrs
These two vxattrs will only exist in local client side, with which
we can easily know which mountpoint the file belongs to and also
they can help locate the debugfs path quickly.

URL: https://tracker.ceph.com/issues/48057
Signed-off-by: Xiubo Li <xiubli@redhat.com>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Reviewed-by: Ilya Dryomov <idryomov@gmail.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2020-12-14 23:21:47 +01:00
Xiubo Li 247b1f19db ceph: add status debugfs file
This will help list some useful client side info, like the client
entity address/name and blocklisted status, etc.

URL: https://tracker.ceph.com/issues/48057
Signed-off-by: Xiubo Li <xiubli@redhat.com>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2020-12-14 23:21:47 +01:00
Jeff Layton 04fabb1199 ceph: ensure we have Fs caps when fetching dir link count
The link count for a directory is defined as inode->i_subdirs + 2,
(for "." and ".."). i_subdirs is only populated when Fs caps are held.
Ensure we grab Fs caps when fetching the link count for a directory.

[ idryomov: break unnecessarily long line ]

URL: https://tracker.ceph.com/issues/48125
Signed-off-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2020-12-14 23:21:47 +01:00
Xiubo Li 8ba3b8c7fb ceph: send dentry lease metrics to MDS daemon
For the old ceph version, if it received this one metric message
containing the dentry lease metric info, it will just ignore it.

URL: https://tracker.ceph.com/issues/43423
Signed-off-by: Xiubo Li <xiubli@redhat.com>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2020-12-14 23:21:47 +01:00
Jeff Layton 81048c00d1 ceph: acquire Fs caps when getting dir stats
We only update the inode's dirstats when we have Fs caps from the MDS.

Declare a new VXATTR_FLAG_DIRSTAT that we set on all dirstats, and have
the vxattr handling code acquire those caps when it's set.

URL: https://tracker.ceph.com/issues/48104
Signed-off-by: Jeff Layton <jlayton@kernel.org>
Reviewed-by: Patrick Donnelly <pdonnell@redhat.com>
Reviewed-by: Xiubo Li <xiubli@redhat.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2020-12-14 23:21:47 +01:00
Jeff Layton 06a1ad438b ceph: fix up some warnings on W=1 builds
Convert some decodes into unused variables into skips, and fix up some
non-kerneldoc comment headers to not start with "/**".

Signed-off-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2020-12-14 23:21:47 +01:00
Jeff Layton 4ae3713fe4 ceph: queue MDS requests to REJECTED sessions when CLEANRECOVER is set
Ilya noticed that the first access to a blacklisted mount would often
get back -EACCES, but then subsequent calls would be OK. The problem is
in __do_request. If the session is marked as REJECTED, a hard error is
returned instead of waiting for a new session to come into being.

When the session is REJECTED and the mount was done with
recover_session=clean, queue the request to the waiting_for_map queue,
which will be awoken after tearing down the old session. We can only
do this for sync requests though, so check for async ones first and
just let the callers redrive a sync request.

URL: https://tracker.ceph.com/issues/47385
Reported-by: Ilya Dryomov <idryomov@gmail.com>
Signed-off-by: Jeff Layton <jlayton@kernel.org>
Reviewed-by: Xiubo Li <xiubli@redhat.com>
Reviewed-by: "Yan, Zheng" <zyan@redhat.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2020-12-14 23:21:47 +01:00
Jeff Layton dbeec07bc8 ceph: remove timeout on allowing reconnect after blocklisting
30 minutes is a long time to wait, and this makes it difficult to test
the feature by manually blocklisting clients. Remove the timeout
infrastructure and just allow the client to reconnect at will.

Signed-off-by: Jeff Layton <jlayton@kernel.org>
Reviewed-by: Xiubo Li <xiubli@redhat.com>
Reviewed-by: "Yan, Zheng" <zyan@redhat.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2020-12-14 23:21:46 +01:00
Jeff Layton 50c9132ddf ceph: add new RECOVER mount_state when recovering session
When recovering a session (a'la recover_session=clean), we want to do
all of the operations that we do on a forced umount, but changing the
mount state to SHUTDOWN is can cause queued MDS requests to fail when
the session comes back. Most of those can idle until the session is
recovered in this situation.

Reserve SHUTDOWN state for forced umount, and make a new RECOVER state
for the forced reconnect situation. Change several tests for equality with
SHUTDOWN to test for that or RECOVER.

Signed-off-by: Jeff Layton <jlayton@kernel.org>
Reviewed-by: Xiubo Li <xiubli@redhat.com>
Reviewed-by: "Yan, Zheng" <zyan@redhat.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2020-12-14 23:21:46 +01:00
Jeff Layton aa5c791053 ceph: make fsc->mount_state an int
This field is an unsigned long currently, which is a bit of a waste on
most arches since this just holds an enum. Make it (signed) int instead.

Signed-off-by: Jeff Layton <jlayton@kernel.org>
Reviewed-by: Xiubo Li <xiubli@redhat.com>
Reviewed-by: "Yan, Zheng" <zyan@redhat.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2020-12-14 23:21:46 +01:00
Jeff Layton dc167e38a0 ceph: don't WARN when removing caps due to blocklisting
We expect to remove dirty caps when the client is blocklisted. Don't
throw a warning in that case.

[ idryomov: break unnecessarily long line ]

Signed-off-by: Jeff Layton <jlayton@kernel.org>
Reviewed-by: Xiubo Li <xiubli@redhat.com>
Reviewed-by: "Yan, Zheng" <zyan@redhat.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2020-12-14 23:21:46 +01:00
Linus Torvalds 405f868f13 - Remove all uses of TIF_IA32 and TIF_X32 and reclaim the two bits in the end
(Gabriel Krisman Bertazi)
 
 - All kinds of minor cleanups all over the tree.
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEEzv7L6UO9uDPlPSfHEsHwGGHeVUoFAl/XgtoACgkQEsHwGGHe
 VUqGuA/9GqN2zNQdhgRvAQ+FLZiOYK9MfXcoayfMq8T61VRPDBWaQRfVYKmfmEjS
 0l5OnYgZQ9n6vzqFy6pmgc/ix8Jr553dZp5NCamcOqjCTcuO/LwRRh+ZBeFSBTPi
 r2qFYKKRYvM7nbyUMm4WqvAakxJ18xsjNbIslr9Aqe8WtHBKKX3MOu8SOpFtGyXz
 aEc4rhsS45iZa5gTXhvOn73tr3yHGWU1rzyyAAAmDGTgAxRwsTna8v16C4+v+Bua
 Zg18Wiutj8ZjtFpzKJtGWGZoSBap3Jw2Ys64g42MBQUE56KY/99tQVo/SvbYvvlf
 PHWLH0f3rPNJ6J2qeKwhtNzPlEAH/6e416A1/6TVwsK+8pdfGmkfaQh2iDHLhJ5i
 CSwF61H44ZaE3pc1tHHbC5ALvydPlup7D4MKgztfq0mZ3OoV2Vg7dtyyr+Ybz72b
 G+Kl/tmyacQTXo0FiYbZKETo3/VfTdBXGyVax1rHkx3pt8zvhFg3kxb1TT/l/CoM
 eSTx53PtTdVtbGOq1CjnUm0FKlbh4+kLoNuo9DYKeXUQBs8PWOCZmL3wXmm4cqlZ
 mDZVWvll7CjToY8izzcE/AG279cWkgcL5Tcg7W7CR66+egfDdpuqOZ4tv4TyzoWq
 0J7WeNj+TAo98b7RA0Ux8LOlszRxS2ykuI6uB2MgwCaRMbbaQao=
 =lLiH
 -----END PGP SIGNATURE-----

Merge tag 'x86_cleanups_for_v5.11' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

Pull x86 cleanups from Borislav Petkov:
 "Another branch with a nicely negative diffstat, just the way I
  like 'em:

   - Remove all uses of TIF_IA32 and TIF_X32 and reclaim the two bits in
     the end (Gabriel Krisman Bertazi)

   - All kinds of minor cleanups all over the tree"

* tag 'x86_cleanups_for_v5.11' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (24 commits)
  x86/ia32_signal: Propagate __user annotation properly
  x86/alternative: Update text_poke_bp() kernel-doc comment
  x86/PCI: Make a kernel-doc comment a normal one
  x86/asm: Drop unused RDPID macro
  x86/boot/compressed/64: Use TEST %reg,%reg instead of CMP $0,%reg
  x86/head64: Remove duplicate include
  x86/mm: Declare 'start' variable where it is used
  x86/head/64: Remove unused GET_CR2_INTO() macro
  x86/boot: Remove unused finalize_identity_maps()
  x86/uaccess: Document copy_from_user_nmi()
  x86/dumpstack: Make show_trace_log_lvl() static
  x86/mtrr: Fix a kernel-doc markup
  x86/setup: Remove unused MCA variables
  x86, libnvdimm/test: Remove COPY_MC_TEST
  x86: Reclaim TIF_IA32 and TIF_X32
  x86/mm: Convert mmu context ia32_compat into a proper flags field
  x86/elf: Use e_machine to check for x32/ia32 in setup_additional_pages()
  elf: Expose ELF header on arch_setup_additional_pages()
  x86/elf: Use e_machine to select start_thread for x32
  elf: Expose ELF header in compat_start_thread()
  ...
2020-12-14 13:45:26 -08:00
Linus Torvalds 9e4b0d55d8 Merge branch 'linus' of git://git.kernel.org/pub/scm/linux/kernel/git/herbert/crypto-2.6
Pull crypto updates from Herbert Xu:
 "API:
   - Add speed testing on 1420-byte blocks for networking

  Algorithms:
   - Improve performance of chacha on ARM for network packets
   - Improve performance of aegis128 on ARM for network packets

  Drivers:
   - Add support for Keem Bay OCS AES/SM4
   - Add support for QAT 4xxx devices
   - Enable crypto-engine retry mechanism in caam
   - Enable support for crypto engine on sdm845 in qce
   - Add HiSilicon PRNG driver support"

* 'linus' of git://git.kernel.org/pub/scm/linux/kernel/git/herbert/crypto-2.6: (161 commits)
  crypto: qat - add capability detection logic in qat_4xxx
  crypto: qat - add AES-XTS support for QAT GEN4 devices
  crypto: qat - add AES-CTR support for QAT GEN4 devices
  crypto: atmel-i2c - select CONFIG_BITREVERSE
  crypto: hisilicon/trng - replace atomic_add_return()
  crypto: keembay - Add support for Keem Bay OCS AES/SM4
  dt-bindings: Add Keem Bay OCS AES bindings
  crypto: aegis128 - avoid spurious references crypto_aegis128_update_simd
  crypto: seed - remove trailing semicolon in macro definition
  crypto: x86/poly1305 - Use TEST %reg,%reg instead of CMP $0,%reg
  crypto: x86/sha512 - Use TEST %reg,%reg instead of CMP $0,%reg
  crypto: aesni - Use TEST %reg,%reg instead of CMP $0,%reg
  crypto: cpt - Fix sparse warnings in cptpf
  hwrng: ks-sa - Add dependency on IOMEM and OF
  crypto: lib/blake2s - Move selftest prototype into header file
  crypto: arm/aes-ce - work around Cortex-A57/A72 silion errata
  crypto: ecdh - avoid unaligned accesses in ecdh_set_secret()
  crypto: ccree - rework cache parameters handling
  crypto: cavium - Use dma_set_mask_and_coherent to simplify code
  crypto: marvell/octeontx - Use dma_set_mask_and_coherent to simplify code
  ...
2020-12-14 12:18:19 -08:00
Linus Torvalds 51895d58c7 fsverity updates for 5.11
Some cleanups for fs-verity:
 
 - Rename some names that have been causing confusion.
 
 - Move structs needed for file signing to the UAPI header.
 -----BEGIN PGP SIGNATURE-----
 
 iIoEABYIADIWIQSacvsUNc7UX4ntmEPzXCl4vpKOKwUCX9bedBQcZWJpZ2dlcnNA
 Z29vZ2xlLmNvbQAKCRDzXCl4vpKOK7s6AQD4xMyyPTAQbznYlxPzn+2d8H3y+qik
 xbX7JwATnyOM4AEApDMGPvokvmEJO4z/ionUunR20i//JIyOklxy8w4whQ0=
 =XO4L
 -----END PGP SIGNATURE-----

Merge tag 'fsverity-for-linus' of git://git.kernel.org/pub/scm/fs/fscrypt/fscrypt

Pull fsverity updates from Eric Biggers:
 "Some cleanups for fs-verity:

   - Rename some names that have been causing confusion

   - Move structs needed for file signing to the UAPI header"

* tag 'fsverity-for-linus' of git://git.kernel.org/pub/scm/fs/fscrypt/fscrypt:
  fs-verity: move structs needed for file signing to UAPI header
  fs-verity: rename "file measurement" to "file digest"
  fs-verity: rename fsverity_signed_digest to fsverity_formatted_digest
  fs-verity: remove filenames from file comments
2020-12-14 12:10:32 -08:00
Linus Torvalds 7c7fdaf6ad fscrypt updates for 5.11
This release there are some fixes for longstanding problems, as well as
 some cleanups:
 
 - Fix a race condition where a duplicate filename could be created in an
   encrypted directory if a syscall that creates a new filename raced
   with the directory's encryption key being added.
 
 - Allow deleting files that use an unsupported encryption policy.
 
 - Simplify the locking for 'struct fscrypt_master_key'.
 
 - Remove kernel-internal constants from the UAPI header.
 
 As usual, all these patches have been in linux-next with no reported
 issues, and I've tested them with xfstests.
 -----BEGIN PGP SIGNATURE-----
 
 iIoEABYIADIWIQSacvsUNc7UX4ntmEPzXCl4vpKOKwUCX9bcDxQcZWJpZ2dlcnNA
 Z29vZ2xlLmNvbQAKCRDzXCl4vpKOK/HRAP95FGQqS47rIEh4LrvS7cohMJxb5NiX
 KokAyB88GgmzLQD/c4Xh+iYOxxhFX5NRgruuoec876DrzsuNbEt7WNJ6CQc=
 =CBoc
 -----END PGP SIGNATURE-----

Merge tag 'fscrypt-for-linus' of git://git.kernel.org/pub/scm/fs/fscrypt/fscrypt

Pull fscrypt updates from Eric Biggers:
 "This release there are some fixes for longstanding problems, as well
  as some cleanups:

   - Fix a race condition where a duplicate filename could be created in
     an encrypted directory if a syscall that creates a new filename
     raced with the directory's encryption key being added.

   - Allow deleting files that use an unsupported encryption policy.

   - Simplify the locking for 'struct fscrypt_master_key'.

   - Remove kernel-internal constants from the UAPI header.

  As usual, all these patches have been in linux-next with no reported
  issues, and I've tested them with xfstests"

* tag 'fscrypt-for-linus' of git://git.kernel.org/pub/scm/fs/fscrypt/fscrypt:
  fscrypt: allow deleting files with unsupported encryption policy
  fscrypt: unexport fscrypt_get_encryption_info()
  fscrypt: move fscrypt_require_key() to fscrypt_private.h
  fscrypt: move body of fscrypt_prepare_setattr() out-of-line
  fscrypt: introduce fscrypt_prepare_readdir()
  ext4: don't call fscrypt_get_encryption_info() from dx_show_leaf()
  ubifs: remove ubifs_dir_open()
  f2fs: remove f2fs_dir_open()
  ext4: remove ext4_dir_open()
  fscrypt: simplify master key locking
  fscrypt: remove unnecessary calls to fscrypt_require_key()
  ubifs: prevent creating duplicate encrypted filenames
  f2fs: prevent creating duplicate encrypted filenames
  ext4: prevent creating duplicate encrypted filenames
  fscrypt: add fscrypt_is_nokey_name()
  fscrypt: remove kernel-internal constants from UAPI header
2020-12-14 12:06:54 -08:00
Ronnie Sahlberg 5c4b642141 cifs: fix uninitialized variable in smb3_fs_context_parse_param
Addresses an issue noted by the kernel test robot

Reported-by: kernel test robot <lkp@intel.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
Signed-off-by: Ronnie Sahlberg <lsahlber@redhat.com>
2020-12-14 13:40:43 -06:00
Ronnie Sahlberg 1cb6c3d62c cifs: update mnt_cifs_flags during reconfigure
Many mount flags (e.g. for noperm, noxattr, nobrl,
cifsacl, mfsymlinks and more) can be updated now.

Signed-off-by: Ronnie Sahlberg <lsahlber@redhat.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
2020-12-14 13:37:43 -06:00
Ronnie Sahlberg 2d39f50c2b cifs: move update of flags into a separate function
This function will set/clear flags that can be changed during mount or remount

Signed-off-by: Ronnie Sahlberg <lsahlber@redhat.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
2020-12-14 09:28:25 -06:00
Ronnie Sahlberg 51acd208bd cifs: remove ctx argument from cifs_setup_cifs_sb
Signed-off-by: Ronnie Sahlberg <lsahlber@redhat.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
2020-12-14 09:26:30 -06:00
Ronnie Sahlberg 531f03bc6d cifs: do not allow changing posix_paths during remount
Signed-off-by: Ronnie Sahlberg <lsahlber@redhat.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
2020-12-14 09:26:30 -06:00
Ronnie Sahlberg 7c7ee628f8 cifs: uncomplicate printing the iocharset parameter
There is no need to load the default nls to check if the iocharset argument
was specified or not since we have it in cifs_sb->ctx

Signed-off-by: Ronnie Sahlberg <lsahlber@redhat.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
2020-12-14 09:26:30 -06:00
Ronnie Sahlberg 6fd4ea88b5 cifs: don't create a temp nls in cifs_setup_ipc
just use the one that is already available in ctx

Signed-off-by: Ronnie Sahlberg <lsahlber@redhat.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
2020-12-14 09:26:30 -06:00
Ronnie Sahlberg 387ec58f33 cifs: simplify handling of cifs_sb/ctx->local_nls
Only load/unload local_nls from cifs_sb and just make the ctx
contain a pointer to cifs_sb->ctx.

Signed-off-by: Ronnie Sahlberg <lsahlber@redhat.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
2020-12-14 09:26:30 -06:00
Ronnie Sahlberg 9ccecae8d1 cifs: we do not allow changing username/password/unc/... during remount
Signed-off-by: Ronnie Sahlberg <lsahlber@redhat.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
2020-12-14 09:26:30 -06:00
Ronnie Sahlberg d6a7878340 cifs: add initial reconfigure support
Signed-off-by: Ronnie Sahlberg <lsahlber@redhat.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
2020-12-14 09:26:30 -06:00
Ronnie Sahlberg 522aa3b575 cifs: move [brw]size from cifs_sb to cifs_sb->ctx
Signed-off-by: Ronnie Sahlberg <lsahlber@redhat.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
2020-12-14 09:26:30 -06:00
Ronnie Sahlberg c741cba2cd cifs: move cifs_cleanup_volume_info[_content] to fs_context.c
and rename it to smb3_cleanup_fs_context[_content]

Signed-off-by: Ronnie Sahlberg <lsahlber@redhat.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
2020-12-14 09:26:30 -06:00
Dmitry Osipenko 427c4f004e cifs: Add missing sentinel to smb3_fs_parameters
Add missing sentinel to smb3_fs_parameters. This fixes ARM32 kernel
crashing once CIFS is registered.

 Unable to handle kernel paging request at virtual address 33626d73
...
 (strcmp) from (fs_validate_description)
 (fs_validate_description) from (register_filesystem)
 (register_filesystem) from (init_cifs [cifs])
 (init_cifs [cifs]) from (do_one_initcall)
 (do_one_initcall) from (do_init_module)
 (do_init_module) from (load_module)
 (load_module) from (sys_finit_module)
 (sys_finit_module) from (ret_fast_syscal)

Fixes: e07724d1cf38 ("cifs: switch to new mount api")
Signed-off-by: Dmitry Osipenko <digetx@gmail.com>
Reviewed-by: Aurelien Aptel <aaptel@suse.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
2020-12-14 09:21:33 -06:00
Samuel Cabrero 121d947d4f cifs: Handle witness client move notification
This message is sent to tell a client to close its current connection
and connect to the specified address.

Signed-off-by: Samuel Cabrero <scabrero@suse.de>
Reviewed-by: Aurelien Aptel <aaptel@suse.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
2020-12-14 09:18:55 -06:00
Ronnie Sahlberg af1e40d9ac cifs: remove actimeo from cifs_sb
Can now be accessed via the ctx

Signed-off-by: Ronnie Sahlberg <lsahlber@redhat.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
2020-12-14 09:16:23 -06:00
Ronnie Sahlberg 8401e93678 cifs: remove [gu]id/backup[gu]id/file_mode/dir_mode from cifs_sb
We can already access these from cifs_sb->ctx so we no longer need
a local copy in cifs_sb.

Signed-off-by: Ronnie Sahlberg <lsahlber@redhat.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
2020-12-14 09:16:23 -06:00
Steve French ee0dce4926 cifs: remove some minor warnings pointed out by kernel test robot
Correct some trivial warnings caused when new file unc.c
was created. For example:

   In file included from fs/cifs/unc.c:11:
>> fs/cifs/cifsproto.h:44:28: warning: 'struct TCP_Server_Info' declared inside parameter list will not be visible outside of this definition or declaration
      44 | extern int smb_send(struct TCP_Server_Info *, struct smb_hdr *,

Reported-by: kernel test robot <lkp@intel.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
2020-12-14 09:16:23 -06:00
Steve French 607dfc79c3 cifs: remove various function description warnings
When compiling with W=1 I noticed various functions that
did not follow proper style in describing (in the comments)
the parameters passed in to the function. For example:

fs/cifs/inode.c:2236: warning: Function parameter or member 'mode' not described in 'cifs_wait_bit_killable'

I did not address the style warnings in two of the six files
(connect.c and misc.c) in order to reduce risk of merge
conflict with pending patches. We can update those later.

Signed-off-by: Steve French <stfrench@microsoft.com>
2020-12-14 09:16:23 -06:00
Samuel Cabrero 7d6535b720 cifs: Simplify reconnect code when dfs upcall is enabled
Some witness notifications, like client move, tell the client to
reconnect to a specific IP address. In this situation the DFS failover
code path has to be skipped so clean up as much as possible the
cifs_reconnect() code.

Signed-off-by: Samuel Cabrero <scabrero@suse.de>
Reviewed-by: Aurelien Aptel <aaptel@suse.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
2020-12-14 09:16:23 -06:00
Samuel Cabrero 21077c62e1 cifs: Send witness register messages to userspace daemon in echo task
If the daemon starts after mounting a share, or if it crashes, this
provides a mechanism to register again.

Signed-off-by: Samuel Cabrero <scabrero@suse.de>
Reviewed-by: Aurelien Aptel <aaptel@suse.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
2020-12-14 09:16:23 -06:00
Samuel Cabrero 20fab0da2f cifs: Add witness information to debug data dump
+ Indicate if witness feature is supported
+ Indicate if witness is used when dumping tcons
+ Dumps witness registrations. Example:
  Witness registrations:
  Id: 1 Refs: 1 Network name: 'fs.fover.ad'(y) Share name: 'share1'(y) \
    Ip address: 192.168.103.200(n)

Signed-off-by: Samuel Cabrero <scabrero@suse.de>
Reviewed-by: Aurelien Aptel <aaptel@suse.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
2020-12-14 09:16:22 -06:00
Samuel Cabrero fed979a7e0 cifs: Set witness notification handler for messages from userspace daemon
+ Set a handler for the witness notification messages received from the
  userspace daemon.

+ Handle the resource state change notification. When the resource
  becomes unavailable or available set the tcp status to
  CifsNeedReconnect for all channels.

Signed-off-by: Samuel Cabrero <scabrero@suse.de>
Reviewed-by: Aurelien Aptel <aaptel@suse.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
2020-12-14 09:16:22 -06:00
Samuel Cabrero bf80e5d425 cifs: Send witness register and unregister commands to userspace daemon
+ Define the generic netlink family commands and message attributes to
  communicate with the userspace daemon

+ The register and unregister commands are sent when connecting or
  disconnecting a tree. The witness registration keeps a pointer to
  the tcon and has the same lifetime.

+ Each registration has an id allocated by an IDR. This id is sent to the
  userspace daemon in the register command, and will be included in the
  notification messages from the userspace daemon to retrieve from the
  IDR the matching registration.

+ The authentication information is bundled in the register message.
  If kerberos is used the message just carries a flag.

Signed-off-by: Samuel Cabrero <scabrero@suse.de>
Reviewed-by: Aurelien Aptel <aaptel@suse.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
2020-12-14 09:16:22 -06:00
Steve French e68f4a7bf0 cifs: minor updates to Kconfig
Correct references to fs/cifs/README which has been replaced by
Documentation/filesystems/admin-guide/cifs/usage.rst, and also
correct a typo.

Reviewed-by: Aurelien Aptel <aaptel@suse.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
2020-12-14 09:16:22 -06:00
Samuel Cabrero 0ac4e2919a cifs: add witness mount option and data structs
Add 'witness' mount option to register for witness notifications.

Signed-off-by: Samuel Cabrero <scabrero@suse.de>
Reviewed-by: Aurelien Aptel <aaptel@suse.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
2020-12-14 09:16:22 -06:00
Samuel Cabrero 06f08dab3c cifs: Register generic netlink family
Register a new generic netlink family to talk to the witness service
userspace daemon.

Signed-off-by: Samuel Cabrero <scabrero@suse.de>
Reviewed-by: Aurelien Aptel <aaptel@suse.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
2020-12-14 09:16:22 -06:00
Steve French 047092ffe2 cifs: cleanup misc.c
misc.c was getting a little large, move two of the UNC parsing relating
functions to a new C file unc.c which makes the coding of the
upcoming witness protocol patch series a little cleaner as well.

Suggested-by: Rafal Szczesniak <rafal@elbingbrewery.org>
Reviewed-by: Aurelien Aptel <aaptel@suse.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
2020-12-14 09:16:22 -06:00
Steve French bc04499477 cifs: minor kernel style fixes for comments
Trivial fix for a few comments which didn't follow kernel style

Signed-off-by: Steve French <stfrench@microsoft.com>
2020-12-14 09:16:22 -06:00
Samuel Cabrero e73a42e07a cifs: Make extract_sharename function public
Move the function to misc.c

Signed-off-by: Samuel Cabrero <scabrero@suse.de>
Reviewed-by: Aurelien Aptel <aaptel@suse.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
2020-12-14 09:16:22 -06:00
Samuel Cabrero a87e67254b cifs: Make extract_hostname function public
Move the function to misc.c and give it a public header.

Signed-off-by: Samuel Cabrero <scabrero@suse.de>
Reviewed-by: Aurelien Aptel <aaptel@suse.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
2020-12-14 09:16:22 -06:00
Miklos Szeredi 459c7c565a ovl: unprivieged mounts
Enable unprivileged user namespace mounts of overlayfs.  Overlayfs's
permission model (*) ensures that the mounter itself cannot gain additional
privileges by the act of creating an overlayfs mount.

This feature request is coming from the "rootless" container crowd.

(*) Documentation/filesystems/overlayfs.txt#Permission model

Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2020-12-14 15:26:14 +01:00
Miklos Szeredi 87b2c60c61 ovl: do not get metacopy for userxattr
When looking up an inode on the lower layer for which the mounter lacks
read permisison the metacopy check will fail.  This causes the lookup to
fail as well, even though the directory is readable.

So ignore EACCES for the "userxattr" case and assume no metacopy for the
unreadable file.

Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2020-12-14 15:26:14 +01:00
Miklos Szeredi b6650dab40 ovl: do not fail because of O_NOATIME
In case the file cannot be opened with O_NOATIME because of lack of
capabilities, then clear O_NOATIME instead of failing.

Remove WARN_ON(), since it would now trigger if O_NOATIME was cleared.
Noticed by Amir Goldstein.

Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2020-12-14 15:26:14 +01:00
Miklos Szeredi 6939f977c5 ovl: do not fail when setting origin xattr
Comment above call already says this, but only EOPNOTSUPP is ignored, other
failures are not.

For example setting "user.*" will fail with EPERM on symlink/special.

Ignore this error as well.

Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2020-12-14 15:26:14 +01:00
Miklos Szeredi 2d2f2d7322 ovl: user xattr
Optionally allow using "user.overlay." namespace instead of
"trusted.overlay."

This is necessary for overlayfs to be able to be mounted in an unprivileged
namepsace.

Make the option explicit, since it makes the filesystem format be
incompatible.

Disable redirect_dir and metacopy options, because these would allow
privilege escalation through direct manipulation of the
"user.overlay.redirect" or "user.overlay.metacopy" xattrs.

Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
Reviewed-by: Amir Goldstein <amir73il@gmail.com>
2020-12-14 15:26:14 +01:00
Miklos Szeredi 82a763e61e ovl: simplify file splice
generic_file_splice_read() and iter_file_splice_write() will call back into
f_op->iter_read() and f_op->iter_write() respectively.  These already do
the real file lookup and cred override.  So the code in ovl_splice_read()
and ovl_splice_write() is redundant.

In addition the ovl_file_accessed() call in ovl_splice_write() is
incorrect, though probably harmless.

Fix by calling generic_file_splice_read() and iter_file_splice_write()
directly.

Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2020-12-14 15:26:14 +01:00
Miklos Szeredi 89bdfaf93d ovl: make ioctl() safe
ovl_ioctl_set_flags() does a capability check using flags, but then the
real ioctl double-fetches flags and uses potentially different value.

The "Check the capability before cred override" comment misleading: user
can skip this check by presenting benign flags first and then overwriting
them to non-benign flags.

Just remove the cred override for now, hoping this doesn't cause a
regression.

The proper solution is to create a new setxflags i_op (patches are in the
works).

Xfstests don't show a regression.

Reported-by: Dmitry Vyukov <dvyukov@google.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
Reviewed-by: Amir Goldstein <amir73il@gmail.com>
Fixes: dab5ca8fd9 ("ovl: add lsattr/chattr support")
Cc: <stable@vger.kernel.org> # v4.19
2020-12-14 15:26:14 +01:00
Miklos Szeredi c846af050f ovl: check privs before decoding file handle
CAP_DAC_READ_SEARCH is required by open_by_handle_at(2) so check it in
ovl_decode_real_fh() as well to prevent privilege escalation for
unprivileged overlay mounts.

[Amir] If the mounter is not capable in init ns, ovl_check_origin() and
ovl_verify_index() will not function as expected and this will break index
and nfs export features.  So check capability in ovl_can_decode_fh(), to
auto disable those features.

Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2020-12-14 15:26:14 +01:00
Miklos Szeredi 3078d85c9a vfs: verify source area in vfs_dedupe_file_range_one()
Call remap_verify_area() on the source file as well as the destination.

When called from vfs_dedupe_file_range() the check as already been
performed, but not so if called from layered fs (overlayfs, etc...)

Could ommit the redundant check in vfs_dedupe_file_range(), but leave for
now to get error early (for fear of breaking backward compatibility).

This call shouldn't be performance sensitive.

Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2020-12-14 15:26:13 +01:00
Miklos Szeredi 7c03e2cda4 vfs: move cap_convert_nscap() call into vfs_setxattr()
cap_convert_nscap() does permission checking as well as conversion of the
xattr value conditionally based on fs's user-ns.

This is needed by overlayfs and probably other layered fs (ecryptfs) and is
what vfs_foo() is supposed to do anyway.

Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
Acked-by: James Morris <jamorris@linux.microsoft.com>
2020-12-14 15:26:13 +01:00
Trond Myklebust 5c3485bb12 NFSv4.2/pnfs: Don't use READ_PLUS with pNFS yet
We have no way of tracking server READ_PLUS support in pNFS for now, so
just disable it.

Reported-by: "Mkrtchyan, Tigran" <tigran.mkrtchyan@desy.de>
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
2020-12-14 06:51:08 -05:00
Trond Myklebust 7aedc687c9 NFSv4.2: Deal with potential READ_PLUS data extent buffer overflow
If the server returns more data than we have buffer space for, then
we need to truncate and exit early.

Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
2020-12-14 06:51:08 -05:00
Trond Myklebust 503b934a75 NFSv4.2: Don't error when exiting early on a READ_PLUS buffer overflow
Expanding the READ_PLUS extents can cause the read buffer to overflow.
If it does, then don't error, but just exit early.

Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
2020-12-14 06:51:08 -05:00
Trond Myklebust dac3b1059b NFSv4.2: Handle hole lengths that exceed the READ_PLUS read buffer
If a hole extends beyond the READ_PLUS read buffer, then we want to fill
just the remaining buffer with zeros. Also ignore eof...

Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
2020-12-14 06:51:08 -05:00
Trond Myklebust 82f98c8b11 NFSv4.2: decode_read_plus_hole() needs to check the extent offset
The server is allowed to return a hole extent with an offset that starts
before the offset supplied in the READ_PLUS argument. Ensure that we
support that case too.

Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
2020-12-14 06:51:07 -05:00
Trond Myklebust 5c4afe2ab6 NFSv4.2: decode_read_plus_data() must skip padding after data segment
All XDR opaque object sizes are 32-bit aligned, and a data segment is no
exception.

Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
2020-12-14 06:51:07 -05:00
Trond Myklebust 1ee6310119 NFSv4.2: Ensure we always reset the result->count in decode_read_plus()
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
2020-12-14 06:51:07 -05:00
Geliang Tang 1f70ea7009 NFSv4.1: use BITS_PER_LONG macro in nfs4session.h
Use the existing BITS_PER_LONG macro instead of calculating the value.

Signed-off-by: Geliang Tang <geliangtang@gmail.com>
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
2020-12-14 06:51:07 -05:00
Frank van der Linden a1f26739cc NFSv4.2: improve page handling for GETXATTR
XDRBUF_SPARSE_PAGES can cause problems for the RDMA transport,
and it's easy enough to allocate enough pages for the request
up front, so do that.

Also, since we've allocated the pages anyway, use the full
page aligned length for the receive buffer. This will allow
caching of valid replies that are too large for the caller,
but that still fit in the allocated pages.

Signed-off-by: Frank van der Linden <fllinden@amazon.com>
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
2020-12-14 06:51:07 -05:00
Ronnie Sahlberg a2a52a8a36 cifs: get rid of cifs_sb->mountdata
as we now have a full smb3_fs_context as part of the cifs superblock
we no longer need a local copy of the mount options and can just
reference the copy in the smb3_fs_context.

Signed-off-by: Ronnie Sahlberg <lsahlber@redhat.com>
Reviewed-by: Shyam Prasad N <sprasad@microsoft.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
2020-12-13 19:12:07 -06:00
Ronnie Sahlberg d17abdf756 cifs: add an smb3_fs_context to cifs_sb
and populate it during mount in cifs_smb3_do_mount()

Signed-off-by: Ronnie Sahlberg <lsahlber@redhat.com>
Reviewed-by: Shyam Prasad N <sprasad@microsoft.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
2020-12-13 19:12:07 -06:00
Ronnie Sahlberg 4deb075985 cifs: remove the devname argument to cifs_compose_mount_options
none of the callers use this argument any more.

Signed-off-by: Ronnie Sahlberg <lsahlber@redhat.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
2020-12-13 19:12:07 -06:00
Ronnie Sahlberg 24e0a1eff9 cifs: switch to new mount api
See Documentation/filesystems/mount_api.rst for details on new mount API

Signed-off-by: Ronnie Sahlberg <lsahlber@redhat.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
2020-12-13 19:12:07 -06:00
Ronnie Sahlberg 66e7b09c73 cifs: move cifs_parse_devname to fs_context.c
Also rename the function from cifs_ to smb3_

Signed-off-by: Ronnie Sahlberg <lsahlber@redhat.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
2020-12-13 19:12:07 -06:00
Ronnie Sahlberg 15c7d09af2 cifs: move the enum for cifs parameters into fs_context.h
No change to logic, just moving the enum of cifs mount parms into a header

Signed-off-by: Ronnie Sahlberg <lsahlber@redhat.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
2020-12-13 19:12:07 -06:00
Ronnie Sahlberg 837e3a1bbf cifs: rename dup_vol to smb3_fs_context_dup and move it into fs_context.c
Continue restructuring needed for support of new mount API

Signed-off-by: Ronnie Sahlberg <lsahlber@redhat.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
2020-12-13 19:12:07 -06:00
Ronnie Sahlberg 3fa1c6d1b8 cifs: rename smb_vol as smb3_fs_context and move it to fs_context.h
Harmonize and change all such variables to 'ctx', where possible.
No changes to actual logic.

Signed-off-by: Ronnie Sahlberg <lsahlber@redhat.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
2020-12-13 19:12:07 -06:00
Steve French 7955f105af SMB3.1.1: do not log warning message if server doesn't populate salt
In the negotiate protocol preauth context, the server is not required
to populate the salt (although it is done by most servers) so do
not warn on mount.

We retain the checks (warn) that the preauth context is the minimum
size and that the salt does not exceed DataLength of the SMB response.
Although we use the defaults in the case that the preauth context
response is invalid, these checks may be useful in the future
as servers add support for additional mechanisms.

CC: Stable <stable@vger.kernel.org>
Reviewed-by: Shyam Prasad N <sprasad@microsoft.com>
Reviewed-by: Pavel Shilovsky <pshilov@microsoft.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
2020-12-13 19:12:07 -06:00
Steve French 145024e3e4 SMB3.1.1: update comments clarifying SPNEGO info in negprot response
Trivial changes to clarify confusing comment about
SPNEGO blog (and also one length comparisons in negotiate
context parsing).

Suggested-by: Tom Talpey <tom@talpey.com>
Suggested-by: Pavel Shilovsky <pshilov@microsoft.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
2020-12-13 19:12:07 -06:00
Shyam Prasad N f2156d35c9 cifs: Enable sticky bit with cifsacl mount option.
For the cifsacl mount option, we did not support sticky bits.
With this patch, we do support it, by setting the DELETE_CHILD perm
on the directory only for the owner user. When sticky bit is not
enabled, allow DELETE_CHILD perm for everyone.

Signed-off-by: Shyam Prasad N <sprasad@microsoft.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
2020-12-13 19:12:07 -06:00
Shyam Prasad N 0f22053e81 cifs: Fix unix perm bits to cifsacl conversion for "other" bits.
With the "cifsacl" mount option, the mode bits set on the file/dir
is converted to corresponding ACEs in DACL. However, only the
ALLOWED ACEs were being set for "owner" and "group" SIDs. Since
owner is a subset of group, and group is a subset of
everyone/world SID, in order to properly emulate unix perm groups,
we need to add DENIED ACEs. If we don't do that, "owner" and "group"
SIDs could get more access rights than they should. Which is what
was happening. This fixes it.

We try to keep the "preferred" order of ACEs, i.e. DENYs followed
by ALLOWs. However, for a small subset of cases we cannot
maintain the preferred order. In that case, we'll end up with the
DENY ACE for group after the ALLOW for the owner.

If owner SID == group SID, use the more restrictive
among the two perm bits and convert them to ACEs.

Also, for reverse mapping, i.e. to convert ACL to unix perm bits,
for the "others" bits, we needed to add the masked bits of the
owner and group masks to others mask.

Updated version of patch fixes a problem noted by the kernel
test robot.

Reported-by: kernel test robot <lkp@intel.com>
Signed-off-by: Shyam Prasad N <sprasad@microsoft.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
2020-12-13 19:12:07 -06:00
Steve French bc7c4129d4 SMB3.1.1: remove confusing mount warning when no SPNEGO info on negprot rsp
Azure does not send an SPNEGO blob in the negotiate protocol response,
so we shouldn't assume that it is there when validating the location
of the first negotiate context.  This avoids the potential confusing
mount warning:

   CIFS: Invalid negotiate context offset

CC: Stable <stable@vger.kernel.org>
Reviewed-by: Pavel Shilovsky <pshilov@microsoft.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
2020-12-13 19:12:07 -06:00
Steve French ebcd6de987 SMB3: avoid confusing warning message on mount to Azure
Mounts to Azure cause an unneeded warning message in dmesg
   "CIFS: VFS: parse_server_interfaces: incomplete interface info"

Azure rounds up the size (by 8 additional bytes, to a
16 byte boundary) of the structure returned on the query
of the server interfaces at mount time.  This is permissible
even though different than other servers so do not log a warning
if query network interfaces response is only rounded up by 8
bytes or fewer.

CC: Stable <stable@vger.kernel.org>
Reviewed-by: Ronnie Sahlberg <lsahlber@redhat.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
2020-12-13 19:12:07 -06:00
Gustavo A. R. Silva 21ac58f495 cifs: Fix fall-through warnings for Clang
In preparation to enable -Wimplicit-fallthrough for Clang, fix multiple
warnings by explicitly adding multiple break/goto statements instead of
just letting the code fall through to the next case.

Link: https://github.com/KSPP/linux/issues/115
Signed-off-by: Gustavo A. R. Silva <gustavoars@kernel.org>
Signed-off-by: Steve French <stfrench@microsoft.com>
2020-12-13 19:12:07 -06:00
Zhihao Cheng b80a974b8c ubifs: ubifs_dump_node: Dump all branches of the index node
An index node can have up to c->fanout branches, all branches should be
displayed while dumping index node.

Signed-off-by: Zhihao Cheng <chengzhihao1@huawei.com>
Signed-off-by: Richard Weinberger <richard@nod.at>
2020-12-13 22:12:42 +01:00
Zhihao Cheng bf6dab7a6c ubifs: ubifs_dump_sleb: Remove unused function
Function ubifs_dump_sleb() is defined but unused, it can be removed.

Signed-off-by: Zhihao Cheng <chengzhihao1@huawei.com>
Signed-off-by: Richard Weinberger <richard@nod.at>
2020-12-13 22:12:38 +01:00
Zhihao Cheng a33e30a0e0 ubifs: Pass node length in all node dumping callers
Function ubifs_dump_node() has been modified to avoid memory oob
accessing while dumping node, node length (corresponding to the
size of allocated memory for node) should be passed into all node
dumping callers.

Signed-off-by: Zhihao Cheng <chengzhihao1@huawei.com>
Signed-off-by: Richard Weinberger <richard@nod.at>
2020-12-13 22:12:32 +01:00
Zhihao Cheng c8be097530 Revert "ubifs: Fix out-of-bounds memory access caused by abnormal value of node_len"
This reverts commit acc5af3efa ("ubifs: Fix out-of-bounds memory access caused by abnormal value of node_len")

No need to avoid memory oob in dumping for data node alone. Later, node
length will be passed into function 'ubifs_dump_node()' which replaces
all node dumping places.

Signed-off-by: Zhihao Cheng <chengzhihao1@huawei.com>
Signed-off-by: Richard Weinberger <richard@nod.at>
2020-12-13 22:11:51 +01:00
Zhihao Cheng c4c0d19d39 ubifs: Limit dumping length by size of memory which is allocated for the node
To prevent memory out-of-bounds accessing in ubifs_dump_node(), actual
dumping length should be restricted by another condition(size of memory
which is allocated for the node).

This patch handles following situations (These situations may be caused
by bit flipping due to hardware error, writing bypass ubifs, unknown
bugs in ubifs, etc.):
1. bad node_len: Dumping data according to 'ch->len' which may exceed
   the size of memory allocated for node.
2. bad node content: Some kinds of node can record additional data, eg.
   index node and orphan node, make sure the size of additional data
   not beyond the node length.
3. node_type changes: Read data according to type A, but expected type
   B, before that, node is allocated according to type B's size. Length
   of type A node is greater than type B node.

Signed-off-by: Zhihao Cheng <chengzhihao1@huawei.com>
Signed-off-by: Richard Weinberger <richard@nod.at>
2020-12-13 22:11:36 +01:00
Chengsong Ke 32f6ccc743 ubifs: Remove the redundant return in dbg_check_nondata_nodes_order
There is a redundant return in dbg_check_nondata_nodes_order,
which will be never reached. In addition, error code should be
returned instead of zero in this branch.

Signed-off-by: Chengsong Ke <kechengsong@huawei.com>
Signed-off-by: Richard Weinberger <richard@nod.at>
2020-12-13 22:11:23 +01:00
Jamie Iles a61df3c413 jffs2: Fix NULL pointer dereference in rp_size fs option parsing
syzkaller found the following JFFS2 splat:

  Unable to handle kernel paging request at virtual address dfffa00000000001
  Mem abort info:
    ESR = 0x96000004
    EC = 0x25: DABT (current EL), IL = 32 bits
    SET = 0, FnV = 0
    EA = 0, S1PTW = 0
  Data abort info:
    ISV = 0, ISS = 0x00000004
    CM = 0, WnR = 0
  [dfffa00000000001] address between user and kernel address ranges
  Internal error: Oops: 96000004 [#1] SMP
  Dumping ftrace buffer:
     (ftrace buffer empty)
  Modules linked in:
  CPU: 0 PID: 12745 Comm: syz-executor.5 Tainted: G S                5.9.0-rc8+ #98
  Hardware name: linux,dummy-virt (DT)
  pstate: 20400005 (nzCv daif +PAN -UAO BTYPE=--)
  pc : jffs2_parse_param+0x138/0x308 fs/jffs2/super.c:206
  lr : jffs2_parse_param+0x108/0x308 fs/jffs2/super.c:205
  sp : ffff000022a57910
  x29: ffff000022a57910 x28: 0000000000000000
  x27: ffff000057634008 x26: 000000000000d800
  x25: 000000000000d800 x24: ffff0000271a9000
  x23: ffffa0001adb5dc0 x22: ffff000023fdcf00
  x21: 1fffe0000454af2c x20: ffff000024cc9400
  x19: 0000000000000000 x18: 0000000000000000
  x17: 0000000000000000 x16: ffffa000102dbdd0
  x15: 0000000000000000 x14: ffffa000109e44bc
  x13: ffffa00010a3a26c x12: ffff80000476e0b3
  x11: 1fffe0000476e0b2 x10: ffff80000476e0b2
  x9 : ffffa00010a3ad60 x8 : ffff000023b70593
  x7 : 0000000000000003 x6 : 00000000f1f1f1f1
  x5 : ffff000023fdcf00 x4 : 0000000000000002
  x3 : ffffa00010000000 x2 : 0000000000000001
  x1 : dfffa00000000000 x0 : 0000000000000008
  Call trace:
   jffs2_parse_param+0x138/0x308 fs/jffs2/super.c:206
   vfs_parse_fs_param+0x234/0x4e8 fs/fs_context.c:117
   vfs_parse_fs_string+0xe8/0x148 fs/fs_context.c:161
   generic_parse_monolithic+0x17c/0x208 fs/fs_context.c:201
   parse_monolithic_mount_data+0x7c/0xa8 fs/fs_context.c:649
   do_new_mount fs/namespace.c:2871 [inline]
   path_mount+0x548/0x1da8 fs/namespace.c:3192
   do_mount+0x124/0x138 fs/namespace.c:3205
   __do_sys_mount fs/namespace.c:3413 [inline]
   __se_sys_mount fs/namespace.c:3390 [inline]
   __arm64_sys_mount+0x164/0x238 fs/namespace.c:3390
   __invoke_syscall arch/arm64/kernel/syscall.c:36 [inline]
   invoke_syscall arch/arm64/kernel/syscall.c:48 [inline]
   el0_svc_common.constprop.0+0x15c/0x598 arch/arm64/kernel/syscall.c:149
   do_el0_svc+0x60/0x150 arch/arm64/kernel/syscall.c:195
   el0_svc+0x34/0xb0 arch/arm64/kernel/entry-common.c:226
   el0_sync_handler+0xc8/0x5b4 arch/arm64/kernel/entry-common.c:236
   el0_sync+0x15c/0x180 arch/arm64/kernel/entry.S:663
  Code: d2d40001 f2fbffe1 91002260 d343fc02 (38e16841)
  ---[ end trace 4edf690313deda44 ]---

This is because since ec10a24f10, the option parsing happens before
fill_super and so the MTD device isn't associated with the filesystem.
Defer the size check until there is a valid association.

Fixes: ec10a24f10 ("vfs: Convert jffs2 to use the new mount API")
Cc: <stable@vger.kernel.org>
Cc: David Howells <dhowells@redhat.com>
Signed-off-by: Jamie Iles <jamie@nuviainc.com>
Signed-off-by: Richard Weinberger <richard@nod.at>
2020-12-13 21:57:21 +01:00
Fangping Liang 89f40d0a96 ubifs: Fixed print foramt mismatch in ubifs
fs/ubifs/super.c: function mount_ubifs:
the format specifier "lld" need arg type "long long",
but the according arg "old_idx_sz" has type
"unsigned long long"

Signed-off-by: Fangping Liang <liangfangping@huawei.com>
Signed-off-by: Richard Weinberger <richard@nod.at>
2020-12-13 21:57:21 +01:00
Tom Rix 22bdb8b6fd jffs2: remove trailing semicolon in macro definition
The macro use will already have a semicolon.

Signed-off-by: Tom Rix <trix@redhat.com>
Signed-off-by: Richard Weinberger <richard@nod.at>
2020-12-13 21:57:20 +01:00
Wang ShaoBo 3cded66330 ubifs: Fix error return code in ubifs_init_authentication()
Fix to return PTR_ERR() error code from the error handling case where
ubifs_hash_get_desc() failed instead of 0 in ubifs_init_authentication(),
as done elsewhere in this function.

Fixes: 49525e5eec ("ubifs: Add helper functions for authentication support")
Signed-off-by: Wang ShaoBo <bobo.shaobowang@huawei.com>
Reviewed-by: Sascha Hauer <s.hauer@pengutronix.de>
Signed-off-by: Richard Weinberger <richard@nod.at>
2020-12-13 21:57:20 +01:00
Richard Weinberger 20f1431160 ubifs: wbuf: Don't leak kernel memory to flash
Write buffers use a kmalloc()'ed buffer, they can leak
up to seven bytes of kernel memory to flash if writes are not
aligned.
So use ubifs_pad() to fill these gaps with padding bytes.
This was never a problem while scanning because the scanner logic
manually aligns node lengths and skips over these gaps.

Cc: <stable@vger.kernel.org>
Fixes: 1e51764a3c ("UBIFS: add new flash file system")
Signed-off-by: Richard Weinberger <richard@nod.at>
Reviewed-by: Zhihao Cheng <chengzhihao1@huawei.com>
Signed-off-by: Richard Weinberger <richard@nod.at>
2020-12-13 21:57:20 +01:00
Chengsong Ke f212400783 ubifs: Fix the printing type of c->big_lpt
Ubifs uses %d to print c->big_lpt, but c->big_lpt is a variable
of type unsigned int and should be printed with %u.

Signed-off-by: Chengsong Ke <kechengsong@huawei.com>
Signed-off-by: Richard Weinberger <richard@nod.at>
2020-12-13 21:57:10 +01:00
lizhe cd3ed3c73a jffs2: Allow setting rp_size to zero during remounting
Set rp_size to zero will be ignore during remounting.

The method to identify whether we input a remounting option of
rp_size is to check if the rp_size input is zero. It can not work
well if we pass "rp_size=0".

This patch add a bool variable "set_rp_size" to fix this problem.

Reported-by: Jubin Zhong <zhongjubin@huawei.com>
Signed-off-by: lizhe <lizhe67@huawei.com>
Signed-off-by: Richard Weinberger <richard@nod.at>
2020-12-13 21:56:24 +01:00
lizhe 08cd274f9b jffs2: Fix ignoring mounting options problem during remounting
The jffs2 mount options will be ignored when remounting jffs2.
It can be easily reproduced with the steps listed below.
1. mount -t jffs2 -o compr=none /dev/mtdblockx /mnt
2. mount -o remount compr=zlib /mnt

Since ec10a24f10, the option parsing happens before fill_super and
then pass fc, which contains the options parsing results, to function
jffs2_reconfigure during remounting. But function jffs2_reconfigure do
not update c->mount_opts.

This patch add a function jffs2_update_mount_opts to fix this problem.

By the way, I notice that tmpfs use the same way to update remounting
options. If it is necessary to unify them?

Cc: <stable@vger.kernel.org>
Fixes: ec10a24f10 ("vfs: Convert jffs2 to use the new mount API")
Signed-off-by: lizhe <lizhe67@huawei.com>
Signed-off-by: Richard Weinberger <richard@nod.at>
2020-12-13 21:55:59 +01:00
Zhe Li 9afc9a8a49 jffs2: Fix GC exit abnormally
The log of this problem is:
jffs2: Error garbage collecting node at 0x***!
jffs2: No space for garbage collection. Aborting GC thread

This is because GC believe that it do nothing, so it abort.

After going over the image of jffs2, I find a scene that
can trigger this problem stably.
The scene is: there is a normal dirent node at summary-area,
but abnormal at corresponding not-summary-area with error
name_crc.

The reason that GC exit abnormally is because it find that
abnormal dirent node to GC, but when it goes to function
jffs2_add_fd_to_list, it cannot meet the condition listed
below:

if ((*prev)->nhash == new->nhash && !strcmp((*prev)->name, new->name))

So no node is marked obsolete, statistical information of
erase_block do not change, which cause GC exit abnormally.

The root cause of this problem is: we do not check the
name_crc of the abnormal dirent node with summary is enabled.

Noticed that in function jffs2_scan_dirent_node, we use
function jffs2_scan_dirty_space to deal with the dirent
node with error name_crc. So this patch add a checking
code in function read_direntry to ensure the correctness
of dirent node. If checked failed, the dirent node will
be marked obsolete so GC will pass this node and this
problem will be fixed.

Cc: <stable@vger.kernel.org>
Signed-off-by: Zhe Li <lizhe67@huawei.com>
Signed-off-by: Richard Weinberger <richard@nod.at>
2020-12-13 21:55:39 +01:00
Chengguang Xu 2976c19c95 ubifs: Code cleanup by removing ifdef macro surrounding
Define ubifs_listxattr and ubifs_xattr_handlers to NULL
when CONFIG_UBIFS_FS_XATTR is not enabled, then we can
remove many ugly ifdef macros in the code.

Signed-off-by: Chengguang Xu <cgxu519@mykernel.net>
Signed-off-by: Richard Weinberger <richard@nod.at>
2020-12-13 21:51:59 +01:00
Randy Dunlap 8fdaaf4cf3 jffs2: Fix if/else empty body warnings
When debug (print) macros are not enabled, change them to use the
no_printk() macro instead of <nothing>. This fixes gcc warnings when
-Wextra is used:

../fs/jffs2/nodelist.c:255:37: warning: suggest braces around empty body in an ‘else’ statement [-Wempty-body]
../fs/jffs2/nodelist.c:278:38: warning: suggest braces around empty body in an ‘else’ statement [-Wempty-body]
../fs/jffs2/nodelist.c:558:52: warning: suggest braces around empty body in an ‘else’ statement [-Wempty-body]
../fs/jffs2/xattr.c:1247:58: warning: suggest braces around empty body in an ‘if’ statement [-Wempty-body]
../fs/jffs2/xattr.c:1281:65: warning: suggest braces around empty body in an ‘if’ statement [-Wempty-body]

Builds without warnings on all 3 levels of CONFIG_JFFS2_FS_DEBUG.

Signed-off-by: Randy Dunlap <rdunlap@infradead.org>
Cc: David Woodhouse <dwmw2@infradead.org>
Cc: Richard Weinberger <richard@nod.at>
Cc: linux-mtd@lists.infradead.org
Signed-off-by: Richard Weinberger <richard@nod.at>
2020-12-13 21:51:54 +01:00
Randy Dunlap b8f1da98a2 ubifs: Delete duplicated words + other fixes
Delete repeated words in fs/ubifs/.
{negative, is, of, and, one, it}
where "it it" was changed to "if it".

Signed-off-by: Randy Dunlap <rdunlap@infradead.org>
To: linux-fsdevel@vger.kernel.org
Cc: Richard Weinberger <richard@nod.at>
Cc: linux-mtd@lists.infradead.org
Signed-off-by: Richard Weinberger <richard@nod.at>
2020-12-13 21:51:46 +01:00
Zheng Yongjun 1189686e54 fs/xfs: convert comma to semicolon
Replace a comma between expression statements by a semicolon.

Signed-off-by: Zheng Yongjun <zhengyongjun3@huawei.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Eric Sandeen <sandeen@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2020-12-12 10:49:47 -08:00
Christoph Hellwig 5d24ec4c7d xfs: open code updating i_mode in xfs_set_acl
Rather than going through the big and hairy xfs_setattr_nonsize function,
just open code a transactional i_mode and i_ctime update.  This allows
to mark xfs_setattr_nonsize and remove the flags argument to it.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Gao Xiang <hsiangkao@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2020-12-12 10:49:38 -08:00
Christoph Hellwig 26f88363ec xfs: remove xfs_vn_setattr_nonsize
Merge xfs_vn_setattr_nonsize into the only caller.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Gao Xiang <hsiangkao@redhat.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2020-12-12 10:48:25 -08:00
Gao Xiang 3937493c50 xfs: kill ialloced in xfs_dialloc()
It's enough to just use return code, and get rid of an argument.

Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Gao Xiang <hsiangkao@redhat.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2020-12-12 10:48:25 -08:00
Dave Chinner 8d822dc38a xfs: spilt xfs_dialloc() into 2 functions
This patch explicitly separates free inode chunk allocation and
inode allocation into two individual high level operations.

Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Gao Xiang <hsiangkao@redhat.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2020-12-12 10:48:25 -08:00
Dave Chinner f3bf6e0f11 xfs: move xfs_dialloc_roll() into xfs_dialloc()
Get rid of the confusing ialloc_context and failure handling around
xfs_dialloc() by moving xfs_dialloc_roll() into xfs_dialloc().

Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Gao Xiang <hsiangkao@redhat.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2020-12-12 10:48:24 -08:00
Dave Chinner 1abcf26101 xfs: move on-disk inode allocation out of xfs_ialloc()
So xfs_ialloc() will only address in-core inode allocation then,
Also, rename xfs_ialloc() to xfs_dir_ialloc_init() in order to
keep everything in xfs_inode.c under the same namespace.

Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Gao Xiang <hsiangkao@redhat.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2020-12-12 10:48:24 -08:00
Dave Chinner aececc9f8d xfs: introduce xfs_dialloc_roll()
Introduce a helper to make the on-disk inode allocation rolling
logic clearer in preparation of the following cleanup.

Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Gao Xiang <hsiangkao@redhat.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2020-12-12 10:48:24 -08:00
Gao Xiang 15574ebbff xfs: convert noroom, okalloc in xfs_dialloc() to bool
Boolean is preferred for such use.

Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Gao Xiang <hsiangkao@redhat.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2020-12-12 10:48:24 -08:00
Linus Torvalds 31d00f6eb1 io_uring-5.10-2020-12-11
-----BEGIN PGP SIGNATURE-----
 
 iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAl/UMOwQHGF4Ym9lQGtl
 cm5lbC5kawAKCRD301j7KXHgpmI3EAC22QohUftfWAJeks142J9wFKQgYfNZpZHs
 iJR07NTYUSghp9rpfGVNGY/ZonUoKds0VSbDZpRU8TTjglGf9GguzYE4iOMedtG3
 MA9diNOnAAK0gwZrBx1uB882X4pf1qYGChMe9fXXJ4mQsHpG84q00GnJI0cPjkKn
 PaYRVugr8AxCzG4dHHL0ckQtsEHWaPwe7i5u4UkoOC3IsrVjRri6wYAa98xDuznv
 fqssoMp2cSOcVHsfypQAVZbg0heR3Zo2OuXE4uw6yjB7w5j8tuLIV9XKtqwiRn2n
 57rdnGfY6vSfJb9EPm6fwBRnve2cI2Xbr30x/XWSIx+QcUOFOsBfdblaJikX8cSJ
 FG5c0AgBeRo1heWC4FRNIFStAubm0RPJAPqNm9p6T0DzO4hC3xTTdkl04B5bA4Ms
 4tNWhk2fP8YRKhC0gRcw7zRHAl/qjuMjrUor5L68jxGlmFR7h+uMky0AsS4KZKNZ
 o1+HFmc/m+3ZBNzL7NoXkJzWFaKzZVf0iuEqJP3Xqy91hCA85pvh7DGAlQpZBppM
 C0eNeJ59rWvfSuUS9bLlLfoMO3Ab5BsE4IEji7K2lBNZsCXQpXHVv4xJxgKJRja9
 t+SuBgYpNOfHrIPSY0LaJh93MNzoP22TP8qF8U5gWQ0Bus/mVKE9S1EXSUAvfHPm
 IyOfTB9M6g==
 =wjOp
 -----END PGP SIGNATURE-----

Merge tag 'io_uring-5.10-2020-12-11' of git://git.kernel.dk/linux-block

Pull io_uring fixes from Jens Axboe:
 "Two fixes in here, fixing issues introduced in this merge window"

* tag 'io_uring-5.10-2020-12-11' of git://git.kernel.dk/linux-block:
  io_uring: fix file leak on error path of io ctx creation
  io_uring: fix mis-seting personality's creds
2020-12-12 09:45:01 -08:00
Jens Axboe 355fb9e2b7 io_uring: remove 'twa_signal_ok' deadlock work-around
The TIF_NOTIFY_SIGNAL based implementation of TWA_SIGNAL is always safe
to use, regardless of context, as we won't be recursing into the signal
lock. So now that all archs are using that, we can drop this deadlock
work-around as it's always safe to use TWA_SIGNAL.

Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-12-12 09:17:38 -07:00
Jens Axboe 792ee0f6db io_uring: JOBCTL_TASK_WORK is no longer used by task_work
Remove the dead code, TWA_SIGNAL will never set JOBCTL_TASK_WORK at
this point.

Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-12-12 09:17:38 -07:00
Jakub Kicinski 46d5e62dd3 Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net
xdp_return_frame_bulk() needs to pass a xdp_buff
to __xdp_return().

strlcpy got converted to strscpy but here it makes no
functional difference, so just keep the right code.

Conflicts:
	net/netfilter/nf_tables_api.c

Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2020-12-11 22:29:38 -08:00
Linus Torvalds 782598ecea zonefs fixes for 5.10-rc7
A single patch in this pull request to fix a BIO and page reference
 leak when writing sequential zone files.
 
 Signed-off-by: Damien Le Moal <damien.lemoal@wdc.com>
 -----BEGIN PGP SIGNATURE-----
 
 iHUEABYIAB0WIQSRPv8tYSvhwAzJdzjdoc3SxdoYdgUCX9NY0gAKCRDdoc3SxdoY
 dga9AQCq+hAOyA1FeCkWwBG0gywIIVhPscNphe1bH576JoaIZAEA86DsGB9BED7H
 yaezfh5W1lr63ctxtSVdyo+x5172fgY=
 =Te7s
 -----END PGP SIGNATURE-----

Merge tag 'zonefs-5.10-rc7' of git://git.kernel.org/pub/scm/linux/kernel/git/dlemoal/zonefs

Pull zonefs fix from Damien Le Moal:
 "A single patch in this pull request to fix a BIO and page reference
  leak when writing sequential zone files"

* tag 'zonefs-5.10-rc7' of git://git.kernel.org/pub/scm/linux/kernel/git/dlemoal/zonefs:
  zonefs: fix page reference and BIO leak
2020-12-11 14:22:42 -08:00
Miles Chen 40d6366e9d proc: use untagged_addr() for pagemap_read addresses
When we try to visit the pagemap of a tagged userspace pointer, we find
that the start_vaddr is not correct because of the tag.
To fix it, we should untag the userspace pointers in pagemap_read().

I tested with 5.10-rc4 and the issue remains.

Explanation from Catalin in [1]:

 "Arguably, that's a user-space bug since tagged file offsets were never
  supported. In this case it's not even a tag at bit 56 as per the arm64
  tagged address ABI but rather down to bit 47. You could say that the
  problem is caused by the C library (malloc()) or whoever created the
  tagged vaddr and passed it to this function. It's not a kernel
  regression as we've never supported it.

  Now, pagemap is a special case where the offset is usually not
  generated as a classic file offset but rather derived by shifting a
  user virtual address. I guess we can make a concession for pagemap
  (only) and allow such offset with the tag at bit (56 - PAGE_SHIFT + 3)"

My test code is based on [2]:

A userspace pointer which has been tagged by 0xb4: 0xb400007662f541c8

userspace program:

  uint64 OsLayer::VirtualToPhysical(void *vaddr) {
	uint64 frame, paddr, pfnmask, pagemask;
	int pagesize = sysconf(_SC_PAGESIZE);
	off64_t off = ((uintptr_t)vaddr) / pagesize * 8; // off = 0xb400007662f541c8 / pagesize * 8 = 0x5a00003b317aa0
	int fd = open(kPagemapPath, O_RDONLY);
	...

	if (lseek64(fd, off, SEEK_SET) != off || read(fd, &frame, 8) != 8) {
		int err = errno;
		string errtxt = ErrorString(err);
		if (fd >= 0)
			close(fd);
		return 0;
	}
  ...
  }

kernel fs/proc/task_mmu.c:

  static ssize_t pagemap_read(struct file *file, char __user *buf,
		size_t count, loff_t *ppos)
  {
	...
	src = *ppos;
	svpfn = src / PM_ENTRY_BYTES; // svpfn == 0xb400007662f54
	start_vaddr = svpfn << PAGE_SHIFT; // start_vaddr == 0xb400007662f54000
	end_vaddr = mm->task_size;

	/* watch out for wraparound */
	// svpfn == 0xb400007662f54
	// (mm->task_size >> PAGE) == 0x8000000
	if (svpfn > mm->task_size >> PAGE_SHIFT) // the condition is true because of the tag 0xb4
		start_vaddr = end_vaddr;

	ret = 0;
	while (count && (start_vaddr < end_vaddr)) { // we cannot visit correct entry because start_vaddr is set to end_vaddr
		int len;
		unsigned long end;
		...
	}
	...
  }

[1] https://lore.kernel.org/patchwork/patch/1343258/
[2] https://github.com/stressapptest/stressapptest/blob/master/src/os.cc#L158

Link: https://lkml.kernel.org/r/20201204024347.8295-1-miles.chen@mediatek.com
Signed-off-by: Miles Chen <miles.chen@mediatek.com>
Reviewed-by: Vincenzo Frascino <vincenzo.frascino@arm.com>
Reviewed-by: Catalin Marinas <catalin.marinas@arm.com>
Cc: Alexey Dobriyan <adobriyan@gmail.com>
Cc: Andrey Konovalov <andreyknvl@google.com>
Cc: Alexander Potapenko <glider@google.com>
Cc: Vincenzo Frascino <vincenzo.frascino@arm.com>
Cc: Andrey Ryabinin <aryabinin@virtuozzo.com>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Dmitry Vyukov <dvyukov@google.com>
Cc: Marco Elver <elver@google.com>
Cc: Will Deacon <will@kernel.org>
Cc: Eric W. Biederman <ebiederm@xmission.com>
Cc: Song Bao Hua (Barry Song) <song.bao.hua@hisilicon.com>
Cc: <stable@vger.kernel.org>	[5.4-]
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-12-11 14:02:14 -08:00
Amir Goldstein fecc455978 fsnotify: fix events reported to watching parent and child
fsnotify_parent() used to send two separate events to backends when a
parent inode is watching children and the child inode is also watching.
In an attempt to avoid duplicate events in fanotify, we unified the two
backend callbacks to a single callback and handled the reporting of the
two separate events for the relevant backends (inotify and dnotify).
However the handling is buggy and can result in inotify and dnotify
listeners receiving events of the type they never asked for or spurious
events.

The problem is the unified event callback with two inode marks (parent and
child) is called when any of the parent and child inodes are watched and
interested in the event, but the parent inode's mark that is interested
in the event on the child is not necessarily the one we are currently
reporting to (it could belong to a different group).

So before reporting the parent or child event flavor to backend we need
to check that the mark is really interested in that event flavor.

The semantics of INODE and CHILD marks were hard to follow and made the
logic more complicated than it should have been.  Replace it with INODE
and PARENT marks semantics to hopefully make the logic more clear.

Thanks to Hugh Dickins for spotting a bug in the earlier version of this
patch.

Fixes: 497b0c5a7c ("fsnotify: send event to parent and child with single callback")
CC: stable@vger.kernel.org
Link: https://lore.kernel.org/r/20201202120713.702387-4-amir73il@gmail.com
Reported-by: Hugh Dickins <hughd@google.com>
Signed-off-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Jan Kara <jack@suse.cz>
2020-12-11 11:40:43 +01:00
Linus Torvalds 6840a3dcc2 More NFS Client Bugfixes for Linux 5.10
- Bugfixes:
   - Fix array overflow when flexfiles mirroring is enabled
   - Fix rpcrdma_inline_fixup() crash with new LISTXATTRS
   - Fix 5 second delay when doing inter-server copy
   - Disable READ_PLUS by default
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCAAdFiEEnZ5MQTpR7cLU7KEp18tUv7ClQOsFAl/Sl70ACgkQ18tUv7Cl
 QOuGdw/9GMmrLF26brO3q3lSTD9xdljJGQi0q1DEF2amNUL9cPgxB0xZB9lyUPqs
 AbpLxsCgfJ+rN3sbOnDdz8Ae/wiWS1PkeJWw4cu0/kIIjyPD2Uu4jWuIwxpFvP1v
 lFfBHeEdsNgqS5t9t9cF9u30PjPKt9SHG/64/8GY92zlrKDRUaJUyvW1zUKo4ne4
 m3JxZ1rCxNkBYV+odrQIYE9gqFI6OBWpVsUeIXjBWWDZ/+zK9rGiO5fxZ8oe+yT2
 cmrR/vOznoPgUGaH380HrCbXcIKwim2mn76t1MH9fScZQc1ocSWckrZRRdOVWlJ+
 228ImDPACLFaTqpgi8ZhFmuD3Mwbz4AtPLce52lvZLuMAPVMoFhR8xrm/VpJFSim
 pNjwsl/j9qlGSAe5XdGW+X/DzmHPVd6uhfzOFAd7ZErt7rU0gjyJTWzrmayoB3N1
 GF5g4LaK1dYYj7SG8d8DrEV2CGyYle2UP4aDm8qJ/ZLQhd10RHeVt4BHfSKwY05l
 03Gim7wlH/ercYfgU6wrgmEG2SWjuFJr+j1v//aqcmJVhA7FvTKygGpGyNK16pNL
 9dZ81wUHKHzVt5C/ruG+3tARvnobvmB4ngQAQmtu0aXe9c+Kr2XlaXUAusHPK+Me
 pVqtmaKBgddUtJwFms8JF6dePiSBRQgpwyvluqNi1D7ntRhS+Ik=
 =EXrS
 -----END PGP SIGNATURE-----

Merge tag 'nfs-for-5.10-3' of git://git.linux-nfs.org/projects/anna/linux-nfs

Pull NFS client fixes from Anna Schumaker:
 "Here are a handful more bugfixes for 5.10.

  Unfortunately, we found some problems with the new READ_PLUS operation
  that aren't easy to fix. We've decided to disable this codepath
  through a Kconfig option for now, but a series of patches going into
  5.11 will clean up the code and fix the issues at the same time. This
  seemed like the best way to go about it.

  Summary:

   - Fix array overflow when flexfiles mirroring is enabled

   - Fix rpcrdma_inline_fixup() crash with new LISTXATTRS

   - Fix 5 second delay when doing inter-server copy

   - Disable READ_PLUS by default"

* tag 'nfs-for-5.10-3' of git://git.linux-nfs.org/projects/anna/linux-nfs:
  NFS: Disable READ_PLUS by default
  NFSv4.2: Fix 5 seconds delay when doing inter server copy
  NFS: Fix rpcrdma_inline_fixup() crash with new LISTXATTRS operation
  pNFS/flexfiles: Fix array overflow when flexfiles mirroring is enabled
2020-12-10 15:36:09 -08:00
Al Viro 1a97d899ec Make sure that make_create_in_sticky() never sees uninitialized value of dir_mode
make sure nd->dir_mode is always initialized after success exit from
link_path_walk(); in case of empty path it did not happen.

Reported-by: Anant Thazhemadam <anant.thazhemadam@gmail.com>
Tested-by: Anant Thazhemadam <anant.thazhemadam@gmail.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2020-12-10 17:33:17 -05:00
Hao Li 77573fa310 fs: Kill DCACHE_DONTCACHE dentry even if DCACHE_REFERENCED is set
If DCACHE_REFERENCED is set, fast_dput() will return true, and then
retain_dentry() have no chance to check DCACHE_DONTCACHE. As a result,
the dentry won't be killed and the corresponding inode can't be evicted.
In the following example, the DAX policy can't take effects unless we
do a drop_caches manually.

  # DCACHE_LRU_LIST will be set
  echo abcdefg > test.txt

  # DCACHE_REFERENCED will be set and DCACHE_DONTCACHE can't do anything
  xfs_io -c 'chattr +x' test.txt

  # Drop caches to make DAX changing take effects
  echo 2 > /proc/sys/vm/drop_caches

What this patch does is preventing fast_dput() from returning true if
DCACHE_DONTCACHE is set. Then retain_dentry() will detect the
DCACHE_DONTCACHE and will return false. As a result, the dentry will be
killed and the inode will be evicted. In this way, if we change per-file
DAX policy, it will take effects automatically after this file is closed
by all processes.

I also add some comments to make the code more clear.

Signed-off-by: Hao Li <lihao2018.fnst@cn.fujitsu.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Reviewed-by: Ira Weiny <ira.weiny@intel.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2020-12-10 17:33:17 -05:00
Hao Li 88149082bb fs: Handle I_DONTCACHE in iput_final() instead of generic_drop_inode()
If generic_drop_inode() returns true, it means iput_final() can evict
this inode regardless of whether it is dirty or not. If we check
I_DONTCACHE in generic_drop_inode(), any inode with this bit set will be
evicted unconditionally. This is not the desired behavior because
I_DONTCACHE only means the inode shouldn't be cached on the LRU list.
As for whether we need to evict this inode, this is what
generic_drop_inode() should do. This patch corrects the usage of
I_DONTCACHE.

This patch was proposed in [1].

[1]: https://lore.kernel.org/linux-fsdevel/20200831003407.GE12096@dread.disaster.area/

Fixes: dae2f8ed79 ("fs: Lift XFS_IDONTCACHE to the VFS layer")
Signed-off-by: Hao Li <lihao2018.fnst@cn.fujitsu.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Ira Weiny <ira.weiny@intel.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2020-12-10 17:33:17 -05:00
Eric Biggers edf7ddbf1c fs/namespace.c: WARN if mnt_count has become negative
Missing calls to mntget() (or equivalently, too many calls to mntput())
are hard to detect because mntput() delays freeing mounts using
task_work_add(), then again using call_rcu().  As a result, mnt_count
can often be decremented to -1 without getting a KASAN use-after-free
report.  Such cases are still bugs though, and they point to real
use-after-frees being possible.

For an example of this, see the bug fixed by commit 1b0b9cc8d3
("vfs: fsmount: add missing mntget()"), discussed at
https://lkml.kernel.org/linux-fsdevel/20190605135401.GB30925@xxxxxxxxxxxxxxxxxxxxxxxxx/T/#u.
This bug *should* have been trivial to find.  But actually, it wasn't
found until syzkaller happened to use fchdir() to manipulate the
reference count just right for the bug to be noticeable.

Address this by making mntput_no_expire() issue a WARN if mnt_count has
become negative.

Suggested-by: Miklos Szeredi <miklos@szeredi.hu>
Signed-off-by: Eric Biggers <ebiggers@google.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2020-12-10 17:33:17 -05:00
Anna Schumaker 21e31401fc NFS: Disable READ_PLUS by default
We've been seeing failures with xfstests generic/091 and generic/263
when using READ_PLUS. I've made some progress on these issues, and the
tests fail later on but still don't pass. Let's disable READ_PLUS by
default until we can work out what is going on.

Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2020-12-10 16:48:03 -05:00
Dai Ngo fe8eb820e3 NFSv4.2: Fix 5 seconds delay when doing inter server copy
Since commit b4868b44c5 ("NFSv4: Wait for stateid updates after
CLOSE/OPEN_DOWNGRADE"), every inter server copy operation suffers 5
seconds delay regardless of the size of the copy. The delay is from
nfs_set_open_stateid_locked when the check by nfs_stateid_is_sequential
fails because the seqid in both nfs4_state and nfs4_stateid are 0.

Fix __nfs42_ssc_open to delay setting of NFS_OPEN_STATE in nfs4_state,
until after the call to update_open_stateid, to indicate this is the 1st
open. This fix is part of a 2 patches, the other patch is the fix in the
source server to return the stateid for COPY_NOTIFY request with seqid 1
instead of 0.

Fixes: ce0887ac96 ("NFSD add nfs4 inter ssc to nfsd4_copy")
Signed-off-by: Dai Ngo <dai.ngo@oracle.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2020-12-10 16:48:03 -05:00
Chuck Lever 1c87b85162 NFS: Fix rpcrdma_inline_fixup() crash with new LISTXATTRS operation
By switching to an XFS-backed export, I am able to reproduce the
ibcomp worker crash on my client with xfstests generic/013.

For the failing LISTXATTRS operation, xdr_inline_pages() is called
with page_len=12 and buflen=128.

- When ->send_request() is called, rpcrdma_marshal_req() does not
  set up a Reply chunk because buflen is smaller than the inline
  threshold. Thus rpcrdma_convert_iovs() does not get invoked at
  all and the transport's XDRBUF_SPARSE_PAGES logic is not invoked
  on the receive buffer.

- During reply processing, rpcrdma_inline_fixup() tries to copy
  received data into rq_rcv_buf->pages because page_len is positive.
  But there are no receive pages because rpcrdma_marshal_req() never
  allocated them.

The result is that the ibcomp worker faults and dies. Sometimes that
causes a visible crash, and sometimes it results in a transport hang
without other symptoms.

RPC/RDMA's XDRBUF_SPARSE_PAGES support is not entirely correct, and
should eventually be fixed or replaced. However, my preference is
that upper-layer operations should explicitly allocate their receive
buffers (using GFP_KERNEL) when possible, rather than relying on
XDRBUF_SPARSE_PAGES.

Reported-by: Olga kornievskaia <kolga@netapp.com>
Suggested-by: Olga kornievskaia <kolga@netapp.com>
Fixes: c10a75145f ("NFSv4.2: add the extended attribute proc functions.")
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Reviewed-by: Olga kornievskaia <kolga@netapp.com>
Reviewed-by: Frank van der Linden <fllinden@amazon.com>
Tested-by: Olga kornievskaia <kolga@netapp.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2020-12-10 16:48:03 -05:00
Eric W. Biederman f7cfd871ae exec: Transform exec_update_mutex into a rw_semaphore
Recently syzbot reported[0] that there is a deadlock amongst the users
of exec_update_mutex.  The problematic lock ordering found by lockdep
was:

   perf_event_open  (exec_update_mutex -> ovl_i_mutex)
   chown            (ovl_i_mutex       -> sb_writes)
   sendfile         (sb_writes         -> p->lock)
     by reading from a proc file and writing to overlayfs
   proc_pid_syscall (p->lock           -> exec_update_mutex)

While looking at possible solutions it occured to me that all of the
users and possible users involved only wanted to state of the given
process to remain the same.  They are all readers.  The only writer is
exec.

There is no reason for readers to block on each other.  So fix
this deadlock by transforming exec_update_mutex into a rw_semaphore
named exec_update_lock that only exec takes for writing.

Cc: Jann Horn <jannh@google.com>
Cc: Vasiliy Kulikov <segoon@openwall.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Bernd Edlinger <bernd.edlinger@hotmail.de>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Christopher Yeoh <cyeoh@au1.ibm.com>
Cc: Cyrill Gorcunov <gorcunov@gmail.com>
Cc: Sargun Dhillon <sargun@sargun.me>
Cc: Christian Brauner <christian.brauner@ubuntu.com>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Arnaldo Carvalho de Melo <acme@kernel.org>
Fixes: eea9673250 ("exec: Add exec_update_mutex to replace cred_guard_mutex")
[0] https://lkml.kernel.org/r/00000000000063640c05ade8e3de@google.com
Reported-by: syzbot+db9cdf3dd1f64252c6ef@syzkaller.appspotmail.com
Link: https://lkml.kernel.org/r/87ft4mbqen.fsf@x220.int.ebiederm.org
Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
2020-12-10 13:13:32 -06:00
Eric W. Biederman 9ee1206dcf exec: Move io_uring_task_cancel after the point of no return
Now that unshare_files happens in begin_new_exec after the point of no
return, io_uring_task_cancel can also happen later.

Effectively this means io_uring activities for a task are only canceled
when exec succeeds.

Link: https://lkml.kernel.org/r/878saih2op.fsf@x220.int.ebiederm.org
Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
2020-12-10 12:57:13 -06:00
Eric W. Biederman c39ab6de22 coredump: Document coredump code exclusively used by cell spufs
Oleg Nesterov recently asked[1] why is there an unshare_files in
do_coredump.  After digging through all of the callers of lookup_fd it
turns out that it is
arch/powerpc/platforms/cell/spufs/coredump.c:coredump_next_context
that needs the unshare_files in do_coredump.

Looking at the history[2] this code was also the only piece of coredump code
that required the unshare_files when the unshare_files was added.

Looking at that code it turns out that cell is also the only
architecture that implements elf_coredump_extra_notes_size and
elf_coredump_extra_notes_write.

I looked at the gdb repo[3] support for cell has been removed[4] in binutils
2.34.  Geoff Levand reports he is still getting questions on how to
run modern kernels on the PS3, from people using 3rd party firmware so
this code is not dead.  According to Wikipedia the last PS3 shipped in
Japan sometime in 2017.  So it will probably be a little while before
everyone's hardware dies.

Add some comments briefly documenting the coredump code that exists
only to support cell spufs to make it easier to understand the
coredump code.  Eventually the hardware will be dead, or their won't
be userspace tools, or the coredump code will be refactored and it
will be too difficult to update a dead architecture and these comments
make it easy to tell where to pull to remove cell spufs support.

[1] https://lkml.kernel.org/r/20201123175052.GA20279@redhat.com
[2] 179e037fc1 ("do_coredump(): make sure that descriptor table isn't shared")
[3] git://sourceware.org/git/binutils-gdb.git
[4] abf516c6931a ("Remove Cell Broadband Engine debugging support").
Link: https://lkml.kernel.org/r/87h7pdnlzv.fsf_-_@x220.int.ebiederm.org
Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
2020-12-10 12:57:08 -06:00
Eric W. Biederman fa67bf885e file: Remove get_files_struct
When discussing[1] exec and posix file locks it was realized that none
of the callers of get_files_struct fundamentally needed to call
get_files_struct, and that by switching them to helper functions
instead it will both simplify their code and remove unnecessary
increments of files_struct.count.  Those unnecessary increments can
result in exec unnecessarily unsharing files_struct which breaking
posix locks, and it can result in fget_light having to fallback to
fget reducing system performance.

Now that get_files_struct has no more users and can not cause the
problems for posix file locking and fget_light remove get_files_struct
so that it does not gain any new users.

[1] https://lkml.kernel.org/r/20180915160423.GA31461@redhat.com
Suggested-by: Oleg Nesterov <oleg@redhat.com>
Acked-by: Christian Brauner <christian.brauner@ubuntu.com>
v1: https://lkml.kernel.org/r/20200817220425.9389-13-ebiederm@xmission.com
Link: https://lkml.kernel.org/r/20201120231441.29911-24-ebiederm@xmission.com
Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
2020-12-10 12:42:59 -06:00
Eric W. Biederman 9fe83c43e7 file: Rename __close_fd_get_file close_fd_get_file
The function close_fd_get_file is explicitly a variant of
__close_fd[1].  Now that __close_fd has been renamed close_fd, rename
close_fd_get_file to be consistent with close_fd.

When __alloc_fd, __close_fd and __fd_install were introduced the
double underscore indicated that the function took a struct
files_struct parameter.  The function __close_fd_get_file never has so
the naming has always been inconsistent.  This just cleans things up
so there are not any lingering mentions or references __close_fd left
in the code.

[1] 80cd795630 ("binder: fix use-after-free due to ksys_close() during fdget()")
Link: https://lkml.kernel.org/r/20201120231441.29911-23-ebiederm@xmission.com
Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
2020-12-10 12:42:59 -06:00
Eric W. Biederman 1572bfdf21 file: Replace ksys_close with close_fd
Now that ksys_close is exactly identical to close_fd replace
the one caller of ksys_close with close_fd.

[1] https://lkml.kernel.org/r/20200818112020.GA17080@infradead.org
Suggested-by: Christoph Hellwig <hch@infradead.org>
Link: https://lkml.kernel.org/r/20201120231441.29911-22-ebiederm@xmission.com
Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
2020-12-10 12:42:59 -06:00
Eric W. Biederman 8760c909f5 file: Rename __close_fd to close_fd and remove the files parameter
The function __close_fd was added to support binder[1].  Now that
binder has been fixed to no longer need __close_fd[2] all calls
to __close_fd pass current->files.

Therefore transform the files parameter into a local variable
initialized to current->files, and rename __close_fd to close_fd to
reflect this change, and keep it in sync with the similar changes to
__alloc_fd, and __fd_install.

This removes the need for callers to care about the extra care that
needs to be take if anything except current->files is passed, by
limiting the callers to only operation on current->files.

[1] 483ce1d4b8 ("take descriptor-related part of close() to file.c")
[2] 44d8047f1d ("binder: use standard functions to allocate fds")
Acked-by: Christian Brauner <christian.brauner@ubuntu.com>
v1: https://lkml.kernel.org/r/20200817220425.9389-17-ebiederm@xmission.com
Link: https://lkml.kernel.org/r/20201120231441.29911-21-ebiederm@xmission.com
Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
2020-12-10 12:42:59 -06:00
Eric W. Biederman aa384d10f3 file: Merge __alloc_fd into alloc_fd
The function __alloc_fd was added to support binder[1].  With binder
fixed[2] there are no more users.

As alloc_fd just calls __alloc_fd with "files=current->files",
merge them together by transforming the files parameter into a
local variable initialized to current->files.

[1] dcfadfa4ec ("new helper: __alloc_fd()")
[2] 44d8047f1d ("binder: use standard functions to allocate fds")
Acked-by: Christian Brauner <christian.brauner@ubuntu.com>
v1: https://lkml.kernel.org/r/20200817220425.9389-16-ebiederm@xmission.com
Link: https://lkml.kernel.org/r/20201120231441.29911-20-ebiederm@xmission.com
Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
2020-12-10 12:42:59 -06:00
Eric W. Biederman e06b53c22f file: In f_dupfd read RLIMIT_NOFILE once.
Simplify the code, and remove the chance of races by reading
RLIMIT_NOFILE only once in f_dupfd.

Pass the read value of RLIMIT_NOFILE into alloc_fd which is the other
location the rlimit was read in f_dupfd.  As f_dupfd is the only
caller of alloc_fd this changing alloc_fd is trivially safe.

Further this causes alloc_fd to take all of the same arguments as
__alloc_fd except for the files_struct argument.

Acked-by: Christian Brauner <christian.brauner@ubuntu.com>
v1: https://lkml.kernel.org/r/20200817220425.9389-15-ebiederm@xmission.com
Link: https://lkml.kernel.org/r/20201120231441.29911-19-ebiederm@xmission.com
Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
2020-12-10 12:42:59 -06:00
Eric W. Biederman d74ba04d91 file: Merge __fd_install into fd_install
The function __fd_install was added to support binder[1].  With binder
fixed[2] there are no more users.

As fd_install just calls __fd_install with "files=current->files",
merge them together by transforming the files parameter into a
local variable initialized to current->files.

[1] f869e8a7f7 ("expose a low-level variant of fd_install() for binder")
[2] 44d8047f1d ("binder: use standard functions to allocate fds")
Acked-by: Christian Brauner <christian.brauner@ubuntu.com>
v1:https://lkml.kernel.org/r/20200817220425.9389-14-ebiederm@xmission.com
Link: https://lkml.kernel.org/r/20201120231441.29911-18-ebiederm@xmission.com
Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
2020-12-10 12:42:58 -06:00
Eric W. Biederman 775e0656b2 proc/fd: In fdinfo seq_show don't use get_files_struct
When discussing[1] exec and posix file locks it was realized that none
of the callers of get_files_struct fundamentally needed to call
get_files_struct, and that by switching them to helper functions
instead it will both simplify their code and remove unnecessary
increments of files_struct.count.  Those unnecessary increments can
result in exec unnecessarily unsharing files_struct which breaking
posix locks, and it can result in fget_light having to fallback to
fget reducing system performance.

Instead hold task_lock for the duration that task->files needs to be
stable in seq_show.  The task_lock was already taken in
get_files_struct, and so skipping get_files_struct performs less work
overall, and avoids the problems with the files_struct reference
count.

[1] https://lkml.kernel.org/r/20180915160423.GA31461@redhat.com
Suggested-by: Oleg Nesterov <oleg@redhat.com>
Acked-by: Christian Brauner <christian.brauner@ubuntu.com>
v1: https://lkml.kernel.org/r/20200817220425.9389-12-ebiederm@xmission.com
Link: https://lkml.kernel.org/r/20201120231441.29911-17-ebiederm@xmission.com
Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
2020-12-10 12:42:58 -06:00
Eric W. Biederman 5b17b61870 proc/fd: In proc_readfd_common use task_lookup_next_fd_rcu
When discussing[1] exec and posix file locks it was realized that none
of the callers of get_files_struct fundamentally needed to call
get_files_struct, and that by switching them to helper functions
instead it will both simplify their code and remove unnecessary
increments of files_struct.count.  Those unnecessary increments can
result in exec unnecessarily unsharing files_struct which breaking
posix locks, and it can result in fget_light having to fallback to
fget reducing system performance.

Using task_lookup_next_fd_rcu simplifies proc_readfd_common, by moving
the checking for the maximum file descritor into the generic code, and
by remvoing the need for capturing and releasing a reference on
files_struct.

As task_lookup_fd_rcu may update the fd ctx->pos has been changed
to be the fd +2 after task_lookup_fd_rcu returns.

[1] https://lkml.kernel.org/r/20180915160423.GA31461@redhat.com
Suggested-by: Oleg Nesterov <oleg@redhat.com>
Tested-by: Andy Lavr <andy.lavr@gmail.com>
v1: https://lkml.kernel.org/r/20200817220425.9389-10-ebiederm@xmission.com
Link: https://lkml.kernel.org/r/20201120231441.29911-15-ebiederm@xmission.com
Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
2020-12-10 12:42:58 -06:00
Eric W. Biederman e9a53aeb5e file: Implement task_lookup_next_fd_rcu
As a companion to fget_task and task_lookup_fd_rcu implement
task_lookup_next_fd_rcu that will return the struct file for the first
file descriptor number that is equal or greater than the fd argument
value, or NULL if there is no such struct file.

This allows file descriptors of foreign processes to be iterated
through safely, without needed to increment the count on files_struct.

Some concern[1] has been expressed that this function takes the task_lock
for each iteration and thus for each file descriptor.  This place
where this function will be called in a commonly used code path is for
listing /proc/<pid>/fd.  I did some small benchmarks and did not see
any measurable performance differences.  For ordinary users ls is
likely to stat each of the directory entries and tid_fd_mode called
from tid_fd_revalidae has always taken the task lock for each file
descriptor.  So this does not look like it will be a big change in
practice.

At some point is will probably be worth changing put_files_struct to
free files_struct after an rcu grace period so that task_lock won't be
needed at all.

[1] https://lkml.kernel.org/r/20200817220425.9389-10-ebiederm@xmission.com
v1: https://lkml.kernel.org/r/20200817220425.9389-9-ebiederm@xmission.com
Link: https://lkml.kernel.org/r/20201120231441.29911-14-ebiederm@xmission.com
Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
2020-12-10 12:42:58 -06:00
Eric W. Biederman 64eb661fda proc/fd: In tid_fd_mode use task_lookup_fd_rcu
When discussing[1] exec and posix file locks it was realized that none
of the callers of get_files_struct fundamentally needed to call
get_files_struct, and that by switching them to helper functions
instead it will both simplify their code and remove unnecessary
increments of files_struct.count.  Those unnecessary increments can
result in exec unnecessarily unsharing files_struct which breaking
posix locks, and it can result in fget_light having to fallback to
fget reducing system performance.

Instead of manually coding finding the files struct for a task and
then calling files_lookup_fd_rcu, use the helper task_lookup_fd_rcu
that combines those to steps.   Making the code simpler and removing
the need to get a reference on a files_struct.

[1] https://lkml.kernel.org/r/20180915160423.GA31461@redhat.com
Suggested-by: Oleg Nesterov <oleg@redhat.com>
Acked-by: Christian Brauner <christian.brauner@ubuntu.com>
v1: https://lkml.kernel.org/r/20200817220425.9389-7-ebiederm@xmission.com
Link: https://lkml.kernel.org/r/20201120231441.29911-12-ebiederm@xmission.com
Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
2020-12-10 12:40:14 -06:00
Eric W. Biederman 3a879fb380 file: Implement task_lookup_fd_rcu
As a companion to lookup_fd_rcu implement task_lookup_fd_rcu for
querying an arbitrary process about a specific file.

Acked-by: Christian Brauner <christian.brauner@ubuntu.com>
v1: https://lkml.kernel.org/r/20200818103713.aw46m7vprsy4vlve@wittgenstein
Link: https://lkml.kernel.org/r/20201120231441.29911-11-ebiederm@xmission.com
Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
2020-12-10 12:40:10 -06:00
Eric W. Biederman 460b4f812a file: Rename fcheck lookup_fd_rcu
Also remove the confusing comment about checking if a fd exists.  I
could not find one instance in the entire kernel that still matches
the description or the reason for the name fcheck.

The need for better names became apparent in the last round of
discussion of this set of changes[1].

[1] https://lkml.kernel.org/r/CAHk-=wj8BQbgJFLa+J0e=iT-1qpmCRTbPAJ8gd6MJQ=kbRPqyQ@mail.gmail.com
Link: https://lkml.kernel.org/r/20201120231441.29911-10-ebiederm@xmission.com
Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
2020-12-10 12:40:07 -06:00
Eric W. Biederman f36c294327 file: Replace fcheck_files with files_lookup_fd_rcu
This change renames fcheck_files to files_lookup_fd_rcu.  All of the
remaining callers take the rcu_read_lock before calling this function
so the _rcu suffix is appropriate.  This change also tightens up the
debug check to verify that all callers hold the rcu_read_lock.

All callers that used to call files_check with the files->file_lock
held have now been changed to call files_lookup_fd_locked.

This change of name has helped remind me of which locks and which
guarantees are in place helping me to catch bugs later in the
patchset.

The need for better names became apparent in the last round of
discussion of this set of changes[1].

[1] https://lkml.kernel.org/r/CAHk-=wj8BQbgJFLa+J0e=iT-1qpmCRTbPAJ8gd6MJQ=kbRPqyQ@mail.gmail.com
Link: https://lkml.kernel.org/r/20201120231441.29911-9-ebiederm@xmission.com
Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
2020-12-10 12:40:03 -06:00
Eric W. Biederman 120ce2b0cd file: Factor files_lookup_fd_locked out of fcheck_files
To make it easy to tell where files->file_lock protection is being
used when looking up a file create files_lookup_fd_locked.  Only allow
this function to be called with the file_lock held.

Update the callers of fcheck and fcheck_files that are called with the
files->file_lock held to call files_lookup_fd_locked instead.

Hopefully this makes it easier to quickly understand what is going on.

The need for better names became apparent in the last round of
discussion of this set of changes[1].

[1] https://lkml.kernel.org/r/CAHk-=wj8BQbgJFLa+J0e=iT-1qpmCRTbPAJ8gd6MJQ=kbRPqyQ@mail.gmail.com
Link: https://lkml.kernel.org/r/20201120231441.29911-8-ebiederm@xmission.com
Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
2020-12-10 12:39:59 -06:00
Eric W. Biederman bebf684bf3 file: Rename __fcheck_files to files_lookup_fd_raw
The function fcheck despite it's comment is poorly named
as it has no callers that only check it's return value.
All of fcheck's callers use the returned file descriptor.
The same is true for fcheck_files and __fcheck_files.

A new less confusing name is needed.  In addition the names
of these functions are confusing as they do not report
the kind of locks that are needed to be held when these
functions are called making error prone to use them.

To remedy this I am making the base functio name lookup_fd
and will and prefixes and sufficies to indicate the rest
of the context.

Name the function (previously called __fcheck_files) that proceeds
from a struct files_struct, looks up the struct file of a file
descriptor, and requires it's callers to verify all of the appropriate
locks are held files_lookup_fd_raw.

The need for better names became apparent in the last round of
discussion of this set of changes[1].

[1] https://lkml.kernel.org/r/CAHk-=wj8BQbgJFLa+J0e=iT-1qpmCRTbPAJ8gd6MJQ=kbRPqyQ@mail.gmail.com
Link: https://lkml.kernel.org/r/20201120231441.29911-7-ebiederm@xmission.com
Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
2020-12-10 12:39:54 -06:00
Eric W. Biederman 439be32656 proc/fd: In proc_fd_link use fget_task
When discussing[1] exec and posix file locks it was realized that none
of the callers of get_files_struct fundamentally needed to call
get_files_struct, and that by switching them to helper functions
instead it will both simplify their code and remove unnecessary
increments of files_struct.count.  Those unnecessary increments can
result in exec unnecessarily unsharing files_struct which breaking
posix locks, and it can result in fget_light having to fallback to
fget reducing system performance.

Simplifying proc_fd_link is a little bit tricky.  It is necessary to
know that there is a reference to fd_f	 ile while path_get is running.
This reference can either be guaranteed to exist either by locking the
fdtable as the code currently does or by taking a reference on the
file in question.

Use fget_task to remove the need for get_files_struct and
to take a reference to file in question.

[1] https://lkml.kernel.org/r/20180915160423.GA31461@redhat.com
Suggested-by: Oleg Nesterov <oleg@redhat.com>
v1: https://lkml.kernel.org/r/20200817220425.9389-8-ebiederm@xmission.com
Link: https://lkml.kernel.org/r/20201120231441.29911-6-ebiederm@xmission.com
Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
2020-12-10 12:39:48 -06:00
Eric W. Biederman 950db38ff2 exec: Remove reset_files_struct
Now that exec no longer needs to restore the previous value of current->files
on error there are no more callers of reset_files_struct so remove it.

Acked-by: Christian Brauner <christian.brauner@ubuntu.com>
v1: https://lkml.kernel.org/r/20200817220425.9389-3-ebiederm@xmission.com
Link: https://lkml.kernel.org/r/20201120231441.29911-3-ebiederm@xmission.com
Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
2020-12-10 12:39:36 -06:00
Eric W. Biederman 1f702603e7 exec: Simplify unshare_files
Now that exec no longer needs to return the unshared files to their
previous value there is no reason to return displaced.

Instead when unshare_fd creates a copy of the file table, call
put_files_struct before returning from unshare_files.

Acked-by: Christian Brauner <christian.brauner@ubuntu.com>
v1: https://lkml.kernel.org/r/20200817220425.9389-2-ebiederm@xmission.com
Link: https://lkml.kernel.org/r/20201120231441.29911-2-ebiederm@xmission.com
Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
2020-12-10 12:39:32 -06:00
Eric W. Biederman b604350128 exec: Move unshare_files to fix posix file locking during exec
Many moons ago the binfmts were doing some very questionable things
with file descriptors and an unsharing of the file descriptor table
was added to make things better[1][2].  The helper steal_lockss was
added to avoid breaking the userspace programs[3][4][6].

Unfortunately it turned out that steal_locks did not work for network
file systems[5], so it was removed to see if anyone would
complain[7][8].  It was thought at the time that NPTL would not be
affected as the unshare_files happened after the other threads were
killed[8].  Unfortunately because there was an unshare_files in
binfmt_elf.c before the threads were killed this analysis was
incorrect.

This unshare_files in binfmt_elf.c resulted in the unshares_files
happening whenever threads were present.  Which led to unshare_files
being moved to the start of do_execve[9].

Later the problems were rediscovered and the suggested approach was to
readd steal_locks under a different name[10].  I happened to be
reviewing patches and I noticed that this approach was a step
backwards[11].

I proposed simply moving unshare_files[12] and it was pointed
out that moving unshare_files without auditing the code was
also unsafe[13].

There were then several attempts to solve this[14][15][16] and I even
posted this set of changes[17].  Unfortunately because auditing all of
execve is time consuming this change did not make it in at the time.

Well now that I am cleaning up exec I have made the time to read
through all of the binfmts and the only playing with file descriptors
is either the security modules closing them in
security_bprm_committing_creds or is in the generic code in fs/exec.c.
None of it happens before begin_new_exec is called.

So move unshare_files into begin_new_exec, after the point of no
return.  If memory is very very very low and the application calling
exec is sharing file descriptor tables between processes we might fail
past the point of no return.  Which is unfortunate but no different
than any of the other places where we allocate memory after the point
of no return.

This movement allows another process that shares the file table, or
another thread of the same process and that closes files or changes
their close on exec behavior and races with execve to cause some
unexpected things to happen.  There is only one time of check to time
of use race and it is just there so that execve fails instead of
an interpreter failing when it tries to open the file it is supposed
to be interpreting.   Failing later if userspace is being silly is
not a problem.

With this change it the following discription from the removal
of steal_locks[8] finally becomes true.

    Apps using NPTL are not affected, since all other threads are killed before
    execve.

    Apps using LinuxThreads are only affected if they

      - have multiple threads during exec (LinuxThreads doesn't kill other
        threads, the app may do it with pthread_kill_other_threads_np())
      - rely on POSIX locks being inherited across exec

    Both conditions are documented, but not their interaction.

    Apps using clone() natively are affected if they

      - use clone(CLONE_FILES)
      - rely on POSIX locks being inherited across exec

I have investigated some paths to make it possible to solve this
without moving unshare_files but they all look more complicated[18].

Reported-by: Daniel P. Berrangé <berrange@redhat.com>
Reported-by: Jeff Layton <jlayton@redhat.com>
History-tree: git://git.kernel.org/pub/scm/linux/kernel/git/tglx/history.git
[1] 02cda956de0b ("[PATCH] unshare_files"
[2] 04e9bcb4d106 ("[PATCH] use new unshare_files helper")
[3] 088f5d7244de ("[PATCH] add steal_locks helper")
[4] 02c541ec8ffa ("[PATCH] use new steal_locks helper")
[5] https://lkml.kernel.org/r/E1FLIlF-0007zR-00@dorka.pomaz.szeredi.hu
[6] https://lkml.kernel.org/r/0060321191605.GB15997@sorel.sous-sol.org
[7] https://lkml.kernel.org/r/E1FLwjC-0000kJ-00@dorka.pomaz.szeredi.hu
[8] c89681ed7d ("[PATCH] remove steal_locks()")
[9] fd8328be87 ("[PATCH] sanitize handling of shared descriptor tables in failing execve()")
[10] https://lkml.kernel.org/r/20180317142520.30520-1-jlayton@kernel.org
[11] https://lkml.kernel.org/r/87r2nwqk73.fsf@xmission.com
[12] https://lkml.kernel.org/r/87bmfgvg8w.fsf@xmission.com
[13] https://lkml.kernel.org/r/20180322111424.GE30522@ZenIV.linux.org.uk
[14] https://lkml.kernel.org/r/20180827174722.3723-1-jlayton@kernel.org
[15] https://lkml.kernel.org/r/20180830172423.21964-1-jlayton@kernel.org
[16] https://lkml.kernel.org/r/20180914105310.6454-1-jlayton@kernel.org
[17] https://lkml.kernel.org/r/87a7ohs5ow.fsf@xmission.com
[18] https://lkml.kernel.org/r/87pn8c1uj6.fsf_-_@x220.int.ebiederm.org
Acked-by: Christian Brauner <christian.brauner@ubuntu.com>
v1: https://lkml.kernel.org/r/20200817220425.9389-1-ebiederm@xmission.com
Link: https://lkml.kernel.org/r/20201120231441.29911-1-ebiederm@xmission.com
Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
2020-12-10 12:39:24 -06:00
Eric W. Biederman 878f12dbb8 exec: Don't open code get_close_on_exec
Al Viro pointed out that using the phrase "close_on_exec(fd,
rcu_dereference_raw(current->files->fdt))" instead of wrapping it in
rcu_read_lock(), rcu_read_unlock() is a very questionable
optimization[1].

Once wrapped with rcu_read_lock()/rcu_read_unlock() that phrase
becomes equivalent the helper function get_close_on_exec so
simplify the code and make it more robust by simply using
get_close_on_exec.

[1] https://lkml.kernel.org/r/20201207222214.GA4115853@ZenIV.linux.org.uk
Suggested-by: Al Viro <viro@ftp.linux.org.uk>
Link: https://lkml.kernel.org/r/87k0tqr6zi.fsf_-_@x220.int.ebiederm.org
Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
2020-12-10 12:39:00 -06:00
Chao Yu 75e91c8889 f2fs: compress: fix compression chksum
This patch addresses minor issues in compression chksum.

Fixes: b28f047b28 ("f2fs: compress: support chksum")
Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2020-12-10 09:13:53 -08:00
Chao Yu e584bbe821 f2fs: fix shift-out-of-bounds in sanity_check_raw_super()
syzbot reported a bug which could cause shift-out-of-bounds issue,
fix it.

Call Trace:
 __dump_stack lib/dump_stack.c:79 [inline]
 dump_stack+0x107/0x163 lib/dump_stack.c:120
 ubsan_epilogue+0xb/0x5a lib/ubsan.c:148
 __ubsan_handle_shift_out_of_bounds.cold+0xb1/0x181 lib/ubsan.c:395
 sanity_check_raw_super fs/f2fs/super.c:2812 [inline]
 read_raw_super_block fs/f2fs/super.c:3267 [inline]
 f2fs_fill_super.cold+0x16c9/0x16f6 fs/f2fs/super.c:3519
 mount_bdev+0x34d/0x410 fs/super.c:1366
 legacy_get_tree+0x105/0x220 fs/fs_context.c:592
 vfs_get_tree+0x89/0x2f0 fs/super.c:1496
 do_new_mount fs/namespace.c:2896 [inline]
 path_mount+0x12ae/0x1e70 fs/namespace.c:3227
 do_mount fs/namespace.c:3240 [inline]
 __do_sys_mount fs/namespace.c:3448 [inline]
 __se_sys_mount fs/namespace.c:3425 [inline]
 __x64_sys_mount+0x27f/0x300 fs/namespace.c:3425
 do_syscall_64+0x2d/0x70 arch/x86/entry/common.c:46
 entry_SYSCALL_64_after_hwframe+0x44/0xa9

Reported-by: syzbot+ca9a785f8ac472085994@syzkaller.appspotmail.com
Signed-off-by: Anant Thazhemadam <anant.thazhemadam@gmail.com>
Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2020-12-10 09:13:19 -08:00
Miklos Szeredi 5d069dbe8a fuse: fix bad inode
Jan Kara's analysis of the syzbot report (edited):

  The reproducer opens a directory on FUSE filesystem, it then attaches
  dnotify mark to the open directory.  After that a fuse_do_getattr() call
  finds that attributes returned by the server are inconsistent, and calls
  make_bad_inode() which, among other things does:

          inode->i_mode = S_IFREG;

  This then confuses dnotify which doesn't tear down its structures
  properly and eventually crashes.

Avoid calling make_bad_inode() on a live inode: switch to a private flag on
the fuse inode.  Also add the test to ops which the bad_inode_ops would
have caught.

This bug goes back to the initial merge of fuse in 2.6.14...

Reported-by: syzbot+f427adf9324b92652ccc@syzkaller.appspotmail.com
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
Tested-by: Jan Kara <jack@suse.cz>
Cc: <stable@vger.kernel.org>
2020-12-10 15:33:14 +01:00
Trond Myklebust fa94a951bf NFSv4.2: Fix up the get/listxattr calls to rpc_prepare_reply_pages()
Ensure that both getxattr and listxattr page array are correctly
aligned, and that getxattr correctly accounts for the page padding word.

Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
2020-12-10 09:01:53 -05:00
Damien Le Moal 6bea0225a4 zonefs: fix page reference and BIO leak
In zonefs_file_dio_append(), the pages obtained using
bio_iov_iter_get_pages() are not released on completion of the
REQ_OP_APPEND BIO, nor when bio_iov_iter_get_pages() fails.
Furthermore, a call to bio_put() is missing when
bio_iov_iter_get_pages() fails.

Fix these resource leaks by adding BIO resource release code (bio_put()i
and bio_release_pages()) at the end of the function after the BIO
execution and add a jump to this resource cleanup code in case of
bio_iov_iter_get_pages() failure.

While at it, also fix the call to task_io_account_write() to be passed
the correct BIO size instead of bio_iov_iter_get_pages() return value.

Reported-by: Christoph Hellwig <hch@lst.de>
Fixes: 02ef12a663 ("zonefs: use REQ_OP_ZONE_APPEND for sync DIO")
Cc: stable@vger.kernel.org
Signed-off-by: Damien Le Moal <damien.lemoal@wdc.com>
Reviewed-by: Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
2020-12-10 15:14:19 +09:00
Huang Jianan d8b3df8b10 erofs: avoid using generic_block_bmap
Surprisingly, `block' in sector_t indicates the number of
i_blkbits-sized blocks rather than sectors for bmap.

In addition, considering buffer_head limits mapped size to 32-bits,
should avoid using generic_block_bmap.

Link: https://lore.kernel.org/r/20201209115740.18802-1-huangjianan@oppo.com
Fixes: 9da681e017 ("staging: erofs: support bmap")
Reviewed-by: Chao Yu <yuchao0@huawei.com>
Reviewed-by: Gao Xiang <hsiangkao@redhat.com>
Signed-off-by: Huang Jianan <huangjianan@oppo.com>
Signed-off-by: Guo Weichao <guoweichao@oppo.com>
[ Gao Xiang: slightly update the commit message description. ]
Signed-off-by: Gao Xiang <hsiangkao@redhat.com>
2020-12-10 11:07:40 +08:00
Chunguang Xu 41fca96e63 ext4: delete nonsensical (commented-out) code inside ext4_xattr_block_set()
Signed-off-by: Chunguang Xu <brookxu@tencent.com>
Reviewed-by: Andreas Dilger <adilger@dilger.ca>
Link: https://lore.kernel.org/r/1604764698-4269-7-git-send-email-brookxu@tencent.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2020-12-09 14:22:56 -05:00
Chunguang Xu 8041ac642a ext4: update ext4_data_block_valid related comments
Since ext4_data_block_valid() has been renamed to ext4_inode_block_valid(),
the related comments need to be updated.

Signed-off-by: Chunguang Xu <brookxu@tencent.com>
Reviewed-by: Andreas Dilger <adilger@dilger.ca>
Link: https://lore.kernel.org/r/1604764698-4269-5-git-send-email-brookxu@tencent.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2020-12-09 14:11:27 -05:00
Pavel Begunkov 59850d226e io_uring: fix io_cqring_events()'s noflush
Checking !list_empty(&ctx->cq_overflow_list) around noflush in
io_cqring_events() is racy, because if it fails but a request overflowed
just after that, io_cqring_overflow_flush() still will be called.

Remove the second check, it shouldn't be a problem for performance,
because there is cq_check_overflow bit check just above.

Cc: <stable@vger.kernel.org> # 5.5+
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-12-09 12:04:02 -07:00
Pavel Begunkov 634578f800 io_uring: fix racy IOPOLL flush overflow
It's not safe to call io_cqring_overflow_flush() for IOPOLL mode without
hodling uring_lock, because it does synchronisation differently. Make
sure we have it.

As for io_ring_exit_work(), we don't even need it there because
io_ring_ctx_wait_and_kill() already set force flag making all overflowed
requests to be dropped.

Cc: <stable@vger.kernel.org> # 5.5+
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-12-09 12:04:01 -07:00
Pavel Begunkov 31bff9a51b io_uring: fix racy IOPOLL completions
IOPOLL allows buffer remove/provide requests, but they doesn't
synchronise by rules of IOPOLL, namely it have to hold uring_lock.

Cc: <stable@vger.kernel.org> # 5.7+
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-12-09 12:04:01 -07:00
Xiaoguang Wang dad1b1242f io_uring: always let io_iopoll_complete() complete polled io
Abaci Fuzz reported a double-free or invalid-free BUG in io_commit_cqring():
[   95.504842] BUG: KASAN: double-free or invalid-free in io_commit_cqring+0x3ec/0x8e0
[   95.505921]
[   95.506225] CPU: 0 PID: 4037 Comm: io_wqe_worker-0 Tainted: G    B
W         5.10.0-rc5+ #1
[   95.507434] Hardware name: Red Hat KVM, BIOS 0.5.1 01/01/2011
[   95.508248] Call Trace:
[   95.508683]  dump_stack+0x107/0x163
[   95.509323]  ? io_commit_cqring+0x3ec/0x8e0
[   95.509982]  print_address_description.constprop.0+0x3e/0x60
[   95.510814]  ? vprintk_func+0x98/0x140
[   95.511399]  ? io_commit_cqring+0x3ec/0x8e0
[   95.512036]  ? io_commit_cqring+0x3ec/0x8e0
[   95.512733]  kasan_report_invalid_free+0x51/0x80
[   95.513431]  ? io_commit_cqring+0x3ec/0x8e0
[   95.514047]  __kasan_slab_free+0x141/0x160
[   95.514699]  kfree+0xd1/0x390
[   95.515182]  io_commit_cqring+0x3ec/0x8e0
[   95.515799]  __io_req_complete.part.0+0x64/0x90
[   95.516483]  io_wq_submit_work+0x1fa/0x260
[   95.517117]  io_worker_handle_work+0xeac/0x1c00
[   95.517828]  io_wqe_worker+0xc94/0x11a0
[   95.518438]  ? io_worker_handle_work+0x1c00/0x1c00
[   95.519151]  ? __kthread_parkme+0x11d/0x1d0
[   95.519806]  ? io_worker_handle_work+0x1c00/0x1c00
[   95.520512]  ? io_worker_handle_work+0x1c00/0x1c00
[   95.521211]  kthread+0x396/0x470
[   95.521727]  ? _raw_spin_unlock_irq+0x24/0x30
[   95.522380]  ? kthread_mod_delayed_work+0x180/0x180
[   95.523108]  ret_from_fork+0x22/0x30
[   95.523684]
[   95.523985] Allocated by task 4035:
[   95.524543]  kasan_save_stack+0x1b/0x40
[   95.525136]  __kasan_kmalloc.constprop.0+0xc2/0xd0
[   95.525882]  kmem_cache_alloc_trace+0x17b/0x310
[   95.533930]  io_queue_sqe+0x225/0xcb0
[   95.534505]  io_submit_sqes+0x1768/0x25f0
[   95.535164]  __x64_sys_io_uring_enter+0x89e/0xd10
[   95.535900]  do_syscall_64+0x33/0x40
[   95.536465]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
[   95.537199]
[   95.537505] Freed by task 4035:
[   95.538003]  kasan_save_stack+0x1b/0x40
[   95.538599]  kasan_set_track+0x1c/0x30
[   95.539177]  kasan_set_free_info+0x1b/0x30
[   95.539798]  __kasan_slab_free+0x112/0x160
[   95.540427]  kfree+0xd1/0x390
[   95.540910]  io_commit_cqring+0x3ec/0x8e0
[   95.541516]  io_iopoll_complete+0x914/0x1390
[   95.542150]  io_do_iopoll+0x580/0x700
[   95.542724]  io_iopoll_try_reap_events.part.0+0x108/0x200
[   95.543512]  io_ring_ctx_wait_and_kill+0x118/0x340
[   95.544206]  io_uring_release+0x43/0x50
[   95.544791]  __fput+0x28d/0x940
[   95.545291]  task_work_run+0xea/0x1b0
[   95.545873]  do_exit+0xb6a/0x2c60
[   95.546400]  do_group_exit+0x12a/0x320
[   95.546967]  __x64_sys_exit_group+0x3f/0x50
[   95.547605]  do_syscall_64+0x33/0x40
[   95.548155]  entry_SYSCALL_64_after_hwframe+0x44/0xa9

The reason is that once we got a non EAGAIN error in io_wq_submit_work(),
we'll complete req by calling io_req_complete(), which will hold completion_lock
to call io_commit_cqring(), but for polled io, io_iopoll_complete() won't
hold completion_lock to call io_commit_cqring(), then there maybe concurrent
access to ctx->defer_list, double free may happen.

To fix this bug, we always let io_iopoll_complete() complete polled io.

Cc: <stable@vger.kernel.org> # 5.5+
Reported-by: Abaci Fuzz <abaci@linux.alibaba.com>
Signed-off-by: Xiaoguang Wang <xiaoguang.wang@linux.alibaba.com>
Reviewed-by: Pavel Begunkov <asml.silence@gmail.com>
Reviewed-by: Joseph Qi <joseph.qi@linux.alibaba.com>
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-12-09 12:04:01 -07:00
Pavel Begunkov 9c8e11b36c io_uring: add timeout update
Support timeout updates through IORING_OP_TIMEOUT_REMOVE with passed in
IORING_TIMEOUT_UPDATE. Updates doesn't support offset timeout mode.
Oirignal timeout.off will be ignored as well.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
[axboe: remove now unused 'ret' variable]
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-12-09 12:04:01 -07:00
Pavel Begunkov fbd15848f3 io_uring: restructure io_timeout_cancel()
Add io_timeout_extract() helper, which searches and disarms timeouts,
but doesn't complete them. No functional changes.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-12-09 12:04:01 -07:00
Pavel Begunkov bee749b187 io_uring: fix files cancellation
io_uring_cancel_files()'s task check condition mistakenly got flipped.

1. There can't be a request in the inflight list without
IO_WQ_WORK_FILES, kill this check to keep the whole condition simpler.
2. Also, don't call the function for files==NULL to not do such a check,
all that staff is already handled well by its counter part,
__io_uring_cancel_task_requests().

With that just flip the task check.

Also, it iowq-cancels all request of current task there, don't forget to
set right ->files into struct io_task_cancel.

Fixes: c1973b38bf639 ("io_uring: cancel only requests of current task")
Reported-by: syzbot+c0d52d0b3c0c3ffb9525@syzkaller.appspotmail.com
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-12-09 12:04:01 -07:00
Jens Axboe ac0648a56c io_uring: use bottom half safe lock for fixed file data
io_file_data_ref_zero() can be invoked from soft-irq from the RCU core,
hence we need to ensure that the file_data lock is bottom half safe. Use
the _bh() variants when grabbing this lock.

Reported-by: syzbot+1f4ba1e5520762c523c6@syzkaller.appspotmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-12-09 12:04:01 -07:00
Pavel Begunkov bd5bbda72f io_uring: fix miscounting ios_left
io_req_init() doesn't decrement state->ios_left if a request doesn't
need ->file, it just returns before that on if(!needs_file). That's
not really a problem but may cause overhead for an additional fput().
Also inline and kill io_req_set_file() as it's of no use anymore.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-12-09 12:04:01 -07:00
Pavel Begunkov 6e1271e60c io_uring: change submit file state invariant
Keep submit state invariant of whether there are file refs left based on
state->nr_refs instead of (state->file==NULL), and always check against
the first one. It's easier to track and allows to remove 1 if. It also
automatically leaves struct submit_state in a consistent state after
io_submit_state_end(), that's not used yet but nice.

btw rename has_refs to file_refs for more clarity.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-12-09 12:04:01 -07:00
Xiaoguang Wang 65b2b21348 io_uring: check kthread stopped flag when sq thread is unparked
syzbot reports following issue:
INFO: task syz-executor.2:12399 can't die for more than 143 seconds.
task:syz-executor.2  state:D stack:28744 pid:12399 ppid:  8504 flags:0x00004004
Call Trace:
 context_switch kernel/sched/core.c:3773 [inline]
 __schedule+0x893/0x2170 kernel/sched/core.c:4522
 schedule+0xcf/0x270 kernel/sched/core.c:4600
 schedule_timeout+0x1d8/0x250 kernel/time/timer.c:1847
 do_wait_for_common kernel/sched/completion.c:85 [inline]
 __wait_for_common kernel/sched/completion.c:106 [inline]
 wait_for_common kernel/sched/completion.c:117 [inline]
 wait_for_completion+0x163/0x260 kernel/sched/completion.c:138
 kthread_stop+0x17a/0x720 kernel/kthread.c:596
 io_put_sq_data fs/io_uring.c:7193 [inline]
 io_sq_thread_stop+0x452/0x570 fs/io_uring.c:7290
 io_finish_async fs/io_uring.c:7297 [inline]
 io_sq_offload_create fs/io_uring.c:8015 [inline]
 io_uring_create fs/io_uring.c:9433 [inline]
 io_uring_setup+0x19b7/0x3730 fs/io_uring.c:9507
 do_syscall_64+0x2d/0x70 arch/x86/entry/common.c:46
 entry_SYSCALL_64_after_hwframe+0x44/0xa9
RIP: 0033:0x45deb9
Code: Unable to access opcode bytes at RIP 0x45de8f.
RSP: 002b:00007f174e51ac78 EFLAGS: 00000246 ORIG_RAX: 00000000000001a9
RAX: ffffffffffffffda RBX: 0000000000008640 RCX: 000000000045deb9
RDX: 0000000000000000 RSI: 0000000020000140 RDI: 00000000000050e5
RBP: 000000000118bf58 R08: 0000000000000000 R09: 0000000000000000
R10: 0000000000000000 R11: 0000000000000246 R12: 000000000118bf2c
R13: 00007ffed9ca723f R14: 00007f174e51b9c0 R15: 000000000118bf2c
INFO: task syz-executor.2:12399 blocked for more than 143 seconds.
      Not tainted 5.10.0-rc3-next-20201110-syzkaller #0
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.

Currently we don't have a reproducer yet, but seems that there is a
race in current codes:
=> io_put_sq_data
      ctx_list is empty now.       |
==> kthread_park(sqd->thread);     |
                                   | T1: sq thread is parked now.
==> kthread_stop(sqd->thread);     |
    KTHREAD_SHOULD_STOP is set now.|
===> kthread_unpark(k);            |
                                   | T2: sq thread is now unparkd, run again.
                                   |
                                   | T3: sq thread is now preempted out.
                                   |
===> wake_up_process(k);           |
                                   |
                                   | T4: Since sqd ctx_list is empty, needs_sched will be true,
                                   | then sq thread sets task state to TASK_INTERRUPTIBLE,
                                   | and schedule, now sq thread will never be waken up.
===> wait_for_completion           |

I have artificially used mdelay() to simulate above race, will get same
stack like this syzbot report, but to be honest, I'm not sure this code
race triggers syzbot report.

To fix this possible code race, when sq thread is unparked, need to check
whether sq thread has been stopped.

Reported-by: syzbot+03beeb595f074db9cfd1@syzkaller.appspotmail.com
Signed-off-by: Xiaoguang Wang <xiaoguang.wang@linux.alibaba.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-12-09 12:04:01 -07:00
Pavel Begunkov 36f72fe279 io_uring: share fixed_file_refs b/w multiple rsrcs
Double fixed files for splice/tee are done in a nasty way, it takes 2
ref_node refs, and during the second time it blindly overrides
req->fixed_file_refs hoping that it haven't changed. That works because
all that is done under iouring_lock in a single go but is error-prone.

Bind everything explicitly to a single ref_node and take only one ref,
with current ref_node ordering it's guaranteed to keep all files valid
awhile the request is inflight.

That's mainly a cleanup + preparation for generic resource handling,
but also saves pcpu_ref get/put for splice/tee with 2 fixed files.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-12-09 12:04:01 -07:00
Pavel Begunkov c98de08c99 io_uring: replace inflight_wait with tctx->wait
As tasks now cancel only theirs requests, and inflight_wait is awaited
only in io_uring_cancel_files(), which should be called with ->in_idle
set, instead of keeping a separate inflight_wait use tctx->wait.

That will add some spurious wakeups but actually is safer from point of
not hanging the task.

e.g.
task1                   | IRQ
                        | *start* io_complete_rw_common(link)
                        |        link: req1 -> req2 -> req3(with files)
*cancel_files()         |
io_wq_cancel(), etc.    |
                        | put_req(link), adds to io-wq req2
schedule()              |

So, task1 will never try to cancel req2 or req3. If req2 is
long-standing (e.g. read(empty_pipe)), this may hang.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-12-09 12:04:01 -07:00
Pavel Begunkov 10cad2c40d io_uring: don't take fs for recvmsg/sendmsg
We don't even allow not plain data msg_control, which is disallowed in
__sys_{send,revb}msg_sock(). So no need in fs for IORING_OP_SENDMSG and
IORING_OP_RECVMSG. fs->lock is less contanged not as much as before, but
there are cases that can be, e.g. IOSQE_ASYNC.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-12-09 12:04:01 -07:00
Xiaoguang Wang 2e9dbe902d io_uring: only wake up sq thread while current task is in io worker context
If IORING_SETUP_SQPOLL is enabled, sqes are either handled in sq thread
task context or in io worker task context. If current task context is sq
thread, we don't need to check whether should wake up sq thread.

io_iopoll_req_issued() calls wq_has_sleeper(), which has smp_mb() memory
barrier, before this patch, perf shows obvious overhead:
  Samples: 481K of event 'cycles', Event count (approx.): 299807382878
  Overhead  Comma  Shared Object     Symbol
     3.69%  :9630  [kernel.vmlinux]  [k] io_issue_sqe

With this patch, perf shows:
  Samples: 482K of event 'cycles', Event count (approx.): 299929547283
  Overhead  Comma  Shared Object     Symbol
     0.70%  :4015  [kernel.vmlinux]  [k] io_issue_sqe

It shows some obvious improvements.

Signed-off-by: Xiaoguang Wang <xiaoguang.wang@linux.alibaba.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-12-09 12:04:00 -07:00
Xiaoguang Wang 906a3c6f9c io_uring: don't acquire uring_lock twice
Both IOPOLL and sqes handling need to acquire uring_lock, combine
them together, then we just need to acquire uring_lock once.

Signed-off-by: Xiaoguang Wang <xiaoguang.wang@linux.alibaba.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-12-09 12:04:00 -07:00
Xiaoguang Wang a0d9205f7d io_uring: initialize 'timeout' properly in io_sq_thread()
Some static checker reports below warning:
    fs/io_uring.c:6939 io_sq_thread()
    error: uninitialized symbol 'timeout'.

This is a false positive, but let's just initialize 'timeout' to make
sure we don't trip over this.

Reported-by: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: Xiaoguang Wang <xiaoguang.wang@linux.alibaba.com>
Reviewed-by: Stefano Garzarella <sgarzare@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-12-09 12:04:00 -07:00
Xiaoguang Wang 0836924634 io_uring: refactor io_sq_thread() handling
There are some issues about current io_sq_thread() implementation:
  1. The prepare_to_wait() usage in __io_sq_thread() is weird. If
multiple ctxs share one same poll thread, one ctx will put poll thread
in TASK_INTERRUPTIBLE, but if other ctxs have work to do, we don't
need to change task's stat at all. I think only if all ctxs don't have
work to do, we can do it.
  2. We use round-robin strategy to make multiple ctxs share one same
poll thread, but there are various condition in __io_sq_thread(), which
seems complicated and may affect round-robin strategy.

To improve above issues, I take below actions:
  1. If multiple ctxs share one same poll thread, only if all all ctxs
don't have work to do, we can call prepare_to_wait() and schedule() to
make poll thread enter sleep state.
  2. To make round-robin strategy more straight, I simplify
__io_sq_thread() a bit, it just does io poll and sqes submit work once,
does not check various condition.
  3. For multiple ctxs share one same poll thread, we choose the biggest
sq_thread_idle among these ctxs as timeout condition, and will update
it when ctx is in or out.
  4. Not need to check EBUSY especially, if io_submit_sqes() returns
EBUSY, IORING_SQ_CQ_OVERFLOW should be set, helper in liburing should
be aware of cq overflow and enters kernel to flush work.

Signed-off-by: Xiaoguang Wang <xiaoguang.wang@linux.alibaba.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-12-09 12:04:00 -07:00
Pavel Begunkov f6edbabb83 io_uring: always batch cancel in *cancel_files()
Instead of iterating over each request and cancelling it individually in
io_uring_cancel_files(), try to cancel all matching requests and use
->inflight_list only to check if there anything left.

In many cases it should be faster, and we can reuse a lot of code from
task cancellation.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-12-09 12:04:00 -07:00
Pavel Begunkov 6b81928d4c io_uring: pass files into kill timeouts/poll
Make io_poll_remove_all() and io_kill_timeouts() to match against files
as well. A preparation patch, effectively not used by now.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-12-09 12:04:00 -07:00
Pavel Begunkov b52fda00dd io_uring: don't iterate io_uring_cancel_files()
io_uring_cancel_files() guarantees to cancel all matching requests,
that's not necessary to do that in a loop. Move it up in the callchain
into io_uring_cancel_task_requests().

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-12-09 12:04:00 -07:00
Pavel Begunkov df9923f967 io_uring: cancel only requests of current task
io_uring_cancel_files() cancels all request that match files regardless
of task. There is no real need in that, cancel only requests of the
specified task. That also handles SQPOLL case as it already changes task
to it.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-12-09 12:04:00 -07:00
Pavel Begunkov 08d2363464 io_uring: add a {task,files} pair matching helper
Add io_match_task() that matches both task and files.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-12-09 12:04:00 -07:00
Pavel Begunkov 06de5f5973 io_uring: simplify io_task_match()
If IORING_SETUP_SQPOLL is set all requests belong to the corresponding
SQPOLL task, so skip task checking in that case and always match.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-12-09 12:04:00 -07:00
Pavel Begunkov 2846c481c9 io_uring: inline io_import_iovec()
Inline io_import_iovec() and leave only its former __io_import_iovec()
renamed to the original name. That makes it more obious what is reused in
io_read/write().

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-12-09 12:04:00 -07:00
Pavel Begunkov 632546c4b5 io_uring: remove duplicated io_size from rw
io_size and iov_count in io_read() and io_write() hold the same value,
kill the last one.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-12-09 12:04:00 -07:00
David Laight 10fc72e433 fs/io_uring Don't use the return value from import_iovec().
This is the only code that relies on import_iovec() returning
iter.count on success.
This allows a better interface to import_iovec().

Signed-off-by: David Laight <david.laight@aculab.com>
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Reviewed-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-12-09 12:04:00 -07:00
Pavel Begunkov 1a38ffc9cb io_uring: NULL files dereference by SQPOLL
SQPOLL task may find sqo_task->files == NULL and
__io_sq_thread_acquire_files() would leave it unset, so following
fget_many() and others try to dereference NULL and fault. Propagate
an error files are missing.

[  118.962785] BUG: kernel NULL pointer dereference, address:
	0000000000000020
[  118.963812] #PF: supervisor read access in kernel mode
[  118.964534] #PF: error_code(0x0000) - not-present page
[  118.969029] RIP: 0010:__fget_files+0xb/0x80
[  119.005409] Call Trace:
[  119.005651]  fget_many+0x2b/0x30
[  119.005964]  io_file_get+0xcf/0x180
[  119.006315]  io_submit_sqes+0x3a4/0x950
[  119.007481]  io_sq_thread+0x1de/0x6a0
[  119.007828]  kthread+0x114/0x150
[  119.008963]  ret_from_fork+0x22/0x30

Reported-by: Josef Grieb <josef.grieb@gmail.com>
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-12-09 12:03:59 -07:00
Hao Xu c73ebb685f io_uring: add timeout support for io_uring_enter()
Now users who want to get woken when waiting for events should submit a
timeout command first. It is not safe for applications that split SQ and
CQ handling between two threads, such as mysql. Users should synchronize
the two threads explicitly to protect SQ and that will impact the
performance.

This patch adds support for timeout to existing io_uring_enter(). To
avoid overloading arguments, it introduces a new parameter structure
which contains sigmask and timeout.

I have tested the workloads with one thread submiting nop requests
while the other reaping the cqe with timeout. It shows 1.8~2x faster
when the iodepth is 16.

Signed-off-by: Jiufei Xue <jiufei.xue@linux.alibaba.com>
Signed-off-by: Hao Xu <haoxu@linux.alibaba.com>
[axboe: various cleanups/fixes, and name change to SIG_IS_DATA]
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-12-09 12:03:59 -07:00
Jens Axboe 27926b683d io_uring: only plug when appropriate
We unconditionally call blk_start_plug() when starting the IO
submission, but we only really should do that if we have more than 1
request to submit AND we're potentially dealing with block based storage
underneath. For any other type of request, it's just a waste of time to
do so.

Add a ->plug bit to io_op_def and set it for read/write requests. We
could make this more precise and check the file itself as well, but it
doesn't matter that much and would quickly become more expensive.

Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-12-09 12:03:59 -07:00
Pavel Begunkov 0415767e7f io_uring: rearrange io_kiocb fields for better caching
We've got extra 8 bytes in the 2nd cacheline, put ->fixed_file_refs
there, so inline execution path mostly doesn't touch the 3rd cacheline
for fixed_file requests as well.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-12-09 12:03:59 -07:00
Pavel Begunkov f2f87370bb io_uring: link requests with singly linked list
Singly linked list for keeping linked requests is enough, because we
almost always operate on the head and traverse forward with the
exception of linked timeouts going 1 hop backwards.

Replace ->link_list with a handmade singly linked list. Also kill
REQ_F_LINK_HEAD in favour of checking a newly added ->list for NULL
directly.

That saves 8B in io_kiocb, is not as heavy as list fixup, makes better
use of cache by not touching a previous request (i.e. last request of
the link) each time on list modification and optimises cache use further
in the following patch, and actually makes travesal easier removing in
the end some lines. Also, keeping invariant in ->list instead of having
REQ_F_LINK_HEAD is less error-prone.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-12-09 12:03:59 -07:00
Pavel Begunkov 90cd7e4249 io_uring: track link timeout's master explicitly
In preparation for converting singly linked lists for chaining requests,
make linked timeouts save requests that they're responsible for and not
count on doubly linked list for back referencing.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-12-09 12:03:59 -07:00
Pavel Begunkov 863e05604a io_uring: track link's head and tail during submit
Explicitly save not only a link's head in io_submit_sqe[s]() but the
tail as well. That's in preparation for keeping linked requests in a
singly linked list.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-12-09 12:03:59 -07:00
Pavel Begunkov 018043be1f io_uring: split poll and poll_remove structs
Don't use a single struct for polls and poll remove requests, they have
totally different layouts.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-12-09 12:03:59 -07:00
Jens Axboe 14a1143b68 io_uring: add support for IORING_OP_UNLINKAT
IORING_OP_UNLINKAT behaves like unlinkat(2) and takes the same flags
and arguments.

Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-12-09 12:03:59 -07:00
Jens Axboe 80a261fd00 io_uring: add support for IORING_OP_RENAMEAT
IORING_OP_RENAMEAT behaves like renameat2(), and takes the same flags
etc.

Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-12-09 12:03:59 -07:00