Commit Graph

81738 Commits

Author SHA1 Message Date
Jeff Layton 2005f5b9c3 lockd: fix races in client GRANTED_MSG wait logic
After the wait for a grant is done (for whatever reason), nlmclnt_block
updates the status of the nlm_rqst with the status of the block. At the
point it does this, however, the block is still queued its status could
change at any time.

This is particularly a problem when the waiting task is signaled during
the wait. We can end up giving up on the lock just before the GRANTED_MSG
callback comes in, and accept it even though the lock request gets back
an error, leaving a dangling lock on the server.

Since the nlm_wait never lives beyond the end of nlmclnt_lock, put it on
the stack and add functions to allow us to enqueue and dequeue the
block. Enqueue it just before the lock/wait loop, and dequeue it
just after we exit the loop instead of waiting until the end of
the function. Also, scrape the status at the time that we dequeue it to
ensure that it's final.

Reported-by: Yongcheng Yang <yoyang@redhat.com>
Link: https://bugzilla.redhat.com/show_bug.cgi?id=2063818
Signed-off-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2023-04-26 09:05:00 -04:00
Jeff Layton f0aa4852e6 lockd: move struct nlm_wait to lockd.h
The next patch needs struct nlm_wait in fs/lockd/clntproc.c, so move
the definition to a shared header file. As an added clean-up, drop
the unused b_reclaim field.

Signed-off-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2023-04-26 09:05:00 -04:00
Jeff Layton bfca7a6f0c lockd: purge resources held on behalf of nlm clients when shutting down
It's easily possible for the server to have an outstanding lock when we
go to shut down. When that happens, we often get a warning like this in
the kernel log:

    lockd: couldn't shutdown host module for net f0000000!

This is because the shutdown procedures skip removing any hosts that
still have outstanding resources (locks). Eventually, things seem to get
cleaned up anyway, but the log message is unsettling, and server
shutdown doesn't seem to be working the way it was intended.

Ensure that we tear down any resources held on behalf of a client when
tearing one down for server shutdown.

Reported-by: Yongcheng Yang <yoyang@redhat.com>
Link: https://bugzilla.redhat.com/show_bug.cgi?id=2063818
Signed-off-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2023-04-26 09:04:59 -04:00
Chuck Lever c4c649ab41 NFSD: Convert filecache to rhltable
While we were converting the nfs4_file hashtable to use the kernel's
resizable hashtable data structure, Neil Brown observed that the
list variant (rhltable) would be better for managing nfsd_file items
as well. The nfsd_file hash table will contain multiple entries for
the same inode -- these should be kept together on a list. And, it
could be possible for exotic or malicious client behavior to cause
the hash table to resize itself on every insertion.

A nice simplification is that rhltable_lookup() can return a list
that contains only nfsd_file items that match a given inode, which
enables us to eliminate specialized hash table helper functions and
use the default functions provided by the rhashtable implementation).

Since we are now storing nfsd_file items for the same inode on a
single list, that effectively reduces the number of hash entries
that have to be tracked in the hash table. The mininum bucket count
is therefore lowered.

Light testing with fstests generic/531 show no regressions.

Suggested-by: Neil Brown <neilb@suse.de>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2023-04-26 09:04:59 -04:00
Jeff Layton dcb779fcd4 nfsd: allow reaping files still under writeback
On most filesystems, there is no reason to delay reaping an nfsd_file
just because its underlying inode is still under writeback. nfsd just
relies on client activity or the local flusher threads to do writeback.

The main exception is NFS, which flushes all of its dirty data on last
close. Add a new EXPORT_OP_FLUSH_ON_CLOSE flag to allow filesystems to
signal that they do this, and only skip closing files under writeback on
such filesystems.

Also, remove a redundant NULL file pointer check in
nfsd_file_check_writeback, and clean up nfs's export op flag
definitions.

Signed-off-by: Jeff Layton <jlayton@kernel.org>
Acked-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2023-04-26 09:04:59 -04:00
Jeff Layton 972cc0e092 nfsd: update comment over __nfsd_file_cache_purge
Signed-off-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2023-04-26 09:04:59 -04:00
Jeff Layton b2ff1bd71d nfsd: don't take/put an extra reference when putting a file
The last thing that filp_close does is an fput, so don't bother taking
and putting the extra reference.

Signed-off-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2023-04-26 09:04:59 -04:00
Jeff Layton b680cb9b73 nfsd: add some comments to nfsd_file_do_acquire
David Howells mentioned that he found this bit of code confusing, so
sprinkle in some comments to clarify.

Reported-by: David Howells <dhowells@redhat.com>
Signed-off-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2023-04-26 09:04:58 -04:00
Jeff Layton c6593366c0 nfsd: don't kill nfsd_files because of lease break error
An error from break_lease is non-fatal, so we needn't destroy the
nfsd_file in that case. Just put the reference like we normally would
and return the error.

Signed-off-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2023-04-26 09:04:58 -04:00
Jeff Layton d69b8dbfd0 nfsd: simplify test_bit return in NFSD_FILE_KEY_FULL comparator
test_bit returns bool, so we can just compare the result of that to the
key->gc value without the "!!".

Signed-off-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2023-04-26 09:04:58 -04:00
Jeff Layton 6c31e4c988 nfsd: NFSD_FILE_KEY_INODE only needs to find GC'ed entries
Since v4 files are expected to be long-lived, there's little value in
closing them out of the cache when there is conflicting access.

Change the comparator to also match the gc value in the key. Change both
of the current users of that key to set the gc value in the key to
"true".

Signed-off-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2023-04-26 09:04:58 -04:00
Jeff Layton b8bea9f6cd nfsd: don't open-code clear_and_wake_up_bit
Signed-off-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2023-04-26 09:04:58 -04:00
Paolo Abeni c248b27cfc Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net
No conflicts.

Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2023-04-26 10:17:46 +02:00
Jens Axboe afed6271f5 pipe: set FMODE_NOWAIT on pipes
Pipes themselves do not hold the the pipe lock across IO, and hence are
safe for RWF_NOWAIT/IOCB_NOWAIT usage. The "contract" for NOWAIT is
really "should not do IO under this lock", not strictly that we cannot
block or that the below code is in any way atomic. Pipes fulfil that
criteria.

Acked-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christian Brauner <brauner@kernel.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-04-25 14:08:59 -06:00
Jens Axboe 0f99fc513d splice: clear FMODE_NOWAIT on file if splice/vmsplice is used
In preparation for pipes setting FMODE_NOWAIT on pipes to indicate that
RWF_NOWAIT/IOCB_NOWAIT is fully supported, have splice and vmsplice
clear that file flag. Splice holds the pipe lock around IO and cannot
easily be refactored to avoid that, as splice and pipes are inherently
tied together.

By clearing FMODE_NOWAIT if splice is being used on a pipe, other users
of the pipe will know that the pipe is no longer safe for RWF_NOWAIT
and friends.

Suggested-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-04-25 14:08:55 -06:00
Linus Torvalds 736b378b29 slab changes for 6.4
-----BEGIN PGP SIGNATURE-----
 
 iQEzBAABCAAdFiEEe7vIQRWZI0iWSE3xu+CwddJFiJoFAmRCSGEACgkQu+CwddJF
 iJpA2wgAkwMP++Znd8JU3iQ4N53lv18euNuEMLTOY+jk7zXHvsRX8KyzLmsohUKO
 SSGVi1Om785AidOsJhARJawW7AWYuJ5l7ri+FyskTwrTUcMC4UZ/IT2tB22lRsXi
 0f3lgbdArZbj7aq7AVO9N7bh9rgVUHa/RHIwXzMp0sc9nekne9t+FFv7tyRnr7cc
 SMp/FdMZqbt9pVf0Uwud1BpdgER7QqQaSfaxITL7D2oJTePRZVWiXerrr4hMcQl1
 s6kgUgKdlaYmIx2N8eP1Nmp7undtwHo1C8dLLWKGCEuEAaXIxtXUtaUWFFmBDzH9
 Fv6qswNFcfwiLNPsY+xi9iA+vlGKAg==
 =T0EM
 -----END PGP SIGNATURE-----

Merge tag 'slab-for-6.4' of git://git.kernel.org/pub/scm/linux/kernel/git/vbabka/slab

Pull slab updates from Vlastimil Babka:
 "The main change is naturally the SLOB removal. Since its deprecation
  in 6.2 I've seen no complaints so hopefully SLUB_(TINY) works well for
  everyone and we can proceed.

  Besides the code cleanup, the main immediate benefit will be allowing
  kfree() family of function to work on kmem_cache_alloc() objects,
  which was incompatible with SLOB. This includes kfree_rcu() which had
  no kmem_cache_free_rcu() counterpart yet and now it shouldn't be
  necessary anymore.

  Besides that, there are several small code and comment improvements
  from Thomas, Thorsten and Vernon"

* tag 'slab-for-6.4' of git://git.kernel.org/pub/scm/linux/kernel/git/vbabka/slab:
  mm/slab: document kfree() as allowed for kmem_cache_alloc() objects
  mm/slob: remove slob.c
  mm/slab: remove CONFIG_SLOB code from slab common code
  mm, pagemap: remove SLOB and SLQB from comments and documentation
  mm, page_flags: remove PG_slob_free
  mm/slob: remove CONFIG_SLOB
  mm/slub: fix help comment of SLUB_DEBUG
  mm: slub: make kobj_type structure constant
  slab: Adjust comment after refactoring of gfp.h
2023-04-25 13:00:41 -07:00
Linus Torvalds 53b5e72b9d asm-generic updates for 6.4
These are various cleanups, fixing a number of uapi header files to no
 longer reference CONFIG_* symbols, and one patch that introduces the
 new CONFIG_HAS_IOPORT symbol for architectures that provide working
 inb()/outb() macros, as a preparation for adding driver dependencies
 on those in the following release.
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEEiK/NIGsWEZVxh/FrYKtH/8kJUicFAmRG8IkACgkQYKtH/8kJ
 Uid15Q/9E/neIIEqEk6IvtyhUicrJiIZUM0rGoYtWXiz75ggk6Kx9+3I+j8zIQ/E
 kf2TzAG7q9Md7nfTDFLr4FSr0IcNDj+VG4nYxUyDHdKGcARO+g9Kpdvscxip3lgU
 Rw5w74Gyd30u4iUKGS39OYuxcCgl9LaFjMA9Gh402Oiaoh+OYLmgQS9h/goUD5KN
 Nd+AoFvkdbnHl0/SpxthLRyL5rFEATBmAY7apYViPyMvfjS3gfDJwXJR9jkKgi6X
 Qs4t8Op8BA3h84dCuo6VcFqgAJs2Wiq3nyTSUnkF8NxJ2RFTpeiVgfsLOzXHeDgz
 SKDB4Lp14o3mlyZyj00MWq1uMJRRetUgNiVb6iHOoKQ/E4demBdh+mhIFRybjM5B
 XNTWFcg9PWFCMa4W9jnLfZBc881X4+7T+qUF8I0W/1AbRJUmyGj8HO6jLceC4yGD
 UYLn5oFPM6OWXHp6DqJrCr9Yw8h6fuviQZFEbl/ARlgVGt+J4KbYweJYk8DzfX6t
 PZIj8LskOqyIpRuC2oDA1PHxkaJ1/z+N5oRBHq1uicSh4fxY5HW7HnyzgF08+R3k
 cf+fjAhC3TfGusHkBwQKQJvpxrxZjPuvYXDZ0GxTvNKJRB8eMeiTm1n41E5oTVwQ
 swSblSCjZj/fMVVPXLcjxEW4SBNWRxa9Lz3tIPXb3RheU10Lfy8=
 =H3k4
 -----END PGP SIGNATURE-----

Merge tag 'asm-generic-6.4' of git://git.kernel.org/pub/scm/linux/kernel/git/arnd/asm-generic

Pull asm-generic updates from Arnd Bergmann:
 "These are various cleanups, fixing a number of uapi header files to no
  longer reference CONFIG_* symbols, and one patch that introduces the
  new CONFIG_HAS_IOPORT symbol for architectures that provide working
  inb()/outb() macros, as a preparation for adding driver dependencies
  on those in the following release"

* tag 'asm-generic-6.4' of git://git.kernel.org/pub/scm/linux/kernel/git/arnd/asm-generic:
  Kconfig: introduce HAS_IOPORT option and select it as necessary
  scripts: Update the CONFIG_* ignore list in headers_install.sh
  pktcdvd: Remove CONFIG_CDROM_PKTCDVD_WCACHE from uapi header
  Move bp_type_idx to include/linux/hw_breakpoint.h
  Move ep_take_care_of_epollwakeup() to fs/eventpoll.c
  Move COMPAT_ATM_ADDPARTY to net/atm/svc.c
2023-04-25 12:22:11 -07:00
Bob Peterson 644f6bf762 gfs2: gfs2_ail_empty_gl no log flush on error
Before this patch, function gfs2_ail_empty_gl called gfs2_log_flush even
in cases where it encountered an error. It should probably skip the log
flush and leave the file system in an inconsistent state, letting a
subsequent withdraw force the journal to be replayed to reestablish
metadata consistency.

Signed-off-by: Bob Peterson <rpeterso@redhat.com>
Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
2023-04-25 11:07:16 +02:00
Bob Peterson b97e583caa gfs2: Issue message when revokes cannot be written
Before this patch, function gfs2_ail_empty_gl would silently return an
error to the caller. This would get silently set into sd_log_error which
would cause a withdraw, but there was no indication why the file system
was withdrawn. This patch adds a fs_err to log the appropriate error
message.

Signed-off-by: Bob Peterson <rpeterso@redhat.com>
Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
2023-04-25 11:06:30 +02:00
Bob Peterson 68ca088dc1 gfs2: Perform second log flush in gfs2_make_fs_ro
Before this patch, function gfs2_make_fs_ro called gfs2_log_flush once to
finalize the log. However, if there's dirty metadata, log flushes tend
to sync the metadata and formulate revokes. Before this patch, those
revokes may not be written out to the journal immediately, which meant
unresolved glocks could still have revokes in their ail lists. When the
glock worker runs, it tries to transition the glock, but the unresolved
revokes in the ail still need to be written, so it tries to start a
transaction. It's impossible to start a transaction because at that
point, the SDF_JOURNAL_LIVE flag has been cleared by gfs2_make_fs_ro.
That causes the glock worker to fail, unable to write the revokes. The
calling sequence looked something like this:

gfs2_make_fs_ro
   gfs2_log_flush - with GFS2_LOG_HEAD_FLUSH_SHUTDOWN flag set
	if (flags & GFS2_LOG_HEAD_FLUSH_SHUTDOWN)
		clear_bit(SDF_JOURNAL_LIVE, &sdp->sd_flags);
...meanwhile...
glock_work_func
   do_xmote
      rgrp_go_sync (or possibly inode_go_sync)
         ...
         gfs2_ail_empty_gl
            __gfs2_trans_begin
               if (unlikely(!test_bit(SDF_JOURNAL_LIVE, &sdp->sd_flags))) {
               ...
                  return -EROFS;

The previous patch in the series ("gfs2: return errors from
gfs2_ail_empty_gl") now causes the transaction error to no longer be
ignored, so it causes a warning from MOST of the xfstests:

WARNING: CPU: 11 PID: X at fs/gfs2/super.c:603 gfs2_put_super [gfs2]

which corresponds to:

WARN_ON(gfs2_withdrawing(sdp));

The withdraw was triggered silently from do_xmote by:

	if (unlikely(sdp->sd_log_error && !gfs2_withdrawn(sdp)))
		gfs2_withdraw_delayed(sdp);

This patch adds a second log_flush to gfs2_make_fs_ro: one to sync the
data and one to sync any outstanding revokes and finalize the journal.
Note that both of these log flushes need to be "special," in other
words, not GFS2_LOG_HEAD_FLUSH_NORMAL.

Signed-off-by: Bob Peterson <rpeterso@redhat.com>
Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
2023-04-25 11:01:41 +02:00
Bob Peterson 24ab158298 gfs2: return errors from gfs2_ail_empty_gl
Before this patch, function gfs2_ail_empty_gl did not return errors it
encountered from __gfs2_trans_begin. Those errors usually came from the
fact that the file system was made read-only, often due to unmount
(but theoretically could be due to -o remount,ro), which prevented
the transaction from starting.

The inability to start a transaction prevented its revokes from being
properly written to the journal for glocks during unmount (and
transition to ro).

That meant glocks could be unlocked without the metadata properly
revoked in the journal. So other nodes could grab the glock thinking
that their lvb values were correct, but in fact corresponded to the
glock without its revokes properly synced. That presented as lvb
mismatch errors.

This patch allows gfs2_ail_empty_gl to return the error properly to
the caller.

Signed-off-by: Bob Peterson <rpeterso@redhat.com>
Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
2023-04-25 11:00:21 +02:00
Linus Torvalds 181b69dd6e misc pile
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
 -----BEGIN PGP SIGNATURE-----
 
 iHUEABYIAB0WIQQqUNBr3gm4hGXdBJlZ7Krx/gZQ6wUCZEYC7AAKCRBZ7Krx/gZQ
 66NCAP9khf7D5Clz5BsUjlsYoXpBPDaSGhVnpAR8mRQ4/Y8eRQD/dyBVt0FuU72Y
 j1w/foMeP3195Bfdp7UwwZLSU+hbxgw=
 =M/uq
 -----END PGP SIGNATURE-----

Merge tag 'pull-misc' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs

Pull misc vfs pile from Al Viro.

Random minor cleanups.

* tag 'pull-misc' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
  fs: Fix description of vfs_tmpfile()
  sysv: switch to put_and_unmap_page()
  fs/sysv: Don't round down address for kunmap_flush_on_unmap()
2023-04-24 19:38:34 -07:00
Linus Torvalds 11b32219cb legacy direct-io cleanup
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
 -----BEGIN PGP SIGNATURE-----
 
 iHUEABYIAB0WIQQqUNBr3gm4hGXdBJlZ7Krx/gZQ6wUCZEYDPAAKCRBZ7Krx/gZQ
 65qvAQC60ak71gAqOSktR93GXYrQ5Q8hbpug7t3lzDQE9huGqgEAl0Zvr8d5ir3j
 y0X5U5Yl6bcUSQDd4VY76C+53yIKZQs=
 =rKXE
 -----END PGP SIGNATURE-----

Merge tag 'pull-old-dio' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs

Pull legacy dio cleanup from Al Viro.

* tag 'pull-old-dio' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
  __blockdev_direct_IO(): get rid of submit_io callback
2023-04-24 19:28:49 -07:00
Linus Torvalds 0e497ad525 write_one_page series
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
 -----BEGIN PGP SIGNATURE-----
 
 iHUEABYIAB0WIQQqUNBr3gm4hGXdBJlZ7Krx/gZQ6wUCZEYC0gAKCRBZ7Krx/gZQ
 6+N5AQCERtd5I2RLm7URX0kurwOthe7o+DX4Lj7y/mcjZV2N4gEAu9VrgHBIuLev
 +KuGGY0VnKwCtcgGcGGNBrfrRjyKugs=
 =gDk0
 -----END PGP SIGNATURE-----

Merge tag 'pull-write-one-page' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs

Pull vfs write_one_page removal from Al Viro:
 "write_one_page series"

* tag 'pull-write-one-page' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
  mm,jfs: move write_one_page/folio_write_one to jfs
  ocfs2: don't use write_one_page in ocfs2_duplicate_clusters_by_page
  ufs: don't flush page immediately for DIRSYNC directories
2023-04-24 19:20:27 -07:00
Linus Torvalds ef36b9afc2 fget() to fdget() conversions
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
 -----BEGIN PGP SIGNATURE-----
 
 iHUEABYIAB0WIQQqUNBr3gm4hGXdBJlZ7Krx/gZQ6wUCZEYCQAAKCRBZ7Krx/gZQ
 64FdAQDZ2hTDyZEWPt486dWYPYpiKyaGFXSXDGo7wgP0fiwxXQEA/mROKb6JqYw6
 27mZ9A7qluT8r3AfTTQ0D+Yse/dr4AM=
 =GA9W
 -----END PGP SIGNATURE-----

Merge tag 'pull-fd' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs

Pull vfs fget updates from Al Viro:
 "fget() to fdget() conversions"

* tag 'pull-fd' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
  fuse_dev_ioctl(): switch to fdget()
  cgroup_get_from_fd(): switch to fdget_raw()
  bpf: switch to fdget_raw()
  build_mount_idmapped(): switch to fdget()
  kill the last remaining user of proc_ns_fget()
  SVM-SEV: convert the rest of fget() uses to fdget() in there
  convert sgx_set_attribute() to fdget()/fdput()
  convert setns(2) to fdget()/fdput()
2023-04-24 19:14:20 -07:00
Linus Torvalds 61d325dcbc Changes since last update:
- Add sub-page block size support for uncompressed files;
 
  - Support flattened block device for multi-blob images to be attached
    into virtual machines (including cloud servers) and bare metals;
 
  - Support long xattr name prefixes to optimize images with common
    xattr namespaces (e.g. files with overlayfs xattrs) use cases;
 
  - Various minor cleanups & fixes.
 -----BEGIN PGP SIGNATURE-----
 
 iIcEABYIAC8WIQThPAmQN9sSA0DVxtI5NzHcH7XmBAUCZETCNREceGlhbmdAa2Vy
 bmVsLm9yZwAKCRA5NzHcH7XmBJCMAP9VkAPycbbqa6qWUASdyh/HGyuLJTHSfmsJ
 zO4y6hBgOwD9GXg55sY8ycvcOx9ayaUt5V5f9zhs4wdGcoPhj5fWzgA=
 =nUva
 -----END PGP SIGNATURE-----

Merge tag 'erofs-for-6.4-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/xiang/erofs

Pull erofs updates from Gao Xiang:
 "In this cycle, sub-page block support for uncompressed files is
  available. It's mainly used to enable original signing ('golden')
  4k-block images on arm64 with 16/64k pages. In addition, end users
  could also use this feature to build a manifest to directly refer to
  golden tar data.

  Besides, long xattr name prefix support is also introduced in this
  cycle to avoid too many xattrs with the same prefix (e.g. overlayfs
  xattrs). It's useful for erofs + overlayfs combination (like Composefs
  model): the image size is reduced by ~14% and runtime performance is
  also slightly improved.

  Others are random fixes and cleanups as usual.

  Summary:

   - Add sub-page block size support for uncompressed files

   - Support flattened block device for multi-blob images to be attached
     into virtual machines (including cloud servers) and bare metals

   - Support long xattr name prefixes to optimize images with common
     xattr namespaces (e.g. files with overlayfs xattrs) use cases

   - Various minor cleanups & fixes"

* tag 'erofs-for-6.4-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/xiang/erofs:
  erofs: cleanup i_format-related stuffs
  erofs: sunset erofs_dbg()
  erofs: fix potential overflow calculating xattr_isize
  erofs: get rid of z_erofs_fill_inode()
  erofs: enable long extended attribute name prefixes
  erofs: handle long xattr name prefixes properly
  erofs: add helpers to load long xattr name prefixes
  erofs: introduce on-disk format for long xattr name prefixes
  erofs: move packed inode out of the compression part
  erofs: keep meta inode into erofs_buf
  erofs: initialize packed inode after root inode is assigned
  erofs: stop parsing non-compact HEAD index if clusterofs is invalid
  erofs: don't warn ztailpacking feature anymore
  erofs: simplify erofs_xattr_generic_get()
  erofs: rename init_inode_xattrs with erofs_ prefix
  erofs: move several xattr helpers into xattr.c
  erofs: tidy up EROFS on-disk naming
  erofs: support flattened block device for multi-blob images
  erofs: set block size to the on-disk block size
  erofs: avoid hardcoded blocksize for subpage block support
2023-04-24 14:25:39 -07:00
Linus Torvalds 97adb49f05 v6.4/vfs.open
-----BEGIN PGP SIGNATURE-----
 
 iHUEABYKAB0WIQRAhzRXHqcMeLMyaSiRxhvAZXjcogUCZEEn8AAKCRCRxhvAZXjc
 okJoAQC1bjJp4SQw7VQhTuyv0Ak67PACwKPNUPQyHcqMV5s5DAD/fcnMjq7+UieH
 TEk/zRBGWWI8m0wb51MMO+VVM2GeXwI=
 =EKj/
 -----END PGP SIGNATURE-----

Merge tag 'v6.4/vfs.open' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs

Pull vfs open fixlet from Christian Brauner:
 "EINVAL ist keinmal: This contains the changes to make O_DIRECTORY when
  specified together with O_CREAT an invalid request.

  The wider background is that a regression report about the behavior of
  O_DIRECTORY | O_CREAT was sent to fsdevel about a behavior that was
  changed multiple years and LTS releases earlier during v5.7
  development.

  This has also been covered in

        https://lwn.net/Articles/926782/

  which provides an excellent summary of the discussion"

* tag 'v6.4/vfs.open' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs:
  open: return EINVAL for O_DIRECTORY | O_CREAT
2023-04-24 14:06:41 -07:00
Linus Torvalds e2eff52ce5 v6.4/vfs.misc
-----BEGIN PGP SIGNATURE-----
 
 iHUEABYKAB0WIQRAhzRXHqcMeLMyaSiRxhvAZXjcogUCZEJxiQAKCRCRxhvAZXjc
 oj4jAP4pt/SuHdZQzv9k2j+UA4hpDmwt7UUWhLxtycQrt9oEoQD/U9hCwnYY9Ksj
 5TSZiK/OGyoslmF1zfAC+1R2crgr3A0=
 =PpBS
 -----END PGP SIGNATURE-----

Merge tag 'v6.4/vfs.misc' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs

Pull misc vfs updates from Christian Brauner:
 "This contains a pile of various smaller fixes. Most of them aren't
  very interesting so this just highlights things worth mentioning:

   - Various filesystems contained the same little helper to convert
     from the mode of a dentry to the DT_* type of that dentry.

     They have now all been switched to rely on the generic
     fs_umode_to_dtype() helper. All custom helpers are removed (Jeff)

   - Fsnotify now reports ACCESS and MODIFY events for splice
     (Chung-Chiang Cheng)

   - After converting timerfd a long time ago to rely on
     wait_event_interruptible_*() apis, convert eventfd as well. This
     removes the complex open-coded wait code (Wen Yang)

   - Simplify sysctl registration for devpts, avoiding the declaration
     of two tables. Instead, just use a prefixed path with
     register_sysctl() (Luis)

   - The setattr_should_drop_sgid() helper is now exported so NFS can
     use it. By switching NFS to this helper an NFS setgid inheritance
     bug is fixed (me)"

* tag 'v6.4/vfs.misc' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs:
  fs: hfsplus: remove WARN_ON() from hfsplus_cat_{read,write}_inode()
  pnode: pass mountpoint directly
  eventfd: use wait_event_interruptible_locked_irq() helper
  splice: report related fsnotify events
  fs: consolidate duplicate dt_type helpers
  nfs: use vfs setgid helper
  Update relatime comments to include equality
  fs/buffer: Remove redundant assignment to err
  fs_context: drop the unused lsm_flags member
  fs/namespace: fnic: Switch to use %ptTd
  Documentation: update idmappings.rst
  devpts: simplify two-level sysctl registration for pty_kern_table
  eventpoll: align comment with nested epoll limitation
2023-04-24 13:39:58 -07:00
Linus Torvalds 7bcff5a396 v6.4/vfs.acl
-----BEGIN PGP SIGNATURE-----
 
 iHUEABYKAB0WIQRAhzRXHqcMeLMyaSiRxhvAZXjcogUCZEEhwgAKCRCRxhvAZXjc
 otwgAQDXHnKiPm/d76lITXbxdUNCtvZz+ig26EbOrD+vEszzIQEA81dru0QbCNCt
 ctoZdcsmtKbt2VaYQF1CDOhlnNg5VQM=
 =pER1
 -----END PGP SIGNATURE-----

Merge tag 'v6.4/vfs.acl' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs

Pull acl updates from Christian Brauner:
 "After finishing the introduction of the new posix acl api last cycle
  the generic POSIX ACL xattr handlers are still around in the
  filesystems xattr handlers for two reasons:

   (1) Because a few filesystems rely on the ->list() method of the
       generic POSIX ACL xattr handlers in their ->listxattr() inode
       operation.

   (2) POSIX ACLs are only available if IOP_XATTR is raised. The
       IOP_XATTR flag is raised in inode_init_always() based on whether
       the sb->s_xattr pointer is non-NULL. IOW, the registered xattr
       handlers of the filesystem are used to raise IOP_XATTR. Removing
       the generic POSIX ACL xattr handlers from all filesystems would
       risk regressing filesystems that only implement POSIX ACL support
       and no other xattrs (nfs3 comes to mind).

  This contains the work to decouple POSIX ACLs from the IOP_XATTR flag
  as they don't depend on xattr handlers anymore. So it's now possible
  to remove the generic POSIX ACL xattr handlers from the sb->s_xattr
  list of all filesystems. This is a crucial step as the generic POSIX
  ACL xattr handlers aren't used for POSIX ACLs anymore and POSIX ACLs
  don't depend on the xattr infrastructure anymore.

  Adressing problem (1) will require more long-term work. It would be
  best to get rid of the ->list() method of xattr handlers completely at
  some point.

  For erofs, ext{2,4}, f2fs, jffs2, ocfs2, and reiserfs the nop POSIX
  ACL xattr handler is kept around so they can continue to use
  array-based xattr handler indexing.

  This update does simplify the ->listxattr() implementation of all
  these filesystems however"

* tag 'v6.4/vfs.acl' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs:
  acl: don't depend on IOP_XATTR
  ovl: check for ->listxattr() support
  reiserfs: rework priv inode handling
  fs: rename generic posix acl handlers
  reiserfs: rework ->listxattr() implementation
  fs: simplify ->listxattr() implementation
  fs: drop unused posix acl handlers
  xattr: remove unused argument
  xattr: add listxattr helper
  xattr: simplify listxattr helpers
2023-04-24 13:35:23 -07:00
Linus Torvalds ec40758b31 v6.4/pidfd.file
-----BEGIN PGP SIGNATURE-----
 
 iHUEABYKAB0WIQRAhzRXHqcMeLMyaSiRxhvAZXjcogUCZEEt8gAKCRCRxhvAZXjc
 oppuAQDu9kwAQWAl0KzlpjQkrEDAEuyHRy6SCpo1kPPD5f3rigD+INZb3fi2QXmK
 ZL/c6XtII9ah/8i2zfzAgH9Q2ZZu0gk=
 =xcAX
 -----END PGP SIGNATURE-----

Merge tag 'v6.4/pidfd.file' of git://git.kernel.org/pub/scm/linux/kernel/git/brauner/linux

Pull pidfd updates from Christian Brauner:
 "This adds a new pidfd_prepare() helper which allows the caller to
  reserve a pidfd number and allocates a new pidfd file that stashes the
  provided struct pid.

  It should be avoided installing a file descriptor into a task's file
  descriptor table just to close it again via close_fd() in case an
  error occurs. The fd has been visible to userspace and might already
  be in use. Instead, a file descriptor should be reserved but not
  installed into the caller's file descriptor table.

  If another failure path is hit then the reserved file descriptor and
  file can just be put without any userspace visible side-effects. And
  if all failure paths are cleared the file descriptor and file can be
  installed into the task's file descriptor table.

  This helper is now used in all places that open coded this
  functionality before. For example, this is currently done during
  copy_process() and fanotify used pidfd_create(), which returns a pidfd
  that has already been made visibile in the caller's file descriptor
  table, but then closed it using close_fd().

  In one of the next merge windows there is also new functionality
  coming to unix domain sockets that will have to rely on
  pidfd_prepare()"

* tag 'v6.4/pidfd.file' of git://git.kernel.org/pub/scm/linux/kernel/git/brauner/linux:
  fanotify: use pidfd_prepare()
  fork: use pidfd_prepare()
  pid: add pidfd_prepare()
2023-04-24 13:03:42 -07:00
Linus Torvalds c23f28975a Commit volume in documentation is relatively low this time, but there is
still a fair amount going on, including:
 
 - Reorganizing the architecture-specific documentation under
   Documentation/arch.  This makes the structure match the source directory
   and helps to clean up the mess that is the top-level Documentation
   directory a bit.  This work creates the new directory and moves x86 and
   most of the less-active architectures there.  The current plan is to move
   the rest of the architectures in 6.5, with the patches going through the
   appropriate subsystem trees.
 
 - Some more Spanish translations and maintenance of the Italian
   translation.
 
 - A new "Kernel contribution maturity model" document from Ted.
 
 - A new tutorial on quickly building a trimmed kernel from Thorsten.
 
 Plus the usual set of updates and fixes.
 -----BEGIN PGP SIGNATURE-----
 
 iQFDBAABCAAtFiEEIw+MvkEiF49krdp9F0NaE2wMflgFAmRGze0PHGNvcmJldEBs
 d24ubmV0AAoJEBdDWhNsDH5Y/VsH/RyWqinorRVFZmHqRJMRhR0j7hE2pAgK5prE
 dGXYVtHHNQ+25thNaqhZTOLYFbSX6ii2NG7sLRXmyOTGIZrhUCFFXCHkuq4ZUypR
 gJpMUiKQVT4dhln3gIZ0k09NSr60gz8UTcq895N9UFpUdY1SCDhbCcLc4uXTRajq
 NrdgFaHWRkPb+gBRbXOExYm75DmCC6Ny5AyGo2rXfItV//ETjWIJVQpJhlxKrpMZ
 3LgpdYSLhEFFnFGnXJ+EAPJ7gXDi2Tg5DuPbkvJyFOTouF3j4h8lSS9l+refMljN
 xNRessv+boge/JAQidS6u8F2m2ESSqSxisv/0irgtKIMJwXaoX4=
 =1//8
 -----END PGP SIGNATURE-----

Merge tag 'docs-6.4' of git://git.lwn.net/linux

Pull documentation updates from Jonathan Corbet:
 "Commit volume in documentation is relatively low this time, but there
  is still a fair amount going on, including:

   - Reorganize the architecture-specific documentation under
     Documentation/arch

     This makes the structure match the source directory and helps to
     clean up the mess that is the top-level Documentation directory a
     bit. This work creates the new directory and moves x86 and most of
     the less-active architectures there.

     The current plan is to move the rest of the architectures in 6.5,
     with the patches going through the appropriate subsystem trees.

   - Some more Spanish translations and maintenance of the Italian
     translation

   - A new "Kernel contribution maturity model" document from Ted

   - A new tutorial on quickly building a trimmed kernel from Thorsten

  Plus the usual set of updates and fixes"

* tag 'docs-6.4' of git://git.lwn.net/linux: (47 commits)
  media: Adjust column width for pdfdocs
  media: Fix building pdfdocs
  docs: clk: add documentation to log which clocks have been disabled
  docs: trace: Fix typo in ftrace.rst
  Documentation/process: always CC responsible lists
  docs: kmemleak: adjust to config renaming
  ELF: document some de-facto PT_* ABI quirks
  Documentation: arm: remove stih415/stih416 related entries
  docs: turn off "smart quotes" in the HTML build
  Documentation: firmware: Clarify firmware path usage
  docs/mm: Physical Memory: Fix grammar
  Documentation: Add document for false sharing
  dma-api-howto: typo fix
  docs: move m68k architecture documentation under Documentation/arch/
  docs: move parisc documentation under Documentation/arch/
  docs: move ia64 architecture docs under Documentation/arch/
  docs: Move arc architecture docs under Documentation/arch/
  docs: move nios2 documentation under Documentation/arch/
  docs: move openrisc documentation under Documentation/arch/
  docs: move superh documentation under Documentation/arch/
  ...
2023-04-24 12:35:49 -07:00
Linus Torvalds 5dfb75e842 RCU Changes for 6.4:
o  MAINTAINERS files additions and changes.
  o  Fix hotplug warning in nohz code.
  o  Tick dependency changes by Zqiang.
  o  Lazy-RCU shrinker fixes by Zqiang.
  o  rcu-tasks stall reporting improvements by Neeraj.
  o  Initial changes for renaming of k[v]free_rcu() to its new k[v]free_rcu_mightsleep()
     name for robustness.
  o  Documentation Updates:
  o  Significant changes to srcu_struct size.
  o  Deadlock detection for srcu_read_lock() vs synchronize_srcu() from Boqun.
  o  rcutorture and rcu-related tool, which are targeted for v6.4 from Boqun's tree.
  o  Other misc changes.
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEEcoCIrlGe4gjE06JJqA4nf2o45hAFAmQuBnIACgkQqA4nf2o4
 5hACVRAAoXu7/gfh5Pjw9O4E4pCdPJKsZZVYrcrVGrq6NAxRn6M1SgurAdC5grj2
 96x0waoGaiO82V0H5iJMcKdAVu67x9R8WaQ1JoxN75Efn8h9W4TguB87TV1gk0xS
 eZ18b/CyEaM5mNb80DFFF4FLohy5737p/kNTMqXQdUyR1BsDl16iRMgjiBiFhNUx
 yPo8Y2kC2U2OTbldZgaE7s9bQO3xxEcifx93sGWsAex/gx54FYNisiwSlCOSgOE+
 XkYo/OKk8Xvr82tLVX8XQVEPCMJ+rxea8T5zSs8/alvsPq7gA8wW3y6fsoa3vUU/
 +Gd+W+Q/OsONIDtp8rQAY1qsD0ScDpaR8052RSH0zTa7pj8HsQgE5PjZ+cJW0SEi
 cKN+Oe8+ETqKald+xZ6PDf58O212VLrru3RpQWrOQcJ7fmKmfT4REK0RcbLgg4qT
 CBgOo6eg+ub4pxq2y11LZJBNTv1/S7xAEzFE0kArew64KB2gyVud0VJRZVAJnEfe
 93QQVDFrwK2bhgWQZ6J6IbTvGeQW0L93IibuaU6jhZPR283VtUIIvM7vrOylN7Fq
 4jsae0T7YGYfKUhgTpm7rCnm8A/D3Ni8MY0sKYYgDSyKmZUsnpI5wpx1xke4lwwV
 ErrY46RCFa+k8wscc6iWfB4cGXyyFHyu+wtyg0KpFn5JAzcfz4A=
 =Rgbj
 -----END PGP SIGNATURE-----

Merge tag 'rcu.6.4.april5.2023.3' of git://git.kernel.org/pub/scm/linux/kernel/git/jfern/linux

Pull RCU updates from Joel Fernandes:

 - Updates and additions to MAINTAINERS files, with Boqun being added to
   the RCU entry and Zqiang being added as an RCU reviewer.

   I have also transitioned from reviewer to maintainer; however, Paul
   will be taking over sending RCU pull-requests for the next merge
   window.

 - Resolution of hotplug warning in nohz code, achieved by fixing
   cpu_is_hotpluggable() through interaction with the nohz subsystem.

   Tick dependency modifications by Zqiang, focusing on fixing usage of
   the TICK_DEP_BIT_RCU_EXP bitmask.

 - Avoid needless calls to the rcu-lazy shrinker for CONFIG_RCU_LAZY=n
   kernels, fixed by Zqiang.

 - Improvements to rcu-tasks stall reporting by Neeraj.

 - Initial renaming of k[v]free_rcu() to k[v]free_rcu_mightsleep() for
   increased robustness, affecting several components like mac802154,
   drbd, vmw_vmci, tracing, and more.

   A report by Eric Dumazet showed that the API could be unknowingly
   used in an atomic context, so we'd rather make sure they know what
   they're asking for by being explicit:

      https://lore.kernel.org/all/20221202052847.2623997-1-edumazet@google.com/

 - Documentation updates, including corrections to spelling,
   clarifications in comments, and improvements to the srcu_size_state
   comments.

 - Better srcu_struct cache locality for readers, by adjusting the size
   of srcu_struct in support of SRCU usage by Christoph Hellwig.

 - Teach lockdep to detect deadlocks between srcu_read_lock() vs
   synchronize_srcu() contributed by Boqun.

   Previously lockdep could not detect such deadlocks, now it can.

 - Integration of rcutorture and rcu-related tools, targeted for v6.4
   from Boqun's tree, featuring new SRCU deadlock scenarios, test_nmis
   module parameter, and more

 - Miscellaneous changes, various code cleanups and comment improvements

* tag 'rcu.6.4.april5.2023.3' of git://git.kernel.org/pub/scm/linux/kernel/git/jfern/linux: (71 commits)
  checkpatch: Error out if deprecated RCU API used
  mac802154: Rename kfree_rcu() to kvfree_rcu_mightsleep()
  rcuscale: Rename kfree_rcu() to kfree_rcu_mightsleep()
  ext4/super: Rename kfree_rcu() to kfree_rcu_mightsleep()
  net/mlx5: Rename kfree_rcu() to kfree_rcu_mightsleep()
  net/sysctl: Rename kvfree_rcu() to kvfree_rcu_mightsleep()
  lib/test_vmalloc.c: Rename kvfree_rcu() to kvfree_rcu_mightsleep()
  tracing: Rename kvfree_rcu() to kvfree_rcu_mightsleep()
  misc: vmw_vmci: Rename kvfree_rcu() to kvfree_rcu_mightsleep()
  drbd: Rename kvfree_rcu() to kvfree_rcu_mightsleep()
  rcu: Protect rcu_print_task_exp_stall() ->exp_tasks access
  rcu: Avoid stack overflow due to __rcu_irq_enter_check_tick() being kprobe-ed
  rcu-tasks: Report stalls during synchronize_srcu() in rcu_tasks_postscan()
  rcu: Permit start_poll_synchronize_rcu_expedited() to be invoked early
  rcu: Remove never-set needwake assignment from rcu_report_qs_rdp()
  rcu: Register rcu-lazy shrinker only for CONFIG_RCU_LAZY=y kernels
  rcu: Fix missing TICK_DEP_MASK_RCU_EXP dependency check
  rcu: Fix set/clear TICK_DEP_BIT_RCU_EXP bitmask race
  rcu/trace: use strscpy() to instead of strncpy()
  tick/nohz: Fix cpu_is_hotpluggable() by checking with nohz subsystem
  ...
2023-04-24 12:16:14 -07:00
Linus Torvalds 08e30833f8 lsm/stable-6.4 PR 20230420
-----BEGIN PGP SIGNATURE-----
 
 iQJIBAABCAAyFiEES0KozwfymdVUl37v6iDy2pc3iXMFAmRBolwUHHBhdWxAcGF1
 bC1tb29yZS5jb20ACgkQ6iDy2pc3iXMy/w//YOB9EJ7hpAGouq0Il+SyWdLQP1Bw
 dOaJ5Xs0zDUQJsloqLpkk83aKocHXnl2jIE0mYVhfX2tdd2odKv/qKFcSPCBx1pf
 STRHsDBkNfi9wldAWZ6y92WZk9l0lqwdP/sJ4TMsrJLEnkeOBwcwAA4zzPRVu+dN
 aJQkSCj/5hF7r7/BvpfO+78O2h3dC42L6SepHrjnc/btSZ4qW4dPMJfTD7zT6r5Y
 tVRD/IZ+f7cakKulnWvOIXNGR45CTdE6TiPd9mxkbA2I86wvEec6jLIYtpPoEmtU
 +vENXjKDAX+Af3DyIC0rZECBFoAjLR0Myi75i74Haug0nxPyPqcjDKKYpfKwYxT0
 CH1LHx4rHUbUvXz4tbLuEiNEb5ZX+P5Rpklev8aijvQ/3iVjdzkg74a4QDZcHi8K
 1V/uKSBcC6De3789KmwEYIQu35cXqbT5TscuK4Hf8fdHcPZGRvjps12JSkuRhrIQ
 B5vJ4AZ3O5CWXO9u/n9czssnQ0WHSFFy1/OEpsVgXLpYMwP4xIr0q+C3n1Efnxnp
 HjoqE1N8bgsV4hYzwZwX3z490Vo4V3S6cpYp40UoeiJ0bJup5WuBselOSnZozyLQ
 hxxNHXFY8QtwQ0Ik4rTHfttwa28DE6qF+zh6mJDdgdbLfmlBGn3EaW9cwJrCiQ6X
 pZ6R6SdwFdyj7Uk=
 =JtiD
 -----END PGP SIGNATURE-----

Merge tag 'lsm-pr-20230420' of git://git.kernel.org/pub/scm/linux/kernel/git/pcmoore/lsm

Pull lsm updates from Paul Moore:

 - Move the LSM hook comment blocks into security/security.c

   For many years the LSM hook comment blocks were located in a very odd
   place, include/linux/lsm_hooks.h, where they lived on their own,
   disconnected from both the function prototypes and definitions.

   In keeping with current kernel conventions, this moves all of these
   comment blocks to the top of the function definitions, transforming
   them into the kdoc format in the process. This should make it much
   easier to maintain these comments, which are the main source of LSM
   hook documentation.

   For the most part the comment contents were left as-is, although some
   glaring errors were corrected. Expect additional edits in the future
   as we slowly update and correct the comment blocks.

   This is the bulk of the diffstat.

 - Introduce LSM_ORDER_LAST

   Similar to how LSM_ORDER_FIRST is used to specify LSMs which should
   be ordered before "normal" LSMs, the LSM_ORDER_LAST is used to
   specify LSMs which should be ordered after "normal" LSMs.

   This is one of the prerequisites for transitioning IMA/EVM to a
   proper LSM.

 - Remove the security_old_inode_init_security() hook

   The security_old_inode_init_security() LSM hook only allows for a
   single xattr which is problematic both for LSM stacking and the
   IMA/EVM-as-a-LSM effort. This finishes the conversion over to the
   security_inode_init_security() hook and removes the single-xattr LSM
   hook.

 - Fix a reiserfs problem with security xattrs

   During the security_old_inode_init_security() removal work it became
   clear that reiserfs wasn't handling security xattrs properly so we
   fixed it.

* tag 'lsm-pr-20230420' of git://git.kernel.org/pub/scm/linux/kernel/git/pcmoore/lsm: (32 commits)
  reiserfs: Add security prefix to xattr name in reiserfs_security_write()
  security: Remove security_old_inode_init_security()
  ocfs2: Switch to security_inode_init_security()
  reiserfs: Switch to security_inode_init_security()
  security: Remove integrity from the LSM list in Kconfig
  Revert "integrity: double check iint_cache was initialized"
  security: Introduce LSM_ORDER_LAST and set it for the integrity LSM
  device_cgroup: Fix typo in devcgroup_css_alloc description
  lsm: fix a badly named parameter in security_get_getsecurity()
  lsm: fix doc warnings in the LSM hook comments
  lsm: styling fixes to security/security.c
  lsm: move the remaining LSM hook comments to security/security.c
  lsm: move the io_uring hook comments to security/security.c
  lsm: move the perf hook comments to security/security.c
  lsm: move the bpf hook comments to security/security.c
  lsm: move the audit hook comments to security/security.c
  lsm: move the binder hook comments to security/security.c
  lsm: move the sysv hook comments to security/security.c
  lsm: move the key hook comments to security/security.c
  lsm: move the xfrm hook comments to security/security.c
  ...
2023-04-24 11:21:50 -07:00
Qi Han 8375be2b64 f2fs: remove unnessary comment in __may_age_extent_tree
This comment make no sense and is in the wrong place, so let's
remove it.

Signed-off-by: Qi Han <hanqi@vivo.com>
Reviewed-by: Chao Yu <chao@kernel.org>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2023-04-24 11:03:10 -07:00
Daeho Jeong 994b442b66 f2fs: allocate node blocks for atomic write block replacement
When a node block is missing for atomic write block replacement, we need
to allocate it in advance of the replacement.

Signed-off-by: Daeho Jeong <daehojeong@google.com>
Reviewed-by: Chao Yu <chao@kernel.org>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2023-04-24 11:03:10 -07:00
Daeho Jeong 591fc34e1f f2fs: use cow inode data when updating atomic write
Need to use cow inode data content instead of the one in the original
inode, when we try to write the already updated atomic write files.

Signed-off-by: Daeho Jeong <daehojeong@google.com>
Reviewed-by: Chao Yu <chao@kernel.org>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2023-04-24 11:03:10 -07:00
Jaegeuk Kim 2e2c6e9b72 f2fs: remove power-of-two limitation of zoned device
In f2fs, there's no reason to force po2.

Reviewed-by: Chao Yu <chao@kernel.org>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2023-04-24 11:03:10 -07:00
Linus Torvalds b9dff2195f iter-ubuf.2-2023-04-21
-----BEGIN PGP SIGNATURE-----
 
 iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAmRCvdsQHGF4Ym9lQGtl
 cm5lbC5kawAKCRD301j7KXHgpg4oD/457EJ21Fm36NuyT/S0Cr8ok9Tdk7t9BeBh
 V/9CYThoXr5aqAox0Vq23FF+Rhzm81GzwYERN4493LBblliNeNOo2IaXF9/7qrUW
 11v9Bkug2J3k3hRGtEa6Zl0EpMu+FRLsNpchjFS2KPuOq+iMDxrvwuy50kidWg7n
 r25e4UwpExVO9fIoUSmzgWVfRHOTuj9yiG/UsaH2+2BRXerIX0Q1tyElwmcGh25M
 Ad2hN+yDnuIbNA5gNUpnzY32Dp0zjAsquc//QOvq9mltcNTElokB8idGliismvyd
 8qF0lkwQwewOBT/sSD5EY3K0Qd8IJu425bvT/yPUDScHz1chxHUoxo5eisIr2M9l
 5AL5KHAf7Zzs8ZuV+IYPzZ5qM6a/vF3mHUisKRNKYVhF46Nmd4cBratfXwWb1MxV
 clQM2qr0TLOYli9mOeTXph3hg/rBVqKqf90boAZoN8b2tWBKlMykpqRadbepjrgx
 bmBSwwAF99NxIHEjU3U5DMdUloCSiMZIfMfDxQrPNDrfWAW4xJs5Ym0VeOjEotTt
 oFEs1fr6c3Mn7KEuPPfOtnDxvs51IP/B8+gDgMt/edf+wHiCU1Zm31u2gxt2dsKh
 g73Y92i5SHjIf36H5szBTeioyMy1E1VA9HF14xWz2eKdQ+wxQ9VNWoctcJ85k3F4
 6AZDYRIrWA==
 =EaE9
 -----END PGP SIGNATURE-----

Merge tag 'iter-ubuf.2-2023-04-21' of git://git.kernel.dk/linux

Pull ITER_UBUF updates from Jens Axboe:
 "This turns singe vector imports into ITER_UBUF, rather than
  ITER_IOVEC.

  The former is more trivial to iterate and advance, and hence a bit
  more efficient. From some very unscientific testing, ~60% of all iovec
  imports are single vector"

* tag 'iter-ubuf.2-2023-04-21' of git://git.kernel.dk/linux:
  iov_iter: Mark copy_compat_iovec_from_user() noinline
  iov_iter: import single vector iovecs as ITER_UBUF
  iov_iter: convert import_single_range() to ITER_UBUF
  iov_iter: overlay struct iovec and ubuf/len
  iov_iter: set nr_segs = 1 for ITER_UBUF
  iov_iter: remove iov_iter_iovec()
  iov_iter: add iter_iov_addr() and iter_iov_len() helpers
  ALSA: pcm: check for user backed iterator, not specific iterator type
  IB/qib: check for user backed iterator, not specific iterator type
  IB/hfi1: check for user backed iterator, not specific iterator type
  iov_iter: add iter_iovec() helper
  block: ensure bio_alloc_map_data() deals with ITER_UBUF correctly
2023-04-24 10:29:28 -07:00
Namjae Jeon 74d7970feb ksmbd: fix racy issue from using ->d_parent and ->d_name
Al pointed out that ksmbd has racy issue from using ->d_parent and ->d_name
in ksmbd_vfs_unlink and smb2_vfs_rename(). and use new lock_rename_child()
to lock stable parent while underlying rename racy.
Introduce vfs_path_parent_lookup helper to avoid out of share access and
export vfs functions like the following ones to use
vfs_path_parent_lookup().
 - rename __lookup_hash() to lookup_one_qstr_excl().
 - export lookup_one_qstr_excl().
 - export getname_kernel() and putname().

vfs_path_parent_lookup() is used for parent lookup of destination file
using absolute pathname given from FILE_RENAME_INFORMATION request.

Signed-off-by: Namjae Jeon <linkinjeon@kernel.org>
Signed-off-by: Steve French <stfrench@microsoft.com>
2023-04-24 00:09:20 -05:00
David Disseldorp af36c51e0e ksmbd: remove unused compression negotiate ctx packing
build_compression_ctxt() is currently unreachable due to
conn.compress_algorithm remaining zero (SMB3_COMPRESS_NONE).

It appears to have been broken in a couple of subtle ways over the
years:
- prior to d6c9ad23b4 ("ksmbd: use the common definitions for
  NEGOTIATE_PROTOCOL") smb2_compression_ctx.DataLength was set to 8,
  which didn't account for the single CompressionAlgorithms flexible
  array member.
- post d6c9ad23b4 smb2_compression_capabilities_context
  CompressionAlgorithms is a three member array, while
  CompressionAlgorithmCount is set to indicate only one member.
  assemble_neg_contexts() ctxt_size is also incorrectly incremented by
  sizeof(struct smb2_compression_capabilities_context) + 2, which
  assumes one flexible array member.

Signed-off-by: David Disseldorp <ddiss@suse.de>
Acked-by: Namjae Jeon <linkinjeon@kernel.org>
Signed-off-by: Steve French <stfrench@microsoft.com>
2023-04-24 00:09:10 -05:00
David Disseldorp a12a07a85a ksmbd: avoid duplicate negotiate ctx offset increments
Both pneg_ctxt and ctxt_size change in unison, with each adding the
length of the previously added context, rounded up to an eight byte
boundary.
Drop pneg_ctxt increments and instead use the ctxt_size offset when
passing output pointers to per-context helper functions. This slightly
simplifies offset tracking and shaves off a few text bytes.
Before (x86-64 gcc 7.5):
   text    data     bss     dec     hex filename
 213234    8677     672  222583   36577 ksmbd.ko

After:
   text    data     bss     dec     hex filename
 213218    8677     672  222567   36567 ksmbd.ko

Signed-off-by: David Disseldorp <ddiss@suse.de>
Acked-by: Namjae Jeon <linkinjeon@kernel.org>
Signed-off-by: Steve French <stfrench@microsoft.com>
2023-04-24 00:09:07 -05:00
David Disseldorp 34e8ccf9ce ksmbd: set NegotiateContextCount once instead of every inc
There are no early returns, so marshalling the incremented
NegotiateContextCount with every context is unnecessary.

Signed-off-by: David Disseldorp <ddiss@suse.de>
Acked-by: Namjae Jeon <linkinjeon@kernel.org>
Signed-off-by: Steve French <stfrench@microsoft.com>
2023-04-24 00:09:05 -05:00
Steve French 42bc6793e4 lock_rename_child() (for ksmbd folks)
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
 -----BEGIN PGP SIGNATURE-----
 
 iHUEABYIAB0WIQQqUNBr3gm4hGXdBJlZ7Krx/gZQ6wUCZEYEIwAKCRBZ7Krx/gZQ
 60Y+AP9UHVj3Kc4F5FnX+z/lfRwMMP6+hO3Gb1nGr9jddCpthgEAnsPGC3iitMSa
 vsDw5G+gAsb4qyvJv1U0JJ2VkveZPgA=
 =4UtI
 -----END PGP SIGNATURE-----

Merge tag 'pull-lock_rename_child' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs into ksmbd-for-next

lock_rename_child() (for ksmbd folks)

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2023-04-24 00:07:14 -05:00
Volker Lendecke 919e57c314 cifs: Avoid a cast in add_lease_context()
We have the correctly-typed struct smb2_create_req * available in the
caller.

Signed-off-by: Volker Lendecke <vl@samba.org>
Reviewed-by Ralph Boehme <slow@samba.org>
Signed-off-by: Steve French <stfrench@microsoft.com>
2023-04-23 21:16:57 -05:00
Volker Lendecke d2ec43b515 cifs: Simplify SMB2_open_init()
Reduce code duplication by calculating req->CreateContextsLength in
one place.

This is the last reference to "req" in the add_*_context functions,
remove that parameter.

Signed-off-by: Volker Lendecke <vl@samba.org>
Reviewed-by: Enzo Matsumiya <ematsumiya@suse.de>
Signed-off-by: Steve French <stfrench@microsoft.com>
2023-04-23 20:31:44 -05:00
Volker Lendecke 2a8d1387ed cifs: Simplify SMB2_open_init()
Reduce code duplication by stitching together create contexts in one
place.

Signed-off-by: Volker Lendecke <vl@samba.org>
Reviewed-by: Enzo Matsumiya <ematsumiya@suse.de>
Signed-off-by: Steve French <stfrench@microsoft.com>
2023-04-23 20:31:44 -05:00
Volker Lendecke 5ec629e037 cifs: Simplify SMB2_open_init()
We can point to the create contexts in just one place, we don't have
to do this in every add_*_context routine.

Signed-off-by: Volker Lendecke <vl@samba.org>
Reviewed-by: Enzo Matsumiya <ematsumiya@suse.de>
Signed-off-by: Steve French <stfrench@microsoft.com>
2023-04-23 20:31:44 -05:00
Zhihao Cheng b5fda08ef2 ubifs: Fix memleak when insert_old_idx() failed
Following process will cause a memleak for copied up znode:

dirty_cow_znode
  zn = copy_znode(c, znode);
  err = insert_old_idx(c, zbr->lnum, zbr->offs);
  if (unlikely(err))
     return ERR_PTR(err);   // No one refers to zn.

Fetch a reproducer in [Link].

Function copy_znode() is split into 2 parts: resource allocation
and znode replacement, insert_old_idx() is split in similar way,
so resource cleanup could be done in error handling path without
corrupting metadata(mem & disk).
It's okay that old index inserting is put behind of add_idx_dirt(),
old index is used in layout_leb_in_gaps(), so the two processes do
not depend on each other.

Link: https://bugzilla.kernel.org/show_bug.cgi?id=216705
Fixes: 1e51764a3c ("UBIFS: add new flash file system")
Cc: stable@vger.kernel.org
Signed-off-by: Zhihao Cheng <chengzhihao1@huawei.com>
Signed-off-by: Richard Weinberger <richard@nod.at>
2023-04-23 23:36:38 +02:00
Zhihao Cheng 7d01cb27f6 Revert "ubifs: dirty_cow_znode: Fix memleak in error handling path"
This reverts commit 122deabfe1 (ubifs: dirty_cow_znode: Fix memleak
in error handling path).
After commit 122deabfe1 applied, if insert_old_idx() failed, old
index neither exists in TNC nor in old-index tree. Which means that
old index node could be overwritten in layout_leb_in_gaps(), then
ubifs image will be corrupted in power-cut.

Fixes: 122deabfe1 (ubifs: dirty_cow_znode: Fix memleak ... path)
Cc: stable@vger.kernel.org
Signed-off-by: Zhihao Cheng <chengzhihao1@huawei.com>
Signed-off-by: Richard Weinberger <richard@nod.at>
2023-04-23 23:36:19 +02:00
Linus Torvalds 84ebdb8e0d three small smb3 client fixes
-----BEGIN PGP SIGNATURE-----
 
 iQGzBAABCgAdFiEE6fsu8pdIjtWE/DpLiiy9cAdyT1EFAmRDSFEACgkQiiy9cAdy
 T1FUiQwAr0/Di4S4U5XDZNDdW9/4qFqI95ciShsIFb6cXtMQeauTImOp5zDEaW2x
 mn31RZSIMDK+Ljximh4W2phH+5k47K51tVXetWCVOHIPBc1h4AUAz5XmhChtirJ9
 seqhPx2wIi+4TH3gvKkc8xxmZ6HMfZ+F7nuc72eZ1rnO91VoLRywpACItKO2Q8Vu
 XSGMgjiaP0KshyxIcBwKXxHjnaOBF6nBBENU/sHkNZNRngzi/+Xcr371eijbs8Ay
 hk7RPnH04uOYwkFs+9gFpSyYtkdn9iTaPScYWE9UDTeZl8xuQJnGlGXQ1CEZHfkh
 OBzb2/UCX77ggnuFyukP8e5ZsrdGEaotjYgmsb7s7BO9eqAcBRp31SKhIbvIZvTG
 IBLBtWFNSvdoKk1EUvu2iKHxcGcZ3EFjLTmx5pwcAwmBLp+UM/i+IcxvS+b2uPif
 mVIzsXPRI6yi9uJ7PPqtBlkhNrwNk4cfs1hkeXOqN7XGBaQjx89wydMGJDF9otse
 BsNDDXmF
 =KXCn
 -----END PGP SIGNATURE-----

Merge tag '6.3-rc7-smb3-client-fixes' of git://git.samba.org/sfrench/cifs-2.6

Pull cifs fixes from Steve French:
 "Three small smb3 client fixes:

   - two important fixes for unbuffered read regression with the
     iov_iter changes (e.g. read soon after mount in some multichannel
     scenarios)

   - DFS prefix path fix (also for stable)"

* tag '6.3-rc7-smb3-client-fixes' of git://git.samba.org/sfrench/cifs-2.6:
  cifs: Reapply lost fix from commit 30b2b2196d
  cifs: Fix unbuffered read
  cifs: avoid dup prefix path in dfs_get_automount_devname()
2023-04-22 09:18:35 -07:00
David Howells e0416e7d33 rxrpc: Fix potential race in error handling in afs_make_call()
If the rxrpc call set up by afs_make_call() receives an error whilst it is
transmitting the request, there's the possibility that it may get to the
point the rxrpc call is ended (after the error_kill_call label) just as the
call is queued for async processing.

This could manifest itself as call->rxcall being seen as NULL in
afs_deliver_to_call() when it tries to lock the call.

Fix this by splitting rxrpc_kernel_end_call() into a function to shut down
an rxrpc call and a function to release the caller's reference and calling
the latter only when we get to afs_put_call().

Reported-by: Jeffrey Altman <jaltman@auristor.com>
Signed-off-by: David Howells <dhowells@redhat.com>
Tested-by: kafs-testing+fedora36_64checkkafs-build-306@auristor.com
cc: Marc Dionne <marc.dionne@auristor.com>
cc: "David S. Miller" <davem@davemloft.net>
cc: Eric Dumazet <edumazet@google.com>
cc: Jakub Kicinski <kuba@kernel.org>
cc: Paolo Abeni <pabeni@redhat.com>
cc: linux-afs@lists.infradead.org
cc: netdev@vger.kernel.org
Signed-off-by: David S. Miller <davem@davemloft.net>
2023-04-22 15:16:39 +01:00
Arnd Bergmann 09d49eb90f ocfs2: reduce ioctl stack usage
On 32-bit architectures with KASAN_STACK enabled, the total stack usage of
the ocfs2_ioctl function grows beyond the warning limit:

fs/ocfs2/ioctl.c: In function 'ocfs2_ioctl':
fs/ocfs2/ioctl.c:934:1: error: the frame size of 1448 bytes is larger than 1400 bytes [-Werror=frame-larger-than=]

Move each of the variables into a basic block, and mark
ocfs2_info_handle() as noinline_for_stack, in order to have the variable
share stack slots.

Link: https://lkml.kernel.org/r/20230417205631.1956027-1-arnd@kernel.org
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Reviewed-by: Joseph Qi <joseph.qi@linux.alibaba.com>
Reviewed-by: Mark Fasheh <mark@fasheh.com>
Cc: Joel Becker <jlbec@evilplan.org>
Cc: Junxiao Bi <junxiao.bi@oracle.com>
Cc: Changwei Ge <gechangwei@live.cn>
Cc: Gang He <ghe@suse.com>
Cc: Jun Piao <piaojun@huawei.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-04-21 14:54:34 -07:00
Chunguang Wu 522dc4e5f5 fs/proc: add Kthread flag to /proc/$pid/status
The command `ps -ef ` and `top -c` mark kernel thread by '[' and ']', but
sometimes the result is not correct.  The task->flags in /proc/$pid/stat
is good, but we need remember the value of PF_KTHREAD is 0x00200000 and
convert dec to hex.  If we have no binary program and shell script which
read /proc/$pid/stat, we can know it directly by `cat /proc/$pid/status`.

Link: https://lkml.kernel.org/r/20230416052404.2920-1-fullspring2018@gmail.com
Signed-off-by: Chunguang Wu <fullspring2018@gmail.com>
Reviewed-by: Randy Dunlap <rdunlap@infradead.org>
Cc: Alexey Dobriyan <adobriyan@gmail.com>
Cc: Jonathan Corbet <corbet@lwn.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-04-21 14:54:34 -07:00
Linus Torvalds 6b008640db mm: move 'mmap_min_addr' logic from callers into vm_unmapped_area()
Instead of having callers care about the mmap_min_addr logic for the
lowest valid mapping address (and some of them getting it wrong), just
move the logic into vm_unmapped_area() itself.  One less thing for various
architecture cases (and generic helpers) to worry about.

We should really try to make much more of this be common code, but baby
steps..

Without this, vm_unmapped_area() could return an address below
mmap_min_addr (because some caller forgot about that).  That then causes
the mmap machinery to think it has found a workable address, but then
later security_mmap_addr(addr) is unhappy about it and the mmap() returns
with a nonsensical error (EPERM).

The proper action is to either return ENOMEM (if the virtual address space
is exhausted), or try to find another address (ie do a bottom-up search
for free addresses after the top-down one failed).

See commit 2afc745f3e ("mm: ensure get_unmapped_area() returns higher
address than mmap_min_addr"), which fixed this for one call site (the
generic arch_get_unmapped_area_topdown() fallback) but left other cases
alone.

Link: https://lkml.kernel.org/r/20230418214009.1142926-1-Liam.Howlett@oracle.com
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Liam R. Howlett <Liam.Howlett@oracle.com>
Cc: Russell King <linux@armlinux.org.uk>
Cc: Liam Howlett <liam.howlett@oracle.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-04-21 14:52:05 -07:00
Stefan Roesch d21077fbc2 mm: add new KSM process and sysfs knobs
This adds the general_profit KSM sysfs knob and the process profit metric
knobs to ksm_stat.

1) expose general_profit metric

   The documentation mentions a general profit metric, however this
   metric is not calculated.  In addition the formula depends on the size
   of internal structures, which makes it more difficult for an
   administrator to make the calculation.  Adding the metric for a better
   user experience.

2) document general_profit sysfs knob

3) calculate ksm process profit metric

   The ksm documentation mentions the process profit metric and how to
   calculate it.  This adds the calculation of the metric.

4) mm: expose ksm process profit metric in ksm_stat

   This exposes the ksm process profit metric in /proc/<pid>/ksm_stat.
   The documentation mentions the formula for the ksm process profit
   metric, however it does not calculate it.  In addition the formula
   depends on the size of internal structures.  So it makes sense to
   expose it.

5) document new procfs ksm knobs

Link: https://lkml.kernel.org/r/20230418051342.1919757-3-shr@devkernel.io
Signed-off-by: Stefan Roesch <shr@devkernel.io>
Reviewed-by: Bagas Sanjaya <bagasdotme@gmail.com>
Acked-by: David Hildenbrand <david@redhat.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Rik van Riel <riel@surriel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-04-21 14:52:03 -07:00
Pankaj Raghav c6c8c3e7b4 fs/buffer: convert create_page_buffers to folio_create_buffers
fs/buffer do not support large folios as there are many assumptions on the
folio size to be the host page size.  This conversion is one step towards
removing that assumption.  Also this conversion will reduce calls to
compound_head() if folio_create_buffers() calls
folio_create_empty_buffers().

Link: https://lkml.kernel.org/r/20230417123618.22094-5-p.raghav@samsung.com
Signed-off-by: Pankaj Raghav <p.raghav@samsung.com>
Reviewed-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Christian Brauner <brauner@kernel.org>
Cc: Luis Chamberlain <mcgrof@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-04-21 14:52:01 -07:00
Pankaj Raghav 8e2e17560b fs/buffer: add folio_create_empty_buffers helper
Folio version of create_empty_buffers().  This is required to convert
create_page_buffers() to folio_create_buffers() later in the series.

It removes several calls to compound_head() as it works directly on folio
compared to create_empty_buffers().  Hence, create_empty_buffers() has
been modified to call folio_create_empty_buffers().

Link: https://lkml.kernel.org/r/20230417123618.22094-4-p.raghav@samsung.com
Signed-off-by: Pankaj Raghav <p.raghav@samsung.com>
Reviewed-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Christian Brauner <brauner@kernel.org>
Cc: Luis Chamberlain <mcgrof@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-04-21 14:52:01 -07:00
Pankaj Raghav c71124a8af buffer: add folio_alloc_buffers() helper
Folio version of alloc_page_buffers() helper.  This is required to convert
create_page_buffers() to folio_create_buffers() later in the series.

alloc_page_buffers() has been modified to call folio_alloc_buffers() which
adds one call to compound_head() but folio_alloc_buffers() removes one
call to compound_head() compared to the existing alloc_page_buffers()
implementation.

Link: https://lkml.kernel.org/r/20230417123618.22094-3-p.raghav@samsung.com
Signed-off-by: Pankaj Raghav <p.raghav@samsung.com>
Reviewed-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Christian Brauner <brauner@kernel.org>
Cc: Luis Chamberlain <mcgrof@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-04-21 14:52:01 -07:00
Pankaj Raghav 465e5e6a16 fs/buffer: add folio_set_bh helper
Patch series "convert create_page_buffers to folio_create_buffers".

One of the first kernel panic we hit when we try to increase the block
size > 4k is inside create_page_buffers()[1].  Even though buffer.c
function do not support large folios (folios > PAGE_SIZE) at the moment,
these changes are required when we want to remove that constraint.


This patch (of 4):

The folio version of set_bh_page().  This is required to convert
create_page_buffers() to folio_create_buffers() later in the series.

Link: https://lkml.kernel.org/r/20230417123618.22094-1-p.raghav@samsung.com
Link: https://lkml.kernel.org/r/20230417123618.22094-2-p.raghav@samsung.com
Signed-off-by: Pankaj Raghav <p.raghav@samsung.com>
Reviewed-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Christian Brauner <brauner@kernel.org>
Cc: Luis Chamberlain <mcgrof@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-04-21 14:52:01 -07:00
Benjamin Coddington e025f0a73f NFS: Cleanup unused rpc_clnt variable
The root rpc_clnt is not used here, clean it up.

Fixes: 4dc73c6791 ("NFSv4: keep state manager thread active if swap is enabled")
Signed-off-by: Benjamin Coddington <bcodding@redhat.com>
Reviewed-by: NeilBrown <neilb@suse.de>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2023-04-21 17:00:55 -04:00
Mårten Lindahl 3a36d20e01 ubifs: Fix memory leak in do_rename
If renaming a file in an encrypted directory, function
fscrypt_setup_filename allocates memory for a file name. This name is
never used, and before returning to the caller the memory for it is not
freed.

When running kmemleak on it we see that it is registered as a leak. The
report below is triggered by a simple program 'rename' that renames a
file in an encrypted directory:

  unreferenced object 0xffff888101502840 (size 32):
    comm "rename", pid 9404, jiffies 4302582475 (age 435.735s)
    backtrace:
      __kmem_cache_alloc_node
      __kmalloc
      fscrypt_setup_filename
      do_rename
      ubifs_rename
      vfs_rename
      do_renameat2

To fix this we can remove the call to fscrypt_setup_filename as it's not
needed.

Fixes: 278d9a2436 ("ubifs: Rename whiteout atomically")
Reported-by: Zhihao Cheng <chengzhihao1@huawei.com>
Signed-off-by: Mårten Lindahl <marten.lindahl@axis.com>
Reviewed-by: Zhihao Cheng <chengzhihao1@huawei.com>
Cc: stable@vger.kernel.org
Signed-off-by: Richard Weinberger <richard@nod.at>
2023-04-21 22:38:51 +02:00
Mårten Lindahl 1fb815b38b ubifs: Free memory for tmpfile name
When opening a ubifs tmpfile on an encrypted directory, function
fscrypt_setup_filename allocates memory for the name that is to be
stored in the directory entry, but after the name has been copied to the
directory entry inode, the memory is not freed.

When running kmemleak on it we see that it is registered as a leak. The
report below is triggered by a simple program 'tmpfile' just opening a
tmpfile:

  unreferenced object 0xffff88810178f380 (size 32):
    comm "tmpfile", pid 509, jiffies 4294934744 (age 1524.742s)
    backtrace:
      __kmem_cache_alloc_node
      __kmalloc
      fscrypt_setup_filename
      ubifs_tmpfile
      vfs_tmpfile
      path_openat

Free this memory after it has been copied to the inode.

Signed-off-by: Mårten Lindahl <marten.lindahl@axis.com>
Reviewed-by: Zhihao Cheng <chengzhihao1@huawei.com>
Cc: stable@vger.kernel.org
Signed-off-by: Richard Weinberger <richard@nod.at>
2023-04-21 22:37:37 +02:00
Tom Rix c5733ae6dc NFS: set varaiable nfs_netfs_debug_id storage-class-specifier to static
smatch reports
fs/nfs/fscache.c:260:10: warning: symbol
  'nfs_netfs_debug_id' was not declared. Should it be static?

This variable is only used in its defining file, so it should be static

Signed-off-by: Tom Rix <trix@redhat.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2023-04-21 16:36:21 -04:00
Yangtao Li c477d83c26 ubifs: Remove return in compr_exit()
It's redundant, let's remove it.

Signed-off-by: Yangtao Li <frank.li@vivo.com>
Reviewed-by: Zhihao Cheng <chengzhihao1@huawei.com>
Signed-off-by: Richard Weinberger <richard@nod.at>
2023-04-21 22:31:03 +02:00
Linus Torvalds c337b23f32 for-6.3-rc7-tag
-----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEE8rQSAMVO+zA4DBdWxWXV+ddtWDsFAmRCmhgACgkQxWXV+ddt
 WDsHrA/+KaCgixD0Z/8f2tMu2Kd5KQ6vQGMlydZzr0OvTYh3skAjTbAfTGAUHiXF
 6qZOpYYEilE+xhdcTegB4fV1OPJQw8+rvRrPps9ugZEShQhHlUbIuuiSCtrILKmK
 424wkllNc7NDbz5CHbbBpNNGdc6Xgyr3zy4nKZf/Sezmj+aK/nRL/JmazzUaEnxM
 NC8hBq+Nrpz0ucyStiLp4jfdp5geo4hcfpXVEBuH2ZpzhBPV4usLBWwsEj6uBcTy
 mpvMNHTFw/8H/k9w6GS+E/hrU5Rs5tWHTlEIz+xD1kK8DoPoE1arcgdLCzS0yC81
 8MyjB2qgMp3XutVlQGwyWAalY04UfzKvQ4yUYwTKT24pToc0TmQq8YV2Sy7c7SeA
 SDy+Ev1wgteeaPskhS9vMbJvnKVSzOMovt0oNR6VoPivXZ0OjVRDkC3fT2l497JL
 jZB3H7JaUGxJ/du1kUQkhL2c6YnjkWsqbl1YoOUBilNXkY/Mbz8NCZZdLJia0Q41
 P14w4aeD8HAYBNkOvSrDwfBQB5fR31GQq3QH/dGfJ4i41eJlNAposcOWQkV115Ib
 eILV3kFxJNSCpUI7eaE2biacGxJLdiWPQDv5Oo5AETyqcoiFqjCDerZWCTgH54H2
 YzzJiY/1BH8RgYbrCUyoPmyGOhoovYSVG9gLK3nXk1jqWltJgD0=
 =mGL5
 -----END PGP SIGNATURE-----

Merge tag 'for-6.3-rc7-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux

Pull btrfs fixes from David Sterba:
 "Two patches fixing the problem with aync discard.

  The default settings had a low IOPS limit and processing a large batch
  to discard would take a long time. On laptops this can cause increased
  power consumption due to disk activity.

  As async discard has been on by default since 6.2 this likely affects
  a lot of users.

  Summary:

   - increase the default IOPS limit 10x which reportedly helped

   - setting the sysfs IOPS value to 0 now does not throttle anymore
     allowing the discards to be processed at full speed. Previously
     there was an arbitrary 6 hour target for processing the pending
     batch"

* tag 'for-6.3-rc7-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux:
  btrfs: reinterpret async discard iops_limit=0 as no delay
  btrfs: set default discard iops_limit to 1000
2023-04-21 10:47:21 -07:00
Alexander Aring 7a40f1f18a fs: dlm: stop unnecessarily filling zero ms_extra bytes
Commit 7175e131eb ("fs: dlm: fix invalid derefence of sb_lvbptr")
fixes an issue when the lkb->lkb_lvbptr set to an dangled pointer and an
followed memcpy() would fail. It was fixed by an additional check of
DLM_LKF_VALBLK flag. The mentioned commit forgot to add an additional check
if DLM_LKF_VALBLK is set for the additional amount of LVB data allocated
in a dlm message. This patch is changing the message allocation to check
additionally if DLM_LKF_VALBLK is set otherwise a dangled lkb->lkb_lvbptr
pointer would allocated zero LVB message data which not gets filled with
actual data.

This patch is however only a cleanup to reduce the amount of zero bytes
transmitted over network as receive_lvb() will only evaluates message LVB
data if DLM_LKF_VALBLK is set.

Signed-off-by: Alexander Aring <aahringo@redhat.com>
Signed-off-by: David Teigland <teigland@redhat.com>
2023-04-21 11:46:47 -05:00
Ritesh Harjani (IBM) 3fd41721cd iomap: Add DIO tracepoints
Add trace_iomap_dio_rw_begin, trace_iomap_dio_rw_queued and
trace_iomap_dio_complete tracepoint.
trace_iomap_dio_rw_queued is mostly only to know that the request was
queued and -EIOCBQUEUED was returned. It is mostly trace_iomap_dio_rw_begin
& trace_iomap_dio_complete which has all the details.

<example output log>
      a.out-2073  [006]   134.225717: iomap_dio_rw_begin:   dev 7:7 ino 0xe size 0x0 offset 0x0 length 0x1000 done_before 0x0 flags DIRECT|WRITE dio_flags DIO_FORCE_WAIT aio 1
      a.out-2073  [006]   134.226234: iomap_dio_complete:   dev 7:7 ino 0xe size 0x1000 offset 0x1000 flags DIRECT|WRITE aio 1 error 0 ret 4096
      a.out-2074  [006]   136.225975: iomap_dio_rw_begin:   dev 7:7 ino 0xe size 0x1000 offset 0x0 length 0x1000 done_before 0x0 flags DIRECT dio_flags  aio 1
      a.out-2074  [006]   136.226173: iomap_dio_rw_queued:  dev 7:7 ino 0xe size 0x1000 offset 0x1000 length 0x0
ksoftirqd/3-31    [003]   136.226389: iomap_dio_complete:   dev 7:7 ino 0xe size 0x1000 offset 0x1000 flags DIRECT aio 1 error 0 ret 4096
      a.out-2075  [003]   141.674969: iomap_dio_rw_begin:   dev 7:7 ino 0xe size 0x1000 offset 0x0 length 0x1000 done_before 0x0 flags DIRECT|WRITE dio_flags  aio 1
      a.out-2075  [003]   141.676085: iomap_dio_rw_queued:  dev 7:7 ino 0xe size 0x1000 offset 0x1000 length 0x0
kworker/2:0-27    [002]   141.676432: iomap_dio_complete:   dev 7:7 ino 0xe size 0x1000 offset 0x1000 flags DIRECT|WRITE aio 1 error 0 ret 4096
      a.out-2077  [006]   143.443746: iomap_dio_rw_begin:   dev 7:7 ino 0xe size 0x1000 offset 0x0 length 0x1000 done_before 0x0 flags DIRECT dio_flags  aio 1
      a.out-2077  [006]   143.443866: iomap_dio_rw_queued:  dev 7:7 ino 0xe size 0x1000 offset 0x1000 length 0x0
ksoftirqd/5-41    [005]   143.444134: iomap_dio_complete:   dev 7:7 ino 0xe size 0x1000 offset 0x1000 flags DIRECT aio 1 error 0 ret 4096
      a.out-2078  [007]   146.716833: iomap_dio_rw_begin:   dev 7:7 ino 0xe size 0x1000 offset 0x0 length 0x1000 done_before 0x0 flags DIRECT dio_flags  aio 0
      a.out-2078  [007]   146.717639: iomap_dio_complete:   dev 7:7 ino 0xe size 0x1000 offset 0x1000 flags DIRECT aio 0 error 0 ret 4096
      a.out-2079  [006]   148.972605: iomap_dio_rw_begin:   dev 7:7 ino 0xe size 0x1000 offset 0x0 length 0x1000 done_before 0x0 flags DIRECT dio_flags  aio 0
      a.out-2079  [006]   148.973099: iomap_dio_complete:   dev 7:7 ino 0xe size 0x1000 offset 0x1000 flags DIRECT aio 0 error 0 ret 4096

Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Ritesh Harjani (IBM) <ritesh.list@gmail.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
[djwong: line up strings all prettylike]
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
2023-04-21 08:54:47 -07:00
Ritesh Harjani (IBM) d3bff1fc50 iomap: Remove IOMAP_DIO_NOSYNC unused dio flag
IOMAP_DIO_NOSYNC earlier was added for use in btrfs. But it seems for
aio dsync writes this is not useful anyway. For aio dsync case, we
we queue the request and return -EIOCBQUEUED. Now, since IOMAP_DIO_NOSYNC
doesn't let iomap_dio_complete() to call generic_write_sync(),
hence we may lose the sync write.

Hence kill this flag as it is not in use by any FS now.

Tested-by: Disha Goel <disgoel@linux.ibm.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Ritesh Harjani (IBM) <ritesh.list@gmail.com>
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
2023-04-21 08:54:47 -07:00
Al Viro 4a892c0fe4 fuse_dev_ioctl(): switch to fdget()
Reviewed-by: Christian Brauner <brauner@kernel.org>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2023-04-20 22:55:35 -04:00
Al Viro 96e85e95dc build_mount_idmapped(): switch to fdget()
Reviewed-by: Christian Brauner <brauner@kernel.org>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2023-04-20 22:55:35 -04:00
Al Viro 38e1240862 kill the last remaining user of proc_ns_fget()
lookups by descriptor are better off closer to syscall surface...

Reviewed-by: Christian Brauner <brauner@kernel.org>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2023-04-20 22:55:35 -04:00
Al Viro 9bc37e0482 fs: introduce lock_rename_child() helper
Pass the dentry of a source file and the dentry of a destination directory
to lock parent inodes for rename. As soon as this function returns,
->d_parent of the source file dentry is stable and inodes are properly
locked for calling vfs-rename. This helper is needed for ksmbd server.
rename request of SMB protocol has to rename an opened file, no matter
which directory it's in.

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Namjae Jeon <linkinjeon@kernel.org>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2023-04-20 22:37:05 -04:00
Namjae Jeon 211db0ac9e ksmbd: remove internal.h include
Since vfs_path_lookup is exported, It should not be internal.
Move vfs_path_lookup prototype in internal.h to linux/namei.h.

Suggested-by: Al Viro <viro@zeniv.linux.org.uk>
Reviewed-by: Christian Brauner <brauner@kernel.org>
Signed-off-by: Namjae Jeon <linkinjeon@kernel.org>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2023-04-20 22:36:43 -04:00
Jakub Kicinski 681c5b51dc Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net
Adjacent changes:

net/mptcp/protocol.h
  63740448a3 ("mptcp: fix accept vs worker race")
  2a6a870e44 ("mptcp: stops worker on unaccepted sockets at listener close")
  ddb1a072f8 ("mptcp: move first subflow allocation at mpc access time")

Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-04-20 16:29:51 -07:00
Boris Burkov ef9cddfe57 btrfs: reinterpret async discard iops_limit=0 as no delay
Currently, a limit of 0 results in a hard coded metering over 6 hours.
Since the default is a set limit, I suspect no one truly depends on this
rather arbitrary setting. Repurpose it for an arguably more useful
"unlimited" mode, where the delay is 0.

Note that if block groups are too new, or go fully empty, there is still
a delay associated with those conditions. Those delays implement
heuristics for not trimming a region we are relatively likely to fully
overwrite soon.

CC: stable@vger.kernel.org # 6.2+
Reviewed-by: Neal Gompa <neal@gompa.dev>
Signed-off-by: Boris Burkov <boris@bur.io>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-04-21 00:28:23 +02:00
Boris Burkov e9f59429b8 btrfs: set default discard iops_limit to 1000
Previously, the default was a relatively conservative 10. This results
in a 100ms delay, so with ~300 discards in a commit, it takes the full
30s till the next commit to finish the discards. On a workstation, this
results in the disk never going idle, wasting power/battery, etc.

Set the default to 1000, which results in using the smallest possible
delay, currently, which is 1ms. This has shown to not pathologically
keep the disk busy by the original reporter.

Link: https://lore.kernel.org/linux-btrfs/Y%2F+n1wS%2F4XAH7X1p@nz/
Link: https://bugzilla.redhat.com/show_bug.cgi?id=2182228
CC: stable@vger.kernel.org # 6.2+
Reviewed-by: Neal Gompa <neal@gompa.dev
Signed-off-by: Boris Burkov <boris@bur.io>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-04-21 00:28:20 +02:00
Johannes Berg 8c6174503c um: hostfs: define our own API boundary
Instead of exporting the set of functions provided by
glibc that are needed for hostfs_user.c, just build that
into the kernel image whenever hostfs is built, and then
export _those_ functions cleanly, to be independent of
the libc implementation.

Signed-off-by: Johannes Berg <johannes.berg@intel.com>
Signed-off-by: Richard Weinberger <richard@nod.at>
2023-04-20 23:04:40 +02:00
Wu Bo 5584785080 f2fs: allocate trace path buffer from names_cache
It would be better to use the dedicated slab to store path.

Signed-off-by: Wu Bo <bo.wu@vivo.com>
Reviewed-by: Chao Yu <chao@kernel.org>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2023-04-20 09:38:12 -07:00
Josh Triplett 519fe1bae7 ext4: Add a uapi header for ext4 userspace APIs
Create a uapi header include/uapi/linux/ext4.h, move the ioctls and
associated data structures to the uapi header, and include it from
fs/ext4/ext4.h.

Signed-off-by: Josh Triplett <josh@joshtriplett.org>
Link: https://lore.kernel.org/r/680175260970d977d16b5cc7e7606483ec99eb63.1680402881.git.josh@joshtriplett.org
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2023-04-19 23:39:42 -04:00
wuchi 17809d3cf8 ext4: remove useless conditional branch code
It's ok because the code will be optimized by the compiler, just
try to simple the code.

Signed-off-by: wuchi <wuchi.zero@gmail.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Link: https://lore.kernel.org/r/20230401075303.45206-1-wuchi.zero@gmail.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2023-04-19 23:39:08 -04:00
Tom Rix 8ae56b4e82 ext4: remove unneeded check of nr_to_submit
cppcheck reports
fs/ext4/page-io.c:516:51: style:
  Condition 'nr_to_submit' is always true [knownConditionTrueFalse]
 if (fscrypt_inode_uses_fs_layer_crypto(inode) && nr_to_submit) {
                                                  ^
This earlier check to bail, makes this check unncessary
	/* Nothing to submit? Just unlock the page... */
	if (!nr_to_submit)
		return 0;

Signed-off-by: Tom Rix <trix@redhat.com>
Fixes: dff4ac75ee ("ext4: move keep_towrite handling to ext4_bio_write_page()")
Reviewed-by: Eric Biggers <ebiggers@google.com>
Link: https://lore.kernel.org/r/20230316204831.2472537-1-trix@redhat.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2023-04-19 23:38:33 -04:00
Linus Torvalds cb0856346a 22 hotfixes.
19 are cc:stable and the remainder address issues which were introduced
 during this merge cycle, or aren't considered suitable for -stable
 backporting.
 
 19 are for MM and the remainder are for other subsystems.
 -----BEGIN PGP SIGNATURE-----
 
 iHUEABYIAB0WIQTTMBEPP41GrTpTJgfdBJ7gKXxAjgUCZEB7GgAKCRDdBJ7gKXxA
 jl4zAP9LxKisY8L29qrZG/SKoYbMMSM33ASOGZJRAuRRaOYL6QEAvS14pg/c22rL
 4GCZbzvENY4xPRbz/6kc/s2Jnuww4wA=
 =Kh/V
 -----END PGP SIGNATURE-----

Merge tag 'mm-hotfixes-stable-2023-04-19-16-36' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm

Pull misc fixes from Andrew Morton:
 "22 hotfixes.

  19 are cc:stable and the remainder address issues which were
  introduced during this merge cycle, or aren't considered suitable for
  -stable backporting.

  19 are for MM and the remainder are for other subsystems"

* tag 'mm-hotfixes-stable-2023-04-19-16-36' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: (22 commits)
  nilfs2: initialize unused bytes in segment summary blocks
  mm: page_alloc: skip regions with hugetlbfs pages when allocating 1G pages
  mm/mmap: regression fix for unmapped_area{_topdown}
  maple_tree: fix mas_empty_area() search
  maple_tree: make maple state reusable after mas_empty_area_rev()
  mm: kmsan: handle alloc failures in kmsan_ioremap_page_range()
  mm: kmsan: handle alloc failures in kmsan_vmap_pages_range_noflush()
  tools/Makefile: do missed s/vm/mm/
  mm: fix memory leak on mm_init error handling
  mm/page_alloc: fix potential deadlock on zonelist_update_seq seqlock
  kernel/sys.c: fix and improve control flow in __sys_setres[ug]id()
  Revert "userfaultfd: don't fail on unrecognized features"
  writeback, cgroup: fix null-ptr-deref write in bdi_split_work_to_wbs
  maple_tree: fix a potential memory leak, OOB access, or other unpredictable bug
  tools/mm/page_owner_sort.c: fix TGID output when cull=tg is used
  mailmap: update jtoppins' entry to reference correct email
  mm/mempolicy: fix use-after-free of VMA iterator
  mm/huge_memory.c: warn with pr_warn_ratelimited instead of VM_WARN_ON_ONCE_FOLIO
  mm/mprotect: fix do_mprotect_pkey() return on error
  mm/khugepaged: check again on anon uffd-wp during isolation
  ...
2023-04-19 17:55:45 -07:00
Dave Chinner 422d56536f xfs: fix duplicate includes
Header files were already included, just not in the normal order.
Remove the duplicates, preserving normal order. Also move xfs_ag.h
include to before the scrub internal includes which are normally
last in the include list.

Fixes: d5c88131db ("xfs: allow queued AG intents to drain before scrubbing")
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2023-04-20 08:18:34 +10:00
David Howells 023fc150a3 cifs: Reapply lost fix from commit 30b2b2196d
Reapply the fix from:

   30b2b2196d ("cifs: do not include page data when checking signature")

that got lost in the iteratorisation of the cifs driver.

Fixes: d08089f649 ("cifs: Change the I/O paths to use an iterator rather than a page list")
Acked-by: Paulo Alcantara (SUSE) <pc@manguebit.com>
Reported-by: Paulo Alcantara <pc@manguebit.com>
Signed-off-by: David Howells <dhowells@redhat.com>
cc: Paulo Alcantara <pc@cjr.nz>
cc: Shyam Prasad N <nspmangalore@gmail.com>
cc: Bharath S M <bharathsm@microsoft.com>
cc: Enzo Matsumiya <ematsumiya@suse.de>
cc: linux-cifs@vger.kernel.org
Signed-off-by: Steve French <stfrench@microsoft.com>
2023-04-18 21:26:09 -05:00
David Howells ac13692844 cifs: Fix unbuffered read
If read() is done in an unbuffered manner, such that, say,
cifs_strict_readv() goes through cifs_user_readv() and thence
__cifs_readv(), it doesn't recognise the EOF and keeps indicating to
userspace that it returning full buffers of data.

This is due to ctx->iter being advanced in cifs_send_async_read() as the
buffer is split up amongst a number of rdata objects.  The iterator count
is then used in collect_uncached_read_data() in the non-DIO case to set the
total length read - and thus the return value of sys_read().  But since the
iterator normally gets used up completely during splitting, ctx->total_len
gets overridden to the full amount.

However, prior to that in collect_uncached_read_data(), we've gone through
the list of rdatas and added up the amount of data we actually received
(which we then throw away).

Fix this by removing the bit that overrides the amount read in the non-DIO
case and just going with the total added up in the aforementioned loop.

This was observed by mounting a cifs share with multiple channels, e.g.:

	mount //192.168.6.1/test /test/ -o user=shares,pass=...,max_channels=6

and then reading a 1MiB file on the share:

	strace cat /xfstest.test/1M  >/dev/null

Through strace, the same data can be seen being read again and again.

Fixes: d08089f649 ("cifs: Change the I/O paths to use an iterator rather than a page list")
Signed-off-by: David Howells <dhowells@redhat.com>
Acked-by: Paulo Alcantara (SUSE) <pc@manguebit.com>
cc: Jérôme Glisse <jglisse@redhat.com>
cc: Long Li <longli@microsoft.com>
cc: Enzo Matsumiya <ematsumiya@suse.de>
cc: Shyam Prasad N <nspmangalore@gmail.com>
cc: Rohith Surabattula <rohiths.msft@gmail.com>
cc: Jeff Layton <jlayton@kernel.org>
cc: linux-cifs@vger.kernel.org
Signed-off-by: Steve French <stfrench@microsoft.com>
2023-04-18 21:22:08 -05:00
Davidlohr Bueso d4cb626d6f epoll: rename global epmutex
As of 4f04cbaf128 ("epoll: use refcount to reduce ep_mutex contention"),
this lock is now specific to nesting cases - inserting an epoll fd onto
another epoll fd.  Rename the lock to be less generic.

Link: https://lkml.kernel.org/r/20230411234159.20421-1-dave@stgolabs.net
Signed-off-by: Davidlohr Bueso <dave@stgolabs.net>
Cc: Paolo Abeni <pabeni@redhat.com>
Cc: Eric Dumazet <edumazet@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-04-18 16:39:35 -07:00
Heiko Carstens 1be2edb25c proc/stat: remove arch_idle_time()
The last (only) architecture specific arch_idle_time() implementation was
removed with commit be76ea6144 ("s390/idle: remove arch_cpu_idle_time()
and corresponding code").

Therefore remove the now dead code in fs/proc/stat.c as well.

Link: https://lkml.kernel.org/r/20230405143452.2677172-1-hca@linux.ibm.com
Signed-off-by: Heiko Carstens <hca@linux.ibm.com>
Cc: Alexey Dobriyan <adobriyan@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-04-18 16:39:33 -07:00
Yosry Ahmed c7b23b68e2 mm: vmscan: refactor updating current->reclaim_state
During reclaim, we keep track of pages reclaimed from other means than
LRU-based reclaim through scan_control->reclaim_state->reclaimed_slab,
which we stash a pointer to in current task_struct.

However, we keep track of more than just reclaimed slab pages through
this.  We also use it for clean file pages dropped through pruned inodes,
and xfs buffer pages freed.  Rename reclaimed_slab to reclaimed, and add a
helper function that wraps updating it through current, so that future
changes to this logic are contained within include/linux/swap.h.

Link: https://lkml.kernel.org/r/20230413104034.1086717-4-yosryahmed@google.com
Signed-off-by: Yosry Ahmed <yosryahmed@google.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Christoph Lameter <cl@linux.com>
Cc: Darrick J. Wong <djwong@kernel.org>
Cc: Dave Chinner <david@fromorbit.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Hyeonggon Yoo <42.hyeyoo@gmail.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Miaohe Lin <linmiaohe@huawei.com>
Cc: NeilBrown <neilb@suse.de>
Cc: Peter Xu <peterx@redhat.com>
Cc: Roman Gushchin <roman.gushchin@linux.dev>
Cc: Shakeel Butt <shakeelb@google.com>
Cc: Tim Chen <tim.c.chen@linux.intel.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Yu Zhao <yuzhao@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-04-18 16:30:10 -07:00
Aneesh Kumar K.V 0b376f1e0f mm/hugetlb_vmemmap: rename ARCH_WANT_HUGETLB_PAGE_OPTIMIZE_VMEMMAP
Now we use ARCH_WANT_HUGETLB_PAGE_OPTIMIZE_VMEMMAP config option to
indicate devdax and hugetlb vmemmap optimization support.  Hence rename
that to a generic ARCH_WANT_OPTIMIZE_VMEMMAP

Link: https://lkml.kernel.org/r/20230412050025.84346-2-aneesh.kumar@linux.ibm.com
Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
Reviewed-by: Muchun Song <songmuchun@bytedance.com>
Cc: Joao Martins <joao.m.martins@oracle.com>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Mike Kravetz <mike.kravetz@oracle.com>
Cc: Tarun Sahu <tsahu@linux.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-04-18 16:30:09 -07:00
Pankaj Raghav 09a607c9cd mpage: use folios in bio end_io handler
Use folios in the bio end_io handler.  This conversion does the
appropriate handling on the folios in the respective end_io callback and
removes the call to page_endio(), which is soon to be removed.

Link: https://lkml.kernel.org/r/20230411122920.30134-4-p.raghav@samsung.com
Signed-off-by: Pankaj Raghav <p.raghav@samsung.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Christian Brauner <brauner@kernel.org>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Luis Chamberlain <mcgrof@kernel.org>
Cc: Martin Brandenburg <martin@omnibond.com>
Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
Cc: Mike Marshall <hubcap@omnibond.com>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Sergey Senozhatsky <senozhatsky@chromium.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-04-18 16:30:02 -07:00
Pankaj Raghav f0d6ca46d6 mpage: split submit_bio and bio end_io handler for reads and writes
Split the submit_bio() and bio end_io handler for reads and writes similar
to other aops.

This is a prep patch before we convert end_io handlers to use folios.

Link: https://lkml.kernel.org/r/20230411122920.30134-3-p.raghav@samsung.com
Signed-off-by: Pankaj Raghav <p.raghav@samsung.com>
Suggested-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Christian Brauner <brauner@kernel.org>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Luis Chamberlain <mcgrof@kernel.org>
Cc: Martin Brandenburg <martin@omnibond.com>
Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
Cc: Mike Marshall <hubcap@omnibond.com>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Sergey Senozhatsky <senozhatsky@chromium.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-04-18 16:30:01 -07:00
Pankaj Raghav cd01049d9c orangefs: use folios in orangefs_readahead
Patch series "remove page_endio()", v3.

It was decided to remove the page_endio() as per the previous RFC
discussion[1] of this series and move that functionality into the caller
itself.  One of the side benefit of doing that is the callers have been
modified to directly work on folios as page_endio() already worked on
folios.

As Christoph is doing ZRAM cleanups[4] which will get rid of page_endio()
function usage, I removed the final patch that removes page_endio()[5].  I
will send it separately after rc-1 once the zram cleanups are merged.

mpage changes were tested with a simple boot testing and running a fio
workload on ext2 filesystem.  orangefs was tested by Mike Marshall (No
code changes since he tested).


This patch (of 3):

Convert orangefs_readahead() from using struct page to struct folio.  This
conversion removes the call to page_endio() which is soon to be removed,
and simplifies the final page handling.

The page error flags is not required to be set in the error case as
orangefs doesn't depend on them.

Link: https://lkml.kernel.org/r/20230411122920.30134-1-p.raghav@samsung.com
Link: https://lkml.kernel.org/r/20230411122920.30134-2-p.raghav@samsung.com
Link: https://lore.kernel.org/linux-mm/ZBHcl8Pz2ULb4RGD@infradead.org/ [1]
Link: https://lore.kernel.org/linux-mm/20230322135013.197076-1-p.raghav@samsung.com/ [2]
Link: https://lore.kernel.org/linux-mm/8adb0770-6124-e11f-2551-6582db27ed32@samsung.com/ [3]
Link: https://lore.kernel.org/linux-block/20230404150536.2142108-1-hch@lst.de/T/#t [4]
Link: https://lore.kernel.org/lkml/20230403132221.94921-6-p.raghav@samsung.com/ [5]
Signed-off-by: Pankaj Raghav <p.raghav@samsung.com>
Reviewed-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Tested-by: Mike Marshall <hubcap@omnibond.com>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Christian Brauner <brauner@kernel.org>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Luis Chamberlain <mcgrof@kernel.org>
Cc: Martin Brandenburg <martin@omnibond.com>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Sergey Senozhatsky <senozhatsky@chromium.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-04-18 16:30:01 -07:00
Steven Price b4aca54792 smaps: fix defined but not used smaps_shmem_walk_ops
When !CONFIG_SHMEM smaps_shmem_walk_ops is defined but not used,
triggering a compiler warning.  To avoid the warning remove the #ifdef
around the usage.  This has no effect because shmem_mapping() is a stub
returning false when !CONFIG_SHMEM so the code will be compiled out,
however we now need to also provide a stub for shmem_swap_usage().

Link: https://lkml.kernel.org/r/20230405103819.151246-1-steven.price@arm.com
Fixes: 7b86ac3371 ("pagewalk: separate function pointers from iterator data")
Signed-off-by: Steven Price <steven.price@arm.com>
Reported-by: kernel test robot <lkp@intel.com>
  Link: https://lore.kernel.org/oe-kbuild-all/202304031749.UiyJpxzF-lkp@intel.com/
Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Alexey Dobriyan <adobriyan@gmail.com>
Cc: Thomas Hellström <thomas.hellstrom@linux.intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-04-18 16:29:54 -07:00
Andrew Morton f8f238ffe5 sync mm-stable with mm-hotfixes-stable to pick up depended-upon upstream changes 2023-04-18 14:53:49 -07:00
Ryusuke Konishi ef832747a8 nilfs2: initialize unused bytes in segment summary blocks
Syzbot still reports uninit-value in nilfs_add_checksums_on_logs() for
KMSAN enabled kernels after applying commit 7397031622 ("nilfs2:
initialize "struct nilfs_binfo_dat"->bi_pad field").

This is because the unused bytes at the end of each block in segment
summaries are not initialized.  So this fixes the issue by padding the
unused bytes with null bytes.

Link: https://lkml.kernel.org/r/20230417173513.12598-1-konishi.ryusuke@gmail.com
Signed-off-by: Ryusuke Konishi <konishi.ryusuke@gmail.com>
Tested-by: Ryusuke Konishi <konishi.ryusuke@gmail.com>
Reported-by: syzbot+048585f3f4227bb2b49b@syzkaller.appspotmail.com
  Link: https://syzkaller.appspot.com/bug?extid=048585f3f4227bb2b49b
Cc: Alexander Potapenko <glider@google.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-04-18 14:22:14 -07:00
Yangtao Li c1660d88a0 f2fs: add has_enough_free_secs()
Replace !has_not_enough_free_secs w/ has_enough_free_secs.
BTW avoid nested 'if' statements in f2fs_balance_fs().

Signed-off-by: Yangtao Li <frank.li@vivo.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2023-04-18 09:05:54 -07:00
Jaegeuk Kim bd90c5cd33 f2fs: relax sanity check if checkpoint is corrupted
1. extent_cache
 - let's drop the largest extent_cache
2. invalidate_block
 - don't show the warnings

Reviewed-by: Chao Yu <chao@kernel.org>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2023-04-18 09:05:54 -07:00
Jaegeuk Kim 2d3f197bad f2fs: refactor f2fs_gc to call checkpoint in urgent condition
The major change is to call checkpoint, if there's not enough space while having
some prefree segments in FG_GC case.

Reviewed-by: Chao Yu <chao@kernel.org>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2023-04-18 09:05:43 -07:00
Markus Elfring 55534c094f gfs2: Move variable assignment behind a null pointer check in inode_go_dump
Since commit 27a2660f1e ("gfs2: Dump nrpages for inodes and their
glocks"), inode_go_dump() computes the address of inode within ip before
checking if ip is NULL.  This isn't a bug by itself, but it can give
rise to bugs later.  Avoid that by checking if ip is NULL first.

Signed-off-by: Markus Elfring <elfring@users.sourceforge.net>
Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
2023-04-18 14:58:32 +02:00
Bob Peterson 130cf5269c gfs2: Use gfs2_holder_initialized for jindex
Before this patch function init_journal() used a local variable jindex to
keep track of whether it needed to dequeue the jindex holder when errors
were found. It also uselessly set the variable just before returning from
the function. This patch simplifies the code by eliminatinng the local
variable in favor of using function gfs2_holder_initialized.

Signed-off-by: Bob Peterson <rpeterso@redhat.com>
Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
2023-04-18 14:46:16 +02:00
Bob Peterson 7d1b37787f gfs2: Eliminate gfs2_trim_blocks
Function gfs2_trim_blocks is not referenced. Eliminate it.

Signed-off-by: Bob Peterson <rpeterso@redhat.com>
Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
2023-04-18 14:40:12 +02:00
Chao Yu 635a52da86 f2fs: remove folio_detach_private() in .invalidate_folio and .release_folio
We have maintain PagePrivate and page_private and page reference
w/ {set,clear}_page_private_*, it doesn't need to call
folio_detach_private() in the end of .invalidate_folio and
.release_folio, remove it and use f2fs_bug_on instead.

Signed-off-by: Chao Yu <chao@kernel.org>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2023-04-17 14:49:40 -07:00
Yangtao Li 33560f8020 f2fs: remove bulk remove_proc_entry() and unnecessary kobject_del()
Convert to use remove_proc_subtree() and kill kobject_del() directly.
kobject_put() actually covers kobject removal automatically, which is
single stage removal.

Signed-off-by: Yangtao Li <frank.li@vivo.com>
Reviewed-by: Chao Yu <chao@kernel.org>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2023-04-17 14:49:30 -07:00
Josh Poimboeuf f372463124 btrfs: mark btrfs_assertfail() __noreturn
Fixes a bunch of warnings including:

  vmlinux.o: warning: objtool: select_reloc_root+0x314: unreachable instruction
  vmlinux.o: warning: objtool: finish_inode_if_needed+0x15b1: unreachable instruction
  vmlinux.o: warning: objtool: get_bio_sector_nr+0x259: unreachable instruction
  vmlinux.o: warning: objtool: raid_wait_read_end_io+0xc26: unreachable instruction
  vmlinux.o: warning: objtool: raid56_parity_alloc_scrub_rbio+0x37b: unreachable instruction
  ...

Reported-by: kernel test robot <lkp@intel.com>
Link: https://lore.kernel.org/oe-kbuild-all/202302210709.IlXfgMpX-lkp@intel.com/
Signed-off-by: Josh Poimboeuf <jpoimboe@kernel.org>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-04-17 19:52:19 +02:00
Genjian Zhang 8ba7d5f5ba btrfs: fix uninitialized variable warnings
There are some warnings on older compilers (gcc 10, 7) or non-x86_64
architectures (aarch64).  As btrfs wants to enable -Wmaybe-uninitialized
by default, fix the warnings even though it's not necessary on recent
compilers (gcc 12+).

../fs/btrfs/volumes.c: In function ‘btrfs_init_new_device’:
../fs/btrfs/volumes.c:2703:3: error: ‘seed_devices’ may be used uninitialized in this function [-Werror=maybe-uninitialized]
 2703 |   btrfs_setup_sprout(fs_info, seed_devices);
      |   ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

../fs/btrfs/send.c: In function ‘get_cur_inode_state’:
../include/linux/compiler.h:70:32: error: ‘right_gen’ may be used uninitialized in this function [-Werror=maybe-uninitialized]
   70 |   (__if_trace.miss_hit[1]++,1) :  \
      |                                ^
../fs/btrfs/send.c:1878:6: note: ‘right_gen’ was declared here
 1878 |  u64 right_gen;
      |      ^~~~~~~~~

Reported-by: k2ci <kernel-bot@kylinos.cn>
Signed-off-by: Genjian Zhang <zhanggenjian@kylinos.cn>
Reviewed-by: David Sterba <dsterba@suse.com>
[ update changelog ]
Signed-off-by: David Sterba <dsterba@suse.com>
2023-04-17 19:52:19 +02:00
Filipe Manana 5d3e4f1d51 btrfs: use log root when iterating over index keys when logging directory
When logging dir dentries of a directory, we iterate over the subvolume
tree to find dir index keys on leaves modified in the current transaction.
This however is heavy on locking, since btrfs_search_forward() may often
keep locks on extent buffers for quite a while when walking the tree to
find a suitable leaf modified in the current transaction and with a key
not smaller than then the provided minimum key. That means it will block
other tasks trying to access the subvolume tree, which may be common fs
operations like creating, renaming, linking, unlinking, reflinking files,
etc.

A better solution is to iterate the log tree, since it's much smaller than
a subvolume tree and just use plain btrfs_search_slot() (or the wrapper
btrfs_for_each_slot()) and only contains dir index keys added in the
current transaction.

The following bonnie++ test on a non-debug kernel (with Debian's default
kernel config) on a 20G null block device, was used to measure the impact:

   $ cat test.sh
   #!/bin/bash

   DEV=/dev/nullb0
   MNT=/mnt/nullb0

   NR_DIRECTORIES=20
   NR_FILES=20480  # must be a multiple of 1024
   DATASET_SIZE=$(( (8 * 1024 * 1024 * 1024) / 1048576 )) # 8 GiB as megabytes
   DIRECTORY_SIZE=$(( DATASET_SIZE / NR_FILES ))
   NR_FILES=$(( NR_FILES / 1024 ))

   umount $DEV &> /dev/null
   mkfs.btrfs -f $DEV
   mount $DEV $MNT

   bonnie++ -u root -d $MNT \
       -n $NR_FILES:$DIRECTORY_SIZE:$DIRECTORY_SIZE:$NR_DIRECTORIES \
       -r 0 -s $DATASET_SIZE -b

   umount $MNT

Before patchset:

   Version 2.00a       ------Sequential Output------ --Sequential Input- --Random-
                       -Per Chr- --Block-- -Rewrite- -Per Chr- --Block-- --Seeks--
   Name:Size etc        /sec %CP  /sec %CP  /sec %CP  /sec %CP  /sec %CP  /sec %CP
   debian0          8G  376k  99  1.1g  98  939m  92 1527k  99  3.2g  99  9060 256
   Latency             24920us     207us     680ms    5594us     171us    2891us
   Version 2.00a       ------Sequential Create------ --------Random Create--------
   debian0             -Create-- --Read--- -Delete-- -Create-- --Read--- -Delete--
                 files  /sec %CP  /sec %CP  /sec %CP  /sec %CP  /sec %CP  /sec %CP
                 20/20 20480  96 +++++ +++ 20480  95 20480  99 +++++ +++ 20480  97
   Latency              8708us     137us    5128us    6743us      60us   19712us

After patchset:

   Version 2.00a       ------Sequential Output------ --Sequential Input- --Random-
                       -Per Chr- --Block-- -Rewrite- -Per Chr- --Block-- --Seeks--
   Name:Size etc        /sec %CP  /sec %CP  /sec %CP  /sec %CP  /sec %CP  /sec %CP
   debian0          8G  384k  99  1.2g  99  971m  91 1533k  99  3.3g  99  9180 309
   Latency             24930us     125us     661ms    5587us      46us    2020us
   Version 2.00a       ------Sequential Create------ --------Random Create--------
   debian0             -Create-- --Read--- -Delete-- -Create-- --Read--- -Delete--
                 files  /sec %CP  /sec %CP  /sec %CP  /sec %CP  /sec %CP  /sec %CP
                 20/20 20480  90 +++++ +++ 20480  99 20480  99 +++++ +++ 20480  97
   Latency              7030us      61us    1246us    4942us      56us   16855us

The patchset consists of this patch plus a previous one that has the
following subject:

   "btrfs: avoid iterating over all indexes when logging directory"

Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-04-17 19:52:19 +02:00
Filipe Manana fa4b8cb173 btrfs: avoid iterating over all indexes when logging directory
When logging a directory, after copying all directory index items from the
subvolume tree to the log tree, we iterate over the subvolume tree to find
all dir index items that are located in leaves COWed (or created) in the
current transaction. If we keep logging a directory several times during
the same transaction, we end up iterating over the same dir index items
everytime we log the directory, wasting time and adding extra lock
contention on the subvolume tree.

So just keep track of the last logged dir index offset in order to start
the search for that index (+1) the next time the directory is logged, as
dir index values (key offsets) come from a monotonically increasing
counter.

The following test measures the difference before and after this change:

  $ cat test.sh
  #!/bin/bash

  DEV=/dev/nullb0
  MNT=/mnt/nullb0

  umount $DEV &> /dev/null
  mkfs.btrfs -f $DEV
  mount -o ssd $DEV $MNT

  # Time values in milliseconds.
  declare -a fsync_times
  # Total number of files added to the test directory.
  num_files=1000000
  # Fsync directory after every N files are added.
  fsync_period=100

  mkdir $MNT/testdir

  fsync_total_time=0
  for ((i = 1; i <= $num_files; i++)); do
        echo -n > $MNT/testdir/file_$i

        if [ $((i % fsync_period)) -eq 0 ]; then
                start=$(date +%s%N)
                xfs_io -c "fsync" $MNT/testdir
                end=$(date +%s%N)
                fsync_total_time=$((fsync_total_time + (end - start)))
                fsync_times[i]=$(( (end - start) / 1000000 ))
                echo -n -e "Progress $i / $num_files\r"
        fi
  done

  echo -e "\nHistogram of directory fsync duration in ms:\n"

  printf '%s\n' "${fsync_times[@]}" | \
     perl -MStatistics::Histogram -e '@d = <>; print get_histogram(\@d);'

  fsync_total_time=$((fsync_total_time / 1000000))
  echo -e "\nTotal time spent in fsync: $fsync_total_time ms\n"
  echo

  umount $MNT

The test was run on a non-debug kernel (Debian's default kernel config)
against a 15G null block device.

Result before this change:

   Histogram of directory fsync duration in ms:

   Count: 10000
   Range:  3.000 - 362.000; Mean: 34.556; Median: 31.000; Stddev: 25.751
   Percentiles:  90th: 71.000; 95th: 77.000; 99th: 81.000
      3.000 -    5.278:  1423 #################################
      5.278 -    8.854:  1173 ###########################
      8.854 -   14.467:   591 ##############
     14.467 -   23.277:  1025 #######################
     23.277 -   37.105:  1422 #################################
     37.105 -   58.809:  2036 ###############################################
     58.809 -   92.876:  2316 #####################################################
     92.876 -  146.346:     6 |
    146.346 -  230.271:     6 |
    230.271 -  362.000:     2 |

   Total time spent in fsync: 350527 ms

Result after this change:

   Histogram of directory fsync duration in ms:

   Count: 10000
   Range:  3.000 - 1088.000; Mean:  8.704; Median:  8.000; Stddev: 12.576
   Percentiles:  90th: 12.000; 95th: 14.000; 99th: 17.000
      3.000 -    6.007:  3222 #################################
      6.007 -   11.276:  5197 #####################################################
     11.276 -   20.506:  1551 ################
     20.506 -   36.674:    24 |
     36.674 -  201.552:     1 |
    201.552 -  353.841:     4 |
    353.841 - 1088.000:     1 |

   Total time spent in fsync: 92114 ms

Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-04-17 19:52:19 +02:00
Qu Wenruo 8eb3dd17ea btrfs: dev-replace: error out if we have unrepaired metadata error during
[BUG]
Even before the scrub rework, if we have some corrupted metadata failed
to be repaired during replace, we still continue replacing and let it
finish just as there is nothing wrong:

 BTRFS info (device dm-4): dev_replace from /dev/mapper/test-scratch1 (devid 1) to /dev/mapper/test-scratch2 started
 BTRFS warning (device dm-4): tree block 5578752 mirror 1 has bad csum, has 0x00000000 want 0xade80ca1
 BTRFS warning (device dm-4): tree block 5578752 mirror 0 has bad csum, has 0x00000000 want 0xade80ca1
 BTRFS warning (device dm-4): checksum error at logical 5578752 on dev /dev/mapper/test-scratch1, physical 5578752: metadata leaf (level 0) in tree 5
 BTRFS warning (device dm-4): checksum error at logical 5578752 on dev /dev/mapper/test-scratch1, physical 5578752: metadata leaf (level 0) in tree 5
 BTRFS error (device dm-4): bdev /dev/mapper/test-scratch1 errs: wr 0, rd 0, flush 0, corrupt 1, gen 0
 BTRFS warning (device dm-4): tree block 5578752 mirror 1 has bad bytenr, has 0 want 5578752
 BTRFS error (device dm-4): unable to fixup (regular) error at logical 5578752 on dev /dev/mapper/test-scratch1
 BTRFS info (device dm-4): dev_replace from /dev/mapper/test-scratch1 (devid 1) to /dev/mapper/test-scratch2 finished

This can lead to unexpected problems for the resulting filesystem.

[CAUSE]
Btrfs reuses scrub code path for dev-replace to iterate all dev extents.
But unlike scrub, dev-replace doesn't really bother to check the scrub
progress, which records all the errors found during replace.

And even if we check the progress, we cannot really determine which
errors are minor, which are critical just by the plain numbers.
(remember we don't treat metadata/data checksum error differently).

This behavior is there from the very beginning.

[FIX]
Instead of continuing the replace, just error out if we hit an
unrepaired metadata sector.

Now the dev-replace would be rejected with -EIO, to let the user know.
Although it also means, the filesystem has some metadata error which
cannot be repaired, the user would be upset anyway.

The new dmesg would look like this:

 BTRFS info (device dm-4): dev_replace from /dev/mapper/test-scratch1 (devid 1) to /dev/mapper/test-scratch2 started
 BTRFS warning (device dm-4): tree block 5578752 mirror 1 has bad csum, has 0x00000000 want 0xade80ca1
 BTRFS warning (device dm-4): tree block 5578752 mirror 1 has bad csum, has 0x00000000 want 0xade80ca1
 BTRFS error (device dm-4): unable to fixup (regular) error at logical 5570560 on dev /dev/mapper/test-scratch1 physical 5570560
 BTRFS warning (device dm-4): header error at logical 5570560 on dev /dev/mapper/test-scratch1, physical 5570560: metadata leaf (level 0) in tree 5
 BTRFS warning (device dm-4): header error at logical 5570560 on dev /dev/mapper/test-scratch1, physical 5570560: metadata leaf (level 0) in tree 5
 BTRFS error (device dm-4): stripe 5570560 has unrepaired metadata sector at 5578752
 BTRFS error (device dm-4): btrfs_scrub_dev(/dev/mapper/test-scratch1, 1, /dev/mapper/test-scratch2) failed -5

Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-04-17 19:52:19 +02:00
Filipe Manana 524f14bb11 btrfs: remove pointless loop at btrfs_get_next_valid_item()
It's pointless to have a while loop at btrfs_get_next_valid_item(), as if
the slot on the current leaf is beyond the last item, we call
btrfs_next_leaf(), which leaves us at a valid slot of the next leaf (or
a valid slot in the current leaf if after releasing the path an item gets
pushed from the next leaf to the current leaf).

So just call btrfs_next_leaf() if the current slot on the current leaf is
beyond the last item.

Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-04-17 19:52:19 +02:00
Qu Wenruo 604e6681e1 btrfs: scrub: reject unsupported scrub flags
Since the introduction of scrub interface, the only flag that we support
is BTRFS_SCRUB_READONLY.  Thus there is no sanity checks, if there are
some undefined flags passed in, we just ignore them.

This is problematic if we want to introduce new scrub flags, as we have
no way to determine if such flags are supported.

Address the problem by introducing a check for the flags, and if
unsupported flags are set, return -EOPNOTSUPP to inform the user space.

This check should be backported for all supported kernels before any new
scrub flags are introduced.

CC: stable@vger.kernel.org # 4.14+
Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-04-17 19:52:19 +02:00
Boris Burkov f263a7c3a5 btrfs: reinterpret async discard iops_limit=0 as no delay
Currently, a limit of 0 results in a hard coded metering over 6 hours.
Since the default is a set limit, I suspect no one truly depends on this
rather arbitrary setting. Repurpose it for an arguably more useful
"unlimited" mode, where the delay is 0.

Note that if block groups are too new, or go fully empty, there is still
a delay associated with those conditions. Those delays implement
heuristics for not trimming a region we are relatively likely to fully
overwrite soon.

CC: stable@vger.kernel.org # 6.2+
Reviewed-by: Neal Gompa <neal@gompa.dev>
Signed-off-by: Boris Burkov <boris@bur.io>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-04-17 19:52:19 +02:00
Boris Burkov cfe3445a58 btrfs: set default discard iops_limit to 1000
Previously, the default was a relatively conservative 10. This results
in a 100ms delay, so with ~300 discards in a commit, it takes the full
30s till the next commit to finish the discards. On a workstation, this
results in the disk never going idle, wasting power/battery, etc.

Set the default to 1000, which results in using the smallest possible
delay, currently, which is 1ms. This has shown to not pathologically
keep the disk busy by the original reporter.

Link: https://lore.kernel.org/linux-btrfs/Y%2F+n1wS%2F4XAH7X1p@nz/
Link: https://bugzilla.redhat.com/show_bug.cgi?id=2182228
CC: stable@vger.kernel.org # 6.2+
Reviewed-by: Neal Gompa <neal@gompa.dev
Signed-off-by: Boris Burkov <boris@bur.io>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-04-17 19:52:18 +02:00
Qu Wenruo aca43fe839 btrfs: remove unused raid56 functions which were dedicated for scrub
Since the scrub rework, the following RAID56 functions are no longer
called:

- raid56_add_scrub_pages()
- raid56_alloc_missing_rbio()
- raid56_submit_missing_rbio()

Those functions are all utilized by scrub to handle missing device cases
for RAID56.

However the new scrub code handle them in a completely different way:

- If it's data stripe, go recovery path through btrfs_submit_bio()
- If it's P/Q stripe, it would be handled through
  raid56_parity_submit_scrub_rbio()
  And that function would handle dev-replace and repair properly.

Thus we can safely remove those functions.

Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-04-17 19:52:18 +02:00
Qu Wenruo 13a62fd997 btrfs: scrub: remove scrub_bio structure
Since scrub path has been fully moved to scrub_stripe based facilities,
no more scrub_bio would be submitted.
Thus we can remove it completely, this involves:

- SCRUB_SECTORS_PER_BIO macro
- SCRUB_BIOS_PER_SCTX macro
- SCRUB_MAX_PAGES macro
- BTRFS_MAX_MIRRORS macro
- scrub_bio structure
- scrub_ctx::bios member
- scrub_ctx::curr member
- scrub_ctx::bios_in_flight member
- scrub_ctx::workers_pending member
- scrub_ctx::list_lock member
- scrub_ctx::list_wait member

- function scrub_bio_end_io_worker()
- function scrub_pending_bio_inc()
- function scrub_pending_bio_dec()
- function scrub_throttle()
- function scrub_submit()

- function scrub_find_csum()
- function drop_csum_range()

- Some unnecessary flush and scrub pauses

Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-04-17 18:01:24 +02:00
Qu Wenruo 001e3fc263 btrfs: scrub: remove scrub_block and scrub_sector structures
Those two structures are used to represent a bunch of sectors for scrub,
but now they are fully replaced by scrub_stripe in one go, so we can
remove them. This involves:

- structure scrub_block
- structure scrub_sector

- structure scrub_page_private
- function attach_scrub_page_private()
- function detach_scrub_page_private()
  Now we no longer need to use page::private to handle subpage.

- function alloc_scrub_block()
- function alloc_scrub_sector()
- function scrub_sector_get_page()
- function scrub_sector_get_page_offset()
- function scrub_sector_get_kaddr()
- function bio_add_scrub_sector()

- function scrub_checksum_data()
- function scrub_checksum_tree_block()
- function scrub_checksum_super()
- function scrub_check_fsid()
- function scrub_block_get()
- function scrub_block_put()
- function scrub_sector_get()
- function scrub_sector_put()
- function scrub_bio_end_io()
- function scrub_block_complete()
- function scrub_add_sector_to_rd_bio()

Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-04-17 18:01:24 +02:00
Qu Wenruo e9255d6c40 btrfs: scrub: remove the old scrub recheck code
The old scrub code has different entrance to verify the content, and
since we have removed the writeback path, now we can start removing the
re-check part, including:

- scrub_recover structure
- scrub_sector::recover member
- function scrub_setup_recheck_block()
- function scrub_recheck_block()
- function scrub_recheck_block_checksum()
- function scrub_repair_block_group_good_copy()
- function scrub_repair_sector_from_good_copy()
- function scrub_is_page_on_raid56()

- function full_stripe_lock()
- function search_full_stripe_lock()
- function get_full_stripe_logical()
- function insert_full_stripe_lock()
- function lock_full_stripe()
- function unlock_full_stripe()
- btrfs_block_group::full_stripe_locks_root member
- btrfs_full_stripe_locks_tree structure
  This infrastructure is to ensure RAID56 scrub is properly handling
  recovery and P/Q scrub correctly.

  This is no longer needed, before P/Q scrub we will wait for all
  the involved data stripes to be scrubbed first, and RAID56 code has
  internal lock to ensure no race in the same full stripe.

- function scrub_print_warning()
- function scrub_get_recover()
- function scrub_put_recover()
- function scrub_handle_errored_block()
- function scrub_setup_recheck_block()
- function scrub_bio_wait_endio()
- function scrub_submit_raid56_bio_wait()
- function scrub_recheck_block_on_raid56()
- function scrub_recheck_block()
- function scrub_recheck_block_checksum()
- function scrub_repair_block_from_good_copy()
- function scrub_repair_sector_from_good_copy()

And two more functions exported temporarily for later cleanup:

- alloc_scrub_sector()
- alloc_scrub_block()

Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-04-17 18:01:24 +02:00
Qu Wenruo 16f9399349 btrfs: scrub: remove the old writeback infrastructure
Since the whole scrub path has been switched to scrub_stripe based
solution, the old writeback path can be removed completely, which
involves:

- scrub_ctx::wr_curr_bio member
- scrub_ctx::flush_all_writes member
- function scrub_write_block_to_dev_replace()
- function scrub_write_sector_to_dev_replace()
- function scrub_add_sector_to_wr_bio()
- function scrub_wr_submit()
- function scrub_wr_bio_end_io()
- function scrub_wr_bio_end_io_worker()

And one more function needs to be exported temporarily:

- scrub_sector_get()

Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-04-17 18:01:24 +02:00
Qu Wenruo 5dc96f8d5d btrfs: scrub: remove scrub_parity structure
The structure scrub_parity is used to indicate that some extents are
scrubbed for the purpose of RAID56 P/Q scrubbing.

Since the whole RAID56 P/Q scrubbing path has been replaced with new
scrub_stripe infrastructure, and we no longer need to use scrub_parity
to modify the behavior of data stripes, we can remove it completely.

This removal involves:

- scrub_parity_workers
  Now only one worker would be utilized, scrub_workers, to do the read
  and repair.
  All writeback would happen at the main scrub thread.

- scrub_block::sparity member
- scrub_parity structure
- function scrub_parity_get()
- function scrub_parity_put()
- function scrub_free_parity()

- function __scrub_mark_bitmap()
- function scrub_parity_mark_sectors_error()
- function scrub_parity_mark_sectors_data()
  These helpers are no longer needed, scrub_stripe has its bitmaps and
  we can use bitmap helpers to get the error/data status.

- scrub_parity_bio_endio()
- scrub_parity_check_and_repair()
- function scrub_sectors_for_parity()
- function scrub_extent_for_parity()
- function scrub_raid56_data_stripe_for_parity()
- function scrub_raid56_parity()
  The new code would reuse the scrub read-repair and writeback path.
  Just skip the dev-replace phase.
  And scrub_stripe infrastructure allows us to submit and wait for those
  data stripes before scrubbing P/Q, without extra infrastructure.

The following two functions are temporarily exported for later cleanup:

- scrub_find_csum()
- scrub_add_sector_to_rd_bio()

Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-04-17 18:01:24 +02:00
Qu Wenruo 1009254bf2 btrfs: scrub: use scrub_stripe to implement RAID56 P/Q scrub
Implement the only missing part for scrub: RAID56 P/Q stripe scrub.

The workflow is pretty straightforward for the new function,
scrub_raid56_parity_stripe():

- Go through the regular scrub path for each data stripe

- Wait for the verification and repair to finish

- Writeback the repaired sectors to data stripes

- Make sure all stripes are properly repaired
  If we have sectors unrepaired, we cannot continue, or we could further
  corrupt the P/Q stripe.

- Submit the rbio for P/Q stripe
  The dev-replace would be handled inside
  raid56_parity_submit_scrub_rbio() path.

- Wait for the above bio to finish

Although the old code is no longer used, we still keep the declaration,
as the cleanup can be several times larger than this patch itself.

Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-04-17 18:01:24 +02:00
Qu Wenruo e02ee89baa btrfs: scrub: switch scrub_simple_mirror() to scrub_stripe infrastructure
Switch scrub_simple_mirror() to the new scrub_stripe infrastructure.

Since scrub_simple_mirror() is the core part of scrub (only RAID56
P/Q stripes don't utilize it), we can get rid of a big chunk of code,
mostly scrub_extent(), scrub_sectors() and directly called functions.

There is a functionality change:

- Scrub speed throttle now only affects read on the scrubbing device
  Writes (for repair and replace), and reads from other mirrors won't
  be limited by the set limits.

Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-04-17 18:01:24 +02:00
Qu Wenruo 54765392a1 btrfs: scrub: introduce helper to queue a stripe for scrub
The new helper, queue_scrub_stripe(), would try to queue a stripe for
scrub.  If all stripes are already in use, we will submit all the
existing ones and wait for them to finish.

Currently we would queue up to 8 stripes, to enlarge the blocksize to
512KiB to improve the performance. Sectors repaired on zoned need to be
relocated instead of in-place fix.

Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-04-17 18:01:24 +02:00
Qu Wenruo 0096580713 btrfs: scrub: introduce error reporting functionality for scrub_stripe
The new helper, scrub_stripe_report_errors(), will report the result of
the scrub to system log.

The main reporting is done by introducing a new helper,
scrub_print_common_warning(), which is mostly the same content from
scrub_print_wanring(), but without the need for a scrub_block.

Since we're reporting the errors, it's the perfect time to update the
scrub stats too.

Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-04-17 18:01:24 +02:00
Qu Wenruo 058e09e6fe btrfs: scrub: introduce a writeback helper for scrub_stripe
Add a new helper, scrub_write_sectors(), to submit write bios for
specified sectors to the target disk.

There are several differences compared to read path:

- Utilize btrfs_submit_scrub_write()
  Now we still rely on the @mirror_num based writeback, but the
  requirement is also a little different than regular writeback or read,
  thus we have to call btrfs_submit_scrub_write().

- We cannot write the full stripe back
  We can only write the sectors we have.  There will be two call sites
  later, one for repaired sectors, one for all utilized sectors of
  dev-replace.

  Thus the callers should specify their own write_bitmap.

This function only submit the bios, will not wait for them unless for
zoned case.

Caller must explicitly wait for the IO to finish.

Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-04-17 18:01:23 +02:00
Qu Wenruo 9ecb5ef543 btrfs: scrub: introduce the main read repair worker for scrub_stripe
The new helper, scrub_stripe_read_repair_worker(), would handle the
read-repair part:

- Wait for the previous submitted read IO to finish

- Verify the contents of the stripe

- Go through the remaining mirrors, using as large blocksize as possible
  At this stage, we just read out all the failed sectors from each
  mirror and re-verify.
  If no more failed sector, we can exit.

- Go through all mirrors again, sector-by-sector
  This time, we read sector by sector, this is to address cases where
  one bad sector mismatches the drive's internal checksum, and cause the
  whole read range to fail.

  We put this recovery method as the last resort, as sector-by-sector
  reading is slow, and reading from other mirrors may have already fixed
  the errors.

Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-04-17 18:01:23 +02:00
Qu Wenruo 97cf8f3754 btrfs: scrub: introduce a helper to verify one scrub_stripe
The new helper, scrub_verify_stripe(), shares the same main workflow of
the old scrub code.

The major differences are:

- How pages/page_offset is grabbed
  Everything can be grabbed from scrub_stripe easily.

- When error report happens
  Currently the helper only verifies the sectors, not really doing any
  error reporting.
  The error reporting would be done after we have done the repair.

Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-04-17 18:01:23 +02:00
Qu Wenruo a3ddbaebc7 btrfs: scrub: introduce a helper to verify one metadata block
The new helper, scrub_verify_one_metadata(), is almost the same as
scrub_checksum_tree_block().

The difference is in how we grab the pages from other structures.

Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-04-17 18:01:23 +02:00
Qu Wenruo b979547513 btrfs: scrub: introduce helper to find and fill sector info for a scrub_stripe
The new helper will search the extent tree to find the first extent of a
logical range, then fill the sectors array by two loops:

- Loop 1 to fill common bits and metadata generation

- Loop 2 to fill csum data (only for data bgs)
  This loop will use the new btrfs_lookup_csums_bitmap() to fill
  the full csum buffer, and set scrub_sector_verification::csum.

With all the needed info filled by this function, later we only need to
submit and verify the stripe.

Here we temporarily export the helper to avoid warning on unused static
function.

Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-04-17 18:01:23 +02:00
Qu Wenruo 2af2aaf982 btrfs: scrub: introduce structure for new BTRFS_STRIPE_LEN based interface
This patch introduces the following structures:

- scrub_sector_verification
  Contains all the needed info to verify one sector (data or metadata).

- scrub_stripe
  Contains all needed members (mostly bitmap based) to scrub one stripe
  (with a length of BTRFS_STRIPE_LEN).

The basic idea is, we keep the existing per-device scrub behavior, but
merge all the scrub_bio/scrub_bio into one generic structure, and read
the full BTRFS_STRIPE_LEN stripe on the first try.

This means we will read some sectors which are not scrub target, but
that's fine. At dev-replace time we only writeback the utilized and good
sectors, and for read-repair we only writeback the repaired sectors.

With every read submitted in BTRFS_STRIPE_LEN, the need for complex bio
form shaping would be gone.
Although to get the same performance of the old scrub behavior, we would
need to submit the initial read for two stripes at once.

Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-04-17 18:01:23 +02:00
Qu Wenruo 4886ff7b50 btrfs: introduce a new helper to submit write bio for repair
Both scrub and read-repair are utilizing a special repair writes that:

- Only writes back to a single device
  Even for read-repair on RAID56, we only update the corrupted data
  stripe itself, not triggering the full RMW path.

- Requires a valid @mirror_num
  For RAID56 case, only @mirror_num == 1 is valid.
  For non-RAID56 cases, we need @mirror_num to locate our stripe.

- No data csum generation needed

These two call sites still have some differences though:

- Read-repair goes plain bio
  It doesn't need a full btrfs_bio, and goes submit_bio_wait().

- New scrub repair would go btrfs_bio
  To simplify both read and write path.

So here this patch would:

- Introduce a common helper, btrfs_map_repair_block()
  Due to the single device nature, we can use an on-stack
  btrfs_io_stripe to pass device and its physical bytenr.

- Introduce a new interface, btrfs_submit_repair_bio(), for later scrub
  code
  This is for the incoming scrub code.

Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-04-17 18:01:23 +02:00
Qu Wenruo 4317ff0056 btrfs: introduce btrfs_bio::fs_info member
Currently we're doing a lot of work for btrfs_bio:

- Checksum verification for data read bios
- Bio splits if it crosses stripe boundary
- Read repair for data read bios

However for the incoming scrub patches, we don't want this extra
functionality at all, just plain logical + mirror -> physical mapping
ability.

Thus here we do the following changes:

- Introduce btrfs_bio::fs_info
  This is for the new scrub specific btrfs_bio, which would not populate
  btrfs_bio::inode.
  Thus we need such new member to grab a fs_info

  This new member will always be populated.

- Replace @inode argument with @fs_info for btrfs_bio_init() and its
  caller
  Since @inode is no longer a mandatory member, replace it with
  @fs_info, and let involved users populate @inode.

- Skip checksum verification and generation if @bbio->inode is NULL

- Add extra ASSERT()s
  To make sure:

  * bbio->inode is properly set for involved read repair path
  * if @file_offset is set, bbio->inode is also populated

- Grab @fs_info from @bbio directly
  We can no longer go @bbio->inode->root->fs_info, as bbio->inode can be
  NULL. This involves:

  * btrfs_simple_end_io()
  * should_async_write()
  * btrfs_wq_submit_bio()
  * btrfs_use_zone_append()

Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-04-17 18:01:23 +02:00
Qu Wenruo 2a2dc22f7e btrfs: scrub: use dedicated super block verification function to scrub one super block
There is really no need to go through the super complex scrub_sectors()
to just handle super blocks.  Introduce a dedicated function to handle
super block scrubbing.

This new function will introduce a behavior change, instead of using the
complex but concurrent scrub_bio system, here we just go submit-and-wait.

There is really not much sense to care the performance of super block
scrubbing. It only has 3 super blocks at most, and they are all
scattered around the devices already.

Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-04-17 18:01:23 +02:00
Anand Jain f0bb5474cf btrfs: remove redundant release of btrfs_device::alloc_state
Commit 321f69f86a ("btrfs: reset device back to allocation state when
removing") included adding extent_io_tree_release(&device->alloc_state)
to btrfs_close_one_device(), which had already been called in
btrfs_free_device().

The alloc_state tree (IO_TREE_DEVICE_ALLOC_STATE), is created in
btrfs_alloc_device() and released in btrfs_close_one_device(). Therefore,
the additional call to extent_io_tree_release(&device->alloc_state) in
btrfs_free_device() is unnecessary and can be removed.

Signed-off-by: Anand Jain <anand.jain@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-04-17 18:01:23 +02:00
Anand Jain 1f16033c99 btrfs: warn for any missed cleanup at btrfs_close_one_device
During my recent search for the root cause of a reported bug, I realized
that it's a good idea to issue a warning for missed cleanup instead of
using debug-only assertions. Since most installations run with debug off,
missed cleanups and premature calls to close could go unnoticed. However,
these issues are serious enough to warrant reporting and fixing.

Signed-off-by: Anand Jain <anand.jain@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-04-17 18:01:23 +02:00
Christoph Hellwig 6e7a367e1a btrfs: don't print the crc32c implementation at module load time
Btrfs can use various different checksumming algorithms, and prints
the one used for a given file system at mount time.  Don't bother
printing the crc32c implementation at module load time, the information
is available in /sys/fs/btrfs/FSID/checksum.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-04-17 18:01:22 +02:00
Christoph Hellwig e6b430f817 btrfs: tree-log: factor out a clean_log_buffer helper
The tree-log code has three almost identical copies for the accounting on
an extent_buffer that doesn't need to be written any more.  The only
difference is that walk_down_log_tree passed the bytenr used to find the
buffer instead of extent_buffer.start and calculates the length using the
nodesize, while the other two callers look at the extent_buffer.len
field that must always be equivalent to the nodesize.

Factor the code into a common helper.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-04-17 18:01:22 +02:00
Christoph Hellwig 2c275afeb6 block: make blkcg_punt_bio_submit optional
Guard all the code to punt bios to a per-cgroup submission helper by a
new CONFIG_BLK_CGROUP_PUNT_BIO symbol that is selected by btrfs.
This way non-btrfs kernel builds don't need to have this code.

Reviewed-by: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-04-17 18:01:22 +02:00
Christoph Hellwig 3480373ebd btrfs, block: move REQ_CGROUP_PUNT to btrfs
REQ_CGROUP_PUNT is a bit annoying as it is hard to follow and adds
a branch to the bio submission hot path.  To fix this, export
blkcg_punt_bio_submit and let btrfs call it directly.  Add a new
REQ_FS_PRIVATE flag for btrfs to indicate to it's own low-level
bio submission code that a punt to the cgroup submission helper
is required.

Reviewed-by: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-04-17 18:01:22 +02:00
Christoph Hellwig 0a0596fbbe btrfs, mm: remove the punt_to_cgroup field in struct writeback_control
punt_to_cgroup is only used by extent_write_locked_range, but that
function also directly controls the bio flags for the actual submission.
Remove th punt_to_cgroup field, and just set REQ_CGROUP_PUNT directly
in extent_write_locked_range.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-04-17 18:01:22 +02:00
Christoph Hellwig 896d7c1a90 btrfs: also use kthread_associate_blkcg for uncompressible ranges
submit_one_async_extent needs to use submit_one_async_extent no matter
if the range it handles ends up beeing compressed or not as the deadlock
risk due to cgroup thottling is the same.  Call kthread_associate_blkcg
earlier to cover submit_uncompressed_range case as well.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-04-17 18:01:22 +02:00
Christoph Hellwig e43a6210b7 btrfs: don't free the async_extent in submit_uncompressed_range
Let submit_one_async_extent, which is the only caller of
submit_uncompressed_range handle freeing of the async_extent in one
central place.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-04-17 18:01:22 +02:00
Christoph Hellwig 05d06a5c9d btrfs: move kthread_associate_blkcg out of btrfs_submit_compressed_write
btrfs_submit_compressed_write should not have to care if it is called
from a helper thread or not.  Move the kthread_associate_blkcg handling
into submit_one_async_extent, as that is the one caller that needs it.
Also move the assignment of REQ_CGROUP_PUNT into cow_file_range_async,
as that is the routine that sets up the helper thread offload.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-04-17 18:01:22 +02:00
Filipe Manana 0f69d1f4d6 btrfs: correctly calculate delayed ref bytes when starting transaction
When starting a transaction, we are assuming the number of bytes used for
each delayed ref update matches the number of bytes used for each item
update, that is the return value of:

   btrfs_calc_insert_metadata_size(fs_info, num_items)

However that is not correct when we are using the free space tree, as we
need to multiply that value by 2, since delayed ref updates need to modify
the free space tree besides the extent tree.

So fix this by using btrfs_calc_delayed_ref_bytes() to get the correct
number of bytes used for delayed ref updates.

Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-04-17 18:01:22 +02:00
Filipe Manana e4773b57b8 btrfs: make btrfs_block_rsv_full() check more boolean when starting transaction
When starting a transaction we are comparing the result of a call to
btrfs_block_rsv_full() with 0, but the function returns a boolean. While
in practice it is not incorrect, as 0 is equivalent to false, it makes it
a bit odd and less readable. So update the check to not compare against 0
and instead use the logical not (!) operator.

Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-04-17 18:01:22 +02:00
Boris Burkov b73a6fd1b1 btrfs: split partial dio bios before submit
If an application is doing direct io to a btrfs file and experiences a
page fault reading from the write buffer, iomap will issue a partial
bio, and allow the fs to keep going. However, there was a subtle bug in
this code path in the btrfs dio iomap implementation that led to the
partial write ending up as a gap in the file's extents and to be read
back as zeros.

The sequence of events in a partial write, lightly summarized and
trimmed down for brevity is as follows:

==== WRITING TASK ====
 btrfs_direct_write
 __iomap_dio_write
 iomap_iter
 btrfs_dio_iomap_begin # create full ordered extent
 iomap_dio_bio_iter
 bio_iov_iter_get_pages # page fault; partial read
 submit_bio # partial bio
 iomap_iter
 btrfs_dio_iomap_end
 btrfs_mark_ordered_io_finished # sets BTRFS_ORDERED_IOERR;
				# submit to finish_ordered_fn wq
 fault_in_iov_iter_readable # btrfs_direct_write detects partial write
 __iomap_dio_write
 iomap_iter
 btrfs_dio_iomap_begin # create second partial ordered extent
 iomap_dio_bio_iter
 bio_iov_iter_get_pages # read all of remainder
 submit_bio # partial bio with all of remainder
 iomap_iter
 btrfs_dio_iomap_end # nothing exciting to do with ordered io

==== DIO ENDIO ====
== FIRST PARTIAL BIO ==
 btrfs_dio_end_io
 btrfs_mark_ordered_io_finished # bytes_left > 0
			        # don't submit to finish_ordered_fn wq
== SECOND PARTIAL BIO ==
 btrfs_dio_end_io
 btrfs_mark_ordered_io_finished # bytes_left == 0
			        # submit to finish_ordered_fn wq

==== BTRFS FINISH ORDERED WQ ====
== FIRST PARTIAL BIO ==
 btrfs_finish_ordered_io # called by dio_iomap_end_io, sees
		         # BTRFS_ORDERED_IOERR, just drops the
		         # ordered_extent
==SECOND PARTIAL BIO==
 btrfs_finish_ordered_io # called by btrfs_dio_end_io, writes out file
		         # extents, csums, etc...

The essence of the problem is that while btrfs_direct_write and iomap
properly interact to submit all the correct bios, there is insufficient
logic in the btrfs dio functions (btrfs_dio_iomap_begin,
btrfs_dio_submit_io, btrfs_dio_end_io, and btrfs_dio_iomap_end) to
ensure that every bio is at least a part of a completed ordered_extent.
And it is completing an ordered_extent that results in crucial
functionality like writing out a file extent for the range.

More specifically, btrfs_dio_end_io treats the ordered extent as
unfinished but btrfs_dio_iomap_end sets BTRFS_ORDERED_IOERR on it.
Thus, the finish io work doesn't result in file extents, csums, etc.
In the aftermath, such a file behaves as though it has a hole in it,
instead of the purportedly written data.

We considered a few options for fixing the bug:

  1. treat the partial bio as if we had truncated the file, which would
     result in properly finishing it.
  2. split the ordered extent when submitting a partial bio.
  3. cache the ordered extent across calls to __iomap_dio_rw in
     iter->private, so that we could reuse it and correctly apply
     several bios to it.

I had trouble with 1, and it felt the most like a hack, so I tried 2
and 3. Since 3 has the benefit of also not creating an extra file
extent, and avoids an ordered extent lookup during bio submission, it
felt like the best option. However, that turned out to re-introduce a
deadlock which this code discarding the ordered_extent between faults
was meant to fix in the first place. (Link to an explanation of the
deadlock below.)

Therefore, go with fix 2, which requires a bit more setup work but fixes
the corruption without introducing the deadlock, which is fundamentally
caused by the ordered extent existing when we attempt to fault in a
range that overlaps with it.

Put succinctly, what this patch does is: when we submit a dio bio, check
if it is partial against the ordered extent stored in dio_data, and if it
is, extract the ordered_extent that matches the bio exactly out of the
larger ordered_extent. Keep the remaining ordered_extent around in dio_data
for cancellation in iomap_end.

Thanks to Josef, Christoph, and Filipe with their help figuring out the
bug and the fix.

Fixes: 51bd9563b6 ("btrfs: fix deadlock due to page faults during direct IO reads and writes")
Link: https://bugzilla.redhat.com/show_bug.cgi?id=2169947
Link: https://lore.kernel.org/linux-btrfs/aa1fb69e-b613-47aa-a99e-a0a2c9ed273f@app.fastmail.com/
Link: https://pastebin.com/3SDaH8C6
Link: https://lore.kernel.org/linux-btrfs/20230315195231.GW10580@twin.jikos.cz/T/#t
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Tested-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Boris Burkov <boris@bur.io>
[ hch: refactored the ordered_extent extraction ]
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-04-17 18:01:21 +02:00
Boris Burkov f0f5329a00 btrfs: don't split NOCOW extent_maps in btrfs_extract_ordered_extent
NOCOW writes just overwrite an existing extent map, which thus should
not be split in btrfs_extract_ordered_extent.  The NOCOW case can't
currently happen as btrfs_extract_ordered_extent is only used on zoned
devices that do not support NOCOW writes, but this will change soon.

Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Tested-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Boris Burkov <boris@bur.io>
[ hch: split from a larger patch, wrote a commit log ]
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-04-17 18:01:21 +02:00
Christoph Hellwig 7edd339c8a btrfs: pass an ordered_extent to btrfs_extract_ordered_extent
To prepare for a new caller that already has the ordered_extent
available, change btrfs_extract_ordered_extent to take an argument
for it.  Add a wrapper for the bio case that still has to do the
lookup (for now).

Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Tested-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-04-17 18:01:21 +02:00
Christoph Hellwig 2e38a84bc6 btrfs: simplify extent map splitting and rename split_zoned_em
split_zoned_em is only ever asked to split out the beginning of an extent
map.  Change it to only take a len to split out instead of a pre and post
region.

Also rename the function to split_extent_map as there is nothing zoned
device specific about it.

Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Tested-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-04-17 18:01:21 +02:00
Christoph Hellwig f0792b792d btrfs: fold btrfs_clone_ordered_extent into btrfs_split_ordered_extent
The function btrfs_clone_ordered_extent is very specific to the usage in
btrfs_split_ordered_extent.  Now that only a single call to
btrfs_clone_ordered_extent is left, just fold it into
btrfs_split_ordered_extent to make the operation more clear.

Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Tested-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-04-17 18:01:21 +02:00
Christoph Hellwig 8f4af4b8e1 btrfs: sink parameter len to btrfs_split_ordered_extent
btrfs_split_ordered_extent is only ever asked to split out the beginning
of an ordered_extent (i.e. post == 0).  Change it to only take a len to
split out, and switch it to allocate the new extent for the beginning,
as that helps with callers that want to keep a pointer to the
ordered_extent that it is stealing from.

Reviewed-by: Naohiro Aota <naohiro.aota@wdc.com>
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Tested-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-04-17 18:01:21 +02:00
Christoph Hellwig 11d33ab6c1 btrfs: simplify splitting logic in btrfs_extract_ordered_extent
btrfs_extract_ordered_extent is always used to split an ordered_extent
and extent_map into two parts, so it doesn't need to deal with a three
way split.

Simplify it by only allowing for a single split point, and always split
out the beginning of the extent, as that is what we'll later need to
be able to hold on to a reference to the original ordered_extent that
the first part is split off for submission.

Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Tested-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-04-17 18:01:21 +02:00