Commit Graph

8099 Commits

Author SHA1 Message Date
Darrick J. Wong 397b2d7e0f xfs: flush dirty data and drain directios before scrubbing cow fork
When we're scrubbing the COW fork, we need to take MMAPLOCK_EXCL to
prevent page_mkwrite from modifying any inode state.  The ILOCK should
suffice to avoid confusing online fsck, but let's take the same locks
that we do everywhere else.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2023-05-02 09:14:43 +10:00
Darrick J. Wong 8e698ee72c xfs: set bnobt/cntbt numrecs correctly when formatting new AGs
Through generic/300, I discovered that mkfs.xfs creates corrupt
filesystems when given these parameters:

# mkfs.xfs -d size=512M /dev/sda -f -d su=128k,sw=4 --unsupported
Filesystems formatted with --unsupported are not supported!!
meta-data=/dev/sda               isize=512    agcount=8, agsize=16352 blks
         =                       sectsz=512   attr=2, projid32bit=1
         =                       crc=1        finobt=1, sparse=1, rmapbt=1
         =                       reflink=1    bigtime=1 inobtcount=1 nrext64=1
data     =                       bsize=4096   blocks=130816, imaxpct=25
         =                       sunit=32     swidth=128 blks
naming   =version 2              bsize=4096   ascii-ci=0, ftype=1
log      =internal log           bsize=4096   blocks=8192, version=2
         =                       sectsz=512   sunit=32 blks, lazy-count=1
realtime =none                   extsz=4096   blocks=0, rtextents=0
         =                       rgcount=0    rgsize=0 blks
Discarding blocks...Done.
# xfs_repair -n /dev/sda
Phase 1 - find and verify superblock...
        - reporting progress in intervals of 15 minutes
Phase 2 - using internal log
        - zero log...
        - 16:30:50: zeroing log - 16320 of 16320 blocks done
        - scan filesystem freespace and inode maps...
agf_freeblks 25, counted 0 in ag 4
sb_fdblocks 8823, counted 8798

The root cause of this problem is the numrecs handling in
xfs_freesp_init_recs, which is used to initialize a new AG.  Prior to
calling the function, we set up the new bnobt block with numrecs == 1
and rely on _freesp_init_recs to format that new record.  If the last
record created has a blockcount of zero, then it sets numrecs = 0.

That last bit isn't correct if the AG contains the log, the start of the
log is not immediately after the initial blocks due to stripe alignment,
and the end of the log is perfectly aligned with the end of the AG.  For
this case, we actually formatted a single bnobt record to handle the
free space before the start of the (stripe aligned) log, and incremented
arec to try to format a second record.  That second record turned out to
be unnecessary, so what we really want is to leave numrecs at 1.

The numrecs handling itself is overly complicated because a different
function sets numrecs == 1.  Change the bnobt creation code to start
with numrecs set to zero and only increment it after successfully
formatting a free space extent into the btree block.

Fixes: f327a00745 ("xfs: account for log space when formatting new AGs")
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2023-05-02 09:14:36 +10:00
Darrick J. Wong b82a5c42a5 xfs: don't unconditionally null args->pag in xfs_bmap_btalloc_at_eof
xfs/170 on a filesystem with su=128k,sw=4 produces this splat:

BUG: kernel NULL pointer dereference, address: 0000000000000010
#PF: supervisor write access in kernel mode
#PF: error_code(0x0002) - not-present page
PGD 0 P4D 0
Oops: 0002 [#1] PREEMPT SMP
CPU: 1 PID: 4022907 Comm: dd Tainted: G        W          6.3.0-xfsx #2 6ebeeffbe9577d32
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS ?-20171121_152543-x86-ol7-bu
RIP: 0010:xfs_perag_rele+0x10/0x70 [xfs]
RSP: 0018:ffffc90001e43858 EFLAGS: 00010217
RAX: 0000000000000000 RBX: 0000000000000000 RCX: 0000000000000100
RDX: ffffffffa054e717 RSI: 0000000000000005 RDI: 0000000000000000
RBP: ffff888194eea000 R08: 0000000000000000 R09: 0000000000000037
R10: ffff888100ac1cb0 R11: 0000000000000018 R12: 0000000000000000
R13: ffffc90001e43a38 R14: ffff888194eea000 R15: ffff888194eea000
FS:  00007f93d1a0e740(0000) GS:ffff88843fc80000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 0000000000000010 CR3: 000000018a34f000 CR4: 00000000003506e0
Call Trace:
 <TASK>
 xfs_bmap_btalloc+0x1a7/0x5d0 [xfs f85291d6841cbb3dc740083f1f331c0327394518]
 xfs_bmapi_allocate+0xee/0x470 [xfs f85291d6841cbb3dc740083f1f331c0327394518]
 xfs_bmapi_write+0x539/0x9e0 [xfs f85291d6841cbb3dc740083f1f331c0327394518]
 xfs_iomap_write_direct+0x1bb/0x2b0 [xfs f85291d6841cbb3dc740083f1f331c0327394518]
 xfs_direct_write_iomap_begin+0x51c/0x710 [xfs f85291d6841cbb3dc740083f1f331c0327394518]
 iomap_iter+0x132/0x2f0
 __iomap_dio_rw+0x2f8/0x840
 iomap_dio_rw+0xe/0x30
 xfs_file_dio_write_aligned+0xad/0x180 [xfs f85291d6841cbb3dc740083f1f331c0327394518]
 xfs_file_write_iter+0xfb/0x190 [xfs f85291d6841cbb3dc740083f1f331c0327394518]
 vfs_write+0x2eb/0x410
 ksys_write+0x65/0xe0
 do_syscall_64+0x2b/0x80

This crash occurs under the "out_low_space" label.  We grabbed a perag
reference, passed it via args->pag into xfs_bmap_btalloc_at_eof, and
afterwards args->pag is NULL.  Fix the second function not to clobber
args->pag if the caller had passed one in.

Fixes: 8584332709 ("xfs: factor xfs_bmap_btalloc()")
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2023-05-02 09:14:27 +10:00
Linus Torvalds 56c455b38d xfs: New code for 6.4
o Added detailed design documentation for the upcoming online repair feature
 o major update to online scrub to complete the reverse mapping cross-referencing
   infrastructure enabling us to fully validate allocated metadata against owner
   records. This is the last piece of scrub infrastructure needed before we can
   start merging online repair functionality.
 o Fixes for the ascii-ci hashing issues
 o deprecation of the ascii-ci functionality
 o on-disk format verification bug fixes
 o various random bug fixes for syzbot and other bug reports
 
 Signed-off-by: Dave Chinner <david@fromorbit.com>
 -----BEGIN PGP SIGNATURE-----
 
 iQJIBAABCgAyFiEEmJOoJ8GffZYWSjj/regpR/R1+h0FAmRMTZ8UHGRhdmlkQGZy
 b21vcmJpdC5jb20ACgkQregpR/R1+h3XtA//bZYjsYRU3hzyGLKee++5t/zbiqZB
 KWw8zuPEdEsSAPphK4DQYO7XPWetgFh8iBU39M8TM0+g5YacjzBLGADjQiEv7naN
 IxSoElQQzZbvMcUPOnuRaoopn0v7pbWIDRo3hKWaRDKCrnMGOnTvDFuC/VX0RAbn
 GzPimbuvaYJPXTnWTwsKeAuVYP4HLdTh2R1gUMjyY80Ed08hxhCzrXSvjEtuxOOy
 tDk50wJUhgx7UTgFBsXug1wXLCYwDFvAUjpsBKnmq+vSl0MpI3TdCetmSQbuvAeu
 gvkRyBMOcqcY5rlozcKPpyXwy7I0ftXOY4xpUSW8H9tAx0oVImkC69DsAjotQV0r
 r6vEtcw7LgmaS9kbA6G2Z4JfKEHuf2d/6OI4onZh25b5SWq7+qFBPo67AoFg8UQf
 bKSf3QQNVLTyWqpRf8Z3XOEBygYGsDUuxrm2AA5Aar4t4T3y5oAKFKkf4ZAlAYxH
 KViQsq0qVcoQ4k4txZgU7XQrftKyu2csqxqtKDozH7FutxscchZEwvjdQ6jnS2+L
 2Qlf6On8edfEkPKzF7/1cgxUXCXuTqakFVetChXjZ1/ZFt9LUYphvESdaolJ8Aqz
 lwEy5UrbC2oMrBDT7qESLWs3U66mPhRaaFfuLUJRyhHN3Y0tVVA2mgNzyD6oBQVy
 ffIbZ3+1QEPOaOQ=
 =lBJy
 -----END PGP SIGNATURE-----

Merge tag 'xfs-6.4-merge-1' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux

Pull xfs updates from Dave Chinner:
 "This consists mainly of online scrub functionality and the design
  documentation for the upcoming online repair functionality built on
  top of the scrub code:

   - Added detailed design documentation for the upcoming online repair
     feature

   - major update to online scrub to complete the reverse mapping
     cross-referencing infrastructure enabling us to fully validate
     allocated metadata against owner records. This is the last piece of
     scrub infrastructure needed before we can start merging online
     repair functionality.

   - Fixes for the ascii-ci hashing issues

   - deprecation of the ascii-ci functionality

   - on-disk format verification bug fixes

   - various random bug fixes for syzbot and other bug reports"

* tag 'xfs-6.4-merge-1' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux: (107 commits)
  xfs: fix livelock in delayed allocation at ENOSPC
  xfs: Extend table marker on deprecated mount options table
  xfs: fix duplicate includes
  xfs: fix BUG_ON in xfs_getbmap()
  xfs: verify buffer contents when we skip log replay
  xfs: _{attr,data}_map_shared should take ILOCK_EXCL until iread_extents is completely done
  xfs: remove WARN when dquot cache insertion fails
  xfs: don't consider future format versions valid
  xfs: deprecate the ascii-ci feature
  xfs: test the ascii case-insensitive hash
  xfs: stabilize the dirent name transformation function used for ascii-ci dir hash computation
  xfs: cross-reference rmap records with refcount btrees
  xfs: cross-reference rmap records with inode btrees
  xfs: cross-reference rmap records with free space btrees
  xfs: cross-reference rmap records with ag btrees
  xfs: introduce bitmap type for AG blocks
  xfs: convert xbitmap to interval tree
  xfs: drop the _safe behavior from the xbitmap foreach macro
  xfs: don't load local xattr values during scrub
  xfs: remove the for_each_xbitmap_ helpers
  ...
2023-04-29 10:44:27 -07:00
Linus Torvalds 7fa8a8ee94 - Nick Piggin's "shoot lazy tlbs" series, to improve the peformance of
switching from a user process to a kernel thread.
 
 - More folio conversions from Kefeng Wang, Zhang Peng and Pankaj Raghav.
 
 - zsmalloc performance improvements from Sergey Senozhatsky.
 
 - Yue Zhao has found and fixed some data race issues around the
   alteration of memcg userspace tunables.
 
 - VFS rationalizations from Christoph Hellwig:
 
   - removal of most of the callers of write_one_page().
 
   - make __filemap_get_folio()'s return value more useful
 
 - Luis Chamberlain has changed tmpfs so it no longer requires swap
   backing.  Use `mount -o noswap'.
 
 - Qi Zheng has made the slab shrinkers operate locklessly, providing
   some scalability benefits.
 
 - Keith Busch has improved dmapool's performance, making part of its
   operations O(1) rather than O(n).
 
 - Peter Xu adds the UFFD_FEATURE_WP_UNPOPULATED feature to userfaultd,
   permitting userspace to wr-protect anon memory unpopulated ptes.
 
 - Kirill Shutemov has changed MAX_ORDER's meaning to be inclusive rather
   than exclusive, and has fixed a bunch of errors which were caused by its
   unintuitive meaning.
 
 - Axel Rasmussen give userfaultfd the UFFDIO_CONTINUE_MODE_WP feature,
   which causes minor faults to install a write-protected pte.
 
 - Vlastimil Babka has done some maintenance work on vma_merge():
   cleanups to the kernel code and improvements to our userspace test
   harness.
 
 - Cleanups to do_fault_around() by Lorenzo Stoakes.
 
 - Mike Rapoport has moved a lot of initialization code out of various
   mm/ files and into mm/mm_init.c.
 
 - Lorenzo Stoakes removd vmf_insert_mixed_prot(), which was added for
   DRM, but DRM doesn't use it any more.
 
 - Lorenzo has also coverted read_kcore() and vread() to use iterators
   and has thereby removed the use of bounce buffers in some cases.
 
 - Lorenzo has also contributed further cleanups of vma_merge().
 
 - Chaitanya Prakash provides some fixes to the mmap selftesting code.
 
 - Matthew Wilcox changes xfs and afs so they no longer take sleeping
   locks in ->map_page(), a step towards RCUification of pagefaults.
 
 - Suren Baghdasaryan has improved mmap_lock scalability by switching to
   per-VMA locking.
 
 - Frederic Weisbecker has reworked the percpu cache draining so that it
   no longer causes latency glitches on cpu isolated workloads.
 
 - Mike Rapoport cleans up and corrects the ARCH_FORCE_MAX_ORDER Kconfig
   logic.
 
 - Liu Shixin has changed zswap's initialization so we no longer waste a
   chunk of memory if zswap is not being used.
 
 - Yosry Ahmed has improved the performance of memcg statistics flushing.
 
 - David Stevens has fixed several issues involving khugepaged,
   userfaultfd and shmem.
 
 - Christoph Hellwig has provided some cleanup work to zram's IO-related
   code paths.
 
 - David Hildenbrand has fixed up some issues in the selftest code's
   testing of our pte state changing.
 
 - Pankaj Raghav has made page_endio() unneeded and has removed it.
 
 - Peter Xu contributed some rationalizations of the userfaultfd
   selftests.
 
 - Yosry Ahmed has fixed an issue around memcg's page recalim accounting.
 
 - Chaitanya Prakash has fixed some arm-related issues in the
   selftests/mm code.
 
 - Longlong Xia has improved the way in which KSM handles hwpoisoned
   pages.
 
 - Peter Xu fixes a few issues with uffd-wp at fork() time.
 
 - Stefan Roesch has changed KSM so that it may now be used on a
   per-process and per-cgroup basis.
 -----BEGIN PGP SIGNATURE-----
 
 iHUEABYIAB0WIQTTMBEPP41GrTpTJgfdBJ7gKXxAjgUCZEr3zQAKCRDdBJ7gKXxA
 jlLoAP0fpQBipwFxED0Us4SKQfupV6z4caXNJGPeay7Aj11/kQD/aMRC2uPfgr96
 eMG3kwn2pqkB9ST2QpkaRbxA//eMbQY=
 =J+Dj
 -----END PGP SIGNATURE-----

Merge tag 'mm-stable-2023-04-27-15-30' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm

Pull MM updates from Andrew Morton:

 - Nick Piggin's "shoot lazy tlbs" series, to improve the peformance of
   switching from a user process to a kernel thread.

 - More folio conversions from Kefeng Wang, Zhang Peng and Pankaj
   Raghav.

 - zsmalloc performance improvements from Sergey Senozhatsky.

 - Yue Zhao has found and fixed some data race issues around the
   alteration of memcg userspace tunables.

 - VFS rationalizations from Christoph Hellwig:
     - removal of most of the callers of write_one_page()
     - make __filemap_get_folio()'s return value more useful

 - Luis Chamberlain has changed tmpfs so it no longer requires swap
   backing. Use `mount -o noswap'.

 - Qi Zheng has made the slab shrinkers operate locklessly, providing
   some scalability benefits.

 - Keith Busch has improved dmapool's performance, making part of its
   operations O(1) rather than O(n).

 - Peter Xu adds the UFFD_FEATURE_WP_UNPOPULATED feature to userfaultd,
   permitting userspace to wr-protect anon memory unpopulated ptes.

 - Kirill Shutemov has changed MAX_ORDER's meaning to be inclusive
   rather than exclusive, and has fixed a bunch of errors which were
   caused by its unintuitive meaning.

 - Axel Rasmussen give userfaultfd the UFFDIO_CONTINUE_MODE_WP feature,
   which causes minor faults to install a write-protected pte.

 - Vlastimil Babka has done some maintenance work on vma_merge():
   cleanups to the kernel code and improvements to our userspace test
   harness.

 - Cleanups to do_fault_around() by Lorenzo Stoakes.

 - Mike Rapoport has moved a lot of initialization code out of various
   mm/ files and into mm/mm_init.c.

 - Lorenzo Stoakes removd vmf_insert_mixed_prot(), which was added for
   DRM, but DRM doesn't use it any more.

 - Lorenzo has also coverted read_kcore() and vread() to use iterators
   and has thereby removed the use of bounce buffers in some cases.

 - Lorenzo has also contributed further cleanups of vma_merge().

 - Chaitanya Prakash provides some fixes to the mmap selftesting code.

 - Matthew Wilcox changes xfs and afs so they no longer take sleeping
   locks in ->map_page(), a step towards RCUification of pagefaults.

 - Suren Baghdasaryan has improved mmap_lock scalability by switching to
   per-VMA locking.

 - Frederic Weisbecker has reworked the percpu cache draining so that it
   no longer causes latency glitches on cpu isolated workloads.

 - Mike Rapoport cleans up and corrects the ARCH_FORCE_MAX_ORDER Kconfig
   logic.

 - Liu Shixin has changed zswap's initialization so we no longer waste a
   chunk of memory if zswap is not being used.

 - Yosry Ahmed has improved the performance of memcg statistics
   flushing.

 - David Stevens has fixed several issues involving khugepaged,
   userfaultfd and shmem.

 - Christoph Hellwig has provided some cleanup work to zram's IO-related
   code paths.

 - David Hildenbrand has fixed up some issues in the selftest code's
   testing of our pte state changing.

 - Pankaj Raghav has made page_endio() unneeded and has removed it.

 - Peter Xu contributed some rationalizations of the userfaultfd
   selftests.

 - Yosry Ahmed has fixed an issue around memcg's page recalim
   accounting.

 - Chaitanya Prakash has fixed some arm-related issues in the
   selftests/mm code.

 - Longlong Xia has improved the way in which KSM handles hwpoisoned
   pages.

 - Peter Xu fixes a few issues with uffd-wp at fork() time.

 - Stefan Roesch has changed KSM so that it may now be used on a
   per-process and per-cgroup basis.

* tag 'mm-stable-2023-04-27-15-30' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: (369 commits)
  mm,unmap: avoid flushing TLB in batch if PTE is inaccessible
  shmem: restrict noswap option to initial user namespace
  mm/khugepaged: fix conflicting mods to collapse_file()
  sparse: remove unnecessary 0 values from rc
  mm: move 'mmap_min_addr' logic from callers into vm_unmapped_area()
  hugetlb: pte_alloc_huge() to replace huge pte_alloc_map()
  maple_tree: fix allocation in mas_sparse_area()
  mm: do not increment pgfault stats when page fault handler retries
  zsmalloc: allow only one active pool compaction context
  selftests/mm: add new selftests for KSM
  mm: add new KSM process and sysfs knobs
  mm: add new api to enable ksm per process
  mm: shrinkers: fix debugfs file permissions
  mm: don't check VMA write permissions if the PTE/PMD indicates write permissions
  migrate_pages_batch: fix statistics for longterm pin retry
  userfaultfd: use helper function range_in_vma()
  lib/show_mem.c: use for_each_populated_zone() simplify code
  mm: correct arg in reclaim_pages()/reclaim_clean_pages_from_list()
  fs/buffer: convert create_page_buffers to folio_create_buffers
  fs/buffer: add folio_create_empty_buffers helper
  ...
2023-04-27 19:42:02 -07:00
Linus Torvalds 888d3c9f7f sysctl-6.4-rc1
This pull request goes with only a few sysctl moves from the
 kernel/sysctl.c file, the rest of the work has been put towards
 deprecating two API calls which incur recursion and prevent us
 from simplifying the registration process / saving memory per
 move. Most of the changes have been soaking on linux-next since
 v6.3-rc3.
 
 I've slowed down the kernel/sysctl.c moves due to Matthew Wilcox's
 feedback that we should see if we could *save* memory with these
 moves instead of incurring more memory. We currently incur more
 memory since when we move a syctl from kernel/sysclt.c out to its
 own file we end up having to add a new empty sysctl used to register
 it. To achieve saving memory we want to allow syctls to be passed
 without requiring the end element being empty, and just have our
 registration process rely on ARRAY_SIZE(). Without this, supporting
 both styles of sysctls would make the sysctl registration pretty
 brittle, hard to read and maintain as can be seen from Meng Tang's
 efforts to do just this [0]. Fortunately, in order to use ARRAY_SIZE()
 for all sysctl registrations also implies doing the work to deprecate
 two API calls which use recursion in order to support sysctl
 declarations with subdirectories.
 
 And so during this development cycle quite a bit of effort went into
 this deprecation effort. I've annotated the following two APIs are
 deprecated and in few kernel releases we should be good to remove them:
 
   * register_sysctl_table()
   * register_sysctl_paths()
 
 During this merge window we should be able to deprecate and unexport
 register_sysctl_paths(), we can probably do that towards the end
 of this merge window.
 
 Deprecating register_sysctl_table() will take a bit more time but
 this pull request goes with a few example of how to do this.
 
 As it turns out each of the conversions to move away from either of
 these two API calls *also* saves memory. And so long term, all these
 changes *will* prove to have saved a bit of memory on boot.
 
 The way I see it then is if remove a user of one deprecated call, it
 gives us enough savings to move one kernel/sysctl.c out from the
 generic arrays as we end up with about the same amount of bytes.
 
 Since deprecating register_sysctl_table() and register_sysctl_paths()
 does not require maintainer coordination except the final unexport
 you'll see quite a bit of these changes from other pull requests, I've
 just kept the stragglers after rc3.
 
 Most of these changes have been soaking on linux-next since around rc3.
 
 [0] https://lkml.kernel.org/r/ZAD+cpbrqlc5vmry@bombadil.infradead.org
 -----BEGIN PGP SIGNATURE-----
 
 iQJGBAABCgAwFiEENnNq2KuOejlQLZofziMdCjCSiKcFAmRHAjQSHG1jZ3JvZkBr
 ZXJuZWwub3JnAAoJEM4jHQowkoinTzgQAI/uKHKi0VlUR1l2Psl0XbseUVueuyj3
 ZDxSJpbVUmsoDf2MlLjzB8mYE3ricnNTDbLr7qOyA6pXdM1N0mY5LQmRVRu8/ffd
 2T1hQ5pl7YnJdWP5dPhcF9Y+jnu1tjX1MW5DS4fzllwK7FnD86HuIruGq52RAPS/
 /FH+BD9eodLWWXk6A/o2GFqoWxPKQI0GLxEYWa7Hg7yt8E/3PQL9QsRzn8i6U+HW
 BrN/+G3YD1VCCzXu0UAeXnm+i1Z7CdvqNdZuSkvE3DObiZ5WpOS+/i7FrDB7zdiu
 zAbHaifHnDPtcK3w2ZodbLAAwEWD/mG4iwIjE2kgIMVYxBv7TFDBRREXAWYAevIT
 UUuZnWDQsGaWdjywrebaUycEfd6dytKyan0fTXgMFkcoWRjejhitfdM2iZDdQROg
 q453p4HqOw4vTrhy4ov4zOX7J3EFiBzpZdl+SmLqcXk+jbLVb/Q9snUWz1AFtHBl
 gHoP5bS82uVktGG3MsObjgTzYYMQjO9YGIrVuW1VP9uWs8WaoWx6M9FQJIIhtwE+
 h6wG2s7CjuFWnS0/IxWmDOn91QyUn1w7ohiz9TuvYj/5GLSBpBDGCJHsNB5T2WS1
 qbQRaZ2Kg3j9TeyWfXxdlxBx7bt3ni+J/IXDY0zom2sTpGHKl8D2g5AzmEXJDTpl
 kd7Z3gsmwhDh
 =0U0W
 -----END PGP SIGNATURE-----

Merge tag 'sysctl-6.4-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/mcgrof/linux

Pull sysctl updates from Luis Chamberlain:
 "This only does a few sysctl moves from the kernel/sysctl.c file, the
  rest of the work has been put towards deprecating two API calls which
  incur recursion and prevent us from simplifying the registration
  process / saving memory per move. Most of the changes have been
  soaking on linux-next since v6.3-rc3.

  I've slowed down the kernel/sysctl.c moves due to Matthew Wilcox's
  feedback that we should see if we could *save* memory with these moves
  instead of incurring more memory. We currently incur more memory since
  when we move a syctl from kernel/sysclt.c out to its own file we end
  up having to add a new empty sysctl used to register it. To achieve
  saving memory we want to allow syctls to be passed without requiring
  the end element being empty, and just have our registration process
  rely on ARRAY_SIZE(). Without this, supporting both styles of sysctls
  would make the sysctl registration pretty brittle, hard to read and
  maintain as can be seen from Meng Tang's efforts to do just this [0].
  Fortunately, in order to use ARRAY_SIZE() for all sysctl registrations
  also implies doing the work to deprecate two API calls which use
  recursion in order to support sysctl declarations with subdirectories.

  And so during this development cycle quite a bit of effort went into
  this deprecation effort. I've annotated the following two APIs are
  deprecated and in few kernel releases we should be good to remove
  them:

   - register_sysctl_table()
   - register_sysctl_paths()

  During this merge window we should be able to deprecate and unexport
  register_sysctl_paths(), we can probably do that towards the end of
  this merge window.

  Deprecating register_sysctl_table() will take a bit more time but this
  pull request goes with a few example of how to do this.

  As it turns out each of the conversions to move away from either of
  these two API calls *also* saves memory. And so long term, all these
  changes *will* prove to have saved a bit of memory on boot.

  The way I see it then is if remove a user of one deprecated call, it
  gives us enough savings to move one kernel/sysctl.c out from the
  generic arrays as we end up with about the same amount of bytes.

  Since deprecating register_sysctl_table() and register_sysctl_paths()
  does not require maintainer coordination except the final unexport
  you'll see quite a bit of these changes from other pull requests, I've
  just kept the stragglers after rc3"

Link: https://lkml.kernel.org/r/ZAD+cpbrqlc5vmry@bombadil.infradead.org [0]

* tag 'sysctl-6.4-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/mcgrof/linux: (29 commits)
  fs: fix sysctls.c built
  mm: compaction: remove incorrect #ifdef checks
  mm: compaction: move compaction sysctl to its own file
  mm: memory-failure: Move memory failure sysctls to its own file
  arm: simplify two-level sysctl registration for ctl_isa_vars
  ia64: simplify one-level sysctl registration for kdump_ctl_table
  utsname: simplify one-level sysctl registration for uts_kern_table
  ntfs: simplfy one-level sysctl registration for ntfs_sysctls
  coda: simplify one-level sysctl registration for coda_table
  fs/cachefiles: simplify one-level sysctl registration for cachefiles_sysctls
  xfs: simplify two-level sysctl registration for xfs_table
  nfs: simplify two-level sysctl registration for nfs_cb_sysctls
  nfs: simplify two-level sysctl registration for nfs4_cb_sysctls
  lockd: simplify two-level sysctl registration for nlm_sysctls
  proc_sysctl: enhance documentation
  xen: simplify sysctl registration for balloon
  md: simplify sysctl registration
  hv: simplify sysctl registration
  scsi: simplify sysctl registration with register_sysctl()
  csky: simplify alignment sysctl registration
  ...
2023-04-27 16:52:33 -07:00
Dave Chinner 9419092fb2 xfs: fix livelock in delayed allocation at ENOSPC
On a filesystem with a non-zero stripe unit and a large sequential
write, delayed allocation will set a minimum allocation length of
the stripe unit. If allocation fails because there are no extents
long enough for an aligned minlen allocation, it is supposed to
fall back to unaligned allocation which allows single block extents
to be allocated.

When the allocator code was rewritting in the 6.3 cycle, this
fallback was broken - the old code used args->fsbno as the both the
allocation target and the allocation result, the new code passes the
target as a separate parameter. The conversion didn't handle the
aligned->unaligned fallback path correctly - it reset args->fsbno to
the target fsbno on failure which broke allocation failure detection
in the high level code and so it never fell back to unaligned
allocations.

This resulted in a loop in writeback trying to allocate an aligned
block, getting a false positive success, trying to insert the result
in the BMBT. This did nothing because the extent already was in the
BMBT (merge results in an unchanged extent) and so it returned the
prior extent to the conversion code as the current iomap.

Because the iomap returned didn't cover the offset we tried to map,
xfs_convert_blocks() then retries the allocation, which fails in the
same way and now we have a livelock.

Reported-and-tested-by: Brian Foster <bfoster@redhat.com>
Fixes: 8584332709 ("xfs: factor xfs_bmap_btalloc()")
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
2023-04-27 09:02:11 +10:00
Linus Torvalds 5b9a7bb72f for-6.4/io_uring-2023-04-21
-----BEGIN PGP SIGNATURE-----
 
 iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAmRCvawQHGF4Ym9lQGtl
 cm5lbC5kawAKCRD301j7KXHgpiKTEACvp0jm3Lyhxb8RMsx5T6Ko0pFH3DIymiL4
 xpoZmAUflOjD0c+99FwHRQqKKXuo3OelhW+YOm0N6OOAt6JMSGmKpZh0UNJx+Fgj
 wMwiQ0X3Y5SaAsr5ZpXM+G1BV7ajihsMpu8/a718ERB3U3cLDz2qJfnzJh+E5Ip5
 pYB4vS3+/FAER2MYQ7IPeovch2wWYtxDPztOxNX6SORu3OvpWiz1GR/+8u0tqj50
 ROq97Jwjh5Tl4zP356EUSj/Vkfdr2yb+NlLbun8My5x8tYftZjnrNQ/+qeJNLwB8
 tWTrg166ox/VX3aYZruAgUPv0IyGPZg7qZV5R72ChBK3VhIbOOLOCm/V6dhvl/XH
 vu2FG7J8WylWHmc+OU8u7TeSJdrwxTLs4e2IFUBK9ymAYFp0Q9S924fgvSYsFvVB
 iNn58SPRIbuA4SPtRfCd7pENtZW/QKfBC5CYK+pjsZVX40c9dbe40foVu4t2/EAo
 gi9+gSWEUVRRW2osxjaHXh78cW63g0j9bNfS6n1Vy32Oo5Mwm7n+bVWqCU5bCBXI
 MpPOk6AgME3UPwFzGzSmx+PVw8VacPxYP1NF8RFTCwj7OowFnrolJtruDmKJgXWY
 BN41EDo41k/C5mEu16Jr9rAkHeVhHaNZ+JhyDrzv8llJ/rv+4zEJw9SrhnpufmOX
 +YERd/ndAw==
 =Erfk
 -----END PGP SIGNATURE-----

Merge tag 'for-6.4/io_uring-2023-04-21' of git://git.kernel.dk/linux

Pull io_uring updates from Jens Axboe:

 - Cleanup of the io-wq per-node mapping, notably getting rid of it so
   we just have a single io_wq entry per ring (Breno)

 - Followup to the above, move accounting to io_wq as well and
   completely drop struct io_wqe (Gabriel)

 - Enable KASAN for the internal io_uring caches (Breno)

 - Add support for multishot timeouts. Some applications use timeouts to
   wake someone waiting on completion entries, and this makes it a bit
   easier to just have a recurring timer rather than needing to rearm it
   every time (David)

 - Support archs that have shared cache coloring between userspace and
   the kernel, and hence have strict address requirements for mmap'ing
   the ring into userspace. This should only be parisc/hppa. (Helge, me)

 - XFS has supported O_DIRECT writes without needing to lock the inode
   exclusively for a long time, and ext4 now supports it as well. This
   is true for the common cases of not extending the file size. Flag the
   fs as having that feature, and utilize that to avoid serializing
   those writes in io_uring (me)

 - Enable completion batching for uring commands (me)

 - Revert patch adding io_uring restriction to what can be GUP mapped or
   not. This does not belong in io_uring, as io_uring isn't really
   special in this regard. Since this is also getting in the way of
   cleanups and improvements to the GUP code, get rid of if (me)

 - A few series greatly reducing the complexity of registered resources,
   like buffers or files. Not only does this clean up the code a lot,
   the simplified code is also a LOT more efficient (Pavel)

 - Series optimizing how we wait for events and run task_work related to
   it (Pavel)

 - Fixes for file/buffer unregistration with DEFER_TASKRUN (Pavel)

 - Misc cleanups and improvements (Pavel, me)

* tag 'for-6.4/io_uring-2023-04-21' of git://git.kernel.dk/linux: (71 commits)
  Revert "io_uring/rsrc: disallow multi-source reg buffers"
  io_uring: add support for multishot timeouts
  io_uring/rsrc: disassociate nodes and rsrc_data
  io_uring/rsrc: devirtualise rsrc put callbacks
  io_uring/rsrc: pass node to io_rsrc_put_work()
  io_uring/rsrc: inline io_rsrc_put_work()
  io_uring/rsrc: add empty flag in rsrc_node
  io_uring/rsrc: merge nodes and io_rsrc_put
  io_uring/rsrc: infer node from ctx on io_queue_rsrc_removal
  io_uring/rsrc: remove unused io_rsrc_node::llist
  io_uring/rsrc: refactor io_queue_rsrc_removal
  io_uring/rsrc: simplify single file node switching
  io_uring/rsrc: clean up __io_sqe_buffers_update()
  io_uring/rsrc: inline switch_start fast path
  io_uring/rsrc: remove rsrc_data refs
  io_uring/rsrc: fix DEFER_TASKRUN rsrc quiesce
  io_uring/rsrc: use wq for quiescing
  io_uring/rsrc: refactor io_rsrc_ref_quiesce
  io_uring/rsrc: remove io_rsrc_node::done
  io_uring/rsrc: use nospec'ed indexes
  ...
2023-04-26 12:40:31 -07:00
Linus Torvalds 7bcff5a396 v6.4/vfs.acl
-----BEGIN PGP SIGNATURE-----
 
 iHUEABYKAB0WIQRAhzRXHqcMeLMyaSiRxhvAZXjcogUCZEEhwgAKCRCRxhvAZXjc
 otwgAQDXHnKiPm/d76lITXbxdUNCtvZz+ig26EbOrD+vEszzIQEA81dru0QbCNCt
 ctoZdcsmtKbt2VaYQF1CDOhlnNg5VQM=
 =pER1
 -----END PGP SIGNATURE-----

Merge tag 'v6.4/vfs.acl' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs

Pull acl updates from Christian Brauner:
 "After finishing the introduction of the new posix acl api last cycle
  the generic POSIX ACL xattr handlers are still around in the
  filesystems xattr handlers for two reasons:

   (1) Because a few filesystems rely on the ->list() method of the
       generic POSIX ACL xattr handlers in their ->listxattr() inode
       operation.

   (2) POSIX ACLs are only available if IOP_XATTR is raised. The
       IOP_XATTR flag is raised in inode_init_always() based on whether
       the sb->s_xattr pointer is non-NULL. IOW, the registered xattr
       handlers of the filesystem are used to raise IOP_XATTR. Removing
       the generic POSIX ACL xattr handlers from all filesystems would
       risk regressing filesystems that only implement POSIX ACL support
       and no other xattrs (nfs3 comes to mind).

  This contains the work to decouple POSIX ACLs from the IOP_XATTR flag
  as they don't depend on xattr handlers anymore. So it's now possible
  to remove the generic POSIX ACL xattr handlers from the sb->s_xattr
  list of all filesystems. This is a crucial step as the generic POSIX
  ACL xattr handlers aren't used for POSIX ACLs anymore and POSIX ACLs
  don't depend on the xattr infrastructure anymore.

  Adressing problem (1) will require more long-term work. It would be
  best to get rid of the ->list() method of xattr handlers completely at
  some point.

  For erofs, ext{2,4}, f2fs, jffs2, ocfs2, and reiserfs the nop POSIX
  ACL xattr handler is kept around so they can continue to use
  array-based xattr handler indexing.

  This update does simplify the ->listxattr() implementation of all
  these filesystems however"

* tag 'v6.4/vfs.acl' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs:
  acl: don't depend on IOP_XATTR
  ovl: check for ->listxattr() support
  reiserfs: rework priv inode handling
  fs: rename generic posix acl handlers
  reiserfs: rework ->listxattr() implementation
  fs: simplify ->listxattr() implementation
  fs: drop unused posix acl handlers
  xattr: remove unused argument
  xattr: add listxattr helper
  xattr: simplify listxattr helpers
2023-04-24 13:35:23 -07:00
Dave Chinner 422d56536f xfs: fix duplicate includes
Header files were already included, just not in the normal order.
Remove the duplicates, preserving normal order. Also move xfs_ag.h
include to before the scrub internal includes which are normally
last in the include list.

Fixes: d5c88131db ("xfs: allow queued AG intents to drain before scrubbing")
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2023-04-20 08:18:34 +10:00
Yosry Ahmed c7b23b68e2 mm: vmscan: refactor updating current->reclaim_state
During reclaim, we keep track of pages reclaimed from other means than
LRU-based reclaim through scan_control->reclaim_state->reclaimed_slab,
which we stash a pointer to in current task_struct.

However, we keep track of more than just reclaimed slab pages through
this.  We also use it for clean file pages dropped through pruned inodes,
and xfs buffer pages freed.  Rename reclaimed_slab to reclaimed, and add a
helper function that wraps updating it through current, so that future
changes to this logic are contained within include/linux/swap.h.

Link: https://lkml.kernel.org/r/20230413104034.1086717-4-yosryahmed@google.com
Signed-off-by: Yosry Ahmed <yosryahmed@google.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Christoph Lameter <cl@linux.com>
Cc: Darrick J. Wong <djwong@kernel.org>
Cc: Dave Chinner <david@fromorbit.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Hyeonggon Yoo <42.hyeyoo@gmail.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Miaohe Lin <linmiaohe@huawei.com>
Cc: NeilBrown <neilb@suse.de>
Cc: Peter Xu <peterx@redhat.com>
Cc: Roman Gushchin <roman.gushchin@linux.dev>
Cc: Shakeel Butt <shakeelb@google.com>
Cc: Tim Chen <tim.c.chen@linux.intel.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Yu Zhao <yuzhao@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-04-18 16:30:10 -07:00
Dave Chinner 798352cb25 xfs: fix ascii-ci problems, then kill it [v2]
Last week, I was fiddling around with the metadump name obfuscation code
 while writing a debugger command to generate directories full of names
 that all have the same hash name.  I had a few questions about how well
 all that worked with ascii-ci mode, and discovered a nasty discrepancy
 between the kernel and glibc's implementations of the tolower()
 function.
 
 I discovered that I could create a directory that is large enough to
 require separate leaf index blocks.  The hashes stored in the dabtree
 use the ascii-ci specific hash function, which uses a library function
 to convert the name to lowercase before hashing.  If the kernel and C
 library's versions of tolower do not behave exactly identically,
 xfs_ascii_ci_hashname will not produce the same results for the same
 inputs.  xfs_repair will deem the leaf information corrupt and rebuild
 the directory.  After that, lookups in the kernel will fail because the
 hash index doesn't work.
 
 The kernel's tolower function will convert extended ascii uppercase
 letters (e.g. A-with-umlaut) to extended ascii lowercase letters (e.g.
 a-with-umlaut), whereas glibc's will only do that if you force LANG to
 ascii.  Tiny embedded libc implementations just plain won't do it at
 all, and the result is a mess.  Stabilize the behavior of the hash
 function by encoding the name transformation function in libxfs, add it
 to the selftest, and fix all the userspace tools, none of which handle
 this transformation correctly.
 
 The v1 series generated a /lot/ of discussion, in which several things
 became very clear: (1) Linus is not enamored of case folding of any
 kind; (2) Dave and Christoph don't seem to agree on whether the feature
 is supposed to work for 7-bit ascii or latin1; (3) it trashes UTF8
 encoded names if those happen to show up; and (4) I don't want to
 maintain this mess any longer than I have to.  Kill it in 2030.
 
 v2: rename the functions to make it clear we're moving away from the
 letters t, o, l, o, w, e, and r; and deprecate the whole feature once
 we've fixed the bugs and added tests.
 
 Signed-off-by: Darrick J. Wong <djwong@kernel.org>
 -----BEGIN PGP SIGNATURE-----
 
 iHUEABYKAB0WIQQ2qTKExjcn+O1o2YRKO3ySh0YRpgUCZDYdqwAKCRBKO3ySh0YR
 pi33AQC4MFCz0uP1aF64zRgE+wtU2YBGw5cGps7nWIljVptbkAEAubfoY88wAop8
 /KHIgZ8pHIb7ooPrYKpPZL5m0udtMw8=
 =3Up6
 -----END PGP SIGNATURE-----

Merge tag 'fix-asciici-bugs-6.4_2023-04-11' of git://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux into guilt/xfs-for-next

xfs: fix ascii-ci problems, then kill it [v2]

Last week, I was fiddling around with the metadump name obfuscation code
while writing a debugger command to generate directories full of names
that all have the same hash name.  I had a few questions about how well
all that worked with ascii-ci mode, and discovered a nasty discrepancy
between the kernel and glibc's implementations of the tolower()
function.

I discovered that I could create a directory that is large enough to
require separate leaf index blocks.  The hashes stored in the dabtree
use the ascii-ci specific hash function, which uses a library function
to convert the name to lowercase before hashing.  If the kernel and C
library's versions of tolower do not behave exactly identically,
xfs_ascii_ci_hashname will not produce the same results for the same
inputs.  xfs_repair will deem the leaf information corrupt and rebuild
the directory.  After that, lookups in the kernel will fail because the
hash index doesn't work.

The kernel's tolower function will convert extended ascii uppercase
letters (e.g. A-with-umlaut) to extended ascii lowercase letters (e.g.
a-with-umlaut), whereas glibc's will only do that if you force LANG to
ascii.  Tiny embedded libc implementations just plain won't do it at
all, and the result is a mess.  Stabilize the behavior of the hash
function by encoding the name transformation function in libxfs, add it
to the selftest, and fix all the userspace tools, none of which handle
this transformation correctly.

The v1 series generated a /lot/ of discussion, in which several things
became very clear: (1) Linus is not enamored of case folding of any
kind; (2) Dave and Christoph don't seem to agree on whether the feature
is supposed to work for 7-bit ascii or latin1; (3) it trashes UTF8
encoded names if those happen to show up; and (4) I don't want to
maintain this mess any longer than I have to.  Kill it in 2030.

v2: rename the functions to make it clear we're moving away from the
letters t, o, l, o, w, e, and r; and deprecate the whole feature once
we've fixed the bugs and added tests.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2023-04-14 07:11:43 +10:00
Dave Chinner b89116c2fb xfs: strengthen rmapbt scrubbing [v24.5]
This series strengthens space allocation record cross referencing by
 using AG block bitmaps to compute the difference between space used
 according to the rmap records and the primary metadata, and reports
 cross-referencing errors for any discrepancies.
 
 Signed-off-by: Darrick J. Wong <djwong@kernel.org>
 -----BEGIN PGP SIGNATURE-----
 
 iHUEABYKAB0WIQQ2qTKExjcn+O1o2YRKO3ySh0YRpgUCZDYdqgAKCRBKO3ySh0YR
 pgQ8AP4uxtGQY7RXaJ6aKXTwqarZO9vwZyQiL/nFo1onGCHv4wD/cIwn7hvJmbnF
 7avZMCsYJlVFNAJUFZuACmz0v4PwWQ8=
 =KqDr
 -----END PGP SIGNATURE-----

Merge tag 'scrub-strengthen-rmap-checking-6.4_2023-04-11' of git://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux into guilt/xfs-for-next

xfs: strengthen rmapbt scrubbing [v24.5]

This series strengthens space allocation record cross referencing by
using AG block bitmaps to compute the difference between space used
according to the rmap records and the primary metadata, and reports
cross-referencing errors for any discrepancies.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2023-04-14 07:11:31 +10:00
Dave Chinner 43223ef72e xfs: rework online fsck incore bitmap [v24.5]
In this series, we make some changes to the incore bitmap code: First,
 we shorten the prefix to 'xbitmap'.  Then, we rework some utility
 functions for later use by online repair and clarify how the walk
 functions are supposed to be used.
 
 Finally, we use all these new pieces to convert the incore bitmap to use
 an interval tree instead of linked lists.  This lifts the limitation
 that callers had to be careful not to set a range that was already set;
 and gets us ready for the btree rebuilder functions needing to be able
 to set bits in a bitmap and generate maximal contiguous extents for the
 set ranges.
 
 Signed-off-by: Darrick J. Wong <djwong@kernel.org>
 -----BEGIN PGP SIGNATURE-----
 
 iHUEABYKAB0WIQQ2qTKExjcn+O1o2YRKO3ySh0YRpgUCZDYdqQAKCRBKO3ySh0YR
 pmoIAQDszM48pGVRa7zh942ACCa+CbPABFyKDaVePwhccnX2/QD/a9gi4lmEoPVw
 915wIjWbk5rzAjZPp2k3IwYkEf4v2wI=
 =QCWU
 -----END PGP SIGNATURE-----

Merge tag 'repair-bitmap-rework-6.4_2023-04-11' of git://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux into guilt/xfs-for-next

xfs: rework online fsck incore bitmap [v24.5]

In this series, we make some changes to the incore bitmap code: First,
we shorten the prefix to 'xbitmap'.  Then, we rework some utility
functions for later use by online repair and clarify how the walk
functions are supposed to be used.

Finally, we use all these new pieces to convert the incore bitmap to use
an interval tree instead of linked lists.  This lifts the limitation
that callers had to be careful not to set a range that was already set;
and gets us ready for the btree rebuilder functions needing to be able
to set bits in a bitmap and generate maximal contiguous extents for the
set ranges.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2023-04-14 07:11:20 +10:00
Dave Chinner bb09d76599 xfs: clean up memory management in xattr scrub [v24.5]
Currently, the extended attribute scrubber uses a single VLA to store
 all the context information needed in various parts of the scrubber
 code.  This includes xattr leaf block space usage bitmaps, and the value
 buffer used to check the correctness of remote xattr value block
 headers.  We try to minimize the insanity through the use of helper
 functions, but this is a memory management nightmare.  Clean this up by
 making the bitmap and value pointers explicit members of struct
 xchk_xattr_buf.
 
 Second, strengthen the xattr checking by teaching it to look for overlapping
 data structures in the shortform attr data.
 
 Signed-off-by: Darrick J. Wong <djwong@kernel.org>
 -----BEGIN PGP SIGNATURE-----
 
 iHUEABYKAB0WIQQ2qTKExjcn+O1o2YRKO3ySh0YRpgUCZDYdqAAKCRBKO3ySh0YR
 pnfxAQDGPPb0yT/FD7qnVSSdmUgQPc3Q8OPam/pOrIvXYABn9AD+I9zr64zoR/cD
 xn2Z7nkcLV4wK+ofOO8beGqK6R/oBwI=
 =/OO/
 -----END PGP SIGNATURE-----

Merge tag 'scrub-fix-xattr-memory-mgmt-6.4_2023-04-11' of git://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux into guilt/xfs-for-next

xfs: clean up memory management in xattr scrub [v24.5]

Currently, the extended attribute scrubber uses a single VLA to store
all the context information needed in various parts of the scrubber
code.  This includes xattr leaf block space usage bitmaps, and the value
buffer used to check the correctness of remote xattr value block
headers.  We try to minimize the insanity through the use of helper
functions, but this is a memory management nightmare.  Clean this up by
making the bitmap and value pointers explicit members of struct
xchk_xattr_buf.

Second, strengthen the xattr checking by teaching it to look for overlapping
data structures in the shortform attr data.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2023-04-14 07:11:10 +10:00
Dave Chinner b9fcf89f6b xfs: detect mergeable and overlapping btree records [v24.5]
While I was doing differential fuzz analysis between xfs_scrub and
 xfs_repair, I noticed that xfs_repair was only partially effective at
 detecting btree records that can be merged, and xfs_scrub totally didn't
 notice at all.
 
 For every interval btree type except for the bmbt, there should never
 exist two adjacent records with adjacent keyspaces because the
 blockcount field is always large enough to span the entire keyspace of
 the domain.  This is because the free space, rmap, and refcount btrees
 have a blockcount field large enough to store the maximum AG length, and
 there can never be an allocation larger than an AG.
 
 The bmbt is a different story due to its ondisk encoding where the
 blockcount is only 21 bits wide.  Because AGs can span up to 2^31 blocks
 and the RT volume can span up to 2^52 blocks, a preallocation of 2^22
 blocks will be expressed as two records of 2^21 length.  We don't
 opportunistically combine records when doing bmbt operations, which is
 why the fsck tools have never complained about this scenario.
 
 Offline repair is partially effective at detecting mergeable records
 because I taught it to do that for the rmap and refcount btrees.  This
 series enhances the free space, rmap, and refcount scrubbers to detect
 mergeable records.  For the bmbt, it will flag the file as being
 eligible for an optimization to shrink the size of the data structure.
 
 The last patch in this set also enhances the rmap scrubber to detect
 records that overlap incorrectly.  This check is done automatically for
 non-overlapping btree types, but we have to do it separately for the
 rmapbt because there are constraints on which allocation types are
 allowed to overlap.
 
 Signed-off-by: Darrick J. Wong <djwong@kernel.org>
 -----BEGIN PGP SIGNATURE-----
 
 iHUEABYKAB0WIQQ2qTKExjcn+O1o2YRKO3ySh0YRpgUCZDYdqAAKCRBKO3ySh0YR
 pj1kAP97i2gcFebCwLC+YTDFQoiGVYTf2Mo/bHcPopP/ptXFSAD9HYeG8xM0T3aB
 EGrmyaXSPcO+l/txMGTx8ACCJjxOFwg=
 =G9gi
 -----END PGP SIGNATURE-----

Merge tag 'scrub-detect-mergeable-records-6.4_2023-04-11' of git://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux into guilt/xfs-for-next

xfs: detect mergeable and overlapping btree records [v24.5]

While I was doing differential fuzz analysis between xfs_scrub and
xfs_repair, I noticed that xfs_repair was only partially effective at
detecting btree records that can be merged, and xfs_scrub totally didn't
notice at all.

For every interval btree type except for the bmbt, there should never
exist two adjacent records with adjacent keyspaces because the
blockcount field is always large enough to span the entire keyspace of
the domain.  This is because the free space, rmap, and refcount btrees
have a blockcount field large enough to store the maximum AG length, and
there can never be an allocation larger than an AG.

The bmbt is a different story due to its ondisk encoding where the
blockcount is only 21 bits wide.  Because AGs can span up to 2^31 blocks
and the RT volume can span up to 2^52 blocks, a preallocation of 2^22
blocks will be expressed as two records of 2^21 length.  We don't
opportunistically combine records when doing bmbt operations, which is
why the fsck tools have never complained about this scenario.

Offline repair is partially effective at detecting mergeable records
because I taught it to do that for the rmap and refcount btrees.  This
series enhances the free space, rmap, and refcount scrubbers to detect
mergeable records.  For the bmbt, it will flag the file as being
eligible for an optimization to shrink the size of the data structure.

The last patch in this set also enhances the rmap scrubber to detect
records that overlap incorrectly.  This check is done automatically for
non-overlapping btree types, but we have to do it separately for the
rmapbt because there are constraints on which allocation types are
allowed to overlap.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2023-04-14 07:11:02 +10:00
Dave Chinner d808a8e6b9 xfs: merge bmap records for faster scrubs [v24.5]
I started looking into performance problems with the data fork scrubber
 in generic/333, and noticed a few things that needed improving.  First,
 due to design reasons, it's possible for file forks btrees to contain
 multiple contiguous mappings to the same physical space.  Instead of
 checking each ondisk mapping individually, it's much faster to combine
 them when possible and check the combined mapping because that's fewer
 trips through the rmap btree, and we can drop this check-around
 behavior that it does when an rmapbt lookup produces a record that
 starts before or ends after a particular bmbt mapping.
 
 Second, I noticed that the bmbt scrubber decides to walk every reverse
 mapping in the filesystem if the file fork is in btree format.  This is
 very costly, and only necessary if the inode repair code had to zap a
 fork to convince iget to work.  Constraining the full-rmap scan to this
 one case means we can skip it for normal files, which drives the runtime
 of this test from 8 hours down to 45 minutes (observed with realtime
 reflink and rebuild-all mode.)
 
 Signed-off-by: Darrick J. Wong <djwong@kernel.org>
 -----BEGIN PGP SIGNATURE-----
 
 iHUEABYKAB0WIQQ2qTKExjcn+O1o2YRKO3ySh0YRpgUCZDdPcQAKCRBKO3ySh0YR
 pl1UAPoDtMaFrsLvz7clh31S6Yi+X8oCB/iJZXWl7HXaNsIjUQEA253GuiOj80Rz
 IHYo3t0KPYTm2Mc/7kBFQcctFbisDwE=
 =zFQ+
 -----END PGP SIGNATURE-----

Merge tag 'scrub-merge-bmap-records-6.4_2023-04-12' of git://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux into guilt/xfs-for-next

xfs: merge bmap records for faster scrubs [v24.5]

I started looking into performance problems with the data fork scrubber
in generic/333, and noticed a few things that needed improving.  First,
due to design reasons, it's possible for file forks btrees to contain
multiple contiguous mappings to the same physical space.  Instead of
checking each ondisk mapping individually, it's much faster to combine
them when possible and check the combined mapping because that's fewer
trips through the rmap btree, and we can drop this check-around
behavior that it does when an rmapbt lookup produces a record that
starts before or ends after a particular bmbt mapping.

Second, I noticed that the bmbt scrubber decides to walk every reverse
mapping in the filesystem if the file fork is in btree format.  This is
very costly, and only necessary if the inode repair code had to zap a
fork to convince iget to work.  Constraining the full-rmap scan to this
one case means we can skip it for normal files, which drives the runtime
of this test from 8 hours down to 45 minutes (observed with realtime
reflink and rebuild-all mode.)

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2023-04-14 07:10:53 +10:00
Dave Chinner 1e7912349e xfs: fix iget/irele usage in online fsck [v24.5]
This patchset fixes a handful of problems relating to how we get and
 release incore inodes in the online scrub code.  The first patch fixes
 how we handle DONTCACHE -- our reasons for setting (or clearing it)
 depend entirely on the runtime environment at irele time.  Hence we can
 refactor iget and irele to use our own wrappers that set that context
 appropriately.
 
 The second patch fixes a race between the iget call in the inode core
 scrubber and other writer threads that are allocating or freeing inodes
 in the same AG by changing the behavior of xchk_iget (and the inode core
 scrub setup function) to return either an incore inode or the AGI buffer
 so that we can be sure that the inode cannot disappear on us.
 
 The final patch elides MMAPLOCK from scrub paths when possible.  It did
 not fit anywhere else.
 
 Signed-off-by: Darrick J. Wong <djwong@kernel.org>
 -----BEGIN PGP SIGNATURE-----
 
 iHUEABYKAB0WIQQ2qTKExjcn+O1o2YRKO3ySh0YRpgUCZDdPcAAKCRBKO3ySh0YR
 piKpAP43yHHymFsXykAj23/YTBumdWYvAYu/n3IDLHe8Tt/UTgEAuiJj/MWfi3aR
 35JL0XbhANm9i72vfwRJJl0WkCwwwAI=
 =zQ5R
 -----END PGP SIGNATURE-----

Merge tag 'scrub-iget-fixes-6.4_2023-04-12' of git://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux into guilt/xfs-for-next

xfs: fix iget/irele usage in online fsck [v24.5]

This patchset fixes a handful of problems relating to how we get and
release incore inodes in the online scrub code.  The first patch fixes
how we handle DONTCACHE -- our reasons for setting (or clearing it)
depend entirely on the runtime environment at irele time.  Hence we can
refactor iget and irele to use our own wrappers that set that context
appropriately.

The second patch fixes a race between the iget call in the inode core
scrubber and other writer threads that are allocating or freeing inodes
in the same AG by changing the behavior of xchk_iget (and the inode core
scrub setup function) to return either an incore inode or the AGI buffer
so that we can be sure that the inode cannot disappear on us.

The final patch elides MMAPLOCK from scrub paths when possible.  It did
not fit anywhere else.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2023-04-14 07:10:41 +10:00
Dave Chinner a44667226d xfs: fix bugs in parent pointer checking [v24.5]
Jan Kara pointed out that the VFS doesn't take i_rwsem of a child
 subdirectory that is being moved from one parent to another.  Upon
 deeper analysis, I realized that this was the source of a very hard to
 trigger false corruption report in the parent pointer checking code.
 
 Now that we've refactored how directory walks work in scrub, we can also
 get rid of all the unnecessary and broken locking to make parent pointer
 scrubbing work properly.
 
 Signed-off-by: Darrick J. Wong <djwong@kernel.org>
 -----BEGIN PGP SIGNATURE-----
 
 iHUEABYKAB0WIQQ2qTKExjcn+O1o2YRKO3ySh0YRpgUCZDdPcAAKCRBKO3ySh0YR
 po7/AQD2P0QjtwTDUPPMy9AttJjXybZ0Xc/kKjrPk7AXyIrmDAEA758BIpxI7Wwv
 9OrAWI2NG8allL4sIM7c4Y+fguHu8gg=
 =f+Wc
 -----END PGP SIGNATURE-----

Merge tag 'scrub-parent-fixes-6.4_2023-04-12' of git://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux into guilt/xfs-for-next

xfs: fix bugs in parent pointer checking [v24.5]

Jan Kara pointed out that the VFS doesn't take i_rwsem of a child
subdirectory that is being moved from one parent to another.  Upon
deeper analysis, I realized that this was the source of a very hard to
trigger false corruption report in the parent pointer checking code.

Now that we've refactored how directory walks work in scrub, we can also
get rid of all the unnecessary and broken locking to make parent pointer
scrubbing work properly.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2023-04-14 07:10:32 +10:00
Dave Chinner f697c2cc15 xfs: fix iget usage in directory scrub [v24.5]
In this series, we fix some problems with how the directory scrubber
 grabs child inodes.  First, we want to reduce EDEADLOCK returns by
 replacing fixed-iteration loops with interruptible trylock loops.
 Second, we add UNTRUSTED to the child iget call so that we can detect a
 dirent that points to an unallocated inode.  Third, we fix a bug where
 we weren't checking the inode pointed to by dotdot entries at all.
 
 Signed-off-by: Darrick J. Wong <djwong@kernel.org>
 -----BEGIN PGP SIGNATURE-----
 
 iHUEABYKAB0WIQQ2qTKExjcn+O1o2YRKO3ySh0YRpgUCZDdPbwAKCRBKO3ySh0YR
 pjuKAQDHfHFTFVo2Svy/R/E4l/OqWWpO6ffHzESIE1Dzmb8flgEAoaPgXrMJwFg6
 lNoEJRc9DZ+K+rNvHtgZhxIKfFOgSw4=
 =xSO3
 -----END PGP SIGNATURE-----

Merge tag 'scrub-dir-iget-fixes-6.4_2023-04-12' of git://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux into guilt/xfs-for-next

xfs: fix iget usage in directory scrub [v24.5]

In this series, we fix some problems with how the directory scrubber
grabs child inodes.  First, we want to reduce EDEADLOCK returns by
replacing fixed-iteration loops with interruptible trylock loops.
Second, we add UNTRUSTED to the child iget call so that we can detect a
dirent that points to an unallocated inode.  Third, we fix a bug where
we weren't checking the inode pointed to by dotdot entries at all.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2023-04-14 07:10:21 +10:00
Dave Chinner b1bdab2526 xfs: detect incorrect gaps in rmap btree [v24.5]
Following in the theme of the last two patchsets, this one strengthens
 the rmap btree record checking so that scrub can count the number of
 space records that map to a given owner and that do not map to a given
 owner.  This enables us to determine exclusive ownership of space that
 can't be shared.
 
 Signed-off-by: Darrick J. Wong <djwong@kernel.org>
 -----BEGIN PGP SIGNATURE-----
 
 iHUEABYKAB0WIQQ2qTKExjcn+O1o2YRKO3ySh0YRpgUCZDYdpAAKCRBKO3ySh0YR
 pk9sAQDXcVPG4a2TvTd+j95UkkPovjYjTJekTTlJL/Xo91rAxgD/fEx3I8A8vNes
 dxVeyT/CwiYOPRYxFE3g3UdJGbaeHQA=
 =ux+s
 -----END PGP SIGNATURE-----

Merge tag 'scrub-detect-rmapbt-gaps-6.4_2023-04-11' of git://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux into guilt/xfs-for-next

xfs: detect incorrect gaps in rmap btree [v24.5]

Following in the theme of the last two patchsets, this one strengthens
the rmap btree record checking so that scrub can count the number of
space records that map to a given owner and that do not map to a given
owner.  This enables us to determine exclusive ownership of space that
can't be shared.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2023-04-14 07:10:13 +10:00
Dave Chinner f1121b995c xfs: detect incorrect gaps in inode btree [v24.5]
This series continues the corrections for a couple of problems I found
 in the inode btree scrubber.  The first problem is that we don't
 directly check the inobt records have a direct correspondence with the
 finobt records, and vice versa.  The second problem occurs on
 filesystems with sparse inode chunks -- the cross-referencing we do
 detects sparseness, but it doesn't actually check the consistency
 between the inobt hole records and the rmap data.
 
 Signed-off-by: Darrick J. Wong <djwong@kernel.org>
 -----BEGIN PGP SIGNATURE-----
 
 iHUEABYKAB0WIQQ2qTKExjcn+O1o2YRKO3ySh0YRpgUCZDYdowAKCRBKO3ySh0YR
 pt2WAQDcHbg0JDyGDcTiSyYqTlT2xxzeaxtMRg75fWYpIRa2dQEAuatGejdp56in
 AbH6jSmtS9f4M0wcy5JhHyHzZdZjcgc=
 =1G5P
 -----END PGP SIGNATURE-----

Merge tag 'scrub-detect-inobt-gaps-6.4_2023-04-11' of git://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux into guilt/xfs-for-next

xfs: detect incorrect gaps in inode btree [v24.5]

This series continues the corrections for a couple of problems I found
in the inode btree scrubber.  The first problem is that we don't
directly check the inobt records have a direct correspondence with the
finobt records, and vice versa.  The second problem occurs on
filesystems with sparse inode chunks -- the cross-referencing we do
detects sparseness, but it doesn't actually check the consistency
between the inobt hole records and the rmap data.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2023-04-14 07:10:05 +10:00
Dave Chinner e7cef2fe44 xfs: detect incorrect gaps in refcount btree [v24.5]
The next few patchsets address a deficiency in scrub that I found while
 QAing the refcount btree scrubber.  If there's a gap between refcount
 records, we need to cross-reference that gap with the reverse mappings
 to ensure that there are no overlapping records in the rmap btree.  If
 we find any, then the refcount btree is not consistent.  This is not a
 property that is specific to the refcount btree; they all need to have
 this sort of keyspace scanning logic to detect inconsistencies.
 
 To do this accurately, we need to be able to scan the keyspace of a
 btree (which we already do) to be able to tell the caller if the
 keyspace is empty, sparse, or fully covered by records.  The first few
 patches add the keyspace scanner to the generic btree code, along with
 the ability to mask off parts of btree keys because when we scan the
 rmapbt, we only care about space usage, not the owners.
 
 The final patch closes the scanning gap in the refcountbt scanner.
 
 v23.1: create helpers for the key extraction and comparison functions,
        improve documentation, and eliminate the ->mask_key indirect
        calls
 
 Signed-off-by: Darrick J. Wong <djwong@kernel.org>
 -----BEGIN PGP SIGNATURE-----
 
 iHUEABYKAB0WIQQ2qTKExjcn+O1o2YRKO3ySh0YRpgUCZDYdogAKCRBKO3ySh0YR
 pjDDAQC88qzAvA3j2JP8ZC9mnK89LsYpkOEX2i6HV2m4LWYdWgD/fWdGnp0BFoQj
 is+V82X6oRhWi8SRnjOX28Mk8gCdDA8=
 =fzga
 -----END PGP SIGNATURE-----

Merge tag 'scrub-detect-refcount-gaps-6.4_2023-04-11' of git://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux into guilt/xfs-for-next

xfs: detect incorrect gaps in refcount btree [v24.5]

The next few patchsets address a deficiency in scrub that I found while
QAing the refcount btree scrubber.  If there's a gap between refcount
records, we need to cross-reference that gap with the reverse mappings
to ensure that there are no overlapping records in the rmap btree.  If
we find any, then the refcount btree is not consistent.  This is not a
property that is specific to the refcount btree; they all need to have
this sort of keyspace scanning logic to detect inconsistencies.

To do this accurately, we need to be able to scan the keyspace of a
btree (which we already do) to be able to tell the caller if the
keyspace is empty, sparse, or fully covered by records.  The first few
patches add the keyspace scanner to the generic btree code, along with
the ability to mask off parts of btree keys because when we scan the
rmapbt, we only care about space usage, not the owners.

The final patch closes the scanning gap in the refcountbt scanner.

v23.1: create helpers for the key extraction and comparison functions,
       improve documentation, and eliminate the ->mask_key indirect
       calls

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2023-04-14 07:09:55 +10:00
Dave Chinner 6858c88701 xfs: enhance btree key scrubbing [v24.5]
This series fixes the scrub btree block checker to ensure that the keys
 in the parent block accurately represent the block, and check the
 ordering of all interior key records.
 
 Signed-off-by: Darrick J. Wong <djwong@kernel.org>
 -----BEGIN PGP SIGNATURE-----
 
 iHUEABYKAB0WIQQ2qTKExjcn+O1o2YRKO3ySh0YRpgUCZDYdogAKCRBKO3ySh0YR
 pszmAP94RPiNPhUac1Eh6cCOGrt7Cor84dyZmWugqDI+nGo5JAD/Qfy5gFwAK7WV
 vQZZPSf/AdLTEEJFA11NVqBW0TukWgE=
 =xIq/
 -----END PGP SIGNATURE-----

Merge tag 'scrub-btree-key-enhancements-6.4_2023-04-11' of git://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux into guilt/xfs-for-next

xfs: enhance btree key scrubbing [v24.5]

This series fixes the scrub btree block checker to ensure that the keys
in the parent block accurately represent the block, and check the
ordering of all interior key records.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2023-04-14 07:09:46 +10:00
Dave Chinner 1ee7550532 xfs: fix rmap btree key flag handling [v24.5]
This series fixes numerous flag handling bugs in the rmapbt key code.
 The most serious transgression is that key comparisons completely strip
 out all flag bits from rm_offset, including the ones that participate in
 record lookups.  The second problem is that for years we've been letting
 the unwritten flag (which is an attribute of a specific record and not
 part of the record key) escape from leaf records into key records.
 
 The solution to the second problem is to filter attribute flags when
 creating keys from records, and the solution to the first problem is to
 preserve *only* the flags used for key lookups.  The ATTR and BMBT flags
 are a part of the lookup key, and the UNWRITTEN flag is a record
 attribute.
 
 This has worked for years without generating user complaints because
 ATTR and BMBT extents cannot be shared, so key comparisons succeed
 solely on rm_startblock.  Only file data fork extents can be shared, and
 those records never set any of the three flag bits, so comparisons that
 dig into rm_owner and rm_offset work just fine.
 
 A filesystem written with an unpatched kernel and mounted on a patched
 kernel will work correctly because the ATTR/BMBT flags have been
 conveyed into keys correctly all along, and we still ignore the
 UNWRITTEN flag in any key record.  This was what doomed my previous
 attempt to correct this problem in 2019.
 
 A filesystem written with a patched kernel and mounted on an unpatched
 kernel will also work correctly because unpatched kernels ignore all
 flags.
 
 With this patchset applied, the scrub code gains the ability to detect
 rmap btrees with incorrectly set attr and bmbt flags in the key records.
 After three years of testing, I haven't encountered any problems.
 
 Signed-off-by: Darrick J. Wong <djwong@kernel.org>
 -----BEGIN PGP SIGNATURE-----
 
 iHUEABYKAB0WIQQ2qTKExjcn+O1o2YRKO3ySh0YRpgUCZDYdoQAKCRBKO3ySh0YR
 prUmAP9WiaLPxeMAnQiQcaZyqyAhaiqbwNoLkDMx0+1+SKDPCwD7BU6tPQpT039i
 mrDag3g2x4N7g/e89N29SQp8EDGuQQQ=
 =Chkt
 -----END PGP SIGNATURE-----

Merge tag 'rmap-btree-fix-key-handling-6.4_2023-04-11' of git://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux into guilt/xfs-for-next

xfs: fix rmap btree key flag handling [v24.5]

This series fixes numerous flag handling bugs in the rmapbt key code.
The most serious transgression is that key comparisons completely strip
out all flag bits from rm_offset, including the ones that participate in
record lookups.  The second problem is that for years we've been letting
the unwritten flag (which is an attribute of a specific record and not
part of the record key) escape from leaf records into key records.

The solution to the second problem is to filter attribute flags when
creating keys from records, and the solution to the first problem is to
preserve *only* the flags used for key lookups.  The ATTR and BMBT flags
are a part of the lookup key, and the UNWRITTEN flag is a record
attribute.

This has worked for years without generating user complaints because
ATTR and BMBT extents cannot be shared, so key comparisons succeed
solely on rm_startblock.  Only file data fork extents can be shared, and
those records never set any of the three flag bits, so comparisons that
dig into rm_owner and rm_offset work just fine.

A filesystem written with an unpatched kernel and mounted on a patched
kernel will work correctly because the ATTR/BMBT flags have been
conveyed into keys correctly all along, and we still ignore the
UNWRITTEN flag in any key record.  This was what doomed my previous
attempt to correct this problem in 2019.

A filesystem written with a patched kernel and mounted on an unpatched
kernel will also work correctly because unpatched kernels ignore all
flags.

With this patchset applied, the scrub code gains the ability to detect
rmap btrees with incorrectly set attr and bmbt flags in the key records.
After three years of testing, I haven't encountered any problems.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2023-04-14 07:09:36 +10:00
Dave Chinner b764ea207f xfs: hoist scrub record checks into libxfs [v24.5]
There are a few things about btree records that scrub checked but the
 libxfs _get_rec functions didn't.  Move these bits into libxfs so that
 everyone can benefit.
 
 Signed-off-by: Darrick J. Wong <djwong@kernel.org>
 -----BEGIN PGP SIGNATURE-----
 
 iHUEABYKAB0WIQQ2qTKExjcn+O1o2YRKO3ySh0YRpgUCZDYdoAAKCRBKO3ySh0YR
 pvbPAP9zGtY7B15ORWk9wcHELUoPgDhNZR39ye7MfxWNCBZJxgD6A8SzZpbZc5Gh
 9a1/ImUDZ0ekFnAdx0dVRA+gnrO4Vwo=
 =197l
 -----END PGP SIGNATURE-----

Merge tag 'btree-hoist-scrub-checks-6.4_2023-04-11' of git://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux into guilt/xfs-for-next

xfs: hoist scrub record checks into libxfs [v24.5]

There are a few things about btree records that scrub checked but the
libxfs _get_rec functions didn't.  Move these bits into libxfs so that
everyone can benefit.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2023-04-14 07:09:27 +10:00
Dave Chinner 01822a74ca xfs: standardize btree record checking code [v24.5]
While I was cleaning things up for 6.1, I noticed that the btree
 _query_range and _query_all functions don't perform the same checking
 that the _get_rec functions perform.  In fact, they don't perform /any/
 sanity checking, which means that callers aren't warned about impossible
 records.
 
 Therefore, hoist the record validation and complaint logging code into
 separate functions, and call them from any place where we convert an
 ondisk record into an incore record.  For online scrub, we can replace
 checking code with a call to the record checking functions in libxfs,
 thereby reducing the size of the codebase.
 
 Signed-off-by: Darrick J. Wong <djwong@kernel.org>
 -----BEGIN PGP SIGNATURE-----
 
 iHUEABYKAB0WIQQ2qTKExjcn+O1o2YRKO3ySh0YRpgUCZDYdnwAKCRBKO3ySh0YR
 ppn6AQCOWjqsq7klLAQdvEDm3O8v4k94geKdn4Ruvbptwa2iUQD/WAJ5LwKnEPuQ
 +eB5AfzsziMQMNX7DtUwncaDJm1RBgY=
 =ys9Z
 -----END PGP SIGNATURE-----

Merge tag 'btree-complain-bad-records-6.4_2023-04-11' of git://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux into guilt/xfs-for-next

xfs: standardize btree record checking code [v24.5]

While I was cleaning things up for 6.1, I noticed that the btree
_query_range and _query_all functions don't perform the same checking
that the _get_rec functions perform.  In fact, they don't perform /any/
sanity checking, which means that callers aren't warned about impossible
records.

Therefore, hoist the record validation and complaint logging code into
separate functions, and call them from any place where we convert an
ondisk record into an incore record.  For online scrub, we can replace
checking code with a call to the record checking functions in libxfs,
thereby reducing the size of the codebase.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2023-04-14 07:09:18 +10:00
Dave Chinner b634abac59 xfs: drain deferred work items when scrubbing [v24.5]
The design doc for XFS online fsck contains a long discussion of the
 eventual consistency models in use for XFS metadata.  In that chapter,
 we note that it is possible for scrub to collide with a chain of
 deferred space metadata updates, and proposes a lightweight solution:
 The use of a pending-intents counter so that scrub can wait for the
 system to drain all chains.
 
 This patchset implements that scrub drain.  The first patch implements
 the basic mechanism, and the subsequent patches reduce the runtime
 overhead by converting the implementation to use sloppy counters and
 introducing jump labels to avoid walking into scrub hooks when it isn't
 running.  This last paradigm repeats elsewhere in this megaseries.
 
 v23.1: make intent items take an active ref to the perag structure and
        document why we bump and drop the intent counts when we do
 
 Signed-off-by: Darrick J. Wong <djwong@kernel.org>
 -----BEGIN PGP SIGNATURE-----
 
 iHUEABYKAB0WIQQ2qTKExjcn+O1o2YRKO3ySh0YRpgUCZDYdnwAKCRBKO3ySh0YR
 poQmAQDAu0YNxoRGok7H/RGfQQHWBReSkLXT9RKGzjWn4G51EQD8DA/CpuqsC3yU
 uJ55vGAb8jSCBFJITVF1/i8B9sfpngw=
 =Nz0X
 -----END PGP SIGNATURE-----

Merge tag 'scrub-drain-intents-6.4_2023-04-11' of git://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux into guilt/xfs-for-next

xfs: drain deferred work items when scrubbing [v24.5]

The design doc for XFS online fsck contains a long discussion of the
eventual consistency models in use for XFS metadata.  In that chapter,
we note that it is possible for scrub to collide with a chain of
deferred space metadata updates, and proposes a lightweight solution:
The use of a pending-intents counter so that scrub can wait for the
system to drain all chains.

This patchset implements that scrub drain.  The first patch implements
the basic mechanism, and the subsequent patches reduce the runtime
overhead by converting the implementation to use sloppy counters and
introducing jump labels to avoid walking into scrub hooks when it isn't
running.  This last paradigm repeats elsewhere in this megaseries.

v23.1: make intent items take an active ref to the perag structure and
       document why we bump and drop the intent counts when we do

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2023-04-14 07:09:09 +10:00
Dave Chinner 793f5c2cca xfs_scrub: fix licensing and copyright notices [v24.5]
Fix various attribution problems in the xfs_scrub source code, such as
 the author's contact information, out of date SPDX tags, and a rough
 estimate of when the feature was under heavy development.  The most
 egregious parts are the files that are missing license information
 completely.
 
 Signed-off-by: Darrick J. Wong <djwong@kernel.org>
 -----BEGIN PGP SIGNATURE-----
 
 iHUEABYKAB0WIQQ2qTKExjcn+O1o2YRKO3ySh0YRpgUCZDYdngAKCRBKO3ySh0YR
 pjGfAP9QqnMMXWlre680TuB1pWXKzHviCdKHlxTeJtYTAycNoAEAiA0JMyahF4kT
 p8EEGUvw0cg0RcZA4PUF/vCsxvCJfws=
 =NYxO
 -----END PGP SIGNATURE-----

Merge tag 'scrub-fix-legalese-6.4_2023-04-11' of git://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux into guilt/xfs-for-next

xfs_scrub: fix licensing and copyright notices [v24.5]

Fix various attribution problems in the xfs_scrub source code, such as
the author's contact information, out of date SPDX tags, and a rough
estimate of when the feature was under heavy development.  The most
egregious parts are the files that are missing license information
completely.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2023-04-14 07:08:59 +10:00
Dave Chinner 1e5ffdc57d xfs: pass perag references around when possible [v24.5]
Avoid the cost of perag radix tree lookups by passing around active perag
 references when possible.
 
 v24.2: rework some of the naming and whatnot so there's less opencoding
 
 Signed-off-by: Darrick J. Wong <djwong@kernel.org>
 -----BEGIN PGP SIGNATURE-----
 
 iHUEABYKAB0WIQQ2qTKExjcn+O1o2YRKO3ySh0YRpgUCZDYdnQAKCRBKO3ySh0YR
 po/YAPsEFygm4/bQZBtOf0HFmVDtTXYAEujJeiXKbmEqzlMxpQEAhuCqFaTQ+Pnr
 zpg1egeIcaw6dNTW4f2slcATaQgG0gM=
 =8HsC
 -----END PGP SIGNATURE-----

Merge tag 'pass-perag-refs-6.4_2023-04-11' of git://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux into guilt/xfs-for-next

xfs: pass perag references around when possible [v24.5]

Avoid the cost of perag radix tree lookups by passing around active perag
references when possible.

v24.2: rework some of the naming and whatnot so there's less opencoding

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2023-04-14 07:08:49 +10:00
Dave Chinner 826053db98 xfs: make intent items take a perag reference [v24.5]
Now that we've cleaned up some code warts in the deferred work item
 processing code, let's make intent items take an active perag reference
 from their creation until they are finally freed by the defer ops
 machinery.  This change facilitates the scrub drain in the next patchset
 and will make it easier for the future AG removal code to detect a busy
 AG in need of quiescing.
 
 Signed-off-by: Darrick J. Wong <djwong@kernel.org>
 -----BEGIN PGP SIGNATURE-----
 
 iHUEABYKAB0WIQQ2qTKExjcn+O1o2YRKO3ySh0YRpgUCZDYdnAAKCRBKO3ySh0YR
 poBzAP9+tx/LNTZeLtmjj/d7tVLMm2/f8LPyhDmkF85JWnjknwEAnLQxkqRMfF9i
 ah3ACAZ30o+Mp7Qe6tnYVIdOSD2xCAM=
 =mRAy
 -----END PGP SIGNATURE-----

Merge tag 'intents-perag-refs-6.4_2023-04-11' of git://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux into guilt/xfs-for-next

xfs: make intent items take a perag reference [v24.5]

Now that we've cleaned up some code warts in the deferred work item
processing code, let's make intent items take an active perag reference
from their creation until they are finally freed by the defer ops
machinery.  This change facilitates the scrub drain in the next patchset
and will make it easier for the future AG removal code to detect a busy
AG in need of quiescing.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2023-04-14 07:08:38 +10:00
Luis Chamberlain f5d2b92c85 xfs: simplify two-level sysctl registration for xfs_table
There is no need to declare two tables to just create directories,
this can be easily be done with a prefix path with register_sysctl().

Simplify this registration.

Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Luis Chamberlain <mcgrof@kernel.org>
2023-04-13 11:49:35 -07:00
Ye Bin 8ee81ed581 xfs: fix BUG_ON in xfs_getbmap()
There's issue as follows:
XFS: Assertion failed: (bmv->bmv_iflags & BMV_IF_DELALLOC) != 0, file: fs/xfs/xfs_bmap_util.c, line: 329
------------[ cut here ]------------
kernel BUG at fs/xfs/xfs_message.c:102!
invalid opcode: 0000 [#1] PREEMPT SMP KASAN
CPU: 1 PID: 14612 Comm: xfs_io Not tainted 6.3.0-rc2-next-20230315-00006-g2729d23ddb3b-dirty #422
RIP: 0010:assfail+0x96/0xa0
RSP: 0018:ffffc9000fa178c0 EFLAGS: 00010246
RAX: 0000000000000000 RBX: 0000000000000001 RCX: ffff888179a18000
RDX: 0000000000000000 RSI: ffff888179a18000 RDI: 0000000000000002
RBP: 0000000000000000 R08: ffffffff8321aab6 R09: 0000000000000000
R10: 0000000000000001 R11: ffffed1105f85139 R12: ffffffff8aacc4c0
R13: 0000000000000149 R14: ffff888269f58000 R15: 000000000000000c
FS:  00007f42f27a4740(0000) GS:ffff88882fc00000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 0000000000b92388 CR3: 000000024f006000 CR4: 00000000000006e0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
Call Trace:
 <TASK>
 xfs_getbmap+0x1a5b/0x1e40
 xfs_ioc_getbmap+0x1fd/0x5b0
 xfs_file_ioctl+0x2cb/0x1d50
 __x64_sys_ioctl+0x197/0x210
 do_syscall_64+0x39/0xb0
 entry_SYSCALL_64_after_hwframe+0x63/0xcd

Above issue may happen as follows:
         ThreadA                       ThreadB
do_shared_fault
 __do_fault
  xfs_filemap_fault
   __xfs_filemap_fault
    filemap_fault
                             xfs_ioc_getbmap -> Without BMV_IF_DELALLOC flag
			      xfs_getbmap
			       xfs_ilock(ip, XFS_IOLOCK_SHARED);
			       filemap_write_and_wait
 do_page_mkwrite
  xfs_filemap_page_mkwrite
   __xfs_filemap_fault
    xfs_ilock(XFS_I(inode), XFS_MMAPLOCK_SHARED);
    iomap_page_mkwrite
     ...
     xfs_buffered_write_iomap_begin
      xfs_bmapi_reserve_delalloc -> Allocate delay extent
                              xfs_ilock_data_map_shared(ip)
	                      xfs_getbmap_report_one
			       ASSERT((bmv->bmv_iflags & BMV_IF_DELALLOC) != 0)
	                        -> trigger BUG_ON

As xfs_filemap_page_mkwrite() only hold XFS_MMAPLOCK_SHARED lock, there's
small window mkwrite can produce delay extent after file write in xfs_getbmap().
To solve above issue, just skip delalloc extents.

Signed-off-by: Ye Bin <yebin10@huawei.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2023-04-12 15:49:44 +10:00
Darrick J. Wong 22ed903eee xfs: verify buffer contents when we skip log replay
syzbot detected a crash during log recovery:

XFS (loop0): Mounting V5 Filesystem bfdc47fc-10d8-4eed-a562-11a831b3f791
XFS (loop0): Torn write (CRC failure) detected at log block 0x180. Truncating head block from 0x200.
XFS (loop0): Starting recovery (logdev: internal)
==================================================================
BUG: KASAN: slab-out-of-bounds in xfs_btree_lookup_get_block+0x15c/0x6d0 fs/xfs/libxfs/xfs_btree.c:1813
Read of size 8 at addr ffff88807e89f258 by task syz-executor132/5074

CPU: 0 PID: 5074 Comm: syz-executor132 Not tainted 6.2.0-rc1-syzkaller #0
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 10/26/2022
Call Trace:
 <TASK>
 __dump_stack lib/dump_stack.c:88 [inline]
 dump_stack_lvl+0x1b1/0x290 lib/dump_stack.c:106
 print_address_description+0x74/0x340 mm/kasan/report.c:306
 print_report+0x107/0x1f0 mm/kasan/report.c:417
 kasan_report+0xcd/0x100 mm/kasan/report.c:517
 xfs_btree_lookup_get_block+0x15c/0x6d0 fs/xfs/libxfs/xfs_btree.c:1813
 xfs_btree_lookup+0x346/0x12c0 fs/xfs/libxfs/xfs_btree.c:1913
 xfs_btree_simple_query_range+0xde/0x6a0 fs/xfs/libxfs/xfs_btree.c:4713
 xfs_btree_query_range+0x2db/0x380 fs/xfs/libxfs/xfs_btree.c:4953
 xfs_refcount_recover_cow_leftovers+0x2d1/0xa60 fs/xfs/libxfs/xfs_refcount.c:1946
 xfs_reflink_recover_cow+0xab/0x1b0 fs/xfs/xfs_reflink.c:930
 xlog_recover_finish+0x824/0x920 fs/xfs/xfs_log_recover.c:3493
 xfs_log_mount_finish+0x1ec/0x3d0 fs/xfs/xfs_log.c:829
 xfs_mountfs+0x146a/0x1ef0 fs/xfs/xfs_mount.c:933
 xfs_fs_fill_super+0xf95/0x11f0 fs/xfs/xfs_super.c:1666
 get_tree_bdev+0x400/0x620 fs/super.c:1282
 vfs_get_tree+0x88/0x270 fs/super.c:1489
 do_new_mount+0x289/0xad0 fs/namespace.c:3145
 do_mount fs/namespace.c:3488 [inline]
 __do_sys_mount fs/namespace.c:3697 [inline]
 __se_sys_mount+0x2d3/0x3c0 fs/namespace.c:3674
 do_syscall_x64 arch/x86/entry/common.c:50 [inline]
 do_syscall_64+0x3d/0xb0 arch/x86/entry/common.c:80
 entry_SYSCALL_64_after_hwframe+0x63/0xcd
RIP: 0033:0x7f89fa3f4aca
Code: 83 c4 08 5b 5d c3 66 2e 0f 1f 84 00 00 00 00 00 c3 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 44 00 00 49 89 ca b8 a5 00 00 00 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 c7 c1 c0 ff ff ff f7 d8 64 89 01 48
RSP: 002b:00007fffd5fb5ef8 EFLAGS: 00000206 ORIG_RAX: 00000000000000a5
RAX: ffffffffffffffda RBX: 00646975756f6e2c RCX: 00007f89fa3f4aca
RDX: 0000000020000100 RSI: 0000000020009640 RDI: 00007fffd5fb5f10
RBP: 00007fffd5fb5f10 R08: 00007fffd5fb5f50 R09: 000000000000970d
R10: 0000000000200800 R11: 0000000000000206 R12: 0000000000000004
R13: 0000555556c6b2c0 R14: 0000000000200800 R15: 00007fffd5fb5f50
 </TASK>

The fuzzed image contains an AGF with an obviously garbage
agf_refcount_level value of 32, and a dirty log with a buffer log item
for that AGF.  The ondisk AGF has a higher LSN than the recovered log
item.  xlog_recover_buf_commit_pass2 reads the buffer, compares the
LSNs, and decides to skip replay because the ondisk buffer appears to be
newer.

Unfortunately, the ondisk buffer is corrupt, but recovery just read the
buffer with no buffer ops specified:

	error = xfs_buf_read(mp->m_ddev_targp, buf_f->blf_blkno,
			buf_f->blf_len, buf_flags, &bp, NULL);

Skipping the buffer leaves its contents in memory unverified.  This sets
us up for a kernel crash because xfs_refcount_recover_cow_leftovers
reads the buffer (which is still around in XBF_DONE state, so no read
verification) and creates a refcountbt cursor of height 32.  This is
impossible so we run off the end of the cursor object and crash.

Fix this by invoking the verifier on all skipped buffers and aborting
log recovery if the ondisk buffer is corrupt.  It might be smarter to
force replay the log item atop the buffer and then see if it'll pass the
write verifier (like ext4 does) but for now let's go with the
conservative option where we stop immediately.

Link: https://syzkaller.appspot.com/bug?extid=7e9494b8b399902e994e
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2023-04-12 15:49:23 +10:00
Darrick J. Wong c95356ca88 xfs: _{attr,data}_map_shared should take ILOCK_EXCL until iread_extents is completely done
While fuzzing the data fork extent count on a btree-format directory
with xfs/375, I observed the following (excerpted) splat:

XFS: Assertion failed: xfs_isilocked(ip, XFS_ILOCK_EXCL), file: fs/xfs/libxfs/xfs_bmap.c, line: 1208
------------[ cut here ]------------
WARNING: CPU: 0 PID: 43192 at fs/xfs/xfs_message.c:104 assfail+0x46/0x4a [xfs]
Call Trace:
 <TASK>
 xfs_iread_extents+0x1af/0x210 [xfs 09f66509ece4938760fac7de64732a0cbd3e39cd]
 xchk_dir_walk+0xb8/0x190 [xfs 09f66509ece4938760fac7de64732a0cbd3e39cd]
 xchk_parent_count_parent_dentries+0x41/0x80 [xfs 09f66509ece4938760fac7de64732a0cbd3e39cd]
 xchk_parent_validate+0x199/0x2e0 [xfs 09f66509ece4938760fac7de64732a0cbd3e39cd]
 xchk_parent+0xdf/0x130 [xfs 09f66509ece4938760fac7de64732a0cbd3e39cd]
 xfs_scrub_metadata+0x2b8/0x730 [xfs 09f66509ece4938760fac7de64732a0cbd3e39cd]
 xfs_scrubv_metadata+0x38b/0x4d0 [xfs 09f66509ece4938760fac7de64732a0cbd3e39cd]
 xfs_ioc_scrubv_metadata+0x111/0x160 [xfs 09f66509ece4938760fac7de64732a0cbd3e39cd]
 xfs_file_ioctl+0x367/0xf50 [xfs 09f66509ece4938760fac7de64732a0cbd3e39cd]
 __x64_sys_ioctl+0x82/0xa0
 do_syscall_64+0x2b/0x80
 entry_SYSCALL_64_after_hwframe+0x46/0xb0

The cause of this is a race condition in xfs_ilock_data_map_shared,
which performs an unlocked access to the data fork to guess which lock
mode it needs:

Thread 0                          Thread 1

xfs_need_iread_extents
<observe no iext tree>
xfs_ilock(..., ILOCK_EXCL)
xfs_iread_extents
<observe no iext tree>
<check ILOCK_EXCL>
<load bmbt extents into iext>
<notice iext size doesn't
 match nextents>
                                  xfs_need_iread_extents
                                  <observe iext tree>
                                  xfs_ilock(..., ILOCK_SHARED)
<tear down iext tree>
xfs_iunlock(..., ILOCK_EXCL)
                                  xfs_iread_extents
                                  <observe no iext tree>
                                  <check ILOCK_EXCL>
                                  *BOOM*

Fix this race by adding a flag to the xfs_ifork structure to indicate
that we have not yet read in the extent records and changing the
predicate to look at the flag state, not if_height.  The memory barrier
ensures that the flag will not be set until the very end of the
function.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2023-04-12 15:49:10 +10:00
Dave Chinner 4b827b3f30 xfs: remove WARN when dquot cache insertion fails
It just creates unnecessary bot noise these days.

Reported-by: syzbot+6ae213503fb12e87934f@syzkaller.appspotmail.com
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2023-04-12 15:48:59 +10:00
Dave Chinner aa88019851 xfs: don't consider future format versions valid
In commit fe08cc5044 we reworked the valid superblock version
checks. If it is a V5 filesystem, it is always valid, then we
checked if the version was less than V4 (reject) and then checked
feature fields in the V4 flags to determine if it was valid.

What we missed was that if the version is not V4 at this point,
we shoudl reject the fs. i.e. the check current treats V6+
filesystems as if it was a v4 filesystem. Fix this.

cc: stable@vger.kernel.org
Fixes: fe08cc5044 ("xfs: open code sb verifier feature checks")
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2023-04-12 15:48:50 +10:00
Darrick J. Wong 7ba83850ca xfs: deprecate the ascii-ci feature
This feature is a mess -- the hash function has been broken for the
entire 15 years of its existence if you create names with extended ascii
bytes; metadump name obfuscation has silently failed for just as long;
and the feature clashes horribly with the UTF8 encodings that most
systems use today.  There is exactly one fstest for this feature.

In other words, this feature is crap.  Let's deprecate it now so we can
remove it from the codebase in 2030.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
2023-04-11 19:05:19 -07:00
Darrick J. Wong 6db09a8d03 xfs: test the ascii case-insensitive hash
Now that we've made kernel and userspace use the same tolower code for
computing directory index hashes, add that to the selftest code.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
2023-04-11 19:05:05 -07:00
Darrick J. Wong a9248538fa xfs: stabilize the dirent name transformation function used for ascii-ci dir hash computation
Back in the old days, the "ascii-ci" feature was created to implement
case-insensitive directory entry lookups for latin1-encoded names and
remove the large overhead of Samba's case-insensitive lookup code.  UTF8
names were not allowed, but nobody explicitly wrote in the documentation
that this was only expected to work if the system used latin1 names.
The kernel tolower function was selected to prepare names for hashed
lookups.

There's a major discrepancy in the function that computes directory entry
hashes for filesystems that have ASCII case-insensitive lookups enabled.
The root of this is that the kernel and glibc's tolower implementations
have differing behavior for extended ASCII accented characters.  I wrote
a program to spit out characters for which the tolower() return value is
different from the input:

glibc tolower:
65:A 66:B 67:C 68:D 69:E 70:F 71:G 72:H 73:I 74:J 75:K 76:L 77:M 78:N
79:O 80:P 81:Q 82:R 83:S 84:T 85:U 86:V 87:W 88:X 89:Y 90:Z

kernel tolower:
65:A 66:B 67:C 68:D 69:E 70:F 71:G 72:H 73:I 74:J 75:K 76:L 77:M 78:N
79:O 80:P 81:Q 82:R 83:S 84:T 85:U 86:V 87:W 88:X 89:Y 90:Z 192:À 193:Á
194:Â 195:Ã 196:Ä 197:Å 198:Æ 199:Ç 200:È 201:É 202:Ê 203:Ë 204:Ì 205:Í
206:Î 207:Ï 208:Ð 209:Ñ 210:Ò 211:Ó 212:Ô 213:Õ 214:Ö 215:× 216:Ø 217:Ù
218:Ú 219:Û 220:Ü 221:Ý 222:Þ

Which means that the kernel and userspace do not agree on the hash value
for a directory filename that contains those higher values.  The hash
values are written into the leaf index block of directories that are
larger than two blocks in size, which means that xfs_repair will flag
these directories as having corrupted hash indexes and rewrite the index
with hash values that the kernel now will not recognize.

Because the ascii-ci feature is not frequently enabled and the kernel
touches filesystems far more frequently than xfs_repair does, fix this
by encoding the kernel's toupper predicate and tolower functions into
libxfs.  Give the new functions less provocative names to make it really
obvious that this is a pre-hash name preparation function, and nothing
else.  This change makes userspace's behavior consistent with the
kernel.

Found by auditing obfuscate_name in xfs_metadump as part of working on
parent pointers, wondering how it could possibly work correctly with ci
filesystems, writing a test tool to create a directory with
hash-colliding names, and watching xfs_repair flag it.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
2023-04-11 19:05:04 -07:00
Darrick J. Wong 4f5e304248 xfs: cross-reference rmap records with refcount btrees
Strengthen the rmap btree record checker a little more by comparing
OWN_REFCBT reverse mappings against the refcount btrees.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
2023-04-11 19:00:39 -07:00
Darrick J. Wong 0abe6fc53b xfs: cross-reference rmap records with inode btrees
Strengthen the rmap btree record checker a little more by comparing
OWN_INOBT reverse mappings against the inode btrees.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
2023-04-11 19:00:39 -07:00
Darrick J. Wong 3a3108ea8c xfs: cross-reference rmap records with free space btrees
Strengthen the rmap btree record checker a little more by comparing
OWN_AG reverse mappings against the free space btrees, the rmap btree,
and the AGFL.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
2023-04-11 19:00:38 -07:00
Darrick J. Wong fed050f345 xfs: cross-reference rmap records with ag btrees
Strengthen the rmap btree record checker a little more by comparing
OWN_FS and OWN_LOG reverse mappings against the AG headers and internal
logs, respectively.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
2023-04-11 19:00:38 -07:00
Darrick J. Wong a47bd1e0e6 xfs: introduce bitmap type for AG blocks
Create a typechecked bitmap for extents within an AG.  Online repair
uses bitmaps to store various different types of numbers, so let's make
it obvious when we're storing xfs_agblock_t (and later xfs_fsblock_t)
versus anything else.

In subsequent patches, we're going to use agblock bitmaps to enhance the
rmapbt checker to look for discrepancies between the rmapbt records and
AG metadata block usage.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
2023-04-11 19:00:37 -07:00
Darrick J. Wong 6772fcc889 xfs: convert xbitmap to interval tree
Convert the xbitmap code to use interval trees instead of linked lists.
This reduces the amount of coding required to handle the disunion
operation and in the future will make it easier to set bits in arbitrary
order yet later be able to extract maximally sized extents, which we'll
need for rebuilding certain structures.  We define our own interval tree
type so that it can deal with 64-bit indices even on 32-bit machines.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
2023-04-11 19:00:36 -07:00
Darrick J. Wong 7296a6d6fb xfs: drop the _safe behavior from the xbitmap foreach macro
It's not safe to edit bitmap intervals while we're iterating them with
for_each_xbitmap_extent.  None of the existing callers actually need
that ability anyway, so drop the safe variable.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
2023-04-11 19:00:36 -07:00
Darrick J. Wong 44af6c7e59 xfs: don't load local xattr values during scrub
Local extended attributes store their values within the same leaf block.
There's no header for the values themselves, nor are they separately
checksummed.  Hence we can save a bit of time in the attr scrubber by
not wasting time retrieving the values.

Regrettably, shortform attributes do not set XFS_ATTR_LOCAL so this
offers us no advantage there, but at least there are very few attrs in
that case.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
2023-04-11 19:00:35 -07:00
Darrick J. Wong 178b48d588 xfs: remove the for_each_xbitmap_ helpers
Remove the for_each_xbitmap_ macros in favor of proper iterator
functions.  We'll soon be switching this data structure over to an
interval tree implementation, which means that we can't allow callers to
modify the bitmap during iteration without telling us.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
2023-04-11 19:00:35 -07:00
Darrick J. Wong 674f0d0dc6 xfs: only allocate free space bitmap for xattr scrub if needed
The free space bitmap is only required if we're going to check the
bestfree space at the end of an xattr leaf block.  Therefore, we can
reduce the memory requirements of this scrubber if we can determine that
the xattr is in short format.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
2023-04-11 19:00:34 -07:00
Darrick J. Wong 6cee51e6d0 xfs: clean up xattr scrub initialization
Clean up local variable initialization and error returns in xchk_xattr.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
2023-04-11 19:00:34 -07:00
Darrick J. Wong ae0506eba7 xfs: check used space of shortform xattr structures
Make sure that the records used inside a shortform xattr structure do
not overlap.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
2023-04-11 19:00:33 -07:00
Darrick J. Wong 5b02a3e839 xfs: move xattr scrub buffer allocation to top level function
Move the xchk_setup_xattr_buf call from xchk_xattr_block to xchk_xattr,
since we only need to set up the leaf block bitmaps once.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
2023-04-11 19:00:32 -07:00
Darrick J. Wong f58977edc0 xfs: remove flags argument from xchk_setup_xattr_buf
All callers pass XCHK_GFP_FLAGS as the flags argument to
xchk_setup_xattr_buf, so get rid of the argument.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
2023-04-11 19:00:32 -07:00
Darrick J. Wong b996c9a806 xfs: split valuebuf from xchk_xattr_buf.buf
Move the xattr value buffer from somewhere in xchk_xattr_buf.buf[] to an
explicit pointer.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
2023-04-11 19:00:31 -07:00
Darrick J. Wong 80069284b5 xfs: split usedmap from xchk_xattr_buf.buf
Move the used space bitmap from somewhere in xchk_xattr_buf.buf[] to an
explicit pointer.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
2023-04-11 19:00:31 -07:00
Darrick J. Wong 91781ff549 xfs: split freemap from xchk_xattr_buf.buf
Move the free space bitmap from somewhere in xchk_xattr_buf.buf[] to an
explicit pointer.  This is the start of removing the complex overloaded
memory buffer that is the source of weird memory misuse bugs.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
2023-04-11 19:00:30 -07:00
Darrick J. Wong 4cb7602520 xfs: remove unnecessary dstmap in xattr scrubber
Replace bitmap_and with bitmap_intersects in the xattr leaf block
scrubber, since we only care if there's overlap between the used space
bitmap and the free space bitmap.  This means we don't need dstmap any
more, and can thus reduce the memory requirements.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
2023-04-11 19:00:30 -07:00
Darrick J. Wong ee366fe4f5 xfs: don't shadow @leaf in xchk_xattr_block
Don't shadow the leaf variable here, because it's misleading to have one
place in the codebase where two variables with different types have the
same name.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
2023-04-11 19:00:29 -07:00
Darrick J. Wong c12ad41468 xfs: xattr scrub should ensure one namespace bit per name
Check that each extended attribute exists in only one namespace.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
2023-04-11 19:00:29 -07:00
Darrick J. Wong 1c1646afc9 xfs: check for reverse mapping records that could be merged
Enhance the rmap scrubber to flag adjacent records that could be merged.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
2023-04-11 19:00:28 -07:00
Darrick J. Wong 29ab991b4f xfs: check overlapping rmap btree records
The rmap btree scrubber doesn't contain sufficient checking for records
that cannot overlap but do anyway.  For the other btrees, this is
enforced by the inorder checks in xchk_btree_rec, but the rmap btree is
special because it allows overlapping records to handle shared data
extents.

Therefore, enhance the rmap btree record check function to compare each
record against the previous one so that we can detect overlapping rmap
records for space allocations that do not allow sharing.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
2023-04-11 19:00:27 -07:00
Darrick J. Wong db0502b39c xfs: flag refcount btree records that could be merged
Complain if we encounter refcount btree records that could be merged.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
2023-04-11 19:00:27 -07:00
Darrick J. Wong 1e59fdb7d6 xfs: don't call xchk_bmap_check_rmaps for btree-format file forks
The logic at the end of xchk_bmap_want_check_rmaps tries to detect a
file fork that has been zapped by what will become the online inode
repair code.  Zapped forks are in FMT_EXTENTS with zero extents, and
some sort of hint that there's supposed to be data somewhere in the
filesystem.

Unfortunately, the inverted logic here is confusing and has the effect
that we always call xchk_bmap_check_rmaps for FMT_BTREE forks.  This is
horribly inefficient and unnecessary, so invert the logic to get rid of
this performance problem.  This has caused 8h delays in generic/333 and
generic/334.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
2023-04-11 19:00:26 -07:00
Darrick J. Wong d5784ae827 xfs: flag free space btree records that could be merged
Complain if we encounter free space btree records that could be merged.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
2023-04-11 19:00:26 -07:00
Darrick J. Wong e8882f69b9 xfs: split the xchk_bmap_check_rmaps into a predicate
This function has two parts: the second part scans every reverse mapping
record for this file fork to make sure that there's a corresponding
mapping in the fork, and the first part decides if we even want to do
that.

Split the first part into a separate predicate so that we can make more
changes to it in the next patch.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
2023-04-11 19:00:25 -07:00
Darrick J. Wong 336642f792 xfs: alert the user about data/attr fork mappings that could be merged
If the data or attr forks have mappings that could be merged, let the
user know that the structure could be optimized.  This isn't a
filesystem corruption since the regular filesystem does not try to be
smart about merging bmbt records.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
2023-04-11 19:00:25 -07:00
Darrick J. Wong c0d5a92f7a xfs: split xchk_bmap_xref_rmap into two functions
There's more special-cased functionality than not in this function.
Split it into two so that each can be far more cohesive.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
2023-04-11 19:00:24 -07:00
Darrick J. Wong 634d4a79e7 xfs: accumulate iextent records when checking bmap
Currently, the bmap scrubber checks file fork mappings individually.  In
the case that the file uses multiple mappings to a single contiguous
piece of space, the scrubber repeatedly locks the AG to check the
existence of a reverse mapping that overlaps this file mapping.  If the
reverse mapping starts before or ends after the mapping we're checking,
it will also crawl around in the bmbt checking correspondence for
adjacent extents.

This is not very time efficient because it does the crawling while
holding the AGF buffer, and checks the middle mappings multiple times.
Instead, create a custom iextent record iterator function that combines
multiple adjacent allocated mappings into one large incore bmbt record.
This is feasible because the incore bmbt record length is 64-bits wide.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
2023-04-11 19:00:24 -07:00
Darrick J. Wong 971ee3a670 xfs: change bmap scrubber to store the previous mapping
Convert the inode data/attr/cow fork scrubber to remember the entire
previous mapping, not just the next expected offset.  No behavior
changes here, but this will enable some better checking in subsequent
patches.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
2023-04-11 19:00:23 -07:00
Darrick J. Wong 1fc7a0597d xfs: don't take the MMAPLOCK when scrubbing file metadata
The MMAPLOCK stabilizes mappings in a file's pagecache.  Therefore, we
do not need it to check directories, symlinks, extended attributes, or
file-based metadata.  Reduce its usage to the one case that requires it,
which is when we want to scrub the data fork of a regular file.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
2023-04-11 19:00:22 -07:00
Darrick J. Wong 38bb131084 xfs: retain the AGI when we can't iget an inode to scrub the core
xchk_get_inode is not quite the right function to be calling from the
inode scrubber setup function.  The common get_inode function either
gets an inode and installs it in the scrub context, or it returns an
error code explaining what happened.  This is acceptable for most file
scrubbers because it is not in their scope to fix corruptions in the
inode core and fork areas that cause iget to fail.

Dealing with these problems is within the scope of the inode scrubber,
however.  If iget fails with EFSCORRUPTED, we need to xchk_inode to flag
that as corruption.  Since we can't get our hands on an incore inode, we
need to hold the AGI to prevent inode allocation activity so that
nothing changes in the inode metadata.

Looking ahead to the inode core repair patches, we will also need to
hold the AGI buffer into xrep_inode so that we can make modifications to
the xfs_dinode structure without any other thread swooping in to
allocate or free the inode.

Adapt the xchk_get_inode into xchk_setup_inode since this is a one-off
use case where the error codes we check for are a little different, and
the return state is much different from the common function.

xchk_setup_inode prepares to check or repair an inode record, so it must
continue the scrub operation even if the inode/inobt verifiers cause
xfs_iget to return EFSCORRUPTED.  This is done by attaching the locked
AGI buffer to the scrub transaction and returning 0 to move on to the
actual scrub.  (Later, the online inode repair code will also want the
xfs_imap structure so that it can reset the ondisk xfs_dinode
structure.)

xchk_get_inode retrieves an inode on behalf of a scrubber that operates
on an incore inode -- data/attr/cow forks, directories, xattrs,
symlinks, parent pointers, etc.  If the inode/inobt verifiers fail and
xfs_iget returns EFSCORRUPTED, we want to exit to userspace (because the
caller should be fix the inode first) and drop everything we acquired
along the way.

A behavior common to both functions is that it's possible that xfs_scrub
asked for a scrub-by-handle concurrent with the inode being freed or the
passed-in inumber is invalid.  In this case, we call xfs_imap to see if
the inobt index thinks the inode is allocated, and return ENOENT
("nothing to check here") to userspace if this is not the case.  The
imap lookup is why both functions call xchk_iget_agi.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
2023-04-11 19:00:22 -07:00
Darrick J. Wong 46e0dd8965 xfs: rename xchk_get_inode -> xchk_iget_for_scrubbing
Dave Chinner suggested renaming this function to make more obvious what
it does.  The function returns an incore inode to callers that want to
scrub a metadata structure that hangs off an inode.  If the iget fails
with EINVAL, it will single-step the loading process to distinguish
between actually free inodes or impossible inumbers (ENOENT);
discrepancies between the inobt freemask and the free status in the
inode record (EFSCORRUPTED).  Any other negative errno is returned
unchanged.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
2023-04-11 19:00:21 -07:00
Darrick J. Wong 302436c27c xfs: fix an inode lookup race in xchk_get_inode
In commit d658e, we tried to improve the robustnes of xchk_get_inode in
the face of EINVAL returns from iget by calling xfs_imap to see if the
inobt itself thinks that the inode is allocated.  Unfortunately, that
commit didn't consider the possibility that the inode gets allocated
after iget but before imap.  In this case, the imap call will succeed,
but we turn that into a corruption error and tell userspace the inode is
corrupt.

Avoid this false corruption report by grabbing the AGI header and
retrying the iget before calling imap.  If the iget succeeds, we can
proceed with the usual scrub-by-handle code.  Fix all the incorrect
comments too, since unreadable/corrupt inodes no longer result in EINVAL
returns.

Fixes: d658e72b4a ("xfs: distinguish between corrupt inode and invalid inum in xfs_scrub_get_inode")
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
2023-04-11 19:00:21 -07:00
Darrick J. Wong 0916056eba xfs: fix parent pointer scrub racing with subdirectory reparenting
Jan Kara pointed out that rename() doesn't lock a subdirectory that is
being moved from one parent to another, even though the move requires an
update to the subdirectory's dotdot entry.  This means that it's *not*
sufficient to hold a directory's IOLOCK to stabilize the dotdot entry.
We must hold the ILOCK of both the child and the alleged parent, and
there's no use in holding the parent's IOLOCK.

With that in mind, we can get rid of all the messy code that tries to
grab the parent's IOLOCK, which means we don't need to let go of the
ILOCK of the directory whose parent we are checking.  We still have to
use nonblocking mode to take the ILOCK of the alleged parent, so the
revalidation loop has to stay.

However, we can remove the retry counter, since threads aren't supposed
to hold the ILOCK for long periods of time.  Remove the inverted ilock
helper from the common code since nobody uses it.  Remove the entire
source of -EDEADLOCK-based "retry harder" scrub executions.

Link: https://lore.kernel.org/linux-xfs/20230117123735.un7wbamlbdihninm@quack3/
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
2023-04-11 19:00:20 -07:00
Darrick J. Wong a03297a0ca xfs: manage inode DONTCACHE status at irele time
Right now, there are statements scattered all over the online fsck
codebase about how we can't use XFS_IGET_DONTCACHE because of concerns
about scrub's unusual practice of releasing inodes with transactions
held.

However, iget is the wrong place to handle this -- the DONTCACHE state
doesn't matter at all until we try to *release* the inode, and here we
get things wrong in multiple ways:

First, if we /do/ have a transaction, we must NOT drop the inode,
because the inode could have dirty pages, dropping the inode will
trigger writeback, and writeback can trigger a nested transaction.

Second, if the inode already had an active reference and the DONTCACHE
flag set, the icache hit when scrub grabs another ref will not clear
DONTCACHE.  This is sort of by design, since DONTCACHE is now used to
initiate cache drops so that sysadmins can change a file's access mode
between pagecache and DAX.

Third, if we do actually have the last active reference to the inode, we
can set DONTCACHE to avoid polluting the cache.  This is the /one/ case
where we actually want that flag.

Create an xchk_irele helper to encode all that logic and switch the
online fsck code to use it.  Since this now means that nearly all
scrubbers use the same xfs_iget flags, we can wrap them too.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
2023-04-11 19:00:20 -07:00
Darrick J. Wong b049962c0f xfs: simplify xchk_parent_validate
This function is unnecessarily long because it contains code to
revalidate a dotdot entry after cycling locks to try to confirm a
subdirectory parent pointer.  Shorten the codebase by making the
parent's lookup call do double duty as the revalidation code.

This weakeans the efficacy of this scrub function temporarily, but the
next patch will resolve this as part of fixing an unhandled race that is
the result of the VFS rename locking model not working the way Darrick
thought it did.

Rename this stupid 'dnum' variable too.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
2023-04-11 19:00:19 -07:00
Darrick J. Wong cbab28f4c0 xfs: remove xchk_parent_count_parent_dentries
This helper is now trivial, so get rid of it.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
2023-04-11 19:00:19 -07:00
Darrick J. Wong 6bb9209cee xfs: always check the existence of a dirent's child inode
When we're scrubbing directory entries, we always need to iget the child
inode to make sure that the inode pointer points to a valid inode.  The
original directory scrub code (commit a5c4) only set us up to do this
for ftype=1 filesystems, which is not sufficient; and then commit 4b80
made it worse by exempting the dot and dotdot entries.

Sorta-fixes: a5c46e5e89 ("xfs: scrub directory metadata")
Sorta-fixes: 4b80ac6445 ("xfs: scrub should mark a directory corrupt if any entries cannot be iget'd")
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
2023-04-11 19:00:18 -07:00
Darrick J. Wong d9a94480f9 xfs: xfs_iget in the directory scrubber needs to use UNTRUSTED
In commit 4b80ac6445, we tried to strengthen the directory scrubber by
using the iget call to detect directory entries that point to
unallocated inodes.  Unfortunately, that commit neglected to pass
XFS_IGET_UNTRUSTED to xfs_iget, so we don't check the inode btree first.
If the inode number points to something that isn't even an inode
cluster, iget will throw corruption errors and return -EFSCORRUPTED,
which means that we fail to mark the directory corrupt.

Fixes: 4b80ac6445 ("xfs: scrub should mark a directory corrupt if any entries cannot be iget'd")
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
2023-04-11 19:00:17 -07:00
Darrick J. Wong 4c233b5c4f xfs: streamline the directory iteration code for scrub
Currently, online scrub reuses the xfs_readdir code to walk every entry
in a directory.  This isn't awesome for performance, since we end up
cycling the directory ILOCK needlessly and coding around the particular
quirks of the VFS dir_context interface.

Create a streamlined version of readdir that keeps the ILOCK (since the
walk function isn't going to copy stuff to userspace), skips a whole lot
of directory walk cursor checks (since we start at 0 and walk to the
end) and has a sane way to return error codes.

Note: Porting the dotdot checking code is left for a subsequent patch.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
2023-04-11 19:00:17 -07:00
Darrick J. Wong 30f8ee5e7e xfs: ensure that single-owner file blocks are not owned by others
For any file fork mapping that can only have a single owner, make sure
that there are no other rmap owners for that mapping.  This patch
requires the more detailed checking provided by xfs_rmap_count_owners so
that we can know how many rmap records for a given range of space had a
matching owner, how many had a non-matching owner, and how many
conflicted with the records that have a matching owner.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
2023-04-11 19:00:16 -07:00
Darrick J. Wong 9dceccc582 xfs: use the directory name hash function for dir scrubbing
The directory code has a directory-specific hash computation function
that includes a modified hash function for case-insensitive lookups.
Hence we must use that function (and not the raw da_hashname) when
checking the dabtree structure.

Found by accidentally breaking xfs/188 to create an abnormally huge
case-insensitive directory and watching scrub break.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
2023-04-11 19:00:16 -07:00
Darrick J. Wong efc0845f5d xfs: convert xfs_ialloc_has_inodes_at_extent to return keyfill scan results
Convert the xfs_ialloc_has_inodes_at_extent function to return keyfill
scan results because for a given range of inode numbers, we might have
no indexed inodes at all; the entire region might be allocated ondisk
inodes; or there might be a mix of the two.

Unfortunately, sparse inodes adds to the complexity, because each inode
record can have holes, which means that we cannot use the generic btree
_scan_keyfill function because we must look for holes in individual
records to decide the result.  On the plus side, online fsck can now
detect sub-chunk discrepancies in the inobt.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
2023-04-11 19:00:15 -07:00
Darrick J. Wong 69115f775f xfs: teach scrub to check for sole ownership of metadata objects
Strengthen online scrub's checking even further by enabling us to check
that a range of blocks are owned solely by a given owner.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
2023-04-11 19:00:15 -07:00
Darrick J. Wong bc0f3b5546 xfs: directly cross-reference the inode btrees with each other
Improve the cross-referencing of the two inode btrees by directly
checking the free and hole state of each inode with the other btree.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
2023-04-11 19:00:14 -07:00
Darrick J. Wong c01868b60e xfs: clean up broken eearly-exit code in the inode btree scrubber
Corrupt inode chunks should cause us to exit early after setting the
CORRUPT flag on the scrub state.  While we're at it, collapse trivial
helpers.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
2023-04-11 19:00:13 -07:00
Darrick J. Wong cc1207662d xfs: remove pointless shadow variable from xfs_difree_inobt
In xfs_difree_inobt, the pag passed in was previously used to look up
the AGI buffer.  There's no need to extract it again, so remove the
shadow variable and shut up -Wshadow.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
2023-04-11 19:00:13 -07:00
Darrick J. Wong 7ac14fa2bd xfs: ensure that all metadata and data blocks are not cow staging extents
Make sure that all filesystem metadata blocks and file data blocks are
not also marked as CoW staging extents.  The extra checking added here
was inspired by an actual VM host filesystem corruption incident due to
bugs in the CoW handling of 4.x kernels.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
2023-04-11 19:00:12 -07:00
Darrick J. Wong 7ad9ea6398 xfs: check the reference counts of gaps in the refcount btree
Gaps in the reference count btree are also significant -- for these
regions, there must not be any overlapping reverse mappings.  We don't
currently check this, so make the refcount scrubber more complete.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
2023-04-11 19:00:12 -07:00
Darrick J. Wong 4a200a0978 xfs: implement masked btree key comparisons for _has_records scans
For keyspace fullness scans, we want to be able to mask off the parts of
the key that we don't care about.  For most btree types we /do/ want the
full keyspace, but for checking that a given space usage also has a full
complement of rmapbt records (even if different/multiple owners) we need
this masking so that we only track sparseness of rm_startblock, not the
whole keyspace (which is extremely sparse).

Augment the ->diff_two_keys and ->keys_contiguous helpers to take a
third union xfs_btree_key argument, and wire up xfs_rmap_has_records to
pass this through.  This third "mask" argument should contain a nonzero
value in each structure field that should be used in the key comparisons
done during the scan.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
2023-04-11 19:00:11 -07:00
Darrick J. Wong 6abc7aef85 xfs: replace xfs_btree_has_record with a general keyspace scanner
The current implementation of xfs_btree_has_record returns true if it
finds /any/ record within the given range.  Unfortunately, that's not
sufficient for scrub.  We want to be able to tell if a range of keyspace
for a btree is devoid of records, is totally mapped to records, or is
somewhere in between.  By forcing this to be a boolean, we conflated
sparseness and fullness, which caused scrub to return incorrect results.
Fix the API so that we can tell the caller which of those three is the
current state.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
2023-04-11 19:00:10 -07:00
Darrick J. Wong bd7e795108 xfs: refactor ->diff_two_keys callsites
Create wrapper functions around ->diff_two_keys so that we don't have to
remember what the return values mean, and adjust some of the code
comments to reflect the longtime code behavior.  We're going to
introduce more uses of ->diff_two_keys in the next patch, so reduce the
cognitive load for readers by doing this refactoring now.

Suggested-by: Dave Chinner <david@fromorbit.com>
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
2023-04-11 19:00:10 -07:00
Darrick J. Wong 2bea8df0a5 xfs: always scrub record/key order of interior records
In commit d47fef9342, we removed the firstrec and firstkey fields of
struct xchk_btree because Christoph thought they were unnecessary
because we could use the record index in the btree cursor.  This is
incorrect because bc_ptrs (now bc_levels[].ptr) tracks the cursor
position within a specific btree block, not within the entire level.

The end result is that scrub no longer detects situations where the
rightmost record of a block is identical to the leftmost record of that
block's right sibling.  Fix this regression by reintroducing record
validity booleans so that order checking skips *only* the leftmost
record/key in each level.

Fixes: d47fef9342 ("xfs: don't track firstrec/firstkey separately in xchk_btree")
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
2023-04-11 19:00:09 -07:00
Darrick J. Wong ee5fe8ff6d xfs: refactor converting btree irec to btree key
We keep doing these conversions to support btree queries, so refactor
this into a helper.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
2023-04-11 19:00:09 -07:00
Darrick J. Wong c99f99fa3e xfs: check btree keys reflect the child block
When scrub is checking a non-root btree block, it should make sure that
the keys in the parent btree block accurately capture the keyspace that
the child block stores.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
2023-04-11 19:00:08 -07:00
Darrick J. Wong 38384569a2 xfs: detect unwritten bit set in rmapbt node block keys
In the last patch, we changed the rmapbt code to remove the UNWRITTEN
bit when creating an rmapbt key from an rmapbt record, and we changed
the rmapbt key comparison code to start considering the ATTR and BMBT
flags during lookup.  This brought the behavior of the rmapbt
implementation in line with its specification.

However, there may exist filesystems that have the unwritten bit still
set in the rmapbt keys.  We should detect these situations and flag the
rmapbt as one that would benefit from optimization.  Eventually, online
repair will be able to do something in response to this.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
2023-04-11 19:00:07 -07:00
Darrick J. Wong 08c987deca xfs: fix rm_offset flag handling in rmap keys
Keys for extent interval records in the reverse mapping btree are
supposed to be computed as follows:

(physical block, owner, fork, is_btree, offset)

This provides users the ability to look up a reverse mapping from a file
block mapping record -- start with the physical block; then if there are
multiple records for the same block, move on to the owner; then the
inode fork type; and so on to the file offset.

Unfortunately, the code that creates rmap lookup keys from rmap records
forgot to mask off the record attribute flags, leading to ondisk keys
that look like this:

(physical block, owner, fork, is_btree, unwritten state, offset)

Fortunately, this has all worked ok for the past six years because the
key comparison functions incorrectly ignore the fork/bmbt/unwritten
information that's encoded in the on-disk offset.  This means that
lookup comparisons are only done with:

(physical block, owner, offset)

Queries can (theoretically) return incorrect results because of this
omission.  On consistent filesystems this isn't an issue because xattr
and bmbt blocks cannot be shared and hence the comparisons succeed
purely on the contents of the rm_startblock field.  For the one case
where we support sharing (written data fork blocks) all flag bits are
zero, so the omission in the comparison has no ill effects.

Unfortunately, this bug prevents scrub from detecting incorrect fork and
bmbt flag bits in the rmap btree, so we really do need to fix the
compare code.  Old filesystems with the unwritten bit erroneously set in
the rmap key struct will work fine on new kernels since we still ignore
the unwritten bit.  New filesystems on older kernels will work fine
since the old kernels never paid attention to the unwritten bit.

A previous version of this patch forgot to keep the (un)written state
flag masked during the comparison and caused a major regression in
5.9.x since unwritten extent conversion can update an rmap record
without requiring key updates.

Note that blocks cannot go directly from data fork to attr fork without
being deallocated and reallocated, nor can they be added to or removed
from a bmbt without a free/alloc cycle, so this should not cause any
regressions.

Found by fuzzing keys[1].attrfork = ones on xfs/371.

Fixes: 4b8ed67794 ("xfs: add rmap btree operations")
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
2023-04-11 19:00:07 -07:00
Darrick J. Wong de1a9ce225 xfs: hoist inode record alignment checks from scrub
Move the inobt record alignment checks from xchk_iallocbt_rec into
xfs_inobt_check_irec so that they are applied everywhere.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
2023-04-11 19:00:06 -07:00
Darrick J. Wong e774b2ea0b xfs: hoist rmap record flag checks from scrub
Move the rmap record flag checks from xchk_rmapbt_rec into
xfs_rmap_check_irec so that they are applied everywhere.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
2023-04-11 19:00:06 -07:00
Darrick J. Wong 6a3bd8fcf9 xfs: complain about bad file mapping records in the ondisk bmbt
Similar to what we've just done for the other btrees, create a function
to log corrupt bmbt records and call it whenever we encounter a bad
record in the ondisk btree.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
2023-04-11 19:00:05 -07:00
Darrick J. Wong 7d7d6d2fd0 xfs: hoist rmap record flag checks from scrub
Move the rmap record flag checks from xchk_rmapbt_rec into
xfs_rmap_check_irec so that they are applied everywhere.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
2023-04-11 19:00:05 -07:00
Darrick J. Wong ee12eaaa43 xfs: complain about bad records in query_range helpers
For every btree type except for the bmbt, refactor the code that
complains about bad records into a helper and make the ->query_range
helpers call it so that corruptions found via that avenue are logged.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
2023-04-11 19:00:04 -07:00
Darrick J. Wong 69010fe3ac xfs: standardize ondisk to incore conversion for bmap btrees
Fix all xfs_bmbt_disk_get_all callsites to call xfs_bmap_validate_extent
and bubble up corruption reports.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
2023-04-11 19:00:04 -07:00
Darrick J. Wong c4e34172da xfs: standardize ondisk to incore conversion for rmap btrees
Create a xfs_rmap_check_irec function to detect corruption in btree
records.  Fix all xfs_rmap_btrec_to_irec callsites to call the new
helper and bubble up corruption reports.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
2023-04-11 19:00:03 -07:00
Darrick J. Wong 39ab26d59f xfs: return a failure address from xfs_rmap_irec_offset_unpack
Currently, xfs_rmap_irec_offset_unpack returns only 0 or -EFSCORRUPTED.
Change this function to return the code address of a failed conversion
in preparation for the next patch, which standardizes localized record
checking and reporting code.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
2023-04-11 19:00:02 -07:00
Darrick J. Wong 2b30cc0bf0 xfs: standardize ondisk to incore conversion for refcount btrees
Create a xfs_refcount_check_irec function to detect corruption in btree
records.  Fix all xfs_refcount_btrec_to_irec callsites to call the new
helper and bubble up corruption reports.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
2023-04-11 19:00:02 -07:00
Darrick J. Wong 366a0b8d49 xfs: standardize ondisk to incore conversion for inode btrees
Create a xfs_inobt_check_irec function to detect corruption in btree
records.  Fix all xfs_inobt_btrec_to_irec callsites to call the new
helper and bubble up corruption reports.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
2023-04-11 19:00:01 -07:00
Darrick J. Wong 35e3b9a117 xfs: standardize ondisk to incore conversion for free space btrees
Create a xfs_alloc_btrec_to_irec function to convert an ondisk record to
an incore record, and a xfs_alloc_check_irec function to detect
corruption.  Replace all the open-coded logic with calls to the new
helpers and bubble up corruption reports.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
2023-04-11 19:00:01 -07:00
Darrick J. Wong 88accf1722 xfs: scrub should use ECHRNG to signal that the drain is needed
In the previous patch, we added jump labels to the intent drain code so
that regular filesystem operations need not pay the price of checking
for someone (scrub) waiting on intents to drain from some part of the
filesystem when that someone isn't running.

However, I observed that xfs/285 now spends a lot more time pushing the
AIL from the inode btree scrubber than it used to.  This is because the
inobt scrubber will try push the AIL to try to get logged inode cores
written to the filesystem when it sees a weird discrepancy between the
ondisk inode and the inobt records.  This AIL push is triggered when the
setup function sees TRY_HARDER is set; and the requisite EDEADLOCK
return is initiated when the discrepancy is seen.

The solution to this performance slow down is to use a different result
code (ECHRNG) for scrub code to signal that it needs to wait for
deferred intent work items to drain out of some part of the filesystem.
When this happens, set a new scrub state flag (XCHK_NEED_DRAIN) so that
setup functions will activate the jump label.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
2023-04-11 19:00:00 -07:00
Darrick J. Wong 466c525d6d xfs: minimize overhead of drain wakeups by using jump labels
To reduce the runtime overhead even further when online fsck isn't
running, use a static branch key to decide if we call wake_up on the
drain.  For compilers that support jump labels, the call to wake_up is
replaced by a nop sled when nobody is waiting for intents to drain.

From my initial microbenchmarking, every transition of the static key
between the on and off states takes about 22000ns to complete; this is
paid entirely by the xfs_scrub process.  When the static key is off
(which it should be when fsck isn't running), the nop sled adds an
overhead of approximately 0.36ns to runtime code.  The post-atomic
lockless waiter check adds about 0.03ns, which is basically free.

For the few compilers that don't support jump labels, runtime code pays
the cost of calling wake_up on an empty waitqueue, which was observed to
be about 30ns.  However, most architectures that have sufficient memory
and CPU capacity to run XFS also support jump labels, so this is not
much of a worry.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
2023-04-11 18:59:59 -07:00
Darrick J. Wong 3f64c718d0 xfs: clean up scrub context if scrub setup returns -EDEADLOCK
It has been a longstanding convention that online scrub and repair
functions can return -EDEADLOCK to signal that they weren't able to
obtain some necessary resource.  When this happens, the scrub framework
is supposed to release all resources attached to the scrub context, set
the TRY_HARDER flag in the scrub context flags, and try again.  In this
context, individual scrub functions are supposed to take all the
resources they (incorrectly) speculated were not necessary.

We're about to make it so that the functions that lock and wait for a
filesystem AG can also return EDEADLOCK to signal that we need to try
again with the drain waiters enabled.  Therefore, refactor
xfs_scrub_metadata to support this behavior for ->setup() functions.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
2023-04-11 18:59:59 -07:00
Darrick J. Wong d5c88131db xfs: allow queued AG intents to drain before scrubbing
When a writer thread executes a chain of log intent items, the AG header
buffer locks will cycle during a transaction roll to get from one intent
item to the next in a chain.  Although scrub takes all AG header buffer
locks, this isn't sufficient to guard against scrub checking an AG while
that writer thread is in the middle of finishing a chain because there's
no higher level locking primitive guarding allocation groups.

When there's a collision, cross-referencing between data structures
(e.g. rmapbt and refcountbt) yields false corruption events; if repair
is running, this results in incorrect repairs, which is catastrophic.

Fix this by adding to the perag structure the count of active intents
and make scrub wait until it has both AG header buffer locks and the
intent counter reaches zero.

One quirk of the drain code is that deferred bmap updates also bump and
drop the intent counter.  A fundamental decision made during the design
phase of the reverse mapping feature is that updates to the rmapbt
records are always made by the same code that updates the primary
metadata.  In other words, callers of bmapi functions expect that the
bmapi functions will queue deferred rmap updates.

Some parts of the reflink code queue deferred refcount (CUI) and bmap
(BUI) updates in the same head transaction, but the deferred work
manager completely finishes the CUI before the BUI work is started.  As
a result, the CUI drops the intent count long before the deferred rmap
(RUI) update even has a chance to bump the intent count.  The only way
to keep the intent count elevated between the CUI and RUI is for the BUI
to bump the counter until the RUI has been created.

A second quirk of the intent drain code is that deferred work items must
increment the intent counter as soon as the work item is added to the
transaction.  When a BUI completes and queues an RUI, the RUI must
increment the counter before the BUI decrements it.  The only way to
accomplish this is to require that the counter be bumped as soon as the
deferred work item is created in memory.

In the next patches we'll improve on this facility, but this patch
provides the basic functionality.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
2023-04-11 18:59:58 -07:00
Darrick J. Wong 9014890304 xfs: add a tracepoint to report incorrect extent refcounts
Add a new tracepoint so that I can see exactly what and where we failed
the refcount check.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
2023-04-11 18:59:58 -07:00
Darrick J. Wong ecc73f8a58 xfs: update copyright years for scrub/ files
Update the copyright years in the scrub/ source code files.  This isn't
required, but it's helpful to remind myself just how long it's taken to
develop this feature.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
2023-04-11 18:59:57 -07:00
Darrick J. Wong 739a2fe042 xfs: fix author and spdx headers on scrub/ files
Fix the spdx tags to match current practice, and update the author
contact information.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
2023-04-11 18:59:56 -07:00
Darrick J. Wong 9b2e5a234c xfs: create traced helper to get extra perag references
There are a few places in the XFS codebase where a caller has either an
active or a passive reference to a perag structure and wants to give
a passive reference to some other piece of code.  Btree cursor creation
and inode walks are good examples of this.  Replace the open-coded logic
with a helper to do this.

The new function adds a few safeguards -- it checks that there's at
least one reference to the perag structure passed in, and it records the
refcount bump in the ftrace information.  This makes it much easier to
debug perag refcounting problems.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
2023-04-11 18:59:55 -07:00
Darrick J. Wong 00e7b3bac1 xfs: give xfs_refcount_intent its own perag reference
Give the xfs_refcount_intent a passive reference to the perag structure
data.  This reference will be used to enable scrub intent draining
functionality in subsequent patches.  Any space being modified by a
refcount intent is already allocated, so we need to be able to operate
even if the AG is being shrunk or offlined.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
2023-04-11 18:59:55 -07:00
Darrick J. Wong c13418e8eb xfs: give xfs_rmap_intent its own perag reference
Give the xfs_rmap_intent a passive reference to the perag structure
data.  This reference will be used to enable scrub intent draining
functionality in subsequent patches.  The space we're (reverse) mapping
is already allocated, so we need to be able to operate even if the AG is
being shrunk or offlined.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
2023-04-11 18:59:54 -07:00
Darrick J. Wong f6b384631e xfs: give xfs_extfree_intent its own perag reference
Give the xfs_extfree_intent an passive reference to the perag structure
data.  This reference will be used to enable scrub intent draining
functionality in subsequent patches.  The space being freed must already
be allocated, so we need to able to run even if the AG is being offlined
or shrunk.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
2023-04-11 18:59:54 -07:00
Darrick J. Wong b2ccab3199 xfs: pass per-ag references to xfs_free_extent
Pass a reference to the per-AG structure to xfs_free_extent.  Most
callers already have one, so we can eliminate unnecessary lookups.  The
one exception to this is the EFI code, which the next patch will fix.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
2023-04-11 18:59:53 -07:00
Darrick J. Wong 774a99b47b xfs: give xfs_bmap_intent its own perag reference
Give the xfs_bmap_intent an active reference to the perag structure
data.  This reference will be used to enable scrub intent draining
functionality in subsequent patches.  Later, shrink will use these
passive references to know if an AG is quiesced or not.

The reason why we take a passive ref for a file mapping operation is
simple: we're committing to some sort of action involving space in an
AG, so we want to indicate our interest in that AG.  The space is
already allocated, so we need to be able to operate on AGs that are
offline or being shrunk.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
2023-04-11 18:59:53 -07:00
Matthew Wilcox (Oracle) 945ea457b5 xfs: remove xfs_filemap_map_pages() wrapper
Patch series "Prevent ->map_pages from sleeping", v2.

In preparation for a larger patch series which will handle (some, easy)
page faults protected only by RCU, change the two filesystems which have
sleeping locks to not take them and hold the RCU lock around calls to
->map_page to prevent other filesystems from adding sleeping locks.


This patch (of 3):

XFS doesn't actually need to be holding the XFS_MMAPLOCK_SHARED to do
this.  filemap_map_pages() cannot bring new folios into the page cache
and the folio lock is taken during filemap_map_pages() which provides
sufficient protection against a truncation or hole punch.

Link: https://lkml.kernel.org/r/20230327174515.1811532-1-willy@infradead.org
Link: https://lkml.kernel.org/r/20230327174515.1811532-2-willy@infradead.org
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Cc: Darrick J. Wong <djwong@kernel.org>
Cc: David Howells <dhowells@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-04-05 19:42:59 -07:00
Jens Axboe d8aeb44a9a fs: add FMODE_DIO_PARALLEL_WRITE flag
Some filesystems support multiple threads writing to the same file with
O_DIRECT without requiring exclusive access to it. io_uring can use this
hint to avoid serializing dio writes to this inode, instead allowing them
to run in parallel.

XFS and ext4 both fall into this category, so set the flag for both of
them.

Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-04-03 07:14:20 -06:00
Darrick J. Wong 4dfb02d5ca xfs: fix mismerged tracepoints
At some point in between sending this patch to the list and merging it
into for-next, the tracepoints got all mixed up because I've
over-reliant on automated tools not sucking.  The end result is that the
tracepoints are all wrong, so fix them.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
2023-03-24 13:16:01 -07:00
Darrick J. Wong e2e63b071b xfs: clear incore AGFL_RESET state if it's not needed
Prior to commit 7ac2ff8bb3, when we loaded the incore perag structure
with information from the AGF header, we would set or clear the
pagf_agfl_reset field based on whether or not the AGFL list was
misaligned within the block.  IOWs, it's an incore state bit that's
supposed to cache something in the ondisk metadata.  Therefore, the code
still needs to support clearing the incore bit if (somehow) the AGFL
were to correct itself.

It turns out that xfs_repair does exactly this -- phase 4 loads the AGF
to scan the rmapbt for corrupt records, which can set NEEDS_AGFL_RESET.
The scan unsets AGF_INIT but doesn't unset NEEDS_AGFL_RESET.  Phase 5
totally rewrites the AGFL and fixes the alignment problem, didn't clear
NEEDS_AGFL_RESET historically, and reloads the perag state to fix the
freelist.  This results in the AGFL being reset based on stale data,
which then causes the new AGFL blocks to be leaked.  A subsequent
xfs_repair -n then complains about the leaks.

One could argue that phase 5 ought to clear this bit directly when it
reloads the perag AGF data after rewriting the AGFL, but libxfs used to
handle this for us, so it should go back to doing that.

Found by fuzzing flfirst = ones in xfs/352.

Fixes: 7ac2ff8bb3 ("xfs: perags need atomic operational state")
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
2023-03-24 08:40:02 -07:00
Darrick J. Wong fcde88af6a xfs: pass the correct cursor to xfs_iomap_prealloc_size
In xfs_buffered_write_iomap_begin, @icur is the iext cursor for the data
fork and @ccur is the cursor for the cow fork.  Pass in whichever cursor
corresponds to allocfork, because otherwise the xfs_iext_prev_extent
call can use the data fork cursor to walk off the end of the cow fork
structure.  Best case it returns the wrong results, worst case it does
this:

stack segment: 0000 [#1] PREEMPT SMP
CPU: 2 PID: 3141909 Comm: fsstress Tainted: G        W          6.3.0-rc2-xfsx #6.3.0-rc2 7bf5cc2e98997627cae5c930d890aba3aeec65dd
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS ?-20171121_152543-x86-ol7-builder-01.us.oracle.com-4.el7.1 04/01/2014
RIP: 0010:xfs_iext_prev+0x71/0x150 [xfs]
RSP: 0018:ffffc90002233aa8 EFLAGS: 00010297
RAX: 000000000000000f RBX: 000000000000000e RCX: 000000000000000c
RDX: 0000000000000002 RSI: 000000000000000e RDI: ffff8883d0019ba0
RBP: 989642409af8a7a7 R08: ffffea0000000001 R09: 0000000000000002
R10: 0000000000000000 R11: 000000000000000c R12: ffffc90002233b00
R13: ffff8883d0019ba0 R14: 989642409af8a6bf R15: 000ffffffffe0000
FS:  00007fdf8115f740(0000) GS:ffff88843fd00000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 00007fdf8115e000 CR3: 0000000357256000 CR4: 00000000003506e0
Call Trace:
 <TASK>
 xfs_iomap_prealloc_size.constprop.0.isra.0+0x1a6/0x410 [xfs 619a268fb2406d68bd34e007a816b27e70abc22c]
 xfs_buffered_write_iomap_begin+0xa87/0xc60 [xfs 619a268fb2406d68bd34e007a816b27e70abc22c]
 iomap_iter+0x132/0x2f0
 iomap_file_buffered_write+0x92/0x330
 xfs_file_buffered_write+0xb1/0x330 [xfs 619a268fb2406d68bd34e007a816b27e70abc22c]
 vfs_write+0x2eb/0x410
 ksys_write+0x65/0xe0
 do_syscall_64+0x2b/0x80
 entry_SYSCALL_64_after_hwframe+0x46/0xb0

Found by xfs/538 in alwayscow mode, but this doesn't seem particular to
that test.

Fixes: 590b16516e ("xfs: refactor xfs_iomap_prealloc_size")
Actually-Fixes: 66ae56a53f ("xfs: introduce an always_cow mode")
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
2023-03-24 08:40:01 -07:00
Darrick J. Wong 3cfb9290da xfs: test dir/attr hash when loading module
Back in the 6.2-rc1 days, Eric Whitney reported a fstests regression in
ext4 against generic/454.  The cause of this test failure was the
unfortunate combination of setting an xattr name containing UTF8 encoded
emoji, an xattr hash function that accepted a char pointer with no
explicit signedness, signed type extension of those chars to an int, and
the 6.2 build tools maintainers deciding to mandate -funsigned-char
across the board.  As a result, the ondisk extended attribute structure
written out by 6.1 and 6.2 were not the same.

This discrepancy, in fact, had been noticeable if a filesystem with such
an xattr were moved between any two architectures that don't employ the
same signedness of a raw "char" declaration.  The only reason anyone
noticed is that x86 gcc defaults to signed, and no such -funsigned-char
update was made to e2fsprogs, so e2fsck immediately started reporting
data corruption.

After a day and a half of discussing how to handle this use case (xattrs
with bit 7 set anywhere in the name) without breaking existing users,
Linus merged his own patch and didn't tell the maintainer.  None of the
ext4 developers realized this until AUTOSEL announced that the commit
had been backported to stable.

In the end, this problem could have been detected much earlier if there
had been any useful tests of hash function(s) in use inside ext4 to make
sure that they always produce the same outputs given the same inputs.

The XFS dirent/xattr name hash takes a uint8_t*, so I don't think it's
vulnerable to this problem.  However, let's avoid all this drama by
adding our own self test to check that the da hash produces the same
outputs for a static pile of inputs on various platforms.  This enables
us to fix any breakage that may result in a controlled fashion.  The
buffer and test data are identical to the patches submitted to xfsprogs.

Link: https://lore.kernel.org/linux-ext4/Y8bpkm3jA3bDm3eL@debian-BULLSEYE-live-builder-AMD64/
Link: https://lore.kernel.org/linux-xfs/ZBUKCRR7xvIqPrpX@destitution/T/#md38272cc684e2c0d61494435ccbb91f022e8dee4
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
2023-03-19 09:55:49 -07:00
Darrick J. Wong e6fbb7167e xfs: add tracepoints for each of the externally visible allocators
There are now five separate space allocator interfaces exposed to the
rest of XFS for five different strategies to find space.  Add
tracepoints for each of them so that I can tell from a trace dump
exactly which ones got called and what happened underneath them.  Add a
sixth so it's more obvious if an allocation actually happened.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
2023-03-19 09:55:49 -07:00
Darrick J. Wong 9eb775968b xfs: walk all AGs if TRYLOCK passed to xfs_alloc_vextent_iterate_ags
Callers of xfs_alloc_vextent_iterate_ags that pass in the TRYLOCK flag
want us to perform a non-blocking scan of the AGs for free space.  There
are no ordering constraints for non-blocking AGF lock acquisition, so
the scan can freely start over at AG 0 even when minimum_agno > 0.

This manifests fairly reliably on xfs/294 on 6.3-rc2 with the parent
pointer patchset applied and the realtime volume enabled.  I observed
the following sequence as part of an xfs_dir_createname call:

0. Fragment the free space, then allocate nearly all the free space in
   all AGs except AG 0.

1. Create a directory in AG 2 and let it grow for a while.

2. Try to allocate 2 blocks to expand the dirent part of a directory.
   The space will be allocated out of AG 0, but the allocation will not
   be contiguous.  This (I think) activates the LOWMODE allocator.

3. The bmapi call decides to convert from extents to bmbt format and
   tries to allocate 1 block.  This allocation request calls
   xfs_alloc_vextent_start_ag with the inode number, which starts the
   scan at AG 2.  We ignore AG 0 (with all its free space) and instead
   scrape AG 2 and 3 for more space.  We find one block, but this now
   kicks t_highest_agno to 3.

4. The createname call decides it needs to split the dabtree.  It tries
   to allocate even more space with xfs_alloc_vextent_start_ag, but now
   we're constrained to AG 3, and we don't find the space.  The
   createname returns ENOSPC and the filesystem shuts down.

This change fixes the problem by making the trylock scan wrap around to
AG 0 if it doesn't like the AGs that it finds.  Since the current
transaction itself holds AGF 0, the trylock of AGF 0 will succeed, and
we take space from the AG that has plenty.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
2023-03-19 09:55:48 -07:00
Darrick J. Wong 6de4b1ab47 xfs: try to idiot-proof the allocators
In porting his development branch to 6.3-rc1, yours truly has
repeatedly screwed up the args->pag being fed to the xfs_alloc_vextent*
functions.  Add some debugging assertions to test the preconditions
required of the callers.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
2023-03-16 09:32:18 -07:00
Christian Brauner 0c95c025a0
fs: drop unused posix acl handlers
Remove struct posix_acl_{access,default}_handler for all filesystems
that don't depend on the xattr handler in their inode->i_op->listxattr()
method in any way. There's nothing more to do than to simply remove the
handler. It's been effectively unused ever since we introduced the new
posix acl api.

Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Christian Brauner (Microsoft) <brauner@kernel.org>
2023-03-06 09:57:12 +01:00
Dave Chinner 8ac5b996bf xfs: fix off-by-one-block in xfs_discard_folio()
The recent writeback corruption fixes changed the code in
xfs_discard_folio() to calculate a byte range to for punching
delalloc extents. A mistake was made in using round_up(pos) for the
end offset, because when pos points at the first byte of a block, it
does not get rounded up to point to the end byte of the block. hence
the punch range is short, and this leads to unexpected behaviour in
certain cases in xfs_bmap_punch_delalloc_range.

e.g. pos = 0 means we call xfs_bmap_punch_delalloc_range(0,0), so
there is no previous extent and it rounds up the punch to the end of
the delalloc extent it found at offset 0, not the end of the range
given to xfs_bmap_punch_delalloc_range().

Fix this by handling the zero block offset case correctly.

Bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=217030
Link: https://lore.kernel.org/linux-xfs/Y+vOfaxIWX1c%2Fyy9@bfoster/
Fixes: 7348b32233 ("xfs: xfs_bmap_punch_delalloc_range() should take a byte range")
Reported-by: Pengfei Xu <pengfei.xu@intel.com>
Found-by: Brian Foster <bfoster@redhat.com>
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
2023-03-05 15:13:23 -08:00
Dave Chinner 0c7273e494 xfs: quotacheck failure can race with background inode inactivation
The background inode inactivation can attached dquots to inodes, but
this can race with a foreground quotacheck failure that leads to
disabling quotas and freeing the mp->m_quotainfo structure. The
background inode inactivation then tries to allocate a quota, tries
to dereference mp->m_quotainfo, and crashes like so:

XFS (loop1): Quotacheck: Unsuccessful (Error -5): Disabling quotas.
xfs filesystem being mounted at /root/syzkaller.qCVHXV/0/file0 supports timestamps until 2038 (0x7fffffff)
BUG: kernel NULL pointer dereference, address: 00000000000002a8
....
CPU: 0 PID: 161 Comm: kworker/0:4 Not tainted 6.2.0-c9c3395d5e3d #1
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.16.0-0-gd239552ce722-prebuilt.qemu.org 04/01/2014
Workqueue: xfs-inodegc/loop1 xfs_inodegc_worker
RIP: 0010:xfs_dquot_alloc+0x95/0x1e0
....
Call Trace:
 <TASK>
 xfs_qm_dqread+0x46/0x440
 xfs_qm_dqget_inode+0x154/0x500
 xfs_qm_dqattach_one+0x142/0x3c0
 xfs_qm_dqattach_locked+0x14a/0x170
 xfs_qm_dqattach+0x52/0x80
 xfs_inactive+0x186/0x340
 xfs_inodegc_worker+0xd3/0x430
 process_one_work+0x3b1/0x960
 worker_thread+0x52/0x660
 kthread+0x161/0x1a0
 ret_from_fork+0x29/0x50
 </TASK>
....

Prevent this race by flushing all the queued background inode
inactivations pending before purging all the cached dquots when
quotacheck fails.

Reported-by: Pengfei Xu <pengfei.xu@intel.com>
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
2023-03-05 15:13:22 -08:00
Linus Torvalds c0927a7a53 New code for 6.3-rc1, part 2:
* Fix a deadlock in the free space allocator due to the AG-walking
    algorithm forgetting to follow AG-order locking rules.
  * Make the inode allocator prefer existing free inodes instead of
    failing to allocate new inode chunks when free space is low.
  * Set minleft correctly when setting allocator parameters for bmap
    changes.
  * Fix uninitialized variable access in the getfsmap code.
  * Make a distinction between active and passive per-AG structure
    references.  For now, active references are taken to perform some
    work in an AG on behalf of a high level operation; passive references
    are used by lower level code to finish operations started by other
    threads.  Eventually this will become part of online shrink.
  * Split out all the different allocator strategies into separate
    functions to move us away from design antipattern of filling out a
    huge structure for various differentish things and issuing a single
    function multiplexing call.
  * Various cleanups in the filestreams allocator code, which we might
    very well want to deprecate instead of continuing.
  * Fix a bug with the agi rotor code that was introduced earlier in this
    series.
 
 Signed-off-by: Darrick J. Wong <djwong@kernel.org>
 -----BEGIN PGP SIGNATURE-----
 
 iHUEABYKAB0WIQQ2qTKExjcn+O1o2YRKO3ySh0YRpgUCY/zgqgAKCRBKO3ySh0YR
 plIkAQDIscqdqXGH01gF19/ncqG2GUaXY+/zeOReuk1Iv3VEVgD+MVXf+QvHk7LD
 /LTWNl2K6NQmE/9RtaBt0aFNDzvIAgU=
 =k7r8
 -----END PGP SIGNATURE-----

Merge tag 'xfs-6.3-merge-4' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux

Pull moar xfs updates from Darrick Wong:
 "This contains a fix for a deadlock in the allocator. It continues the
  slow march towards being able to offline AGs, and it refactors the
  interface to the xfs allocator to be less indirection happy.

  Summary:

   - Fix a deadlock in the free space allocator due to the AG-walking
     algorithm forgetting to follow AG-order locking rules

   - Make the inode allocator prefer existing free inodes instead of
     failing to allocate new inode chunks when free space is low

   - Set minleft correctly when setting allocator parameters for bmap
     changes

   - Fix uninitialized variable access in the getfsmap code

   - Make a distinction between active and passive per-AG structure
     references. For now, active references are taken to perform some
     work in an AG on behalf of a high level operation; passive
     references are used by lower level code to finish operations
     started by other threads. Eventually this will become part of
     online shrink

   - Split out all the different allocator strategies into separate
     functions to move us away from design antipattern of filling out a
     huge structure for various differentish things and issuing a single
     function multiplexing call

   - Various cleanups in the filestreams allocator code, which we might
     very well want to deprecate instead of continuing

   - Fix a bug with the agi rotor code that was introduced earlier in
     this series"

* tag 'xfs-6.3-merge-4' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux: (44 commits)
  xfs: restore old agirotor behavior
  xfs: fix uninitialized variable access
  xfs: refactor the filestreams allocator pick functions
  xfs: return a referenced perag from filestreams allocator
  xfs: pass perag to filestreams tracing
  xfs: use for_each_perag_wrap in xfs_filestream_pick_ag
  xfs: track an active perag reference in filestreams
  xfs: factor out MRU hit case in xfs_filestream_select_ag
  xfs: remove xfs_filestream_select_ag() longest extent check
  xfs: merge new filestream AG selection into xfs_filestream_select_ag()
  xfs: merge filestream AG lookup into xfs_filestream_select_ag()
  xfs: move xfs_bmap_btalloc_filestreams() to xfs_filestreams.c
  xfs: use xfs_bmap_longest_free_extent() in filestreams
  xfs: get rid of notinit from xfs_bmap_longest_free_extent
  xfs: factor out filestreams from xfs_bmap_btalloc_nullfb
  xfs: convert trim to use for_each_perag_range
  xfs: convert xfs_alloc_vextent_iterate_ags() to use perag walker
  xfs: move the minimum agno checks into xfs_alloc_vextent_check_args
  xfs: fold xfs_alloc_ag_vextent() into callers
  xfs: move allocation accounting to xfs_alloc_vextent_set_fsbno()
  ...
2023-02-28 16:08:30 -08:00
Darrick J. Wong 6e2985c938 xfs: restore old agirotor behavior
Prior to the removal of xfs_ialloc_next_ag, we would increment the agi
rotor and return the *old* value.  atomic_inc_return returns the new
value, which causes mkfs to allocate the root directory in AG 1.  Put
back the old behavior (at least for mkfs) by subtracting 1 here.

Fixes: 20a5eab49d ("xfs: convert xfs_ialloc_next_ag() to an atomic")
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
2023-02-27 08:53:45 -08:00
Linus Torvalds 3822a7c409 - Daniel Verkamp has contributed a memfd series ("mm/memfd: add
F_SEAL_EXEC") which permits the setting of the memfd execute bit at
   memfd creation time, with the option of sealing the state of the X bit.
 
 - Peter Xu adds a patch series ("mm/hugetlb: Make huge_pte_offset()
   thread-safe for pmd unshare") which addresses a rare race condition
   related to PMD unsharing.
 
 - Several folioification patch serieses from Matthew Wilcox, Vishal
   Moola, Sidhartha Kumar and Lorenzo Stoakes
 
 - Johannes Weiner has a series ("mm: push down lock_page_memcg()") which
   does perform some memcg maintenance and cleanup work.
 
 - SeongJae Park has added DAMOS filtering to DAMON, with the series
   "mm/damon/core: implement damos filter".  These filters provide users
   with finer-grained control over DAMOS's actions.  SeongJae has also done
   some DAMON cleanup work.
 
 - Kairui Song adds a series ("Clean up and fixes for swap").
 
 - Vernon Yang contributed the series "Clean up and refinement for maple
   tree".
 
 - Yu Zhao has contributed the "mm: multi-gen LRU: memcg LRU" series.  It
   adds to MGLRU an LRU of memcgs, to improve the scalability of global
   reclaim.
 
 - David Hildenbrand has added some userfaultfd cleanup work in the
   series "mm: uffd-wp + change_protection() cleanups".
 
 - Christoph Hellwig has removed the generic_writepages() library
   function in the series "remove generic_writepages".
 
 - Baolin Wang has performed some maintenance on the compaction code in
   his series "Some small improvements for compaction".
 
 - Sidhartha Kumar is doing some maintenance work on struct page in his
   series "Get rid of tail page fields".
 
 - David Hildenbrand contributed some cleanup, bugfixing and
   generalization of pte management and of pte debugging in his series "mm:
   support __HAVE_ARCH_PTE_SWP_EXCLUSIVE on all architectures with swap
   PTEs".
 
 - Mel Gorman and Neil Brown have removed the __GFP_ATOMIC allocation
   flag in the series "Discard __GFP_ATOMIC".
 
 - Sergey Senozhatsky has improved zsmalloc's memory utilization with his
   series "zsmalloc: make zspage chain size configurable".
 
 - Joey Gouly has added prctl() support for prohibiting the creation of
   writeable+executable mappings.  The previous BPF-based approach had
   shortcomings.  See "mm: In-kernel support for memory-deny-write-execute
   (MDWE)".
 
 - Waiman Long did some kmemleak cleanup and bugfixing in the series
   "mm/kmemleak: Simplify kmemleak_cond_resched() & fix UAF".
 
 - T.J.  Alumbaugh has contributed some MGLRU cleanup work in his series
   "mm: multi-gen LRU: improve".
 
 - Jiaqi Yan has provided some enhancements to our memory error
   statistics reporting, mainly by presenting the statistics on a per-node
   basis.  See the series "Introduce per NUMA node memory error
   statistics".
 
 - Mel Gorman has a second and hopefully final shot at fixing a CPU-hog
   regression in compaction via his series "Fix excessive CPU usage during
   compaction".
 
 - Christoph Hellwig does some vmalloc maintenance work in the series
   "cleanup vfree and vunmap".
 
 - Christoph Hellwig has removed block_device_operations.rw_page() in ths
   series "remove ->rw_page".
 
 - We get some maple_tree improvements and cleanups in Liam Howlett's
   series "VMA tree type safety and remove __vma_adjust()".
 
 - Suren Baghdasaryan has done some work on the maintainability of our
   vm_flags handling in the series "introduce vm_flags modifier functions".
 
 - Some pagemap cleanup and generalization work in Mike Rapoport's series
   "mm, arch: add generic implementation of pfn_valid() for FLATMEM" and
   "fixups for generic implementation of pfn_valid()"
 
 - Baoquan He has done some work to make /proc/vmallocinfo and
   /proc/kcore better represent the real state of things in his series
   "mm/vmalloc.c: allow vread() to read out vm_map_ram areas".
 
 - Jason Gunthorpe rationalized the GUP system's interface to the rest of
   the kernel in the series "Simplify the external interface for GUP".
 
 - SeongJae Park wishes to migrate people from DAMON's debugfs interface
   over to its sysfs interface.  To support this, we'll temporarily be
   printing warnings when people use the debugfs interface.  See the series
   "mm/damon: deprecate DAMON debugfs interface".
 
 - Andrey Konovalov provided the accurately named "lib/stackdepot: fixes
   and clean-ups" series.
 
 - Huang Ying has provided a dramatic reduction in migration's TLB flush
   IPI rates with the series "migrate_pages(): batch TLB flushing".
 
 - Arnd Bergmann has some objtool fixups in "objtool warning fixes".
 -----BEGIN PGP SIGNATURE-----
 
 iHUEABYIAB0WIQTTMBEPP41GrTpTJgfdBJ7gKXxAjgUCY/PoPQAKCRDdBJ7gKXxA
 jlvpAPsFECUBBl20qSue2zCYWnHC7Yk4q9ytTkPB/MMDrFEN9wD/SNKEm2UoK6/K
 DmxHkn0LAitGgJRS/W9w81yrgig9tAQ=
 =MlGs
 -----END PGP SIGNATURE-----

Merge tag 'mm-stable-2023-02-20-13-37' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm

Pull MM updates from Andrew Morton:

 - Daniel Verkamp has contributed a memfd series ("mm/memfd: add
   F_SEAL_EXEC") which permits the setting of the memfd execute bit at
   memfd creation time, with the option of sealing the state of the X
   bit.

 - Peter Xu adds a patch series ("mm/hugetlb: Make huge_pte_offset()
   thread-safe for pmd unshare") which addresses a rare race condition
   related to PMD unsharing.

 - Several folioification patch serieses from Matthew Wilcox, Vishal
   Moola, Sidhartha Kumar and Lorenzo Stoakes

 - Johannes Weiner has a series ("mm: push down lock_page_memcg()")
   which does perform some memcg maintenance and cleanup work.

 - SeongJae Park has added DAMOS filtering to DAMON, with the series
   "mm/damon/core: implement damos filter".

   These filters provide users with finer-grained control over DAMOS's
   actions. SeongJae has also done some DAMON cleanup work.

 - Kairui Song adds a series ("Clean up and fixes for swap").

 - Vernon Yang contributed the series "Clean up and refinement for maple
   tree".

 - Yu Zhao has contributed the "mm: multi-gen LRU: memcg LRU" series. It
   adds to MGLRU an LRU of memcgs, to improve the scalability of global
   reclaim.

 - David Hildenbrand has added some userfaultfd cleanup work in the
   series "mm: uffd-wp + change_protection() cleanups".

 - Christoph Hellwig has removed the generic_writepages() library
   function in the series "remove generic_writepages".

 - Baolin Wang has performed some maintenance on the compaction code in
   his series "Some small improvements for compaction".

 - Sidhartha Kumar is doing some maintenance work on struct page in his
   series "Get rid of tail page fields".

 - David Hildenbrand contributed some cleanup, bugfixing and
   generalization of pte management and of pte debugging in his series
   "mm: support __HAVE_ARCH_PTE_SWP_EXCLUSIVE on all architectures with
   swap PTEs".

 - Mel Gorman and Neil Brown have removed the __GFP_ATOMIC allocation
   flag in the series "Discard __GFP_ATOMIC".

 - Sergey Senozhatsky has improved zsmalloc's memory utilization with
   his series "zsmalloc: make zspage chain size configurable".

 - Joey Gouly has added prctl() support for prohibiting the creation of
   writeable+executable mappings.

   The previous BPF-based approach had shortcomings. See "mm: In-kernel
   support for memory-deny-write-execute (MDWE)".

 - Waiman Long did some kmemleak cleanup and bugfixing in the series
   "mm/kmemleak: Simplify kmemleak_cond_resched() & fix UAF".

 - T.J. Alumbaugh has contributed some MGLRU cleanup work in his series
   "mm: multi-gen LRU: improve".

 - Jiaqi Yan has provided some enhancements to our memory error
   statistics reporting, mainly by presenting the statistics on a
   per-node basis. See the series "Introduce per NUMA node memory error
   statistics".

 - Mel Gorman has a second and hopefully final shot at fixing a CPU-hog
   regression in compaction via his series "Fix excessive CPU usage
   during compaction".

 - Christoph Hellwig does some vmalloc maintenance work in the series
   "cleanup vfree and vunmap".

 - Christoph Hellwig has removed block_device_operations.rw_page() in
   ths series "remove ->rw_page".

 - We get some maple_tree improvements and cleanups in Liam Howlett's
   series "VMA tree type safety and remove __vma_adjust()".

 - Suren Baghdasaryan has done some work on the maintainability of our
   vm_flags handling in the series "introduce vm_flags modifier
   functions".

 - Some pagemap cleanup and generalization work in Mike Rapoport's
   series "mm, arch: add generic implementation of pfn_valid() for
   FLATMEM" and "fixups for generic implementation of pfn_valid()"

 - Baoquan He has done some work to make /proc/vmallocinfo and
   /proc/kcore better represent the real state of things in his series
   "mm/vmalloc.c: allow vread() to read out vm_map_ram areas".

 - Jason Gunthorpe rationalized the GUP system's interface to the rest
   of the kernel in the series "Simplify the external interface for
   GUP".

 - SeongJae Park wishes to migrate people from DAMON's debugfs interface
   over to its sysfs interface. To support this, we'll temporarily be
   printing warnings when people use the debugfs interface. See the
   series "mm/damon: deprecate DAMON debugfs interface".

 - Andrey Konovalov provided the accurately named "lib/stackdepot: fixes
   and clean-ups" series.

 - Huang Ying has provided a dramatic reduction in migration's TLB flush
   IPI rates with the series "migrate_pages(): batch TLB flushing".

 - Arnd Bergmann has some objtool fixups in "objtool warning fixes".

* tag 'mm-stable-2023-02-20-13-37' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: (505 commits)
  include/linux/migrate.h: remove unneeded externs
  mm/memory_hotplug: cleanup return value handing in do_migrate_range()
  mm/uffd: fix comment in handling pte markers
  mm: change to return bool for isolate_movable_page()
  mm: hugetlb: change to return bool for isolate_hugetlb()
  mm: change to return bool for isolate_lru_page()
  mm: change to return bool for folio_isolate_lru()
  objtool: add UACCESS exceptions for __tsan_volatile_read/write
  kmsan: disable ftrace in kmsan core code
  kasan: mark addr_has_metadata __always_inline
  mm: memcontrol: rename memcg_kmem_enabled()
  sh: initialize max_mapnr
  m68k/nommu: add missing definition of ARCH_PFN_OFFSET
  mm: percpu: fix incorrect size in pcpu_obj_full_size()
  maple_tree: reduce stack usage with gcc-9 and earlier
  mm: page_alloc: call panic() when memoryless node allocation fails
  mm: multi-gen LRU: avoid futile retries
  migrate_pages: move THP/hugetlb migration support check to simplify code
  migrate_pages: batch flushing TLB
  migrate_pages: share more code between _unmap and _move
  ...
2023-02-23 17:09:35 -08:00
Linus Torvalds 28e335208c New code for 6.3-rc1:
* Eliminate repeated boxing and unboxing of log item parameters.
  * Clean up some confusing variable names in the log item code.
  * Fix a deadlock when doing unwritten extent conversion that causes a
    bmbt split when there are sustained memory shortages and the worker
    pool runs out of worker threads.
  * Fix the panic_mask debug knob not being able to trigger on verifier
    errors.
  * Constify kobj_type objects.
 
 Signed-off-by: Darrick J. Wong <djwong@kernel.org>
 -----BEGIN PGP SIGNATURE-----
 
 iHUEABYKAB0WIQQ2qTKExjcn+O1o2YRKO3ySh0YRpgUCY+Z6BwAKCRBKO3ySh0YR
 pkQJAQCjkzXqZuj8WH/g22S01smT51QhmX+1ubLdzMYSvRvrKQD+MlH74EcgurQD
 GhgCWJh6dBTx1nICKpCXYgVD9Glvowc=
 =J2Xw
 -----END PGP SIGNATURE-----

Merge tag 'xfs-6.3-merge-2' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux

Pull xfs updates from Darrick Wong:
 "There's a couple of bug fixes, some cleanups for inconsistent variable
  names and reduction of struct boxing and unboxing in the logging code.

  More work is pending, which will begin reworking allocation group
  lifetimes and finally replace confusing indirect calls to the
  allocator with actual ... function calls. But I want to let that
  experience another week of testing.

  Summary:

   - Eliminate repeated boxing and unboxing of log item parameters

   - Clean up some confusing variable names in the log item code

   - Fix a deadlock when doing unwritten extent conversion that causes a
     bmbt split when there are sustained memory shortages and the worker
     pool runs out of worker threads

   - Fix the panic_mask debug knob not being able to trigger on verifier
     errors

   - Constify kobj_type objects"

* tag 'xfs-6.3-merge-2' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux:
  xfs: revert commit 8954c44ff4
  xfs: make kobj_type structures constant
  xfs: allow setting full range of panic tags
  xfs: don't use BMBT btree split workers for IO completion
  xfs: fix confusing variable names in xfs_refcount_item.c
  xfs: pass refcount intent directly through the log intent code
  xfs: fix confusing variable names in xfs_rmap_item.c
  xfs: pass rmap space mapping directly through the log intent code
  xfs: fix confusing xfs_extent_item variable names
  xfs: pass xfs_extent_free_item directly through the log intent code
  xfs: fix confusing variable names in xfs_bmap_item.c
  xfs: pass the xfs_bmbt_irec directly through the log intent code
  xfs: use strscpy() to instead of strncpy()
2023-02-22 13:55:51 -08:00
Linus Torvalds d151e8bea1 New code for 6.3:
- Change when the iomap page_done function is called so that we still
    have a locked folio in the success case.  This fixes a writeback race
    in gfs2.
  - Change when the iomap page_prepare function is called so that gfs2
    can recover from OOM scenarios more gracefully.
  - Rename the iomap page_ops to folio_ops, since they operate on folios
    now.
 
 Signed-off-by: Darrick J. Wong <djwong@kernel.org>
 -----BEGIN PGP SIGNATURE-----
 
 iHUEABYKAB0WIQQ2qTKExjcn+O1o2YRKO3ySh0YRpgUCY8g/FwAKCRBKO3ySh0YR
 pi19AQDCatxkzguJGV9BY52Bf8iDxCgdL34RatKXAzkZC3Y6UQEAsNdb88rkWkNK
 qPlXgsZm9cNlFb8c7mFvA9JAL9IPxgE=
 =ubh6
 -----END PGP SIGNATURE-----

Merge tag 'iomap-6.3-merge-1' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux

Pull iomap updates from Darrick Wong:
 "This is mostly rearranging things to make life easier for gfs2,
  nothing all that mindblowing for this release.

   - Change when the iomap page_done function is called so that we still
     have a locked folio in the success case. This fixes a writeback
     race in gfs2

   - Change when the iomap page_prepare function is called so that gfs2
     can recover from OOM scenarios more gracefully

   - Rename the iomap page_ops to folio_ops, since they operate on
     folios now"

* tag 'iomap-6.3-merge-1' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux:
  iomap: Rename page_ops to folio_ops
  iomap: Rename page_prepare handler to get_folio
  iomap: Add __iomap_get_folio helper
  iomap/gfs2: Get page in page_prepare handler
  iomap: Add iomap_get_folio helper
  iomap: Rename page_done handler to put_folio
  iomap/gfs2: Unlock and put folio in page_done handler
  iomap: Add __iomap_put_folio helper
2023-02-22 13:50:13 -08:00
Linus Torvalds 05e6295f7b fs.idmapped.v6.3
-----BEGIN PGP SIGNATURE-----
 
 iHUEABYKAB0WIQRAhzRXHqcMeLMyaSiRxhvAZXjcogUCY+5NlQAKCRCRxhvAZXjc
 orOaAP9i2h3OJy95nO2Fpde0Bt2UT+oulKCCcGlvXJ8/+TQpyQD/ZQq47gFQ0EAz
 Br5NxeyGeecAb0lHpFz+CpLGsxMrMwQ=
 =+BG5
 -----END PGP SIGNATURE-----

Merge tag 'fs.idmapped.v6.3' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/idmapping

Pull vfs idmapping updates from Christian Brauner:

 - Last cycle we introduced the dedicated struct mnt_idmap type for
   mount idmapping and the required infrastucture in 256c8aed2b ("fs:
   introduce dedicated idmap type for mounts"). As promised in last
   cycle's pull request message this converts everything to rely on
   struct mnt_idmap.

   Currently we still pass around the plain namespace that was attached
   to a mount. This is in general pretty convenient but it makes it easy
   to conflate namespaces that are relevant on the filesystem with
   namespaces that are relevant on the mount level. Especially for
   non-vfs developers without detailed knowledge in this area this was a
   potential source for bugs.

   This finishes the conversion. Instead of passing the plain namespace
   around this updates all places that currently take a pointer to a
   mnt_userns with a pointer to struct mnt_idmap.

   Now that the conversion is done all helpers down to the really
   low-level helpers only accept a struct mnt_idmap argument instead of
   two namespace arguments.

   Conflating mount and other idmappings will now cause the compiler to
   complain loudly thus eliminating the possibility of any bugs. This
   makes it impossible for filesystem developers to mix up mount and
   filesystem idmappings as they are two distinct types and require
   distinct helpers that cannot be used interchangeably.

   Everything associated with struct mnt_idmap is moved into a single
   separate file. With that change no code can poke around in struct
   mnt_idmap. It can only be interacted with through dedicated helpers.
   That means all filesystems are and all of the vfs is completely
   oblivious to the actual implementation of idmappings.

   We are now also able to extend struct mnt_idmap as we see fit. For
   example, we can decouple it completely from namespaces for users that
   don't require or don't want to use them at all. We can also extend
   the concept of idmappings so we can cover filesystem specific
   requirements.

   In combination with the vfs{g,u}id_t work we finished in v6.2 this
   makes this feature substantially more robust and thus difficult to
   implement wrong by a given filesystem and also protects the vfs.

 - Enable idmapped mounts for tmpfs and fulfill a longstanding request.

   A long-standing request from users had been to make it possible to
   create idmapped mounts for tmpfs. For example, to share the host's
   tmpfs mount between multiple sandboxes. This is a prerequisite for
   some advanced Kubernetes cases. Systemd also has a range of use-cases
   to increase service isolation. And there are more users of this.

   However, with all of the other work going on this was way down on the
   priority list but luckily someone other than ourselves picked this
   up.

   As usual the patch is tiny as all the infrastructure work had been
   done multiple kernel releases ago. In addition to all the tests that
   we already have I requested that Rodrigo add a dedicated tmpfs
   testsuite for idmapped mounts to xfstests. It is to be included into
   xfstests during the v6.3 development cycle. This should add a slew of
   additional tests.

* tag 'fs.idmapped.v6.3' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/idmapping: (26 commits)
  shmem: support idmapped mounts for tmpfs
  fs: move mnt_idmap
  fs: port vfs{g,u}id helpers to mnt_idmap
  fs: port fs{g,u}id helpers to mnt_idmap
  fs: port i_{g,u}id_into_vfs{g,u}id() to mnt_idmap
  fs: port i_{g,u}id_{needs_}update() to mnt_idmap
  quota: port to mnt_idmap
  fs: port privilege checking helpers to mnt_idmap
  fs: port inode_owner_or_capable() to mnt_idmap
  fs: port inode_init_owner() to mnt_idmap
  fs: port acl to mnt_idmap
  fs: port xattr to mnt_idmap
  fs: port ->permission() to pass mnt_idmap
  fs: port ->fileattr_set() to pass mnt_idmap
  fs: port ->set_acl() to pass mnt_idmap
  fs: port ->get_acl() to pass mnt_idmap
  fs: port ->tmpfile() to pass mnt_idmap
  fs: port ->rename() to pass mnt_idmap
  fs: port ->mknod() to pass mnt_idmap
  fs: port ->mkdir() to pass mnt_idmap
  ...
2023-02-20 11:53:11 -08:00
Darrick J. Wong 60b730a40c xfs: fix uninitialized variable access
If the end position of a GETFSMAP query overlaps an allocated space and
we're using the free space info to generate fsmap info, the akeys
information gets fed into the fsmap formatter with bad results.
Zero-init the space.

Reported-by: syzbot+090ae72d552e6bd93cfe@syzkaller.appspotmail.com
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
2023-02-14 17:51:35 -08:00
Dave Chinner bd4f5d09cc xfs: refactor the filestreams allocator pick functions
Now that the filestreams allocator is largely rewritten,
restructure the main entry point and pick function to seperate out
the different operations cleanly. The MRU lookup function should not
handle the start AG selection on MRU lookup failure, and nor should
the pick function handle building the association that is inserted
into the MRU.

This leaves the filestreams allocator fairly clean and easy to
understand, returning to the caller with an active perag reference
and a target block to allocate at.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
2023-02-13 09:14:56 +11:00
Dave Chinner f8f1ed1ab3 xfs: return a referenced perag from filestreams allocator
Now that the filestreams AG selection tracks active perags, we need
to return an active perag to the core allocator code. This is
because the file allocation the filestreams code will run are AG
specific allocations and so need to pin the AG until the allocations
complete.

We cannot rely on the filestreams item reference to do this - the
filestreams association can be torn down at any time, hence we
need to have a separate reference for the allocation process to pin
the AG after it has been selected.

This means there is some perag juggling in allocation failure
fallback paths as they will do all AG scans in the case the AG
specific allocation fails. Hence we need to track the perag
reference that the filestream allocator returned to make sure we
don't leak it on repeated allocation failure.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
2023-02-13 09:14:56 +11:00
Dave Chinner 571e259282 xfs: pass perag to filestreams tracing
Pass perags instead of raw ag numbers, avoiding the need for the
special peek function for the tracing code.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
2023-02-13 09:14:56 +11:00
Dave Chinner eb70aa2d8e xfs: use for_each_perag_wrap in xfs_filestream_pick_ag
xfs_filestream_pick_ag() is now ready to rework to use
for_each_perag_wrap() for iterating the perags during the AG
selection scan.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
2023-02-13 09:14:55 +11:00
Dave Chinner 3054face13 xfs: track an active perag reference in filestreams
Rather than just track the agno of the reference, track a referenced
perag pointer instead. This will allow active filestreams to prevent
AGs from going away until the filestreams have been torn down.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
2023-02-13 09:14:55 +11:00
Dave Chinner f38b46bbfa xfs: factor out MRU hit case in xfs_filestream_select_ag
Because it now stands out like a sore thumb. Factoring out this case
starts the process of simplifying xfs_filestream_select_ag() again.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
2023-02-13 09:14:55 +11:00
Dave Chinner 3e43877a9d xfs: remove xfs_filestream_select_ag() longest extent check
Picking a new AG checks the longest free extent in the AG is valid,
so there's no need to repeat the check in
xfs_filestream_select_ag(). Remove it.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
2023-02-13 09:14:55 +11:00
Dave Chinner ba34de8def xfs: merge new filestream AG selection into xfs_filestream_select_ag()
This is largely a wrapper around xfs_filestream_pick_ag() that
repeats a lot of the lookups that we just merged back into
xfs_filestream_select_ag() from the lookup code. Merge the
xfs_filestream_new_ag() code back into _select_ag() to get rid
of all the unnecessary logic.

Indeed, this makes it obvious that if we have no parent inode,
the filestreams allocator always selects AG 0 regardless of whether
it is fit for purpose or not.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
2023-02-13 09:14:55 +11:00
Dave Chinner a52dc2ad36 xfs: merge filestream AG lookup into xfs_filestream_select_ag()
The lookup currently either returns the cached filestream AG or it
calls xfs_filestreams_select_lengths() to looks up a new AG. This
has verify the AG that is selected, so we end up doing "select a new
AG loop in a couple of places when only one really is needed.  Merge
the initial lookup functionality with the length selection so that
we only need to do a single pick loop on lookup or verification
failure.

This undoes a lot of the factoring that enabled the selection to be
moved over to the filestreams code. It makes
xfs_filestream_select_ag() an awful messier, but it has to be made
worse before it can get better in future patches...

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
2023-02-13 09:14:55 +11:00
Dave Chinner 8f7747ad8c xfs: move xfs_bmap_btalloc_filestreams() to xfs_filestreams.c
xfs_bmap_btalloc_filestreams() calls two filestreams functions to
select the AG to allocate from. Both those functions end up in
the same selection function that iterates all AGs multiple times.
Worst case, xfs_bmap_btalloc_filestreams() can iterate all AGs 4
times just to select the initial AG to allocate in.

Move the AG selection to fs/xfs/xfs_filestreams.c as a single
interface so that the inefficient AG interation is contained
entirely within the filestreams code. This will allow the
implementation to be simplified and made more efficient in future
patches.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
2023-02-13 09:14:55 +11:00
Dave Chinner 05cf492a8d xfs: use xfs_bmap_longest_free_extent() in filestreams
The code in xfs_bmap_longest_free_extent() is open coded in
xfs_filestream_pick_ag(). Export xfs_bmap_longest_free_extent and
call it from the filestreams code instead.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
2023-02-13 09:14:55 +11:00
Dave Chinner 6b637ad0c7 xfs: get rid of notinit from xfs_bmap_longest_free_extent
It is only set if reading the AGF gets a EAGAIN error. Just return
the EAGAIN error and handle that error in the callers.

This means we can remove the not_init parameter from
xfs_bmap_select_minlen(), too, because the use of not_init there is
pessimistic. If we can't read the agf, it won't increase blen.

The only time we actually care whether we checked all the AGFs for
contiguous free space is when the best length is less than the
minimum allocation length. If not_init is set, then we ignore blen
and set the minimum alloc length to the absolute minimum, not the
best length we know already is present.

However, if blen is less than the minimum we're going to ignore it
anyway, regardless of whether we scanned all the AGFs or not.  Hence
not_init can go away, because we only use if blen is good from
the scanned AGs otherwise we ignore it altogether and use minlen.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
2023-02-13 09:14:55 +11:00
Dave Chinner 89563e7dc0 xfs: factor out filestreams from xfs_bmap_btalloc_nullfb
There's many if (filestreams) {} else {} branches in this function.
Split it out into a filestreams specific function so that we can
then work directly on cleaning up the filestreams code without
impacting the rest of the allocation algorithms.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
2023-02-13 09:14:54 +11:00
Dave Chinner 35bf2b1abc xfs: convert trim to use for_each_perag_range
To convert it to using active perag references and hence make it
shrink safe.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
2023-02-13 09:14:54 +11:00
Dave Chinner 3432ef6111 xfs: convert xfs_alloc_vextent_iterate_ags() to use perag walker
Now that the AG iteration code in the core allocation code has been
cleaned up, we can easily convert it to use a for_each_perag..()
variant to use active references and skip AGs that it can't get
active references on.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
2023-02-13 09:14:54 +11:00
Dave Chinner 8b81356825 xfs: move the minimum agno checks into xfs_alloc_vextent_check_args
All of the allocation functions now extract the minimum allowed AG
from the transaction and then use it in some way. The allocation
functions that are restricted to a single AG all check if the
AG requested can be allocated from and return an error if so. These
all set args->agno appropriately.

All the allocation functions that iterate AGs use it to calculate
the scan start AG. args->agno is not set until the iterator starts
walking AGs.

Hence we can easily set up a conditional check against the minimum
AG allowed in xfs_alloc_vextent_check_args() based on whether
args->agno contains NULLAGNUMBER or not and move all the repeated
setup code to xfs_alloc_vextent_check_args(), further simplifying
the allocation functions.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
2023-02-13 09:14:54 +11:00
Dave Chinner 230e8fe846 xfs: fold xfs_alloc_ag_vextent() into callers
We don't need the multiplexing xfs_alloc_ag_vextent() provided
anymore - we can just call the exact/near/size variants directly.
This allows us to remove args->type completely and stop using
args->fsbno as an input to the allocator algorithms.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
2023-02-13 09:14:54 +11:00
Dave Chinner e4d1742607 xfs: move allocation accounting to xfs_alloc_vextent_set_fsbno()
Move it from xfs_alloc_ag_vextent() so we can get rid of that layer.
Rename xfs_alloc_vextent_set_fsbno() to xfs_alloc_vextent_finish()
to indicate that it's function is finishing off the allocation that
we've run now that it contains much more functionality.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
2023-02-13 09:14:54 +11:00
Dave Chinner 74b9aa6319 xfs: introduce xfs_alloc_vextent_prepare()
Now that we have wrapper functions for each type of allocation we
can ask for, we can start unravelling xfs_alloc_ag_vextent(). That
is essentially just a prepare stage, the allocation multiplexer
and a post-allocation accounting step is the allocation proceeded.

The current xfs_alloc_vextent*() wrappers all have a prepare stage,
the allocation operation and a post-allocation accounting step.

We can consolidate this by moving the AG alloc prep code into the
wrapper functions, the accounting code in the wrapper accounting
functions, and cut out the multiplexer layer entirely.

This patch consolidates the AG preparation stage.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
2023-02-13 09:14:54 +11:00
Dave Chinner 5f36b2ce79 xfs: introduce xfs_alloc_vextent_exact_bno()
Two of the callers to xfs_alloc_vextent_this_ag() actually want
exact block number allocation, not anywhere-in-ag allocation. Split
this out from _this_ag() as a first class citizen so no external
extent allocation code needs to care about args->type anymore.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
2023-02-13 09:14:54 +11:00
Dave Chinner db4710fd12 xfs: introduce xfs_alloc_vextent_near_bno()
The remaining callers of xfs_alloc_vextent() are all doing NEAR_BNO
allocations. We can replace that function with a new
xfs_alloc_vextent_near_bno() function that does this explicitly.

We also multiplex NEAR_BNO allocations through
xfs_alloc_vextent_this_ag via args->type. Replace all of these with
direct calls to xfs_alloc_vextent_near_bno(), too.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
2023-02-13 09:14:54 +11:00
Dave Chinner 2a7f6d41d8 xfs: use xfs_alloc_vextent_start_bno() where appropriate
Change obvious callers of single AG allocation to use
xfs_alloc_vextent_start_bno(). Callers no long need to specify
XFS_ALLOCTYPE_START_BNO, and so the type can be driven inward and
removed.

While doing this, also pass the allocation target fsb as a parameter
rather than encoding it in args->fsbno.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
2023-02-13 09:14:53 +11:00
Dave Chinner 319c9e874a xfs: use xfs_alloc_vextent_first_ag() where appropriate
Change obvious callers of single AG allocation to use
xfs_alloc_vextent_first_ag(). This gets rid of
XFS_ALLOCTYPE_FIRST_AG as the type used within
xfs_alloc_vextent_first_ag() during iteration is _THIS_AG. Hence we
can remove the setting of args->type from all the callers of
_first_ag() and remove the alloctype.

While doing this, pass the allocation target fsb as a parameter
rather than encoding it in args->fsbno. This starts the process
of making args->fsbno an output only variable rather than
input/output.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
2023-02-13 09:14:53 +11:00
Dave Chinner 8584332709 xfs: factor xfs_bmap_btalloc()
There are several different contexts xfs_bmap_btalloc() handles, and
large chunks of the code execute independent allocation contexts.
Try to untangle this mess a bit.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
2023-02-13 09:14:53 +11:00
Dave Chinner 74c36a8689 xfs: use xfs_alloc_vextent_this_ag() where appropriate
Change obvious callers of single AG allocation to use
xfs_alloc_vextent_this_ag(). Drive the per-ag grabbing out to the
callers, too, so that callers with active references don't need
to do new lookups just for an allocation in a context that already
has a perag reference.

The only remaining caller that does single AG allocation through
xfs_alloc_vextent() is xfs_bmap_btalloc() with
XFS_ALLOCTYPE_NEAR_BNO. That is going to need more untangling before
it can be converted cleanly.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
2023-02-13 09:14:53 +11:00
Dave Chinner 4811c933ea xfs: combine __xfs_alloc_vextent_this_ag and xfs_alloc_ag_vextent
There's a bit of a recursive conundrum around
xfs_alloc_ag_vextent(). We can't first call xfs_alloc_ag_vextent()
without preparing the AGFL for the allocation, and preparing the
AGFL calls xfs_alloc_ag_vextent() to prepare the AGFL for the
allocation. This "double allocation" requirement is not really clear
from the current xfs_alloc_fix_freelist() calls that are sprinkled
through the allocation code.

It's not helped that xfs_alloc_ag_vextent() can actually allocate
from the AGFL itself, but there's special code to prevent AGFL prep
allocations from allocating from the free list it's trying to prep.
The naming is also not consistent: args->wasfromfl is true when we
allocated _from_ the free list, but the indication that we are
allocating _for_ the free list is via checking that (args->resv ==
XFS_AG_RESV_AGFL).

So, lets make this "allocation required for allocation" situation
clear by moving it all inside xfs_alloc_ag_vextent(). The freelist
allocation is a specific XFS_ALLOCTYPE_THIS_AG allocation, which
translated directly to xfs_alloc_ag_vextent_size() allocation.

This enables us to replace __xfs_alloc_vextent_this_ag() with a call
to xfs_alloc_ag_vextent(), and we drive the freelist fixing further
into the per-ag allocation algorithm.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
2023-02-13 09:14:53 +11:00
Dave Chinner 2edf06a50f xfs: factor xfs_alloc_vextent_this_ag() for _iterate_ags()
The core of the per-ag iteration is effectively doing a "this ag"
allocation on one AG at a time. Use the same code to implement the
core "this ag" allocation in both xfs_alloc_vextent_this_ag()
and xfs_alloc_vextent_iterate_ags().

This means we only call xfs_alloc_ag_vextent() from one place so we
can easily collapse the call stack in future patches.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
2023-02-13 09:14:53 +11:00
Dave Chinner ecd788a924 xfs: rework xfs_alloc_vextent()
It's a multiplexing mess that can be greatly simplified, and really
needs to be simplified to allow active per-ag references to
propagate from initial AG selection code the the bmapi code.

This splits the code out into separate a parameter checking
function, an iterator function, and allocation completion functions
and then implements the individual policies using these functions.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
2023-02-13 09:14:53 +11:00
Dave Chinner 76257a1587 xfs: introduce xfs_for_each_perag_wrap()
In several places we iterate every AG from a specific start agno and
wrap back to the first AG when we reach the end of the filesystem to
continue searching. We don't have a primitive for this iteration
yet, so add one for conversion of these algorithms to per-ag based
iteration.

The filestream AG select code is a mess, and this initially makes it
worse. The per-ag selection needs to be driven completely into the
filestream code to clean this up and it will be done in a future
patch that makes the filestream allocator use active per-ag
references correctly.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
2023-02-13 09:14:53 +11:00
Dave Chinner 7ac2ff8bb3 xfs: perags need atomic operational state
We currently don't have any flags or operational state in the
xfs_perag except for the pagf_init and pagi_init flags. And the
agflreset flag. Oh, there's also the pagf_metadata and pagi_inodeok
flags, too.

For controlling per-ag operations, we are going to need some atomic
state flags. Hence add an opstate field similar to what we already
have in the mount and log, and convert all these state flags across
to atomic bit operations.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Allison Henderson <allison.henderson@oracle.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
2023-02-13 09:14:52 +11:00
Dave Chinner 20a5eab49d xfs: convert xfs_ialloc_next_ag() to an atomic
This is currently a spinlock lock protected rotor which can be
implemented with a single atomic operation. Change it to be more
efficient and get rid of the m_agirotor_lock. Noticed while
converting the inode allocation AG selection loop to active perag
references.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Allison Henderson <allison.henderson@oracle.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
2023-02-13 09:14:52 +11:00
Dave Chinner bab8b79518 xfs: inobt can use perags in many more places than it does
Lots of code in the inobt infrastructure is passed both xfs_mount
and perags. We only need perags for the per-ag inode allocation
code, so reduce the duplication by passing only the perags as the
primary object.

This ends up reducing the code size by a bit:

	   text    data     bss     dec     hex filename
orig	1138878  323979     548 1463405  16546d (TOTALS)
patched	1138709  323979     548 1463236  1653c4 (TOTALS)

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Allison Henderson <allison.henderson@oracle.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
2023-02-13 09:14:52 +11:00
Dave Chinner dedab3e437 xfs: use active perag references for inode allocation
Convert the inode allocation routines to use active perag references
or references held by callers rather than grab their own. Also drive
the perag further inwards to replace xfs_mounts when doing
operations on a specific AG.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Allison Henderson <allison.henderson@oracle.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
2023-02-13 09:14:52 +11:00
Dave Chinner 498f0adbcd xfs: convert xfs_imap() to take a perag
Callers have referenced perags but they don't pass it into
xfs_imap() so it takes it's own reference. Fix that so we can change
inode allocation over to using active references.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
2023-02-13 09:14:52 +11:00
Dave Chinner 368e2d09b4 xfs: rework the perag trace points to be perag centric
So that they all output the same information in the traces to make
debugging refcount issues easier.

This means that all the lookup/drop functions no longer need to use
the full memory barrier atomic operations (atomic*_return()) so
will have less overhead when tracing is off. The set/clear tag
tracepoints no longer abuse the reference count to pass the tag -
the tag being cleared is obvious from the _RET_IP_ that is recorded
in the trace point.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Allison Henderson <allison.henderson@oracle.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
2023-02-13 09:14:52 +11:00
Dave Chinner c4d5660afb xfs: active perag reference counting
We need to be able to dynamically remove instantiated AGs from
memory safely, either for shrinking the filesystem or paging AG
state in and out of memory (e.g. supporting millions of AGs). This
means we need to be able to safely exclude operations from accessing
perags while dynamic removal is in progress.

To do this, introduce the concept of active and passive references.
Active references are required for high level operations that make
use of an AG for a given operation (e.g. allocation) and pin the
perag in memory for the duration of the operation that is operating
on the perag (e.g. transaction scope). This means we can fail to get
an active reference to an AG, hence callers of the new active
reference API must be able to handle lookup failure gracefully.

Passive references are used in low level code, where we might need
to access the perag structure for the purposes of completing high
level operations. For example, buffers need to use passive
references because:
- we need to be able to do metadata IO during operations like grow
  and shrink transactions where high level active references to the
  AG have already been blocked
- buffers need to pin the perag until they are reclaimed from
  memory, something that high level code has no direct control over.
- unused cached buffers should not prevent a shrink from being
  started.

Hence we have active references that will form exclusion barriers
for operations to be performed on an AG, and passive references that
will prevent reclaim of the perag until all objects with passive
references have been reclaimed themselves.

This patch introduce xfs_perag_grab()/xfs_perag_rele() as the API
for active AG reference functionality. We also need to convert the
for_each_perag*() iterators to use active references, which will
start the process of converting high level code over to using active
references. Conversion of non-iterator based code to active
references will be done in followup patches.

Note that the implementation using reference counting is really just
a development vehicle for the API to ensure we don't have any leaks
in the callers. Once we need to remove perag structures from memory
dyanmically, we will need a much more robust per-ag state transition
mechanism for preventing new references from being taken while we
wait for existing references to drain before removal from memory can
occur....

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Allison Henderson <allison.henderson@oracle.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
2023-02-13 09:14:42 +11:00
Dave Chinner 55d5c3a386 xfs: don't assert fail on transaction cancel with deferred ops
We can error out of an allocation transaction when updating BMBT
blocks when things go wrong. This can be a btree corruption, and
unexpected ENOSPC, etc. In these cases, we already have deferred ops
queued for the first allocation that has been done, and we just want
to cancel out the transaction and shut down the filesystem on error.

In fact, we do just that for production systems - the assert that we
can't have a transaction with defer ops attached unless we are
already shut down is bogus and gets in the way of debugging
whatever issue is actually causing the transaction to be cancelled.

Remove the assert because it is causing spurious test failures to
hang test machines.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Allison Henderson <allison.henderson@oracle.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
2023-02-11 04:12:06 +11:00
Dave Chinner 692b6cddeb xfs: t_firstblock is tracking AGs not blocks
The tp->t_firstblock field is now raelly tracking the highest AG we
have locked, not the block number of the highest allocation we've
made. It's purpose is to prevent AGF locking deadlocks, so rename it
to "highest AG" and simplify the implementation to just track the
agno rather than a fsbno.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Allison Henderson <allison.henderson@oracle.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
2023-02-11 04:11:06 +11:00
Dave Chinner 36b6ad2d9c xfs: drop firstblock constraints from allocation setup
Now that xfs_alloc_vextent() does all the AGF deadlock prevention
filtering for multiple allocations in a single transaction, we no
longer need the allocation setup code to care about what AGs we
might already have locked.

Hence we can remove all the "nullfb" conditional logic in places
like xfs_bmap_btalloc() and instead have them focus simply on
setting up locality constraints. If the allocation fails due to
AGF lock filtering in xfs_alloc_vextent, then we just fall back as
we normally do to more relaxed allocation constraints.

As a result, any allocation that allows AG scanning (i.e. not
confined to a single AG) and does not force a worst case full
filesystem scan will now be able to attempt allocation from AGs
lower than that defined by tp->t_firstblock. This is because
xfs_alloc_vextent() allows try-locking of the AGFs and hence enables
low space algorithms to at least -try- to get space from AGs lower
than the one that we have currently locked and allocated from. This
is a significant improvement in the low space allocation algorithm.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Allison Henderson <allison.henderson@oracle.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
2023-02-11 04:10:06 +11:00
Dave Chinner d5753847b2 xfs: block reservation too large for minleft allocation
When we enter xfs_bmbt_alloc_block() without having first allocated
a data extent (i.e. tp->t_firstblock == NULLFSBLOCK) because we
are doing something like unwritten extent conversion, the transaction
block reservation is used as the minleft value.

This works for operations like unwritten extent conversion, but it
assumes that the block reservation is only for a BMBT split. THis is
not always true, and sometimes results in larger than necessary
minleft values being set. We only actually need enough space for a
btree split, something we already handle correctly in
xfs_bmapi_write() via the xfs_bmapi_minleft() calculation.

We should use xfs_bmapi_minleft() in xfs_bmbt_alloc_block() to
calculate the number of blocks a BMBT split on this inode is going to
require, not use the transaction block reservation that contains the
maximum number of blocks this transaction may consume in it...

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Allison Henderson <allison.henderson@oracle.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
2023-02-11 04:09:06 +11:00
Dave Chinner f08f984c63 xfs: prefer free inodes at ENOSPC over chunk allocation
When an XFS filesystem has free inodes in chunks already allocated
on disk, it will still allocate new inode chunks if the target AG
has no free inodes in it. Normally, this is a good idea as it
preserves locality of all the inodes in a given directory.

However, at ENOSPC this can lead to using the last few remaining
free filesystem blocks to allocate a new chunk when there are many,
many free inodes that could be allocated without consuming free
space. This results in speeding up the consumption of the last few
blocks and inode create operations then returning ENOSPC when there
free inodes available because we don't have enough block left in the
filesystem for directory creation reservations to proceed.

Hence when we are near ENOSPC, we should be attempting to preserve
the remaining blocks for directory block allocation rather than
using them for unnecessary inode chunk creation.

This particular behaviour is exposed by xfs/294, when it drives to
ENOSPC on empty file creation whilst there are still thousands of
free inodes available for allocation in other AGs in the filesystem.

Hence, when we are within 1% of ENOSPC, change the inode allocation
behaviour to prefer to use existing free inodes over allocating new
inode chunks, even though it results is poorer locality of the data
set. It is more important for the allocations to be space efficient
near ENOSPC than to have optimal locality for performance, so lets
modify the inode AG selection code to reflect that fact.

This allows generic/294 to not only pass with this allocator rework
patchset, but to increase the number of post-ENOSPC empty inode
allocations to from ~600 to ~9080 before we hit ENOSPC on the
directory create transaction reservation.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Allison Henderson <allison.henderson@oracle.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
2023-02-11 04:08:06 +11:00
Dave Chinner 1dd0510f6d xfs: fix low space alloc deadlock
I've recently encountered an ABBA deadlock with g/476. The upcoming
changes seem to make this much easier to hit, but the underlying
problem is a pre-existing one.

Essentially, if we select an AG for allocation, then lock the AGF
and then fail to allocate for some reason (e.g. minimum length
requirements cannot be satisfied), then we drop out of the
allocation with the AGF still locked.

The caller then modifies the allocation constraints - usually
loosening them up - and tries again. This can result in trying to
access AGFs that are lower than the AGF we already have locked from
the failed attempt. e.g. the failed attempt skipped several AGs
before failing, so we have locks an AG higher than the start AG.
Retrying the allocation from the start AG then causes us to violate
AGF lock ordering and this can lead to deadlocks.

The deadlock exists even if allocation succeeds - we can do a
followup allocations in the same transaction for BMBT blocks that
aren't guaranteed to be in the same AG as the original, and can move
into higher AGs. Hence we really need to move the tp->t_firstblock
tracking down into xfs_alloc_vextent() where it can be set when we
exit with a locked AG.

xfs_alloc_vextent() can also check there if the requested
allocation falls within the allow range of AGs set by
tp->t_firstblock. If we can't allocate within the range set, we have
to fail the allocation. If we are allowed to to non-blocking AGF
locking, we can ignore the AG locking order limitations as we can
use try-locks for the first iteration over requested AG range.

This invalidates a set of post allocation asserts that check that
the allocation is always above tp->t_firstblock if it is set.
Because we can use try-locks to avoid the deadlock in some
circumstances, having a pre-existing locked AGF doesn't always
prevent allocation from lower order AGFs. Hence those ASSERTs need
to be removed.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Allison Henderson <allison.henderson@oracle.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
2023-02-11 04:07:06 +11:00
Darrick J. Wong dd07bb8b6b xfs: revert commit 8954c44ff4
The name passed into __xfs_xattr_put_listent is exactly namelen bytes
long and not null-terminated.  Passing namelen+1 to the strscpy function

    strscpy(offset, (char *)name, namelen + 1);

is therefore wrong.  Go back to the old code, which works fine because
strncpy won't find a null in @name and stops after namelen bytes.  It
really could be a memcpy call, but it worked for years.

Reported-by: syzbot+898115bc6d7140437215@syzkaller.appspotmail.com
Fixes: 8954c44ff4 ("xfs: use strscpy() to instead of strncpy()")
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
2023-02-10 09:06:06 -08:00
Thomas Weißschuh 2ee8333529 xfs: make kobj_type structures constant
Since commit ee6d3dd4ed ("driver core: make kobj_type constant.")
the driver core allows the usage of const struct kobj_type.

Take advantage of this to constify the structure definitions to prevent
modification at runtime.

Signed-off-by: Thomas Weißschuh <linux@weissschuh.net>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
2023-02-10 08:59:48 -08:00
Donald Douwsma 167ce4cbfa xfs: allow setting full range of panic tags
xfs will not allow combining other panic masks with
XFS_PTAG_VERIFIER_ERROR.

 # sysctl fs.xfs.panic_mask=511
 sysctl: setting key "fs.xfs.panic_mask": Invalid argument
 fs.xfs.panic_mask = 511

Update to the maximum value that can be set to allow the full range of
masks. Do this using a mask of possible values to prevent this happening
again as suggested by Darrick.

Fixes: d519da41e2 ("xfs: Introduce XFS_PTAG_VERIFIER_ERROR panic mask")
Signed-off-by: Donald Douwsma <ddouwsma@redhat.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
2023-02-09 18:36:17 -08:00
Suren Baghdasaryan 1c71222e5f mm: replace vma->vm_flags direct modifications with modifier calls
Replace direct modifications to vma->vm_flags with calls to modifier
functions to be able to track flag changes and to keep vma locking
correctness.

[akpm@linux-foundation.org: fix drivers/misc/open-dice.c, per Hyeonggon Yoo]
Link: https://lkml.kernel.org/r/20230126193752.297968-5-surenb@google.com
Signed-off-by: Suren Baghdasaryan <surenb@google.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Acked-by: Mel Gorman <mgorman@techsingularity.net>
Acked-by: Mike Rapoport (IBM) <rppt@kernel.org>
Acked-by: Sebastian Reichel <sebastian.reichel@collabora.com>
Reviewed-by: Liam R. Howlett <Liam.Howlett@Oracle.com>
Reviewed-by: Hyeonggon Yoo <42.hyeyoo@gmail.com>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Arjun Roy <arjunroy@google.com>
Cc: Axel Rasmussen <axelrasmussen@google.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: David Howells <dhowells@redhat.com>
Cc: Davidlohr Bueso <dave@stgolabs.net>
Cc: David Rientjes <rientjes@google.com>
Cc: Eric Dumazet <edumazet@google.com>
Cc: Greg Thelen <gthelen@google.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Jann Horn <jannh@google.com>
Cc: Joel Fernandes <joelaf@google.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Kent Overstreet <kent.overstreet@linux.dev>
Cc: Laurent Dufour <ldufour@linux.ibm.com>
Cc: Lorenzo Stoakes <lstoakes@gmail.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Minchan Kim <minchan@google.com>
Cc: Paul E. McKenney <paulmck@kernel.org>
Cc: Peter Oskolkov <posk@google.com>
Cc: Peter Xu <peterx@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Punit Agrawal <punit.agrawal@bytedance.com>
Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Cc: Shakeel Butt <shakeelb@google.com>
Cc: Soheil Hassas Yeganeh <soheil@google.com>
Cc: Song Liu <songliubraving@fb.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Will Deacon <will@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-02-09 16:51:39 -08:00
Dave Chinner c85007e2e3 xfs: don't use BMBT btree split workers for IO completion
When we split a BMBT due to record insertion, we offload it to a
worker thread because we can be deep in the stack when we try to
allocate a new block for the BMBT. Allocation can use several
kilobytes of stack (full memory reclaim, swap and/or IO path can
end up on the stack during allocation) and we can already be several
kilobytes deep in the stack when we need to split the BMBT.

A recent workload demonstrated a deadlock in this BMBT split
offload. It requires several things to happen at once:

1. two inodes need a BMBT split at the same time, one must be
unwritten extent conversion from IO completion, the other must be
from extent allocation.

2. there must be a no available xfs_alloc_wq worker threads
available in the worker pool.

3. There must be sustained severe memory shortages such that new
kworker threads cannot be allocated to the xfs_alloc_wq pool for
both threads that need split work to be run

4. The split work from the unwritten extent conversion must run
first.

5. when the BMBT block allocation runs from the split work, it must
loop over all AGs and not be able to either trylock an AGF
successfully, or each AGF is is able to lock has no space available
for a single block allocation.

6. The BMBT allocation must then attempt to lock the AGF that the
second task queued to the rescuer thread already has locked before
it finds an AGF it can allocate from.

At this point, we have an ABBA deadlock between tasks queued on the
xfs_alloc_wq rescuer thread and a locked AGF. i.e. The queued task
holding the AGF lock can't be run by the rescuer thread until the
task the rescuer thread is runing gets the AGF lock....

This is a highly improbably series of events, but there it is.

There's a couple of ways to fix this, but the easiest way to ensure
that we only punt tasks with a locked AGF that holds enough space
for the BMBT block allocations to the worker thread.

This works for unwritten extent conversion in IO completion (which
doesn't have a locked AGF and space reservations) because we have
tight control over the IO completion stack. It is typically only 6
functions deep when xfs_btree_split() is called because we've
already offloaded the IO completion work to a worker thread and
hence we don't need to worry about stack overruns here.

The other place we can be called for a BMBT split without a
preceeding allocation is __xfs_bunmapi() when punching out the
center of an existing extent. We don't remove extents in the IO
path, so these operations don't tend to be called with a lot of
stack consumed. Hence we don't really need to ship the split off to
a worker thread in these cases, either.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
2023-02-05 08:48:24 -08:00
Darrick J. Wong 01a3af226b xfs: fix confusing variable names in xfs_refcount_item.c
Variable names in this code module are inconsistent and confusing.
xfs_phys_extent describe physical mappings, so rename them "pmap".
xfs_refcount_intents describe refcount intents, so rename them "ri".

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
2023-02-05 08:48:12 -08:00
Darrick J. Wong 0b11553ec5 xfs: pass refcount intent directly through the log intent code
Pass the incore refcount intent through the CUI logging code instead of
repeatedly boxing and unboxing parameters.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
2023-02-05 08:48:11 -08:00
Darrick J. Wong ffaa196f62 xfs: fix confusing variable names in xfs_rmap_item.c
Variable names in this code module are inconsistent and confusing.
xfs_map_extent describe file mappings, so rename them "map".
xfs_rmap_intents describe block mapping intents, so rename them "ri".

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
2023-02-05 08:48:11 -08:00
Darrick J. Wong 1534328bb4 xfs: pass rmap space mapping directly through the log intent code
Pass the incore rmap space mapping through the RUI logging code instead
of repeatedly boxing and unboxing parameters.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
2023-02-05 08:48:11 -08:00
Darrick J. Wong 578c714b21 xfs: fix confusing xfs_extent_item variable names
Change the name of all pointers to xfs_extent_item structures to "xefi"
to make the name consistent and because the current selections ("new"
and "free") mean other things in C.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
2023-02-05 08:48:11 -08:00
Darrick J. Wong 72ba455599 xfs: pass xfs_extent_free_item directly through the log intent code
Pass the incore xfs_extent_free_item through the EFI logging code
instead of repeatedly boxing and unboxing parameters.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
2023-02-05 08:48:11 -08:00
Darrick J. Wong f3ebac4c94 xfs: fix confusing variable names in xfs_bmap_item.c
Variable names in this code module are inconsistent and confusing.
xfs_map_extent describe file mappings, so rename them "map".
xfs_bmap_intents describe block mapping intents, so rename them "bi".

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
2023-02-05 08:48:11 -08:00
Darrick J. Wong ddccb81b26 xfs: pass the xfs_bmbt_irec directly through the log intent code
Instead of repeatedly boxing and unboxing the incore extent mapping
structure as it passes through the BUI code, pass the pointer directly
through.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
2023-02-05 08:48:11 -08:00
Xu Panda 8954c44ff4 xfs: use strscpy() to instead of strncpy()
The implementation of strscpy() is more robust and safer.
That's now the recommended way to copy NUL-terminated strings.

Signed-off-by: Xu Panda <xu.panda@zte.com.cn>
Signed-off-by: Yang Yang <yang.yang29@zte.com.cn>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
2023-02-05 08:48:11 -08:00
Christian Brauner 4d7ca40901
fs: port vfs{g,u}id helpers to mnt_idmap
Convert to struct mnt_idmap.

Last cycle we merged the necessary infrastructure in
256c8aed2b ("fs: introduce dedicated idmap type for mounts").
This is just the conversion to struct mnt_idmap.

Currently we still pass around the plain namespace that was attached to a
mount. This is in general pretty convenient but it makes it easy to
conflate namespaces that are relevant on the filesystem with namespaces
that are relevent on the mount level. Especially for non-vfs developers
without detailed knowledge in this area this can be a potential source for
bugs.

Once the conversion to struct mnt_idmap is done all helpers down to the
really low-level helpers will take a struct mnt_idmap argument instead of
two namespace arguments. This way it becomes impossible to conflate the two
eliminating the possibility of any bugs. All of the vfs and all filesystems
only operate on struct mnt_idmap.

Acked-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Christian Brauner (Microsoft) <brauner@kernel.org>
2023-01-19 09:24:30 +01:00
Christian Brauner c14329d39f
fs: port fs{g,u}id helpers to mnt_idmap
Convert to struct mnt_idmap.

Last cycle we merged the necessary infrastructure in
256c8aed2b ("fs: introduce dedicated idmap type for mounts").
This is just the conversion to struct mnt_idmap.

Currently we still pass around the plain namespace that was attached to a
mount. This is in general pretty convenient but it makes it easy to
conflate namespaces that are relevant on the filesystem with namespaces
that are relevent on the mount level. Especially for non-vfs developers
without detailed knowledge in this area this can be a potential source for
bugs.

Once the conversion to struct mnt_idmap is done all helpers down to the
really low-level helpers will take a struct mnt_idmap argument instead of
two namespace arguments. This way it becomes impossible to conflate the two
eliminating the possibility of any bugs. All of the vfs and all filesystems
only operate on struct mnt_idmap.

Acked-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Christian Brauner (Microsoft) <brauner@kernel.org>
2023-01-19 09:24:30 +01:00
Christian Brauner e67fe63341
fs: port i_{g,u}id_into_vfs{g,u}id() to mnt_idmap
Convert to struct mnt_idmap.
Remove legacy file_mnt_user_ns() and mnt_user_ns().

Last cycle we merged the necessary infrastructure in
256c8aed2b ("fs: introduce dedicated idmap type for mounts").
This is just the conversion to struct mnt_idmap.

Currently we still pass around the plain namespace that was attached to a
mount. This is in general pretty convenient but it makes it easy to
conflate namespaces that are relevant on the filesystem with namespaces
that are relevent on the mount level. Especially for non-vfs developers
without detailed knowledge in this area this can be a potential source for
bugs.

Once the conversion to struct mnt_idmap is done all helpers down to the
really low-level helpers will take a struct mnt_idmap argument instead of
two namespace arguments. This way it becomes impossible to conflate the two
eliminating the possibility of any bugs. All of the vfs and all filesystems
only operate on struct mnt_idmap.

Acked-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Christian Brauner (Microsoft) <brauner@kernel.org>
2023-01-19 09:24:29 +01:00
Christian Brauner 0dbe12f2e4
fs: port i_{g,u}id_{needs_}update() to mnt_idmap
Convert to struct mnt_idmap.

Last cycle we merged the necessary infrastructure in
256c8aed2b ("fs: introduce dedicated idmap type for mounts").
This is just the conversion to struct mnt_idmap.

Currently we still pass around the plain namespace that was attached to a
mount. This is in general pretty convenient but it makes it easy to
conflate namespaces that are relevant on the filesystem with namespaces
that are relevent on the mount level. Especially for non-vfs developers
without detailed knowledge in this area this can be a potential source for
bugs.

Once the conversion to struct mnt_idmap is done all helpers down to the
really low-level helpers will take a struct mnt_idmap argument instead of
two namespace arguments. This way it becomes impossible to conflate the two
eliminating the possibility of any bugs. All of the vfs and all filesystems
only operate on struct mnt_idmap.

Acked-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Christian Brauner (Microsoft) <brauner@kernel.org>
2023-01-19 09:24:29 +01:00
Christian Brauner 9452e93e6d
fs: port privilege checking helpers to mnt_idmap
Convert to struct mnt_idmap.

Last cycle we merged the necessary infrastructure in
256c8aed2b ("fs: introduce dedicated idmap type for mounts").
This is just the conversion to struct mnt_idmap.

Currently we still pass around the plain namespace that was attached to a
mount. This is in general pretty convenient but it makes it easy to
conflate namespaces that are relevant on the filesystem with namespaces
that are relevent on the mount level. Especially for non-vfs developers
without detailed knowledge in this area this can be a potential source for
bugs.

Once the conversion to struct mnt_idmap is done all helpers down to the
really low-level helpers will take a struct mnt_idmap argument instead of
two namespace arguments. This way it becomes impossible to conflate the two
eliminating the possibility of any bugs. All of the vfs and all filesystems
only operate on struct mnt_idmap.

Acked-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Christian Brauner (Microsoft) <brauner@kernel.org>
2023-01-19 09:24:29 +01:00
Christian Brauner f2d40141d5
fs: port inode_init_owner() to mnt_idmap
Convert to struct mnt_idmap.

Last cycle we merged the necessary infrastructure in
256c8aed2b ("fs: introduce dedicated idmap type for mounts").
This is just the conversion to struct mnt_idmap.

Currently we still pass around the plain namespace that was attached to a
mount. This is in general pretty convenient but it makes it easy to
conflate namespaces that are relevant on the filesystem with namespaces
that are relevent on the mount level. Especially for non-vfs developers
without detailed knowledge in this area this can be a potential source for
bugs.

Once the conversion to struct mnt_idmap is done all helpers down to the
really low-level helpers will take a struct mnt_idmap argument instead of
two namespace arguments. This way it becomes impossible to conflate the two
eliminating the possibility of any bugs. All of the vfs and all filesystems
only operate on struct mnt_idmap.

Acked-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Christian Brauner (Microsoft) <brauner@kernel.org>
2023-01-19 09:24:28 +01:00
Christian Brauner 700b794052
fs: port acl to mnt_idmap
Convert to struct mnt_idmap.

Last cycle we merged the necessary infrastructure in
256c8aed2b ("fs: introduce dedicated idmap type for mounts").
This is just the conversion to struct mnt_idmap.

Currently we still pass around the plain namespace that was attached to a
mount. This is in general pretty convenient but it makes it easy to
conflate namespaces that are relevant on the filesystem with namespaces
that are relevent on the mount level. Especially for non-vfs developers
without detailed knowledge in this area this can be a potential source for
bugs.

Once the conversion to struct mnt_idmap is done all helpers down to the
really low-level helpers will take a struct mnt_idmap argument instead of
two namespace arguments. This way it becomes impossible to conflate the two
eliminating the possibility of any bugs. All of the vfs and all filesystems
only operate on struct mnt_idmap.

Acked-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Christian Brauner (Microsoft) <brauner@kernel.org>
2023-01-19 09:24:28 +01:00
Christian Brauner 39f60c1cce
fs: port xattr to mnt_idmap
Convert to struct mnt_idmap.

Last cycle we merged the necessary infrastructure in
256c8aed2b ("fs: introduce dedicated idmap type for mounts").
This is just the conversion to struct mnt_idmap.

Currently we still pass around the plain namespace that was attached to a
mount. This is in general pretty convenient but it makes it easy to
conflate namespaces that are relevant on the filesystem with namespaces
that are relevent on the mount level. Especially for non-vfs developers
without detailed knowledge in this area this can be a potential source for
bugs.

Once the conversion to struct mnt_idmap is done all helpers down to the
really low-level helpers will take a struct mnt_idmap argument instead of
two namespace arguments. This way it becomes impossible to conflate the two
eliminating the possibility of any bugs. All of the vfs and all filesystems
only operate on struct mnt_idmap.

Acked-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Christian Brauner (Microsoft) <brauner@kernel.org>
2023-01-19 09:24:28 +01:00
Christian Brauner 8782a9aea3
fs: port ->fileattr_set() to pass mnt_idmap
Convert to struct mnt_idmap.

Last cycle we merged the necessary infrastructure in
256c8aed2b ("fs: introduce dedicated idmap type for mounts").
This is just the conversion to struct mnt_idmap.

Currently we still pass around the plain namespace that was attached to a
mount. This is in general pretty convenient but it makes it easy to
conflate namespaces that are relevant on the filesystem with namespaces
that are relevent on the mount level. Especially for non-vfs developers
without detailed knowledge in this area this can be a potential source for
bugs.

Once the conversion to struct mnt_idmap is done all helpers down to the
really low-level helpers will take a struct mnt_idmap argument instead of
two namespace arguments. This way it becomes impossible to conflate the two
eliminating the possibility of any bugs. All of the vfs and all filesystems
only operate on struct mnt_idmap.

Acked-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Christian Brauner (Microsoft) <brauner@kernel.org>
2023-01-19 09:24:27 +01:00
Christian Brauner 13e83a4923
fs: port ->set_acl() to pass mnt_idmap
Convert to struct mnt_idmap.

Last cycle we merged the necessary infrastructure in
256c8aed2b ("fs: introduce dedicated idmap type for mounts").
This is just the conversion to struct mnt_idmap.

Currently we still pass around the plain namespace that was attached to a
mount. This is in general pretty convenient but it makes it easy to
conflate namespaces that are relevant on the filesystem with namespaces
that are relevent on the mount level. Especially for non-vfs developers
without detailed knowledge in this area this can be a potential source for
bugs.

Once the conversion to struct mnt_idmap is done all helpers down to the
really low-level helpers will take a struct mnt_idmap argument instead of
two namespace arguments. This way it becomes impossible to conflate the two
eliminating the possibility of any bugs. All of the vfs and all filesystems
only operate on struct mnt_idmap.

Acked-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Christian Brauner (Microsoft) <brauner@kernel.org>
2023-01-19 09:24:27 +01:00
Christian Brauner 011e2b717b
fs: port ->tmpfile() to pass mnt_idmap
Convert to struct mnt_idmap.

Last cycle we merged the necessary infrastructure in
256c8aed2b ("fs: introduce dedicated idmap type for mounts").
This is just the conversion to struct mnt_idmap.

Currently we still pass around the plain namespace that was attached to a
mount. This is in general pretty convenient but it makes it easy to
conflate namespaces that are relevant on the filesystem with namespaces
that are relevent on the mount level. Especially for non-vfs developers
without detailed knowledge in this area this can be a potential source for
bugs.

Once the conversion to struct mnt_idmap is done all helpers down to the
really low-level helpers will take a struct mnt_idmap argument instead of
two namespace arguments. This way it becomes impossible to conflate the two
eliminating the possibility of any bugs. All of the vfs and all filesystems
only operate on struct mnt_idmap.

Acked-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Christian Brauner (Microsoft) <brauner@kernel.org>
2023-01-19 09:24:27 +01:00
Christian Brauner e18275ae55
fs: port ->rename() to pass mnt_idmap
Convert to struct mnt_idmap.

Last cycle we merged the necessary infrastructure in
256c8aed2b ("fs: introduce dedicated idmap type for mounts").
This is just the conversion to struct mnt_idmap.

Currently we still pass around the plain namespace that was attached to a
mount. This is in general pretty convenient but it makes it easy to
conflate namespaces that are relevant on the filesystem with namespaces
that are relevent on the mount level. Especially for non-vfs developers
without detailed knowledge in this area this can be a potential source for
bugs.

Once the conversion to struct mnt_idmap is done all helpers down to the
really low-level helpers will take a struct mnt_idmap argument instead of
two namespace arguments. This way it becomes impossible to conflate the two
eliminating the possibility of any bugs. All of the vfs and all filesystems
only operate on struct mnt_idmap.

Acked-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Christian Brauner (Microsoft) <brauner@kernel.org>
2023-01-19 09:24:26 +01:00
Christian Brauner 5ebb29bee8
fs: port ->mknod() to pass mnt_idmap
Convert to struct mnt_idmap.

Last cycle we merged the necessary infrastructure in
256c8aed2b ("fs: introduce dedicated idmap type for mounts").
This is just the conversion to struct mnt_idmap.

Currently we still pass around the plain namespace that was attached to a
mount. This is in general pretty convenient but it makes it easy to
conflate namespaces that are relevant on the filesystem with namespaces
that are relevent on the mount level. Especially for non-vfs developers
without detailed knowledge in this area this can be a potential source for
bugs.

Once the conversion to struct mnt_idmap is done all helpers down to the
really low-level helpers will take a struct mnt_idmap argument instead of
two namespace arguments. This way it becomes impossible to conflate the two
eliminating the possibility of any bugs. All of the vfs and all filesystems
only operate on struct mnt_idmap.

Acked-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Christian Brauner (Microsoft) <brauner@kernel.org>
2023-01-19 09:24:26 +01:00
Christian Brauner c54bd91e9e
fs: port ->mkdir() to pass mnt_idmap
Convert to struct mnt_idmap.

Last cycle we merged the necessary infrastructure in
256c8aed2b ("fs: introduce dedicated idmap type for mounts").
This is just the conversion to struct mnt_idmap.

Currently we still pass around the plain namespace that was attached to a
mount. This is in general pretty convenient but it makes it easy to
conflate namespaces that are relevant on the filesystem with namespaces
that are relevent on the mount level. Especially for non-vfs developers
without detailed knowledge in this area this can be a potential source for
bugs.

Once the conversion to struct mnt_idmap is done all helpers down to the
really low-level helpers will take a struct mnt_idmap argument instead of
two namespace arguments. This way it becomes impossible to conflate the two
eliminating the possibility of any bugs. All of the vfs and all filesystems
only operate on struct mnt_idmap.

Acked-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Christian Brauner (Microsoft) <brauner@kernel.org>
2023-01-19 09:24:26 +01:00
Christian Brauner 7a77db9551
fs: port ->symlink() to pass mnt_idmap
Convert to struct mnt_idmap.

Last cycle we merged the necessary infrastructure in
256c8aed2b ("fs: introduce dedicated idmap type for mounts").
This is just the conversion to struct mnt_idmap.

Currently we still pass around the plain namespace that was attached to a
mount. This is in general pretty convenient but it makes it easy to
conflate namespaces that are relevant on the filesystem with namespaces
that are relevent on the mount level. Especially for non-vfs developers
without detailed knowledge in this area this can be a potential source for
bugs.

Once the conversion to struct mnt_idmap is done all helpers down to the
really low-level helpers will take a struct mnt_idmap argument instead of
two namespace arguments. This way it becomes impossible to conflate the two
eliminating the possibility of any bugs. All of the vfs and all filesystems
only operate on struct mnt_idmap.

Acked-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Christian Brauner (Microsoft) <brauner@kernel.org>
2023-01-19 09:24:25 +01:00
Christian Brauner 6c960e68aa
fs: port ->create() to pass mnt_idmap
Convert to struct mnt_idmap.

Last cycle we merged the necessary infrastructure in
256c8aed2b ("fs: introduce dedicated idmap type for mounts").
This is just the conversion to struct mnt_idmap.

Currently we still pass around the plain namespace that was attached to a
mount. This is in general pretty convenient but it makes it easy to
conflate namespaces that are relevant on the filesystem with namespaces
that are relevent on the mount level. Especially for non-vfs developers
without detailed knowledge in this area this can be a potential source for
bugs.

Once the conversion to struct mnt_idmap is done all helpers down to the
really low-level helpers will take a struct mnt_idmap argument instead of
two namespace arguments. This way it becomes impossible to conflate the two
eliminating the possibility of any bugs. All of the vfs and all filesystems
only operate on struct mnt_idmap.

Acked-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Christian Brauner (Microsoft) <brauner@kernel.org>
2023-01-19 09:24:25 +01:00
Christian Brauner b74d24f7a7
fs: port ->getattr() to pass mnt_idmap
Convert to struct mnt_idmap.

Last cycle we merged the necessary infrastructure in
256c8aed2b ("fs: introduce dedicated idmap type for mounts").
This is just the conversion to struct mnt_idmap.

Currently we still pass around the plain namespace that was attached to a
mount. This is in general pretty convenient but it makes it easy to
conflate namespaces that are relevant on the filesystem with namespaces
that are relevent on the mount level. Especially for non-vfs developers
without detailed knowledge in this area this can be a potential source for
bugs.

Once the conversion to struct mnt_idmap is done all helpers down to the
really low-level helpers will take a struct mnt_idmap argument instead of
two namespace arguments. This way it becomes impossible to conflate the two
eliminating the possibility of any bugs. All of the vfs and all filesystems
only operate on struct mnt_idmap.

Acked-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Christian Brauner (Microsoft) <brauner@kernel.org>
2023-01-19 09:24:25 +01:00
Christian Brauner c1632a0f11
fs: port ->setattr() to pass mnt_idmap
Convert to struct mnt_idmap.

Last cycle we merged the necessary infrastructure in
256c8aed2b ("fs: introduce dedicated idmap type for mounts").
This is just the conversion to struct mnt_idmap.

Currently we still pass around the plain namespace that was attached to a
mount. This is in general pretty convenient but it makes it easy to
conflate namespaces that are relevant on the filesystem with namespaces
that are relevent on the mount level. Especially for non-vfs developers
without detailed knowledge in this area this can be a potential source for
bugs.

Once the conversion to struct mnt_idmap is done all helpers down to the
really low-level helpers will take a struct mnt_idmap argument instead of
two namespace arguments. This way it becomes impossible to conflate the two
eliminating the possibility of any bugs. All of the vfs and all filesystems
only operate on struct mnt_idmap.

Acked-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Christian Brauner (Microsoft) <brauner@kernel.org>
2023-01-19 09:24:02 +01:00
Andreas Gruenbacher 471859f57d iomap: Rename page_ops to folio_ops
The operations in struct page_ops all operate on folios, so rename
struct page_ops to struct folio_ops.

Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
[djwong: port around not removing iomap_valid]
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
2023-01-18 10:44:05 -08:00
Jeff Layton 5970e15dbc filelock: move file locking definitions to separate header file
The file locking definitions have lived in fs.h since the dawn of time,
but they are only used by a small subset of the source files that
include it.

Move the file locking definitions to a new header file, and add the
appropriate #include directives to the source files that need them. By
doing this we trim down fs.h a bit and limit the amount of rebuilding
that has to be done when we make changes to the file locking APIs.

Reviewed-by: Xiubo Li <xiubli@redhat.com>
Reviewed-by: Christian Brauner (Microsoft) <brauner@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: David Howells <dhowells@redhat.com>
Reviewed-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk>
Acked-by: Chuck Lever <chuck.lever@oracle.com>
Acked-by: Joseph Qi <joseph.qi@linux.alibaba.com>
Acked-by: Steve French <stfrench@microsoft.com>
Acked-by: Al Viro <viro@zeniv.linux.org.uk>
Acked-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Jeff Layton <jlayton@kernel.org>
2023-01-11 06:52:32 -05:00
Wengang Wang 601a27ea09 xfs: fix extent busy updating
In xfs_extent_busy_update_extent() case 6 and 7, whenever bno is modified on
extent busy, the relavent length has to be modified accordingly.

Signed-off-by: Wengang Wang <wen.gang.wang@oracle.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
2023-01-05 07:34:21 -08:00
Li zeming e195605ed2 xfs: xfs_qm: remove unnecessary ‘0’ values from error
error is assigned first, so it does not need to initialize the
assignment.

Signed-off-by: Li zeming <zeming@nfschina.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
2023-01-03 10:23:07 -08:00
Wu Guanghao 4da112513c xfs: Fix deadlock on xfs_inodegc_worker
We are doing a test about deleting a large number of files
when memory is low. A deadlock problem was found.

[ 1240.279183] -> #1 (fs_reclaim){+.+.}-{0:0}:
[ 1240.280450]        lock_acquire+0x197/0x460
[ 1240.281548]        fs_reclaim_acquire.part.0+0x20/0x30
[ 1240.282625]        kmem_cache_alloc+0x2b/0x940
[ 1240.283816]        xfs_trans_alloc+0x8a/0x8b0
[ 1240.284757]        xfs_inactive_ifree+0xe4/0x4e0
[ 1240.285935]        xfs_inactive+0x4e9/0x8a0
[ 1240.286836]        xfs_inodegc_worker+0x160/0x5e0
[ 1240.287969]        process_one_work+0xa19/0x16b0
[ 1240.289030]        worker_thread+0x9e/0x1050
[ 1240.290131]        kthread+0x34f/0x460
[ 1240.290999]        ret_from_fork+0x22/0x30
[ 1240.291905]
[ 1240.291905] -> #0 ((work_completion)(&gc->work)){+.+.}-{0:0}:
[ 1240.293569]        check_prev_add+0x160/0x2490
[ 1240.294473]        __lock_acquire+0x2c4d/0x5160
[ 1240.295544]        lock_acquire+0x197/0x460
[ 1240.296403]        __flush_work+0x6bc/0xa20
[ 1240.297522]        xfs_inode_mark_reclaimable+0x6f0/0xdc0
[ 1240.298649]        destroy_inode+0xc6/0x1b0
[ 1240.299677]        dispose_list+0xe1/0x1d0
[ 1240.300567]        prune_icache_sb+0xec/0x150
[ 1240.301794]        super_cache_scan+0x2c9/0x480
[ 1240.302776]        do_shrink_slab+0x3f0/0xaa0
[ 1240.303671]        shrink_slab+0x170/0x660
[ 1240.304601]        shrink_node+0x7f7/0x1df0
[ 1240.305515]        balance_pgdat+0x766/0xf50
[ 1240.306657]        kswapd+0x5bd/0xd20
[ 1240.307551]        kthread+0x34f/0x460
[ 1240.308346]        ret_from_fork+0x22/0x30
[ 1240.309247]
[ 1240.309247] other info that might help us debug this:
[ 1240.309247]
[ 1240.310944]  Possible unsafe locking scenario:
[ 1240.310944]
[ 1240.312379]        CPU0                    CPU1
[ 1240.313363]        ----                    ----
[ 1240.314433]   lock(fs_reclaim);
[ 1240.315107]                                lock((work_completion)(&gc->work));
[ 1240.316828]                                lock(fs_reclaim);
[ 1240.318088]   lock((work_completion)(&gc->work));
[ 1240.319203]
[ 1240.319203]  *** DEADLOCK ***
...
[ 2438.431081] Workqueue: xfs-inodegc/sda xfs_inodegc_worker
[ 2438.432089] Call Trace:
[ 2438.432562]  __schedule+0xa94/0x1d20
[ 2438.435787]  schedule+0xbf/0x270
[ 2438.436397]  schedule_timeout+0x6f8/0x8b0
[ 2438.445126]  wait_for_completion+0x163/0x260
[ 2438.448610]  __flush_work+0x4c4/0xa40
[ 2438.455011]  xfs_inode_mark_reclaimable+0x6ef/0xda0
[ 2438.456695]  destroy_inode+0xc6/0x1b0
[ 2438.457375]  dispose_list+0xe1/0x1d0
[ 2438.458834]  prune_icache_sb+0xe8/0x150
[ 2438.461181]  super_cache_scan+0x2b3/0x470
[ 2438.461950]  do_shrink_slab+0x3cf/0xa50
[ 2438.462687]  shrink_slab+0x17d/0x660
[ 2438.466392]  shrink_node+0x87e/0x1d40
[ 2438.467894]  do_try_to_free_pages+0x364/0x1300
[ 2438.471188]  try_to_free_pages+0x26c/0x5b0
[ 2438.473567]  __alloc_pages_slowpath.constprop.136+0x7aa/0x2100
[ 2438.482577]  __alloc_pages+0x5db/0x710
[ 2438.485231]  alloc_pages+0x100/0x200
[ 2438.485923]  allocate_slab+0x2c0/0x380
[ 2438.486623]  ___slab_alloc+0x41f/0x690
[ 2438.490254]  __slab_alloc+0x54/0x70
[ 2438.491692]  kmem_cache_alloc+0x23e/0x270
[ 2438.492437]  xfs_trans_alloc+0x88/0x880
[ 2438.493168]  xfs_inactive_ifree+0xe2/0x4e0
[ 2438.496419]  xfs_inactive+0x4eb/0x8b0
[ 2438.497123]  xfs_inodegc_worker+0x16b/0x5e0
[ 2438.497918]  process_one_work+0xbf7/0x1a20
[ 2438.500316]  worker_thread+0x8c/0x1060
[ 2438.504938]  ret_from_fork+0x22/0x30

When the memory is insufficient, xfs_inonodegc_worker will trigger memory
reclamation when memory is allocated, then flush_work() may be called to
wait for the work to complete. This causes a deadlock.

So use memalloc_nofs_save() to avoid triggering memory reclamation in
xfs_inodegc_worker.

Signed-off-by: Wu Guanghao <wuguanghao3@huawei.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
2023-01-03 10:23:07 -08:00
Hironori Shiina 817644fa45 xfs: get root inode correctly at bulkstat
The root inode number should be set to `breq->startino` for getting stat
information of the root when XFS_BULK_IREQ_SPECIAL_ROOT is used.
Otherwise, the inode search is started from 1
(XFS_BULK_IREQ_SPECIAL_ROOT) and the inode with the lowest number in a
filesystem is returned.

Fixes: bf3cb39447 ("xfs: allow single bulkstat of special inodes")
Signed-off-by: Hironori Shiina <shiina.hironori@fujitsu.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
2023-01-03 10:23:07 -08:00
Darrick J. Wong c0f399ff51 xfs: fix off-by-one error in xfs_btree_space_to_height
Lately I've been stress-testing extreme-sized rmap btrees by using the
(new) xfs_db bmap_inflate command to clone bmbt mappings billions of
times and then using xfs_repair to build new rmap and refcount btrees.
This of course is /much/ faster than actually FICLONEing a file billions
of times.

Unfortunately, xfs_repair fails in xfs_btree_bload_compute_geometry with
EOVERFLOW, which indicates that xfs_mount.m_rmap_maxlevels is not
sufficiently large for the test scenario.  For a 1TB filesystem (~67
million AG blocks, 4 AGs) the btheight command reports:

$ xfs_db -c 'btheight -n 4400801200 -w min rmapbt' /dev/sda
rmapbt: worst case per 4096-byte block: 84 records (leaf) / 45 keyptrs (node)
level 0: 4400801200 records, 52390491 blocks
level 1: 52390491 records, 1164234 blocks
level 2: 1164234 records, 25872 blocks
level 3: 25872 records, 575 blocks
level 4: 575 records, 13 blocks
level 5: 13 records, 1 block
6 levels, 53581186 blocks total

The AG is sufficiently large to build this rmap btree.  Unfortunately,
m_rmap_maxlevels is 5.  Augmenting the loop in the space->height
function to report height, node blocks, and blocks remaining produces
this:

ht 1 node_blocks 45 blockleft 67108863
ht 2 node_blocks 2025 blockleft 67108818
ht 3 node_blocks 91125 blockleft 67106793
ht 4 node_blocks 4100625 blockleft 67015668
final height: 5

The goal of this function is to compute the maximum height btree that
can be stored in the given number of ondisk fsblocks.  Starting with the
top level of the tree, each iteration through the loop adds the fanout
factor of the next level down until we run out of blocks.  IOWs, maximum
height is achieved by using the smallest fanout factor that can apply
to that level.

However, the loop setup is not correct.  Top level btree blocks are
allowed to contain fewer than minrecs items, so the computation is
incorrect because the first time through the loop it should be using a
fanout factor of 2.  With this corrected, the above becomes:

ht 1 node_blocks 2 blockleft 67108863
ht 2 node_blocks 90 blockleft 67108861
ht 3 node_blocks 4050 blockleft 67108771
ht 4 node_blocks 182250 blockleft 67104721
ht 5 node_blocks 8201250 blockleft 66922471
final height: 6

Fixes: 9ec691205e ("xfs: compute the maximum height of the rmap btree when reflink enabled")
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
2023-01-03 10:23:07 -08:00
Darrick J. Wong d4542f3145 xfs: make xfs_iomap_page_ops static
Shut up the sparse warnings about this variable that isn't referenced
anywhere else.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
2022-12-26 10:11:18 -08:00
Darrick J. Wong 26870c3f5b xfs: don't assert if cmap covers imap after cycling lock
In xfs_reflink_fill_cow_hole, there's a debugging assertion that trips
if (after cycling the ILOCK to get a transaction) the requeried cow
mapping overlaps the start of the area being written.  IOWs, it trips if
the hole in the cow fork that it's supposed to fill has been filled.

This is trivially possible since we cycled ILOCK_EXCL.  If we trip the
assertion, then we know that cmap is a delalloc extent because @found is
false.  Fortunately, the bmapi_write call below will convert the
delalloc extent to a real unwritten cow fork extent, so all we need to
do here is remove the assertion.

It turns out that generic/095 trips this pretty regularly with alwayscow
mode enabled.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
2022-12-26 10:11:17 -08:00
Linus Torvalds 87be949912 New XFS code for 6.2:
- Fix a race condition w.r.t. percpu inode free counters
  - Fix a broken error return in xfs_remove
  - Print FS UUID at mount/unmount time
  - Numerous fixes to the online fsck code
  - Fix inode locking inconsistency problems when dealing with realtime
    metadata files
  - Actually merge pull requests so that we capture the cover letter
    contents
  - Fix a race between rebuilding VFS inode state and the AIL flushing
    inodes that could cause corrupt inodes to be written to the
    filesystem
  - Fix a data corruption problem resulting from a write() to an
    unwritten extent racing with writeback started on behalf of memory
    reclaim changing the extent state
  - Add debugging knobs so that we can test iomap invalidation
  - Fix the blockdev pagecache contents being stale after unmounting the
    filesystem, leading to spurious xfs_db errors and corrupt metadumps
  - Fix a file mapping corruption bug due to ilock cycling when attaching
    dquots to a file during delalloc reservation
  - Fix a refcount btree corruption problem due to the refcount
    adjustment code not handling MAXREFCOUNT correctly, resulting in
    unnecessary record splits
  - Fix COW staging extent alloctions not being classified as USERDATA,
    which results in filestreams being ignored and possible data
    corruption if the allocation was filled from the AGFL and the block
    buffer is still being tracked in the AIL
  - Fix new duplicated includes
  - Fix a race between the dquot shrinker and dquot freeing that could
    cause a UAF
 
 Signed-off-by: Darrick J. Wong <djwong@kernel.org>
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEEUzaAxoMeQq6m2jMV+H93GTRKtOsFAmOSEWsACgkQ+H93GTRK
 tOvpsg//Y8pgue8GFwyXq0LYEYb1yjueGIxDGz9SwkfMP9vADsdDpXxquHmes5M+
 Q9vMyFnfaizZs2oXD6Nw/+RJMyOa3ZQtNqjxJET5pTIBcWvdjsP9UGW+K+1uN7LT
 NsM7lgpxy8RfQFHjvFHpOysxGIpT70n3lz98qlwy1yIGF/EFE52pkKcArGjpIu4A
 wBdyL0hIBwXc27zLRahLxfwFaW/I40ka3D40EUYpNnAjE5Sy0YgLlsOCzrxN0UvY
 a9dlq+WFJjWDsLp6vr11ruewXAmzYG2m/3RdP2aLbmDHDvo06UkesKkPNhexlClM
 kRE/ZImmakqKlAqgtUbkxT06NbIKOxYslbcoOOLDneqb1grTcgk79J7jsMlLLU1s
 s1WyPMWR3wb0jjclgGBxd3c1nprdkvJSkBpyEOwIYLhwdPNuwqTwEVsq7TvasRLI
 dgals5/J6fBnIeTR7x2YObonQRd4FlkXFv+AVYpGVUJEI02eRgY3i7NJBZWyBKAS
 +Gcd1Bq1F387b0FRqq1iVhGD+NpoHHiP84bOQED9R9t0jP1AHj9t47f+Uuvjj2hN
 ByT7MpA0nZdbYGKU+rFyKsIvONyLdxyjL+jm6FkmrW+G25fJ1af2yhrVhZQhw7dm
 zLb1ntSnXvNTj4OopfKSDD2MPGf+2C/o2XJvAAS501pmsQefKOM=
 =plES
 -----END PGP SIGNATURE-----

Merge tag 'xfs-6.2-merge-8' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux

Pull XFS updates from Darrick Wong:
 "The highlight of this is a batch of fixes for the online metadata
  checking code as we start the loooong march towards merging online
  repair. I aim to merge that in time for the 2023 LTS.

  There are also a large number of data corruption and race condition
  fixes in this patchset. Most notably fixed are write() calls to
  unwritten extents racing with writeback, which required some late(r
  than I prefer) code changes to iomap to support the necessary
  revalidations. I don't really like iomap changes going in past -rc4,
  but Dave and I have been working on it long enough that I chose to
  push it for 6.2 anyway.

  There are also a number of other subtle problems fixed, including the
  log racing with inode writeback to write inodes with incorrect link
  count to disk; file data mapping corruptions as a result of incorrect
  lock cycling when attaching dquots; refcount metadata corruption if
  one actually manages to share a block 2^32 times; and the log
  clobbering cow staging extents if they were formerly metadata blocks.

  Summary:

   - Fix a race condition w.r.t. percpu inode free counters

   - Fix a broken error return in xfs_remove

   - Print FS UUID at mount/unmount time

   - Numerous fixes to the online fsck code

   - Fix inode locking inconsistency problems when dealing with realtime
     metadata files

   - Actually merge pull requests so that we capture the cover letter
     contents

   - Fix a race between rebuilding VFS inode state and the AIL flushing
     inodes that could cause corrupt inodes to be written to the
     filesystem

   - Fix a data corruption problem resulting from a write() to an
     unwritten extent racing with writeback started on behalf of memory
     reclaim changing the extent state

   - Add debugging knobs so that we can test iomap invalidation

   - Fix the blockdev pagecache contents being stale after unmounting
     the filesystem, leading to spurious xfs_db errors and corrupt
     metadumps

   - Fix a file mapping corruption bug due to ilock cycling when
     attaching dquots to a file during delalloc reservation

   - Fix a refcount btree corruption problem due to the refcount
     adjustment code not handling MAXREFCOUNT correctly, resulting in
     unnecessary record splits

   - Fix COW staging extent alloctions not being classified as USERDATA,
     which results in filestreams being ignored and possible data
     corruption if the allocation was filled from the AGFL and the block
     buffer is still being tracked in the AIL

   - Fix new duplicated includes

   - Fix a race between the dquot shrinker and dquot freeing that could
     cause a UAF"

* tag 'xfs-6.2-merge-8' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux: (50 commits)
  xfs: dquot shrinker doesn't check for XFS_DQFLAG_FREEING
  xfs: Remove duplicated include in xfs_iomap.c
  xfs: invalidate xfs_bufs when allocating cow extents
  xfs: get rid of assert from xfs_btree_islastblock
  xfs: estimate post-merge refcounts correctly
  xfs: hoist refcount record merge predicates
  xfs: fix super block buf log item UAF during force shutdown
  xfs: wait iclog complete before tearing down AIL
  xfs: attach dquots to inode before reading data/cow fork mappings
  xfs: shut up -Wuninitialized in xfsaild_push
  xfs: use memcpy, not strncpy, to format the attr prefix during listxattr
  xfs: invalidate block device page cache during unmount
  xfs: add debug knob to slow down write for fun
  xfs: add debug knob to slow down writeback for fun
  xfs: drop write error injection is unfixable, remove it
  xfs: use iomap_valid method to detect stale cached iomaps
  iomap: write iomap validity checks
  xfs: xfs_bmap_punch_delalloc_range() should take a byte range
  iomap: buffered write failure should not truncate the page cache
  xfs,iomap: move delalloc punching to iomap
  ...
2022-12-14 10:11:51 -08:00
Linus Torvalds e2ca6ba6ba MM patches for 6.2-rc1.
- More userfaultfs work from Peter Xu.
 
 - Several convert-to-folios series from Sidhartha Kumar and Huang Ying.
 
 - Some filemap cleanups from Vishal Moola.
 
 - David Hildenbrand added the ability to selftest anon memory COW handling.
 
 - Some cpuset simplifications from Liu Shixin.
 
 - Addition of vmalloc tracing support by Uladzislau Rezki.
 
 - Some pagecache folioifications and simplifications from Matthew Wilcox.
 
 - A pagemap cleanup from Kefeng Wang: we have VM_ACCESS_FLAGS, so use it.
 
 - Miguel Ojeda contributed some cleanups for our use of the
   __no_sanitize_thread__ gcc keyword.  This series shold have been in the
   non-MM tree, my bad.
 
 - Naoya Horiguchi improved the interaction between memory poisoning and
   memory section removal for huge pages.
 
 - DAMON cleanups and tuneups from SeongJae Park
 
 - Tony Luck fixed the handling of COW faults against poisoned pages.
 
 - Peter Xu utilized the PTE marker code for handling swapin errors.
 
 - Hugh Dickins reworked compound page mapcount handling, simplifying it
   and making it more efficient.
 
 - Removal of the autonuma savedwrite infrastructure from Nadav Amit and
   David Hildenbrand.
 
 - zram support for multiple compression streams from Sergey Senozhatsky.
 
 - David Hildenbrand reworked the GUP code's R/O long-term pinning so
   that drivers no longer need to use the FOLL_FORCE workaround which
   didn't work very well anyway.
 
 - Mel Gorman altered the page allocator so that local IRQs can remnain
   enabled during per-cpu page allocations.
 
 - Vishal Moola removed the try_to_release_page() wrapper.
 
 - Stefan Roesch added some per-BDI sysfs tunables which are used to
   prevent network block devices from dirtying excessive amounts of
   pagecache.
 
 - David Hildenbrand did some cleanup and repair work on KSM COW
   breaking.
 
 - Nhat Pham and Johannes Weiner have implemented writeback in zswap's
   zsmalloc backend.
 
 - Brian Foster has fixed a longstanding corner-case oddity in
   file[map]_write_and_wait_range().
 
 - sparse-vmemmap changes for MIPS, LoongArch and NIOS2 from Feiyang
   Chen.
 
 - Shiyang Ruan has done some work on fsdax, to make its reflink mode
   work better under xfstests.  Better, but still not perfect.
 
 - Christoph Hellwig has removed the .writepage() method from several
   filesystems.  They only need .writepages().
 
 - Yosry Ahmed wrote a series which fixes the memcg reclaim target
   beancounting.
 
 - David Hildenbrand has fixed some of our MM selftests for 32-bit
   machines.
 
 - Many singleton patches, as usual.
 -----BEGIN PGP SIGNATURE-----
 
 iHUEABYKAB0WIQTTMBEPP41GrTpTJgfdBJ7gKXxAjgUCY5j6ZwAKCRDdBJ7gKXxA
 jkDYAP9qNeVqp9iuHjZNTqzMXkfmJPsw2kmy2P+VdzYVuQRcJgEAgoV9d7oMq4ml
 CodAgiA51qwzId3GRytIo/tfWZSezgA=
 =d19R
 -----END PGP SIGNATURE-----

Merge tag 'mm-stable-2022-12-13' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm

Pull MM updates from Andrew Morton:

 - More userfaultfs work from Peter Xu

 - Several convert-to-folios series from Sidhartha Kumar and Huang Ying

 - Some filemap cleanups from Vishal Moola

 - David Hildenbrand added the ability to selftest anon memory COW
   handling

 - Some cpuset simplifications from Liu Shixin

 - Addition of vmalloc tracing support by Uladzislau Rezki

 - Some pagecache folioifications and simplifications from Matthew
   Wilcox

 - A pagemap cleanup from Kefeng Wang: we have VM_ACCESS_FLAGS, so use
   it

 - Miguel Ojeda contributed some cleanups for our use of the
   __no_sanitize_thread__ gcc keyword.

   This series should have been in the non-MM tree, my bad

 - Naoya Horiguchi improved the interaction between memory poisoning and
   memory section removal for huge pages

 - DAMON cleanups and tuneups from SeongJae Park

 - Tony Luck fixed the handling of COW faults against poisoned pages

 - Peter Xu utilized the PTE marker code for handling swapin errors

 - Hugh Dickins reworked compound page mapcount handling, simplifying it
   and making it more efficient

 - Removal of the autonuma savedwrite infrastructure from Nadav Amit and
   David Hildenbrand

 - zram support for multiple compression streams from Sergey Senozhatsky

 - David Hildenbrand reworked the GUP code's R/O long-term pinning so
   that drivers no longer need to use the FOLL_FORCE workaround which
   didn't work very well anyway

 - Mel Gorman altered the page allocator so that local IRQs can remnain
   enabled during per-cpu page allocations

 - Vishal Moola removed the try_to_release_page() wrapper

 - Stefan Roesch added some per-BDI sysfs tunables which are used to
   prevent network block devices from dirtying excessive amounts of
   pagecache

 - David Hildenbrand did some cleanup and repair work on KSM COW
   breaking

 - Nhat Pham and Johannes Weiner have implemented writeback in zswap's
   zsmalloc backend

 - Brian Foster has fixed a longstanding corner-case oddity in
   file[map]_write_and_wait_range()

 - sparse-vmemmap changes for MIPS, LoongArch and NIOS2 from Feiyang
   Chen

 - Shiyang Ruan has done some work on fsdax, to make its reflink mode
   work better under xfstests. Better, but still not perfect

 - Christoph Hellwig has removed the .writepage() method from several
   filesystems. They only need .writepages()

 - Yosry Ahmed wrote a series which fixes the memcg reclaim target
   beancounting

 - David Hildenbrand has fixed some of our MM selftests for 32-bit
   machines

 - Many singleton patches, as usual

* tag 'mm-stable-2022-12-13' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: (313 commits)
  mm/hugetlb: set head flag before setting compound_order in __prep_compound_gigantic_folio
  mm: mmu_gather: allow more than one batch of delayed rmaps
  mm: fix typo in struct pglist_data code comment
  kmsan: fix memcpy tests
  mm: add cond_resched() in swapin_walk_pmd_entry()
  mm: do not show fs mm pc for VM_LOCKONFAULT pages
  selftests/vm: ksm_functional_tests: fixes for 32bit
  selftests/vm: cow: fix compile warning on 32bit
  selftests/vm: madv_populate: fix missing MADV_POPULATE_(READ|WRITE) definitions
  mm/gup_test: fix PIN_LONGTERM_TEST_READ with highmem
  mm,thp,rmap: fix races between updates of subpages_mapcount
  mm: memcg: fix swapcached stat accounting
  mm: add nodes= arg to memory.reclaim
  mm: disable top-tier fallback to reclaim on proactive reclaim
  selftests: cgroup: make sure reclaim target memcg is unprotected
  selftests: cgroup: refactor proactive reclaim code to reclaim_until()
  mm: memcg: fix stale protection of reclaim target memcg
  mm/mmap: properly unaccount memory on mas_preallocate() failure
  omfs: remove ->writepage
  jfs: remove ->writepage
  ...
2022-12-13 19:29:45 -08:00
Linus Torvalds 6a518afcc2 fs.acl.rework.v6.2
-----BEGIN PGP SIGNATURE-----
 
 iHUEABYKAB0WIQRAhzRXHqcMeLMyaSiRxhvAZXjcogUCY5bwTgAKCRCRxhvAZXjc
 ovd2AQCK00NAtGjQCjQPQGyTa4GAPqvWgq1ef0lnhv+TL5US5gD9FncQ8UofeMXt
 pBfjtAD6ettTPCTxUQfnTwWEU4rc7Qg=
 =27Wm
 -----END PGP SIGNATURE-----

Merge tag 'fs.acl.rework.v6.2' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/idmapping

Pull VFS acl updates from Christian Brauner:
 "This contains the work that builds a dedicated vfs posix acl api.

  The origins of this work trace back to v5.19 but it took quite a while
  to understand the various filesystem specific implementations in
  sufficient detail and also come up with an acceptable solution.

  As we discussed and seen multiple times the current state of how posix
  acls are handled isn't nice and comes with a lot of problems: The
  current way of handling posix acls via the generic xattr api is error
  prone, hard to maintain, and type unsafe for the vfs until we call
  into the filesystem's dedicated get and set inode operations.

  It is already the case that posix acls are special-cased to death all
  the way through the vfs. There are an uncounted number of hacks that
  operate on the uapi posix acl struct instead of the dedicated vfs
  struct posix_acl. And the vfs must be involved in order to interpret
  and fixup posix acls before storing them to the backing store, caching
  them, reporting them to userspace, or for permission checking.

  Currently a range of hacks and duct tape exist to make this work. As
  with most things this is really no ones fault it's just something that
  happened over time. But the code is hard to understand and difficult
  to maintain and one is constantly at risk of introducing bugs and
  regressions when having to touch it.

  Instead of continuing to hack posix acls through the xattr handlers
  this series builds a dedicated posix acl api solely around the get and
  set inode operations.

  Going forward, the vfs_get_acl(), vfs_remove_acl(), and vfs_set_acl()
  helpers must be used in order to interact with posix acls. They
  operate directly on the vfs internal struct posix_acl instead of
  abusing the uapi posix acl struct as we currently do. In the end this
  removes all of the hackiness, makes the codepaths easier to maintain,
  and gets us type safety.

  This series passes the LTP and xfstests suites without any
  regressions. For xfstests the following combinations were tested:
   - xfs
   - ext4
   - btrfs
   - overlayfs
   - overlayfs on top of idmapped mounts
   - orangefs
   - (limited) cifs

  There's more simplifications for posix acls that we can make in the
  future if the basic api has made it.

  A few implementation details:

   - The series makes sure to retain exactly the same security and
     integrity module permission checks. Especially for the integrity
     modules this api is a win because right now they convert the uapi
     posix acl struct passed to them via a void pointer into the vfs
     struct posix_acl format to perform permission checking on the mode.

     There's a new dedicated security hook for setting posix acls which
     passes the vfs struct posix_acl not a void pointer. Basing checking
     on the posix acl stored in the uapi format is really unreliable.
     The vfs currently hacks around directly in the uapi struct storing
     values that frankly the security and integrity modules can't
     correctly interpret as evidenced by bugs we reported and fixed in
     this area. It's not necessarily even their fault it's just that the
     format we provide to them is sub optimal.

   - Some filesystems like 9p and cifs need access to the dentry in
     order to get and set posix acls which is why they either only
     partially or not even at all implement get and set inode
     operations. For example, cifs allows setxattr() and getxattr()
     operations but doesn't allow permission checking based on posix
     acls because it can't implement a get acl inode operation.

     Thus, this patch series updates the set acl inode operation to take
     a dentry instead of an inode argument. However, for the get acl
     inode operation we can't do this as the old get acl method is
     called in e.g., generic_permission() and inode_permission(). These
     helpers in turn are called in various filesystem's permission inode
     operation. So passing a dentry argument to the old get acl inode
     operation would amount to passing a dentry to the permission inode
     operation which we shouldn't and probably can't do.

     So instead of extending the existing inode operation Christoph
     suggested to add a new one. He also requested to ensure that the
     get and set acl inode operation taking a dentry are consistently
     named. So for this version the old get acl operation is renamed to
     ->get_inode_acl() and a new ->get_acl() inode operation taking a
     dentry is added. With this we can give both 9p and cifs get and set
     acl inode operations and in turn remove their complex custom posix
     xattr handlers.

     In the future I hope to get rid of the inode method duplication but
     it isn't like we have never had this situation. Readdir is just one
     example. And frankly, the overall gain in type safety and the more
     pleasant api wise are simply too big of a benefit to not accept
     this duplication for a while.

   - We've done a full audit of every codepaths using variant of the
     current generic xattr api to get and set posix acls and
     surprisingly it isn't that many places. There's of course always a
     chance that we might have missed some and if so I'm sure we'll find
     them soon enough.

     The crucial codepaths to be converted are obviously stacking
     filesystems such as ecryptfs and overlayfs.

     For a list of all callers currently using generic xattr api helpers
     see [2] including comments whether they support posix acls or not.

   - The old vfs generic posix acl infrastructure doesn't obey the
     create and replace semantics promised on the setxattr(2) manpage.
     This patch series doesn't address this. It really is something we
     should revisit later though.

  The patches are roughly organized as follows:

   (1) Change existing set acl inode operation to take a dentry
       argument (Intended to be a non-functional change)

   (2) Rename existing get acl method (Intended to be a non-functional
       change)

   (3) Implement get and set acl inode operations for filesystems that
       couldn't implement one before because of the missing dentry.
       That's mostly 9p and cifs (Intended to be a non-functional
       change)

   (4) Build posix acl api, i.e., add vfs_get_acl(), vfs_remove_acl(),
       and vfs_set_acl() including security and integrity hooks
       (Intended to be a non-functional change)

   (5) Implement get and set acl inode operations for stacking
       filesystems (Intended to be a non-functional change)

   (6) Switch posix acl handling in stacking filesystems to new posix
       acl api now that all filesystems it can stack upon support it.

   (7) Switch vfs to new posix acl api (semantical change)

   (8) Remove all now unused helpers

   (9) Additional regression fixes reported after we merged this into
       linux-next

  Thanks to Seth for a lot of good discussion around this and
  encouragement and input from Christoph"

* tag 'fs.acl.rework.v6.2' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/idmapping: (36 commits)
  posix_acl: Fix the type of sentinel in get_acl
  orangefs: fix mode handling
  ovl: call posix_acl_release() after error checking
  evm: remove dead code in evm_inode_set_acl()
  cifs: check whether acl is valid early
  acl: make vfs_posix_acl_to_xattr() static
  acl: remove a slew of now unused helpers
  9p: use stub posix acl handlers
  cifs: use stub posix acl handlers
  ovl: use stub posix acl handlers
  ecryptfs: use stub posix acl handlers
  evm: remove evm_xattr_acl_change()
  xattr: use posix acl api
  ovl: use posix acl api
  ovl: implement set acl method
  ovl: implement get acl method
  ecryptfs: implement set acl method
  ecryptfs: implement get acl method
  ksmbd: use vfs_remove_acl()
  acl: add vfs_remove_acl()
  ...
2022-12-12 18:46:39 -08:00
Shiyang Ruan 480017957d xfs: remove restrictions for fsdax and reflink
Since the basic function for fsdax and reflink has been implemented,
remove the restrictions of them for widly test.

Link: https://lkml.kernel.org/r/1669908773-207-1-git-send-email-ruansy.fnst@fujitsu.com
Signed-off-by: Shiyang Ruan <ruansy.fnst@fujitsu.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Cc: Alistair Popple <apopple@nvidia.com>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Dave Chinner <david@fromorbit.com>
Cc: Jason Gunthorpe <jgg@nvidia.com>
Cc: John Hubbard <jhubbard@nvidia.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-12-11 18:12:17 -08:00
Shiyang Ruan d984648e42 fsdax,xfs: port unshare to fsdax
Implement unshare in fsdax mode: copy data from srcmap to iomap.

Link: https://lkml.kernel.org/r/1669908753-169-1-git-send-email-ruansy.fnst@fujitsu.com
Signed-off-by: Shiyang Ruan <ruansy.fnst@fujitsu.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Cc: Alistair Popple <apopple@nvidia.com>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Dave Chinner <david@fromorbit.com>
Cc: Jason Gunthorpe <jgg@nvidia.com>
Cc: John Hubbard <jhubbard@nvidia.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-12-11 18:12:17 -08:00
Shiyang Ruan 64e6edc185 xfs: use dax ops for zero and truncate in fsdax mode
Zero and truncate on a dax file may execute CoW.  So use dax ops which
contains end work for CoW.

Link: https://lkml.kernel.org/r/1669908730-131-1-git-send-email-ruansy.fnst@fujitsu.com
Signed-off-by: Shiyang Ruan <ruansy.fnst@fujitsu.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Cc: Alistair Popple <apopple@nvidia.com>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Dave Chinner <david@fromorbit.com>
Cc: Jason Gunthorpe <jgg@nvidia.com>
Cc: John Hubbard <jhubbard@nvidia.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-12-11 18:12:16 -08:00
Shiyang Ruan c6f0b395b2 fsdax,xfs: set the shared flag when file extent is shared
If a dax page is shared, mapread at different offsets can also trigger
page fault on same dax page.  So, change the flag from "cow" to "shared". 
And get the shared flag from filesystem when read.

Link: https://lkml.kernel.org/r/1669908538-55-5-git-send-email-ruansy.fnst@fujitsu.com
Signed-off-by: Shiyang Ruan <ruansy.fnst@fujitsu.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Cc: Alistair Popple <apopple@nvidia.com>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Dave Chinner <david@fromorbit.com>
Cc: Jason Gunthorpe <jgg@nvidia.com>
Cc: John Hubbard <jhubbard@nvidia.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-12-11 18:12:16 -08:00
Dave Chinner 52f31ed228 xfs: dquot shrinker doesn't check for XFS_DQFLAG_FREEING
Resulting in a UAF if the shrinker races with some other dquot
freeing mechanism that sets XFS_DQFLAG_FREEING before the dquot is
removed from the LRU. This can occur if a dquot purge races with
drop_caches.

Reported-by: syzbot+912776840162c13db1a3@syzkaller.appspotmail.com
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
2022-12-08 08:29:58 -08:00
Yang Li 1f5619ed88 xfs: Remove duplicated include in xfs_iomap.c
./fs/xfs/xfs_iomap.c: xfs_error.h is included more than once.
./fs/xfs/xfs_iomap.c: xfs_errortag.h is included more than once.

Link: https://bugzilla.openanolis.cn/show_bug.cgi?id=3337
Reported-by: Abaci Robot <abaci@linux.alibaba.com>
Signed-off-by: Yang Li <yang.lee@linux.alibaba.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
2022-12-04 09:42:59 -08:00
Darrick J. Wong ddfdd530e4 xfs: invalidate xfs_bufs when allocating cow extents
While investigating test failures in xfs/17[1-3] in alwayscow mode, I
noticed through code inspection that xfs_bmap_alloc_userdata isn't
setting XFS_ALLOC_USERDATA when allocating extents for a file's CoW
fork.  COW staging extents should be flagged as USERDATA, since user
data are persisted to these blocks before being remapped into a file.

This mis-classification has a few impacts on the behavior of the system.
First, the filestreams allocator is supposed to keep allocating from a
chosen AG until it runs out of space in that AG.  However, it only does
that for USERDATA allocations, which means that COW allocations aren't
tied to the filestreams AG.  Fortunately, few people use filestreams, so
nobody's noticed.

A more serious problem is that xfs_alloc_ag_vextent_small looks for a
buffer to invalidate *if* the USERDATA flag is set and the AG is so full
that the allocation had to come from the AGFL because the cntbt is
empty.  The consequences of not invalidating the buffer are severe --
if the AIL incorrectly checkpoints a buffer that is now being used to
store user data, that action will clobber the user's written data.

Fix filestreams and yet another data corruption vector by flagging COW
allocations as USERDATA.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
2022-12-01 09:36:16 -08:00
Guo Xuenan 8c25febf23 xfs: get rid of assert from xfs_btree_islastblock
xfs_btree_check_block contains debugging knobs. With XFS_DEBUG setting up,
turn on the debugging knob can trigger the assert of xfs_btree_islastblock,
test script as follows:

while true
do
    mount $disk $mountpoint
    fsstress -d $testdir -l 0 -n 10000 -p 4 >/dev/null
    echo 1 > /sys/fs/xfs/sda/errortag/btree_chk_sblk
    sleep 10
    umount $mountpoint
done

Kick off fsstress and only *then* turn on the debugging knob. If it
happens that the knob gets turned on after the cntbt lookup succeeds
but before the call to xfs_btree_islastblock, then we *can* end up in
the situation where a previously checked btree block suddenly starts
returning EFSCORRUPTED from xfs_btree_check_block. Kaboom.

Darrick give a very detailed explanation as follows:
Looking back at commit 27d9ee577d, I think the point of all this was
to make sure that the cursor has actually performed a lookup, and that
the btree block at whatever level we're asking about is ok.

If the caller hasn't ever done a lookup, the bc_levels array will be
empty, so cur->bc_levels[level].bp pointer will be NULL.  The call to
xfs_btree_get_block will crash anyway, so the "ASSERT(block);" part is
pointless.

If the caller did a lookup but the lookup failed due to block
corruption, the corresponding cur->bc_levels[level].bp pointer will also
be NULL, and we'll still crash.  The "ASSERT(xfs_btree_check_block);"
logic is also unnecessary.

If the cursor level points to an inode root, the block buffer will be
incore, so it had better always be consistent.

If the caller ignores a failed lookup after a successful one and calls
this function, the cursor state is garbage and the assert wouldn't have
tripped anyway. So get rid of the assert.

Fixes: 27d9ee577d ("xfs: actually check xfs_btree_check_block return in xfs_btree_islastblock")
Signed-off-by: Guo Xuenan <guoxuenan@huawei.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
2022-12-01 09:36:16 -08:00
Darrick J. Wong 948961964b xfs: fix broken MAXREFCOUNT handling
This series fixes a bug in the refcount code where we don't merge
 records correctly if the refcount is hovering around MAXREFCOUNT.  This
 fixes regressions in xfs/179 when fsdax is enabled.  xfs/179 itself will
 be modified to exploit the bug through the pagecache path.
 
 Signed-off-by: Darrick J. Wong <djwong@kernel.org>
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEEUzaAxoMeQq6m2jMV+H93GTRKtOsFAmOI5VUACgkQ+H93GTRK
 tOsf+A/9GsOyXhF2dAPt2zvDpoiHvR/y42/LQUoKah2cxUXOhfPIu53xoRgNgNOh
 1O3D1s9HjudI1Bo6/R+xeijq/upKGmKmOJL52vULy8UqsbDLvV8RHQ5YfHbIIghW
 HiYsD3aX5qhTKr7KHMrJzZcokSTAIGgIAT3QK7X0BrWUSXK12BN53A5D3XSFEMPu
 K1rws+XEZkOlgNti1DqTrRa6OdU+Y0CKOobNciNGTEBi6Gg4QFaudoobrBAYGNpQ
 R2icLAJG2lxlv3VRr2GqfgZ4XFh+q8TmXfb0tF+/rS/XWeDoElci+t4zpyDyeetH
 H4IdfXuiJMw9UUj8h7N299qRYlUFhfqfwMs/9FBU+W8UTV9ke6796r+/3ARcUino
 GoHwNpXEQZHatU5zVj1v2/6yVOCLapoz1WHa3pkbvwBGPKvWS2DGs0Mx5W28b/QE
 M/uRj/CL4re0lEqsnNHu0J2tMiLb5A33pkdAZTfsoOwXdjpZdiXCuUHJzD71UDb8
 hePXm5c9XBSSLIV44bpgBCDqu7acybENRUXJ2+1kBebxaKEpH8nrPKIuHRzX2Ulh
 KfCZwwMLR24AYYHcFRsb8p4QtBaSu0jF4WJQJoggSsHErQfDdNDfTr5nL86kF+3P
 zp9/8WOi4ZfRGAMOS/kTjZRFnG+hDE+EqvAzuNyf9vlytF84KZU=
 =qgYH
 -----END PGP SIGNATURE-----

Merge tag 'maxrefcount-fixes-6.2_2022-12-01' of git://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux into xfs-6.2-mergeD

xfs: fix broken MAXREFCOUNT handling

This series fixes a bug in the refcount code where we don't merge
records correctly if the refcount is hovering around MAXREFCOUNT.  This
fixes regressions in xfs/179 when fsdax is enabled.  xfs/179 itself will
be modified to exploit the bug through the pagecache path.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>

* tag 'maxrefcount-fixes-6.2_2022-12-01' of git://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux:
  xfs: estimate post-merge refcounts correctly
  xfs: hoist refcount record merge predicates
2022-12-01 09:35:52 -08:00
Darrick J. Wong b25d1984aa xfs: estimate post-merge refcounts correctly
Upon enabling fsdax + reflink for XFS, xfs/179 began to report refcount
metadata corruptions after being run.  Specifically, xfs_repair noticed
single-block refcount records that could be combined but had not been.

The root cause of this is improper MAXREFCOUNT edge case handling in
xfs_refcount_merge_extents.  When we're trying to find candidates for a
refcount btree record merge, we compute the refcount attribute of the
merged record, but we fail to account for the fact that once a record
hits rc_refcount == MAXREFCOUNT, it is pinned that way forever.  Hence
the computed refcount is wrong, and we fail to merge the extents.

Fix this by adjusting the merge predicates to compute the adjusted
refcount correctly.

Fixes: 3172725814 ("xfs: adjust refcount of an extent of blocks in refcount btree")
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Xiao Yang <yangx.jy@fujitsu.com>
2022-12-01 09:32:04 -08:00
Darrick J. Wong 9d720a5a65 xfs: hoist refcount record merge predicates
Hoist these multiline conditionals into separate static inline helpers
to improve readability and set the stage for corruption fixes that will
be introduced in the next patch.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Xiao Yang <yangx.jy@fujitsu.com>
2022-12-01 09:32:04 -08:00
Guo Xuenan 575689fc0f xfs: fix super block buf log item UAF during force shutdown
xfs log io error will trigger xlog shut down, and end_io worker call
xlog_state_shutdown_callbacks to unpin and release the buf log item.
The race condition is that when there are some thread doing transaction
commit and happened not to be intercepted by xlog_is_shutdown, then,
these log item will be insert into CIL, when unpin and release these
buf log item, UAF will occur. BTW, add delay before `xlog_cil_commit`
can increase recurrence probability.

The following call graph actually encountered this bad situation.
fsstress                    io end worker kworker/0:1H-216
                            xlog_ioend_work
                              ->xlog_force_shutdown
                                ->xlog_state_shutdown_callbacks
                                  ->xlog_cil_process_committed
                                    ->xlog_cil_committed
                                      ->xfs_trans_committed_bulk
->xfs_trans_apply_sb_deltas             ->li_ops->iop_unpin(lip, 1);
  ->xfs_trans_getsb
    ->_xfs_trans_bjoin
      ->xfs_buf_item_init
        ->if (bip) { return 0;} //relog
->xlog_cil_commit
  ->xlog_cil_insert_items //insert into CIL
                                           ->xfs_buf_ioend_fail(bp);
                                             ->xfs_buf_ioend
                                               ->xfs_buf_item_done
                                                 ->xfs_buf_item_relse
                                                   ->xfs_buf_item_free

when cil push worker gather percpu cil and insert super block buf log item
into ctx->log_items then uaf occurs.

==================================================================
BUG: KASAN: use-after-free in xlog_cil_push_work+0x1c8f/0x22f0
Write of size 8 at addr ffff88801800f3f0 by task kworker/u4:4/105

CPU: 0 PID: 105 Comm: kworker/u4:4 Tainted: G W
6.1.0-rc1-00001-g274115149b42 #136
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS
1.13.0-1ubuntu1.1 04/01/2014
Workqueue: xfs-cil/sda xlog_cil_push_work
Call Trace:
 <TASK>
 dump_stack_lvl+0x4d/0x66
 print_report+0x171/0x4a6
 kasan_report+0xb3/0x130
 xlog_cil_push_work+0x1c8f/0x22f0
 process_one_work+0x6f9/0xf70
 worker_thread+0x578/0xf30
 kthread+0x28c/0x330
 ret_from_fork+0x1f/0x30
 </TASK>

Allocated by task 2145:
 kasan_save_stack+0x1e/0x40
 kasan_set_track+0x21/0x30
 __kasan_slab_alloc+0x54/0x60
 kmem_cache_alloc+0x14a/0x510
 xfs_buf_item_init+0x160/0x6d0
 _xfs_trans_bjoin+0x7f/0x2e0
 xfs_trans_getsb+0xb6/0x3f0
 xfs_trans_apply_sb_deltas+0x1f/0x8c0
 __xfs_trans_commit+0xa25/0xe10
 xfs_symlink+0xe23/0x1660
 xfs_vn_symlink+0x157/0x280
 vfs_symlink+0x491/0x790
 do_symlinkat+0x128/0x220
 __x64_sys_symlink+0x7a/0x90
 do_syscall_64+0x35/0x80
 entry_SYSCALL_64_after_hwframe+0x63/0xcd

Freed by task 216:
 kasan_save_stack+0x1e/0x40
 kasan_set_track+0x21/0x30
 kasan_save_free_info+0x2a/0x40
 __kasan_slab_free+0x105/0x1a0
 kmem_cache_free+0xb6/0x460
 xfs_buf_ioend+0x1e9/0x11f0
 xfs_buf_item_unpin+0x3d6/0x840
 xfs_trans_committed_bulk+0x4c2/0x7c0
 xlog_cil_committed+0xab6/0xfb0
 xlog_cil_process_committed+0x117/0x1e0
 xlog_state_shutdown_callbacks+0x208/0x440
 xlog_force_shutdown+0x1b3/0x3a0
 xlog_ioend_work+0xef/0x1d0
 process_one_work+0x6f9/0xf70
 worker_thread+0x578/0xf30
 kthread+0x28c/0x330
 ret_from_fork+0x1f/0x30

The buggy address belongs to the object at ffff88801800f388
 which belongs to the cache xfs_buf_item of size 272
The buggy address is located 104 bytes inside of
 272-byte region [ffff88801800f388, ffff88801800f498)

The buggy address belongs to the physical page:
page:ffffea0000600380 refcount:1 mapcount:0 mapping:0000000000000000
index:0xffff88801800f208 pfn:0x1800e
head:ffffea0000600380 order:1 compound_mapcount:0 compound_pincount:0
flags: 0x1fffff80010200(slab|head|node=0|zone=1|lastcpupid=0x1fffff)
raw: 001fffff80010200 ffffea0000699788 ffff88801319db50 ffff88800fb50640
raw: ffff88801800f208 000000000015000a 00000001ffffffff 0000000000000000
page dumped because: kasan: bad access detected

Memory state around the buggy address:
 ffff88801800f280: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
 ffff88801800f300: fb fb fb fc fc fc fc fc fc fc fc fc fc fc fc fc
>ffff88801800f380: fc fa fb fb fb fb fb fb fb fb fb fb fb fb fb fb
                                                             ^
 ffff88801800f400: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
 ffff88801800f480: fb fb fb fc fc fc fc fc fc fc fc fc fc fc fc fc
==================================================================
Disabling lock debugging due to kernel taint

Signed-off-by: Guo Xuenan <guoxuenan@huawei.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
2022-11-30 09:25:46 -08:00
Guo Xuenan 1eb52a6a71 xfs: wait iclog complete before tearing down AIL
Fix uaf in xfs_trans_ail_delete during xlog force shutdown.
In commit cd6f79d1fb ("xfs: run callbacks before waking waiters in
xlog_state_shutdown_callbacks") changed the order of running callbacks
and wait for iclog completion to avoid unmount path untimely destroy AIL.
But which seems not enough to ensue this, adding mdelay in
`xfs_buf_item_unpin` can prove that.

The reproduction is as follows. To ensure destroy AIL safely,
we should wait all xlog ioend workers done and sync the AIL.

==================================================================
BUG: KASAN: use-after-free in xfs_trans_ail_delete+0x240/0x2a0
Read of size 8 at addr ffff888023169400 by task kworker/1:1H/43

CPU: 1 PID: 43 Comm: kworker/1:1H Tainted: G        W
6.1.0-rc1-00002-gc28266863c4a #137
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS
1.13.0-1ubuntu1.1 04/01/2014
Workqueue: xfs-log/sda xlog_ioend_work
Call Trace:
 <TASK>
 dump_stack_lvl+0x4d/0x66
 print_report+0x171/0x4a6
 kasan_report+0xb3/0x130
 xfs_trans_ail_delete+0x240/0x2a0
 xfs_buf_item_done+0x7b/0xa0
 xfs_buf_ioend+0x1e9/0x11f0
 xfs_buf_item_unpin+0x4c8/0x860
 xfs_trans_committed_bulk+0x4c2/0x7c0
 xlog_cil_committed+0xab6/0xfb0
 xlog_cil_process_committed+0x117/0x1e0
 xlog_state_shutdown_callbacks+0x208/0x440
 xlog_force_shutdown+0x1b3/0x3a0
 xlog_ioend_work+0xef/0x1d0
 process_one_work+0x6f9/0xf70
 worker_thread+0x578/0xf30
 kthread+0x28c/0x330
 ret_from_fork+0x1f/0x30
 </TASK>

Allocated by task 9606:
 kasan_save_stack+0x1e/0x40
 kasan_set_track+0x21/0x30
 __kasan_kmalloc+0x7a/0x90
 __kmalloc+0x59/0x140
 kmem_alloc+0xb2/0x2f0
 xfs_trans_ail_init+0x20/0x320
 xfs_log_mount+0x37e/0x690
 xfs_mountfs+0xe36/0x1b40
 xfs_fs_fill_super+0xc5c/0x1a70
 get_tree_bdev+0x3c5/0x6c0
 vfs_get_tree+0x85/0x250
 path_mount+0xec3/0x1830
 do_mount+0xef/0x110
 __x64_sys_mount+0x150/0x1f0
 do_syscall_64+0x35/0x80
 entry_SYSCALL_64_after_hwframe+0x63/0xcd

Freed by task 9662:
 kasan_save_stack+0x1e/0x40
 kasan_set_track+0x21/0x30
 kasan_save_free_info+0x2a/0x40
 __kasan_slab_free+0x105/0x1a0
 __kmem_cache_free+0x99/0x2d0
 kvfree+0x3a/0x40
 xfs_log_unmount+0x60/0xf0
 xfs_unmountfs+0xf3/0x1d0
 xfs_fs_put_super+0x78/0x300
 generic_shutdown_super+0x151/0x400
 kill_block_super+0x9a/0xe0
 deactivate_locked_super+0x82/0xe0
 deactivate_super+0x91/0xb0
 cleanup_mnt+0x32a/0x4a0
 task_work_run+0x15f/0x240
 exit_to_user_mode_prepare+0x188/0x190
 syscall_exit_to_user_mode+0x12/0x30
 do_syscall_64+0x42/0x80
 entry_SYSCALL_64_after_hwframe+0x63/0xcd

The buggy address belongs to the object at ffff888023169400
 which belongs to the cache kmalloc-128 of size 128
The buggy address is located 0 bytes inside of
 128-byte region [ffff888023169400, ffff888023169480)

The buggy address belongs to the physical page:
page:ffffea00008c5a00 refcount:1 mapcount:0 mapping:0000000000000000
index:0xffff888023168f80 pfn:0x23168
head:ffffea00008c5a00 order:1 compound_mapcount:0 compound_pincount:0
flags: 0x1fffff80010200(slab|head|node=0|zone=1|lastcpupid=0x1fffff)
raw: 001fffff80010200 ffffea00006b3988 ffffea0000577a88 ffff88800f842ac0
raw: ffff888023168f80 0000000000150007 00000001ffffffff 0000000000000000
page dumped because: kasan: bad access detected

Memory state around the buggy address:
 ffff888023169300: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc
 ffff888023169380: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc
>ffff888023169400: fa fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
                   ^
 ffff888023169480: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc
 ffff888023169500: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc
==================================================================
Disabling lock debugging due to kernel taint

Fixes: cd6f79d1fb ("xfs: run callbacks before waking waiters in xlog_state_shutdown_callbacks")
Signed-off-by: Guo Xuenan <guoxuenan@huawei.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
2022-11-30 09:25:46 -08:00
Darrick J. Wong 4c6dbfd275 xfs: attach dquots to inode before reading data/cow fork mappings
I've been running near-continuous integration testing of online fsck,
and I've noticed that once a day, one of the ARM VMs will fail the test
with out of order records in the data fork.

xfs/804 races fsstress with online scrub (aka scan but do not change
anything), so I think this might be a bug in the core xfs code.  This
also only seems to trigger if one runs the test for more than ~6 minutes
via TIME_FACTOR=13 or something.
https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfstests-dev.git/tree/tests/xfs/804?h=djwong-wtf

I added a debugging patch to the kernel to check the data fork extents
after taking the ILOCK, before dropping ILOCK, and before and after each
bmapping operation.  So far I've narrowed it down to the delalloc code
inserting a record in the wrong place in the iext tree:

xfs_bmap_add_extent_hole_delay, near line 2691:

	case 0:
		/*
		 * New allocation is not contiguous with another
		 * delayed allocation.
		 * Insert a new entry.
		 */
		oldlen = newlen = 0;
		xfs_iunlock_check_datafork(ip);		<-- ok here
		xfs_iext_insert(ip, icur, new, state);
		xfs_iunlock_check_datafork(ip);		<-- bad here
		break;
	}

I recorded the state of the data fork mappings and iext cursor state
when a corrupt data fork is detected immediately after the
xfs_bmap_add_extent_hole_delay call in xfs_bmapi_reserve_delalloc:

ino 0x140bb3 func xfs_bmapi_reserve_delalloc line 4164 data fork:
    ino 0x140bb3 nr 0x0 nr_real 0x0 offset 0xb9 blockcount 0x1f startblock 0x935de2 state 1
    ino 0x140bb3 nr 0x1 nr_real 0x1 offset 0xe6 blockcount 0xa startblock 0xffffffffe0007 state 0
    ino 0x140bb3 nr 0x2 nr_real 0x1 offset 0xd8 blockcount 0xe startblock 0x935e01 state 0

Here we see that a delalloc extent was inserted into the wrong position
in the iext leaf, same as all the other times.  The extra trace data I
collected are as follows:

ino 0x140bb3 fork 0 oldoff 0xe6 oldlen 0x4 oldprealloc 0x6 isize 0xe6000
    ino 0x140bb3 oldgotoff 0xea oldgotstart 0xfffffffffffffffe oldgotcount 0x0 oldgotstate 0
    ino 0x140bb3 crapgotoff 0x0 crapgotstart 0x0 crapgotcount 0x0 crapgotstate 0
    ino 0x140bb3 freshgotoff 0xd8 freshgotstart 0x935e01 freshgotcount 0xe freshgotstate 0
    ino 0x140bb3 nowgotoff 0xe6 nowgotstart 0xffffffffe0007 nowgotcount 0xa nowgotstate 0
    ino 0x140bb3 oldicurpos 1 oldleafnr 2 oldleaf 0xfffffc00f0609a00
    ino 0x140bb3 crapicurpos 2 crapleafnr 2 crapleaf 0xfffffc00f0609a00
    ino 0x140bb3 freshicurpos 1 freshleafnr 2 freshleaf 0xfffffc00f0609a00
    ino 0x140bb3 newicurpos 1 newleafnr 3 newleaf 0xfffffc00f0609a00

The first line shows that xfs_bmapi_reserve_delalloc was called with
whichfork=XFS_DATA_FORK, off=0xe6, len=0x4, prealloc=6.

The second line ("oldgot") shows the contents of @got at the beginning
of the call, which are the results of the first iext lookup in
xfs_buffered_write_iomap_begin.

Line 3 ("crapgot") is the result of duplicating the cursor at the start
of the body of xfs_bmapi_reserve_delalloc and performing a fresh lookup
at @off.

Line 4 ("freshgot") is the result of a new xfs_iext_get_extent right
before the call to xfs_bmap_add_extent_hole_delay.  Totally garbage.

Line 5 ("nowgot") is contents of @got after the
xfs_bmap_add_extent_hole_delay call.

Line 6 is the contents of @icur at the beginning fo the call.  Lines 7-9
are the contents of the iext cursors at the point where the block
mappings were sampled.

I think @oldgot is a HOLESTARTBLOCK extent because the first lookup
didn't find anything, so we filled in imap with "fake hole until the
end".  At the time of the first lookup, I suspect that there's only one
32-block unwritten extent in the mapping (hence oldicurpos==1) but by
the time we get to recording crapgot, crapicurpos==2.

Dave then added:

Ok, that's much simpler to reason about, and implies the smoke is
coming from xfs_buffered_write_iomap_begin() or
xfs_bmapi_reserve_delalloc(). I suspect the former - it does a lot
of stuff with the ILOCK_EXCL held.....

.... including calling xfs_qm_dqattach_locked().

xfs_buffered_write_iomap_begin
  ILOCK_EXCL
  look up icur
  xfs_qm_dqattach_locked
    xfs_qm_dqattach_one
      xfs_qm_dqget_inode
        dquot cache miss
        xfs_iunlock(ip, XFS_ILOCK_EXCL);
        error = xfs_qm_dqread(mp, id, type, can_alloc, &dqp);
        xfs_ilock(ip, XFS_ILOCK_EXCL);
  ....
  xfs_bmapi_reserve_delalloc(icur)

Yup, that's what is letting the magic smoke out -
xfs_qm_dqattach_locked() can cycle the ILOCK. If that happens, we
can pass a stale icur to xfs_bmapi_reserve_delalloc() and it all
goes downhill from there.

Back to Darrick now:

So.  Fix this by moving the dqattach_locked call up before we take the
ILOCK, like all the other callers in that file.

Fixes: a526c85c22 ("xfs: move xfs_file_iomap_begin_delay around") # goes further back than this
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
2022-11-30 08:55:18 -08:00
Darrick J. Wong e5827a007a xfs: shut up -Wuninitialized in xfsaild_push
-Wuninitialized complains about @target in xfsaild_push being
uninitialized in the case where the waitqueue is active but there is no
last item in the AIL to wait for.  I /think/ it should never be the case
that the subsequent xfs_trans_ail_cursor_first returns a log item and
hence we'll never end up at XFS_LSN_CMP, but let's make this explicit.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Gao Xiang <hsiangkao@linux.alibaba.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
2022-11-30 08:55:18 -08:00
Darrick J. Wong fd5beaff25 xfs: use memcpy, not strncpy, to format the attr prefix during listxattr
When -Wstringop-truncation is enabled, the compiler complains about
truncation of the null byte at the end of the xattr name prefix.  This
is intentional, since we're concatenating the two strings together and
do _not_ want a null byte in the middle of the name.

We've already ensured that the name buffer is long enough to handle
prefix and name, and the prefix_len is supposed to be the length of the
prefix string without the null byte, so use memcpy here instead.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Gao Xiang <hsiangkao@linux.alibaba.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
2022-11-30 08:55:18 -08:00
Darrick J. Wong 032e160305 xfs: invalidate block device page cache during unmount
Every now and then I see fstests failures on aarch64 (64k pages) that
trigger on the following sequence:

mkfs.xfs $dev
mount $dev $mnt
touch $mnt/a
umount $mnt
xfs_db -c 'path /a' -c 'print' $dev

99% of the time this succeeds, but every now and then xfs_db cannot find
/a and fails.  This turns out to be a race involving udev/blkid, the
page cache for the block device, and the xfs_db process.

udev is triggered whenever anyone closes a block device or unmounts it.
The default udev rules invoke blkid to read the fs super and create
symlinks to the bdev under /dev/disk.  For this, it uses buffered reads
through the page cache.

xfs_db also uses buffered reads to examine metadata.  There is no
coordination between xfs_db and udev, which means that they can run
concurrently.  Note there is no coordination between the kernel and
blkid either.

On a system with 64k pages, the page cache can cache the superblock and
the root inode (and hence the root dir) with the same 64k page.  If
udev spawns blkid after the mkfs and the system is busy enough that it
is still running when xfs_db starts up, they'll both read from the same
page in the pagecache.

The unmount writes updated inode metadata to disk directly.  The XFS
buffer cache does not use the bdev pagecache, nor does it invalidate the
pagecache on umount.  If the above scenario occurs, the pagecache no
longer reflects what's on disk, xfs_db reads the stale metadata, and
fails to find /a.  Most of the time this succeeds because closing a bdev
invalidates the page cache, but when processes race, everyone loses.

Fix the problem by invalidating the bdev pagecache after flushing the
bdev, so that xfs_db will see up to date metadata.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Gao Xiang <hsiangkao@linux.alibaba.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
2022-11-30 08:55:18 -08:00
Darrick J. Wong 254e345928 xfs: add debug knob to slow down write for fun
Add a new error injection knob so that we can arbitrarily slow down
pagecache writes to test for race conditions and aberrant reclaim
behavior if the writeback mechanisms are slow to issue writeback.  This
will enable functional testing for the ifork sequence counters
introduced in commit 304a68b9c6 ("xfs: use iomap_valid method to
detect stale cached iomaps") that fixes write racing with reclaim
writeback.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
2022-11-28 17:54:49 -08:00
Darrick J. Wong c2beff99eb xfs: add debug knob to slow down writeback for fun
Add a new error injection knob so that we can arbitrarily slow down
writeback to test for race conditions and aberrant reclaim behavior if
the writeback mechanisms are slow to issue writeback.  This will enable
functional testing for the ifork sequence counters introduced in commit
745b3f76d1 ("xfs: maintain a sequence count for inode fork
manipulations").

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
2022-11-28 17:24:35 -08:00
Darrick J. Wong 7dd73802f9 xfs, iomap: fix data corruption due to stale cached iomaps
This patch series fixes a data corruption that occurs in a specific
 multi-threaded write workload. The workload combined
 racing unaligned adjacent buffered writes with low memory conditions
 that caused both writeback and memory reclaim to race with the
 writes.
 
 The result of this was random partial blocks containing zeroes
 instead of the correct data.  The underlying problem is that iomap
 caches the write iomap for the duration of the write() operation,
 but it fails to take into account that the extent underlying the
 iomap can change whilst the write is in progress.
 
 The short story is that an iomap can span mutliple folios, and so
 under low memory writeback can be cleaning folios the write()
 overlaps. Whilst the overlapping data is cached in memory, this
 isn't a problem, but because the folios are now clean they can be
 reclaimed. Once reclaimed, the write() does the wrong thing when
 re-instantiating partial folios because the iomap no longer reflects
 the underlying state of the extent. e.g. it thinks the extent is
 unwritten, so it zeroes the partial range, when in fact the
 underlying extent is now written and so it should have read the data
 from disk.  This is how we get random zero ranges in the file
 instead of the correct data.
 
 The gory details of the race condition can be found here:
 
 https://lore.kernel.org/linux-xfs/20220817093627.GZ3600936@dread.disaster.area/
 
 Fixing the problem has two aspects. The first aspect of the problem
 is ensuring that iomap can detect a stale cached iomap during a
 write in a race-free manner. We already do this stale iomap
 detection in the writeback path, so we have a mechanism for
 detecting that the iomap backing the data range may have changed
 and needs to be remapped.
 
 In the case of the write() path, we have to ensure that the iomap is
 validated at a point in time when the page cache is stable and
 cannot be reclaimed from under us. We also need to validate the
 extent before we start performing any modifications to the folio
 state or contents. Combine these two requirements together, and the
 only "safe" place to validate the iomap is after we have looked up
 and locked the folio we are going to copy the data into, but before
 we've performed any initialisation operations on that folio.
 
 If the iomap fails validation, we then mark it stale, unlock the
 folio and end the write. This effectively means a stale iomap
 results in a short write. Filesystems should already be able to
 handle this, as write operations can end short for many reasons and
 need to iterate through another mapping cycle to be completed. Hence
 the iomap changes needed to detect and handle stale iomaps during
 write() operations is relatively simple...
 
 However, the assumption is that filesystems should already be able
 to handle write failures safely, and that's where the second
 (first?) part of the problem exists. That is, handling a partial
 write is harder than just "punching out the unused delayed
 allocation extent". This is because mmap() based faults can race
 with writes, and if they land in the delalloc region that the write
 allocated, then punching out the delalloc region can cause data
 corruption.
 
 This data corruption problem is exposed by generic/346 when iomap is
 converted to detect stale iomaps during write() operations. Hence
 write failure in the filesytems needs to handle the fact that the
 write() in progress doesn't necessarily own the data in the page
 cache over the range of the delalloc extent it just allocated.
 
 As a result, we can't just truncate the page cache over the range
 the write() didn't reach and punch all the delalloc extent. We have
 to walk the page cache over the untouched range and skip over any
 dirty data region in the cache in that range. Which is ....
 non-trivial.
 
 That is, iterating the page cache has to handle partially populated
 folios (i.e. block size < page size) that contain data. The data
 might be discontiguous within a folio. Indeed, there might be
 *multiple* discontiguous data regions within a single folio. And to
 make matters more complex, multi-page folios mean we just don't know
 how many sub-folio regions we might have to iterate to find all
 these regions. All the corner cases between the conversions and
 rounding between filesystem block size, folio size and multi-page
 folio size combined with unaligned write offsets kept breaking my
 brain.
 
 However, if we convert the code to track the processed
 write regions by byte ranges instead of fileystem block or page
 cache index, we could simply use mapping_seek_hole_data() to find
 the start and end of each discrete data region within the range we
 needed to scan. SEEK_DATA finds the start of the cached data region,
 SEEK_HOLE finds the end of the region. These are byte based
 interfaces that understand partially uptodate folio regions, and so
 can iterate discrete sub-folio data regions directly. This largely
 solved the problem of discovering the dirty regions we need to keep
 the delalloc extent over.
 
 However, to use mapping_seek_hole_data() without needing to export
 it, we have to move all the delalloc extent cleanup to the iomap
 core and so now the iomap core can clean up delayed allocation
 extents in a safe, sane and filesystem neutral manner.
 
 With all this done, the original data corruption never occurs
 anymore, and we now have a generic mechanism for ensuring that page
 cache writes do not do the wrong thing when writeback and reclaim
 change the state of the physical extent and/or page cache contents
 whilst the write is in progress.
 
 Signed-off-by: Dave Chinner <dchinner@redhat.com>
 -----BEGIN PGP SIGNATURE-----
 
 iQJIBAABCgAyFiEEmJOoJ8GffZYWSjj/regpR/R1+h0FAmOFSzwUHGRhdmlkQGZy
 b21vcmJpdC5jb20ACgkQregpR/R1+h3djhAAwOf9VeLO7TW/0B1XfE3ktWGiDmEG
 ekB8mkB7CAHB9SBq7TZMHjktJIJxY81q5+Iq9qHGiW3asoVbmWvkeRSJgXljhTby
 D2KsUIT1NK/X6DhC9FhNjv/Q2GJ0nY6s65RLudUEkelYBFhGMM0kdXX+fZmtZ4yT
 T/lRYk/KBFpeQCaGRcFXK55TnB/B9muOI9FyKvh2DNWe6r0Xu3Obb3a9k+snZA9R
 EeUpAosDSrXzP4c2w2ovpU2eutUdo4eYTHIzXKGkhktbRhmCRLn4NlxvFCanoe8h
 eSS85sb8DHUh2iyaaB8yrJ6LL3MuBytOi24rNBeyd1KAyEtT21+cTUK/QAahzble
 pL8l6TA7ZXbhYcbk5uQvFEIAInR+0ffjde61uE14N55awq0Vdrym7C7D2ri60iw6
 ts45AVYKYeF61coAbwvmaJyvqvQ0tUlmVZXI4lQzN2O17Lr6004gJFMjDRsXXU7H
 eHLUt496Geq39rglw85y8G0vmxxGZ9iIGkeC1kUSSCmlvx/JfuJlbWBgyMGtNRBI
 qzv0jmk67Ft1seQSMWQJttxCZs4uOF2gwERYGAF7iUR8F4PGob/N1e2/hpg75G8q
 0S8u1N1p8Cv5u/jwybqy8FnSC2MlUZl6SQURaVQDy2DLMKHb4T1diu0jrCbiSPiF
 JKfQ7aNQxaEZIJw=
 =cv9i
 -----END PGP SIGNATURE-----

Merge tag 'xfs-iomap-stale-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/dgc/linux-xfs into xfs-6.2-mergeB

xfs, iomap: fix data corruption due to stale cached iomaps

This patch series fixes a data corruption that occurs in a specific
multi-threaded write workload. The workload combined
racing unaligned adjacent buffered writes with low memory conditions
that caused both writeback and memory reclaim to race with the
writes.

The result of this was random partial blocks containing zeroes
instead of the correct data.  The underlying problem is that iomap
caches the write iomap for the duration of the write() operation,
but it fails to take into account that the extent underlying the
iomap can change whilst the write is in progress.

The short story is that an iomap can span mutliple folios, and so
under low memory writeback can be cleaning folios the write()
overlaps. Whilst the overlapping data is cached in memory, this
isn't a problem, but because the folios are now clean they can be
reclaimed. Once reclaimed, the write() does the wrong thing when
re-instantiating partial folios because the iomap no longer reflects
the underlying state of the extent. e.g. it thinks the extent is
unwritten, so it zeroes the partial range, when in fact the
underlying extent is now written and so it should have read the data
from disk.  This is how we get random zero ranges in the file
instead of the correct data.

The gory details of the race condition can be found here:

https://lore.kernel.org/linux-xfs/20220817093627.GZ3600936@dread.disaster.area/

Fixing the problem has two aspects. The first aspect of the problem
is ensuring that iomap can detect a stale cached iomap during a
write in a race-free manner. We already do this stale iomap
detection in the writeback path, so we have a mechanism for
detecting that the iomap backing the data range may have changed
and needs to be remapped.

In the case of the write() path, we have to ensure that the iomap is
validated at a point in time when the page cache is stable and
cannot be reclaimed from under us. We also need to validate the
extent before we start performing any modifications to the folio
state or contents. Combine these two requirements together, and the
only "safe" place to validate the iomap is after we have looked up
and locked the folio we are going to copy the data into, but before
we've performed any initialisation operations on that folio.

If the iomap fails validation, we then mark it stale, unlock the
folio and end the write. This effectively means a stale iomap
results in a short write. Filesystems should already be able to
handle this, as write operations can end short for many reasons and
need to iterate through another mapping cycle to be completed. Hence
the iomap changes needed to detect and handle stale iomaps during
write() operations is relatively simple...

However, the assumption is that filesystems should already be able
to handle write failures safely, and that's where the second
(first?) part of the problem exists. That is, handling a partial
write is harder than just "punching out the unused delayed
allocation extent". This is because mmap() based faults can race
with writes, and if they land in the delalloc region that the write
allocated, then punching out the delalloc region can cause data
corruption.

This data corruption problem is exposed by generic/346 when iomap is
converted to detect stale iomaps during write() operations. Hence
write failure in the filesytems needs to handle the fact that the
write() in progress doesn't necessarily own the data in the page
cache over the range of the delalloc extent it just allocated.

As a result, we can't just truncate the page cache over the range
the write() didn't reach and punch all the delalloc extent. We have
to walk the page cache over the untouched range and skip over any
dirty data region in the cache in that range. Which is ....
non-trivial.

That is, iterating the page cache has to handle partially populated
folios (i.e. block size < page size) that contain data. The data
might be discontiguous within a folio. Indeed, there might be
*multiple* discontiguous data regions within a single folio. And to
make matters more complex, multi-page folios mean we just don't know
how many sub-folio regions we might have to iterate to find all
these regions. All the corner cases between the conversions and
rounding between filesystem block size, folio size and multi-page
folio size combined with unaligned write offsets kept breaking my
brain.

However, if we convert the code to track the processed
write regions by byte ranges instead of fileystem block or page
cache index, we could simply use mapping_seek_hole_data() to find
the start and end of each discrete data region within the range we
needed to scan. SEEK_DATA finds the start of the cached data region,
SEEK_HOLE finds the end of the region. These are byte based
interfaces that understand partially uptodate folio regions, and so
can iterate discrete sub-folio data regions directly. This largely
solved the problem of discovering the dirty regions we need to keep
the delalloc extent over.

However, to use mapping_seek_hole_data() without needing to export
it, we have to move all the delalloc extent cleanup to the iomap
core and so now the iomap core can clean up delayed allocation
extents in a safe, sane and filesystem neutral manner.

With all this done, the original data corruption never occurs
anymore, and we now have a generic mechanism for ensuring that page
cache writes do not do the wrong thing when writeback and reclaim
change the state of the physical extent and/or page cache contents
whilst the write is in progress.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Darrick J. Wong <djwong@kernel.org>

* tag 'xfs-iomap-stale-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/dgc/linux-xfs:
  xfs: drop write error injection is unfixable, remove it
  xfs: use iomap_valid method to detect stale cached iomaps
  iomap: write iomap validity checks
  xfs: xfs_bmap_punch_delalloc_range() should take a byte range
  iomap: buffered write failure should not truncate the page cache
  xfs,iomap: move delalloc punching to iomap
  xfs: use byte ranges for write cleanup ranges
  xfs: punching delalloc extents on write failure is racy
  xfs: write page faults in iomap are not buffered writes
2022-11-28 17:23:58 -08:00
Dave Chinner 6e8af15ccd xfs: drop write error injection is unfixable, remove it
With the changes to scan the page cache for dirty data to avoid data
corruptions from partial write cleanup racing with other page cache
operations, the drop writes error injection no longer works the same
way it used to and causes xfs/196 to fail. This is because xfs/196
writes to the file and populates the page cache before it turns on
the error injection and starts failing -overwrites-.

The result is that the original drop-writes code failed writes only
-after- overwriting the data in the cache, followed by invalidates
the cached data, then punching out the delalloc extent from under
that data.

On the surface, this looks fine. The problem is that page cache
invalidation *doesn't guarantee that it removes anything from the
page cache* and it doesn't change the dirty state of the folio. When
block size == page size and we do page aligned IO (as xfs/196 does)
everything happens to align perfectly and page cache invalidation
removes the single page folios that span the written data. Hence the
followup delalloc punch pass does not find cached data over that
range and it can punch the extent out.

IOWs, xfs/196 "works" for block size == page size with the new
code. I say "works", because it actually only works for the case
where IO is page aligned, and no data was read from disk before
writes occur. Because the moment we actually read data first, the
readahead code allocates multipage folios and suddenly the
invalidate code goes back to zeroing subfolio ranges without
changing dirty state.

Hence, with multipage folios in play, block size == page size is
functionally identical to block size < page size behaviour, and
drop-writes is manifestly broken w.r.t to this case. Invalidation of
a subfolio range doesn't result in the folio being removed from the
cache, just the range gets zeroed. Hence after we've sequentially
walked over a folio that we've dirtied (via write data) and then
invalidated, we end up with a dirty folio full of zeroed data.

And because the new code skips punching ranges that have dirty
folios covering them, we end up leaving the delalloc range intact
after failing all the writes. Hence failed writes now end up
writing zeroes to disk in the cases where invalidation zeroes folios
rather than removing them from cache.

This is a fundamental change of behaviour that is needed to avoid
the data corruption vectors that exist in the old write fail path,
and it renders the drop-writes injection non-functional and
unworkable as it stands.

As it is, I think the error injection is also now unnecessary, as
partial writes that need delalloc extent are going to be a lot more
common with stale iomap detection in place. Hence this patch removes
the drop-writes error injection completely. xfs/196 can remain for
testing kernels that don't have this data corruption fix, but those
that do will report:

xfs/196 3s ... [not run] XFS error injection drop_writes unknown on this kernel.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
2022-11-29 09:09:17 +11:00
Dave Chinner 304a68b9c6 xfs: use iomap_valid method to detect stale cached iomaps
Now that iomap supports a mechanism to validate cached iomaps for
buffered write operations, hook it up to the XFS buffered write ops
so that we can avoid data corruptions that result from stale cached
iomaps. See:

https://lore.kernel.org/linux-xfs/20220817093627.GZ3600936@dread.disaster.area/

or the ->iomap_valid() introduction commit for exact details of the
corruption vector.

The validity cookie we store in the iomap is based on the type of
iomap we return. It is expected that the iomap->flags we set in
xfs_bmbt_to_iomap() is not perturbed by the iomap core and are
returned to us in the iomap passed via the .iomap_valid() callback.
This ensures that the validity cookie is always checking the correct
inode fork sequence numbers to detect potential changes that affect
the extent cached by the iomap.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
2022-11-29 09:09:17 +11:00
Dave Chinner 7348b32233 xfs: xfs_bmap_punch_delalloc_range() should take a byte range
All the callers of xfs_bmap_punch_delalloc_range() jump through
hoops to convert a byte range to filesystem blocks before calling
xfs_bmap_punch_delalloc_range(). Instead, pass the byte range to
xfs_bmap_punch_delalloc_range() and have it do the conversion to
filesystem blocks internally.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
2022-11-29 09:09:17 +11:00