Commit Graph

74389 Commits

Author SHA1 Message Date
Jan Kara 15fc69bbbb ext4: make sure quota gets properly shutdown on error
When we hit an error when enabling quotas and setting inode flags, we do
not properly shutdown quota subsystem despite returning error from
Q_QUOTAON quotactl. This can lead to some odd situations like kernel
using quota file while it is still writeable for userspace. Make sure we
properly cleanup the quota subsystem in case of error.

Signed-off-by: Jan Kara <jack@suse.cz>
Cc: stable@kernel.org
Link: https://lore.kernel.org/r/20211007155336.12493-2-jack@suse.cz
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2022-01-10 13:25:54 -05:00
Ye Bin 380a0091ca ext4: Fix BUG_ON in ext4_bread when write quota data
We got issue as follows when run syzkaller:
[  167.936972] EXT4-fs error (device loop0): __ext4_remount:6314: comm rep: Abort forced by user
[  167.938306] EXT4-fs (loop0): Remounting filesystem read-only
[  167.981637] Assertion failure in ext4_getblk() at fs/ext4/inode.c:847: '(EXT4_SB(inode->i_sb)->s_mount_state & EXT4_FC_REPLAY) || handle != NULL || create == 0'
[  167.983601] ------------[ cut here ]------------
[  167.984245] kernel BUG at fs/ext4/inode.c:847!
[  167.984882] invalid opcode: 0000 [#1] PREEMPT SMP KASAN PTI
[  167.985624] CPU: 7 PID: 2290 Comm: rep Tainted: G    B             5.16.0-rc5-next-20211217+ #123
[  167.986823] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS ?-20190727_073836-buildvm-ppc64le-16.ppc.fedoraproject.org-3.fc31 04/01/2014
[  167.988590] RIP: 0010:ext4_getblk+0x17e/0x504
[  167.989189] Code: c6 01 74 28 49 c7 c0 a0 a3 5c 9b b9 4f 03 00 00 48 c7 c2 80 9c 5c 9b 48 c7 c6 40 b6 5c 9b 48 c7 c7 20 a4 5c 9b e8 77 e3 fd ff <0f> 0b 8b 04 244
[  167.991679] RSP: 0018:ffff8881736f7398 EFLAGS: 00010282
[  167.992385] RAX: 0000000000000094 RBX: 1ffff1102e6dee75 RCX: 0000000000000000
[  167.993337] RDX: 0000000000000001 RSI: ffffffff9b6e29e0 RDI: ffffed102e6dee66
[  167.994292] RBP: ffff88816a076210 R08: 0000000000000094 R09: ffffed107363fa09
[  167.995252] R10: ffff88839b1fd047 R11: ffffed107363fa08 R12: ffff88816a0761e8
[  167.996205] R13: 0000000000000000 R14: 0000000000000021 R15: 0000000000000001
[  167.997158] FS:  00007f6a1428c740(0000) GS:ffff88839b000000(0000) knlGS:0000000000000000
[  167.998238] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  167.999025] CR2: 00007f6a140716c8 CR3: 0000000133216000 CR4: 00000000000006e0
[  167.999987] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[  168.000944] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[  168.001899] Call Trace:
[  168.002235]  <TASK>
[  168.007167]  ext4_bread+0xd/0x53
[  168.007612]  ext4_quota_write+0x20c/0x5c0
[  168.010457]  write_blk+0x100/0x220
[  168.010944]  remove_free_dqentry+0x1c6/0x440
[  168.011525]  free_dqentry.isra.0+0x565/0x830
[  168.012133]  remove_tree+0x318/0x6d0
[  168.014744]  remove_tree+0x1eb/0x6d0
[  168.017346]  remove_tree+0x1eb/0x6d0
[  168.019969]  remove_tree+0x1eb/0x6d0
[  168.022128]  qtree_release_dquot+0x291/0x340
[  168.023297]  v2_release_dquot+0xce/0x120
[  168.023847]  dquot_release+0x197/0x3e0
[  168.024358]  ext4_release_dquot+0x22a/0x2d0
[  168.024932]  dqput.part.0+0x1c9/0x900
[  168.025430]  __dquot_drop+0x120/0x190
[  168.025942]  ext4_clear_inode+0x86/0x220
[  168.026472]  ext4_evict_inode+0x9e8/0xa22
[  168.028200]  evict+0x29e/0x4f0
[  168.028625]  dispose_list+0x102/0x1f0
[  168.029148]  evict_inodes+0x2c1/0x3e0
[  168.030188]  generic_shutdown_super+0xa4/0x3b0
[  168.030817]  kill_block_super+0x95/0xd0
[  168.031360]  deactivate_locked_super+0x85/0xd0
[  168.031977]  cleanup_mnt+0x2bc/0x480
[  168.033062]  task_work_run+0xd1/0x170
[  168.033565]  do_exit+0xa4f/0x2b50
[  168.037155]  do_group_exit+0xef/0x2d0
[  168.037666]  __x64_sys_exit_group+0x3a/0x50
[  168.038237]  do_syscall_64+0x3b/0x90
[  168.038751]  entry_SYSCALL_64_after_hwframe+0x44/0xae

In order to reproduce this problem, the following conditions need to be met:
1. Ext4 filesystem with no journal;
2. Filesystem image with incorrect quota data;
3. Abort filesystem forced by user;
4. umount filesystem;

As in ext4_quota_write:
...
         if (EXT4_SB(sb)->s_journal && !handle) {
                 ext4_msg(sb, KERN_WARNING, "Quota write (off=%llu, len=%llu)"
                         " cancelled because transaction is not started",
                         (unsigned long long)off, (unsigned long long)len);
                 return -EIO;
         }
...
We only check handle if NULL when filesystem has journal. There is need
check handle if NULL even when filesystem has no journal.

Signed-off-by: Ye Bin <yebin10@huawei.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Link: https://lore.kernel.org/r/20211223015506.297766-1-yebin10@huawei.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Cc: stable@kernel.org
2022-01-10 13:25:54 -05:00
Sebastian Andrzej Siewior ab047d516d ext4: destroy ext4_fc_dentry_cachep kmemcache on module removal
The kmemcache for ext4_fc_dentry_cachep remains registered after module
removal.

Destroy ext4_fc_dentry_cachep kmemcache on module removal.

Fixes: aa75f4d3da ("ext4: main fast-commit commit path")
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Reviewed-by: Lukas Czerner <lczerner@redhat.com>
Reviewed-by: Harshad Shirwadkar <harshadshirwadkar@gmail.com>
Link: https://lore.kernel.org/r/20211110134640.lyku5vklvdndw6uk@linutronix.de
Link: https://lore.kernel.org/r/YbiK3JetFFl08bd7@linutronix.de
Link: https://lore.kernel.org/r/20211223164436.2628390-1-bigeasy@linutronix.de
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Cc: stable@kernel.org
2022-01-10 13:25:54 -05:00
Xin Yin 9725958bb7 ext4: fast commit may miss tracking unwritten range during ftruncate
If use FALLOC_FL_KEEP_SIZE to alloc unwritten range at bottom, the
inode->i_size will not include the unwritten range. When call
ftruncate with fast commit enabled, it will miss to track the
unwritten range.

Change to trace the full range during ftruncate.

Signed-off-by: Xin Yin <yinxin.x@bytedance.com>
Reviewed-by: Harshad Shirwadkar <harshadshirwadkar@gmail.com>
Link: https://lore.kernel.org/r/20211223032337.5198-3-yinxin.x@bytedance.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Cc: stable@kernel.org
2022-01-10 13:25:54 -05:00
Xin Yin 0b5b5a62b9 ext4: use ext4_ext_remove_space() for fast commit replay delete range
For now ,we use ext4_punch_hole() during fast commit replay delete range
procedure. But it will be affected by inode->i_size, which may not
correct during fast commit replay procedure. The following test will
failed.

-create & write foo (len 1000K)
-falloc FALLOC_FL_ZERO_RANGE foo (range 400K - 600K)
-create & fsync bar
-falloc FALLOC_FL_PUNCH_HOLE foo (range 300K-500K)
-fsync foo
-crash before a full commit

After the fast_commit reply procedure, the range 400K-500K will not be
removed. Because in this case, when calling ext4_punch_hole() the
inode->i_size is 0, and it just retruns with doing nothing.

Change to use ext4_ext_remove_space() instead of ext4_punch_hole()
to remove blocks of inode directly.

Signed-off-by: Xin Yin <yinxin.x@bytedance.com>
Reviewed-by: Harshad Shirwadkar <harshadshirwadkar@gmail.com>
Link: https://lore.kernel.org/r/20211223032337.5198-2-yinxin.x@bytedance.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Cc: stable@kernel.org
2022-01-10 13:25:54 -05:00
Xin Yin 5e4d0eba1c ext4: fix fast commit may miss tracking range for FALLOC_FL_ZERO_RANGE
when call falloc with FALLOC_FL_ZERO_RANGE, to set an range to unwritten,
which has been already initialized. If the range is align to blocksize,
fast commit will not track range for this change.

Also track range for unwritten range in ext4_map_blocks().

Signed-off-by: Xin Yin <yinxin.x@bytedance.com>
Reviewed-by: Harshad Shirwadkar <harshadshirwadkar@gmail.com>
Link: https://lore.kernel.org/r/20211221022839.374606-1-yinxin.x@bytedance.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Cc: stable@kernel.org
2022-01-10 13:25:44 -05:00
David Howells 16f2f4e679 nfs: Implement cache I/O by accessing the cache directly
Move NFS to using fscache DIO API instead of the old upstream I/O API as
that has been removed.  This is a stopgap solution as the intention is that
at sometime in the future, the cache will move to using larger blocks and
won't be able to store individual pages in order to deal with the potential
for data corruption due to the backing filesystem being able insert/remove
bridging blocks of zeros into its extent list[1].

NFS then reads and writes cache pages synchronously and one page at a time.

The preferred change would be to use the netfs lib, but the new I/O API can
be used directly.  It's just that as the cache now needs to track data for
itself, caching blocks may exceed page size...

This code is somewhat borrowed from my "fallback I/O" patchset[2].

Changes
=======
ver #3:
 - Restore lost =n fallback for nfs_fscache_release_page()[2].

Signed-off-by: David Howells <dhowells@redhat.com>
Tested-by: Dave Wysochanski <dwysocha@redhat.com>
Acked-by: Jeff Layton <jlayton@kernel.org>
cc: Trond Myklebust <trond.myklebust@hammerspace.com>
cc: Anna Schumaker <anna.schumaker@netapp.com>
cc: linux-nfs@vger.kernel.org
cc: linux-cachefs@redhat.com
Link: https://lore.kernel.org/r/YO17ZNOcq+9PajfQ@mit.edu [1]
Link: https://lore.kernel.org/r/202112100957.2oEDT20W-lkp@intel.com/ [2]
Link: https://lore.kernel.org/r/163189108292.2509237.12615909591150927232.stgit@warthog.procyon.org.uk/ [2]
Link: https://lore.kernel.org/r/163906981318.143852.17220018647843475985.stgit@warthog.procyon.org.uk/ # v2
Link: https://lore.kernel.org/r/163967184451.1823006.6450645559828329590.stgit@warthog.procyon.org.uk/ # v3
Link: https://lore.kernel.org/r/164021577632.640689.11069627070150063812.stgit@warthog.procyon.org.uk/ # v4
2022-01-10 11:53:25 +00:00
Dave Wysochanski a6b5a28eb5 nfs: Convert to new fscache volume/cookie API
Change the nfs filesystem to support fscache's indexing rewrite and
reenable caching in nfs.

The following changes have been made:

 (1) The fscache_netfs struct is no more, and there's no need to register
     the filesystem as a whole.

 (2) The session cookie is now an fscache_volume cookie, allocated with
     fscache_acquire_volume().  That takes three parameters: a string
     representing the "volume" in the index, a string naming the cache to
     use (or NULL) and a u64 that conveys coherency metadata for the
     volume.

     For nfs, I've made it render the volume name string as:

        "nfs,<ver>,<family>,<address>,<port>,<fsidH>,<fsidL>*<,param>[,<uniq>]"

 (3) The fscache_cookie_def is no more and needed information is passed
     directly to fscache_acquire_cookie().  The cache no longer calls back
     into the filesystem, but rather metadata changes are indicated at
     other times.

     fscache_acquire_cookie() is passed the same keying and coherency
     information as before.

 (4) fscache_enable/disable_cookie() have been removed.

     Call fscache_use_cookie() and fscache_unuse_cookie() when a file is
     opened or closed to prevent a cache file from being culled and to keep
     resources to hand that are needed to do I/O.

     If a file is opened for writing, we invalidate it with
     FSCACHE_INVAL_DIO_WRITE in lieu of doing writeback to the cache,
     thereby making it cease caching until all currently open files are
     closed.  This should give the same behaviour as the uptream code.
     Making the cache store local modifications isn't straightforward for
     NFS, so that's left for future patches.

 (5) fscache_invalidate() now needs to be given uptodate auxiliary data and
     a file size.  It also takes a flag to indicate if this was due to a
     DIO write.

 (6) Call nfs_fscache_invalidate() with FSCACHE_INVAL_DIO_WRITE on a file
     to which a DIO write is made.

 (7) Call fscache_note_page_release() from nfs_release_page().

 (8) Use a killable wait in nfs_vm_page_mkwrite() when waiting for
     PG_fscache to be cleared.

 (9) The functions to read and write data to/from the cache are stubbed out
     pending a conversion to use netfslib.

Changes
=======
ver #3:
 - Added missing =n fallback for nfs_fscache_release_file()[1][2].

ver #2:
 - Use gfpflags_allow_blocking() rather than using flag directly.
 - fscache_acquire_volume() now returns errors.
 - Remove NFS_INO_FSCACHE as it's no longer used.
 - Need to unuse a cookie on file-release, not inode-clear.

Signed-off-by: Dave Wysochanski <dwysocha@redhat.com>
Co-developed-by: David Howells <dhowells@redhat.com>
Signed-off-by: David Howells <dhowells@redhat.com>
Tested-by: Dave Wysochanski <dwysocha@redhat.com>
Acked-by: Jeff Layton <jlayton@kernel.org>
cc: Trond Myklebust <trond.myklebust@hammerspace.com>
cc: Anna Schumaker <anna.schumaker@netapp.com>
cc: linux-nfs@vger.kernel.org
cc: linux-cachefs@redhat.com
Link: https://lore.kernel.org/r/202112100804.nksO8K4u-lkp@intel.com/ [1]
Link: https://lore.kernel.org/r/202112100957.2oEDT20W-lkp@intel.com/ [2]
Link: https://lore.kernel.org/r/163819668938.215744.14448852181937731615.stgit@warthog.procyon.org.uk/ # v1
Link: https://lore.kernel.org/r/163906979003.143852.2601189243864854724.stgit@warthog.procyon.org.uk/ # v2
Link: https://lore.kernel.org/r/163967182112.1823006.7791504655391213379.stgit@warthog.procyon.org.uk/ # v3
Link: https://lore.kernel.org/r/164021575950.640689.12069642327533368467.stgit@warthog.procyon.org.uk/ # v4
2022-01-10 11:53:25 +00:00
David Howells 93c846143d 9p: Copy local writes to the cache when writing to the server
When writing to the server from v9fs_vfs_writepage(), copy the data to the
cache object too.

To make this possible, the cookie must have its active users count
incremented when the page is dirtied and kept incremented until we manage
to clean up all the pages.  This allows the writeback to take place after
the last file struct is released.

This is done by taking a use on the cookie in v9fs_set_page_dirty() if we
haven't already done so (controlled by the I_PINNING_FSCACHE_WB flag) and
dropping the pin in v9fs_write_inode() if __writeback_single_inode() clears
all the outstanding dirty pages (conveyed by the unpinned_fscache_wb flag
in the writeback_control struct).

Inode eviction must also clear the flag after truncating away all the
outstanding pages.

In the future this will be handled more gracefully by netfslib.

Changes
=======
ver #3:
 - Canonicalise the coherency data to make it endianness-independent.

ver #2:
 - Fix an unused-var warning due to CONFIG_9P_FSCACHE=n[1].

Signed-off-by: David Howells <dhowells@redhat.com>
Acked-by: Jeff Layton <jlayton@kernel.org>
Tested-by: Dominique Martinet <asmadeus@codewreck.org>
cc: Eric Van Hensbergen <ericvh@gmail.com>
cc: Latchesar Ionkov <lucho@ionkov.net>
cc: v9fs-developer@lists.sourceforge.net
cc: linux-cachefs@redhat.com
Link: https://lore.kernel.org/r/163819667027.215744.13815687931204222995.stgit@warthog.procyon.org.uk/ # v1
Link: https://lore.kernel.org/r/163906978015.143852.10646669694345706328.stgit@warthog.procyon.org.uk/ # v2
Link: https://lore.kernel.org/r/163967180760.1823006.5831751873616248910.stgit@warthog.procyon.org.uk/ # v3
Link: https://lore.kernel.org/r/164021574522.640689.13849966660182529125.stgit@warthog.procyon.org.uk/ # v4
2022-01-10 11:53:25 +00:00
David Howells 24e42e32d3 9p: Use fscache indexing rewrite and reenable caching
Change the 9p filesystem to take account of the changes to fscache's
indexing rewrite and reenable caching in 9p.

The following changes have been made:

 (1) The fscache_netfs struct is no more, and there's no need to register
     the filesystem as a whole.

 (2) The session cookie is now an fscache_volume cookie, allocated with
     fscache_acquire_volume().  That takes three parameters: a string
     representing the "volume" in the index, a string naming the cache to
     use (or NULL) and a u64 that conveys coherency metadata for the
     volume.

     For 9p, I've made it render the volume name string as:

	"9p,<devname>,<cachetag>"

     where the cachetag is replaced by the aname if it wasn't supplied.

     This probably needs rethinking a bit as the aname can have slashes in
     it.  It might be better to hash the cachetag and use the hash or I
     could substitute commas for the slashes or something.

 (3) The fscache_cookie_def is no more and needed information is passed
     directly to fscache_acquire_cookie().  The cache no longer calls back
     into the filesystem, but rather metadata changes are indicated at
     other times.

     fscache_acquire_cookie() is passed the same keying and coherency
     information as before.

 (4) The functions to set/reset/flush cookies are removed and
     fscache_use_cookie() and fscache_unuse_cookie() are used instead.

     fscache_use_cookie() is passed a flag to indicate if the cookie is
     opened for writing.  fscache_unuse_cookie() is passed updates for the
     metadata if we changed it (ie. if the file was opened for writing).

     These are called when the file is opened or closed.

 (5) wait_on_page_bit[_killable]() is replaced with the specific wait
     functions for the bits waited upon.

 (6) I've got rid of some of the 9p-specific cache helper functions and
     called things like fscache_relinquish_cookie() directly as they'll
     optimise away if v9fs_inode_cookie() returns an unconditional NULL
     (which will be the case if CONFIG_9P_FSCACHE=n).

 (7) v9fs_vfs_setattr() is made to call fscache_resize() to change the size
     of the cache object.

Notes:

 (A) We should call fscache_invalidate() if we detect that the server's
     copy of a file got changed by a third party, but I don't know where to
     do that.  We don't need to do that when allocating the cookie as we
     get a check-and-invalidate when we initially bind to the cache object.

 (B) The copy-to-cache-on-writeback side of things will be handled in
     separate patch.

Changes
=======
ver #3:
 - Canonicalise the cookie key and coherency data to make them
   endianness-independent.

ver #2:
 - Use gfpflags_allow_blocking() rather than using flag directly.
 - fscache_acquire_volume() now returns errors.

Signed-off-by: David Howells <dhowells@redhat.com>
Acked-by: Jeff Layton <jlayton@kernel.org>
Tested-by: Dominique Martinet <asmadeus@codewreck.org>
cc: Eric Van Hensbergen <ericvh@gmail.com>
cc: Latchesar Ionkov <lucho@ionkov.net>
cc: v9fs-developer@lists.sourceforge.net
cc: linux-cachefs@redhat.com
Link: https://lore.kernel.org/r/163819664645.215744.1555314582005286846.stgit@warthog.procyon.org.uk/ # v1
Link: https://lore.kernel.org/r/163906975017.143852.3459573173204394039.stgit@warthog.procyon.org.uk/ # v2
Link: https://lore.kernel.org/r/163967178512.1823006.17377493641569138183.stgit@warthog.procyon.org.uk/ # v3
Link: https://lore.kernel.org/r/164021573143.640689.3977487095697717967.stgit@warthog.procyon.org.uk/ # v4
2022-01-10 11:53:06 +00:00
Pavel Begunkov 3cc7fdb9f9 io_uring: fix not released cached task refs
tctx_task_work() may get run after io_uring cancellation and so there
will be no one to put cached in tctx task refs that may have been added
back by tw handlers using inline completion infra, Call
io_uring_drop_tctx_refs() at the end of the main tw handler to release
them.

Cc: stable@vger.kernel.org # 5.15+
Reported-by: Lukas Bulwahn <lukas.bulwahn@gmail.com>
Fixes: e98e49b2bb ("io_uring: extend task put optimisations")
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/69f226b35fbdb996ab799a8bbc1c06bf634ccec1.1641688805.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-01-09 09:22:49 -07:00
Linus Torvalds 622e42a674 Fixes for 5.16-rc8:
- Make the old ALLOCSP ioctl behave in a consistent manner with
    newer syscalls like fallocate.
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEEUzaAxoMeQq6m2jMV+H93GTRKtOsFAmHDpfcACgkQ+H93GTRK
 tOuI9A/9HAdOYymXZMcYrFxbxKrzgjnX+2dYAJHyko9ZT2tFl512xfGE7hA50V1w
 r6FIPCOHXmuIT82kOMs+0MwqY61tJAMzkGB3N2xX1o99bl4enQ7/gqkAaMWx/0YS
 H11h619EdcnZPOZhgH9vtLkWdAW4muhvmmNHnLvsv8XHinDkIOLR4fepa5zbbY/b
 fkGH/OHuJ8an2scdvgPQzov8jd76jPmLT8fd1hu34WFNkReBmpllU1ATAz1pUHpZ
 EvIHfUmrlHTtGrGaa5Ti8bgY0ppKIfrv/7DxRmCaSmc8K8cqBtjSM4FbCOZJKeiN
 Ajn4Rru86KMD0Ker+1aLRZchY+j0S4NQH2blCqeUL3DfIcZ+qrnOEqOo8VG8slBV
 GD7GKAuGL9YPQazLnrENVnl2v3r9IR83WXpO5htDom+wCcwKkI7lXLKiyrHQOYM6
 spcLMfa+XUudyLyFkXZc4hCHh234B+CeEBOWQtcEOZsC+HhILLTowrTJ9m53H7v2
 AjCON+6CYArP/6WlW1WaeGeZhZFq0y+HNplcu46R8ccnrZhv8UnbuUH5UQC65ktj
 nWwD3T+o9toJqzsz02Kg9zseAtM4ungfrc/QMMYKIUQH7XMYAXePhr7t890NU5sZ
 HPXfGSAxzV7XefG/rqe8yu0ZLR+OtKb16d1Cl/B4Py6TkLVj4G0=
 =8ha9
 -----END PGP SIGNATURE-----

Merge tag 'xfs-5.16-fixes-4' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux

Pull xfs fix from Darrick Wong:

 - Make the old ALLOCSP ioctl behave in a consistent manner with newer
   syscalls like fallocate.

* tag 'xfs-5.16-fixes-4' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux:
  xfs: map unwritten blocks in XFS_IOC_{ALLOC,FREE}SP just like fallocate
2022-01-08 10:56:47 -08:00
Matthew Wilcox (Oracle) 338f379cf7 fs: Convert vfs_dedupe_file_range_compare to folios
We still only operate on a single page of data at a time due to using
kmap().  A more complex implementation would work on each page in a folio,
but it's not clear that such a complex implementation would be worthwhile.

Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: William Kucharski <william.kucharski@oracle.com>
2022-01-08 00:28:41 -05:00
Matthew Wilcox (Oracle) 51dcbdac28 mm: Convert find_lock_entries() to use a folio_batch
find_lock_entries() already only returned the head page of folios, so
convert it to return a folio_batch instead of a pagevec.  That cascades
through converting truncate_inode_pages_range() to
delete_from_page_cache_batch() and page_cache_delete_batch().

Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: William Kucharski <william.kucharski@oracle.com>
2022-01-08 00:28:41 -05:00
David Howells 0770bd4187 afs: Skip truncation on the server of data we haven't written yet
Don't send a truncation RPC to the server if we're only shortening data
that's in the pagecache and is beyond the server's EOF.

Also don't automatically force writeback on setattr, but do wait to store
RPCs that are in the region to be removed on a shortening truncation.

Signed-off-by: David Howells <dhowells@redhat.com>
Tested-by: kafs-testing@auristor.com
Acked-by: Jeff Layton <jlayton@kernel.org>
cc: Marc Dionne <marc.dionne@auristor.com>
cc: linux-afs@lists.infradead.org
Link: https://lore.kernel.org/r/163819663275.215744.4781075713714590913.stgit@warthog.procyon.org.uk/ # v1
Link: https://lore.kernel.org/r/163906972600.143852.14237659724463048094.stgit@warthog.procyon.org.uk/ # v2
Link: https://lore.kernel.org/r/163967177522.1823006.15336589054269480601.stgit@warthog.procyon.org.uk/ # v3
Link: https://lore.kernel.org/r/164021571880.640689.1837025861707111004.stgit@warthog.procyon.org.uk/ # v4
2022-01-07 13:44:56 +00:00
David Howells c7f75ef33b afs: Copy local writes to the cache when writing to the server
When writing to the server from afs_writepage() or afs_writepages(), copy
the data to the cache object too.

To make this possible, the cookie must have its active users count
incremented when the page is dirtied and kept incremented until we manage
to clean up all the pages.  This allows the writeback to take place after
the last file struct is released.

Signed-off-by: David Howells <dhowells@redhat.com>
Tested-by: kafs-testing@auristor.com
Acked-by: Jeff Layton <jlayton@kernel.org>
cc: Marc Dionne <marc.dionne@auristor.com>
cc: linux-afs@lists.infradead.org
cc: linux-cachefs@redhat.com
Link: https://lore.kernel.org/r/163819662333.215744.7531373404219224438.stgit@warthog.procyon.org.uk/ # v1
Link: https://lore.kernel.org/r/163906970998.143852.674420788614608063.stgit@warthog.procyon.org.uk/ # v2
Link: https://lore.kernel.org/r/163967176564.1823006.16666056085593949570.stgit@warthog.procyon.org.uk/ # v3
Link: https://lore.kernel.org/r/164021570208.640689.9193494979708031862.stgit@warthog.procyon.org.uk/ # v4
2022-01-07 13:44:52 +00:00
David Howells 523d27cda1 afs: Convert afs to use the new fscache API
Change the afs filesystem to support the new afs driver.

The following changes have been made:

 (1) The fscache_netfs struct is no more, and there's no need to register
     the filesystem as a whole.  There's also no longer a cell cookie.

 (2) The volume cookie is now an fscache_volume cookie, allocated with
     fscache_acquire_volume().  This function takes three parameters: a
     string representing the "volume" in the index, a string naming the
     cache to use (or NULL) and a u64 that conveys coherency metadata for
     the volume.

     For afs, I've made it render the volume name string as:

        "afs,<cell>,<volume_id>"

     and the coherency data is currently 0.

 (3) The fscache_cookie_def is no more and needed information is passed
     directly to fscache_acquire_cookie().  The cache no longer calls back
     into the filesystem, but rather metadata changes are indicated at
     other times.

     fscache_acquire_cookie() is passed the same keying and coherency
     information as before, except that these are now stored in big endian
     form instead of cpu endian.  This makes the cache more copyable.

 (4) fscache_use_cookie() and fscache_unuse_cookie() are called when a file
     is opened or closed to prevent a cache file from being culled and to
     keep resources to hand that are needed to do I/O.

     fscache_use_cookie() is given an indication if the cache is likely to
     be modified locally (e.g. the file is open for writing).

     fscache_unuse_cookie() is given a coherency update if we had the file
     open for writing and will update that.

 (5) fscache_invalidate() is now given uptodate auxiliary data and a file
     size.  It can also take a flag to indicate if this was due to a DIO
     write.  This is wrapped into afs_fscache_invalidate() now for
     convenience.

 (6) fscache_resize() now gets called from the finalisation of
     afs_setattr(), and afs_setattr() does use/unuse of the cookie around
     the call to support this.

 (7) fscache_note_page_release() is called from afs_release_page().

 (8) Use a killable wait in nfs_vm_page_mkwrite() when waiting for
     PG_fscache to be cleared.

Render the parts of the cookie key for an afs inode cookie as big endian.

Changes
=======
ver #2:
 - Use gfpflags_allow_blocking() rather than using flag directly.
 - fscache_acquire_volume() now returns errors.

Signed-off-by: David Howells <dhowells@redhat.com>
Acked-by: Jeff Layton <jlayton@kernel.org>
Tested-by: kafs-testing@auristor.com
cc: Marc Dionne <marc.dionne@auristor.com>
cc: linux-afs@lists.infradead.org
cc: linux-cachefs@redhat.com
Link: https://lore.kernel.org/r/163819661382.215744.1485608824741611837.stgit@warthog.procyon.org.uk/ # v1
Link: https://lore.kernel.org/r/163906970002.143852.17678518584089878259.stgit@warthog.procyon.org.uk/ # v2
Link: https://lore.kernel.org/r/163967174665.1823006.1301789965454084220.stgit@warthog.procyon.org.uk/ # v3
Link: https://lore.kernel.org/r/164021568841.640689.6684240152253400380.stgit@warthog.procyon.org.uk/ # v4
2022-01-07 13:44:47 +00:00
David Howells 9f08ebc343 fscache, cachefiles: Display stat of culling events
Add a stat counter of culling events whereby the cache backend culls a file
to make space (when asked by cachefilesd in this case) and display in
/proc/fs/fscache/stats.

Signed-off-by: David Howells <dhowells@redhat.com>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
cc: linux-cachefs@redhat.com
Link: https://lore.kernel.org/r/163819654165.215744.3797804661644212436.stgit@warthog.procyon.org.uk/ # v1
Link: https://lore.kernel.org/r/163906961387.143852.9291157239960289090.stgit@warthog.procyon.org.uk/ # v2
Link: https://lore.kernel.org/r/163967168266.1823006.14436200166581605746.stgit@warthog.procyon.org.uk/ # v3
Link: https://lore.kernel.org/r/164021567619.640689.4339228906248763197.stgit@warthog.procyon.org.uk/ # v4
2022-01-07 13:43:18 +00:00
David Howells 3929eca769 fscache, cachefiles: Display stats of no-space events
Add stat counters of no-space events that caused caching not to happen and
display in /proc/fs/fscache/stats.

Signed-off-by: David Howells <dhowells@redhat.com>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
cc: linux-cachefs@redhat.com
Link: https://lore.kernel.org/r/163819653216.215744.17210522251617386509.stgit@warthog.procyon.org.uk/ # v1
Link: https://lore.kernel.org/r/163906958369.143852.7257100711818401748.stgit@warthog.procyon.org.uk/ # v2
Link: https://lore.kernel.org/r/163967166917.1823006.14842444049198947892.stgit@warthog.procyon.org.uk/ # v3
Link: https://lore.kernel.org/r/164021566184.640689.4417328329632709265.stgit@warthog.procyon.org.uk/ # v4
2022-01-07 13:43:13 +00:00
David Howells ecd1a5f62e cachefiles: Allow cachefiles to actually function
Remove the block that allowed cachefiles to be compiled but prevented it
from actually starting a cache.

Signed-off-by: David Howells <dhowells@redhat.com>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
cc: linux-cachefs@redhat.com
Link: https://lore.kernel.org/r/163819649497.215744.2872504990762846767.stgit@warthog.procyon.org.uk/ # v1
Link: https://lore.kernel.org/r/163906956491.143852.4951522864793559189.stgit@warthog.procyon.org.uk/ # v2
Link: https://lore.kernel.org/r/163967165374.1823006.14248189932202373809.stgit@warthog.procyon.org.uk/ # v3
Link: https://lore.kernel.org/r/164021564379.640689.7921380491176827442.stgit@warthog.procyon.org.uk/ # v4
2022-01-07 13:43:07 +00:00
David Howells 32e150037d fscache, cachefiles: Store the volume coherency data
Store the volume coherency data in an xattr and check it when we rebind the
volume.  If it doesn't match the cache volume is moved to the graveyard and
rebuilt anew.

Changes
=======
ver #4:
 - Remove a couple of debugging prints.

Signed-off-by: David Howells <dhowells@redhat.com>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Link: https://lore.kernel.org/r/163967164397.1823006.2950539849831291830.stgit@warthog.procyon.org.uk/ # v3
Link: https://lore.kernel.org/r/164021563138.640689.15851092065380543119.stgit@warthog.procyon.org.uk/ # v4
2022-01-07 13:43:03 +00:00
David Howells 047487c947 cachefiles: Implement the I/O routines
Implement the I/O routines for cachefiles.  There are two sets of routines
here: preparation and actual I/O.

Preparation for read involves looking to see whether there is data present,
and how much.  Netfslib tells us what it wants us to do and we have the
option of adjusting shrinking and telling it whether to read from the
cache, download from the server or simply clear a region.

Preparation for write involves checking for space and defending against
possibly running short of space, if necessary punching out a hole in the
file so that we don't leave old data in the cache if we update the
coherency information.

Then there's a read routine and a write routine.  They wait for the cookie
state to move to something appropriate and then start a potentially
asynchronous direct I/O operation upon it.

Changes
=======
ver #2:
 - Fix a misassigned variable[1].

Signed-off-by: David Howells <dhowells@redhat.com>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
cc: linux-cachefs@redhat.com
Link: https://lore.kernel.org/YaZOCk9zxApPattb@archlinux-ax161/ [1]
Link: https://lore.kernel.org/r/163819647945.215744.17827962047487125939.stgit@warthog.procyon.org.uk/ # v1
Link: https://lore.kernel.org/r/163906954666.143852.1504887120569779407.stgit@warthog.procyon.org.uk/ # v2
Link: https://lore.kernel.org/r/163967163110.1823006.9206718511874339672.stgit@warthog.procyon.org.uk/ # v3
Link: https://lore.kernel.org/r/164021562168.640689.8802250542405732391.stgit@warthog.procyon.org.uk/ # v4
2022-01-07 13:42:58 +00:00
David Howells 7623ed6772 cachefiles: Implement cookie resize for truncate
Implement resizing an object, using truncate and/or fallocate to adjust the
object.

Signed-off-by: David Howells <dhowells@redhat.com>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
cc: linux-cachefs@redhat.com
Link: https://lore.kernel.org/r/163819646631.215744.13819016478175576761.stgit@warthog.procyon.org.uk/ # v1
Link: https://lore.kernel.org/r/163906952877.143852.4140962906331914859.stgit@warthog.procyon.org.uk/ # v2
Link: https://lore.kernel.org/r/163967162168.1823006.5941985259926902274.stgit@warthog.procyon.org.uk/ # v3
Link: https://lore.kernel.org/r/164021560394.640689.9972155785508094960.stgit@warthog.procyon.org.uk/ # v4
2022-01-07 13:42:50 +00:00
David Howells 287fd61123 cachefiles: Implement begin and end I/O operation
Implement the methods for beginning and ending an I/O operation.

When called to begin an I/O operation, we are guaranteed that the cookie
has reached a certain stage (we're called by fscache after it has done a
suitable wait).

If a file is available, we paste a ref over into the cache resources for
the I/O routines to use.  This means that the object can be invalidated
whilst the I/O is ongoing without the need to synchronise as the file
pointer in the object is replaced, but the file pointer in the cache
resources is unaffected.

Ending the operation just requires ditching any refs we have and dropping
the access guarantee that fscache got for us on the cookie.

Signed-off-by: David Howells <dhowells@redhat.com>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
cc: linux-cachefs@redhat.com
Link: https://lore.kernel.org/r/163819645033.215744.2199344081658268312.stgit@warthog.procyon.org.uk/ # v1
Link: https://lore.kernel.org/r/163906951916.143852.9531384743995679857.stgit@warthog.procyon.org.uk/ # v2
Link: https://lore.kernel.org/r/163967161222.1823006.4461476204800357263.stgit@warthog.procyon.org.uk/ # v3
Link: https://lore.kernel.org/r/164021559030.640689.3684291785218094142.stgit@warthog.procyon.org.uk/ # v4
2022-01-07 13:42:44 +00:00
David Howells 1f08c925e7 cachefiles: Implement backing file wrangling
Implement the wrangling of backing files, including the following pieces:

 (1) Lookup and creation of a file on disk, using a tmpfile if the file
     isn't yet present.  The file is then opened, sized for DIO and the
     file handle is attached to the cachefiles_object struct.  The inode is
     marked to indicate that it's in use by a kernel service.

 (2) Invalidation of an object, creating a tmpfile and switching the file
     pointer in the cachefiles object.

 (3) Committing a file to disk, including setting the coherency xattr on it
     and, if necessary, creating a hard link to it.

     Note that this would be a good place to use Omar Sandoval's vfs_link()
     with AT_LINK_REPLACE[1] as I may have to unlink an old file before I
     can link a tmpfile into place.

 (4) Withdrawal of open objects when a cache is being withdrawn or a cookie
     is relinquished.  This involves committing or discarding the file.

Changes
=======
ver #2:
 - Fix logging of wrong error[1].

Signed-off-by: David Howells <dhowells@redhat.com>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
cc: linux-cachefs@redhat.com
Link: https://lore.kernel.org/r/20211203094950.GA2480@kili/ [1]
Link: https://lore.kernel.org/r/163819644097.215744.4505389616742411239.stgit@warthog.procyon.org.uk/ # v1
Link: https://lore.kernel.org/r/163906949512.143852.14222856795032602080.stgit@warthog.procyon.org.uk/ # v2
Link: https://lore.kernel.org/r/163967158526.1823006.17482695321424642675.stgit@warthog.procyon.org.uk/ # v3
Link: https://lore.kernel.org/r/164021557060.640689.16373541458119269871.stgit@warthog.procyon.org.uk/ # v4
2022-01-07 13:42:40 +00:00
David Howells 07a90e9740 cachefiles: Implement culling daemon commands
Implement the ability for the userspace daemon to try and cull a file or
directory in the cache.  Two daemon commands are implemented:

 (1) The "inuse" command.  This queries if a file is in use or whether it
     can be deleted.  It checks the S_KERNEL_FILE flag on the inode
     referred to by the specified filename.

 (2) The "cull" command.  This asks for a file or directory to be removed,
     where removal means either unlinking it or moving it to the graveyard
     directory for userspace to dismantle.

Changes
=======
ver #2:
 - Fix logging of wrong error[1].
 - Need to unmark an inode we've moved to the graveyard before unlocking.

Signed-off-by: David Howells <dhowells@redhat.com>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
cc: linux-cachefs@redhat.com
Link: https://lore.kernel.org/r/20211203094950.GA2480@kili/ [1]
Link: https://lore.kernel.org/r/163819643179.215744.13641580295708315695.stgit@warthog.procyon.org.uk/ # v1
Link: https://lore.kernel.org/r/163906945705.143852.8177595531814485350.stgit@warthog.procyon.org.uk/ # v2
Link: https://lore.kernel.org/r/163967155792.1823006.1088936326902550910.stgit@warthog.procyon.org.uk/ # v3
Link: https://lore.kernel.org/r/164021555037.640689.9472627499842585255.stgit@warthog.procyon.org.uk/ # v4
2022-01-07 13:42:35 +00:00
David Howells 169379eaef cachefiles: Mark a backing file in use with an inode flag
Use an inode flag, S_KERNEL_FILE, to mark that a backing file is in use by
the kernel to prevent cachefiles or other kernel services from interfering
with that file.

Using S_SWAPFILE instead isn't really viable as that has other effects in
the I/O paths.

Signed-off-by: David Howells <dhowells@redhat.com>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
cc: linux-cachefs@redhat.com
Link: https://lore.kernel.org/r/163819642273.215744.6414248677118690672.stgit@warthog.procyon.org.uk/ # v1
Link: https://lore.kernel.org/r/163906943215.143852.16972351425323967014.stgit@warthog.procyon.org.uk/ # v2
Link: https://lore.kernel.org/r/163967154118.1823006.13227551961786743991.stgit@warthog.procyon.org.uk/ # v3
Link: https://lore.kernel.org/r/164021541207.640689.564689725898537127.stgit@warthog.procyon.org.uk/ # v4
Link: https://lore.kernel.org/r/164021552299.640689.10578652796777392062.stgit@warthog.procyon.org.uk/ # v4
2022-01-07 13:42:29 +00:00
David Howells 72b957856b cachefiles: Implement metadata/coherency data storage in xattrs
Use an xattr on each backing file in the cache to store some metadata, such
as the content type and the coherency data.

Five content types are defined:

 (0) No content stored.

 (1) The file contains a single monolithic blob and must be all or nothing.
     This would be used for something like an AFS directory or a symlink.

 (2) The file is populated with content completely up to a point with
     nothing beyond that.

 (3) The file has a map attached and is sparsely populated.  This would be
     stored in one or more additional xattrs.

 (4) The file is dirty, being in the process of local modification and the
     contents are not necessarily represented correctly by the metadata.
     The file should be deleted if this is seen on binding.

Signed-off-by: David Howells <dhowells@redhat.com>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
cc: linux-cachefs@redhat.com
Link: https://lore.kernel.org/r/163819641320.215744.16346770087799536862.stgit@warthog.procyon.org.uk/ # v1
Link: https://lore.kernel.org/r/163906942248.143852.5423738045012094252.stgit@warthog.procyon.org.uk/ # v2
Link: https://lore.kernel.org/r/163967151734.1823006.9301249989443622576.stgit@warthog.procyon.org.uk/ # v3
Link: https://lore.kernel.org/r/164021550471.640689.553853918307994335.stgit@warthog.procyon.org.uk/ # v4
2022-01-07 13:42:24 +00:00
David Howells 5d439467b8 cachefiles: Implement key to filename encoding
Implement a function to encode a binary cookie key as something that can be
used as a filename.  Four options are considered:

 (1) All printable chars with no '/' characters.  Prepend a 'D' to indicate
     the encoding but otherwise use as-is.

 (2) Appears to be an array of __be32.  Encode as 'S' plus a list of
     hex-encoded 32-bit ints separated by commas.  If a number is 0, it is
     rendered as "" instead of "0".

 (3) Appears to be an array of __le32.  Encoded as (2) but with a 'T'
     encoding prefix.

 (4) Encoded as base64 with an 'E' prefix plus a second char indicating how
     much padding is involved.  A non-standard base64 encoding is used
     because '/' cannot be used in the encoded form.

If (1) is not possible, whichever of (2), (3) or (4) produces the shortest
string is selected (hex-encoding a number may be less dense than base64
encoding it).

Note that the prefix characters have to be selected from the set [DEIJST@]
lest cachefilesd remove the files because it recognise the name.

Changes
=======
ver #2:
 - Fix a short allocation that didn't allow for a string terminator[1]

Signed-off-by: David Howells <dhowells@redhat.com>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
cc: linux-cachefs@redhat.com
Link: https://lore.kernel.org/r/bcefb8f2-576a-b3fc-cc29-89808ebfd7c1@linux.alibaba.com/ [1]
Link: https://lore.kernel.org/r/163819640393.215744.15212364106412961104.stgit@warthog.procyon.org.uk/ # v1
Link: https://lore.kernel.org/r/163906940529.143852.17352132319136117053.stgit@warthog.procyon.org.uk/ # v2
Link: https://lore.kernel.org/r/163967149827.1823006.6088580775428487961.stgit@warthog.procyon.org.uk/ # v3
Link: https://lore.kernel.org/r/164021549223.640689.14762875188193982341.stgit@warthog.procyon.org.uk/ # v4
2022-01-07 13:42:16 +00:00
David Howells df98e87f20 cachefiles: Implement object lifecycle funcs
Implement allocate, get, see and put functions for the cachefiles_object
struct.  The members of the struct we're going to need are also added.

Additionally, implement a lifecycle tracepoint.

Signed-off-by: David Howells <dhowells@redhat.com>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
cc: linux-cachefs@redhat.com
Link: https://lore.kernel.org/r/163819639457.215744.4600093239395728232.stgit@warthog.procyon.org.uk/ # v1
Link: https://lore.kernel.org/r/163906939569.143852.3594314410666551982.stgit@warthog.procyon.org.uk/ # v2
Link: https://lore.kernel.org/r/163967148857.1823006.6332962598220464364.stgit@warthog.procyon.org.uk/ # v3
Link: https://lore.kernel.org/r/164021547762.640689.8422781599594931000.stgit@warthog.procyon.org.uk/ # v4
2022-01-07 13:42:08 +00:00
David Howells fe2140e2f5 cachefiles: Implement volume support
Implement support for creating the directory layout for a volume on disk
and setting up and withdrawing volume caching.

Each volume has a directory named for the volume key under the root of the
cache (prefixed with an 'I' to indicate to cachefilesd that it's an index)
and then creates a bunch of hash bucket subdirectories under that (named as
'@' plus a hex number) in which cookie files will be created.

Signed-off-by: David Howells <dhowells@redhat.com>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
cc: linux-cachefs@redhat.com
Link: https://lore.kernel.org/r/163819635314.215744.13081522301564537723.stgit@warthog.procyon.org.uk/ # v1
Link: https://lore.kernel.org/r/163906936397.143852.17788457778396467161.stgit@warthog.procyon.org.uk/ # v2
Link: https://lore.kernel.org/r/163967143860.1823006.7185205806080225038.stgit@warthog.procyon.org.uk/ # v3
Link: https://lore.kernel.org/r/164021545212.640689.5064821392307582927.stgit@warthog.procyon.org.uk/ # v4
2022-01-07 13:41:53 +00:00
David Howells d1065b0a6f cachefiles: Implement cache registration and withdrawal
Do the following:

 (1) Fill out cachefiles_daemon_add_cache() so that it sets up the cache
     directories and registers the cache with cachefiles.

 (2) Add a function to do the top-level part of cache withdrawal and
     unregistration.

 (3) Add a function to sync a cache.

Signed-off-by: David Howells <dhowells@redhat.com>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
cc: linux-cachefs@redhat.com
Link: https://lore.kernel.org/r/163819633175.215744.10857127598041268340.stgit@warthog.procyon.org.uk/ # v1
Link: https://lore.kernel.org/r/163906935445.143852.15545194974036410029.stgit@warthog.procyon.org.uk/ # v2
Link: https://lore.kernel.org/r/163967142904.1823006.244055483596047072.stgit@warthog.procyon.org.uk/ # v3
Link: https://lore.kernel.org/r/164021543872.640689.14370017789605073222.stgit@warthog.procyon.org.uk/ # v4
2022-01-07 13:41:46 +00:00
David Howells 32759f7d7a cachefiles: Implement a function to get/create a directory in the cache
Implement a function to get/create structural directories in the cache.
This is used for setting up a cache and creating volume substructures.  The
directory in memory are marked with the S_KERNEL_FILE inode flag whilst
they're in use to tell rmdir to reject attempts to remove them.

Changes
=======
ver #3:
 - Return an indication as to whether the directory was freshly created.

Signed-off-by: David Howells <dhowells@redhat.com>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
cc: linux-cachefs@redhat.com
Link: https://lore.kernel.org/r/163819631182.215744.3322471539523262619.stgit@warthog.procyon.org.uk/ # v1
Link: https://lore.kernel.org/r/163906933130.143852.962088616746509062.stgit@warthog.procyon.org.uk/ # v2
Link: https://lore.kernel.org/r/163967141952.1823006.7832985646370603833.stgit@warthog.procyon.org.uk/ # v3
Link: https://lore.kernel.org/r/164021542169.640689.18266858945694357839.stgit@warthog.procyon.org.uk/ # v4
2022-01-07 13:41:40 +00:00
David Howells 1bd9c4e4f0 vfs, cachefiles: Mark a backing file in use with an inode flag
Use an inode flag, S_KERNEL_FILE, to mark that a backing file is in use by
the kernel to prevent cachefiles or other kernel services from interfering
with that file.

Alter rmdir to reject attempts to remove a directory marked with this flag.
This is used by cachefiles to prevent cachefilesd from removing them.

Using S_SWAPFILE instead isn't really viable as that has other effects in
the I/O paths.

Changes
=======
ver #3:
 - Check for the object pointer being NULL in the tracepoints rather than
   the caller.

Signed-off-by: David Howells <dhowells@redhat.com>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
cc: linux-cachefs@redhat.com
Link: https://lore.kernel.org/r/163819630256.215744.4815885535039369574.stgit@warthog.procyon.org.uk/ # v1
Link: https://lore.kernel.org/r/163906931596.143852.8642051223094013028.stgit@warthog.procyon.org.uk/ # v2
Link: https://lore.kernel.org/r/163967141000.1823006.12920680657559677789.stgit@warthog.procyon.org.uk/ # v3
Link: https://lore.kernel.org/r/164021541207.640689.564689725898537127.stgit@warthog.procyon.org.uk/ # v4
2022-01-07 13:41:32 +00:00
David Howells 80f94f29f6 cachefiles: Provide a function to check how much space there is
Provide a function to check how much space there is.  This also flips the
state on the cache and will signal the daemon to inform it of the change
and to ask it to do some culling if necessary.

We will also need to subtract the amount of data currently being written to
the cache (cache->b_writing) from the amount of available space to avoid
hitting ENOSPC accidentally.

Signed-off-by: David Howells <dhowells@redhat.com>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
cc: linux-cachefs@redhat.com
Link: https://lore.kernel.org/r/163819629322.215744.13457425294680841213.stgit@warthog.procyon.org.uk/ # v1
Link: https://lore.kernel.org/r/163906930100.143852.1681026700865762069.stgit@warthog.procyon.org.uk/ # v2
Link: https://lore.kernel.org/r/163967140058.1823006.7781243664702837128.stgit@warthog.procyon.org.uk/ # v3
Link: https://lore.kernel.org/r/164021539957.640689.12477177372616805706.stgit@warthog.procyon.org.uk/ # v4
2022-01-07 13:41:26 +00:00
David Howells 8667d434b2 cachefiles: Register a miscdev and parse commands over it
Register a misc device with which to talk to the daemon.  The misc device
holds a cache set up through it around and closing the device kills the
cache.

cachefilesd communicates with the kernel by passing it single-line text
commands.  Parse these and use them to parameterise the cache state.  This
does not implement the command to actually bring a cache online.  That's
left for later.

Signed-off-by: David Howells <dhowells@redhat.com>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
cc: linux-cachefs@redhat.com
Link: https://lore.kernel.org/r/163819628388.215744.17712097043607299608.stgit@warthog.procyon.org.uk/ # v1
Link: https://lore.kernel.org/r/163906929128.143852.14065207858943654011.stgit@warthog.procyon.org.uk/ # v2
Link: https://lore.kernel.org/r/163967139085.1823006.3514846391807454287.stgit@warthog.procyon.org.uk/ # v3
Link: https://lore.kernel.org/r/164021538400.640689.9172006906288062041.stgit@warthog.procyon.org.uk/ # v4
2022-01-07 13:41:20 +00:00
David Howells 254947d479 cachefiles: Add security derivation
Implement code to derive a new set of creds for the cachefiles to use when
making VFS or I/O calls and to change the auditing info since the
application interacting with the network filesystem is not accessing the
cache directly.  Cachefiles uses override_creds() to change the effective
creds temporarily.

set_security_override_from_ctx() is called to derive the LSM 'label' that
the cachefiles driver will act with.  set_create_files_as() is called to
determine the LSM 'label' that will be applied to files and directories
created in the cache.  These functions alter the new creds.

Also implement a couple of functions to wrap the calls to begin/end cred
overriding.

Signed-off-by: David Howells <dhowells@redhat.com>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
cc: linux-cachefs@redhat.com
Link: https://lore.kernel.org/r/163819627469.215744.3603633690679962985.stgit@warthog.procyon.org.uk/ # v1
Link: https://lore.kernel.org/r/163906928172.143852.15886637013364286786.stgit@warthog.procyon.org.uk/ # v2
Link: https://lore.kernel.org/r/163967138138.1823006.7620933448261939504.stgit@warthog.procyon.org.uk/ # v3
Link: https://lore.kernel.org/r/164021537001.640689.4081334436031700558.stgit@warthog.procyon.org.uk/ # v4
2022-01-07 13:41:14 +00:00
David Howells 1493bf74bc cachefiles: Add cache error reporting macro
Add a macro to report a cache I/O error and to tell fscache that the cache
is in trouble.

Also add a pointer to the fscache cache cookie from the cachefiles_cache
struct as we need that to pass to fscache_io_error().

Signed-off-by: David Howells <dhowells@redhat.com>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
cc: linux-cachefs@redhat.com
Link: https://lore.kernel.org/r/163819626562.215744.1503690975344731661.stgit@warthog.procyon.org.uk/ # v1
Link: https://lore.kernel.org/r/163906927235.143852.13694625647880837563.stgit@warthog.procyon.org.uk/ # v2
Link: https://lore.kernel.org/r/163967137158.1823006.2065038830569321335.stgit@warthog.procyon.org.uk/ # v3
Link: https://lore.kernel.org/r/164021536053.640689.5306822604644352548.stgit@warthog.procyon.org.uk/ # v4
2022-01-07 13:41:07 +00:00
David Howells ecf5a6ce15 cachefiles: Add a couple of tracepoints for logging errors
Add two trace points to log errors, one for vfs operations like mkdir or
create, and one for I/O operations, like read, write or truncate.

Also add the beginnings of a struct that is going to represent a data file
and place a debugging ID in it for the tracepoints to record.

Signed-off-by: David Howells <dhowells@redhat.com>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
cc: linux-cachefs@redhat.com
Link: https://lore.kernel.org/r/163819625632.215744.17907340966178411033.stgit@warthog.procyon.org.uk/ # v1
Link: https://lore.kernel.org/r/163906926297.143852.18267924605548658911.stgit@warthog.procyon.org.uk/ # v2
Link: https://lore.kernel.org/r/163967135390.1823006.2512120406360156424.stgit@warthog.procyon.org.uk/ # v3
Link: https://lore.kernel.org/r/164021534029.640689.1875723624947577095.stgit@warthog.procyon.org.uk/ # v4
2022-01-07 13:41:01 +00:00
David Howells a70f652626 cachefiles: Add some error injection support
Add support for injecting ENOSPC or EIO errors.  This needs to be enabled
by CONFIG_CACHEFILES_ERROR_INJECTION=y.  Once enabled, ENOSPC on things
like write and mkdir can be triggered by:

        echo 1 >/proc/sys/cachefiles/error_injection

and EIO can be triggered on most operations by:

        echo 2 >/proc/sys/cachefiles/error_injection

Signed-off-by: David Howells <dhowells@redhat.com>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
cc: linux-cachefs@redhat.com
Link: https://lore.kernel.org/r/163819624706.215744.6911916249119962943.stgit@warthog.procyon.org.uk/ # v1
Link: https://lore.kernel.org/r/163906925343.143852.5465695512984025812.stgit@warthog.procyon.org.uk/ # v2
Link: https://lore.kernel.org/r/163967134412.1823006.7354285948280296595.stgit@warthog.procyon.org.uk/ # v3
Link: https://lore.kernel.org/r/164021532340.640689.18209494225772443698.stgit@warthog.procyon.org.uk/ # v4
2022-01-07 13:40:56 +00:00
David Howells 8390fbc465 cachefiles: Define structs
Define the cachefiles_cache struct that's going to carry the cache-level
parameters and state of a cache.

Define the beginning of the cachefiles_object struct that's going to carry
the state for a data storage object.  For the moment this is just a
debugging ID for logging purposes.

Signed-off-by: David Howells <dhowells@redhat.com>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
cc: linux-cachefs@redhat.com
Link: https://lore.kernel.org/r/163819623690.215744.2824739137193655547.stgit@warthog.procyon.org.uk/ # v1
Link: https://lore.kernel.org/r/163906924292.143852.15881439716653984905.stgit@warthog.procyon.org.uk/ # v2
Link: https://lore.kernel.org/r/163967131405.1823006.4480555941533935597.stgit@warthog.procyon.org.uk/ # v3
Link: https://lore.kernel.org/r/164021530610.640689.846094074334176928.stgit@warthog.procyon.org.uk/ # v4
2022-01-07 13:40:45 +00:00
David Howells 77443f6171 cachefiles: Introduce rewritten driver
Introduce basic skeleton of the rewritten cachefiles driver including
config options so that it can be enabled for compilation.

Signed-off-by: David Howells <dhowells@redhat.com>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
cc: linux-cachefs@redhat.com
Link: https://lore.kernel.org/r/163819622766.215744.9108359326983195047.stgit@warthog.procyon.org.uk/ # v1
Link: https://lore.kernel.org/r/163906923341.143852.3856498104256721447.stgit@warthog.procyon.org.uk/ # v2
Link: https://lore.kernel.org/r/163967130320.1823006.15791456613198441566.stgit@warthog.procyon.org.uk/ # v3
Link: https://lore.kernel.org/r/164021528993.640689.9069695476048171884.stgit@warthog.procyon.org.uk/ # v4
2022-01-07 13:40:39 +00:00
David Howells 16a96bdf92 fscache: Provide a function to resize a cookie
Provide a function to change the size of the storage attached to a cookie,
to match the size of the file being cached when it's changed by truncate or
fallocate:

	void fscache_resize_cookie(struct fscache_cookie *cookie,
				   loff_t new_size);

This acts synchronously and is expected to run under the inode lock of the
caller.

Signed-off-by: David Howells <dhowells@redhat.com>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
cc: linux-cachefs@redhat.com
Link: https://lore.kernel.org/r/163819621839.215744.7895597119803515402.stgit@warthog.procyon.org.uk/ # v1
Link: https://lore.kernel.org/r/163906922387.143852.16394459879816147793.stgit@warthog.procyon.org.uk/ # v2
Link: https://lore.kernel.org/r/163967128998.1823006.10740669081985775576.stgit@warthog.procyon.org.uk/ # v3
Link: https://lore.kernel.org/r/164021527861.640689.3466382085497236267.stgit@warthog.procyon.org.uk/ # v4
2022-01-07 13:40:33 +00:00
Qu Wenruo 36c86a9e1b btrfs: output more debug messages for uncommitted transaction
Print extra information about how many dirty bytes an uncommitted
has at the end of mount.

Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2022-01-07 14:18:27 +01:00
Filipe Manana c2f822635d btrfs: respect the max size in the header when activating swap file
If we extended the size of a swapfile after its header was created (by the
mkswap utility) and then try to activate it, we will map the entire file
when activating the swap file, instead of limiting to the max size defined
in the swap file's header.

Currently test case generic/643 from fstests fails because we do not
respect that size limit defined in the swap file's header.

So fix this by not mapping file ranges beyond the max size defined in the
swap header.

This is the same type of bug that iomap used to have, and was fixed in
commit 36ca7943ac ("mm/swap: consider max pages in
iomap_swapfile_add_extent").

Fixes: ed46ff3d42 ("Btrfs: support swap files")
CC: stable@vger.kernel.org # 5.4+
Reviewed-and-tested-by: Josef Bacik <josef@toxicpanda.com
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2022-01-07 14:18:27 +01:00
Yang Li be8d1a2ab9 btrfs: fix argument list that the kdoc format and script verified
The warnings were found by running scripts/kernel-doc, which is
caused by using 'make W=1'.

fs/btrfs/extent_io.c:3210: warning: Function parameter or member
'bio_ctrl' not described in 'btrfs_bio_add_page'
fs/btrfs/extent_io.c:3210: warning: Excess function parameter 'bio'
description in 'btrfs_bio_add_page'
fs/btrfs/extent_io.c:3210: warning: Excess function parameter
'prev_bio_flags' description in 'btrfs_bio_add_page'
fs/btrfs/space-info.c:1602: warning: Excess function parameter 'root'
description in 'btrfs_reserve_metadata_bytes'
fs/btrfs/space-info.c:1602: warning: Function parameter or member
'fs_info' not described in 'btrfs_reserve_metadata_bytes'

Note: this is fixing only the warnings regarding parameter list, the
first line is not strictly conforming to the kdoc format as the btrfs
codebase does not stick to that and keeps the first line more free form
(because it's only for internal use).

Reported-by: Abaci Robot <abaci@linux.alibaba.com>
Signed-off-by: Yang Li <yang.lee@linux.alibaba.com>
Reviewed-by: David Sterba <dsterba@suse.com>
[ add note ]
Signed-off-by: David Sterba <dsterba@suse.com>
2022-01-07 14:18:27 +01:00
Su Yue 4a9e803e5b btrfs: remove unnecessary parameter type from compression_decompress_bio
btrfs_decompress_bio, the only caller of compression_decompress_bio gets
type from @cb and passes it to compression_decompress_bio.
However, compression_decompress_bio can get compression type directly
from @cb.

So remove the parameter and access it through @cb.  No functional
change.

Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Su Yue <l@damenly.su>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2022-01-07 14:18:27 +01:00
Qu Wenruo 856e47946c btrfs: selftests: dump extent io tree if extent-io-tree test failed
When code modifying extent-io-tree get modified and got that selftest
failed, it can take some time to pin down the cause.

To make it easier to expose the problem, dump the extent io tree if the
selftest failed.

This can save developers debug time, especially since the selftest we
can not use the trace events, thus have to manually add debug trace
points.

Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2022-01-07 14:18:27 +01:00
Qu Wenruo 2ae8ae3d3d btrfs: scrub: cleanup the argument list of scrub_stripe()
The argument list of btrfs_stripe() has similar problems of
scrub_chunk():

- Duplicated and ambiguous @base argument
  Can be fetched from btrfs_block_group::bg.

- Ambiguous argument @length
  It's again device extent length

- Ambiguous argument @num
  The instinctive guess would be mirror number, but in fact it's stripe
  index.

Fix it by:

- Remove @base parameter

- Rename @length to @dev_extent_len

- Rename @num to @stripe_index

Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2022-01-07 14:18:27 +01:00
Qu Wenruo d04fbe19ae btrfs: scrub: cleanup the argument list of scrub_chunk()
The argument list of scrub_chunk() has the following problems:

- Duplicated @chunk_offset
  It is the same as btrfs_block_group::start.

- Confusing @length
  The most instinctive guess is chunk length, and one may want to delete
  it, but the truth is, it's the device extent length.

Fix this by:

- Remove @chunk_offset
  Use btrfs_block_group::start instead.

- Rename @length to @dev_extent_len
  Also rename the caller to remove the ambiguous naming.

- Rename @cache to @bg
  The "_cache" suffix for btrfs_block_group has been removed for a while.

Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2022-01-07 14:18:26 +01:00
Qu Wenruo f26c923860 btrfs: remove reada infrastructure
Currently there is only one user for btrfs metadata readahead, and
that's scrub.

But even for the single user, it's not providing the correct
functionality it needs, as scrub needs reada for commit root, which
current readahead can't provide. (Although it's pretty easy to add such
feature).

Despite this, there are some extra problems related to metadata
readahead:

- Duplicated feature with btrfs_path::reada

- Partly duplicated feature of btrfs_fs_info::buffer_radix
  Btrfs already caches its metadata in buffer_radix, while readahead
  tries to read the tree block no matter if it's already cached.

- Poor layer separation
  Metadata readahead works kinda at device level.
  This is definitely not the correct layer it should be, since metadata
  is at btrfs logical address space, it should not bother device at all.

  This brings extra chance for bugs to sneak in, while brings
  unnecessary complexity.

- Dead code
  In the very beginning of scrub.c we have #undef DEBUG, rendering all
  the debug related code useless and unable to test.

Thus here I purpose to remove the metadata readahead mechanism
completely.

[BENCHMARK]
There is a full benchmark for the scrub performance difference using the
old btrfs_reada_add() and btrfs_path::reada.

For the worst case (no dirty metadata, slow HDD), there could be a 5%
performance drop for scrub.
For other cases (even SATA SSD), there is no distinguishable performance
difference.

The number is reported scrub speed, in MiB/s.
The resolution is limited by the reported duration, which only has a
resolution of 1 second.

	Old		New		Diff
SSD	455.3		466.332		+2.42%
HDD	103.927 	98.012		-5.69%

Comprehensive test methodology is in the cover letter of the patch.

Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2022-01-07 14:18:26 +01:00
Qu Wenruo dcf62b204c btrfs: scrub: use btrfs_path::reada for extent tree readahead
For scrub, we trigger two readaheads for two trees, extent tree to get
where to scrub, and csum tree to get the data checksum.

For csum tree we already trigger readahead in
btrfs_lookup_csums_range(), by setting path->reada.
But for extent tree we don't have any path based readahead.

Add the readahead for extent tree as well, so we can later remove the
btrfs_reada_add() based readahead.

Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2022-01-07 14:18:26 +01:00
Qu Wenruo 2522dbe86b btrfs: scrub: remove the unnecessary path parameter for scrub_raid56_parity()
In function scrub_stripe() we allocated two btrfs_path's, one @path for
extent tree search and another @ppath for full stripe extent tree search
for RAID56.

This is totally umncessary, as the @ppath usage is completely inside
scrub_raid56_parity(), thus we can move the path allocation into
scrub_raid56_parity() completely.

Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2022-01-07 14:18:26 +01:00
Nikolay Borisov c122799643 btrfs: refactor unlock_up
The purpose of this function is to unlock all nodes in a btrfs path
which are above 'lowest_unlock' and whose slot used is different than 0.
As such it used slightly awkward structure of 'if' as well as somewhat
cryptic "no_skip" control variable which denotes whether we should
check the current level of skipability or no.

This patch does the following (cosmetic) refactorings:

* Renames 'no_skip' to 'check_skip' and makes it a boolean. This
  variable controls whether we are below the lowest_unlock/skip_level
  levels.

* Consolidates the 2 conditions which warrant checking whether the
  current level should be skipped under 1 common if (check_skip) branch,
  this increase indentation level but is not critical.

* Consolidates the 'skip_level < i && i >= lowest_unlock' and
  'i >= lowest_unlock && i > skip_level' condition into a common branch
  since those are identical.

* Eliminates the local extent_buffer variable as in this case it doesn't
  bring anything to function readability.

Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2022-01-07 14:18:26 +01:00
Filipe Manana 1b58ae0e4d btrfs: skip transaction commit after failure to create subvolume
At ioctl.c:create_subvol(), when we fail to create a subvolume we always
commit the transaction. In most cases this is a no-op, since all the error
paths, except for one, abort the transaction - the only exception is when
we fail to insert the new root item into the root tree, in that case we
don't abort the transaction because we didn't do anything that is
irreversible - however we end up committing the transaction which although
is not a functional problem, it adds unnecessary rotation of the backup
roots in the superblock and unnecessary work.

So change that to commit a transaction only when no error happened,
otherwise just call btrfs_end_transaction() to release our reference on
the transaction.

Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2022-01-07 14:18:26 +01:00
Naohiro Aota 82187d2ecd btrfs: zoned: fix chunk allocation condition for zoned allocator
The ZNS specification defines a limit on the number of "active"
zones. That limit impose us to limit the number of block groups which
can be used for an allocation at the same time. Not to exceed the
limit, we reuse the existing active block groups as much as possible
when we can't activate any other zones without sacrificing an already
activated block group in commit a85f05e59b ("btrfs: zoned: avoid
chunk allocation if active block group has enough space").

However, the check is wrong in two ways. First, it checks the
condition for every raid index (ffe_ctl->index). Even if it reaches
the condition and "ffe_ctl->max_extent_size >=
ffe_ctl->min_alloc_size" is met, there can be other block groups
having enough space to hold ffe_ctl->num_bytes. (Actually, this won't
happen in the current zoned code as it only supports SINGLE
profile. But, it can happen once it enables other RAID types.)

Second, it checks the active zone availability depending on the
raid index. The raid index is just an index for
space_info->block_groups, so it has nothing to do with chunk allocation.

These mistakes are causing a faulty allocation in a certain
situation. Consider we are running zoned btrfs on a device whose
max_active_zone == 0 (no limit). And, suppose no block group have a
room to fit ffe_ctl->num_bytes but some room to meet
ffe_ctl->min_alloc_size (i.e. max_extent_size > num_bytes >=
min_alloc_size).

In this situation, the following occur:

- With SINGLE raid_index, it reaches the chunk allocation checking
  code
- The check returns true because we can activate a new zone (no limit)
- But, before allocating the chunk, it iterates to the next raid index
  (RAID5)
- Since there are no RAID5 block groups on zoned mode, it again
  reaches the check code
- The check returns false because of btrfs_can_activate_zone()'s "if
  (raid_index != BTRFS_RAID_SINGLE)" part
- That results in returning -ENOSPC without allocating a new chunk

As a result, we end up hitting -ENOSPC too early.

Move the check to the right place in the can_allocate_chunk() hook,
and do the active zone check depending on the allocation flag, not on
the raid index.

CC: stable@vger.kernel.org # 5.16
Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2022-01-07 14:18:26 +01:00
Naohiro Aota 50475cd577 btrfs: add extent allocator hook to decide to allocate chunk or not
Introduce a new hook for an extent allocator policy. With the new
hook, a policy can decide to allocate a new block group or not. If
not, it will return -ENOSPC, so btrfs_reserve_extent() will cut the
allocation size in half and retry the allocation if min_alloc_size is
large enough.

The hook has a place holder and will be replaced with the real
implementation in the next patch.

CC: stable@vger.kernel.org # 5.16
Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2022-01-07 14:18:26 +01:00
Naohiro Aota 1ada69f61c btrfs: zoned: unset dedicated block group on allocation failure
Allocating an extent from a block group can fail for various reasons.
When an allocation from a dedicated block group (for tree-log or
relocation data) fails, we need to unregister it as a dedicated one so
that we can allocate a new block group for the dedicated one.

However, we are returning early when the block group in case it is
read-only, fully used, or not be able to activate the zone. As a result,
we keep the non-usable block group as a dedicated one, leading to
further allocation failure. With many block groups, the allocator will
iterate hopeless loop to find a free extent, results in a hung task.

Fix the issue by delaying the return and doing the proper cleanups.

CC: stable@vger.kernel.org # 5.16
Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2022-01-07 14:18:26 +01:00
Johannes Thumshirn 7367271000 btrfs: zoned: drop redundant check for REQ_OP_ZONE_APPEND and btrfs_is_zoned
REQ_OP_ZONE_APPEND can only work on zoned devices, so it is redundant to
check if the filesystem is zoned when REQ_OP_ZONE_APPEND is set as the
bio's bio_op.

Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2022-01-07 14:18:26 +01:00
Johannes Thumshirn 554aed7da2 btrfs: zoned: sink zone check into btrfs_repair_one_zone
Sink zone check into btrfs_repair_one_zone() so we don't need to do it
in all callers.

Also as btrfs_repair_one_zone() doesn't return a sensible error, make it
a boolean function and return false in case it got called on a non-zoned
filesystem and true on a zoned filesystem.

Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2022-01-07 14:18:26 +01:00
Johannes Thumshirn 8fdf54fe69 btrfs: zoned: simplify btrfs_check_meta_write_pointer
btrfs_check_meta_write_pointer() will always be called with a NULL
'cache_ret' argument.

As there's no need to check if we have a valid block_group passed in
remove these checks.

Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2022-01-07 14:18:25 +01:00
Johannes Thumshirn 869f4cdc73 btrfs: zoned: encapsulate inode locking for zoned relocation
Encapsulate the inode lock needed for serializing the data relocation
writes on a zoned filesystem into a helper.

This streamlines the code reading flow and hides special casing for
zoned filesystems.

Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2022-01-07 14:18:25 +01:00
Anand Jain a26d60dedf btrfs: sysfs: add devinfo/fsid to retrieve actual fsid from the device
In the case of the seed device, the fsid can be different from the mounted
sprout fsid.  The userland has to read the device superblock to know the
fsid but, that idea fails if the device is missing. So add a sysfs
interface devinfo/<devid>/fsid to show the fsid of the device.

For example:
  $ cd /sys/fs/btrfs/b10b02a5-f9de-4276-b9e8-2bfd09a578a8

  $ cat devinfo/1/fsid
  c44d771f-639d-4df3-99ec-5bc7ad2af93b
  $ cat  devinfo/3/fsid
  b10b02a5-f9de-4276-b9e8-2bfd09a578a8

Though it's related to seeding, the name of the sysfs file is plain fsid as it
matches what blkid says.  A path to the device's fsid will aid scripting.

Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Anand Jain <anand.jain@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2022-01-07 14:18:25 +01:00
Josef Bacik c18e323564 btrfs: reserve extra space for the free space tree
Filipe reported a problem where sometimes he'd get an ENOSPC abort when
running delayed refs with generic/619 and the free space tree enabled.
This is partly because we do not reserve space for modifying the free
space tree, nor do we have a block rsv associated with that tree.

The delayed_refs_rsv tracks the amount of space required to run delayed
refs.  This means 1 modification means 1 change to the extent root.
With the free space tree this turns into 2 changes, because modifying 1
extent means updating the extent tree and potentially updating the free
space tree to either remove that entry or add the free space.  Thus if
we have the FST enabled, simply double the reservation size for our
modification.

Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2022-01-07 14:18:25 +01:00
Josef Bacik 9506f95382 btrfs: include the free space tree in the global rsv minimum calculation
Filipe reported a problem where generic/619 was failing with an ENOSPC
abort while running delayed refs, like the following

  BTRFS: Transaction aborted (error -28)
  WARNING: CPU: 3 PID: 522920 at fs/btrfs/free-space-tree.c:1049 add_to_free_space_tree+0xe5/0x110 [btrfs]
  CPU: 3 PID: 522920 Comm: kworker/u16:19 Tainted: G        W         5.16.0-rc2-btrfs-next-106 #1
  Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.14.0-0-g155821a1990b-prebuilt.qemu.org 04/01/2014
  Workqueue: events_unbound btrfs_async_reclaim_metadata_space [btrfs]
  RIP: 0010:add_to_free_space_tree+0xe5/0x110 [btrfs]
  RSP: 0000:ffffa65087fb7b20 EFLAGS: 00010282
  RAX: 0000000000000000 RBX: 0000000000001000 RCX: 0000000000000000
  RDX: 0000000000000001 RSI: ffffffff9131eeaa RDI: 00000000ffffffff
  RBP: ffff8d62e26481b8 R08: ffffffff9ad97ce0 R09: 0000000000000001
  R10: 0000000000000000 R11: 0000000000000001 R12: 00000000ffffffe4
  R13: ffff8d61c25fe688 R14: ffff8d61ebd88800 R15: ffff8d61ebd88a90
  FS:  0000000000000000(0000) GS:ffff8d64ed400000(0000) knlGS:0000000000000000
  CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
  CR2: 00007fa46a8b1000 CR3: 0000000148d18003 CR4: 0000000000370ee0
  DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
  DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
  Call Trace:
   <TASK>
   __btrfs_free_extent+0x516/0x950 [btrfs]
   __btrfs_run_delayed_refs+0x2b1/0x1250 [btrfs]
   btrfs_run_delayed_refs+0x86/0x210 [btrfs]
   flush_space+0x403/0x630 [btrfs]
   ? call_rcu_tasks_generic+0x50/0x80
   ? lock_release+0x223/0x4a0
   ? btrfs_get_alloc_profile+0xb5/0x290 [btrfs]
   ? do_raw_spin_unlock+0x4b/0xa0
   btrfs_async_reclaim_metadata_space+0x139/0x320 [btrfs]
   process_one_work+0x24c/0x5b0
   worker_thread+0x55/0x3c0
   ? process_one_work+0x5b0/0x5b0
   kthread+0x17c/0x1a0
   ? set_kthread_struct+0x40/0x40
   ret_from_fork+0x22/0x30

There's a couple of reasons for this, but in generic/619's case the
largest reason is because it is a very small file system, ad we do not
reserve enough space for the global reserve.

With the free space tree we now have the free space tree that we need to
modify when running delayed refs.  This means we need the global reserve
to take this into account when it calculates the minimum size it needs
to be.  This is especially important for very small file systems.

Fix this by adjusting the minimum global block rsv size math to include
the size of the free space tree when calculating the size.

Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2022-01-07 14:18:25 +01:00
Qu Wenruo c9d328c0c4 btrfs: scrub: merge SCRUB_PAGES_PER_RD_BIO and SCRUB_PAGES_PER_WR_BIO
These two values were introduced in commit ff023aac31 ("Btrfs: add code
to scrub to copy read data to another disk") as an optimization.

But the truth is, block layer scheduler can do whatever it wants to
merge/split bios to improve performance.

Doing such "optimization" is not really going to affect much, especially
considering how good current block layer optimizations are doing.
Remove such old and immature optimization from our code.

Since we're here, also change BUG_ON()s using these two macros to use
ASSERT()s.

Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2022-01-07 14:18:25 +01:00
Qu Wenruo 0bb3acdc48 btrfs: update SCRUB_MAX_PAGES_PER_BLOCK
Use BTRFS_MAX_METADATA_BLOCKSIZE and SZ_4K (minimal sectorsize) to
calculate this value.

And remove one stale comment on the value, in fact with recent subpage
support, BTRFS_MAX_METADATA_BLOCKSIZE * PAGE_SIZE is already beyond
BTRFS_STRIPE_LEN, just we don't use the full page.

Also since we're here, update the BUG_ON() related to
SCRUB_MAX_PAGES_PER_BLOCK to ASSERT().

As those ASSERT() are really only for developers to catch early obvious
bugs, not to let end users suffer.

Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2022-01-07 14:18:25 +01:00
Josef Bacik 8697b8f88e btrfs: do not check -EAGAIN when truncating inodes in the log root
We only throttle the btrfs_truncate_inode_items if the root is
SHAREABLE, which isn't set on the log root, which means this loop is
unnecessary.

Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2022-01-07 14:18:25 +01:00
Josef Bacik e48dac7f6f btrfs: make should_throttle loop local in btrfs_truncate_inode_items
We reset this bool on every loop through the truncate loop, make this
variable local to the loop.

Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2022-01-07 14:18:25 +01:00
Josef Bacik 0adbc6190c btrfs: combine extra if statements in btrfs_truncate_inode_items
We have

    if (del_item)
	    // do something
    else
	    // something else
    if (del_item)
	    // do yet another thing
    else
	    // something else entirely

back to back in btrfs_truncate_inode_items, collapse these two sets of
if statements into one.

Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2022-01-07 14:18:25 +01:00
Josef Bacik 376b91d570 btrfs: convert BUG() for pending_del_nr into an ASSERT
This is a logic correctness check, convert it into an ASSERT() instead
of a BUG().

Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2022-01-07 14:18:25 +01:00
Josef Bacik 56e1edb0e3 btrfs: convert BUG_ON() in btrfs_truncate_inode_items to ASSERT
We have a correctness BUG_ON() in btrfs_truncate_inode_items to make
sure that we're always using min_type == BTRFS_EXTENT_DATA_KEY if
new_size is > 0.  Convert this to an ASSERT.

Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2022-01-07 14:18:25 +01:00
Josef Bacik 71d18b5354 btrfs: add inode to truncate control
In the future we're going to want to use btrfs_truncate_inode_items
without looking up the associated inode.  In order to accommodate this
add the inode to btrfs_truncate_control and handle the case where
control->inode is NULL appropriately.  This is fairly straightforward,
we simply need to add a helper for the trace points, as the file extent
map update is controlled by a flag on btrfs_truncate_control.

Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2022-01-07 14:18:24 +01:00
Josef Bacik 487e81d2a4 btrfs: pass the ino via truncate control
In the future we are going to want to truncate inode items without
needing to have an btrfs_inode to pass in, so add ino to the
btrfs_truncate_control and use that to look up the inode items to
truncate.

Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2022-01-07 14:18:24 +01:00
Josef Bacik 655807b895 btrfs: use a flag to control when to clear the file extent range
We only care about updating the file extent range when we are doing a
normal truncation.  We skip this for tree logging currently, but we can
also skip this for eviction as well.  Using a flag makes it more
explicit when we want to do this work.

Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2022-01-07 14:18:24 +01:00
Josef Bacik 5caa490ed8 btrfs: control extent reference updates with a control flag for truncate
We've had weird bugs in the past where we forgot to adjust the truncate
path to deal with the fact that we can be called by the tree log path.
Instead of checking if our root is a LOG_ROOT use a flag on the
btrfs_truncate_control to indicate that we don't want to do extent
reference updates during this truncate.

Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2022-01-07 14:18:24 +01:00
Josef Bacik 462b728ea8 btrfs: only call inode_sub_bytes in truncate paths that care
We currently have a bunch of awkward checks to make sure we only update
the inode i_bytes if we're truncating the real inode.  Instead keep
track of the number of bytes we need to sub in the
btrfs_truncate_control, and then do the appropriate adjustment in the
truncate paths that care.

Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2022-01-07 14:18:24 +01:00
Josef Bacik c2ddb612a8 btrfs: only update i_size in truncate paths that care
We currently will update the i_size of the inode as we truncate it down,
however we skip this if we're calling btrfs_truncate_inode_items from
the tree log code.  However we also don't care about this in the case of
evict.  Instead keep track of this value in the btrfs_truncate_control
and then have btrfs_truncate() and the free space cache truncate path
both do the i_size update themselves.

Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2022-01-07 14:18:24 +01:00
Josef Bacik d9ac19c380 btrfs: add truncate control struct
I'm going to be adding more arguments and counters to
btrfs_truncate_inode_items, so add a control struct to handle all of the
extra arguments to make it easier to follow.

Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2022-01-07 14:18:24 +01:00
Josef Bacik 7097a941bf btrfs: remove found_extent from btrfs_truncate_inode_items
We only set this if we find a normal file extent, del_item == 1, and the
file extent points to a real extent and isn't a hole extent.  We can use
del_item == 1 && extent_start != 0 to get the same information that
found_extent provides, so remove this variable and use the other
variables instead.

Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2022-01-07 14:18:24 +01:00
Josef Bacik 2adc75d612 btrfs: move btrfs_kill_delayed_inode_items into evict
We have a special case in btrfs_truncate_inode_items() to call
btrfs_kill_delayed_inode_items() if min_type == 0, which is only called
during evict.

Instead move this out into evict proper, and add some comments because I
erroneously attempted to remove this code altogether without
understanding what we were doing.

Evict is updating the inode only because we only care about making sure
the i_nlink count has hit disk.  If we had pending deletions we don't
want to process those via the delayed inode updates, we simply want to
drop all of them and reclaim the reserved metadata space.  Then from
there the btrfs_truncate_inode_items() will do the work to remove all of
the items as appropriate.

Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2022-01-07 14:18:24 +01:00
Josef Bacik 275312a03c btrfs: remove free space cache inode check in btrfs_truncate_inode_items
We no longer have inode cache feature, so this check is extraneous as
the only inode cache is in the tree_root, which is not marked as
SHAREABLE.

Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2022-01-07 14:18:24 +01:00
Josef Bacik 9a4a1429ac btrfs: move extent locking outside of btrfs_truncate_inode_items
Currently we are locking the extent and dropping the extent cache for
any inodes we truncate, unless they're in the tree log.  We call this
helper from:

- truncate
- evict
- tree log
- free space cache truncation

For evict we've already dropped all of the extent cache for this inode
once we've gotten here, and we're the only one accessing this inode, so
this step is unnecessary.

For the tree log code we already skip this part.

Pull this work into the truncate path and the free space cache
truncation path.

Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2022-01-07 14:18:24 +01:00
Josef Bacik 54f03ab1e1 btrfs: move btrfs_truncate_inode_items to inode-item.c
This is an inode item related manipulation with a few vfs related
adjustments.  I'm going to remove the vfs related code from this helper
and simplify it a lot, but I want those changes to be easily seen via
git blame, so move this function now and then the simplification work
can be done.

Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2022-01-07 14:18:24 +01:00
Josef Bacik 26c2c4540d btrfs: add an inode-item.h
We have a few helpers in inode-item.c, and I'm going to make a few
changes to how we do truncate in the future, so break out these
definitions into their own header file to trim down ctree.h some and
make it easier to do the work on truncate in the future.

Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2022-01-07 14:18:23 +01:00
Filipe Manana 727e60604f btrfs: remove stale comment about locking at btrfs_search_slot()
The comment refers to the old extent buffer locking code, where we used to
have custom locks that had blocking and spinning behaviour modes. That is
not the case anymore, since we have transitioned to rw semaphores, so the
comment does not offer any value anymore. Remove it.

Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2022-01-07 14:18:23 +01:00
Filipe Manana bb8e9a6080 btrfs: remove BUG_ON() after splitting leaf
After calling split_leaf() we BUG_ON() if the returned value is greater
than zero. However split_leaf() only returns 0, in case of success, or a
negative value in case of an error.

The reason for the BUG_ON() is that if we ever get a positive return
value from split_leaf(), we can not simply propagate it to the callers
of btrfs_search_slot(), as that would be interpreted as "key not found"
and not as an error. That means it could result in callers ending up
causing some potential silent corruption.

So change the BUG_ON() to an ASSERT(), and in case assertions are
disabled, produce a warning and set the return value to an error, to make
it not possible to get into a silent corruption and having the error not
noticed.

Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2022-01-07 14:18:23 +01:00
Filipe Manana 109324cfda btrfs: move leaf search logic out of btrfs_search_slot()
There's quite a significant amount of code for doing the key search for a
leaf at btrfs_search_slot(), with a couple labels and gotos in it, plus
btrfs_search_slot() is already big enough.

So move the logic that does the key search on a leaf into a new helper
function. This makes it better organized, removing the need for the labels
and the gotos, as well as reducing the indentation level and the size of
btrfs_search_slot().

Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2022-01-07 14:18:23 +01:00
Filipe Manana e5e1c1741b btrfs: remove useless condition check before splitting leaf
When inserting a key, we check if the write_lock_level is less than 1,
and if so we set it to 1, release the path and retry the tree traversal.

However that is unnecessary, because when ins_len is greater than 0, we
know that write_lock_level can never be less than 1.

The logic to retry is also buggy, because in case ins_len was decremented,
due to an exact key match and the search is not meant for item extension
(path->search_for_extension is 0), we retry without incrementing ins_len,
which would make the next retry decrement it again by the same amount.

So remove the check for write_lock_level being less than 1 and add an
assertion to assert it's always >= 1.

Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2022-01-07 14:18:23 +01:00
Filipe Manana e2e58d0f8d btrfs: try to unlock parent nodes earlier when inserting a key
When inserting a new key, we release the write lock on the leaf's parent
only after doing the binary search on the leaf. This is because if the
key ends up at slot 0, we will have to update the key at slot 0 of the
parent node. The same reasoning applies to any other upper level nodes
when their slot is 0. We also need to keep the parent locked in case the
leaf does not have enough free space to insert the new key/item, because
in that case we will split the leaf and we will need to add a new key to
the parent due to a new leaf resulting from the split operation.

However if the leaf has enough space for the new key and the key does not
end up at slot 0 of the leaf we could release our write lock on the parent
before doing the binary search on the leaf to figure out the destination
slot. That leads to reducing the amount of time other tasks are blocked
waiting to lock the parent, therefore increasing parallelism when there
are other tasks that are trying to access other leaves accessible through
the same parent. This also applies to other upper nodes besides the
immediate parent, when their slot is 0, since we keep locks on them until
we figure out if the leaf slot is slot 0 or not.

In fact, having the key ending at up slot 0 when is rare. Typically it
only happens when the key is less than or equals to the smallest, the
"left most", key of the entire btree, during a split attempt when we try
to push to the right sibling leaf or when the caller just wants to update
the item of an existing key. It's also very common that a leaf has enough
space to insert a new key, since after a split we move about half of the
keys from one into the new leaf.

So unlock the parent, and any other upper level nodes, when during a key
insertion we notice the key is greater then the first key in the leaf and
the leaf has enough free space. After unlocking the upper level nodes, do
the binary search using a low boundary of slot 1 and not slot 0, to figure
out the slot where the key will be inserted (or where the key already is
in case it exists and the caller wants to modify its item data).
This extra comparison, with the first key, is cheap and the key is very
likely already in a cache line because it immediately follows the header
of the extent buffer and we have recently read the level field of the
header (which in fact is the last field of the header).

The following fs_mark test was run on a non-debug kernel (debian's default
kernel config), with a 12 cores intel CPU, and using a NVMe device:

  $ cat run-fsmark.sh
  #!/bin/bash

  DEV=/dev/nvme0n1
  MNT=/mnt/nvme0n1
  MOUNT_OPTIONS="-o ssd"
  MKFS_OPTIONS="-O no-holes -R free-space-tree"
  FILES=100000
  THREADS=$(nproc --all)
  FILE_SIZE=0

  echo "performance" | \
	tee /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor

  mkfs.btrfs -f $MKFS_OPTIONS $DEV
  mount $MOUNT_OPTIONS $DEV $MNT

  OPTS="-S 0 -L 10 -n $FILES -s $FILE_SIZE -t $THREADS -k"
  for ((i = 1; i <= $THREADS; i++)); do
      OPTS="$OPTS -d $MNT/d$i"
  done

  fs_mark $OPTS

  umount $MNT

Before this change:

FSUse%        Count         Size    Files/sec     App Overhead
     0      1200000            0     165273.6          5958381
     0      2400000            0     190938.3          6284477
     0      3600000            0     181429.1          6044059
     0      4800000            0     173979.2          6223418
     0      6000000            0     139288.0          6384560
     0      7200000            0     163000.4          6520083
     1      8400000            0      57799.2          5388544
     1      9600000            0      66461.6          5552969
     2     10800000            0      49593.5          5163675
     2     12000000            0      57672.1          4889398

After this change:

FSUse%        Count         Size    Files/sec            App Overhead
     0      1200000            0     167987.3 (+1.6%)         6272730
     0      2400000            0     198563.9 (+4.0%)         6048847
     0      3600000            0     197436.6 (+8.8%)         6163637
     0      4800000            0     202880.7 (+16.6%)        6371771
     1      6000000            0     167275.9 (+20.1%)        6556733
     1      7200000            0     204051.2 (+25.2%)        6817091
     1      8400000            0      69622.8 (+20.5%)        5525675
     1      9600000            0      69384.5 (+4.4%)         5700723
     1     10800000            0      61454.1 (+23.9%)        5363754
     3     12000000            0      61908.7 (+7.3%)         5370196

Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2022-01-07 14:18:23 +01:00
Filipe Manana fb81212c07 btrfs: allow generic_bin_search() to take low boundary as an argument
Right now generic_bin_search() always uses a low boundary slot of 0, but
in the next patch we'll want to often skip slot 0 when searching for a
key. So make generic_bin_search() have the low boundary slot specified
as an argument, and move the check for the extent buffer level from
btrfs_bin_search() to generic_bin_search() to avoid adding another
wrapper around generic_bin_search().

Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2022-01-07 14:18:23 +01:00
Josef Bacik 120de408e4 btrfs: check the root node for uptodate before returning it
Now that we clear the extent buffer uptodate if we fail to write it out
we need to check to see if our root node is uptodate before we search
down it.  Otherwise we could return stale data (or potentially corrupt
data that was caught by the write verification step) and think that the
path is OK to search down.

CC: stable@vger.kernel.org # 5.4+
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2022-01-07 14:18:23 +01:00
Nikolay Borisov a174c0a2e8 btrfs: allow device add if balance is paused
Currently paused balance precludes adding a device since they are both
considered exclusive ops and we can have at most one running at a time.
This is problematic in case a filesystem encounters an ENOSPC situation
while balance is running, in this case the only thing the user can do
is mount the fs with "skip_balance" which pauses balance and delete some
data to free up space for balance. However, it should be possible to add
a new device when balance is paused.

Fix this by allowing device add to proceed when balance is paused.

Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2022-01-07 14:18:23 +01:00
Nikolay Borisov 621a1ee1d3 btrfs: make device add compatible with paused balance in btrfs_exclop_start_try_lock
This is needed to enable device add to work in cases when a file system
has been mounted with 'skip_balance' mount option.

Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2022-01-07 14:18:23 +01:00
Nikolay Borisov efc0e69c2f btrfs: introduce exclusive operation BALANCE_PAUSED state
Current set of exclusive operation states is not sufficient to handle
all practical use cases. In particular there is a need to be able to add
a device to a filesystem that have paused balance. Currently there is no
way to distinguish between a running and a paused balance. Fix this by
introducing BTRFS_EXCLOP_BALANCE_PAUSED which is going to be set in 2
occasions:

1. When a filesystem is mounted with skip_balance and there is an
   unfinished balance it will now be into BALANCE_PAUSED instead of
   simply BALANCE state.

2. When a running balance is paused.

Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2022-01-07 14:18:23 +01:00
Filipe Manana d96b34248c btrfs: make send work with concurrent block group relocation
We don't allow send and balance/relocation to run in parallel in order
to prevent send failing or silently producing some bad stream. This is
because while send is using an extent (specially metadata) or about to
read a metadata extent and expecting it belongs to a specific parent
node, relocation can run, the transaction used for the relocation is
committed and the extent gets reallocated while send is still using the
extent, so it ends up with a different content than expected. This can
result in just failing to read a metadata extent due to failure of the
validation checks (parent transid, level, etc), failure to find a
backreference for a data extent, and other unexpected failures. Besides
reallocation, there's also a similar problem of an extent getting
discarded when it's unpinned after the transaction used for block group
relocation is committed.

The restriction between balance and send was added in commit 9e967495e0
("Btrfs: prevent send failures and crashes due to concurrent relocation"),
kernel 5.3, while the more general restriction between send and relocation
was added in commit 1cea5cf0e6 ("btrfs: ensure relocation never runs
while we have send operations running"), kernel 5.14.

Both send and relocation can be very long running operations. Relocation
because it has to do a lot of IO and expensive backreference lookups in
case there are many snapshots, and send due to read IO when operating on
very large trees. This makes it inconvenient for users and tools to deal
with scheduling both operations.

For zoned filesystem we also have automatic block group relocation, so
send can fail with -EAGAIN when users least expect it or send can end up
delaying the block group relocation for too long. In the future we might
also get the automatic block group relocation for non zoned filesystems.

This change makes it possible for send and relocation to run in parallel.
This is achieved the following way:

1) For all tree searches, send acquires a read lock on the commit root
   semaphore;

2) After each tree search, and before releasing the commit root semaphore,
   the leaf is cloned and placed in the search path (struct btrfs_path);

3) After releasing the commit root semaphore, the changed_cb() callback
   is invoked, which operates on the leaf and writes commands to the pipe
   (or file in case send/receive is not used with a pipe). It's important
   here to not hold a lock on the commit root semaphore, because if we did
   we could deadlock when sending and receiving to the same filesystem
   using a pipe - the send task blocks on the pipe because it's full, the
   receive task, which is the only consumer of the pipe, triggers a
   transaction commit when attempting to create a subvolume or reserve
   space for a write operation for example, but the transaction commit
   blocks trying to write lock the commit root semaphore, resulting in a
   deadlock;

4) Before moving to the next key, or advancing to the next change in case
   of an incremental send, check if a transaction used for relocation was
   committed (or is about to finish its commit). If so, release the search
   path(s) and restart the search, to where we were before, so that we
   don't operate on stale extent buffers. The search restarts are always
   possible because both the send and parent roots are RO, and no one can
   add, remove of update keys (change their offset) in RO trees - the
   only exception is deduplication, but that is still not allowed to run
   in parallel with send;

5) Periodically check if there is contention on the commit root semaphore,
   which means there is a transaction commit trying to write lock it, and
   release the semaphore and reschedule if there is contention, so as to
   avoid causing any significant delays to transaction commits.

This leaves some room for optimizations for send to have less path
releases and re searching the trees when there's relocation running, but
for now it's kept simple as it performs quite well (on very large trees
with resulting send streams in the order of a few hundred gigabytes).

Test case btrfs/187, from fstests, stresses relocation, send and
deduplication attempting to run in parallel, but without verifying if send
succeeds and if it produces correct streams. A new test case will be added
that exercises relocation happening in parallel with send and then checks
that send succeeds and the resulting streams are correct.

A final note is that for now this still leaves the mutual exclusion
between send operations and deduplication on files belonging to a root
used by send operations. A solution for that will be slightly more complex
but it will eventually be built on top of this change.

Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2022-01-07 14:18:23 +01:00
David Howells 08276bdae6 vfs, fscache: Implement pinning of cache usage for writeback
Cachefiles has a problem in that it needs to keep the backing file for a
cookie open whilst there are local modifications pending that need to be
written to it.  However, we don't want to keep the file open indefinitely,
as that causes EMFILE/ENFILE/ENOMEM problems.

Reopening the cache file, however, is a problem if this is being done due
to writeback triggered by exit().  Some filesystems will oops if we try to
open a file in that context because they want to access current->fs or
other resources that have already been dismantled.

To get around this, I added the following:

 (1) An inode flag, I_PINNING_FSCACHE_WB, to be set on a network filesystem
     inode to indicate that we have a usage count on the cookie caching
     that inode.

 (2) A flag in struct writeback_control, unpinned_fscache_wb, that is set
     when __writeback_single_inode() clears the last dirty page from
     i_pages - at which point it clears I_PINNING_FSCACHE_WB and sets this
     flag.

     This has to be done here so that clearing I_PINNING_FSCACHE_WB can be
     done atomically with the check of PAGECACHE_TAG_DIRTY that clears
     I_DIRTY_PAGES.

 (3) A function, fscache_set_page_dirty(), which if it is not set, sets
     I_PINNING_FSCACHE_WB and calls fscache_use_cookie() to pin the cache
     resources.

 (4) A function, fscache_unpin_writeback(), to be called by ->write_inode()
     to unuse the cookie.

 (5) A function, fscache_clear_inode_writeback(), to be called when the
     inode is evicted, before clear_inode() is called.  This cleans up any
     lingering I_PINNING_FSCACHE_WB.

The network filesystem can then use these tools to make sure that
fscache_write_to_cache() can write locally modified data to the cache as
well as to the server.

For the future, I'm working on write helpers for netfs lib that should
allow this facility to be removed by keeping track of the dirty regions
separately - but that's incomplete at the moment and is also going to be
affected by folios, one way or another, since it deals with pages

Signed-off-by: David Howells <dhowells@redhat.com>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
cc: linux-cachefs@redhat.com
Link: https://lore.kernel.org/r/163819615157.215744.17623791756928043114.stgit@warthog.procyon.org.uk/ # v1
Link: https://lore.kernel.org/r/163906917856.143852.8224898306177154573.stgit@warthog.procyon.org.uk/ # v2
Link: https://lore.kernel.org/r/163967124567.1823006.14188359004568060298.stgit@warthog.procyon.org.uk/ # v3
Link: https://lore.kernel.org/r/164021524705.640689.17824932021727663017.stgit@warthog.procyon.org.uk/ # v4
2022-01-07 09:22:19 +00:00
David Howells b6e16652d6 fscache: Implement higher-level write I/O interface
Provide a higher-level function than fscache_write() to perform a write
from an inode's pagecache to the cache, whilst fending off concurrent
writes by means of the PG_fscache mark on a page:

	void fscache_write_to_cache(struct fscache_cookie *cookie,
				    struct address_space *mapping,
				    loff_t start,
				    size_t len,
				    loff_t i_size,
				    netfs_io_terminated_t term_func,
				    void *term_func_priv,
				    bool caching);

If caching is false, this function does nothing except call (*term_func)()
if given.  It assumes that, in such a case, PG_fscache will not have been
set on the pages.

Otherwise, if caching is true, this function requires the source pages to
have had PG_fscache set on them before calling.  start and len define the
region of the file to be modified and i_size indicates the new file size.
The source pages are extracted from the mapping.

term_func and term_func_priv work as for fscache_write().  The PG_fscache
marks will be cleared at the end of the operation, before term_func is
called or the function otherwise returns.

There is an additonal helper function to clear the PG_fscache bits from a
range of pages:

	void fscache_clear_page_bits(struct fscache_cookie *cookie,
				     struct address_space *mapping,
				     loff_t start, size_t len,
				     bool caching);

If caching is true, the pages to be managed are expected to be located on
mapping in the range defined by start and len.  If caching is false, it
does nothing.

Signed-off-by: David Howells <dhowells@redhat.com>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
cc: linux-cachefs@redhat.com
Link: https://lore.kernel.org/r/163819614155.215744.5528123235123721230.stgit@warthog.procyon.org.uk/ # v1
Link: https://lore.kernel.org/r/163906916346.143852.15632773570362489926.stgit@warthog.procyon.org.uk/ # v2
Link: https://lore.kernel.org/r/163967123599.1823006.12946816026724657428.stgit@warthog.procyon.org.uk/ # v3
Link: https://lore.kernel.org/r/164021522672.640689.4381958316198807813.stgit@warthog.procyon.org.uk/ # v4
2022-01-07 09:22:19 +00:00
David Howells 3a11b3a863 netfs: Pass more information on how to deal with a hole in the cache
Pass more information to the cache on how to deal with a hole if it
encounters one when trying to read from the cache.  Three options are
provided:

 (1) NETFS_READ_HOLE_IGNORE.  Read the hole along with the data, assuming
     it to be a punched-out extent by the backing filesystem.

 (2) NETFS_READ_HOLE_CLEAR.  If there's a hole, erase the requested region
     of the cache and clear the read buffer.

 (3) NETFS_READ_HOLE_FAIL.  Fail the read if a hole is detected.

Signed-off-by: David Howells <dhowells@redhat.com>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
cc: linux-cachefs@redhat.com
Link: https://lore.kernel.org/r/163819612321.215744.9738308885948264476.stgit@warthog.procyon.org.uk/ # v1
Link: https://lore.kernel.org/r/163906914460.143852.6284247083607910189.stgit@warthog.procyon.org.uk/ # v2
Link: https://lore.kernel.org/r/163967119923.1823006.15637375885194297582.stgit@warthog.procyon.org.uk/ # v3
Link: https://lore.kernel.org/r/164021519762.640689.16994364383313159319.stgit@warthog.procyon.org.uk/ # v4
2022-01-07 09:22:19 +00:00
David Howells 8e7a867bb7 fscache: Provide read/write stat counters for the cache
Provide read/write stat counters for the cache backend to use.

Signed-off-by: David Howells <dhowells@redhat.com>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
cc: linux-cachefs@redhat.com
Link: https://lore.kernel.org/r/163819609532.215744.10821082637727410554.stgit@warthog.procyon.org.uk/ # v1
Link: https://lore.kernel.org/r/163906912598.143852.12960327989649429069.stgit@warthog.procyon.org.uk/ # v2
Link: https://lore.kernel.org/r/163967113830.1823006.3222957649202368162.stgit@warthog.procyon.org.uk/ # v3
Link: https://lore.kernel.org/r/164021517502.640689.6077928311710357342.stgit@warthog.procyon.org.uk/ # v4
2022-01-07 09:22:19 +00:00