The RWF_SYNC and !RWF_SYNC arms are now exactly alike except that
the RWF_SYNC arm resets the boot verifier twice in a row. Fix that
redundancy and de-duplicate the code.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
The nfsd_file nf_rwsem is currently being used to separate file write
and commit instances to ensure that we catch errors and apply them to
the correct write/commit.
We can improve scalability at the expense of a little accuracy (some
extra false positives) by replacing the nf_rwsem with more careful
use of the errseq_t mechanism to track errors across the different
operations.
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
[ cel: rebased on zero-verifier fix ]
RFC 8881 explains the purpose of the write verifier this way:
> The final portion of the result is the field writeverf. This field
> is the write verifier and is a cookie that the client can use to
> determine whether a server has changed instance state (e.g., server
> restart) between a call to WRITE and a subsequent call to either
> WRITE or COMMIT.
But then it says:
> This cookie MUST be unchanged during a single instance of the
> NFSv4.1 server and MUST be unique between instances of the NFSv4.1
> server. If the cookie changes, then the client MUST assume that
> any data written with an UNSTABLE4 value for committed and an old
> writeverf in the reply has been lost and will need to be
> recovered.
RFC 1813 has similar language for NFSv3. NFSv2 does not have a write
verifier since it doesn't implement the COMMIT procedure.
Since commit 19e0663ff9 ("nfsd: Ensure sampling of the write
verifier is atomic with the write"), the Linux NFS server has
returned a boot-time-based verifier for UNSTABLE WRITEs, but a zero
verifier for FILE_SYNC and DATA_SYNC WRITEs. FILE_SYNC and DATA_SYNC
WRITEs are not followed up with a COMMIT, so there's no need for
clients to compare verifiers for stable writes.
However, by returning a different verifier for stable and unstable
writes, the above commit puts the Linux NFS server a step farther
out of compliance with the first MUST above. At least one NFS client
(FreeBSD) noticed the difference, making this a potential
regression.
Reported-by: Rick Macklem <rmacklem@uoguelph.ca>
Link: https://lore.kernel.org/linux-nfs/YQXPR0101MB096857EEACF04A6DF1FC6D9BDD749@YQXPR0101MB0968.CANPRD01.PROD.OUTLOOK.COM/T/
Fixes: 19e0663ff9 ("nfsd: Ensure sampling of the write verifier is atomic with the write")
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
If we get back -EOPENSTALE from an NFSv4 open, then we either got some
unhandled error or the inode we got back was not the same as the one
associated with the dentry.
We really have no recourse in that situation other than to retry the
open, and if it fails to just return nfserr_stale back to the client.
Signed-off-by: Jeff Layton <jeff.layton@primarydata.com>
Signed-off-by: Lance Shelton <lance.shelton@hammerspace.com>
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
The NFS client can occasionally return EREMOTEIO when signalling issues
with the server. ...map to NFSERR_IO.
Signed-off-by: Jeff Layton <jeff.layton@primarydata.com>
Signed-off-by: Lance Shelton <lance.shelton@hammerspace.com>
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Now that we have open file cache, it is possible that another client
deletes the file and DP will not know about it. Then IO to MDS would
fail with BADSTATEID and knfsd would start state recovery, which
should fail as well and then nfs read/write will fail with EBADF.
And it triggers a WARN() in nfserrno().
-----------[ cut here ]------------
WARNING: CPU: 0 PID: 13529 at fs/nfsd/nfsproc.c:758 nfserrno+0x58/0x70 [nfsd]()
nfsd: non-standard errno: -9
modules linked in: nfsv3 nfs_layout_flexfiles rpcsec_gss_krb5 nfsv4 dns_resolver nfs fscache ip6t_rpfilter ip6t_REJECT nf_reject_ipv6 xt_connt
pata_acpi floppy
CPU: 0 PID: 13529 Comm: nfsd Tainted: G W 4.1.5-00307-g6e6579b #7
Hardware name: VMware, Inc. VMware Virtual Platform/440BX Desktop Reference Platform, BIOS 6.00 09/30/2014
0000000000000000 00000000464e6c9c ffff88079085fba8 ffffffff81789936
0000000000000000 ffff88079085fc00 ffff88079085fbe8 ffffffff810a08ea
ffff88079085fbe8 ffff88080f45c900 ffff88080f627d50 ffff880790c46a48
all Trace:
[<ffffffff81789936>] dump_stack+0x45/0x57
[<ffffffff810a08ea>] warn_slowpath_common+0x8a/0xc0
[<ffffffff810a0975>] warn_slowpath_fmt+0x55/0x70
[<ffffffff81252908>] ? splice_direct_to_actor+0x148/0x230
[<ffffffffa02fb8c0>] ? fsid_source+0x60/0x60 [nfsd]
[<ffffffffa02f9918>] nfserrno+0x58/0x70 [nfsd]
[<ffffffffa02fba57>] nfsd_finish_read+0x97/0xb0 [nfsd]
[<ffffffffa02fc7a6>] nfsd_splice_read+0x76/0xa0 [nfsd]
[<ffffffffa02fcca1>] nfsd_read+0xc1/0xd0 [nfsd]
[<ffffffffa0233af2>] ? svc_tcp_adjust_wspace+0x12/0x30 [sunrpc]
[<ffffffffa03073da>] nfsd3_proc_read+0xba/0x150 [nfsd]
[<ffffffffa02f7a03>] nfsd_dispatch+0xc3/0x210 [nfsd]
[<ffffffffa0233af2>] ? svc_tcp_adjust_wspace+0x12/0x30 [sunrpc]
[<ffffffffa0232913>] svc_process_common+0x453/0x6f0 [sunrpc]
[<ffffffffa0232cc3>] svc_process+0x113/0x1b0 [sunrpc]
[<ffffffffa02f740f>] nfsd+0xff/0x170 [nfsd]
[<ffffffffa02f7310>] ? nfsd_destroy+0x80/0x80 [nfsd]
[<ffffffff810bf3a8>] kthread+0xd8/0xf0
[<ffffffff810bf2d0>] ? kthread_create_on_node+0x1b0/0x1b0
[<ffffffff817912a2>] ret_from_fork+0x42/0x70
[<ffffffff810bf2d0>] ? kthread_create_on_node+0x1b0/0x1b0
Signed-off-by: Peng Tao <tao.peng@primarydata.com>
Signed-off-by: Lance Shelton <lance.shelton@hammerspace.com>
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
The Linux NFS server currently responds to a zero-length NFSv3 WRITE
request with NFS3ERR_IO. It responds to a zero-length NFSv4 WRITE
with NFS4_OK and count of zero.
RFC 1813 says of the WRITE procedure's @count argument:
count
The number of bytes of data to be written. If count is
0, the WRITE will succeed and return a count of 0,
barring errors due to permissions checking.
RFC 8881 has similar language for NFSv4, though NFSv4 removed the
explicit @count argument because that value is already contained in
the opaque payload array.
The synthetic client pynfs's WRT4 and WRT15 tests do emit zero-
length WRITEs to exercise this spec requirement. Commit fdec6114ee
("nfsd4: zero-length WRITE should succeed") addressed the same
problem there with the same fix.
But interestingly the Linux NFS client does not appear to emit zero-
length WRITEs, instead squelching them. I'm not aware of a test that
can generate such WRITEs for NFSv3, so I wrote a naive C program to
generate a zero-length WRITE and test this fix.
Fixes: 8154ef2776 ("NFSD: Clean up legacy NFS WRITE argument XDR decoders")
Reported-by: Trond Myklebust <trond.myklebust@hammerspace.com>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Cc: stable@vger.kernel.org
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
nbl allocated in nfsd4_lock can be released by a several ways:
directly in nfsd4_lock(), via nfs4_laundromat(), via another nfs
command RELEASE_LOCKOWNER or via nfsd4_callback.
This structure should be refcounted to be used and released correctly
in all these cases.
Refcount is initialized to 1 during allocation and is incremented
when nbl is added into nbl_list/nbl_lru lists.
Usually nbl is linked into both lists together, so only one refcount
is used for both lists.
However nfsd4_lock() should keep in mind that nbl can be present
in one of lists only. This can happen if nbl was handled already
by nfs4_laundromat/nfsd4_callback/etc.
Refcount is decremented if vfs_lock_file() returns FILE_LOCK_DEFERRED,
because nbl can be handled already by nfs4_laundromat/nfsd4_callback/etc.
Refcount is not changed in find_blocked_lock() because of it reuses counter
released after removing nbl from lists.
Signed-off-by: Vasily Averin <vvs@virtuozzo.com>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
NFSv4.1 supports an optional lock notification feature which notifies
the client when a lock comes available. (Normally NFSv4 clients just
poll for locks if necessary.) To make that work, we need to request a
blocking lock from the filesystem.
We turned that off for NFS in commit f657f8eef3 ("nfs: don't atempt
blocking locks on nfs reexports") [sic] because it actually blocks the
nfsd thread while waiting for the lock.
Thanks to Vasily Averin for pointing out that NFS isn't the only
filesystem with that problem.
Any filesystem that leaves ->lock NULL will use posix_lock_file(), which
does the right thing. Simplest is just to assume that any filesystem
that defines its own ->lock is not safe to request a blocking lock from.
So, this patch mostly reverts commit f657f8eef3 ("nfs: don't atempt
blocking locks on nfs reexports") [sic] and commit b840be2f00 ("lockd:
don't attempt blocking locks on nfs reexports"), and instead uses a
check of ->lock (Vasily's suggestion) to decide whether to support
blocking lock notifications on a given filesystem. Also add a little
documentation.
Perhaps someday we could add back an export flag later to allow
filesystems with "good" ->lock methods to support blocking lock
notifications.
Reported-by: Vasily Averin <vvs@virtuozzo.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
[ cel: Description rewritten to address checkpatch nits ]
[ cel: Fixed warning when SUNRPC debugging is disabled ]
[ cel: Fixed NULL check ]
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Reviewed-by: Vasily Averin <vvs@virtuozzo.com>
Clean up. Trond points out that xdr_stream_decode_uint32_array()
does the same thing as nfsd4_decode_bitmap4().
Suggested-by: Trond Myklebust <trondmy@hammerspace.com>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
The use of the bitmaps is confusing. Add a cross-reference to make it
easier to find the existing comment. Add an updated reference with URL
to make it quicker to look up. And a bit more editorializing about the
value of this.
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Clean up: The garbage_args and cant_encode tracepoints report the
same information as each other, so combine them into a single
tracepoint class to reduce code duplication and slightly reduce the
size of trace.o.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
- Make the old ALLOCSP ioctl behave in a consistent manner with
newer syscalls like fallocate.
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEEUzaAxoMeQq6m2jMV+H93GTRKtOsFAmHDpfcACgkQ+H93GTRK
tOuI9A/9HAdOYymXZMcYrFxbxKrzgjnX+2dYAJHyko9ZT2tFl512xfGE7hA50V1w
r6FIPCOHXmuIT82kOMs+0MwqY61tJAMzkGB3N2xX1o99bl4enQ7/gqkAaMWx/0YS
H11h619EdcnZPOZhgH9vtLkWdAW4muhvmmNHnLvsv8XHinDkIOLR4fepa5zbbY/b
fkGH/OHuJ8an2scdvgPQzov8jd76jPmLT8fd1hu34WFNkReBmpllU1ATAz1pUHpZ
EvIHfUmrlHTtGrGaa5Ti8bgY0ppKIfrv/7DxRmCaSmc8K8cqBtjSM4FbCOZJKeiN
Ajn4Rru86KMD0Ker+1aLRZchY+j0S4NQH2blCqeUL3DfIcZ+qrnOEqOo8VG8slBV
GD7GKAuGL9YPQazLnrENVnl2v3r9IR83WXpO5htDom+wCcwKkI7lXLKiyrHQOYM6
spcLMfa+XUudyLyFkXZc4hCHh234B+CeEBOWQtcEOZsC+HhILLTowrTJ9m53H7v2
AjCON+6CYArP/6WlW1WaeGeZhZFq0y+HNplcu46R8ccnrZhv8UnbuUH5UQC65ktj
nWwD3T+o9toJqzsz02Kg9zseAtM4ungfrc/QMMYKIUQH7XMYAXePhr7t890NU5sZ
HPXfGSAxzV7XefG/rqe8yu0ZLR+OtKb16d1Cl/B4Py6TkLVj4G0=
=8ha9
-----END PGP SIGNATURE-----
Merge tag 'xfs-5.16-fixes-4' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux
Pull xfs fix from Darrick Wong:
- Make the old ALLOCSP ioctl behave in a consistent manner with newer
syscalls like fallocate.
* tag 'xfs-5.16-fixes-4' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux:
xfs: map unwritten blocks in XFS_IOC_{ALLOC,FREE}SP just like fallocate
We still only operate on a single page of data at a time due to using
kmap(). A more complex implementation would work on each page in a folio,
but it's not clear that such a complex implementation would be worthwhile.
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: William Kucharski <william.kucharski@oracle.com>
find_lock_entries() already only returned the head page of folios, so
convert it to return a folio_batch instead of a pagevec. That cascades
through converting truncate_inode_pages_range() to
delete_from_page_cache_batch() and page_cache_delete_batch().
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: William Kucharski <william.kucharski@oracle.com>
Change the afs filesystem to support the new afs driver.
The following changes have been made:
(1) The fscache_netfs struct is no more, and there's no need to register
the filesystem as a whole. There's also no longer a cell cookie.
(2) The volume cookie is now an fscache_volume cookie, allocated with
fscache_acquire_volume(). This function takes three parameters: a
string representing the "volume" in the index, a string naming the
cache to use (or NULL) and a u64 that conveys coherency metadata for
the volume.
For afs, I've made it render the volume name string as:
"afs,<cell>,<volume_id>"
and the coherency data is currently 0.
(3) The fscache_cookie_def is no more and needed information is passed
directly to fscache_acquire_cookie(). The cache no longer calls back
into the filesystem, but rather metadata changes are indicated at
other times.
fscache_acquire_cookie() is passed the same keying and coherency
information as before, except that these are now stored in big endian
form instead of cpu endian. This makes the cache more copyable.
(4) fscache_use_cookie() and fscache_unuse_cookie() are called when a file
is opened or closed to prevent a cache file from being culled and to
keep resources to hand that are needed to do I/O.
fscache_use_cookie() is given an indication if the cache is likely to
be modified locally (e.g. the file is open for writing).
fscache_unuse_cookie() is given a coherency update if we had the file
open for writing and will update that.
(5) fscache_invalidate() is now given uptodate auxiliary data and a file
size. It can also take a flag to indicate if this was due to a DIO
write. This is wrapped into afs_fscache_invalidate() now for
convenience.
(6) fscache_resize() now gets called from the finalisation of
afs_setattr(), and afs_setattr() does use/unuse of the cookie around
the call to support this.
(7) fscache_note_page_release() is called from afs_release_page().
(8) Use a killable wait in nfs_vm_page_mkwrite() when waiting for
PG_fscache to be cleared.
Render the parts of the cookie key for an afs inode cookie as big endian.
Changes
=======
ver #2:
- Use gfpflags_allow_blocking() rather than using flag directly.
- fscache_acquire_volume() now returns errors.
Signed-off-by: David Howells <dhowells@redhat.com>
Acked-by: Jeff Layton <jlayton@kernel.org>
Tested-by: kafs-testing@auristor.com
cc: Marc Dionne <marc.dionne@auristor.com>
cc: linux-afs@lists.infradead.org
cc: linux-cachefs@redhat.com
Link: https://lore.kernel.org/r/163819661382.215744.1485608824741611837.stgit@warthog.procyon.org.uk/ # v1
Link: https://lore.kernel.org/r/163906970002.143852.17678518584089878259.stgit@warthog.procyon.org.uk/ # v2
Link: https://lore.kernel.org/r/163967174665.1823006.1301789965454084220.stgit@warthog.procyon.org.uk/ # v3
Link: https://lore.kernel.org/r/164021568841.640689.6684240152253400380.stgit@warthog.procyon.org.uk/ # v4
Print extra information about how many dirty bytes an uncommitted
has at the end of mount.
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
If we extended the size of a swapfile after its header was created (by the
mkswap utility) and then try to activate it, we will map the entire file
when activating the swap file, instead of limiting to the max size defined
in the swap file's header.
Currently test case generic/643 from fstests fails because we do not
respect that size limit defined in the swap file's header.
So fix this by not mapping file ranges beyond the max size defined in the
swap header.
This is the same type of bug that iomap used to have, and was fixed in
commit 36ca7943ac ("mm/swap: consider max pages in
iomap_swapfile_add_extent").
Fixes: ed46ff3d42 ("Btrfs: support swap files")
CC: stable@vger.kernel.org # 5.4+
Reviewed-and-tested-by: Josef Bacik <josef@toxicpanda.com
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The warnings were found by running scripts/kernel-doc, which is
caused by using 'make W=1'.
fs/btrfs/extent_io.c:3210: warning: Function parameter or member
'bio_ctrl' not described in 'btrfs_bio_add_page'
fs/btrfs/extent_io.c:3210: warning: Excess function parameter 'bio'
description in 'btrfs_bio_add_page'
fs/btrfs/extent_io.c:3210: warning: Excess function parameter
'prev_bio_flags' description in 'btrfs_bio_add_page'
fs/btrfs/space-info.c:1602: warning: Excess function parameter 'root'
description in 'btrfs_reserve_metadata_bytes'
fs/btrfs/space-info.c:1602: warning: Function parameter or member
'fs_info' not described in 'btrfs_reserve_metadata_bytes'
Note: this is fixing only the warnings regarding parameter list, the
first line is not strictly conforming to the kdoc format as the btrfs
codebase does not stick to that and keeps the first line more free form
(because it's only for internal use).
Reported-by: Abaci Robot <abaci@linux.alibaba.com>
Signed-off-by: Yang Li <yang.lee@linux.alibaba.com>
Reviewed-by: David Sterba <dsterba@suse.com>
[ add note ]
Signed-off-by: David Sterba <dsterba@suse.com>
btrfs_decompress_bio, the only caller of compression_decompress_bio gets
type from @cb and passes it to compression_decompress_bio.
However, compression_decompress_bio can get compression type directly
from @cb.
So remove the parameter and access it through @cb. No functional
change.
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Su Yue <l@damenly.su>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
When code modifying extent-io-tree get modified and got that selftest
failed, it can take some time to pin down the cause.
To make it easier to expose the problem, dump the extent io tree if the
selftest failed.
This can save developers debug time, especially since the selftest we
can not use the trace events, thus have to manually add debug trace
points.
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The argument list of btrfs_stripe() has similar problems of
scrub_chunk():
- Duplicated and ambiguous @base argument
Can be fetched from btrfs_block_group::bg.
- Ambiguous argument @length
It's again device extent length
- Ambiguous argument @num
The instinctive guess would be mirror number, but in fact it's stripe
index.
Fix it by:
- Remove @base parameter
- Rename @length to @dev_extent_len
- Rename @num to @stripe_index
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The argument list of scrub_chunk() has the following problems:
- Duplicated @chunk_offset
It is the same as btrfs_block_group::start.
- Confusing @length
The most instinctive guess is chunk length, and one may want to delete
it, but the truth is, it's the device extent length.
Fix this by:
- Remove @chunk_offset
Use btrfs_block_group::start instead.
- Rename @length to @dev_extent_len
Also rename the caller to remove the ambiguous naming.
- Rename @cache to @bg
The "_cache" suffix for btrfs_block_group has been removed for a while.
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Currently there is only one user for btrfs metadata readahead, and
that's scrub.
But even for the single user, it's not providing the correct
functionality it needs, as scrub needs reada for commit root, which
current readahead can't provide. (Although it's pretty easy to add such
feature).
Despite this, there are some extra problems related to metadata
readahead:
- Duplicated feature with btrfs_path::reada
- Partly duplicated feature of btrfs_fs_info::buffer_radix
Btrfs already caches its metadata in buffer_radix, while readahead
tries to read the tree block no matter if it's already cached.
- Poor layer separation
Metadata readahead works kinda at device level.
This is definitely not the correct layer it should be, since metadata
is at btrfs logical address space, it should not bother device at all.
This brings extra chance for bugs to sneak in, while brings
unnecessary complexity.
- Dead code
In the very beginning of scrub.c we have #undef DEBUG, rendering all
the debug related code useless and unable to test.
Thus here I purpose to remove the metadata readahead mechanism
completely.
[BENCHMARK]
There is a full benchmark for the scrub performance difference using the
old btrfs_reada_add() and btrfs_path::reada.
For the worst case (no dirty metadata, slow HDD), there could be a 5%
performance drop for scrub.
For other cases (even SATA SSD), there is no distinguishable performance
difference.
The number is reported scrub speed, in MiB/s.
The resolution is limited by the reported duration, which only has a
resolution of 1 second.
Old New Diff
SSD 455.3 466.332 +2.42%
HDD 103.927 98.012 -5.69%
Comprehensive test methodology is in the cover letter of the patch.
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
For scrub, we trigger two readaheads for two trees, extent tree to get
where to scrub, and csum tree to get the data checksum.
For csum tree we already trigger readahead in
btrfs_lookup_csums_range(), by setting path->reada.
But for extent tree we don't have any path based readahead.
Add the readahead for extent tree as well, so we can later remove the
btrfs_reada_add() based readahead.
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
In function scrub_stripe() we allocated two btrfs_path's, one @path for
extent tree search and another @ppath for full stripe extent tree search
for RAID56.
This is totally umncessary, as the @ppath usage is completely inside
scrub_raid56_parity(), thus we can move the path allocation into
scrub_raid56_parity() completely.
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The purpose of this function is to unlock all nodes in a btrfs path
which are above 'lowest_unlock' and whose slot used is different than 0.
As such it used slightly awkward structure of 'if' as well as somewhat
cryptic "no_skip" control variable which denotes whether we should
check the current level of skipability or no.
This patch does the following (cosmetic) refactorings:
* Renames 'no_skip' to 'check_skip' and makes it a boolean. This
variable controls whether we are below the lowest_unlock/skip_level
levels.
* Consolidates the 2 conditions which warrant checking whether the
current level should be skipped under 1 common if (check_skip) branch,
this increase indentation level but is not critical.
* Consolidates the 'skip_level < i && i >= lowest_unlock' and
'i >= lowest_unlock && i > skip_level' condition into a common branch
since those are identical.
* Eliminates the local extent_buffer variable as in this case it doesn't
bring anything to function readability.
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
At ioctl.c:create_subvol(), when we fail to create a subvolume we always
commit the transaction. In most cases this is a no-op, since all the error
paths, except for one, abort the transaction - the only exception is when
we fail to insert the new root item into the root tree, in that case we
don't abort the transaction because we didn't do anything that is
irreversible - however we end up committing the transaction which although
is not a functional problem, it adds unnecessary rotation of the backup
roots in the superblock and unnecessary work.
So change that to commit a transaction only when no error happened,
otherwise just call btrfs_end_transaction() to release our reference on
the transaction.
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The ZNS specification defines a limit on the number of "active"
zones. That limit impose us to limit the number of block groups which
can be used for an allocation at the same time. Not to exceed the
limit, we reuse the existing active block groups as much as possible
when we can't activate any other zones without sacrificing an already
activated block group in commit a85f05e59b ("btrfs: zoned: avoid
chunk allocation if active block group has enough space").
However, the check is wrong in two ways. First, it checks the
condition for every raid index (ffe_ctl->index). Even if it reaches
the condition and "ffe_ctl->max_extent_size >=
ffe_ctl->min_alloc_size" is met, there can be other block groups
having enough space to hold ffe_ctl->num_bytes. (Actually, this won't
happen in the current zoned code as it only supports SINGLE
profile. But, it can happen once it enables other RAID types.)
Second, it checks the active zone availability depending on the
raid index. The raid index is just an index for
space_info->block_groups, so it has nothing to do with chunk allocation.
These mistakes are causing a faulty allocation in a certain
situation. Consider we are running zoned btrfs on a device whose
max_active_zone == 0 (no limit). And, suppose no block group have a
room to fit ffe_ctl->num_bytes but some room to meet
ffe_ctl->min_alloc_size (i.e. max_extent_size > num_bytes >=
min_alloc_size).
In this situation, the following occur:
- With SINGLE raid_index, it reaches the chunk allocation checking
code
- The check returns true because we can activate a new zone (no limit)
- But, before allocating the chunk, it iterates to the next raid index
(RAID5)
- Since there are no RAID5 block groups on zoned mode, it again
reaches the check code
- The check returns false because of btrfs_can_activate_zone()'s "if
(raid_index != BTRFS_RAID_SINGLE)" part
- That results in returning -ENOSPC without allocating a new chunk
As a result, we end up hitting -ENOSPC too early.
Move the check to the right place in the can_allocate_chunk() hook,
and do the active zone check depending on the allocation flag, not on
the raid index.
CC: stable@vger.kernel.org # 5.16
Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Introduce a new hook for an extent allocator policy. With the new
hook, a policy can decide to allocate a new block group or not. If
not, it will return -ENOSPC, so btrfs_reserve_extent() will cut the
allocation size in half and retry the allocation if min_alloc_size is
large enough.
The hook has a place holder and will be replaced with the real
implementation in the next patch.
CC: stable@vger.kernel.org # 5.16
Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Allocating an extent from a block group can fail for various reasons.
When an allocation from a dedicated block group (for tree-log or
relocation data) fails, we need to unregister it as a dedicated one so
that we can allocate a new block group for the dedicated one.
However, we are returning early when the block group in case it is
read-only, fully used, or not be able to activate the zone. As a result,
we keep the non-usable block group as a dedicated one, leading to
further allocation failure. With many block groups, the allocator will
iterate hopeless loop to find a free extent, results in a hung task.
Fix the issue by delaying the return and doing the proper cleanups.
CC: stable@vger.kernel.org # 5.16
Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>
Signed-off-by: David Sterba <dsterba@suse.com>
REQ_OP_ZONE_APPEND can only work on zoned devices, so it is redundant to
check if the filesystem is zoned when REQ_OP_ZONE_APPEND is set as the
bio's bio_op.
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Sink zone check into btrfs_repair_one_zone() so we don't need to do it
in all callers.
Also as btrfs_repair_one_zone() doesn't return a sensible error, make it
a boolean function and return false in case it got called on a non-zoned
filesystem and true on a zoned filesystem.
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
btrfs_check_meta_write_pointer() will always be called with a NULL
'cache_ret' argument.
As there's no need to check if we have a valid block_group passed in
remove these checks.
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Encapsulate the inode lock needed for serializing the data relocation
writes on a zoned filesystem into a helper.
This streamlines the code reading flow and hides special casing for
zoned filesystems.
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
In the case of the seed device, the fsid can be different from the mounted
sprout fsid. The userland has to read the device superblock to know the
fsid but, that idea fails if the device is missing. So add a sysfs
interface devinfo/<devid>/fsid to show the fsid of the device.
For example:
$ cd /sys/fs/btrfs/b10b02a5-f9de-4276-b9e8-2bfd09a578a8
$ cat devinfo/1/fsid
c44d771f-639d-4df3-99ec-5bc7ad2af93b
$ cat devinfo/3/fsid
b10b02a5-f9de-4276-b9e8-2bfd09a578a8
Though it's related to seeding, the name of the sysfs file is plain fsid as it
matches what blkid says. A path to the device's fsid will aid scripting.
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Anand Jain <anand.jain@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Filipe reported a problem where sometimes he'd get an ENOSPC abort when
running delayed refs with generic/619 and the free space tree enabled.
This is partly because we do not reserve space for modifying the free
space tree, nor do we have a block rsv associated with that tree.
The delayed_refs_rsv tracks the amount of space required to run delayed
refs. This means 1 modification means 1 change to the extent root.
With the free space tree this turns into 2 changes, because modifying 1
extent means updating the extent tree and potentially updating the free
space tree to either remove that entry or add the free space. Thus if
we have the FST enabled, simply double the reservation size for our
modification.
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Filipe reported a problem where generic/619 was failing with an ENOSPC
abort while running delayed refs, like the following
BTRFS: Transaction aborted (error -28)
WARNING: CPU: 3 PID: 522920 at fs/btrfs/free-space-tree.c:1049 add_to_free_space_tree+0xe5/0x110 [btrfs]
CPU: 3 PID: 522920 Comm: kworker/u16:19 Tainted: G W 5.16.0-rc2-btrfs-next-106 #1
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.14.0-0-g155821a1990b-prebuilt.qemu.org 04/01/2014
Workqueue: events_unbound btrfs_async_reclaim_metadata_space [btrfs]
RIP: 0010:add_to_free_space_tree+0xe5/0x110 [btrfs]
RSP: 0000:ffffa65087fb7b20 EFLAGS: 00010282
RAX: 0000000000000000 RBX: 0000000000001000 RCX: 0000000000000000
RDX: 0000000000000001 RSI: ffffffff9131eeaa RDI: 00000000ffffffff
RBP: ffff8d62e26481b8 R08: ffffffff9ad97ce0 R09: 0000000000000001
R10: 0000000000000000 R11: 0000000000000001 R12: 00000000ffffffe4
R13: ffff8d61c25fe688 R14: ffff8d61ebd88800 R15: ffff8d61ebd88a90
FS: 0000000000000000(0000) GS:ffff8d64ed400000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 00007fa46a8b1000 CR3: 0000000148d18003 CR4: 0000000000370ee0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
Call Trace:
<TASK>
__btrfs_free_extent+0x516/0x950 [btrfs]
__btrfs_run_delayed_refs+0x2b1/0x1250 [btrfs]
btrfs_run_delayed_refs+0x86/0x210 [btrfs]
flush_space+0x403/0x630 [btrfs]
? call_rcu_tasks_generic+0x50/0x80
? lock_release+0x223/0x4a0
? btrfs_get_alloc_profile+0xb5/0x290 [btrfs]
? do_raw_spin_unlock+0x4b/0xa0
btrfs_async_reclaim_metadata_space+0x139/0x320 [btrfs]
process_one_work+0x24c/0x5b0
worker_thread+0x55/0x3c0
? process_one_work+0x5b0/0x5b0
kthread+0x17c/0x1a0
? set_kthread_struct+0x40/0x40
ret_from_fork+0x22/0x30
There's a couple of reasons for this, but in generic/619's case the
largest reason is because it is a very small file system, ad we do not
reserve enough space for the global reserve.
With the free space tree we now have the free space tree that we need to
modify when running delayed refs. This means we need the global reserve
to take this into account when it calculates the minimum size it needs
to be. This is especially important for very small file systems.
Fix this by adjusting the minimum global block rsv size math to include
the size of the free space tree when calculating the size.
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: David Sterba <dsterba@suse.com>
These two values were introduced in commit ff023aac31 ("Btrfs: add code
to scrub to copy read data to another disk") as an optimization.
But the truth is, block layer scheduler can do whatever it wants to
merge/split bios to improve performance.
Doing such "optimization" is not really going to affect much, especially
considering how good current block layer optimizations are doing.
Remove such old and immature optimization from our code.
Since we're here, also change BUG_ON()s using these two macros to use
ASSERT()s.
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Use BTRFS_MAX_METADATA_BLOCKSIZE and SZ_4K (minimal sectorsize) to
calculate this value.
And remove one stale comment on the value, in fact with recent subpage
support, BTRFS_MAX_METADATA_BLOCKSIZE * PAGE_SIZE is already beyond
BTRFS_STRIPE_LEN, just we don't use the full page.
Also since we're here, update the BUG_ON() related to
SCRUB_MAX_PAGES_PER_BLOCK to ASSERT().
As those ASSERT() are really only for developers to catch early obvious
bugs, not to let end users suffer.
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
We only throttle the btrfs_truncate_inode_items if the root is
SHAREABLE, which isn't set on the log root, which means this loop is
unnecessary.
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
We reset this bool on every loop through the truncate loop, make this
variable local to the loop.
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
We have
if (del_item)
// do something
else
// something else
if (del_item)
// do yet another thing
else
// something else entirely
back to back in btrfs_truncate_inode_items, collapse these two sets of
if statements into one.
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
This is a logic correctness check, convert it into an ASSERT() instead
of a BUG().
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
We have a correctness BUG_ON() in btrfs_truncate_inode_items to make
sure that we're always using min_type == BTRFS_EXTENT_DATA_KEY if
new_size is > 0. Convert this to an ASSERT.
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
In the future we're going to want to use btrfs_truncate_inode_items
without looking up the associated inode. In order to accommodate this
add the inode to btrfs_truncate_control and handle the case where
control->inode is NULL appropriately. This is fairly straightforward,
we simply need to add a helper for the trace points, as the file extent
map update is controlled by a flag on btrfs_truncate_control.
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
In the future we are going to want to truncate inode items without
needing to have an btrfs_inode to pass in, so add ino to the
btrfs_truncate_control and use that to look up the inode items to
truncate.
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
We only care about updating the file extent range when we are doing a
normal truncation. We skip this for tree logging currently, but we can
also skip this for eviction as well. Using a flag makes it more
explicit when we want to do this work.
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
We've had weird bugs in the past where we forgot to adjust the truncate
path to deal with the fact that we can be called by the tree log path.
Instead of checking if our root is a LOG_ROOT use a flag on the
btrfs_truncate_control to indicate that we don't want to do extent
reference updates during this truncate.
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
We currently have a bunch of awkward checks to make sure we only update
the inode i_bytes if we're truncating the real inode. Instead keep
track of the number of bytes we need to sub in the
btrfs_truncate_control, and then do the appropriate adjustment in the
truncate paths that care.
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
We currently will update the i_size of the inode as we truncate it down,
however we skip this if we're calling btrfs_truncate_inode_items from
the tree log code. However we also don't care about this in the case of
evict. Instead keep track of this value in the btrfs_truncate_control
and then have btrfs_truncate() and the free space cache truncate path
both do the i_size update themselves.
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
I'm going to be adding more arguments and counters to
btrfs_truncate_inode_items, so add a control struct to handle all of the
extra arguments to make it easier to follow.
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
We only set this if we find a normal file extent, del_item == 1, and the
file extent points to a real extent and isn't a hole extent. We can use
del_item == 1 && extent_start != 0 to get the same information that
found_extent provides, so remove this variable and use the other
variables instead.
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
We have a special case in btrfs_truncate_inode_items() to call
btrfs_kill_delayed_inode_items() if min_type == 0, which is only called
during evict.
Instead move this out into evict proper, and add some comments because I
erroneously attempted to remove this code altogether without
understanding what we were doing.
Evict is updating the inode only because we only care about making sure
the i_nlink count has hit disk. If we had pending deletions we don't
want to process those via the delayed inode updates, we simply want to
drop all of them and reclaim the reserved metadata space. Then from
there the btrfs_truncate_inode_items() will do the work to remove all of
the items as appropriate.
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: David Sterba <dsterba@suse.com>
We no longer have inode cache feature, so this check is extraneous as
the only inode cache is in the tree_root, which is not marked as
SHAREABLE.
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Currently we are locking the extent and dropping the extent cache for
any inodes we truncate, unless they're in the tree log. We call this
helper from:
- truncate
- evict
- tree log
- free space cache truncation
For evict we've already dropped all of the extent cache for this inode
once we've gotten here, and we're the only one accessing this inode, so
this step is unnecessary.
For the tree log code we already skip this part.
Pull this work into the truncate path and the free space cache
truncation path.
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
This is an inode item related manipulation with a few vfs related
adjustments. I'm going to remove the vfs related code from this helper
and simplify it a lot, but I want those changes to be easily seen via
git blame, so move this function now and then the simplification work
can be done.
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
We have a few helpers in inode-item.c, and I'm going to make a few
changes to how we do truncate in the future, so break out these
definitions into their own header file to trim down ctree.h some and
make it easier to do the work on truncate in the future.
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The comment refers to the old extent buffer locking code, where we used to
have custom locks that had blocking and spinning behaviour modes. That is
not the case anymore, since we have transitioned to rw semaphores, so the
comment does not offer any value anymore. Remove it.
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
After calling split_leaf() we BUG_ON() if the returned value is greater
than zero. However split_leaf() only returns 0, in case of success, or a
negative value in case of an error.
The reason for the BUG_ON() is that if we ever get a positive return
value from split_leaf(), we can not simply propagate it to the callers
of btrfs_search_slot(), as that would be interpreted as "key not found"
and not as an error. That means it could result in callers ending up
causing some potential silent corruption.
So change the BUG_ON() to an ASSERT(), and in case assertions are
disabled, produce a warning and set the return value to an error, to make
it not possible to get into a silent corruption and having the error not
noticed.
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
There's quite a significant amount of code for doing the key search for a
leaf at btrfs_search_slot(), with a couple labels and gotos in it, plus
btrfs_search_slot() is already big enough.
So move the logic that does the key search on a leaf into a new helper
function. This makes it better organized, removing the need for the labels
and the gotos, as well as reducing the indentation level and the size of
btrfs_search_slot().
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
When inserting a key, we check if the write_lock_level is less than 1,
and if so we set it to 1, release the path and retry the tree traversal.
However that is unnecessary, because when ins_len is greater than 0, we
know that write_lock_level can never be less than 1.
The logic to retry is also buggy, because in case ins_len was decremented,
due to an exact key match and the search is not meant for item extension
(path->search_for_extension is 0), we retry without incrementing ins_len,
which would make the next retry decrement it again by the same amount.
So remove the check for write_lock_level being less than 1 and add an
assertion to assert it's always >= 1.
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
When inserting a new key, we release the write lock on the leaf's parent
only after doing the binary search on the leaf. This is because if the
key ends up at slot 0, we will have to update the key at slot 0 of the
parent node. The same reasoning applies to any other upper level nodes
when their slot is 0. We also need to keep the parent locked in case the
leaf does not have enough free space to insert the new key/item, because
in that case we will split the leaf and we will need to add a new key to
the parent due to a new leaf resulting from the split operation.
However if the leaf has enough space for the new key and the key does not
end up at slot 0 of the leaf we could release our write lock on the parent
before doing the binary search on the leaf to figure out the destination
slot. That leads to reducing the amount of time other tasks are blocked
waiting to lock the parent, therefore increasing parallelism when there
are other tasks that are trying to access other leaves accessible through
the same parent. This also applies to other upper nodes besides the
immediate parent, when their slot is 0, since we keep locks on them until
we figure out if the leaf slot is slot 0 or not.
In fact, having the key ending at up slot 0 when is rare. Typically it
only happens when the key is less than or equals to the smallest, the
"left most", key of the entire btree, during a split attempt when we try
to push to the right sibling leaf or when the caller just wants to update
the item of an existing key. It's also very common that a leaf has enough
space to insert a new key, since after a split we move about half of the
keys from one into the new leaf.
So unlock the parent, and any other upper level nodes, when during a key
insertion we notice the key is greater then the first key in the leaf and
the leaf has enough free space. After unlocking the upper level nodes, do
the binary search using a low boundary of slot 1 and not slot 0, to figure
out the slot where the key will be inserted (or where the key already is
in case it exists and the caller wants to modify its item data).
This extra comparison, with the first key, is cheap and the key is very
likely already in a cache line because it immediately follows the header
of the extent buffer and we have recently read the level field of the
header (which in fact is the last field of the header).
The following fs_mark test was run on a non-debug kernel (debian's default
kernel config), with a 12 cores intel CPU, and using a NVMe device:
$ cat run-fsmark.sh
#!/bin/bash
DEV=/dev/nvme0n1
MNT=/mnt/nvme0n1
MOUNT_OPTIONS="-o ssd"
MKFS_OPTIONS="-O no-holes -R free-space-tree"
FILES=100000
THREADS=$(nproc --all)
FILE_SIZE=0
echo "performance" | \
tee /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor
mkfs.btrfs -f $MKFS_OPTIONS $DEV
mount $MOUNT_OPTIONS $DEV $MNT
OPTS="-S 0 -L 10 -n $FILES -s $FILE_SIZE -t $THREADS -k"
for ((i = 1; i <= $THREADS; i++)); do
OPTS="$OPTS -d $MNT/d$i"
done
fs_mark $OPTS
umount $MNT
Before this change:
FSUse% Count Size Files/sec App Overhead
0 1200000 0 165273.6 5958381
0 2400000 0 190938.3 6284477
0 3600000 0 181429.1 6044059
0 4800000 0 173979.2 6223418
0 6000000 0 139288.0 6384560
0 7200000 0 163000.4 6520083
1 8400000 0 57799.2 5388544
1 9600000 0 66461.6 5552969
2 10800000 0 49593.5 5163675
2 12000000 0 57672.1 4889398
After this change:
FSUse% Count Size Files/sec App Overhead
0 1200000 0 167987.3 (+1.6%) 6272730
0 2400000 0 198563.9 (+4.0%) 6048847
0 3600000 0 197436.6 (+8.8%) 6163637
0 4800000 0 202880.7 (+16.6%) 6371771
1 6000000 0 167275.9 (+20.1%) 6556733
1 7200000 0 204051.2 (+25.2%) 6817091
1 8400000 0 69622.8 (+20.5%) 5525675
1 9600000 0 69384.5 (+4.4%) 5700723
1 10800000 0 61454.1 (+23.9%) 5363754
3 12000000 0 61908.7 (+7.3%) 5370196
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Right now generic_bin_search() always uses a low boundary slot of 0, but
in the next patch we'll want to often skip slot 0 when searching for a
key. So make generic_bin_search() have the low boundary slot specified
as an argument, and move the check for the extent buffer level from
btrfs_bin_search() to generic_bin_search() to avoid adding another
wrapper around generic_bin_search().
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Now that we clear the extent buffer uptodate if we fail to write it out
we need to check to see if our root node is uptodate before we search
down it. Otherwise we could return stale data (or potentially corrupt
data that was caught by the write verification step) and think that the
path is OK to search down.
CC: stable@vger.kernel.org # 5.4+
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Currently paused balance precludes adding a device since they are both
considered exclusive ops and we can have at most one running at a time.
This is problematic in case a filesystem encounters an ENOSPC situation
while balance is running, in this case the only thing the user can do
is mount the fs with "skip_balance" which pauses balance and delete some
data to free up space for balance. However, it should be possible to add
a new device when balance is paused.
Fix this by allowing device add to proceed when balance is paused.
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
This is needed to enable device add to work in cases when a file system
has been mounted with 'skip_balance' mount option.
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Current set of exclusive operation states is not sufficient to handle
all practical use cases. In particular there is a need to be able to add
a device to a filesystem that have paused balance. Currently there is no
way to distinguish between a running and a paused balance. Fix this by
introducing BTRFS_EXCLOP_BALANCE_PAUSED which is going to be set in 2
occasions:
1. When a filesystem is mounted with skip_balance and there is an
unfinished balance it will now be into BALANCE_PAUSED instead of
simply BALANCE state.
2. When a running balance is paused.
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
We don't allow send and balance/relocation to run in parallel in order
to prevent send failing or silently producing some bad stream. This is
because while send is using an extent (specially metadata) or about to
read a metadata extent and expecting it belongs to a specific parent
node, relocation can run, the transaction used for the relocation is
committed and the extent gets reallocated while send is still using the
extent, so it ends up with a different content than expected. This can
result in just failing to read a metadata extent due to failure of the
validation checks (parent transid, level, etc), failure to find a
backreference for a data extent, and other unexpected failures. Besides
reallocation, there's also a similar problem of an extent getting
discarded when it's unpinned after the transaction used for block group
relocation is committed.
The restriction between balance and send was added in commit 9e967495e0
("Btrfs: prevent send failures and crashes due to concurrent relocation"),
kernel 5.3, while the more general restriction between send and relocation
was added in commit 1cea5cf0e6 ("btrfs: ensure relocation never runs
while we have send operations running"), kernel 5.14.
Both send and relocation can be very long running operations. Relocation
because it has to do a lot of IO and expensive backreference lookups in
case there are many snapshots, and send due to read IO when operating on
very large trees. This makes it inconvenient for users and tools to deal
with scheduling both operations.
For zoned filesystem we also have automatic block group relocation, so
send can fail with -EAGAIN when users least expect it or send can end up
delaying the block group relocation for too long. In the future we might
also get the automatic block group relocation for non zoned filesystems.
This change makes it possible for send and relocation to run in parallel.
This is achieved the following way:
1) For all tree searches, send acquires a read lock on the commit root
semaphore;
2) After each tree search, and before releasing the commit root semaphore,
the leaf is cloned and placed in the search path (struct btrfs_path);
3) After releasing the commit root semaphore, the changed_cb() callback
is invoked, which operates on the leaf and writes commands to the pipe
(or file in case send/receive is not used with a pipe). It's important
here to not hold a lock on the commit root semaphore, because if we did
we could deadlock when sending and receiving to the same filesystem
using a pipe - the send task blocks on the pipe because it's full, the
receive task, which is the only consumer of the pipe, triggers a
transaction commit when attempting to create a subvolume or reserve
space for a write operation for example, but the transaction commit
blocks trying to write lock the commit root semaphore, resulting in a
deadlock;
4) Before moving to the next key, or advancing to the next change in case
of an incremental send, check if a transaction used for relocation was
committed (or is about to finish its commit). If so, release the search
path(s) and restart the search, to where we were before, so that we
don't operate on stale extent buffers. The search restarts are always
possible because both the send and parent roots are RO, and no one can
add, remove of update keys (change their offset) in RO trees - the
only exception is deduplication, but that is still not allowed to run
in parallel with send;
5) Periodically check if there is contention on the commit root semaphore,
which means there is a transaction commit trying to write lock it, and
release the semaphore and reschedule if there is contention, so as to
avoid causing any significant delays to transaction commits.
This leaves some room for optimizations for send to have less path
releases and re searching the trees when there's relocation running, but
for now it's kept simple as it performs quite well (on very large trees
with resulting send streams in the order of a few hundred gigabytes).
Test case btrfs/187, from fstests, stresses relocation, send and
deduplication attempting to run in parallel, but without verifying if send
succeeds and if it produces correct streams. A new test case will be added
that exercises relocation happening in parallel with send and then checks
that send succeeds and the resulting streams are correct.
A final note is that for now this still leaves the mutual exclusion
between send operations and deduplication on files belonging to a root
used by send operations. A solution for that will be slightly more complex
but it will eventually be built on top of this change.
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Cachefiles has a problem in that it needs to keep the backing file for a
cookie open whilst there are local modifications pending that need to be
written to it. However, we don't want to keep the file open indefinitely,
as that causes EMFILE/ENFILE/ENOMEM problems.
Reopening the cache file, however, is a problem if this is being done due
to writeback triggered by exit(). Some filesystems will oops if we try to
open a file in that context because they want to access current->fs or
other resources that have already been dismantled.
To get around this, I added the following:
(1) An inode flag, I_PINNING_FSCACHE_WB, to be set on a network filesystem
inode to indicate that we have a usage count on the cookie caching
that inode.
(2) A flag in struct writeback_control, unpinned_fscache_wb, that is set
when __writeback_single_inode() clears the last dirty page from
i_pages - at which point it clears I_PINNING_FSCACHE_WB and sets this
flag.
This has to be done here so that clearing I_PINNING_FSCACHE_WB can be
done atomically with the check of PAGECACHE_TAG_DIRTY that clears
I_DIRTY_PAGES.
(3) A function, fscache_set_page_dirty(), which if it is not set, sets
I_PINNING_FSCACHE_WB and calls fscache_use_cookie() to pin the cache
resources.
(4) A function, fscache_unpin_writeback(), to be called by ->write_inode()
to unuse the cookie.
(5) A function, fscache_clear_inode_writeback(), to be called when the
inode is evicted, before clear_inode() is called. This cleans up any
lingering I_PINNING_FSCACHE_WB.
The network filesystem can then use these tools to make sure that
fscache_write_to_cache() can write locally modified data to the cache as
well as to the server.
For the future, I'm working on write helpers for netfs lib that should
allow this facility to be removed by keeping track of the dirty regions
separately - but that's incomplete at the moment and is also going to be
affected by folios, one way or another, since it deals with pages
Signed-off-by: David Howells <dhowells@redhat.com>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
cc: linux-cachefs@redhat.com
Link: https://lore.kernel.org/r/163819615157.215744.17623791756928043114.stgit@warthog.procyon.org.uk/ # v1
Link: https://lore.kernel.org/r/163906917856.143852.8224898306177154573.stgit@warthog.procyon.org.uk/ # v2
Link: https://lore.kernel.org/r/163967124567.1823006.14188359004568060298.stgit@warthog.procyon.org.uk/ # v3
Link: https://lore.kernel.org/r/164021524705.640689.17824932021727663017.stgit@warthog.procyon.org.uk/ # v4
Provide a higher-level function than fscache_write() to perform a write
from an inode's pagecache to the cache, whilst fending off concurrent
writes by means of the PG_fscache mark on a page:
void fscache_write_to_cache(struct fscache_cookie *cookie,
struct address_space *mapping,
loff_t start,
size_t len,
loff_t i_size,
netfs_io_terminated_t term_func,
void *term_func_priv,
bool caching);
If caching is false, this function does nothing except call (*term_func)()
if given. It assumes that, in such a case, PG_fscache will not have been
set on the pages.
Otherwise, if caching is true, this function requires the source pages to
have had PG_fscache set on them before calling. start and len define the
region of the file to be modified and i_size indicates the new file size.
The source pages are extracted from the mapping.
term_func and term_func_priv work as for fscache_write(). The PG_fscache
marks will be cleared at the end of the operation, before term_func is
called or the function otherwise returns.
There is an additonal helper function to clear the PG_fscache bits from a
range of pages:
void fscache_clear_page_bits(struct fscache_cookie *cookie,
struct address_space *mapping,
loff_t start, size_t len,
bool caching);
If caching is true, the pages to be managed are expected to be located on
mapping in the range defined by start and len. If caching is false, it
does nothing.
Signed-off-by: David Howells <dhowells@redhat.com>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
cc: linux-cachefs@redhat.com
Link: https://lore.kernel.org/r/163819614155.215744.5528123235123721230.stgit@warthog.procyon.org.uk/ # v1
Link: https://lore.kernel.org/r/163906916346.143852.15632773570362489926.stgit@warthog.procyon.org.uk/ # v2
Link: https://lore.kernel.org/r/163967123599.1823006.12946816026724657428.stgit@warthog.procyon.org.uk/ # v3
Link: https://lore.kernel.org/r/164021522672.640689.4381958316198807813.stgit@warthog.procyon.org.uk/ # v4
Provide a function to begin a read operation:
int fscache_begin_read_operation(
struct netfs_cache_resources *cres,
struct fscache_cookie *cookie)
This is primarily intended to be called by network filesystems on behalf of
netfslib, but may also be called to use the I/O access functions directly.
It attaches the resources required by the cache to cres struct from the
supplied cookie.
This holds access to the cache behind the cookie for the duration of the
operation and forces cache withdrawal and cookie invalidation to perform
synchronisation on the operation. cres->inval_counter is set from the
cookie at this point so that it can be compared at the end of the
operation.
Note that this does not guarantee that the cache state is fully set up and
able to perform I/O immediately; looking up and creation may be left in
progress in the background. The operations intended to be called by the
network filesystem, such as reading and writing, are expected to wait for
the cookie to move to the correct state.
This will, however, potentially sleep, waiting for a certain minimum state
to be set or for operations such as invalidate to advance far enough that
I/O can resume.
Also provide a function for the cache to call to wait for the cache object
to get to a state where it can be used for certain things:
bool fscache_wait_for_operation(struct netfs_cache_resources *cres,
enum fscache_want_stage stage);
This looks at the cache resources provided by the begin function and waits
for them to get to an appropriate stage. There's a choice of wanting just
some parameters (FSCACHE_WANT_PARAM) or the ability to do I/O
(FSCACHE_WANT_READ or FSCACHE_WANT_WRITE).
Signed-off-by: David Howells <dhowells@redhat.com>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
cc: linux-cachefs@redhat.com
Link: https://lore.kernel.org/r/163819603692.215744.146724961588817028.stgit@warthog.procyon.org.uk/ # v1
Link: https://lore.kernel.org/r/163906910672.143852.13856103384424986357.stgit@warthog.procyon.org.uk/ # v2
Link: https://lore.kernel.org/r/163967110245.1823006.2239170567540431836.stgit@warthog.procyon.org.uk/ # v3
Link: https://lore.kernel.org/r/164021513617.640689.16627329360866150606.stgit@warthog.procyon.org.uk/ # v4
Provide a pair of functions to count the number of users of a cookie (open
files, writeback, invalidation, resizing, reads, writes), to obtain and pin
resources for the cookie and to prevent culling for the whilst there are
users.
The first function marks a cookie as being in use:
void fscache_use_cookie(struct fscache_cookie *cookie,
bool will_modify);
The caller should indicate the cookie to use and whether or not the caller
is in a context that may modify the cookie (e.g. a file open O_RDWR).
If the cookie is not already resourced, fscache will ask the cache backend
in the background to do whatever it needs to look up, create or otherwise
obtain the resources necessary to access data. This is pinned to the
cookie and may not be culled, though it may be withdrawn if the cache as a
whole is withdrawn.
The second function removes the in-use mark from a cookie and, optionally,
updates the coherency data:
void fscache_unuse_cookie(struct fscache_cookie *cookie,
const void *aux_data,
const loff_t *object_size);
If non-NULL, the aux_data buffer and/or the object_size will be saved into
the cookie and will be set on the backing store when the object is
committed.
If this removes the last usage on a cookie, the cookie is placed onto an
LRU list from which it will be removed and closed after a couple of seconds
if it doesn't get reused. This prevents resource overload in the cache -
in particular it prevents it from holding too many files open.
Changes
=======
ver #2:
- Fix fscache_unuse_cookie() to use atomic_dec_and_lock() to avoid a
potential race if the cookie gets reused before it completes the
unusement.
- Added missing transition to LRU_DISCARDING state.
Signed-off-by: David Howells <dhowells@redhat.com>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
cc: linux-cachefs@redhat.com
Link: https://lore.kernel.org/r/163819600612.215744.13678350304176542741.stgit@warthog.procyon.org.uk/ # v1
Link: https://lore.kernel.org/r/163906907567.143852.16979631199380722019.stgit@warthog.procyon.org.uk/ # v2
Link: https://lore.kernel.org/r/163967106467.1823006.6790864931048582667.stgit@warthog.procyon.org.uk/ # v3
Link: https://lore.kernel.org/r/164021511674.640689.10084988363699111860.stgit@warthog.procyon.org.uk/ # v4
Add a number of helper functions to manage access to a cookie, pinning the
cache object in place for the duration to prevent cache withdrawal from
removing it:
(1) void fscache_init_access_gate(struct fscache_cookie *cookie);
This function initialises the access count when a cache binds to a
cookie. An extra ref is taken on the access count to prevent wakeups
while the cache is active. We're only interested in the wakeup when a
cookie is being withdrawn and we're waiting for it to quiesce - at
which point the counter will be decremented before the wait.
The FSCACHE_COOKIE_NACC_ELEVATED flag is set on the cookie to keep
track of the extra ref in order to handle a race between
relinquishment and withdrawal both trying to drop the extra ref.
(2) bool fscache_begin_cookie_access(struct fscache_cookie *cookie,
enum fscache_access_trace why);
This function attempts to begin access upon a cookie, pinning it in
place if it's cached. If successful, it returns true and leaves a the
access count incremented.
(3) void fscache_end_cookie_access(struct fscache_cookie *cookie,
enum fscache_access_trace why);
This function drops the access count obtained by (2), permitting
object withdrawal to take place when it reaches zero.
A tracepoint is provided to track changes to the access counter on a
cookie.
Changes
=======
ver #2:
- Don't hold n_accesses elevated whilst cache is bound to a cookie, but
rather add a flag that prevents the state machine from being queued when
n_accesses reaches 0.
Signed-off-by: David Howells <dhowells@redhat.com>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
cc: linux-cachefs@redhat.com
Link: https://lore.kernel.org/r/163819595085.215744.1706073049250505427.stgit@warthog.procyon.org.uk/ # v1
Link: https://lore.kernel.org/r/163906895313.143852.10141619544149102193.stgit@warthog.procyon.org.uk/ # v2
Link: https://lore.kernel.org/r/163967095980.1823006.1133648159424418877.stgit@warthog.procyon.org.uk/ # v3
Link: https://lore.kernel.org/r/164021503063.640689.8870918985269528670.stgit@warthog.procyon.org.uk/ # v4
Add a pair of helper functions to manage access to a volume, pinning the
volume in place for the duration to prevent cache withdrawal from removing
it:
bool fscache_begin_volume_access(struct fscache_volume *volume,
enum fscache_access_trace why);
void fscache_end_volume_access(struct fscache_volume *volume,
enum fscache_access_trace why);
The way the access gate on the volume works/will work is:
(1) If the cache tests as not live (state is not FSCACHE_CACHE_IS_ACTIVE),
then we return false to indicate access was not permitted.
(2) If the cache tests as live, then we increment the volume's n_accesses
count and then recheck the cache liveness, ending the access if it
ceased to be live.
(3) When we end the access, we decrement the volume's n_accesses and wake
up the any waiters if it reaches 0.
(4) Whilst the cache is caching, the volume's n_accesses is kept
artificially incremented to prevent wakeups from happening.
(5) When the cache is taken offline, the state is changed to prevent new
accesses, the volume's n_accesses is decremented and we wait for it to
become 0.
Signed-off-by: David Howells <dhowells@redhat.com>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
cc: linux-cachefs@redhat.com
Link: https://lore.kernel.org/r/163819594158.215744.8285859817391683254.stgit@warthog.procyon.org.uk/ # v1
Link: https://lore.kernel.org/r/163906894315.143852.5454793807544710479.stgit@warthog.procyon.org.uk/ # v2
Link: https://lore.kernel.org/r/163967095028.1823006.9173132503876627466.stgit@warthog.procyon.org.uk/ # v3
Link: https://lore.kernel.org/r/164021501546.640689.9631510472149608443.stgit@warthog.procyon.org.uk/ # v4
Add functions to the fscache API to allow data file cookies to be acquired
and relinquished by the network filesystem. It is intended that the
filesystem will create such cookies per-inode under a volume.
To request a cookie, the filesystem should call:
struct fscache_cookie *
fscache_acquire_cookie(struct fscache_volume *volume,
u8 advice,
const void *index_key,
size_t index_key_len,
const void *aux_data,
size_t aux_data_len,
loff_t object_size)
The filesystem must first have created a volume cookie, which is passed in
here. If it passes in NULL then the function will just return a NULL
cookie.
A binary key should be passed in index_key and is of size index_key_len.
This is saved in the cookie and is used to locate the associated data in
the cache.
A coherency data buffer of size aux_data_len will be allocated and
initialised from the buffer pointed to by aux_data. This is used to
validate cache objects when they're opened and is stored on disk with them
when they're committed. The data is stored in the cookie and will be
updateable by various functions in later patches.
The object_size must also be given. This is also used to perform a
coherency check and to size the backing storage appropriately.
This function disallows a cookie from being acquired twice in parallel,
though it will cause the second user to wait if the first is busy
relinquishing its cookie.
When a network filesystem has finished with a cookie, it should call:
void
fscache_relinquish_cookie(struct fscache_volume *volume,
bool retire)
If retire is true, any backing data will be discarded immediately.
Changes
=======
ver #3:
- fscache_hash()'s size parameter is now in bytes. Use __le32 as the unit
to round up to.
- When comparing cookies, simply see if the attributes are the same rather
than subtracting them to produce a strcmp-style return[1].
- Add a check to see if the cookie is still hashed at the point of
freeing.
ver #2:
- Don't hold n_accesses elevated whilst cache is bound to a cookie, but
rather add a flag that prevents the state machine from being queued when
n_accesses reaches 0.
- Remove the unused cookie pointer field from the fscache_acquire
tracepoint.
Signed-off-by: David Howells <dhowells@redhat.com>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
cc: linux-cachefs@redhat.com
Link: https://lore.kernel.org/r/CAHk-=whtkzB446+hX0zdLsdcUJsJ=8_-0S1mE_R+YurThfUbLA@mail.gmail.com/ [1]
Link: https://lore.kernel.org/r/163819590658.215744.14934902514281054323.stgit@warthog.procyon.org.uk/ # v1
Link: https://lore.kernel.org/r/163906891983.143852.6219772337558577395.stgit@warthog.procyon.org.uk/ # v2
Link: https://lore.kernel.org/r/163967088507.1823006.12659006350221417165.stgit@warthog.procyon.org.uk/ # v3
Link: https://lore.kernel.org/r/164021498432.640689.12743483856927722772.stgit@warthog.procyon.org.uk/ # v4
Add functions to the fscache API to allow volumes to be acquired and
relinquished by the network filesystem. A volume is an index of data
storage cache objects. A volume is represented by a volume cookie in the
API. A filesystem would typically create a volume for a superblock and
then create per-inode cookies within it.
To request a volume, the filesystem calls:
struct fscache_volume *
fscache_acquire_volume(const char *volume_key,
const char *cache_name,
const void *coherency_data,
size_t coherency_len)
The volume_key is a printable string used to match the volume in the cache.
It should not contain any '/' characters. For AFS, for example, this would
be "afs,<cellname>,<volume_id>", e.g. "afs,example.com,523001".
The cache_name can be NULL, but if not it should be a string indicating the
name of the cache to use if there's more than one available.
The coherency data, if given, is an arbitrarily-sized blob that's attached
to the volume and is compared when the volume is looked up. If it doesn't
match, the old volume is judged to be out of date and it and everything
within it is discarded.
Acquiring a volume twice concurrently is disallowed, though the function
will wait if an old volume cookie is being relinquishing.
When a network filesystem has finished with a volume, it should return the
volume cookie by calling:
void
fscache_relinquish_volume(struct fscache_volume *volume,
const void *coherency_data,
bool invalidate)
If invalidate is true, the entire volume will be discarded; if false, the
volume will be synced and the coherency data will be updated.
Changes
=======
ver #4:
- Removed an extraneous param from kdoc on fscache_relinquish_volume()[3].
ver #3:
- fscache_hash()'s size parameter is now in bytes. Use __le32 as the unit
to round up to.
- When comparing cookies, simply see if the attributes are the same rather
than subtracting them to produce a strcmp-style return[2].
- Make the coherency data an arbitrary blob rather than a u64, but don't
store it for the moment.
ver #2:
- Fix error check[1].
- Make a fscache_acquire_volume() return errors, including EBUSY if a
conflicting volume cookie already exists. No error is printed now -
that's left to the netfs.
Signed-off-by: David Howells <dhowells@redhat.com>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
cc: linux-cachefs@redhat.com
Link: https://lore.kernel.org/r/20211203095608.GC2480@kili/ [1]
Link: https://lore.kernel.org/r/CAHk-=whtkzB446+hX0zdLsdcUJsJ=8_-0S1mE_R+YurThfUbLA@mail.gmail.com/ [2]
Link: https://lore.kernel.org/r/20211220224646.30e8205c@canb.auug.org.au/ [3]
Link: https://lore.kernel.org/r/163819588944.215744.1629085755564865996.stgit@warthog.procyon.org.uk/ # v1
Link: https://lore.kernel.org/r/163906890630.143852.13972180614535611154.stgit@warthog.procyon.org.uk/ # v2
Link: https://lore.kernel.org/r/163967086836.1823006.8191672796841981763.stgit@warthog.procyon.org.uk/ # v3
Link: https://lore.kernel.org/r/164021495816.640689.4403156093668590217.stgit@warthog.procyon.org.uk/ # v4
Inodes aren't supposed to have a project id of -1U (aka 4294967295) but
the kernel hasn't always validated FSSETXATTR correctly. Flag this as
something for the sysadmin to check out.
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Online fsck depends on callers holding ILOCK_EXCL from the time they
decide to update a block mapping until after they've updated the reverse
mapping records to guarantee the stability of both mapping records.
Unfortunately, the quota code drops ILOCK_EXCL at the first transaction
roll in the dquot allocation process, which breaks that assertion. This
leads to sporadic failures in the online rmap repair code if the repair
code grabs the AGF after bmapi_write maps a new block into the quota
file's data fork but before it can finish the deferred rmap update.
Fix this by rewriting the function to hold the ILOCK until after the
transaction commit like all other bmap updates do, and get rid of the
dqread wrapper that does nothing but complicate the codebase.
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
mp is being initialized to log->l_mp but this is never read
as record is overwritten later on. Remove the redundant
assignment.
Cleans up the following clang-analyzer warning:
fs/xfs/xfs_log_recover.c:3543:20: warning: Value stored to 'mp' during
its initialization is never read [clang-analyzer-deadcode.DeadStores].
Reported-by: Abaci Robot <abaci@linux.alibaba.com>
Signed-off-by: Jiapeng Chong <jiapeng.chong@linux.alibaba.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Oh, let me count the ways that the kvmalloc API sucks dog eggs.
The problem is when we are logging lots of large objects, we hit
kvmalloc really damn hard with costly order allocations, and
behaviour utterly sucks:
- 49.73% xlog_cil_commit
- 31.62% kvmalloc_node
- 29.96% __kmalloc_node
- 29.38% kmalloc_large_node
- 29.33% __alloc_pages
- 24.33% __alloc_pages_slowpath.constprop.0
- 18.35% __alloc_pages_direct_compact
- 17.39% try_to_compact_pages
- compact_zone_order
- 15.26% compact_zone
5.29% __pageblock_pfn_to_page
3.71% PageHuge
- 1.44% isolate_migratepages_block
0.71% set_pfnblock_flags_mask
1.11% get_pfnblock_flags_mask
- 0.81% get_page_from_freelist
- 0.59% _raw_spin_lock_irqsave
- do_raw_spin_lock
__pv_queued_spin_lock_slowpath
- 3.24% try_to_free_pages
- 3.14% shrink_node
- 2.94% shrink_slab.constprop.0
- 0.89% super_cache_count
- 0.66% xfs_fs_nr_cached_objects
- 0.65% xfs_reclaim_inodes_count
0.55% xfs_perag_get_tag
0.58% kfree_rcu_shrink_count
- 2.09% get_page_from_freelist
- 1.03% _raw_spin_lock_irqsave
- do_raw_spin_lock
__pv_queued_spin_lock_slowpath
- 4.88% get_page_from_freelist
- 3.66% _raw_spin_lock_irqsave
- do_raw_spin_lock
__pv_queued_spin_lock_slowpath
- 1.63% __vmalloc_node
- __vmalloc_node_range
- 1.10% __alloc_pages_bulk
- 0.93% __alloc_pages
- 0.92% get_page_from_freelist
- 0.89% rmqueue_bulk
- 0.69% _raw_spin_lock
- do_raw_spin_lock
__pv_queued_spin_lock_slowpath
13.73% memcpy_erms
- 2.22% kvfree
On this workload, that's almost a dozen CPUs all trying to compact
and reclaim memory inside kvmalloc_node at the same time. Yet it is
regularly falling back to vmalloc despite all that compaction, page
and shrinker reclaim that direct reclaim is doing. Copying all the
metadata is taking far less CPU time than allocating the storage!
Direct reclaim should be considered extremely harmful.
This is a high frequency, high throughput, CPU usage and latency
sensitive allocation. We've got memory there, and we're using
kvmalloc to allow memory allocation to avoid doing lots of work to
try to do contiguous allocations.
Except it still does *lots of costly work* that is unnecessary.
Worse: the only way to avoid the slowpath page allocation trying to
do compaction on costly allocations is to turn off direct reclaim
(i.e. remove __GFP_RECLAIM_DIRECT from the gfp flags).
Unfortunately, the stupid kvmalloc API then says "oh, this isn't a
GFP_KERNEL allocation context, so you only get kmalloc!". This
cuts off the vmalloc fallback, and this leads to almost instant OOM
problems which ends up in filesystems deadlocks, shutdowns and/or
kernel crashes.
I want some basic kvmalloc behaviour:
- kmalloc for a contiguous range with fail fast semantics - no
compaction direct reclaim if the allocation enters the slow path.
- run normal vmalloc (i.e. GFP_KERNEL) if kmalloc fails
The really, really stupid part about this is these kvmalloc() calls
are run under memalloc_nofs task context, so all the allocations are
always reduced to GFP_NOFS regardless of the fact that kvmalloc
requires GFP_KERNEL to be passed in. IOWs, we're already telling
kvmalloc to behave differently to the gfp flags we pass in, but it
still won't allow vmalloc to be run with anything other than
GFP_KERNEL.
So, this patch open codes the kvmalloc() in the commit path to have
the above described behaviour. The result is we more than halve the
CPU time spend doing kvmalloc() in this path and transaction commits
with 64kB objects in them more than doubles. i.e. we get ~5x
reduction in CPU usage per costly-sized kvmalloc() invocation and
the profile looks like this:
- 37.60% xlog_cil_commit
16.01% memcpy_erms
- 8.45% __kmalloc
- 8.04% kmalloc_order_trace
- 8.03% kmalloc_order
- 7.93% alloc_pages
- 7.90% __alloc_pages
- 4.05% __alloc_pages_slowpath.constprop.0
- 2.18% get_page_from_freelist
- 1.77% wake_all_kswapds
....
- __wake_up_common_lock
- 0.94% _raw_spin_lock_irqsave
- 3.72% get_page_from_freelist
- 2.43% _raw_spin_lock_irqsave
- 5.72% vmalloc
- 5.72% __vmalloc_node_range
- 4.81% __get_vm_area_node.constprop.0
- 3.26% alloc_vmap_area
- 2.52% _raw_spin_lock
- 1.46% _raw_spin_lock
0.56% __alloc_pages_bulk
- 4.66% kvfree
- 3.25% vfree
- __vfree
- 3.23% __vunmap
- 1.95% remove_vm_area
- 1.06% free_vmap_area_noflush
- 0.82% _raw_spin_lock
- 0.68% _raw_spin_lock
- 0.92% _raw_spin_lock
- 1.40% kfree
- 1.36% __free_pages
- 1.35% __free_pages_ok
- 1.02% _raw_spin_lock_irqsave
It's worth noting that over 50% of the CPU time spent allocating
these shadow buffers is now spent on spinlocks. So the shadow buffer
allocation overhead is greatly reduced by getting rid of direct
reclaim from kmalloc, and could probably be made even less costly if
vmalloc() didn't use global spinlocks to protect it's structures.
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Allison Henderson <allison.henderson@oracle.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
There are currently 2 ways to create a set of sysfs files for a
kobj_type, through the default_attrs field, and the default_groups
field. Move the xfs sysfs code to use default_groups field which has
been the preferred way since aa30f47cf6 ("kobject: Add support for
default attribute groups to kobj_type") so that we can soon get rid of
the obsolete default_attrs field.
Cc: "Darrick J. Wong" <djwong@kernel.org>
Cc: linux-xfs@vger.kernel.org
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
When the kernel is locked down the kernel allows reading only debugfs
files with mode 444. Mode 400 is also valid but is not allowed.
Make the 444 into a mask.
Fixes: 5496197f9b ("debugfs: Restrict debugfs when the kernel is locked down")
Signed-off-by: Michal Suchanek <msuchanek@suse.de>
Link: https://lore.kernel.org/r/20220104170505.10248-1-msuchanek@suse.de
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Move the request list macros to the header file that defines that struct
they operate on.
Signed-off-by: Keith Busch <kbusch@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20220105170518.3181469-2-kbusch@kernel.org
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Get rid of old erofs_get_meta_page() within zmap operations by
using on-stack meta buffers in order to prepare subpage and folio
features.
Finally, erofs_get_meta_page() is useless. Get rid of it!
Link: https://lore.kernel.org/r/20220102040017.51352-6-hsiangkao@linux.alibaba.com
Reviewed-by: Yue Hu <huyue2@yulong.com>
Reviewed-by: Liu Bo <bo.liu@linux.alibaba.com>
Reviewed-by: Chao Yu <chao@kernel.org>
Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com>
Get rid of old erofs_get_meta_page() within xattr operations by
using on-stack meta buffers in order to prepare subpage and folio
features.
Link: https://lore.kernel.org/r/20220102040017.51352-5-hsiangkao@linux.alibaba.com
Reviewed-by: Liu Bo <bo.liu@linux.alibaba.com>
Reviewed-by: Chao Yu <chao@kernel.org>
Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com>
Get rid of old erofs_get_meta_page() within super operations by
using on-stack meta buffers in order to prepare subpage and folio
features.
Link: https://lore.kernel.org/r/20220102081317.109797-1-hsiangkao@linux.alibaba.com
Reviewed-by: Yue Hu <huyue2@yulong.com>
Reviewed-by: Liu Bo <bo.liu@linux.alibaba.com>
Reviewed-by: Chao Yu <chao@kernel.org>
Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com>
Get rid of old erofs_get_meta_page() within inode operations by
using on-stack meta buffers in order to prepare subpage and folio
features.
Link: https://lore.kernel.org/r/20220102040017.51352-3-hsiangkao@linux.alibaba.com
Reviewed-by: Yue Hu <huyue2@yulong.com>
Reviewed-by: Liu Bo <bo.liu@linux.alibaba.com>
Reviewed-by: Chao Yu <chao@kernel.org>
Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com>
In order to support subpage and folio for all uncompressed files,
introduce meta buffer descriptors, which can be effectively stored
on stack, in place of meta page operations.
This converts the uncompressed data path to meta buffers.
Link: https://lore.kernel.org/r/20220102040017.51352-2-hsiangkao@linux.alibaba.com
Reviewed-by: Liu Bo <bo.liu@linux.alibaba.com>
Reviewed-by: Chao Yu <chao@kernel.org>
Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com>
This patch prints the cluster node address if a non-cluster node
(according to the dlm config setting) tries to connect. The current
hexdump call will print in a different loglevel and only available if
dynamic debug is enabled. Additional we using the ip address format
strings to print an IETF ip4/6 string represenation.
Signed-off-by: Alexander Aring <aahringo@redhat.com>
Signed-off-by: David Teigland <teigland@redhat.com>
btrfs_free_space_ctl::private is either unset or it always points to
struct btrfs_block_group when it is set. So there's no point in keeping
the unhelpful 'private' name and keeping it an untyped pointer. Change
both the type and name to be self-describing. No functional changes.
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
There is no point in the function taking an fs_info and a
btrfs_free_space because the ctl passed always belongs to the block
group. Furthermore fs_info can be referenced from the block group. No
functional changes.
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The only difference between the two is whether btrfs_free_space::bytes
is adjusted. Instead of having 2 separate functions control this
behavior via an additional parameter and make them one function instead.
No functional changes.
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The only difference is the former adjusts btrfs_free_space::bytes
member. Consolidate the two function into 1 and add a bool parameter
which controls whether the adjustment is made or not. No functional
changes.
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
In the future we are going to have multiple copies of these trees. To
facilitate this we need a way to lookup the different roots we are
looking for. Handle this by adding a global root rb tree that is
indexed on the root->root_key. Then instead of loading the roots at
mount time with individually targeted keys, simply search the tree_root
for anything with the specific objectid we want. This will make it
straightforward to support both old style and new style file systems.
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
We don't set SHAREABLE on the extent root, we don't need to have this
safety check here.
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
We're going to have multiple free space roots in the future, so adjust
all the users of the free space root to use a helper to access the root.
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
We are going to have multiple csum roots in the future, so convert all
users of ->csum_root to btrfs_csum_root() and rename ->csum_root to
->_csum_root so we can easily find remaining users in the future.
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
We have a few places where we skip doing csums if we mounted with one of
the rescue options that ignores bad csum roots. In the future when
there are multiple csum roots it'll be costly to check and see if there
are any missing csum roots, so simply add a flag to indicate the fs
should skip loading csums in case of errors.
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
In the future we may have multiple csum roots, so simply check the
objectid is for a csum root instead of checking against ->csum_root.
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
When we start having multiple extent roots we'll need to use a helper to
get to the correct extent_root. Rename fs_info->extent_root to
_extent_root and convert all of the users of the extent root to using
the btrfs_extent_root() helper. This will allow us to easily clean up
the remaining direct accesses in the future.
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>