Since all the recovery paths have been migrated to the new error bitmap
based system, we can remove the old stripe number based system.
This cleanup involves one behavior change:
- Rebuild rbio can no longer be merged
Previously a rebuild rbio (caused by retry after data csum mismatch)
can be merged, if the error happens in the same stripe.
But with the new error bitmap based solution, it's much harder to
compare error bitmaps.
So here we just don't merge rebuild rbio at all.
This may introduce some performance impact at extreme corner cases,
but we're willing to take it.
Other than that, this patch will cleanup the following members:
- rbio::faila
- rbio::failb
They will be replaced by per-vertical stripe check, which is more
accurate.
- rbio::error
It will be replace by per-vertical stripe error bitmap check.
- Allow get_rbio_vertical_errors() to accept NULL pointers for
@faila and @failb
Some call sites only want to check if we have errors beyond the
tolerance.
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Since we have rbio::error_bitmap to indicate exactly where the errors
are (including read error and csum mismatch error), we can make recovery
path more accurate.
For example:
0 32K 64K
Data 1 |XXXXXXXX| |
Data 2 | |XXXXXXXXX|
Parity | | |
1) Get csum mismatch when reading data 1 [0, 32K)
2) Mark corresponding range error
The old code will mark the whole data 1 stripe as error.
While the new code will only mark data 1 [0, 32K) as error.
3) Recovery path
The old code will recover data 1 [0, 64K), all using Data 2 and
parity.
This means, Data 1 [32K, 64K) will be corrupted data, as data 2
[32K, 64K) is already corrupted.
While the new code will only recover data 1 [0, 32K), as only
that range has error so far.
This new behavior can avoid populating rbio cache with incorrect data.
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Currently btrfs raid56 uses btrfs_raid_bio::faila and failb to indicate
which stripe(s) had IO errors.
But that has some problems:
- If one sector failed csum check, the whole stripe where the corruption
is will be marked error.
This can reduce the chance we do recover, like this:
0 4K 8K
Data 1 |XX| |
Data 2 | |XX|
Parity | | |
In above case, 0~4K in data 1 should be recovered using data 2 and
parity, while 4K~8K in data 2 should be recovered using data 1 and
parity.
Currently if we trigger read on 0~4K of data 1, we will also recover
4K~8K of data 1 using corrupted data 2 and parity, causing wrong
result in rbio cache.
- Harder to expand for future M-N scheme
As we're limited to just faila/b, two corruptions.
- Harder to expand to handle extra csum errors
This can be problematic if we start to do csum verification.
This patch will introduce an extra @error_bitmap, where one bit
represents error that happened for that sector.
The choice to introduce a new error bitmap other than reusing
sector_ptr, is to avoid extra search between rbio::stripe_sectors[] and
rbio::bio_sectors[].
Since we can submit bio using sectors from both sectors, doing proper
search on both array will more complex.
Although the new bitmap will take extra memory, later we can remove
things like @error and faila/b to save some memory.
Currently the new error bitmap and failab mechanism coexists, the error
bitmap is only updated at endio time and recover entrance.
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The function is for internal interfaces so we should use the
btrfs_inode.
Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The function is mostly using internal interfaces so we should use the
btrfs_inode.
Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The function is mostly using internal interfaces so we should use the
btrfs_inode.
Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The async_chunk::inode structure is for internal interfaces so we should
use the btrfs_inode.
Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The function is for internal interfaces so we should use the
btrfs_inode.
Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The function is for internal interfaces so we should use the
btrfs_inode.
Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The function is for internal interfaces so we should use the
btrfs_inode.
Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The function is for internal interfaces so we should use the
btrfs_inode.
Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The function is for internal interfaces so we should use the
btrfs_inode.
Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The function is for internal interfaces so we should use the
btrfs_inode.
Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The function is for internal interfaces so we should use the
btrfs_inode.
Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The function is for internal interfaces so we should use the
btrfs_inode.
Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The function is for internal interfaces so we should use the
btrfs_inode.
Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The extent_io_tree::private_data was meant to be a preparatory work for
the metadata inode rework but that never materialized. Now it's used
only for an inode so it's better to change the appropriate type and
rename it.
Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
All callers except one pass NULL, so the parameter can be dropped and
the inode::io_tree initialization can be open coded.
Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The function is for internal interfaces so we should use the
btrfs_inode.
Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The function is for internal interfaces so we should use the
btrfs_inode.
Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The function is for internal interfaces so we should use the
btrfs_inode.
Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The btrfs_writepage_fixup structure is for internal interfaces so we
should use the btrfs_inode.
Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The function is for internal interfaces so we should use the
btrfs_inode.
Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The function is for internal interfaces so we should use the
btrfs_inode.
Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The function is for internal interfaces so we should use the
btrfs_inode.
Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The function is for internal interfaces so we should use the
btrfs_inode.
Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The function is for internal interfaces so we should use the
btrfs_inode.
Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The function is for internal interfaces so we should use the
btrfs_inode.
Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The btrfs_dio_private structure is for internal interfaces so we should
use the btrfs_inode.
Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The function is for internal interfaces so we should use the
btrfs_inode.
Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The function is for internal interfaces so we should use the
btrfs_inode.
Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The function is for internal interfaces so we should use the
btrfs_inode.
Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The function is for internal interfaces so we should use the
btrfs_inode.
Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The function is for internal interfaces so we should use the
btrfs_inode.
Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The function is for internal interfaces so we should use the
btrfs_inode.
Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The function is for internal interfaces so we should use the
btrfs_inode.
Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The function is for internal interfaces so we should use the
btrfs_inode.
Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The function is for internal interfaces so we should use the
btrfs_inode.
Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The async bio submit is for internal interfaces so we should use the
btrfs_inode.
Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
After previous patches the unused parameters can be removed from
btree_submit_bio_start and btrfs_submit_bio_start as they don't need to
conform to the extent_submit_bio_start_t typedef.
Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
There's a callback function parameter for btrfs_wq_submit_bio that can
be one of: metadata, buffered data, direct io data. The callback
abstraction is unnecessary as we have all functions available.
Replace the parameter with a command that leads to a direct call in
run_one_async_start. The called functions can be then simplified and we
can also remove the extent_submit_bio_start_t typedef.
Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Compression and direct io don't work together so the compression
parameter can be dropped after previous patch that changed the call
to direct.
Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
There's a function pointer passed to btrfs_repair_one_sector that will
submit the right bio for repair. However there are only two callbacks,
for buffered and for direct IO. This can be simplified to a bool-based
switch and call either function, indirect calls in this case is an
unnecessary abstraction. This allows to remove the submit_bio_hook_t
typedef.
Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
In zoned mode the sequential status of zone can be also tracked in the
runtime flags of block group.
Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
We already have flags in block group to track various status bits,
convert needs_free_space as well and reduce size of btrfs_block_group.
Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
We have a 64bit compatible helper to check if a value is a power of two,
use it instead of open coding it.
Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The copy_page helper may use an optimized version for full page copy
(eg. on s390 there's a special instruction for that), there's one more
left to convert.
Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
After the previous patchset which is comprised of the following patches:
01/17 btrfs: fix inode list leak during backref walking at resolve_indirect_refs()
02/17 btrfs: fix inode list leak during backref walking at find_parent_nodes()
03/17 btrfs: fix ulist leaks in error paths of qgroup self tests
04/17 btrfs: remove pointless and double ulist frees in error paths of qgroup tests
05/17 btrfs: send: avoid unnecessary path allocations when finding extent clone
06/17 btrfs: send: update comment at find_extent_clone()
07/17 btrfs: send: drop unnecessary backref context field initializations
08/17 btrfs: send: avoid unnecessary backref lookups when finding clone source
09/17 btrfs: send: optimize clone detection to increase extent sharing
10/17 btrfs: use a single argument for extent offset in backref walking functions
11/17 btrfs: use a structure to pass arguments to backref walking functions
12/17 btrfs: reuse roots ulist on each leaf iteration for iterate_extent_inodes()
13/17 btrfs: constify ulist parameter of ulist_next()
14/17 btrfs: send: cache leaf to roots mapping during backref walking
15/17 btrfs: send: skip unnecessary backref iterations
16/17 btrfs: send: avoid double extent tree search when finding clone source
17/17 btrfs: send: skip resolution of our own backref when finding clone source
we have now much better performance when doing backref walking in the send
code, so we can increase the current limit from 64 to 1024 references.
This limit is still a bit conservative because there are still edge cases
where backref walking will be too slow and spend a lot of cpu time, some IO
reading b+tree nodes/leaves and memory. The goal is to eventually get rid
of any limit, but for now bump it as it benefits users with extents shared
more than 64 times and up to 1024 times, allowing for more deduplication
at the destination without having to run a dedupe tool after a receive.
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
When doing backref walking to determine a source range to clone from, it
is worthless to collect and resolve our own data backref, as we can't
obviously use it as a clone source and it represents the range we want to
clone into. Collecting the backref implies doing the extra work to resolve
it, doing the search for a file extent item in a subvolume tree, etc.
Skipping the data backref is valid as long as we only have the send root
as the single clone root, otherwise the leaf with the file extent item may
be accessible from another clone root due to shared subtrees created by
snapshots, and therefore we have to collect the backref and resolve it.
So add a callback to the backref walking code to guide it to skip data
backrefs.
This change is part of a patchset comprised of the following patches:
01/17 btrfs: fix inode list leak during backref walking at resolve_indirect_refs()
02/17 btrfs: fix inode list leak during backref walking at find_parent_nodes()
03/17 btrfs: fix ulist leaks in error paths of qgroup self tests
04/17 btrfs: remove pointless and double ulist frees in error paths of qgroup tests
05/17 btrfs: send: avoid unnecessary path allocations when finding extent clone
06/17 btrfs: send: update comment at find_extent_clone()
07/17 btrfs: send: drop unnecessary backref context field initializations
08/17 btrfs: send: avoid unnecessary backref lookups when finding clone source
09/17 btrfs: send: optimize clone detection to increase extent sharing
10/17 btrfs: use a single argument for extent offset in backref walking functions
11/17 btrfs: use a structure to pass arguments to backref walking functions
12/17 btrfs: reuse roots ulist on each leaf iteration for iterate_extent_inodes()
13/17 btrfs: constify ulist parameter of ulist_next()
14/17 btrfs: send: cache leaf to roots mapping during backref walking
15/17 btrfs: send: skip unnecessary backref iterations
16/17 btrfs: send: avoid double extent tree search when finding clone source
17/17 btrfs: send: skip resolution of our own backref when finding clone source
The following test was run on non-debug kernel (Debian's default kernel
config) before and after applying the patchset:
$ cat test-send-many-shared-extents.sh
#!/bin/bash
DEV=/dev/sdh
MNT=/mnt/sdh
umount $DEV &> /dev/null
mkfs.btrfs -f $DEV
mount $DEV $MNT
num_files=50000
num_clones_per_file=50
for ((i = 1; i <= $num_files; i++)); do
xfs_io -f -c "pwrite 0 64K" $MNT/file_$i > /dev/null
echo -ne "\r$i files created..."
done
echo
btrfs subvolume snapshot -r $MNT $MNT/snap1
cloned=0
for ((i = 1; i <= $num_clones_per_file; i++)); do
for ((j = 1; j <= $num_files; j++)); do
cp --reflink=always $MNT/file_$j $MNT/file_${j}_clone_${i}
cloned=$((cloned + 1))
echo -ne "\r$cloned / $((num_files * num_clones_per_file)) clone operations"
done
done
echo
btrfs subvolume snapshot -r $MNT $MNT/snap2
# Unmount and mount again to clear all cached metadata (and data).
umount $DEV
mount $DEV $MNT
start=$(date +%s%N)
btrfs send $MNT/snap2 > /dev/null
end=$(date +%s%N)
dur=$(( (end - start) / 1000000000 ))
echo -e "\nFull send took $dur seconds"
# Unmount and mount again to clear all cached metadata (and data).
umount $DEV
mount $DEV $MNT
start=$(date +%s%N)
btrfs send -p $MNT/snap1 $MNT/snap2 > /dev/null
end=$(date +%s%N)
dur=$(( (end - start) / 1000000000 ))
echo -e "\nIncremental send took $dur seconds"
umount $MNT
Before applying the patchset:
(...)
Full send took 1108 seconds
(...)
Incremental send took 1135 seconds
After applying the whole patchset:
(...)
Full send took 268 seconds (-75.8%)
(...)
Incremental send took 316 seconds (-72.2%)
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
At find_extent_clone() we search twice for the extent item corresponding
to the data extent that the current file extent items points to:
1) Once with a call to extent_from_logical();
2) Once again during backref walking, through iterate_extent_inodes()
which eventually leads to find_parent_nodes() where we will search
again the extent tree for the same extent item.
The extent tree can be huge, so doing this one extra search for every
extent we want to send adds up and it's expensive.
The first call is there since the send code was introduced and it
accomplishes two things:
1) Check that the extent is flagged as a data extent in the extent tree.
But it can not be anything else, otherwise we wouldn't have a file
extent item in the send root pointing to it.
This was probably added to catch bugs in the early days where send was
yet too young and the interaction with everything else was far from
perfect;
2) Check how many direct references there are on the extent, and if
there's too many (more than SEND_MAX_EXTENT_REFS), avoid doing the
backred walking as it may take too long and slowdown send.
So improve on this by having a callback in the backref walking code that
is called when it finds the extent item in the extent tree, and have those
checks done in the callback. When the callback returns anything different
from 0, it stops the backref walking code. This way we do a single search
on the extent tree for the extent item of our data extent.
Also, before this change we were only checking the number of references on
the data extent against SEND_MAX_EXTENT_REFS, but after starting backref
walking we will end up resolving backrefs for extent buffers in the path
from a leaf having a file extent item pointing to our data extent, up to
roots of trees from which the extent buffer is accessible from, due to
shared subtrees resulting from snapshoting. We were therefore allowing for
the possibility for send taking too long due to some node in the path from
the leaf to a root node being shared too many times. After this change we
check for reference counts being greater than SEND_MAX_EXTENT_REFS for
both data extents and metadata extents.
This change is part of a patchset comprised of the following patches:
01/17 btrfs: fix inode list leak during backref walking at resolve_indirect_refs()
02/17 btrfs: fix inode list leak during backref walking at find_parent_nodes()
03/17 btrfs: fix ulist leaks in error paths of qgroup self tests
04/17 btrfs: remove pointless and double ulist frees in error paths of qgroup tests
05/17 btrfs: send: avoid unnecessary path allocations when finding extent clone
06/17 btrfs: send: update comment at find_extent_clone()
07/17 btrfs: send: drop unnecessary backref context field initializations
08/17 btrfs: send: avoid unnecessary backref lookups when finding clone source
09/17 btrfs: send: optimize clone detection to increase extent sharing
10/17 btrfs: use a single argument for extent offset in backref walking functions
11/17 btrfs: use a structure to pass arguments to backref walking functions
12/17 btrfs: reuse roots ulist on each leaf iteration for iterate_extent_inodes()
13/17 btrfs: constify ulist parameter of ulist_next()
14/17 btrfs: send: cache leaf to roots mapping during backref walking
15/17 btrfs: send: skip unnecessary backref iterations
16/17 btrfs: send: avoid double extent tree search when finding clone source
17/17 btrfs: send: skip resolution of our own backref when finding clone source
Performance test results are in the changelog of patch 17/17.
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>