[BUG]
There are several reported runaway balance, that balance is flooding the
log with "found X extents" where the X never changes.
[CAUSE]
Commit d2311e6985 ("btrfs: relocation: Delay reloc tree deletion after
merge_reloc_roots") introduced BTRFS_ROOT_DEAD_RELOC_TREE bit to
indicate that one subvolume has finished its tree blocks swap with its
reloc tree.
However if balance is canceled or hits ENOSPC halfway, we didn't clear
the BTRFS_ROOT_DEAD_RELOC_TREE bit, leaving that bit hanging forever
until unmount.
Any subvolume root with that bit, would cause backref cache to skip this
tree block, as it has finished its tree block swap. This would cause
all tree blocks of that root be ignored by balance, leading to runaway
balance.
[FIX]
Fix the problem by also clearing the BTRFS_ROOT_DEAD_RELOC_TREE bit for
the original subvolume of orphan reloc root.
Add an umount check for the stale bit still set.
Fixes: d2311e6985 ("btrfs: relocation: Delay reloc tree deletion after merge_reloc_roots")
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
When creating a snapshot, ordered extents need to be flushed and this
can take a long time.
In create_snapshot there are two locks held when this happens:
1. Destination directory inode lock
2. Global subvolume semaphore
This will unnecessarily block other operations like subvolume destroy,
create, or setflag until the snapshot is created.
We can fix that by moving the flush outside the locked section as this
does not depend on the aforementioned locks. The code factors out the
snapshot related work from create_snapshot to btrfs_mksnapshot.
__btrfs_ioctl_snap_create
btrfs_mksubvol
create_subvol
btrfs_mksnapshot
<flush>
btrfs_mksubvol
create_snapshot
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Robbie Ko <robbieko@synology.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
SHAREABLE flag is set for subvolumes because users can create snapshot
for subvolumes, thus sharing tree blocks of them.
But data reloc tree is not exposed to user space, as it's only an
internal tree for data relocation, thus it doesn't need the full path
replacement handling at all.
This patch will make data reloc tree a non-shareable tree, and add
btrfs_fs_info::data_reloc_root for data reloc tree, so relocation code
can grab it from fs_info directly.
This would slightly improve tree relocation, as now data reloc tree
can go through regular COW routine to get relocated, without bothering
the complex tree reloc tree routine.
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
There are a lot of root owner checks in btrfs_truncate_inode_items()
like:
if (test_bit(BTRFS_ROOT_SHAREABLE, &root->state) ||
root == fs_info->tree_root)
But considering that, only these trees can have INODE_ITEMs:
- tree root (for v1 space cache)
- subvolume trees
- tree reloc trees
- data reloc tree
- log trees
And since subvolume/tree reloc/data reloc trees all have SHAREABLE bit,
and we're checking tree root manually, so above check is just excluding
log trees.
This patch will replace two of such checks to a simpler one:
if (root->root_key.objectid != BTRFS_TREE_LOG_OBJECTID)
This would merge btrfs_drop_extent_cache() and lock_extent_bits() call
into the same if branch.
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The name BTRFS_ROOT_REF_COWS is not very clear about the meaning.
In fact, that bit can only be set to those trees:
- Subvolume roots
- Data reloc root
- Reloc roots for above roots
All other trees won't get this bit set. So just by the result, it is
obvious that, roots with this bit set can have tree blocks shared with
other trees. Either shared by snapshots, or by reloc roots (an special
snapshot created by relocation).
This patch will rename BTRFS_ROOT_REF_COWS to BTRFS_ROOT_SHAREABLE to
make it easier to understand, and update all comment mentioning
"reference counted" to follow the rename.
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Commit dccdb07bc9 ("btrfs: kill btrfs_fs_info::volume_mutex") removed
the last use of the volume_mutex, forgetting to update the comment.
Signed-off-by: Anand Jain <anand.jain@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The fallback path calls helper write_extent_buffer to do write of the
data spanning two extent buffer pages. As the size is known, we can do
the write directly in two steps. This removes one function call and
compiler can optimize memcpy as the sizes are known at compile time. The
cached token address is set to the second page.
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The helper write_extent_buffer is called to do write of the data
spanning two extent buffer pages. As the size is known, we can do the
write directly in two steps. This removes one function call and
compiler can optimize memcpy as the sizes are known at compile time.
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The fallback path calls helper read_extent_buffer to do read of the data
spanning two extent buffer pages. As the size is known, we can do the
read directly in two steps. This removes one function call and compiler
can optimize memcpy as the sizes are known at compile time. The cached
token address is set to the second page.
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The helper read_extent_buffer is called to do read of the data spanning
two extent buffer pages. As the size is known, we can do the read
directly in two steps. This removes one function call and compiler can
optimize memcpy as the sizes are known at compile time.
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Helpers that iterate over extent buffer pages set up several variables,
one of them is finding out offset of the extent buffer start within a
page. Right now we have extent buffers aligned to page sizes so this is
effectively storing zero. This makes the code harder the follow and can
be simplified.
The same change is done in all the helpers:
* remove: size_t start_offset = offset_in_page(eb->start);
* simplify code using start_offset
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: David Sterba <dsterba@suse.com>
There are many helpers around extent buffers, found in extent_io.h and
ctree.h. Most of them can be converted to take constified eb as there
are no changes to the extent buffer structure itself but rather the
pages.
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: David Sterba <dsterba@suse.com>
All uses of map_private_extent_buffer have been replaced by more
effective way. The set/get helpers have their own bounds checker.
The function name was confusing since the non-private helper was removed
in a65917156e ("Btrfs: stop using highmem for extent_buffers") many
years ago.
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The bin search jumps over the extent buffer item keys, comparing
directly the bytes if the key is in one page, or storing it in a
temporary buffer in case it spans two pages.
The mapping start and length are obtained from map_private_extent_buffer,
which is heavy weight compared to what we need. We know the key size and
can find out the eb page in a simple way. For keys spanning two pages
the fallback read_extent_buffer is used.
The temporary variables are reduced and moved to the scope of use.
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The set/get token helpers either use the cached address in the token or
unconditionally call map_private_extent_buffer to get the address of
page containing the requested offset plus the mapping start and length.
Depending on the return value, the fast path uses unaligned put to write
data within a page, or fall back to write_extent_buffer that can handle
writes spanning more pages.
This is all wasteful. We know the number of bytes to write, 1/2/4/8 and
can find out the page. Then simply check if it's contained or the
fallback is needed. The token address is updated to the page, or the on
the next index, expecting that the next write will use that.
This saves one function call to map_private_extent_buffer and several
unnecessary temporary variables.
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The helpers unconditionally call map_private_extent_buffer to get the
address of page containing the requested offset plus the mapping start
and length. Depending on the return value, the fast path uses unaligned
put to write data within a page, or fall back to write_extent_buffer
that can handle writes spanning more pages.
This is all wasteful. We know the number of bytes to write, 1/2/4/8 and
can find out the page. Then simply check if it's contained or the
fallback is needed.
This saves one function call to map_private_extent_buffer and several
unnecessary temporary variables.
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The set/get token helpers either use the cached address in the token or
unconditionally call map_private_extent_buffer to get the address of
page containing the requested offset plus the mapping start and length.
Depending on the return value, the fast path uses unaligned read to get
data within a page, or fall back to read_extent_buffer that can handle
reads spanning more pages.
This is all wasteful. We know the number of bytes to read, 1/2/4/8 and
can find out the page. Then simply check if it's contained or the
fallback is needed. The token address is updated to the page, or the on
the next index, expecting that the next read will use that.
This saves one function call to map_private_extent_buffer and several
unnecessary temporary variables.
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The helpers unconditionally call map_private_extent_buffer to get the
address of page containing the requested offset plus the mapping start
and length. Depending on the return value, the fast path uses unaligned
read to get data within a page, or fall back to read_extent_buffer that
can handle reads spanning more pages.
This is all wasteful. We know the number of bytes to read, 1/2/4/8 and
can find out the page. Then simply check if it's contained or the
fallback is needed.
This saves one function call to map_private_extent_buffer and several
unnecessary temporary variables.
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The bounds checking is now done in map_private_extent_buffer but that
will be removed in following patches and some sanity checks should still
be done.
There are two separate checks to see the kind of out of bounds access:
partial (start offset is in the buffer) or complete (both start and end
are out).
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: David Sterba <dsterba@suse.com>
All the set/get helpers first check if the token contains a cached
address. After first use the address is always valid, but the extra
check is done for each call.
The token initialization can optimistically set it to the first extent
buffer page, that we know always exists. Then the condition in all
btrfs_token_*/btrfs_set_token_* can be simplified by removing the
address check from the condition, but for development the assertion
still makes sure it's valid.
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The token is supposed to cache the last page used by the set/get
helpers. In leaf_space_used the first and last items are accessed, it's
not likely they'd be on the same page so there's some overhead caused
updating the token address but not using it.
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The set/get token is supposed to cache the last page that was accessed
so it speeds up subsequential access to the eb. It does not make sense
to use that for just one change, which is the case of inode size in
overwrite_item.
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Now that all set/get helpers use the eb from the token, we don't need to
pass it to many btrfs_token_*/btrfs_set_token_* helpers, saving some
stack space.
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The token stores a copy of the extent buffer pointer but does not make
any use of it besides sanity checks. We can use it and drop the eb
parameter from several functions, this patch only switches the use
inside the set/get helpers.
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: David Sterba <dsterba@suse.com>
disk-io.h is included more than once in block-group.c, remove it.
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Tiezhu Yang <yangtiezhu@loongson.cn>
Signed-off-by: David Sterba <dsterba@suse.com>
The name of this function contains the word "cache", which is left from
the times where btrfs_block_group was called btrfs_block_group_cache.
Now this "cache" doesn't match anything, and we have better namings for
functions like read/insert/remove_block_group_item().
Rename it to update_block_group_item().
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Currently the block group item insert is pretty straight forward, fill
the block group item structure and insert it into extent tree.
However the incoming skinny block group feature is going to change this,
so this patch will refactor insertion into a new function,
insert_block_group_item(), to make the incoming feature easier to add.
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
When deleting a block group item, it's pretty straight forward, just
delete the item pointed by the key. However it will not be that
straight-forward for incoming skinny block group item.
So refactor the block group item deletion into a new function,
remove_block_group_item(), also to make the already lengthy
btrfs_remove_block_group() a little shorter.
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Structure btrfs_block_group has the following members which are
currently read from on-disk block group item and key:
- length - from item key
- used
- flags - from block group item
However for incoming skinny block group tree, we are going to read those
members from different sources.
This patch will refactor such read by:
- Don't initialize btrfs_block_group::length at allocation
Caller should initialize them manually.
Also to avoid possible (well, only two callers) missing
initialization, add extra ASSERT() in btrfs_add_block_group_cache().
- Refactor length/used/flags initialization into one function
The new function, fill_one_block_group() will handle the
initialization of such members.
- Use btrfs_block_group::length to replace key::offset
Since skinny block group item would have a different meaning for its
key offset.
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Regular block group items in extent tree are scattered inside the huge
tree, thus forward readahead makes no sense.
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Whenever a chown is executed, all capabilities of the file being touched
are lost. When doing incremental send with a file with capabilities,
there is a situation where the capability can be lost on the receiving
side. The sequence of actions bellow shows the problem:
$ mount /dev/sda fs1
$ mount /dev/sdb fs2
$ touch fs1/foo.bar
$ setcap cap_sys_nice+ep fs1/foo.bar
$ btrfs subvolume snapshot -r fs1 fs1/snap_init
$ btrfs send fs1/snap_init | btrfs receive fs2
$ chgrp adm fs1/foo.bar
$ setcap cap_sys_nice+ep fs1/foo.bar
$ btrfs subvolume snapshot -r fs1 fs1/snap_complete
$ btrfs subvolume snapshot -r fs1 fs1/snap_incremental
$ btrfs send fs1/snap_complete | btrfs receive fs2
$ btrfs send -p fs1/snap_init fs1/snap_incremental | btrfs receive fs2
At this point, only a chown was emitted by "btrfs send" since only the
group was changed. This makes the cap_sys_nice capability to be dropped
from fs2/snap_incremental/foo.bar
To fix that, only emit capabilities after chown is emitted. The current
code first checks for xattrs that are new/changed, emits them, and later
emit the chown. Now, __process_new_xattr skips capabilities, letting
only finish_inode_if_needed to emit them, if they exist, for the inode
being processed.
This behavior was being worked around in "btrfs receive" side by caching
the capability and only applying it after chown. Now, xattrs are only
emmited _after_ chown, making that workaround not needed anymore.
Link: https://github.com/kdave/btrfs-progs/issues/202
CC: stable@vger.kernel.org # 4.4+
Suggested-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Marcos Paulo de Souza <mpdesouza@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
When scrubbing a stripe, whenever we find an extent we lookup for its
checksums in the checksum tree. However we do it even for metadata extents
which don't have checksum items stored in the checksum tree, that is
only for data extents.
So make the lookup for checksums only if we are processing with a data
extent.
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The helpers btrfs_freeze_block_group() and btrfs_unfreeze_block_group()
used to be named btrfs_get_block_group_trimming() and
btrfs_put_block_group_trimming() respectively.
At the time they were added to free-space-cache.c, by commit e33e17ee10
("btrfs: add missing discards when unpinning extents with -o discard")
because all the trimming related functions were in free-space-cache.c.
Now that the helpers were renamed and are used in scrub context as well,
move them to block-group.c, a much more logical location for them.
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Back in 2014, commit 04216820fe ("Btrfs: fix race between fs trimming
and block group remove/allocation"), I added the 'trimming' member to the
block group structure. Its purpose was to prevent races between trimming
and block group deletion/allocation by pinning the block group in a way
that prevents its logical address and device extents from being reused
while trimming is in progress for a block group, so that if another task
deletes the block group and then another task allocates a new block group
that gets the same logical address and device extents while the trimming
task is still in progress.
After the previous fix for scrub (patch "btrfs: fix a race between scrub
and block group removal/allocation"), scrub now also has the same needs that
trimming has, so the member name 'trimming' no longer makes sense.
Since there is already a 'pinned' member in the block group that refers
to space reservations (pinned bytes), rename the member to 'frozen',
add a comment on top of it to describe its general purpose and rename
the helpers to increment and decrement the counter as well, to match
the new member name.
The next patch in the series will move the helpers into a more suitable
file (from free-space-cache.c to block-group.c).
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
When scrub is verifying the extents of a block group for a device, it is
possible that the corresponding block group gets removed and its logical
address and device extents get used for a new block group allocation.
When this happens scrub incorrectly reports that errors were detected
and, if the the new block group has a different profile then the old one,
deleted block group, we can crash due to a null pointer dereference.
Possibly other unexpected and weird consequences can happen as well.
Consider the following sequence of actions that leads to the null pointer
dereference crash when scrub is running in parallel with balance:
1) Balance sets block group X to read-only mode and starts relocating it.
Block group X is a metadata block group, has a raid1 profile (two
device extents, each one in a different device) and a logical address
of 19424870400;
2) Scrub is running and finds device extent E, which belongs to block
group X. It enters scrub_stripe() to find all extents allocated to
block group X, the search is done using the extent tree;
3) Balance finishes relocating block group X and removes block group X;
4) Balance starts relocating another block group and when trying to
commit the current transaction as part of the preparation step
(prepare_to_relocate()), it blocks because scrub is running;
5) The scrub task finds the metadata extent at the logical address
19425001472 and marks the pages of the extent to be read by a bio
(struct scrub_bio). The extent item's flags, which have the bit
BTRFS_EXTENT_FLAG_TREE_BLOCK set, are added to each page (struct
scrub_page). It is these flags in the scrub pages that tells the
bio's end io function (scrub_bio_end_io_worker) which type of extent
it is dealing with. At this point we end up with 4 pages in a bio
which is ready for submission (the metadata extent has a size of
16Kb, so that gives 4 pages on x86);
6) At the next iteration of scrub_stripe(), scrub checks that there is a
pause request from the relocation task trying to commit a transaction,
therefore it submits the pending bio and pauses, waiting for the
transaction commit to complete before resuming;
7) The relocation task commits the transaction. The device extent E, that
was used by our block group X, is now available for allocation, since
the commit root for the device tree was swapped by the transaction
commit;
8) Another task doing a direct IO write allocates a new data block group Y
which ends using device extent E. This new block group Y also ends up
getting the same logical address that block group X had: 19424870400.
This happens because block group X was the block group with the highest
logical address and, when allocating Y, find_next_chunk() returns the
end offset of the current last block group to be used as the logical
address for the new block group, which is
18351128576 + 1073741824 = 19424870400
So our new block group Y has the same logical address and device extent
that block group X had. However Y is a data block group, while X was
a metadata one, and Y has a raid0 profile, while X had a raid1 profile;
9) After allocating block group Y, the direct IO submits a bio to write
to device extent E;
10) The read bio submitted by scrub reads the 4 pages (16Kb) from device
extent E, which now correspond to the data written by the task that
did a direct IO write. Then at the end io function associated with
the bio, scrub_bio_end_io_worker(), we call scrub_block_complete()
which calls scrub_checksum(). This later function checks the flags
of the first page, and sees that the bit BTRFS_EXTENT_FLAG_TREE_BLOCK
is set in the flags, so it assumes it has a metadata extent and
then calls scrub_checksum_tree_block(). That functions returns an
error, since interpreting data as a metadata extent causes the
checksum verification to fail.
So this makes scrub_checksum() call scrub_handle_errored_block(),
which determines 'failed_mirror_index' to be 1, since the device
extent E was allocated as the second mirror of block group X.
It allocates BTRFS_MAX_MIRRORS scrub_block structures as an array at
'sblocks_for_recheck', and all the memory is initialized to zeroes by
kcalloc().
After that it calls scrub_setup_recheck_block(), which is responsible
for filling each of those structures. However, when that function
calls btrfs_map_sblock() against the logical address of the metadata
extent, 19425001472, it gets a struct btrfs_bio ('bbio') that matches
the current block group Y. However block group Y has a raid0 profile
and not a raid1 profile like X had, so the following call returns 1:
scrub_nr_raid_mirrors(bbio)
And as a result scrub_setup_recheck_block() only initializes the
first (index 0) scrub_block structure in 'sblocks_for_recheck'.
Then scrub_recheck_block() is called by scrub_handle_errored_block()
with the second (index 1) scrub_block structure as the argument,
because 'failed_mirror_index' was previously set to 1.
This scrub_block was not initialized by scrub_setup_recheck_block(),
so it has zero pages, its 'page_count' member is 0 and its 'pagev'
page array has all members pointing to NULL.
Finally when scrub_recheck_block() calls scrub_recheck_block_checksum()
we have a NULL pointer dereference when accessing the flags of the first
page, as pavev[0] is NULL:
static void scrub_recheck_block_checksum(struct scrub_block *sblock)
{
(...)
if (sblock->pagev[0]->flags & BTRFS_EXTENT_FLAG_DATA)
scrub_checksum_data(sblock);
(...)
}
Producing a stack trace like the following:
[542998.008985] BUG: kernel NULL pointer dereference, address: 0000000000000028
[542998.010238] #PF: supervisor read access in kernel mode
[542998.010878] #PF: error_code(0x0000) - not-present page
[542998.011516] PGD 0 P4D 0
[542998.011929] Oops: 0000 [#1] PREEMPT SMP DEBUG_PAGEALLOC PTI
[542998.012786] CPU: 3 PID: 4846 Comm: kworker/u8:1 Tainted: G B W 5.6.0-rc7-btrfs-next-58 #1
[542998.014524] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.12.0-59-gc9ba5276e321-prebuilt.qemu.org 04/01/2014
[542998.016065] Workqueue: btrfs-scrub btrfs_work_helper [btrfs]
[542998.017255] RIP: 0010:scrub_recheck_block_checksum+0xf/0x20 [btrfs]
[542998.018474] Code: 4c 89 e6 ...
[542998.021419] RSP: 0018:ffffa7af0375fbd8 EFLAGS: 00010202
[542998.022120] RAX: 0000000000000000 RBX: ffff9792e674d120 RCX: 0000000000000000
[542998.023178] RDX: 0000000000000001 RSI: ffff9792e674d120 RDI: ffff9792e674d120
[542998.024465] RBP: 0000000000000000 R08: 0000000000000067 R09: 0000000000000001
[542998.025462] R10: ffffa7af0375fa50 R11: 0000000000000000 R12: ffff9791f61fe800
[542998.026357] R13: ffff9792e674d120 R14: 0000000000000001 R15: ffffffffc0e3dfc0
[542998.027237] FS: 0000000000000000(0000) GS:ffff9792fb200000(0000) knlGS:0000000000000000
[542998.028327] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[542998.029261] CR2: 0000000000000028 CR3: 00000000b3b18003 CR4: 00000000003606e0
[542998.030301] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[542998.031316] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[542998.032380] Call Trace:
[542998.032752] scrub_recheck_block+0x162/0x400 [btrfs]
[542998.033500] ? __alloc_pages_nodemask+0x31e/0x460
[542998.034228] scrub_handle_errored_block+0x6f8/0x1920 [btrfs]
[542998.035170] scrub_bio_end_io_worker+0x100/0x520 [btrfs]
[542998.035991] btrfs_work_helper+0xaa/0x720 [btrfs]
[542998.036735] process_one_work+0x26d/0x6a0
[542998.037275] worker_thread+0x4f/0x3e0
[542998.037740] ? process_one_work+0x6a0/0x6a0
[542998.038378] kthread+0x103/0x140
[542998.038789] ? kthread_create_worker_on_cpu+0x70/0x70
[542998.039419] ret_from_fork+0x3a/0x50
[542998.039875] Modules linked in: dm_snapshot dm_thin_pool ...
[542998.047288] CR2: 0000000000000028
[542998.047724] ---[ end trace bde186e176c7f96a ]---
This issue has been around for a long time, possibly since scrub exists.
The last time I ran into it was over 2 years ago. After recently fixing
fstests to pass the "--full-balance" command line option to btrfs-progs
when doing balance, several tests started to more heavily exercise balance
with fsstress, scrub and other operations in parallel, and therefore
started to hit this issue again (with btrfs/061 for example).
Fix this by having scrub increment the 'trimming' counter of the block
group, which pins the block group in such a way that it guarantees neither
its logical address nor device extents can be reused by future block group
allocations until we decrement the 'trimming' counter. Also make sure that
on each iteration of scrub_stripe() we stop scrubbing the block group if
it was removed already.
A later patch in the series will rename the block group's 'trimming'
counter and its helpers to a more generic name, since now it is not used
exclusively for pinning while trimming anymore.
CC: stable@vger.kernel.org # 4.4+
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The extent references v0 have been superseded long time go, there are
some unused declarations of access helpers. We can safely remove them
now. The struct btrfs_extent_ref_v0 is not used anywhere, but struct
btrfs_extent_item_v0 is still part of a backward compatibility check in
relocation.c and thus not removed.
Signed-off-by: David Sterba <dsterba@suse.com>
There's no callers in-tree anymore since
commit d24ee97b96 ("btrfs: use new helpers to set uuids in eb")
Signed-off-by: YueHaibing <yuehaibing@huawei.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
[BUG]
For the following operation, qgroup is guaranteed to be screwed up due
to snapshot adding to a new qgroup:
# mkfs.btrfs -f $dev
# mount $dev $mnt
# btrfs qgroup en $mnt
# btrfs subv create $mnt/src
# xfs_io -f -c "pwrite 0 1m" $mnt/src/file
# sync
# btrfs qgroup create 1/0 $mnt/src
# btrfs subv snapshot -i 1/0 $mnt/src $mnt/snapshot
# btrfs qgroup show -prce $mnt/src
qgroupid rfer excl max_rfer max_excl parent child
-------- ---- ---- -------- -------- ------ -----
0/5 16.00KiB 16.00KiB none none --- ---
0/257 1.02MiB 16.00KiB none none --- ---
0/258 1.02MiB 16.00KiB none none 1/0 ---
1/0 0.00B 0.00B none none --- 0/258
^^^^^^^^^^^^^^^^^^^^
[CAUSE]
The problem is in btrfs_qgroup_inherit(), we don't have good enough
check to determine if the new relation would break the existing
accounting.
Unlike btrfs_add_qgroup_relation(), which has proper check to determine
if we can do quick update without a rescan, in btrfs_qgroup_inherit() we
can even assign a snapshot to multiple qgroups.
[FIX]
Fix it by manually marking qgroup inconsistent for snapshot inheritance.
For subvolume creation, since all its extents are exclusively owned, we
don't need to rescan.
In theory, we should call relation check like quick_update_accounting()
when doing qgroup inheritance and inform user about qgroup accounting
inconsistency.
But we don't have good mechanism to relay that back to the user in the
snapshot creation context, thus we can only silently mark the qgroup
inconsistent.
Anyway, user shouldn't use qgroup inheritance during snapshot creation,
and should add qgroup relationship after snapshot creation by 'btrfs
qgroup assign', which has a much better UI to inform user about qgroup
inconsistent and kick in rescan automatically.
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
When mounting, we handle deleted subvolume and orphan items. First,
find add orphan roots, then add them to fs_root radix tree. Second, in
tree-root, process each orphan item, skip if it is dead root.
The original algorithm is based on the list of dead_roots, one by one to
visit and check whether the objectid is consistent, the time complexity
is O (n ^ 2). When processing 50000 deleted subvols, it takes about
120s.
Because btrfs_find_orphan_roots has already ran before us, and added
deleted subvol to fs_roots radix tree.
The fs root will only be removed from the fs_roots radix tree after the
cleaner process is started, and the cleaner will only start execution
after the mount is complete.
btrfs_orphan_cleanup can be called during the whole filesystem mount
lifetime, but only "tree root" will be used in this section of code, and
only mount time will be brought into tree root.
So we can quickly check whether the orphan item is dead root through the
fs_roots radix tree.
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Robbie Ko <robbieko@synology.com>
Signed-off-by: David Sterba <dsterba@suse.com>
I've grepped logs for 'errno=.*unknown' and found -95, -117 and -122,
now added to the table. The wording is adjusted so it makes sense in
context of filesystem.
Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
When an old device has new fsid through 'btrfs device add -f <dev>' our
fs_devices list has an alien device in one of the fs_devices lists.
By having an alien device in fs_devices, we have two issues so far
1. missing device does not not show as missing in the userland
2. degraded mount will fail
Both issues are caused by the fact that there's an alien device in the
fs_devices list. (Alien means that it does not belong to the filesystem,
identified by fsid, or does not contain btrfs filesystem at all, eg. due
to overwrite).
A device can be scanned/added through the control device ioctls
SCAN_DEV, DEVICES_READY or by ADD_DEV.
And device coming through the control device is checked against the all
other devices in the lists, but this was not the case for ADD_DEV.
This patch fixes both issues above by removing the alien device.
CC: stable@vger.kernel.org # 5.4+
Signed-off-by: Anand Jain <anand.jain@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
btrfs_free_extra_devids() updates fs_devices::latest_bdev to point to
the bdev with greatest device::generation number. For a typical-missing
device the generation number is zero so fs_devices::latest_bdev will
never point to it.
But if the missing device is due to alienation [1], then
device::generation is not zero and if it is greater or equal to the rest
of device generations in the list, then fs_devices::latest_bdev ends up
pointing to the missing device and reports the error like [2].
[1] We maintain devices of a fsid (as in fs_device::fsid) in the
fs_devices::devices list, a device is considered as an alien device
if its fsid does not match with the fs_device::fsid
Consider a working filesystem with raid1:
$ mkfs.btrfs -f -d raid1 -m raid1 /dev/sda /dev/sdb
$ mount /dev/sda /mnt-raid1
$ umount /mnt-raid1
While mnt-raid1 was unmounted the user force-adds one of its devices to
another btrfs filesystem:
$ mkfs.btrfs -f /dev/sdc
$ mount /dev/sdc /mnt-single
$ btrfs dev add -f /dev/sda /mnt-single
Now the original mnt-raid1 fails to mount in degraded mode, because
fs_devices::latest_bdev is pointing to the alien device.
$ mount -o degraded /dev/sdb /mnt-raid1
[2]
mount: wrong fs type, bad option, bad superblock on /dev/sdb,
missing codepage or helper program, or other error
In some cases useful info is found in syslog - try
dmesg | tail or so.
kernel: BTRFS warning (device sdb): devid 1 uuid 072a0192-675b-4d5a-8640-a5cf2b2c704d is missing
kernel: BTRFS error (device sdb): failed to read devices
kernel: BTRFS error (device sdb): open_ctree failed
Fix the root cause by checking if the device is not missing before it
can be considered for the fs_devices::latest_bdev.
CC: stable@vger.kernel.org # 4.19+
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Anand Jain <anand.jain@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Use crypto_shash_digest() instead of crypto_shash_init() +
crypto_shash_update() + crypto_shash_final(). This is more efficient.
Signed-off-by: Eric Biggers <ebiggers@google.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
There is no need of goto out in open_fs_devices() as there is nothing
special done there.
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Anand Jain <anand.jain@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
At btrfs_log_prealloc_extents() we are checking if copy_items() returns a
value greater than 0. That used to happen in the past to signal the caller
that the path given to it was released and reused for other searches, but
as of commit 0e56315ca1 ("Btrfs: fix missing hole after hole punching
and fsync when using NO_HOLES"), the copy_items() function does not have
that behaviour anymore and always returns 0 or a negative value. So just
remove that check at btrfs_log_prealloc_extents(), which the previously
mentioned commit forgot to remove.
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Currently, direct I/O has its own versions of bio_readpage_error() and
btrfs_check_repairable() (dio_read_error() and
btrfs_check_dio_repairable(), respectively). The main difference is that
the direct I/O version doesn't do read validation. The rework of direct
I/O repair makes it possible to do validation, so we can get rid of
btrfs_check_dio_repairable() and combine bio_readpage_error() and
dio_read_error() into a new helper, btrfs_submit_read_repair().
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Omar Sandoval <osandov@fb.com>
Signed-off-by: David Sterba <dsterba@suse.com>