Currently the code assumes that there's an implied barrier by the
sequence of code preceding the wakeup, namely the mutex unlock.
As Nikolay pointed out:
I think this is wrong (not your code) but the original assumption that
the RELEASE semantics provided by mutex_unlock is sufficient.
According to memory-barriers.txt:
Section 'LOCK ACQUISITION FUNCTIONS' states:
(2) RELEASE operation implication:
Memory operations issued before the RELEASE will be completed before the
RELEASE operation has completed.
Memory operations issued after the RELEASE *may* be completed before the
RELEASE operation has completed.
(I've bolded the may portion)
The example given there:
As an example, consider the following:
*A = a;
*B = b;
ACQUIRE
*C = c;
*D = d;
RELEASE
*E = e;
*F = f;
The following sequence of events is acceptable:
ACQUIRE, {*F,*A}, *E, {*C,*D}, *B, RELEASE
So if we assume that *C is modifying the flag which the waitqueue is checking,
and *E is the actual wakeup, then those accesses can be re-ordered...
IMHO this code should be considered broken...
---
To be on the safe side, add the barriers. The synchronization logic
around log using the mutexes and several other threads does not make it
easy to reason for/against the barrier.
CC: Nikolay Borisov <nborisov@suse.com>
Link: https://lkml.kernel.org/r/6ee068d8-1a69-3728-00d1-d86293d43c9f@suse.com
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Add convenience wrappers for the waitqueue management that involves
memory barriers to prevent deadlocks. The helpers will let us remove
barriers and the necessary comments in several places.
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Under the following case, qgroup rescan can double account cowed tree
blocks:
In this case, extent tree only has one tree block.
-
| transid=5 last committed=4
| btrfs_qgroup_rescan_worker()
| |- btrfs_start_transaction()
| | transid = 5
| |- qgroup_rescan_leaf()
| |- btrfs_search_slot_for_read() on extent tree
| Get the only extent tree block from commit root (transid = 4).
| Scan it, set qgroup_rescan_progress to the last
| EXTENT/META_ITEM + 1
| now qgroup_rescan_progress = A + 1.
|
| fs tree get CoWed, new tree block is at A + 16K
| transid 5 get committed
-
| transid=6 last committed=5
| btrfs_qgroup_rescan_worker()
| btrfs_qgroup_rescan_worker()
| |- btrfs_start_transaction()
| | transid = 5
| |- qgroup_rescan_leaf()
| |- btrfs_search_slot_for_read() on extent tree
| Get the only extent tree block from commit root (transid = 5).
| scan it using qgroup_rescan_progress (A + 1).
| found new tree block beyong A, and it's fs tree block,
| account it to increase qgroup numbers.
-
In above case, tree block A, and tree block A + 16K get accounted twice,
while qgroup rescan should stop when it already reach the last leaf,
other than continue using its qgroup_rescan_progress.
Such case could happen by just looping btrfs/017 and with some
possibility it can hit such double qgroup accounting problem.
Fix it by checking the path to determine if we should finish qgroup
rescan, other than relying on next loop to exit.
Reported-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
When doing qgroup rescan using the following script (modified from
btrfs/017 test case), we can sometimes hit qgroup corruption.
------
umount $dev &> /dev/null
umount $mnt &> /dev/null
mkfs.btrfs -f -n 64k $dev
mount $dev $mnt
extent_size=8192
xfs_io -f -d -c "pwrite 0 $extent_size" $mnt/foo > /dev/null
btrfs subvolume snapshot $mnt $mnt/snap
xfs_io -f -c "reflink $mnt/foo" $mnt/foo-reflink > /dev/null
xfs_io -f -c "reflink $mnt/foo" $mnt/snap/foo-reflink > /dev/null
xfs_io -f -c "reflink $mnt/foo" $mnt/snap/foo-reflink2 > /dev/unll
btrfs quota enable $mnt
# -W is the new option to only wait rescan while not starting new one
btrfs quota rescan -W $mnt
btrfs qgroup show -prce $mnt
umount $mnt
# Need to patch btrfs-progs to report qgroup mismatch as error
btrfs check $dev || _fail
------
For fast machine, we can hit some corruption which missed accounting
tree blocks:
------
qgroupid rfer excl max_rfer max_excl parent child
-------- ---- ---- -------- -------- ------ -----
0/5 8.00KiB 0.00B none none --- ---
0/257 8.00KiB 0.00B none none --- ---
------
This is due to the fact that we're always searching commit root for
btrfs_find_all_roots() at qgroup_rescan_leaf(), but the leaf we get is
from current transaction, not commit root.
And if our tree blocks get modified in current transaction, we won't
find any owner in commit root, thus causing the corruption.
Fix it by searching commit root for extent tree for
qgroup_rescan_leaf().
Reported-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
[spotted while going through ->d_fsdata handling around d_splice_alias();
don't really care which tree that goes through]
The only thing even looking at ->d_fsdata in there (since 2012)
had been kfree(dentry->d_fsdata) in btrfs_dentry_delete(). Which,
incidentally, is all btrfs_dentry_delete() does.
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
There are already 2 reports about strangely corrupted super blocks,
where csum still matches but extra garbage gets slipped into super block.
The corruption would looks like:
------
superblock: bytenr=65536, device=/dev/sdc1
---------------------------------------------------------
csum_type 41700 (INVALID)
csum 0x3b252d3a [match]
bytenr 65536
flags 0x1
( WRITTEN )
magic _BHRfS_M [match]
...
incompat_flags 0x5b22400000000169
( MIXED_BACKREF |
COMPRESS_LZO |
BIG_METADATA |
EXTENDED_IREF |
SKINNY_METADATA |
unknown flag: 0x5b22400000000000 )
...
------
Or
------
superblock: bytenr=65536, device=/dev/mapper/x
---------------------------------------------------------
csum_type 35355 (INVALID)
csum_size 32
csum 0xf0dbeddd [match]
bytenr 65536
flags 0x1
( WRITTEN )
magic _BHRfS_M [match]
...
incompat_flags 0x176d200000000169
( MIXED_BACKREF |
COMPRESS_LZO |
BIG_METADATA |
EXTENDED_IREF |
SKINNY_METADATA |
unknown flag: 0x176d200000000000 )
------
Obviously, csum_type and incompat_flags get some garbage, but its csum
still matches, which means kernel calculates the csum based on corrupted
super block memory.
And after manually fixing these values, the filesystem is completely
healthy without any problem exposed by btrfs check.
Although the cause is still unknown, at least detect it and prevent further
corruption.
Both reports have same symptoms, there's an overwrite on offset 192 of
the superblock, by 4 bytes. The superblock structure is not allocated or
freed and stays in the memory for the whole filesystem lifetime, so it's
not a use-after-free kind of error on someone else's leaked page.
As a vague point for the problable cause is mentioning of other system
freezing related to graphic card drivers.
Reported-by: Ken Swenson <flat@imo.uto.moe>
Reported-by: Ben Parsons <9parsonsb@gmail.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
[ add brief analysis of the reports ]
Signed-off-by: David Sterba <dsterba@suse.com>
Refactor btrfs_check_super_valid:
1) Rename it to btrfs_validate_mount_super()
Now it's more obvious when the function should be called.
2) Extract core check routine into validate_super()
Later write time check can reuse it, and if needed, we could also
use validate_super() to check each super block.
3) Add more comments about btrfs_validate_mount_super()
Mostly about what it doesn't check and when it should be called.
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
[ rename to validate_super ]
Signed-off-by: David Sterba <dsterba@suse.com>
Move btrfs_check_super_valid() before its single caller to avoid forward
declaration.
Though such code motion is not recommended as it pollutes git history,
in this case the following patches would need to add new forward
declarations for static functions that we want to avoid.
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
This function always takes a transaction handle which contains a
reference to the fs_info. Use that and remove the extra argument.
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
This function takes a transaction handle which already contains a
reference to the fs_info. So use it and remove the extra function
argument.
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
This function alreay takes a transaction handle which holds a reference
to the fs_info. Use that and remove the extra argument.
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
This function takes a transaction handle which holds a reference to
fs_info. So use that and remove the extra argument.
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
This function takes a transaction handle which already has a reference
to the fs_info. Use it and remove the extra argument.
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
This function always takes a transaction handle which references the
fs_info structure. So use that and remove the extra argument.
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
This function already takes a transaction which has a reference to the
fs_info. So use that and remove the extra argument.
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
This function already takes a transaction handle which has a reference
to the fs_info. So use that and remove the extra argument.
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
This function always takes a transaction handle which contains a
reference to fs_info. So use that and kill the extra argument.
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
This function already takes a transaction handle which contains a
reference to fs_info. So use that and remove the extra argument.
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
This function always takes a trans handle which contains a reference to
the fs_info. Use that and remove the extra argument.
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
This function also takes a btrfs_block_group_cache which contains a
referene to the fs_info. So use that and remove the extra argument.
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
This function already takes trans handle from where fs_info can be
referenced. Remove the redundant parameter.
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
This function already takes a transaction handle which contains a
reference to fs_info.
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
This function already takes a transaction handle which has a reference
to the fs_info.
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
We also pass in a transaction handle which has a reference to the
fs_info. Just remove the extraneous argument.
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
This will be necessary for future cleanups which remove the fs_info
argument from some freespace tree functions.
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The invariant is that when nr_delalloc_inodes is 0 then the root
mustn't have any inodes on its delalloc inodes list.
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Currently when checking if a directory can be deleted, we always check
if all its children have been processed.
Example: A directory with 2,000,000 files was deleted
original: 1994m57.071s
patch: 1m38.554s
[FIX]
Instead of checking all children on all calls to can_rmdir(), we keep
track of the directory index offset of the child last checked in the
last call to can_rmdir(), and then use it as the starting point for
future calls to can_rmdir().
Signed-off-by: Robbie Ko <robbieko@synology.com>
Reviewed-by: Filipe Manana <fdmanana@suse.com>
[ update changelog ]
Signed-off-by: David Sterba <dsterba@suse.com>
Move the allocation after the search when it's clear that the new entry
will be added.
Signed-off-by: Robbie Ko <robbieko@synology.com>
Reviewed-by: Filipe Manana <fdmanana@suse.com>
[ update changelog ]
Signed-off-by: David Sterba <dsterba@suse.com>
add_delayed_ref_head really performed 2 independent operations -
initialisting the ref head and adding it to a list. Now that the init
part is in a separate function let's complete the separation between
both operations. This results in a lot simpler interface for
add_delayed_ref_head since the function now deals solely with either
adding the newly initialised delayed ref head or merging it into an
existing delayed ref head. This results in vastly simplified function
signature since 5 arguments are dropped. The only other thing worth
mentioning is that due to this split the WARN_ON catching reinit of
existing. In this patch the condition is extended such that:
qrecord && head_ref->qgroup_ref_root && head_ref->qgroup_reserved
is added. This is done because the two qgroup_* prefixed member are
set only if both ref_root and reserved are passed. So functionally
it's equivalent to the old WARN_ON and allows to remove the two args
from add_delayed_ref_head.
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Use the newly introduced function when initialising the head_ref in
add_delayed_ref_head. No functional changes.
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
add_delayed_ref_head implements the logic to both initialize a head_ref
structure as well as perform the necessary operations to add it to the
delayed ref machinery. This has resulted in a very cumebrsome interface
with loads of parameters and code, which at first glance, looks very
unwieldy. Begin untangling it by first extracting the initialization
only code in its own function. It's more or less verbatim copy of the
first part of add_delayed_ref_head.
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Now that the initialization part and the critical section code have been
split it's a lot easier to open code add_delayed_data_ref. Do so in the
following manner:
1. The common init function is put immediately after memory-to-be-initialized
is allocated, followed by the specific data ref initialization.
2. The only piece of code that remains in the critical section is
insert_delayed_ref call.
3. Tracing and memory freeing code is moved outside of the critical
section.
No functional changes, just an overall shorter critical section.
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Now that the initialization part and the critical section code have been
split it's a lot easier to open code add_delayed_tree_ref. Do so in the
following manner:
1. The comming init code is put immediately after memory-to-be-initialized
is allocated, followed by the ref-specific member initialization.
2. The only piece of code that remains in the critical section is
insert_delayed_ref call.
3. Tracing and memory freeing code is put outside of the critical
section as well.
The only real change here is an overall shorter critical section when
dealing with delayed tree refs. From functional point of view - the code
is unchanged.
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Use the newly introduced helper and remove the duplicate code. No
functional changes.
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Use the newly introduced common helper. No functional changes.
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
THe majority of the init code for struct btrfs_delayed_ref_node is
duplicated in add_delayed_data_ref and add_delayed_tree_ref. Factor out
the common bits in init_delayed_ref_common. This function is going to be
used in future patches to clean that up. No functional changes.
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
It's not good to overwrite -ENOMEM using -EINVAL when failing from mount
option parsing, so just return original error code.
Signed-off-by: Chengguang Xu <cgxu519@gmx.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The fs_info is always available from the context so we don't need to
store it in the structure.
Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
When debugging quota rescan race, some times btrfs rescan could account
some old (committed) leaf and then re-account newly committed leaf
in next generation.
This race needs extra transid to locate, so add @transid for
trace_btrfs_qgroup_account_extent() for such debug.
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Trivial fix to spelling mistake of function name in btrfs_err message
Signed-off-by: Colin Ian King <colin.king@canonical.com>
Signed-off-by: David Sterba <dsterba@suse.com>
This function is used in only one place and devid argument is always
passed 0. So just remove it, similarly to how it was removed in the
userspace code.
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Origin trace_qgroup_update_counters() only records qgroup id and its
reference count change.
It's good enough to debug qgroup accounting change, but when rescan race
is involved, it's pretty hard to distinguish which modification belongs
to which rescan.
So add old_rfer and old_excl trace output to help distinguishing
different rescan instance.
(Different rescan instance should reset its qgroup->rfer to 0)
For trace event parameter, it just changes from u64 qgroup_id to struct
btrfs_qgroup *qgroup, so number of parameters is not changed at all.
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
It's used only in inode.c so makes no sense to have it exported. Also
move the definition of btrfs_delalloc_work to inode.c since it's used
only this file.
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
When allocating a delalloc work we are always setting the delayed_iput
to 0. So remove the delay_iput member of btrfs_delalloc_work, as a
result also remove it as a parameter from btrfs_alloc_delalloc_work
since it's not used anymore.
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
It's always set to 0 so remove it.
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: Qu Wenruo <wqu@suse.com>
[ rename to start_delalloc_inodes ]
Signed-off-by: David Sterba <dsterba@suse.com>
It's always set to 0, so just remove it and collapse the constant value
to the only function we are passing it.
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
This parameter was introduced alongside the function in
eb73c1b7ce ("Btrfs: introduce per-subvolume delalloc inode list") to
avoid deadlocks since this function was used in the transaction commit
path. However, commit 8d875f95da ("btrfs: disable strict file flushes
for renames and truncates") removed that usage, rendering the parameter
obsolete.
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
In btrfs_shrink_device, before btrfs_search_slot, path->reada is set to
READA_FORWARD. But I think READA_BACK is correct.
Since:
1. key.offset is set to (u64)-1
2. after btrfs_search_slot, btrfs_previous_item is called
So, for readahead previous items, READA_BACK is the correct one.
Signed-off-by: Gu Jinxiang <gujx@cn.fujitsu.com>
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
This patch will add the following trace events:
1) btrfs_remove_block_group
For btrfs_remove_block_group() function.
Triggered when a block group is really removed.
2) btrfs_add_unused_block_group
Triggered which block group is added to unused_bgs list.
3) btrfs_skip_unused_block_group
Triggered which unused block group is not deleted.
These trace events is pretty handy to debug case related to block group
auto remove.
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
fs_info can be extracted from btrfs_block_group_cache, and all
btrfs_block_group_cache is created by btrfs_create_block_group_cache()
with fs_info initialized, no need to worry about NULL pointer
dereference.
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>