The actual snapshot creation is delayed until transaction commit. If we
cannot get enough memory for the root item there, we have to fail the
whole transaction commit which is bad. So we'll allocate the memory at
the ioctl call and pass it along with the pending_snapshot struct. The
potential ENOMEM will be returned to the caller of snapshot ioctl.
Signed-off-by: David Sterba <dsterba@suse.com>
The values of btrfs_path::locks are 0 to 4, fit into a u8. Let's see:
* overall size of btrfs_path drops down from 136 to 112 (-24 bytes),
* better packing in a slab page +6 objects
* the whole structure now fits to 2 cachelines
* slight decrease in code size:
text data bss dec hex filename
938731 43670 23144 1005545 f57e9 fs/btrfs/btrfs.ko.before
938203 43670 23144 1005017 f55d9 fs/btrfs/btrfs.ko.after
(and the generated assembly does not change much)
The main purpose is to decrease the size of the structure without
affecting performance. The byte access is usually well behaving accross
arches, the locks are not accessed frequently and sometimes just
compared to zero.
Note for further size reduction attempts: the slots could be made u16
but this might generate worse code on some arches (non-byte and non-int
access). Also the range of operations on slots is wider compared to
locks and the potential performance drop should be evaluated first.
Signed-off-by: David Sterba <dsterba@suse.com>
The level is 0..7, we can use smaller type. The size of btrfs_path is now
136 bytes from 144, which is +2 objects that fit into a 4k slab.
Signed-off-by: David Sterba <dsterba@suse.com>
The possible values for reada are all positive and bounded, we can later
save some bytes by storing it in u8.
Signed-off-by: David Sterba <dsterba@suse.com>
Replace the integers by enums for better readability. The value 2 does
not have any meaning since a717531942
"Btrfs: do less aggressive btree readahead" (2009-01-22).
Signed-off-by: David Sterba <dsterba@suse.com>
There are a few statically initialized arrays that can be made const.
The remaining (like file_system_type, sysfs attributes or prop handlers)
do not allow that due to type mismatch when passed to the APIs or
because the structures are modified through other members.
Signed-off-by: David Sterba <dsterba@suse.com>
Preparatory work for making btrfs_free_space_op constant. In
test_steal_space_from_bitmap_to_extent, we substitute use_bitmap with
own version thus preventing constification. We can rework it so we
replace the whole structure with the correct function pointers.
Signed-off-by: David Sterba <dsterba@suse.com>
Although we prefer to use separate caches for various structs, it seems
better not to do that for struct btrfs_delalloc_work. Objects of this
type are allocated rarely, when transaction commit calls
btrfs_start_delalloc_roots, requesting delayed iputs.
The objects are temporary (with some IO involved) but still allocated
and freed within __start_delalloc_inodes. Memory allocation failure is
handled.
The slab cache is empty most of the time (observed on several systems),
so if we need to allocate a new slab object, the first one has to
allocate a full page. In a potential case of low memory conditions this
might fail with higher probability compared to using the generic slab
caches.
Signed-off-by: David Sterba <dsterba@suse.com>
We can handle the special case of num_stripes == 0 directly inside
btrfs_read_sys_array. The BUG_ON in btrfs_chunk_item_size is there to
catch other unhandled cases where we fail to validate external data.
A crafted or corrupted image crashes at mount time:
BTRFS: device fsid 9006933e-2a9a-44f0-917f-514252aeec2c devid 1 transid 7 /dev/loop0
BTRFS info (device loop0): disk space caching is enabled
BUG: failure at fs/btrfs/ctree.h:337/btrfs_chunk_item_size()!
Kernel panic - not syncing: BUG!
CPU: 0 PID: 313 Comm: mount Not tainted 4.2.5-00657-ge047887-dirty #25
Stack:
637af890 60062489 602aeb2e 604192ba
60387961 00000011 637af8a0 6038a835
637af9c0 6038776b 634ef32b 00000000
Call Trace:
[<6001c86d>] show_stack+0xfe/0x15b
[<6038a835>] dump_stack+0x2a/0x2c
[<6038776b>] panic+0x13e/0x2b3
[<6020f099>] btrfs_read_sys_array+0x25d/0x2ff
[<601cfbbe>] open_ctree+0x192d/0x27af
[<6019c2c1>] btrfs_mount+0x8f5/0xb9a
[<600bc9a7>] mount_fs+0x11/0xf3
[<600d5167>] vfs_kern_mount+0x75/0x11a
[<6019bcb0>] btrfs_mount+0x2e4/0xb9a
[<600bc9a7>] mount_fs+0x11/0xf3
[<600d5167>] vfs_kern_mount+0x75/0x11a
[<600d710b>] do_mount+0xa35/0xbc9
[<600d7557>] SyS_mount+0x95/0xc8
[<6001e884>] handle_syscall+0x6b/0x8e
Reported-by: Jiri Slaby <jslaby@suse.com>
Reported-by: Vegard Nossum <vegard.nossum@oracle.com>
CC: stable@vger.kernel.org # 3.19+
Signed-off-by: David Sterba <dsterba@suse.com>
btrfs_delayed_extent_op can be packed in a better way, it's 40 bytes now
and has 8 unused bytes. Reducing the level type to u8 makes it possible
to squeeze it to the padding byte after key. The bitfields were switched
to bool as there's space to store the full byte without increasing the
whole structure, besides that the generated assembly is smaller.
struct btrfs_delayed_extent_op {
struct btrfs_disk_key key; /* 0 17 */
u8 level; /* 17 1 */
bool update_key; /* 18 1 */
bool update_flags; /* 19 1 */
bool is_data; /* 20 1 */
/* XXX 3 bytes hole, try to pack */
u64 flags_to_set; /* 24 8 */
/* size: 32, cachelines: 1, members: 6 */
/* sum members: 29, holes: 1, sum holes: 3 */
/* last cacheline: 32 bytes */
};
The final size is 32 bytes which gives +26 object per slab page.
text data bss dec hex filename
938811 43670 23144 1005625 f5839 fs/btrfs/btrfs.ko.before
938747 43670 23144 1005561 f57f9 fs/btrfs/btrfs.ko.after
Signed-off-by: David Sterba <dsterba@suse.com>
Inodes for delayed iput allocate a trivial helper structure, let's place
the list hook directly into the inode and save a kmalloc (killing a
__GFP_NOFAIL as a bonus) at the cost of increasing size of btrfs_inode.
The inode can be put into the delayed_iputs list more than once and we
have to keep the count. This means we can't use the list_splice to
process a bunch of inodes because we'd lost track of the count if the
inode is put into the delayed iputs again while it's processed.
Signed-off-by: David Sterba <dsterba@suse.com>
Since we will add support for -d dup for non-mixed filesystem,
kernel need to support converting to this raid-type.
This patch remove limitation of above case.
Tested by following script:
(combination of dup conversion with fsck):
export TEST_DEV='/dev/vdc'
export TEST_DIR='/var/ltf/tester/mnt'
do_dup_test()
{
local m_from="$1"
local d_from="$2"
local m_to="$3"
local d_to="$4"
echo "Convert from -m $m_from -d $d_from to -m $m_to -d $d_to"
umount "$TEST_DIR" &>/dev/null
./mkfs.btrfs -f -m "$m_from" -d "$d_from" "$TEST_DEV" >/dev/null || return 1
mount "$TEST_DEV" "$TEST_DIR" || return 1
cp -a /sbin/* "$TEST_DIR"
[[ "$m_from" != "$m_to" ]] && {
./btrfs balance start -f -mconvert="$m_to" "$TEST_DIR" || return 1
}
[[ "$d_from" != "$d_to" ]] && {
local opt=()
[[ "$d_to" == single ]] && opt+=("-f")
./btrfs balance start "${opt[@]}" -dconvert="$d_to" "$TEST_DIR" || return 1
}
umount "$TEST_DIR" || return 1
./btrfsck "$TEST_DEV" || return 1
echo
return 0
}
test_all()
{
for m_from in single dup; do
for d_from in single dup; do
for m_to in single dup; do
for d_to in single dup; do
do_dup_test "$m_from" "$d_from" "$m_to" "$d_to" || return 1
done
done
done
done
}
test_all
Signed-off-by: Zhao Lei <zhaolei@cn.fujitsu.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
We hit this panic on a few of our boxes this week where we have an
ordered_extent with an NULL inode. We do an igrab() of the inode in writepages,
but weren't doing it in writepage which can be called directly from the VM on
dirty pages. If the inode has been unlinked then we could have I_FREEING set
which means igrab() would return NULL and we get this panic. Fix this by trying
to igrab in btrfs_writepage, and if it returns NULL then just redirty the page
and return AOP_WRITEPAGE_ACTIVATE; so the VM knows it wasn't successful. Thanks,
Signed-off-by: Josef Bacik <jbacik@fb.com>
Reviewed-by: Liu Bo <bo.li.liu@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Looks like oversight, call brelse() when checksum fails. Further down the
code, in the non error path, we do call brelse() and so we don't see
brelse() in the goto error paths.
Signed-off-by: Anand Jain <anand.jain@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Pull btrfs fixes from Chris Mason:
"A couple of small fixes"
* 'for-linus-4.4' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs:
Btrfs: check prepare_uptodate_page() error code earlier
Btrfs: check for empty bitmap list in setup_cluster_bitmaps
btrfs: fix misleading warning when space cache failed to load
Btrfs: fix transaction handle leak in balance
Btrfs: fix unprotected list move from unused_bgs to deleted_bgs list
prepare_pages() may end up calling prepare_uptodate_page() twice if our
write only spans a single page. But if the first call returns an error,
our page will be unlocked and its not safe to call it again.
This bug goes all the way back to 2011, and it's not something commonly
hit.
While we're here, add a more explicit check for the page being truncated
away. The bare lock_page() alone is protected only by good thoughts and
i_mutex, which we're sure to regret eventually.
Reported-by: Dave Jones <dsj@fb.com>
Signed-off-by: Chris Mason <clm@fb.com>
Dave Jones found a warning from kasan in setup_cluster_bitmaps()
==================================================================
BUG: KASAN: stack-out-of-bounds in setup_cluster_bitmap+0xc4/0x5a0 at
addr ffff88039bef6828
Read of size 8 by task nfsd/1009
page:ffffea000e6fbd80 count:0 mapcount:0 mapping: (null)
index:0x0
flags: 0x8000000000000000()
page dumped because: kasan: bad access detected
CPU: 1 PID: 1009 Comm: nfsd Tainted: G W
4.4.0-rc3-backup-debug+ #1
ffff880065647b50 000000006bb712c2 ffff88039bef6640 ffffffffa680a43e
0000004559c00000 ffff88039bef66c8 ffffffffa62638d1 ffffffffa61121c0
ffff8803a5769de8 0000000000000296 ffff8803a5769df0 0000000000046280
Call Trace:
[<ffffffffa680a43e>] dump_stack+0x4b/0x6d
[<ffffffffa62638d1>] kasan_report_error+0x501/0x520
[<ffffffffa61121c0>] ? debug_show_all_locks+0x1e0/0x1e0
[<ffffffffa6263948>] kasan_report+0x58/0x60
[<ffffffffa6814b00>] ? rb_last+0x10/0x40
[<ffffffffa66f8af4>] ? setup_cluster_bitmap+0xc4/0x5a0
[<ffffffffa6262ead>] __asan_load8+0x5d/0x70
[<ffffffffa66f8af4>] setup_cluster_bitmap+0xc4/0x5a0
[<ffffffffa66f675a>] ? setup_cluster_no_bitmap+0x6a/0x400
[<ffffffffa66fcd16>] btrfs_find_space_cluster+0x4b6/0x640
[<ffffffffa66fc860>] ? btrfs_alloc_from_cluster+0x4e0/0x4e0
[<ffffffffa66fc36e>] ? btrfs_return_cluster_to_free_space+0x9e/0xb0
[<ffffffffa702dc37>] ? _raw_spin_unlock+0x27/0x40
[<ffffffffa666a1a1>] find_free_extent+0xba1/0x1520
Andrey noticed this was because we were doing list_first_entry on a list
that might be empty. Rework the tests a bit so we don't do that.
Signed-off-by: Chris Mason <clm@fb.com>
Reprorted-by: Andrey Ryabinin <ryabinin.a.a@gmail.com>
Reported-by: Dave Jones <dsj@fb.com>
When an inconsistent space cache is detected during loading we log a
warning that users frequently mistake as instruction to invalidate the
cache manually, even though this is not required. Fix the message to
indicate that the cache will be rebuilt automatically.
Signed-off-by: Holger Hoffstätte <holger.hoffstaette@googlemail.com>
Acked-by: Filipe Manana <fdmanana@suse.com>
If we fail to allocate a new data chunk, we were jumping to the error path
without release the transaction handle we got before. Fix this by always
releasing it before doing the jump.
Fixes: 2c9fe83552 ("btrfs: Fix lost-data-profile caused by balance bg")
Signed-off-by: Filipe Manana <fdmanana@suse.com>
As of my previous change titled "Btrfs: fix scrub preventing unused block
groups from being deleted", the following warning at
extent-tree.c:btrfs_delete_unused_bgs() can be hit when we mount the a
filesysten with "-o discard":
10263 void btrfs_delete_unused_bgs(struct btrfs_fs_info *fs_info)
10264 {
(...)
10405 if (trimming) {
10406 WARN_ON(!list_empty(&block_group->bg_list));
10407 spin_lock(&trans->transaction->deleted_bgs_lock);
10408 list_move(&block_group->bg_list,
10409 &trans->transaction->deleted_bgs);
10410 spin_unlock(&trans->transaction->deleted_bgs_lock);
10411 btrfs_get_block_group(block_group);
10412 }
(...)
This happens because scrub can now add back the block group to the list of
unused block groups (fs_info->unused_bgs). This is dangerous because we
are moving the block group from the unused block groups list to the list
of deleted block groups without holding the lock that protects the source
list (fs_info->unused_bgs_lock).
The following diagram illustrates how this happens:
CPU 1 CPU 2
cleaner_kthread()
btrfs_delete_unused_bgs()
sees bg X in list
fs_info->unused_bgs
deletes bg X from list
fs_info->unused_bgs
scrub_enumerate_chunks()
searches device tree using
its commit root
finds device extent for
block group X
gets block group X from the tree
fs_info->block_group_cache_tree
(via btrfs_lookup_block_group())
sets bg X to RO (again)
scrub_chunk(bg X)
sets bg X back to RW mode
adds bg X to the list
fs_info->unused_bgs again,
since it's still unused and
currently not in that list
sets bg X to RO mode
btrfs_remove_chunk(bg X)
--> discard is enabled and bg X
is in the fs_info->unused_bgs
list again so the warning is
triggered
--> we move it from that list into
the transaction's delete_bgs
list, but we can have another
task currently manipulating
the first list (fs_info->unused_bgs)
Fix this by using the same lock (fs_info->unused_bgs_lock) to protect both
the list of unused block groups and the list of deleted block groups. This
makes it safe and there's not much worry for more lock contention, as this
lock is seldom used and only the cleaner kthread adds elements to the list
of deleted block groups. The warning goes away too, as this was previously
an impossible case (and would have been better a BUG_ON/ASSERT) but it's
not impossible anymore.
Reproduced with fstest btrfs/073 (using MOUNT_OPTIONS="-o discard").
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Pull btrfs fixes from Chris Mason:
"This has Mark Fasheh's patches to fix quota accounting during subvol
deletion, which we've been working on for a while now. The patch is
pretty small but it's a key fix.
Otherwise it's a random assortment"
* 'for-linus-4.4' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs:
btrfs: fix balance range usage filters in 4.4-rc
btrfs: qgroup: account shared subtree during snapshot delete
Btrfs: use btrfs_get_fs_root in resolve_indirect_ref
btrfs: qgroup: fix quota disable during rescan
Btrfs: fix race between cleaner kthread and space cache writeout
Btrfs: fix scrub preventing unused block groups from being deleted
Btrfs: fix race between scrub and block group deletion
btrfs: fix rcu warning during device replace
btrfs: Continue replace when set_block_ro failed
btrfs: fix clashing number of the enhanced balance usage filter
Btrfs: fix the number of transaction units needed to remove a block group
Btrfs: use global reserve when deleting unused block group after ENOSPC
Btrfs: tests: checking for NULL instead of IS_ERR()
btrfs: fix signed overflows in btrfs_sync_file
There's a regression in 4.4-rc since commit bc3094673f
(btrfs: extend balance filter usage to take minimum and maximum) in that
existing (non-ranged) balance with -dusage=x no longer works; all chunks
are skipped.
After staring at the code for a while and wondering why a non-ranged
balance would even need min and max thresholds (..which then were not
set correctly, leading to the bug) I realized that the only problem
was the fact that the filter functions were named wrong, thanks to
patching copypasta. Simply renaming both functions lets the existing
btrfs-progs call balance with -dusage=x and now the non-ranged filter
function is invoked, properly using only a single chunk limit.
Signed-off-by: Holger Hoffstätte <holger.hoffstaette@googlemail.com>
Fixes: bc3094673f ("btrfs: extend balance filter usage to take minimum and maximum")
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Chris Mason <clm@fb.com>
Commit 0ed4792 ('btrfs: qgroup: Switch to new extent-oriented qgroup
mechanism.') removed our qgroup accounting during
btrfs_drop_snapshot(). Predictably, this results in qgroup numbers
going bad shortly after a snapshot is removed.
Fix this by adding a dirty extent record when we encounter extents during
our shared subtree walk. This effectively restores the functionality we had
with the original shared subtree walking code in 1152651 (btrfs: qgroup:
account shared subtrees during snapshot delete).
The idea with the original patch (and this one) is that shared subtrees can
get skipped during drop_snapshot. The shared subtree walk then allows us a
chance to visit those extents and add them to the qgroup work for later
processing. This ultimately makes the accounting for drop snapshot work.
The new qgroup code nicely handles all the other extents during the tree
walk via the ref dec/inc functions so we don't have to add actions beyond
what we had originally.
Signed-off-by: Mark Fasheh <mfasheh@suse.de>
Signed-off-by: Chris Mason <clm@fb.com>
The backref code will look up the fs_root we're trying to resolve our indirect
refs for, unfortunately we use btrfs_read_fs_root_no_name, which returns -ENOENT
if the ref is 0. This isn't helpful for the qgroup stuff with snapshot delete
as it won't be able to search down the snapshot we are deleting, which will
cause us to miss roots. So use btrfs_get_fs_root and send false for check_ref
so we can always get the root we're looking for. Thanks,
Signed-off-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: Mark Fasheh <mfasheh@suse.de>
Signed-off-by: Chris Mason <clm@fb.com>
There's a race condition that leads to a NULL pointer dereference if you
disable quotas while a quota rescan is running. To fix this, we just need
to wait for the quota rescan worker to actually exit before tearing down
the quota structures.
Signed-off-by: Justin Maggard <jmaggard@netgear.com>
Signed-off-by: Chris Mason <clm@fb.com>
When a block group becomes unused and the cleaner kthread is currently
running, we can end up getting the current transaction aborted with error
-ENOENT when we try to commit the transaction, leading to the following
trace:
[59779.258768] WARNING: CPU: 3 PID: 5990 at fs/btrfs/extent-tree.c:3740 btrfs_write_dirty_block_groups+0x17c/0x214 [btrfs]()
[59779.272594] BTRFS: Transaction aborted (error -2)
(...)
[59779.291137] Call Trace:
[59779.291621] [<ffffffff812566f4>] dump_stack+0x4e/0x79
[59779.292543] [<ffffffff8104d0a6>] warn_slowpath_common+0x9f/0xb8
[59779.293435] [<ffffffffa04cb81f>] ? btrfs_write_dirty_block_groups+0x17c/0x214 [btrfs]
[59779.295000] [<ffffffff8104d107>] warn_slowpath_fmt+0x48/0x50
[59779.296138] [<ffffffffa04c2721>] ? write_one_cache_group.isra.32+0x77/0x82 [btrfs]
[59779.297663] [<ffffffffa04cb81f>] btrfs_write_dirty_block_groups+0x17c/0x214 [btrfs]
[59779.299141] [<ffffffffa0549b0d>] commit_cowonly_roots+0x1de/0x261 [btrfs]
[59779.300359] [<ffffffffa04dd5b6>] btrfs_commit_transaction+0x4c4/0x99c [btrfs]
[59779.301805] [<ffffffffa04b5df4>] btrfs_sync_fs+0x145/0x1ad [btrfs]
[59779.302893] [<ffffffff81196634>] sync_filesystem+0x7f/0x93
(...)
[59779.318186] ---[ end trace 577e2daff90da33a ]---
The following diagram illustrates a sequence of steps leading to this
problem:
CPU 1 CPU 2
<at transaction N>
adds bg A to list
fs_info->unused_bgs
adds bg B to list
fs_info->unused_bgs
<transaction kthread
commits transaction N
and wakes up the
cleaner kthread>
cleaner kthread
delete_unused_bgs()
sees bg A in list
fs_info->unused_bgs
btrfs_start_transaction()
<transaction N + 1 starts>
deletes bg A
update_block_group(bg C)
--> adds bg C to list
fs_info->unused_bgs
deletes bg B
sees bg C in the list
fs_info->unused_bgs
btrfs_remove_chunk(bg C)
btrfs_remove_block_group(bg C)
--> checks if the block group
is in a dirty list, and
because it isn't now, it
does nothing
--> the block group item
is deleted from the
extent tree
--> adds bg C to list
transaction->dirty_bgs
some task calls
btrfs_commit_transaction(t N + 1)
commit_cowonly_roots()
btrfs_write_dirty_block_groups()
--> sees bg C in cur_trans->dirty_bgs
--> calls write_one_cache_group()
which returns -ENOENT because
it did not find the block group
item in the extent tree
--> transaction aborte with -ENOENT
because write_one_cache_group()
returned that error
So fix this by adding a block group to the list of dirty block groups
before adding it to the list of unused block groups.
This happened on a stress test using fsstress plus concurrent calls to
fallocate 20G and truncate (releasing part of the space allocated with
fallocate).
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Chris Mason <clm@fb.com>
Currently scrub can race with the cleaner kthread when the later attempts
to delete an unused block group, and the result is preventing the cleaner
kthread from ever deleting later the block group - unless the block group
becomes used and unused again. The following diagram illustrates that
race:
CPU 1 CPU 2
cleaner kthread
btrfs_delete_unused_bgs()
gets block group X from
fs_info->unused_bgs and
removes it from that list
scrub_enumerate_chunks()
searches device tree using
its commit root
finds device extent for
block group X
gets block group X from the tree
fs_info->block_group_cache_tree
(via btrfs_lookup_block_group())
sets bg X to RO
sees the block group is
already RO and therefore
doesn't delete it nor adds
it back to unused list
So fix this by making scrub add the block group again to the list of
unused block groups if the block group is still unused when it finished
scrubbing it and it hasn't been removed already.
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Chris Mason <clm@fb.com>
Scrub can race with the cleaner kthread deleting block groups that are
unused (and with relocation too) leading to a failure with error -EINVAL
that gets returned to user space.
The following diagram illustrates how it happens:
CPU 1 CPU 2
cleaner kthread
btrfs_delete_unused_bgs()
gets block group X from
fs_info->unused_bgs
sets block group to RO
btrfs_remove_chunk(bg X)
deletes device extents
scrub_enumerate_chunks()
searches device tree using
its commit root
finds device extent for
block group X
gets block group X from the tree
fs_info->block_group_cache_tree
(via btrfs_lookup_block_group())
sets bg X to RO (again)
btrfs_remove_block_group(bg X)
deletes block group from
fs_info->block_group_cache_tree
removes extent map from
fs_info->mapping_tree
scrub_chunk(offset X)
searches fs_info->mapping_tree
for extent map starting at
offset X
--> doesn't find any such
extent map
--> returns -EINVAL and scrub
errors out to userspace
with -EINVAL
Fix this by dealing with an extent map lookup failure as an indicator of
block group deletion.
Issue reproduced with fstest btrfs/071.
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Chris Mason <clm@fb.com>
The test btrfs/011 triggers a rcu warning
Reviewed-by: Anand Jain <anand.jain@oracle.com>
===============================
[ INFO: suspicious RCU usage. ]
4.4.0-rc1-default+ #286 Tainted: G W
-------------------------------
fs/btrfs/volumes.c:1977 suspicious rcu_dereference_check() usage!
other info that might help us debug this:
rcu_scheduler_active = 1, debug_locks = 0
4 locks held by btrfs/28786:
0: (&fs_info->dev_replace.lock_finishing_cancel_unmount){+.+...}, at: [<ffffffffa00bc785>] btrfs_dev_replace_finishing+0x45/0xa00 [btrfs]
1: (uuid_mutex){+.+.+.}, at: [<ffffffffa00bc84f>] btrfs_dev_replace_finishing+0x10f/0xa00 [btrfs]
2: (&fs_devs->device_list_mutex){+.+.+.}, at: [<ffffffffa00bc868>] btrfs_dev_replace_finishing+0x128/0xa00 [btrfs]
3: (&fs_info->chunk_mutex){+.+...}, at: [<ffffffffa00bc87d>] btrfs_dev_replace_finishing+0x13d/0xa00 [btrfs]
stack backtrace:
CPU: 0 PID: 28786 Comm: btrfs Tainted: G W 4.4.0-rc1-default+ #286
Hardware name: Intel Corporation SandyBridge Platform/To be filled by O.E.M., BIOS ASNBCPT1.86C.0031.B00.1006301607 06/30/2010
0000000000000001 ffff8800a07dfb48 ffffffff8141d47b 0000000000000001
0000000000000001 0000000000000000 ffff8801464a4f00 ffff8800a07dfb78
ffffffff810cd883 ffff880146eb9400 ffff8800a3698600 ffff8800a33fe220
Call Trace:
[<ffffffff8141d47b>] dump_stack+0x4f/0x74
[<ffffffff810cd883>] lockdep_rcu_suspicious+0x103/0x140
[<ffffffffa0071261>] btrfs_rm_dev_replace_remove_srcdev+0x111/0x130 [btrfs]
[<ffffffff810d354d>] ? trace_hardirqs_on+0xd/0x10
[<ffffffff81449536>] ? __percpu_counter_sum+0x66/0x80
[<ffffffffa00bcc15>] btrfs_dev_replace_finishing+0x4d5/0xa00 [btrfs]
[<ffffffffa00bc96e>] ? btrfs_dev_replace_finishing+0x22e/0xa00 [btrfs]
[<ffffffffa00a8795>] ? btrfs_scrub_dev+0x415/0x6d0 [btrfs]
[<ffffffffa003ea69>] ? btrfs_start_transaction+0x9/0x20 [btrfs]
[<ffffffffa00bda79>] btrfs_dev_replace_start+0x339/0x590 [btrfs]
[<ffffffff81196aa5>] ? __might_fault+0x95/0xa0
[<ffffffffa0078638>] btrfs_ioctl_dev_replace+0x118/0x160 [btrfs]
[<ffffffff811409c6>] ? stack_trace_call+0x46/0x70
[<ffffffffa007c914>] ? btrfs_ioctl+0x24/0x1770 [btrfs]
[<ffffffffa007ce43>] btrfs_ioctl+0x553/0x1770 [btrfs]
[<ffffffff811409c6>] ? stack_trace_call+0x46/0x70
[<ffffffff811d6eb1>] ? do_vfs_ioctl+0x21/0x5a0
[<ffffffff811d6f1c>] do_vfs_ioctl+0x8c/0x5a0
[<ffffffff811e3336>] ? __fget_light+0x86/0xb0
[<ffffffff811e3369>] ? __fdget+0x9/0x20
[<ffffffff811d7451>] ? SyS_ioctl+0x21/0x80
[<ffffffff811d7483>] SyS_ioctl+0x53/0x80
[<ffffffff81b1efd7>] entry_SYSCALL_64_fastpath+0x12/0x6f
This is because of unprotected use of rcu_dereference in
btrfs_scratch_superblocks. We can't add rcu locks around the whole
function because we read the superblock.
The fix will use the rcu string buffer directly without the rcu locking.
Thi is safe as the device will not go away in the meantime. We're
holding the device list mutexes.
Restructuring the code to narrow down the rcu section turned out to be
impossible, we need to call filp_open (through update_dev_time) on the
buffer and this could call kmalloc/__might_sleep. We could call kstrdup
with GFP_ATOMIC but it's not absolutely necessary.
Fixes: 12b1c2637b (Btrfs: enhance btrfs_scratch_superblock to scratch all superblocks)
Signed-off-by: David Sterba <dsterba@suse.com>
Signed-off-by: Chris Mason <clm@fb.com>
xfstests/011 failed in node with small_size filesystem.
Can be reproduced by following script:
DEV_LIST="/dev/vdd /dev/vde"
DEV_REPLACE="/dev/vdf"
do_test()
{
local mkfs_opt="$1"
local size="$2"
dmesg -c >/dev/null
umount $SCRATCH_MNT &>/dev/null
echo mkfs.btrfs -f $mkfs_opt "${DEV_LIST[*]}"
mkfs.btrfs -f $mkfs_opt "${DEV_LIST[@]}" || return 1
mount "${DEV_LIST[0]}" $SCRATCH_MNT
echo -n "Writing big files"
dd if=/dev/urandom of=$SCRATCH_MNT/t0 bs=1M count=1 >/dev/null 2>&1
for ((i = 1; i <= size; i++)); do
echo -n .
/bin/cp $SCRATCH_MNT/t0 $SCRATCH_MNT/t$i || return 1
done
echo
echo Start replace
btrfs replace start -Bf "${DEV_LIST[0]}" "$DEV_REPLACE" $SCRATCH_MNT || {
dmesg
return 1
}
return 0
}
# Set size to value near fs size
# for example, 1897 can trigger this bug in 2.6G device.
#
./do_test "-d raid1 -m raid1" 1897
System will report replace fail with following warning in dmesg:
[ 134.710853] BTRFS: dev_replace from /dev/vdd (devid 1) to /dev/vdf started
[ 135.542390] BTRFS: btrfs_scrub_dev(/dev/vdd, 1, /dev/vdf) failed -28
[ 135.543505] ------------[ cut here ]------------
[ 135.544127] WARNING: CPU: 0 PID: 4080 at fs/btrfs/dev-replace.c:428 btrfs_dev_replace_start+0x398/0x440()
[ 135.545276] Modules linked in:
[ 135.545681] CPU: 0 PID: 4080 Comm: btrfs Not tainted 4.3.0 #256
[ 135.546439] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.8.2-0-g33fbe13 by qemu-project.org 04/01/2014
[ 135.547798] ffffffff81c5bfcf ffff88003cbb3d28 ffffffff817fe7b5 0000000000000000
[ 135.548774] ffff88003cbb3d60 ffffffff810a88f1 ffff88002b030000 00000000ffffffe4
[ 135.549774] ffff88003c080000 ffff88003c082588 ffff88003c28ab60 ffff88003cbb3d70
[ 135.550758] Call Trace:
[ 135.551086] [<ffffffff817fe7b5>] dump_stack+0x44/0x55
[ 135.551737] [<ffffffff810a88f1>] warn_slowpath_common+0x81/0xc0
[ 135.552487] [<ffffffff810a89e5>] warn_slowpath_null+0x15/0x20
[ 135.553211] [<ffffffff81448c88>] btrfs_dev_replace_start+0x398/0x440
[ 135.554051] [<ffffffff81412c3e>] btrfs_ioctl+0x1d2e/0x25c0
[ 135.554722] [<ffffffff8114c7ba>] ? __audit_syscall_entry+0xaa/0xf0
[ 135.555506] [<ffffffff8111ab36>] ? current_kernel_time64+0x56/0xa0
[ 135.556304] [<ffffffff81201e3d>] do_vfs_ioctl+0x30d/0x580
[ 135.557009] [<ffffffff8114c7ba>] ? __audit_syscall_entry+0xaa/0xf0
[ 135.557855] [<ffffffff810011d1>] ? do_audit_syscall_entry+0x61/0x70
[ 135.558669] [<ffffffff8120d1c1>] ? __fget_light+0x61/0x90
[ 135.559374] [<ffffffff81202124>] SyS_ioctl+0x74/0x80
[ 135.559987] [<ffffffff81809857>] entry_SYSCALL_64_fastpath+0x12/0x6f
[ 135.560842] ---[ end trace 2a5c1fc3205abbdd ]---
Reason:
When big data writen to fs, the whole free space will be allocated
for data chunk.
And operation as scrub need to set_block_ro(), and when there is
only one metadata chunk in system(or other metadata chunks
are all full), the function will try to allocate a new chunk,
and failed because no space in device.
Fix:
When set_block_ro failed for metadata chunk, it is not a problem
because scrub_lock paused commit_trancaction in same time, and
metadata are always cowed, so the on-the-fly writepages will not
write data into same place with scrub/replace.
Let replace continue in this case is no problem.
Tested by above script, and xfstests/011, plus 100 times xfstests/070.
Changelog v1->v2:
1: Add detail comments in source and commit-message.
2: Add dmesg detail into commit-message.
3: Limit return value of -ENOSPC to be passed.
All suggested by: Filipe Manana <fdmanana@gmail.com>
Suggested-by: Filipe Manana <fdmanana@gmail.com>
Signed-off-by: Zhao Lei <zhaolei@cn.fujitsu.com>
Signed-off-by: Chris Mason <clm@fb.com>
I've accidentally picked an already used number for the enhanced usage
filter represented by BTRFS_BALANCE_ARGS_USAGE_RANGE, clashing with
BTRFS_BALANCE_ARGS_CONVERT. Introduced during the development phase,
no backward compatibility issues.
Reported-by: Holger Hoffstätte <holger.hoffstaette@googlemail.com>
Reported-by: Dan Carpenter <dan.carpenter@oracle.com>
Fixes: bc3094673f ("btrfs: extend balance filter usage to take minimum and maximum")
Signed-off-by: David Sterba <dsterba@suse.com>
Signed-off-by: Chris Mason <clm@fb.com>
We were using only 1 transaction unit when attempting to delete an unused
block group but in reality we need 3 + N units, where N corresponds to the
number of stripes. We were accounting only for the addition of the orphan
item (for the block group's free space cache inode) but we were not
accounting that we need to delete one block group item from the extent
tree, one free space item from the tree of tree roots and N device extent
items from the device tree.
While one unit is not enough, it worked most of the time because for each
single unit we are too pessimistic and assume an entire tree path, with
the highest possible heigth (8), needs to be COWed with eventual node
splits at every possible level in the tree, so there was usually enough
reserved space for removing all the items and adding the orphan item.
However after adding the orphan item, writepages() can by called by the VM
subsystem against the btree inode when we are under memory pressure, which
causes writeback to start for the nodes we COWed before, this forces the
operation to remove the free space item to COW again some (or all of) the
same nodes (in the tree of tree roots). Even without writepages() being
called, we could fail with ENOSPC because these items are located in
multiple trees and one of them might have a higher heigth and require
node/leaf splits at many levels, exhausting all the reserved space before
removing all the items and adding the orphan.
In the kernel 4.0 release, commit 3d84be7991 ("Btrfs: fix BUG_ON in
btrfs_orphan_add() when delete unused block group"), we attempted to fix
a BUG_ON due to ENOSPC when trying to add the orphan item by making the
cleaner kthread reserve one transaction unit before attempting to remove
the block group, but this was not enough. We had a couple user reports
still hitting the same BUG_ON after 4.0, like Stefan Priebe's report on
a 4.2-rc6 kernel for example:
http://www.spinics.net/lists/linux-btrfs/msg46070.html
So fix this by reserving all the necessary units of metadata.
Reported-by: Stefan Priebe <s.priebe@profihost.ag>
Fixes: 3d84be7991 ("Btrfs: fix BUG_ON in btrfs_orphan_add() when delete unused block group")
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Chris Mason <clm@fb.com>
It's possible to reach a state where the cleaner kthread isn't able to
start a transaction to delete an unused block group due to lack of enough
free metadata space and due to lack of unallocated device space to allocate
a new metadata block group as well. If this happens try to use space from
the global block group reserve just like we do for unlink operations, so
that we don't reach a permanent state where starting a transaction for
filesystem operations (file creation, renames, etc) keeps failing with
-ENOSPC. Such an unfortunate state was observed on a machine where over
a dozen unused data block groups existed and the cleaner kthread was
failing to delete them due to ENOSPC error when attempting to start a
transaction, and even running balance with a -dusage=0 filter failed with
ENOSPC as well. Also unmounting and mounting again the filesystem didn't
help. Allowing the cleaner kthread to use the global block reserve to
delete the unused data block groups fixed the problem.
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Jeff Mahoney <jeffm@suse.com>
Signed-off-by: Chris Mason <clm@fb.com>
btrfs_alloc_dummy_root() return an error pointer on failure, it never
returns NULL.
Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: Chris Mason <clm@fb.com>
The calculation of range length in btrfs_sync_file leads to signed
overflow. This was caught by PaX gcc SIZE_OVERFLOW plugin.
https://forums.grsecurity.net/viewtopic.php?f=1&t=4284
The fsync call passes 0 and LLONG_MAX, the range length does not fit to
loff_t and overflows, but the value is converted to u64 so it silently
works as expected.
The minimal fix is a typecast to u64, switching functions to take
(start, end) instead of (start, len) would be more intrusive.
Coccinelle script found that there's one more opencoded calculation of
the length.
<smpl>
@@
loff_t start, end;
@@
* end - start
</smpl>
CC: stable@vger.kernel.org
Signed-off-by: David Sterba <dsterba@suse.com>
Signed-off-by: Chris Mason <clm@fb.com>
Pull btrfs fixes and cleanups from Chris Mason:
"Some of this got cherry-picked from a github repo this week, but I
verified the patches.
We have three small scrub cleanups and a collection of fixes"
* 'for-linus-4.4' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs:
btrfs: Use fs_info directly in btrfs_delete_unused_bgs
btrfs: Fix lost-data-profile caused by balance bg
btrfs: Fix lost-data-profile caused by auto removing bg
btrfs: Remove len argument from scrub_find_csum
btrfs: Reduce unnecessary arguments in scrub_recheck_block
btrfs: Use scrub_checksum_data and scrub_checksum_tree_block for scrub_recheck_block_checksum
btrfs: Reset sblock->xxx_error stats before calling scrub_recheck_block_checksum
btrfs: scrub: setup all fields for sblock_to_check
btrfs: scrub: set error stats when tree block spanning stripes
Btrfs: fix race when listing an inode's xattrs
Btrfs: fix race leading to BUG_ON when running delalloc for nodatacow
Btrfs: fix race leading to incorrect item deletion when dropping extents
Btrfs: fix sleeping inside atomic context in qgroup rescan worker
Btrfs: fix race waiting for qgroup rescan worker
btrfs: qgroup: exit the rescan worker during umount
Btrfs: fix extent accounting for partial direct IO writes
No need to use root->fs_info in btrfs_delete_unused_bgs(),
use fs_info directly instead.
Signed-off-by: Zhao Lei <zhaolei@cn.fujitsu.com>
Signed-off-by: Chris Mason <clm@fb.com>
Reproduce:
(In integration-4.3 branch)
TEST_DEV=(/dev/vdg /dev/vdh)
TEST_DIR=/mnt/tmp
umount "$TEST_DEV" >/dev/null
mkfs.btrfs -f -d raid1 "${TEST_DEV[@]}"
mount -o nospace_cache "$TEST_DEV" "$TEST_DIR"
btrfs balance start -dusage=0 $TEST_DIR
btrfs filesystem usage $TEST_DIR
dd if=/dev/zero of="$TEST_DIR"/file count=100
btrfs filesystem usage $TEST_DIR
Result:
We can see "no data chunk" in first "btrfs filesystem usage":
# btrfs filesystem usage $TEST_DIR
Overall:
...
Metadata,single: Size:8.00MiB, Used:0.00B
/dev/vdg 8.00MiB
Metadata,RAID1: Size:122.88MiB, Used:112.00KiB
/dev/vdg 122.88MiB
/dev/vdh 122.88MiB
System,single: Size:4.00MiB, Used:0.00B
/dev/vdg 4.00MiB
System,RAID1: Size:8.00MiB, Used:16.00KiB
/dev/vdg 8.00MiB
/dev/vdh 8.00MiB
Unallocated:
/dev/vdg 1.06GiB
/dev/vdh 1.07GiB
And "data chunks changed from raid1 to single" in second
"btrfs filesystem usage":
# btrfs filesystem usage $TEST_DIR
Overall:
...
Data,single: Size:256.00MiB, Used:0.00B
/dev/vdh 256.00MiB
Metadata,single: Size:8.00MiB, Used:0.00B
/dev/vdg 8.00MiB
Metadata,RAID1: Size:122.88MiB, Used:112.00KiB
/dev/vdg 122.88MiB
/dev/vdh 122.88MiB
System,single: Size:4.00MiB, Used:0.00B
/dev/vdg 4.00MiB
System,RAID1: Size:8.00MiB, Used:16.00KiB
/dev/vdg 8.00MiB
/dev/vdh 8.00MiB
Unallocated:
/dev/vdg 1.06GiB
/dev/vdh 841.92MiB
Reason:
btrfs balance delete last data chunk in case of no data in
the filesystem, then we can see "no data chunk" by "fi usage"
command.
And when we do write operation to fs, the only available data
profile is 0x0, result is all new chunks are allocated single type.
Fix:
Allocate a data chunk explicitly to ensure we don't lose the
raid profile for data.
Test:
Test by above script, and confirmed the logic by debug output.
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Zhao Lei <zhaolei@cn.fujitsu.com>
Signed-off-by: Chris Mason <clm@fb.com>
Reproduce:
(In integration-4.3 branch)
TEST_DEV=(/dev/vdg /dev/vdh)
TEST_DIR=/mnt/tmp
umount "$TEST_DEV" >/dev/null
mkfs.btrfs -f -d raid1 "${TEST_DEV[@]}"
mount -o nospace_cache "$TEST_DEV" "$TEST_DIR"
umount "$TEST_DEV"
mount -o nospace_cache "$TEST_DEV" "$TEST_DIR"
btrfs filesystem usage $TEST_DIR
We can see the data chunk changed from raid1 to single:
# btrfs filesystem usage $TEST_DIR
Data,single: Size:8.00MiB, Used:0.00B
/dev/vdg 8.00MiB
#
Reason:
When a empty filesystem mount with -o nospace_cache, the last
data blockgroup will be auto-removed in umount.
Then if we mount it again, there is no data chunk in the
filesystem, so the only available data profile is 0x0, result
is all new chunks are created as single type.
Fix:
Don't auto-delete last blockgroup for a raid type.
Test:
Test by above script, and confirmed the logic by debug output.
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Zhao Lei <zhaolei@cn.fujitsu.com>
Signed-off-by: Chris Mason <clm@fb.com>
We don't need pass so many arguments for recheck sblock now,
this patch cleans them.
Signed-off-by: Zhao Lei <zhaolei@cn.fujitsu.com>
Signed-off-by: Chris Mason <clm@fb.com>
We can use existing scrub_checksum_data() and scrub_checksum_tree_block()
for scrub_recheck_block_checksum(), instead of write duplicated code.
Signed-off-by: Zhao Lei <zhaolei@cn.fujitsu.com>
Signed-off-by: Chris Mason <clm@fb.com>
We should reset sblock->xxx_error stats before calling
scrub_recheck_block_checksum().
Current code run correctly because all sblock are allocated by
k[cz]alloc(), and the error stats are not got changed.
Signed-off-by: Zhao Lei <zhaolei@cn.fujitsu.com>
Signed-off-by: Chris Mason <clm@fb.com>
scrub_setup_recheck_block() isn't setup all necessary fields for
sblock_to_check because history reason.
So current code need more arguments in severial functions,
and more local variables, just to passing these lacked values to
necessary place.
This patch setup above fields to sblock_to_check in
scrub_setup_recheck_block(), for:
1: more cleanup for function arg, local variable
2: to make sblock_to_check complete, then we can use sblock_to_check
without concern about some uninitialized member.
Signed-off-by: Zhao Lei <zhaolei@cn.fujitsu.com>
Signed-off-by: Chris Mason <clm@fb.com>
It is better to show error stats to user when we found tree block
spanning stripes.
On a btrfs created by old version of btrfs-convert:
Before patch:
# btrfs scrub start -B /dev/vdh
scrub done for 8b342d35-2904-41ab-b3cb-2f929709cf47
scrub started at Tue Aug 25 21:19:09 2015 and finished after 00:00:00
total bytes scrubbed: 53.54MiB with 0 errors
# dmesg
...
[ 128.711434] BTRFS error (device vdh): scrub: tree block 27054080 spanning stripes, ignored. logical=27000832
[ 128.712744] BTRFS error (device vdh): scrub: tree block 27054080 spanning stripes, ignored. logical=27066368
...
After patch:
# btrfs scrub start -B /dev/vdh
scrub done for ff7f844b-7a4e-4b1a-88a9-8252ab25be1b
scrub started at Tue Aug 25 21:42:29 2015 and finished after 00:00:00
total bytes scrubbed: 53.60MiB with 2 errors
error details:
corrected errors: 0, uncorrectable errors: 2, unverified errors: 0
ERROR: There are uncorrectable errors.
# dmesg
...omit...
#
Signed-off-by: Zhao Lei <zhaolei@cn.fujitsu.com>
Signed-off-by: Chris Mason <clm@fb.com>