Currently when there are buffered writes that were not yet flushed and
they fall within allocated ranges of the file (that is, not in holes or
beyond eof assuming there are no prealloc extents beyond eof), btrfs
simply reports an incorrect number of used blocks through the stat(2)
system call (or any of its variants), regardless of mount options or
inode flags (compress, compress-force, nodatacow). This is because the
number of blocks used that is reported is based on the current number
of bytes in the vfs inode plus the number of dealloc bytes in the btrfs
inode. The later covers bytes that both fall within allocated regions
of the file and holes.
Example scenarios where the number of reported blocks is wrong while the
buffered writes are not flushed:
$ mkfs.btrfs -f /dev/sdc
$ mount /dev/sdc /mnt/sdc
$ xfs_io -f -c "pwrite -S 0xaa 0 64K" /mnt/sdc/foo1
wrote 65536/65536 bytes at offset 0
64 KiB, 16 ops; 0.0000 sec (259.336 MiB/sec and 66390.0415 ops/sec)
$ sync
$ xfs_io -c "pwrite -S 0xbb 0 64K" /mnt/sdc/foo1
wrote 65536/65536 bytes at offset 0
64 KiB, 16 ops; 0.0000 sec (192.308 MiB/sec and 49230.7692 ops/sec)
# The following should have reported 64K...
$ du -h /mnt/sdc/foo1
128K /mnt/sdc/foo1
$ sync
# After flushing the buffered write, it now reports the correct value.
$ du -h /mnt/sdc/foo1
64K /mnt/sdc/foo1
$ xfs_io -f -c "falloc -k 0 128K" -c "pwrite -S 0xaa 0 64K" /mnt/sdc/foo2
wrote 65536/65536 bytes at offset 0
64 KiB, 16 ops; 0.0000 sec (520.833 MiB/sec and 133333.3333 ops/sec)
$ sync
$ xfs_io -c "pwrite -S 0xbb 64K 64K" /mnt/sdc/foo2
wrote 65536/65536 bytes at offset 65536
64 KiB, 16 ops; 0.0000 sec (260.417 MiB/sec and 66666.6667 ops/sec)
# The following should have reported 128K...
$ du -h /mnt/sdc/foo2
192K /mnt/sdc/foo2
$ sync
# After flushing the buffered write, it now reports the correct value.
$ du -h /mnt/sdc/foo2
128K /mnt/sdc/foo2
So the number of used file blocks is simply incorrect, unlike in other
filesystems such as ext4 and xfs for example, but only while the buffered
writes are not flushed.
Fix this by tracking the number of delalloc bytes that fall within holes
and beyond eof of a file, and use instead this new counter when reporting
the number of used blocks for an inode.
Another different problem that exists is that the delalloc bytes counter
is reset when writeback starts (by clearing the EXTENT_DEALLOC flag from
the respective range in the inode's iotree) and the vfs inode's bytes
counter is only incremented when writeback finishes (through
insert_reserved_file_extent()). Therefore while writeback is ongoing we
simply report a wrong number of blocks used by an inode if the write
operation covers a range previously unallocated. While this change does
not fix this problem, it does minimizes it a lot by shortening that time
window, as the new dealloc bytes counter (new_delalloc_bytes) is only
decremented when writeback finishes right before updating the vfs inode's
bytes counter. Fully fixing this second problem is not trivial and will
be addressed later by a different patch.
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Normally we don't have inline extents followed by regular extents, but
there's currently at least one harmless case where this happens. For
example, when the page size is 4Kb and compression is enabled:
$ mkfs.btrfs -f /dev/sdb
$ mount -o compress /dev/sdb /mnt
$ xfs_io -f -c "pwrite -S 0xaa 0 4K" -c "fsync" /mnt/foobar
$ xfs_io -c "pwrite -S 0xbb 8K 4K" -c "fsync" /mnt/foobar
In this case we get a compressed inline extent, representing 4Kb of
data, followed by a hole extent and then a regular data extent. The
inline extent was not expanded/converted to a regular extent exactly
because it represents 4Kb of data. This does not cause any apparent
problem (such as the issue solved by commit e1699d2d7b
("btrfs: add missing memset while reading compressed inline extents"))
except trigger an unexpected case in the incremental send code path
that makes us issue an operation to write a hole when it's not needed,
resulting in more writes at the receiver and wasting space at the
receiver.
So teach the incremental send code to deal with this particular case.
The issue can be currently triggered by running fstests btrfs/137 with
compression enabled (MOUNT_OPTIONS="-o compress" ./check btrfs/137).
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: Liu Bo <bo.li.liu@oracle.com>
If the call to btrfs_qgroup_reserve_data() failed, we were leaking an
extent map structure. The failure can happen either due to an -ENOMEM
condition or, when quotas are enabled, due to -EDQUOT for example.
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
[BUG]
If run_delalloc_range() returns error and there is already some ordered
extents created, btrfs will be hanged with the following backtrace:
Call Trace:
__schedule+0x2d4/0xae0
schedule+0x3d/0x90
btrfs_start_ordered_extent+0x160/0x200 [btrfs]
? wake_atomic_t_function+0x60/0x60
btrfs_run_ordered_extent_work+0x25/0x40 [btrfs]
btrfs_scrubparity_helper+0x1c1/0x620 [btrfs]
btrfs_flush_delalloc_helper+0xe/0x10 [btrfs]
process_one_work+0x2af/0x720
? process_one_work+0x22b/0x720
worker_thread+0x4b/0x4f0
kthread+0x10f/0x150
? process_one_work+0x720/0x720
? kthread_create_on_node+0x40/0x40
ret_from_fork+0x2e/0x40
[CAUSE]
|<------------------ delalloc range --------------------------->|
| OE 1 | OE 2 | ... | OE n |
|<>| |<---------- cleanup range --------->|
||
\_=> First page handled by end_extent_writepage() in __extent_writepage()
The problem is caused by error handler of run_delalloc_range(), which
doesn't handle any created ordered extents, leaving them waiting on
btrfs_finish_ordered_io() to finish.
However after run_delalloc_range() returns error, __extent_writepage()
won't submit bio, so btrfs_writepage_end_io_hook() won't be triggered
except the first page, and btrfs_finish_ordered_io() won't be triggered
for created ordered extents either.
So OE 2~n will hang forever, and if OE 1 is larger than one page, it
will also hang.
[FIX]
Introduce btrfs_cleanup_ordered_extents() function to cleanup created
ordered extents and finish them manually.
The function is based on existing
btrfs_endio_direct_write_update_ordered() function, and modify it to
act just like btrfs_writepage_endio_hook() but handles specified range
other than one page.
After fix, delalloc error will be handled like:
|<------------------ delalloc range --------------------------->|
| OE 1 | OE 2 | ... | OE n |
|<>|<-------- ----------->|<------ old error handler --------->|
|| ||
|| \_=> Cleaned up by cleanup_ordered_extents()
\_=> First page handled by end_extent_writepage() in __extent_writepage()
Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com>
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: Liu Bo <bo.li.liu@oracle.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
[BUG]
When btrfs_reloc_clone_csum() reports error, it can underflow metadata
and leads to kernel assertion on outstanding extents in
run_delalloc_nocow() and cow_file_range().
BTRFS info (device vdb5): relocating block group 12582912 flags data
BTRFS info (device vdb5): found 1 extents
assertion failed: inode->outstanding_extents >= num_extents, file: fs/btrfs//extent-tree.c, line: 5858
Currently, due to another bug blocking ordered extents, the bug is only
reproducible under certain block group layout and using error injection.
a) Create one data block group with one 4K extent in it.
To avoid the bug that hangs btrfs due to ordered extent which never
finishes
b) Make btrfs_reloc_clone_csum() always fail
c) Relocate that block group
[CAUSE]
run_delalloc_nocow() and cow_file_range() handles error from
btrfs_reloc_clone_csum() wrongly:
(The ascii chart shows a more generic case of this bug other than the
bug mentioned above)
|<------------------ delalloc range --------------------------->|
| OE 1 | OE 2 | ... | OE n |
|<----------- cleanup range --------------->|
|<----------- ----------->|
\/
btrfs_finish_ordered_io() range
So error handler, which calls extent_clear_unlock_delalloc() with
EXTENT_DELALLOC and EXTENT_DO_ACCOUNT bits, and btrfs_finish_ordered_io()
will both cover OE n, and free its metadata, causing metadata under flow.
[Fix]
The fix is to ensure after calling btrfs_add_ordered_extent(), we only
call error handler after increasing the iteration offset, so that
cleanup range won't cover any created ordered extent.
|<------------------ delalloc range --------------------------->|
| OE 1 | OE 2 | ... | OE n |
|<----------- ----------->|<---------- cleanup range --------->|
\/
btrfs_finish_ordered_io() range
Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com>
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: Liu Bo <bo.li.liu@oracle.com>
KASAN reports that there is a use-after-free case of bio in btrfs_map_bio.
If we need to submit IOs to several disks at a time, the original bio
would get cloned and mapped to the destination disk, but we really should
use the original bio instead of a cloned bio to do the sanity check
because cloned bios are likely to be freed by its endio.
Reported-by: Diego <diegocg@gmail.com>
Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Commit 2dabb32484 ("Btrfs: Direct I/O read: Work on sectorsized blocks")
introduced this bug during iterating bio pages in dio read's endio hook,
and it could end up with segment fault of the dio reading task.
So the reason is 'if (nr_sectors--)', and it makes the code assume that
there is one more block in the same page, so page offset is increased and
the bio which is created to repair the bad block then has an incorrect
bvec.bv_offset, and a later access of the page content would throw a
segmentation fault.
This also adds ASSERT to check page offset against page size.
Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
When doing directIO repair, we have this oops:
[ 1458.532816] general protection fault: 0000 [#1] SMP
...
[ 1458.536291] Workqueue: btrfs-endio-repair btrfs_endio_repair_helper [btrfs]
[ 1458.536893] task: ffff88082a42d100 task.stack: ffffc90002b3c000
[ 1458.537499] RIP: 0010:btrfs_retry_endio+0x7e/0x1a0 [btrfs]
...
[ 1458.543261] Call Trace:
[ 1458.543958] ? rcu_read_lock_sched_held+0xc4/0xd0
[ 1458.544374] bio_endio+0xed/0x100
[ 1458.544750] end_workqueue_fn+0x3c/0x40 [btrfs]
[ 1458.545257] normal_work_helper+0x9f/0x900 [btrfs]
[ 1458.545762] btrfs_endio_repair_helper+0x12/0x20 [btrfs]
[ 1458.546224] process_one_work+0x34d/0xb70
[ 1458.546570] ? process_one_work+0x29e/0xb70
[ 1458.546938] worker_thread+0x1cf/0x960
[ 1458.547263] ? process_one_work+0xb70/0xb70
[ 1458.547624] kthread+0x17d/0x180
[ 1458.547909] ? kthread_create_on_node+0x70/0x70
[ 1458.548300] ret_from_fork+0x31/0x40
It turns out that btrfs_retry_endio is trying to get inode from a directIO
page.
This fixes the problem by using the saved inode pointer, done->inode.
btrfs_retry_endio_nocsum has the same problem, and it's fixed as well.
Also cleanup unused @start (which is too trivial for a separate patch).
Cc: David Sterba <dsterba@suse.cz>
Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The opposite case was already handled right in the very next switch entry.
And also when turning on nossd, drop ssd_spread.
Reported-by: Hans van Kranenburg <hans.van.kranenburg@mendix.com>
Signed-off-by: Adam Borowski <kilobyte@angband.pl>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
This isn't super serious because you need CAP_ADMIN to run this code.
I added this integer overflow check last year but apparently I am
rubbish at writing integer overflow checks... There are two issues.
First, access_ok() works on unsigned long type and not u64 so on 32 bit
systems the access_ok() could be checking a truncated size. The other
issue is that we should be using a stricter limit so we don't overflow
the kzalloc() setting ctx->clone_roots later in the function after the
access_ok():
alloc_size = sizeof(struct clone_root) * (arg->clone_sources_count + 1);
sctx->clone_roots = kzalloc(alloc_size, GFP_KERNEL | __GFP_NOWARN);
Fixes: f5ecec3ce2 ("btrfs: send: silence an integer overflow warning")
Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.com>
[ added comment ]
Signed-off-by: David Sterba <dsterba@suse.com>
Using an int value is causing qg->reserved to become negative and
exclusive -EDQUOT to be reached prematurely.
This affects exclusive qgroups only.
TEST CASE:
DEVICE=/dev/vdb
MOUNTPOINT=/mnt
SUBVOL=$MOUNTPOINT/tmp
umount $SUBVOL
umount $MOUNTPOINT
mkfs.btrfs -f $DEVICE
mount /dev/vdb $MOUNTPOINT
btrfs quota enable $MOUNTPOINT
btrfs subvol create $SUBVOL
umount $MOUNTPOINT
mount /dev/vdb $MOUNTPOINT
mount -o subvol=tmp $DEVICE $SUBVOL
btrfs qgroup limit -e 3G $SUBVOL
btrfs quota rescan /mnt -w
for i in `seq 1 44000`; do
dd if=/dev/zero of=/mnt/tmp/test_$i bs=10k count=1
if [[ $? > 0 ]]; then
btrfs qgroup show -pcref $SUBVOL
exit 1
fi
done
Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com>
[ add reproducer to changelog ]
Signed-off-by: David Sterba <dsterba@suse.com>
Commit 20a7db8ab3 ("btrfs: add dummy callback for readpage_io_failed
and drop checks") made a cleanup around readpage_io_failed_hook, and
it was supposed to keep the original sematics, but it also
unexpectedly disabled repair during read for dup, raid1 and raid10.
This fixes the problem by letting data's inode call the generic
readpage_io_failed callback by returning -EAGAIN from its
readpage_io_failed_hook in order to notify end_bio_extent_readpage to
do the rest. We don't call it directly because the generic one takes
an offset from end_bio_extent_readpage() to calculate the index in the
checksum array and inode's readpage_io_failed_hook doesn't offer that
offset.
Cc: David Sterba <dsterba@suse.cz>
Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.com>
[ keep the const function attribute ]
Signed-off-by: David Sterba <dsterba@suse.com>
This is a story about 4 distinct (and very old) btrfs bugs.
Commit c8b978188c ("Btrfs: Add zlib compression support") added
three data corruption bugs for inline extents (bugs #1-3).
Commit 93c82d5750 ("Btrfs: zero page past end of inline file items")
fixed bug #1: uncompressed inline extents followed by a hole and more
extents could get non-zero data in the hole as they were read. The fix
was to add a memset in btrfs_get_extent to zero out the hole.
Commit 166ae5a418 ("btrfs: fix inline compressed read err corruption")
fixed bug #2: compressed inline extents which contained non-zero bytes
might be replaced with zero bytes in some cases. This patch removed an
unhelpful memset from uncompress_inline, but the case where memset is
required was missed.
There is also a memset in the decompression code, but this only covers
decompressed data that is shorter than the ram_bytes from the extent
ref record. This memset doesn't cover the region between the end of the
decompressed data and the end of the page. It has also moved around a
few times over the years, so there's no single patch to refer to.
This patch fixes bug #3: compressed inline extents followed by a hole
and more extents could get non-zero data in the hole as they were read
(i.e. bug #3 is the same as bug #1, but s/uncompressed/compressed/).
The fix is the same: zero out the hole in the compressed case too,
by putting a memset back in uncompress_inline, but this time with
correct parameters.
The last and oldest bug, bug #0, is the cause of the offending inline
extent/hole/extent pattern. Bug #0 is a subtle and mostly-harmless quirk
of behavior somewhere in the btrfs write code. In a few special cases,
an inline extent and hole are allowed to persist where they normally
would be combined with later extents in the file.
A fast reproducer for bug #0 is presented below. A few offending extents
are also created in the wild during large rsync transfers with the -S
flag. A Linux kernel build (git checkout; make allyesconfig; make -j8)
will produce a handful of offending files as well. Once an offending
file is created, it can present different content to userspace each
time it is read.
Bug #0 is at least 4 and possibly 8 years old. I verified every vX.Y
kernel back to v3.5 has this behavior. There are fossil records of this
bug's effects in commits all the way back to v2.6.32. I have no reason
to believe bug #0 wasn't present at the beginning of btrfs compression
support in v2.6.29, but I can't easily test kernels that old to be sure.
It is not clear whether bug #0 is worth fixing. A fix would likely
require injecting extra reads into currently write-only paths, and most
of the exceptional cases caused by bug #0 are already handled now.
Whether we like them or not, bug #0's inline extents followed by holes
are part of the btrfs de-facto disk format now, and we need to be able
to read them without data corruption or an infoleak. So enough about
bug #0, let's get back to bug #3 (this patch).
An example of on-disk structure leading to data corruption found in
the wild:
item 61 key (606890 INODE_ITEM 0) itemoff 9662 itemsize 160
inode generation 50 transid 50 size 47424 nbytes 49141
block group 0 mode 100644 links 1 uid 0 gid 0
rdev 0 flags 0x0(none)
item 62 key (606890 INODE_REF 603050) itemoff 9642 itemsize 20
inode ref index 3 namelen 10 name: DB_File.so
item 63 key (606890 EXTENT_DATA 0) itemoff 8280 itemsize 1362
inline extent data size 1341 ram 4085 compress(zlib)
item 64 key (606890 EXTENT_DATA 4096) itemoff 8227 itemsize 53
extent data disk byte 5367308288 nr 20480
extent data offset 0 nr 45056 ram 45056
extent compression(zlib)
Different data appears in userspace during each read of the 11 bytes
between 4085 and 4096. The extent in item 63 is not long enough to
fill the first page of the file, so a memset is required to fill the
space between item 63 (ending at 4085) and item 64 (beginning at 4096)
with zero.
Here is a reproducer from Liu Bo, which demonstrates another method
of creating the same inline extent and hole pattern:
Using 'page_poison=on' kernel command line (or enable
CONFIG_PAGE_POISONING) run the following:
# touch foo
# chattr +c foo
# xfs_io -f -c "pwrite -W 0 1000" foo
# xfs_io -f -c "falloc 4 8188" foo
# od -x foo
# echo 3 >/proc/sys/vm/drop_caches
# od -x foo
This produce the following on my box:
Correct output: file contains 1000 data bytes followed
by zeros:
0000000 cdcd cdcd cdcd cdcd cdcd cdcd cdcd cdcd
*
0001740 cdcd cdcd cdcd cdcd 0000 0000 0000 0000
0001760 0000 0000 0000 0000 0000 0000 0000 0000
*
0020000
Actual output: the data after the first 1000 bytes
will be different each run:
0000000 cdcd cdcd cdcd cdcd cdcd cdcd cdcd cdcd
*
0001740 cdcd cdcd cdcd cdcd 6c63 7400 635f 006d
0001760 5f74 6f43 7400 435f 0053 5f74 7363 7400
0002000 435f 0056 5f74 6164 7400 645f 0062 5f74
(...)
Signed-off-by: Zygo Blaxell <ce3g8jdj@umail.furryterror.org>
Reviewed-by: Liu Bo <bo.li.liu@oracle.com>
Reviewed-by: Chris Mason <clm@fb.com>
Signed-off-by: Chris Mason <clm@fb.com>
The bug is a regression after commit
(da2c7009f6 "btrfs: teach __process_pages_contig about PAGE_LOCK operation")
and commit
(76c0021db8 "Btrfs: use helper to simplify lock/unlock pages").
So if the dirty pages which are under writeback got truncated partially
before we lock the dirty pages, we couldn't find all pages mapping to the
delalloc range, and the bug didn't return an error so it kept going on and
found that the delalloc range got truncated and got to unlock the dirty
pages, and then the ASSERT could caught the error, and showed
-----------------------------------------------------------------------------
assertion failed: page_ops & PAGE_LOCK, file: fs/btrfs/extent_io.c, line: 1716
-----------------------------------------------------------------------------
This fixes the bug by returning the proper -EAGAIN.
Cc: David Sterba <dsterba@suse.com>
Reported-by: Dave Jones <davej@codemonkey.org.uk>
Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Make extent_io_ops::readpage_io_failed_hook callback mandatory and
define a dummy function for btrfs_extent_io_ops. As the failed IO
callback is not performance critical, the branch vs extra trade off does
not hurt.
Signed-off-by: David Sterba <dsterba@suse.com>
We know that eadpage_end_io_hook, submit_bio_hook and merge_bio_hook are
always defined so we can drop the checks before we call them.
Signed-off-by: David Sterba <dsterba@suse.com>
Some of the callbacks defined in btree_extent_io_ops and
btrfs_extent_io_ops do always exist so we don't need to check the
existence before each call. This patch just reorders the definition and
documents which are mandatory/optional.
Signed-off-by: David Sterba <dsterba@suse.com>
There's no error path in any of the instances, always return 0.
Reviewed-by: Liu Bo <bo.li.liu@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The space check in btrfs_insert_xattr_item is duplicated in it's caller
(do_setxattr) so we won't hit the BUG_ON. Continuing without any check
could be disasterous so turn it to a proper error handling.
Reviewed-by: Liu Bo <bo.li.liu@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
All callers dereference the 'tm' parameter before it gets to this
function, the NULL check does not make much sense here.
Reviewed-by: Liu Bo <bo.li.liu@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The value of max_out can be calculated from the parameters passed to the
compressors, which is number of pages and the page size, and we don't
have to needlessly pass it around.
Signed-off-by: David Sterba <dsterba@suse.com>
Move the buffer limit definitions out of compress_file_range.
Reviewed-by: Qu Wenruo <quwenruo@cn.fujitsu.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The parameter saying how many pages can be allocated at maximum can be
merged with the output page counter, to save some stack space. The
compression implementation will sink the parameter to a local variable
so everything works as before.
The nr_pages variables can also be simply merged in compress_file_range
into one.
Signed-off-by: David Sterba <dsterba@suse.com>
The length parameter is basically duplicated for input and output in the
top level caller of the compress_pages chain. We can simply use one
variable for that and reduce stack consumption. The compression
implementation will sink the parameter to a local variable so everything
works as before.
Signed-off-by: David Sterba <dsterba@suse.com>
In addition to changing the signature, this patch also switches
all the functions which are used as an argument to also take btrfs_inode.
Namely those are: btrfs_get_extent and btrfs_get_extent_filemap.
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>