btrfs_submit_data_write_bio special cases the reloc root because the
checksums are preloaded, but only does so for the !sync case. The sync
case can't happen for data relocation, but just handling it more generally
significantly simplifies the logic.
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Tested-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: David Sterba <dsterba@suse.com>
Transfer the bio counter reference acquired by btrfs_submit_bio to
raid56_parity_write and raid56_parity_recovery together with the bio
that the reference was acquired for instead of acquiring another
reference in those helpers and dropping the original one in
btrfs_submit_bio.
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Tested-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: David Sterba <dsterba@suse.com>
Always consume the bio and call the end_io handler on error instead of
returning an error and letting the caller handle it. This matches what
the block layer submission does and avoids any confusion on who
needs to handle errors.
Also use the proper bool type for the generic_io argument.
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Tested-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: David Sterba <dsterba@suse.com>
Always consume the bio and call the end_io handler on error instead of
returning an error and letting the caller handle it. This matches what
the block layer submission does and avoids any confusion on who
needs to handle errors.
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Tested-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: David Sterba <dsterba@suse.com>
Always consume the bio and call the end_io handler on error instead of
returning an error and letting the caller handle it. This matches
what the block layer submission does and avoids any confusion on who
needs to handle errors.
As this requires touching all the callers, rename the function to
btrfs_submit_bio, which describes the functionality much better.
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Tested-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: David Sterba <dsterba@suse.com>
For profiles other than RAID56, __btrfs_map_block() returns @map_length
as min(stripe_end, logical + *length), which is also the same result
from btrfs_get_io_geometry().
But for RAID56, __btrfs_map_block() returns @map_length as stripe_len.
This strange behavior is going to hurt incoming bio split at
btrfs_map_bio() time, as we will use @map_length as bio split size.
Fix this behavior by returning @map_length by the same calculation as
for other profiles.
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Tested-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: David Sterba <dsterba@suse.com>
The raid56 code assumes a fixed stripe length BTRFS_STRIPE_LEN but there
are functions passing it as arguments, this is not necessary. The fixed
value has been used for a long time and though the stripe length should
be configurable by super block member stripesize, this hasn't been
implemented and would require more changes so we don't need to keep this
code around until then.
Partially based on a patch from Qu Wenruo.
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Tested-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
[ update changelog ]
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The inode cache feature was removed in kernel 5.11, and we no longer have
any code that reads from or writes to inode caches. We may still mount a
filesystem that has inode caches, but they are ignored.
Remove the check for an inode cache from btrfs_is_free_space_inode(),
since we no longer have code to trigger reads from an inode cache or
writes to an inode cache. The check at send.c is still needed, because
in case we find a filesystem with an inode cache, we must ignore it.
Also leave the checks at tree-checker.c, as they are sanity checks.
This eliminates a dead branch and reduces the amount of code since it's
in an inline function.
Before:
$ size fs/btrfs/btrfs.ko
text data bss dec hex filename
1620662 189240 29032 1838934 1c0f56 fs/btrfs/btrfs.ko
After:
$ size fs/btrfs/btrfs.ko
text data bss dec hex filename
1620502 189240 29032 1838774 1c0eb6 fs/btrfs/btrfs.ko
Reviewed-by: Boris Burkov <boris@bur.io>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
This flag has been merged in 3.10 and is effectively always-on. Its
status depends on the host page size so there's another way to guarantee
compatibility with old kernels.
Due to a bug introduced in 6f93e834fa ("btrfs: fix upper limit for
max_inline for page size 64K") the flag is not persisted among features
in the superblock so it's not reliable.
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
[ update changelog ]
Signed-off-by: David Sterba <dsterba@suse.com>
This feature has been the default for about 13 year. At this point it's
safe to consider it an indispensable feature of BTRFS as such there's
no need to advertise it in sysfs. Remove the global sysfs feature file,
the per-filesystem feature file has never been there.
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Skinny extents have been a default mkfs feature since version 3.18 i
(introduced in btrfs-progs commit 6715de04d9a7 ("btrfs-progs: mkfs:
make skinny-metadata default") ). It really doesn't bring any value to
users to simply remove it.
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Added in commit 727011e07c ("Btrfs: allow metadata blocks larger than
the page size") in 2010 and it's been default for mkfs since 3.12
(2013). The message doesn't really convey any useful information to
users. Remove it.
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The chained assignments may be convenient to write, but make readability
a bit worse as it's too easy to overlook that there are several values
set on the same line while this is rather an exception. Making it
consistent everywhere avoids surprises.
The pattern where inode times are initialized reuses the first value and
the order is mtime, ctime. In other blocks the assignments are expanded
so the order of variables is similar to the neighboring code.
Signed-off-by: David Sterba <dsterba@suse.com>
Use the same expression for stripe_nr for RAID0 (map->sub_stripes is 1)
and RAID10 (map->sub_stripes is 2), with equivalent results.
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: David Sterba <dsterba@suse.com>
There's a sequence of hard coded values for RAID1 profiles that are
already stored in the raid_attr table that should be used instead.
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Commit 6f93e834fa seemingly inadvertently moved the code responsible
for flagging the filesystem as having BIG_METADATA to a place where
setting the flag was essentially lost. This means that
filesystems created with kernels containing this bug (starting with 5.15)
can potentially be mounted by older (pre-3.4) kernels. In reality
chances for this happening are low because there are other incompat
flags introduced in the mean time. Still the correct behavior is to set
INCOMPAT_BIG_METADATA flag and persist this in the superblock.
Fixes: 6f93e834fa ("btrfs: fix upper limit for max_inline for page size 64K")
CC: stable@vger.kernel.org # 5.4+
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Per user request, print the checksum type and implementation at mount
time among the messages. The checksum is user configurable and the
actual crypto implementation is useful to see for performance reasons.
The same information is also available after mount in
/sys/fs/FSID/checksum file.
Example:
[25.323662] BTRFS info (device vdb): using sha256 (sha256-generic) checksum algorithm
Link: https://github.com/kdave/btrfs-progs/issues/483
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
If you try to force a chunk allocation, but you race with another chunk
allocation, you will end up waiting on the chunk allocation that just
occurred and then allocate another chunk. If you have many threads all
doing this at once you can way over-allocate chunks.
Fix this by resetting force to NO_FORCE, that way if we think we need to
allocate we can, otherwise we don't force another chunk allocation if
one is already happening.
Reviewed-by: Filipe Manana <fdmanana@suse.com>
CC: stable@vger.kernel.org # 5.4+
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: David Sterba <dsterba@suse.com>
There are file attributes inherited from previous ext2 SETFLAGS/GETFLAGS
and later from XFLAGS interfaces, now commonly found under the
'fileattr' API. This corresponds to the individual inode bits and that's
part of the on-disk format, so this is suitable for the protocol. The
other interfaces contain a lot of cruft or bits that btrfs does not
support yet.
Currently the value is u64 and matches btrfs_inode_item. Not all the
bits can be set by ioctls (like NODATASUM or READONLY), but we can send
them over the protocol and leave it up to the receiving side what and
how to apply.
As some of the flags, eg. IMMUTABLE, can prevent any further changes,
the receiving side needs to understand that and apply the changes in the
right order, or possibly with some intermediate steps. This should be
easier, future proof and simpler on the protocol layer than implementing
in kernel.
Reviewed-by: Omar Sandoval <osandov@fb.com>
Signed-off-by: David Sterba <dsterba@suse.com>
When send v1 was introduced the otime (inode creation time) was not
available, however the attribute in btrfs send protocol exists. Though
it would be possible to add it for v1 too as the attribute would be
ignored by v1 receive, let's not change the layout of v1 and only add
that to v2+. The otime cannot be changed and is only informative.
Signed-off-by: David Sterba <dsterba@suse.com>
When handling a real world transid mismatch image, it's hard to know
which copy is corrupted, as the error messages just look like this:
BTRFS warning (device dm-3): checksum verify failed on 30408704 wanted 0xcdcdcdcd found 0x3c0adc8e level 0
BTRFS warning (device dm-3): checksum verify failed on 30408704 wanted 0xcdcdcdcd found 0x3c0adc8e level 0
BTRFS warning (device dm-3): checksum verify failed on 30408704 wanted 0xcdcdcdcd found 0x3c0adc8e level 0
BTRFS warning (device dm-3): checksum verify failed on 30408704 wanted 0xcdcdcdcd found 0x3c0adc8e level 0
We don't even know if the retry is caused by btrfs or the VFS retry.
To make things a little easier to read, add mirror number for all
related tree block read errors.
So the above messages would look like this:
BTRFS warning (device dm-3): checksum verify failed on logical 30408704 mirror 1 wanted 0xcdcdcdcd found 0x3c0adc8e level 0
BTRFS warning (device dm-3): checksum verify failed on logical 30408704 mirror 2 wanted 0xcdcdcdcd found 0x3c0adc8e level 0
BTRFS warning (device dm-3): checksum verify failed on logical 30408704 mirror 1 wanted 0xcdcdcdcd found 0x3c0adc8e level 0
BTRFS warning (device dm-3): checksum verify failed on logical 30408704 mirror 2 wanted 0xcdcdcdcd found 0x3c0adc8e level 0
Signed-off-by: Qu Wenruo <wqu@suse.com>
[ update messages, add "logical" ]
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The 'goto out' in cow_file_range() in the exit block are not necessary
and jump back. Replace them with return, while still keeping 'goto out'
in the main code.
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>
Reviewed-by: David Sterba <dsterba@suse.com>
[ keep goto in the main code, update changelog ]
Signed-off-by: David Sterba <dsterba@suse.com>
When cow_file_range() fails in the middle of the allocation loop, it
unlocks the pages but leaves the ordered extents intact. Thus, we need
to call btrfs_cleanup_ordered_extents() to finish the created ordered
extents.
Also, we need to call end_extent_writepage() if locked_page is available
because btrfs_cleanup_ordered_extents() never processes the region on
the locked_page.
Furthermore, we need to set the mapping as error if locked_page is
unavailable before unlocking the pages, so that the errno is properly
propagated to the user space.
CC: stable@vger.kernel.org # 5.18+
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>
Signed-off-by: David Sterba <dsterba@suse.com>
btrfs_cleanup_ordered_extents() assumes locked_page to be non-NULL, so it
is not usable for submit_uncompressed_range() which can have NULL
locked_page.
Add support supports locked_page == NULL case. Also, it rewrites
redundant "page_offset(locked_page)".
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>
Signed-off-by: David Sterba <dsterba@suse.com>
There is a hung_task report on zoned btrfs like below.
https://github.com/naota/linux/issues/59
[726.328648] INFO: task rocksdb:high0:11085 blocked for more than 241 seconds.
[726.329839] Not tainted 5.16.0-rc1+ #1
[726.330484] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[726.331603] task:rocksdb:high0 state:D stack: 0 pid:11085 ppid: 11082 flags:0x00000000
[726.331608] Call Trace:
[726.331611] <TASK>
[726.331614] __schedule+0x2e5/0x9d0
[726.331622] schedule+0x58/0xd0
[726.331626] io_schedule+0x3f/0x70
[726.331629] __folio_lock+0x125/0x200
[726.331634] ? find_get_entries+0x1bc/0x240
[726.331638] ? filemap_invalidate_unlock_two+0x40/0x40
[726.331642] truncate_inode_pages_range+0x5b2/0x770
[726.331649] truncate_inode_pages_final+0x44/0x50
[726.331653] btrfs_evict_inode+0x67/0x480
[726.331658] evict+0xd0/0x180
[726.331661] iput+0x13f/0x200
[726.331664] do_unlinkat+0x1c0/0x2b0
[726.331668] __x64_sys_unlink+0x23/0x30
[726.331670] do_syscall_64+0x3b/0xc0
[726.331674] entry_SYSCALL_64_after_hwframe+0x44/0xae
[726.331677] RIP: 0033:0x7fb9490a171b
[726.331681] RSP: 002b:00007fb943ffac68 EFLAGS: 00000246 ORIG_RAX: 0000000000000057
[726.331684] RAX: ffffffffffffffda RBX: 0000000000000000 RCX: 00007fb9490a171b
[726.331686] RDX: 00007fb943ffb040 RSI: 000055a6bbe6ec20 RDI: 00007fb94400d300
[726.331687] RBP: 00007fb943ffad00 R08: 0000000000000000 R09: 0000000000000000
[726.331688] R10: 0000000000000031 R11: 0000000000000246 R12: 00007fb943ffb000
[726.331690] R13: 00007fb943ffb040 R14: 0000000000000000 R15: 00007fb943ffd260
[726.331693] </TASK>
While we debug the issue, we found running fstests generic/551 on 5GB
non-zoned null_blk device in the emulated zoned mode also had a
similar hung issue.
Also, we can reproduce the same symptom with an error injected
cow_file_range() setup.
The hang occurs when cow_file_range() fails in the middle of
allocation. cow_file_range() called from do_allocation_zoned() can
split the give region ([start, end]) for allocation depending on
current block group usages. When btrfs can allocate bytes for one part
of the split regions but fails for the other region (e.g. because of
-ENOSPC), we return the error leaving the pages in the succeeded regions
locked. Technically, this occurs only when @unlock == 0. Otherwise, we
unlock the pages in an allocated region after creating an ordered
extent.
Considering the callers of cow_file_range(unlock=0) won't write out
the pages, we can unlock the pages on error exit from
cow_file_range(). So, we can ensure all the pages except @locked_page
are unlocked on error case.
In summary, cow_file_range now behaves like this:
- page_started == 1 (return value)
- All the pages are unlocked. IO is started.
- unlock == 1
- All the pages except @locked_page are unlocked in any case
- unlock == 0
- On success, all the pages are locked for writing out them
- On failure, all the pages except @locked_page are unlocked
Fixes: 42c0110009 ("btrfs: zoned: introduce dedicated data write path for zoned filesystems")
CC: stable@vger.kernel.org # 5.12+
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Export commit stats in file
/sys/fs/btrfs/UUID/commit_stats
with example output like:
commits 123
last_commit_ms 11
max_commit_ms 150
total_commit_ms 2000
The values are in one file so reading them at a single time will give a
more consistent view. The stats are internally tracked in nanoseconds so
the cumulative values should not suffer from rounding errors.
Writing 0 to the file 'commit_stats' will reset max_commit_ms.
Initial values are set at first mount of the filesystem.
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Ioannis Angelakopoulos <iangelak@fb.com>
[ update changelog ]
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Track several stats about transaction commit, to be later exported via
sysfs:
- number of commits so far
- duration of the last commit in ns
- maximum commit duration seen so far in ns
- total duration for all commits so far in ns
The update of the commit stats occurs after the commit thread has gone
through all the logic that checks if there is another thread committing
at the same time. This means that we only account for actual commit work
in the commit stats we report and not the time the thread spends waiting
until it is ready to do the commit work.
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Ioannis Angelakopoulos <iangelak@fb.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Same as in commit 21b4ee7029 ("xfs: drop ->writepage completely"): we
can remove the callback as it's only used in one place - single page
writeback from memory reclaim and is not called for cgroup writeback at
all.
We only allow such writeback from kswapd, not from direct memory
reclaim, and so it is rarely used. When it comes from kswapd, it is
effectively random dirty page shoot-down, which is horrible for IO
patterns. We can rely on background writeback to clean all dirty pages
in an efficient way and not let it be interrupted by kswapd.
Suggested-by: Johannes Weiner <hannes@cmpxchg.org>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: David Sterba <dsterba@suse.com>
The whole send operation is restartable and handling properly a buffer
write may not be easy. We can't know what caused that and if a short
delay and retry will fix it or how many retries should be performed in
case it's a temporary condition.
The error value is returned to the ioctl caller so in case it's
transient problem, the user would be notified about the reason. Remove
the TODO note as there's no plan to handle ERESTARTSYS.
Signed-off-by: David Sterba <dsterba@suse.com>
We don't need this ifdef as the header file is not shared, the protocol
definition used by userspace should be from libbtrfs or libbtrfsutil.
Signed-off-by: David Sterba <dsterba@suse.com>
Btrfs currently limits direct I/O reads to a single sector, which goes
back to commit c329861da4 ("Btrfs: don't allocate a separate csums
array for direct reads") from Josef. That commit changes the direct I/O
code to ".. use the private part of the io_tree for our csums.", but ten
years later that isn't how checksums for direct reads work, instead they
use a csums allocation on a per-btrfs_dio_private basis (which have their
own performance problem for small I/O, but that will be addressed later).
There is no fundamental limit in btrfs itself to limit the I/O size
except for the size of the checksum array that scales linearly with
the number of sectors in an I/O. Pick a somewhat arbitrary limit of
256 limits, which matches what the buffered reads typically see as
the upper limit as the limit for direct I/O as well.
This significantly improves direct read performance. For example a fio
run doing 1 MiB aio reads with a queue depth of 1 roughly triples the
throughput:
Baseline:
READ: bw=65.3MiB/s (68.5MB/s), 65.3MiB/s-65.3MiB/s (68.5MB/s-68.5MB/s), io=19.1GiB (20.6GB), run=300013-300013msec
With this patch:
READ: bw=196MiB/s (206MB/s), 196MiB/s-196MiB/s (206MB/s-206MB/s), io=57.5GiB (61.7GB), run=300006-300006msc
Reviewed-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: David Sterba <dsterba@suse.com>
[BUG]
There is a small workload which will always fail with recent kernel:
(A simplified version from btrfs/125 test case)
mkfs.btrfs -f -m raid5 -d raid5 -b 1G $dev1 $dev2 $dev3
mount $dev1 $mnt
xfs_io -f -c "pwrite -S 0xee 0 1M" $mnt/file1
sync
umount $mnt
btrfs dev scan -u $dev3
mount -o degraded $dev1 $mnt
xfs_io -f -c "pwrite -S 0xff 0 128M" $mnt/file2
umount $mnt
btrfs dev scan
mount $dev1 $mnt
btrfs balance start --full-balance $mnt
umount $mnt
The failure is always failed to read some tree blocks:
BTRFS info (device dm-4): relocating block group 217710592 flags data|raid5
BTRFS error (device dm-4): parent transid verify failed on 38993920 wanted 9 found 7
BTRFS error (device dm-4): parent transid verify failed on 38993920 wanted 9 found 7
...
[CAUSE]
With the recently added debug output, we can see all RAID56 operations
related to full stripe 38928384:
56.1183: raid56_read_partial: full_stripe=38928384 devid=2 type=DATA1 offset=0 opf=0x0 physical=9502720 len=65536
56.1185: raid56_read_partial: full_stripe=38928384 devid=3 type=DATA2 offset=16384 opf=0x0 physical=9519104 len=16384
56.1185: raid56_read_partial: full_stripe=38928384 devid=3 type=DATA2 offset=49152 opf=0x0 physical=9551872 len=16384
56.1187: raid56_write_stripe: full_stripe=38928384 devid=3 type=DATA2 offset=0 opf=0x1 physical=9502720 len=16384
56.1188: raid56_write_stripe: full_stripe=38928384 devid=3 type=DATA2 offset=32768 opf=0x1 physical=9535488 len=16384
56.1188: raid56_write_stripe: full_stripe=38928384 devid=1 type=PQ1 offset=0 opf=0x1 physical=30474240 len=16384
56.1189: raid56_write_stripe: full_stripe=38928384 devid=1 type=PQ1 offset=32768 opf=0x1 physical=30507008 len=16384
56.1218: raid56_write_stripe: full_stripe=38928384 devid=3 type=DATA2 offset=49152 opf=0x1 physical=9551872 len=16384
56.1219: raid56_write_stripe: full_stripe=38928384 devid=1 type=PQ1 offset=49152 opf=0x1 physical=30523392 len=16384
56.2721: raid56_parity_recover: full stripe=38928384 eb=39010304 mirror=2
56.2723: raid56_parity_recover: full stripe=38928384 eb=39010304 mirror=2
56.2724: raid56_parity_recover: full stripe=38928384 eb=39010304 mirror=2
Before we enter raid56_parity_recover(), we have triggered some metadata
write for the full stripe 38928384, this leads to us to read all the
sectors from disk.
Furthermore, btrfs raid56 write will cache its calculated P/Q sectors to
avoid unnecessary read.
This means, for that full stripe, after any partial write, we will have
stale data, along with P/Q calculated using that stale data.
Thankfully due to patch "btrfs: only write the sectors in the vertical stripe
which has data stripes" we haven't submitted all the corrupted P/Q to disk.
When we really need to recover certain range, aka in
raid56_parity_recover(), we will use the cached rbio, along with its
cached sectors (the full stripe is all cached).
This explains why we have no event raid56_scrub_read_recover()
triggered.
Since we have the cached P/Q which is calculated using the stale data,
the recovered one will just be stale.
In our particular test case, it will always return the same incorrect
metadata, thus causing the same error message "parent transid verify
failed on 39010304 wanted 9 found 7" again and again.
[BTRFS DESTRUCTIVE RMW PROBLEM]
Test case btrfs/125 (and above workload) always has its trouble with
the destructive read-modify-write (RMW) cycle:
0 32K 64K
Data1: | Good | Good |
Data2: | Bad | Bad |
Parity: | Good | Good |
In above case, if we trigger any write into Data1, we will use the bad
data in Data2 to re-generate parity, killing the only chance to recovery
Data2, thus Data2 is lost forever.
This destructive RMW cycle is not specific to btrfs RAID56, but there
are some btrfs specific behaviors making the case even worse:
- Btrfs will cache sectors for unrelated vertical stripes.
In above example, if we're only writing into 0~32K range, btrfs will
still read data range (32K ~ 64K) of Data1, and (64K~128K) of Data2.
This behavior is to cache sectors for later update.
Incidentally commit d4e28d9b5f ("btrfs: raid56: make steal_rbio()
subpage compatible") has a bug which makes RAID56 to never trust the
cached sectors, thus slightly improve the situation for recovery.
Unfortunately, follow up fix "btrfs: update stripe_sectors::uptodate in
steal_rbio" will revert the behavior back to the old one.
- Btrfs raid56 partial write will update all P/Q sectors and cache them
This means, even if data at (64K ~ 96K) of Data2 is free space, and
only (96K ~ 128K) of Data2 is really stale data.
And we write into that (96K ~ 128K), we will update all the parity
sectors for the full stripe.
This unnecessary behavior will completely kill the chance of recovery.
Thankfully, an unrelated optimization "btrfs: only write the sectors
in the vertical stripe which has data stripes" will prevent
submitting the write bio for untouched vertical sectors.
That optimization will keep the on-disk P/Q untouched for a chance for
later recovery.
[FIX]
Although we have no good way to completely fix the destructive RMW
(unless we go full scrub for each partial write), we can still limit the
damage.
With patch "btrfs: only write the sectors in the vertical stripe which
has data stripes" now we won't really submit the P/Q of unrelated
vertical stripes, so the on-disk P/Q should still be fine.
Now we really need to do is just drop all the cached sectors when doing
recovery.
By this, we have a chance to read the original P/Q from disk, and have a
chance to recover the stale data, while still keep the cache to speed up
regular write path.
In fact, just dropping all the cache for recovery path is good enough to
allow the test case btrfs/125 along with the small script to pass
reliably.
The lack of metadata write after the degraded mount, and forced metadata
COW is saving us this time.
So this patch will fix the behavior by not trust any cache in
__raid56_parity_recover(), to solve the problem while still keep the
cache useful.
But please note that this test pass DOES NOT mean we have solved the
destructive RMW problem, we just do better damage control a little
better.
Related patches:
- btrfs: only write the sectors in the vertical stripe
- d4e28d9b5f ("btrfs: raid56: make steal_rbio() subpage compatible")
- btrfs: update stripe_sectors::uptodate in steal_rbio
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
finish_func is always set to finish_ordered_fn, so remove it and also
the now pointless and somewhat confusingly named
__endio_write_update_ordered wrapper.
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: David Sterba <dsterba@suse.com>
With Filipe's recent rework of the delayed inode code one aspect which
isn't batched is the release of the reserved metadata of delayed inode's
delete items. With this patch on top of Filipe's rework and running the
same test as provided in the description of a patch titled
"btrfs: improve batch deletion of delayed dir index items" I observe
the following change of the number of calls to btrfs_block_rsv_release:
Before this change:
- block_rsv_release: 1004
- btrfs_delete_delayed_items_total_time: 14602
- delete_batches: 505
After:
- block_rsv_release: 510
- btrfs_delete_delayed_items_total_time: 13643
- delete_batches: 507
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Btrfs on-disk format has reserved the first 1MiB for the primary super
block (at 64KiB offset) and bootloaders may also use this space.
This behavior is only introduced since v4.1 btrfs-progs release,
although kernel can ensure we never touch the reserved range of super
blocks, it's better to inform the end users, and a balance will resolve
the problem.
Signed-off-by: Qu Wenruo <wqu@suse.com>
[ update changelog and message ]
Signed-off-by: David Sterba <dsterba@suse.com>
There's a reserved space on each device of size 1MiB that can be used by
bootloaders or to avoid accidental overwrite. Use a symbolic constant
with the explaining comment instead of hard coding the value and
multiple comments.
Note: since btrfs-progs v4.1, mkfs.btrfs will reserve the first 1MiB for
the primary super block (at offset 64KiB), until then the range could
have been used by mistake. Kernel has been always respecting the 1MiB
range for writes.
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
[ update changelog ]
Signed-off-by: David Sterba <dsterba@suse.com>
There's only one function we pass to iterate_inodes_from_logical as
iterator, so we can drop the indirection and call it directly, after
moving the function to backref.c
Signed-off-by: David Sterba <dsterba@suse.com>
The inode reference iterator interface takes parameters that are derived
from the context parameter, but as it's a void* type the values are
passed individually.
Change the ctx type to inode_fs_path as it's the only thing we pass and
drop any parameters that are derived from that.
Signed-off-by: David Sterba <dsterba@suse.com>
The functions for iterating inode reference take a function parameter
but there's only one value, inode_to_path(). Remove the indirection and
call the function. As paths_from_inode would become just an alias for
iterate_irefs(), merge the two into one function.
Signed-off-by: David Sterba <dsterba@suse.com>
For all non-RAID56 profiles, we can use btrfs_raid_array[].ncopies
directly, only for RAID5 and RAID6 we need some extra handling as
there's no table value for that.
For RAID10 there's a change from sub_stripes to ncopies. The values are
the same but semantically we want to use number of copies, as this is
what btrfs_num_copies does.
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Use the raid table instead of hard coded values and rename the helper as
it is exported. This could make later extension on RAID56 based
profiles easier.
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
In __btrfs_map_block() we have an assignment to @max_errors using
nr_parity_stripes().
Although it works for RAID56 it's confusing. Replace it with
btrfs_chunk_max_errors().
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
For scrub_stripe() we can easily calculate the dev extent length as we
have the full info of the chunk.
Thus there is no need to pass @dev_extent_len from the caller, and we
introduce a helper, btrfs_calc_stripe_length(), to do the calculation
from extent_map structure.
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Simplify helper to return only next and prev pointers, we don't need all
the node/parent/prev/next pointers of __etree_search as there are now
other specialized helpers. Rename parameters so they follow the naming.
Signed-off-by: David Sterba <dsterba@suse.com>
With a slight extension of tree_search_for_insert (fill the return node
and parent return parameters) we can avoid calling __etree_search from
tree_search, that could be removed eventually in followup patches.
Signed-off-by: David Sterba <dsterba@suse.com>
The call chain from
tree_search
tree_search_for_insert
__etree_search
can be open coded and allow further simplifications, here we need a tree
search with fallback to the next node in case it's not found. This is
represented as __etree_search parameters next_ret=valid, prev_ret=NULL.
Signed-off-by: David Sterba <dsterba@suse.com>
In two cases the exact location where to insert the extent state is
known at the call time so we don't need to pass it to insert_state that
takes the fast path.
Signed-off-by: David Sterba <dsterba@suse.com>
The bits are passed to all extent state helpers for no apparent reason,
the value only read and never updated so remove the indirection and pass
it directly. Also unify the type to u32 where needed.
Signed-off-by: David Sterba <dsterba@suse.com>
Let callers of insert_state to set up the extent state to allow further
simplifications of the parameters.
Signed-off-by: David Sterba <dsterba@suse.com>
The rbtree search is a known pattern and can be open coded, allowing to
remove the tree_insert and further cleanups.
Signed-off-by: David Sterba <dsterba@suse.com>
Preparatory work to remove tree_insert from extent_io.c, the rbtree
search loop is a known and simple so it can be open coded.
Signed-off-by: David Sterba <dsterba@suse.com>
Originally it's iterating all the sectors which has dbitmap sector for
the vertical stripe.
It can be easily converted to sector bytenr iteration with an test_bit()
call.
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
This function doesn't even utilize full stripe skip, just iterate all
the data sectors is definitely enough.
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The double loop is just checking if the page for the vertical stripe
is allocated.
We can easily convert it to single loop and get rid of @stripe variable.
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The double for loop can be easily converted to single for loop as we're
really iterating the sectors in their bytenr order.
The only exception is the full stripe skip, however that can also easily
be done inside the loop. Add an ASSERT() along with a comment for that
specific case.
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
We can easily calculate the stripe number and sector number inside the
stripe. Thus there is not much need for a double for loop.
For the only case we want to skip the whole stripe, we can manually
increase @total_sector_nr.
This is not a recommended behavior, thus every time the iterator gets
modified there will be a comment along with an ASSERT() for it.
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Currently we will return 1 or -EAGAIN if we decide we need to commit
the transaction rather than sync the log. In practice this doesn't
really matter, we interpret any !0 and !BTRFS_NO_LOG_SYNC as needing to
commit the transaction. However this makes it hard to figure out what
the correct thing to do is.
Fix this up by defining BTRFS_LOG_FORCE_COMMIT and using this in all the
places where we want to force the transaction to be committed.
CC: stable@vger.kernel.org # 5.15+
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
When debugging a reference counting issue with ordered extents, I've found
we're lacking a lot of tracepoint coverage in the ordered extent code.
Close these gaps by adding tracepoints after every refcount_inc() in the
ordered extent code.
Reviewed-by: Boris Burkov <boris@bur.io>
Reviewed-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
We've hidden the zoned support in sysfs under debug config for the first
releases but now the stability is reasonable, though not all features
have been implemented.
Signed-off-by: David Sterba <dsterba@suse.com>
Mapping block for discard doesn't really share any code with the regular
block mapping case. Split it out into an entirely separate helper
that just returns an array of btrfs_discard_stripe structures and the
number of stripes.
This removes the need for the length field in the btrfs_io_context
structure, so remove tht.
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
All the bios that index_one_bio operates on are the bios submitted by the
upper layer. These are never resubmitted to an actual device by the
raid56 code, and thus the iter never changes from the initial state.
Thus we can always just use bi_iter directly as it will be the same as
the saved copy.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: David Sterba <dsterba@suse.com>
[BUG]
If we have a btrfs image with dirty log, along with an unsupported RO
compatible flag:
log_root 30474240
...
compat_flags 0x0
compat_ro_flags 0x40000003
( FREE_SPACE_TREE |
FREE_SPACE_TREE_VALID |
unknown flag: 0x40000000 )
Then even if we can only mount it RO, we will still cause metadata
update for log replay:
BTRFS info (device dm-1): flagging fs with big metadata feature
BTRFS info (device dm-1): using free space tree
BTRFS info (device dm-1): has skinny extents
BTRFS info (device dm-1): start tree-log replay
This is definitely against RO compact flag requirement.
[CAUSE]
RO compact flag only forces us to do RO mount, but we will still do log
replay for plain RO mount.
Thus this will result us to do log replay and update metadata.
This can be very problematic for new RO compat flag, for example older
kernel can not understand v2 cache, and if we allow metadata update on
RO mount and invalidate/corrupt v2 cache.
[FIX]
Just reject the mount unless rescue=nologreplay is provided:
BTRFS error (device dm-1): cannot replay dirty log with unsupport optional features (0x40000000), try rescue=nologreplay instead
We don't want to set rescue=nologreply directly, as this would make the
end user to read the old data, and cause confusion.
Since the such case is really rare, we're mostly fine to just reject the
mount with an error message, which also includes the proper workaround.
CC: stable@vger.kernel.org #4.9+
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
When using "btrfs inspect-internal dump-super" to inspect an fs with
dirty log, it always shows the log_root_transid as 0:
log_root 30474240
log_root_transid 0 <<<
log_root_level 0
It turns out that, btrfs_super_block::log_root_transid is never really
utilized (even no read for it).
This can date back to the introduction of btrfs into upstream kernel.
In fact, when reading log tree root, we always use
btrfs_super_block::generation + 1 as the expected generation.
So here we're completely safe to mark this member deprecated.
In theory we can easily reuse this member for other purposes, but to be
extra safe, here we follow the leafsize way, by adding "__unused_" for
log_root_transid.
And we can safely remove the accessors, since there is no such callers
from the very beginning.
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
submit_one_bio always works on the bio and compression flags from a
btrfs_bio_ctrl structure. Pass the explicitly and clean up the
calling conventions by handling a NULL bio in submit_one_bio, and
using the btrfs_bio_ctrl to pass the mirror number as well.
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Merge end_write_bio and flush_write_bio into a single submit_write_bio
helper, that either submits the bio or ends it if a negative errno was
passed in. This consolidates a lot of duplicated checks in the callers.
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
submit_one_bio is only used for page cache I/O, so the inode can be
trivially derived from the first page in the bio.
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
There are two separate checks in the bounds checker, the first one being
a special case of the second. As this function is performance critical
due to checking access to any eb member, reducing the size can slightly
improve performance.
On a release build on x86_64 the helper is completely inlined so the
function call overhead is also gone.
There was a report of 5% performance drop on metadata heavy workload,
that disappeared after disabling asserts. The most significant part of
that is the bounds checker.
https://lore.kernel.org/linux-btrfs/20200724164147.39925-1-josef@toxicpanda.com/
After the analysis, the optimized code removes the worst overhead which
is the function call and the performance was restored.
https://lore.kernel.org/linux-btrfs/20200730110943.GE3703@twin.jikos.cz/
1. baseline, asserts on, setget check on
run time: 46s
run time with perf: 48s
2. asserts on, comment out setget check
run time: 44s
run time with perf: 47s
So this is confirms the 5% difference
3. asserts on, optimized seget check
run time: 44s
run time with perf: 47s
The optimizations are reducing the number of ifs to 1 and inlining the
hot path. Low-level stuff, gets the performance back. Patch below.
4. asserts off, no setget check
run time: 44s
run time with perf: 45s
This verifies that asserts other than the setget check have negligible
impact on performance and it's not harmful to keep them on.
Analysis where the performance is lost:
* check_setget_bounds is short function, but it's still a function call,
changing the flow of instructions and given how many times it's
called the overhead adds up
* there are two conditions, one to check if the range is
completely outside (member_offset > eb->len) or partially inside
(member_offset + size > eb->len)
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The use of kmap() is being deprecated in favor of kmap_local_page() where
it is feasible. With kmap_local_page(), the mapping is per thread, CPU
local and not globally visible.
Therefore, use kmap_local_page() / kunmap_local() in lzo.c wherever the
mappings are per thread and not globally visible.
Tested on QEMU + KVM 32 bits VM with 4GB of RAM and HIGHMEM64G enabled.
Suggested-by: Ira Weiny <ira.weiny@intel.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Fabio M. De Francesco <fmdefrancesco@gmail.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The use of kmap() is being deprecated in favor of kmap_local_page() where
it is feasible. With kmap_local_page(), the mapping is per thread, CPU
local and not globally visible.
Therefore, use kmap_local_page() / kunmap_local() in inode.c wherever the
mappings are per thread and not globally visible.
Tested on QEMU + KVM 32 bits VM with 4GB of RAM and HIGHMEM64G enabled.
Suggested-by: Ira Weiny <ira.weiny@intel.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Fabio M. De Francesco <fmdefrancesco@gmail.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The bios submitted from btrfs_map_bio don't really interact with the
rest of btrfs and the only btrfs_bio member actually used in the
low-level bios is the pointer to the btrfs_io_context used for endio
handler.
Use a union in struct btrfs_io_stripe that allows the endio handler to
find the btrfs_io_context and remove the spurious ->device assignment
so that a plain fs_bio_set bio can be used for the low-level bios
allocated inside btrfs_map_bio.
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: David Sterba <dsterba@suse.com>
Move all per-stripe handling into submit_stripe_bio and use a label to
cleanup instead of duplicating the logic.
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: David Sterba <dsterba@suse.com>
All reads bio that go through btrfs_map_bio need to be completed in
user context. And read I/Os are the most common and timing critical
in almost any file system workloads.
Embed a work_struct into struct btrfs_bio and use it to complete all
read bios submitted through btrfs_map, using the REQ_META flag to decide
which workqueue they are placed on.
This removes the need for a separate 128 byte allocation (typically
rounded up to 192 bytes by slab) for all reads with a size increase
of 24 bytes for struct btrfs_bio. Future patches will reorganize
struct btrfs_bio to make use of this extra space for writes as well.
(All sizes are based a on typical 64-bit non-debug build)
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: David Sterba <dsterba@suse.com>
Set REQ_META in btrfs_submit_metadata_bio instead of the various callers.
We'll start relying on this flag inside of btrfs in a bit, and this
ensures it is always set correctly.
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Compressed write bio completion is the only user of btrfs_bio_wq_end_io
for writes, and the use of btrfs_bio_wq_end_io is a little suboptimal
here as we only real need user context for the final completion of a
compressed_bio structure, and not every single bio completion.
Add a work_struct to struct compressed_bio instead and use that to call
finish_compressed_bio_write. This allows to remove all handling of
write bios in the btrfs_bio_wq_end_io infrastructure.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: David Sterba <dsterba@suse.com>
The bio completion handler of the bio used for the compressed data is
already run in a workqueue using btrfs_bio_wq_end_io, so don't schedule
the completion of the original bio to the same workqueue again but just
execute it directly.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: David Sterba <dsterba@suse.com>
Instead of attaching an extra allocation an indirect call to each
low-level bio issued by the RAID code, add a work_struct to struct
btrfs_raid_bio and only defer the per-rbio completion action. The
per-bio action for all the I/Os are trivial and can be safely done
from interrupt context.
As a nice side effect this also allows sharing the boilerplate code
for the per-bio completions
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: David Sterba <dsterba@suse.com>
Split btrfs_submit_data_bio into one helper for reads and one for writes.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
There is no exit block and cleanup and the function is reasonably short
so we can use inline return and not the goto. This makes the function
more straight forward.
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Assign ->mirror_num and ->bi_status in btrfs_end_bioc instead of
duplicating the logic in the callers. Also remove the bio argument as
it always must be bioc->orig_bio and the now pointless bioc_error that
did nothing but assign bi_sector to the same value just sampled in the
caller.
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Now that the new support is implemented, allow the ioctl to accept v2
and the compressed flag, and update the version in sysfs.
Signed-off-by: Omar Sandoval <osandov@fb.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Now that all of the pieces are in place, we can use the ENCODED_WRITE
command to send compressed extents when appropriate.
Signed-off-by: Omar Sandoval <osandov@fb.com>
Signed-off-by: David Sterba <dsterba@suse.com>
For encoded writes in send v2, we will get the encoded data with
btrfs_encoded_read_regular_fill_pages(), which expects a list of raw
pages. To avoid extra buffers and copies, we should read directly into
the send buffer. Therefore, we need the raw pages for the send buffer.
We currently allocate the send buffer with kvmalloc(), which may return
a kmalloc'd buffer or a vmalloc'd buffer. For vmalloc, we can get the
pages with vmalloc_to_page(). For kmalloc, we could use virt_to_page().
However, the buffer size we use (144K) is not a power of two, which in
theory is not guaranteed to return a page-aligned buffer, and in
practice would waste a lot of memory due to rounding up to the next
power of two. 144K is large enough that it usually gets allocated with
vmalloc(), anyways. So, for send v2, replace kvmalloc() with vmalloc()
and save the pages in an array.
Signed-off-by: Omar Sandoval <osandov@fb.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The length field of the send stream TLV header is 16 bits. This means
that the maximum amount of data that can be sent for one write is 64K
minus one. However, encoded writes must be able to send the maximum
compressed extent (128K) in one command, or more. To support this, send
stream version 2 encodes the DATA attribute differently: it has no
length field, and the length is implicitly up to the end of containing
command (which has a 32bit length field). Although this is necessary
for encoded writes, normal writes can benefit from it, too.
Also add a check to enforce that the DATA attribute is last. It is only
strictly necessary for v2, but we might as well make v1 consistent with
it.
For v2, let's bump up the send buffer to the maximum compressed extent
size plus 16K for the other metadata (144K total). Since this will most
likely be vmalloc'd (and always will be after the next commit), we round
it up to the next page since we might as well use the rest of the page
on systems with >16K pages.
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Omar Sandoval <osandov@fb.com>
Signed-off-by: David Sterba <dsterba@suse.com>
This adds the definitions of the new commands for send stream version 2
and their respective attributes: fallocate, FS_IOC_SETFLAGS (a.k.a.
chattr), and encoded writes. It also documents two changes to the send
stream format in v2: the receiver shouldn't assume a maximum command
size, and the DATA attribute is encoded differently to allow for writes
larger than 64k. These will be implemented in subsequent changes, and
then the ioctl will accept the new version and flag.
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Omar Sandoval <osandov@fb.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Commit e77fbf9903 ("btrfs: send: prepare for v2 protocol") added
_BTRFS_SEND_C_MAX_V* macros equal to the maximum command number for the
version plus 1, but as written this creates gaps in the number space.
The maximum command number is currently 22, and __BTRFS_SEND_C_MAX_V1 is
accordingly 23. But then __BTRFS_SEND_C_MAX_V2 is 24, suggesting that v2
has a command numbered 23, and __BTRFS_SEND_C_MAX is 25, suggesting that
23 and 24 are valid commands.
Instead, let's explicitly number all of the commands, attributes, and
sentinel MAX constants.
Signed-off-by: Omar Sandoval <osandov@fb.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
We collect these statistics but have never exposed them in any way. I
also didn't find any patches that ever attempted to make use of them.
Signed-off-by: Omar Sandoval <osandov@fb.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Adds write-only trigger to force new chunk allocation for a given block
group type. It is at
/sys/fs/btrfs/<uuid>/allocation/<type>/force_chunk_alloc
Note: this is now only for debugging and testing and is enabled with the
CONFIG_BTRFS_DEBUG configuration option. The transaction is
started from sysfs context and can be problematic in some cases.
Signed-off-by: Stefan Roesch <shr@fb.com>
Reviewed-by: David Sterba <dsterba@suse.com>
[ Changes from the original submission:
- update changelog
- drop unnecessary error messages
- switch value to bool and use kstrtobool
- move BTRFS_ATTR_W definition
- add comment for using transaction
]
Signed-off-by: David Sterba <dsterba@suse.com>
Add new sysfs knob
/sys/fs/btrfs/<uuid>/allocation/<type>/chunk_size.
This allows to query the chunk size and also set the chunk size.
Constraints:
- can be changed by root only
- system chunk size can't be set
- maximum chunk size is 10% of the filesystem size
- final value is rounded down to a multiple of 256M
- cannot be set on zoned filesystem
Note, that rounding and the 10% clamp will result to a different value
on filesystems smaller than 10G, typically 768M.
Signed-off-by: Stefan Roesch <shr@fb.com>
Reviewed-by: David Sterba <dsterba@suse.com>
[ Changes to original submission:
- document setting constraints
- drop read-only requirement
- drop unnecessary error messages
- fix return values of _store callback
- use memparse for the value
- fix rounding down to 256M
]
Signed-off-by: David Sterba <dsterba@suse.com>
The chunk size is stored in the btrfs_space_info structure. It is
initialized at the start and is then used.
A new API is added to update the current chunk size. This API is used
to be able to expose the chunk_size as a sysfs setting.
Signed-off-by: Stefan Roesch <shr@fb.com>
Reviewed-by: David Sterba <dsterba@suse.com>
[ rename and merge helpers, switch atomic type to u64, style fixes ]
Signed-off-by: David Sterba <dsterba@suse.com>
While running generic/475 in a loop I got the following error
BTRFS critical (device dm-11): corrupt leaf: root=5 block=31096832 slot=69, bad key order, prev (263 96 531) current (263 96 524)
<snip>
item 65 key (263 96 517) itemoff 14132 itemsize 33
item 66 key (263 96 523) itemoff 14099 itemsize 33
item 67 key (263 96 525) itemoff 14066 itemsize 33
item 68 key (263 96 531) itemoff 14033 itemsize 33
item 69 key (263 96 524) itemoff 14000 itemsize 33
As you can see here we have 3 dir index keys with the dir index value of
523, 524, and 525 inserted between 517 and 524. This occurs because our
dir index insertion code will bulk insert all dir index items on the
node regardless of their actual key value.
This makes sense on a normally running system, because if there's a gap
in between the items there was a deletion before the item was inserted,
so there's not going to be an overlap of the dir index items that need
to be inserted and what exists on disk.
However during log replay this isn't necessarily true, we could have any
number of dir indexes in the tree already.
Fix this by seeing if we're replaying the log, and if we are simply skip
batching if there's a gap in the key space.
This file system was left broken from the fstest, I tested this patch
against the broken fs to make sure it replayed the log properly, and
then btrfs checked the file system after the log replay to verify
everything was ok.
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: Sweet Tea Dorminy <sweettea-kernel@dorminy.me>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Whenever we want to create a new dir index item (when creating an inode,
create a hard link, rename a file) we reserve 1 unit of metadata space
for it in a transaction (that's 256K for a node/leaf size of 16K), and
then create a delayed insertion item for it to be added later to the
subvolume's tree. That unit of metadata is kept until the delayed item
is inserted into the subvolume tree, which may take a while to happen
(in the worst case, it's done only when the transaction commits). If we
have multiple dir index items to insert for the same directory, say N
index items, and they all fit in a single leaf of metadata, then we are
holding N units of reserved metadata space when all we need is 1 unit.
This change addresses that, whenever a new delayed dir index item is
added, we release the unit of metadata the caller has reserved when it
started the transaction if adding that new dir index item does not
result in touching one more metadata leaf, otherwise the reservation
is kept by transferring it from the transaction block reserve to the
delayed items block reserve, just like before. Given that with a leaf
size of 16K we can have a few hundred dir index items in a single leaf
(the exact value depends on file name lengths), this reduces pressure on
metadata reservation by releasing unnecessary space much sooner.
The following fs_mark test showed some improvement when creating many
files in parallel on machine running a non debug kernel (debian's default
kernel config) with 12 cores:
$ cat test.sh
#!/bin/bash
DEV=/dev/nvme0n1
MNT=/mnt/nvme0n1
MOUNT_OPTIONS="-o ssd"
FILES=100000
THREADS=$(nproc --all)
echo "performance" | \
tee /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor
mkfs.btrfs -f $DEV
mount $MOUNT_OPTIONS $DEV $MNT
OPTS="-S 0 -L 10 -n $FILES -s 0 -t $THREADS -k"
for ((i = 1; i <= $THREADS; i++)); do
OPTS="$OPTS -d $MNT/d$i"
done
fs_mark $OPTS
umount $MNT
Before:
FSUse% Count Size Files/sec App Overhead
2 1200000 0 225991.3 5465891
4 2400000 0 345728.1 5512106
4 3600000 0 346959.5 5557653
8 4800000 0 329643.0 5587548
8 6000000 0 312657.4 5606717
8 7200000 0 281707.5 5727985
12 8400000 0 88309.8 5020422
12 9600000 0 85835.9 5207496
16 10800000 0 81039.2 5404964
16 12000000 0 58548.6 5842468
After:
FSUse% Count Size Files/sec App Overhead
2 1200000 0 230604.5 5778375
4 2400000 0 348908.3 5508072
4 3600000 0 357028.7 5484337
6 4800000 0 342898.3 5565703
6 6000000 0 314670.8 5751555
8 7200000 0 282548.2 5778177
12 8400000 0 90844.9 5306819
12 9600000 0 86963.1 5304689
16 10800000 0 89113.2 5455248
16 12000000 0 86693.5 5518933
The "after" results are after applying this patch and all the other
patches in the same patchset, which is comprised of the following
changes:
btrfs: balance btree dirty pages and delayed items after a rename
btrfs: free the path earlier when creating a new inode
btrfs: balance btree dirty pages and delayed items after clone and dedupe
btrfs: add assertions when deleting batches of delayed items
btrfs: deal with deletion errors when deleting delayed items
btrfs: refactor the delayed item deletion entry point
btrfs: improve batch deletion of delayed dir index items
btrfs: assert that delayed item is a dir index item when adding it
btrfs: improve batch insertion of delayed dir index items
btrfs: do not BUG_ON() on failure to reserve metadata for delayed item
btrfs: set delayed item type when initializing it
btrfs: reduce amount of reserved metadata for delayed item insertion
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Currently we set the type of a delayed item only after successfully
inserting it into its respective rbtree. This is fine, as the type
is not used anywhere before that point, but for the next patch in the
series, there will be the need to check the type of a delayed item
before inserting it into a rbtree.
So set the type of a delayed item immediately after allocating it.
This also makes the trivial wrappers for adding insertion and deletion
useless, so it removes them as well.
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
At btrfs_insert_delayed_dir_index(), we don't expect the metadata
reservation for the delayed dir index item insertion to fail, because the
caller is supposed to have reserved 1 unit of metadata space for that.
All callers are able to deal with an error in case that happens, so there
is no need for something so drastic as a BUG_ON() in case of failure.
Instead just emit a warning, so that's easily noticed during development
(fstests in particular), and return the error to the caller.
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Currently we group delayed dir index items for insertion as a single batch
(a single btree operation) as long as their keys are sequential in the key
space.
For example we have delayed index items for the following index keys:
10, 11, 12, 15, 16, 20, 21
We end up building three batches:
1) First one for index keys 10, 11 and 12;
2) Second one for index keys 15 and 16;
3) Third one for index keys 20 and 21.
However, since the dir index numbers come from a monotonically increasing
counter and are never reused, we could group all these items into a single
batch. The existence of holes in the sequence happens only when we had
delayed dir index items for insertion that got deleted before they were
flushed to the subvolume's tree.
The delayed items are stored in a rbtree based on their key order, so
we can just group items into a batch as long as they all fit in a leaf,
and ignore if there's a gap (key offset, index number) between two
consecutive items. This is more efficient and reduces the amount of
time spent when running delayed items if there are gaps between dir
index items.
For example running the following test script:
$ cat test.sh
#!/bin/bash
DEV=/dev/sdj
MNT=/mnt/sdj
mkfs.btrfs -f $DEV
mount $DEV $MNT
NUM_FILES=100
mkdir $MNT/testdir
for ((i = 1; i <= $NUM_FILES; i++)); do
echo -n > $MNT/testdir/file_$i
done
# Now delete every other file, to create gaps in the dir index keys.
for ((i = 1; i <= $NUM_FILES; i += 2)); do
rm -f $MNT/testdir/file_$i
done
start=$(date +%s%N)
sync
end=$(date +%s%N)
dur=$(( (end - start) / 1000000 ))
echo -e "\nsync took $dur milliseconds"
umount $MNT
While having the following bpftrace script running in another shell:
$ cat bpf-delayed-items-inserts.sh
#!/usr/bin/bpftrace
/* Must add 'noinline' to btrfs_insert_delayed_items(). */
k:btrfs_insert_delayed_items
{
@start_insert_delayed_items[tid] = nsecs;
}
k:btrfs_insert_empty_items
/@start_insert_delayed_items[tid]/
{
@insert_batches = count();
}
kr:btrfs_insert_delayed_items
/@start_insert_delayed_items[tid]/
{
$dur = (nsecs - @start_insert_delayed_items[tid]) / 1000;
@btrfs_insert_delayed_items_total_time = sum($dur);
delete(@start_insert_delayed_items[tid]);
}
Before this change:
@btrfs_insert_delayed_items_total_time: 576
@insert_batches: 51
After this change:
@btrfs_insert_delayed_items_total_time: 174
@insert_batches: 2
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
All delayed items are for dir index items, we don't support any other item
types at the moment. So simplify __btrfs_add_delayed_item() and add an
assertion for checking the item's key type. This also allows the next
change to be simpler and avoid to check key types. In case we add support
for different item types in the future, then we'll hit the assertion
during development and be able to adjust any code that is assuming delayed
items are always associated to dir index items.
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Currently we group delayed dir index items for deletion in a single batch
(single btree operation) as long as they all exist in the same leaf and as
long as their keys are sequential in the key space. For example if we have
a leaf that has dir index items with offsets:
2, 3, 4, 6, 7, 10
And we have delayed dir index items for deleting all these indexes, and
no delayed items for any other index keys in between, then we end up
deleting in 3 batches:
1) First batch for indexes 2, 3 and 4;
2) Second batch for indexes 6 and 7;
3) Third batch for index 10.
This is a waste because we can delete all the index keys in a single
batch. What matters is that each consecutive delayed index key matches
each consecutive dir index key in a leaf.
So update the logic at btrfs_batch_delete_items() to check only for a
key match between delayed dir index items and dir index items in a leaf.
Also avoid the useless first iteration on comparing the key of the
first slot to delete with the key of the first delayed item, as it's
silly since they always match, as the delayed item's key was used for
the btree search that gave us the path we have.
This is more efficient and reduces runtime of running delayed items, as
well as lock contention on the subvolume's tree.
For example, the following test script:
$ cat test.sh
#!/bin/bash
DEV=/dev/sdj
MNT=/mnt/sdj
mkfs.btrfs -f $DEV
mount $DEV $MNT
NUM_FILES=1000
mkdir $MNT/testdir
for ((i = 1; i <= $NUM_FILES; i++)); do
echo -n > $MNT/testdir/file_$i
done
# Now delete every other file, to create gaps in the dir index keys.
for ((i = 1; i <= $NUM_FILES; i += 2)); do
rm -f $MNT/testdir/file_$i
done
# Sync to force any delayed items to be flushed to the tree.
sync
start=$(date +%s%N)
rm -fr $MNT/testdir
end=$(date +%s%N)
dur=$(( (end - start) / 1000000 ))
echo -e "\nrm -fr took $dur milliseconds"
umount $MNT
Running that test script while having the following bpftrace script
running in another shell:
$ cat bpf-measure.sh
#!/usr/bin/bpftrace
/* Add 'noinline' to btrfs_delete_delayed_items()'s definition. */
k:btrfs_delete_delayed_items
{
@start_delete_delayed_items[tid] = nsecs;
}
k:btrfs_del_items
/@start_delete_delayed_items[tid]/
{
@delete_batches = count();
}
kr:btrfs_delete_delayed_items
/@start_delete_delayed_items[tid]/
{
$dur = (nsecs - @start_delete_delayed_items[tid]) / 1000;
@btrfs_delete_delayed_items_total_time = sum($dur);
delete(@start_delete_delayed_items[tid]);
}
Before this change:
@btrfs_delete_delayed_items_total_time: 9563
@delete_batches: 1001
After this change:
@btrfs_delete_delayed_items_total_time: 7328
@delete_batches: 509
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The delayed item deletion entry point, btrfs_delete_delayed_items(), is a
bit convoluted for a few reasons:
1) It's really a loop disguised with labels and goto statements;
2) There's a 'delete_fail' label which isn't only for error cases, we can
jump to that label even if no error happened, if we simply don't have
more delayed items to delete;
3) Unnecessarily keeps track of the current and previous items for no
good reason, as after getting the next item and releasing the current
one, it just jumps to the 'again' label just to look again for the
first delayed item;
4) When a delayed item is not in the tree (because it was already deleted
before), it releases the item while holding a path locked, which is
not necessary and adds more contention to the tree, specially taking
into account that the path came from a deletion search, meaning we have
write locks for nodes at levels 2, 1 and 0. And releasing the item is
not computationally trivial (rb tree deletion, a kfree() and some
trivial things).
So refactor it to use a while loop and add some comments to make it more
obvious why we can have delayed items without a matching item in the tree
as well as why not keep the delayed node locked all the time when running
all its deletion items. This is also a preparation for some upcoming work
involving delayed items.
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Currently, btrfs_delete_delayed_items() ignores any errors returned from
btrfs_batch_delete_items(). This looks fishy but it's not a problem at
the moment because:
1) Two of the errors returned from btrfs_batch_delete_items() are for
impossible cases, cases where a delayed item does not match any item
in the leaf the path points to - btrfs_delete_delayed_items() always
calls btrfs_batch_delete_items() with a path that points to a leaf
that contains an item matching a delayed item;
2) btrfs_batch_delete_items() may return an error from btrfs_del_items(),
in which case it does not release the delayed items of the batch.
At the moment this is harmless because btrfs_del_items() actually is
always able to delete items, even if it returns an error - when it
returns an error it's because it ended up with a leaf mostly empty
(less than 1/3 full) and failed to migrate items from that leaf into
its neighbour leaves - this is not critical, as all the items were
deleted, we just left the tree a bit unbalanced, but it's still a
valid tree and causes no harm, and future operations on the tree will
eventually balance it.
So even if we get an error from btrfs_del_items(), the delayed items
will not be released but the next time we run delayed items we will
find out, at btrfs_delete_delayed_items(), that they are not present
in the tree anymore and then release them.
This is all a bit subtle, and it's certainly prone to be a disaster in
case btrfs_del_items() changes one day and may return errors before being
able to delete all the requested items, in which case we could leave the
filesystem in an inconsistent state as we would commit a transaction
despite a failure from deleting items from the tree.
So make btrfs_delete_delayed_items() check for any errors from the call
to btrfs_batch_delete_items().
Reviewed-by: Anand Jain <anand.jain@oracle.com>
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
There are a few impossible cases that btrfs_batch_delete_items() tries to
deal with:
1) Getting a path pointing to a NULL leaf;
2) The leaf slot is pointing beyond the last item in the leaf;
3) We can't find a single item to delete.
The first case is impossible because the given path was returned by a
successful call to btrfs_search_slot(). Replace the BUG_ON() with an
ASSERT for this.
The second case is impossible because we are always called when a delayed
item matches an item in the given leaf. So add an ASSERT() for that and
if that condition is not satisfied, trigger a warning and return an error.
The third case is impossible exactly because of the same reason as the
second case. The given delayed item matches one item in the leaf, so we
know that our batch always has at least one item. Add an ASSERT to check
that, trigger a warning if that expectation fails and return an error.
Reviewed-by: Anand Jain <anand.jain@oracle.com>
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
When reflinking extents (clone and deduplication), we need to touch the
btree of the destination inode's subvolume, as well as potentially
create a delayed inode for the destination inode (if it was not created
before). However we are neither balancing the btree dirty pages nor the
delayed items after such operations, so if we have a task that is doing
a long series of clone or deduplication operations, it can result in
accumulation of too many btree dirty pages and delayed items.
So just call btrfs_btree_balance_dirty() after clone and deduplication,
just like we do for every other system call that results on modifying a
btree and adding delayed items.
Reviewed-by: Anand Jain <anand.jain@oracle.com>
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
When creating an inode, through btrfs_create_new_inode(), we release the
path we allocated before once we don't need it anymore. But we keep it
allocated until we return from that function, which is wasteful because
after we release the path we do several things that can allocate yet
another path: inheriting properties, setting the xattrs used by ACLs and
secutiry modules, adding an orphan item (O_TMPFILE case) or adding a
dir item (for the non-O_TMPFILE case).
So instead of releasing the path once we don't need it anymore, free it
instead. This way we avoid having two paths allocated until we return
from btrfs_create_new_inode().
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
A rename operation modifies a subvolume's btree, to remove the old dir
item, add the new dir item, remove an inode ref and add a new inode ref.
It can also create the delayed inode for the inodes involved in the
operation, and it creates two delayed dir index items, one to delete
the old name and another one to add the new name.
However we are neither balancing the btree dirty pages nor the delayed
items after a rename, which can result in accumulation of too many
btree dirty pages and delayed items, specially if a task is doing a
series of rename operations (for example it can happen for package
installations/upgrades through the zypper tool).
So just call btrfs_btree_balance_dirty() after a rename, just like we
do for every other system call that results on modifying a btree and
adding delayed items.
Reviewed-by: Anand Jain <anand.jain@oracle.com>
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Add tracepoint for better insight to how the RAID56 data are submitted.
The output looks like this: (trace event header and UUID skipped)
raid56_read_partial: full_stripe=389152768 devid=3 type=DATA1 offset=32768 opf=0x0 physical=323059712 len=32768
raid56_read_partial: full_stripe=389152768 devid=1 type=DATA2 offset=0 opf=0x0 physical=67174400 len=65536
raid56_write_stripe: full_stripe=389152768 devid=3 type=DATA1 offset=0 opf=0x1 physical=323026944 len=32768
raid56_write_stripe: full_stripe=389152768 devid=2 type=PQ1 offset=0 opf=0x1 physical=323026944 len=32768
The above debug output is from a 32K data write into an empty RAID56
data chunk.
Some explanation on the event output:
full_stripe: the logical bytenr of the full stripe
devid: btrfs devid
type: raid stripe type.
DATA1: the first data stripe
DATA2: the second data stripe
PQ1: the P stripe
PQ2: the Q stripe
offset: the offset inside the stripe.
opf: the bio op type
physical: the physical offset the bio is for
len: the length of the bio
The first two lines are from partial RMW read, which is reading the
remaining data stripes from disks.
The last two lines are for full stripe RMW write, which is writing the
involved two 16K stripes (one for DATA1 stripe, one for P stripe).
The stripe for DATA2 doesn't need to be written.
There are 5 types of trace events:
- raid56_read_partial
Read remaining data for regular read/write path.
- raid56_write_stripe
Write the modified stripes for regular read/write path.
- raid56_scrub_read_recover
Read remaining data for scrub recovery path.
- raid56_scrub_write_stripe
Write the modified stripes for scrub path.
- raid56_scrub_read
Read remaining data for scrub path.
Also, since the trace events are included at super.c, we have to export
needed structure definitions to 'raid56.h' and include the header in
super.c, or we're unable to access those members.
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
[ reformat comments ]
Signed-off-by: David Sterba <dsterba@suse.com>
[BUG]
With added debugging, it turns out the following write sequence would
cause extra read which is unnecessary:
# xfs_io -f -s -c "pwrite -b 32k 0 32k" -c "pwrite -b 32k 32k 32k" \
-c "pwrite -b 32k 64k 32k" -c "pwrite -b 32k 96k 32k" \
$mnt/file
The debug message looks like this (btrfs header skipped):
partial rmw, full stripe=389152768 opf=0x0 devid=3 type=1 offset=32768 physical=323059712 len=32768
partial rmw, full stripe=389152768 opf=0x0 devid=1 type=2 offset=0 physical=67174400 len=65536
full stripe rmw, full stripe=389152768 opf=0x1 devid=3 type=1 offset=0 physical=323026944 len=32768
full stripe rmw, full stripe=389152768 opf=0x1 devid=2 type=-1 offset=0 physical=323026944 len=32768
partial rmw, full stripe=298844160 opf=0x0 devid=1 type=1 offset=32768 physical=22052864 len=32768
partial rmw, full stripe=298844160 opf=0x0 devid=2 type=2 offset=0 physical=277872640 len=65536
full stripe rmw, full stripe=298844160 opf=0x1 devid=1 type=1 offset=0 physical=22020096 len=32768
full stripe rmw, full stripe=298844160 opf=0x1 devid=3 type=-1 offset=0 physical=277872640 len=32768
partial rmw, full stripe=389152768 opf=0x0 devid=3 type=1 offset=0 physical=323026944 len=32768
partial rmw, full stripe=389152768 opf=0x0 devid=1 type=2 offset=0 physical=67174400 len=65536
^^^^
Still partial read, even 389152768 is already cached by the first.
write.
full stripe rmw, full stripe=389152768 opf=0x1 devid=3 type=1 offset=32768 physical=323059712 len=32768
full stripe rmw, full stripe=389152768 opf=0x1 devid=2 type=-1 offset=32768 physical=323059712 len=32768
partial rmw, full stripe=298844160 opf=0x0 devid=1 type=1 offset=0 physical=22020096 len=32768
partial rmw, full stripe=298844160 opf=0x0 devid=2 type=2 offset=0 physical=277872640 len=65536
^^^^
Still partial read for 298844160.
full stripe rmw, full stripe=298844160 opf=0x1 devid=1 type=1 offset=32768 physical=22052864 len=32768
full stripe rmw, full stripe=298844160 opf=0x1 devid=3 type=-1 offset=32768 physical=277905408 len=32768
This means every 32K writes, even they are in the same full stripe,
still trigger read for previously cached data.
This would cause extra RAID56 IO, making the btrfs raid56 cache useless.
[CAUSE]
Commit d4e28d9b5f ("btrfs: raid56: make steal_rbio() subpage
compatible") tries to make steal_rbio() subpage compatible, but during
that conversion, there is one thing missing.
We no longer rely on PageUptodate(rbio->stripe_pages[i]), but
rbio->stripe_nsectors[i].uptodate to determine if a sector is uptodate.
This means, previously if we switch the pointer, everything is done,
as the PageUptodate flag is still bound to that page.
But now we have to manually mark the involved sectors uptodate, or later
raid56_rmw_stripe() will find the stolen sector is not uptodate, and
assemble the read bio for it, wasting IO.
[FIX]
We can easily fix the bug, by also update the
rbio->stripe_sectors[].uptodate in steal_rbio().
With this fixed, now the same write pattern no longer leads to the same
unnecessary read:
partial rmw, full stripe=389152768 opf=0x0 devid=3 type=1 offset=32768 physical=323059712 len=32768
partial rmw, full stripe=389152768 opf=0x0 devid=1 type=2 offset=0 physical=67174400 len=65536
full stripe rmw, full stripe=389152768 opf=0x1 devid=3 type=1 offset=0 physical=323026944 len=32768
full stripe rmw, full stripe=389152768 opf=0x1 devid=2 type=-1 offset=0 physical=323026944 len=32768
partial rmw, full stripe=298844160 opf=0x0 devid=1 type=1 offset=32768 physical=22052864 len=32768
partial rmw, full stripe=298844160 opf=0x0 devid=2 type=2 offset=0 physical=277872640 len=65536
full stripe rmw, full stripe=298844160 opf=0x1 devid=1 type=1 offset=0 physical=22020096 len=32768
full stripe rmw, full stripe=298844160 opf=0x1 devid=3 type=-1 offset=0 physical=277872640 len=32768
^^^ No more partial read, directly into the write path.
full stripe rmw, full stripe=389152768 opf=0x1 devid=3 type=1 offset=32768 physical=323059712 len=32768
full stripe rmw, full stripe=389152768 opf=0x1 devid=2 type=-1 offset=32768 physical=323059712 len=32768
full stripe rmw, full stripe=298844160 opf=0x1 devid=1 type=1 offset=32768 physical=22052864 len=32768
full stripe rmw, full stripe=298844160 opf=0x1 devid=3 type=-1 offset=32768 physical=277905408 len=32768
Fixes: d4e28d9b5f ("btrfs: raid56: make steal_rbio() subpage compatible")
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Both memzero_page and memcpy_to_page already call flush_dcache_page so
we can remove the calls from btrfs code.
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: David Sterba <dsterba@suse.com>
If we have only 8K partial write at the beginning of a full RAID56
stripe, we will write the following contents:
0 8K 32K 64K
Disk 1 (data): |XX| | |
Disk 2 (data): | | |
Disk 3 (parity): |XXXXXXXXXXXXXXX|XXXXXXXXXXXXXXX|
|X| means the sector will be written back to disk.
Note that, although we won't write any sectors from disk 2, but we will
write the full 64KiB of parity to disk.
This behavior is fine for now, but not for the future (especially for
RAID56J, as we waste quite some space to journal the unused parity
stripes).
So here we will also utilize the btrfs_raid_bio::dbitmap, anytime we
queue a higher level bio into an rbio, we will update rbio::dbitmap to
indicate which vertical stripes we need to writeback.
And at finish_rmw(), we also check dbitmap to see if we need to write
any sector in the vertical stripe.
So after the patch, above example will only lead to the following
writeback pattern:
0 8K 32K 64K
Disk 1 (data): |XX| | |
Disk 2 (data): | | |
Disk 3 (parity): |XX| | |
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Previously we use "unsigned long *" for those two bitmaps.
But since we only support fixed stripe length (64KiB, already checked in
tree-checker), "unsigned long *" is really a waste of memory, while we
can just use "unsigned long".
This saves us 8 bytes in total for scrub_parity.
To be extra safe, add an ASSERT() making sure calclulated @nsectors is
always smaller than BITS_PER_LONG.
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Previsouly we use "unsigned long *" for those two bitmaps.
But since we only support fixed stripe length (64KiB, already checked in
tree-checker), "unsigned long *" is really a waste of memory, while we
can just use "unsigned long".
This saves us 8 bytes in total for btrfs_raid_bio.
To be extra safe, add an ASSERT() making sure calculated
@stripe_nsectors is always smaller than BITS_PER_LONG.
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
This eliminates 2 labels and makes the code generally more streamlined.
Also rename the 'out_bargs' label to 'out_unlock' since bargs is going
to be freed under the 'out' label. This also fixes a memory leak since
bargs wasn't correctly freed in one of the condition which are now moved
in btrfs_try_lock_balance.
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
This function contains the factored out locking sequence of
btrfs_ioctl_balance. Having this piece of code separate helps to
simplify btrfs_ioctl_balance which has too complicated. This will be
used in the next patch to streamline the logic in btrfs_ioctl_balance.
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Use the new btrfs_bio_for_each_sector iterator to simplify
btrfs_check_read_dio_bio.
Reviewed-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Add a helper that works similar to __bio_for_each_segment, but instead of
iterating over PAGE_SIZE chunks it iterates over each sector.
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
[hch: split from a larger patch, and iterate over the offset instead of
the offset bits]
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: David Sterba <dsterba@suse.com>
[ add parameter comments ]
Signed-off-by: David Sterba <dsterba@suse.com>
Add a helper to find the csum for a byte offset into the csum buffer.
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: David Sterba <dsterba@suse.com>
Untangle the goto and move the code it jumps to so it goes in the order
of the most likely states first.
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: David Sterba <dsterba@suse.com>
[ update changelog ]
Signed-off-by: David Sterba <dsterba@suse.com>
Add a helper to end I/O on a single sector, which will come in handy
with the new read repair code.
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The function submit_data_read_repair() is only called for buffered data
read path, thus those members can be calculated using bvec directly:
- start
start = page_offset(bvec->bv_page) + bvec->bv_offset;
- end
end = start + bvec->bv_len - 1;
- page
page = bvec->bv_page;
- pgoff
pgoff = bvec->bv_offset;
Thus we can safely replace those 4 parameters with just one bio_vec.
Also remove the unused return value.
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
[hch: also remove the return value]
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: David Sterba <dsterba@suse.com>
Although we have several data csum verification code, we never have a
function really just to verify checksum for one sector.
Function check_data_csum() do extra work for error reporting, thus it
requires a lot of extra things like file offset, bio_offset etc.
Function btrfs_verify_data_csum() is even worse, it will utilize page
checked flag, which means it can not be utilized for direct IO pages.
Here we introduce a new helper, btrfs_check_sector_csum(), which really
only accept a sector in page, and expected checksum pointer.
We use this function to implement check_data_csum(), and export it for
incoming patch.
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
[hch: keep passing the csum array as an arguments, as the callers want
to print it, rename per request]
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The following functions do special handling for RAID56 chunks:
- btrfs_is_parity_mirror()
Check if the range is in RAID56 chunks.
- btrfs_full_stripe_len()
Either return sectorsize for non-RAID56 profiles or full stripe length
for RAID56 chunks.
But if a filesystem without any RAID56 chunks, it will not have RAID56
incompat flags, and we can skip the chunk tree looking up completely.
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: Anand Jain <anand.jain@oracle.com>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The <linux/mm.h> already provides the PAGE_ALIGNED macro. Let's
use it instead of IS_ALIGNED and passing PAGE_SIZE directly.
Reviewed-by: Muchun Song <songmuchun@bytedance.com>
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Fanjun Kong <bh1scw@gmail.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Fix the comment to represent the actual logic used for sb_write_pointer
- Empty[0] && In use[1] should be an invalid state instead of returning
zone 0 wp
- Empty[0] && Full[1] should be returning zone 0 wp instead of zone 1 wp
- In use[0] && Empty[1] should be returning zone 0 wp instead of being an
invalid state
- In use[0] && Full[1] should be returning zone 0 wp instead of returning
zone 1 wp
- Full[0] && Empty[1] should be returning zone 1 wp instead of returning
zone 0 wp
- Full[0] && In use[1] should be returning zone 1 wp instead of returning
zone 0 wp
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Pankaj Raghav <p.raghav@samsung.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEE8rQSAMVO+zA4DBdWxWXV+ddtWDsFAmLRpPgACgkQxWXV+ddt
WDtu/BAAnfx7CXKIfWKpz6FZEio9Qb3mUHVOglyKzqR0qB72OdrC1dQMvEWPJc6h
N65di6+8tTNmRIlaFBMU0MDHODR2aDRpDtlR9eUzUuidTc4iOp1fi31uBwl31r7b
k8mCZBc/IAdfH13lBtcfkb2HGid7rik5ZC6Kx/glMcqh647QkSMAleupUsIYHKsK
IgcUWuN3wFIUK2WVgsja7+ljlwIHBHKRp9yrEYw+ef/B0NCNKvOnrIOPJzO7nxMP
1FbqJ6F7u7HjoMFcMwn5rbV/BoIwSSvXyKRqOW+EhGeQR/imVmkH9jXJ7wXdblSz
IvSqaZ0DaWWSvivdMpwbr8Z0Cu4iIYhVY6PSA0hukR63qB5GwKKJ6j1L0zoYoz8C
IDWJPW03FNRIu5ZOduvUQ3qG7jcJQZ3WPCCfrDST1cO2xHT/7f65Tjz4k0hvp4za
edITetC1mEv310CHeGsJaLxGYPNrRe38VZYPxgJ7yFpteGYjh0ZwsuyUHb4MH1no
JWwgElNW+m1BatdWSUBYk6xhqod1s2LOFPNqo7jNlv8I27hPViqCbBA2i9FlkXf+
FwL5kWyJXs69gfjUIj59381Z0U1VdA1tvU8GP2m2+JvIDS6ooAcZj7yEQ69mCZxi
2RFJIU0NFbnc/5j2ARSzOTGs9glDD0yffgXJM+cK+TWsQ3AC31I=
=/47A
-----END PGP SIGNATURE-----
Merge tag 'for-5.19-rc7-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux
Pull btrfs reverts from David Sterba:
"Due to a recent report [1] we need to revert the radix tree to xarray
conversion patches.
There's a problem with sleeping under spinlock, when xa_insert could
allocate memory under pressure. We use GFP_NOFS so this is a real
problem that we unfortunately did not discover during review.
I'm sorry to do such change at rc6 time but the revert is IMO the
safer option, there are patches to use mutex instead of the spin locks
but that would need more testing. The revert branch has been tested on
a few setups, all seem ok.
The conversion to xarray will be revisited in the future"
Link: https://lore.kernel.org/linux-btrfs/cover.1657097693.git.fdmanana@suse.com/ [1]
* tag 'for-5.19-rc7-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux:
Revert "btrfs: turn delayed_nodes_tree into an XArray"
Revert "btrfs: turn name_cache radix tree into XArray in send_ctx"
Revert "btrfs: turn fs_info member buffer_radix into XArray"
Revert "btrfs: turn fs_roots_radix in btrfs_fs_info into an XArray"
This reverts commit 253bf57555.
Revert the xarray conversion, there's a problem with potential
sleep-inside-spinlock [1] when calling xa_insert that triggers GFP_NOFS
allocation. The radix tree used the preloading mechanism to avoid
sleeping but this is not available in xarray.
Conversion from spin lock to mutex is possible but at time of rc6 is
riskier than a clean revert.
[1] https://lore.kernel.org/linux-btrfs/cover.1657097693.git.fdmanana@suse.com/
Reported-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
This reverts commit 4076942021.
Revert the xarray conversion, there's a problem with potential
sleep-inside-spinlock [1] when calling xa_insert that triggers GFP_NOFS
allocation. The radix tree used the preloading mechanism to avoid
sleeping but this is not available in xarray.
Conversion from spin lock to mutex is possible but at time of rc6 is
riskier than a clean revert.
[1] https://lore.kernel.org/linux-btrfs/cover.1657097693.git.fdmanana@suse.com/
Reported-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
This reverts commit 8ee922689d.
Revert the xarray conversion, there's a problem with potential
sleep-inside-spinlock [1] when calling xa_insert that triggers GFP_NOFS
allocation. The radix tree used the preloading mechanism to avoid
sleeping but this is not available in xarray.
Conversion from spin lock to mutex is possible but at time of rc6 is
riskier than a clean revert.
[1] https://lore.kernel.org/linux-btrfs/cover.1657097693.git.fdmanana@suse.com/
Reported-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
This reverts commit 48b36a602a.
Revert the xarray conversion, there's a problem with potential
sleep-inside-spinlock [1] when calling xa_insert that triggers GFP_NOFS
allocation. The radix tree used the preloading mechanism to avoid
sleeping but this is not available in xarray.
Conversion from spin lock to mutex is possible but at time of rc6 is
riskier than a clean revert.
[1] https://lore.kernel.org/linux-btrfs/cover.1657097693.git.fdmanana@suse.com/
Reported-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Improve static type checking by using the enum req_op type for variables
that represent a request operation and the new blk_opf_t type for
variables that represent request flags.
Acked-by: David Sterba <dsterba@suse.com>
Cc: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Bart Van Assche <bvanassche@acm.org>
Link: https://lore.kernel.org/r/20220714180729.1065367-51-bvanassche@acm.org
Signed-off-by: Jens Axboe <axboe@kernel.dk>
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEE8rQSAMVO+zA4DBdWxWXV+ddtWDsFAmLMiQQACgkQxWXV+ddt
WDvBQg/+I1ebfW2DFY8kBwy7c1qKZWIhNx1VVk2AegIXvrW/Tos7wp5O6fi7p/jL
d6k8zO/zFLlfiI4Ckmz3gt7cxaMTNXxr6+GQpNNm1b92Wdcy1a+3gquzcehT9Q10
ZB4ecPWzEDXgORvdBYG2eD2Z8PrsF0Wu88XRDiiJOBQLjZ+k2sVp8QvJlOllLDoC
m7rPoq98jC6VpZwFJ+fGk2jC7y4+1QXrOuQMy7LRTe59Thp6wUFDDPtkKfr5scDC
UxkctlUdInD7A6DVvPzwaBFNoT8UeEByGHcMd3KjjrTdmqSWW6k8FiF4ckZwA3zJ
oPdJVzdC5a2W7t6BHw+t7VNmkKd+swnr2sVSGQ8eIzF7z3/JSqyYVwziOD1YzAdU
QUmawWm4/SFvsbO8aoLrEKNbUiTgQwVbKzJh4Dhu9VJ43jeCwCX7pa/uZI4evgyG
T0tuwm58bWCk4y1o1fcFYgf4JcVgK23F2vKckUFZeHoV3Q8R0DnPCCGTqs1qT5vY
irZ9AIawmaR09JptMjjsAEjDA9qb16Ut/J6/anukyCgL610EyYZG7zb1WH1cUD1o
zNXY6O/iKyNdiXj7V1fTMiG/M8hGDcFu4pOpBk3hFjHEXX9BefoVC0J5YzvCecPz
isqboD5Lt1I4mrzac1X+serMYfVbFH6+tsEPBQBZf/o/a0u43jI=
=Cxvn
-----END PGP SIGNATURE-----
Merge tag 'for-5.19-rc6-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux
Pull btrfs fixes from David Sterba:
"A more fixes that seem to me to be important enough to get merged
before release:
- in zoned mode, fix leak of a structure when reading zone info, this
happens on normal path so this can be significant
- in zoned mode, revert an optimization added in 5.19-rc1 to finish a
zone when the capacity is full, but this is not reliable in all
cases
- try to avoid short reads for compressed data or inline files when
it's a NOWAIT read, applications should handle that but there are
two, qemu and mariadb, that are affected"
* tag 'for-5.19-rc6-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux:
btrfs: zoned: drop optimization of zone finish
btrfs: zoned: fix a leaked bioc in read_zone_info
btrfs: return -EAGAIN for NOWAIT dio reads/writes on compressed and inline extents
We have an optimization in do_zone_finish() to send REQ_OP_ZONE_FINISH only
when necessary, i.e. we don't send REQ_OP_ZONE_FINISH when we assume we
wrote fully into the zone.
The assumption is determined by "alloc_offset == capacity". This condition
won't work if the last ordered extent is canceled due to some errors. In
that case, we consider the zone is deactivated without sending the finish
command while it's still active.
This inconstancy results in activating another block group while we cannot
really activate the underlying zone, which causes the active zone exceeds
errors like below.
BTRFS error (device nvme3n2): allocation failed flags 1, wanted 520192 tree-log 0, relocation: 0
nvme3n2: I/O Cmd(0x7d) @ LBA 160432128, 127 blocks, I/O Error (sct 0x1 / sc 0xbd) MORE DNR
active zones exceeded error, dev nvme3n2, sector 0 op 0xd:(ZONE_APPEND) flags 0x4800 phys_seg 1 prio class 0
nvme3n2: I/O Cmd(0x7d) @ LBA 160432128, 127 blocks, I/O Error (sct 0x1 / sc 0xbd) MORE DNR
active zones exceeded error, dev nvme3n2, sector 0 op 0xd:(ZONE_APPEND) flags 0x4800 phys_seg 1 prio class 0
Fix the issue by removing the optimization for now.
Fixes: 8376d9e1ed ("btrfs: zoned: finish superblock zone once no space left for new SB")
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The bioc would leak on the normal completion path and also on the RAID56
check (but that one won't happen in practice due to the invalid
combination with zoned mode).
Fixes: 7db1c5d14d ("btrfs: zoned: support dev-replace in zoned filesystems")
CC: stable@vger.kernel.org # 5.16+
Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
[ update changelog ]
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
When doing a direct IO read or write, we always return -ENOTBLK when we
find a compressed extent (or an inline extent) so that we fallback to
buffered IO. This however is not ideal in case we are in a NOWAIT context
(io_uring for example), because buffered IO can block and we currently
have no support for NOWAIT semantics for buffered IO, so if we need to
fallback to buffered IO we should first signal the caller that we may
need to block by returning -EAGAIN instead.
This behaviour can also result in short reads being returned to user
space, which although it's not incorrect and user space should be able
to deal with partial reads, it's somewhat surprising and even some popular
applications like QEMU (Link tag #1) and MariaDB (Link tag #2) don't
deal with short reads properly (or at all).
The short read case happens when we try to read from a range that has a
non-compressed and non-inline extent followed by a compressed extent.
After having read the first extent, when we find the compressed extent we
return -ENOTBLK from btrfs_dio_iomap_begin(), which results in iomap to
treat the request as a short read, returning 0 (success) and waiting for
previously submitted bios to complete (this happens at
fs/iomap/direct-io.c:__iomap_dio_rw()). After that, and while at
btrfs_file_read_iter(), we call filemap_read() to use buffered IO to
read the remaining data, and pass it the number of bytes we were able to
read with direct IO. Than at filemap_read() if we get a page fault error
when accessing the read buffer, we return a partial read instead of an
-EFAULT error, because the number of bytes previously read is greater
than zero.
So fix this by returning -EAGAIN for NOWAIT direct IO when we find a
compressed or an inline extent.
Reported-by: Dominique MARTINET <dominique.martinet@atmark-techno.com>
Link: https://lore.kernel.org/linux-btrfs/YrrFGO4A1jS0GI0G@atmark-techno.com/
Link: https://jira.mariadb.org/browse/MDEV-27900?focusedCommentId=216582&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-216582
Tested-by: Dominique MARTINET <dominique.martinet@atmark-techno.com>
CC: stable@vger.kernel.org # 5.10+
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Currently shrinkers are anonymous objects. For debugging purposes they
can be identified by count/scan function names, but it's not always
useful: e.g. for superblock's shrinkers it's nice to have at least an
idea of to which superblock the shrinker belongs.
This commit adds names to shrinkers. register_shrinker() and
prealloc_shrinker() functions are extended to take a format and arguments
to master a name.
In some cases it's not possible to determine a good name at the time when
a shrinker is allocated. For such cases shrinker_debugfs_rename() is
provided.
The expected format is:
<subsystem>-<shrinker_type>[:<instance>]-<id>
For some shrinkers an instance can be encoded as (MAJOR:MINOR) pair.
After this change the shrinker debugfs directory looks like:
$ cd /sys/kernel/debug/shrinker/
$ ls
dquota-cache-16 sb-devpts-28 sb-proc-47 sb-tmpfs-42
mm-shadow-18 sb-devtmpfs-5 sb-proc-48 sb-tmpfs-43
mm-zspool:zram0-34 sb-hugetlbfs-17 sb-pstore-31 sb-tmpfs-44
rcu-kfree-0 sb-hugetlbfs-33 sb-rootfs-2 sb-tmpfs-49
sb-aio-20 sb-iomem-12 sb-securityfs-6 sb-tracefs-13
sb-anon_inodefs-15 sb-mqueue-21 sb-selinuxfs-22 sb-xfs:vda1-36
sb-bdev-3 sb-nsfs-4 sb-sockfs-8 sb-zsmalloc-19
sb-bpf-32 sb-pipefs-14 sb-sysfs-26 thp-deferred_split-10
sb-btrfs:vda2-24 sb-proc-25 sb-tmpfs-1 thp-zero-9
sb-cgroup2-30 sb-proc-39 sb-tmpfs-27 xfs-buf:vda1-37
sb-configfs-23 sb-proc-41 sb-tmpfs-29 xfs-inodegc:vda1-38
sb-dax-11 sb-proc-45 sb-tmpfs-35
sb-debugfs-7 sb-proc-46 sb-tmpfs-40
[roman.gushchin@linux.dev: fix build warnings]
Link: https://lkml.kernel.org/r/Yr+ZTnLb9lJk6fJO@castle
Reported-by: kernel test robot <lkp@intel.com>
Link: https://lkml.kernel.org/r/20220601032227.4076670-4-roman.gushchin@linux.dev
Signed-off-by: Roman Gushchin <roman.gushchin@linux.dev>
Cc: Christophe JAILLET <christophe.jaillet@wanadoo.fr>
Cc: Dave Chinner <dchinner@redhat.com>
Cc: Hillf Danton <hdanton@sina.com>
Cc: Kent Overstreet <kent.overstreet@gmail.com>
Cc: Muchun Song <songmuchun@bytedance.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEE8rQSAMVO+zA4DBdWxWXV+ddtWDsFAmK4dV4ACgkQxWXV+ddt
WDs4uQ/7B0XqPK05NJntJfwnuIoT/yOreKf47wt/6DyFV3CDMFte/qzaZwthwu6P
F0GMpSYAlVszLlML5elvF9VXymlV+e+QROtbD6QCNLNW1IwHA7ZiF5fV/a1Rj930
XSuaDyVFPAK7892RR6yMQ20IeMBuvqiAhXWEzaIJ2tIcAHn+fP+VkY8Nc0aZj3iC
mI+ep4n93karDxmnHVGUxJTxAe0l/uNopx+fYBWQDj7HuoMLo0Cu+rAdv0gRIxi2
RWUBkR4e4PBwV1OFScwNCsljjt6bHdUHrtdB3fo5Hzu9cO5hHdL7NEsKB1K2w7rV
bgNuNqfj6Y4xUBchAfQO5CCJ9ISci5KoJ4RBpk6EprZR3QN40kN8GPlhi2519K7w
F3d8jolDDHlkqxIsqoe47MYOcSepNEadVNsiYKb0rM6doilfxyXiu6dtTFMrC8Vy
K2HDCdTyuIgw+TnwqT1puaUwxiIL8DFJf1CVyjwGuQ4UgaIEkHXKIsCssyyJ76Jh
QkWX1aeRldbfkVArJWHQWqDQopx9pFBz1gjlws0YjAsU5YijOOXva464P9Rxg+Gq
4pRlgnO48joQam9bRirP2Z6yhqa4O6jkzKDOXSYduAUYD7IMfpsYnz09wKS95jj+
QCrR7VmKnpQdsXg5a/mqyacfIH30ph002VywRxPiFM89Syd25yo=
=rUrf
-----END PGP SIGNATURE-----
Merge tag 'for-5.19-rc3-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux
Pull btrfs fixes from David Sterba:
- zoned relocation fixes:
- fix critical section end for extent writeback, this could lead
to out of order write
- prevent writing to previous data relocation block group if space
gets low
- reflink fixes:
- fix race between reflinking and ordered extent completion
- proper error handling when block reserve migration fails
- add missing inode iversion/mtime/ctime updates on each iteration
when replacing extents
- fix deadlock when running fsync/fiemap/commit at the same time
- fix false-positive KCSAN report regarding pid tracking for read locks
and data race
- minor documentation update and link to new site
* tag 'for-5.19-rc3-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux:
Documentation: update btrfs list of features and link to readthedocs.io
btrfs: fix deadlock with fsync+fiemap+transaction commit
btrfs: don't set lock_owner when locking extent buffer for reading
btrfs: zoned: fix critical section of relocation inode writeback
btrfs: zoned: prevent allocation from previous data relocation BG
btrfs: do not BUG_ON() on failure to migrate space when replacing extents
btrfs: add missing inode updates on each iteration when replacing extents
btrfs: fix race between reflinking and ordered extent completion
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEE8rQSAMVO+zA4DBdWxWXV+ddtWDsFAmKxvkkACgkQxWXV+ddt
WDsQYhAAofZGaOdBwSDvGA4srB2ieDIFoMeNb1NYp2P5vafPo3Q5AAvgGAeKhp5x
g2C7W/8q2GMJ+B9SjyiBkVufuQmCWbFKxStQM3QysYoj/EyKyp7SXtO4YMWHz2T3
nfMMlPo2aNpr7Z2s+tcjhthq/hIvVFi6kweRFNvacM2bb/17IxgAdqLpQBqK5xe9
/IGSUTw75jSd2sZSyzBqrqshKDonmJ7u4qCV2X5hTPi8w4AUDERJrm0bOnikNXHx
4LnNDmSIA0BEXybHwEAShoK0ge66z1kP1UspQNB7pKriJcyroNPjgm/fMZJiRKIc
zEYEMSzTYQa5eDwhXCz5PCaPqY4y/ovfYCsmySVXt1a7wgplVl+vsOaesE2NFVCX
FE36d58L+4I8iTJhpVCNmEU9N/spfvAr3mBAcKCkbp9WKyGJ9/2yJpRThkV8Pw2Y
bzhFIYRs1CJvkK7P4Cp+FSfzJx6tvYAqblvE97VUt83PuqS1Fb49lKdr5DZnbplV
vDkewmvXSmHH9Ic5xBeTJXJZ+yeibk/0LSNEKczWva6f60h0ubF0OI6BzmS+NZbN
HyitKerX0ZyFi5VUOZ+PKzXfR3ZlX3SmjAcHrl9BjZjFOJkpxAx6yWBzdnkitb+O
fYyT68H4IetxwkghPVBv8qFCkuNy/i9NsEILcAAXd8CHGQlfwDA=
=eORM
-----END PGP SIGNATURE-----
Merge tag 'for-5.19-rc3-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux
Pull btrfs fixes from David Sterba:
- print more error messages for invalid mount option values
- prevent remount with v1 space cache for subpage filesystem
- fix hang during unmount when block group reclaim task is running
* tag 'for-5.19-rc3-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux:
btrfs: add error messages to all unrecognized mount options
btrfs: prevent remounting to v1 space cache for subpage mount
btrfs: fix hang during unmount when block group reclaim task is running
We are hitting the following deadlock in production occasionally
Task 1 Task 2 Task 3 Task 4 Task 5
fsync(A)
start trans
start commit
falloc(A)
lock 5m-10m
start trans
wait for commit
fiemap(A)
lock 0-10m
wait for 5m-10m
(have 0-5m locked)
have btrfs_need_log_full_commit
!full_sync
wait_ordered_extents
finish_ordered_io(A)
lock 0-5m
DEADLOCK
We have an existing dependency of file extent lock -> transaction.
However in fsync if we tried to do the fast logging, but then had to
fall back to committing the transaction, we will be forced to call
btrfs_wait_ordered_range() to make sure all of our extents are updated.
This creates a dependency of transaction -> file extent lock, because
btrfs_finish_ordered_io() will need to take the file extent lock in
order to run the ordered extents.
Fix this by stopping the transaction if we have to do the full commit
and we attempted to do the fast logging. Then attach to the transaction
and commit it if we need to.
CC: stable@vger.kernel.org # 5.15+
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: David Sterba <dsterba@suse.com>
In 196d59ab9c "btrfs: switch extent buffer tree lock to rw_semaphore"
the functions for tree read locking were rewritten, and in the process
the read lock functions started setting eb->lock_owner = current->pid.
Previously lock_owner was only set in tree write lock functions.
Read locks are shared, so they don't have exclusive ownership of the
underlying object, so setting lock_owner to any single value for a
read lock makes no sense. It's mostly harmless because write locks
and read locks are mutually exclusive, and none of the existing code
in btrfs (btrfs_init_new_buffer and print_eb_refs_lock) cares what
nonsense is written in lock_owner when no writer is holding the lock.
KCSAN does care, and will complain about the data race incessantly.
Remove the assignments in the read lock functions because they're
useless noise.
Fixes: 196d59ab9c ("btrfs: switch extent buffer tree lock to rw_semaphore")
CC: stable@vger.kernel.org # 5.15+
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Zygo Blaxell <ce3g8jdj@umail.furryterror.org>
Signed-off-by: David Sterba <dsterba@suse.com>
We use btrfs_zoned_data_reloc_{lock,unlock} to allow only one process to
write out to the relocation inode. That critical section must include all
the IO submission for the inode. However, flush_write_bio() in
extent_writepages() is out of the critical section, causing an IO
submission outside of the lock. This leads to an out of the order IO
submission and fail the relocation process.
Fix it by extending the critical section.
Fixes: 35156d8527 ("btrfs: zoned: only allow one process to add pages to a relocation inode")
CC: stable@vger.kernel.org # 5.16+
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>
Signed-off-by: David Sterba <dsterba@suse.com>
After commit 5f0addf7b8 ("btrfs: zoned: use dedicated lock for data
relocation"), we observe IO errors on e.g, btrfs/232 like below.
[09.0][T4038707] WARNING: CPU: 3 PID: 4038707 at fs/btrfs/extent-tree.c:2381 btrfs_cross_ref_exist+0xfc/0x120 [btrfs]
<snip>
[09.9][T4038707] Call Trace:
[09.5][T4038707] <TASK>
[09.3][T4038707] run_delalloc_nocow+0x7f1/0x11a0 [btrfs]
[09.6][T4038707] ? test_range_bit+0x174/0x320 [btrfs]
[09.2][T4038707] ? fallback_to_cow+0x980/0x980 [btrfs]
[09.3][T4038707] ? find_lock_delalloc_range+0x33e/0x3e0 [btrfs]
[09.5][T4038707] btrfs_run_delalloc_range+0x445/0x1320 [btrfs]
[09.2][T4038707] ? test_range_bit+0x320/0x320 [btrfs]
[09.4][T4038707] ? lock_downgrade+0x6a0/0x6a0
[09.2][T4038707] ? orc_find.part.0+0x1ed/0x300
[09.5][T4038707] ? __module_address.part.0+0x25/0x300
[09.0][T4038707] writepage_delalloc+0x159/0x310 [btrfs]
<snip>
[09.4][ C3] sd 10:0:1:0: [sde] tag#2620 FAILED Result: hostbyte=DID_OK driverbyte=DRIVER_OK cmd_age=0s
[09.5][ C3] sd 10:0:1:0: [sde] tag#2620 Sense Key : Illegal Request [current]
[09.9][ C3] sd 10:0:1:0: [sde] tag#2620 Add. Sense: Unaligned write command
[09.5][ C3] sd 10:0:1:0: [sde] tag#2620 CDB: Write(16) 8a 00 00 00 00 00 02 f3 63 87 00 00 00 2c 00 00
[09.4][ C3] critical target error, dev sde, sector 396041272 op 0x1:(WRITE) flags 0x800 phys_seg 3 prio class 0
[09.9][ C3] BTRFS error (device dm-1): bdev /dev/mapper/dml_102_2 errs: wr 1, rd 0, flush 0, corrupt 0, gen 0
The IO errors occur when we allocate a regular extent in previous data
relocation block group.
On zoned btrfs, we use a dedicated block group to relocate a data
extent. Thus, we allocate relocating data extents (pre-alloc) only from
the dedicated block group and vice versa. Once the free space in the
dedicated block group gets tight, a relocating extent may not fit into
the block group. In that case, we need to switch the dedicated block
group to the next one. Then, the previous one is now freed up for
allocating a regular extent. The BG is already not enough to allocate
the relocating extent, but there is still room to allocate a smaller
extent. Now the problem happens. By allocating a regular extent while
nocow IOs for the relocation is still on-going, we will issue WRITE IOs
(for relocation) and ZONE APPEND IOs (for the regular writes) at the
same time. That mixed IOs confuses the write pointer and arises the
unaligned write errors.
This commit introduces a new bit 'zoned_data_reloc_ongoing' to the
btrfs_block_group. We set this bit before releasing the dedicated block
group, and no extent are allocated from a block group having this bit
set. This bit is similar to setting block_group->ro, but is different from
it by allowing nocow writes to start.
Once all the nocow IO for relocation is done (hooked from
btrfs_finish_ordered_io), we reset the bit to release the block group for
further allocation.
Fixes: c2707a2556 ("btrfs: zoned: add a dedicated data relocation block group")
CC: stable@vger.kernel.org # 5.16+
Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
At btrfs_replace_file_extents(), if we fail to migrate reserved metadata
space from the transaction block reserve into the local block reserve,
we trigger a BUG_ON(). This is because it should not be possible to have
a failure here, as we reserved more space when we started the transaction
than the space we want to migrate. However having a BUG_ON() is way too
drastic, we can perfectly handle the failure and return the error to the
caller. So just do that instead, and add a WARN_ON() to make it easier
to notice the failure if it ever happens (which is particularly useful
for fstests, and the warning will trigger a failure of a test case).
Reviewed-by: Boris Burkov <boris@bur.io>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
When replacing file extents, called during fallocate, hole punching,
clone and deduplication, we may not be able to replace/drop all the
target file extent items with a single transaction handle. We may get
-ENOSPC while doing it, in which case we release the transaction handle,
balance the dirty pages of the btree inode, flush delayed items and get
a new transaction handle to operate on what's left of the target range.
By dropping and replacing file extent items we have effectively modified
the inode, so we should bump its iversion and update its mtime/ctime
before we update the inode item. This is because if the transaction
we used for partially modifying the inode gets committed by someone after
we release it and before we finish the rest of the range, a power failure
happens, then after mounting the filesystem our inode has an outdated
iversion and mtime/ctime, corresponding to the values it had before we
changed it.
So add the missing iversion and mtime/ctime updates.
Reviewed-by: Boris Burkov <boris@bur.io>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
While doing a reflink operation, if an ordered extent for a file range
that does not overlap with the source and destination ranges of the
reflink operation happens, we can end up having a failure in the reflink
operation and return -EINVAL to user space.
The following sequence of steps explains how this can happen:
1) We have the page at file offset 315392 dirty (under delalloc);
2) A reflink operation for this file starts, using the same file as both
source and destination, the source range is [372736, 409600) (length of
36864 bytes) and the destination range is [208896, 245760);
3) At btrfs_remap_file_range_prep(), we flush all delalloc in the source
and destination ranges, and wait for any ordered extents in those range
to complete;
4) Still at btrfs_remap_file_range_prep(), we then flush all delalloc in
the inode, but we neither wait for it to complete nor any ordered
extents to complete. This results in starting delalloc for the page at
file offset 315392 and creating an ordered extent for that single page
range;
5) We then move to btrfs_clone() and enter the loop to find file extent
items to copy from the source range to destination range;
6) In the first iteration we end up at last file extent item stored in
leaf A:
(...)
item 131 key (143616 108 315392) itemoff 5101 itemsize 53
extent data disk bytenr 1903988736 nr 73728
extent data offset 12288 nr 61440 ram 73728
This represents the file range [315392, 376832), which overlaps with
the source range to clone.
@datal is set to 61440, key.offset is 315392 and @next_key_min_offset
is therefore set to 376832 (315392 + 61440).
@off (372736) is > key.offset (315392), so @new_key.offset is set to
the value of @destoff (208896).
@new_key.offset == @last_dest_end (208896) so @drop_start is set to
208896 (@new_key.offset).
@datal is adjusted to 4096, as @off is > @key.offset.
So in this iteration we call btrfs_replace_file_extents() for the range
[208896, 212991] (a single page, which is
[@drop_start, @new_key.offset + @datal - 1]).
@last_dest_end is set to 212992 (@new_key.offset + @datal =
208896 + 4096 = 212992).
Before the next iteration of the loop, @key.offset is set to the value
376832, which is @next_key_min_offset;
7) On the second iteration btrfs_search_slot() leaves us again at leaf A,
but this time pointing beyond the last slot of leaf A, as that's where
a key with offset 376832 should be at if it existed. So end up calling
btrfs_next_leaf();
8) btrfs_next_leaf() releases the path, but before it searches again the
tree for the next key/leaf, the ordered extent for the single page
range at file offset 315392 completes. That results in trimming the
file extent item we processed before, adjusting its key offset from
315392 to 319488, reducing its length from 61440 to 57344 and inserting
a new file extent item for that single page range, with a key offset of
315392 and a length of 4096.
Leaf A now looks like:
(...)
item 132 key (143616 108 315392) itemoff 4995 itemsize 53
extent data disk bytenr 1801666560 nr 4096
extent data offset 0 nr 4096 ram 4096
item 133 key (143616 108 319488) itemoff 4942 itemsize 53
extent data disk bytenr 1903988736 nr 73728
extent data offset 16384 nr 57344 ram 73728
9) When btrfs_next_leaf() returns, it gives us a path pointing to leaf A
at slot 133, since it's the first key that follows what was the last
key we saw (143616 108 315392). In fact it's the same item we processed
before, but its key offset was changed, so it counts as a new key;
10) So now we have:
@key.offset == 319488
@datal == 57344
@off (372736) is > key.offset (319488), so @new_key.offset is set to
208896 (@destoff value).
@new_key.offset (208896) != @last_dest_end (212992), so @drop_start
is set to 212992 (@last_dest_end value).
@datal is adjusted to 4096 because @off > @key.offset.
So in this iteration we call btrfs_replace_file_extents() for the
invalid range of [212992, 212991] (which is
[@drop_start, @new_key.offset + @datal - 1]).
This range is empty, the end offset is smaller than the start offset
so btrfs_replace_file_extents() returns -EINVAL, which we end up
returning to user space and fail the reflink operation.
This all happens because the range of this file extent item was
already processed in the previous iteration.
This scenario can be triggered very sporadically by fsx from fstests, for
example with test case generic/522.
So fix this by having btrfs_clone() skip file extent items that cover a
file range that we have already processed.
CC: stable@vger.kernel.org # 5.10+
Reviewed-by: Boris Burkov <boris@bur.io>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
New helper to be used instead of direct checks for IOCB_DSYNC:
iocb_is_dsync(iocb). Checks converted, which allows to avoid
the IS_SYNC(iocb->ki_filp->f_mapping->host) part (4 cache lines)
from iocb_flags() - it's checked in iocb_is_dsync() instead
Reviewed-by: Christian Brauner (Microsoft) <brauner@kernel.org>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
... instead of messing with iocb flags
Suggested-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Christian Brauner (Microsoft) <brauner@kernel.org>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Almost none of the errors stemming from a valid mount option but wrong
value prints a descriptive message which would help to identify why
mount failed. Like in the linked report:
$ uname -r
v4.19
$ mount -o compress=zstd /dev/sdb /mnt
mount: /mnt: wrong fs type, bad option, bad superblock on
/dev/sdb, missing codepage or helper program, or other error.
$ dmesg
...
BTRFS error (device sdb): open_ctree failed
Errors caused by memory allocation failures are left out as it's not a
user error so reporting that would be confusing.
Link: https://lore.kernel.org/linux-btrfs/9c3fec36-fc61-3a33-4977-a7e207c3fa4e@gmx.de/
CC: stable@vger.kernel.org # 4.9+
Reviewed-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Upstream commit 9f73f1aef9 ("btrfs: force v2 space cache usage for
subpage mount") forces subpage mount to use v2 cache, to avoid
deprecated v1 cache which doesn't support subpage properly.
But there is a loophole that user can still remount to v1 cache.
The existing check will only give users a warning, but does not really
prevent to do the remount.
Although remounting to v1 will not cause any problems since the v1 cache
will always be marked invalid when mounted with a different page size,
it's still better to prevent v1 cache at all for subpage mounts.
Fixes: 9f73f1aef9 ("btrfs: force v2 space cache usage for subpage mount")
CC: stable@vger.kernel.org # 5.15+
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
When we start an unmount, at close_ctree(), if we have the reclaim task
running and in the middle of a data block group relocation, we can trigger
a deadlock when stopping an async reclaim task, producing a trace like the
following:
[629724.498185] task:kworker/u16:7 state:D stack: 0 pid:681170 ppid: 2 flags:0x00004000
[629724.499760] Workqueue: events_unbound btrfs_async_reclaim_metadata_space [btrfs]
[629724.501267] Call Trace:
[629724.501759] <TASK>
[629724.502174] __schedule+0x3cb/0xed0
[629724.502842] schedule+0x4e/0xb0
[629724.503447] btrfs_wait_on_delayed_iputs+0x7c/0xc0 [btrfs]
[629724.504534] ? prepare_to_wait_exclusive+0xc0/0xc0
[629724.505442] flush_space+0x423/0x630 [btrfs]
[629724.506296] ? rcu_read_unlock_trace_special+0x20/0x50
[629724.507259] ? lock_release+0x220/0x4a0
[629724.507932] ? btrfs_get_alloc_profile+0xb3/0x290 [btrfs]
[629724.508940] ? do_raw_spin_unlock+0x4b/0xa0
[629724.509688] btrfs_async_reclaim_metadata_space+0x139/0x320 [btrfs]
[629724.510922] process_one_work+0x252/0x5a0
[629724.511694] ? process_one_work+0x5a0/0x5a0
[629724.512508] worker_thread+0x52/0x3b0
[629724.513220] ? process_one_work+0x5a0/0x5a0
[629724.514021] kthread+0xf2/0x120
[629724.514627] ? kthread_complete_and_exit+0x20/0x20
[629724.515526] ret_from_fork+0x22/0x30
[629724.516236] </TASK>
[629724.516694] task:umount state:D stack: 0 pid:719055 ppid:695412 flags:0x00004000
[629724.518269] Call Trace:
[629724.518746] <TASK>
[629724.519160] __schedule+0x3cb/0xed0
[629724.519835] schedule+0x4e/0xb0
[629724.520467] schedule_timeout+0xed/0x130
[629724.521221] ? lock_release+0x220/0x4a0
[629724.521946] ? lock_acquired+0x19c/0x420
[629724.522662] ? trace_hardirqs_on+0x1b/0xe0
[629724.523411] __wait_for_common+0xaf/0x1f0
[629724.524189] ? usleep_range_state+0xb0/0xb0
[629724.524997] __flush_work+0x26d/0x530
[629724.525698] ? flush_workqueue_prep_pwqs+0x140/0x140
[629724.526580] ? lock_acquire+0x1a0/0x310
[629724.527324] __cancel_work_timer+0x137/0x1c0
[629724.528190] close_ctree+0xfd/0x531 [btrfs]
[629724.529000] ? evict_inodes+0x166/0x1c0
[629724.529510] generic_shutdown_super+0x74/0x120
[629724.530103] kill_anon_super+0x14/0x30
[629724.530611] btrfs_kill_super+0x12/0x20 [btrfs]
[629724.531246] deactivate_locked_super+0x31/0xa0
[629724.531817] cleanup_mnt+0x147/0x1c0
[629724.532319] task_work_run+0x5c/0xa0
[629724.532984] exit_to_user_mode_prepare+0x1a6/0x1b0
[629724.533598] syscall_exit_to_user_mode+0x16/0x40
[629724.534200] do_syscall_64+0x48/0x90
[629724.534667] entry_SYSCALL_64_after_hwframe+0x44/0xae
[629724.535318] RIP: 0033:0x7fa2b90437a7
[629724.535804] RSP: 002b:00007ffe0b7e4458 EFLAGS: 00000246 ORIG_RAX: 00000000000000a6
[629724.536912] RAX: 0000000000000000 RBX: 00007fa2b9182264 RCX: 00007fa2b90437a7
[629724.538156] RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000555d6cf20dd0
[629724.539053] RBP: 0000555d6cf20ba0 R08: 0000000000000000 R09: 00007ffe0b7e3200
[629724.539956] R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000000000
[629724.540883] R13: 0000555d6cf20dd0 R14: 0000555d6cf20cb0 R15: 0000000000000000
[629724.541796] </TASK>
This happens because:
1) Before entering close_ctree() we have the async block group reclaim
task running and relocating a data block group;
2) There's an async metadata (or data) space reclaim task running;
3) We enter close_ctree() and park the cleaner kthread;
4) The async space reclaim task is at flush_space() and runs all the
existing delayed iputs;
5) Before the async space reclaim task calls
btrfs_wait_on_delayed_iputs(), the block group reclaim task which is
doing the data block group relocation, creates a delayed iput at
replace_file_extents() (called when COWing leaves that have file extent
items pointing to relocated data extents, during the merging phase
of relocation roots);
6) The async reclaim space reclaim task blocks at
btrfs_wait_on_delayed_iputs(), since we have a new delayed iput;
7) The task at close_ctree() then calls cancel_work_sync() to stop the
async space reclaim task, but it blocks since that task is waiting for
the delayed iput to be run;
8) The delayed iput is never run because the cleaner kthread is parked,
and no one else runs delayed iputs, resulting in a hang.
So fix this by stopping the async block group reclaim task before we
park the cleaner kthread.
Fixes: 18bb8bbf13 ("btrfs: zoned: automatically reclaim zones")
CC: stable@vger.kernel.org # 5.15+
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>